Tensorflow feature project: tf.feature_column

1.tf.feature_column.input_layer()

tf.feature_column.input_layer(

features,

feature_columns,

weight_collections=None,

trainable=True,

cols_to_vars=None,

cols_to_output_tensors=None

)
Copy the code

Features: Dictionary. The dict key must be the same as the key of Feature_columns

Feature_columns: This parameter must be inherited from DenseColumn’s numeric_column, embedding_column, bucketized_column, indicator_column. If the feature is a category, it must first be wrapped with an embedding_column or an indicator_column. Note the type of FC requirements here.

tf.feature_column.indicator_column(categorical_column)

Onehot onehot onehot onehot onehot onehot onehot onehot onehot onehot onehot onehot onehot onehot

This function can only be used for columns of the categorical_column_with_*,crossed_column, and bucketized_column types. Indicator_column can take any categorical_column_with_* argument.

However, when buckets/unique (or ont-hot) is large, the embedding_column is recommended.

3. Tf.feature_column. Numeric_column: By default, continuous features are quantized.

tf.feature_column.numeric_column( key, shape=(1,), default_value=None, dtype=tf.float32, normalizer_fn=None ) key: The name of the feature. Which is the corresponding column name. Shape: The shape of the feature that corresponds to this key. The default is 1, but for one-hot types, shape is not 1, but the actual dimension. In short, here is the dimension corresponding to key, which is not necessarily 1. Shape is [n,], and the value of n should be consistent with the reality of each sample. Default_value: the default value is used if it does not exist. Normalizer_fn: all data under this feature is converted. Normalizer_fn is a function that can be customized for numeric transformations.Copy the code

example:

import tensorflow.compat.v1 as tf
tf.disable_v2_behavior()
sess=tf.Session()
features = {
    'price': [[2.0], [30.0], [5.0], [100.0]]} item1_price = tf. Feature_column. Numeric_column ("price")
columns = [
    item1_price
]
inputs = tf.feature_column.input_layer(features, columns)

init = tf.global_variables_initializer()
sess.run(tf.tables_initializer())
sess.run(init)
v=sess.run(inputs)
print[v] output: [[2.] [30.] [5.] [100.]] shape should be set to the same: features = {'price': [[2.0, 30], [30.0, 12],,5.0 [2], [3100.0 versus]]} item1_price = tf. Feature_column. Numeric_column ("price", shape = [2]) output: [[2. 30.] [30. 12.] [2. 5.] [3. 100.]]Copy the code

4. Tf.feature_column. Bucketized_column: Continuous feature discretization

Tf.feature_columl. bucketized_column(source_column, boundaries) Source_column: must be numeric_column boundaries: Different buckets. Boundaries = [0, 1, 2], the bucket is, (inf, 0.), [0. 1), [1, 2), and (2), + inf), each interval respectively 0, 1, 2, 3, and so is equal to the points barrels of the four.Copy the code

Example1: input is two dimensions import tensorflow.pat.v1 as tf tf.disable_v2_behavior() sess= tf.session () features = {'price': [[2.0, 30], [30.0, 12],,5.0 [2], [3100.0 versus]]} item1_price = tf.feature_column.bucketized_column(tf.feature_column.numeric_column("price",shape=[2,]), boundaries=[1,10,100]) columns =[item1_price]# Input layer (data, feature column)
inputs = tf.feature_column.input_layer(features, columns)
Initialize and run
init = tf.global_variables_initializer()
#sess.run(tf.tables_initializer())
sess.run(init)
v=sess.run(inputs)
print(v) the output: [[0. 1. 0. 0 0. 0. 1. 0.] [0. 0. 1. 0. 0. 0. 1. 0.] [0. 1. 0. 0. 0. 1. 0. 0.] [0. 1. 0. 0. 0. 0. 0. 1.]] Example2: input is one dimensional import tensorflow.pat.v1 as tf tf.disable_v2_behavior() sess= tf.session () features = {'price':[[2.0],[30.0],[5.0],[100.0]]} Item1_price = tf.feature_column. Bucketized_column (tf.feature_column."price",shape=[1,]), boundaries=[1,10,100]) columns =[item1_price]# Input layer (data, feature column)
inputs = tf.feature_column.input_layer(features, columns)
Initialize and run
init = tf.global_variables_initializer()
#sess.run(tf.tables_initializer())
sess.run(init)
v=sess.run(inputs)
print(v)

output:
[[0. 1. 0. 0.]
 [0. 0. 1. 0.]
 [0. 1. 0. 0.]
 [0. 0. 0. 1.]]

Copy the code

Tf.feature_column. Categorical_column_with_identity: Identifies a column by category

 tf.feature_column.categorical_column_with_identity("in", num_buckets=10, default_value=0) The categorical identity column can be regarded as a special case of bucketized column. In a generic Bucketized column, each bucket represents a range of values (1960 to 1979, for example). In a categorical identity column, each bucket represents a single, unique integer. For example, if the number of store ids is very large, but each store ID has been encoded from 0, it can be used. The value range of this feature column is [0,num_buckets).Copy the code

example1:
import tensorflow.compat.v1 as tf
tf.disable_v2_behavior()
sess=tf.Session()
features = {
    'b': [[0], [1], [9], [4]]If num_buckets is 10, the value of b column is less than that of num_buckets
}
b = tf.feature_column.categorical_column_with_identity("b", num_buckets=10, default_value=0)
b = tf.feature_column.indicator_column(b)
columns = [
    b
]
# Input layer (data, feature column)
inputs = tf.feature_column.input_layer(features, columns)
Initialize and run
init = tf.global_variables_initializer()
sess.run(init)
v=sess.run(inputs)
print(v) output: [[1. 0. 0. 0 0, 0, 0, 0, 0, 0.] [0. 1. 0. 0. 0, 0, 0, 0, 0, 0.] [0. 0. 0. 0. 0, 0, 0, 0, 0, 1] is [0. 0. 0. 0. 1. 0. 0. 0. 0. 0.]]Copy the code

6.categorical_column_with_hash_bucket

hashed_feature_column =
    tf.feature_column.categorical_column_with_hash_bucket(
        key = "some_feature",
        hash_bucket_size = 100) # The number of categories


Copy the code

1. When the number of categories is very large, it is not possible to set up a separate category for each term or integer or provide a vocabulary for category characteristics. Sometimes it is too tedious and consumes memory when the vocabulary is very large. In such cases, the user is allowed to specify the total number of categories and hash out the final category ID.

2. Hash conflicts will inevitably occur, as many different categories will generate the same category ID. Therefore, the size of hash_bucket_size should be sufficiently redundant. Practice shows that hash conflict does not have much impact on neural network models. Hash categories provide some separation methods for models, and models can be further distinguished by other features.

import tensorflow.compat.v1 as tf
tf.disable_v2_behavior()
sess=tf.Session()
features = {
    'title': [["asdc"], ["thkj"], ["y0bn"], ["12gb"]]  
}
b = tf.feature_column.categorical_column_with_hash_bucket("title", 20,dtype=tf.string)
b = tf.feature_column.indicator_column(b)
columns = [
    b
]
# Input layer (data, feature column)
inputs = tf.feature_column.input_layer(features, columns)
Initialize and run
init = tf.global_variables_initializer()
sess.run(init)
v=sess.run(inputs)
print(v) output: [[1. 0. 0. 0 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0.] [0. 0. 0. 0. 0, 0, 0, 0. 1. 0, 0, 0, 0, 0, 0, 0, 0, 0, 0. 0.] [0. 0 0. 0. 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0. 1. 0. 0.] [0. 0. 0. 0. 0, 0, 0, 0, 0. 1. 0, 0, 0, 0, 0, 0, 0. 0. 0. 0.]]Copy the code

Tensorflow feature project: tf.feature_column

Related Posts

PyTorch custom data set

Three lines of Python code handle image sharpness recognition, which is not necessarily the case as we see it

Digital signal decomposition based on MATLAB CEEMD