A

Accuracy

Classification models predict accurate proportions. In multi-category classification, accuracy is defined as follows:


In dichotomies, accuracy is defined as:


Activation Function

A function (e.g. ReLU or Sigmoid) that inputs the weighted sum of all neuron activation values of the previous layer into a nonlinear function and then passes the output value of that function (typically nonlinear) to the next layer.


AdaGrad

A sophisticated gradient descent algorithm recalibrates the gradient of each parameter, efficiently giving each parameter a separate learning rate.


AUC (Area under the curve)

An evaluation criterion that takes into account all possible classification thresholds. The area under the ROC curve represents the confidence of the classifier in randomly predicting that the true Positives were more likely than the False Positives.


B

Backpropagation

An important algorithm for gradient descent in neural networks. First, the output value of each node is calculated during the forward propagation. Then, the partial derivatives of the errors corresponding to each parameter are calculated during the back propagation.


Baseline

A simple model used to represent reference points for contrast models. Baselines help model developers quantify the expected performance of a model on a particular problem.


batch

The sample set used in an iteration (i.e. a gradient update) in model training.


Batch size

The number of samples in a batch. For example, SGD batch sizes are 1, while mini-Batch batch sizes are typically between 10 and 1000. Batch sizes are usually determined during training and reasoning, however TensorFlow does not allow dynamic batch sizes.


Bias

The intercept or offset from the origin. Bias (also called bias term) is called b or W0 in machine learning models. For example, the offset term is b in the following formula: y ‘= B + W_1x_1 + W_2x_2 +… W_nx_n.

Be careful not to be confused with prediction bias.


Binary classification

A classification task that outputs one of two mutually exclusive (disjoint) categories. For example, a machine learning model that evaluates mail messages and outputs “spam” or “non-spam” is a binary classifier.


binning/bucketing

The transformation of a continuous feature into multiple binary features called buckets or bins, depending on the range of values, is called buckets or bins. For example, by representing temperature as a single floating point feature, the temperature range can be cut into several discrete bins. If the sensitivity of a given temperature is one tenth of a degree, then temperatures distributed between 0.0 and 15.0 degrees can be placed in one bin, 15.1 to 30.0 degrees in a second bin, and 30.1 to 45.0 degrees in a third bin.


C

Calibration Layer

A structure for adjusting late forecasts, often used to account for forecast biases. The adjusted expectations and probabilities must match the distribution of a set of observed labels.


Candidate sampling

One is to optimize the training time, using Softmax and other algorithms to calculate the probability of all positive labels, while only calculating the probability of some random samples of negative labels. For example, with a sample labeled “Beagle” and “dog,” candidate samples will calculate the probability of prediction and loss terms associated with the “beagle” and “dog” category outputs (as well as a random subset of the remaining categories, such as “cat,” “lollipop,” and “fence”). The idea is that negative categories can be learned through less frequent negative reinforcement, while positive categories can often receive appropriate positive reinforcement, as has been observed. Candidate sampling is driven by computational validity from all negative categories of non-computational prediction benefits.


Checkpoint

Data that marks the state of a model variable at a particular time. Checkpoints allow you to output the weights of the model and also train the model through multiple stages. Checkpoints also allow you to skip errors and continue (for example, preempt jobs). Note that its own schema is not included in the checkpoint.


Category (class)

The target value of all attributes of the same class acts as a tag. For example, in a binary classification model that detects spam, the two categories are spam and non-spam. A multi-category classification model would distinguish dog types, such as poodles, beagles, pugs, and so on.


Class Imbalanced Data Sets

This is a binary classification problem, where the labels of the two categories have very different distribution frequencies. For example, a disease data set with positive labels in 0.01% of the samples and negative labels in 99.99% of the samples is a category unbalanced data set. But for a football match predictor dataset, if 51% of the sample marks one team as winning and 49% marks other teams as winning, then this is not a category unbalanced dataset.


Classification Model

A machine learning model that separates data into two or more discrete categories. For example, a natural language processing classification model can classify a sentence as French, Spanish, or Italian. The classification model is compared with the regression model.


Classification Threshold

A scalar value criterion applied to the prediction score of a model to separate positive and negative categories. Classification thresholds are used when logistic regression results need to be mapped to binary classification models. For example, consider a Logistic regression model that determines the probability of a given message being spam. If the classification threshold is 0.9, then a logistic regression value above 0.9 is classified as spam and a logistic regression value below 0.9 is classified as non-spam.


Confusion Matrix

The NxN table summarizes the performance level of the prediction results of the classification model (i.e., the matching degree of the label and the model classification). One axis of the confusion matrix lists the labels predicted by the model and the other axis lists the actual labels. N indicates the number of categories. In a binary classification model, N=2. For example, the following is a simple confusion matrix for a binary classification problem:



The above confusion matrix shows that of the 19 samples that were indeed tumors, 18 were correctly classified by the model (18 true) and 1 was incorrectly classified as non-tumor (1 false negative). Similarly, of the 458 samples that were indeed nontumor, 452 were correctly classified by the model (452 true negative categories) and 6 were incorrectly classified (6 false positive categories).


An obfuscation matrix of multiple categories can help find patterns in which errors occur. For example, a confounding matrix reveals that a model that recognizes handwritten numeric bodies tends to recognize 4 as 9, or 7 as 1. The obfuscation matrix contains enough information to calculate many model performance measures, such as precision and recall rates.


Continuous feature

A floating point feature with an infinite number of value points. This is the opposite of discrete feature.


Convergence

A state reached during training in which the training loss and validation loss change little or nothing in each iteration after a certain number of iterations. In other words, the model is said to have converged when the performance of the model cannot be improved by continuous training on the current data. In deep learning, before the loss value drops, it sometimes remains constant or close to constant after several iterations, resulting in the illusion that the model has converged.


Concex functions

A function roughly shaped like a letter U or bowl. In the degenerate case, however, the convex function is shaped like a line. For example, the following functions are convex:


  • L2 loss function

  • Log loss function

  • L1 regularization function

  • L2 regularized function


Convex functions are very common loss functions. Because when a function has a minimum (which is usually the case), variations in gradient descent guarantee that we find something close to the minimum of the function. Similarly, variations of stochastic gradient descent have a high probability (though no guarantee) of finding a point close to the minimum of the function.


Two convex functions added together (e.g., L2 loss function +L1 regularization function) are still convex.


Depth models are usually nonconvex. Surprisingly, algorithms designed as convex optimizations generally work well on deep networks, although minimums are rarely found.


Cost (cost)

Synonym for loss.


Cross entropy

Extension of Log loss function in multi – class classification problem. Cross entropy quantifies the difference between two probability distributions. See also perplexity.


D

Data set

Collection of samples.


Decision boundary

Separators between classes of model learning in a binary or multi-class classification problem. For example, the figure below shows a binary classification problem where the decision boundary is the boundary between the orange and blue dot classes.




Deep model

A neural network with multiple hidden layers. Depth models depend on their trainable nonlinear properties. Compare with the wide model.


Dense feature

A feature where most values are non-zero, usually denoted by a tensor that takes floating-point values. This is the opposite of sparse feature.


Derived feature

A synonym of synthetic feature.


Discrete feature

A characteristic that has only a finite number of possible values. For example, a value that includes only animal, vegetable, or mineral features is a discrete (or categorical) feature. Compare with the continuous feature.


Dropout Regularization (Dropout regularization)

A useful regularization method for training neural networks. Dropout regularization is the process of removing a fixed number of randomly selected cells from a layer network in a single gradient calculation. The more units deleted, the stronger the regularization.


Dynamic Model

Models trained online in a continuously updated fashion. A model of continuous input of data.


E

Early Stopping

A regularization method that stops the model training process before the training loss has completed its decline. When the loss of validation data sets starts to increase, that is, when generalization performance deteriorates, it is time to use early stop.


Embeddings

A class of unambiguous features represented as continuous value features. Embedding usually refers to converting a higher-dimensional vector into a lower-dimensional space. For example, express the words in an English sentence in any of the following ways:


  • A sparse vector with millions of (high dimensional) elements in which all elements are integers. Each unit of the vector represents a single English word, and the number in the unit represents the number of times that word occurs in a sentence. Since there are usually no more than 50 words in a sentence, almost all units in the vector are zeros. A small number of non-zero units will take a small integer value (usually 1) to indicate the number of occurrences of a word in a sentence.

  • A dense vector with hundreds of (low dimensional) elements, each of which takes a floating point number between 0 and 1.


In TensorFlow, embedding is trained by back propagation loss, just like other parameters of the neural network.


Empirical Risk Minimization (ERM)

The process of selecting model functions that minimize the loss of training data. And structual risk minimization.


Ensemble

Combined consideration of multiple model predictions. You can create an integration method by one or more of the following methods:


  • Set different initializations;

  • Set different superparameters;

  • Set up different overall structures.


The depth and breadth model is an integration.


Estimator

An example of the tf.Estimator class that encapsulates logic to build a TensorFlow graph and run a TensorFlow session. You can create your own estimator by the following way: https://www.tensorflow.org/extend/estimators


Example:

One row of a data set. A sample contains one or more features, and may also be a label. See “labeled examples” and “unlabeled examples”.


F

False negative class (FN)

Samples incorrectly predicted by the model to be negative. For example, the model may infer that a message is non-spam (negative class) when in fact it is spam.


False positive class (FP)

Samples incorrectly predicted to be positive by the model. For example, the model may infer that a message is spam (positive class) when in fact it is non-spam.


False positive rate (FP rate)

The X axis of the ROC curve. FP rate is defined as: false positive rate = false positive class /(false positive class + true negative class)


Feature (s)

Input variables for making predictions.


Feature Columns /Feature Columns

A collection of related characteristics, such as a collection of all possible countries in which a user might live. A sample may have one or more features in a feature column.


Feature columns in TensorFlow can also compress metadata for example:

  • The data type of the feature;

  • A feature is fixed length or should be converted to embed.

  • A feature column can contain only one feature. “Feature column” is a Google-specific term. In VW systems (Yahoo/Microsoft), the meaning of feature columns is “namespace”, or field.


Feature cross

The resultant feature obtained by crossing the feature (product or Cartesian product). Feature crossing helps to represent nonlinear relationships.


Feature Engineering

The process of determining which features are useful when training a model and then converting raw data from documentation and other sources into those features. Characteristic engineering in TensorFlow usually means entering raw record files into the tf.Example protocol cache. See tf. The Transform. Feature engineering is sometimes called feature extraction.


Feature Set

Feature groups used in machine learning model training. For example, zip codes, area requirements and property conditions can form a simple set of features that allow the model to predict house prices.


Feature Spec

Describe the information needed to extract characteristic data from the tf.Example protocol cache. Because the tf.Example protocol cache is just a container for data, the following information must be specified:


  • Data to be extracted (i.e. key information of features)

  • Data type (e.g., floating point versus integer)

  • Data length (fixed or variable)


The Estimator API provides a tool for generating a feature definition from a set of feature columns.


Full Softmax

See softmax. Compare with candidate samples.


G

Generalization.

The ability of a model to make correct predictions using new, unseen data rather than data used for training.


Generalized Linear Model

Generalization/generalization of least squares regression models, based on Gaussian noise, as opposed to other types of noise (based on other types of noise, such as Poisson noise, or class noise). Examples of generalized linear models include:


  • Logistic regression

  • Multiclassification regression

  • Least squares regression


The parameters of the generalized linear model can be obtained by convex optimization, which has the following properties:

  • The average prediction result of the optimal least-squares regression model is equal to the average label of the training data.

  • The average probability prediction result of the optimal logistic regression model is equal to the average label of the training data.


The power of generalized linear models is limited to the properties of their features. Unlike deep models, a generalized linear model cannot “learn new features.”


Gradient

The vector of partial derivatives of all variables. In machine learning, a gradient is a vector of partial derivatives of a model function. The gradient points to the steepest ascent.


Gradient clipping

Modify values before applying gradients, and gradient truncation helps to ensure numerical stability and prevent gradient explosions.


Gradient descent

By calculating the relative parameters of the model and the gradient minimization loss function of the loss function, the value depends on the training data. Gradient descent iteratively adjusts the parameters, gradually approaching the optimal combination of weight and bias, thus minimizing the loss function.


Graph (graph)

A demonstration of a calculation process in TensorFlow. The nodes in the diagram represent operations. The wiring of nodes is directional, passing the result (as an operand) of one operation (a tensor) to another operation. Use TensorBoard to visualize graphs.


H

Heuristic

Practical and suboptimal solutions to a problem, but sufficient progress can be made from the learning experience.


Hidden layer

A synthetic layer in a neural network between an input layer (feature) and an output layer (prediction). A neural network contains one or more hidden layers.


Hinge Loss

A type of loss function used to classify models to find the decision boundary with the maximum distance from each sample, i.e., maximizing the edge between the sample and the boundary. KSVMs uses hinge loss functions (or related functions, such as the square Hinge function). In binary classification, hinge loss functions are defined as follows:

Loss = Max (0, 1 – (y ‘∗ y))

Where y’ is the column output of the classifier model:

Y ‘= b + w_1x_1 + w_2x_2 +… w_nx_n

Y is the real label, -1 or +1.

So hinge losses would look like this:



Test holdout data

Samples not intended for training. Validation data sets and test data sets are two examples of holdout data. Test data helps assess the model’s ability to generalize to data other than training data. The loss of the test set provides a better estimate of the loss of the unknown data set than the loss of the training set.


Hyperparameter

A “knob” that can be turned during continuous training of the model. For example, the learning rate is a hyperparameter relative to the parameters automatically updated by the model. And parameter control.


I

Independently and identically distributed, I.I.D

Data obtained from a distribution that does not change, and each value obtained does not depend on the value obtained previously. I.I.D. is the ideal case for machine learning — a mathematical construct that is useful but almost impossible to find in the real world. For example, the distribution of visitors to a web page might be I.I.D., on a transient window; That is, the distribution does not change during that time window, and each person’s access is independent of the access of others. However, if you expand the time window, there will be seasonal differences in page visitors.


(4) Inference

In machine learning, it usually refers to the process of applying training models to unlabeled samples for prediction. In statistics, inference refers to the process of fitting distribution parameters based on observed data.


Input Layer

The first layer of the neural network (receiving input data).


Inter-rater Agreement

A measure of consensus among human raters on a task. If there is disagreement, the mission statement may need to be improved. It is sometimes called inter-annotator agreement or inter-rater reliability.


K

Kernel Support Vector Machines (KSVM)

A classification algorithm designed to maximize the margin between positive and negative classes by mapping input data vectors to higher dimensional Spaces. For example, consider the classification problem of an input data set containing a hundred features. To maximize the spacing between positive and negative classes, KSVM internally maps features to a million-dimensional space. The loss function used by KSVM is called Hinge loss.


L

L1 Loss Function (L1 Loss)

The loss function is defined based on the absolute value of the difference between the model’s predicted value and the true value of the tag. L1 loss function is less sensitive to outliers than L2 loss function.


L1 Regularization (L1 regularization)

A regularization that punishes in proportion to the sum of absolute weights. In models that rely on sparse features, L1 regularization helps to remove features from the model by forcing the weight of (almost) unrelated features to approach zero.


L2 Loss (L2 Loss)

See squared loss.


L2 regularization (L2 regularization)

A regularization that punishes in proportion to the sum of the squares of the weights. L2 regularization helps push outlier weights closer to 0 instead of going to 0. (Read against L1 regularization.) L2 regularization generally improves the generalization of linear models.


Label

In supervised learning, the “answers” or “results” of a sample. Each sample in the annotation dataset contains one or more features and a label. For example, in a housing data set, characteristics might include the number of bedrooms, the number of bathrooms, the age of the house, and the tag might be the price of the house. In the spam detection dataset, the characteristics might include the subject, the sender of the message itself, and the label might be “spam” or “non-spam.”


Labeled examples

Samples containing features and tags. In supervised training, models learn from labeled samples.


lambda

A synonym for regularization rates. This term has several meanings. Here, we focus on definitions in regularization.


Layer (layer)

Neuronal sequences in neural networks can process input feature sequences or neuronal outputs.

It is also an abstraction of TensorFlow. Layers are Python functions that take tensors and configuration options as input and output other tensors. Once the necessary tensors are present, the user can convert the results into estimators through model functions.


Learning rate

A scalar used to train models through gradient descent. In each iteration, the gradient descent algorithm multiplizes the learning rate by the gradient, which is called gradient step.

Learning rate is an important hyperparameter.


Least squares regression

Linear regression models trained by L2 loss minimization.


Linear regression

A regression model in which linear connections of input features produce continuous values.


Logistic regression

Sigmoid function is applied to linear prediction to generate probabilistic models for each possible discrete label value in the classification problem. Although logistic regression is often used for binary classification problems, it is also used for multi-category classification problems (in this case, logistic regression is called “multi-category Logistic regression” or “polynomial regression”).


Log Loss function

Loss functions used in binary logistic regression models.


loss

A measure of how far the model predicts from the tag, which is a measure of how bad a model is. To determine the loss value, the model must define a loss function. For example, linear regression models typically use mean square error as a loss function, while logistic regression models use a logarithmic loss function.


M

Machine Learning

A project or system that uses input data to construct (train) predictive models. The system uses learning models to make useful predictions for new data with the same distribution as training data. Machine learning also refers to areas of research related to these projects or systems.


Mean Squared Error (MSE)

Average squared loss per sample. MSE can be calculated by dividing the squared loss by the number of samples. The TensorFlow Playground shows that the values of “training loss” and “test loss” are MSE.


Mini-batch

A small, randomly selected subset of the entire batch of samples run in an iteration of training or inference. Small batches are usually between 10 and 1000 in size. It is much more efficient to calculate losses on small batches of data than on total training data.


Mini-batch stochastic gradient descent

Use a small batch gradient descent algorithm. That is, small batch stochastic gradient descent evaluates gradients based on a subset of training data. Vanilla SGD used small batches of size 1.


Model

Representation of what a machine learning system learns from training data. The term has several meanings, including the following two related meanings:


  • TensorFlow diagram, showing how to calculate the structure of the prediction.

  • The specific weights and deviations of TensorFlow diagrams are determined by training.


Model Training

The process of determining the best model.


Momentum

A complex gradient descent algorithm in which the learning step depends not only on the derivative of the current step, but also on the previous step. Momentum consists of an exponentially weighted moving average of a gradient calculated over time, similar to momentum in physics. Momentum can sometimes prevent learning from falling into local minima.


Multi-class

Classification problems that classify in more than two categories. For example, there are about 128 species of maple trees, so the model for classifying maple species is multicategory. Conversely, a model that divides E-mail into two categories (spam and non-spam) is a binary classifier model.


N

NaN trap

During training, if one number in the model becomes a NaN, many or all of the other numbers in the model eventually become nans. NaN is short for “Not a Number”.


Negative class

In a binary classification, one category is positive and the other is negative. The positive class is what we’re looking for, and the negative class is another possibility. For example, a negative class in a medical test might be “non-tumor,” and a negative class in an E-mail classifier might be “non-spam.”


Neural network

Taking inspiration from the brain, the model consists of multiple layers (at least one of which is hidden), each containing simple connecting units or neurons, followed by nonlinearity.


Neuron

Nodes in a neural network typically input multiple values and produce one output value. The neuron calculates the output by applying the activation function (nonlinear transformation) to the weighted sum of the input values.


Normalization

The process of converting the actual interval of a value into a standard interval, usually -1 to +1 or 0 to 1. For example, suppose the natural range of a feature is 800 to 6000. By subtracting and dividing, you can normalize those values to the interval of -1 to +1. See scaling.


numpy

An open source math library in Python that provides efficient array operations. Pandas is built on numpy.


O

Objective

The objective function that the algorithm tries to optimize.


Offline inference

Generate a set of predictions and store them, then retrieve those predictions as needed. Read it against online inferences.


One-hot Encoding (One-Hot Encoding)

A sparse vector where:


  • One element is set to 1.

  • All other elements are set to 0.


Unique thermal encoding is often used to represent a string or identifier with a finite set of possible values. For example, consider a plant data set recording 15,000 different species, each represented by a unique string identifier. As part of a feature project, you might heat encode those string identifiers individually, with each vector size of 15,000.


One-versus-all

Given a classification problem with N possible solutions, the one-to-many solution consists of N independent binary classifiers — one binary classifier for each possible outcome. For example, if a model divides samples into animals, vegetables, or minerals, a one-to-many solution will provide the following three independent binary classifiers:


  • Animals and non-animals

  • Vegetables and non-vegetables

  • Mineral and non-mineral


Online inference

Generate forecasts on demand. Can be read against offline inference.


Operation (Operation/op)

A node in a TensorFlow diagram. In TensorFlow, any step that creates, controls, or corrupts a tensor is an operation. For example, matrix multiplication is an operation that takes two tensors as inputs and produces one tensor as output.


Optimizer

Specific implementation of gradient descent algorithm. The base class Optimizer for TensorFlow is tF.train.Optimizer. Different optimizers (subclasses of tf.train.Optimizer) correspond to different concepts, such as:


  • Momentum

  • Update frequency (AdaGrad = ADAptive GRADient descent; Adam = ADAptive with Momentum; RMSProp)

  • Sparsity/Regularization (Ftrl)

  • More complex mathematics (Proximal and beyond)


You can even imagine nn-driven Optimizer.


Outliers

A value that is very different from most values. In machine learning, the following are outliers:


  • The weight of the high absolute value.

  • A predicted value that is too different from the actual value.

  • A value of input data approximately 3 standard deviations above the mean.


Outliers often cause problems in model training.


Output layer

The “last” layer of a neural network. This layer contains the answers sought by the whole model.


Overfitting

The model created matches the training data so well that the model cannot make correct predictions about the new data.


P

pandas

A column-based data analysis API. Many machine learning frameworks, including TensorFlow, support pandas data structures as input. See the PANDAS documentation.


Parameter

Machine learning systems self-train model variables. For example, weights are parameters whose values are gradually learned by the machine learning system through successive training iterations. It can be read against hyperparameters.


Parameter Server/PS

Used to track model parameters in distributed Settings.


Parameter update

The operation of adjusting model parameters during training is usually carried out in a single iteration of gradient descent.


Partial derivative

The partial derivative of a multivariable function is its derivative with respect to one of the variables, leaving the others constant. For example, the partial derivative of f(x, y) with respect to x is the derivative of f(x), where y remains constant. Only x changes in the partial derivative of x, nothing else in the formula changes.


Partitioning Strategy

An algorithm for dividing variables in multiple parameter servers.


Performance

It has many meanings:


  • Traditional meaning in software engineering: How fast/efficient is the software?

  • Implications for machine learning: How accurate is the model? That is, how well does the model predict?


Perplexity

A measure of how well a model accomplishes a task. For example, suppose your task is to read the first few letters of a word a user types on a smartphone and provide a complete list of possible words. The task’s Perplexity (P) is the number of guesses you need to make in order to list the words a user actually wants to type.

The relationship between confusion degree and cross entropy is as follows:



Pipeline

The infrastructure of machine learning algorithms. The pipeline includes collecting data, putting it into a training data file, training one or more models, and finally outputting the model.


Positive class

In binary classification, there are two categories: positive and negative. Positive classes are the target of our testing. (Admittedly, we test both results, but one is beside the point.) For example, a positive class in a medical test might be “tumor,” or a positive class in an E-mail classifier might be “spam.” It can be read against the negative category.


Precision

An indicator of a classification model. Accuracy refers to the frequency with which the model predicts the positive class correctly. That is:


Prediction

The output of the model after the input sample.


Prediction Bias

Reveal the difference between the predicted average and the average of the labels in the data set.


Pre-made Estimator

Estimators are already built. TensorFlow provides several prefabricated evaluators, including DNNClassifier, DNNRegressor, and LinearClassifier. You can according to the guide (https://www.tensorflow.org/extend/estimators) to build your own prefabricated assessment.


Pre-trained model

Already trained models or model components (such as embedding). Sometimes, you feed a pre-training embed into a neural network. Other times, your model trains the embeddings itself, rather than relying on pre-trained embeddings.


Prior belief

Your belief in the data before the training begins. For example, L2 regularization relies on the belief that the weight values are small and normally distributed around 0.


Q

Queue

Implement TensorFlow operations for queue data structures. Usually used in input/output (I/O).


R

Rank (rank)

A term in the field of machine learning with multiple meanings:


  • The number of dimensions in a tensor. For example, scalars have 1 rank, vectors have 1 rank, matrices have 2 rank. (Note: The concept of “rank” in this vocabulary is not the same as the concept of “rank” in linear algebra, for example, the rank of a third-order invertible matrix is 3.)

  • The ordinal position of a class in a machine learning problem, classifying the class in order from highest to lowest. For example, a behavioral ranking system could rank a dog’s rewards from high (steak) to low (kale).


Rater

The person who provides the label for a sample is sometimes called a tagger.


Recall rate

An indicator of the classification model that can answer the question: how many positive tags can the model accurately identify? That is:



1. Rectified Linear Unit/ReLU

An activation function with the following rules:


  • If the input is negative or zero, the output is 0.

  • If the input is positive, the output is the same as the input.


Regression Model

A model that outputs persistent values, usually floating point numbers. The classification model outputs discrete values, such as “Day Lily” or “Tiger Lily”.


Regularization (regularization)

Penalty for model complexity. Regularization helps prevent overfitting. Regularization includes different types:


  • L1 regularization

  • L2 regularization

  • Dropout regularization

  • Stopping early (this is not a formal regularization method, but can effectively limit overfitting)


Regularization rate

A scalar level, denoted by lambda, referring to the relative importance of a regular function. The regularization rate can be seen in the following simplified loss formula:


Minimize (loss function + λ(regularization function))


Increasing regularization rate can reduce overfitting, but may reduce model accuracy.


Characterization of

The process of mapping data to useful characteristics.


Receiver Operating Characteristic /ROC Curve

The curve of the ratio of true class rate to false positive class rate is reflected in different classification thresholds. See the AUC.


Root directory

Specifies the directory to place the TensorFlow checkpoint file subdirectory and event files for multiple models.


Root Mean Squared Error/RMSE

Square root of the mean square error.


S

Saver

The TensorFlow object that is responsible for storing model checkpoint files.


Scaling

A common operation in feature engineering used to control the interval of eigenvalues to match the interval of other features in a data set. For example, suppose you want all floating-point features in a data set to have an interval of 0 to 1. Given an eigenvalue interval of 0 to 500, you can scale the eigenvalue interval by dividing each value by 500. See also regularization.


scikit-learn

A popular open source machine learning platform. Website: www.scikit-learn.org


Sequence Model

The input has a sequence-dependent model. For example, make a prediction about the next video based on a sequence of previously watched videos.


Session

Keep the state (such as variables) of the TensorFlow program.


Sigmoid function

A function that maps the logistic or polynomial regression output (logarithmic probability) to probability, returning a value between 0 and 1. The formula of sigmoid function is as follows:

Where σ in logistic regression is simply:

In some neural networks, the sigmoID function is the same as the activation function.


softmax

A function that provides probabilities for each possible class in a multi-category classification model. The sum of the probabilities is 1.0. For example, SoftMax might detect that an image is a dog with a probability of 0.9, a cat with a probability of 0.08, and a horse with a probability of 0.02. (also called Full Softmax).


Sparse Feature

Eigenvectors whose values are mainly 0 or null. For example, if the value of a vector has one 1 and one million 0’s, it is a sparse vector. For another example, words in a search query are also sparse vectors: there are many words that can be used in a language, but only some of them are used in a given query.

Can be read against dense features.


Squared loss

The Loss function used in linear regression (also called L2 Loss). This function computes the square of the difference between the predicted value of the labeled sample and the actual value of the label. After squaring, the loss function amplifies the impact of bad predictions. That is, square Loss is more responsive to outliers than L1 Loss.


Static Model

Offline training model.


Steady-state (stationarity)

A property of data in a data set that is distributed consistently across one or more dimensions. In general, the dimension is time, meaning that stationary data does not change over time. For example, data that are stable do not change from September to December.


Step (step)

Forward and backward evaluation in a batch.


Step size

The value of learning rate multiplied by partial derivative is the step length in gradient descent.


Stochastic gradient Descent (SGD)

Gradient descent algorithm with batch size 1. That is, SGD relies on a randomly and uniformly selected sample from the data set to assess the gradient at each step.


Structural Risk Minimization (SRM)

This algorithm balances two objectives:


  • Build the most predictive model (e.g., minimum losses).

  • Keep the model as simple as possible (e.g., strong regularization).


For example, the model function of loss minimization + regularization on the training set is the structural risk minimization algorithm. For more information, see http://www.svms.org/srm/. Read against empirical risk minimization.


Summary

In TensorFlow, a set of values or values calculated by a particular step is usually used to track model indicators during training.


Supervised machine Learning

The model is trained using input data and its corresponding labels. Supervised machine learning is similar to learning by students studying questions and corresponding answers. After mastering the mapping between questions and answers, students can provide answers to new questions on the same topic. Can be read against unsupervised machine learning.


Synthetic feature

A feature that is not in an input feature but is derived from one or more input features. Types of composite features include:


  • Features are multiplied by themselves or other features (called feature crossing).

  • The two properties divide.

  • Put successive features into the Range bin.


Features created by normalization or scaling alone are not composite features.


T

Tensor



The main data structure of the TensorFlow project. Tensors are n-dimensional data structures (N has a large value), often scalars, vectors, or matrices. Tensors can include integer, floating point, or string values.


Tensor Processing Unit

Application-specific Integrated Circuit (ASIC) to optimize TensorFlow performance.


Tensor shape

The number of elements in a tensor is contained in different dimensions. For example, the [5, 10] tensor has a shape of 5 in one dimension and 10 in the other.


Tensor size

The total number of scalars contained in a tensor. For example, the magnitude of the [5, 10] tensor is 50.


TensorBoard

A control panel that displays summary data saved during the run of one or more TensorFlow projects.


TensorFlow

Large distributed machine learning platform. The term also refers to the underlying API layer in the TensorFlow stack that supports common calculations on data flow diagrams.


Although TensorFlow is primarily used for machine learning, it is also suitable for non-machine learning tasks that require numerical calculations using data flow diagrams.


TensorFlow Playground

A platform to see the effects of different hyperparameters on the training of models (mainly neural networks). Go to http://playground.tensorflow.org, the use of TensorFlow Playground.


TensorFlow Serving

Helps train models to be deployable into production platforms.


Test Set

A subset of the data set. After the model has been initially tested with the validation set, the model is tested with the test set. It can be read against the training set and validation set.


tf.Example

A standard Protocol buffer used to describe input data to be trained or inferred by machine learning models.


Training

The process of determining the perfect parameters that make up a model.


Training Set

A subset of data sets used to train the model. Read against validation sets and test sets.


True negative, TN

Samples that are correctly predicted to be negative by the model. For example, the model deduces that an E-mail is not spam, and then the E-mail really is not spam.


True classes (True positive, TP)

Samples that are correctly predicted to be positive by the model. For example, the model predicts that an E-mail message is spam, which turns out to be spam.


True Positive rate (TP rate)

A synonym for recall. That is:

TruePositiveRate=TruePositives/(TruePositives+FalseNegatives)

The true quasi-rate is the Y-axis of the ROC curve.


U

(Unlabeled sample)

Samples containing features but without labels. The unlabeled sample is the input for inference. Unlabeled samples are usually used in the training of semi-supervised and unsupervised learning.


Unsupervised machine Learning

Train a model to find patterns in a dataset (usually an unlabeled dataset).


Unsupervised machine learning is most commonly used to divide data into groups of similar samples. For example, unsupervised machine learning algorithms can cluster data based on various properties of music. The data collected in this way can serve as input to other machine learning algorithms, such as music recommendation services. Clustering is useful in situations where it is difficult to get real labels. For example, in anti-fraud and anti-abuse scenarios, clustering can help humans better understand data.


Another example of unsupervised machine learning is principal Component Analysis (PCA). For example, when PCA is applied to data sets containing the contents of millions of shopping carts, it is possible to find that shopping carts with lemons also tend to have antacids. Can be compared with supervised machine learning.


V

Validation Sets

A subset of the data set (as distinct from the training set) that can be used to adjust hyperparameters. Read against the training set and test set.


W

Weight (weight)

Feature coefficients in linear models, or edges in deep networks. The training goal of linear models is to determine a perfect weight for each feature. If the weight is 0, the corresponding feature is useless to the model.


Wide Model

Linear models usually have many sparse input features. We call it a “wide” model because it has a large number of inputs directly connected to the output nodes, and it is a special type of neural network. Wide models are generally easier to debug and check than deep models. Although wide models cannot express nonlinearity through hidden layers, they can model nonlinearity in different ways using transformations such as feature crossover and bucketization. Can be read against depth model.


via  https://developers.google.com/machine-learning/glossary