0. Guide language

What exactly is feature engineering? As the name implies, it is essentially an engineering activity designed to extract features from raw data to the maximum extent possible for use by algorithms and models.

I’ve written the following quick introduction to AI basics. This article covers feature engineering basics part 2: Digital Feature Processing.

It has been released: \

AI Basics: Easy to get started with Python

AI Basics: Numpy easy to get started

AI: Pandas

AI: Scipy(Scientific Computing Library) easy to get started

AI Basics: An easy introduction to Data Visualization (Matplotlib and Seaborn)

AI Fundamentals: Feature Engineering – Category Features

Follow-up updates


[1] Book Address:

www.oreilly.com/library/vie… [2] Translation of the following:

github.com/apachecn [3] translated by @Coboe:


Code modification and reorganization: Huang Haiguang, the original text was modified into Jupyter Notebook format, and some codes were added and modified. All the tests passed, and all data sets have been downloaded on Baidu Cloud.

The code can be downloaded at Github:


Baidu Cloud of data set:

Link: pan.baidu.com/s/1uDXt5jWU… Extraction code: 8P5D

2. Strange tricks with simple numbers

Before delving into complex data types such as text and images, let’s start with the simplest digital data. They can come from a variety of sources: geographical location or people, price of purchase, sensor measurements, traffic counts, etc. Digital data is already an easily digestible format for mathematical models. This does not mean that feature engineering is no longer needed. Good features not only represent significant aspects of the data, but also conform to the assumptions of the model. Therefore, conversion is often necessary. Digital feature engineering is the foundation. They can be applied when raw data is converted into digital features.

The first sanity check for numerical data is whether size matters. We just need to know is it positive or negative? Or do we just need to know a very coarse-grained size? This smart check is especially important for automatic totals, such as statistics, number of visits to the website per day, number of reviews a restaurant gets, and so on.

Next, consider the size of the feature. What are the maximum and minimum values? Do they span several orders of magnitude? A model with smooth input characteristics is sensitive to the scale of the input. For example, 3x+ 1 is a simple linear function of the input X, the size of which depends directly on the proportion of the input. Other examples include k-means clustering, nearest neighbor methods, RBF kernels, and anything that uses Euclidean distances. For these models and modeling components, it is usually a good idea to normalize the characteristics to keep the output at the desired size.

On the other hand, logical functions are insensitive to input feature scales. Whatever the input, their output is binary. For example, logic, and take any two variables and output 1 if and only if both inputs are true. Another example of a logical function is the step function “input x is greater than 5”. The decision tree model consists of step functions of input features. Therefore, models based on spatial partition trees (decision trees, gradient elevators, random forests) are not sensitive to scale. The only exception is if the size of the input grows over time, then if the characteristic is some type of cumulative count. Eventually it will grow outside of the area where the tree has been trained. If this is likely to be the case, it may be necessary to readjust the input periodically. Another solution is the bin counting method discussed in Chapter 5.

It is also important to consider the distribution of numerical features. Distribution sums up the likelihood of assuming a particular value. The distribution of input features is more important to some models than others. For example, the training process of linear regression models assumes that the prediction errors are distributed like Gaussian. This is usually good, unless the predicted target spreads out over several orders of magnitude. In this case, the Gaussian error hypothesis may no longer hold. One way to address this is to shift output targets to tame growth in scale. (Strictly speaking, this would be a target project, not a feature project.) Logarithmic transformation, which is a power transformation, brings the distribution of variables close to the Gaussian. Another solution is the bin counting method discussed in Chapter 5.

In addition to tailoring models or training process assumptions, multiple functions can be combined into more complex functions. The hope is that complex features can capture important information in raw data more succinctly. By making the inputs more “eloquent “, the models themselves can be simpler, easier to train and evaluate, and make better predictions. As an extreme, complex feature itself may be the output of statistical models. This is a concept called model stacking, which we’ll discuss in more detail in Chapters 7 and 8. In this chapter, we give the simplest example of a complex feature: interactive functionality.

Interaction characteristics are easy to specify, but the combination of characteristics leads to more input into the model. To reduce computational overhead, it is often necessary to use automatic feature selection to prune input features.

We will start with the basic concepts of scalars, vectors, and Spaces, and then discuss scale, distribution, interaction characteristics, and feature selection.

Scalars, vectors, and Spaces

Before we begin, we need to define some basic concepts for the rest of the book. Individual numeric features are also called scalars. An ordered list of scalars is called a vector. The vector is in the vector space. In most machine learning applications, the input to the model is usually represented as a vector of numbers. The rest of the book discusses best practice strategies for converting raw data into digital vectors.

A vector can be visualized as a point in space. Sometimes people draw a line and an arrow from the origin to that point. We will use this primarily in this book. For example, suppose we have a two-dimensional vector. That is, a vector has two numbers, in the first direction, the vector has a value of 1, and in the second direction, it has a value of 2. We can draw it in two dimensions.

Figure 2-1. A single vector.

Abstract vectors and their characteristic dimensions have practical significance in the data world. For example, it can represent a person’s preference for songs. Each song is a feature where the value of 1 corresponds to thumbs up and thumbs down. Suppose the vector represents the preferences of a listener, Bob. Bob loved Bob Dylan’s “Blowin ‘in the Wind” and Lady Gaga’s “Poker Face.” Others may have different preferences. In general, data sets can be visualized as point clouds in feature space.

Instead, a song can be represented by the personal preferences of a group of people. Suppose there are only two listeners, Alice and Bob. Alice loved Leonard Cohen’s “Poker Face,” “Blowin ‘in the Wind” and “Hallelujah,” But hate Katy Perry’s “Roar” and Radiohead’s “Creep.” Bob loves “Roar”,” Hallelujah “and” Blowin ‘in the Wind “, but hates “Poker Face” and “Creep”. In the audience’s space, each song is a point. Just as we can visualize data in a feature space, we can visualize features in a data space. Figure 2-2 shows this example.

Figure 2-2. Illustration of feature space vs. data space.

Handle count

In the age of big data, counts can accumulate quickly and without constraint. Users can put songs or movies on unlimited play, or use scripts to double-check the availability of tickets to popular shows, which can cause the number of plays or website visits to rise rapidly. When data can be produced at high volume and speed, they are likely to contain some extreme values. It is a good idea to check their size and determine whether to keep them as raw numbers, convert them to binary variables to indicate their presence, or put them into coarse-grained.


The user taste portrait in the Million Song dataset contains the complete music listening history of one Million users in the Echo Nest. Here are some relevant statistics about the data set.

Echo Nest taste portrait data set statistics

  • There are more than 48 million user ids, music ids, and listen count triples.
  • The complete dataset contains 1,019,318 unique users and 384,546 unique songs.
  • Citation: Echo Nest tasteful subset of data, the official m letters Song data set of user data set, can be obtained from here: labrosa.ee.columbia.edu/millionsong…

Suppose the task is to build a recommender to recommend songs to users. One component of the recommender predicts how much users will like a particular song. Since the data includes actual listening times, should this be the target of the prediction? If a large listen count means the user really likes the song, and vice versa, that’s true. However, the data showed that while 99% of the listening counts were 24 or lower, there were also some listening counts in the thousands, with a maximum of 9,667. (Figure 2-3 shows the peak in bin where the histogram is closest to 0. But counting more than 10,000 triples is larger, and several in the thousands. These values are unusually large; If we try to predict the actual listening count, the model will be pulled apart by these large values.Figure 2-3. Histogram of listen counts in the user taste profile of the Million Song Dataset. Note that the y-axis is on a log scale.

In the Million Song dataset, the raw listen count is not a reliable measure of user taste. (In statistical terms, robustness means that the method works under a wide variety of conditions.) Users have different listening habits. Some people might put their favorite songs on an infinite loop, while others might only taste them on special occasions. It’s hard to say that people who listen to a song 20 times like it twice as much as people who listen to it 10 times.

A more robust representation of the user’s preference is to dualize the count and trim all counts greater than 1 to 1. In other words, if the user has heard at least one song, then we consider that the user likes the song. In this way, the model does not need to spend cycles to predict small differences between the original counts. Binary goals are simple and robust measures of user preferences.

Example 2-1: Binarize the listening count in the Million Song dataset

import pandas as pd
listen_count = pd.read_csv(
    'data/train_triplets.txt.zip', header=None, delimiter='\t')
# The table contains user-song-count triplets. Only non-zero counts are
# included. Hence to binarize the count, we just need to set the entire
# count column to 1.
listen_count[2] = 1
Copy the code

This is an example of how we design the target variable of our model. Strictly speaking, a target is not a feature because it is not an input. But sometimes we do need to modify the goal to get it right.

Quantization or packing

For this exercise, we collected data from round 6 of the Yelp dataset challenge and created a smaller categorized dataset. The Yelp dataset contains user reviews of businesses from ten cities in North America and Europe. Each merchant is marked with zero or more categories. The following are the relevant statistics about the data set.

About the 6th round of Yelp data set statistics

  • There are 782 merchant categories.
  • The complete dataset contains 1,569,264 reviews (about 1.6m) and 61,184 merchants (61K).
  • “Restaurants” (990,627 reviews) and “nightlife” (210,028) were the most popular categories, and the review count is sensible.
  • No business falls into both the restaurant and nightlife categories. Therefore, there is no overlap between the two sets of comments.

Each merchant has a review count. Suppose our task is to use collaborative filtering to predict how users are likely to rate the business. Comment counts can be a useful input feature because there is usually a strong correlation between popularity and good ratings. The question now is, should we use the raw comment count or take it further? Figure 2-4 shows a histogram of all merchant review counts. We see the same pattern as music listening counts. Most statistics are small, but some businesses have tens of thousands of comments.

Example 2-2: Visualize merchant review counts in the YELP dataset

import pandas as pd
import json
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
Copy the code
### Load the data about businesses
biz_file = open('data/yelp_academic_dataset_business.json')
biz_df = pd.DataFrame([json.loads(x) for x in biz_file.readlines()])
biz_file.close(a)Copy the code
### Plot the histogram of the review counts
fig, ax = plt.subplots()
biz_df['review_count'].hist(ax=ax, bins=100)
ax.set_xlabel('Review Count', fontsize=14)
ax.set_ylabel('Occurrence', fontsize=14)
Copy the code
Copy the code

Figure 2-4. Histogram of business review counts in the Yelp reviews dataset. The y-axis is on a log-scale.

Raw counts that span several orders of magnitude are problematic for many models. In a linear model, the same linear coefficient must work for all possible values of the count. Large numbers of counts can also undermine unsupervised learning methods such as K-means clustering, which uses similarity functions to measure the similarity between data points. The K-mean uses the Euclidean distance between data points. The large count in one element of the data vector will exceed the similarity in all the other elements, which may discard the entire similarity measure.

One solution is to include scalars by quantizing counts. In other words, we group the count into the container and remove the actual count. Quantization maps continuous numbers to discrete numbers. We can think of the discretized numbers as an ordered sequence of containers representing intensity measures.

To quantify the data, we had to decide how wide each box should be. Solutions come in either fixed width or adaptive. We will give examples of each type.

Fixed width packing

For fixed-width boxing, each bin contains a specific range of values. Ranges can be custom designed or auto-segmented, and they can be linearly scaled or exponentially scaled. For example, we can group a person’s age into ten years: 0-9 years to bin 1, 10-19 years to Bin 2, etc. To map from count to bin, simply divide by the width of bin and round the part.

It is also common to see custom designed age ranges more appropriate to life stages:

  • 0-12 years old
  • 12 to 17
  • At the age of 18 to 24
  • 25 to 34 years old
  • 35 to 44
  • 45 to 54
  • 55-64 years old
  • 65-74
  • More than 75 – year – old

When numbers span multiple orders of magnitude, it is best to group them with 10 powers (or any power of a constant) : 0-9, 10-99, 100-999, 100-9999, etc. The container widths grow exponentially from O (10) and O (100) to O (1000) and above. To map from count to bin, take the log value of count. The division of exponential widths is closely related to logarithmic transformations, which we discuss in “Logarithmic transformations”.

Example 2-3: Quantization counting with boxes of fixed width

import numpy as np
### Generate 20 random integers uniformly between 0 and 99
small_counts = np.random.randint(0.100.20)
Copy the code
Copy the code
np.floor_divide(small_counts, 10)
Copy the code
Copy the code
### An array of counts that span several magnitudes
large_counts = [
Copy the code
### Map to exponential-width bins via the log function
Copy the code
Copy the code

Quantile packing

Fixed width boxing is easy to calculate. But if the count is wide, there will be a lot of empty bins with no data. This problem can be solved by adaptive location of dustbin based on data distribution. This can be done using the quantile of distribution.

Quantiles are values that divide data into equal parts. For example, the median splits the data in half; Half of the data were small and half were larger than the median. The quantile divides data into parts, and the decile divides data into ten parts. Examples 2-4 show how to calculate the number of Yelp business reviews to the tenth. Figure 2-5 overrides the tenth on the histogram. This more clearly illustrates the skew of the smaller count.

Example 2-4: Calculate the deciles of Yelp business reviews

deciles = biz_df['review_count'].quantile([1..2..3..4... 5.6..7..8..9.])
Copy the code
0.1     3.0
0.2     3.0
0.3     4.0
0.4     5.0
0.5     6.0
0.6     8.0
0.7    12.0
0.8    23.0
0.9    50.0
Name: review_count, dtype: float64
Copy the code
### Visualize the deciles on the histogram
fig, ax = plt.subplots()
biz_df['review_count'].hist(ax=ax, bins=100)
for pos in deciles:
    handle = plt.axvline(pos, color='r')
ax.legend([handle], ['deciles'], fontsize=14)
ax.set_xlabel('Review Count', fontsize=14)
ax.set_ylabel('Occurrence', fontsize=14)
Copy the code
Copy the code

Figure 2-5. Deciles of the review counts in the Yelp reviews dataset. Note that both x- and y-axes are in log scale.

To compute quantiles and map data to quantile boxes, we can use the library Pandas. Pandas. DataFrame. Quantile and pandas. Series. Used to compute quantile quantile. Pandas. Qcut maps the data to the quantile of the required number.

Example 2-5: counting by quantile by case

### Continue example Example 2- 3 with large_counts
import pandas as pd
### Map the counts to quartiles
pd.qcut(large_counts, 4, labels=False)
Copy the code
array([], dtype=int64)
Copy the code
### Compute the quantiles themselves
large_counts_series = pd.Series(large_counts)
Copy the code
0.25     122.0
0.50     926.0
0.75    8286.0
dtype: float64
Copy the code

Logarithmic transformation

In “Quantization or Boxing,” we briefly introduced the concept of mapping the logarithm of a count to an exponentially wide box. Let’s take a look again now.

A logarithm is the inverse of an exponential. It’s defined as the normal number, which can be any positive number. Because we have. This means that the logarithmic function maps a small range of numbers (0, 1) to the entire range of negative numbers. The function will map to, will map to, and so on. In other words, logarithmic functions compress the range of large numbers and expand the range of decimals. The larger the increment, the slower the increment. This is easier to digest by looking at the graph of the function. (See Figure 2-6).

Figure 2-6. The log function compresses the high numeric range and expands the low range. Note how the horizontal x values from 100 to 1000 got compressed into just 2.0 to 3.0 in the vertical y range, while the tiny horizontal portion of x values less than 100 are mapped to the rest of the vertical range.

Logarithmic transformation is a powerful tool for dealing with positive numbers with heavy-tailed distributions. (The probability of heavy tails in the tail range is greater than the probability of Gaussian distribution). It compresses the long tail at the high end into a shorter tail and expands the low end into a longer head. Figure 2-7 Histogram comparing YELP merchant review counts before and after d log conversion. The Y axis is now on the normal scale. The increased position spacing in the bottom graph of the range is due to the fact that there are only 10 possible integer counts between 1 and 10. Note that the original review counts were very concentrated in the low-count region, with outliers above 4000. After logarithmic transformation, the histogram is not concentrated at the lower end, but more scattered along the X-axis.

Example 2-6: Visualize the distribution of comment numbers before and after logarithmic transformation

fig, (ax1, ax2) = plt.subplots(2.1)
fig.tight_layout(pad=0, w_pad=4.0, h_pad=4.0)
biz_df['review_count'].hist(ax=ax1, bins=100)
ax1.set_xlabel('review_count', fontsize=14)
ax1.set_ylabel('Occurrence', fontsize=14)
biz_df['log_review_count'] = np.log(biz_df['review_count'] + 1)
biz_df['log_review_count'].hist(ax=ax2, bins=100)
ax2.set_xlabel('log10(review_count))', fontsize=14)
ax2.set_ylabel('Occurrence', fontsize=14)
Copy the code
Copy the code

Figure 2-7. Comparison of Yelp business review counts before (top) and after (bottom) log transformation.

Another example is the online news popularity dataset from the UC Irvine machine Learning library. The following are the relevant statistics about the data set.

Online news popularity data set statistics

  • The dataset included 60 features from 39,797 news articles published by MasHabor over a two-year period.
  • Citation: K. Fernandes, P. Vinagre and P. Cortez. An active intelligent decision support system for predicting the popularity of online news. 2015 17th EPIA Event, Proceedings of Portuguese Conference on Artificial Intelligence, September, Coimbra, Portugal.

The aim is to use these characteristics to predict the popularity of articles on social media, as measured by the number of shares. In this case, we’ll focus on just one feature — the number of words in the article. Figure 2-8 shows the histogram of features before and after logarithmic transformation. Note that after logarithmic transformation, the distribution looks more Gaussian, except for a break in the article (no content) with zero length.

df = pd.read_csv('data/OnlineNewsPopularity.csv', delimiter=', ')
Copy the code
Copy the code

5 rows × 61 columns\

df['log_n_tokens_content'] = np.log10(df['n_tokens_content'] + 1)
Copy the code

Example 2-7: Visualize the distribution of news article popularity with and without logarithm transformation

fig, (ax1, ax2) = plt.subplots(2.1)
fig.tight_layout(pad=0, w_pad=4.0, h_pad=4.0)
df['n_tokens_content'].hist(ax=ax1, bins=100)
ax1.set_xlabel('Number of Words in Article', fontsize=14)
ax1.set_ylabel('Number of Articles', fontsize=14)

df['log_n_tokens_content'].hist(ax=ax2, bins=100)
ax2.set_xlabel('Log of Number of Words', fontsize=14)
ax2.set_ylabel('Number of Articles', fontsize=14)
Copy the code
Text(23.625.0.5.'Number of Articles')
Copy the code

Figure 2-8. Comparison of word counts in Mashable news articles before (top) and after (bottom) log transformation.

Logarithmic conversion field

Let’s look at how logarithmic conversion is performed in supervised learning. We will use the two data sets above. For the Yelp review data set, we will use the number of reviews to predict the average ratings of businesses. For Mashable news articles, we’ll use the number of words in the article to predict popularity. Since the output is continuous numbers, we will use simple linear regression as the model. We use Scikit Learn to perform linear regression with 10 fold cross-validation without and with logarithmic transformation features. The model was evaluated by an R-square score, which measures how well the trained regression model predicts new data. Good models have higher R squared scores. A perfect model gets the maximum score of 1. The score can be negative, a bad model can get an arbitrarily low negative score. By cross-validation, we get not only an estimate of the score, but also the variance, which helps us determine whether the difference between the two models is meaningful.

Example 2-8: Predict average business ratings using logarithmic conversion of YELP reviews

import pandas as pd
import numpy as np
import json
from sklearn import linear_model
from sklearn.model_selection import cross_val_score
Copy the code
## Using the previously loaded Yelp reviews dataframe,
## compute the log transform of the Yelp review count.
## Note that we add 1 to the raw count to prevent the logarithm from
## exploding into negative infinity in case the count is zero.
biz_df['log_review_count'] = np.log10(biz_df['review_count'] + 1)
Copy the code
Copy the code
business_id categories city full_address latitude longitude name neighborhoods open review_count stars state type log_review_count
0 rncjoVoEFUJGCUoC1JgnUA [Accountants, Professional Services, Tax Servi… Peoria 8466 W Peoria Ave\nSte 6\nPeoria, AZ 85345 33.581867 112.241596 Peoria Income Tax Service [] True 3 5.0 AZ business 0.602060
1 0FNFSzCFP_rGUoJx8W7tJg [Sporting Goods, Bikes, Shopping] Phoenix 2149 W Wood Dr\nPhoenix, AZ 85029 33.604054 112.105933 Bike Doctor [] True 5 5.0 AZ business 0.778151
2 3f_lyB6vFK48ukH6ScvLHg [] Phoenix 1134 N Central Ave\nPhoenix, AZ 85004 33.460526 112.073933 Valley Permaculture Alliance [] True 4 5.0 AZ business 0.698970
3 usAsSV36QmUej8–yvN-dg [Food, Grocery] Phoenix 845 W Southern Ave\nPhoenix, AZ 85041 33.392210 112.085377 Food City [] True 5 3.5 AZ business 0.778151
4 PzOqRohWw7F7YEPBz6AubA [Food, Bagels, Delis, Restaurants] Glendale Az 6520 W Happy Valley Rd\nSte 101\nGlendale Az, … 33.712797 112.200264 Hot Bagels & Deli [] True 14 3.5 AZ business 1.176091
## Train linear regression models to predict the average stars rating of a business,
## using the review_count feature with and without log transformation
## Compare the 10-fold cross validation score of the two models
m_orig = linear_model.LinearRegression()
scores_orig = cross_val_score(
    m_orig, biz_df[['review_count']], biz_df['stars'], cv=10)
m_log = linear_model.LinearRegression()
scores_log = cross_val_score(
    m_log, biz_df[['log_review_count']], biz_df['stars'], cv=10)
print("R-squared score without log transform: %0.5f (+/- %0.5f)" %
      (scores_orig.mean(), scores_orig.std() * 2))
print("R-squared score with log transform: %0.5f (+/- %0.5f)" %
      (scores_log.mean(), scores_log.std() * 2))
Copy the code
R-squared score without log transform: 0.00215(+ / -0.00329)
R-squared score with log transform: 0.00136(+ / -0.00328)
Copy the code

From the experimental results, the two simple models (with and without logarithm transformation) are equally bad at predicting targets, while the logarithm transformation features perform slightly worse. What a disappointment! It’s not surprising that they’re all not very good because they all only use one feature. However, one would have expected the log transformation functionality to perform better.

Let’s take a look at how log-transformed online news popularity data sets behave.

Example 2-9: Predicting the popularity of articles using the number of words in log-transformed online news data

## Download the Online News Popularirty dataset from UCI, then use
## Pandas to load the file into a dataframe

## Take the log transform of the 'n_tokens_content' feature, which
## represents the number of words (tokens) in a news article.
df['log_n_tokens_content'] = np.log10(df['n_tokens_content'] + 1)

## Train two linear regression models to predict the number of shares
## of an article, one using the original feature and the other the
## log transformed version.
m_orig = linear_model.LinearRegression()
scores_orig = cross_val_score(
    m_orig, df[['n_tokens_content']], df['shares'], cv=10)
m_log = linear_model.LinearRegression()
scores_log = cross_val_score(
    m_log, df[['log_n_tokens_content']], df['shares'], cv=10)
print("R-squared score without log transform: %0.5f (+/- %0.5f)" %
      (scores_orig.mean(), scores_orig.std() * 2))
print("R-squared score with log transform: %0.5f (+/- %0.5f)" %
      (scores_log.mean(), scores_log.std() * 2))
Copy the code
R-squared score without log transform: 0.00242(+ / -0.00509)
R-squared score with log transform: 0.00114(+ / -0.00418)
Copy the code

Confidence intervals still overlap, but models with logarithmic transformation characteristics perform better than models without logarithmic transformation. Why is logarithmic transformation more successful on this data set? We can get clues by looking at a scatter plot of input characteristics and target values. As shown in the bottom panel of Figure 2-9, logarithmic transformation reshapes the axis by pulling the large outliers in the target value (greater than 200,000 shares) further to the right hand side of the axis. This gives linear models more “breathing room” at the lower end of the input feature space. Without logarithmic transformation (top panel), the model has more pressure to adapt to very different target values under very small changes in input values.

Example 2-10: Input-output correlation in visual news flow degree prediction problem

fig2, (ax1, ax2) = plt.subplots(2.1,figsize=(10.4))
fig.tight_layout(pad=0.4, w_pad=4.0, h_pad=6.0)
ax1.scatter(df['n_tokens_content'], df['shares'])
ax1.set_xlabel('Number of Words in Article', fontsize=14)
ax1.set_ylabel('Number of Shares', fontsize=14)

ax2.scatter(df['log_n_tokens_content'], df['shares'])
ax2.set_xlabel('Log of the Number of Words in Article', fontsize=14)
ax2.set_ylabel('Number of Shares', fontsize=14)
Copy the code
Text(0.0.5.'Number of Shares')
Copy the code

Figure 2-9. Scatter plot of number of words (input) vs. number of shares (target) in the Online News dataset. The top plot visualizes the original feature, and the bottom plot shows the scatter plot after log transformation.

Compare this to the same scatter plot applied to the YELP review data set. Figure 2-10 looks very different from Figure 2-9. In the range of 1 to 5, step size 0.5, the mean star is discrete. A high review count (roughly > 2500 reviews) is associated with a higher average star rating. But the relationship is far from linear. There is no clear way to predict average stars based on input. Essentially, the graph shows that both the number of reviews and their logarithms are poor linear predictors of average star ratings.

# # #2- 11. Visualize the correlation of inputs and outputs in Yelp business review predictions. fig, (ax1, ax2) = plt.subplots(2.1)
fig.tight_layout(pad=0, w_pad=4.0, h_pad=4.0)
ax1.scatter(biz_df['review_count'], biz_df['stars'])
ax1.set_xlabel('Review Count', fontsize=14)
ax1.set_ylabel('Average Star Rating', fontsize=14)

ax2.scatter(biz_df['log_review_count'], biz_df['stars'])
ax2.set_xlabel('Log of Review Count', fontsize=14)
ax2.set_ylabel('Average Star Rating', fontsize=14)
Copy the code
Text(23.625.0.5.'Average Star Rating')
Copy the code

Figure 2-10. Scatter plot of review counts (input) vs. average star ratings (target) in the Yelp Reviews dataset. The top panel plots the original review count, and the bottom panel plots the review count after log transformation.

The importance of data visualization

The comparison of the effects of logarithmic transformation on two different data sets illustrates the importance of visualizing data. Here, we deliberately keep the input and target variables simple so that we can easily visualize the relationship between them. The curve shown in Figure 2-10 immediately shows that the selected model (linear) cannot represent the relationship between the selected input and the target. On the other hand, one can convincingly model the distribution of the number of reviews for a given average star. When modeling, it is best to visually examine the relationship between inputs and outputs, as well as the relationship between different input features.

Power transformation: a generalization of logarithmic transformation

Logarithmic transformations are a special example of a family of transformations called power transformations. In statistical terms, these are variance-stable transformations. To understand why variance stability is good, consider the Poisson distribution. This is a heavy-tailed distribution whose variance is equal to its mean. Therefore, the larger its center of mass, the larger its variance, and the heavier its tail. Power transformation alters the distribution of variables such that the variance does not depend on the mean. For example, suppose that a random variable X has a Poisson distribution. If we use the square root transform, the variance of phi is roughly constant, not equal to the mean.

Figure 2-11. The rough illustration of the Poisson distribution increases, not only does the mode of of the distribution shift to the right, but the mass spreads out and the variance becomes larger. The Poisson distribution is an example distribution where the variance increases along with the mean.

A simple generalization of the square root transform and the logarithm transform is called the Box-Cox transform:

Figure 2-12 shows the Box-Cox transformation at (log transformation),, (scaled and shifted versions of square roots),, and. Setting a value less than 1 compresses a higher value, and setting a value greater than 1 has the opposite effect.

Figure 2-12. Box-Cox transforms for different values of  .

The box-Cox formula only works if the data is positive. For non-positive data, the value can be moved by adding a fixed constant. When applying box-Cox transformation or more general power transformation, we must determine the values of the parameters. This may be done by maximum likelihood (found so that the Gaussian likelihood of the resulting transform signal is maximized) or by Bayesian methods. A complete introduction to Box-Cox and the use of general power converters is beyond the scope of this book. Interested readers can find out more about power conversions by Econometric Methods by Jack Johnston and John DiNardo (McGraw Hill). Fortunately, the Scipy package includes an implementation of the Box-Cox transformation, which includes finding the best transformation parameters.

Example 2-12: Box-Cox transformation of the number of Yelp business reviews

from scipy importstats # Continuing from the previous example, assume biz_df contains # the Yelp business reviews data # Box-Cox transform assumes that input data is positive. # Check  the min tomake sure.
Copy the code
Copy the code
Copy the code
Copy the code
# Setting input parameter lmbda to 0 gives us the log transform (without constant offset)
rc_log = stats.boxcox(biz_df['review_count'], lmbda=0)
Copy the code
Copy the code
# By default, the scipy implementation of Box-Cox transform finds the lmbda parameter
# that will make the output the closest to a normal distribution
rc_bc, bc_params = stats.boxcox(biz_df['review_count'])
Copy the code
Copy the code
Copy the code

Figure 2-13 provides a visual comparison of the distribution of the number of original and transformed comments.

# # #2- 13. Visualized histograms of raw, logarithmic, and Box-Cox transformations of the number of comments. fig, (ax1, ax2, ax3) = plt.subplots(3.1)
fig.tight_layout(pad=0, w_pad=4.0, h_pad=4.0)
# original review count histogram
biz_df['review_count'].hist(ax=ax1, bins=100)
ax1.set_title('Review Counts Histogram', fontsize=14)
ax1.set_xlabel(' ')
ax1.set_ylabel('Occurrence', fontsize=14)

# review count after log transform
biz_df['rc_log'].hist(ax=ax2, bins=100)
ax2.set_title('Log Transformed Counts Histogram', fontsize=14)
ax2.set_xlabel(' ')
ax2.set_ylabel('Occurrence', fontsize=14)

# review count after optimal Box-Cox transform
biz_df['rc_bc'].hist(ax=ax3, bins=100)
ax3.set_title('Box-Cox Transformed Counts Histogram', fontsize=14)
ax3.set_xlabel(' ')
ax3.set_ylabel('Occurrence', fontsize=14)
Copy the code
Copy the code

Figure 2-13. Box-Cox transformation of Yelp business review counts.

Probability graphs are a simple way to visually compare data distributions with theoretical ones. This is essentially the theoretical quantile of the observed scatter plot. Figure 2-14 shows the original data of the number of YELP reviews and the probability graph of the relative normal distribution of the converted data. Since the observations are strictly positive, the Gaussian can be negative, so the quantiles will never match on the negative end. So we’re focusing on the positive side. In this respect, the number of original comments is significantly heavier than the normal distribution. (The ordered value rises to 4000, while the theoretical number only extends to 4). Both the simple logarithmic transformation and the optimal Box-Cox transformation bring the positive tails close to a normal distribution. The optimal Box-Cox transformation shrinks the tail more than the number transformation, as the tail flattens out below the red diagonal contour line.

Example 2-14: Probability plots of the relative normal distribution of raw and transformed data

fig2, (ax1, ax2, ax3) = plt.subplots(3.1, figsize=(8.6))
# fig.tight_layout(pad=4, w_pad=5.0, h_pad=0.0)
prob1 = stats.probplot(biz_df['review_count'], dist=stats.norm, plot=ax1)
ax1.set_xlabel(' ')
ax1.set_title('Probplot against normal distribution')
prob2 = stats.probplot(biz_df['rc_log'], dist=stats.norm, plot=ax2)
ax2.set_xlabel(' ')
ax2.set_title('Probplot after log transform')
prob3 = stats.probplot(biz_df['rc_bc'], dist=stats.norm, plot=ax3)
ax3.set_xlabel('Theoretical quantiles')
ax3.set_title('Probplot after Box-Cox transform')
Copy the code
Text(0.5.1.'Probplot after Box-Cox transform')
Copy the code

Figure 2-14. Comparing the distribution of raw and transformed review counts against the Normal distribution.

Feature scaling or normalization

Having bounded values for certain features, such as latitude or longitude. Other numerical characteristics (such as quantity) may increase in unbounded cases. Models in which the input is a smoothing function, such as linear regression, logistic regression, or anything involving matrices, are affected by the numeric range of the input. Tree-based models, on the other hand, care less about this. If your model is sensitive to the numeric range of input features, feature scaling may be helpful. As the name implies, feature scaling changes the numeric range of feature values. It is sometimes called feature normalization. Function scaling is usually done separately for a single feature. There are several common scaling operations, each of which produces a different distribution of eigenvalues.

Min – Max scaling

Let be a single eigenvalue (that is, an eigenvalue in some data points), and, are the minimum and maximum values of that feature on the entire data set, respectively. Min-max scaling compresses (or stretches) all eigenvalues into a range. Figure 2-15 illustrates this concept. The formula for the minimum and maximum scale is

Figure 2-15. Min-max scaling

Normalization (variance scaling)

Feature standardization is defined as:

Subtract the average of the features (all data points) and divide by the variance. Therefore, it can also be called variance scaling. The scaled features have an average of 0 and a variance of 1. If the original feature has a Gaussian distribution, the scaling feature is standard Gaussian. Figure 2-16 contains a description of standardization.Figure 2-16. Illustration of feature standardization

Do not centralize sparse data

Min Max scaling and normalization subtract an amount from the original eigenvalue. For min Max scaling, the amount of movement is the smallest of all values for the current feature. For normalization, the amount moved is the average. If the amount of movement is not zero, then these two transformations can transform a vector with sparse features (most of the values are zero) into a dense vector. This in turn can impose a significant computational burden on the classifier, depending on how it is implemented. Word bag is a sparse representation, and most classification libraries are optimized for sparse input. It would be terrible if the current representation contained every word that didn’t appear in the document. Exercise caution when performing minimum and maximum scaling and standardization operations on sparse features.

L2 normalization

This technique normalizes (divides) the original eigenvalues by the so-called L2 norm (also known as the Euclidean norm).

The L2 norm measures the length of a vector in coordinate space. This definition comes from what’s known as the Pythagorean theorem, and given the lengths of both sides of a triangle, you get the hypotenuse.

The L2 norm will find the sum of squares of each data point of the feature and then take the square root. After L2 is normalized, the eigencolumn has norm 1. It can also be called L2 scaling. (Loosely speaking, scaling means multiplying constants, while normalization can involve many operations.) Figure 2-17 illustrates L2 normalization.Figure 2-17. Illustration of L2 feature normalization

Data space and feature space

Note that the instructions in Figure 2-17 are in the data space, not the feature space. It is also possible to normalize logpoints L2 instead of features, which results in data vectors with a unit norm (norm 1). (See the discussion of the complementary nature of data vectors and feature vectors in the word bag.) Regardless of the scaling method, feature scaling always divides features by constants (also known as normalized constants). Therefore, it does not change the shape of the single-feature distribution. We’ll illustrate this with an online news article tag count.

Example 2-15: Example of feature scaling.

import pandas as pd
import sklearn.preprocessing as preproc

# Load the online news popularity dataset
df = pd.read_csv('data/OnlineNewsPopularity.csv', delimiter=', ')

# Look at the original data - the number of words in an article
Copy the code
array([219..255..211.. .442..682..157.])
Copy the code
# Min-max scaling
df['minmax'] = preproc.minmax_scale(df[['n_tokens_content']])
Copy the code
array([0.02584376.0.03009205.0.02489969. .0.05215955.0.08048147.0.01852726])
Copy the code
# Standardization - note that by definition, some outputs will be negative
df['standardized'] = preproc.StandardScaler().fit_transform(df[['n_tokens_content']])
Copy the code
array([0.69521045.0.61879381.0.71219192. .0.2218518 ,
Copy the code
# L2-normalization
df['l2_normalized'] = preproc.normalize(df[['n_tokens_content']], axis=0)
Copy the code
array([0.00152439.0.00177498.0.00146871. .0.00307663.0.0047472 ,
Copy the code

We can also visualize the distribution of data with different feature scaling methods. As shown in Figure 2-18, unlike logarithmic transformation, feature scaling does not change the shape of the distribution. Only the size of the data has changed.

Example 2-16: Drawing histograms of raw and scaled data.

fig, (ax1, ax2, ax3, ax4) = plt.subplots(4.1)
fig.tight_layout(pad=0, w_pad=1.0, h_pad=2.0)
# fig.tight_layout()

df['n_tokens_content'].hist(ax=ax1, bins=100)
ax1.set_xlabel('Article word count', fontsize=14)
# ax1.set_ylabel('Number of articles', fontsize=14)

df['minmax'].hist(ax=ax2, bins=100)
ax2.set_xlabel('Min-max scaled word count', fontsize=14)
ax2.set_ylabel('Number of articles', fontsize=14)

df['standardized'].hist(ax=ax3, bins=100)
ax3.set_xlabel('Standardized word count', fontsize=14)
# ax3.set_ylabel('Number of articles', fontsize=14)

df['l2_normalized'].hist(ax=ax4, bins=100)
ax4.set_xlabel('L2-normalized word count', fontsize=14)
ax4.set_ylabel('Number of articles', fontsize=14)
Copy the code
Text('Number of articles')
Copy the code

Figure 2-18. Original and scaled news article word counts. Note that only the scale of the x-axis changes; the shape of the distribution stays the same with feature scaling.

Feature scaling is useful in cases where a set of input features vary greatly in scale. For example, a popular e-commerce site may have 100,000 daily visitors and thousands of actual sales. If both of these capabilities are put into the model, the model needs to balance their scale while determining what to do. The great variation of input characteristics will lead to the numerical stability of the model training algorithm. In these cases, standardizing functionality is a good idea. Chapter 4 details feature scaling when working with natural text, including usage examples.

Interactive features

A simple pairwise interaction feature is the product of two features. Similar logic with. It represents the results in pairs: “Purchase from zip code 98121” and “the user is between 18 and 35 years old.” This has no effect on decision tree-based models, but generative interaction features are often useful for generalized linear models.

A simple linear model using a linear combination of single input features,… To predict the outcome

A simple way to extend the linear model is to include a combination of input feature pairs, as follows:

This allows us to capture the interactions between features, so they are called interactive features. If the sum is binary, then their product is a logical function suppose the problem is to predict the customer’s preferences based on his or her profile information. In this case, rather than making predictions based solely on the age or location of the user, interaction features allow the model to make predictions based on users of a particular age and location.

In Examples 2-17, we use paired interaction features of the UCI online news dataset to predict the number of shares per news article. Interaction features result in greater accuracy than single features. Both perform better on a scale of 2-9, which uses a single predictor of the number of words in the text (with or without logarithmic transformation).

Example 2-17: Example of an interaction feature for prediction

from sklearn import linear_model
from sklearn.model_selection import train_test_split
import sklearn.preprocessing as preproc

### Assume df is a Pandas dataframe containing the UCI online news dataset
Copy the code
Copy the code
### Select the content-based features as singleton features in the model,
### skipping over the derived features
features = ['n_tokens_title'.'n_tokens_content'.'n_unique_tokens'.'n_non_stop_words'.'n_non_stop_unique_tokens'.'num_hrefs'.'num_self_hrefs'.'num_imgs'.'num_videos'.'average_token_length'.'num_keywords'.'data_channel_is_lifestyle'.'data_channel_is_entertainment'.'data_channel_is_bus'.'data_channel_is_socmed'.'data_channel_is_tech'.'data_channel_is_world']

X = df[features]
y = df[['shares']]
Copy the code
### Create pairwise interaction features, skipping the constant bias term
X2 = preproc.PolynomialFeatures(include_bias=False).fit_transform(X)
Copy the code
Copy the code
### Create train/test sets for both feature sets
X1_train, X1_test, X2_train, X2_test, y_train, y_test = train_test_split(X, X2, y, test_size=0.3, random_state=123)

def evaluate_feature(X_train, X_test, y_train, y_test):
#   Fit a linear regression model on the training set and score on the test set
    model = linear_model.LinearRegression().fit(X_train, y_train)
    r_score = model.score(X_test, y_test)
    return (model, r_score)
Copy the code
### Train models and compare score on the two feature sets
(m1, r1) = evaluate_feature(X1_train, X1_test, y_train, y_test)
(m2, r2) = evaluate_feature(X2_train, X2_test, y_train, y_test)
print("R-squared score with singleton features: %0.5f" % r1)
print("R-squared score with pairwise features: %0.10f" % r2)
Copy the code
R-squared score with singleton features: 0.00924
R-squared score with pairwise features: 0.0113280910
Copy the code

Constructing interactive features is very simple, but they are expensive to use. Training and scoring times using a linear model of paired interaction features will go from to, where is the number of single features.

There are several approaches to computing costs around higher-order interaction features. Feature selection can be performed over all interaction features, selecting the first few. Alternatively, a smaller number of complex features can be crafted more carefully. Both strategies have their advantages and disadvantages. Feature selection uses computational methods to select the best feature of the problem. (This technique is not limited to interactive features.) Some feature selection techniques still require training multiple models with a large number of features.

Hand-crafted complex features can be expressive enough that only a small number of these features are needed, which can shorten the model’s training time. However, the calculation of the features themselves can be expensive, which increases the calculation cost of the model scoring phase. A good example of handmade (or machine learned) complex features can be found in chapter 8 on image features. Now let’s look at some feature selection techniques.

Feature selection

Feature selection techniques remove unuseful features to reduce the complexity of the final model. The ultimate goal is a minimalistic model of rapid computation with little or no reduction in predictive accuracy. To obtain such models, some feature selection techniques require the training of multiple candidate models. In other words, feature selection did not reduce training time; some techniques actually increased overall training time, but reduced model scoring time.

Roughly speaking, feature selection techniques fall into three categories.

  • Filtering: Preprocessing removes features that are unlikely to be useful to the model. For example, you can calculate the correlation or mutual information between each feature and the response variable and filter out features where the correlation or mutual information is below the threshold. Chapter 3 discusses examples of text feature filtering techniques. Filters are much cheaper than the wrapper techniques below, but they do not take into account the model being used. So they may not be able to choose the right features for the model. It is best to prefilter conservatively so as not to inadvertently eliminate useful features before proceeding to the model training step.
  • Wrapper methods: These techniques are expensive, but they allow you to try out a subset of features, which means you won’t accidentally remove features that don’t provide information on their own but are useful when used together. The packaging approach treats the model as a black box that provides quality fractions of a subset of features. A single method iteratively improves the subset.
  • Embedded methods: Embedded methods perform feature selection as part of the model training process. For example, a decision tree inherently performs feature selection because it selects a feature at each training step on which tree splitting is to take place. Another example is regex, which can be added to the training objectives of any linear model. Models are encouraged to use some characteristics rather than many. Therefore, it is also called the sparse constraint of the model. The embedded approach takes feature selection as part of the model training process. They are not as powerful as packaging methods, but they are also not nearly as expensive. In contrast to filtering, the embedded approach selects model-specific characteristics. In this sense, the embedded approach strikes a balance between the cost of calculation and the quality of the results.

A comprehensive treatment of feature selection is beyond the scope of this book. Interested readers are referred to the research report “An Introduction to Variable and Feature Selection” by Isabelle Guyon and Andre Elisseeff.


This paper discusses many common digital feature engineering techniques: quantization, scaling (also known as normalization), logarithmic transformation (a power transformation), interaction features and a brief summary of feature selection techniques needed to process a large number of interaction features. In statistical machine learning, all data ultimately boils down to numerical features. So all roads eventually lead to some kind of digital feature engineering. To end the feature engineering game, make sure these tools are easy to use!


Guyon, Isabell, and André Elisseeff. 2003. Journal of Machine Learning Research Special Issue on Variable and Feature Selection. 3(Mar):1157–1182.

Johnston, Jack, and John DiNardo. 1997. Econometric Methods (Fourth Edition). New York: McGraw Hill.

The related resources

Book Address:


The code can be downloaded at Github:


Baidu Cloud of data set:

Link: pan.baidu.com/s/1uDXt5jWU…

You are not alone in the battle. The path and materials suitable for beginners to enter artificial intelligence download machine learning online manual Deep learning online Manual note:4500+ user ID:92416895), please reply to knowledge PlanetCopy the code