Hello everyone, tomorrow is the New Year, already feel the taste of the New Year, I wish you all a happy New Year in advance.

Well, without further ado, today let’s talk about how to pick a good feature, and what a good feature means when we distinguish it.

In this article, we will use machine learning classifier for example, runs through the whole article because classifier only after we have provided good characteristics can play out their own good result for us, this also means that find a good character is machine learning to learn one of the important premise, so this time the problem comes, What are good features? How do you know he’s a good feature? Next, let’s solve these problems.

We use to describe an object characteristics, for example, in this class of objects they have length, color and characteristics of the two properties, so use this feature to describe the categories, characteristics of good will make us more easy to identify the characteristics represented by the corresponding categories, and bad characteristics of the chaos our senses, Waste our analytical and computational resources with useless analysis.

Ok, now let’s find a more life-like example. Think of all the cute dogs we have at home. Such as golden retrievers and chihuahua, they can actually have a lot of features, such as eye color, color, weight, height, length, etc., in order to simplify the above a variety of types, we will use the colour, the next main height of these two kinds of attributes, and we’re assuming that the two dogs only partial slant yellow and white the two color attribute.

So let’s start by comparing colors. How many yellow or white golden hairs are there in our virtual world? Found that actually slant yellow and white golden retrievers are almost half, and when we analyze chihuahua, found that the color of the chihuahua is also can be equally divided, this time we will present the data using two attributes, yellow and white, then we respectively to respectively with yellow and white said the proportion of chihuahua, When we came to make we can very well observed data table culture, we will find that either in yellow or partial white attribute, chihuahua and golden retriever ratio is roughly the same, if this time to give you a yellow dog, you are not alone color to distinguish this is chihuahuas or just golden retriever, This means it’s not appropriate to tell whether a dog is a chihuahua or a golden retriever by color. This trait doesn’t really play a role in distinguishing the breed, so it’s time to distinguish the meaningless information.

So when this color is not a valid eigenvalue, should we think about height as a good eigenvalue? Although height is a very abstract characteristic number, we can also visually associate height with these data. At this time, we can use Python to carry out the visualization operation.

First we input the Python modules matplotlib and Numpy. Then we define golden retriever and Chihuahua with two short names, gold and Chihh, and define 400 samples of each dog. Then we generate some data about the height of golden retriever. Chihuahua is 25 cm, and because the height of each dog is not a fixed value, this time we will give each dog with a random variable, random amplitude may be larger, golden retriever chihuahua random amplitude may be smaller, and finally, we use a histogram to represent the golden retriever and the number of the number of baby don’t cry, blue represents the number of chihuahua, Red represents the number of golden hairs.


When we choose a height of 50cm, we can basically determine that the dog is a golden retriever. Similarly, anything larger than 50cm is a golden retriever. When we are 20cm, we can actually have considerable confidence that the dog is a Chihuahua. But when we switch to 30-40, we can’t say for sure whether the dog is a golden retriever or a Chihuahua. At this altitude because the number of every kind of dog basic about the same, so between the height of the dog can we actually is not very well in this highly characteristics to determine the dog breeds, because this feature is not perfect, this also is why we actually contain more features and processing characteristics of the problem in machine learning.

When we want to get more information at this time, we should rule out those who do not have the ability to distinguish between the information, as we have just mentioned the color is not really what useful information, and can highly, and we should find more information to make up for the height can describe that a weight like a dog, for instance, the color of the dog’s glasses, What do dogs’ ears look like and so on. When we are going to add up these information, often can make up for missing a single information, sometimes, we will also have a lot of single feature information of data, some data while their information although there is no repeat, but they described the meaning of is similar, such as when we in the description of the distance, in the data or in km, Although they don’t duplicate numerically, they actually mean the same thing. In machine learning, the more features the better, but information like those two similar types of information is not going to be useful to him, so we have to distinguish between repetitive information. Each point and we are with latitude and longitude, if we use the latitude and longitude to represent the distance between the two places, the more obviously than the trouble with km, so in machine learning, select the feature value is must avoid complex information, because the characteristics and the results of model is simple, machine learning can learn things better.

So when we choose features, we should always keep these three points in mind to avoid meaningless information, repetitive information and complex information.

Ok, so that’s what we’ve been talking about in this article. If you have any better suggestions or ideas, please feel free to share with me. I’d love to share with you.