preface

In the previous article, we introduced data mining in general, and also saw the beauty of data mining. However, as a data white, the distance from data mining is still very far away, so we still have to play it safe. If the last article we are looking up at the stars, so from this article we will bow forward. Today we mainly understand data, although in such an era of data explosion, but do we really understand data? There seems to be a veil between us and the data, we can clearly feel its existence, but can not see its appearance, let us remove this veil.

Attributes of data

A property is a field of data that represents a characteristic of the data object, such as gender or age. People who have experience in object-oriented programming may think that a piece of data is an object, and the attributes of the data are the attributes of the object. The classification of attributes is shown in the figure below


Nominal properties

The nominal attribute is the name of some symbols or things, each value represents a certain state or category, so the nominal attribute is also regarded as classification.

For example, hair color is an attribute that describes a person, and its value can be black, yellow, gold, etc., which are nominal attributes.

Dual attribute

A binary attribute is a special nominal attribute that has only two categories or states.

Binary attributes can be divided into “symmetry” and “asymmetry”. Symmetry means that two states have the same value and carry the same weight, such as gender. If status is not equally important, such as testing negative and positive for HIV, HIV positivity is usually rare and more important, and this is asymmetry.

Ordinal properties

Ordinal attributes are attributes that have a meaningful order between their values. For example, when we go to Starbucks, we talk about tall, grande and venti.

Numerical attributes

Numerical property Quantitative, is a measurable quantity, can be expressed as integer or real value.

Numerical properties can be divided into “interval scale” and “ratio scale”, for example, our temperature is on the interval scale, 20 degrees today, 15 degrees tomorrow. The ratio scale can be said to be a multiple of one value.

Discrete properties and continuous properties

In the machine learning neighborhood, properties are usually classified as continuous and discrete.

Measurement of data

With the classification of data attributes mentioned above, let’s look at measures of data, which can describe some properties of data.

Measurement of central trends

A measure of central tendency, which measures the central or central position of data, simply, where the bulk of a given property falls. The central trend measures include mean, median, mode, and median columns.

The mean

And the idea of the mean, which I think you’ve had since you were a kid, and then there’s the weighted mean, which means that every number has a weight, and then there’s the truncated mean, which is taking away the maximum and the minimum and finding the mean.

The median

The median is the middle value, or the mean of the two values. Sorting is expensive when the data set is large, and there is an algorithm for estimating the median. It is to divide the data into different intervals and know the frequency of each interval. It is calculated by the following formula, where L1 is the lower bound of the interval containing the median, FREq1 is the frequency of the interval containing the median, freq2 is the sum of the frequencies of all intervals below the median, and width is the width of the median interval.

Median = L1 + [(N/2 + freq2) /freq1]width

Mode and middle column number

I’m not going to say much about modes, but one mode is called single peak, multiple modes are called multiple peak, and the median column number is the average of the maximum and minimum values in the data.

Measurement data distribution


poor

Range is the difference between a maximum and a minimum in a data set.

Quartiles and quartiles are very different

In fact, the quartile is three values. If you arrange the data set, the 25th percent is the first quartile Q1, the 50th percent is the second quartile Q2, and the 75th percent is the third quartile Q3.

For example, after 12 data are arranged, the third, sixth and ninth data are Q1, Q2 and Q3 of the data set respectively.

The quartile range is Q3-Q1, denoted as IQR. In the detection of outliers, the rule is usually the value above the third quartile or below the first quartile at least 1.5 times IQR. A box plot is a popular distribution display, structured as follows, with quartiles that nicely show the spread of data.

Variance and standard deviation

The larger the variance and standard deviation are, the wider the spread of data will be; otherwise, the more concentrated the data will be.

The last

Like is the biggest support, more articles and information can follow the wechat public number QStack.