Definition: Gini index (Gini impurity) : represents the probability that a randomly selected sample in the sample set will be misclassified.

Note: The smaller the Gini index is, the smaller the probability of the selected samples in the set being misclassified is, that is, the higher the purity of the set is; otherwise, the more impure the set is.Copy the code

The Gini index (Gini impurity) = probability of sample selection * probability of sample misclassification

The properties of Gini coefficient are the same as that of information entropy: it measures the uncertainty of random variables; The larger G is, the higher the uncertainty of the data is; The smaller G is, the lower the uncertainty of the data is; G = 0, all samples in the data set are in the same category;

code

def cal_gini_index(data, label_len):
    :param data: (list) data set :param label_len: (int) Number of labels per row in the data set :return: gini (float) gini index

    total_asmple = len(data)
    if len(data) == 0:
        return 0

    for a in range(label_len):
        label_counts = label_uniq_cnt(data,a)  # Count the number of different tags in the data set
        print(label_counts)

        Calculate the Gini index of the data set
        gini = 0
        for label in label_counts:
            gini = gini + pow(label_counts[label], 2)

        gini = 1 - float(gini) / pow(total_asmple, 2)
        print(gini)


def label_uniq_cnt(data,a):
    The number of different labels in the statistical data set :param data: (list) Raw data: param a: the number of labels below each tag :return: Label_uniq_cnts (int) Number of labels in the sample, counting the number of values for each category, storing the number of values for each category in the dictionary to return ""

    label_uniq_cnts = {}
    for x in data:
        label = x[a]  # Get the class label for each sample
        #print(label)
        if label not in label_uniq_cnts:
            label_uniq_cnts[label] = 0
        label_uniq_cnts[label] += 1
    return label_uniq_cnts


if __name__ == '__main__':
    data = [('with'.'有'.'is'), ('with'.'有'.'is'), ('with'.'no'.'no'), ('不用'.'有'.'no'), ('不用'.'有'.'no')]
    cal_gini_index(data,len(data[0]))

Copy the code

The results of

{'with': 3.'不用': 2}
0.48
{'有': 4.'no': 1}
0.31999999999999995
{'is': 2.'no': 3}
0.48

Copy the code

This gives you the total probability of a row being randomly assigned to the wrong result. The higher the probability, the less reasonable the split of the data.