Auc calculation and corresponding Roc curve drawing method, GAuc calculation and Sample random negative sampling are commonly used by algorithm engineers


As an algorithmic worker, writing a script in Python to deal with everyday data should be a common-sense ability. For a language, the best feeling is: know what method to use to do a thing, although not necessarily remember the language implementation details, but the key points are clear in the heart, with Baidu search grammar can complete an independent function.

In many cases, Auc is regarded as the most commonly used evaluation indicator in machine learning algorithms, while Auc is a global indicator reflecting the sorting ability of the whole sample.

However, we often encounter distortion of the Auc indicator. Because the ranking between users’ advertisements is personalized, and the ranking results of different users are not easy to compare, leading to the distortion of the global indicator Auc in some highly personalized scenarios.

In computational advertising, what we really want to measure is the ability of different users to rank different ads, rather than the ability of the same user to rank different ads. Therefore, the concept of GAuc is extended, namely group Auc, which can be calculated by grouping multiple advertisements paid attention to by multiple users to obtain personalized advertisements paid attention to by the users.


Here are a few small scripts implemented in Python with detailed comments that should be pasted and ready to use, including:

(1) Auc calculation and corresponding Roc curve drawing method

(2) GAuc calculation method

(3) random negative sampling

1. Auc calculation and corresponding Roc curve drawing method

As we all know, AUC (Area Under Curve) is defined as the Area enclosed by the ROC Curve and the coordinate axis. The horizontal axis of ROC curve is FPRate and the vertical axis is TPRate. The concept of true and false positives is not repeated here.

Generally, the Roc curve is as follows:

Obviously this area can’t be greater than 1.

A feasible method to calculate Auc is to use the idea of calculus to cut the Roc curve and the image of the coordinate axis into small rectangular trapezoids.

Each small right Angle trapezoid is high with each length on X-axis (left and right coordinates are respectively FP_pre and FP), then the high value is: (fp-FP_pre), and the values on Y-axis are upper and lower bottom respectively (upper and lower are TP_pre and TP respectively), then the upper and lower bottom values are: TP_pre, TP.

We learned in primary school that the formula for calculating the area of a trapezoid is:Area = (top bottom + bottom bottom) * height / 2. That is:Auc = Sum{(TP_pre + TP) * (fp-fp_pre / 2}We just add up all of the smaller trapezoidal areas and we get the area of the whole picture, which is our Auc.

Note: Since TPR and FPR are used in the calculation of Auc, we need to know whether the sample is POS or NEG, so we need to pass in the sample label and estimated score in the process of incoming.

talk is cheap, show the code !!!


Welcome to pay attention to the author's public number algorithm full stack road
@ filename calAuc.py

#! /usr/bin/python
import sys
import math

def calAUC(labels,probs) :
	Ensure that each proB calculates the area only once
    i_sorted = sorted(xrange(len(probs)),key=lambda i: probs[i],reverse=True)

    auc_sum = 0.0 Sum of small trapezoids
    TP = 0.0 The TPR value of the current point when drawing auC graph
    TP_pre = 0.0 The TPR value of the previous point when drawing the auC diagram
    FP = 0.0 # FPR value of current point when drawing auC graph
    FP_pre = 0.0 # The FPR value of the previous point when drawing the auC diagram
    P = 0; # number of positive samples, number of values on the vertical axis
    N = 0; # number of negative samples, number of values on the horizontal axis

	Give a value greater than 1.0 that is impossible to get

    last_prob = probs[i_sorted[0]] + 1.0
    for i in xrange(len(probs)):
		# Loop each small trapezoid, going right up from the origin
		# The small trapezoid shows a step shape, and the value of the upper variable on the right is only one unit longer than that on the left
        iflast_prob ! = probs[i_sorted[i]]:# (top bottom + bottom bottom) * high / 2
            auc_temp += (TP+TP_pre) * (FP-FP_pre) / 2.0
            TP_pre = TP # up
            FP_pre = FP # to the right
            last_prob = probs[i_sorted[i]] Ensure that the PROB is evaluated only once

		# If the sample is positive, take a step up
        if labels[i_sorted[i]] == 1:
          TP = TP + 1
        else:
		# negative sample, take a step to the right
          FP = FP + 1

	# Loop ends, computes the last small trapezoid
    auc_temp += (TP+TP_pre) * (FP-FP_pre) / 2.0
	# Notice, since everything up here is the length of the cell, we have to divide by the number of vertical lengths, and then by the number of horizontal lengths. How many parts do I divide the length of the horizontal axis and the length of the vertical axis. The vertical axis is 1/TP and the horizontal axis is 1/FP.
	# Because TP and FP are not equal, the unit lengths of horizontal and vertical axes are different, which determines the different Auc images.
    auc = auc_temp / (TP * FP)
    return auc

def read_Auc_file() :
    labels = []
    probs = []
    for line in sys.stdin:
        sp_line = line.split("\t")
        labels.append(int(sp_line[0]))
        if float(sp_line[1]) < 1e-8:
            probs.append(1e-8)
        else:
            probs.append(float(sp_line[1]))
    return labels,probs

if __name__ == "__main__":
	labels,probs = read_Auc_file()
	auc = calAUC(labels,probs)
	print "AUC:",auc

Copy the code

Note: The order from large to small is the same as the PROB order when drawing the Auc curve.

The book follows the article, and the above calculation Auc method is the same, we are here to introduce a corresponding Roc curve drawing method.

Roc curve drawing method is as follows:

(1) According to sample labels, the number of positive and negative samples is calculated, assuming that the number of positive samples is P and the number of negative samples is N

(2) Set the scale interval of the horizontal axis to 1/N, and the scale of the vertical axis to 1/P

(3) Sort the samples according to the prediction probability of model output (from high to bottom)

(4) Traverse the samples in turn, and draw the Roc curve from zero. For each positive sample, a graduated curve is plotted along the vertical axis, and for each negative sample, a graduated curve is plotted along the horizontal axis.

(5) after traversing all samples, the curve finally stops at the position of (1,1), and the whole Roc curve is drawn.

In the above Auc calculation and the corresponding Roc curve drawing method, we need to input sample labAL and the probs score estimated by the model, and the source data files are separated by “\ T “.


2. GAuc calculation

The previous section introduced the scenarios where Auc and GAuc are applicable, and GAuc is a very effective off-line metric for machine learning models in many scenarios.

The GAuc calculation script is shown below.


Welcome to pay attention to the author's public number algorithm full stack road
@ filename calGAuc.py

#! /usr/bin/python
import sys
import math

def calGauc(group_list,score_list,label_list) :
	Dict to save all data
    all_data = {}
    all_auc = 0.0
	# According to the group attribute, here is iMEI to group all samples
	# Sample each user as their own group
    for i in range(len(group_list)):
        if group_list[i] in all_data:
            all_data[group_list[i]][0].append(score_list[i])
            all_data[group_list[i]][1].append(label_list[i])
        else:
            all_data[group_list[i]] = ([score_list[i]],[label_list[i]])

	# How many samples are there
    all_size = 0
	# Calculate Auc values separately for each user's sample
    for imei,value in all_data.items():
        score = value[0]
        label = value[1]
        try:
            auc = calAUC(label,score)
			# weighted sum of the Auc calculated separately for each user based on how many rows the user samples.
            all_size += len(score)
            all_auc += len(score) * auc
        except:
            pass

	The Auc of the weighted sum is divided by the total to get the GAuc of the synthesis.
    return all_auc/all_size

def read_Gauc_file() :
    labels = []
    probs = []
    groups = []
    for line in sys.stdin:
        sp_line = line.split()
        if(len(sp_line)! =3) :continue;
        labels.append(int(sp_line[0]))
        if float(sp_line[1]) < 1e-8:
            probs.append(1e-8)
        else:
            probs.append(float(sp_line[1]))
        groups.append(sp_line[2])
	Imei = iMEI = iMEI = iMEI = iMEI
    return labels,probs,groups

if __name__ == "__main__":
    labels,probs,groups = read_Gauc_file()
    gauc = calGauc(groups,probs,labels)
	print "GAUC:",gauc
Copy the code

In the code above, I refer to the function in section 1 that calculates the Auc. There is an extra column of IMEI values in the input source data.

As can be seen from the code, GAuc calculation logic is grouped according to IMEI. Auc of each user (IMEI) is weighted according to the number of samples of each user and divided by the total number of samples to obtain the average Auc. In principle, it is more scientific than Auc, taking into account the personalized situation of more users.


3. Use Python script to realize Sample random negative sampling

In addition to using Spark to complete random negative sampling of sample, we can also write another script to realize negative sampling in addition to ensuring the integrity of sample data according to the introduction of enterprise machine learning Pipline – Sample sample processing article. Python is used here to realize this function.

The code is as follows:


Welcome to pay attention to the author's public number algorithm full stack road
@ filename sample.py

import sys,random

ns_sr = 1.0

if len(sys.argv) == 2:
    ns_sr = float(sys.argv[1])

for line in sys.stdin:
    try:
      ss = line.strip().split("\t")
      label = ss[0]
      value = ss[1]

	  # Check if sample tags meet expectations, intercept non-conforming tags
      if "0"! = labeland "1"! = label :continue

	  Only a certain proportion of negative samples will be sampled, and non-conforming ones will be intercepted
      if "1"! = labeland random.random() > ns_sr:
          continue

	  # Pass directly for unsampled samples
      print label + "\t" + value
    except Exception,e:
      continue
Copy the code

The above script can be used with the following command:

Hadoop fs - text/user/app/samples / 20210701 / * | python sample. 0.2 > p y sampled_data. TXTCopy the code

Set the sampling rate to 0.2. For details about the calibration, please pay attention to the author’s official account. Check out the Pipline – sample processing for details


Auc calculation and its corresponding Roc curve drawing method, GAuc calculation and Sample random negative sampling have been introduced.

Code word is not easy, feel the harvest of the like, share, and then watch the three links ~

Welcome to follow my public number: algorithm full stack road

! [] (gitee.com/ldh521/picg… =50%x)