Kernel density estimation (KDE) principle and implementation

Parameter estimation means that the sample data comes from a population with a definite probability density function, while in non-parametric estimation, the probability distribution of the sample data is unknown. In this case, in order to model the sample data, it is necessary to estimate the probability density function of the sample data, and kernel density estimation is one of the methods.

The introduction

In statistics, Kernel Density Estimation is used to infer the distribution of the population data based on a limited number of samples. Therefore, the result of Kernel Density Estimation is the probability Density function Estimation of the sample. According to the estimated probability Density function, we can obtain some properties of the data distribution. Such as an aggregation area of data.

Start with the histogram

The histogram was proposed by Karl Pearson to represent the distribution of sample data and to help analyze the mode and median of sample data. The horizontal axis represents the value range of variables, and the vertical axis represents the ratio between the frequency of data occurrence and the length of the interval.

The U.S. Census Bureau surveyed 1.24 billion people about their commutes to work, and here’s how:

The starting point	class	The frequency of	Frequency/group spacing	Frequency/group spacing/total
0	5	4180	836	0.0067
5	5	13687	2737	0.0221
10	5	18618	3723	0.03
15	5	19634	3926	0.0316
20	5	17981	3596	0.029
25	5	7190	1438	0.0116
30	5	16369	3273	0.0264
35	5	3212	642	0.0052
40	5	4122	824	0.0066
45	15	9200	613	0.0049
60	30	6461	215	0.0017
90	60	3435	57	0.0005

Data visualization using histograms is as follows:

The histogram uses the number of people per unit interval (frequency/group distance) to represent the height of each rectangle, so the area of each rectangle represents the number of people in that area, which is 1.24 billion.

When the histogram (frequency/group distance/total) is expressed as the height of each rectangle, the data visualization is as follows:

In this case, the area of the rectangle represents the frequency occupied by the region, and the total area of the rectangle is 1. The histogram is also known as the frequency histogram.

The frequency histogram has the following characteristics:

The area of the rectangle is the frequency between the regions;
The height of the rectangle is the average frequency density between the regions.

Probability density function

Limit thinking: We use differential thinking to reduce the group distance of frequency histogram step by step. As the group distance decreases, the width of rectangle becomes smaller and smaller. Therefore, in the limit case, the frequency histogram will turn into a curve, which is the probability density curve.

For the probability density curve, we know that the probability value of the random variable falling in a certain region is the integral of the probability density function in this region (see probability density function), namely: P (a < x or less b) = ∫ abf (x) dxP (a < x = \ int \ \ leq b) limits_a ^ b f (x) dxP (a < x = a, b) or less ∫ bf dx (x).

A cumulative distribution function of F (x) F (x) F (x), according to the above definition, the F (x) = ∫ – up xf dxF (x) (x) = \ int \ limits_ {- \ infty} ^ F (x) x dxF (x) = – up ∫ xf dx (x).

According to the idea of differentiation, then:

\begin{aligned} f(x_0) &= \dot{F(x_0)}\\ &= \lim^{}_{h \to 0}\frac{F(x_0+h)-F(x_0 – h)}{2h} \end{aligned}

Kernel density estimation

From the above analysis, we should have understood that the purpose of kernel density estimation is in fact to estimate the probability density function of the given sample data.

The formulas

Considering one-dimensional data, there are NNN sample data as follows: X1, X2,x3… ,xnx_1,x_2,x_3,… ,x_nx1,x2,x3,… , xn.

Assuming that the cumulative distribution function of the sample data is F(x)F(x)F(x) F(x), and the probability density function is F(x)F(x)F(x), then:

$F(x_{i-1} < x < x_i) = \int\limits_{x_{i-1}}^{x_i} f(x)dx$

$f(x_i) = \lim^{}_{h \to 0}\frac{F(x_i+h)-F(x_i-h)}{2h}$

The empirical distribution function of cumulative distribution function is introduced:

$F_n(t) = \frac{1}{n}\sum^{n}_{i=1}1_{x_i \leq t}$

P(x≤t)P(x \leq t)P(x≤t) is approximatively described by the ratio of the number of xi≤tx_i \leq txi≤t in NNN observations.

Substitute this function into f(xi)f(x_i)f(xi) :

$f(x_i) = \lim^{}_{h \to 0}\frac{1}{2nh}\sum^{n}_{i=1}1_{x_i-h \leq x_j \leq x_i+h}$

According to the formula, the HHH value must be given in the actual calculation, and the HHH value should not be too large or too small, which does not meet the conditions of H → 0H \to 0H →0. Too large sample data points are too few and the error will be very large. Therefore, there are many studies on the selection of HHH value, which is also known as the bandwidth in kernel density estimation.

After determining the bandwidth, we can write the expression f(x)f(x)f(x) :

$f(x) = \frac{1}{2nh}\sum^{n}_{i=1}1_{x-h \leq x_i \leq x+h}$

Kernel function

F (x)f(x)f(x) expression can be deformed into:

\begin{aligned} f(x) &= \frac{1}{2nh}\sum^{n}_{i=1}1_{x-h \leq x_i \leq {x+h}}\\ &= \frac{1}{2nh}\sum^{n}_{i=1}K(\frac{|x-x_i|}{h}) \end{aligned}

Among them, the t = ∥ x – xi ∥ ht = \ frac {\ | x – x_i \ |} {h} t = h ∥ x – xi ∥, when 0 or less t 10 or less t \ \ leq leq 10 t 1 or less, or less K (t) = 1 K (t) = 1 K (t) = 1.

And:

\begin{aligned} \int f(x)dx &=\int\frac{1}{2nh}\sum^{n}_{i=1}K(\frac{|x-x_i|}{h})dx\\ &= \int\frac{1}{2n}\sum^{n}_{i=1}K(t)dt\\ &= \int\frac{1}{2} K(t)dt \end{aligned}

Note that ∑ I =1n\sum^{n}_{I =1}∑ I =1n refers to NNN experiments, not accumulative ones, so the calculated value is NNN.

Make K0 12 K (t) (t) = K_0 (t) = \ frac {1} {2} K (t) K0 (t) = 21 K (t), according to the definition of probability density function, we have:

$\int K_0(t)dt = 1$

Which when 0 or less t 10 or less t \ \ leq leq 10 t 1 or less, or less K0 (t) = 12 k_0 (t) = \ frac {1} {2} K0 (t) = 21.

In this case, K0(t)K_0(t)K0(t) is called a common function. The common functions include: Uniform, biweight, triweight, Epanechnikov, normal… .

F (x)f(x)f(x) becomes:

$f(x) = \frac{1}{nh}\sum^{n}_{i=1}K_0(\frac{|x-x_i|}{h})$

For two-dimensional data, f(x)f(x)f(x) is:

$f(x, y) = \frac{1}{nh^2}\sum^{n}_{i=1}K_0(\frac{dist((x, y), (x_i, y_i))}{h})$

Experiment: NUCLEUS density analysis of POI point

Technology selection

Raster data visualization: Canvas
KFC POI climb: POIKit

Kernel function, bandwidth selection

Use the bandwidth selection scheme provided by ArcGIS software description document, and the kernel function is:

$K_0(t) = \frac{3}{\pi}(1-t^2)^2$

Probability density estimation function is: F (x, y) = 1 n ∗ npopik0 h2 ∑ I = 1 (distih) f (x, Y) = \ frac {1} {n * h ^ 2} \ sum_ {I = 1} ^ n pop_i K_0 (\ frac {dist_i} {h}) f (x, y) = n ∗ h21 ∑ I = 1 npopik0 (hdisti)

Where, popiPOP_iPOPI is the given weight field. If it does not contain this field, the value is 1. NNN is the number of POI points.

The bandwidth is:

$H = 0.9 * min (A, \ SQRT {\ frac {1} {\ ln (2)}} * D_m) * n ^ {0.2}$

Parameter Description:

Mean center: refers to the mean center of NNN POI points, that is, the longitude and latitude are averaged respectively;
Weighted average center: refers to the weighted average center of NNN POI points, that is, the longitude and latitude are multiplied by the weight and then averaged;
Standard distance calculation formula:

$SD = \sqrt{\frac{\sum_{i=1}^n(x_i-\bar X)^2}{n}+\frac{\sum_{i=1}^n(y_i-\bar Y)^2}{n}+\frac{\sum_{i=1}^n(z_i-\bar Z)^2}{n}}$

Among them:
- Xi,yi,zix_i,y_i,z_ixi,yi,zi are the coordinates of POI
- Xˉ,Yˉ,Zˉ{\bar X,\bar Y, \bar Z}Xˉ,Yˉ,Zˉ represents the average center
- NNN is the total number of POIs
Weighted standard distance calculation formula:

$SD_w = \sqrt{\frac{\sum_{i=1}^n w_i(x_i-\bar X_w)^2}{\sum_{i=1}^{n}w_i}+\frac{\sum_{i=1}^nw_i(y_i-\bar Y_w)^2}{\sum_{i=1}^{n}w_i}+\frac{\sum_{i=1}^nw_i(z_i-\bar Z_w)^2}{\sum_{i=1}^{n}w_i}}$

Among them:
- Wiw_iwi is the weight of element III
- ˉ ˉ w X, Y ˉ w, Z w {\ bar X_w, \ bar Y_w, \ bar Z_w} ˉ w X, Y ˉ w, Z ˉ w represents a weighted average of the center
- NNN is the total number of POIs
If POI does not contain a weight field, then DmD_mDm is the median distance from the mean center, NNN is the number of POI points, and AAA is the standard distance SDSDSD.
If a POI point contains a weight field, then DmD_mDm is the median distance from the weighted average center, NNN is the sum of the weight field values of POI points, and AAA is the weighted standard distance SDwSD_wSDw.

Data crawl

Using POIKit, climb the DATA of KFC POI in Henan Province, and set the parameters as follows:

A total of 219 pieces of data were obtained by crawling, as shown in the figure below:

Write a program

Kernel function is:

/** * the kernel function *@param {Number} T variable *@returns* /
function kernel(t) {
  return (3 / Math.PI) * Math.pow(1 - t * t, 2);
}
Copy the code

Bandwidth:

/** * bandwidth *@param {Point[]} PTS All POI points *@param {Point} AvePt mean center *@returns* /
function h(pts, avePt) {
  const SD_ = SD(pts, avePt);
  const Dm_ = Dm(pts, avePt);
  if (SD_ > Math.sqrt(1 / Math.log(2)) * Dm_) {
    return 0.9 * Math.sqrt(1 / Math.log(2)) * Dm_ * Math.pow(pts.length, -0.2);
  } else {
    return 0.9 * SD_ * Math.pow(pts.length, -0.2); }}Copy the code

Mean center:

/** * average center *@param {Point} PTS All POI points *@returns* /
function ave(pts) {
  let lon = 0,
    lat = 0;
  pts.forEach((pt) = > {
    lon += pt.lon;
    lat += pt.lat;
  });
  return new Point(lon / pts.length, lat / pts.length);
}
Copy the code

Standard distance SD:

/** * standard distance *@param {Point[]} PTS All POI points *@param {Point} AvePt mean center *@returns* /
function SD(pts, avePt) {
  let SDx = 0,
    SDy = 0;
  pts.forEach((pt) = > {
    SDx += Math.pow(pt.lon - avePt.lon, 2);
    SDy += Math.pow(pt.lat - avePt.lat, 2);
  });
  return Math.sqrt(SDx / pts.length + SDy / pts.length);
}
Copy the code

Dm (median distance to mean center) :

/**
 * Dm
 * @param {Point[]}} PTS All POI points *@param {Point} AvePt mean center *@returns* /
function Dm(pts, avePt) {
  let distance = [];
  pts.forEach((pt) = > {
    distance.push(Gauss(pt, avePt));
  });
  distance.sort();
  return distance[distance.length / 2];
}
Copy the code

Kernel density estimation:

/** * Kernel density estimation *@param {Point[]} PTS All POI points *@param {Rect} Rect POI point boundary *@param {Number} Width Raster image width *@param {Number} Height Raster image height */
function kde(pts, rect, width, height) {
  const estimate = new Array(height)
    .fill(0)
    .map(() = > new Array(width).fill(0));
  let min = Infinity,
    max = -Infinity;
  const avePt = ave(pts);
  const bandWidth = h(pts, avePt);
  rect.top += bandWidth;
  rect.bottom -= bandWidth;
  rect.left -= bandWidth;
  rect.right += bandWidth;
  const itemW = (rect.right - rect.left) / width;
  const itemH = (rect.top - rect.bottom) / height;
  for (let x = 0; x < width; x++) {
    const itemX = rect.left + itemW * x;
    for (let y = 0; y < height; y++) {
      const itemY = rect.bottom + itemH * y;
      let fEstimate = 0;
      for (let m = 0; m < pts.length; m++) {
        const distance = Gauss(pts[m], new Point(itemX, itemY));
        if (distance < Math.pow(bandWidth, 2)) {
          fEstimate += kernel(distance / bandWidth);
        }
      }
      fEstimate = fEstimate / (pts.length * Math.pow(bandWidth, 2));
      min = Math.min(min, fEstimate);
      max = Math.max(max, fEstimate);
      estimate[height - y - 1][x] = fEstimate; }}return { estimate, min, max };
}
Copy the code

Then use HTML5 Canvas for visualization:

/** * Draw the kernel density estimation result raster graph *@param {CanvasRenderingContext2D} CTX Canvas context *@param {*} Param0 Core density estimated value, maximum value, minimum value */
function draw(ctx, { estimate, min, max }) {
  const height = estimate.length,
    width = estimate[0].length;
  for (let x = 0; x < width; x++) {
    for (let y = 0; y < height; y++) {
      const val = (estimate[y][x] - min) / (max - min);
      if (val > 0 && val < 0.4) {
        ctx.fillStyle = "rgb(228, 249, 245)";
        ctx.fillRect(x, y, 1.1);
      } else if (val >= 0.4 && val < 0.7) {
        ctx.fillStyle = "rgb(48, 227, 202)";
        ctx.fillRect(x, y, 1.1);
      } else if (val >= 0.7 && val < 0.9) {
        ctx.fillStyle = "rgb(17, 153, 158)";
        ctx.fillRect(x, y, 1.1);
      } else if (val >= 0.9 && val <= 1) {
        ctx.fillStyle = "rgb(64, 81, 78)";
        ctx.fillRect(x, y, 1.1); }}}}Copy the code

Kernel density estimation results

The following figure is the kernel density estimation result:

Through this figure, it can be found that the results of kernel density estimation can well display the hot spots of data.

All the code

Follow the wechat public account Guyueyue sanmu, check the density analysis to get all the code and data.