A sequence of

This article belongs to the greedy NLP boot camp study notes series. The previous Python basics and numpy related content was 60- 90 on the video. There’s another one about doing crawlers in Python. Let’s skip the rest of this.

I don’t know which teacher taught this chapter, and the PPT is not in standard English, which is similar to handwriting, and I can’t even understand the nouns of probability statistics.

The second sampling

Purpose of sampling:

  • Getting statistics
  • Access to infer that
  • visualization

Any probability can be expressed as a function.

Advantages: Simple and versatile

Cons: Slow, difficult to get “right” samples.

Three Monte Carlo method

Sampling method is the core of approximate inference algorithm, also known as Monte Carlo method. The basic goal of a sampling method is to solve a functionAt some particular probability distributionThe expected value of

But the dimensions of x can be very high, so the expected value is often very hard to find. So to simplify thingsFrom the probability distribution N points are collected to form a sample set (it is worth noting that the statistical properties of these points follow a probability distribution).

So, the question becomes if we can easily go fromIf I sample N points, I can solve this expectation. Id means that the sampling points are independent. PPT style is this kind of script, for small white some difficult.

The Monte Carlo method has the advantage of unbiased estimation (the resulting sample mean is the actual probability distribution).

Encyclopaedia introduces unbiased estimation

Unbiased estimation is a kind of unbiased inference when sample statistics are used to estimate population parameters. If the mathematical expectation of an estimator is equal to the true value of the estimated parameter, the estimator is called an unbiased estimator of the estimated parameter. The meaning of unbiased estimators is that their average approximates the truth value of the estimated parameter under many repetitions. Unbiased estimation is often used in test score statistics.

Well, the teacher didn’t push the formula by hand and explain the derivation process as Li Wenzhe did. I can’t read it.

The classic application of Monte Carlo is computing. The teacher gave **Mandelbrot setIn the example. 天安门事件

Add some background knowledge about fractal mathematics:

The Mandelbrot set is a set of complex numbers C, defined byIn the formulaStart iterating and get.

For example, when C = 0, obviously the sequence is always 0 and does not diverge, so C = 0 does not belong to the Mandelbrot set.

On the two-dimensional plane, all points that do not belong to the Mandelbrot set are marked black, and all points that belong to the Mandelbrot set are given different colors according to their divergence rates, which is the classic picture in the upper right corner of the figure.

True theoretical derivation type is complicated, see Dr Zhihu on special mathematics study the direction of answer: www.zhihu.com/question/26…

In the screenshot above, the probability of finding the set on a (-2, 2) matrix, using the Monte Carlo formula above, is the area of the black region divided by the area of the rectangle.

fourImportance Sampling

So instead of taking the sample from p(z), you take the expected value. Just divide z up into a bunch of little grids and compute

 

This is a fair distribution on the left, and a biased distribution on the right.

The mathematical formula in the lower right corner of the screenshot: it is not very clear to see without hand pushing, students who understand can leave a message to explain.

Find an article on zhihu: zhuanlan.zhihu.com/p/62730810

Such calculation is not accurate, and the precision increases with the increase of mesh precision. Also, the number of summations increases exponentially with the dimension of the variable Z. Also, for a probability distribution, only a small part of z, p(z), tends to be large, and most of z is not valid for the entire expected value. We wondered if we could focus on the high probability points and ignore the low probability points, and that would be the motivation for importance sampling.

Where q(z) is the standard distribution, which is very easy to sample, set of sample points, we callIs the importance weight.

Five negative nagative sampling

There’s a lot of negative sampling on the web that’s related to WORD2vec. (I also make fun of the teacher to say negative sampling directly without word2vec, here skip-gram model also uses the central word to predict the context words)

Word2vec is a huge neural network, and the skip-Gram model has three optimization strategies:

  • 1. Use common phrases as a word.
  • 2. Sample less common words (e.g., A the)
  • 3. Modify the optimization objective function, which is called “Negative Sampling”, so that only a small part of weights of the model can be updated for each training sample.

The teacher’s example is the same. The screenshot shows the curve of the sampling rate formula.

Kexue. FM /archives/56…

Stick a zhihu explanation: www.zhihu.com/question/50…

The use of NCE in probabilistic parameter estimation is equivalent to solving the parameter estimation of an equivalent problem (replacing the original normalized constant with sampling method), bypassing the problem of finding the normalized term directly. Transform a complex problem into an equivalent and computable simple problem

The rest of the basics are too bad to understand. Make yourself a TODO. When you have a foundation, you can sort it out.