1 introduction

The theme of the book is automatic learning, or machine learning as we more commonly call it. That is, we want to program computers so that they can “learn” from available input. Roughly speaking, learning is the process of converting experience into expertise or knowledge. The input of a learning algorithm is training data that represents experience, and the output is some specialized knowledge, usually in the form of another computer program that can perform some task. Seeking a formal mathematical understanding of this concept, we must be more explicit about what each term we cover means: what training data will our program access? How can the learning process be automated? How do we measure the success of such a process (i.e. the quality of the output of the learning program)?

1.1 What is learning?

Let’s start by considering a few examples of naturally occurring animal learning. Some of the most fundamental problems in ML have arisen in this context and we are all familiar with them. “Bait” (” 剧 场 “) — Rats learn to avoid toxic Bait: when they encounter food with a novel appearance or smell, they will first eat small amounts, and the subsequent feeding will depend on the taste of the food and its physiological effects. If the food has a bad effect, the novel food tends to be associated with disease, and later, the mice won’t eat it. Clearly, there is a learning mechanism at work here — animals use past experiences with certain foods to gain expertise in testing the safety of that food. If past experiences with the food were negatively labeled, future encounters with the food would also have a negative impact on the animal’s predictions. Inspired by the previous successful learning example, let’s demonstrate a typical machine learning task. Suppose we want to write a program for a machine to learn how to filter out spam. One naive solution seems similar to the way rats learn to avoid toxic bait. The machine will simply remember all emails previously flagged as spam by human users. When a new E-mail arrives, the machine will search for it in the previous spam collection. If it matches one of them, it is discarded. Otherwise, it will be moved to the user’s inbox folder. While the previous “rote” approach is sometimes useful, it lacks an important aspect of the learning system – the ability to tag emails you’ve never seen before. A successful learner should be able to evolve from individual samples to broader generalizations. This is also known as inductive reasoning or inductive inference. In the previous example of Bait “剧 场”, when the rats encountered the example of a food, they applied their attitude towards it to new, unfamiliar examples of food with similar smells and tastes. To generalize in a spam filtering task, the learner can scan previously seen E-mail messages and extract a set of words whose occurrence in an E-mail message indicates spam. Then, when a new email comes in, the machine can check for suspicious words and predict labels accordingly. Such a system might be able to correctly predict the labels of never-before-seen E-mail messages. However, inductive reasoning can lead us to incorrect conclusions. To illustrate this point, let’s consider another example from animal learning. Pigeon superstition: In an experiment conducted by psychologist B. F. Skinner, he put a group of hungry pigeons in a cage. The cage was fitted with an automatic device that delivered food to the pigeons at regular intervals, regardless of their behaviour. The hungry pigeons moved about in the cage, and when the food was first delivered, it found that each pigeon was doing some movement (pecking, turning its head, etc.). The arrival of food reinforces the specific movements of each bird, so each pigeon tends to spend more time doing the same movements. This, in turn, increases the chances of finding each bird doing these actions again the next time a random meal is delivered. The result was a chain of events that reinforced the pigeons’ bond, linking food delivery to their accidental behavior when they first delivered food. They then went on diligently to do the same. What is the difference between a learning mechanism that leads to superstition and useful learning? This question is crucial to the development of automatic learning machines. While human learners can rely on common sense to filter out random, meaningless learning conclusions, once we export learning tasks to machines, we must provide well-defined, clear principles to ensure that programs do not reach meaningless or useless conclusions. The development of these principles is a central goal of machine learning theory. So what makes rats more successful at learning than pigeons? As a first step in answering this question, let’s take a closer look at the phenomenon of “bait nervousness” in mice. Reconsidering bait “剧 场” — Rats can’t adjust between food and shock or sound and nausea: The mouse “bait” (” 下 下 “) mechanism is more complex than one might think. Experiments conducted by Garcia (Garcia & Koelling 1996) showed that if the unpleasant stimulus after eating food was replaced by an electric shock (rather than nausea), no conditioned response occurred. Even after repeated trials in which rats were given unpleasant electric shocks after eating certain foods, they did not tend to avoid that food. A similar regulatory failure occurs when food features suggestive of disgust, such as taste or smell, are replaced by sound signals. The rats seemed to have some “built-in” prior knowledge that told them that while the temporal correlation between food and nausea might be causal, it was unlikely that there was a causal relationship between eating food and an electric shock, or between sound and nausea. We conclude that a significant feature between bait studies and pigeon superstitions is the combination of priori knowledge of the preferred learning mechanism. This is also called induction bias. Pigeons in the experiment were willing to take any explanation for the presence of food. However, the rats “knew” that the food would not cause an electric shock, and the coexistence of noise with certain foods was unlikely to affect the nutritional value of the food. The learning process in rats is biased towards detecting certain patterns and ignores other temporal correlations between events. It has been proved that the integration of prior knowledge and the addition of bias to the learning process is necessary for the success of the learning algorithm (this is formally stated and proved as “No free lunch theorem” in Chapter 5). Developing tools for expressing domain expertise, translating them into learning biases, and quantifying the impact of such biases on learning success is a central theme of machine learning theory. Roughly speaking, the stronger one’s prior knowledge (or prior assumptions) about the learning process to begin with, the easier it is to learn from further examples. However, the stronger these prior assumptions are, the less flexible learning is — priori, it is constrained by these assumptions. We will discuss these issues explicitly in Chapter 5.

1.2 When do we need machine learning?

When do we need machine learning rather than directly programming computers to perform the task at hand? Two aspects of a given problem may require the use of a program based on “experience” learning and improvement: the complexity of the problem and the need for adaptation.

Tasks too complex to program

  • Tasks performed by animals/humans: We humans often perform many tasks, but our reflection on how to perform them is not refined enough to extract clearly defined procedures. Examples of such tasks include driving, speech recognition, and image understanding. In all of these tasks, the most advanced machine learning programs, the ones that “learn from their experiences,” achieve fairly satisfactory results once they are exposed to enough training samples.
  • A mission beyond human capabilityAnother broad category of tasks that benefit from machine learning techniques relates to analyzing very large and complex data sets: astronomical data, transforming medical records into medical knowledge, weather forecasting, genomic data analysis, web search engines, and e-commerce. As more and more digitally recorded data becomes available, it is becoming clear that there is a wealth of meaningful information buried in data archives that is too large and complex for humans to understand. Learning to detect meaningful patterns in large and complex data sets is a promising field in which combining learning programs with nearly unlimited memory capacity and ever-increasing computer processing speed opens up new horizons.

One limiting feature of adaptive programming tools is their rigidity — once a program is written and installed, it stays the same. However, many tasks change over time or from one user to another. Machine learning tools — programs whose behavior ADAPTS to their input data — offer solutions to these problems; Essentially, they can adapt to changes in the environment with which they interact. Typical successful applications of machine learning to this kind of problem include programs that decode handwritten text, where a fixed program can accommodate differences between different users’ handwriting; Spam detection program, automatically adapt to changes in the nature of spam; And speech recognition programs.

1.3 Types of learning

Learning is of course a very broad field. As a result, the field of machine learning has branched off into several subfields that deal with different types of learning tasks. We present a rough taxonomy of learning paradigms designed to provide some perspective on where the content of this book fits into the broader field of machine learning. We describe four parameters according to which learning paradigms can be classified.

Supervised and unsupervised learning Since learning involves the interaction between the learner and the environment, learning tasks can be divided according to the nature of the interaction. The first distinction to note is between supervised and unsupervised learning. As an illustrative example, consider learning the spam detection task versus the exception detection task. For the spam detection task, consider a setup in which the learner receives labeled spam/non-spam E-mail for training purposes. Based on this training, the learner should find a rule to mark newly received E-mail. In contrast, for the exception detection task, the learner gets just a lot of E-mail messages (without labels) during training, and the learner’s job is to detect “exception” messages. More abstract, see learning as “using experience for professional knowledge”, the process of supervised learning describes a scene, the “experience”, a training sample, contains important information (such as spam/junk mail label), the information in the “test cases” not seen is lack, what they have learned the professional knowledge will be applied to these cases. In this setup, the expertise gained is designed to predict missing information in the test data. In this case, we can think of the environment as a teacher who “monitors” the learner by providing additional information (labels). However, in unsupervised learning, there is no difference in the amount of information carried by training data and test data. The purpose of a learner’s processing of input data is to come up with a summary or compressed version of that data. Clustering data into subsets of similar objects is a typical example of such a task. There is also an intermediate learning setting, where the learner needs to predict more information for the test example, although the training example contains more information than the test example. For example, you might try to learn a value function that describes the degree to which white is better than black at each position on the board. However, the only information available to learners during training is the position that appears in the actual chess match, marked by who ends up winning the match. This learning framework is mainly studied under the title of reinforcement learning.

The learning paradigm of active and passive learners varies depending on the role of the learner. We distinguish between “active” and “passive” learners. Active learners interact with the environment during training, such as asking questions or conducting experiments, while passive learners only observe the information provided by the environment (or the teacher) without influencing or guiding it. Note that the learner of the spam filter is usually passive — waiting for the user to mark the E-mail they receive. In an active environment, it is possible to imagine requiring users to label specific E-mail messages selected by the learner, or even written by the learner, to enhance their understanding of spam.

When people think of human learning, such as an infant at home or a student at school, the process usually involves a helpful teacher who tries to provide the learner with the information most useful for achieving their learning goals. In contrast, when scientists learn about nature, the environment plays the role of teacher, best thought of as passive without regard to the needs of the learner – apples falling, stars shining, rain. We model such learning scenarios by assuming that the training data (or the learner’s experience) is generated by some random process. This is the basic building block of the “statistical learning” branch. Finally, learning occurs when the learner’s input is generated by an adversarial “teacher”. This could be an example of spam filtering (if the spammer is trying to mislead the spammer filter designer) or learning to detect fraud. When a milder setting cannot be assumed, the adversarial teacher model is also used as a worst-case scenario that can be safely assumed. If you can learn with adversarial teachers, then you can guarantee successful interaction with any strange teacher.

The last parameter we mentioned is the difference between situations in which the learner must respond online throughout the learning process and situations in which the learner must take advantage of the expertise gained only after it has the opportunity to process large amounts of data. Stockbrokers, for example, must make daily decisions based on the experience gathered so far. Over time, he may become an expert, but he may make costly mistakes in the process. In contrast, in many data mining environments, the learner — the data miner — has a lot of training data to process before it has to output a conclusion.

In this book, we will discuss only a subset of possible learning paradigms. Our main focus is on statistical batch learning under the supervision of passive learners (e.g., trying to learn how to generate patient predictions from independently collected large patient record dossiers that have been labeled according to the fate of recorded patients). We will also briefly discuss online learning and batch unsupervised learning (especially clustering).

1.4 Relationship with other fields

As an interdisciplinary field, machine learning shares a common thread with mathematical fields such as statistics, information theory, game theory and optimization. It’s naturally a branch of computer science, because our goal is to program machines to learn. In a sense, machine learning can be seen as a branch of AI(artificial intelligence), because after all, the ability to turn experience into expertise or detect meaningful patterns in complex sensory data is a cornerstone of human (and animal) intelligence. However, one should note that, in contrast to traditional ARTIFICIAL intelligence, machine learning does not attempt to build automatic imitation of intelligent behavior, but rather uses the advantages and special abilities of computers to complement human intelligence, often performing tasks far beyond human capabilities. For example, the ability to scan and process large databases allows machine learning programs to detect patterns outside the range of human perception. The experience or training component of machine learning usually refers to randomly generated data. The task of the learner is to process these randomly generated samples and draw conclusions that apply to the circumstances in which the samples are picked from. This description of machine learning highlights its close relationship with statistics. In fact, the two disciplines have a lot in common in terms of goals and technologies used. However, there are some significant differences in emphasis; If a doctor hypothesizes a correlation between smoking and heart disease, the statistician’s role is to look at a sample of patients and check the validity of that hypothesis (a common statistical task for hypothesis testing). Machine learning, by contrast, aims to use data collected from patient samples to describe the causes of heart disease. The hope is that automated technology will be able to spot meaningful patterns (or assumptions) that human observers might miss. In machine learning in general, and in this book in particular, algorithmic considerations play a major role compared to traditional statistics. Machine learning is about the execution of learning behaviors by computers; Therefore, the algorithmic problem is crucial. We develop algorithms to perform learning tasks and focus on their computational efficiency. Another difference is that while statistics is generally interested in asymptotic behavior (such as convergence of sample-based statistical estimates as the sample set size grows to infinity), machine learning theory focuses on finite sample boundaries. That is, given the size of the set of samples available, machine learning theory aims to calculate the accuracy that a learner can expect from those samples. There are further differences between the two disciplines, but we will mention just one here. Although in statistics, usually work under certain predefined data model assumptions (such as the normality of data distribution, and linear) or function relies on, but in machine learning, the focus is on working under “distribution” Settings, including learning as little as possible to assume that the nature of the data distribution, and allow the learning algorithm to find the closest model of the process of data generation. An accurate discussion of this question requires some technical preparation, and we will return to this question later in the book, particularly in Chapter 5.

1.5 sign

Most of the symbols we use throughout the book are either standard or field-defined. In this section, we describe our main conventions and provide a table summarizing our notation (Table 1.1). If some of the notes are unclear while reading the book, the reader is encouraged to skip this section and come back here. We use lowercase letters such as XXX and λλλ to represent scalar and abstract objects. Usually, we want to emphasize that an object is a vector, and then we use boldface letters (for example, x\mathbf xx and λ\ PMB λλλ). The third element of vector x\mathbf xx is represented by xix_{_i}xi. We use capital letters to represent matrices, sets, and sequences. The meaning can be seen from the context. As we will see, the input to the learning algorithm is a series of training examples. Let’s use ZZZ for an abstract example, using S=z1… That zmS = z_ {_1}, \ dots, z_ {_m} S = z1,… ,zm represents a sequence containing MMM examples. Historically, SSS were often referred to as training sets; However, we always assume that SSS is a sequence rather than a set. A sequence of MMM vectors consisting of x1,… , xm \ mathbf x_ {_1}, \ dots, \ mathbf x_ {_m} x1,… , xm said. Xt \mathbf x_{_t} the third element of Xt is represented by xt,ix_{_{t, I}}xt, I. Throughout the book, we draw on basic concepts derived from probability. We use d\ mathcal DD to represent a distribution on a set, such as ZZZ. We use the symbol Z ~ Dz\ simcal Dz ~ D to indicate that ZZZ is sampled from D\mathcal DD. Given a random variable f:Z→Rf:Z\to\mathbb Rf:Z→R, its expected value is expressed as Ez ~ D[f(Z)]\mathbb E_{_{Z\ sim\mathcal D}}[f(Z)]Ez ~ D[f(Z)]. We sometimes use the shorthand E[f]\mathbb E[f]E[f] when it is clear from the context what ZZZ depends on. For f: Z – > {true, false} f: Z \ to \ {\ text {true}, \ text {false} \} f: Z – > {true, false} we also use Pz ~ D [f (Z)] \ mathbb P_ {_z \ sim \ mathcal D} [f (z)] Pz ~ D/f (z) to represent the D (z: f (z) = true) \ mathcal D ({z: f (z) = \ text {true}}) D (z: f (z) = true). In the next chapter, we will also introduce the symbol Dm\mathcal D^mDm to express the information about ZmZ^{^m}Zm (including samples extracted from d\ mathcal DD (z1,… Zm) (z_ {_1}, \ dots, z_ {_m}) (z1,… , zM) composed of) probability, where each point Z1Z_ {_1}z1 is independent of the other points sampled from D\mathcal DD.


To be mathematically accurate, D\mathcal DD should be defined on some σ−\sigma-σ− algebra of a subset of ZZZ. Readers unfamiliar with measure theory can skip a few footnotes and summaries on more formal definitions and assumptions about testability.


In general, we have tried to avoid using asymptotic notation. However, we occasionally use it to clarify the main results. In particular, Given f: R – R + f: \ \ mathbb R to \ mathbb R_ +} {_ f: R – R + and g: R – R + g: \ \ mathbb R to \ mathbb R_ +} {_ g: R – R + we write f = O (g) = f (g) O f = O (g) if there is X0x_ {_0} x0, alpha ∈ R + \ alpha \ \ mathbb in R_ +} {_ alpha ∈ R + allows for all x > x0x > x_ {_0} x > x0 we have f (x) or less alpha g (x) f (x) \ leq \ alpha g (x) f (x) alpha g (x) or less. We write f = o (g) = o (g) o f = f (g) if for each alpha alpha > > 0 0 alpha > 0 is x0x_ {_0} x0 makes for all x > x0x > x_ {_0} x > x0 we have f (x) or less alpha g (x) f (x) \ leq alpha g (x) f (x) alpha g (x) or less. Let’s write f= ω (g)f= Omega(g)f= ω (g) if x0x_{_0}x0 exists, Alpha ∈ R + \ alpha \ \ mathbb in R_ +} {_ alpha ∈ R + allows for all x > x0x > x_ {_0} x > x0 we have f (x) or greater alpha g (x) f (x) \ geq \ alpha g (x) f (x) or greater alpha g (x). The symbol f=ω(g)f=\omega(g) is a similar definition. Symbols f = Θ (g) = f \ Theta (g) = Θ f (g) means f = O (g) = O f (g) = O f = O (g) and g (f) (f) g = g = O O (f). In the end, The symbol f=O~(g)f= O~(g)f= O~(g) means that k∈Nk\in\mathbb Nk∈N such that F (x) = O (g (x) log ⁡ k (g (x))) f (x) = O (g (x) \ log ^ k (g (x))) f (x) = O (g (x) logk (g (x))). The inner product between vectors X \mathbf XX and W \ MathBF WW ⟨x, W ⟩\left\ Langle \ Mathbf x,\ mathBF W \ Right \rangle⟨x, W ⟩. Whenever we don’t specify a vector space, we assume it’s DDD dimensional Euclidean space, ⟩ then ⟨ x, w = ∑ I = 1 dxiwi \ left \ langle \ mathbf x, \ \ \ mathbf w right rangle = \ sum ^ {d} _ {_ {I = 1}} x_ w_ {_i} {_i} ⟨ x, w dxiwi ⟩ = ∑ I = 1. The Euclide (or ℓ2\ell_{_2}ℓ2) norm of vector W \mathbf ww is ∥ W ∥2=⟨ W \lVert\mathbf w\rVert_{_2}=\ SQRT {\left\langle\mathbf W,\mathbf ⟩ W \ \ right rangle} ∥ 2 = ⟨ ∥ w w, w ⟩. When the ℓ2\ell_{_2}ℓ2 norm is clearly stated in the context, we omit its subscript. We also use the other ℓp\ell_{_p}ℓ P norm, ∥ ∥ w p = (∑ I ∣ wi ∣ p) 1 p \ \ lVert \ mathbf w rVert_ {_p} = (\ sum_ {_i} \ lVert w_ {_i} \ lVert ^ p) ^ {^ {\ frac 1 p}} ∥ ∥ w p = (∑ I ∣ wi ∣ p) p1, Especially ∥ w ∥ 1 = ∑ I ∣ wi ∣ \ \ rVert_ lVert \ mathbf w = {_1} \ sum_ {_i} \ lVert w_ {_i} \ rvert ∥ w ∥ 1 = ∑ I ∣ wi ∣ and ∥ w ∥ up = Max ⁡ ∣ I wi ∣ \ lVert \ mathbf W \ lVert_ = {_ \ infty} \ max_ {_i} \ lvert w_ {_i} \ rvert ∥ w ∥ up = maxi ∣ wi ∣. We use the symbol min ⁡ x ∈ Cf (x) \ min_ {_ \ {x in C}} f (x) minx ∈ Cf (x) to represent the set {C} f (x) : x ∈ \ \ {f (x) : x in C \} {C} f (x) : x ∈ minimum value. To be more mathematically accurate, we should use INF ⁡x∈Cf(x)\inf_{_{x\in C}}f(x)infx∈Cf(x) whenever the minimum cannot be achieved. In the context of this book, however, the distinction between a lower bound and a minimum is generally of little concern. Therefore, to simplify the presentation, we sometimes use the symbol min⁡\minmin, even though INF ⁡\infinf is more appropriate. A similar explanation applies to Max ⁡\maxmax and sup⁡\supsup.

symbol meaning

R \mathbb{R}
Real number set

R d \mathbb{R}^{^d}

R \mathbb R
On the
d d
Dimensional vector set

R + \mathbb{R}_+
The set of nonnegative real numbers

N \mathbb{N}
Natural number set

O O
.
o o
.
Θ \Theta
.
Omega. \omega
.
Ω \Omega
.
O ~ \tilde O
Asymptotic notation (see text)

1 [Boolean expression] \mathbb{1}_{_{\text{[Boolean\ expression]}}}
Indicates that the function (if the expression is true, equals
1 1
, otherwise equal to
0 0
)

[ a ] + [a]_{_+}

= max { 0 . a } =\max\{0,a\}

[ n ] [n]
A collection of
{ 1 . . n } \{1,\dots,n\}
(for
n N n\in\mathbb N

x . v . w {\mathbf{x},\mathbf{v},\mathbf{w}}
(column) vector

x i . v i . w i x_i,v_i,w_i
Vector of the first
i i
An element

x . v \left\langle\mathbf x,\mathbf v \right\rangle

= i = 1 d x i v i =\sum^{^{d}}_{_{i=1}}x_{_i}v_{_i}
(inner product)

x 2 \Vert\mathbf x \Vert_{_2}
or
x \Vert\mathbf x \Vert

= x . x =\sqrt{\left\langle\mathbf x,\mathbf x \right\rangle}
(
x \mathbf x

2 \ell_{_2}
Norm)

x 1 \Vert\mathbf x \Vert_{_1}

= i = 1 d x i =\sum^{^{d}}_{_{i=1}}\lvert x_{_i}\rvert

x \mathbf x

1 \ell_{_1}
Norm)

x up \Vert\mathbf x \Vert_\infty

= max i x i =\max_{_i}\lvert x_{_i}\rvert

x \mathbf x

up \ell_{_\infty}
Norm)

x 0 \Vert\mathbf x \Vert_0

x \mathbf x
The number of non-zero elements in

A R d . k A\in\mathbb{R}^{^{d,k}}

R \mathbb R
On the
d x k d\times k
The matrix of

A A^{^\top}

A A
The transpose

A i . j A_{i,j}

A A

( i . j ) (i,j)
The element

x x \mathbf{x}\mathbf{x}^{^\top}

d x d d\times d
The matrix of
A A
make
A i . j = x i x j A_{_{i,j}}= x_{_i}x_{_j}
(including
x R d \mathbf{x}\in\mathbb R^d

x 1 . . x m \mathbf x_{_1},\dots,\mathbf x_{_m}

m m
A sequence of vectors

x i . j x_{i,j}
A sequence of the first
i i
Vector 1, 0
j j
An element

w ( 1 ) . . w ( T ) \mathbf w^{^{(1)}},\dots,\mathbf w^{^{(T)}}
Vector in an iterative algorithm
w \mathbf w
The value of the

w i ( t ) w^{^{(t)}}_{_{i}}
vector
w ( t ) \mathbf w^{^{(t)}}
The first
i i
An element

X \mathcal{X}
Instance domain (a collection)

Y \mathcal{Y}
Tag field (a collection)

Z Z
Sample domain (a collection)

H \mathcal{H}
Hypothesis class (a collection)

: H x Z R + \ell:\mathcal{H}\times Z\to\mathbb{R}_{_+}
Loss function

D \mathcal D
Concerning the distribution of a set (usually concerning
Z Z

X \mathcal X
)

D ( A ) \mathcal {D}(A)
According to the distribution
D \mathcal D
A collection of
A Z A\subseteq Z
The probability of

z …… D \mathcal {z\sim D}
According to the
D \mathcal D
The sampling
z z

S = z 1 . . z m S=z_{_1},\dots,z_{_m}
contains
m m
A sequence of instances

S …… D m S\sim \mathcal {D}^{^m}
According to the distribution
D \mathcal D
Sampled independently and identically
S = z 1 . . z m S=z_{_1},\dots,z_{_m}

P . E \mathbb{P},\mathbb{E}
Probability and expectation of random variables

P z …… D [ f ( z ) ] \mathbb P_{_{z\sim\mathcal D}}[f(z)]

= D ( { z : f ( z ) = true } ) =\mathcal D(\{z:f(z)=\text{true}\})
for
f : Z { true . false } f:Z\to\{\text{true},\text{false}\}

E z …… D [ f ( z ) ] \mathbb E_{_{z\sim\mathcal D}}[f(z)]
A random variable
f : Z R f:Z\to\mathbb R
The expectations of the

N ( mu . C ) N(\pmb\mu,C)
Have a hope
mu \pmb\mu
And the covariance
C C
Gaussian distribution of

f ( x ) f^{\prime}(x)
function
f : R R f:\mathbb R\to\mathbb R

x x
Place the derivative of

f ( x ) f^{\prime \prime}(x)
function
f : R R f:\mathbb R\to\mathbb R

x x
The second derivative of phi

partial f ( w ) partial w i \frac{\partial f(\mathbf w)}{{}_{_{\partial w_i}}}
function
f : R d R f:\mathbb R^{^d}\to\mathbb R

w \mathbf w
At about
w i w_{_i}
The partial derivative

del f ( w ) \triangledown f(\mathbf w)
function
f : R d R f:\mathbb R^{^d}\to\mathbb R

w \mathbf w
The gradient

partial f ( w ) \partial f(\mathbf w)
function
f : R d R f:\mathbb R^{^d}\to\mathbb R

w w
The differential set of phi

min x C f ( x ) \min _{_{x\in C}}f(x)

= min { f ( x ) : x C } =\min\{f(x):x\in C\}
(in the
C C
On the
f f
The minimum value of

max x C f ( x ) \max _{_{x\in C}}f(x)

= max { f ( x ) : x C } =\max\{f(x):x\in C\}
(in the
C C
On the
f f
The maximum value of

Arg min x C f ( x ) \argmin _{_{x\in C}}f(x)
A collection of
{ x C : f ( x ) = min z C f ( z ) } \{x\in C:f(x)=\min_{_{z\in C}}f(z)\}

Arg Max x C f ( x ) \argmax _{_{x\in C}}f(x)
A collection of
{ x C : f ( x ) = max z C f ( z ) } \{x\in C:f(x)=\max_{_{z\in C}}f(z)\}

log \log
Natural logarithm