Original link:tecdat.cn/?p=5354

 

The curse of dimensions is a phenomenon in which an increase in the dimensions of a data set results in exponentially more data needed to produce a representative sample of that data set. To combat the curse of dimensions, many linear and nonlinear dimensionality reduction techniques have been developed. These techniques aim to reduce the number of dimensions (variables) in a data set through feature selection or feature extraction without significant loss of information. Feature extraction is the process of transforming the original data set into a data set with fewer dimensions. Two well-known and closely related feature extraction techniques are principal component analysis (PCA) and self-organizing mapping (SOM). One can think of dimensionality reduction as an aqueduct system to understand the river of data.

 

Principal Component Analysis (PCA)

Principal component analysis (PCA) is a statistical algorithm used to transform a set of potentially related variables into an unrelated linear recombination of variables called principal components. In short, the main component, Y, is a linear combination of variables in our dataset, X, where the weights, E Ĵ Rm are derived from the covariance or feature vectors of the correlation matrix of our dataset.

The first major component is the line that minimizes the sum of squares between data points. It is the least square approximation of a single row data set. Thus, the first major component explains the highest amount of variation in the dataset. The residuals are then extracted from the dataset and the next principal component is calculated. Thus, each continuous component explains less variance, thereby reducing the number of variables X, m, number of main components, k. There are some challenges when using PCA. Thereby reducing the number of variables X, m, number of main components, K from them. There are some challenges when using PCA. First, the algorithm is sensitive to the size of variables in the data set, so it is recommended to use the mean center instead of the correlation matrix X because it is normalized. Another challenge with PCA is that it is linear in nature. The nonlinear adaptation of PCA includes nonlinear PCA and kernel PCA.

Self-organizing mapping (SOM)

Self-organizing mapping (SOMs) was originally invented by Kohonen in the mid-1990s and is sometimes referred to as Kohonen Networks. SOM is a multidimensional scaling technique that builds approximations of the probability density functions of some underlying data set, X, which also preserves the topology of the data set.

This is done by mapping the input vector, X I in the dataset, X, the weight vector, w ^Ĵ, w ^ in the feature map. Preserving the topology simply means that if two input vectors are close to each other X, the neuron w ^ to which those input vectors map will also be tightly bound. This is a characteristic of SOM.

 

If the number of neurons in the SOM is less than the number of patterns in the dataset, then we reduce the dimension of the dataset… Not the dimension of the input or the weight vector. Therefore, the type of dimension reduction performed by SOM is different from the type of dimension reduction performed by PCA, and SOM is actually more similar to clustering algorithms such as K-means clustering.

However, SOM differs from clustering in that clustering of data sets will (generally) retain the probability density function of the data set rather than the topology of the data set. This makes SOM especially useful for visualization. By defining a subfunction that converts a given weight vector to color, we are able to visualize the topology, similarity, and probability density functions of the underlying data set as a lower dimension (usually two because of the grid).

The application of PCA

“Weka is a collection of machine learning algorithms for data mining tasks that can be applied directly to data sets or called from your own Java code. Weka includes data preprocessing, classification, regression, clustering, association rules, visualization, and is also ideal for developing new machine learning solutions. “[Source]

One feature in WEKA is a tool for selecting properties and dimensionality reduction. One of the supported algorithms is principal component analysis. This example applies PCA to a database containing 12 related technical indicators. CSV file. Redundancy is one of the data qualities that leads to overfitting of models, especially machine learning models.

 

Correlation matrix technical indicators

If we load it into WEKA, we will see some basic descriptive statistics of the data set, including histograms of each variable (technical indicator), as well as their minimum, maximum, average sample statistics, and standard deviation sample statistics.

 

In the Select Properties TAB, select the primary component property evaluator and WEKA will automatically select the sorter search method.

 

After clicking Start, WEKA extracts the first five major components. It can be seen that the correlation coefficients of the first three principal components and the closing price are 0.6224,0.3660 and 0.1643 respectively. Knowing PCA, these three components are unrelated and should theoretically contain at least different information about exponential motion.