Greenplum distributed database moves into deep learning

Deep Learning is beginning to become a more important part of enterprise computing because artificial neural networks are so effective in areas such as natural language processing, image recognition, fraud detection and recommendation systems. The huge increase in computing power and the availability of vast amounts of data over the past five to ten years has spurred interest in using deep learning algorithms to solve problems.

On the other hand, the business systems of the enterprise are mostly based on SQL infrastructure, with heavy investment in software and employee training. However, the major innovations in deep learning occur outside of the SQL world, so enterprises using deep learning algorithms need to adopt independent deep learning infrastructure. Therefore, building a deep learning system outside of the traditional SQL architecture requires consideration not only of the additional cost and effort, but also of the risk of developing new data islands. In addition, moving large data sets between systems is inefficient. If enterprises can execute deep learning algorithms in MPP relational databases using popular deep learning frameworks such as Keras and TensorFlow, then this will enable enterprises to leverage their existing investments in SQL to make deep learning easier and more approachable.

Another consideration is the need to apply multiple models to many of today’s data science problems. In general, data scientists often spend a lot of time in analyzing data feature engineering and use multiple methods to solve problems. In this case, the result of data analysis is usually a combination of multiple models. In this case, it is more efficient to do all the calculations using the same computing engine than to do them separately using different systems and then combine the results. To this end, a set of machine learning and analysis functions built into the database can enable these calculations to be executed within the database, thus reducing or even eliminating data movement between different computing environments, greatly improving computing efficiency.

Gpu-accelerated deep learning algorithms are used on Greenplum

The following figure (Figure 1) shows the architecture of Greenplum+GPU. Standard deep learning algorithms libraries such as Keras [1] and TensorFlow [2] are deployed on the Segment node of Greenplum, and gpus are also deployed on the segment node. The segments on each node share GPU computing resources.

Figure 1: Greenplum architecture for deep learning

This architecture is designed to eliminate transfer delays in interconnecting segments and gpus. Under this architecture, each segment only needs to process local data to produce the results, and Apache MADlib, an open source machine learning library integrated with Greenplum, is responsible for merging the models of each segment together to produce the final model. This calculation method takes advantage of MPP’s horizontal and horizontal scaling function.

MADlib program design

Programming with MADlib is as simple as calling the Apache MADlib function in SQL. In the following example, we use the algorithm provided by MADlib to train the model on the CIFAR-10 [4] image dataset, the specific SQL is as follows:

After this SQL runs, the trained model is stored in the table model_ARCH_library, where the data is the JSON representation of the training model convolutional neural network (CNN). CNN is a special neural network that is very good at image classification [5]. In the SQL above, there is a useful parameter — GPU per host (number of Gpus per node), which specifies the number of Gpus on each Segmeng node for training the model. Specifying 0 means using CPU instead of GPU for training, which allows shallow neural network debugging and commissioning on smaller data sets. A trial run is even available on PostgreSQL. After passing the test run, it can be moved to Greenplum clusters equipped with expensive Gpus to train deep neural networks on the entire data set.

After the model training is completed, we can use the model trained above to classify images. The specific SQL is as follows:

Performance and scalability

Modern Gpus have high memory bandwidth and 200 times as many processing units per chip as cpus because they are optimized for parallel data computations such as matrix calculations. Cpus are designed to be more versatile to perform a wider variety of tasks. Therefore, the performance improvement brought by using GPU to train deep neural networks is well known. Figure 2 shows the performance improvement of simple depth CNN [6] between a conventional CPU Greenplum cluster and a GPU-accelerated Greenplum cluster. In this test, we used a small Greenplum cluster (with 4 segments) to test the training time of the CIFAR-10 dataset, as shown in the figure below:

Figure 2: Training performance of Greenplum database GPU versus CPU *

The Greenplum cluster required more than 30 minutes of training time to achieve 75% accuracy on the test set using CPU alone, while the Greenplum cluster achieved 75% accuracy in less than 15 minutes using gPU-accelerated. The CIFAR-10 image resolution is only 32×32 RGB, so the GPU performance improvement is lower than the high resolution image. For the Places dataset with 256×256 RGB images [7], we found that training the model with GPU acceleration was 6 times faster than using CPU alone [8].

The key benefit of using gpu-accelerated model training is reduced model training time, which means data scientists can iterate model training faster and can deploy newly trained models into production more quickly. For example, in the case of fraud detection, the immediate benefit of faster training and deployment of new models is a reduction in financial losses.

The inference refers to the use of training models to classify new data. Using an MPP database like Greenplum is great for batch processing; Throughput increases linearly with the size of the database cluster. For example, using the CNN model we trained above, Table 1 shows the time required to perform bulk inference for 50,000 new 32x32RGB images.

Table 1: Batch classification test results on the Greenplum database cluster

Future jobs

As part of the Apache MADlib project, the MADlib community plans to gradually add new deep learning features with each release. For example, a common data science workflow is parameter selection and adjustment. Parameter adjustment includes not only the parameter adjustment of the model, but also the structure adjustment of the model, such as network layer and composition. These typically involve training combinations of dozens to hundreds of models to find combinations with the best accuracy/training cost profile. Under such great training pressure, using MPP database like Greenplum can greatly improve the efficiency of model training with the help of parallel computing function.

The test environment used in this article

Base Platform: Google Cloud Platform

Greenplum version: Greenplum 5

Segment Node configuration: 32 core vcpus, 150 GB memory

Segment Node GPU configuration: NVIDIA Tesla P100 GPU

The resources

[1] keras.io/

[2] www.tensorflow.org/

[3] madlib.apache.org/

[4] CIFAR – 10 dataset, www.cs.toronto.edu/~kriz/cifar…

[5] Le Cun, Denker, Henderson, Howard, Hubbard and Jackel, Handwritten digit recognition with a back-propagation network, in: Proceedings of the Advances in Neural Information Processing Systems (NIPS), 1989, pp. 396 — 404.

Greenplum distributed database moves into deep learning

Gpu-accelerated deep learning algorithms are used on Greenplum

MADlib program design

Performance and scalability

Future jobs

The test environment used in this article

Related Posts

Learn jSR-303 parameter calibration, really sweet

Real combat development, using Spring Session and Spring Security to complete the website login transformation!

The first day Golang