Large data warehouse practices based on Greenplum,postgreSQL

On October 21, 2017, Gao Yang, co-founder of Shenqi Wisdom, delivered a speech titled “Large data Warehouse Practice based on Greenplum, PostgreSQL” at the “PostgreSQL 2017 China Technology Conference”. As the exclusive video partner, IT mogul Said (wechat ID: Itdakashuo) is authorized to release the video through the review and approval of the host and the speaker.

Read the word count: 4263 | 11 minutes to read

Guest speech video playback and PPT:
t.cn/RgcE3V6

Abstract

In the era of big data, has traditional data warehouse technology become obsolete? We will discuss how to design very large data warehouse platform based on traditional data warehouse. This topic will detail Greenplum,postgreSQL’s position and practices in large data warehouses.

Are traditional data warehouses obsolete

Traditional data warehouse architecture

Traditional Data warehouse is composed of source system, ODS, EDW and Data Mart. Source system is business system and production system, ODS is operational Data storage, EDW is enterprise-level Data warehouse, and Data Mart is Data Mart.

The source system

The production system, the financial system, the human resources system and 12306’s booking system are all source systems, and the primary function of source systems is to generate data. Traditional industries mostly store these data in Oracle and DB2, while the Internet industry mostly chooses open source database.

ODS

ODS stands for Openrational Data Store, called operational Data Store, and is also called an intermediate or temporary library in a project. Before data moves from business systems to the real data warehouse, there is an intermediate layer called ODS.

As an intermediate layer, ODS has the following characteristics.

Integrate heterogeneous data, such as various business data and mysql or Oracle data, into the data warehouse through intermediate libraries
Transfer part of the business system details query function, if the direct use of traditional relational database business system query, there are certain limitations for Oracle and DB2 databases.
Data coding standardization transformation.
DW is static data whereas data in ODS is dynamic and updatable.
Data contents are different. ODS stores current or recent data, and DW stores historical data.
The ODS data capacity level is small, while the DW data capacity is large.

The above is the traditional definition of ODS, but now we understand ODS is no longer limited to this. ODS now stores not only text, but also pictures and videos. That is, it becomes a middle tier, and the technology involved is not just relational databases, but also types like NoSQL or Redis. When the front-end data collection is very large, the relational database may not withstand the pressure, but if it is Redis, you can cache the data in memory, and then batch brush to the relational library.

Introduce EDW

EDW stands for enterprise-level data warehouse. Here are some of its features.

Subject oriented: the data organization of operational database is subject oriented processing task, each business system is separated separately, and the data in data warehouse is organized according to a certain subject domain. For example: parties, agreements, institutions, finance, events, products and other topics.
Integrated: the data in the data warehouse is processed, summarized and sorted out by the system on the basis of the original scattered database data extraction and cleaning. The inconsistency in the source data must be eliminated to ensure that the information in the data warehouse is the consistent global information about the whole enterprise.
Relatively stable: data warehouse data is mainly for enterprise decision analysis, data manipulation is mainly involved in the data query, once a after entering the data warehouse, data will be retained for a long time, generally, that is, a data warehouse usually have a large number of queries, but rarely to modify, and delete operation, usually only need periodic load, refresh.
Reflect historical changes: The data in the data warehouse usually contains historical information. The system records the information of each stage of the enterprise from a certain point in the past (such as the time when the data warehouse is applied) to the present. Through these information, the development process and future trend of the enterprise can be quantitatively analyzed and predicted.

Whether it is the traditional data warehouse or the data warehouse in the era of big data, the functions provided by EDW are not much different. Mainly random query, fixed reports and data mining, general big data level is more inclined to data mining.

DM is introduced

The English name of Data mart is Data Marts. It is a small department-level data warehouse, primarily for department-level business, and only for a specific topic. It is an analytical environment built to meet the needs of a specific user (typically department-level). The investment is smaller and more focused on building complex business rules in the data to support powerful analytics.

The concept of big data is the opposite of the concept of a data mart, where a subset is derived from a hyperset, whereas big data is a collection of business data, from which the correlation and value of the data are mined. Therefore, we believe that the concept of data mart is outdated in the era of big data, which is also the reason why few people have discussed data mart in recent years.

The figure above shows what we think is the correct architecture, with the final DW replaced with DV (visual database/result library). The results calculated in EDW are finally stored in DW, and then displayed or visualized by DW.

Is PG/GP obsolete

As mentioned above, traditional database has many limitations. Next, we will reorganize and design it so that traditional data warehouse can also adapt to the changes of big data era.

Source system design

The picture above shows SCADA (Data Acquisition and Monitoring Control System), a system with tens of thousands of sensors that can generate a huge amount of data at a single moment. In the past, SCADA stored the collected data in memory, but because the amount of data was too large and the value of data could not be found, it would be cleaned periodically.

In recent years, with the development of big data, the value of these data is gradually reflected, so there is a need to store data in the back end. In the selection of database, many people prefer PG, mainly because SCADA is sold together with database, and if MySQL is bundled, there will be certain risks, PG does not have such concerns.

All of this is to say that PG is probably a better choice on source systems.

Intermediate library (ODS) design

Intermediate libraries are often designed as database clusters rather than stand-alone databases. The interface database shown below is actually an intermediate library, a cluster of multiple machines with multiple MySQL or PG instances on each machine. This allows the data to be distributed across different machines, forming an interface library into a cluster. The cluster here is not in the traditional sense of the cluster, the intermediate library should be loose MySQL cluster, PG cluster, when the amount of data can also choose Redis cluster.

EDW design

Now that we are talking about data warehouse design, let’s go back to the traditional layer — Oracle-based data warehouse.

The traditional data warehouse has the following characteristics: first, it uses partitioning technology to accelerate data access. In Oracle, a large table can be divided into several partitions, and each partition can be placed in different physical disks. Such a design is actually great for performance improvement, but it is not enough in the era of big data. The second is the application cluster technology, the front-end is multiple servers to provide computing capacity, the back-end is shared storage, also known as IO. Architecturally, this is a disk juxtaposition, and once I/O bottlenecks occur, the entire application cluster will suffer, so this architecture is also not suitable for data warehouses. The third is Oracle’s Exadata, which is widely used in trading platforms but rarely in data warehousing. Oracle’s BIEE is also hard to keep up with from a visual perspective.

To sum up, it can be said that traditional data warehouse technologies, while not completely obsolete, are slowly fading away.

When we have a lot of data, we have to face the selection of data warehouse, such as Oracle, DB2, PG ecosystem or Hadoop ecosystem. Given that Oracle and DB2 are definitely excluded, there are two choices between PG and Hadoop if the PG ecosystem is chosen: PostgreSQL and Greenplum. Definitely PostgreSQL for online trading systems, Greenplum for real data warehouses.

Greenplum architecture

Greenplum a cluster consisting of multiple control nodes (masters) and multiple data nodes (segment hosts).

Greenplum was chosen for its high performance.

High performance is first reflected in large table distribution. In Greenplum, data of a large table is evenly distributed to multiple nodes, laying a foundation for parallel execution (parallel computing). Parallel execution of Greenplum can be parallelism of external table data loading, query parallelism, index creation and use parallelism, statistics collection parallelism, table association parallelism and so on. The third point is column storage and data compression. The column mode is more efficient if common queries only fetch a small number of fields in the table. The row mode is more efficient if queries require a large number of fields in the table.

The second reason for choosing Greenplum is the maturity of the product. As mentioned earlier, Greenplum consists of multiple nodes, each of which is actually a PostgreSQL. PostgreSQLy was developed in 1986, the first version was developed in 1987, and released to the public in 1988. PG is a very mature product after so many years of development.

The third reason is the disaster recovery mechanism. Greenplum can have two master nodes. When one of them goes down, the other one continues to receive access, and the Catalog and transaction logs of the two nodes are synchronized in real time.

The fourth reason is linear expansion. Greenplum adopts the common MPP parallel processing architecture. Adding nodes in THE MPP architecture can linearly increase the storage capacity and processing capacity of the system. Greenplum makes it easy to expand nodes and redistribute data in a short amount of time. Greenplum linear extension support provides technical guarantee for future expansion of data analysis system, and users can expand capacity and performance according to implementation needs.

The final reason is the familiar development environment. Greenplum is based on PostgreSQL, which is syntactically similar to PG, allowing traditional Java developers to make a smooth transition to Greenplum.

The introduction of Hadoop

The traditional SQL query Greenplum can be easily handled, but it is obviously not enough for machine learning. Although Greenplum’s MADlib supports machine learning, actual cases are not common. Therefore, Hadoop ecosphere should be introduced into EDW to meet the needs of machine learning.

The hadoop ecosystem was introduced, with Mesos and Yarm as the resource management layer, HDPS as the distributed storage layer, and Spark Core as the processing engine layer. So if you want to do machine learning, there are actually two options, one is MapReduce plus Mahout, the other is Spark Core plus MLlib. MapReduce does not perform as well, so we generally prefer the second solution.

Ultimately, the data enters the Hadoop ecosystem via Greenplum, and then chooses where to store it based on development capability and application. Greenplum serves as a data source for machine learning, as well as sqL-based queries once the data is in Hadoop.

Another point to note is that the calculation results of data warehouse or big data platform are generally stored in PG, because PG has better processing capacity for large tables than MySQL.

conclusion

Finally, we sort out the whole architecture in reverse. PG is used for the bottom DV, Greenplum plus Hadoop is used for EDW, and PG is also used for the ODS layer, in order to avoid too many heterogeneous databases in the project and facilitate the development of developers.