In the middle stage, I understand that it is the sinking of capacity, the sinking of data processing capacity as a processing platform, and the sinking of data processing results as data assets. So can data governance sink? What can sink out of it?

— Lu Shanwei, head of Creditease data Center

Source of this article: lu Shanwei, head of Creditease Data Center, shared the record of “Digital Center Innovation” salon in Yiou Industry Internet channel

Billions of euros

Yiou Industrial Internet channel launched the “Digital Innovation” salon in Shanghai InnoSpace on October 24th. Activities combine the product shop Luo Yiqun electricity technology center director, director of love chi automotive technology information HangYuFeng, appropriate data, head of the middle letter Lu Shanwei, ThoughtWorks chief consultant and geek time “said through China” columnist, jian wang and European MiaoGuoCheng, director of east China, the industry of the Internet channel, deputy editor Huang Zhilei million Internet channel for the industry Gong Chenxia participated in the sharing and launched an in-depth discussion on the topic of digital China.

Creditease is a fintech enterprise engaged in inclusive finance and wealth management business, which was founded in 2006. In 2018, based on four open source platforms and middleware technologies, creditEase began to develop data center and promote and use it internally. At present, creditease’s Zhongtai department is divided into two major sections: data Zhongtai and AI Zhongtai.

The following is a review of Lu Shanwei’s speech:

1. Guiding thinking of CreditEase data Center: unified construction and agile development

2. From open source to middle stage, the key word is self-service

3. Data governance, which is more dependent on human governance or autonomy?

The following is a shorthand transcript of the speech, which is compiled by Yi Ou Industry Internet channel for reference.

Good afternoon, everyone. My name is Lu Shanwei, from Creditease. Just listen to Luo introduced the concept and application of zhongtai strategically, benefit a lot. My sharing will be different: First, I have the qualifier “data.” Luo shared the business center, organization center, technical center have discussed, and I do data, so I only introduce data center. Second, I personally come from a purely technical route, so the content I share will be more specific.

The topic I share today is the Trilogy of CreditEase Data Construction in Central Taiwan. The content will be developed according to the story line of time development. They are: “Agile Messenger” — ABD Era (2015-2018); “Self-help Adventure” — ADX Era (2018-2019); “The Return of Autonomy” — ADG Era (2019 -).

Before joining CreditEase in 2015, I worked in eBay’S R&D center in Zhangjiang, Shanghai. At that time, my main direction was big data architecture and R&D, and I worked in the paid advertising group for big data. Since MY personal focus is more focused on platform technology, I always want to make some framework and platform class things in the technical field.

1. Guiding thinking of CreditEase data Center: unified construction and agile development

In 2015, CreditEase came to me and told me that there was no data platform in the company and they hoped that I could lead the construction of the data platform, so I joined CreditEase.

Actually said “no data platform” is not accurate, should be more accurate to say “no unified data platform”, because many business line has its own so-called data platform, have done a little better, have a plenty of pure customization, not platform, because the company scale is very big, many are bottom-up construction, It’s not like a bank is driven from the top down to do this. At that time, there was no concept of data center, but to be a good data platform, I felt a little at a loss and was very challenged, so I started to do a lot of internal investigations and interviews, and used several pictures to show the current situation at that time.

The top left diagram represents the silos of the business, from the front end to the business development side, to the data side, and even to the database. Usually good data center, to have good business center cooperation, in the current situation of serious business shaft, want to get through the data layer fusion is very difficult.

The lower left corner expresses two slow problems faced by many enterprises in 2015, namely, slow prescription and slow implementation.

On the one hand, T+1 batch processing was still the mainstream at that time, and many enterprises did not have a complete streaming platform, unlike today, there are many mature options. Generally speaking, it can only meet the data demand of T+1 time.

On the other hand, because there is no good self-service platform for everyone to use, the demand will be given to a specific BI team, and the BI team will pick up the demand. When there are too many demands, they will start scheduling, and it may take 1-2 months or more to respond to and deal with the demand.

Some business departments are more powerful, have many big data engineers, and use a lot of technology selection, such as MongoDB, ES, HBase, Cassandra, Phoenix, Presto, Spark, Hive, Impala and so on. There is no unified technology selection standard. The company demand is diverse, as above to the right of the self-service query, 360 panoramic multidimensional analysis, data analysis, real-time processing, lake and so on, makes the big data architecture is becoming more and more complex and bloated, more and more difficult to construction and maintenance, coupled with the figure at the bottom of the aspects such as data management, data quality, data security issues, At that time, we faced such a complicated situation.

Think about the whole problem and find a solution under this situation. I personally advocate the idea of agile development. Agile development is more practical experience in business development. Big data is cumbersome, so how can we make the elephant run? I think we need to use agile thinking to build data platforms. After investigation and thinking, we have formed a series of big data agile thinking framework, practice and methodology. More importantly, we need to implement some middleware to drive agile practice.

Next, we developed four middleware platforms successively: DBus, Wormhole, Moonbox and Davinci. Since so many technology selection, and it is difficult to quickly unify them into a set of technology selection, but also to unify the control of key nodes, the best way is to use the idea of middleware to adapt to the existing selection, and then to simplify the whole architecture.

The figure below shows the whole data processing link, from left to right, which is divided into data source layer, data integration layer, data bus layer, data processing layer, data storage layer, data service layer and data application layer. In the data source layer, there are a variety of natural selection, which is the business needs; Data storage layer, for different purposes there are many technology selection, this can not be unified soon, and itself is difficult to find a big data storage selection, can solve all the storage problems and computing problems, so we have to face multiple storage and computing integration problems.

On the application side, it is also difficult to integrate and unify the requirement scenario drive. The data integration layer, data bus layer, data processing layer and data service layer can be integrated. After the whole data link is sorted out, it is an “open + unified” architecture. Some layers are open and inclusive, while others are unified and closed.

Of course, the gray section topic in the figure above should also be paid attention to and supported, because our strategy at that time was to build four middleware tools DBus, Wormhole, Moonbox and Davinci, so we did not pay too much attention to these section topics.

These middleware tools are described below:

  • DBus can extract data in real time and connect with multiple databases and logs. It can extract incremental data in real time and support full data extraction. It also maintains consistent ID system with incremental data to support subsequent idempotent storage.
  • Wormhole, for streaming job development and management, supports real-time synchronization and on-stream processing logic through configuration and SQL without writing code. This also reflects agility: one is the middleware unified implementation of common technology, do not have to repeat development; The second is to constantly reduce the cost of data project implementation, implementation personnel as much as possible to pay attention to the business logic itself, simple training can be self-help to complete the project, these are the embodiment of agile thinking. For example, from the use of experience, such as incremental data from Oracle real-time, want to real-time write MySQL, as long as a simple configuration, if you also need some real-time processing logic, such as the flow of incremental data to Lookup the outside of Redis, as long as you write a SQL. In addition, because we do middleware rather than reinventing engines, Wormhole is based on mainstream streaming engines Spark and Flink, and users can choose their own computing engines. Flink also supports CEP operations, so Wormhole also supports CEP rule configuration.
  • Moonbox, heterogeneous system mixing service, assuming that data is stored in different places for various reasons, but you want to be able to mix these data, you can use Moonbox as a “virtual database”. For example, table A is in Oracle, table B is in MongoDB, and table C is in ES. If A complete SQL is sent to Moonbox, the results will be mixed out automatically and the result data will be returned. At the same time, Moonbox can effectively take advantage of the computing advantages of each storage and push more operators down to improve the overall computing performance.
  • Davinci, visualization platform, Davinci basically has all the functions of general visualization platforms, and supports rich visualization applications and system integration capabilities, trying to solve the “last ten kilometers of big data problem”.

What are the effects of this middleware? For example, 2-3 data related personnel in a certain business line, who are very familiar with the business but do not have the background of big data technology development, can finish the end-to-end projects of various real-time data warehouse, real-time report and real-time application quickly and independently after one or two weeks of training. This was unthinkable before. In the past, a real-time project required a team with a background in big data technology development to support IT. Now, even people with a non-IT background can be trained to do this.

Let’s take a closer look at Wormhole.

In addition to the above mentioned benefits of configuring and SQL-based streaming applications, from an internal technical implementation point of view, many of the typical issues of streaming development are also masked by middleware and are transparently supported by users.

  • Idempotent Sink: The incremental data on the stream is not guaranteed to be strongly ordered, but the final consistency should be achieved when falling into the Sink. Wormhole already has this processing logic built in, so users can just write logical SQL on the stream.
  • This problem is well known, and Wormhole has a built-in solution.
  • Multi-flow support is our unique feature. If you have done Spark development, you will know that when you write a Spark program, it will always occupy a fixed memory and run a job. However, we think Spark Streaming should be a physical resource pipeline, and the logic on the streamin the Spark should be decoued from the physical resources. So we designed and developed the concept of Flow. Flow is defined by where it comes from, where it goes, and what processing logic is done on the Flow. The effect of decoupling is that a Spark Streaming physical pipe can run multiple logical flows. For example, if a company has 10,000 tables and needs to synchronize to 20,000 target ends, it might need to start 20,000 Spark streaming jobs in previous development. The Spark Streaming job can be set to 50GB of memory and run 20,000 synchronous Flow jobs in it, which is equivalent to logical layer pipeline support. This is quite original and we are the only one doing this.
  • Dynamic instructions, this is about operations, I don’t want to have to restart the stream every time I change the stream processing logic, but can be changed online and take effect in real time.
  • Service time policy: Spark Streaming used to do calculation based on Process time by default. Now the streaming engine is mature and supports calculation based on Event time inside the engine. However, Spark Streaming was not supported at that time, so we also support this part.
  • Flow drift, this is also operation related, for example, we set up 5 physical Spark streaming pipes, each running 10 flows, one day a business line incremental data surge, a Stream resource is insufficient, The Flow drift capability allows you to float this logical Flow into another idle Spark Streaming physical pipe. This is to lower the threshold of streaming operation and maintenance development and try to be as agile as possible. In other words, I can write an automatic small program to detect which Spark streaming resources are insufficient and which ones are idle, and then automatically stream a Flow, which can achieve automatic operation and maintenance of streaming processing. We are also exploring this topic, batch operation is relatively lucky, if there is a problem, it can automatically restart, but streaming processing is more difficult to operate and maintain, including resource size, restart Offset and so on, we have done a lot of work in the above. So instead of just wrapping Spark, we did a lot of deep stuff.

As for open source, I used to work in eBay, where several Apache top open source projects have been very influential to us. Therefore, when I designed these four tools in Creditease, I started with the direction of universal open source tools. I don’t know if you have heard of these tools, but Davinci is very popular in the community and many companies use it.

So far, the work in the first stage has become stable and solved many problems within the company. Several open source tools have been well applied not only within the company, but also in the technical community, enabling many other enterprises.

The second phase began late last year. In 2017, I attended the Hangzhou Computing Conference and heard the data Platform shared by Ali before the term “Platform” became popular. By the beginning of 2018, I was thinking that the data center was what the company needed to do, so I suggested it to the CTO, and he was very supportive of us, and within a few months, the data center started to catch on, so we kind of caught on.

2. From open source to middle stage, the key word is self-service

ABD era has done a good job, why do we need to do data center again? In addition to the well-known problems such as multiple business lines, multiple technology selection and multiple requirements mentioned above, from the perspective of data management, such as data governance and data assets, etc., there are also many aspects of the subject that have not been considered too much. Before, I had offline communication with some communities and companies because of open source, and all of them said, “Your open source tools do a good job, but they are still short of the intermediate feeling we want from our business needs.” In fact, the difference is something similar to data center.

No matter how the data center is defined, enterprises need a platform that can more directly empower their business. Therefore, we can elevate another level between business requirements and middleware tools to build an integrated, standardized, one-stop self-service platform.

Enter the second era of Agile data centralization ADX. The blue triangle in the big triangle below is the data platform engine. Technically speaking, we first need to build a convenient self-service platform based on the previous open source tools. However, a good self-service data platform is not the same as data center. After referring to many data center articles and definitions, we concluded that the data center should also include three other pieces.

  • One is the data asset system. Data asset is the precipitation and reuse of good data information. The data center must include the construction of data asset, for example, solidify and systematize the data model methodology, so as to support the precipitation of data asset more standardized and standardized.
  • With data assets and a good platform, however, if the pot is large and the port is small and the bandwidth of data value enabling service is insufficient, business departments may intuitively feel that they can only look at reports, which will result in insufficient data enabling ability. Therefore, the front desk business not only needs to provide reports, but also needs to provide data products, data API, self-service analysis, etc., which can better enable the business.
  • With these, the data center can really run, but also depends on the company’s process system and operation mechanism. For example, I have good data assets, but there is no data operation mechanism to guarantee them, and other business teams will not dare to use them. If I want to reuse them, I will be responsible for them. These are all considerations of data operation. After these aspects are done well, it is possible to do a good job in the data center and operate well.

The value of the data center is also shown on the right side of the figure. Simply speaking, it is “saving more, faster and more accurate”, or in another word, “reducing cost, increasing efficiency and improving quality”, which is the essence of the value of the data center.

Below is an overview of the ADX experience. In the self-service data center, the entire data center r&d team becomes the IT team behind IT. The user does not have to deal directly with us, on the platform can apply for library self-service application resources, table, self development, self-service operations, check monitoring, set alarm, diagnose problems, such as online offline, we just be platform design, development and operations, this is we want to achieve the effect, more thorough self-support, popular.

The data center is built based on the idea of modularization, which is divided into many sub-modules, and the relationship between them is hierarchical and united. For example, unified data collection, data processing, data model, monitoring and warning, these ideas are similar to other companies; The right side of the data management, Taiwan management, are to solve the topic of the plane; The above sections are modules that are close to business use. There are many modules that I will not cover here.

It is worth mentioning that the main core modules are not developed from scratch, but built based on the integration of ABD open source tools. Therefore, ADX is not overthrowing the previous ABD, but based on ABD more abstract, more modular, more business-oriented superstructure.

Now in the AGE of ADX, the chart below has changed. DataHub integrates the data integration and data bus layers. Previously, DBus only supported streaming collection and distribution, while DataHub can support both streaming and batch. The same is true of DataWorks for Wormhole, which is an extended extension of ABD functionality.

There are also corresponding modules to solve the lower level of section topics. Therefore, ADX is more platform-oriented, unlike before when we made several good open source tools and then we combined them to solve various scenarios by ourselves, now it is based on a one-stop self-service platform, on which users can complete a variety of daily data processing work.

Again, the DataHub module was not good when it was made, but after it was made, everyone thought it was really convenient and powerful.

From the perspective of the black box outside the DataHub module, you can get whatever data you want: for example, if I want a daily T+1 snapshot of a table, it will return to me; I want an accurate snapshot of any historical moment in this list, and it comes back to me, too; I want a live stream of this table, and it comes back to me. We can do this because we drop full + incremental data from all tables into the data lake in real time, and provide a wide variety of required data forms based on the integration model of ABD open source tools, so from the data level, you can theoretically want anything, DataHub can provide. We’ve also seen some similar data integration solutions in the community, most of which provide purely tool-level functionality without built-in real-time data lakes. DataHub contains a data lake, all the data of the company can be collected and maintained in real time, all data users can return what they want, which is a very convenient and thorough use experience.

In the second era of ADX, it has taken more than one year from development to launch to large-scale application. In the third era, we pay more attention to the capacity building of data assets and data governance. Without data assets, there is no data center, and data governance is an important guarantee to ensure the effective precipitation of data assets and enabling business.

The subject of data governance has corresponding possible problems at each layer of data link. Some of these problems can be solved at the system level, but most of them depend on people and organizations to solve them, and they are still not easy to be solved perfectly. In this subject we are also thinking and groping, the following is limited to discussion.

3. Data governance, which is more dependent on human governance or autonomy?

Here are some thoughts. Autonomy has two meanings: automated governance and self-service governance.

In the middle stage, I understand that it is the sinking of capacity, the sinking of data processing capacity as a processing platform, and the sinking of data processing results as data assets. So can data governance sink? What can sink out of it?

One is to sink some platform tools, such as metadata management, data quality management, these can be very general, tool; One kind is sinking out some systematic methodology, such as ali OneData, is a set of internal grinding out the localization of methodology, the ground for a system of systems, the system and the methodology is not necessarily suitable for every company, every company but I think the ideas we can draw lessons from, grind the methodology for the enterprise business system, then the systematic, Better constrain and standardize data governance management and data asset construction within the enterprise.

For “automated” data governance, the above two categories still cannot cover all problems. For example, enterprises have many legacy systems and processes, and cannot carry out large-scale and unified transformation and migration in a short period of time. Then how to control and govern it? This remains a difficult question. RPA is a relatively emerging idea, which can deal with the problems of legacy systems well. This point may find a good intersection with data governance. For example, the idea of process choreography and automatic execution can be used to deal with the data governance problems of some legacy systems and legacy environments.

With regard to “self-service” data governance, data governance is not quite the same as data processing, such as streaming processing, which is an intuitive need for business, no matter what business will have a strong demand. However, data governance is different from data governance. From the perspective of business, although data governance can bring solid positive impact on the whole enterprise and business development in the long term, it may limit the speed of rapid business development in the short term, so business parties may not have great motivation to actively support and cooperate with data governance.

Some organizations enforce top-down management and practices of data governance, which requires awareness and commitment from management. Our company is different. Data governance needs to compromise with rapid business iteration and rapid demand change. It cannot be pushed from top to bottom, but at the same time, it cannot be ruled out. Such as business line can be set up its own private data assets, if you want to upgrade into public data assets, can apply for verification, of course, it can bring benefits for business line, and KPI to bind, in this way, data assets operation ability can be down, let people actively involved in the data governance together, the way flexible data governance is an extension may be more effective, That’s what we’re trying to do.

The diagram above is just a rough conceptual architecture, it’s not very mature, and it’s something we’re thinking about right now. If we can bring all of the company’s metadata together to form an enterprise metadata panorama, we have data knowledge; Because we have Moonbox, we have all kinds of data manipulation capabilities; Based on data knowledge and data operation ability, data governance actions can be visualized according to data governance experience, rules and current status of the process, and finally form an automated data governance system and framework.

If data governance is purely dependent on people, there are too many uncertain factors. Relatively speaking, I believe in tools. I believe that through continuous abstraction, subsidence and verification, I can find a more systematic process and supporting tools to do a better job.

The above is the course we have gone through in the three eras of data center construction in the past four years. There is still a long way to go and we still need to explore and settle. I hope we can communicate with you more and thank you for listening!