In March 2021, Star Huan Technology released version 8.0 of TDH, its high-speed big data platform. I believe many users are very interested in this product. This series of articles introduces you to the new features and technological innovations of TDH8.0. Help users of enterprise-level data platform to have a more comprehensive and in-depth understanding of cutting-edge big data technology and better technology selection.

You can also see our videos on our official video account, community service account, bilibili, Tencent video and other websites.

Previous Content:

TDH8.0 Use Required Read: Why Do You Need a Decoupled Multi-Model Data Management Platform

TDH8.0 uses Required Read 2: 10 data models all support the future belongs to the multi-model big data platform

Talk about TDH’s product mission

Let’s start with the origin of the name TDH. The full name of TDH is Transwarp Data Hub. The so-called Data Hub is simply the Hub of big Data that we want to be.

Since Starring was founded in 2013, we have wanted to provide a big data platform and a series of tools, so that users can gather all the data together, operate the data through the tools, and help customer enterprises create value. To do this, the platform hopes to meet several needs:

First, this is an enterprise software, it is composed of many submodules, relatively complex; Second, we need to meet the needs of one-stop data processing, can help users complete a full link of data processing; Third, we deal with multiple data models, structured, graphical, textual, and so on. Finally, we need to have strong storage and computing capacity, the ability to help customers find value in the vast amount of data;

It is actually quite difficult to realize an enterprise-level, one-stop, multi-data model big data platform. The Star Huan big data platform has also overcome many technical problems. Today, our topic focuses on the multi-mode big data platform.

I remember that Star Huan was just founded in 2013. At that time, big data technology was very popular. Various big data technologies emerged one after another, and the market was generally in a state of exploration for these technologies. Many software companies based on big data at the same time will choose some relatively mature open source products to directly combine into their own big data solutions. The reason is that many Internet companies at home and abroad have proved that this technology is reliable, so we don’t need to start from the wheel again.

To this day, I don’t think this is the right thing to do from a technical point of view, especially for the underlying software.

We are dealing with complex systems in the enterprise, and we need to acknowledge the complexity of the problems we face. Solutions piled directly with open source products have certain ability to solve specific scenarios, but the division of scenarios requires relatively professional knowledge. More importantly, our enterprise customer business has a long history of development, far beyond the Internet companies, beyond the development of big data technology.

Compared to their business, big data technology can solve some pain points, but it is not systematic enough. Users cannot continue to develop with only one or two products for a long time. There are two reasons for this. One is that the open source big data technology has few functions. The second is that most open source communities are still dominated by foreign technical personnel, and the domestic scene is faced with few problems to consider.

This is completely different from Internet companies, which have no historical business and can develop their business with the technology. Therefore, we cannot assume that open source technology can be applied to traditional enterprises if it has been proven in Internet companies.

Of course, to date, big data technology has been proven to be applicable to critical production systems in enterprises, and this is what Star Huan insists. However, how to make a good product, integrate these technologies, and support the complex scene of the enterprise at the same time, is a headache for me and my team.

TDH architecture design principle – user first, efficiency second

First is cost issues, as an entrepreneurial company, especially the first few years a startup, we don’t have enough research and development staff, it is not possible to get the open source products are available in the market back research thoroughly, so we choose the way is to work on the core of big data technology, independent research and development of product code as far as possible at the same time, And in the process of research and development to do iterative improvements to some of the technology.

Although independent research and development at the beginning of the product to build, speed may be slow, the quality will be difficult to handle, but once that is done, the subsequent iteration speed quickly, the reason is very simple, is that you are familiar with your product architecture, where the extension, where can reconstruct, are very clear, the evolution of the code and the iteration is in the reasonable planning and control. Quote a colleague of mine to say is, all is oneself write the code, have what can’t achieve.

Due to the limited manpower and the relatively large number of functions required by the platform, the overall structure of TDH was relatively modular at the earliest time of design. Each R & D can focus on its own module, which is more efficient and good for testing. Experienced R & D leaders will make the interface definition more expandable, and we also take into account the further iteration of future requirements.

Therefore, on the one hand, we are faced with a complex enterprise scene for external reasons. On the other hand, we also want to realize an autonomous and controllable big data platform with efficient methods for internal reasons. Combined with internal and external factors, we finally decided to abstract a unified distributed computing engine and a unified distributed storage engine, and then each product team to implement their own storage structure to meet the customer business needs of such an architecture. This design also laid the foundation for the multi-model big data platform we have today.

In the subsequent evolution of the architecture, the client’s requirements also continue to verify the correctness of our design.

Here’s an example of a graph database implementation where we found a point with a particularly large deviation when building the graph, which is thousands of times larger than the average deviation of the graph. We were curious to check the original data, so we switched the engine used in the graph database to the state of SQL through a hot configuration of session, and found that the data and schema were correct and wrong, resulting in a large number of wrong data.

This process is actually a benefit of the so-called Unity Engine. A unified storage engine is similar. In case of scaling capacity, disk damage, etc., no matter what data model, operation and maintenance mode, and command are the same, there is no need to learn a set of operation and maintenance mode for each component. Not to mention ElasticSearch and other unique distributed solutions, it takes some time to learn the different command systems.

Of course, Star Huan’s multi-mode big data platform has some very good functions, such as a variety of model processing can be in a process, can also be independent process to make the utilization of resources easier to deploy; Good SQL support can reduce business migration costs; Unified operation and maintenance methods and concepts can make operation and maintenance easier.

8 years’ achievements accumulated by the team: the embodiment of the advanced TDH architecture

We can illustrate this by making some concrete comparisons:

I. Integrated vs. assembled

The software of the open source community is often targeted at one or several specific scenarios. In order to support an enterprise-level demand, the open source big data platform needs to be assembled with many components. Compared with the open source big data software stack, Starring’s big data platform software is more powerful and has a much lower architectural complexity than the Hadoop ecosystem. In the same functional complexity, the number of components and modules of Star Ring is far less than the open source product assembled solutions, this is an advantage.

Because of simplicity, unnecessary interactions are removed. Of course, in some scenarios with single functional requirements, our big data platform still focuses on some, but at any time the software is becoming more and more mature, we will slim down through modularization and other ways, and do a good job of slimming down the software for some small scenarios.

2. Traditional enterprise scene vs Internet scene

This topic has been mentioned before, but we’ll talk about it in more detail here. Traditional enterprises have a long history, for example, take the scene of the bank, in fact, the degree of business perfection is very high. When we talk about creating new environments and creating new value, the first thing we need to think about is compatibility. We can’t bypass the old business to create a new business, that’s not practical. So in fact, how the original business can be smoothly migrated to TDH is the first issue we consider.

I think the problems of the Internet and traditional enterprises are two kinds of problems. In solving problems, technologies can be borrowed from each other, but we can’t say which one is more advanced or more useful. This has a bit of the meaning of Guan Gong fighting Qin Qiong.

When choosing technical routes, TDH likes to try new technologies, but it does not blindly pursue new technologies, but pursues applicability. New technologies, valuable technologies, must be able to be deployed in enterprise applications. Landing is one of the most important indicators when we make technical choices. Therefore, our TDH technology adopts the new big data technology, and it is also very grounded on the ground. It keeps iterating around the needs of customers, which is a benign development and will gradually form the core competitiveness of the product.

JVM vs C Lang

Our friends in tech often face a choice. Let me get straight to our point. Java is easy to learn, but hard to master; For Native languages, the upper limit is higher. Starring’s unified computing engine is JVM-based, while the storage engine is written in C++. This combination and collocation is more appropriate to the current customer needs. The storage engine is stable, we use C++ to do a good memory model, transaction management, at the same time disaster tolerance, capacity expansion and other capabilities are also constantly enhanced with the iteration of the version. The computing engine is powerful, and we will pay more attention to adapt the GC model and JIT of the JVM in programming, so that we can quickly develop a more powerful computing engine with performance and function.

Difficulty · Try · Target · Waiting for you

Over the past year or so, our team has been constantly trying to break through several key features. In fact, from the beginning when we wanted to make this structure to the construction of this structure, it was not smooth sailing, in fact, can be said to be rather bumpy. The development process is actually a step in the pit of the process, more impressed is to solve the operating system, JVM and other low-level runtime components of the problem. Of course, the most classic is to fight with GC, but this is so routine that there is nothing to talk about, today we can talk about a slightly less popular story, related to JIT.

JIT is the key to the performance of Java program running, a piece of Java code in the end depends on the performance of the C2 compiler, we have encountered a lot of performance attenuation in the process of running, in simple words, the more slow it is, we found some key problems by looking at the JIT assembly.

Later we designed the engineering framework with particular concern for the performance of the JIT after compilation. Without addressing these issues, we would not be able to put such complex functionality into the same JVM to support many data models.

Domestic basic software development time is very short, we still have a lot of work to do. We will focus more on the usability, stability, performance of the platform, as well as developing more features. It is hoped that TDH can help customers create greater value.

If you would like to join us in the development of system software, please contact us at mailto:[email protected].