Planning to edit | Natalie
Interview & writing | Natalie
The interviewees | pretty flat
Edit | Debra
AI Front Line introduction:A few days ago, the news that the new version of Apache Hadoop 2.8.4, the open source big data platform led by Tencent, attracted the attention of the author. It has been more than 10 years since Hadoop was born in Yahoo. During this period, especially in recent years, there have been many precedents of Chinese leading the Release of new versions as Release Manager. But the companies behind it are yahoo, Microsoft, Hortonworks, Cloudera and other US companies. This new release is the first by a Chinese company, which is of course an important encouragement to the domestic open source community and shows that Chinese developers and development organizations are fully capable of breaking through barriers and playing a more influential role in the popular open source community. On the other hand, it also means that Tencent’s long-standing support and embrace of open source and the open source community has paid off and is starting to reap the benefits of the open source community.





For the author, what is more curious is another question, why Tencent still spend so much energy to lead the release of the open source version of Hadoop in the midst of the negative arguments about Hadoop at home and abroad?






Please pay attention to the wechat public account “AI Front”, (ID: AI-front)

Hadoop was first created in 2006 and became an Apache top-level project in 2008. Although at the beginning, only a few giants at home and abroad tried to use Hadoop technology, it did not take long for Hadoop to become the standard configuration of big data computing in the Internet industry, and Hadoop quickly became one of the gold projects of the Apache Software Foundation. Not only that, but it also spawned a series of well-known Apache top-level projects, including HBase, Hive, ZooKeeper, and others, all of which started out as Apache Hadoop subprojects in the community and were well known to developers.

Hadoop is now 12 years old, which is a long life cycle for any piece of software. Since 2016, voices have been criticizing Hadoop at home and abroad. Although Hadoop is still an indispensable configuration for big data computing for many enterprises at home and abroad, many people are not optimistic about the future development of Hadoop, “there will not be a good development”. Hortonworks, the largest platform provider behind Hadoop, is also moving toward a cloud-centric world.


Last September, Gartner dropped the Hadoop distribution from the technology maturity curve for data management, as many organizations have begun to reconsider their role in information infrastructure due to the complexity and availability of the entire Hadoop stack. This year’s data science and machine learning tool survey by KDnuggets showed a drop in Hadoop usage as well, giving the idea that Hadoop is old again.


The 2018 Data Science and Machine Learning Tools Survey reported a 35% drop in Hadoop usage

At this point, why is Tencent putting so much effort into leading the release of an open source version of Hadoop?

Du Junping, an expert researcher at Tencent Cloud who led the open source release, told AI Frontier that the real “old” is the commercial distribution of Hadoop, not the technology itself, which remains the core and de facto standard of big data platforms both at home and abroad. What needs to change is the way Hadoop technology is used and distributed. In the future, more and more users will migrate from offline Hadoop distributions to data lakes (object storage +Hadoop) in the cloud.

Tencent chooses Hadoop: considering both platform stability and technological advancement

Tencent’s big data platform has many products and components that are optimized or even re-developed for its own special scenarios, but a large part of them are built based on open-source Hadoop ecological components.

At present, Tencent’s big data platform uses many Hadoop ecological components. Take the elastic MapReduce service on Tencent cloud as an example. Tencent provides component services such as Hadoop, HBase, Spark, Hive, Presto, Storm, Flink, and Sqoop. Different components also serve different purposes: Hadoop implements data storage and computing resource scheduling, and Sqoop is used for data import. HBase provides NoSQL database service, and MapReduce, Spark, and Hive are used for offline data processing. Streaming data processing is provided by Storm, Spark Streaming, Flink, etc.

Du junping said that Tencent’s overall principle for the selection of various components in the Hadoop ecosystem is to take into account both platform stability and technological advancement. On the one hand, you need to understand the scenarios for each component and their capability boundaries, and on the other hand, you need to understand the stability and operational complexity of each component in terms of testing and operations practices. Take the Hadoop-based data warehouse component as an example, the new Hive adds LLAP component to improve the performance and speed of interactive query, but the actual effect of the current operation is not stable, so Tencent postponed the introduction of this component into the production system, Hive more services for offline computing scenarios. Interactive queries are provided by the more stable SparkSQL and Presto.

Tencent is not alone. In the big data platforms of many domestic and foreign enterprises, various components of Hadoop ecosystem account for a considerable proportion. Everyone needs it, but maybe it’s so common that Hadoop gets less attention. As PMC of Hadoop, Du Junping said that Hadoop as the core of big data platform and de facto standard status, there is no big difference at home and abroad. However, the maturity of Hadoop applications varies from industry to industry. Hadoop, for example, was first and most mature in Internet companies; Secondly, there are many successful cases of Hadoop big data platform landing in the financial industry, which is relatively mature. At present, the hot spots of Hadoop big data platform applications are in government affairs and security fields and IOT industrial Internet platforms. These new hot spots bring new demands and will promote the continuous evolution of Hadoop technology and ecology.

Hadoop technology is not old, but the way it is used and distributed needs to change

As for Gartner’s removal of Hadoop from the technology maturity curve, Du pointed out that Gartner’s report was about commercial releases of Hadoop, not the Technology itself.

The problems with the Hadoop distribution mentioned in the report, such as the high complexity of the distribution and the inclusion of many unnecessary components, are real based on user feedback. Many commercial distributions, such as CDH or HDP, contain dozens or even dozens of components for users to use, providing flexibility but also causing users a lot of use and operation annoyance. What’s more, the problem has not been alleviated in recent years, but is getting worse. Therefore, the way Hadoop technology is used and distributed needs to change. In the future, more and more users will migrate from offline Hadoop distributions to data lakes (object storage +Hadoop) in the cloud.

Du acknowledges that there are some shortcomings in the Hadoop ecosystem. Hadoop’s ecosystem is very complex, with each component being a separate module, developed and distributed by a separate open source community, which we can call loosely-coupled. The advantages of this loosely coupled development mode are flexibility, wide adaptability and controllable development cycle, while the disadvantages are low maturity of components, serious version conflicts and difficult integration testing. This also makes it difficult for users to use, because a scenario involves the configuration of many components.

While streaming computing is becoming more and more important for big data processing, not supporting streaming computing is not a fatal issue for Hadoop. While Hadoop itself does not provide Streaming computing services, the major Streaming components such as Storm, Spark Streaming, and Flink are part of the Hadoop ecosystem and therefore do not pose much of a problem.

Hadoop ecological components are in fierce competition, and Spark has advantages. MapReduce has entered maintenance mode

Some developers have told AI Front that Hadoop is mostly being held back by MapReduce, but HDFS and YARN are both fine. Du junping thinks it is not accurate to say MapReduce is a drag on Hadoop. MapReduce is still used in some scenarios, but it is getting narrower and narrower. It is still suitable for some batch tasks with very large scale data processing, and the tasks run very stable. Second, the Hadoop community is positioning MapReduce to be in maintenance mode, not pursuing any new features or performance evolution, so that resources can be invested in newer computing frameworks such as Spark and Tez to help them mature.

HDFS and YARN are the de facto standards for distributed storage and resource scheduling systems in the field of big data, but they also face some challenges. For HDFS, in the public cloud domain, more and more big data applications choose to skip HDFS and directly use object storage on the cloud, which facilitates the separation of computing and storage and increases resource elasticity. YARN also faces strong challenges from Kubernetes, especially native Docker support, better isolation and the integrity of the ecosystem above. However, K8S is still a big data catch-up and has a lot of room for improvement in resource scheduler and support for various computing frameworks.

Spark is basically the dominant computing framework, MapReduce is mostly a legacy application, and Tez is more like Hive’s proprietary execution engine. On the Streaming side, the early Streaming engine Storm is being retired, while Spark Streaming and Flink are currently leading the way. The former is ecologically strong, while the latter is architecturally strong. An interesting situation is that Spark Streaming and Flink applications are quite different at home and abroad. A large number of domestic companies have started to use Flink to build their own Streaming platforms, but Spark Streaming still dominates in the US market. Of course, there are also new streaming frameworks, such as Kafka Streams, that are doing well.

In terms of big data SQL engines, Hive, SparkSQL, Presto, and Impala still have their own strengths.

Hive was first developed by Facebook and is the most widely used big data SQL engine in its early years. Hive, like MapReduce, is known for being slow and stable. It is the conscience of the industry by selflessly providing many common components for other engines to use, such as Hive Metastore, query optimizer Calcite, column storage ORC, and so on. Hive has developed rapidly in recent years, including CBO for query optimization, Tez for MapReduce for execution engine, LLAP for cache query result optimization, and ORC storage evolution. However, these new technologies are not yet mature and stable in terms of market adoption, and Hive is still defined by a large number of users as a reliable ETL tool rather than a real-time query product.

SparkSQL has grown dramatically in the past two years, especially as Spark enters the 2.x era. Its excellent SQL compatibility (the only open source big data SQL with all 99 queries in PASS TPC-DS), excellent performance, large and active community, and complete ecosystem (machine learning, graph computing, stream processing, etc.) make SparkSQL stand out from the other open source products. It has been widely used in domestic and foreign markets.

Presto has also been widely used in the past two years. This in-memory MPP engine can handle small amounts of data very quickly, while large amounts of data can be difficult. Impala also performs very well, but its path is relatively closed, the community ecology is slow, SQL compatibility is poor, and the user base is relatively small.

The Hadoop ecosystem is bound to move to the cloud, and IOT deserves long-term attention

Hadoop is now 12 years old. Where will the Hadoop ecosystem be in the future? Du Junping said that in the future, the ecosystem of Hadoop will develop towards the direction of cloud, and simplified operation and maintenance or even free operation and maintenance is not only the demand of users but also the advantage of cloud vendors. More and more data is generated, stored and consumed in the cloud, thus forming the closed loop of data life cycle in the cloud — data lake. So data security and privacy protection technology on the cloud is very important.

In addition, the deployment and application of Hadoop on hybrid cloud will also be an important trend. However, the technology and architecture in this area are not very mature and need continuous innovation and creation. In this context, traditional Hadoop distribution vendors will have less say at the technical and business level, while cloud vendors will have more say. Another trend is that the Hadoop ecosystem will continue to grow towards data applications, emphasizing the shift from data processing to data governance. More convenient ETL tools, metadata management and data governance tools will gradually mature and improve. Finally, the Hadoop ecosystem will evolve from a pure big data platform to an integrated data and machine learning platform, which will help many AI application scenarios in the future.

Du junping told AI Frontier that IOT is an area worthy of long-term attention among the more important development directions of big data in the future. In the history of big data development, this part of the business development cycle is short, many technologies are not very mature, and standards are not completely unified. In addition, there is room for technological change in big data products on the cloud, such as cross-data center/cloud solutions, automated critical data business migration, data privacy protection, automatic machine learning, etc. There will be more innovative products to impress and attract users to the cloud in the future.

Tencent Cloud will focus on the core pain points of cloud big data users and formulate corresponding technology and product routes. As for the underlying platform architecture of big data platform, Tencent Cloud will pay more attention to serverless, pay attention to the balance between performance and cost, and improve resource utilization will be a long-term direction. The Hadoop ecosystem will continue to play an important role as the market recognizes open and open source products and solutions. Tencent Cloud will continue to contribute to the open source community and work with the community to create better and newer technologies to meet the needs of the future.

conclusion

Hadoop has taken 12 years to grow from an emerging open source project to a standard big data platform. Now the Hadoop ecosystem inside facing competitive pressure from many young open source components, evolution is also very normal, there is no perfect open source platform, with the advantages of existing, the Hadoop ecosystem status is still very strong, but the future is also can coruscate gives new vitality, or in the process of comprehensive cloud faded away, Is still an unknown quantity.