There’s been a lot of chatter about Hadoop lately. In this paper, the rise and fall of Hadoop as a wedge to talk about the development status and future trends of big data analysis.

15-second abbreviated version:

Hadoop

  • It’s past its peak and is becoming a legacy system
  • Hadoop and distributed databases are on the same track, and Hadoop currently has no advantage on that track

Big data

  • Big data market is SQL market, is distributed database market
  • Basic analysis such as BI, interactive query and other technologies have matured
  • Advanced analytics (machine learning) is sinking towards embedded analytics in databases
  • The main problem with advanced analytics (machine learning) is not the analysis but the data itself

1. Hadoop has been peaking for years and is becoming a legacy system

Since 2015, many problems have been exposed in Hadoop. Subsequently, analysts from Gartner, IDG and others, Hadoop users, and Hadoop and big data insiders increasingly reported various problems.

The reasons are mainly as follows:

  • Hadoop stack is too complex, with many components, difficult to integrate, and expensive to play
  • Hadoop’s slow (or slow) pace of innovation and lack of unified philosophy and governance make the integration of its many components very complex
  • Under the impact of Cloud technology, s3-like object storage, in particular, provides cheaper, easier to use and more scalable storage than HDFS, leveraging HDFS, the foundation of Hadoop
  • Expectations are too high for Hadoop, which grew out of cheap storage and batch processing, and people expect It to solve all the problems of big data, resulting in low satisfaction
  • Talent is expensive and scarce

This article does not attempt to argue that Hadoop is past its peak as an industry fact. Interested readers can refer to the many comments on the Internet, and select some articles that I feel are valuable or relevant to the following list (from the title can feel a strong bleak atmosphere) :

  • Does Hadoop have a future? History and future direction of Hadoop
  • Hadoop is done: Escape complexity and embrace cloud computing
  • Beyond cloud Computing: Thoughts on the future of database management systems
  • Big Data Is Still Hard. Here’s Why
  • Big Data Will Get By (but only with a little help from its friends)
  • Unfavorable influence Cloudera and Hortonworks Merger Means Hadoop’s influence is unfavorable
  • From data ingestion to insight prediction: Google Cloud smart analytics accelerates your business transformation
  • Hadoop is Dead. Long Live Hadoop
  • Hadoop Has Failed Us, Tech Experts Say
  • Hadoop Past, Present, and Future
  • Hadoop: Past, Present and Future
  • Hadoop runs out of gas
  • Hadoop Struggles and BI Deals: What’s Going On?
  • Hitting the Reset Button on Hadoop
  • Is Hadoop officially dead
  • Mike Olson on Zoo Animals, Object Stores, and the Future of Cloudera
  • More turbulence is coming to the big-data analytics market in 2019
  • Object and Scale-Out File Systems Fill Hadoop Storage Void
  • The Decline of HADOOP and Ushering An Era of Cloud
  • The elephant’s dilemma: What does the future of databases really look like?
  • The Future of Database Management Systems is Cloud!
  • The history of Hadoop
  • Why is Hadoop dying?

Ok, if you’re like me and you’ve read all the articles above, you’re really interested in the subject. Send me an email ([email protected]) and invite you to drink and chat.

Can Hadoop regain its momentum? What Hadoop needs to get back to the center of big data is confidence and time, and that’s exactly what Hadoop is missing right now. The industry has given Hadoop more than a decade, and for whatever reason, Hadoop does not properly solve the big data problem, or even the basic problem of big data. It is hard to believe that another ten years will do it. As the problems became more widespread, industry confidence in Hadoop declined significantly. Just as important, there are now a variety of big data solutions (especially open source ones) to choose from, as there were no alternatives a decade or so ago.

This does not mean that Hadoop will disappear, however. After more than a decade of development, there are now many Hadoop clusters deployed around the world, and these legacy assets and their derivative needs will continue for quite some time. HDFS, the foundation of Hadoop, is challenged by object storage and has lost in the public cloud. It will remain on the defensive in the enterprise for the time being. However, as cloud vendors enter the enterprise market, they will soon face great challenges. Hadoop is also moving toward object storage and may be a candidate for many object storage solutions in the future, but it’s safe to say that Hadoop is no longer the center of the discussion.

HortonWorks co-founder, CPO, Cloudera CPO Arun C Murthy posted on September 10, 2019:

The old way of thinking about Hadoop is dead — done, and dusted. Hadoop as a philosophy to drive an ever-evolving ecosystem of open source technologies and open data standards that empower people to turn data into insights is alive and enduring.

Traditional Hadoop as you think it is, is dead. But Hadoop as a philosophy that drives an evolving ecosystem of open source technologies and open data standards that enable people to turn data into insights is vibrant and enduring.)

The formless is a means that is a network of principles. No device as a carrier, then sit and talk.

2. The Hadoop market is the data warehouse market, but it is not dominant in this market right now

Let’s start by outlining the evolution of several major components of Hadoop.

  • Apache Nutch is an open source web crawler written by Hadoop guru Doug Cutting. To store the vast number of web pages, Nutch needs a distributed storage layer. Inspired by the Google GFS paper, Doug designed an open source IMPLEMENTATION of GFS that became HDFS. HDFS provided inexpensive, highly reliable, and scalable storage compared to the expensive disk arrays and SAN at the time.
  • After the distributed storage layer solution, Nutch needs a parallel computing model that can adapt to the distributed environment. Inspired by the Google MapReduce paper, Doug designed an open source version of MapReduce. HDFS and MapReduce solved the storage and calculation problems of big data, which were sought after by large Internet companies trapped in the problem of big data. Hadoop soon attracted a large number of developers and became the top project of Apache.
  • Hadoop solves the problem. It quickly became clear that MapReduce was so complex that even a technology as powerful as Facebook could not write an efficient and correct MapReduce program. In addition to solving batch problems, people need Hadoop to solve interactive query tasks they encounter. To that end, Facebook developed Hive, which quickly became popular and still has many users. At Facebook, 95% of users were using Hive instead of bare-written MapReduce.
  • Because Hadoop is not designed for interactive processing, Hive is inefficient and has low concurrency. In addition, Hive does not support standard SQL, making integration with other products difficult. Cloudera developed Impala for this purpose. Impala is essentially a distributed MPP (massively parallel processing) database.

From the above development context, we can clearly see that Hadoop started from distributed storage and parallel computing model and gradually developed into MPP database, and MPP database has been developed as a mature data warehouse solution for more than 30 years. Hadoop market is mainly SQL market. However, When compared with other classic MPP databases, Hive and Impala are not superior in terms of performance, SQL compatibility, scalability, etc. In Gartner’s 2019 data analytics market rankings, all three Hadoop publishers are outside the top 10 (Teradata, Oracle, and Greenplum are the top three).

This is backed up by the market: Cloudera officially says 75% of its revenue comes from SQL products. Cloudera’s recent (September 4, 2019) acquisition of Arcadia Data, an AI-powered cloud-native BI vendor, demonstrates the Hadoop market leader’s momentum. The aforementioned Cloudera CPO also publicly noted that: “For several years now, Cloudera has stopped marketing itself as a Hadoop company, But instead as an enterprise data company.”

3. The big data analysis market is currently SQL market

Big data analysis consists of two levels, the first level is basic analysis, the second level is advanced analysis.

At the basic analysis level, the main applications and scenarios involved are BI, interactive query, and visualization. The mainstream core technology used in these scenarios is SQL, and the basic gameplay of products like BI is SQL+ Graphical user interface (UI). The main SQL features associated with this are group by and aggregation, window functions, data Cube, and so on. The main calculations behind these SQL functions are basically addition, subtraction, multiplication and division in elementary school math. The “big data analysis” that looks fancy is mostly elementary school math. Of course, adding, subtracting, multiplying and dividing large amounts of data in these groups without ACID properties can be challenging. Distributed MPP databases such as Greenplum and Vertica have solved these problems well.

The advanced level of analysis involves the adoption of complex algorithms such as machine learning, pattern recognition, and AI. There is a tendency for this layer to sink inside the database. Apache MADLib is the first mature commercial product to lead this trend. In 2017, Google released BigQuery ML, also an advanced analysis solution based on SQL. See the introduction to database built – in analysis article for those interested.

From two aspects of big data analysis, its core is SQL. For more information on the evolution of data processing platforms and their driving forces, please refer to “The Evolution of Data Processing Platforms” and “Big Data ≈ Distributed Databases” (greenplum.cn/2019/06/17/…) at Greenplum. .

4. The difficulty of advanced data analysis lies not in the analysis but in the data itself

If you have enough clean data, then advanced data analysis is not a problem for you.

“Sufficient” does not necessarily mean the massive amount of PB data, but the amount of data that can meet the requirements. The requirements vary from MB to GB to PB in different scenarios. Advanced data analysis does not necessarily require big data. Currently, widely used commercial analysis products such as SAS and SPSS are single-node, which can handle a large amount of data.

A large number of studies have also proved that, even if the algorithm used is unchanged, the larger the amount of data, the better the accuracy of the model and the better the accuracy of the results. Therefore, using as much data as possible, using full data rather than sampling becomes the primary means of improving accuracy.

“Clean” means that the data is standard and accurate. The reality, however, is far from that. Inaccurate data can seriously bias the results of advanced analysis.

Data engineers and data scientists are faced with a number of complex problems such as data discovery, data integration and data cleansing. To solve these problems, data scientists spend a lot of time organizing data rather than analyzing it. Numerous reports indicate that data scientists spend at least 70% more of their time on data discovery, integration, and cleansing. One data scientist at iRobot even said, “I spend 90 percent of my time discovering and cleaning data, and 90 percent of my time correcting errors during cleaning.” That may be an exaggeration, but it gives a sense of what database scientists do. How to improve the efficiency of data workers is a very active investment field at home and abroad.

conclusion

To sum up, Hadoop has passed its peak as the first generation of big data solutions, and big data has entered the second generation: distributed databases.

Distributed database, especially MPP database, has well solved the basic analysis problems of big data, and will continue to develop in the direction of easier to use and faster in the future.

Advanced data analysis tends to sink into the database. The difficulty at the level of advanced data analysis lies not in the analysis, but in the quantity and quality of the data itself. Expect more innovation in this area.