Crisis and Opportunity for Spark: The future must be an AI framework reversed data processing framework

The author | wish William, focus on the big data/machine learning direction,

He is now a senior data architect in DXY

Source | authorized reprinted from Jane

AI Front Line introduction:Last week, at the Spark+AI Summit, Matei Zaharia, a key Spark and Mesos author and chief technologist at Databrick, announced the launch of MLflow, an open source machine learning platform, This is a new platform that covers the entire process of machine learning, from data preparation to model training to final deployment. It aims to simplify the complex process of building, testing, and deploying machine learning models for data scientists. According to Matei, the research effort revolves around “ideas on how to provide developers with similar benefits to platforms like Google TFX and Facebook FBLearner Flow, but in an open way — not only open in the sense of being open, but open in the sense of being able to use any tools and algorithms.” AI Front has covered the platform in detail
Spark team’s New Open Source project: MLflow, a Full-process Machine Learning platform.

There are all sorts of questions about this new machine learning platform, including where MLflow fits in. How does it relate to TensorFlow? Today we have brought some views of Teacher William Zhu on MLflow for your reference.

Please pay attention to the wechat official account “AI Front”, (ID: AI-front)

MLFlow

Last week sent an article “new book MLFlow Spark team can solve the problem of what” (https://www.jianshu.com/p/2ed60a1dc764) describes my views on MLFlow, now that I think about it, the Spark team is very smart, AI students have their own communities and ecology. Spark has a great influence in the engineering research and development community, but has little appeal in the FIELD of AI. So it is no way through a subversive stuff to make AI students transferring, and MLFlow did not change the AI the old habits and processes of the classmate, it provides some auxiliary tools and standards, solved some pain points, permeated slowly, slowly to achieve transformation, of course, finally what also can’t completely may also lift.

The challenges behind Spark’s success

The first is the rise of the AI wave, which is both a crisis and an opportunity for Spark. DataBrick calls itself an AI company these days, but you might wonder why DataBrick, the company behind Spark, isn’t trying to reinvent AI instead of sticking to its data-processing strengths. Leaving aside the whole wave of capital markets and technology, the biggest problem is that the future is going to be an AI framework pushing back data processing frameworks. AI frameworks are likely to evolve into their own data processing frameworks, such as TensorFlow, which has greatly enhanced TF.data to facilitate data processing. If Spark is not proactive, the future will be passive.

The second one is streaming. With the advent of streaming era, Spark has been slow in streaming field until now. While strengthening its advantages in batch processing, Spark has lost its advantage in streaming field, and many companies (especially cloud companies such as Ali Cloud and Huawei) turn to Flink. When I was 16 years, has stressed the importance of flow calculation, such as the article “the data is inherently streaming” (https://www.jianshu.com/p/9574e359ce35), also set up a special project for this purpose. As a result, Spark faces new challenges in traditional data processing.

Walking heavier and heavier

Spark also comes up with a Hydrogen design that allows Spark to better integrate with the deep learning framework. In some ways it was a response to the situation, but in fact it was a necessary response in the context of the transformation of AI.

Still the king

Spark is still the best tool I’ve ever used and still has the best ecosystem. It’s easy to do a lot of things based on it.

The latter

In fact, I think that adapting to AI is not necessarily in the direction of integrating AI frameworks. As mentioned earlier, the future of Spark will be the backward data processing framework of AI frameworks. As long as Spark can do better data preprocessing for AI, become a de facto standard, and adapt to mainstream AI frameworks, Spark will surely have a new moat. For the simplest example, Spark 2.3 already supports image processing, but it’s still quite problematic and could be better. Can you support tensors? In short, the best strategy is to compress the boundaries of the AI framework and ensure Spark’s absolute monopoly on data processing. In the actual use, I found that Spark is not convenient to do a lot of data preprocessing, so WE have to use the functions of the AI algorithm library.

Of course, the other thing is to accelerate the development of streaming, and increase the publicity and investment in this area, to ensure that the second stage of data processing can remain ahead of the game.

Original link:

https://www.jianshu.com/p/2dc96dfc89c8

Crisis and Opportunity for Spark: The future must be an AI framework reversed data processing framework

Related Posts

How good is artificial intelligence now at automatically repairing old photos? JpgHD tell you

Machine learning decision tree ID3 algorithm, hand teach you to use Python implementation

Machine learning 049- Extracting SIFT feature points from images