The author is Hu Xi

Graduated from Beihang University with a master’s degree in computer science, I am now working in an Internet finance company and an open source technology enthusiast. He used to work for IBM, Sogou, Weibo and other companies. I have a deep understanding of Kafka and other open source stream processing technologies and frameworks, and am also a senior Kafka code contributor in China. Kafka principle, operation mechanism and application development are deeply studied.

  • The introduction

The purpose of this paper is to plan a clear learning route for all big data beginners and help them start their big data learning journey. Given the complexity of the technology available in the big data field, every big data beginner should have his or her own learning path.

Big Data, namely Big Data, has many definitions about it, the most authoritative is IBM’s definition, readers can refer to, the author will not repeat here. Since this article focuses on how to learn big data, it is important to define the different roles in the big data field so that you can start the learning process based on your own situation.

  • role

In my humble opinion, there are two types of roles in the current big data industry.

  1. Big data Engineering

  2. Big data analysis

What does it mean that these two roles are interdependent but operate independently? Without big data engineering, big data analysis is out of the question; But without big data analytics, there is no reason for big data engineering to exist. This is similar to marriage and love – love for the purpose of getting married, and love for the purpose of not getting married is rogue.

Specifically, big data engineering needs to solve the work of data definition, collection, calculation and preservation. Therefore, when designing and deploying such systems, big data engineers mainly consider the problem of high data availability, that is, big data engineering systems need to provide real-time data services for downstream business systems or analysis systems. And the big role in how to use the data to the data analysis that receives data from big data engineering system, how to provide output data analysis for the enterprise or organization, and can really help the company improve or improve the service level of business, so for large data analysts, their primary problem is to find and use the value of data, Specific may include: trend analysis, modeling and predictive analysis.

To summarize, big data engineering roles need to consider the collection, calculation (or processing) and preservation of data; The big data analyst role is to perform advanced calculations of the data.

  • Which roles do we belong to

Now that we know the classification of roles in the field of big data, we naturally need to “fit in” to determine our position, so that we can start to learn big data in a targeted manner. In considering this question, we need to consider the following two factors.

  1. Professional Knowledge Background

  2. Experience in industry

The background of professional knowledge here is not the background of education or university, but your understanding of certain IT technologies. Even Dennis Ritchie, the father of C, won’t look down on you if you’re not a computer science major. So, the expertise here is really just the following two.

  1. Computer expertise, such as operating systems, programming languages, how computers work, etc.

  2. Mathematics, I mean advanced mathematics, calculus, probability and statistics, linear algebra, discrete mathematics, not x×x+y×y= 1.

Industry experience refers to your work experience in a relevant field, which can be divided into three categories.

  1. Amateurs.

  2. An engineer with some experience.

  3. Senior experts — There’s a cooler name in big data these days, data scientists, like Dr. Andrew Ng, the former chief data scientist at Baidu.

Ok, now we can define our characters according to the categories above. For example, the author would position himself as follows: “I am an engineer with a degree in computer science, and have a certain knowledge of mathematics (especially in calculus and linear algebra), but mathematical statistics and probability theory are not my strong points.” In addition, it is best not to play a swollen face full of fat, if there is no previous experience, admit that they are a rookie is ok, the key is to find their own positioning.

Once you’ve defined your position, you need to be able to fit into a specific big data role. Here are some basic rules.

  1. If you have a good programming foundation and an in-depth understanding of computer interaction and the underlying technology principles of the Internet, but do not have a deep grasp of mathematics and statistics, then big data engineering may be your future direction of study.

  2. If you have a programming background (some high-level languages, such as Python) and a strong mathematical background, then big data analysis may be the direction of your efforts.

  • Learning path

No matter which role you are in, there are some big data theories that you must master, including but not limited to the following.

  • Data fragmentation and routing: pick a typical partition algorithm to learn, such as the consistency of the hash algorithm (https://en.wikipedia.org/wiki/Consistent_hashing).

  • Backup mechanism and consistency.

Square root learning is regarded as “the bible” domestic but also in foreign countries is general theory of the CAP (https://en.wikipedia.org/wiki/CAP_theorem).

Square root idempotence (Idempotent), the cornerstone of many distributed system state management (https://mortoray.com/2014/09/05/what-is-an-idempotent-function/).

√ Various consistency models: strong consistency, weak consistency, and final consistency.

√ Backup: the master-follower model is out of fashion, and the current cool model is leader-follower.

√ Consensus Protocol: Commonly translated as consensus Protocol in China. Learn the common Paxos and Raft protocols.

  • Algorithms and data structures.

√ LSM: Learn how it differs from B+ trees and what the advantages are.

√ Compression algorithm: Understand a mainstream compression algorithm, such as Snappy and LZ4. In addition, Facebook recently opened source a new compression algorithm called Zstandard, which promises to beat all major compression algorithms.

√ Bloom Filter: O(1) Filter for big data.

These theoretical knowledge are essential for both big data engineering and big data analysis, as they are essential skills for designing many distributed systems. Below we design different learning routes for different characters.

 

  • Big data Engineer

For big data engineers, you should have at least the following skills.

  • A JVM language: The current big data ecosystem JVM language class proportion is very large, to some extent, it is also a monopoly. I recommend that you learn Java or Scala, but Clojure is a difficult language to learn. In addition, in an age of “mother is more valuable than child,” a big data framework will make its programming languages popular, such as Docker to Go or Kafka to Scala. Therefore, I recommend that you be proficient in at least one JVM language. It is worth mentioning that we must understand the multi-threaded model and memory model of this language. In fact, the processing mode of many big data frameworks is similar to that of multi-threaded processing mode at the language level, but the big data framework extends them to the multi-machine distributed level.

My advice: Learn Java or Scala.

  • Computational processing framework: Strictly speaking, there are offline batch processing and streaming processing. Streaming processing is the future trend, we suggest you must learn; Offline batch processing is actually becoming obsolete, and its batching idea is not capable of handling infinite data sets, so it is becoming increasingly less applicable. In fact, Google has officially abandoned offline processing, as represented by MapReduce, within the company. Therefore, if you want to learn big data engineering, it is necessary to master a real-time streaming processing framework. Current mainstream frameworks include Apache Samza, Apache Storm, Apache Spark Streaming, and Apache Flink, which has been gaining momentum over the past year. Of course, Apache Kafka has also launched its own streaming framework, Kafka Streams.

My advice: Learn one of Flink, Spark Streaming or Kafka Streams. Also, read up on Google’s The World Beyond Batch: Streaming 101 “, the address is https://www.oreilly.com/ideas/the-world-beyond-batch-streaming-101.

  • Distributed storage Framework: While MapReduce is a bit out of date, Hadoop’s other cornerstone, HDFS, is still going strong and is the most popular distributed storage framework in the open source community, so it’s definitely worth your time. If you want to dig deeper, Google GFS paper also must read static.googleusercontent.com/media/research.google.com/en//archive/gfs-sosp2003.pdf (https://). Of course, there are many distributed storage frameworks in the open source world, among which OceanBase of Alibaba in China is a very excellent one.

My advice: Learn HDFS.

  • Resource scheduling framework: Docker has been on fire for the past year or two. Various companies are developing container solutions based on Docker. The most famous open source container scheduling framework is Kubernetes, but also Hadoop YARN and Apache Mesos. The latter two can schedule not only container clusters, but also non-container clusters, which is worth learning.

Learn YARN.

  • Distributed coordination framework: There are some general functions that need to be implemented in all major big data distributed frameworks, such as service discovery, leader election, distributed lock, KV storage, etc. These capabilities facilitate the development of distributed coordination frameworks. The oldest and most famous is Apache Zookeeper, newer ones include Consul, Etcd, etc. Learning big data engineering, distributed coordination framework is unavoidable, and to some extent, it should be deeply understood.

My advice: Learn ZooKeeper — many big data frameworks require it, such as Kafka, Storm, HBase, etc.

  • KV database: Typical examples are Memcache and Redis, especially Redis is growing fast. Redis’s simple API design and high performance TPS are increasingly favored by the majority of users. Even if you don’t learn big data, learning Redis is helpful.

I suggest: learning Redis, if the C language foundation is good, the best read source code, the source code is not a lot.

  • Column stored databases: I’ve spent a lot of time with Oracle, but I have to admit that relational databases are dying out, and there are so many alternatives to RDBMSS. Column storage has been developed to address the problem that row storage is not suitable for ad-hoc queries of big data, and the typical column storage database is the HBase of the open source community. In fact, the concept of column storage came from a Google paper called “Google BigTable,” You’d better read (https://static.googleusercontent.com/media/research.google.com/en//archive/bigtable-osdi06.pdf) we are interested in.

My advice: Learn HBase, which is the most widely used open source column storage.

  • Message queue: As the main system of “peak cutting and valley filling”, message queue is indispensable in big data engineering processing. There are many solutions in this field at present, including ActiveMQ and Kafka. RocketMQ has also been opened by Alibaba in China. The leader of the pack is Apache Kafka. Many of Kafka’s design ideas fit particularly well with distributed streaming data processing. No wonder Jay Kreps, the original creator of Kafka, is one of today’s leading gods of real-time streaming.

My advice: Learn Kafka not only to find a job (almost all big data recruitment resumes require Kafka:-), but also to further understand the data processing paradigm based on backup logs.

  • Big data analyst or data scientist

To become a data scientist, you need to have at least the following skills.

  • Math skills: Calculus is a must. You don’t have to know multivariable calculus, but you do have to know and use one variable calculus. In addition, linear algebra must be proficient, especially matrix operations, vector space, rank and other concepts. Many computations in current machine learning frameworks require matrix multiplication, transpose or inversion. While many frameworks provide such tools directly, we should at least understand the underlying prototype principles, such as how to efficiently determine whether a matrix has an inverse and how to calculate it.

I suggest: review tongji edition of “Advanced Mathematics”, conditional can go to Coursea to learn calculus courses in university of Pennsylvania. I recommend Strang’s Introduction to Linear Algebra — this is one of the most classic textbooks!

  • Mathematical statistics: Probability theory and various statistical methods should be mastered, such as how to calculate Bayesian probability? What’s the probability distribution? Proficiency is not required, but an understanding of background and terminology is essential.

The author suggests: look for “probability theory” study afresh below.

  • Interactive data analysis frameworks: I don’t mean SQL or database queries here, but analysis interaction frameworks like Apache Hive or Apache Kylin. There are many similar frameworks in the open source community for data analysis or data mining of big data using traditional data analysis methods. The ones I have experience with are Hive and Kylin. However, Hive, especially Hive1, is based on MapReduce, and its performance is not particularly good. Kylin adopts the concept of data cube combined with star model, which can achieve extremely low latency analysis speed. Moreover, Kylin is the first Apache incubation project with a Chinese r&d team, so it is increasingly attracting wide attention.

My advice: Learn Hive first, and if you have time, learn about Kylin and the data mining ideas behind it.

  • Machine learning framework: Machine learning is really hot at the moment, everyone talks about machine learning and AI, but THE author always thinks that machine learning is just like cloud computing a few years ago. Although it is hot at present, there is no actual project, and it may take a few years to mature gradually. But it never hurts to stock up on machine learning now. When it comes to machine learning frameworks, there are many familiar ones, such as TensorFlow, Caffe8, Keras9, CNTK10, Torch711, etc., among which TensorFlow is the leading one.

The author suggests that a machine learning framework can be selected for learning, but most of these frameworks encapsulate various machine learning algorithms for users to use, so it is better to learn from the principles of machine learning algorithms. For example: Udacity courses is very entry-level machine learning course classroom.udacity.com/courses/ud120 (https://); At present, the most outstanding introductory course in the field of Machine Learning, namely Dr. Ng’s Machine Learning; Learn Python’s SciKit-Learn library (http://scikit-learn.org/stable/); Ten years ago. Look for a book at home, on Zhou Zhihua’s Machine Learning. Look forward to Ng’s new book on Machine Learning. Finally, if you feel like you’ve got it, try Kaggle (https://www.kaggle. Com).

  • conclusion

The above are my thoughts and suggestions on the learning route of big data, hoping to be helpful to readers.

— — —

This article is from the public number “Big data Kafka technology analysis” (public ID: GH_3AF7B8516EEC public number: Kafka technology sharing big data). Please check out the author’s upcoming original book Apache Kafka in Action. This book is based on Apache Kafka 1.0.0 version introduction, written by Kafka Contributor, from the basic concepts and features of Kafka, detailed introduction of Kafka deployment, development, operation, monitoring, debugging, optimization and design principles of important components, and gives detailed cases. It is a good introduction to Kafka as well as a reference reading for system architects and front-line development engineers. If you have any doubts about the value of big data, you can read Data-Driven: From Method to Practice to get a sense of how companies and individuals will use it to achieve decisive growth in the data-driven world of the future.

  • Apache Kafka is a practical guide and reference book covering all aspects of Apache Kafka. The author combined with the typical use of the scene, the whole technical system of Kafka has a more comprehensive explanation, so that readers can draw parallels, directly applied to practice. At the same time, the design principle of Kafka and its flow processing components are discussed in depth, and detailed cases are given. The book is divided into 10 chapters: Chapter 1 introduces the message engine system and the basic concepts and features of Kafka, quickly leading readers into the world of Kafka; Chapter 2 briefly reviews the history of Apache Kafka. Chapter 3 introduces the construction of Kafka cluster environment in detail. Chapter 4 and chapter 5 discuss the use of Kafka client in depth. Chapter 6 takes the reader through the internal design principles of Kafka; Chapter 7~9 explains the management, monitoring, and tuning of Kafka cluster with examples. Chapter 10 introduces Kafka’s new streaming components. This book is suitable for all technical personnel who are interested in cloud computing and big data processing, especially for technical personnel who are interested in message engine, streaming processing technology and framework.

Long press the TWO-DIMENSIONAL code, pay attention to “cutting-edge technology villa”, the first to receive new knowledge, understand the news, get to know the big names.

Any greatness begins inadvertently, you can choose to inadvertently miss, or inadvertently started…