NO.1 How to learn Big Data with zero basic entry?

A: Learning anything is the same. It’s a bit of a hurdle at the beginning. I love reading books, especially those that are easy to get started with. As for big data, my specific research direction is the machine learning application of large-scale data, so I need to master the following basic concepts first. * Calculus (take derivatives, extreme values, Limit) * Linear algebra (matrix representation, matrix computation, eigenroots, eigenvectors) * Probability theory + statistics (a lot of data analysis modeling is based on statistical models), statistical inference, random processes * linear programming + convex optimization, nonlinear programming, etc. * Numerical computation, numerical line generation, etc At first, of course, as long as have calculus, line generation as well as the introduction to probability theory can basically machine learning, I strongly recommend a few books, these books don’t need to be finished, only need some important knowledge points (derivation, matrix calculation, conditional probability) master, when the tool can be used in, for later learning machine learning tools groundwork and big data. {big data era}, {simple data analysis}, {simple statistics}

Big data tools

The best programming language for learning big data is certainly Java, and Scala will do as well. At that time, I gave myself the following plan, can refer to.

Initial big data learning content: Hadoop Distributed system Learning objective: Hadoop, HDFS, MapReduce, Yarn Completed projects: real-time statistics of movie box office, anti-crawler of J2EE website

Big data Database Learning Content: HIVE+HBase Learning objective: Hadoop, HDFS, MapReduce, and HBase Completed project: Analyzing customer risk levels based on HIVE and HBase user behaviors

Real-time data acquisition and processing Learning objectives: Flume, Kafka, Storm, data visualization project: real-time calculation + visual graphics

Spark Data Analysis Learning Content: Spark Data analysis Learning objective: Scala, Spark Completed project: Real-time box office calculation

After all of the above training, you are ready to create your own learning plan. Of course, for the course of machine learning, I strongly recommend Professor Zhou Zhihua’s watermelon book “Machine Learning”, and I also read Professor Li Hang’s “Statistical Learning Methods”. These two books are not difficult, but they have covered most industrial application scenarios, and they are good Chinese textbooks for Chinese people, which are very worth reading. Learning this kind of thing, the best is to start ten years ago, the second is now, never feel too late.

NO.2 What is big data mainly used for?

Around the use of big data is very extensive, especially now cloud computing, big data center to establish good later, using big data instead of artificial makes the corresponding judgement, can better solve some practical problems, the main use in fact or in health, we can all see transportation, financial social and other fields, here are some introduction: The e-commerce industry is the first to use big data to carry out precision marketing. It advances production materials and logistics management according to customers’ consumption habits, which is conducive to fine social mass production. Because of electricity data is concentrated, the data quantity is big enough, the data type is more, so the future electricity data applications will have more imagination space, including forecast trends, consumer trends and regional characteristics, customer consumption habits of correlation, all kinds of consumer behavior, consumption hot spots, the important factors that affect consumption, etc.

Big data is widely used in the financial industry. It is mostly used in transactions. Now, many equity transactions are carried out by big data algorithms, which increasingly consider social media and website news to decide whether to buy or sell in the next few seconds.

Three medical institutions, medical industry both in the pathological report, treatments or drugs, etc, are relatively large data industry, in the face of many viruses and tumor cells in a process of constant evolution and diagnosis for the diagnosis and treatment of disease found certainty is difficult, and in the future, we can use big data platform to collect impassability cases and treatment plan, And the basic characteristics of the patient, can be built for the characteristics of the disease database.

4. The future big data of agriculture, animal husbandry and fishery should be applied in the field of agriculture, animal husbandry and fishery, which can help agriculture reduce the probability of cheap vegetables hurting farmers, accurately predict weather changes, help farmers to prevent natural disasters, and improve the high yield per unit planting area. Grazers can also arrange grazing range based on big data analysis to effectively utilize farms and reduce animal loss. Fishermen can also use big data to schedule fishing closures, locate fishing, and reduce human injuries.

Five, biotechnology, gene technology is the important weapon in the future challenges of disease, scientists can use the application of the big data technology, which will speed up its own gene and other genes in animals, it will be one of important weapon to conquer the disease in the future, human future biological gene technology can not only improve crops, Genetic technology can also be used to grow human organs and destroy pests.

Big data is also being used to improve the cities we live in every day. Such as optimizing the latest traffic conditions based on real-time city traffic information, using social networks and weather data. At present, many cities are conducting big data analysis and pilot projects.

Big data has now been widely used in the process of security law enforcement. Everyone knows that the NSA uses big data to fight terrorism and even monitor People’s Daily lives. And companies are using big data technology to defend against cyber attacks. Police are using big data tools to catch criminals, and credit card companies are using big data tools to stop fraudulent transactions.

Big data will also play a big role in traditional fields: helping agriculture to make ultra-refined farming based on the environment, climate, soil and crop conditions; Fully grasp the balance between supply and demand in the field of industrial production and explore innovative growth points; With intelligent assistance and even unmanned driving in the field of transportation, traffic jams and accidents will become history; The energy industry will achieve accurate forecasting and real-time production control.

Upload an individual’s life will be real-time data acquisition, diet, health, travel, home, health, shopping, socializing, large data services will be widely used and revolutionise the quality of life of ascension to the customer, all services will be in the form of personalized tailor, for each of the “you” for every behavior based on historical data and real-time dynamic intelligent decision.

No.3 Big data and Fintech?

Answer a: first of all, how do you learn math? If you are good at math and have engineering background and data thinking, this major is very good. For data has become the new engine of economic growth, much as oil and gas have been for the past century. Any job is stressful, and big data and fintech are not direct programming disciplines, but primarily data processing. The ability to collect and analyze data is central. If you have the ability to go to this major, personal advice to seize the opportunity.

Answer two. In the era of fintech 2.0, there are three magic weapons, or troika — big data, artificial intelligence and blockchain technology. Let me introduce them respectively:

  1. Big data mainly comes from the Internet of Things and Internet data. The use method is to use big data to depict user portraits, which can be divided into static life portraits and dynamic behavior portraits to find target customer groups. Then, big data is used to reshape the credit investigation system, and low-risk customers are identified by big data credit investigation. Finally, precision marketing is carried out.

  2. Artificial intelligence is mainly to use machine learning to complete the work beyond the human brain. For example, in all the history of thousands of stocks, find a k-line trend similar to the stock you care about, you can refer to the APP “Ulibao”.

  3. Blockchain technology is the best B. Bitcoin is just one of his applications. The most important thing is to realize “smart contracts”, which are open, supervised and enforced.

Big data is only one aspect of fintech. It is recommended to study blockchain technology, which is more widely used than AI.

Does Hadoop have any advantages over Spark? Here are three answers from different perspectives. I think they are better, so I collected them together

Answer one: Hadoop

Hadoop solves the problem of reliable storage and processing of big data. Hadoop currently consists of two main frameworks:

HDFS: provides highly reliable file storage on a cluster consisting of common PCS. It saves blocks into multiple copies to prevent server or hard disk failures. It can store data in a way of low power consumption and high performance, and optimize the types and reading speed of big data. Calculation engine YARN: can carry any number of application framework, the framework of the original is MR, through Mapper and Reducer abstraction provides a programming model, can be in one or hundreds of PC on an unreliable cluster of concurrent and distributed processing large data sets, and the concurrent, distributed and fault recovery calculation details such as hidden. Limitations and limitations of Hadoop

Low level of abstraction, the need to manually write code to complete, the use of difficult to get started. It provides only two operations, Map and Reduce, and lacks expression power. A Job has only two phases, Map and Reduce. Complex computations require a large number of jobs, and the dependencies between jobs are managed by the developers themselves. The processing logic is hidden in the details of the code, without the overall logic, the intermediate results are also stored in the HDFS file system ReduceTask needs to wait for the completion of all mapTasks before the delay can be extended. It only applies to Batch data processing. For interactive data processing, There is insufficient support for real-time data processing and poor performance for iterative data processing

Spark

Apache Spark is an emerging engine for big data processing. It provides a cluster distributed memory abstraction to support applications that require working sets.

The abstraction is Resilient Distributed Dataset (RDD). RDD is an immutable record collection with partitions. RDD is also the programming model in Spark. Spark provides two types of operations on the RDD, transformations and actions. Transformations are used to define a new RDD, Including Map, flatMap, filter, Union, sample, Join, groupByKey, coGroup,ReduceByKey, Cros, sortByKey, mapValues and so on, the action is to return a result, These include Collect, reduce, Count, Save, and lookupKey. In Spark, all RDD conversions are lazily evaluated. The RDD conversion generates a new RDD. The data of the new RDD depends on the data of the original RDD, and each RDD contains multiple partitions. A program then essentially constructs a directed acyclic graph (DAG) of interdependent RDDS. This directed acyclic graph is submitted to Spark as a Job by performing an action on the RDD.

Spark schedules directed acyclic graph jobs, determines stages, partitions, pipelines, tasks, and caches, optimizes them, and runs jobs on Spark clusters. RDD dependencies are classified into wide dependencies (dependent on multiple partitions) and narrow dependencies (dependent on only one partition). When determining the phases, you need to divide the phases according to the wide dependencies. Divide tasks by partition.

Spark supports different fault recovery methods. Two methods are provided: Linage: Check the blood relationship of the data, perform the previous processing again, and Checkpoint: store the data set to persistent storage. Spark provides better support for iterative data processing. The data for each iteration can be kept in memory rather than written to a file.

Spark addresses those shortcomings of Hadoop

1. HadoopSpark has a low level of abstraction and requires manual coding, which makes it difficult to use.

2. It only provides two operations, Map and Reduce, and lacks expression power => It provides many transformations and actions. Many basic operations, such as Join and GroupBy, have been implemented in RDD transformations and actions.

3. A Job has only two phases, Map and Reduce. Complex calculations require a large number of jobs to complete, and the dependencies between jobs are managed by developers themselves. If the RDD partitions of multiple MAP operations remain unchanged, they can be placed in the same Task.

4. The processing logic is hidden in the details of the code, there is no overall logic => In Scala, RDD transformations support streaming APIS with anonymous functions and higher-order functions that provide a holistic view of the processing logic. The code does not contain the implementation details of specific operations, the logic is clearer.

5. Intermediate results are also stored in the HDFS file system. If the intermediate results are stored in the memory, they will be written to the local disk instead of the HDFS.

6. ReduceTask Can be started only after all MapTasks are completed => The transformation of the same partition constitutes an assembly line and is run in a Task, while the transformation of different partitions needs Shuffle and is divided into different stages, and can be started only after the completion of the previous stages.

Answer 2: First of all, why did Hadoop become popular? Hadoop is popular because it solves the pain point that the Internet, mobile Internet fast data accumulation and the original traditional data analysis solutions cannot adapt to.

In my article “Field Training Methods – On the Ideas and Adaptability of Software architecture Design”, I have discussed that with the continuous development of information technology, many specific technologies to solve the software architecture design have become outdated or die out, but the patterns and ideas of software architecture design will not be outdated. Hadoop, as a specific technology to solve the pain point of big data analysis, cannot escape this fate. When did Hadoop come along? It dates back to January 2016. When did Spark come out? The year was 2009. Obviously, Spark is a pioneering innovation built on the shoulders of the Hadoop giant, and it can solve problems that Hadoop can solve, and certainly compensate for shortcomings of Hadoop. Therefore, Hadoop has no advantage over Spark in this regard.

Does Hadoop have any advantage over Spark at all? The answer is no. This is because after the rapid development of the Internet, especially the mobile Internet and the Internet of Things, the data scale has grown exponentially and this trend is still continuing. This realistic demand makes big data analysis still on the rise, so the development of Hadoop is also on the rise. Since Both Hadoop and Spark are on the rise, Hadoop, as the first to emerge, has a first-mover advantage. Its practical application in the industry is wider, its system develops faster and more perfect, and new complementary solutions are constantly added to its deficiencies. This can be seen in Apache’s top projects, where there are close to 20 top projects related to the Hadoop ecosystem, compared to three or four related to Spark.

The technology selection of enterprises is based on the premise of solving the current business pain points. Therefore, choose a mature and perfect system to solve the current business pain points, or choose a relatively imperfect solution and invest more resources? The obvious answer is Hadoop’s advantage over Spark.

Whether this advantage can make up for the Saprk architecture’s ability to solve more data analysis scenarios depends on the actual requirements facing enterprises at the current end. Most of the time, working together is a better solution.

For interactive data processing, there is not enough support for real-time data processing. => Provide Discretized Stream for processing Stream data by separating the Stream into small Batch.

Answer 3: Both Hadoop and Spark are big data frameworks that provide tools for performing common big data tasks. But they do not exactly perform the same tasks, nor are they mutually exclusive.

While Spark claims to be 100 times faster than Hadoop in certain cases, it doesn’t have a distributed storage system of its own. Distributed storage is the foundation of many big data projects today. It can store petabyte datasets on an almost unlimited number of ordinary computer hard drives, and provides good scalability by simply adding disks as the datasets grow.

Therefore, Spark needs a third-party distributed storage. It is for this reason that many big data projects install Spark on top of Hadoop. In this way, Spark’s advanced analysis application can use the data stored in HDFS.

Spark’s real advantage over Hadoop is speed. Most of Spark’s operations are in memory, whereas Hadoop’s MapReduce system writes all data back to physical storage after each operation. This is to ensure full recovery in the event of a problem, but Spark’s resilient distributed data store also enables this.

In addition, Spark outperforms Hadoop in advanced data processing (such as real-time streaming and machine learning). In Bernard’s opinion, this, along with its speed advantage, is the real reason for Spark’s growing popularity. Real-time processing means that data can be submitted to an analytical application as soon as it is captured, with immediate feedback. This kind of processing is increasingly used in a variety of big data applications, such as recommendation engines used by retailers and performance monitoring of industrial machinery in manufacturing. The Spark platform’s speed and streaming data processing capabilities are also well suited to machine learning algorithms.

Such algorithms can learn and improve themselves until they find the ideal solution to the problem. Such technology is at the heart of state-of-the-art manufacturing systems, such as predicting when parts will break, and driverless cars. Spark has its own machine learning library, MLib, while Hadoop systems rely on third-party machine learning libraries, such as Apache Mahout.

In fact, while Spark and Hadoop do have some functional overlap, neither is a commercial product and there is no real competition, and companies that make money from supporting such free systems tend to offer both services. Cloudera, for example, offers both Spark and Hadoop and provides the most appropriate recommendations based on customer needs.

While Spark is growing rapidly, Bernard believes it is still in its infancy and the security and technical support infrastructure is underdeveloped. In his view, Spark’s increased activity in the open source community shows that enterprise users are looking for innovative uses of stored data.

No.5 Can Spark replace Hadoop with big Data?

Spark is not intended to replace Hadoop in terms of its previous and current technology path, but to exist and develop as an important member of the Hadoop ecosystem (Hadoop in a broad sense).

First of all, we know that Hadoop (narrow sense of Hadoop) has several key technologies: HDFS, MR(MapReduce), and YARN. These technologies correspond to distributed file systems (storage), distributed computing frameworks (computing), and distributed resource scheduling frameworks (resource scheduling).

Let’s take a look at Spark’s technology system, which is mainly divided into the following:

  • Spark Core: Provides Core frameworks and common apis, such as basic data structures such as RDD.

  • Spark SQL: a distributed SQL-like query engine that provides structured data processing capabilities.

  • Streaming: Provide Streaming data processing capability;

  • MLLib: provides a common algorithm package for distributed machine learning;

  • GraphX: Provides graph computing power

As you can see from the above, Spark is primarily designed to provide a variety of data computing capabilities (officially called a full-stack computing framework) and does not involve much in the storage and scheduling tiers itself (although it does provide a scheduler of its own). It is designed to accommodate popular storage and scheduling tiers. In other words, Spark’s storage layer can connect to both Hadoop HDFS and Amazon S2. The scheduling layer can be connected to Both Hadoop YARN and Apache Mesos.

Thus, we can say Spark is more of a complement to Hadoop MR’s single-batch computing power than a complete replacement for Hadoop.

NO.6 Hadoop, Spark, SaaS, PaaS, IaaS, cloud Computing concepts?

Hadoop & Spark

First of all, neither of them belongs to the product category, so it can be understood as an ecosystem or some people call it “big data general processing platform”, which is more accurate

Hadoop is a distributed system infrastructure developed by the Apache Foundation

Hadoop mainly includes:

Hadoop Distributed file system: A distributed, block-oriented, non-updatable, highly scalable file system that can run on ordinary hard disks in a cluster

MapReduce framework: A basic distributed computing framework that executes on a standard set of hardware in a cluster

YARN: default resource manager in the Hadoop ecological cluster

Hive: AN SQL-like query engine based on the MapReduce framework

Hbase: Key/value pair storage system based on HDFS provides online transaction processing (OLTP) capability for Hadoop

Spark is a fast, general-purpose computing engine designed for large-scale data processing

Spark includes:

Spark Core: Engine for general distributed data processing

Spark Sql: Sql query statements running on Spark support a series of Sql functions and HiveQL

Spark Streaming: A Spark based microbatch engine

MLlib: Machine learning library built on Top of Spark

IaaS, PaaS, SaaS

These are the three layered services of cloud computing:

Infrastructure at bottom: infrastructure-as-a-service (IaaS)

Platform in the middle: platform-as-a-Service (PaaS)

Software at the top: software-as-A-Service

IaaS :Infrastructure as a Service

Infrastructure as a service

Provide computing infrastructure (servers, networking technology, storage, and data center space) as a service to customers. It also includes providing operating systems and virtualization technologies to manage resources. Consumers can obtain services from a sophisticated computer infrastructure over the Internet.

PaaS: Platform as a Service

Platform as a service

PaaS actually refers to the software development platform as a service, suppliers provide services beyond the infrastructure, a software development and operation environment as a complete solution, that is, SaaS model delivered to users. Therefore, PaaS is also an application of the SaaS model. However, the emergence of PaaS can accelerate the development of SaaS, especially the development of SaaS applications.

SaaS: Software as a Service

Software as a service

A delivery model in which the application is hosted as a service and made available to users over the Internet; Help customers better manage their IT projects and services, ensure the quality and performance of their IT applications, and monitor their online businesses.

Cloud Computing: Cloud Computing

When I need it, I turn the tap and the water comes, I just have to worry about paying the bill!

When you need an app, you don’t have to go to Computer Mart, open the app store, and it’s downloaded. You just have to pay.

When you want to read a newspaper, you don’t have to run to the newsstand, you just turn on the headlines and the news is available.

When you want to read a book, you don’t have to go to the bookstore, just need to open the reading software, find such a book, read on the phone;

When you want to listen to music, you don’t have to go to the music store to find the CD, open the music software, you can listen to the music;

Cloud computing, like opening different water companies in different regions, no geographical restrictions, excellent cloud software service providers, to provide software services to every corner of the world — just like the clouds in the sky, no matter where you are, as long as you look up, you can see!

Five features of “cloud computing” :

1. Large-scale and distributed 2. Virtualization 3. High availability and scalability 4. security

“Cloud computing” has been deeply implanted in every bit of our life. Those commonly used apps or websites are basically inseparable from the powerful services and technical support of “cloud computing”, such as Taobao and JINGdong, which have a love-hate relationship with the party, and wechat and Weibo, which are obsessed with the party in social activities. At the same time, more and more enterprises and even government departments begin to use cloud-based platform services, and life is undergoing revolutionary changes and changes due to “cloud computing”!

You can also watch past videos of big data by adding wechat.