Big data and Hadoop

This is the second day of my participation in the wenwen Challenge

Definition of big data

Big data refers to the collection of data whose contents cannot be captured, managed and processed by conventional software tools within a certain period of time.

The concept of big data –4V+XV

1. Large Volume of data
2. Variety
3, Fast speed and high aging
4. Low value density
Variability
Veracity

The concept of big data — quantity, type

The three phases of the big data generation pattern

Operational system phase

Management information application system

User-generated content phase

WEB 2.0, Weibo, wechat and so on

Perceptual system stage

Sensors, Internet of things

The impact of big data on scientific research

The first paradigm: experimental science
The second paradigm: theoretical science
The third paradigm: computational science
The fourth paradigm: data-intensive science

The impact of big data on way of thinking

Full samples rather than samples;
Efficiency rather than accuracy;
Correlation rather than causation;

Big data computing mode

Batch calculation; MapReduce
Flow calculation; Storm,Flink,Spark streaming
Figure calculation; Pregel,Spark GraphX
Query analysis calculation; Dremel, Hive, Impala

The definition of Hadoop

Apache open Source software Foundation developed, run on a large scale ordinary server big data storage, computing, analysis of distributed storage system and distributed computing framework

Hadoop2.0 consists of three parts

Distributed file system HDFS
Resource allocation system Yarn
MapReduce, a distributed computing framework

Hadoop and Google

The characteristics of the Hadoop

Scalable: The axis can reliably store and handle petabytes of data.
Economical: Data can be distributed and processed through server farms of ordinary machines. These server farms can total thousands of nodes.
Efficient: By distributing data, Hadoop can process it in parallel on the nodes where it resides, which makes processing very fast.
Reliable: Hadoop automatically maintains multiple copies of data and redeploys computing tasks when a task fails.