This is the second day of my participation in the wenwen Challenge

Definition of big data

Big data refers to the collection of data whose contents cannot be captured, managed and processed by conventional software tools within a certain period of time.

The concept of big data –4V+XV

  • 1. Large Volume of data
  • 2. Variety
  • 3, Fast speed and high aging
  • 4. Low value density
  • Variability
  • Veracity

The concept of big data — quantity, type

The three phases of the big data generation pattern

  1. Operational system phase

Management information application system

  1. User-generated content phase

WEB 2.0, Weibo, wechat and so on

  1. Perceptual system stage

Sensors, Internet of things

The impact of big data on scientific research

  1. The first paradigm: experimental science
  2. The second paradigm: theoretical science
  3. The third paradigm: computational science
  4. The fourth paradigm: data-intensive science

The impact of big data on way of thinking

  1. Full samples rather than samples;
  2. Efficiency rather than accuracy;
  3. Correlation rather than causation;

Big data computing mode

  1. Batch calculation; MapReduce
  2. Flow calculation; Storm,Flink,Spark streaming
  3. Figure calculation; Pregel,Spark GraphX
  4. Query analysis calculation; Dremel, Hive, Impala

The definition of Hadoop

Apache open Source software Foundation developed, run on a large scale ordinary server big data storage, computing, analysis of distributed storage system and distributed computing framework

Hadoop2.0 consists of three parts

  • Distributed file system HDFS
  • Resource allocation system Yarn
  • MapReduce, a distributed computing framework

Hadoop and Google

The characteristics of the Hadoop

  1. Scalable: The axis can reliably store and handle petabytes of data.
  2. Economical: Data can be distributed and processed through server farms of ordinary machines. These server farms can total thousands of nodes.
  3. Efficient: By distributing data, Hadoop can process it in parallel on the nodes where it resides, which makes processing very fast.
  4. Reliable: Hadoop automatically maintains multiple copies of data and redeploys computing tasks when a task fails.