This column continues to update, the content is not complete, please do not be impatient, some of the content of other books or network literature, will give the original source


The 21st century is bound to be the era of big data and the golden age of intelligent information processing.

The data volume of BAT company around 2013 is as follows:

  • According to the relevant technical report of Baidu in 2013, the total amount of baidu data is close to 1000PB, and the number of web pages is hundreds of billions, which are updated billions of times every year and queried billions of times every day.
  • According to Tencent’s 2013 technical report, Tencent has about 800 million users, 400 million mobile users, and the total amount of stored data is over 50 percent

After compression, it is about 100PB, adding 200TB to 300TB per day and increasing data volume by 10% per month.

  • According to the relevant technical report of Alibaba in 2013, the total data volume is 100PB, and the daily active data volume has exceeded 50TB, with a total of 400 million pieces of product information and more than 200 million registered users, and more than 40 million visits per day.

In order to collect, store and analyze big data, Internet companies try to develop big data technologies. Among many technical solutions, open-source systems such as Hadoop, Spark and Elasticsearch have become the most widely used big data technologies. Due to their huge number of users, they have initially become big data technical specifications.

This column, “Exploration of Big Data Processing Practice”, tries to combine big data with machine learning to generate new practical ideas by recording python-based big data processing practice exploration cases. Most of the big data blog posts on the web are based on Java or Scala. The purpose of this column is to integrate big data (PySpark, Elasticsearch, SkLearn…) into Python. , data development, combined with data analysis. And in the practice of the content to give some guidance, finally this column specifically for the written interview high-frequency questions to share, I hope to be able to help you when looking for a job.

Github address: big_data_repo


Framework platform introduction

Big data tries to dig out valuable information from massive data through certain distributed technical means and finally provide it to users, thus generating practical value and commercial value. Due to the diversity of data itself and data analysis needs, big data technology system is very complex, involving many components and modules.

In order to facilitate readers to have a clear understanding of big data from the top-level framework, this part tries to generalize the framework of big data technology first.

Cloud platform

I’ve been working with engineers in North America who have taken AWS as a basic setup, and if you don’t know cloud computing, or any cloud platform, you’re out of it. The best way to learn about cloud computing or cloud platforms is through their documentation. Domestic cloud platforms are booming like bamboo shoots after a spring rain, but the most worthy of reference is the ancestor OF AWS.

  • Python interacts with AWS
  • Aws EC2 Configure FTP —- Use VSFTP

Installation and commissioning

This section mainly introduces the development environment and cluster environment construction

  • Basic environment construction: Use PySpark in The Jupyter Notebook

  • Configure the local Scala 2.12 Spark 3.0.2 development environment on the IDEA 2021

  • Use Python Fabric to set up the RHEL 7.2 Big data infrastructure environment and partially optimize it

  • CDH cluster installation & test summary

  • CDH 5.x cluster installation and uninstallation

Big data search framework Elasticsearch

Elasticsearch is a real-time distributed search and analysis engine that allows you to search data at a scale and speed. It is commonly used for full text search, structured search, analysis, and a combination of all three. Wikipedia, Guardian, Stack Overflow, and Github all use Elasticsearch for their own relevant searches.

  • Research on big data processing practices —- Elastic Search
  • Book Report — Getting Started with Elasticsearch —-
  • Book Report – Getting Started with Elasticsearch —-
  • Elasticsearch-based search autocorrect

Big data framework Spark

Spark, originally developed in UC Berkeley’s AMP lab, is a fast, universal engine for large-scale data processing. Since joining the Apache Incubator in 2013, Spark has grown rapidly and is now one of the Apache Software Foundation’s three most important open source projects for distributed computing systems (Hadoop, Spark, and Storm). Spark was originally designed to make data analysis faster — not only fast to run, but also fast and easy to write programs. To make programs run faster, Spark provides in-memory computation, reducing I/O overhead during iterative computation. To make it easier to write programs, Spark uses the concise and elegant Scala language to provide interactive programming experience. Although Hadoop has become the de facto standard for big data, its MapReduce distributed computing model still has many defects. Spark not only has the advantages of Hadoop MapReduce, but also solves the disadvantages of Hadoop MapReduce. Spark is gradually becoming the most popular big data computing platform in the field of big data with its advantages of integrated structure and diversified functions.

  • Basic environment construction: Use PySpark and Scala in Jupyter Notebook JupyterHub

  • Spark Reads elasticSearch by array

  • Run Jupyter and submit PySpark tasks based on KubeFlow K8S

WSL is a Linux Subsystem under Windows Subsystem for Linux. Since it is difficult to install PySpark directly on Windows to run many machine learning libraries based on Linux, Is there a way to save time and effort and use fewer resources? WSL is certainly a good choice compared to a virtual machine’s memory usage of around 8GB.

  • A simple practice of PySpark + XGBoost classification + feature importance using WSL

Data processing

Data access

  • Unified data access practice sharing

Data cleaning

  • Purpose and method of data cleaning

ETL

  • A brief introduction to pandas, PySpark’s Big data ETL Experience
  • Big data processing practice exploration (3) —- PySpark big data ETL tools

EDA

  • Python Pandas Profiling one line of code EDA exploratory data analysis

Characteristics of the engineering

  • points
  • Dimensionality Reduction (PCA Dimensionality Reduction for Unsupervised Learning & TSNE for Popular Learning)

Big data machine learning

Machine learning is almost everywhere in big data, and even if we don’t specifically refer to it, it is often used in big data applications such as search, recommendation, prediction, and data mining. With the high-speed development of Internet, the data quantity of explosive growth, increasingly rich data dimension, this also for the development and application of machine learning provides a good soil, good achievements of machine learning also reverse let data to produce more value, become a real “big data”, the two complement each other, promote each other, make the data more and more intelligent.

  • Machine learning best practices based on big data

Principles of Algorithms

  • Cluster testing —- Intel-Hadoop /HiBench process analysis —- Take the Bayesian algorithm as an example

SQL optimization

SQL optimization is everywhere, and the core can be summarized as follows:

  1. Efficient use of indexes
  2. Continuous optimization according to the query plan
  3. Build efficient SQL statements
  • PostgreSQL provides built-in partitioning

Big data visualization

A picture is worth a thousand words, and visual information bandwidth is much larger than text.

  • Technical research —-BI tool comparison and Surperset docker installation and visualization

  • Kibana tip


Practice Case

Data processing based on big data

  • (1) —- Python and Oracle database import and export

  • Use Python to import and export database, cloud platform, Oracle, AWS, ES

Data analysis based on big data

Modeling and analysis of data set of Kaggle contest Give Me some credit using PySpark

  • 1. Data preparation and EDA
  • 2.1 Data Cleaning
  • 2.2 Feature Engineering

Written interview

What is the core meaning of the examination and interview question review? In a word, basic + principle.

  • Big data basics Q&A

  • Written interview high-frequency questions —- basic knowledge

  • Written interview high-frequency questions —- hadoop

  • Written interview high-frequency questions —- Yarn Basics

  • —- Spark basic tuning

  • Written interview high-frequency questions —- Spark Basics

  • Written interview high-frequency questions —- Hive basics

  • —- ElasticSearch


other

  • Zookeeper Cluster Deployment in Docker Environment (1)
  • Zookeeper Cluster Deployment in Docker Environment (2)

reference

Spark Introduction (Python)