The introduction

Hello everyone, I am ChinaManor, which literally translates to Chinese code farmer. I hope I can become a pathfinder on the road of national rejuvenation, a ploughman in the field of big data, an ordinary person who is unwilling to be mediocre.

Before going out for an interview, we need to sort out a set of basic knowledge for each big data project, which must be memorized in our heart and mastered in our heart so that we can win without sword! (Knocks on blackboard) The interview was written by a real lecturer with 10 years of development experience, and I just wrote it on my blogCopy the code

=== The following takes a smart logistics big data platform project as an example: ===

The first sword “general decision” function overview (about three sentences, concise and to the point)

The business data involved in this project includes the data and information involved in logistics links such as orders, transportation, storage, handling and loading and unloading. Due to years of accumulation, huge user base, daily order number tens of millions, traditional data processing technology has been unable to meet the needs of enterprises. Therefore, through big data analysis, transportation and distribution efficiency can be improved, logistics costs can be reduced, customer service requirements can be more effectively met, and data results can be analyzed to put forward a solution with mesoscopic guiding significance.

Second sword “broken Sword” project cycle (development duration and staffing)

Development Duration:

About six months

Stage division Demand survey, review (4 weeks) Design architecture (1 week) coding, integration (12 weeks) testing (2 weeks) online deployment, trial operation and tuning (3 weeks)Copy the code

staffing

Responsibilities: Front-end (JavaWeb+ front-end 2) Big data development (3) Operation and maintenance (1)

The third sword “breaking Knife” technical framework (technical options and framework version)

  1. Service system data is stored in Oracle and MySQL databases. For example, CRM system data is stored in MySQL and OMS system data is stored in Oracle.
  2. OGG incrementally synchronizes data from Oracle database and Canal incrementally synchronizes data from MySQL database.
  3. The incremental extracted data from OGG and Canal will be written to the Kafka cluster for consumption by the real-time analysis program.

Real-time analysis 4. Real-time analysis calculation program consumption Kafka data, will consume the data for ETL operation; 5. In order to facilitate business departments to query all kinds of documents, StructuredStreaming system writes data into Elasticsearch index after ETL processing; StructuredStreaming processing will write the data to ClickHouse, and the Java Web backend will query the data directly for display, for example, the GPS position data of transport vehicles will be displayed to THE GIS map in real time; 7. StructuredStreaming will synchronize the data processed by real-time ETL into Kudu, which is convenient for quasi-real-time analysis and query of data. Impala will perform impromptu analysis and query of Kudu data; 8. Visual display of data by front-end application, such as data service interface or real-time refresh of large screen;

The fourth sword “broken gun” cluster scale (business data volume and server configuration and number)

The amount of data here should be considered based on actual requirementsCopy the code

How do I confirm the cluster size? (Assume: 8TB disk and 128GB memory per server)

1 million daily active users per day, 100 per person per day: 1 million x 100 =100 million (100 million)

Each log file is about 1 KB, 100 million logs are generated every day: 100000000/1024/1024= about 100 GB If the server is not added within six months, 100 GB x 180 days = about 18 TB

Save 3 copies: 18T3=54T Reserve 20%-30%Buf=54T/0.7=77T Therefore: about 8T10 servers

Anyway, what about storefront layering? Servers are about to be expanded 1-2 times

Does the server use a physical machine or a cloud host?

Physical machine: using 128G memory, 20 core physical CPU, 40 threads, 8THDD and 2TSSD hard disk, the price is quoted at just over 4W per machine, the cost of hosting server needs to be considered. Taking Ali Cloud as an example, the value of a physical machine is about 5 years, and the annual operation and maintenance cost is taken into account: restoring physical machine: professional operation and maintenance personnel are required to be fixed cloud host: a lot of operation and maintenance work is completed by Ali Cloud, and the operation and maintenance is relatively easy

Data source and data collection of the fifth sword “breaking whip”

The sixth sword “rope breaking” data ETL (may be offline, may be real-time)

The seventh sword “Broken palm” business report analysis (offline report, real-time report)

  • The first point: traditional report analysis, each subject report
    • Data skew, association of large tables with large tables, OOM memory overflow, etc
  • Second point: Impala AD hoc query, SQL statement
  • Third point: ClickHouse real-time OLAP analysis

Eighth sword “broken arrow” data analysis engine (Hive, Impala, Es, Spark, Flink, etc.)

  • Hive: The underlying MapReduce framework, stable
  • SparkSQL: Integrate Hive or Kudu, analyze data, also using StructuredStreaming
  • Impala, ClickHouse: Real-time OLAP analysis framework

9th sword “Break air” project issues (data skew, OOM or performance optimization, etc.)

  • Throw out a problem, how to solve it (solve it yourself)
  • Hive Performance optimization, Spark performance optimization (primitive stuff)

Example: How do I avoid Spark data skew? To avoid Spark data skewness, you need to select appropriate keys or define your own partitioner to split the keys by adding salt or hash values to separate the data into different partitions. The following operators cause shuffle and are key points that may cause data skew: groupByKey; ReduceByKey; AggregaByKey; The join. Cogroup;

conclusion

The above is ten years of project experience of the interviewer personal big data interview dugu nine jian ~

May you have your own harvest after reading, if there is a harvest might as well a key three even once ~