Public account (five minutes to learn big data) has launched a big data interview series of articles – five minutes small interview, this series of articles will be in-depth study of the major factory pen interview real questions, and according to the pen interview questions to expand the relevant knowledge points, help everyone can successfully enter the big factory!

The interview questions were taken from the preferred side of the us Group shared by niuke netizens. The interview lasted one hour. Net friend information: overseas water, in a third line factory work for 2 years. The reference answers are provided by this official account. If there are mistakes, welcome to correct!

The following questions are asked during the interview. The position is big data development. According to the content of the questions, the main direction of the data warehouse is biased

  1. To introduce myself
  2. Willingness to work in Beijing
  3. SQL problem, give a city and trading volume table, a city corresponding to the table of provinces, extract the name of the province with a total trading volume greater than 500 provinces
  4. SQL, based on just now, get the number of provinces with total transaction volume [0,500], [500,1000], [1000,+∞] in the following three ranges
  5. SQL questions, or based on just now, according to the order from small to large to obtain the cumulative trading volume of each city, you can use the window
  6. Fields are divided by delimiters, which can be expressed with a re
  7. Mainly responsible for the products
  8. How do product metrics know how much value they add
  9. How to achieve accurate indicators
  10. Describe the attributes of an entity mentioned in the project
  11. How would you model if you did
  12. What data sources are extracted and what tools are used
  13. Describe the internal logic of the extraction. How is it implemented
  14. Do you read any books besides work
  15. What do you think is the difference between Flink and Spark Streaming
  16. What are the advantages of Spark Streaming over Flink
  17. Do you have any questions

Answer:

1. Introduce yourself

The first thing an interviewer says in an interview is, “Tell me about yourself.”

A lot of people really very sincere just say: “MY name is XX, from xx, xx years old.” Then there was a silence.

Such a self-introduction is not as popular as some time ago: “MY name is XX, I like singing and dancing, rap, basketball.” At least you let the interviewer know what you are good at.

So you need to use the shortest time to impress the interviewer, highlight your strengths, persuasive persuasion.

Here are the rules: Don’t recite your resume in chunks; Don’t say a running list, lengthy content; Try to keep your introduction time to about one minute.

So how to meet the above three in the shortest time.

Luo Zhenyu once said, “Clumsy people make sense, while smart people tell stories.”

So get rid of the usual answers and learn to tell stories. To tell the story, remember the following eight words to fit your own experience or thoughts into it:

  • Goals: What I want to do, what I want to be.

  • Obstacles: what is the difficulty of doing this thing, what are the “questioning” points;

  • Efforts: What I have done to overcome difficulties and “doubts”;

  • Results: What I have achieved through my efforts.

There is time for self-introduction I will explain in detail, this part is still very important, because this part is “sell” themselves, to their advertising time.

Willingness to work in Beijing

You’re definitely looking for someone who’s solid and stable, not someone who just comes in for a few days or leaves after a while. The interviewer is asking you this question to determine, one, how long you are likely to stay with the company and whether it is necessary for the company to develop or hire you.

So be sure to say that you can definitely come to Beijing and stay there for a long time (even if you don’t), and use your imagination in specific answers.

3. SQL question, give a table of cities and trading volume, and a table of provinces corresponding to cities, extract the names of provinces with total trading volume greater than 500

To facilitate understanding, we first construct these two tables according to the above questions:

City transaction volume table business_table

City_num: indicates the city id. GMV: indicates the transaction volume

city_num gmv
1001 210
1001 90
1002 250
1003 200
1004 700
1005 350
1005 150
1006 250
1007 150

Province_table indicates the province_table of a city

Province_num: indicates the province id. Province_name: indicates the province name. City_num: indicates the city id

province_num province_name city_num
11 a 1001
11 a 1005
12 b 1002
12 b 1003
13 c 1004
13 c 1006
13 c 1007

Based on the above table, the SQL statement is as follows.

SELECT MAX(tmp.province_name)
FROM (
	SELECT bt.city_num, bt.gmv, pt.province_num, pt.province_name
	FROM business_table bt
		LEFT JOIN province_table pt ON bt.city_num = pt.city_num
) tmp
GROUP BY tmp.province_num
HAVING SUM(tmp.gmv) > 500;
Copy the code

[500,1000], [1000,+oo] in the following three ranges

The SQL statement is as follows:

SELECT 
  COUNT(CASE 
	  WHEN tmp2.pro_gmv > = 0
	  AND tmp2.pro_gmv < 500 THEN tmp2.pro_name
	  ELSE NULL END) AS gmv_0_500,
  COUNT(CASE 
	  WHEN tmp2.pro_gmv > = 500
	  AND tmp2.pro_gmv < 1000 THEN tmp2.pro_name
	  ELSE NULL END) AS gmv_500_1000, 
  COUNT(CASE 
	  WHEN tmp2.pro_gmv > = 1000 THEN tmp2.pro_name
	  ELSE NULL END) AS gmv_1000_
FROM (
	SELECT MAX(tmp.province_name) AS pro_name, SUM(tmp.gmv) AS pro_gmv
	FROM (
		SELECT bt.city_num, bt.gmv, pt.province_num, pt.province_name
		FROM business_table bt
			LEFT JOIN province_table pt ON bt.city_num = pt.city_num
	) tmp
	GROUP BY tmp.province_num
) tmp2;
Copy the code

5. SQL questions, based on just now, the cumulative trading volume of each city can be obtained in the order from small to large. You can use Windows

The SQL statement is as follows:

# If the interviewer asks you to use the window function, you are welcomeSELECT city_num, gmv
FROM (
	SELECT DISTINCT city_num, SUM(gmv) OVER(PARTITION BY city_num) AS gmv
	FROM business_table
) tmp
ORDER BY gmv;
Copy the code

6. Delimit fields according to delimiters, which can be expressed by re

# java
String address="Shanghai minhang district | | | Shanghai wuzhong road plays"; String[] splitAddress=address.split("\\|"); //If a vertical line is used as the separator, then split requires two slashes \\ to escape #sql
hive> select split('abcdef'.'c') from test;
["ab", "def"]
Copy the code

7. Products in charge

This can be based on your resume or your own situation.

8. How do product metrics know how much value they bring

There is no standard answer to this question, just answer according to your own understanding.

The following is for reference only:

In the book Lean Data Analysis, two sets of commonly used indicator system construction methodologies are given, one of which is the well-known Pirate indicator method, which is AARRR pirate model that we often hear. The Pirate model is a classic model for user analysis, reflecting the fact that growth is systematic throughout all phases of the user lifecycle: Acquisition, Activation, Retention, Revenue, Referral.

Why do we say this model? Because through some key indicators in this model, we can deduce the value brought by the indicators of the product.

AARRR model:

A Laxin: acquire target users through various promotion channels in various ways, evaluate the effects of various marketing channels, constantly optimize investment strategies and reduce the cost of acquiring customers. Key metrics such as new registrations, activations, conversion rates, retention, downloads, installs, etc. are used to reflect the effectiveness of target user acquisition.

A Active: Active users really start to use the value provided by the product. We need to master the behavioral data of users and monitor the health of the product. This module mainly reflects the behavior of users entering the product and is the core of product experience. It involves key indicators such as DAU/MAU, daily usage duration, APP startup duration, and APP startup times. These metrics can reflect user activity.

R Retention: A measure of engagement and quality. Involving key metrics such as retention, attrition, etc. These metrics can be used to reflect user retention.

R realization: mainly used to measure the commercial value of products. Related to key indicators such as life cycle value (LTV), customer unit price, GMV, etc. These indicators can reflect the commercial value of the product.

R recommendation: Measure the degree of self-spread and word-of-mouth of users. It involves key indicators such as invitation rate and fission coefficient.

9. How to achieve accurate indicators

Same question. There’s no right answer.

For reference only:

To achieve accurate indicators, it is necessary to use scientific methods to select indicators.

The common methods for selecting indicators are index classification method and OSM model.

1. Index classification method: Index classification is mainly based on vertical thinking of index content. According to enterprise strategic objectives, organization and business process, indicators are classified from top to bottom and analyzed layer by layer, mainly divided into three levels T1, T2 and T3.

  • T1 indicator: strategy-level indicator used to measure the achievement of the company’s overall goals, mainly decision-making indicators. T1 indicator usually serves the strategic decision-making level of the company.

  • T2 indicator: business strategy-level indicator In order to achieve the target of T1 indicator, the company will disassemble the target into business lines or business groups and make a series of targeted operation strategies. T2 indicator usually reflects that the policy results are supportive indicators and also the core indicators of business lines or business groups. T2 is a vertical path disaggregation of T1 indicators to facilitate fault location of T1 indicators. T2 indicators usually use service lines or business groups.

  • T3 indicator: service execution-level indicator T3 is a disassembly of T2 indicator and is used to locate T2 indicator problems. T3 metrics are also typically the most common in business processes. According to the different objectives of each functional department, the indicators concerned are also different. The use of T3 index can usually guide front-line operations or analysts to carry out their work. The content of T3 index is process-oriented and can quickly guide front-line personnel to make corresponding actions.

2. OSM model (Obejective, Strategy, Measurement) : it is an important method to help determine the core in the process of index system construction, including business objectives, business strategies and business measures, and is a lateral thinking of index content.

O: What is the user’s goal with the product? What needs does the product meet? Objectives are mainly determined from user and business perspectives, and the principles are practical, easy to understand, intervenable and positive.

S: What strategies have I adopted to achieve these goals?

M: What are the subsequent changes in data indicators brought about by these strategies?

10. Describe the attributes of an entity mentioned in the project

Practical project problems, according to the description in the resume.

Here is also to remind us: we must be very familiar with the project written in the resume, and we need to be familiar with the whole life cycle of the project, including all the content of the project before, during and after the development, can be more detailed than the resume, but do not differ from the resume.

11. How would you model if you could

Detailed modeling can be found in this article: Kimball dimensional modeling, the most commonly used model in data warehouse construction

The following is an excerpt from the above article

When it comes to modeling, keep in mind the four steps of dimensional modeling. How to build the model is narrated around the following four steps:

1. The selection of business process dimension modeling is close to business, so it must be modeled based on business. Then the selection of business process, as the name implies, is to select the business we need to model in the whole business process, and select the business according to the requirements provided by the operation and the future scalability.

2. Declarative granularity starts with a focus on atomic granularity, which can withstand unexpected user queries. However, roll-up summary granularity is very important to improve query performance. Therefore, we establish roll-up summary granularity for data with clear requirements, and atomic granularity for data with unclear requirements.

Dimension tables are also known as the “soul” of a data warehouse because they serve as entry points and descriptive identifiers for business analysis. Once granularity is declared, it is necessary to determine which attributes are dimensions. How to determine which attributes are dimension attributes is not detailed here, but can be explained in the article link above.

4. One of the core principles of fact dimension modeling is that all measures in the same fact table must have the same granularity. This ensures that there will be no double-counting of measures. Sometimes it is not clear whether the column data is a fact or a dimension property. Remember that the most useful facts are numeric types and additive facts.

This piece of content is too much, after finishing the above four steps, you can talk about how the data warehouse is stratified, what data are stored in each layer and so on. Specific article can be click: combined with the company’s business analysis of offline warehouse construction

12. Which data sources are extracted and which tools are used

Answer the questions according to the resume.

The following data are extracted for reference only:

  1. Business library data, extracted using SQOOP
  2. Use Flume to collect traffic log data in real time
  3. Third-party company data is collected using a common interface

13. Describe the internal logic of extraction and how it is implemented

Answer the questions according to the resume.

The following is for reference only:

Before you can start creating an extraction system, you need a logical data map that describes the relationship between the original and final target fields in the tables that are submitted to the foreground. This documentation is throughout the ETL system.

Design logic:

  1. Have a plan
  2. Identify candidate data sources
  3. Analyze source systems using data evaluation and analysis tools
  4. Accept traversal of data lines and business rules
  5. Have a good understanding of data warehouse data models
  6. Verify the validity of the calculation and formula

Logical data map composition: target table name, table type, SCD(slow change dimension), source database, source table name, source column name, transformation.

The table must clearly describe the process involved in the transformation process, without any doubt.

The table type gives us the order in which the data loading process is performed: first the dimension table, then the fact table. Along with the table types, it is important to load the dimension table process SCD types before development to understand which columns need to retain historical information and the policies required to obtain it.

Complete logical data mapping does not exist until the source system is validated and analyzed. Source system analysis is usually divided into two main stages: data discovery and anomaly detection.

Data discovery phase: The need for the ETL team to drill down into the requirements of the data, identify each source system, table, and attribute that needs to be loaded into the data warehouse, and determine the appropriate source or record system for each element is a challenge that must be carefully evaluated.

Exception detection phase: Check for NULL values for each foreign key in the source database. If NULL values are present, the table must be externally associated. If NULL is not a foreign key but a column, then there must be a business rule for handling NULL data. Whenever possible, the warehouse must load data with default values instead of NULL.

14. Do you study outside of work

For reference only:

Some time ago, I read the book “Data Warehouse Toolbox – Authoritative Guide to Dimensional Modeling”, and BENEFITED a lot. I have a clear understanding of dimensional modeling. Dimensional modeling is always considering how to provide simplicity, business-driven, user understanding and query performance as the goal of such a modeling method.

Currently, I am reading Big Data Diary: Architecture and Algorithms, which involves a lot of knowledge and comprehensively summarizes the relevant technologies of big data storage and processing. Reading books can help me learn big data technologies in a more systematic way.

Note: The electronic versions of the above two books can be obtained in the background of the official account of Wuminxuebig Data. The key words of reply are: Data Warehouse Toolbox or Big Data Daily Record

15. What do you think is the difference between Flink and Spark Streaming

This is a very macro problem because there are so many differences between the two frameworks. But there is one important point to make when interviewing: Flink is a standard real-time processing engine, event-driven. Spark Streaming is a micro-batch model.

Here are the main differences between the two frameworks:

  1. Architectural model:

    • The main roles of Spark Streaming at runtime include: Master, Worker, Driver, and Executor.
    • Main Flink packages at runtime :Jobmanager, Taskmanager, and Slot.
  2. Task scheduling:

    • Spark Streaming generates tiny data batches continuously to build directed acyclic graph DAG. Spark Streaming creates DStreamGraph, JobGenerator and JobScheduler in sequence.
    • Flink generates StreamGraph from user-submitted code, optimizes it to generate JobGraph, and then submits it to JobManager for processing. JobManager generates ExecutionGraph from JobGraph. ExecutionGraph is the core data structure of Flink scheduling. JobManager schedules jobs based on ExecutionGraph.
  3. Time mechanism:

    • Spark Streaming supports a limited time mechanism, only processing time.
    • Flink supports three definitions of time for stream handlers: processing time, event time, and injection time. Watermark is also supported to process delayed data.
  4. Fault tolerance mechanism:

    • For Spark Streaming tasks, we can set a checkpoint and then, if there is a failure and restart, we can recover from the last checkpoint, but this behavior only keeps the data from being lost and may be repeated rather than processing semantics exactly at once.
    • Flink uses the two-phase commit protocol to solve this problem.

Flink’s two-phase commit protocol can be detailed in this article: Eight diagrams to understand the semantics of Flink’s end-to-end precision once processing

16. What are the advantages of Spark Streaming over Flink

What is Flink’s advantage over Spark

Advantages of microbatch processing:

Spark Streaming’s microbatch processing is not as real-time as Flink’s, but it has great advantages for tasks with low real-time requirements.

  • For example, if 10W+ data is written into MySql, if Flink is used for real-time processing, Flink is event-driven, and each item is inserted or updated into the database, which is obviously unreliable, because the database cannot carry it. If batch processing is added to the Flink Sink, performance can be improved, but if the last batch does not reach the batch size threshold, data will not be flushed out, resulting in data loss.
  • Flink is state-based calculation, so it is very difficult to do associative operations in multiple Windows, only to throw all the state into memory, but if it exceeds the memory, it will directly run out of memory. Because Spark is based on RDD, it can make use of the advantages of RDD, even if the data is out of memory, so Spark Streaming is superior to Flink in coarse time granularity limit throughput.

Language advantages:

  • Both Flink and Spark are implemented by Scla and Java hybrid programming. Spark’s core logic is completed by Scala, while Flink’s main core logic is completed by Java. Spark has extensive support for third-party languages. Spark almost perfectly supports Scala, Java, Python, and R programming languages.

17. Do you have any questions

Interviews are a process of getting to know each other, so it’s polite to ask if you have any questions at the end of the interview. The interviewer also wants to see how much you know about their company and how interested you are.

So please do not answer, “I have no more questions,” but treat this question as your last opportunity to speak. Ask questions about the company and show that you care and care.

Here’s what you can ask:

  1. What are the company’s expectations of this position? What do I need to work on specifically?
  2. Is there a senior person who can lead the newcomers and give them a chance to play?
  3. The company emphasizes teamwork. What personal traits are the company looking for in this team?
  4. Could you introduce the working environment for me?

Here’s what not to ask:

  1. salary
  2. Too deep a question
  3. Questions that go beyond the position you are applying for