There are a thousand Hamlets in the eyes of a thousand readers, and a thousand big data programmers in the minds of a thousand big data interview questions. This is the interview question I think can be used to interview big data programmers.

The questions have nothing to do with the company or the business, and I don’t think I can answer them well. I’ve just pulled some of the questions I think are good from the interview questions I’ve collected.

Interview questions are collected from three sources:

  1. Interview questions collected and organized by the author in preparation for the interview.
  2. In the process of preparing for the interview, the author thought about the new questions.
  3. The author encountered in the interview process feel better questions.

Good don’t say nonsense on ~~~~~~~ ~~~~~~~~~

Because it is not convenient to post pictures on the Nuggets, I made a document for you, which has been uploaded to the big data learning exchange Qun531629188 created by myself

Whether it is cattle or college students who want to change careers and study

There will be a [free] big data live course at 20:10 in the evening, focusing on big data analysis methods, big data programming, big data warehouse, big data cases, artificial intelligence and data mining, all of which are pure dry goods sharing.

1. Spark RDD generation process

,

The Spark task scheduling process consists of four steps

1RDD objects

In the preparation stage of RDD, RDD and its dependencies are organized to generate a DAG diagram of the approximate RDD. The DAG diagram is a directed loop diagram.

2DAG scheduler

Subdivide the partition dependencies in the RDD to determine which ones are wide and which ones are narrow, generate a more detailed DAG, encapsulate the DAG as a TaskSet and submit it to the cluster when the calculation is triggered (perform action operators).

3TaskScheduler

Receive the TaskSet task set, analyze and determine which task corresponds to which worker and send it to the worker for execution.

4worker Execution phase

Receives tasks, obtains data from blocks on cluster nodes using Spark block manager, and starts Executor to complete calculations

2. Spark task submission process

2. After the spark-submit command is used to submit the program, the driver and application register information with the Master

3. Create SparkContext objects: The main objects include DAGScheduler and TaskScheduler

4. After the Driver registers the Application information with the Master, the Master will start the Executor on the Worker node based on the App information

5.Executor creates a thread pool for running tasks internally, and then reverse-registers the started executors with the Dirver

6.DAGScheduler: The Directed Acyclic Graph DAG (Directed Acyclic Graph) is responsible for the transformation of Spark jobs into stages. Stages are sliced according to their width and width dependence, and then the stages are packaged into tasksets and sent to a TaskScheduler.

At the same time, DAGScheduler also processes the failure caused by Shuffle data loss.

7.TaskScheduler: Maintains all tasksets, distributes tasks to executors of each node (based on data localization policies), monitors the running status of tasks, and retries failed tasks.

8. After all tasks are complete, SparkContext is deregistered to the Master to release resources.

3.Spark SQL creates a partition table

spark.sql(“use oracledb”)

spark.sql(“CREATE TABLE IF NOT EXISTS ” + tablename + ” (OBUID STRING, BUS_ID STRING,REVTIME STRING,OBUTIME STRING,LONGITUDE STRING,LATITUDE STRING,\

GPSKEY STRING,DIRECTION STRING,SPEED STRING,RUNNING_NO STRING,DATA_SERIAL STRING,GPS_MILEAGE STRING,SATELLITE_COUNT STRING,ROUTE_CODE STRING,SERVICE STRING)\

PARTITIONED BY(area STRING,obudate STRING) ROW FORMAT DELIMITED FIELDS TERMINATED BY ‘,’ “)

# set parameters

# hive > set hive.exec.dynamic.partition.mode = nonstrict;

# hive > set hive.exec.dynamic.partition = true;

spark.sql(“set hive.exec.dynamic.partition.mode = nonstrict”)

spark.sql(“set hive.exec.dynamic.partition = true”)

# print(” create database complete “)

if addoroverwrite:

# additional

spark.sql(“INSERT INTO TABLE ” + tablename + ” PARTITION(area,obudate) SELECT OBUID,BUS_ID, REVTIME, OBUTIME,LONGITUDE ,LATITUDE,GPSKEY,DIRECTION,SPEED,\

RUNNING_NO,DATA_SERIAL,GPS_MILEAGE, SATELLITE_COUNT,ROUTE_CODE,SERVICE,’gz’ AS area,SUBSTR(OBUTIME,1,10) AS obudate FROM “+ tablename + “_tmp”)

4. What are the Java synchronization locks

Synchronized lock

5. Can Arrarylist store NULL

The data type bit object that can be added is null, which is also of object type

6. Spring Cloud control permissions

How to manage microservice permissions under Spring Cloud? How to design more reasonable? From a large level, it is called service permission, which is divided into three parts: user authentication, user permission, and service verification.

7. Hashset contains method

The contains method is used to determine whether the Set collection contains the specified object.

Syntax Boolean contains(Object O)

Return value: True if the Set contains the specified object; Otherwise return false.

8. Spark Streaming data block size

Buffer the default size 32 k / / buffer for 32 k SparkConf. Set (” spark. Shuffle. The file. The buffer “, “64 k”)

48 m/reduce/reduce side pull data, the default size is 48 m SparkConf set (” spark. Reducer. MaxSizeInFlight “, “96 m”)

spark.shuffle.file.buffer

Default value: 32K

Parameter Description: This parameter is used to set the buffer size of BufferedOutputStream of shuffle Write Task. Data is written to the buffer before being written to the disk file. When the buffer is full, data is overwritten to the disk.

Tuning suggestion: If the available memory resources of a job are sufficient, you can increase the value of this parameter (for example, 64 KB) to reduce the number of disk file overwrites during shuffle Write, reduce the number of DISK I/O operations, and improve disk performance. It is found in practice that the performance can be improved by 1%~5% when the parameters are properly adjusted.

spark.reducer.maxSizeInFlight

Default value: 48 MB

Parameter Description: This parameter sets the size of the Shuffle Read task’s buffer, which determines how much data can be pulled at a time.

Tuning advice: If the job has sufficient memory resources, increase the size of this parameter appropriately (such as 96 MB) to reduce the number of pulls, thus reducing the number of network transfers, and thus improving performance. It is found in practice that the performance can be improved by 1%~5% when the parameters are properly adjusted.

Error: Reduce OOM

Reduce task pulls data from map and aggregates data as it pulls data. Reduce segment has an aggregate memory (Executor Memory * 0.2).

Solution: 1, increase the reduce in proportion to the polymerization of memory set spark. Shuffle. MemoryFraction

2. Increase the size of executor memory –executor-memory 5G

3, reduce the reduce task each pull. The amount of data set spark reducer. MaxSizeInFlight 24 m

9.GC

The Java Garbage Collection (GC) mechanism is one of the major differences between Java and C++/C. Generally, there is no need to write special memory Collection and Garbage Collection codes when using Java. This is because there are automatic memory management and garbage cleaning mechanisms in Java virtual machines.

10. How to ensure data integrity of Flume push and pull

Channel does persistence

11. Java 1/0.0 infinity

In floating-point arithmetic, we sometimes encounter a divisor of 0. How does Java solve this problem?

We know that in integer operations, the divisor cannot be zero, otherwise it will run as an exception. But in floating-point arithmetic, infinity is introduced, so let’s look at the definitions in Double and Float.

1. Multiply infinity by 0 and the result is NAN

System.out.println(Float.POSITIVE_INFINITY * 0); // output: NAN

System.out.println(Float.NEGATIVE_INFINITY * 0); // output: NAN

2. Divide infinity by 0 and the result remains the same

System.out.println((Float.POSITIVE_INFINITY / 0) == Float.POSITIVE_INFINITY); // output: true

System.out.println((Float.NEGATIVE_INFINITY / 0) == Float.NEGATIVE_INFINITY); // output: true

3. Infinity If you do anything but multiply by 0, you get infinity

System.out.println(Float.POSITIVE_INFINITY == (Float.POSITIVE_INFINITY + 10000)); // output: true

System.out.println(Float.POSITIVE_INFINITY == (Float.POSITIVE_INFINITY – 10000)); // output: true

System.out.println(Float.POSITIVE_INFINITY == (Float.POSITIVE_INFINITY * 10000)); // output: true

System.out.println(Float.POSITIVE_INFINITY == (Float.POSITIVE_INFINITY / 10000)); // output: true

To determine whether a floating point number is INFINITY, the isInfinite method is used

System.out.println(Double.isInfinite(Float.POSITIVE_INFINITY)); // output: true

12. (int) (char) (byte) -1 = 65535

public class T {

public static void main(String args[]) {

new T().toInt(-1);

new T().toByte((byte) -1);

new T().toChar((char) (byte) -1);

new T().toInt((int)(char) (byte) -1);

}

void toByte(byte b) {

for (int i = 7; i >= 0; i–) {

System.out.print((b>>i) & 0x01);

}

System.out.println();

}

void toInt(int b) {

for (int i = 31; i >= 0; i–) {

System.out.print((b>>i) & 0x01);

}

System.out.println();

}

void toChar(char b) {

for (int i = 15; i >= 0; i–) {

System.out.print((b>>i) & 0x01);

}

System.out.println();

}

}

11111111111111111111111111111111

11111111

1111111111111111

00000000000000001111111111111111

13. Check whether Spark Shuffle is stored on disks

will

Hive functions

Such as the case when

15. The Hadoop Shuffle will sort several times

16. Where does Shuffle take place

MapReduce is the core of Hadoop, but Shuffle is the core of MapReduce. Shuffle mainly works between the end of Map and the start of Reduce. If you look at this graph first, you can see where the Shuffle is. In the figure, Partitions, copy phase, and sort phase represent the different stages of shuffle.

17. How does Spark kill a submitted task

18. What parameters can be set when submitting the Spark task

19. What do Zookeeper three processes do

Zookeeper provides configuration management, name service, distributed synchronization, and cluster management.

20. What are the advantages and disadvantages of Kafka’s three methods of transmitting data

1. At-most-once: After the client receives a message, it automatically commits it before processing the message. This way kafka assumes that the consumer has already consumed the message, and the offset increases.

2. At-least-once: The client receives a message, processes the message, and submits feedback. It is possible that the message has been processed, that the network is down or that the application has died before the feedback can be submitted, and that Kafka assumes that the message has not been consumed by the consumer, causing repeated notifications.

3. Exaxtly-once: Ensure that message processing and submission feedback are in the same transaction, i.e. atomicity.

This article from these points, elaborated how to realize the above three ways.

21. How to write Flume interception plug-in

Build a Maven project, import the Flume-core package, and implement the Interceptor interface

22. How to implement small file aggregation in Hadoop

Hadoop itself provides several mechanisms to solve related problems, including HAR, SequeueFile, and CombineFileInputFormat.

23. Does Spark RDD store data?

RDD does not store real data, but only getPartitions for real data and compute for a single partition

24. Implement thread synchronization methods for map

There are two ways to implement synchronization:

1. Sync code block:

Synchronized (synchronized){} Synchronized: N threads access the same data at the same time.

2. Synchronization method:

Public synchronized method name (){}

The use of synchronized to modify a method is called synchronized. For synchronization method, need not specified synchronous monitor, synchronization method of synchronous monitor is this also is the object itself (here refers to the object itself is a bit vague, is called the synchronous method of the object) by using a synchronization method, can be very convenient to certain into thread-safe class, has the following features:

1. Objects of this class can be safely accessed by multiple threads.

2. Every thread that calls any method on this object will get the correct result.

3. After each thread calls any method of the object, the state of the object remains reasonable.

Note: The synchronized keyword can modify methods and code blocks, but not constructors, properties, and so on.

Implementation of synchronization mechanism note the following points: high security, low performance, in multithreading. High performance, low security, used in single thread.

1. Do not synchronize all methods of a thread-safe class, only those methods that change the shared resource.

2. If a mutable class has two runtime environments, the thread-safe version and thread-unsafe version (without synchronized methods and synchronized blocks) should be provided for the mutable class in a threaded and multithreaded environment. In a single-threaded environment, use a thread-unsafe version for performance, and use a thread-safe version for multithreading.

25. What should be paid attention to in Combiner components

Because the Combiner may or may not be invoked during MapReduce, and it may be invoked once or for multiple times, it cannot be determined and controlled

Therefore, the principles of using Combiner are as follows: The presence or absence of combiner does not affect the service logic, so that the final reducer result cannot be affected without using Combiner. Moreover, the combiner output KV should correspond to the reducer input KV type. Because sometimes the use of the improper combiner will cause a wrong outcome to the statistical results, it is better not to use. For example, taking the average of all numbers:

Combiner is used on Mapper

3, 5, 7 – > (3 + 5 + 7) / 3 = 5

2, 6 – > (2 + 6) / 2 = 4

Reducer

5 plus 4 over 2 is 9/2 is not equal to 3 plus 5 plus 7 plus 2 plus 6 over 5 is equal to 23/5

26. Contrast between Storm and Sparkstreaming

Storm

1) Real time processing. (Real-time)

2) It is troublesome to implement some complex functions, such as sliding Windows (ease of use)

Native API: Spout Bolt Bolt

The Trident framework: It’s a little harder to use.

3) There is no complete ecology

SparkStreaming

1) There is a feeling of batch processing, a small amount of data processing, and then based on memory can be quickly completed. It’s almost quasi real time. (Real-time)

2) Encapsulates many advanced apis, which can be easily implemented when users want to implement a complex function. (Ease of use)

3) There is a complete ecosystem. SparkCore,SparkSQL, Mlib,GraphX, etc. can be configured at the same time, and they can be seamlessly switched between them.

Here’s an analogy to illustrate the difference between the two:

Storm is like an escalator in a supermarket, running all the time;

SparkStreaming is like an elevator in a supermarket, picking up a few people at a time.

27. What problems may occur during the Hdp disaster recovery test

Ambari Server has a single point of problem. If the Server machine goes down, the data of the entire Ambari Server cannot be recovered, that is, the cluster can no longer be managed through Ambari.

28. What makes Kafka’s data stream read faster, and why Kafka was chosen over other messaging middleware

Producer (writing data)

The producer is responsible for submitting data to Kafka. Let’s look at that first.

Kafka writes incoming messages to the hard disk, and it never loses data. To optimize write speed Kafak uses two techniques, sequential write and MMFile.

Sequential writes

Because hard disks are mechanical, each read and write is addressed -> written, where addressing is a “mechanical action” that is the most time consuming. So hard disks hate random I/O and prefer sequential I/O. To speed up reading and writing hard drives, Kafka uses sequential I/O.

29. What are Spark’s advantages compared with Hadoop

1) Spark’s advantages over Hadoop in processing model

First of all, Spark abandons the strict MapReduce method of map first and then Reduce, and implements the more general directed acyclic graph (DAG) operator.

In addition, the MR-based computing engine outputs intermediate results to disks during shuffle for storage and fault tolerance. The RELIABLE mechanism of HDFS is to save files in three copies. Spark abstracts the execution model into a general directed acyclic graph execution plan (DAG), which is calculated only at the last step. In this way, multi-stage tasks can be executed in series or in parallel without the need to output the intermediate results of stages to HDFS. The disk I/O performance is lower than that of memory. Therefore, the running efficiency of Hadoop is lower than that of Spark.

2) Data format and memory layout

The processing mode of MR reading causes high processing overhead. Spark abstracts the elastic distributed data set RDD to store data. RDD can support coarse-grained write operations, but RDD can be accurate to each record for read operations, making it useful as a distributed index. Spark allows developers to control data partitioning on different nodes. Users can customize partitioning policies, such as hash partitioning.

3) Execute the strategy

MR takes a lot of time to sort before data shuffle, and Spark can reduce this overhead. Spark tasks are not sorted in all scenarios in shuffle. Therefore, Distributed aggregation based on hash is supported. A more general task execution plan (DAG) is used for scheduling, and the output of each round is cached in memory.

4) Cost of task scheduling

Traditional MR systems are designed to run batch jobs lasting several hours, and in some extreme cases, the latency for submitting a task is very high. Spark uses the time-driven AKKA class library to start tasks and reuse threads in the thread pool to avoid process or thread start and switch overhead.

5) Expansion of memory computing capability

Spark’s elastic distributed data set (RDD) abstraction allows developers to persistently store any point on the processing pipeline in memory across cluster nodes, ensuring that subsequent steps that require the same data set do not have to be recalculated or loaded from disk, greatly improving performance. This feature makes Spark ideal for algorithms that involve a large number of iterations, traversing the same data set multiple times, as well as reactive applications that scan large amounts of in-memory data and quickly respond to user queries.

6) Improvement of development speed

The biggest bottleneck in building data applications is not CPU, disk, or network, but analyst productivity. So Spark accelerates the development process by integrating the entire pipeline of pre-processing to model evaluation into one programming environment. The Spark programming model is expressive and wraps a set of analysis libraries under the REPL, eliminating the overhead of multiple round trips to the IDE. These costs are unavoidable for frameworks such as MapReduce. Spark also avoids the problems of sampling and rolling data back and forth from HDFS that frameworks like R often encounter. The faster analysts can experiment with the data, the more likely they are to be able to extract value from it.

7) Powerful

As a general-purpose computing engine, Spark’s core API provides a powerful foundation for data transformation, independent of any capabilities of statistics, machine learning, or matrix algebra. And its Scala and Python apis allow you to write programs in expressive general-purpose programming languages and access existing libraries.

Spark’s memory cache makes it suitable for iterative computation at both the micro and macro levels. Machine learning algorithms need to traverse the training set many times and can cache the training set in memory. When exploring and getting to know the data set, the data scientist can save on disk access by keeping the data set in memory while running queries and easily caching the converted version.

30. How to resolve Java Hash conflicts

1) Open addressing method: 2) chain address method 3, 4) hash again, establish the public overflow area