After studying big data for one year, to test your mastery of big data technology? Big data comprehensive review interview question 15

preface

Hello everyone, MY name is ChinaManor, which literally means Chinese code farmer. I hope I can become a passer-by on the road of national rejuvenation, a ploughman in the field of big data, an ordinary person who is unwilling to be mediocre.

After a year, finally mainstream big data componentsAll learnedWhen learning, is the day of the teacher, the teacher will come to test how you learn:

Question 1: How is the Rowkey designed and what are the design rules?

Service principle: Match the service and ensure that the prefix is the most commonly used query field
Uniqueness rule: Each RowKey represents a unique piece of data
Combination rule: Common query conditions are combined as rowkeys
Hash rule: RowKey builds cannot be contiguous
Length rule: The shorter the business needs are, the better

Problem 2: Briefly describe the Hbase data writing process

Step1: Obtain metadata
- The client requests Zookeeper to obtain the Address of The RegionServer where the Meta table resides
- Read meta table data: Get metadata for all tables
Step2: find the corresponding Region
- Locate all regions corresponding to the meta table based on the metadata in the table
- Determine the region to be written based on the region scope and Rowkey
- Request a Regionserver of the Region based on its address
Step3: write data
- Request RegionServer to write corresponding Region: The RegionServer specifies the Region to be written according to the Region name
- Determine which specific Store to write to based on the column family
  - WAL: Hlog prewrite logs are written first
- Write to MemStore of corresponding Store
  - MemStore

Question 3: What is a coprocessor? How many coprocessors are provided in Hbase?

Coprocessor refers to the development interfaces provided by HbaseCustom developmentSome functions are integrated into Hbase
This is similar to the UDF in Hive. If this function is not available, you can use the coprocessor to customize development and Hbase to support corresponding functions
Coprocessors fall into two categories
- Observer: Observer class, similar to listener implementation
- Endpoint: terminator class, similar to the implementation of stored procedures

The above questions are from the Hbase column link published earlier

Q4: Why is Kafka reading and writing fast?

Write soon
- PageCache is appliedPage caching mechanism
- Sequential write diskThe mechanism of
Read quickly
- Preferentially read based on PageCache memory, useZero copy mechanism
- Each entry is read in order by Offset
- Build the Segment file Segment
- Create index;

Q5: how does Kafka ensure that production data is not lost?

Acks mechanism: When receiving data, the recipient returns an ACK message confirming the data
The producer produces data to Kafka, and Kafka returns an ACK based on the configuration requirements
- Ack =0: The producer sends the next message regardless of whether Kafka receives it or not
  - Advantages: fast
  - Disadvantages: Easy to cause data loss, high probability
- Ack =1: The producer sends data to Kafka. Kafka waits for the leader copy of the partition to write successfully, returns an ACK, and the producer sends the next one
  - Advantages: Balance performance and security
  - Disadvantages: There is still the probability of data loss, but the probability is relatively small
- Ack =all/-1: The producer sends data to Kafka, which waits for all copies of the partition to be written, returns an ACK, and the producer sends the next one
  - Advantages: Data security
  - Disadvantages: slower
  - If ack=all is used, it can be used with the min.insync.replicas parameter to improve efficiency
    - Min.insync.replicas: Indicates that the ack is returned after the minimum number of replicas are synchronized
If the producer does not receive an ACK, a retry mechanism is used to resend the last message until an ACK is received

Q6: What are the partitioning rules for producers in Kafka? How can I customize the partitioning rules?

If a partition is specified: writes to the specified partition
If no partition is specified, the Key is specified
- If a Key is specified, the Hash partition is based on the Key
- If no Key is specified: implement according to sticky partitioning
Custom partitions
- Develop a class that implements the Partitioner interface
- Implementing partition methods
- Specify the configuration of the divider in the producer

The above questions are from the Kafka column Kafka column link previously published

Question 7: Description of the Spark on YARN job submission process (yarn Cluster mode)

2. ResourceManager allocates a Container to start ApplicationMaster on an appropriate NodeManager. The ApplicationMaster is the Driver. After the Driver starts, apply for Executor memory from ResourceManager. 4. ResourceManager allocates containers after receiving ApplicationMaster resource requests. After the Executor process is started, it registers with the Driver in reverse order. 6. After the Executor is registered, the Driver executes main. Each stage generates a taskSet, and then distributes tasks to each Executor for execution.

Question 8: Description of the Job submission process of Spark on YARN (Yarn Client mode)

1. The Driver runs on the local machine where the task is submitted. After the Driver starts, it communicates with ResourceManager to apply for starting ApplicationMaster. Start ApplicationMaster on the appropriate NodeManager. 3. The ApplicationMaster functions as an ExecutorLaucher, Apply for Executor memory only from ResourceManager. 4. ResourceManager allocates containers after receiving ApplicationMaster’s resource request. ApplicationMaster starts the Executor process on the NodeManager specified for resource allocation. 5. The Executor process is reverse-registered with the Driver. Then, when the Action operator is executed, a job is triggered and stages are divided based on the wide dependency. Each stage generates a taskSet, and then the task is distributed to each Executor for execution.

Problem 9: The relationship between Repartition and Coalesce is different

1) Relationship:

Both are used to change the number of PARTITIONS in the RDD. The coalesce method is called at the bottom of repartition: coalesce(numesce, shuffle = true)

2) Differences:

Shuffle must occur on repartition. Coalesce determines whether shuffle occurs based on the input parameters

In general, repartition is used to increase the number of PARTITIONS in an RDD, and coalesce is used to reduce the number of partitions

10. What is the difference between cache and PESist?

Both cache and persist are used to cache an RDD so that it does not need to be recalculated for later use, which can greatly save running time

1) The cache has only one default cache level MEMORY_ONLY, and the cache calls persist, which can set other cache levels as appropriate;

2) By default, 60% of executor execution is cache time and 40% task time. Persist is the root function, the lowest level function.

The above questions are from the previously published Spark column Spark column link

Question 11: Watermarking mechanism in Flink?

1. What is Watermaker first? Watermaker is a time stamp added to the data. 2. How to calculate Watermaker? Watermaker = The maximum event time of the current window – the maximum allowed delay time or out-of-order time

1. Data is displayed in the window
2.Watermaker >= Window end time

12. What are the four cornerstones of Flink?

Checkpoint, State, Time, Window

13. What are Flink’s restart strategies?

Fixed Delay Restart Strategy

Failure Rate Restart Strategy

Fallback Restart Strategy Fallback Restart Strategy

None Restart Policy

Question 14: Describe flink’s dual-stream join

There are only two types of Flink Join: Window Join and Interval Join.

Window Join can be divided into three types according to the type of Window:

Tumbling Window Join、Sliding Window Join、Session Widnow Join

Windows-based join uses the mechanism of Window. Data is cached in Window State first. When the window triggers calculation, join operation is performed.

Interval Join also uses state to store data for reprocessing. The difference lies in the invalidation mechanism of data in state, which triggers data cleaning by data.

Problem 15: Two modes of flink on YARN to execute a task

First yarn SEesion (Start a long-running Flink cluster on YARN) In this mode, you need to Start a cluster, submit a job, and apply for a space from YARN. The resources remain unchanged forever. If the resources are full, the next job cannot be submitted. The next job can be submitted only after one of the yarn jobs is completed and the resources are released. In this way, resources are limited to the session and cannot be exceeded, which is suitable for a specific run environment or test environment.

Flink Run Directly submits a Flink job (Run a Flink job on YARN) on YARN. In this mode, each job corresponds to one job. If no job is submitted, the system applies for resources from YARN based on its own situation until the job is completed. This does not affect the next job, except when there are no resources on YARN. The general production environment is run in this way. This approach requires ensuring sufficient cluster resources.

Take the above questions from the previously published Flink column Flink column link

conclusion

The above is the big data comprehensive review of the interview question 15, have you mastered?

May you have your own harvest after reading, if there is a harvest might as wellThree even a keyAnd we’ll see you next time 👋·