How to install and configure an open source Hadoop for Apache

  1. Log in as user root

2. Modify the IP

3. Change the host host name

4. Configure SSH password-free login

5. Disable the firewall

6. Install the JDK

7. Decompress the Hadoop installation package

8. Configure the hadoop core configuration file hadoop-env.sh? core-site.xml? mapred-site.xml yarn-site.xml hdfs-site.xml

9. Configure hadoop environment variables

10. Format Hadoop namenode-format

  1. Example Start the start-all.sh node

List the processes that need to be started in hadoop cluster. What are their functions?

Namenode: Manages metadata of HDFS files, responds to client requests, balances file blocks on Datanodes, and maintains the number of copies

Secondname: Mainly responsible for checkpoint operation. Cold backup can also be used to perform snapshot backup of data in a certain range

Datanode: Stores data blocks and receives I/O requests from customers for data blocks

Jobtracker: Pipeline tasks and assign them to TaskTracker

Tasktracker: Performs tasks assigned by Jobtrancker

resource manager? nodemanager journalnode? zookeeper? zkfc

This section describes the working principles of MapReduce

The map output is stored in the buffer. When the cache reaches a certain threshold, the data is spilled to the disk. Then partition, sort, and combine are performed. After multiple spill operations, multiple files are stored on the disk. Merge merges these files into one file, and shuffle draws data files from the Map node. Each map corresponds to a piece of data. When the memory usage reaches a certain level, merge the memory, and output the data in the memory to a file on the disk. If it doesn’t fit in memory, it writes directly to disk. One map data corresponds to one file. When the number of files reaches a threshold, merge disk files into one file. Finally, merge files in memory and files on disk globally to form a terminal file as the reduce input file. Of course, merge does sort, Combine.

Four Differences between internal and external tables in Hive

Internal table: Data is stored in the Hive data warehouse directory. When deleting a table, metadata and actual table files are deleted.

External table: Data is not stored in the Hive data warehouse directory. When deleting a table, metadata is deleted but actual table files are not deleted.

5 Differences between Combiner and Partition in MapReduce

? Combiner performs a Reduce operation on the Map end first to reduce data transfer from the Map end to the Reduce end, saving network bandwidth and improving execution efficiency.

Partition is to Partition map output by key and send it to different Reduce devices for parallel execution to improve efficiency.

Six Interview questions:

1. Project experience:

The questions are very detailed. Give paper and pen to draw the hadoop project architecture of the company, and describe a few pieces of business data. After passing through the platform, what will it look like?

2. Java:

IO input/output stream which common classes, and webService, thread related knowledge;

3, Linux:

Ask about the JPS command, kill command, awk,sed, and some common Hadoop commands.

4, hadoop:

The map,shuffle, and Reduce processes in Hadoop1 are discussed. Details about overwrite on the Map and Reduce ends are asked.

External tables and hive’s physical model are different from traditional databases.

Seven Questions for beginners:

1. How much is your salary? How many years have you worked

** 13K, I have been working on JavaWeb for nearly three years. I started to learn Hadoop in April 2014. Now I have been working for more than a month

Flume, Kafka, Storm how to learn, have you optimized

** See the official documentation, first build the environment, and then write code in Java to call their interface, familiar with the API, but if there are video resources, or recommended to try to see

3, now use hadoop1 or 2

**hadoop2

4. How long have you been doing Hadoop

** I said nearly two years, and you must have Hadoop experience in the interview

5. Storm, do you know Python before? Or did you learn it by yourself after entering the company

** All these are self-taught after arriving at the company

6. Are you using Hadoop for a fee or free

** is currently used for free

7. Have you ever built a cluster? Is it stressful at the beginning

** clusters are their own ride, a lot of pressure, but the car will work out

8. Mapreduce for advertising cheating

** using storm, real-time processing

9, ordinary LAN machine can be built

** Yes, AT that time, I first tested on my own machine, using the virtual machine on my own computer, and then the company bought the server

10, Flume knowledge what profound things

** I don’t think anything is sophisticated, we just don’t get involved, as long as we use it and test it more, it’s just a piece of software

Do you see the source code, now?

** will look at the source code, but I think don’t lock in the source code, we are mainly application, if you have energy, can also be divided into blocks to study

12. How many servers does the company have now?

** There are 10, four of which I used for Storm, Kafka, Flume, four for Hadoop,hive, and two for machine learning

Hadoop can be done without Java

** no, you have to know Java

Interview will not let silent code

** has not encountered (different companies are different)

Learn by yourself and solve problems by yourself?

** can solve his own problems, if not, he will help his colleagues

16. Do you use hbase for your database?

** Currently useless, now mainly use mongdb,mysql,redis (there are many companies of Hive and hbase)

Junior college is not good to find a job?

Also have no, I have a colleague, also be junior college degree, but she is looking for a job to say oneself is undergraduate course (because that position requires undergraduate course), after the interview passes later, make a phone call to the personnel to say, I am actually junior college degree, but in order to get this interview opportunity, I say oneself is undergraduate course… Later hr said it was ok… That company is ×××, she is now working there (this belongs to a special case, if the company is more strict, the possibility of rejection is very large, unless the strength is strong, college to find a job is a very normal thing,

** Here is just public for your reference, hope to go to its dregs, take its essence

18. What level is Hadoop now? Is the basic framework enough to be used?

** Yes, I am familiar with the basic framework and cluster environment, including the API called

19, Hadoop direction is good I am now 15K, considering whether to switch

** I think it depends on the future development of your current industry. If there is a bottleneck, I think you can consider transferring

20. Can you understand the official documents in English?

** Reading documents is not a problem, writing and speaking is not good, I am making plans, see how to learn

21. What do you think of yourself in the IT industry? Will you always be in big data?

** Now I plan to be a data mining and machine learning engineer

22. How much python do you need to know?

** In the Internet, Python,shell are indispensable tools, I think we are mainly proficient in one, Python, can understand, can modify others code on the line. For now, I prefer Python, which is more powerful than the shell and simpler than Java.

**23, 3 weeks to learn alone, or outside of work?

During the process of ** learning, I usually study in the evening, and I am crazy about it. Maybe IT is because I want to transfer quickly and get out of the misery of the current company, haha

People say cluster what all did not build, such a job you did not hesitate to pick up, so confident?

** AT that time, I was also very worried, but when I went in, some people said, let me not too much pressure, if there is a problem, they will try to find someone to help me solve, so I gave it up

25. To what extent is shell easy to use in work

** I think shell, mainly learn AWK,sed well, of course, also need to learn the basics, such as network configuration, basic operations

What do you understand about eight wide watches?

A wide table means that there are fewer rows and more columns. If a row contains too much data, an HFile may not fit. But wide tables have the advantage of row-level atomicity. A high table has many rows but few columns. Hbase can be fragmented only by row, so a high table has advantages. Specific or according to the business scenario comprehensive consideration.

2) It is better not to define too many ColumnFamily. In general, one ColumnFamily per table is good. Because Flushing and compression are Region based. When data stored by one ColumnFamily reaches the Flushing threshold, other ColumnFamily in the table may not store much data and also perform Flushing operation, which will bring a lot of unnecessary I/O consumption. The more ColumFamily there is, the greater the impact on performance. In addition, the amount of data stored by different columnfamilies in the same table should not be too different; otherwise, some data will be scattered in too many regions, affecting the retrieval efficiency.

9 Hbase Rowkey design principles

General rule: Consider service scenarios and hbase storage access features.

A few simple rules: RowKeys are unique, the length is consistent, and as short as possible.

Then consider a few questions:

1) Easy to read?

I. Store search criteria in rowKey as much as possible.

Ii. For data accessed at the same time, rowkeys should be connected as much as possible. That is, you can use scan to specify the start and end rowkeys for direct access.

2) Improve writing efficiency?

I. Evaluate business scenarios and pre-divide according to data distribution to improve concurrency.

Ii. In some cases, you can add hash values to distribute write to each RegionServer to avoid single point overload.

Ten. Graphs? ? join? What are the methods?

Database.51cto.com/art/201410/…

My.oschina.net/leejun2005/…

If there are two files File1 and File2, the contents of the files are as follows:

File1: (Student CODE, student name, gender)

Zs zhang SAN men

.

File2: (Student CODE, elective course, score)

zs c1 80

zs c2 90

.

1) reduce the join

Applies to both tables that are large

In the Map phase, the output data is marked in value as data from File1 and File2. In the Reduce phase, the key value is divided into two groups according to the source of File1 or File2 and the set product is made.

Disadvantages:

i.? During the map phase, data is not slimmed, and the network transmission and sorting performance of shuffle is low.

ii.? The Reduce server computes the product of two sets, which consumes a lot of memory and results in OOM.

Example:

In the map phase, the map function simultaneously reads two files, File1 and File2, and labels each data to distinguish the data source from File1 and File2. The connection field is the key, and other fields and flags are values.

Such as

In the Shuffle phase 😕 The link field is key, and the output results of each map form a list as value.

Such as

In the Reduce phase 😕 For the same key, value is divided into left table and right table by flag bit, and then cartesian join output is performed.

As shown in the left table = {” Zhang SAN Male “}

Table = {right? “c1 80”,? “c2 90”? }

The two for loops then implement the Cartesian join output:

Zhang SAN Male C1 80

Zhang SAN Male C2 90

2) map the join

Fits a small watch, a big watch.

The small data files are loaded to the memory, and the big data files are used as input files for the map. The big data files are connected to the small data files in the memory, and the output files are grouped by key. Reduce sorting and network transmission consumption in shuffle phase.

Example:

Assume that File1 is a small table and File2 is a large table.

i.? Place the small table file File1 in the DistributedCache of the job.

ii.? In the setup function, read File1 from DistributedCache into memory, such as a Hash map. Such as:

{zs, “Zhang SAN nan “}

iii.? In the map function, scan File2 and check whether the key of File2 is in hasp map. If yes, print (

Key + Value of the key in the hash map +? Other fields in File2) such as:

Zm Zhang Three men? c1 80

3) semi join

A variant of Reduce Join. Separate the join key from File1 and store it in File3. Distributed to the relevant nodes via Distributed Cache, and then fetched into memory, such as a hash set. In the Map phase, the connection table is scanned, the records whose key is not in the set are filtered out, and the records that participate in the Join are tagged and transferred to the Reduce end for operation. The subsequent procedure is the same as that for Reduce Join.

4) Reduce Join + boomfilter

If the semi Join extracted keys do not fit in memory, consider putting the keys into boomfilter. Use boomFilter to filter out the records that do not need to join the Join, label the records that do not need to join the Join, and transfer them to the Reduce end for operation through Shuffle. The subsequent procedure is the same as that for Reduce Join. Boomfilter records data in binary bits (such as 0101), so it takes up less space.

MR data skew causes and solutions

Data skew refers to the situation that data keys are distributed unevenly and are distributed to different Reduce tasks. Some Reduce tasks are heavy and other Reduce tasks are completed, but individual Reduce tasks are not completed.

Blog.sina.com.cn/s/blog_9402…

www.cnblogs.com/datacloud/p…

Here’s why:

1) Uneven distribution of keys

2) Characteristics of business data itself

Solution:

Assume that tables A and B are associated, and A has A data skew key

1) First sample table A, and store records of keys that cause data skew into A temporary table TMP1.

2) In general, there are not too many keys that cause data skew, so TMP1 will be a small table. Therefore, map join tMP1 and B tables to generate TMP2, and distribute TMP2 to the Distribute file cache.

3) Map reads tables A and B. If the record is from table A, it determines whether the key is in TMP2. If so, it outputs the key to local file A; if not, it generates A pair.

4) Merge file A and reduce output files in Step 3 and write them to HDFS.

In popular terms:

1) Map data skew is usually caused by too many input files with different sizes. You can merge files.

2) Data skewing on the Reduce end is generally a default partition problem. You can customize partitions according to key features and distribution of data to evenly distribute them to Reduce as much as possible.

3) Set Combine to aggregate and simplify data.

4) Join the two tables. If the key records causing data skew account for a small proportion of the total data, the data can be divided into skew and non-skew parts. In this way, the skew part will be small files and can be processed by Map Join, while the non-skew part can be processed by normal Reduce.

If Hive statistics is used, perform the following operations:

  1. Adjust Hive configuration parameters:

i. hive.map.aggr = true? Partial aggregation on the Map side is equivalent to Combiner

ii. hive.groupby.skewindata = true? – In the case of data skew, the query plan generates two Mr Jobs. For the first job, keys are randomly allocated and the amount of data is reduced first. And then the second job, the real group by key math.

  1. SQL to adjust

I. Size table join

Use Map Join to advance small tables into memory and perform reduce on the Map side.

Ii. Join a large table

If the null value causes data skew, change the null value into a string and add a random number, and distribute the slanted data to different Reduce devices.

Iii. Count DISTINCT A large number of identical special values (such as null values)

We don’t need to deal with the null value, just add one to the last count result. Or the null value can be processed separately and then finally union back.

Iv. Association of different data types

The default hash operation allocates values of either type, causing all types to be distributed to the same Reduce. Converts two associated types to the same type.

  1. RDMS database three paradigm

1NF: The field is indivisible. ? Org_id can save only the organization code, but cannot save the organization code + user code

2NF: The primary key cannot be redundant. ? Org_id +kpi_code as the primary key. Org_id +org_name+kpi_code cannot be used as the primary key.

3NF: non-primary key cannot be relied on. Org_id, kpi_code, kpi_value as a table, org_id,? org_name, kpi_code,? Kpi_value is not a good table because it relies on the non-primary key org_name.

OLTP can fully comply with 3NF, but OLAP only needs to comply with 2NF.

Twelve hadoop? How does it work?

Namenode manages the namespace of the file system and maintains metadata of all files and folders in the file system tree. Datanodes store actual data and report data storage to Namenode. That is, data storage is realized through HDFS and data calculation and processing is realized through Mr.

13 graphs? The principle of?

A fundamental idea is “divide and conquer”. A large task is divided into many smaller tasks, map is executed in parallel, and Reduce merges the results.

What about MapReduce? How does it work?

The map output is stored in the buffer. When the cache reaches a certain threshold, the data is spilled to the disk. Then partition, sort, and combine are performed. After multiple spill operations, multiple files are stored on the disk. Merge merges these files into one file, and shuffle draws data files from the Map node. Each map corresponds to a piece of data. When the memory usage reaches a certain level, merge the memory, and output the data in the memory to a file on the disk. If it doesn’t fit in memory, it writes directly to disk. One map data corresponds to one file. When the number of files reaches a threshold, merge disk files into one file. Finally, merge files in memory and files on disk globally to form a terminal file as the reduce input file. Of course, merge does sort, Combine.

Fifteen HDFS? The mechanism of storage?

Namenode is responsible for maintaining metadata of all data directories and files, and Datanode is responsible for actual data storage.

When a client writes data to HFDS, it first blocks the data and communicates with Namenode. Namenode tells the client to write data to the ADDRESS of datanode. After the first Datanode writes data, the client synchronizes data to the second Datanode, and so on, until all backup nodes write data. Then the next block of data is written.

When rack awareness is enabled, a storage policy is stored locally, on a node in the same rack, and on a node in a different rack.

16 graphs? In the? Combiner? And? Partition? The role of

Combiner performs a Reduce operation on the Map end first to reduce data transfer from the Map end to the Reduce end, saving network bandwidth and improving execution efficiency.

Partition is to Partition map output by key and send it to different Reduce devices for parallel execution to improve efficiency.

Seventeen Hive? What are the methods of metadata preservation and what are the characteristics of each?

1) Embedded mode: Metadata is stored in a local embedded Derby database, which can only access one data file at a time, meaning that it does not support multi-session connections.

2). Local mode: Metadata is stored in a local independent database (typically mysql), which can support multi-session connections.

3). Remote mode: Store metadata in a remote independent mysql database to avoid installing mysql database on each client.

Hadoop rack Awareness

Blog.csdn.net/l1028386804…

topology.script.file.name

Value specifies an executable program, usually a shell script, that takes a parameter (IP) and outputs a value (rack position).

IP, host name, and rack location can be configured in a configuration file. The script then reads the configuration file, resolves the rack location corresponding to the incoming IP, and outputs it. It can also be implemented using Java classes.

The HDFS storage policy is to store one copy of the HDFS data locally, and one copy of the HDFS data is stored on other nodes in the same rack, and one copy of the HDFS data is stored on a node in a different rack. If local data is damaged during computing, data can be obtained from neighboring nodes in the same rack faster than that on different racks. At the same time, if the network of the whole rack is abnormal, the data can be obtained from other racks. To implement this storage strategy, rack awareness is required.

The 19th HDFS? Data compression algorithm

www.tuicool.com/articles/eI…

Compressed format: gzip,? bzip2,? lzo

Deflate [D?’fle? T],? Bzip2, lzo

In terms of compression effect: bzip2 > gzip > LZO

In terms of compression speed: LZO > gzip > bizp2

In addition, BizP2 and LZO both support file splitting, but GZIP does not.

All compression algorithms are trade-offs of time and space. When choosing which compression format to use, we should choose according to our business needs.

Twenty hadoop? What is the scheduler of the. How does it work?

www.mamicode.com/info-detail…

www.cnblogs.com/xing901022/…

My.oschina.net/ssrs2202/bl… }

www.tuicool.com/articles/M3…

Lxw1234.com/archives/20…

The FIFO scheduler, which has only one queue, allocates resources to jobs on a first-in, first-out basis.

The Capacity scheduler can set multiple queues and set the resource ratio for each queue. For example, if there are three queues, the resource ratio can be set to 30%, 30%, and 40%. Queues can share resources. When the resources in a queue are not needed, they can be shared with other queues. When the cluster is busy and some tasks are completed, idle resources are allocated to queues with low resource utilization first. In this way, resources are allocated according to queue capacity. Jobs in the queue are selected according to the FIFO rule.

Of course, you can set the maximum resource usage for a queue to ensure that each queue does not consume resources for the entire cluster.

Fair scheduler, can set multiple queues, and set the minimum quota, weight and other indicators for each queue, such as the whole cluster has 100G memory, there are three queues, the minimum quota is set to 40G,30G,20G, the weight is set to 2,3,4 (according to the principle of who wants to share more, who gets more, That is, the minimum share is inversely proportional to the weight. Queues can share resources. When the resources in a queue are not needed, they can be shared with other queues. When the cluster is busy, once some task to complete release resources form the idle resources, priority assigned to the “hunger”, have been used, with a minimum amount of share) the gap between high queue, slowly, you don’t, would be the “hunger” state, then according to use the resource/weight Whoever sub-scores ration, Finally, it achieves the effect of “fair” distribution of resources according to the minimum share and weight. The jobs in the queue can be selected based on FIFO or Fair (Fair indicates the difference between the amount of resources used by a Job and the amount of resources required by the Job, and the submission time).

The Capacity and Fair schedulers also support resource preemption.

The 21st hive? Compressed format in? RCFile, TextFile, SequenceFile? What’s the difference?

Blog.csdn.net/yfkiss/arti…

TextFile: the default Hive format, which does not compress data and has high disk and network overhead. You can use Gzip and Bzip2 together. However, hive does not shard data in this way and therefore cannot perform parallel operations on data.

SequenceFile:? SequenceFile is a binary file supported by the Hadoop API. It is easy to use, split, and compressed. SequenceFile supports three compression options: NONE, RECORD, and BLOCK. The compression rate of RECORD is low. BLOCK compression is recommended.

RCFILE:? RCFILE is a combination of row and column storage. First, the data is divided into blocks according to the row, to ensure that the same record in a block, avoid reading a record to read more than one block. Secondly, block data column storage is beneficial to data compression.

Conclusion: Compared with TEXTFILE and SEQUENCEFILE, RCFILE consumes more performance during data loading due to its column storage mode, but has better compression ratio and query response. Data warehouses are characterized by one write and many reads. Therefore, RCFILE has obvious advantages over the other two formats.

What are the differences and functions of internal, external, partition, and bucket tables in Hive?

Internal table: Data is stored in the Hive data warehouse directory. When deleting a table, metadata and actual table files are deleted.

External table: Data is not stored in the Hive data warehouse directory. When deleting a table, metadata is deleted but actual table files are not deleted.

Partitioned table: Similar to the partitioning concept of RDMS, the data of a table is divided into multiple directories for storage according to partitioning rules. This can speed up queries by specifying partitions.

Bucket table: On the basis of table or partition, records are stored in buckets according to the value of a column, that is, in file storage, which means that large tables are changed into small tables. In this way, buckets can be associated with buckets when Join operation is involved, which greatly reduces the amount of Join data and improves the execution efficiency.

Kafka message contains what message

A Kafka Message consists of a fixed-length header and a variably long body. The header consists of a one-byte magic(file format) and a four-byte CRC32(to determine if the body body is healthy). When magic is 1, there is an extra byte of data between magic and CRc32: Attributes (which holds attributes such as compression, compression format, etc.); If magic has a value of 0, there is no attributes attribute

The body is a n-byte message body that contains the specific key/value message

Kafka offset

For versions above 0.9, the latest Consumer Client can be used. Consumer.seektoend ()/consumer.position() can be used to get the current latest offset:

Hadoop shuffle process

Map the shuffle

The Map side processes the input data and produces intermediate results, which are written to local disk instead of HDFS. The output of each Map is written to the memory buffer first. When the written data reaches a preset threshold, the system starts a thread to write the buffer data to the disk, a process called spill.

Before spill data is written, data is sorted by the partition to which it belongs, and data in each partition is sorted by key. The objective of the partitions is to divide records into different Reducer to achieve load balancing. The Reducer will read its own data according to the partitions. Then run combiner(if it is set). Combiner is essentially a Reducer, which reduces the amount of data written to disk by doing a Reducer on the files that are to be written to disk. At last, data is written to the local disk to generate a spill file. The spill file is stored in the directory specified by {mapred.local.dir} and will be deleted after the Map task is complete.

Finally, each Map task may generate multiple spill files. Before each Map task is completed, a multi-path merging algorithm is used to merge these spill files into one file. At this point, the Map shuffle process is complete.

Reduce the shuffle

Shuffle on Reduce consists of three phases: Copy, sort(Merge), and Reduce.

First, the output files generated by the Map end should be copied to the Reduce end, but how does each Reducer know which data it should process? Because when the Map server performs partitions, it actually specifies all data to be processed by the Reducer (the partitions correspond to the Reducer), the Reducer only needs to copy data from its corresponding partitions. Each Reducer handles one or more partitions, but must first copy the data of its own partitions from the output of each Map.

Next comes the sort stage, also known as the merge stage, because the main job of this stage is to perform merge sort. Data copied from the Map end to the Reduce end is ordered, so it is suitable for merge sort. Eventually, a large file is generated on the Reduce side as input to Reduce.

Finally, there is the Reduce process, where the final output is produced and written to HDFS.

26 Spark cluster computing mode

Spark has a variety of Standalone modes, such as Standalone mode, Standalone mode, Standalone mode, Standalone mode, Standalone mode, Standalone mode, Standalone mode, Standalone mode, Standalone mode, Standalone mode, Standalone mode, Standalone mode, and Standalone mode. The Standalone mode is sufficient for most cases and is also convenient to deploy if the enterprise already has a Yarn or Mesos environment.

Standalone (Cluster mode) – Typical Mater/slave mode, but also shows that the Master has a single point of failure; Spark supports ZooKeeper to implement HA

On YARN (cluster mode) : Runs on the YARN resource manager framework. Yarn manages resources, and Spark schedules and calculates tasks

On MESOS (Cluster mode) : Runs on the MESOS resource manager framework. Mesos is responsible for resource management, and Spark is responsible for task scheduling and computing

On Cloud (cluster mode) : For example, AWS EC2, which makes it easy to access Amazon S3; Spark supports multiple distributed storage systems, such as HDFS and S3

Process of reading and writing data in HDFS

Read:

1. Communicate with Namenode to query metadata and find the Datanode server where the file block resides

2. Select a Datanode (nearest principle, then random) server and request to establish socket flow

3. Datanode starts to send data (read data from disk and put data into stream, and perform packet verification in unit)

4. The client receives the packet in unit, caches it locally, and then writes it to the target file

Write:

1. The root namenode sends a request to upload a file, and the Namenode checks whether the target file exists and the parent directory exists

2. Namenode returns whether the file can be uploaded

3. Client requests the datanode server to which the first block should be transferred

4. Namenode returns three DATanode server ABCs

5. The client requests one of the three DN A to upload data (essentially an RPC call to establish A pipeline). A will continue to call B after receiving the request, and then B will call C to complete the establishment of the real pipeline and return to the client step by step

6. The client starts to upload the first block to A (reads data from the disk and puts it into A local memory cache). Each packet sent by A will be put into A reply queue waiting for the reply

7. When a block transfer is complete, the client requests namenode to upload the second block to the server.

Which is better in RDD reduceBykey or groupByKey? Why

ReduceByKey: reduceByKey will merge each Mapper locally before sending results to reducer, similar to combiner in MapReduce. In this way, after a reduce operation is performed on the Map end, the amount of data is greatly reduced, reducing transmission and ensuring faster result calculation on the Reduce end.

GroupByKey: groupByKey aggregates the values in each RDD to form an Iterator. This operation occurs on the Reduce end. Therefore, all data will be transmitted over the network, resulting in unnecessary waste. At the same time, outofMemoryErrors may be caused if the amount of data is very large.

Based on the above comparison, it can be found that reduceByKey is recommended when performing reduce operations with a large amount of data. Not only does this increase speed, but it also prevents memory overruns caused by using groupByKey.

29 Spark SQL How to Obtain the difference set of data

Doesn’t seem to support

The understanding of spark2.0

Simpler: ANSI SQL with more reasonable apis

Faster: Use Spark as the compiler

More intelligent: Structured Streaming

How to partition wide and narrow dependencies in RDD

Wide dependency: Partition quilt of the parent RDD. Using multiple partitions of the RDD, such as groupByKey, reduceByKey, and sortByKey, generates wide dependency and shuffle

Narrow dependencies: Each partition of the parent RDD is only one partition of the RDD. Operations such as map, filter, and union will generate narrow dependencies

Spark Streaming: Two ways to read Kafka data

The two methods are:

Receiver-base

This is done using Kafka’s high-level Consumer API. The data the Receiver gets from Kafka is stored in the Spark Executor’s memory, and the Spark Streaming started job processes that data. However, with the default configuration, this approach can result in data loss due to underlying failures. To enable high reliability and zero data loss, Spark Streaming must enable Write Ahead Log (WAL). This mechanism synchronously writes the received Kafka data to a write-ahead log on a distributed file system such as HDFS. So, even if the underlying node fails, you can use data from the pre-write log to recover.

Direct

In Spark1.3, Direct is introduced to replace Receiver for receiving data. In this way, Kafka is periodically queried to obtain the latest offset of each topic+partition, and the range of offset of each batch is defined. When the data-processing job is started, Kafka’s simple Consumer API is used to fetch data in the offset range specified by Kafka.

32 Kafka data in memory or disk

At the heart of Kafka is the idea of using disks, not memory, which everyone would assume is faster than disk. I’m no exception. After looking at the design ideas of Kafka, reviewing the data and running my own tests, I found that the sequential read and write speed of the disk was equal to that of memory.

Linux also optimizes disk read and write, including read-ahead and write-behind, and disk caching. If the memory overhead of JAVA objects is high and the GC time of JAVA becomes very long as the heap memory increases, using disk operations has several advantages:

The disk cache is maintained by the Linux system, which saves a lot of programmer work.

The sequential disk read/write speed exceeds the random memory read/write speed.

The JVM’s GC is inefficient and has a large memory footprint. Using disks can avoid this problem.

The disk cache is still available after the system is cold booted.

How to solve kafka data loss

The producer side:

Macroscopically, to ensure the reliability and security of data, must be based on the number of partitions to do a good job of data backup, set up the number of copies.

The broker side:

Topic sets up multiple partitions, which are adaptive to the machine where the partitions are located. The number of partitions should be greater than the number of brokers in order for the partitions to be evenly distributed among the brokers where they are located.

A partition is the unit of parallel kafka reads and writes. It is the key to increasing Kafka’s speed.

The Consumer end

The case for a message loss on the consumer side is simple: if the offset is committed before the message processing is complete, data can be lost. Since Kafka Consumer automatically commits shifts by default, it is important to ensure that the message is properly processed before committing shifts in the background. Therefore, heavy processing logic is not recommended. If the processing takes a long time, it is recommended to put the logic in another thread. In order to avoid data loss, two suggestions are given:

Enable.auto.mit =false Disables the automatic commit shift

Manually commit the shift after the message has been fully processed

Given two files a and B, each containing 5 billion urls, each occupying 64 bytes, the memory limit is 4 gb, let you find the common URL of a and B?

If the size of each URL is 10bytes, it can be estimated that the size of each file is 50G×64=320G, which is far larger than the memory limit of 4G. Therefore, it is impossible to load the file into memory for processing, and the idea of divide-and-conquer can be adopted to solve the problem.

Step1: iterate over file a, get hash(url)%1000 for each url, and store urls into 1000 small files (denoted as a0,a1… ,a999, each small file is about 300M);

Step2: traverse file b and store urls into 1000 small files in the same way as a (denoted as b0,b1… ,b999);

The neat thing: After this processing, all urls that might be the same are saved in the corresponding small file (a0VSB0,a1vsb1,… ,a999vsb999), different small files cannot have the same URL. And then we just want to find the same URL in this 1000 pairs of little files.

Step3: When finding the same URL in each pair of small files AI and BI, you can store the URL of AI in hash_set/hash_map. Each URL of bi is then iterated to see if it is in the hash_set we just built. If so, it is a common URL and stored in a file.

The sketch is as follows (decompose A on the left, decompose B on the right, and solve the same URL in the middle) :

35. There is a file of 1G size, each line contains a word, the size of the word is not more than 16 bytes, the memory limit is 1M, the highest frequency of return 100 words.

Step1: read the file sequentially, hash(x)%5000 for each word x, then save the value to 5000 small files (denoted f0,f1… ,f4999), so each file is about 200K. If some of the files are more than 1M in size, you can continue to divide them in a similar way until the size of the decomposed small files is less than 1M;

Step2: For each small file, count the words appearing in each file and the corresponding frequency (trie tree /hash_map can be used), and take out the 100 words with the largest frequency (the minimum heap containing 100 nodes can be used), and store 100 words and the corresponding frequency into the file, so as to obtain another 5000 files;

Step3: Merge the 5000 files (similar to merge sort);

The sketch is as follows (partition big problem, solve small problem, merge) :

Thirty-six existing massive log data stored in a super large file, the file can not be read directly into the memory, from which the request to extract a day to visit baidu the most times that IP.

Step1: from the log data of this day to access baidu IP out, one by one into a large file;

Step2: Notice that the IP is 32-bit, with a maximum of 2^32 IP addresses. You can also use a mapping method, such as module 1000, to map the entire large file into 1000 small files;

Step3: find out the IP addresses with the largest frequency in each text (hash_map can be used for frequency statistics, and then find out the maximum frequencies) and the corresponding frequencies;

Step4: In the 1000 largest IP, find out the IP with the largest frequency, that is to ask.

The sketch is as follows:

What are the disadvantages of LVS compared with HAProxy?

I have used LVS to load balance MySQL cluster and HAProxy before, but I have not tried to understand the two in front of me. There was such a question in the interview, and the interviewer gave the answer that the configuration of LVS was quite complicated. Later, I searched relevant materials and had a further understanding of these two load balancing schemes. LVS load balancing performance has reached 60% of hardware load balancing F5, and HAproxy load balancing and Nginx load balancing, are about 10% of hardware load balancing. Thus, configuration is complex, the corresponding effect is also obvious. In the process of searching for information, I tried to understand the 10 scheduling algorithms of LVS. The seemingly large number of 10 algorithms actually have some slight differences among different algorithms. Among the 10 scheduling algorithms, there are four static scheduling algorithms and six dynamic scheduling algorithms.

Static scheduling algorithm:

①RR polling scheduling algorithm

This scheduling algorithm does not consider the state of the server, so it is stateless, and does not consider the performance of each server, for example, I have 1-N servers, N requests, the first request to the first, the second request to the second, the NTH request to the NTH server, and so on.

② Weighted polling

This scheduling algorithm takes server performance into account, and you can weight requests based on the performance of different servers.

③ Hash based on the destination address

This scheduling algorithm is similar to the hash hash based on the source address in that it is used to maintain a session. The hash hash based on the destination address remembers the destination address of the same request and sends the request to the same destination server. In short, requests to this destination address are directed to the same server. Source address-based hash means that all requests from the same source address are sent to the same server.

④ Hash based on the source address

The above mentioned, not repeat.

Dynamic scheduling

① Minimum connection scheduling algorithm

This scheduling algorithm records the number of connections established on the server responding to the request, increases the number of connections established on the server by one for each received request, and allocates the incoming request to the machine with the least number of connections.

② Weighted least connection scheduling algorithm

This scheduling algorithm considers the server performance on the basis of the least connection scheduling algorithm. Of course, it is reasonable to consider this. If the server is of the same specification, the more connections it establishes, the more its load must be increased. Therefore, only scheduling algorithm based on the least number of connections can achieve reasonable load balancing. But what if the performance of the server is different? I have a server, for example, most can only process 10 connections, has established three now, still have one server can handle a maximum of 1000 link, set up five now, if simply according to the above the minimum link scheduling algorithm, no problem of the former, but the former has thirty percent of the connection is established, and the latter one percent of the connection is not established, Does this make sense? Obviously not reasonable. That’s why it makes sense to add weights. The corresponding formula is also fairly simple: active*256/weight.

③ Shortest expected scheduling algorithm

This algorithm avoids a special case of the above weighted least connection scheduling algorithm, which causes the scheduler to treat no difference even if the weight is added, for example:

Suppose there are three servers, ABC, whose current connections are 1,2,3, and their weights are 1,2,3. So if we follow the weighted least connection scheduling algorithm, it looks like this:

A: 1256/1 = 256

B: 2256/2 = 256

C: 3256/3 = 256

We will find that A, B, and C are the same after calculation, even with weights, so that the scheduler will choose one of A, B, and C without any difference and send the request.

The shortest expectation improves the active256/weight algorithm to (active+1)256/weight

So again, the previous example:

A (1 + 1) 256/1 = 2/1256 = 2256

B: (2 + 1) 256/2 = 3/2256 = 1.5256

C: (3 + 1) 256, 3 = 1.3256 material 4/3256

Clearly C

④ Never queue algorithm

Send the request to the server with the current number of connections 0.

(5) Based on local least connection scheduling algorithm

This scheduling algorithm is applied to the Cache system and maintains the mapping of one request to one server. Actually, let’s think about it a little bit. Considering the condition and performance of the server, but a request is not one-way, it seems like there’s a have never worked with Daniel, he is idle, you let him go to a previously encountered a problem, does not necessarily have to look for a has cooperated with you before even less idle now heads are good oh ~, so scheduling algorithm based on partial least connections, The function of this mapping is that if a request comes in, the server corresponding to the mapping is not overloaded, ok to the old partner to finish, I rest assured, if the server does not exist, or is overloaded and there are other servers working at half load, Find one of the remaining servers in the cluster and assign the request to it based on the minimum connection scheduling algorithm.

⑥ Copy-based local least connection scheduling algorithm

The scheduling algorithm is also applied to the cache system, but it keep rather than to a server mapping mapping to a set of server, when there is a new arrival of the request, according to the principle of minimum connections from the map to choose a server in the server group, if it does not overload to it to process this request, if you find it overloads, Add a server from a cluster outside the server group to the server group based on the minimum connection rule. If the server group has not been modified for a period of time, the busiest server is removed from the server group.

38. How does Sqoop feel to use?

Sqoop1 and Sqoop2 are quite different in architecture. Sqoop2 has improved significantly in terms of data type, security permissions and password exposure. At the same time, compared with other heterogeneous data synchronization tools, such as Taobao DataX or Kettle, Sqoop is quite good both in terms of data import efficiency and in terms of the richness of the plugins supported.

What is the role of ZooKeeper and how does Zookepper work?

Sure enough, people’s memory has a decay curve. When the interviewer asked this question, I answered only two roles of the former (leader and follower), and the principle of the latter was so vague that I forgot it. Therefore, there are four roles involved in Zookeeper: leader, follower, Observer, and client. The leader is mainly used for decision-making and scheduling. The difference between followers and observers lies only in that the latter do not have the function of writing, but both have the function of submitting client requests to the leader. The observer is designed to cope with the situation when the pressure of voting is too high. A client is used to initiate a request. The distributed consistency algorithm used by Zookeeper includes the election of the leader, which is similar to that of the chieftainer who gets the artifact or the emperor who gets the seal. The leader with the smallest ID will generate ID under the corresponding node machine according to the corresponding file configured by you. The corresponding node then uses the getChildren () function to get the id generated under the previously set node, who is the smallest and who is the leader. And in the event that the leader dies or falls, the lesser leader takes over. When the zooKeeper file is configured, information similar to the following is displayed: server. x=AAAA:BBBB:CCCC. Where x is your node number, AAAA corresponds to the IP address of your zookeeper, BBBB is the port to receive client requests, and CCCC is the port to re-elect the leader.

40. What is the difference between HBase Insert and Update?

This question comes from a recent project where the three methods used to interact with hbase were INSERT, DELETE, and update. Since that project was a docking project, the docking partner and I negotiated that update would not be merged into INSERT. If so, according to the project itself, overwrite through insert is equivalent to indirect update. In essence, Or it doesn’t make any difference in presentation including the put that’s called. But that’s just in terms of the program of that project, if it’s on the HBaseshell level. When the data of the same rowkey is inserted into HBase, one timestamp is displayed, but the corresponding timestamp is different, and the maximum number of versions can be set in the configuration file.

Please briefly describe the presentation of big data results.

1) Form of statement

Data reports based on data mining, including data tables, matrices, graphs and reports in custom formats, are easy to use and flexible in design.

2) Graphical presentation

Provide curve, pie chart, accumulation chart, dashboard, fish bone analysis chart and other graphical forms to display the distribution of model data in a macro way, so as to facilitate decision-making.

3) the KPI

Provide tabular performance list and customize performance viewing methods, such as data tables or charts, so that enterprise managers can quickly evaluate progress against measurable goals.

4) Query and display

Based on the data query conditions and query content, the system summarizes query results in data tables, provides detailed query functions, and enables users to drill up, drill down, and rotate data tables.

Forty-two examples of big data around.

I.Q., Weibo and other social software

Ii. Data generated by e-commerce such as Tmall and JINGdong

Iii. All kinds of data on the Internet

Brief introduction of data management methods of big data.

A: For images, videos, urls, geographic locations and other types of data, it is difficult to describe in a traditional structured way, so you need to use a columns-oriented data management system consisting of multidimensional tables to organize and manage the data. That is, sort the data by row, store it by column, and aggregate the data from the same field as a column family. Different column families correspond to different attributes of data, which can be dynamically increased according to requirements. Data can be stored and managed in a unified structured way through such a distributed real-time column database, avoiding associated query in the traditional data storage mode.

What is Big Data?

A: Big data is data whose content cannot be captured, managed, and processed using conventional software tools in a permissible time.

Forty-five massive log data to extract the IP address that accesses Baidu the most times in a day.

The first is the day, and is the IP access to baidu logs out, one by one into a large file. Notice that the IP address is 32 bits, with a maximum of 2^32 IP addresses. Similarly, a mapping method can be adopted, such as module 1000, to map the whole large file into 1000 small files, and then find out the IP with the highest frequency in each small text (hash_map can be used for frequency statistics, and then find out the maximum frequency) and the corresponding frequency. And then out of the 1000 largest IP addresses, find the IP with the highest frequency, and that’s what you want.

Or as follows (Eagle of the Snow) :

Algorithm idea: divide and conquer +Hash

1) THE MAXIMUM number of IP addresses is 2^32=4G, so it cannot be fully loaded into memory for processing;

2) Consider using the idea of “divide and conquer”, storing massive IP logs in 1024 small files according to the Hash(IP)%1024 value of IP address. Thus, each small file can contain up to 4MB of IP addresses;

3) For each small file, we can build a Hashmap with IP as key and value as occurrences, and record the IP address that appears most frequently.

4) The IP addresses with the most occurrences in 1024 small files can be obtained, and then the IP addresses with the most occurrences in general can be obtained according to the conventional sorting algorithm;

46. The search engine records all search strings used by the user in each search through a log file. Each query string is 1-255 bytes in length.

Let’s say there are 10 million records (these query strings are highly repetitive, and the total is 10 million, but no more than 3 million if duplicates are removed. The more repetitions a query string has, the more users it has, and therefore the more popular it is.) , please count the 10 most popular query strings, the requirement of the use of memory should not exceed 1G.

The typical TopK algorithm is described in this article. For details, see:

In this paper, the final algorithm is as follows:

The first step is to preprocess this batch of massive data and complete statistics with Hash table in O(N) time. Out, 2011.04.27);

The second step, with the help of the heap data structure, find TopK, time complexity is N ‘logK.

That is, with the heap structure, we can find and adjust/move in order of log time. Therefore, we maintain a small root heap of size K(10 in this case), and then iterate over 3 million queries to compare them to the root element, so our final time complexity is O(N)+N ‘*O(logK), (N = 10 million, N’ = 3 million). Ok, for more details, please refer to the original text.

Or: the trie tree is used, and the keyword field stores the number of occurrences of the query string. If there is no occurrence, it is 0. Finally, the minimum extrapolation of 10 elements is used to sort the occurrence frequency.

Forty-seven has a file of 1G size, each line of which is a word, the size of the word is not more than 16 bytes, the memory limit is 1M. Returns the 100 words with the highest frequency.

For each word x, hash(x)%5000 and store that value into 5000 small files (x0,x1… X4999). That’s about 200K per file.

If some of the files are larger than 1 MB, you can continue to divide them in a similar way until no smaller files are larger than 1 MB.

For each small file, count the words that appear in each file and their frequency (trie tree /hash_map, etc.), take the 100 words that occur most frequently (a minimum heap of 100 nodes can be used), and store the 100 words and their frequency into the file, thus generating another 5000 files. The next step is to merge (sort and merge) the 5000 files.

48. There are 10 files, 1 gb each, and each row of each file contains the user’s Query, which can be repeated for each file. Ask you to sort by the frequency of queries.

Or a typical TOPK algorithm, the solution is as follows:

Plan 1:

Read 10 files sequentially and write query to another 10 files (named as) based on the hash(query)%10 result. The size of each newly generated file is also about 1G(assuming the hash function is random).

Use hash_map(query,query_count) to count the number of occurrences of each query. Sort by number of occurrences using fast/heap/merge sort. Output the sorted query and corresponding query_cout to the file. This results in 10 sorted files (denoted as).

Merge sort (a combination of inner sort and outer sort) on the 10 files.

Scheme 2:

The total number of queries is limited, but the number of repetitions is quite large. It is possible to add all queries to memory at once. We can use trie tree /hash_map to count occurrences of each query directly, and then do a quick/heap/merge sort by occurrences.

Solution 3:

Similar to scheme 1, but after the hash is done, multiple files can be distributed to multiple files for processing, and a distributed architecture (such as MapReduce) can be used for processing, and then merged.

49. Given two files A and B, each containing 5 billion urls, each 64 bytes long, with a memory limit of 4 gigabytes, let you find the common URL of a and B?

Scheme 1: It can be estimated that the size of each file is 5G×64=320G, far larger than the memory limit of 4G. So it’s impossible to load it all into memory for processing. Consider a divide-and-rule approach.

Iterate over file A, get hash(URL)%1000 for each URL, and store each URL into 1000 small files based on the value obtained (denoted a0, A1… A999). So each small file is about 300M.

Iterate through file B and store urls into 1000 small files in the same way as A (denoted as B0, B1… , b999). After this processing, all possible urls are in the corresponding small file (a0VSB0,a1vsb1,… ,a999vsb999), different small files cannot have the same URL. Then we just want to find the same URL in 1000 pairs of small files.

To find the same URL in each pair of small files, you can store the URL of one of the small files in a hash_set. We then iterate over each URL of the other small file to see if it’s in the hash_set we just built, and if it is, it’s a common URL to save to the file.

Option 2: If a certain error rate is allowed, Bloomfilter can be used, and 4G memory can represent approximately 34 billion bits. Map the urls in one of the files to the 34 billion bits using Bloomfilter, and then read the urls of the other files one by one to check if they are the same as Bloomfilter. If so, the URLS should be common (note that there is an error rate).

Bloomfilter will elaborate on this BLOG later.

50. Find the non-repeating integers in the 250 million integers. Note that memory is not enough to hold the 250 million integers.

Scheme 1:2-bitmap is adopted (each number is allocated with 2bits, 00 means non-existence, 01 means occurrence once, 10 means multiple times, 11 is meaningless), total memory 2^32*2bit=1GB memory, which is acceptable. Then scan the 250 million integers to see the corresponding bits in the Bitmap. If 00 changes to 01,01 changes to 10,10 stays the same. When you’re done, look at the bitmap and print the corresponding integer with bit 01.

Scheme 2: similar to question 1, the method of dividing small files can also be adopted. Then find the non-repeating integers in the small file and sort them. And then we merge, removing duplicate elements.

Give 4 billion unsignedint integers, unsorted integers, and then give a number. How do you quickly determine if this number is among the 4 billion integers?

Similar to problem 6 above, my first response was quicksort + binary search. Here are some better ways:

Scenario 1: OO, request 512 MEgabytes of memory, one bit represents an unsignedint value. Read 4 billion numbers and set the corresponding bit. Read the number to be queried and check whether the corresponding bit is 1. 1 indicates that the number exists, and 0 indicates that the number does not exist.

Dizengrong:

Scheme 2: this problem in “programming Abas” has a good description, we can refer to the following ideas, discuss:

And since 2^32 is more than 4 billion, a given number may or may not be in it;

So we’re going to represent each of these four billion numbers in 32-bit binary

Let’s say those four billion numbers start out in a file.

Then divide those four billion numbers into two categories:

1. The highest bit is 0

2. The highest bit is 1

Write the two classes into two files, one of which has <=2 billion and the other >=2 billion (which is half cut).

Compare with the highest bit of the number to find and then enter the corresponding file to find again

Then divide the file into two categories:

1. The second highest bit is 0

2. The second highest bit is 1

Write the two classes into two files, one of which has <=1 billion and the other >=1 billion (this is half).

Compare with the next highest bit of the number to be searched and then enter the corresponding file to search again.

… .

And so on, and so on, and so on, and so on, and so on, and so on, and so on, and so on, and so on.

Note: Here, a brief introduction to bitmap methods:

Use bitmap to determine if there are duplicates in an integer array

Determining the existence of duplication in a set is one of the common programming tasks. When the set has a large amount of data, we usually want to perform fewer scans, and then the double loop method is not desirable.

Bitmap method is more suitable for this situation. It creates a new array with the length Max +1 according to the largest element Max in the set, and then scans the original array again. When encountering a number, 1 is assigned to the position of the new array. In this way, the next time we encounter 5, we find that the sixth element of the new array is already 1, which indicates that the data must be repeated with the previous data. This method is called bitmap because it is similar to bitmap processing. It has, at worst, 2N operations. If you know the maximum value of the array you can set the length of the new array in advance and you can double the efficiency.

Fifty-two. How do you find the most repeated one in the mass of data?

Scheme 1: First hash, then map the module to a small file, find the one with the most repetitions in each small file, and record the number of repetitions. Then find the data that is repeated the most often in the previous step (refer to the previous problem).

Fifty-three million or hundreds of millions of data (there are repeated), statistics which appear the most money N data.

Plan 1: Tens of millions or hundreds of millions of data, the memory of modern machines should be able to hold. So consider using hash_map/ search binary tree/red black tree etc for counting times. The next step is to retrieve the first N occurrences of the data, which can be done using the heap mechanism mentioned in question 2.

Fifty-four text files, about ten thousand lines, one word for each line, ask to count the top 10 words that appear most frequently, please give the idea, give the time complexity analysis.

Plan 1: Consider time efficiency. The trie tree is used to count the number of occurrences of each word, and the time complexity is O(nLE)(LE represents the word’s leveling length). And then we have to find the top 10 words that occur most frequently, which we can do with the heap, which we saw in the previous problem, in order nlg10. So the total time is the greater of O(nle) or O(nlg10).

Find the largest 100 of the 55 100w numbers.

Scenario 1: In the previous problem, we have mentioned a minimum heap of 100 elements. The complexity is O(100W *lg100).

Scheme 2: The idea of quick sorting is adopted. After each segmentation, only the part larger than the axis is considered. When the part larger than the axis is more than 100, the traditional sorting algorithm is adopted to sort and the first 100 are taken. The complexity is O(100W x 100).

Scheme 3: use local elimination method. Take the first 100 elements, sort them, and call them sequence L. Then scan the remaining element x once, compare it with the smallest element of the 100 sorted elements, and if it is larger than this smallest element, delete the smallest element, and insert X into sequence L using the idea of insertion sort. Loop through until all the elements are scanned. The complexity is O(100W x 100).

Summary of 56 massive data processing methods

Ok, look at the above so many interview questions, whether a little dizzy. Yes, there needs to be a summary. Next, this article will briefly summarize some common approaches to dealing with massive data problems, which will be described in detail in this BLOG.

A, Bloomfilter

Scope of application: it can be used to realize data dictionary, to judge the weight of data, or set intersection

57 Basic Principles and Key Points:

Simple for principle, bit array +k separate hash functions. Set the hash function’s bit array to 1. If all the hash function bits are 1, the result is not 100% correct. It is also not supported to delete an inserted keyword because the bit of the keyword will affect other keywords. So a simple improvement to countingBloomfilter is to replace the bit array with a counter array, which supports deletion.

There is also an important problem, how to determine the size of the bit array m and the number of hash functions according to the number of input elements N. The error rate is minimum when the number of hash functions k is (ln2)(m/n). In the case that the error rate is not greater than E, m must be at least equal to NLG (1/E) to represent any set of n elements. But m should be larger, because at least half of the array of bits is zero, so m should >= NLG (1/E)*lge is roughly 1.44 times NLG (1/E) (lg is logarithm base 2).

For example, assuming an error rate of 0.01, m should be about 13 times as large as N. So k is about eight.

Note that the units of m are different from those of n, where M is the unit of bit and n is the unit of number of elements (specifically, number of different elements). Usually the length of a single element has many bits. So using BloomFilter is usually a savings in memory.

Extension:

Bloomfilter maps the elements of the set into the array, using k(k is the number of hash functions) to indicate whether the elements are in the set or not. Countingbloomfilter(CBF) supports element deletion by extending each bit in the bit array to a counter. The SpectralBloomFilter(SBF) correlates this to the number of occurrences of a set element. SBF uses the minimum value in counter to approximate the occurrence frequency of elements.

Problem example: I give you two files A and B, each containing 5 billion urls. Each URL occupies 64 bytes and the memory limit is 4 gb. Let you find the common URL of A and B files. What about three or even n files?

If the error rate is 0.01, we need about 65 billion bits. If the error rate is 0.01, we need about 65 billion bits. Now we have 34 billion available, which is not that far off, so that might increase the error rate a little bit. In addition, if these urlips are one-to-one, they can be converted to IP, which is much easier.

58 of Hashing

Scope of application: Quick lookup, delete basic data structures, usually require a total amount of data can be put into memory

Basic Principles and Key Points:

Hash function selection, for strings, integers, permutations, concrete corresponding hash method.

Collision handling, one is Openhashing, also known as zipper; The other is closedhashing, also known as open addressing.

Extension:

D-lefthashing means multiple, so let’s simplify this a little bit, and look at 2-lefthashing. 2-lefthashing refers to dividing a hash table into two halves of equal length, called T1 and T2, and assigning T1 and T2 a hash function, h1 and H2, respectively. When storing a new key, two hash functions are used simultaneously to compute the two addresses h1[key] and H2 [key]. In this case, you need to check the H1 [key] position in T1 and the H2 [key] position in T2, which position has more (collision) keys, and then store the new key in the position with less load. If there are equal numbers on both sides, for example, both positions are empty or both stores a key, the new key is stored in the T1 child table on the left, from which 2-left is derived. When looking for a key, you have to hash twice, looking for two places at once.

Examples of problems:

1). Massive log data, extract the IP that visits Baidu the most times in a certain day.

The number of IP addresses is still limited, up to 2^32, so consider using hash to store IP addresses directly into memory for statistics.

Fifty-nine bit map

Scope of application: data can be quickly found, weight, delete, generally speaking, the data range is less than 10 times of int

The basics and essentials: Use an array of bits to indicate the presence or absence of certain elements, such as 8-digit telephone numbers

Extension: BloomFilter can be seen as an extension of bitmap

Examples of problems:

1) A file contains some telephone numbers, each number is 8 digits, and the number of different numbers is counted.

8 bits up to 99999999, about 99M bits, about 10 m bytes of memory can be.

2) Find the number of non-repeating integers in the 250 million integers. There is not enough memory space for these 250 million integers.

To extend the bitmap, use 2 bits to represent a number. 0 indicates no occurrence, 1 indicates one occurrence, and 2 indicates two or more occurrences. Or we do not use 2bit to represent, we can use two bitmaps to simulate the implementation of the 2bitmap.

Sixty heap

Application scope: N is large for massive data and n is small. The heap can be stored in memory

Basic principle and key points: maximum heap for the first N small, minimum heap for the first N large. Method, for example, to find the first n small, we compare the current element to the largest element in the maximum heap, and if it’s less than the largest element, we should replace that largest element. So the n that you end up with is the smallest n. Suitable for large amount of data, the first n is small, the size of n is relatively small, so that you can scan once to get all the first N elements, high efficiency.

Extension: Double heap, a maximum heap combined with a minimum heap, can be used to maintain the median.

Examples of problems:

1) Find the largest number of 100w numbers.

Use a minimum heap of 100 elements.

Sixty-one double barrel division – – in fact, is essentially the idea of “divide and conquer”, focusing on the skills of “divide”!

Scope of application: the KTH largest, median, non-repeating or repeating number

Rationale and key points: Because the range of elements is too large to use a direct addressing table, the scope is determined gradually through multiple partitions, and then finally within an acceptable range. It can be scaled down multiple times. Double layers are just one example.

Extension:

Examples of problems:

1). Find the number of non-repeating integers in 250 million integers. There is not enough memory space for these 250 million integers.

A bit like the pigeon nest principle, the number of integers is 2^32, that is, we can divide the 2^32 number into 2^8 regions (such as a single file representing a region), and then divide the data into different regions, and then the different regions can be solved directly using bitmap. That is to say, as long as there is enough disk space, it can be very convenient to solve.

2). The median of 500 million ints to find them.

This example is even more obvious than the one above. First we divide int into 2^16 regions, and then we read the data and count the number of numbers that fall into each region. Then we can use the statistics to determine which region the median fell into, and also know what number of numbers in this region happens to be the median. And then the second scan we just count the numbers that fall in this region.

In fact, if int is int64 instead of int, we can reduce it to an acceptable level by three such partitions. Int64 can be divided into 2^24 regions, and then determine the number of the region, the region is divided into 2^20 subregions, and then determine the number of the number of the subregion is only 2^20, you can directly use directaddrtable statistics.

Sixty-two database indexes

Scope of application: increase, delete, change and check of large amount of data

Basic principle and key points: using the design and implementation method of data, to deal with the increase, deletion, change and check of massive data.

Invertedindex Invertedindex

Scope of application: search engine, keyword query

Rationale and Key points: Why is it called inverted Index? An indexing method used to store a mapping of the location of a word in a document or group of documents under a full-text search.

In English, for example, here is the text to be indexed:

T0 = “itiswhatitis”

T1 = “whatisit”

T2 = “itisabanana”

We get the following reverse file index:

“A” : {2}

“Banana” : {2}

“Is” : {0}

“It” : {0}

“What” : {0, 1}

The search conditions “what”, “is”, and “it” will correspond to the intersection of collections.

Forward indexing is developed to store a list of words for each document. Forward-indexed queries tend to satisfy ordered and frequent full-text queries for each document and validation of each word in a validated document. In forward indexing, documents take center stage, and each document points to a sequence of index items it contains. That is, the document points to the words it contains, and the reverse index points to the document that contains it, so it’s easy to see the reverse relationship.

Extension:

Example problem: a document retrieval system that looks for files that contain a certain word, such as a keyword search for a common academic paper.

Sixty-four outer sort

Scope of application: big data sorting, deduplication

Basic principle and key points: merging method of external sorting, principle of replacement selection loser tree, optimal merging tree

Extension:

Examples of problems:

1). There is a file with a size of 1G. Each line of the file contains a word, the size of the word is not more than 16 bytes. Returns the 100 words with the highest frequency.

This data has the obvious characteristics that the word size is 16 bytes, but only 1m memory is not enough for hash, so it can be used for sorting. Memory can be used as an input buffer.

Sixty-five Trie trees

Application scope: Large amount of data, repetitive, but small data types can be stored in memory

Basic principle and key points: implementation mode, node child representation mode

Extension: Compression implementation.

Examples of problems:

1). There are 10 files, 1G per file. Each line of each file stores the user’s Query, and the query of each file may be repeated. Ask you to sort by the frequency of queries.

2).10 million strings, some of which are the same (duplicate), need to remove all the repeated strings, keep the non-repeated strings. How to design and implement?

3). Search for hot queries: the query string has a high degree of repetition, although the total number is 10 million, but if you remove the repetition, no more than 3 million, each no more than 255 bytes.

Sixty-six distributed processing of MapReduce

Application scope: Large amount of data but small type of data can be stored in memory

Basic principle and key points: data to different machines to process, data partition, results reduction.

Extension:

Examples of problems:

1). ThecanonicalexampleapplicationofMapReduceisaprocesstocounttheappearancesof

Eachdifferentwordinasetofdocuments:

2). Massive data is distributed among 100 computers. Find a way to efficiently collect the TOP10 of this batch of data.

3). There are N machines with N numbers on each machine. Each machine can store at most O(N) numbers and operate on them. How do I find a median of N^2 numbers?

Analysis of sixty-seven classic problems

Tens of millions or hundreds of millions of data (with duplicates) : collects statistics on the top N data that occur most frequently. The data can be read into the memory at a time, but not at a time.

Available ideas: Trie tree + heap, database index, separate subset statistics, Hash, distributed computing, approximate statistics, external sorting

The so-called can be read into memory, should actually refer to the amount of data after removing the repetition. If the data can be put into memory after deduplication, we can create dictionaries for the data, such as map, HashMap, trie, and then directly perform statistics. Of course, when updating the occurrence times of each piece of data, we can use a heap to maintain the top N data with the most occurrence times. Of course, this will lead to an increase in the number of maintenance, which is not as efficient as calculating the top N after complete statistics.

If the data can’t fit into memory. On the one hand, we can consider whether the dictionary method above can be modified to accommodate this situation. The change we can make is to store the dictionary on hard disk instead of memory, which can be referred to the database storage method.

Of course, there is a better way, is to use distributed computing, basically map-reduce process, first of all, the data can be divided into different machines according to the scope of the data value or the hash(MD5) value of the data, it is better to make the data partition can be read into the memory at a time, In this way, different machines handle various ranges of values, which is essentially a map. After obtaining the results, each machine only needs to take out the top N data with the most occurrences, and then summarize and select the top N data with the most occurrences in all the data, which is actually a Reduce process.

In fact, you may want to divide the data directly into different machines, which will not get the correct solution. Because one piece of data may be evenly distributed on different machines, another may be completely aggregated on one machine, and there may be the same number of pieces of data. Occurrences such as we are looking for most of the top 100, we will be 10 million data distribution to 10 machine, find each appears before 100, most times after merge so can’t guarantee to find the real 100th, because such as the 100th in the most times there may be 10000, but it has been assigned to 10 machine, So we only have 1,000 on each platform, and assuming that the ones that are ahead of the 1,000 are all distributed individually on one machine, let’s say 1001, then this one with 10,000 will be eliminated, and even if we asked each machine to pick the 1,000 that appear the most and merge, it would still be wrong, Because there may be a large number of 1001 clusters occurring. Therefore, data cannot be randomly divided evenly between different machines. Instead, they need to be mapped to different machines based on the hash value, allowing different machines to handle a range of values.

However, the external sorting method will consume a lot of IO, and the efficiency is not very high. The above distributed approach can also be used in the standalone version, that is, the total data is divided into multiple different sub-files based on the range of values, and then processed one by one. After processing, merge the words and their occurrence frequency. You can actually use an outer sort merge process.

Another option is approximation, where we can make this scale fit into memory by combining natural language properties and using as a dictionary only those words that actually occur most frequently.

68 Use Mr, Spark, Sparksql to write wordcount programs

[Spark version]

Valconf = newSparkConf (.) setAppName (” wd “). SetMaster (” local “[1])

Valsc = newSparkContext (conf, 2)

/ / load

TructField vallines = sc. TextFile (” (” name “, DataTypes. StringType, true) “)

Valparis = lines. FlatMap (line = > line. The split (” ^ “A))

Valwords = Paris. The map ((1) _,)

Valresult = words. ReduceByKey (+). SortBy (x = > x. _1, false)

/ / print

Result. The foreach (

WDS = > {

Println (” word: “+ WDS._1+” number: “+ WDS._2)

}

)

Sc. Stop ()

Sparksql Version

Valconf = newSparkConf (.) setAppName (” sqlWd “). SetMaster (” local “[1])

Valsc = newSparkContext (conf)

ValsqlContext = newSQLContext (sc)

/ / load

Vallines = sqlContext. TextFile (” E: idea15createRecommederdatawords. TXT “)

Valwords = lines. FlatMap (x = > x.s plit (” “)). The map (y = > Row (y))

ValstructType = StructType (Array (StructField (” name “, DataTypes. StringType, true)))

Valdf = sqlContext. CreateDataFrame (rows, structType)

Df. RegisterTempTable (” t_word_count “)

SqlContext. Udf. Register (” num_word “, (name: String) = > 1)

SqlContext. SQL (” selectname num_word (name) fromt_word_count “). The groupBy (df) col (” name “)). The count (), show ()

Sc. Stop ()

69 2Hive usage, internal and external table differences, partitioning, UDF and Hive optimization

(1) Hive use: warehouse and tools

(2) Internal and external tables: Internal table data is permanently deleted. After external table data is deleted, other people can still access it

(3) Partition function: to prevent data skew

(4)UDF function: user-defined function (mainly to solve format, calculation problems), need to inherit UDF class

Java code implementation

ClassTestUDFHiveextendsUDF {

PublicStringevalute (Stringstr) {

Try {

Return “hello” + STR

{} the catch (Exceptione)

Returnstr + “error”

}

}

}

(5)Hive optimization: View it as MapReduce processing

A Sorting optimization: Sortby is more efficient than OrDerby

Partition B: Use static partitions (statu_date=”20160516″,location=”beijin”). Each partition corresponds to a directory on the HDFS

C Reduce the number of jobs and tasks: Use table links

D Resolve groupby data skew: Set hive.groupby. Skewindata to true. Hive automatically balances load

E Merging small files into large files: table join operation

F Using UDF or UDAF functions: Write and use UDTF in Hive (go to) -ggjucheng – Blogpark

3Hbase rk design and Hbase optimization

Aowkey: key in hbase 3d storage (rowkey: rowkey, columnKey(family+quilaty) : columnKey, timestamp: timestamp)

Owkey dictionary sort, the shorter the better

Id + Time: 9527+20160517 Hash: dsakjKdFUwDSf +9527+20160518

In applications, the rowkey is an integer multiple of 8 bytes, ranging from 10 to 100bytes, which improves OS performance

BHbase optimization

The RegionSplit() method NUMREGIONS=9

The number of columns cannot exceed three

Hard disk configuration facilitates regionServer management and data backup and recovery

Allocate appropriate memory to regionServer

Other:

Hbase query

(1) the get

(2) scan

Use startRow and endRow limits

4 Common Linux operations

Aawk:

The awk – F: BEGIN {print “nameip} {print $1 $7} END {print” END “} / etc/passwd

The last | | head – 5 awkBEGIN {print “nameip} {print $1 $3} END {print” over “}

bsed

Seventy-five Java threads are implemented in two ways, design mode, linked list operation, sorting

(1)2 thread implementations

AThread class inheritance

TestCLth=newTestCL()// Class extends Thread

Th. The start ()

B Implements the Runnable interface

Threadth = newThread (newRunnable () {

Publicvoidrun () {

/ / implementation

}

})

Th. The start ()

(2) Design mode, divided into 4 categories

A Create mode: for example, factory mode, singleton mode

B Structure mode: proxy mode

C Behavior pattern: observer pattern

D Thread pool mode

6. Introduction of a project you are most familiar with, architecture diagram, technology used, which part are you responsible for?

7CDH cluster monitoring

(1) database monitoring (2) host monitoring (3) service monitoring (4) activity monitoring

Working principle of computer network

Connect the scattered machines through the principle of data communication to achieve sharing!

Hadoop ecosystem

hdfsmapreducehivehbasezookeeperlume

HDFS principles and functions of each module MapReduce Principles MapReduce optimizes data skew

11 System maintenance: Hadoop upgrade Datanodes

12 [Explained key points of the project: amount of data, number of people, division of labor, operation time, machine, algorithm and technology used in the project]

13. Learn to ask questions

JVM running mechanism and memory principle

Run:

I loads the. Class file

II manages and allocates memory

III Garbage Collection

Memory principle:

IJVM load environment and configuration

II loads and initializes the JVM.dll

IV handles the class class

15 Optimize HDFS and YARN parameters

Graphs. The job. The JVM. Num. The tasks

Default is 1, set to -1, reuse JVM

16 Principles and usage of hbase, Hive, Impala, ZooKeeper, Storm, and Spark, and their architecture diagrams

How do I set the number of Mappers for a Hadoop task

The answer:

Manual segmentation using job.setNummapTask (intn) is not feasible

The official document: “Note: Thisisonlyahinttotheframework” shows that this method only hint, not decisive

It’s actually calculated using the formula:

Max(min.split, min(max.split, block)) computeSplitSize()

Is it possible to export hadoop tasks to multiple directories? If so, how?

Answer: use multipleElsiders.java class after 1.x

Source:

MultipleOutputs. AddNamedOutput (conf, “text2 TextOutputFormat. Class, Long. Class, String. The class).

MultipleOutputs. AddNamedOutput (conf, “text3 TextOutputFormat. Class, Long. Class, String. The class).

Pronunciation: Multiple[m? Lt? Pl]– “Multiple

73 How do I set the number of reducer to be created for a Hadoop job

Answer: job. SetNumReduceTask (intn)

Or adjust the HDFS – site. XML the mapred. Tasktracker. Reduce. The tasks. The maximum default parameter values

74 Of the major common InputFormats defined in Hadoop, which is the default:

(A) TextInputFormat

(B) KeyValueInputFormat

(C) SequenceFileInputFormat

Answer: A,

What is the difference between the two classes TextInputFormat and KeyValueTextInputFormat?

The answer:

? Subclasses of FileInputFormat:

TextInputFormat(default type, key is LongWritable, value is Text, key is the offset of the current line in the file, value is the current line itself);

? KeyValueTextInputFormat(suitable for files with key and value, as long as the delimiter can be specified, it is practical, the default is split);

Source:

StringsepStr = job. Get (” graphs. Input. Keyvaluelinerecordreader. Key. The value. The separator “, “”);

Note: The FileInputFormat parent class is inherited when you customize the input format

What is InputSpilt in a running Hadoop task?

The answer: InputSplit is the input unit for MapReduce to process and calculate files. It is just a logical concept. Each InputSplit does not actually split files. It simply records the location of the data to be processed (including the path and hosts of the file) and the length (determined by start and length), which is the same size as the block by default.

Expansion: After defining InputSplit, explain the principle of MapReduce

How is file splitting invoked in the Hadoop framework?

The JobTracker will create an instance of InputFormat and call its getSplits() method to our Mappertask Queue. The Mappertask will be queued and split into FileSplist.

Source code shows the number of split

LonggoalSize = totalSize/(numSplits = = 0? 1:numSplits);

LongminSize = Math. Max (job. GetLong (. Org. Apache hadoop. Graphs. Lib. Input.

FileInputFormat SPLIT_MINSIZE, 1), minSplitSize); //minSplitSize The default is 1

When is combiner used and when is combiner not used?

Answer: Combiner is suitable for scenarios where records are summarized (such as summing), but not for scenarios where averages are taken

What is the difference between jobs and Tasks in Hadoop?

The answer:

A job is an entry point to a job that controls, tracks, and manages tasks. It is also a process

Includes MapTask and Reducetask

Tasks are map and Reduce steps that are used to complete Tasks and are also threads

Hadoop implements parallel computing by splitting tasks into multiple nodes. However, slow running of some nodes slows down the whole task. What mechanism does Hadoop adopt to deal with this situation?

Results The monitoring logs show that this phenomenon is caused by data skew

Solution:

(1) Adjust the number of split Mapper (partition number)

(2) to increase the JVM

(3) Appropriately increase the number of reduce

What features in the 81 Stream API provide the flexibility that allows MapReduce jobs to be implemented in different languages, such as Perlubyawk?

Answer: Use executable files as Mapper and Reducer, accept standard input and output standard output

82 Refer to the M/R system scenario below:

— the HDFS block size is 64MB

— The input type is FileInputFormat

There are three files in size: 64k65MB127MB

How many chunks will the Hadoop framework split these files into?

The answer:

64 k — — — — — — — > a block

65MB—-> Two files: 64MB is a block and 1MB is a block

127MB– > Two files: 64MB is a block and 63MB is a block

What is the role of RecordReader in Hadoop?

Answer: a process between split and mapper

The output behavior of inputsplit is converted into a record, which is provided to mapper as a key-value record

After the eighty-four Map phase is over, the Hadoop framework handles: Partitioning,shuffle, and sort. What happens during this phase?

The answer:

MR consists of four stages. Splitmapshuffreduce can partition the output of the map after executing the map.

Partitioning: This shard determines which Reduce to compute (summary)

Sort: Sort in each partition, by default in lexicographical order.

Group: Grouping after sorting

85 If no Partitioner is defined, how is the data partitioned before being sent to the Reducer?

The answer:

The Partitioner is called when the map function executes context.write().

Users can implement custom? The Partitioner controls which key is assigned to which? Reducer.

Check the source code to know:

If no partitioner is defined, the default partition Hashpartitioner will go

PublicclassHashPartitionerextendsPartitioner {

/ * * Use {@ linkObject# hashCode ()} topartition. * /

PublicintgetPartition (Kkey Vvalue, intnumReduceTasks) {

Return (key hashCode () and Integer. MAX_VALUE) % numReduceTasks;

}

}

What is Combiner?

Answer: This is a Hadoop optimization step that occurs between Map and Reduce

Objective: To solve the problem of data skew, reduce network pressure, and actually reduce the output of mAPER

The source information is as follows:

Publicvoidreduce (Textkey Iteratorvalues,

OutputCollectoroutput Reporterreporter)

ThrowsIOException {

LongWritablemaxValue = null;

While (values. HasNext ()) {

LongWritablevalue = values. Next ();

If (maxValue = = null) {

MaxValue = value;

} elseif (value.com pareTo (maxValue) > 0) {

MaxValue = value;

}

}

The output. Collect (key, maxValue);

}

In the Collect implementation class, there is such a method

Publicsynchronizedvoidcollect (Kkey Vvalue)

ThrowsIOException {

OutCounter. Increment (1);

Writer. Append (key, value);

If ((outCounter getValue () % progressBar) = = 0) {

progressable.progress();

Which of the following programs is responsible for HDFS data storage? The answer C datanode

a)NameNode

b)Jobtracker

c)Datanode

d)secondaryNameNode

e)tasktracker

How many blocks in HDfS are saved by default? A 3

A) 3 copies

2 b)

1 c)

D) not sure

Which of the following programs is usually started on the same node as NameNode? The answer D

a)SecondaryNameNode

b)DataNode

c)TaskTracker

d)Jobtracke

Analysis of this problem:

Hadoop clustering is based on master/slave mode. Namenode and JobTracker belong to master, datanode and Tasktracker belong to slave, and there is only one master. Slave has multiple SecondaryNameNode memory requirements on the same order of magnitude as NameNode, so usually the SecondaryNameNode (running on a separate physical machine) and NameNode run on different machines.

The JobTracker and TaskTracker

JobTracker corresponds to NameNode

TaskTracker corresponds to DataNode

Datanodes and Namenodes are for data storage

JobTracker and TaskTracker are for MapReduce execution

There are several main concepts in MapReduce. Mapreduce as a whole can be divided into several execution clues: ObClient, JobTracker and TaskTracker.

1) JobClient will package application configuration parameters into JAR files and store them in HDFS through the JobClient class on the client side. Submit the path to Jobtracker, which then creates each Task(i.e., MapTask and ReduceTask) and distributes them to each TaskTracker service for execution.

2) JobTracker is a master service. After the software starts, JobTracker receives the Job, schedules each sub-task of the Job to run on TaskTracker, monitors it, and re-runs it if any task fails. In general, JobTracker should be deployed on a separate machine.

TaskTracker is a slaver service that runs on multiple nodes. TaskTracker actively communicates with JobTracker, receives jobs, and is responsible for executing each task directly. TaskTracker needs to run on datanodes in HDFS.

C Doug cutting

a)Martin Fowler

b)Kent Beck

c)Doug cutting

A. HDFS b. HDFS C. HDFS D. HDFS

a)32MB

b)64MB

c)128MB

(Because the version changes quickly, the answer here is for reference only)

Which of the following is usually the most important bottleneck in a cluster? Answer: C Disks

a)CPU

B) network

C) disk IO

D) memory

The answer to this question is:

The first goal of clustering is to save costs, replacing minicomputers and mainframes with cheap PCS. What are the features of minicomputers and mainframes?

1. Strong CPU processing capability

2. Memory is large enough

So the bottleneck of the cluster cannot be A and D

3. The Internet is a scarce resource, but it is not the bottleneck.

4. Because big data is faced with massive data, I/O is required to read and write data, and then redundant data is required. Hadoop generally has three copies of data, so I/O is compromised.

Which is true about SecondaryNameNode? The answer C

A) It is hot standby for NameNode

B) It has no memory requirements

C) Its purpose is to help NameNode merge edit logs and reduce NameNode startup time

D)SecondaryNameNode and NameNode should be deployed on the same node.

Multiple choice:

Which of the following can be used for cluster management? Answer: ABD

a)Puppet

b)Pdsh

c)Cloudera Manager

d)Zookeeper

95. Which of the following is correct in configuring rack awareness: answer ABC

A) If a rack fails, data read and write will not be affected

B) Data will be written to datanodes on different racks

C)MapReduce obtains network data close to it based on the rack

Which of the following is correct when a Client uploads a file? The answer B

A) Data is transmitted to DataNode through NameNode

B) The Client splits the file into blocks and uploads the file in sequence

C) The Client uploads data to only one DataNode, and NameNode is responsible for Block replication

Analysis of the problem:

The Client initiates a file write request to the NameNode.

The NameNode returns information about datanodes managed by the Client based on the file size and file block configuration.

The Client divides the file into multiple blocks and writes them to each DataNode Block in sequence based on the DataNode address information.

11. Which of the following is the operating mode of Hadoop

A) stand-alone version

B) pseudo-distributed

C) distributed

What methods does Cloudera offer to install CDH? Answer: the ABCD

a)Cloudera manager

b)Tarball

c)Yum

d)Rpm

True or false:

Ganglia does not only monitor but also alert. (right)

A. Ganglia B. Ganglia Strictly speaking, yes. Ganglia, one of the most commonly used monitoring software in Linux environments, excels at collecting data from nodes at a low cost to the user. But Ganglia isn’t very good at warning and notifying users of events. The latest Version of Ganglia already has some of this functionality. But Nagios is even better at warning. Nagios is a software that specializes in early warning and notification. By combining Ganglia and Nagios, using Ganglia as the source of Nagios data, and Nagios as the source of alerts, you can implement a complete monitoring and management system.

99. Block Size cannot be modified. (error)

Analysis: The basic Hadoop configuration file is hadoop-default. XML. By default, the Job Config is created, and the Job Config is read into the hadoop-default. XML configuration first. Then read the configuration of hadoop-site. XML (this file is initially empty). In hadoop-site. XML, configure the system level configuration of hadoop-default. XML that needs to be overwritten.

Nagios cannot monitor a Hadoop cluster because it does not provide Hadoop support. (error)

Analysis: Nagios is a cluster monitoring tool and one of the top three cloud computing tools

16. If NameNode terminates unexpectedly, SecondaryNameNode takes over to keep the cluster working. (error)

SecondaryNameNode is a recovery aid, not a replacement.

  1. Cloudera CDH is available for a fee. (error)

Analysis: The first paid product is Cloudera Enterpris, which was unveiled at the Hadoop Summit in California. Cloudera Enterprise enhances Hadoop with several proprietary management, monitoring, and operational tools. The fee is a contract subscription, and the price varies with the size of the Hadoop cluster used.

  1. Hadoop is developed in Java, so MapReduce can only be written in Java. (error)

Analysis: RHadoop is developed in R language, MapReduce is a framework, can be understood as an idea, can be developed in other languages.

  1. Hadoop supports random reads and writes of data. (wrong)

Lucene supports random reads and writes, while HDFS only supports random reads and writes. But HBase can help. HBase provides random read and write services to solve problems that Hadoop cannot handle. HBase has focused on scalability issues from its underlying design: tables can be “tall” with billions of rows; It can be “wide,” with millions of columns; Horizontal partitioning and automatic replication on thousands of normal business machine nodes. The table schema is a direct reflection of physical storage, making it possible for the system to improve efficient serialization, storage, and retrieval of data structures.

  1. The NameNode manages metadata. The NameNode reads or writes metadata information from the disk and sends it back to the client for each read/write request. (error)

Analysis of this problem:

The NameNode does not need to read metadata from disk. All data is in memory, and only serialized results are read on disk each time the NameNode is started.

1) File writing

The Client initiates a file write request to the NameNode.

The NameNode returns information about datanodes managed by the Client based on the file size and file block configuration.

The Client divides the file into multiple blocks and writes them to each DataNode Block in sequence based on the DataNode address information.

2) File reading

Client initiates a file read request to NameNode.

  1. The NameNode local disk holds the location of the Block. (Personal opinion is correct, other opinions are welcome)

Analysis: Datanodes are basic units of file storage. Datanodes store blocks in local file systems, store meta-data of blocks, and periodically send all existing Block information to NameNode. NameNode returns the DataNode information stored in the file.

The Client reads file information.

  1. Datanodes communicate with NameNode over long connections. (a)

There is disagreement on this point: specifically, favourable information is being sought in this regard. The following information is available for reference.

First, to clarify the concept:

(1). The long connection

A communication connection is established between the Client and Server. After the connection is established, packets are sent and received continuously. This mode is often used for point-to-point communication because the communication connection always exists.

(2). A short connection

The Client and Server communicate with each other only after the transaction is complete. This mode is commonly used for point-to-multipoint communication, for example, multiple clients connect to a Server.

  1. Hadoop has strict permission management and security measures to ensure normal cluster running. (error)

Analysis: Hadoop can only stop good people from doing bad things, but not bad people from doing bad things

  1. The Slave node stores data, so the larger the disk, the better. (error)

Analysis: Once the Slave node is down, data recovery is a problem

  1. The hadoop dfsadmin -report command is used to detect HDFS corrupted blocks. (error)

  2. Hadoop default scheduler policy FIFO(correct)

111. RAID should be configured for each node in the cluster to prevent the damage of a single disk from affecting the operation of the entire node. (error)

Analysis: To understand what RAID is, refer to disk array. What is wrong with this sentence is that it is too absolute. Questions are not important, knowledge is the most important. Because Hadoop is inherently redundant, you don’t need RAID if you’re not too strict. Please refer to question 2 for details.

112. Because HDFS has multiple copies, NameNode does not have a single point of problem. (error)

113. Each map slot is a thread. (error)

Analysis: First of all, we know what is the map slot, map slot – > map slotmap slot is a logical value (. Org. Apache hadoop. Mapred. TaskTracker. TaskLauncher. NumFreeSlots), It’s not a thread or a process

  1. Mapreduce’s input split is a block. (error)

  2. NameNode’s Web UI port is 50030, and it launches the Web service through Jetty. (error)

  3. The Hadoop environment variable HADOOP_HEAPSIZE is used to set the memory of all Hadoop daemons. It defaults to 200 GB. (error)

Analysis: Hadoop for each daemon (namenode, secondarynamenode, jobtracker, datanode, tasktracker) exist in the unified distribution of hadoop – env. Set up in sh, parameters for HADOOP_HEAPSIZE, The default value is 1000 MB.

  1. When datanodes join the cluster for the first time, if incompatible file versions are reported in the log, NameNode needs to run the Hadoop amenode-format operation to format disks. (error)

Analysis:

What is ClusterID

ClusterID. A new identifier ClusterID was added to identify all nodes in the cluster. When formatting a Namenode, this identifier needs to be provided or generated automatically. This ID can be used to format other Namenodes that join the cluster.

Collecting these hopes is helpful to all of you.