1. Let’s talk about the first project
  1. Shuffle optimization in Hive

    1. Compression reduces the amount of data stored on disk and improves query speed by reducing I/O.

      Enable compression for a series of MR intermediate processes produced by Hive

      set hive.exec.compress.intermediate=true;
      set mapred.map.output.compression.codec=org.apache.hadoop.io.compress.SnappyCodec;

      Compression of final output (files written to HDFS, local disk)

      set hive.exec.compress.output=true;
      set mapred.output.compression.codec=org.apache.hadoop.io.compress.SnappyCodec;
    2. The join optimization

      1. map join
By default, map join if one of the two tables in the associative query is a small table; To add small tables to memory hive. Mapjoin. Smalltable. The filesize = 25000000 default size hive. Auto. Convert. Join = true open by default If there is no use mapjoin, SQL SELECT * FROM STORE_SALES JOIN TIME_DIM ON (ss_sold_time_sk = ss_sold_time_sk = ss_sold_time_sk = ss_sold_time_sk = ss_sold_time_sk = ss_sold_time_sk = ss_sold_time_sk = ss_sold_time_sk T_time_sk) ' ' 'SMB join sort-merge-bucket join (SMB join sort-merge-bucket join (SMB join sort-merge-bucket join (SMB join sort-merge-bucket join (SMB join sort-merge-bucket join (SMB join sort-merge-bucket join) Tilted join ' 'XML <! -- HIV.OPTIMIZ.SKEWJOIN: Whether to create a separate execution plan for the skew keys in the join table. It is based on skew keys stored in metadata. At compile time, Hive generates individual query plans for skew keys and other key values. --> <property> <name>hive.optimize.skewjoin</name> <value>true</value> </property> <property> <! -- Hive. SkewJoin. key: Determines how to determine the skew key in the connection. During a join operation, if the number of rows corresponding to the same key value exceeds the value of this parameter, the key is considered a skew join key. --> <name>hive.skewjoin.key</name> <value>100000</value> </property> <! - hive. Skewjoin. Mapjoin. Map. The tasks: a specified Angle, used to map the connection operation on the number of tasks. This parameter should be with hive. Skewjoin. Mapjoin. Min. The split, perform fine-grained control. --> <property> <name>hive.skewjoin.mapjoin.map.tasks</name> <value>10000</value> </property> <! - hive. Skewjoin. Mapjoin. Min. The split: by specifying the size of the smallest split, determine the Map connection assignments on the number of tasks. This parameter should be with hive. Skewjoin. Mapjoin. Map. The tasks are used together, execution of fine-grained control. --> <property> <name>hive.skewjoin.mapjoin.min.split</name> <value>33554432</value> </property> ```
  1. How does Hive resolve data skew during clustering

Essential reason: the uneven distribution of key

The Map side is partially aggregated, equivalent to a Combiner

hive.map.aggr=true

Load balancing is performed when there is data skew

hive.groupby.skewindata=true

If the selection is set to true, the generated query plan will have two Mr Jobs. In the first Mr Job, the output result set of Map will be randomly distributed to Reduce, and each Reduce will perform partial aggregation operation and output results. The result of such processing is that the same Group By Key may be distributed to different Reduce. So as to achieve the purpose of load balance; The second Mr Job will distribute the preprocessed data results to Reduce according to Group By Key (this process can ensure that the same Group By Key is distributed to the same Reduce), and finally complete the final aggregation operation.

  1. SQOOP wants to import all tables in the database. How do you do this? Which parameters? Incremental import?

Full amount of import

[hadoop @ linux03 sqoop - 1.4.5 - cdh5.3.6] $bin/sqoop import \ > - connect jdbc:mysql://linux03.ibf.com: 3306 / mydb \ > --username root \ > --password 123456 \ > --table user

Delta import

bin/sqoop import \ --connect jdbc:mysql://linux03.ibf.com:3306/mydb \ --username root \ --password 123456 \ --table user  \ --fields-terminated-by '\t' \ --target-dir /sqoop/incremental \ -m 1 \ --direct \ --check-column id \ --incremental append \ --last-value 3
  1. The possibility of Hive causing data skew (which operations can cause skew) –> bucket join key is not distributed uniformly and there is a large number of null values.

    Uneven distribution of results based on key operations may lead to data skew, such as Group by Key

    Order by using global sorting ends up running all the data on only one reducer, resulting in skew

    A large number of NULL

    Hive NULL is sometimes required:

    1) INSERT statements in Hive must match the number of columns. No writes are not supported. NULL placeholders must be used for columns with no values.

    2) Separate columns by delimiters in the data file of the Hive table. Null columns hold NULL (N) to preserve the column position. However, if there are not enough columns when the external table is loading some data, such as table 13, and the file data has only 2 columns, then the remaining columns in the end of the table are corresponding to no data and will be automatically displayed as NULL.

    NULL is converted to an empty string, which can save disk space in several ways.

    A, language sentence ROW FORMAT SERDE 'org.. Apache hadoop. Hive. Serde2. Lazy. LazySimpleSerDe' with Serdeproperties (' serialization. Null. The format '='), note that both must be used together, such as
    CREATE TABLE hive_tb (id int,name STRING) PARTITIONED BY ( `day` string,`type` tinyint COMMENT '0 as bid, 1 as win, 2 as ck', ` hour ` tinyint) ROW FORMAT SERDE 'org.. Apache hadoop. Hive. Serde2. Lazy. LazySimpleSerDe' WITH SERDEPROPERTIES ( = '/' field. Delim 't', 'escape, delim' = '/ /', 'serialization. Null. The format' = ' ') STORED AS TEXTFILE;
    B, or by Row Format Delimited NULL DEFINED AS"
    CREATE TABLE hive_tb (id int,name STRING)
    PARTITIONED BY ( `day` string,`type` tinyint COMMENT '0 as bid, 1 as win, 2 as ck', `hour` tinyint)
    ROW FORMAT DELIMITED 
            NULL DEFINED AS '' 
    STORED AS TEXTFILE;

    2) modify existing tables

    alter table hive_tb set serdeproperties('serialization.null.format' = '');
  2. How to add a column of data to Hive? Add a column

    hive > alter table log_messages add coloumns( app_name string comment 'Application name', session_id long comment 'The current session id' ); After the last field of the table in which the column is added, and before the partitioning field.

    If new_column is added to the table, it is not feasible to insert new_column data directly into the table

    If the new column is a partition, you can add data under that partition

    insert into table clear partition(date='20150828',hour='18') select id,url,guid from tracklogs where date='20150828' and  hour='18';
  3. Run the spark
  4. Has Hive worked with JSON? What are the functions?

    1. When creating the table, specify the JAR package to handle the JSON data

      1. First add the JAR package
ADD the JAR hcatalog/share/hcatalog/hive - hcatalog - core - 1.1.0 - cdh5.14.2. JAR; Table 2. Built ` ` ` hive (default) > ADD JAR hcatalog/share/hcatalog/hive - hcatalog - core - 1.1.0 - cdh5.14.2. JAR; Added [hcatalog/share/hcatalog/hive - hcatalog - core - 1.1.0 - cdh5.14.2. Jar] to the class path Added resources: [hcatalog/share/hcatalog/hive - hcatalog - core - 1.1.0 - cdh5.14.2. Jar] hive (default) > create table spark_people_json (> > `name` string, > > `age` int) > > ROW FORMAT SERDE 'org.apache.hive.hcatalog.data.JsonSerDe' > > STORED AS TEXTFILE; OK Time taken: 4.445 seconds' '2. Write down what happens if you only have a JSON field and you want to get a value from it. 1. Get_json_object () can only get one SQL select from ' Get_json_object (' {" shop ": {" book" : [{" price ": 43.3," type ":" art "}, {" price ": 30," type ":" technology "}], "thanks" : {" price ": 19.95 1,"type":"shirt"}},"name":"jane","age":"23"}', '$.shop.book[0].type'); ` ` ` 2. Json_tuple () can get multiple fields ` ` ` SQL select json_tuple (' {" name ":" jack "and" server ":" www.qq.com "} 'and' server ', 'name') ` ` ` 3. Write your own UDF
  1. How does a running SparkStreaming program abort? How to stop it safely? How do you alternate between running and updated code?

    Upgrade the application code

    If you need to upgrade your running Spark Streaming application with the new application code, there are two possible mechanisms.

    1. The upgraded Spark Streaming application starts and runs in parallel with the existing application. Once the new (receiving the same data as the old) has warmed up and is ready for prime time, the old can be put down. Note that this can be used to support data sources that send data to two targets, namely, early and upgraded applications.
    2. Existing applications gracefully shut down (see StreamingContext.stop(…)) Or JavaStreamingContext. Stop (…). For the elegant close option) to ensure that the received data is fully processed before closing. You can then launch the upgraded application, which will start processing at the same point where the earlier application stopped. Note that this can only be done with input sources that support source-side caching, such as Kafka and Flume, because the data needs to be buffered before the previous application is shut down and the upgraded application has not yet started. And you cannot restart the pre-upgrade code information from an earlier checkpoint. Checkpoint information basically contains serialized Scala/Java/Python objects, and trying to deserialize objects with the new modified class can result in an error. In this case, either start the upgraded application with a different checkpoint directory or remove the previous checkpoint directory.
  1. How is precise one-time data consumption in SparkStreaming and Kafka integration implemented?

    Use direct connection mode

  2. How many message semantics are there?

    1. At least once — messages will never be lost, but may be repeated
    2. At most once — A message may be lost, but it will never be repeated
    3. Exactly once — every message will be transmitted once and only once, and many times this is what the user wants
  3. How do Kafka consumers ensure an accurate purchase?
  1. How many ways are there for SparkStreaming and Kafka integration?

    1. Receiver-based Approach
    2. Direct Approach (No Receivers) native Offsets
  2. How do I implement SparkStreaming?

    Initialize StreamingContext

    Define input sources by creating input DStreams.

    Streaming computation is defined by applying conversion and output operations to a DStream.

    Start receiving data and using it to process streamingContext.start().

    Stop waiting to be processed (either manually or due to any error) using streamingContext. AwaitTermination ().

    You can manually stop processing streamingContext.stop().

  3. How many people are there on the project?
  4. Storage architecture in your project?
  5. What does the development tool use? (What tools to use for what situations /xshell/idea)
  6. How does the code manage it (git)?
  7. Analysis functions in Hive?

    1. The window function
The Lead can optionally specify the number of rows to boot. If the number of rows to boot is not specified, the leading row is one. Returns null when the leading line of the current line exceeds the end of the window. ``` hive (default)> desc function lead; OK tab_name LEAD (scalar_expression [,offset] [,default]) OVER ([query_partition_clause] order_by_clause); The LEAD function is used to return data from The next row. If the number of lagged rows is not specified, the lag is one row. Returns null when the delay of the current row extends before the window begins. ``` hive (default)> desc function lag; OK tab_name LAG (scalar_expression [,offset] [,default]) OVER ([query_partition_clause] order_by_clause); The LAG function is used to access data from a previous row. ' ' 'FIRST_VALUE requires a maximum of two parameters. The first argument is the column for which you want the first value, and the second (optional) argument must be false. The default is a Boolean value. If set to true, the null value is skipped. LAST_VALUE This takes at most two arguments. The first argument is the column for which you want the last value, and the second (optional) argument must be false. The default is a Boolean value. If set to true, the null value is skipped. Standard aggregations: COUNT, SUM, MIN, MAX, AVG uses a PARTITION BY statement with one or more PARTITION columns of any primitive data type. Use PARTITION BY and ORDER BY to sequence one or more partitions and/or rows with any data type. An over specification with a window. Windows can be defined separately in the WINDOW clause. The window specification supports the following formats:  ``` (ROWS | RANGE) BETWEEN (UNBOUNDED | [num]) PRECEDING AND ([num] PRECEDING | CURRENT ROW | (UNBOUNDED | [num]) FOLLOWING) (ROWS | RANGE) BETWEEN CURRENT ROW AND (CURRENT ROW | (UNBOUNDED | [num]) FOLLOWING) (ROWS | RANGE) BETWEEN (num) FOLLOWING the AND (UNBOUNDED | [num]) FOLLOWING ` ` ` when specified ORDER BY lack of WINDOW clause, The WINDOW specification defaults to RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW. When the ORDER BY AND WINDOW clauses are missing, the WINDOW specification defaults to ROW BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING. Over clauses supporting the FOLLOWING functions, But it does not support Windows with them (see HIVE-4797) : Ranking functions: Rank, NTile, DenseRank, CumeDist, PercentRank. Lead and Lag functions. 3. RANK ROW_NUMBER DENSE_RANK CUME_DIST: PERCENT_RANK NTILE ' 'SQL select s_id, PERCENT_RANK NTILE, PERCENT_RANK NTILE(2) over(partition by c_id order by s_score) from score ' ' '4. Hive 2.1.0 (see Hive -9534) Aggregation functions support Distinct, including SUM, COUNT, and AVG, which are aggregated on different values within each partition. The current implementation has the following limitations: For performance reasons, the ORDER BY or window specification cannot be supported in the partitioning clause. The supported syntax is as follows. SQL Count (DISTINCT A) OVER (Partition BY C) ' ' 'SQL COUNT(DISTINCT A) OVER (Partition BY C)' ' 'Hive 2.2.0 supports ORDER BY and Windows (see HIVE-13453). An example is as follows. ```sql COUNT(DISTINCT a) OVER (PARTITION BY c ORDER BY d ROWS BETWEEN 1 PRECEDING AND 1 FOLLOWING) ``` 5. Hive Aggregation functions in OVER clause support in 2.1.0 and later (see hive-13475) added support for reference to aggregate functions in OVER clauses. For example, we can currently use the SUM aggregation function in the OVER clause, as shown below. ```sql SELECT rank() OVER (ORDER BY sum(b)) FROM T GROUP BY a; ` ` `
  1. What functions are used for common strings?

    1. String length function: length
    2. The string reversal function: reverse
    3. String concatenation function: concat
    4. The delimited string concatenation function: concat_ws
    hive (practice)> select concat_ws('|','abc','def','gh');
    abc|def|gh
    1. String interception functions: substr,substring

      substr(string A, int start),substring(string A, int start)

    2. String interception functions: substr,substring

      substr(string A, int start, int len),substring(string A, int start, int len)

    3. String to uppercase function: upper,ucase
    4. String to lowercase function: lower,lcase
    5. Go to the whitespace function: trim
    6. To the left, go to the space function: ltrim
    7. On the right go the space function: rtrim
    8. Regular expression replacement function: regexp_replace

      Regexp_replace (string A, string B, string C)

      Return value: String

      Note: Replace the part of string A that matches the Java regular expression B with C. Note that there are situations where escape characters are used, similar to the regexp_replace function in Oracle.

    9. Regular expression parsing function: regexp_extract

      Syntax: regexp_extract(string subject, string pattern, int index)

      Return value: String

      Note: Split the string Subject according to the rules of the Pattern regular expression, returning the character specified by Index.

    10. URL resolution function: parse_url
    11. JSON parsing function: get_json_object
    12. The space string function: space
    13. Repeat string function: repeat
    hive (practice)> select repeat('abc',5);
    abcabcabcabcabc
    1. The first character of the ASCII function: ASCII
    2. Left complement function: LPAD
    3. Right complement function: RPAD
    4. Split string function: split
    5. Set lookup function: find_in_set
  1. How to calculate PV in Hive on Monday and the first day of each month?

    Gets the first day of the specified date month and the first day of the year

    select trunc('2019-02-24', 'YYYY');
    
    select trunc('2019-02-24', 'MM');
    

    The appointed day of the week of the following week

    select next_day('2019-02-24', 'TU');
    

    Returns the date several months after the specified date in the specified format

    select add_months('2019-02-28', 1);
    
    select add_months('2019-02-24 21:15:16', 2, 'YYYY-MM-dd HH:mm:ss');
    

    select count(guid) from table group by trunc(date, ‘MM’)

    select count(guid) from table group by next_day(‘2019-06-08’, ‘MONDAY’);

  2. What’s the difference between a DataSet and a DataFrame?

    Since Spark2.x, the (DataFrame)/DataSet API has been unified. DataFrame is only used when every element in a DataSet is of type ROW

    The difference is that a DataSet is strongly typed while a DataFrame is untypedrel.

  3. Where is the Hive metadata stored in a project?

    1. If Metastore is not specified to store to a specific database, then Metastore is stored by default in the Deybe database that comes with Hive, which is the embedded mode for installing Hive
    2. If an external database is specified in the setup, it is stored in the database, and the local and remote schemas are saved in this way
  4. How does the metadata keep him safe?

    The user name and password used to modify the metadata

    <property>
        <name>javax.jdo.option.ConnectionUserName</name>
        <value>root</value>
    </property>
    <property>
        <name>javax.jdo.option.ConnectionPassword</name>
        <value>123456</value>
    </property> 

    Set access to the Metastore database on the MySQL side

  5. How many ways can SQOOP import and export? Incremental export?

    The import

    Full amount import ` ` ` [hadoop @ linux03 sqoop - 1.4.5 - cdh5.3.6] $bin/sqoop import \ > - connect jdbc:mysql://linux03.ibf.com: 3306 / test_db  \ > --username root \ > --password root \ > --table toHdfs \ > --target-dir /toHdfs \ > --direct \ > --delete-target-dir \ > --fields-terminated-by '\t' \ >-m 1 ' 'deltate append' 'sh bin/sqoop import \ --connect jdbc:mysql://linux03.ibf.com:3306/mydb \ --username root \ --password 123456 \ --table user \ --fields-terminated-by '\t' \ --target-dir /sqoop/incremental \ -m 1 \ --direct \ --check-column id \ --incremental append \ --last-value 3 ``` Incremental import lastmodified must have a column in the table indicate time ` ` ` sqoop import \ - connect JDBC: mysql: / / master: 3306 / test \ - username hive \ - the password 123456 \ --table customertest \ --check-column last_mod \ --incremental lastmodified \ --last-value "2016-12-15 15:47:29" \ -m 1 \ --append ```

    export

    By default, SQOOP-EXPORT adds a new row to the table; Each row of input records is converted into an INSERT statement to add the row to the target database table. If tables in the database have constraints (for example, their values must be unique primary key columns) and data exists, care must be taken to avoid inserting records that violate these constraints. If the INSERT statement fails, the export process will fail. This mode is primarily used to export records to empty tables that can receive these results. If the --update-key parameter is specified, SQOOP will modify existing data in tables in the database instead. Each input record is converted into an UPDATE statement to modify the existing data. The rows modified by the statement depend on the column name specified by the --update-key, and if data does not exist in the table in the database, it will not be inserted. Depending on the target database, if you want to update rows that already exist in the database, or insert rows if they don't already exist, you can also specify parameters using AllowInsert mode in --update-mode
  1. SparkStreaming using Batch HDFS does too much to save small files?

    Use the window function, specify a window long enough to process the data, make the data large enough (preferably within a block size), and then use foreachRDD to write out the data to HDFS

  2. What are the responsible data types in Hive?

    1. numeric

      TINYINT

      SMALLINT

      INT/INTEGER

      BIGINT

      FLOAT

      DOUBLE

      DECIMAL

    2. character

      string

      varchar

      char

    3. The date type

      TIMESTAMP

      DATE

    4. The complex type

      ARRAY<data_type>

      create table hive_array_test (name string, stu_id_list array<INT>) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' COLLECTION ITEMS TERMINATED BY ':' ; -- 'FIELDS TERMINATED BY' : delimiter between FIELDS -- 'COLLECTION ITEMS TERMINATED BY' : [chen@centos01 ~]$vi hive_array.txt 0601, 1:2:3:40602,5:6 0603,7:8:9:10 0604,11:12 Load Data Local inpath '/home/chen/hive_array.txt' into table hive_array_test;
      hive (default)> select * from hive_array_test; OK hive_array_test.name hive_array_test.stu_id_list 0602 [1,2,3,4] 0602 [5,6] 0603 [7,8,9,10] 0604 [11,12] 0604 [11,12] Time taken: 0.9seconds, touchdown: 4 row(s)

      MAP<primitive_type, data_type>

      create table hive_map_test (id int, unit map<string, int>) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' COLLECTION ITEMS TERMINATED BY ',' MAP KEYS TERMINATED BY ':'; 'MAP KEYS TERMINATED BY' : Key value delimiter [Chen @ centos01 ~] $vi hive_map. TXT 0 Chinese: 100, English: 80, math: 59 1 Chinese: 80, English: 90 2 Chinese:100,English:100,math:60 load data local inpath '/home/chen/hive_map.txt' into table hive_map_test;
      hive (default)> select * from hive_map_test; OK hive_map_test.id hive_map_test.unit 0 {"Chinese":100,"English":80,"math":59} 1 {"Chinese":80,"English":90} 2 {"Chinese":100,"English":100,"math":60} Time taken: 0.204 seconds, touchdown: 3 row(s) hive (default)> select id, unit['math'] from hive_map_test; OK ID _C1 0 59 1 NULL 2 60 Time taken: 554 seconds, touchdown: 3 row(s)

      STRUCT<col_name : data_type [COMMENT col_comment], … >

      create table hive_struct_test(id int, info struct<name:string, age:int, height:float>)
      ROW FORMAT DELIMITED
      FIELDS TERMINATED BY ','
      COLLECTION ITEMS TERMINATED BY ':';
      
      [chen@centos01 ~]$ vi hive_struct.txt
      0,zhao:18:178
      1,qian:30:173
      2,sun:20:180
      3,li:23:183
      
      load data local inpath '/home/chen/hive_struct.txt' into table hive_struct_test;
      
      hive (default)> select * from hive_struct_test; OK hive_struct_test.id hive_struct_test.info 0 {"name":" Zhao ","age":18,"height":178.0} 1 {" name ", "qian", "age" : 30, "height" : 173.0} 2 {" name ":" sun ", "age" : 20, "height" : 180.0} 3 {" name ":" li ", "age" : 23, "height" : 183.0} Time taken: 0.153 seconds, touchdown: 4 row(s) hive (default)> touchdown id, INFO. Name from hive_struct_test; 0Zhao 1 Qian 2 Sun 3 Li Time Taken: 0.133 seconds, touchdown: 4 row(s)
  1. Import and export of data in Hive, how to load? Where do I keep it?

    Load data [local] inpath ‘path’ overwrite into table name

    Insert overwrite [local] directory ‘/home/hadoop/data’ select * from emp_p;

    Local: Load with local, load with no local from HDFS

  2. How is RDD created?

    1. Parallelize data sets that exist on the driver side, not on the Parallelize executor side
    scala> var data = Array(1, 2, 3, 4, 5)
    data: Array[Int] = Array(1, 2, 3, 4, 5)
    
    scala> val rdd = sc.parallelize(data)
    rdd: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[1] at parallelize at <console>:26
    1. References or reads data sets from an external storage system, such as HDFS, HBase, or any subclass of Hadoop InputFormat
    scala> sc.textFile("student.log")
    res0: org.apache.spark.rdd.RDD[String] = student.log MapPartitionsRDD[1] at textFile at <console>:25
    1. From an existing RDD, the Transformation operator is called to generate a new child RDD
  3. How to handle driver memory overflow while running?
  • An application is running. How to modify its calculated displacement without stopping?
  • The compressed format of Hadoop

    bin/hadoop checknative  -a
    [chen@centos01 hadoop-2.6.0-cdh5.14.2]$bin/hadoop checknative-a 19/06/05 19:15:45 INFO bzip2.Bzip2Factory: Successfully loaded & initialized native-bzip2 library system-native 19/06/05 19:15:45 INFO zlib.ZlibFactory: Successfully loaded & initialized native-zlib library Native library checking: hadoop: True/opt/modules/hadoop - server - cdh5.14.2 / lib/native/libhadoop. So the 1.0.0 zlib: true/lib64 / libz. So. 1 after: True/opt/modules/hadoop - server - cdh5.14.2 / lib/native/libsnappy. So. 1 lz4: true revision: 10301 bzip2: true /lib64/libbz2.so.1 openssl: true /usr/lib64/libcrypto.so
  • SparkSQL processes the result of the dataFrame in the database. How do you do this?

    spark
        .read
        .table("mydb.emp")
        .write
        .mode(SaveMode.Ignore)
        .jdbc("jdbc:mysql://centos01:3306/mydb", "emp", prop)
  • Optimization of Shuffle in MapReduce?

    Shuffle process: map side: ring buffer (to 80%) -- "overwrite (partition, sort) --" Cominer -- "compress --" reduce side: The group combiner is optionally turned on, the map compress is optionally turned on, the result is compressed, and the Hashpartitioner is reduced when partitioning in the IO Shuffle. The same keys will enter the same reduce, and uneven distribution of keys will lead to data skewness. Refer to the process of data skewness optimization
  • Hive quadratic sorting problem?

    ORDER BY: Global order, the final data will go into a reduce, not recommended

    Sort by: local order, the data bureau order in each reduce

    distribute by

    DISTRIBUBY is often used with SORT BY to set partitions and SORT BY to set intra-partition sorting

    Cluster By: distribute By and sort By are used when conditions are the same

    A, B, C, A, B, C… , Hive will sort by A first, if A is the same sort by B, if B is the same sort by C

    select * from score order by score.s_id asc, score.s_score desc;
    score.s_id score.c_id score.s_score 01 03 99 01 02 90 01 01 80 02 03 80 02 01 70 02 02 60 03 03 80 03 02 80 03 01 80 04 01 50 04 02 30 04 03 20 05 02 87 05 03 34 06 01 31 07 98 07 02 89 Time taken: 96.333 seconds. 18 row(s)

    Can you see why we don’t use ORDER BY

  • How did Rowkey design it and why?

    General policy: Avoid data hotspots because the HBase front-matching mechanism causes all data to enter one RegionServer or several RegionServers

    Rowkey:

    Must be the only

    It is not recommended to use random numbers as rowkeys. Rowkeys should be designed according to actual business requirements

    Cannot be set too large to cause large storage space and large indexes

  • The SparkStreaming window function opens the window. How long does it open

    To open the window function, you need to specify two parameters:

    - The duration of The window (3 in The figure). - The interval at which The window operation is performed (2 in The figure).
    // Sliding window: Val WC = res.reduceByKeyAndWindow(a:Int, b:Int) => a + b, // The width of the window is lower than the width of the window (a:Int, b:Int). // Slide the time interval to indicate how long it took to calculate the data Seconds of a window (10).
  • How are the three parameters of the SparkStreaming window function designed?

    Generally the window length is greater than the slide interval

    SparkStreaming allows you to run batches using SparkStreaming by increasing the window width and manipulating data within a single window