Data pre-partitioning

If a large amount of data is written to the region server, the region server is overwhelmed and the write performance is low.

create 'test', {NAME => 'cf', COMPRESSION => 'SNAPPY'}, SPLITS => ['10','20','30']
Copy the code

Rowkey design

  1. Clear query criteria

    Hbase is basically a keyvalue database. Data needs to be searched by keys. Therefore, query conditions need to be merged into keys.

  2. Hot issue

    All methods change the order in which the raw data is stored, so you need to adopt a method suitable for query in different scenarios.

    There are three main ways to avoid hot spots: reverse, salt, and hash.

  • Inversion: Inverts rowkeys in fixed length or numeric format. The inversion can be divided into general data inversion and timestamp inversion, among which timestamp inversion is more common. Disadvantages: Good for Get but bad for Scan because the natural order of data on the original RowKey has been scrambled. For example, if you want to save the operation records of a user, you can sort the rowkey in reverse order by operation time. When designing rowKey, you can specify the reversed userId when querying all operation records of the user. StartRow is the reversed userId. StopRow is the reversed userId. To query operation records of a certain period of time, startRow is the reversed userId[long. Max_Value – start time], and stopRow is the reversed userId.
  • Salt: Prefixes the RowKey with a random number of fixed length. That is, assign a random prefix to the RowKey so that it starts differently than the previous RowKey
  • Hash: Hash based on all or part of the RowKey data, and then replace all or part of the original RowKey prefix with the hashed value. It’s like adding salt except instead of a random number, it’s a predictable number

Code optimization

  1. Batch read/write

    • The List can be transmitted during GET to reduce the number of RPC calls.

    • During SCAN, you can increase the cache size (the default value is 100) to retrieve more data during each RPC operation.

    scan.setCaching(500)

  2. Just take the columns you need

    Use the QualifierFilter in the case of a large number of columns to reduce the amount of data returned to the client.

Disk usage

  1. Keep rowkeys, column family names, and column names as short as possible because each cell contains these information, which can significantly save disk space in the case of large data volumes.
  2. Snappy compression is used for table construction