This is the fourth day of my participation in the August Text Challenge.More challenges in August

To review, learn some bottom-up logic for DB

  • The Disk Manager cares about physical storage
  • The Buffer pool Manager cares about memory storage
  • Access Methods care about data lookup
  • What does operator execution care about
  • How does Query Planning care about designing the execution sequence

1. hash table

  • How to choose FUNC: Speed versus collision rate;
  • How to design scheme: How to handle conflicting keys;

Static hash table

Knowing the number of elements in advance, each key is independent and the hash function can uniquely map the key to the value without causing conflicts.

Linear probe hashing

Linear probe resolves conflicts by searching for the next free slot when conflicts occur.

To correctly locate the target element, record both the key and the value in the stored slot.

For non-unique keys, there are two solutions:

  • Each key is associated with a list;
  • Store keys and values in slots.

Robin hood hashing

Try to balance the number of backward linear probes for keys that are out of alignment due to conflict.

Use the following example to illustrate. The following elements are written to table in ABCDE order: hash(A) = 2; hash(A) = 2; hash(B) = 0; hash(C) = 2; hash(D) = 3; Hash (E) = 2;

First of all, A is where it is now, it hasn’t moved, so I’ll call it 0; C is also hash to the position of A, resulting in A conflict, moving back 1 bit, called 1; D is hashed to where C is now, moved 1 bit back, so it’s also called 1; E is hashed to A at index = 2, where 0 = 0 and A remains the same. Index = 3 (index = 3, index = 3, index = 3, index = 3, index = 3) If E goes further, the distance of E becomes 2, but the distance of D in the next grid is 1, D[1] < E[2], so D is pushed back and E remains in the grid with index = 4, which becomes the situation in the second graph:

Cuckoo hashing

Cuckoo hash; Mapping multiple hash functions to multiple hash tables at the same time; Write every time a target location in any table is found available; If the target location of each table is occupied, select one of them and hash again to provide the grid.

Dynamic hash table

The above are static hash tables, which require the DBMS to know the number of elements. For example, the way of linear probe, no matter how many slots each element moves, the total number of data to be inserted into the table is the same as the size of the table. Dynamic hash tables can be resized on demand. Three methods are introduced:

chained hashing

Buckets are used for storage, and each hash key corresponds to a linked list. Equivalent to the zipper method for storage;

extendible hashing

Extend the granularity of the hash function mapping to refine what is stored in the bucket. When the coarse-grained hash function maps to a target bucket that is full, the hash function expands once:

Initially, global=2, means selects the target bucket by mapping the leftmost 2 bits of the binary key. Notice that bucket number 2 is already full, so if a conflict occurs when another 10 is inserted, expand global to 3, as shown in Figure 2.

Linear hashing

When a bucket overflows, add a bucket, chain increment, and then generate a new hash function hash2 that evaluates all buckets before the split pointer using hash2 once.

2. To summarize

This section introduces several ideas for designing hash tables. There are many kinds of data structures used in a DBMS. Different data structures, used for different functions.

  • Internal metadata: Maintains database information and system state. Metadata is usually stored with the actual data and is in the form of a data table.
  • Core data storage: The actual data in the database. Storage described in Section 3 and 4 is used to store data.
  • Temporary data structures: When processing some Queries, it may be necessary to create temporary data structures to serve as intermediate tables, etc., to speed up and optimize execution. The Hash table mentioned in this section can be used to generate intermediate tables from join results.
  • Table index: Used for the retrieval of data in data tables. The tree index section mentioned in sections 7 and 8 is used to build the index.