The harder you work, the more lucky you are. This article has been collected in GitHub JavaCommunity, which has an interview to share, source code analysis series articles. Welcome to collect, like github.com/Ccww-lx/Jav…

In actual development, Redis will be used frequently, so how should we choose the right data type during the use process? Which data types apply to which scenarios. It is also common to be asked questions about Redis data structure in interviews:

  • Why is Redis fast?
  • Why are queries slowing down?
  • Redis Hash Rehash process
  • Why use a hash table as an index of Redis?

When we analyze and understand Redis data structure, we can make the correct choice of data type when we use Redis and improve system performance.

RedisUnderlying data structure

Redis is a memory key-value key-value database, and key-value pair data is stored in memory, so Redis based on memory data operation, its efficiency is high, fast;

Redis supports String, List, Hash, Set, Sorted Set, BitMap, and other value types. Redis can be widely applied to many business scenarios based on its diverse types of values.

The data type of Redis Value is based on the redisObject system, which is customized for Redis.

typedef struct redisObject{
    / / type
    unsigned type:4;
    / / code
    unsigned encoding:4;
    // A pointer to the underlying implementation data structure
    void*ptr; ... . }Copy the code

In addition to the actual data, a redisObject requires additional memory space to record metadata information such as data length and space usage. The redisObject contains 8-byte metadata and an 8-byte pointer that points to the location of the actual data of a specific data type:

The pointer points to the location where data is stored by Redis’s underlying data structure: SDS, bidirectional linked list, hop list, hash table, compressed list and integer set.

So how does Redis implement the underlying data structure?

Redis underlying data structure implementation

Let’s take a look at Redis’ simpler **SDS, bidirectional linked lists, integer sets **.

SDS, bidirectional linked lists, and integer sets

SDS, which uses len to record the number of bytes used, reduces the complexity of retrieving string length to O(1), and SDS is lazy to free space. You free space, and the system records the data for use next time it wants to use it. Don’t apply for new space.

Integer set, allocate a piece of contiguous address space in memory, data elements will be stored next to each other, no need for additional Pointers to bring space overhead, it is characterized by compact memory saving memory space, query complexity O(1) high efficiency, other operations complexity O(N);

Bidirectionally linked lists, which in memory can be non-contiguous, non-sequential Spaces, concatenate the order between elements with an extra pointer overhead.

It is characterized by high efficiency and O(N) query complexity as well as O(1) section insert/update data complexity.

HashHash table

A hash table, in fact, is similar to an array, each element of the array is called a hash bucket, each hash bucket holds the key-value pair data, and the elements in the hash bucket usedictEntryThe structure,

Therefore, the hash bucket element does not store the key-value pair itself, but rather the pointer to the specific Value. ** Therefore, the storage of each key-value pair costs at least 24 bytes. ** Especially for String key-value pairs, each key-value pair costs 24 bytes of space. When saving data is small and the extra overhead is larger than the data, then to save space, consider changing the data structure.

Let’s have a look at the global hash table:

Although hash table operations are fast, there is a potential risk as Redis data gets larger: hash table conflicts and rehash overhead, which explains why hash table operations are slow?

When more data is written into the hash table, hash conflicts are inevitable. Redis solves hash conflicts by using chained hash. Multiple elements in the same hash bucket are stored in a linked list, and they are connected by Pointers in turn, as shown in the figure:

When there are more and more hash conflicts, it will lead to the excessively long chain of some hash conflicts, which will lead to the time-consuming and inefficient element search on the chain.

To solve the problem of long hashing chains, rehash the existing hash bucket to spread out the number of elements in a bucket. So how does the rehash process work?

Rehash

To make the rehash operation more efficient, two global hash tables are used: hash table 1 and hash table 2, as follows:

  • Hash table 2 into a larger space,
  • Remap and copy the data from hash table 1 into hash table 2;
  • Free space for hash table 1

However, due to the large amount of data in table 1 and Table 2 during remapping and replication, if all the data in table 1 are migrated at one time, the Redis thread will be blocked and unable to serve other requests.

To avoid this problem and ensure that Redis can properly process client requests, Redis uses progressive Rehash.

When processing a request, copy all entries in index positions from hash table 1 to hash table 2 in turn, allocating the cost of a large number of copies to multiple processing requests, avoiding time-consuming operations and ensuring fast data access.

After understanding Hash tables, take a look at unusual compressed lists and hop tables.

Compressed lists and hops

List of compressionOn the basis of array, there are three fields zlBytes, ZLTail and zllen in the header of the compressed list, representing the length of the list, the offset at the end of the list and the number of entries in the list respectively. The compressed list also has a Zlend at the end of the table, indicating the end of the list.

Advantages: Compact memory saves memory space. In memory, a piece of contiguous address space is allocated, and data elements are stored next to each other. There is no need for extra Pointers to bring space overhead. The first and last elements can be located directly by the length of the first three fields in the table. The complexity is O(1).

Skip table, on the basis of linked list, added multi-level index, through several jumps of index position, to achieve fast location of data, as shown in the following figure:

For example, query 33

Features: When the data volume is large, the search complexity of the hop table is O(logN).

To sum up, we can know the time complexity of the underlying data structure:

Data structure type Time complexity
Hash table O(1)
An integer array O(N)
Two-way linked list O(N)
List of compression O(N)
Jump table O(logN)

The Redis custom object system type is the data type of Redis Value. The data type of Redis is based on the underlying data structure. What are the data types?

Redis data type

String, List, Hash, Sorted Set, and Set are common types. Their mappings to underlying data structures are as follows:

The data type The data structure
String SDS(Simple Dynamic String)
List Two-way linked list

List of compression
Hash List of compression

Hash table
Sorted Set List of compression

Jump table
Set Hash table

An integer array

The corresponding characteristics of data types are similar to the underlying data structures implemented by them, and the properties are the same

String, based on SDS implementation, suitable for simple key-value storage, setnx key value implementation distributed lock, counter (atomicity), distributed global unique ID.

List, sorted by the order in which the elements enter the List, follows the FIFO rule and is commonly used in sorting statistics and simple message queues.

Hash is a mapping between the string key and the string value. It is very suitable for representing an object. The complexity of adding and deleting operations is O(1).

A Set is an unordered Set of String elements whose members are unique, which means that no duplicate data can occur in the Set. Hash table based implementation, so add, delete, search complexity is O(1).

Sorted Set is an upgrade of the Set type. The difference is that each element is associated with a score of type double.

So let’s look at these data types, Redis Geo, HyperLogLog, BitMap?

Redis Geo considers the earth as approximately a sphere and converts two-dimensional latitude and longitude into strings based on GeoHash to achieve position division and query of specified distances. Features are commonly used in location-dependent applications.

HyperLogLog is a probabilistic data structure that uses probabilistic algorithms to calculate approximate cardinality of sets with an error rate of about 0.81%. When the set has a large number of elements, the space it needs to calculate the cardinality is always fixed and small enough to be used for UV statistics.

BitMap, with a bit to map the state of an element, only 0 and 1 two states, very typical binary state, and its own is a statistical binary state data type with String type as the underlying data structure, advantages of saving a lot of memory space, but used in binary statistics scenarios.

With this knowledge in mind, what are the strategies for selecting Redis data types for application scenarios?

Choose the right oneRedisData type policy

In practical development applications, Redis can be applied to many business scenarios, but how do we choose data type storage?

Based primarily on time/space complexity, the following points can be considered in practical development:

  • Data volume, the size of the data itself
  • Collection type statistical mode
  • Supports single point query/range query
  • Special Application Scenario

Data volume, the size of the data itself

When the amount of data is large and the data itself is small, using **String** will use extra space greatly increased, because using hash tables to save key-value pairs and using dictEntry structure will cause the overhead of saving three Pointers to dictEntry for each key-value pair. As a result, the data itself is less than the overhead of the extra space, and ultimately the data size of the storage space is much larger than the original data storage size.

List, Hash, and Sorted sets can be implemented using integer arrays and compressed lists, because both allocate a contiguous space in memory, and then place the elements of the collection one by one. It is very compact, and there is no need to concatenate the elements through extra Pointers. This avoids the space overhead of extra Pointers. In addition, when using the set type, a key corresponds to a set of data, which can save a lot of data, but also only a dictEntry, so as to save memory.

Collection type statistical mode

The common Redis collection type statistical patterns are:

  • Aggregate statistics (intersection, difference, and union statistics) : This parameter can be selected for aggregate calculation of multiple setsSet;
  • Collation statistics (requires collection types to preserve order for elements) :RedisIn theListSorted SetIs an ordered set,ListEnter by elementListIn the order of,Sorted SetYou can sort by the weight of the elements;
  • Binary state statistics (set elements have only 0 and 1 values) :BitmapItself is to useStringBitmap uses BITCOUNT to count the number of 1 after BITOP operations by bit and, or, xOR.
  • Cardinality statistics (counting the number of non-repeating elements in a set) :HyperLogLogIt is a kind of data set type used for statistical cardinality. The statistical result has certain error, the standard error rate is 0.81%. For precise statistics, use Set or Hash.

The Set type is applicable to the aggregation operation of users/friends/followers/fans/interested people, for example

  • Count the number of new users of mobile APP every day
  • Mutual friends of two users

In Redis, List and Sorted Set are ordered sets, which are used to sort collection elements, such as

  • List of recent comments
  • list

Bitmap binary status statistics, applicable to statistics that have a large amount of data and can be represented by binary status. For example:

  • Check in, the number of users check in that day
  • Weekly User Activity
  • User Online Status

HyperLogLog is a type of data set used to count cardinality, the number of non-repeating elements in a set, for example

  • Count the UV of a web page. Multiple visits by a user in a day only count as one

Supports single point query/range query

In Redis, List and Sorted Set are ordered collections that support range query, but Hash does not support range query

Special Application Scenario

Message queue, using Redis as the realization of message queue, to the basic requirements of message sequence preservation, processing repeated messages and ensure message reliability, the scheme is as follows:

  • List-based message queue solution
  • A Stream-based message queue solution
Based on the List Based on the Strems
Message order preserving useLPUSH/RPOP useXADD/XREAD
Block read useBRPOP useXREAD block
Duplicate message processing The producer implements its own globally unique ID Streams automatically generates globally unique ids
Message reliability useBRPOPLPUSH usePENDINGList Automatically saves messages
Applicable scenario Small amount of messages The total number of messages is large and data needs to be read in the form of consumer groups

Location-based LBS service is implemented with Redis specific GEO data type. GEO can record the geographical location information in the form of latitude and longitude, which is widely used in LBS service. For example: how taxi hailing apps offer location-based services.

conclusion

The reason why Redis is so fast is that its memory-based data operation and the use of Hash table as index are efficient and fast. Moreover, due to the diversity of its underlying data, Redis can be applied to many scenarios. Selecting the right data type in different scenarios can improve its query performance.

Finally, wechat search “Ccww Technology Blog” to watch more articles, but also welcome to pay attention to a wave.