“This is the third day of my participation in the First Challenge 2022. For details: First Challenge 2022”

Hello, I’m looking at the mountains.

The originator of short chain service is TinyURL, which is the earliest website providing short chain service. At present, there are many short chain services in China: Sina (T.cn), Baidu (Dwz.cn), Tencent (URu.cn) and so on.

I have to ask, why a short chain? Another implication of this question is, do short-chain services need to exist?

To apply the balm answer: existence is reasonable.

The rationality of the existence of short chain service

Let’s start with the rationale for the existence of short-chain services.

The only advantage of a short chain is that it is short.

Early users of weibo knew that each post was limited to 140 characters and that if you wanted to share a link, you needed to reduce the description.

Similarly, if you want to include a link in a marketing text message, you need to consider the cost. If you’re on an earlier phone, you also need to consider the possibility that the user might receive three broken SMS messages, which severely affects the reach and click of SMS messages.

In this case, if the link is short enough, the rest of the content can be much richer. However, we might define links of different lengths depending on the business, and add parameters to normal links to meet other requirements (for example, statistical marketing data). So short links are created, by redirecting a jump, by replacing a link with a very short link, like a 20-character link like t.cn/A6ULvJho, You can redirect to the original link, which is 146 characters in length www.howardliu.cn/how-to-use-…

The above two examples demonstrate the value of short chains. Let’s summarize some additional uses of short chains:

  1. Send marketing messages, more money: the link is shorter, the length of the message is smaller, the need to pay SMS costs are reduced, such as the above short chain 20 characters, 146 characters of the original link, the difference is money ah.

  2. For example, the following two two-dimensional code pictures, the same size, because the number of content is different, the density of cells will be different

  3. Flexible configuration, because the short link to the original link has been redirected once, if there is a problem in the original link at some point, or you need to redirect to another place, you can change the destination address of the redirect. This is very beneficial for offline material release. For example, two-dimensional code materials have been put in, and when we find that we want to jump to other websites or activities, we only need to modify the target address of the short chain, instead of replacing all the materials already put in.

The principle of short chains

In fact, as mentioned earlier, the short chain is implemented by redirecting the server to the original link. Curl -i http://t.cn/A6ULvJho = curl -i http://t.cn/A6ULvJho = curl -i http://t.cn/A6ULvJho

HTTP/1.1 302 Found Date: Thu, 30 Jul 2020 13:59:13 GMT Content-Type: text/ HTML; charset=UTF-8 Content-Length: 328 Connection: keep-alive Set-Cookie: aliyungf_tc=AQAAAJuaDFpOdQYARlNadFi502DO2kaj; Path=/; HttpOnly Server: nginx Location: https://www.howardliu.cn/how-to-use-branch-efficiently-in-git/index.html??spm=5176.12825654.gzwmvexct.d118.e9392c4aP1UUd V&scm = 20140722.2007.2.1989<HTML>
<HEAD>
<TITLE>Moved Temporarily</TITLE>
</HEAD>
<BODY BGCOLOR="#FFFFFF" TEXT="# 000000">
<H1>Moved Temporarily</H1>
The document has moved <A HREF="Https://www.howardliu.cn/how-to-use-branch-efficiently-in-git/index.html??spm=5176.12825654.gzwmvexct.d118.e9392c4aP1UU Dv&scm = 20140722.2007.2.1989">here</A>.
</BODY>
</HTML>
Copy the code

As you can see from the above information, Sina has made a 302 jump and also returned HTML content for manual adjustment for compatibility. The whole interaction process is as follows:

Short chain generation

According to the statistics of the number of web pages, there are currently 5.8 billion web pages in the world. The int value in Java is at most 2^32 = 4294967296 < 4.3 billion < 5.8 billion, and the long value is 2^64 > 5.8 billion. So if you use numbers, int can barely support (after all, not all sites will call the short chain service to create a short chain), using long is safer, but will cause space waste, the specific use of which type depends on the business.

Sina Weibo uses an 8-bit string to represent the original link, which can be understood as the 62-digit representation of the number, 62^8 = 3521614606208 > 352.1 billion > 5.8 billion, that is, can solve the world’s currently known web addresses. Base 62 is a number consisting of 10 digits + (A-z)26 lower-case letters + (A-Z)26 upper-case letters.

Generation method 1: Hash

Taking the Hash value of the original link is a relatively simple idea. There are many existing algorithms that can be implemented, but there is an unavoidable problem: Hash collision, so it is important to choose an algorithm with low collision rate.

The Recommended Algorithm is the MurmurHash algorithm, which is an unencrypted hash function suitable for general hash retrieval operations. Redis, Memcached, Cassandra, HBase, and Lucene all use this algorithm.

With MurmurHash in Guava:

final String url = "Https://www.howardliu.cn/how-to-use-branch-efficiently-in-git/index.html?spm=5176.12825654.gzwmvexct.d118.e9392c4aP1UUd V&scm = 20140722.2007.2.1989";
final HashFunction hf = Hashing.murmur3_128();
final HashCode hashCode = hf.newHasher().putString(url, Charsets.UTF_8).hash();
final int hashCodeAsInt = hashCode.asInt();// Select return int or return long
System.out.println(hashCodeAsInt);// The output result is: 1810437348, converted to base 62:1Ywpso
Copy the code

The simplest way to think about collisions is to append a special string to the original URL if a collision occurs until the collision is avoided. The specific operation is shown below:

Generation mode 2: Unified number transmitter

This is to assign an ID to whatever comes in through a centralized unified sender. This ID is the content of the short chain. For example, the first one comes in is tinyurl.com/1, the second one is HTTP… The value is a character string in base 62.

  1. Redis autoincrement: Redis has good performance and can support 10W+ requests on a single device. If it is used as a number sender, Redis persistence and disaster recovery need to be considered.
  2. MySQL auto-increment primary key: this scheme is similar to Redis scheme. It uses the reminder of database auto-increment primary key to ensure that ids are not repeated and are automatically created continuously.
  3. Snowflake: Meituan’s Leaf is an update to the popular ID sequence generation algorithm. However, this algorithm depends on the server clock, and if there is a clock callback, there may be ID conflicts. (Some people would argue that the value of the sequence in milliseconds is the bottleneck of this algorithm. On the other hand, this algorithm only provides the idea that if the sequence length is not enough, you can add it yourself, but is there really so much service in millions per second?)
  4. Wait…

There will be a separate article about the unified transmitter later, and I will modify it and attach the link. Or you can follow me (wechat id: Mountain Hut) to get first-hand information.

Another problem that needs to be solved with the unified sender approach is: if the same original link is used, should it return the same short chain or a different short chain?

The answer is to return different short links from the same original link, based on the dimensions of user, location, etc. If the dimensions are all the same, the same short chain is returned. The advantage of this is that we can do statistics according to the short chain of click, request information. With short chains, we sacrifice some storage and computation, but the information collected is invaluable.

Storage short chain

Generally, this data is stored in no more than two kinds: relational database or NoSQL database. With the creation logic above, storage comes naturally. MySQL > create table ();

CREATE TABLE IF NOT EXISTS tiny_url
(
    sid                INT AUTO_INCREMENT PRIMARY KEY,
    create_time        DATETIME  DEFAULT CURRENT_TIMESTAMP NULL,
    update_time        TIMESTAMP DEFAULT CURRENT_TIMESTAMP NULL ON UPDATE CURRENT_TIMESTAMP,
    version            INT       DEFAULT 0                 NULL COMMENT 'Version number',
    tiny_url           VARCHAR(10)                         NULL COMMENT 'short chain',
    original_url       TEXT                                NOT NULL COMMENT 'Original link', # Additional information creator_IPINT       DEFAULT 0                 NOT NULL,
    creator_user_agent TEXT                                NOT NULLFor this data, as long as the necessary fields that affect the creation of the short chain are stored, the rest can be sent directly to the data service instance_idINT       DEFAULT 0                 NOT NULLID state TINYINTDEFAULT 1                 NULL COMMENT '-1 invalid 1 valid '
);
Copy the code

Again, storage needs to consider the magnitude of data, planning in advance whether the need to separate tables and libraries.

Short chain request

Once the storage is complete, it’s time to use it.

The usual practice is to find the data from the store, based on the short string requested, and then return HTTP to redirect to the original address. If the storage uses a relational database, indexes are typically created for short-linked fields, and caching is paved in front of the database to avoid the database becoming a bottleneck. In order to improve the reasonable use of cache, LRU algorithm is generally used to eliminate the non-hot short link data. The process is as follows:

The Bloom filter in the figure is used to prevent cache breakdown, causing excessive server pressure.

Here is another question: is the HTTP return encoding 301 or 302? Why does Sina Weibo return 302 instead of the more semantically correct 301 redirect? (For those of you who are not familiar with HTTP status codes, you can get more information from HTTP Status Code Summary.)

  • 301 stands for permanent redirect. In other words, after the first request, the browser obtains the redirect address directly from the browser cache, and does not request the short link service. This can effectively reduce the number of service requests and reduce the load on the server, but because the browser does not send requests to the back end, it does not get the true number of hits.
  • 302 stands for temporary redirect. That is, every time the browser requests a new address, it puts pressure on the server, but in today’s hardware glut, that pressure is nothing compared to data. Therefore, 302 redirection is the first choice for short chain services.

conclusion

In fact, the short chain service is relatively simple, without too much business logic, mainly to investigate the understanding of distributed system commonly used design, is also often used in the interview process of a question. This is just to give you some design ideas, but I don’t go too far into the distributed ID, Bloom filter, MurmurHash, etc., because each one is not easy to understand in a few words.

Recommended reading

  • What are microservices?
  • Microservices programming paradigm
  • Infrastructure for microservices
  • Feasible solutions for service registration and discovery in microservices
  • From singleton architecture to microservice architecture
  • What other options do we have besides microservices?
  • How to effectively use Git to manage code in microservices teams?
  • Summary of data consistency in microservice systems
  • Implementing DevOps in three steps
  • System Design Series how to Design a Short Chain Service
  • System design series of task queues
  • Software Architecture – Caching technology
  • Software Architecture – Event-driven architecture