This is the 25th day of my participation in the November Gwen Challenge. Check out the event details: The last Gwen Challenge 2021

What is a short link

A short link, as its name suggests, is a short link (I’m talking nonsense).

Mp.weixin.qq.com/s?biz=MzU5M…

Have long (yes, this is my WX public links, can look), what if we need to send a link in an article or promotion to others, for such a long look at is too bad, and the emergence of the short links is to use a short URL instead of the long guy, when the user to access the short link, Redirects to the original link. Like this:

sourl.cn/CsTkky

If you’re paying attention, the links on all kinds of business text messages can get really short:

This extremely short URL is a short link.

Why do WE need URL short links

URL short links are used to create shorter aliases for long urls. We call these shortened aliases “short links”; When users click on these short links, they are redirected to the original URL; Short links save a lot of space when displaying, printing, and sending messages.

For example, if we shorten the following URL with sourl:

Juejin. Cn/user / 211951…

We can get a short link:

sourl.cn/R99fbj

The shortened URL is almost one-third the size of the actual URL.

URL abbreviations are often used to optimize links between devices, track individual links to analyze audiences, measure the performance of advertising campaigns, or hide associated original urls.

If you haven’t used Sourl before, try creating a new URL short link and spend some time browsing the various options their service offers. It will help you understand the article better.

System requirements and objectives

When completing a feature or developing a system, it is a good practice to determine the system’s position and goals. This will give you a clearer idea of the design and development process.

Our short link system should meet the following requirements:

Functional requirements:

  • Given a URL, our service should generate a short and unique alias for it, called a short link, which should be short enough to be easy to copy and paste into the application;
  • When a user accesses a short link, our service should redirect them to the original link;
  • Users should be able to selectively select a custom short link for their URL;
  • Links can expire after a specified time span, and users should be able to specify an expiration time.

Non-functional requirements:

  • The system must be highly available. If our service shuts down, all URL redirects will start to fail.
  • The LATENCY for URL redirection should be small enough;
  • Short links should be unguessable.

Extension requirements:

  • Support analysis and statistics, such as the number of visits to short links;
  • Other services should also be able to access our services through the RESTAPI.

Capacity requirements and limitations

There will be a lot of traffic to our system. There are read requests for short links and write requests to create short links. Assume the read/write ratio is 100:1.

Estimated traffic volume:

Assuming we have 500 million new short links per month with a 100:1 read/write ratio, we can expect 50 billion redirects at the same time:

100 x 500 million => 50 billion

What is the QPS(number of queries per second) of our system? New short links per second are:

500 million/(30 days * 24 hours * 3600 seconds) ≈ 200 URLs/s

Considering the 100:1 read/write ratio, the URL redirection per second would be:

100 * 200 URLs/s = 20000/s

Storage estimation:

Suppose we store each URL shortening request (and associated shortening links) for five years. Since we expect 500 million new urls per month, we expect the total number of stored objects to be 30 billion:

500 million * 5 years * 12 months = 30 billion

Assume each stored object is about 500 bytes (this is just an estimate). We will need 15TB of total storage:

30 billion * 500 bytes material 15 TB

Bandwidth estimation:

For write requests, since we expect 200 new short links to be created per second, the total incoming data for our service is 100KB per second:

200 * 500 bytes material 100 KB/s

For read requests, we expect about 20,000 URL redirects per second, so the total outgoing data for our service will be 10MB per second:

20000 * 500 bytes ≈10 MB/s

Memory estimation:

How much memory do we need to store popular urls that we need to cache in order to speed up access? If we follow the 80/20 rule, where 20% of urls generate 80% of traffic, we want to cache the 20% of popular urls.

Since we have 20,000 requests per second, we receive 1.7 billion requests per day:

20000 * 24 * 3600 ≈ 1.7 billion

To cache 20% of these requests, we need 170 GB of memory:

1.7 billion * 0.2 * 500bytes ≈ 170GB

One thing to note here is that since there will be many repeated requests from the same URL, our actual memory usage may not reach 170 GB.

Overall, assuming 500 million new urls per month and a 100:1 read/write ratio, our estimate would look something like this:

type Forecast values
New short link 200/s
Short link redirection 20000/s
Incoming data 100KB/s
Outgoing data 10 MB/s
Storage capacity for 5 years 15 TB
Memory cache capacity 170 GB

System API design

Once we have finalized the requirements, we can define the API for our system, in this case defining exactly what services our system can provide.

We can use REST apis to expose the functionality of our services. Here is the definition of the API for creating and deleting urls:

Create a short link interface

String createURL(api_dev_key, original_url, custom_alias=None, user_name=None, expire_date=None)
Copy the code

Parameter list:

Api_dev_key: the developer key assigned to a registered user, which can be used to limit the number of short links created by the user;

Original_url: The original URL of the short link to be generated.

Custom_alias: user-defined URL name.

User_name: user name that can be used in encoding;

Expire_date: expiration time of a short link.

The return value:

Successful short link generation returns the short link URL. Otherwise, an error code is returned.

Delete the short link interface

String deleteURL(api_dev_key, url_key)
Copy the code

Where url_key is the short link string to be deleted. A successful deletion returns Delete Success.

How to detect and prevent abuse of short links?

A malicious user could attack us by using all the URL keys in the current design. To prevent abuse, we can restrict users by their API_dev_key. Each API_dev_key can be limited to creating a certain number of urls and redirects per period of time (different durations can be set depending on the developer key).

Data model design

Completing the design of the data model prior to development will help you understand the flow of data between the various components.

The data in our short link service system has the following characteristics:

  • A billion data records to store;
  • Each object stored is small (less than 1K);
  • There is no relationship between the records other than to store which user created the URL;
  • Our service will have a large number of read requests.

We need to create two tables, one for short link data and one for user data;

What database should be used?

Because we expect to store billions of rows and don’t need to use relationships between objects – NoSQL stores like mongoDB and Cassandra are better choices. NoSQL is also easier to extend.

Basic system design and algorithm

The problem is how to generate a short and unique key for a given URL. There are two main solutions:

  • Encode the original URL
  • Generate the secret key offline in advance

Encode the original URL

You can compute a unique HASH value for a given URL (for example, MD5 or SHA256, etc.). You can then encode the HASH for display. The encoding can be Base36 ([A-z, 0-9]) or Base62 ([A-z, A-z, 0-9]), and if we add + and /, we can use Base64 encoding. One question to consider is what length should a short link be? Six, eight or 10 characters?

Using Base64 encoding, a six-letter long key would yield 64^6≈ 68.7 billion possible strings; Using Base64 encoding, an 8-letter long key would yield 64^8≈281 trillion possible strings.

According to our estimated data, 68.7 billion is enough for us, so we can choose six letters.

If we use the MD5 algorithm as our HASH function, it will produce a 128-bit HASH value. After Base64 encoding, we get a string of more than 21 characters (because each Base64 character encodes a 6-bit HASH value).

Now we only have 6(or 8) characters for each short link, so how do we choose our keys?

We could take the first six (or eight) letters as the key, but this would duplicate links; To solve this problem, we can select some other characters from the encoding string or swap some characters.

Our solution has the following problems:

  • If multiple users enter the same URL, they can get the same short link, which should not be allowed;
  • What if parts of the URL itself are URL-encoded? For example, in addition to URL encoding,https://juejin.cn/user/2119514147536087, andhttps://juejin.cn/user/2119514147536087%3Fid%3DIt’s the same thing.

Solutions:

We can append an increasing sequence number to each input URL to make it unique, and then generate its hash. However, we do not need to store this serial number in the database. A possible problem with this approach is that increasing serial numbers can cause overflows. Adding an increasing sequence number also affects the performance of the service.

Another solution could be to append the user ID to the input URL. However, if the user is not logged in, we will have to ask the user to select a unique key. Even then there can be conflicts and you need to keep generating until you get a unique key.

Generate the secret key offline

You can have a separate Key Generation Service, called the Key Generation Service (KGS), which pre-generates random six-letter strings and stores them in a database. Whenever we want to generate a short link, we go to KGS to get an already generated key and use it. This method is much simpler and faster. Not only do we not need to encode urls, we also don’t have to worry about duplication or conflicts. KGS will ensure that all keys inserted into the database are unique.

Is there a concurrency problem?

Once the key is used, it should be marked in the database to ensure that it will not be used again. If there are multiple servers reading the key at the same time, we may encounter situations where two or more servers try to read the same key from the database. How do you solve this concurrency problem?

KGS can use two tables to store keys: one for unused keys and one for all used keys.

Once KGS provides the keys to one of the servers, it can move them to the table of used secret keys; You can always keep some keys in memory so that the server can provide them quickly if it needs them.

For simplicity, once KGS has loaded some keys into memory, it can move them into the Used Key table. This ensures that each server gets a unique key.

If KGS restarts or dies before assigning all loaded keys to a server, we will waste those keys, which is acceptable given the number of secret keys we have.

You must also ensure that KGS does not supply the same key to multiple servers, so the actions of loading the secret key into memory and moving the secret key to a used table are synchronized or locked as needed before it can supply the secret key to the server.

Is there a single point of failure in KGS?

To solve the KGS single point of failure problem, we can use an alternate copy of KGS. When the primary server crashes, the secondary server can take over to generate and supply keys.

Can each application server have some keys instead?

Yes, this reduces access to KGS, although, in this case, if the application server dies before using all the keys, we will eventually lose them. But because we have so many keys, that’s acceptable.

How to complete the secret key search?

We can look up the key in the database to get the full URL. If it exists in the database, an “HTTP302 Redirect” state is sent back to the browser, passing the stored URL to the Location field of the request. If the key is Not in our system, an HTTP 404 Not Found status is issued or the user is redirected back to the home page.

Data partitioning and replication

Because we need to store a billion URL data, a single database node may not be sufficient for storage, and a single node may not support our reading requirements.

Therefore, we need to develop a partitioning scheme to divide and store the data in different database services.

Range-based partitioning:

We can store urls in different partitions based on the first letter of the short link. Therefore, we store all urls starting with the letter ‘A/ A ‘in one partition, urls starting with the letter’ B/ B ‘in another partition, and so on. This approach is called range-based partitioning. We can even merge some of the less frequent letters into a database partition.

Hash based partitioning:

In this scenario, we Hash the object we want to store. We then calculate which partition to use based on the Hash results. In our example, we can use the Hash value of the short link to determine the partition to store the data object.

The Hash function randomly assigns urls to different partitions (for example, the Hash function can always map any ‘key’ to a number between [1… 256] that will represent the partition in which we store the object.

This approach can lead to data overload of some partitions, which can be resolved using consistent hashing algorithms.

The cache

We can cache hot urls that are frequently visited. Caching solutions can use off-the-shelf solutions such as memcached, Redis, etc., so the application server can quickly check that the cache has the desired URL before looking up the database.

If you decide on the cache capacity?

You can start with 20% traffic per day and adjust the number of cache servers you need based on the client usage patterns. As mentioned above, we need 170 GB of memory to cache 20% of our daily traffic. You can use several smaller servers to store all of these popular urls.

Which elimination strategy to choose?

The elimination strategy is that when the cache is full, what do we do if we want to replace the link with a hotter URL?

Least-recent use (LRU) is a reasonable policy for our system. Under this strategy, we first discard the least recently used URL; We can use a HASH Map or similar data structure to store urls and access times using a short link or the HASH value of the short link as the key.

How do I update the cache?

Every time a cache miss occurs, our server hits the back-end database. Each time this happens, we can update the cache and pass the new entry to all cache copies. Each copy can update its cache by adding new entries. If the replica already has the entry, it can simply ignore it.

Load balancing

Load balancing layers can be added at three locations in the system:

  • Between the client and the application server;
  • Between the application server and the database server;
  • Between the application server and the cache server.

A simple circular scheduling approach can be used to evenly distribute incoming requests between back-end servers. This load balancing approach is simple to implement and incurs no overhead. Another benefit of this approach is that if the server crashes, the load balancer can take it out of rotation and stop sending any traffic to it.

One problem with circular scheduling is that it does not take server overload into account. Therefore, if the server is overloaded or slow, new requests to that server will not be stopped. To address this problem, a smarter solution can be put in place that periodically queries the load on the back-end servers and adjusts traffic based on that.

Data cleaning policy

Should data be kept forever, or should it be erased? What happens to short links if the user-specified expiration time is reached?

  • Continuously scan the database to remove expired data.
  • Lazy deletion strategy

If we choose to continuously query expired links to delete, it will bring a lot of pressure to the database; Outdated links can be removed slowly and cleaned up lazily. The service ensures that only expired links are removed, and although some expired links can stay in the database longer, they are never returned to the user.

  • Whenever a user tries to access an expired link, we can remove the link and return an error to the user;
  • A separate cleanup service can run periodically to remove expired links from storage and caches;
  • This service should be very lightweight and planned to run only when expected user traffic is low;
  • We can set a default expiration time for each short link (for example, two years);
  • After removing the expired link, we can put the key back into KGS’s database for reuse.

conclusion

The above is the development of a short link service system to do all aspects, there may be some small black did not consider the place, welcome message area exchange! If it helps you a little bit, give it a thumbs up.

I am xiao Hei, a programmer in the Internet “casual”.

Water does not compete, you are in the flow