preface

System design column I usually according to the analysis of three major parts below, for the third big below, if the system size is moderate, I usually give yourself time to 3-4 months to do it 🙈, and put a lot link, by the junior partner in the community to make correct supplement, if you feel can’t do or system is so wide, Hope that the community can have a small partner to communicate, arrange the division of labor, we see how to design this system ~

Disclaimer: The following analysis ideas come fromDesigning a URL Shortening service like TinyURL, with some of my own understanding, if interested in the original partners can directly read the original

Why do you need a short url system

Short url system is mainly used to rename the original long link to a shorter link, so no matter in display, send messages will save a lot of space, such as the following link

https://www.google.com.hk/search?q=
%E6%AF%94%E8%BE%83%E9%95%BF%E7%9A%84%E9%93%BE%E6%8E%A5+%26+%E6%AF%94%E8%BE%83%E7%9F%A
D%E7%9A%84%E8%BF%9E%E6%8E%A5&newwindow=1&ei=q2vZYa3sNt6F4t4Pj7GAkAw&ved=0ahUKEwjt7b6Z_a
H1AhXegtgFHY8YAMIQ4dUDCA4&uact=5&oq=%E6%AF%94%E8%BE%83%E9%95%BF%E7%9A%84%E9%93%BE%E6%8E
%A5+%26+%E6%AF%94%E8%BE%83%E7%9F%AD%E7%9A%84%E8%BF%9E%E6%8E%A5&gs_lcp=Cgdnd3Mtd2l6EAM6B
QghEKABOgkIIRAKEKABECo6BwghEAoQoAFKBAhBGAFKBAhGGABQuBFY3jBgiDpoAnAAeAGAAZ0CiAHHJZIBBjAu
MjAuNpgBAKABAcABAQ&sclient=gws-wiz
Copy the code

The final short chain is:

https://tinyurl.com/bdzftrfb
Copy the code

Short url system design process

Demand analysis

The requirement analysis of a certain problem is divided into functional requirements and non-functional requirements. The so-called functional requirements are those that the system must have, while non-functional requirements can be realized but are not necessary

1. Functional requirements

  • Users can customize their own short chain
  • The user can add a timeout to the generated short chain
  • Users can change the long chain to the short chain through our website
  • The user accesses the short chain through a browser and can access the address of the source site

2. Non-functional requirements

  • The system should be highly available
  • The performance of generating short chains is better
  • The resulting short chain is unpredictable
  • Statistical analysis is carried out on the source domain name that generates short chain

Volume forecast

The system throughput and storage space consumption can be estimated in advance so that the technology selection can be made more informed later

Note: There are approximations in the following calculations

  1. QPS

    The system is a read write less system, we assume that every month there are 5.0000500005 quintillion new URL (isn’t it a bit big 😏), then according to read and write 100:1100-1100:1 ratio calculation, The QPS of write requests is 5∗108/30/24/3600=200/s5*10^{8}/30/24/3600 =200/s5 ∗108/30/24/3600=200/s. So the read request is 20000/s20000/s20000/s

  2. storage

    If we calculate that 50,000, 50,000, 50,000, 50,000 new urls are generated every month, over 555 years, The total number of urls that can be generated is 5∗108∗12∗5=3∗10105*10^{8} * 12 * 5=3 *10^{10}5∗108∗12∗5=3∗1010 if each URL is 500Byte500 byte500, So you end up with about 15 tbS. 15 TBs. 15TB of storage

  3. bandwidth

    For write requests, if 200,200,200 urls are written per second, each URL size is 500 bytes 500 bytes, For write requests, the bandwidth is 200∗500=100KB/s200 * 500=100KB/s200 ∗500=100KB/s200 ∗500=100KB/s. Read requests are 100100100100 times of write requests, and the corresponding bandwidth is 10MB/s10MB/s10MB/s

  4. Memory is estimated

    If we want to cache some urls that are frequently visited, we follow the 80-20 rule (20% of urls solve 80% of transactions), we can cache 20% of “hot” urls, and continue with the volume above, we will read 20,000 urls per second. 2∗104∗3600∗24=1.7∗1092*10^{4} * 3600 * 24=1.7 *10^{9}2∗104∗3600∗24=1.7 *10^{9} 1.7∗109∗500∗0.2=170GB1.7 * 10^{9} * 500 * 0.2=170GB1.7 ∗109∗500∗0.2=170GB1.7 * 109∗500∗0.2=170GB, although there may be many repeated URL requests, Therefore, the actual size of the cache is smaller than 170GB170GB

System API design

Once we have identified the requirements, we can try to define the system API, which is a clear description of the system’s functionality

CreateURL Creates a new short link
// uuid => the unique identifier of the registered user, which can be used to limit some actions of the user, egS: the number of short chains generated
// originalUrl => originalUrl
// customAlias => [Optional] User-defined character string. The generated short chain is the short chain domain name + customAlias
// userName => [optional] userName indicates the user id
// expireDate => [Optional] Duration of the short chain
func createURL(uuid string, originalUrl string, customAlias string, userName string, expire_date int64) string
Copy the code
// deleteURL Deletes short links
// uuid => Repeat the preceding steps
// urlKey => String that identifies the short chain
func deleteURL(uuid string, urlKey string)
Copy the code

Database selection

Our database has the following conditions:

  1. We need to store a lot of data
  2. Each record is small
  3. There are few relationships between the records
  4. Our service has heavy read requests

Schema design

url_key

The field name type
id int(10)
url_key varchar(16)
state Int (10) 1 indicates available, 2 indicates unavailable

url_transfer

The field name type
url_key_id int(10)
original_url varchar(1024)
uuid varchar(256)
create_date datetime
expire_date datetime

user

The field name type
uuid int(10)
name varchar(256)
email varchar(256)
create_date datetime
last_login datetime

After designing the table structure, we need to consider whether to use relational database or non-relational database for storage. Since this section needs to store massive data, and we do not need to maintain the relationship between various entities, so we can consider using non-relational database for storage

Implementation details

1. Generation of urlKey

Here we need to solve the Key problem is how to generate a short and unique Key for each URL, in the above example now being generated by the short chain for https://tinyurl.com/bdzftrfb, we need to discuss is BDZFTRFB generate rules

Here we explore two ways of doing this. (PS: I learned something from this comparison, and I hope you learned something from reading this article.)

  • Encode the original URL

    This approach tends to be real-time generation, that is, for the original URL provided by the user, we use some algorithm to generate a string of specific length (urlKey) to uniquely identify the original URL

    For example, we could use MD5 to encrypt the original URL to get a 128-bit value, and then base64, which encodes every 6 bits, to get a string sequence of about 21 bits, when all we really need is the first 6 bits/first 8 bits

    The approach described above raises the following questions:

    1. If you cut the first 6 or 8 characters of a string of length 21 each time, will there be duplication

      This problem can be circumvented by swapping characters until unique or otherwise

    2. If the original URL is encoded, the final result of the same original URL is the same for different users

      For this problem, we can add the unique identifier UUID of the user mentioned above in the process of coding to solve the problem. Of course, for unregistered users, that is, users do not have uUID, this situation still cannot be solved well, can only force users to register accounts, for the user experience is not very good

  • Generate unique identifiers offline

    Since a series of problems will arise when the unique identification of URL is generated online, we consider whether the generation of the unique identification (urlKey) can be offline, that is, we can create a separate service, which is used to generate the unique identification of URL

    When the user encodes the original URL, there will be a large number of concurrent requests to the uniquely identified generating service. In order to cope with the data consistency under concurrent conditions, we can soft delete a urlKey after it is taken out (the operation is atomic) to ensure that the urlKey will not be reused

    Moving on to some other properties of the service, such as the storage size that the current service needs to use, suppose we generate a urlKey of 666 in length, and if we encode it in Base64, there are 64 possibilities for characters in each location, So the final number of URLkeys generated is 646=687 billion 64^6 =687 billion 646=687 billion, which is 444 bytes per character, The final storage required is 4∗6∗687 billion =1536GB4 * 6 * 687 billion =1536GB4 ∗6∗687 billion =1536GB4 * 6∗687 billion =1536GB4 ∗6∗687 billion =1536GB. In order to facilitate the horizontal expansion after the service, it is better to partition the data and hash the objects to be stored when ensuring the uniform distribution of data. The resulting hash value is used to determine the number of modular partitions to determine the final storage location of the data

At this point, I don’t know if it makes sense to the reader, but when an online urlKey is created that requires a lot of effort to think about boundaries, using this offline approach can actually make it easier to implement and scale horizontally

2. Performance optimization

There are two points that can be optimized to improve throughput

  1. When the client obtains urLkeys from the service that generates the unique IDENTIFIER of the URL, it can obtain a batch of URLkeys at a time and cache them locally. Of course, the records corresponding to these URLkeys will also be marked as unavailable. In this way, the client will obtain urLkeys from the local first. If not, the urlKey generation service will be requested

  2. The service itself is a scenario of read more than write less. For frequently accessed urls, we can cache them. Each time the client enters a short chain, we get the original URL from the cache first, if not, we get it from the database, and then update the cache

    Of course, because the short link has an expiration time, so the cache also needs to store the expiration time, when a request hit the cache, found that the current link is invalid, the return 404, and delete the cache, delete the record in the database, and release the current urlKey

Begin to achieve their own short url system

To be implemented (Deadline: 5.30)