preface

The system design practice article will be based on the “system design interview of the panacea” for the pre-template, explaining dozens of common system design ideas.

Design goals

Design a URL shortening service like TinyURL. The service will provide a shorter URL that is redirected to the original long URL.

Why do we need URL short chains

URL shortening is used to create shorter aliases for long urls. We call these shortened aliases short links. When the user clicks on these short links, they are redirected to the original URL. Short links save a lot of space when displaying, printing, sending or tweeting, and are easy for users to type in manually.

For example, if we shorten this URL by TinyURL

www.educative.io/collection/…

We can get:

tinyurl.com/jlg8zpc

The shortened url is almost a third the size of the actual url.

URL shortening is used to optimize links across devices, track individual links to analyze audience and event performance, and hide associated original urls.

If you haven’t used Tinyurl.com before, try creating a new short url and spending some time browsing the various service options they offer will help you a lot.

Requirements and objectives of the system

You should make it clear at the beginning of the interview. Be sure to ask questions to find out the exact scope of the system the interviewer has in mind.

Our url shortening system should meet the following requirements:

Functional requirements:
  • Given a URL, our service should generate a shorter and unique alias.
  • When a user accesses a short link, our service should redirect them to the original link.
  • Users should be able to select a custom short link for their URL.
  • The link will expire after the standard default interval. Users should be able to specify expiration times.
Non-functional requirements:
  • The system should be highly available. This is necessary because if our service stops, all URL redirects will start to fail.
  • URL redirection should occur in real time with minimal latency.
  • Shortened links should not be guessable.
Scalability requirements:
  • Analysis: For example, how many redirects occurred?
  • Our services should also be accessible to other services through REST apis.

Capacity estimation and constraints

In terms of the amount of read and write requests, the short chain system is very large. Accessing a short chain will result in a large number of redirected requests compared to shortening a URL. You can assume that the ratio of reading to writing is 100:1.

Traffic estimation

Assuming we have 500 MB of new URL shortening per month with a read/write ratio of 100:1, we can expect 50B of redirection over the same period.

100 * 500M => 50B

What are the queries per second (QPS) of our system?

500 million / (30 days * 24 hours * 3600 seconds) = ~200 URLs/s

Given a 100:1 read/write ratio, URL redirection per second would be:

100 * 200 URLs/s = 20K/s

Storage estimates

Suppose we store each URL target link (and the associated shortened link) for five years. Since we expect 500 million new urls per month, we expect the total number of stored objects to reach 30 billion

500 million * 5 years * 12 months = 30 billion

Let’s assume each storage object is about 500 bytes (this is just a rough estimate that we’ll delve into later), and we need 15TB of storage in total.

Bandwidth estimation

For write requests, since we expect 200 new urls per second, the total incoming data for our service will be 100KB per second

200 * 500 bytes = 100 KB/s

For read requests, since we expect approximately 20K URL redirection per second, the total output data for our service will be 10MB per second

20K * 500 bytes = ~10 MB/s

Memory is estimated

If we want to cache some frequently accessed hot urls, how much memory do we need to store them?

If we follow the 80-20 rule, where 20% of urls generate 80% of traffic, we want to cache those 20% of hot urls.

Since we have 20,000 requests per second, we will receive 1.7 billion requests per day.

20K * 3600 seconds * 24 hours = ~1.7 billion

To cache 20% of the requests, we need 170GB of memory.

0.2 * 1.7 billion * 500 bytes = ~170GB

One thing to note here is that our actual memory usage will be less than 170GB due to many repeated requests (the same URL).

To estimate the overview

Assuming 500 million new urls per month and a read to write ratio of 100:1, the following is a summary of our service capacity estimates.

  • Create short chain 200/s
  • Short chain redirection 20K/s
  • Inbound traffic 100KB/s
  • Egress traffic 10MB/s
  • The storage capacity is 15TB for five years
  • Memory usage is 170GB

4. System API design

Once we have identified the requirements, it is always a good idea to define the system API. At this point, it should be clear what the system is expected to do.

We can use SOAP or REST apis to expose the functionality of the service. Below is a definition of the API used to create and delete urls.

createURL(api_dev_key, original_url, custom_alias=None, user_name=None, expire_date=None)

parameter
  • Api_dev_key (string) : API development key for registered accounts. In addition, this will be used to restrict users based on their assigned quotas
  • Original_url (string): Indicates an optional short chain address
  • Custom_alias (string) : Optional key of the URL
  • User_name (string) : Optional user name for encoding

*• EXPIRE_DATE (string) : Optional expiration time

return

Success returns the shortened URL. Otherwise, it returns an error code.

deleteURL(api_dev_key, url_key)

Where the URL key is a string representing the shortened URL to retrieve. Successful removal returned URL Removed.

How do we detect and prevent abuse?

A malicious user could request to take over all the URL keys in the current system, depriving our business of the ability to create new short chains. To prevent abuse, we can restrict users with api_dev_key. Each API_dev_key can be limited to a specific number of URL creation and redirection at a time.

Five. Database design

Defining the DB schema at an early stage will help us understand the flow of data between different components and help us deal with data partitioning later.

Some observations about the nature of the data we’re going to store

  • We need to store billions of records
  • Each object we store is small (less than 1K).
  • There is no relationship between the records other than to store which user created the URL.
  • Our service has a high volume of read requests.
Database model

We need two tables. One is used to store information about URL mappings, and the other is used to create user data for short links.

URL User
[PK] Hash: varchar(16) [PK] UserID: int
OriginalURL: varchar(512) Name: varchar(20)
CreationDate: datetime Email: varchar(20)
ExpirationDate: datatime CreationDate: datetime
LastLoginDate: datetime
What database should we use?

Because we expect to store billions of rows of data, and we don’t need to use relationships between objects, NoSQL key-value stores like DynamoDB, Cassandra or Riak are a better choice. NoSQL is also easier to extend. See SQL vs NoSQL for more details

Basic system design and algorithm

The problem we solve here is how to generate a short and unique primary key for a given URL.

Why do we need the URL in the first quarter short chain sample, a shortened URL is http://tinyurl.com/jlg8zpc. The last six characters of this URL are the primary keys we want to generate.

We’ll explore two solutions here.

Plan 1. Encode URL

We can calculate a unique hash for a given URL (for example, MD5 or SHA256, etc.). The hash can then be encoded for display.

Encoding can be base36 ([a-z 0-9]) or base62 ([a-z, a-z 0-9]), add – and. We can use Base64 encoding. So the question is, what should be the length of the short bond?

Using Base64 encoding, a 6-letter long key will produce 64^6 = ~ 68.7 billion possible strings, and an 8-letter long key will produce 64^8 = ~281 trillion possible strings.

The unique string of 68.7b is sufficient for our system, so we can use a six-letter key.

If we use the MD5 algorithm as our hash function, it will produce a 128-bit hash value. After base64 encoding, we get a string of more than 21 characters (because each Base64 character encodes a 6-bit hash value). Since we only have room for 8 characters per shortcut, how will we choose our key? We can take the first six (or eight) letters as the key. However, this can lead to duplicate keys, on which basis we can select some other characters from the encoding string or swap some characters.

What are the different problems with this solution?

There are several problems with our coding scheme

  1. If multiple users enter the same URL, they will get the same shortened URL, which is not acceptable.

  2. What if part of the URL is URL-encoded? For example, www.education.io/distributed… And www.education.io/distributed…

The solution

We can add an increasing sequence number to each input URL, make it unique, and then generate its hash. We don’t need to store this serial number in a database. The possible problem with this approach is that it overflows with increasing serial numbers, and the addition of increasing serial numbers also affects the performance of the service.

Another solution is to append the user ID (which should be unique) to the input URL. However, if the user is not already logged in, we must ask the user to select a unique key. Even after that if we have a conflict, we have to keep generating a key until we get a unique key.

Scheme 2. Generate the key offline

We can have an independent Key Generation Service (KGS) that generates random 6-letter strings beforehand and stores them in a database (we call them key-DB). When we want to shorten a URL, we just need a generated key and use it. This approach will make things very simple and fast. Not only do we not encode urls, but we don’t have to worry about duplication or conflicts. KGS will ensure that all keys inserted into key-DB are unique.

Concurrency problem

Once the key is used, it should be flagged in the database to ensure it is not used again. If there are multiple servers concurrently reading the key, we may have a scenario where two or more servers try to read the same key from the database. How do we solve this concurrency problem?

The server can use KGS to read/mark keys in the database. KGS can use two tables to store keys: one for unused keys and one for all used keys. Once KGS provides the keys to one of the servers, it can move them to the used key table. KGS can always keep some keys in memory so that the server can provide them quickly if it needs them.

For simplicity, once KGS has loaded some keys in memory, it can move them to the key table used. This ensures that each server gets a unique key. If KGS dies before assigning all loaded keys to a server, that part of the key will be wasted, which is acceptable because we have a large number of keys.

KGS must also ensure that the same key is not supplied to multiple servers. To do this, it must synchronize (or acquire locks) the data structure that holds the keys, then remove the keys from that data structure and hand them to the server.

What is the key database size?

Using Base64 encoding, we can generate 68.7B unique six-letter keys. If we need a byte to store an alphanumeric character, we can store all the keys on a 412GB disk.

6 (characters per key) * 68.7B (unique keys) = 412 GB

Isn’t KGS a single point of failure?

Yes. To solve this problem, we can have an alternate copy of KGS that can take over the generation and provision of keys when the primary server dies.

Can each application server cache some keys from key-DB?

Yes, and it speeds up the response. In this case, though, if the application server died before using all the keys, we would eventually lose them. But this is acceptable because we have 68B’s unique 6-letter key.

How do I perform key lookup?

We can look up keys in a database or key-value store to get the full URL. If it does, an HTTP 302 redirect status is issued to the browser and the stored URL is passed in the Location field of the request. If the key is not in our system, an HTTP 404 Not Found status is issued or the user is redirected back to the home page.

Should we impose size limits on custom aliases?

Our service supports custom aliases. Users can choose any key they like, but it is not mandatory to provide custom aliases. However, it is reasonable (and often desirable) to impose size limits on custom aliases to ensure that we have a consistent URL database. Assume that the user can specify up to 16 characters per customer key (as shown in the database schema)

Data partitioning and backup

To scale our database, we need to partition it so that it can store billions of urls of information. We needed to come up with a partitioning scheme to divide our data and store it on different DB servers.

Interval differentiate

We can store urls in separate partitions based on their first letter or hash key. Therefore, we store all urls starting with the letter A in one partition, those starting with the letter ‘B’ in another partition, and so on. This approach is called range-based partitioning. We can even combine some of the less frequent letters into a database partition. We should provide a static partitioning scheme so that we can store/find files in a predictable way.

The main problem with this approach is that it can lead to server imbalance. For example: We decided to put all urls starting with the letter E into one DB partition, but then we realized that there were too many urls starting with the letter E.

Hash based partitioning

In this scenario, we take the hash value of the stored object. It then calculates which partition to use based on the hash. In this case, we can use the hash value of the key or the actual URL to determine the partition where the data object is stored. Our hash function will randomly assign urls to different partitions (for example, our hash function can always map any key to a number between [1… 256]), which will represent the partition where we store the object.

This approach still leads to overloaded partitions, a problem that can be solved with consistent hashing.

Cache eight.

We can cache frequently accessed urls. Off-the-shelf solutions are available, such as Memcache, which stores complete urls with their respective hashes. The application server can quickly check whether the cache has the required URL before accessing the back-end storage.

How big should the cache be?

We cache 20% of our daily traffic and then adjust how many caching servers we need based on client usage patterns. As mentioned above, we need 170GB of memory to cache 20% of our daily traffic. Because today’s servers can have 256GB of memory, we can easily fit all the caches into one machine. Alternatively, we could use some smaller servers to store all these hot urls.

Which cache expulsion strategy best suits our needs?

What do we choose when the cache is full and we want to replace the link with a newer/hotter URL? Minimal recent use (LRU) may be appropriate. According to this policy, we first discard the least recently used URL. We can use LinkedHashMap or similar data structures to store our urls and hashes, which can also keep track of recently visited urls.

To further improve efficiency, we can replicate the cache servers to distribute the load between them.

How do I update each cache copy?

Whenever the cache is lost, our server hits the back-end database. Whenever this happens, we can update the cache and pass the new entry to all cache copies. Each replica can update its cache by adding new entries. If the copy already has the entry, you can simply ignore it.

Load balancing

There are three places we can add load balancing to the system

  1. Between client and application server
  2. Connection between the application server and the database server
  3. Between the application server and the cache server

Initially, we can use a simple Round Robin approach to evenly distribute incoming requests to the back-end server. This LB implementation is simple and introduces no overhead. Another benefit of this approach is that if a server crashes, the LB will stop sending any traffic to it.

The problem with polling LB is that the server load is not considered. The LB does not stop sending new requests to the server if it becomes overloaded or slow down. To solve this problem, a smarter LB solution could be put in place that periodically queries the back-end server load and adjusts traffic based on that.

X. Data cleaning

Should the short chain be kept permanently or should it be expunged? What happens to the link if the user-specified expiration time is reached?

If we chose to actively search for outdated links to remove them, this would put a lot of strain on our database. Instead, we can slowly remove outdated links and do lazy cleanup. Our service will ensure that only expired links will be removed, and although some expired links can live longer, they will never be returned to the user.

  • When a user tries to access an expired link, we can remove the link and return an error to the user
  • A separate Cleanup service can be run periodically to remove expired links from storage and cache. The service should be very lightweight and can only be scheduled to run when user traffic expectations are low
  • We can set a default expiration time (for example, two years) for each link.
  • After removing the expired links, we can put the key back into key-DB for reuse.
  • Should YOU remove links that haven’t been visited in 6 months? This can be tricky. As storage gets cheaper and cheaper, we can decide to stay connected forever.

Trace extensions

How many times is a short URL used, what is the user location, and so on? How will we store these statistics? If it is part of a DB row that is updated on each view, what happens when a hot URL is hit by a large number of concurrent requests?

Some statistics worth tracking: the country of the visitor, the date and time of the visit, the page involved in the click, the browser, or the platform on which the page was visited.

Xii. Security and Permissions

Whether users can create private urls or allow specific user groups to access them.

We can store the permission level (public/private) for each URL in the database. We can also create a separate table to store user ids that are authorized to view specific urls. If the user does not have permission and tries to access the URL, we can send back an error (HTTP 401). Since we are storing data in a NoSQL wide column database like Cassandra, the key for table storage permissions will be hashed (or kGs-generated). These columns will store the user ids of those users who have the right to view the URL.