Flow limiting algorithm, in Golang mode

Rate limit

In fields such as Web Server, TCP communication, and API interaction, Rate Limit is generally used to control the Rate based on parameters such as the number of requests and traffic. It is sometimes called flow control.

When talking about flow control, generally consider the following two aspects:

Constant Bit Rate (CBR)
Varible Bit Rate (VBR)

This is not a concept exclusive to TCP communications. In fact, rate control is a universal concept that exists across many disciplines. For example, in audio and video playback (especially in streaming media playback), decoding rate also needs to be controlled: if the hardware video buffer rate is not enough, decoding has to wait; If the stream data is retrieved too slowly, the decoding will also stall – but at this point the playback must not be paused/or at least the sound must not be paused. Flow control is also involved at this point.

In addition to the Rate problem of flow control, one of the most important design and coding factors of Rate Limit is the repudiability of input flow.

For TCP’s underlying bandwidth control and flow control, traffic is not allowed to be discarded. Instead, TCP requires packet reachability. Therefore, TCP retransmits failed packets.

In a microservice network, flow control may be required to be non-disposable in order to provide smooth and stable service requests for downstream services, where flow control will spread peak requests over a longer time slice, reducing downstream instantaneous power.

However, vendors that provide public API services will implement flow limiting logic in the service gateway. If you try to make high frequency requests, the portion of those requests that are out of limit will be discarded by the gateway and will not be seen by downstream services inside the gateway.

So flow control is similar to and not similar to flow limiting, and you need to make a clear distinction between them.

In Rate Limiter’s case, it was possible to provide both in a single software component, as long as the designer was conscious.

Implementation scheme

For the sake of the following argument, we use 100 requests per second (100 r/s) as an indicator of the rate limit.

Counter mode

The first thing that comes to mind is the counter method. It’s pretty straightforward, assuming you have a counter that starts at 100 and decreases by 1 on every request until counter is zero and then resets to the initial value based on conditions such as more than 1s.

type counter struct {
	Maximal int
	Period  time.Duration
	count   int
	tick    int64 // in nanosecond
}

func (s *counter) Take(count int) bool {
	if time.Now().UnixNano() > s.tick {
		// if timeout, reset counter regally at first
		s.count = 0
		s.tick = time.Now().Add(s.Period).UnixNano()
	}

	s.count += count            // it's acceptable in HPC scene
	return s.count <= s.Maximal // it's acceptable in HPC scene
}
Copy the code

The striking feature of the counter method is its lack of uniformity. That is, if a large number of requests burst during the timing cycle, Take() may still succeed (as long as the counter is not exceeded). But it is clear that the batch of requests granted at this time was actually too frequent and excessive.

In this case, our current limiting rules are still in line. But at the end of the timing period, and at the beginning of the next timing period, if a large number of requests are entered, and these requests are still passed, our limiting rule is actually broken: If the cycle we are looking at starts at the midpoint of the previous timing cycle and ends at the midpoint of the next cycle, the number of requests passed in that period may be twice as high as our limiting rule.

The counter method is sometimes described as a Fixed Window method, as opposed to a Sliding Window method. You can refer to the time range in the counter method (1s in the case) as the statistical window, or the calculation window, or the observation window for 1s.

Sliding Log

Sliding Log is sometimes listed as an algorithm.

At this point, it was thought to be a different approach from sliding Windows: it serialized consumer requests to time stamps and kept track of the sequence. When a request comes in, the previous sequence of timestamps is summed, and the Sum divided by the corresponding Duration gives the frequency of requests during that time. If the frequency is too high the request is rejected, otherwise it is passed. To save as much memory as possible, the system only keeps a certain amount of time stamps.

But in our view, Sliding Log is really just a specific implementation of Sliding window.

Sliding window method

As mentioned in the previous discussion, we need a finer time slice granularity to solve the problem of smoothing the counter method. But this is obviously not easy. On the one hand, high precision clocks are subject to different hardware facilities, and the strength of support is also very different. On the other hand, how much granularity is fine, 1ms or 1ns, is a question — different scenarios may have different ideas about this.

Therefore, we also need to change the perspective of thinking. In order to obtain a delicate time slice, we use the sliding window to indirectly reach any time slice size.

The so-called sliding window method, in fact, refers to when the request occurs, the time slice of a forward observation window (such as 1s), if the number of accepted requests within this period of time has not exceeded, the request is passed, and vice versa. In this way, no matter what rate the request comes in at, we always do rate limit detection in one time period ahead of the request point. This time slice moves with the request point, so we call it a sliding window.

Sliding window is a general concept widely used in computer science. For example, in LZW (LZ77) compression algorithm, there are sliding Windows in string matching, TCP communication flow control and other scenarios.

The algorithm idea of sliding window method is easy to understand.

However, it is not easy to implement programmmically in rate-limiting scenarios because of the amount of housekeeping required to track and count the number of authorized requests for any window forward.

We already know that you can store a Sliding Log of each request and then sum it backwards in the clock direction to determine if the new request has crossed the threshold. That’s one way to do it. If you’re building a high-throughput API Server, the storage and computation costs are probably not trivial.

Another method is to divide the time slice window into 8 pieces and record the authorized number in each piece respectively. When new requests arrive and need to be tracked in the sliding window, the accumulative sum of the 8 pieces needs to be fixed forward. This can significantly reduce the amount of computation and memory consumption, but it is a rough version of the sliding window method: at least it is much finer than the counter method.

It is not acceptable for HFT to consume a large amount of CPU or memory for the task of rate limiting. So the production environment is not necessarily a sliding window approach to code implementation.

Leaky Bucket

The basic principle of

Any window, has authorized the number of requests is not easy to get, or a large amount of calculation, in this case, we can have the opposite tack: assuming a constant capacity (100 drop), the bucket of water, the tap water rate depends on the request of many, and below the bucket has a mouth, just uniform water at a rate of about 10 ms. So no matter how many requests come in, they’re bound by the bucket in a certain way, and only at most as many requests as the bucket can ever get through.

If there are too many requests, the tap water is too much, the bucket overflows, and those extra requests are discarded.

If you don’t want to discard overflow requests, you typically spin until the bucket is full, and then submit requests until all requests are passed, which is passing at the cost of time. In other words, let’s find an extra bucket big enough for the overflow water to go down, and then put it back on the tap to drip.

Thus, we get the bucket version of rate limiting thinking: the leaky bucket method.

From: dev to/swyx/networ…

Depending on the discharge strategy, non-uniform access requests can be released as they are permitted, or they can be reshaped and released at a constant rate (e.g., one request every 10ms).

Obviously, if you do not want to discard redundant requests, you must ensure that the total number of requests for a period of time does not exceed the rate limit. Otherwise these blocked requests will cause bottlenecks.

Open source implementation

In many open source implementations, the leaky bucket method is limited to the classical mode, which is the so-called uniform water outlet mode (also known as the leaky bucket algorithm with Traffic Shaping).

A typical implementation (e.g. Kevinms/Leakybucket-go) might look like this:

// Makes it easy to test time based things.
var now = time.Now

// LeakyBucket represents a bucket that leaks at a constant rate.
type LeakyBucket struct {
	// The identifying key, used for map lookups.
	key string

	// How large the bucket is.
	capacity int64

	// Amount the bucket leaks per second.
	rate float64

	// The priority of the bucket in a min-heap priority queue, where p is the
	// exact time the bucket will have leaked enough to be empty. Buckets that
	// are empty or will be the soonest are at the top of the heap. This allows
	// for quick pruning of empty buckets that scales very well. p is adjusted
	// any time an amount is added to the Queue().
	p time.Time

	// The index is maintained by the heap.Interface methods.
	index int
}

func NewLeakyBucket(rate float64, capacity int64) *LeakyBucket {
	return &LeakyBucket{
		rate:     rate,
		capacity: capacity,
		p:        now(),
	}
}

func (b *LeakyBucket) Add(amount int64) int64 {
	count := b.Count()
	if count >= b.capacity {
		// The bucket is full.
		return 0
	}

	if! now().Before(b.p) {// The bucket needs to be reset.
		b.p = now()
	}
	remaining := b.capacity - count
	if amount > remaining {
		amount = remaining
	}
	t := time.Duration(float64(time.Second) * (float64(amount) / b.rate))
	b.p = b.p.Add(t)

	return amount
}
Copy the code

In the above implementation, Add() returns 0 to indicate that the bucket is full and that the request has failed.

Its implementation is slightly less conducive to our inertial understanding.

In addition, Uber-Go/Ratelimit realizes a complete leak-bucket method of uniform water outlet. In order to achieve uniform water outlet, it is detected in the way of blocking in its Take() and will not return unless it reaches the boundary of 10ms, thus achieving accurate uniform water outlet. Since there are multiple state quantities involved for each request, Uber uses an atomic operation called unsafe. Pointer, which makes its code less intuitive. So we don’t list the implementation code here.

Our version

This kind of classical algorithm may not be a particularly good choice sometimes. Their problem is that requests get blocked at Take() when they come in, so the entry may pile up and not be able to effectively discard unwanted requests. Further, the anti-ddos capability is poor. The benefit, of course, is that as long as the production scenario requires that entry requests not be allowed to be discarded, they will effectively pile up until downstream services are finally eaten, a feature that is appropriate for microservice peaking.

Non-blocking implementation

Based on these realizations, we implemented a leaky bucket algorithm (full REPO) that allows non-blocking entry:

func New(maxCount int64, d time.Duration) *leakyBucket {
	return (&leakyBucket{
		int64(maxCount),
		make(chan struct{}),
		int64(d) / int64(maxCount),
		time.Now().UnixNano(),
		0,
	}).start(d)
}

type leakyBucket struct {
	Maximal     int64
	exitCh      chan struct{}
	rate        int64
	refreshTime int64 // in nanoseconds
	count       int64
}

func (s *leakyBucket) Count(a) int64            { return atomic.LoadInt64(&s.count) }
func (s *leakyBucket) Available(a) int64        { return int64(s.count) }
func (s *leakyBucket) Capacity(a) int64         { return int64(s.Maximal) }

func (s *leakyBucket) Close(a) {
	close(s.exitCh)
}

func (s *leakyBucket) start(d time.Duration) *leakyBucket {
	if s.rate < 1000 {
		log.Errorf("the rate cannot be less than 1000us, it's %v", s.rate)
		return nil
	}

	// fmt.Printf("rate: %v\n", time.Duration(s.rate))

	// go s.looper(d)
	return s
}

func (s *leakyBucket) looper(d time.Duration) {
	// nothing to do
}

func (s *leakyBucket) max(a, b int64) int64 {
	if a < b {
		return b
	}
	return a
}

func (s *leakyBucket) take(count int) (requestAt time.Time, ok bool) {
	requestAt = time.Now()

	s.count = s.max(0, s.count-(requestAt.UnixNano()-s.refreshTime)/s.rate*int64(count))
	s.refreshTime = requestAt.UnixNano()

	if s.count < s.Maximal {
		s.count += int64(count)
		ok = true
	}

	return
}

func (s *leakyBucket) Take(count int) (ok bool) {
	_, ok = s.take(count)
	return
}

func (s *leakyBucket) TakeBlocked(count int) (requestAt time.Time) {
	var ok bool
	requestAt, ok = s.take(count)
	for! ok { time.Sleep(time.Duration(s.rate - (1000 - 1)))
		_, ok = s.take(count)
	}
	time.Sleep(time.Duration(s.rate-int64(time.Now().Sub(requestAt))) - time.Millisecond)
	return
}
Copy the code

In this implementation, taking advantage of the non-blocking effect of Take(), you have the following customization options:

First you try to get permission by non-blocking Take()
If not approved:
1. If you don’t want to discard the request, create a queue of your own to cache and delay trying to get permission again.
2. If you need to discard out-of-limit requests, then nothing is done, right
For approved requests, the resulting water is not uniformly shaped after the return of Take().

Compatible with uber – go/ratelimit

However, if you’re building Rate Limiter middleware for peak clipping purposes, you might want to do what Uber does: not discard unapproved requests and water at a constant rate. That doesn’t matter — though a little sketchy — we provided a version of TakeBlocked() that allowed the approval requests to flow out uniformly. Take the test function as an example:

import "github.com/hedzr/rate/leakybucket"

func TestLeakyBucketLimiter(b *testing.T) {
	var counter int
	l := leakybucket.New(100, time.Second, false) // one req per 10ms
	defer l.Close()
	time.Sleep(300 * time.Millisecond)
	prev := time.Now()
	for i := 0; i < 20; i++ {
		ok := l.TakeBlocked(1)
		now := time.Now()
		counter++
		fmt.Println(i, now.Sub(prev))
		prev = now
		time.Sleep(1 * time.Millisecond)
	}
	b.Logf("%v requests allowed.", counter)
}
Copy the code

The result looks like this:

$ go test -v -race -test.run='^TestLeakyBucketLimiter$'/ Leakybucket ==== RUN TestLeakyBucketLimiter 0 9.468477 MS 1 10.47512 MS 2 12.589327 MS 3 10.364861 MS 4 11.820381ms 5 10.353834 MS 6 12.504309 MS 7 10.167151 MS 8 11.623466 MS 9 10.487361 MS 10 10.72498 MS 11 12.435839 MS 12 11.73813ms 13 10.956558 MS 14 11.893156 MS 15 11.964504 MS 16 11.856545 MS 17 11.094344 MS 18 10.978074 MS 19 10.632583ms lb_test.go:36: 20 requests charges - PASS: TestLeakyBucketLimiter (0.53 s) PASS ok github.com/hedzr/rate/leakybucket 0.957 sCopy the code

If not, the request will block until the next one is successful — not much beyond the rate boundary due to the rate=10ms (1000ms/100 times) constraint.

It’s not very precise, it’s a little clunky, but we got one, right?

Public API-oriented – Simplified Gin middleware

If a Public API scenario is to be established, out-of-limit requests (especially malicious ones) must be discarded, so Uber’s Ratelimit version is not appropriate, and the non-blocking version is the appropriate choice.

A simplified gin middleware can be written like this:

func (r *Limiter) getRL(ctx gin.Context) rateapi.Limiter {
	if r.limiter == nil {
		r.limiter = leakybucket.New(100, time.Second, false)}return r.limiter
}

func (r *Limiter) SimpleMiddleware(a) gin.HandlerFunc {
	return func(ctx *gin.Context) {
		limiter := r.getRL(ctx)
		iferr ! =nil {
			ctx.AbortWithError(429, err)
		} else iflimiter ! =nil {
			if limiter.Take(1) {
				ctx.Writer.Header().Set("X-RateLimit-Remaining", fmt.Sprintf("%d", limiter.Available()))
				ctx.Writer.Header().Set("X-RateLimit-Limit", fmt.Sprintf("%d", limiter.Capacity()))
				ctx.Next()
			} else {
				err = errors.New("Too many requests")
				ctx.Writer.Header().Set("X-RateLimit-Remaining", fmt.Sprintf("%d", limiter.Available()))
				ctx.Writer.Header().Set("X-RateLimit-Limit", fmt.Sprintf("%d", limiter.Capacity()))
				ctx.AbortWithError(429, err)
			}
		} else {
			ctx.Next()
		}
	}
}
Copy the code

By convention, a rejected request is notified with a Response header:

Gin middleware

As a refined version of the simplified code above, we provide a gin Ratelimit middleware that you can use directly:

import "github.com/hedzr/rate/middleware"

func engine(a) *gin.Engine {
  r := gin.Default()
  r.Use(middleware.NewLimiterForGin(time.Second, 100.func(ctx *gin.Context) (string, error){
    key := ctx.Request.Header.Get("X-API-KEY")
		ifkey ! ="" {
			return key, nil
		}
		ctx.JSON(403, gin.H{"code": 2901."message": "API key is missing"})
		return "", errors.New("API key is missing")}))return r
}
Copy the code

In the example, you provide a keygenFunc to the middleware. This function should return a unique key, such as the API key requested by an API Client, and the middleware will maintain a separate Rate limiter for this key.

Depending on your business scenario, you can choose a more appropriate keygenFunc factor. For example, instead of checking the X-API-key header, we can check the X-user-token in the header, or use IP/GeoIP as the KEY. Then we can obtain traffic limiting policies for specific users, different IP addresses, and specific regions.

Of course, in the HPC scenario, you may need to extend the middleware we provide further. For example, in the large-scale API KEYs scenario, you need to have measures to separate nodes to prevent memory overflow caused by a large number of Rate limiters on a single node, etc.

Once you adopt the multi-node distributed computing strategy, you need to adapt and adopt the distributed Ratelimit core algorithm — the full REPO we provide does not directly support distributed computing scenarios.

Token Bucket

The leaky bucket method is used to shape the entry request (remove the excess, and only the approved ones are leaked), but although it is a new idea for us to transform the sliding window which is difficult to code, it is slightly different from the sliding window.

Therefore, the token method is closer to the sliding window method of tuning algorithm.

In general, we assume that a producer produces tokens at a constant rate (one per 10ms), and that a request consumes one token when granted. As a result, it is no longer a matter of how many requests have been authorized per unit of time, but of whether tokens are available in the token bucket.

From: dev to/swyx/networ…

Since tokens are always manufactured at a uniform rate, the entry requests are uniformly curbed (that is, all additional requests in the 10ms cycle are not allowed), and the token approach has an absolute advantage over the previous approaches in this area.

If, on the other hand, the entry request cannot be discarded, the token method incurs additional blocking overhead: failed requests need to be retried in a slower queue (or simply blocked in place) until the token is obtained and passed.

So we have implementation code like this:

func New(maxCount int64, d time.Duration) *tokenBucket {
	return (&tokenBucket{
		int32(maxCount),
		d,
		int64(d) / int64(maxCount),
		make(chan struct{}),
		int32(maxCount),
	}).start(d)
}

type tokenBucket struct {
	Maximal int32
	period  time.Duration
	rate    int64
	exitCh  chan struct{}
	count   int32
}

func (s *tokenBucket) Count(a) int32            { return atomic.LoadInt32(&s.count) }
func (s *tokenBucket) Available(a) int64        { return int64(s.count) }
func (s *tokenBucket) Capacity(a) int64         { return int64(s.Maximal) }

func (s *tokenBucket) Close(a) {
	close(s.exitCh)
}

func (s *tokenBucket) start(d time.Duration) *tokenBucket {
	if s.rate < 1000 {
		log.Errorf("the rate cannot be less than 1000us, it's %v", s.rate)
		return nil
	}

	go s.looper(d)
	return s
}

func (s *tokenBucket) looper(d time.Duration) {
	ticker := time.NewTicker(d / time.Duration(s.Maximal))
	// fmt.Printf("token building spped is: 1req/%v\n", d/time.Duration(s.Maximal))
	defer func(a) {
		ticker.Stop()
	}()
	for {
		select {
		case <-s.exitCh:
			return
		case <-ticker.C:
			vn := atomic.AddInt32(&s.count, 1)
			if vn < s.Maximal {
				continue
			}

			vn %= s.Maximal
			if vn > 0 {
				atomic.StoreInt32(&s.count, s.Maximal)
			}
		}
	}
}

func (s *tokenBucket) take(count int) bool {
	if vn := atomic.AddInt32(&s.count, - 1*int32(count)); vn >= 0 {
		return true
	}
	atomic.StoreInt32(&s.count, 0)
	return false
}

func (s *tokenBucket) Take(count int) bool {
	ok := s.take(count)
	return ok
}

func (s *tokenBucket) TakeBlocked(count int) (requestAt time.Time) {
	requestAt = time.Now().UTC()
	ok := s.take(count)
	for! ok { time.Sleep(time.Duration(s.rate - (1000 - 1)))
		ok = s.take(count)
	}
	// time.Sleep(time.Duration(s.rate-int64(time.Now().Sub(requestAt))) - time.Millisecond)
	return requestAt
}
Copy the code

In our implementation, advanced features like leaky buckets are supported, such as support for non-blocking or blocking versions, and so on. Line 76 is currently commented out because for the time being we do not want to do the extra processing of uniform water outlet in the Token Bucket algorithm: in fact, since the output of the Token itself is uniform, we can directly obtain relatively uniform water outlet in most cases except for some boundary conditions. The effect of Line 76 is only relatively more uniform, which I don’t think is necessary.

Other features

In our token bucket implementation, the features already described in the leaky bucket method can be easily changed to token bucket because of the algorithm abstraction already done. Specifically, the way to use our RATE component is as follows:

package main

import (
	"fmt"
	"github.com/hedzr/rate"
	"time"
)

func main(a) {
  // rate.LeakyBucket, rate.TokenBucket, ...
	l := rate.New(rate.LeakyBucket, 100, time.Second)
	for i := 0; i < 120; i++ {
		ok := l.Take(1)
		if! ok { fmt.Printf("#%d Take() returns not ok, counter: %v\n", i, rate.CountOf(l))
			time.Sleep(50 * time.Millisecond)
		}
	}
}
Copy the code

Similarly, using our Rate component as middleware in the Web Server framework:

import (
  "github.com/hedzr/rate"
  "github.com/hedzr/rate/middleware"
)

func engine(a) *gin.Engine {
	config := &middleware.Config{
		Name:          "A",
		Description:   "A",
		Algorithm:     stirng(rate.TokenBucket),
		Interval:      time.Second,
		MaxRequests:   1000,
		HeaderKeyName: "X-API-TOKEN",
		ExceptionKeys: nil,
	}
	r := gin.Default()
  rg := r.Group("/api")
  rg.Use(middleware.ForGin(config))
  // ...
  
  config2 := ...
  r.Post("/login", middleware.ForGin(config2), loginHandler)
	return r
}
Copy the code

If you are using our CMDR, Middleware. LoadConfig (” server. Rate – limites “) or middleware. LoadConfigForGin (” server. Rate – limits “, rg) can further simplify your code.

Enhancements to Looper

In the looper() implementation, we use a constant producer to produce tokens.

Considering more potential traffic limiting strategy possibilities, the variable speed scheme can actually be considered here. Some typical possibilities are:

LogN or LN accelerator scheme
Smooth ratio constant speed: A time slice is used as the statistical unit, and the rate value is constantly adjusted to reduce the token issuing speed in the future after the burst traffic consumes most of the tokens. It’s called SmoothBusrty in Google Guava.
SmoothWarmUp: Also known as SmoothWarmUp. In the first part of a time slice, when more tokens are issued at a faster speed, the speed of token issuance presents negative acceleration. The remaining tokens are then released at a constant rate for the rest of the time.
More: Depending on the specific production scenario, you can customize the Looper algorithm to provide additional policies (not implemented in the sample code).

It is worth noting that the token method is not only easy to understand, but also extremely easy to implement. Not only that, but the token method has the lowest overhead of any implementation.

In the open source implementation, Juju/Ratelimit implements the token bucket algorithm in the same way.

To compare

In previous reflections, we have already mentioned some characteristics of each algorithm.

In the last few minutes of this article, let’s briefly summarize the similarities and differences of several typical algorithms:

Counter way: the most easy to achieve, but the worst strain capacity, high efficiency
Sliding window method: the theory is easy to understand, that is, to reduce the size of measurement as much as possible, so as to smoothly improve the strain capacity of the counter. However, it is difficult to implement code efficiently.
Leaky bucket method: a better choice, for the input flow of shaping, (usually) at a uniform rate of discharge.
Token method: A relatively optimal trade-off, shaping the outlet and (usually or optionally) releasing the flow at an approximately uniform rate. It is very easy to expand for custom shaping for business scenarios.

Depending on whether the input traffic can be discarded, take care to use the appropriate method when implementing Ratelimit:

Consider whether to use asynchronous (discard traffic separately)/non-blocking mode (discard traffic handled by the caller)
1. Asynchronous mode is not suitable for use as an API stream limiting locale or as a Web Server middlerware
2. Asynchronous or non-blocking modes have the best incoming traffic reception performance
Choice of non-blocking mode or blocking mode

In our implementation code github.com/hedzr/rate, the blocking mode is not optimized, so the exit is not precise. If you are interested in this section of code, the best source reference is at:

Bucket: github.com/uber-go/rat…
Token: github.com/juju/rateli…

TODO

In this article, the Rate Limit problem in HPC is slightly mentioned. However, a detailed discussion of it is for another time.

Here we can make a brief introduction in advance:

In the HPC scenario, rate limiting needs to be implemented distributed
In a particular distributed scenario, the distributed rate limit can be considered to be promoted to an entry point (usually a load balancer, or a Microservice API Gateway) for the same processing.
Even if handled by API GW, it is still a distributed rate limiter.

reference

Github.com/uber-go/rat…
Github.com/juju/rateli…
Github.com/kevinms/lea…
github.com/hedzr/rate
En.wikipedia.org/wiki/Bandwi…
En.wikipedia.org/wiki/Token_…
En.wikipedia.org/wiki/Leaky_…
Networking Essentials: Rate Limiting and Traffic Shaping – DEV Community 👩‍💻👨‍💻
Scaling your API with rate limiters
How To Design A Scalable Rate Limiting Algorithm
NGINX Rate Limiting