Wobble Bug in Github's Rate Limiter

Summary#

When github implemented their rate limiter using redis they saw some pretty weird behavior. In this blog we discuss about what happened and how they finally fixed it.

In this article we’ll discuss about

Understanding what is rate limiting
How Github implemented it
Weird behavior that the engineers observed
The fix

Understanding what is rate limiting#

Rate limiting is a technique used in computer systems, particularly in APIs, networks, and applications, to control the rate at which requests are processed or accepted. It is a crucial mechanism for maintaining the performance, reliability, and security of systems by preventing excessive usage or abuse.

How Github implemented it#

Initial rate limiter setup of github was built using Memcached. They originally had a single memcached server which powered their caching use cases along with their rate limiter use cases.

Now one of the key problems that engineers in github faces was when the data usage sometimes got too big the memcached server sometimes accidentally removed rate limiting keys. This lead to a lot of discrepancies and inconsistencies in the user experience.

Another problem that they faced was.. when they started there data was present in a single data-center but as they grew their data got distributed to multiple data centers. Now they wanted a separate memcached cluster per data center, hence, sharding was required and they decided to re-architect their rate limiter with redis.

The main reasons for choosing redis were:

Simple sharding and replication setup.
Redis TTL for auto expiration of key.
Support for LUA.

Weird behavior that they observed#

The issue that happened was “wobbling”. So If you hit the following curl on api.github.com ..

curl -I https://api.github.com/users/saksham-sharma-99

the response headers you’ll get are as follows

...
access-control-allow-origin: *
strict-transport-security: max-age=31536000; includeSubdomains; preload
x-frame-options: deny
x-content-type-options: nosniff
x-xss-protection: 0
referrer-policy: origin-when-cross-origin, strict-origin-when-cross-origin
content-security-policy: default-src 'none'
server: github.com
x-ratelimit-limit: 60
x-ratelimit-remaining: 59
x-ratelimit-reset: 1724064530
x-ratelimit-resource: core
x-ratelimit-used: 1
accept-ranges: bytes
content-length: 1502
...

headers which are important for us are

x-ratelimit-limit: maximum number of requests i am allowed to make until the next reset time
x-ratelimit-remaining: number of request left with me or which i can make until the next reset time
x-ratelimit-reset: the epoch time at which the rate-limit will be reset.

Now the problem that occurred was wobbling of the header x-ratelimit-reset as it sometimes returned 1724064530 and other times it returned 1724064531.

Now this happened because of a very minor edge case that the engineers at github sort of overlooked.
The way this is header is formed is..
(Step 1) Request hits the API server ⤵
(Step 2) API Server asks Redis Cluster for TTL ⤵
(Step 3) Redis cluster executes some LUA scripts and returns TTL ⤵
(Step 4) API server returns Time.now() + TTL in "x-ratelimit-reset" header

This all seems pretty direct and straight in the first glance but it neglects a minor time period of processing time. Lets understand it with 2 scenarios keeping the ttl as 5s.

1. When the first request enters the server at timestamp 1000#

lets say request enters the server at time timestamp 1000.
(Step 2) and (Step 3) takes somewhere around 200ms to be executed.
Hence (Step 4) starts executing at 1000.2 and returns x-ratelimit-reset headers as 1005.2, which is ~ 1005.

2. When the second request enters the server at timestamp 1001.9#

now since one second has passed, TTL is now 4s
(Step 2) and (Step 3) takes somewhere around 200ms to be executed.
Hence (Step 4) starts executing at 1002.1 and returns x-ratelimit-reset headers as 1006.2, which is ~ 1006.

As you can see clearly the x-ratelimit-reset wobbled from 1005 to 1006.

The fix#

There were multiple fixes available for this

Cut down the TTL to milliseconds instead of seconds. But this would’ve reduced the wobbling instead of eradicating it
Instead of computing the x-ratelimit-reset at API server, if redis could send it directly then problem could be solved using redis time command. Unfortunately it wasn’t possible on versions 5 and below (which was required by some of their other components).

So they took the final option

Store the TTL in the database (at every reset time) and directly send it in response instead of calculating it at API server.

Thats all in this one folks.. let me know where i messed up in writing this :)