Uber System Design

Functional Requirements

rider searches for cab (with source & destination)
uber shows ETA and fare estimate for different types of cab (go, moto, sedan etc)
rider books a cab
uber matches nearby drivers
uber sends request to nearby drivers -> sends notification
driver accept/deny ride request
driver accepts - navigates to pickup location
after pickup (using otp) driver navigates to drop location
driver ends the trip
rider makes payment

out of scope

profile creation
ratings( driver and rider)
notification service
monitoring
analytics

Non Functional Requirements

Cap theorem - highly available for ETA, fare estimate etc
Highly consistent for driver matching - no double booking (1 ride two drivers, 1 drivers two ride)
Latency - as low as possible
durable - ride details persistence
scalable - for peak hours, for new users

Entities

rider
driver
estimate
ride
location
cab

APIs

get fare estimate

GET /estimates?src_lat={}&src_long={}&dest_lat={}&dest_long={}

    -> estimateId, estimate, ETA for different cab types

book a ride

POST /rides

{
    estimateId,
    cabType
}

location updates

POST /location/update

{ 
  latitude,
  longitude
}

driver accepts/deny ride request

patch /rides/driver/accept

{ 
    rideId,
    accept (true/false)
}

driver updates the ride for pickup and dropoff

patch /rides/driver/update

{
    rideId,
    update (pickedUp/droppedOff)
}

make payment

POST /payments

{
    rideId,
    paymentInfo {}
}

Scale

100 million total riders
10 million daily active riders - 2 riders per users (on an average)
1 million drivers

ride requests = 20 million per day = 20 * 10 ^6/10^5 = 200 TPS

peak TPS = 3x = 600 TPS

location update every 5s -> 10^6/5 TPS =>200k TPS

High Level Design

Why we need websocket connection?

frequent location updates
server to send ride accept requests to the drivers
server to send real time updates for ride status update (using notification service might add delay)

why not use SSE?

SSE not supported in all mobile devices
websocket has lower latency
bidirectional connection is required

So we can utilize a bidirectional connection like websocket

Handle frequent driver location updates

1 million DA drivers
sending update every 5s -> 10^6/5 TPS =>200k TPS
SQL database supports max 10k TPS (single node)
dynamoDB supports upto 100k +
redis supports upto 1 million TPS
can we do batch updates to database to reduce write overhead?
it introduces delay in driver location updates -> driver matching might not be appropriate

Driver Matching Logic

user sends src location let's say the service will find drivers within 2 km range

GeoHashing

Quad Tree

map service uses some geo hashing/ quad tree indexing
using ride src location, we get chunk of src from map service
we get adjacent chunks from map service
list of drivers fetched from location DB who belongs to those adjacent chunks (order by distance, limit by x)
Driver matching service sends request to drivers one by one
one driver rejects, move on to the next one
if driver accepts, send request to ride service to update the ride details with driver, ride status changed to accepted
sends notification to user

We have driver locations, but how do we know which chunks they belong to? again take help from map service!!!

How to do proximity search for drivers without using map service?

there is lots of overhead in driver matching previous approach
lets use a database with geohashing index capability options - POstgres with PostGIS, redis

Can we use redis?

we just need to store the latest location of drivers in cache

driver_id, lat, long -> 8 bytes *3 => ~ 25bytes for 1 million drivers -> 25 x 10^6 bytes = 25 mb

We can store the latest driver locations in redis -> distributed cache

update location events if we want to store we can use cassanda/dynamodb

Redis supports upto 1 million TPS

GEOADD, GEOSEARCH -> search within a radius

in case of redis failure? distributed - very rare -> have robust monitoring and failover straegies

system recovers quickly after failure -> as driver sends the location update every 5 seconds

Driver Matching Consistency (No double booking)

one ride multiple drivers?

we scale the Driver matching service based on rideId (consistent hashing)
only one server handles one ride
server sends request to drivers sequentially
one denies, sends request to the next driver
so no two drivers will be booked at the same time for one ride

One driver, multiple rides?

scaling the Driver matching service based on ride id
two ride request went to same service instance
both got adjacent driver lists
possible that one driver is present in both the lists
we dont want to send multiple ride requests to one driver simultaneously, if he accepts => double booking :(

Can we have service instance level lock? one service instance doesnot send requests to one driver for multiple rides simultaneously

two ride request went to different service instance
both got adjacent driver lists
possible that one driver is present in both the lists
we dont want to send multiple ride requests to one driver simultaneously, if he accepts => double booking :(

service instance level lock on driver with TTL?

multiple service instance can lock the same driver simultaneously, whats the use of this lock?
only within a service instance this lock works

Database update for driver status, application level lock TTL?

update driver status to be outstanding_request
manage lock in application, if driver accepts update the driver status to accepted or in_ride.
if driver denies release lock, updates status to be available

what if service instance crashes? driver status remains locked in outstanding_request

Need to run a cron job to find such driver status and clear that out. Extra overhead!!

Distributed lock in Redis with TTL

driver id locked for say 10s

How do we Scale Ride Service?

make it stateless, use queue before matching service

How frequently drivers need to send location updates?

in client we can have adaptive strategy (on device sensor and client algorithm)
if driver is not moving, dont send updates
if moving slowly send less frequent updates

How to scale geographically and for peak hours?

shard the system geographically
GEO sharding, consistent hashing
copy of the systems deployed in multiple data centres, with data replication worldwide