Scaling Whitepages With Redis on Flash

Stay Aware Engineering September 13, 2018
Avatar

author:

Scaling Whitepages With Redis on Flash

 As a global leader in digital identity verification, online contact directory, and fraud screening, we at Whitepages have a ton of data to wrangle. We recently switched to using Redis on Flash to deal with the high amount of data storage, so we don’t have to sacrifice response time and high throughput for our millions of users.

Varun, a Principal Architect at Whitepages, gives his take on the switch to Redis, with a technical behind-the-scenes look at how it keeps our data flowing smoothly.


At Whitepages we have a Global Identity Graph (IGraph) powering almost all of our products. Whether it is a user search for “John Smith in Seattle,” or an e-commerce client querying if a transaction is fraudulent, every query goes through a series of calls ending in IGraph. IGraph is a service backed by a data-store which contains all entities including Person, Business, Phone, Address, Email, URL and their corresponding attributes and links. IGraph Retrieval Service (IRS) fronts the data-store to serve hundreds of thousands of requests per second.

Key Requirements
  • Sub-millisecond latency
  • Scalability over 200K requests/sec
  • Linearly scalable storage
  • Fault tolerance: No single point of failure, instant failover

IGraph: A Little History

Whitepages started as an online directory of people, phone numbers, and addresses. PostgreSQL was sufficient to power all the queries, but the volume of calls was limited, and low latency was not a major consideration for consumer queries. Our company grew and we soon realized that Postgres could not help with the growing customer base, storage, and feature set we wanted to support. What we needed was a graph-friendly data-store, something that can scale linearly, is distributed and works well with high QPS needs.

Enter Graph Database: An idea not unique to Whitepages, but rather a reflection of the growing architecture choice across the industry. We were answering graph-like queries for a rich set of ever-growing feature sets. We wanted to stay close to open-source and fundamental computer science, so a convincing choice was Titan backed by Cassandra. Titan-Cassandra helped us grow our data and rich feature set, however, we were not satisfied with the latency or reliability.

With the continuous increase in query volume and data, our product performance became slow. Often Titan decided to go in a bad and non-responsive state, bringing down all our products. It was extremely difficult to debug the root cause and often required a full restart or reload of data. We needed more reliability.

There were several learnings from our Titan journey, most important being our need to simplify. Graph Databases come with a wide variety of generic and rich feature set, each of which add to the complexity and performance of the overall architecture. Tuning the underlying hardware, memory management, query, and overall data model was non-trivial and hence required a significant time investment.

We decided to move to a simpler implementation. Our solution was laying out the graph as an adjacency matrix and storing in a fast key-value store Redis. The key was a simple UUID representing the entity and value comprised of a JSON containing the attributes and links:

“8f4c5736-5af0-4eaf-b10c-3ac25dd3a30f” :  {
“first_name”: “John”,
“last_name”: “Smith”,
“DOB”: “27-02-1970”,
“links”: {
“person”: [“fbff57cb-a1ed-4764-9102-79e3fe98e4ea”],
“address”: [“44c64738-9a2f-49d8-89c5-39d64c59f834”, “8a1804ca-8d64-4b23-aa26-97f426421295”],
“phone”: [“552a071d-ae3e-43d4-b6c1-0fd7d58ca42e”]
}
}

Our features, and hence the queries that we need to support those, need no more than depth three calls. Hence the worst-case latency of the service would be around three times the latency of the underlying store plus the network round time. It was critical to have a low latency data-store.

Here is an example. We want to get the telephone numbers of all the people currently living in a particular address:

Input: Address

Expected Output: Set of telephone numbers.

Flow:

Call 1: Redis call with UUID of the address. The result would be UUIDs of the set of people living in the address.

“44c64738-9a2f-49d8-89c5-39d64c59f834” : {
“street” : …
“country” : …

“links”: {
“person”: [“8f4c5736-5af0-4eaf-b10c-3ac25dd3a30f”, “fbff57cb-a1ed-4764-9102-79e3fe98e4ea”]
}
}

Call 2: For all the person UUIDs returned in the above call, we want to get the telephone UUIDs associated with them. We sent a batch Redis request with the UUIDs: [“8f4c5736-5af0-4eaf-b10c-3ac25dd3a30f”, “fbff57cb-a1ed-4764-9102-79e3fe98e4ea”]

We received a response something like:

[“8f4c5736-5af0-4eaf-b10c-3ac25dd3a30f”, “fbff57cb-a1ed-4764-9102-79e3fe98e4ea”] : {
“street” : …
“country” : …

“links”: {
“telephones”: [“6d455736-5af0-4eaf-b10c-3ac25dd3a76c”, “d34ce7cb-a1ed-4764-9102-79e3fe98e54d”]
}
}

Call 3: For all the telephone UUIDs returned in the above call, we want to get the details. We sent a batch Redis request with the UUIDs [“6d455736-5af0-4eaf-b10c-3ac25dd3a76c”, “d34ce7cb-a1ed-4764-9102-79e3fe98e54d]

We received a response something like:

[“6d455736-5af0-4eaf-b10c-3ac25dd3a76c”, “d34ce7cb-a1ed-4764-9102-79e3fe98e54d] : {
“phone_number” : [“+1-206-323-4343, +1425-4355-3344]
“carrier” : […]
“line_type”: […]
}

Redis served us just right for this fast key-value lookup.

Alternative Options

AWS Elasticsearch clustered Redis was a sure bet if nothing worked out, but it was going to be very costly. We would require a full 15 r3.8xlarge nodes to hold our data. For fail-safe, we needed at least one replica, which meant we would need 30 r3.8xlarge nodes. Another big concern was our growing data with global expansion. Every time we add a new country, we add several terabytes of data to our cluster, and Elasticcache Redis has a hard limit of 6 TB. Thus, we started exploring a cheaper and more scalable option. Another limiting aspect of Elasticcache Redis was its standby-doing-nothing replicas. We had half the nodes sitting and doing nothing. In case the primary node fails, it took around one minute to failover to secondary, which means all the keys in the partition would fail for one minute. This was not pleasant.

For potential alternatives, we started with the most popular stores, MongoDB, Cassandra, and Couchbase. MongoDB and Cassandra scaled well, but it was difficult to get sub-millisecond latency. With the load and size, we did as high 10-20 ms p95s.

We soon reached the conclusion that we needed some sort of in-memory storage to meet our latency needs. Couchbase was a good bet. It performed very well under the load conditions, and we were getting much better latencies. However, Couchbase had a requirement of storing all keys and metadata in RAM which required us to have a minimum number of EC2 nodes to provide sufficient RAM. We did shortlist Couchbase as an acceptable alternative. The overall cost was going to be lower than Elasticache and we could expand beyond 6 TB.

What we wanted was a hybrid solution where hot elements sit in RAM and spillovers go to SSDs. Wishful thinking: should we build one ourselves? Aha! we stumbled upon Redis Labs Redis on Flash! With the advent of NVMe volumes, SSD speeds got way better and closer to RAMs. Redis on Flash followed the exact architecture that we wanted. We started out prototyping and load testing. We got great results, with a mere 30% of values in RAM and 70% in Flash, and we were getting between 1-2ms p95 latency. We expanded our test data and made our tests longer. We did notice a consistent behavior and RAM hit rate rising. With the increased RAM hit rate our p95 came further down to < 1ms. Cost-wise, Redis on Flash was much cheaper than any present alternatives with comparable latency. We had our winner!

AWS Elasticache Redis vs Redis Labs Redis on Flash: A Comparison

  • Lower operating cost as we needed around 60% fewer nodes (RAM) to serve the same amount of data.
  • Elasticache being on pure RAM wins on latency, but the difference was not huge. With a near 0% RAM hit ratio we were getting around 2ms latency and with RAM hit ratio > 95% we were around 0.2ms. Elasticcache gave us a consistent 0.1 ms latency.
  • Elasticache architecture uses replica as a failsafe and not active reads. Whereas Redis Labs shards models split data into multiple shards with each node holding a mix of primary and replicas. Thereby having all the nodes. The reads to go to a replica shard anytime the primary is not responding enables near zero downtime in case of node failure. Whereas Elasticache takes around a minute to failover.

The major win was a lower running cost with comparable latency. We lost around 0.1 ms per call compared to standard Redis hosted on AWS Elasticcache, but we would save millions per year. This was an acceptable and welcomed compromise!

Key Learnings

  • Hot Pockets of Data: Soon after we moved to production and had our cluster take an average of 200K requests/sec, we did observe an interesting pattern. With mere 30% values in RAM, we started to get around a 97% RAM hit rate after 24-30 hours. We knew we had a very low query repeat rate but going further down in call stack we were hitting the same keys. Redison Flash bumps up the keys to RAM (Standard LRU) and hence after a good amount of traffic we did reach a stable state of over 97% RAM hit rate. This led to a drop in our average latency and we did get 0.1-0.2 ms latency. Which was almost similar to standard RAM only Redis.
  • Compression: We realized that we gain the smaller values on multiple fronts. Less storage space required hence less overall cost, lower network traffic and finally faster gets on Redis. Standard compression would not work well as we have small disjoint key-value pairs and not big blogs of data. We tried out Facebook’s Zstandard compression using our custom-tailored dictionary. We got over a 50% compression ratio, thereby reducing our overall storage needs by 50%. We were able to fit over 11 TB of data in a 6 TB cluster.
  • Clustered vs Standalone Redis: The multiple shard-per-node architecture enabled us to use Redis as a standalone cluster even with a clustered setup. This is more stable, faster and easy to work with compared to a clustered setup.

In April, Jason, a Software Developer on our Data Services Team at Whitepages, was asked to present at the Redis Labs Conference. Redis Labs then paid a visit to our Seattle Headquarters to record an informational video on how we use their product: 

Watch The Video