Towards 3B time-series data points per day: Why DNSFilter replaced InfluxDB with TimescaleDB
Our results: 10x better resource utilization, even with 30% more requests
In the early days of DNSFilter, we spent some cycles thinking towards the future scalability of the business. I’m an advocate of early-stage startups not over-optimizing, but we also needed to understand the unit-economics of our business. That meant understanding how performant we could be, along with what bandwidth, storage and reporting costs would be.
A large component of our architecture is logging our customers’ requests. We are a usage-based service, and we provide analytics to our customers. This means we need to account for each DNS request, because any lost request is lost revenue, and lost data for our customers.
Back in late 2015 when we started the company, and into early 2016 as we began internally testing our MVP, we researched options in the market for storing and analyzing large sums of data. A new buzz-word at the time was ‘time series databases’. Our data seemed to fit the pattern, and I began working with the hot new startup InfluxDB.
Challenges with InfluxDB
After spending months experimenting, looking at their performance tests, implementing InfluxDB into query-processor, and setting up a fast and powerful server to run InfluxDB — There were still more hurdles. Unfortunately I wasted weeks of time on the learning curve of understanding in what ways InfluxDB differed from traditional SQL, limitations in their schema, query language, indexing, and more. I emailed influx, read forums, and github issues, and battled through issue after issue. But finally we had a working solution. For a while.
Unfortunately over time we ran into a number of issues with InfluxDB 1.3.
- Cardinality. This was one of our biggest challenges. In order to provide customers with statistics about what domains are visited, we needed to ‘tag’ (index) this data. At the same time, there are millions of different domains out there, many of which our customers access each day. Influx was originally designed to handle only a certain threshold of unique data. We quickly exceeded it. What this meant for us was that to survive, we had to begin shaving off the outliers. We would only rollup entries with 20 or more accesses within a 15-min period; so that we reduced the overall number of unique domains.
- Rollup tables. In order to have aggregate stats which could be queried from our customer analytics dashboard, we needed to create rollup tables. 1 minute summaries, 15 minute, 1 hour, 1 day, etc. Some rollup tables depended on the completion of other rollup tables. Some were intermediary rollup tables just designed to serve a higher-level rollup table. A lot of storage and table creation went into these. I have a 5 stage, 3 page document outlining how to create our rollup tables.
- Ingestion performance vs query performance. Despite the touted performance of InfluxDB, we found that ingestion performance dropped significantly once you were also querying against that same data. At only a few hundred queries per second during the peak of our days, the logs would start alerting us to dropped queries. Again, this directly translated into lost revenue. We began planning for sharding of customer accounts across influxdb instances, further adding to the complexity of the solution.
- Resource utilization. At one point in time we had a 512GB ram server in order to try to ‘work-around’ the cardinality problem. Influx ground to a halt. We re-architected the entire schema with more rollup tables, shorter retention periods, and were able to run it for a few months on a 64GB server. It still used almost all the ram, and had a constant CPU load of over 50%. It felt like we had little room to grow.
While in NYC, meeting up with one of our vendors, Packet.net, I searched through LinkedIn and saw that an old friend, Ajay Kulkarni was now in New York, and working on an IoT startup. As we reconnected, I learned that they were in the process of pivoting the business based on a component they’d developed to process time series data; and store it in PostgreSQL. As I shared some of my challenges with InfluxDB, they shared their status with releasing TimescaleDB. I left New York that week excited and rejuvenated that maybe there was another option that could fit our needs.
Switching was not easy. Not because of TimescaleDB — but because we’d chosen to integrate so tightly with InfluxDB. Our query-processor code was directly sending data to our InfluxDB instance. We explored the option of sending data to both InfluxDB and TimescaleDB directly, but in the end realized an architecture utilizing Kafka would be far superior.
After learning the ropes of Kafka, setting it up, coding query-processor to send data to Kafka, writing our own golang Kafka consumer to process the Kafka queue, and send data to both InfluxDB and TimescaleDB — we were finally ready to compare the solutions.
What a difference! From the get-go, we noticed marked improvements:
- Ease of change. Often a schema change with influxdb meant deleting everything and re-creating from scratch. TimescaleDB supports all standard PostgreSQL alter table commands, create index, and more — enabling fast changes as we improve or add features.
- Performance. We saw more efficient analytics querying, very low CPU on incoming data. Finally it felt like we had room to breathe and grow.
- No missing data. Comparing TimescaleDB to InfluxDB at the same time — we realized we were losing data. InfluxDB relied on precisely timed execution of rollup commands to process the last X minutes of data into rollups. Combined with our series of rollups, we realized that some slow queries were causing us to lose data. The TimescaleDB data had 1–5% more entries! Also we no longer had to deal with cardinality issues, and could show our customers every last DNS request, even at a monthly rollup.
With upcoming GDPR regulations, it also seems TimescaleDB is better able to handle compliance as compared to InfluxDB. With InfluxDB, it is not possible to delete data, only to drop it based on a retention policy. TimescaleDB in comparison, offers a more efficient
drop_chunks() function, but also allows for the deletion of individual records. This ability to delete particular records allows for compliance with deleting user data if requested.
The one area TimescaleDB natively lacked, was storage requirements. We saw that it was far less efficient as compared to InfluxDB. For us and our volume, this was not a big deal.
The current solution to this is filesystem level compression. The TimescaleDB folks recommend ZFS compression. After setting it up, we get about a 4x compression ratio on our production server, running a 2TB NVMe.
Our results: 10x better resource utilization, even with 30% more requests
Here’s a snapshot of system CPU load with InfluxDB from October 2017, compared to February 2018 with TimescaleDB. Also keep in mind that daily processed requests increased 30% between these time periods, and TimescaleDB is run on a slightly slower server.¹
As you can see from the charts, even while processing 30% more requests, we’re using 96% less CPU, with far lower maximum load spikes. If this scales perfectly, and the NVMe can keep up, this means we could potentially handle over 3B requests/day (CPU-wise… We’ll need to add a 2TB NVMe every 1B). This is relevant and helpful for cost-estimating, as we recently quoted a potential customer who would send us a little over 3B requests a day.
The most surprising improvement in switching to TimescaleDB was in recovered development cycles. As myself and our backend developer were already very familiar with standard SQL, it has become faster to add new features, because it’s very clear-cut how to add columns, add indexes, and make the changes necessary to roll new features into production. We use a master-master architecture, with Kafka streaming to two production timescale instances. We can shift our API traffic between the instances when doing any database changes, so we can be sure we’re not impacting customers. Many PostgreSQL operations are non-blocking, so we wouldn’t even need to do this, but it’s also just great to have a hot failover.
We went from InfluxDB’s performance as a daily worry and development headache, to being able to treat TimescaleDB as a reliable component of our architecture which is a joy to work with.
I’m eternally thankful to the team at TimescaleDB for their help over the last year. Our business would not be the same today if it were not for our switch to TimescaleDB, which they helped make very painless. We’re doing 150M queries a day, and at less than 5% CPU usage. Our global anycast network currently has the capacity to process 20B queries a day if equally distributed. I feel confident we can scale to 3B queries a day or more without much change — and look forward to continuing to grow with TimescaleDB.
 These are two different dedicated servers with two different providers. Both have high performance NVMe drives.