Generated by Codex with GPT-5
What happened
Cloudflare’s official engineering blog published Our billing pipeline was suddenly slow. The culprit was a hidden bottleneck in ClickHouse, a post about a production performance regression in a petabyte-scale ClickHouse deployment and the upstream database changes Cloudflare made to fix it.
The setting is unusually concrete. Cloudflare uses ClickHouse to run millions of daily analytical queries that determine customer usage, support billing for hundreds of millions of dollars in revenue, and feed fraud systems and other operational workflows. The affected platform, Ready-Analytics, lets internal teams stream data into a shared ClickHouse table instead of hand-designing separate schemas. Records are distinguished by namespace, sorted within each namespace by an indexID, and ordered by timestamp, giving the table a primary key shaped around tenant-specific query patterns.
That shared-table design had already grown very large. Cloudflare says the broader ClickHouse estate holds more than 100 PB across a few dozen clusters, while Ready-Analytics itself had passed 2 PiB by late 2024 and was ingesting millions of rows per second. The trouble started with a retention requirement. Ready-Analytics originally partitioned data only by day, so Cloudflare’s retention system could cheaply drop old daily partitions but could not give one namespace a few days of retention and another several years.
Cloudflare considered splitting namespaces into separate tables, but that would have required a large automation layer for thousands of dynamically managed tables. Instead, it changed the partitioning key from day to (namespace, day). The decision was reasonable on paper: every query already filtered by namespace, so the set of parts read by any one query should remain roughly stable, while retention could become namespace-specific. The new scheme also supported a storage-management layer based on max-min fairness, letting namespaces borrow unused capacity while the cluster targeted high disk utilization.
The hidden cost was not in query execution. It was in planning. As the new partitioning scheme increased the total number of ClickHouse parts, billing aggregation jobs became progressively slower even though the obvious metrics stayed clean. I/O, memory, scanned rows, and parts read by individual queries did not explain the regression. Only after plotting query duration against total part count did Cloudflare find the system-level correlation: queries were not reading all the extra parts, but their mere existence made planning slower.
The investigation turned on ClickHouse’s built-in trace_log, which Cloudflare used to generate flame graphs for leaf SELECT queries. A CPU flame graph first showed a large share of sampled time in filterPartsByPartition, the code path that prunes the table’s parts before execution. Cloudflare tried changing heuristic order there and got a small improvement, but the real breakthrough came from switching from CPU traces to real-time traces that include threads waiting on locks.
That second flame graph showed the system was dominated by lock contention. More than half of query duration was spent waiting on a single mutex protecting the MergeTreeData parts list. Each planning thread acquired an exclusive lock, copied the full list of table parts, released the lock, and then filtered the copy down to the parts relevant to the query. With tens of thousands of parts and hundreds of concurrent queries, the planners had become serialized behind a lock even though they were only reading metadata.
Cloudflare fixed the issue in three steps. First, it changed planning to use a shared lock because query planners do not mutate the parts list. That allowed concurrent planners into the critical section and removed the immediate lock convoy. Second, it stopped copying the full vector of all parts for every query. Instead, ClickHouse could maintain a shared read-only copy for planners and regenerate it only when inserts or other write-side operations changed the parts set, so planners copied only the smaller filtered list they actually needed.
Those first two changes solved the acute billing issue and were later merged upstream in ClickHouse PR #85535, becoming available in ClickHouse 25.11. But Cloudflare later saw that query duration still grew with part count, just more slowly. The remaining bottleneck was the linear scan used during part filtering. Because the parts vector is sorted by partition key and namespace is the first component of that key, Cloudflare added a binary-search pass over the namespace portion of the partition ID. That narrowed the candidate range before applying the existing filters, cut query duration by about 50% after deployment in March 2026, and finally broke the correlation between query latency and total part count for this workload.
Why it matters
The important lesson is that metadata scale can become the hot path even when data scale looks controlled. Cloudflare’s original assumption focused on execution: if each query still reads the same namespace-specific data, then changing the global partition layout should not materially affect the query. That was incomplete because the planner had to reason over a much larger metadata surface before execution could begin. The system did not become slow because ClickHouse was scanning more billing data. It became slow because every query had to queue, copy, and filter an expanding catalog of parts.
That distinction matters for any shared analytical platform. Multi-tenant designs often collapse many workloads into one table, topic, index, bucket, or namespace to reduce onboarding friction and operational sprawl. The first-order benefit is real: teams avoid custom schemas, service owners avoid thousands of bespoke resources, and platform engineers can centralize retention, storage, and ingestion policy. But the global control structures behind the abstraction still grow. Locks, caches, planners, schedulers, metadata stores, and background compaction systems may see every tenant even when each user query sees only one tenant.
The post is also a good debugging case study because Cloudflare had to change what it was measuring. The usual query metrics made the migration look innocent, and CPU flame graphs initially pointed at active work rather than waiting. Real-time flame graphs exposed the missing dimension: serialized planner threads spending wall-clock time on a lock. That is a useful reminder that performance regressions in production data systems are often not visible in the metrics teams first reach for. A system can look underloaded in CPU, memory, and I/O while user-visible latency is being burned in coordination.
The fixes are instructive because they move from immediate contention removal to deeper algorithmic scaling. Replacing an exclusive lock with a shared lock fixed the correctness of the concurrency model: read-only planning should not serialize read-only planning. Avoiding full-vector copies reduced unnecessary per-query metadata work. Adding binary search changed the complexity of tenant pruning by exploiting the existing partition ordering. Each patch addressed a different layer of the same bottleneck: synchronization, allocation and copying, then search strategy.
There is a broader architectural caution in the ending. Cloudflare explicitly says the new partitioning scheme bought breathing room but may not be the ideal long-term design. The same part-count growth also stressed ZooKeeper metadata, and the article hints at a very large ZooKeeper cluster as another consequence. That humility is important. Local optimizations can make a problematic design viable for a long time, but they do not erase the underlying tradeoff between a flexible shared table and the metadata footprint required to make that table behave like many tenant-specific datasets.
Takeaway
Cloudflare’s ClickHouse post is valuable because it shows how a sensible product requirement, per-namespace retention, turned into a database-internals problem far away from the feature itself. The partitioning change preserved the execution shape of individual queries but changed the planning and metadata shape of the whole cluster.
The broader engineering takeaway is to treat metadata paths as production paths. If a platform design increases object count, partition count, tenant count, file count, shard count, or table-part count, then planning, locking, listing, pruning, and cache invalidation need the same scaling scrutiny as the data scans themselves. The fastest query is still slow if it waits behind an expanding control-plane bottleneck before it can start.