Motiejus Jakštys Public Record

Scaling Btrfs in an Enterprise

2026-03-27

This first been published in The New Stack: Scaling Btrfs to petabytes in production: a 74% cost reduction story. I am replicating a late draft of this post here for archival reasons.

We saved 74% of our storage costs by moving petabytes of time-series data from ext4 to Btrfs. We are partnering with Google to bring Btrfs to all GCP customers, you can experimentally run Btrfs on GCP today. Here is how we did that without blowing up production. This is an expanded blog post of a conference talk that aired in FOSDEM 2026.

What is Btrfs?

Linux is an operating system that supports multiple file systems. A filesystem is part of the operating system, it governs file organization and access. Btrfs is one of a few supported in Linux. Beyond offering file organization and access, Btrfs has features less commonly found in other file systems, such as transparent checksumming, transparent compression, copy on write and others. We will be discussing Btrfs file system in this blog post.

Step 1: savings from compression

To understand how compression works in Btrfs, let’s go through an example. The first 109 (one trillion) bytes of an English Wikipedia dump takes 1000MB (953MiB), but, once transparently compressed by Btrfs, takes only 340MiB on disk:

$ compsize enwik9
Processed 1 file, 7630 regular extents (7630 refs), 0 inline.
Type       Perc     Disk Usage   Uncompressed Referenced
TOTAL       35%      340M         953M         953M
zstd        35%      340M         953M         953M

The file on disk is 35% of its actual size! To generalize, if we stored mostly English wikipedias on this drive, I would need an approximately 3x smaller disk for the same amount of data. Although enwik9 is compressed on disk, we can look for useful content in it without decompressing it:

$ grep FOSDEM enwik9 | head -1
Cox was the recipient of the [[Free Software Foundation]]'s [[2003]] [[FSF
Award for the Advancement of Free Software|Award for the Advancement of Free
Software]] at the [[FOSDEM]] conference in [[Brussels]].

At Chronosphere we are storing lots of time-series metrics. Time-series metrics generally consist of two parts:

  1. Labels. E.g. service=roar_labels, environment=production. These compress well, since those are text and often repeated (e.g. every metric has an environment label).
  2. The actual metric values, i.e. floats or integers. They are already well compressed in the application by float/integer encoding.

We downloaded a subset of our internal cluster data and did a similar compsize check to the above. The potential disk savings were on the 65-70% ballpark. Since we store petabytes of data on disks, and the disk was the single biggest expense in our cloud bill at the time, we decided Btrfs was worth a careful evaluation.

Our disks could be shrunk similarly if we also compressed from within the application itself. However, transparent filesystem compression was advantageous for us, because:

  1. Compression ratio would be similar if we compressed it directly (i.e. no significantly higher space savings from application compression).
  2. We heavily use mmap for reading; changing this to use compressed data is doable, but would require significant changes in the database.
  3. Change of the file format is disruptive, especially if all historical data needs to be read and written — that’s quite a lot of tooling to prepare and CPU cycles to process!

Given that Btrfs showed significant savings potential and it required almost no changes on the database itself, we opted to evaluate it thoroughly.

Btrfs reputation

Once we saw the compression ratios and potential mind-boggling savings, we started looking deeper into Btrfs (and getting asked about it quite a bit). Btrfs was merged to the Linux kernel in 2008, but had a poor reputation about reliability from the early days; the author can attest to it from the early days. On the bright side, over the last decade Btrfs development was seriously picked up by large companies running Btrfs, and there were lots of good stories of large companies using Btrfs at scale. For me and many others at Chronosphere, Josef Bacik’s talk “Btrfs at Facebook” from the 2020 Open Source Summit highlighted the current reputation and helped change the perspective of Btrfs inside the organization.

Collage illustrating the gap between Btrfs’s online reputation and real-world production use: a light-gray background is filled with red-outlined Reddit and search snippets warning about data loss, instability, and power-failure corruption, while a green-highlighted article card in the foreground reads ‘Btrfs at Facebook.’ The composition suggests that widespread fear and forum folklore around Btrfs are being challenged by evidence of successful operation at very large scale.

Btrfs reputation: bad and good parts

Step 2: Google/k8s/database support

Chronosphere runs on Google-managed kubernetes on GCP. At the time of writing, GCP supported only ext4 and xfs on Linux. To run a filesystem not in the official list needs a separate Container Storage Interface driver (CSI) deployed on the hosts and registered in the k8s control plane.

We used two approaches to get a Btrfs volume for the database:

  1. Initially, we took a shortcut and managed the block device ourselves. On startup the database would pick up a raw block device, format, mount it and use it.
  2. Once we were comfortable with the prototype, we forked the GCP’s CSI driver, added Btrfs support there, and deployed the fork on our fleet as a separate provisioner.

Once moved to compressed Btrfs, the database did not blow up. We technically had something that could potentially work. Compression ratio on the database was consistent to the original compsize experiments from before: 65-70%.

File System conversion

Once the CSI driver was deployed, we needed a way to convert the existing disks to Btrfs. Initially, we considered btrfs-convert, but it has the following warning in its documentation:

Always consider if a mkfs and file copy would not be a better option than the in-place conversion given what was said above.

To convert a filesystem, we used a workflow that:

  1. Creates a new target volume of the same parameters as source, except fsType: btrfs.
  2. Copies all files from source to target.
  3. Shuts down the database.
  4. Synchronizes all files between source and target.
  5. Swaps source and target, removes the “old” one.

In fact, we already had a very similar workflow to shrink the database disks (because neither ext4, nor Google’s block device storage support online disk shrinking). We re-used the disk shrink workflow to convert between the file systems; it was a relatively straightforward change.

Step 2a: risks and potential issues

When we had the infrastructure ready and proved it works on a real database, a few possible risks were raised:

  1. Poor reputation on reliability from the early days. For whomever this was a concern, it was mostly addressed by watching Josef Bacik’s talk from 2020.
  2. Different IO behavior may have performance implications. We had a decade of experience running the database on ext4, but zero experience on anything else. Early performance testing did not show significant performance changes, and we had a gradual roll plan to roll this out and observe performance changes.
  3. Compression/decompression will cost extra CPU cycles, which will not be available for the database for other tasks.
  4. Handling of remaining disk space. Btrfs tracks available disk space differently than other filesystems, and statfs cannot be as trusted as, say, on ext4. We absolutely cannot ever run out of disk space, as this will cause data loss. This was a major concern to us, so we had to be extra conservative.
  5. Google does not support Btrfs. Why? All of the following are plausible:
    1. A Google customer tried Btrfs; it behaved poorly with Google’s block device offering, so it was not added to the “officially supported” list.
    2. Google’s kernel team advised against it due to its poor reliability track record.
    3. Nobody ever seriously asked for Btrfs support, so Google just never did it.
  6. There were surely more unknown unknowns.

Before detailing those unknown unknowns, let’s walk through our configuration.

Our configuration

We use btrfs, with the following settings:

  1. discard=async
  2. compress=zstd:1
  3. btrfs-allocation-data-bg_reclaim_threshold=90
  4. btrfs-bdi-read_ahead_kb=128

discard=async and compress=zstd:1 are documented in the btrfs manual. The other two settings are implemented in the CSI-driver. They write the configured values to /sys/fs/btrfs/<UUID>/allocation/data/bg_reclaim_threshold and to /sys/fs/btrfs/<UUID>/bdi/read_ahead_kb respectively.

We set bg_reclaim_threshold to 90, because that makes it easy to account for free space tracking. This value is aggressive: with it, the difference between Device Unallocated and statfs (via btrfs-filesystem-usage) is less than 2%. This helps us safely account for free space.

The unknown unknowns

Even with careful planning, we encountered surprises.

Surprise 1: Disk Snapshot Costs Ballooned

Our first major surprise came from an unexpected direction: backup costs. We rely on block-level snapshots provided by GCP. With ext4, these incremental snapshots were small and cheap relative to the disk size. After moving to Btrfs, our snapshot costs exploded.

Stacked bar chart titled ‘Snapshot costs ballooned’ comparing storage cost split between ext4 and btrfs with disk snapshots. The ext4 bar totals 100, with 94 in blue labeled ‘Disk’ and 6 in red labeled ‘Backups.’ The btrfs bar totals 61, with 21 in blue for disk and 40 in red for backups. The chart shows that btrfs greatly reduces disk cost, but backup cost becomes a much larger share of the total.

Right after migration to Btrfs disk snapshots dominated the cost

While Btrfs reduced the disks by >50%, snapshot costs grew by more than 6x. After most of the clients migrated, we were paying more for disk snapshots than the actual disks!

We didn’t narrow down the culprit, but had to speed up a different project. Over the course of a few months we transitioned from snapshot-based backups to file-based backups, making the backup (and backup costs) completely filesystem agnostic; that brought the cost of backups down to even pre-Btrfs levels:

Stacked bar chart titled ‘Disk and Backups’ comparing total storage costs for three setups: ext4 with disk snapshots, btrfs with disk snapshots, and btrfs with file snapshots. The ext4 plus disk snapshots bar totals 100, made up of 94 in blue for disk and 6 in red for backups. The btrfs plus disk snapshots bar totals 61, with 21 for disk and 40 for backups. The btrfs plus file snapshots bar totals 24, with 21 for disk and 3 for backups. The chart shows that btrfs cuts overall cost substantially, and switching from disk snapshots to file snapshots reduces backup cost dramatically.

Once snapshots were gone, total costs went down too

Surprise 2: Reclaim Causing Massive IO on Large Deletes

A more alarming issue surfaced on a production tenant. We noticed that deleting a large volume of files would trigger a massive IO storm.

Screenshot of a dark-themed chat message from Faustus (EET) in July saying, “something bad is happening,” with an attached image. The image is a line chart titled “Commit Log Queue Length” showing many multicolored spikes between about 15:35 and 16:33, followed by a sharp sustained surge near 16:28 that rises to roughly 7 million.

This became a production incident. The cause was traced back to Btrfs’s background space reclamation logic.

Screenshot of a chat message warning not to deploy or restart m3db until approval is given. The message says the downsamples-1h namespace was removed that day, and after restarting a repo they discovered that deleting a large amount of files on btrfs can consume a lot of I/O and impact ingestion.

Guidance right after the incident

As can be seen in the screenshots, some downsampled data were deleted. Once many files are deleted in Btrfs, reclamation kicks in: Btrfs shuffles around data on disk, roughly proportional to how many bytes were removed. Today we configure this mechanism via bg_reclaim_threshold, which is quite aggressive.

To mitigate IO storms during reclaim, we are eagerly waiting for the dynamic_reclaim tunable, which was merged to Linux v6.11, to propagate to LTS kernels (where we can turn it on). In the meanwhile, we will throttle the deletes on the application side.

Surprise 3: Read Ahead!

One day, we hit a performance wall with our read workloads. We saw p99 latency spikes that correlated directly with the disk’s read throughput being completely saturated.

Screenshot of a monitoring dashboard chart titled “Node IO throughput — Read.” Multiple colored lines show read throughput over time, starting near zero around 15:00, then rising into bursty activity and a long plateau near 570 MiB/s from roughly 15:40 to 17:10, followed by a sharp drop and intermittent spikes afterward.

The issue was a subtle but critical kernel setting: read-ahead. There are two different read-ahead settings that matter: one for the generic block device and one specifically for the Btrfs filesystem via its Backing Device Info (BDI) interface.

  • /sys/block/<...>/queue/read_ahead_kb (default: 128KB)
  • /sys/fs/btrfs/<...>/bdi/read_ahead_kb (default: 4096KB)

The Btrfs-specific read-ahead was 32 times larger than the block device default. This setting is more intelligent than the block-device one—it understands the logical layout of files, not just the physical block layout. However, for our workload, which involves many random-like reads, this massive up-to-4MB read-ahead was causing extreme read amplification, pulling huge amounts of data into memory that we would never use.

Running the project & timeline

We created a single “Btrfs master plan” in which we put all the information for interested parties: savings potential, rough timeline, necessary development to productionize it, migration path for existing databases, risks. The “master plan” was very helpful to show others that we understand, see, acknowledge the risks, and are taking steps to de-risk them. Rough project timeline:

  1. 2024Q2: the compsize test on our data, plus ballpark calculations on how much it could save, was presented in a team offsite. This step was documented in the previous section.
  2. 2024Q3: hacked together the database support just enough to run on a development cluster. This proved the database did not fall over using our standard synthetic benchmarks.
    1. Success of this step promoted the project from “a team experiment” to something a wider organization started keeping an eye at.
    2. Discussions started about Btrfs reliability, long-term support and maintenance.
  3. 2024Q4: infrastructure team productionized Btrfs support.
  4. 2025Q1: moved the smallest internal production cluster (meta-meta) to Btrfs.
  5. 2025Q2: moved the first production cluster to Btrfs. Mass migration commenced.
  6. 2025Q4:
    1. moved the last production cluster to Btrfs.
    2. Google lands the last Btrfs-enabling patch to their CSI driver. Btrfs is in production on the GCP side for everyone.

Where Are We Now?

Today, all the time-series databases are running Btrfs. That’s petabytes worth of storage (after compression!). This journey required significant engineering effort, and our infrastructure is now forked from standard GCP offerings. We maintain our own builds of Google’s Container Storage Interface (CSI) to include Btrfs support. We continue to work with Google to upstream our changes, and they did. Btrfs has been enabled since Container Optimized OS 125 and all of our suggested changes to the CSI driver have been merged upstream.

As it stands today, you should be able to fully utilize it from Google Kubernetes Engine 1.35 or later.

As a result of the migration, we saved 74% of disk costs due to compression alone. Now that we are fully on btrfs, we are removing some in-application checksums, which, we believe, will bring additional cost savings for compute (CPUs).

Takeaways

Our experience provides a few key lessons. First, Btrfs is a viable and trustworthy filesystem for large-scale, single-volume enterprise deployments. Second, not only is it a good filesystem, thanks to transparent compression, it shaved 74% off our disk cost; it can also save your disk too, if you store lots of uncompressed data, which is tricky for the application to compress.

Errata

If you watched the FOSDEM talk, there was one last question towards the end of the presentation:

Do you use deduplication?

When on stage I thought the question was about replication. The answer is still “no”, but for other reasons — we do not use deduplication, because we have decent means to deduplicate shared data in the application without major changes and with barely any CPU or IO overhead.