Trending:
Software Development

Oxen cuts data versioning commit time from 50 minutes to 2.5 minutes by reducing lock contention

Oxen's engineering team identified that 90% of commit time was spent waiting for RocksDB locks, not doing actual work. By eliminating unnecessary database calls and object cloning across code layers, they achieved a 20x performance improvement on million-file repositories.

Oxen.ai reduced commit times on million-file repositories from over 50 minutes to around 2.5 minutes by fixing a database lock contention issue that consumed 90% of execution time.

The problem surfaced during routine benchmarking. Oxen's add command processes a million files in roughly one minute, but commit operations took 50+ minutes despite being O(n) with respect to directories, not files. For context, Oxen has demonstrated it can version ImageNet's 1M+ images in 90 minutes compared to Git-LFS's 20 hours, so the commit bottleneck stood out.

Profiling with samply revealed parallel workers were fighting for locks on the staging RocksDB instance. The root cause: clean separation of concerns across code layers led to repeated database opens and data cloning. Each thread independently fetched the same metadata, creating contention on a database designed for parallel writes, not reads.

The fix involved passing data between layers instead of re-fetching it. The resulting pull request was small, the kind of change that looks obvious in hindsight.

What This Means

This is a useful reminder that architectural best practices (clean separation, clear interfaces) can create performance issues when threading and shared resources enter the picture. The Oxen team notes RocksDB isn't optimal for their parallel read pattern since it's optimized for parallel writes, suggesting they may need to revisit storage layer choices as they scale.

For teams building performance-critical Rust applications, the lesson is straightforward: profiling beats intuition. The actual bottleneck was lock contention, not the commit algorithm. Clean architecture didn't prevent the issue, it arguably created it by encouraging isolated context retrieval at each layer.

Oxen continues positioning itself as the fastest data versioning tool for large-scale ML datasets, though questions about production readiness relative to competitors like lakeFS remain. Speed matters less if reliability isn't proven at scale.