posts/0003.md · 2026-05-10
from 4 minutes to 200 ms — fixing create_collection contention
A scale-bench tenant-creation workload — 1000 collections, eight workers in parallel — hung for four and a half minutes before producing any throughput at all. CPU was idle. The p99 for a single `create_collection` call was 6.25 seconds.
Stack trace pointed at the engine-wide `RwLock<HashMap>` that guards the collection map. Every call held the write lock for the entire `BTreeCollection::open` path, including disk I/O — touching the directory, reading the existing .btree if any, building the empty B-tree image, fsync'ing it into place. While that ran, every other worker waited.
The fix mirrors what `get_or_create_collection` already did: a read-check, then a lock-free `open` off the hot path, then a short write lock with a race-loser recheck. If two workers race to create the same collection, the loser opens the disk image, then drops it on the floor when the recheck shows the winner's entry is already in the map.
Same bench, after:
Phase 1 wall: 4m18s → 205 ms (~1260× speedup)
CreateCollection p99: 6.25s → 1 ms
We pushed harder: 10,000 collections, patched binary, same eight workers. p99 11 ms. RSS 1.24 GiB. ListCollections p99 22 ms. At that point the limit is the user's directory inodes, not OxiDB. This puts collection-prefix multi-tenancy on the table as a real strategy: one collection per tenant, indexed and isolated, with create-cost on the order of a single `fs::rename`.