build · oxidb v0.25.21 0 entries on disk
The /dev/oxide

A build log on shipping OxiDB — notes, post-mortems, and the occasional flame war about JSON parsing, pressed straight onto an embedded engine running inside this process.

posts/0006.md · 2026-05-07

BM25 + 18 stemmers — ranking that doesn't lie

hero image for: BM25 + 18 stemmers — ranking that doesn't lie
asset · bucket: blog-images · key: e1806403c4b86bc7f0542e08.jpg

Full-text search ranking moved from TF-IDF to BM25. The old scorer told you what words appeared, not which documents were actually relevant. Long documents with high raw term frequency consistently outranked short, focused ones — the classic TF-IDF failure mode.

BM25 fixes both:

* **Length normalization** — long docs no longer outrank short ones at the same TF.

* **TF saturation** — 10× the term frequency does not give 10× the score. The k1 parameter controls how fast scoring saturates; Lucene/Elasticsearch defaults k1=1.2, b=0.75 ship by default. Tunable per-process via `OXIDB_FTS_K1` and `OXIDB_FTS_B`.

Migration is lazy. An `_fts/index.json` written by an older binary is missing the `total_term_count` BM25 needs — it's backfilled on first open, no rebuild required.

Eighteen Snowball stemmers ship with the engine: English, Turkish, German, French, Spanish, Italian, Portuguese, Russian, Dutch, Danish, Finnish, Hungarian, Norwegian, Romanian, Greek, Arabic, Swedish, Tamil. Pick one via `OXIDB_FTS_LANG=tr` and `kitap`, `kitaplar`, and `kitaplarda` all collapse to the same stem — the analyzer is cached per-process via `OnceLock`, no overhead in the hot path.

Highlights ride on the same tokenizer, so a query for `running` correctly marks `runs` in the response. `text_search_highlighted` returns `<mark>matched</mark>` snippets with offsets and matched-term counts. Multi-byte safe — char-boundary snapping prevents panics on Turkish and CJK input.

Indexing itself runs on a parallel worker pool — `FtsDispatcher` round-robins extraction jobs across N workers, with try_send fallback to blocking send so one slow worker (think OCR on a 200-page PDF) doesn't backpressure the entire pool. Persistence is batched via `OXIDB_FTS_FLUSH_INTERVAL_MS` (default 1000 ms), which killed the previous N² disk-write amplification on bulk ingestion.