posts/0006.md · 2026-05-07
BM25 + 18 stemmers — ranking that doesn't lie
Full-text search ranking moved from TF-IDF to BM25. The old scorer told you what words appeared, not which documents were actually relevant. Long documents with high raw term frequency consistently outranked short, focused ones — the classic TF-IDF failure mode.
BM25 fixes both:
* **Length normalization** — long docs no longer outrank short ones at the same TF.
* **TF saturation** — 10× the term frequency does not give 10× the score. The k1 parameter controls how fast scoring saturates; Lucene/Elasticsearch defaults k1=1.2, b=0.75 ship by default. Tunable per-process via `OXIDB_FTS_K1` and `OXIDB_FTS_B`.
Migration is lazy. An `_fts/index.json` written by an older binary is missing the `total_term_count` BM25 needs — it's backfilled on first open, no rebuild required.
Eighteen Snowball stemmers ship with the engine: English, Turkish, German, French, Spanish, Italian, Portuguese, Russian, Dutch, Danish, Finnish, Hungarian, Norwegian, Romanian, Greek, Arabic, Swedish, Tamil. Pick one via `OXIDB_FTS_LANG=tr` and `kitap`, `kitaplar`, and `kitaplarda` all collapse to the same stem — the analyzer is cached per-process via `OnceLock`, no overhead in the hot path.
Highlights ride on the same tokenizer, so a query for `running` correctly marks `runs` in the response. `text_search_highlighted` returns `<mark>matched</mark>` snippets with offsets and matched-term counts. Multi-byte safe — char-boundary snapping prevents panics on Turkish and CJK input.
Indexing itself runs on a parallel worker pool — `FtsDispatcher` round-robins extraction jobs across N workers, with try_send fallback to blocking send so one slow worker (think OCR on a 200-page PDF) doesn't backpressure the entire pool. Persistence is batched via `OXIDB_FTS_FLUSH_INTERVAL_MS` (default 1000 ms), which killed the previous N² disk-write amplification on bulk ingestion.