posts/0012.md · 2026-05-01
aggregation pipeline — $match, $group, $sort, $dateHistogram, $percentile
The aggregation surface is MongoDB-style — same operator names, same input/output shape, same `db.aggregate(coll, [stages...])` API. Stages execute as an iterator chain so intermediate results never need to fit in memory.
Stages implemented today: `$match`, `$group`, `$sort`, `$skip`, `$limit`, `$project`, `$count`, `$unwind`, `$addFields`, `$lookup`, `$dateHistogram`, `$percentile`. Group accumulators: `$sum`, `$avg`, `$min`, `$max`, `$count`, `$push`, `$addToSet`, `$first`, `$last`, plus `$percentile` with exact linear interpolation over the full input.
**$dateHistogram** buckets a date field by interval and applies accumulators per bucket. Intervals: `Ns`/`Nm`/`Nh`/`Nd`/`Nw` (fixed-width) or `1M`/`1y` (calendar). `min_doc_count: 0` fills empty buckets between observed min and max — emitted as a synthetic `$group` chain.
**Index-backed $sort** when the sort field is indexed: the pipeline iterates the BTreeMap directly instead of collecting everything and sorting. `O(limit)` for `$sort + $limit`, no `O(n log n)` price.
**Index-backed $group** when the group key is indexed: buckets are read off the index in key order. The accumulator doesn't need a hash map.
Aggregation is where embedded mode shines hardest. The stages are Rust functions hitting in-process state. No network round-trip per stage, no serialization between them. A four-stage pipeline on 100K docs returns in tens of milliseconds on a laptop.