Default n-gram tokenizer keeps JSON delimiters " : { }#4
Conversation
Structured/JSON greps like "role": were un-prunable because the tokenizer
dropped all punctuation, so "role" and "system" matched as bare words in
every block. Keep the JSON delimiters " : { } inside n-gram runs so n-grams
span them and stay selective.
Measured on WildChat (3.2M LLM logs, full 13 GB source): index grows just
1.04x (1.20 to 1.24 GB), versus 2.6x for keeping all punctuation, while JSON
greps see order-of-magnitude wins. End-to-end to the first 100 results,
"role": drops from 14.7 s / 678 MB scanned to 0.35 s / 20 MB (42x), and
"type": "object" from 79 s to 2.2 s (35x). Code, URL, and prose greps are
unchanged, with no regressions.
The kept-character set is recorded per index in kv metadata
(hypgrep.ngram_chars) and read back at query time so query and index tokenize
identically. An index built before this change has no such key and is read as
the empty set (plain alphanumeric), so it keeps working unchanged, with no
index format-version bump.
Code Review: structural (JSON-delimiter) default tokenizerSolid, well-scoped change with good test coverage and careful backward-compat reasoning. Correctnesspraise: the regex character-class escaping is correct and complete ( praise: backward AND forward compatible without a version bump, and that's justified. A pre-structural index (no praise: query/index tokenizer agreement is enforced by construction. Metadata round-trips, including the empty-string case (the Nits
Validation note
Overall: ship it. Optionally add a CHANGELOG note on the default-param behavior change. |
Structured and JSON greps like
"role":were un-prunable: the tokenizer dropped all punctuation, soroleandsystemmatched as bare words in every block and the index could not narrow the candidate set at all.This makes the default tokenizer keep the JSON delimiters
" : { }inside n-gram runs, so n-grams span them and stay selective on structured text.Results (WildChat, 3.2M LLM logs, full 13 GB source)
Index grows just 1.04x (1.20 to 1.24 GB), while JSON greps see order-of-magnitude wins. Code, URL, and prose greps are unchanged, with no regressions.
End-to-end to the first 100 results, alphanumeric vs structural:
"role":"type": "object""temperature": 0"content":{"name":system.out.printlnself.assertequalschopenhauerWe picked the kept-character set empirically: keeping only
" : { }recovers essentially all the selectivity of keeping every punctuation character, but at 1.04x the index size instead of 2.6x, and without the one query regression that the keep-everything approach showed.Backward compatibility
The kept-character set is recorded per index in kv metadata (
hypgrep.ngram_chars) and read back at query time, so query and index always tokenize identically. An index built before this change has no such key and is read as the empty set (plain alphanumeric), so existing indexes keep working unchanged. No index format-version bump.Tests
New unit and round-trip tests cover the structural default, the recorded character set, the regex-escaping of kept characters, and the pre-structural compatibility path. Full suite passes (95 tests), lint clean.