perf: speed up B4 local editing ~2.4x and snapshot import ~45%#1033
Open
zxch3n wants to merge 6 commits into
Open
perf: speed up B4 local editing ~2.4x and snapshot import ~45%#1033zxch3n wants to merge 6 commits into
zxch3n wants to merge 6 commits into
Conversation
Local text editing (applying the automerge-paper trace): ~112ms -> ~65ms. - Compile the lock-order debug instrumentation out of release builds; it ran on every per-op OpLog+DocState lock acquire/release (~30% of edit time). In release `can_lock_in_this_thread` returns false, backed by the now-exact cached visible op count. - Bump `visible_op_count` incrementally for local ops instead of recomputing it from the version vectors (which also heap-allocated an im::HashMap iterator) on every op. - Build the position-context error string in `checked_range_end` lazily (no per-op alloc) and return entity ranges in a SmallVec (no per-delete Vec alloc). - Route the per-insert event-index computation through the existing cursor cache instead of a fresh `visit_previous_caches` walk every op. Snapshot import (fast snapshot): B4 ~135us -> ~80us; B4x100 (22MB) ~8.15ms -> ~4.5ms. - Skip the redundant per-block SSTable checksum on full import; the whole body is already covered by the document checksum verified in parse_header_and_body. Adds crates/examples/examples/b4_bench.rs (phase-timed B4 harness) plus regression tests for the cached visible op count and the block-checksum skip. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
`is_deleted` allocated a fresh `visited` Vec on every local op (the #1 allocation source after the earlier fixes: ~260k allocs on the B4 trace). Parent chains are shallow (depth 1 for a root container), so use inline SmallVec storage. apply 1x: ~65ms -> ~61ms. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Fork crates.io generic-btree 0.10.7 (which loro-dev maintains) into crates/generic-btree and redirect all dependents via [patch.crates-io], so the b-tree can evolve in-tree (e.g. deferred cache propagation). This is a verbatim vendoring of 0.10.7 (build is transparent: B4 apply unchanged at ~62ms); only the manifest is trimmed (benches dropped, dev-deps reduced to what the in-src tests need). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Add a specialized insert/delete path for style-free text on the attached, non-wasm, unicode-index path (the common Rust text-editing case). When the richtext has no style anchors, entity_index == event_index == unicode pos, so the entire read phase -- cursor location, two `visit_previous_caches` coordinate walks, and the styles lookup -- is unnecessary; `apply_local_op` then locates the cursor exactly once. The delete path likewise skips the two `index_to_event_index` walks. Falls back to the general path when styles are present, on wasm, or for non-unicode position types, so results are unchanged (snapshot bytes identical; loro, loro-internal lib, and mergeable tests all pass). Also gate `apply_local_op`'s txn/doc context check (a per-op `Weak::upgrade`) to debug builds, since the handler always passes its own doc. Cumulative B4 apply: 112ms -> ~46ms (~2.4x), ~11.5 M op/s. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Contributor
WASM Size Report
|
Address two correctness regressions introduced by the B4 perf work: 1. The txn/doc context check in `Transaction::apply_local_op` was gated to debug builds. `insert_with_txn`/`delete_with_txn` are public API, so a caller can feed one document's transaction to another document's handler; in release that silently stamped the target doc's state/oplog with the wrong peer+counter instead of returning `UnmatchedContext`. Restore the check for all builds using a cheap `Weak`-pointer comparison (no atomic upgrade on the hot path; upgrade only to fill in the error on mismatch). 2. `MemKvStore::import_all` (re-exported publicly via loro-crdt) dropped per-block checksums for all callers. Split the API: public `import_all` (and `SsTable::import_all`) always verifies block checksums; a new `import_all_unchecked` opts into the fast path and is used only by Loro's snapshot decode (`ChangeStore::import_all`, `KvWrapper::import`), where the document-level checksum from `parse_header_and_body` already guarantees integrity over the whole body. Adds regression tests: `cross_doc_txn_is_rejected` and the updated `sstable_import_block_checksum_only_skipped_when_unchecked`. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Speeds up the B4 (
automerge-paper) workload on both axes: local text editing and fast-snapshot import. Measured on an Apple M5 Pro (rustc 1.96, release):Snapshot bytes stay byte-identical throughout; all loro / loro-internal / mergeable tests pass and the fuzz corpus replays clean.
Changes
Local editing
LoroMutex) out of release builds — it ran on every per-op OpLog+DocState acquire/release (~30% of edit time).can_lock_in_this_threadreturnsfalsein release, backed by the now-exact cached visible-op count; the order checks still run in debug/tests.visible_op_countincrementally for local ops instead of recomputing it from the version vectors every op (the old path also heap-allocated anim::HashMapiterator each call).visitedVec allocation inDocState::is_deleted(inlineSmallVec).checked_range_endlazily; return entity ranges in aSmallVec.entity_index == event_index == pos, so the read phase (cursor location + twovisit_previous_cacheswalks + styles lookup) is skipped, and the delete path skips itsindex_to_event_indexwalks. Falls back to the general path when styles are present, on wasm, or for other position types.apply_local_op's txn/doc context check (a per-opWeak::upgrade) to debug builds.Snapshot import
parse_header_and_body, so this removes a second hash pass over the data.Infra
generic-btree(maintained by loro-dev) intocrates/generic-btreeand redirect via[patch.crates-io], so the b-tree can evolve in-tree. This is a verbatim vendoring of 0.10.7 (most of the line count in this diff) — the build is transparent.crates/examples/examples/b4_bench.rs, a phase-timed B4 harness.Validation
cargo test -p loro-internal --lib(279),cargo test -p loro(all suites),mergeable_container/mergeable_cid_encoding,import_atomicity, kv-storesstable. New regression tests for the cached visible-op-count and the block-checksum skip.cargo +nightly fuzz run allcorpus replay: clean.Not included / future
Reaching diamond-types-level throughput (~2 ms for this trace on the same machine) would require a plain-text-specialized path that drops the rich-text entity/style + 5-coordinate cache for style-free text, plus deferred b-tree cache propagation — a larger structural change. The vendored fork is in place to enable that work.
🤖 Generated with Claude Code