Skip to content

perf: speed up B4 local editing ~2.4x and snapshot import ~45%#1033

Open
zxch3n wants to merge 6 commits into
mainfrom
perf/b4-local-edit-and-import
Open

perf: speed up B4 local editing ~2.4x and snapshot import ~45%#1033
zxch3n wants to merge 6 commits into
mainfrom
perf/b4-local-edit-and-import

Conversation

@zxch3n

@zxch3n zxch3n commented Jun 25, 2026

Copy link
Copy Markdown
Member

Summary

Speeds up the B4 (automerge-paper) workload on both axes: local text editing and fast-snapshot import. Measured on an Apple M5 Pro (rustc 1.96, release):

before after Δ
apply 1× (local edit, 260K actions) 112 ms ~46 ms −59% (2.4×), 11.5 M op/s
apply 100× 11.5 s ~5.25 s −54%
import (B4 snapshot) 135 µs ~78 µs −42%
import (B4×100, 22 MB) 8.15 ms ~4.4 ms −46%

Snapshot bytes stay byte-identical throughout; all loro / loro-internal / mergeable tests pass and the fuzz corpus replays clean.

Changes

Local editing

  • Compile the lock-order debug instrumentation (LoroMutex) out of release builds — it ran on every per-op OpLog+DocState acquire/release (~30% of edit time). can_lock_in_this_thread returns false in release, backed by the now-exact cached visible-op count; the order checks still run in debug/tests.
  • Bump visible_op_count incrementally for local ops instead of recomputing it from the version vectors every op (the old path also heap-allocated an im::HashMap iterator each call).
  • Avoid the per-op visited Vec allocation in DocState::is_deleted (inline SmallVec).
  • Build the position-context error string in checked_range_end lazily; return entity ranges in a SmallVec.
  • Route the per-insert event-index computation through the existing cursor cache.
  • Plain-text fast path: for style-free text (non-wasm, unicode index) entity_index == event_index == pos, so the read phase (cursor location + two visit_previous_caches walks + styles lookup) is skipped, and the delete path skips its index_to_event_index walks. Falls back to the general path when styles are present, on wasm, or for other position types.
  • Gate apply_local_op's txn/doc context check (a per-op Weak::upgrade) to debug builds.

Snapshot import

  • Skip the redundant per-block SSTable checksum on full import — the whole snapshot body is already covered by the document-level checksum verified in parse_header_and_body, so this removes a second hash pass over the data.

Infra

  • Vendor generic-btree (maintained by loro-dev) into crates/generic-btree and redirect via [patch.crates-io], so the b-tree can evolve in-tree. This is a verbatim vendoring of 0.10.7 (most of the line count in this diff) — the build is transparent.
  • Add crates/examples/examples/b4_bench.rs, a phase-timed B4 harness.

Validation

  • cargo test -p loro-internal --lib (279), cargo test -p loro (all suites), mergeable_container / mergeable_cid_encoding, import_atomicity, kv-store sstable. New regression tests for the cached visible-op-count and the block-checksum skip.
  • cargo +nightly fuzz run all corpus replay: clean.

Not included / future

Reaching diamond-types-level throughput (~2 ms for this trace on the same machine) would require a plain-text-specialized path that drops the rich-text entity/style + 5-coordinate cache for style-free text, plus deferred b-tree cache propagation — a larger structural change. The vendored fork is in place to enable that work.

🤖 Generated with Claude Code

zxch3n and others added 4 commits June 25, 2026 12:23
Local text editing (applying the automerge-paper trace): ~112ms -> ~65ms.
- Compile the lock-order debug instrumentation out of release builds; it ran on
  every per-op OpLog+DocState lock acquire/release (~30% of edit time). In
  release `can_lock_in_this_thread` returns false, backed by the now-exact
  cached visible op count.
- Bump `visible_op_count` incrementally for local ops instead of recomputing it
  from the version vectors (which also heap-allocated an im::HashMap iterator)
  on every op.
- Build the position-context error string in `checked_range_end` lazily (no
  per-op alloc) and return entity ranges in a SmallVec (no per-delete Vec alloc).
- Route the per-insert event-index computation through the existing cursor cache
  instead of a fresh `visit_previous_caches` walk every op.

Snapshot import (fast snapshot): B4 ~135us -> ~80us; B4x100 (22MB) ~8.15ms -> ~4.5ms.
- Skip the redundant per-block SSTable checksum on full import; the whole body
  is already covered by the document checksum verified in parse_header_and_body.

Adds crates/examples/examples/b4_bench.rs (phase-timed B4 harness) plus
regression tests for the cached visible op count and the block-checksum skip.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
`is_deleted` allocated a fresh `visited` Vec on every local op (the #1
allocation source after the earlier fixes: ~260k allocs on the B4 trace).
Parent chains are shallow (depth 1 for a root container), so use inline
SmallVec storage. apply 1x: ~65ms -> ~61ms.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Fork crates.io generic-btree 0.10.7 (which loro-dev maintains) into
crates/generic-btree and redirect all dependents via [patch.crates-io], so the
b-tree can evolve in-tree (e.g. deferred cache propagation). This is a verbatim
vendoring of 0.10.7 (build is transparent: B4 apply unchanged at ~62ms); only
the manifest is trimmed (benches dropped, dev-deps reduced to what the in-src
tests need).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Add a specialized insert/delete path for style-free text on the attached,
non-wasm, unicode-index path (the common Rust text-editing case). When the
richtext has no style anchors, entity_index == event_index == unicode pos, so
the entire read phase -- cursor location, two `visit_previous_caches` coordinate
walks, and the styles lookup -- is unnecessary; `apply_local_op` then locates the
cursor exactly once. The delete path likewise skips the two `index_to_event_index`
walks. Falls back to the general path when styles are present, on wasm, or for
non-unicode position types, so results are unchanged (snapshot bytes identical;
loro, loro-internal lib, and mergeable tests all pass).

Also gate `apply_local_op`'s txn/doc context check (a per-op `Weak::upgrade`) to
debug builds, since the handler always passes its own doc.

Cumulative B4 apply: 112ms -> ~46ms (~2.4x), ~11.5 M op/s.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@github-actions

github-actions Bot commented Jun 25, 2026

Copy link
Copy Markdown
Contributor

WASM Size Report

  • Original size: 3029.43 KB
  • Gzipped size: 999.71 KB
  • Brotli size: 701.20 KB

zxch3n and others added 2 commits June 25, 2026 15:49
Address two correctness regressions introduced by the B4 perf work:

1. The txn/doc context check in `Transaction::apply_local_op` was gated to
   debug builds. `insert_with_txn`/`delete_with_txn` are public API, so a
   caller can feed one document's transaction to another document's handler;
   in release that silently stamped the target doc's state/oplog with the
   wrong peer+counter instead of returning `UnmatchedContext`. Restore the
   check for all builds using a cheap `Weak`-pointer comparison (no atomic
   upgrade on the hot path; upgrade only to fill in the error on mismatch).

2. `MemKvStore::import_all` (re-exported publicly via loro-crdt) dropped
   per-block checksums for all callers. Split the API: public `import_all`
   (and `SsTable::import_all`) always verifies block checksums; a new
   `import_all_unchecked` opts into the fast path and is used only by Loro's
   snapshot decode (`ChangeStore::import_all`, `KvWrapper::import`), where the
   document-level checksum from `parse_header_and_body` already guarantees
   integrity over the whole body.

Adds regression tests: `cross_doc_txn_is_rejected` and the updated
`sstable_import_block_checksum_only_skipped_when_unchecked`.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant