Harden the dual-era stream loop's era-lock and rejection semantics#3040
Conversation
Post-merge review follow-up to #3038, fixing the cases where a failed request could still lock the connection's era: - The modern era now locks only when a request SUCCEEDS, mirroring the legacy side's lock-on-success rule. Previously the lock committed at classification time, so a first request with a well-formed envelope but malformed content - or subscriptions/listen, or an unknown method - locked the connection modern after failing; the client's initialize fallback then got -32022, the one code auto-negotiating clients do not fall back from, stranding the connection for its lifetime (a whole subprocess on stdio). - initialize is legacy-distinctive by definition (the method does not exist at modern versions), so it always takes the handshake path now, even when a confused client stamps the envelope triple on it. Previously it routed modern, failed METHOD_NOT_FOUND, and locked the era modern - bricking the handshake the same way. - A bare server/discover on a legacy-locked connection falls through to the loop runner's per-version surface validation again, restoring the byte-identical METHOD_NOT_FOUND a handshake-only server produced; only envelope-bearing frames get the INVALID_REQUEST rejection (a conforming legacy client can never send the reserved triple). - classify_inbound_request rejects a present-but-non-string protocol version as a descriptive INVALID_PARAMS instead of raising ValidationError out of its own -32022 payload construction (`requested` is a str field); the dispatcher's exception ladder happened to mask that as a generic -32602. - _NoServerRequestsDispatchContext masks transport.can_send_request so the per-message transport metadata agrees with the wrapper's denial, matching the modern HTTP entry's context. - NotifyOnlyOutbound inherits _NoChannelOutbound's refusal instead of duplicating it, and the has_standalone_channel / Raises contracts now document that a modern connection refuses requests on a real channel. - Docs: the migration guide and protocol-versions page note the discover probe is transport-independent; the subscriptions page documents the streamable-HTTP-only limitation of subscriptions/listen; serve_loop's stale shared-recipe claim is corrected. Also replaces two pyright suppressions in the stdio test doubles with typeshed-compatible Buffer signatures.
📚 Documentation preview
|
There was a problem hiding this comment.
1 issue found across 9 files
Reply with feedback, questions, or to request a fix.
Fix all with cubic | Re-trigger cubic
The dual-era loop's lock-on-success rule approximated "client-visible success" with "the handler returned", leaving windows where a request the client saw fail still moved the era: a pipelined envelope-bearing request completing after an inline initialize overwrote the committed legacy lock (stranding the freshly-handshaked client), and a peer cancel landing during the handler's shielded teardown locked the era while the client received "Request cancelled". Era commits now run through a single era_settles predicate: monotone (the first success wins, a straggler from the other era can never move a committed lock) and skipped when a pending cancel is about to replace the response. Two modern-path divergences between the stream loop and the HTTP entry close by sharing one pipeline: Connection.from_envelope takes the classifier's raw envelope values and owns the tolerant coercion (custom methods now degrade mis-shaped clientInfo identically everywhere), and a shared modern_error_data boundary maps unmapped handler exceptions to a generic INTERNAL_ERROR instead of the legacy code-0 catch-all putting str(e) on the wire. The classifier's non-string protocol-version guard moves below the header rung, so the HTTP wire is untouched whenever the version header is present, while the guard still backstops the header-absent null-version corner that previously escaped as a ValidationError.
There was a problem hiding this comment.
1 issue found across 6 files (changes from recent commits).
Reply with feedback, questions, or to request a fix.
Fix all with cubic | Re-trigger cubic
A null body protocolVersion slipped the header equality check (None == None), so the absent-header outcome depended on the body value: an int gave HEADER_MISMATCH while null fell through to the string guard's INVALID_PARAMS. The rung now checks presence explicitly, making an absent version header uniformly HEADER_MISMATCH and the string guard reachable only on header-less transports. Also updates the public Server.run docstring to the first-success era-lock wording.
There was a problem hiding this comment.
No new issues found on this pass — the round-2 commits (era_settles monotone lock, explicit version-header presence check, shared modern_error_data boundary, Server.run docstring) resolve the concerns I raised earlier, each with a pinning regression test. That said, this changes the era-lock and rejection semantics of the dual-era negotiation path, so it warrants a human maintainer's review rather than auto-approval.
Extended reasoning...
Overview
This PR hardens the dual-era stream loop introduced in #3038: the era lock now commits only on a client-visible success (monotone, first-success-wins, cancel-aware via era_settles), initialize always takes the handshake path even when envelope-stamped, a bare server/discover on a legacy-locked connection restores the byte-identical METHOD_NOT_FOUND, unmapped modern handler exceptions are sanitized identically across transports via a shared modern_error_data boundary, envelope coercion is centralized in Connection.from_envelope, and the classifier gains an explicit non-string protocol-version guard placed after a now presence-checking header rung. Touched files span runner.py, inbound.py, connection.py, the modern HTTP entry, docs, and substantial new tests.
Security risks
No auth, crypto, or permission code is touched. The changes affect protocol negotiation and error-mapping semantics; the main risk class is compatibility/availability (a stranded stdio subprocess or a wrong wire error code for buggy clients) rather than data exposure. The exception-boundary change actually improves information hygiene by ensuring handler internals never reach the wire on the modern stream path.
Level of scrutiny
High. This is core connection-negotiation logic with subtle concurrency (pipelined requests, cancellation races) and wire-compatibility guarantees that released auto-negotiating clients depend on. Even with no findings on this pass, design-level decisions here (lock-on-success semantics, error-code choices, the classifier rung ordering) deserve maintainer sign-off.
Other factors
The prior review findings — the era-overwrite race, the missing version-header presence check, and the stale Server.run docstring — are all addressed in the follow-up commits, each with a regression test that reproduces the original scenario. The suite maintains 100% coverage and the PR description reports live stdio-subprocess validation of the fallback and race scenarios. Existing reviewer threads (cubic and my earlier comments) are marked resolved.
…ack-channel A server/discover probe that outlives the client's 10s timeout can still succeed on a slow-starting server, locking the connection modern before the pipelined fallback initialize arrives; that initialize then fails with -32022, which negotiate_auto treated as terminal - the default mode='auto' could not connect at all. The -32022 is itself positive modern evidence (it names the server's versions), so the policy now re-probes once at a mutual version, bounded by the existing two-attempt loop. modern_on_request now wraps its dispatch context in the server-requests denial like the other modern entries: a handler's request-scoped elicitation no longer delivers the protocol-forbidden frame to the in-process client (which answered a misleading "Method not found") and instead refuses server-side with the no-back-channel contract.
There was a problem hiding this comment.
1 issue found across 5 files (changes from recent commits).
Prompt for AI agents (unresolved issues)
Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.
<file name="src/mcp/client/_probe.py">
<violation number="1" location="src/mcp/client/_probe.py:85">
P2: The handshake recovery rejects supported-only `-32022` data, even though this codebase emits that shape for initialize-after-modern-lock errors. Parse `supported` from the handshake error without requiring `requested`, otherwise auto mode can still fail instead of re-probing.</violation>
</file>
Tip: Review your code locally with the cubic CLI to iterate faster.
Fix all with cubic | Re-trigger cubic
| # the connection modern before this initialize arrived. Re-probe | ||
| # once at a version the server names; the era is already | ||
| # settled, so the second probe answers without the slow start. | ||
| supported = _parse_supported(handshake_exc.error.data) |
There was a problem hiding this comment.
P2: The handshake recovery rejects supported-only -32022 data, even though this codebase emits that shape for initialize-after-modern-lock errors. Parse supported from the handshake error without requiring requested, otherwise auto mode can still fail instead of re-probing.
Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At src/mcp/client/_probe.py, line 85:
<comment>The handshake recovery rejects supported-only `-32022` data, even though this codebase emits that shape for initialize-after-modern-lock errors. Parse `supported` from the handshake error without requiring `requested`, otherwise auto mode can still fail instead of re-probing.</comment>
<file context>
@@ -65,7 +72,22 @@ async def negotiate_auto(session: ClientSession) -> None:
+ # the connection modern before this initialize arrived. Re-probe
+ # once at a version the server names; the era is already
+ # settled, so the second probe answers without the slow start.
+ supported = _parse_supported(handshake_exc.error.data)
+ mutual = [v for v in MODERN_PROTOCOL_VERSIONS if v in (supported or ())]
+ if not mutual:
</file context>
| supported = _parse_supported(handshake_exc.error.data) | |
| data = handshake_exc.error.data | |
| supported = _parse_supported(data) | |
| if supported is None and isinstance(data, dict) and isinstance(data.get("supported"), list): | |
| supported = [v for v in data["supported"] if isinstance(v, str)] |
There was a problem hiding this comment.
The follow-ups to my earlier comments (the era-lock overwrite guard, the header-presence check, and the Server.run docstring) all look addressed in the latest commits, and I found no new issues in this pass — but this PR reworks core era-negotiation and client auto-negotiation semantics across three rounds of changes, so it warrants a human maintainer's review before merge.
Extended reasoning...
Overview
This PR (rounds 1–3) hardens the dual-era stream loop's era-lock semantics in src/mcp/server/runner.py (lock-only-on-client-visible-success via the era_settles predicate, monotone first-success commits, shared modern_error_data exception boundary), moves envelope coercion into Connection.from_envelope, adds a non-string protocol-version guard and explicit header-presence check to the classifier ladder in src/mcp/shared/inbound.py, changes the client-side negotiate_auto policy in src/mcp/client/_probe.py to treat a -32022 fallback-handshake error as modern evidence and re-probe once, and wraps the in-process modern entry in the server-requests denial. Docs and extensive pinning/regression tests accompany each change.
Security risks
No direct security-sensitive surfaces (auth, crypto, permissions) are touched. The changes do affect error-message hygiene positively: unmapped handler exceptions on the modern stream path are now sanitized to a generic INTERNAL_ERROR instead of leaking str(e) to the wire, matching the HTTP entry. The classifier changes alter which error code a malformed client receives but do not weaken validation.
Level of scrutiny
High. This is production-critical protocol-negotiation logic: the era-lock state machine determines whether real clients can connect at all over stdio (where a stranded connection means restarting the subprocess), and the client-side auto-negotiation policy affects every default-mode connection. The changes involve subtle concurrency reasoning (pipelined inline vs. spawned dispatch, cancellation at the response-write checkpoint) and wire-compatibility guarantees (byte-identical legacy fall-through, -32022 semantics). Even with strong test coverage, the design decisions here (what counts as "success", the re-probe policy on -32022) are ones a maintainer should sign off on.
Other factors
The prior review comments — cubic's atomicity finding and my comments on the era-overwrite race, the header-rung placement, and the stale Server.run docstring — are all addressed in the round-2/3 commits, each with a regression test that reproduces the original race deterministically. The bug hunting system found no new issues on the current revision, and the repository requires 100% coverage which the PR claims to maintain. The remaining reason to defer is scope and criticality, not any specific defect.
Rebasing onto main picked up #3040, which hardened the era semantics: a server-initiated request on a 2026-07-28 connection is now refused by the SERVER, as MCPError: Cannot send 'elicitation/create': this transport context has no back-channel for server-initiated requests. instead of reaching the client and being rejected there as "Method not found". Two of this branch's docs tests failed on the rebase -- which is the point of testing every example -- and the troubleshooting page keyed its biggest elicitation entry on the old string. So the two entries merge into the one that owns the surviving string. "Method not found" is now short and generic (an era mismatch: a method one protocol revision has and the other does not), and says explicitly that ctx.elicit() at 2026-07-28 no longer produces it. The "Cannot send 'elicitation/create' ..." entry becomes the single home for "your handler reached back and nothing can carry it", with its two real triggers -- any 2026-07-28 connection, and a legacy connection on a stateless_http=True server -- both shown from tested examples, and the one fix (a resolver). The tests pin the new behaviour, the same way #3040 itself updated tests/docs_src/test_client_callbacks.py.
Post-merge review follow-up to #3038. An adversarial review pass over the dual-era driver found a family of cases where a request that failed could still lock the connection's era — and on stdio a connection is a whole subprocess, so a locked-then-stranded connection has no recovery short of a restart.
The era-lock fixes
The invariant the driver documents is: released auto-negotiating clients fall back on any error code except -32022, so a failed probe must leave the legacy handshake available. Three openers violated it:
clientInfo: 42): classification checks key presence only, so the era locked modern before pydantic rejected the content. The request failed with a fallback-eligible -32602, but the fallback then hit -32022 forever. The lock now commits only after the request succeeds, mirroring the legacy side's existing "lock only on success" rule.initializestamped with the envelope triple: routed modern (the triple won), failed METHOD_NOT_FOUND (initializedoesn't exist at 2026-07-28), and locked the era anyway.initializeis legacy-distinctive by definition, so it now always takes the handshake path regardless of decoration.subscriptions/listenas the opener: the not-served-on-this-transport rejection used to land after the lock; now, like every failed request, it locks nothing.No conforming client — released SDK or otherwise — produces any of these frames; the fixes are robustness against buggy third-party modern clients.
Compatibility restoration
A bare
server/discoveron a legacy-locked connection now falls through to the loop runner's per-version validation, restoring the byte-identical METHOD_NOT_FOUND (-32601, same message and data) a handshake-only server produced — #3038 had it rejecting with -32600. Only envelope-bearing frames get the cross-era INVALID_REQUEST; a conforming legacy client can never send the reserved triple, so legacy traffic is byte-identical again in full.Note for anyone bisecting client behavior: v2.0.0b1 ships the -32600 variant for this one frame shape (it carries #3038 but not this fix), so a legacy-locked bare
server/discoveranswers -32600 on b1 and -32601 before #3038 and from this PR onward.Smaller fixes
classify_inbound_requestrejects a present-but-non-string protocol version with a descriptive INVALID_PARAMS instead of raisingValidationErrorout of its own -32022 payload construction (requestedis astrfield). The dispatcher's exception ladder happened to mask that as a generic -32602 — same code, now deliberate, with a message that names the defect. Unreachable over HTTP (the header rung fires first), so the wire delta is stdio-only.transport.can_send_requestso per-message transport metadata agrees with its denial, matching the modern HTTP entry's context — closing a latent inconsistency before anything consumes the field.NotifyOnlyOutboundinherits the no-channel refusal instead of duplicating it; thehas_standalone_channel/Raisescontracts now document that a modern connection refuses requests on a real channel.subscriptions/listen; a staleserve_loopdocstring claim is corrected. Two pyright suppressions in the stdio test doubles are replaced with typeshed-compatibleBuffersignatures.Validation
Every fix has a pinning test (rejection codes, lock neutrality of each failing opener, the byte-identical legacy fall-through, the classifier guard, the transport mask). Beyond the suite (100% coverage), each formerly-stranding opener was driven against a live stdio subprocess and shown to leave the legacy handshake working on the same pipe, with the full original probe matrix (auto/legacy/pinned negotiation, pipelined probe ordering, -32022 round-trip) passing unchanged.
Round 2 — review follow-ups
Review found that the round-1 lock-on-success rule still approximated "client-visible success" with "the handler returned". Two windows violated the invariant, plus two cross-transport divergences on the modern path:
initializeoverwrote the committed legacy lock — bricking the freshly-handshaked client — and the mirror direction (a legacy commit clobbering a modern lock) was possible too. The era cell is now monotone: oneera_settlessettlement predicate with site-local atomic commits, first success wins, and a straggler from the other era can never move a committed lock. Its response still stands — it was already earned."Request cancelled", but the era had already locked. The settlement predicate now skips the lock when the request's cancel event is set, tying "success" to what the client actually observes. (A cancel racing a genuinely blocked response write can still slip through; that residue belongs to the dispatcher's existing respond-after-cancel divergence and dissolves with it.)str(e)on the wire — while the identical request over modern HTTP produced the generic-32603 Internal server error. The code-0 pin exists for v1 compat, which the modern era doesn't have. Both modern entries now share onemodern_error_databoundary; debugraise_exceptionsstill re-raises.clientInfo/clientCapabilitiesto not-supplied while the stream path rejected them atConnection.from_envelopeconstruction. For spec methods both transports already rejected identically (the kernel's per-version params surface types the reserved_metakeys strictly); custom methods diverged.from_envelopenow takes the classifier's raw values and owns the coercion, so every modern entry — HTTP, stream loop, andmodern_on_request— builds connection identity identically.None == Noneequality and make the absent-header outcome depend on the body value. An absent or disagreeing version header is uniformlyHEADER_MISMATCH, the HTTP wire is untouched wherever it was well-defined, and the string guard is reachable only on header-less transports, making the round-1 "stdio-only delta" claim above exactly true. (Through the shipped HTTP manager the era routing is header-only, so these cells harden the modern entry's own contract.) The publicServer.rundocstring also picked up the first-success wording.Each behavioral fix carries a regression test that failed on the round-1 code (the overwrite and cancelled-away tests reproduce the races deterministically), and the scenarios were re-driven against a live stdio subprocess: the pipelined-race brick, the sanitized internal error, the failed-opener fallback, and the cancel-never-locks path all behave as documented, with auto/legacy negotiation unchanged.
Round 3 — the timeout brick and the third modern entry
Further review surfaced the last member of the brick family, plus a pre-existing gap in the parity story:
mode='auto'(reproduced live): aserver/discoverprobe that outlives the client's 10s timeout can still succeed on a slow-starting server — the response was written, the server cannot know the client stopped waiting, and discover is inline so even a courtesy cancel could not land before the commit. The connection locks modern, the pipelined fallbackinitializegets -32022, andnegotiate_autotreated that as terminal: the defaultmode='auto'could not connect to any stdio server that takes >10s to its first response. No server-side change can close this; the client-side fix is that -32022 from the fallback handshake is itself positive modern evidence (it names the server's versions), so the policy now re-probes once at a mutual version — bounded by the existing two-attempt loop, so a pathologically slow server still terminates with a genuine error. Covered by policy unit tests and a deterministic end-to-end test (a relay holds the probe frame until the post-timeoutinitializeis sent — no sleeps).Client(server)in modern mode delivered the protocol-forbiddenelicitation/createframe to the client, which answered a misleadingMethod not found. The entry's dispatch context is now wrapped in the same server-requests denial, so the call refuses server-side with the no-back-channel contract — which is also the exact wording the client-callbacks docs page already used to describe this path.AI Disclaimer