Skip to content

Fix master CLI connection slot leak on client disconnect#15

Open
alexanderstephan wants to merge 3 commits into
masterfrom
fix/cli-master-socket-leak
Open

Fix master CLI connection slot leak on client disconnect#15
alexanderstephan wants to merge 3 commits into
masterfrom
fix/cli-master-socket-leak

Conversation

@alexanderstephan

Copy link
Copy Markdown
Collaborator

No description provided.

…nect

In master-worker mode the master CLI proxy (mworker_proxy) has a
hardcoded maxconn of 10. When a client connects to the master CLI
socket and issues a command that gets forwarded to an unresponsive
worker (e.g. one that is stuck or very slow), the connection hangs
waiting for the worker's response. If the client then disconnects
(timeout, Ctrl-C, etc.), the connection slot is never released because
the client-side FIN is never propagated to tear down the backend.

After 10 such leaked slots the master CLI socket becomes completely
unreachable, returning "Resource temporarily unavailable" to any new
connection attempt.

The fix has three parts:

1) Remove sc_schedule_shutdown(s->scb) after command forwarding.
   The worker doesn't need the TCP FIN to know when a command ends -
   CLI commands are newline-delimited. Without this shutdown, the
   backend never enters half-close during normal command processing,
   so timeout server-fin is never implicitly armed by process_stream().

2) In the AN_RES_WAIT_CLI early-return path, call sc_set_hcto(s->scb)
   only when the client has disconnected (SC_FL_EOS on scf) and there
   is no more pending request data. This arms the 1s server-fin timer
   exclusively in the stuck-worker scenario.

3) Call channel_dont_close(req) to prevent process_stream() from
   auto-forwarding the client's FIN to the backend via CF_AUTO_CLOSE.

A 1s timeout server-fin is configured on mworker_proxy. It is only
armed after the client disconnects cleanly, so it never fires during
normal command processing. It ensures a stuck backend releases its
connection slot promptly once the client is gone.

Locally-handled commands (master applet) are unaffected because they
complete synchronously before the client has reason to disconnect.
The scb->ioto reset (TICK_ETERNITY) at end-of-transaction in
pcli_wait_for_response() prevents any timer leakage between commands.

This fixes GH issue haproxy#3351.
This should be backported to all stable branches.
@alexanderstephan alexanderstephan force-pushed the fix/cli-master-socket-leak branch from 9081d8a to f218e22 Compare May 8, 2026 09:16
@alexanderstephan alexanderstephan self-assigned this May 15, 2026
Bash-based risk validation harness used to verify the master CLI slot
leak fix (haproxy#3351) does not regress nearby scenarios.  Tests three areas:

  #1 pipelined commands (incl. hard mid-stream disconnect with no FIN
     and the actual leak trigger: frozen worker + 10 SIGKILL'd clients)
  #2 reload during in-flight CLI traffic
  #3 long-running command cut-off

Designed to run against any haproxy binary; pass/fail counts make
baseline-vs-patched comparison straightforward.  Used to validate both
the upstream fix (f218e22 on 3.4) and a 3.0.19 backport.

Not invoked by "make reg-tests" — invoke manually:

    ./reg-tests/mcli/run_risk_tests.sh /path/to/haproxy [label]
…dge case

Diagnostic-only artifacts (a bash script and a .vtc) for the concern
that timeout server-fin might fail to fire when a worker is frozen at
the OS level before any byte flows back through the master->worker
sockpair (lra would still be TICK_ETERNITY, so tick_add(lra, ioto)
also stays TICK_ETERNITY).

Empirical result: the concern does NOT manifest.  lra is updated when
the backend stconn enters EST during sockpair setup — independently
of worker reads — so SIGSTOP'ing the worker beforehand does not
prevent the timer from firing.  11th connection succeeds in ~0.5s
even with the worker still SIGSTOPped.

These files are NOT recommended for upstream submission: the existing
mcli_master_socket_leak.vtc explicitly avoids SIGSTOP/SIGCONT for CI
reliability reasons.  They are kept here as evidence of the
investigation behind the fix.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant