Fix deadlock in take_event and set_on_new_event_callback#890
Fix deadlock in take_event and set_on_new_event_callback#890thomasmoore-torc wants to merge 2 commits into
Conversation
|
@Mergifyio backport kilted jazzy humble |
☑️ Command disallowed due to command restrictions in the Mergify configuration.Details
|
Signed-off-by: Thomas Moore <thomas.moore@torc.ai>
…ent deadlock Signed-off-by: Thomas Moore <thomas.moore@torc.ai>
cff650c to
5a77dcc
Compare
|
Result for the events test does not show lock order inversion: |
|
@MiguelCompany - that test is insufficient to trigger the AB-BA deadlock. I've created a test in a branch of |
|
@thomasmoore-torc Great! Please open a PR on |
|
@MiguelCompany - I've created the PR at ros2/rmw_implementation#278 |
fujitatomoya
left a comment
There was a problem hiding this comment.
lgtm with reproducible test in rmw test.
|
Pulls: #890, ros2/rmw_implementation#278 |
Description
Using the
EventsExecutortogether with a subscription QoS-status event callback (e.g. amessage_lost_callback) can produce an AB-BA deadlock between Fast DDS andrmw_fastrtps:on_sample_lost), which calls intormw_fastrtpsand blocks taking the subscription's event mutex (on_new_event_m_).EventsExecutorthread holds that event mutex insidetake_event()and calls back into the DataReader (get_sample_lost_status), blocking on the reader mutex.Each thread holds the mutex the other needs, in the opposite order, so they deadlock — wedging the reader mutex and stalling the receive path.
The same lock-ordering applies to every QoS-status event (liveliness, deadline, requested/offered-incompatible-qos, matched, sample/message-lost), in both
take_event()andset_on_new_event_callback(), on the subscription and publication event paths. This PR breaks the chain uniformly:on_new_event_m_is no longer held across calls into the DataReader/DataWriter status getters — each status is copied into a local with the lock released, then the lock is re-acquired and the cached value adopted only if a listener update did not arrive in the meantime.get_inconsistent_topic_status()is intentionally left under the lock, since it operates on the Topic and does not take the reader/writer mutex.Is this user-facing behavior change?
No public API or contract change. The only observable difference is that the AB-BA deadlock described above no longer occurs; event-status delivery semantics are preserved (the lock is released only around the DDS status query, and a concurrent listener update is never discarded).
Did you use Generative AI?
Yes. The root-cause analysis and the code changes were developed with Claude Code (Claude Opus 4.8); the modifications to
take_event()andset_on_new_event_callback()incustom_subscriber_info.cppandcustom_publisher_info.cppwere authored with its assistance and reviewed and validated by myself and my coleagues.Additional Information
This was validated by exercising the affected path — an
EventsExecutorwith a message-lost event callback under sustained sample loss — which reliably reproduced the deadlock before this change and showed no recurrence afterward.Note for reviewers: the diff is broader than the message-lost path that was observed to deadlock. The same fix is applied to all status getters and to both the subscription and publication event classes, since they share the identical lock-ordering pattern.