UCP/EP: fix discarding from pending on failed lane#6933
UCP/EP: fix discarding from pending on failed lane#6933yosefe merged 2 commits intoopenucx:masterfrom
Conversation
844990b to
4e65d55
Compare
| if (ucp_worker_is_uct_ep_discarding(worker, uct_ep)) { | ||
| ucs_debug("UCT EP %p is being discarded on UCP Worker %p", | ||
| uct_ep, worker); | ||
| uct_ep_pending_purge(uct_ep, ucp_ep_err_pending_purge, |
There was a problem hiding this comment.
why needed?
UCT EP should be purged as a part of discarding procedure
There was a problem hiding this comment.
discard itself can be on pending
There was a problem hiding this comment.
discard itself can be on pending
but why we need to remove discarding from the pending?
There was a problem hiding this comment.
it should be removed by discarding
There was a problem hiding this comment.
it does not remove, ucp_ep_err_pending_purge does ucp_request_send_state_ff which posts flush cancel again, that's the fix
There was a problem hiding this comment.
pls add comment to describe why it's needed
| if (ucp_worker_is_uct_ep_discarding(worker, uct_ep)) { | ||
| ucs_debug("UCT EP %p is being discarded on UCP Worker %p", | ||
| uct_ep, worker); | ||
| uct_ep_pending_purge(uct_ep, ucp_ep_err_pending_purge, |
There was a problem hiding this comment.
pls add comment to describe why it's needed
src/ucp/core/ucp_worker.c
Outdated
| * UCS_ERR_NO_RESOURCES, so need to purge the queue to resubmit the | ||
| * operation */ |
There was a problem hiding this comment.
i think it's "abort the operation", not resubmit, right?
There was a problem hiding this comment.
ucp_request_send_state_ff, in case of discard, has to re-submit the operation to avoid reordering, flush cancel must be completed last otherwise we can get error WQE when lanes are destroyed
There was a problem hiding this comment.
Add to comment:
/* We need to resubmit the FLUSH_CANCEL operation on the same failed lane, in order to make sure all previous outstanding operations are completed before destroying the failed endpoint */
+ reduce TL timeouts and related refactoring
8ba56db to
8bfb149
Compare
|
port |
What
fix discarding from pending on failed lane
Why ?
bugfix
How ?