-
Notifications
You must be signed in to change notification settings - Fork 19
Add resilient ib plugin #44
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
| { | ||
| WARN("NET/IB : Got async event : %s event type: %d deviceid:%d device name:%s", str, event.event_type, d, context->device->name); | ||
| } | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
shall we keep the original code and print the device name for all the warnings instead of the resilient enabled ones?
| NCCL_PARAM(IbQpsPerConn, "IB_QPS_PER_CONNECTION", 1); | ||
|
|
||
| std::unordered_set<std::string> disabledIbPeer; | ||
| void disableIb(std::string peerAddr) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| } | ||
| } | ||
|
|
||
| bool getIbDisableStatus(std::string peerAddr) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| if (err != cudaSuccess) { | ||
| WARN("%ld is not a valid pointer", rComm->remFifo.elems[slot][id].addr); | ||
| continue; | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just curious, if this addr is not a valid pointer, should we break the current loop and throw some errors? Why do we simply log it and continue?
| { | ||
| return nRet; | ||
| } | ||
| disableIb(comm->addr); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Question here: looks like we will process all of the QPs here and may disable the comm->addr multiple times, why not disable the addr when there is error happening when calling post_recv? Besides, when we will also call ncclIbPostFifo even calling post_recv fails and device is disabled, is this logic as same as the original logic?
No description provided.