Skip to content

Conversation

@Andyli1007
Copy link
Contributor

No description provided.

@Andyli1007 Andyli1007 requested a review from yzygitzh April 24, 2024 11:12
@Andyli1007 Andyli1007 requested a review from hippogr May 8, 2024 07:19
{
WARN("NET/IB : Got async event : %s event type: %d deviceid:%d device name:%s", str, event.event_type, d, context->device->name);
}
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shall we keep the original code and print the device name for all the warnings instead of the resilient enabled ones?

NCCL_PARAM(IbQpsPerConn, "IB_QPS_PER_CONNECTION", 1);

std::unordered_set<std::string> disabledIbPeer;
void disableIb(std::string peerAddr)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(std::string

const std::string& should be better :-)

}
}

bool getIbDisableStatus(std::string peerAddr)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

std::string

same as above

if (err != cudaSuccess) {
WARN("%ld is not a valid pointer", rComm->remFifo.elems[slot][id].addr);
continue;
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just curious, if this addr is not a valid pointer, should we break the current loop and throw some errors? Why do we simply log it and continue?

{
return nRet;
}
disableIb(comm->addr);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

disableIb(comm->addr);

Question here: looks like we will process all of the QPs here and may disable the comm->addr multiple times, why not disable the addr when there is error happening when calling post_recv? Besides, when we will also call ncclIbPostFifo even calling post_recv fails and device is disabled, is this logic as same as the original logic?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants