Skip to content

UCP/WIREUP: Implement support for SHM/multi-lane in case of CM#5696

Merged
yosefe merged 1 commit intoopenucx:masterfrom
dmitrygx:topic/ucp/cm_mlane
Oct 15, 2020
Merged

UCP/WIREUP: Implement support for SHM/multi-lane in case of CM#5696
yosefe merged 1 commit intoopenucx:masterfrom
dmitrygx:topic/ucp/cm_mlane

Conversation

@dmitrygx
Copy link
Member

What

Implement support for SHM/multi-lane in case of CM

Why ?

If the connection was established thru CM (i.e. RDMACM or TCPCM), UCP EP config has lanes configured on the same device where CM is running on.
It means that SHM/CUDA/multi-lane couldn't be used in the current implementation.
This PR addresses this.

How ?

After CM flow, UCP EPs start WIREUP MSG protocol to exchange by all addresses and try to select lanes with the best available TLs.
WIREUP MSGs are sent through WIREUP lane (UCT EP that was created on the same device as CM). If after reconfiguration UCT EP for WIREUP lane isn't selected, change WIREUP lane to CM lane and hide UCT EP from WIREUP lane to the CM_WIREUP_EP/AUX_EP and use it for WIREUP MSGs.

@dmitrygx dmitrygx force-pushed the topic/ucp/cm_mlane branch 13 times, most recently from f5b69ba to 9831a93 Compare September 19, 2020 18:49
@dmitrygx
Copy link
Member Author

/azp run

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@dmitrygx
Copy link
Member Author

@brminich @alinask @evgeny-leksikov could you review pls?

sizeof(ucp_address_packed_iface_attr_t);
(sizeof(ucp_address_packed_iface_attr_t) +
((flags & UCP_ADDRESS_PACK_FLAG_TL_RSC_IDX) ?
sizeof(ucp_rsc_index_t) : 0));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

uint8_t since it's going to be on wire

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

Comment on lines +434 to +435
*(ucp_rsc_index_t*)ptr = rsc_index;
packed_len += sizeof(rsc_index);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

uint8_t for both lines and everywhere below

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done


UCP_ADDRESS_PACK_FLAG_LAST,

UCP_ADDRESS_PACK_FLAG_TL_RSC_IDX = UCS_BIT(5), /* Pack TL resource index */
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why it goes after UCP_ADDRESS_PACK_FLAG_LAST?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we don't want to pack resource index always to not increase the length of the worker address
we want to pack it only during exchanging by WIREUP MSGs

}

static ucp_lane_index_t
ucp_wireup_ep_config_find_reused_lane(ucp_ep_h ep,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the same, ep is not needed here

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it gets config from the ep

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i mean can pass key instead of EP

key.dst_md_cmpts = ucs_alloca(sizeof(*key.dst_md_cmpts) * UCP_MAX_MDS);
ucp_wireup_get_reachable_mds(ep, remote_address, &key);

if (ep->cfg_index != UCP_WORKER_CFG_INDEX_NULL) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it casts long func to long long func :) need to break up
BTW, can it be united with check at line 1245?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it couldn't be united with the check below, since we need to generate the config key index first
wrapped new code into the function

[reuse_lane].rsc_index;
/* reset the UCT EP from the previous WIREUP lane to not
* destroy it, since it's not needed anymore in the new
* configuratiob, but will be used for WIREUP MSG */
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

configuration

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

}

static inline int
ucp_ep_config_key_get_cm_lane(const ucp_ep_config_key_t *config_key)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is it really needed? config_key->cm_lane looks shorter..

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

removed

unsigned pack_flags,
int enable_atomics)
{
int packed_len = 0;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

minor: no need to initialize

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

removed

packed_len = sizeof(*packed);

if (pack_flags & UCP_ADDRESS_PACK_FLAG_TL_RSC_IDX) {
ptr = UCS_PTR_BYTE_OFFSET(ptr, packed_len);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

minor: could do packed + 1

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

[select_info->addr_index].md_index;
return ucp_wireup_add_lane_desc(select_info, dst_md_index, lane_type,
is_proxy, select_ctx);
dst_md_index = addr_entry->md_index;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

minor: can get rid of it as well and pass addr_entry->md_index to ucp_wireup_add_lane_desc directly

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

/* mark the endpoint as connected to remote */
if (!ucp_ep_config(ep)->p2p_lanes) {
/* don't mark as connected to remote now in case of CM, since it destroys
* CM wireup EP (if it is hidden in the CM lane) thas is used for sending
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

/* set the correct WIREUP MSG lane in case of CM */
if ((key.wireup_lane != UCP_NULL_LANE) &&
(UCP_NULL_LANE !=
(reuse_lane = ucp_wireup_ep_config_find_reused_lane(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i'd define reuse_lane at first for readability

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

unfortunately, it depends on key.wireup_lane
so, we can't do it

Comment on lines +785 to +786
ucs_diag("ep %p: UCT EP %p for CM lane doesn't exist, it"
"has already been discarded", ucp_ep, uct_cm_ep);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

plz add whitespace between it and has

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks
fixed

@dmitrygx
Copy link
Member Author

@evgeny-leksikov @brminich your comments were fixed, could pls review?

return (/* compare resource indices, if both are specified */
(key1->lanes[lane1].dst_rsc_index != UCP_NULL_RESOURCE) ==
(key2->lanes[lane2].dst_rsc_index != UCP_NULL_RESOURCE)) &&
(key1->lanes[lane1].dst_rsc_index != key2->lanes[lane2].dst_rsc_index);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why !=?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed
it works since ucp_ep_config_lane_is_same_dst_rsc_index() was called without negation

ucp_lane_index_t lane2)
{
return (/* compare resource indices, if both are specified */
(key1->lanes[lane1].dst_rsc_index != UCP_NULL_RESOURCE) ==
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why need to check it separately?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

}

ucp_wireup_remote_connected(ucp_ep);
/* TCP don't have CONNECT_TO_EP support and has internal connection
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

minor: doesn't

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

uct_cm_ep);
UCS_ASYNC_UNBLOCK(&worker->async);

if (!discard_uct_ep) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why !?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

Comment on lines +1047 to +1053
(proxy_lane1 != UCP_NULL_RESOURCE) &&
(proxy_lane2 != UCP_NULL_RESOURCE) &&
ucp_ep_config_lane_is_same_peer(key1, proxy_lane1,
key2, proxy_lane2)) ||
(/* both lanes are not proxies */
(proxy_lane1 == UCP_NULL_RESOURCE) &&
(proxy_lane2 == UCP_NULL_RESOURCE));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe move

(/* both lanes are not proxies */
            (proxy_lane1 == UCP_NULL_RESOURCE) &&
            (proxy_lane2 == UCP_NULL_RESOURCE));

prior /* both lanes are proxies for the same peer */
and move

            (proxy_lane1 != UCP_NULL_RESOURCE) &&
            (proxy_lane2 != UCP_NULL_RESOURCE) &&

inside ucp_ep_config_lane_is_same_peer like parameters validation?
then it could look limpler

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

return UCS_ERR_UNREACHABLE;
}

static ucp_lane_index_t
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

according to name *can_reuse_lane I would expect return true/false, pls rename, also it's not clear that lane is lane index of key1

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

Comment on lines +1122 to +1124
ucp_wireup_ep_config_find_intersect_lane(const ucp_ep_config_key_t *old_key,
const ucp_lane_index_t *intersect_map,
ucp_lane_index_t new_lane)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pls abstract away from usage, no need new/old prefixes, you just do value lookup in array

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done


ucp_wireup_ep_config_intersect(old_key, new_key, reuse_lane_map);

if (ucp_ep_config_key_has_cm_lane(new_key)) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this func called for non-CM wireup as well?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

but we did not need it before, what's changed?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see any problems to check config intersection for non-CM wireup
Also, I think we could support EP reconfiguration now - just need to remove the check for reconfiguration

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have no objections, I'm trying to understand side effects of this. WRT ep reconf - it would be great

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm trying to understand side effects of this

ic, I guess the one side effect is time that is wasted for searching intersections (and trying to re-use lanes)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok then since it's local and number of lanes is limited -

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok then since it's local and number of lanes is limited -

yep

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if not cm -> skip ucp_wireup_check_config_intersect altogether

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@dmitrygx
Copy link
Member Author

@brminich @evgeny-leksikov can you pls review?

Comment on lines +296 to +298
ucp_wireup_ep_configs_find_reuse_lane(const ucp_ep_config_key_t *key1,
ucp_lane_index_t lane1,
const ucp_ep_config_key_t *key2)
Copy link
Contributor

@evgeny-leksikov evgeny-leksikov Sep 23, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

seems like we need typedef for currently anonymous struct inside key, then it could be like

static ucp_lane_index_t
ucp_ep_config_key_lane_by_attrs(const ucp_ep_config_key_t *key,
                                const ucp_ep_config_lane_attrs_t *attrs)

i guess it could simplify other places

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do you mean to replace ucp_wireup_ep_configs_find_reuse_lane() by ucp_ep_config_key_lane_by_attrs(), but I don't understand how will it work

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for example from your code if i don't mess indexes...

        intersect_map[lane1_idx] =
            ucp_ep_config_key_lane_by_attrs(key2, &key1->lanes[lane1_idx]);

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

unfortunately, it's not possible ucp_ep_config_find_match_lane() has to know the lane1 number to report it as a result in case of CM or sockaddr stub lane
and ucp_ep_config_lane_is_same_peer() requires two keys ant lanes

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

actually, with new typedef
ucp_ep_config_lane_is_same_peer could be ucp_ep_config_lane_attrs_peer_is_equal(lane_attrs1, lane_attrs2)
and CM/sockaddr stub lane cases have to be handled separately

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

still it will be impossible, since we have ucp_ep_config_lane_is_proxy_equal() that requires key1/key2 to check proxy lane
I'd remain it as is, and evaluate it then

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just fixed the names of the functions as you suggested

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

still it will be impossible, since we have ucp_ep_config_lane_is_proxy_equal() that requires key1/key2 to check proxy lane
I'd remain it as is, and evaluate it then

As I see proxy lane is part of "arrt" key2->lanes[lane2].proxy_lane

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, but we won't be able to get attributes for proxy_lane if we pass only attributes for the original lane

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

to check proxy attrs it could be separate call with proxy lanes attrs in args

static void
ucp_wireup_ep_config_intersect(const ucp_ep_config_key_t *old_key,
const ucp_ep_config_key_t *new_key,
ucp_wireup_ep_config_intersect(const ucp_ep_config_key_t *key1,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ucp_ep_config_lanes_intersect

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed


static ucp_lane_index_t
ucp_wireup_ep_config_find_intersect_lane(const ucp_ep_config_key_t *old_key,
ucp_wireup_ep_config_find_intersect_lane(const ucp_ep_config_key_t *key,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe ucp_ep_config_key_match_lane(key, lane, lane_indexes)?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

@brminich
Copy link
Contributor

Is it relevant?

2020-09-24T05:45:05.0389441Z [     INFO ] server listening on fe80::ba59:9f03:52:940c:36132
2020-09-24T05:45:05.0594529Z [swx-rdmz-ucx-new-02:3467 :0:3467]    ucp_ep.inl:152  Assertion `ep->flags & UCP_EP_FLAG_FLUSH_STATE_VALID' failed
2020-09-24T05:45:05.5064438Z 
2020-09-24T05:45:05.5067064Z /scrap/azure/agent-05/AZP_WORKSPACE/2/s/contrib/../src/ucp/core/ucp_ep.inl: [ ucp_ep_flush_state() ]
2020-09-24T05:45:05.5068033Z       ...
2020-09-24T05:45:05.5068855Z       149 static UCS_F_ALWAYS_INLINE ucp_ep_flush_state_t* ucp_ep_flush_state(ucp_ep_h ep)
2020-09-24T05:45:05.5069665Z       150 {
2020-09-24T05:45:05.5070687Z       151     ucs_assert(ep->flags & UCP_EP_FLAG_FLUSH_STATE_VALID);
2020-09-24T05:45:05.5071907Z ==>   152     ucs_assert(!(ep->flags & UCP_EP_FLAG_ON_MATCH_CTX));
2020-09-24T05:45:05.5073325Z       153     ucs_assert(!(ep->flags & UCP_EP_FLAG_LISTENER));
2020-09-24T05:45:05.5074522Z       154     ucs_assert(!(ep->flags & UCP_EP_FLAG_CLOSE_REQ_VALID));
2020-09-24T05:45:05.5075795Z       155     return &ucp_ep_ext_gen(ep)->flush_state;
2020-09-24T05:45:05.5076536Z 
2020-09-24T05:45:05.9371661Z ==== backtrace (tid:   3467) ====
2020-09-24T05:45:05.9374317Z  0 0x000000000005c038 ucs_debug_print_backtrace()  /scrap/azure/agent-05/AZP_WORKSPACE/2/s/contrib/../src/ucs/debug/debug.c:656
2020-09-24T05:45:05.9375886Z  1 0x000000000004e956 ucp_ep_flush_state()  /scrap/azure/agent-05/AZP_WORKSPACE/2/s/contrib/../src/ucp/core/ucp_ep.inl:152
2020-09-24T05:45:05.9377065Z  2 0x000000000004f3e8 ucp_ep_flush_internal()  /scrap/azure/agent-05/AZP_WORKSPACE/2/s/contrib/../src/ucp/rma/flush.c:335
2020-09-24T05:45:05.9378190Z  3 0x000000000004f3e8 ucp_ep_flush_internal()  /scrap/azure/agent-05/AZP_WORKSPACE/2/s/contrib/../src/ucp/rma/flush.c:337
2020-09-24T05:45:05.9379836Z  4 0x00000000000260f9 ucp_ep_close_nbx()  /scrap/azure/agent-05/AZP_WORKSPACE/2/s/contrib/../src/ucp/core/ucp_ep.c:984
2020-09-24T05:45:05.9380952Z  5 0x0000000000026306 ucp_ep_close_nb()  /scrap/azure/agent-05/AZP_WORKSPACE/2/s/contrib/../src/ucp/core/ucp_ep.c:963
2020-09-24T05:45:05.9382206Z  6 0x0000000000896602 ucp_test_base::entity::disconnect_nb()  /scrap/azure/agent-05/AZP_WORKSPACE/2/s/contrib/../test/gtest/ucp/ucp_test.cc:602
2020-09-24T05:45:05.9383719Z  7 0x000000000089983e ucp_test_base::entity::close_all_eps()  /scrap/azure/agent-05/AZP_WORKSPACE/2/s/contrib/../test/gtest/ucp/ucp_test.cc:633
2020-09-24T05:45:05.9384991Z  8 0x000000000089983e ucp_test_base::entity::get_num_eps()  /scrap/azure/agent-05/AZP_WORKSPACE/2/s/contrib/../test/gtest/ucp/ucp_test.cc:772
2020-09-24T05:45:05.9386242Z  9 0x000000000089983e ucp_test_base::entity::close_all_eps()  /scrap/azure/agent-05/AZP_WORKSPACE/2/s/contrib/../test/gtest/ucp/ucp_test.cc:632
2020-09-24T05:45:05.9387459Z 10 0x0000000000899edd ucp_test::disconnect()  /scrap/azure/agent-05/AZP_WORKSPACE/2/s/contrib/../test/gtest/ucp/ucp_test.cc:189
2020-09-24T05:45:05.9388659Z 11 0x0000000000899edd ucp_test_base::entity::get_num_workers()  /scrap/azure/agent-05/AZP_WORKSPACE/2/s/contrib/../test/gtest/ucp/ucp_test.cc:768
2020-09-24T05:45:05.9389874Z 12 0x0000000000899edd ucp_test::disconnect()  /scrap/azure/agent-05/AZP_WORKSPACE/2/s/contrib/../test/gtest/ucp/ucp_test.cc:179
2020-09-24T05:45:05.9391044Z 13 0x0000000000899f38 ucp_test::cleanup()  /scrap/azure/agent-05/AZP_WORKSPACE/2/s/contrib/../test/gtest/ucp/ucp_test.cc:63
2020-09-24T05:45:05.9392296Z 14 0x00000000005baa10 ucs::test_base::TearDownProxy()  /scrap/azure/agent-05/AZP_WORKSPACE/2/s/contrib/../test/gtest/common/test.cc:282
2020-09-24T05:45:05.9393618Z 15 0x000000000059ea09 HandleSehExceptionsInMethodIfSupported<testing::Test, void>()  /scrap/azure/agent-05/AZP_WORKSPACE/2/s/contrib/../test/gtest/common/gtest-all.cc:3562
2020-09-24T05:45:05.9394941Z 16 0x0000000000592c1c testing::Test::Run()  /scrap/azure/agent-05/AZP_WORKSPACE/2/s/contrib/../test/gtest/common/gtest-all.cc:3643
2020-09-24T05:45:05.9396160Z 17 0x0000000000592d25 testing::TestInfo::Run()  /scrap/azure/agent-05/AZP_WORKSPACE/2/s/contrib/../test/gtest/common/gtest-all.cc:3812
2020-09-24T05:45:05.9397380Z 18 0x0000000000592e8f testing::TestCase::Run()  /scrap/azure/agent-05/AZP_WORKSPACE/2/s/contrib/../test/gtest/common/gtest-all.cc:3930
2020-09-24T05:45:05.9398653Z 19 0x000000000059763d testing::internal::UnitTestImpl::RunAllTests()  /scrap/azure/agent-05/AZP_WORKSPACE/2/s/contrib/../test/gtest/common/gtest-all.cc:5808
2020-09-24T05:45:05.9399952Z 20 0x0000000000597920 testing::internal::UnitTestImpl::RunAllTests()  /scrap/azure/agent-05/AZP_WORKSPACE/2/s/contrib/../test/gtest/common/gtest-all.cc:5725
2020-09-24T05:45:05.9401150Z 21 0x0000000000530be8 RUN_ALL_TESTS()  /scrap/azure/agent-05/AZP_WORKSPACE/2/s/contrib/../test/gtest/common/gtest.h:20059
2020-09-24T05:45:05.9402413Z 22 0x0000000000530be8 main()  /scrap/azure/agent-05/AZP_WORKSPACE/2/s/contrib/../test/gtest/common/main.cc:102
2020-09-24T05:45:05.9403159Z 23 0x00000000000223d5 __libc_start_main()  ???:0
2020-09-24T05:45:05.9403731Z 24 0x000000000057cfbf _start()  ???:0
2020-09-24T05:45:05.9404200Z =================================
2020-09-24T05:45:05.9404915Z [swx-rdmz-ucx-new-02:3467 :0:3467] Process frozen...
2020-09-24T08:09:47.2541611Z make: *** [test] Terminated
2020-09-24T08:09:47.2578679Z 

@dmitrygx
Copy link
Member Author

Is it relevant?

@brminich, no
it is #5506

(/* both lanes are proxies for themselves */
(proxy_lane1 == lane1) && (proxy_lane2 == lane2)) ||
/* both lanes are proxies for the same peer */
ucp_ep_config_lane_is_peer_equal(key1, proxy_lane1,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why does ucp_ep_config_lane_is_peer_equal calls ucp_ep_config_lane_is_proxy_equal?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in order to check that lanes are proxies for the same lanes

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

waiting for the #5688 being merged

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

Comment on lines +1179 to +1146
/* Get all reachable MDs from full remote address list and join with
* current ep configuration
*/
key.dst_md_cmpts = ucs_alloca(sizeof(*key.dst_md_cmpts) * UCP_MAX_MDS);
ucp_wireup_get_reachable_mds(ep, remote_address, &key);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

duplication of the below

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

Comment on lines +797 to +798
/* UCT EP has already been allocated for this EP (it could be
* done only during CM connection establishment phase) */
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we should not call ucp_wireup_connect_lane_to_iface for reused ep's in the first place
so maybe return lane_map to ucp_wireup_init_lanes() function, from ucp_wireup_check_config_intersect

}
}

if (!ucp_ep_has_cm_lane(ep) &&
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

but we exit this function immediately if it's not CM..


/* check that all TL resources in the TL bitmap have the same dev_index */
#if ENABLE_ASSERT
#ifdef ENABLE_ASSERT
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can remove ifdef?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

unhandled

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've changed to #if and then removed, so it is now initial code
done


ucp_wireup_ep_config_intersect(old_key, new_key, reuse_lane_map);

if (ucp_ep_config_key_has_cm_lane(new_key)) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if not cm -> skip ucp_wireup_check_config_intersect altogether

Comment on lines +1131 to +1140
ucp_wireup_check_config_intersect(ep, &key, reuse_lane_map,
&replay_pending_queue);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we just return a bitmap of which lanes need to connect (e.g which lanes are new)?
calculate the map inside ucp_wireup_check_config_intersect as part of lane reuse logic

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done


/* check that all TL resources in the TL bitmap have the same dev_index */
#if ENABLE_ASSERT
#ifdef ENABLE_ASSERT
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

unhandled

uint64_t tl_bitmap = local_tl_bitmap &
worker->context->tl_bitmap;
ucp_rsc_index_t cm_idx = UCP_NULL_RESOURCE;
ucp_lane_index_t connect_lane_bitmap = UINT8_MAX;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this should be initialized in ucp_wireup_check_config_intersect to ucs_mask(num_lanes)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

ucp_lane_index_t lane, unsigned path_index,
ucp_rsc_index_t rsc_index,
ucp_worker_iface_t *wiface,
ucp_rsc_index_t rsc_index,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's restore arg order?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

Copy link
Contributor

@yosefe yosefe left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pls squash

@dmitrygx
Copy link
Member Author

pls squash

done

@yosefe
Copy link
Contributor

yosefe commented Oct 12, 2020

@dmitrygx seems there is a build issue: need to change ucs_mask to UCS_MASK

@dmitrygx
Copy link
Member Author

@dmitrygx seems there is a build issue: need to change ucs_mask to UCS_MASK

fixed

@dmitrygx
Copy link
Member Author

/azp run

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@yosefe
Copy link
Contributor

yosefe commented Oct 13, 2020

issue seems relevant from discard uct ep:

 5 0x0000000000082cea _int_malloc()  malloc.c:0
 6 0x0000000000085b1c __GI___libc_malloc()  :0
 7 0x0000000000061ca8 ucs_malloc()  /scrap/azure/agent-07/AZP_WORKSPACE/2/s/contrib/../src/ucs/debug/memtrack.c:185
 8 0x00000000000548db ucs_mpool_hugetlb_malloc()  /scrap/azure/agent-07/AZP_WORKSPACE/2/s/contrib/../src/ucs/datastruct/mpool.c:317
 9 0x0000000000054351 ucs_mpool_grow()  /scrap/azure/agent-07/AZP_WORKSPACE/2/s/contrib/../src/ucs/datastruct/mpool.c:189
10 0x00000000000545c8 ucs_mpool_get_grow()  /scrap/azure/agent-07/AZP_WORKSPACE/2/s/contrib/../src/ucs/datastruct/mpool.c:237
11 0x000000000003c694 ucs_mpool_get_inline()  /scrap/azure/agent-07/AZP_WORKSPACE/2/s/contrib/../src/ucs/datastruct/mpool.inl:23
12 0x000000000003c694 ucp_worker_discard_tl_uct_ep()  /scrap/azure/agent-07/AZP_WORKSPACE/2/s/contrib/../src/ucp/core/ucp_worker.c:2636
13 0x000000000003c694 ucp_worker_discard_uct_ep()  /scrap/azure/agent-07/AZP_WORKSPACE/2/s/contrib/../src/ucp/core/ucp_worker.c:2746
14 0x000000000003c7c2 ucp_worker_iface_err_handle_progress()  /scrap/azure/agent-07/AZP_WORKSPACE/2/s/contrib/../src/ucp/core/ucp_worker.c:486
15 0x0000000000050e6b ucs_callbackq_slow_proxy()  /scrap/azure/agent-07/AZP_WORKSPACE/2/s/contrib/../src/ucs/datastruct/callbackq.c:400
16 0x0000000000039f2a ucs_callbackq_dispatch()  /scrap/azure/agent-07/AZP_WORKSPACE/2/s/contrib/../src/ucs/datastruct/callbackq.h:211
17 0x0000000000039f2a uct_worker_progress()  /scrap/azure/agent-07/AZP_WORKSPACE/2/s/contrib/../src/uct/api/uct.h:2405
18 0x0000000000039f2a ucp_worker_progress()  /scrap/azure/agent-07/AZP_WORKSPACE/2/s/contrib/../src/ucp/core/ucp_worker.c:2285
19 0x000000000086a56b ucp_test_base::entity::progress()  /scrap/azure/agent-07/AZP_WORKSPACE/2/s/contrib/../test/gtest/ucp/ucp_test.cc:764
20 0x000000000086a7a7 ucp_test::progress()  /scrap/azure/agent-07/AZP_WORKSPACE/2/s/contrib/../test/gtest/ucp/ucp_test.cc:147
21 0x000000000086a7a7 ucp_test::short_progress_loop()  /scrap/azure/agent-07/AZP_WORKSPACE/2/s/contrib/../test/gtest/ucp/ucp_test.cc:155
22 0x000000000083ffda ucp_test::wait_for_flag<unsigned long const>()  /scrap/azure/agent-07/AZP_WORKSPACE/2/s/contrib/../test/gtest/ucp/ucp_test.h:240

@dmitrygx
Copy link
Member Author

dmitrygx commented Oct 14, 2020

@yosefe seems that discard just detected corruption inside malloc() for UCP request
I tried to reproduce the issue (all/test_ucp_sockaddr.err_handle/4 test on swx-rdmz-ucx-althca), but no luck

@dmitrygx
Copy link
Member Author

/azp run

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@dmitrygx
Copy link
Member Author

dmitrygx commented Oct 14, 2020

@yosefe do we need restart one more time?
I tried reproduce the issue with valgrind, it did > 60k iterations, currently no luck to reproduce

@yosefe
Copy link
Contributor

yosefe commented Oct 14, 2020

/azp run

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

1 similar comment
@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@dmitrygx
Copy link
Member Author

/azp run

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@dmitrygx
Copy link
Member Author

locally test did >300k repetitions with Valgrind, and the problem wasn't reproduced

@changchengx
Copy link
Contributor

Ported to yosefe#223

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants