Skip to content

Comments

ovn sci adaptions#91

Open
notandy wants to merge 5 commits intostable/2024.1-m3from
ovn-sci-adaptions
Open

ovn sci adaptions#91
notandy wants to merge 5 commits intostable/2024.1-m3from
ovn-sci-adaptions

Conversation

@notandy
Copy link
Collaborator

@notandy notandy commented Jun 17, 2025

This commit removes the hardcoded ovn l3 plugin in the ovn-db-sync-util.
In upstream neutron, the ovn l3 is enabled by defaut but not needed in
our setup. removing the l3 plugin just syncs everyting else, like ports,
QOS, security groups and networks.

Furthermore, the changes introduce some bug fixes in the ovn_client,
that sometimes accessd a None field or created a none entry in the OVN
schema - which is unsupported.

At last, I've skipped some parts of the mechanism driver initialisation
when running inside rpc-server, since rpc-wrokers are not handling OVS
callbacks.
  • [OVN] don't set explicit down status for ports that just got created
  • [OVN] disable OVN l3, improve startup time, fix OVN db-sync
  • [OVN] disable add_gw_port_info_to_logical_router_port maintenance task

if L3 OVN service plugin is not used. It would cause an exception since the l3
router object is not populated with `external_gateways`.
Will be removed anyway with 2025.1 release - Can be skipped for
backporting to 2025.1 or later.

also fix exception logging for check_baremetal_ports_dhcp_options
@notandy notandy requested review from a team as code owners June 17, 2025 21:55
@notandy notandy force-pushed the ovn-sci-adaptions branch 6 times, most recently from b05dd87 to 3ebc418 Compare June 18, 2025 15:17
toanju
toanju previously approved these changes Jun 18, 2025
Copy link
Collaborator

@sebageek sebageek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The last commit will most likely be hard to backport, as it fixes many things at once. It would be nice if you could split it up a bit more, but if not we can probably do that on the next upgrade ourselves.

Comment on lines +350 to +353
router = router.get(l3.EXTERNAL_GW_INFO, {})
if not router:
return True
return router.get('enable_snat', True)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks interesting, do you have a stacktrace for this? I'd like to see where this comes from, maybe that's also something we can fix in our custom l3 driver if that's the cause.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

$ neutron-ovn-db-sync-util --config... --ovn-neutron_sync_mode repair:

...
2025-06-20 12:55:17 63836 DEBUG neutron.plugins.ml2.drivers.ovn.mech_driver.ovsdb.ovn_db_sync [req-148219c3-0f54-4025-9a44-c2cbfe38e929 gNone - - - - - -] OVN-NB Sync routers and router ports started @ 2025-06-20 12:55:17.573351 sync_routers_and_rports /Users/d072895/Workspace/2024.1/neutron/neutron/plugins/ml2/drivers/ovn/mech_driver/ovsdb/ovn_db_sync.py:516
2025-06-20 12:55:44 63836 CRITICAL neutron_ovn_db_sync_util [req-148219c3-0f54-4025-9a44-c2cbfe38e929 gNone - - - - - -] Unhandled error: AttributeError: 'NoneType' object has no attribute 'get'
2025-06-20 12:55:44 63836 ERROR neutron_ovn_db_sync_util Traceback (most recent call last):
2025-06-20 12:55:44 63836 ERROR neutron_ovn_db_sync_util   File "/Users/d072895/.virtualenvs/caracal_python3.10.12/bin/neutron-ovn-db-sync-util", line 8, in <module>
2025-06-20 12:55:44 63836 ERROR neutron_ovn_db_sync_util     sys.exit(main())
2025-06-20 12:55:44 63836 ERROR neutron_ovn_db_sync_util   File "/Users/d072895/Workspace/2024.1/neutron/neutron/cmd/ovn/neutron_ovn_db_sync_util.py", line 241, in main
2025-06-20 12:55:44 63836 ERROR neutron_ovn_db_sync_util     synchronizer.do_sync()
2025-06-20 12:55:44 63836 ERROR neutron_ovn_db_sync_util   File "/Users/d072895/Workspace/2024.1/neutron/neutron/plugins/ml2/drivers/ovn/mech_driver/ovsdb/ovn_db_sync.py", line 110, in do_sync
2025-06-20 12:55:44 63836 ERROR neutron_ovn_db_sync_util     self.sync_routers_and_rports(ctx)
2025-06-20 12:55:44 63836 ERROR neutron_ovn_db_sync_util   File "/Users/d072895/Workspace/2024.1/neutron/neutron/plugins/ml2/drivers/ovn/mech_driver/ovsdb/ovn_db_sync.py", line 542, in sync_routers_and_rports
2025-06-20 12:55:44 63836 ERROR neutron_ovn_db_sync_util     if gw_info.router_ip and utils.is_snat_enabled(router):
2025-06-20 12:55:44 63836 ERROR neutron_ovn_db_sync_util   File "/Users/d072895/Workspace/2024.1/neutron/neutron/common/ovn/utils.py", line 350, in is_snat_enabled
2025-06-20 12:55:44 63836 ERROR neutron_ovn_db_sync_util     return router.get(l3.EXTERNAL_GW_INFO, {}).get('enable_snat', True)
2025-06-20 12:55:44 63836 ERROR neutron_ovn_db_sync_util AttributeError: 'NoneType' object has no attribute 'get'
2025-06-20 12:55:44 63836 ERROR neutron_ovn_db_sync_util 

Process finished with exit code 1

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the router in this stacktrace is b375ffbd-599d-4fff-a2aa-eb75988ba821 in qa-de-1

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Aye, this is a weird case. Apparently the router has no external gw and no router ports associated, but there seems to be one port in the db having the router as device id. I see 13 such routers worldwide. We should investigate that at some point in the future.

Proposed for commit message: This might happen e.g. due to an inconsistency in between the router/routerports and the ports table (where ports have the router as device_id). Until we have figured out where that inconsistency comes from we need to keep this patch.

I wonder what still calls this code. sync_routers_and_rports() should exit early if the l3_plugin is not ovn (is_ovn_l3()). When we deactivate ovn l3 earlie, shouldn't we just exit early here or is the is_ovn_l3() not working?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Proposed for commit message: This might happen e.g. due to an inconsistency in between the router/routerports and the ports table (where ports have the router as device_id). Until we have figured out where that inconsistency comes from we need to keep this patch.

Amended the commit message.

I wonder what still calls this code. sync_routers_and_rports() should exit early if the l3_plugin is not ovn (is_ovn_l3()). When we deactivate ovn l3 earlie, shouldn't we just exit early here or is the is_ovn_l3() not working?

Hrm, It might be so obvious anymore since it's hard to keep track which patch was first. is_snat_enabled is used quite often in the ovs_client itself, like router update or router delete. So I think it's safer to use the patch for now.

Comment on lines 773 to 775
# SCI: If the port does not have options, it is not ovn l3
if not hasattr(lrp, 'options'):
raise periodics.NeverAgain()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So we are aborting after seeing the first router that is not ovn l3, right? Shouldn't this either be a continue instead of raise (if we ever want to support ovn l3) or could we skip this method altogether? If the best way to figure out that we're not ovn l3 is to check for a specific router port we can also keep it this way for now.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've changed the check to utils.is_ovn_l3(self._ovn_client._l3_plugin): which makes it clearer that the code expects ovn managed routers instead on relying on missing attributes.

I am very unoptimistic that ovn would be able to handle multiple l3 plugins at the same time. So it seems safer to disable it as long as we have another l3 implementation.

ovn_const.OVN_REV_NUM_EXT_ID_KEY: str(utils.get_revision_number(
port, ovn_const.TYPE_PORTS)),
ovn_const.OVN_PORT_VNIC_TYPE_KEY: port_info.vnic_type,
ovn_const.OVN_PORT_VNIC_TYPE_KEY: port_info.vnic_type or '',
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just found baremetal and normal as vnic_type in our db - do you have any information / a stack trace when this is None? Or maybe even better: do you have a port with which this has happened?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

010a396e-943d-407b-9127-e1b091b8fd65 in qa-de-1

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This port is one of these "that shouldn't happen but still happens" cases.

When a port only has binding profiles with status INACTIVE it looks like this gets returned as the port not having a vnic_type. This is something that should not happen, but still can happen due to botched migrations. Therefore we need this workaround until we can make sure we always have a binding in status ACTIVE for our ports.

Could you please add something like this (or even this) as a comment into the commit message? Just so we know when we can remove the workaround.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

amended

Comment on lines +2067 to +2070
ovn_const.OVN_NETTYPE_EXT_ID_KEY: network.get(pnet.NETWORK_TYPE,
const.TYPE_VLAN),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are a couple of other places where we also access the NETWORK_TYPE of a network. Do we need to fix that there as well or have some other workaround? Or can the other places be ignored?

ralonsohs note suggests that this part here does not work with multiple segments and this seems to hold true when looking at where NETWORK_TYPE is actually set - it doesn't get set, as we're using hierarchical portbinding and multiple segments are present for the network.

Also, could you please fixup that to the previous commit? It will be easier to backport along the line.

Copy link
Collaborator Author

@notandy notandy Jun 20, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, it's a hack. OVN was never tested with hierarchical portbinding, which makes sense since it's usually used as a overlay network.

I am mocking/hardcoding a provider vlan for every network. The vlan information is already correctly synced by ovn, e.g. for cc-demo in bb085:

[qa-de-1:monsoon3][0.500s][~]$ ovn-nbctl list Logical_Switch_Port 1cc2050b-e1eb-4c06-8441-9cef9b9d33f8
_uuid               : 1cc2050b-e1eb-4c06-8441-9cef9b9d33f8
addresses           : [unknown]
dhcpv4_options      : []
dhcpv6_options      : []
dynamic_addresses   : []
enabled             : []
external_ids        : {}
ha_chassis_group    : []
mirror_rules        : []
name                : provnet-5eef7f5b-2d5d-4861-9c18-c11b506cbd7e
options             : {localnet_learn_fdb="false", mcast_flood="false", mcast_flood_reports="true", network_name=bb085}
parent_name         : []
port_security       : []
tag                 : 2099
tag_request         : []
type                : localnet
up                  : []

This way the ovn-controller will create a local gateway connection to the ovs integration bridge for every network instead of using a geneve overlay.

Comment on lines +79 to +83
# SCI: only needed for ovn l3 plugin
# if not self.pf_plugin:
# self.pf_plugin = (
# manager.NeutronManager.load_class_for_provider(
# 'neutron.service_plugins', 'port_forwarding')())
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need to remove this? Would it help if we'd support the port forwarding extension in our asr1k driver?

Copy link
Collaborator Author

@notandy notandy Jun 20, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a requirement of the OVN l3 plugin, what does it help if you support it in asr1k?
the ovn_db_sync hardcodes the ovn l3 plugin just for the manual sync - and thus expects the port_forwarding plugin for syncing attributes. That alone would be fine with me if it didn't had code that modifies existing router/gateway ports.
So I want to skip the sync of routers/fips completely for now to be sure it won't mess with the db.

We could discuss at a different time to use ovn-l3 instead of the asr1k mech plugin and use the ovsdb as a source to configure the asr1ks, though.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a requirement of the OVN l3 plugin, what does it help if you support it in asr1k?

My thought was this: You tried run ovn in our qa env, where we have the asr1k driver running with no port forwarding. If the part of the code you're working on is requiring a "port forwarding enabled l3 plugin", then adding port forwarding support (or at least support for the APIs) to the asr1k driver would give you just that. If the problem is on the other hand that you need the l3 ovn port forwarding support, then of course my offering wouldn't help you at all.

We could discuss at a different time to use ovn-l3 instead of the asr1k mech plugin and use the ovsdb as a source to configure the asr1ks, though.

I am not sure this will work, especially with our special cases, but we can evaluate that at some point in the future.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My thought was this: You tried run ovn in our qa env, where we have the asr1k driver running with no port forwarding. If the part of the code you're working on is requiring a "port forwarding enabled l3 plugin", then adding port forwarding support (or at least support for the APIs) to the asr1k driver would give you just that. If the problem is on the other hand that you need the l3 ovn port forwarding support, then of course my offering wouldn't help you at all.

PF is only required by the OVN l3 implementation, thus I've disabled it since it's just depending on other stuff.

I am not sure this will work, especially with our special cases, but we can evaluate that at some point in the future.

Sure!

@notandy notandy force-pushed the ovn-sci-adaptions branch from a1b8e7d to d2af24d Compare June 20, 2025 18:29
@notandy
Copy link
Collaborator Author

notandy commented Jun 20, 2025

The last commit will most likely be hard to backport, as it fixes many things at once. It would be nice if you could split it up a bit more, but if not we can probably do that on the next upgrade ourselves.

not sure if there is much to be backported. Most changes are because of us using a custom L3 plugin and OVN with HPB.
I think I could split out the log bugfix and maybe some of the defaulting of values/NONE.

@notandy notandy force-pushed the ovn-sci-adaptions branch from d2af24d to bedc5f3 Compare June 20, 2025 18:44
@notandy notandy requested review from sebageek and toanju June 20, 2025 18:45
Copy link
Collaborator

@sebageek sebageek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not sure if there is much to be backported. Most changes are because of us using a custom L3 plugin and OVN with HPB.

What I mean by "backporting" is our neutron upgrade process of switching to a new release and then "backporting" or "forwardporting" all of our patches onto the new branch. We go through them, one by one, see if they apply and see if they are still needed. When we have one big commit doing multiple things it's harder to port, especially in cases where we only need half of the patch in the next release.

ovn_const.OVN_REV_NUM_EXT_ID_KEY: str(utils.get_revision_number(
port, ovn_const.TYPE_PORTS)),
ovn_const.OVN_PORT_VNIC_TYPE_KEY: port_info.vnic_type,
ovn_const.OVN_PORT_VNIC_TYPE_KEY: port_info.vnic_type or '',
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This port is one of these "that shouldn't happen but still happens" cases.

When a port only has binding profiles with status INACTIVE it looks like this gets returned as the port not having a vnic_type. This is something that should not happen, but still can happen due to botched migrations. Therefore we need this workaround until we can make sure we always have a binding in status ACTIVE for our ports.

Could you please add something like this (or even this) as a comment into the commit message? Just so we know when we can remove the workaround.

Comment on lines +350 to +353
router = router.get(l3.EXTERNAL_GW_INFO, {})
if not router:
return True
return router.get('enable_snat', True)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Aye, this is a weird case. Apparently the router has no external gw and no router ports associated, but there seems to be one port in the db having the router as device id. I see 13 such routers worldwide. We should investigate that at some point in the future.

Proposed for commit message: This might happen e.g. due to an inconsistency in between the router/routerports and the ports table (where ports have the router as device_id). Until we have figured out where that inconsistency comes from we need to keep this patch.

I wonder what still calls this code. sync_routers_and_rports() should exit early if the l3_plugin is not ovn (is_ovn_l3()). When we deactivate ovn l3 earlie, shouldn't we just exit early here or is the is_ovn_l3() not working?

Comment on lines +79 to +83
# SCI: only needed for ovn l3 plugin
# if not self.pf_plugin:
# self.pf_plugin = (
# manager.NeutronManager.load_class_for_provider(
# 'neutron.service_plugins', 'port_forwarding')())
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a requirement of the OVN l3 plugin, what does it help if you support it in asr1k?

My thought was this: You tried run ovn in our qa env, where we have the asr1k driver running with no port forwarding. If the part of the code you're working on is requiring a "port forwarding enabled l3 plugin", then adding port forwarding support (or at least support for the APIs) to the asr1k driver would give you just that. If the problem is on the other hand that you need the l3 ovn port forwarding support, then of course my offering wouldn't help you at all.

We could discuss at a different time to use ovn-l3 instead of the asr1k mech plugin and use the ovsdb as a source to configure the asr1ks, though.

I am not sure this will work, especially with our special cases, but we can evaluate that at some point in the future.

This commit removes the hardcoded ovn l3 plugin in the ovn-db-sync-util.
In upstream neutron, the ovn l3 is enabled by defaut but not needed in
our setup. removing the l3 plugin just syncs everyting else, like ports,
QOS, security groups and networks.

Disable maintenance tasks for routers if the ovn l3 plugins is not used.

ovn/utils.py @is_snat_enabled:
`external_gateway_info` can be None, this might happen e.g. due to an
inconsistency in between the router/routerports and the ports table
(where ports have the router as device_id). Until we have figured out
where that inconsistency comes from we need to keep this patch.

use default TYPE_VLAN for networks without attr `provider:network_type`.
This hardcodes vlan gateway termination on every ovn node - since OVN
deson't really support HPB.
@notandy notandy requested a review from sebageek June 30, 2025 16:43
@notandy notandy force-pushed the ovn-sci-adaptions branch 2 times, most recently from 1866a01 to e54f098 Compare June 30, 2025 17:19
notandy added 3 commits June 30, 2025 13:29
OVN expects more or less to be the only driver responsible for ports.

This patch removes the explicit port_status_down for a newly created
logical_switch_port in OVSDB-NB. OVN driver will sync all network ports
to the ovsdb, regardless if they are bound against phyiscal network with
ovn-controller or not. So this patch leaves the status intact for ports
that are not managed by OVN.

status=false is anyway the default for new ports.
We encountered neutron port's with **None** `vnic_type`, which cannot
modeled in OVSDB, thus need to be converted to a empty string.

When a port only has binding profiles with status INACTIVE it looks like
this gets returned as the port not having a vnic_type. This is something
that should not happen, but still can happen due to botched migrations.
Therefore we need this workaround until we can make sure we always have
a binding in status ACTIVE for our ports.
…orts_and_dhcp_opts

The log line uses `lport` to interpolate the logical port id,
but was injected as `seg` instead causing an attribute exception.
@notandy notandy force-pushed the ovn-sci-adaptions branch from e54f098 to 363a3d3 Compare June 30, 2025 17:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants