Conversation
if L3 OVN service plugin is not used. It would cause an exception since the l3 router object is not populated with `external_gateways`. Will be removed anyway with 2025.1 release - Can be skipped for backporting to 2025.1 or later. also fix exception logging for check_baremetal_ports_dhcp_options
b05dd87 to
3ebc418
Compare
3ebc418 to
a1b8e7d
Compare
sebageek
left a comment
There was a problem hiding this comment.
The last commit will most likely be hard to backport, as it fixes many things at once. It would be nice if you could split it up a bit more, but if not we can probably do that on the next upgrade ourselves.
| router = router.get(l3.EXTERNAL_GW_INFO, {}) | ||
| if not router: | ||
| return True | ||
| return router.get('enable_snat', True) |
There was a problem hiding this comment.
Looks interesting, do you have a stacktrace for this? I'd like to see where this comes from, maybe that's also something we can fix in our custom l3 driver if that's the cause.
There was a problem hiding this comment.
$ neutron-ovn-db-sync-util --config... --ovn-neutron_sync_mode repair:
...
2025-06-20 12:55:17 63836 DEBUG neutron.plugins.ml2.drivers.ovn.mech_driver.ovsdb.ovn_db_sync [req-148219c3-0f54-4025-9a44-c2cbfe38e929 gNone - - - - - -] OVN-NB Sync routers and router ports started @ 2025-06-20 12:55:17.573351 sync_routers_and_rports /Users/d072895/Workspace/2024.1/neutron/neutron/plugins/ml2/drivers/ovn/mech_driver/ovsdb/ovn_db_sync.py:516
2025-06-20 12:55:44 63836 CRITICAL neutron_ovn_db_sync_util [req-148219c3-0f54-4025-9a44-c2cbfe38e929 gNone - - - - - -] Unhandled error: AttributeError: 'NoneType' object has no attribute 'get'
2025-06-20 12:55:44 63836 ERROR neutron_ovn_db_sync_util Traceback (most recent call last):
2025-06-20 12:55:44 63836 ERROR neutron_ovn_db_sync_util File "/Users/d072895/.virtualenvs/caracal_python3.10.12/bin/neutron-ovn-db-sync-util", line 8, in <module>
2025-06-20 12:55:44 63836 ERROR neutron_ovn_db_sync_util sys.exit(main())
2025-06-20 12:55:44 63836 ERROR neutron_ovn_db_sync_util File "/Users/d072895/Workspace/2024.1/neutron/neutron/cmd/ovn/neutron_ovn_db_sync_util.py", line 241, in main
2025-06-20 12:55:44 63836 ERROR neutron_ovn_db_sync_util synchronizer.do_sync()
2025-06-20 12:55:44 63836 ERROR neutron_ovn_db_sync_util File "/Users/d072895/Workspace/2024.1/neutron/neutron/plugins/ml2/drivers/ovn/mech_driver/ovsdb/ovn_db_sync.py", line 110, in do_sync
2025-06-20 12:55:44 63836 ERROR neutron_ovn_db_sync_util self.sync_routers_and_rports(ctx)
2025-06-20 12:55:44 63836 ERROR neutron_ovn_db_sync_util File "/Users/d072895/Workspace/2024.1/neutron/neutron/plugins/ml2/drivers/ovn/mech_driver/ovsdb/ovn_db_sync.py", line 542, in sync_routers_and_rports
2025-06-20 12:55:44 63836 ERROR neutron_ovn_db_sync_util if gw_info.router_ip and utils.is_snat_enabled(router):
2025-06-20 12:55:44 63836 ERROR neutron_ovn_db_sync_util File "/Users/d072895/Workspace/2024.1/neutron/neutron/common/ovn/utils.py", line 350, in is_snat_enabled
2025-06-20 12:55:44 63836 ERROR neutron_ovn_db_sync_util return router.get(l3.EXTERNAL_GW_INFO, {}).get('enable_snat', True)
2025-06-20 12:55:44 63836 ERROR neutron_ovn_db_sync_util AttributeError: 'NoneType' object has no attribute 'get'
2025-06-20 12:55:44 63836 ERROR neutron_ovn_db_sync_util
Process finished with exit code 1There was a problem hiding this comment.
the router in this stacktrace is b375ffbd-599d-4fff-a2aa-eb75988ba821 in qa-de-1
There was a problem hiding this comment.
Aye, this is a weird case. Apparently the router has no external gw and no router ports associated, but there seems to be one port in the db having the router as device id. I see 13 such routers worldwide. We should investigate that at some point in the future.
Proposed for commit message: This might happen e.g. due to an inconsistency in between the router/routerports and the ports table (where ports have the router as device_id). Until we have figured out where that inconsistency comes from we need to keep this patch.
I wonder what still calls this code. sync_routers_and_rports() should exit early if the l3_plugin is not ovn (is_ovn_l3()). When we deactivate ovn l3 earlie, shouldn't we just exit early here or is the is_ovn_l3() not working?
There was a problem hiding this comment.
Proposed for commit message: This might happen e.g. due to an inconsistency in between the router/routerports and the ports table (where ports have the router as device_id). Until we have figured out where that inconsistency comes from we need to keep this patch.
Amended the commit message.
I wonder what still calls this code.
sync_routers_and_rports()should exit early if thel3_pluginis not ovn (is_ovn_l3()). When we deactivate ovn l3 earlie, shouldn't we just exit early here or is theis_ovn_l3()not working?
Hrm, It might be so obvious anymore since it's hard to keep track which patch was first. is_snat_enabled is used quite often in the ovs_client itself, like router update or router delete. So I think it's safer to use the patch for now.
| # SCI: If the port does not have options, it is not ovn l3 | ||
| if not hasattr(lrp, 'options'): | ||
| raise periodics.NeverAgain() |
There was a problem hiding this comment.
So we are aborting after seeing the first router that is not ovn l3, right? Shouldn't this either be a continue instead of raise (if we ever want to support ovn l3) or could we skip this method altogether? If the best way to figure out that we're not ovn l3 is to check for a specific router port we can also keep it this way for now.
There was a problem hiding this comment.
I've changed the check to utils.is_ovn_l3(self._ovn_client._l3_plugin): which makes it clearer that the code expects ovn managed routers instead on relying on missing attributes.
I am very unoptimistic that ovn would be able to handle multiple l3 plugins at the same time. So it seems safer to disable it as long as we have another l3 implementation.
| ovn_const.OVN_REV_NUM_EXT_ID_KEY: str(utils.get_revision_number( | ||
| port, ovn_const.TYPE_PORTS)), | ||
| ovn_const.OVN_PORT_VNIC_TYPE_KEY: port_info.vnic_type, | ||
| ovn_const.OVN_PORT_VNIC_TYPE_KEY: port_info.vnic_type or '', |
There was a problem hiding this comment.
I just found baremetal and normal as vnic_type in our db - do you have any information / a stack trace when this is None? Or maybe even better: do you have a port with which this has happened?
There was a problem hiding this comment.
010a396e-943d-407b-9127-e1b091b8fd65 in qa-de-1
There was a problem hiding this comment.
This port is one of these "that shouldn't happen but still happens" cases.
When a port only has binding profiles with status INACTIVE it looks like this gets returned as the port not having a vnic_type. This is something that should not happen, but still can happen due to botched migrations. Therefore we need this workaround until we can make sure we always have a binding in status ACTIVE for our ports.
Could you please add something like this (or even this) as a comment into the commit message? Just so we know when we can remove the workaround.
| ovn_const.OVN_NETTYPE_EXT_ID_KEY: network.get(pnet.NETWORK_TYPE, | ||
| const.TYPE_VLAN), |
There was a problem hiding this comment.
There are a couple of other places where we also access the NETWORK_TYPE of a network. Do we need to fix that there as well or have some other workaround? Or can the other places be ignored?
ralonsohs note suggests that this part here does not work with multiple segments and this seems to hold true when looking at where NETWORK_TYPE is actually set - it doesn't get set, as we're using hierarchical portbinding and multiple segments are present for the network.
Also, could you please fixup that to the previous commit? It will be easier to backport along the line.
There was a problem hiding this comment.
Yeah, it's a hack. OVN was never tested with hierarchical portbinding, which makes sense since it's usually used as a overlay network.
I am mocking/hardcoding a provider vlan for every network. The vlan information is already correctly synced by ovn, e.g. for cc-demo in bb085:
[qa-de-1:monsoon3][0.500s][~]$ ovn-nbctl list Logical_Switch_Port 1cc2050b-e1eb-4c06-8441-9cef9b9d33f8
_uuid : 1cc2050b-e1eb-4c06-8441-9cef9b9d33f8
addresses : [unknown]
dhcpv4_options : []
dhcpv6_options : []
dynamic_addresses : []
enabled : []
external_ids : {}
ha_chassis_group : []
mirror_rules : []
name : provnet-5eef7f5b-2d5d-4861-9c18-c11b506cbd7e
options : {localnet_learn_fdb="false", mcast_flood="false", mcast_flood_reports="true", network_name=bb085}
parent_name : []
port_security : []
tag : 2099
tag_request : []
type : localnet
up : []
This way the ovn-controller will create a local gateway connection to the ovs integration bridge for every network instead of using a geneve overlay.
| # SCI: only needed for ovn l3 plugin | ||
| # if not self.pf_plugin: | ||
| # self.pf_plugin = ( | ||
| # manager.NeutronManager.load_class_for_provider( | ||
| # 'neutron.service_plugins', 'port_forwarding')()) |
There was a problem hiding this comment.
Why do we need to remove this? Would it help if we'd support the port forwarding extension in our asr1k driver?
There was a problem hiding this comment.
It's a requirement of the OVN l3 plugin, what does it help if you support it in asr1k?
the ovn_db_sync hardcodes the ovn l3 plugin just for the manual sync - and thus expects the port_forwarding plugin for syncing attributes. That alone would be fine with me if it didn't had code that modifies existing router/gateway ports.
So I want to skip the sync of routers/fips completely for now to be sure it won't mess with the db.
We could discuss at a different time to use ovn-l3 instead of the asr1k mech plugin and use the ovsdb as a source to configure the asr1ks, though.
There was a problem hiding this comment.
It's a requirement of the OVN l3 plugin, what does it help if you support it in asr1k?
My thought was this: You tried run ovn in our qa env, where we have the asr1k driver running with no port forwarding. If the part of the code you're working on is requiring a "port forwarding enabled l3 plugin", then adding port forwarding support (or at least support for the APIs) to the asr1k driver would give you just that. If the problem is on the other hand that you need the l3 ovn port forwarding support, then of course my offering wouldn't help you at all.
We could discuss at a different time to use ovn-l3 instead of the asr1k mech plugin and use the ovsdb as a source to configure the asr1ks, though.
I am not sure this will work, especially with our special cases, but we can evaluate that at some point in the future.
There was a problem hiding this comment.
My thought was this: You tried run ovn in our qa env, where we have the asr1k driver running with no port forwarding. If the part of the code you're working on is requiring a "port forwarding enabled l3 plugin", then adding port forwarding support (or at least support for the APIs) to the asr1k driver would give you just that. If the problem is on the other hand that you need the l3 ovn port forwarding support, then of course my offering wouldn't help you at all.
PF is only required by the OVN l3 implementation, thus I've disabled it since it's just depending on other stuff.
I am not sure this will work, especially with our special cases, but we can evaluate that at some point in the future.
Sure!
a1b8e7d to
d2af24d
Compare
not sure if there is much to be backported. Most changes are because of us using a custom L3 plugin and OVN with HPB. |
d2af24d to
bedc5f3
Compare
sebageek
left a comment
There was a problem hiding this comment.
not sure if there is much to be backported. Most changes are because of us using a custom L3 plugin and OVN with HPB.
What I mean by "backporting" is our neutron upgrade process of switching to a new release and then "backporting" or "forwardporting" all of our patches onto the new branch. We go through them, one by one, see if they apply and see if they are still needed. When we have one big commit doing multiple things it's harder to port, especially in cases where we only need half of the patch in the next release.
| ovn_const.OVN_REV_NUM_EXT_ID_KEY: str(utils.get_revision_number( | ||
| port, ovn_const.TYPE_PORTS)), | ||
| ovn_const.OVN_PORT_VNIC_TYPE_KEY: port_info.vnic_type, | ||
| ovn_const.OVN_PORT_VNIC_TYPE_KEY: port_info.vnic_type or '', |
There was a problem hiding this comment.
This port is one of these "that shouldn't happen but still happens" cases.
When a port only has binding profiles with status INACTIVE it looks like this gets returned as the port not having a vnic_type. This is something that should not happen, but still can happen due to botched migrations. Therefore we need this workaround until we can make sure we always have a binding in status ACTIVE for our ports.
Could you please add something like this (or even this) as a comment into the commit message? Just so we know when we can remove the workaround.
| router = router.get(l3.EXTERNAL_GW_INFO, {}) | ||
| if not router: | ||
| return True | ||
| return router.get('enable_snat', True) |
There was a problem hiding this comment.
Aye, this is a weird case. Apparently the router has no external gw and no router ports associated, but there seems to be one port in the db having the router as device id. I see 13 such routers worldwide. We should investigate that at some point in the future.
Proposed for commit message: This might happen e.g. due to an inconsistency in between the router/routerports and the ports table (where ports have the router as device_id). Until we have figured out where that inconsistency comes from we need to keep this patch.
I wonder what still calls this code. sync_routers_and_rports() should exit early if the l3_plugin is not ovn (is_ovn_l3()). When we deactivate ovn l3 earlie, shouldn't we just exit early here or is the is_ovn_l3() not working?
| # SCI: only needed for ovn l3 plugin | ||
| # if not self.pf_plugin: | ||
| # self.pf_plugin = ( | ||
| # manager.NeutronManager.load_class_for_provider( | ||
| # 'neutron.service_plugins', 'port_forwarding')()) |
There was a problem hiding this comment.
It's a requirement of the OVN l3 plugin, what does it help if you support it in asr1k?
My thought was this: You tried run ovn in our qa env, where we have the asr1k driver running with no port forwarding. If the part of the code you're working on is requiring a "port forwarding enabled l3 plugin", then adding port forwarding support (or at least support for the APIs) to the asr1k driver would give you just that. If the problem is on the other hand that you need the l3 ovn port forwarding support, then of course my offering wouldn't help you at all.
We could discuss at a different time to use ovn-l3 instead of the asr1k mech plugin and use the ovsdb as a source to configure the asr1ks, though.
I am not sure this will work, especially with our special cases, but we can evaluate that at some point in the future.
This commit removes the hardcoded ovn l3 plugin in the ovn-db-sync-util. In upstream neutron, the ovn l3 is enabled by defaut but not needed in our setup. removing the l3 plugin just syncs everyting else, like ports, QOS, security groups and networks. Disable maintenance tasks for routers if the ovn l3 plugins is not used. ovn/utils.py @is_snat_enabled: `external_gateway_info` can be None, this might happen e.g. due to an inconsistency in between the router/routerports and the ports table (where ports have the router as device_id). Until we have figured out where that inconsistency comes from we need to keep this patch. use default TYPE_VLAN for networks without attr `provider:network_type`. This hardcodes vlan gateway termination on every ovn node - since OVN deson't really support HPB.
1866a01 to
e54f098
Compare
OVN expects more or less to be the only driver responsible for ports. This patch removes the explicit port_status_down for a newly created logical_switch_port in OVSDB-NB. OVN driver will sync all network ports to the ovsdb, regardless if they are bound against phyiscal network with ovn-controller or not. So this patch leaves the status intact for ports that are not managed by OVN. status=false is anyway the default for new ports.
We encountered neutron port's with **None** `vnic_type`, which cannot modeled in OVSDB, thus need to be converted to a empty string. When a port only has binding profiles with status INACTIVE it looks like this gets returned as the port not having a vnic_type. This is something that should not happen, but still can happen due to botched migrations. Therefore we need this workaround until we can make sure we always have a binding in status ACTIVE for our ports.
…orts_and_dhcp_opts The log line uses `lport` to interpolate the logical port id, but was injected as `seg` instead causing an attribute exception.
e54f098 to
363a3d3
Compare
Uh oh!
There was an error while loading. Please reload this page.