Netdev Archive mirror
 help / color / mirror / Atom feed
* [PATCH net-next 0/5] Add layer 2 miss indication and filtering
@ 2023-05-18 11:33 Ido Schimmel
  2023-05-18 11:33 ` [PATCH net-next 1/5] skbuff: bridge: Add layer 2 miss indication Ido Schimmel
                   ` (4 more replies)
  0 siblings, 5 replies; 20+ messages in thread
From: Ido Schimmel @ 2023-05-18 11:33 UTC (permalink / raw
  To: netdev, bridge
  Cc: davem, kuba, pabeni, edumazet, razor, roopa, taras.chornyi,
	saeedm, leon, petrm, vladimir.oltean, claudiu.manoil,
	alexandre.belloni, UNGLinuxDriver, jhs, xiyou.wangcong, jiri,
	taspelund, Ido Schimmel

tl;dr
=====

This patchset adds a single bit to the skb to indicate that a packet
encountered a layer 2 miss in the bridge and extends flower to match on
this metadata. This is required for non-DF (Designated Forwarder)
filtering in EVPN multi-homing which prevents decapsulated BUM packets
from being forwarded multiple times to the same multi-homed host.

Background
==========

In a typical EVPN multi-homing setup each host is multi-homed using a
set of links called ES (Ethernet Segment, i.e., LAG) to multiple leaf
switches in a rack. These switches act as VTEPs and are not directly
connected (as opposed to MLAG), but can communicate with each other (as
well as with VTEPs in remote racks) via spine switches over L3.

When a host sends a BUM packet over ES1 to VTEP1, the VTEP will flood it
to other VTEPs in the network, including those connected to the host
over ES1. The receiving VTEPs must drop the packet and not forward it
back to the host. This is called "split-horizon filtering" (SPH) [1].

FRR configures SPH filtering using two tc filters. The first, an ingress
filter that matches on packets received from VTEP1 and marks them using
a fwmark (firewall mark). The second, an egress filter configured on the
LAG interface connected to the host that matches on the fwmark and drops
the packets. Example:

 # tc filter add dev vxlan0 ingress pref 1 proto all flower enc_src_ip $VTEP1_IP action skbedit mark 101
 # tc filter add dev bond0 egress pref 1 handle 101 fw action drop

Motivation
==========

For each ES, only one VTEP is elected by the control plane as the DF.
The DF is responsible for forwarding decapsulated BUM traffic to the
host over the ES. The non-DF VTEPs must drop such traffic as otherwise
the host will receive multiple copies of BUM traffic. This is called
"non-DF filtering" [2].

Filtering of multicast and broadcast traffic can be achieved using the
following flower filter:

 # tc filter add dev bond0 egress pref 1 proto all flower indev vxlan0 dst_mac 01:00:00:00:00:00/01:00:00:00:00:00 action drop

Unlike broadcast and multicast traffic, it is not currently possible to
filter unknown unicast traffic. The classification into unknown unicast
is performed by the bridge driver, but is not visible to other layers.

Implementation
==============

The proposed solution is to add a single bit to the skb that is set by
the bridge for packets that encountered an FDB/MDB miss. The flower
classifier is extended to be able to match on this new metadata bit in a
similar fashion to existing metadata options such as 'indev'.

A bit that is set for every flooded packet would also work, but it does
not allow us to differentiate between registered and unregistered
multicast traffic which might be useful in the future.

A relatively generic name is chosen for this bit - 'l2_miss' - to allow
its use to be extended to other layer 2 devices such as VXLAN, should a
use case arise.

With the above, the control plane can implement a non-DF filter using
the following tc filters:

 # tc filter add dev bond0 egress pref 1 proto all flower indev vxlan0 dst_mac 01:00:00:00:00:00/01:00:00:00:00:00 action drop
 # tc filter add dev bond0 egress pref 2 proto all flower indev vxlan0 l2_miss true action drop

The first drops broadcast and multicast traffic and the second drops
unknown unicast traffic.

Testing
=======

A test exercising the different permutations of the 'l2_miss' bit is
added in patch #5.

Patchset overview
=================

Patch #1 adds the new bit to the skb and sets it in the bridge driver
for packets that encountered a miss. The new bit is added in an existing
hole in the skb in order not to inflate this data structure.

Patch #2 extends the flower classifier to be able to match on the new
layer 2 miss metadata.

Patch #3 rejects matching on the new metadata in drivers that already
support the 'FLOW_DISSECTOR_KEY_META' key.

Patch #4 extends mlxsw to be able to match on layer 2 miss.

Patch #5 adds a selftest.

iproute2 patches can be found here [3].

Changelog
=========

Since RFC [4]:

No changes.

[1] https://datatracker.ietf.org/doc/html/rfc7432#section-8.3
[2] https://datatracker.ietf.org/doc/html/rfc7432#section-8.5
[3] https://github.com/idosch/iproute2/tree/submit/non_df_filter_v1
[4] https://lore.kernel.org/netdev/20230509070446.246088-1-idosch@nvidia.com/

Ido Schimmel (5):
  skbuff: bridge: Add layer 2 miss indication
  net/sched: flower: Allow matching on layer 2 miss
  flow_offload: Reject matching on layer 2 miss
  mlxsw: spectrum_flower: Add ability to match on layer 2 miss
  selftests: forwarding: Add layer 2 miss test cases

 .../marvell/prestera/prestera_flower.c        |   6 +
 .../net/ethernet/mellanox/mlx5/core/en_tc.c   |   6 +
 .../mellanox/mlxsw/core_acl_flex_keys.c       |   1 +
 .../mellanox/mlxsw/core_acl_flex_keys.h       |   3 +-
 .../mellanox/mlxsw/spectrum_acl_flex_keys.c   |   5 +
 .../ethernet/mellanox/mlxsw/spectrum_flower.c |  16 +
 drivers/net/ethernet/mscc/ocelot_flower.c     |  10 +
 include/linux/skbuff.h                        |   4 +
 include/net/flow_dissector.h                  |   2 +
 include/uapi/linux/pkt_cls.h                  |   2 +
 net/bridge/br_device.c                        |   1 +
 net/bridge/br_forward.c                       |   3 +
 net/bridge/br_input.c                         |   1 +
 net/core/flow_dissector.c                     |   3 +
 net/sched/cls_flower.c                        |  14 +-
 .../testing/selftests/net/forwarding/Makefile |   1 +
 .../net/forwarding/tc_flower_l2_miss.sh       | 343 ++++++++++++++++++
 17 files changed, 418 insertions(+), 3 deletions(-)
 create mode 100755 tools/testing/selftests/net/forwarding/tc_flower_l2_miss.sh

-- 
2.40.1


^ permalink raw reply	[flat|nested] 20+ messages in thread

* [PATCH net-next 1/5] skbuff: bridge: Add layer 2 miss indication
  2023-05-18 11:33 [PATCH net-next 0/5] Add layer 2 miss indication and filtering Ido Schimmel
@ 2023-05-18 11:33 ` Ido Schimmel
  2023-05-18 16:08   ` Nikolay Aleksandrov
  2023-05-18 11:33 ` [PATCH net-next 2/5] net/sched: flower: Allow matching on layer 2 miss Ido Schimmel
                   ` (3 subsequent siblings)
  4 siblings, 1 reply; 20+ messages in thread
From: Ido Schimmel @ 2023-05-18 11:33 UTC (permalink / raw
  To: netdev, bridge
  Cc: davem, kuba, pabeni, edumazet, razor, roopa, taras.chornyi,
	saeedm, leon, petrm, vladimir.oltean, claudiu.manoil,
	alexandre.belloni, UNGLinuxDriver, jhs, xiyou.wangcong, jiri,
	taspelund, Ido Schimmel

Allow the bridge driver to mark packets that did not match a layer 2
entry during forwarding by adding a 'l2_miss' bit to the skb.

Clear the bit whenever a packet enters the bridge (received from a
bridge port or transmitted via the bridge) and set it if the packet did
not match an FDB/MDB entry.

Subsequent patches will allow the flower classifier to match on this
bit. The motivating use case in non-DF (Designated Forwarder) filtering
where we would like to prevent decapsulated packets from being flooded
to a multi-homed host.

Do not allocate the bit if the kernel was not compiled with bridge
support and place it after the two bit fields in accordance with commit
4c60d04c2888 ("net: skbuff: push nf_trace down the bitfield"). The bit
does not increase the size of the structure as it is placed at an
existing hole. Layout with allmodconfig:

struct sk_buff {
[...]
			__u8       csum_not_inet:1;      /*   132: 3  1 */
			__u8       l2_miss:1;            /*   132: 4  1 */

			/* XXX 3 bits hole, try to pack */
			/* XXX 1 byte hole, try to pack */

			__u16      tc_index;             /*   134     2 */
			u16        alloc_cpu;            /*   136     2 */
[...]
} __attribute__((__aligned__(8)));

Signed-off-by: Ido Schimmel <idosch@nvidia.com>
---
 include/linux/skbuff.h  | 4 ++++
 net/bridge/br_device.c  | 1 +
 net/bridge/br_forward.c | 3 +++
 net/bridge/br_input.c   | 1 +
 4 files changed, 9 insertions(+)

diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index 8cff3d817131..b64dc3f62c5c 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -801,6 +801,7 @@ typedef unsigned char *sk_buff_data_t;
  *	@encap_hdr_csum: software checksum is needed
  *	@csum_valid: checksum is already valid
  *	@csum_not_inet: use CRC32c to resolve CHECKSUM_PARTIAL
+ *	@l2_miss: Packet did not match an L2 entry during forwarding
  *	@csum_complete_sw: checksum was completed by software
  *	@csum_level: indicates the number of consecutive checksums found in
  *		the packet minus one that have been verified as
@@ -991,6 +992,9 @@ struct sk_buff {
 #if IS_ENABLED(CONFIG_IP_SCTP)
 	__u8			csum_not_inet:1;
 #endif
+#if IS_ENABLED(CONFIG_BRIDGE)
+	__u8			l2_miss:1;
+#endif
 
 #ifdef CONFIG_NET_SCHED
 	__u16			tc_index;	/* traffic control index */
diff --git a/net/bridge/br_device.c b/net/bridge/br_device.c
index 8eca8a5c80c6..91dbdae4afd4 100644
--- a/net/bridge/br_device.c
+++ b/net/bridge/br_device.c
@@ -39,6 +39,7 @@ netdev_tx_t br_dev_xmit(struct sk_buff *skb, struct net_device *dev)
 	u16 vid = 0;
 
 	memset(skb->cb, 0, sizeof(struct br_input_skb_cb));
+	skb->l2_miss = 0;
 
 	rcu_read_lock();
 	nf_ops = rcu_dereference(nf_br_ops);
diff --git a/net/bridge/br_forward.c b/net/bridge/br_forward.c
index 84d6dd5e5b1a..8cf5a51489ce 100644
--- a/net/bridge/br_forward.c
+++ b/net/bridge/br_forward.c
@@ -203,6 +203,8 @@ void br_flood(struct net_bridge *br, struct sk_buff *skb,
 	struct net_bridge_port *prev = NULL;
 	struct net_bridge_port *p;
 
+	skb->l2_miss = 1;
+
 	list_for_each_entry_rcu(p, &br->port_list, list) {
 		/* Do not flood unicast traffic to ports that turn it off, nor
 		 * other traffic if flood off, except for traffic we originate
@@ -295,6 +297,7 @@ void br_multicast_flood(struct net_bridge_mdb_entry *mdst,
 			allow_mode_include = false;
 	} else {
 		p = NULL;
+		skb->l2_miss = 1;
 	}
 
 	while (p || rp) {
diff --git a/net/bridge/br_input.c b/net/bridge/br_input.c
index fc17b9fd93e6..d8ab5890cbe6 100644
--- a/net/bridge/br_input.c
+++ b/net/bridge/br_input.c
@@ -334,6 +334,7 @@ static rx_handler_result_t br_handle_frame(struct sk_buff **pskb)
 		return RX_HANDLER_CONSUMED;
 
 	memset(skb->cb, 0, sizeof(struct br_input_skb_cb));
+	skb->l2_miss = 0;
 
 	p = br_port_get_rcu(skb->dev);
 	if (p->flags & BR_VLAN_TUNNEL)
-- 
2.40.1


^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [PATCH net-next 2/5] net/sched: flower: Allow matching on layer 2 miss
  2023-05-18 11:33 [PATCH net-next 0/5] Add layer 2 miss indication and filtering Ido Schimmel
  2023-05-18 11:33 ` [PATCH net-next 1/5] skbuff: bridge: Add layer 2 miss indication Ido Schimmel
@ 2023-05-18 11:33 ` Ido Schimmel
  2023-05-19 11:27   ` Simon Horman
  2023-05-18 11:33 ` [PATCH net-next 3/5] flow_offload: Reject " Ido Schimmel
                   ` (2 subsequent siblings)
  4 siblings, 1 reply; 20+ messages in thread
From: Ido Schimmel @ 2023-05-18 11:33 UTC (permalink / raw
  To: netdev, bridge
  Cc: davem, kuba, pabeni, edumazet, razor, roopa, taras.chornyi,
	saeedm, leon, petrm, vladimir.oltean, claudiu.manoil,
	alexandre.belloni, UNGLinuxDriver, jhs, xiyou.wangcong, jiri,
	taspelund, Ido Schimmel

Add the 'TCA_FLOWER_L2_MISS' netlink attribute that allows user space to
match on packets that encountered a layer 2 miss. The miss indication is
set as metadata in the skb by the bridge driver upon FDB/MDB lookup
miss.

Signed-off-by: Ido Schimmel <idosch@nvidia.com>
---
 include/net/flow_dissector.h |  2 ++
 include/uapi/linux/pkt_cls.h |  2 ++
 net/core/flow_dissector.c    |  3 +++
 net/sched/cls_flower.c       | 14 ++++++++++++--
 4 files changed, 19 insertions(+), 2 deletions(-)

diff --git a/include/net/flow_dissector.h b/include/net/flow_dissector.h
index 85b2281576ed..8b41668c77fc 100644
--- a/include/net/flow_dissector.h
+++ b/include/net/flow_dissector.h
@@ -243,10 +243,12 @@ struct flow_dissector_key_ip {
  * struct flow_dissector_key_meta:
  * @ingress_ifindex: ingress ifindex
  * @ingress_iftype: ingress interface type
+ * @l2_miss: packet did not match an L2 entry during forwarding
  */
 struct flow_dissector_key_meta {
 	int ingress_ifindex;
 	u16 ingress_iftype;
+	u8 l2_miss;
 };
 
 /**
diff --git a/include/uapi/linux/pkt_cls.h b/include/uapi/linux/pkt_cls.h
index 648a82f32666..00933dda7b10 100644
--- a/include/uapi/linux/pkt_cls.h
+++ b/include/uapi/linux/pkt_cls.h
@@ -594,6 +594,8 @@ enum {
 
 	TCA_FLOWER_KEY_L2TPV3_SID,	/* be32 */
 
+	TCA_FLOWER_L2_MISS,		/* u8 */
+
 	__TCA_FLOWER_MAX,
 };
 
diff --git a/net/core/flow_dissector.c b/net/core/flow_dissector.c
index 25fb0bbc310f..3776c7bdd228 100644
--- a/net/core/flow_dissector.c
+++ b/net/core/flow_dissector.c
@@ -241,6 +241,9 @@ void skb_flow_dissect_meta(const struct sk_buff *skb,
 					 FLOW_DISSECTOR_KEY_META,
 					 target_container);
 	meta->ingress_ifindex = skb->skb_iif;
+#if IS_ENABLED(CONFIG_BRIDGE)
+	meta->l2_miss = skb->l2_miss;
+#endif
 }
 EXPORT_SYMBOL(skb_flow_dissect_meta);
 
diff --git a/net/sched/cls_flower.c b/net/sched/cls_flower.c
index 9dbc43388e57..4eb06c6367fc 100644
--- a/net/sched/cls_flower.c
+++ b/net/sched/cls_flower.c
@@ -615,7 +615,8 @@ static void *fl_get(struct tcf_proto *tp, u32 handle)
 }
 
 static const struct nla_policy fl_policy[TCA_FLOWER_MAX + 1] = {
-	[TCA_FLOWER_UNSPEC]		= { .type = NLA_UNSPEC },
+	[TCA_FLOWER_UNSPEC]		= { .strict_start_type =
+						TCA_FLOWER_L2_MISS },
 	[TCA_FLOWER_CLASSID]		= { .type = NLA_U32 },
 	[TCA_FLOWER_INDEV]		= { .type = NLA_STRING,
 					    .len = IFNAMSIZ },
@@ -720,7 +721,7 @@ static const struct nla_policy fl_policy[TCA_FLOWER_MAX + 1] = {
 	[TCA_FLOWER_KEY_PPPOE_SID]	= { .type = NLA_U16 },
 	[TCA_FLOWER_KEY_PPP_PROTO]	= { .type = NLA_U16 },
 	[TCA_FLOWER_KEY_L2TPV3_SID]	= { .type = NLA_U32 },
-
+	[TCA_FLOWER_L2_MISS]		= NLA_POLICY_MAX(NLA_U8, 1),
 };
 
 static const struct nla_policy
@@ -1668,6 +1669,10 @@ static int fl_set_key(struct net *net, struct nlattr **tb,
 		mask->meta.ingress_ifindex = 0xffffffff;
 	}
 
+	fl_set_key_val(tb, &key->meta.l2_miss, TCA_FLOWER_L2_MISS,
+		       &mask->meta.l2_miss, TCA_FLOWER_UNSPEC,
+		       sizeof(key->meta.l2_miss));
+
 	fl_set_key_val(tb, key->eth.dst, TCA_FLOWER_KEY_ETH_DST,
 		       mask->eth.dst, TCA_FLOWER_KEY_ETH_DST_MASK,
 		       sizeof(key->eth.dst));
@@ -3074,6 +3079,11 @@ static int fl_dump_key(struct sk_buff *skb, struct net *net,
 			goto nla_put_failure;
 	}
 
+	if (fl_dump_key_val(skb, &key->meta.l2_miss,
+			    TCA_FLOWER_L2_MISS, &mask->meta.l2_miss,
+			    TCA_FLOWER_UNSPEC, sizeof(key->meta.l2_miss)))
+		goto nla_put_failure;
+
 	if (fl_dump_key_val(skb, key->eth.dst, TCA_FLOWER_KEY_ETH_DST,
 			    mask->eth.dst, TCA_FLOWER_KEY_ETH_DST_MASK,
 			    sizeof(key->eth.dst)) ||
-- 
2.40.1


^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [PATCH net-next 3/5] flow_offload: Reject matching on layer 2 miss
  2023-05-18 11:33 [PATCH net-next 0/5] Add layer 2 miss indication and filtering Ido Schimmel
  2023-05-18 11:33 ` [PATCH net-next 1/5] skbuff: bridge: Add layer 2 miss indication Ido Schimmel
  2023-05-18 11:33 ` [PATCH net-next 2/5] net/sched: flower: Allow matching on layer 2 miss Ido Schimmel
@ 2023-05-18 11:33 ` Ido Schimmel
  2023-05-19 11:33   ` Simon Horman
  2023-05-18 11:33 ` [PATCH net-next 4/5] mlxsw: spectrum_flower: Add ability to match " Ido Schimmel
  2023-05-18 11:33 ` [PATCH net-next 5/5] selftests: forwarding: Add layer 2 miss test cases Ido Schimmel
  4 siblings, 1 reply; 20+ messages in thread
From: Ido Schimmel @ 2023-05-18 11:33 UTC (permalink / raw
  To: netdev, bridge
  Cc: davem, kuba, pabeni, edumazet, razor, roopa, taras.chornyi,
	saeedm, leon, petrm, vladimir.oltean, claudiu.manoil,
	alexandre.belloni, UNGLinuxDriver, jhs, xiyou.wangcong, jiri,
	taspelund, Ido Schimmel

Adjust drivers that support the 'FLOW_DISSECTOR_KEY_META' key to reject
filters that try to match on the newly added layer 2 miss option. Add an
extack message to clearly communicate the failure reason to user space.

Example:

 # tc filter add dev swp1 egress pref 1 proto all flower skip_sw l2_miss true action drop
 Error: mlxsw_spectrum: Can't match on "l2_miss".
 We have an error talking to the kernel

Acked-by: Elad Nachman <enachman@marvell.com>
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
---
 .../net/ethernet/marvell/prestera/prestera_flower.c    |  6 ++++++
 drivers/net/ethernet/mellanox/mlx5/core/en_tc.c        |  6 ++++++
 drivers/net/ethernet/mellanox/mlxsw/spectrum_flower.c  |  6 ++++++
 drivers/net/ethernet/mscc/ocelot_flower.c              | 10 ++++++++++
 4 files changed, 28 insertions(+)

diff --git a/drivers/net/ethernet/marvell/prestera/prestera_flower.c b/drivers/net/ethernet/marvell/prestera/prestera_flower.c
index 91a478b75cbf..3e20e71b0f81 100644
--- a/drivers/net/ethernet/marvell/prestera/prestera_flower.c
+++ b/drivers/net/ethernet/marvell/prestera/prestera_flower.c
@@ -148,6 +148,12 @@ static int prestera_flower_parse_meta(struct prestera_acl_rule *rule,
 	__be16 key, mask;
 
 	flow_rule_match_meta(f_rule, &match);
+
+	if (match.mask->l2_miss) {
+		NL_SET_ERR_MSG_MOD(f->common.extack, "Can't match on \"l2_miss\"");
+		return -EOPNOTSUPP;
+	}
+
 	if (match.mask->ingress_ifindex != 0xFFFFFFFF) {
 		NL_SET_ERR_MSG_MOD(f->common.extack,
 				   "Unsupported ingress ifindex mask");
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_tc.c b/drivers/net/ethernet/mellanox/mlx5/core/en_tc.c
index 728b82ce4031..516653568330 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_tc.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_tc.c
@@ -2586,6 +2586,12 @@ static int mlx5e_flower_parse_meta(struct net_device *filter_dev,
 		return 0;
 
 	flow_rule_match_meta(rule, &match);
+
+	if (match.mask->l2_miss) {
+		NL_SET_ERR_MSG_MOD(f->common.extack, "Can't match on \"l2_miss\"");
+		return -EOPNOTSUPP;
+	}
+
 	if (!match.mask->ingress_ifindex)
 		return 0;
 
diff --git a/drivers/net/ethernet/mellanox/mlxsw/spectrum_flower.c b/drivers/net/ethernet/mellanox/mlxsw/spectrum_flower.c
index 594cdcb90b3d..6fec9223250b 100644
--- a/drivers/net/ethernet/mellanox/mlxsw/spectrum_flower.c
+++ b/drivers/net/ethernet/mellanox/mlxsw/spectrum_flower.c
@@ -294,6 +294,12 @@ static int mlxsw_sp_flower_parse_meta(struct mlxsw_sp_acl_rule_info *rulei,
 		return 0;
 
 	flow_rule_match_meta(rule, &match);
+
+	if (match.mask->l2_miss) {
+		NL_SET_ERR_MSG_MOD(f->common.extack, "Can't match on \"l2_miss\"");
+		return -EOPNOTSUPP;
+	}
+
 	if (match.mask->ingress_ifindex != 0xFFFFFFFF) {
 		NL_SET_ERR_MSG_MOD(f->common.extack, "Unsupported ingress ifindex mask");
 		return -EINVAL;
diff --git a/drivers/net/ethernet/mscc/ocelot_flower.c b/drivers/net/ethernet/mscc/ocelot_flower.c
index ee052404eb55..e0916afcddfb 100644
--- a/drivers/net/ethernet/mscc/ocelot_flower.c
+++ b/drivers/net/ethernet/mscc/ocelot_flower.c
@@ -592,6 +592,16 @@ ocelot_flower_parse_key(struct ocelot *ocelot, int port, bool ingress,
 		return -EOPNOTSUPP;
 	}
 
+	if (flow_rule_match_key(rule, FLOW_DISSECTOR_KEY_META)) {
+		struct flow_match_meta match;
+
+		flow_rule_match_meta(rule, &match);
+		if (match.mask->l2_miss) {
+			NL_SET_ERR_MSG_MOD(extack, "Can't match on \"l2_miss\"");
+			return -EOPNOTSUPP;
+		}
+	}
+
 	/* For VCAP ES0 (egress rewriter) we can match on the ingress port */
 	if (!ingress) {
 		ret = ocelot_flower_parse_indev(ocelot, port, f, filter);
-- 
2.40.1


^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [PATCH net-next 4/5] mlxsw: spectrum_flower: Add ability to match on layer 2 miss
  2023-05-18 11:33 [PATCH net-next 0/5] Add layer 2 miss indication and filtering Ido Schimmel
                   ` (2 preceding siblings ...)
  2023-05-18 11:33 ` [PATCH net-next 3/5] flow_offload: Reject " Ido Schimmel
@ 2023-05-18 11:33 ` Ido Schimmel
  2023-05-19 11:35   ` Simon Horman
  2023-05-18 11:33 ` [PATCH net-next 5/5] selftests: forwarding: Add layer 2 miss test cases Ido Schimmel
  4 siblings, 1 reply; 20+ messages in thread
From: Ido Schimmel @ 2023-05-18 11:33 UTC (permalink / raw
  To: netdev, bridge
  Cc: davem, kuba, pabeni, edumazet, razor, roopa, taras.chornyi,
	saeedm, leon, petrm, vladimir.oltean, claudiu.manoil,
	alexandre.belloni, UNGLinuxDriver, jhs, xiyou.wangcong, jiri,
	taspelund, Ido Schimmel

Add the 'dmac_type' key element to supported key blocks and make use of
it to match on layer 2 miss.

This is a two bits key in hardware with the following values:
00b - Known multicast.
01b - Broadcast.
10b - Known unicast.
11b - Unknown unicast or unregistered multicast.

When 'l2_miss' is set we need to match on 01b or 11b. Therefore, only
match on the LSB in order to differentiate between both cases of
'l2_miss'.

Tested on Spectrum-{1,2,3,4}.

Signed-off-by: Ido Schimmel <idosch@nvidia.com>
---
 .../mellanox/mlxsw/core_acl_flex_keys.c       |  1 +
 .../mellanox/mlxsw/core_acl_flex_keys.h       |  3 ++-
 .../mellanox/mlxsw/spectrum_acl_flex_keys.c   |  5 +++++
 .../ethernet/mellanox/mlxsw/spectrum_flower.c | 20 ++++++++++++++-----
 4 files changed, 23 insertions(+), 6 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlxsw/core_acl_flex_keys.c b/drivers/net/ethernet/mellanox/mlxsw/core_acl_flex_keys.c
index bd1a51a0a540..81af0b9a4329 100644
--- a/drivers/net/ethernet/mellanox/mlxsw/core_acl_flex_keys.c
+++ b/drivers/net/ethernet/mellanox/mlxsw/core_acl_flex_keys.c
@@ -42,6 +42,7 @@ static const struct mlxsw_afk_element_info mlxsw_afk_element_infos[] = {
 	MLXSW_AFK_ELEMENT_INFO_BUF(DST_IP_64_95, 0x34, 4),
 	MLXSW_AFK_ELEMENT_INFO_BUF(DST_IP_32_63, 0x38, 4),
 	MLXSW_AFK_ELEMENT_INFO_BUF(DST_IP_0_31, 0x3C, 4),
+	MLXSW_AFK_ELEMENT_INFO_U32(DMAC_TYPE, 0x40, 0, 2),
 };
 
 struct mlxsw_afk {
diff --git a/drivers/net/ethernet/mellanox/mlxsw/core_acl_flex_keys.h b/drivers/net/ethernet/mellanox/mlxsw/core_acl_flex_keys.h
index 3a037fe47211..6f1649cfa4cb 100644
--- a/drivers/net/ethernet/mellanox/mlxsw/core_acl_flex_keys.h
+++ b/drivers/net/ethernet/mellanox/mlxsw/core_acl_flex_keys.h
@@ -35,6 +35,7 @@ enum mlxsw_afk_element {
 	MLXSW_AFK_ELEMENT_IP_DSCP,
 	MLXSW_AFK_ELEMENT_VIRT_ROUTER_MSB,
 	MLXSW_AFK_ELEMENT_VIRT_ROUTER_LSB,
+	MLXSW_AFK_ELEMENT_DMAC_TYPE,
 	MLXSW_AFK_ELEMENT_MAX,
 };
 
@@ -69,7 +70,7 @@ struct mlxsw_afk_element_info {
 	MLXSW_AFK_ELEMENT_INFO(MLXSW_AFK_ELEMENT_TYPE_BUF,			\
 			       _element, _offset, 0, _size)
 
-#define MLXSW_AFK_ELEMENT_STORAGE_SIZE 0x40
+#define MLXSW_AFK_ELEMENT_STORAGE_SIZE 0x44
 
 struct mlxsw_afk_element_inst { /* element instance in actual block */
 	enum mlxsw_afk_element element;
diff --git a/drivers/net/ethernet/mellanox/mlxsw/spectrum_acl_flex_keys.c b/drivers/net/ethernet/mellanox/mlxsw/spectrum_acl_flex_keys.c
index 00c32320f891..18a968cded36 100644
--- a/drivers/net/ethernet/mellanox/mlxsw/spectrum_acl_flex_keys.c
+++ b/drivers/net/ethernet/mellanox/mlxsw/spectrum_acl_flex_keys.c
@@ -26,6 +26,7 @@ static struct mlxsw_afk_element_inst mlxsw_sp_afk_element_info_l2_smac[] = {
 static struct mlxsw_afk_element_inst mlxsw_sp_afk_element_info_l2_smac_ex[] = {
 	MLXSW_AFK_ELEMENT_INST_BUF(SMAC_32_47, 0x02, 2),
 	MLXSW_AFK_ELEMENT_INST_BUF(SMAC_0_31, 0x04, 4),
+	MLXSW_AFK_ELEMENT_INST_U32(DMAC_TYPE, 0x08, 0, 2),
 	MLXSW_AFK_ELEMENT_INST_U32(ETHERTYPE, 0x0C, 0, 16),
 };
 
@@ -50,6 +51,7 @@ static struct mlxsw_afk_element_inst mlxsw_sp_afk_element_info_ipv4[] = {
 };
 
 static struct mlxsw_afk_element_inst mlxsw_sp_afk_element_info_ipv4_ex[] = {
+	MLXSW_AFK_ELEMENT_INST_U32(DMAC_TYPE, 0x00, 24, 2),
 	MLXSW_AFK_ELEMENT_INST_U32(VID, 0x00, 0, 12),
 	MLXSW_AFK_ELEMENT_INST_U32(PCP, 0x08, 29, 3),
 	MLXSW_AFK_ELEMENT_INST_U32(SRC_L4_PORT, 0x08, 0, 16),
@@ -78,6 +80,7 @@ static struct mlxsw_afk_element_inst mlxsw_sp_afk_element_info_ipv6_sip_ex[] = {
 };
 
 static struct mlxsw_afk_element_inst mlxsw_sp_afk_element_info_packet_type[] = {
+	MLXSW_AFK_ELEMENT_INST_U32(DMAC_TYPE, 0x00, 30, 2),
 	MLXSW_AFK_ELEMENT_INST_U32(ETHERTYPE, 0x00, 0, 16),
 };
 
@@ -123,6 +126,7 @@ const struct mlxsw_afk_ops mlxsw_sp1_afk_ops = {
 };
 
 static struct mlxsw_afk_element_inst mlxsw_sp_afk_element_info_mac_0[] = {
+	MLXSW_AFK_ELEMENT_INST_U32(DMAC_TYPE, 0x00, 0, 2),
 	MLXSW_AFK_ELEMENT_INST_BUF(DMAC_0_31, 0x04, 4),
 };
 
@@ -313,6 +317,7 @@ const struct mlxsw_afk_ops mlxsw_sp2_afk_ops = {
 };
 
 static struct mlxsw_afk_element_inst mlxsw_sp_afk_element_info_mac_5b[] = {
+	MLXSW_AFK_ELEMENT_INST_U32(DMAC_TYPE, 0x00, 2, 2),
 	MLXSW_AFK_ELEMENT_INST_U32(VID, 0x04, 18, 12),
 	MLXSW_AFK_ELEMENT_INST_EXT_U32(SRC_SYS_PORT, 0x04, 0, 9, -1, true), /* RX_ACL_SYSTEM_PORT */
 };
diff --git a/drivers/net/ethernet/mellanox/mlxsw/spectrum_flower.c b/drivers/net/ethernet/mellanox/mlxsw/spectrum_flower.c
index 6fec9223250b..170a07f35897 100644
--- a/drivers/net/ethernet/mellanox/mlxsw/spectrum_flower.c
+++ b/drivers/net/ethernet/mellanox/mlxsw/spectrum_flower.c
@@ -295,11 +295,6 @@ static int mlxsw_sp_flower_parse_meta(struct mlxsw_sp_acl_rule_info *rulei,
 
 	flow_rule_match_meta(rule, &match);
 
-	if (match.mask->l2_miss) {
-		NL_SET_ERR_MSG_MOD(f->common.extack, "Can't match on \"l2_miss\"");
-		return -EOPNOTSUPP;
-	}
-
 	if (match.mask->ingress_ifindex != 0xFFFFFFFF) {
 		NL_SET_ERR_MSG_MOD(f->common.extack, "Unsupported ingress ifindex mask");
 		return -EINVAL;
@@ -327,6 +322,21 @@ static int mlxsw_sp_flower_parse_meta(struct mlxsw_sp_acl_rule_info *rulei,
 				       MLXSW_AFK_ELEMENT_SRC_SYS_PORT,
 				       mlxsw_sp_port->local_port,
 				       0xFFFFFFFF);
+
+	/* This is a two bits key in hardware with the following values:
+	 * 00b - Known multicast.
+	 * 01b - Broadcast.
+	 * 10b - Known unicast.
+	 * 11b - Unknown unicast or unregistered multicast.
+	 *
+	 * When 'l2_miss' is set we need to match on 01b or 11b. Therefore,
+	 * only match on the LSB in order to differentiate between both cases
+	 * of 'l2_miss'.
+	 */
+	mlxsw_sp_acl_rulei_keymask_u32(rulei, MLXSW_AFK_ELEMENT_DMAC_TYPE,
+				       match.key->l2_miss,
+				       match.mask->l2_miss & BIT(0));
+
 	return 0;
 }
 
-- 
2.40.1


^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [PATCH net-next 5/5] selftests: forwarding: Add layer 2 miss test cases
  2023-05-18 11:33 [PATCH net-next 0/5] Add layer 2 miss indication and filtering Ido Schimmel
                   ` (3 preceding siblings ...)
  2023-05-18 11:33 ` [PATCH net-next 4/5] mlxsw: spectrum_flower: Add ability to match " Ido Schimmel
@ 2023-05-18 11:33 ` Ido Schimmel
  4 siblings, 0 replies; 20+ messages in thread
From: Ido Schimmel @ 2023-05-18 11:33 UTC (permalink / raw
  To: netdev, bridge
  Cc: davem, kuba, pabeni, edumazet, razor, roopa, taras.chornyi,
	saeedm, leon, petrm, vladimir.oltean, claudiu.manoil,
	alexandre.belloni, UNGLinuxDriver, jhs, xiyou.wangcong, jiri,
	taspelund, Ido Schimmel

Add test cases to verify that the bridge driver correctly marks layer 2
misses only when it should and that the flower classifier can match on
this metadata.

Example output:

 # ./tc_flower_l2_miss.sh
 TEST: L2 miss - Unicast                                             [ OK ]
 TEST: L2 miss - Multicast (IPv4)                                    [ OK ]
 TEST: L2 miss - Multicast (IPv6)                                    [ OK ]
 TEST: L2 miss - Link-local multicast (IPv4)                         [ OK ]
 TEST: L2 miss - Link-local multicast (IPv6)                         [ OK ]
 TEST: L2 miss - Broadcast                                           [ OK ]

Signed-off-by: Ido Schimmel <idosch@nvidia.com>
---
 .../testing/selftests/net/forwarding/Makefile |   1 +
 .../net/forwarding/tc_flower_l2_miss.sh       | 343 ++++++++++++++++++
 2 files changed, 344 insertions(+)
 create mode 100755 tools/testing/selftests/net/forwarding/tc_flower_l2_miss.sh

diff --git a/tools/testing/selftests/net/forwarding/Makefile b/tools/testing/selftests/net/forwarding/Makefile
index a474c60fe348..9d0062b542e5 100644
--- a/tools/testing/selftests/net/forwarding/Makefile
+++ b/tools/testing/selftests/net/forwarding/Makefile
@@ -83,6 +83,7 @@ TEST_PROGS = bridge_igmp.sh \
 	tc_chains.sh \
 	tc_flower_router.sh \
 	tc_flower.sh \
+	tc_flower_l2_miss.sh \
 	tc_mpls_l2vpn.sh \
 	tc_police.sh \
 	tc_shblocks.sh \
diff --git a/tools/testing/selftests/net/forwarding/tc_flower_l2_miss.sh b/tools/testing/selftests/net/forwarding/tc_flower_l2_miss.sh
new file mode 100755
index 000000000000..fbf0a960b2c8
--- /dev/null
+++ b/tools/testing/selftests/net/forwarding/tc_flower_l2_miss.sh
@@ -0,0 +1,343 @@
+#!/bin/bash
+# SPDX-License-Identifier: GPL-2.0
+
+# +-----------------------+                             +----------------------+
+# | H1 (vrf)              |                             | H2 (vrf)             |
+# |    + $h1              |                             |              $h2 +   |
+# |    | 192.0.2.1/28     |                             |     192.0.2.2/28 |   |
+# |    | 2001:db8:1::1/64 |                             | 2001:db8:1::2/64 |   |
+# +----|------------------+                             +------------------|---+
+#      |                                                                   |
+# +----|-------------------------------------------------------------------|---+
+# | SW |                                                                   |   |
+# |  +-|-------------------------------------------------------------------|-+ |
+# |  | + $swp1                       BR                              $swp2 + | |
+# |  +-----------------------------------------------------------------------+ |
+# +----------------------------------------------------------------------------+
+
+ALL_TESTS="
+	test_l2_miss_unicast
+	test_l2_miss_multicast
+	test_l2_miss_ll_multicast
+	test_l2_miss_broadcast
+"
+
+NUM_NETIFS=4
+source lib.sh
+source tc_common.sh
+
+h1_create()
+{
+	simple_if_init $h1 192.0.2.1/28 2001:db8:1::1/64
+}
+
+h1_destroy()
+{
+	simple_if_fini $h1 192.0.2.1/28 2001:db8:1::1/64
+}
+
+h2_create()
+{
+	simple_if_init $h2 192.0.2.2/28 2001:db8:1::2/64
+}
+
+h2_destroy()
+{
+	simple_if_fini $h2 192.0.2.2/28 2001:db8:1::2/64
+}
+
+switch_create()
+{
+	ip link add name br1 up type bridge
+	ip link set dev $swp1 master br1
+	ip link set dev $swp1 up
+	ip link set dev $swp2 master br1
+	ip link set dev $swp2 up
+
+	tc qdisc add dev $swp2 clsact
+}
+
+switch_destroy()
+{
+	tc qdisc del dev $swp2 clsact
+
+	ip link set dev $swp2 down
+	ip link set dev $swp2 nomaster
+	ip link set dev $swp1 down
+	ip link set dev $swp1 nomaster
+	ip link del dev br1
+}
+
+test_l2_miss_unicast()
+{
+	local dmac=00:01:02:03:04:05
+	local dip=192.0.2.2
+	local sip=192.0.2.1
+
+	RET=0
+
+	# Unknown unicast.
+	tc filter add dev $swp2 egress protocol ipv4 handle 101 pref 1 \
+	   flower indev $swp1 l2_miss true dst_mac $dmac src_ip $sip \
+	   dst_ip $dip action pass
+	# Known unicast.
+	tc filter add dev $swp2 egress protocol ipv4 handle 102 pref 1 \
+	   flower indev $swp1 l2_miss false dst_mac $dmac src_ip $sip \
+	   dst_ip $dip action pass
+
+	# Before adding FDB entry.
+	$MZ $h1 -a own -b $dmac -t ip -A $sip -B $dip -c 1 -p 100 -q
+
+	tc_check_packets "dev $swp2 egress" 101 1
+	check_err $? "Unknown unicast filter was not hit before adding FDB entry"
+
+	tc_check_packets "dev $swp2 egress" 102 0
+	check_err $? "Known unicast filter was hit before adding FDB entry"
+
+	# Adding FDB entry.
+	bridge fdb replace $dmac dev $swp2 master static
+
+	$MZ $h1 -a own -b $dmac -t ip -A $sip -B $dip -c 1 -p 100 -q
+
+	tc_check_packets "dev $swp2 egress" 101 1
+	check_err $? "Unknown unicast filter was hit after adding FDB entry"
+
+	tc_check_packets "dev $swp2 egress" 102 1
+	check_err $? "Known unicast filter was not hit after adding FDB entry"
+
+	# Deleting FDB entry.
+	bridge fdb del $dmac dev $swp2 master static
+
+	$MZ $h1 -a own -b $dmac -t ip -A $sip -B $dip -c 1 -p 100 -q
+
+	tc_check_packets "dev $swp2 egress" 101 2
+	check_err $? "Unknown unicast filter was not hit after deleting FDB entry"
+
+	tc_check_packets "dev $swp2 egress" 102 1
+	check_err $? "Known unicast filter was hit after deleting FDB entry"
+
+	tc filter del dev $swp2 egress protocol ipv4 pref 1 handle 102 flower
+	tc filter del dev $swp2 egress protocol ipv4 pref 1 handle 101 flower
+
+	log_test "L2 miss - Unicast"
+}
+
+test_l2_miss_multicast_common()
+{
+	local proto=$1; shift
+	local sip=$1; shift
+	local dip=$1; shift
+	local mode=$1; shift
+	local name=$1; shift
+
+	RET=0
+
+	# Unregistered multicast.
+	tc filter add dev $swp2 egress protocol $proto handle 101 pref 1 \
+	   flower indev $swp1 l2_miss true src_ip $sip dst_ip $dip \
+	   action pass
+	# Registered multicast.
+	tc filter add dev $swp2 egress protocol $proto handle 102 pref 1 \
+	   flower indev $swp1 l2_miss false src_ip $sip dst_ip $dip \
+	   action pass
+
+	# Before adding MDB entry.
+	$MZ $mode $h1 -t ip -A $sip -B $dip -c 1 -p 100 -q
+
+	tc_check_packets "dev $swp2 egress" 101 1
+	check_err $? "Unregistered multicast filter was not hit before adding MDB entry"
+
+	tc_check_packets "dev $swp2 egress" 102 0
+	check_err $? "Registered multicast filter was hit before adding MDB entry"
+
+	# Adding MDB entry.
+	bridge mdb replace dev br1 port $swp2 grp $dip permanent
+
+	$MZ $mode $h1 -t ip -A $sip -B $dip -c 1 -p 100 -q
+
+	tc_check_packets "dev $swp2 egress" 101 1
+	check_err $? "Unregistered multicast filter was hit after adding MDB entry"
+
+	tc_check_packets "dev $swp2 egress" 102 1
+	check_err $? "Registered multicast filter was not hit after adding MDB entry"
+
+	# Deleting MDB entry.
+	bridge mdb del dev br1 port $swp2 grp $dip
+
+	$MZ $mode $h1 -t ip -A $sip -B $dip -c 1 -p 100 -q
+
+	tc_check_packets "dev $swp2 egress" 101 2
+	check_err $? "Unregistered multicast filter was not hit after deleting MDB entry"
+
+	tc_check_packets "dev $swp2 egress" 102 1
+	check_err $? "Registered multicast filter was hit after deleting MDB entry"
+
+	tc filter del dev $swp2 egress protocol $proto pref 1 handle 102 flower
+	tc filter del dev $swp2 egress protocol $proto pref 1 handle 101 flower
+
+	log_test "L2 miss - Multicast ($name)"
+}
+
+test_l2_miss_multicast_ipv4()
+{
+	local proto="ipv4"
+	local sip=192.0.2.1
+	local dip=239.1.1.1
+	local mode="-4"
+	local name="IPv4"
+
+	test_l2_miss_multicast_common $proto $sip $dip $mode $name
+}
+
+test_l2_miss_multicast_ipv6()
+{
+	local proto="ipv6"
+	local sip=2001:db8:1::1
+	local dip=ff0e::1
+	local mode="-6"
+	local name="IPv6"
+
+	test_l2_miss_multicast_common $proto $sip $dip $mode $name
+}
+
+test_l2_miss_multicast()
+{
+	# Configure $swp2 as a multicast router port so that it will forward
+	# both registered and unregistered multicast traffic.
+	bridge link set dev $swp2 mcast_router 2
+
+	# Forwarding according to MDB entries only takes place when the bridge
+	# detects that there is a valid querier in the network. Set the bridge
+	# as the querier and assign it a valid IPv6 link-local address to be
+	# used as the source address for MLD queries.
+	ip link set dev br1 type bridge mcast_querier 1
+	ip -6 address add fe80::1/64 nodad dev br1
+	# Wait the default Query Response Interval (10 seconds) for the bridge
+	# to determine that there are no other queriers in the network.
+	sleep 10
+
+	test_l2_miss_multicast_ipv4
+	test_l2_miss_multicast_ipv6
+
+	ip -6 address del fe80::1/64 dev br1
+	ip link set dev br1 type bridge mcast_querier 0
+	bridge link set dev $swp2 mcast_router 1
+}
+
+test_l2_miss_multicast_common2()
+{
+	local name=$1; shift
+	local dmac=$1; shift
+	local dip=224.0.0.1
+	local sip=192.0.2.1
+
+}
+
+test_l2_miss_ll_multicast_common()
+{
+	local proto=$1; shift
+	local dmac=$1; shift
+	local sip=$1; shift
+	local dip=$1; shift
+	local mode=$1; shift
+	local name=$1; shift
+
+	RET=0
+
+	tc filter add dev $swp2 egress protocol $proto handle 101 pref 1 \
+	   flower indev $swp1 l2_miss true dst_mac $dmac src_ip $sip \
+	   dst_ip $dip action pass
+
+	$MZ $mode $h1 -a own -b $dmac -t ip -A $sip -B $dip -c 1 -p 100 -q
+
+	tc_check_packets "dev $swp2 egress" 101 1
+	check_err $? "Filter was not hit"
+
+	tc filter del dev $swp2 egress protocol $proto pref 1 handle 101 flower
+
+	log_test "L2 miss - Link-local multicast ($name)"
+}
+
+test_l2_miss_ll_multicast_ipv4()
+{
+	local proto=ipv4
+	local dmac=01:00:5e:00:00:01
+	local sip=192.0.2.1
+	local dip=224.0.0.1
+	local mode="-4"
+	local name="IPv4"
+
+	test_l2_miss_ll_multicast_common $proto $dmac $sip $dip $mode $name
+}
+
+test_l2_miss_ll_multicast_ipv6()
+{
+	local proto=ipv6
+	local dmac=33:33:00:00:00:01
+	local sip=2001:db8:1::1
+	local dip=ff02::1
+	local mode="-6"
+	local name="IPv6"
+
+	test_l2_miss_ll_multicast_common $proto $dmac $sip $dip $mode $name
+}
+
+test_l2_miss_ll_multicast()
+{
+	test_l2_miss_ll_multicast_ipv4
+	test_l2_miss_ll_multicast_ipv6
+}
+
+test_l2_miss_broadcast()
+{
+	local dmac=ff:ff:ff:ff:ff:ff
+	local smac=00:01:02:03:04:05
+
+	RET=0
+
+	tc filter add dev $swp2 egress protocol all handle 101 pref 1 \
+	   flower indev $swp1 l2_miss true dst_mac $dmac src_mac $smac \
+	   action pass
+
+	$MZ $h1 -a $smac -b $dmac -c 1 -p 100 -q
+
+	tc_check_packets "dev $swp2 egress" 101 1
+	check_err $? "Filter was not hit"
+
+	tc filter del dev $swp2 egress protocol all pref 1 handle 101 flower
+
+	log_test "L2 miss - Broadcast"
+}
+
+setup_prepare()
+{
+	h1=${NETIFS[p1]}
+	swp1=${NETIFS[p2]}
+
+	swp2=${NETIFS[p3]}
+	h2=${NETIFS[p4]}
+
+	vrf_prepare
+	h1_create
+	h2_create
+	switch_create
+}
+
+cleanup()
+{
+	pre_cleanup
+
+	switch_destroy
+	h2_destroy
+	h1_destroy
+	vrf_cleanup
+}
+
+trap cleanup EXIT
+
+setup_prepare
+setup_wait
+
+tests_run
+
+exit $EXIT_STATUS
-- 
2.40.1


^ permalink raw reply related	[flat|nested] 20+ messages in thread

* Re: [PATCH net-next 1/5] skbuff: bridge: Add layer 2 miss indication
  2023-05-18 11:33 ` [PATCH net-next 1/5] skbuff: bridge: Add layer 2 miss indication Ido Schimmel
@ 2023-05-18 16:08   ` Nikolay Aleksandrov
  2023-05-19 13:51     ` Ido Schimmel
  0 siblings, 1 reply; 20+ messages in thread
From: Nikolay Aleksandrov @ 2023-05-18 16:08 UTC (permalink / raw
  To: Ido Schimmel, netdev, bridge
  Cc: davem, kuba, pabeni, edumazet, roopa, taras.chornyi, saeedm, leon,
	petrm, vladimir.oltean, claudiu.manoil, alexandre.belloni,
	UNGLinuxDriver, jhs, xiyou.wangcong, jiri, taspelund

On 18/05/2023 14:33, Ido Schimmel wrote:
> Allow the bridge driver to mark packets that did not match a layer 2
> entry during forwarding by adding a 'l2_miss' bit to the skb.
> 
> Clear the bit whenever a packet enters the bridge (received from a
> bridge port or transmitted via the bridge) and set it if the packet did
> not match an FDB/MDB entry.
> 
> Subsequent patches will allow the flower classifier to match on this
> bit. The motivating use case in non-DF (Designated Forwarder) filtering
> where we would like to prevent decapsulated packets from being flooded
> to a multi-homed host.
> 
> Do not allocate the bit if the kernel was not compiled with bridge
> support and place it after the two bit fields in accordance with commit
> 4c60d04c2888 ("net: skbuff: push nf_trace down the bitfield"). The bit
> does not increase the size of the structure as it is placed at an
> existing hole. Layout with allmodconfig:
> 
> struct sk_buff {
> [...]
> 			__u8       csum_not_inet:1;      /*   132: 3  1 */
> 			__u8       l2_miss:1;            /*   132: 4  1 */
> 
> 			/* XXX 3 bits hole, try to pack */
> 			/* XXX 1 byte hole, try to pack */
> 
> 			__u16      tc_index;             /*   134     2 */
> 			u16        alloc_cpu;            /*   136     2 */
> [...]
> } __attribute__((__aligned__(8)));
> 
> Signed-off-by: Ido Schimmel <idosch@nvidia.com>
> ---
>  include/linux/skbuff.h  | 4 ++++
>  net/bridge/br_device.c  | 1 +
>  net/bridge/br_forward.c | 3 +++
>  net/bridge/br_input.c   | 1 +
>  4 files changed, 9 insertions(+)
> 
[snip]
>  	while (p || rp) {
> diff --git a/net/bridge/br_input.c b/net/bridge/br_input.c
> index fc17b9fd93e6..d8ab5890cbe6 100644
> --- a/net/bridge/br_input.c
> +++ b/net/bridge/br_input.c
> @@ -334,6 +334,7 @@ static rx_handler_result_t br_handle_frame(struct sk_buff **pskb)
>  		return RX_HANDLER_CONSUMED;
>  
>  	memset(skb->cb, 0, sizeof(struct br_input_skb_cb));
> +	skb->l2_miss = 0;
>  
>  	p = br_port_get_rcu(skb->dev);
>  	if (p->flags & BR_VLAN_TUNNEL)

Overall looks good, only this part is a bit worrisome and needs some additional
investigation because now we'll unconditionally dirty a cache line for every
packet that is forwarded. Could you please check the effect with perf?

Thanks,
 Nik


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH net-next 2/5] net/sched: flower: Allow matching on layer 2 miss
  2023-05-18 11:33 ` [PATCH net-next 2/5] net/sched: flower: Allow matching on layer 2 miss Ido Schimmel
@ 2023-05-19 11:27   ` Simon Horman
  0 siblings, 0 replies; 20+ messages in thread
From: Simon Horman @ 2023-05-19 11:27 UTC (permalink / raw
  To: Ido Schimmel
  Cc: netdev, bridge, davem, kuba, pabeni, edumazet, razor, roopa,
	taras.chornyi, saeedm, leon, petrm, vladimir.oltean,
	claudiu.manoil, alexandre.belloni, UNGLinuxDriver, jhs,
	xiyou.wangcong, jiri, taspelund

On Thu, May 18, 2023 at 02:33:25PM +0300, Ido Schimmel wrote:
> Add the 'TCA_FLOWER_L2_MISS' netlink attribute that allows user space to
> match on packets that encountered a layer 2 miss. The miss indication is
> set as metadata in the skb by the bridge driver upon FDB/MDB lookup
> miss.
> 
> Signed-off-by: Ido Schimmel <idosch@nvidia.com>

Hi Ido,

these changes look good to me.

If I was doing this I would split the flow_dissector and cls_flower
changes into separate patches: they are different, though related, things.
But I don't feel strongly about that.

Reviewed-by: Simon Horman <simon.horman@corigine.com>

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH net-next 3/5] flow_offload: Reject matching on layer 2 miss
  2023-05-18 11:33 ` [PATCH net-next 3/5] flow_offload: Reject " Ido Schimmel
@ 2023-05-19 11:33   ` Simon Horman
  2023-05-19 14:10     ` Ido Schimmel
  0 siblings, 1 reply; 20+ messages in thread
From: Simon Horman @ 2023-05-19 11:33 UTC (permalink / raw
  To: Ido Schimmel
  Cc: netdev, bridge, davem, kuba, pabeni, edumazet, razor, roopa,
	taras.chornyi, saeedm, leon, petrm, vladimir.oltean,
	claudiu.manoil, alexandre.belloni, UNGLinuxDriver, jhs,
	xiyou.wangcong, jiri, taspelund

On Thu, May 18, 2023 at 02:33:26PM +0300, Ido Schimmel wrote:
> Adjust drivers that support the 'FLOW_DISSECTOR_KEY_META' key to reject
> filters that try to match on the newly added layer 2 miss option. Add an
> extack message to clearly communicate the failure reason to user space.

Hi Ido,

FLOW_DISSECTOR_KEY_META is also used in the following.
Perhaps they don't need updating. But perhaps it is worth mentioning why.

 * drivers/net/ethernet/mediatek/mtk_ppe_offload.c
 * drivers/net/ethernet/netronome/nfp/flower/conntrack.c

> 
> Example:
> 
>  # tc filter add dev swp1 egress pref 1 proto all flower skip_sw l2_miss true action drop
>  Error: mlxsw_spectrum: Can't match on "l2_miss".
>  We have an error talking to the kernel
> 
> Acked-by: Elad Nachman <enachman@marvell.com>
> Signed-off-by: Ido Schimmel <idosch@nvidia.com>

...

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH net-next 4/5] mlxsw: spectrum_flower: Add ability to match on layer 2 miss
  2023-05-18 11:33 ` [PATCH net-next 4/5] mlxsw: spectrum_flower: Add ability to match " Ido Schimmel
@ 2023-05-19 11:35   ` Simon Horman
  0 siblings, 0 replies; 20+ messages in thread
From: Simon Horman @ 2023-05-19 11:35 UTC (permalink / raw
  To: Ido Schimmel
  Cc: netdev, bridge, davem, kuba, pabeni, edumazet, razor, roopa,
	taras.chornyi, saeedm, leon, petrm, vladimir.oltean,
	claudiu.manoil, alexandre.belloni, UNGLinuxDriver, jhs,
	xiyou.wangcong, jiri, taspelund

On Thu, May 18, 2023 at 02:33:27PM +0300, Ido Schimmel wrote:
> Add the 'dmac_type' key element to supported key blocks and make use of
> it to match on layer 2 miss.
> 
> This is a two bits key in hardware with the following values:
> 00b - Known multicast.
> 01b - Broadcast.
> 10b - Known unicast.
> 11b - Unknown unicast or unregistered multicast.
> 
> When 'l2_miss' is set we need to match on 01b or 11b. Therefore, only
> match on the LSB in order to differentiate between both cases of
> 'l2_miss'.
> 
> Tested on Spectrum-{1,2,3,4}.
> 
> Signed-off-by: Ido Schimmel <idosch@nvidia.com>

Reviewed-by: Simon Horman <simon.horman@corigine.com>


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH net-next 1/5] skbuff: bridge: Add layer 2 miss indication
  2023-05-18 16:08   ` Nikolay Aleksandrov
@ 2023-05-19 13:51     ` Ido Schimmel
  2023-05-19 21:52       ` Jakub Kicinski
  2023-05-21  7:34       ` Nikolay Aleksandrov
  0 siblings, 2 replies; 20+ messages in thread
From: Ido Schimmel @ 2023-05-19 13:51 UTC (permalink / raw
  To: Nikolay Aleksandrov
  Cc: netdev, bridge, davem, kuba, pabeni, edumazet, roopa,
	taras.chornyi, saeedm, leon, petrm, vladimir.oltean,
	claudiu.manoil, alexandre.belloni, UNGLinuxDriver, jhs,
	xiyou.wangcong, jiri, taspelund

On Thu, May 18, 2023 at 07:08:47PM +0300, Nikolay Aleksandrov wrote:
> On 18/05/2023 14:33, Ido Schimmel wrote:
> > diff --git a/net/bridge/br_input.c b/net/bridge/br_input.c
> > index fc17b9fd93e6..d8ab5890cbe6 100644
> > --- a/net/bridge/br_input.c
> > +++ b/net/bridge/br_input.c
> > @@ -334,6 +334,7 @@ static rx_handler_result_t br_handle_frame(struct sk_buff **pskb)
> >  		return RX_HANDLER_CONSUMED;
> >  
> >  	memset(skb->cb, 0, sizeof(struct br_input_skb_cb));
> > +	skb->l2_miss = 0;
> >  
> >  	p = br_port_get_rcu(skb->dev);
> >  	if (p->flags & BR_VLAN_TUNNEL)
> 
> Overall looks good, only this part is a bit worrisome and needs some additional
> investigation because now we'll unconditionally dirty a cache line for every
> packet that is forwarded. Could you please check the effect with perf?

To eliminate it I tried the approach we discussed yesterday:

First, add the miss indication to the bridge's control block which is
zeroed for every skb entering the bridge:

diff --git a/net/bridge/br_private.h b/net/bridge/br_private.h
index 2119729ded2b..bd5c18286a40 100644
--- a/net/bridge/br_private.h
+++ b/net/bridge/br_private.h
@@ -581,6 +581,7 @@ struct br_input_skb_cb {
 #endif
        u8 proxyarp_replied:1;
        u8 src_port_isolated:1;
+       u8 miss:1;      /* FDB or MDB lookup miss */
 #ifdef CONFIG_BRIDGE_VLAN_FILTERING
        u8 vlan_filtered:1;
 #endif

And set this bit upon misses instead of skb->l2_miss:

@@ -203,6 +205,8 @@ void br_flood(struct net_bridge *br, struct sk_buff *skb,
        struct net_bridge_port *prev = NULL;
        struct net_bridge_port *p;
 
+       BR_INPUT_SKB_CB(skb)->miss = 1;
+
        list_for_each_entry_rcu(p, &br->port_list, list) {
                /* Do not flood unicast traffic to ports that turn it off, nor
                 * other traffic if flood off, except for traffic we originate
@@ -295,6 +299,7 @@ void br_multicast_flood(struct net_bridge_mdb_entry *mdst,
                        allow_mode_include = false;
        } else {
                p = NULL;
+               BR_INPUT_SKB_CB(skb)->miss = 1;
        }
 
        while (p || rp) {

Then copy it to skb->l2_miss at the very end where the cache line
containing this field is already written to:

diff --git a/net/bridge/br_forward.c b/net/bridge/br_forward.c
index 84d6dd5e5b1a..89f65564e338 100644
--- a/net/bridge/br_forward.c
+++ b/net/bridge/br_forward.c
@@ -50,6 +50,8 @@ int br_dev_queue_push_xmit(struct net *net, struct sock *sk, struct sk_buff *skb
 
        br_switchdev_frame_set_offload_fwd_mark(skb);
 
+       skb->l2_miss = BR_INPUT_SKB_CB(skb)->miss;
+
        dev_queue_xmit(skb);
 
        return 0;

Also for locally received packets:

diff --git a/net/bridge/br_input.c b/net/bridge/br_input.c
index fc17b9fd93e6..274e55455b15 100644
--- a/net/bridge/br_input.c
+++ b/net/bridge/br_input.c
@@ -46,6 +46,8 @@ static int br_pass_frame_up(struct sk_buff *skb)
         */
        br_switchdev_frame_unmark(skb);
 
+       skb->l2_miss = BR_INPUT_SKB_CB(skb)->miss;
+
        /* Bridge is just like any other port.  Make sure the
         * packet is allowed except in promisc mode when someone
         * may be running packet capture.

Ran these changes through the selftest and it seems to work.

WDYT?

^ permalink raw reply related	[flat|nested] 20+ messages in thread

* Re: [PATCH net-next 3/5] flow_offload: Reject matching on layer 2 miss
  2023-05-19 11:33   ` Simon Horman
@ 2023-05-19 14:10     ` Ido Schimmel
  2023-05-19 14:15       ` Simon Horman
  0 siblings, 1 reply; 20+ messages in thread
From: Ido Schimmel @ 2023-05-19 14:10 UTC (permalink / raw
  To: Simon Horman, nbd
  Cc: netdev, bridge, davem, kuba, pabeni, edumazet, razor, roopa,
	taras.chornyi, saeedm, leon, petrm, vladimir.oltean,
	claudiu.manoil, alexandre.belloni, UNGLinuxDriver, jhs,
	xiyou.wangcong, jiri, taspelund

On Fri, May 19, 2023 at 01:33:00PM +0200, Simon Horman wrote:
> On Thu, May 18, 2023 at 02:33:26PM +0300, Ido Schimmel wrote:
> > Adjust drivers that support the 'FLOW_DISSECTOR_KEY_META' key to reject
> > filters that try to match on the newly added layer 2 miss option. Add an
> > extack message to clearly communicate the failure reason to user space.
> 
> Hi Ido,
> 
> FLOW_DISSECTOR_KEY_META is also used in the following.
> Perhaps they don't need updating. But perhaps it is worth mentioning why.

Good point.

> 
>  * drivers/net/ethernet/mediatek/mtk_ppe_offload.c

This driver does not seem to do anything with this key. TBH, I'm not
sure what is the purpose of this hunk:

if (flow_rule_match_key(rule, FLOW_DISSECTOR_KEY_META)) {
	struct flow_match_meta match;

	flow_rule_match_meta(rule, &match);
} else {
	return -EOPNOTSUPP;
}

Felix, can you comment?
Original patch:
https://lore.kernel.org/netdev/20230518113328.1952135-4-idosch@nvidia.com/

>  * drivers/net/ethernet/netronome/nfp/flower/conntrack.c

My understanding is that this code is for netfilter offload (not tc)
which does not use the new bit. Adding a check would therefore be dead
code. I don't mind adding a check or mentioning in the commit message
why I didn't add one. Let me know what you prefer.

Thanks

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH net-next 3/5] flow_offload: Reject matching on layer 2 miss
  2023-05-19 14:10     ` Ido Schimmel
@ 2023-05-19 14:15       ` Simon Horman
  0 siblings, 0 replies; 20+ messages in thread
From: Simon Horman @ 2023-05-19 14:15 UTC (permalink / raw
  To: Ido Schimmel
  Cc: nbd, netdev, bridge, davem, kuba, pabeni, edumazet, razor, roopa,
	taras.chornyi, saeedm, leon, petrm, vladimir.oltean,
	claudiu.manoil, alexandre.belloni, UNGLinuxDriver, jhs,
	xiyou.wangcong, jiri, taspelund

On Fri, May 19, 2023 at 05:10:58PM +0300, Ido Schimmel wrote:
> On Fri, May 19, 2023 at 01:33:00PM +0200, Simon Horman wrote:
> > On Thu, May 18, 2023 at 02:33:26PM +0300, Ido Schimmel wrote:
> > > Adjust drivers that support the 'FLOW_DISSECTOR_KEY_META' key to reject
> > > filters that try to match on the newly added layer 2 miss option. Add an
> > > extack message to clearly communicate the failure reason to user space.
> > 
> > Hi Ido,
> > 
> > FLOW_DISSECTOR_KEY_META is also used in the following.
> > Perhaps they don't need updating. But perhaps it is worth mentioning why.
> 
> Good point.
> 
> > 
> >  * drivers/net/ethernet/mediatek/mtk_ppe_offload.c
> 
> This driver does not seem to do anything with this key. TBH, I'm not
> sure what is the purpose of this hunk:
> 
> if (flow_rule_match_key(rule, FLOW_DISSECTOR_KEY_META)) {
> 	struct flow_match_meta match;
> 
> 	flow_rule_match_meta(rule, &match);
> } else {
> 	return -EOPNOTSUPP;
> }
> 
> Felix, can you comment?
> Original patch:
> https://lore.kernel.org/netdev/20230518113328.1952135-4-idosch@nvidia.com/

FWIIW, I agree with your analysis here.

> >  * drivers/net/ethernet/netronome/nfp/flower/conntrack.c
> 
> My understanding is that this code is for netfilter offload (not tc)
> which does not use the new bit. Adding a check would therefore be dead
> code. I don't mind adding a check or mentioning in the commit message
> why I didn't add one. Let me know what you prefer.

Let's go with a comment in the commit message.
Thanks.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH net-next 1/5] skbuff: bridge: Add layer 2 miss indication
  2023-05-19 13:51     ` Ido Schimmel
@ 2023-05-19 21:52       ` Jakub Kicinski
  2023-05-23  8:10         ` Ido Schimmel
  2023-05-21  7:34       ` Nikolay Aleksandrov
  1 sibling, 1 reply; 20+ messages in thread
From: Jakub Kicinski @ 2023-05-19 21:52 UTC (permalink / raw
  To: Ido Schimmel
  Cc: Nikolay Aleksandrov, netdev, bridge, davem, pabeni, edumazet,
	roopa, taras.chornyi, saeedm, leon, petrm, vladimir.oltean,
	claudiu.manoil, alexandre.belloni, UNGLinuxDriver, jhs,
	xiyou.wangcong, jiri, taspelund

On Fri, 19 May 2023 16:51:48 +0300 Ido Schimmel wrote:
> diff --git a/net/bridge/br_input.c b/net/bridge/br_input.c
> index fc17b9fd93e6..274e55455b15 100644
> --- a/net/bridge/br_input.c
> +++ b/net/bridge/br_input.c
> @@ -46,6 +46,8 @@ static int br_pass_frame_up(struct sk_buff *skb)
>          */
>         br_switchdev_frame_unmark(skb);
>  
> +       skb->l2_miss = BR_INPUT_SKB_CB(skb)->miss;
> +
>         /* Bridge is just like any other port.  Make sure the
>          * packet is allowed except in promisc mode when someone
>          * may be running packet capture.
> 
> Ran these changes through the selftest and it seems to work.

Can we possibly put the new field at the end of the CB and then have TC
look at it in the CB? We already do a bit of such CB juggling in strp
(first member of struct sk_skb_cb).

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH net-next 1/5] skbuff: bridge: Add layer 2 miss indication
  2023-05-19 13:51     ` Ido Schimmel
  2023-05-19 21:52       ` Jakub Kicinski
@ 2023-05-21  7:34       ` Nikolay Aleksandrov
  1 sibling, 0 replies; 20+ messages in thread
From: Nikolay Aleksandrov @ 2023-05-21  7:34 UTC (permalink / raw
  To: Ido Schimmel
  Cc: netdev, bridge, davem, kuba, pabeni, edumazet, roopa,
	taras.chornyi, saeedm, leon, petrm, vladimir.oltean,
	claudiu.manoil, alexandre.belloni, UNGLinuxDriver, jhs,
	xiyou.wangcong, jiri, taspelund

On 19/05/2023 16:51, Ido Schimmel wrote:
> On Thu, May 18, 2023 at 07:08:47PM +0300, Nikolay Aleksandrov wrote:
>> On 18/05/2023 14:33, Ido Schimmel wrote:
>>> diff --git a/net/bridge/br_input.c b/net/bridge/br_input.c
>>> index fc17b9fd93e6..d8ab5890cbe6 100644
>>> --- a/net/bridge/br_input.c
>>> +++ b/net/bridge/br_input.c
>>> @@ -334,6 +334,7 @@ static rx_handler_result_t br_handle_frame(struct sk_buff **pskb)
>>>  		return RX_HANDLER_CONSUMED;
>>>  
>>>  	memset(skb->cb, 0, sizeof(struct br_input_skb_cb));
>>> +	skb->l2_miss = 0;
>>>  
>>>  	p = br_port_get_rcu(skb->dev);
>>>  	if (p->flags & BR_VLAN_TUNNEL)
>>
>> Overall looks good, only this part is a bit worrisome and needs some additional
>> investigation because now we'll unconditionally dirty a cache line for every
>> packet that is forwarded. Could you please check the effect with perf?
> 
> To eliminate it I tried the approach we discussed yesterday:
> 
> First, add the miss indication to the bridge's control block which is
> zeroed for every skb entering the bridge:
> 
> diff --git a/net/bridge/br_private.h b/net/bridge/br_private.h
> index 2119729ded2b..bd5c18286a40 100644
> --- a/net/bridge/br_private.h
> +++ b/net/bridge/br_private.h
> @@ -581,6 +581,7 @@ struct br_input_skb_cb {
>  #endif
>         u8 proxyarp_replied:1;
>         u8 src_port_isolated:1;
> +       u8 miss:1;      /* FDB or MDB lookup miss */
>  #ifdef CONFIG_BRIDGE_VLAN_FILTERING
>         u8 vlan_filtered:1;
>  #endif
> 
> And set this bit upon misses instead of skb->l2_miss:
> 
> @@ -203,6 +205,8 @@ void br_flood(struct net_bridge *br, struct sk_buff *skb,
>         struct net_bridge_port *prev = NULL;
>         struct net_bridge_port *p;
>  
> +       BR_INPUT_SKB_CB(skb)->miss = 1;
> +
>         list_for_each_entry_rcu(p, &br->port_list, list) {
>                 /* Do not flood unicast traffic to ports that turn it off, nor
>                  * other traffic if flood off, except for traffic we originate
> @@ -295,6 +299,7 @@ void br_multicast_flood(struct net_bridge_mdb_entry *mdst,
>                         allow_mode_include = false;
>         } else {
>                 p = NULL;
> +               BR_INPUT_SKB_CB(skb)->miss = 1;
>         }
>  
>         while (p || rp) {
> 
> Then copy it to skb->l2_miss at the very end where the cache line
> containing this field is already written to:
> 
> diff --git a/net/bridge/br_forward.c b/net/bridge/br_forward.c
> index 84d6dd5e5b1a..89f65564e338 100644
> --- a/net/bridge/br_forward.c
> +++ b/net/bridge/br_forward.c
> @@ -50,6 +50,8 @@ int br_dev_queue_push_xmit(struct net *net, struct sock *sk, struct sk_buff *skb
>  
>         br_switchdev_frame_set_offload_fwd_mark(skb);
>  
> +       skb->l2_miss = BR_INPUT_SKB_CB(skb)->miss;
> +
>         dev_queue_xmit(skb);
>  
>         return 0;
> 
> Also for locally received packets:
> 
> diff --git a/net/bridge/br_input.c b/net/bridge/br_input.c
> index fc17b9fd93e6..274e55455b15 100644
> --- a/net/bridge/br_input.c
> +++ b/net/bridge/br_input.c
> @@ -46,6 +46,8 @@ static int br_pass_frame_up(struct sk_buff *skb)
>          */
>         br_switchdev_frame_unmark(skb);
>  
> +       skb->l2_miss = BR_INPUT_SKB_CB(skb)->miss;
> +
>         /* Bridge is just like any other port.  Make sure the
>          * packet is allowed except in promisc mode when someone
>          * may be running packet capture.
> 
> Ran these changes through the selftest and it seems to work.
> 
> WDYT?

Looks good to me, this is what I had in mind wrt cache line dirtying.
The swdev mark already does it, so putting them together is nice. From
bridge POV this is good.

Thanks,
 Nik



^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH net-next 1/5] skbuff: bridge: Add layer 2 miss indication
  2023-05-19 21:52       ` Jakub Kicinski
@ 2023-05-23  8:10         ` Ido Schimmel
  2023-05-23  9:04           ` Paolo Abeni
                             ` (2 more replies)
  0 siblings, 3 replies; 20+ messages in thread
From: Ido Schimmel @ 2023-05-23  8:10 UTC (permalink / raw
  To: Jakub Kicinski, razor
  Cc: netdev, bridge, davem, pabeni, edumazet, roopa, taras.chornyi,
	saeedm, leon, petrm, vladimir.oltean, claudiu.manoil,
	alexandre.belloni, UNGLinuxDriver, jhs, xiyou.wangcong, jiri,
	taspelund

On Fri, May 19, 2023 at 02:52:18PM -0700, Jakub Kicinski wrote:
> On Fri, 19 May 2023 16:51:48 +0300 Ido Schimmel wrote:
> > diff --git a/net/bridge/br_input.c b/net/bridge/br_input.c
> > index fc17b9fd93e6..274e55455b15 100644
> > --- a/net/bridge/br_input.c
> > +++ b/net/bridge/br_input.c
> > @@ -46,6 +46,8 @@ static int br_pass_frame_up(struct sk_buff *skb)
> >          */
> >         br_switchdev_frame_unmark(skb);
> >  
> > +       skb->l2_miss = BR_INPUT_SKB_CB(skb)->miss;
> > +
> >         /* Bridge is just like any other port.  Make sure the
> >          * packet is allowed except in promisc mode when someone
> >          * may be running packet capture.
> > 
> > Ran these changes through the selftest and it seems to work.
> 
> Can we possibly put the new field at the end of the CB and then have TC
> look at it in the CB? We already do a bit of such CB juggling in strp
> (first member of struct sk_skb_cb).

Using the CB between different layers is very fragile and I would like
to avoid it. Note that the skb can pass various layers until hitting the
classifier, each of which can decide to memset() the CB.

Anyway, I think I have a better alternative. I added the 'l2_miss' bit
to the tc skb extension and adjusted the bridge to mark packets via this
extension. The entire thing is protected by the existing 'tc_skb_ext_tc'
static key, so overhead is kept to a minimum when feature is disabled.
Extended flower to enable / disable this key when filters that match on
'l2_miss' are added / removed.

bridge change to mark the packet:
https://github.com/idosch/linux/commit/3fab206492fcad9177f2340680f02ced1b9a0dec.patch

flow_dissector change to dissect the info from the extension:
https://github.com/idosch/linux/commit/1533c078b02586547817a4e63989a0db62aa5315.patch

flower change to enable / disable the key:
https://github.com/idosch/linux/commit/cf84b277511ec80fe565c41271abc6b2e2f629af.patch

Advantages compared to the previous approach are that we do not need a
new bit in the skb and that overhead is kept to a minimum when feature
is disabled. Disadvantage is that overhead is higher when feature is
enabled.

WDYT?

To be clear, merely asking for feedback on the general approach, not
code review.

Thanks

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH net-next 1/5] skbuff: bridge: Add layer 2 miss indication
  2023-05-23  8:10         ` Ido Schimmel
@ 2023-05-23  9:04           ` Paolo Abeni
  2023-05-23 11:34             ` Ido Schimmel
  2023-05-23 13:03           ` Nikolay Aleksandrov
  2023-05-23 20:29           ` Jakub Kicinski
  2 siblings, 1 reply; 20+ messages in thread
From: Paolo Abeni @ 2023-05-23  9:04 UTC (permalink / raw
  To: Ido Schimmel, Jakub Kicinski, razor
  Cc: netdev, bridge, davem, edumazet, roopa, taras.chornyi, saeedm,
	leon, petrm, vladimir.oltean, claudiu.manoil, alexandre.belloni,
	UNGLinuxDriver, jhs, xiyou.wangcong, jiri, taspelund

On Tue, 2023-05-23 at 11:10 +0300, Ido Schimmel wrote:
> On Fri, May 19, 2023 at 02:52:18PM -0700, Jakub Kicinski wrote:
> > On Fri, 19 May 2023 16:51:48 +0300 Ido Schimmel wrote:
> > > diff --git a/net/bridge/br_input.c b/net/bridge/br_input.c
> > > index fc17b9fd93e6..274e55455b15 100644
> > > --- a/net/bridge/br_input.c
> > > +++ b/net/bridge/br_input.c
> > > @@ -46,6 +46,8 @@ static int br_pass_frame_up(struct sk_buff *skb)
> > >          */
> > >         br_switchdev_frame_unmark(skb);
> > >  
> > > +       skb->l2_miss = BR_INPUT_SKB_CB(skb)->miss;
> > > +
> > >         /* Bridge is just like any other port.  Make sure the
> > >          * packet is allowed except in promisc mode when someone
> > >          * may be running packet capture.
> > > 
> > > Ran these changes through the selftest and it seems to work.
> > 
> > Can we possibly put the new field at the end of the CB and then have TC
> > look at it in the CB? We already do a bit of such CB juggling in strp
> > (first member of struct sk_skb_cb).
> 
> Using the CB between different layers is very fragile and I would like
> to avoid it. Note that the skb can pass various layers until hitting the
> classifier, each of which can decide to memset() the CB.
> 
> Anyway, I think I have a better alternative. I added the 'l2_miss' bit
> to the tc skb extension and adjusted the bridge to mark packets via this
> extension. The entire thing is protected by the existing 'tc_skb_ext_tc'
> static key, so overhead is kept to a minimum when feature is disabled.
> Extended flower to enable / disable this key when filters that match on
> 'l2_miss' are added / removed.
> 
> bridge change to mark the packet:
> https://github.com/idosch/linux/commit/3fab206492fcad9177f2340680f02ced1b9a0dec.patch
> 
> flow_dissector change to dissect the info from the extension:
> https://github.com/idosch/linux/commit/1533c078b02586547817a4e63989a0db62aa5315.patch
> 
> flower change to enable / disable the key:
> https://github.com/idosch/linux/commit/cf84b277511ec80fe565c41271abc6b2e2f629af.patch
> 
> Advantages compared to the previous approach are that we do not need a
> new bit in the skb and that overhead is kept to a minimum when feature
> is disabled. Disadvantage is that overhead is higher when feature is
> enabled.
> 
> WDYT?

Looks good to me.

I think you would only need to set/add the extension when l2_miss is
true, right? (with no extension l2 hit is assumed). That will avoid
unneeded overhead for br_dev_xmit().

All the others involved paths look like slow(er) one, so the occasional
skb extension overhead should not be a problem.

Cheers,

Paolo


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH net-next 1/5] skbuff: bridge: Add layer 2 miss indication
  2023-05-23  9:04           ` Paolo Abeni
@ 2023-05-23 11:34             ` Ido Schimmel
  0 siblings, 0 replies; 20+ messages in thread
From: Ido Schimmel @ 2023-05-23 11:34 UTC (permalink / raw
  To: Paolo Abeni
  Cc: Jakub Kicinski, razor, netdev, bridge, davem, edumazet, roopa,
	taras.chornyi, saeedm, leon, petrm, vladimir.oltean,
	claudiu.manoil, alexandre.belloni, UNGLinuxDriver, jhs,
	xiyou.wangcong, jiri, taspelund

On Tue, May 23, 2023 at 11:04:27AM +0200, Paolo Abeni wrote:
> I think you would only need to set/add the extension when l2_miss is
> true, right? (with no extension l2 hit is assumed). That will avoid
> unneeded overhead for br_dev_xmit().

If an extension is already present (possibly with 'l2_miss' being 'true'
because the packet was flooded by a different bridge earlier in the
pipeline), then we need to clear it when the packet enters the bridge.
IMO, this is quite unlikely. However, if the extension is missing, then
you are correct and there is no point in allocating one.

IOW, I can squash the following diff to the first patch:

diff --git a/net/bridge/br_private.h b/net/bridge/br_private.h
index fb6525553a8a..32115d76a6de 100644
--- a/net/bridge/br_private.h
+++ b/net/bridge/br_private.h
@@ -764,10 +764,16 @@ static inline void br_tc_skb_miss_set(struct sk_buff *skb, bool miss)
                return;
 
        ext = skb_ext_find(skb, TC_SKB_EXT);
-       if (!ext)
-               ext = tc_skb_ext_alloc(skb);
-       if (ext)
+       if (ext) {
                ext->l2_miss = miss;
+               return;
+       }
+       if (!miss)
+               return;
+       ext = tc_skb_ext_alloc(skb);
+       if (!ext)
+               return;
+       ext->l2_miss = miss;
 }
 #else
 static inline void br_tc_skb_miss_set(struct sk_buff *skb, bool miss)

Thanks

^ permalink raw reply related	[flat|nested] 20+ messages in thread

* Re: [PATCH net-next 1/5] skbuff: bridge: Add layer 2 miss indication
  2023-05-23  8:10         ` Ido Schimmel
  2023-05-23  9:04           ` Paolo Abeni
@ 2023-05-23 13:03           ` Nikolay Aleksandrov
  2023-05-23 20:29           ` Jakub Kicinski
  2 siblings, 0 replies; 20+ messages in thread
From: Nikolay Aleksandrov @ 2023-05-23 13:03 UTC (permalink / raw
  To: Ido Schimmel, Jakub Kicinski
  Cc: netdev, bridge, davem, pabeni, edumazet, roopa, taras.chornyi,
	saeedm, leon, petrm, vladimir.oltean, claudiu.manoil,
	alexandre.belloni, UNGLinuxDriver, jhs, xiyou.wangcong, jiri,
	taspelund

On 23/05/2023 11:10, Ido Schimmel wrote:
> On Fri, May 19, 2023 at 02:52:18PM -0700, Jakub Kicinski wrote:
>> On Fri, 19 May 2023 16:51:48 +0300 Ido Schimmel wrote:
>>> diff --git a/net/bridge/br_input.c b/net/bridge/br_input.c
>>> index fc17b9fd93e6..274e55455b15 100644
>>> --- a/net/bridge/br_input.c
>>> +++ b/net/bridge/br_input.c
>>> @@ -46,6 +46,8 @@ static int br_pass_frame_up(struct sk_buff *skb)
>>>          */
>>>         br_switchdev_frame_unmark(skb);
>>>  
>>> +       skb->l2_miss = BR_INPUT_SKB_CB(skb)->miss;
>>> +
>>>         /* Bridge is just like any other port.  Make sure the
>>>          * packet is allowed except in promisc mode when someone
>>>          * may be running packet capture.
>>>
>>> Ran these changes through the selftest and it seems to work.
>>
>> Can we possibly put the new field at the end of the CB and then have TC
>> look at it in the CB? We already do a bit of such CB juggling in strp
>> (first member of struct sk_skb_cb).
> 
> Using the CB between different layers is very fragile and I would like
> to avoid it. Note that the skb can pass various layers until hitting the
> classifier, each of which can decide to memset() the CB.
> 
> Anyway, I think I have a better alternative. I added the 'l2_miss' bit
> to the tc skb extension and adjusted the bridge to mark packets via this
> extension. The entire thing is protected by the existing 'tc_skb_ext_tc'
> static key, so overhead is kept to a minimum when feature is disabled.
> Extended flower to enable / disable this key when filters that match on
> 'l2_miss' are added / removed.
> 
> bridge change to mark the packet:
> https://github.com/idosch/linux/commit/3fab206492fcad9177f2340680f02ced1b9a0dec.patch
> 
> flow_dissector change to dissect the info from the extension:
> https://github.com/idosch/linux/commit/1533c078b02586547817a4e63989a0db62aa5315.patch
> 
> flower change to enable / disable the key:
> https://github.com/idosch/linux/commit/cf84b277511ec80fe565c41271abc6b2e2f629af.patch
> 
> Advantages compared to the previous approach are that we do not need a
> new bit in the skb and that overhead is kept to a minimum when feature
> is disabled. Disadvantage is that overhead is higher when feature is
> enabled.
> 
> WDYT?
> 
> To be clear, merely asking for feedback on the general approach, not
> code review.
> 
> Thanks

TBH, I like this approach much better for obvious reasons. :)
Thanks for working on it.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH net-next 1/5] skbuff: bridge: Add layer 2 miss indication
  2023-05-23  8:10         ` Ido Schimmel
  2023-05-23  9:04           ` Paolo Abeni
  2023-05-23 13:03           ` Nikolay Aleksandrov
@ 2023-05-23 20:29           ` Jakub Kicinski
  2 siblings, 0 replies; 20+ messages in thread
From: Jakub Kicinski @ 2023-05-23 20:29 UTC (permalink / raw
  To: Ido Schimmel
  Cc: razor, netdev, bridge, davem, pabeni, edumazet, roopa,
	taras.chornyi, saeedm, leon, petrm, vladimir.oltean,
	claudiu.manoil, alexandre.belloni, UNGLinuxDriver, jhs,
	xiyou.wangcong, jiri, taspelund

On Tue, 23 May 2023 11:10:38 +0300 Ido Schimmel wrote:
> > Can we possibly put the new field at the end of the CB and then have TC
> > look at it in the CB? We already do a bit of such CB juggling in strp
> > (first member of struct sk_skb_cb).  
> 
> Using the CB between different layers is very fragile and I would like
> to avoid it. Note that the skb can pass various layers until hitting the
> classifier, each of which can decide to memset() the CB.
> 
> Anyway, I think I have a better alternative. I added the 'l2_miss' bit
> to the tc skb extension and adjusted the bridge to mark packets via this
> extension. The entire thing is protected by the existing 'tc_skb_ext_tc'
> static key, so overhead is kept to a minimum when feature is disabled.
> Extended flower to enable / disable this key when filters that match on
> 'l2_miss' are added / removed.
> 
> bridge change to mark the packet:
> https://github.com/idosch/linux/commit/3fab206492fcad9177f2340680f02ced1b9a0dec.patch
> 
> flow_dissector change to dissect the info from the extension:
> https://github.com/idosch/linux/commit/1533c078b02586547817a4e63989a0db62aa5315.patch
> 
> flower change to enable / disable the key:
> https://github.com/idosch/linux/commit/cf84b277511ec80fe565c41271abc6b2e2f629af.patch
> 
> Advantages compared to the previous approach are that we do not need a
> new bit in the skb and that overhead is kept to a minimum when feature
> is disabled. Disadvantage is that overhead is higher when feature is
> enabled.
> 
> WDYT?

Sounds good, yup. Thanks!

^ permalink raw reply	[flat|nested] 20+ messages in thread

end of thread, other threads:[~2023-05-23 20:29 UTC | newest]

Thread overview: 20+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2023-05-18 11:33 [PATCH net-next 0/5] Add layer 2 miss indication and filtering Ido Schimmel
2023-05-18 11:33 ` [PATCH net-next 1/5] skbuff: bridge: Add layer 2 miss indication Ido Schimmel
2023-05-18 16:08   ` Nikolay Aleksandrov
2023-05-19 13:51     ` Ido Schimmel
2023-05-19 21:52       ` Jakub Kicinski
2023-05-23  8:10         ` Ido Schimmel
2023-05-23  9:04           ` Paolo Abeni
2023-05-23 11:34             ` Ido Schimmel
2023-05-23 13:03           ` Nikolay Aleksandrov
2023-05-23 20:29           ` Jakub Kicinski
2023-05-21  7:34       ` Nikolay Aleksandrov
2023-05-18 11:33 ` [PATCH net-next 2/5] net/sched: flower: Allow matching on layer 2 miss Ido Schimmel
2023-05-19 11:27   ` Simon Horman
2023-05-18 11:33 ` [PATCH net-next 3/5] flow_offload: Reject " Ido Schimmel
2023-05-19 11:33   ` Simon Horman
2023-05-19 14:10     ` Ido Schimmel
2023-05-19 14:15       ` Simon Horman
2023-05-18 11:33 ` [PATCH net-next 4/5] mlxsw: spectrum_flower: Add ability to match " Ido Schimmel
2023-05-19 11:35   ` Simon Horman
2023-05-18 11:33 ` [PATCH net-next 5/5] selftests: forwarding: Add layer 2 miss test cases Ido Schimmel

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).