There is a use case where one would like to enable multicast snooping
on a bridge but disable multicast flooding on all bridge ports so that
registered multicast traffic will only reach the intended recipients and
unregistered multicast traffic will be dropped. However, with existing
bridge ports' mcast_flood flag implementation, it doesn't work as desired.
This patchset aims to make multicast snooping work even when multicast
flooding is disabled on the bridge ports, without changing the semantic of
the mcast_flood flag too much. Patches 1 to 4 attempt to address this issue.
Also, in a network where more than one multicast snooping capable bridges
are interconnected without multicast routers being present, multicast
snooping fails if:
1. The source is not directly attached to the Querier
2. The listener is beyond the mrouter port of the bridge where the
source is directly attached
3. A hardware offloading switch is involved
When all of the conditions are met, the listener will not receive any
multicast packets from the source. Patches 5 to 10 attempt to address this
issue. Specifically, patches 5 to 8 set up the infrastructure, patch 9
handles unregistered multicast packets forwarding, and patch 10 handles
registered multicast packets forwarding to the mrouter port.
The patches were developed against 5.15, and forward-ported to 6.8.
Tests were done on a Pi 4B + Marvell 6393X Eval board with a single
switch chip with no VLAN.
V1 -> V2:
- Moved the bulk of the change from the bridge to the mv88e6xxx driver.
- Added more patches (specifically 3 and 4) to workaround some more
issues with multicast flooding being disabled.
v1 here:
https://patchwork.kernel.org/project/netdevbpf/cover/[email protected]/
Joseph Huang (10):
net: bridge: Flood Queries even when mc flood is disabled
net: bridge: Always multicast_flood Reports
net: bridge: Always flood local subnet mc packets
net: dsa: mv88e6xxx: Add all hosts mc addr to ATU
net: dsa: Add support for PORT_MROUTER attribute
net: dsa: mv88e6xxx: Track soft bridge objects
net: dsa: mv88e6xxx: Track bridge mdb objects
net: dsa: mv88e6xxx: Convert MAB to use bit flags
net: dsa: mv88e6xxx: Enable mc flood for mrouter port
net: dsa: mv88e6xxx: Offload mrouter port
drivers/net/dsa/mv88e6xxx/Kconfig | 12 +
drivers/net/dsa/mv88e6xxx/chip.c | 439 +++++++++++++++++++++++-
drivers/net/dsa/mv88e6xxx/chip.h | 16 +-
drivers/net/dsa/mv88e6xxx/global1_atu.c | 3 +-
include/net/dsa.h | 2 +
net/bridge/br_device.c | 5 +-
net/bridge/br_forward.c | 3 +-
net/bridge/br_input.c | 5 +-
net/bridge/br_multicast.c | 21 +-
net/bridge/br_private.h | 6 +
net/dsa/port.c | 11 +
net/dsa/port.h | 2 +
net/dsa/user.c | 6 +
13 files changed, 502 insertions(+), 29 deletions(-)
--
2.17.1
Add local network all hosts multicast address (224.0.0.1) and link-local
all nodes multicast address (ff02::1) to the ATU so that IGMP/MLD
Queries can be forwarded even when multicast flooding is disabled on a
port.
Signed-off-by: Joseph Huang <[email protected]>
---
drivers/net/dsa/mv88e6xxx/Kconfig | 12 ++++++++
drivers/net/dsa/mv88e6xxx/chip.c | 47 +++++++++++++++++++++++++++++++
2 files changed, 59 insertions(+)
diff --git a/drivers/net/dsa/mv88e6xxx/Kconfig b/drivers/net/dsa/mv88e6xxx/Kconfig
index e3181d5471df..ef7798bf50d7 100644
--- a/drivers/net/dsa/mv88e6xxx/Kconfig
+++ b/drivers/net/dsa/mv88e6xxx/Kconfig
@@ -17,3 +17,15 @@ config NET_DSA_MV88E6XXX_PTP
help
Say Y to enable PTP hardware timestamping on Marvell 88E6xxx switch
chips that support it.
+
+config NET_DSA_MV88E6XXX_ALWAYS_FLOOD_LOCAL_ALL_HOSTS_ADDRESS
+ bool "Always flood local all hosts multicast packets"
+ depends on NET_DSA_MV88E6XXX
+ help
+ When set to Y, always flood multicast packets destined for
+ 224.0.0.1 (Local Network All Hosts multicast address) and
+ ff02::1 (Link-Local All Nodes multicast address), even when
+ multicast flooding is disabled for a port. This is so that
+ multicast snooping can continue to function even when
+ multicast flooding is disabled.
+ If in doubt, say N.
diff --git a/drivers/net/dsa/mv88e6xxx/chip.c b/drivers/net/dsa/mv88e6xxx/chip.c
index 9ed1821184ec..fddcb596c421 100644
--- a/drivers/net/dsa/mv88e6xxx/chip.c
+++ b/drivers/net/dsa/mv88e6xxx/chip.c
@@ -2488,6 +2488,41 @@ static int mv88e6xxx_broadcast_setup(struct mv88e6xxx_chip *chip, u16 vid)
return 0;
}
+#if IS_ENABLED(CONFIG_NET_DSA_MV88E6XXX_ALWAYS_FLOOD_LOCAL_ALL_HOSTS_ADDRESS)
+static int mv88e6xxx_port_add_multicast(struct mv88e6xxx_chip *chip, int port,
+ u16 vid)
+{
+ u8 state = MV88E6XXX_G1_ATU_DATA_STATE_MC_STATIC;
+ const char multicast[][ETH_ALEN] = {
+ { 0x01, 0x00, 0x5e, 0x00, 0x00, 0x01 },
+ { 0x33, 0x33, 0x00, 0x00, 0x00, 0x01 }
+ };
+ int i, err;
+
+ for (i = 0; i < ARRAY_SIZE(multicast); i++) {
+ err = mv88e6xxx_port_db_load_purge(chip, port, multicast[i], vid, state);
+ if (err)
+ return err;
+ }
+
+ return 0;
+}
+
+static int mv88e6xxx_multicast_setup(struct mv88e6xxx_chip *chip, u16 vid)
+{
+ int port;
+ int err;
+
+ for (port = 0; port < mv88e6xxx_num_ports(chip); port++) {
+ err = mv88e6xxx_port_add_multicast(chip, port, vid);
+ if (err)
+ return err;
+ }
+
+ return 0;
+}
+#endif
+
struct mv88e6xxx_port_broadcast_sync_ctx {
int port;
bool flood;
@@ -2572,6 +2607,12 @@ static int mv88e6xxx_port_vlan_join(struct mv88e6xxx_chip *chip, int port,
err = mv88e6xxx_broadcast_setup(chip, vlan.vid);
if (err)
return err;
+
+#if IS_ENABLED(CONFIG_NET_DSA_MV88E6XXX_ALWAYS_FLOOD_LOCAL_ALL_HOSTS_ADDRESS)
+ err = mv88e6xxx_multicast_setup(chip, vlan.vid);
+ if (err)
+ return err;
+#endif
} else if (vlan.member[port] != member) {
vlan.member[port] = member;
@@ -3930,6 +3971,12 @@ static int mv88e6xxx_setup(struct dsa_switch *ds)
if (err)
goto unlock;
+#if IS_ENABLED(CONFIG_NET_DSA_MV88E6XXX_ALWAYS_FLOOD_LOCAL_ALL_HOSTS_ADDRESS)
+ err = mv88e6xxx_multicast_setup(chip, 0);
+ if (err)
+ goto unlock;
+#endif
+
err = mv88e6xxx_pot_setup(chip);
if (err)
goto unlock;
--
2.17.1
Add support for delivering SWITCHDEV_ATTR_ID_PORT_MROUTER event to DSA
subsystem. This is essentially 08cc83c ("net: dsa: add support for
BRIDGE_MROUTER attribute") repurposed for PORT_MROUTER.
Signed-off-by: Joseph Huang <[email protected]>
---
include/net/dsa.h | 2 ++
net/dsa/port.c | 11 +++++++++++
net/dsa/port.h | 2 ++
net/dsa/user.c | 6 ++++++
4 files changed, 21 insertions(+)
diff --git a/include/net/dsa.h b/include/net/dsa.h
index 7c0da9effe4e..5dc5838caa9c 100644
--- a/include/net/dsa.h
+++ b/include/net/dsa.h
@@ -1037,6 +1037,8 @@ struct dsa_switch_ops {
int (*port_bridge_flags)(struct dsa_switch *ds, int port,
struct switchdev_brport_flags flags,
struct netlink_ext_ack *extack);
+ int (*port_mrouter)(struct dsa_switch *ds, int port, bool mrouter,
+ struct netlink_ext_ack *extack);
void (*port_set_host_flood)(struct dsa_switch *ds, int port,
bool uc, bool mc);
diff --git a/net/dsa/port.c b/net/dsa/port.c
index c42dac87671b..3f689cb713aa 100644
--- a/net/dsa/port.c
+++ b/net/dsa/port.c
@@ -948,6 +948,17 @@ int dsa_port_bridge_flags(struct dsa_port *dp,
return 0;
}
+int dsa_port_mrouter(struct dsa_port *dp, bool mrouter,
+ struct netlink_ext_ack *extack)
+{
+ struct dsa_switch *ds = dp->ds;
+
+ if (!ds->ops->port_mrouter)
+ return -EOPNOTSUPP;
+
+ return ds->ops->port_mrouter(ds, dp->index, mrouter, extack);
+}
+
void dsa_port_set_host_flood(struct dsa_port *dp, bool uc, bool mc)
{
struct dsa_switch *ds = dp->ds;
diff --git a/net/dsa/port.h b/net/dsa/port.h
index 6bc3291573c0..85102e1659ae 100644
--- a/net/dsa/port.h
+++ b/net/dsa/port.h
@@ -81,6 +81,8 @@ int dsa_port_pre_bridge_flags(const struct dsa_port *dp,
int dsa_port_bridge_flags(struct dsa_port *dp,
struct switchdev_brport_flags flags,
struct netlink_ext_ack *extack);
+int dsa_port_mrouter(struct dsa_port *dp, bool mrouter,
+ struct netlink_ext_ack *extack);
int dsa_port_vlan_add(struct dsa_port *dp,
const struct switchdev_obj_port_vlan *vlan,
struct netlink_ext_ack *extack);
diff --git a/net/dsa/user.c b/net/dsa/user.c
index 16d395bb1a1f..f69c4df143f7 100644
--- a/net/dsa/user.c
+++ b/net/dsa/user.c
@@ -647,6 +647,12 @@ static int dsa_user_port_attr_set(struct net_device *dev, const void *ctx,
ret = dsa_port_bridge_flags(dp, attr->u.brport_flags, extack);
break;
+ case SWITCHDEV_ATTR_ID_PORT_MROUTER:
+ if (!dsa_port_offloads_bridge_port(dp, attr->orig_dev))
+ return -EOPNOTSUPP;
+
+ ret = dsa_port_mrouter(dp, attr->u.mrouter, extack);
+ break;
case SWITCHDEV_ATTR_ID_VLAN_MSTI:
if (!dsa_port_offloads_bridge_dev(dp, attr->orig_dev))
return -EOPNOTSUPP;
--
2.17.1
Convert MAB (MacAuth Bypass) flag to use bitmap.
A new port flag will be added with a subsequent patch. Convert the
'mab' member to a bit in a bitmask to save space.
Signed-off-by: Joseph Huang <[email protected]>
---
drivers/net/dsa/mv88e6xxx/chip.h | 12 ++++++++----
drivers/net/dsa/mv88e6xxx/global1_atu.c | 3 ++-
2 files changed, 10 insertions(+), 5 deletions(-)
diff --git a/drivers/net/dsa/mv88e6xxx/chip.h b/drivers/net/dsa/mv88e6xxx/chip.h
index a32e4564eb3d..205f6777c2ac 100644
--- a/drivers/net/dsa/mv88e6xxx/chip.h
+++ b/drivers/net/dsa/mv88e6xxx/chip.h
@@ -272,6 +272,9 @@ struct mv88e6xxx_vlan {
bool valid;
};
+/* MacAuth Bypass Control Flag */
+#define MV88E6XXX_PORT_FLAG_MAB BIT(0)
+
struct mv88e6xxx_port {
struct mv88e6xxx_chip *chip;
int port;
@@ -288,9 +291,7 @@ struct mv88e6xxx_port {
bool mirror_egress;
struct devlink_region *region;
void *pcs_private;
-
- /* MacAuth Bypass control flag */
- bool mab;
+ unsigned int flags;
};
enum mv88e6xxx_region_id {
@@ -802,7 +803,10 @@ static inline bool mv88e6xxx_is_invalid_port(struct mv88e6xxx_chip *chip, int po
static inline void mv88e6xxx_port_set_mab(struct mv88e6xxx_chip *chip,
int port, bool mab)
{
- chip->ports[port].mab = mab;
+ if (mab)
+ chip->ports[port].flags |= MV88E6XXX_PORT_FLAG_MAB;
+ else
+ chip->ports[port].flags &= ~MV88E6XXX_PORT_FLAG_MAB;
}
int mv88e6xxx_read(struct mv88e6xxx_chip *chip, int addr, int reg, u16 *val);
diff --git a/drivers/net/dsa/mv88e6xxx/global1_atu.c b/drivers/net/dsa/mv88e6xxx/global1_atu.c
index ce3b3690c3c0..06d0c526e33d 100644
--- a/drivers/net/dsa/mv88e6xxx/global1_atu.c
+++ b/drivers/net/dsa/mv88e6xxx/global1_atu.c
@@ -445,7 +445,8 @@ static irqreturn_t mv88e6xxx_g1_atu_prob_irq_thread_fn(int irq, void *dev_id)
fid);
chip->ports[spid].atu_miss_violation++;
- if (fid != MV88E6XXX_FID_STANDALONE && chip->ports[spid].mab) {
+ if (fid != MV88E6XXX_FID_STANDALONE &&
+ !!(chip->ports[spid].flags & MV88E6XXX_PORT_FLAG_MAB)) {
err = mv88e6xxx_handle_miss_violation(chip, spid,
&entry, fid);
if (err)
--
2.17.1
Modify the forwarding path so that IGMPv1/v2/MLDv1 Reports are always
flooded by br_multicast_flood(), regardless of the check done
by br_multicast_querier_exists().
This patch fixes the problems where shortly after a system boots up,
the first couple of Reports are not handled properly in that:
1) The Report from the Host is being flooded (via br_flood) to all
bridge ports, and
2) If the mrouter port's multicast flooding is disabled, the Reports
received from other hosts will not be forwarded to the Querier.
Fixes: b00589af3b04 ("bridge: disable snooping if there is no querier")
Signed-off-by: Joseph Huang <[email protected]>
---
net/bridge/br_device.c | 5 +++--
net/bridge/br_input.c | 5 +++--
net/bridge/br_multicast.c | 5 +++++
net/bridge/br_private.h | 3 +++
4 files changed, 14 insertions(+), 4 deletions(-)
diff --git a/net/bridge/br_device.c b/net/bridge/br_device.c
index c366ccc8b3db..5c09b9dd61dc 100644
--- a/net/bridge/br_device.c
+++ b/net/bridge/br_device.c
@@ -93,8 +93,9 @@ netdev_tx_t br_dev_xmit(struct sk_buff *skb, struct net_device *dev)
}
mdst = br_mdb_entry_skb_get(brmctx, skb, vid);
- if ((mdst || BR_INPUT_SKB_CB_MROUTERS_ONLY(skb)) &&
- br_multicast_querier_exists(brmctx, eth_hdr(skb), mdst))
+ if (((mdst || BR_INPUT_SKB_CB_MROUTERS_ONLY(skb)) &&
+ br_multicast_querier_exists(brmctx, eth_hdr(skb), mdst)) ||
+ BR_INPUT_SKB_CB_FORCE_MC_FLOOD(skb))
br_multicast_flood(mdst, skb, brmctx, false, true);
else
br_flood(br, skb, BR_PKT_MULTICAST, false, true, vid);
diff --git a/net/bridge/br_input.c b/net/bridge/br_input.c
index f21097e73482..8e614ab20966 100644
--- a/net/bridge/br_input.c
+++ b/net/bridge/br_input.c
@@ -176,8 +176,9 @@ int br_handle_frame_finish(struct net *net, struct sock *sk, struct sk_buff *skb
switch (pkt_type) {
case BR_PKT_MULTICAST:
mdst = br_mdb_entry_skb_get(brmctx, skb, vid);
- if ((mdst || BR_INPUT_SKB_CB_MROUTERS_ONLY(skb)) &&
- br_multicast_querier_exists(brmctx, eth_hdr(skb), mdst)) {
+ if (((mdst || BR_INPUT_SKB_CB_MROUTERS_ONLY(skb)) &&
+ br_multicast_querier_exists(brmctx, eth_hdr(skb), mdst)) ||
+ BR_INPUT_SKB_CB_FORCE_MC_FLOOD(skb)) {
if ((mdst && mdst->host_joined) ||
br_multicast_is_router(brmctx, skb)) {
local_rcv = true;
diff --git a/net/bridge/br_multicast.c b/net/bridge/br_multicast.c
index 42d900549227..8531f0e03f41 100644
--- a/net/bridge/br_multicast.c
+++ b/net/bridge/br_multicast.c
@@ -3844,6 +3844,7 @@ static int br_multicast_ipv4_rcv(struct net_bridge_mcast *brmctx,
case IGMP_HOST_MEMBERSHIP_REPORT:
case IGMPV2_HOST_MEMBERSHIP_REPORT:
BR_INPUT_SKB_CB(skb)->mrouters_only = 1;
+ BR_INPUT_SKB_CB(skb)->force_mc_flood = 1;
err = br_ip4_multicast_add_group(brmctx, pmctx, ih->group, vid,
src, true);
break;
@@ -3855,6 +3856,7 @@ static int br_multicast_ipv4_rcv(struct net_bridge_mcast *brmctx,
br_ip4_multicast_query(brmctx, pmctx, skb, vid);
break;
case IGMP_HOST_LEAVE_MESSAGE:
+ BR_INPUT_SKB_CB(skb)->force_mc_flood = 1;
br_ip4_multicast_leave_group(brmctx, pmctx, ih->group, vid, src);
break;
}
@@ -3910,6 +3912,7 @@ static int br_multicast_ipv6_rcv(struct net_bridge_mcast *brmctx,
case ICMPV6_MGM_REPORT:
src = eth_hdr(skb)->h_source;
BR_INPUT_SKB_CB(skb)->mrouters_only = 1;
+ BR_INPUT_SKB_CB(skb)->force_mc_flood = 1;
err = br_ip6_multicast_add_group(brmctx, pmctx, &mld->mld_mca,
vid, src, true);
break;
@@ -3922,6 +3925,7 @@ static int br_multicast_ipv6_rcv(struct net_bridge_mcast *brmctx,
break;
case ICMPV6_MGM_REDUCTION:
src = eth_hdr(skb)->h_source;
+ BR_INPUT_SKB_CB(skb)->force_mc_flood = 1;
br_ip6_multicast_leave_group(brmctx, pmctx, &mld->mld_mca, vid,
src);
break;
@@ -3944,6 +3948,7 @@ int br_multicast_rcv(struct net_bridge_mcast **brmctx,
BR_INPUT_SKB_CB(skb)->igmp = 0;
BR_INPUT_SKB_CB(skb)->mrouters_only = 0;
BR_INPUT_SKB_CB(skb)->force_flood = 0;
+ BR_INPUT_SKB_CB(skb)->force_mc_flood = 0;
if (!br_opt_get((*brmctx)->br, BROPT_MULTICAST_ENABLED))
return 0;
diff --git a/net/bridge/br_private.h b/net/bridge/br_private.h
index c28e0cd0855c..d72a632a1ad2 100644
--- a/net/bridge/br_private.h
+++ b/net/bridge/br_private.h
@@ -587,6 +587,7 @@ struct br_input_skb_cb {
u8 igmp;
u8 mrouters_only:1;
u8 force_flood:1;
+ u8 force_mc_flood:1;
#endif
u8 proxyarp_replied:1;
u8 src_port_isolated:1;
@@ -622,9 +623,11 @@ struct br_input_skb_cb {
#ifdef CONFIG_BRIDGE_IGMP_SNOOPING
# define BR_INPUT_SKB_CB_MROUTERS_ONLY(__skb) (BR_INPUT_SKB_CB(__skb)->mrouters_only)
# define BR_INPUT_SKB_CB_FORCE_FLOOD(__skb) (BR_INPUT_SKB_CB(__skb)->force_flood)
+# define BR_INPUT_SKB_CB_FORCE_MC_FLOOD(__skb) (BR_INPUT_SKB_CB(__skb)->force_mc_flood)
#else
# define BR_INPUT_SKB_CB_MROUTERS_ONLY(__skb) (0)
# define BR_INPUT_SKB_CB_FORCE_FLOOD(__skb) (0)
+# define BR_INPUT_SKB_CB_FORCE_MC_FLOOD(__skb) (0)
#endif
#define br_printk(level, br, format, args...) \
--
2.17.1
Keep track of bridge mdb objects in the driver.
Similar to the previous patch, since the driver doesn't get explicit
notifications about mdb group creation or destruction, just create
the mdb group when the first port joins the group via
mv88e6xxx_port_mdb_add(), and destroys the group when the last port left
the group via mv88e6xxx_port_mdb_del().
Use the group's L2 address together with the VLAN ID as the key to the list.
Port membership is again stored in a bitmask.
Signed-off-by: Joseph Huang <[email protected]>
---
drivers/net/dsa/mv88e6xxx/chip.c | 117 +++++++++++++++++++++++++++++++
1 file changed, 117 insertions(+)
diff --git a/drivers/net/dsa/mv88e6xxx/chip.c b/drivers/net/dsa/mv88e6xxx/chip.c
index f66ddde484dc..32a613c965b1 100644
--- a/drivers/net/dsa/mv88e6xxx/chip.c
+++ b/drivers/net/dsa/mv88e6xxx/chip.c
@@ -47,6 +47,14 @@ struct mv88e6xxx_bridge {
struct list_head head;
struct net_device *br_dev;
u16 ports;
+ struct list_head br_mdb_list;
+};
+
+struct mv88e6xxx_br_mdb {
+ struct list_head head;
+ unsigned char addr[ETH_ALEN];
+ u16 vid;
+ u16 portvec;
};
static void assert_reg_lock(struct mv88e6xxx_chip *chip)
@@ -2974,6 +2982,7 @@ mv88e6xxx_bridge_create(struct mv88e6xxx_chip *chip, struct net_device *br_dev)
return ERR_PTR(-ENOMEM);
mv_bridge->br_dev = br_dev;
+ INIT_LIST_HEAD(&mv_bridge->br_mdb_list);
list_add(&mv_bridge->head, &chip->bridge_list);
return mv_bridge;
@@ -2984,6 +2993,7 @@ static void mv88e6xxx_bridge_destroy(struct mv88e6xxx_bridge *mv_bridge)
list_del(&mv_bridge->head);
WARN_ON(mv_bridge->ports);
+ WARN_ON(!list_empty(&mv_bridge->br_mdb_list));
kfree(mv_bridge);
}
@@ -6583,16 +6593,101 @@ static int mv88e6xxx_change_tag_protocol(struct dsa_switch *ds,
return err;
}
+static struct mv88e6xxx_br_mdb *
+mv88e6xxx_br_mdb_create(struct mv88e6xxx_bridge *mv_bridge,
+ const struct switchdev_obj_port_mdb *mdb)
+{
+ struct mv88e6xxx_br_mdb *mv_br_mdb;
+
+ mv_br_mdb = kzalloc(sizeof(*mv_br_mdb), GFP_KERNEL);
+ if (!mv_br_mdb)
+ return ERR_PTR(-ENOMEM);
+
+ ether_addr_copy(mv_br_mdb->addr, mdb->addr);
+ mv_br_mdb->vid = mdb->vid;
+ list_add(&mv_br_mdb->head, &mv_bridge->br_mdb_list);
+
+ return mv_br_mdb;
+}
+
+static void mv88e6xxx_br_mdb_destroy(struct mv88e6xxx_br_mdb *mv_br_mdb)
+{
+ list_del(&mv_br_mdb->head);
+
+ WARN_ON(mv_br_mdb->portvec);
+ kfree(mv_br_mdb);
+}
+
+static struct mv88e6xxx_br_mdb *
+mv88e6xxx_br_mdb_find(struct mv88e6xxx_bridge *mv_bridge,
+ const struct switchdev_obj_port_mdb *mdb)
+{
+ struct mv88e6xxx_br_mdb *mv_br_mdb;
+
+ list_for_each_entry(mv_br_mdb, &mv_bridge->br_mdb_list, head)
+ if (ether_addr_equal(mv_br_mdb->addr, mdb->addr) &&
+ mv_br_mdb->vid == mdb->vid)
+ return mv_br_mdb;
+
+ return NULL;
+}
+
+static struct mv88e6xxx_br_mdb *
+mv88e6xxx_br_mdb_get(struct mv88e6xxx_bridge *mv_bridge,
+ const struct switchdev_obj_port_mdb *mdb)
+{
+ struct mv88e6xxx_br_mdb *mv_br_mdb;
+
+ mv_br_mdb = mv88e6xxx_br_mdb_find(mv_bridge, mdb);
+ if (!mv_br_mdb)
+ mv_br_mdb = mv88e6xxx_br_mdb_create(mv_bridge, mdb);
+
+ return mv_br_mdb;
+}
+
+static void mv88e6xxx_br_mdb_put(struct mv88e6xxx_br_mdb *mv_br_mdb)
+{
+ if (!mv_br_mdb->portvec)
+ mv88e6xxx_br_mdb_destroy(mv_br_mdb);
+}
+
static int mv88e6xxx_port_mdb_add(struct dsa_switch *ds, int port,
const struct switchdev_obj_port_mdb *mdb,
struct dsa_db db)
{
struct mv88e6xxx_chip *chip = ds->priv;
+ struct mv88e6xxx_bridge *mv_bridge;
+ struct mv88e6xxx_br_mdb *mv_br_mdb;
+ struct net_device *orig_dev;
+ struct net_device *br_dev;
int err;
+ orig_dev = mdb->obj.orig_dev;
+ br_dev = netdev_master_upper_dev_get(orig_dev);
+ if (!br_dev)
+ br_dev = orig_dev;
+
+ mv_bridge = mv88e6xxx_bridge_by_dev(chip, br_dev);
+ if (!mv_bridge)
+ return -EINVAL;
+
+ mv_br_mdb = mv88e6xxx_br_mdb_get(mv_bridge, mdb);
+ if (IS_ERR(mv_br_mdb))
+ return PTR_ERR(mv_br_mdb);
+
+ if (mv_br_mdb->portvec & BIT(port))
+ return -EEXIST;
+
mv88e6xxx_reg_lock(chip);
err = mv88e6xxx_port_db_load_purge(chip, port, mdb->addr, mdb->vid,
MV88E6XXX_G1_ATU_DATA_STATE_MC_STATIC);
+
+ if (err)
+ goto out;
+
+ mv_br_mdb->portvec |= BIT(port);
+
+out:
mv88e6xxx_reg_unlock(chip);
return err;
@@ -6603,10 +6698,32 @@ static int mv88e6xxx_port_mdb_del(struct dsa_switch *ds, int port,
struct dsa_db db)
{
struct mv88e6xxx_chip *chip = ds->priv;
+ struct mv88e6xxx_bridge *mv_bridge;
+ struct mv88e6xxx_br_mdb *mv_br_mdb;
+ struct net_device *orig_dev;
+ struct net_device *br_dev;
int err;
+ orig_dev = mdb->obj.orig_dev;
+ br_dev = netdev_master_upper_dev_get(orig_dev);
+ if (!br_dev)
+ br_dev = orig_dev;
+
+ mv_bridge = mv88e6xxx_bridge_by_dev(chip, br_dev);
+ if (!mv_bridge)
+ return -EINVAL;
+
+ mv_br_mdb = mv88e6xxx_br_mdb_find(mv_bridge, mdb);
+ if (!mv_br_mdb)
+ return -ENOENT;
+
+ if (!(mv_br_mdb->portvec & BIT(port)))
+ return -ENOENT;
+
mv88e6xxx_reg_lock(chip);
err = mv88e6xxx_port_db_load_purge(chip, port, mdb->addr, mdb->vid, 0);
+ mv_br_mdb->portvec &= ~BIT(port);
+ mv88e6xxx_br_mdb_put(mv_br_mdb);
mv88e6xxx_reg_unlock(chip);
return err;
--
2.17.1
Keep track of soft bridge objects in the driver.
Since the driver doesn't get explicit notifications about bridge creation
or destruction, just create the bridge data structure when the first port
joins the bridge via mv88e6xxx_port_bridge_join(). Similarly, destroy
the bridge after the last port left the bridge via
mv88e6xxx_port_bridge_leave().
Use the bridge's net_device pointer as the key to the list. Port
information is stored in a bitmask.
Signed-off-by: Joseph Huang <[email protected]>
---
drivers/net/dsa/mv88e6xxx/chip.c | 85 ++++++++++++++++++++++++++++++++
drivers/net/dsa/mv88e6xxx/chip.h | 3 ++
2 files changed, 88 insertions(+)
diff --git a/drivers/net/dsa/mv88e6xxx/chip.c b/drivers/net/dsa/mv88e6xxx/chip.c
index fddcb596c421..f66ddde484dc 100644
--- a/drivers/net/dsa/mv88e6xxx/chip.c
+++ b/drivers/net/dsa/mv88e6xxx/chip.c
@@ -43,6 +43,12 @@
#include "serdes.h"
#include "smi.h"
+struct mv88e6xxx_bridge {
+ struct list_head head;
+ struct net_device *br_dev;
+ u16 ports;
+};
+
static void assert_reg_lock(struct mv88e6xxx_chip *chip)
{
if (unlikely(!mutex_is_locked(&chip->reg_lock))) {
@@ -2958,6 +2964,60 @@ static int mv88e6xxx_port_fdb_dump(struct dsa_switch *ds, int port,
return err;
}
+static struct mv88e6xxx_bridge *
+mv88e6xxx_bridge_create(struct mv88e6xxx_chip *chip, struct net_device *br_dev)
+{
+ struct mv88e6xxx_bridge *mv_bridge;
+
+ mv_bridge = kzalloc(sizeof(*mv_bridge), GFP_KERNEL);
+ if (!mv_bridge)
+ return ERR_PTR(-ENOMEM);
+
+ mv_bridge->br_dev = br_dev;
+ list_add(&mv_bridge->head, &chip->bridge_list);
+
+ return mv_bridge;
+}
+
+static void mv88e6xxx_bridge_destroy(struct mv88e6xxx_bridge *mv_bridge)
+{
+ list_del(&mv_bridge->head);
+
+ WARN_ON(mv_bridge->ports);
+ kfree(mv_bridge);
+}
+
+static
+struct mv88e6xxx_bridge *mv88e6xxx_bridge_by_dev(struct mv88e6xxx_chip *chip,
+ const struct net_device *br_dev)
+{
+ struct mv88e6xxx_bridge *mv_bridge;
+
+ list_for_each_entry(mv_bridge, &chip->bridge_list, head)
+ if (mv_bridge->br_dev == br_dev)
+ return mv_bridge;
+
+ return NULL;
+}
+
+static struct mv88e6xxx_bridge *
+mv88e6xxx_bridge_get(struct mv88e6xxx_chip *chip, struct net_device *br_dev)
+{
+ struct mv88e6xxx_bridge *mv_bridge;
+
+ mv_bridge = mv88e6xxx_bridge_by_dev(chip, br_dev);
+ if (!mv_bridge)
+ mv_bridge = mv88e6xxx_bridge_create(chip, br_dev);
+
+ return mv_bridge;
+}
+
+static void mv88e6xxx_bridge_put(struct mv88e6xxx_bridge *mv_bridge)
+{
+ if (!mv_bridge->ports)
+ mv88e6xxx_bridge_destroy(mv_bridge);
+}
+
static int mv88e6xxx_bridge_map(struct mv88e6xxx_chip *chip,
struct dsa_bridge bridge)
{
@@ -3009,8 +3069,16 @@ static int mv88e6xxx_port_bridge_join(struct dsa_switch *ds, int port,
struct netlink_ext_ack *extack)
{
struct mv88e6xxx_chip *chip = ds->priv;
+ struct mv88e6xxx_bridge *mv_bridge;
int err;
+ mv_bridge = mv88e6xxx_bridge_get(chip, bridge.dev);
+ if (IS_ERR(mv_bridge))
+ return PTR_ERR(mv_bridge);
+
+ if (mv_bridge->ports & BIT(port))
+ return -EEXIST;
+
mv88e6xxx_reg_lock(chip);
err = mv88e6xxx_bridge_map(chip, bridge);
@@ -3033,6 +3101,8 @@ static int mv88e6xxx_port_bridge_join(struct dsa_switch *ds, int port,
*tx_fwd_offload = true;
}
+ mv_bridge->ports |= BIT(port);
+
unlock:
mv88e6xxx_reg_unlock(chip);
@@ -3043,8 +3113,19 @@ static void mv88e6xxx_port_bridge_leave(struct dsa_switch *ds, int port,
struct dsa_bridge bridge)
{
struct mv88e6xxx_chip *chip = ds->priv;
+ struct mv88e6xxx_bridge *mv_bridge;
int err;
+ mv_bridge = mv88e6xxx_bridge_by_dev(chip, bridge.dev);
+ if (!mv_bridge)
+ return;
+
+ if (!(mv_bridge->ports & BIT(port)))
+ return;
+
+ mv_bridge->ports &= ~BIT(port);
+ mv88e6xxx_bridge_put(mv_bridge);
+
mv88e6xxx_reg_lock(chip);
if (bridge.tx_fwd_offload &&
@@ -6436,6 +6517,7 @@ static struct mv88e6xxx_chip *mv88e6xxx_alloc_chip(struct device *dev)
INIT_LIST_HEAD(&chip->mdios);
idr_init(&chip->policies);
INIT_LIST_HEAD(&chip->msts);
+ INIT_LIST_HEAD(&chip->bridge_list);
return chip;
}
@@ -7272,6 +7354,9 @@ static void mv88e6xxx_remove(struct mdio_device *mdiodev)
mv88e6xxx_g1_irq_free(chip);
else
mv88e6xxx_irq_poll_free(chip);
+
+ WARN_ON(!list_empty(&chip->bridge_list));
+
}
static void mv88e6xxx_shutdown(struct mdio_device *mdiodev)
diff --git a/drivers/net/dsa/mv88e6xxx/chip.h b/drivers/net/dsa/mv88e6xxx/chip.h
index 85eb293381a7..a32e4564eb3d 100644
--- a/drivers/net/dsa/mv88e6xxx/chip.h
+++ b/drivers/net/dsa/mv88e6xxx/chip.h
@@ -432,6 +432,9 @@ struct mv88e6xxx_chip {
/* Bridge MST to SID mappings */
struct list_head msts;
+
+ /* software bridges */
+ struct list_head bridge_list;
};
struct mv88e6xxx_bus_ops {
--
2.17.1
When a port turns into an mrouter port, enable multicast flooding
on that port even if multicast flooding is disabled by user config. This
is necessary so that in a distributed system, the multicast packets
can be forwarded to the Querier when the multicast source is attached
to a Non-Querier bridge.
Consider the following scenario:
+--------------------+
| |
| Snooping | +------------+
| Bridge 1 |----| Listener 1 |
| (Querier) | +------------+
| |
+--------------------+
|
|
+--------------------+
| | mrouter | |
+-----------+ | +---------+ |
| MC Source |----| Snooping |
+-----------| | Bridge 2 |
| (Non-Querier) |
+--------------------+
In this scenario, Listener 1 will never receive multicast traffic
from MC Source if multicast flooding is disabled on the mrouter port on
Snooping Bridge 2.
Signed-off-by: Joseph Huang <[email protected]>
---
drivers/net/dsa/mv88e6xxx/chip.c | 86 ++++++++++++++++++++++++++++++--
drivers/net/dsa/mv88e6xxx/chip.h | 1 +
2 files changed, 83 insertions(+), 4 deletions(-)
diff --git a/drivers/net/dsa/mv88e6xxx/chip.c b/drivers/net/dsa/mv88e6xxx/chip.c
index 32a613c965b1..9831aa370921 100644
--- a/drivers/net/dsa/mv88e6xxx/chip.c
+++ b/drivers/net/dsa/mv88e6xxx/chip.c
@@ -47,6 +47,7 @@ struct mv88e6xxx_bridge {
struct list_head head;
struct net_device *br_dev;
u16 ports;
+ u16 mrouter_ports;
struct list_head br_mdb_list;
};
@@ -2993,6 +2994,7 @@ static void mv88e6xxx_bridge_destroy(struct mv88e6xxx_bridge *mv_bridge)
list_del(&mv_bridge->head);
WARN_ON(mv_bridge->ports);
+ WARN_ON(mv_bridge->mrouter_ports);
WARN_ON(!list_empty(&mv_bridge->br_mdb_list));
kfree(mv_bridge);
}
@@ -3010,6 +3012,19 @@ struct mv88e6xxx_bridge *mv88e6xxx_bridge_by_dev(struct mv88e6xxx_chip *chip,
return NULL;
}
+static
+struct mv88e6xxx_bridge *mv88e6xxx_bridge_by_port(struct mv88e6xxx_chip *chip,
+ int port)
+{
+ struct mv88e6xxx_bridge *mv_bridge;
+
+ list_for_each_entry(mv_bridge, &chip->bridge_list, head)
+ if (mv_bridge->ports & BIT(port))
+ return mv_bridge;
+
+ return NULL;
+}
+
static struct mv88e6xxx_bridge *
mv88e6xxx_bridge_get(struct mv88e6xxx_chip *chip, struct net_device *br_dev)
{
@@ -6849,11 +6864,28 @@ static int mv88e6xxx_port_bridge_flags(struct dsa_switch *ds, int port,
if (flags.mask & BR_MCAST_FLOOD) {
bool multicast = !!(flags.val & BR_MCAST_FLOOD);
+ struct mv88e6xxx_bridge *mv_bridge;
+ struct mv88e6xxx_port *p;
+ bool mrouter;
- err = chip->info->ops->port_set_mcast_flood(chip, port,
- multicast);
- if (err)
- goto out;
+ mv_bridge = mv88e6xxx_bridge_by_port(chip, port);
+ if (!mv_bridge)
+ return -EINVAL;
+
+ p = &chip->ports[port];
+ mrouter = !!(mv_bridge->mrouter_ports & BIT(port));
+
+ if (!mrouter) {
+ err = chip->info->ops->port_set_mcast_flood(chip, port,
+ multicast);
+ if (err)
+ goto out;
+ }
+
+ if (multicast)
+ p->flags |= MV88E6XXX_PORT_FLAG_MC_FLOOD;
+ else
+ p->flags &= ~MV88E6XXX_PORT_FLAG_MC_FLOOD;
}
if (flags.mask & BR_BCAST_FLOOD) {
@@ -6883,6 +6915,51 @@ static int mv88e6xxx_port_bridge_flags(struct dsa_switch *ds, int port,
return err;
}
+static int mv88e6xxx_port_mrouter(struct dsa_switch *ds, int port,
+ bool mrouter,
+ struct netlink_ext_ack *extack)
+{
+ struct mv88e6xxx_chip *chip = ds->priv;
+ struct mv88e6xxx_bridge *mv_bridge;
+ struct mv88e6xxx_port *p;
+ bool old_mrouter;
+ bool mc_flood;
+ int err;
+
+ if (!chip->info->ops->port_set_mcast_flood)
+ return -EOPNOTSUPP;
+
+ mv_bridge = mv88e6xxx_bridge_by_port(chip, port);
+ if (!mv_bridge)
+ return -EINVAL;
+
+ old_mrouter = !!(mv_bridge->mrouter_ports & BIT(port));
+ if (mrouter == old_mrouter)
+ return 0;
+
+ p = &chip->ports[port];
+ mc_flood = !!(p->flags & MV88E6XXX_PORT_FLAG_MC_FLOOD);
+
+ mv88e6xxx_reg_lock(chip);
+
+ if (!mc_flood) {
+ err = chip->info->ops->port_set_mcast_flood(chip, port,
+ mrouter);
+ if (err)
+ goto out;
+ }
+
+ if (mrouter)
+ mv_bridge->mrouter_ports |= BIT(port);
+ else
+ mv_bridge->mrouter_ports &= ~BIT(port);
+
+out:
+ mv88e6xxx_reg_unlock(chip);
+
+ return err;
+}
+
static bool mv88e6xxx_lag_can_offload(struct dsa_switch *ds,
struct dsa_lag lag,
struct netdev_lag_upper_info *info,
@@ -7199,6 +7276,7 @@ static const struct dsa_switch_ops mv88e6xxx_switch_ops = {
.port_bridge_leave = mv88e6xxx_port_bridge_leave,
.port_pre_bridge_flags = mv88e6xxx_port_pre_bridge_flags,
.port_bridge_flags = mv88e6xxx_port_bridge_flags,
+ .port_mrouter = mv88e6xxx_port_mrouter,
.port_stp_state_set = mv88e6xxx_port_stp_state_set,
.port_mst_state_set = mv88e6xxx_port_mst_state_set,
.port_fast_age = mv88e6xxx_port_fast_age,
diff --git a/drivers/net/dsa/mv88e6xxx/chip.h b/drivers/net/dsa/mv88e6xxx/chip.h
index 205f6777c2ac..47e056dc7925 100644
--- a/drivers/net/dsa/mv88e6xxx/chip.h
+++ b/drivers/net/dsa/mv88e6xxx/chip.h
@@ -274,6 +274,7 @@ struct mv88e6xxx_vlan {
/* MacAuth Bypass Control Flag */
#define MV88E6XXX_PORT_FLAG_MAB BIT(0)
+#define MV88E6XXX_PORT_FLAG_MC_FLOOD BIT(1)
struct mv88e6xxx_port {
struct mv88e6xxx_chip *chip;
--
2.17.1
Offload mrouter port forwarding to mv88e6xxx.
Currently multicast snooping fails to forward traffic in some cases
where there are multiple hardware-offloading bridges involved.
Consider the following scenario:
+--------------------+
| |
| Snooping +--| +------------+
| Bridge 1 |P1|----| Listener 1 |
| (Querier) +--| +------------+
| |
+--------------------+
|
|
+--------------------+
| | mrouter | |
+-----------+ | +---------+ +--| +------------+
| MC Source |----| Snooping |P2|----| Listener 2 |
+-----------| | Bridge 2 +--| +------------+
| (Non-Querier) |
+--------------------+
In this scenario, Listener 2 is able to receive multicast traffic from
MC Source while Listener 1 is not. The reason is that on Snooping
Bridge 2, when the (soft) bridge attempts to forward a packet to
the mrouter port via br_multicast_flood(), the effort is blocked by
nbp_switchdev_allowed_egress(), since offload_fwd_mark indicates that
the packet should have been handled by the hardware already. Listener 2
would receive the packets without any problem since P2 is programmed unto
the switch chip as a member of the group; however, the mrouter port
would not since the mrouter port would normally not be a member of any
group, and thus will not be added to the address database on the switch
chip of Snooping Bridge 2.
Even if nbp_switchdev_allowed_egress() did not block the forwarding,
it would still be better to offload the forwarding to the switch
rather than letting the bridge handle the forwarding in software.
Before this patch, mv88e6xxx programming matches exactly with mdb:
+-----+
| mdb |
+-----+
|
+----------------------------------------------+
| | +--------------------------------+ |
| | | both in mdb and mv88e6xxx | |
| | | +------+ +------+ +------+ | |
| +--------|-| port |---| port |---| port | | |
| | +------+ +------+ +------+ | |
| mv88e6xxx +--------------------------------+ |
+----------------------------------------------+
After this patch, some entries will only exist in mv88e6xxx and not
in mdb:
+-----+
| mdb |
+-----+
|
+---------------------------------------------------------------------+
| | +--------------------------------++---------------------+ |
| | | both in mdb and mv88e6xxx || only in mv88e6xxx | |
| | | +------+ +------+ +------+ || +------+ +------+ | |
| +--------|-| port |---| port |---| port |-||-| mr |---| mr | | |
| | +------+ +------+ +------+ || +------+ +------+ | |
| mv88e6xxx +--------------------------------++---------------------+ |
+---------------------------------------------------------------------+
Signed-off-by: Joseph Huang <[email protected]>
---
drivers/net/dsa/mv88e6xxx/chip.c | 104 ++++++++++++++++++++++++++++---
1 file changed, 94 insertions(+), 10 deletions(-)
diff --git a/drivers/net/dsa/mv88e6xxx/chip.c b/drivers/net/dsa/mv88e6xxx/chip.c
index 9831aa370921..ab519e4d9e4f 100644
--- a/drivers/net/dsa/mv88e6xxx/chip.c
+++ b/drivers/net/dsa/mv88e6xxx/chip.c
@@ -2194,13 +2194,10 @@ mv88e6xxx_port_vlan_prepare(struct dsa_switch *ds, int port,
return err;
}
-static int mv88e6xxx_port_db_load_purge(struct mv88e6xxx_chip *chip, int port,
- const unsigned char *addr, u16 vid,
- u8 state)
+static int mv88e6xxx_fid_from_vid(struct mv88e6xxx_chip *chip, u16 vid,
+ u16 *fid)
{
- struct mv88e6xxx_atu_entry entry;
struct mv88e6xxx_vtu_entry vlan;
- u16 fid;
int err;
/* Ports have two private address databases: one for when the port is
@@ -2211,7 +2208,7 @@ static int mv88e6xxx_port_db_load_purge(struct mv88e6xxx_chip *chip, int port,
* VLAN ID into the port's database used for VLAN-unaware bridging.
*/
if (vid == 0) {
- fid = MV88E6XXX_FID_BRIDGED;
+ *fid = MV88E6XXX_FID_BRIDGED;
} else {
err = mv88e6xxx_vtu_get(chip, vid, &vlan);
if (err)
@@ -2221,9 +2218,24 @@ static int mv88e6xxx_port_db_load_purge(struct mv88e6xxx_chip *chip, int port,
if (!vlan.valid)
return -EOPNOTSUPP;
- fid = vlan.fid;
+ *fid = vlan.fid;
}
+ return 0;
+}
+
+static int mv88e6xxx_port_db_load_purge(struct mv88e6xxx_chip *chip, int port,
+ const unsigned char *addr, u16 vid,
+ u8 state)
+{
+ struct mv88e6xxx_atu_entry entry;
+ u16 fid;
+ int err;
+
+ err = mv88e6xxx_fid_from_vid(chip, vid, &fid);
+ if (err)
+ return err;
+
entry.state = 0;
ether_addr_copy(entry.mac, addr);
eth_addr_dec(entry.mac);
@@ -2255,6 +2267,30 @@ static int mv88e6xxx_port_db_load_purge(struct mv88e6xxx_chip *chip, int port,
return mv88e6xxx_g1_atu_loadpurge(chip, fid, &entry);
}
+static int mv88e6xxx_port_mdb_load_purge(struct mv88e6xxx_chip *chip,
+ int portvec,
+ const unsigned char *addr,
+ u16 vid)
+{
+ struct mv88e6xxx_atu_entry entry;
+ u16 fid;
+ int err;
+
+ err = mv88e6xxx_fid_from_vid(chip, vid, &fid);
+ if (err)
+ return err;
+
+ memset(&entry, 0, sizeof(entry));
+ ether_addr_copy(entry.mac, addr);
+ entry.portvec = portvec;
+ if (!entry.portvec)
+ entry.state = 0;
+ else
+ entry.state = MV88E6XXX_G1_ATU_DATA_STATE_MC_STATIC;
+
+ return mv88e6xxx_g1_atu_loadpurge(chip, fid, &entry);
+}
+
static int mv88e6xxx_policy_apply(struct mv88e6xxx_chip *chip, int port,
const struct mv88e6xxx_policy *policy)
{
@@ -6666,6 +6702,12 @@ static void mv88e6xxx_br_mdb_put(struct mv88e6xxx_br_mdb *mv_br_mdb)
mv88e6xxx_br_mdb_destroy(mv_br_mdb);
}
+static u16 mv88e6xxx_br_mdb_portvec_fixup(struct mv88e6xxx_bridge *mv_bridge,
+ u16 portvec)
+{
+ return portvec ? (portvec | mv_bridge->mrouter_ports) : portvec;
+}
+
static int mv88e6xxx_port_mdb_add(struct dsa_switch *ds, int port,
const struct switchdev_obj_port_mdb *mdb,
struct dsa_db db)
@@ -6675,6 +6717,7 @@ static int mv88e6xxx_port_mdb_add(struct dsa_switch *ds, int port,
struct mv88e6xxx_br_mdb *mv_br_mdb;
struct net_device *orig_dev;
struct net_device *br_dev;
+ u16 portvec;
int err;
orig_dev = mdb->obj.orig_dev;
@@ -6693,9 +6736,11 @@ static int mv88e6xxx_port_mdb_add(struct dsa_switch *ds, int port,
if (mv_br_mdb->portvec & BIT(port))
return -EEXIST;
+ portvec = mv_br_mdb->portvec | BIT(port);
+ portvec = mv88e6xxx_br_mdb_portvec_fixup(mv_bridge, portvec);
+
mv88e6xxx_reg_lock(chip);
- err = mv88e6xxx_port_db_load_purge(chip, port, mdb->addr, mdb->vid,
- MV88E6XXX_G1_ATU_DATA_STATE_MC_STATIC);
+ err = mv88e6xxx_port_mdb_load_purge(chip, portvec, mdb->addr, mdb->vid);
if (err)
goto out;
@@ -6717,6 +6762,7 @@ static int mv88e6xxx_port_mdb_del(struct dsa_switch *ds, int port,
struct mv88e6xxx_br_mdb *mv_br_mdb;
struct net_device *orig_dev;
struct net_device *br_dev;
+ u16 portvec;
int err;
orig_dev = mdb->obj.orig_dev;
@@ -6735,8 +6781,11 @@ static int mv88e6xxx_port_mdb_del(struct dsa_switch *ds, int port,
if (!(mv_br_mdb->portvec & BIT(port)))
return -ENOENT;
+ portvec = mv_br_mdb->portvec & ~BIT(port);
+ portvec = mv88e6xxx_br_mdb_portvec_fixup(mv_bridge, portvec);
+
mv88e6xxx_reg_lock(chip);
- err = mv88e6xxx_port_db_load_purge(chip, port, mdb->addr, mdb->vid, 0);
+ err = mv88e6xxx_port_mdb_load_purge(chip, portvec, mdb->addr, mdb->vid);
mv_br_mdb->portvec &= ~BIT(port);
mv88e6xxx_br_mdb_put(mv_br_mdb);
mv88e6xxx_reg_unlock(chip);
@@ -6921,9 +6970,11 @@ static int mv88e6xxx_port_mrouter(struct dsa_switch *ds, int port,
{
struct mv88e6xxx_chip *chip = ds->priv;
struct mv88e6xxx_bridge *mv_bridge;
+ struct mv88e6xxx_br_mdb *mv_br_mdb;
struct mv88e6xxx_port *p;
bool old_mrouter;
bool mc_flood;
+ u16 portvec;
int err;
if (!chip->info->ops->port_set_mcast_flood)
@@ -6949,11 +7000,44 @@ static int mv88e6xxx_port_mrouter(struct dsa_switch *ds, int port,
goto out;
}
+ list_for_each_entry(mv_br_mdb, &mv_bridge->br_mdb_list, head) {
+ portvec = mv_br_mdb->portvec;
+ portvec = mv88e6xxx_br_mdb_portvec_fixup(mv_bridge, portvec);
+
+ if (mrouter)
+ portvec |= BIT(port);
+ else
+ portvec &= ~BIT(port);
+
+ err = mv88e6xxx_port_mdb_load_purge(chip, portvec,
+ mv_br_mdb->addr,
+ mv_br_mdb->vid);
+ if (err)
+ goto out_port_mdb_load_purge;
+ }
+
if (mrouter)
mv_bridge->mrouter_ports |= BIT(port);
else
mv_bridge->mrouter_ports &= ~BIT(port);
+ mv88e6xxx_reg_unlock(chip);
+
+ return 0;
+
+out_port_mdb_load_purge:
+ list_for_each_entry_continue_reverse(mv_br_mdb,
+ &mv_bridge->br_mdb_list,
+ head) {
+ portvec = mv_br_mdb->portvec;
+ portvec = mv88e6xxx_br_mdb_portvec_fixup(mv_bridge, portvec);
+ mv88e6xxx_port_mdb_load_purge(chip, portvec,
+ mv_br_mdb->addr,
+ mv_br_mdb->vid);
+ }
+
+ if (!mc_flood)
+ chip->info->ops->port_set_mcast_flood(chip, port, old_mrouter);
out:
mv88e6xxx_reg_unlock(chip);
--
2.17.1
On Mon, Apr 01, 2024 at 08:11:06PM -0400, Joseph Huang wrote:
> Keep track of bridge mdb objects in the driver.
>
> Similar to the previous patch, since the driver doesn't get explicit
> notifications about mdb group creation or destruction, just create
> the mdb group when the first port joins the group via
> mv88e6xxx_port_mdb_add(), and destroys the group when the last port left
> the group via mv88e6xxx_port_mdb_del().
>
> Use the group's L2 address together with the VLAN ID as the key to the list.
> Port membership is again stored in a bitmask.
>
> Signed-off-by: Joseph Huang <[email protected]>
> ---
Can you comment on the feasibility/infeasibility of Tobias' proposal of:
"The bridge could just provide some MDB iterator to save us from having
to cache all the configured groups."?
https://lore.kernel.org/netdev/[email protected]/
What is done here will have to be scaled to many drivers - potentially
all existing DSA ones, as far as I'm aware.
On 4/2/24 03:10, Joseph Huang wrote:
> There is a use case where one would like to enable multicast snooping
> on a bridge but disable multicast flooding on all bridge ports so that
> registered multicast traffic will only reach the intended recipients and
> unregistered multicast traffic will be dropped. However, with existing
> bridge ports' mcast_flood flag implementation, it doesn't work as desired.
>
> This patchset aims to make multicast snooping work even when multicast
> flooding is disabled on the bridge ports, without changing the semantic of
> the mcast_flood flag too much. Patches 1 to 4 attempt to address this issue.
>
> Also, in a network where more than one multicast snooping capable bridges
> are interconnected without multicast routers being present, multicast
> snooping fails if:
>
> 1. The source is not directly attached to the Querier
> 2. The listener is beyond the mrouter port of the bridge where the
> source is directly attached
> 3. A hardware offloading switch is involved
>
> When all of the conditions are met, the listener will not receive any
> multicast packets from the source. Patches 5 to 10 attempt to address this
> issue. Specifically, patches 5 to 8 set up the infrastructure, patch 9
> handles unregistered multicast packets forwarding, and patch 10 handles
> registered multicast packets forwarding to the mrouter port.
>
> The patches were developed against 5.15, and forward-ported to 6.8.
> Tests were done on a Pi 4B + Marvell 6393X Eval board with a single
> switch chip with no VLAN.
>
> V1 -> V2:
> - Moved the bulk of the change from the bridge to the mv88e6xxx driver.
> - Added more patches (specifically 3 and 4) to workaround some more
> issues with multicast flooding being disabled.
>
> v1 here:
> https://patchwork.kernel.org/project/netdevbpf/cover/[email protected]/
>
For the bridge patches:
Nacked-by: Nikolay Aleksandrov <[email protected]>
You cannot break the multicast flood flag to add support for a custom
use-case. This is unacceptable. The current bridge behaviour is correct
your patch 02 doesn't fix anything, you should configure the bridge
properly to avoid all those problems, not break protocols.
Your special use case can easily be solved by a user-space helper or
eBPF and nftables. You can set the mcast flood flag and bypass the
bridge for these packets. I basically said the same in 2021, if this is
going to be in the bridge it should be hidden behind an option that is
default off. But in my opinion adding an option to solve such special
cases is undesirable, they can be easily solved with what's currently
available.
On Mon, Apr 01, 2024 at 08:10:59PM -0400, Joseph Huang wrote:
> There is a use case where one would like to enable multicast snooping
> on a bridge but disable multicast flooding on all bridge ports so that
> registered multicast traffic will only reach the intended recipients and
> unregistered multicast traffic will be dropped. However, with existing
> bridge ports' mcast_flood flag implementation, it doesn't work as desired.
>
> This patchset aims to make multicast snooping work even when multicast
> flooding is disabled on the bridge ports, without changing the semantic of
> the mcast_flood flag too much. Patches 1 to 4 attempt to address this issue.
>
> Also, in a network where more than one multicast snooping capable bridges
> are interconnected without multicast routers being present, multicast
> snooping fails if:
>
> 1. The source is not directly attached to the Querier
> 2. The listener is beyond the mrouter port of the bridge where the
> source is directly attached
> 3. A hardware offloading switch is involved
I've not studied the details here, but that last point makes me think
the offload driver is broken. There should not be any difference
between software bridging and hardware bridging. The whole idea is
that you take what Linux can do in software and accelerate it by
offloading to hardware. Doing acceleration should not change the
behaviour.
> The patches were developed against 5.15, and forward-ported to 6.8.
> Tests were done on a Pi 4B + Marvell 6393X Eval board with a single
> switch chip with no VLAN.
So what is the mv88e6xxx driver doing wrong?
Andrew
Hi Nikolai,
On Tue, Apr 02, 2024 at 12:28:38PM +0300, Nikolay Aleksandrov wrote:
> For the bridge patches:
> Nacked-by: Nikolay Aleksandrov <[email protected]>
>
> You cannot break the multicast flood flag to add support for a custom
> use-case. This is unacceptable. The current bridge behaviour is correct
> your patch 02 doesn't fix anything, you should configure the bridge
> properly to avoid all those problems, not break protocols.
>
> Your special use case can easily be solved by a user-space helper or
> eBPF and nftables. You can set the mcast flood flag and bypass the
> bridge for these packets. I basically said the same in 2021, if this is
> going to be in the bridge it should be hidden behind an option that is
> default off. But in my opinion adding an option to solve such special
> cases is undesirable, they can be easily solved with what's currently
> available.
I appreciate your time is limited, but could you please translate your
suggestion, and detail your proposed alternative a bit, for those of us
who are not very familiar with IP multicast snooping?
Bypass the bridge for which packets? General IGMP/MLD queries? Wouldn't
that break snooping? And then do what with the packets, forward them in
another software layer than the bridge?
I also don't quite understand the suggestion of turning on mcast flooding:
isn't Joseph saying that he wants it off for the unregistered multicast
data traffic?
On Mon, Apr 01, 2024 at 08:11:03PM -0400, Joseph Huang wrote:
> Add local network all hosts multicast address (224.0.0.1) and link-local
> all nodes multicast address (ff02::1) to the ATU so that IGMP/MLD
> Queries can be forwarded even when multicast flooding is disabled on a
> port.
>
> Signed-off-by: Joseph Huang <[email protected]>
> ---
> drivers/net/dsa/mv88e6xxx/Kconfig | 12 ++++++++
> drivers/net/dsa/mv88e6xxx/chip.c | 47 +++++++++++++++++++++++++++++++
> 2 files changed, 59 insertions(+)
>
> diff --git a/drivers/net/dsa/mv88e6xxx/Kconfig b/drivers/net/dsa/mv88e6xxx/Kconfig
> index e3181d5471df..ef7798bf50d7 100644
> --- a/drivers/net/dsa/mv88e6xxx/Kconfig
> +++ b/drivers/net/dsa/mv88e6xxx/Kconfig
> @@ -17,3 +17,15 @@ config NET_DSA_MV88E6XXX_PTP
> help
> Say Y to enable PTP hardware timestamping on Marvell 88E6xxx switch
> chips that support it.
> +
> +config NET_DSA_MV88E6XXX_ALWAYS_FLOOD_LOCAL_ALL_HOSTS_ADDRESS
> + bool "Always flood local all hosts multicast packets"
> + depends on NET_DSA_MV88E6XXX
> + help
> + When set to Y, always flood multicast packets destined for
> + 224.0.0.1 (Local Network All Hosts multicast address) and
> + ff02::1 (Link-Local All Nodes multicast address), even when
> + multicast flooding is disabled for a port. This is so that
> + multicast snooping can continue to function even when
> + multicast flooding is disabled.
> + If in doubt, say N.
In its current form, this will never be accepted. The mainline Linux
kernel is not a purpose-built project like a bootloader (and also like
some other uses of the Linux kernel). It gets picked up by
distributions, and the same kernel image is supposed to be used on
multiple platforms. So, customizing behavior at compilation time is not
a viable option. If there is any behavior that needs to be different on
some platforms than on others for some reason (this needs a
justification in its own right), it is handled through dedicated user
space API. When others say "hide custom behavior X behind an option", a
compile-time option is not what they mean.
As for the substance of the change itself, I am far from an authority
on multicast, I think you should try to push for something else, which
should be more palatable for everybody.
Some switches I've been working with have explicit flooding controls for:
- L2 multicast
- IPv4 control multicast (224.0.0.x)
- IPv4 data multicast
- IPv6 control multicast (FF02::/16)
- IPv6 data multicast
Whereas the bridge has a single "mcast_flood" control.
It seems adequate to attempt adding more netlink attributes to control
all of the above, individually. Then you could describe your use case,
in a standard way to the kernel, as "ip link set swp0 type bridge_slave
mcast_ipv4_data_flood off mcast_ipv4_ctrl_flood on". For compatibility,
"mcast_flood" could still be interpreted as a global enable for all
multicast.
The trouble seems to be actually offloading this config to Marvell DSA,
they don't seem to have a classifier that distinguishes between kinds of
multicast traffic.
How many link-local IPv4 and IPv6 addresses are there in actual use?
Would it be feasible to actually add ATU addresses for all of them, like
you did for 224.0.0.1 and ff02::1, and say that the port destinations of
_those_ are the mcast_ipv4_ctrl_flood ports?
> diff --git a/drivers/net/dsa/mv88e6xxx/chip.c b/drivers/net/dsa/mv88e6xxx/chip.c
> index 9ed1821184ec..fddcb596c421 100644
> --- a/drivers/net/dsa/mv88e6xxx/chip.c
> +++ b/drivers/net/dsa/mv88e6xxx/chip.c
> @@ -2488,6 +2488,41 @@ static int mv88e6xxx_broadcast_setup(struct mv88e6xxx_chip *chip, u16 vid)
> return 0;
> }
>
> +#if IS_ENABLED(CONFIG_NET_DSA_MV88E6XXX_ALWAYS_FLOOD_LOCAL_ALL_HOSTS_ADDRESS)
> +static int mv88e6xxx_port_add_multicast(struct mv88e6xxx_chip *chip, int port,
> + u16 vid)
> +{
> + u8 state = MV88E6XXX_G1_ATU_DATA_STATE_MC_STATIC;
> + const char multicast[][ETH_ALEN] = {
> + { 0x01, 0x00, 0x5e, 0x00, 0x00, 0x01 },
> + { 0x33, 0x33, 0x00, 0x00, 0x00, 0x01 }
> + };
> + int i, err;
> +
> + for (i = 0; i < ARRAY_SIZE(multicast); i++) {
> + err = mv88e6xxx_port_db_load_purge(chip, port, multicast[i], vid, state);
> + if (err)
> + return err;
> + }
> +
> + return 0;
> +}
> +
> +static int mv88e6xxx_multicast_setup(struct mv88e6xxx_chip *chip, u16 vid)
> +{
> + int port;
> + int err;
> +
> + for (port = 0; port < mv88e6xxx_num_ports(chip); port++) {
> + err = mv88e6xxx_port_add_multicast(chip, port, vid);
> + if (err)
> + return err;
> + }
> +
> + return 0;
> +}
> +#endif
> +
> struct mv88e6xxx_port_broadcast_sync_ctx {
> int port;
> bool flood;
> @@ -2572,6 +2607,12 @@ static int mv88e6xxx_port_vlan_join(struct mv88e6xxx_chip *chip, int port,
> err = mv88e6xxx_broadcast_setup(chip, vlan.vid);
> if (err)
> return err;
> +
> +#if IS_ENABLED(CONFIG_NET_DSA_MV88E6XXX_ALWAYS_FLOOD_LOCAL_ALL_HOSTS_ADDRESS)
> + err = mv88e6xxx_multicast_setup(chip, vlan.vid);
> + if (err)
> + return err;
> +#endif
> } else if (vlan.member[port] != member) {
> vlan.member[port] = member;
>
> @@ -3930,6 +3971,12 @@ static int mv88e6xxx_setup(struct dsa_switch *ds)
> if (err)
> goto unlock;
>
> +#if IS_ENABLED(CONFIG_NET_DSA_MV88E6XXX_ALWAYS_FLOOD_LOCAL_ALL_HOSTS_ADDRESS)
> + err = mv88e6xxx_multicast_setup(chip, 0);
This is the danger of developing on an old kernel and then just blindly
forward-porting. It will compile and smile and look pretty, but it won't
work. You need to request your board provider to do something and give
you access to mainline code.
In newer kernels, VID 0 is MV88E6XXX_VID_STANDALONE, aka a completely
separate address database compared to what the bridge is in charge of.
So, applied to the mainline kernel as of today, your change does nothing
useful, because when under a VLAN-unaware bridge, packets now get
classified to MV88E6XXX_VID_BRIDGED (4095).
For that matter, the port database (MV88E6XXX_VID_STANDALONE) should be
controlled through dev_mc_add(), and "ip maddr" shows you the link-local
multicast addresses offloaded to the device's RX filter.
mv88e6xxx today does not pass the requirements for dsa_switch_supports_mc_filtering(),
so dev_mc_add() does not actually program anything into hardware, but if
it did, it would have been the port database (VID 0), and this is what
makes your change wrong. You can read more about address databases in
Documentation/networking/dsa/dsa.rst.
> + if (err)
> + goto unlock;
> +#endif
> +
> err = mv88e6xxx_pot_setup(chip);
> if (err)
> goto unlock;
> --
> 2.17.1
>
On Tue, Apr 02, 2024 at 09:50:51PM +0300, Nikolay Aleksandrov wrote:
> On 4/2/24 20:43, Vladimir Oltean wrote:
> > Hi Nikolai,
> >
> > On Tue, Apr 02, 2024 at 12:28:38PM +0300, Nikolay Aleksandrov wrote:
> > > For the bridge patches:
> > > Nacked-by: Nikolay Aleksandrov <[email protected]>
> > >
> > > You cannot break the multicast flood flag to add support for a custom
> > > use-case. This is unacceptable. The current bridge behaviour is correct
> > > your patch 02 doesn't fix anything, you should configure the bridge
> > > properly to avoid all those problems, not break protocols.
> > >
> > > Your special use case can easily be solved by a user-space helper or
> > > eBPF and nftables. You can set the mcast flood flag and bypass the
> > > bridge for these packets. I basically said the same in 2021, if this is
> > > going to be in the bridge it should be hidden behind an option that is
> > > default off. But in my opinion adding an option to solve such special
> > > cases is undesirable, they can be easily solved with what's currently
> > > available.
> >
> > I appreciate your time is limited, but could you please translate your
> > suggestion, and detail your proposed alternative a bit, for those of us
> > who are not very familiar with IP multicast snooping?
> >
>
> My suggestion is not related to snooping really, but to the goal of
> patches 01-03. The bridge patches in this set are trying to forward
> traffic that is not supposed to be forwarded with the proposed
> configuration,
Correct up to a point. Reinterpreting the given user space configuration
and trying to make it do something else seems like a mistake, but in
principle one could also look at alternative bridge configurations like
the one I described here:
https://lore.kernel.org/netdev/20240402180805.yhhwj2f52sdc4dl2@skbuf/
> so that can be done by a user-space helper that installs
> rules to bypass the bridge specifically for those packets while
> monitoring the bridge state to implement a policy and manage these rules
> in order to keep snooping working.
>
> > Bypass the bridge for which packets? General IGMP/MLD queries? Wouldn't
> > that break snooping? And then do what with the packets, forward them in
> > another software layer than the bridge?
> >
>
> The ones that are not supposed to be forwarded in the proposed config
> and are needed for this use case (control traffic and link-local). Obviously
> to have proper snooping you'd need to manage these bypass
> rules and use them only while needed.
I think Joseph will end up in a situation where he needs IGMP control
messages both in the bridge data path and outside of it :)
Also, your proposal eliminates the possibility of cooperating with a
hardware accelerator which can forward the IGMP messages where they need
to go.
As far as I understand, I don't think Joseph has a very "special" use case.
Disabling flooding of unregistered multicast in the data plane sounds
reasonable. There seems to be a gap in the bridge API, in that this
operation also affects the control plane, which he is trying to fix with
this "force flooding", because of insufficiently fine grained control.
> > I also don't quite understand the suggestion of turning on mcast flooding:
> > isn't Joseph saying that he wants it off for the unregistered multicast
> > data traffic?
>
> Ah my bad, I meant to turn off flooding and bypass the bridge for those
> packets and ports while necessary, under necessary can be any policy
> that the user-space helper wants to implement.
>
> In any case, if this is going to be yet another kernel solution then it
> must be a new option that is default off, and doesn't break current mcast
> flood flag behaviour.
Yeah, maybe something like this, simple and with clear offload
semantics, as seen in existing hardware (not Marvell though):
mcast_flood == off:
- mcast_ipv4_ctrl_flood: don't care (maybe can force to "off")
- mcast_ipv4_data_flood: don't care
- mcast_ipv6_ctrl_flood: don't care
- mcast_ipv6_data_flood: don't care
- mcast_l2_flood: don't care
mcast_flood == on:
- Flood 224.0.0.x according to mcast_ipv4_ctrl_flood
- Flood all other IPv4 multicast according to mcast_ipv4_data_flood
- Flood ff02::/16 according to mcast_ipv6_ctrl_flood
- Flood all other IPv6 multicast according to mcast_ipv6_data_flood
- Flood L2 according to mcast_l2_flood
On 4/2/24 23:46, Vladimir Oltean wrote:
> On Tue, Apr 02, 2024 at 09:50:51PM +0300, Nikolay Aleksandrov wrote:
>> On 4/2/24 20:43, Vladimir Oltean wrote:
>>> Hi Nikolai,
>>>
>>> On Tue, Apr 02, 2024 at 12:28:38PM +0300, Nikolay Aleksandrov wrote:
>>>> For the bridge patches:
>>>> Nacked-by: Nikolay Aleksandrov <[email protected]>
>>>>
>>>> You cannot break the multicast flood flag to add support for a custom
>>>> use-case. This is unacceptable. The current bridge behaviour is correct
>>>> your patch 02 doesn't fix anything, you should configure the bridge
>>>> properly to avoid all those problems, not break protocols.
>>>>
>>>> Your special use case can easily be solved by a user-space helper or
>>>> eBPF and nftables. You can set the mcast flood flag and bypass the
>>>> bridge for these packets. I basically said the same in 2021, if this is
>>>> going to be in the bridge it should be hidden behind an option that is
>>>> default off. But in my opinion adding an option to solve such special
>>>> cases is undesirable, they can be easily solved with what's currently
>>>> available.
>>>
>>> I appreciate your time is limited, but could you please translate your
>>> suggestion, and detail your proposed alternative a bit, for those of us
>>> who are not very familiar with IP multicast snooping?
>>>
>>
>> My suggestion is not related to snooping really, but to the goal of
>> patches 01-03. The bridge patches in this set are trying to forward
>> traffic that is not supposed to be forwarded with the proposed
>> configuration,
>
> Correct up to a point. Reinterpreting the given user space configuration
> and trying to make it do something else seems like a mistake, but in
> principle one could also look at alternative bridge configurations like
> the one I described here:
> https://lore.kernel.org/netdev/20240402180805.yhhwj2f52sdc4dl2@skbuf/
>
>> so that can be done by a user-space helper that installs
>> rules to bypass the bridge specifically for those packets while
>> monitoring the bridge state to implement a policy and manage these rules
>> in order to keep snooping working.
>>
>>> Bypass the bridge for which packets? General IGMP/MLD queries? Wouldn't
>>> that break snooping? And then do what with the packets, forward them in
>>> another software layer than the bridge?
>>>
>>
>> The ones that are not supposed to be forwarded in the proposed config
>> and are needed for this use case (control traffic and link-local). Obviously
>> to have proper snooping you'd need to manage these bypass
>> rules and use them only while needed.
>
> I think Joseph will end up in a situation where he needs IGMP control
> messages both in the bridge data path and outside of it :)
>
My solution does not exclude such scenario. With all unregistered mcast
disabled it will be handled the same as with this proposed solution.
> Also, your proposal eliminates the possibility of cooperating with a
> hardware accelerator which can forward the IGMP messages where they need
> to go.
>
> As far as I understand, I don't think Joseph has a very "special" use case.
> Disabling flooding of unregistered multicast in the data plane sounds
> reasonable. There seems to be a gap in the bridge API, in that this
This we already have, but..
> operation also affects the control plane, which he is trying to fix with
> this "force flooding", because of insufficiently fine grained control.
>
Right, this is the part that is more special, I'm not saying it's
unreasonable. The proposition to make it optional and break it down to
type of traffic sounds good to me.
>>> I also don't quite understand the suggestion of turning on mcast flooding:
>>> isn't Joseph saying that he wants it off for the unregistered multicast
>>> data traffic?
>>
>> Ah my bad, I meant to turn off flooding and bypass the bridge for those
>> packets and ports while necessary, under necessary can be any policy
>> that the user-space helper wants to implement.
>>
>> In any case, if this is going to be yet another kernel solution then it
>> must be a new option that is default off, and doesn't break current mcast
>> flood flag behaviour.
>
> Yeah, maybe something like this, simple and with clear offload
> semantics, as seen in existing hardware (not Marvell though):
>
> mcast_flood == off:
> - mcast_ipv4_ctrl_flood: don't care (maybe can force to "off")
> - mcast_ipv4_data_flood: don't care
> - mcast_ipv6_ctrl_flood: don't care
> - mcast_ipv6_data_flood: don't care
> - mcast_l2_flood: don't care
> mcast_flood == on:
> - Flood 224.0.0.x according to mcast_ipv4_ctrl_flood
> - Flood all other IPv4 multicast according to mcast_ipv4_data_flood
> - Flood ff02::/16 according to mcast_ipv6_ctrl_flood
> - Flood all other IPv6 multicast according to mcast_ipv6_data_flood
> - Flood L2 according to mcast_l2_flood
Yep, sounds good to me. I was thinking about something in these lines
as well if doing a kernel solution in order to make it simpler and more
generic. The ctrl flood bits need to be handled more carefully to make
sure they match only control traffic and not link-local data.
I think the old option can be converted to use this fine-grained
control, that is if anyone sets the old flood on/off then the flood
mask gets set properly so we can do just 1 & in the fast path and avoid
adding more tests. It will also make it symmetric - if it can override
the on case, then it will be able to override the off case. And to be
more explicit you can pass a mask variable to br_multicast_rcv() which
will get populated and then you can pass it down to br_flood(). That
will also avoid adding new bits to the skb's bridge CB.
On 4/2/24 20:43, Vladimir Oltean wrote:
> Hi Nikolai,
>
> On Tue, Apr 02, 2024 at 12:28:38PM +0300, Nikolay Aleksandrov wrote:
>> For the bridge patches:
>> Nacked-by: Nikolay Aleksandrov <[email protected]>
>>
>> You cannot break the multicast flood flag to add support for a custom
>> use-case. This is unacceptable. The current bridge behaviour is correct
>> your patch 02 doesn't fix anything, you should configure the bridge
>> properly to avoid all those problems, not break protocols.
>>
>> Your special use case can easily be solved by a user-space helper or
>> eBPF and nftables. You can set the mcast flood flag and bypass the
>> bridge for these packets. I basically said the same in 2021, if this is
>> going to be in the bridge it should be hidden behind an option that is
>> default off. But in my opinion adding an option to solve such special
>> cases is undesirable, they can be easily solved with what's currently
>> available.
>
> I appreciate your time is limited, but could you please translate your
> suggestion, and detail your proposed alternative a bit, for those of us
> who are not very familiar with IP multicast snooping?
>
My suggestion is not related to snooping really, but to the goal of
patches 01-03. The bridge patches in this set are trying to forward
traffic that is not supposed to be forwarded with the proposed
configuration, so that can be done by a user-space helper that installs
rules to bypass the bridge specifically for those packets while
monitoring the bridge state to implement a policy and manage these rules
in order to keep snooping working.
> Bypass the bridge for which packets? General IGMP/MLD queries? Wouldn't
> that break snooping? And then do what with the packets, forward them in
> another software layer than the bridge?
>
The ones that are not supposed to be forwarded in the proposed config
and are needed for this use case (control traffic and link-local).
Obviously to have proper snooping you'd need to manage these bypass
rules and use them only while needed.
> I also don't quite understand the suggestion of turning on mcast flooding:
> isn't Joseph saying that he wants it off for the unregistered multicast
> data traffic?
Ah my bad, I meant to turn off flooding and bypass the bridge for those
packets and ports while necessary, under necessary can be any policy
that the user-space helper wants to implement.
In any case, if this is going to be yet another kernel solution then it
must be a new option that is default off, and doesn't break current
mcast flood flag behaviour.
In general my opinion is that the whole snooping control must be in
user-space and only have the dataplane in the kernel, but that is beyond
the scope of this set.
On Mon, Apr 01, 2024 at 08:11:08PM -0400, Joseph Huang wrote:
> When a port turns into an mrouter port, enable multicast flooding
> on that port even if multicast flooding is disabled by user config. This
> is necessary so that in a distributed system, the multicast packets
> can be forwarded to the Querier when the multicast source is attached
> to a Non-Querier bridge.
>
> Consider the following scenario:
>
> +--------------------+
> | |
> | Snooping | +------------+
> | Bridge 1 |----| Listener 1 |
> | (Querier) | +------------+
> | |
> +--------------------+
> |
> |
> +--------------------+
> | | mrouter | |
> +-----------+ | +---------+ |
> | MC Source |----| Snooping |
> +-----------| | Bridge 2 |
> | (Non-Querier) |
> +--------------------+
>
> In this scenario, Listener 1 will never receive multicast traffic
> from MC Source if multicast flooding is disabled on the mrouter port on
> Snooping Bridge 2.
>
> Signed-off-by: Joseph Huang <[email protected]>
..
> @@ -6849,11 +6864,28 @@ static int mv88e6xxx_port_bridge_flags(struct dsa_switch *ds, int port,
>
> if (flags.mask & BR_MCAST_FLOOD) {
> bool multicast = !!(flags.val & BR_MCAST_FLOOD);
> + struct mv88e6xxx_bridge *mv_bridge;
> + struct mv88e6xxx_port *p;
> + bool mrouter;
>
> - err = chip->info->ops->port_set_mcast_flood(chip, port,
> - multicast);
> - if (err)
> - goto out;
> + mv_bridge = mv88e6xxx_bridge_by_port(chip, port);
> + if (!mv_bridge)
> + return -EINVAL;
I think that mv88e6xxx_reg_unlock(chip) is needed here.
So perhaps (completely untested!):
if (!mv_bridge) {
err = -EINVAL;
goto out;
}
Flagged by Smatch
> +
> + p = &chip->ports[port];
> + mrouter = !!(mv_bridge->mrouter_ports & BIT(port));
> +
> + if (!mrouter) {
> + err = chip->info->ops->port_set_mcast_flood(chip, port,
> + multicast);
> + if (err)
> + goto out;
> + }
> +
> + if (multicast)
> + p->flags |= MV88E6XXX_PORT_FLAG_MC_FLOOD;
> + else
> + p->flags &= ~MV88E6XXX_PORT_FLAG_MC_FLOOD;
> }
>
> if (flags.mask & BR_BCAST_FLOOD) {
> @@ -6883,6 +6915,51 @@ static int mv88e6xxx_port_bridge_flags(struct dsa_switch *ds, int port,
> return err;
> }
>
> +static int mv88e6xxx_port_mrouter(struct dsa_switch *ds, int port,
> + bool mrouter,
> + struct netlink_ext_ack *extack)
> +{
> + struct mv88e6xxx_chip *chip = ds->priv;
> + struct mv88e6xxx_bridge *mv_bridge;
> + struct mv88e6xxx_port *p;
> + bool old_mrouter;
> + bool mc_flood;
> + int err;
> +
> + if (!chip->info->ops->port_set_mcast_flood)
> + return -EOPNOTSUPP;
> +
> + mv_bridge = mv88e6xxx_bridge_by_port(chip, port);
> + if (!mv_bridge)
> + return -EINVAL;
> +
> + old_mrouter = !!(mv_bridge->mrouter_ports & BIT(port));
> + if (mrouter == old_mrouter)
> + return 0;
> +
> + p = &chip->ports[port];
> + mc_flood = !!(p->flags & MV88E6XXX_PORT_FLAG_MC_FLOOD);
> +
> + mv88e6xxx_reg_lock(chip);
> +
> + if (!mc_flood) {
> + err = chip->info->ops->port_set_mcast_flood(chip, port,
> + mrouter);
> + if (err)
> + goto out;
> + }
> +
> + if (mrouter)
> + mv_bridge->mrouter_ports |= BIT(port);
> + else
> + mv_bridge->mrouter_ports &= ~BIT(port);
> +
> +out:
> + mv88e6xxx_reg_unlock(chip);
If mc_flood is true then err is uninitialised here.
Flagged by clang-17 W=1 build, and Smatch.
> +
> + return err;
> +}
> +
> static bool mv88e6xxx_lag_can_offload(struct dsa_switch *ds,
> struct dsa_lag lag,
> struct netdev_lag_upper_info *info,
..
On Mon, Apr 01, 2024 at 08:11:01PM -0400, Joseph Huang wrote:
> Modify the forwarding path so that IGMPv1/v2/MLDv1 Reports are always
> flooded by br_multicast_flood(), regardless of the check done
> by br_multicast_querier_exists().
>
> This patch fixes the problems where shortly after a system boots up,
> the first couple of Reports are not handled properly in that:
>
> 1) The Report from the Host is being flooded (via br_flood) to all
> bridge ports, and
> 2) If the mrouter port's multicast flooding is disabled, the Reports
> received from other hosts will not be forwarded to the Querier.
>
> Fixes: b00589af3b04 ("bridge: disable snooping if there is no querier")
>
There should be no blank line between Fixes and other tags.
If this is a fix, it should be targeted against net,
and likely broken out of this series.
If it is not a fix, then it should not have a Fixes tag.
Rather, you can reference a commit in the commit message text
(not tags) using something like.
Introduced by commit b00589af3b04 ("bridge: disable snooping if there is no querier")
Link: https://www.kernel.org/doc/html/next/process/maintainer-netdev.html
> Signed-off-by: Joseph Huang <[email protected]>
..
Hi Vladimir,
On 4/2/2024 8:23 AM, Vladimir Oltean wrote:
> Can you comment on the feasibility/infeasibility of Tobias' proposal of:
> "The bridge could just provide some MDB iterator to save us from having
> to cache all the configured groups."?
> https://lore.kernel.org/netdev/[email protected]/
>
> What is done here will have to be scaled to many drivers - potentially
> all existing DSA ones, as far as I'm aware.
>
I thought about implementing an MDB iterator as suggested by Tobias, but
I'm a bit concerned about the coherence of these MDB objects. In theory,
when the device driver is trying to act on an event, the source of the
trigger may have changed its state in the bridge already. If, upon
receiving an event in the device driver, we iterate over what the bridge
has at that instant, the differences between the worlds as seen by the
bridge and the device driver might lead to some unexpected results.
However, if we cache the MDB objects in the device driver, at least the
order in which the events took place will be coherent and at any give
time the state of the MDB objects in the device driver can be guaranteed
to be sane. This is also the approach the prestera device driver took.
Thanks,
Joseph
Hi Andrew,
On 4/2/2024 8:43 AM, Andrew Lunn wrote:
> On Mon, Apr 01, 2024 at 08:10:59PM -0400, Joseph Huang wrote:
>> There is a use case where one would like to enable multicast snooping
>> on a bridge but disable multicast flooding on all bridge ports so that
>> registered multicast traffic will only reach the intended recipients and
>> unregistered multicast traffic will be dropped. However, with existing
>> bridge ports' mcast_flood flag implementation, it doesn't work as desired.
>>
>> This patchset aims to make multicast snooping work even when multicast
>> flooding is disabled on the bridge ports, without changing the semantic of
>> the mcast_flood flag too much. Patches 1 to 4 attempt to address this issue.
>>
>> Also, in a network where more than one multicast snooping capable bridges
>> are interconnected without multicast routers being present, multicast
>> snooping fails if:
>>
>> 1. The source is not directly attached to the Querier
>> 2. The listener is beyond the mrouter port of the bridge where the
>> source is directly attached
>> 3. A hardware offloading switch is involved
>
> I've not studied the details here, but that last point makes me think
> the offload driver is broken. There should not be any difference
> between software bridging and hardware bridging. The whole idea is
> that you take what Linux can do in software and accelerate it by
> offloading to hardware. Doing acceleration should not change the
> behaviour.
>
In patch 10 I gave a little more detail about the fix, but basically
this is what happened.
Assuming we have a soft bridge like the following:
bp1 +------------+
Querier <---- | bridge |
+------------+
bp2 | | bp3
| |
v v
MC Source MC Listener
Here bp1 is the mrouter port, bp2 is connected to the multicast source,
and bp3 is connected to the multicast listener who wishes to receive
multicast traffic for that group.
After some Query/Report exchange, the snooping code in the bridge is
going to learn about the Listener from bp3, and is going to create an
MDB group which includes bp3 as the sole member. When the bridge
receives a multicast packet for that group from bp2, the bridge will do
a look up to find the members of that group (in this case, bp3) and
forward the packet to every single member in that group. At the same
time, the bridge will also forward the packet to every mrouter port so
that listeners beyond mrouter ports can receive that multicast packet as
well.
Now consider the same scenario, but with a hardware-offloaded switch:
+------------+
| bridge |
+------------+
^
|
| p6 (Host CPU port)
p1/bp1 +------------+
Querier <---- | sw |
+------------+
p2/bp2 | | p3/bp3
| |
v v
MC Source MC Listener
Same Query/Report exchange, same MDB group, except that this time around
the MDB group will be offloaded to the switch as well. So in the
switch's ATU we will now have an entry for the multicast group and with
p3 being the only member of that ATU. When the multicast packet arrives
at the switch from p2, the switch will do an ATU lookup, and forward the
packet to p3 only. This means that the Host CPU (p6) will not get a copy
of the packet, and so the soft bridge will not have the opportunity to
forward that packet to the mrouter port. This is what patch 10 attempts
to address.
One possible solution of course is to add p6 to every MDB group, however
that's probably not very desirable. Besides, it will be more efficient
if the packet is forwarded to the mrouter port by the switch in hardware
(i.e., offload mrouter forwarding), vs. being forwarded in the bridge by
software.
>> The patches were developed against 5.15, and forward-ported to 6.8.
>> Tests were done on a Pi 4B + Marvell 6393X Eval board with a single
>> switch chip with no VLAN.
>
> So what is the mv88e6xxx driver doing wrong?
>
> Andrew
>
The mv88e6xxx driver did nothing differently than the other DSA drivers.
I chose the mv88e6xxx driver to implement the fix simply because that's
the only platform I have at hand to verify the solution.
On 4/2/2024 5:59 PM, Nikolay Aleksandrov wrote:
> On 4/2/24 23:46, Vladimir Oltean wrote:
>> On Tue, Apr 02, 2024 at 09:50:51PM +0300, Nikolay Aleksandrov wrote:
>>> On 4/2/24 20:43, Vladimir Oltean wrote:
>>>> Hi Nikolai,
>>>>
>>>> On Tue, Apr 02, 2024 at 12:28:38PM +0300, Nikolay Aleksandrov wrote:
>>>>> For the bridge patches:
>>>>> Nacked-by: Nikolay Aleksandrov <[email protected]>
>>>>>
>>>>> You cannot break the multicast flood flag to add support for a custom
>>>>> use-case. This is unacceptable. The current bridge behaviour is correct
>>>>> your patch 02 doesn't fix anything, you should configure the bridge
>>>>> properly to avoid all those problems, not break protocols.
>>>>>
>>>>> Your special use case can easily be solved by a user-space helper or
>>>>> eBPF and nftables. You can set the mcast flood flag and bypass the
>>>>> bridge for these packets. I basically said the same in 2021, if this is
>>>>> going to be in the bridge it should be hidden behind an option that is
>>>>> default off. But in my opinion adding an option to solve such special
>>>>> cases is undesirable, they can be easily solved with what's currently
>>>>> available.
>>>>
>>>> I appreciate your time is limited, but could you please translate your
>>>> suggestion, and detail your proposed alternative a bit, for those of us
>>>> who are not very familiar with IP multicast snooping?
>>>>
>>>
>>> My suggestion is not related to snooping really, but to the goal of
>>> patches 01-03. The bridge patches in this set are trying to forward
>>> traffic that is not supposed to be forwarded with the proposed
>>> configuration,
>>
>> Correct up to a point. Reinterpreting the given user space configuration
>> and trying to make it do something else seems like a mistake, but in
>> principle one could also look at alternative bridge configurations like
>> the one I described here:
>> https://lore.kernel.org/netdev/20240402180805.yhhwj2f52sdc4dl2@skbuf/
>>
>>> so that can be done by a user-space helper that installs
>>> rules to bypass the bridge specifically for those packets while
>>> monitoring the bridge state to implement a policy and manage these rules
>>> in order to keep snooping working.
>>>
>>>> Bypass the bridge for which packets? General IGMP/MLD queries? Wouldn't
>>>> that break snooping? And then do what with the packets, forward them in
>>>> another software layer than the bridge?
>>>>
>>>
>>> The ones that are not supposed to be forwarded in the proposed config
>>> and are needed for this use case (control traffic and link-local). Obviously
>>> to have proper snooping you'd need to manage these bypass
>>> rules and use them only while needed.
>>
>> I think Joseph will end up in a situation where he needs IGMP control
>> messages both in the bridge data path and outside of it :)
>>
>
> My solution does not exclude such scenario. With all unregistered mcast
> disabled it will be handled the same as with this proposed solution.
>
>> Also, your proposal eliminates the possibility of cooperating with a
>> hardware accelerator which can forward the IGMP messages where they need
>> to go.
>>
>> As far as I understand, I don't think Joseph has a very "special" use case.
>> Disabling flooding of unregistered multicast in the data plane sounds
>> reasonable. There seems to be a gap in the bridge API, in that this
>
> This we already have, but..
>
>> operation also affects the control plane, which he is trying to fix with
>> this "force flooding", because of insufficiently fine grained control.
>>
>
> Right, this is the part that is more special, I'm not saying it's
> unreasonable. The proposition to make it optional and break it down to
> type of traffic sounds good to me.
>
>>>> I also don't quite understand the suggestion of turning on mcast flooding:
>>>> isn't Joseph saying that he wants it off for the unregistered multicast
>>>> data traffic?
>>>
>>> Ah my bad, I meant to turn off flooding and bypass the bridge for those
>>> packets and ports while necessary, under necessary can be any policy
>>> that the user-space helper wants to implement.
>>>
>>> In any case, if this is going to be yet another kernel solution then it
>>> must be a new option that is default off, and doesn't break current mcast
>>> flood flag behaviour.
>>
>> Yeah, maybe something like this, simple and with clear offload
>> semantics, as seen in existing hardware (not Marvell though):
>>
>> mcast_flood == off:
>> - mcast_ipv4_ctrl_flood: don't care (maybe can force to "off")
>> - mcast_ipv4_data_flood: don't care
>> - mcast_ipv6_ctrl_flood: don't care
>> - mcast_ipv6_data_flood: don't care
>> - mcast_l2_flood: don't care
>> mcast_flood == on:
>> - Flood 224.0.0.x according to mcast_ipv4_ctrl_flood
>> - Flood all other IPv4 multicast according to mcast_ipv4_data_flood
>> - Flood ff02::/16 according to mcast_ipv6_ctrl_flood
>> - Flood all other IPv6 multicast according to mcast_ipv6_data_flood
>> - Flood L2 according to mcast_l2_flood
Did you mean
if mcast_flood == on (meaning mcast_flood is ENABLED)
- mcast_ipv4_ctrl_flood: don't care (since 224.0.0.x will be flooded anyway)
..
if mcast_flood == off (meaning mcast_flood is DISABLED)
- Flood 224.0.0.x according to mcast_ipv4_ctrl_flood
..
? Otherwise the problem is still not solved when mcast_flood is disabled.
>
> Yep, sounds good to me. I was thinking about something in these lines
> as well if doing a kernel solution in order to make it simpler and more
> generic. The ctrl flood bits need to be handled more carefully to make
> sure they match only control traffic and not link-local data.
Do we consider 224.0.0.251 (mDNS) to be control or data? What qualifies
as control I guess that's my question.
> I think the old option can be converted to use this fine-grained
> control, that is if anyone sets the old flood on/off then the flood
> mask gets set properly so we can do just 1 & in the fast path and avoid
> adding more tests. It will also make it symmetric - if it can override
> the on case, then it will be able to override the off case. And to be
> more explicit you can pass a mask variable to br_multicast_rcv() which
> will get populated and then you can pass it down to br_flood(). That
> will also avoid adding new bits to the skb's bridge CB.
>
>
On Thu, Apr 04, 2024 at 05:35:41PM -0400, Joseph Huang wrote:
> Hi Andrew,
>
> On 4/2/2024 8:43 AM, Andrew Lunn wrote:
> > On Mon, Apr 01, 2024 at 08:10:59PM -0400, Joseph Huang wrote:
> > > There is a use case where one would like to enable multicast snooping
> > > on a bridge but disable multicast flooding on all bridge ports so that
> > > registered multicast traffic will only reach the intended recipients and
> > > unregistered multicast traffic will be dropped. However, with existing
> > > bridge ports' mcast_flood flag implementation, it doesn't work as desired.
> > >
> > > This patchset aims to make multicast snooping work even when multicast
> > > flooding is disabled on the bridge ports, without changing the semantic of
> > > the mcast_flood flag too much. Patches 1 to 4 attempt to address this issue.
> > >
> > > Also, in a network where more than one multicast snooping capable bridges
> > > are interconnected without multicast routers being present, multicast
> > > snooping fails if:
> > >
> > > 1. The source is not directly attached to the Querier
> > > 2. The listener is beyond the mrouter port of the bridge where the
> > > source is directly attached
> > > 3. A hardware offloading switch is involved
> >
> > I've not studied the details here, but that last point makes me think
> > the offload driver is broken. There should not be any difference
> > between software bridging and hardware bridging. The whole idea is
> > that you take what Linux can do in software and accelerate it by
> > offloading to hardware. Doing acceleration should not change the
> > behaviour.
> >
>
> In patch 10 I gave a little more detail about the fix, but basically this is
> what happened.
>
> Assuming we have a soft bridge like the following:
>
> bp1 +------------+
> Querier <---- | bridge |
> +------------+
> bp2 | | bp3
> | |
> v v
> MC Source MC Listener
>
> Here bp1 is the mrouter port, bp2 is connected to the multicast source, and
> bp3 is connected to the multicast listener who wishes to receive multicast
> traffic for that group.
>
> After some Query/Report exchange, the snooping code in the bridge is going
> to learn about the Listener from bp3, and is going to create an MDB group
> which includes bp3 as the sole member. When the bridge receives a multicast
> packet for that group from bp2, the bridge will do a look up to find the
> members of that group (in this case, bp3) and forward the packet to every
> single member in that group. At the same time, the bridge will also forward
> the packet to every mrouter port so that listeners beyond mrouter ports can
> receive that multicast packet as well.
>
> Now consider the same scenario, but with a hardware-offloaded switch:
>
> +------------+
> | bridge |
> +------------+
> ^
> |
> | p6 (Host CPU port)
> p1/bp1 +------------+
> Querier <---- | sw |
> +------------+
> p2/bp2 | | p3/bp3
> | |
> v v
> MC Source MC Listener
>
> Same Query/Report exchange, same MDB group, except that this time around the
> MDB group will be offloaded to the switch as well. So in the switch's ATU we
> will now have an entry for the multicast group and with p3 being the only
> member of that ATU. When the multicast packet arrives at the switch from p2,
> the switch will do an ATU lookup, and forward the packet to p3 only. This
> means that the Host CPU (p6) will not get a copy of the packet, and so the
> soft bridge will not have the opportunity to forward that packet to the
> mrouter port. This is what patch 10 attempts to address.
>
> One possible solution of course is to add p6 to every MDB group, however
> that's probably not very desirable. Besides, it will be more efficient if
> the packet is forwarded to the mrouter port by the switch in hardware (i.e.,
> offload mrouter forwarding), vs. being forwarded in the bridge by software.
Thanks for the explanation. So i think the key part which you said
above is:
At the same time, the bridge will also forward the packet to every
mrouter port so that listeners beyond mrouter ports can receive that
multicast packet as well.
How does the bridge know about the mrouter port? It seems like the
bridge needs to pass that information down to the switch. Is the
mrouter itself performing a join on the group when it has members of
the group on the other side of it?
Andrew
Hi Andrew,
On 4/4/2024 6:11 PM, Andrew Lunn wrote:
> On Thu, Apr 04, 2024 at 05:35:41PM -0400, Joseph Huang wrote:
>> Hi Andrew,
>>
>> On 4/2/2024 8:43 AM, Andrew Lunn wrote:
>>> On Mon, Apr 01, 2024 at 08:10:59PM -0400, Joseph Huang wrote:
>>>> There is a use case where one would like to enable multicast snooping
>>>> on a bridge but disable multicast flooding on all bridge ports so that
>>>> registered multicast traffic will only reach the intended recipients and
>>>> unregistered multicast traffic will be dropped. However, with existing
>>>> bridge ports' mcast_flood flag implementation, it doesn't work as desired.
>>>>
>>>> This patchset aims to make multicast snooping work even when multicast
>>>> flooding is disabled on the bridge ports, without changing the semantic of
>>>> the mcast_flood flag too much. Patches 1 to 4 attempt to address this issue.
>>>>
>>>> Also, in a network where more than one multicast snooping capable bridges
>>>> are interconnected without multicast routers being present, multicast
>>>> snooping fails if:
>>>>
>>>> 1. The source is not directly attached to the Querier
>>>> 2. The listener is beyond the mrouter port of the bridge where the
>>>> source is directly attached
>>>> 3. A hardware offloading switch is involved
>>>
>>> I've not studied the details here, but that last point makes me think
>>> the offload driver is broken. There should not be any difference
>>> between software bridging and hardware bridging. The whole idea is
>>> that you take what Linux can do in software and accelerate it by
>>> offloading to hardware. Doing acceleration should not change the
>>> behaviour.
>>>
>>
>> In patch 10 I gave a little more detail about the fix, but basically this is
>> what happened.
>>
>> Assuming we have a soft bridge like the following:
>>
>> bp1 +------------+
>> Querier <---- | bridge |
>> +------------+
>> bp2 | | bp3
>> | |
>> v v
>> MC Source MC Listener
>>
>> Here bp1 is the mrouter port, bp2 is connected to the multicast source, and
>> bp3 is connected to the multicast listener who wishes to receive multicast
>> traffic for that group.
>>
>> After some Query/Report exchange, the snooping code in the bridge is going
>> to learn about the Listener from bp3, and is going to create an MDB group
>> which includes bp3 as the sole member. When the bridge receives a multicast
>> packet for that group from bp2, the bridge will do a look up to find the
>> members of that group (in this case, bp3) and forward the packet to every
>> single member in that group. At the same time, the bridge will also forward
>> the packet to every mrouter port so that listeners beyond mrouter ports can
>> receive that multicast packet as well.
>>
>> Now consider the same scenario, but with a hardware-offloaded switch:
>>
>> +------------+
>> | bridge |
>> +------------+
>> ^
>> |
>> | p6 (Host CPU port)
>> p1/bp1 +------------+
>> Querier <---- | sw |
>> +------------+
>> p2/bp2 | | p3/bp3
>> | |
>> v v
>> MC Source MC Listener
>>
>> Same Query/Report exchange, same MDB group, except that this time around the
>> MDB group will be offloaded to the switch as well. So in the switch's ATU we
>> will now have an entry for the multicast group and with p3 being the only
>> member of that ATU. When the multicast packet arrives at the switch from p2,
>> the switch will do an ATU lookup, and forward the packet to p3 only. This
>> means that the Host CPU (p6) will not get a copy of the packet, and so the
>> soft bridge will not have the opportunity to forward that packet to the
>> mrouter port. This is what patch 10 attempts to address.
>>
>> One possible solution of course is to add p6 to every MDB group, however
>> that's probably not very desirable. Besides, it will be more efficient if
>> the packet is forwarded to the mrouter port by the switch in hardware (i.e.,
>> offload mrouter forwarding), vs. being forwarded in the bridge by software.
>
> Thanks for the explanation. So i think the key part which you said
> above is:
>
> At the same time, the bridge will also forward the packet to every
> mrouter port so that listeners beyond mrouter ports can receive that
> multicast packet as well.
>
> How does the bridge know about the mrouter port? It seems like the
The bridge learns about the existence of the Querier by the reception of
Queries. The bridge will mark the port which it received Queries from as
the mrouter port.
> bridge needs to pass that information down to the switch. Is the
The bridge does pass that information down to switchdev. Patch 5 adds
DSA handling of that event as well. Patches 9 and 10 adds the support in
the mv88e6xxx driver.
> mrouter itself performing a join on the group when it has members of
> the group on the other side of it?
You hit a key point here. The Querier does not send Report (not with
IGMPv2 anyway), so the bridge will never add the mrouter port to any MDB
group. That's why patch 10 is needed. Prestera driver does something
similar (which is what patches 6,7, and 10 are modeled after).
>
> Andrew
On Thu, Apr 04, 2024 at 06:16:12PM -0400, Joseph Huang wrote:
> > > mcast_flood == off:
> > > - mcast_ipv4_ctrl_flood: don't care (maybe can force to "off")
> > > - mcast_ipv4_data_flood: don't care
> > > - mcast_ipv6_ctrl_flood: don't care
> > > - mcast_ipv6_data_flood: don't care
> > > - mcast_l2_flood: don't care
> > > mcast_flood == on:
> > > - Flood 224.0.0.x according to mcast_ipv4_ctrl_flood
> > > - Flood all other IPv4 multicast according to mcast_ipv4_data_flood
> > > - Flood ff02::/16 according to mcast_ipv6_ctrl_flood
> > > - Flood all other IPv6 multicast according to mcast_ipv6_data_flood
> > > - Flood L2 according to mcast_l2_flood
>
> Did you mean
>
> if mcast_flood == on (meaning mcast_flood is ENABLED)
> - mcast_ipv4_ctrl_flood: don't care (since 224.0.0.x will be flooded anyway)
> ...
>
> if mcast_flood == off (meaning mcast_flood is DISABLED)
> - Flood 224.0.0.x according to mcast_ipv4_ctrl_flood
> ...
>
> ? Otherwise the problem is still not solved when mcast_flood is disabled.
No, I mean exactly as I said. My goal was not to "solve the problem"
when mcast_flood is disabled, but to give you an option to configure the
bridge to achieve what you want, in a way which I think is more acceptable.
AFAIU, there is not really any "problem" - the bridge behaves exactly as
instructed given the limited language available to instruct it ("mcast_flood"
covers all multicast). So the other knobs have the role of fine-tuning
what gets flooded when mcast_flood is on. Like "yes, but..."
You can't "solve the problem" when it involves changing an established
behavior that somebody probably depended on to be just like that.
> > Yep, sounds good to me. I was thinking about something in these lines
> > as well if doing a kernel solution in order to make it simpler and more
> > generic. The ctrl flood bits need to be handled more carefully to make
> > sure they match only control traffic and not link-local data.
>
> Do we consider 224.0.0.251 (mDNS) to be control or data? What qualifies as
> control I guess that's my question.
Well, as I said, I'm proposing that 224.0.0.x qualifies as control and
the rest of IPv4 multicast as data. Which means that, applied to your
case, "mcast_flood on mcast_ipv4_ctrl_flood on mcast_ipv4_data_flood off"
will "force flood" mDNS just like the IGMP traffic from your patches.
I'm not aware if this could be considered problematic (I don't think so).
The reason behind this proposal is that, AFAIU, endpoints may choose to
join IGMP groups in the 224.0.0.x range or not, but RFC4541 says that
switches shouldn't prune the destinations towards endpoints that don't
join this range anyway: https://www.rfc-editor.org/rfc/rfc4541#page-6
Whereas for IP multicast traffic towards an address outside 224.0.0.x,
pruning will happen as per the IGMP join tracking mechanism.
On 4/5/24 13:20, Vladimir Oltean wrote:
> On Thu, Apr 04, 2024 at 06:16:12PM -0400, Joseph Huang wrote:
>>>> mcast_flood == off:
>>>> - mcast_ipv4_ctrl_flood: don't care (maybe can force to "off")
>>>> - mcast_ipv4_data_flood: don't care
>>>> - mcast_ipv6_ctrl_flood: don't care
>>>> - mcast_ipv6_data_flood: don't care
>>>> - mcast_l2_flood: don't care
>>>> mcast_flood == on:
>>>> - Flood 224.0.0.x according to mcast_ipv4_ctrl_flood
>>>> - Flood all other IPv4 multicast according to mcast_ipv4_data_flood
>>>> - Flood ff02::/16 according to mcast_ipv6_ctrl_flood
>>>> - Flood all other IPv6 multicast according to mcast_ipv6_data_flood
>>>> - Flood L2 according to mcast_l2_flood
>>
>> Did you mean
>>
>> if mcast_flood == on (meaning mcast_flood is ENABLED)
>> - mcast_ipv4_ctrl_flood: don't care (since 224.0.0.x will be flooded anyway)
>> ...
>>
>> if mcast_flood == off (meaning mcast_flood is DISABLED)
>> - Flood 224.0.0.x according to mcast_ipv4_ctrl_flood
>> ...
>>
>> ? Otherwise the problem is still not solved when mcast_flood is disabled.
>
> No, I mean exactly as I said. My goal was not to "solve the problem"
> when mcast_flood is disabled, but to give you an option to configure the
> bridge to achieve what you want, in a way which I think is more acceptable.
>
> AFAIU, there is not really any "problem" - the bridge behaves exactly as
> instructed given the limited language available to instruct it ("mcast_flood"
> covers all multicast). So the other knobs have the role of fine-tuning
> what gets flooded when mcast_flood is on. Like "yes, but..."
>
> You can't "solve the problem" when it involves changing an established
> behavior that somebody probably depended on to be just like that.
>
>>> Yep, sounds good to me. I was thinking about something in these lines
>>> as well if doing a kernel solution in order to make it simpler and more
>>> generic. The ctrl flood bits need to be handled more carefully to make
>>> sure they match only control traffic and not link-local data.
>>
>> Do we consider 224.0.0.251 (mDNS) to be control or data? What qualifies as
>> control I guess that's my question.
>
> Well, as I said, I'm proposing that 224.0.0.x qualifies as control and
> the rest of IPv4 multicast as data. Which means that, applied to your
> case, "mcast_flood on mcast_ipv4_ctrl_flood on mcast_ipv4_data_flood off"
> will "force flood" mDNS just like the IGMP traffic from your patches.
> I'm not aware if this could be considered problematic (I don't think so).
>
> The reason behind this proposal is that, AFAIU, endpoints may choose to
> join IGMP groups in the 224.0.0.x range or not, but RFC4541 says that
> switches shouldn't prune the destinations towards endpoints that don't
> join this range anyway: https://www.rfc-editor.org/rfc/rfc4541#page-6
>
> Whereas for IP multicast traffic towards an address outside 224.0.0.x,
> pruning will happen as per the IGMP join tracking mechanism.
+1, non-IGMP traffic to 224.0.0.x must be flooded to all anyway
so this should allow for a better control over it, but perhaps
the naming should be link_local instead because control usually
means IGMP, maybe something like mcast_ip_link_local_flood
On Thu, Apr 04, 2024 at 04:43:38PM -0400, Joseph Huang wrote:
> Hi Vladimir,
>
> On 4/2/2024 8:23 AM, Vladimir Oltean wrote:
> > Can you comment on the feasibility/infeasibility of Tobias' proposal of:
> > "The bridge could just provide some MDB iterator to save us from having
> > to cache all the configured groups."?
> > https://lore.kernel.org/netdev/[email protected]/
> >
> > What is done here will have to be scaled to many drivers - potentially
> > all existing DSA ones, as far as I'm aware.
> >
>
> I thought about implementing an MDB iterator as suggested by Tobias, but I'm
> a bit concerned about the coherence of these MDB objects. In theory, when
> the device driver is trying to act on an event, the source of the trigger
> may have changed its state in the bridge already.
Yes, this is the result of SWITCHDEV_F_DEFER, used by both
SWITCHDEV_ATTR_ID_PORT_MROUTER and SWITCHDEV_OBJ_ID_PORT_MDB.
> If, upon receiving an event in the device driver, we iterate over what
> the bridge has at that instant, the differences between the worlds as
> seen by the bridge and the device driver might lead to some unexpected
> results.
Translated: iterating over bridge MDB objects needs to be serialized
with new switchdev events by acquiring rtnl_lock(). Then, once switchdev
events are temporarily blocked, the pending ones need to be flushed
using switchdev_deferred_process(), so resync the bridge state with the
driver state. Once the resync is done, the iteration is safe until
rtnl_unlock().
Applied to our case, the MDB iterator is needed in mv88e6xxx_port_mrouter().
This is already called with rtnl_lock() acquired. The resync procedure
will indirectly call mv88e6xxx_port_mdb_add()/mv88e6xxx_port_mdb_del()
through switchdev_deferred_process(), and then the walk is consistent
for the remainder of the mv88e6xxx_port_mrouter() function.
A helper which does this is what would be required - an iterator
function which calls an int (*cb)(struct net_device *brport, const struct switchdev_obj_port_mdb *mdb)
for each MDB entry. The DSA core could then offer some post-processing
services over this API, to recover the struct dsa_port associated with
the bridge port (in the LAG case they aren't the same) and the address
database associated with the bridge.
Do you think there would be unexpected results even if we did this?
br_switchdev_mdb_replay() needs to handle a similarly complicated
situation of synchronizing with deferred MDB events.
> However, if we cache the MDB objects in the device driver, at least
> the order in which the events took place will be coherent and at any
> give time the state of the MDB objects in the device driver can be
> guaranteed to be sane. This is also the approach the prestera device
> driver took.
Not contesting this, but I wouldn't like to see MDBs cached in each
device driver just for this. Switchdev is not very high on the list of
APIs which are easy to use, and making MDB caching a requirement
(for the common case that MDB entry destinations need software fixups
with the mrouter ports) isn't exactly going to make that any better.
Others' opinion may differ, but mine is that core offload APIs need to
consider what hardware is available in the real world, make the common
case easy, and the advanced cases possible. Rather than make every case
"advanced" :)
> > Thanks for the explanation. So i think the key part which you said
> > above is:
> >
> > At the same time, the bridge will also forward the packet to every
> > mrouter port so that listeners beyond mrouter ports can receive that
> > multicast packet as well.
> >
> > How does the bridge know about the mrouter port? It seems like the
>
> The bridge learns about the existence of the Querier by the reception of
> Queries. The bridge will mark the port which it received Queries from as the
> mrouter port.
>
> > bridge needs to pass that information down to the switch. Is the
>
> The bridge does pass that information down to switchdev. Patch 5 adds DSA
> handling of that event as well. Patches 9 and 10 adds the support in the
> mv88e6xxx driver.
I've not been looking at the details too much for this patchset. It
does however seem that some parts are controversial, and others just
seem like a bug fix. Does it make sense to split this into two
patchsets?
Andrew
Hi Vladimir,
On 4/5/2024 7:07 AM, Vladimir Oltean wrote:
> On Thu, Apr 04, 2024 at 04:43:38PM -0400, Joseph Huang wrote:
>> Hi Vladimir,
>>
>> On 4/2/2024 8:23 AM, Vladimir Oltean wrote:
>>> Can you comment on the feasibility/infeasibility of Tobias' proposal of:
>>> "The bridge could just provide some MDB iterator to save us from having
>>> to cache all the configured groups."?
>>> https://lore.kernel.org/netdev/[email protected]/
>>>
>>> What is done here will have to be scaled to many drivers - potentially
>>> all existing DSA ones, as far as I'm aware.
>>>
>>
>> I thought about implementing an MDB iterator as suggested by Tobias, but I'm
>> a bit concerned about the coherence of these MDB objects. In theory, when
>> the device driver is trying to act on an event, the source of the trigger
>> may have changed its state in the bridge already.
>
> Yes, this is the result of SWITCHDEV_F_DEFER, used by both
> SWITCHDEV_ATTR_ID_PORT_MROUTER and SWITCHDEV_OBJ_ID_PORT_MDB.
>
>> If, upon receiving an event in the device driver, we iterate over what
>> the bridge has at that instant, the differences between the worlds as
>> seen by the bridge and the device driver might lead to some unexpected
>> results.
>
> Translated: iterating over bridge MDB objects needs to be serialized
> with new switchdev events by acquiring rtnl_lock(). Then, once switchdev
> events are temporarily blocked, the pending ones need to be flushed
> using switchdev_deferred_process(), so resync the bridge state with the
> driver state. Once the resync is done, the iteration is safe until
> rtnl_unlock().
>
> Applied to our case, the MDB iterator is needed in mv88e6xxx_port_mrouter().
> This is already called with rtnl_lock() acquired. The resync procedure
> will indirectly call mv88e6xxx_port_mdb_add()/mv88e6xxx_port_mdb_del()
> through switchdev_deferred_process(), and then the walk is consistent
> for the remainder of the mv88e6xxx_port_mrouter() function.
>
> A helper which does this is what would be required - an iterator
> function which calls an int (*cb)(struct net_device *brport, const struct switchdev_obj_port_mdb *mdb)
> for each MDB entry. The DSA core could then offer some post-processing
> services over this API, to recover the struct dsa_port associated with
> the bridge port (in the LAG case they aren't the same) and the address
> database associated with the bridge.
>
> Do you think there would be unexpected results even if we did this?
> br_switchdev_mdb_replay() needs to handle a similarly complicated
> situation of synchronizing with deferred MDB events.
> >> However, if we cache the MDB objects in the device driver, at least
>> the order in which the events took place will be coherent and at any
>> give time the state of the MDB objects in the device driver can be
>> guaranteed to be sane. This is also the approach the prestera device
>> driver took.
>
> Not contesting this, but I wouldn't like to see MDBs cached in each
> device driver just for this. Switchdev is not very high on the list of
> APIs which are easy to use, and making MDB caching a requirement
> (for the common case that MDB entry destinations need software fixups
> with the mrouter ports) isn't exactly going to make that any better.
> Others' opinion may differ, but mine is that core offload APIs need to
> consider what hardware is available in the real world, make the common
> case easy, and the advanced cases possible. Rather than make every case
> "advanced" :)
Just throwing some random ideas out there. Do you think it would make
more sense if this whole solution (rtnl_lock, iterator cb,...etc.) is
moved up to DSA so that other DSA drivers could benefit from it? I
thought about implement it (not the iterator, the current form) in DSA
at first, but I'm not sure how other drivers would behave, so I did it
with mv instead.
I guess the question is, is the current limitation (mrouter not properly
offloaded) an issue specific to mv or is it a limitation of hardware
offloading in general? I tend to think it's the latter.
But then again, if we move it to DSA, we would lose the benefit of the
optimization of consolidating multiple register writes into just one (as
done in patch 10 currently), unless we add a new switch op which takes a
portvec instead of a port when modifying mdb's.
On 4/5/2024 7:00 AM, Nikolay Aleksandrov wrote:
> On 4/5/24 13:20, Vladimir Oltean wrote:
>> On Thu, Apr 04, 2024 at 06:16:12PM -0400, Joseph Huang wrote:
>>>>> mcast_flood == off:
>>>>> - mcast_ipv4_ctrl_flood: don't care (maybe can force to "off")
>>>>> - mcast_ipv4_data_flood: don't care
>>>>> - mcast_ipv6_ctrl_flood: don't care
>>>>> - mcast_ipv6_data_flood: don't care
>>>>> - mcast_l2_flood: don't care
>>>>> mcast_flood == on:
>>>>> - Flood 224.0.0.x according to mcast_ipv4_ctrl_flood
>>>>> - Flood all other IPv4 multicast according to mcast_ipv4_data_flood
>>>>> - Flood ff02::/16 according to mcast_ipv6_ctrl_flood
>>>>> - Flood all other IPv6 multicast according to mcast_ipv6_data_flood
>>>>> - Flood L2 according to mcast_l2_flood
>>>
>>> Did you mean
>>>
>>> if mcast_flood == on (meaning mcast_flood is ENABLED)
>>> - mcast_ipv4_ctrl_flood: don't care (since 224.0.0.x will be flooded
>>> anyway)
>>> ...
>>>
>>> if mcast_flood == off (meaning mcast_flood is DISABLED)
>>> - Flood 224.0.0.x according to mcast_ipv4_ctrl_flood
>>> ...
>>>
>>> ? Otherwise the problem is still not solved when mcast_flood is
>>> disabled.
>>
>> No, I mean exactly as I said. My goal was not to "solve the problem"
>> when mcast_flood is disabled, but to give you an option to configure the
>> bridge to achieve what you want, in a way which I think is more
>> acceptable.
>>
>> AFAIU, there is not really any "problem" - the bridge behaves exactly as
>> instructed given the limited language available to instruct it
>> ("mcast_flood"
>> covers all multicast). So the other knobs have the role of fine-tuning
>> what gets flooded when mcast_flood is on. Like "yes, but..."
>>
>> You can't "solve the problem" when it involves changing an established
>> behavior that somebody probably depended on to be just like that.
>>
>>>> Yep, sounds good to me. I was thinking about something in these lines
>>>> as well if doing a kernel solution in order to make it simpler and more
>>>> generic. The ctrl flood bits need to be handled more carefully to make
>>>> sure they match only control traffic and not link-local data.
>>>
>>> Do we consider 224.0.0.251 (mDNS) to be control or data? What
>>> qualifies as
>>> control I guess that's my question.
>>
>> Well, as I said, I'm proposing that 224.0.0.x qualifies as control and
>> the rest of IPv4 multicast as data. Which means that, applied to your
>> case, "mcast_flood on mcast_ipv4_ctrl_flood on mcast_ipv4_data_flood off"
>> will "force flood" mDNS just like the IGMP traffic from your patches.
>> I'm not aware if this could be considered problematic (I don't think so).
>>
>> The reason behind this proposal is that, AFAIU, endpoints may choose to
>> join IGMP groups in the 224.0.0.x range or not, but RFC4541 says that
>> switches shouldn't prune the destinations towards endpoints that don't
>> join this range anyway: https://www.rfc-editor.org/rfc/rfc4541#page-6
>>
>> Whereas for IP multicast traffic towards an address outside 224.0.0.x,
>> pruning will happen as per the IGMP join tracking mechanism.
>
> +1, non-IGMP traffic to 224.0.0.x must be flooded to all anyway
> so this should allow for a better control over it, but perhaps
> the naming should be link_local instead because control usually
> means IGMP, maybe something like mcast_ip_link_local_flood
>
Like this?
bridge link set dev swp0 mcast_flood off
- all flooding disabled
bridge link set dev swp0 mcast_flood on
- all flooding enabled
bridge link set dev swp0 mcast_flood on mcast_ipv4_data_flood off
mcast_ipv6_data_flood off
- IPv4 data packets flooding disabled, IPv6 data packets flooding
disabled, everything else floods (that is to say, only allow IPv4 local
subnet and IPv6 link-local to flood)
?
The syntax seems to be counterintuitive.
Or like this?
bridge link set dev swp0 mcast_flood on mcast_ipv4_ctrl_flood on
- only allow IPv4 local subnet to flood, everything else off
?
So basically the question is, what should the behavior be when something
is omitted from the command line?
On Fri, Apr 05, 2024 at 04:22:43PM -0400, Joseph Huang wrote:
> Like this?
>
> bridge link set dev swp0 mcast_flood off
> - all flooding disabled
>
> bridge link set dev swp0 mcast_flood on
> - all flooding enabled
>
> bridge link set dev swp0 mcast_flood on mcast_ipv4_data_flood off
> mcast_ipv6_data_flood off
> - IPv4 data packets flooding disabled, IPv6 data packets flooding
> disabled, everything else floods (that is to say, only allow IPv4 local
> subnet and IPv6 link-local to flood)
>
> ?
Yeah.
> The syntax seems to be counterintuitive.
>
> Or like this?
>
> bridge link set dev swp0 mcast_flood on mcast_ipv4_ctrl_flood on
> - only allow IPv4 local subnet to flood, everything else off
>
> ?
Nope.
> So basically the question is, what should the behavior be when something is
> omitted from the command line?
The answer is always: "new options should default to behaving exactly
like before". It's not just about the command line arguments, but also
about the actual netlink attributes that iproute2 (and other tooling)
creates when communicating with the kernel. Old user space has no idea
about the existence of mcast_ipv4_ctrl_flood et. al. So, if netlink
attributes specifying their value are not sent by user space, their
value in the kernel must mimic the value of mcast_flood.
On 4/5/2024 5:15 PM, Vladimir Oltean wrote:
> On Fri, Apr 05, 2024 at 04:22:43PM -0400, Joseph Huang wrote:
>> Like this?
>>
>> bridge link set dev swp0 mcast_flood off
>> - all flooding disabled
>>
>> bridge link set dev swp0 mcast_flood on
>> - all flooding enabled
>>
>> bridge link set dev swp0 mcast_flood on mcast_ipv4_data_flood off
>> mcast_ipv6_data_flood off
>> - IPv4 data packets flooding disabled, IPv6 data packets flooding
>> disabled, everything else floods (that is to say, only allow IPv4 local
>> subnet and IPv6 link-local to flood)
>>
>> ?
>
> Yeah.
>
>> The syntax seems to be counterintuitive.
>>
>> Or like this?
>>
>> bridge link set dev swp0 mcast_flood on mcast_ipv4_ctrl_flood on
>> - only allow IPv4 local subnet to flood, everything else off
>>
>> ?
>
> Nope.
>
>> So basically the question is, what should the behavior be when something is
>> omitted from the command line?
>
> The answer is always: "new options should default to behaving exactly
> like before". It's not just about the command line arguments, but also
> about the actual netlink attributes that iproute2 (and other tooling)
> creates when communicating with the kernel. Old user space has no idea
> about the existence of mcast_ipv4_ctrl_flood et. al. So, if netlink
> attributes specifying their value are not sent by user space, their
> value in the kernel must mimic the value of mcast_flood.
How about the following syntax? I think it satisfies all the "not
breaking existing behavior" requirements (new option defaults to off,
and missing user space netlink attributes does not change the existing
behavior):
mcast_flood off
all off
mcast_flood off mcast_flood_rfc4541 off
all off
mcast_flood off mcast_flood_rfc4541 on
224.0.0.X and ff02::1 on, the rest off
mcast_flood on
all on
mcast_flood on mcast_flood_rfc4541 off
all on (mcast_flood on overrides mcast_flood_rfc4541)
mcast_flood on mcast_flood_rfc4541 on
all on
mcast_flood_rfc4541 off
invalid (mcast_flood_rfc4541 is only valid if mcast_flood [on | off]
is specified first)
mcast_flood_rfc4541 on
invalid (mcast_flood_rfc4541 is only valid if mcast_flood [on | off]
is specified first)
Think of mcast_flood_rfc4541 like a pet door if you will.
On 4/5/2024 2:58 PM, Joseph Huang wrote:
> Hi Vladimir,
>
> On 4/5/2024 7:07 AM, Vladimir Oltean wrote:
>> On Thu, Apr 04, 2024 at 04:43:38PM -0400, Joseph Huang wrote:
>>> Hi Vladimir,
>>>
>>> On 4/2/2024 8:23 AM, Vladimir Oltean wrote:
>>>> Can you comment on the feasibility/infeasibility of Tobias' proposal
>>>> of:
>>>> "The bridge could just provide some MDB iterator to save us from having
>>>> to cache all the configured groups."?
>>>> https://lore.kernel.org/netdev/[email protected]/
>>>>
>>>> What is done here will have to be scaled to many drivers - potentially
>>>> all existing DSA ones, as far as I'm aware.
>>>>
>>>
>>> I thought about implementing an MDB iterator as suggested by Tobias,
>>> but I'm
>>> a bit concerned about the coherence of these MDB objects. In theory,
>>> when
>>> the device driver is trying to act on an event, the source of the
>>> trigger
>>> may have changed its state in the bridge already.
>>
>> Yes, this is the result of SWITCHDEV_F_DEFER, used by both
>> SWITCHDEV_ATTR_ID_PORT_MROUTER and SWITCHDEV_OBJ_ID_PORT_MDB.
>>
>>> If, upon receiving an event in the device driver, we iterate over what
>>> the bridge has at that instant, the differences between the worlds as
>>> seen by the bridge and the device driver might lead to some unexpected
>>> results.
>>
>> Translated: iterating over bridge MDB objects needs to be serialized
>> with new switchdev events by acquiring rtnl_lock(). Then, once switchdev
>> events are temporarily blocked, the pending ones need to be flushed
>> using switchdev_deferred_process(), so resync the bridge state with the
>> driver state. Once the resync is done, the iteration is safe until
>> rtnl_unlock().
>>
>> Applied to our case, the MDB iterator is needed in
>> mv88e6xxx_port_mrouter().
>> This is already called with rtnl_lock() acquired. The resync procedure
>> will indirectly call mv88e6xxx_port_mdb_add()/mv88e6xxx_port_mdb_del()
>> through switchdev_deferred_process(), and then the walk is consistent
>> for the remainder of the mv88e6xxx_port_mrouter() function.
>>
>> A helper which does this is what would be required - an iterator
>> function which calls an int (*cb)(struct net_device *brport, const
>> struct switchdev_obj_port_mdb *mdb)
>> for each MDB entry. The DSA core could then offer some post-processing
>> services over this API, to recover the struct dsa_port associated with
>> the bridge port (in the LAG case they aren't the same) and the address
>> database associated with the bridge.
Something like this (some layers omitted for brevity)?
+br_iterator
| for each mdb
| _br_switchdev_mdb_notify
rtnl_lock | without F_DEFER flag
| | |
+switchdev_port_attr_set_deferred | +switchdev_port_obj_notify
| | |
+dsa_port_mrouter | +dsa_user_port_obj_a/d
| | |
+mv88e6xxx_port_mrouter----------+ +mv88e6xxx_port_obj_a/d
|
+--------------------------------------+
|
rtnl_unlock
Note that on the system I tested, each register read/write takes about
100us to complete. For 100's of mdb groups, this would mean that we will
be holding rtnl lock for 10's of ms. I don't know if it's considered too
long.
>>
>> Do you think there would be unexpected results even if we did this?
>> br_switchdev_mdb_replay() needs to handle a similarly complicated
>> situation of synchronizing with deferred MDB events.
>> >> However, if we cache the MDB objects in the device driver, at least
>>> the order in which the events took place will be coherent and at any
>>> give time the state of the MDB objects in the device driver can be
>>> guaranteed to be sane. This is also the approach the prestera device
>>> driver took.
>>
>> Not contesting this, but I wouldn't like to see MDBs cached in each
>> device driver just for this. Switchdev is not very high on the list of
>> APIs which are easy to use, and making MDB caching a requirement
>> (for the common case that MDB entry destinations need software fixups
>> with the mrouter ports) isn't exactly going to make that any better.
>> Others' opinion may differ, but mine is that core offload APIs need to
>> consider what hardware is available in the real world, make the common
>> case easy, and the advanced cases possible. Rather than make every case
>> "advanced" :)
>
> Just throwing some random ideas out there. Do you think it would make
> more sense if this whole solution (rtnl_lock, iterator cb,...etc.) is
> moved up to DSA so that other DSA drivers could benefit from it? I
> thought about implement it (not the iterator, the current form) in DSA
> at first, but I'm not sure how other drivers would behave, so I did it
> with mv instead.
>
> I guess the question is, is the current limitation (mrouter not properly
> offloaded) an issue specific to mv or is it a limitation of hardware
> offloading in general? I tend to think it's the latter.
>
> But then again, if we move it to DSA, we would lose the benefit of the
> optimization of consolidating multiple register writes into just one (as
> done in patch 10 currently), unless we add a new switch op which takes a
> portvec instead of a port when modifying mdb's.
On Mon, Apr 29, 2024 at 06:07:25PM -0400, Joseph Huang wrote:
> Something like this (some layers omitted for brevity)?
>
> +br_iterator
> | for each mdb
> | _br_switchdev_mdb_notify
> rtnl_lock | without F_DEFER flag
> | | |
> +switchdev_port_attr_set_deferred | +switchdev_port_obj_notify
> | | |
> +dsa_port_mrouter | +dsa_user_port_obj_a/d
> | | |
> +mv88e6xxx_port_mrouter----------+ +mv88e6xxx_port_obj_a/d
> |
> +--------------------------------------+
> |
> rtnl_unlock
At a _very_ superficial glance, I don't think you are properly
accounting for the fact that even with rtnl_lock() held, there are still
SWITCHDEV_OBJ_ID_PORT_MDB events which may be pending on the switchdev
chain. Without a switchdev_deferred_process() flush call, you won't be
getting rid of them, so when you rtnl_unlock(), they will still run.
Even worse, holding rtnl_lock() will not stop the bridge multicast layer
from modifying its br->mdb_list; only br->multicast_lock will.
So you may be better off also acquiring br->multicast_lock, and
notifying the MDB entries to the switchdev chain _with_the F_DEFER flag.
> Note that on the system I tested, each register read/write takes about 100us
> to complete. For 100's of mdb groups, this would mean that we will be
> holding rtnl lock for 10's of ms. I don't know if it's considered too long.
Not sure how this is going to be any better if the iteration over MDB
entries is done 100% in the driver, though - since its hook,
dsa_port_mrouter(), runs entirely under rtnl_lock().
Anyway, with the SWITCHDEV_F_DEFER flag, maybe the mdb object
notifications can be made to run by switchdev only a few at a time, to
give the network stack time to do other things as well.
On Mon, Apr 29, 2024 at 04:14:03PM -0400, Joseph Huang wrote:
> How about the following syntax? I think it satisfies all the "not breaking
> existing behavior" requirements (new option defaults to off, and missing
> user space netlink attributes does not change the existing behavior):
>
> mcast_flood off
> all off
> mcast_flood off mcast_flood_rfc4541 off
> all off
> mcast_flood off mcast_flood_rfc4541 on
> 224.0.0.X and ff02::1 on, the rest off
> mcast_flood on
> all on
> mcast_flood on mcast_flood_rfc4541 off
> all on (mcast_flood on overrides mcast_flood_rfc4541)
> mcast_flood on mcast_flood_rfc4541 on
> all on
> mcast_flood_rfc4541 off
> invalid (mcast_flood_rfc4541 is only valid if mcast_flood [on | off] is
> specified first)
> mcast_flood_rfc4541 on
> invalid (mcast_flood_rfc4541 is only valid if mcast_flood [on | off] is
> specified first)
A bridge port defaults to having BR_MCAST_FLOOD set - see new_nbp().
Netlink attributes are only there to _change_ the state of properties in
the kernel. They don't need to be specified by user space if there's
nothing to be changed. "Only valid if another netlink attribute comes
first" makes no sense. You can alter 2 bridge port flags as part of the
same netlink message, or as part of different netlink messages (sent
over sockets of other processes).
>
> Think of mcast_flood_rfc4541 like a pet door if you will.
Ultimately, as far as I see it, both the OR-based and the AND-based UAPI
addition could be made to work in a way that's perhaps similarly backwards
compatible. It needs to be worked out with the bridge maintainers. Given
that I'm not doing great with my spare time, I will take a back seat on
that.
However, what I don't quite understand in your proposal is how many IPv4
addresses lie beyond the "224.0.0.X" notation? 256? Explain why there is
such a large discrepancy in the number of IPv4 addresses you are in
control of (256), vs only 1 IPv6 address with the new rfc4541 flag?
On 4/29/2024 8:59 PM, Vladimir Oltean wrote:
> On Mon, Apr 29, 2024 at 06:07:25PM -0400, Joseph Huang wrote:
>> Something like this (some layers omitted for brevity)?
>>
>> +br_iterator
>> | for each mdb
>> | _br_switchdev_mdb_notify
>> rtnl_lock | without F_DEFER flag
>> | | |
>> +switchdev_port_attr_set_deferred | +switchdev_port_obj_notify
>> | | |
>> +dsa_port_mrouter | +dsa_user_port_obj_a/d
>> | | |
>> +mv88e6xxx_port_mrouter----------+ +mv88e6xxx_port_obj_a/d
>> |
>> +--------------------------------------+
>> |
>> rtnl_unlock
>
> At a _very_ superficial glance, I don't think you are properly
> accounting for the fact that even with rtnl_lock() held, there are still
> SWITCHDEV_OBJ_ID_PORT_MDB events which may be pending on the switchdev
> chain. Without a switchdev_deferred_process() flush call, you won't be
> getting rid of them, so when you rtnl_unlock(), they will still run.
>
> Even worse, holding rtnl_lock() will not stop the bridge multicast layer
> from modifying its br->mdb_list; only br->multicast_lock will.
>
> So you may be better off also acquiring br->multicast_lock, and
> notifying the MDB entries to the switchdev chain _with_the F_DEFER flag.
Like this?
+br_iterator(dsa_cb)
| lock br->multicask_lock
| for each mdb
| br_switchdev_mdb_notify
rtnl_lock | |
| | +switchdev_port_obj_._defer
+switchdev_port_attr_set_deferred | unlock br->multicast_lock
| |
+dsa_port_mrouter |
| |
+mv88e6xxx_port_mrouter----------+
|
+--------------------------------------+
|
rtnl_unlock
(potential task change)
rtnl_lock
|
+switchdev_deferred_process
| flush all queued dfitems in queuing order
|
rtnl_unlock
I'm not that familiar with the bridge code, but is there any concern
with potential deadlock here (bewteen rtnl_lock and br->multicast_lock)?
>
>> Note that on the system I tested, each register read/write takes about 100us
>> to complete. For 100's of mdb groups, this would mean that we will be
>> holding rtnl lock for 10's of ms. I don't know if it's considered too long.
>
> Not sure how this is going to be any better if the iteration over MDB
> entries is done 100% in the driver, though - since its hook,
> dsa_port_mrouter(), runs entirely under rtnl_lock(). >
> Anyway, with the SWITCHDEV_F_DEFER flag, maybe the mdb object
> notifications can be made to run by switchdev only a few at a time, to
> give the network stack time to do other things as well.
On 4/29/2024 9:21 PM, Vladimir Oltean wrote:
> On Mon, Apr 29, 2024 at 04:14:03PM -0400, Joseph Huang wrote:
>> How about the following syntax? I think it satisfies all the "not breaking
>> existing behavior" requirements (new option defaults to off, and missing
>> user space netlink attributes does not change the existing behavior):
>>
>> mcast_flood off
>> all off
>> mcast_flood off mcast_flood_rfc4541 off
>> all off
>> mcast_flood off mcast_flood_rfc4541 on
>> 224.0.0.X and ff02::1 on, the rest off
>> mcast_flood on
>> all on
>> mcast_flood on mcast_flood_rfc4541 off
>> all on (mcast_flood on overrides mcast_flood_rfc4541)
>> mcast_flood on mcast_flood_rfc4541 on
>> all on
>> mcast_flood_rfc4541 off
>> invalid (mcast_flood_rfc4541 is only valid if mcast_flood [on | off] is
>> specified first)
>> mcast_flood_rfc4541 on
>> invalid (mcast_flood_rfc4541 is only valid if mcast_flood [on | off] is
>> specified first)
>
> A bridge port defaults to having BR_MCAST_FLOOD set - see new_nbp().
> Netlink attributes are only there to _change_ the state of properties in
> the kernel. They don't need to be specified by user space if there's
> nothing to be changed. "Only valid if another netlink attribute comes
> first" makes no sense. You can alter 2 bridge port flags as part of the
> same netlink message, or as part of different netlink messages (sent
> over sockets of other processes).
>
>>
>> Think of mcast_flood_rfc4541 like a pet door if you will.
>
> Ultimately, as far as I see it, both the OR-based and the AND-based UAPI
> addition could be made to work in a way that's perhaps similarly backwards
> compatible. It needs to be worked out with the bridge maintainers. Given
> that I'm not doing great with my spare time, I will take a back seat on
> that.
Nik, do you have any objection to the following proposal?
mcast_flood -> default/ off on
(existing flag) missing (specified/ (specified/
(on) nlmsg) nlmsg)
mcast_flood_rfc4541
(proposed new flag)
|
v
default/ flood all no flood flood all
missing
(off)
off flood all no flood flood all
(specified/nlmsg)
on flood all flood 4541 flood all
(specified/nlmsg) ^^^^^^^^^^
only behavior change
Basically the attributes are OR'ed together to form the final flooding
decision.
>
> However, what I don't quite understand in your proposal is how many IPv4
> addresses lie beyond the "224.0.0.X" notation? 256? Explain why there is
> such a large discrepancy in the number of IPv4 addresses you are in
> control of (256), vs only 1 IPv6 address with the new rfc4541 flag?
That's straight out of RFC-4541 ("coincidentally", these are also the IP
addresses for which the bridge will not create mdb's):
2.1.2.
2) Packets with a destination IP (DIP) address in the 224.0.0.X range
which are not IGMP must be forwarded on all ports.
This recommendation is based on the fact that many host systems do
not send Join IP multicast addresses in this range before sending
or listening to IP multicast packets. Furthermore, since the
224.0.0.X address range is defined as link-local (not to be
routed), it seems unnecessary to keep the state for each address
in this range. Additionally, some routers operate in the
224.0.0.X address range without issuing IGMP Joins, and these
applications would break if the switch were to prune them due to
not having seen a Join Group message from the router.
and
3.
In IPv6, the data forwarding rules are more straight forward because
MLD is mandated for addresses with scope 2 (link-scope) or greater.
The only exception is the address FF02::1 which is the all hosts
link-scope address for which MLD messages are never sent. Packets
with the all hosts link-scope address should be forwarded on all
ports.
On 30/04/2024 20:01, Joseph Huang wrote:
> On 4/29/2024 9:21 PM, Vladimir Oltean wrote:
>> On Mon, Apr 29, 2024 at 04:14:03PM -0400, Joseph Huang wrote:
>>> How about the following syntax? I think it satisfies all the "not breaking
>>> existing behavior" requirements (new option defaults to off, and missing
>>> user space netlink attributes does not change the existing behavior):
>>>
>>> mcast_flood off
>>> all off
>>> mcast_flood off mcast_flood_rfc4541 off
>>> all off
>>> mcast_flood off mcast_flood_rfc4541 on
>>> 224.0.0.X and ff02::1 on, the rest off
>>> mcast_flood on
>>> all on
>>> mcast_flood on mcast_flood_rfc4541 off
>>> all on (mcast_flood on overrides mcast_flood_rfc4541)
>>> mcast_flood on mcast_flood_rfc4541 on
>>> all on
>>> mcast_flood_rfc4541 off
>>> invalid (mcast_flood_rfc4541 is only valid if mcast_flood [on | off] is
>>> specified first)
>>> mcast_flood_rfc4541 on
>>> invalid (mcast_flood_rfc4541 is only valid if mcast_flood [on | off] is
>>> specified first)
>>
>> A bridge port defaults to having BR_MCAST_FLOOD set - see new_nbp().
>> Netlink attributes are only there to _change_ the state of properties in
>> the kernel. They don't need to be specified by user space if there's
>> nothing to be changed. "Only valid if another netlink attribute comes
>> first" makes no sense. You can alter 2 bridge port flags as part of the
>> same netlink message, or as part of different netlink messages (sent
>> over sockets of other processes).
>>
>>>
>>> Think of mcast_flood_rfc4541 like a pet door if you will.
>>
>> Ultimately, as far as I see it, both the OR-based and the AND-based UAPI
>> addition could be made to work in a way that's perhaps similarly backwards
>> compatible. It needs to be worked out with the bridge maintainers. Given
>> that I'm not doing great with my spare time, I will take a back seat on
>> that.
>
> Nik, do you have any objection to the following proposal?
>
> mcast_flood -> default/ off on
> (existing flag) missing (specified/ (specified/
> (on) nlmsg) nlmsg)
>
> mcast_flood_rfc4541
> (proposed new flag)
> |
> v
> default/ flood all no flood flood all
> missing
> (off)
>
> off flood all no flood flood all
> (specified/nlmsg)
>
> on flood all flood 4541 flood all
> (specified/nlmsg) ^^^^^^^^^^
> only behavior change
>
>
> Basically the attributes are OR'ed together to form the final flooding decision.
>
>
Looks good to me. Please make use of the boolopt uapi to avoid adding new
nl attributes.
Thanks,
Nik
On 4/30/2024 12:27 PM, Joseph Huang wrote:
> On 4/29/2024 8:59 PM, Vladimir Oltean wrote:
>> On Mon, Apr 29, 2024 at 06:07:25PM -0400, Joseph Huang wrote:
>>> Something like this (some layers omitted for brevity)?
>>>
>>> +br_iterator
>>> | for each mdb
>>> | _br_switchdev_mdb_notify
>>> rtnl_lock | without F_DEFER flag
>>> | | |
>>> +switchdev_port_attr_set_deferred | +switchdev_port_obj_notify
>>> | | |
>>> +dsa_port_mrouter | +dsa_user_port_obj_a/d
>>> | | |
>>> +mv88e6xxx_port_mrouter----------+
>>> +mv88e6xxx_port_obj_a/d
>>> |
>>> +--------------------------------------+
>>> |
>>> rtnl_unlock
>>
>> At a _very_ superficial glance, I don't think you are properly
>> accounting for the fact that even with rtnl_lock() held, there are still
>> SWITCHDEV_OBJ_ID_PORT_MDB events which may be pending on the switchdev
>> chain. Without a switchdev_deferred_process() flush call, you won't be
>> getting rid of them, so when you rtnl_unlock(), they will still run.
>>
>> Even worse, holding rtnl_lock() will not stop the bridge multicast layer
>> from modifying its br->mdb_list; only br->multicast_lock will.
>>
>> So you may be better off also acquiring br->multicast_lock, and
>> notifying the MDB entries to the switchdev chain _with_the F_DEFER flag.
>
> Like this?
>
> +br_iterator(dsa_cb)
> | lock br->multicask_lock
> | for each mdb
> | br_switchdev_mdb_notify
> rtnl_lock | |
> | | +switchdev_port_obj_._defer
> +switchdev_port_attr_set_deferred | unlock br->multicast_lock
> | |
> +dsa_port_mrouter |
> | |
> +mv88e6xxx_port_mrouter----------+
> |
> +--------------------------------------+
> |
> rtnl_unlock
>
> (potential task change)
>
> rtnl_lock
> |
> +switchdev_deferred_process
> | flush all queued dfitems in queuing order
> |
> rtnl_unlock
>
> I'm not that familiar with the bridge code, but is there any concern
> with potential deadlock here (between rtnl_lock and br->multicast_lock)?
Hi Nik,
Do you know if it's safe to acquire rtnl_lock and br->multicast_lock in
the following sequence? Is there any potential possibility for a deadlock?
rtnl_lock
spin_lock(br->multicast_lock)
spin_unlock(br->multicast_lock)
rtnl_unlock
If the operation is safe, the next question is should it be spin_lock or
spin_lock_bh?
Thanks,
Joseph