This series of patches contains the following fixes:
1. In a distributed system with multiple hardware-offloading bridges,
if a multicast source is attached to a Non-Querier bridge, the bridge
will not forward any multicast packets from that source to the Querier.
+--------------------+
| |
| Snooping | +------------+
| Bridge 1 |----| Listener 1 |
| (Querier) | +------------+
| |
+--------------------+
|
|
+----+---------+-----+
| | mrouter | |
+-----------+ | +---------+ | +------------+
| MC Source |----| Snooping |----| Listener 2 |
+-----------| | Bridge 2 | +------------+
| (Non-Querier) |
+--------------------+
In this scenario, Listener 1 will never receive multicast traffic
from MC Source since Snooping Bridge 2 does not forward multicast
packets to the mrouter port. Patches 0001, 0002, and 0003 address
this issue.
2. If mcast_flood is disabled on a bridge port, some of the snooping
functions stop working properly.
a. Consider the following scenario:
+--------------------+
| |
| Snooping | +------------+
| Bridge 1 |----| Listener 1 |
| (Querier) | +------------+
| |
+--------------------+
|
|
+--------------------+
| | mrouter | |
+-----------+ | +---------+ |
| MC Source |----| Snooping |
+-----------| | Bridge 2 |
| (Non-Querier) |
+--------------------+
In this scenario, Listener 1 will never receive multicast traffic
from MC Source if mcast_flood is disabled on the mrouter port on
Snooping Bridge 2. Patch 0004 addresses this issue.
b. For a Non-Querier bridge, if mcast_flood is disabled on a bridge
port, Queries received from other Querier will not be forwarded
out of that bridge port. Patch 0005 addresses this issue.
3. After a system boots up, the first couple Reports are not handled
properly:
1) the Report from the Host is being flooded (via br_flood) to all
bridge ports, and
2) if the mrouter port's mcast_flood is disabled, the Reports received
from other hosts will not be forwarded to the Querier.
Patch 0006 addresses this issue.
These patches were developed and verified initially against 5.4 kernel
(due to hardware platform limitation) and forward-patched to 5.12.
Snooping code introduced between 5.4 and 5.12 are not extensively tested
(only IGMPv2/MLDv1 were tested). The hardware platform used were two
bridges utilizing a single Marvell 88E6352 Ethernet switch chip (i.e.,
no cross-chip bridging involved).
Joseph Huang (6):
bridge: Refactor br_mdb_notify
bridge: Offload mrouter port forwarding to switchdev
bridge: Avoid traffic disruption when Querier state changes
bridge: Force mcast_flooding for mrouter ports
bridge: Flood Queries even when mcast_flood is disabled
bridge: Always multicast_flood Reports
net/bridge/br_device.c | 5 +-
net/bridge/br_forward.c | 3 +-
net/bridge/br_input.c | 5 +-
net/bridge/br_mdb.c | 70 +++++++++++++---------
net/bridge/br_multicast.c | 121 ++++++++++++++++++++++++++++++++++----
net/bridge/br_private.h | 11 +++-
6 files changed, 169 insertions(+), 46 deletions(-)
base-commit: 5e321ded302da4d8c5d5dd953423d9b748ab3775
--
2.17.1
On 04/05/2021 21:22, Joseph Huang wrote:
> This series of patches contains the following fixes:
>
> 1. In a distributed system with multiple hardware-offloading bridges,
> if a multicast source is attached to a Non-Querier bridge, the bridge
> will not forward any multicast packets from that source to the Querier.
>
> +--------------------+
> | |
> | Snooping | +------------+
> | Bridge 1 |----| Listener 1 |
> | (Querier) | +------------+
> | |
> +--------------------+
> |
> |
> +----+---------+-----+
> | | mrouter | |
> +-----------+ | +---------+ | +------------+
> | MC Source |----| Snooping |----| Listener 2 |
> +-----------| | Bridge 2 | +------------+
> | (Non-Querier) |
> +--------------------+
>
> In this scenario, Listener 1 will never receive multicast traffic
> from MC Source since Snooping Bridge 2 does not forward multicast
> packets to the mrouter port. Patches 0001, 0002, and 0003 address
> this issue.
>
> 2. If mcast_flood is disabled on a bridge port, some of the snooping
> functions stop working properly.
>
> a. Consider the following scenario:
>
> +--------------------+
> | |
> | Snooping | +------------+
> | Bridge 1 |----| Listener 1 |
> | (Querier) | +------------+
> | |
> +--------------------+
> |
> |
> +--------------------+
> | | mrouter | |
> +-----------+ | +---------+ |
> | MC Source |----| Snooping |
> +-----------| | Bridge 2 |
> | (Non-Querier) |
> +--------------------+
>
> In this scenario, Listener 1 will never receive multicast traffic
> from MC Source if mcast_flood is disabled on the mrouter port on
> Snooping Bridge 2. Patch 0004 addresses this issue.
>
> b. For a Non-Querier bridge, if mcast_flood is disabled on a bridge
> port, Queries received from other Querier will not be forwarded
> out of that bridge port. Patch 0005 addresses this issue.
>
> 3. After a system boots up, the first couple Reports are not handled
> properly:
>
> 1) the Report from the Host is being flooded (via br_flood) to all
> bridge ports, and
> 2) if the mrouter port's mcast_flood is disabled, the Reports received
> from other hosts will not be forwarded to the Querier.
>
> Patch 0006 addresses this issue.
>
> These patches were developed and verified initially against 5.4 kernel
> (due to hardware platform limitation) and forward-patched to 5.12.
> Snooping code introduced between 5.4 and 5.12 are not extensively tested
> (only IGMPv2/MLDv1 were tested). The hardware platform used were two
> bridges utilizing a single Marvell 88E6352 Ethernet switch chip (i.e.,
> no cross-chip bridging involved).
>
> Joseph Huang (6):
> bridge: Refactor br_mdb_notify
> bridge: Offload mrouter port forwarding to switchdev
> bridge: Avoid traffic disruption when Querier state changes
> bridge: Force mcast_flooding for mrouter ports
> bridge: Flood Queries even when mcast_flood is disabled
> bridge: Always multicast_flood Reports
>
> net/bridge/br_device.c | 5 +-
> net/bridge/br_forward.c | 3 +-
> net/bridge/br_input.c | 5 +-
> net/bridge/br_mdb.c | 70 +++++++++++++---------
> net/bridge/br_multicast.c | 121 ++++++++++++++++++++++++++++++++++----
> net/bridge/br_private.h | 11 +++-
> 6 files changed, 169 insertions(+), 46 deletions(-)
>
>
> base-commit: 5e321ded302da4d8c5d5dd953423d9b748ab3775
>
Hi,
This patch-set is inappropriate for -net, if at all. It's quite late over here
and I'll review the rest later, but I can say from a quick peek that patch 02
is unacceptable for it increases the complexity with 1 order of magnitude of all
add/del call paths and some of them can be invoked on user packets. A lot of this
functionality should be "hidden" in the driver or done by a user-space daemon/helper.
Most of the flooding behaviour changes must be hidden behind some new option
otherwise they'll break user setups that rely on the current. I'll review the patches
in detail over the following few days, net-next is closed anyway.
Cheers,
Nik
> Hi,
> This patch-set is inappropriate for -net, if at all. It's quite late over here and I'll
> review the rest later, but I can say from a quick peek that patch 02 is
> unacceptable for it increases the complexity with 1 order of magnitude of all
> add/del call paths and some of them can be invoked on user packets. A lot of
> this functionality should be "hidden" in the driver or done by a user-space
> daemon/helper.
> Most of the flooding behaviour changes must be hidden behind some new
> option otherwise they'll break user setups that rely on the current. I'll review
> the patches in detail over the following few days, net-next is closed anyway.
>
> Cheers,
> Nik
Hi Nik,
Thanks for your quick response!
Once you have a chance to review the set, please let me know how I can improve them to make them acceptable. These are real problems and we do need to fix them.
Thanks,
Joseph
Modify br_mdb_notify so that switchdev mdb purge events are never sent
for mrouter ports, and when a non-mrouter port turns into an mrouter
port, all port groups associated with that port are deleted immediately.
Consider the following scenario:
+--------------------+
| |
+-----------+ | Snooping +--| +------------+
| MC Source |----| Bridge 1 |P1|----| Listener 1 |
+-----------| | +--+ +--| +------------+
| |P2| |
+--------------------+
|
|
+--------------------+
| |P3| |
| +--+ +--| +---------------+
| Snooping |P4|----| Listener 2 |
| Bridge 2 +--| | Joins Group A |
| | +---------------+
+--------------------+
Assuming initially Snooping Bridge 1 is the Querier, and Snooping Bridge
2 is a Non-Querier. After some Query/Report exchange, Snooping Bridge 1
would create an mdb group A and add P2 to the group, and starts a timer
on the port group A/P2.
Let's say Snooping Bridge 2 becomes the Querier for some reason (e.g.,
Snooping Bridge 2 rebooted) before the port group A/P2 expires. With
the patch 'bridge: Offload mrouter port forwarding to switchdev',
Snooping Bridge 1 detects that P2 has now become an mrouter port, and
will add it to the address database on the hardware switch chip (even
though it's already there when the port group A/P2 was added). This is
all fine until the timer on port group A/P2 expires, and then Snooping
Bridge 1 will purge P2 from the address database on the switch chip.
Now Listener 2 will not be able to receive multicast traffic from MC
Source anymore.
With this patch, immediately after a bridge port turns into an mrouter
port, the port's membership information is removed from the bridge' mdb,
but remains programmed in the address database on the hardware chip,
just to be consistent with the database/programming state as before
the Querier role change. The hardware programming will be cleaned up
when the group expires (via br_multicast_group_change).
Signed-off-by: Joseph Huang <[email protected]>
---
net/bridge/br_mdb.c | 17 +++++----
net/bridge/br_multicast.c | 78 ++++++++++++++++++++++++++++++++-------
net/bridge/br_private.h | 3 +-
3 files changed, 75 insertions(+), 23 deletions(-)
diff --git a/net/bridge/br_mdb.c b/net/bridge/br_mdb.c
index e8684d798ec3..c121b780450b 100644
--- a/net/bridge/br_mdb.c
+++ b/net/bridge/br_mdb.c
@@ -708,16 +708,17 @@ void br_mdb_switchdev_port(struct net_bridge_mdb_entry *mp,
void br_mdb_notify(struct net_device *dev,
struct net_bridge_mdb_entry *mp,
struct net_bridge_port_group *pg,
- int type)
+ int type, bool swdev_notify)
{
struct net *net = dev_net(dev);
struct sk_buff *skb;
int err = -ENOBUFS;
- if (pg) {
- br_mdb_switchdev_port(mp, pg->key.port, type);
- } else {
- br_mdb_switchdev_host(dev, mp, type);
+ if (swdev_notify) {
+ if (pg)
+ br_mdb_switchdev_port(mp, pg->key.port, type);
+ else
+ br_mdb_switchdev_host(dev, mp, type);
}
skb = nlmsg_new(rtnl_mdb_nlmsg_size(pg), GFP_ATOMIC);
@@ -1011,7 +1012,7 @@ static int br_mdb_add_group(struct net_bridge *br, struct net_bridge_port *port,
}
br_multicast_host_join(mp, false);
- br_mdb_notify(br->dev, mp, NULL, RTM_NEWMDB);
+ br_mdb_notify(br->dev, mp, NULL, RTM_NEWMDB, true);
return 0;
}
@@ -1042,7 +1043,7 @@ static int br_mdb_add_group(struct net_bridge *br, struct net_bridge_port *port,
rcu_assign_pointer(*pp, p);
if (entry->state == MDB_TEMPORARY)
mod_timer(&p->timer, now + br->multicast_membership_interval);
- br_mdb_notify(br->dev, mp, p, RTM_NEWMDB);
+ br_mdb_notify(br->dev, mp, p, RTM_NEWMDB, true);
/* if we are adding a new EXCLUDE port group (*,G) it needs to be also
* added to all S,G entries for proper replication, if we are adding
* a new INCLUDE port (S,G) then all of *,G EXCLUDE ports need to be
@@ -1176,7 +1177,7 @@ static int __br_mdb_del(struct net_bridge *br, struct br_mdb_entry *entry,
if (entry->ifindex == mp->br->dev->ifindex && mp->host_joined) {
br_multicast_host_leave(mp, false);
err = 0;
- br_mdb_notify(br->dev, mp, NULL, RTM_DELMDB);
+ br_mdb_notify(br->dev, mp, NULL, RTM_DELMDB, true);
if (!mp->ports && netif_running(br->dev))
mod_timer(&mp->timer, jiffies);
goto unlock;
diff --git a/net/bridge/br_multicast.c b/net/bridge/br_multicast.c
index 5ed0d5efef09..d7fbe1f3af18 100644
--- a/net/bridge/br_multicast.c
+++ b/net/bridge/br_multicast.c
@@ -506,7 +506,7 @@ static void br_multicast_fwd_src_handle(struct net_bridge_group_src *src)
sg_mp = br_mdb_ip_get(src->br, &sg_key.addr);
if (!sg_mp)
return;
- br_mdb_notify(src->br->dev, sg_mp, sg, RTM_NEWMDB);
+ br_mdb_notify(src->br->dev, sg_mp, sg, RTM_NEWMDB, true);
}
}
@@ -617,7 +617,12 @@ void br_multicast_del_pg(struct net_bridge_mdb_entry *mp,
br_multicast_eht_clean_sets(pg);
hlist_for_each_entry_safe(ent, tmp, &pg->src_list, node)
br_multicast_del_group_src(ent, false);
- br_mdb_notify(br->dev, mp, pg, RTM_DELMDB);
+ /* don't notify switchdev if mrouter port
+ * switchdev will be notified when group expires via
+ * br_multicast_group_change
+ */
+ br_mdb_notify(br->dev, mp, pg, RTM_DELMDB,
+ hlist_unhashed(&pg->key.port->rlist));
if (!br_multicast_is_star_g(&mp->addr)) {
rhashtable_remove_fast(&br->sg_port_tbl, &pg->rhnode,
br_sg_port_rht_params);
@@ -688,7 +693,7 @@ static void br_multicast_port_group_expired(struct timer_list *t)
if (WARN_ON(!mp))
goto out;
- br_mdb_notify(br->dev, mp, pg, RTM_NEWMDB);
+ br_mdb_notify(br->dev, mp, pg, RTM_NEWMDB, true);
}
out:
spin_unlock(&br->multicast_lock);
@@ -1228,7 +1233,7 @@ void br_multicast_host_join(struct net_bridge_mdb_entry *mp, bool notify)
if (br_multicast_is_star_g(&mp->addr))
br_multicast_star_g_host_state(mp);
if (notify)
- br_mdb_notify(mp->br->dev, mp, NULL, RTM_NEWMDB);
+ br_mdb_notify(mp->br->dev, mp, NULL, RTM_NEWMDB, true);
}
if (br_group_is_l2(&mp->addr))
@@ -1246,7 +1251,7 @@ void br_multicast_host_leave(struct net_bridge_mdb_entry *mp, bool notify)
if (br_multicast_is_star_g(&mp->addr))
br_multicast_star_g_host_state(mp);
if (notify)
- br_mdb_notify(mp->br->dev, mp, NULL, RTM_DELMDB);
+ br_mdb_notify(mp->br->dev, mp, NULL, RTM_DELMDB, true);
}
static struct net_bridge_port_group *
@@ -1294,7 +1299,7 @@ __br_multicast_add_group(struct net_bridge *br,
rcu_assign_pointer(*pp, p);
if (blocked)
p->flags |= MDB_PG_FLAGS_BLOCKED;
- br_mdb_notify(br->dev, mp, p, RTM_NEWMDB);
+ br_mdb_notify(br->dev, mp, p, RTM_NEWMDB, true);
found:
if (igmpv2_mldv1)
@@ -2436,7 +2441,7 @@ static int br_ip4_multicast_igmp3_report(struct net_bridge *br,
break;
}
if (changed)
- br_mdb_notify(br->dev, mdst, pg, RTM_NEWMDB);
+ br_mdb_notify(br->dev, mdst, pg, RTM_NEWMDB, true);
unlock_continue:
spin_unlock_bh(&br->multicast_lock);
}
@@ -2575,7 +2580,7 @@ static int br_ip6_multicast_mld2_report(struct net_bridge *br,
break;
}
if (changed)
- br_mdb_notify(br->dev, mdst, pg, RTM_NEWMDB);
+ br_mdb_notify(br->dev, mdst, pg, RTM_NEWMDB, true);
unlock_continue:
spin_unlock_bh(&br->multicast_lock);
}
@@ -2660,26 +2665,71 @@ br_multicast_update_query_timer(struct net_bridge *br,
mod_timer(&query->timer, jiffies + br->multicast_querier_interval);
}
-static void br_port_mc_router_state_change(struct net_bridge_port *p,
+static void br_port_mc_router_state_change(struct net_bridge_port *port,
bool is_mc_router)
{
struct switchdev_attr attr = {
- .orig_dev = p->dev,
+ .orig_dev = port->dev,
.id = SWITCHDEV_ATTR_ID_PORT_MROUTER,
.flags = SWITCHDEV_F_DEFER,
.u.mrouter = is_mc_router,
};
struct net_bridge_mdb_entry *mp;
+ struct net_bridge *br = port->br;
struct hlist_node *n;
- switchdev_port_attr_set(p->dev, &attr, NULL);
+ switchdev_port_attr_set(port->dev, &attr, NULL);
/* Add/delete the router port to/from all multicast group
* called whle br->multicast_lock is held
*/
- hlist_for_each_entry_safe(mp, n, &p->br->mdb_list, mdb_node) {
- br_mdb_switchdev_port(mp, p, is_mc_router ?
- RTM_NEWMDB : RTM_DELMDB);
+ hlist_for_each_entry_safe(mp, n, &br->mdb_list, mdb_node) {
+ struct net_bridge_port_group __rcu **pp;
+ struct net_bridge_port_group *p;
+ int port_group_exists = 0;
+
+ if (is_mc_router) {
+ for (pp = &mp->ports;
+ (p = mlock_dereference(*pp, br)) != NULL;
+ pp = &p->next) {
+ if (p->key.port == port) {
+ port_group_exists = 1;
+ if (!(p->flags & MDB_PG_FLAGS_PERMANENT))
+ br_multicast_del_pg(mp, p, pp);
+ }
+
+ if ((unsigned long)p->key.port < (unsigned long)port)
+ break;
+ }
+
+ if (port_group_exists)
+ continue;
+
+ br_mdb_switchdev_port(mp, port, RTM_NEWMDB);
+ } else {
+ for (pp = &mp->ports;
+ (p = mlock_dereference(*pp, br)) != NULL;
+ pp = &p->next) {
+ if (p->key.port == port) {
+ port_group_exists = 1;
+ break;
+ }
+
+ if ((unsigned long)p->key.port < (unsigned long)port)
+ break;
+ }
+
+ if (port_group_exists)
+ continue;
+
+ p = br_multicast_new_port_group(port, &mp->addr, *pp, 0,
+ NULL, MCAST_EXCLUDE, RTPROT_KERNEL);
+ if (unlikely(!p))
+ continue;
+ rcu_assign_pointer(*pp, p);
+ br_mdb_notify(br->dev, mp, p, RTM_NEWMDB, false);
+ mod_timer(&p->timer, jiffies + br->multicast_membership_interval);
+ }
}
}
diff --git a/net/bridge/br_private.h b/net/bridge/br_private.h
index 5cba9d228b9c..9aa51508ba83 100644
--- a/net/bridge/br_private.h
+++ b/net/bridge/br_private.h
@@ -832,7 +832,8 @@ void br_mdb_hash_fini(struct net_bridge *br);
void br_mdb_switchdev_port(struct net_bridge_mdb_entry *mp,
struct net_bridge_port *p, int type);
void br_mdb_notify(struct net_device *dev, struct net_bridge_mdb_entry *mp,
- struct net_bridge_port_group *pg, int type);
+ struct net_bridge_port_group *pg, int type,
+ bool swdev_notify);
void br_rtr_notify(struct net_device *dev, struct net_bridge_port *port,
int type);
void br_multicast_del_pg(struct net_bridge_mdb_entry *mp,
--
2.17.1
Separate out switchdev notification to its own function in preparation
for the patch "bridge: Offload mrouter port forwarding to switchdev".
Signed-off-by: Joseph Huang <[email protected]>
---
net/bridge/br_mdb.c | 57 ++++++++++++++++++++++++-----------------
net/bridge/br_private.h | 2 ++
2 files changed, 36 insertions(+), 23 deletions(-)
diff --git a/net/bridge/br_mdb.c b/net/bridge/br_mdb.c
index 95fa4af0e8dd..e8684d798ec3 100644
--- a/net/bridge/br_mdb.c
+++ b/net/bridge/br_mdb.c
@@ -669,10 +669,9 @@ static void br_mdb_switchdev_host(struct net_device *dev,
br_mdb_switchdev_host_port(dev, lower_dev, mp, type);
}
-void br_mdb_notify(struct net_device *dev,
- struct net_bridge_mdb_entry *mp,
- struct net_bridge_port_group *pg,
- int type)
+void br_mdb_switchdev_port(struct net_bridge_mdb_entry *mp,
+ struct net_bridge_port *p,
+ int type)
{
struct br_mdb_complete_info *complete_info;
struct switchdev_obj_port_mdb mdb = {
@@ -681,30 +680,42 @@ void br_mdb_notify(struct net_device *dev,
.flags = SWITCHDEV_F_DEFER,
},
};
+
+ if (!p)
+ return;
+
+ br_switchdev_mdb_populate(&mdb, mp);
+
+ mdb.obj.orig_dev = p->dev;
+ switch (type) {
+ case RTM_NEWMDB:
+ complete_info = kmalloc(sizeof(*complete_info), GFP_ATOMIC);
+ if (!complete_info)
+ break;
+ complete_info->port = p;
+ complete_info->ip = mp->addr;
+ mdb.obj.complete_priv = complete_info;
+ mdb.obj.complete = br_mdb_complete;
+ if (switchdev_port_obj_add(p->dev, &mdb.obj, NULL))
+ kfree(complete_info);
+ break;
+ case RTM_DELMDB:
+ switchdev_port_obj_del(p->dev, &mdb.obj);
+ break;
+ }
+}
+
+void br_mdb_notify(struct net_device *dev,
+ struct net_bridge_mdb_entry *mp,
+ struct net_bridge_port_group *pg,
+ int type)
+{
struct net *net = dev_net(dev);
struct sk_buff *skb;
int err = -ENOBUFS;
if (pg) {
- br_switchdev_mdb_populate(&mdb, mp);
-
- mdb.obj.orig_dev = pg->key.port->dev;
- switch (type) {
- case RTM_NEWMDB:
- complete_info = kmalloc(sizeof(*complete_info), GFP_ATOMIC);
- if (!complete_info)
- break;
- complete_info->port = pg->key.port;
- complete_info->ip = mp->addr;
- mdb.obj.complete_priv = complete_info;
- mdb.obj.complete = br_mdb_complete;
- if (switchdev_port_obj_add(pg->key.port->dev, &mdb.obj, NULL))
- kfree(complete_info);
- break;
- case RTM_DELMDB:
- switchdev_port_obj_del(pg->key.port->dev, &mdb.obj);
- break;
- }
+ br_mdb_switchdev_port(mp, pg->key.port, type);
} else {
br_mdb_switchdev_host(dev, mp, type);
}
diff --git a/net/bridge/br_private.h b/net/bridge/br_private.h
index 7ce8a77cc6b6..5cba9d228b9c 100644
--- a/net/bridge/br_private.h
+++ b/net/bridge/br_private.h
@@ -829,6 +829,8 @@ br_multicast_new_port_group(struct net_bridge_port *port, struct br_ip *group,
u8 filter_mode, u8 rt_protocol);
int br_mdb_hash_init(struct net_bridge *br);
void br_mdb_hash_fini(struct net_bridge *br);
+void br_mdb_switchdev_port(struct net_bridge_mdb_entry *mp,
+ struct net_bridge_port *p, int type);
void br_mdb_notify(struct net_device *dev, struct net_bridge_mdb_entry *mp,
struct net_bridge_port_group *pg, int type);
void br_rtr_notify(struct net_device *dev, struct net_bridge_port *port,
--
2.17.1
On Tue, May 04, 2021 at 20:37, "Huang, Joseph" <[email protected]> wrote:
>> Hi,
>> This patch-set is inappropriate for -net, if at all. It's quite late over here and I'll
>> review the rest later, but I can say from a quick peek that patch 02 is
>> unacceptable for it increases the complexity with 1 order of magnitude of all
>> add/del call paths and some of them can be invoked on user packets. A lot of
>> this functionality should be "hidden" in the driver or done by a user-space
>> daemon/helper.
>> Most of the flooding behaviour changes must be hidden behind some new
>> option otherwise they'll break user setups that rely on the current. I'll review
>> the patches in detail over the following few days, net-next is closed anyway.
>>
>> Cheers,
>> Nik
>
> Hi Nik,
>
> Thanks for your quick response!
> Once you have a chance to review the set, please let me know how I can improve them to make them acceptable. These are real problems and we do need to fix them.
If I may make a suggestion: I also work with mv88e6xxx systems, and we
have the same issues with known multicast not being flooded to router
ports. Knowing that chipset, I see what you are trying to do.
But other chips may work differently. Imagine for example a switch where
there is a separate vector of router ports that the hardware can OR in
after looking up the group in the ATU. This implementation would render
the performance gains possible on that device useless. As another
example, you could imagine a device where an ATU operation exists that
sets a bit in the vector of every group in a particular database;
instead of having to update each entry individually.
I think we (mv88e6xxx) will have to accept that we need to add the
proper scaffolding to manage this on the driver side. That way the
bridge can stay generic. The bridge could just provide some MDB iterator
to save us from having to cache all the configured groups.
So basically:
- In mv88e6xxx, maintain a per-switch vector of router ports.
- When a ports router state is toggled:
1. Update the vector.
2. Ask the bridge to iterate through all applicable groups and update
the corresponding ATU entries.
- When a new MDB entry is updated, make sure to also OR in the current
vector of router ports in the DPV of the ATU entry.
I would be happy to help out with testing of this!
> If I may make a suggestion: I also work with mv88e6xxx systems, and we
> have the same issues with known multicast not being flooded to router
> ports. Knowing that chipset, I see what you are trying to do.
>
> But other chips may work differently. Imagine for example a switch where
> there is a separate vector of router ports that the hardware can OR in after
> looking up the group in the ATU. This implementation would render the
> performance gains possible on that device useless. As another example, you
> could imagine a device where an ATU operation exists that sets a bit in the
> vector of every group in a particular database; instead of having to update
> each entry individually.
>
> I think we (mv88e6xxx) will have to accept that we need to add the proper
> scaffolding to manage this on the driver side. That way the bridge can stay
> generic. The bridge could just provide some MDB iterator to save us from
> having to cache all the configured groups.
>
> So basically:
>
> - In mv88e6xxx, maintain a per-switch vector of router ports.
>
> - When a ports router state is toggled:
> 1. Update the vector.
> 2. Ask the bridge to iterate through all applicable groups and update
> the corresponding ATU entries.
>
> - When a new MDB entry is updated, make sure to also OR in the current
> vector of router ports in the DPV of the ATU entry.
>
>
> I would be happy to help out with testing of this!
Thanks for the suggestion/offer!
What patch 0002 does is that:
- When an mrouter port is added/deleted, it iterates over the list of mdb's
to add/delete that port to/from the group in the hardware (I think this is
what your bullet #2 does as well, except that one is done in the bridge,
and the other is done in the driver)
- When a group is added/deleted, it iterates over the list of mrouter ports
to add/delete the switchdev programming
I think what Nik is objecting to is that with this approach, there's now
a for-loop in the call paths (thus it "increases the complexity with 1 order
of magnitude), however I can't think of a way to avoid the looping (whether
done inside the bridge or in the driver) but still achieve the same result
(for Marvell at least).
I suspect that other SOHO switches might have this problem as well (Broadcom
comes to mind).
Thanks,
Joseph
On 05/05/2021 02:26, Huang, Joseph wrote:
>> If I may make a suggestion: I also work with mv88e6xxx systems, and we
>> have the same issues with known multicast not being flooded to router
>> ports. Knowing that chipset, I see what you are trying to do.
>>
>> But other chips may work differently. Imagine for example a switch where
>> there is a separate vector of router ports that the hardware can OR in after
>> looking up the group in the ATU. This implementation would render the
>> performance gains possible on that device useless. As another example, you
>> could imagine a device where an ATU operation exists that sets a bit in the
>> vector of every group in a particular database; instead of having to update
>> each entry individually.
>>
>> I think we (mv88e6xxx) will have to accept that we need to add the proper
>> scaffolding to manage this on the driver side. That way the bridge can stay
>> generic. The bridge could just provide some MDB iterator to save us from
>> having to cache all the configured groups.
>>
>> So basically:
>>
>> - In mv88e6xxx, maintain a per-switch vector of router ports.
>>
>> - When a ports router state is toggled:
>> 1. Update the vector.
>> 2. Ask the bridge to iterate through all applicable groups and update
>> the corresponding ATU entries.
>>
>> - When a new MDB entry is updated, make sure to also OR in the current
>> vector of router ports in the DPV of the ATU entry.
>>
>>
>> I would be happy to help out with testing of this!
>
> Thanks for the suggestion/offer!
>
> What patch 0002 does is that:
>
> - When an mrouter port is added/deleted, it iterates over the list of mdb's
> to add/delete that port to/from the group in the hardware (I think this is
> what your bullet #2 does as well, except that one is done in the bridge,
> and the other is done in the driver)
>
> - When a group is added/deleted, it iterates over the list of mrouter ports
> to add/delete the switchdev programming
>
> I think what Nik is objecting to is that with this approach, there's now
> a for-loop in the call paths (thus it "increases the complexity with 1 order
> of magnitude), however I can't think of a way to avoid the looping (whether
> done inside the bridge or in the driver) but still achieve the same result
> (for Marvell at least).
>
Note that I did not say to avoid it in the switchdev driver. :)
I said it should be in the driver or in some user-space helper, but it mustn't
affect non-switchdev software use cases so much.
You can check how mlxsw[1] deals with mdbs and router ports.
[1] drivers/net/ethernet/mellanox/mlxsw/spectrum_switchdev.c
> I suspect that other SOHO switches might have this problem as well (Broadcom
> comes to mind).
>
> Thanks,
> Joseph
>
On Tue, May 04, 2021 at 23:26, "Huang, Joseph" <[email protected]> wrote:
>> If I may make a suggestion: I also work with mv88e6xxx systems, and we
>> have the same issues with known multicast not being flooded to router
>> ports. Knowing that chipset, I see what you are trying to do.
>>
>> But other chips may work differently. Imagine for example a switch where
>> there is a separate vector of router ports that the hardware can OR in after
>> looking up the group in the ATU. This implementation would render the
>> performance gains possible on that device useless. As another example, you
>> could imagine a device where an ATU operation exists that sets a bit in the
>> vector of every group in a particular database; instead of having to update
>> each entry individually.
>>
>> I think we (mv88e6xxx) will have to accept that we need to add the proper
>> scaffolding to manage this on the driver side. That way the bridge can stay
>> generic. The bridge could just provide some MDB iterator to save us from
>> having to cache all the configured groups.
>>
>> So basically:
>>
>> - In mv88e6xxx, maintain a per-switch vector of router ports.
>>
>> - When a ports router state is toggled:
>> 1. Update the vector.
>> 2. Ask the bridge to iterate through all applicable groups and update
>> the corresponding ATU entries.
>>
>> - When a new MDB entry is updated, make sure to also OR in the current
>> vector of router ports in the DPV of the ATU entry.
>>
>>
>> I would be happy to help out with testing of this!
>
> Thanks for the suggestion/offer!
>
> What patch 0002 does is that:
>
> - When an mrouter port is added/deleted, it iterates over the list of mdb's
> to add/delete that port to/from the group in the hardware (I think this is
> what your bullet #2 does as well, except that one is done in the bridge,
> and the other is done in the driver)
>
> - When a group is added/deleted, it iterates over the list of mrouter ports
> to add/delete the switchdev programming
>
> I think what Nik is objecting to is that with this approach, there's now
> a for-loop in the call paths (thus it "increases the complexity with 1 order
> of magnitude), however I can't think of a way to avoid the looping (whether
> done inside the bridge or in the driver) but still achieve the same result
> (for Marvell at least).
(I will stop trying to read Nikolay's mind and go forward with my own
opinions now :))
The problem with solving this at the bridge layer is that you miss out
on optimizations that are available at lower layers. As an example:
br0
/ | \
swp0 swp1 swp2
(R) (R)
With two router ports, any new group added/removed to/from swp0 would
incur 3 individual ATU operations: First adding swp0, then each router
port individually in your loop. If you have the vector prepared in the
driver, you can batch them together in one operation.
This also atomically transitions the group from unknown to known without
disrupting any streams towards a router. In the bridge-layer solution,
flows will still be blocked in the (admittedly small) window between
adding swp0 and swp{1,2}.
> I suspect that other SOHO switches might have this problem as well (Broadcom
> comes to mind).
I suspect you are right. That is why I suggested implementing the
iterator in the bridge that can be reused by all drivers. Something
along the lines of br_fdb_replay. The rest should mostly be hardware
specific anyway.