2022-07-08 18:54:58

by Jonathan Toppins

[permalink] [raw]
Subject: [PATCH net-next v4 1/2] bond: add mac filter option for balance-xor

Implement a MAC filter that prevents duplicate frame delivery when
handling BUM traffic. This attempts to partially replicate OvS SLB
Bonding[1] like functionality without requiring significant change
in the Linux bridging code.

A typical network setup for this feature would be:

.--------------------------------------------.
| .--------------------. |
| | | |
.-------------------. | |
| | Bond 0 | | | |
| .--'---. .---'--. | | |
.----|-| eth0 |-| eth1 |-|----. .-----+----. .----+------.
| | '------' '------' | | | Switch 1 | | Switch 2 |
| '---,---------------' | | +---+ |
| / | '----+-----' '----+------'
| .---'---. .------. | | |
| | br0 |----| VM 1 | | ~~~~~~~~~~~~~~~~~~~~~
| '-------' '------' | ( )
| | .------. | ( Rest of Network )
| '--------| VM # | | (_____________________)
| '------' |
| Host 1 |
'-----------------------------'

Where 'VM1' and 'VM#' are hosts connected to a Linux bridge, br0, with
bond0 and its associated links, eth0 & eth1, provide ingress/egress. One
can assume bond0, br1, and hosts VM1 to VM# are all contained in a
single box, as depicted. Interfaces eth0 and eth1 provide redundant
connections to the data center with the requirement to use all bandwidth
when the system is functioning normally. Switch 1 and Switch 2 are
physical switches that do not implement any advanced L2 management
features such as MLAG, Cisco's VPC, or LACP.

Combining this feature with vlan+srcmac hash policy allows a user to
create an access network without the need to use expensive switches that
support features like Cisco's VCP.

[1] https://docs.openvswitch.org/en/latest/topics/bonding/#slb-bonding

Co-developed-by: Long Xin <[email protected]>
Signed-off-by: Long Xin <[email protected]>
Signed-off-by: Jonathan Toppins <[email protected]>
---

Notes:
v2:
* dropped needless abstraction functions and put code in module init
* renamed variable "rc" to "ret" to stay consistent with most of the
code
* fixed parameter setting management, when arp-monitor is turned on
this feature will be turned off similar to how miimon and arp-monitor
interact
* renamed bond_xor_recv to bond_mac_filter_recv for a little more
clarity
* it appears the implied default return code for any bonding recv probe
must be `RX_HANDLER_ANOTHER`. Changed the default return code of
bond_mac_filter_recv to use this return value to not break skb
processing when the skb dev is switched to the bond dev:
`skb->dev = bond->dev`

v3: Nik's comments
* clarified documentation
* fixed inline and basic reverse Christmas tree formatting
* zero'ed entry in mac_create
* removed read_lock taking in bond_mac_filter_recv
* made has_expired() atomic and removed critical sections
surrounding calls to has_expired(), this also removed the
use-after-free that would have occurred:
spin_lock_irqsave(&entry->lock, flags);
if (has_expired(bond, entry))
mac_delete(bond, entry);
spin_unlock_irqrestore(&entry->lock, flags); <---
* moved init/destroy of mac_filter_tbl to bond_open/bond_close
this removed the complex option dependencies, the only behavioural
change the user will see is if the bond is up and mac_filter is
enabled if they try and set arp_interval they will receive -EBUSY
* in bond_changelink moved processing of mac_filter option just below
mode processing

v4:
* rebase to latest net-next
* removed rcu_read_lock() call in bond_mac_insert()
* used specific spin_lock_{}() calls instead of the irqsave version
* Outstanding comments from Nik:
https://lore.kernel.org/netdev/[email protected]/
- old version of the patch still under discussion
https://lore.kernel.org/netdev/[email protected]/
* response: it has been over a month now and no new comments have come in
so I am assuming there is nothing left to the discussion
- What if anyone decides at some point 5 seconds are not enough or too much?
* response: I think making that configurable at a later time should not
prevent the inclusion of the initial feature.
- This bit is pointless, you still have races even though not critical
* response: if there are races please point them out simply making a
comment about an enum doesn't show the race.
- This is not scalable at all, you get the lock even to update the
expire on *every* packet, most users go for L3 or L3+L4 and will hit
this with different flows on different CPUs.
* response: we take the lock to update the value, so that receive
traffic is not dropped. Further any implementation, bpf or nft,
will also have to take locks to update MAC entries. If there is a
specific locking order that will be less bad, I would appreciate
that discussion. Concerning L3 or L3+L4 hashing this is not the
data I have, in fact the most utilized hash is layer2.

Documentation/networking/bonding.rst | 20 +++
drivers/net/bonding/Makefile | 2 +-
drivers/net/bonding/bond_mac_filter.c | 202 ++++++++++++++++++++++++++
drivers/net/bonding/bond_mac_filter.h | 37 +++++
drivers/net/bonding/bond_main.c | 30 ++++
drivers/net/bonding/bond_netlink.c | 14 ++
drivers/net/bonding/bond_options.c | 81 +++++++++--
drivers/net/bonding/bonding_priv.h | 1 +
include/net/bond_options.h | 1 +
include/net/bonding.h | 3 +
include/uapi/linux/if_link.h | 1 +
11 files changed, 375 insertions(+), 17 deletions(-)
create mode 100644 drivers/net/bonding/bond_mac_filter.c
create mode 100644 drivers/net/bonding/bond_mac_filter.h

diff --git a/Documentation/networking/bonding.rst b/Documentation/networking/bonding.rst
index 53a18ff7cf23..430d1360bd20 100644
--- a/Documentation/networking/bonding.rst
+++ b/Documentation/networking/bonding.rst
@@ -550,6 +550,26 @@ lacp_rate

The default is slow.

+mac_filter
+
+ Tells the bonding device to drop frames received who's source MAC
+ address matches entries in a filter table. The filter table is
+ populated when the bond transmits frames. This is similar in
+ concept to the MAC learning table implemented in the bridge code.
+
+ This filtering is only enabled for the balance-xor bonding mode.
+
+ off or 0
+ Turns the feature off
+
+ number
+ A number greater than zero turns the feature on and sets
+ the maximum number of MAC addresses to store in the hash
+ table. This value is used as the exponent in a 2^N calculation
+ to determine the actual size of the hashtable.
+
+ The default is off.
+
max_bonds

Specifies the number of bonding devices to create for this
diff --git a/drivers/net/bonding/Makefile b/drivers/net/bonding/Makefile
index 30e8ae3da2da..5dbc360a8522 100644
--- a/drivers/net/bonding/Makefile
+++ b/drivers/net/bonding/Makefile
@@ -5,7 +5,7 @@

obj-$(CONFIG_BONDING) += bonding.o

-bonding-objs := bond_main.o bond_3ad.o bond_alb.o bond_sysfs.o bond_sysfs_slave.o bond_debugfs.o bond_netlink.o bond_options.o
+bonding-objs := bond_main.o bond_3ad.o bond_alb.o bond_sysfs.o bond_sysfs_slave.o bond_debugfs.o bond_netlink.o bond_options.o bond_mac_filter.o

proc-$(CONFIG_PROC_FS) += bond_procfs.o
bonding-objs += $(proc-y)
diff --git a/drivers/net/bonding/bond_mac_filter.c b/drivers/net/bonding/bond_mac_filter.c
new file mode 100644
index 000000000000..9fed467fda16
--- /dev/null
+++ b/drivers/net/bonding/bond_mac_filter.c
@@ -0,0 +1,202 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ * Filter received frames based on MAC addresses "behind" the bond.
+ */
+
+#include "bonding_priv.h"
+
+static const struct rhashtable_params bond_rht_params = {
+ .head_offset = offsetof(struct bond_mac_cache_entry, rhnode),
+ .key_offset = offsetof(struct bond_mac_cache_entry, key),
+ .key_len = sizeof(struct mac_addr),
+ .automatic_shrinking = true,
+};
+
+static unsigned long hold_time(void)
+{
+ return msecs_to_jiffies(5000);
+}
+
+static bool has_expired(struct bond_mac_cache_entry *entry)
+{
+ return time_before(entry->expired, jiffies);
+}
+
+static bool has_expired_sync(struct bond_mac_cache_entry *entry)
+{
+ bool ret;
+
+ spin_lock(&entry->lock);
+ ret = has_expired(entry);
+ spin_unlock(&entry->lock);
+
+ return ret;
+}
+
+static void mac_delete_rcu(struct callback_head *head)
+{
+ kmem_cache_free(bond_mac_cache,
+ container_of(head, struct bond_mac_cache_entry, rcu));
+}
+
+static int mac_remove(struct bonding *bond,
+ struct bond_mac_cache_entry *entry)
+{
+ set_bit(BOND_MAC_DEAD, &entry->flags);
+ return rhashtable_remove_fast(bond->mac_filter_tbl,
+ &entry->rhnode,
+ bond->mac_filter_tbl->p);
+}
+
+void bond_mac_hash_release_entries(struct work_struct *work)
+{
+ struct bonding *bond = container_of(work, struct bonding,
+ mac_work.work);
+ struct bond_mac_cache_entry *entry;
+ struct rhashtable_iter iter;
+
+ rhashtable_walk_enter(bond->mac_filter_tbl, &iter);
+ rhashtable_walk_start(&iter);
+ while ((entry = rhashtable_walk_next(&iter)) != NULL) {
+ if (IS_ERR(entry))
+ continue;
+
+ spin_lock_bh(&entry->lock);
+ if (!has_expired(entry)) {
+ spin_unlock_bh(&entry->lock);
+ continue;
+ }
+
+ mac_remove(bond, entry);
+ spin_unlock_bh(&entry->lock);
+ call_rcu(&entry->rcu, mac_delete_rcu);
+ }
+ rhashtable_walk_stop(&iter);
+ rhashtable_walk_exit(&iter);
+ queue_delayed_work(bond->wq, &bond->mac_work,
+ msecs_to_jiffies(5 * 60 * 1000));
+}
+
+int bond_mac_hash_init(struct bonding *bond)
+{
+ int ret;
+
+ bond->mac_filter_tbl = kzalloc(sizeof(*bond->mac_filter_tbl),
+ GFP_KERNEL);
+ if (!bond->mac_filter_tbl)
+ return -ENOMEM;
+
+ ret = rhashtable_init(bond->mac_filter_tbl, &bond_rht_params);
+ if (ret) {
+ kfree(bond->mac_filter_tbl);
+ bond->mac_filter_tbl = NULL;
+ }
+
+ bond->mac_filter_tbl->p.max_size = 1 << bond->params.mac_filter;
+ netdev_dbg(bond->dev, "mac_filter hash table size: %d\n",
+ bond->mac_filter_tbl->p.max_size);
+ return ret;
+}
+
+static void bond_mac_free_entry(void *entry, void *ctx)
+{
+ kmem_cache_free((struct kmem_cache *)ctx, entry);
+}
+
+void bond_mac_hash_destroy(struct bonding *bond)
+{
+ if (bond->mac_filter_tbl) {
+ rhashtable_free_and_destroy(bond->mac_filter_tbl,
+ bond_mac_free_entry,
+ bond_mac_cache);
+ kfree(bond->mac_filter_tbl);
+ bond->mac_filter_tbl = NULL;
+ }
+}
+
+static void mac_update(struct bond_mac_cache_entry *entry)
+{
+ entry->expired = jiffies + hold_time();
+}
+
+static int mac_create(struct bonding *bond, const u8 *addr)
+{
+ struct bond_mac_cache_entry *entry;
+ int ret;
+
+ entry = kmem_cache_alloc(bond_mac_cache, GFP_ATOMIC);
+ if (!entry)
+ return -ENOMEM;
+
+ memset(entry, 0, sizeof(*entry));
+ spin_lock_init(&entry->lock);
+ memcpy(&entry->key, addr, sizeof(entry->key));
+ mac_update(entry);
+ ret = rhashtable_lookup_insert_fast(bond->mac_filter_tbl,
+ &entry->rhnode,
+ bond->mac_filter_tbl->p);
+ if (ret) {
+ kmem_cache_free(bond_mac_cache, entry);
+ if (ret == -EEXIST)
+ return 0;
+ netdev_dbg(bond->dev, "Failed to insert mac entry %d\n", ret);
+ }
+ return ret;
+}
+
+static struct bond_mac_cache_entry *mac_find(struct bonding *bond,
+ const u8 *addr)
+{
+ struct mac_addr key;
+
+ memcpy(&key, addr, sizeof(key));
+ return rhashtable_lookup(bond->mac_filter_tbl, &key,
+ bond->mac_filter_tbl->p);
+}
+
+int bond_mac_insert(struct bonding *bond, const u8 *addr)
+{
+ struct bond_mac_cache_entry *entry;
+ int ret = 0;
+
+ if (!is_valid_ether_addr(addr))
+ return -EINVAL;
+
+ entry = mac_find(bond, addr);
+ if (entry) {
+ spin_lock(&entry->lock);
+ if (!test_bit(BOND_MAC_DEAD, &entry->flags)) {
+ mac_update(entry);
+ spin_unlock(&entry->lock);
+ goto out;
+ }
+ spin_unlock(&entry->lock);
+ }
+
+ ret = mac_create(bond, addr);
+
+out:
+ return ret;
+}
+
+int bond_mac_filter_recv(const struct sk_buff *skb, struct bonding *bond,
+ struct slave *slave)
+{
+ struct bond_mac_cache_entry *entry;
+ const struct ethhdr *mac_hdr;
+ int ret = RX_HANDLER_ANOTHER;
+
+ mac_hdr = (struct ethhdr *)skb_mac_header(skb);
+ if (is_multicast_ether_addr(mac_hdr->h_dest) &&
+ slave != rcu_dereference(bond->curr_active_slave)) {
+ ret = RX_HANDLER_CONSUMED;
+ goto out;
+ }
+
+ entry = mac_find(bond, mac_hdr->h_source);
+ if (entry && !has_expired_sync(entry))
+ ret = RX_HANDLER_CONSUMED;
+
+out:
+ return ret;
+}
diff --git a/drivers/net/bonding/bond_mac_filter.h b/drivers/net/bonding/bond_mac_filter.h
new file mode 100644
index 000000000000..4ebec63e41e7
--- /dev/null
+++ b/drivers/net/bonding/bond_mac_filter.h
@@ -0,0 +1,37 @@
+/* SPDX-License-Identifier: GPL-2.0-only
+ *
+ * Filter received frames based on MAC addresses "behind" the bond.
+ */
+
+#ifndef _BOND_MAC_FILTER_H
+#define _BOND_MAC_FILTER_H
+#include <net/bonding.h>
+#include <linux/spinlock.h>
+#include <linux/rhashtable.h>
+
+enum {
+ BOND_MAC_DEAD,
+};
+
+struct bond_mac_cache_entry {
+ struct rhash_head rhnode;
+ struct mac_addr key;
+
+ spinlock_t lock; /* protects used member */
+ unsigned long flags;
+ unsigned long expired;
+ struct rcu_head rcu;
+};
+
+extern struct kmem_cache *bond_mac_cache;
+
+void bond_mac_hash_release_entries(struct work_struct *work);
+int bond_mac_hash_init(struct bonding *bond);
+void bond_mac_hash_destroy(struct bonding *bond);
+
+int bond_mac_insert(struct bonding *bond, const u8 *addr);
+int bond_mac_filter_recv(const struct sk_buff *skb,
+ struct bonding *bond,
+ struct slave *slave);
+
+#endif
diff --git a/drivers/net/bonding/bond_main.c b/drivers/net/bonding/bond_main.c
index e75acb14d066..82e716e848f8 100644
--- a/drivers/net/bonding/bond_main.c
+++ b/drivers/net/bonding/bond_main.c
@@ -207,6 +207,7 @@ MODULE_PARM_DESC(lp_interval, "The number of seconds between instances where "
atomic_t netpoll_block_tx = ATOMIC_INIT(0);
#endif

+struct kmem_cache *bond_mac_cache __read_mostly;
unsigned int bond_net_id __read_mostly;

static const struct flow_dissector_key flow_keys_bonding_keys[] = {
@@ -4151,6 +4152,7 @@ void bond_work_init_all(struct bonding *bond)
INIT_DELAYED_WORK(&bond->arp_work, bond_arp_monitor);
INIT_DELAYED_WORK(&bond->ad_work, bond_3ad_state_machine_handler);
INIT_DELAYED_WORK(&bond->slave_arr_work, bond_slave_arr_handler);
+ INIT_DELAYED_WORK(&bond->mac_work, bond_mac_hash_release_entries);
}

static void bond_work_cancel_all(struct bonding *bond)
@@ -4161,6 +4163,7 @@ static void bond_work_cancel_all(struct bonding *bond)
cancel_delayed_work_sync(&bond->ad_work);
cancel_delayed_work_sync(&bond->mcast_work);
cancel_delayed_work_sync(&bond->slave_arr_work);
+ cancel_delayed_work_sync(&bond->mac_work);
}

static int bond_open(struct net_device *bond_dev)
@@ -4208,6 +4211,15 @@ static int bond_open(struct net_device *bond_dev)
bond_3ad_initiate_agg_selection(bond, 1);
}

+ if (BOND_MODE(bond) == BOND_MODE_XOR && bond->params.mac_filter) {
+ int ret = bond_mac_hash_init(bond);
+
+ if (ret)
+ return ret;
+ bond->recv_probe = bond_mac_filter_recv;
+ queue_delayed_work(bond->wq, &bond->mac_work, 0);
+ }
+
if (bond_mode_can_use_xmit_hash(bond))
bond_update_slave_arr(bond, NULL);

@@ -4223,6 +4235,7 @@ static int bond_close(struct net_device *bond_dev)
if (bond_is_lb(bond))
bond_alb_deinitialize(bond);
bond->recv_probe = NULL;
+ bond_mac_hash_destroy(bond);

return 0;
}
@@ -5077,6 +5090,13 @@ static struct slave *bond_xmit_3ad_xor_slave_get(struct bonding *bond,
if (unlikely(!count))
return NULL;

+ if (BOND_MODE(bond) == BOND_MODE_XOR && bond->params.mac_filter) {
+ const struct ethhdr *mac_hdr;
+
+ mac_hdr = (struct ethhdr *)skb_mac_header(skb);
+ if (bond_mac_insert(bond, mac_hdr->h_source))
+ return NULL;
+ }
slave = slaves->arr[hash % count];
return slave;
}
@@ -6158,6 +6178,7 @@ static int bond_check_params(struct bond_params *params)
params->downdelay = downdelay;
params->peer_notif_delay = 0;
params->use_carrier = use_carrier;
+ params->mac_filter = 0;
params->lacp_active = 1;
params->lacp_fast = lacp_fast;
params->primary[0] = 0;
@@ -6350,6 +6371,14 @@ static int __init bonding_init(void)
goto err;
}

+ bond_mac_cache = kmem_cache_create("bond_mac_cache",
+ sizeof(struct bond_mac_cache_entry),
+ 0, SLAB_HWCACHE_ALIGN, NULL);
+ if (!bond_mac_cache) {
+ res = -ENOMEM;
+ goto err;
+ }
+
skb_flow_dissector_init(&flow_keys_bonding,
flow_keys_bonding_keys,
ARRAY_SIZE(flow_keys_bonding_keys));
@@ -6379,6 +6408,7 @@ static void __exit bonding_exit(void)
/* Make sure we don't have an imbalance on our netpoll blocking */
WARN_ON(atomic_read(&netpoll_block_tx));
#endif
+ kmem_cache_destroy(bond_mac_cache);
}

module_init(bonding_init);
diff --git a/drivers/net/bonding/bond_netlink.c b/drivers/net/bonding/bond_netlink.c
index c2d080fc4fc4..d166630fc835 100644
--- a/drivers/net/bonding/bond_netlink.c
+++ b/drivers/net/bonding/bond_netlink.c
@@ -117,6 +117,7 @@ static const struct nla_policy bond_policy[IFLA_BOND_MAX + 1] = {
[IFLA_BOND_PEER_NOTIF_DELAY] = { .type = NLA_U32 },
[IFLA_BOND_MISSED_MAX] = { .type = NLA_U8 },
[IFLA_BOND_NS_IP6_TARGET] = { .type = NLA_NESTED },
+ [IFLA_BOND_MAC_FILTER] = { .type = NLA_U8 },
};

static const struct nla_policy bond_slave_policy[IFLA_BOND_SLAVE_MAX + 1] = {
@@ -196,6 +197,15 @@ static int bond_changelink(struct net_device *bond_dev, struct nlattr *tb[],
if (err)
return err;
}
+ if (data[IFLA_BOND_MAC_FILTER]) {
+ u8 mac_filter = nla_get_u8(data[IFLA_BOND_MAC_FILTER]);
+
+ bond_opt_initval(&newval, mac_filter);
+ err = __bond_opt_set(bond, BOND_OPT_MAC_FILTER, &newval,
+ data[IFLA_BOND_MAC_FILTER], extack);
+ if (err)
+ return err;
+ }
if (data[IFLA_BOND_ACTIVE_SLAVE]) {
int ifindex = nla_get_u32(data[IFLA_BOND_ACTIVE_SLAVE]);
struct net_device *slave_dev;
@@ -610,6 +620,7 @@ static size_t bond_get_size(const struct net_device *bond_dev)
/* IFLA_BOND_NS_IP6_TARGET */
nla_total_size(sizeof(struct nlattr)) +
nla_total_size(sizeof(struct in6_addr)) * BOND_MAX_NS_TARGETS +
+ nla_total_size(sizeof(u8)) + /* IFLA_BOND_MAC_FILTER */
0;
}

@@ -768,6 +779,9 @@ static int bond_fill_info(struct sk_buff *skb,
if (nla_put_u8(skb, IFLA_BOND_MISSED_MAX,
bond->params.missed_max))
goto nla_put_failure;
+ if (nla_put_u8(skb, IFLA_BOND_MAC_FILTER,
+ bond->params.mac_filter))
+ goto nla_put_failure;

if (BOND_MODE(bond) == BOND_MODE_8023AD) {
struct ad_info info;
diff --git a/drivers/net/bonding/bond_options.c b/drivers/net/bonding/bond_options.c
index 3498db1c1b3c..bd791775fd6b 100644
--- a/drivers/net/bonding/bond_options.c
+++ b/drivers/net/bonding/bond_options.c
@@ -15,6 +15,7 @@
#include <linux/sched/signal.h>

#include <net/bonding.h>
+#include "bonding_priv.h"

static int bond_option_active_slave_set(struct bonding *bond,
const struct bond_opt_value *newval);
@@ -84,7 +85,8 @@ static int bond_option_ad_user_port_key_set(struct bonding *bond,
const struct bond_opt_value *newval);
static int bond_option_missed_max_set(struct bonding *bond,
const struct bond_opt_value *newval);
-
+static int bond_option_mac_filter_set(struct bonding *bond,
+ const struct bond_opt_value *newval);

static const struct bond_opt_value bond_mode_tbl[] = {
{ "balance-rr", BOND_MODE_ROUNDROBIN, BOND_VALFLAG_DEFAULT},
@@ -226,6 +228,12 @@ static const struct bond_opt_value bond_missed_max_tbl[] = {
{ NULL, -1, 0},
};

+static const struct bond_opt_value bond_mac_filter_tbl[] = {
+ { "off", 0, BOND_VALFLAG_MIN | BOND_VALFLAG_DEFAULT},
+ { "maxval", 18, BOND_VALFLAG_MAX},
+ { NULL, -1, 0}
+};
+
static const struct bond_option bond_opts[BOND_OPT_LAST] = {
[BOND_OPT_MODE] = {
.id = BOND_OPT_MODE,
@@ -490,7 +498,16 @@ static const struct bond_option bond_opts[BOND_OPT_LAST] = {
.desc = "Delay between each peer notification on failover event, in milliseconds",
.values = bond_intmax_tbl,
.set = bond_option_peer_notif_delay_set
- }
+ },
+ [BOND_OPT_MAC_FILTER] = {
+ .id = BOND_OPT_MAC_FILTER,
+ .name = "mac_filter",
+ .unsuppmodes = BOND_MODE_ALL_EX(BIT(BOND_MODE_XOR)),
+ .desc = "filter received frames based on MAC addresses that have transmitted from the bond, number of MAC addresses to track",
+ .flags = BOND_OPTFLAG_IFDOWN,
+ .values = bond_mac_filter_tbl,
+ .set = bond_option_mac_filter_set
+ },
};

/* Searches for an option by name */
@@ -855,25 +872,38 @@ static bool bond_set_tls_features(struct bonding *bond)
return true;
}

+static void disable_arp_enable_mii(struct bonding *bond, const char *feature,
+ const char *suffix)
+{
+ if (bond->params.arp_interval) {
+ netdev_dbg(bond->dev, "%s%s is incompatible with arp monitoring, start mii monitoring\n",
+ feature, suffix);
+ /* disable arp monitoring */
+ bond->params.arp_interval = 0;
+ }
+
+ if (!bond->params.miimon) {
+ /* set miimon to default value */
+ bond->params.miimon = BOND_DEFAULT_MIIMON;
+ netdev_dbg(bond->dev, "Setting MII monitoring interval to %d\n",
+ bond->params.miimon);
+ }
+}
+
static int bond_option_mode_set(struct bonding *bond,
const struct bond_opt_value *newval)
{
- if (!bond_mode_uses_arp(newval->value)) {
- if (bond->params.arp_interval) {
- netdev_dbg(bond->dev, "%s mode is incompatible with arp monitoring, start mii monitoring\n",
- newval->string);
- /* disable arp monitoring */
- bond->params.arp_interval = 0;
- }
-
- if (!bond->params.miimon) {
- /* set miimon to default value */
- bond->params.miimon = BOND_DEFAULT_MIIMON;
- netdev_dbg(bond->dev, "Setting MII monitoring interval to %d\n",
- bond->params.miimon);
- }
+ if (bond->params.mac_filter && newval->value != BOND_MODE_XOR) {
+ netdev_dbg(bond->dev, "%s mode is incompatible with mac filtering, disabling\n",
+ newval->string);
+ bond->params.mac_filter = 0;
}

+ if (!bond_mode_uses_arp(newval->value))
+ disable_arp_enable_mii(bond, newval->string, " mode");
+ else if (bond->params.mac_filter && bond->params.arp_interval)
+ disable_arp_enable_mii(bond, "MAC filtering", "");
+
if (newval->value == BOND_MODE_ALB)
bond->params.tlb_dynamic_lb = 1;

@@ -1061,6 +1091,17 @@ static int bond_option_use_carrier_set(struct bonding *bond,
return 0;
}

+static int bond_option_mac_filter_set(struct bonding *bond,
+ const struct bond_opt_value *newval)
+{
+ if (newval->value && bond->params.arp_interval)
+ disable_arp_enable_mii(bond, "MAC filtering", "");
+
+ netdev_dbg(bond->dev, "Setting mac_filter to %llu\n", newval->value);
+ bond->params.mac_filter = newval->value;
+ return 0;
+}
+
/* There are two tricky bits here. First, if ARP monitoring is activated, then
* we must disable MII monitoring. Second, if the ARP timer isn't running,
* we must start it.
@@ -1068,6 +1109,14 @@ static int bond_option_use_carrier_set(struct bonding *bond,
static int bond_option_arp_interval_set(struct bonding *bond,
const struct bond_opt_value *newval)
{
+ if (newval->value && bond->params.mac_filter) {
+ if (bond->dev->flags & IFF_UP)
+ return -EBUSY;
+
+ netdev_dbg(bond->dev, "MAC filtering cannot be used with ARP monitoring. Disabling MAC filtering\n");
+ bond->params.mac_filter = 0;
+ }
+
netdev_dbg(bond->dev, "Setting ARP monitoring interval to %llu\n",
newval->value);
bond->params.arp_interval = newval->value;
diff --git a/drivers/net/bonding/bonding_priv.h b/drivers/net/bonding/bonding_priv.h
index 48cdf3a49a7d..0299f8bcb5fd 100644
--- a/drivers/net/bonding/bonding_priv.h
+++ b/drivers/net/bonding/bonding_priv.h
@@ -15,6 +15,7 @@
#ifndef _BONDING_PRIV_H
#define _BONDING_PRIV_H
#include <generated/utsrelease.h>
+#include "bond_mac_filter.h"

#define DRV_NAME "bonding"
#define DRV_DESCRIPTION "Ethernet Channel Bonding Driver"
diff --git a/include/net/bond_options.h b/include/net/bond_options.h
index d2aea5cf1e41..25a9ef30b047 100644
--- a/include/net/bond_options.h
+++ b/include/net/bond_options.h
@@ -68,6 +68,7 @@ enum {
BOND_OPT_MISSED_MAX,
BOND_OPT_NS_TARGETS,
BOND_OPT_PRIO,
+ BOND_OPT_MAC_FILTER,
BOND_OPT_LAST
};

diff --git a/include/net/bonding.h b/include/net/bonding.h
index 6e78d657aa05..ad5b9c43b1e9 100644
--- a/include/net/bonding.h
+++ b/include/net/bonding.h
@@ -125,6 +125,7 @@ struct bond_params {
int miimon;
u8 num_peer_notif;
u8 missed_max;
+ u8 mac_filter;
int arp_interval;
int arp_validate;
int arp_all_targets;
@@ -251,6 +252,7 @@ struct bonding {
struct delayed_work alb_work;
struct delayed_work ad_work;
struct delayed_work mcast_work;
+ struct delayed_work mac_work;
struct delayed_work slave_arr_work;
#ifdef CONFIG_DEBUG_FS
/* debugging support via debugfs */
@@ -263,6 +265,7 @@ struct bonding {
spinlock_t ipsec_lock;
#endif /* CONFIG_XFRM_OFFLOAD */
struct bpf_prog *xdp_prog;
+ struct rhashtable *mac_filter_tbl;
};

#define bond_slave_get_rcu(dev) \
diff --git a/include/uapi/linux/if_link.h b/include/uapi/linux/if_link.h
index e36d9d2c65a7..c510e8be8479 100644
--- a/include/uapi/linux/if_link.h
+++ b/include/uapi/linux/if_link.h
@@ -936,6 +936,7 @@ enum {
IFLA_BOND_AD_LACP_ACTIVE,
IFLA_BOND_MISSED_MAX,
IFLA_BOND_NS_IP6_TARGET,
+ IFLA_BOND_MAC_FILTER,
__IFLA_BOND_MAX,
};

--
2.31.1


2022-07-09 03:45:20

by Jakub Kicinski

[permalink] [raw]
Subject: Re: [PATCH net-next v4 1/2] bond: add mac filter option for balance-xor

On Fri, 8 Jul 2022 14:41:56 -0400 Jonathan Toppins wrote:
> Implement a MAC filter that prevents duplicate frame delivery when
> handling BUM traffic. This attempts to partially replicate OvS SLB
> Bonding[1] like functionality without requiring significant change
> in the Linux bridging code.

You can't post the user space patches in the same series, patchwork
will think they are supposed to be applied to the same tree. Post them
separately please, they'll land on the list next to each other if you
send them at the same time, that's good enough.

https://www.kernel.org/doc/html/latest/process/maintainer-netdev.html#how-do-i-post-corresponding-changes-to-user-space-components

2022-07-13 04:19:39

by Nikolay Aleksandrov

[permalink] [raw]
Subject: Re: [RESEND PATCH net-next v4] bond: add mac filter option for balance-xor

On 7/11/22 15:07, Jonathan Toppins wrote:
> Implement a MAC filter that prevents duplicate frame delivery when
> handling BUM traffic. This attempts to partially replicate OvS SLB
> Bonding[1] like functionality without requiring significant change
> in the Linux bridging code.
>
> A typical network setup for this feature would be:
>
> .--------------------------------------------.
> | .--------------------. |
> | | | |
> .-------------------. | |
> | | Bond 0 | | | |
> | .--'---. .---'--. | | |
> .----|-| eth0 |-| eth1 |-|----. .-----+----. .----+------.
> | | '------' '------' | | | Switch 1 | | Switch 2 |
> | '---,---------------' | | +---+ |
> | / | '----+-----' '----+------'
> | .---'---. .------. | | |
> | | br0 |----| VM 1 | | ~~~~~~~~~~~~~~~~~~~~~
> | '-------' '------' | ( )
> | | .------. | ( Rest of Network )
> | '--------| VM # | | (_____________________)
> | '------' |
> | Host 1 |
> '-----------------------------'
>
> Where 'VM1' and 'VM#' are hosts connected to a Linux bridge, br0, with
> bond0 and its associated links, eth0 & eth1, provide ingress/egress. One
> can assume bond0, br1, and hosts VM1 to VM# are all contained in a
> single box, as depicted. Interfaces eth0 and eth1 provide redundant
> connections to the data center with the requirement to use all bandwidth
> when the system is functioning normally. Switch 1 and Switch 2 are
> physical switches that do not implement any advanced L2 management
> features such as MLAG, Cisco's VPC, or LACP.
>
> Combining this feature with vlan+srcmac hash policy allows a user to
> create an access network without the need to use expensive switches that
> support features like Cisco's VCP.
>
> [1] https://docs.openvswitch.org/en/latest/topics/bonding/#slb-bonding
>
> Co-developed-by: Long Xin <[email protected]>
> Signed-off-by: Long Xin <[email protected]>
> Signed-off-by: Jonathan Toppins <[email protected]>
> ---
>
> Notes:
> v2:
> * dropped needless abstraction functions and put code in module init
> * renamed variable "rc" to "ret" to stay consistent with most of the
> code
> * fixed parameter setting management, when arp-monitor is turned on
> this feature will be turned off similar to how miimon and arp-monitor
> interact
> * renamed bond_xor_recv to bond_mac_filter_recv for a little more
> clarity
> * it appears the implied default return code for any bonding recv probe
> must be `RX_HANDLER_ANOTHER`. Changed the default return code of
> bond_mac_filter_recv to use this return value to not break skb
> processing when the skb dev is switched to the bond dev:
> `skb->dev = bond->dev`
>
> v3: Nik's comments
> * clarified documentation
> * fixed inline and basic reverse Christmas tree formatting
> * zero'ed entry in mac_create
> * removed read_lock taking in bond_mac_filter_recv
> * made has_expired() atomic and removed critical sections
> surrounding calls to has_expired(), this also removed the
> use-after-free that would have occurred:
> spin_lock_irqsave(&entry->lock, flags);
> if (has_expired(bond, entry))
> mac_delete(bond, entry);
> spin_unlock_irqrestore(&entry->lock, flags); <---
> * moved init/destroy of mac_filter_tbl to bond_open/bond_close
> this removed the complex option dependencies, the only behavioural
> change the user will see is if the bond is up and mac_filter is
> enabled if they try and set arp_interval they will receive -EBUSY
> * in bond_changelink moved processing of mac_filter option just below
> mode processing
>
> v4:
> * rebase to latest net-next
> * removed rcu_read_lock() call in bond_mac_insert()
> * used specific spin_lock_{}() calls instead of the irqsave version
> * Outstanding comments from Nik:
> https://lore.kernel.org/netdev/[email protected]/
> - old version of the patch still under discussion
> https://lore.kernel.org/netdev/[email protected]/
> * response: it has been over a month now and no new comments have come in
> so I am assuming there is nothing left to the discussion

Not really, I didn't have time to adjust my solution, but it's not hard
to solve the multicast ingress. More below.

> - What if anyone decides at some point 5 seconds are not enough or too much?
> * response: I think making that configurable at a later time should not
> prevent the inclusion of the initial feature. > - This bit is pointless, you still have races even though not
critical
> * response: if there are races please point them out simply making a
> comment about an enum doesn't show the race.

You had race conditions about the flag, now that you always take the
lock you no longer have them, but also the flag is useless.

> - This is not scalable at all, you get the lock even to update the
> expire on *every* packet, most users go for L3 or L3+L4 and will hit
> this with different flows on different CPUs.
> * response: we take the lock to update the value, so that receive
> traffic is not dropped. Further any implementation, bpf or nft,
> will also have to take locks to update MAC entries. If there is a
> specific locking order that will be less bad, I would appreciate
> that discussion. Concerning L3 or L3+L4 hashing this is not the
> data I have, in fact the most utilized hash is layer2.
>

Yes, there are more scalable alternatives that are already implemented
in nft and eBPF and you can get them for free if you go with some of the
options below.

To the point, tbh I'm amazed we're still discussing applying this change :)

anyway here are my suggestions:

a) if you insist on adding some new code then at least add only
what is missing, i.e. add a xor mode option to limit ingress mcast only
to the active slave, then use the nftables mac filtering solution that I
gave you and you'll have the best of both worlds, you'll be able to dump
the current macs, edit them if needed, control the filtering time and
have the full nft flexibility, also it will scale much better than this

b) write a 3-4 line bash script that allows mcast only through the
active slave, and again use the nft solution I gave you for the rest

c) extend the eBPF program I wrote to do mcast filtering, again it'll
scale much better

(tentative) d) figure out a way to solve the mcast ingress entirely with
nft, I suspect it's possible but I don't have time to waste to prove it

I don't plan on reviewing the patch in this form because I don't think
it should be applied at all since all of this can be easily implemented
with the available tools, but I won't officially nack it either.
That would be the maintainers' decision.

Cheers,
Nik