This patchset fixes several bugs that are related to nesting
device infrastructure.
Current nesting infrastructure code doesn't limit the depth level of
devices. nested devices could be handled recursively. at that moment,
it needs huge memory and stack overflow could occur.
Below devices type have same bug.
VLAN, BONDING, TEAM, MACSEC, MACVLAN, IPVLAN, VIRT_WIFI and VXLAN.
But I couldn't test all interface types so there could be more device
types which have similar problems.
Maybe qmi_wwan.c code could have same problem.
So, I would appreciate if someone test qmi_wwan.c and other modules.
Test commands:
ip link add dummy0 type dummy
ip link add vlan1 link dummy0 type vlan id 1
for i in {2..100}
do
let A=$i-1
ip link add name vlan$i link vlan$A type vlan id $i
done
ip link del dummy0
1st patch actually fixes the root cause.
It adds new common variables {upper/lower}_level that represent
depth level. upper_level variable is depth of upper devices.
lower_level variable is depth of lower devices.
[U][L] [U][L]
vlan1 1 5 vlan4 1 4
vlan2 2 4 vlan5 2 3
vlan3 3 3 |
| |
+------------+
|
vlan6 4 2
dummy0 5 1
After this patch, the nesting infrastructure code uses this variable to
check the depth level.
2, 4, 5, 6, 7 patches fix lockdep related problem.
Before this patch, devices use static lockdep map.
So, if devices that are same type is nested, lockdep will warn about
recursive situation.
These patches make these devices use dynamic lockdep key instead of
static lock or subclass.
3rd patch fixes unexpected IFF_BONDING bit unset.
8th patch fixes a refcnt leak in the macsec module.
9th patch adds ignore flag to an adjacent structure.
In order to exchange an adjacent node safely, ignore flag is needed.
10th patch makes vxlan add an adjacent link to limit depth level.
11th patch removes unnecessary variables and callback.
12th patch fix refcnt leaks in the virt_wifi module
v3 -> v4 :
- Add new 12th patch to fix refcnt leaks in the virt_wifi module
- Fix wrong usage netdev_upper_dev_link() in the vxlan.c
- Preserve reverse christmas tree variable ordering in the vxlan.c
- Add missing static keyword in the dev.c
- Expose netdev_adjacent_change_{prepare/commit/abort} instead of
netdev_adjacent_dev_{enable/disable}
v2 -> v3 :
- Modify nesting infrastructure code to use iterator instead of recursive.
v1 -> v2 :
- Make the 3rd patch do not add a new priv_flag.
Taehee Yoo (12):
net: core: limit nested device depth
vlan: use dynamic lockdep key instead of subclass
bonding: fix unexpected IFF_BONDING bit unset
bonding: use dynamic lockdep key instead of subclass
team: use dynamic lockdep key instead of static key
macsec: use dynamic lockdep key instead of subclass
macvlan: use dynamic lockdep key instead of subclass
macsec: fix refcnt leak in module exit routine
net: core: add ignore flag to netdev_adjacent structure
vxlan: add adjacent link to limit depth level
net: remove unnecessary variables and callback
virt_wifi: fix refcnt leak in module exit routine
drivers/net/bonding/bond_alb.c | 2 +-
drivers/net/bonding/bond_main.c | 81 ++-
.../net/ethernet/mellanox/mlx5/core/en_tc.c | 2 +-
drivers/net/macsec.c | 50 +-
drivers/net/macvlan.c | 36 +-
drivers/net/team/team.c | 61 ++-
drivers/net/vxlan.c | 52 +-
drivers/net/wireless/virt_wifi.c | 51 +-
include/linux/if_macvlan.h | 3 +-
include/linux/if_team.h | 5 +
include/linux/if_vlan.h | 13 +-
include/linux/netdevice.h | 26 +-
include/net/bonding.h | 4 +-
include/net/vxlan.h | 1 +
net/8021q/vlan.c | 1 -
net/8021q/vlan_dev.c | 32 +-
net/core/dev.c | 508 +++++++++++++++---
net/core/dev_addr_lists.c | 12 +-
net/smc/smc_core.c | 2 +-
net/smc/smc_pnet.c | 2 +-
20 files changed, 752 insertions(+), 192 deletions(-)
--
2.17.1
Current code doesn't limit the number of nested devices.
Nested devices would be handled recursively and this needs huge stack
memory. So, unlimited nested devices could make stack overflow.
This patch adds upper_level and lower_level, they are common variables
and represent maximum lower/upper depth.
When upper/lower device is attached or dettached,
{lower/upper}_level are updated. and if maximum depth is bigger than 8,
attach routine fails and returns -EMLINK.
In addition, this patch converts recursive routine of
netdev_walk_all_{lower/upper} to iterator routine.
Test commands:
ip link add dummy0 type dummy
ip link add link dummy0 name vlan1 type vlan id 1
ip link set vlan1 up
for i in {2..200}
do
let A=$i-1
ip link add vlan$i link vlan$A type vlan id $i
done
ip link del vlan1
Splat looks like:
[ 923.102992] Thread overran stack, or stack corrupted
[ 923.103471] Oops: 0000 [#1] SMP DEBUG_PAGEALLOC KASAN PTI
[ 923.104086] CPU: 0 PID: 1597 Comm: ip Not tainted 5.3.0+ #3
[ 923.104771] Hardware name: innotek GmbH VirtualBox/VirtualBox, BIOS VirtualBox 12/01/2006
[ 923.108837] RIP: 0010:stack_depot_fetch+0x10/0x30
[ 923.109470] Code: 00 75 10 48 8b 73 18 48 89 ef 5b 5d e9 79 b1 83 ff 0f 0b e8 92 96 97 ff eb e9 89 f8 c1 ef 11 25 ff 0
[ 923.111775] RSP: 0018:ffff8880541ceb78 EFLAGS: 00010006
[ 923.112452] RAX: 00000000001fffff RBX: ffff8880541cee88 RCX: 0000000000000000
[ 923.113399] RDX: 000000000000001d RSI: ffff8880541ceb80 RDI: 0000000000003ff0
[ 923.114284] RBP: ffffea0001507380 R08: ffffed100d8fdf23 R09: ffffed100d8fdf23
[ 923.115183] R10: 0000000000000001 R11: ffffed100d8fdf22 R12: ffff88806c240880
[ 923.115986] R13: ffff8880541cec98 R14: ffff8880541cee88 R15: ffff8880541ced20
[ 923.120477] FS: 00007ff38ab4f0c0(0000) GS:ffff88806c600000(0000) knlGS:0000000000000000
[ 923.121486] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 923.122451] CR2: ffffffffa5be5658 CR3: 0000000053532004 CR4: 00000000000606f0
[ 923.123303] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 923.128422] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[ 923.129399] Call Trace:
[ 923.129710] Modules linked in: 8021q dummy ip_tables x_tables
[ 923.130518] CR2: ffffffffa5be5658
[ 923.130909] ---[ end trace 9568b7d36ab26094 ]---
[ 923.131457] RIP: 0010:stack_depot_fetch+0x10/0x30
[ 923.132006] Code: 00 75 10 48 8b 73 18 48 89 ef 5b 5d e9 79 b1 83 ff 0f 0b e8 92 96 97 ff eb e9 89 f8 c1 ef 11 25 ff 0
[ 923.134219] RSP: 0018:ffff8880541ceb78 EFLAGS: 00010006
[ 923.134834] RAX: 00000000001fffff RBX: ffff8880541cee88 RCX: 0000000000000000
[ 923.135664] RDX: 000000000000001d RSI: ffff8880541ceb80 RDI: 0000000000003ff0
[ 923.136514] RBP: ffffea0001507380 R08: ffffed100d8fdf23 R09: ffffed100d8fdf23
[ 923.137276] R10: 0000000000000001 R11: ffffed100d8fdf22 R12: ffff88806c240880
[ 923.138025] R13: ffff8880541cec98 R14: ffff8880541cee88 R15: ffff8880541ced20
[ 923.138773] FS: 00007ff38ab4f0c0(0000) GS:ffff88806c600000(0000) knlGS:0000000000000000
[ 923.140099] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 923.140763] CR2: ffffffffa5be5658 CR3: 0000000053532004 CR4: 00000000000606f0
[ 923.141539] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 923.144930] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[ 923.145942] Kernel panic - not syncing: Fatal exception
Signed-off-by: Taehee Yoo <[email protected]>
---
v3 -> v4 :
- This patch is not changed
v2 -> v3 :
- Modify nesting infra code to use iterator instead of recursive
v1 -> v2 :
- This patch is not changed
include/linux/netdevice.h | 4 +
net/core/dev.c | 286 ++++++++++++++++++++++++++++++++------
2 files changed, 245 insertions(+), 45 deletions(-)
diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index 9eda1c31d1f7..613007aa5986 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -1637,6 +1637,8 @@ enum netdev_priv_flags {
* @type: Interface hardware type
* @hard_header_len: Maximum hardware header length.
* @min_header_len: Minimum hardware header length
+ * @upper_level: Maximum depth level of upper devices.
+ * @lower_level: Maximum depth level of lower devices.
*
* @needed_headroom: Extra headroom the hardware may need, but not in all
* cases can this be guaranteed
@@ -1867,6 +1869,8 @@ struct net_device {
unsigned short type;
unsigned short hard_header_len;
unsigned char min_header_len;
+ unsigned char upper_level;
+ unsigned char lower_level;
unsigned short needed_headroom;
unsigned short needed_tailroom;
diff --git a/net/core/dev.c b/net/core/dev.c
index bf3ed413abaf..13cb646fb98f 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -146,6 +146,7 @@
#include "net-sysfs.h"
#define MAX_GRO_SKBS 8
+#define MAX_NEST_DEV 8
/* This should be increased if a protocol with a bigger head is added. */
#define GRO_MAX_HEAD (MAX_HEADER + 128)
@@ -6644,6 +6645,21 @@ struct net_device *netdev_upper_get_next_dev_rcu(struct net_device *dev,
}
EXPORT_SYMBOL(netdev_upper_get_next_dev_rcu);
+static struct net_device *netdev_next_upper_dev(struct net_device *dev,
+ struct list_head **iter)
+{
+ struct netdev_adjacent *upper;
+
+ upper = list_entry((*iter)->next, struct netdev_adjacent, list);
+
+ if (&upper->list == &dev->adj_list.upper)
+ return NULL;
+
+ *iter = &upper->list;
+
+ return upper->dev;
+}
+
static struct net_device *netdev_next_upper_dev_rcu(struct net_device *dev,
struct list_head **iter)
{
@@ -6661,31 +6677,103 @@ static struct net_device *netdev_next_upper_dev_rcu(struct net_device *dev,
return upper->dev;
}
+int netdev_walk_all_upper_dev(struct net_device *dev,
+ int (*fn)(struct net_device *dev,
+ void *data),
+ void *data)
+{
+ struct net_device *udev, *next, *now, *dev_stack[MAX_NEST_DEV + 1];
+ struct list_head *niter, *iter, *iter_stack[MAX_NEST_DEV + 1];
+ int ret, cur = 0;
+
+ now = dev;
+ iter = &dev->adj_list.upper;
+
+ while (1) {
+ if (now != dev) {
+ ret = fn(now, data);
+ if (ret)
+ return ret;
+ }
+
+ next = NULL;
+ while (1) {
+ udev = netdev_next_upper_dev(now, &iter);
+ if (!udev)
+ break;
+
+ if (!next) {
+ next = udev;
+ niter = &udev->adj_list.upper;
+ } else {
+ dev_stack[cur] = udev;
+ iter_stack[cur++] = &udev->adj_list.upper;
+ break;
+ }
+ }
+
+ if (!next) {
+ if (!cur)
+ return 0;
+ next = dev_stack[--cur];
+ niter = iter_stack[cur];
+ }
+
+ now = next;
+ iter = niter;
+ }
+
+ return 0;
+}
+
int netdev_walk_all_upper_dev_rcu(struct net_device *dev,
int (*fn)(struct net_device *dev,
void *data),
void *data)
{
- struct net_device *udev;
- struct list_head *iter;
- int ret;
+ struct net_device *udev, *next, *now, *dev_stack[MAX_NEST_DEV + 1];
+ struct list_head *niter, *iter, *iter_stack[MAX_NEST_DEV + 1];
+ int ret, cur = 0;
- for (iter = &dev->adj_list.upper,
- udev = netdev_next_upper_dev_rcu(dev, &iter);
- udev;
- udev = netdev_next_upper_dev_rcu(dev, &iter)) {
- /* first is the upper device itself */
- ret = fn(udev, data);
- if (ret)
- return ret;
+ now = dev;
+ iter = &dev->adj_list.upper;
- /* then look at all of its upper devices */
- ret = netdev_walk_all_upper_dev_rcu(udev, fn, data);
- if (ret)
- return ret;
+ while (1) {
+ if (now != dev) {
+ ret = fn(now, data);
+ if (ret)
+ return ret;
+ }
+
+ next = NULL;
+ while (1) {
+ udev = netdev_next_upper_dev_rcu(now, &iter);
+ if (!udev)
+ break;
+
+ if (!next) {
+ next = udev;
+ niter = &udev->adj_list.upper;
+ } else {
+ dev_stack[cur] = udev;
+ iter_stack[cur++] = &udev->adj_list.upper;
+ break;
+ }
+ }
+
+ if (!next) {
+ if (!cur)
+ return 0;
+ next = dev_stack[--cur];
+ niter = iter_stack[cur];
+ }
+
+ now = next;
+ iter = niter;
}
return 0;
+
}
EXPORT_SYMBOL_GPL(netdev_walk_all_upper_dev_rcu);
@@ -6790,23 +6878,45 @@ int netdev_walk_all_lower_dev(struct net_device *dev,
void *data),
void *data)
{
- struct net_device *ldev;
- struct list_head *iter;
- int ret;
+ struct net_device *ldev, *next, *now, *dev_stack[MAX_NEST_DEV + 1];
+ struct list_head *niter, *iter, *iter_stack[MAX_NEST_DEV + 1];
+ int ret, cur = 0;
- for (iter = &dev->adj_list.lower,
- ldev = netdev_next_lower_dev(dev, &iter);
- ldev;
- ldev = netdev_next_lower_dev(dev, &iter)) {
- /* first is the lower device itself */
- ret = fn(ldev, data);
- if (ret)
- return ret;
+ now = dev;
+ iter = &dev->adj_list.lower;
- /* then look at all of its lower devices */
- ret = netdev_walk_all_lower_dev(ldev, fn, data);
- if (ret)
- return ret;
+ while (1) {
+ if (now != dev) {
+ ret = fn(now, data);
+ if (ret)
+ return ret;
+ }
+
+ next = NULL;
+ while (1) {
+ ldev = netdev_next_lower_dev(now, &iter);
+ if (!ldev)
+ break;
+
+ if (!next) {
+ next = ldev;
+ niter = &ldev->adj_list.lower;
+ } else {
+ dev_stack[cur] = ldev;
+ iter_stack[cur++] = &ldev->adj_list.lower;
+ break;
+ }
+ }
+
+ if (!next) {
+ if (!cur)
+ return 0;
+ next = dev_stack[--cur];
+ niter = iter_stack[cur];
+ }
+
+ now = next;
+ iter = niter;
}
return 0;
@@ -6827,31 +6937,100 @@ static struct net_device *netdev_next_lower_dev_rcu(struct net_device *dev,
return lower->dev;
}
-int netdev_walk_all_lower_dev_rcu(struct net_device *dev,
- int (*fn)(struct net_device *dev,
- void *data),
- void *data)
+static u8 __netdev_upper_depth(struct net_device *dev)
+{
+ struct net_device *udev;
+ struct list_head *iter;
+ u8 max_depth = 0;
+
+ for (iter = &dev->adj_list.upper,
+ udev = netdev_next_upper_dev(dev, &iter);
+ udev;
+ udev = netdev_next_upper_dev(dev, &iter)) {
+ if (max_depth < udev->upper_level)
+ max_depth = udev->upper_level;
+ }
+
+ return max_depth;
+}
+
+static u8 __netdev_lower_depth(struct net_device *dev)
{
struct net_device *ldev;
struct list_head *iter;
- int ret;
+ u8 max_depth = 0;
for (iter = &dev->adj_list.lower,
- ldev = netdev_next_lower_dev_rcu(dev, &iter);
+ ldev = netdev_next_lower_dev(dev, &iter);
ldev;
- ldev = netdev_next_lower_dev_rcu(dev, &iter)) {
- /* first is the lower device itself */
- ret = fn(ldev, data);
- if (ret)
- return ret;
+ ldev = netdev_next_lower_dev(dev, &iter)) {
+ if (max_depth < ldev->lower_level)
+ max_depth = ldev->lower_level;
+ }
- /* then look at all of its lower devices */
- ret = netdev_walk_all_lower_dev_rcu(ldev, fn, data);
- if (ret)
- return ret;
+ return max_depth;
+}
+
+static int __netdev_update_upper_level(struct net_device *dev, void *data)
+{
+ dev->upper_level = __netdev_upper_depth(dev) + 1;
+ return 0;
+}
+
+static int __netdev_update_lower_level(struct net_device *dev, void *data)
+{
+ dev->lower_level = __netdev_lower_depth(dev) + 1;
+ return 0;
+}
+
+int netdev_walk_all_lower_dev_rcu(struct net_device *dev,
+ int (*fn)(struct net_device *dev,
+ void *data),
+ void *data)
+{
+ struct net_device *ldev, *next, *now, *dev_stack[MAX_NEST_DEV + 1];
+ struct list_head *niter, *iter, *iter_stack[MAX_NEST_DEV + 1];
+ int ret, cur = 0;
+
+ now = dev;
+ iter = &dev->adj_list.lower;
+
+ while (1) {
+ if (now != dev) {
+ ret = fn(now, data);
+ if (ret)
+ return ret;
+ }
+
+ next = NULL;
+ while (1) {
+ ldev = netdev_next_lower_dev_rcu(now, &iter);
+ if (!ldev)
+ break;
+
+ if (!next) {
+ next = ldev;
+ niter = &ldev->adj_list.lower;
+ } else {
+ dev_stack[cur] = ldev;
+ iter_stack[cur++] = &ldev->adj_list.lower;
+ break;
+ }
+ }
+
+ if (!next) {
+ if (!cur)
+ return 0;
+ next = dev_stack[--cur];
+ niter = iter_stack[cur];
+ }
+
+ now = next;
+ iter = niter;
}
return 0;
+
}
EXPORT_SYMBOL_GPL(netdev_walk_all_lower_dev_rcu);
@@ -7105,6 +7284,9 @@ static int __netdev_upper_dev_link(struct net_device *dev,
if (netdev_has_upper_dev(upper_dev, dev))
return -EBUSY;
+ if ((dev->lower_level + upper_dev->upper_level) > MAX_NEST_DEV)
+ return -EMLINK;
+
if (!master) {
if (netdev_has_upper_dev(dev, upper_dev))
return -EEXIST;
@@ -7131,6 +7313,12 @@ static int __netdev_upper_dev_link(struct net_device *dev,
if (ret)
goto rollback;
+ __netdev_update_upper_level(dev, NULL);
+ netdev_walk_all_lower_dev(dev, __netdev_update_upper_level, NULL);
+
+ __netdev_update_lower_level(upper_dev, NULL);
+ netdev_walk_all_upper_dev(upper_dev, __netdev_update_lower_level, NULL);
+
return 0;
rollback:
@@ -7213,6 +7401,12 @@ void netdev_upper_dev_unlink(struct net_device *dev,
call_netdevice_notifiers_info(NETDEV_CHANGEUPPER,
&changeupper_info.info);
+
+ __netdev_update_upper_level(dev, NULL);
+ netdev_walk_all_lower_dev(dev, __netdev_update_upper_level, NULL);
+
+ __netdev_update_lower_level(upper_dev, NULL);
+ netdev_walk_all_upper_dev(upper_dev, __netdev_update_lower_level, NULL);
}
EXPORT_SYMBOL(netdev_upper_dev_unlink);
@@ -9212,6 +9406,8 @@ struct net_device *alloc_netdev_mqs(int sizeof_priv, const char *name,
dev->gso_max_size = GSO_MAX_SIZE;
dev->gso_max_segs = GSO_MAX_SEGS;
+ dev->upper_level = 1;
+ dev->lower_level = 1;
INIT_LIST_HEAD(&dev->napi_list);
INIT_LIST_HEAD(&dev->unreg_list);
--
2.17.1
All VLAN device has same lockdep key and subclass is initialized with
nest_level.
But actual nest_level value can be changed when a lower device is attached.
And at this moment, the subclass should be updated but it seems to be
unsafe.
So this patch makes VLAN use dynamic lockdep key instead of the subclass.
Test commands:
ip link add dummy0 type dummy
ip link set dummy0 up
ip link add bond0 type bond
ip link add vlan_dummy1 link dummy0 type vlan id 1
ip link add vlan_bond1 link bond0 type vlan id 2
ip link set vlan_dummy1 master bond0
ip link set bond0 up
ip link set vlan_dummy1 up
ip link set vlan_bond1 up
Both vlan_dummy1 and vlan_bond1 have the same subclass and it makes
unnecessary deadlock warning message.
Splat looks like:
[ 75.879233] WARNING: possible recursive locking detected
[ 75.879881] 5.3.0+ #3 Not tainted
[ 75.880285] --------------------------------------------
[ 75.880933] ip/634 is trying to acquire lock:
[ 75.881463] ffff8880673c2558 (&vlan_netdev_addr_lock_key/1){+...}, at: dev_uc_sync_multiple+0xfa/0x1a0
[ 75.882714]
[ 75.882714] but task is already holding lock:
[ 75.883502] ffff8880645193f8 (&vlan_netdev_addr_lock_key/1){+...}, at: dev_set_rx_mode+0x19/0x30
[ 75.884707]
[ 75.884707] other info that might help us debug this:
[ 75.885742] Possible unsafe locking scenario:
[ 75.885742]
[ 75.887013] CPU0
[ 75.887415] ----
[ 75.887723] lock(&vlan_netdev_addr_lock_key/1);
[ 75.888280] lock(&vlan_netdev_addr_lock_key/1);
[ 75.888852]
[ 75.888852] *** DEADLOCK ***
[ 75.888852]
[ 75.889569] May be due to missing lock nesting notation
[ 75.889569]
[ 75.890453] 4 locks held by ip/634:
[ 75.890992] #0: ffffffff96ec7a30 (rtnl_mutex){+.+.}, at: rtnetlink_rcv_msg+0x466/0x8a0
[ 75.892021] #1: ffff8880645193f8 (&vlan_netdev_addr_lock_key/1){+...}, at: dev_set_rx_mode+0x19/0x30
[ 75.893387] #2: ffff8880694c4558 (&dev_addr_list_lock_key/3){+...}, at: dev_mc_sync+0xfa/0x1a0
[ 75.894545] #3: ffffffff96b22780 (rcu_read_lock){....}, at: bond_set_rx_mode+0x5/0x3c0 [bonding]
[ 75.895558]
[ 75.895558] stack backtrace:
[ 75.896003] CPU: 0 PID: 634 Comm: ip Not tainted 5.3.0+ #3
[ 75.896566] Hardware name: innotek GmbH VirtualBox/VirtualBox, BIOS VirtualBox 12/01/2006
[ 75.897549] Call Trace:
[ 75.897916] dump_stack+0x7c/0xbb
[ 75.898287] __lock_acquire+0x26a9/0x3df0
[ 75.898664] ? register_lock_class+0x14d0/0x14d0
[ 75.899255] lock_acquire+0x164/0x3b0
[ 75.899718] ? dev_uc_sync_multiple+0xfa/0x1a0
[ 75.900245] ? rcu_read_lock_held+0x90/0xa0
[ 75.900707] _raw_spin_lock_nested+0x2e/0x60
[ 75.901149] ? dev_uc_sync_multiple+0xfa/0x1a0
[ 75.901629] dev_uc_sync_multiple+0xfa/0x1a0
[ 75.902116] bond_set_rx_mode+0x269/0x3c0 [bonding]
[ 75.903135] ? bond_init+0x6f0/0x6f0 [bonding]
[ 75.903696] dev_mc_sync+0x15a/0x1a0
[ ... ]
Fixes: 0fe1e567d0b4 ("[VLAN]: nested VLAN: fix lockdep's recursive locking warning")
Signed-off-by: Taehee Yoo <[email protected]>
---
v1 -> v4 :
- This patch is not changed
include/linux/if_vlan.h | 3 +++
net/8021q/vlan_dev.c | 28 +++++++++++++++-------------
2 files changed, 18 insertions(+), 13 deletions(-)
diff --git a/include/linux/if_vlan.h b/include/linux/if_vlan.h
index 244278d5c222..1aed9f613e90 100644
--- a/include/linux/if_vlan.h
+++ b/include/linux/if_vlan.h
@@ -183,6 +183,9 @@ struct vlan_dev_priv {
struct netpoll *netpoll;
#endif
unsigned int nest_level;
+
+ struct lock_class_key xmit_lock_key;
+ struct lock_class_key addr_lock_key;
};
static inline struct vlan_dev_priv *vlan_dev_priv(const struct net_device *dev)
diff --git a/net/8021q/vlan_dev.c b/net/8021q/vlan_dev.c
index 93eadf179123..12bc80650087 100644
--- a/net/8021q/vlan_dev.c
+++ b/net/8021q/vlan_dev.c
@@ -494,24 +494,24 @@ static void vlan_dev_set_rx_mode(struct net_device *vlan_dev)
* "super class" of normal network devices; split their locks off into a
* separate class since they always nest.
*/
-static struct lock_class_key vlan_netdev_xmit_lock_key;
-static struct lock_class_key vlan_netdev_addr_lock_key;
-
static void vlan_dev_set_lockdep_one(struct net_device *dev,
struct netdev_queue *txq,
- void *_subclass)
+ void *_unused)
{
- lockdep_set_class_and_subclass(&txq->_xmit_lock,
- &vlan_netdev_xmit_lock_key,
- *(int *)_subclass);
+ struct vlan_dev_priv *vlan = vlan_dev_priv(dev);
+
+ lockdep_set_class(&txq->_xmit_lock, &vlan->xmit_lock_key);
}
-static void vlan_dev_set_lockdep_class(struct net_device *dev, int subclass)
+static void vlan_dev_set_lockdep_class(struct net_device *dev)
{
- lockdep_set_class_and_subclass(&dev->addr_list_lock,
- &vlan_netdev_addr_lock_key,
- subclass);
- netdev_for_each_tx_queue(dev, vlan_dev_set_lockdep_one, &subclass);
+ struct vlan_dev_priv *vlan = vlan_dev_priv(dev);
+
+ lockdep_register_key(&vlan->addr_lock_key);
+ lockdep_set_class(&dev->addr_list_lock, &vlan->addr_lock_key);
+
+ lockdep_register_key(&vlan->xmit_lock_key);
+ netdev_for_each_tx_queue(dev, vlan_dev_set_lockdep_one, NULL);
}
static int vlan_dev_get_lock_subclass(struct net_device *dev)
@@ -609,7 +609,7 @@ static int vlan_dev_init(struct net_device *dev)
SET_NETDEV_DEVTYPE(dev, &vlan_type);
- vlan_dev_set_lockdep_class(dev, vlan_dev_get_lock_subclass(dev));
+ vlan_dev_set_lockdep_class(dev);
vlan->vlan_pcpu_stats = netdev_alloc_pcpu_stats(struct vlan_pcpu_stats);
if (!vlan->vlan_pcpu_stats)
@@ -630,6 +630,8 @@ static void vlan_dev_uninit(struct net_device *dev)
kfree(pm);
}
}
+ lockdep_unregister_key(&vlan->addr_lock_key);
+ lockdep_unregister_key(&vlan->xmit_lock_key);
}
static netdev_features_t vlan_dev_fix_features(struct net_device *dev,
--
2.17.1
The IFF_BONDING means bonding master or bonding slave device.
->ndo_add_slave() sets IFF_BONDING flag and ->ndo_del_slave() unsets
IFF_BONDING flag.
bond0<--bond1
Both bond0 and bond1 are bonding device and these should keep having
IFF_BONDING flag until they are removed.
But bond1 would lose IFF_BONDING at ->ndo_del_slave() because that routine
do not check whether the slave device is the bonding type or not.
This patch adds the interface type check routine before removing
IFF_BONDING flag.
Test commands:
ip link add bond0 type bond
ip link add bond1 type bond
ip link set bond1 master bond0
ip link set bond1 nomaster
ip link del bond1 type bond
ip link add bond1 type bond
Splat looks like:
[ 38.843933] proc_dir_entry 'bonding/bond1' already registered
[ 38.844741] WARNING: CPU: 1 PID: 631 at fs/proc/generic.c:361 proc_register+0x2a9/0x3e0
[ 38.845741] Modules linked in: bonding ip_tables x_tables
[ 38.846432] CPU: 1 PID: 631 Comm: ip Not tainted 5.3.0+ #3
[ 38.847234] Hardware name: innotek GmbH VirtualBox/VirtualBox, BIOS VirtualBox 12/01/2006
[ 38.848489] RIP: 0010:proc_register+0x2a9/0x3e0
[ 38.849164] Code: 89 fa 48 c1 ea 03 80 3c 02 00 0f 85 39 01 00 00 48 8b 04 24 48 89 ea 48 c7 c7 e0 2b 34 b3 48 8b b0 e
0 00 00 00 e8 c7 b6 89 ff <0f> 0b 48 c7 c7 40 3d c5 b3 e8 99 7a 38 01 48 8b 4c 24 10 48 b8 00
[ 38.851317] RSP: 0018:ffff888061527078 EFLAGS: 00010282
[ 38.851902] RAX: dffffc0000000008 RBX: ffff888064dc8cb0 RCX: ffffffffb1d252a2
[ 38.852684] RDX: 0000000000000001 RSI: 0000000000000008 RDI: ffff88806cbf6b8c
[ 38.853464] RBP: ffff888064dc8f33 R08: ffffed100d980019 R09: ffffed100d980019
[ 38.854242] R10: 0000000000000001 R11: ffffed100d980018 R12: ffff888064dc8e48
[ 38.855929] R13: ffff888064dc8f32 R14: dffffc0000000000 R15: ffffed100c9b91e6
[ 38.856695] FS: 00007fc9fcc230c0(0000) GS:ffff88806ca00000(0000) knlGS:0000000000000000
[ 38.857541] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 38.858150] CR2: 000055948b91c118 CR3: 0000000057110006 CR4: 00000000000606e0
[ 38.858957] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 38.859785] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[ 38.860700] Call Trace:
[ 38.861004] proc_create_seq_private+0xb3/0xf0
[ 38.861460] bond_create_proc_entry+0x1b3/0x3f0 [bonding]
[ 38.862113] bond_netdev_event+0x433/0x970 [bonding]
[ 38.862762] ? __module_text_address+0x13/0x140
[ 38.867678] notifier_call_chain+0x90/0x160
[ 38.868257] register_netdevice+0x9b3/0xd80
[ 38.868791] ? alloc_netdev_mqs+0x854/0xc10
[ 38.869335] ? netdev_change_features+0xa0/0xa0
[ 38.869852] ? rtnl_create_link+0x2ed/0xad0
[ 38.870423] bond_newlink+0x2a/0x60 [bonding]
[ 38.870935] __rtnl_newlink+0xb9f/0x11b0
[ ... ]
Fixes: 0b680e753724 ("[PATCH] bonding: Add priv_flag to avoid event mishandling")
Signed-off-by: Taehee Yoo <[email protected]>
---
v2 -> v4 :
- This patch is not changed
v1 -> v2 :
- Do not add a new priv_flag.
drivers/net/bonding/bond_main.c | 6 ++++--
1 file changed, 4 insertions(+), 2 deletions(-)
diff --git a/drivers/net/bonding/bond_main.c b/drivers/net/bonding/bond_main.c
index 931d9d935686..0db12fcfc953 100644
--- a/drivers/net/bonding/bond_main.c
+++ b/drivers/net/bonding/bond_main.c
@@ -1816,7 +1816,8 @@ int bond_enslave(struct net_device *bond_dev, struct net_device *slave_dev,
slave_disable_netpoll(new_slave);
err_close:
- slave_dev->priv_flags &= ~IFF_BONDING;
+ if (!netif_is_bond_master(slave_dev))
+ slave_dev->priv_flags &= ~IFF_BONDING;
dev_close(slave_dev);
err_restore_mac:
@@ -2017,7 +2018,8 @@ static int __bond_release_one(struct net_device *bond_dev,
else
dev_set_mtu(slave_dev, slave->original_mtu);
- slave_dev->priv_flags &= ~IFF_BONDING;
+ if (!netif_is_bond_master(slave_dev))
+ slave_dev->priv_flags &= ~IFF_BONDING;
bond_free_slave(slave);
--
2.17.1
All bonding device has same lockdep key and subclass is initialized with
nest_level.
But actual nest_level value can be changed when a lower device is attached.
And at this moment, the subclass should be updated but it seems to be
unsafe.
So this patch makes bonding use dynamic lockdep key instead of the
subclass.
Test commands:
ip link add bond0 type bond
for i in {1..5}
do
let A=$i-1
ip link add bond$i type bond
ip link set bond$i master bond$A
done
ip link set bond5 master bond0
Splat looks like:
[ 29.858108] WARNING: possible recursive locking detected
[ 29.858630] 5.3.0+ #3 Not tainted
[ 29.858946] --------------------------------------------
[ 29.859501] ip/629 is trying to acquire lock:
[ 29.860591] ffff88806801cf00 (&(&bond->stats_lock)->rlock#2/2){+.+.}, at: bond_get_stats+0xb8/0x500 [bonding]
[ 29.861677]
[ 29.861677] but task is already holding lock:
[ 29.862307] ffff88806801ada0 (&(&bond->stats_lock)->rlock#2/2){+.+.}, at: bond_get_stats+0xb8/0x500 [bonding]
[ 29.863406]
[ 29.863406] other info that might help us debug this:
[ 29.864092] Possible unsafe locking scenario:
[ 29.864092]
[ 29.864715] CPU0
[ 29.864968] ----
[ 29.865225] lock(&(&bond->stats_lock)->rlock#2/2);
[ 29.865731] lock(&(&bond->stats_lock)->rlock#2/2);
[ 29.866235]
[ 29.866235] *** DEADLOCK ***
[ 29.866235]
[ 29.866829] May be due to missing lock nesting notation
[ 29.866829]
[ 29.867632] 3 locks held by ip/629:
[ 29.868077] #0: ffffffffb4ec7a30 (rtnl_mutex){+.+.}, at: rtnetlink_rcv_msg+0x466/0x8a0
[ 29.869141] #1: ffff88806801ada0 (&(&bond->stats_lock)->rlock#2/2){+.+.}, at: bond_get_stats+0xb8/0x500 [bonding]
[ 29.870504] #2: ffffffffb4b22780 (rcu_read_lock){....}, at: bond_get_stats+0x9f/0x500 [bonding]
[ 29.875917]
[ 29.875917] stack backtrace:
[ 29.876533] CPU: 0 PID: 629 Comm: ip Not tainted 5.3.0+ #3
[ 29.877254] Hardware name: innotek GmbH VirtualBox/VirtualBox, BIOS VirtualBox 12/01/2006
[ 29.878344] Call Trace:
[ 29.878697] dump_stack+0x7c/0xbb
[ 29.879167] __lock_acquire+0x26a9/0x3df0
[ 29.879660] ? register_lock_class+0x14d0/0x14d0
[ 29.880067] lock_acquire+0x164/0x3b0
[ 29.880402] ? bond_get_stats+0xb8/0x500 [bonding]
[ 29.880826] _raw_spin_lock_nested+0x2e/0x60
[ 29.881206] ? bond_get_stats+0xb8/0x500 [bonding]
[ 29.881725] bond_get_stats+0xb8/0x500 [bonding]
[ ... ]
Fixes: d3fff6c443fe ("net: add netdev_lockdep_set_classes() helper")
Signed-off-by: Taehee Yoo <[email protected]>
---
v1 -> v4 :
- This patch is not changed
drivers/net/bonding/bond_main.c | 61 ++++++++++++++++++++++++++++++---
include/net/bonding.h | 3 ++
2 files changed, 59 insertions(+), 5 deletions(-)
diff --git a/drivers/net/bonding/bond_main.c b/drivers/net/bonding/bond_main.c
index 0db12fcfc953..7f574e74ed78 100644
--- a/drivers/net/bonding/bond_main.c
+++ b/drivers/net/bonding/bond_main.c
@@ -1857,6 +1857,32 @@ int bond_enslave(struct net_device *bond_dev, struct net_device *slave_dev,
return res;
}
+static void bond_dev_set_lockdep_one(struct net_device *dev,
+ struct netdev_queue *txq,
+ void *_unused)
+{
+ struct bonding *bond = netdev_priv(dev);
+
+ lockdep_set_class(&txq->_xmit_lock, &bond->xmit_lock_key);
+}
+
+static void bond_update_lock_key(struct net_device *dev)
+{
+ struct bonding *bond = netdev_priv(dev);
+
+ lockdep_unregister_key(&bond->stats_lock_key);
+ lockdep_unregister_key(&bond->addr_lock_key);
+ lockdep_unregister_key(&bond->xmit_lock_key);
+
+ lockdep_register_key(&bond->stats_lock_key);
+ lockdep_register_key(&bond->addr_lock_key);
+ lockdep_register_key(&bond->xmit_lock_key);
+
+ lockdep_set_class(&bond->stats_lock, &bond->stats_lock_key);
+ lockdep_set_class(&dev->addr_list_lock, &bond->addr_lock_key);
+ netdev_for_each_tx_queue(dev, bond_dev_set_lockdep_one, NULL);
+}
+
/* Try to release the slave device <slave> from the bond device <master>
* It is legal to access curr_active_slave without a lock because all the function
* is RTNL-locked. If "all" is true it means that the function is being called
@@ -2022,6 +2048,8 @@ static int __bond_release_one(struct net_device *bond_dev,
slave_dev->priv_flags &= ~IFF_BONDING;
bond_free_slave(slave);
+ if (netif_is_bond_master(slave_dev))
+ bond_update_lock_key(slave_dev);
return 0;
}
@@ -3459,7 +3487,7 @@ static void bond_get_stats(struct net_device *bond_dev,
struct list_head *iter;
struct slave *slave;
- spin_lock_nested(&bond->stats_lock, bond_get_nest_level(bond_dev));
+ spin_lock(&bond->stats_lock);
memcpy(stats, &bond->bond_stats, sizeof(*stats));
rcu_read_lock();
@@ -4297,8 +4325,6 @@ void bond_setup(struct net_device *bond_dev)
{
struct bonding *bond = netdev_priv(bond_dev);
- spin_lock_init(&bond->mode_lock);
- spin_lock_init(&bond->stats_lock);
bond->params = bonding_defaults;
/* Initialize pointers */
@@ -4367,6 +4393,9 @@ static void bond_uninit(struct net_device *bond_dev)
list_del(&bond->bond_list);
+ lockdep_unregister_key(&bond->stats_lock_key);
+ lockdep_unregister_key(&bond->addr_lock_key);
+ lockdep_unregister_key(&bond->xmit_lock_key);
bond_debug_unregister(bond);
}
@@ -4758,6 +4787,29 @@ static int bond_check_params(struct bond_params *params)
return 0;
}
+static struct lock_class_key qdisc_tx_busylock_key;
+static struct lock_class_key qdisc_running_key;
+
+static void bond_dev_set_lockdep_class(struct net_device *dev)
+{
+ struct bonding *bond = netdev_priv(dev);
+
+ dev->qdisc_tx_busylock = &qdisc_tx_busylock_key;
+ dev->qdisc_running_key = &qdisc_running_key;
+
+ spin_lock_init(&bond->mode_lock);
+
+ spin_lock_init(&bond->stats_lock);
+ lockdep_register_key(&bond->stats_lock_key);
+ lockdep_set_class(&bond->stats_lock, &bond->stats_lock_key);
+
+ lockdep_register_key(&bond->addr_lock_key);
+ lockdep_set_class(&dev->addr_list_lock, &bond->addr_lock_key);
+
+ lockdep_register_key(&bond->xmit_lock_key);
+ netdev_for_each_tx_queue(dev, bond_dev_set_lockdep_one, NULL);
+}
+
/* Called from registration process */
static int bond_init(struct net_device *bond_dev)
{
@@ -4771,8 +4823,7 @@ static int bond_init(struct net_device *bond_dev)
return -ENOMEM;
bond->nest_level = SINGLE_DEPTH_NESTING;
- netdev_lockdep_set_classes(bond_dev);
-
+ bond_dev_set_lockdep_class(bond_dev);
list_add_tail(&bond->bond_list, &bn->dev_list);
bond_prepare_sysfs_group(bond);
diff --git a/include/net/bonding.h b/include/net/bonding.h
index f7fe45689142..c39ac7061e41 100644
--- a/include/net/bonding.h
+++ b/include/net/bonding.h
@@ -239,6 +239,9 @@ struct bonding {
struct dentry *debug_dir;
#endif /* CONFIG_DEBUG_FS */
struct rtnl_link_stats64 bond_stats;
+ struct lock_class_key stats_lock_key;
+ struct lock_class_key xmit_lock_key;
+ struct lock_class_key addr_lock_key;
};
#define bond_slave_get_rcu(dev) \
--
2.17.1
In the current code, all team devices have same static lockdep key
and team devices could be nested so that it makes unnecessary
lockdep warning.
Test commands:
ip link add team0 type team
for i in {1..7}
do
let A=$i-1
ip link add team$i type team
ip link set team$i master team$A
done
ip link del team0
Splat looks like:
[ 32.862645] WARNING: possible recursive locking detected
[ 32.863304] 5.3.0+ #3 Not tainted
[ 32.863700] --------------------------------------------
[ 32.864358] ip/647 is trying to acquire lock:
[ 32.864968] ffff8880666a6ad8 (&dev_addr_list_lock_key/1){+...}, at: dev_uc_sync_multiple+0xfa/0x1a0
[ 32.866047]
[ 32.866047] but task is already holding lock:
[ 32.866744] ffff888067402558 (&dev_addr_list_lock_key/1){+...}, at: dev_uc_unsync+0x10c/0x1b0
[ 32.867774]
[ 32.867774] other info that might help us debug this:
[ 32.868513] Possible unsafe locking scenario:
[ 32.868513]
[ 32.869180] CPU0
[ 32.872973] ----
[ 32.876717] lock(&dev_addr_list_lock_key/1);
[ 32.877130] lock(&dev_addr_list_lock_key/1);
[ 32.877621]
[ 32.877621] *** DEADLOCK ***
[ 32.877621]
[ 32.878284] May be due to missing lock nesting notation
[ 32.878284]
[ 32.878999] 5 locks held by ip/647:
[ 32.879382] #0: ffffffff8fec7a30 (rtnl_mutex){+.+.}, at: rtnetlink_rcv_msg+0x466/0x8a0
[ 32.880110] #1: ffff888068d5e300 (&team->lock){+.+.}, at: team_uninit+0x3a/0x1a0 [team]
[ 32.880889] #2: ffff888068d5d978 (&dev_addr_list_lock_key){+...}, at: dev_uc_unsync+0x98/0x1b0
[ 32.881660] #3: ffff888067402558 (&dev_addr_list_lock_key/1){+...}, at: dev_uc_unsync+0x10c/0x1b0
[ 32.882451] #4: ffffffff8fb22780 (rcu_read_lock){....}, at: team_set_rx_mode+0x5/0x1d0 [team]
[ 32.883209]
[ 32.883209] stack backtrace:
[ 32.883605] CPU: 0 PID: 647 Comm: ip Not tainted 5.3.0+ #3
[ 32.884144] Hardware name: innotek GmbH VirtualBox/VirtualBox, BIOS VirtualBox 12/01/2006
[ 32.884926] Call Trace:
[ 32.885151] dump_stack+0x7c/0xbb
[ 32.885460] __lock_acquire+0x26a9/0x3df0
[ 32.885964] ? register_lock_class+0x14d0/0x14d0
[ 32.886522] ? register_lock_class+0x14d0/0x14d0
[ 32.887114] lock_acquire+0x164/0x3b0
[ 32.887578] ? dev_uc_sync_multiple+0xfa/0x1a0
[ 32.888130] _raw_spin_lock_nested+0x2e/0x60
[ 32.888725] ? dev_uc_sync_multiple+0xfa/0x1a0
[ 32.889264] dev_uc_sync_multiple+0xfa/0x1a0
[ 32.889779] team_set_rx_mode+0xa9/0x1d0 [team]
[ 32.892841] dev_uc_unsync+0x151/0x1b0
[ ... ]
Fixes: 3d249d4ca7d0 ("net: introduce ethernet teaming device")
Signed-off-by: Taehee Yoo <[email protected]>
---
v1 -> v4 :
- This patch is not changed
drivers/net/team/team.c | 61 ++++++++++++++++++++++++++++++++++++++---
include/linux/if_team.h | 5 ++++
2 files changed, 62 insertions(+), 4 deletions(-)
diff --git a/drivers/net/team/team.c b/drivers/net/team/team.c
index e8089def5a46..bfcd6ed57493 100644
--- a/drivers/net/team/team.c
+++ b/drivers/net/team/team.c
@@ -1607,6 +1607,34 @@ static const struct team_option team_options[] = {
},
};
+static void team_dev_set_lockdep_one(struct net_device *dev,
+ struct netdev_queue *txq,
+ void *_unused)
+{
+ struct team *team = netdev_priv(dev);
+
+ lockdep_set_class(&txq->_xmit_lock, &team->xmit_lock_key);
+}
+
+static struct lock_class_key qdisc_tx_busylock_key;
+static struct lock_class_key qdisc_running_key;
+
+static void team_dev_set_lockdep_class(struct net_device *dev)
+{
+ struct team *team = netdev_priv(dev);
+
+ dev->qdisc_tx_busylock = &qdisc_tx_busylock_key;
+ dev->qdisc_running_key = &qdisc_running_key;
+
+ lockdep_register_key(&team->team_lock_key);
+ __mutex_init(&team->lock, "team->team_lock_key", &team->team_lock_key);
+
+ lockdep_register_key(&team->addr_lock_key);
+ lockdep_set_class(&dev->addr_list_lock, &team->addr_lock_key);
+
+ lockdep_register_key(&team->xmit_lock_key);
+ netdev_for_each_tx_queue(dev, team_dev_set_lockdep_one, NULL);
+}
static int team_init(struct net_device *dev)
{
@@ -1615,7 +1643,6 @@ static int team_init(struct net_device *dev)
int err;
team->dev = dev;
- mutex_init(&team->lock);
team_set_no_mode(team);
team->pcpu_stats = netdev_alloc_pcpu_stats(struct team_pcpu_stats);
@@ -1642,7 +1669,7 @@ static int team_init(struct net_device *dev)
goto err_options_register;
netif_carrier_off(dev);
- netdev_lockdep_set_classes(dev);
+ team_dev_set_lockdep_class(dev);
return 0;
@@ -1673,6 +1700,11 @@ static void team_uninit(struct net_device *dev)
team_queue_override_fini(team);
mutex_unlock(&team->lock);
netdev_change_features(dev);
+
+ lockdep_unregister_key(&team->team_lock_key);
+ lockdep_unregister_key(&team->addr_lock_key);
+ lockdep_unregister_key(&team->xmit_lock_key);
+
}
static void team_destructor(struct net_device *dev)
@@ -1967,6 +1999,23 @@ static int team_add_slave(struct net_device *dev, struct net_device *port_dev,
return err;
}
+static void team_update_lock_key(struct net_device *dev)
+{
+ struct team *team = netdev_priv(dev);
+
+ lockdep_unregister_key(&team->team_lock_key);
+ lockdep_unregister_key(&team->addr_lock_key);
+ lockdep_unregister_key(&team->xmit_lock_key);
+
+ lockdep_register_key(&team->team_lock_key);
+ lockdep_register_key(&team->addr_lock_key);
+ lockdep_register_key(&team->xmit_lock_key);
+
+ lockdep_set_class(&team->lock, &team->team_lock_key);
+ lockdep_set_class(&dev->addr_list_lock, &team->addr_lock_key);
+ netdev_for_each_tx_queue(dev, team_dev_set_lockdep_one, NULL);
+}
+
static int team_del_slave(struct net_device *dev, struct net_device *port_dev)
{
struct team *team = netdev_priv(dev);
@@ -1976,8 +2025,12 @@ static int team_del_slave(struct net_device *dev, struct net_device *port_dev)
err = team_port_del(team, port_dev);
mutex_unlock(&team->lock);
- if (!err)
- netdev_change_features(dev);
+ if (err)
+ return err;
+
+ if (netif_is_team_master(port_dev))
+ team_update_lock_key(port_dev);
+ netdev_change_features(dev);
return err;
}
diff --git a/include/linux/if_team.h b/include/linux/if_team.h
index 06faa066496f..9c97bb19ed34 100644
--- a/include/linux/if_team.h
+++ b/include/linux/if_team.h
@@ -223,6 +223,11 @@ struct team {
atomic_t count_pending;
struct delayed_work dw;
} mcast_rejoin;
+
+ struct lock_class_key team_lock_key;
+ struct lock_class_key xmit_lock_key;
+ struct lock_class_key addr_lock_key;
+
long mode_priv[TEAM_MODE_PRIV_LONGS];
};
--
2.17.1
All macsec device has same lockdep key and subclass is initialized with
nest_level.
But actual nest_level value can be changed when a lower device is attached.
And at this moment, the subclass should be updated but it seems to be
unsafe.
So this patch makes macsec use dynamic lockdep key instead of the subclass.
Test commands:
ip link add bond0 type bond
ip link add dummy0 type dummy
ip link add macsec0 link bond0 type macsec
ip link add macsec1 link dummy0 type macsec
ip link set bond0 mtu 1000
ip link set macsec1 master bond0
ip link set bond0 up
ip link set macsec0 up
ip link set dummy0 up
ip link set macsec1 up
Splat looks like:
[ 29.758606] WARNING: possible recursive locking detected
[ 29.759626] 5.3.0+ #3 Not tainted
[ 29.760670] --------------------------------------------
[ 29.761385] ip/639 is trying to acquire lock:
[ 29.761938] ffff888067680298 (&macsec_netdev_addr_lock_key/1){+...}, at: dev_uc_sync_multiple+0xfa/0x1a0
[ 29.763073]
[ 29.763073] but task is already holding lock:
[ 29.763840] ffff888060148298 (&macsec_netdev_addr_lock_key/1){+...}, at: dev_set_rx_mode+0x19/0x30
[ 29.764931]
[ 29.764931] other info that might help us debug this:
[ 29.765721] Possible unsafe locking scenario:
[ 29.765721]
[ 29.766615] CPU0
[ 29.766914] ----
[ 29.767256] lock(&macsec_netdev_addr_lock_key/1);
[ 29.767847] lock(&macsec_netdev_addr_lock_key/1);
[ 29.768441]
[ 29.768441] *** DEADLOCK ***
[ 29.768441]
[ 29.769158] May be due to missing lock nesting notation
[ 29.769158]
[ 29.770083] 4 locks held by ip/639:
[ 29.770908] #0: ffffffff93ec7a30 (rtnl_mutex){+.+.}, at: rtnetlink_rcv_msg+0x466/0x8a0
[ 29.771970] #1: ffff888060148298 (&macsec_netdev_addr_lock_key/1){+...}, at: dev_set_rx_mode+0x19/0x30
[ 29.773216] #2: ffff888063e58298 (&dev_addr_list_lock_key/3){+...}, at: dev_mc_sync+0xfa/0x1a0
[ 29.774324] #3: ffffffff93b22780 (rcu_read_lock){....}, at: bond_set_rx_mode+0x5/0x3c0 [bonding]
[ 29.775459]
[ 29.775459] stack backtrace:
[ 29.775986] CPU: 0 PID: 639 Comm: ip Not tainted 5.3.0+ #3
[ 29.776719] Hardware name: innotek GmbH VirtualBox/VirtualBox, BIOS VirtualBox 12/01/2006
[ 29.777707] Call Trace:
[ 29.778012] dump_stack+0x7c/0xbb
[ 29.778434] __lock_acquire+0x26a9/0x3df0
[ 29.778920] ? register_lock_class+0x14d0/0x14d0
[ 29.779537] lock_acquire+0x164/0x3b0
[ 29.779981] ? dev_uc_sync_multiple+0xfa/0x1a0
[ 29.780523] ? rcu_read_lock_held+0x90/0xa0
[ 29.781028] _raw_spin_lock_nested+0x2e/0x60
[ 29.781550] ? dev_uc_sync_multiple+0xfa/0x1a0
[ 29.782311] dev_uc_sync_multiple+0xfa/0x1a0
[ 29.782832] bond_set_rx_mode+0x269/0x3c0 [bonding]
[ ... ]
Fixes: e20038724552 ("macsec: fix lockdep splats when nesting devices")
Signed-off-by: Taehee Yoo <[email protected]>
---
v1 -> v4 :
- This patch is not changed
drivers/net/macsec.c | 37 ++++++++++++++++++++++++++++++++-----
1 file changed, 32 insertions(+), 5 deletions(-)
diff --git a/drivers/net/macsec.c b/drivers/net/macsec.c
index cb7637364b40..c4a41b90c846 100644
--- a/drivers/net/macsec.c
+++ b/drivers/net/macsec.c
@@ -267,6 +267,8 @@ struct macsec_dev {
struct pcpu_secy_stats __percpu *stats;
struct list_head secys;
struct gro_cells gro_cells;
+ struct lock_class_key xmit_lock_key;
+ struct lock_class_key addr_lock_key;
unsigned int nest_level;
};
@@ -2750,7 +2752,32 @@ static netdev_tx_t macsec_start_xmit(struct sk_buff *skb,
#define MACSEC_FEATURES \
(NETIF_F_SG | NETIF_F_HIGHDMA | NETIF_F_FRAGLIST)
-static struct lock_class_key macsec_netdev_addr_lock_key;
+
+static void macsec_dev_set_lockdep_one(struct net_device *dev,
+ struct netdev_queue *txq,
+ void *_unused)
+{
+ struct macsec_dev *macsec = macsec_priv(dev);
+
+ lockdep_set_class(&txq->_xmit_lock, &macsec->xmit_lock_key);
+}
+
+static struct lock_class_key qdisc_tx_busylock_key;
+static struct lock_class_key qdisc_running_key;
+
+static void macsec_dev_set_lockdep_class(struct net_device *dev)
+{
+ struct macsec_dev *macsec = macsec_priv(dev);
+
+ dev->qdisc_tx_busylock = &qdisc_tx_busylock_key;
+ dev->qdisc_running_key = &qdisc_running_key;
+
+ lockdep_register_key(&macsec->addr_lock_key);
+ lockdep_set_class(&dev->addr_list_lock, &macsec->addr_lock_key);
+
+ lockdep_register_key(&macsec->xmit_lock_key);
+ netdev_for_each_tx_queue(dev, macsec_dev_set_lockdep_one, NULL);
+}
static int macsec_dev_init(struct net_device *dev)
{
@@ -2781,6 +2808,7 @@ static int macsec_dev_init(struct net_device *dev)
if (is_zero_ether_addr(dev->broadcast))
memcpy(dev->broadcast, real_dev->broadcast, dev->addr_len);
+ macsec_dev_set_lockdep_class(dev);
return 0;
}
@@ -2790,6 +2818,9 @@ static void macsec_dev_uninit(struct net_device *dev)
gro_cells_destroy(&macsec->gro_cells);
free_percpu(dev->tstats);
+
+ lockdep_unregister_key(&macsec->addr_lock_key);
+ lockdep_unregister_key(&macsec->xmit_lock_key);
}
static netdev_features_t macsec_fix_features(struct net_device *dev,
@@ -3264,10 +3295,6 @@ static int macsec_newlink(struct net *net, struct net_device *dev,
dev_hold(real_dev);
macsec->nest_level = dev_get_nest_level(real_dev) + 1;
- netdev_lockdep_set_classes(dev);
- lockdep_set_class_and_subclass(&dev->addr_list_lock,
- &macsec_netdev_addr_lock_key,
- macsec_get_nest_level(dev));
err = netdev_upper_dev_link(real_dev, dev, extack);
if (err < 0)
--
2.17.1
This patch removes variables and callback these are related to the nested
device structure.
devices that can be nested have their own nest_level variable that
represents the depth of nested devices.
In the previous patch, new {lower/upper}_level variables are added and
they replace old private nest_level variable.
So, this patch removes all 'nest_level' variables.
In order to avoid lockdep warning, ->ndo_get_lock_subclass() was added
to get lockdep subclass value, which is actually lower nested depth value.
But now, they use the dynamic lockdep key to avoid lockdep warning instead
of the subclass.
So, this patch removes ->ndo_get_lock_subclass() callback.
Signed-off-by: Taehee Yoo <[email protected]>
---
v1 -> v4 :
- This patch is not changed
drivers/net/bonding/bond_alb.c | 2 +-
drivers/net/bonding/bond_main.c | 14 -------------
.../net/ethernet/mellanox/mlx5/core/en_tc.c | 2 +-
drivers/net/macsec.c | 9 ---------
drivers/net/macvlan.c | 7 -------
include/linux/if_macvlan.h | 1 -
include/linux/if_vlan.h | 12 -----------
include/linux/netdevice.h | 12 -----------
include/net/bonding.h | 1 -
net/8021q/vlan.c | 1 -
net/8021q/vlan_dev.c | 6 ------
net/core/dev.c | 20 -------------------
net/core/dev_addr_lists.c | 12 +++++------
net/smc/smc_core.c | 2 +-
net/smc/smc_pnet.c | 2 +-
15 files changed, 10 insertions(+), 93 deletions(-)
diff --git a/drivers/net/bonding/bond_alb.c b/drivers/net/bonding/bond_alb.c
index 8c79bad2a9a5..4f2e6910c623 100644
--- a/drivers/net/bonding/bond_alb.c
+++ b/drivers/net/bonding/bond_alb.c
@@ -952,7 +952,7 @@ static int alb_upper_dev_walk(struct net_device *upper, void *_data)
struct bond_vlan_tag *tags;
if (is_vlan_dev(upper) &&
- bond->nest_level == vlan_get_encap_level(upper) - 1) {
+ bond->dev->lower_level == upper->lower_level - 1) {
if (upper->addr_assign_type == NET_ADDR_STOLEN) {
alb_send_lp_vid(slave, mac_addr,
vlan_dev_vlan_proto(upper),
diff --git a/drivers/net/bonding/bond_main.c b/drivers/net/bonding/bond_main.c
index 7f574e74ed78..69eb61466fbe 100644
--- a/drivers/net/bonding/bond_main.c
+++ b/drivers/net/bonding/bond_main.c
@@ -1733,8 +1733,6 @@ int bond_enslave(struct net_device *bond_dev, struct net_device *slave_dev,
goto err_upper_unlink;
}
- bond->nest_level = dev_get_nest_level(bond_dev) + 1;
-
/* If the mode uses primary, then the following is handled by
* bond_change_active_slave().
*/
@@ -1983,9 +1981,6 @@ static int __bond_release_one(struct net_device *bond_dev,
if (!bond_has_slaves(bond)) {
bond_set_carrier(bond);
eth_hw_addr_random(bond_dev);
- bond->nest_level = SINGLE_DEPTH_NESTING;
- } else {
- bond->nest_level = dev_get_nest_level(bond_dev) + 1;
}
unblock_netpoll_tx();
@@ -3472,13 +3467,6 @@ static void bond_fold_stats(struct rtnl_link_stats64 *_res,
}
}
-static int bond_get_nest_level(struct net_device *bond_dev)
-{
- struct bonding *bond = netdev_priv(bond_dev);
-
- return bond->nest_level;
-}
-
static void bond_get_stats(struct net_device *bond_dev,
struct rtnl_link_stats64 *stats)
{
@@ -4298,7 +4286,6 @@ static const struct net_device_ops bond_netdev_ops = {
.ndo_neigh_setup = bond_neigh_setup,
.ndo_vlan_rx_add_vid = bond_vlan_rx_add_vid,
.ndo_vlan_rx_kill_vid = bond_vlan_rx_kill_vid,
- .ndo_get_lock_subclass = bond_get_nest_level,
#ifdef CONFIG_NET_POLL_CONTROLLER
.ndo_netpoll_setup = bond_netpoll_setup,
.ndo_netpoll_cleanup = bond_netpoll_cleanup,
@@ -4822,7 +4809,6 @@ static int bond_init(struct net_device *bond_dev)
if (!bond->wq)
return -ENOMEM;
- bond->nest_level = SINGLE_DEPTH_NESTING;
bond_dev_set_lockdep_class(bond_dev);
list_add_tail(&bond->bond_list, &bn->dev_list);
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_tc.c b/drivers/net/ethernet/mellanox/mlx5/core/en_tc.c
index 3e78a727f3e6..c4c59d2e676e 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_tc.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_tc.c
@@ -3160,7 +3160,7 @@ static int add_vlan_pop_action(struct mlx5e_priv *priv,
struct mlx5_esw_flow_attr *attr,
u32 *action)
{
- int nest_level = vlan_get_encap_level(attr->parse_attr->filter_dev);
+ int nest_level = attr->parse_attr->filter_dev->lower_level;
struct flow_action_entry vlan_act = {
.id = FLOW_ACTION_VLAN_POP,
};
diff --git a/drivers/net/macsec.c b/drivers/net/macsec.c
index 28972da4a0b3..647aeead644d 100644
--- a/drivers/net/macsec.c
+++ b/drivers/net/macsec.c
@@ -269,7 +269,6 @@ struct macsec_dev {
struct gro_cells gro_cells;
struct lock_class_key xmit_lock_key;
struct lock_class_key addr_lock_key;
- unsigned int nest_level;
};
/**
@@ -2989,11 +2988,6 @@ static int macsec_get_iflink(const struct net_device *dev)
return macsec_priv(dev)->real_dev->ifindex;
}
-static int macsec_get_nest_level(struct net_device *dev)
-{
- return macsec_priv(dev)->nest_level;
-}
-
static const struct net_device_ops macsec_netdev_ops = {
.ndo_init = macsec_dev_init,
.ndo_uninit = macsec_dev_uninit,
@@ -3007,7 +3001,6 @@ static const struct net_device_ops macsec_netdev_ops = {
.ndo_start_xmit = macsec_start_xmit,
.ndo_get_stats64 = macsec_get_stats64,
.ndo_get_iflink = macsec_get_iflink,
- .ndo_get_lock_subclass = macsec_get_nest_level,
};
static const struct device_type macsec_type = {
@@ -3290,8 +3283,6 @@ static int macsec_newlink(struct net *net, struct net_device *dev,
if (err < 0)
return err;
- macsec->nest_level = dev_get_nest_level(real_dev) + 1;
-
err = netdev_upper_dev_link(real_dev, dev, extack);
if (err < 0)
goto unregister;
diff --git a/drivers/net/macvlan.c b/drivers/net/macvlan.c
index dae368a2e8d1..2c14bc606514 100644
--- a/drivers/net/macvlan.c
+++ b/drivers/net/macvlan.c
@@ -867,11 +867,6 @@ static int macvlan_do_ioctl(struct net_device *dev, struct ifreq *ifr, int cmd)
#define MACVLAN_STATE_MASK \
((1<<__LINK_STATE_NOCARRIER) | (1<<__LINK_STATE_DORMANT))
-static int macvlan_get_nest_level(struct net_device *dev)
-{
- return ((struct macvlan_dev *)netdev_priv(dev))->nest_level;
-}
-
static void macvlan_dev_set_lockdep_one(struct net_device *dev,
struct netdev_queue *txq,
void *_unused)
@@ -1180,7 +1175,6 @@ static const struct net_device_ops macvlan_netdev_ops = {
.ndo_fdb_add = macvlan_fdb_add,
.ndo_fdb_del = macvlan_fdb_del,
.ndo_fdb_dump = ndo_dflt_fdb_dump,
- .ndo_get_lock_subclass = macvlan_get_nest_level,
#ifdef CONFIG_NET_POLL_CONTROLLER
.ndo_poll_controller = macvlan_dev_poll_controller,
.ndo_netpoll_setup = macvlan_dev_netpoll_setup,
@@ -1464,7 +1458,6 @@ int macvlan_common_newlink(struct net *src_net, struct net_device *dev,
vlan->dev = dev;
vlan->port = port;
vlan->set_features = MACVLAN_FEATURES;
- vlan->nest_level = dev_get_nest_level(lowerdev) + 1;
vlan->mode = MACVLAN_MODE_VEPA;
if (data && data[IFLA_MACVLAN_MODE])
diff --git a/include/linux/if_macvlan.h b/include/linux/if_macvlan.h
index ea5b41823287..e9202edcf101 100644
--- a/include/linux/if_macvlan.h
+++ b/include/linux/if_macvlan.h
@@ -29,7 +29,6 @@ struct macvlan_dev {
netdev_features_t set_features;
enum macvlan_mode mode;
u16 flags;
- int nest_level;
unsigned int macaddr_count;
struct lock_class_key xmit_lock_key;
struct lock_class_key addr_lock_key;
diff --git a/include/linux/if_vlan.h b/include/linux/if_vlan.h
index 1aed9f613e90..6f30284a58e5 100644
--- a/include/linux/if_vlan.h
+++ b/include/linux/if_vlan.h
@@ -182,8 +182,6 @@ struct vlan_dev_priv {
#ifdef CONFIG_NET_POLL_CONTROLLER
struct netpoll *netpoll;
#endif
- unsigned int nest_level;
-
struct lock_class_key xmit_lock_key;
struct lock_class_key addr_lock_key;
};
@@ -224,11 +222,6 @@ extern void vlan_vids_del_by_dev(struct net_device *dev,
extern bool vlan_uses_dev(const struct net_device *dev);
-static inline int vlan_get_encap_level(struct net_device *dev)
-{
- BUG_ON(!is_vlan_dev(dev));
- return vlan_dev_priv(dev)->nest_level;
-}
#else
static inline struct net_device *
__vlan_find_dev_deep_rcu(struct net_device *real_dev,
@@ -298,11 +291,6 @@ static inline bool vlan_uses_dev(const struct net_device *dev)
{
return false;
}
-static inline int vlan_get_encap_level(struct net_device *dev)
-{
- BUG();
- return 0;
-}
#endif
/**
diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index d1f99d4f41bb..4133db060593 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -1421,7 +1421,6 @@ struct net_device_ops {
void (*ndo_dfwd_del_station)(struct net_device *pdev,
void *priv);
- int (*ndo_get_lock_subclass)(struct net_device *dev);
int (*ndo_set_tx_maxrate)(struct net_device *dev,
int queue_index,
u32 maxrate);
@@ -4060,16 +4059,6 @@ static inline void netif_addr_lock(struct net_device *dev)
spin_lock(&dev->addr_list_lock);
}
-static inline void netif_addr_lock_nested(struct net_device *dev)
-{
- int subclass = SINGLE_DEPTH_NESTING;
-
- if (dev->netdev_ops->ndo_get_lock_subclass)
- subclass = dev->netdev_ops->ndo_get_lock_subclass(dev);
-
- spin_lock_nested(&dev->addr_list_lock, subclass);
-}
-
static inline void netif_addr_lock_bh(struct net_device *dev)
{
spin_lock_bh(&dev->addr_list_lock);
@@ -4354,7 +4343,6 @@ void netdev_lower_state_changed(struct net_device *lower_dev,
extern u8 netdev_rss_key[NETDEV_RSS_KEY_LEN] __read_mostly;
void netdev_rss_key_fill(void *buffer, size_t len);
-int dev_get_nest_level(struct net_device *dev);
int skb_checksum_help(struct sk_buff *skb);
int skb_crc32c_csum_help(struct sk_buff *skb);
int skb_csum_hwoffload_help(struct sk_buff *skb,
diff --git a/include/net/bonding.h b/include/net/bonding.h
index c39ac7061e41..74f41dd73866 100644
--- a/include/net/bonding.h
+++ b/include/net/bonding.h
@@ -203,7 +203,6 @@ struct bonding {
struct slave __rcu *primary_slave;
struct bond_up_slave __rcu *slave_arr; /* Array of usable slaves */
bool force_primary;
- u32 nest_level;
s32 slave_cnt; /* never change this value outside the attach/detach wrappers */
int (*recv_probe)(const struct sk_buff *, struct bonding *,
struct slave *);
diff --git a/net/8021q/vlan.c b/net/8021q/vlan.c
index 54728d2eda18..d4bcfd8f95bf 100644
--- a/net/8021q/vlan.c
+++ b/net/8021q/vlan.c
@@ -172,7 +172,6 @@ int register_vlan_dev(struct net_device *dev, struct netlink_ext_ack *extack)
if (err < 0)
goto out_uninit_mvrp;
- vlan->nest_level = dev_get_nest_level(real_dev) + 1;
err = register_netdevice(dev);
if (err < 0)
goto out_uninit_mvrp;
diff --git a/net/8021q/vlan_dev.c b/net/8021q/vlan_dev.c
index 12bc80650087..e8707827540c 100644
--- a/net/8021q/vlan_dev.c
+++ b/net/8021q/vlan_dev.c
@@ -514,11 +514,6 @@ static void vlan_dev_set_lockdep_class(struct net_device *dev)
netdev_for_each_tx_queue(dev, vlan_dev_set_lockdep_one, NULL);
}
-static int vlan_dev_get_lock_subclass(struct net_device *dev)
-{
- return vlan_dev_priv(dev)->nest_level;
-}
-
static const struct header_ops vlan_header_ops = {
.create = vlan_dev_hard_header,
.parse = eth_header_parse,
@@ -814,7 +809,6 @@ static const struct net_device_ops vlan_netdev_ops = {
.ndo_netpoll_cleanup = vlan_dev_netpoll_cleanup,
#endif
.ndo_fix_features = vlan_dev_fix_features,
- .ndo_get_lock_subclass = vlan_dev_get_lock_subclass,
.ndo_get_iflink = vlan_dev_get_iflink,
};
diff --git a/net/core/dev.c b/net/core/dev.c
index 0b60bcd5033e..3fbd42eb75d1 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -7712,26 +7712,6 @@ void *netdev_lower_dev_get_private(struct net_device *dev,
}
EXPORT_SYMBOL(netdev_lower_dev_get_private);
-
-int dev_get_nest_level(struct net_device *dev)
-{
- struct net_device *lower = NULL;
- struct list_head *iter;
- int max_nest = -1;
- int nest;
-
- ASSERT_RTNL();
-
- netdev_for_each_lower_dev(dev, lower, iter) {
- nest = dev_get_nest_level(lower);
- if (max_nest < nest)
- max_nest = nest;
- }
-
- return max_nest + 1;
-}
-EXPORT_SYMBOL(dev_get_nest_level);
-
/**
* netdev_lower_change - Dispatch event about lower device state change
* @lower_dev: device
diff --git a/net/core/dev_addr_lists.c b/net/core/dev_addr_lists.c
index 6393ba930097..2f949b5a1eb9 100644
--- a/net/core/dev_addr_lists.c
+++ b/net/core/dev_addr_lists.c
@@ -637,7 +637,7 @@ int dev_uc_sync(struct net_device *to, struct net_device *from)
if (to->addr_len != from->addr_len)
return -EINVAL;
- netif_addr_lock_nested(to);
+ netif_addr_lock(to);
err = __hw_addr_sync(&to->uc, &from->uc, to->addr_len);
if (!err)
__dev_set_rx_mode(to);
@@ -667,7 +667,7 @@ int dev_uc_sync_multiple(struct net_device *to, struct net_device *from)
if (to->addr_len != from->addr_len)
return -EINVAL;
- netif_addr_lock_nested(to);
+ netif_addr_lock(to);
err = __hw_addr_sync_multiple(&to->uc, &from->uc, to->addr_len);
if (!err)
__dev_set_rx_mode(to);
@@ -691,7 +691,7 @@ void dev_uc_unsync(struct net_device *to, struct net_device *from)
return;
netif_addr_lock_bh(from);
- netif_addr_lock_nested(to);
+ netif_addr_lock(to);
__hw_addr_unsync(&to->uc, &from->uc, to->addr_len);
__dev_set_rx_mode(to);
netif_addr_unlock(to);
@@ -858,7 +858,7 @@ int dev_mc_sync(struct net_device *to, struct net_device *from)
if (to->addr_len != from->addr_len)
return -EINVAL;
- netif_addr_lock_nested(to);
+ netif_addr_lock(to);
err = __hw_addr_sync(&to->mc, &from->mc, to->addr_len);
if (!err)
__dev_set_rx_mode(to);
@@ -888,7 +888,7 @@ int dev_mc_sync_multiple(struct net_device *to, struct net_device *from)
if (to->addr_len != from->addr_len)
return -EINVAL;
- netif_addr_lock_nested(to);
+ netif_addr_lock(to);
err = __hw_addr_sync_multiple(&to->mc, &from->mc, to->addr_len);
if (!err)
__dev_set_rx_mode(to);
@@ -912,7 +912,7 @@ void dev_mc_unsync(struct net_device *to, struct net_device *from)
return;
netif_addr_lock_bh(from);
- netif_addr_lock_nested(to);
+ netif_addr_lock(to);
__hw_addr_unsync(&to->mc, &from->mc, to->addr_len);
__dev_set_rx_mode(to);
netif_addr_unlock(to);
diff --git a/net/smc/smc_core.c b/net/smc/smc_core.c
index 4ca50ddf8d16..a2e91b8d04b3 100644
--- a/net/smc/smc_core.c
+++ b/net/smc/smc_core.c
@@ -558,7 +558,7 @@ int smc_vlan_by_tcpsk(struct socket *clcsock, struct smc_init_info *ini)
}
rtnl_lock();
- nest_lvl = dev_get_nest_level(ndev);
+ nest_lvl = ndev->lower_level;
for (i = 0; i < nest_lvl; i++) {
struct list_head *lower = &ndev->adj_list.lower;
diff --git a/net/smc/smc_pnet.c b/net/smc/smc_pnet.c
index bab2da8cf17a..2920b006f65c 100644
--- a/net/smc/smc_pnet.c
+++ b/net/smc/smc_pnet.c
@@ -718,7 +718,7 @@ static struct net_device *pnet_find_base_ndev(struct net_device *ndev)
int i, nest_lvl;
rtnl_lock();
- nest_lvl = dev_get_nest_level(ndev);
+ nest_lvl = ndev->lower_level;
for (i = 0; i < nest_lvl; i++) {
struct list_head *lower = &ndev->adj_list.lower;
--
2.17.1
All macvlan device has same lockdep key and subclass is initialized with
nest_level.
But actual nest_level value can be changed when a lower device is attached.
And at this moment, the subclass should be updated but it seems to be
unsafe.
So this patch makes macvlan use dynamic lockdep key instead of the
subclass.
Test commands:
ip link add bond0 type bond
ip link add dummy0 type dummy
ip link add macvlan0 link bond0 type macvlan mode bridge
ip link add macvlan1 link dummy0 type macvlan mode bridge
ip link set bond0 mtu 1000
ip link set macvlan1 master bond0
ip link set bond0 up
ip link set macvlan0 up
ip link set dummy0 up
ip link set macvlan1 up
Splat looks like:
[ 30.281866] WARNING: possible recursive locking detected
[ 30.282374] 5.3.0+ #3 Not tainted
[ 30.282673] --------------------------------------------
[ 30.283138] ip/643 is trying to acquire lock:
[ 30.283522] ffff88806750c818 (&macvlan_netdev_addr_lock_key/1){+...}, at: dev_uc_sync_multiple+0xfa/0x1a0
[ 30.284363]
[ 30.284363] but task is already holding lock:
[ 30.284878] ffff88806853ead8 (&macvlan_netdev_addr_lock_key/1){+...}, at: dev_set_rx_mode+0x19/0x30
[ 30.285680]
[ 30.285680] other info that might help us debug this:
[ 30.286274] Possible unsafe locking scenario:
[ 30.286274]
[ 30.286903] CPU0
[ 30.287192] ----
[ 30.287475] lock(&macvlan_netdev_addr_lock_key/1);
[ 30.288121] lock(&macvlan_netdev_addr_lock_key/1);
[ 30.288818]
[ 30.288818] *** DEADLOCK ***
[ 30.288818]
[ 30.294651] May be due to missing lock nesting notation
[ 30.294651]
[ 30.295660] 4 locks held by ip/643:
[ 30.296076] #0: ffffffff93ec7a30 (rtnl_mutex){+.+.}, at: rtnetlink_rcv_msg+0x466/0x8a0
[ 30.297030] #1: ffff88806853ead8 (&macvlan_netdev_addr_lock_key/1){+...}, at: dev_set_rx_mode+0x19/0x30
[ 30.298749] #2: ffff888063b8a3f8 (&dev_addr_list_lock_key/3){+...}, at: dev_uc_sync+0xfa/0x1a0
[ 30.299727] #3: ffffffff93b22780 (rcu_read_lock){....}, at: bond_set_rx_mode+0x5/0x3c0 [bonding]
[ 30.302803]
[ 30.302803] stack backtrace:
[ 30.303254] CPU: 1 PID: 643 Comm: ip Not tainted 5.3.0+ #3
[ 30.303907] Hardware name: innotek GmbH VirtualBox/VirtualBox, BIOS VirtualBox 12/01/2006
[ 30.310458] Call Trace:
[ 30.310694] dump_stack+0x7c/0xbb
[ 30.311016] __lock_acquire+0x26a9/0x3df0
[ 30.311390] ? register_lock_class+0x14d0/0x14d0
[ 30.311815] lock_acquire+0x164/0x3b0
[ 30.312237] ? dev_uc_sync_multiple+0xfa/0x1a0
[ 30.312776] ? rcu_read_lock_held+0x90/0xa0
[ 30.313293] _raw_spin_lock_nested+0x2e/0x60
[ 30.313819] ? dev_uc_sync_multiple+0xfa/0x1a0
[ 30.314429] dev_uc_sync_multiple+0xfa/0x1a0
[ 30.314950] bond_set_rx_mode+0x269/0x3c0 [bonding]
[ 30.315541] ? bond_init+0x6f0/0x6f0 [bonding]
[ 30.316075] dev_uc_sync+0x15a/0x1a0
[ ... ]
Fixes: c674ac30c549 ("macvlan: Fix lockdep warnings with stacked macvlan devices")
Signed-off-by: Taehee Yoo <[email protected]>
---
v1 -> v4 :
- This patch is not changed
drivers/net/macvlan.c | 35 +++++++++++++++++++++++++++--------
include/linux/if_macvlan.h | 2 ++
2 files changed, 29 insertions(+), 8 deletions(-)
diff --git a/drivers/net/macvlan.c b/drivers/net/macvlan.c
index 940192c057b6..dae368a2e8d1 100644
--- a/drivers/net/macvlan.c
+++ b/drivers/net/macvlan.c
@@ -852,8 +852,6 @@ static int macvlan_do_ioctl(struct net_device *dev, struct ifreq *ifr, int cmd)
* "super class" of normal network devices; split their locks off into a
* separate class since they always nest.
*/
-static struct lock_class_key macvlan_netdev_addr_lock_key;
-
#define ALWAYS_ON_OFFLOADS \
(NETIF_F_SG | NETIF_F_HW_CSUM | NETIF_F_GSO_SOFTWARE | \
NETIF_F_GSO_ROBUST | NETIF_F_GSO_ENCAP_ALL)
@@ -874,12 +872,30 @@ static int macvlan_get_nest_level(struct net_device *dev)
return ((struct macvlan_dev *)netdev_priv(dev))->nest_level;
}
-static void macvlan_set_lockdep_class(struct net_device *dev)
+static void macvlan_dev_set_lockdep_one(struct net_device *dev,
+ struct netdev_queue *txq,
+ void *_unused)
+{
+ struct macvlan_dev *macvlan = netdev_priv(dev);
+
+ lockdep_set_class(&txq->_xmit_lock, &macvlan->xmit_lock_key);
+}
+
+static struct lock_class_key qdisc_tx_busylock_key;
+static struct lock_class_key qdisc_running_key;
+
+static void macvlan_dev_set_lockdep_class(struct net_device *dev)
{
- netdev_lockdep_set_classes(dev);
- lockdep_set_class_and_subclass(&dev->addr_list_lock,
- &macvlan_netdev_addr_lock_key,
- macvlan_get_nest_level(dev));
+ struct macvlan_dev *macvlan = netdev_priv(dev);
+
+ dev->qdisc_tx_busylock = &qdisc_tx_busylock_key;
+ dev->qdisc_running_key = &qdisc_running_key;
+
+ lockdep_register_key(&macvlan->addr_lock_key);
+ lockdep_set_class(&dev->addr_list_lock, &macvlan->addr_lock_key);
+
+ lockdep_register_key(&macvlan->xmit_lock_key);
+ netdev_for_each_tx_queue(dev, macvlan_dev_set_lockdep_one, NULL);
}
static int macvlan_init(struct net_device *dev)
@@ -900,7 +916,7 @@ static int macvlan_init(struct net_device *dev)
dev->gso_max_segs = lowerdev->gso_max_segs;
dev->hard_header_len = lowerdev->hard_header_len;
- macvlan_set_lockdep_class(dev);
+ macvlan_dev_set_lockdep_class(dev);
vlan->pcpu_stats = netdev_alloc_pcpu_stats(struct vlan_pcpu_stats);
if (!vlan->pcpu_stats)
@@ -922,6 +938,9 @@ static void macvlan_uninit(struct net_device *dev)
port->count -= 1;
if (!port->count)
macvlan_port_destroy(port->dev);
+
+ lockdep_unregister_key(&vlan->addr_lock_key);
+ lockdep_unregister_key(&vlan->xmit_lock_key);
}
static void macvlan_dev_get_stats64(struct net_device *dev,
diff --git a/include/linux/if_macvlan.h b/include/linux/if_macvlan.h
index 2e55e4cdbd8a..ea5b41823287 100644
--- a/include/linux/if_macvlan.h
+++ b/include/linux/if_macvlan.h
@@ -31,6 +31,8 @@ struct macvlan_dev {
u16 flags;
int nest_level;
unsigned int macaddr_count;
+ struct lock_class_key xmit_lock_key;
+ struct lock_class_key addr_lock_key;
#ifdef CONFIG_NET_POLL_CONTROLLER
struct netpoll *netpoll;
#endif
--
2.17.1
Current vxlan code doesn't limit the number of nested devices.
Nested devices would be handled recursively and this routine needs
huge stack memory. So, unlimited nested devices could make
stack overflow.
In order to fix this issue, this patch adds adjacent links.
The adjacent link APIs internally check the depth level.
Test commands:
ip link add dummy0 type dummy
ip link add vxlan0 type vxlan id 0 group 239.1.1.1 dev dummy0 \
dstport 4789
for i in {1..100}
do
let A=$i-1
ip link add vxlan$i type vxlan id $i group 239.1.1.1 \
dev vxlan$A dstport 4789
done
ip link del dummy0
The top upper link is vxlan100 and the lowest link is vxlan0.
When vxlan0 is deleting, the upper devices will be deleted recursively.
It needs huge stack memory so it makes stack overflow.
Splat looks like:
[ 229.628477] =============================================================================
[ 229.629785] BUG page->ptl (Not tainted): Padding overwritten. 0x0000000026abf214-0x0000000091f6abb2
[ 229.629785] -----------------------------------------------------------------------------
[ 229.629785]
[ 229.655439] ==================================================================
[ 229.629785] INFO: Slab 0x00000000ff7cfda8 objects=19 used=19 fp=0x00000000fe33776c flags=0x200000000010200
[ 229.655688] BUG: KASAN: stack-out-of-bounds in unmap_single_vma+0x25a/0x2e0
[ 229.655688] Read of size 8 at addr ffff888113076928 by task vlan-network-in/2334
[ 229.655688]
[ 229.629785] Padding 0000000026abf214: 00 80 14 0d 81 88 ff ff 68 91 81 14 81 88 ff ff ........h.......
[ 229.629785] Padding 0000000001e24790: 38 91 81 14 81 88 ff ff 68 91 81 14 81 88 ff ff 8.......h.......
[ 229.629785] Padding 00000000b39397c8: 33 30 62 a7 ff ff ff ff ff eb 60 22 10 f1 ff 1f 30b.......`"....
[ 229.629785] Padding 00000000bc98f53a: 80 60 07 13 81 88 ff ff 00 80 14 0d 81 88 ff ff .`..............
[ 229.629785] Padding 000000002aa8123d: 68 91 81 14 81 88 ff ff f7 21 17 a7 ff ff ff ff h........!......
[ 229.629785] Padding 000000001c8c2369: 08 81 14 0d 81 88 ff ff 03 02 00 00 00 00 00 00 ................
[ 229.629785] Padding 000000004e290c5d: 21 90 a2 21 10 ed ff ff 00 00 00 00 00 fc ff df !..!............
[ 229.629785] Padding 000000000e25d731: 18 60 07 13 81 88 ff ff c0 8b 13 05 81 88 ff ff .`..............
[ 229.629785] Padding 000000007adc7ab3: b3 8a b5 41 00 00 00 00 ...A....
[ 229.629785] FIX page->ptl: Restoring 0x0000000026abf214-0x0000000091f6abb2=0x5a
[ ... ]
Fixes: acaf4e70997f ("net: vxlan: when lower dev unregisters remove vxlan dev as well")
Signed-off-by: Taehee Yoo <[email protected]>
---
v3 -> v4 :
- Fix wrong usage netdev_upper_dev_link() in the vxlan.c
- Preserve reverse christmas tree variable ordering in the vxlan.c
v1 -> v3 :
- This patch is not changed
drivers/net/vxlan.c | 52 ++++++++++++++++++++++++++++++++++++---------
include/net/vxlan.h | 1 +
2 files changed, 43 insertions(+), 10 deletions(-)
diff --git a/drivers/net/vxlan.c b/drivers/net/vxlan.c
index 3d9bcc957f7d..5537998d6137 100644
--- a/drivers/net/vxlan.c
+++ b/drivers/net/vxlan.c
@@ -3566,10 +3566,13 @@ static int __vxlan_dev_create(struct net *net, struct net_device *dev,
{
struct vxlan_net *vn = net_generic(net, vxlan_net_id);
struct vxlan_dev *vxlan = netdev_priv(dev);
+ struct net_device *remote_dev = NULL;
struct vxlan_fdb *f = NULL;
bool unregister = false;
+ struct vxlan_rdst *dst;
int err;
+ dst = &vxlan->default_dst;
err = vxlan_dev_configure(net, dev, conf, false, extack);
if (err)
return err;
@@ -3577,14 +3580,14 @@ static int __vxlan_dev_create(struct net *net, struct net_device *dev,
dev->ethtool_ops = &vxlan_ethtool_ops;
/* create an fdb entry for a valid default destination */
- if (!vxlan_addr_any(&vxlan->default_dst.remote_ip)) {
+ if (!vxlan_addr_any(&dst->remote_ip)) {
err = vxlan_fdb_create(vxlan, all_zeros_mac,
- &vxlan->default_dst.remote_ip,
+ &dst->remote_ip,
NUD_REACHABLE | NUD_PERMANENT,
vxlan->cfg.dst_port,
- vxlan->default_dst.remote_vni,
- vxlan->default_dst.remote_vni,
- vxlan->default_dst.remote_ifindex,
+ dst->remote_vni,
+ dst->remote_vni,
+ dst->remote_ifindex,
NTF_SELF, &f);
if (err)
return err;
@@ -3595,26 +3598,41 @@ static int __vxlan_dev_create(struct net *net, struct net_device *dev,
goto errout;
unregister = true;
+ if (dst->remote_ifindex) {
+ remote_dev = __dev_get_by_index(net, dst->remote_ifindex);
+ if (!remote_dev)
+ goto errout;
+
+ err = netdev_upper_dev_link(remote_dev, dev, extack);
+ if (err)
+ goto errout;
+ }
+
err = rtnl_configure_link(dev, NULL);
if (err)
- goto errout;
+ goto unlink;
if (f) {
- vxlan_fdb_insert(vxlan, all_zeros_mac,
- vxlan->default_dst.remote_vni, f);
+ vxlan_fdb_insert(vxlan, all_zeros_mac, dst->remote_vni, f);
/* notify default fdb entry */
err = vxlan_fdb_notify(vxlan, f, first_remote_rtnl(f),
RTM_NEWNEIGH, true, extack);
if (err) {
vxlan_fdb_destroy(vxlan, f, false, false);
+ if (remote_dev)
+ netdev_upper_dev_unlink(remote_dev, dev);
goto unregister;
}
}
list_add(&vxlan->next, &vn->vxlan_list);
+ if (remote_dev)
+ dst->remote_dev = remote_dev;
return 0;
-
+unlink:
+ if (remote_dev)
+ netdev_upper_dev_unlink(remote_dev, dev);
errout:
/* unregister_netdevice() destroys the default FDB entry with deletion
* notification. But the addition notification was not sent yet, so
@@ -3932,11 +3950,12 @@ static int vxlan_changelink(struct net_device *dev, struct nlattr *tb[],
struct netlink_ext_ack *extack)
{
struct vxlan_dev *vxlan = netdev_priv(dev);
- struct vxlan_rdst *dst = &vxlan->default_dst;
struct net_device *lowerdev;
struct vxlan_config conf;
+ struct vxlan_rdst *dst;
int err;
+ dst = &vxlan->default_dst;
err = vxlan_nl2conf(tb, data, dev, &conf, true, extack);
if (err)
return err;
@@ -3946,6 +3965,11 @@ static int vxlan_changelink(struct net_device *dev, struct nlattr *tb[],
if (err)
return err;
+ err = netdev_adjacent_change_prepare(dst->remote_dev, lowerdev, dev,
+ extack);
+ if (err)
+ return err;
+
/* handle default dst entry */
if (!vxlan_addr_equal(&conf.remote_ip, &dst->remote_ip)) {
u32 hash_index = fdb_head_index(vxlan, all_zeros_mac, conf.vni);
@@ -3962,6 +3986,8 @@ static int vxlan_changelink(struct net_device *dev, struct nlattr *tb[],
NTF_SELF, true, extack);
if (err) {
spin_unlock_bh(&vxlan->hash_lock[hash_index]);
+ netdev_adjacent_change_abort(dst->remote_dev,
+ lowerdev, dev);
return err;
}
}
@@ -3979,6 +4005,10 @@ static int vxlan_changelink(struct net_device *dev, struct nlattr *tb[],
if (conf.age_interval != vxlan->cfg.age_interval)
mod_timer(&vxlan->age_timer, jiffies);
+ netdev_adjacent_change_commit(dst->remote_dev, lowerdev, dev);
+ if (lowerdev && lowerdev != dst->remote_dev)
+ dst->remote_dev = lowerdev;
+
vxlan_config_apply(dev, &conf, lowerdev, vxlan->net, true);
return 0;
}
@@ -3991,6 +4021,8 @@ static void vxlan_dellink(struct net_device *dev, struct list_head *head)
list_del(&vxlan->next);
unregister_netdevice_queue(dev, head);
+ if (vxlan->default_dst.remote_dev)
+ netdev_upper_dev_unlink(vxlan->default_dst.remote_dev, dev);
}
static size_t vxlan_get_size(const struct net_device *dev)
diff --git a/include/net/vxlan.h b/include/net/vxlan.h
index 335283dbe9b3..373aadcfea21 100644
--- a/include/net/vxlan.h
+++ b/include/net/vxlan.h
@@ -197,6 +197,7 @@ struct vxlan_rdst {
u8 offloaded:1;
__be32 remote_vni;
u32 remote_ifindex;
+ struct net_device *remote_dev;
struct list_head list;
struct rcu_head rcu;
struct dst_cache dst_cache;
--
2.17.1
virt_wifi_newlink() calls netdev_upper_dev_link() and it internally
holds reference count of lower interface.
Current code does not release a reference count of the lower interface
when the lower interface is being deleted.
So, reference count leaks occur.
Test commands:
ip link add dummy0 type dummy
ip link add vw1 link dummy0 type virt_wifi
Splat looks like:
[ 182.001918][ T1333] WARNING: CPU: 0 PID: 1333 at net/core/dev.c:8638 rollback_registered_many+0x75d/0xda0
[ 182.002724][ T1333] Modules linked in: virt_wifi cfg80211 dummy veth openvswitch nsh nf_conncount nf_nat nf_conntrack6
[ 182.002724][ T1333] CPU: 0 PID: 1333 Comm: ip Not tainted 5.3.0+ #370
[ 182.002724][ T1333] RIP: 0010:rollback_registered_many+0x75d/0xda0
[ 182.002724][ T1333] Code: 0c 00 00 48 89 de 4c 89 ff e8 6f 5a 04 00 48 89 df e8 c7 26 fd ff 84 c0 0f 84 a5 fd ff ff 40
[ 182.002724][ T1333] RSP: 0018:ffff88810900f348 EFLAGS: 00010286
[ 182.002724][ T1333] RAX: 0000000000000024 RBX: ffff88811361d700 RCX: 0000000000000000
[ 182.002724][ T1333] RDX: 0000000000000024 RSI: 0000000000000008 RDI: ffffed1021201e5f
[ 182.002724][ T1333] RBP: ffff88810900f4e0 R08: ffffed1022c3ff71 R09: ffffed1022c3ff71
[ 182.002724][ T1333] R10: 0000000000000001 R11: ffffed1022c3ff70 R12: dffffc0000000000
[ 182.002724][ T1333] R13: ffff88810900f460 R14: ffff88810900f420 R15: ffff8881075f1940
[ 182.002724][ T1333] FS: 00007f77c42240c0(0000) GS:ffff888116000000(0000) knlGS:0000000000000000
[ 182.002724][ T1333] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 182.002724][ T1333] CR2: 00007f77c3706240 CR3: 000000011139e000 CR4: 00000000001006f0
[ 182.002724][ T1333] Call Trace:
[ 182.002724][ T1333] ? generic_xdp_install+0x310/0x310
[ 182.002724][ T1333] ? check_chain_key+0x236/0x5d0
[ 182.002724][ T1333] ? __nla_validate_parse+0x98/0x1ad0
[ 182.002724][ T1333] unregister_netdevice_many.part.123+0x13/0x1b0
[ 182.002724][ T1333] rtnl_delete_link+0xbc/0x100
[ 182.002724][ T1333] ? rtnl_af_register+0xc0/0xc0
[ 182.002724][ T1333] rtnl_dellink+0x2e7/0x870
[ ... ]
[ 192.874736][ T1333] unregister_netdevice: waiting for dummy0 to become free. Usage count = 1
This patch adds notifier routine to delete upper interface before deleting
lower interface.
Fixes: c7cdba31ed8b ("mac80211-next: rtnetlink wifi simulation device")
Signed-off-by: Taehee Yoo <[email protected]>
---
v4:
- Add this new patch to fix refcnt leaks in the virt_wifi module
drivers/net/wireless/virt_wifi.c | 51 ++++++++++++++++++++++++++++++--
1 file changed, 49 insertions(+), 2 deletions(-)
diff --git a/drivers/net/wireless/virt_wifi.c b/drivers/net/wireless/virt_wifi.c
index be92e1220284..aadbacb01c8d 100644
--- a/drivers/net/wireless/virt_wifi.c
+++ b/drivers/net/wireless/virt_wifi.c
@@ -590,6 +590,42 @@ static struct rtnl_link_ops virt_wifi_link_ops = {
.priv_size = sizeof(struct virt_wifi_netdev_priv),
};
+static inline bool netif_is_virt_wifi_dev(const struct net_device *dev)
+{
+ return rcu_access_pointer(dev->rx_handler) == virt_wifi_rx_handler;
+}
+
+static int virt_wifi_event(struct notifier_block *this, unsigned long event,
+ void *ptr)
+{
+ struct net_device *lower_dev = netdev_notifier_info_to_dev(ptr);
+ struct virt_wifi_netdev_priv *priv;
+ struct net_device *upper_dev;
+ LIST_HEAD(list_kill);
+
+ if (!netif_is_virt_wifi_dev(lower_dev))
+ return NOTIFY_DONE;
+
+ switch (event) {
+ case NETDEV_UNREGISTER:
+ priv = rtnl_dereference(lower_dev->rx_handler_data);
+ if (!priv)
+ return NOTIFY_DONE;
+
+ upper_dev = priv->upperdev;
+
+ upper_dev->rtnl_link_ops->dellink(upper_dev, &list_kill);
+ unregister_netdevice_many(&list_kill);
+ break;
+ }
+
+ return NOTIFY_DONE;
+}
+
+static struct notifier_block virt_wifi_notifier = {
+ .notifier_call = virt_wifi_event,
+};
+
/* Acquires and releases the rtnl lock. */
static int __init virt_wifi_init_module(void)
{
@@ -598,14 +634,24 @@ static int __init virt_wifi_init_module(void)
/* Guaranteed to be locallly-administered and not multicast. */
eth_random_addr(fake_router_bssid);
+ err = register_netdevice_notifier(&virt_wifi_notifier);
+ if (err)
+ return err;
+
common_wiphy = virt_wifi_make_wiphy();
if (!common_wiphy)
- return -ENOMEM;
+ goto notifier;
err = rtnl_link_register(&virt_wifi_link_ops);
if (err)
- virt_wifi_destroy_wiphy(common_wiphy);
+ goto destroy_wiphy;
+ return 0;
+
+destroy_wiphy:
+ virt_wifi_destroy_wiphy(common_wiphy);
+notifier:
+ unregister_netdevice_notifier(&virt_wifi_notifier);
return err;
}
@@ -615,6 +661,7 @@ static void __exit virt_wifi_cleanup_module(void)
/* Will delete any devices that depend on the wiphy. */
rtnl_link_unregister(&virt_wifi_link_ops);
virt_wifi_destroy_wiphy(common_wiphy);
+ unregister_netdevice_notifier(&virt_wifi_notifier);
}
module_init(virt_wifi_init_module);
--
2.17.1
On Sat, 2019-09-28 at 16:48 +0000, Taehee Yoo wrote:
> virt_wifi_newlink() calls netdev_upper_dev_link() and it internally
> holds reference count of lower interface.
[...]
> This patch adds notifier routine to delete upper interface before deleting
> lower interface.
Good catch, thanks!
For now I'll assume this will go in through net together with the whole
series (once ready), shout if you want something else.
johannes
Hi,
I hadn't seen the previous patchsets of this, and looking briefly in the
archives didn't really seem to say anything about this.
However, I'm wondering now: patches 2-7 of this patchset look basically
all identical in a way:
* you set the addr_list_lock's class to a newly registered key inside
the netdev (or rather the private struct, but doesn't make a big
difference)
* you set each TX queue's _xmit_lock's class similarly
* you set the qdisc_tx_busylock/qdisc_running_key
The first two of these look pretty much completely identical.
Would it perhaps make sense to just do that for *every* netdev? Many of
those netdevs won't ever nest so it wouldn't really be useful, but I'm
not convinced it would put that much more strain on lockdep - if
anything, people are probably creating more VLANs than regular PF/VF
netdevs anyway?
I didn't see any discussion on this, but perhaps I missed it? The cost
would be a bigger netdev struct (when lockdep is enabled), but we
already have that for all the VLANs etc. it's just in the private data,
so it's not a _huge_ difference really I'd think, and this is quite a
bit of code for each device type now.
Alternatively, maybe there could just be some common helper code:
struct nested_netdev_lockdep {
struct lock_class_key xmit_lock_key;
struct lock_class_key addr_lock_key;
};
void netdev_init_nested_lockdep(struct net_device *dev,
struct netsted_netdev_lockdep *l)
{
/* ... */
}
so you just have to embed a "struct nested_netdev_lockdep" in your
private data structure and call the common code.
Or maybe make that
void netdev_init_nested_lockdep(
struct net_device *dev,
struct
netsted_netdev_lockdep *l,
struct lock_class_key
*qdisc_tx_busylock_key,
struct lock_class_key *qdisc_running_key)
so you can't really get that part wrong either?
> @@ -922,6 +938,9 @@ static void macvlan_uninit(struct net_device *dev)
> port->count -= 1;
> if (!port->count)
> macvlan_port_destroy(port->dev);
> +
> + lockdep_unregister_key(&vlan->addr_lock_key);
> + lockdep_unregister_key(&vlan->xmit_lock_key);
> }
OK, so I guess you need an equivalent "deinit" function too -
netdev_deinit_nested_lockdep() or so.
What's not really clear to me is why the qdisc locks can actually stay
the same at all levels? Can they just never nest? But then why are they
different per device type?
Thanks,
johannes
> VLAN, BONDING, TEAM, MACSEC, MACVLAN, IPVLAN, VIRT_WIFI and VXLAN.
> But I couldn't test all interface types so there could be more device
> types which have similar problems.
Did you test virt_wifi? I don't see how it *doesn't* have the nesting
problem, and you didn't change it?
No, I see. You're limiting the nesting generally now in patch 1, and the
others are just lockdep fixups (I guess it's surprising virt_wifi
doesn't do this at all?).
FWIW I don't think virt_wifi really benefits at all from stacking, so we
could just do something like
--- a/drivers/net/wireless/virt_wifi.c
+++ b/drivers/net/wireless/virt_wifi.c
@@ -508,6 +508,9 @@ static int virt_wifi_newlink(struct net *src_net, struct net_device *dev,
else if (dev->mtu > priv->lowerdev->mtu)
return -EINVAL;
+ if (priv->lowerdev->ieee80211_ptr)
+ return -EINVAL;
+
err = netdev_rx_handler_register(priv->lowerdev, virt_wifi_rx_handler,
priv);
if (err) {
IMHO, but of course generally limiting the stack depth is needed anyway
and solves the problem well enough for virt_wifi.
johannes
Hi,
> int netdev_walk_all_upper_dev_rcu(struct net_device *dev,
> int (*fn)(struct net_device *dev,
> void *data),
> void *data)
> {
[...]
> }
>
> return 0;
> +
> }
that seems like an oversight, probably from editing the patch in
different versions?
> +static int __netdev_update_upper_level(struct net_device *dev, void *data)
> +{
> + dev->upper_level = __netdev_upper_depth(dev) + 1;
> + return 0;
> +}
> +
> +static int __netdev_update_lower_level(struct net_device *dev, void *data)
> +{
> + dev->lower_level = __netdev_lower_depth(dev) + 1;
> + return 0;
> +}
Is there any point in the return value here? You don't really use it,
afaict? I guess I might see the point if it was used for tail-call
optimisation or such?
Also, I dunno, I guess netdevs aren't as much under pressure as SKBs :-)
but do we actually gain much from storing the nesting level at all? You
have to maintain it all the time anyway when adding/removing and that's
the only place where you also check it, so perhaps it wouldn't be that
bad to just count at that time?
But then again the counting would probably be recursive again ...
> return 0;
> +
> }
> EXPORT_SYMBOL_GPL(netdev_walk_all_lower_dev_rcu);
same nit as above
> + __netdev_update_upper_level(dev, NULL);
> + netdev_walk_all_lower_dev(dev, __netdev_update_upper_level, NULL);
> +
> + __netdev_update_lower_level(upper_dev, NULL);
> + netdev_walk_all_upper_dev(upper_dev, __netdev_update_lower_level, NULL);
Actually, if I'm reading this correctly you already walk all the levels
anyway? Then couldn't you calculate the depth at this time and return
it, instead of storing it? Though, if it actually overflowed then you'd
have to walk *again* to undo that?
Hmm, actually, if you don't store the value you don't even need to walk
here I guess, or at least you would only have to do it to verify you
*can* attach, but wouldn't have to in detach?
So it looks to me like on attach (i.e. this code, quoted from
__netdev_upper_dev_link) you're already walking the entire graph to
update the level values, and could probably instead calculate the
nesting depth to validate it?
And then on netdev_upper_dev_unlink() you wouldn't even have to walk the
graph at all, since you only need that to update the values that you
stored.
But maybe I'm misinterpreting this completely?
Thanks,
johannes
On Sat, 2019-09-28 at 16:48 +0000, Taehee Yoo wrote:
> This patch removes variables and callback these are related to the nested
> device structure.
> devices that can be nested have their own nest_level variable that
> represents the depth of nested devices.
> In the previous patch, new {lower/upper}_level variables are added and
> they replace old private nest_level variable.
> So, this patch removes all 'nest_level' variables.
Ah, well, I see at least this patch also needs the nesting level tracked
in the netdev, at least the "lower_level".
johannes
On Sun, 29 Sep 2019 at 04:15, Johannes Berg <[email protected]> wrote:
>
> Hi,
>
Hi Johannes,
Thank you so much for the detailed reviews!
> I hadn't seen the previous patchsets of this, and looking briefly in the
> archives didn't really seem to say anything about this.
>
>
> However, I'm wondering now: patches 2-7 of this patchset look basically
> all identical in a way:
> * you set the addr_list_lock's class to a newly registered key inside
> the netdev (or rather the private struct, but doesn't make a big
> difference)
> * you set each TX queue's _xmit_lock's class similarly
> * you set the qdisc_tx_busylock/qdisc_running_key
>
> The first two of these look pretty much completely identical.
>
> Would it perhaps make sense to just do that for *every* netdev? Many of
> those netdevs won't ever nest so it wouldn't really be useful, but I'm
> not convinced it would put that much more strain on lockdep - if
> anything, people are probably creating more VLANs than regular PF/VF
> netdevs anyway?
>
> I didn't see any discussion on this, but perhaps I missed it? The cost
> would be a bigger netdev struct (when lockdep is enabled), but we
> already have that for all the VLANs etc. it's just in the private data,
> so it's not a _huge_ difference really I'd think, and this is quite a
> bit of code for each device type now.
>
Actually I agree with your opinion.
The benefits of this way are to be able to make common helper functions.
That would reduce duplicate codes and we can maintain this more easily.
But I'm not sure about the overhead of this way. So I would like to ask
maintainers and more reviewers about this.
> Alternatively, maybe there could just be some common helper code:
>
> struct nested_netdev_lockdep {
> struct lock_class_key xmit_lock_key;
> struct lock_class_key addr_lock_key;
> };
>
> void netdev_init_nested_lockdep(struct net_device *dev,
> struct netsted_netdev_lockdep *l)
> {
> /* ... */
> }
>
> so you just have to embed a "struct nested_netdev_lockdep" in your
> private data structure and call the common code.
>
> Or maybe make that
>
> void netdev_init_nested_lockdep(
> struct net_device *dev,
> struct
> netsted_netdev_lockdep *l,
> struct lock_class_key
> *qdisc_tx_busylock_key,
> struct lock_class_key *qdisc_running_key)
>
> so you can't really get that part wrong either?
>
>
> > @@ -922,6 +938,9 @@ static void macvlan_uninit(struct net_device *dev)
> > port->count -= 1;
> > if (!port->count)
> > macvlan_port_destroy(port->dev);
> > +
> > + lockdep_unregister_key(&vlan->addr_lock_key);
> > + lockdep_unregister_key(&vlan->xmit_lock_key);
> > }
>
> OK, so I guess you need an equivalent "deinit" function too -
> netdev_deinit_nested_lockdep() or so.
>
Using "struct nested_netdev_lockdep" looks really good.
I will make common codes such as "struct nested_netdev_lockdep"
and "netdev_devinit_nested_lockdep" and others in a v5 patch.
>
> What's not really clear to me is why the qdisc locks can actually stay
> the same at all levels? Can they just never nest? But then why are they
> different per device type?
I didn't test about qdisc so I didn't modify code related to qdisc code.
If someone reviews this, I would really appreciate.
Thank you
>
> Thanks,
> johannes
>
On Sun, 29 Sep 2019 at 04:20, Johannes Berg <[email protected]> wrote:
>
>
> > VLAN, BONDING, TEAM, MACSEC, MACVLAN, IPVLAN, VIRT_WIFI and VXLAN.
> > But I couldn't test all interface types so there could be more device
> > types which have similar problems.
>
> Did you test virt_wifi? I don't see how it *doesn't* have the nesting
> problem, and you didn't change it?
>
> No, I see. You're limiting the nesting generally now in patch 1, and the
> others are just lockdep fixups (I guess it's surprising virt_wifi
> doesn't do this at all?).
virt_wifi case is a little bit different case.
I add the last patch that is to fix refcnt leaks in the virt_wifi module.
The way to fix this is to add notifier routine.
The notifier routine could delete lower device before deleting
virt_wifi device.
If virt_wifi devices are nested, notifier would work recursively.
At that time, it would make stack memory overflow.
Actually, before this patch, virt_wifi doesn't have the same problem.
So, I will update a comment in a v5 patch.
>
> FWIW I don't think virt_wifi really benefits at all from stacking, so we
> could just do something like
>
> --- a/drivers/net/wireless/virt_wifi.c
> +++ b/drivers/net/wireless/virt_wifi.c
> @@ -508,6 +508,9 @@ static int virt_wifi_newlink(struct net *src_net, struct net_device *dev,
> else if (dev->mtu > priv->lowerdev->mtu)
> return -EINVAL;
>
> + if (priv->lowerdev->ieee80211_ptr)
> + return -EINVAL;
> +
> err = netdev_rx_handler_register(priv->lowerdev, virt_wifi_rx_handler,
> priv);
> if (err) {
>
Many other devices use this way to avoid wrong nesting configuration.
And I think it's a good way.
But we should think about the below configuration.
vlan5
|
virt_wifi4
|
vlan3
|
virt_wifi2
|
vlan1
|
dummy0
That code wouldn't avoid this configuration.
And all devices couldn't avoid this config.
I have been considering this case, but I couldn't make a decision yet.
Maybe common netdev function is needed to find the same device type
in their graph.
>
>
> IMHO, but of course generally limiting the stack depth is needed anyway
> and solves the problem well enough for virt_wifi.
>
>
This is a little bit different question for you.
I found another bug in virt_wifi after my last patch.
Please test below commands
ip link add dummy0 type dummy
ip link add vw1 link dummy0 type virt_wifi
ip link add vw2 link vw1 type virt_wifi
modprobe -rv virt_wifi
Then, you can see the warning messages.
If SET_NETDEV_DEV() is deleted in the virt_wifi_newlink(),
you can avoid that warning message.
But I'm not sure about it's safe to remove that.
I would really appreciate it if you let me know about that.
> johannes
>
On Sun, 29 Sep 2019 at 04:36, Johannes Berg <[email protected]> wrote:
>
> Hi,
>
> > int netdev_walk_all_upper_dev_rcu(struct net_device *dev,
> > int (*fn)(struct net_device *dev,
> > void *data),
> > void *data)
> > {
> [...]
> > }
> >
> > return 0;
> > +
> > }
>
> that seems like an oversight, probably from editing the patch in
> different versions?
>
I will fix this in a v5 patch.
> > +static int __netdev_update_upper_level(struct net_device *dev, void *data)
> > +{
> > + dev->upper_level = __netdev_upper_depth(dev) + 1;
> > + return 0;
> > +}
> > +
> > +static int __netdev_update_lower_level(struct net_device *dev, void *data)
> > +{
> > + dev->lower_level = __netdev_lower_depth(dev) + 1;
> > + return 0;
> > +}
>
> Is there any point in the return value here? You don't really use it,
> afaict? I guess I might see the point if it was used for tail-call
> optimisation or such?
>
These functions are used as a callback function of
netdev_walk_all_{upper/lower}_dev(). So these return types are needed.
>
> Also, I dunno, I guess netdevs aren't as much under pressure as SKBs :-)
> but do we actually gain much from storing the nesting level at all? You
> have to maintain it all the time anyway when adding/removing and that's
> the only place where you also check it, so perhaps it wouldn't be that
> bad to just count at that time?
>
> But then again the counting would probably be recursive again ...
>
> > return 0;
> > +
> > }
> > EXPORT_SYMBOL_GPL(netdev_walk_all_lower_dev_rcu);
>
> same nit as above
>
I will fix this in a v5 patch too.
> > + __netdev_update_upper_level(dev, NULL);
> > + netdev_walk_all_lower_dev(dev, __netdev_update_upper_level, NULL);
> > +
> > + __netdev_update_lower_level(upper_dev, NULL);
> > + netdev_walk_all_upper_dev(upper_dev, __netdev_update_lower_level, NULL);
>
> Actually, if I'm reading this correctly you already walk all the levels
> anyway? Then couldn't you calculate the depth at this time and return
> it, instead of storing it? Though, if it actually overflowed then you'd
> have to walk *again* to undo that?
>
> Hmm, actually, if you don't store the value you don't even need to walk
> here I guess, or at least you would only have to do it to verify you
> *can* attach, but wouldn't have to in detach?
>
> So it looks to me like on attach (i.e. this code, quoted from
> __netdev_upper_dev_link) you're already walking the entire graph to
> update the level values, and could probably instead calculate the
> nesting depth to validate it?
> And then on netdev_upper_dev_unlink() you wouldn't even have to walk the
> graph at all, since you only need that to update the values that you
> stored.
>
> But maybe I'm misinterpreting this completely?
>
Without storing level storing, a walking graph routine is needed only
once. The routine would work as a nesting depth validator.
So that the detach routine doesn't need to walk the graph.
Whereas, in this patch, both attach and detach routine need to
walk graph. So, storing nesting variable way is slower than without
storing nesting variable way because of the detach routine's updating
upper and lower level routine.
But I'm sure that storing nesting variables is useful because other
modules already using nesting level values.
Please look at vlan_get_encap_level() and usecases.
If we don't provide nesting level variables, they should calculate
every time when they need it and this way is easier way to get a
nesting level. There are use-cases of lower_level variable
in the 11th patch.
Thank you
> Thanks,
> johannes
>
>
Taehee Yoo <[email protected]> wrote:
>The IFF_BONDING means bonding master or bonding slave device.
>->ndo_add_slave() sets IFF_BONDING flag and ->ndo_del_slave() unsets
>IFF_BONDING flag.
>
>bond0<--bond1
>
>Both bond0 and bond1 are bonding device and these should keep having
>IFF_BONDING flag until they are removed.
>But bond1 would lose IFF_BONDING at ->ndo_del_slave() because that routine
>do not check whether the slave device is the bonding type or not.
>This patch adds the interface type check routine before removing
>IFF_BONDING flag.
>
>Test commands:
> ip link add bond0 type bond
> ip link add bond1 type bond
> ip link set bond1 master bond0
> ip link set bond1 nomaster
> ip link del bond1 type bond
> ip link add bond1 type bond
>
>Splat looks like:
>[ 38.843933] proc_dir_entry 'bonding/bond1' already registered
>[ 38.844741] WARNING: CPU: 1 PID: 631 at fs/proc/generic.c:361 proc_register+0x2a9/0x3e0
>[ 38.845741] Modules linked in: bonding ip_tables x_tables
>[ 38.846432] CPU: 1 PID: 631 Comm: ip Not tainted 5.3.0+ #3
>[ 38.847234] Hardware name: innotek GmbH VirtualBox/VirtualBox, BIOS VirtualBox 12/01/2006
>[ 38.848489] RIP: 0010:proc_register+0x2a9/0x3e0
>[ 38.849164] Code: 89 fa 48 c1 ea 03 80 3c 02 00 0f 85 39 01 00 00 48 8b 04 24 48 89 ea 48 c7 c7 e0 2b 34 b3 48 8b b0 e
>0 00 00 00 e8 c7 b6 89 ff <0f> 0b 48 c7 c7 40 3d c5 b3 e8 99 7a 38 01 48 8b 4c 24 10 48 b8 00
>[ 38.851317] RSP: 0018:ffff888061527078 EFLAGS: 00010282
>[ 38.851902] RAX: dffffc0000000008 RBX: ffff888064dc8cb0 RCX: ffffffffb1d252a2
>[ 38.852684] RDX: 0000000000000001 RSI: 0000000000000008 RDI: ffff88806cbf6b8c
>[ 38.853464] RBP: ffff888064dc8f33 R08: ffffed100d980019 R09: ffffed100d980019
>[ 38.854242] R10: 0000000000000001 R11: ffffed100d980018 R12: ffff888064dc8e48
>[ 38.855929] R13: ffff888064dc8f32 R14: dffffc0000000000 R15: ffffed100c9b91e6
>[ 38.856695] FS: 00007fc9fcc230c0(0000) GS:ffff88806ca00000(0000) knlGS:0000000000000000
>[ 38.857541] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>[ 38.858150] CR2: 000055948b91c118 CR3: 0000000057110006 CR4: 00000000000606e0
>[ 38.858957] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
>[ 38.859785] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
>[ 38.860700] Call Trace:
>[ 38.861004] proc_create_seq_private+0xb3/0xf0
>[ 38.861460] bond_create_proc_entry+0x1b3/0x3f0 [bonding]
>[ 38.862113] bond_netdev_event+0x433/0x970 [bonding]
>[ 38.862762] ? __module_text_address+0x13/0x140
>[ 38.867678] notifier_call_chain+0x90/0x160
>[ 38.868257] register_netdevice+0x9b3/0xd80
>[ 38.868791] ? alloc_netdev_mqs+0x854/0xc10
>[ 38.869335] ? netdev_change_features+0xa0/0xa0
>[ 38.869852] ? rtnl_create_link+0x2ed/0xad0
>[ 38.870423] bond_newlink+0x2a/0x60 [bonding]
>[ 38.870935] __rtnl_newlink+0xb9f/0x11b0
>[ ... ]
>Fixes: 0b680e753724 ("[PATCH] bonding: Add priv_flag to avoid event mishandling")
>Signed-off-by: Taehee Yoo <[email protected]>
Signed-off-by: Jay Vosburgh <[email protected]>
>---
>
>v2 -> v4 :
> - This patch is not changed
>v1 -> v2 :
> - Do not add a new priv_flag.
>
> drivers/net/bonding/bond_main.c | 6 ++++--
> 1 file changed, 4 insertions(+), 2 deletions(-)
>
>diff --git a/drivers/net/bonding/bond_main.c b/drivers/net/bonding/bond_main.c
>index 931d9d935686..0db12fcfc953 100644
>--- a/drivers/net/bonding/bond_main.c
>+++ b/drivers/net/bonding/bond_main.c
>@@ -1816,7 +1816,8 @@ int bond_enslave(struct net_device *bond_dev, struct net_device *slave_dev,
> slave_disable_netpoll(new_slave);
>
> err_close:
>- slave_dev->priv_flags &= ~IFF_BONDING;
>+ if (!netif_is_bond_master(slave_dev))
>+ slave_dev->priv_flags &= ~IFF_BONDING;
> dev_close(slave_dev);
>
> err_restore_mac:
>@@ -2017,7 +2018,8 @@ static int __bond_release_one(struct net_device *bond_dev,
> else
> dev_set_mtu(slave_dev, slave->original_mtu);
>
>- slave_dev->priv_flags &= ~IFF_BONDING;
>+ if (!netif_is_bond_master(slave_dev))
>+ slave_dev->priv_flags &= ~IFF_BONDING;
>
> bond_free_slave(slave);
>
>--
>2.17.1
>
Hi,
Sorry for the delay.
> These functions are used as a callback function of
> netdev_walk_all_{upper/lower}_dev(). So these return types are needed.
Ah yes, I missed that, sorry.
> Without storing level storing, a walking graph routine is needed only
> once. The routine would work as a nesting depth validator.
> So that the detach routine doesn't need to walk the graph.
> Whereas, in this patch, both attach and detach routine need to
> walk graph. So, storing nesting variable way is slower than without
> storing nesting variable way because of the detach routine's updating
> upper and lower level routine.
Right, that's what I thought.
> But I'm sure that storing nesting variables is useful because other
> modules already using nesting level values.
> Please look at vlan_get_encap_level() and usecases.
Indeed, I noticed that later.
> If we don't provide nesting level variables, they should calculate
> every time when they need it and this way is easier way to get a
> nesting level. There are use-cases of lower_level variable
> in the 11th patch.
Yes, makes sense, agree. One could argue that you only ever need the
"lower_level" stored, not the "upper_level", but I guess that doesn't
really make a difference.
Placing these in a better position in the struct might make sense - a
cursory look suggested that they weren't filling any of the many holes
there, did you pay attention to that or was the placement more or less
random?
johannes
Hi,
> > I didn't see any discussion on this, but perhaps I missed it? The cost
> > would be a bigger netdev struct (when lockdep is enabled), but we
> > already have that for all the VLANs etc. it's just in the private data,
> > so it's not a _huge_ difference really I'd think, and this is quite a
> > bit of code for each device type now.
>
> Actually I agree with your opinion.
> The benefits of this way are to be able to make common helper functions.
> That would reduce duplicate codes and we can maintain this more easily.
> But I'm not sure about the overhead of this way. So I would like to ask
> maintainers and more reviewers about this.
:-)
> Using "struct nested_netdev_lockdep" looks really good.
> I will make common codes such as "struct nested_netdev_lockdep"
> and "netdev_devinit_nested_lockdep" and others in a v5 patch.
That makes *sense*, but it seems to me that for example in virt_wifi we
just missed this part completely, so addressing it in the generic code
would still reduce overall code and complexity?
Actually, looking at net-next, we already have
netdev_lockdep_set_classes() as a macro there that handles all this. I
guess having it as a macro makes some sense so it "evaporates" when
lockdep isn't enabled.
I'd probably try that but maybe somebody else can chime in and say what
they think about applying that to *every* netdev instead though.
> > What's not really clear to me is why the qdisc locks can actually stay
> > the same at all levels? Can they just never nest? But then why are they
> > different per device type?
>
> I didn't test about qdisc so I didn't modify code related to qdisc code.
> If someone reviews this, I would really appreciate.
I didn't really think hard about it when I wrote this ...
But it seems to me the whole nesting also has to be applied here?
__dev_xmit_skb:
* qdisc_run_begin()
* sch_direct_xmit()
* HARD_TX_LOCK(dev, txq, smp_processor_id());
* dev_hard_start_xmit() // say this is VLAN
* dev_queue_xmit() // on real_dev
* __dev_xmit_skb // recursion on another netdev
Now if you have VLAN-in-VLAN the whole thing will recurse right?
johannes
On Sun, 2019-09-29 at 17:31 +0900, Taehee Yoo wrote:
> virt_wifi case is a little bit different case.
Well, arguably, it was also just missing this - it just looks different
:)
> I add the last patch that is to fix refcnt leaks in the virt_wifi module.
> The way to fix this is to add notifier routine.
> The notifier routine could delete lower device before deleting
> virt_wifi device.
> If virt_wifi devices are nested, notifier would work recursively.
> At that time, it would make stack memory overflow.
>
> Actually, before this patch, virt_wifi doesn't have the same problem.
> So, I will update a comment in a v5 patch.
OK, sure.
> Many other devices use this way to avoid wrong nesting configuration.
> And I think it's a good way.
> But we should think about the below configuration.
>
> vlan5
> |
> virt_wifi4
> |
> vlan3
> |
> virt_wifi2
> |
> vlan1
> |
> dummy0
>
> That code wouldn't avoid this configuration.
> And all devices couldn't avoid this config.
Good point, so then really that isn't useful to check - most people
won't try to set it up that way (since it's completely useless) and if
they do anyway too much nesting would be caught by your patchset here.
> I have been considering this case, but I couldn't make a decision yet.
> Maybe common netdev function is needed to find the same device type
> in their graph.
I don't think it's worthwhile just to prevent somebody from making a
configuration that we think now is nonsense. Perhaps they do have some
kind of useful use-case for it ...
> This is a little bit different question for you.
> I found another bug in virt_wifi after my last patch.
> Please test below commands
> ip link add dummy0 type dummy
> ip link add vw1 link dummy0 type virt_wifi
> ip link add vw2 link vw1 type virt_wifi
> modprobe -rv virt_wifi
>
> Then, you can see the warning messages.
> If SET_NETDEV_DEV() is deleted in the virt_wifi_newlink(),
> you can avoid that warning message.
> But I'm not sure about it's safe to remove that.
> I would really appreciate it if you let me know about that.
Hmm, I don't see any warnings. SET_NETDEV_DEV() should be there though.
Do you see the same if you stack it with something else inbetween? If
not, I guess preventing virt_wifi from stacking on top of itself would
be sufficient ...
johannes
On Tue, 1 Oct 2019 at 16:11, Johannes Berg <[email protected]> wrote:
>
> Hi,
>
Hi!
> Sorry for the delay.
>
> > These functions are used as a callback function of
> > netdev_walk_all_{upper/lower}_dev(). So these return types are needed.
>
> Ah yes, I missed that, sorry.
>
> > Without storing level storing, a walking graph routine is needed only
> > once. The routine would work as a nesting depth validator.
> > So that the detach routine doesn't need to walk the graph.
> > Whereas, in this patch, both attach and detach routine need to
> > walk graph. So, storing nesting variable way is slower than without
> > storing nesting variable way because of the detach routine's updating
> > upper and lower level routine.
>
> Right, that's what I thought.
>
> > But I'm sure that storing nesting variables is useful because other
> > modules already using nesting level values.
> > Please look at vlan_get_encap_level() and usecases.
>
> Indeed, I noticed that later.
>
> > If we don't provide nesting level variables, they should calculate
> > every time when they need it and this way is easier way to get a
> > nesting level. There are use-cases of lower_level variable
> > in the 11th patch.
>
> Yes, makes sense, agree. One could argue that you only ever need the
> "lower_level" stored, not the "upper_level", but I guess that doesn't
> really make a difference.
>
> Placing these in a better position in the struct might make sense - a
> cursory look suggested that they weren't filling any of the many holes
> there, did you pay attention to that or was the placement more or less
> random?
>
If I understand correctly, you said about the alignment of
"lower_level" and "upper_level".
I thought this place is a fine position for variables as regards the
alignment and I didn't try to put each variable in different places.
If I misunderstood your mention, please let me know.
Thank you
> johannes
>
Hi,
(jumping out now, forgive me for being so brief)
> If I understand correctly, you said about the alignment of
> "lower_level" and "upper_level".
> I thought this place is a fine position for variables as regards the
> alignment and I didn't try to put each variable in different places.
>
> If I misunderstood your mention, please let me know.
Not sure what you mean, alignment doesn't matter for them (they're u8).
I was thinking of the packing for the overall struct, we have:
unsigned int max_mtu;
unsigned short type;
unsigned short hard_header_len;
unsigned char min_header_len;
+ unsigned char upper_level, lower_level;
unsigned short needed_headroom;
unsigned short needed_tailroom;
Previously, there was a one byte hole at that spot due to a single
"unsigned char" (after something aligned at least 4 bytes) followed by
"unsigned short" - now you push that out a bit.
If you place the variables a bit lower, below "name_assign_type", you
probably fill a hole instead.
Check out the 'pahole' tool.
johannes
On Tue, 1 Oct 2019 at 22:57, Johannes Berg <[email protected]> wrote:
>
> Hi,
>
Hi!
> (jumping out now, forgive me for being so brief)
>
> > If I understand correctly, you said about the alignment of
> > "lower_level" and "upper_level".
> > I thought this place is a fine position for variables as regards the
> > alignment and I didn't try to put each variable in different places.
> >
> > If I misunderstood your mention, please let me know.
>
> Not sure what you mean, alignment doesn't matter for them (they're u8).
>
> I was thinking of the packing for the overall struct, we have:
>
> unsigned int max_mtu;
> unsigned short type;
> unsigned short hard_header_len;
> unsigned char min_header_len;
>
> + unsigned char upper_level, lower_level;
>
> unsigned short needed_headroom;
> unsigned short needed_tailroom;
>
>
> Previously, there was a one byte hole at that spot due to a single
> "unsigned char" (after something aligned at least 4 bytes) followed by
> "unsigned short" - now you push that out a bit.
>
> If you place the variables a bit lower, below "name_assign_type", you
> probably fill a hole instead.
>
> Check out the 'pahole' tool.
>
Thank you for the detailed explanation.
I tested the pahole and found holes.
$ pahole ./vmlinux.o -C net_device
unsigned char addr_assign_type; /* 598 1 */
unsigned char addr_len; /* 599 1 */
short unsigned int neigh_priv_len; /* 600 2 */
short unsigned int dev_id; /* 602 2 */
short unsigned int dev_port; /* 604 2 */
/* XXX 2 bytes hole, try to pack */
I will place the variables here.
> johannes
>
Thank you so much!
Taehee
On Tue, 1 Oct 2019 at 16:25, Johannes Berg <[email protected]> wrote:
>
> Hi,
>
Hi,
> > > I didn't see any discussion on this, but perhaps I missed it? The cost
> > > would be a bigger netdev struct (when lockdep is enabled), but we
> > > already have that for all the VLANs etc. it's just in the private data,
> > > so it's not a _huge_ difference really I'd think, and this is quite a
> > > bit of code for each device type now.
> >
> > Actually I agree with your opinion.
> > The benefits of this way are to be able to make common helper functions.
> > That would reduce duplicate codes and we can maintain this more easily.
> > But I'm not sure about the overhead of this way. So I would like to ask
> > maintainers and more reviewers about this.
>
> :-)
>
> > Using "struct nested_netdev_lockdep" looks really good.
> > I will make common codes such as "struct nested_netdev_lockdep"
> > and "netdev_devinit_nested_lockdep" and others in a v5 patch.
>
> That makes *sense*, but it seems to me that for example in virt_wifi we
> just missed this part completely, so addressing it in the generic code
> would still reduce overall code and complexity?
>
Yes, you're right,
Virt_wifi has the same problem. I will fix this in a v5 patch!
> Actually, looking at net-next, we already have
> netdev_lockdep_set_classes() as a macro there that handles all this. I
> guess having it as a macro makes some sense so it "evaporates" when
> lockdep isn't enabled.
>
>
> I'd probably try that but maybe somebody else can chime in and say what
> they think about applying that to *every* netdev instead though.
>
If we place lockdep keys into "struct net_device", this macro would be a
little bit modified and reused. And driver code shape will not be huge
changed. I think this way is better than this v4 way.
So I will try it.
>
> > > What's not really clear to me is why the qdisc locks can actually stay
> > > the same at all levels? Can they just never nest? But then why are they
> > > different per device type?
> >
> > I didn't test about qdisc so I didn't modify code related to qdisc code.
> > If someone reviews this, I would really appreciate.
>
> I didn't really think hard about it when I wrote this ...
>
> But it seems to me the whole nesting also has to be applied here?
>
> __dev_xmit_skb:
> * qdisc_run_begin()
> * sch_direct_xmit()
> * HARD_TX_LOCK(dev, txq, smp_processor_id());
> * dev_hard_start_xmit() // say this is VLAN
> * dev_queue_xmit() // on real_dev
> * __dev_xmit_skb // recursion on another netdev
>
> Now if you have VLAN-in-VLAN the whole thing will recurse right?
>
I have checked on this routine.
Only xmit_lock(HARD_TX_LOCK) could be nested. other
qdisc locks(runinng, busylock) will not be nested. This patch already
handles the _xmit_lock key. so I think there is no problem.
But I would like to place four lockdep keys(busylock, address,
running, _xmit_lock) into "struct net_device" because of code complexity.
Let me know if I misunderstood anything.
> johannes
>
Thank you,
Taehee
On Tue, 1 Oct 2019 at 16:39, Johannes Berg <[email protected]> wrote:
>
Hi,
> On Sun, 2019-09-29 at 17:31 +0900, Taehee Yoo wrote:
>
> > virt_wifi case is a little bit different case.
>
> Well, arguably, it was also just missing this - it just looks different
> :)
>
> > I add the last patch that is to fix refcnt leaks in the virt_wifi module.
> > The way to fix this is to add notifier routine.
> > The notifier routine could delete lower device before deleting
> > virt_wifi device.
> > If virt_wifi devices are nested, notifier would work recursively.
> > At that time, it would make stack memory overflow.
> >
> > Actually, before this patch, virt_wifi doesn't have the same problem.
> > So, I will update a comment in a v5 patch.
>
> OK, sure.
>
> > Many other devices use this way to avoid wrong nesting configuration.
> > And I think it's a good way.
> > But we should think about the below configuration.
> >
> > vlan5
> > |
> > virt_wifi4
> > |
> > vlan3
> > |
> > virt_wifi2
> > |
> > vlan1
> > |
> > dummy0
> >
> > That code wouldn't avoid this configuration.
> > And all devices couldn't avoid this config.
>
> Good point, so then really that isn't useful to check - most people
> won't try to set it up that way (since it's completely useless) and if
> they do anyway too much nesting would be caught by your patchset here.
>
Yes, Thanks!
> > I have been considering this case, but I couldn't make a decision yet.
> > Maybe common netdev function is needed to find the same device type
> > in their graph.
>
> I don't think it's worthwhile just to prevent somebody from making a
> configuration that we think now is nonsense. Perhaps they do have some
> kind of useful use-case for it ...
>
I agree with your opinion.
> > This is a little bit different question for you.
> > I found another bug in virt_wifi after my last patch.
> > Please test below commands
> > ip link add dummy0 type dummy
> > ip link add vw1 link dummy0 type virt_wifi
> > ip link add vw2 link vw1 type virt_wifi
> > modprobe -rv virt_wifi
> >
> > Then, you can see the warning messages.
> > If SET_NETDEV_DEV() is deleted in the virt_wifi_newlink(),
> > you can avoid that warning message.
> > But I'm not sure about it's safe to remove that.
> > I would really appreciate it if you let me know about that.
>
> Hmm, I don't see any warnings. SET_NETDEV_DEV() should be there though.
Okay, thanks. I will do not remove SET_NETDEV_DEV() in a v5 patch.
> Do you see the same if you stack it with something else inbetween? If
> not, I guess preventing virt_wifi from stacking on top of itself would
> be sufficient ...
>
Yes, the below test commands will make warning messages.
So, I will add a new patch for this without removing SET_NETDEV_DEV().
Reproducer :
ip link add dummy0 type dummy
ip link add vw1 link dummy0 type virt_wifi
ip link add vlan2 link vw1 type vlan id 1
ip link add vw3 link vlan2 type virt_wifi
modprobe -rv virt_wifi
Messages:
[12734.236946] sysfs group 'byte_queue_limits' not found for kobject 'tx-0'
[12734.238862] WARNING: CPU: 1 PID: 19710 at fs/sysfs/group.c:280
sysfs_remove_group+0x11b/0x170
[ ... ]
12734.256132] Call Trace:
[12734.256430] netdev_queue_update_kobjects+0x1f5/0x340
[12734.257025] netdev_unregister_kobject+0x142/0x1d0
[12734.257580] rollback_registered_many+0x618/0xc80
[12734.258175] ? notifier_call_chain+0x90/0x160
[12734.258688] ? generic_xdp_install+0x310/0x310
[12734.259208] ? netdev_upper_dev_unlink+0x114/0x180
[12734.259791] unregister_netdevice_many.part.126+0x13/0x1b0
[12734.260434] __rtnl_link_unregister+0x156/0x320
[12734.260967] ? rtnl_unregister_all+0x120/0x120
[ ... ]
[12734.283395] sysfs group 'power' not found for kobject 'vw3'
[12734.284081] WARNING: CPU: 1 PID: 19710 at fs/sysfs/group.c:280
sysfs_remove_group+0x11b/0x170
[ ... ]
[12734.337509] sysfs group 'statistics' not found for kobject 'vw3'
[12734.338375] WARNING: CPU: 1 PID: 19710 at fs/sysfs/group.c:280
sysfs_remove_group+0x11b/0x170
[ ... ]
[12734.391687] sysfs group 'wireless' not found for kobject 'vw3'
[12734.392525] WARNING: CPU: 1 PID: 19710 at fs/sysfs/group.c:280
sysfs_remove_group+0x11b/0x170
[ ... ]
> johannes
>
Thanks,
Taehee Yoo
2019-09-28, 16:48:43 +0000, Taehee Yoo wrote:
> virt_wifi_newlink() calls netdev_upper_dev_link() and it internally
> holds reference count of lower interface.
>
> Current code does not release a reference count of the lower interface
> when the lower interface is being deleted.
> So, reference count leaks occur.
>
> Test commands:
> ip link add dummy0 type dummy
> ip link add vw1 link dummy0 type virt_wifi
There should also be "ip link del dummy0" in this reproducer, right?
[...]
> @@ -598,14 +634,24 @@ static int __init virt_wifi_init_module(void)
> /* Guaranteed to be locallly-administered and not multicast. */
> eth_random_addr(fake_router_bssid);
>
> + err = register_netdevice_notifier(&virt_wifi_notifier);
> + if (err)
> + return err;
> +
Here err is 0.
> common_wiphy = virt_wifi_make_wiphy();
> if (!common_wiphy)
> - return -ENOMEM;
> + goto notifier;
err is still 0 when we jump...
> err = rtnl_link_register(&virt_wifi_link_ops);
> if (err)
> - virt_wifi_destroy_wiphy(common_wiphy);
> + goto destroy_wiphy;
>
> + return 0;
> +
> +destroy_wiphy:
> + virt_wifi_destroy_wiphy(common_wiphy);
> +notifier:
> + unregister_netdevice_notifier(&virt_wifi_notifier);
> return err;
> }
... so now we return 0 on failure. Can you add an "err = -ENOMEM"
before "common_wiphy = ..."?
Thanks.
--
Sabrina
On Sat, 2019-10-05 at 18:13 +0900, Taehee Yoo wrote:
>
> If we place lockdep keys into "struct net_device", this macro would be a
> little bit modified and reused. And driver code shape will not be huge
> changed. I think this way is better than this v4 way.
> So I will try it.
What I was thinking was that if we can do this for every VLAN netdev,
why shouldn't we do it for *every* netdev unconditionally? Some code
could perhaps even be simplified if this was just a general part of
netdev allocation.
> > But it seems to me the whole nesting also has to be applied here?
> >
> > __dev_xmit_skb:
> > * qdisc_run_begin()
> > * sch_direct_xmit()
> > * HARD_TX_LOCK(dev, txq, smp_processor_id());
> > * dev_hard_start_xmit() // say this is VLAN
> > * dev_queue_xmit() // on real_dev
> > * __dev_xmit_skb // recursion on another netdev
> >
> > Now if you have VLAN-in-VLAN the whole thing will recurse right?
> >
>
> I have checked on this routine.
> Only xmit_lock(HARD_TX_LOCK) could be nested. other
> qdisc locks(runinng, busylock) will not be nested.
OK, I still didn't check it too closely I guess, or got confused which
lock I should look at.
> This patch already
> handles the _xmit_lock key. so I think there is no problem.
Right
> But I would like to place four lockdep keys(busylock, address,
> running, _xmit_lock) into "struct net_device" because of code complexity.
>
> Let me know if I misunderstood anything.
Nothing to misunderstand - I was just asking/wondering why the qdisc
locks were not treated the same way.
johannes
On Mon, 7 Oct 2019 at 20:22, Sabrina Dubroca <[email protected]> wrote:
>
Hi Sabrina,
Thank you for the review!
> 2019-09-28, 16:48:43 +0000, Taehee Yoo wrote:
> > virt_wifi_newlink() calls netdev_upper_dev_link() and it internally
> > holds reference count of lower interface.
> >
> > Current code does not release a reference count of the lower interface
> > when the lower interface is being deleted.
> > So, reference count leaks occur.
> >
> > Test commands:
> > ip link add dummy0 type dummy
> > ip link add vw1 link dummy0 type virt_wifi
>
> There should also be "ip link del dummy0" in this reproducer, right?
>
> [...]
>
> > @@ -598,14 +634,24 @@ static int __init virt_wifi_init_module(void)
> > /* Guaranteed to be locallly-administered and not multicast. */
> > eth_random_addr(fake_router_bssid);
> >
> > + err = register_netdevice_notifier(&virt_wifi_notifier);
> > + if (err)
> > + return err;
> > +
>
> Here err is 0.
>
> > common_wiphy = virt_wifi_make_wiphy();
> > if (!common_wiphy)
> > - return -ENOMEM;
> > + goto notifier;
>
> err is still 0 when we jump...
>
> > err = rtnl_link_register(&virt_wifi_link_ops);
> > if (err)
> > - virt_wifi_destroy_wiphy(common_wiphy);
> > + goto destroy_wiphy;
> >
> > + return 0;
> > +
> > +destroy_wiphy:
> > + virt_wifi_destroy_wiphy(common_wiphy);
> > +notifier:
> > + unregister_netdevice_notifier(&virt_wifi_notifier);
> > return err;
> > }
>
> ... so now we return 0 on failure. Can you add an "err = -ENOMEM"
> before "common_wiphy = ..."?
>
You're right, I will fix this in a v5 patch!
Thanks!
> Thanks.
>
> --
> Sabrina
On Mon, 7 Oct 2019 at 20:41, Johannes Berg <[email protected]> wrote:
>
Hi Johannes,
> On Sat, 2019-10-05 at 18:13 +0900, Taehee Yoo wrote:
> >
> > If we place lockdep keys into "struct net_device", this macro would be a
> > little bit modified and reused. And driver code shape will not be huge
> > changed. I think this way is better than this v4 way.
> > So I will try it.
>
> What I was thinking was that if we can do this for every VLAN netdev,
> why shouldn't we do it for *every* netdev unconditionally? Some code
> could perhaps even be simplified if this was just a general part of
> netdev allocation.
>
Your opinion makes sense.
I think there is no critical reason that every netdev shouldn't have
own lockdep keys. By comparison, the benefits are obvious.
> > > But it seems to me the whole nesting also has to be applied here?
> > >
> > > __dev_xmit_skb:
> > > * qdisc_run_begin()
> > > * sch_direct_xmit()
> > > * HARD_TX_LOCK(dev, txq, smp_processor_id());
> > > * dev_hard_start_xmit() // say this is VLAN
> > > * dev_queue_xmit() // on real_dev
> > > * __dev_xmit_skb // recursion on another netdev
> > >
> > > Now if you have VLAN-in-VLAN the whole thing will recurse right?
> > >
> >
> > I have checked on this routine.
> > Only xmit_lock(HARD_TX_LOCK) could be nested. other
> > qdisc locks(runinng, busylock) will not be nested.
>
> OK, I still didn't check it too closely I guess, or got confused which
> lock I should look at.
>
> > This patch already
> > handles the _xmit_lock key. so I think there is no problem.
>
> Right
>
> > But I would like to place four lockdep keys(busylock, address,
> > running, _xmit_lock) into "struct net_device" because of code complexity.
> >
> > Let me know if I misunderstood anything.
>
> Nothing to misunderstand - I was just asking/wondering why the qdisc
> locks were not treated the same way.
>
I'm always thankful for your detailed and careful reviews.
> johannes
>
Thank you,
Taehee Yoo
2019-09-28, 16:48:32 +0000, Taehee Yoo wrote:
> @@ -6790,23 +6878,45 @@ int netdev_walk_all_lower_dev(struct net_device *dev,
> void *data),
> void *data)
> {
> - struct net_device *ldev;
> - struct list_head *iter;
> - int ret;
> + struct net_device *ldev, *next, *now, *dev_stack[MAX_NEST_DEV + 1];
> + struct list_head *niter, *iter, *iter_stack[MAX_NEST_DEV + 1];
> + int ret, cur = 0;
>
> - for (iter = &dev->adj_list.lower,
> - ldev = netdev_next_lower_dev(dev, &iter);
> - ldev;
> - ldev = netdev_next_lower_dev(dev, &iter)) {
> - /* first is the lower device itself */
> - ret = fn(ldev, data);
> - if (ret)
> - return ret;
> + now = dev;
> + iter = &dev->adj_list.lower;
>
> - /* then look at all of its lower devices */
> - ret = netdev_walk_all_lower_dev(ldev, fn, data);
> - if (ret)
> - return ret;
> + while (1) {
> + if (now != dev) {
> + ret = fn(now, data);
> + if (ret)
> + return ret;
> + }
> +
> + next = NULL;
> + while (1) {
> + ldev = netdev_next_lower_dev(now, &iter);
> + if (!ldev)
> + break;
> +
> + if (!next) {
> + next = ldev;
> + niter = &ldev->adj_list.lower;
> + } else {
> + dev_stack[cur] = ldev;
> + iter_stack[cur++] = &ldev->adj_list.lower;
> + break;
> + }
> + }
> +
> + if (!next) {
> + if (!cur)
> + return 0;
Hmm, I don't think this condition is correct.
If we have this topology:
bridge0
/ | \
/ | \
/ | \
dummy0 vlan1 vlan2
| \
dummy1 dummy2
We end up with the expected lower/upper levels for all devices:
| device | upper | lower |
|---------+-------+-------|
| dummy0 | 2 | 1 |
| dummy1 | 3 | 1 |
| dummy2 | 3 | 1 |
| vlan1 | 2 | 2 |
| vlan2 | 2 | 2 |
| bridge0 | 1 | 3 |
If we then add macvlan0 on top of bridge0:
macvlan0
|
|
bridge0
/ | \
/ | \
/ | \
dummy0 vlan1 vlan2
| \
dummy1 dummy2
we can observe that __netdev_update_upper_level is only called for
some of the devices under bridge0. I added a perf probe:
# perf probe -a '__netdev_update_upper_level dev->name:string'
which gets hit for bridge0 (called directly by
__netdev_upper_dev_link) and then dummy0, vlan1, dummy1. It is never
called for vlan2 and dummy2.
After this, we have the following levels (*):
| device | upper | lower |
|----------+-------+-------|
| dummy0 | 3 | 1 |
| dummy1 | 4 | 1 |
| dummy2 | 3 | 1 |
| vlan1 | 3 | 2 |
| vlan2 | 2 | 2 |
| bridge0 | 2 | 3 |
| macvlan0 | 1 | 4 |
For dummy0, dummy1, vlan1, the upper level has increased by 1, as
expected. For dummy2 and vlan2, it's still the same, which is wrong.
(*) observed easily by adding another probe:
# perf probe -a 'dev_get_stats dev->name:string dev->upper_level dev->lower_level'
and running "ip link"
Or you can just add prints and recompile, of course :)
> + next = dev_stack[--cur];
> + niter = iter_stack[cur];
> + }
> +
> + now = next;
> + iter = niter;
> }
>
> return 0;
--
Sabrina
On Thu, 10 Oct 2019 at 19:19, Sabrina Dubroca <[email protected]> wrote:
>
Hi Sabrina,
Thank you for review and testing!
> 2019-09-28, 16:48:32 +0000, Taehee Yoo wrote:
> > @@ -6790,23 +6878,45 @@ int netdev_walk_all_lower_dev(struct net_device *dev,
> > void *data),
> > void *data)
> > {
> > - struct net_device *ldev;
> > - struct list_head *iter;
> > - int ret;
> > + struct net_device *ldev, *next, *now, *dev_stack[MAX_NEST_DEV + 1];
> > + struct list_head *niter, *iter, *iter_stack[MAX_NEST_DEV + 1];
> > + int ret, cur = 0;
> >
> > - for (iter = &dev->adj_list.lower,
> > - ldev = netdev_next_lower_dev(dev, &iter);
> > - ldev;
> > - ldev = netdev_next_lower_dev(dev, &iter)) {
> > - /* first is the lower device itself */
> > - ret = fn(ldev, data);
> > - if (ret)
> > - return ret;
> > + now = dev;
> > + iter = &dev->adj_list.lower;
> >
> > - /* then look at all of its lower devices */
> > - ret = netdev_walk_all_lower_dev(ldev, fn, data);
> > - if (ret)
> > - return ret;
> > + while (1) {
> > + if (now != dev) {
> > + ret = fn(now, data);
> > + if (ret)
> > + return ret;
> > + }
> > +
> > + next = NULL;
> > + while (1) {
> > + ldev = netdev_next_lower_dev(now, &iter);
> > + if (!ldev)
> > + break;
> > +
> > + if (!next) {
> > + next = ldev;
> > + niter = &ldev->adj_list.lower;
> > + } else {
> > + dev_stack[cur] = ldev;
> > + iter_stack[cur++] = &ldev->adj_list.lower;
> > + break;
> > + }
> > + }
> > +
> > + if (!next) {
> > + if (!cur)
> > + return 0;
>
> Hmm, I don't think this condition is correct.
>
> If we have this topology:
>
>
> bridge0
> / | \
> / | \
> / | \
> dummy0 vlan1 vlan2
> | \
> dummy1 dummy2
>
> We end up with the expected lower/upper levels for all devices:
>
> | device | upper | lower |
> |---------+-------+-------|
> | dummy0 | 2 | 1 |
> | dummy1 | 3 | 1 |
> | dummy2 | 3 | 1 |
> | vlan1 | 2 | 2 |
> | vlan2 | 2 | 2 |
> | bridge0 | 1 | 3 |
>
>
> If we then add macvlan0 on top of bridge0:
>
>
> macvlan0
> |
> |
> bridge0
> / | \
> / | \
> / | \
> dummy0 vlan1 vlan2
> | \
> dummy1 dummy2
>
>
> we can observe that __netdev_update_upper_level is only called for
> some of the devices under bridge0. I added a perf probe:
>
> # perf probe -a '__netdev_update_upper_level dev->name:string'
>
> which gets hit for bridge0 (called directly by
> __netdev_upper_dev_link) and then dummy0, vlan1, dummy1. It is never
> called for vlan2 and dummy2.
>
> After this, we have the following levels (*):
>
> | device | upper | lower |
> |----------+-------+-------|
> | dummy0 | 3 | 1 |
> | dummy1 | 4 | 1 |
> | dummy2 | 3 | 1 |
> | vlan1 | 3 | 2 |
> | vlan2 | 2 | 2 |
> | bridge0 | 2 | 3 |
> | macvlan0 | 1 | 4 |
>
> For dummy0, dummy1, vlan1, the upper level has increased by 1, as
> expected. For dummy2 and vlan2, it's still the same, which is wrong.
>
>
> (*) observed easily by adding another probe:
>
> # perf probe -a 'dev_get_stats dev->name:string dev->upper_level dev->lower_level'
>
> and running "ip link"
>
> Or you can just add prints and recompile, of course :)
>
Thank you so much, I found a bug very easily with your test config.
I will fix this bug in a v5 patch.
> > + next = dev_stack[--cur];
> > + niter = iter_stack[cur];
> > + }
> > +
> > + now = next;
> > + iter = niter;
> > }
> >
> > return 0;
>
> --
> Sabrina
Thank you,
Taehee Yoo
On Mon, 7 Oct 2019 at 20:41, Johannes Berg <[email protected]> wrote:
>
Hi Johannes,
> On Sat, 2019-10-05 at 18:13 +0900, Taehee Yoo wrote:
> >
> > If we place lockdep keys into "struct net_device", this macro would be a
> > little bit modified and reused. And driver code shape will not be huge
> > changed. I think this way is better than this v4 way.
> > So I will try it.
>
> What I was thinking was that if we can do this for every VLAN netdev,
> why shouldn't we do it for *every* netdev unconditionally? Some code
> could perhaps even be simplified if this was just a general part of
> netdev allocation.
>
> > > But it seems to me the whole nesting also has to be applied here?
> > >
> > > __dev_xmit_skb:
> > > * qdisc_run_begin()
> > > * sch_direct_xmit()
> > > * HARD_TX_LOCK(dev, txq, smp_processor_id());
> > > * dev_hard_start_xmit() // say this is VLAN
> > > * dev_queue_xmit() // on real_dev
> > > * __dev_xmit_skb // recursion on another netdev
> > >
> > > Now if you have VLAN-in-VLAN the whole thing will recurse right?
> > >
> >
> > I have checked on this routine.
> > Only xmit_lock(HARD_TX_LOCK) could be nested. other
> > qdisc locks(runinng, busylock) will not be nested.
>
"I have checked on this routine.
Only xmit_lock(HARD_TX_LOCK) could be nested. other
qdisc locks(runinng, busylock) will not be nested."
I'm so sorry, I think it's not true.
running lock could be nested.
But lockdep warning doesn't occur because of below code.
seqcount_acquire(&qdisc->running.dep_map, 0, 1, _RET_IP_);
The third argument means trylock.
If trylock is set, lockdep doesn't make lockdep chain.
So, running could be nested but lockdep warning doesn't occur even
these have the same lockdep key.
You can check on /proc/lockdep and /proc/lockdep_chain
> OK, I still didn't check it too closely I guess, or got confused which
> lock I should look at.
>
> > This patch already
> > handles the _xmit_lock key. so I think there is no problem.
>
> Right
>
> > But I would like to place four lockdep keys(busylock, address,
> > running, _xmit_lock) into "struct net_device" because of code complexity.
> >
> > Let me know if I misunderstood anything.
>
> Nothing to misunderstand - I was just asking/wondering why the qdisc
> locks were not treated the same way.
>
> johannes
>
Thank you
Taehee