2012-05-22 09:50:37

by kun.jiang

[permalink] [raw]
Subject: [PATCH] ipv4: fix the rcu race between free_fib_info and ip_route_output_slow

From: Yanmin Zhang <[email protected]>

We hit a kernel OOPS.

<3>[23898.789643] BUG: sleeping function called from invalid context at
/data/buildbot/workdir/ics/hardware/intel/linux-2.6/arch/x86/mm/fault.c:1103
<3>[23898.862215] in_atomic(): 0, irqs_disabled(): 0, pid: 10526, name:
Thread-6683
<4>[23898.967805] HSU serial 0000:00:05.1: 0000:00:05.2:HSU serial prevented me
to suspend...
<4>[23899.258526] Pid: 10526, comm: Thread-6683 Tainted: G W
3.0.8-137685-ge7742f9 #1
<4>[23899.357404] HSU serial 0000:00:05.1: 0000:00:05.2:HSU serial prevented me
to suspend...
<4>[23899.904225] Call Trace:
<4>[23899.989209] [<c1227f50>] ? pgtable_bad+0x130/0x130
<4>[23900.000416] [<c1238c2a>] __might_sleep+0x10a/0x110
<4>[23900.007357] [<c1228021>] do_page_fault+0xd1/0x3c0
<4>[23900.013764] [<c18e9ba9>] ? restore_all+0xf/0xf
<4>[23900.024024] [<c17c007b>] ? napi_complete+0x8b/0x690
<4>[23900.029297] [<c1227f50>] ? pgtable_bad+0x130/0x130
<4>[23900.123739] [<c1227f50>] ? pgtable_bad+0x130/0x130
<4>[23900.128955] [<c18ea0c3>] error_code+0x5f/0x64
<4>[23900.133466] [<c1227f50>] ? pgtable_bad+0x130/0x130
<4>[23900.138450] [<c17f6298>] ? __ip_route_output_key+0x698/0x7c0
<4>[23900.144312] [<c17f5f8d>] ? __ip_route_output_key+0x38d/0x7c0
<4>[23900.150730] [<c17f63df>] ip_route_output_flow+0x1f/0x60
<4>[23900.156261] [<c181de58>] ip4_datagram_connect+0x188/0x2b0
<4>[23900.161960] [<c18e981f>] ? _raw_spin_unlock_bh+0x1f/0x30
<4>[23900.167834] [<c18298d6>] inet_dgram_connect+0x36/0x80
<4>[23900.173224] [<c14f9e88>] ? _copy_from_user+0x48/0x140
<4>[23900.178817] [<c17ab9da>] sys_connect+0x9a/0xd0
<4>[23900.183538] [<c132e93c>] ? alloc_file+0xdc/0x240
<4>[23900.189111] [<c123925d>] ? sub_preempt_count+0x3d/0x50

In function free_fib_info, we don't reset nexthop_nh->nh_dev to NULL before releasing
nh_dev. kfree_rcu(fi, rcu) would release the whole area.

Signed-off-by: Yanmin Zhang <[email protected]>
Signed-off-by: Kun Jiang <[email protected]>
---
net/ipv4/fib_semantics.c | 1 -
1 files changed, 0 insertions(+), 1 deletions(-)

diff --git a/net/ipv4/fib_semantics.c b/net/ipv4/fib_semantics.c
index 5063fa3..68ea013 100644
--- a/net/ipv4/fib_semantics.c
+++ b/net/ipv4/fib_semantics.c
@@ -159,7 +159,6 @@ void free_fib_info(struct fib_info *fi)
change_nexthops(fi) {
if (nexthop_nh->nh_dev)
dev_put(nexthop_nh->nh_dev);
- nexthop_nh->nh_dev = NULL;
} endfor_nexthops(fi);
fib_info_cnt--;
release_net(fi->fib_net);
--
1.7.1


2012-05-22 19:15:59

by David Miller

[permalink] [raw]
Subject: Re: [PATCH] ipv4: fix the rcu race between free_fib_info and ip_route_output_slow

From: "kun.jiang" <[email protected]>
Date: Tue, 22 May 2012 17:48:53 +0800

> From: Yanmin Zhang <[email protected]>
>
> We hit a kernel OOPS.
...
> In function free_fib_info, we don't reset nexthop_nh->nh_dev to NULL before releasing
> nh_dev. kfree_rcu(fi, rcu) would release the whole area.
>
> Signed-off-by: Yanmin Zhang <[email protected]>
> Signed-off-by: Kun Jiang <[email protected]>

This isn't a fix. You're keeping around a pointer to a completely
released object, which is therefore illegal to dereference.

That's why we must set it to NULL, to catch such illegal accesses.

2012-05-23 03:01:14

by Yanmin Zhang

[permalink] [raw]
Subject: Re: [PATCH] ipv4: fix the rcu race between free_fib_info and ip_route_output_slow

On Tue, 2012-05-22 at 15:15 -0400, David Miller wrote:
> From: "kun.jiang" <[email protected]>
> Date: Tue, 22 May 2012 17:48:53 +0800
>
> > From: Yanmin Zhang <[email protected]>
> >
> > We hit a kernel OOPS.
> ...
> > In function free_fib_info, we don't reset nexthop_nh->nh_dev to NULL before releasing
> > nh_dev. kfree_rcu(fi, rcu) would release the whole area.
> >
> > Signed-off-by: Yanmin Zhang <[email protected]>
> > Signed-off-by: Kun Jiang <[email protected]>
>
> This isn't a fix. You're keeping around a pointer to a completely
> released object, which is therefore illegal to dereference.
>
> That's why we must set it to NULL, to catch such illegal accesses.
Thanks for the comments. The network stack in kernel is complicated.

Questions:
1) Why does free_fib_info call call_rcu instead of releasing fi directly?
I assume other cpu might be accessing it. nexthop_nh->nh_dev is in fi.
If other cpu are accessing it, here resetting to NULL would cause other
cpu panic.

void free_fib_info(struct fib_info *fi)
{
if (fi->fib_dead == 0) {
pr_warn("Freeing alive fib_info %p\n", fi);
return;
}
change_nexthops(fi) {
if (nexthop_nh->nh_dev)
dev_put(nexthop_nh->nh_dev);
nexthop_nh->nh_dev = NULL;
} endfor_nexthops(fi);
fib_info_cnt--;
release_net(fi->fib_net);
call_rcu(&fi->rcu, free_fib_info_rcu); <=====rcu
}
2) dev_put is to decrease the counter and doesn't release nh_dev.

If 2) is incorrect, how about below solution? It delays the resetting
in rcu.

================

We hit a kernel OOPS.

<3>[23898.789643] BUG: sleeping function called from invalid context at
/data/buildbot/workdir/ics/hardware/intel/linux-2.6/arch/x86/mm/fault.c:1103
<3>[23898.862215] in_atomic(): 0, irqs_disabled(): 0, pid: 10526, name:
Thread-6683
<4>[23898.967805] HSU serial 0000:00:05.1: 0000:00:05.2:HSU serial prevented me
to suspend...
<4>[23899.258526] Pid: 10526, comm: Thread-6683 Tainted: G W
3.0.8-137685-ge7742f9 #1
<4>[23899.357404] HSU serial 0000:00:05.1: 0000:00:05.2:HSU serial prevented me
to suspend...
<4>[23899.904225] Call Trace:
<4>[23899.989209] [<c1227f50>] ? pgtable_bad+0x130/0x130
<4>[23900.000416] [<c1238c2a>] __might_sleep+0x10a/0x110
<4>[23900.007357] [<c1228021>] do_page_fault+0xd1/0x3c0
<4>[23900.013764] [<c18e9ba9>] ? restore_all+0xf/0xf
<4>[23900.024024] [<c17c007b>] ? napi_complete+0x8b/0x690
<4>[23900.029297] [<c1227f50>] ? pgtable_bad+0x130/0x130
<4>[23900.123739] [<c1227f50>] ? pgtable_bad+0x130/0x130
<4>[23900.128955] [<c18ea0c3>] error_code+0x5f/0x64
<4>[23900.133466] [<c1227f50>] ? pgtable_bad+0x130/0x130
<4>[23900.138450] [<c17f6298>] ? __ip_route_output_key+0x698/0x7c0
<4>[23900.144312] [<c17f5f8d>] ? __ip_route_output_key+0x38d/0x7c0
<4>[23900.150730] [<c17f63df>] ip_route_output_flow+0x1f/0x60
<4>[23900.156261] [<c181de58>] ip4_datagram_connect+0x188/0x2b0
<4>[23900.161960] [<c18e981f>] ? _raw_spin_unlock_bh+0x1f/0x30
<4>[23900.167834] [<c18298d6>] inet_dgram_connect+0x36/0x80
<4>[23900.173224] [<c14f9e88>] ? _copy_from_user+0x48/0x140
<4>[23900.178817] [<c17ab9da>] sys_connect+0x9a/0xd0
<4>[23900.183538] [<c132e93c>] ? alloc_file+0xdc/0x240
<4>[23900.189111] [<c123925d>] ? sub_preempt_count+0x3d/0x50

Function free_fib_info resets nexthop_nh->nh_dev to NULL before releasing
fi. Other cpu might be accessing fi. Fixing it by delaying the resetting.

Signed-off-by: Yanmin Zhang <[email protected]>
Signed-off-by: Kun Jiang <[email protected]>
---
net/ipv4/fib_semantics.c | 15 ++++++++-------
1 files changed, 8 insertions(+), 7 deletions(-)

diff --git a/net/ipv4/fib_semantics.c b/net/ipv4/fib_semantics.c
index 5063fa3..56d8a97 100644
--- a/net/ipv4/fib_semantics.c
+++ b/net/ipv4/fib_semantics.c
@@ -145,6 +145,14 @@ static void free_fib_info_rcu(struct rcu_head *head)
{
struct fib_info *fi = container_of(head, struct fib_info, rcu);

+ change_nexthops(fi) {
+ if (nexthop_nh->nh_dev)
+ dev_put(nexthop_nh->nh_dev);
+ nexthop_nh->nh_dev = NULL;
+ } endfor_nexthops(fi);
+
+ fib_info_cnt--;
+ release_net(fi->fib_net);
if (fi->fib_metrics != (u32 *) dst_default_metrics)
kfree(fi->fib_metrics);
kfree(fi);
@@ -156,13 +164,6 @@ void free_fib_info(struct fib_info *fi)
pr_warn("Freeing alive fib_info %p\n", fi);
return;
}
- change_nexthops(fi) {
- if (nexthop_nh->nh_dev)
- dev_put(nexthop_nh->nh_dev);
- nexthop_nh->nh_dev = NULL;
- } endfor_nexthops(fi);
- fib_info_cnt--;
- release_net(fi->fib_net);
call_rcu(&fi->rcu, free_fib_info_rcu);
}

--
1.7.5.4


2012-05-23 03:23:17

by David Miller

[permalink] [raw]
Subject: Re: [PATCH] ipv4: fix the rcu race between free_fib_info and ip_route_output_slow

From: Yanmin Zhang <[email protected]>
Date: Wed, 23 May 2012 11:02:03 +0800

> 1) Why does free_fib_info call call_rcu instead of releasing fi directly?
> I assume other cpu might be accessing it. nexthop_nh->nh_dev is in fi.
> If other cpu are accessing it, here resetting to NULL would cause other
> cpu panic.

Because fib trie lookups are done with RCU locking, therefore we must
use RCU freeing to release the object.

What I was trying to impart to you is that removing the NULL
assignment is wrong and that an alternative fix is warranted (hint:
consider moving something into the RCU release).

2012-05-23 03:29:52

by Yanmin Zhang

[permalink] [raw]
Subject: Re: [PATCH] ipv4: fix the rcu race between free_fib_info and ip_route_output_slow

On Tue, 2012-05-22 at 23:23 -0400, David Miller wrote:
> From: Yanmin Zhang <[email protected]>
> Date: Wed, 23 May 2012 11:02:03 +0800
>
> > 1) Why does free_fib_info call call_rcu instead of releasing fi directly?
> > I assume other cpu might be accessing it. nexthop_nh->nh_dev is in fi.
> > If other cpu are accessing it, here resetting to NULL would cause other
> > cpu panic.
>
> Because fib trie lookups are done with RCU locking, therefore we must
> use RCU freeing to release the object.
>
> What I was trying to impart to you is that removing the NULL
> assignment is wrong and that an alternative fix is warranted (hint:
> consider moving something into the RCU release).
Thanks for the explanation.

How about the new patch posted in the end of previous reply? It does move the
the resetting to RCU release.
https://lkml.org/lkml/2012/5/22/558?

2012-05-23 03:49:27

by David Miller

[permalink] [raw]
Subject: Re: [PATCH] ipv4: fix the rcu race between free_fib_info and ip_route_output_slow

From: Yanmin Zhang <[email protected]>
Date: Wed, 23 May 2012 11:30:41 +0800

> How about the new patch posted in the end of previous reply? It does move the
> the resetting to RCU release.
> https://lkml.org/lkml/2012/5/22/558?

If you test it and submit it formally I'll probably apply it.

2012-05-23 04:37:18

by Eric Dumazet

[permalink] [raw]
Subject: Re: [PATCH] ipv4: fix the rcu race between free_fib_info and ip_route_output_slow

On Tue, 2012-05-22 at 23:23 -0400, David Miller wrote:
> From: Yanmin Zhang <[email protected]>
> Date: Wed, 23 May 2012 11:02:03 +0800
>
> > 1) Why does free_fib_info call call_rcu instead of releasing fi directly?
> > I assume other cpu might be accessing it. nexthop_nh->nh_dev is in fi.
> > If other cpu are accessing it, here resetting to NULL would cause other
> > cpu panic.
>
> Because fib trie lookups are done with RCU locking, therefore we must
> use RCU freeing to release the object.
>
> What I was trying to impart to you is that removing the NULL
> assignment is wrong and that an alternative fix is warranted (hint:
> consider moving something into the RCU release).
> --

Its more than that I'm afraid.

fi->fib_dev (aka fib_nh[0].nh_dev) need full RCU protection.

Also the "fib_info_cnt--;" must stay in free_fib_info() (so that it is
protected by RTNL)


2012-05-23 04:40:20

by Yanmin Zhang

[permalink] [raw]
Subject: Re: [PATCH] ipv4: fix the rcu race between free_fib_info and ip_route_output_slow

On Tue, 2012-05-22 at 23:49 -0400, David Miller wrote:
> From: Yanmin Zhang <[email protected]>
> Date: Wed, 23 May 2012 11:30:41 +0800
>
> > How about the new patch posted in the end of previous reply? It does move the
> > the resetting to RCU release.
> > https://lkml.org/lkml/2012/5/22/558?
>
> If you test it and submit it formally I'll probably apply it.
We did test it. But it's hard to reproduce it. We hit it the issue
for a couple of times when running MTBF testing on Android smart phone.
Mostly, it happens after MTBF runs for more than 12 hours. To releasing
the product, MTBF testing should run for hundreds of hours. This bug
blocks it.

We would submit it formally and also run more testing.

Sorry for taking you too much time.

2012-05-23 04:54:09

by Yanmin Zhang

[permalink] [raw]
Subject: Re: [PATCH] ipv4: fix the rcu race between free_fib_info and ip_route_output_slow

On Wed, 2012-05-23 at 06:37 +0200, Eric Dumazet wrote:
> On Tue, 2012-05-22 at 23:23 -0400, David Miller wrote:
> > From: Yanmin Zhang <[email protected]>
> > Date: Wed, 23 May 2012 11:02:03 +0800
> >
> > > 1) Why does free_fib_info call call_rcu instead of releasing fi directly?
> > > I assume other cpu might be accessing it. nexthop_nh->nh_dev is in fi.
> > > If other cpu are accessing it, here resetting to NULL would cause other
> > > cpu panic.
> >
> > Because fib trie lookups are done with RCU locking, therefore we must
> > use RCU freeing to release the object.
> >
> > What I was trying to impart to you is that removing the NULL
> > assignment is wrong and that an alternative fix is warranted (hint:
> > consider moving something into the RCU release).
> > --
>
> Its more than that I'm afraid.
>
> fi->fib_dev (aka fib_nh[0].nh_dev) need full RCU protection.
The new patch posted at https://lkml.org/lkml/2012/5/22/558 does move the
resetting to RCU protection.

>
> Also the "fib_info_cnt--;" must stay in free_fib_info() (so that it is
> protected by RTNL)
We would move "fib_info_cnt--;" back to free_fib_info.

Thanks,
Yanmin

2012-05-23 05:02:57

by Eric Dumazet

[permalink] [raw]
Subject: Re: [PATCH] ipv4: fix the rcu race between free_fib_info and ip_route_output_slow

On Wed, 2012-05-23 at 12:54 +0800, Yanmin Zhang wrote:

> > fi->fib_dev (aka fib_nh[0].nh_dev) need full RCU protection.
> The new patch posted at https://lkml.org/lkml/2012/5/22/558 does move the
> resetting to RCU protection.

Its not enough.

We must take care that all users are in a RCU protected region.

They might be already, but a full check is needed.

For example

net/ipv4/fib_trie.c:2563: fi->fib_dev ? fi->fib_dev->name : "*"

looks to be safe (because already in a rcu_read_lock())

But its not.

Right thing would be to do :

struct net_device *ndev = rcu_dereference(fi->fib_dev)

...
ndev ? ndev->name : "*"



2012-05-23 05:08:46

by Eric Dumazet

[permalink] [raw]
Subject: Re: [PATCH] ipv4: fix the rcu race between free_fib_info and ip_route_output_slow

On Wed, 2012-05-23 at 12:41 +0800, Yanmin Zhang wrote:
> We did test it. But it's hard to reproduce it. We hit it the issue
> for a couple of times when running MTBF testing on Android smart phone.
> Mostly, it happens after MTBF runs for more than 12 hours. To releasing
> the product, MTBF testing should run for hundreds of hours. This bug
> blocks it.
>

I have an idea how to trigger this in my test machine.

> We would submit it formally and also run more testing.
>
> Sorry for taking you too much time.

Hey, you fix a bug, take your time and dont worry.

Thanks

2012-05-23 06:15:09

by Yanmin Zhang

[permalink] [raw]
Subject: Re: [PATCH] ipv4: fix the rcu race between free_fib_info and ip_route_output_slow

On Wed, 2012-05-23 at 07:02 +0200, Eric Dumazet wrote:
> On Wed, 2012-05-23 at 12:54 +0800, Yanmin Zhang wrote:
>
> > > fi->fib_dev (aka fib_nh[0].nh_dev) need full RCU protection.
> > The new patch posted at https://lkml.org/lkml/2012/5/22/558 does move the
> > resetting to RCU protection.
>
> Its not enough.
>
> We must take care that all users are in a RCU protected region.
>
> They might be already, but a full check is needed.
>
> For example
>
> net/ipv4/fib_trie.c:2563: fi->fib_dev ? fi->fib_dev->name : "*"
>
> looks to be safe (because already in a rcu_read_lock())
>
> But its not.
>
> Right thing would be to do :
>
> struct net_device *ndev = rcu_dereference(fi->fib_dev)
>
> ...
> ndev ? ndev->name : "*"
Thanks for the pointer.

Besides fi->fib_dev here, we need check fi->fib_hash, fi->fib_lhash,
and nexthop_nh->nh_hash.

Yanmin

2012-05-23 06:27:48

by Eric Dumazet

[permalink] [raw]
Subject: Re: [PATCH] ipv4: fix the rcu race between free_fib_info and ip_route_output_slow

On Wed, 2012-05-23 at 14:15 +0800, Yanmin Zhang wrote:

>
> Besides fi->fib_dev here, we need check fi->fib_hash, fi->fib_lhash,
> and nexthop_nh->nh_hash.

No, its not needed.

I am sending a patch, because I feel this area is too complex for non
netdev guys.


2012-05-23 06:37:32

by Eric Dumazet

[permalink] [raw]
Subject: Re: [PATCH] ipv4: fix the rcu race between free_fib_info and ip_route_output_slow

From: Eric Dumazet <[email protected]>

I am testing following patch, could you test it too ?

Thanks

[PATCH] ipv4: nh_dev should be rcu protected

Yanmin Zhang reported an OOPS and provided a patch.

<3>[23898.789643] BUG: sleeping function called from invalid context at
/data/buildbot/workdir/ics/hardware/intel/linux-2.6/arch/x86/mm/fault.c:1103
<3>[23898.862215] in_atomic(): 0, irqs_disabled(): 0, pid: 10526, name:
Thread-6683
<4>[23898.967805] HSU serial 0000:00:05.1: 0000:00:05.2:HSU serial
prevented me
to suspend...
<4>[23899.258526] Pid: 10526, comm: Thread-6683 Tainted: G W
3.0.8-137685-ge7742f9 #1
<4>[23899.357404] HSU serial 0000:00:05.1: 0000:00:05.2:HSU serial
prevented me
to suspend...
<4>[23899.904225] Call Trace:
<4>[23899.989209] [<c1227f50>] ? pgtable_bad+0x130/0x130
<4>[23900.000416] [<c1238c2a>] __might_sleep+0x10a/0x110
<4>[23900.007357] [<c1228021>] do_page_fault+0xd1/0x3c0
<4>[23900.013764] [<c18e9ba9>] ? restore_all+0xf/0xf
<4>[23900.024024] [<c17c007b>] ? napi_complete+0x8b/0x690
<4>[23900.029297] [<c1227f50>] ? pgtable_bad+0x130/0x130
<4>[23900.123739] [<c1227f50>] ? pgtable_bad+0x130/0x130
<4>[23900.128955] [<c18ea0c3>] error_code+0x5f/0x64
<4>[23900.133466] [<c1227f50>] ? pgtable_bad+0x130/0x130
<4>[23900.138450] [<c17f6298>] ? __ip_route_output_key+0x698/0x7c0
<4>[23900.144312] [<c17f5f8d>] ? __ip_route_output_key+0x38d/0x7c0
<4>[23900.150730] [<c17f63df>] ip_route_output_flow+0x1f/0x60
<4>[23900.156261] [<c181de58>] ip4_datagram_connect+0x188/0x2b0
<4>[23900.161960] [<c18e981f>] ? _raw_spin_unlock_bh+0x1f/0x30
<4>[23900.167834] [<c18298d6>] inet_dgram_connect+0x36/0x80
<4>[23900.173224] [<c14f9e88>] ? _copy_from_user+0x48/0x140
<4>[23900.178817] [<c17ab9da>] sys_connect+0x9a/0xd0
<4>[23900.183538] [<c132e93c>] ? alloc_file+0xdc/0x240
<4>[23900.189111] [<c123925d>] ? sub_preempt_count+0x3d/0x50

But real fix is to provide appropriate RCU protection to nh_dev/fib_dev,
with appropriate __rcu annotation.

tested with CONFIG_PROVE_RCU and CONFIG_SPARSE_RCU_POINTER

Reported-by: Yanmin Zhang <[email protected]>
Reported-by: Kun Jiang <[email protected]>
Signed-off-by: Eric Dumazet <[email protected]>
---
include/linux/inetdevice.h | 8 ++-
include/net/ip_fib.h | 2
net/ipv4/devinet.c | 11 ++--
net/ipv4/fib_frontend.c | 6 +-
net/ipv4/fib_semantics.c | 66 ++++++++++++++++++----------
net/ipv4/fib_trie.c | 11 ++--
net/ipv4/netfilter/ipt_rpfilter.c | 4 -
net/ipv4/route.c | 4 -
8 files changed, 71 insertions(+), 41 deletions(-)

diff --git a/include/linux/inetdevice.h b/include/linux/inetdevice.h
index 597f4a9..8cfa569 100644
--- a/include/linux/inetdevice.h
+++ b/include/linux/inetdevice.h
@@ -172,7 +172,13 @@ extern int inet_addr_onlink(struct in_device *in_dev, __be32 a, __be32 b);
extern int devinet_ioctl(struct net *net, unsigned int cmd, void __user *);
extern void devinet_init(void);
extern struct in_device *inetdev_by_index(struct net *, int);
-extern __be32 inet_select_addr(const struct net_device *dev, __be32 dst, int scope);
+
+extern __be32 __inet_select_addr(struct net *net,
+ const struct net_device *dev,
+ __be32 dst, int scope);
+#define inet_select_addr(dev, dst, scope) \
+ __inet_select_addr(dev_net(dev), (dev), (dst), (scope))
+
extern __be32 inet_confirm_addr(struct in_device *in_dev, __be32 dst, __be32 local, int scope);
extern struct in_ifaddr *inet_ifa_byprefix(struct in_device *in_dev, __be32 prefix, __be32 mask);

diff --git a/include/net/ip_fib.h b/include/net/ip_fib.h
index 78df0866..9d49426 100644
--- a/include/net/ip_fib.h
+++ b/include/net/ip_fib.h
@@ -46,7 +46,7 @@ struct fib_config {
struct fib_info;

struct fib_nh {
- struct net_device *nh_dev;
+ struct net_device __rcu *nh_dev;
struct hlist_node nh_hash;
struct fib_info *nh_parent;
unsigned int nh_flags;
diff --git a/net/ipv4/devinet.c b/net/ipv4/devinet.c
index 10e15a1..be1e75c 100644
--- a/net/ipv4/devinet.c
+++ b/net/ipv4/devinet.c
@@ -164,7 +164,7 @@ struct net_device *__ip_dev_find(struct net *net, __be32 addr, bool devref)
if (local &&
!fib_table_lookup(local, &fl4, &res, FIB_LOOKUP_NOREF) &&
res.type == RTN_LOCAL)
- result = FIB_RES_DEV(res);
+ result = rcu_dereference(FIB_RES_DEV(res));
}
if (result && devref)
dev_hold(result);
@@ -960,14 +960,15 @@ out:
return done;
}

-__be32 inet_select_addr(const struct net_device *dev, __be32 dst, int scope)
+__be32 __inet_select_addr(struct net *net,
+ const struct net_device *dev,
+ __be32 dst, int scope)
{
__be32 addr = 0;
struct in_device *in_dev;
- struct net *net = dev_net(dev);

rcu_read_lock();
- in_dev = __in_dev_get_rcu(dev);
+ in_dev = dev ? __in_dev_get_rcu(dev) : NULL;
if (!in_dev)
goto no_in_dev;

@@ -1007,7 +1008,7 @@ out_unlock:
rcu_read_unlock();
return addr;
}
-EXPORT_SYMBOL(inet_select_addr);
+EXPORT_SYMBOL(__inet_select_addr);

static __be32 confirm_addr_indev(struct in_device *in_dev, __be32 dst,
__be32 local, int scope)
diff --git a/net/ipv4/fib_frontend.c b/net/ipv4/fib_frontend.c
index 3854411..2479884 100644
--- a/net/ipv4/fib_frontend.c
+++ b/net/ipv4/fib_frontend.c
@@ -159,7 +159,7 @@ static inline unsigned int __inet_dev_addr_type(struct net *net,
ret = RTN_UNICAST;
rcu_read_lock();
if (!fib_table_lookup(local_table, &fl4, &res, FIB_LOOKUP_NOREF)) {
- if (!dev || dev == res.fi->fib_dev)
+ if (!dev || dev == rcu_dereference(res.fi->fib_dev))
ret = res.type;
}
rcu_read_unlock();
@@ -237,13 +237,13 @@ int fib_validate_source(struct sk_buff *skb, __be32 src, __be32 dst, u8 tos,
for (ret = 0; ret < res.fi->fib_nhs; ret++) {
struct fib_nh *nh = &res.fi->fib_nh[ret];

- if (nh->nh_dev == dev) {
+ if (rcu_dereference(nh->nh_dev) == dev) {
dev_match = true;
break;
}
}
#else
- if (FIB_RES_DEV(res) == dev)
+ if (rcu_dereference(FIB_RES_DEV(res)) == dev)
dev_match = true;
#endif
if (dev_match) {
diff --git a/net/ipv4/fib_semantics.c b/net/ipv4/fib_semantics.c
index a8bdf74..a09f055 100644
--- a/net/ipv4/fib_semantics.c
+++ b/net/ipv4/fib_semantics.c
@@ -157,9 +157,13 @@ void free_fib_info(struct fib_info *fi)
return;
}
change_nexthops(fi) {
- if (nexthop_nh->nh_dev)
- dev_put(nexthop_nh->nh_dev);
- nexthop_nh->nh_dev = NULL;
+ struct net_device *ndev;
+
+ ndev = rtnl_dereference(nexthop_nh->nh_dev);
+ if (ndev) {
+ dev_put(ndev);
+ RCU_INIT_POINTER(nexthop_nh->nh_dev, NULL);
+ }
} endfor_nexthops(fi);
fib_info_cnt--;
release_net(fi->fib_net);
@@ -273,7 +277,7 @@ int ip_fib_check_default(__be32 gw, struct net_device *dev)
hash = fib_devindex_hashfn(dev->ifindex);
head = &fib_info_devhash[hash];
hlist_for_each_entry(nh, node, head, nh_hash) {
- if (nh->nh_dev == dev &&
+ if (rcu_access_pointer(nh->nh_dev) == dev &&
nh->nh_gw == gw &&
!(nh->nh_flags & RTNH_F_DEAD)) {
spin_unlock(&fib_info_lock);
@@ -360,13 +364,17 @@ struct fib_alias *fib_find_alias(struct list_head *fah, u8 tos, u32 prio)
return NULL;
}

+/* called in rcu_read_lock() section */
int fib_detect_death(struct fib_info *fi, int order,
struct fib_info **last_resort, int *last_idx, int dflt)
{
- struct neighbour *n;
+ struct neighbour *n = NULL;
int state = NUD_NONE;
+ struct net_device *ndev = rcu_dereference(fi->fib_dev);
+
+ if (ndev)
+ n = neigh_lookup(&arp_tbl, &fi->fib_nh[0].nh_gw, ndev);

- n = neigh_lookup(&arp_tbl, &fi->fib_nh[0].nh_gw, fi->fib_dev);
if (n) {
state = n->nud_state;
neigh_release(n);
@@ -551,7 +559,7 @@ static int fib_check_nh(struct fib_config *cfg, struct fib_info *fi,
return -ENODEV;
if (!(dev->flags & IFF_UP))
return -ENETDOWN;
- nh->nh_dev = dev;
+ rcu_assign_pointer(nh->nh_dev, dev);
dev_hold(dev);
nh->nh_scope = RT_SCOPE_LINK;
return 0;
@@ -578,7 +586,8 @@ static int fib_check_nh(struct fib_config *cfg, struct fib_info *fi,
goto out;
nh->nh_scope = res.scope;
nh->nh_oif = FIB_RES_OIF(res);
- nh->nh_dev = dev = FIB_RES_DEV(res);
+ dev = rcu_dereference(FIB_RES_DEV(res));
+ rcu_assign_pointer(nh->nh_dev, dev);
if (!dev)
goto out;
dev_hold(dev);
@@ -597,8 +606,8 @@ static int fib_check_nh(struct fib_config *cfg, struct fib_info *fi,
err = -ENETDOWN;
if (!(in_dev->dev->flags & IFF_UP))
goto out;
- nh->nh_dev = in_dev->dev;
- dev_hold(nh->nh_dev);
+ rcu_assign_pointer(nh->nh_dev, in_dev->dev);
+ dev_hold(in_dev->dev);
nh->nh_scope = RT_SCOPE_HOST;
err = 0;
}
@@ -695,9 +704,14 @@ static void fib_info_hash_move(struct hlist_head *new_info_hash,

__be32 fib_info_update_nh_saddr(struct net *net, struct fib_nh *nh)
{
- nh->nh_saddr = inet_select_addr(nh->nh_dev,
- nh->nh_gw,
- nh->nh_parent->fib_scope);
+ struct net_device *ndev;
+
+ rcu_read_lock();
+ ndev = rcu_dereference(nh->nh_dev);
+ nh->nh_saddr = __inet_select_addr(net, ndev,
+ nh->nh_gw,
+ nh->nh_parent->fib_scope);
+ rcu_read_unlock();
nh->nh_saddr_genid = atomic_read(&net->ipv4.dev_addr_genid);

return nh->nh_saddr;
@@ -843,7 +857,8 @@ struct fib_info *fib_create_info(struct fib_config *cfg)
if (nhs != 1 || nh->nh_gw)
goto err_inval;
nh->nh_scope = RT_SCOPE_NOWHERE;
- nh->nh_dev = dev_get_by_index(net, fi->fib_nh->nh_oif);
+ RCU_INIT_POINTER(nh->nh_dev,
+ dev_get_by_index(net, fi->fib_nh->nh_oif));
err = -ENODEV;
if (nh->nh_dev == NULL)
goto failure;
@@ -889,10 +904,11 @@ link_it:
change_nexthops(fi) {
struct hlist_head *head;
unsigned int hash;
+ struct net_device *ndev = rtnl_dereference(nexthop_nh->nh_dev);

- if (!nexthop_nh->nh_dev)
+ if (!ndev)
continue;
- hash = fib_devindex_hashfn(nexthop_nh->nh_dev->ifindex);
+ hash = fib_devindex_hashfn(ndev->ifindex);
head = &fib_info_devhash[hash];
hlist_add_head(&nexthop_nh->nh_hash, head);
} endfor_nexthops(fi)
@@ -1049,14 +1065,14 @@ int fib_sync_down_dev(struct net_device *dev, int force)
int dead;

BUG_ON(!fi->fib_nhs);
- if (nh->nh_dev != dev || fi == prev_fi)
+ if (rtnl_dereference(nh->nh_dev) != dev || fi == prev_fi)
continue;
prev_fi = fi;
dead = 0;
change_nexthops(fi) {
if (nexthop_nh->nh_flags & RTNH_F_DEAD)
dead++;
- else if (nexthop_nh->nh_dev == dev &&
+ else if (rtnl_dereference(nexthop_nh->nh_dev) == dev &&
nexthop_nh->nh_scope != scope) {
nexthop_nh->nh_flags |= RTNH_F_DEAD;
#ifdef CONFIG_IP_ROUTE_MULTIPATH
@@ -1068,7 +1084,8 @@ int fib_sync_down_dev(struct net_device *dev, int force)
dead++;
}
#ifdef CONFIG_IP_ROUTE_MULTIPATH
- if (force > 1 && nexthop_nh->nh_dev == dev) {
+ if (force > 1 &&
+ rtnl_dereference(nexthop_nh->nh_dev) == dev) {
dead = fi->fib_nhs;
break;
}
@@ -1167,20 +1184,23 @@ int fib_sync_up(struct net_device *dev)
int alive;

BUG_ON(!fi->fib_nhs);
- if (nh->nh_dev != dev || fi == prev_fi)
+ if (rtnl_dereference(nh->nh_dev) != dev || fi == prev_fi)
continue;

prev_fi = fi;
alive = 0;
change_nexthops(fi) {
+ struct net_device *ndev;
+
if (!(nexthop_nh->nh_flags & RTNH_F_DEAD)) {
alive++;
continue;
}
- if (nexthop_nh->nh_dev == NULL ||
- !(nexthop_nh->nh_dev->flags & IFF_UP))
+ ndev = rtnl_dereference(nexthop_nh->nh_dev);
+ if (ndev == NULL ||
+ !(ndev->flags & IFF_UP))
continue;
- if (nexthop_nh->nh_dev != dev ||
+ if (ndev != dev ||
!__in_dev_get_rtnl(dev))
continue;
alive++;
diff --git a/net/ipv4/fib_trie.c b/net/ipv4/fib_trie.c
index 30b88d7..3861ba0 100644
--- a/net/ipv4/fib_trie.c
+++ b/net/ipv4/fib_trie.c
@@ -2556,11 +2556,14 @@ static int fib_route_seq_show(struct seq_file *seq, void *v)
|| fa->fa_type == RTN_MULTICAST)
continue;

- if (fi)
+ if (fi) {
+ struct net_device *ndev;
+
+ ndev = rcu_dereference(fi->fib_dev);
seq_printf(seq,
"%s\t%08X\t%08X\t%04X\t%d\t%u\t"
"%d\t%08X\t%d\t%u\t%u%n",
- fi->fib_dev ? fi->fib_dev->name : "*",
+ ndev ? ndev->name : "*",
prefix,
fi->fib_nh->nh_gw, flags, 0, 0,
fi->fib_priority,
@@ -2569,13 +2572,13 @@ static int fib_route_seq_show(struct seq_file *seq, void *v)
fi->fib_advmss + 40 : 0),
fi->fib_window,
fi->fib_rtt >> 3, &len);
- else
+ } else {
seq_printf(seq,
"*\t%08X\t%08X\t%04X\t%d\t%u\t"
"%d\t%08X\t%d\t%u\t%u%n",
prefix, 0, flags, 0, 0, 0,
mask, 0, 0, 0, &len);
-
+ }
seq_printf(seq, "%*s\n", 127 - len, "");
}
}
diff --git a/net/ipv4/netfilter/ipt_rpfilter.c b/net/ipv4/netfilter/ipt_rpfilter.c
index 31371be..bdc9393 100644
--- a/net/ipv4/netfilter/ipt_rpfilter.c
+++ b/net/ipv4/netfilter/ipt_rpfilter.c
@@ -52,13 +52,13 @@ static bool rpfilter_lookup_reverse(struct flowi4 *fl4,
for (ret = 0; ret < res.fi->fib_nhs; ret++) {
struct fib_nh *nh = &res.fi->fib_nh[ret];

- if (nh->nh_dev == dev) {
+ if (rcu_dereference(nh->nh_dev) == dev) {
dev_match = true;
break;
}
}
#else
- if (FIB_RES_DEV(res) == dev)
+ if (rcu_dereference(FIB_RES_DEV(res)) == dev)
dev_match = true;
#endif
if (dev_match || flags & XT_RPFILTER_LOOSE)
diff --git a/net/ipv4/route.c b/net/ipv4/route.c
index ffcb3b0..b56b91e 100644
--- a/net/ipv4/route.c
+++ b/net/ipv4/route.c
@@ -2127,7 +2127,7 @@ static int __mkroute_input(struct sk_buff *skb,
u32 itag;

/* get a working reference to the output device */
- out_dev = __in_dev_get_rcu(FIB_RES_DEV(*res));
+ out_dev = __in_dev_get_rcu(rcu_dereference(FIB_RES_DEV(*res)));
if (out_dev == NULL) {
net_crit_ratelimited("Bug in ip_route_input_slow(). Please report.\n");
return -EINVAL;
@@ -2786,7 +2786,7 @@ static struct rtable *ip_route_output_slow(struct net *net, struct flowi4 *fl4)
if (!fl4->saddr)
fl4->saddr = FIB_RES_PREFSRC(net, res);

- dev_out = FIB_RES_DEV(res);
+ dev_out = rcu_dereference(FIB_RES_DEV(res));
fl4->flowi4_oif = dev_out->ifindex;



2012-05-23 06:43:42

by Eric Dumazet

[permalink] [raw]
Subject: Re: [PATCH] ipv4: fix the rcu race between free_fib_info and ip_route_output_slow

On Wed, 2012-05-23 at 08:37 +0200, Eric Dumazet wrote:
> From: Eric Dumazet <[email protected]>

> diff --git a/net/ipv4/route.c b/net/ipv4/route.c
> index ffcb3b0..b56b91e 100644
> --- a/net/ipv4/route.c
> +++ b/net/ipv4/route.c
> @@ -2127,7 +2127,7 @@ static int __mkroute_input(struct sk_buff *skb,
> u32 itag;
>
> /* get a working reference to the output device */
> - out_dev = __in_dev_get_rcu(FIB_RES_DEV(*res));
> + out_dev = __in_dev_get_rcu(rcu_dereference(FIB_RES_DEV(*res)));

This part might need additional fix (if FIB_RES_DEV(*res) is NULL),
because __in_dev_get_rcu() could crash dereferencing NULL pointer.






2012-05-23 06:46:15

by Yanmin Zhang

[permalink] [raw]
Subject: Re: [PATCH] ipv4: fix the rcu race between free_fib_info and ip_route_output_slow

On Wed, 2012-05-23 at 08:27 +0200, Eric Dumazet wrote:
> On Wed, 2012-05-23 at 14:15 +0800, Yanmin Zhang wrote:
>
> >
> > Besides fi->fib_dev here, we need check fi->fib_hash, fi->fib_lhash,
> > and nexthop_nh->nh_hash.
>
> No, its not needed.
I am checking it now.
fib_create_info=>fib_find_info, but fib_find_info is not protected by
fib_info_lock. See the codes:


link_it:
ofi = fib_find_info(fi);
if (ofi) {
fi->fib_dead = 1;
free_fib_info(fi);
ofi->fib_treeref++;
return ofi;
}

>
> I am sending a patch, because I feel this area is too complex for non
> netdev guys.
Indeed, it's complicated. I will test your patches.

2012-05-23 06:55:46

by Eric Dumazet

[permalink] [raw]
Subject: Re: [PATCH] ipv4: fix the rcu race between free_fib_info and ip_route_output_slow

On Wed, 2012-05-23 at 14:47 +0800, Yanmin Zhang wrote:
> On Wed, 2012-05-23 at 08:27 +0200, Eric Dumazet wrote:
> > On Wed, 2012-05-23 at 14:15 +0800, Yanmin Zhang wrote:
> >
> > >
> > > Besides fi->fib_dev here, we need check fi->fib_hash, fi->fib_lhash,
> > > and nexthop_nh->nh_hash.
> >
> > No, its not needed.
> I am checking it now.
> fib_create_info=>fib_find_info, but fib_find_info is not protected by
> fib_info_lock. See the codes:
>
>
> link_it:
> ofi = fib_find_info(fi);
> if (ofi) {
> fi->fib_dead = 1;
> free_fib_info(fi);
> ofi->fib_treeref++;
> return ofi;
> }
>
> >
> > I am sending a patch, because I feel this area is too complex for non
> > netdev guys.
> Indeed, it's complicated. I will test your patches.

Please hold on, I'll send a v2


2012-05-23 07:14:07

by Eric Dumazet

[permalink] [raw]
Subject: Re: [PATCH] ipv4: fix the rcu race between free_fib_info and ip_route_output_slow

On Wed, 2012-05-23 at 08:55 +0200, Eric Dumazet wrote:

> Please hold on, I'll send a v2

I believe your patch should be fine, if you move back the
fib_info_cnt--;

So only do the dev_put() in free_fib_info_rcu().

No need to clear nh_dev to NULL since we are freeing fi at the end of
function.


2012-05-23 07:23:33

by Yanmin Zhang

[permalink] [raw]
Subject: Re: [PATCH] ipv4: fix the rcu race between free_fib_info and ip_route_output_slow

On Wed, 2012-05-23 at 09:13 +0200, Eric Dumazet wrote:
> On Wed, 2012-05-23 at 08:55 +0200, Eric Dumazet wrote:
>
> > Please hold on, I'll send a v2
>
> I believe your patch should be fine, if you move back the
> fib_info_cnt--;
>
> So only do the dev_put() in free_fib_info_rcu().
We would do so in a new patch.

>
> No need to clear nh_dev to NULL since we are freeing fi at the end of
> function.
David suggests to reset it to NULL to detect other potential
race conditions.

Besides above suggestions, how do you think about:

fib_create_info=>fib_find_info, but fib_find_info is not protected by
fib_info_lock. See the codes:

fib_create_info()
{
...
link_it:
ofi = fib_find_info(fi);
if (ofi) {
fi->fib_dead = 1;
free_fib_info(fi);
ofi->fib_treeref++;
return ofi;
}
fi->fib_treeref++;
atomic_inc(&fi->fib_clntref);
spin_lock_bh(&fib_info_lock);

...
}

I plan to change it to hold fib_info_lock before calling fib_find_info. Is
it ok for you?

Thanks for the direct speaking.

Yanmin

2012-05-23 07:39:41

by Eric Dumazet

[permalink] [raw]
Subject: Re: [PATCH] ipv4: fix the rcu race between free_fib_info and ip_route_output_slow

On Wed, 2012-05-23 at 15:24 +0800, Yanmin Zhang wrote:
> On Wed, 2012-05-23 at 09:13 +0200, Eric Dumazet wrote:
> > On Wed, 2012-05-23 at 08:55 +0200, Eric Dumazet wrote:
> >
> > > Please hold on, I'll send a v2
> >
> > I believe your patch should be fine, if you move back the
> > fib_info_cnt--;
> >
> > So only do the dev_put() in free_fib_info_rcu().
> We would do so in a new patch.
>
> >
> > No need to clear nh_dev to NULL since we are freeing fi at the end of
> > function.
> David suggests to reset it to NULL to detect other potential
> race conditions.
>

Yes but no.

Users are in a RCU read lock and we should not change nh_dev to NULL
while they are running.

Thats what I tried to do (David message gave me this wrong hint) but it
came to a dead issue.

Only after a grace period we can :
dev_put() all involved net_devices
kfree(fi)

> Besides above suggestions, how do you think about:
>
> fib_create_info=>fib_find_info, but fib_find_info is not protected by
> fib_info_lock. See the codes:
>
> fib_create_info()
> {
> ...
> link_it:
> ofi = fib_find_info(fi);
> if (ofi) {
> fi->fib_dead = 1;
> free_fib_info(fi);
> ofi->fib_treeref++;
> return ofi;
> }
> fi->fib_treeref++;
> atomic_inc(&fi->fib_clntref);
> spin_lock_bh(&fib_info_lock);
>
> ...
> }
>
> I plan to change it to hold fib_info_lock before calling fib_find_info. Is
> it ok for you?

Its not needed we hold RTNL mutex.

spinlock is needed mainly because of ip_fib_check_default(), but this
could be changed to use RCU as well.

(readers use RCU, writers already hold RTNL -> no more fib_info_lock )


2012-05-23 07:41:43

by Eric Dumazet

[permalink] [raw]
Subject: Re: [PATCH] ipv4: fix the rcu race between free_fib_info and ip_route_output_slow

On Wed, 2012-05-23 at 09:39 +0200, Eric Dumazet wrote:

> spinlock is needed mainly because of ip_fib_check_default(), but this
> could be changed to use RCU as well.
>
> (readers use RCU, writers already hold RTNL -> no more fib_info_lock )

Since its not a fast path and hash table can be resized, its probably
not worth the pain.

2012-05-23 07:47:10

by Yanmin Zhang

[permalink] [raw]
Subject: Re: [PATCH] ipv4: fix the rcu race between free_fib_info and ip_route_output_slow

On Wed, 2012-05-23 at 09:39 +0200, Eric Dumazet wrote:
> On Wed, 2012-05-23 at 15:24 +0800, Yanmin Zhang wrote:
> > On Wed, 2012-05-23 at 09:13 +0200, Eric Dumazet wrote:
> > > On Wed, 2012-05-23 at 08:55 +0200, Eric Dumazet wrote:
> > >
> > > > Please hold on, I'll send a v2
> > >
> > > I believe your patch should be fine, if you move back the
> > > fib_info_cnt--;
> > >
> > > So only do the dev_put() in free_fib_info_rcu().
> > We would do so in a new patch.
> >
> > >
> > > No need to clear nh_dev to NULL since we are freeing fi at the end of
> > > function.
> > David suggests to reset it to NULL to detect other potential
> > race conditions.
> >
>
> Yes but no.
>
> Users are in a RCU read lock and we should not change nh_dev to NULL
> while they are running.
>
> Thats what I tried to do (David message gave me this wrong hint) but it
> came to a dead issue.
>
> Only after a grace period we can :
> dev_put() all involved net_devices
> kfree(fi)
>
> > Besides above suggestions, how do you think about:
> >
> > fib_create_info=>fib_find_info, but fib_find_info is not protected by
> > fib_info_lock. See the codes:
> >
> > fib_create_info()
> > {
> > ...
> > link_it:
> > ofi = fib_find_info(fi);
> > if (ofi) {
> > fi->fib_dead = 1;
> > free_fib_info(fi);
> > ofi->fib_treeref++;
> > return ofi;
> > }
> > fi->fib_treeref++;
> > atomic_inc(&fi->fib_clntref);
> > spin_lock_bh(&fib_info_lock);
> >
> > ...
> > }
> >
> > I plan to change it to hold fib_info_lock before calling fib_find_info. Is
> > it ok for you?
>
> Its not needed we hold RTNL mutex.
Thanks. We would work out a new patch and post it here after running MTBF
testing.