Sequence counters write paths are critical sections that must never be
preempted, and blocking, even for CONFIG_PREEMPTION=n, is not allowed.
Commit 5dbe7c178d3f ("net: fix kernel deadlock with interface rename and
netdev name retrieval.") handled a deadlock, observed with
CONFIG_PREEMPTION=n, where the devnet_rename seqcount read side was
infinitely spinning: it got scheduled after the seqcount write side
blocked inside its own critical section.
To fix that deadlock, among other issues, the commit added a
cond_resched() inside the read side section. While this will get the
non-preemptible kernel eventually unstuck, the seqcount reader is fully
exhausting its slice just spinning -- until TIF_NEED_RESCHED is set.
The fix is also still broken: if the seqcount reader belongs to a
real-time scheduling policy, it can spin forever and the kernel will
livelock.
Disabling preemption over the seqcount write side critical section will
not work: inside it are a number of GFP_KERNEL allocations and mutex
locking through the drivers/base/ :: device_rename() call chain.
From all the above, replace the seqcount with a rwsem.
Fixes: 5dbe7c178d3f (net: fix kernel deadlock with interface rename and netdev name retrieval.)
Fixes: 30e6c9fa93cf (net: devnet_rename_seq should be a seqcount)
Fixes: c91f6df2db49 (sockopt: Change getsockopt() of SO_BINDTODEVICE to return an interface name)
Cc: <[email protected]>
Signed-off-by: Ahmed S. Darwish <[email protected]>
Reviewed-by: Sebastian Andrzej Siewior <[email protected]>
---
net/core/dev.c | 30 ++++++++++++------------------
1 file changed, 12 insertions(+), 18 deletions(-)
diff --git a/net/core/dev.c b/net/core/dev.c
index 522288177bbd..e18a4c23df0e 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -79,6 +79,7 @@
#include <linux/sched.h>
#include <linux/sched/mm.h>
#include <linux/mutex.h>
+#include <linux/rwsem.h>
#include <linux/string.h>
#include <linux/mm.h>
#include <linux/socket.h>
@@ -194,7 +195,7 @@ static DEFINE_SPINLOCK(napi_hash_lock);
static unsigned int napi_gen_id = NR_CPUS;
static DEFINE_READ_MOSTLY_HASHTABLE(napi_hash, 8);
-static seqcount_t devnet_rename_seq;
+static DECLARE_RWSEM(devnet_rename_sem);
static inline void dev_base_seq_inc(struct net *net)
{
@@ -930,18 +931,13 @@ EXPORT_SYMBOL(dev_get_by_napi_id);
* @net: network namespace
* @name: a pointer to the buffer where the name will be stored.
* @ifindex: the ifindex of the interface to get the name from.
- *
- * The use of raw_seqcount_begin() and cond_resched() before
- * retrying is required as we want to give the writers a chance
- * to complete when CONFIG_PREEMPTION is not set.
*/
int netdev_get_name(struct net *net, char *name, int ifindex)
{
struct net_device *dev;
- unsigned int seq;
-retry:
- seq = raw_seqcount_begin(&devnet_rename_seq);
+ down_read(&devnet_rename_sem);
+
rcu_read_lock();
dev = dev_get_by_index_rcu(net, ifindex);
if (!dev) {
@@ -951,10 +947,8 @@ int netdev_get_name(struct net *net, char *name, int ifindex)
strcpy(name, dev->name);
rcu_read_unlock();
- if (read_seqcount_retry(&devnet_rename_seq, seq)) {
- cond_resched();
- goto retry;
- }
+
+ up_read(&devnet_rename_sem);
return 0;
}
@@ -1228,10 +1222,10 @@ int dev_change_name(struct net_device *dev, const char *newname)
likely(!(dev->priv_flags & IFF_LIVE_RENAME_OK)))
return -EBUSY;
- write_seqcount_begin(&devnet_rename_seq);
+ down_write(&devnet_rename_sem);
if (strncmp(newname, dev->name, IFNAMSIZ) == 0) {
- write_seqcount_end(&devnet_rename_seq);
+ up_write(&devnet_rename_sem);
return 0;
}
@@ -1239,7 +1233,7 @@ int dev_change_name(struct net_device *dev, const char *newname)
err = dev_get_valid_name(net, dev, newname);
if (err < 0) {
- write_seqcount_end(&devnet_rename_seq);
+ up_write(&devnet_rename_sem);
return err;
}
@@ -1254,11 +1248,11 @@ int dev_change_name(struct net_device *dev, const char *newname)
if (ret) {
memcpy(dev->name, oldname, IFNAMSIZ);
dev->name_assign_type = old_assign_type;
- write_seqcount_end(&devnet_rename_seq);
+ up_write(&devnet_rename_sem);
return ret;
}
- write_seqcount_end(&devnet_rename_seq);
+ up_write(&devnet_rename_sem);
netdev_adjacent_rename_links(dev, oldname);
@@ -1279,7 +1273,7 @@ int dev_change_name(struct net_device *dev, const char *newname)
/* err >= 0 after dev_alloc_name() or stores the first errno */
if (err >= 0) {
err = ret;
- write_seqcount_begin(&devnet_rename_seq);
+ down_write(&devnet_rename_sem);
memcpy(dev->name, oldname, IFNAMSIZ);
memcpy(oldname, newname, IFNAMSIZ);
dev->name_assign_type = old_assign_type;
--
2.20.1
On Tue, 19 May 2020 23:45:23 +0200
"Ahmed S. Darwish" <[email protected]> wrote:
> Sequence counters write paths are critical sections that must never be
> preempted, and blocking, even for CONFIG_PREEMPTION=n, is not allowed.
>
> Commit 5dbe7c178d3f ("net: fix kernel deadlock with interface rename and
> netdev name retrieval.") handled a deadlock, observed with
> CONFIG_PREEMPTION=n, where the devnet_rename seqcount read side was
> infinitely spinning: it got scheduled after the seqcount write side
> blocked inside its own critical section.
>
> To fix that deadlock, among other issues, the commit added a
> cond_resched() inside the read side section. While this will get the
> non-preemptible kernel eventually unstuck, the seqcount reader is fully
> exhausting its slice just spinning -- until TIF_NEED_RESCHED is set.
>
> The fix is also still broken: if the seqcount reader belongs to a
> real-time scheduling policy, it can spin forever and the kernel will
> livelock.
>
> Disabling preemption over the seqcount write side critical section will
> not work: inside it are a number of GFP_KERNEL allocations and mutex
> locking through the drivers/base/ :: device_rename() call chain.
>
> From all the above, replace the seqcount with a rwsem.
>
> Fixes: 5dbe7c178d3f (net: fix kernel deadlock with interface rename and netdev name retrieval.)
> Fixes: 30e6c9fa93cf (net: devnet_rename_seq should be a seqcount)
> Fixes: c91f6df2db49 (sockopt: Change getsockopt() of SO_BINDTODEVICE to return an interface name)
> Cc: <[email protected]>
> Signed-off-by: Ahmed S. Darwish <[email protected]>
> Reviewed-by: Sebastian Andrzej Siewior <[email protected]>
Have your performance tested this with 1000's of network devices?
The reason seqcount logic is was done here was to achieve scaleablity
and a semaphore does not scale as well.
Stephen Hemminger <[email protected]> writes:
> On Tue, 19 May 2020 23:45:23 +0200
> "Ahmed S. Darwish" <[email protected]> wrote:
>
>> Sequence counters write paths are critical sections that must never be
>> preempted, and blocking, even for CONFIG_PREEMPTION=n, is not allowed.
>>
>> Commit 5dbe7c178d3f ("net: fix kernel deadlock with interface rename and
>> netdev name retrieval.") handled a deadlock, observed with
>> CONFIG_PREEMPTION=n, where the devnet_rename seqcount read side was
>> infinitely spinning: it got scheduled after the seqcount write side
>> blocked inside its own critical section.
>>
>> To fix that deadlock, among other issues, the commit added a
>> cond_resched() inside the read side section. While this will get the
>> non-preemptible kernel eventually unstuck, the seqcount reader is fully
>> exhausting its slice just spinning -- until TIF_NEED_RESCHED is set.
>>
>> The fix is also still broken: if the seqcount reader belongs to a
>> real-time scheduling policy, it can spin forever and the kernel will
>> livelock.
>>
>> Disabling preemption over the seqcount write side critical section will
>> not work: inside it are a number of GFP_KERNEL allocations and mutex
>> locking through the drivers/base/ :: device_rename() call chain.
>>
>> From all the above, replace the seqcount with a rwsem.
>>
>> Fixes: 5dbe7c178d3f (net: fix kernel deadlock with interface rename and netdev name retrieval.)
>> Fixes: 30e6c9fa93cf (net: devnet_rename_seq should be a seqcount)
>> Fixes: c91f6df2db49 (sockopt: Change getsockopt() of SO_BINDTODEVICE to return an interface name)
>> Cc: <[email protected]>
>> Signed-off-by: Ahmed S. Darwish <[email protected]>
>> Reviewed-by: Sebastian Andrzej Siewior <[email protected]>
>
> Have your performance tested this with 1000's of network devices?
No. We did not. -ENOTESTCASE
> The reason seqcount logic is was done here was to achieve scaleablity
> and a semaphore does not scale as well.
That still does not make the livelock magically going away. Just make a
reader with real-time priority preempt the writer and the system stops
dead. The net result is perfomance <= 0.
This was observed on RT kernels without a special 1000's of network
devices test case.
Just for the record: This is not a RT specific problem. You can
reproduce that w/o an RT kernel as well. Just run the reader with
real-time scheduling policy.
As much as you hate it from a performance POV the only sane rule of
programming is: Correctness first.
And this code clearly violates that rule.
Thanks,
tglx
On Wed, 20 May 2020 00:23:48 +0200
Thomas Gleixner <[email protected]> wrote:
> Stephen Hemminger <[email protected]> writes:
> > On Tue, 19 May 2020 23:45:23 +0200
> > "Ahmed S. Darwish" <[email protected]> wrote:
> >
> >> Sequence counters write paths are critical sections that must never be
> >> preempted, and blocking, even for CONFIG_PREEMPTION=n, is not allowed.
> >>
> >> Commit 5dbe7c178d3f ("net: fix kernel deadlock with interface rename and
> >> netdev name retrieval.") handled a deadlock, observed with
> >> CONFIG_PREEMPTION=n, where the devnet_rename seqcount read side was
> >> infinitely spinning: it got scheduled after the seqcount write side
> >> blocked inside its own critical section.
> >>
> >> To fix that deadlock, among other issues, the commit added a
> >> cond_resched() inside the read side section. While this will get the
> >> non-preemptible kernel eventually unstuck, the seqcount reader is fully
> >> exhausting its slice just spinning -- until TIF_NEED_RESCHED is set.
> >>
> >> The fix is also still broken: if the seqcount reader belongs to a
> >> real-time scheduling policy, it can spin forever and the kernel will
> >> livelock.
> >>
> >> Disabling preemption over the seqcount write side critical section will
> >> not work: inside it are a number of GFP_KERNEL allocations and mutex
> >> locking through the drivers/base/ :: device_rename() call chain.
> >>
> >> From all the above, replace the seqcount with a rwsem.
> >>
> >> Fixes: 5dbe7c178d3f (net: fix kernel deadlock with interface rename and netdev name retrieval.)
> >> Fixes: 30e6c9fa93cf (net: devnet_rename_seq should be a seqcount)
> >> Fixes: c91f6df2db49 (sockopt: Change getsockopt() of SO_BINDTODEVICE to return an interface name)
> >> Cc: <[email protected]>
> >> Signed-off-by: Ahmed S. Darwish <[email protected]>
> >> Reviewed-by: Sebastian Andrzej Siewior <[email protected]>
> >
> > Have your performance tested this with 1000's of network devices?
>
> No. We did not. -ENOTESTCASE
Please try, it isn't that hard..
# time for ((i=0;i<1000;i++)); do ip li add dev dummy$i type dummy; done
real 0m17.002s
user 0m1.064s
sys 0m0.375s
Stephen Hemminger <[email protected]> writes:
> On Wed, 20 May 2020 00:23:48 +0200
> Thomas Gleixner <[email protected]> wrote:
>> No. We did not. -ENOTESTCASE
>
> Please try, it isn't that hard..
>
> # time for ((i=0;i<1000;i++)); do ip li add dev dummy$i type dummy; done
>
> real 0m17.002s
> user 0m1.064s
> sys 0m0.375s
And that solves the incorrectness of the current code in which way?
On Wed, 20 May 2020 01:42:30 +0200
Thomas Gleixner <[email protected]> wrote:
> Stephen Hemminger <[email protected]> writes:
> > On Wed, 20 May 2020 00:23:48 +0200
> > Thomas Gleixner <[email protected]> wrote:
> >> No. We did not. -ENOTESTCASE
> >
> > Please try, it isn't that hard..
> >
> > # time for ((i=0;i<1000;i++)); do ip li add dev dummy$i type dummy; done
> >
> > real 0m17.002s
> > user 0m1.064s
> > sys 0m0.375s
>
> And that solves the incorrectness of the current code in which way?
Agree that the current code is has evolved over time to a state where it is not
correct in the case of Preempt-RT. The motivation for the changes to seqcount
goes back many years when there were ISP's that were concerned about scaling of tunnels, vlans etc.
Is it too much to ask for a simple before/after test of your patch as part
of the submission. You probably measure latency changes to the nanosecond.
Getting it correct without causing user complaints.
Stephen Hemminger <[email protected]> writes:
> On Wed, 20 May 2020 01:42:30 +0200
> Thomas Gleixner <[email protected]> wrote:
>
>> Stephen Hemminger <[email protected]> writes:
>> > On Wed, 20 May 2020 00:23:48 +0200
>> > Thomas Gleixner <[email protected]> wrote:
>> >> No. We did not. -ENOTESTCASE
>> >
>> > Please try, it isn't that hard..
>> >
>> > # time for ((i=0;i<1000;i++)); do ip li add dev dummy$i type dummy; done
>> >
>> > real 0m17.002s
>> > user 0m1.064s
>> > sys 0m0.375s
>>
>> And that solves the incorrectness of the current code in which way?
>
> Agree that the current code is has evolved over time to a state where it is not
> correct in the case of Preempt-RT.
That's not a RT problem as explained in great length in the changelog
and as I pointed out in my previous reply.
Realtime scheduling classes are available on stock kernels and all
those attempts to "fix" the livelock problem are ignoring that fact.
Just because you or whoever involved are not using them or do not care
is not making the code more correct.
> The motivation for the changes to seqcount goes back many years when
> there were ISP's that were concerned about scaling of tunnels, vlans
> etc.
I completely understand where this comes from, but that is not a
justification for incorrect code at all.
> Is it too much to ask for a simple before/after test of your patch as part
> of the submission. You probably measure latency changes to the
> nanosecond.
It's not too much to ask and I'm happy to provide the numbers.
But before I waste my time and produce them, can you please explain how
any numbers provided are going to change the fact that the code is
incorrect?
A bug, is a bug no matter what the numbers are.
I don't have a insta reproducer at hand for the problem which made that
code go belly up, but the net result is simply:
Before: After:
real INFINTE 0mxx.yyys
And the 'Before' comes with the extra benefit of stall warnings (if
enabled in the config).
If you insist I surely can go the extra mile and write up the insta
reproducer and stick it into a bugzilla for you.
Thanks,
tglx
On 5/19/20 2:45 PM, Ahmed S. Darwish wrote:
> Sequence counters write paths are critical sections that must never be
> preempted, and blocking, even for CONFIG_PREEMPTION=n, is not allowed.
>
> Commit 5dbe7c178d3f ("net: fix kernel deadlock with interface rename and
> netdev name retrieval.") handled a deadlock, observed with
> CONFIG_PREEMPTION=n, where the devnet_rename seqcount read side was
> infinitely spinning: it got scheduled after the seqcount write side
> blocked inside its own critical section.
>
> To fix that deadlock, among other issues, the commit added a
> cond_resched() inside the read side section. While this will get the
> non-preemptible kernel eventually unstuck, the seqcount reader is fully
> exhausting its slice just spinning -- until TIF_NEED_RESCHED is set.
>
> The fix is also still broken: if the seqcount reader belongs to a
> real-time scheduling policy, it can spin forever and the kernel will
> livelock.
>
> Disabling preemption over the seqcount write side critical section will
> not work: inside it are a number of GFP_KERNEL allocations and mutex
> locking through the drivers/base/ :: device_rename() call chain.
>
> From all the above, replace the seqcount with a rwsem.
>
> Fixes: 5dbe7c178d3f (net: fix kernel deadlock with interface rename and netdev name retrieval.)
> Fixes: 30e6c9fa93cf (net: devnet_rename_seq should be a seqcount)
> Fixes: c91f6df2db49 (sockopt: Change getsockopt() of SO_BINDTODEVICE to return an interface name)
> Cc: <[email protected]>
> Signed-off-by: Ahmed S. Darwish <[email protected]>
> Reviewed-by: Sebastian Andrzej Siewior <[email protected]>
> ---
> net/core/dev.c | 30 ++++++++++++------------------
> 1 file changed, 12 insertions(+), 18 deletions(-)
>
Seems fine to me, assuming rwsem prevent starvation of the writer.
(Presumably this could be a per ndevice rwsem, or per netns, to provide some isolation)
Alternative would be to convert ndev->name from char array to a pointer (rcu protected),
but this looks quite invasive change, certainly not for stable branches.
Reviewed-by: Eric Dumazet <[email protected]>
From: Thomas Gleixner <[email protected]>
Date: Wed, 20 May 2020 01:42:30 +0200
> Stephen Hemminger <[email protected]> writes:
>> On Wed, 20 May 2020 00:23:48 +0200
>> Thomas Gleixner <[email protected]> wrote:
>>> No. We did not. -ENOTESTCASE
>>
>> Please try, it isn't that hard..
>>
>> # time for ((i=0;i<1000;i++)); do ip li add dev dummy$i type dummy; done
>>
>> real 0m17.002s
>> user 0m1.064s
>> sys 0m0.375s
>
> And that solves the incorrectness of the current code in which way?
You mentioned that there wasn't a test case, he gave you one to try.
On 5/19/20 7:57 PM, David Miller wrote:
> From: Thomas Gleixner <[email protected]>
> Date: Wed, 20 May 2020 01:42:30 +0200
>
>> Stephen Hemminger <[email protected]> writes:
>>> On Wed, 20 May 2020 00:23:48 +0200
>>> Thomas Gleixner <[email protected]> wrote:
>>>> No. We did not. -ENOTESTCASE
>>>
>>> Please try, it isn't that hard..
>>>
>>> # time for ((i=0;i<1000;i++)); do ip li add dev dummy$i type dummy; done
>>>
>>> real 0m17.002s
>>> user 0m1.064s
>>> sys 0m0.375s
>>
>> And that solves the incorrectness of the current code in which way?
>
> You mentioned that there wasn't a test case, he gave you one to try.
>
I do not think this would ever use device rename, nor netdev_get_name()
None of this stuff is fast path really.
# time for ((i=1;i<1000;i++)); do ip li add dev dummy$i type dummy; done
real 0m1.127s
user 0m0.270s
sys 0m1.039s
On Tue, 19 May 2020 20:18:19 -0700
Eric Dumazet <[email protected]> wrote:
> On 5/19/20 7:57 PM, David Miller wrote:
> > From: Thomas Gleixner <[email protected]>
> > Date: Wed, 20 May 2020 01:42:30 +0200
> >
> >> Stephen Hemminger <[email protected]> writes:
> >>> On Wed, 20 May 2020 00:23:48 +0200
> >>> Thomas Gleixner <[email protected]> wrote:
> >>>> No. We did not. -ENOTESTCASE
> >>>
> >>> Please try, it isn't that hard..
> >>>
> >>> # time for ((i=0;i<1000;i++)); do ip li add dev dummy$i type dummy; done
> >>>
> >>> real 0m17.002s
> >>> user 0m1.064s
> >>> sys 0m0.375s
> >>
> >> And that solves the incorrectness of the current code in which way?
> >
> > You mentioned that there wasn't a test case, he gave you one to try.
> >
>
> I do not think this would ever use device rename, nor netdev_get_name()
>
> None of this stuff is fast path really.
>
> # time for ((i=1;i<1000;i++)); do ip li add dev dummy$i type dummy; done
>
> real 0m1.127s
> user 0m0.270s
> sys 0m1.039s
Your right it is a weak test, and most of the overhead is in the syscall
and all netlink events that happen.
It does end up looking up the new name, so would exercise that.
Better test is to use %d syntax or create 1000 dummy's then rename every one.
This is more of a stress test
# for ((i=0;i<1000;i++)); do echo link add dev dummy%d type dummy; done | time ip -batch -
0.00user 0.29system 0:02.11elapsed 13%CPU (0avgtext+0avgdata 2544maxresident)k
0inputs+0outputs (0major+148minor)pagefaults 0swaps
# for ((i=999;i>=0;i--)); do echo link set dummy$i name dummy$((i+1)); done | time ip -batch -
0.00user 0.26system 0:54.98elapsed 0%CPU (0avgtext+0avgdata 2508maxresident)k
0inputs+0outputs (0major+145minor)pagefaults 0swaps
Hello Eric,
On Tue, May 19, 2020 at 07:01:38PM -0700, Eric Dumazet wrote:
>
> On 5/19/20 2:45 PM, Ahmed S. Darwish wrote:
> > Sequence counters write paths are critical sections that must never be
> > preempted, and blocking, even for CONFIG_PREEMPTION=n, is not allowed.
> >
> > Commit 5dbe7c178d3f ("net: fix kernel deadlock with interface rename and
> > netdev name retrieval.") handled a deadlock, observed with
> > CONFIG_PREEMPTION=n, where the devnet_rename seqcount read side was
> > infinitely spinning: it got scheduled after the seqcount write side
> > blocked inside its own critical section.
> >
> > To fix that deadlock, among other issues, the commit added a
> > cond_resched() inside the read side section. While this will get the
> > non-preemptible kernel eventually unstuck, the seqcount reader is fully
> > exhausting its slice just spinning -- until TIF_NEED_RESCHED is set.
> >
> > The fix is also still broken: if the seqcount reader belongs to a
> > real-time scheduling policy, it can spin forever and the kernel will
> > livelock.
> >
> > Disabling preemption over the seqcount write side critical section will
> > not work: inside it are a number of GFP_KERNEL allocations and mutex
> > locking through the drivers/base/ :: device_rename() call chain.
> >
> > From all the above, replace the seqcount with a rwsem.
> >
> > Fixes: 5dbe7c178d3f (net: fix kernel deadlock with interface rename and netdev name retrieval.)
> > Fixes: 30e6c9fa93cf (net: devnet_rename_seq should be a seqcount)
> > Fixes: c91f6df2db49 (sockopt: Change getsockopt() of SO_BINDTODEVICE to return an interface name)
> > Cc: <[email protected]>
> > Signed-off-by: Ahmed S. Darwish <[email protected]>
> > Reviewed-by: Sebastian Andrzej Siewior <[email protected]>
> > ---
> > net/core/dev.c | 30 ++++++++++++------------------
> > 1 file changed, 12 insertions(+), 18 deletions(-)
> >
>
> Seems fine to me, assuming rwsem prevent starvation of the writer.
>
Thanks for the review.
AFAIK, due to 5cfd92e12e13 ("locking/rwsem: Adaptive disabling of reader
optimistic spinning"), using a rwsem shouldn't lead to writer starvation
in the contended case.
--
Ahmed S. Darwish
Linutronix GmbH
On 5/19/20 11:42 PM, Ahmed S. Darwish wrote:
> Hello Eric,
>
> On Tue, May 19, 2020 at 07:01:38PM -0700, Eric Dumazet wrote:
>>
>> On 5/19/20 2:45 PM, Ahmed S. Darwish wrote:
>>> Sequence counters write paths are critical sections that must never be
>>> preempted, and blocking, even for CONFIG_PREEMPTION=n, is not allowed.
>>>
>>> Commit 5dbe7c178d3f ("net: fix kernel deadlock with interface rename and
>>> netdev name retrieval.") handled a deadlock, observed with
>>> CONFIG_PREEMPTION=n, where the devnet_rename seqcount read side was
>>> infinitely spinning: it got scheduled after the seqcount write side
>>> blocked inside its own critical section.
>>>
>>> To fix that deadlock, among other issues, the commit added a
>>> cond_resched() inside the read side section. While this will get the
>>> non-preemptible kernel eventually unstuck, the seqcount reader is fully
>>> exhausting its slice just spinning -- until TIF_NEED_RESCHED is set.
>>>
>>> The fix is also still broken: if the seqcount reader belongs to a
>>> real-time scheduling policy, it can spin forever and the kernel will
>>> livelock.
>>>
>>> Disabling preemption over the seqcount write side critical section will
>>> not work: inside it are a number of GFP_KERNEL allocations and mutex
>>> locking through the drivers/base/ :: device_rename() call chain.
>>>
>>> From all the above, replace the seqcount with a rwsem.
>>>
>>> Fixes: 5dbe7c178d3f (net: fix kernel deadlock with interface rename and netdev name retrieval.)
>>> Fixes: 30e6c9fa93cf (net: devnet_rename_seq should be a seqcount)
>>> Fixes: c91f6df2db49 (sockopt: Change getsockopt() of SO_BINDTODEVICE to return an interface name)
>>> Cc: <[email protected]>
>>> Signed-off-by: Ahmed S. Darwish <[email protected]>
>>> Reviewed-by: Sebastian Andrzej Siewior <[email protected]>
>>> ---
>>> net/core/dev.c | 30 ++++++++++++------------------
>>> 1 file changed, 12 insertions(+), 18 deletions(-)
>>>
>>
>> Seems fine to me, assuming rwsem prevent starvation of the writer.
>>
>
> Thanks for the review.
>
> AFAIK, due to 5cfd92e12e13 ("locking/rwsem: Adaptive disabling of reader
> optimistic spinning"), using a rwsem shouldn't lead to writer starvation
> in the contended case.
Hmm this was in linux-5.3, so very recent stuff.
Has this patch been backported to stable releases ?
With all the Fixes: tags you added, stable teams will backport this networking patch to
all stable versions.
Do we have a way to tune a dedicare rwsem to 'give preference to the (unique in this case) writer" over
a myriad of potential readers ?
Thanks.
Hi "Ahmed,
Thank you for the patch! Perhaps something to improve:
[auto build test WARNING on tip/locking/core]
[also build test WARNING on nf-next/master nf/master tip/timers/core linus/master v5.7-rc6 next-20200519]
[if your patch is applied to the wrong git tree, please drop us a note to help
improve the system. BTW, we also suggest to use '--base' option to specify the
base tree in git format-patch, please see https://stackoverflow.com/a/37406982]
url: https://github.com/0day-ci/linux/commits/Ahmed-S-Darwish/seqlock-Extend-seqcount-API-with-associated-locks/20200520-055145
base: https://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git 23b5ae2e8e1326c91b5dfdbb6ebcd5a6820074ae
config: x86_64-defconfig (attached as .config)
If you fix the issue, kindly add following tag as appropriate
Reported-by: kbuild test robot <[email protected]>
Reported-by: Dan Carpenter <[email protected]>
smatch warnings:
net/core/dev.c:953 netdev_get_name() warn: inconsistent returns 'devnet_rename_sem'.
# https://github.com/0day-ci/linux/commit/2354e271ada778bbb935d7b20113693710905cff
git remote add linux-review https://github.com/0day-ci/linux
git remote update linux-review
git checkout 2354e271ada778bbb935d7b20113693710905cff
vim +/devnet_rename_sem +953 net/core/dev.c
5dbe7c178d3f0a4 Nicolas Schichan 2013-06-26 935 int netdev_get_name(struct net *net, char *name, int ifindex)
5dbe7c178d3f0a4 Nicolas Schichan 2013-06-26 936 {
5dbe7c178d3f0a4 Nicolas Schichan 2013-06-26 937 struct net_device *dev;
5dbe7c178d3f0a4 Nicolas Schichan 2013-06-26 938
2354e271ada778b Ahmed S. Darwish 2020-05-19 939 down_read(&devnet_rename_sem);
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2354e271ada778b Ahmed S. Darwish 2020-05-19 940
5dbe7c178d3f0a4 Nicolas Schichan 2013-06-26 941 rcu_read_lock();
5dbe7c178d3f0a4 Nicolas Schichan 2013-06-26 942 dev = dev_get_by_index_rcu(net, ifindex);
5dbe7c178d3f0a4 Nicolas Schichan 2013-06-26 943 if (!dev) {
5dbe7c178d3f0a4 Nicolas Schichan 2013-06-26 944 rcu_read_unlock();
5dbe7c178d3f0a4 Nicolas Schichan 2013-06-26 945 return -ENODEV;
^^^^^^^^^^^^^^
We need to drop the new semaphore on error.
5dbe7c178d3f0a4 Nicolas Schichan 2013-06-26 946 }
5dbe7c178d3f0a4 Nicolas Schichan 2013-06-26 947
5dbe7c178d3f0a4 Nicolas Schichan 2013-06-26 948 strcpy(name, dev->name);
5dbe7c178d3f0a4 Nicolas Schichan 2013-06-26 949 rcu_read_unlock();
2354e271ada778b Ahmed S. Darwish 2020-05-19 950
2354e271ada778b Ahmed S. Darwish 2020-05-19 951 up_read(&devnet_rename_sem);
5dbe7c178d3f0a4 Nicolas Schichan 2013-06-26 952
5dbe7c178d3f0a4 Nicolas Schichan 2013-06-26 @953 return 0;
5dbe7c178d3f0a4 Nicolas Schichan 2013-06-26 954 }
---
0-DAY CI Kernel Test Service, Intel Corporation
https://lists.01.org/hyperkitty/list/[email protected]
David Miller <[email protected]> writes:
> From: Thomas Gleixner <[email protected]>
> Date: Wed, 20 May 2020 01:42:30 +0200
>>> Please try, it isn't that hard..
>>>
>>> # time for ((i=0;i<1000;i++)); do ip li add dev dummy$i type dummy; done
>>>
>>> real 0m17.002s
>>> user 0m1.064s
>>> sys 0m0.375s
>>
>> And that solves the incorrectness of the current code in which way?
>
> You mentioned that there wasn't a test case, he gave you one to try.
If it makes you happy to compare incorrrect code with correct code, here
you go:
5 runs of 1000 device add, 1000 device rename and 1000 device del
CONFIG_PREEMPT_NONE=y
Base rwsem
add 0:05.01 0:05.28
0:05.93 0:06.11
0:06.52 0:06.26
0:06.06 0:05.74
0:05.71 0:06.07
rename 0:32.57 0:33.04
0:32.91 0:32.45
0:32.72 0:32.53
0:39.65 0:34.18
0:34.52 0:32.50
delete 3:48.65 3:48.91
3:49.66 3:49.13
3:45.29 3:48.26
3:47.56 3:46.60
3:50.01 3:48.06
-------------------------
CONFIG_PREEMPT_VOLUNTARY=y
Base rwsem
add 0:06.80 0:06.42
0:04.77 0:05.03
0:05.74 0:04.62
0:05.87 0:04.34
0:04.20 0:04.12
rename 0:33.33 0:42.02
0:42.36 0:32.55
0:39.58 0:31.60
0:33.69 0:35.08
0:34.24 0:33.97
delete 3:47.82 3:44.00
3:47.42 3:51.00
3:48.52 3:48.88
3:48.50 3:48.09
3:50.03 3:46.56
-------------------------
CONFIG_PREEMPT=y
Base rwsem
add 0:07.89 0:07.72
0:07.25 0:06.72
0:07.42 0:06.51
0:06.92 0:06.38
0:06.20 0:06.72
rename 0:41.77 0:32.39
0:44.29 0:33.29
0:36.19 0:34.86
0:33.19 0:35.06
0:37.00 0:34.78
delete 2:36.96 2:39.97
2:37.80 2:42.19
2:44.66 2:48.40
2:39.75 2:41.02
2:40.77 2:38.36
The runtime variation is rather large and when running the same in a VM
I got complete random numbers for both base and rwsem. The most amazing
was delete where the time varies from 30s to 6m20s.
Btw, Sebastian noticed that rename spams dmesg:
netdev_info(dev, "renamed from %s\n", oldname);
which eats about 50% of the Rename run time.
Base netdev_info() removed
Rename 0:34.84 0:17.48
That number at least makes tons of sense
Thanks,
tglx
On Wed, 20 May 2020 21:37:11 +0200
Thomas Gleixner <[email protected]> wrote:
> David Miller <[email protected]> writes:
> > From: Thomas Gleixner <[email protected]>
> > Date: Wed, 20 May 2020 01:42:30 +0200
> >>> Please try, it isn't that hard..
> >>>
> >>> # time for ((i=0;i<1000;i++)); do ip li add dev dummy$i type dummy; done
> >>>
> >>> real 0m17.002s
> >>> user 0m1.064s
> >>> sys 0m0.375s
> >>
> >> And that solves the incorrectness of the current code in which way?
> >
> > You mentioned that there wasn't a test case, he gave you one to try.
>
> If it makes you happy to compare incorrrect code with correct code, here
> you go:
>
> 5 runs of 1000 device add, 1000 device rename and 1000 device del
>
> CONFIG_PREEMPT_NONE=y
>
> Base rwsem
> add 0:05.01 0:05.28
> 0:05.93 0:06.11
> 0:06.52 0:06.26
> 0:06.06 0:05.74
> 0:05.71 0:06.07
>
> rename 0:32.57 0:33.04
> 0:32.91 0:32.45
> 0:32.72 0:32.53
> 0:39.65 0:34.18
> 0:34.52 0:32.50
>
> delete 3:48.65 3:48.91
> 3:49.66 3:49.13
> 3:45.29 3:48.26
> 3:47.56 3:46.60
> 3:50.01 3:48.06
>
> -------------------------
>
> CONFIG_PREEMPT_VOLUNTARY=y
>
> Base rwsem
> add 0:06.80 0:06.42
> 0:04.77 0:05.03
> 0:05.74 0:04.62
> 0:05.87 0:04.34
> 0:04.20 0:04.12
>
> rename 0:33.33 0:42.02
> 0:42.36 0:32.55
> 0:39.58 0:31.60
> 0:33.69 0:35.08
> 0:34.24 0:33.97
>
> delete 3:47.82 3:44.00
> 3:47.42 3:51.00
> 3:48.52 3:48.88
> 3:48.50 3:48.09
> 3:50.03 3:46.56
>
> -------------------------
>
> CONFIG_PREEMPT=y
>
> Base rwsem
>
> add 0:07.89 0:07.72
> 0:07.25 0:06.72
> 0:07.42 0:06.51
> 0:06.92 0:06.38
> 0:06.20 0:06.72
>
> rename 0:41.77 0:32.39
> 0:44.29 0:33.29
> 0:36.19 0:34.86
> 0:33.19 0:35.06
> 0:37.00 0:34.78
>
> delete 2:36.96 2:39.97
> 2:37.80 2:42.19
> 2:44.66 2:48.40
> 2:39.75 2:41.02
> 2:40.77 2:38.36
>
> The runtime variation is rather large and when running the same in a VM
> I got complete random numbers for both base and rwsem. The most amazing
> was delete where the time varies from 30s to 6m20s.
>
> Btw, Sebastian noticed that rename spams dmesg:
>
> netdev_info(dev, "renamed from %s\n", oldname);
>
> which eats about 50% of the Rename run time.
>
> Base netdev_info() removed
>
> Rename 0:34.84 0:17.48
>
> That number at least makes tons of sense
>
> Thanks,
>
> tglx
Looks good thanks for following through.
On Wed, May 20, 2020 at 05:37:07PM +0300, Dan Carpenter wrote:
...
>
> smatch warnings:
> net/core/dev.c:953 netdev_get_name() warn: inconsistent returns 'devnet_rename_sem'.
>
...
>
> 5dbe7c178d3f0a4 Nicolas Schichan 2013-06-26 935 int netdev_get_name(struct net *net, char *name, int ifindex)
> 5dbe7c178d3f0a4 Nicolas Schichan 2013-06-26 936 {
> 5dbe7c178d3f0a4 Nicolas Schichan 2013-06-26 937 struct net_device *dev;
> 5dbe7c178d3f0a4 Nicolas Schichan 2013-06-26 938
> 2354e271ada778b Ahmed S. Darwish 2020-05-19 939 down_read(&devnet_rename_sem);
> ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
>
> 2354e271ada778b Ahmed S. Darwish 2020-05-19 940
> 5dbe7c178d3f0a4 Nicolas Schichan 2013-06-26 941 rcu_read_lock();
> 5dbe7c178d3f0a4 Nicolas Schichan 2013-06-26 942 dev = dev_get_by_index_rcu(net, ifindex);
> 5dbe7c178d3f0a4 Nicolas Schichan 2013-06-26 943 if (!dev) {
> 5dbe7c178d3f0a4 Nicolas Schichan 2013-06-26 944 rcu_read_unlock();
> 5dbe7c178d3f0a4 Nicolas Schichan 2013-06-26 945 return -ENODEV;
> ^^^^^^^^^^^^^^
Oh, shouldn't have missed that. Will fix in v2.
Thanks,
On Wed, May 20, 2020 at 05:51:27AM -0700, Eric Dumazet wrote:
>
> On 5/19/20 11:42 PM, Ahmed S. Darwish wrote:
> > Hello Eric,
> >
> > On Tue, May 19, 2020 at 07:01:38PM -0700, Eric Dumazet wrote:
> >>
> >> On 5/19/20 2:45 PM, Ahmed S. Darwish wrote:
> >>> Sequence counters write paths are critical sections that must never be
> >>> preempted, and blocking, even for CONFIG_PREEMPTION=n, is not allowed.
> >>>
> >>> Commit 5dbe7c178d3f ("net: fix kernel deadlock with interface rename and
> >>> netdev name retrieval.") handled a deadlock, observed with
> >>> CONFIG_PREEMPTION=n, where the devnet_rename seqcount read side was
> >>> infinitely spinning: it got scheduled after the seqcount write side
> >>> blocked inside its own critical section.
> >>>
> >>> To fix that deadlock, among other issues, the commit added a
> >>> cond_resched() inside the read side section. While this will get the
> >>> non-preemptible kernel eventually unstuck, the seqcount reader is fully
> >>> exhausting its slice just spinning -- until TIF_NEED_RESCHED is set.
> >>>
> >>> The fix is also still broken: if the seqcount reader belongs to a
> >>> real-time scheduling policy, it can spin forever and the kernel will
> >>> livelock.
> >>>
> >>> Disabling preemption over the seqcount write side critical section will
> >>> not work: inside it are a number of GFP_KERNEL allocations and mutex
> >>> locking through the drivers/base/ :: device_rename() call chain.
> >>>
> >>> From all the above, replace the seqcount with a rwsem.
> >>>
> >>> Fixes: 5dbe7c178d3f (net: fix kernel deadlock with interface rename and netdev name retrieval.)
> >>> Fixes: 30e6c9fa93cf (net: devnet_rename_seq should be a seqcount)
> >>> Fixes: c91f6df2db49 (sockopt: Change getsockopt() of SO_BINDTODEVICE to return an interface name)
> >>> Cc: <[email protected]>
> >>> Signed-off-by: Ahmed S. Darwish <[email protected]>
> >>> Reviewed-by: Sebastian Andrzej Siewior <[email protected]>
> >>> ---
> >>> net/core/dev.c | 30 ++++++++++++------------------
> >>> 1 file changed, 12 insertions(+), 18 deletions(-)
> >>>
> >>
> >> Seems fine to me, assuming rwsem prevent starvation of the writer.
> >>
> >
> > Thanks for the review.
> >
> > AFAIK, due to 5cfd92e12e13 ("locking/rwsem: Adaptive disabling of reader
> > optimistic spinning"), using a rwsem shouldn't lead to writer starvation
> > in the contended case.
>
> Hmm this was in linux-5.3, so very recent stuff.
>
> Has this patch been backported to stable releases ?
>
> With all the Fixes: tags you added, stable teams will backport this
> networking patch to all stable versions.
>
> Do we have a way to tune a dedicare rwsem to 'give preference to the
> (unique in this case) writer" over a myriad of potential readers ?
>
I was wrong in referencing the commit 5cfd92e12e13 above.
Before and after that commit, once a rwsem writer is blocking, all
subsequent readers will block until that writer makes progress.
Given that behavior, and that the read section is already quite short, I
don't think there's any danger incurred on writers here.
(a v2 will be sent shortly, fixing the error found Dan/kbuild-bot.)
Thanks,
--
Ahmed S. Darwish
Linutronix GmbH