2015-11-02 19:48:43

by Chris Worley

[permalink] [raw]
Subject: Re: timer code oops when calling mod_delayed_work

On Sat, Oct 31, 2015 at 11:54 AM, Jeff Layton <[email protected]> wrote:
> On Sat, 31 Oct 2015 17:31:07 -0400
> Tejun Heo <[email protected]> wrote:
...
>>
>> > I have asked Chris and Michael to see if they can bisect it down, but
>> > it may be a bit before they can get that done. Any insight you might
>> > have in the meantime would helpful.
>>
>> Yeah, I'd love to find out how reproducible the issue is. If the
>> problem is rarely reproducible, it might make sense to try
>> instrumentation before trying bisection as it *could* be a latent bug
>> which has been there all along and bisecting to the commit introducing
>> the code wouldn't help us too much.
>>
>
> It seems fairly reproducible, at least on v4.3-rc7 kernels:
>
> This came about when I asked them to perf test some nfsd patches that I
> have queued up. I patched a Fedora 4.3-rc7 kernel and wanted to see
> what the perf delta was (with NFSv3, fwiw):
>
> Patched kernels here: http://koji.fedoraproject.org/koji/taskinfo?taskID=11598089
>
> Unpatched kernels here: http://koji.fedoraproject.org/koji/buildinfo?buildID=694377
>
> Michael was using the SPEC SFS VDI workload to test, and was able to
> get the same panic on both kernels. So it does seem to be reproducible.
> It might even be possible to tune the VM to make the shrinker fire more
> often, which may help tickle this more.
>
> In any case, I've asked them to try something v4.2-ish and see if it's
> reproducible there, and then try v4.1 if it is. I figure anything
> earlier is probably not worth testing if it still fails on v4.1. If it
> turns out not to be reproducible on those earlier kernels then we can
> bisect from there to track it down.

The trick seems to be the NFS thread count: I initially though this
was SFS/VDI specific, but when I ratcheted up the thread count to what
Michael was using, 256 threads oopses on fio (throughput) benchmarks
too.

In bisecting kernels, it appeared between 4.2.3-200 and 4.2.5-200 (all
the 4.2.4 kernels were bad).

Jeff has a lead on this...

Chris


2015-11-02 19:56:38

by Jeff Layton

[permalink] [raw]
Subject: Re: timer code oops when calling mod_delayed_work

On Mon, 2 Nov 2015 09:48:41 -1000
Chris Worley <[email protected]> wrote:

> On Sat, Oct 31, 2015 at 11:54 AM, Jeff Layton <[email protected]> wrote:
> > On Sat, 31 Oct 2015 17:31:07 -0400
> > Tejun Heo <[email protected]> wrote:
> ...
> >>
> >> > I have asked Chris and Michael to see if they can bisect it down, but
> >> > it may be a bit before they can get that done. Any insight you might
> >> > have in the meantime would helpful.
> >>
> >> Yeah, I'd love to find out how reproducible the issue is. If the
> >> problem is rarely reproducible, it might make sense to try
> >> instrumentation before trying bisection as it *could* be a latent bug
> >> which has been there all along and bisecting to the commit introducing
> >> the code wouldn't help us too much.
> >>
> >
> > It seems fairly reproducible, at least on v4.3-rc7 kernels:
> >
> > This came about when I asked them to perf test some nfsd patches that I
> > have queued up. I patched a Fedora 4.3-rc7 kernel and wanted to see
> > what the perf delta was (with NFSv3, fwiw):
> >
> > Patched kernels here: http://koji.fedoraproject.org/koji/taskinfo?taskID=11598089
> >
> > Unpatched kernels here: http://koji.fedoraproject.org/koji/buildinfo?buildID=694377
> >
> > Michael was using the SPEC SFS VDI workload to test, and was able to
> > get the same panic on both kernels. So it does seem to be reproducible.
> > It might even be possible to tune the VM to make the shrinker fire more
> > often, which may help tickle this more.
> >
> > In any case, I've asked them to try something v4.2-ish and see if it's
> > reproducible there, and then try v4.1 if it is. I figure anything
> > earlier is probably not worth testing if it still fails on v4.1. If it
> > turns out not to be reproducible on those earlier kernels then we can
> > bisect from there to track it down.
>
> The trick seems to be the NFS thread count: I initially though this
> was SFS/VDI specific, but when I ratcheted up the thread count to what
> Michael was using, 256 threads oopses on fio (throughput) benchmarks
> too.
>
> In bisecting kernels, it appeared between 4.2.3-200 and 4.2.5-200 (all
> the 4.2.4 kernels were bad).
>
> Jeff has a lead on this...
>
> Chris

Thanks Chris. This commit went in between those kernels:

commit a7c571f2e3ff9243ba41c242486f53dbca37d864
Author: Shaohua Li <[email protected]>
Date: Wed Sep 30 09:05:30 2015 -0700

workqueue: make sure delayed work run in local cpu

commit 874bbfe600a660cba9c776b3957b1ce393151b76 upstream.


...and I suspect that it's the culprit. That commit causes this code to
always use add_timer_on, which seems to have different semantics from
add_timer.

I'm going to build a v4.2.5 kernel with that patch reverted to confirm
it, but it seems likely...

--
Jeff Layton <[email protected]>

2015-11-03 01:33:46

by Jeff Layton

[permalink] [raw]
Subject: Re: timer code oops when calling mod_delayed_work

On Mon, 2 Nov 2015 14:56:33 -0500
Jeff Layton <[email protected]> wrote:

> On Mon, 2 Nov 2015 09:48:41 -1000
> Chris Worley <[email protected]> wrote:
>
> > On Sat, Oct 31, 2015 at 11:54 AM, Jeff Layton <[email protected]> wrote:
> > > On Sat, 31 Oct 2015 17:31:07 -0400
> > > Tejun Heo <[email protected]> wrote:
> > ...
> > >>
> > >> > I have asked Chris and Michael to see if they can bisect it down, but
> > >> > it may be a bit before they can get that done. Any insight you might
> > >> > have in the meantime would helpful.
> > >>
> > >> Yeah, I'd love to find out how reproducible the issue is. If the
> > >> problem is rarely reproducible, it might make sense to try
> > >> instrumentation before trying bisection as it *could* be a latent bug
> > >> which has been there all along and bisecting to the commit introducing
> > >> the code wouldn't help us too much.
> > >>
> > >
> > > It seems fairly reproducible, at least on v4.3-rc7 kernels:
> > >
> > > This came about when I asked them to perf test some nfsd patches that I
> > > have queued up. I patched a Fedora 4.3-rc7 kernel and wanted to see
> > > what the perf delta was (with NFSv3, fwiw):
> > >
> > > Patched kernels here: http://koji.fedoraproject.org/koji/taskinfo?taskID=11598089
> > >
> > > Unpatched kernels here: http://koji.fedoraproject.org/koji/buildinfo?buildID=694377
> > >
> > > Michael was using the SPEC SFS VDI workload to test, and was able to
> > > get the same panic on both kernels. So it does seem to be reproducible.
> > > It might even be possible to tune the VM to make the shrinker fire more
> > > often, which may help tickle this more.
> > >
> > > In any case, I've asked them to try something v4.2-ish and see if it's
> > > reproducible there, and then try v4.1 if it is. I figure anything
> > > earlier is probably not worth testing if it still fails on v4.1. If it
> > > turns out not to be reproducible on those earlier kernels then we can
> > > bisect from there to track it down.
> >
> > The trick seems to be the NFS thread count: I initially though this
> > was SFS/VDI specific, but when I ratcheted up the thread count to what
> > Michael was using, 256 threads oopses on fio (throughput) benchmarks
> > too.
> >
> > In bisecting kernels, it appeared between 4.2.3-200 and 4.2.5-200 (all
> > the 4.2.4 kernels were bad).
> >
> > Jeff has a lead on this...
> >
> > Chris
>
> Thanks Chris. This commit went in between those kernels:
>
> commit a7c571f2e3ff9243ba41c242486f53dbca37d864
> Author: Shaohua Li <[email protected]>
> Date: Wed Sep 30 09:05:30 2015 -0700
>
> workqueue: make sure delayed work run in local cpu
>
> commit 874bbfe600a660cba9c776b3957b1ce393151b76 upstream.
>
>
> ...and I suspect that it's the culprit. That commit causes this code to
> always use add_timer_on, which seems to have different semantics from
> add_timer.
>
> I'm going to build a v4.2.5 kernel with that patch reverted to confirm
> it, but it seems likely...
>

(dropping Lai's address, and adding Shaohua Li's)

Ok, I built a kernel with that patch reverted and that seems to fix the
problem.

Looking at the patch, I guess the main difference is that we're no
longer using add_timer for unbound workqueue tasks. That said, we
should have possession of the PENDING bit before calling add_timer_on.

So...I'm a little stumped. Any thoughts as to where the race could be?
--
Jeff Layton <[email protected]>

2015-11-03 17:55:13

by Jeff Layton

[permalink] [raw]
Subject: Re: timer code oops when calling mod_delayed_work

On Mon, 2 Nov 2015 20:33:39 -0500
Jeff Layton <[email protected]> wrote:

> On Mon, 2 Nov 2015 14:56:33 -0500
> Jeff Layton <[email protected]> wrote:
>
> > On Mon, 2 Nov 2015 09:48:41 -1000
> > Chris Worley <[email protected]> wrote:
> >
> > > On Sat, Oct 31, 2015 at 11:54 AM, Jeff Layton <[email protected]> wrote:
> > > > On Sat, 31 Oct 2015 17:31:07 -0400
> > > > Tejun Heo <[email protected]> wrote:
> > > ...
> > > >>
> > > >> > I have asked Chris and Michael to see if they can bisect it down, but
> > > >> > it may be a bit before they can get that done. Any insight you might
> > > >> > have in the meantime would helpful.
> > > >>
> > > >> Yeah, I'd love to find out how reproducible the issue is. If the
> > > >> problem is rarely reproducible, it might make sense to try
> > > >> instrumentation before trying bisection as it *could* be a latent bug
> > > >> which has been there all along and bisecting to the commit introducing
> > > >> the code wouldn't help us too much.
> > > >>
> > > >
> > > > It seems fairly reproducible, at least on v4.3-rc7 kernels:
> > > >
> > > > This came about when I asked them to perf test some nfsd patches that I
> > > > have queued up. I patched a Fedora 4.3-rc7 kernel and wanted to see
> > > > what the perf delta was (with NFSv3, fwiw):
> > > >
> > > > Patched kernels here: http://koji.fedoraproject.org/koji/taskinfo?taskID=11598089
> > > >
> > > > Unpatched kernels here: http://koji.fedoraproject.org/koji/buildinfo?buildID=694377
> > > >
> > > > Michael was using the SPEC SFS VDI workload to test, and was able to
> > > > get the same panic on both kernels. So it does seem to be reproducible.
> > > > It might even be possible to tune the VM to make the shrinker fire more
> > > > often, which may help tickle this more.
> > > >
> > > > In any case, I've asked them to try something v4.2-ish and see if it's
> > > > reproducible there, and then try v4.1 if it is. I figure anything
> > > > earlier is probably not worth testing if it still fails on v4.1. If it
> > > > turns out not to be reproducible on those earlier kernels then we can
> > > > bisect from there to track it down.
> > >
> > > The trick seems to be the NFS thread count: I initially though this
> > > was SFS/VDI specific, but when I ratcheted up the thread count to what
> > > Michael was using, 256 threads oopses on fio (throughput) benchmarks
> > > too.
> > >
> > > In bisecting kernels, it appeared between 4.2.3-200 and 4.2.5-200 (all
> > > the 4.2.4 kernels were bad).
> > >
> > > Jeff has a lead on this...
> > >
> > > Chris
> >
> > Thanks Chris. This commit went in between those kernels:
> >
> > commit a7c571f2e3ff9243ba41c242486f53dbca37d864
> > Author: Shaohua Li <[email protected]>
> > Date: Wed Sep 30 09:05:30 2015 -0700
> >
> > workqueue: make sure delayed work run in local cpu
> >
> > commit 874bbfe600a660cba9c776b3957b1ce393151b76 upstream.
> >
> >
> > ...and I suspect that it's the culprit. That commit causes this code to
> > always use add_timer_on, which seems to have different semantics from
> > add_timer.
> >
> > I'm going to build a v4.2.5 kernel with that patch reverted to confirm
> > it, but it seems likely...
> >
>
> (dropping Lai's address, and adding Shaohua Li's)
>
> Ok, I built a kernel with that patch reverted and that seems to fix the
> problem.
>
> Looking at the patch, I guess the main difference is that we're no
> longer using add_timer for unbound workqueue tasks. That said, we
> should have possession of the PENDING bit before calling add_timer_on.
>
> So...I'm a little stumped. Any thoughts as to where the race could be?

Ok, I've been able to reproduce this with a small module that I wrote
as a reproducer:

[ 197.502139] BUG: unable to handle kernel NULL pointer dereference at (null)
[ 197.503073] IP: [<ffffffff8112d67c>] detach_if_pending+0x6c/0x190
[ 197.503752] PGD 0
[ 197.504061] Oops: 0002 [#1] SMP
[ 197.504449] Modules linked in: wqthrash(OE-) snd_hda_codec_generic snd_hda_intel snd_hda_codec snd_hda_core xfs snd_hwdep snd_seq libcrc32c snd_seq_device ppdev snd_pcm snd_timer snd soundcore parport_pc e1000 i2c_piix4 parport joydev virtio_balloon pvpanic acpi_cpufreq nfsd nfs_acl lockd auth_rpcgss grace sunrpc virtio_blk virtio_console qxl drm_kms_helper ttm virtio_pci virtio_ring serio_raw virtio drm ata_generic pata_acpi
[ 197.509358] CPU: 1 PID: 794 Comm: kworker/u8:9 Tainted: G OE 4.3.0-0.rc7.git2.1.fc24.x86_64 #1
[ 197.510767] Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
[ 197.511850] Workqueue: wqthrash wqthrash_workfunc [wqthrash]
[ 197.513141] task: ffff88020f918000 ti: ffff8802143a8000 task.ti: ffff8802143a8000
[ 197.513958] RIP: 0010:[<ffffffff8112d67c>] [<ffffffff8112d67c>] detach_if_pending+0x6c/0x190
[ 197.514927] RSP: 0018:ffff8802143abc68 EFLAGS: 00010086
[ 197.515539] RAX: dead000000000200 RBX: ffffffffa022e050 RCX: 000000000000000f
[ 197.516552] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000046
[ 197.517374] RBP: ffff8802143abc88 R08: 0000000000000001 R09: 0000000000000001
[ 197.518194] R10: 0000000000000000 R11: 0000000000000001 R12: 0000000000000000
[ 197.518992] R13: ffff880216fce2c0 R14: 0000000000000001 R15: 00000000000003e8
[ 197.519817] FS: 0000000000000000(0000) GS:ffff880217000000(0000) knlGS:0000000000000000
[ 197.520738] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[ 197.521410] CR2: 0000000000000000 CR3: 00000000daa9b000 CR4: 00000000000006e0
[ 197.522234] Stack:
[ 197.522584] ffffffffa022e050 0000000000000000 ffff880216fce2c0 ffff88021688f800
[ 197.524102] ffff8802143abcc0 ffffffff8112db72 0000000000000086 00000000ba809f75
[ 197.525594] ffffffffa022e000 ffff8802143abd00 0000000000000001 ffff8802143abcf0
[ 197.527082] Call Trace:
[ 197.527581] [<ffffffff8112db72>] del_timer+0x62/0xa0
[ 197.532565] [<ffffffff810c913e>] try_to_grab_pending+0xce/0x190
[ 197.533253] [<ffffffff810c9242>] mod_delayed_work_on+0x42/0xb0
[ 197.533965] [<ffffffffa022c081>] wqthrash_workfunc+0x61/0x90 [wqthrash]
[ 197.534766] [<ffffffff810c8470>] process_one_work+0x230/0x6a0
[ 197.535683] [<ffffffff810c83d9>] ? process_one_work+0x199/0x6a0
[ 197.536378] [<ffffffff810c892e>] worker_thread+0x4e/0x450
[ 197.537004] [<ffffffff810c88e0>] ? process_one_work+0x6a0/0x6a0
[ 197.537724] [<ffffffff810cf981>] kthread+0x101/0x120
[ 197.538662] [<ffffffff81104fa9>] ? trace_hardirqs_on_caller+0x129/0x1b0
[ 197.539926] [<ffffffff810cf880>] ? kthread_create_on_node+0x250/0x250
[ 197.541160] [<ffffffff81864fef>] ret_from_fork+0x3f/0x70
[ 197.542187] [<ffffffff810cf880>] ? kthread_create_on_node+0x250/0x250
[ 197.543416] Code: 3b 89 e4 00 e8 06 64 ff ff 85 c0 74 0d 80 3d 45 61 e4 00 00 0f 84 c4 00 00 00 65 ff 0d de e7 ed 7e 48 8b 03 48 8b 53 08 48 85 c0 <48> 89 02 74 04 48 89 50 08 45 84 f6 74 08 48 c7 43 08 00 00 00
[ 197.547527] RIP [<ffffffff8112d67c>] detach_if_pending+0x6c/0x190
[ 197.548481] RSP <ffff8802143abc68>
[ 197.549220] CR2: 0000000000000000
[ 197.549883] ---[ end trace 10dec6677e9e3a68 ]---

Apply this patch, build a wqthrash.ko module, plug it in and watch it
explode. It basically just spawns a bunch of (immediate) workqueue jobs
that attempt to schedule and modify a single piece of delayed_work.

I still don't have a lead on the actual bug, but this does seem to
tickle it fairly easily.

-----------------------8<------------------------------

[PATCH] wqthrash: add module that will thrash the workqueue code

...let's see if we can tickle the races in the delayed_work
infrastructure.

Signed-off-by: Jeff Layton <[email protected]>
---
samples/Makefile | 3 +-
samples/wqthrash/Makefile | 1 +
samples/wqthrash/wqthrash.c | 94 +++++++++++++++++++++++++++++++++++++++++++++
3 files changed, 97 insertions(+), 1 deletion(-)
create mode 100644 samples/wqthrash/Makefile
create mode 100644 samples/wqthrash/wqthrash.c

diff --git a/samples/Makefile b/samples/Makefile
index f00257bcc5a7..7e92dc91879d 100644
--- a/samples/Makefile
+++ b/samples/Makefile
@@ -1,4 +1,5 @@
# Makefile for Linux samples code

obj-$(CONFIG_SAMPLES) += kobject/ kprobes/ trace_events/ livepatch/ \
- hw_breakpoint/ kfifo/ kdb/ hidraw/ rpmsg/ seccomp/
+ hw_breakpoint/ kfifo/ kdb/ hidraw/ rpmsg/ \
+ seccomp/ wqthrash/
diff --git a/samples/wqthrash/Makefile b/samples/wqthrash/Makefile
new file mode 100644
index 000000000000..da4c36a2c7d5
--- /dev/null
+++ b/samples/wqthrash/Makefile
@@ -0,0 +1 @@
+obj-m += wqthrash.o
diff --git a/samples/wqthrash/wqthrash.c b/samples/wqthrash/wqthrash.c
new file mode 100644
index 000000000000..b1c52c20fb8f
--- /dev/null
+++ b/samples/wqthrash/wqthrash.c
@@ -0,0 +1,94 @@
+/*
+ * wqthrash.c -- thrash the delayed workqueue rescheduling code
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License
+ * as published by the Free Software Foundation; either version 2
+ * of the License, or (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software
+ * Foundation, Inc., 51 Franklin Street, Fifth Floor,
+ * Boston, MA 02110-1301, USA.
+ */
+
+#include <linux/init.h>
+#include <linux/module.h>
+#include <linux/kernel.h>
+#include <linux/slab.h>
+#include <linux/gfp.h>
+#include <linux/workqueue.h>
+#include <linux/delay.h>
+#include <linux/sched.h>
+
+#define ITERATIONS (102400)
+#define DELAY (HZ)
+#define THRASHERS (256)
+
+static void
+dummy_work(struct work_struct *unused)
+{
+ schedule_timeout_uninterruptible(1);
+}
+
+static DECLARE_DELAYED_WORK(wqthrash_delayed_work, dummy_work);
+
+static void
+wqthrash_workfunc(struct work_struct *unused)
+{
+ int i;
+
+ for (i = 0; i < ITERATIONS; ++i) {
+ schedule_delayed_work(&wqthrash_delayed_work, DELAY);
+ cond_resched();
+ mod_delayed_work(system_wq, &wqthrash_delayed_work, DELAY);
+ cond_resched();
+ }
+}
+
+static struct workqueue_struct *wq;
+static struct work_struct *tw;
+
+static int
+wqthrash_init(void)
+{
+ int i;
+
+ wq = alloc_workqueue("wqthrash", WQ_UNBOUND, 0);
+ if (!wq)
+ return -ENOMEM;
+
+ tw = kcalloc(THRASHERS, sizeof(*tw), GFP_KERNEL);
+ if (!tw) {
+ destroy_workqueue(wq);
+ return -ENOMEM;
+ }
+
+ for (i = 0; i < THRASHERS; ++i) {
+ INIT_WORK(&tw[i], wqthrash_workfunc);
+ queue_work(wq, &tw[i]);
+ }
+ return 0;
+}
+
+static void
+wqthrash_exit(void)
+{
+ int i;
+
+ for (i = 0; i < THRASHERS; ++i)
+ flush_work(&tw[i]);
+
+ kfree(tw);
+ destroy_workqueue(wq);
+ cancel_delayed_work_sync(&wqthrash_delayed_work);
+}
+
+module_init(wqthrash_init);
+module_exit(wqthrash_exit);
+MODULE_LICENSE("GPL");
--
2.4.3

2015-11-03 22:54:10

by Tejun Heo

[permalink] [raw]
Subject: Re: timer code oops when calling mod_delayed_work

Hello, Jeff.

On Tue, Nov 03, 2015 at 12:55:04PM -0500, Jeff Layton wrote:
> > Ok, I built a kernel with that patch reverted and that seems to fix the
> > problem.
> >
> > Looking at the patch, I guess the main difference is that we're no
> > longer using add_timer for unbound workqueue tasks. That said, we
> > should have possession of the PENDING bit before calling add_timer_on.
> >
> > So...I'm a little stumped. Any thoughts as to where the race could be?
>
> Ok, I've been able to reproduce this with a small module that I wrote
> as a reproducer:

I think I know what's going on. Looks like a bug in timer code when
timers are queued on a different cpu than before. Will experiment
with it a bit and report back.

Thanks.

--
tejun

2015-11-04 00:07:04

by Tejun Heo

[permalink] [raw]
Subject: Re: timer code oops when calling mod_delayed_work

Hello, Jeff.

Can you please verify whether the following patch fixes the issue?

Thanks.

diff --git a/kernel/time/timer.c b/kernel/time/timer.c
index 84190f0..566a282 100644
--- a/kernel/time/timer.c
+++ b/kernel/time/timer.c
@@ -970,12 +970,21 @@ EXPORT_SYMBOL(add_timer);
*/
void add_timer_on(struct timer_list *timer, int cpu)
{
- struct tvec_base *base = per_cpu_ptr(&tvec_bases, cpu);
+ struct tvec_base *new_base = per_cpu_ptr(&tvec_bases, cpu);
+ struct tvec_base *base;
unsigned long flags;

timer_stats_timer_set_start_info(timer);
BUG_ON(timer_pending(timer) || !timer->function);
- spin_lock_irqsave(&base->lock, flags);
+
+ base = lock_timer_base(timer, &flags);
+ if (base != new_base) {
+ timer->flags |= TIMER_MIGRATING;
+ spin_unlock(&base->lock);
+ base = new_base;
+ spin_lock(&base->lock);
+ }
+
timer->flags = (timer->flags & ~TIMER_BASEMASK) | cpu;
debug_activate(timer, timer->expires);
internal_add_timer(base, timer);

2015-11-04 11:48:42

by Jeff Layton

[permalink] [raw]
Subject: Re: timer code oops when calling mod_delayed_work

On Tue, 3 Nov 2015 19:06:58 -0500
Tejun Heo <[email protected]> wrote:

> Hello, Jeff.
>
> Can you please verify whether the following patch fixes the issue?
>
> Thanks.
>
> diff --git a/kernel/time/timer.c b/kernel/time/timer.c
> index 84190f0..566a282 100644
> --- a/kernel/time/timer.c
> +++ b/kernel/time/timer.c
> @@ -970,12 +970,21 @@ EXPORT_SYMBOL(add_timer);
> */
> void add_timer_on(struct timer_list *timer, int cpu)
> {
> - struct tvec_base *base = per_cpu_ptr(&tvec_bases, cpu);
> + struct tvec_base *new_base = per_cpu_ptr(&tvec_bases, cpu);
> + struct tvec_base *base;
> unsigned long flags;
>
> timer_stats_timer_set_start_info(timer);
> BUG_ON(timer_pending(timer) || !timer->function);
> - spin_lock_irqsave(&base->lock, flags);
> +
> + base = lock_timer_base(timer, &flags);
> + if (base != new_base) {
> + timer->flags |= TIMER_MIGRATING;
> + spin_unlock(&base->lock);
> + base = new_base;
> + spin_lock(&base->lock);
> + }
> +
> timer->flags = (timer->flags & ~TIMER_BASEMASK) | cpu;
> debug_activate(timer, timer->expires);
> internal_add_timer(base, timer);

Yes, thanks! That fixes my reproducer and Chris also said he was unable
to reproduce the bug after applying that patch. You can add this if you
like:

Tested-by: Jeff Layton <[email protected]>

2015-11-04 17:15:39

by Tejun Heo

[permalink] [raw]
Subject: [PATCH] timer: add_timer_on() should perform proper migration

Regardless of the previous CPU a timer was on, add_timer_on()
currently simply sets timer->flags to the new CPU. As the caller must
be seeing the timer as idle, this is locally fine, but the timer
leaving the old base while unlocked can lead to race conditions as
follows.

Let's say timer was on cpu 0.

cpu 0 cpu 1
-----------------------------------------------------------------------------
del_timer(timer) succeeds
del_timer(timer)
lock_timer_base(timer) locks cpu_0_base
add_timer_on(timer, 1)
spin_lock(&cpu_1_base->lock)
timer->flags set to cpu_1_base
operates on @timer operates on @timer

This triggered with mod_delayed_work_on() which contains
"if (del_timer()) add_timer_on()" sequence eventually leading to the
following oops.

BUG: unable to handle kernel NULL pointer dereference at (null)
IP: [<ffffffff810ca6e9>] detach_if_pending+0x69/0x1a0
...
Workqueue: wqthrash wqthrash_workfunc [wqthrash]
task: ffff8800172ca680 ti: ffff8800172d0000 task.ti: ffff8800172d0000
RIP: 0010:[<ffffffff810ca6e9>] [<ffffffff810ca6e9>] detach_if_pending+0x69/0x1a0
...
Call Trace:
[<ffffffff810cb0b4>] del_timer+0x44/0x60
[<ffffffff8106e836>] try_to_grab_pending+0xb6/0x160
[<ffffffff8106e913>] mod_delayed_work_on+0x33/0x80
[<ffffffffa0000081>] wqthrash_workfunc+0x61/0x90 [wqthrash]
[<ffffffff8106dba8>] process_one_work+0x1e8/0x650
[<ffffffff8106e05e>] worker_thread+0x4e/0x450
[<ffffffff810746af>] kthread+0xef/0x110
[<ffffffff8185980f>] ret_from_fork+0x3f/0x70

Fix it by updating add_timer_on() to perform proper migration as
__mod_timer() does.

Signed-off-by: Tejun Heo <[email protected]>
Reported-and-tested-by: Jeff Layton <[email protected]>
Cc: [email protected]
Link: http://lkml.kernel.org/g/[email protected]
---
kernel/time/timer.c | 22 +++++++++++++++++++---
1 file changed, 19 insertions(+), 3 deletions(-)

--- a/kernel/time/timer.c
+++ b/kernel/time/timer.c
@@ -970,13 +970,29 @@ EXPORT_SYMBOL(add_timer);
*/
void add_timer_on(struct timer_list *timer, int cpu)
{
- struct tvec_base *base = per_cpu_ptr(&tvec_bases, cpu);
+ struct tvec_base *new_base = per_cpu_ptr(&tvec_bases, cpu);
+ struct tvec_base *base;
unsigned long flags;

timer_stats_timer_set_start_info(timer);
BUG_ON(timer_pending(timer) || !timer->function);
- spin_lock_irqsave(&base->lock, flags);
- timer->flags = (timer->flags & ~TIMER_BASEMASK) | cpu;
+
+ /*
+ * If @timer was on a different CPU, it should be migrated with the
+ * old base locked to prevent other operations proceeding with the
+ * wrong base locked. See lock_timer_base().
+ */
+ base = lock_timer_base(timer, &flags);
+ if (base != new_base) {
+ timer->flags |= TIMER_MIGRATING;
+
+ spin_unlock(&base->lock);
+ base = new_base;
+ spin_lock(&base->lock);
+ WRITE_ONCE(timer->flags,
+ (timer->flags & ~TIMER_BASEMASK) | cpu);
+ }
+
debug_activate(timer, timer->expires);
internal_add_timer(base, timer);
spin_unlock_irqrestore(&base->lock, flags);

Subject: [tip:timers/urgent] timers: Use proper base migration in add_timer_on()

Commit-ID: 22b886dd1018093920c4250dee2a9a3cb7cff7b8
Gitweb: http://git.kernel.org/tip/22b886dd1018093920c4250dee2a9a3cb7cff7b8
Author: Tejun Heo <[email protected]>
AuthorDate: Wed, 4 Nov 2015 12:15:33 -0500
Committer: Thomas Gleixner <[email protected]>
CommitDate: Wed, 4 Nov 2015 20:23:19 +0100

timers: Use proper base migration in add_timer_on()

Regardless of the previous CPU a timer was on, add_timer_on()
currently simply sets timer->flags to the new CPU. As the caller must
be seeing the timer as idle, this is locally fine, but the timer
leaving the old base while unlocked can lead to race conditions as
follows.

Let's say timer was on cpu 0.

cpu 0 cpu 1
-----------------------------------------------------------------------------
del_timer(timer) succeeds
del_timer(timer)
lock_timer_base(timer) locks cpu_0_base
add_timer_on(timer, 1)
spin_lock(&cpu_1_base->lock)
timer->flags set to cpu_1_base
operates on @timer operates on @timer

This triggered with mod_delayed_work_on() which contains
"if (del_timer()) add_timer_on()" sequence eventually leading to the
following oops.

BUG: unable to handle kernel NULL pointer dereference at (null)
IP: [<ffffffff810ca6e9>] detach_if_pending+0x69/0x1a0
...
Workqueue: wqthrash wqthrash_workfunc [wqthrash]
task: ffff8800172ca680 ti: ffff8800172d0000 task.ti: ffff8800172d0000
RIP: 0010:[<ffffffff810ca6e9>] [<ffffffff810ca6e9>] detach_if_pending+0x69/0x1a0
...
Call Trace:
[<ffffffff810cb0b4>] del_timer+0x44/0x60
[<ffffffff8106e836>] try_to_grab_pending+0xb6/0x160
[<ffffffff8106e913>] mod_delayed_work_on+0x33/0x80
[<ffffffffa0000081>] wqthrash_workfunc+0x61/0x90 [wqthrash]
[<ffffffff8106dba8>] process_one_work+0x1e8/0x650
[<ffffffff8106e05e>] worker_thread+0x4e/0x450
[<ffffffff810746af>] kthread+0xef/0x110
[<ffffffff8185980f>] ret_from_fork+0x3f/0x70

Fix it by updating add_timer_on() to perform proper migration as
__mod_timer() does.

Reported-and-tested-by: Jeff Layton <[email protected]>
Signed-off-by: Tejun Heo <[email protected]>
Cc: Chris Worley <[email protected]>
Cc: [email protected]
Cc: Michael Skralivetsky <[email protected]>
Cc: Trond Myklebust <[email protected]>
Cc: Shaohua Li <[email protected]>
Cc: Jeff Layton <[email protected]>
Cc: [email protected]
Cc: [email protected]
Link: http://lkml.kernel.org/r/[email protected]
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Thomas Gleixner <[email protected]>
---
kernel/time/timer.c | 22 +++++++++++++++++++---
1 file changed, 19 insertions(+), 3 deletions(-)

diff --git a/kernel/time/timer.c b/kernel/time/timer.c
index 74591ba..bbc5d11 100644
--- a/kernel/time/timer.c
+++ b/kernel/time/timer.c
@@ -977,13 +977,29 @@ EXPORT_SYMBOL(add_timer);
*/
void add_timer_on(struct timer_list *timer, int cpu)
{
- struct tvec_base *base = per_cpu_ptr(&tvec_bases, cpu);
+ struct tvec_base *new_base = per_cpu_ptr(&tvec_bases, cpu);
+ struct tvec_base *base;
unsigned long flags;

timer_stats_timer_set_start_info(timer);
BUG_ON(timer_pending(timer) || !timer->function);
- spin_lock_irqsave(&base->lock, flags);
- timer->flags = (timer->flags & ~TIMER_BASEMASK) | cpu;
+
+ /*
+ * If @timer was on a different CPU, it should be migrated with the
+ * old base locked to prevent other operations proceeding with the
+ * wrong base locked. See lock_timer_base().
+ */
+ base = lock_timer_base(timer, &flags);
+ if (base != new_base) {
+ timer->flags |= TIMER_MIGRATING;
+
+ spin_unlock(&base->lock);
+ base = new_base;
+ spin_lock(&base->lock);
+ WRITE_ONCE(timer->flags,
+ (timer->flags & ~TIMER_BASEMASK) | cpu);
+ }
+
debug_activate(timer, timer->expires);
internal_add_timer(base, timer);
spin_unlock_irqrestore(&base->lock, flags);

2015-11-04 19:35:47

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [PATCH] timer: add_timer_on() should perform proper migration

Tejun,

On Wed, 4 Nov 2015, Tejun Heo wrote:

> Regardless of the previous CPU a timer was on, add_timer_on()
> currently simply sets timer->flags to the new CPU. As the caller must
> be seeing the timer as idle, this is locally fine, but the timer
> leaving the old base while unlocked can lead to race conditions as
> follows.

nice detective work. This has been there forever. I really wonder why
nobody ever triggered this before.

@stable:

The patch does only apply to kernels >= 4.2. Backport for older
kernels is below.

Thanks,

tglx

----------->

--- a/kernel/time/timer.c
+++ b/kernel/time/timer.c
@@ -966,13 +966,26 @@ EXPORT_SYMBOL(add_timer);
*/
void add_timer_on(struct timer_list *timer, int cpu)
{
- struct tvec_base *base = per_cpu(tvec_bases, cpu);
+ struct tvec_base *new_base = per_cpu(tvec_bases, cpu);
+ struct tvec_base *base;
unsigned long flags;

timer_stats_timer_set_start_info(timer);
BUG_ON(timer_pending(timer) || !timer->function);
- spin_lock_irqsave(&base->lock, flags);
- timer_set_base(timer, base);
+
+ /*
+ * If @timer was on a different CPU, it should be migrated with the
+ * old base locked to prevent other operations proceeding with the
+ * wrong base locked. See lock_timer_base().
+ */
+ base = lock_timer_base(timer, &flags);
+ if (base != new_base) {
+ timer_set_base(timer, NULL);
+ spin_unlock(&base->lock);
+ base = new_base;
+ spin_lock(&base->lock);
+ timer_set_base(timer, base);
+ }
debug_activate(timer, timer->expires);
internal_add_timer(base, timer);
spin_unlock_irqrestore(&base->lock, flags);

2015-11-04 19:43:51

by Tejun Heo

[permalink] [raw]
Subject: Re: [PATCH] timer: add_timer_on() should perform proper migration

Hello, Thomas.

On Wed, Nov 04, 2015 at 08:35:00PM +0100, Thomas Gleixner wrote:
> nice detective work. This has been there forever. I really wonder why
> nobody ever triggered this before.

I guess there weren't enough concurrent del/add_timer_on() combos.
Workqueue recently switched to always use add_timer_on() from its
timer function making that a lot more frequent for heavy
mod_delayed_work_on() users and that's when it started surfacing.

Thanks.

--
tejun