Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1755623AbbKCRzN (ORCPT ); Tue, 3 Nov 2015 12:55:13 -0500 Received: from mail-qg0-f52.google.com ([209.85.192.52]:35836 "EHLO mail-qg0-f52.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1752145AbbKCRzJ (ORCPT ); Tue, 3 Nov 2015 12:55:09 -0500 Date: Tue, 3 Nov 2015 12:55:04 -0500 From: Jeff Layton To: Chris Worley Cc: Tejun Heo , linux-kernel@vger.kernel.org, bfields@fieldses.org, Michael Skralivetsky , Trond Myklebust , Shaohua Li Subject: Re: timer code oops when calling mod_delayed_work Message-ID: <20151103125504.6649138f@tlielax.poochiereds.net> In-Reply-To: <20151102203339.7ed8f2bb@synchrony.poochiereds.net> References: <20151029103113.2f893924@tlielax.poochiereds.net> <20151029135836.02ad9000@synchrony.poochiereds.net> <20151031020012.GH3582@mtj.duckdns.org> <20151031073400.2cf05d77@tlielax.poochiereds.net> <20151031213107.GA23841@mtj.duckdns.org> <20151031175404.3c57a17a@tlielax.poochiereds.net> <20151102145633.5329f3da@tlielax.poochiereds.net> <20151102203339.7ed8f2bb@synchrony.poochiereds.net> X-Mailer: Claws Mail 3.12.0 (GTK+ 2.24.28; x86_64-redhat-linux-gnu) MIME-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 11305 Lines: 283 On Mon, 2 Nov 2015 20:33:39 -0500 Jeff Layton wrote: > On Mon, 2 Nov 2015 14:56:33 -0500 > Jeff Layton wrote: > > > On Mon, 2 Nov 2015 09:48:41 -1000 > > Chris Worley wrote: > > > > > On Sat, Oct 31, 2015 at 11:54 AM, Jeff Layton wrote: > > > > On Sat, 31 Oct 2015 17:31:07 -0400 > > > > Tejun Heo wrote: > > > ... > > > >> > > > >> > I have asked Chris and Michael to see if they can bisect it down, but > > > >> > it may be a bit before they can get that done. Any insight you might > > > >> > have in the meantime would helpful. > > > >> > > > >> Yeah, I'd love to find out how reproducible the issue is. If the > > > >> problem is rarely reproducible, it might make sense to try > > > >> instrumentation before trying bisection as it *could* be a latent bug > > > >> which has been there all along and bisecting to the commit introducing > > > >> the code wouldn't help us too much. > > > >> > > > > > > > > It seems fairly reproducible, at least on v4.3-rc7 kernels: > > > > > > > > This came about when I asked them to perf test some nfsd patches that I > > > > have queued up. I patched a Fedora 4.3-rc7 kernel and wanted to see > > > > what the perf delta was (with NFSv3, fwiw): > > > > > > > > Patched kernels here: http://koji.fedoraproject.org/koji/taskinfo?taskID=11598089 > > > > > > > > Unpatched kernels here: http://koji.fedoraproject.org/koji/buildinfo?buildID=694377 > > > > > > > > Michael was using the SPEC SFS VDI workload to test, and was able to > > > > get the same panic on both kernels. So it does seem to be reproducible. > > > > It might even be possible to tune the VM to make the shrinker fire more > > > > often, which may help tickle this more. > > > > > > > > In any case, I've asked them to try something v4.2-ish and see if it's > > > > reproducible there, and then try v4.1 if it is. I figure anything > > > > earlier is probably not worth testing if it still fails on v4.1. If it > > > > turns out not to be reproducible on those earlier kernels then we can > > > > bisect from there to track it down. > > > > > > The trick seems to be the NFS thread count: I initially though this > > > was SFS/VDI specific, but when I ratcheted up the thread count to what > > > Michael was using, 256 threads oopses on fio (throughput) benchmarks > > > too. > > > > > > In bisecting kernels, it appeared between 4.2.3-200 and 4.2.5-200 (all > > > the 4.2.4 kernels were bad). > > > > > > Jeff has a lead on this... > > > > > > Chris > > > > Thanks Chris. This commit went in between those kernels: > > > > commit a7c571f2e3ff9243ba41c242486f53dbca37d864 > > Author: Shaohua Li > > Date: Wed Sep 30 09:05:30 2015 -0700 > > > > workqueue: make sure delayed work run in local cpu > > > > commit 874bbfe600a660cba9c776b3957b1ce393151b76 upstream. > > > > > > ...and I suspect that it's the culprit. That commit causes this code to > > always use add_timer_on, which seems to have different semantics from > > add_timer. > > > > I'm going to build a v4.2.5 kernel with that patch reverted to confirm > > it, but it seems likely... > > > > (dropping Lai's address, and adding Shaohua Li's) > > Ok, I built a kernel with that patch reverted and that seems to fix the > problem. > > Looking at the patch, I guess the main difference is that we're no > longer using add_timer for unbound workqueue tasks. That said, we > should have possession of the PENDING bit before calling add_timer_on. > > So...I'm a little stumped. Any thoughts as to where the race could be? Ok, I've been able to reproduce this with a small module that I wrote as a reproducer: [ 197.502139] BUG: unable to handle kernel NULL pointer dereference at (null) [ 197.503073] IP: [] detach_if_pending+0x6c/0x190 [ 197.503752] PGD 0 [ 197.504061] Oops: 0002 [#1] SMP [ 197.504449] Modules linked in: wqthrash(OE-) snd_hda_codec_generic snd_hda_intel snd_hda_codec snd_hda_core xfs snd_hwdep snd_seq libcrc32c snd_seq_device ppdev snd_pcm snd_timer snd soundcore parport_pc e1000 i2c_piix4 parport joydev virtio_balloon pvpanic acpi_cpufreq nfsd nfs_acl lockd auth_rpcgss grace sunrpc virtio_blk virtio_console qxl drm_kms_helper ttm virtio_pci virtio_ring serio_raw virtio drm ata_generic pata_acpi [ 197.509358] CPU: 1 PID: 794 Comm: kworker/u8:9 Tainted: G OE 4.3.0-0.rc7.git2.1.fc24.x86_64 #1 [ 197.510767] Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011 [ 197.511850] Workqueue: wqthrash wqthrash_workfunc [wqthrash] [ 197.513141] task: ffff88020f918000 ti: ffff8802143a8000 task.ti: ffff8802143a8000 [ 197.513958] RIP: 0010:[] [] detach_if_pending+0x6c/0x190 [ 197.514927] RSP: 0018:ffff8802143abc68 EFLAGS: 00010086 [ 197.515539] RAX: dead000000000200 RBX: ffffffffa022e050 RCX: 000000000000000f [ 197.516552] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000046 [ 197.517374] RBP: ffff8802143abc88 R08: 0000000000000001 R09: 0000000000000001 [ 197.518194] R10: 0000000000000000 R11: 0000000000000001 R12: 0000000000000000 [ 197.518992] R13: ffff880216fce2c0 R14: 0000000000000001 R15: 00000000000003e8 [ 197.519817] FS: 0000000000000000(0000) GS:ffff880217000000(0000) knlGS:0000000000000000 [ 197.520738] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b [ 197.521410] CR2: 0000000000000000 CR3: 00000000daa9b000 CR4: 00000000000006e0 [ 197.522234] Stack: [ 197.522584] ffffffffa022e050 0000000000000000 ffff880216fce2c0 ffff88021688f800 [ 197.524102] ffff8802143abcc0 ffffffff8112db72 0000000000000086 00000000ba809f75 [ 197.525594] ffffffffa022e000 ffff8802143abd00 0000000000000001 ffff8802143abcf0 [ 197.527082] Call Trace: [ 197.527581] [] del_timer+0x62/0xa0 [ 197.532565] [] try_to_grab_pending+0xce/0x190 [ 197.533253] [] mod_delayed_work_on+0x42/0xb0 [ 197.533965] [] wqthrash_workfunc+0x61/0x90 [wqthrash] [ 197.534766] [] process_one_work+0x230/0x6a0 [ 197.535683] [] ? process_one_work+0x199/0x6a0 [ 197.536378] [] worker_thread+0x4e/0x450 [ 197.537004] [] ? process_one_work+0x6a0/0x6a0 [ 197.537724] [] kthread+0x101/0x120 [ 197.538662] [] ? trace_hardirqs_on_caller+0x129/0x1b0 [ 197.539926] [] ? kthread_create_on_node+0x250/0x250 [ 197.541160] [] ret_from_fork+0x3f/0x70 [ 197.542187] [] ? kthread_create_on_node+0x250/0x250 [ 197.543416] Code: 3b 89 e4 00 e8 06 64 ff ff 85 c0 74 0d 80 3d 45 61 e4 00 00 0f 84 c4 00 00 00 65 ff 0d de e7 ed 7e 48 8b 03 48 8b 53 08 48 85 c0 <48> 89 02 74 04 48 89 50 08 45 84 f6 74 08 48 c7 43 08 00 00 00 [ 197.547527] RIP [] detach_if_pending+0x6c/0x190 [ 197.548481] RSP [ 197.549220] CR2: 0000000000000000 [ 197.549883] ---[ end trace 10dec6677e9e3a68 ]--- Apply this patch, build a wqthrash.ko module, plug it in and watch it explode. It basically just spawns a bunch of (immediate) workqueue jobs that attempt to schedule and modify a single piece of delayed_work. I still don't have a lead on the actual bug, but this does seem to tickle it fairly easily. -----------------------8<------------------------------ [PATCH] wqthrash: add module that will thrash the workqueue code ...let's see if we can tickle the races in the delayed_work infrastructure. Signed-off-by: Jeff Layton --- samples/Makefile | 3 +- samples/wqthrash/Makefile | 1 + samples/wqthrash/wqthrash.c | 94 +++++++++++++++++++++++++++++++++++++++++++++ 3 files changed, 97 insertions(+), 1 deletion(-) create mode 100644 samples/wqthrash/Makefile create mode 100644 samples/wqthrash/wqthrash.c diff --git a/samples/Makefile b/samples/Makefile index f00257bcc5a7..7e92dc91879d 100644 --- a/samples/Makefile +++ b/samples/Makefile @@ -1,4 +1,5 @@ # Makefile for Linux samples code obj-$(CONFIG_SAMPLES) += kobject/ kprobes/ trace_events/ livepatch/ \ - hw_breakpoint/ kfifo/ kdb/ hidraw/ rpmsg/ seccomp/ + hw_breakpoint/ kfifo/ kdb/ hidraw/ rpmsg/ \ + seccomp/ wqthrash/ diff --git a/samples/wqthrash/Makefile b/samples/wqthrash/Makefile new file mode 100644 index 000000000000..da4c36a2c7d5 --- /dev/null +++ b/samples/wqthrash/Makefile @@ -0,0 +1 @@ +obj-m += wqthrash.o diff --git a/samples/wqthrash/wqthrash.c b/samples/wqthrash/wqthrash.c new file mode 100644 index 000000000000..b1c52c20fb8f --- /dev/null +++ b/samples/wqthrash/wqthrash.c @@ -0,0 +1,94 @@ +/* + * wqthrash.c -- thrash the delayed workqueue rescheduling code + * + * This program is free software; you can redistribute it and/or + * modify it under the terms of the GNU General Public License + * as published by the Free Software Foundation; either version 2 + * of the License, or (at your option) any later version. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + * + * You should have received a copy of the GNU General Public License + * along with this program; if not, write to the Free Software + * Foundation, Inc., 51 Franklin Street, Fifth Floor, + * Boston, MA 02110-1301, USA. + */ + +#include +#include +#include +#include +#include +#include +#include +#include + +#define ITERATIONS (102400) +#define DELAY (HZ) +#define THRASHERS (256) + +static void +dummy_work(struct work_struct *unused) +{ + schedule_timeout_uninterruptible(1); +} + +static DECLARE_DELAYED_WORK(wqthrash_delayed_work, dummy_work); + +static void +wqthrash_workfunc(struct work_struct *unused) +{ + int i; + + for (i = 0; i < ITERATIONS; ++i) { + schedule_delayed_work(&wqthrash_delayed_work, DELAY); + cond_resched(); + mod_delayed_work(system_wq, &wqthrash_delayed_work, DELAY); + cond_resched(); + } +} + +static struct workqueue_struct *wq; +static struct work_struct *tw; + +static int +wqthrash_init(void) +{ + int i; + + wq = alloc_workqueue("wqthrash", WQ_UNBOUND, 0); + if (!wq) + return -ENOMEM; + + tw = kcalloc(THRASHERS, sizeof(*tw), GFP_KERNEL); + if (!tw) { + destroy_workqueue(wq); + return -ENOMEM; + } + + for (i = 0; i < THRASHERS; ++i) { + INIT_WORK(&tw[i], wqthrash_workfunc); + queue_work(wq, &tw[i]); + } + return 0; +} + +static void +wqthrash_exit(void) +{ + int i; + + for (i = 0; i < THRASHERS; ++i) + flush_work(&tw[i]); + + kfree(tw); + destroy_workqueue(wq); + cancel_delayed_work_sync(&wqthrash_delayed_work); +} + +module_init(wqthrash_init); +module_exit(wqthrash_exit); +MODULE_LICENSE("GPL"); -- 2.4.3 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/