Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1754009AbbLIMJC (ORCPT ); Wed, 9 Dec 2015 07:09:02 -0500 Received: from mail-wm0-f50.google.com ([74.125.82.50]:37633 "EHLO mail-wm0-f50.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751498AbbLIMJA (ORCPT ); Wed, 9 Dec 2015 07:09:00 -0500 To: Tejun Heo Cc: "Linux-Kernel@Vger. Kernel. Org" , SiteGround Operations From: Nikolay Borisov Subject: corruption causing crash in __queue_work Message-ID: <566819D8.5090804@kyup.com> Date: Wed, 9 Dec 2015 14:08:56 +0200 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:38.0) Gecko/20100101 Thunderbird/38.1.0 MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 5356 Lines: 94 Hello Tejun, I've been observing the following crashes on kernel 4.2.6 : 73309.529940] BUG: unable to handle kernel NULL pointer dereference at (null) [73309.530238] IP: [] __queue_work+0xb3/0x390 [73309.530466] PGD 0 [73309.530681] Oops: 0000 [#1] SMP [73309.530947] Modules linked in: dm_snapshot dm_thin_pool dm_bio_prison dm_persistent_data dm_bufio libcrc32c ipv6 xt_multiport iptable_filter xt_nat iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat nf_conntrack ip_tables ext2 dm_mirror dm_region_hash dm_log iTCO_wdt iTCO_vendor_support sb_edac edac_core i2c_i801 igb i2c_algo_bit i2c_core lpc_ich mfd_core ipmi_devintf ipmi_si ipmi_msghandler ioatdma dca [73309.533556] CPU: 19 PID: 0 Comm: swapper/19 Not tainted 4.2.6-wbpatch-qib #1 [73309.533734] Hardware name: Supermicro X9DRD-iF/LF/X9DRD-iF, BIOS 3.0b 12/05/2013 [73309.533911] task: ffff880276501b80 ti: ffff880276510000 task.ti: ffff880276510000 [73309.534093] RIP: 0010:[] [] __queue_work+0xb3/0x390 [73309.534321] RSP: 0018:ffff88047fce3d58 EFLAGS: 00010086 [73309.534495] RAX: ffff880277812400 RBX: ffff8801e53e24c0 RCX: 00000000000100f0 [73309.534672] RDX: 0000000000000000 RSI: 0000000000000030 RDI: ffff8801e53e24c0 [73309.534849] RBP: ffff88047fce3de8 R08: 000042ad628a3480 R09: 0000000000000000 [73309.535023] R10: ffffffff816099d5 R11: 0000000000000000 R12: ffffffff8106b940 [73309.535196] R13: 0000000000000013 R14: ffff8803df464c00 R15: 0000000000000013 [73309.535370] FS: 0000000000000000(0000) GS:ffff88047fce0000(0000) knlGS:0000000000000000 [73309.535544] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [73309.535714] CR2: 0000000000000000 CR3: 0000000001a0e000 CR4: 00000000000406e0 [73309.535886] Stack: [73309.536049] ffff88047fcefcd8 0000000000000092 0000000000000000 ffff8803df464d10 [73309.536415] 0000000000000032 00000000000100f0 0000000000000000 ffff88047fcf4a00 [73309.536785] ffff88047fcf4a00 0000000000000013 0000000000000000 ffff880276501b80 [73309.537152] Call Trace: [73309.537319] [73309.537373] [] ? __queue_work+0x390/0x390 [73309.537714] [] delayed_work_timer_fn+0x18/0x20 [73309.537891] [] call_timer_fn+0x47/0x110 [73309.538071] [] ? tick_sched_timer+0x52/0xa0 [73309.538249] [] run_timer_softirq+0x17f/0x2b0 [73309.538425] [] ? __queue_work+0x390/0x390 [73309.538604] [] __do_softirq+0xe0/0x290 [73309.538778] [] irq_exit+0xa6/0xb0 [73309.538952] [] smp_apic_timer_interrupt+0x4a/0x59 [73309.539128] [] apic_timer_interrupt+0x6b/0x70 [73309.539300] [73309.539355] [] ? cpuidle_enter_state+0x136/0x290 [73309.539694] [] ? cpuidle_enter_state+0x12d/0x290 [73309.539870] [] ? __schedule+0x37d/0x840 [73309.540045] [] cpuidle_enter+0x17/0x20 [73309.540222] [] cpuidle_idle_call+0x95/0x140 [73309.540398] [] ? atomic_notifier_call_chain+0x16/0x20 [73309.540574] [] cpu_idle_loop+0x145/0x200 [73309.540748] [] ? cpu_startup_entry+0x1b/0x70 [73309.540924] [] ? get_random_bytes+0x48/0x90 [73309.541098] [] cpu_startup_entry+0x5f/0x70 [73309.541274] [] start_secondary+0xc2/0xd0 [73309.541446] Code: 49 8b 96 08 01 00 00 49 63 c7 48 03 14 c5 e0 af ab 81 48 89 55 80 48 89 df e8 0a ee ff ff 48 8b 55 80 48 85 c0 0f 84 3e 01 00 00 <48> 8b 3a 48 39 f8 0f 84 35 01 00 00 48 89 c7 48 89 85 78 ff ff [73309.545008] RIP [] __queue_work+0xb3/0x390 [73309.545231] RSP [73309.545399] CR2: 0000000000000000 The gist is that this fail on the following line: if (last_pool && last_pool != pwq->pool) { Since the pointer 'pwq' is wrong (it is loaded in %rdx) which in this case is 0000000000000000. Looking at the function's source pwq should be loaded by per_cpu_ptr since the if (!(wq->flags & WQ_UNBOUND)) check should evaluate to false. So pwq is loaded as the result from unbound_pwq_by_node(wq, cpu_to_node(cpu)); Here are the flags of the workqueue: crash> struct workqueue_struct.flags 0xffff8803df464c00 flags = 131082 (0xffff8803df464c00 is indeed the pointer to the workqueue struct, so the flags aren't bogus). So reading the numa_pwq_tbl it seems that it's uninitialised: crash> struct workqueue_struct.numa_pwq_tbl 0xffff8803df464c00 numa_pwq_tbl = 0xffff8803df464d10 crash> rd -64 0xffff8803df464d10 3 ffff8803df464d10: 0000000000000000 0000000000000000 ................ ffff8803df464d20: 0000000000000000 ........ The machine where the crash occurred has a single NUMA node, so at the very least I would have expected to have a pointer, rather than NULL ptr. Also this crash is not isolated in that I have observed it on multiple other nodes running vanilla 4.2.5/4.2.6 kernels. Any advice how to further debug that? -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/