Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S965498AbcJXUqr (ORCPT ); Mon, 24 Oct 2016 16:46:47 -0400 Received: from mail-oi0-f54.google.com ([209.85.218.54]:34680 "EHLO mail-oi0-f54.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S965180AbcJXUqp (ORCPT ); Mon, 24 Oct 2016 16:46:45 -0400 MIME-Version: 1.0 In-Reply-To: References: <20161018234248.GB93792@clm-mbp.masoncoding.com> <332c8e94-a969-093f-1fb4-30d89be8993e@kernel.org> <20161020225028.czodw54tjbiwwv3o@codemonkey.org.uk> <20161020230341.jsxpia2sy53xn5l5@codemonkey.org.uk> <20161021200245.kahjzgqzdfyoe3uz@codemonkey.org.uk> <20161022152033.gkmm3l75kqjzsije@codemonkey.org.uk> <20161024044051.onmh4h6sc2bjxzzc@codemonkey.org.uk> From: Linus Torvalds Date: Mon, 24 Oct 2016 13:46:42 -0700 X-Google-Sender-Auth: aCpNyf15oFA5AzDzmr_QY_74npo Message-ID: Subject: Re: bio linked list corruption. To: Andy Lutomirski Cc: Dave Jones , Chris Mason , Andy Lutomirski , Jens Axboe , Al Viro , Josef Bacik , David Sterba , linux-btrfs , Linux Kernel Content-Type: text/plain; charset=UTF-8 Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 6379 Lines: 124 On Mon, Oct 24, 2016 at 1:06 PM, Andy Lutomirski wrote: >> >> [69943.450108] Oops: 0003 [#1] PREEMPT SMP DEBUG_PAGEALLOC > > This is an unhandled kernel page fault. The string "Oops" is so helpful :-/ I think there was a line above it that DaveJ just didn't include. > >> [69943.454452] CPU: 1 PID: 21558 Comm: trinity-c60 Not tainted 4.9.0-rc1-think+ #11 >> [69943.463510] task: ffff8804f8dd3740 task.stack: ffffc9000b108000 >> [69943.468077] RIP: 0010:[] >> [69943.472704] [] __lock_acquire.isra.32+0x6b/0x8c0 >> [69943.477489] RSP: 0018:ffffc9000b10b9e8 EFLAGS: 00010086 >> [69943.482368] RAX: ffffffff81789b90 RBX: ffff8804f8dd3740 RCX: 0000000000000000 >> [69943.487410] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000000 >> [69943.492515] RBP: ffffc9000b10ba18 R08: 0000000000000001 R09: 0000000000000000 >> [69943.497666] R10: 0000000000000001 R11: 00003f9cfa7f4e73 R12: 0000000000000000 >> [69943.502880] R13: 0000000000000000 R14: ffffc9000af7bd48 R15: ffff8804f8dd3740 >> [69943.508163] FS: 00007f64904a2b40(0000) GS:ffff880507a00000(0000) knlGS:0000000000000000 >> [69943.513591] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 >> [69943.518917] CR2: ffffffff81789d28 CR3: 00000004a8f16000 CR4: 00000000001406e0 >> [69943.524253] DR0: 00007f5b97fd4000 DR1: 0000000000000000 DR2: 0000000000000000 >> [69943.529488] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000600 >> [69943.534771] Stack: >> [69943.540023] ffff880507bd74c0 >> [69943.545317] ffff8804f8dd3740 0000000000000046 0000000000000286[69943.545456] ffffc9000af7bd08 >> [69943.550930] 0000000000000100 ffffc9000b10ba50 ffffffff810c4b68[69943.551069] ffffffff810ba40c >> [69943.556657] ffff880400000000 0000000000000000 ffffc9000af7bd48[69943.556796] Call Trace: >> [69943.562465] [] lock_acquire+0x58/0x70 >> [69943.568354] [] ? finish_wait+0x3c/0x70 >> [69943.574306] [] _raw_spin_lock_irqsave+0x42/0x80 >> [69943.580335] [] ? finish_wait+0x3c/0x70 >> [69943.586237] [] finish_wait+0x3c/0x70 >> [69943.591992] [] shmem_fault+0x167/0x1b0 >> [69943.597807] [] ? prepare_to_wait_event+0x100/0x100 >> [69943.603741] [] __do_fault+0x6d/0x1b0 >> [69943.609743] [] handle_mm_fault+0xc58/0x1170 >> [69943.615822] [] ? handle_mm_fault+0x43/0x1170 >> [69943.621971] [] __do_page_fault+0x172/0x4e0 >> [69943.628184] [] do_page_fault+0x20/0x70 >> [69943.634449] [] ? debug_smp_processor_id+0x17/0x20 >> [69943.640784] [] page_fault+0x1f/0x30 >> [69943.647170] [] ? strncpy_from_user+0x5c/0x170 >> [69943.653480] [] ? strncpy_from_user+0x46/0x170 >> [69943.659632] [] setxattr+0x57/0x170 >> [69943.665846] [] ? debug_smp_processor_id+0x17/0x20 >> [69943.672172] [] ? get_lock_stats+0x19/0x50 >> [69943.678558] [] ? sched_clock_cpu+0xb6/0xd0 >> [69943.685007] [] ? __lock_acquire.isra.32+0x1cf/0x8c0 >> [69943.691542] [] ? __this_cpu_preempt_check+0x13/0x20 >> [69943.698130] [] ? preempt_count_add+0x7c/0xc0 >> [69943.704791] [] ? __mnt_want_write+0x61/0x90 >> [69943.711519] [] SyS_fsetxattr+0x78/0xa0 >> [69943.718300] [] do_syscall_64+0x5c/0x170 >> [69943.724949] [] entry_SYSCALL64_slow_path+0x25/0x25 >> [69943.731521] Code: >> [69943.738124] 00 83 fe 01 0f 86 0e 03 00 00 31 d2 4c 89 f7 44 89 45 d0 89 4d d4 e8 75 e7 ff ff 8b 4d d4 48 85 c0 44 8b 45 d0 0f 84 d8 02 00 00 ff 80 98 01 00 00 8b 15 e0 21 8f 01 45 8b 8f 50 08 00 00 85 > > That's lock incl 0x198(%rax). I think this is: > > atomic_inc((atomic_t *)&class->ops); > > I suppose this could be stack corruption at work, but after a fair > amount of staring, I still haven't found anything in the vmap_stack > code that would cause stack corruption. Well, it is intriguing that what faults is this: finish_wait(shmem_falloc_waitq, &shmem_fault_wait); where 'shmem_fault_wait' is a on-stack wait queue. So it really looks very much like stack corruption. What strikes me is that "finish_wait()" does this optimistic "has my entry been removed" without holding the waitqueue lock (and uses list_empty_careful() to make sure it does that "safely"). It has that big comment too: /* * shmem_falloc_waitq points into the shmem_fallocate() * stack of the hole-punching task: shmem_falloc_waitq * is usually invalid by the time we reach here, but * finish_wait() does not dereference it in that case; * though i_lock needed lest racing with wake_up_all(). */ the stack it comes from is the wait queue head from shmem_fallocate(), which will do "wake_up_all()" under the inode lock. On the face of it, the inode lock should make that safe and serialize everything. And yes, finish_wait() does not touch the unsafe stuff if the wait-queue (in the local stack) is empty, which wake_up_all() *should* have guaranteed. It's just a regular wait-queue entry (that DEFINE_WAIT() does that), so it uses the normal autoremove_wake_function() that removes things on successful wakeup: int autoremove_wake_function(wait_queue_t *wait, unsigned mode, int sync, void *key) { int ret = default_wake_function(wait, mode, sync, key); if (ret) list_del_init(&wait->task_list); return ret; } So the only issue is "did default_wake_function() return true"? That's try_to_wake_up(TASK_NORMAL, 0), and I note that it can return zero (and thus *not* remove the entry - leavign the invalid entry tghere) if if (!(p->state & state)) goto out; but "prepare_to_wait()" (which also ran with the inode->i_lock held, and also takes the wait-queue lock) did set p->state to TASK_UNINTERRUPTIBLE. So this is all some really subtle code, but I'm not seeing that it would be wrong. Linus