Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S967043AbbBEAQ5 (ORCPT ); Wed, 4 Feb 2015 19:16:57 -0500 Received: from mail-ie0-f176.google.com ([209.85.223.176]:38052 "EHLO mail-ie0-f176.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S965444AbbBEAQy (ORCPT ); Wed, 4 Feb 2015 19:16:54 -0500 MIME-Version: 1.0 In-Reply-To: <54D2AA16.6030706@oracle.com> References: <54D2AA16.6030706@oracle.com> Date: Wed, 4 Feb 2015 16:16:54 -0800 X-Google-Sender-Auth: 87kf48Z-GkxIi_-gOyr1Me00BSg Message-ID: Subject: Re: sched: memory corruption on completing completions From: Linus Torvalds To: Sasha Levin , Waiman Long Cc: Peter Zijlstra , Ingo Molnar , Andrew Morton , Andrey Ryabinin , Dave Jones , LKML Content-Type: text/plain; charset=UTF-8 Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 3552 Lines: 93 On Wed, Feb 4, 2015 at 3:24 PM, Sasha Levin wrote: > > I now have a theory for why it happens: > > Thread A Thread B > ---------------------------------------------------------- > > [Enter function] > DECLARE_COMPLETION_ONSTACK(x) > wait_for_completion(x) > complete(x) > [In complete(x):] > spin_lock_irqsave(&x->wait.lock, flags); > x->done++; > __wake_up_locked(&x->wait, TASK_NORMAL, 1); > [Done waiting, wakes up] > [Exit function] > spin_unlock_irqrestore(&x->wait.lock, flags); > > So the spin_unlock_irqrestore() at the end of complete() would proceed to corruption > the stack of thread A. We have indeed had bugs like that, but it *shouldn't be the case any more. The hard rule for spinlocks is: - the last thing *unlock* does is to release the lock with a single store. After that has happened, it will not touch it, and before that has happened, nobody else can get the spinlock which means that you cannot actually get the case you are talking about (because the "done waiting, wakes up" in thread A happens with the spinlock held). That said, as mentioned, we have had bugs in spinlock debugging code in particular, where the last unlock does other things after it releases the low-level lock. And you are clearly not using normal spinlocks, since your error trace has this: do_raw_spin_unlock+0x417/0x4f0 which means that "do_raw_spin_unlock() is 1250+ bytes in size. A "normal" do_raw_spin_unlock()" is inlined because it's a single store operation. Looks like the debug version of do_raw_spin_unlock() is slightly larger than that ;) Anyway, apparently that version of do_raw_spin_unlock isn't just huge, it's also buggy.. Although I thought we fixed that bug some time ago, and looking at the code it does void do_raw_spin_unlock(raw_spinlock_t *lock) { debug_spin_unlock(lock); arch_spin_unlock(&lock->raw_lock); } so it *looks* ok in the generic code. So off to look at the arch code... And looking at the arch version, I think the paravirtualized code is crap. It does: prev = *lock; add_smp(&lock->tickets.head, TICKET_LOCK_INC); /* add_smp() is a full mb() */ if (unlikely(lock->tickets.tail & TICKET_SLOWPATH_FLAG)) __ticket_unlock_slowpath(lock, prev); which is *exactly* the kind of things you cannot do with spinlocks, because after you've done the "add_smp()" and released the spinlock for the fast-path, you can't access the spinlock any more. Exactly because a fast-path lock migth come in, and release the whole data structure. As usual, the paravirt code is a horribly buggy heap of crud. Film at 11. Why did I think we had this bug but already fixed it ? Maybe it's one of those things that Waiman fixed in his long delayed qspinlock series? Waiman? Or maybe I just remember the fixes where we changed from a mutex to a spinlock, which fixed a very similar case for non-paravirtualized cases. Linus -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/