Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1751470AbaLOXqn (ORCPT ); Mon, 15 Dec 2014 18:46:43 -0500 Received: from mail-qa0-f53.google.com ([209.85.216.53]:58338 "EHLO mail-qa0-f53.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751333AbaLOXqm (ORCPT ); Mon, 15 Dec 2014 18:46:42 -0500 MIME-Version: 1.0 In-Reply-To: References: <20141211145408.GB16800@redhat.com> <20141212185454.GB4716@redhat.com> <20141213165915.GA12756@redhat.com> <20141213223616.GA22559@redhat.com> <20141214234654.GA396@redhat.com> <20141215055707.GA26225@redhat.com> Date: Mon, 15 Dec 2014 15:46:41 -0800 X-Google-Sender-Auth: nMLlx-GaqOJ-Ma_7Z6YbsWdnG74 Message-ID: Subject: Re: frequent lockups in 3.18rc4 From: Linus Torvalds To: Dave Jones , Linus Torvalds , Chris Mason , Mike Galbraith , Ingo Molnar , Peter Zijlstra , =?UTF-8?Q?D=C3=A2niel_Fraga?= , Sasha Levin , "Paul E. McKenney" , Linux Kernel Mailing List , Suresh Siddha , Oleg Nesterov , Peter Anvin Content-Type: text/plain; charset=UTF-8 Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Mon, Dec 15, 2014 at 10:21 AM, Linus Torvalds wrote: > > So let's just fix it. Here's a completely untested patch. So after looking at this more, I'm actually really convinced that this was a pretty nasty bug. I'm *not* convinced that it's necessarily *your* bug, but I still think it could be. I cleaned up the patch a bit, split it up into two to clarify it, and have committed it to my tree. I'm not marking the patches for stable, because while I'm convinced it's a bug, I'm also not sure why even if it triggers it doesn't eventually recover when the IO completes. So I'd mark them for stable only if they are actually confirmed to fix anything in the wild, and after they've gotten some testing in general. The patches *look* straightforward, they remove more lines than they add, and I think the code is more understandable too, but maybe I just screwed up. Whatever. Some care is warranted, but this is the first time I feel like I actually fixed something that matched at least one of your lockup symptoms. Anyway, it's there as 26178ec11ef3 ("x86: mm: consolidate VM_FAULT_RETRY handling") 7fb08eca4527 ("x86: mm: move mmap_sem unlock from mm_fault_error() to caller") and I'll continue to look at the page fault patch. I still have a slight worry that it's something along the lines of corrupted page tables or some core VM issue, but I apart from my general nervousness about the auto-numa code (which will be cleaned up eventually though the pte_protnone patches), I can't actually see how you'd get into endless page faults any other way. So I'm really hoping that the buggy VM_FAULT_RETRY handling explains it. But me not seeing any other bug clearly doesn't mean it doesn't exist. Linus -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/