Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1752034AbaKYFgI (ORCPT ); Tue, 25 Nov 2014 00:36:08 -0500 Received: from cantor2.suse.de ([195.135.220.15]:59690 "EHLO mx2.suse.de" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751990AbaKYFgG (ORCPT ); Tue, 25 Nov 2014 00:36:06 -0500 Message-ID: <54741540.3010607@suse.com> Date: Tue, 25 Nov 2014 06:36:00 +0100 From: =?windows-1252?Q?J=FCrgen_Gro=DF?= User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:31.0) Gecko/20100101 Thunderbird/31.2.0 MIME-Version: 1.0 To: torvalds@linux-foundation.org CC: Konrad Rzeszutek Wilk , Josh Boyer , Andy Lutomirski , Linus Torvalds , Steven Rostedt , Tejun Heo , "linux-kernel@vger.kernel.org" , Thomas Gleixner , Peter Zijlstra , Frederic Weisbecker , Don Zickus , Dave Jones , the arch/x86 maintainers Subject: Re: frequent lockups in 3.18rc4 References: <20141121142301.564f7eb7@gandalf.local.home> <20141124184856.GA9349@laptop.dumpdata.com> In-Reply-To: <20141124184856.GA9349@laptop.dumpdata.com> Content-Type: text/plain; charset=windows-1252; format=flowed Content-Transfer-Encoding: 7bit Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On 11/24/2014 07:48 PM, Konrad Rzeszutek Wilk wrote: > On Fri, Nov 21, 2014 at 03:23:13PM -0500, Josh Boyer wrote: >> On Fri, Nov 21, 2014 at 3:16 PM, Andy Lutomirski wrote: >>> On Fri, Nov 21, 2014 at 12:14 PM, Josh Boyer wrote: >>>> On Fri, Nov 21, 2014 at 2:52 PM, Andy Lutomirski wrote: >>>>> On Fri, Nov 21, 2014 at 11:46 AM, Linus Torvalds >>>>> wrote: >>>>>> On Fri, Nov 21, 2014 at 11:34 AM, Linus Torvalds >>>>>> wrote: >>>>>>> >>>>>>> So I kind of agree, but it wouldn't be my primary worry. My primary >>>>>>> worry is actually paravirt doing something insane. >>>>>> >>>>>> Btw, on that tangent, does anybody actually care about paravirt any more? >>>>>> Funny, during testing some patches related to Xen I hit the lockup issue. It looked a little bit different, but a variation of your patch solved my problem. The difference to the original report might be due to the rather low system load during my test, so the system was still responsive when the first lockup messages appeared. I could see the hanging cpus were spinning in pmd_lock() called during __handle_mm_fault(). I could reproduce the issue within a few minutes reliably without the patch below. With it the machine survived 12 hours and is still running. WHY my test would trigger the problem so fast I have no idea. I saw it on a rather huge machine only (128GB memory, 120 cpus), that's quite understandable. My test remapped some pages via the hypervisor and removed those mappings again. Perhaps the TLB flushing involved in these operations is triggering the problem. diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c index d973e61..b847ff7 100644 --- a/arch/x86/mm/fault.c +++ b/arch/x86/mm/fault.c @@ -377,7 +377,7 @@ static noinline int vmalloc_fault(unsigned long address) * happen within a race in page table update. In the later * case just flush: */ - pgd = pgd_offset(current->active_mm, address); + pgd = (pgd_t *)__va(read_cr3()) + pgd_index(address); pgd_ref = pgd_offset_k(address); if (pgd_none(*pgd_ref)) return -1; Juergen -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/