Received: by 10.223.185.116 with SMTP id b49csp152927wrg; Thu, 8 Mar 2018 14:40:36 -0800 (PST) X-Google-Smtp-Source: AG47ELv1jGIJgFOWbuP+z4r4a1jhAafxZXV5Y0AoyM2tf2PUgmKM/VSgYQWLw8niJhj8d9Y0Zqq/ X-Received: by 2002:a17:902:8602:: with SMTP id f2-v6mr15605745plo.6.1520548836129; Thu, 08 Mar 2018 14:40:36 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1520548836; cv=none; d=google.com; s=arc-20160816; b=Y9lDi6YuUkIeNt42wtnRFTx4pqG/O81TX/6IRjwUrSM9nLy4MptTjus27yBxEYh6wm GVAWmxrghodtaKWGGtqR613/mOcTNo/VHLOm5sdc4I/sgPbBeWf22iiWioSG4vE6a2w+ VFUmnWjJ2oQi3YYs9kvp6TyDYwmiwVuK+f8Exj0Qq3EUpgrKPLUxMCDPTaL1gPZdmTmt 27upwBX5tqbC+aQDJTEav6nUz2sEoxrXs0yiY3KFjeM9/PxPhdGdO5EDNKViplDXi9B0 gFVuweWIk1dApKcip02Quo9DCDc2zuKlZL3v3Vjb6Z+jJOaOghShviy+GgVSurlr5a9x GcOg== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:cc:to:subject:message-id:date:from :references:in-reply-to:mime-version:dmarc-filter :arc-authentication-results; bh=ipddd3EHfejg8hTIpebhLKq5wD1hxtCDRMPfR5P6W9U=; b=L+vUsYue+EMTqnlizs92YgusOi5WYSmKgU/C+B5was6DBRK9eaSIr+bt0SKU8rXJpO nWh4E8uPZsgq4x8fPTig+xlPrL3QAlh3RtlmyxIjVJ2GC1ZV5nAbh6+4IxXEHH22Y5WI FRE9gmV5B0sJhpZm2wo4CZn6plWqpFbnxTqf689MyqHfaAhu1WNc9qXzt7lmXnfW8Lml biCjxGBz/WTeLeo7sZ4XdSvNngVZWnvCIIwDLBEY2XXABHfzRp1AIGfRJs40tuk6/n6T 4k5CCUPISohrv8dxeRhIqODLcBPSv0Rn16z5LM1a7QFa4qytMK1bnD6kIXLu3x5SnVnQ 25LQ== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id c136si8038252pga.318.2018.03.08.14.40.21; Thu, 08 Mar 2018 14:40:36 -0800 (PST) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1751024AbeCHWiz (ORCPT + 99 others); Thu, 8 Mar 2018 17:38:55 -0500 Received: from mail.kernel.org ([198.145.29.99]:46578 "EHLO mail.kernel.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1750783AbeCHWiy (ORCPT ); Thu, 8 Mar 2018 17:38:54 -0500 Received: from mail-io0-f169.google.com (mail-io0-f169.google.com [209.85.223.169]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by mail.kernel.org (Postfix) with ESMTPSA id 8D7D921795 for ; Thu, 8 Mar 2018 22:38:53 +0000 (UTC) DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 8D7D921795 Authentication-Results: mail.kernel.org; dmarc=none (p=none dis=none) header.from=kernel.org Authentication-Results: mail.kernel.org; spf=none smtp.mailfrom=luto@kernel.org Received: by mail-io0-f169.google.com with SMTP id d71so1475418iog.4 for ; Thu, 08 Mar 2018 14:38:53 -0800 (PST) X-Gm-Message-State: AElRT7FterFjttGOc3wzoFVNF7BMjv+XTTyBGeplZjobH10RZ6/35xPx n+v/OBirhr7wl33EfbNLHadiFOy1uQWsc4oZwIu/eg== X-Received: by 10.107.151.209 with SMTP id z200mr25214772iod.150.1520548732853; Thu, 08 Mar 2018 14:38:52 -0800 (PST) MIME-Version: 1.0 Received: by 10.2.137.101 with HTTP; Thu, 8 Mar 2018 14:38:32 -0800 (PST) In-Reply-To: <1520548101.2693.106.camel@hpe.com> References: <87a7vi1f3h.fsf@kerf.amer.corp.natinst.com> <1520548101.2693.106.camel@hpe.com> From: Andy Lutomirski Date: Thu, 8 Mar 2018 22:38:32 +0000 X-Gmail-Original-Message-ID: Message-ID: Subject: Re: Kernel page fault in vmalloc_fault() after a preempted ioremap To: "Kani, Toshi" Cc: "linux-kernel@vger.kernel.org" , "linux-mm@kvack.org" , "gratian.crisan@ni.com" , "mingo@kernel.org" , "peterz@infradead.org" , "julia.cartwright@ni.com" , "torvalds@linux-foundation.org" , "tglx@linutronix.de" , "bp@suse.de" , "akpm@linux-foundation.org" , "hpa@zytor.com" , "brgerst@gmail.com" , "luto@kernel.org" , "dave.hansen@intel.com" , "dvlasenk@redhat.com" , "gratian@gmail.com" Content-Type: text/plain; charset="UTF-8" Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Thu, Mar 8, 2018 at 9:43 PM, Kani, Toshi wrote: > On Thu, 2018-03-08 at 14:34 -0600, Gratian Crisan wrote: >> Hi all, >> >> We are seeing kernel page faults happening on module loads with certain >> drivers like the i915 video driver[1]. This was initially discovered on >> a 4.9 PREEMPT_RT kernel. It takes 5 days on average to reproduce using a >> simple reboot loop test. Looking at the code paths involved I believe >> the issue is still present in the latest vanilla kernel. >> >> Some relevant points are: >> >> * x86_64 CPU: Intel Atom E3940 >> >> * CONFIG_HUGETLBFS is not set (which also gates CONFIG_HUGETLB_PAGE) >> >> Based on function traces I was able to gather the sequence of events is: >> >> 1. Driver starts a ioremap operation for a region that is PMD_SIZE in >> size (or PUD_SIZE). >> >> 2. The ioremap() operation is preempted while it's in the middle of >> setting up the page mappings: >> ioremap_page_range->...->ioremap_pmd_range->pmd_set_huge <> >> >> 3. Unrelated tasks run. Traces also include some cross core scheduling >> IPI calls. >> >> 4. Driver resumes execution finishes the ioremap operation and tries to >> access the newly mapped IO region. This triggers a vmalloc fault. >> >> 5. The vmalloc_fault() function hits a kernel page fault when trying to >> dereference a non-existent *pte_ref. >> >> The reason this happens is the code paths called from ioremap_page_range() >> make different assumptions about when a large page (pud/pmd) mapping can be >> used versus the code paths in vmalloc_fault(). >> >> Using the PMD sized ioremap case as an example (the PUD case is similar): >> ioremap_pmd_range() calls ioremap_pmd_enabled() which is gated by >> CONFIG_HAVE_ARCH_HUGE_VMAP. On x86_64 this will return true unless the >> "nohugeiomap" kernel boot parameter is passed in. >> >> On the other hand, in the rare case when a page fault happens in the >> ioremap'ed region, vmalloc_fault() calls the pmd_huge() function to check >> if a PMD page is marked huge or if it should go on and get a reference to >> the PTE. However pmd_huge() is conditionally compiled based on the user >> configured CONFIG_HUGETLB_PAGE selected by CONFIG_HUGETLBFS. If the >> CONFIG_HUGETLBFS option is not enabled pmd_huge() is always defined to be >> 0. >> >> The end result is an OOPS in vmalloc_fault() when the non-existent pte_ref >> is dereferenced because the test for pmd_huge() failed. >> >> Commit f4eafd8bcd52 ("x86/mm: Fix vmalloc_fault() to handle large pages >> properly") attempted to fix the mismatch between ioremap() and >> vmalloc_fault() with regards to huge page handling but it missed this use >> case. >> >> I am working on a simpler reproducing case however so far I've been >> unsuccessful in re-creating the conditions that trigger the vmalloc fault >> in the first place. Adding explicit scheduling points in >> ioremap_pmd_range/pmd_set_huge doesn't seem to be sufficient. Ideas >> appreciated. >> >> Any thoughts on what a correct fix would look like? Should the ioremap >> code paths respect the HUGETLBFS config or would it be better for the >> vmalloc fault code paths to match the tests used in ioremap and not rely >> on the HUGETLBFS option being enabled? > > Thanks for the report and analysis! I believe pud_large() and > pmd_large() should have been used here. I will try to reproduce the > issue and verify the fix. Indeed. I find myself wondering why pud_huge() exists at all. While you're at it, I think there may be more bugs in there. Specifically, the code walks the reference and current tables at the same time without any synchronization and without READ_ONCE() protection. I think that all of the BUG() calls below the comment: /* * Below here mismatches are bugs because these lower tables * are shared: */ are bogus and could be hit due to races. I also think they're pointless -- we've already asserted that the reference and loaded tables are literally the same pointers. I think the right fix is to remove pud_ref, pmd_ref and pte_ref entirely and to get rid of those BUG() calls. What do you think?