Received: by 2002:a25:8b12:0:0:0:0:0 with SMTP id i18csp3673462ybl; Tue, 20 Aug 2019 00:05:35 -0700 (PDT) X-Google-Smtp-Source: APXvYqyAPDmTQWIYDUrInMLLD26kq9ODeCKqgna32oTGb6E6VTS4r9KPfdrhQNN30JCJE2WM3Z40 X-Received: by 2002:a17:90a:ba8e:: with SMTP id t14mr25575137pjr.116.1566284735531; Tue, 20 Aug 2019 00:05:35 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1566284735; cv=none; d=google.com; s=arc-20160816; b=slPCeN67eeDis8oYPpnZ9SXxtEVKBr35pdDzfgEJNpEqsfYuoPC8tgk+z1Ifay3zAX DXZmN8A019+mTb2JAMHP2j6l3VbPDnYgHlnpTe/dmWzDt4MW+BB1ojFRp0gGsPAe8Voc SuRx3xDOkOQF00fEuDSKT6xw1EVxbNyta+DvAUHDQwQpLRqtwimS8/WFmW1P6o1Z7DLu l1u1+1iiXt7l55bPFAd+yM1KGKpaH17/Iu5X1rXoFs9KsT+luZqEOvUQxRSfMiaYV3v1 20DG9fJY7pECSAf1UhpLWdPSvmwAZUuh08KKQO4BXwIGUjz7T10qU6iFUUKtXFatYHAD 5MTg== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:dkim-signature; bh=k3Y+XVnvt7AU2ChE57m9qKA8qYR/u8KnDS7/D3UW/i8=; b=kd+k5ZZMpvpdHkuDhmtuXECf7Oc1azhC2jTg25ebh2rQKTGaOUaNHJtGuTOWBbGZng 09jHiqRD6m3H5nZ1MMXHwUNhj+glDBViu6VvqNPzrQGX2w80UBeD+Yqv73DHmtYv1or5 KUhUG77KhmbqcfDROhJaZOj669zK+tsyfcXmd86Las1HnpMvhij1VcrhDAx3iYFpH52E iGGDjvc8JP7jGdlVG3ILOHVAWwGcTiYFOW+k+Hj63FKPMuCSYsF9q+FjCSZKtmcBQ92b +JMpWcF37wM72l2aW/utu5bbDmgjCrmzwCUG0F50Gon+1qQpkcoWBfHSlUue6YlKRxEu 6gtA== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@gmail.com header.s=20161025 header.b=MSyMxEDm; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=QUARANTINE dis=NONE) header.from=gmail.com Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id v10si11603776plo.384.2019.08.20.00.05.19; Tue, 20 Aug 2019 00:05:35 -0700 (PDT) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; dkim=pass header.i=@gmail.com header.s=20161025 header.b=MSyMxEDm; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=QUARANTINE dis=NONE) header.from=gmail.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1729260AbfHTHE1 (ORCPT + 99 others); Tue, 20 Aug 2019 03:04:27 -0400 Received: from mail-ot1-f67.google.com ([209.85.210.67]:38429 "EHLO mail-ot1-f67.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1728657AbfHTHE1 (ORCPT ); Tue, 20 Aug 2019 03:04:27 -0400 Received: by mail-ot1-f67.google.com with SMTP id r20so4100038ota.5; Tue, 20 Aug 2019 00:04:25 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=mime-version:references:in-reply-to:from:date:message-id:subject:to :cc; bh=k3Y+XVnvt7AU2ChE57m9qKA8qYR/u8KnDS7/D3UW/i8=; b=MSyMxEDmu1xQEQx3124prJwr4JdGq2H8W7B48jktxsNQnPT6bD8qXsv0Tok2UvLD1W Q42m23DCoxd51MypkIBVcTkxPZsW+g9LJpqN8x6rqV0SeAAaH0Jq6WP51nqP6nWb5rWS 1raP2Rn0zpkErnXrGzOm9WeaLVmPkOKBuw8siG+MOS5yo91EJjSu8U0cStknC+mo2sAO rdKPGMqoR/wKV1QIwS/lDh17NZtrf9ZXT2M+HkIKqxZgRwE9DVsrCup+i2X8zXisA3xA A51aqdjRhxw7hUL2JTfrl12KF2cqDtfS6abm0uZMIfOSOaxDCagcU7JY1OUIzfzbEPrO AaDg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc; bh=k3Y+XVnvt7AU2ChE57m9qKA8qYR/u8KnDS7/D3UW/i8=; b=q72knGYQgFSspErPRd1f92EKPpGnGGweZeMLoBVKREkYqXrZdag+aPPNt3EJtsEaQx k+ucbwnSheDEoRDqf1acodIjCAFbr1sPALBPIqpXAdqDC08tg8nhD5H825PdjbDy7GYf 4K/AmMXCjPPM5UsCUTNowDqa4F2m75GOlZFDuiHCHFscCzOJkNPLxfjIrnJOsin2c0Kn c55gsS9KyiR10o5W0dW8TARti6Xz5L32ROy36SVFltf62XOobEqtKphfWtncu6Yqjdqg 41pM3usZqxeaN7pHh1mQxu0971hcTGPZh5t+VNEYq1kztZTyvRhF2xB0TPjt8mwrbZ3E +L8A== X-Gm-Message-State: APjAAAU+ZL933L61rf+6RnwLZSZOBSrfuPMPDJ2w2TDxML8eisHtDBk6 OY8q2tqXN2lDm7llj2wTDMr7CWhJPF/UrdRgToE= X-Received: by 2002:a9d:674c:: with SMTP id w12mr17479137otm.118.1566284665544; Tue, 20 Aug 2019 00:04:25 -0700 (PDT) MIME-Version: 1.0 References: <20180130013919.GA19959@hori1.linux.bs1.fc.nec.co.jp> <1517284444-18149-1-git-send-email-n-horiguchi@ah.jp.nec.com> <87inbbjx2w.fsf@e105922-lin.cambridge.arm.com> <20180207011455.GA15214@hori1.linux.bs1.fc.nec.co.jp> <87fu6bfytm.fsf@e105922-lin.cambridge.arm.com> <20180208121749.0ac09af2b5a143106f339f55@linux-foundation.org> <87wozhvc49.fsf@concordia.ellerman.id.au> <20190610235045.GB30991@hori.linux.bs1.fc.nec.co.jp> In-Reply-To: <20190610235045.GB30991@hori.linux.bs1.fc.nec.co.jp> From: Wanpeng Li Date: Tue, 20 Aug 2019 15:03:55 +0800 Message-ID: Subject: Re: [PATCH v2] mm: hwpoison: disable memory error handling on 1GB hugepage To: Naoya Horiguchi Cc: Mike Kravetz , Michael Ellerman , Andrew Morton , Punit Agrawal , "linux-mm@kvack.org" , Michal Hocko , "Aneesh Kumar K.V" , Anshuman Khandual , "linux-kernel@vger.kernel.org" , Benjamin Herrenschmidt , "linuxppc-dev@lists.ozlabs.org" , kvm , Paolo Bonzini , Xiao Guangrong , "lidongchen@tencent.com" , "yongkaiwu@tencent.com" , Mel Gorman , "Kirill A. Shutemov" , "Hansen, Dave" , Hugh Dickins Content-Type: text/plain; charset="UTF-8" Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Cc Mel Gorman, Kirill, Dave Hansen, On Tue, 11 Jun 2019 at 07:51, Naoya Horiguchi wrote: > > On Wed, May 29, 2019 at 04:31:01PM -0700, Mike Kravetz wrote: > > On 5/28/19 2:49 AM, Wanpeng Li wrote: > > > Cc Paolo, > > > Hi all, > > > On Wed, 14 Feb 2018 at 06:34, Mike Kravetz wrote: > > >> > > >> On 02/12/2018 06:48 PM, Michael Ellerman wrote: > > >>> Andrew Morton writes: > > >>> > > >>>> On Thu, 08 Feb 2018 12:30:45 +0000 Punit Agrawal wrote: > > >>>> > > >>>>>> > > >>>>>> So I don't think that the above test result means that errors are properly > > >>>>>> handled, and the proposed patch should help for arm64. > > >>>>> > > >>>>> Although, the deviation of pud_huge() avoids a kernel crash the code > > >>>>> would be easier to maintain and reason about if arm64 helpers are > > >>>>> consistent with expectations by core code. > > >>>>> > > >>>>> I'll look to update the arm64 helpers once this patch gets merged. But > > >>>>> it would be helpful if there was a clear expression of semantics for > > >>>>> pud_huge() for various cases. Is there any version that can be used as > > >>>>> reference? > > >>>> > > >>>> Is that an ack or tested-by? > > >>>> > > >>>> Mike keeps plaintively asking the powerpc developers to take a look, > > >>>> but they remain steadfastly in hiding. > > >>> > > >>> Cc'ing linuxppc-dev is always a good idea :) > > >>> > > >> > > >> Thanks Michael, > > >> > > >> I was mostly concerned about use cases for soft/hard offline of huge pages > > >> larger than PMD_SIZE on powerpc. I know that powerpc supports PGD_SIZE > > >> huge pages, and soft/hard offline support was specifically added for this. > > >> See, 94310cbcaa3c "mm/madvise: enable (soft|hard) offline of HugeTLB pages > > >> at PGD level" > > >> > > >> This patch will disable that functionality. So, at a minimum this is a > > >> 'heads up'. If there are actual use cases that depend on this, then more > > >> work/discussions will need to happen. From the e-mail thread on PGD_SIZE > > >> support, I can not tell if there is a real use case or this is just a > > >> 'nice to have'. > > > > > > 1GB hugetlbfs pages are used by DPDK and VMs in cloud deployment, we > > > encounter gup_pud_range() panic several times in product environment. > > > Is there any plan to reenable and fix arch codes? > > > > I too am aware of slightly more interest in 1G huge pages. Suspect that as > > Intel MMU capacity increases to handle more TLB entries there will be more > > and more interest. > > > > Personally, I am not looking at this issue. Perhaps Naoya will comment as > > he know most about this code. > > Thanks for forwarding this to me, I'm feeling that memory error handling > on 1GB hugepage is demanded as real use case. > > > > > > In addition, https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/arch/x86/kvm/mmu.c#n3213 > > > The memory in guest can be 1GB/2MB/4K, though the host-backed memory > > > are 1GB hugetlbfs pages, after above PUD panic is fixed, > > > try_to_unmap() which is called in MCA recovery path will mark the PUD > > > hwpoison entry. The guest will vmexit and retry endlessly when > > > accessing any memory in the guest which is backed by this 1GB poisoned > > > hugetlbfs page. We have a plan to split this 1GB hugetblfs page by 2MB > > > hugetlbfs pages/4KB pages, maybe file remap to a virtual address range > > > which is 2MB/4KB page granularity, also split the KVM MMU 1GB SPTE > > > into 2MB/4KB and mark the offensive SPTE w/ a hwpoison flag, a sigbus > > > will be delivered to VM at page fault next time for the offensive > > > SPTE. Is this proposal acceptable? > > > > I am not sure of the error handling design, but this does sound reasonable. > > I agree that that's better. > > > That block of code which potentially dissolves a huge page on memory error > > is hard to understand and I'm not sure if that is even the 'normal' > > functionality. Certainly, we would hate to waste/poison an entire 1G page > > for an error on a small subsection. > > Yes, that's not practical, so we need at first establish the code base for > 2GB hugetlb splitting and then extending it to 1GB next. I found it is not easy to split. There is a unique hugetlb page size that is associated with a mounted hugetlbfs filesystem, file remap to 2MB/4KB will break this. How about hard offline 1GB hugetlb page as what has already done in soft offline, replace the corrupted 1GB page by new 1GB page through page migration, the offending/corrupted area in the original 1GB page doesn't need to be copied into the new page, the offending/corrupted area in new page can keep full zero just as it is clear during hugetlb page fault, other sub-pages of the original 1GB page can be freed to buddy system. The sigbus signal is sent to userspace w/ offending/corrupted virtual address, and signal code, userspace should take care this. Regards, Wanpeng Li