Received: by 2002:a05:6358:3188:b0:123:57c1:9b43 with SMTP id q8csp1566248rwd; Thu, 18 May 2023 13:59:10 -0700 (PDT) X-Google-Smtp-Source: ACHHUZ57CfQFfBYdVVd0ATyT4J3rv+jxVGYvrEYgEFC6w56Om4ziOZfSomQed6A878ulnO//kTf9 X-Received: by 2002:a05:6a00:134c:b0:648:64fe:b14b with SMTP id k12-20020a056a00134c00b0064864feb14bmr47042pfu.32.1684443550392; Thu, 18 May 2023 13:59:10 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1684443550; cv=none; d=google.com; s=arc-20160816; b=oR8W1QTSEYmjC3Uzx5AXpKpfSOUSQ6+/DcP62thuUoL6CWsmApFf0JwVr6wMOlN7UD KJ02N3XXNM3nvcSWPMmalr3VOp/98TF0oXxvEhS3Sq50odlgeCihL0lx+yquMm+lAf55 hRLWNWuKKr7g9KQXqxvvX+dxIJLzsizoV5+QNrYYDushZ+u/vT6o8q67bjvfnTN1AoP4 ze5gt1nPbs3i84GaFPM48FpXDm9MkPkQ/8+dr1gpswaakTL0QNIyG5l+s2lAcR6unG2g s4tMLgMWWQevo+RtuHRTWgfX5lIDIti5IXX0v52s3z6hWDAMHXEG8ZM1S/1vJ+hSFxST uXiw== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:cc:to:subject :message-id:date:from:in-reply-to:references:mime-version :dkim-signature; bh=XVJjM3dZpBhYfdGEBjBsaY6bzKlLzf09nB2ohjvenyQ=; b=SXbhLwFD9xlu+d46SII9FjmP4AHYq4qkcf5t6c7qYKNPyEmYNlool+9Wjx38DDHS9k n6ctBWssXuI04LmpjzdtEAbRohsgfqWfxQe1RRbZoFZkWki2foIqy3PpNnKdj9EKMjM+ 0PnsL65o+zcn9xoltwwWyYeDZJ1bs/tKsOL90H/8ivADeoRpTwx0KNhtmxlqR1ajpLbp wJMF/cVSukhHbF0GgGKxkGk3qUGa9HtZQIJ+OOPWh36kGq+cspbtGQtvAcIG1HFNOpts Djk03LIj4zJGVQ8RyeuOODthb05IllY9pvLxkLUyWI4cfogDfOt/ADlQaseAKZDMGc3h L1/w== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@google.com header.s=20221208 header.b=KC7zVJFn; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Return-Path: Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20]) by mx.google.com with ESMTP id j15-20020a633c0f000000b00534873a665esi428828pga.309.2023.05.18.13.58.57; Thu, 18 May 2023 13:59:10 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Authentication-Results: mx.google.com; dkim=pass header.i=@google.com header.s=20221208 header.b=KC7zVJFn; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S230213AbjERUit (ORCPT + 99 others); Thu, 18 May 2023 16:38:49 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:53006 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S229826AbjERUis (ORCPT ); Thu, 18 May 2023 16:38:48 -0400 Received: from mail-qv1-xf29.google.com (mail-qv1-xf29.google.com [IPv6:2607:f8b0:4864:20::f29]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id D45C9192 for ; Thu, 18 May 2023 13:38:46 -0700 (PDT) Received: by mail-qv1-xf29.google.com with SMTP id 6a1803df08f44-61b58779b93so22593396d6.0 for ; Thu, 18 May 2023 13:38:46 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20221208; t=1684442326; x=1687034326; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=XVJjM3dZpBhYfdGEBjBsaY6bzKlLzf09nB2ohjvenyQ=; b=KC7zVJFnzmuXe7Xk2yjmyLWy+hXYZY12nh9dr5q+Klm00Z8PULFOs9h7jquNULU94j RP9LceDzWnbdvq5glgB9cue3MEl9uHZBlbSLvQsIkuIsJUYJpxgatOkeKPvm5MVueo9X X64e8QhblAYabyyvk/XTwPqRwXZNnC/lCpaXEo2jDguMLoyYlAvltn5QQT1/78CNtmdZ 3LfOVNLKlIHD0OWtJxk6dvlRLJaKL410nVxUdrN1tBrfNN2dFEcrSHP0ItVe09zB9vnt AcoR7ycs0H9WKt4wJxAP/6OZPWNS8XV/0sqPqcYyZcM2tFgTz5PzMhiNy3Bs2IpyGhuo HDxg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20221208; t=1684442326; x=1687034326; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=XVJjM3dZpBhYfdGEBjBsaY6bzKlLzf09nB2ohjvenyQ=; b=LdXPQj2YVdvovzTyP2VfiL4wIsmsyABz5sdt/n5pJCJTqZxGk5coWSa9FBx5HPLgbs lIza5uDDB7JICbRauO+3+WJoeCgBIaX838QoZaAy/2jiMCHBQsL3hnZNEO8rdLsb64lh q4rlkc4ZKLabMyJpwPGh4knCBNBmTBSkhFveKyVsoIoyjOP7VO7fj8xaA5uwagHj50cR kcgoT0IYbVhz/JJqJ7JhC0Qtd4scp008CxY2LStbR4oYKw5zf6o3dHudmfBqFKbW7rye 2tAPidT9VaJjJVUimRP76Nuq72Lz40QKWRsd+Xu7wwJeZS4yF64+12WlKGn5zKNOPm4e sqbw== X-Gm-Message-State: AC+VfDyNXDBguk3ExKMDCs6GrE+XMlQElhIQBiWvi0KQiDbimHrLlWzR c5vVugMya4AJaqavKS3SK9+1QkANfOpEyq5p64t/Iw== X-Received: by 2002:a05:6214:202f:b0:61b:5b9f:f5fd with SMTP id 15-20020a056214202f00b0061b5b9ff5fdmr571446qvf.41.1684442325664; Thu, 18 May 2023 13:38:45 -0700 (PDT) MIME-Version: 1.0 References: <20230511182426.1898675-1-axelrasmussen@google.com> In-Reply-To: From: Axel Rasmussen Date: Thu, 18 May 2023 13:38:09 -0700 Message-ID: Subject: Re: [PATCH 1/3] mm: userfaultfd: add new UFFDIO_SIGBUS ioctl To: Peter Xu Cc: Jiaqi Yan , James Houghton , Alexander Viro , Andrew Morton , Christian Brauner , David Hildenbrand , Hongchen Zhang , Huang Ying , "Liam R. Howlett" , Miaohe Lin , "Mike Rapoport (IBM)" , Nadav Amit , Naoya Horiguchi , Shuah Khan , ZhangPeng , linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org, linux-mm@kvack.org, linux-kselftest@vger.kernel.org, Anish Moorthy Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Spam-Status: No, score=-17.6 required=5.0 tests=BAYES_00,DKIMWL_WL_MED, DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF, ENV_AND_HDR_SPF_MATCH,RCVD_IN_DNSWL_NONE,SPF_HELO_NONE,SPF_PASS, T_SCC_BODY_TEXT_LINE,USER_IN_DEF_DKIM_WL,USER_IN_DEF_SPF_WL autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Thu, May 18, 2023 at 9:05=E2=80=AFAM Peter Xu wrote: > > On Wed, May 17, 2023 at 05:43:53PM -0700, Jiaqi Yan wrote: > > On Wed, May 17, 2023 at 3:29=E2=80=AFPM Axel Rasmussen wrote: > > > > > > On Wed, May 17, 2023 at 3:20=E2=80=AFPM Peter Xu = wrote: > > > > > > > > On Wed, May 17, 2023 at 06:12:33PM -0400, Peter Xu wrote: > > > > > On Thu, May 11, 2023 at 03:00:09PM -0700, James Houghton wrote: > > > > > > On Thu, May 11, 2023 at 11:24=E2=80=AFAM Axel Rasmussen > > > > > > wrote: > > > > > > > > > > > > > > So the basic way to use this new feature is: > > > > > > > > > > > > > > - On the new host, the guest's memory is registered with user= faultfd, in > > > > > > > either MISSING or MINOR mode (doesn't really matter for thi= s purpose). > > > > > > > - On any first access, we get a userfaultfd event. At this po= int we can > > > > > > > communicate with the old host to find out if the page was p= oisoned. > > > > > > > - If so, we can respond with a UFFDIO_SIGBUS - this places a = swap marker > > > > > > > so any future accesses will SIGBUS. Because the pte is now = "present", > > > > > > > future accesses won't generate more userfaultfd events, the= y'll just > > > > > > > SIGBUS directly. > > > > > > > > > > > > I want to clarify the SIGBUS mechanism here when KVM is involve= d, > > > > > > keeping in mind that we need to be able to inject an MCE into t= he > > > > > > guest for this to be useful. > > > > > > > > > > > > 1. vCPU gets an EPT violation --> KVM attempts GUP. > > > > > > 2. GUP finds a PTE_MARKER_UFFD_SIGBUS and returns VM_FAULT_SIGB= US. > > > > > > 3. KVM finds that GUP failed and returns -EFAULT. > > > > > > > > > > > > This is different than if GUP found poison, in which case KVM w= ill > > > > > > actually queue up a SIGBUS *containing the address of the fault= *, and > > > > > > userspace can use it to inject an appropriate MCE into the gues= t. With > > > > > > UFFDIO_SIGBUS, we are missing the address! > > > > > > > > > > > > I see three options: > > > > > > 1. Make KVM_RUN queue up a signal for any VM_FAULT_SIGBUS. I th= ink > > > > > > this is pointless. > > > > > > 2. Don't have UFFDIO_SIGBUS install a PTE entry, but instead ha= ve a > > > > > > UFFDIO_WAKE_MODE_SIGBUS, where upon waking, we return VM_FAULT_= SIGBUS > > > > > > instead of VM_FAULT_RETRY. We will keep getting userfaults on r= epeated > > > > > > accesses, just like how we get repeated signals for real poison= . > > > > > > 3. Use this in conjunction with the additional KVM EFAULT info = that > > > > > > Anish proposed (the first part of [1]). > > > > > > > > > > > > I think option 3 is fine. :) > > > > > > > > > > Or... option 4) just to use either MADV_HWPOISON or hwpoison-inje= ct? :) > > > > > > > > I just remember Axel mentioned this in the commit message, and just= in case > > > > this is why option 4) was ruled out: > > > > > > > > They expect that once poisoned, pages can never become > > > > "un-poisoned". So, when we live migrate the VM, we need to = preserve > > > > the poisoned status of these pages. > > > > > > > > Just to supplement on this point: we do have unpoison (echoing to > > > > "debug/hwpoison/hwpoison_unpoison"), or am I wrong? > > > > If I read unpoison_memory() correctly, once there is a real hardware > > memory corruption (hw_memory_failure will be set), unpoison will stop > > working and return EOPNOTSUPP. > > > > I know some cloud providers evacuating VMs once a single memory error > > happens, so not supporting unpoison is probably not a big deal for > > them. BUT others do keep VM running until more errors show up later, > > which could be long after the 1st error. > > We're talking about postcopy migrating a VM has poisoned page on src, > rather than on dst host, am I right? IOW, the dest hwpoison should be > fake. > > If so, then I would assume that's the case where all the pages on the des= t > host is still all good (so hw_memory_failure not yet set, or I doubt the > judgement of being a migration target after all)? > > The other thing is even if dest host has hw poisoned page, I'm not sure > whether hw_memory_failure is the only way to solve this. > > I saw that this is something got worked on before from Zhenwei, David use= d > to have some reasoning on why it was suggested like using a global knob: > > https://lore.kernel.org/all/d7927214-e433-c26d-7a9c-a291ced81887@redhat.c= om/ > > Two major issues here afaics: > > - Zhenwei's approach only considered x86 hwpoison - it relies on kpte > having !present in entries but that's x86 specific rather than generi= c > to memory_failure.c. > > - It is _assumed_ that hwpoison injection is for debugging only. > > I'm not sure whether you can fix 1) by some other ways, e.g., what if the > host just remember all the hardware poisoned pfns (or remember > soft-poisoned ones, but then here we need to be careful on removing them > from the list when it's hwpoisoned for real)? It sounds like there's > opportunity on providing a generic solution rather than relying on > !pte_present(). > > For 2) IMHO that's not a big issue, you can declare it'll be used in !deb= ug > but production systems so as to boost the feature importance with a real > use case. > > So far I'd say it'll be great to leverage what it's already there in linu= x > and make it as generic as possible. The only issue is probably > CAP_ADMIN... not sure whether we can have some way to provide !ADMIN > somehow, or you can simply work around this issue. As you mention below I think the key distinction is the scope - I think MADV_HWPOISON affects the whole system, including other processes. For our purposes, we really just want to "poison" this particular virtual address (the HVA, from the VM's perspective), not even other mappings of the same shared memory. I think that behavior is different from MADV_HWPOISON, at least. > > > > > > > > > > > > > > > > > Besides what James mentioned on "missing addr", I didn't quickly = see what's > > > > > the major difference comparing to the old hwpoison injection meth= ods even > > > > > without the addr requirement. If we want the addr for MCE then it= 's more of > > > > > a question to ask. > > > > > > > > > > I also didn't quickly see why for whatever new way to inject a pt= e error we > > > > > need to have it registered with uffd. Could it be something like > > > > > MADV_PGERR (even if MADV_HWPOISON won't suffice) so you can injec= t even > > > > > without an userfault context (but still usable when uffd register= ed)? > > > > > > > > > > And it'll be alawys nice to have a cover letter too (if there'll = be a new > > > > > version) explaining the bits. > > > > > > I do plan a v2, if for no other reason than to update the > > > documentation. Happy to add a cover letter with it as well. > > > > > > +Jiaqi back to CC, this is one piece of a larger memory poisoning / > > > recovery design Jiaqi is working on, so he may have some ideas why > > > MADV_HWPOISON or MADV_PGER will or won't work. > > > > Per https://man7.org/linux/man-pages/man2/madvise.2.html, > > MADV_HWPOISON "is available only for privileged (CAP_SYS_ADMIN) > > processes." So for a non-root VMM, MADV_HWPOISON is out of option. > > It makes sense to me especially when the page can be shared with other > tasks. > > > > > Another issue with MADV_HWPOISON is, it requires to first successfully > > get_user_pages_fast(). I don't think it will work if memory is not > > mapped yet. > > Fair point, so probably current MADV_HWPOISON got ruled out. > hwpoison-inject seems fine where only the PFN is needed rather than the > pte. But same issue on CAP_ADMIN indeed. > > > > > With the UFFDIO_SIGBUS feature introduced in this patchset, it may > > even be possible to free the emulated-hwpoison page back to the kernel > > so we don't lose a 4K page. > > > > I didn't find any ref/doc for MADV_PGERR. Is it something you suggest > > to build, Peter? > > That's something I made up just to show my question on why such an > interface (even if wanted) needs to be bound to userfaultfd, e.g. a > madvise() seems working if someone sololy want to install a poisoned pte. I look at it a bit differently... Even existing UFFDIO_* operations could technically be separated from userfaultfd. You could imagine a MADV_MAP_PAGE instead of UFFDIO_CONTINUE. UFFDIO_COPY is a bit trickier since it takes an argument, but it could be done with process_madvise(). (Granted, I'm not sure this would be useful... But this is equally true for UFFDIO_SIGBUS; it seems non-live-migration use cases could use MADV_HWPOISON, and for live migration use cases we will be using UFFD.) We've sort of setup a convention with userfaultfd where at a high level users are supposed to: 1. Receive events from the uffd 2. Resolve those events with UFFDIO_* ioctls 3. Wake up with UFFDIO_WAKE to retry the fault that generated the original event (can be combined with step 2 of course) So for me, even if MADV_PGERR or similar existed, I would be tempted to add a UFFDIO_SIGBUS as well, even if it just calls the same underlying function to do the same thing, if only for consistency (with the idea "UFFD events are resolved by UFFD ioctls") from the user's perspective. > > IIUC even with an madvise one may not need CAP_ADMIN since we can limit t= he > op to current mm only, I assume it's safe. > > Here you'd want to return VM_FAULT_HWPOISON for whatever swap pte you'd > like to install (in do_swap_page) with whatever new interface (assuming > still a new madvise). As James mentioned, I think KVM liked that to > recognize -EHWPOISON from -EFAULT. I'd say we can even consider reusing > PTE_MARKER_SWAPIN_ERROR to let it just return VM_FAULT_HWPOISON directly = if > so. > > Thanks, > > -- > Peter Xu >