Received: by 2002:a05:6358:3188:b0:123:57c1:9b43 with SMTP id q8csp1207229rwd; Thu, 18 May 2023 09:11:37 -0700 (PDT) X-Google-Smtp-Source: ACHHUZ6u5zgRTqc3Hbr7wac7rqxFhyGoc4/gC44IM3AcYnTrZd/R+yi9o8A1ydg1cGMpyRtzrasx X-Received: by 2002:a17:902:e808:b0:1a6:9cb3:5b30 with SMTP id u8-20020a170902e80800b001a69cb35b30mr3648268plg.8.1684426296720; Thu, 18 May 2023 09:11:36 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1684426296; cv=none; d=google.com; s=arc-20160816; b=ZFDfFX/h0r1kI7upgWOUz3qEh5sBZZZrunkB9ANQZsF7MVnbaL/7plxaYLPCPG/0TT xt60VQTQ/gnmLgIZubajJIh6O6FXZVfPjVVDwqdxEzl/rXQ6kvO6NTfPgFisYd1/7Yj4 uPpp3u4X9FHCgpKsVdTDqhLTqayV/nIne1vbGrdpwlvJjkIXI/B80ogj8ODvfWd6fkPG 8h5jubj3lMfGNEm2mbkaU+e/f0vcGrzfaa/3QVNFJXRTr/86M4DKl5Vp9MrIf3HlJRVu 3Bhs0dT3ssWJqqqiT80CsDwFTDw8BtTXy0jhDVkDx7hdUSQ1F4Jnx2C77ky1xmKGbLLW BVjg== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:in-reply-to:content-transfer-encoding :content-disposition:mime-version:references:message-id:subject:cc :to:from:date:dkim-signature; bh=dxoCY1idzW72cG8lj4UwFyENZYdTkejsmFcoUP3trcY=; b=OTpJlH3ybgmSzyTOYQLgv8HQpqVFnEOypP1A5+QPAjKmhzo4PdBQ2OjnTjp6pb24DX ZJeI95y9tttPUYCJOe/lLJElOX6Uo8AXGkcTnmsQAHrfGmjMO337oxuR1IwQi1l2jzTA TSPs4q7xV6qg1ZMSYJadO981+BJQVElFsfKXB5fFxeegjxDkGiPWV26r+URO7rYUrrGE FTWuRsUFU01oqq0eJyPGZ64S/UAbnVDTWdJT8Z3dYSnQdsu76eDq3QFG4dHqgf3stPgY gTbdA9XwK59gOyLjt68WUGz0qoBjWSCVtMplw5NytuJx6uoUFhqOsNmzlGc+tCy2w5Km EUdQ== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@redhat.com header.s=mimecast20190719 header.b=CVWMN+F5; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=redhat.com Return-Path: Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20]) by mx.google.com with ESMTP id fu15-20020a17090ad18f00b0024e0cb4a986si1894548pjb.89.2023.05.18.09.11.20; Thu, 18 May 2023 09:11:36 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Authentication-Results: mx.google.com; dkim=pass header.i=@redhat.com header.s=mimecast20190719 header.b=CVWMN+F5; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=redhat.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S229491AbjERQG6 (ORCPT + 99 others); Thu, 18 May 2023 12:06:58 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:41958 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S229552AbjERQG4 (ORCPT ); Thu, 18 May 2023 12:06:56 -0400 Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.129.124]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 9FB5EFE for ; Thu, 18 May 2023 09:06:01 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1684425959; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=dxoCY1idzW72cG8lj4UwFyENZYdTkejsmFcoUP3trcY=; b=CVWMN+F528Z6iVh5RxP1g19hzN2WSNR+Q8h6orCvg1ve71iyNSYBdoiqoJAMcIFQTsJMdJ 28b+5Hyi7t99c5UDwqro/R5VSIud61z6Al7WVyWu7EZSTsyrnYWyM5UwRTQXEbRcSsKFbb ctZoZ3ZssVMSKUMhGJ81ktx8Fs49d3k= Received: from mail-qt1-f198.google.com (mail-qt1-f198.google.com [209.85.160.198]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.3, cipher=TLS_AES_256_GCM_SHA384) id us-mta-563-P0jhW5fWOPywtFgC0sl5dA-1; Thu, 18 May 2023 12:05:25 -0400 X-MC-Unique: P0jhW5fWOPywtFgC0sl5dA-1 Received: by mail-qt1-f198.google.com with SMTP id d75a77b69052e-3f39195e7e5so4523371cf.0 for ; Thu, 18 May 2023 09:05:21 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20221208; t=1684425912; x=1687017912; h=in-reply-to:content-transfer-encoding:content-disposition :mime-version:references:message-id:subject:cc:to:from:date :x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=dxoCY1idzW72cG8lj4UwFyENZYdTkejsmFcoUP3trcY=; b=GX922nv+YR3iGLI70fPmCA3YHWNzhxoWuci6jCWcpSaMGyBIhJ6Hv0ubyIJSADmGuz /gk+Ii5dioe0agWi+xqLeL3DK0welwoviaqipr1wvcdyRXMqvdtzR1a7F9W7t5Qv1DrU /RzXUm9wbDbzuEbPVzGauvlKs16xOWbFkYZ0rKj8zdjvcyBhCae664KwwwAXzpbIFlN/ ormKgfKfCDhlLHu2OsuHQk8nZQeEQj/SJlE730ocGTwQ854v3EBMbwJcVp8SE6vaUSTq XFfozKU78Bk320/8CvAy+Io89WWGrD7XqpN6oHzSQpzKb5KfGw7fyTkFTJPTb32rw3xU nKog== X-Gm-Message-State: AC+VfDyy2lMPm07UVc4lM5P2U7a5T3QsU+teAKO5S1E6R09ocGk7MHGU dfBN9I7aabQzdmZJUuzFoykcK7B5ePXxjaprtBo3pBHWb0kOEu05gzSdsxyj9IQoeqi+nrIUsmO 3AEndIGQzwuG0SR6YF0cAgcbs X-Received: by 2002:ac8:7d06:0:b0:3f5:29b9:59e3 with SMTP id g6-20020ac87d06000000b003f529b959e3mr12767385qtb.3.1684425911761; Thu, 18 May 2023 09:05:11 -0700 (PDT) X-Received: by 2002:ac8:7d06:0:b0:3f5:29b9:59e3 with SMTP id g6-20020ac87d06000000b003f529b959e3mr12767348qtb.3.1684425911419; Thu, 18 May 2023 09:05:11 -0700 (PDT) Received: from x1n (bras-base-aurron9127w-grc-62-70-24-86-62.dsl.bell.ca. [70.24.86.62]) by smtp.gmail.com with ESMTPSA id fg9-20020a05622a580900b003f3963d24ebsm598380qtb.30.2023.05.18.09.05.09 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Thu, 18 May 2023 09:05:10 -0700 (PDT) Date: Thu, 18 May 2023 12:05:08 -0400 From: Peter Xu To: Jiaqi Yan Cc: Axel Rasmussen , James Houghton , Alexander Viro , Andrew Morton , Christian Brauner , David Hildenbrand , Hongchen Zhang , Huang Ying , "Liam R. Howlett" , Miaohe Lin , "Mike Rapoport (IBM)" , Nadav Amit , Naoya Horiguchi , Shuah Khan , ZhangPeng , linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org, linux-mm@kvack.org, linux-kselftest@vger.kernel.org, Anish Moorthy Subject: Re: [PATCH 1/3] mm: userfaultfd: add new UFFDIO_SIGBUS ioctl Message-ID: References: <20230511182426.1898675-1-axelrasmussen@google.com> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Disposition: inline Content-Transfer-Encoding: 8bit In-Reply-To: X-Spam-Status: No, score=-2.1 required=5.0 tests=BAYES_00,DKIMWL_WL_HIGH, DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,RCVD_IN_DNSWL_NONE, SPF_HELO_NONE,SPF_NONE,T_SCC_BODY_TEXT_LINE autolearn=unavailable autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Wed, May 17, 2023 at 05:43:53PM -0700, Jiaqi Yan wrote: > On Wed, May 17, 2023 at 3:29 PM Axel Rasmussen wrote: > > > > On Wed, May 17, 2023 at 3:20 PM Peter Xu wrote: > > > > > > On Wed, May 17, 2023 at 06:12:33PM -0400, Peter Xu wrote: > > > > On Thu, May 11, 2023 at 03:00:09PM -0700, James Houghton wrote: > > > > > On Thu, May 11, 2023 at 11:24 AM Axel Rasmussen > > > > > wrote: > > > > > > > > > > > > So the basic way to use this new feature is: > > > > > > > > > > > > - On the new host, the guest's memory is registered with userfaultfd, in > > > > > > either MISSING or MINOR mode (doesn't really matter for this purpose). > > > > > > - On any first access, we get a userfaultfd event. At this point we can > > > > > > communicate with the old host to find out if the page was poisoned. > > > > > > - If so, we can respond with a UFFDIO_SIGBUS - this places a swap marker > > > > > > so any future accesses will SIGBUS. Because the pte is now "present", > > > > > > future accesses won't generate more userfaultfd events, they'll just > > > > > > SIGBUS directly. > > > > > > > > > > I want to clarify the SIGBUS mechanism here when KVM is involved, > > > > > keeping in mind that we need to be able to inject an MCE into the > > > > > guest for this to be useful. > > > > > > > > > > 1. vCPU gets an EPT violation --> KVM attempts GUP. > > > > > 2. GUP finds a PTE_MARKER_UFFD_SIGBUS and returns VM_FAULT_SIGBUS. > > > > > 3. KVM finds that GUP failed and returns -EFAULT. > > > > > > > > > > This is different than if GUP found poison, in which case KVM will > > > > > actually queue up a SIGBUS *containing the address of the fault*, and > > > > > userspace can use it to inject an appropriate MCE into the guest. With > > > > > UFFDIO_SIGBUS, we are missing the address! > > > > > > > > > > I see three options: > > > > > 1. Make KVM_RUN queue up a signal for any VM_FAULT_SIGBUS. I think > > > > > this is pointless. > > > > > 2. Don't have UFFDIO_SIGBUS install a PTE entry, but instead have a > > > > > UFFDIO_WAKE_MODE_SIGBUS, where upon waking, we return VM_FAULT_SIGBUS > > > > > instead of VM_FAULT_RETRY. We will keep getting userfaults on repeated > > > > > accesses, just like how we get repeated signals for real poison. > > > > > 3. Use this in conjunction with the additional KVM EFAULT info that > > > > > Anish proposed (the first part of [1]). > > > > > > > > > > I think option 3 is fine. :) > > > > > > > > Or... option 4) just to use either MADV_HWPOISON or hwpoison-inject? :) > > > > > > I just remember Axel mentioned this in the commit message, and just in case > > > this is why option 4) was ruled out: > > > > > > They expect that once poisoned, pages can never become > > > "un-poisoned". So, when we live migrate the VM, we need to preserve > > > the poisoned status of these pages. > > > > > > Just to supplement on this point: we do have unpoison (echoing to > > > "debug/hwpoison/hwpoison_unpoison"), or am I wrong? > > If I read unpoison_memory() correctly, once there is a real hardware > memory corruption (hw_memory_failure will be set), unpoison will stop > working and return EOPNOTSUPP. > > I know some cloud providers evacuating VMs once a single memory error > happens, so not supporting unpoison is probably not a big deal for > them. BUT others do keep VM running until more errors show up later, > which could be long after the 1st error. We're talking about postcopy migrating a VM has poisoned page on src, rather than on dst host, am I right? IOW, the dest hwpoison should be fake. If so, then I would assume that's the case where all the pages on the dest host is still all good (so hw_memory_failure not yet set, or I doubt the judgement of being a migration target after all)? The other thing is even if dest host has hw poisoned page, I'm not sure whether hw_memory_failure is the only way to solve this. I saw that this is something got worked on before from Zhenwei, David used to have some reasoning on why it was suggested like using a global knob: https://lore.kernel.org/all/d7927214-e433-c26d-7a9c-a291ced81887@redhat.com/ Two major issues here afaics: - Zhenwei's approach only considered x86 hwpoison - it relies on kpte having !present in entries but that's x86 specific rather than generic to memory_failure.c. - It is _assumed_ that hwpoison injection is for debugging only. I'm not sure whether you can fix 1) by some other ways, e.g., what if the host just remember all the hardware poisoned pfns (or remember soft-poisoned ones, but then here we need to be careful on removing them from the list when it's hwpoisoned for real)? It sounds like there's opportunity on providing a generic solution rather than relying on !pte_present(). For 2) IMHO that's not a big issue, you can declare it'll be used in !debug but production systems so as to boost the feature importance with a real use case. So far I'd say it'll be great to leverage what it's already there in linux and make it as generic as possible. The only issue is probably CAP_ADMIN... not sure whether we can have some way to provide !ADMIN somehow, or you can simply work around this issue. > > > > > > > > > > > > Besides what James mentioned on "missing addr", I didn't quickly see what's > > > > the major difference comparing to the old hwpoison injection methods even > > > > without the addr requirement. If we want the addr for MCE then it's more of > > > > a question to ask. > > > > > > > > I also didn't quickly see why for whatever new way to inject a pte error we > > > > need to have it registered with uffd. Could it be something like > > > > MADV_PGERR (even if MADV_HWPOISON won't suffice) so you can inject even > > > > without an userfault context (but still usable when uffd registered)? > > > > > > > > And it'll be alawys nice to have a cover letter too (if there'll be a new > > > > version) explaining the bits. > > > > I do plan a v2, if for no other reason than to update the > > documentation. Happy to add a cover letter with it as well. > > > > +Jiaqi back to CC, this is one piece of a larger memory poisoning / > > recovery design Jiaqi is working on, so he may have some ideas why > > MADV_HWPOISON or MADV_PGER will or won't work. > > Per https://man7.org/linux/man-pages/man2/madvise.2.html, > MADV_HWPOISON "is available only for privileged (CAP_SYS_ADMIN) > processes." So for a non-root VMM, MADV_HWPOISON is out of option. It makes sense to me especially when the page can be shared with other tasks. > > Another issue with MADV_HWPOISON is, it requires to first successfully > get_user_pages_fast(). I don't think it will work if memory is not > mapped yet. Fair point, so probably current MADV_HWPOISON got ruled out. hwpoison-inject seems fine where only the PFN is needed rather than the pte. But same issue on CAP_ADMIN indeed. > > With the UFFDIO_SIGBUS feature introduced in this patchset, it may > even be possible to free the emulated-hwpoison page back to the kernel > so we don't lose a 4K page. > > I didn't find any ref/doc for MADV_PGERR. Is it something you suggest > to build, Peter? That's something I made up just to show my question on why such an interface (even if wanted) needs to be bound to userfaultfd, e.g. a madvise() seems working if someone sololy want to install a poisoned pte. IIUC even with an madvise one may not need CAP_ADMIN since we can limit the op to current mm only, I assume it's safe. Here you'd want to return VM_FAULT_HWPOISON for whatever swap pte you'd like to install (in do_swap_page) with whatever new interface (assuming still a new madvise). As James mentioned, I think KVM liked that to recognize -EHWPOISON from -EFAULT. I'd say we can even consider reusing PTE_MARKER_SWAPIN_ERROR to let it just return VM_FAULT_HWPOISON directly if so. Thanks, -- Peter Xu