Received: by 2002:a05:6a11:4021:0:0:0:0 with SMTP id ky33csp2174505pxb; Fri, 24 Sep 2021 23:25:16 -0700 (PDT) X-Google-Smtp-Source: ABdhPJwXRMPGUGdlj8CriCzX+XR8zo4h27nekbFuFuesNJ5pCx2viheajumUMFQuJGOh8Ug6bSiX X-Received: by 2002:a05:6e02:1c84:: with SMTP id w4mr11704716ill.219.1632551116572; Fri, 24 Sep 2021 23:25:16 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1632551116; cv=none; d=google.com; s=arc-20160816; b=vpVVzTXJXbO0IQHr3q9UYwnGMtnXQ2iHV/2sQ9rWE/xELR1iWwAukHbZWsJ0UwrWXS Hbd8yehogE0OKecovpko8RhL0G1jpvVNl8IU9fZseUCwkIa+YOSvjP1CYna6etfku9zZ scWRCPBMw5tv/aC8aQiqvS8mU1W2aKiCMTlzOO9yKNJW4EmnS6WZjW33yKt7SjeiSpyo cAEI6yv0Q/Ugm6jAu2fS6qrzGzSVRbNj9zcQyKIpVt0HZlmQp5CBZ3dlGY96LFSOHEDm htRCp2f1k30Zh4CP9nLQ//MYCDgeNyywlpVhTOvJ5MftnxmJIYYVS+pICkEfszmPyvcJ hulw== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:cc:to:subject:message-id:date:from:in-reply-to :references:mime-version:dkim-signature; bh=P6UjX46EgO1JdvaA1S99BM8bVjWyLM+0KRRhsNVfqrc=; b=n1csspENaaIeQ4ypg1p0JAOt/c8UXM/T36Psfxil6ZY/EhYsKYqcmW7LvFjQX6B4J3 Bo3qI54clVcKxWKvbdXMxgbQRNuyQpOr0dC3JyiJeOXU63RqmaaCOPRwqFbj3ZE2OMVb hr5jromYyb27rguHrFoi7LHiKyk4kf2SiH/9ozafWTUClr+ecaJ6YXpQViY3Q2DztJu/ VhMOwarv0gPRR4AzBfaWMPmnDqpZTVVDI1WosGjQYNutbNttF4/tEtt+KsHOD692ypU2 QxMNCx9aIuhIlH49JcpvPeXkjnlqnEniGC7hjN8/PSpBcM0y8XIon48jLDH4AcjHC07i pzuw== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@google.com header.s=20210112 header.b=hNilYYi9; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Return-Path: Received: from vger.kernel.org (vger.kernel.org. [23.128.96.18]) by mx.google.com with ESMTP id e10si12523265ioc.1.2021.09.24.23.25.05; Fri, 24 Sep 2021 23:25:16 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) client-ip=23.128.96.18; Authentication-Results: mx.google.com; dkim=pass header.i=@google.com header.s=20210112 header.b=hNilYYi9; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1345900AbhIXUYH (ORCPT + 99 others); Fri, 24 Sep 2021 16:24:07 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:35402 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1346799AbhIXUYG (ORCPT ); Fri, 24 Sep 2021 16:24:06 -0400 Received: from mail-wr1-x42e.google.com (mail-wr1-x42e.google.com [IPv6:2a00:1450:4864:20::42e]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 4A60FC061613 for ; Fri, 24 Sep 2021 13:22:33 -0700 (PDT) Received: by mail-wr1-x42e.google.com with SMTP id r23so5535222wra.6 for ; Fri, 24 Sep 2021 13:22:33 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20210112; h=mime-version:references:in-reply-to:from:date:message-id:subject:to :cc; bh=P6UjX46EgO1JdvaA1S99BM8bVjWyLM+0KRRhsNVfqrc=; b=hNilYYi9DfGN1Dx6/MtN058s8ZQ30CNP0PCqpqvXCmeNiuOhFNazMI2RkNfZif6nAu rGIBi3W2uxP6WNpia+1cWs6TX03mE6nmeXC+3L1ns31tp1llFuebuAA/v7XkUejdwrwh V845bcchNr9ZOJjCFBOLaVkMIQCvhHcHNVA0BWSfr3cb4pSvlkXIrj0zYGbj6eBeE3Pt zGHYlYYrUxJJNdVUDPgyAaFeOV7d6ouoZgJXW5xwoMszebmsbXDdOg4LAWDzhss5Tnm2 r1TtraxcgioXJZ3Y+EpqAFcp2mmQi0NNeJNQ+tXpLR/y+DgFSipRMQNc/YZLV9BYxfj9 ye/g== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc; bh=P6UjX46EgO1JdvaA1S99BM8bVjWyLM+0KRRhsNVfqrc=; b=edPzaZJXng7vvIx+LBbSPQDlF5nXF69/UXmiq+PBcrSf8C1mYMI/jW22jz2e1buy9D VAKgpHpE7lPwx8Sj3rn4HyCqPvvUJEEoFBdCx6wiemzAy4P97d4L0fmI8cXwUOI1VyBa 4V36NGORLtkrYi5T8PVO5ahTZNm7IMGL15hybjTFNClO1tXwGJHZlj3ZhRY2QhYK72Ga 1CCdSYJxC9tU3vHecBUnQRSAsXtNZ6f40WseQM5EMJes2ERhW5t5ggF97SpQAHgeE8IJ ZSIDoFlbKABEuXWcZiSDa48bdfVeFs2INeG9Fds4NFmxtkZRpFb6IOdpgL6AfPMRRDeO J6rg== X-Gm-Message-State: AOAM530LfV34SSvsrQmcwnp/sOGsQgxKG02kv5aJ30/JT4R0bBVtRgJU f+Flpz0JkPJq2zlAXpTY1hVfl508pTCJotWdBpPyKw== X-Received: by 2002:adf:f48e:: with SMTP id l14mr13662271wro.109.1632514951618; Fri, 24 Sep 2021 13:22:31 -0700 (PDT) MIME-Version: 1.0 References: In-Reply-To: From: Jue Wang Date: Fri, 24 Sep 2021 13:22:19 -0700 Message-ID: Subject: Re: [PATCH 1/3] userfaultfd/selftests: fix feature support detection To: Peter Xu Cc: James Houghton , Axel Rasmussen , Andrew Morton , Shuah Khan , Linux MM , Linuxkselftest , LKML Content-Type: text/plain; charset="UTF-8" Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Fri, Sep 24, 2021 at 1:09 PM Peter Xu wrote: > > On Wed, Sep 22, 2021 at 10:43:40PM -0700, Jue Wang wrote: > > [...] > > > > > Could I know what's the workaround? Normally if the workaround works solidly, > > > > then there's less need to introduce a kernel interface for that. Otherwise I'm > > > > glad to look into such a formal proposal. > > > > > > The workaround is, for the region that you want to zap, run through > > > this sequence of syscalls: mumap, mmap, and re-register with > > > userfaultfd if it was registered before. If we're using tmpfs, we can > > > use madvise(DONTNEED) instead, but this is kind of an abuse of the > > > API. I don't think there's a guarantee that the PTEs will get zapped, > > > but currently they will always get zapped if we're using tmpfs. I > > > really like the idea of adding a new madvise() mode that is guaranteed > > > to zap the PTEs. > > I see. > > > > > > > > > > > > > It's also useful for memory poisoning, I think, if the host > > > > > decides some page(s) are "bad" and wants to intercept any future guest > > > > > accesses to those page(s). > > > > > > > > Curious: isn't hwpoison information come from MCEs; or say, host kernel side? > > > > Then I thought the host kernel will have full control of it already. > > > > > > > > Or there's other way that the host can try to detect some pages are going to be > > > > rotten? So the userspace can do something before the kernel handles those > > > > exceptions? > > > > > > Here's a general idea of how we would like to use userfaultfd to support MPR: > > > > > > If a guest accesses a poisoned page for the first time, we will get an > > > MCE through the host kernel and send an MCE to the guest. The guest > > > will now no longer be able to access this page, and we have to enforce > > > this. After a live migration, the pages that were poisoned before > > > probably won't still be poisoned (from the host's perspective), so we > > > can't rely on the host kernel's MCE handling path. This is where > > > userfaultfd and this new madvise mode come in: we can just > > > madvise(MADV_ZAP) the poisoned page(s) on the target during a > > > migration. Now all accesses will be routed to the VMM and we can > > > inject an MCE. We don't *need* the new madvise mode, as we can also > > > use fallocate(PUNCH_HOLE) (works for tmpfs and hugetlbfs), but it > > > would be more convenient if we didn't have to use fallocate. > > > > > > Jue Wang can provide more context here, so I've cc'd him. There may be > > > some things I'm wrong about, so Jue feel free to correct me. > > > > > James is right. > > > > The page is marked PG_HWPoison in the source VM host's kernel. The need > > of intercepting guest accesses to it exist on the target VM host, where > > the same physical page is no longer poisoned. > > > > On the target host, the hypervisor needs to intercept all guest accesses > > to pages poisoned from the source VM host. > > Thanks for these information, James, Jue, Axel. I'm not familiar with memory > failures yet, so please bare with me with a few naive questions. > > So now I can undertand that hw-poisonsed pages on src host do not mean these > pages will be hw-poisoned on dest host too, but I may have missed the reason on > why dest host needs to trap it with pgtable removed. > > AFAIU after pages got hw-poisoned on src, and after vmm injects MCEs into the > guest, the guest shouldn't be accessing these pages any more, am I right? Then This is also our hope for the guest to behave but there is no guarantee on guest behavior. The goal here is to have the hypervisor provide consistent behavior aligned to native hardware, i.e., a guest page with "memory errors" stay persistently in that state no matter on source or target. > after migration completes, IIUC the guest shouldn't be accessing these pages > too. My current understanding is, instead of trapping these pages on dest, we > should just (somehow, which I have no real idea...) un-hw-poison these pages > after migration because these pages are very possibly normal pages there. When > there's real hw-poisoned pages reported on dst host, we should re-inject MCE > errors to guest with another set of pages. > > Could you tell me where did I miss? > > Thanks, > > -- > Peter Xu >