Received: by 2002:ab2:7444:0:b0:1ef:eae8:a797 with SMTP id f4csp819lqn; Fri, 15 Mar 2024 10:59:25 -0700 (PDT) X-Forwarded-Encrypted: i=3; AJvYcCVCPSFIE4IocJ/5AxmuBANd5Szak47SqdVlu8/U81ZnQOhIM15Rud9WXKWE5MNDSUZkGD1g8QtMG8vWKbDSGkk+vwaKDe39oQ2+i76h8g== X-Google-Smtp-Source: AGHT+IFRJw4x9g5DL+5SRjNhlDhA76Kf3Gx3r4E+4EqgICyZiSGfdEEvy7AI1EVmSH1aIz5g+1Xa X-Received: by 2002:a17:906:cf82:b0:a46:5f7d:862d with SMTP id um2-20020a170906cf8200b00a465f7d862dmr3640766ejb.75.1710525565121; Fri, 15 Mar 2024 10:59:25 -0700 (PDT) ARC-Seal: i=2; a=rsa-sha256; t=1710525565; cv=pass; d=google.com; s=arc-20160816; b=Y/fw73TOcmsQbY0/lc8NGEBgso0bmW3QjjD7xLLZ2o2TYNBdqjrTjfs67LMbj407XY WfsnEocj3hCLFJYYfym3TEuPn5zQeVbBR4ZcRlhIn8PYU1sPY40K1aLrKZkvxvjnmHBw VUc55xKwAOmsAm7/g0m5uG0LFwhw9SKncRTI3oiceyEta3MRok6IKLdCLy3DUM+oZJK0 OYLSVCAkDcRH+vhj1an+MXHDHMfVHP809qmtcyFe7dhJowQ0T8EBA0swhsT8/FAFh7C9 mw/Pc2W4Z0YSrnUfNWPl6XDIQDXZAC6kmHZq/0a+uvfDfrw1M+cMuXoot9IqgbOvVME7 Moog== ARC-Message-Signature: i=2; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=cc:to:from:subject:message-id:references:mime-version :list-unsubscribe:list-subscribe:list-id:precedence:in-reply-to:date :dkim-signature; bh=WM5C19X0S/o+ngFsIMGb5pAF5Gc9v2cbDsQ1+8QY85g=; fh=EJ6ryTLbTtnp0Ej5JLSbJcsxofe48Qchgb6n2h6YXMs=; b=mI36gRUJa5NJIFvcQxx5j4WTLYu0L1y6OvxbrQCsAJcQ4duJ/yuZYRbRt9ofOEgQ6T ukUZ+Nqez0D6R6w0301qkqQdpDVBiRdctaoV4snugJ9jvyevnMdbLeP0yX4ZFOqZ67rx I4/VLJ4R2qp36um66iYI81dx9zSszmlHYiZv9iAMQFEENWJfNWR6DqiiOnpj9Xm6wUq8 2V9yVEPeKKSVInRUrwWEFP3XHL5ans1ZcL/yWKz664ZkrDK7KzsKDzQNSgFPD5hMkwXh lwEzx7Ca7CVNTHMEWUpSlqQU1GBZ+3tGHiLKOt65dl1+JYbOBquBlHYyMz0hYXJrVCg/ Nfhw==; dara=google.com ARC-Authentication-Results: i=2; mx.google.com; dkim=pass header.i=@google.com header.s=20230601 header.b=xvntzci7; arc=pass (i=1 spf=pass spfdomain=flex--seanjc.bounces.google.com dkim=pass dkdomain=google.com dmarc=pass fromdomain=google.com); spf=pass (google.com: domain of linux-kernel+bounces-104759-linux.lists.archive=gmail.com@vger.kernel.org designates 147.75.80.249 as permitted sender) smtp.mailfrom="linux-kernel+bounces-104759-linux.lists.archive=gmail.com@vger.kernel.org"; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Return-Path: Received: from am.mirrors.kernel.org (am.mirrors.kernel.org. [147.75.80.249]) by mx.google.com with ESMTPS id f20-20020a170906139400b00a469ec07049si51502ejc.419.2024.03.15.10.59.24 for (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Fri, 15 Mar 2024 10:59:25 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel+bounces-104759-linux.lists.archive=gmail.com@vger.kernel.org designates 147.75.80.249 as permitted sender) client-ip=147.75.80.249; Authentication-Results: mx.google.com; dkim=pass header.i=@google.com header.s=20230601 header.b=xvntzci7; arc=pass (i=1 spf=pass spfdomain=flex--seanjc.bounces.google.com dkim=pass dkdomain=google.com dmarc=pass fromdomain=google.com); spf=pass (google.com: domain of linux-kernel+bounces-104759-linux.lists.archive=gmail.com@vger.kernel.org designates 147.75.80.249 as permitted sender) smtp.mailfrom="linux-kernel+bounces-104759-linux.lists.archive=gmail.com@vger.kernel.org"; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Received: from smtp.subspace.kernel.org (wormhole.subspace.kernel.org [52.25.139.140]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by am.mirrors.kernel.org (Postfix) with ESMTPS id 8D0091F23FB1 for ; Fri, 15 Mar 2024 17:59:24 +0000 (UTC) Received: from localhost.localdomain (localhost.localdomain [127.0.0.1]) by smtp.subspace.kernel.org (Postfix) with ESMTP id CB17E4E1C1; Fri, 15 Mar 2024 17:59:11 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=google.com header.i=@google.com header.b="xvntzci7" Received: from mail-yw1-f202.google.com (mail-yw1-f202.google.com [209.85.128.202]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id C492F4CB3D for ; Fri, 15 Mar 2024 17:59:07 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.128.202 ARC-Seal:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1710525550; cv=none; b=urMpN0o2ZXijhSHsnRjaJEzVLVIXAm1J9dZzfnweDR2tFvXWI+1jhk3cD03VsubVFPVyRDD9noSaBDx0n8x17+rAprYBr4rR6cW6xT6MYhKVNrndKY0Ksrvi/Cgz4HwIxRWa/tXY9a5MY02G8y61/nuBF4+TENNN6g/hL2IwAgM= ARC-Message-Signature:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1710525550; c=relaxed/simple; bh=pG2Up4LsCMZMTvJ9j9JbKLk64005Wj4y/kZ5JCQAjVQ=; h=Date:In-Reply-To:Mime-Version:References:Message-ID:Subject:From: To:Cc:Content-Type; b=bcuOTtuYtQ7GkvCNql6cOhBFgnltppmAAOc1Nb1dJq5q271isPktf5hqA5jW1CsosfiiGuvoJM2aifx/b0T8C6JUrBnjz1JKAq2YJ/fSvU9h2pX7Ck+a3BWJaCoUF8K5ojZ2gIKqCIqlbPH2/6mLpooMtYCGBFd2+i65B24T4No= ARC-Authentication-Results:i=1; smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=google.com; spf=pass smtp.mailfrom=flex--seanjc.bounces.google.com; dkim=pass (2048-bit key) header.d=google.com header.i=@google.com header.b=xvntzci7; arc=none smtp.client-ip=209.85.128.202 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=google.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=flex--seanjc.bounces.google.com Received: by mail-yw1-f202.google.com with SMTP id 00721157ae682-60ab69a9e6fso38107827b3.0 for ; Fri, 15 Mar 2024 10:59:07 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20230601; t=1710525547; x=1711130347; darn=vger.kernel.org; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:from:to:cc:subject:date:message-id:reply-to; bh=WM5C19X0S/o+ngFsIMGb5pAF5Gc9v2cbDsQ1+8QY85g=; b=xvntzci7ESvuF9VUpPQsF1pgW0GgtI79qJFTZp3x+/sTYIlIRXlYXNbz+ZX1KFXDIr bI+2WvG8tZcsPsNHkdnhYmp9ceEXPauBPUnZWx/cEkk0em2FCiu5DhAegDQZJZP+8Fhn /IYrNqsabh9CCMc8rmKbF3RGE3br8kqoKQRyAPVq5IFxjnEdg/81TgjmU2n3fR+AAVeN d+GfOoFZGDMMqfyWQJLBuQ+/pWLX2B+yMlNhfQfVTYAzGqa58KXkeTIXj/lI6HQILWZ/ BnMl+xVZQs3WDOcPmskCaQZRf5JTWXZ/zlRmF+NguMqMOWsh5DYYCMpa1ITmgNdMAQ2u t93w== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1710525547; x=1711130347; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=WM5C19X0S/o+ngFsIMGb5pAF5Gc9v2cbDsQ1+8QY85g=; b=n2DgedUxovJ2lBaT/ZvSRvdrExGGWoWMi6bYQKYxP9zR9YyTAYRi0qLRX6AjVbxDfG +RX3XEKJ+hyw3OdukGJwdbbI/XVEEYnzCvbrH92N+evO69ToA3oYYdvneVUPHUe0n1pI PsDowhG8pjofWGGREWI7kVQCqrslq5ZJFMg5Ikn+zackl9WSTNBdtkaRj/6aTqupthvD GCU0yt74MQfk741XGYmK8p9nAWyaPaSO33J014hode+eV+C17EPXip6cNtA2t468UwGp PnifOA7HNfgGGk1CgGhid0H42NfDZqpNF0z1ag+GKJTu2jrBdXIVxwqWyMq3ffVLkLeB MbLw== X-Forwarded-Encrypted: i=1; AJvYcCVvU9mNtNO5snHzos736uVIgNbWtpjZW2i2ZVZL/Sxwki/8ymeU6+TZ9KRj6DFpsmjDnSvbukdzA6MZhS0/RVlQ0QzaQM3a872vhgT5 X-Gm-Message-State: AOJu0YzaEHvqsJpkPDpnO12FzuCUswBhBaSuRh+dOZXwJ7ZiR1oEf6Lm T1w/js/aa6CP5V7fPr/Ya2157qMNK2a4FP+0Mr5brYDVSkK3DlOv4VWe6G1Q6tQsi5ZMECHxYHh guQ== X-Received: from zagreus.c.googlers.com ([fda3:e722:ac3:cc00:7f:e700:c0a8:5c37]) (user=seanjc job=sendgmr) by 2002:a5b:388:0:b0:dcc:5463:49a8 with SMTP id k8-20020a5b0388000000b00dcc546349a8mr1729067ybp.6.1710525546895; Fri, 15 Mar 2024 10:59:06 -0700 (PDT) Date: Fri, 15 Mar 2024 10:59:05 -0700 In-Reply-To: Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: Mime-Version: 1.0 References: <9e604f99-5b63-44d7-8476-00859dae1dc4@amd.com> <93df19f9-6dab-41fc-bbcd-b108e52ff50b@amd.com> Message-ID: Subject: Re: [PATCH v11 0/8] KVM: allow mapping non-refcounted pages From: Sean Christopherson To: David Stevens Cc: Paolo Bonzini , Yu Zhang , Isaku Yamahata , Zhi Wang , Maxim Levitsky , kvmarm@lists.linux.dev, linux-kernel@vger.kernel.org, kvm@vger.kernel.org, Axel Rasmussen Content-Type: text/plain; charset="us-ascii" On Thu, Mar 14, 2024, Sean Christopherson wrote: > +Alex, who is looking at the huge-VM_PFNMAP angle in particular. Oof, *Axel*. Sorry Axel. > On Thu, Mar 14, 2024, Sean Christopherson wrote: > > -Christ{oph,ian} to avoid creating more noise... > > > > On Thu, Mar 14, 2024, David Stevens wrote: > > > Because of that, the specific type of pfns that don't work right now are > > > pfn_valid() && !PG_Reserved && !page_ref_count() - what I called the > > > non-refcounted pages in a bad choice of words. If that's correct, then > > > perhaps this series should go a little bit further in modifying > > > hva_to_pfn_remapped, but it isn't fundamentally wrong. > > > > Loosely related to all of this, I have a mildly ambitious idea. Well, one mildly > > ambitious idea, and one crazy ambitious idea. Crazy ambitious idea first... > > > > Something we (GCE side of Google) have been eyeballing is adding support for huge > > VM_PFNMAP memory, e.g. for mapping large amounts of device (a.k.a. GPU) memory > > into guests using hugepages. One of the hiccups is that follow_pte() doesn't play > > nice with hugepages, at all, e.g. even has a "VM_BUG_ON(pmd_trans_huge(*pmd))". > > Teaching follow_pte() to play nice with hugepage probably is doing, but making > > sure all existing users are aware, maybe not so much. > > > > My first (half baked, crazy ambitious) idea is to move away from follow_pte() and > > get_user_page_fast_only() for mmu_notifier-aware lookups, i.e. that don't need > > to grab references, and replace them with a new converged API that locklessly walks > > host userspace page tables, and grabs the hugepage size along the way, e.g. so that > > arch code wouldn't have to do a second walk of the page tables just to get the > > hugepage size. > > > > In other words, for the common case (mmu_notifier integration, no reference needed), > > route hva_to_pfn_fast() into the new API and walk the userspace page tables (probably > > only for write faults, to avoid CoW compliciations) before doing anything else. > > > > Uses of hva_to_pfn() that need to get a reference to the struct page couldn't be > > converted, e.g. when stuffing physical addresses into the VMCS for nested virtualization. > > But for everything else, grabbing a reference is a non-goal, i.e. actually "getting" > > a user page is wasted effort and actively gets in the way. > > > > I was initially hoping we could go super simple and use something like x86's > > host_pfn_mapping_level(), but there are too many edge cases in gup() that need to > > be respected, e.g. to avoid mapping memfd_secret pages into KVM guests. I.e. the > > API would need to be a formal mm-owned thing, not some homebrewed KVM implementation. > > > > I can't tell if the payoff would be big enough to justify the effort involved, i.e. > > having a single unified API for grabbing PFNs from the primary MMU might just be a > > pie-in-the-sky type idea. > > > > My second, less ambitious idea: the previously linked LWN[*] article about the > > writeback issues reminded me of something that has bugged me for a long time. IIUC, > > getting a writable mapping from the primary MMU marks the page/folio dirty, and that > > page/folio stays dirty until the data is written back and the mapping is made read-only. > > And because KVM is tapped into the mmu_notifiers, KVM will be notified *before* the > > RW=>RO conversion completes, i.e. before the page/folio is marked clean. > > > > I _think_ that means that calling kvm_set_page_dirty() when zapping a SPTE (or > > dropping any mmu_notifier-aware mapping) is completely unnecessary. If that is the > > case, _and_ we can weasel our way out of calling kvm_set_page_accessed() too, then > > with FOLL_GET plumbed into hva_to_pfn(), we can: > > > > - Drop kvm_{set,release}_pfn_{accessed,dirty}(), because all callers of hva_to_pfn() > > that aren't tied into mmu_notifiers, i.e. aren't guaranteed to drop mappings > > before the page/folio is cleaned, will *know* that they hold a refcounted struct > > page. > > > > - Skip "KVM: x86/mmu: Track if sptes refer to refcounted pages" entirely, because > > KVM never needs to know if a SPTE points at a refcounted page. > > > > In other words, double down on immediately doing put_page() after gup() if FOLL_GET > > isn't specified, and naturally make all KVM MMUs compatible with pfn_valid() PFNs > > that are acquired by follow_pte(). > > > > I suspect we can simply mark pages as access when a page is retrieved from the primary > > MMU, as marking a page accessed when its *removed* from the guest is rather nonsensical. > > E.g. if a page is mapped into the guest for a long time and it gets swapped out, marking > > the page accessed when KVM drops its SPTEs in response to the swap adds no value. And > > through the mmu_notifiers, KVM already plays nice with setups that use idle page > > tracking to make reclaim decisions. > > > > [*] https://lwn.net/Articles/930667