Received: by 2002:a05:6358:1087:b0:cb:c9d3:cd90 with SMTP id j7csp2988930rwi; Fri, 21 Oct 2022 10:06:32 -0700 (PDT) X-Google-Smtp-Source: AMsMyM5MMun3njh4Ebr0ZtX8EPAdkPIr9TycbgKmTTp4187ooK1DMbjvUi0jIwhqtomjNlr7KTsL X-Received: by 2002:a17:906:5dce:b0:78d:ec48:6a58 with SMTP id p14-20020a1709065dce00b0078dec486a58mr16550888ejv.209.1666371991752; Fri, 21 Oct 2022 10:06:31 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1666371991; cv=none; d=google.com; s=arc-20160816; b=gwRjTHzlrFOlWHUFHhwhhcuSj0Gv2m9SV+L+y/qQ+QDlHyYvHiTZrZy5ZwtTRwxWVy EaePUAxBdoMRJDDQdsrLSLNza0QDc3gIPfyIj+PpOP2vEjt17vUFbBlkwHjLh5vvgiyo BQXKxjS5tElPQwBEUvfTChYd6vTkwCWXnxy4wbcC63kLAmx40cRMmd3vAY00MWsMf33h EKNgCo7SksVePXnWK530NRkWRLrUtNL2Fm7xiYPLkOsz73iVWvAYUuWyAcf8m+VIy7ra uPjKJBq832GlmyxxYdX457AzmAhFD7/++HOHA7ofh+X9MY8gzHNa3KoNZgabFhp3i/+F YGZw== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:in-reply-to:content-disposition:mime-version :references:message-id:subject:cc:to:from:date:dkim-signature; bh=xrysGx55UenjRLyS1DI8lWgE85TWYqrMWgtnLZZFAkc=; b=f42cKo1BLys7kAQiA3dJ7S2Wpc8GVZ62GVFbWYxUg7Kg17tQFM27ffnPGpaXuPUypN hXyYq6LEBS1EubUUtNggQO55HifYCPXpLfjGrGtXrHslEK/MEIRj+Gei/m+zLO5ywlYH 0D6tIbEWbTpkzVUEpovmcwxhjiUkb0HDmxlHTcBw4f9SdujMtEEMJFlRtklfY15eB7dc 1s1GVmeEHXux9C90TO/4zH8s5zInruLJLLp5xGwWcIeiqWMB19OWKuTPIIzX5fBMPITJ VFv/3v/bWl6PcATS+J1cZ7U3axGoyzzUZSM57pd0Ggw3m/P7O6UgZIJtJAPPr8oHacnk JuXw== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@google.com header.s=20210112 header.b=RHWZQBLe; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Return-Path: Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20]) by mx.google.com with ESMTP id s4-20020a056402520400b0044615ee1b6fsi23213636edd.218.2022.10.21.10.06.06; Fri, 21 Oct 2022 10:06:31 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Authentication-Results: mx.google.com; dkim=pass header.i=@google.com header.s=20210112 header.b=RHWZQBLe; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S231174AbiJUQyc (ORCPT + 99 others); Fri, 21 Oct 2022 12:54:32 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:52326 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S231392AbiJUQxx (ORCPT ); Fri, 21 Oct 2022 12:53:53 -0400 Received: from mail-pj1-x1029.google.com (mail-pj1-x1029.google.com [IPv6:2607:f8b0:4864:20::1029]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 97B2A28B1B4 for ; Fri, 21 Oct 2022 09:53:21 -0700 (PDT) Received: by mail-pj1-x1029.google.com with SMTP id t10-20020a17090a4e4a00b0020af4bcae10so3433730pjl.3 for ; Fri, 21 Oct 2022 09:53:21 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20210112; h=in-reply-to:content-disposition:mime-version:references:message-id :subject:cc:to:from:date:from:to:cc:subject:date:message-id:reply-to; bh=xrysGx55UenjRLyS1DI8lWgE85TWYqrMWgtnLZZFAkc=; b=RHWZQBLe9HXVngD3tJPBIAxrqGbPxOHUi+EuoMf2wh8wAVWCUykhVBw5M4bpHipxKX AQrvUqN4owfPyZqdyzULb7AFBCSDE+f4NUKYHAYIddpZR67VGGUvvxapOtY3ozT/Ml4t aiXuSjYrXY3eVMT1GbVbskQFgGxypZbRWqNEOhCm30Zif46drkOhIOZelqHekwQwUk43 bnuW3YgCEoukR20OzDrDD/o7UY7n/fRCHqej35doY178zBweJ3IoeGQQs9hPHYSo3Ohx OqzLxanZHEvXOsNtWfWWxGtsKOFGJnejx5vYFgdKqN5Vb6+SO3aa9cMSKsUkIyd4p5eH sJHA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=in-reply-to:content-disposition:mime-version:references:message-id :subject:cc:to:from:date:x-gm-message-state:from:to:cc:subject:date :message-id:reply-to; bh=xrysGx55UenjRLyS1DI8lWgE85TWYqrMWgtnLZZFAkc=; b=xU5d4XWz7XCK7eBELIvIBNAC9tqJGVNxRjWrPT6Egr14FxMzcnTHYfXgNQRD4CUeVL FtOt08EtdtMKcy9N3vRAZT/7Qf9oW7nb32RvSUT1MTKSJ6QgTHoEeobKg7OWZO33ymBe Js1ZZFtC6Mvmh0riOoaIaLYeU0HptPaGhcCqPujqPNVGu2w1gLvdf9Hd9eo30Y5DWI/L 4ok49GLDWBZOz99K92GxfxXn6ywGUATh2rPIfSw6jHVi/8tY/rPMC4uTRRX0ehbhZds+ 0D/GK8eDU90juwsQOtP3WbYLivkvfD+kSQR++vZ8KW9POQIVrCa/gDGdBaUrK1h5Q1XP O1/A== X-Gm-Message-State: ACrzQf3qo+4e8hbhL+Wh00QGO/B1FUvZy8IK5aNmAkwwzuVcMrE0ZpMK nibyQ85oYTXjmlFHUejFowKcjQ== X-Received: by 2002:a17:90b:2651:b0:20a:daaf:75f0 with SMTP id pa17-20020a17090b265100b0020adaaf75f0mr22464873pjb.142.1666371200475; Fri, 21 Oct 2022 09:53:20 -0700 (PDT) Received: from google.com (7.104.168.34.bc.googleusercontent.com. [34.168.104.7]) by smtp.gmail.com with ESMTPSA id b14-20020a170903228e00b00176e8f85147sm15298020plh.83.2022.10.21.09.53.19 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Fri, 21 Oct 2022 09:53:19 -0700 (PDT) Date: Fri, 21 Oct 2022 16:53:15 +0000 From: Sean Christopherson To: Chao Peng Cc: Vishal Annapurve , "Kirill A . Shutemov" , "Gupta, Pankaj" , Vlastimil Babka , kvm@vger.kernel.org, linux-kernel@vger.kernel.org, linux-mm@kvack.org, linux-fsdevel@vger.kernel.org, linux-api@vger.kernel.org, linux-doc@vger.kernel.org, qemu-devel@nongnu.org, Paolo Bonzini , Jonathan Corbet , Vitaly Kuznetsov , Wanpeng Li , Jim Mattson , Joerg Roedel , Thomas Gleixner , Ingo Molnar , Borislav Petkov , x86@kernel.org, "H . Peter Anvin" , Hugh Dickins , Jeff Layton , "J . Bruce Fields" , Andrew Morton , Shuah Khan , Mike Rapoport , Steven Price , "Maciej S . Szmigiero" , Yu Zhang , luto@kernel.org, jun.nakajima@intel.com, dave.hansen@intel.com, ak@linux.intel.com, david@redhat.com, aarcange@redhat.com, ddutile@redhat.com, dhildenb@redhat.com, Quentin Perret , Michael Roth , mhocko@suse.com, Muchun Song , wei.w.wang@intel.com Subject: Re: [PATCH v8 1/8] mm/memfd: Introduce userspace inaccessible memfd Message-ID: References: <20220915142913.2213336-1-chao.p.peng@linux.intel.com> <20220915142913.2213336-2-chao.p.peng@linux.intel.com> <20221017161955.t4gditaztbwijgcn@box.shutemov.name> <20221017215640.hobzcz47es7dq2bi@box.shutemov.name> <20221019153225.njvg45glehlnjgc7@box.shutemov.name> <20221021135434.GB3607894@chaop.bj.intel.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20221021135434.GB3607894@chaop.bj.intel.com> X-Spam-Status: No, score=-17.6 required=5.0 tests=BAYES_00,DKIMWL_WL_MED, DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF, ENV_AND_HDR_SPF_MATCH,RCVD_IN_DNSWL_NONE,SPF_HELO_NONE,SPF_PASS, USER_IN_DEF_DKIM_WL,USER_IN_DEF_SPF_WL autolearn=unavailable autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Fri, Oct 21, 2022, Chao Peng wrote: > On Thu, Oct 20, 2022 at 04:20:58PM +0530, Vishal Annapurve wrote: > > On Wed, Oct 19, 2022 at 9:02 PM Kirill A . Shutemov wrote: > > > > > > On Tue, Oct 18, 2022 at 07:12:10PM +0530, Vishal Annapurve wrote: > > > > I think moving this notifier_invalidate before fallocate may not solve > > > > the problem completely. Is it possible that between invalidate and > > > > fallocate, KVM tries to handle the page fault for the guest VM from > > > > another vcpu and uses the pages to be freed to back gpa ranges? Should > > > > hole punching here also update mem_attr first to say that KVM should > > > > consider the corresponding gpa ranges to be no more backed by > > > > inaccessible memfd? > > > > > > We rely on external synchronization to prevent this. See code around > > > mmu_invalidate_retry_hva(). > > > > > > -- > > > Kiryl Shutsemau / Kirill A. Shutemov > > > > IIUC, mmu_invalidate_retry_hva/gfn ensures that page faults on gfn > > ranges that are being invalidated are retried till invalidation is > > complete. In this case, is it possible that KVM tries to serve the > > page fault after inaccessible_notifier_invalidate is complete but > > before fallocate could punch hole into the files? It's not just the page fault edge case. In the more straightforward scenario where the memory is already mapped into the guest, freeing pages back to the kernel before they are removed from the guest will lead to use-after-free. > > e.g. > > inaccessible_notifier_invalidate(...) > > ... (system event preempting this control flow, giving a window for > > the guest to retry accessing the gfn range which was invalidated) > > fallocate(.., PUNCH_HOLE..) > > Looks this is something can happen. > And sounds to me the solution needs > just follow the mmu_notifier's way of using a invalidate_start/end pair. > > invalidate_start() --> kvm->mmu_invalidate_in_progress++; > zap KVM page table entries; > fallocate() > invalidate_end() --> kvm->mmu_invalidate_in_progress--; > > Then during invalidate_start/end time window mmu_invalidate_retry_gfn > checks 'mmu_invalidate_in_progress' and prevent repopulating the same > page in KVM page table. Yes, if it's not safe to invalidate after making the change (fallocate()), then the change needs to be bookended by a start+end pair. The mmu_notifier's unpaired invalidate() hook works by zapping the primary MMU's PTEs before invalidate(), but frees the underlying physical page _after_ invalidate(). And the only reason the unpaired invalidate() exists is because there are secondary MMUs that reuse the primary MMU's page tables, e.g. shared virtual addressing, in which case bookending doesn't work because the secondary MMU can't remove PTEs, it can only flush its TLBs. For this case, the whole point is to not create PTEs in the primary MMU, so there should never be a use case that _needs_ an unpaired invalidate(). TL;DR: a start+end pair is likely the simplest solution.