Received: by 2002:a05:6358:d09b:b0:dc:cd0c:909e with SMTP id jc27csp7725386rwb; Tue, 6 Dec 2022 09:06:18 -0800 (PST) X-Google-Smtp-Source: AA0mqf415cG1YXM8Lbx+KEpB8ysSfsN/qL2hBwSD6sQ2tFgm5lOL8TloxvImtKV0zZyub6BeOy53 X-Received: by 2002:a17:906:248c:b0:7c0:fe68:9c2e with SMTP id e12-20020a170906248c00b007c0fe689c2emr6175145ejb.278.1670346378611; Tue, 06 Dec 2022 09:06:18 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1670346378; cv=none; d=google.com; s=arc-20160816; b=qxk4X5c7A09rvK06y2dw5EgETCkzRCctNIKaqkXzHiCPOAV/c3Hw5EaHN9lgreFJ1n 5sXNyxO1DRww00gOmbEp2Quw0YpSWASeE7LVt7+yQAiuHPExLo8r0unLOi/cHLk4Y1+P YL/GEMBwLDEV9gJRans1JvZaG3e43zieCA/N0aJxfDM27/JNP3R/z14pAlYkRegqmX4q AwFF8lKZnv+K+LJzRH7VxFGlOZwpxT68KNh2fYhGfPtLVzRFLwGxIeQxgw1z0Ea9CliB up1p8SdaOSXZBu60pyQRparVUoqi7h3p44gSc9djeu2RWLTZ637OvGakhPTnneYMFPlz tjDQ== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:in-reply-to:subject :organization:from:references:cc:to:content-language:user-agent :mime-version:date:message-id:dkim-signature; bh=ftN5vrqIh393VC3x0YgzupFdfd4d2nuxsXC4NEbHZHY=; b=pbMYxl90bSlg2lxFDDfaQ7mcS2STNPeVwcIDiuQbR3dgtrvVM7Ogfu/Gjz646x/sb/ Y33y22RtX4vNfKFkKtDiu/KGc42mfbrrf/QGWptZE5n0YEmdY1J4ixJgs3zBYGnznbC7 dhWIjfOIQXlnMDuKR9E2tl1QqpeXQIyJWflDSWSkrK6rbW6mI0vw6octwnUVPmb8PTpz lPkpzVvx9hcReIqGZunqIG9yKc7RFMzbqzdrDqqRyEuZSIaMFZA4b+EjOG6CBOF+qa/W XfldPSX8yWlhzGNfyUv3Hf01lfPHv+TL8ZB5ripmBFEf9RQidFjwLJD4aHa61S5JxTbq 5TEg== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@redhat.com header.s=mimecast20190719 header.b="gOwrLy/F"; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=redhat.com Return-Path: Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20]) by mx.google.com with ESMTP id hr12-20020a1709073f8c00b0078b96068bc0si9942648ejc.79.2022.12.06.09.05.57; Tue, 06 Dec 2022 09:06:18 -0800 (PST) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Authentication-Results: mx.google.com; dkim=pass header.i=@redhat.com header.s=mimecast20190719 header.b="gOwrLy/F"; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=redhat.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S234965AbiLFQ3I (ORCPT + 78 others); Tue, 6 Dec 2022 11:29:08 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:44178 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S232324AbiLFQ3G (ORCPT ); Tue, 6 Dec 2022 11:29:06 -0500 Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.133.124]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id E5396BC99 for ; Tue, 6 Dec 2022 08:28:12 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1670344092; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=ftN5vrqIh393VC3x0YgzupFdfd4d2nuxsXC4NEbHZHY=; b=gOwrLy/FTvUaCP/rIXbAiT27i+a6Qs+osGed97tL+agrpuNQJOztdw6JxxWyz3LqbVJcmw zYi1FzFz0tIDPc78Obuamr/zuzfr8JZUm61J/89ULJb7qheh5t+qLdnFUnX55EPkqgMPEb v/Hgdx055PFt6t0DQcYj1DooKSyKOpg= Received: from mail-wr1-f69.google.com (mail-wr1-f69.google.com [209.85.221.69]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.3, cipher=TLS_AES_128_GCM_SHA256) id us-mta-654-DyxuGLBBN6yGALfYU6C4ug-1; Tue, 06 Dec 2022 11:28:10 -0500 X-MC-Unique: DyxuGLBBN6yGALfYU6C4ug-1 Received: by mail-wr1-f69.google.com with SMTP id h15-20020adfa4cf000000b0024276cca31cso637188wrb.0 for ; Tue, 06 Dec 2022 08:28:10 -0800 (PST) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=content-transfer-encoding:in-reply-to:subject:organization:from :references:cc:to:content-language:user-agent:mime-version:date :message-id:x-gm-message-state:from:to:cc:subject:date:message-id :reply-to; bh=ftN5vrqIh393VC3x0YgzupFdfd4d2nuxsXC4NEbHZHY=; b=NNCa/bDRzzbt7675gQqVJrh9Tbp2HNXU+sED/FrRQdxYDsm/xzdZLOlR32DOCATY4l BF/199BoW7qXJvwH9YLF7G0uFEy6JyEhKF/raUDALT4NgAontBbGhMj63WEEYQEJ5NYu Kab0G8oxBy0oaq6pfApCLYEnjvVE+DTKlS8M0sgkflnjxALrGduKRuV9rEaDvWHWHu2A l+Izb2j08Ki8HlKy5Cw0tR6n2pIN471X286ZvxZ3ybxpq2RT7SqbiteY6ceGHWwrtyPh sIVks+7gmhULBrPPYSJYCgCQaisvkfSa10Jrnoi17BdfZIy42snCQ5OpNn5HL8HKBcYq z/GQ== X-Gm-Message-State: ANoB5pnWs356GJ67Uc/nPZ1VEcXtjldHRa1M51SaoBGAEsJXquEa6O5I lk37QYtit2kTSHzVAMblOcYHnM/Mkz+H1zD6kKSugh0dviSTMUt+r7h1W2MtK/Mlp3uJYqI4OyP C3mDXjDNModksq5Nb2cIIScn5 X-Received: by 2002:a5d:4985:0:b0:242:4c61:271f with SMTP id r5-20020a5d4985000000b002424c61271fmr9501082wrq.236.1670344089232; Tue, 06 Dec 2022 08:28:09 -0800 (PST) X-Received: by 2002:a5d:4985:0:b0:242:4c61:271f with SMTP id r5-20020a5d4985000000b002424c61271fmr9501065wrq.236.1670344088866; Tue, 06 Dec 2022 08:28:08 -0800 (PST) Received: from ?IPV6:2003:cb:c705:4f00:41f1:185d:4f9f:d1c2? (p200300cbc7054f0041f1185d4f9fd1c2.dip0.t-ipconnect.de. [2003:cb:c705:4f00:41f1:185d:4f9f:d1c2]) by smtp.gmail.com with ESMTPSA id i21-20020a1c5415000000b003d1f2c3e571sm125988wmb.33.2022.12.06.08.28.07 (version=TLS1_3 cipher=TLS_AES_128_GCM_SHA256 bits=128/128); Tue, 06 Dec 2022 08:28:08 -0800 (PST) Message-ID: <92173bad-caa3-6b43-9d1e-9a471fdbc184@redhat.com> Date: Tue, 6 Dec 2022 17:28:07 +0100 MIME-Version: 1.0 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:102.0) Gecko/20100101 Thunderbird/102.5.0 Content-Language: en-US To: Peter Xu Cc: linux-kernel@vger.kernel.org, linux-mm@kvack.org, Ives van Hoorne , stable@vger.kernel.org, Andrew Morton , Hugh Dickins , Alistair Popple , Mike Rapoport , Nadav Amit , Andrea Arcangeli References: <20221202122748.113774-1-david@redhat.com> <690afe0f-c9a0-9631-b365-d11d98fdf56f@redhat.com> <19800718-9cb6-9355-da1c-c7961b01e922@redhat.com> From: David Hildenbrand Organization: Red Hat Subject: Re: [PATCH RFC] mm/userfaultfd: enable writenotify while userfaultfd-wp is enabled for a VMA In-Reply-To: Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 7bit X-Spam-Status: No, score=-2.4 required=5.0 tests=BAYES_00,DKIMWL_WL_HIGH, DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,NICE_REPLY_A, RCVD_IN_DNSWL_NONE,RCVD_IN_MSPIKE_H2,SPF_HELO_NONE,SPF_NONE autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org >>>> I think fundamentally the core is uffd-wp is pte-based, so the information >>>> resides in pte not vma. I'm not strongly objecting this patch, especially >>>> you mentioned auto-numa so I need to have a closer look later there. >>>> However I do think uffd-wp is slightly special because we always need to >>>> consider pte information anyway, so a per-vma information doesn't hugely >>>> help, IMHO. >>> >>> That's the same as softdirty tracking, IMHO. > > Soft-dirty doesn't have a bit in the pte showing whether the page is > protected. If the page is already softdirty (PTE bit), we don't need write faults and consequently can map the page writable. If the page is not softdirty, we need !write as default. But let's talk in detail about vma->vm_page_prot and interaction with other wrprotect features below. > > One wr-protects in soft-dirty with either ALL or NONE. That's per-vma. > > One wr-protects in uffd-wp by wr-protect specific page or range of pages. > That's per-page. > Let me try to explain clearer what the issue with uffd-wp on shmem is right now and how to proceed from here. maybe that makes it clearer what the vma->vm_page_prot semantics are. vma->vm_page_prot defines the default PTE/PMD/... protection. This protection is always *safe*, meaning, if you blindly use that protection when (re)mapping a page, there won't be correctness issues. No need to manually wrprotect afterwards. That's why migration, mprotect(), NUMA faults that all rely on vma->vm_page_prot work as expected for now. When calculating vma->vm_page_prot, we considers three things: (1) VMA protection. For example, no VM_WRITE? then it won't include write permissions. (2) Private vs. shared: If the mapping is private, we will never have write permissions in vma->vm_page_prot, because we might have to COW individual PTEs. We really want write faults as default. (3) Write-notify: Some mechanism (softdirty tracking, file system, uffd- wp) *might* require default write-protection, such that we always properly trigger a write fault instead where we can handle these. This only applies to shared mappings, because private mappings always default to !write (2). As vma->vm_page_prot is safe, it is always valid to apply them when mapping a PTE/PMD/ ... *in addition* we can use other information, to detect if we still can map the PTE writable (if they were removed due to (2) or (3)). In migration code, we use the migration type (writable migration entries). In NUMA-hinting code we used the stored savedwrite value. Absence of these bits does not imply that we have to map the page wrprotected. We never had to remove write permissions so far from the vma->vm_page_prot default. We always only added permissions. Now, uffd-wp on shmem as of now violates these semantics. vma->vm_page_prot cannot always be used as a safe default, because we might have to wrprotect individual PTEs. Note that for uffd-wp on anonymous memory this was not an issue, because we default to !write in vma->vm_page_prot. The two possible ways to approach this for uffd-wp on shmem are: (1) Obey existing vma->vm_page_prot semantics. Default to !write and optimize the relevant cases to *add* the write bit. This is essentially what this patch does, minus can_change_pte_writable() optimizations on relevant code paths where we'd want to maintain the write bit. For example, when removing uffd-wp protection we might want to restore the write bit directly. (2) Default to write permissions and check each and every code location where we remap for uffd-wp ptes, to manuall wrprotect -- *remove* the write bit. Alternatively, as you said, always wrprotect when setting the PTE bit, which might work as well. My claim is that (1) is less error prone, because in the worst case we forget to optimize one code location -- instead to resulting in a BUG if we forget to wrprotect (what we have now). But I am not going to fight for it, because I can see that (2) can be made to work as well, as you outline in your patch. You seem to have decided on (2). Fair enough, you know uffd-wp best. We just have to fix it properly and make the logic consistent whenever we remap a page. Is anything I said fundamentally wrong? >>> >>> [...] >>> >>>>> Running the mprotect() reproducer [1] without this commit: >>>>> $ ./uffd-wp-mprotect >>>>> FAIL: uffd-wp did not fire >>>>> Running the mprotect() reproducer with this commit: >>>>> $ ./uffd-wp-mprotect >>>>> PASS: uffd-wp fired >>>>> >>>>> [1] https://lore.kernel.org/all/222fc0b2-6ec0-98e7-833f-ea868b248446@redhat.com/T/#u >>>> >>>> I still hope for a formal patch (non-rfc) we can have a reproducer outside >>>> mprotect(). IMHO mprotect() is really ambiguously here being used with >>>> uffd-wp, so not a good example IMO as I explained in the other thread [1]. >>> >>> I took the low hanging fruit to showcase that this is a more generic problem. >>> The reproducer is IMHO nice because it's simple and race-free. > > If no one is using mprotect() with uffd-wp like that, then the reproducer > may not be valid - the reproducer is defining how it should work, but does > that really stand? That's why I said it's ambiguous, because the > definition in this case is unclear. There are interesting variations like: mmap(PROT_READ, MAP_POPULATE|MAP_SHARED) uffd_wp() mprotect(PROT_READ|PROT_WRITE) Where we start out with all-write permissions before we enable selective write permissions. But I'm not going to argue about whats valid and whats not as long as it's documented ;). I primarily wanted to showcase that the same logic based on vma->vm_page_prot is used elsewhere, and that migration PTE restoration is not particularly special. > > I think numa has the problem too which I agree with you. If you attach a > numa reproducer it'll be nicer. But again I'm not convinced uffd-wp is a > per-vma thing, which seems to be what this patch is based upon. I might be able to give NUMA hinting a shot later this week, but I have other stuff to focus on. > > Now I really wonder whether I should just simply wr-protect pte for > pte_mkuffd_wp() always, attached. I didn't do that from the start because > I wanted to keep the helpers operate on one bit only. But I found that > it's actually common technique to use in pgtable arch code, and it really > doesn't make sense to not wr-protect a pte if uffd-wp is set on a present > entry. It's much safer. Indeed, that would be much safer. > >>> >>>> >>>> I'll need to off-work most of the rest of today, but maybe I can also have >>>> a look in the weekend or Monday more on the numa paths. Before that, can >>>> we first reach a consensus that we have the mm/migrate patch there to be >>>> merged first? These are two issues, IMHO. >>>> >>>> I know you're against me for some reason, but until now I sincerely don't >>>> know why. That patch sololy recovers write bit status (by removing it for >>>> read-only) for a migration entry and that definitely makes sense to me. As >>>> I also mentioned in the old version of that thread, we can rework migration >>>> entries and merge READ|WRITE entries into a GENERIC entry one day if you >>>> think proper, but that's for later. >>> >>> I'm not against you. I'm against changing well-working, common code >>> when it doesn't make any sense to me to change it. > > This goes back to the original question of whether we should remove the > write bit for read migration entry. Well, let's just focus on others; > we're all tired of this one. I hope my lengthy explanation above was helpful and correct. > >>> And now we have proof that >>> mprotect() just behaves exactly the same way, using the basic rules of vma->vm_page_prot. >>> >>> Yes, there is broken sparc64 (below), but that shouldn't dictate our implementation. > > I doubt whether sparc64 is broken if it has been like that anyway, because > I know little on sparc64 so I guess I'd not speak on that. > I'd wish some of the sparc64 maintainers would speak up finally. Anyhow, most probably I'll write a reproducer for some broken case and send it to the sparc64 list. Yeah, I need to get Linux for sparc64 running inside a VM ... :/ [...] > > Yes, but now I'm prone to the patch I attached which should just cover all > pte_mkuffd_wp(). We should probably do the same for the pmd variants, and clean up the existing wrprotect like: pmde = pmd_wrprotect(pmd_mkuffd_wp(pmde)); But that's of course, not a fix. > > Side note: since looking at the numa code, I found that after the recent > rework of removing savedwrite for numa, cdb205f9e220 ("mm/autonuma: use > can_change_(pte|pmd)_writable() to replace savedwrite"), I think it can > happen that after numa balancing one pte under uffd-wp vma (but not > wr-protected) can have its write bit lost if the migration failed during > recovering, because vma_wants_manual_pte_write_upgrade() will return false > for such case. Is it true? Yes, you are correct. I added that to the patch description: " Note that we don't optimize for the actual migration case: (1) When migration succeeds the new PTE will not be writable because the source PTE was not writable (protnone); in the future we might just optimize that case similarly by reusing can_change_pte_writable()/can_change_pmd_writable() when removing migration PTEs. (2) When migration fails, we'd have to recalculate the "writable" flag because we temporarily dropped the PT lock; for now keep it simple and set "writable=false". " Case (1) would, with your current patch, always lose the write bit during migration, even if vma->vm_page_prot included it. We most might want to optimize that in the future. Case (2) is rather a corner case, and unless people complain about it being a real performance issue, it felt cleaner (less code) to not optimize for that now. Again Peter, I am not against you, not at all. Sorry if I gave you the impression. I highly appreciate your work and this discussion. -- Thanks, David / dhildenb