Received: by 2002:a05:6358:d09b:b0:dc:cd0c:909e with SMTP id jc27csp2413288rwb; Fri, 2 Dec 2022 09:24:29 -0800 (PST) X-Google-Smtp-Source: AA0mqf4+pkCx0Guv0FwIy+JPjY3frmMZTC2ais3pBOOGUoiN6gwJJPHM6xf7ZnSf7lAIeKTn+g7G X-Received: by 2002:a17:907:a505:b0:7bb:129e:8a4c with SMTP id vr5-20020a170907a50500b007bb129e8a4cmr26912254ejc.300.1670001869153; Fri, 02 Dec 2022 09:24:29 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1670001869; cv=none; d=google.com; s=arc-20160816; b=Cl6kLcWliIRzkzN5s3aaDZHo5rhPcWdcahbZC37x/4/EJp6tq0keGkUxsES2pp5FwW 64CKNMYNuEjyalg2qxPk9DFM5tzBCMSpEVtjm4zh7xSvm4L4KEROM1G3fR/f6rLGtCJX XzbO2iyHdW3YHFBJrz1953qeGAzBUraPyglc5BVo+OgoE8Kp4KC9DOjJFIZCn2OO+xyk DOt+ixWH/k0iUPi0DK35S5EZpJLOvsl4JTqg66mGsvYUZ+Eql28ww3l27wAR/6seuE+U XUYYBXvYJfJl1W6lGuzx+1RFvWG/THBPLmgp4syT2DWdPFcRS5a+lI6EXB+LITVDVt2G 8Rig== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:in-reply-to:content-disposition:mime-version :references:message-id:subject:cc:to:from:date:dkim-signature; bh=b8bI9+JmLhw8QU3ftEmY7iVkkP29bgA+9gnEtvy2G2c=; b=ZBlARn2U0MvlCzppoFSITReBnw1ypzwxJm/OqGzDzwokQuvuhQNwXCKX/g00iD+TBR ow/ddZpQVeBnSajtpnapDTw/QXGvCvnyhPMR0DWgAFfC4+ufbHdNUydAdlBdxvz3wdZM rdYuKHRx0jp2C+4K8V5qgvm0+LpGHmpttRSFgxNFhVoxcKU1uGK7a5MpRtB+tW/ApU2H nZSDVO6JfjHcMVvGj09BbUCb5LM+3Pr7WVgfMqOAl6huuWdOkiSo27vRpjeW4WYgshbJ 3QPIZuWBt4Xm2+4cSKHpuV6/yImGggY/HvXMPlWf8snLzN+VYuuqJdPk0d/E2IPba3B9 N72w== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@redhat.com header.s=mimecast20190719 header.b=HiWTOHcv; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=redhat.com Return-Path: Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20]) by mx.google.com with ESMTP id cw18-20020a170906479200b00781b66e7065si1246879ejc.240.2022.12.02.09.24.09; Fri, 02 Dec 2022 09:24:29 -0800 (PST) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Authentication-Results: mx.google.com; dkim=pass header.i=@redhat.com header.s=mimecast20190719 header.b=HiWTOHcv; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=redhat.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S233721AbiLBQex (ORCPT + 82 others); Fri, 2 Dec 2022 11:34:53 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:57382 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S231470AbiLBQev (ORCPT ); Fri, 2 Dec 2022 11:34:51 -0500 Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.133.124]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 04AE19AE0A for ; Fri, 2 Dec 2022 08:33:50 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1669998830; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: in-reply-to:in-reply-to:references:references; bh=b8bI9+JmLhw8QU3ftEmY7iVkkP29bgA+9gnEtvy2G2c=; b=HiWTOHcvmNRDj0n4FhzjdNTF6v3DhZ4WYOFE0OIbOJjfsZ6/fiSUi5fM+pwU07q6h05xIt owlA6mfSAzelzyJw2my5T86tyK3X4CjyVSPArogpcAYpnM2e/5Qz7z6O4rtyk9afsej8ea 7Z+kFX5cIGpQ5Mww6/lXITE6nvHL+eA= Received: from mail-qt1-f199.google.com (mail-qt1-f199.google.com [209.85.160.199]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.3, cipher=TLS_AES_128_GCM_SHA256) id us-mta-660-je8mbxBNMte9QPsMiIPHVw-1; Fri, 02 Dec 2022 11:33:49 -0500 X-MC-Unique: je8mbxBNMte9QPsMiIPHVw-1 Received: by mail-qt1-f199.google.com with SMTP id ew11-20020a05622a514b00b003a524196d31so18253006qtb.2 for ; Fri, 02 Dec 2022 08:33:49 -0800 (PST) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=in-reply-to:content-disposition:mime-version:references:message-id :subject:cc:to:from:date:x-gm-message-state:from:to:cc:subject:date :message-id:reply-to; bh=b8bI9+JmLhw8QU3ftEmY7iVkkP29bgA+9gnEtvy2G2c=; b=HieCXyRKNNdzOC+vs6SKIPpaoGD2RGOhwWlsuKfRc5RSHqZ7RtcCxNbMZuNbmyG7pq Jz3doh41eOHMTG4/vdsqYCZoYeI+/6kdy6PBBxFcFMnC49oGnHNSn/KiYn68UZgSe8nS oce5ADgbFZqiS5ao15hCRSobeomhInnGGN29LfwiQy2tGt20PymCOZ5aoXxVINWBFfWM jHnBIgGpZFhlm6QAaR/pm6rm5xENX7zrmWNc0CiQx1h/oHxqJRtt1V0XrWzuAZBOQaUh K0JkgjK2lGwHh9okL7jpzgaWGYCocSjyBOHvB4sR9roEin4KjbT1HbkQ7fhI0SVuniZq Yh9A== X-Gm-Message-State: ANoB5pk6LDV5El9QB2lpLSF8sli1ywiS9GheuWKrAjplFJB3GB60V1JY 0mioi01l4JysIWmQKv1MYM9sE18KwYKK+5yP7nE26RKkSeRyPvdvAtjglYP7OIfReVTCKtiT6Eg nzo+Se/LftV/Y9qo++VK6fK/I X-Received: by 2002:a05:620a:cef:b0:6fc:a666:bed6 with SMTP id c15-20020a05620a0cef00b006fca666bed6mr7293180qkj.10.1669998827561; Fri, 02 Dec 2022 08:33:47 -0800 (PST) X-Received: by 2002:a05:620a:cef:b0:6fc:a666:bed6 with SMTP id c15-20020a05620a0cef00b006fca666bed6mr7293157qkj.10.1669998827220; Fri, 02 Dec 2022 08:33:47 -0800 (PST) Received: from x1n (bras-base-aurron9127w-grc-46-70-31-27-79.dsl.bell.ca. [70.31.27.79]) by smtp.gmail.com with ESMTPSA id o6-20020ac85546000000b003a69225c2cdsm2874748qtr.56.2022.12.02.08.33.46 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Fri, 02 Dec 2022 08:33:46 -0800 (PST) Date: Fri, 2 Dec 2022 11:33:45 -0500 From: Peter Xu To: David Hildenbrand Cc: linux-kernel@vger.kernel.org, linux-mm@kvack.org, Ives van Hoorne , stable@vger.kernel.org, Andrew Morton , Hugh Dickins , Alistair Popple , Mike Rapoport , Nadav Amit , Andrea Arcangeli Subject: Re: [PATCH RFC] mm/userfaultfd: enable writenotify while userfaultfd-wp is enabled for a VMA Message-ID: References: <20221202122748.113774-1-david@redhat.com> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Disposition: inline In-Reply-To: <20221202122748.113774-1-david@redhat.com> X-Spam-Status: No, score=-2.1 required=5.0 tests=BAYES_00,DKIMWL_WL_HIGH, DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,RCVD_IN_DNSWL_NONE, RCVD_IN_MSPIKE_H2,SPF_HELO_NONE,SPF_NONE autolearn=unavailable autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Fri, Dec 02, 2022 at 01:27:48PM +0100, David Hildenbrand wrote: > Currently, we don't enable writenotify when enabling userfaultfd-wp on > a shared writable mapping (for now we only support SHMEM). The consequence and hugetlbfs > is that vma->vm_page_prot will still include write permissions, to be set > as default for all PTEs that get remapped (e.g., mprotect(), NUMA hinting, > page migration, ...). The thing is by default I think we want the write bit.. The simple example is (1) register UFFD_WP on shmem writable, (2) write a page. Here we didn't wr-protect anything, so we want the write bit there. Or the other example is when UFFDIO_COPY with flags==0 even if with VM_UFFD_WP. We definitely wants the write bit. We only doesn't want the write bit when uffd-wp is explicitly set. I think fundamentally the core is uffd-wp is pte-based, so the information resides in pte not vma. I'm not strongly objecting this patch, especially you mentioned auto-numa so I need to have a closer look later there. However I do think uffd-wp is slightly special because we always need to consider pte information anyway, so a per-vma information doesn't hugely help, IMHO. > > This is problematic for uffd-wp: we'd have to manually check for > a uffd-wp PTE and manually write-protect that PTE, which is error prone > and the logic is the wrong way around. Prone to such issues is any code > that uses vma->vm_page_prot to set PTE permissions: primarily pte_modify() > and mk_pte(), but there might be more (move_pte() looked suspicious at > first but the protection parameter is essentially unused). > > Instead, let's enable writenotify -- just like we do for softdirty > tracking -- such that PTEs will be mapped write-protected as default > and we will only allow selected PTEs that are defintly safe to be > mapped without write-protection (see can_change_pte_writable()) to be > writable. This reverses the logic and implicitly fixes and prevents any > such uffd-wp issues. > > Note that when enabling userfaultfd-wp, there is no need to walk page > tables to enforce the new default protection for the PTEs: we know that > they cannot be uffd-wp'ed yet, because that can only happen afterwards. > > For example, this fixes page migration and mprotect() to not map a > uffd-wp'ed PTE writable. In theory, this should also fix when NUMA-hinting > remaps pages in such (shmem) mappings -- if NUMA-hinting is applicable to > shmem with uffd as well. > > Running the mprotect() reproducer [1] without this commit: > $ ./uffd-wp-mprotect > FAIL: uffd-wp did not fire > Running the mprotect() reproducer with this commit: > $ ./uffd-wp-mprotect > PASS: uffd-wp fired > > [1] https://lore.kernel.org/all/222fc0b2-6ec0-98e7-833f-ea868b248446@redhat.com/T/#u I still hope for a formal patch (non-rfc) we can have a reproducer outside mprotect(). IMHO mprotect() is really ambiguously here being used with uffd-wp, so not a good example IMO as I explained in the other thread [1]. I'll need to off-work most of the rest of today, but maybe I can also have a look in the weekend or Monday more on the numa paths. Before that, can we first reach a consensus that we have the mm/migrate patch there to be merged first? These are two issues, IMHO. I know you're against me for some reason, but until now I sincerely don't know why. That patch sololy recovers write bit status (by removing it for read-only) for a migration entry and that definitely makes sense to me. As I also mentioned in the old version of that thread, we can rework migration entries and merge READ|WRITE entries into a GENERIC entry one day if you think proper, but that's for later. Let me provide another example why I think recovering write bit may matter outside uffd too so it's common and it's always good to explicit check it. If you still remember for sparc64 (I think you're also in the loop) pte_mkdirty will also apply write bit there; even though I don't know why but it works like that for years. Consider below sequence: - map a page writable, write to page (fault in, set dirty) - mprotect(RO) on the page (keep dirty bit, vma/pte becomes RO) - migrate the page - mk_pte returns with WRITE bit cleared (which is correct) - pte_mkdirty set dirty and write bit (because dirty used to set) If without the previous mm/migrate patch [1] IIUC it'll allow the pte writable even without VM_WRITE here after migration. I'm not sure whether I missed something, nor can I write a reproducer because I don't have sparc64 systems on hand, not to mention time. But hopefully I explained why I think it's safer to just always double check on the write bit to be the same as when before migration, irrelevant of uffd-wp, vma pgprot or mk_pte(). For this patch itself, I'll rethink again when I read more on numa. Thanks, > > Reported-by: Ives van Hoorne > Debugged-by: Peter Xu > Fixes: b1f9e876862d ("mm/uffd: enable write protection for shmem & hugetlbfs") > Cc: stable@vger.kernel.org > Cc: Andrew Morton > Cc: Hugh Dickins > Cc: Alistair Popple > Cc: Mike Rapoport > Cc: Nadav Amit > Cc: Andrea Arcangeli > Signed-off-by: David Hildenbrand > --- > > Based on latest upstream. userfaultfd selftests seem to pass. > > --- > fs/userfaultfd.c | 28 ++++++++++++++++++++++------ > mm/mmap.c | 4 ++++ > 2 files changed, 26 insertions(+), 6 deletions(-) > > diff --git a/fs/userfaultfd.c b/fs/userfaultfd.c > index 98ac37e34e3d..fb0733f2e623 100644 > --- a/fs/userfaultfd.c > +++ b/fs/userfaultfd.c > @@ -108,6 +108,21 @@ static bool userfaultfd_is_initialized(struct userfaultfd_ctx *ctx) > return ctx->features & UFFD_FEATURE_INITIALIZED; > } > > +static void userfaultfd_set_vm_flags(struct vm_area_struct *vma, > + vm_flags_t flags) > +{ > + const bool uffd_wp = !!((vma->vm_flags | flags) & VM_UFFD_WP); > + > + vma->vm_flags = flags; > + /* > + * For shared mappings, we want to enable writenotify while > + * userfaultfd-wp is enabled (see vma_wants_writenotify()). We'll simply > + * recalculate vma->vm_page_prot whenever userfaultfd-wp is involved. > + */ > + if ((vma->vm_flags & VM_SHARED) && uffd_wp) > + vma_set_page_prot(vma); > +} > + > static int userfaultfd_wake_function(wait_queue_entry_t *wq, unsigned mode, > int wake_flags, void *key) > { > @@ -618,7 +633,8 @@ static void userfaultfd_event_wait_completion(struct userfaultfd_ctx *ctx, > for_each_vma(vmi, vma) { > if (vma->vm_userfaultfd_ctx.ctx == release_new_ctx) { > vma->vm_userfaultfd_ctx = NULL_VM_UFFD_CTX; > - vma->vm_flags &= ~__VM_UFFD_FLAGS; > + userfaultfd_set_vm_flags(vma, > + vma->vm_flags & ~__VM_UFFD_FLAGS); > } > } > mmap_write_unlock(mm); > @@ -652,7 +668,7 @@ int dup_userfaultfd(struct vm_area_struct *vma, struct list_head *fcs) > octx = vma->vm_userfaultfd_ctx.ctx; > if (!octx || !(octx->features & UFFD_FEATURE_EVENT_FORK)) { > vma->vm_userfaultfd_ctx = NULL_VM_UFFD_CTX; > - vma->vm_flags &= ~__VM_UFFD_FLAGS; > + userfaultfd_set_vm_flags(vma, vma->vm_flags & ~__VM_UFFD_FLAGS); > return 0; > } > > @@ -733,7 +749,7 @@ void mremap_userfaultfd_prep(struct vm_area_struct *vma, > } else { > /* Drop uffd context if remap feature not enabled */ > vma->vm_userfaultfd_ctx = NULL_VM_UFFD_CTX; > - vma->vm_flags &= ~__VM_UFFD_FLAGS; > + userfaultfd_set_vm_flags(vma, vma->vm_flags & ~__VM_UFFD_FLAGS); > } > } > > @@ -895,7 +911,7 @@ static int userfaultfd_release(struct inode *inode, struct file *file) > prev = vma; > } > > - vma->vm_flags = new_flags; > + userfaultfd_set_vm_flags(vma, new_flags); > vma->vm_userfaultfd_ctx = NULL_VM_UFFD_CTX; > } > mmap_write_unlock(mm); > @@ -1463,7 +1479,7 @@ static int userfaultfd_register(struct userfaultfd_ctx *ctx, > * the next vma was merged into the current one and > * the current one has not been updated yet. > */ > - vma->vm_flags = new_flags; > + userfaultfd_set_vm_flags(vma, new_flags); > vma->vm_userfaultfd_ctx.ctx = ctx; > > if (is_vm_hugetlb_page(vma) && uffd_disable_huge_pmd_share(vma)) > @@ -1651,7 +1667,7 @@ static int userfaultfd_unregister(struct userfaultfd_ctx *ctx, > * the next vma was merged into the current one and > * the current one has not been updated yet. > */ > - vma->vm_flags = new_flags; > + userfaultfd_set_vm_flags(vma, new_flags); > vma->vm_userfaultfd_ctx = NULL_VM_UFFD_CTX; > > skip: > diff --git a/mm/mmap.c b/mm/mmap.c > index 74a84eb33b90..ce7526aa5d61 100644 > --- a/mm/mmap.c > +++ b/mm/mmap.c > @@ -1525,6 +1525,10 @@ int vma_wants_writenotify(struct vm_area_struct *vma, pgprot_t vm_page_prot) > if (vma_soft_dirty_enabled(vma) && !is_vm_hugetlb_page(vma)) > return 1; > > + /* Do we need write faults for uffd-wp tracking? */ > + if (userfaultfd_wp(vma)) > + return 1; > + > /* Specialty mapping? */ > if (vm_flags & VM_PFNMAP) > return 0; > > base-commit: a4412fdd49dc011bcc2c0d81ac4cab7457092650 > -- > 2.38.1 > -- Peter Xu