Received: by 2002:a05:6358:a55:b0:ec:fcf4:3ecf with SMTP id 21csp6785696rwb; Wed, 18 Jan 2023 09:18:42 -0800 (PST) X-Google-Smtp-Source: AMrXdXu7nAu/Kp30z0ejxdP0khSYP1+ZwjpN7Gejk6sglktgHp3Bd8mq0aT8irhqEnilrxUz54R5 X-Received: by 2002:aa7:cac2:0:b0:497:948b:e8 with SMTP id l2-20020aa7cac2000000b00497948b00e8mr7056126edt.6.1674062322727; Wed, 18 Jan 2023 09:18:42 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1674062322; cv=none; d=google.com; s=arc-20160816; b=ULJzCgbjXREtBnSnpZ+u8Fjqj6ISgXN8Lf456tCzXt46rX1AUjFq2B2CyDzcaL03OS GGCxkmJOV8Ndtbykd/taQTIrt4d6yhJ20G175yDVM5MsA1Y6USap/V9syoQP4kRJ0G/b cqMLbH+If6XT1Ifm0tigJXu+1+qn5WF4SlrRMRs2DWwqObrykgk3dlHuuPi2g6ZJOHWy dJD55EPL9iA/7uD31jpFVSF80/QP7S+BsGg0VY9mr7omGqy9w5W3ugIUMil/8JvbN4jL q8PAOzqmn8l1OhykA/ofl8OdnDC3xHLg2E/JM+EUE0MIud20D1VSPfN+neFet5zdjK8F wrsw== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:in-reply-to:content-disposition:mime-version :references:message-id:subject:cc:to:from:date:dkim-signature; bh=NXDkukCGDDbIwdVjEbXoFsk14utwQSi8UJxIyseUvo8=; b=jWu6pPiQ2ukwMb0jZ4TKhtKzS5Rc2Ht/yVh1E0qLk4fR/T1SOfV5RoUyrI5EO1QxLM CZPd4oCy2P3QtddTj+pHlf2OftdKA3IFH8sorwWIApOqipQlnE9eAJwa9v1hwKhvVTJD Spz2YkoUdxlTaWdNdPgu/zQNqwaS0cgKwZq915uAIaqajbX3C8Gnng13sBmtZXl9o0WD 9m2E71N4LdK6+hNqxNkvfpTuKFE+9L3NfQYzJg+SOCBgH1qzlOS6nNNzTnteyENknvb4 EBTzDBxYenAzjDWB7CZ5ccb9d85/Y5oE74iF4wMxs6LIZflslS4R9DwRlYBQUjOlbqmh 4Img== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@redhat.com header.s=mimecast20190719 header.b=fqIfgSME; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=redhat.com Return-Path: Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20]) by mx.google.com with ESMTP id gs8-20020a1709072d0800b0085a483a6fcdsi30067036ejc.311.2023.01.18.09.18.31; Wed, 18 Jan 2023 09:18:42 -0800 (PST) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Authentication-Results: mx.google.com; dkim=pass header.i=@redhat.com header.s=mimecast20190719 header.b=fqIfgSME; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=redhat.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S230311AbjARQ4O (ORCPT + 45 others); Wed, 18 Jan 2023 11:56:14 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:59664 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S230286AbjARQ4B (ORCPT ); Wed, 18 Jan 2023 11:56:01 -0500 Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.133.124]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 203AC34C36 for ; Wed, 18 Jan 2023 08:55:10 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1674060909; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: in-reply-to:in-reply-to:references:references; bh=NXDkukCGDDbIwdVjEbXoFsk14utwQSi8UJxIyseUvo8=; b=fqIfgSMEVEsd77eI5uNAgx5RRqJl/SDbBMkMbO1/D2Ovr2AGY2XcJgzoRwwldRPy9+6r5o QtRr52NaUXnR5YlIBuuqRQNqhxd+b78eFF6zey+FGS5bKTv6MPLN+8BgX7xCtJLT+rYhhW ZcDe800vyUwCzBQ3fhLAVta/j9H3be4= Received: from mail-vk1-f200.google.com (mail-vk1-f200.google.com [209.85.221.200]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.3, cipher=TLS_AES_128_GCM_SHA256) id us-mta-198-4wpZimcoMuOW6_YATn4kkw-1; Wed, 18 Jan 2023 11:55:04 -0500 X-MC-Unique: 4wpZimcoMuOW6_YATn4kkw-1 Received: by mail-vk1-f200.google.com with SMTP id s203-20020a1f2cd4000000b003d5b4915319so10429518vks.18 for ; Wed, 18 Jan 2023 08:55:04 -0800 (PST) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=in-reply-to:content-disposition:mime-version:references:message-id :subject:cc:to:from:date:x-gm-message-state:from:to:cc:subject:date :message-id:reply-to; bh=NXDkukCGDDbIwdVjEbXoFsk14utwQSi8UJxIyseUvo8=; b=fuhVV4yGnLeeInYTvNcI/x1SIgOQNIVy4OG8Vxx63tVQiUtsjZHhFPwduADir39ZIY UoEZdRQIGhBsfq7XusYKtrYK5w9gUlp5Ejwlisn7LSEdIYTZFkcoEXYruaOvIbfOwMlL LA4oNYnDySd7pbkTXzyRdWQc9OwVFfGXocPtNGNQx6P9raqGLuw4qFGEhi9qoW5e+Luf qHqhUs9pH4bBkUn3AgtWz4sDTrRqBrekGnyZ0p+eDNutTFP3i4WcxGrfjFog7UN5iPQL APF7wCxzJSoXWwexBcLqsOmPYHqrue1nOgp3SCcWY+UgtLT6RpGWEzOJdAQpqDzPnI8p CnLQ== X-Gm-Message-State: AFqh2kpezVCP7Jv47ZnCAACDlZ5ZlLUhEG+/C2eLzWj0dEhiJ4gvw1uh 0J7MZph3C/HcmLvVvr4yWTrWkgmvb5ocqBMItWNIqDwxj70K6fWtMK4oi0EASeQOsTYSXtsv/qr lAcPCEPpf/+MNED0p9nUcf9Le X-Received: by 2002:a05:6102:14aa:b0:3d1:3b77:eccb with SMTP id d42-20020a05610214aa00b003d13b77eccbmr4625649vsv.32.1674060903559; Wed, 18 Jan 2023 08:55:03 -0800 (PST) X-Received: by 2002:a05:6102:14aa:b0:3d1:3b77:eccb with SMTP id d42-20020a05610214aa00b003d13b77eccbmr4625606vsv.32.1674060903180; Wed, 18 Jan 2023 08:55:03 -0800 (PST) Received: from x1n (bras-base-aurron9127w-grc-56-70-30-145-63.dsl.bell.ca. [70.30.145.63]) by smtp.gmail.com with ESMTPSA id y4-20020a05620a44c400b007069375e0f4sm4420004qkp.122.2023.01.18.08.55.00 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Wed, 18 Jan 2023 08:55:02 -0800 (PST) Date: Wed, 18 Jan 2023 11:54:59 -0500 From: Peter Xu To: Muhammad Usama Anjum Cc: David Hildenbrand , Andrew Morton , =?utf-8?B?TWljaGHFgiBNaXJvc8WCYXc=?= , Andrei Vagin , Danylo Mocherniuk , Paul Gofman , Cyrill Gorcunov , Alexander Viro , Shuah Khan , Christian Brauner , Yang Shi , Vlastimil Babka , "Liam R . Howlett" , Yun Zhou , Suren Baghdasaryan , Alex Sierra , Matthew Wilcox , Pasha Tatashin , Mike Rapoport , Nadav Amit , Axel Rasmussen , "Gustavo A . R . Silva" , Dan Williams , linux-kernel@vger.kernel.org, linux-fsdevel@vger.kernel.org, linux-mm@kvack.org, linux-kselftest@vger.kernel.org, Greg KH , kernel@collabora.com Subject: Re: [PATCH v7 1/4] userfaultfd: Add UFFD WP Async support Message-ID: References: <20230109064519.3555250-1-usama.anjum@collabora.com> <20230109064519.3555250-2-usama.anjum@collabora.com> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Disposition: inline In-Reply-To: <20230109064519.3555250-2-usama.anjum@collabora.com> X-Spam-Status: No, score=-2.1 required=5.0 tests=BAYES_00,DKIMWL_WL_HIGH, DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,RCVD_IN_DNSWL_NONE, RCVD_IN_MSPIKE_H2,SPF_HELO_NONE,SPF_NONE autolearn=unavailable autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Hi, Muhammad, On Mon, Jan 09, 2023 at 11:45:16AM +0500, Muhammad Usama Anjum wrote: > Add new WP Async mode (UFFDIO_WRITEPROTECT_MODE_ASYNC_WP) which resolves > the page faults on its own. It can be used to track that which pages have > been written to from the time the pages were write protected. It is very > efficient way to track the changes as uffd is by nature pte/pmd based. > > UFFD WP (UFFDIO_WRITEPROTECT_MODE_WP) sends the page faults to the > userspace where the pages which have been written-to can be tracked. But > it is not efficient. This is why this async version is being added. > After setting the WP Async, the pages which have been written to can be > found in the pagemap file or information can be obtained from the > PAGEMAP_IOCTL (see next patches). > > Signed-off-by: Muhammad Usama Anjum > --- > fs/userfaultfd.c | 150 +++++++++++++++++-------------- > include/uapi/linux/userfaultfd.h | 6 ++ > 2 files changed, 90 insertions(+), 66 deletions(-) > > diff --git a/fs/userfaultfd.c b/fs/userfaultfd.c > index 15a5bf765d43..be5e10d15058 100644 > --- a/fs/userfaultfd.c > +++ b/fs/userfaultfd.c > @@ -69,6 +69,7 @@ struct userfaultfd_ctx { > unsigned int features; > /* released */ > bool released; > + bool async; Let's just make it a feature flag, UFFD_FEATURE_WP_ASYNC > /* memory mappings are changing because of non-cooperative event */ > atomic_t mmap_changing; > /* mm with one ore more vmas attached to this userfaultfd_ctx */ > @@ -497,80 +498,93 @@ vm_fault_t handle_userfault(struct vm_fault *vmf, unsigned long reason) > > /* take the reference before dropping the mmap_lock */ > userfaultfd_ctx_get(ctx); > + if (ctx->async) { Firstly, please consider not touching the existing code/indent as much as what this patch did. Hopefully we can keep the major part of sync uffd be there with its git log, it also helps reviewing your code. You can add the async block before that, handle the fault and return just earlier. And, I think this is a bit too late because we're going to return with VM_FAULT_RETRY here, while maybe we don't need to retry at all here because we're going to resolve the page fault immediately. I assume you added this because you wanted userfaultfd_ctx_get() to make sure the uffd context will not go away from under us, but it's not needed if we're still holding the mmap read lock. I'd expect for async mode we don't really need to release it at all. > + // Resolve page fault of this page Please use "/* ... */" as that's the common pattern of commenting in the Linux kernel, at least what I see in mm/. > + unsigned long addr = (ctx->features & UFFD_FEATURE_EXACT_ADDRESS) ? > + vmf->real_address : vmf->address; > + struct vm_area_struct *dst_vma = find_vma(ctx->mm, addr); > + size_t s = PAGE_SIZE; This is weird - if we want async uffd-wp, let's consider huge page from the 1st day. > + > + if (dst_vma->vm_flags & VM_HUGEPAGE) { VM_HUGEPAGE is only a hint. It doesn't mean this page is always a huge page. For anon, we can have thp wr-protected as a whole, not happening for !anon because we'll split already. For anon, if a write happens to a thp being uffd-wp-ed, we'll keep that pmd wr-protected and report the uffd message. The pmd split happens when the user invokes UFFDIO_WRITEPROTECT on the small page. I think it'll stop working for async uffd-wp because we're going to resolve the page faults right away. So for async uffd-wp (note: this will be different from hugetlb), you may want to consider having a pre-requisite patch to change wp_huge_pmd() behavior: rather than calling handle_userfault(), IIUC you can also just fallback to the split path right below (__split_huge_pmd) so the thp will split now even before the uffd message is generated. I think it should be transparent to the user and it'll start working for you with async uffd-wp here, because it means when reaching handle_userfault, it should not be possible to have thp at all since they should have all split up. > + s = HPAGE_SIZE; > + addr &= HPAGE_MASK; > + } > > - init_waitqueue_func_entry(&uwq.wq, userfaultfd_wake_function); > - uwq.wq.private = current; > - uwq.msg = userfault_msg(vmf->address, vmf->real_address, vmf->flags, > - reason, ctx->features); > - uwq.ctx = ctx; > - uwq.waken = false; > - > - blocking_state = userfaultfd_get_blocking_state(vmf->flags); > + ret = mwriteprotect_range(ctx->mm, addr, s, false, &ctx->mmap_changing); This is an overkill - we're pretty sure it's a single page, no need to call a range function here. > + } else { > + init_waitqueue_func_entry(&uwq.wq, userfaultfd_wake_function); > + uwq.wq.private = current; > + uwq.msg = userfault_msg(vmf->address, vmf->real_address, vmf->flags, > + reason, ctx->features); > + uwq.ctx = ctx; > + uwq.waken = false; > > - /* > - * Take the vma lock now, in order to safely call > - * userfaultfd_huge_must_wait() later. Since acquiring the > - * (sleepable) vma lock can modify the current task state, that > - * must be before explicitly calling set_current_state(). > - */ > - if (is_vm_hugetlb_page(vma)) > - hugetlb_vma_lock_read(vma); > + blocking_state = userfaultfd_get_blocking_state(vmf->flags); > > - spin_lock_irq(&ctx->fault_pending_wqh.lock); > - /* > - * After the __add_wait_queue the uwq is visible to userland > - * through poll/read(). > - */ > - __add_wait_queue(&ctx->fault_pending_wqh, &uwq.wq); > - /* > - * The smp_mb() after __set_current_state prevents the reads > - * following the spin_unlock to happen before the list_add in > - * __add_wait_queue. > - */ > - set_current_state(blocking_state); > - spin_unlock_irq(&ctx->fault_pending_wqh.lock); > + /* > + * Take the vma lock now, in order to safely call > + * userfaultfd_huge_must_wait() later. Since acquiring the > + * (sleepable) vma lock can modify the current task state, that > + * must be before explicitly calling set_current_state(). > + */ > + if (is_vm_hugetlb_page(vma)) > + hugetlb_vma_lock_read(vma); > > - if (!is_vm_hugetlb_page(vma)) > - must_wait = userfaultfd_must_wait(ctx, vmf->address, vmf->flags, > - reason); > - else > - must_wait = userfaultfd_huge_must_wait(ctx, vma, > - vmf->address, > - vmf->flags, reason); > - if (is_vm_hugetlb_page(vma)) > - hugetlb_vma_unlock_read(vma); > - mmap_read_unlock(mm); > + spin_lock_irq(&ctx->fault_pending_wqh.lock); > + /* > + * After the __add_wait_queue the uwq is visible to userland > + * through poll/read(). > + */ > + __add_wait_queue(&ctx->fault_pending_wqh, &uwq.wq); > + /* > + * The smp_mb() after __set_current_state prevents the reads > + * following the spin_unlock to happen before the list_add in > + * __add_wait_queue. > + */ > + set_current_state(blocking_state); > + spin_unlock_irq(&ctx->fault_pending_wqh.lock); > > - if (likely(must_wait && !READ_ONCE(ctx->released))) { > - wake_up_poll(&ctx->fd_wqh, EPOLLIN); > - schedule(); > - } > + if (!is_vm_hugetlb_page(vma)) > + must_wait = userfaultfd_must_wait(ctx, vmf->address, vmf->flags, > + reason); > + else > + must_wait = userfaultfd_huge_must_wait(ctx, vma, > + vmf->address, > + vmf->flags, reason); > + if (is_vm_hugetlb_page(vma)) > + hugetlb_vma_unlock_read(vma); > + mmap_read_unlock(mm); > + > + if (likely(must_wait && !READ_ONCE(ctx->released))) { > + wake_up_poll(&ctx->fd_wqh, EPOLLIN); > + schedule(); > + } > > - __set_current_state(TASK_RUNNING); > + __set_current_state(TASK_RUNNING); > > - /* > - * Here we race with the list_del; list_add in > - * userfaultfd_ctx_read(), however because we don't ever run > - * list_del_init() to refile across the two lists, the prev > - * and next pointers will never point to self. list_add also > - * would never let any of the two pointers to point to > - * self. So list_empty_careful won't risk to see both pointers > - * pointing to self at any time during the list refile. The > - * only case where list_del_init() is called is the full > - * removal in the wake function and there we don't re-list_add > - * and it's fine not to block on the spinlock. The uwq on this > - * kernel stack can be released after the list_del_init. > - */ > - if (!list_empty_careful(&uwq.wq.entry)) { > - spin_lock_irq(&ctx->fault_pending_wqh.lock); > /* > - * No need of list_del_init(), the uwq on the stack > - * will be freed shortly anyway. > + * Here we race with the list_del; list_add in > + * userfaultfd_ctx_read(), however because we don't ever run > + * list_del_init() to refile across the two lists, the prev > + * and next pointers will never point to self. list_add also > + * would never let any of the two pointers to point to > + * self. So list_empty_careful won't risk to see both pointers > + * pointing to self at any time during the list refile. The > + * only case where list_del_init() is called is the full > + * removal in the wake function and there we don't re-list_add > + * and it's fine not to block on the spinlock. The uwq on this > + * kernel stack can be released after the list_del_init. > */ > - list_del(&uwq.wq.entry); > - spin_unlock_irq(&ctx->fault_pending_wqh.lock); > + if (!list_empty_careful(&uwq.wq.entry)) { > + spin_lock_irq(&ctx->fault_pending_wqh.lock); > + /* > + * No need of list_del_init(), the uwq on the stack > + * will be freed shortly anyway. > + */ > + list_del(&uwq.wq.entry); > + spin_unlock_irq(&ctx->fault_pending_wqh.lock); > + } > } > - > /* > * ctx may go away after this if the userfault pseudo fd is > * already released. > @@ -1861,11 +1875,14 @@ static int userfaultfd_writeprotect(struct userfaultfd_ctx *ctx, > return ret; > > if (uffdio_wp.mode & ~(UFFDIO_WRITEPROTECT_MODE_DONTWAKE | > - UFFDIO_WRITEPROTECT_MODE_WP)) > + UFFDIO_WRITEPROTECT_MODE_WP | > + UFFDIO_WRITEPROTECT_MODE_ASYNC_WP)) > return -EINVAL; > > - mode_wp = uffdio_wp.mode & UFFDIO_WRITEPROTECT_MODE_WP; > + mode_wp = uffdio_wp.mode & (UFFDIO_WRITEPROTECT_MODE_WP | > + UFFDIO_WRITEPROTECT_MODE_ASYNC_WP); > mode_dontwake = uffdio_wp.mode & UFFDIO_WRITEPROTECT_MODE_DONTWAKE; > + ctx->async = uffdio_wp.mode & UFFDIO_WRITEPROTECT_MODE_ASYNC_WP; Please no.. ctx attributes shouldn't be easily changed by a single ioctl. I suggest we have a new feature bit as I mentioned above (say, UFFD_FEATURE_WP_ASYNC), set it once with UFFDIO_API and it should apply to the whole lifecycle of this uffd handle. That flag should (something I can quickly think of): - Have effect only if the uffd will be registered with WP mode (of course) or ignored in any other modes, - Should fail any attempts of UFFDIO_WRITEPROTECT with wp=false on this uffd handle because with async faults no page fault resolution needed from userspace, - Should apply to any region registered with this uffd ctx, so it's exclusively used with sync uffd-wp mode. Then when the app wants to wr-protect in async mode, it simply goes ahead with UFFDIO_WRITEPROTECT(wp=true), it'll happen exactly the same as when it was sync mode. It's only the pf handling procedure that's different (along with how the fault is reported - rather than as a message but it'll be consolidated into the soft-dirty bit). > > if (mode_wp && mode_dontwake) > return -EINVAL; > @@ -2126,6 +2143,7 @@ static int new_userfaultfd(int flags) > ctx->flags = flags; > ctx->features = 0; > ctx->released = false; > + ctx->async = false; > atomic_set(&ctx->mmap_changing, 0); > ctx->mm = current->mm; > /* prevent the mm struct to be freed */ > diff --git a/include/uapi/linux/userfaultfd.h b/include/uapi/linux/userfaultfd.h > index 005e5e306266..b89665653861 100644 > --- a/include/uapi/linux/userfaultfd.h > +++ b/include/uapi/linux/userfaultfd.h > @@ -284,6 +284,11 @@ struct uffdio_writeprotect { > * UFFDIO_WRITEPROTECT_MODE_DONTWAKE: set the flag to avoid waking up > * any wait thread after the operation succeeds. > * > + * UFFDIO_WRITEPROTECT_MODE_ASYNC_WP: set the flag to write protect a > + * range, the flag is unset automatically when the page is written. > + * This is used to track which pages have been written to from the > + * time the memory was write protected. > + * > * NOTE: Write protecting a region (WP=1) is unrelated to page faults, > * therefore DONTWAKE flag is meaningless with WP=1. Removing write > * protection (WP=0) in response to a page fault wakes the faulting > @@ -291,6 +296,7 @@ struct uffdio_writeprotect { > */ > #define UFFDIO_WRITEPROTECT_MODE_WP ((__u64)1<<0) > #define UFFDIO_WRITEPROTECT_MODE_DONTWAKE ((__u64)1<<1) > +#define UFFDIO_WRITEPROTECT_MODE_ASYNC_WP ((__u64)1<<2) > __u64 mode; > }; > > -- > 2.30.2 > -- Peter Xu