Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id C1E5AC636D4 for ; Fri, 17 Feb 2023 10:59:49 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S229575AbjBQK7s (ORCPT ); Fri, 17 Feb 2023 05:59:48 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:37650 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S229564AbjBQK7p (ORCPT ); Fri, 17 Feb 2023 05:59:45 -0500 Received: from madras.collabora.co.uk (madras.collabora.co.uk [IPv6:2a00:1098:0:82:1000:25:2eeb:e5ab]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 707A6635B4 for ; Fri, 17 Feb 2023 02:59:36 -0800 (PST) Received: from [192.168.10.28] (unknown [119.155.16.218]) (using TLSv1.3 with cipher TLS_AES_128_GCM_SHA256 (128/128 bits) key-exchange X25519 server-signature RSA-PSS (4096 bits)) (No client certificate requested) (Authenticated sender: usama.anjum) by madras.collabora.co.uk (Postfix) with ESMTPSA id B4F78660212C; Fri, 17 Feb 2023 10:59:31 +0000 (GMT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=collabora.com; s=mail; t=1676631575; bh=mb2jX23vPrqyzgd8ZrPYqGRRQL+iP0u9mVld4ObU+Vc=; h=Date:Cc:Subject:To:References:From:In-Reply-To:From; b=KNiRTgxjwD+pQt7l8Ib1SAoHCYW1oj9ycWbo0fTTvlHpmglFnK/U8w38dvzz3ReUU z12V54UHDBLnEULa2Bcr5SlP4GsEOnQlQz/eVjhNAyqfJQavsd0it6nOc1lItM6T6Y 0+AM6O8S+nrQGJtmn81gpcvWKRb6bBHgJ2bt2bGKkqDzpZzHpAlE4VopPcpgAICa/4 9JOcfHubfAJoB4KoCf5A30xLdV5iJGbfNwJ/1IDhO335r4S95qcM13dfm9EReNAycb 8IEbR0dMdpJUVWRbsFjmg6ee8bIsgwo6u5OjohbtG3qDaO1PpFF9AmkMUc7q9Zmds6 fRG9Y0XA8jESQ== Message-ID: <5f234d04-684e-9338-44b6-df23f7166578@collabora.com> Date: Fri, 17 Feb 2023 15:59:25 +0500 MIME-Version: 1.0 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:102.0) Gecko/20100101 Thunderbird/102.7.1 Cc: Muhammad Usama Anjum , Paul Gofman , david@redhat.com, Andrew Morton , kernel@collabora.com, linux-mm@kvack.org, linux-kernel@vger.kernel.org Subject: Re: [PATCH v2 1/2] mm/userfaultfd: Support WP on multiple VMAs Content-Language: en-US To: Peter Xu References: <20230213163124.2850816-1-usama.anjum@collabora.com> <9f0278d7-54f1-960e-ffdf-eeb2572ff6d1@collabora.com> <0549bd0e-85c4-1547-3eaa-16c8a8883837@collabora.com> From: Muhammad Usama Anjum In-Reply-To: Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On 2/16/23 9:41 PM, Peter Xu wrote: > On Thu, Feb 16, 2023 at 11:25:51AM +0500, Muhammad Usama Anjum wrote: >> On 2/16/23 2:45 AM, Peter Xu wrote: >>> On Wed, Feb 15, 2023 at 12:08:11PM +0500, Muhammad Usama Anjum wrote: >>>> Hi Peter, >>>> >>>> Thank you for your review! >>>> >>>> On 2/15/23 2:50 AM, Peter Xu wrote: >>>>> On Tue, Feb 14, 2023 at 01:49:50PM +0500, Muhammad Usama Anjum wrote: >>>>>> On 2/14/23 2:11 AM, Peter Xu wrote: >>>>>>> On Mon, Feb 13, 2023 at 10:50:39PM +0500, Muhammad Usama Anjum wrote: >>>>>>>> On 2/13/23 9:54 PM, Peter Xu wrote: >>>>>>>>> On Mon, Feb 13, 2023 at 09:31:23PM +0500, Muhammad Usama Anjum wrote: >>>>>>>>>> mwriteprotect_range() errors out if [start, end) doesn't fall in one >>>>>>>>>> VMA. We are facing a use case where multiple VMAs are present in one >>>>>>>>>> range of interest. For example, the following pseudocode reproduces the >>>>>>>>>> error which we are trying to fix: >>>>>>>>>> >>>>>>>>>> - Allocate memory of size 16 pages with PROT_NONE with mmap >>>>>>>>>> - Register userfaultfd >>>>>>>>>> - Change protection of the first half (1 to 8 pages) of memory to >>>>>>>>>> PROT_READ | PROT_WRITE. This breaks the memory area in two VMAs. >>>>>>>>>> - Now UFFDIO_WRITEPROTECT_MODE_WP on the whole memory of 16 pages errors >>>>>>>>>> out. >>>>>>>>>> >>>>>>>>>> This is a simple use case where user may or may not know if the memory >>>>>>>>>> area has been divided into multiple VMAs. >>>>>>>>>> >>>>>>>>>> Reported-by: Paul Gofman >>>>>>>>>> Signed-off-by: Muhammad Usama Anjum >>>>>>>>>> --- >>>>>>>>>> Changes since v1: >>>>>>>>>> - Correct the start and ending values passed to uffd_wp_range() >>>>>>>>>> --- >>>>>>>>>> mm/userfaultfd.c | 38 ++++++++++++++++++++++---------------- >>>>>>>>>> 1 file changed, 22 insertions(+), 16 deletions(-) >>>>>>>>>> >>>>>>>>>> diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c >>>>>>>>>> index 65ad172add27..bccea08005a8 100644 >>>>>>>>>> --- a/mm/userfaultfd.c >>>>>>>>>> +++ b/mm/userfaultfd.c >>>>>>>>>> @@ -738,9 +738,12 @@ int mwriteprotect_range(struct mm_struct *dst_mm, unsigned long start, >>>>>>>>>> unsigned long len, bool enable_wp, >>>>>>>>>> atomic_t *mmap_changing) >>>>>>>>>> { >>>>>>>>>> + unsigned long end = start + len; >>>>>>>>>> + unsigned long _start, _end; >>>>>>>>>> struct vm_area_struct *dst_vma; >>>>>>>>>> unsigned long page_mask; >>>>>>>>>> int err; >>>>>>>>> >>>>>>>>> I think this needs to be initialized or it can return anything when range >>>>>>>>> not mapped. >>>>>>>> It is being initialized to -EAGAIN already. It is not visible in this patch. >>>>>>> >>>>>>> I see, though -EAGAIN doesn't look suitable at all. The old retcode for >>>>>>> !vma case is -ENOENT, so I think we'd better keep using it if we want to >>>>>>> have this patch. >>>>>> I'll update in next version. >>>>>> >>>>>>> >>>>>>>> >>>>>>>>> >>>>>>>>>> + VMA_ITERATOR(vmi, dst_mm, start); >>>>>>>>>> >>>>>>>>>> /* >>>>>>>>>> * Sanitize the command parameters: >>>>>>>>>> @@ -762,26 +765,29 @@ int mwriteprotect_range(struct mm_struct *dst_mm, unsigned long start, >>>>>>>>>> if (mmap_changing && atomic_read(mmap_changing)) >>>>>>>>>> goto out_unlock; >>>>>>>>>> >>>>>>>>>> - err = -ENOENT; >>>>>>>>>> - dst_vma = find_dst_vma(dst_mm, start, len); >>>>>>>>>> + for_each_vma_range(vmi, dst_vma, end) { >>>>>>>>>> + err = -ENOENT; >>>>>>>>>> >>>>>>>>>> - if (!dst_vma) >>>>>>>>>> - goto out_unlock; >>>>>>>>>> - if (!userfaultfd_wp(dst_vma)) >>>>>>>>>> - goto out_unlock; >>>>>>>>>> - if (!vma_can_userfault(dst_vma, dst_vma->vm_flags)) >>>>>>>>>> - goto out_unlock; >>>>>>>>>> + if (!dst_vma->vm_userfaultfd_ctx.ctx) >>>>>>>>>> + break; >>>>>>>>>> + if (!userfaultfd_wp(dst_vma)) >>>>>>>>>> + break; >>>>>>>>>> + if (!vma_can_userfault(dst_vma, dst_vma->vm_flags)) >>>>>>>>>> + break; >>>>>>>>>> >>>>>>>>>> - if (is_vm_hugetlb_page(dst_vma)) { >>>>>>>>>> - err = -EINVAL; >>>>>>>>>> - page_mask = vma_kernel_pagesize(dst_vma) - 1; >>>>>>>>>> - if ((start & page_mask) || (len & page_mask)) >>>>>>>>>> - goto out_unlock; >>>>>>>>>> - } >>>>>>>>>> + if (is_vm_hugetlb_page(dst_vma)) { >>>>>>>>>> + err = -EINVAL; >>>>>>>>>> + page_mask = vma_kernel_pagesize(dst_vma) - 1; >>>>>>>>>> + if ((start & page_mask) || (len & page_mask)) >>>>>>>>>> + break; >>>>>>>>>> + } >>>>>>>>>> >>>>>>>>>> - uffd_wp_range(dst_mm, dst_vma, start, len, enable_wp); >>>>>>>>>> + _start = (dst_vma->vm_start > start) ? dst_vma->vm_start : start; >>>>>>>>>> + _end = (dst_vma->vm_end < end) ? dst_vma->vm_end : end; >>>>>>>>>> >>>>>>>>>> - err = 0; >>>>>>>>>> + uffd_wp_range(dst_mm, dst_vma, _start, _end - _start, enable_wp); >>>>>>>>>> + err = 0; >>>>>>>>>> + } >>>>>>>>>> out_unlock: >>>>>>>>>> mmap_read_unlock(dst_mm); >>>>>>>>>> return err; >>>>>>>>> >>>>>>>>> This whole patch also changes the abi, so I'm worried whether there can be >>>>>>>>> app that relies on the existing behavior. >>>>>>>> Even if a app is dependent on it, this change would just don't return error >>>>>>>> if there are multiple VMAs under the hood and handle them correctly. Most >>>>>>>> apps wouldn't care about VMAs anyways. I don't know if there would be any >>>>>>>> drastic behavior change, other than the behavior becoming nicer. >>>>>>> >>>>>>> So this logic existed since the initial version of uffd-wp. It has a good >>>>>>> thing that it strictly checks everything and it makes sense since uffd-wp >>>>>>> is per-vma attribute. In short, the old code fails clearly. >>>>>>> >>>>>>> While the new proposal is not: if -ENOENT we really have no idea what >>>>>>> happened at all; some ranges can be wr-protected but we don't know where >>>>>>> starts to go wrong. >>>>>> The return error codes can be made to return in better way somewhat. The >>>>>> return error codes shouldn't block a correct functionality enhancement patch. >>>>>> >>>>>>> >>>>>>> Now I'm looking at the original problem.. >>>>>>> >>>>>>> - Allocate memory of size 16 pages with PROT_NONE with mmap >>>>>>> - Register userfaultfd >>>>>>> - Change protection of the first half (1 to 8 pages) of memory to >>>>>>> PROT_READ | PROT_WRITE. This breaks the memory area in two VMAs. >>>>>>> - Now UFFDIO_WRITEPROTECT_MODE_WP on the whole memory of 16 pages errors >>>>>>> out. >>>>>>> >>>>>>> Why the user app should wr-protect 16 pages at all? >>>>>> Taking arguments from Paul here. >>>>>> >>>>>> The app is free to insert guard pages inside the range (with PROT_NONE) and >>>>>> change the protection of memory freely. Not sure why it is needed to >>>>>> complicate things by denying any flexibility. We should never restrict what >>>>>> is possible and what not. All of these different access attributes and >>>>>> their any combination of interaction _must_ work without question. The >>>>>> application should be free to change protection on any sub-range and it >>>>>> shouldn't break the PAGE_IS_WRITTEN + UFFD_WRITE_PROTECT promise which >>>>>> PAGEMAP_IOCTL (patches are in progress) and UFFD makes. >>>>> >>>>> Because uffd-wp has a limitation on e.g. it cannot nest so far. I'm fine >>>>> with allowing mprotect() happening, but just to mention so far it cannot do >>>>> "any combinations" yet. >>>>> >>>>>> >>>>>>> >>>>>>> If so, uffd_wp_range() will be ran upon a PROT_NONE range which doesn't >>>>>>> make sense at all, no matter whether the user is aware of vma concept or >>>>>>> not... because it's destined that it's a vain effort. >>>>>> It is not a vain effort. The user want to watch/find the dirty pages of a >>>>>> range while working with it: reserve and watch at once while Write >>>>>> protecting or un-protecting as needed. There may be several different use >>>>>> cases. Inserting guard pages to catch out of range access, map something >>>>>> only when it is needed; unmap or PROT_NONE pages when they are set free in >>>>>> the app etc. >>>>> >>>>> Fair enough. >>>>> >>>>>> >>>>>>> >>>>>>> So IMHO it's the user app needs fixing here, not the interface? I think >>>>>>> it's the matter of whether the monitor is aware of mprotect() being >>>>>>> invoked. >>>>>> No. The common practice is to allocate a big memory chunk at once and have >>>>>> own allocator over it (whether it is some specific allocator in a game or a >>>>>> .net allocator with garbage collector). From the usage point of view it is >>>>>> very limiting to demand constant memory attributes for the whole range. >>>>>> >>>>>> That said, if we do have the way to do exactly what we want with reset >>>>>> through pagemap fd and it is just UFFD ioctl will be working differently, >>>>>> it is not a blocker of course, just weird api design. >>>>> >>>>> Do you mean you'll disable ENGAGE_WP && !GET in your other series? Yes, if >>>>> this will service your goal, it'll be perfect to remove that interface. >>>> No, we cannot remove it. >>> >>> If this patch can land, I assume ioctl(UFFDIO_WP) can start to service the >>> dirty tracking purpose, then why do you still need "ENGAGE_WP && !GET"? >> We don't need it. We need the following operations only: >> 1 GET >> 2 ENGAGE_WP + GET >> When we have these two operations, we had added the following as well: >> 3 ENGAGE_WP + !GET or only ENGAGE_WP >> This (3) can be removed from ioctl(PAGEMAP_IOCTL) if reviewers ask. I can >> remove it in favour of this ioctl(UFFDIO_WP) patch for sure. > > Yes I prefer not having it if this works, because then they'll be > completely duplicated. I'll remove it from next version. > >> >>> >>> Note, I'm not asking to drop ENGAGE_WP entirely, only when !GET. >>> >>>> >>>>> >>>>>> >>>>>>> >>>>>>> In short, I hope we're working on things that helps at least someone, and >>>>>>> we should avoid working on things that does not have clear benefit yet. >>>>>>> With the WP_ENGAGE new interface being proposed, I just didn't see any >>>>>>> benefit of changing the current interface, especially if the change can >>>>>>> bring uncertainties itself (e.g., should we fail upon !uffd-wp vmas, or >>>>>>> should we skip?). >>>>>> We can work on solving uncertainties in case of error conditions. Fail if >>>>>> !uffd-wp vma comes. >>>>> >>>>> Let me try to double check with you here: >>>>> >>>>> I assume you want to skip any vma that is not mapped at all, as the loop >>>>> already does so. So it'll succeed if there're memory holes. >>>>> >>>>> You also want to explicitly fail if some vma is not registered with uffd-wp >>>>> when walking the vma list, am I right? IOW, the tracee _won't_ ever have a >>>>> chance to unregister uffd-wp itself, right? >>>> Yes, fail if any VMA doesn't have uffd-wp. This fail means the >>>> write-protection or un-protection failed on a region of memory with error >>>> -ENOENT. This is already happening in this current patch. The unregister >>>> code would remain same. The register and unregister ioctls are already >>>> going over all the VMAs in a range. I'm not rigid on anything. Let me >>>> define the interface below. >>>> >>>>> >>>>>> >>>>>>> >>>>>>>> >>>>>>>>> >>>>>>>>> Is this for the new pagemap effort? Can this just be done in the new >>>>>>>>> interface rather than changing the old? >>>>>>>> We found this bug while working on pagemap patches. It is already being >>>>>>>> handled in the new interface. We just thought that this use case can happen >>>>>>>> pretty easily and unknowingly. So the support should be added. >>>>>>> >>>>>>> Thanks. My understanding is that it would have been reported if it >>>>>>> affected any existing uffd-wp user. >>>>>> I would consider the UFFD WP a recent functionality and it may not being >>>>>> used in wide range of app scenarios. >>>>> >>>>> Yes I think so. >>>>> >>>>> Existing users should in most cases be applying the ioctl upon valid vmas >>>>> somehow. I think the chance is low that someone relies on the errcode to >>>>> make other decisions, but I just cannot really tell because the user app >>>>> can do many weird things. >>>> Correct. The user can use any combination of operation >>>> (mmap/mprotect/uffd). They must work in harmony. >>> >>> No uffd - that's exactly what I'm saying: mprotect is fine here, but uffd >>> is probably not, not in a nested way. When you try to UFFDIO_REGISTER upon >>> some range that already been registered (by the tracee), it'll fail for you >>> immediately: >>> >>> /* >>> * Check that this vma isn't already owned by a >>> * different userfaultfd. We can't allow more than one >>> * userfaultfd to own a single vma simultaneously or we >>> * wouldn't know which one to deliver the userfaults to. >>> */ >>> ret = -EBUSY; >>> if (cur->vm_userfaultfd_ctx.ctx && >>> cur->vm_userfaultfd_ctx.ctx != ctx) >>> goto out_unlock; >>> >>> So if this won't work for you, then AFAICT uffd-wp won't work for you (just >>> like soft-dirty but in another way, sorry), at least not until someone >>> starts to work on the nested. >> I was referring to a case where user registers the WP on multiple VMAs and >> all the VMAs haven't been registered before. It would work. Right? > > But what if the user wants to do that during when you're tracing it using > userfaultfd? Can it happen? Sorry, I've not tested tracing. > > Thanks, > -- BR, Muhammad Usama Anjum