Received: by 2002:a05:6358:d09b:b0:dc:cd0c:909e with SMTP id jc27csp332113rwb; Wed, 9 Nov 2022 03:10:47 -0800 (PST) X-Google-Smtp-Source: AMsMyM5J6RPjMZJPzVl6zlGSzQmnqfkFoSQVdxJUct23bV1rDwZhCXAFKtCvaCtmEH7uS7kXBxXZ X-Received: by 2002:a05:6a00:80d:b0:56d:93d8:f42f with SMTP id m13-20020a056a00080d00b0056d93d8f42fmr48284427pfk.14.1667992247206; Wed, 09 Nov 2022 03:10:47 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1667992247; cv=none; d=google.com; s=arc-20160816; b=IWOX1UiF/3QZX3UU6dQ3pcwHcl6EYVYhLL87CryLnSmaVc0jXG35ez5JDbp/xzKyXs IJdacO1IRYHUe0ME3cVJWZPmCfwJz5D/0Ha8gz7O25sAv8iHveXLT77DkTgc9wUvAHln ikPUO233RrdmEbOlqiH2NT+6Bzv4hOM1nnw/j6x0y8ZatDOLI7JLSjY96UrEf8Ej3RiJ CSgo3V7DBbjtJwy6c6T0PYE/6Rvh2dWD2RMfAcKOQem5eAFzymIphVhNN3lBQHnPUitP JRzRU83vGlYBmjde0PEVxUpCME6gJNpqvUCABMojGZ3NdZ6Ux3B/M2/m2zYDThFtO3ur l/xQ== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:in-reply-to:subject :organization:from:references:to:content-language:user-agent :mime-version:date:message-id:dkim-signature; bh=3J441/GDgv6cN9YzwYIw/Qlxf96gs8QTcgGQhe6EZYE=; b=oVCUaBrkZwzRUYP5JuuEuGcJ+4QEHF2/5RCgrjhzkFy/7Ga6gIItvN1nbUaNhN6WDl ZLMEJeYvnXLu3S3o3VmmgS9k2IHkSNxIYSPaUo9UrTT1jbiZ7XQisvu+98IC3NRsCy+z GRDYAQBzrWfm41htnrMoML20UwgPamUKGpA+qkb1WQ4dyCz449AmwpFwUI2tM5C82Kry Zm2zkaY6N363WOAaSaf04Q1Ehuw4D7jAADQNu5hesezerEZHQhkZc/RWGVVHXhWuY3yN 0Em6vwa+88gH5y0D+UsyUOOaMtHI+MwjgNfxjZO/xIaKcZM0b143i4e12qgdcZLE6smb u8fQ== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@redhat.com header.s=mimecast20190719 header.b=SLHXDNu6; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=redhat.com Return-Path: Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20]) by mx.google.com with ESMTP id n2-20020a170902d2c200b0017a0e8713cesi20078792plc.452.2022.11.09.03.10.35; Wed, 09 Nov 2022 03:10:47 -0800 (PST) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Authentication-Results: mx.google.com; dkim=pass header.i=@redhat.com header.s=mimecast20190719 header.b=SLHXDNu6; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=redhat.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S229891AbiKIKfs (ORCPT + 93 others); Wed, 9 Nov 2022 05:35:48 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:52274 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S229593AbiKIKfp (ORCPT ); Wed, 9 Nov 2022 05:35:45 -0500 Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.129.124]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id B352F193FE for ; Wed, 9 Nov 2022 02:34:49 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1667990088; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=3J441/GDgv6cN9YzwYIw/Qlxf96gs8QTcgGQhe6EZYE=; b=SLHXDNu6OP/hwicW+P/ENpnWFs+Mz+dcKNKk/uRn4/tzJyoK1XvhbYks2GephIYj8hNXq6 +LHEhuT+cBeAPGOeVMjvzy/x+nPxb1IBM7YtKTdn180hhZc/4MdYc7rL21uLmuoSX6v9HE BsxFlN9aSuaVnnyONcmPpu3Qk+MGYz4= Received: from mail-wm1-f69.google.com (mail-wm1-f69.google.com [209.85.128.69]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.3, cipher=TLS_AES_128_GCM_SHA256) id us-mta-483-z3L8k4LoMuCvb5CYWw9tCw-1; Wed, 09 Nov 2022 05:34:47 -0500 X-MC-Unique: z3L8k4LoMuCvb5CYWw9tCw-1 Received: by mail-wm1-f69.google.com with SMTP id z15-20020a1c4c0f000000b003cf6f80007cso419344wmf.3 for ; Wed, 09 Nov 2022 02:34:47 -0800 (PST) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=content-transfer-encoding:in-reply-to:subject:organization:from :references:to:content-language:user-agent:mime-version:date :message-id:x-gm-message-state:from:to:cc:subject:date:message-id :reply-to; bh=3J441/GDgv6cN9YzwYIw/Qlxf96gs8QTcgGQhe6EZYE=; b=X2FJkQMOMK3xhuhujGdaVlzqud4nVHs8Z2nG7nrW7PwAuUSfx+e2P2BPIie4CphSlE AVZnHcBuMtQqDmlenOFMPgW0xEk4BUgrMde4ukstwC9l4Ml90b7kSPS5X4o7MDQsAqbf OC2bKjtkb5sh6uKkKOQswrItOf3aoPrHUzzFaCUn4CGQhR+jhNtJLpi8WxCdSlpK3lNj 1BBXW8O0VOfh+YLkjtaqUS+pwr8+vh4MdRuY/H1VT7QqfCk39WKvkHF5tfk6/P/8D8/k W+HyPsJQH2s2BsKEKHAqV3EsIBhsMSaspbeqvsGzV6Xq3Hkn+6C8WxBnnjFr+pqvh/Q5 p+Eg== X-Gm-Message-State: ACrzQf2qLHhXAgMcrY7uznwcN5jQ5hzzloEeMFzPtPTGeecVXjyy5iVh 7Ar4Pt+0KfHymy1R3yfOZXI3qY1j+l5vqg8fK4WNEOO22qv8ukDpf7/StnFvUlQng3UkhU6sg6x ffQMPvdWXStVp7LHIh+VLLRsU X-Received: by 2002:a1c:f214:0:b0:3be:4e7c:1717 with SMTP id s20-20020a1cf214000000b003be4e7c1717mr40707871wmc.171.1667990086099; Wed, 09 Nov 2022 02:34:46 -0800 (PST) X-Received: by 2002:a1c:f214:0:b0:3be:4e7c:1717 with SMTP id s20-20020a1cf214000000b003be4e7c1717mr40707846wmc.171.1667990085742; Wed, 09 Nov 2022 02:34:45 -0800 (PST) Received: from ?IPV6:2003:cb:c704:b000:3b0e:74a3:bc8:9937? (p200300cbc704b0003b0e74a30bc89937.dip0.t-ipconnect.de. [2003:cb:c704:b000:3b0e:74a3:bc8:9937]) by smtp.gmail.com with ESMTPSA id h4-20020a05600c350400b003c6f426467fsm1194121wmq.40.2022.11.09.02.34.44 (version=TLS1_3 cipher=TLS_AES_128_GCM_SHA256 bits=128/128); Wed, 09 Nov 2022 02:34:45 -0800 (PST) Message-ID: <9c167d01-ef09-ec4e-b4a1-2fff62bf01fe@redhat.com> Date: Wed, 9 Nov 2022 11:34:43 +0100 MIME-Version: 1.0 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:102.0) Gecko/20100101 Thunderbird/102.4.0 Content-Language: en-US To: Muhammad Usama Anjum , =?UTF-8?B?TWljaGHFgiBNaXJvc8WCYXc=?= , Andrei Vagin , Danylo Mocherniuk , Alexander Viro , Andrew Morton , Suren Baghdasaryan , Greg KH , Christian Brauner , Peter Xu , Yang Shi , Vlastimil Babka , Zach O'Keefe , "Matthew Wilcox (Oracle)" , "Gustavo A. R. Silva" , Dan Williams , kernel@collabora.com, Gabriel Krisman Bertazi , Peter Enderborg , "open list : KERNEL SELFTEST FRAMEWORK" , Shuah Khan , open list , "open list : PROC FILESYSTEM" , "open list : MEMORY MANAGEMENT" , Paul Gofman References: <20221109102303.851281-1-usama.anjum@collabora.com> From: David Hildenbrand Organization: Red Hat Subject: Re: [PATCH v6 0/3] Implement IOCTL to get and/or the clear info about PTEs In-Reply-To: <20221109102303.851281-1-usama.anjum@collabora.com> Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 7bit X-Spam-Status: No, score=-2.1 required=5.0 tests=BAYES_00,DKIMWL_WL_HIGH, DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,NICE_REPLY_A, RCVD_IN_DNSWL_NONE,RCVD_IN_MSPIKE_H2,SPF_HELO_NONE,SPF_NONE autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On 09.11.22 11:23, Muhammad Usama Anjum wrote: > Changes in v6: > - Updated the interface and made cosmetic changes > > Original Cover Letter in v5: > Hello, > > This patch series implements IOCTL on the pagemap procfs file to get the > information about the page table entries (PTEs). The following operations > are supported in this ioctl: > - Get the information if the pages are soft-dirty, file mapped, present > or swapped. > - Clear the soft-dirty PTE bit of the pages. > - Get and clear the soft-dirty PTE bit of the pages atomically. > > Soft-dirty PTE bit of the memory pages can be read by using the pagemap > procfs file. The soft-dirty PTE bit for the whole memory range of the > process can be cleared by writing to the clear_refs file. There are other > methods to mimic this information entirely in userspace with poor > performance: > - The mprotect syscall and SIGSEGV handler for bookkeeping > - The userfaultfd syscall with the handler for bookkeeping > Some benchmarks can be seen here[1]. This series adds features that weren't > present earlier: > - There is no atomic get soft-dirty PTE bit status and clear operation > possible. > - The soft-dirty PTE bit of only a part of memory cannot be cleared. > > Historically, soft-dirty PTE bit tracking has been used in the CRIU > project. The procfs interface is enough for finding the soft-dirty bit > status and clearing the soft-dirty bit of all the pages of a process. > We have the use case where we need to track the soft-dirty PTE bit for > only specific pages on demand. We need this tracking and clear mechanism > of a region of memory while the process is running to emulate the > getWriteWatch() syscall of Windows. This syscall is used by games to > keep track of dirty pages to process only the dirty pages. > > The information related to pages if the page is file mapped, present and > swapped is required for the CRIU project[2][3]. The addition of the > required mask, any mask, excluded mask and return masks are also required > for the CRIU project[2]. > > The IOCTL returns the addresses of the pages which match the specific masks. > The page addresses are returned in struct page_region in a compact form. > The max_pages is needed to support a use case where user only wants to get > a specific number of pages. So there is no need to find all the pages of > interest in the range when max_pages is specified. The IOCTL returns when > the maximum number of the pages are found. The max_pages is optional. If > max_pages is specified, it must be equal or greater than the vec_size. > This restriction is needed to handle worse case when one page_region only > contains info of one page and it cannot be compacted. This is needed to > emulate the Windows getWriteWatch() syscall. > > Some non-dirty pages get marked as dirty because of the kernel's > internal activity (such as VMA merging as soft-dirty bit difference isn't > considered while deciding to merge VMAs). The dirty bit of the pages is > stored in the VMA flags and in the per page flags. If any of these two bits > are set, the page is considered to be soft dirty. Suppose you have cleared > the soft dirty bit of half of VMA which will be done by splitting the VMA > and clearing soft dirty bit flag in the half VMA and the pages in it. Now > kernel may decide to merge the VMAs again. So the half VMA becomes dirty > again. This splitting/merging costs performance. The application receives > a lot of pages which aren't dirty in reality but marked as dirty. > Performance is lost again here. Also sometimes user doesn't want the newly > allocated memory to be marked as dirty. PAGEMAP_NO_REUSED_REGIONS flag > solves both the problems. It is used to not depend on the soft dirty flag > in the VMA flags. So VMA splitting and merging doesn't happen. It only > depends on the soft dirty bit of the individual pages. Thus by using this > flag, there may be a scenerio such that the new memory regions which are > just created, doesn't look dirty when seen with the IOCTL, but look dirty > when seen from procfs. This seems okay as the user of this flag know the > implication of using it. Please separate that part out from the other changes; I am still not convinced that we want this and what the semantical implications are. Let's take a look at an example: can_change_pte_writable() /* Do we need write faults for softdirty tracking? */ if (vma_soft_dirty_enabled(vma) && !pte_soft_dirty(pte)) return false; We care about PTE softdirty tracking, if it is enabled for the VMA. Tracking is enabled if: vma_soft_dirty_enabled() /* * Soft-dirty is kind of special: its tracking is enabled when * the vma flags not set. */ return !(vma->vm_flags & VM_SOFTDIRTY); Consequently, if VM_SOFTDIRTY is set, we are not considering the soft_dirty PTE bits accordingly. I'd suggest moving forward without this controversial PAGEMAP_NO_REUSED_REGIONS functionality for now, and preparing it as a clear add-on we can discuss separately. -- Thanks, David / dhildenb