Received: by 2002:a05:6359:322:b0:b3:69d0:12d8 with SMTP id ef34csp481928rwb; Wed, 10 Aug 2022 11:23:26 -0700 (PDT) X-Google-Smtp-Source: AA6agR7ccQOAkVVQp6cOnCb2aG/rA5jl3Gt8g7IEuYnhhMBQLB98NQNg+97bYo65M9vlBb6vYd3w X-Received: by 2002:a63:5120:0:b0:41d:d4e9:9a6a with SMTP id f32-20020a635120000000b0041dd4e99a6amr5720102pgb.402.1660155805992; Wed, 10 Aug 2022 11:23:25 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1660155805; cv=none; d=google.com; s=arc-20160816; b=PHxmkG0oxIMzmP5gTaAwpwHxMeSiHp8OG3afzqx7rg/xoA0RWkf9il8kNVdluMQIkx WM3BcFtRHvVuEkede/qHdSpV8O43wcV0aiYW4IMfNFyN+lAZrsDHXS1maa22wC+QEw9r VMOhPuqEhGLmDRR6ah/AMv8/rmtmsHJS1au7HW2ScmLsBRBEsjvp9snq2ZjtCDJycvxp IGYKw7wUignM90KwpuffnaBl6aGTZmURmbqjww8FtNgbduxxZ8Xry51jLNjhiBDmaohE QHovAFw2wH72G4sKevYBVLU/IEzi8zTU2mWKMH5gM9kDcbLmrQBlzmK8bNkUDzIM6tv8 CpJw== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:mime-version:user-agent:message-id:in-reply-to :date:references:subject:cc:to:from:dkim-signature; bh=5W+T+uSBZCVsE07w+hJv4mFV2j6lc0+qkAHaXqQaWD8=; b=X7cSTYI8anVKzcg8pFl1oHG37QY57468N4MzMOWW9mRGdBKLxzzQHq9XAtonNqdOfG cHXskKpzs4rrdG0tPH7c7yAuBZy83z5w8aaCGvc4xYY9iC6hsJPgyrgdB7oU+vZFi+hX N64Dc7GZ0lj1Z87wkuBkJlyRL5zn/X/R4ch32l/fej7ll2cqI1jKob+/823LVT9WO68j slRzSGaedb+aKB8RcT1wZeaOSuFzeD76eZPiIHdJkYvlWU6mnZKtmAR4b7Pss8EgmIhO xNzkvlL99YIaOw9Yee9SL1+RhKh/F2+gh7fw3A2BCtK0RYMhNTJZV6H3VcKmkjdxXJ2S oNaQ== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@collabora.com header.s=mail header.b="h3/E9F9x"; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=collabora.com Return-Path: Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20]) by mx.google.com with ESMTP id 5-20020a630c45000000b0041dff983839si1279267pgm.453.2022.08.10.11.23.11; Wed, 10 Aug 2022 11:23:25 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Authentication-Results: mx.google.com; dkim=pass header.i=@collabora.com header.s=mail header.b="h3/E9F9x"; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=collabora.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S233172AbiHJRFZ (ORCPT + 99 others); Wed, 10 Aug 2022 13:05:25 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:38736 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S229522AbiHJRFW (ORCPT ); Wed, 10 Aug 2022 13:05:22 -0400 Received: from madras.collabora.co.uk (madras.collabora.co.uk [46.235.227.172]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 4B46E20F5A; Wed, 10 Aug 2022 10:05:21 -0700 (PDT) Received: from localhost (modemcable141.102-20-96.mc.videotron.ca [96.20.102.141]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (4096 bits) server-digest SHA256) (No client certificate requested) (Authenticated sender: krisman) by madras.collabora.co.uk (Postfix) with ESMTPSA id 227CD66019C1; Wed, 10 Aug 2022 18:05:19 +0100 (BST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=collabora.com; s=mail; t=1660151120; bh=4xdK5anslBOIhnkqGW81pncrZ4JdPy+zW5SIE/JLnRY=; h=From:To:Cc:Subject:References:Date:In-Reply-To:From; b=h3/E9F9xuu0PQBdbIIIowMiztInzpGA0Sr0/+ERtIQ9Vaj8xol3qxBdNobdQ5OFOQ lAbYHUygoM5LQQ5pucfxCdpxQTPn06tCWpoSfjKHlvwjjlSxcMj8Ew2NicpOm1AvJ3 roEcd73rt3wd9yGwbKMRSMKfTRLj6KFgYIfZ2F4vNor+cpy8eB6hytQGEFPfzNo5Md msk+2xH6Fdmd57XrQPJRtVBSL67QF0eySoRR2reLPDjb1MJQU/w+xrfqcTVAbkf3vT XqEyOLcJ5t7UXVh9WHIGLYw/NAFIeHfZKTVCtl5iayQ/V09KExOQ4SdIqXZDjGHMtJ +LprGiO7NYsdA== From: Gabriel Krisman Bertazi To: David Hildenbrand Cc: Muhammad Usama Anjum , Jonathan Corbet , Andy Lutomirski , Thomas Gleixner , Ingo Molnar , Borislav Petkov , Dave Hansen , "maintainer:X86 ARCHITECTURE (32-BIT AND 64-BIT)" , "H. Peter Anvin" , Arnd Bergmann , Andrew Morton , Peter Zijlstra , Arnaldo Carvalho de Melo , Mark Rutland , Alexander Shishkin , Jiri Olsa , Namhyung Kim , Shuah Khan , "open list:DOCUMENTATION" , open list , "open list:PROC FILESYSTEM" , "open list:ABI/API" , "open list:GENERIC INCLUDE/ASM HEADER FILES" , "open list:MEMORY MANAGEMENT" , "open list:PERFORMANCE EVENTS SUBSYSTEM" , "open list:KERNEL SELFTEST FRAMEWORK" , kernel@collabora.com Subject: Re: [PATCH 0/5] Add process_memwatch syscall References: <20220726161854.276359-1-usama.anjum@collabora.com> <95ed1a81-ff8e-2c48-8838-4b3995af51b7@redhat.com> Date: Wed, 10 Aug 2022 13:05:13 -0400 In-Reply-To: <95ed1a81-ff8e-2c48-8838-4b3995af51b7@redhat.com> (David Hildenbrand's message of "Wed, 10 Aug 2022 11:03:11 +0200") Message-ID: <87pmh8ghbq.fsf@collabora.com> User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/28.1 (gnu/linux) MIME-Version: 1.0 Content-Type: text/plain X-Spam-Status: No, score=-2.1 required=5.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,SPF_HELO_NONE,SPF_PASS, T_SCC_BODY_TEXT_LINE autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org David Hildenbrand writes: > On 26.07.22 18:18, Muhammad Usama Anjum wrote: >> Hello, > > Hi, > >> >> This patch series implements a new syscall, process_memwatch. Currently, >> only the support to watch soft-dirty PTE bit is added. This syscall is >> generic to watch the memory of the process. There is enough room to add >> more operations like this to watch memory in the future. >> >> Soft-dirty PTE bit of the memory pages can be viewed by using pagemap >> procfs file. The soft-dirty PTE bit for the memory in a process can be >> cleared by writing to the clear_refs file. This series adds features that >> weren't possible through the Proc FS interface. >> - There is no atomic get soft-dirty PTE bit status and clear operation >> possible. > > Such an interface might be easy to add, no? > >> - The soft-dirty PTE bit of only a part of memory cannot be cleared. > > Same. > > So I'm curious why we need a new syscall for that. Hi David, Yes, sure. Though it has to be through an ioctl since we need both input and output semantics at the same call to keep the atomic semantics. I answered Peter Enderborg about our concerns when turning this into an ioctl. But they are possible to overcome. >> project. The Proc FS interface is enough for that as I think the process >> is frozen. We have the use case where we need to track the soft-dirty >> PTE bit for running processes. We need this tracking and clear mechanism >> of a region of memory while the process is running to emulate the >> getWriteWatch() syscall of Windows. This syscall is used by games to keep >> track of dirty pages and keep processing only the dirty pages. This >> syscall can be used by the CRIU project and other applications which >> require soft-dirty PTE bit information. >> >> As in the current kernel there is no way to clear a part of memory (instead >> of clearing the Soft-Dirty bits for the entire processi) and get+clear >> operation cannot be performed atomically, there are other methods to mimic >> this information entirely in userspace with poor performance: >> - The mprotect syscall and SIGSEGV handler for bookkeeping >> - The userfaultfd syscall with the handler for bookkeeping > > You write "poor performance". Did you actually implement a prototype > using userfaultfd-wp? Can you share numbers for comparison? Yes, we did. I think Usama can share some numbers. The problem with userfaultfd, as far as I understand, is that it will require a second userspace process to be called in order to handle the annotation that a page was touched, before remapping the page to make it accessible to the originating process, every time a page is touched. This context switch is prohibitively expensive to our use case, where Windows applications might invoke it quite often. Soft-dirty bit instead, allows the page tracking to be done entirely in kernelspace. If I understand correctly, userfaultfd is usefull for VM/container migration, where the cost of the context switch is not a real concern, since there are much bigger costs from the migration itself. Maybe we're missing some feature about userfaultfd that would allow us to avoid the cost, but from our observations we didn't find a way to overcome it. >> long process_memwatch(int pidfd, unsigned long start, int len, >> unsigned int flags, void *vec, int vec_len); >> >> This syscall can be used by the CRIU project and other applications which >> require soft-dirty PTE bit information. The following operations are >> supported in this syscall: >> - Get the pages that are soft-dirty. >> - Clear the pages which are soft-dirty. >> - The optional flag to ignore the VM_SOFTDIRTY and only track per page >> soft-dirty PTE bit > > Huh, why? VM_SOFTDIRTY is an internal implementation detail and should > remain such. > VM_SOFTDIRTY translates to "all pages in this VMA are soft-dirty". That is something very specific about our use case, and we should explain it a bit better. The problem is that VM_SOFTDIRTY modifications introduce the overhead of the mm write lock acquisition, which is very visible in our benchmarks of Windows games running over Wine. Since the main reason for VM_SOFTDIRTY to exist, as far as we understand it, is to track vma remapping, and this is a use case we don't need to worry about when implementing windows semantics, we'd like to be able to avoid this extra overhead, optionally, iff userspace knows it can be done safely. VM_SOFTDIRTY is indeed an internal interface. Which is why we are proposing to expose the feature in terms of tracking VMA reuse. Thanks, -- Gabriel Krisman Bertazi