Received: by 2002:a05:6a10:c604:0:0:0:0 with SMTP id y4csp4588534pxt; Wed, 11 Aug 2021 09:17:51 -0700 (PDT) X-Google-Smtp-Source: ABdhPJyXvsXEMxiPbczH/sxioYr43stbGZG7mYod+xj9/RcMoB05DtaPj+VEgvQPM2TAsqGeg16p X-Received: by 2002:a17:906:d541:: with SMTP id cr1mr4416733ejc.81.1628698670776; Wed, 11 Aug 2021 09:17:50 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1628698670; cv=none; d=google.com; s=arc-20160816; b=OXcBTYAQS0aIMdwc+8sKlOWvoS3ttgVvzuwL0JT7bhFbcOmZKgosuGKzyW744IzhRt +wJ/GV4vmbpr4Ydmu6XJMKYyKr68jdD5ccjFX90dYfBVkFzgEj1vaZnC03NWYYWrsKPE hBr/qEYtRa2r/p4k1OwQiywu6HHeOVKekTxKyz+9NgLDcKuDcW0lDBG25mWjIlD+AgiV bOnfsb3uM8NjdGcBXRDCNtENPZaug8GdynRvZPYOIuZry0Ut7cIJhNnOxUWQeQCFTSSz s6IseIEcynj+YrlO6QvxLKX0/mbNrX9IAGXF5wSIixF8Wys9RiVF72rRnvtJyLbfcDYA Xbcg== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:content-language :in-reply-to:mime-version:user-agent:date:message-id:subject :organization:from:references:cc:to:dkim-signature; bh=31qE1bVII2HIf93HYG2MbHWEfiiik6GenrXLfQ/a8MA=; b=he73gYfClqpvRW7bafz6vVMrZ1OPehXQn3h19S8WWFsLiqNFQf2o5hIeeMQxUJvpA/ LTX/Vc9snpBksPZuvC+UGwDQ/sk5qi5WbblHCM4kRAgzYyvzbLE3Y9yD3+ej3A7OCpT0 PPSRkf54QzBF/peowWh6Y3WL1mfjKmVoGBij0a0Q/y2b6ao0Koypz1xdpejdysil59z+ 8dsKK1XNGs6jSzTrvUqmer/MVmwCk9ZQsvBjDrXThhFOXB+W5zCDfdFNg6/jEGTmgSo1 JUiM/jz9R9788cDOZF5AbQFl+kb+MM0TRzieyOJ9Y51BMxhE6bk3InSaQ/tVOJ6eSeCL 1t+A== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@redhat.com header.s=mimecast20190719 header.b=GRW9TSeh; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=redhat.com Return-Path: Received: from vger.kernel.org (vger.kernel.org. [23.128.96.18]) by mx.google.com with ESMTP id d12si4466836ejj.278.2021.08.11.09.17.25; Wed, 11 Aug 2021 09:17:50 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) client-ip=23.128.96.18; Authentication-Results: mx.google.com; dkim=pass header.i=@redhat.com header.s=mimecast20190719 header.b=GRW9TSeh; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=redhat.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S229697AbhHKQQH (ORCPT + 99 others); Wed, 11 Aug 2021 12:16:07 -0400 Received: from us-smtp-delivery-124.mimecast.com ([170.10.133.124]:44389 "EHLO us-smtp-delivery-124.mimecast.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S229508AbhHKQQG (ORCPT ); Wed, 11 Aug 2021 12:16:06 -0400 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1628698542; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=31qE1bVII2HIf93HYG2MbHWEfiiik6GenrXLfQ/a8MA=; b=GRW9TSehoBoJvDRrXN6rprznS+uivAJuEQjvTRV7DQNwCKlLmiTJqXGwsBe2eDI/uSeos/ nCLyWmGfLpVGjivIiUs2GR++eJbzrWGKSfCcbGfaYTKbx8RKrBX0WqJ2ckYpW1PMkFpITj 85oEKcc+yfWdgx3vuuqzjRGd7YgR8+Q= Received: from mail-wm1-f71.google.com (mail-wm1-f71.google.com [209.85.128.71]) (Using TLS) by relay.mimecast.com with ESMTP id us-mta-500-sfg7NxnLOYi5LBPXf_QuzQ-1; Wed, 11 Aug 2021 12:15:41 -0400 X-MC-Unique: sfg7NxnLOYi5LBPXf_QuzQ-1 Received: by mail-wm1-f71.google.com with SMTP id z15-20020a7bc7cf0000b02902e6a68ffd3cso618567wmk.2 for ; Wed, 11 Aug 2021 09:15:40 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:to:cc:references:from:organization:subject :message-id:date:user-agent:mime-version:in-reply-to :content-language:content-transfer-encoding; bh=31qE1bVII2HIf93HYG2MbHWEfiiik6GenrXLfQ/a8MA=; b=MyXWLyIOimXyD16POTMdzjAl8XGVfYjNSTz82KMcrsBbvepwStSFMVEK6cmluVF1vF G2F4Td4gAJlzpKxMssptlsR/g7QrGzfjc4HcC5BqX7Uh6iX6ZtBdYh37vAAJNeQTGOBt lOEjG9/yE7A2iwKhbmVoH5OgLdae9SkNA1TZZpVIr5VScN1u7DHMzp1EQGZrbFCcSgTZ kGyjQsjpNVwGwALprbI2te+oNr+bVfUDSgj5LoUssD7ImLpbR3GYV12YaWKeYjKUHhaG YYc05H3sDBCsusJO6VtYs92ISj3L1uBfSfsg58FUim9JsFtcVwf5Dp7IV4SgNbwefe9A xIwA== X-Gm-Message-State: AOAM53208KoYWe1uK3W4WwZ0zDngCcW+byqLI8cARhPNrpAMR92ecwxN fcSNpD8es++LwCBufeScqXGHTercUtrPa/Yj+u4lq3U8yxkKCrxT8oA1jZGcGMJTmMmk/wwlx// IrffYKRxyYow2WU3csVZ4vTWv X-Received: by 2002:a5d:42c9:: with SMTP id t9mr3932941wrr.356.1628698539896; Wed, 11 Aug 2021 09:15:39 -0700 (PDT) X-Received: by 2002:a5d:42c9:: with SMTP id t9mr3932915wrr.356.1628698539636; Wed, 11 Aug 2021 09:15:39 -0700 (PDT) Received: from [192.168.3.132] (p5b0c64a0.dip0.t-ipconnect.de. [91.12.100.160]) by smtp.gmail.com with ESMTPSA id n8sm26832112wrx.46.2021.08.11.09.15.38 (version=TLS1_3 cipher=TLS_AES_128_GCM_SHA256 bits=128/128); Wed, 11 Aug 2021 09:15:39 -0700 (PDT) To: Peter Xu Cc: Tiberiu A Georgescu , akpm@linux-foundation.org, viro@zeniv.linux.org.uk, christian.brauner@ubuntu.com, ebiederm@xmission.com, adobriyan@gmail.com, songmuchun@bytedance.com, axboe@kernel.dk, vincenzo.frascino@arm.com, catalin.marinas@arm.com, peterz@infradead.org, chinwen.chang@mediatek.com, linmiaohe@huawei.com, jannh@google.com, apopple@nvidia.com, linux-kernel@vger.kernel.org, linux-fsdevel@vger.kernel.org, linux-mm@kvack.org, ivan.teterevkov@nutanix.com, florian.schmidt@nutanix.com, carl.waldspurger@nutanix.com, jonathan.davies@nutanix.com References: <20210730160826.63785-1-tiberiu.georgescu@nutanix.com> <839e82f7-2c54-d1ef-8371-0a332a4cb447@redhat.com> From: David Hildenbrand Organization: Red Hat Subject: Re: [PATCH 0/1] pagemap: swap location for shared pages Message-ID: <0beb1386-d670-aab1-6291-5c3cb0d661e0@redhat.com> Date: Wed, 11 Aug 2021 18:15:37 +0200 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:78.0) Gecko/20100101 Thunderbird/78.11.0 MIME-Version: 1.0 In-Reply-To: Content-Type: text/plain; charset=utf-8; format=flowed Content-Language: en-US Content-Transfer-Encoding: 8bit Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On 04.08.21 21:17, Peter Xu wrote: > On Wed, Aug 04, 2021 at 08:49:14PM +0200, David Hildenbrand wrote: >> TBH, I tend to really dislike the PTE marker idea. IMHO, we shouldn't store >> any state information regarding shared memory in per-process page tables: it >> just doesn't make too much sense. >> >> And this is similar to SOFTDIRTY or UFFD_WP bits: this information actually >> belongs to the shared file ("did *someone* write to this page", "is >> *someone* interested into changes to that page", "is there something"). I >> know, that screams for a completely different design in respect to these >> features. >> >> I guess we start learning the hard way that shared memory is just different >> and requires different interfaces than per-process page table interfaces we >> have (pagemap, userfaultfd). >> >> I didn't have time to explore any alternatives yet, but I wonder if tracking >> such stuff per an actual fd/memfd and not via process page tables is >> actually the right and clean approach. There are certainly many issues to >> solve, but conceptually to me it feels more natural to have these shared >> memory features not mangled into process page tables. > > Yes, we can explore all the possibilities, I'm totally fine with it. > > I just want to say I still don't think when there's page cache then we must put > all the page-relevant things into the page cache. [sorry for the late reply] Right, but for the case of shared, swapped out pages, the information is already there, in the page cache :) > > They're shared by processes, but process can still have its own way to describe > the relationship to that page in the cache, to me it's as simple as "we allow > process A to write to page cache P", while "we don't allow process B to write > to the same page" like the write bit. The issue I'm having uffd-wp as it was proposed for shared memory is that there is hardly a sane use case where we would *want* it to work that way. A UFFD-WP flag in a page table for shared memory means "please notify once this process modifies the shared memory (via page tables, not via any other fd modification)". Do we have an example application where these semantics makes sense and don't over-complicate the whole approach? I don't know any, thus I'm asking dumb questions :) For background snapshots in QEMU the flow would currently be like this, assuming all processes have the shared guest memory mapped. 1. Background snapshot preparation: QEMU requests all processes to uffd-wp the range a) All processes register a uffd handler on guest RAM b) All processes fault in all guest memory (essentially populating all memory): with a uffd-WP extensions we might be able to get rid of that, I remember you were working on that. c) All processes uffd-WP the range to set the bit in their page table 2. Background snapshot runs: a) A process either receives a UFFD-WP event and forwards it to QEMU or QEMU polls all other processes for UFFD events. b) QEMU writes the to-be-changed page to the migration stream. c) QEMU triggers all processes to un-protect the page and wake up any waiters. All processes clear the uffd-WP bit in their page tables. 3. Background snapshot completes: a) All processes unregister the uffd handler Now imagine something like this: 1. Background snapshot preparation: a) QEMU registers a UFFD-WP handler on a *memfd file* that corresponds to guest memory. b) QEMU uffd-wp's the whole file 2. Background snapshot runs: a) QEMU receives a UFFD-WP event. b) QEMU writes the to-be-changed page to the migration stream. c) QEMU un-protect the page and wake up any waiters. 3. Background snapshot completes: a) QEMU unregister the uffd handler Wouldn't that be much nicer and much easier to handle? Yes, it is much harder to implement because such an infrastructure does not exist yet, and it most probably wouldn't be called uffd anymore, because we are dealing with file access. But this way, it would actually be super easy to use the feature across multiple processes and eventually to even catch other file modifications. Again, I am not sure if uffd-wp or softdirty make too much sense in general when applied to shmem. But I'm happy to learn more. -- Thanks, David / dhildenb