Received: by 2002:a05:6a10:2785:0:0:0:0 with SMTP id ia5csp3243393pxb; Tue, 12 Jan 2021 09:41:35 -0800 (PST) X-Google-Smtp-Source: ABdhPJwo4Ervpe4oR8izPMTp1oVvrnIlXA1JjCg6bmcqjopDcpKRfS1WA/JZNJTFQjcPcyLfalkA X-Received: by 2002:a05:6402:1c90:: with SMTP id cy16mr207097edb.73.1610473294937; Tue, 12 Jan 2021 09:41:34 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1610473294; cv=none; d=google.com; s=arc-20160816; b=WLjRdTUxkVnNGyn07Xk2ro8RfnEXb+RmXHbH04Hkx9z4sSUqn5evmJimuOBkuzjDDr rwdfHlBS5kw63wFmqbV/s8WCd334rYAcCmn++oL04kun9JTt+T71/oliDvRl1QFvVYYs N4JZS4LGVlApbxw57ZbEYOpPcA9I8CdBXdQC4SXfs7YHcx+2I39oopgumyU7VCvYjuc5 NxfaXYj2k/3wkvywlhKmYdQ08e3mZfRpQutvspond89OKFtUv5a9k18d1mLVoT+yiKmA JiSKUuclQfLBAIAS/+k4p31hB/2owxWRV3wtqG9/twE9813Y+SVbcGeugg62rMopHGUk XQKQ== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:cc:to:subject:message-id:date:from:in-reply-to :references:mime-version:dkim-signature; bh=NAxlQKvrdS5yrFwFBQveOuSAj/TgVa80OtygrRtytek=; b=e0hPIK5Pjp3q1Vl2S/ArMPqBDCkyKZdp4D+p9n2CTVy06mSMJsTtzYUV0iOkZ9xUxA ltoxr4yIh4UNzLhRW0g8WApxdSzCxWH3TrRICN60Jq27yKYCnvmTiszFKNQyaokBm6P0 YxMNl0FHifR4rWVL/+4L91PHERqp7p6WB3KfWyClrt0LxOgba+5UssiuoLwTwuZetVW7 PdqRneyAqHVSnWc4NIS0BtkbYE28RgElv6rTcbO4DXSw1xGKBWcL5VvWCErWumxzahNb hSEDvI2je96GHMDnakb/r0pJAOQPIl7JoI+A58BlNT0rIrqw5ZHiVGmlr8vvEt4sBZMv hIDw== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@google.com header.s=20161025 header.b=psue1+5e; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Return-Path: Received: from vger.kernel.org (vger.kernel.org. [23.128.96.18]) by mx.google.com with ESMTP id c4si801247edr.176.2021.01.12.09.41.10; Tue, 12 Jan 2021 09:41:34 -0800 (PST) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) client-ip=23.128.96.18; Authentication-Results: mx.google.com; dkim=pass header.i=@google.com header.s=20161025 header.b=psue1+5e; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S2390780AbhALRiW (ORCPT + 99 others); Tue, 12 Jan 2021 12:38:22 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:49494 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1727395AbhALRiV (ORCPT ); Tue, 12 Jan 2021 12:38:21 -0500 Received: from mail-io1-xd33.google.com (mail-io1-xd33.google.com [IPv6:2607:f8b0:4864:20::d33]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 1E66AC061575 for ; Tue, 12 Jan 2021 09:37:41 -0800 (PST) Received: by mail-io1-xd33.google.com with SMTP id d9so5753807iob.6 for ; Tue, 12 Jan 2021 09:37:41 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20161025; h=mime-version:references:in-reply-to:from:date:message-id:subject:to :cc; bh=NAxlQKvrdS5yrFwFBQveOuSAj/TgVa80OtygrRtytek=; b=psue1+5emNwzRBP5lFlcypQibb13jqpMBYklmRW43IxYPa/IQFV/phMHjmDGZ/DAdw b3Mj4VuigTJyj40pqVFGEPSuoWlNW9SU6/Nl6dMUbTQiVtbTc6YaL7TnwIx/X9Hj0i87 a/DcO8DMJcX4amvXNN2DsJYD6t3zmn/RYY8xcdE/BEDv1AXFWelMewTE62Au8ezUm92F Q+oRLxHA4lzlomFBES0Kpx/86/Le2/xwL7kIlucL0lqkdZEeWMksopm3et1n4qLDV3lx rdlwdD9Sj9fH2D6KZilphgorsMll+bRxaCsR+dvKXgN3/5i1nus1M20SvwHvsGd29TkF ywRA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc; bh=NAxlQKvrdS5yrFwFBQveOuSAj/TgVa80OtygrRtytek=; b=CR9r5O0F6+hHAN/j83439VlEZgHkDiWPAUV4GAzPxEFb8C2VW2atwMVAFAv9jtVyXS nHzcmFwb60G3JMa9rDG0GdnWaOzThtWpZpb6xDQc6RDahxsBCmTcykzNySImyS6+IQi3 cJ74/ERArKk8aRPyIvrJLZysex5oH+LfvNAzP5UieYxQ8tPKqaCQAEFuFUqYuVcEDCtM gJAwgcD5/7rxeaJ33vENtqCNpUxN3x6Doy3uTfLpd6Ea5CVL69mnkYjJmttZvZ/sx3yu Gcz1L64Az4BBhuZZIet1sCZsU1ngGDBjD7k2/hUs3+2FKgm79e5EZGY00jB77q0qPRNK NJcg== X-Gm-Message-State: AOAM531l5dKgmngLY7o8v2TXIfvjJtJdgXY5necWlp+3eLYFqHNyDyis NCU3gd39T4mn3tRfJFAGzqwfwRR4UQtPuYcMpgcsMQ== X-Received: by 2002:a5e:840d:: with SMTP id h13mr918688ioj.23.1610473060271; Tue, 12 Jan 2021 09:37:40 -0800 (PST) MIME-Version: 1.0 References: <20210107190453.3051110-1-axelrasmussen@google.com> <48f4f43f-eadd-f37d-bd8f-bddba03a7d39@oracle.com> <20210111230848.GA588752@xz-x1> <2b31c1ad-2b61-32e7-e3e5-63a3041eabfd@oracle.com> <20210112014934.GB588752@xz-x1> In-Reply-To: <20210112014934.GB588752@xz-x1> From: Axel Rasmussen Date: Tue, 12 Jan 2021 09:37:03 -0800 Message-ID: Subject: Re: [RFC PATCH 0/2] userfaultfd: handle minor faults, add UFFDIO_CONTINUE To: Peter Xu Cc: Mike Kravetz , Alexander Viro , Alexey Dobriyan , Andrea Arcangeli , Andrew Morton , Anshuman Khandual , Catalin Marinas , Chinwen Chang , Huang Ying , Ingo Molnar , Jann Horn , Jerome Glisse , Lokesh Gidra , "Matthew Wilcox (Oracle)" , Michael Ellerman , =?UTF-8?Q?Michal_Koutn=C3=BD?= , Michel Lespinasse , Mike Rapoport , Nicholas Piggin , Shaohua Li , Shawn Anastasio , Steven Rostedt , Steven Price , Vlastimil Babka , LKML , linux-fsdevel@vger.kernel.org, Linux MM , Adam Ruprecht , Cannon Matthews , "Dr . David Alan Gilbert" , David Rientjes , Oliver Upton Content-Type: text/plain; charset="UTF-8" Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Mon, Jan 11, 2021 at 5:49 PM Peter Xu wrote: > > On Mon, Jan 11, 2021 at 04:13:41PM -0800, Mike Kravetz wrote: > > On 1/11/21 3:08 PM, Peter Xu wrote: > > > On Mon, Jan 11, 2021 at 02:42:48PM -0800, Mike Kravetz wrote: > > >> On 1/7/21 11:04 AM, Axel Rasmussen wrote: > > >>> Overview > > >>> ======== > > >>> > > >>> This series adds a new userfaultfd registration mode, > > >>> UFFDIO_REGISTER_MODE_MINOR. This allows userspace to intercept "minor" faults. > > >>> By "minor" fault, I mean the following situation: > > >>> > > >>> Let there exist two mappings (i.e., VMAs) to the same page(s) (shared memory). > > >>> One of the mappings is registered with userfaultfd (in minor mode), and the > > >>> other is not. Via the non-UFFD mapping, the underlying pages have already been > > >>> allocated & filled with some contents. The UFFD mapping has not yet been > > >>> faulted in; when it is touched for the first time, this results in what I'm > > >>> calling a "minor" fault. As a concrete example, when working with hugetlbfs, we > > >>> have huge_pte_none(), but find_lock_page() finds an existing page. > > >>> > > >>> We also add a new ioctl to resolve such faults: UFFDIO_CONTINUE. The idea is, > > >>> userspace resolves the fault by either a) doing nothing if the contents are > > >>> already correct, or b) updating the underlying contents using the second, > > >>> non-UFFD mapping (via memcpy/memset or similar, or something fancier like RDMA, > > >>> or etc...). In either case, userspace issues UFFDIO_CONTINUE to tell the kernel > > >>> "I have ensured the page contents are correct, carry on setting up the mapping". > > >>> > > >> > > >> One quick thought. > > >> > > >> This is not going to work as expected with hugetlbfs pmd sharing. If you > > >> are not familiar with hugetlbfs pmd sharing, you are not alone. :) > > >> > > >> pmd sharing is enabled for x86 and arm64 architectures. If there are multiple > > >> shared mappings of the same underlying hugetlbfs file or shared memory segment > > >> that are 'suitably aligned', then the PMD pages associated with those regions > > >> are shared by all the mappings. Suitably aligned means 'on a 1GB boundary' > > >> and 1GB in size. > > >> > > >> When pmds are shared, your mappings will never see a 'minor fault'. This > > >> is because the PMD (page table entries) is shared. > > > > > > Thanks for raising this, Mike. > > > > > > I've got a few patches that plan to disable huge pmd sharing for uffd in > > > general, e.g.: > > > > > > https://github.com/xzpeter/linux/commit/f9123e803d9bdd91bf6ef23b028087676bed1540 > > > https://github.com/xzpeter/linux/commit/aa9aeb5c4222a2fdb48793cdbc22902288454a31 > > > > > > I believe we don't want that for missing mode too, but it's just not extremely > > > important for missing mode yet, because in missing mode we normally monitor all > > > the processes that will be using the registered mm range. For example, in QEMU > > > postcopy migration with vhost-user hugetlbfs files as backends, we'll monitor > > > both the QEMU process and the DPDK program, so that either of the programs will > > > trigger a missing fault even if pmd shared between them. However again I think > > > it's not ideal since uffd (even if missing mode) is pgtable-based, so sharing > > > could always be too tricky. > > > > > > They're not yet posted to public yet since that's part of uffd-wp support for > > > hugetlbfs (along with shmem). So just raise this up to avoid potential > > > duplicated work before I post the patchset. > > > > > > (Will read into details soon; probably too many things piled up...) > > > > Thanks for the heads up about this Peter. > > > > I know Oracle DB really wants shared pmds -and- UFFD. I need to get details > > of their exact usage model. I know they primarily use SIGBUS, but use > > MISSING_HUGETLBFS as well. We may need to be more selective in when to > > disable. > > After a second thought, indeed it's possible to use it that way with pmd > sharing. Actually we don't need to generate the fault for every page, if what > we want to do is simply "initializing the pages using some data" on the > registered ranges. Should also be the case even for qemu+dpdk, because if > e.g. qemu faulted in a page, then it'll be nicer if dpdk can avoid faulting in > again (so when huge pmd sharing enabled we can even avoid the PF irq to install > the pte if at last page cache existed). It should be similarly beneficial if > the other process is not faulting in but proactively filling the holes using > UFFDIO_COPY either for the current process or for itself; sounds like a valid > scenario for Google too when VM migrates. Exactly right, but I'm a little unsure how to get it to work. There are two different cases: - Allocate + populate a page in the background (not on demand) during postcopy (i.e., after the VM has started executing on the migration target). In this case, we can be certain that the page contents are up to date, since execution on the source was already paused. In this case PMD sharing would actually be nice, because it would mean the VM would never fault on this page going forward. - Allocate + populate a page during precopy (i.e., while the VM is still executing on the migration source). In this case, we *don't* want PMD sharing, because we need to intercept the first time this page is touched, verify it's up to date, and copy over the updated data if not. Another related situation to consider is, at some point on the target machine, we'll receive the "dirty map" indicating which pages are out of date or not. My original thinking was, when the VM faults on any of these pages, from this point forward we'd just look at the map and then UFFDIO_CONTINUE if things were up to date. But you're right that a possible optimization is, once we receive the map, just immediately "enable PMD sharing" on these pages, so the VM will never fault on them. But, this is all kind of speculative. I don't know of any existing API for *userspace* to take an existing shared memory mapping without PMD sharing, and "turn on" PMD sharing for particular page(s). For now, I'll plan on disabling PMD sharing for MINOR registered ranges. Thanks, Peter and Mike! > > I've modified my local tree to only disable pmd sharing for uffd-wp but keep > missing mode as-is [1]. A new helper uffd_disable_huge_pmd_share() is > introduced in patch "hugetlb/userfaultfd: Forbid huge pmd sharing when uffd > enabled", so should be easier if we would like to add minor mode too. > > Thanks! > > [1] https://github.com/xzpeter/linux/commits/uffd-wp-shmem-hugetlbfs > > -- > Peter Xu >