Received: by 2002:a05:6a10:2785:0:0:0:0 with SMTP id ia5csp3413795pxb; Tue, 12 Jan 2021 14:10:00 -0800 (PST) X-Google-Smtp-Source: ABdhPJyjC74/ITaOswjKVz1VFy/m0Qp9iVmeGjQpllgDG8PbB8K+scqTTMSaQqdjS6SrEZ1hHN+G X-Received: by 2002:aa7:dcd0:: with SMTP id w16mr1009255edu.229.1610489400647; Tue, 12 Jan 2021 14:10:00 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1610489400; cv=none; d=google.com; s=arc-20160816; b=xjHodKAweLdO4fluGlA2C27MJalqk8MOmEiiM8Vtm/5JQ0+nyQoMhukz82a5MYdmem 7oSlEHPBwEVLkvEhxM5DGWqc31kaHxXWitwQeqIXpc86XIWu7xWYD35s6EsYnoitCEFJ RVmncs/DLWv64LwDPNX3Q0JuV4EVy+4Dqp58CT/NLpzNPiACDsdgJwFypVGhvfMf67hL hLo6hZNQOyCVwwmtwTxFRH6s51CgGyCTlKnJuDZRyshUm7HMwsbO6jr9yCxwPf6OEvJr gBLlvQJ6FPelry19YIO0YaL/jb2p/2QhM08E9pFOoF3t6DnzKWuQSYO7Af1BfzvLurS5 WTXg== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:in-reply-to:content-disposition:mime-version :references:message-id:subject:cc:to:from:date:dkim-signature; bh=XZDeRCQYJoezv5keDk7+o32vhu8r74OG0tRv9cLa8Bs=; b=ATzVtScye4Vg05RakiB0WJ3hsi2BV4G4HOOot9MjhyiEZT0BaL4iXOlcxclA2p8DMf TPoZHlLyXi3E6yV+LFEmeG+BCaYxxsFUdm0TmIVtAnYF8QuGhjMpON5IJ/G3FjEZpUXP J12cykSCeyECxfq7g0Qx6l5iRVCd/+1vUk/r/GNa4gBfJpAwGs/4uYGQEEGZCAU6PEcU UvRbH41WdvAp3AdvQt0zRbs2Wez/qvFcnXdrzVtMfDCpGxVko+rBd/1PQ+cvsmfosY1x Dtr7c4SYMHTeZml17F8/+42cuOm28TwSu6OJ6vhLGGphOzx6UgCdXbgSTMesnkunVO9b N9TQ== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@redhat.com header.s=mimecast20190719 header.b=JVx7Wjus; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=redhat.com Return-Path: Received: from vger.kernel.org (vger.kernel.org. [23.128.96.18]) by mx.google.com with ESMTP id p17si13842ejd.570.2021.01.12.14.09.36; Tue, 12 Jan 2021 14:10:00 -0800 (PST) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) client-ip=23.128.96.18; Authentication-Results: mx.google.com; dkim=pass header.i=@redhat.com header.s=mimecast20190719 header.b=JVx7Wjus; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=redhat.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S2391534AbhALAYu (ORCPT + 99 others); Mon, 11 Jan 2021 19:24:50 -0500 Received: from us-smtp-delivery-124.mimecast.com ([63.128.21.124]:59501 "EHLO us-smtp-delivery-124.mimecast.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S2403811AbhAKXKX (ORCPT ); Mon, 11 Jan 2021 18:10:23 -0500 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1610406536; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: in-reply-to:in-reply-to:references:references; bh=XZDeRCQYJoezv5keDk7+o32vhu8r74OG0tRv9cLa8Bs=; b=JVx7WjusI+in290KoptublzLF6olRuvf8ROzDF1b8KQjtwn1fep6By8H/OwZGMrKvXWyQX 0HM0B/ce3NpHo1cRTrbvLBq9JgeDeC8loWqdkoSHBhRCvXMPnxBm68dUEOzDHBBj1TZYVh DSo6FBXxerfswEKcT8YWy5IecRZf2Ds= Received: from mail-il1-f198.google.com (mail-il1-f198.google.com [209.85.166.198]) (Using TLS) by relay.mimecast.com with ESMTP id us-mta-386-3knLXS93OISQnP1j2Y5SgQ-1; Mon, 11 Jan 2021 18:08:53 -0500 X-MC-Unique: 3knLXS93OISQnP1j2Y5SgQ-1 Received: by mail-il1-f198.google.com with SMTP id f19so763192ilk.8 for ; Mon, 11 Jan 2021 15:08:53 -0800 (PST) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:date:from:to:cc:subject:message-id:references :mime-version:content-disposition:in-reply-to; bh=XZDeRCQYJoezv5keDk7+o32vhu8r74OG0tRv9cLa8Bs=; b=ieAg0FFqD3fwrZoh1WO5oQ539vLg4ZRmAExN39IdnSMy5QeAAFuCJCZegQDyc6+fE3 6qi/FULSSuqRsEA1Jgk5PCoMPGuwrFiGKHSg4Wo1JW+6/ylwIQuLEVWb2E8ZLsOclAnZ 7Ej7rq6T8Q7QxJDeoWDHTnzeUM2mNIoJ4E2nX7PS7vGX7OiZngj7o0tmmCCo4laHb1Nl TkHCpfQ7G0RXZHLFS+IgzCeLysvDYLtlqX421AwwG/Q7jyoq2fKxKcjuB4ur7iXkDYGE lKjdCDca/LS06QBFE/USmDIBW7r5TEBMtso7rRf83IZSBaXs7hTpS3XBxj8ZucOBKKVW UwQQ== X-Gm-Message-State: AOAM530FxAv87lU0jTAgbtbZuOJkAcTzFV3pcjGZPH49A+vJtl/WaeOJ icnjqUbIDXxSF7ass9tusDVDah+lQiFIEWF0Xlv/SKgCIvvEywWWNqjdh7x7TgqJRqGRBx4y+yT BDIfpVyNqOI1k9fDmPrim7rYg X-Received: by 2002:a92:6f07:: with SMTP id k7mr1342461ilc.18.1610406532567; Mon, 11 Jan 2021 15:08:52 -0800 (PST) X-Received: by 2002:a92:6f07:: with SMTP id k7mr1342424ilc.18.1610406532311; Mon, 11 Jan 2021 15:08:52 -0800 (PST) Received: from xz-x1 ([142.126.83.202]) by smtp.gmail.com with ESMTPSA id l20sm669280ioh.49.2021.01.11.15.08.49 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Mon, 11 Jan 2021 15:08:51 -0800 (PST) Date: Mon, 11 Jan 2021 18:08:48 -0500 From: Peter Xu To: Mike Kravetz Cc: Axel Rasmussen , Alexander Viro , Alexey Dobriyan , Andrea Arcangeli , Andrew Morton , Anshuman Khandual , Catalin Marinas , Chinwen Chang , Huang Ying , Ingo Molnar , Jann Horn , Jerome Glisse , Lokesh Gidra , "Matthew Wilcox (Oracle)" , Michael Ellerman , Michal =?utf-8?Q?Koutn=C3=BD?= , Michel Lespinasse , Mike Rapoport , Nicholas Piggin , Shaohua Li , Shawn Anastasio , Steven Rostedt , Steven Price , Vlastimil Babka , linux-kernel@vger.kernel.org, linux-fsdevel@vger.kernel.org, linux-mm@kvack.org, Adam Ruprecht , Cannon Matthews , "Dr . David Alan Gilbert" , David Rientjes , Oliver Upton Subject: Re: [RFC PATCH 0/2] userfaultfd: handle minor faults, add UFFDIO_CONTINUE Message-ID: <20210111230848.GA588752@xz-x1> References: <20210107190453.3051110-1-axelrasmussen@google.com> <48f4f43f-eadd-f37d-bd8f-bddba03a7d39@oracle.com> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Disposition: inline In-Reply-To: <48f4f43f-eadd-f37d-bd8f-bddba03a7d39@oracle.com> Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Mon, Jan 11, 2021 at 02:42:48PM -0800, Mike Kravetz wrote: > On 1/7/21 11:04 AM, Axel Rasmussen wrote: > > Overview > > ======== > > > > This series adds a new userfaultfd registration mode, > > UFFDIO_REGISTER_MODE_MINOR. This allows userspace to intercept "minor" faults. > > By "minor" fault, I mean the following situation: > > > > Let there exist two mappings (i.e., VMAs) to the same page(s) (shared memory). > > One of the mappings is registered with userfaultfd (in minor mode), and the > > other is not. Via the non-UFFD mapping, the underlying pages have already been > > allocated & filled with some contents. The UFFD mapping has not yet been > > faulted in; when it is touched for the first time, this results in what I'm > > calling a "minor" fault. As a concrete example, when working with hugetlbfs, we > > have huge_pte_none(), but find_lock_page() finds an existing page. > > > > We also add a new ioctl to resolve such faults: UFFDIO_CONTINUE. The idea is, > > userspace resolves the fault by either a) doing nothing if the contents are > > already correct, or b) updating the underlying contents using the second, > > non-UFFD mapping (via memcpy/memset or similar, or something fancier like RDMA, > > or etc...). In either case, userspace issues UFFDIO_CONTINUE to tell the kernel > > "I have ensured the page contents are correct, carry on setting up the mapping". > > > > One quick thought. > > This is not going to work as expected with hugetlbfs pmd sharing. If you > are not familiar with hugetlbfs pmd sharing, you are not alone. :) > > pmd sharing is enabled for x86 and arm64 architectures. If there are multiple > shared mappings of the same underlying hugetlbfs file or shared memory segment > that are 'suitably aligned', then the PMD pages associated with those regions > are shared by all the mappings. Suitably aligned means 'on a 1GB boundary' > and 1GB in size. > > When pmds are shared, your mappings will never see a 'minor fault'. This > is because the PMD (page table entries) is shared. Thanks for raising this, Mike. I've got a few patches that plan to disable huge pmd sharing for uffd in general, e.g.: https://github.com/xzpeter/linux/commit/f9123e803d9bdd91bf6ef23b028087676bed1540 https://github.com/xzpeter/linux/commit/aa9aeb5c4222a2fdb48793cdbc22902288454a31 I believe we don't want that for missing mode too, but it's just not extremely important for missing mode yet, because in missing mode we normally monitor all the processes that will be using the registered mm range. For example, in QEMU postcopy migration with vhost-user hugetlbfs files as backends, we'll monitor both the QEMU process and the DPDK program, so that either of the programs will trigger a missing fault even if pmd shared between them. However again I think it's not ideal since uffd (even if missing mode) is pgtable-based, so sharing could always be too tricky. They're not yet posted to public yet since that's part of uffd-wp support for hugetlbfs (along with shmem). So just raise this up to avoid potential duplicated work before I post the patchset. (Will read into details soon; probably too many things piled up...) Thanks, -- Peter Xu