Received: by 2002:a05:6a10:8c0a:0:0:0:0 with SMTP id go10csp3255401pxb; Fri, 12 Feb 2021 13:23:57 -0800 (PST) X-Google-Smtp-Source: ABdhPJxL378IJABs0bCpHKd92qo6lpqUif5pl8mRZvjEJteUAI0ZhGiz2lw3HB33ImT8SAX+K7LF X-Received: by 2002:a17:906:858b:: with SMTP id v11mr2324749ejx.179.1613165037061; Fri, 12 Feb 2021 13:23:57 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1613165037; cv=none; d=google.com; s=arc-20160816; b=WmFamRYPR0Yjasp8twzs9oLacxHhOY4y8SxYcRod+N8ZyAk+tz9x1krZOg7XAAQ1YO frpnijf+/tHut7gSBxAxQYeHRLOg+RMyba+C6KXzs9X5n7btkm/0eg413OdV8MOKR2jl IJ8MGgybPQnoJ7Efn+VYYQws8rfHjwDmgiXH075a2S5q+3ha3/joaFH/fLR75TSp+DYB lRVEAcln8cfIVKWgQ8U3a2XgYrmm7wAmJoAQosM9mFQb8/MrEALB20WPT74nEgOLQMj3 f7ATw+8gogQWrKKXdHErbatk/kTYnK3VNcO4HG5hAYTYbihSdh1yRL/FWcM5l7PHagR7 mPcA== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:in-reply-to:content-disposition:mime-version :references:message-id:subject:cc:to:from:date:dkim-signature; bh=ECWAuKMicsl2BzJlbKNKyjnZw633xrb05WUgP1ZpmLg=; b=z+M3rtn0pjZ1DSJU/6XDFH+7Lu/7VMa9q8zDpc65Sxk9wViWu6rewgZu8EgqJBt2z4 y4t/bQo7D+MEsv1Fmw6fbdWgqjDPQ2HNpjEA2Zc8eVcgPVGPZ71tQZg5wfIIdsAz/cMn ic7jui6lDzhuvLaZgoSg9fp46G/uZYkufOUko0ip/tUSetBBWdzG/Vqw9iLYPiVei0f0 2r241e14mfe+Lbj/scsIzIjJa5p5MoIKZVVhG+0RJXIC7u4H3QWUeSgwUfegwlX6Oypn UZMVRp0uy4OTHQIjgvWh6vKml+/TdHYqYdN8GF37LjtzELPshSHPLRlFkmR20TkSmaPH tlxA== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@redhat.com header.s=mimecast20190719 header.b=YokTitCM; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=redhat.com Return-Path: Received: from vger.kernel.org (vger.kernel.org. [23.128.96.18]) by mx.google.com with ESMTP id o5si6928489edz.447.2021.02.12.13.23.33; Fri, 12 Feb 2021 13:23:57 -0800 (PST) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) client-ip=23.128.96.18; Authentication-Results: mx.google.com; dkim=pass header.i=@redhat.com header.s=mimecast20190719 header.b=YokTitCM; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=redhat.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S232052AbhBLVUg (ORCPT + 99 others); Fri, 12 Feb 2021 16:20:36 -0500 Received: from us-smtp-delivery-124.mimecast.com ([63.128.21.124]:24848 "EHLO us-smtp-delivery-124.mimecast.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S231532AbhBLVU3 (ORCPT ); Fri, 12 Feb 2021 16:20:29 -0500 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1613164742; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: in-reply-to:in-reply-to:references:references; bh=ECWAuKMicsl2BzJlbKNKyjnZw633xrb05WUgP1ZpmLg=; b=YokTitCMxHrt4O/x6kDTcodUrQreQuscU/272JR8Cdysze2/J831fD1PUFbgNoVhHYTzEv qafT/aySznEuNfKOAW2fsG40MaMJcYG8MX8mC021wMXmBQsS7yUZ/S15T6hQEcoshaM6bW ZWAOCvBlVDhrYd60kmddHTOAbkjKp+Q= Received: from mail-qv1-f70.google.com (mail-qv1-f70.google.com [209.85.219.70]) (Using TLS) by relay.mimecast.com with ESMTP id us-mta-84-dGY14fcCN3ORlaUpi3Y3WQ-1; Fri, 12 Feb 2021 16:19:00 -0500 X-MC-Unique: dGY14fcCN3ORlaUpi3Y3WQ-1 Received: by mail-qv1-f70.google.com with SMTP id y16so482533qvs.12 for ; Fri, 12 Feb 2021 13:19:00 -0800 (PST) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:date:from:to:cc:subject:message-id:references :mime-version:content-disposition:in-reply-to; bh=ECWAuKMicsl2BzJlbKNKyjnZw633xrb05WUgP1ZpmLg=; b=O9IWKjK8buFO47A0CQs5U3tb5Db31V97Ks8oVaLFGjmVt8Zg1tIVtkBFIkBOvw077U hZF+wZEjvHzG4BJ60XAgPgM2R5FDQJ1B3kbemQPHjUogWJ16wokjwzkMNM562s1deEhp X3RR7Cb5JqkVYYz7qL38SYCtI3RWPc06cyUlkqHYepDyHYfzQyzwTPc4zJl5PmaEJpVs mjCuRatvLx1xKH3QvshiLwWoCG1VohNEa019cHQhwJ6EP4q8Cy7OJwwZFxt0oAhH8LVC XQWwQcJlhua8nByQbuF+d9LR85A4n+OadqmaUuOIpDXL3YKwo+sJ+Eh3n7B5FGvANsuO 4LLA== X-Gm-Message-State: AOAM533rGoxTBnADIndMGt4ZvNefU1M8iYj94MVbpSP5oTnmOR27L2ZE 7lMUYaoP29eh4L4LQ8oJIcGvC1zNlX15ul6b2m6RYKKoZgoIK0jSt3BDJwaRuPaq7WDulsb4rVu liMYeAYfkt5HiViggnSHtBYDY X-Received: by 2002:ac8:7293:: with SMTP id v19mr4170935qto.161.1613164739642; Fri, 12 Feb 2021 13:18:59 -0800 (PST) X-Received: by 2002:ac8:7293:: with SMTP id v19mr4170895qto.161.1613164739311; Fri, 12 Feb 2021 13:18:59 -0800 (PST) Received: from xz-x1 (bras-vprn-toroon474qw-lp130-20-174-93-89-182.dsl.bell.ca. [174.93.89.182]) by smtp.gmail.com with ESMTPSA id h11sm6971410qkj.135.2021.02.12.13.18.57 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Fri, 12 Feb 2021 13:18:58 -0800 (PST) Date: Fri, 12 Feb 2021 16:18:56 -0500 From: Peter Xu To: Mike Kravetz Cc: Axel Rasmussen , Alexander Viro , Alexey Dobriyan , Andrea Arcangeli , Andrew Morton , Anshuman Khandual , Catalin Marinas , Chinwen Chang , Huang Ying , Ingo Molnar , Jann Horn , Jerome Glisse , Lokesh Gidra , "Matthew Wilcox (Oracle)" , Michael Ellerman , Michal =?utf-8?Q?Koutn=C3=BD?= , Michel Lespinasse , Mike Rapoport , Nicholas Piggin , Shaohua Li , Shawn Anastasio , Steven Rostedt , Steven Price , Vlastimil Babka , linux-kernel@vger.kernel.org, linux-fsdevel@vger.kernel.org, linux-mm@kvack.org, Adam Ruprecht , Cannon Matthews , "Dr . David Alan Gilbert" , David Rientjes , Mina Almasry , Oliver Upton Subject: Re: [PATCH v5 04/10] hugetlb/userfaultfd: Unshare all pmds for hugetlbfs when register wp Message-ID: <20210212211856.GD3171@xz-x1> References: <20210210212200.1097784-1-axelrasmussen@google.com> <20210210212200.1097784-5-axelrasmussen@google.com> <517f3477-cb80-6dc9-bda0-b147dea68f95@oracle.com> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Disposition: inline In-Reply-To: <517f3477-cb80-6dc9-bda0-b147dea68f95@oracle.com> Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Fri, Feb 12, 2021 at 10:11:39AM -0800, Mike Kravetz wrote: > On 2/10/21 1:21 PM, Axel Rasmussen wrote: > > From: Peter Xu > > > > Huge pmd sharing for hugetlbfs is racy with userfaultfd-wp because > > userfaultfd-wp is always based on pgtable entries, so they cannot be shared. > > > > Walk the hugetlb range and unshare all such mappings if there is, right before > > UFFDIO_REGISTER will succeed and return to userspace. > > > > This will pair with want_pmd_share() in hugetlb code so that huge pmd sharing > > is completely disabled for userfaultfd-wp registered range. > > > > Signed-off-by: Peter Xu > > Signed-off-by: Axel Rasmussen > > --- > > fs/userfaultfd.c | 48 ++++++++++++++++++++++++++++++++++++ > > include/linux/mmu_notifier.h | 1 + > > 2 files changed, 49 insertions(+) > > > > diff --git a/fs/userfaultfd.c b/fs/userfaultfd.c > > index 0be8cdd4425a..1f4a34b1a1e7 100644 > > --- a/fs/userfaultfd.c > > +++ b/fs/userfaultfd.c > > @@ -15,6 +15,7 @@ > > #include > > #include > > #include > > +#include > > #include > > #include > > #include > > @@ -1191,6 +1192,50 @@ static ssize_t userfaultfd_read(struct file *file, char __user *buf, > > } > > } > > > > +/* > > + * This function will unconditionally remove all the shared pmd pgtable entries > > + * within the specific vma for a hugetlbfs memory range. > > + */ > > +static void hugetlb_unshare_all_pmds(struct vm_area_struct *vma) > > +{ > > +#ifdef CONFIG_HUGETLB_PAGE > > + struct hstate *h = hstate_vma(vma); > > + unsigned long sz = huge_page_size(h); > > + struct mm_struct *mm = vma->vm_mm; > > + struct mmu_notifier_range range; > > + unsigned long address; > > + spinlock_t *ptl; > > + pte_t *ptep; > > + > > + if (!(vma->vm_flags & VM_MAYSHARE)) > > + return; > > + > > + /* > > + * No need to call adjust_range_if_pmd_sharing_possible(), because > > + * we're going to operate on the whole vma > > + */ > > This code will certainly work as intended. However, I wonder if we should > try to optimize and only flush and call huge_pmd_unshare for addresses where > sharing is possible. Consider this worst case example: > > vm_start = 8G + 2M > vm_end = 11G - 2M > The vma is 'almost' 3G in size, yet only the range 9G to 10G is possibly > shared. This routine will potentially call lock/unlock ptl and call > huge_pmd_share for every huge page in the range. Ideally, we should only > make one call to huge_pmd_share with address 9G. If the unshare is > successful or not, we are done. The subtle manipulation of &address in > huge_pmd_unshare will result in only one call if the unshare is successful, > but if unsuccessful we will unnecessarily call huge_pmd_unshare for each > address in the range. > > Maybe we start by rounding up vm_start by PUD_SIZE and rounding down > vm_end by PUD_SIZE. I didn't think that lot since it's slow path, but yeah if that's faster and without extra logic, then why not. :) > > > + mmu_notifier_range_init(&range, MMU_NOTIFY_HUGETLB_UNSHARE, > > + 0, vma, mm, vma->vm_start, vma->vm_end); > > + mmu_notifier_invalidate_range_start(&range); > > + i_mmap_lock_write(vma->vm_file->f_mapping); > > + for (address = vma->vm_start; address < vma->vm_end; address += sz) { > > Then, change the loop increment to PUD_SIZE. And, also ignore the &address > manipulation done by huge_pmd_unshare. Will do! > > > + ptep = huge_pte_offset(mm, address, sz); > > + if (!ptep) > > + continue; > > + ptl = huge_pte_lock(h, mm, ptep); > > + huge_pmd_unshare(mm, vma, &address, ptep); > > + spin_unlock(ptl); > > + } > > + flush_hugetlb_tlb_range(vma, vma->vm_start, vma->vm_end); > > + i_mmap_unlock_write(vma->vm_file->f_mapping); > > + /* > > + * No need to call mmu_notifier_invalidate_range(), see > > + * Documentation/vm/mmu_notifier.rst. > > + */ > > + mmu_notifier_invalidate_range_end(&range); > > +#endif > > +} > > + > > static void __wake_userfault(struct userfaultfd_ctx *ctx, > > struct userfaultfd_wake_range *range) > > { > > @@ -1449,6 +1494,9 @@ static int userfaultfd_register(struct userfaultfd_ctx *ctx, > > vma->vm_flags = new_flags; > > vma->vm_userfaultfd_ctx.ctx = ctx; > > > > + if (is_vm_hugetlb_page(vma) && uffd_disable_huge_pmd_share(vma)) > > + hugetlb_unshare_all_pmds(vma); > > + > > skip: > > prev = vma; > > start = vma->vm_end; > > diff --git a/include/linux/mmu_notifier.h b/include/linux/mmu_notifier.h > > index b8200782dede..ff50c8528113 100644 > > --- a/include/linux/mmu_notifier.h > > +++ b/include/linux/mmu_notifier.h > > @@ -51,6 +51,7 @@ enum mmu_notifier_event { > > MMU_NOTIFY_SOFT_DIRTY, > > MMU_NOTIFY_RELEASE, > > MMU_NOTIFY_MIGRATE, > > + MMU_NOTIFY_HUGETLB_UNSHARE, > > I don't claim to know much about mmu notifiers. Currently, we use other > event notifiers such as MMU_NOTIFY_CLEAR. I guess we do 'clear' page table > entries if we unshare. More than happy to have a MMU_NOTIFY_HUGETLB_UNSHARE > event, but will consumers of the notifications know what this new event type > means? And, if we introduce this should we use this other places where > huge_pmd_unshare is called? Yeah AFAICT that is a new feature to mmu notifiers and it's not really used a lot by consumers yet. Hmm... is there really any consumer at all? I simply grepped MMU_NOTIFY_UNMAP and see no hook took special care of that. So it's some extra information that the upper layer would like to deliever to the notifiers, it's just not vastly used so far. So far I didn't worry too much on that either. MMU_NOTIFY_HUGETLB_UNSHARE is introduced here simply because I tried to make it explicit, then it's easy to be overwritten one day if we think huge pmd unshare is not worth a standalone flag but reuse some other common one. But I think at least I owe one documentation of that new enum. :) I'm not extremely willing to touch the rest callers of huge pmd unshare yet, unless I've a solid reason. E.g., one day maybe one mmu notifier hook would start to monitor some events, then that's a better place imho to change them. Otherwise any of my future change could be vague, imho. For this patch - how about I simply go back to use MMU_NOTIFIER_CLEAR instead? Thanks, -- Peter Xu