Received: by 2002:ad5:4acb:0:0:0:0:0 with SMTP id n11csp679280imw; Fri, 15 Jul 2022 10:46:29 -0700 (PDT) X-Google-Smtp-Source: AGRyM1uWh+kyJ5oV/WfhvEM4L86f6q2L/MyQUaw7Y1Uk96h/5zVYrPqVL4AWy6RJz+ssn+95XkxT X-Received: by 2002:a17:906:a402:b0:72b:8e6e:64ea with SMTP id l2-20020a170906a40200b0072b8e6e64eamr15177809ejz.469.1657907188890; Fri, 15 Jul 2022 10:46:28 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1657907188; cv=none; d=google.com; s=arc-20160816; b=BLVKAsW1uH0gAD4HOi1HTMAvmE7M+uiwq8Wzi+Z9ashwiBDL5j5hMlnSY8zS31xneh NY7qH3xUPBZG8VUjyNabBHe3sWa2YBDdeDBxewWPqeCcpEybWLPB12AKxWFUUFGDq8xt 29qB8iYiX6mVApMLW/HXOEBfX4Hy6qB4oUA1VYzQsrNOtPrW/PrFok9qBezpYsO/apxa a5oK2PhbPfZ+74YRkvzk4lcr+k6iLmaZBWdxStFyuebhCQdF4OYOGzagbrDtqTWWzKGw RGxOPx1ODZhB/Z8hzC00srN9/YnpYhO+Vn255w7Tf9jO6+Gr4o+L5goZHNK/zpPlAGgg KXlQ== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:in-reply-to:content-disposition:mime-version :references:message-id:subject:cc:to:from:date:dkim-signature; bh=PTK4goQcbNMUtE6aJiq+7EgmXb2x5IDyZNvU93q4GVI=; b=xeqTt70BZornv8N6Xj2LniwqDOL6wgbPaRsfWxzKzXlPoc/27Yt3Ea3oLgUNknnG+B jyEr7RjqLYRFIynmn6BUcLVTLKHivSo2wRGqzVO5p8qEOPpXMiEi/J3ciXbEF9rRnM9W VTgvLmY+unxh76S/oh6uFA1h3BVkUBt4vsL9MYRXeg7woiLHCoqyNVSebdahvn6AZH1n PB0842+vo3USEWTCRymAJwdxWw9t4Mdmq8nYKnyg5yLgKZQp0e8e1XfwThn/VawsMukh kSTSjegSc65SHJc7B1ZOFjYgzt4SwP5QrsnQbYHonrmlunyD+Bu7tx+rVyvSxOiCTORK /ANQ== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@redhat.com header.s=mimecast20190719 header.b=WED9835j; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=redhat.com Return-Path: Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20]) by mx.google.com with ESMTP id qw33-20020a1709066a2100b00722df7631c5si5885520ejc.749.2022.07.15.10.46.03; Fri, 15 Jul 2022 10:46:28 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Authentication-Results: mx.google.com; dkim=pass header.i=@redhat.com header.s=mimecast20190719 header.b=WED9835j; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=redhat.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S235598AbiGORUm (ORCPT + 99 others); Fri, 15 Jul 2022 13:20:42 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:56690 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S234994AbiGORUk (ORCPT ); Fri, 15 Jul 2022 13:20:40 -0400 Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.129.124]) by lindbergh.monkeyblade.net (Postfix) with ESMTP id 03CBB79682 for ; Fri, 15 Jul 2022 10:20:39 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1657905639; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: in-reply-to:in-reply-to:references:references; bh=PTK4goQcbNMUtE6aJiq+7EgmXb2x5IDyZNvU93q4GVI=; b=WED9835j/4K4uHSalJd4jiuJSAzC0WhxBrvpxP4yyxaDn8gku9BfAoKkBDeNNAXR/T8tpR tOInADxIOwq5JcJSGmG7ba29j/mAIrtcUG0FEfFbJObCBXCsZCatuV+Y/5038qmvhQNflB rxb9XZOln11gs7PZSNLiPGBS1xmN1Cg= Received: from mail-qk1-f200.google.com (mail-qk1-f200.google.com [209.85.222.200]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id us-mta-587-8ijHpdezNc6eEvZr5NSlhg-1; Fri, 15 Jul 2022 13:20:37 -0400 X-MC-Unique: 8ijHpdezNc6eEvZr5NSlhg-1 Received: by mail-qk1-f200.google.com with SMTP id x22-20020a05620a259600b006b552a69231so3869647qko.18 for ; Fri, 15 Jul 2022 10:20:37 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=x-gm-message-state:date:from:to:cc:subject:message-id:references :mime-version:content-disposition:in-reply-to; bh=PTK4goQcbNMUtE6aJiq+7EgmXb2x5IDyZNvU93q4GVI=; b=sKDUlf5gLGZzmox7/g4rPGbqu/SEBoTjDvSQoVUcVUxLaQpwr7mQgoT9+oqhnEPMJV ILo8xYOAIipPAIHy/rD0yBVQGUndj5Yz4saH8aIGeQIbrQgiGkztndZvqYU35ql1QWqS jy2ARV8YURUwsXX8VV1Elflc6tRPU18W1E9cTmPHrGb2D0uT5W+35iz6KokOo+ayah5N 9uraXyTsj2HTyPCKFwX79yl0ZLwSx5ki9wPoX9kL64d7Zs+jvcfJxLtVSli+HVBJwIJX q5q30cAD8oob8iYTfVoc5mFB/VM04jw8gUBlIcCVxvW7Kqg2x0ZyvAJnKXD087cEMFZn CrVg== X-Gm-Message-State: AJIora/gQwZECU9qpZ2Y4vOCYRvcTQXJU11rg9rJwOkFEwU4oxHbEW3e l0vR/PyGqQFovx/Mrmppp4RZg+S5uqDhDGGoZc2vZBxYADS0STneFnm2KooW2UXhsduf6FnboWx wztus6NYBrF+oAmNro9AT+Z9w X-Received: by 2002:a05:622a:313:b0:31e:bb0b:4748 with SMTP id q19-20020a05622a031300b0031ebb0b4748mr13265064qtw.439.1657905637150; Fri, 15 Jul 2022 10:20:37 -0700 (PDT) X-Received: by 2002:a05:622a:313:b0:31e:bb0b:4748 with SMTP id q19-20020a05622a031300b0031ebb0b4748mr13265042qtw.439.1657905636896; Fri, 15 Jul 2022 10:20:36 -0700 (PDT) Received: from xz-m1.local (bras-base-aurron9127w-grc-37-74-12-30-48.dsl.bell.ca. [74.12.30.48]) by smtp.gmail.com with ESMTPSA id bp13-20020a05622a1b8d00b0031e9fa40c2esm4208168qtb.27.2022.07.15.10.20.35 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Fri, 15 Jul 2022 10:20:36 -0700 (PDT) Date: Fri, 15 Jul 2022 13:20:35 -0400 From: Peter Xu To: James Houghton Cc: Mike Kravetz , Muchun Song , David Hildenbrand , David Rientjes , Axel Rasmussen , Mina Almasry , Jue Wang , Manish Mishra , "Dr . David Alan Gilbert" , linux-mm@kvack.org, linux-kernel@vger.kernel.org Subject: Re: [RFC PATCH 20/26] hugetlb: add support for high-granularity UFFDIO_CONTINUE Message-ID: References: <20220624173656.2033256-1-jthoughton@google.com> <20220624173656.2033256-21-jthoughton@google.com> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Disposition: inline In-Reply-To: X-Spam-Status: No, score=-3.4 required=5.0 tests=BAYES_00,DKIMWL_WL_HIGH, DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,RCVD_IN_DNSWL_LOW, SPF_HELO_NONE,SPF_NONE autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Fri, Jul 15, 2022 at 09:58:10AM -0700, James Houghton wrote: > On Fri, Jul 15, 2022 at 9:21 AM Peter Xu wrote: > > > > On Fri, Jun 24, 2022 at 05:36:50PM +0000, James Houghton wrote: > > > The changes here are very similar to the changes made to > > > hugetlb_no_page, where we do a high-granularity page table walk and > > > do accounting slightly differently because we are mapping only a piece > > > of a page. > > > > > > Signed-off-by: James Houghton > > > --- > > > fs/userfaultfd.c | 3 +++ > > > include/linux/hugetlb.h | 6 +++-- > > > mm/hugetlb.c | 54 +++++++++++++++++++++----------------- > > > mm/userfaultfd.c | 57 +++++++++++++++++++++++++++++++---------- > > > 4 files changed, 82 insertions(+), 38 deletions(-) > > > > > > diff --git a/fs/userfaultfd.c b/fs/userfaultfd.c > > > index e943370107d0..77c1b8a7d0b9 100644 > > > --- a/fs/userfaultfd.c > > > +++ b/fs/userfaultfd.c > > > @@ -245,6 +245,9 @@ static inline bool userfaultfd_huge_must_wait(struct userfaultfd_ctx *ctx, > > > if (!ptep) > > > goto out; > > > > > > + if (hugetlb_hgm_enabled(vma)) > > > + goto out; > > > + > > > > This is weird. It means we'll never wait for sub-page mapping enabled > > vmas. Why? > > > > `ret` is true in this case, so we're actually *always* waiting. Aha! Then I think that's another problem, sorry. :) See Below. > > > Not to mention hugetlb_hgm_enabled() currently is simply VM_SHARED, so it > > means we'll stop waiting for all shared hugetlbfs uffd page faults.. > > > > I'd expect in the in-house postcopy tests you should see vcpu threads > > spinning on the page faults until it's serviced. > > > > IMO we still need to properly wait when the pgtable doesn't have the > > faulted address covered. For sub-page mapping it'll probably need to walk > > into sub-page levels. > > Ok, SGTM. I'll do that for the next version. I'm not sure of the > consequences of returning `true` here when we should be returning > `false`. We've put ourselves onto the wait queue, if another concurrent UFFDIO_CONTINUE happened and pte is already installed, I think this thread could be waiting forever on the next schedule(). The solution should be the same - walking the sub-page pgtable would work, afaict. [...] > > > @@ -6239,14 +6241,16 @@ int hugetlb_mcopy_atomic_pte(struct mm_struct *dst_mm, > > > * registered, we firstly wr-protect a none pte which has no page cache > > > * page backing it, then access the page. > > > */ > > > - if (!huge_pte_none_mostly(huge_ptep_get(dst_pte))) > > > + if (!hugetlb_pte_none_mostly(dst_hpte)) > > > goto out_release_unlock; > > > > > > - if (vm_shared) { > > > - page_dup_file_rmap(page, true); > > > - } else { > > > - ClearHPageRestoreReserve(page); > > > - hugepage_add_new_anon_rmap(page, dst_vma, dst_addr); > > > + if (new_mapping) { > > > > IIUC you wanted to avoid the mapcount accountings when it's the sub-page > > that was going to be mapped. > > > > Is it a must we get this only from the caller? Can we know we're doing > > sub-page mapping already here and make a decision with e.g. dst_hpte? > > > > It looks weird to me to pass this explicitly from the caller, especially > > that's when we don't really have the pgtable lock so I'm wondering about > > possible race conditions too on having stale new_mapping values. > > The only way to know what the correct value for `new_mapping` should > be is to know if we had to change the hstate-level P*D to non-none to > service this UFFDIO_CONTINUE request. I'll see if there is a nice way > to do that check in `hugetlb_mcopy_atomic_pte`. > Right now there is no Would "new_mapping = dest_hpte->shift != huge_page_shift(hstate)" work (or something alike)? > race, because we synchronize on the per-hpage mutex. Yeah not familiar with that mutex enough to tell, as long as that mutex guarantees no pgtable update (hmm, then why we need the pgtable lock here???) then it looks fine. [...] > > > @@ -335,12 +337,16 @@ static __always_inline ssize_t __mcopy_atomic_hugetlb(struct mm_struct *dst_mm, > > > copied = 0; > > > page = NULL; > > > vma_hpagesize = vma_kernel_pagesize(dst_vma); > > > + if (use_hgm) > > > + vma_altpagesize = PAGE_SIZE; > > > > Do we need to check the "len" to know whether we should use sub-page > > mapping or original hpage size? E.g. any old UFFDIO_CONTINUE code will > > still want the old behavior I think. > > I think that's a fair point; however, if we enable HGM and the address > and len happen to be hstate-aligned The address can, but len (note! not "end" here) cannot? > , we basically do the same thing as > if HGM wasn't enabled. It could be a minor performance optimization to > do `vma_altpagesize=vma_hpagesize` in that case, but in terms of how > the page tables are set up, the end result would be the same. Thanks, -- Peter Xu