Received: by 2002:a05:7412:a9a2:b0:e2:908c:2ebd with SMTP id o34csp2676038rdh; Mon, 30 Oct 2023 04:43:36 -0700 (PDT) X-Google-Smtp-Source: AGHT+IFrJ7v1r/NrNqbNCh8fvrkNFw1GxCpUMC4TiacHUQzClNaDH6l0qJhb53bCKXH4HW7r/iZW X-Received: by 2002:a17:902:eb42:b0:1c9:c951:57f9 with SMTP id i2-20020a170902eb4200b001c9c95157f9mr6529356pli.68.1698666215808; Mon, 30 Oct 2023 04:43:35 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1698666215; cv=none; d=google.com; s=arc-20160816; b=uVym3LeGPu5ZHciLL+fSgyCtnsWWjZ6UUe1xFhcW3kq4xl7N90TK/1m32p8favq0o6 vJ+VumlfUsVaZwb7OzHAuWmMayEQl74C9975UzUX7qqS3VMHVaxk16DudT0XG7AaORiw qXs961jXGDuhc2pbcmn+p0CWPcBQEJdUpmguknY+hOzRPtQ6J4i7XnAMuqrBZ4JGNXle 7QmBUXm2Xu/RS2D4y1Hp2nRIMuSxc8SiBvi1FaVpeEMvmAHi+ZorE6z2trOTJ2u0jdhN Q6RRLUefxu2VKB4sIp+wW2F7NjUjr/jO+8T3VBuoRJxI+em/T/+9q9fZ94AlInBrPTON fjcA== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:in-reply-to:from :references:cc:to:content-language:subject:user-agent:mime-version :date:message-id; bh=MxazePhosfqFmj8L4Lkmx5bH2iBn+RkvClQeqiUQNeA=; fh=5zwT+7asdmsLpd+yRTQzKchnSIqSHqhxGkvitlEUxY0=; b=pO/4hWG9NKCf+bpN5A0KBju7IqZXJv04nTxlTlfE7cZwDAl2EA/iQh/VpRopmZRBRk ATD2UFmDd8nVmg3Ak8E+vllhrr2v1c3fcsKsFE/HvmwPvROv18N8mceFSgYss+SXRy09 HwE46VXnekEyDIY02rGbf58oQVXrMizk9XcWdzfmVZIyfBloHtT/EyCNZ/0ePHGIuXHX cldE+fVtMUeM1Upxviyf2aBM9UtLSUrFlSCvc05uHFxA7pb1IbR4CyzOKL4o73JvagFY v3+DdFT/7N1FOTagZs+/ATCAgguj2RGJYF0hYqtE20YKQUuc7fvmKEPRRnZ47MMO62pH dWkA== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.38 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=arm.com Return-Path: Received: from fry.vger.email (fry.vger.email. [23.128.96.38]) by mx.google.com with ESMTPS id h21-20020a170902f7d500b001c32fe6cdf9si4031040plw.386.2023.10.30.04.43.35 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Mon, 30 Oct 2023 04:43:35 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.38 as permitted sender) client-ip=23.128.96.38; Authentication-Results: mx.google.com; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.38 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=arm.com Received: from out1.vger.email (depot.vger.email [IPv6:2620:137:e000::3:0]) by fry.vger.email (Postfix) with ESMTP id A0F0A809FA20; Mon, 30 Oct 2023 04:43:32 -0700 (PDT) X-Virus-Status: Clean X-Virus-Scanned: clamav-milter 0.103.10 at fry.vger.email Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S232887AbjJ3LnW (ORCPT + 99 others); Mon, 30 Oct 2023 07:43:22 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:46772 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S229456AbjJ3LnU (ORCPT ); Mon, 30 Oct 2023 07:43:20 -0400 Received: from foss.arm.com (foss.arm.com [217.140.110.172]) by lindbergh.monkeyblade.net (Postfix) with ESMTP id C6637B6 for ; Mon, 30 Oct 2023 04:43:17 -0700 (PDT) Received: from usa-sjc-imap-foss1.foss.arm.com (unknown [10.121.207.14]) by usa-sjc-mx-foss1.foss.arm.com (Postfix) with ESMTP id 3593FFEC; Mon, 30 Oct 2023 04:43:59 -0700 (PDT) Received: from [10.57.71.117] (unknown [10.57.71.117]) by usa-sjc-imap-foss1.foss.arm.com (Postfix) with ESMTPSA id EA0353F738; Mon, 30 Oct 2023 04:43:14 -0700 (PDT) Message-ID: <5993c198-0d27-46c3-b757-3a02c2aacfc9@arm.com> Date: Mon, 30 Oct 2023 11:43:13 +0000 MIME-Version: 1.0 User-Agent: Mozilla Thunderbird Subject: Re: [PATCH v6 5/9] mm: thp: Extend THP to allocate anonymous large folios Content-Language: en-GB To: John Hubbard , Andrew Morton , Matthew Wilcox , Yin Fengwei , David Hildenbrand , Yu Zhao , Catalin Marinas , Anshuman Khandual , Yang Shi , "Huang, Ying" , Zi Yan , Luis Chamberlain , Itaru Kitayama , "Kirill A. Shutemov" , David Rientjes , Vlastimil Babka , Hugh Dickins Cc: linux-mm@kvack.org, linux-kernel@vger.kernel.org, linux-arm-kernel@lists.infradead.org References: <20230929114421.3761121-1-ryan.roberts@arm.com> <20230929114421.3761121-6-ryan.roberts@arm.com> <8a72da61-b2ef-48ad-ae59-0bae7ac2ce10@nvidia.com> From: Ryan Roberts In-Reply-To: <8a72da61-b2ef-48ad-ae59-0bae7ac2ce10@nvidia.com> Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit X-Spam-Status: No, score=-0.8 required=5.0 tests=HEADER_FROM_DIFFERENT_DOMAINS, MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS autolearn=unavailable autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on fry.vger.email Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org X-Greylist: Sender passed SPF test, not delayed by milter-greylist-4.6.4 (fry.vger.email [0.0.0.0]); Mon, 30 Oct 2023 04:43:32 -0700 (PDT) On 28/10/2023 00:04, John Hubbard wrote: > On 9/29/23 04:44, Ryan Roberts wrote: > > Hi Ryan, > > A few clarifying questions below. Excellent - keep them coming! > > ... >> diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h >> index 2e7c338229a6..c4860476a1f5 100644 >> --- a/include/linux/huge_mm.h >> +++ b/include/linux/huge_mm.h >> @@ -68,9 +68,11 @@ extern struct kobj_attribute shmem_enabled_attr; >>   #define HPAGE_PMD_NR (1<>     /* >> - * Mask of all large folio orders supported for anonymous THP. >> + * Mask of all large folio orders supported for anonymous THP; all orders up to >> + * and including PMD_ORDER, except order-0 (which is not "huge") and order-1 >> + * (which is a limitation of the THP implementation). >>    */ >> -#define THP_ORDERS_ALL_ANON    BIT(PMD_ORDER) >> +#define THP_ORDERS_ALL_ANON    ((BIT(PMD_ORDER + 1) - 1) & ~(BIT(0) | BIT(1))) >>     /* >>    * Mask of all large folio orders supported for file THP. >> diff --git a/mm/memory.c b/mm/memory.c >> index b5b82fc8e164..92ed9c782dc9 100644 >> --- a/mm/memory.c >> +++ b/mm/memory.c >> @@ -4059,6 +4059,87 @@ vm_fault_t do_swap_page(struct vm_fault *vmf) >>       return ret; >>   } >>   +static bool vmf_pte_range_changed(struct vm_fault *vmf, int nr_pages) >> +{ >> +    int i; >> + >> +    if (nr_pages == 1) >> +        return vmf_pte_changed(vmf); >> + >> +    for (i = 0; i < nr_pages; i++) { >> +        if (!pte_none(ptep_get_lockless(vmf->pte + i))) >> +            return true; > > This seems like something different than the function name implies. > It's really confusing: for a single page case, return true if the > pte in the page tables has changed, yes that is very clear. > > But then for multiple page cases, which is really the main > focus here--for that, claim that the range has changed if any > pte is present (!pte_none). Can you please help me understand > what this means? Yes I understand your confusion. Although I'm confident that the code is correct, its a bad name - I'll make the excuse that this has evolved through rebasing to cope with additions to UFFD. Perhaps something like vmf_is_large_folio_suitable() is a better name. It used to be that we would only take the do_anonymous_page() path if the pte was none; i.e. this is the first time we are faulting on an address covered by an anon VMA and we need to allocate some memory. But more recently we also end up here if the pte is a uffd_wp marker. So for a single pte, instead of checking none, we can check if the pte has changed from our original check (where we determined it was a uffd_wp marker or none). But for multiple ptes, we don't have storage to store all the original ptes from the first check. Fortunately, if uffd is in use for a vma, then we don't want to use a large folio anyway (this would break uffd semantics because we would no longer get a fault for every page). So we only care about the "same but not none" case for nr_pages=1. Would changing the name to vmf_is_large_folio_suitable() help here? > >> +    } >> + >> +    return false; >> +} >> + >> +#ifdef CONFIG_TRANSPARENT_HUGEPAGE >> +static struct folio *alloc_anon_folio(struct vm_fault *vmf) >> +{ >> +    gfp_t gfp; >> +    pte_t *pte; >> +    unsigned long addr; >> +    struct folio *folio; >> +    struct vm_area_struct *vma = vmf->vma; >> +    unsigned int orders; >> +    int order; >> + >> +    /* >> +     * If uffd is active for the vma we need per-page fault fidelity to >> +     * maintain the uffd semantics. >> +     */ >> +    if (userfaultfd_armed(vma)) >> +        goto fallback; >> + >> +    /* >> +     * Get a list of all the (large) orders below PMD_ORDER that are enabled >> +     * for this vma. Then filter out the orders that can't be allocated over >> +     * the faulting address and still be fully contained in the vma. >> +     */ >> +    orders = hugepage_vma_check(vma, vma->vm_flags, false, true, true, >> +                    BIT(PMD_ORDER) - 1); >> +    orders = transhuge_vma_suitable(vma, vmf->address, orders); >> + >> +    if (!orders) >> +        goto fallback; >> + >> +    pte = pte_offset_map(vmf->pmd, vmf->address & PMD_MASK); >> +    if (!pte) >> +        return ERR_PTR(-EAGAIN); > > pte_offset_map() can only fail due to: > >     a) Wrong pmd type. These include: >         pmd_none >         pmd_bad >         pmd migration entry >         pmd_trans_huge >         pmd_devmap > >     b) __pte_map() failure > > For (a), why is it that -EAGAIN is used here? I see that that > will lead to a re-fault, I got that far, but am missing something > still. > > For (b), same question, actually. I'm not completely sure why > why a retry is going to fix a __pte_map() failure? I'm not going to claim to understand all the details of this. But this is due to a change that Hugh introduced and we concluded at [1] that its always correct to return EAGAIN here to rerun the fault. In fact, with the current implementation pte_offset_map() should never fail for anon IIUC, but the view was that EAGAIN makes it safe for tomorrow, and because this would only fail due to a race, retrying is correct. [1] https://lore.kernel.org/linux-mm/8bdfd8d8-5662-4615-86dc-d60259bd16d@google.com/ > > >> + >> +    order = first_order(orders); >> +    while (orders) { >> +        addr = ALIGN_DOWN(vmf->address, PAGE_SIZE << order); >> +        vmf->pte = pte + pte_index(addr); >> +        if (!vmf_pte_range_changed(vmf, 1 << order)) >> +            break; >> +        order = next_order(&orders, order); >> +    } >> + >> +    vmf->pte = NULL; >> +    pte_unmap(pte); >> + >> +    gfp = vma_thp_gfp_mask(vma); >> + >> +    while (orders) { >> +        addr = ALIGN_DOWN(vmf->address, PAGE_SIZE << order); >> +        folio = vma_alloc_folio(gfp, order, vma, addr, true); >> +        if (folio) { >> +            clear_huge_page(&folio->page, addr, 1 << order); >> +            return folio; >> +        } >> +        order = next_order(&orders, order); >> +    } > > And finally: is it accurate to say that there are *no* special > page flags being set, for PTE-mapped THPs? I don't see any here, > but want to confirm. The page flags are coming from 'gfp = vma_thp_gfp_mask(vma)', which pulls in the correct flags based on transparent_hugepage/defrag file. > > > thanks,