Received: by 2002:a05:7412:8d10:b0:f3:1519:9f41 with SMTP id bj16csp1567141rdb; Thu, 7 Dec 2023 02:58:04 -0800 (PST) X-Google-Smtp-Source: AGHT+IFVKsOfcTINxMdQcwGr6sTcSz/Ov0MozAVSN15gRDn0A7V16tJWcDEl29mIXAi0myBWt0E3 X-Received: by 2002:a17:902:f807:b0:1d0:c0f4:7c1c with SMTP id ix7-20020a170902f80700b001d0c0f47c1cmr2220254plb.78.1701946684071; Thu, 07 Dec 2023 02:58:04 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1701946684; cv=none; d=google.com; s=arc-20160816; b=lwXRxjts/o7mW29nhmMWDEJptnrUT7cfCmcQO74EpvPEg6TJG4KoMJX1t20254Lx8X 0FRx2LVPblOfGLsFD9O536uTYp/lgep0zzVw0mjm2N0ODRo/1EZGNysHOQiu5uw2Ejxr d+Ht4NII1Qt56jUX8W5NUCvHPml8b7O5bnCJesQRShiF3k3ZrARbqVxB8l+Heswx/l8V FRYYsBeXf0pje7O5DpOOcT6+msN8SFiz7zsLfhmjCtJqdzdGDMx0xHKrmpiEd+r6lKWS iRcFDLhoYmZAN3Rrpil5k9k6CuEOfF5r5SAUGeST3kJBU4pCqcxQIWIc4owxqM5fqLSq D8yg== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:in-reply-to:references :cc:to:from:content-language:subject:user-agent:mime-version:date :message-id; bh=epF2gUOiGnxkjpHh0Ponaael9znASvLfzpKmsmiqULY=; fh=Hhg/Irzq7NxjNc8I6dFyEpRureAOclKOpRByptRv8N8=; b=02gyMLiTR0H8nA1AUqoj4VxhrGhosSa151i00+MBI7siAqtyUzeB7FUqQbAMsbt1/n rlSDeEAcUMX1/qx5VP7S8gTOW/5LeEsHfz/6JLjHz0Gi+ARdiDlmZT7gX+8qczYbIjhM 8gsy3Ezkrpo3/clBrxtayWIr7Xc5Ncf+MnI9Unn9r9SsQ54DIg7A+Bv2bSMsva+6G6fM OkbFy/PlgPLC9w1THd/IbGDrwYCOhPfzd+OMVSSHEAuQ/bz4lmVlkcm4fIDgDwEh4lxI Yq+oTIBxJy0v/zRoTdQaJgjAzttFLznw6Q3JlS0P+0oMO78xrATPIizunUQxufvcA938 ycrw== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.37 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=arm.com Return-Path: Received: from snail.vger.email (snail.vger.email. [23.128.96.37]) by mx.google.com with ESMTPS id jn3-20020a170903050300b001d0642963b8si923906plb.465.2023.12.07.02.58.03 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Thu, 07 Dec 2023 02:58:04 -0800 (PST) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.37 as permitted sender) client-ip=23.128.96.37; Authentication-Results: mx.google.com; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.37 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=arm.com Received: from out1.vger.email (depot.vger.email [IPv6:2620:137:e000::3:0]) by snail.vger.email (Postfix) with ESMTP id 4ACE98087252; Thu, 7 Dec 2023 02:56:27 -0800 (PST) X-Virus-Status: Clean X-Virus-Scanned: clamav-milter 0.103.11 at snail.vger.email Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S232072AbjLGK4R (ORCPT + 99 others); Thu, 7 Dec 2023 05:56:17 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:33398 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S229514AbjLGK4Q (ORCPT ); Thu, 7 Dec 2023 05:56:16 -0500 Received: from foss.arm.com (foss.arm.com [217.140.110.172]) by lindbergh.monkeyblade.net (Postfix) with ESMTP id 86D6E84 for ; Thu, 7 Dec 2023 02:56:20 -0800 (PST) Received: from usa-sjc-imap-foss1.foss.arm.com (unknown [10.121.207.14]) by usa-sjc-mx-foss1.foss.arm.com (Postfix) with ESMTP id 3920F12FC; Thu, 7 Dec 2023 02:57:06 -0800 (PST) Received: from [10.1.32.134] (XHFQ2J9959.cambridge.arm.com [10.1.32.134]) by usa-sjc-imap-foss1.foss.arm.com (Postfix) with ESMTPSA id 629563F762; Thu, 7 Dec 2023 02:56:17 -0800 (PST) Message-ID: <841178ee-d29d-43e3-b977-5796c90bd7ad@arm.com> Date: Thu, 7 Dec 2023 10:56:16 +0000 MIME-Version: 1.0 User-Agent: Mozilla Thunderbird Subject: Re: [PATCH v8 03/10] mm: thp: Introduce multi-size THP sysfs interface Content-Language: en-GB From: Ryan Roberts To: David Hildenbrand , Andrew Morton , Matthew Wilcox , Yin Fengwei , Yu Zhao , Catalin Marinas , Anshuman Khandual , Yang Shi , "Huang, Ying" , Zi Yan , Luis Chamberlain , Itaru Kitayama , "Kirill A. Shutemov" , John Hubbard , David Rientjes , Vlastimil Babka , Hugh Dickins , Kefeng Wang , Barry Song <21cnbao@gmail.com>, Alistair Popple Cc: linux-mm@kvack.org, linux-arm-kernel@lists.infradead.org, linux-kernel@vger.kernel.org References: <20231204102027.57185-1-ryan.roberts@arm.com> <20231204102027.57185-4-ryan.roberts@arm.com> <004ed23d-5571-4474-b7fe-7bc08817c165@redhat.com> In-Reply-To: Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit X-Spam-Status: No, score=-1.9 required=5.0 tests=BAYES_00, RCVD_IN_DNSWL_BLOCKED,SPF_HELO_NONE,SPF_NONE,T_SCC_BODY_TEXT_LINE autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org X-Greylist: Sender passed SPF test, not delayed by milter-greylist-4.6.4 (snail.vger.email [0.0.0.0]); Thu, 07 Dec 2023 02:56:27 -0800 (PST) On 06/12/2023 13:18, Ryan Roberts wrote: > On 05/12/2023 16:57, David Hildenbrand wrote: >> On 04.12.23 11:20, Ryan Roberts wrote: >>> In preparation for adding support for anonymous multi-size THP, >>> introduce new sysfs structure that will be used to control the new >>> behaviours. A new directory is added under transparent_hugepage for each >>> supported THP size, and contains an `enabled` file, which can be set to >>> "inherit" (to inherit the global setting), "always", "madvise" or >>> "never". For now, the kernel still only supports PMD-sized anonymous >>> THP, so only 1 directory is populated. >>> >>> The first half of the change converts transhuge_vma_suitable() and >>> hugepage_vma_check() so that they take a bitfield of orders for which >>> the user wants to determine support, and the functions filter out all >>> the orders that can't be supported, given the current sysfs >>> configuration and the VMA dimensions. If there is only 1 order set in >>> the input then the output can continue to be treated like a boolean; >>> this is the case for most call sites. The resulting functions are >>> renamed to thp_vma_suitable_orders() and thp_vma_allowable_orders() >>> respectively. >>> >>> The second half of the change implements the new sysfs interface. It has >>> been done so that each supported THP size has a `struct thpsize`, which >>> describes the relevant metadata and is itself a kobject. This is pretty >>> minimal for now, but should make it easy to add new per-thpsize files to >>> the interface if needed in future (e.g. per-size defrag). Rather than >>> keep the `enabled` state directly in the struct thpsize, I've elected to >>> directly encode it into huge_anon_orders_[always|madvise|inherit] >>> bitfields since this reduces the amount of work required in >>> thp_vma_allowable_orders() which is called for every page fault. >>> >>> See Documentation/admin-guide/mm/transhuge.rst, as modified by this >>> commit, for details of how the new sysfs interface works. >>> >>> Signed-off-by: Ryan Roberts >>> --- >> >> Some comments mostly regarding thp_vma_allowable_orders and friends. In general, >> LGTM. I'll have to go over the order logic once again, I got a bit lost once we >> started mixing anon and file orders. >> >> [...] >> >> Doc updates all looked good to me, skimming over them. >> >>> diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h >>> index fa0350b0812a..bd0eadd3befb 100644 >>> --- a/include/linux/huge_mm.h >>> +++ b/include/linux/huge_mm.h >> >> [...] >> >>> +static inline unsigned long thp_vma_suitable_orders(struct vm_area_struct *vma, >>> +        unsigned long addr, unsigned long orders) >>> +{ >>> +    int order; >>> + >>> +    /* >>> +     * Iterate over orders, highest to lowest, removing orders that don't >>> +     * meet alignment requirements from the set. Exit loop at first order >>> +     * that meets requirements, since all lower orders must also meet >>> +     * requirements. >>> +     */ >>> + >>> +    order = first_order(orders); >> >> nit: "highest_order" or "largest_order" would be more expressive regarding the >> actual semantics. > > Yep, will call it "highest_order". > >> >>> + >>> +    while (orders) { >>> +        unsigned long hpage_size = PAGE_SIZE << order; >>> +        unsigned long haddr = ALIGN_DOWN(addr, hpage_size); >>> + >>> +        if (haddr >= vma->vm_start && >>> +            haddr + hpage_size <= vma->vm_end) { >>> +            if (!vma_is_anonymous(vma)) { >>> +                if (IS_ALIGNED((vma->vm_start >> PAGE_SHIFT) - >>> +                        vma->vm_pgoff, >>> +                        hpage_size >> PAGE_SHIFT)) >>> +                    break; >>> +            } else >>> +                break; >> >> Comment: Codying style wants you to use if () {} else {} >> >> But I'd recommend for the conditions: >> >> if (haddr < vma->vm_start || >>     haddr + hpage_size > vma->vm_end) >>     continue; >> /* Don't have to check pgoff for anonymous vma */ >> if (!vma_is_anonymous(vma)) >>     break; >> if (IS_ALIGNED((... >>     break; > > OK I'll take this structure. FYI I ended up NOT taking this, because I now have thp_vma_suitable_order() and thp_vma_suitable_orders(). The former is essentially what was there before (except I pass the order and derive the nr_pages, alignment, etc from that rather than using the HPAGE_* macros. The latter is just a loop that calls the former for each order in the bitfield. > >> >> [...] >> >> >>> +/** >>> + * thp_vma_allowable_orders - determine hugepage orders that are allowed for vma >>> + * @vma:  the vm area to check >>> + * @vm_flags: use these vm_flags instead of vma->vm_flags >>> + * @smaps: whether answer will be used for smaps file >>> + * @in_pf: whether answer will be used by page fault handler >>> + * @enforce_sysfs: whether sysfs config should be taken into account >>> + * @orders: bitfield of all orders to consider >>> + * >>> + * Calculates the intersection of the requested hugepage orders and the allowed >>> + * hugepage orders for the provided vma. Permitted orders are encoded as a set >>> + * bit at the corresponding bit position (bit-2 corresponds to order-2, bit-3 >>> + * corresponds to order-3, etc). Order-0 is never considered a hugepage order. >>> + * >>> + * Return: bitfield of orders allowed for hugepage in the vma. 0 if no hugepage >>> + * orders are allowed. >>> + */ >>> +unsigned long thp_vma_allowable_orders(struct vm_area_struct *vma, >>> +                       unsigned long vm_flags, bool smaps, >>> +                       bool in_pf, bool enforce_sysfs, >>> +                       unsigned long orders) >>> +{ >>> +    /* Check the intersection of requested and supported orders. */ >>> +    orders &= vma_is_anonymous(vma) ? >>> +            THP_ORDERS_ALL_ANON : THP_ORDERS_ALL_FILE; >>> +    if (!orders) >>> +        return 0; >> >> Comment: if this is called from some hot path, we might want to move as much as >> possible into a header, so we can avoid this function call here when e.g., THP >> are completely disabled etc. > > If THP is completely disabled (compiled out) then thp_vma_allowable_orders() is > defined as a header inline that returns 0. I'm not sure there are any paths in > practice which are likely to ask for a set of orders which are never supported > (i.e. where this specific check would return 0). And the "are they run time > enabled" check is further down and fairly involved, so not sure that's ideal for > an inline. > > I haven't changed the pattern from how it was previously, so I don't think it > should be any more expensive. Which parts exactly do you want to move to a header? As per my response against the next patch (#4), I have now implemented thp_vma_allowable_orders() so that the order check is in the header with an early exit if THP is completely disabled. > > >> >>> + >>>       if (!vma->vm_mm)        /* vdso */ >>> -        return false; >>> +        return 0; >>>         /* >>>        * Explicitly disabled through madvise or prctl, or some >>> @@ -88,16 +141,16 @@ bool hugepage_vma_check(struct vm_area_struct *vma, >>> unsigned long vm_flags, >>>        * */ >>>       if ((vm_flags & VM_NOHUGEPAGE) || >>>           test_bit(MMF_DISABLE_THP, &vma->vm_mm->flags)) >>> -        return false; >>> +        return 0; >>>       /* >>>        * If the hardware/firmware marked hugepage support disabled. >>>        */ >>>       if (transparent_hugepage_flags & (1 << TRANSPARENT_HUGEPAGE_UNSUPPORTED)) >>> -        return false; >>> +        return 0; >>>         /* khugepaged doesn't collapse DAX vma, but page fault is fine. */ >>>       if (vma_is_dax(vma)) >>> -        return in_pf; >>> +        return in_pf ? orders : 0; >>>         /* >>>        * khugepaged special VMA and hugetlb VMA. >>> @@ -105,17 +158,29 @@ bool hugepage_vma_check(struct vm_area_struct *vma, >>> unsigned long vm_flags, >>>        * VM_MIXEDMAP set. >>>        */ >>>       if (!in_pf && !smaps && (vm_flags & VM_NO_KHUGEPAGED)) >>> -        return false; >>> +        return 0; >>>         /* >>> -     * Check alignment for file vma and size for both file and anon vma. >>> +     * Check alignment for file vma and size for both file and anon vma by >>> +     * filtering out the unsuitable orders. >>>        * >>>        * Skip the check for page fault. Huge fault does the check in fault >>> -     * handlers. And this check is not suitable for huge PUD fault. >>> +     * handlers. >>>        */ >>> -    if (!in_pf && >>> -        !transhuge_vma_suitable(vma, (vma->vm_end - HPAGE_PMD_SIZE))) >>> -        return false; >>> +    if (!in_pf) { >>> +        int order = first_order(orders); >>> +        unsigned long addr; >>> + >>> +        while (orders) { >>> +            addr = vma->vm_end - (PAGE_SIZE << order); >>> +            if (thp_vma_suitable_orders(vma, addr, BIT(order))) >>> +                break; >> >> Comment: you'd want a "thp_vma_suitable_order" helper here. But maybe the >> compiler is smart enough to optimize the loop and everyything else out. > > I'm happy to refactor so that thp_vma_suitable_order() is the basic primitive, > then make thp_vma_suitable_orders() a loop that calls thp_vma_suitable_order() > (that's basically how it is laid out already, just all in one function). Is that > what you are requesting? > >> >> [...] >> >>> + >>> +static ssize_t thpsize_enabled_store(struct kobject *kobj, >>> +                     struct kobj_attribute *attr, >>> +                     const char *buf, size_t count) >>> +{ >>> +    int order = to_thpsize(kobj)->order; >>> +    ssize_t ret = count; >>> + >>> +    if (sysfs_streq(buf, "always")) { >>> +        set_bit(order, &huge_anon_orders_always); >>> +        clear_bit(order, &huge_anon_orders_inherit); >>> +        clear_bit(order, &huge_anon_orders_madvise); >>> +    } else if (sysfs_streq(buf, "inherit")) { >>> +        set_bit(order, &huge_anon_orders_inherit); >>> +        clear_bit(order, &huge_anon_orders_always); >>> +        clear_bit(order, &huge_anon_orders_madvise); >>> +    } else if (sysfs_streq(buf, "madvise")) { >>> +        set_bit(order, &huge_anon_orders_madvise); >>> +        clear_bit(order, &huge_anon_orders_always); >>> +        clear_bit(order, &huge_anon_orders_inherit); >>> +    } else if (sysfs_streq(buf, "never")) { >>> +        clear_bit(order, &huge_anon_orders_always); >>> +        clear_bit(order, &huge_anon_orders_inherit); >>> +        clear_bit(order, &huge_anon_orders_madvise); >> >> Note: I was wondering for a second if some concurrent cames could lead to an >> inconsistent state. I think in the worst case we'll simply end up with "never" >> on races. > > You mean if different threads try to write different values to this file > concurrently? Or if there is a concurrent fault that tries to read the flags > while they are being modified? > > I thought about this for a long time too and wasn't sure what was best. The > existing global enabled store impl clears the bits first then sets the bit. With > this approach you can end up with multiple bits set if there is a race to set > diffierent values, and you can end up with a faulting thread seeing never if it > reads the bits after they have been cleared but before setting them. > > I decided to set the new bit before clearing the old bits, which is different; A > racing fault will never see "never" but as you say, a race to set the file could > result in "never" being set. > > On reflection, it's probably best to set the bit *last* like the global control > does?> > >> >> [...] >> >>>   static int __init hugepage_init_sysfs(struct kobject **hugepage_kobj) >>>   { >>>       int err; >>> +    struct thpsize *thpsize; >>> +    unsigned long orders; >>> +    int order; >>> + >>> +    /* >>> +     * Default to setting PMD-sized THP to inherit the global setting and >>> +     * disable all other sizes. powerpc's PMD_ORDER isn't a compile-time >>> +     * constant so we have to do this here. >>> +     */ >>> +    huge_anon_orders_inherit = BIT(PMD_ORDER); >>>         *hugepage_kobj = kobject_create_and_add("transparent_hugepage", mm_kobj); >>>       if (unlikely(!*hugepage_kobj)) { >>> @@ -434,8 +631,24 @@ static int __init hugepage_init_sysfs(struct kobject >>> **hugepage_kobj) >>>           goto remove_hp_group; >>>       } >>>   +    orders = THP_ORDERS_ALL_ANON; >>> +    order = first_order(orders); >>> +    while (orders) { >>> +        thpsize = thpsize_create(order, *hugepage_kobj); >>> +        if (IS_ERR(thpsize)) { >>> +            pr_err("failed to create thpsize for order %d\n", order); >>> +            err = PTR_ERR(thpsize); >>> +            goto remove_all; >>> +        } >>> +        list_add(&thpsize->node, &thpsize_list); >>> +        order = next_order(&orders, order); >>> +    } >>> + >>>       return 0; >>>   >> >> [...] >> >>>       page = compound_head(page); >>> @@ -5116,7 +5116,8 @@ static vm_fault_t __handle_mm_fault(struct >>> vm_area_struct *vma, >>>           return VM_FAULT_OOM; >>>   retry_pud: >>>       if (pud_none(*vmf.pud) && >>> -        hugepage_vma_check(vma, vm_flags, false, true, true)) { >>> +        thp_vma_allowable_orders(vma, vm_flags, false, true, true, >>> +                     BIT(PUD_ORDER))) { >>>           ret = create_huge_pud(&vmf); >>>           if (!(ret & VM_FAULT_FALLBACK)) >>>               return ret; >>> @@ -5150,7 +5151,8 @@ static vm_fault_t __handle_mm_fault(struct >>> vm_area_struct *vma, >>>           goto retry_pud; >>>         if (pmd_none(*vmf.pmd) && >>> -        hugepage_vma_check(vma, vm_flags, false, true, true)) { >>> +        thp_vma_allowable_orders(vma, vm_flags, false, true, true, >>> +                     BIT(PMD_ORDER))) { >> >> Comment: A helper like "thp_vma_allowable_order(vma, PMD_ORDER)" might make this >> easier to read -- and the implemenmtation will be faster. > > I'm happy to do this and use it to improve readability: > > #define thp_vma_allowable_order(..., order) \ > thp_vma_allowable_orders(..., BIT(order)) > > This wouldn't make the implementation any faster though; Are you suggesting a > completely separate impl? Even then, I don't think there is much scope to make > it faster for the case where there is only 1 order in the bitfield. > >> >>>           ret = create_huge_pmd(&vmf); >>>           if (!(ret & VM_FAULT_FALLBACK)) >>>               return ret; >>> diff --git a/mm/page_vma_mapped.c b/mm/page_vma_mapped.c >>> index e0b368e545ed..64da127cc267 100644 >>> --- a/mm/page_vma_mapped.c >>> +++ b/mm/page_vma_mapped.c >>> @@ -268,7 +268,8 @@ bool page_vma_mapped_walk(struct page_vma_mapped_walk *pvmw) >>>                * cleared *pmd but not decremented compound_mapcount(). >>>                */ >>>               if ((pvmw->flags & PVMW_SYNC) && >>> -                transhuge_vma_suitable(vma, pvmw->address) && >>> +                thp_vma_suitable_orders(vma, pvmw->address, >>> +                            BIT(PMD_ORDER)) && >> >> Comment: Similarly, a helper like "thp_vma_suitable_order(vma, PMD_ORDER)" might >> make this easier to read. > > Yep, will do this. > >> >>>                   (pvmw->nr_pages >= HPAGE_PMD_NR)) { >>>                   spinlock_t *ptl = pmd_lock(mm, pvmw->pmd); >>>   >> >