Received: by 2002:a05:7412:8d10:b0:f3:1519:9f41 with SMTP id bj16csp1590366rdb; Thu, 7 Dec 2023 03:44:37 -0800 (PST) X-Google-Smtp-Source: AGHT+IG8AJzOSwX7HEgOl4S6xFP7hoB54ns1cRssE8SPPLUSSStJ5yCKvtQbo00nvRFjDZaDprby X-Received: by 2002:a05:6a00:2d09:b0:6ce:2732:1df4 with SMTP id fa9-20020a056a002d0900b006ce27321df4mr2102649pfb.46.1701949476869; Thu, 07 Dec 2023 03:44:36 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1701949476; cv=none; d=google.com; s=arc-20160816; b=ua4AG5JXkTluvrd05QZTv95PtrHoy3tpg7WOThF+Jkk/zWTrkAsuuiW+oiM4R3qoV2 KLoU+4kOAaSClCOVCJmtK9jMH6ZZfHSZpheAKE1TMDR3dfxYb2zqc8nZvwTpGQUj1e/V 2icAvHPRz0ar/3eKWRNW5B+cWCgZ8XF4kaL/hfgGyRHomv1QOfkMfehNstU0OoLnExOf kpTkkUDPaMZlNDPcIOVevi8oqE6i23RCbXmJOSngCeHdOmpatYcsDSHE8AwtpOIM6oJ9 xEkAZCLhdzIFIqRqF8DhfY1sUa313ZDU0cwtBI07k3/oJicgVVx0lAh+ZJFl6gs/z1PY iR3A== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:in-reply-to:from :references:cc:to:content-language:subject:user-agent:mime-version :date:message-id; bh=OWm3EVXYxIU8REx54bUG7LYlNUS9aQ3BoIQPq1Ru78M=; fh=Hhg/Irzq7NxjNc8I6dFyEpRureAOclKOpRByptRv8N8=; b=D3AeTUN3+kw9Q7PSpicSTQTkT5PYk1xLq3Df4LANxJ3uYH4niHYAcP6mB/6szk09g9 v3QuSbUlk870b3t3ZNRLJwzHJMsiWH2iR340SAeTFp81or2oU6hE+4XHLEfe5by+K+nA Gaw+oRQobewR2hPANQ2nQa8eKTSaNVC5TgtNEoCh3ufsPdyOkwJdW2CQtm/SKjCVAQvr R9cXiYGA5sn0j+5qGLrLcNyALjzuLJKV0O15MyHa+qPHg5QZyXlf6xl5Bw3ZEqJargTF 59LyHnxiGvMhCaJdMk+RQxwm5IJsv4ihiTX5jNMLpG+N42wepXwitCUoOH6mSjENghEw ZzGw== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.34 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=arm.com Return-Path: Received: from howler.vger.email (howler.vger.email. [23.128.96.34]) by mx.google.com with ESMTPS id b22-20020a056a000cd600b006cdfca5f4dbsi1096626pfv.106.2023.12.07.03.44.36 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Thu, 07 Dec 2023 03:44:36 -0800 (PST) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.34 as permitted sender) client-ip=23.128.96.34; Authentication-Results: mx.google.com; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.34 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=arm.com Received: from out1.vger.email (depot.vger.email [IPv6:2620:137:e000::3:0]) by howler.vger.email (Postfix) with ESMTP id DFBF2802048A; Thu, 7 Dec 2023 03:44:33 -0800 (PST) X-Virus-Status: Clean X-Virus-Scanned: clamav-milter 0.103.11 at howler.vger.email Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S232435AbjLGLoD (ORCPT + 99 others); Thu, 7 Dec 2023 06:44:03 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:35104 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S230234AbjLGLoD (ORCPT ); Thu, 7 Dec 2023 06:44:03 -0500 Received: from foss.arm.com (foss.arm.com [217.140.110.172]) by lindbergh.monkeyblade.net (Postfix) with ESMTP id E59C019A for ; Thu, 7 Dec 2023 03:44:07 -0800 (PST) Received: from usa-sjc-imap-foss1.foss.arm.com (unknown [10.121.207.14]) by usa-sjc-mx-foss1.foss.arm.com (Postfix) with ESMTP id 7201312FC; Thu, 7 Dec 2023 03:44:53 -0800 (PST) Received: from [10.1.32.134] (XHFQ2J9959.cambridge.arm.com [10.1.32.134]) by usa-sjc-imap-foss1.foss.arm.com (Postfix) with ESMTPSA id B94403F762; Thu, 7 Dec 2023 03:44:04 -0800 (PST) Message-ID: <856f4317-43d9-41bb-8096-1eef69c86d3b@arm.com> Date: Thu, 7 Dec 2023 11:44:03 +0000 MIME-Version: 1.0 User-Agent: Mozilla Thunderbird Subject: Re: [PATCH v8 03/10] mm: thp: Introduce multi-size THP sysfs interface Content-Language: en-GB To: David Hildenbrand , Andrew Morton , Matthew Wilcox , Yin Fengwei , Yu Zhao , Catalin Marinas , Anshuman Khandual , Yang Shi , "Huang, Ying" , Zi Yan , Luis Chamberlain , Itaru Kitayama , "Kirill A. Shutemov" , John Hubbard , David Rientjes , Vlastimil Babka , Hugh Dickins , Kefeng Wang , Barry Song <21cnbao@gmail.com>, Alistair Popple Cc: linux-mm@kvack.org, linux-arm-kernel@lists.infradead.org, linux-kernel@vger.kernel.org References: <20231204102027.57185-1-ryan.roberts@arm.com> <20231204102027.57185-4-ryan.roberts@arm.com> <004ed23d-5571-4474-b7fe-7bc08817c165@redhat.com> <378afc6b-f93a-48ad-8aa6-ab171f3b9613@redhat.com> <4aa520f0-7c84-4e93-88bf-aee6d8d3ea70@arm.com> <44fdd46b-ad46-4ae2-a20f-20374acdf464@redhat.com> From: Ryan Roberts In-Reply-To: <44fdd46b-ad46-4ae2-a20f-20374acdf464@redhat.com> Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit X-Spam-Status: No, score=-0.8 required=5.0 tests=HEADER_FROM_DIFFERENT_DOMAINS, MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS,T_SCC_BODY_TEXT_LINE autolearn=unavailable autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on howler.vger.email Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org X-Greylist: Sender passed SPF test, not delayed by milter-greylist-4.6.4 (howler.vger.email [0.0.0.0]); Thu, 07 Dec 2023 03:44:34 -0800 (PST) On 07/12/2023 11:25, David Hildenbrand wrote: > On 07.12.23 12:22, Ryan Roberts wrote: >> On 07/12/2023 11:13, David Hildenbrand wrote: >>>>> >>>>>> + >>>>>>         if (!vma->vm_mm)        /* vdso */ >>>>>> -        return false; >>>>>> +        return 0; >>>>>>           /* >>>>>>          * Explicitly disabled through madvise or prctl, or some >>>>>> @@ -88,16 +141,16 @@ bool hugepage_vma_check(struct vm_area_struct *vma, >>>>>> unsigned long vm_flags, >>>>>>          * */ >>>>>>         if ((vm_flags & VM_NOHUGEPAGE) || >>>>>>             test_bit(MMF_DISABLE_THP, &vma->vm_mm->flags)) >>>>>> -        return false; >>>>>> +        return 0; >>>>>>         /* >>>>>>          * If the hardware/firmware marked hugepage support disabled. >>>>>>          */ >>>>>>         if (transparent_hugepage_flags & (1 << >>>>>> TRANSPARENT_HUGEPAGE_UNSUPPORTED)) >>>>>> -        return false; >>>>>> +        return 0; >>>>>>           /* khugepaged doesn't collapse DAX vma, but page fault is fine. */ >>>>>>         if (vma_is_dax(vma)) >>>>>> -        return in_pf; >>>>>> +        return in_pf ? orders : 0; >>>>>>           /* >>>>>>          * khugepaged special VMA and hugetlb VMA. >>>>>> @@ -105,17 +158,29 @@ bool hugepage_vma_check(struct vm_area_struct *vma, >>>>>> unsigned long vm_flags, >>>>>>          * VM_MIXEDMAP set. >>>>>>          */ >>>>>>         if (!in_pf && !smaps && (vm_flags & VM_NO_KHUGEPAGED)) >>>>>> -        return false; >>>>>> +        return 0; >>>>>>           /* >>>>>> -     * Check alignment for file vma and size for both file and anon vma. >>>>>> +     * Check alignment for file vma and size for both file and anon vma by >>>>>> +     * filtering out the unsuitable orders. >>>>>>          * >>>>>>          * Skip the check for page fault. Huge fault does the check in fault >>>>>> -     * handlers. And this check is not suitable for huge PUD fault. >>>>>> +     * handlers. >>>>>>          */ >>>>>> -    if (!in_pf && >>>>>> -        !transhuge_vma_suitable(vma, (vma->vm_end - HPAGE_PMD_SIZE))) >>>>>> -        return false; >>>>>> +    if (!in_pf) { >>>>>> +        int order = first_order(orders); >>>>>> +        unsigned long addr; >>>>>> + >>>>>> +        while (orders) { >>>>>> +            addr = vma->vm_end - (PAGE_SIZE << order); >>>>>> +            if (thp_vma_suitable_orders(vma, addr, BIT(order))) >>>>>> +                break; >>>>> >>>>> Comment: you'd want a "thp_vma_suitable_order" helper here. But maybe the >>>>> compiler is smart enough to optimize the loop and everyything else out. >>>> >>>> I'm happy to refactor so that thp_vma_suitable_order() is the basic primitive, >>>> then make thp_vma_suitable_orders() a loop that calls thp_vma_suitable_order() >>>> (that's basically how it is laid out already, just all in one function). Is >>>> that >>>> what you are requesting? >>> >>> You got the spirit, yes. >>> >>>>> >>>>> [...] >>>>> >>>>>> + >>>>>> +static ssize_t thpsize_enabled_store(struct kobject *kobj, >>>>>> +                     struct kobj_attribute *attr, >>>>>> +                     const char *buf, size_t count) >>>>>> +{ >>>>>> +    int order = to_thpsize(kobj)->order; >>>>>> +    ssize_t ret = count; >>>>>> + >>>>>> +    if (sysfs_streq(buf, "always")) { >>>>>> +        set_bit(order, &huge_anon_orders_always); >>>>>> +        clear_bit(order, &huge_anon_orders_inherit); >>>>>> +        clear_bit(order, &huge_anon_orders_madvise); >>>>>> +    } else if (sysfs_streq(buf, "inherit")) { >>>>>> +        set_bit(order, &huge_anon_orders_inherit); >>>>>> +        clear_bit(order, &huge_anon_orders_always); >>>>>> +        clear_bit(order, &huge_anon_orders_madvise); >>>>>> +    } else if (sysfs_streq(buf, "madvise")) { >>>>>> +        set_bit(order, &huge_anon_orders_madvise); >>>>>> +        clear_bit(order, &huge_anon_orders_always); >>>>>> +        clear_bit(order, &huge_anon_orders_inherit); >>>>>> +    } else if (sysfs_streq(buf, "never")) { >>>>>> +        clear_bit(order, &huge_anon_orders_always); >>>>>> +        clear_bit(order, &huge_anon_orders_inherit); >>>>>> +        clear_bit(order, &huge_anon_orders_madvise); >>>>> >>>>> Note: I was wondering for a second if some concurrent cames could lead to an >>>>> inconsistent state. I think in the worst case we'll simply end up with "never" >>>>> on races. >>>> >>>> You mean if different threads try to write different values to this file >>>> concurrently? Or if there is a concurrent fault that tries to read the flags >>>> while they are being modified? >>> >>> I thought about what you said first, but what you said last might also apply. As >>> long as "nothing breaks", all good. >>> >>>> >>>> I thought about this for a long time too and wasn't sure what was best. The >>>> existing global enabled store impl clears the bits first then sets the bit. >>>> With >>>> this approach you can end up with multiple bits set if there is a race to set >>>> diffierent values, and you can end up with a faulting thread seeing never if it >>>> reads the bits after they have been cleared but before setting them. >>> >>> Right, but user space is playing stupid games and can win stupid prices. As long >>> as nothing breaks, we're good. >>> >>>> >>>> I decided to set the new bit before clearing the old bits, which is >>>> different; A >>>> racing fault will never see "never" but as you say, a race to set the file >>>> could >>>> result in "never" being set. >>>> >>>> On reflection, it's probably best to set the bit *last* like the global control >>>> does? >>> >>> Probably might just slap a simple spinlock in there, so at least the writer side >>> is completely serialized. Then you can just set the bit last. It's unlikely that >>> readers will actually run into issues, and if they ever would, we could use some >>> rcu magic to let them read a consistent state. >> >> I'd prefer to leave it as it is now; clear first, set last without any explicit >> serialization. I've convinced myself that nothing breaks and its the same >> pattern used by the global control so its consistent. Unless you're insisting on >> the spin lock? > > No, not at all. But it would certainly remove any possible concerns :) OK fine, you win :). I'll add a spin lock on the writer side.