Received: by 2002:a05:7412:a9a2:b0:e2:908c:2ebd with SMTP id o34csp398581rdh; Thu, 26 Oct 2023 05:32:52 -0700 (PDT) X-Google-Smtp-Source: AGHT+IEOc7BAgL0t/EL/dZBtbNMkatNbm+VKC5P22nCdIS1vsM0iVlpGn/Jl7mcSsZJHEXAzdpIB X-Received: by 2002:a81:4f45:0:b0:5a8:1727:cab4 with SMTP id d66-20020a814f45000000b005a81727cab4mr19210608ywb.5.1698323571995; Thu, 26 Oct 2023 05:32:51 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1698323571; cv=none; d=google.com; s=arc-20160816; b=iY2G9QEs16HjVCIMXqWsMBFv5KIziuBv4ahITKMvTeqpzik8dK4G3RLRG8A7clP5Aj x8s0tqTv7Jj7bwtBNfR4h1nckAWohtrtD81CV9/yKQ0fomtMbR25f3A/dHyJB+5goMVl yknB0vm3nvvD/umbUAuvY113i8Gv6TKRJthkRUBYKzPtAywmXoKVL0rig6meIqU1ECEO 33hsI2ZDvCWld9naSfRsZEN+wEfa6YjYC+ohOE+4frd46MGd86ln9qzGUwtwBHFiegib 8DE8mku4eUzoatcBXZJIZQxfNbVTua20LuVPRdo6e8qpVnEEcxPEPQoCQtpFdT9rbLqt bA2A== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:in-reply-to:from :references:cc:to:subject:user-agent:mime-version:date:message-id; bh=l4P9q3JKjvOw9ABm1570GOx+ic1eoE58K/bAGiyTYOk=; fh=kU6e6yKV1H4bSowTTbWxsSOXTb8o/v1OtC79+6q35/Q=; b=fieWskgb/mS5GzWdG5lgX6R0DJebl80VsdzDAmqZhQRcuQnu3q1kj78RGTLMgPXSZ6 pbNb+C6nmIAZ/KUwjdZYfe/UW5ijzocj/nqJ7/en6chLaq58IA4Tevj2e2HQIytb6LP0 zyCzk5AApnMu14s4mR98P4pVrX1gCO/4Sx+8CO8EFMhbqRWSnJ+jXup+lsjRHqHMCBzg fHDdepuljyJMrms3hDbUP0z0v0IxInO0wpWscAIM6JCHlA2RIKUmAqlhz+Aqcg4z9kYF +3ua9WMqDYDnrUVwVnZP80zg5XlXvu7oCptEBlSOdhfj6P559gQZh4G00N6L8z6qKhX/ 6itQ== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.36 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=alibaba.com Return-Path: Received: from pete.vger.email (pete.vger.email. [23.128.96.36]) by mx.google.com with ESMTPS id s185-20020a0dd0c2000000b005a7d81017b1si14199732ywd.192.2023.10.26.05.32.51 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Thu, 26 Oct 2023 05:32:51 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.36 as permitted sender) client-ip=23.128.96.36; Authentication-Results: mx.google.com; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.36 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=alibaba.com Received: from out1.vger.email (depot.vger.email [IPv6:2620:137:e000::3:0]) by pete.vger.email (Postfix) with ESMTP id 2F62A81C8D92; Thu, 26 Oct 2023 05:32:48 -0700 (PDT) X-Virus-Status: Clean X-Virus-Scanned: clamav-milter 0.103.10 at pete.vger.email Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1344896AbjJZMch (ORCPT + 99 others); Thu, 26 Oct 2023 08:32:37 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:33778 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S229647AbjJZMcf (ORCPT ); Thu, 26 Oct 2023 08:32:35 -0400 Received: from out30-101.freemail.mail.aliyun.com (out30-101.freemail.mail.aliyun.com [115.124.30.101]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id CB4A493 for ; Thu, 26 Oct 2023 05:32:32 -0700 (PDT) X-Alimail-AntiSpam: AC=PASS;BC=-1|-1;BR=01201311R151e4;CH=green;DM=||false|;DS=||;FP=0|-1|-1|-1|0|-1|-1|-1;HT=ay29a033018046059;MF=baolin.wang@linux.alibaba.com;NM=1;PH=DS;RN=10;SR=0;TI=SMTPD_---0VuxqP9._1698323546; Received: from 30.97.48.63(mailfrom:baolin.wang@linux.alibaba.com fp:SMTPD_---0VuxqP9._1698323546) by smtp.aliyun-inc.com; Thu, 26 Oct 2023 20:32:27 +0800 Message-ID: <37a3e0d0-329f-576f-2498-986ff260c07a@linux.alibaba.com> Date: Thu, 26 Oct 2023 20:32:42 +0800 MIME-Version: 1.0 User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:102.0) Gecko/20100101 Thunderbird/102.15.1 Subject: Re: [PATCH] arm64: mm: drop tlb flush operation when clearing the access bit To: Anshuman Khandual , Barry Song <21cnbao@gmail.com> Cc: catalin.marinas@arm.com, will@kernel.org, akpm@linux-foundation.org, v-songbaohua@oppo.com, yuzhao@google.com, linux-mm@kvack.org, linux-arm-kernel@lists.infradead.org, linux-kernel@vger.kernel.org References: <2f55f62b-cae2-4eee-8572-1b662a170880@arm.com> From: Baolin Wang In-Reply-To: Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 8bit X-Spam-Status: No, score=-4.1 required=5.0 tests=HEADER_FROM_DIFFERENT_DOMAINS, MAILING_LIST_MULTI,NICE_REPLY_A,SPF_HELO_NONE,SPF_PASS, UNPARSEABLE_RELAY autolearn=unavailable autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on pete.vger.email Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org X-Greylist: Sender passed SPF test, not delayed by milter-greylist-4.6.4 (pete.vger.email [0.0.0.0]); Thu, 26 Oct 2023 05:32:48 -0700 (PDT) On 10/26/2023 2:01 PM, Anshuman Khandual wrote: > > > On 10/26/23 11:24, Barry Song wrote: >> On Thu, Oct 26, 2023 at 12:55 PM Anshuman Khandual >> wrote: >>> >>> >>> >>> On 10/24/23 18:26, Baolin Wang wrote: >>>> Now ptep_clear_flush_young() is only called by folio_referenced() to >>>> check if the folio was referenced, and now it will call a tlb flush on >>>> ARM64 architecture. However the tlb flush can be expensive on ARM64 >>>> servers, especially for the systems with a large CPU numbers. >>> >>> TLB flush would be expensive on *any* platform with large CPU numbers ? Perhaps yes, but did not measure it on other platforms. >>>> Similar to the x86 architecture, below comments also apply equally to >>>> ARM64 architecture. So we can drop the tlb flush operation in >>>> ptep_clear_flush_young() on ARM64 architecture to improve the performance. >>>> " >>>> /* Clearing the accessed bit without a TLB flush >>>> * doesn't cause data corruption. [ It could cause incorrect >>>> * page aging and the (mistaken) reclaim of hot pages, but the >>>> * chance of that should be relatively low. ] >>>> * >>>> * So as a performance optimization don't flush the TLB when >>>> * clearing the accessed bit, it will eventually be flushed by >>>> * a context switch or a VM operation anyway. [ In the rare >>>> * event of it not getting flushed for a long time the delay >>>> * shouldn't really matter because there's no real memory >>>> * pressure for swapout to react to. ] >>>> */ >>> >>> If always true, this sounds generic enough for all platforms, why only >>> x86 and arm64 ? I am not sure this is always true for every architectures. >>>> " >>>> Running the thpscale to show some obvious improvements for compaction >>>> latency with this patch: >>>> base patched >>>> Amean fault-both-1 1093.19 ( 0.00%) 1084.57 * 0.79%* >>>> Amean fault-both-3 2566.22 ( 0.00%) 2228.45 * 13.16%* >>>> Amean fault-both-5 3591.22 ( 0.00%) 3146.73 * 12.38%* >>>> Amean fault-both-7 4157.26 ( 0.00%) 4113.67 * 1.05%* >>>> Amean fault-both-12 6184.79 ( 0.00%) 5218.70 * 15.62%* >>>> Amean fault-both-18 9103.70 ( 0.00%) 7739.71 * 14.98%* >>>> Amean fault-both-24 12341.73 ( 0.00%) 10684.23 * 13.43%* >>>> Amean fault-both-30 15519.00 ( 0.00%) 13695.14 * 11.75%* >>>> Amean fault-both-32 16189.15 ( 0.00%) 14365.73 * 11.26%* >>>> base patched >>>> Duration User 167.78 161.03 >>>> Duration System 1836.66 1673.01 >>>> Duration Elapsed 2074.58 2059.75 >>> >>> Could you please point to the test repo you are running ? The test is based on v6.5 kernel. >>>> Barry Song submitted a similar patch [1] before, that replaces the >>>> ptep_clear_flush_young_notify() with ptep_clear_young_notify() in >>>> folio_referenced_one(). However, I'm not sure if removing the tlb flush >>>> operation is applicable to every architecture in kernel, so dropping >>>> the tlb flush for ARM64 seems a sensible change. >>> >>> The reasoning provided here sounds generic when true, hence there seems >>> to be no justification to keep it limited just for arm64 and x86. Also Right, but I can not ensure if this will break other architectures. >>> what about pmdp_clear_flush_young_notify() when THP is enabled. Should >>> that also not do a TLB flush after clearing access bit ? Although arm64 Yes, I think so. >>> does not enable __HAVE_ARCH_PMDP_CLEAR_YOUNG_FLUSH, rather depends on >>> the generic pmdp_clear_flush_young() which also does a TLB flush via >>> flush_pmd_tlb_range() while clearing the access bit. >>> >>>> >>>> Note: I am okay for both approach, if someone can help to ensure that >>>> all architectures do not need the tlb flush when clearing the accessed >>>> bit, then I also think Barry's patch is better (hope Barry can resend >>>> his patch). >>> >>> This paragraph belongs after the '----' below and not part of the commit >>> message. OK. >>>> [1] https://lore.kernel.org/lkml/20220617070555.344368-1-21cnbao@gmail.com/ >>>> Signed-off-by: Baolin Wang >>>> --- >>>> arch/arm64/include/asm/pgtable.h | 31 ++++++++++++++++--------------- >>>> 1 file changed, 16 insertions(+), 15 deletions(-) >>>> >>>> diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h >>>> index 0bd18de9fd97..2979d796ba9d 100644 >>>> --- a/arch/arm64/include/asm/pgtable.h >>>> +++ b/arch/arm64/include/asm/pgtable.h >>>> @@ -905,21 +905,22 @@ static inline int ptep_test_and_clear_young(struct vm_area_struct *vma, >>>> static inline int ptep_clear_flush_young(struct vm_area_struct *vma, >>>> unsigned long address, pte_t *ptep) >>>> { >>>> - int young = ptep_test_and_clear_young(vma, address, ptep); >>>> - >>>> - if (young) { >>>> - /* >>>> - * We can elide the trailing DSB here since the worst that can >>>> - * happen is that a CPU continues to use the young entry in its >>>> - * TLB and we mistakenly reclaim the associated page. The >>>> - * window for such an event is bounded by the next >>>> - * context-switch, which provides a DSB to complete the TLB >>>> - * invalidation. >>>> - */ >>>> - flush_tlb_page_nosync(vma, address); >>>> - } >>>> - >>>> - return young; >>>> + /* >>>> + * This comment is borrowed from x86, but applies equally to ARM64: >>>> + * >>>> + * Clearing the accessed bit without a TLB flush doesn't cause >>>> + * data corruption. [ It could cause incorrect page aging and >>>> + * the (mistaken) reclaim of hot pages, but the chance of that >>>> + * should be relatively low. ] >>>> + * >>>> + * So as a performance optimization don't flush the TLB when >>>> + * clearing the accessed bit, it will eventually be flushed by >>>> + * a context switch or a VM operation anyway. [ In the rare >>>> + * event of it not getting flushed for a long time the delay >>>> + * shouldn't really matter because there's no real memory >>>> + * pressure for swapout to react to. ] >>>> + */ >>>> + return ptep_test_and_clear_young(vma, address, ptep); >>>> } >>>> >>>> #ifdef CONFIG_TRANSPARENT_HUGEPAGE >>> >>> There are three distinct concerns here >>> >>> 1) What are the chances of this misleading existing hot page reclaim process >>> 2) How secondary MMU such as SMMU adapt to change in mappings without a flush >>> 3) Could this break the architecture rule requiring a TLB flush after access >>> bit clear on a page table entry >> >> In terms of all of above concerns, though 2 is different, which is an >> issue between >> cpu and non-cpu, >> i feel kernel has actually dropped tlb flush at least for mglru, there >> is no flush in >> lru_gen_look_around(), >> >> static bool folio_referenced_one(struct folio *folio, >> struct vm_area_struct *vma, unsigned long address, void *arg) >> { >> ... >> >> if (pvmw.pte) { >> if (lru_gen_enabled() && >> pte_young(ptep_get(pvmw.pte))) { >> lru_gen_look_around(&pvmw); >> referenced++; >> } >> >> if (ptep_clear_flush_young_notify(vma, address, >> pvmw.pte)) >> referenced++; >> } >> >> return true; >> } >> >> and so is in walk_pte_range() of vmscan. linux has been surviving with >> all above concerns for a while, believing it or not :-) > > Although the first two concerns could be worked upon in the SW, kernel surviving > after breaking arch rules explicitly is not a correct state to be in IMHO. Not sure what's the meaning of "not a correct state", at least we (Alibaba) have not found this can cause any issues until now when using MGLRU on x86 and ARM64 platforms.