Received: by 2002:a05:7412:251c:b0:e2:908c:2ebd with SMTP id w28csp2279987rda; Tue, 24 Oct 2023 19:43:34 -0700 (PDT) X-Google-Smtp-Source: AGHT+IGSTHBFj5wtNzn6dFs8D8BFe80vE33x94Eu1k1Di+g12/fTZzcNyqGM/wY9G5fPCfXR6Fcd X-Received: by 2002:a0d:e843:0:b0:5a7:dac8:440c with SMTP id r64-20020a0de843000000b005a7dac8440cmr14455302ywe.23.1698201814300; Tue, 24 Oct 2023 19:43:34 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1698201814; cv=none; d=google.com; s=arc-20160816; b=dgmsxaVe1aQ09OO7PSntj2VeEobrUfP9bMA/Z9A+exEft+gHcOnWZRfPXb4sQMuSJO gKLQBQjlxmed7u36Tmwr/+mZRogdfdi6ZE8UurGTEVGT4rYN7RzaAj3rbq/IEPfpLBFH qmeUCU2VVD8wZzu/Qa6Fd1zw7RGGpmYYJ4aIAip/c4iZaJb94EgxodfHrMpA/uoLyBPV sjK0HGO6YAclZSrEjMqqg2jZBOWKivMx4tFvb3sRDGK+lvINJuG7u/2yF2GVKTMUpsEh rzrZvat3baETt9qeyXYuprTlgILIjXPzjPYf9hD/auasMAykBBcBUNx8ZaLXZp+jRLwH pwBg== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:in-reply-to:from :references:cc:to:subject:user-agent:mime-version:date:message-id; bh=/3HJMmluLohvZkT31ayRvFegjkfYl4QdgVW1QIBPXEE=; fh=fLCsVuz6qi3SN0EfM/fWrN94hPmp0IyeMyhgMCEfZ50=; b=mZ/DY+RMIisz7wTMV6QGhgZTnmx6dCXYbbARyNqWmBxnyzjiVrQF+sgF6YikqGIFJI 8/KyamHa11ONMMU2FZCdO/GLBsKde8Mc3IK8sF3LoiEAJuLz5EdOPAJmr5yym31EUYOz S1HHt7+Kdvr0CP4WtWCUiwGEZ4Wu2tt3IqNb/M1GVcJjQc5piytbpn0pIcIe/e8zaBcm M4W1wrK60ekv+9/1wHsElDUXHjbda7Nfn35YXrZLyT0cSc59pqjFa/D8yNLxgBZcmWYh JfeKKij+IT9HRf9QKWhHLtF4/D8ONEG3CRQ+okeCB2/+rqIpCG5ACx+ZqrqnMj/MAinC qIRQ== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.33 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=alibaba.com Return-Path: Received: from lipwig.vger.email (lipwig.vger.email. [23.128.96.33]) by mx.google.com with ESMTPS id u10-20020a81470a000000b0056953ab0638si9316810ywa.392.2023.10.24.19.43.33 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Tue, 24 Oct 2023 19:43:34 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.33 as permitted sender) client-ip=23.128.96.33; Authentication-Results: mx.google.com; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.33 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=alibaba.com Received: from out1.vger.email (depot.vger.email [IPv6:2620:137:e000::3:0]) by lipwig.vger.email (Postfix) with ESMTP id 1B86580209E4; Tue, 24 Oct 2023 19:43:31 -0700 (PDT) X-Virus-Status: Clean X-Virus-Scanned: clamav-milter 0.103.10 at lipwig.vger.email Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S231773AbjJYCnP (ORCPT + 99 others); Tue, 24 Oct 2023 22:43:15 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:41832 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S229514AbjJYCnP (ORCPT ); Tue, 24 Oct 2023 22:43:15 -0400 Received: from out30-113.freemail.mail.aliyun.com (out30-113.freemail.mail.aliyun.com [115.124.30.113]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id F0FA6129 for ; Tue, 24 Oct 2023 19:43:10 -0700 (PDT) X-Alimail-AntiSpam: AC=PASS;BC=-1|-1;BR=01201311R721e4;CH=green;DM=||false|;DS=||;FP=0|-1|-1|-1|0|-1|-1|-1;HT=ay29a033018045176;MF=baolin.wang@linux.alibaba.com;NM=1;PH=DS;RN=10;SR=0;TI=SMTPD_---0Vusq8o5_1698201785; Received: from 30.97.48.63(mailfrom:baolin.wang@linux.alibaba.com fp:SMTPD_---0Vusq8o5_1698201785) by smtp.aliyun-inc.com; Wed, 25 Oct 2023 10:43:06 +0800 Message-ID: Date: Wed, 25 Oct 2023 10:43:21 +0800 MIME-Version: 1.0 User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:102.0) Gecko/20100101 Thunderbird/102.15.1 Subject: Re: [PATCH] arm64: mm: drop tlb flush operation when clearing the access bit To: Alistair Popple , Barry Song <21cnbao@gmail.com> Cc: catalin.marinas@arm.com, will@kernel.org, akpm@linux-foundation.org, v-songbaohua@oppo.com, yuzhao@google.com, linux-mm@kvack.org, linux-arm-kernel@lists.infradead.org, linux-kernel@vger.kernel.org References: <87y1frqz2u.fsf@nvdebian.thelocal> <87ttqfqw8f.fsf@nvdebian.thelocal> From: Baolin Wang In-Reply-To: <87ttqfqw8f.fsf@nvdebian.thelocal> Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 8bit X-Spam-Status: No, score=-4.1 required=5.0 tests=HEADER_FROM_DIFFERENT_DOMAINS, MAILING_LIST_MULTI,NICE_REPLY_A,SPF_HELO_NONE,SPF_PASS, UNPARSEABLE_RELAY autolearn=unavailable autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lipwig.vger.email Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org X-Greylist: Sender passed SPF test, not delayed by milter-greylist-4.6.4 (lipwig.vger.email [0.0.0.0]); Tue, 24 Oct 2023 19:43:31 -0700 (PDT) On 10/25/2023 9:58 AM, Alistair Popple wrote: > > Barry Song <21cnbao@gmail.com> writes: > >> On Wed, Oct 25, 2023 at 9:18 AM Alistair Popple wrote: >>> >>> >>> Barry Song <21cnbao@gmail.com> writes: >>> >>>> On Wed, Oct 25, 2023 at 7:16 AM Barry Song <21cnbao@gmail.com> wrote: >>>>> >>>>> On Tue, Oct 24, 2023 at 8:57 PM Baolin Wang >>>>> wrote: > > [...] > >>>>> (A). Constant flush cost vs. (B). very very occasional reclaimed hot >>>>> page, B might >>>>> be a correct choice. >>>> >>>> Plus, I doubt B is really going to happen. as after a page is promoted to >>>> the head of lru list or new generation, it needs a long time to slide back >>>> to the inactive list tail or to the candidate generation of mglru. the time >>>> should have been large enough for tlb to be flushed. If the page is really >>>> hot, the hardware will get second, third, fourth etc opportunity to set an >>>> access flag in the long time in which the page is re-moved to the tail >>>> as the page can be accessed multiple times if it is really hot. >>> >>> This might not be true if you have external hardware sharing the page >>> tables with software through either HMM or hardware supported ATS >>> though. >>> >>> In those cases I think it's much more likely hardware can still be >>> accessing the page even after a context switch on the CPU say. So those >>> pages will tend to get reclaimed even though hardware is still actively >>> using them which would be quite expensive and I guess could lead to >>> thrashing as each page is reclaimed and then immediately faulted back >>> in. That's possible, but the chance should be relatively low. At least on x86, I have not heard of this issue. >> i am not quite sure i got your point. has the external hardware sharing cpu's >> pagetable the ability to set access flag in page table entries by >> itself? if yes, >> I don't see how our approach will hurt as folio_referenced can notify the >> hardware driver and the driver can flush its own tlb. If no, i don't see >> either as the external hardware can't set access flags, that means we >> have ignored its reference and only knows cpu's access even in the current >> mainline code. so we are not getting worse. >> >> so the external hardware can also see cpu's TLB? or cpu's tlb flush can >> also broadcast to external hardware, then external hardware sees the >> cleared access flag, thus, it can set access flag in page table when the >> hardware access it? If this is the case, I feel what you said is true. > > Perhaps it would help if I gave a concrete example. Take for example the > ARM SMMU. It has it's own TLB. Invalidating this TLB is done in one of > two ways depending on the specific HW implementation. > > If broadcast TLB maintenance (BTM) is supported it will snoop CPU TLB > invalidations. If BTM is not supported it relies on SW to explicitly > forward TLB invalidations via MMU notifiers. On our ARM64 hardware, we rely on BTM to maintain TLB coherency. > Now consider the case where some external device is accessing mappings > via the SMMU. The access flag will be cached in the SMMU TLB. If we > clear the access flag without a TLB invalidate the access flag in the > CPU page table will not get updated because it's already set in the SMMU > TLB. > > As an aside access flag updates happen in one of two ways. If the SMMU > HW supports hardware translation table updates (HTTU) then hardware will > manage updating access/dirty flags as required. If this is not supported > then SW is relied on to update these flags which in practice means > taking a minor fault. But I don't think that is relevant here - in > either case without a TLB invalidate neither of those things will > happen. > > I suppose drivers could implement the clear_flush_young() MMU notifier > callback (none do at the moment AFAICT) but then won't that just lead to > the opposite problem - that every page ever used by an external device > remains active and unavailable for reclaim because the access flag never > gets cleared? I suppose they could do the flush then which would ensure Yes, I think so too. The reason there is currently no problem, perhaps I think, there are no actual use cases at the moment? At least on our Alibaba's fleet, SMMU and MMU do not share page tables now. > the page is marked inactive if it's not referenced between the two > folio_referenced calls(). > > But that requires changes to those drivers. SMMU from memory doesn't > even register for notifiers if BTM is supported. > > - Alistair > >>> >>> Of course TLB flushes are equally (perhaps even more) expensive for this >>> kind of external HW so reducing them would still be beneficial. I wonder >>> if there's some way they could be deferred until the page is moved to >>> the inactive list say? >>> >>>>> >>>>>> [1] https://lore.kernel.org/lkml/20220617070555.344368-1-21cnbao@gmail.com/ >>>>>> Signed-off-by: Baolin Wang >>>>>> --- >>>>>> arch/arm64/include/asm/pgtable.h | 31 ++++++++++++++++--------------- >>>>>> 1 file changed, 16 insertions(+), 15 deletions(-) >>>>>> >>>>>> diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h >>>>>> index 0bd18de9fd97..2979d796ba9d 100644 >>>>>> --- a/arch/arm64/include/asm/pgtable.h >>>>>> +++ b/arch/arm64/include/asm/pgtable.h >>>>>> @@ -905,21 +905,22 @@ static inline int ptep_test_and_clear_young(struct vm_area_struct *vma, >>>>>> static inline int ptep_clear_flush_young(struct vm_area_struct *vma, >>>>>> unsigned long address, pte_t *ptep) >>>>>> { >>>>>> - int young = ptep_test_and_clear_young(vma, address, ptep); >>>>>> - >>>>>> - if (young) { >>>>>> - /* >>>>>> - * We can elide the trailing DSB here since the worst that can >>>>>> - * happen is that a CPU continues to use the young entry in its >>>>>> - * TLB and we mistakenly reclaim the associated page. The >>>>>> - * window for such an event is bounded by the next >>>>>> - * context-switch, which provides a DSB to complete the TLB >>>>>> - * invalidation. >>>>>> - */ >>>>>> - flush_tlb_page_nosync(vma, address); >>>>>> - } >>>>>> - >>>>>> - return young; >>>>>> + /* >>>>>> + * This comment is borrowed from x86, but applies equally to ARM64: >>>>>> + * >>>>>> + * Clearing the accessed bit without a TLB flush doesn't cause >>>>>> + * data corruption. [ It could cause incorrect page aging and >>>>>> + * the (mistaken) reclaim of hot pages, but the chance of that >>>>>> + * should be relatively low. ] >>>>>> + * >>>>>> + * So as a performance optimization don't flush the TLB when >>>>>> + * clearing the accessed bit, it will eventually be flushed by >>>>>> + * a context switch or a VM operation anyway. [ In the rare >>>>>> + * event of it not getting flushed for a long time the delay >>>>>> + * shouldn't really matter because there's no real memory >>>>>> + * pressure for swapout to react to. ] >>>>>> + */ >>>>>> + return ptep_test_and_clear_young(vma, address, ptep); >>>>>> } >>>>>> >>>>>> #ifdef CONFIG_TRANSPARENT_HUGEPAGE >>>>>> -- >>>>>> 2.39.3 >>>>>> >>>>> >>>>> Thanks >>>>> Barry >>> >> Thanks >> Barry