Received: by 2002:a05:7412:419a:b0:f3:1519:9f41 with SMTP id i26csp2949227rdh; Mon, 27 Nov 2023 03:11:56 -0800 (PST) X-Google-Smtp-Source: AGHT+IEVnIgxhYczpWCW0XuDft+mMqI7tdSFhqI/rjJDYqQcffPu7K5DB+sE2sGe5aOkHpQopvW/ X-Received: by 2002:a05:6a20:ce4d:b0:187:96e0:33c with SMTP id id13-20020a056a20ce4d00b0018796e0033cmr11565902pzb.45.1701083516619; Mon, 27 Nov 2023 03:11:56 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1701083516; cv=none; d=google.com; s=arc-20160816; b=rD0StPhq/5l2jxDkSrqLXZD4qERruYvUClxCgQGztd2JyxaRvuva2CUPXnu7MiztyF tt1VYLR73H/YrXUsgVagccZcXRzUreTO69DXZ98jFc2jOVe1ivSGg3DAEDVFM9Tcq7ee XPwedXCl3E+lLceJFcmNd/YrTynny2ps0QgKfsIL9WBY+NzTdV3xxc1YhJENkd6xpjPJ 3px2I8qScVzcd2R7jQvIowkX2KqSGL/QTztzwT6FormWFwhwrTYM32w6i7xR/zhelCjA 606be0OyKAnLkxdFuOPuBmjJnDsBjuEqIoMAu8+Mug/v7t70W0b9v9ByGGbBfBraRqAA r+AQ== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:in-reply-to:from :references:cc:to:content-language:subject:user-agent:mime-version :date:message-id; bh=xoVptFoz9zRlqr4S50e9TrowNX4ipk6ZpxvpJtArsTc=; fh=Iq90pYtK3oOdLRS6wcOYtCi0+G8Np5/ZJK5ks93k25I=; b=TItsZmneQJlLjDFQWILt3aKbKYWk9lw8SS/rXHL4Y7VnGqIgnsSmoGmpEXerpocYgC A0UX2yRxGkzBrkOP4V7V//yZVSOUH76TIdw8Y5la2tfz7B4kHcJ8d9+IPy+XcX4iQZKk rYeyI97JlV7xQd4/5xFZWkUcj5YESNb2sNFG9ReQjxhb7eysEtUP05xkt+n5HQPp8mde QRCKrEW/lDoxSdx1L6VfmYtJv48o/q3jFbZXoMuqvbwoHCgr7znrLHa7RRTIiFyMvn52 F5O0ka3AcbWysTzjEHzLqKJfYL0jEVAgJbQAOwgn9n3JVXNNE/TtHFJ6Lv2qEl3JSVnz fueg== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.32 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=arm.com Return-Path: Received: from agentk.vger.email (agentk.vger.email. [23.128.96.32]) by mx.google.com with ESMTPS id k5-20020aa788c5000000b006cbe59cbcc6si9847880pff.213.2023.11.27.03.11.56 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Mon, 27 Nov 2023 03:11:56 -0800 (PST) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.32 as permitted sender) client-ip=23.128.96.32; Authentication-Results: mx.google.com; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.32 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=arm.com Received: from out1.vger.email (depot.vger.email [IPv6:2620:137:e000::3:0]) by agentk.vger.email (Postfix) with ESMTP id 84FAD8067A75; Mon, 27 Nov 2023 03:11:53 -0800 (PST) X-Virus-Status: Clean X-Virus-Scanned: clamav-milter 0.103.11 at agentk.vger.email Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S232680AbjK0LLc (ORCPT + 99 others); Mon, 27 Nov 2023 06:11:32 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:52366 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S232966AbjK0LL3 (ORCPT ); Mon, 27 Nov 2023 06:11:29 -0500 Received: from foss.arm.com (foss.arm.com [217.140.110.172]) by lindbergh.monkeyblade.net (Postfix) with ESMTP id 35526D4B for ; Mon, 27 Nov 2023 03:11:30 -0800 (PST) Received: from usa-sjc-imap-foss1.foss.arm.com (unknown [10.121.207.14]) by usa-sjc-mx-foss1.foss.arm.com (Postfix) with ESMTP id 5B7222F4; Mon, 27 Nov 2023 03:12:17 -0800 (PST) Received: from [10.57.73.191] (unknown [10.57.73.191]) by usa-sjc-imap-foss1.foss.arm.com (Postfix) with ESMTPSA id 6B8023F73F; Mon, 27 Nov 2023 03:11:26 -0800 (PST) Message-ID: Date: Mon, 27 Nov 2023 11:11:25 +0000 MIME-Version: 1.0 User-Agent: Mozilla Thunderbird Subject: Re: [PATCH v2 00/14] Transparent Contiguous PTEs for User Mappings Content-Language: en-GB To: Barry Song <21cnbao@gmail.com> Cc: akpm@linux-foundation.org, andreyknvl@gmail.com, anshuman.khandual@arm.com, ardb@kernel.org, catalin.marinas@arm.com, david@redhat.com, dvyukov@google.com, glider@google.com, james.morse@arm.com, jhubbard@nvidia.com, linux-arm-kernel@lists.infradead.org, linux-kernel@vger.kernel.org, linux-mm@kvack.org, mark.rutland@arm.com, maz@kernel.org, oliver.upton@linux.dev, ryabinin.a.a@gmail.com, suzuki.poulose@arm.com, vincenzo.frascino@arm.com, wangkefeng.wang@huawei.com, will@kernel.org, willy@infradead.org, yuzenghui@huawei.com, yuzhao@google.com, ziy@nvidia.com References: <20231115163018.1303287-1-ryan.roberts@arm.com> <20231127031813.5576-1-v-songbaohua@oppo.com> <234021ba-73c2-474a-82f9-91e1604d5bb5@arm.com> From: Ryan Roberts In-Reply-To: Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit X-Spam-Status: No, score=1.7 required=5.0 tests=HEADER_FROM_DIFFERENT_DOMAINS, MAILING_LIST_MULTI,SORTED_RECIPS,SPF_HELO_NONE,SPF_PASS, T_SCC_BODY_TEXT_LINE autolearn=no autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on agentk.vger.email Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org X-Greylist: Sender passed SPF test, not delayed by milter-greylist-4.6.4 (agentk.vger.email [0.0.0.0]); Mon, 27 Nov 2023 03:11:53 -0800 (PST) X-Spam-Level: * On 27/11/2023 10:35, Barry Song wrote: > On Mon, Nov 27, 2023 at 10:15 PM Ryan Roberts wrote: >> >> On 27/11/2023 03:18, Barry Song wrote: >>>> Ryan Roberts (14): >>>> mm: Batch-copy PTE ranges during fork() >>>> arm64/mm: set_pte(): New layer to manage contig bit >>>> arm64/mm: set_ptes()/set_pte_at(): New layer to manage contig bit >>>> arm64/mm: pte_clear(): New layer to manage contig bit >>>> arm64/mm: ptep_get_and_clear(): New layer to manage contig bit >>>> arm64/mm: ptep_test_and_clear_young(): New layer to manage contig bit >>>> arm64/mm: ptep_clear_flush_young(): New layer to manage contig bit >>>> arm64/mm: ptep_set_wrprotect(): New layer to manage contig bit >>>> arm64/mm: ptep_set_access_flags(): New layer to manage contig bit >>>> arm64/mm: ptep_get(): New layer to manage contig bit >>>> arm64/mm: Split __flush_tlb_range() to elide trailing DSB >>>> arm64/mm: Wire up PTE_CONT for user mappings >>>> arm64/mm: Implement ptep_set_wrprotects() to optimize fork() >>>> arm64/mm: Add ptep_get_and_clear_full() to optimize process teardown >>> >>> Hi Ryan, >>> Not quite sure if I missed something, are we splitting/unfolding CONTPTES >>> in the below cases >> >> The general idea is that the core-mm sets the individual ptes (one at a time if >> it likes with set_pte_at(), or in a block with set_ptes()), modifies its >> permissions (ptep_set_wrprotect(), ptep_set_access_flags()) and clears them >> (ptep_clear(), etc); This is exactly the same interface as previously. >> >> BUT, the arm64 implementation of those interfaces will now detect when a set of >> adjacent PTEs (a contpte block - so 16 naturally aligned entries when using 4K >> base pages) are all appropriate for having the CONT_PTE bit set; in this case >> the block is "folded". And it will detect when the first PTE in the block >> changes such that the CONT_PTE bit must now be unset ("unfolded"). One of the >> requirements for folding a contpte block is that all the pages must belong to >> the *same* folio (that means its safe to only track access/dirty for thecontpte >> block as a whole rather than for each individual pte). >> >> (there are a couple of optimizations that make the reality slightly more >> complicated than what I've just explained, but you get the idea). >> >> On that basis, I believe all the specific cases you describe below are all >> covered and safe - please let me know if you think there is a hole here! >> >>> >>> 1. madvise(MADV_DONTNEED) on a part of basepages on a CONTPTE large folio >> >> The page will first be unmapped (e.g. ptep_clear() or ptep_get_and_clear(), or >> whatever). The implementation of that will cause an unfold and the CONT_PTE bit >> is removed from the whole contpte block. If there is then a subsequent >> set_pte_at() to set a swap entry, the implementation will see that its not >> appropriate to re-fold, so the range will remain unfolded. >> >>> >>> 2. vma split in a large folio due to various reasons such as mprotect, >>> munmap, mlock etc. >> >> I'm not sure if PTEs are explicitly unmapped/remapped when splitting a VMA? I >> suspect not, so if the VMA is split in the middle of a currently folded contpte >> block, it will remain folded. But this is safe and continues to work correctly. >> The VMA arrangement is not important; it is just important that a single folio >> is mapped contiguously across the whole block. > > I don't think it is safe to keep CONTPTE folded in a split_vma case. as > otherwise, copy_ptes in your other patch might only copy a part > of CONTPES. > For example, if page0-page4 and page5-page15 are splitted in split_vma, > in fork, while copying pte for the first VMA, we are copying page0-page4, > this will immediately cause inconsistent CONTPTE. as we have to > make sure all CONTPTEs are atomically mapped in a PTL. No that's not how it works. The CONT_PTE bit is not blindly copied from parent to child. It is explicitly managed by the arch code and set when appropriate. In the case above, we will end up calling set_ptes() for page0-page4 in the child. set_ptes() will notice that there are only 5 contiguous pages so it will map without the CONT_PTE bit. > >> >>> >>> 3. try_to_unmap_one() to reclaim a folio, ptes are scanned one by one >>> rather than being as a whole. >> >> Yes, as per 1; the arm64 implementation will notice when the first entry is >> cleared and unfold the contpte block. >> >>> >>> In hardware, we need to make sure CONTPTE follow the rule - always 16 >>> contiguous physical address with CONTPTE set. if one of them run away >>> from the 16 ptes group and PTEs become unconsistent, some terrible >>> errors/faults can happen in HW. for example >> >> Yes, the implementation obeys all these rules; see contpte_try_fold() and >> contpte_try_unfold(). the fold/unfold operation is only done when all >> requirements are met, and we perform it in a manner that is conformant to the >> architecture requirements (see contpte_fold() - being renamed to >> contpte_convert() in the next version). >> >> Thanks for the review! >> >> Thanks, >> Ryan >> >>> >>> case0: >>> addr0 PTE - has no CONTPE >>> addr0+4kb PTE - has CONTPTE >>> .... >>> addr0+60kb PTE - has CONTPTE >>> >>> case 1: >>> addr0 PTE - has no CONTPE >>> addr0+4kb PTE - has CONTPTE >>> .... >>> addr0+60kb PTE - has swap >>> >>> Unconsistent 16 PTEs will lead to crash even in the firmware based on >>> our observation. >>> > > Thanks > Barry