Received: by 2002:a05:7412:8d10:b0:f3:1519:9f41 with SMTP id bj16csp6487450rdb; Thu, 14 Dec 2023 22:50:59 -0800 (PST) X-Google-Smtp-Source: AGHT+IGANYxVoWCmBPnggEQrUqAc1djZNT5IwhlmiPCOxAikKgn6sQ0drmu86xCxWarq0YPwj4sZ X-Received: by 2002:a50:aac2:0:b0:552:87c1:38fe with SMTP id r2-20020a50aac2000000b0055287c138femr804317edc.0.1702623059002; Thu, 14 Dec 2023 22:50:59 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1702623058; cv=none; d=google.com; s=arc-20160816; b=YphkfQ2XfbFoZWDC1ryu8toqZiagMCwa5sMYb2FSty8rAPxa2CEyVt6uR2GjniD09r P18TjFJjKDT2tH8wc1dUJadNwePgtzQ+gYNCksVWI5IR4zGVzLyUQlfVGw3F5wxn7Ug9 /nDQ/tXvetBEaex1+2s2H3OzZo/lLueE/aIqP+WvH9Ps2ohpVa9MdCrilNxnkbLqIp6c UlwKXclAjUdimjsj+o1ePfymP6T8d39cC6X4wa4lI20xEbGcrDcsBbBszIZH8q4nkPnY 44dWkpGXwWU/PlKCkOopmBDe1Z8mg6RK5seoyap2DluxEnkDRMIoTAt10hdsleIb60tg +l4w== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=content-transfer-encoding:in-reply-to:from:references:cc:to :content-language:subject:user-agent:mime-version:list-unsubscribe :list-subscribe:list-id:precedence:date:message-id; bh=DbeUKvGW6fB+INrSzKxijeFcZvmG0uAtBs2WH7JSxDE=; fh=Q9iYlmY0dDPglQr4fIu9MRGpkWlYs0W3QXR7+giCGuo=; b=G1ei5eQWQujV2sdIUMXOMM5D7EOGiyTIav55+fr3NNef3p/NAzY35RhE1qn9gV1IqX ijwH1v7qDdKJdxUQcO8s08YnTKcnXeWzsSgzEKgIfP8usgl9qniFJuuHUQbeg6x4iZiN ev3a6qUr83mAOL2tpg2g9LyetJUx2ahJyfEsmu7V7X0KXuY1Sm070CSBayY1npzIpwIk MJ0IUSQlThc61RYEvE5Bb9RZAuyH4D9HBps3FzjVHto3onY2pFvziPYYfnO6FiiWpk+M IawJ81FJ+GJPF3c66PHXFus/WiL+frOJw/t7v1K0k0+WZCcV57k4H2VQbAm8JpBo6N5L rndg== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: domain of linux-kernel+bounces-497-linux.lists.archive=gmail.com@vger.kernel.org designates 2604:1380:4601:e00::3 as permitted sender) smtp.mailfrom="linux-kernel+bounces-497-linux.lists.archive=gmail.com@vger.kernel.org"; dmarc=fail (p=QUARANTINE sp=QUARANTINE dis=NONE) header.from=huawei.com Return-Path: Received: from am.mirrors.kernel.org (am.mirrors.kernel.org. [2604:1380:4601:e00::3]) by mx.google.com with ESMTPS id r24-20020a50c018000000b0054cdd0ecab8si6894712edb.589.2023.12.14.22.50.58 for (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Thu, 14 Dec 2023 22:50:58 -0800 (PST) Received-SPF: pass (google.com: domain of linux-kernel+bounces-497-linux.lists.archive=gmail.com@vger.kernel.org designates 2604:1380:4601:e00::3 as permitted sender) client-ip=2604:1380:4601:e00::3; Authentication-Results: mx.google.com; spf=pass (google.com: domain of linux-kernel+bounces-497-linux.lists.archive=gmail.com@vger.kernel.org designates 2604:1380:4601:e00::3 as permitted sender) smtp.mailfrom="linux-kernel+bounces-497-linux.lists.archive=gmail.com@vger.kernel.org"; dmarc=fail (p=QUARANTINE sp=QUARANTINE dis=NONE) header.from=huawei.com Received: from smtp.subspace.kernel.org (wormhole.subspace.kernel.org [52.25.139.140]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by am.mirrors.kernel.org (Postfix) with ESMTPS id B3D5F1F22E77 for ; Fri, 15 Dec 2023 06:50:53 +0000 (UTC) Received: from localhost.localdomain (localhost.localdomain [127.0.0.1]) by smtp.subspace.kernel.org (Postfix) with ESMTP id 244B6D515; Fri, 15 Dec 2023 06:50:36 +0000 (UTC) X-Original-To: linux-kernel@vger.kernel.org Received: from szxga04-in.huawei.com (szxga04-in.huawei.com [45.249.212.190]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 6618B111A9 for ; Fri, 15 Dec 2023 06:50:31 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=huawei.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=huawei.com Received: from mail.maildlp.com (unknown [172.19.88.214]) by szxga04-in.huawei.com (SkyGuard) with ESMTP id 4SrzsS1jz6z1wnqj; Fri, 15 Dec 2023 14:32:12 +0800 (CST) Received: from dggpemm100001.china.huawei.com (unknown [7.185.36.93]) by mail.maildlp.com (Postfix) with ESMTPS id 4A4581A0192; Fri, 15 Dec 2023 14:32:21 +0800 (CST) Received: from [10.174.177.243] (10.174.177.243) by dggpemm100001.china.huawei.com (7.185.36.93) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256) id 15.1.2507.35; Fri, 15 Dec 2023 14:32:20 +0800 Message-ID: Date: Fri, 15 Dec 2023 14:32:20 +0800 Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 User-Agent: Mozilla Thunderbird Subject: Re: [PATCH 2/3] arm64: mm: HVO: support BBM of vmemmap pgtable safely Content-Language: en-US To: Muchun Song , Nanyong Sun , , , , , CC: , , , References: <20231214073912.1938330-1-sunnanyong@huawei.com> <20231214073912.1938330-3-sunnanyong@huawei.com> <50f082a8-b805-5d66-45f4-2af58e99a67f@linux.dev> From: Kefeng Wang In-Reply-To: <50f082a8-b805-5d66-45f4-2af58e99a67f@linux.dev> Content-Type: text/plain; charset="UTF-8"; format=flowed Content-Transfer-Encoding: 8bit X-ClientProxiedBy: dggems705-chm.china.huawei.com (10.3.19.182) To dggpemm100001.china.huawei.com (7.185.36.93) On 2023/12/15 12:36, Muchun Song wrote: > > > On 2023/12/14 15:39, Nanyong Sun wrote: >> Implement vmemmap_update_pmd and vmemmap_update_pte on arm64 to do >> BBM(break-before-make) logic when change the page table of vmemmap >> address, they will under the init_mm.page_table_lock. >> If a translation fault of vmemmap address concurrently happened after >> pte/pmd cleared, vmemmap page fault handler will acquire the >> init_mm.page_table_lock to wait for vmemmap update to complete, >> by then the virtual address is valid again, so PF can return and >> access can continue. >> In other case, do the traditional kernel fault. > > Yes. BTW, we already use the same scheme to support arm64 > in our internal production. So the whole approach LGTM. > >> Implement flush_tlb_vmemmap_all and flush_tlb_vmemmap_range on arm64 >> with nothing to do because tlb already flushed in every single BBM. >> >> Signed-off-by: Nanyong Sun >> --- >>   arch/arm64/include/asm/esr.h |  4 ++ >>   arch/arm64/include/asm/mmu.h | 20 ++++++++ >>   arch/arm64/mm/fault.c        | 94 ++++++++++++++++++++++++++++++++++++ >>   arch/arm64/mm/mmu.c          | 28 +++++++++++ >>   4 files changed, 146 insertions(+) >> >> diff --git a/arch/arm64/include/asm/esr.h b/arch/arm64/include/asm/esr.h >> index ae35939f395b..1c63256efd25 100644 >> --- a/arch/arm64/include/asm/esr.h >> +++ b/arch/arm64/include/asm/esr.h >> @@ -116,6 +116,10 @@ >>   #define ESR_ELx_FSC_SERROR    (0x11) >>   #define ESR_ELx_FSC_ACCESS    (0x08) >>   #define ESR_ELx_FSC_FAULT    (0x04) >> +#define ESR_ELx_FSC_FAULT_L0    (0x04) >> +#define ESR_ELx_FSC_FAULT_L1    (0x05) >> +#define ESR_ELx_FSC_FAULT_L2    (0x06) >> +#define ESR_ELx_FSC_FAULT_L3    (0x07) >>   #define ESR_ELx_FSC_PERM    (0x0C) >>   #define ESR_ELx_FSC_SEA_TTW0    (0x14) >>   #define ESR_ELx_FSC_SEA_TTW1    (0x15) >> diff --git a/arch/arm64/include/asm/mmu.h b/arch/arm64/include/asm/mmu.h >> index 2fcf51231d6e..fcec5827f54f 100644 >> --- a/arch/arm64/include/asm/mmu.h >> +++ b/arch/arm64/include/asm/mmu.h >> @@ -76,5 +76,25 @@ extern bool kaslr_requires_kpti(void); >>   #define INIT_MM_CONTEXT(name)    \ >>       .pgd = init_pg_dir, >> +#ifdef CONFIG_HUGETLB_PAGE_OPTIMIZE_VMEMMAP >> +void vmemmap_update_pmd(unsigned long start, pmd_t *pmd, pte_t >> *pgtable); >> +#define vmemmap_update_pmd vmemmap_update_pmd >> +void vmemmap_update_pte(unsigned long addr, pte_t *pte, pte_t entry); >> +#define vmemmap_update_pte vmemmap_update_pte >> + >> +static inline void flush_tlb_vmemmap_all(void) >> +{ >> +    /* do nothing, already flushed tlb in every single BBM */ >> +} >> +#define flush_tlb_vmemmap_all flush_tlb_vmemmap_all >> + >> +static inline void flush_tlb_vmemmap_range(unsigned long start, >> +                       unsigned long end) >> +{ >> +    /* do nothing, already flushed tlb in every single BBM */ >> +} >> +#define flush_tlb_vmemmap_range flush_tlb_vmemmap_range >> +#endif >> + >>   #endif    /* !__ASSEMBLY__ */ >>   #endif >> diff --git a/arch/arm64/mm/fault.c b/arch/arm64/mm/fault.c >> index 460d799e1296..7066a273c1e0 100644 >> --- a/arch/arm64/mm/fault.c >> +++ b/arch/arm64/mm/fault.c >> @@ -368,6 +368,97 @@ static bool >> is_el1_mte_sync_tag_check_fault(unsigned long esr) >>       return false; >>   } >> +#ifdef CONFIG_HUGETLB_PAGE_OPTIMIZE_VMEMMAP >> +static inline bool is_vmemmap_address(unsigned long addr) >> +{ >> +    return (addr >= VMEMMAP_START) && (addr < VMEMMAP_END); >> +} >> + >> +static inline bool vmemmap_fault_may_fixup(unsigned long addr, >> +                       unsigned long esr) >> +{ >> +    if (!is_vmemmap_address(addr)) >> +        return false; >> + Squash the vmemmap range check into this function. >> +    /* >> +     * Only try to handle translation fault level 2 or level 3, >> +     * because hugetlb vmemmap optimize only clear pmd or pte. >> +     */ >> +    switch (esr & ESR_ELx_FSC) { >> +    case ESR_ELx_FSC_FAULT_L2: >> +    case ESR_ELx_FSC_FAULT_L3: >> +        return true; >> +    default: >> +        return false; >> +    } >> +} >> + >> +/* >> + * PMD mapped vmemmap should has been split as PTE mapped >> + * by HVO now, here we only check this case, other cases >> + * should fail. >> + * Also should check the addr is healthy enough that will not cause >> + * a level2 or level3 translation fault again after page fault >> + * handled with success, so we need check both bits[1:0] of PMD and >> + * PTE as ARM Spec mentioned below: >> + * A Translation fault is generated if bits[1:0] of a translation >> + * table descriptor identify the descriptor as either a Fault >> + * encoding or a reserved encoding. >> + */ >> +static inline bool vmemmap_addr_healthy(unsigned long addr) >> +{ >> +    pgd_t *pgdp; >> +    p4d_t *p4dp; >> +    pud_t *pudp, pud; >> +    pmd_t *pmdp, pmd; >> +    pte_t *ptep, pte; >> + >> +    pgdp = pgd_offset_k(addr); >> +    if (pgd_none(READ_ONCE(*pgdp))) >> +        return false; >> + >> +    p4dp = p4d_offset(pgdp, addr); >> +    if (p4d_none(READ_ONCE(*p4dp))) >> +        return false; >> + >> +    pudp = pud_offset(p4dp, addr); >> +    pud = READ_ONCE(*pudp); >> +    if (pud_none(pud)) >> +        return false; >> + >> +    pmdp = pmd_offset(pudp, addr); > > We already make sure it is a translation fault of level 2 or 3 > here, so we could use pmd_offset_k() macro to simplify the code > a little. Right? > >> +    pmd = READ_ONCE(*pmdp); >> +    if (!pmd_table(pmd)) >> +        return false; >> + >> +    ptep = pte_offset_kernel(pmdp, addr); >> +    pte = READ_ONCE(*ptep); > > Please use ptep_get (which is supposed to do this) to access the > raw pte, see commit c33c794828f21217. > >> +    return (pte_val(pte) & PTE_TYPE_MASK) == PTE_TYPE_PAGE; >> +} >> + >> +static bool vmemmap_handle_page_fault(unsigned long addr, >> +                      unsigned long esr) >> +{ >> +    bool ret = false; >> + >> +    if (likely(!vmemmap_fault_may_fixup(addr, esr))) >> +        return false; >> + >> +    spin_lock(&init_mm.page_table_lock); >> +    if (vmemmap_addr_healthy(addr)) >> +        ret = true; > > It is to assign the return value to ret directly. Like: > >         ret = vmemmap_addr_healthy(addr); > > The the initializetion to ret also can be dropped. > >> +    spin_unlock(&init_mm.page_table_lock); >> + >> +    return ret; >> +} >> +#else >> +static inline bool vmemmap_handle_page_fault(unsigned long addr, >> +                         unsigned long esr) >> +{ >> +    return false; >> +} >> +#endif /*CONFIG_HUGETLB_PAGE_OPTIMIZE_VMEMMAP */ >           ^ > Miss a blank between "*" and "C" here. > > Thanks. > >> + >>   static bool is_translation_fault(unsigned long esr) >>   { >>       return (esr & ESR_ELx_FSC_TYPE) == ESR_ELx_FSC_FAULT; >> @@ -409,6 +500,9 @@ static void __do_kernel_fault(unsigned long addr, >> unsigned long esr, >>               kfence_handle_page_fault(addr, esr & ESR_ELx_WNR, regs)) >>               return; >> +        if (vmemmap_handle_page_fault(addr, esr)) >> +            return; >> + Better to move under is_translation_fault()? if (is_translation_fault(esr) { if (kfence_handle_page_fault()) return; if (vmemmap_handle_page_fault()) return; } >>           msg = "paging request"; >>       } >> diff --git a/arch/arm64/mm/mmu.c b/arch/arm64/mm/mmu.c >> index 15f6347d23b6..81a600ccac7c 100644 >> --- a/arch/arm64/mm/mmu.c >> +++ b/arch/arm64/mm/mmu.c >> @@ -1146,6 +1146,34 @@ int __meminit vmemmap_check_pmd(pmd_t *pmdp, >> int node, >>       return 1; >>   } >> +#ifdef CONFIG_HUGETLB_PAGE_OPTIMIZE_VMEMMAP >> +/* >> + * In the window between the page table entry is cleared and filled >> + * with a new value, other threads have the opportunity to concurrently >> + * access the vmemmap area then page translation fault occur. >> + * Therefore, we need to ensure that the init_mm.page_table_lock is held >> + * to synchronize the vmemmap page fault handling which will wait for >> + * this lock to be released to ensure that the page table entry has been >> + * refreshed with a new valid value. >> + */ >> +void vmemmap_update_pmd(unsigned long start, pmd_t *pmd, pte_t *pgtable) >> +{ >> +    lockdep_assert_held(&init_mm.page_table_lock); >> +    pmd_clear(pmd); >> +    flush_tlb_kernel_range(start, start + PMD_SIZE); >> +    pmd_populate_kernel(&init_mm, pmd, pgtable); >> +} >> + >> +void vmemmap_update_pte(unsigned long addr, pte_t *pte, pte_t entry) >> +{ >> +    spin_lock(&init_mm.page_table_lock); >> +    pte_clear(&init_mm, addr, pte); >> +    flush_tlb_kernel_range(addr, addr + PAGE_SIZE); >> +    set_pte_at(&init_mm, addr, pte, entry); >> +    spin_unlock(&init_mm.page_table_lock); >> +} >> +#endif >> + >>   int __meminit vmemmap_populate(unsigned long start, unsigned long >> end, int node, >>           struct vmem_altmap *altmap) >>   { >