Received: by 2002:a05:6a10:1d13:0:0:0:0 with SMTP id pp19csp4358711pxb; Tue, 31 Aug 2021 03:15:22 -0700 (PDT) X-Google-Smtp-Source: ABdhPJxDKPz5+JkMvTc2LMRdO8YBIkJsgp0Nd2g/5iuWzTQv/GxHlRrSJiyozdroZ35KUrxGwgON X-Received: by 2002:a92:c702:: with SMTP id a2mr19771351ilp.210.1630404921873; Tue, 31 Aug 2021 03:15:21 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1630404921; cv=none; d=google.com; s=arc-20160816; b=BXhI2qXUc0F2McZ+kXX0wPeUZQCbN30PTl9B4EL8Rfam71VDn6bi5mZHoGJKAuzUAz quFgIgJ6V+8+i11GwpCpYfvSZO/xPDB8MJFbhuy/sYNmafQ1GOWoo4xmOy6+kmc0nTfk J5MRo4ucD0TqiOAxGSckrsSuyfAkH4Be6aRCndjwL7Y4pJxFDZgihSs/OEuHwaQlPCW/ VjiFQhxzWUtyyWUamffz1hbGrbl8dVU3yA2JhjuOaeQQSWbGwEUMPMvKVjwKaQwHI/Df AAzSOI194dc1t2HZBPlUEPfqJdfS0M1Zz8IfADyttm15v88FzgEFneTwWk0aNYZifiXE S6cQ== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:in-reply-to:content-disposition:mime-version :references:message-id:subject:cc:to:from:date:dkim-signature; bh=BCXwHamwV1v7Yjvqm2ngmAt2uNTYXuu8VPJOqY2mUgU=; b=ZWwVRD05jj227jfeDR7pWezux8SsMzJhKM2ONtb7ze30GZzzoToCjnWPUApwkNF/SW XdrJt7ZXhAnLQ2r0p12DGSm1Ty7fCSt4tpgoosaa2yzP2QJRIoFkNhAGKCX1zrj5SF+W e5FDa8Mhd6s1uJC8vQ/fCh3DHO+q8T+UWzUB1QilK4Nfk8NrJZ2kanINR6QaNm3Pd6JZ MAljdNpc6AqoWfgD4MHD4BoOKKfkqE2LdepIC+7IGtWTgv+Ebu81Z2n2tc2L54HZ5tgw /YhqYJIDHfLiMBDRub/c8wpE9VrtScu5xsUWiR9JA6vuPUrADh4fOGJbFDB41R+KNOF3 eH8g== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@kernel.org header.s=k20201202 header.b=jjmdlQqB; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=kernel.org Return-Path: Received: from vger.kernel.org (vger.kernel.org. [23.128.96.18]) by mx.google.com with ESMTP id n3si18388900ioh.98.2021.08.31.03.15.10; Tue, 31 Aug 2021 03:15:21 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) client-ip=23.128.96.18; Authentication-Results: mx.google.com; dkim=pass header.i=@kernel.org header.s=k20201202 header.b=jjmdlQqB; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=kernel.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S240914AbhHaKPU (ORCPT + 99 others); Tue, 31 Aug 2021 06:15:20 -0400 Received: from mail.kernel.org ([198.145.29.99]:57204 "EHLO mail.kernel.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S240908AbhHaKPT (ORCPT ); Tue, 31 Aug 2021 06:15:19 -0400 Received: by mail.kernel.org (Postfix) with ESMTPSA id C9DCC60ED4; Tue, 31 Aug 2021 10:14:20 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=k20201202; t=1630404864; bh=sv+Cg1X+GBWUrO7fD0zMaYtizfycZgRul3+2yg67iwI=; h=Date:From:To:Cc:Subject:References:In-Reply-To:From; b=jjmdlQqBBwC61Dm6M3j6ltcmjch9JnUdjCt1m3iPM+1xA+zSjCR9ZVE5lUqJvNw0I QmtznYopatvh54G3PJL4O+aOONlBX6vtKCzq43PN3S23Hzf++5YgjJQHyXI+VTsyD+ /ERe0A1hTT6brhvBCAv26DIqs4FDgfoYs1b2bRarfMN4RzKSfYYhZk4kT+fZmcuA1Z NxZ779TMOgpDgAQ4FlRaaPr4BqkMVOS9M6W4wLsy0adubDhHGNOMfASpb9RxAlSMpK /y46REUiJRJxP5u8Jg2ukGCMW5O8yRlnFK4TxmfDwysH9amt+7vENnsMIn8878hlmd yjXymAIBf97Cg== Date: Tue, 31 Aug 2021 13:14:16 +0300 From: Mike Rapoport To: Rick Edgecombe Cc: dave.hansen@intel.com, luto@kernel.org, peterz@infradead.org, x86@kernel.org, akpm@linux-foundation.org, keescook@chromium.org, shakeelb@google.com, vbabka@suse.cz, linux-mm@kvack.org, linux-hardening@vger.kernel.org, kernel-hardening@lists.openwall.com, ira.weiny@intel.com, dan.j.williams@intel.com, linux-kernel@vger.kernel.org Subject: Re: [RFC PATCH v2 17/19] x86/mm/cpa: PKS protect direct map page tables Message-ID: References: <20210830235927.6443-1-rick.p.edgecombe@intel.com> <20210830235927.6443-18-rick.p.edgecombe@intel.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20210830235927.6443-18-rick.p.edgecombe@intel.com> Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Mon, Aug 30, 2021 at 04:59:25PM -0700, Rick Edgecombe wrote: > Protecting direct map page tables is a bit more difficult because a page > table may be needed for a page split as part of setting the PKS > permission the new page table. So in the case of an empty cache of page > tables the page table allocator could get into a situation where it cannot > create any more page tables. > > Several solutions were looked at: > > 1. Break the direct map with pages allocated from the large page being > converted to PKS. This would result in a window where the table could be > written to right before it was linked into the page tables. It also > depends on high order pages being available, and so would regress from > the un-protected behavior in that respect. > 2. Hold some page tables in reserve to be able to break the large page > for a new 2MB page, but if there are no 2MB page's available we may need > to add a single page to the cache, in which case we would use up the > reserve of page tables needed to break a new page, but not get enough > page tables back to replenish the resereve. > 3. Always map the direct map at 4k when protecting page tables so that > pages don't need to be broken to map them with a PKS key. This would have > undesirable performance. > > 4. Lastly, the strategy employed in this patch, have a separate cache of > page tables just used for the direct map. Early in boot, squirrel away > enough page tables to map the direct map at 4k. This comes with the same > memory overhead of mapping the direct map at 4k, but gets the other > benefits of mapping the direct map as large pages. > > There is also the problem of protecting page tables that are allocated > during boot. Instead of recording the tables to protect later, create a > page table traversing infrastructure to walk every page table in init_mm > and apply protection. This also covers non-direct map odds-and-ends page > tables that are allocated during boot. The existing page table traversing > in pagewalk.c cannot be used for this purpose because there are not actual > vmas for all of the kernel address space. > > The algorithm for protecting the direct map page table cache, while also > allocating from it for direct map splits is described in the comments of > init_pks_dmap_tables(). > > Signed-off-by: Rick Edgecombe > --- > arch/x86/include/asm/set_memory.h | 2 + > arch/x86/mm/init.c | 89 ++++++++++ > arch/x86/mm/pat/set_memory.c | 263 +++++++++++++++++++++++++++++- > 3 files changed, 350 insertions(+), 4 deletions(-) > > diff --git a/arch/x86/include/asm/set_memory.h b/arch/x86/include/asm/set_memory.h > index 1ba2fb45ed05..9f8d0d0ae063 100644 > --- a/arch/x86/include/asm/set_memory.h > +++ b/arch/x86/include/asm/set_memory.h > @@ -90,6 +90,8 @@ bool kernel_page_present(struct page *page); > > extern int kernel_set_to_readonly; > > +void add_dmap_table(unsigned long addr); > + > #ifdef CONFIG_X86_64 > /* > * Prevent speculative access to the page by either unmapping > diff --git a/arch/x86/mm/init.c b/arch/x86/mm/init.c > index c8933c6d5efd..a91696e3da96 100644 > --- a/arch/x86/mm/init.c > +++ b/arch/x86/mm/init.c > @@ -6,6 +6,7 @@ > #include > #include > #include > +#include > #include > > #include > @@ -26,6 +27,7 @@ > #include > #include > #include > +#include > > /* > * We need to define the tracepoints somewhere, and tlb.c > @@ -119,6 +121,17 @@ __ref void *alloc_low_pages(unsigned int num) > if (after_bootmem) { > unsigned int order; > > + if (cpu_feature_enabled(X86_FEATURE_PKS_TABLES)) { > + struct page *page; > + > + /* 64 bit only allocates order 0 pages */ > + WARN_ON(num != 1); > + > + page = alloc_table(GFP_ATOMIC | __GFP_ZERO); > + if (!page) > + return NULL; > + return (void *)page_address(page); > + } > order = get_order((unsigned long)num << PAGE_SHIFT); > return (void *)__get_free_pages(GFP_ATOMIC | __GFP_ZERO, order); > } > @@ -504,6 +517,79 @@ bool pfn_range_is_mapped(unsigned long start_pfn, unsigned long end_pfn) > return false; > } > > +#ifdef CONFIG_PKS_PG_TABLES > +/* Page tables needed in bytes */ > +static u64 calc_tables_needed(unsigned int size) > +{ > + unsigned int puds = size >> PUD_SHIFT; > + unsigned int pmds = size >> PMD_SHIFT; > + > + /* > + * Catch if direct map ever might need more page tables to split > + * down to 4k. > + */ > + BUILD_BUG_ON(p4d_huge(foo)); > + BUILD_BUG_ON(pgd_huge(foo)); > + > + return (puds + pmds) << PAGE_SHIFT; > +} > + > +/* > + * If pre boot, reserve large pages from memory that will be mapped. It's ok that this is not > + * mapped as PKS, other init code in CPA will handle the conversion. > + */ > +static unsigned int __init reserve_pre_boot(u64 start, u64 end) > +{ > + u64 cur = memblock_find_in_range(start, end, HPAGE_SIZE, HPAGE_SIZE); > + int i; Please use memblock_phys_alloc_range() here. Besides, it seems this reserved pages are not accessed until late_initcall time, so there is no need to limit the allocation to already mapped areas, memblock_alloc_raw() would suffice. > + > + if (!cur) > + return 0; > + memblock_reserve(cur, HPAGE_SIZE); > + for (i = 0; i < HPAGE_SIZE; i += PAGE_SIZE) > + add_dmap_table((unsigned long)__va(cur + i)); > + return HPAGE_SIZE; > +} > + > +/* If post boot, memblock is not available. Just reserve from other memory regions */ > +static unsigned int __init reserve_post_boot(void) > +{ > + struct page *page = alloc_table(GFP_KERNEL); > + > + if (!page) > + return 0; > + > + add_dmap_table((unsigned long)page_address(page)); add_dmap_table() calls use casting everywhere, maybe make it add_dmap_table(void *)? > + > + return PAGE_SIZE; > +} > + > +static void __init reserve_page_tables(u64 start, u64 end) > +{ > + u64 reserve_size = calc_tables_needed(end - start); > + u64 reserved = 0; > + u64 cur_reserved; > + > + while (reserved < reserve_size) { > + if (after_bootmem) > + cur_reserved = reserve_post_boot(); > + else > + cur_reserved = reserve_pre_boot(start, end); > + > + if (!cur_reserved) { > + WARN(1, "Could not reserve direct map page tables %llu/%llu\n", > + reserved, > + reserve_size); > + return; > + } > + > + reserved += cur_reserved; > + } > +} > +#else > +static inline void reserve_page_tables(u64 start, u64 end) { } > +#endif > + > /* > * Setup the direct mapping of the physical memory at PAGE_OFFSET. > * This runs before bootmem is initialized and gets pages directly from > @@ -529,6 +615,9 @@ unsigned long __ref init_memory_mapping(unsigned long start, > > add_pfn_range_mapped(start >> PAGE_SHIFT, ret >> PAGE_SHIFT); > > + if (cpu_feature_enabled(X86_FEATURE_PKS_TABLES)) > + reserve_page_tables(start, end); > + > return ret >> PAGE_SHIFT; > } -- Sincerely yours, Mike.