2019-09-26 08:50:22

by Steve Wahl

[permalink] [raw]
Subject: [PATCH v3 1/2] x86/boot/64: Make level2_kernel_pgt pages invalid outside kernel area.

Our hardware (UV aka Superdome Flex) has address ranges marked
reserved by the BIOS. Access to these ranges is caught as an error,
causing the BIOS to halt the system.

Initial page tables mapped a large range of physical addresses that
were not checked against the list of BIOS reserved addresses, and
sometimes included reserved addresses in part of the mapped range.
Including the reserved range in the map allowed processor speculative
accesses to the reserved range, triggering a BIOS halt.

Used early in booting, the page table level2_kernel_pgt addresses 1
GiB divided into 2 MiB pages, and it was set up to linearly map a full
1 GiB of physical addresses that included the physical address range
of the kernel image, as chosen by KASLR. But this also included a
large range of unused addresses on either side of the kernel image.
And unlike the kernel image's physical address range, this extra
mapped space was not checked against the BIOS tables of usable RAM
addresses. So there were times when the addresses chosen by KASLR
would result in processor accessible mappings of BIOS reserved
physical addresses.

The kernel code did not directly access any of this extra mapped
space, but having it mapped allowed the processor to issue speculative
accesses into reserved memory, causing system halts.

This was encountered somewhat rarely on a normal system boot, and much
more often when starting the crash kernel if "crashkernel=512M,high"
was specified on the command line (this heavily restricts the physical
address of the crash kernel, in our case usually within 1 GiB of
reserved space).

The solution is to invalidate the pages of this table outside the
kernel image's space before the page table is activated. This patch
has been validated to fix this problem on our hardware.

Signed-off-by: Steve Wahl <[email protected]>
Cc: [email protected]
---
Changes since v1:
* Added comment.
* Reworked changelog text.
Changes since v2:
* Added further inline comments.
arch/x86/kernel/head64.c | 22 ++++++++++++++++++++--
1 file changed, 20 insertions(+), 2 deletions(-)

diff --git a/arch/x86/kernel/head64.c b/arch/x86/kernel/head64.c
index 29ffa495bd1c..282054025dcf 100644
--- a/arch/x86/kernel/head64.c
+++ b/arch/x86/kernel/head64.c
@@ -222,13 +222,31 @@ unsigned long __head __startup_64(unsigned long physaddr,
* we might write invalid pmds, when the kernel is relocated
* cleanup_highmap() fixes this up along with the mappings
* beyond _end.
+ *
+ * Only the region occupied by the kernel image has so far
+ * been checked against the table of usable memory regions
+ * provided by the firmware, so invalidate pages outside that
+ * region. A page table entry that maps to a reserved area of
+ * memory would allow processor speculation into that area,
+ * and on some hardware (particularly the UV platform) even
+ * speculative access to some reserved areas is caught as an
+ * error, causing the BIOS to halt the system.
*/

pmd = fixup_pointer(level2_kernel_pgt, physaddr);
- for (i = 0; i < PTRS_PER_PMD; i++) {
+
+ /* invalidate pages before the kernel image */
+ for (i = 0; i < pmd_index((unsigned long)_text); i++)
+ pmd[i] &= ~_PAGE_PRESENT;
+
+ /* fixup pages that are part of the kernel image */
+ for (; i <= pmd_index((unsigned long)_end); i++)
if (pmd[i] & _PAGE_PRESENT)
pmd[i] += load_delta;
- }
+
+ /* invalidate pages after the kernel image */
+ for (; i < PTRS_PER_PMD; i++)
+ pmd[i] &= ~_PAGE_PRESENT;

/*
* Fixup phys_base - remove the memory encryption mask to obtain
--
2.21.0


--
Steve Wahl, Hewlett Packard Enterprise


2019-09-26 10:38:39

by Kirill A. Shutemov

[permalink] [raw]
Subject: Re: [PATCH v3 1/2] x86/boot/64: Make level2_kernel_pgt pages invalid outside kernel area.

On Tue, Sep 24, 2019 at 04:03:55PM -0500, Steve Wahl wrote:
> Our hardware (UV aka Superdome Flex) has address ranges marked
> reserved by the BIOS. Access to these ranges is caught as an error,
> causing the BIOS to halt the system.
>
> Initial page tables mapped a large range of physical addresses that
> were not checked against the list of BIOS reserved addresses, and
> sometimes included reserved addresses in part of the mapped range.
> Including the reserved range in the map allowed processor speculative
> accesses to the reserved range, triggering a BIOS halt.
>
> Used early in booting, the page table level2_kernel_pgt addresses 1
> GiB divided into 2 MiB pages, and it was set up to linearly map a full
> 1 GiB of physical addresses that included the physical address range
> of the kernel image, as chosen by KASLR. But this also included a
> large range of unused addresses on either side of the kernel image.
> And unlike the kernel image's physical address range, this extra
> mapped space was not checked against the BIOS tables of usable RAM
> addresses. So there were times when the addresses chosen by KASLR
> would result in processor accessible mappings of BIOS reserved
> physical addresses.
>
> The kernel code did not directly access any of this extra mapped
> space, but having it mapped allowed the processor to issue speculative
> accesses into reserved memory, causing system halts.
>
> This was encountered somewhat rarely on a normal system boot, and much
> more often when starting the crash kernel if "crashkernel=512M,high"
> was specified on the command line (this heavily restricts the physical
> address of the crash kernel, in our case usually within 1 GiB of
> reserved space).
>
> The solution is to invalidate the pages of this table outside the
> kernel image's space before the page table is activated. This patch
> has been validated to fix this problem on our hardware.
>
> Signed-off-by: Steve Wahl <[email protected]>
> Cc: [email protected]

Acked-by: Kirill A. Shutemov <[email protected]>

--
Kirill A. Shutemov

2019-10-11 16:03:58

by Dave Hansen

[permalink] [raw]
Subject: Re: [PATCH v3 1/2] x86/boot/64: Make level2_kernel_pgt pages invalid outside kernel area.

On 9/24/19 2:03 PM, Steve Wahl wrote:
> The solution is to invalidate the pages of this table outside the
> kernel image's space before the page table is activated. This patch
> has been validated to fix this problem on our hardware.

Looks good, thanks for the changes!

For both patches:

Acked-by: Dave Hansen <[email protected]>

Subject: [tip: x86/urgent] x86/boot/64: Make level2_kernel_pgt pages invalid outside kernel area

The following commit has been merged into the x86/urgent branch of tip:

Commit-ID: 2aa85f246c181b1fa89f27e8e20c5636426be624
Gitweb: https://git.kernel.org/tip/2aa85f246c181b1fa89f27e8e20c5636426be624
Author: Steve Wahl <[email protected]>
AuthorDate: Tue, 24 Sep 2019 16:03:55 -05:00
Committer: Borislav Petkov <[email protected]>
CommitterDate: Fri, 11 Oct 2019 18:38:15 +02:00

x86/boot/64: Make level2_kernel_pgt pages invalid outside kernel area

Our hardware (UV aka Superdome Flex) has address ranges marked
reserved by the BIOS. Access to these ranges is caught as an error,
causing the BIOS to halt the system.

Initial page tables mapped a large range of physical addresses that
were not checked against the list of BIOS reserved addresses, and
sometimes included reserved addresses in part of the mapped range.
Including the reserved range in the map allowed processor speculative
accesses to the reserved range, triggering a BIOS halt.

Used early in booting, the page table level2_kernel_pgt addresses 1
GiB divided into 2 MiB pages, and it was set up to linearly map a full
1 GiB of physical addresses that included the physical address range
of the kernel image, as chosen by KASLR. But this also included a
large range of unused addresses on either side of the kernel image.
And unlike the kernel image's physical address range, this extra
mapped space was not checked against the BIOS tables of usable RAM
addresses. So there were times when the addresses chosen by KASLR
would result in processor accessible mappings of BIOS reserved
physical addresses.

The kernel code did not directly access any of this extra mapped
space, but having it mapped allowed the processor to issue speculative
accesses into reserved memory, causing system halts.

This was encountered somewhat rarely on a normal system boot, and much
more often when starting the crash kernel if "crashkernel=512M,high"
was specified on the command line (this heavily restricts the physical
address of the crash kernel, in our case usually within 1 GiB of
reserved space).

The solution is to invalidate the pages of this table outside the kernel
image's space before the page table is activated. It fixes this problem
on our hardware.

[ bp: Touchups. ]

Signed-off-by: Steve Wahl <[email protected]>
Signed-off-by: Borislav Petkov <[email protected]>
Acked-by: Dave Hansen <[email protected]>
Acked-by: Kirill A. Shutemov <[email protected]>
Cc: Baoquan He <[email protected]>
Cc: Brijesh Singh <[email protected]>
Cc: [email protected]
Cc: Feng Tang <[email protected]>
Cc: "H. Peter Anvin" <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Jordan Borgner <[email protected]>
Cc: Juergen Gross <[email protected]>
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: Thomas Gleixner <[email protected]>
Cc: x86-ml <[email protected]>
Cc: Zhenzhong Duan <[email protected]>
Link: https://lkml.kernel.org/r/9c011ee51b081534a7a15065b1681d200298b530.1569358539.git.steve.wahl@hpe.com
---
arch/x86/kernel/head64.c | 22 ++++++++++++++++++++--
1 file changed, 20 insertions(+), 2 deletions(-)

diff --git a/arch/x86/kernel/head64.c b/arch/x86/kernel/head64.c
index 29ffa49..206a4b6 100644
--- a/arch/x86/kernel/head64.c
+++ b/arch/x86/kernel/head64.c
@@ -222,13 +222,31 @@ unsigned long __head __startup_64(unsigned long physaddr,
* we might write invalid pmds, when the kernel is relocated
* cleanup_highmap() fixes this up along with the mappings
* beyond _end.
+ *
+ * Only the region occupied by the kernel image has so far
+ * been checked against the table of usable memory regions
+ * provided by the firmware, so invalidate pages outside that
+ * region. A page table entry that maps to a reserved area of
+ * memory would allow processor speculation into that area,
+ * and on some hardware (particularly the UV platform) even
+ * speculative access to some reserved areas is caught as an
+ * error, causing the BIOS to halt the system.
*/

pmd = fixup_pointer(level2_kernel_pgt, physaddr);
- for (i = 0; i < PTRS_PER_PMD; i++) {
+
+ /* invalidate pages before the kernel image */
+ for (i = 0; i < pmd_index((unsigned long)_text); i++)
+ pmd[i] &= ~_PAGE_PRESENT;
+
+ /* fixup pages that are part of the kernel image */
+ for (; i <= pmd_index((unsigned long)_end); i++)
if (pmd[i] & _PAGE_PRESENT)
pmd[i] += load_delta;
- }
+
+ /* invalidate pages after the kernel image */
+ for (; i < PTRS_PER_PMD; i++)
+ pmd[i] &= ~_PAGE_PRESENT;

/*
* Fixup phys_base - remove the memory encryption mask to obtain

Subject: [tip: x86/urgent] x86/boot/64: Make level2_kernel_pgt pages invalid outside kernel area

The following commit has been merged into the x86/urgent branch of tip:

Commit-ID: 2aa85f246c181b1fa89f27e8e20c5636426be624
Gitweb: https://git.kernel.org/tip/2aa85f246c181b1fa89f27e8e20c5636426be624
Author: Steve Wahl <[email protected]>
AuthorDate: Tue, 24 Sep 2019 16:03:55 -05:00
Committer: Borislav Petkov <[email protected]>
CommitterDate: Fri, 11 Oct 2019 18:38:15 +02:00

x86/boot/64: Make level2_kernel_pgt pages invalid outside kernel area

Our hardware (UV aka Superdome Flex) has address ranges marked
reserved by the BIOS. Access to these ranges is caught as an error,
causing the BIOS to halt the system.

Initial page tables mapped a large range of physical addresses that
were not checked against the list of BIOS reserved addresses, and
sometimes included reserved addresses in part of the mapped range.
Including the reserved range in the map allowed processor speculative
accesses to the reserved range, triggering a BIOS halt.

Used early in booting, the page table level2_kernel_pgt addresses 1
GiB divided into 2 MiB pages, and it was set up to linearly map a full
1 GiB of physical addresses that included the physical address range
of the kernel image, as chosen by KASLR. But this also included a
large range of unused addresses on either side of the kernel image.
And unlike the kernel image's physical address range, this extra
mapped space was not checked against the BIOS tables of usable RAM
addresses. So there were times when the addresses chosen by KASLR
would result in processor accessible mappings of BIOS reserved
physical addresses.

The kernel code did not directly access any of this extra mapped
space, but having it mapped allowed the processor to issue speculative
accesses into reserved memory, causing system halts.

This was encountered somewhat rarely on a normal system boot, and much
more often when starting the crash kernel if "crashkernel=512M,high"
was specified on the command line (this heavily restricts the physical
address of the crash kernel, in our case usually within 1 GiB of
reserved space).

The solution is to invalidate the pages of this table outside the kernel
image's space before the page table is activated. It fixes this problem
on our hardware.

[ bp: Touchups. ]

Signed-off-by: Steve Wahl <[email protected]>
Signed-off-by: Borislav Petkov <[email protected]>
Acked-by: Dave Hansen <[email protected]>
Acked-by: Kirill A. Shutemov <[email protected]>
Cc: Baoquan He <[email protected]>
Cc: Brijesh Singh <[email protected]>
Cc: [email protected]
Cc: Feng Tang <[email protected]>
Cc: "H. Peter Anvin" <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Jordan Borgner <[email protected]>
Cc: Juergen Gross <[email protected]>
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: Thomas Gleixner <[email protected]>
Cc: x86-ml <[email protected]>
Cc: Zhenzhong Duan <[email protected]>
Link: https://lkml.kernel.org/r/9c011ee51b081534a7a15065b1681d200298b530.1569358539.git.steve.wahl@hpe.com
---
arch/x86/kernel/head64.c | 22 ++++++++++++++++++++--
1 file changed, 20 insertions(+), 2 deletions(-)

diff --git a/arch/x86/kernel/head64.c b/arch/x86/kernel/head64.c
index 29ffa49..206a4b6 100644
--- a/arch/x86/kernel/head64.c
+++ b/arch/x86/kernel/head64.c
@@ -222,13 +222,31 @@ unsigned long __head __startup_64(unsigned long physaddr,
* we might write invalid pmds, when the kernel is relocated
* cleanup_highmap() fixes this up along with the mappings
* beyond _end.
+ *
+ * Only the region occupied by the kernel image has so far
+ * been checked against the table of usable memory regions
+ * provided by the firmware, so invalidate pages outside that
+ * region. A page table entry that maps to a reserved area of
+ * memory would allow processor speculation into that area,
+ * and on some hardware (particularly the UV platform) even
+ * speculative access to some reserved areas is caught as an
+ * error, causing the BIOS to halt the system.
*/

pmd = fixup_pointer(level2_kernel_pgt, physaddr);
- for (i = 0; i < PTRS_PER_PMD; i++) {
+
+ /* invalidate pages before the kernel image */
+ for (i = 0; i < pmd_index((unsigned long)_text); i++)
+ pmd[i] &= ~_PAGE_PRESENT;
+
+ /* fixup pages that are part of the kernel image */
+ for (; i <= pmd_index((unsigned long)_end); i++)
if (pmd[i] & _PAGE_PRESENT)
pmd[i] += load_delta;
- }
+
+ /* invalidate pages after the kernel image */
+ for (; i < PTRS_PER_PMD; i++)
+ pmd[i] &= ~_PAGE_PRESENT;

/*
* Fixup phys_base - remove the memory encryption mask to obtain