2022-06-20 05:59:55

by Vasily Averin

[permalink] [raw]
Subject: "Bad pagetable: 000c" crashes and errata "Not-Present Page Faults May Set the RSVD Flag in the Error Code"

Some (old?) Intel CPU's have errata:
"Not-Present Page Faults May Set the RSVD Flag in the Error Code

Problem:
An attempt to access a page that is not marked present causes a page fault.
Such a page fault delivers an error code in which both the P flag (bit 0)
and the RSVD flag (bit 3) are 0. Due to this erratum, not-present page faults
may deliver an error code in which the P flag is 0 but the RSVD flag is 1.

Implication:
Software may erroneously infer that a page fault was due to a reserved-bit
violation when it was actually due to an attempt to access a not-present page.
Intel has not observed this erratum with any commercially available software.

Workaround: Page-fault handlers should ignore the RSVD flag in the error code
if the P flag is 0."

I think Steve Sipes, one of OpenVz7 customers, encountered this problem.
He reported about "Bad pagetable: 000c" host crashes observed on several nodes.
(https://bugs.openvz.org/browse/OVZ-7348)

[1190695.900880]
httpd: Corrupted page table at address 7f62d5b48e68
PGD 80000002e92bf067 PUD 1c99c5067 PMD 195015067 PTE 7fffffffb78b680
Bad pagetable: 000c [#1] SMP

CPU: 5 PID: 627609 Comm: httpd ve: 053a3e76-fa58-4116-9567-97028be293c5
Kdump: loaded Not tainted 3.10.0-1160.53.1.vz7.185.3 #1 185.3
Hardware name: Dell Inc. PowerEdge 2950/0H268G, BIOS 2.7.0 10/30/2010
task: ffff8bfd19c72000 ti: ffff8bfcbc26c000 task.ti: ffff8bfcbc26c000
RIP: 0033:[<00007f62d5888d28>] [<00007f62d5888d28>] 0x7f62d5888d28
RSP: 002b:00007f62c6eb2c68 EFLAGS: 00010206
RAX: fffffffffffffff5 RBX: 00005575a0197080 RCX: 00007f62d5888d1b
RDX: 0000000000000001 RSI: 00007f62d61ac106 RDI: 0000000000008029
RBP: 00007f62d61ac106 R08: 00007f62c6eb2c60 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000202 R12: 00007f62c6eb2cac
R13: 0000000000000001 R14: 0000000000000000 R15: 0000000000000001
FS: 00007f62c6eb3700(0000) GS:ffff8c03ffd40000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007f62d5b48e68 CR3: 0000000199880000 CR4: 00000000000407e0

vtop crash command confirms the numbers and explain the according
memory page was moved to swap

crash> vtop 7f62d5b48e68
VIRTUAL PHYSICAL
7f62d5b48e68 (not mapped)

PGD: 1998807f0 => 80000002e92bf067
PUD: 2e92bfc58 => 1c99c5067
PMD: 1c99c5568 => 195015067
PTE: 195015a40 => 7fffffffb78b680

PTE SWAP OFFSET
7fffffffb78b680 /dev/dm-1 148388

VMA START END FLAGS FILE
ffff8bfd194132a0 7f62d5b45000 7f62d5b49000 8100071 /usr/lib64/libc-2.28.so

SWAP: /dev/dm-1 OFFSET: 148388

Initially I expected that it was single rare issue however Steve Sipes
provided several other similar crash reports:

[104586.499948] systemd: Corrupted page table at address 55cab6d28eb8
[104586.500011] PGD 80000000353fe067 PUD 352dc067 PMD 3536f067 PTE 7fffffffe775680
[104586.500011] Bad pagetable: 000c [#1] SMP
...
[453049.102377] gitlab-runner: Corrupted page table at address c00028b008
[453049.102590] PGD 8000000776349067 PUD 776103067 PMD 6e852067 PTE 7fffffffb3ad280
[453049.102858] Bad pagetable: 000c [#1] SMP
...
[ 3373.495631] httpd: Corrupted page table at address 55792c055298
[ 3373.495828] PGD 8000000eca9c6067 PUD eca9c7067 PMD eca9c8067 PTE 7ffffffff5c4c80
[ 3373.496084] Bad pagetable: 000c [#1] SMP

... and few more.

One of such crashes was happen in kernel space on access to swapped user-space page:

[147450.815301] httpd: Corrupted page table at address 7f82ecac1c60
[147450.815503] PGD 800000005f00a067 PUD 5e55b067 PMD 8c63c2067 PTE 7fffffff395fa80
[147450.815776] Bad pagetable: 000a [#1] SMP

[147450.816058] CPU: 5 PID: 106583 Comm: httpd ve: 053a3e76-fa58-4116-9567-97028be293c5
Kdump: loaded Not tainted 3.10.0-1160.53.1.vz7.185.3 #1 185.3
[147450.816058] Hardware name: Dell Inc. PowerEdge 2950/0H268G, BIOS 2.7.0 10/30/2010
[147450.816058] task: ffff9b83781ea000 ti: ffff9b831e7b0000 task.ti: ffff9b831e7b0000
[147450.816058] RIP: 0010:[<ffffffff8b9c64d0>] [<ffffffff8b9c64d0>] copy_user_generic_string+0x30/0x40
[147450.816058] RSP: 0000:ffff9b831e7b3e90 EFLAGS: 00010246
[147450.816058] RAX: ffff9b831e7b0000 RBX: 0000000000000000 RCX: 0000000000000002
[147450.816058] RDX: 0000000000000000 RSI: ffff9b831e7b3ea8 RDI: 00007f82ecac1c60
[147450.816058] RBP: ffff9b831e7b3ee0 R08: ffff9b831e7b4000 R09: 0000000000000001
[147450.816058] R10: ffffffff8c3693e0 R11: 0000000000000246 R12: ffff9b831e7b3ef8
[147450.816058] R13: 00007f82ecac1c60 R14: 0000000000000001 R15: 0000000000000000
[147450.816058] FS: 00007f82ecac2700(0000) GS:ffff9b842fd40000(0000) knlGS:0000000000000000
[147450.816058] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[147450.816058] CR2: 00007f82ecac1c60 CR3: 00000000c63d0000 CR4: 00000000000407e0
[147450.816058] Call Trace:
[147450.816058] [<ffffffff8b88ab93>] ? poll_select_copy_remaining+0x113/0x180
[147450.816058] [<ffffffff8b88ba2c>] SyS_select+0xdc/0x120
[147450.816058] [<ffffffff8bdd4052>] system_call_fastpath+0x25/0x2a
[147450.816058] [<ffffffff8bdd3f95>] ? system_call_after_swapgs+0xa2/0x13a
[147450.816058] Code: 74 30 83 fa 08 72 27 89 f9 83 e1 07 74 15 83 e9 08 f7 d9 29 ca 8a 06
88 07 48 ff c6 48 ff c7 ff c9 75 f2 89 d1 c1 e9 03 83 e2 07
<f3> 48 a5 89 d1 f3 a4 31 c0 66 66 90 c3 0f 1f 00 66 66 90 21 d2

In all these cases error code had set "Reserved" bit 3 and clear "Present" bit 0.
Affected nodes user old Xeon E5400 and E5450 processors, according to Intel Documentation
all of them are affected by pointed errata, this issue was not fixed in microcode
and have pointed workaround only.

This was happen old stable OpenVZ7 kernel based on even more rock-stable RHEL7 kernel.
I cannot explain why this happened right now on these nodes and why has no one else
reported this before. I've checked Intel Documentation and found that few other CPUs
are affected too. Finaly this issue is described in recent version of
Intel® 64 and IA-32 Architectures Software Developer’s Manual Volume 3A:
System Programming Guide, Part 1
"
4.7 PAGE-FAULT EXCEPTIONS

RSVD flag (bit 3).
This flag is 1 if there is no translation for the linear address because a
reserved bit was set in one of the paging-structure entries used to translate
that address. (Because reserved bits are not checked in a paging-structure
entry whose P flag is 0, bit 3 of the error code can be set only if bit 0
is also set.[1])

[1] Some past processors had errata for some page faults that occur when
there is no translation for the linear address because the P flag was 0
in one of the paging-structure entries used to translate that address.
Due to these errata, some such page faults produced error codes that
cleared bit 0 (P flag) and set bit 3 (RSVD flag).
"

Currently this case is handled in arch/x86/mm/fault.c::do_user_addr_fault()

/*
* Reserved bits are never expected to be set on
* entries in the user portion of the page tables.
*/
if (unlikely(error_code & X86_PF_RSVD))
pgtable_bad(regs, error_code, address);


As you can see, kernel dpes not check Present bit 0 before reporting the problem
that calls die() and crash the host.

I'm not sure that described issue really can be reproduced on current upstream.
OpenVz7 as well as in original RHEL7 kernel does not have few recent changes,
for example commit eee4818baac0 ("mm: x86: move _PAGE_SWP_SOFT_DIRTY from
bit 7 to bit 1")

However, as far as I understand, the reported problem may also occur in current.
Therefore after some doubts I decided to report about the problem here.

Perhaps we should improve the above check by also testing the X86_PF_PROT bit?

if (unlikely((error_code & (X86_PF_RSVD | X86_PF_PROT)) == (X86_PF_RSVD | X86_PF_PROT))

I've prepared test kernel with similar patch, provided it to affected customer
and it looks like this resolved the problem, at least he does not report about
new crashes.

Thank you,
Vasily Averin


2022-06-20 08:04:40

by Dave Hansen

[permalink] [raw]
Subject: Re: "Bad pagetable: 000c" crashes and errata "Not-Present Page Faults May Set the RSVD Flag in the Error Code"

On 6/19/22 22:29, Vasily Averin wrote:
> Some (old?) Intel CPU's have errata:
> "Not-Present Page Faults May Set the RSVD Flag in the Error Code

Do you happen to have a link to the actual erratum document? It's
usually a "Specification Update" for some model or another.

2022-06-20 08:44:01

by Vasily Averin

[permalink] [raw]
Subject: Re: "Bad pagetable: 000c" crashes and errata "Not-Present Page Faults May Set the RSVD Flag in the Error Code"

On 6/20/22 10:18, Dave Hansen wrote:
> On 6/19/22 22:29, Vasily Averin wrote:
>> Some (old?) Intel CPU's have errata:
>> "Not-Present Page Faults May Set the RSVD Flag in the Error Code
>
> Do you happen to have a link to the actual erratum document? It's
> usually a "Specification Update" for some model or another.

For Intel Xeon E5400 used on affected node:
https://www.intel.com/content/dam/www/public/us/en/documents/specification-updates/xeon-5400-spec-update.pdf
Intel® Xeon® Processor 5400 Series
Specification Update
August 2013
Document Number: 318585-021

page 39
AX74. Not-Present Page Faults May Set the RSVD Flag in the Error Cod

Later I've found similar articles for another processors too.

Thank you,
Vasily Averin

2022-06-30 06:12:09

by Vasily Averin

[permalink] [raw]
Subject: [PATCH] x86/fault: ignore RSVD flag in error code if P flag is 0

Some older Intel CPUs have errata:
"Not-Present Page Faults May Set the RSVD Flag in the Error Code

Problem:
An attempt to access a page that is not marked present causes a page
fault. Such a page fault delivers an error code in which both the
P flag (bit 0) and the RSVD flag (bit 3) are 0. Due to this erratum,
not-present page faults may deliver an error code in which the P flag
is 0 but the RSVD flag is 1.

Implication:
Software may erroneously infer that a page fault was due to a
reserved-bit violation when it was actually due to an attempt
to access a not-present page.

Workaround: Page-fault handlers should ignore the RSVD flag in the error
code if the P flag is 0."

This issues was observed on several nodes crashed with messages
httpd: Corrupted page table at address 7f62d5b48e68
PGD 80000002e92bf067 PUD 1c99c5067 PMD 195015067 PTE 7fffffffb78b680
Bad pagetable: 000c [#1] SMP

Let's follow the recommendation and will ignore the RSVD flag in the
error code if the P flag is 0

Link: https://lore.kernel.org/all/[email protected]
Signed-off-by: Vasily Averin <[email protected]>
---
arch/x86/mm/fault.c | 9 +++++++++
1 file changed, 9 insertions(+)

diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c
index fe10c6d76bac..ffc6d6bd2a22 100644
--- a/arch/x86/mm/fault.c
+++ b/arch/x86/mm/fault.c
@@ -1481,6 +1481,15 @@ handle_page_fault(struct pt_regs *regs, unsigned long error_code,
if (unlikely(kmmio_fault(regs, address)))
return;

+ /*
+ * Some older Intel CPUs have errata
+ * "Not-Present Page Faults May Set the RSVD Flag in the Error Code"
+ * It is recommended to ignore the RSVD flag (bit 3) in the error code
+ * if the P flag (bit 0) is 0.
+ */
+ if (unlikely((error_code & X86_PF_RSVD) && !(error_code & X86_PF_PROT)))
+ error_code &= ~X86_PF_RSVD;
+
/* Was the fault on kernel-controlled part of the address space? */
if (unlikely(fault_in_kernel_space(address))) {
do_kern_addr_fault(regs, error_code, address);
--
2.36.1

2022-06-30 14:59:16

by Dave Hansen

[permalink] [raw]
Subject: Re: [PATCH] x86/fault: ignore RSVD flag in error code if P flag is 0

On 6/29/22 22:58, Vasily Averin wrote:
> Some older Intel CPUs have errata:
> "Not-Present Page Faults May Set the RSVD Flag in the Error Code

Please include a link to the documentation when you cite things like
this. For example, this is very helpful:

Several older Intel CPUs have this or a similar erratum. For
instance, the "Intel Xeon Processor 5400 Series Specification
Update" has "AX74 ... Not-Present Page Faults May Set the RSVD
Flag in the Error Code".

https://www.intel.com/content/dam/www/public/us/en/documents/specification-updates/xeon-5400-spec-update.pdf

That makes it a *LOT* easier to find the actual erratum and its text. I
honestly also woudln't mind if you just copy a chunk of the problem text
verbatim into the changelog. Intel does have a habit of updating text
in documents like that and it's quite handy to have a snapshot of what
you were reading when you wrote the patch.

2022-07-01 00:04:42

by Vasily Averin

[permalink] [raw]
Subject: [PATCH v2] x86/fault: ignore RSVD flag in error code if P flag is 0

Several older Intel CPUs have this or a similar erratum.
For instance, the "Intel Xeon Processor 5400 Series
Specification Update" [1] has

"AX74. Not-Present Page Faults May Set the RSVD Flag in the Error Code

Problem:
An attempt to access a page that is not marked present causes
a page fault. Such a page fault delivers an error code in which
both the P flag (bit 0) and the RSVD flag (bit 3) are 0.
Due to this erratum, not-present page faults may deliver
an error code in which the P flag is 0 but the RSVD flag is 1.

Implication:
Software may erroneously infer that a page fault was due to
a reserved-bit violation when it was actually due to an attempt
to access a not-present page. Intel has not observed this erratum
with any commercially available software.

Workaround:
Page-fault handlers should ignore the RSVD flag in the error
code if the P flag is 0"

This problem has been observed several times on several nodes using
Intel Xeon E5450 processors. These nodes were crashed after
"Bad pagetable: 000c" messages like this:

Corrupted page table at address 7f62d5b48e68
PGD 80000002e92bf067 PUD 1c99c5067 PMD 195015067 PTE 7fffffffb78b680
Bad pagetable: 000c [#1] SMP

Error code here is 0xc, it have set RSVD flag (bit 3), however P flag
(bit 0) is clear.

Let's follow the recommendations and ignore the RSVD flag in the cases
described.

Link: [1] https://www.intel.com/content/dam/www/public/us/en/documents/specification-updates/xeon-5400-spec-update.pdf
Link: https://lore.kernel.org/all/[email protected]
Reported-by: Steve Sipes <[email protected]>
Signed-off-by: Vasily Averin <[email protected]>
---
v2: added original reporter
improved patch description, added link to CPU spec update
---
arch/x86/mm/fault.c | 9 +++++++++
1 file changed, 9 insertions(+)

diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c
index fe10c6d76bac..ffc6d6bd2a22 100644
--- a/arch/x86/mm/fault.c
+++ b/arch/x86/mm/fault.c
@@ -1481,6 +1481,15 @@ handle_page_fault(struct pt_regs *regs, unsigned long error_code,
if (unlikely(kmmio_fault(regs, address)))
return;

+ /*
+ * Some older Intel CPUs have errata
+ * "Not-Present Page Faults May Set the RSVD Flag in the Error Code"
+ * It is recommended to ignore the RSVD flag (bit 3) in the error code
+ * if the P flag (bit 0) is 0.
+ */
+ if (unlikely((error_code & X86_PF_RSVD) && !(error_code & X86_PF_PROT)))
+ error_code &= ~X86_PF_RSVD;
+
/* Was the fault on kernel-controlled part of the address space? */
if (unlikely(fault_in_kernel_space(address))) {
do_kern_addr_fault(regs, error_code, address);
--
2.36.1

2022-07-01 01:36:50

by H. Peter Anvin

[permalink] [raw]
Subject: Re: [PATCH] x86/fault: ignore RSVD flag in error code if P flag is 0

On June 29, 2022 10:58:36 PM PDT, Vasily Averin <[email protected]> wrote:
>Some older Intel CPUs have errata:
>"Not-Present Page Faults May Set the RSVD Flag in the Error Code
>
>Problem:
>An attempt to access a page that is not marked present causes a page
>fault. Such a page fault delivers an error code in which both the
>P flag (bit 0) and the RSVD flag (bit 3) are 0. Due to this erratum,
>not-present page faults may deliver an error code in which the P flag
>is 0 but the RSVD flag is 1.
>
>Implication:
>Software may erroneously infer that a page fault was due to a
>reserved-bit violation when it was actually due to an attempt
>to access a not-present page.
>
>Workaround: Page-fault handlers should ignore the RSVD flag in the error
>code if the P flag is 0."
>
>This issues was observed on several nodes crashed with messages
>httpd: Corrupted page table at address 7f62d5b48e68
>PGD 80000002e92bf067 PUD 1c99c5067 PMD 195015067 PTE 7fffffffb78b680
>Bad pagetable: 000c [#1] SMP
>
>Let's follow the recommendation and will ignore the RSVD flag in the
>error code if the P flag is 0
>
>Link: https://lore.kernel.org/all/[email protected]
>Signed-off-by: Vasily Averin <[email protected]>
>---
> arch/x86/mm/fault.c | 9 +++++++++
> 1 file changed, 9 insertions(+)
>
>diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c
>index fe10c6d76bac..ffc6d6bd2a22 100644
>--- a/arch/x86/mm/fault.c
>+++ b/arch/x86/mm/fault.c
>@@ -1481,6 +1481,15 @@ handle_page_fault(struct pt_regs *regs, unsigned long error_code,
> if (unlikely(kmmio_fault(regs, address)))
> return;
>
>+ /*
>+ * Some older Intel CPUs have errata
>+ * "Not-Present Page Faults May Set the RSVD Flag in the Error Code"
>+ * It is recommended to ignore the RSVD flag (bit 3) in the error code
>+ * if the P flag (bit 0) is 0.
>+ */
>+ if (unlikely((error_code & X86_PF_RSVD) && !(error_code & X86_PF_PROT)))
>+ error_code &= ~X86_PF_RSVD;
>+
> /* Was the fault on kernel-controlled part of the address space? */
> if (unlikely(fault_in_kernel_space(address))) {
> do_kern_addr_fault(regs, error_code, address);

Are there other bits we could/should mask.out in the case P = 0? The only bits that should be able to appear are ones that are independent of the PTE content.

2022-07-01 05:05:34

by Vasily Averin

[permalink] [raw]
Subject: Re: [PATCH] x86/fault: ignore RSVD flag in error code if P flag is 0

On 7/1/22 03:42, H. Peter Anvin wrote:
> On June 29, 2022 10:58:36 PM PDT, Vasily Averin <[email protected]> wrote:
>> Some older Intel CPUs have errata:
>> "Not-Present Page Faults May Set the RSVD Flag in the Error Code
>>
>> Problem:
>> An attempt to access a page that is not marked present causes a page
>> fault. Such a page fault delivers an error code in which both the
>> P flag (bit 0) and the RSVD flag (bit 3) are 0. Due to this erratum,
>> not-present page faults may deliver an error code in which the P flag
>> is 0 but the RSVD flag is 1.
>>
>> Implication:
>> Software may erroneously infer that a page fault was due to a
>> reserved-bit violation when it was actually due to an attempt
>> to access a not-present page.
>>
>> Workaround: Page-fault handlers should ignore the RSVD flag in the error
>> code if the P flag is 0."
>>
>> This issues was observed on several nodes crashed with messages
>> httpd: Corrupted page table at address 7f62d5b48e68
>> PGD 80000002e92bf067 PUD 1c99c5067 PMD 195015067 PTE 7fffffffb78b680
>> Bad pagetable: 000c [#1] SMP
>>
>> Let's follow the recommendation and will ignore the RSVD flag in the
>> error code if the P flag is 0
>>
>> Link: https://lore.kernel.org/all/[email protected]
>> Signed-off-by: Vasily Averin <[email protected]>
>> ---
>> arch/x86/mm/fault.c | 9 +++++++++
>> 1 file changed, 9 insertions(+)
>>
>> diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c
>> index fe10c6d76bac..ffc6d6bd2a22 100644
>> --- a/arch/x86/mm/fault.c
>> +++ b/arch/x86/mm/fault.c
>> @@ -1481,6 +1481,15 @@ handle_page_fault(struct pt_regs *regs, unsigned long error_code,
>> if (unlikely(kmmio_fault(regs, address)))
>> return;
>>
>> + /*
>> + * Some older Intel CPUs have errata
>> + * "Not-Present Page Faults May Set the RSVD Flag in the Error Code"
>> + * It is recommended to ignore the RSVD flag (bit 3) in the error code
>> + * if the P flag (bit 0) is 0.
>> + */
>> + if (unlikely((error_code & X86_PF_RSVD) && !(error_code & X86_PF_PROT)))
>> + error_code &= ~X86_PF_RSVD;
>> +
>> /* Was the fault on kernel-controlled part of the address space? */
>> if (unlikely(fault_in_kernel_space(address))) {
>> do_kern_addr_fault(regs, error_code, address);
>
> Are there other bits we could/should mask.out in the case P = 0? The
> only bits that should be able to appear are ones that are independent
> of the PTE content.
In accordance with the "Intel® 64 and IA-32 Architectures Software Developer’s
Manual Volume 3A: System Programming Guide, Part 1" there are several other
similar bits:
http://www.intel.com/content/dam/www/public/us/en/documents/manuals/64-ia-32-architectures-software-developer-manual-325462.pdf

"4.7 PAGE-FAULT EXCEPTIONS
...
• HLAT (bit 7).
This flag is 1 if there is no translation for the linear address using HLAT
paging because, in one of the paging structure entries used to translate that
address, either the P flag was 0 or a reserved bit was set. An error code will
set this flag only if it clears bit 0 or sets bit 3. This flag will not be set
by a page fault resulting from a violation of access rights, nor for one
encountered during ordinary paging, including the case in which there has been
a restart of HLAT paging.

• SGX flag (bit 15).
This flag is 1 if the exception is unrelated to paging and resulted from
violation of SGX-specific access-control requirements. Because such a violation
can occur only if there is no ordinary page fault, this flag is set only if
the P flag (bit 0) is 1 and the RSVD flag (bit 3) and the PK flag (bit 5)
are both 0."

However, only the RSVD flag has errata in real processors.
So I don't think any other bits should be masked in some way.

Thank you,
Vasily Averin