2018-09-10 21:57:31

by Guenter Roeck

[permalink] [raw]
Subject: Random crashes with i386 and efi boots

Hi folks,

even after commit eeb89e2bb1ac ("x86/efi: Load fixmap GDT in
efi_call_phys_epilog()"), my i386/efi qemu boot tests still crash randomly
(roughly 5-10% of the time). As before, I don't see much useful output in
the qemu log (this time it doesn't even complain about a triple fault).

Debugging shows that the crash happens in efi_call_phys_epilog().
A sample log from a crashed test run is attached below. It appears that
the crash happens if there is an interrupt at a critical section of the
code.

While playing with the code, I found a possible fix.

diff --git a/arch/x86/platform/efi/efi_32.c b/arch/x86/platform/efi/efi_32.c
index 05ca14222463..9959657127f4 100644
--- a/arch/x86/platform/efi/efi_32.c
+++ b/arch/x86/platform/efi/efi_32.c
@@ -85,10 +85,9 @@ pgd_t * __init efi_call_phys_prolog(void)

void __init efi_call_phys_epilog(pgd_t *save_pgd)
{
+ load_fixmap_gdt(0);
load_cr3(save_pgd);
__flush_tlb_all();
-
- load_fixmap_gdt(0);
}

This restores the execution order prior to commit eeb89e2bb1ac.
I have no real idea what I am doing, so this change is to be taken with
a grain of salt. All I can say is that 100 boots with the above change
were successful while the current upstream code (v4.19-rc3) crashes on
a regular basis (in a controlled test I observed 6 failures out of 100
boots).

It would be great if someone with a bit more experience can have another
look and figure out the underlying problem. Please let me know if I can
provide additional information.

Thanks,
Guenter

----------------
IN:
# efi_call_phys_epilog(save_pgd);
0xd8f9c12c: 8b 45 f0 movl -0x10(%ebp), %eax
0xd8f9c12f: e8 49 01 00 00 calll 0xd8f9c27d

----------------
IN:
# efi_call_phys_epilog():
# load_cr3();
0xd8f9c27d: 55 pushl %ebp
0xd8f9c27e: 05 00 00 00 40 addl $0x40000000, %eax
0xd8f9c283: 89 e5 movl %esp, %ebp
0xd8f9c285: 0f 22 d8 movl %eax, %cr3

CR3 update: CR3=1904e000
----------------
IN:
# __flush_tlb_all();
0xd8f9c288: e8 c8 5e 2b ff calll 0xd8252155

EAX=1904e000 EBX=00000000 ECX=d8f9c126 EDX=d8eafd60
ESI=1f09a000 EDI=00000030 EBP=d8e99f3c ESP=d8e99f3c
EIP=d8f9c288 EFL=00200207 [-----PC] CPL=0 II=0 A20=1 SMM=0 HLT=0
ES =007b 00000000 ffffffff 00cff300 DPL=3 DS [-WA]
CS =0060 00000000 ffffffff 00cf9a00 DPL=0 CS32 [-R-]
SS =0068 00000000 ffffffff 00cf9300 DPL=0 DS [-WA]
DS =007b 00000000 ffffffff 00cff300 DPL=3 DS [-WA]
FS =00d8 06c92000 ffffffff 008f9300 DPL=0 DS16 [-WA]
GS =00e0 dfcd09c0 00000018 00409100 DPL=0 DS [--A]
LDT=0000 00000000 00000000 00008200 DPL=0 LDT
TR =0080 ff803000 0000206b 00008900 DPL=0 TSS32-avl
GDT= 1fcc0000 000000ff
IDT= ff800000 000007ff
CR0=80050033 CR2=ffd17000 CR3=1904e000 CR4=00040690
DR0=00000000 DR1=00000000 DR2=00000000 DR3=00000000
DR6=ffff0ff0 DR7=00000400
CCS=40000000 CCD=1904e000 CCO=ADDL
EFER=0000000000000000
Servicing hardware INT=0x30
508: v=30 e=0000 i=0 cpl=0 IP=0060:d8f9c288 pc=d8f9c288 SP=0068:d8e99f3c env->regs[R_EAX]=1904e000
X=1904e000 EBX=00000000 ECX=d8f9c126 EDX=d8eafd60
ESI=1f09a000 EDI=00000030 EBP=d8e99f3c ESP=d8e99f3c
EIP=d8f9c288 EFL=00200207 [-----PC] CPL=0 II=0 A20=1 SMM=0 HLT=0
ES =007b 00000000 ffffffff 00cff300 DPL=3 DS [-WA]
CS =0060 00000000 ffffffff 00cf9a00 DPL=0 CS32 [-R-]
SS =0068 00000000 ffffffff 00cf9300 DPL=0 DS [-WA]
DS =007b 00000000 ffffffff 00cff300 DPL=3 DS [-WA]
FS =00d8 06c92000 ffffffff 008f9300 DPL=0 DS16 [-WA]
GS =00e0 dfcd09c0 00000018 00409100 DPL=0 DS [--A]
LDT=0000 00000000 00000000 00008200 DPL=0 LDT
TR =0080 ff803000 0000206b 00008900 DPL=0 TSS32-avl
GDT= 1fcc0000 000000ff
IDT= ff800000 000007ff
CR0=80050033 CR2=ffd17000 CR3=1904e000 CR4=00040690
DR0=00000000 DR1=00000000 DR2=00000000 DR3=00000000
DR6=ffff0ff0 DR7=00000400
CCS=00000005 CCD=1904e000 CCO=EFLAGS
EFER=0000000000000000
check_exception old: 0xffffffff new 0xe
509: v=0e e=0000 i=0 cpl=0 IP=0060:d8f9c288 pc=d8f9c288 SP=0068:d8e99f3c CR2=1fcc0060
EAX=1904e000 EBX=00000000 ECX=d8f9c126 EDX=d8eafd60
ESI=1f09a000 EDI=00000030 EBP=d8e99f3c ESP=d8e99f3c
EIP=d8f9c288 EFL=00200207 [-----PC] CPL=0 II=0 A20=1 SMM=0 HLT=0
ES =007b 00000000 ffffffff 00cff300 DPL=3 DS [-WA]
CS =0060 00000000 ffffffff 00cf9a00 DPL=0 CS32 [-R-]
SS =0068 00000000 ffffffff 00cf9300 DPL=0 DS [-WA]
DS =007b 00000000 ffffffff 00cff300 DPL=3 DS [-WA]
FS =00d8 06c92000 ffffffff 008f9300 DPL=0 DS16 [-WA]
GS =00e0 dfcd09c0 00000018 00409100 DPL=0 DS [--A]
LDT=0000 00000000 00000000 00008200 DPL=0 LDT
TR =0080 ff803000 0000206b 00008900 DPL=0 TSS32-avl
GDT= 1fcc0000 000000ff
IDT= ff800000 000007ff
CR0=80050033 CR2=1fcc0060 CR3=1904e000 CR4=00040690
DR0=00000000 DR1=00000000 DR2=00000000 DR3=00000000
DR6=ffff0ff0 DR7=00000400
CCS=00000005 CCD=1904e000 CCO=EFLAGS
EFER=0000000000000000
check_exception old: 0xe new 0xe
510: v=08 e=0000 i=0 cpl=0 IP=0060:d8f9c288 pc=d8f9c288 SP=0068:d8e99f3c env->regs[R_EAX]=1904e000
EAX=1904e000 EBX=00000000 ECX=d8f9c126 EDX=d8eafd60
ESI=1f09a000 EDI=00000030 EBP=d8e99f3c ESP=d8e99f3c
EIP=d8f9c288 EFL=00200207 [-----PC] CPL=0 II=0 A20=1 SMM=0 HLT=0
ES =007b 00000000 ffffffff 00cff300 DPL=3 DS [-WA]
CS =0060 00000000 ffffffff 00cf9a00 DPL=0 CS32 [-R-]
SS =0068 00000000 ffffffff 00cf9300 DPL=0 DS [-WA]
DS =007b 00000000 ffffffff 00cff300 DPL=3 DS [-WA]
FS =00d8 06c92000 ffffffff 008f9300 DPL=0 DS16 [-WA]
GS =00e0 dfcd09c0 00000018 00409100 DPL=0 DS [--A]
LDT=0000 00000000 00000000 00008200 DPL=0 LDT
TR =0080 ff803000 0000206b 00008900 DPL=0 TSS32-avl
GDT= 1fcc0000 000000ff
IDT= ff800000 000007ff
CR0=80050033 CR2=1fcc0060 CR3=1904e000 CR4=00040690
DR0=00000000 DR1=00000000 DR2=00000000 DR3=00000000
DR6=ffff0ff0 DR7=00000400
CCS=00000005 CCD=1904e000 CCO=EFLAGS
EFER=0000000000000000
check_exception old: 0x8 new 0xe

# qemu dies silently here

---

In another log, the code proceeeds a little further:

CR3 update: CR3=13668000
----------------
IN:
0xd35b6412: e8 ae cb 29 ff calll 0xd2852fc5

----------------
IN:
0xd35b6417: 31 c0 xorl %eax, %eax
0xd35b6419: e8 12 2e 27 ff calll 0xd2829230

----------------
IN:
0xd2829230: 55 pushl %ebp
0xd2829231: 89 e5 movl %esp, %ebp
0xd2829233: 83 ec 08 subl $8, %esp
0xd2829236: e8 75 6c 02 00 calll 0xd284feb0

Servicing hardware INT=0x30

... subsequent log messages and crash are as before.


2018-09-11 11:53:59

by Andy Lutomirski

[permalink] [raw]
Subject: Re: Random crashes with i386 and efi boots



> On Sep 10, 2018, at 2:56 PM, Guenter Roeck <[email protected]> wrote:
>
> Hi folks,
>
> even after commit eeb89e2bb1ac ("x86/efi: Load fixmap GDT in
> efi_call_phys_epilog()"), my i386/efi qemu boot tests still crash randomly
> (roughly 5-10% of the time). As before, I don't see much useful output in
> the qemu log (this time it doesn't even complain about a triple fault).
>
> Debugging shows that the crash happens in efi_call_phys_epilog().
> A sample log from a crashed test run is attached below. It appears that
> the crash happens if there is an interrupt at a critical section of the
> code.
>
> While playing with the code, I found a possible fix.
>
> diff --git a/arch/x86/platform/efi/efi_32.c b/arch/x86/platform/efi/efi_32.c
> index 05ca14222463..9959657127f4 100644
> --- a/arch/x86/platform/efi/efi_32.c
> +++ b/arch/x86/platform/efi/efi_32.c
> @@ -85,10 +85,9 @@ pgd_t * __init efi_call_phys_prolog(void)
>
> void __init efi_call_phys_epilog(pgd_t *save_pgd)
> {
> + load_fixmap_gdt(0);
> load_cr3(save_pgd);
> __flush_tlb_all();
> -
> - load_fixmap_gdt(0);
> }

We have IRQs on here? It seems plausible that we’re in a window where the EFI pgd doesn’t have cpu_entry_area mapped. Also, the hard coded CPU 0 is suspicious.

Maybe try instrumenting the code to check whether the clone_pgd_range calls in setup_percpu.c have happened yet?

Your patch may well be correct, but, if we have IRQs on, we should really have cpu_entry_area mapped in both pgds.

Or we could turn off IRQs. Why on Earth are IRQs on in a context where the fixmap gdt is unusable?

2018-09-11 13:31:37

by Guenter Roeck

[permalink] [raw]
Subject: Re: Random crashes with i386 and efi boots

On 09/11/2018 04:52 AM, Andy Lutomirski wrote:
>
>
>> On Sep 10, 2018, at 2:56 PM, Guenter Roeck <[email protected]> wrote:
>>
>> Hi folks,
>>
>> even after commit eeb89e2bb1ac ("x86/efi: Load fixmap GDT in
>> efi_call_phys_epilog()"), my i386/efi qemu boot tests still crash randomly
>> (roughly 5-10% of the time). As before, I don't see much useful output in
>> the qemu log (this time it doesn't even complain about a triple fault).
>>
>> Debugging shows that the crash happens in efi_call_phys_epilog().
>> A sample log from a crashed test run is attached below. It appears that
>> the crash happens if there is an interrupt at a critical section of the
>> code.
>>
>> While playing with the code, I found a possible fix.
>>
>> diff --git a/arch/x86/platform/efi/efi_32.c b/arch/x86/platform/efi/efi_32.c
>> index 05ca14222463..9959657127f4 100644
>> --- a/arch/x86/platform/efi/efi_32.c
>> +++ b/arch/x86/platform/efi/efi_32.c
>> @@ -85,10 +85,9 @@ pgd_t * __init efi_call_phys_prolog(void)
>>
>> void __init efi_call_phys_epilog(pgd_t *save_pgd)
>> {
>> + load_fixmap_gdt(0);
>> load_cr3(save_pgd);
>> __flush_tlb_all();
>> -
>> - load_fixmap_gdt(0);
>> }
>
> We have IRQs on here? It seems plausible that we’re in a window where the EFI pgd doesn’t have cpu_entry_area mapped. Also, the hard coded CPU 0 is suspicious.
>
The hard coded CPU 0 was always there. The call is ultimately from
efi_enter_virtual_mode(), which is called from start_kernel().
so presumably it is guaranteed to run on CPU 0.

> Maybe try instrumenting the code to check whether the clone_pgd_range calls in setup_percpu.c have happened yet?
>
The crash is seen late in the boot process, so I am quite sure it happened,
but I can add a check if needed. I think that might be a different problem,
though.

> Your patch may well be correct, but, if we have IRQs on, we should really have cpu_entry_area mapped in both pgds.
>
> Or we could turn off IRQs. Why on Earth are IRQs on in a context where the fixmap gdt is unusable?
>

From arch/x86/platform/efi/efi.c:phys_efi_set_virtual_address_map():

save_pgd = efi_call_phys_prolog();
local_irq_save(flags);
status = efi_call_phys(...);
local_irq_restore(flags);

efi_call_phys_epilog(save_pgd);

So, yes, interrupts are very much enabled.

I ran several additional test sequences. With above patch, no failures with
500 boots. Without it, failure rate (long term average) across 500 boots
is around 10%. Another data point: Moving load_fixmap_gdt(0); after
load_cr3(save_pgd); does not help; it has to come first.

Guenter

2018-09-11 16:37:51

by Andy Lutomirski

[permalink] [raw]
Subject: Re: Random crashes with i386 and efi boots



> On Sep 11, 2018, at 6:30 AM, Guenter Roeck <[email protected]> wrote:
>
> On 09/11/2018 04:52 AM, Andy Lutomirski wrote:
>>> On Sep 10, 2018, at 2:56 PM, Guenter Roeck <[email protected]> wrote:
>>>
>>> Hi folks,
>>>
>>> even after commit eeb89e2bb1ac ("x86/efi: Load fixmap GDT in
>>> efi_call_phys_epilog()"), my i386/efi qemu boot tests still crash randomly
>>> (roughly 5-10% of the time). As before, I don't see much useful output in
>>> the qemu log (this time it doesn't even complain about a triple fault).
>>>
>>> Debugging shows that the crash happens in efi_call_phys_epilog().
>>> A sample log from a crashed test run is attached below. It appears that
>>> the crash happens if there is an interrupt at a critical section of the
>>> code.
>>>
>>> While playing with the code, I found a possible fix.
>>>
>>> diff --git a/arch/x86/platform/efi/efi_32.c b/arch/x86/platform/efi/efi_32.c
>>> index 05ca14222463..9959657127f4 100644
>>> --- a/arch/x86/platform/efi/efi_32.c
>>> +++ b/arch/x86/platform/efi/efi_32.c
>>> @@ -85,10 +85,9 @@ pgd_t * __init efi_call_phys_prolog(void)
>>>
>>> void __init efi_call_phys_epilog(pgd_t *save_pgd)
>>> {
>>> + load_fixmap_gdt(0);
>>> load_cr3(save_pgd);
>>> __flush_tlb_all();
>>> -
>>> - load_fixmap_gdt(0);
>>> }
>> We have IRQs on here? It seems plausible that we’re in a window where the EFI pgd doesn’t have cpu_entry_area mapped. Also, the hard coded CPU 0 is suspicious.
> The hard coded CPU 0 was always there. The call is ultimately from
> efi_enter_virtual_mode(), which is called from start_kernel().
> so presumably it is guaranteed to run on CPU 0.
>
>> Maybe try instrumenting the code to check whether the clone_pgd_range calls in setup_percpu.c have happened yet?
> The crash is seen late in the boot process, so I am quite sure it happened,
> but I can add a check if needed. I think that might be a different problem,
> though.
>
>> Your patch may well be correct, but, if we have IRQs on, we should really have cpu_entry_area mapped in both pgds.
>> Or we could turn off IRQs. Why on Earth are IRQs on in a context where the fixmap gdt is unusable?
>
> From arch/x86/platform/efi/efi.c:phys_efi_set_virtual_address_map():
>
> save_pgd = efi_call_phys_prolog();
> local_irq_save(flags);
> status = efi_call_phys(...);
> local_irq_restore(flags);
>
> efi_call_phys_epilog(save_pgd);
>
> So, yes, interrupts are very much enabled.

Does fixing that solve the problem? It seems more robust.

>
> I ran several additional test sequences. With above patch, no failures with
> 500 boots. Without it, failure rate (long term average) across 500 boots
> is around 10%. Another data point: Moving load_fixmap_gdt(0); after
> load_cr3(save_pgd); does not help; it has to come first.
>
> Guenter

2018-09-11 17:42:31

by Jörg Rödel

[permalink] [raw]
Subject: Re: Random crashes with i386 and efi boots

On Tue, Sep 11, 2018 at 09:36:51AM -0700, Andy Lutomirski wrote:
> > save_pgd = efi_call_phys_prolog();
> > local_irq_save(flags);
> > status = efi_call_phys(...);
> > local_irq_restore(flags);
> >
> > efi_call_phys_epilog(save_pgd);
> >
> > So, yes, interrupts are very much enabled.
>
> Does fixing that solve the problem? It seems more robust.

The problem is still that in efi_call_phys_prolog() we load the gdt with
its physical address, and when we reload the %cr3 in _epilog from
initial_page_table to swapper_pg_dir again the gdt is no longer mapped.
Blocking interrupts is more robust, but we can't block NMIs that way
that would also trigger the issue, no?

So I am in favor of changing the order in efi_call_phys_epilog() too.

Regards,

Joerg

2018-09-11 18:00:29

by Guenter Roeck

[permalink] [raw]
Subject: Re: Random crashes with i386 and efi boots

On Tue, Sep 11, 2018 at 07:41:58PM +0200, Joerg Roedel wrote:
> On Tue, Sep 11, 2018 at 09:36:51AM -0700, Andy Lutomirski wrote:
> > > save_pgd = efi_call_phys_prolog();
> > > local_irq_save(flags);
> > > status = efi_call_phys(...);
> > > local_irq_restore(flags);
> > >
> > > efi_call_phys_epilog(save_pgd);
> > >
> > > So, yes, interrupts are very much enabled.
> >
> > Does fixing that solve the problem? It seems more robust.
>
> The problem is still that in efi_call_phys_prolog() we load the gdt with
> its physical address, and when we reload the %cr3 in _epilog from
> initial_page_table to swapper_pg_dir again the gdt is no longer mapped.
> Blocking interrupts is more robust, but we can't block NMIs that way
> that would also trigger the issue, no?
>
> So I am in favor of changing the order in efi_call_phys_epilog() too.
>
Blocking the interrupts across efi_call_phys_prolog() / efi_call_phys() /
efi_call_phys_epilog() does solve the issue, but I am concerned about its
negative effect on 64-bit operation (not sure if blocking interrupts
in the prolog/epilog functions there is ok). Adding another set of
local_irq_save() / local_irq_restore() to efi_call_phys_epilog() seems
wasteful if the problem can be solved by reordering instructions.

So, yes, I also prefer changing the order in efi_call_phys_epilog().
I'll submit a patch along that line.

Thanks,
Guenter

2018-09-11 18:01:31

by Linus Torvalds

[permalink] [raw]
Subject: Re: Random crashes with i386 and efi boots

On Tue, Sep 11, 2018 at 7:42 AM Joerg Roedel <[email protected]> wrote:
>
> So I am in favor of changing the order in efi_call_phys_epilog() too.

Yeah, the only reason the order was changed was that it seemed more
logical to use the new cr3 to load the GDT. But that was clearly
wrong.

So ack on just changing the order, and my bad for suggesting the
change in the first place.

Linus

2018-09-11 18:06:25

by Andy Lutomirski

[permalink] [raw]
Subject: Re: Random crashes with i386 and efi boots



> On Sep 11, 2018, at 10:41 AM, Joerg Roedel <[email protected]> wrote:
>
> On Tue, Sep 11, 2018 at 09:36:51AM -0700, Andy Lutomirski wrote:
>>> save_pgd = efi_call_phys_prolog();
>>> local_irq_save(flags);
>>> status = efi_call_phys(...);
>>> local_irq_restore(flags);
>>>
>>> efi_call_phys_epilog(save_pgd);
>>>
>>> So, yes, interrupts are very much enabled.
>>
>> Does fixing that solve the problem? It seems more robust.
>
> The problem is still that in efi_call_phys_prolog() we load the gdt with
> its physical address, and when we reload the %cr3 in _epilog from
> initial_page_table to swapper_pg_dir again the gdt is no longer mapped.
> Blocking interrupts is more robust, but we can't block NMIs that way
> that would also trigger the issue, no?
>
> So I am in favor of changing the order in efi_call_phys_epilog() too.
>

I’m rather confused here. We’re loading CR3 with page tables that don’t have the fixmap mapped? With interrupts on? And we expect it to work? This is *nuts*.

There are IMO only three sane fixes here:

1. Load the fixmap, cpu_entry_area, etc into the EFI page table. Drop the GDT reload entirely.

2. Do this whole virtual map dance earlier so we don’t have IRQs and NMIs and such. Maybe while we’re still using the initial page table?

3. Just identity map all the EFI regions. Make EFI page tables that literally map them at their physical addresses *and* map the entire kernel, just like we do for normal user mms.

Sure, as a stopgap, turning off IRQs and applying Guenter’s patch seems okay, but this code is not okay.

2018-09-11 18:24:06

by Guenter Roeck

[permalink] [raw]
Subject: Re: Random crashes with i386 and efi boots

On Tue, Sep 11, 2018 at 11:05:25AM -0700, Andy Lutomirski wrote:
>
>
> > On Sep 11, 2018, at 10:41 AM, Joerg Roedel <[email protected]> wrote:
> >
> > On Tue, Sep 11, 2018 at 09:36:51AM -0700, Andy Lutomirski wrote:
> >>> save_pgd = efi_call_phys_prolog();
> >>> local_irq_save(flags);
> >>> status = efi_call_phys(...);
> >>> local_irq_restore(flags);
> >>>
> >>> efi_call_phys_epilog(save_pgd);
> >>>
> >>> So, yes, interrupts are very much enabled.
> >>
> >> Does fixing that solve the problem? It seems more robust.
> >
> > The problem is still that in efi_call_phys_prolog() we load the gdt with
> > its physical address, and when we reload the %cr3 in _epilog from
> > initial_page_table to swapper_pg_dir again the gdt is no longer mapped.
> > Blocking interrupts is more robust, but we can't block NMIs that way
> > that would also trigger the issue, no?
> >
> > So I am in favor of changing the order in efi_call_phys_epilog() too.
> >
>
> I’m rather confused here. We’re loading CR3 with page tables that don’t have the fixmap mapped? With interrupts on? And we expect it to work? This is *nuts*.
>
> There are IMO only three sane fixes here:
>
> 1. Load the fixmap, cpu_entry_area, etc into the EFI page table. Drop the GDT reload entirely.
>
> 2. Do this whole virtual map dance earlier so we don’t have IRQs and NMIs and such. Maybe while we’re still using the initial page table?
>
> 3. Just identity map all the EFI regions. Make EFI page tables that literally map them at their physical addresses *and* map the entire kernel, just like we do for normal user mms.
>
> Sure, as a stopgap, turning off IRQs and applying Guenter’s patch seems okay, but this code is not okay.

I submitted a patch with the diff I suggested above; it seems to be the
least invasive solution and addresses the immediate problem.

I am way out of league regarding the other suggested changes. I'll be happy
to test the code if someone is willing to rearrange the code accordingly,
but I don't think it would make sense to even try doing it myself.

Thanks,
Guenter