LinuxLists.cc - [BUG] x86/efi: MMRs no longer properly mapped after switch to isolated page table

2016-04-27 15:52:12

Subject: [BUG] x86/efi: MMRs no longer properly mapped after switch to isolated page table

Hey guys,

We've hit an issue that we haven't seen before on recent community
kernels. Hoping that you can provide some insight.

Late last week I hit this bad paging request in uv_system_init on one of
our small UV systems:

8<---
[ 0.355998] UV: Found UV2000/3000 hub
[ 0.360088] BUG: unable to handle kernel paging request at ffff8800fb600000
[ 0.367882] IP: [<ffffffff81b6e906>] uv_system_init+0x1f8/0x1051
[ 0.374604] PGD 1f83067 PUD 0
[ 0.378030] Oops: 0000 [#1] SMP
[ 0.381652] Modules linked in:
[ 0.385069] CPU: 0 PID: 1 Comm: swapper/0 Not tainted 4.6.0-rc3-community-4.6-rc2+ #454
[ 0.394003] Hardware name: SGI UV3000/UV3000, BIOS SGI UV 3000 series BIOS 01/15/2015
[ 0.402743] task: ffff88086ee18000 ti: ffff88086ee1c000 task.ti: ffff88086ee1c000
[ 0.411097] RIP: 0010:[<ffffffff81b6e906>] [<ffffffff81b6e906>] uv_system_init+0x1f8/0x1051
[ 0.420530] RSP: 0000:ffff88086ee1fdb0 EFLAGS: 00010286
[ 0.426458] RAX: ffff8800fb600000 RBX: 0000000000000000 RCX: 000000000000c780
[ 0.434421] RDX: 00000000fa000000 RSI: 0000000000000246 RDI: 0000000000000246
[ 0.442385] RBP: ffff88086ee1fe48 R08: 00000000fffffffe R09: 0000000000000000
[ 0.450348] R10: 0000000000000005 R11: 0000000000000146 R12: 000000000000c780
[ 0.458312] R13: 000000000000a0f0 R14: 0000000000000037 R15: 0000000000000038
[ 0.466277] FS: 0000000000000000(0000) GS:ffff880870c00000(0000) knlGS:0000000000000000
[ 0.475307] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 0.481718] CR2: ffff8800fb600000 CR3: 0000000001a06000 CR4: 00000000001406f0
[ 0.489682] Stack:
[ 0.491924] 0000000000000037 0000000000000038 ffff88086ee1fdd0 ffffffff810bef29
[ 0.500219] ffff88086ee1fe28 ffffffff8116426d ffff880800000010 ffff88086ee1fe38
[ 0.508517] ffff88086ee1fdf8 ffffffff8116426d 0000000000000002 0000000000000001
[ 0.516810] Call Trace:
[ 0.519542] [<ffffffff810bef29>] ? vprintk_default+0x29/0x40
[ 0.525957] [<ffffffff8116426d>] ? printk+0x4d/0x4f
[ 0.531497] [<ffffffff8116426d>] ? printk+0x4d/0x4f
[ 0.537041] [<ffffffff81b6a2d2>] native_smp_prepare_cpus+0x299/0x2e4
[ 0.544232] [<ffffffff81b5b19b>] kernel_init_freeable+0xc3/0x220
[ 0.551035] [<ffffffff815b236e>] kernel_init+0xe/0x110
[ 0.556869] [<ffffffff815be102>] ret_from_fork+0x22/0x40
[ 0.562893] [<ffffffff815b2360>] ? rest_init+0x80/0x80
[ 0.568723] Code: 65 48 03 05 1d b8 49 7e 80 78 14 02 ba 00 00 00 fa 76 0b 48 89 c8 65 48 03 05 07 b8 49 7e 48 b8 00 00 60 01 00 88 ff ff 48 09 d0 <48> 8b 00 88 c3 48 c1 e8 06 41 bd 01 00 00 00 88 c1 83 e3 3f 4c
[ 0.590530] RIP [<ffffffff81b6e906>] uv_system_init+0x1f8/0x1051
[ 0.597341] RSP <ffff88086ee1fdb0>
[ 0.601230] CR2: ffff8800fb600000
[ 0.604933] ---[ end trace 06918ff61c5d4cab ]---
[ 0.610091] Kernel panic - not syncing: Attempted to kill init! exitcode=0x00000009
--->8

A bit of digging will tell us that this is the failing line:

m_n_config.v = uv_read_local_mmr(UVH_RH_GAM_CONFIG_MMR );

A bisect (with a bit of extra digging into a merge commit) led me to
commit 67a9108ed4313 ("x86/efi: Build our own page table structures") as
being the first bad commit.

After a bit of poking around, I found that the part of this change that
is actually breaking things for us is the fact that the EFI memory
regions are no longer being mapped into the trampoline_pgd in
__map_region.

Just as a quick test, I made this small change and it got my UV to boot:

8<---
diff --git a/arch/x86/platform/efi/efi_64.c b/arch/x86/platform/efi/efi_64.c
index 49e4dd4..15bf5df 100644
--- a/arch/x86/platform/efi/efi_64.c
+++ b/arch/x86/platform/efi/efi_64.c
@@ -296,6 +296,7 @@ static void __init __map_region(efi_memory_desc_t *md, u64 va)
{
unsigned long flags = _PAGE_RW;
unsigned long pfn;
+ pgd_t *tramp_pgd = (pgd_t *)__va(real_mode_header->trampoline_pgd);
pgd_t *pgd = efi_pgd;

if (!(md->attribute & EFI_MEMORY_WB))
@@ -305,6 +306,10 @@ static void __init __map_region(efi_memory_desc_t *md, u64 va)
if (kernel_map_pages_in_pgd(pgd, pfn, va, md->num_pages, flags))
pr_warn("Error mapping PA 0x%llx -> VA 0x%llx!\n",
md->phys_addr, va);
+
+ if (kernel_map_pages_in_pgd(tramp_pgd, pfn, va, md->num_pages, flags))
+ pr_warn("Error mapping in trampoline_pgd PA 0x%llx -> VA 0x%llx!\n",
+ md->phys_addr, va);
}

void __init efi_map_region(efi_memory_desc_t *md)
--->8

>From what I know, this works because the first PGD entry (the one
containing the identity mappings) of the trampoline_pgd is shared with
the main kernel PGD (init_level4_pgt), so when __map_region maps this
stuff into the trampoline_pgd, we get it in the kernel PGD as well.

The way I understand it, the motivation behind this change is to isolate
the EFI runtime services code/data from the regular kernel page tables,
but it appears that we've isolated a bit more than that. I do
understand that if we were doing an actual EFI callback, we'd switch
over to the efi_pgd, where we'd have this stuff mapped in, but, until
now, we've been able to read these MMRs using the regular kernel page
table without issues.

Please let me know what you guys think about this issue, and if you have
any suggestions for possible solutions. Any input is greatly
appreciated!

- Alex

2016-04-27 18:23:56

by Alex Thorlton

[permalink] [raw]

Subject: Re: [BUG] x86/efi: MMRs no longer properly mapped after switch to isolated page table

On Wed, Apr 27, 2016 at 10:41:32AM -0500, Alex Thorlton wrote:

Adding Boris to Cc.

> Hey guys,
>
> We've hit an issue that we haven't seen before on recent community
> kernels. Hoping that you can provide some insight.
>
> Late last week I hit this bad paging request in uv_system_init on one of
> our small UV systems:
>
> 8<---
> [ 0.355998] UV: Found UV2000/3000 hub
> [ 0.360088] BUG: unable to handle kernel paging request at ffff8800fb600000
> [ 0.367882] IP: [<ffffffff81b6e906>] uv_system_init+0x1f8/0x1051
> [ 0.374604] PGD 1f83067 PUD 0
> [ 0.378030] Oops: 0000 [#1] SMP
> [ 0.381652] Modules linked in:
> [ 0.385069] CPU: 0 PID: 1 Comm: swapper/0 Not tainted 4.6.0-rc3-community-4.6-rc2+ #454
> [ 0.394003] Hardware name: SGI UV3000/UV3000, BIOS SGI UV 3000 series BIOS 01/15/2015
> [ 0.402743] task: ffff88086ee18000 ti: ffff88086ee1c000 task.ti: ffff88086ee1c000
> [ 0.411097] RIP: 0010:[<ffffffff81b6e906>] [<ffffffff81b6e906>] uv_system_init+0x1f8/0x1051
> [ 0.420530] RSP: 0000:ffff88086ee1fdb0 EFLAGS: 00010286
> [ 0.426458] RAX: ffff8800fb600000 RBX: 0000000000000000 RCX: 000000000000c780
> [ 0.434421] RDX: 00000000fa000000 RSI: 0000000000000246 RDI: 0000000000000246
> [ 0.442385] RBP: ffff88086ee1fe48 R08: 00000000fffffffe R09: 0000000000000000
> [ 0.450348] R10: 0000000000000005 R11: 0000000000000146 R12: 000000000000c780
> [ 0.458312] R13: 000000000000a0f0 R14: 0000000000000037 R15: 0000000000000038
> [ 0.466277] FS: 0000000000000000(0000) GS:ffff880870c00000(0000) knlGS:0000000000000000
> [ 0.475307] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [ 0.481718] CR2: ffff8800fb600000 CR3: 0000000001a06000 CR4: 00000000001406f0
> [ 0.489682] Stack:
> [ 0.491924] 0000000000000037 0000000000000038 ffff88086ee1fdd0 ffffffff810bef29
> [ 0.500219] ffff88086ee1fe28 ffffffff8116426d ffff880800000010 ffff88086ee1fe38
> [ 0.508517] ffff88086ee1fdf8 ffffffff8116426d 0000000000000002 0000000000000001
> [ 0.516810] Call Trace:
> [ 0.519542] [<ffffffff810bef29>] ? vprintk_default+0x29/0x40
> [ 0.525957] [<ffffffff8116426d>] ? printk+0x4d/0x4f
> [ 0.531497] [<ffffffff8116426d>] ? printk+0x4d/0x4f
> [ 0.537041] [<ffffffff81b6a2d2>] native_smp_prepare_cpus+0x299/0x2e4
> [ 0.544232] [<ffffffff81b5b19b>] kernel_init_freeable+0xc3/0x220
> [ 0.551035] [<ffffffff815b236e>] kernel_init+0xe/0x110
> [ 0.556869] [<ffffffff815be102>] ret_from_fork+0x22/0x40
> [ 0.562893] [<ffffffff815b2360>] ? rest_init+0x80/0x80
> [ 0.568723] Code: 65 48 03 05 1d b8 49 7e 80 78 14 02 ba 00 00 00 fa 76 0b 48 89 c8 65 48 03 05 07 b8 49 7e 48 b8 00 00 60 01 00 88 ff ff 48 09 d0 <48> 8b 00 88 c3 48 c1 e8 06 41 bd 01 00 00 00 88 c1 83 e3 3f 4c
> [ 0.590530] RIP [<ffffffff81b6e906>] uv_system_init+0x1f8/0x1051
> [ 0.597341] RSP <ffff88086ee1fdb0>
> [ 0.601230] CR2: ffff8800fb600000
> [ 0.604933] ---[ end trace 06918ff61c5d4cab ]---
> [ 0.610091] Kernel panic - not syncing: Attempted to kill init! exitcode=0x00000009
> --->8
>
> A bit of digging will tell us that this is the failing line:
>
> m_n_config.v = uv_read_local_mmr(UVH_RH_GAM_CONFIG_MMR );
>
> A bisect (with a bit of extra digging into a merge commit) led me to
> commit 67a9108ed4313 ("x86/efi: Build our own page table structures") as
> being the first bad commit.
>
> After a bit of poking around, I found that the part of this change that
> is actually breaking things for us is the fact that the EFI memory
> regions are no longer being mapped into the trampoline_pgd in
> __map_region.
>
> Just as a quick test, I made this small change and it got my UV to boot:
>
> 8<---
> diff --git a/arch/x86/platform/efi/efi_64.c b/arch/x86/platform/efi/efi_64.c
> index 49e4dd4..15bf5df 100644
> --- a/arch/x86/platform/efi/efi_64.c
> +++ b/arch/x86/platform/efi/efi_64.c
> @@ -296,6 +296,7 @@ static void __init __map_region(efi_memory_desc_t *md, u64 va)
> {
> unsigned long flags = _PAGE_RW;
> unsigned long pfn;
> + pgd_t *tramp_pgd = (pgd_t *)__va(real_mode_header->trampoline_pgd);
> pgd_t *pgd = efi_pgd;
>
> if (!(md->attribute & EFI_MEMORY_WB))
> @@ -305,6 +306,10 @@ static void __init __map_region(efi_memory_desc_t *md, u64 va)
> if (kernel_map_pages_in_pgd(pgd, pfn, va, md->num_pages, flags))
> pr_warn("Error mapping PA 0x%llx -> VA 0x%llx!\n",
> md->phys_addr, va);
> +
> + if (kernel_map_pages_in_pgd(tramp_pgd, pfn, va, md->num_pages, flags))
> + pr_warn("Error mapping in trampoline_pgd PA 0x%llx -> VA 0x%llx!\n",
> + md->phys_addr, va);
> }
>
> void __init efi_map_region(efi_memory_desc_t *md)
> --->8
>
> From what I know, this works because the first PGD entry (the one
> containing the identity mappings) of the trampoline_pgd is shared with
> the main kernel PGD (init_level4_pgt), so when __map_region maps this
> stuff into the trampoline_pgd, we get it in the kernel PGD as well.
>
> The way I understand it, the motivation behind this change is to isolate
> the EFI runtime services code/data from the regular kernel page tables,
> but it appears that we've isolated a bit more than that. I do
> understand that if we were doing an actual EFI callback, we'd switch
> over to the efi_pgd, where we'd have this stuff mapped in, but, until
> now, we've been able to read these MMRs using the regular kernel page
> table without issues.
>
> Please let me know what you guys think about this issue, and if you have
> any suggestions for possible solutions. Any input is greatly
> appreciated!
>
> - Alex

2016-04-27 22:51:32

by Borislav Petkov

[permalink] [raw]

Subject: Re: [BUG] x86/efi: MMRs no longer properly mapped after switch to isolated page table

On Wed, Apr 27, 2016 at 10:41:32AM -0500, Alex Thorlton wrote:
> A bit of digging will tell us that this is the failing line:
>
> m_n_config.v = uv_read_local_mmr(UVH_RH_GAM_CONFIG_MMR );

That looks like

All code
========
0: 65 48 03 05 1d b8 49 add %gs:0x7e49b81d(%rip),%rax # 0x7e49b825
7: 7e
8: 80 78 14 02 cmpb $0x2,0x14(%rax)
c: ba 00 00 00 fa mov $0xfa000000,%edx
11: 76 0b jbe 0x1e
13: 48 89 c8 mov %rcx,%rax
16: 65 48 03 05 07 b8 49 add %gs:0x7e49b807(%rip),%rax # 0x7e49b825
1d: 7e
1e: 48 b8 00 00 60 01 00 movabs $0xffff880001600000,%rax
25: 88 ff ff
28: 48 09 d0 or %rdx,%rax
2b:* 48 8b 00 mov (%rax),%rax <-- trapping instruction
2e: 88 c3 mov %al,%bl
30: 48 c1 e8 06 shr $0x6,%rax
34: 41 bd 01 00 00 00 mov $0x1,%r13d
3a: 88 c1 mov %al,%cl
3c: 83 e3 3f and $0x3f,%ebx

but why does this have anything to do with the EFI pagetable, at all?
The MMRs should be mapped in the normal kernel page table, right?

And your dirty fix of mapping into trampoline_pgd doesn't make any
sense...

How do the MMRs get mapped on that box exactly? And why aren't they
mapped in the normal kernel page table all of a sudden?

/me is confused and goes to bed.

--
Regards/Gruss,
Boris.

SUSE Linux GmbH, GF: Felix Imendörffer, Jane Smithard, Graham Norton, HRB 21284 (AG Nürnberg)
--

2016-04-28 01:41:32

by Alex Thorlton

[permalink] [raw]

Subject: Re: [BUG] x86/efi: MMRs no longer properly mapped after switch to isolated page table

On Thu, Apr 28, 2016 at 12:51:22AM +0200, Borislav Petkov wrote:
> On Wed, Apr 27, 2016 at 10:41:32AM -0500, Alex Thorlton wrote:
> > A bit of digging will tell us that this is the failing line:
> >
> > m_n_config.v = uv_read_local_mmr(UVH_RH_GAM_CONFIG_MMR );
>
> That looks like
>
> All code
> ========
> 0: 65 48 03 05 1d b8 49 add %gs:0x7e49b81d(%rip),%rax # 0x7e49b825
> 7: 7e
> 8: 80 78 14 02 cmpb $0x2,0x14(%rax)
> c: ba 00 00 00 fa mov $0xfa000000,%edx
> 11: 76 0b jbe 0x1e
> 13: 48 89 c8 mov %rcx,%rax
> 16: 65 48 03 05 07 b8 49 add %gs:0x7e49b807(%rip),%rax # 0x7e49b825
> 1d: 7e
> 1e: 48 b8 00 00 60 01 00 movabs $0xffff880001600000,%rax
> 25: 88 ff ff
> 28: 48 09 d0 or %rdx,%rax
> 2b:* 48 8b 00 mov (%rax),%rax <-- trapping instruction
> 2e: 88 c3 mov %al,%bl
> 30: 48 c1 e8 06 shr $0x6,%rax
> 34: 41 bd 01 00 00 00 mov $0x1,%r13d
> 3a: 88 c1 mov %al,%cl
> 3c: 83 e3 3f and $0x3f,%ebx
>
> but why does this have anything to do with the EFI pagetable, at all?

In this particular instance, it's not using the EFI page table - it's
showing that the register isn't mapped into the main kernel page table,
via the bad paging request. The issue isn't that there's something
wrong with the EFI page table, but that something appears to be missing
from the kernel table.

> The MMRs should be mapped in the normal kernel page table, right?

Well, quite a while back, these MMRs got mapped in using
init_extra_mapping_uc in map_low_mmrs. Back then, yes, they were mapped
straight into the kernel page table.

After d2f7cbe7b26a74 ("x86/efi: Runtime services virtual mapping") was
introduced (I'm sure you remember the BIOS issue we had a while back) we
had to fall back to using EFI_OLD_MEMMAP, so for a while we relied on
efi_ioremap.

Eventually we got a BIOS fix that allowed us to start using the new
memmap scheme, at which point we removed the init_extra_mapping_uc()s,
since the efi_map_region code appeared to be doing what we needed.

So, short answer is, they were mapped in and working up until now.

> And your dirty fix of mapping into trampoline_pgd doesn't make any
> sense...

I *think* it does? This one took me a while to figure out - I wrote the
fix at midnight last night, so please inform me if my logic sounds
insane, haha.

In efi_map_region we first do the __map_region for the physical address,
which will get mapped into the lowest PGD entry of the trampoline_pgd.
I believe the lowest PGD entry in the trampoline_pgd is actually the
same as init_mm.pgd + pgd_index(PAGE_OFFSET) (see setup_real_mode), so
you're effectively mapping into the kernel 1:1 space when you map into
pgd_index 0 of the trampoline_pgd.

By adding the mappings to the trampoline_pgd back in, I think I've also
added them back into the kernel.

> How do the MMRs get mapped on that box exactly?

Believe I covered this above. Let me know if you have questions!

> And why aren't they
> mapped in the normal kernel page table all of a sudden?

It looks to me like there might be a bit of a complicated chain of
events that led us to this point. I thought I was close to having this
worked out, but after looking back over it I'm starting to think, "how
was it working in the first place?!"

At this point, considering the fact that my fix definitely gets the UV
to boot, I'm going to stick to my claim that adding the maps back into
the trampoline_pgd fixes at least part of the issue - not saying it's
the right way, but it's an intersting data point.

Despite the fact that my fix might help remedy the situation, I will
admit that my understanding of the mechanics behind said fix could be
flawed :)

I'm going to have to think about this some more tomorrow. Let me know
if you have any ideas!

- Alex

2016-04-28 12:57:22

by Borislav Petkov

[permalink] [raw]

Subject: Re: [BUG] x86/efi: MMRs no longer properly mapped after switch to isolated page table

On Wed, Apr 27, 2016 at 08:41:28PM -0500, Alex Thorlton wrote:
> In this particular instance, it's not using the EFI page table - it's
> showing that the register isn't mapped into the main kernel page table,
> via the bad paging request. The issue isn't that there's something
> wrong with the EFI page table, but that something appears to be missing
> from the kernel table.

So my question stands: you said 67a9108ed4313 is causing this and this
commit creates an EFI-specific page table and that shouldn't have
anything to do with how the MMR stuff is mapped, should it?

> Well, quite a while back, these MMRs got mapped in using
> init_extra_mapping_uc in map_low_mmrs. Back then, yes, they were mapped
> straight into the kernel page table.
>
> After d2f7cbe7b26a74 ("x86/efi: Runtime services virtual mapping") was
> introduced (I'm sure you remember the BIOS issue we had a while back) we
> had to fall back to using EFI_OLD_MEMMAP, so for a while we relied on
> efi_ioremap.

Hmm, but you see my confusion, right? Why is init_extra_mapping_uc()
influenced by the EFI changes? It uses __init_extra_mapping() and it
maps into init_mm's pgd.

> Eventually we got a BIOS fix that allowed us to start using the new
> memmap scheme, at which point we removed the init_extra_mapping_uc()s,
> since the efi_map_region code appeared to be doing what we needed.

Why would you even do that? Why are you even mapping MMRs using EFI
facilities?

I'm more confused today.

So what I see from here is this:

* MMRs and EFI shouldn't have anything in common. Imagine there were an
UV box without EFI (you probably are going to say there's no such thing
but imagine anyway): how are you going to map the MMR space then?

* I think you should restore the old case where you mapped the MMRs
using init_extra_mapping_uc().

And I think

d394f2d9d8e1 ("x86/platform/UV: Remove EFI memmap quirk for UV2+")

was wrong in doing

+ if (is_uv1_hub())
+ map_low_mmrs();

for the simple fact that MMRs mapping and EFI shouldn't depend on one
another.

So the proper fix would be to remove the if- check.

Or am I missing something here and MMRs need EFI?

(However, UV1 apparently works fine without it).

Right?

--
Regards/Gruss,
Boris.

SUSE Linux GmbH, GF: Felix Imendörffer, Jane Smithard, Graham Norton, HRB 21284 (AG Nürnberg)
--

2016-04-29 09:01:23

by Matt Fleming

[permalink] [raw]

Subject: Re: [BUG] x86/efi: MMRs no longer properly mapped after switch to isolated page table

On Wed, 27 Apr, at 10:41:32AM, Alex Thorlton wrote:
>
> From what I know, this works because the first PGD entry (the one
> containing the identity mappings) of the trampoline_pgd is shared with
> the main kernel PGD (init_level4_pgt), so when __map_region maps this
> stuff into the trampoline_pgd, we get it in the kernel PGD as well.

Correct.

> The way I understand it, the motivation behind this change is to isolate
> the EFI runtime services code/data from the regular kernel page tables,

Yep.

> but it appears that we've isolated a bit more than that. I do
> understand that if we were doing an actual EFI callback, we'd switch
> over to the efi_pgd, where we'd have this stuff mapped in, but, until
> now, we've been able to read these MMRs using the regular kernel page
> table without issues.
>
> Please let me know what you guys think about this issue, and if you have
> any suggestions for possible solutions. Any input is greatly
> appreciated!

The issue is that you're relying on the EFI mapping code to map these
regions for you, and that's a bug. These regions have nothing to do
with EFI.

Arguably it's always been a bug. You're not alone though, there were
also other pieces of code piggy-backing on the EFI mapping functions,
and EFI code piggy-backing on the regular kernel mapping functions,
see 452308de6105 ("x86/efi: Fix boot crash by always mapping boot
service regions into new EFI page tables").

I agree with Boris' suggestion that removing the "if (is_uv1_hub())"
check in uv_system_init() is probably the way to go, since it should
always be the responsibility of uv_system_init() to setup these
mappings.

2016-04-29 15:41:25

by Alex Thorlton

[permalink] [raw]

Subject: Re: [BUG] x86/efi: MMRs no longer properly mapped after switch to isolated page table

On Thu, Apr 28, 2016 at 02:57:11PM +0200, Borislav Petkov wrote:
> > After d2f7cbe7b26a74 ("x86/efi: Runtime services virtual mapping") was
> > introduced (I'm sure you remember the BIOS issue we had a while back) we
> > had to fall back to using EFI_OLD_MEMMAP, so for a while we relied on
> > efi_ioremap.
>
> Hmm, but you see my confusion, right? Why is init_extra_mapping_uc()
> influenced by the EFI changes? It uses __init_extra_mapping() and it
> maps into init_mm's pgd.

I think the answer here is that the init_extra_mapping wasn't
necessarily affected by the EFI changes. What happened is that the new
EFI changes in d2f7cbe7b26a74 mapped in the register for us, before
uv_system_init came along and tried to do it with the
init_extra_mapping. When uv_system_init tried to do the extra mappings,
they fail because the register was already mapped by the EFI code.

> > Eventually we got a BIOS fix that allowed us to start using the new
> > memmap scheme, at which point we removed the init_extra_mapping_uc()s,
> > since the efi_map_region code appeared to be doing what we needed.
>
> Why would you even do that? Why are you even mapping MMRs using EFI
> facilities?

I believe this was unintentional behavior. Here's the order that I
think all this got mixed up in (the first step being what was actually
wrong in the first place):

1) From the very beginning, we've had the address space containing those
registers tucked inside and EFI memory descriptor. Initially this
wasn't an issue, I believe because the efi_ioremap was putting them
in ioremap space, so there was no collision when we went to do the
init_extra_mapping.
2) EFI code changed in d2f7cbe7b26a74 to do stuff with efi_map_region,
which then comes in and maps the register into the kernel 1:1 space,
where our init_extra_mapping used to place it.
* This caused our original issue, because the init_extra_mapping
would blow up in uv_system_init when the registers were already
mapped.
* Note that we had one BIOS bug that initally prevented us from using
the new mapping scheme at all. We *thought* we had this cleared
up, but it appears we may have missed something.
3) I removed the init_extra_mapping in d394f2d9d8e1 becausem, as
described above, it was causing a crash during uv_system_init when we
switched over to the new EFI memmap scheme.
* Note that UV1 did not receive a BIOS fix, so it will eternally use
EFI_OLD_MEMMAP.
4) The latest change came along, and pulled the 1:1 mappings out of the
trampoline_pgd (and therefore out of the kernel page table) and here
we are today.

> I'm more confused today.

It took me a about 3 entire days to wrap my head around all this :)

> So what I see from here is this:
>
> * MMRs and EFI shouldn't have anything in common. Imagine there were an
> UV box without EFI (you probably are going to say there's no such thing
> but imagine anyway): how are you going to map the MMR space then?

Right - I'm beginning to see where we went wrong here :)

> * I think you should restore the old case where you mapped the MMRs
> using init_extra_mapping_uc().
>
> And I think
>
> d394f2d9d8e1 ("x86/platform/UV: Remove EFI memmap quirk for UV2+")
>
> was wrong in doing
>
> + if (is_uv1_hub())
> + map_low_mmrs();
>
> for the simple fact that MMRs mapping and EFI shouldn't depend on one
> another.

I agree here.

> So the proper fix would be to remove the if- check.
>
> Or am I missing something here and MMRs need EFI?

I think this is partially correct, but in doing that, we find that we're
still missing something. Watch what happens when I make this small
tweak to my kernel:

8<---
diff --git a/arch/x86/kernel/apic/x2apic_uv_x.c
b/arch/x86/kernel/apic/x2apic_uv_x.c
index 624db005..91ac029 100644
--- a/arch/x86/kernel/apic/x2apic_uv_x.c
+++ b/arch/x86/kernel/apic/x2apic_uv_x.c
@@ -891,7 +891,7 @@ void __init uv_system_init(void)
pr_info("UV: Found %s hub\n", hub);

/* We now only need to map the MMRs on UV1 */
- if (is_uv1_hub())
+ //if (is_uv1_hub())
map_low_mmrs();

m_n_config.v = uv_read_local_mmr(UVH_RH_GAM_CONFIG_MMR );
--->8

Here's the result:

8<---
[ 5.353656] BUG: unable to handle kernel paging request at ffff88006a1ab938
[ 5.361448] IP: [<ffff88006a1ab938>] 0xffff88006a1ab938
[ 5.367290] PGD 1f81067 PUD 87ffff067 PMD 87fff8067 PTE 0
[ 5.373356] Oops: 0010 [#1] SMP
[ 5.376977] Modules linked in:
[ 5.380395] CPU: 0 PID: 1 Comm: swapper/0 Not tainted 4.5.0-rc2-uv4-comm-debug-fix+ #538
[ 5.389428] Hardware name: SGI UV3000/UV3000, BIOS SGI UV 3000 series BIOS 01/15/2015
[ 5.398169] task: ffff880867ec4040 ti: ffff880867ec8000 task.ti: ffff880867ec8000
[ 5.406522] RIP: 0010:[<ffff88006a1ab938>] [<ffff88006a1ab938>] 0xffff88006a1ab938
[ 5.415080] RSP: 0000:ffff880867ecbc88 EFLAGS: 00010086
[ 5.421006] RAX: 0000000000000000 RBX: 0000000000000282 RCX: 0000000000000001
[ 5.428971] RDX: 0000000000000000 RSI: 0000000000000001 RDI: ffff88006a1ab938
[ 5.436935] RBP: ffff880867ecbd58 R08: ffff880867ecbd68 R09: ffff880867ecbd70
[ 5.444900] R10: ffffffffffffffda R11: 000000006a1ab938 R12: 0000000000000000
[ 5.452864] R13: ffffffff81dcf0b8 R14: ffffffff81dcf0c0 R15: ffffffff81dcf0a0
[ 5.460829] FS: 0000000000000000(0000) GS:ffff880878c00000(0000) knlGS:0000000000000000
[ 5.469861] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 5.476274] CR2: ffff88006a1ab938 CR3: 0000000001a0a000 CR4: 00000000001406f0
[ 5.484240] Stack:
[ 5.486483] ffffffff8105d7f8 0000000000000000 0000000000000006 0000000000000006
[ 5.494777] 000000000000001e 0000000000000000 0000000000000000 ffff880867ecbd38
[ 5.503074] 0000000080050033 0000000000000000 0000000000000000 0000000000000000
[ 5.511368] Call Trace:
[ 5.514098] [<ffffffff8105d7f8>] ? efi_call+0x58/0x90
[ 5.519834] [<ffffffff8106033d>] ? uv_bios_call_irqsave+0x5d/0x80
[ 5.526733] [<ffffffff810603a0>] uv_bios_get_sn_info+0x40/0xb0
[ 5.533344] [<ffffffff81b6f824>] uv_system_init+0x772/0x104d
[ 5.539751] [<ffffffff810bd479>] ? vprintk_default+0x29/0x40
[ 5.546159] [<ffffffff81161cf8>] ? printk+0x4d/0x4f
[ 5.551692] [<ffffffff81b6ac75>] native_smp_prepare_cpus+0x299/0x2e4
[ 5.558884] [<ffffffff81b5c18e>] kernel_init_freeable+0xc3/0x21b
[ 5.565680] [<ffffffff815acd00>] ? rest_init+0x80/0x80
[ 5.571502] [<ffffffff815acd0e>] kernel_init+0xe/0xf0
[ 5.577238] [<ffffffff815b87cf>] ret_from_fork+0x3f/0x70
[ 5.583264] [<ffffffff815acd00>] ? rest_init+0x80/0x80
[ 5.589093] Code: Bad RIP value.
[ 5.592812] RIP [<ffff88006a1ab938>] 0xffff88006a1ab938
[ 5.598748] RSP <ffff880867ecbc88>
[ 5.602638] CR2: ffff88006a1ab938
[ 5.606339] ---[ end trace 3abaacb020c74a50 ]---
[ 5.611487] Kernel panic - not syncing: Attempted to kill init! exitcode=0x00000009
--->8

You can see here that we've made it past the MMR read in uv_system_init,
but we die inside of our first EFI callback. In this example, it looks
like we're using the kernel page table at the time of the failure, and I
believe that the failing address is somewhere in our EFI runtime code:

[ 0.803396] efi: ATHORLTON EFI md dump:
[ 0.803396] type: 5
[ 0.803396] pad: 0
[ 0.803396] phys_addr: 6a0a6000
[ 0.803396] virt_addr: 0
[ 0.803396] num_pages: 184
[ 0.803396] attribute: 800000000000000f

So it looks like we're trying to read from EFI runtime space while using
the kernel page table, which fails, presumably because the space is also
not mapped into the kernel page table. While I understand *why* it
fails, and why the address isn't mapped, I don't fully know how we
should handle fixing it.

It would appear after a good amount of investigation that we've been
skating by on some (perhaps not so subtle) undefined/unintentional
behvaior inside the EFI memory mapping code.

I will need to talk with our BIOS team about this to see what they
think. Any input you guys might have on how we might remedy this
situation would be really valuable. I think we have a good
understanding of what's going wrong here, but now we need to work out a
way to fix it.

Thanks for you help so far!

- Alex

2016-04-29 15:45:20

by Alex Thorlton

[permalink] [raw]

Subject: Re: [BUG] x86/efi: MMRs no longer properly mapped after switch to isolated page table

On Fri, Apr 29, 2016 at 10:01:15AM +0100, Matt Fleming wrote:
> On Wed, 27 Apr, at 10:41:32AM, Alex Thorlton wrote:
> >
> > From what I know, this works because the first PGD entry (the one
> > containing the identity mappings) of the trampoline_pgd is shared with
> > the main kernel PGD (init_level4_pgt), so when __map_region maps this
> > stuff into the trampoline_pgd, we get it in the kernel PGD as well.
>
> Correct.
>
> > The way I understand it, the motivation behind this change is to isolate
> > the EFI runtime services code/data from the regular kernel page tables,
>
> Yep.

Thanks for confirming my suspicions. This one was really bothering me.

I'd rather have broken code and know why it's broken, than have working
code and not know why it works :)

> > but it appears that we've isolated a bit more than that. I do
> > understand that if we were doing an actual EFI callback, we'd switch
> > over to the efi_pgd, where we'd have this stuff mapped in, but, until
> > now, we've been able to read these MMRs using the regular kernel page
> > table without issues.
> >
> > Please let me know what you guys think about this issue, and if you have
> > any suggestions for possible solutions. Any input is greatly
> > appreciated!
>
> The issue is that you're relying on the EFI mapping code to map these
> regions for you, and that's a bug. These regions have nothing to do
> with EFI.

Understood, and agreed.

> Arguably it's always been a bug. You're not alone though, there were
> also other pieces of code piggy-backing on the EFI mapping functions,
> and EFI code piggy-backing on the regular kernel mapping functions,
> see 452308de6105 ("x86/efi: Fix boot crash by always mapping boot
> service regions into new EFI page tables").
>
> I agree with Boris' suggestion that removing the "if (is_uv1_hub())"
> check in uv_system_init() is probably the way to go, since it should
> always be the responsibility of uv_system_init() to setup these
> mappings.

I have tried this, but there are some issues there as well (see my
response to Boris's last message). I think we're getting closer to
having this all figured out, but there are still some pieces missing
from the puzzle.

Thanks for the input, Matt. Please let me know if you have any ideas
about the remaining issue that I described!

- Alex

2016-04-30 22:12:15

by Matt Fleming

[permalink] [raw]

Subject: Re: [BUG] x86/efi: MMRs no longer properly mapped after switch to isolated page table

On Fri, 29 Apr, at 10:41:19AM, Alex Thorlton wrote:
>
> You can see here that we've made it past the MMR read in uv_system_init,
> but we die inside of our first EFI callback. In this example, it looks
> like we're using the kernel page table at the time of the failure, and I
> believe that the failing address is somewhere in our EFI runtime code:
>
> [ 0.803396] efi: ATHORLTON EFI md dump:
> [ 0.803396] type: 5
> [ 0.803396] pad: 0
> [ 0.803396] phys_addr: 6a0a6000
> [ 0.803396] virt_addr: 0
> [ 0.803396] num_pages: 184
> [ 0.803396] attribute: 800000000000000f
>
> So it looks like we're trying to read from EFI runtime space while using
> the kernel page table, which fails, presumably because the space is also
> not mapped into the kernel page table. While I understand *why* it
> fails, and why the address isn't mapped, I don't fully know how we
> should handle fixing it.

How come you're not using the new EFI page tables while making EFI
runtime calls?

Who owns the MMR space and what is it used for? Do both the kernel and
the firmware need access to it? My SGI UV knowledge is zero, so I'm
happy to be educated! I can't think of any analogous memory regions on
x86 where the EFI services require the kernel to map them, other than
the EFI regions themselves.

Runtime EFI regions should be opaque, isolated and self-contained.
This is why there are two phases of execution for firmware; before and
after ExitBootServices(). Once the kernel takes control after
ExitBootServices() firmware can no longer provide timer services, for
example, and doesn't care where the kernel maps the LAPIC because it
never tries to access it.

The fact that the UV firmware does care where the MMR space is mapped
makes me suspect that there's a lack of isolation.