2012-06-20 08:25:01

by Robin Holt

[permalink] [raw]
Subject: [PATCH] phys_efi_set_virtual_address_map needs va, no pa.

The kernel allocated memmap may end up being beyond the first
512GB of memory. That early range is identity mapped, while the
remainder of memory is not. The net result is the memmap allocated by
efi_enter_virtual_mode will not be accessible via its __pa as is currently
passed back to EFI.

Since EFI is going to have to parse the passed in table, I believe the
EFI documentation is wrong.

I asked one of our BIOS engineers to look at the Intel reference code
and he said it was obvious that the address would have to be a virtaully
accessible address as we are in virtual mode while EFI is handling the
callback.

Signed-off-by: Robin Holt <[email protected]>
Cc: Matthew Garrett <[email protected]>
Cc: H. Peter Anvin <[email protected]>
---
arch/x86/platform/efi/efi.c | 2 +-
1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/arch/x86/platform/efi/efi.c b/arch/x86/platform/efi/efi.c
index 92660ed..ea4317a 100644
--- a/arch/x86/platform/efi/efi.c
+++ b/arch/x86/platform/efi/efi.c
@@ -869,7 +869,7 @@ void __init efi_enter_virtual_mode(void)
memmap.desc_size * count,
memmap.desc_size,
memmap.desc_version,
- (efi_memory_desc_t *)__pa(new_memmap));
+ new_memmap);

if (status != EFI_SUCCESS) {
pr_alert("Unable to switch EFI into virtual mode "
--
1.7.0.4


2012-06-20 12:07:09

by Matthew Garrett

[permalink] [raw]
Subject: Re: [PATCH] phys_efi_set_virtual_address_map needs va, no pa.

On Wed, Jun 20, 2012 at 03:24:57AM -0500, Robin Holt wrote:
> The kernel allocated memmap may end up being beyond the first
> 512GB of memory. That early range is identity mapped, while the
> remainder of memory is not. The net result is the memmap allocated by
> efi_enter_virtual_mode will not be accessible via its __pa as is currently
> passed back to EFI.
>
> Since EFI is going to have to parse the passed in table, I believe the
> EFI documentation is wrong.
>
> I asked one of our BIOS engineers to look at the Intel reference code
> and he said it was obvious that the address would have to be a virtaully
> accessible address as we are in virtual mode while EFI is handling the
> callback.

No, that's completely wrong. UEFI can't be called in virtual mode until
*after* SetVirtualAddressMap(). The UEFI spec indicates that all
physical memory must have an identity mapping at this stage (section
2.3.4), so if we don't then that's a bug that needs to be fixed.

--
Matthew Garrett | [email protected]

2012-06-20 20:42:03

by H. Peter Anvin

[permalink] [raw]
Subject: Re: [PATCH] phys_efi_set_virtual_address_map needs va, no pa.

On 06/20/2012 05:07 AM, Matthew Garrett wrote:
>
> No, that's completely wrong. UEFI can't be called in virtual mode until
> *after* SetVirtualAddressMap(). The UEFI spec indicates that all
> physical memory must have an identity mapping at this stage (section
> 2.3.4), so if we don't then that's a bug that needs to be fixed.
>

I think it is a bug, and with the trampoline work in 3.4 we should
finally have a proper platform to fix it.

In particular, we should keep a full 1:1 page map around, and it should
be the one that is in the trampoline (real_mode_header->trampoline_pgd)
as we need the page directory to be 32-bit addressable.

The right thing to do is to sync the pgds in the 1:1 area, both for 64
bit and for legacy 32 bit (PAE 32 bit don't need it, since all the
kernel maps are shared.) This is currently done ad hoc (and
differently!) on both 32 and 64 bits and that really should be fixed.

Once that is properly fixed, we have a usable identity mapping.

On that subject, I have been thinking about the kexec use case. I'm
thinking that if we indeed cannot use either physical mode nor a
zero-offset virtual mode, that the most likely sane thing to do is to
use a fixed offset of 2^46 and still use a (pseudo-)1:1 map.

Do we have any data at all on machines that supposedly can't use
identity-mapped EFI?

-hpa

2012-06-21 00:27:45

by Robin Holt

[permalink] [raw]
Subject: Re: [PATCH] phys_efi_set_virtual_address_map needs va, no pa.

On Wed, Jun 20, 2012 at 01:41:54PM -0700, H. Peter Anvin wrote:
> On 06/20/2012 05:07 AM, Matthew Garrett wrote:
> >
> > No, that's completely wrong. UEFI can't be called in virtual mode until
> > *after* SetVirtualAddressMap(). The UEFI spec indicates that all
> > physical memory must have an identity mapping at this stage (section
> > 2.3.4), so if we don't then that's a bug that needs to be fixed.
> >
>
> I think it is a bug, and with the trampoline work in 3.4 we should
> finally have a proper platform to fix it.
>
> In particular, we should keep a full 1:1 page map around, and it should
> be the one that is in the trampoline (real_mode_header->trampoline_pgd)
> as we need the page directory to be 32-bit addressable.
>
> The right thing to do is to sync the pgds in the 1:1 area, both for 64
> bit and for legacy 32 bit (PAE 32 bit don't need it, since all the
> kernel maps are shared.) This is currently done ad hoc (and
> differently!) on both 32 and 64 bits and that really should be fixed.

What do you need from me? If you want me to help with this, I have a
_WHOLE_ lot of learning to do. Can you give me any pointers?

We are trying to get this finally fixed. We have had work-around code
in SLES11 SP1, SLES11 SP2, and RHEL 6.x. I would love to get this fixed
for future distro snaps.

Thanks,
Robin

2012-06-21 00:46:51

by H. Peter Anvin

[permalink] [raw]
Subject: Re: [PATCH] phys_efi_set_virtual_address_map needs va, no pa.

On 06/20/2012 05:27 PM, Robin Holt wrote:
>
> What do you need from me? If you want me to help with this, I have a
> _WHOLE_ lot of learning to do. Can you give me any pointers?
>
> We are trying to get this finally fixed. We have had work-around code
> in SLES11 SP1, SLES11 SP2, and RHEL 6.x. I would love to get this fixed
> for future distro snaps.
>

If you want to tackle it, the task is basically that when we modify the
pgds in 32-bit legacy (non-PAE) mode, we should make the corresponding
modifications to initial_page_table, and in 64-bit mode to
real_mode_header->trampoline_pgd. It might be worthwhile to introduce a
common pointer for both, obviously.

This is currently handled via something called the pgd_list (when we
update the top level kernel address space we walk pgd_list and update
them all), but there are two issues:

1. Obviously, in the case of the 1:1 map, we don't just need to maintain
the kernel area, but the "user space" part of the address space should
contain a copy, as well.

2. To complicate things, there is code in there to grab an mm lock for
the benefit of Xen. The 1:1 map doesn't have an mm associated with it,
so I'm not quite sure how that is to be handled. Perhaps Xen just plain
won't need it and we can just bypass it, but I have no bloody idea.

It is also a bit "cute" how we seem to make a function call to indirect
through a pointer (why on Earth is pgd_page_get_mm() not an inline?!),
and then grab a lock unconditionally, regardless of if we are affected
by Xen or not.

-hpa

2012-06-21 16:52:23

by Robin Holt

[permalink] [raw]
Subject: Re: [PATCH] phys_efi_set_virtual_address_map needs va, no pa.

On Wed, Jun 20, 2012 at 05:46:49PM -0700, H. Peter Anvin wrote:
> On 06/20/2012 05:27 PM, Robin Holt wrote:
> >
> > What do you need from me? If you want me to help with this, I have a
> > _WHOLE_ lot of learning to do. Can you give me any pointers?
> >
> > We are trying to get this finally fixed. We have had work-around code
> > in SLES11 SP1, SLES11 SP2, and RHEL 6.x. I would love to get this fixed
> > for future distro snaps.
> >
>
> If you want to tackle it, the task is basically that when we modify the
> pgds in 32-bit legacy (non-PAE) mode, we should make the corresponding
> modifications to initial_page_table, and in 64-bit mode to
> real_mode_header->trampoline_pgd. It might be worthwhile to introduce a
> common pointer for both, obviously.

I am completely lost as to what should be done. How do we know
which identity maps need to be created? Do we just add them as we are
scanning the e820/EFI memory maps and include the reserved, etc ranges?
Do we look at the table handed to us by EFI at the beginning of boot and
use that as the basis? Or do we simply wait until the kernel's memory
initialization is complete and cover all of physical memory from zero
up to the highest physical address?

> This is currently handled via something called the pgd_list (when we
> update the top level kernel address space we walk pgd_list and update
> them all), but there are two issues:
>
> 1. Obviously, in the case of the 1:1 map, we don't just need to maintain
> the kernel area, but the "user space" part of the address space should
> contain a copy, as well.
>
> 2. To complicate things, there is code in there to grab an mm lock for
> the benefit of Xen. The 1:1 map doesn't have an mm associated with it,
> so I'm not quite sure how that is to be handled. Perhaps Xen just plain
> won't need it and we can just bypass it, but I have no bloody idea.
>
> It is also a bit "cute" how we seem to make a function call to indirect
> through a pointer (why on Earth is pgd_page_get_mm() not an inline?!),
> and then grab a lock unconditionally, regardless of if we are affected
> by Xen or not.
>
> -hpa

2012-06-21 19:23:52

by Konrad Rzeszutek Wilk

[permalink] [raw]
Subject: Re: [PATCH] phys_efi_set_virtual_address_map needs va, no pa.

On Wed, Jun 20, 2012 at 05:46:49PM -0700, H. Peter Anvin wrote:
> On 06/20/2012 05:27 PM, Robin Holt wrote:
> >
> > What do you need from me? If you want me to help with this, I have a
> > _WHOLE_ lot of learning to do. Can you give me any pointers?
> >
> > We are trying to get this finally fixed. We have had work-around code
> > in SLES11 SP1, SLES11 SP2, and RHEL 6.x. I would love to get this fixed
> > for future distro snaps.
> >
>
> If you want to tackle it, the task is basically that when we modify the
> pgds in 32-bit legacy (non-PAE) mode, we should make the corresponding
> modifications to initial_page_table, and in 64-bit mode to
> real_mode_header->trampoline_pgd. It might be worthwhile to introduce a
> common pointer for both, obviously.
>
> This is currently handled via something called the pgd_list (when we
> update the top level kernel address space we walk pgd_list and update
> them all), but there are two issues:
>
> 1. Obviously, in the case of the 1:1 map, we don't just need to maintain
> the kernel area, but the "user space" part of the address space should
> contain a copy, as well.
>
> 2. To complicate things, there is code in there to grab an mm lock for
> the benefit of Xen. The 1:1 map doesn't have an mm associated with it,
> so I'm not quite sure how that is to be handled. Perhaps Xen just plain
> won't need it and we can just bypass it, but I have no bloody idea.

You mean this?

79e53d8 (Andrea Arcangeli 2011-02-16 15:45:22 -0800 127) spin_lock(&pgd_lock);
4f76cd38 (Jeremy Fitzhardinge 2008-03-17 16:36:55 -0700 128) pgd_list_del(pgd);
a79e53d8 (Andrea Arcangeli 2011-02-16 15:45:22 -0800 129) spin_unlock(&pgd_lock);

which says:
x86/mm: Fix pgd_lock deadlock

It's forbidden to take the page_table_lock with the irq disabled
or if there's contention the IPIs (for tlb flushes) sent with
the page_table_lock held will never run leading to a deadlock.

Nobody takes the pgd_lock from irq context so the _irqsave can be
removed.

Looking before that git commit I see Jeremy's 4f76cd38 unification of
the 32-bit and 64-bit pgtable, and before that:

1da177e4c3f41524e886b7f1b8a0c1fc7321cac2
Linux-2.6.12-rc2

I am not really convienced that lock was put there for Xen as the
git history seems to point to well, way ancient stuff.

Or are you referring to something else?

>
> It is also a bit "cute" how we seem to make a function call to indirect
> through a pointer (why on Earth is pgd_page_get_mm() not an inline?!),
> and then grab a lock unconditionally, regardless of if we are affected
> by Xen or not.
>
> -hpa
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/

2012-06-22 00:36:06

by H. Peter Anvin

[permalink] [raw]
Subject: Re: [PATCH] phys_efi_set_virtual_address_map needs va, no pa.

On 06/21/2012 09:52 AM, Robin Holt wrote:
>
> I am completely lost as to what should be done. How do we know
> which identity maps need to be created? Do we just add them as we are
> scanning the e820/EFI memory maps and include the reserved, etc ranges?
> Do we look at the table handed to us by EFI at the beginning of boot and
> use that as the basis? Or do we simply wait until the kernel's memory
> initialization is complete and cover all of physical memory from zero
> up to the highest physical address?
>

Robin, we already create the 1:1 maps. Right now there is some
weirdness with some of the issues that you mention, but that is
orthogonal to this.

The 1:1 map created for the kernel is created at a specific offset,
__PAGE_OFFSET, and is propagated into every vm context created by the
kernel. There are two problems:

1. The "initial" (32 bit) or "trampoline" (64 bit) maps aren't on the
list of vm contexts created by the kernel (pgd_list), so they never get
updated after a particular point in the boot.

2. The initial/trampoline maps need these mappings not just at address
__PAGE_OFFSET, but also at address zero (identity mapping), which means
that just adding it to the pgd_list is insufficient.

Note that i386-PAE is unaffected, simply because the contents of the top
(3rd) level is always fixed.

-hpa