Some systems have ACPI tables that don't include everything that needs
to be mapped for a successful kexec. These systems rely on identity
maps that include the full gigabyte surrounding any smaller region
requested for kexec success. Without this, they fail to kexec and end
up doing a full firmware reboot.
So, reduce the use of GB pages only on systems where this is known to
be necessary (specifically, UV systems).
Signed-off-by: Steve Wahl <[email protected]>
Fixes: d794734c9bbf ("x86/mm/ident_map: Use gbpages only where full GB page should be mapped.")
Reported-by: Pavin Joseph <[email protected]>
Closes: https://lore.kernel.org/all/[email protected]/
Tested-by: Pavin Joseph <[email protected]>
Tested-by: Eric Hagberg <[email protected]>
Tested-by: Sarah Brofeldt <[email protected]>
---
arch/x86/include/asm/init.h | 1 +
arch/x86/kernel/machine_kexec_64.c | 3 +++
arch/x86/mm/ident_map.c | 13 +++++++------
3 files changed, 11 insertions(+), 6 deletions(-)
diff --git a/arch/x86/include/asm/init.h b/arch/x86/include/asm/init.h
index cc9ccf61b6bd..4ae843e8fefb 100644
--- a/arch/x86/include/asm/init.h
+++ b/arch/x86/include/asm/init.h
@@ -10,6 +10,7 @@ struct x86_mapping_info {
unsigned long page_flag; /* page flag for PMD or PUD entry */
unsigned long offset; /* ident mapping offset */
bool direct_gbpages; /* PUD level 1GB page support */
+ bool direct_gbpages_always; /* use 1GB pages exclusively */
unsigned long kernpg_flag; /* kernel pagetable flag override */
};
diff --git a/arch/x86/kernel/machine_kexec_64.c b/arch/x86/kernel/machine_kexec_64.c
index b180d8e497c3..1e1c6633bbec 100644
--- a/arch/x86/kernel/machine_kexec_64.c
+++ b/arch/x86/kernel/machine_kexec_64.c
@@ -28,6 +28,7 @@
#include <asm/setup.h>
#include <asm/set_memory.h>
#include <asm/cpu.h>
+#include <asm/uv/uv.h>
#ifdef CONFIG_ACPI
/*
@@ -212,6 +213,8 @@ static int init_pgtable(struct kimage *image, unsigned long start_pgtable)
if (direct_gbpages)
info.direct_gbpages = true;
+ if (!is_uv_system())
+ info.direct_gbpages_always = true;
for (i = 0; i < nr_pfn_mapped; i++) {
mstart = pfn_mapped[i].start << PAGE_SHIFT;
diff --git a/arch/x86/mm/ident_map.c b/arch/x86/mm/ident_map.c
index a204a332c71f..8039498b9713 100644
--- a/arch/x86/mm/ident_map.c
+++ b/arch/x86/mm/ident_map.c
@@ -39,12 +39,13 @@ static int ident_pud_init(struct x86_mapping_info *info, pud_t *pud_page,
/* Is using a gbpage allowed? */
use_gbpage = info->direct_gbpages;
- /* Don't use gbpage if it maps more than the requested region. */
- /* at the begining: */
- use_gbpage &= ((addr & ~PUD_MASK) == 0);
- /* ... or at the end: */
- use_gbpage &= ((next & ~PUD_MASK) == 0);
-
+ if (!info->direct_gbpages_always) {
+ /* Don't use gbpage if it maps more than the requested region. */
+ /* at the beginning: */
+ use_gbpage &= ((addr & ~PUD_MASK) == 0);
+ /* ... or at the end: */
+ use_gbpage &= ((next & ~PUD_MASK) == 0);
+ }
/* Never overwrite existing mappings */
use_gbpage &= !pud_present(*pud);
--
2.26.2
On 3/22/24 09:21, Steve Wahl wrote:
> Some systems have ACPI tables that don't include everything that needs
> to be mapped for a successful kexec. These systems rely on identity
> maps that include the full gigabyte surrounding any smaller region
> requested for kexec success. Without this, they fail to kexec and end
> up doing a full firmware reboot.
>
> So, reduce the use of GB pages only on systems where this is known to
> be necessary (specifically, UV systems).
Isn't this called "buggy firmware"?
I'd much rather add synthetic entries to the memory maps that have this
information than hack around it by assuming that things are within a
gigabyte.
On 3/22/24 10:31, Eric W. Biederman wrote:
>> I'd much rather add synthetic entries to the memory maps that have this
>> information than hack around it by assuming that things are within a
>> gigabyte.
> So this change is a partial revert of a change that broke kexec in
> existing configurations. To fix a regression that breaks kexec.
Let's back up for a second:
* Mapping extra memory on UV systems causes halts[1]
* Mapping extra memory on UV systems breaks kexec (this thread)
So we're in a pickle. I understand your concern for kexec. But I'm
concerned that fixing the kexec problem will re-expose us to the [1]
problem.
Steve, can you explain a bit why this patch doesn't re-expose the kernel
to the [1] bug?
1. https://lore.kernel.org/all/[email protected]/
On 3/22/24 10:40, Dave Hansen wrote:
> * Mapping extra memory on UV systems causes halts[1]
> * Mapping extra memory on UV systems breaks kexec (this thread)
Sorry, I said that second one backwards:
* _Not_ mapping extra memory on UV systems breaks kexec
On Fri, Mar 22, 2024 at 10:43:36AM -0700, Dave Hansen wrote:
> On 3/22/24 10:40, Dave Hansen wrote:
> > * Mapping extra memory on UV systems causes halts[1]
> > * Mapping extra memory on UV systems breaks kexec (this thread)
>
> Sorry, I said that second one backwards:
>
> * _Not_ mapping extra memory on UV systems breaks kexec
Not quite. This is * _Not_ mapping extra memory on _non_ UV systems
breaks kexec.
Thanks,
--> Steve
--
Steve Wahl, Hewlett Packard Enterprise
Dave Hansen <[email protected]> writes:
> On 3/22/24 09:21, Steve Wahl wrote:
>> Some systems have ACPI tables that don't include everything that needs
>> to be mapped for a successful kexec. These systems rely on identity
>> maps that include the full gigabyte surrounding any smaller region
>> requested for kexec success. Without this, they fail to kexec and end
>> up doing a full firmware reboot.
>>
>> So, reduce the use of GB pages only on systems where this is known to
>> be necessary (specifically, UV systems).
>
> Isn't this called "buggy firmware"?
>
> I'd much rather add synthetic entries to the memory maps that have this
> information than hack around it by assuming that things are within a
> gigabyte.
So this change is a partial revert of a change that broke kexec in
existing configurations. To fix a regression that breaks kexec.
I don't have enough to know which systems broke. So I don't know
the difficulty in fixing systems.
Dave do you know what synthetic entries need to be added to the
memory maps to fix the regression? If not we should go with this
and if we care enough we can add the synthetic entries later.
Eric
On Fri, Mar 22, 2024 at 10:40:37AM -0700, Dave Hansen wrote:
> On 3/22/24 10:31, Eric W. Biederman wrote:
> >> I'd much rather add synthetic entries to the memory maps that have this
> >> information than hack around it by assuming that things are within a
> >> gigabyte.
> > So this change is a partial revert of a change that broke kexec in
> > existing configurations. To fix a regression that breaks kexec.
Hi, Dave!
> Let's back up for a second:
>
> * Mapping extra memory on UV systems causes halts[1]
> * Mapping extra memory on UV systems breaks kexec (this thread)
These are the same. The most reliable way to create the problem[1] on
UV is a kexec to a kdump kernel, because of the typical placement of
the kdump kernel active region with respect to the reserved addresses
that cause the halts. (The distros we typically run place the
crashkernel just below the highest reserved region, where a gbpage can
include both.)
What you didn't state here is the third bullet that this patch addresses.
* Neglecting to map extra memory on some (firmware buggy?) non-UV
systems breaks kexec.
> So we're in a pickle. I understand your concern for kexec. But I'm
> concerned that fixing the kexec problem will re-expose us to the [1]
> problem.
> Steve, can you explain a bit why this patch doesn't re-expose the kernel
> to the [1] bug?
>
> 1. https://lore.kernel.org/all/[email protected]/
This patch still has UV systems avoid gbpages that go far outside
actual requested regions, but allows the full gb pages on other
systems. On UV systems, the new gbpage algorithm is followed. On
non-UV systems, gbpages are allowed even for requests that don't cover
a complete gbpage -- essentially the former algorithm but using the
new code.
Hope that makes sense.
I would probably consider this buggy firmware, but got enough reports
of this regression (from Pavin Joseph, Eric Hagberg, and Sara
Brofeldt, all of whom tested the patch to see if it cured the
regression) that it seemd everyone would want it fixed quickly and
point fingers later.
In the private debugging exchanges with Pavin, I got some printks of
regions that were mapped, and did one exchange with hard-coded adding
regions not covered on his particular system back into the table;
there were four regions left out. I added all four in one patch. I
could have dived in further to diagnose which of the missing region(s)
were actually necessary to get kexec to succeed, but couldn't see what
I would do with that information once I had it, as I don't see a way
to generalize this to other platforms exhibiting the problem.
Thanks,
--> Steve
--
Steve Wahl, Hewlett Packard Enterprise
On 3/22/24 09:21, Steve Wahl wrote:
> Some systems have ACPI tables that don't include everything that needs
> to be mapped for a successful kexec. These systems rely on identity
> maps that include the full gigabyte surrounding any smaller region
> requested for kexec success. Without this, they fail to kexec and end
> up doing a full firmware reboot.
I'm still missing something here. Which ACPI tables are we talking
about? What don't they map? I normally don't think of ACPI _tables_ as
"mapping" things.
It seems like there's a theory that some ACPI table isn't mapped, but
looking through the discussion so far I don't see a smoking gun. Let's
say the kernel has a bug and the kernel was actively not mapping
something that it should have mapped. The oversized 1GB mappings made
the bug harder to hit. If that's the case, we'll just be adding a hack
which papers over the bug instead of fixing it properly.
I'm kind of leaning to say that we should just revert d794734c9bbf and
have the UV folks go back to the nogbpages until we get this properly
sorted.
> @@ -10,6 +10,7 @@ struct x86_mapping_info {
> unsigned long page_flag; /* page flag for PMD or PUD entry */
> unsigned long offset; /* ident mapping offset */
> bool direct_gbpages; /* PUD level 1GB page support */
> + bool direct_gbpages_always; /* use 1GB pages exclusively */
> unsigned long kernpg_flag; /* kernel pagetable flag override */
> };
But let's at least talk about this patch in case we decide to go forward
with it. We've really got two things:
1. Can the system use gbpages in the first place?
2. Do the gbpages need to be exact (UV) or sloppy (everything else)?
I wouldn't refer to this at all as "always" use gbpages. It's really a
be-sloppy-and-paper-over-bugs mode. They might be kernel bugs or
firmware bugs, but they're bugs _somewhere_ right?
Dave Hansen <[email protected]> writes:
> On 3/22/24 09:21, Steve Wahl wrote:
>> Some systems have ACPI tables that don't include everything that needs
>> to be mapped for a successful kexec. These systems rely on identity
>> maps that include the full gigabyte surrounding any smaller region
>> requested for kexec success. Without this, they fail to kexec and end
>> up doing a full firmware reboot.
>
> I'm still missing something here. Which ACPI tables are we talking
> about? What don't they map? I normally don't think of ACPI _tables_ as
> "mapping" things.
Either E820 or ACPI lists which areas of memory are present in a
machine. Those tables are used to build the identity memory mappings.
Those identity mapped page tables not built with GB pages cause kexec to
fail for at least 3 people. Presumably because something using those
page tables accesses memory that is not mapped.
> It seems like there's a theory that some ACPI table isn't mapped, but
> looking through the discussion so far I don't see a smoking gun. Let's
> say the kernel has a bug and the kernel was actively not mapping
> something that it should have mapped. The oversized 1GB mappings made
> the bug harder to hit. If that's the case, we'll just be adding a hack
> which papers over the bug instead of fixing it properly.
>
> I'm kind of leaning to say that we should just revert d794734c9bbf and
> have the UV folks go back to the nogbpages until we get this properly
> sorted.
That is exactly what this patch does. It reverts the change except
on UV systems.
>> @@ -10,6 +10,7 @@ struct x86_mapping_info {
>> unsigned long page_flag; /* page flag for PMD or PUD entry */
>> unsigned long offset; /* ident mapping offset */
>> bool direct_gbpages; /* PUD level 1GB page support */
>> + bool direct_gbpages_always; /* use 1GB pages exclusively */
>> unsigned long kernpg_flag; /* kernel pagetable flag override */
>> };
>
> But let's at least talk about this patch in case we decide to go forward
> with it. We've really got two things:
>
> 1. Can the system use gbpages in the first place?
> 2. Do the gbpages need to be exact (UV) or sloppy (everything else)?
>
> I wouldn't refer to this at all as "always" use gbpages. It's really a
> be-sloppy-and-paper-over-bugs mode. They might be kernel bugs or
> firmware bugs, but they're bugs _somewhere_ right?
Is it?
As far as I can tell the UV mode is be exact and avoid cpu bugs mode.
My sense is that using GB pages for everything (when we want an identity
mapping) should be much cheaper TLB wise, so we probably want to use GB
pages for everything if we can.
Personally I'd rather turn of the page tables entirely for kexec but
that is not an option in x86_64.
Eric
* Steve Wahl <[email protected]> wrote:
> Some systems have ACPI tables that don't include everything that needs
> to be mapped for a successful kexec. These systems rely on identity
> maps that include the full gigabyte surrounding any smaller region
> requested for kexec success. Without this, they fail to kexec and end
> up doing a full firmware reboot.
>
> So, reduce the use of GB pages only on systems where this is known to
> be necessary (specifically, UV systems).
>
> Signed-off-by: Steve Wahl <[email protected]>
> Fixes: d794734c9bbf ("x86/mm/ident_map: Use gbpages only where full GB page should be mapped.")
> Reported-by: Pavin Joseph <[email protected]>
Sigh, why was d794734c9bbf marked for a -stable backport? The commit
never explains ...
If it's broken, it should be reverted - instead of trying to partially
revert and then maybe break some other systems.
When there's boot breakage with new patches, we back out the bad patch
and re-try in 99.9% of the cases.
Thanks,
Ingo
On 3/23/24 21:45, Eric W. Biederman wrote:
> Dave Hansen <[email protected]> writes:
>> On 3/22/24 09:21, Steve Wahl wrote:
>>> Some systems have ACPI tables that don't include everything that needs
>>> to be mapped for a successful kexec. These systems rely on identity
>>> maps that include the full gigabyte surrounding any smaller region
>>> requested for kexec success. Without this, they fail to kexec and end
>>> up doing a full firmware reboot.
>>
>> I'm still missing something here. Which ACPI tables are we talking
>> about? What don't they map? I normally don't think of ACPI _tables_ as
>> "mapping" things.
>
> Either E820 or ACPI lists which areas of memory are present in a
> machine. Those tables are used to build the identity memory mappings.
>
> Those identity mapped page tables not built with GB pages cause kexec to
> fail for at least 3 people. Presumably because something using those
> page tables accesses memory that is not mapped.
But why is it not mapped? Are the firmware-provided memory maps
inaccurate? Or did the kernel read those maps and then forget to map
something.
Using GB pages could paper over either class of bug.
>> It seems like there's a theory that some ACPI table isn't mapped, but
>> looking through the discussion so far I don't see a smoking gun. Let's
>> say the kernel has a bug and the kernel was actively not mapping
>> something that it should have mapped. The oversized 1GB mappings made
>> the bug harder to hit. If that's the case, we'll just be adding a hack
>> which papers over the bug instead of fixing it properly.
>>
>> I'm kind of leaning to say that we should just revert d794734c9bbf and
>> have the UV folks go back to the nogbpages until we get this properly
>> sorted.
>
> That is exactly what this patch does. It reverts the change except
> on UV systems.
Maybe it's splitting hairs, but I see a difference between reverting the
_commit_ and adding new code that tries to revert the commit's behavior.
I think reverting the commit is more conservative and that's what I was
referring to.
>>> @@ -10,6 +10,7 @@ struct x86_mapping_info {
>>> unsigned long page_flag; /* page flag for PMD or PUD entry */
>>> unsigned long offset; /* ident mapping offset */
>>> bool direct_gbpages; /* PUD level 1GB page support */
>>> + bool direct_gbpages_always; /* use 1GB pages exclusively */
>>> unsigned long kernpg_flag; /* kernel pagetable flag override */
>>> };
>>
>> But let's at least talk about this patch in case we decide to go forward
>> with it. We've really got two things:
>>
>> 1. Can the system use gbpages in the first place?
>> 2. Do the gbpages need to be exact (UV) or sloppy (everything else)?
>>
>> I wouldn't refer to this at all as "always" use gbpages. It's really a
>> be-sloppy-and-paper-over-bugs mode. They might be kernel bugs or
>> firmware bugs, but they're bugs _somewhere_ right?
>
> Is it?
>
> As far as I can tell the UV mode is be exact and avoid cpu bugs mode.
The fact is that there are parts of the physical address space that have
read side effects. If you want to have them mapped, you need to use a
mapping type where speculative accesses won't occur (like UC).
I don't really think these are CPU bugs. They're just a fact of life.
> My sense is that using GB pages for everything (when we want an identity
> mapping) should be much cheaper TLB wise, so we probably want to use GB
> pages for everything if we can.
Sure. But the "if we can" situation is where the physical address space
is uniform underneath that GB page.
It's not at all uncommon to have those goofy, undesirable read
side-effects. We've had several issues around them over the years. You
really can't just map random physical memory and hope for the best.
That means that you are limited to mapping memory that you *know* is
uniform, like "all RAM" or "all PMEM".
On Sun, Mar 24, 2024 at 11:31:39AM +0100, Ingo Molnar wrote:
>
> * Steve Wahl <[email protected]> wrote:
>
> > Some systems have ACPI tables that don't include everything that needs
> > to be mapped for a successful kexec. These systems rely on identity
> > maps that include the full gigabyte surrounding any smaller region
> > requested for kexec success. Without this, they fail to kexec and end
> > up doing a full firmware reboot.
> >
> > So, reduce the use of GB pages only on systems where this is known to
> > be necessary (specifically, UV systems).
> >
> > Signed-off-by: Steve Wahl <[email protected]>
> > Fixes: d794734c9bbf ("x86/mm/ident_map: Use gbpages only where full GB page should be mapped.")
> > Reported-by: Pavin Joseph <[email protected]>
>
> Sigh, why was d794734c9bbf marked for a -stable backport? The commit
> never explains ...
I will try to explain, since Steve is offline. That commit fixes a
legitimate bug where more address range is mapped (1G) than the
requested address range. The fix avoids the issue of cpu speculativly
loading beyond the requested range, which inludes specutalive loads
from reserved memory. That is why it was marked for -stable.
> If it's broken, it should be reverted - instead of trying to partially
> revert and then maybe break some other systems.
Three people reported that mapping only the correct address range
caused problems on their platforms. https://lore.kernel.org/all/[email protected]/
Steve and several people helped debug the issue. The commit itself
looks correct but the correct behavior causes some side effect on
a few platforms. Some memory ends up not being mapped, but it is not
clear if it is due to some other bug, such as bios not accurately
providing the right memory map or some other kernel code path did
not map what it should. The 1G mapping covers up that type issue.
Steve's second patch was to not break those platforms while leaving the
fix on the platform detected the original mapping problem (UV platform).
> When there's boot breakage with new patches, we back out the bad patch
> and re-try in 99.9% of the cases.
Steve can certainly merge his two patches and resubmit, to replace the
reverted original patch. He should be on in the morning to speak for
himself.
Thanks
--
Russ Anderson, SuperDome Flex Linux Kernel Group Manager
HPE - Hewlett Packard Enterprise (formerly SGI) [email protected]
* Russ Anderson <[email protected]> wrote:
> On Sun, Mar 24, 2024 at 11:31:39AM +0100, Ingo Molnar wrote:
> >
> > * Steve Wahl <[email protected]> wrote:
> >
> > > Some systems have ACPI tables that don't include everything that needs
> > > to be mapped for a successful kexec. These systems rely on identity
> > > maps that include the full gigabyte surrounding any smaller region
> > > requested for kexec success. Without this, they fail to kexec and end
> > > up doing a full firmware reboot.
> > >
> > > So, reduce the use of GB pages only on systems where this is known to
> > > be necessary (specifically, UV systems).
> > >
> > > Signed-off-by: Steve Wahl <[email protected]>
> > > Fixes: d794734c9bbf ("x86/mm/ident_map: Use gbpages only where full GB page should be mapped.")
> > > Reported-by: Pavin Joseph <[email protected]>
> >
> > Sigh, why was d794734c9bbf marked for a -stable backport? The commit
> > never explains ...
>
> I will try to explain, since Steve is offline. That commit fixes a
> legitimate bug where more address range is mapped (1G) than the
> requested address range.
If a change regresses on certain machines then it's not a bug fix anymore,
it's a regression. End of story.
> The fix avoids the issue of cpu speculativly
> loading beyond the requested range, which inludes specutalive loads
> from reserved memory. That is why it was marked for -stable.
And this regression is why more complicated fixes in this area should not
be forwarded to -stable before it's been merged upstream and exposed a bit
more. Please keep that in mind for future iterations.
> > If it's broken, it should be reverted - instead of trying to partially
> > revert and then maybe break some other systems.
>
> Three people reported that mapping only the correct address range
> caused problems on their platforms. https://lore.kernel.org/all/[email protected]/
> Steve and several people helped debug the issue. The commit itself
> looks correct but the correct behavior causes some side effect on
> a few platforms.
That's all fine and the effort is much appreciated - but we should not try
to whitewash a regression: if there's a couple of reports in such a short
time already, then the regression is significant.
Anyway, I've reverted this in tip:x86/urgent:
c567f2948f57 Revert "x86/mm/ident_map: Use gbpages only where full GB page should be mapped."
we can iterate from there again. Please post future patches against that
tree.
Note that this is just the regular development process: regressions happen,
and this is how we handle them a lot of the time in this area - we back out
the breakage, then try again.
> Some memory ends up not being mapped, but it is not clear if it is due to
> some other bug, such as bios not accurately providing the right memory
> map or some other kernel code path did not map what it should. The 1G
> mapping covers up that type issue.
>
> Steve's second patch was to not break those platforms while leaving the
> fix on the platform detected the original mapping problem (UV platform).
>
> > When there's boot breakage with new patches, we back out the bad patch
> > and re-try in 99.9% of the cases.
>
> Steve can certainly merge his two patches and resubmit, to replace the
> reverted original patch. He should be on in the morning to speak for
> himself.
Thank you!
Ingo
On Mon, Mar 25, 2024 at 10:04:41AM -0500, Eric W. Biederman wrote:
> Russ Anderson <[email protected]> writes:
> > Steve can certainly merge his two patches and resubmit, to replace the
> > reverted original patch. He should be on in the morning to speak for
> > himself.
>
> I am going to push back and suggest that this is perhaps a bug in the
> HPE UV systems firmware not setting up the cpus memory type range
> registers correctly.
>
> Unless those systems are using new fangled cpus that don't have 16bit
> and 32bit support, and don't implement memory type range registers,
> I don't see how something that only affects HPE UV systems could be
> anything except an HPE UV specific bug.
Eric,
I took the time to communicate with others in the company who know
this stuff better than I do before replying on this.
One of the problems with using the MTRRs for this is that there are
simply not enough of them. The MTRRs size/alignment requirements mean
that more than one entry would be required per reserved region, and we
need one reserved region per socket on systems that currently can go
up to 32 sockets. (In case you would think to ask, the reserved
regions also cannot be made contiguous.)
So MTRRs will not work to keep speculation out of our reserved memory
regions.
Let me know if you need more information from us on this.
Thanks.
--> Steve Wahl
--
Steve Wahl, Hewlett Packard Enterprise
Russ Anderson <[email protected]> writes:
> On Sun, Mar 24, 2024 at 11:31:39AM +0100, Ingo Molnar wrote:
>>
>> * Steve Wahl <[email protected]> wrote:
>>
>> > Some systems have ACPI tables that don't include everything that needs
>> > to be mapped for a successful kexec. These systems rely on identity
>> > maps that include the full gigabyte surrounding any smaller region
>> > requested for kexec success. Without this, they fail to kexec and end
>> > up doing a full firmware reboot.
>> >
>> > So, reduce the use of GB pages only on systems where this is known to
>> > be necessary (specifically, UV systems).
>> >
>> > Signed-off-by: Steve Wahl <[email protected]>
>> > Fixes: d794734c9bbf ("x86/mm/ident_map: Use gbpages only where full GB page should be mapped.")
>> > Reported-by: Pavin Joseph <[email protected]>
>>
>> Sigh, why was d794734c9bbf marked for a -stable backport? The commit
>> never explains ...
>
> I will try to explain, since Steve is offline. That commit fixes a
> legitimate bug where more address range is mapped (1G) than the
> requested address range. The fix avoids the issue of cpu speculativly
> loading beyond the requested range, which inludes specutalive loads
> from reserved memory. That is why it was marked for -stable.
To call that a bug presumes that the memory type range registers
were not setup properly by the boot firmware.
I think I saw something that the existence of memory type range
registers is changing/has changed in recent cpus, but historically it
has been the job of the memory type range registers to ensure that the
attributes of specific addresses are correct.
The memory attributes should guide the speculation.
To depend upon page tables to ensure the attributes are correct would
presumably require a cpu that does not have support for disabling page
tables in 32bit mode and does not have 16bit mode.
On older systems (I haven't looked lately) I have seen all kinds of
oddities in the descriptions of memory. Like not describing the memory
at address 0 where the real mode IDT lives. So I am not at all certain
any firmware information can be depended upon or reasonably expected to
be complete. For a while there was no concept of firmware memory areas
so on some older systems it was actually required for their to be gaps
in the description of memory provided to the system, so that operating
systems would not touch memory used by the firmware.
Which definitely means in the case of kexec there are legitimate reasons
to access memory areas that are well known but have not always been
descried by the boot firmware. So the assertion that it is necessarily
a firmware bug for not describing all of memory of memory is at least
historically incorrect on x86_64.
There may be different requirements for the kexec identity map and the
ordinary kernel boot type memory map and as we look at solutions that
can reasonably be explored
> Some memory ends up not being mapped, but it is not
> clear if it is due to some other bug, such as bios not accurately
> providing the right memory map or some other kernel code path did
> not map what it should.
> The 1G mapping covers up that type issue.
I have seen this assertion repeated several times, and at least
historically on x86_64 it is most definitely false. The E820 map which
was the primary information source for a long time could not describe
all of memory so depending upon it to be complete is erroneous.
>> When there's boot breakage with new patches, we back out the bad patch
>> and re-try in 99.9% of the cases.
>
> Steve can certainly merge his two patches and resubmit, to replace the
> reverted original patch. He should be on in the morning to speak for
> himself.
I am going to push back and suggest that this is perhaps a bug in the
HPE UV systems firmware not setting up the cpus memory type range
registers correctly.
Unless those systems are using new fangled cpus that don't have 16bit
and 32bit support, and don't implement memory type range registers,
I don't see how something that only affects HPE UV systems could be
anything except an HPE UV specific bug.
Eric
I understand the original has been reverted, and that my way forward
is likely to combine the original patch and the fix into a combined
patch (possibly a set). I still think some of this conversation will
be valuable to creation of that patch, so continuing to reply. More
below.
On Fri, Mar 22, 2024 at 04:29:27PM -0700, Dave Hansen wrote:
> On 3/22/24 09:21, Steve Wahl wrote:
> > Some systems have ACPI tables that don't include everything that needs
> > to be mapped for a successful kexec. These systems rely on identity
> > maps that include the full gigabyte surrounding any smaller region
> > requested for kexec success. Without this, they fail to kexec and end
> > up doing a full firmware reboot.
>
> I'm still missing something here. Which ACPI tables are we talking
> about? What don't they map? I normally don't think of ACPI _tables_ as
> "mapping" things.
I'm refering to the memory areas that are mapped in machine_kexec_64.c
in the function map_acpi_tables. These appear to be e820 table
entries that have a type of E820_TYPE_ACPI which the kernel marks as
IORES_DESC_ACPI_TABLES, or a type of E820_TYPE_NVS that the kernel
marks as IORES_DESC_ACPI_NV_STORAGE.
The name of the function that maps them is why I refer to them as ACPI
tables. Sorry if that is inaccurate.
> It seems like there's a theory that some ACPI table isn't mapped, but
> looking through the discussion so far I don't see a smoking gun.
I think I'm saying more that the ACPI table doesn't list everything
that needs to be mapped, not that the table itself isn't mapped. Not
sure if that changes your picture or not.
My debuging exchanges with Pavin showed that the regions mapped within
the map_acpi_tables function were the ones that left uncovered holes
in the identity map if you don't overshoot what's requested by using
full gbpages for everything.
For his system only, I manually added hardcoded regions corresponding
to the holes that got left by using 2M pages instead of GB pages, and
kexec succeeded.
Having the list of holes-not-covered (IIRC, four of them), I could
have persued which particular holes cause kexec to fail, but I did not
because I couldn't think of a way to make use of that information.
Even knowing which additional addresses need coverage for this
particular machine, I have no way of knowing what is in those regions,
nor how to generalize to what is needed on other machines.
> Let's say the kernel has a bug and the kernel was actively not
> mapping something that it should have mapped. The oversized 1GB
> mappings made the bug harder to hit. If that's the case, we'll just
> be adding a hack which papers over the bug instead of fixing it
> properly.
I hope you agree that by reverting, we have now papered over that bug,
just in a different way.
If a patch that leaves this papered over except for UV systems won't
be acceptable -- that's what I intend to do, just combining my two
patches -- please let me know what my way forward should be.
> I'm kind of leaning to say that we should just revert d794734c9bbf and
> have the UV folks go back to the nogbpages until we get this properly
> sorted.
Being larger memory systems (for example, 32 socket Sapphire Rapids
systems with a full set of RAM on each socket), UV probably suffers
the most from having an extra 4K per GiB to create the identity map.
> > @@ -10,6 +10,7 @@ struct x86_mapping_info {
> > unsigned long page_flag; /* page flag for PMD or PUD entry */
> > unsigned long offset; /* ident mapping offset */
> > bool direct_gbpages; /* PUD level 1GB page support */
> > + bool direct_gbpages_always; /* use 1GB pages exclusively */
> > unsigned long kernpg_flag; /* kernel pagetable flag override */
> > };
>
> But let's at least talk about this patch in case we decide to go forward
> with it. We've really got two things:
>
> 1. Can the system use gbpages in the first place?
> 2. Do the gbpages need to be exact (UV) or sloppy (everything else)?
>
> I wouldn't refer to this at all as "always" use gbpages. It's really a
> be-sloppy-and-paper-over-bugs mode. They might be kernel bugs or
> firmware bugs, but they're bugs _somewhere_ right?
Do you have a concise suggestion of what you'd call it? I could use
*_sloppy if you'd like, but I don't care much for the way that
reads.
Thanks for your time,
--> Steve Wahl
--
Steve Wahl, Hewlett Packard Enterprise
On Sun, Mar 24, 2024 at 11:31:39AM +0100, Ingo Molnar wrote:
>
> * Steve Wahl <[email protected]> wrote:
>
> > Some systems have ACPI tables that don't include everything that needs
> > to be mapped for a successful kexec. These systems rely on identity
> > maps that include the full gigabyte surrounding any smaller region
> > requested for kexec success. Without this, they fail to kexec and end
> > up doing a full firmware reboot.
> >
> > So, reduce the use of GB pages only on systems where this is known to
> > be necessary (specifically, UV systems).
> >
> > Signed-off-by: Steve Wahl <[email protected]>
> > Fixes: d794734c9bbf ("x86/mm/ident_map: Use gbpages only where full GB page should be mapped.")
> > Reported-by: Pavin Joseph <[email protected]>
>
> Sigh, why was d794734c9bbf marked for a -stable backport? The commit
> never explains ...
>
> If it's broken, it should be reverted - instead of trying to partially
> revert and then maybe break some other systems.
>
> When there's boot breakage with new patches, we back out the bad patch
> and re-try in 99.9% of the cases.
Fine with me, especially as you've already done the revert. :-)
I will create a new patch that combines the two. If you have any
specific actions you'd like to be sure I do for this, let me know.
Thanks,
--> Steve Wahl
--
Steve Wahl, Hewlett Packard Enterprise
Steve Wahl <[email protected]> writes:
> On Mon, Mar 25, 2024 at 10:04:41AM -0500, Eric W. Biederman wrote:
>> Russ Anderson <[email protected]> writes:
>> > Steve can certainly merge his two patches and resubmit, to replace the
>> > reverted original patch. He should be on in the morning to speak for
>> > himself.
>>
>> I am going to push back and suggest that this is perhaps a bug in the
>> HPE UV systems firmware not setting up the cpus memory type range
>> registers correctly.
>>
>> Unless those systems are using new fangled cpus that don't have 16bit
>> and 32bit support, and don't implement memory type range registers,
>> I don't see how something that only affects HPE UV systems could be
>> anything except an HPE UV specific bug.
>
> Eric,
>
> I took the time to communicate with others in the company who know
> this stuff better than I do before replying on this.
>
> One of the problems with using the MTRRs for this is that there are
> simply not enough of them. The MTRRs size/alignment requirements mean
> that more than one entry would be required per reserved region, and we
> need one reserved region per socket on systems that currently can go
> up to 32 sockets. (In case you would think to ask, the reserved
> regions also cannot be made contiguous.)
>
> So MTRRs will not work to keep speculation out of our reserved memory
> regions.
>
> Let me know if you need more information from us on this.
Thanks for this.
Do you know if there are enough MTRRs for the first 4GB?
I am curious if kexec should even consider going into 32bit mode without
page tables or even into 16bit mode on such a system. Or if such a
system will always require using page tables.
If you don't have enough MTRRs on a big NUMA system I think it is
perfectly understandable, to need to use the page tables.
Please include this the fact that splitting GBpages is necessary because
of a lack of MTRRs in the change description.
Given that it is the lack of MTRRs on a large NUMA system that make the
change necessary. The goes from a pure bug fix change to a change to
accommodate systems without enough MTRRs.
That information makes it more understandable why older systems (at
least in the case of kexec) might not be ok with the change. As for
older systems their MTRRs are sufficient and thus they can use fewer
page table entries. Allowing for use of larger TLB entries.
Eric
On Wed, Mar 27, 2024 at 07:57:52AM -0500, Eric W. Biederman wrote:
> Steve Wahl <[email protected]> writes:
>
> > On Mon, Mar 25, 2024 at 10:04:41AM -0500, Eric W. Biederman wrote:
> >> Russ Anderson <[email protected]> writes:
> >> > Steve can certainly merge his two patches and resubmit, to replace the
> >> > reverted original patch. He should be on in the morning to speak for
> >> > himself.
> >>
> >> I am going to push back and suggest that this is perhaps a bug in the
> >> HPE UV systems firmware not setting up the cpus memory type range
> >> registers correctly.
> >>
> >> Unless those systems are using new fangled cpus that don't have 16bit
> >> and 32bit support, and don't implement memory type range registers,
> >> I don't see how something that only affects HPE UV systems could be
> >> anything except an HPE UV specific bug.
> >
> > Eric,
> >
> > I took the time to communicate with others in the company who know
> > this stuff better than I do before replying on this.
> >
> > One of the problems with using the MTRRs for this is that there are
> > simply not enough of them. The MTRRs size/alignment requirements mean
> > that more than one entry would be required per reserved region, and we
> > need one reserved region per socket on systems that currently can go
> > up to 32 sockets. (In case you would think to ask, the reserved
> > regions also cannot be made contiguous.)
> >
> > So MTRRs will not work to keep speculation out of our reserved memory
> > regions.
> >
> > Let me know if you need more information from us on this.
>
> Thanks for this.
>
> Do you know if there are enough MTRRs for the first 4GB?
I don't personally know all the details of how BIOS chooses to place
things, but I suspect that might be true. The restricted spaces
usually end up at the end of the address range for a particular node,
and 4GB would be in the early part of node 0. If the conversation
develops further along these lines, I can find out more definitively.
> I am curious if kexec should even consider going into 32bit mode without
> page tables or even into 16bit mode on such a system. Or if such a
> system will always require using page tables.
Unless I'm mistaken, wouldn't that put a pretty heavy restriction on
where the kdump kernel could be located? Or the target region for
KASLR?
> If you don't have enough MTRRs on a big NUMA system I think it is
> perfectly understandable, to need to use the page tables.
>
> Please include this the fact that splitting GBpages is necessary because
> of a lack of MTRRs in the change description.
OK.
> Given that it is the lack of MTRRs on a large NUMA system that make the
> change necessary. The goes from a pure bug fix change to a change to
> accommodate systems without enough MTRRs.
>
> That information makes it more understandable why older systems (at
> least in the case of kexec) might not be ok with the change. As for
> older systems their MTRRs are sufficient and thus they can use fewer
> page table entries. Allowing for use of larger TLB entries.
That last paragraph doesn't match what I think is happening.
At least from my point of view, that some systems aren't OK with the
change has nothing to do with MTRRs or TLB page size. They simply
require the extra "slop" of GB pages, implicitly adding a full GB of
space around any smaller space requested by map_acpi_tables().
The systems that failed with my original change also failed on earlier
kernels when nogbpages was added to the kernel command line. That
creates the identity map using 2M pages for everything, with no GB
page "slop". I'm pretty sure these systems will continue to fail with
"nogbpages" enabled.
For one debug-kernel cycle on Pavin's system I added in hard-coded
requests to explicitly add back in the areas that not being sloppy had
excluded, and that brought kexec back to functioning; which further
proves my point.
I wanted to be sure you understood this in case it has any effect on
what you think should be done.
--> Steve
--
Steve Wahl, Hewlett Packard Enterprise
Steve Wahl <[email protected]> writes:
> On Wed, Mar 27, 2024 at 07:57:52AM -0500, Eric W. Biederman wrote:
>> Steve Wahl <[email protected]> writes:
>>
>> > On Mon, Mar 25, 2024 at 10:04:41AM -0500, Eric W. Biederman wrote:
>> >> Russ Anderson <[email protected]> writes:
>> >> > Steve can certainly merge his two patches and resubmit, to replace the
>> >> > reverted original patch. He should be on in the morning to speak for
>> >> > himself.
>> >>
>> >> I am going to push back and suggest that this is perhaps a bug in the
>> >> HPE UV systems firmware not setting up the cpus memory type range
>> >> registers correctly.
>> >>
>> >> Unless those systems are using new fangled cpus that don't have 16bit
>> >> and 32bit support, and don't implement memory type range registers,
>> >> I don't see how something that only affects HPE UV systems could be
>> >> anything except an HPE UV specific bug.
>> >
>> > Eric,
>> >
>> > I took the time to communicate with others in the company who know
>> > this stuff better than I do before replying on this.
>> >
>> > One of the problems with using the MTRRs for this is that there are
>> > simply not enough of them. The MTRRs size/alignment requirements mean
>> > that more than one entry would be required per reserved region, and we
>> > need one reserved region per socket on systems that currently can go
>> > up to 32 sockets. (In case you would think to ask, the reserved
>> > regions also cannot be made contiguous.)
>> >
>> > So MTRRs will not work to keep speculation out of our reserved memory
>> > regions.
>> >
>> > Let me know if you need more information from us on this.
>>
>> Thanks for this.
>>
>> Do you know if there are enough MTRRs for the first 4GB?
>
> I don't personally know all the details of how BIOS chooses to place
> things, but I suspect that might be true. The restricted spaces
> usually end up at the end of the address range for a particular node,
> and 4GB would be in the early part of node 0. If the conversation
> develops further along these lines, I can find out more definitively.
>
>> I am curious if kexec should even consider going into 32bit mode without
>> page tables or even into 16bit mode on such a system. Or if such a
>> system will always require using page tables.
>
> Unless I'm mistaken, wouldn't that put a pretty heavy restriction on
> where the kdump kernel could be located?
If you are coming from 64bit EFI it adds restrictions.
Most of my experience involves systems using a real mode BIOS and
folks thought I was strange for wanting to be able to load the kernel
above 4GB.
Having that experience, I am stuck wondering how all of the weird
backwards compatibility cases are going to work. Hmm.
There is one concrete case where it matters that I think still exists.
x86_64 processors startup in 16bit real mode, then have to transition
through 32bit protected mode, before transitioning to 64bit protected
mode. Only in 64bit protected mode are page tables enabled.
All this happens during early kernel startup when the bootstrap
processor sends STARTUP IPIs to all of the secondary processors.
The startup IPI lets you pick where in the first 1MiB the secondary
processors will start.
Assuming there isn't a new processor startup sequence on your cpus
speculation before the processor loads it's first page table is a
legitimate concern.
> Or the target region for KASLR?
As I recall the kernel is limited to the last 2GB of the virtual
address space, as parts of the instruction
>> If you don't have enough MTRRs on a big NUMA system I think it is
>> perfectly understandable, to need to use the page tables.
>>
>> Please include this the fact that splitting GBpages is necessary because
>> of a lack of MTRRs in the change description.
>
> OK.
>
>> Given that it is the lack of MTRRs on a large NUMA system that make the
>> change necessary. The goes from a pure bug fix change to a change to
>> accommodate systems without enough MTRRs.
>>
>> That information makes it more understandable why older systems (at
>> least in the case of kexec) might not be ok with the change. As for
>> older systems their MTRRs are sufficient and thus they can use fewer
>> page table entries. Allowing for use of larger TLB entries.
>
> That last paragraph doesn't match what I think is happening.
>
> At least from my point of view, that some systems aren't OK with the
> change has nothing to do with MTRRs or TLB page size. They simply
> require the extra "slop" of GB pages, implicitly adding a full GB of
> space around any smaller space requested by map_acpi_tables().
>
> The systems that failed with my original change also failed on earlier
> kernels when nogbpages was added to the kernel command line. That
> creates the identity map using 2M pages for everything, with no GB
> page "slop". I'm pretty sure these systems will continue to fail with
> "nogbpages" enabled.
>
> For one debug-kernel cycle on Pavin's system I added in hard-coded
> requests to explicitly add back in the areas that not being sloppy had
> excluded, and that brought kexec back to functioning; which further
> proves my point.
>
> I wanted to be sure you understood this in case it has any effect on
> what you think should be done.
Sort of.
What kexec wants of an identity mapped page table really is to simulate
disabling paging altogether. There isn't enough memory in most systems
to identity map the entire 48bit or 52bit physical address space so some
compromises have to be made. I seem to recall only mapping up to
maxpfn, and using 1GB pages when I originally wrote the code. It was
later refactored to share the identity map page table building code with
the rest of the kernel.
When you changed the page tables not to map everything, strictly
speaking you created an ABI break of the kexec ABI.
Which is a long way of saying it isn't being sloppy it is deliberate,
and that the problem from my perspective is that things have become too
fine grained, too optimized.
Pavin's definitely proves the issue was not mapping enough pages, it is
nice that we have that confirmation.
From my perspective the entire reason for wanting to be fine grained and
precise in the kernel memory map is because the UV systems don't have
enough MTRRs. So you have to depend upon the cache-ability attributes
for specific addresses of memory coming from the page tables instead of
from the MTRRs.
If you had enough MTRRs more defining the page tables to be precisely
what is necessary would be simply an exercise in reducing kernel
performance, because it is more efficient in both page table size, and
in TLB usage to use 1GB pages instead of whatever smaller pages you have
to use for oddball regions.
For systems without enough MTRRs the small performance hit in paging
performance is the necessary trade off.
At least that is my perspective. Does that make sense?
Eric
On Thu, Mar 28, 2024 at 12:05:02AM -0500, Eric W. Biederman wrote:
> Steve Wahl <[email protected]> writes:
>
> > On Wed, Mar 27, 2024 at 07:57:52AM -0500, Eric W. Biederman wrote:
> >> Steve Wahl <[email protected]> writes:
> >>
> >> > On Mon, Mar 25, 2024 at 10:04:41AM -0500, Eric W. Biederman wrote:
> >> >> Russ Anderson <[email protected]> writes:
> >> >> > Steve can certainly merge his two patches and resubmit, to replace the
> >> >> > reverted original patch. He should be on in the morning to speak for
> >> >> > himself.
> >> >>
> >> >> I am going to push back and suggest that this is perhaps a bug in the
> >> >> HPE UV systems firmware not setting up the cpus memory type range
> >> >> registers correctly.
> >> >>
> >> >> Unless those systems are using new fangled cpus that don't have 16bit
> >> >> and 32bit support, and don't implement memory type range registers,
> >> >> I don't see how something that only affects HPE UV systems could be
> >> >> anything except an HPE UV specific bug.
> >> >
> >> > Eric,
> >> >
> >> > I took the time to communicate with others in the company who know
> >> > this stuff better than I do before replying on this.
> >> >
> >> > One of the problems with using the MTRRs for this is that there are
> >> > simply not enough of them. The MTRRs size/alignment requirements mean
> >> > that more than one entry would be required per reserved region, and we
> >> > need one reserved region per socket on systems that currently can go
> >> > up to 32 sockets. (In case you would think to ask, the reserved
> >> > regions also cannot be made contiguous.)
> >> >
> >> > So MTRRs will not work to keep speculation out of our reserved memory
> >> > regions.
> >> >
> >> > Let me know if you need more information from us on this.
> >>
> >> Thanks for this.
> >>
> >> Do you know if there are enough MTRRs for the first 4GB?
> >
> > I don't personally know all the details of how BIOS chooses to place
> > things, but I suspect that might be true. The restricted spaces
> > usually end up at the end of the address range for a particular node,
> > and 4GB would be in the early part of node 0. If the conversation
> > develops further along these lines, I can find out more definitively.
> >
> >> I am curious if kexec should even consider going into 32bit mode without
> >> page tables or even into 16bit mode on such a system. Or if such a
> >> system will always require using page tables.
> >
> > Unless I'm mistaken, wouldn't that put a pretty heavy restriction on
> > where the kdump kernel could be located?
>
> If you are coming from 64bit EFI it adds restrictions.
We are. :-)
> Most of my experience involves systems using a real mode BIOS and
> folks thought I was strange for wanting to be able to load the kernel
> above 4GB.
>
> Having that experience, I am stuck wondering how all of the weird
> backwards compatibility cases are going to work. Hmm.
>
> There is one concrete case where it matters that I think still exists.
>
> x86_64 processors startup in 16bit real mode, then have to transition
> through 32bit protected mode, before transitioning to 64bit protected
> mode. Only in 64bit protected mode are page tables enabled.
>
> All this happens during early kernel startup when the bootstrap
> processor sends STARTUP IPIs to all of the secondary processors.
>
> The startup IPI lets you pick where in the first 1MiB the secondary
> processors will start.
>
> Assuming there isn't a new processor startup sequence on your cpus
> speculation before the processor loads it's first page table is a
> legitimate concern.
I believe the reserved memory that is problematic is at the end of
each socket's (NUMA node's) address space. You have to get to 64 bit
execution before you can reach addresses outside of the first 4GB of
space I think. External hardware uses this RAM, the processors are
not to access it at all. MTRRs don't exactly have a entry type to
match this, at least from the document skimming I've done. (I have a
limited understanding, but I think this reserved space is used by our
hardware to keep track of cache line ownership for the rest of the
ram, so letting any other entity take even a read claim on these
addresses is a problem, in a catch-22 or circular reference sort of
way.)
> > Or the target region for KASLR?
>
> As I recall the kernel is limited to the last 2GB of the virtual
> address space, as parts of the instruction
From what I recall, KASLR varies both the virtual an physical addresses,
and it's the physical that's of concern here.
arch/x86/boot/compressed/kaslr.c: "In theory, KASLR can put the kernel
anywhere in the range of [16M, MAXMEM) on 64-bit..."
I had to make a change in that area a few years ago for similar
reasons:
1869dbe87cb94d x86/boot/64: Round memory hole size up to next PMD page
> >> If you don't have enough MTRRs on a big NUMA system I think it is
> >> perfectly understandable, to need to use the page tables.
> >>
> >> Please include this the fact that splitting GBpages is necessary because
> >> of a lack of MTRRs in the change description.
> >
> > OK.
> >
> >> Given that it is the lack of MTRRs on a large NUMA system that make the
> >> change necessary. The goes from a pure bug fix change to a change to
> >> accommodate systems without enough MTRRs.
> >>
> >> That information makes it more understandable why older systems (at
> >> least in the case of kexec) might not be ok with the change. As for
> >> older systems their MTRRs are sufficient and thus they can use fewer
> >> page table entries. Allowing for use of larger TLB entries.
> >
> > That last paragraph doesn't match what I think is happening.
> >
> > At least from my point of view, that some systems aren't OK with the
> > change has nothing to do with MTRRs or TLB page size. They simply
> > require the extra "slop" of GB pages, implicitly adding a full GB of
> > space around any smaller space requested by map_acpi_tables().
> >
> > The systems that failed with my original change also failed on earlier
> > kernels when nogbpages was added to the kernel command line. That
> > creates the identity map using 2M pages for everything, with no GB
> > page "slop". I'm pretty sure these systems will continue to fail with
> > "nogbpages" enabled.
> >
> > For one debug-kernel cycle on Pavin's system I added in hard-coded
> > requests to explicitly add back in the areas that not being sloppy had
> > excluded, and that brought kexec back to functioning; which further
> > proves my point.
> >
> > I wanted to be sure you understood this in case it has any effect on
> > what you think should be done.
>
> Sort of.
>
> What kexec wants of an identity mapped page table really is to simulate
> disabling paging altogether. There isn't enough memory in most systems
> to identity map the entire 48bit or 52bit physical address space so some
> compromises have to be made. I seem to recall only mapping up to
> maxpfn, and using 1GB pages when I originally wrote the code. It was
> later refactored to share the identity map page table building code with
> the rest of the kernel.
>
> When you changed the page tables not to map everything, strictly
> speaking you created an ABI break of the kexec ABI.
>
> Which is a long way of saying it isn't being sloppy it is deliberate,
> and that the problem from my perspective is that things have become too
> fine grained, too optimized.
>
> Pavin's definitely proves the issue was not mapping enough pages, it is
> nice that we have that confirmation.
>
> From my perspective the entire reason for wanting to be fine grained and
> precise in the kernel memory map is because the UV systems don't have
> enough MTRRs. So you have to depend upon the cache-ability attributes
> for specific addresses of memory coming from the page tables instead of
> from the MTRRs.
It would be more accurate to say we depend upon the addresses not
being listed in the page tables at all. We'd be OK with mapped but
not accessed, if it weren't for processor speculation. There's no "no
access" setting within the existing MTRR definitions, though there may
be a setting that would rein in processor speculation enough to make
due.
> If you had enough MTRRs more defining the page tables to be precisely
> what is necessary would be simply an exercise in reducing kernel
> performance, because it is more efficient in both page table size, and
> in TLB usage to use 1GB pages instead of whatever smaller pages you have
> to use for oddball regions.
>
> For systems without enough MTRRs the small performance hit in paging
> performance is the necessary trade off.
>
> At least that is my perspective. Does that make sense?
I think I'm begining to get your perspective. From your point of
view, is kexec failing with "nogbpages" set a bug? My point of view
is it likely is. I think your view would say it isn't?
--> Steve
--
Steve Wahl, Hewlett Packard Enterprise
Steve Wahl <[email protected]> writes:
> On Thu, Mar 28, 2024 at 12:05:02AM -0500, Eric W. Biederman wrote:
>>
>> From my perspective the entire reason for wanting to be fine grained and
>> precise in the kernel memory map is because the UV systems don't have
>> enough MTRRs. So you have to depend upon the cache-ability attributes
>> for specific addresses of memory coming from the page tables instead of
>> from the MTRRs.
>
> It would be more accurate to say we depend upon the addresses not
> being listed in the page tables at all. We'd be OK with mapped but
> not accessed, if it weren't for processor speculation. There's no "no
> access" setting within the existing MTRR definitions, though there may
> be a setting that would rein in processor speculation enough to make
> due.
The uncached setting and the write-combining settings that are used for
I/O are required to disable speculation for any regions so marked. Any
reads or writes to a memory mapped I/O region can result in hardware
with processing it as a command. Which as I understand it is exactly
the problem with UV systems.
Frankly not mapping an I/O region (in an identity mapped page table)
instead of properly mapping it as it would need to be mapped for
performing I/O seems like a bit of a bug.
>> If you had enough MTRRs more defining the page tables to be precisely
>> what is necessary would be simply an exercise in reducing kernel
>> performance, because it is more efficient in both page table size, and
>> in TLB usage to use 1GB pages instead of whatever smaller pages you have
>> to use for oddball regions.
>>
>> For systems without enough MTRRs the small performance hit in paging
>> performance is the necessary trade off.
>>
>> At least that is my perspective. Does that make sense?
>
> I think I'm begining to get your perspective. From your point of
> view, is kexec failing with "nogbpages" set a bug? My point of view
> is it likely is. I think your view would say it isn't?
I would say it is a bug.
Part of the bug is someone yet again taking something simple that
kexec is doing and reworking it to use generic code, then changing
the generic code to do something different from what kexec needs
and then being surprised that kexec stops working.
The interface kexec wants to provide to whatever is being loaded is not
having to think about page tables until that software is up far enough
to enable their own page tables.
People being clever and enabling just enough pages in the page tables
to work based upon the results of some buggy (they are always buggy some
are just less so than others) boot up firmware is where I get concerned.
Said another way the point is to build an identity mapped page table.
Skipping some parts of the physical<->virtual identity because we seem
to think no one will use it is likely a bug.
I really don't see any point in putting holes in such a page table for
any address below the highest address that is good for something. Given
that on some systems the MTRRs are insufficient to do there job it
definitely makes sense to not enable caching on areas that we don't
think are memory.
Eric
On Sat, Mar 30, 2024 at 10:46:21PM -0500, Eric W. Biederman wrote:
> Steve Wahl <[email protected]> writes:
>
> > On Thu, Mar 28, 2024 at 12:05:02AM -0500, Eric W. Biederman wrote:
> >>
> >> From my perspective the entire reason for wanting to be fine grained and
> >> precise in the kernel memory map is because the UV systems don't have
> >> enough MTRRs. So you have to depend upon the cache-ability attributes
> >> for specific addresses of memory coming from the page tables instead of
> >> from the MTRRs.
> >
> > It would be more accurate to say we depend upon the addresses not
> > being listed in the page tables at all. We'd be OK with mapped but
> > not accessed, if it weren't for processor speculation. There's no "no
> > access" setting within the existing MTRR definitions, though there may
> > be a setting that would rein in processor speculation enough to make
> > due.
>
> The uncached setting and the write-combining settings that are used for
> I/O are required to disable speculation for any regions so marked. Any
> reads or writes to a memory mapped I/O region can result in hardware
> with processing it as a command. Which as I understand it is exactly
> the problem with UV systems.
>
> Frankly not mapping an I/O region (in an identity mapped page table)
> instead of properly mapping it as it would need to be mapped for
> performing I/O seems like a bit of a bug.
>
> >> If you had enough MTRRs more defining the page tables to be precisely
> >> what is necessary would be simply an exercise in reducing kernel
> >> performance, because it is more efficient in both page table size, and
> >> in TLB usage to use 1GB pages instead of whatever smaller pages you have
> >> to use for oddball regions.
> >>
> >> For systems without enough MTRRs the small performance hit in paging
> >> performance is the necessary trade off.
> >>
> >> At least that is my perspective. Does that make sense?
> >
> > I think I'm begining to get your perspective. From your point of
> > view, is kexec failing with "nogbpages" set a bug? My point of view
> > is it likely is. I think your view would say it isn't?
>
> I would say it is a bug.
>
> Part of the bug is someone yet again taking something simple that
> kexec is doing and reworking it to use generic code, then changing
> the generic code to do something different from what kexec needs
> and then being surprised that kexec stops working.
>
> The interface kexec wants to provide to whatever is being loaded is not
> having to think about page tables until that software is up far enough
> to enable their own page tables.
>
> People being clever and enabling just enough pages in the page tables
> to work based upon the results of some buggy (they are always buggy some
> are just less so than others) boot up firmware is where I get concerned.
>
> Said another way the point is to build an identity mapped page table.
> Skipping some parts of the physical<->virtual identity because we seem
> to think no one will use it is likely a bug.
Hmm. I would think what's needed for kexec is to create, as nearly as
possible, identical conditions to what the BIOS / bootloader provides
when jumping to the kernel entry point. Whatever agreements are set
on entry to the kernel, kexec needs to match.
And I think you want a completely identity mapped table to match those
entry point requirements, that's why on other platforms, the condition
is MMU turned off.
From that point of view, it does make sense to special case UV systems
for this. The restricted areas we're talking about are not in the map
when the bootloader is started on the UV platform.
> I really don't see any point in putting holes in such a page table for
> any address below the highest address that is good for something. Given
> that on some systems the MTRRs are insufficient to do there job it
> definitely makes sense to not enable caching on areas that we don't
> think are memory.
Well, on the UV platform, these addresses are *not* good for
something, at least from any processor's point of view, nor any IO
device (they are not allowed to appear in any DMA or PCI bus master
transaction, either). A hardware ASIC is using this portion of local
RAM to hold some tables that are too large to put directly on the
ASIC. Things turn ugly if anyone else tries to access these
addresses.
In another message, Pavin thanked you for you work on kexec. I'd like
to express my appreciation also. In my current job, I'm mostly
focused on its use for kdump kernels. I've been dealing with kernel
crash dumps since running Unix on i386 machines, and always had do
deal with "OK, but what if the kernel state gets corrupt enough that
the disk driver won't work, or network if you're trying to do a remote
dump." The use of kexec to start a fresh instance of the kernel is an
excelent way to solve that problem, in my opinion. And a couple of
jobs ago we were able to use it to restart a SAN switch after software
upgrade, without needing to stop forwarding traffic, which wouldn't
have been possible without kexec.
Thanks,
--> Steve Wahl
--
Steve Wahl, Hewlett Packard Enterprise
On 4/1/24 08:15, Steve Wahl wrote:
> From that point of view, it does make sense to special case UV systems
> for this. The restricted areas we're talking about are not in the map
> when the bootloader is started on the UV platform.
Just to be clear what I'm looking for here: Special casing UV systems is
theoretically OK. What I don't like is doing that in using GB pages or not.
It would be much nicer to have specific, precise information about what
UV needs done. For instance, do we know where the special address range
is? Is it fixed? If so, I'd much rather have code that says: "Whoa,
don't map this range with *any* identity map page tables" versus
something targeted specifically at gbpages.
On Mon, Apr 01, 2024 at 11:03:20AM -0700, Dave Hansen wrote:
> On 4/1/24 08:15, Steve Wahl wrote:
> > From that point of view, it does make sense to special case UV systems
> > for this. The restricted areas we're talking about are not in the map
> > when the bootloader is started on the UV platform.
>
> Just to be clear what I'm looking for here: Special casing UV systems is
> theoretically OK. What I don't like is doing that in using GB pages or not.
>
> It would be much nicer to have specific, precise information about what
> UV needs done. For instance, do we know where the special address range
> is? Is it fixed? If so, I'd much rather have code that says: "Whoa,
> don't map this range with *any* identity map page tables" versus
> something targeted specifically at gbpages.
The area is not fixed. Some reserved memory in each numa node,
address and amount varies depending on the amount of memory in the
system.
We've kept the memory range marked as reserved in the e820 memory
tables, and our experience is the kernel respects that in all other
aspects. Even creation of the identity maps is on the surface
respecting those areas being listed as reserved, but using GBpages
"swings wide" and includes the reserved areas in the identity map.
So, I don't fully understand your hesitation in being more selective
in the use of gbpages on UV, or what I might suggest in its place.
Here's partial kernel output showing the memory map, from a randomly
selected system in case it helps illustrate it for you:
[Tue Mar 21 09:40:00 2023] BIOS-provided physical RAM map:
[Tue Mar 21 09:40:00 2023] BIOS-e820: [mem 0x0000000000000000-0x000000000009ffff] usable
[Tue Mar 21 09:40:00 2023] BIOS-e820: [mem 0x00000000000a0000-0x00000000000fffff] reserved
[Tue Mar 21 09:40:00 2023] BIOS-e820: [mem 0x0000000000100000-0x0000000068285fff] usable
[Tue Mar 21 09:40:00 2023] BIOS-e820: [mem 0x0000000068286000-0x0000000068286fff] ACPI NVS
[Tue Mar 21 09:40:00 2023] BIOS-e820: [mem 0x0000000068287000-0x0000000068baefff] usable
[Tue Mar 21 09:40:00 2023] BIOS-e820: [mem 0x0000000068baf000-0x0000000068bb1fff] reserved
[Tue Mar 21 09:40:00 2023] BIOS-e820: [mem 0x0000000068bb2000-0x0000000068c3ffff] usable
[Tue Mar 21 09:40:00 2023] BIOS-e820: [mem 0x0000000068c40000-0x000000006a33ffff] reserved
[Tue Mar 21 09:40:00 2023] BIOS-e820: [mem 0x000000006a340000-0x000000006d5fefff] ACPI NVS
[Tue Mar 21 09:40:00 2023] BIOS-e820: [mem 0x000000006d5ff000-0x000000006fffefff] ACPI data
[Tue Mar 21 09:40:00 2023] BIOS-e820: [mem 0x000000006ffff000-0x000000006fffffff] usable
[Tue Mar 21 09:40:00 2023] BIOS-e820: [mem 0x0000000070000000-0x000000008fffffff] reserved
[Tue Mar 21 09:40:00 2023] BIOS-e820: [mem 0x00000000f8000000-0x00000000fbffffff] reserved
[Tue Mar 21 09:40:00 2023] BIOS-e820: [mem 0x00000000fe010000-0x00000000fe010fff] reserved
[Tue Mar 21 09:40:00 2023] BIOS-e820: [mem 0x0000000100000000-0x0000003f7fffffff] usable
[Tue Mar 21 09:40:00 2023] BIOS-e820: [mem 0x0000003f80000000-0x000000407fffffff] reserved
[Tue Mar 21 09:40:00 2023] BIOS-e820: [mem 0x0000004080000000-0x0000007f7fffffff] usable
[Tue Mar 21 09:40:00 2023] BIOS-e820: [mem 0x0000007f80000000-0x000000807fffffff] reserved
[Tue Mar 21 09:40:00 2023] BIOS-e820: [mem 0x0000008080000000-0x000000bf7fffffff] usable
[Tue Mar 21 09:40:00 2023] BIOS-e820: [mem 0x000000bf80000000-0x000000c07fffffff] reserved
[Tue Mar 21 09:40:00 2023] BIOS-e820: [mem 0x000000c080000000-0x000000ff7fffffff] usable
[Tue Mar 21 09:40:00 2023] BIOS-e820: [mem 0x000000ff80000000-0x000001007fffffff] reserved
[Tue Mar 21 09:40:00 2023] BIOS-e820: [mem 0x0000010080000000-0x0000013f7fffffff] usable
[Tue Mar 21 09:40:00 2023] BIOS-e820: [mem 0x0000013f80000000-0x000001407fffffff] reserved
[Tue Mar 21 09:40:00 2023] BIOS-e820: [mem 0x0000014080000000-0x0000017f7fffffff] usable
[Tue Mar 21 09:40:00 2023] BIOS-e820: [mem 0x0000017f80000000-0x000001807fffffff] reserved
[Tue Mar 21 09:40:00 2023] BIOS-e820: [mem 0x0000018080000000-0x000001bf7fffffff] usable
[Tue Mar 21 09:40:00 2023] BIOS-e820: [mem 0x000001bf80000000-0x000001c07fffffff] reserved
[Tue Mar 21 09:40:00 2023] BIOS-e820: [mem 0x000001c080000000-0x000001ff7fffffff] usable
[Tue Mar 21 09:40:00 2023] BIOS-e820: [mem 0x000001ff80000000-0x000002007fffffff] reserved
[Tue Mar 21 09:40:00 2023] BIOS-e820: [mem 0x0000020080000000-0x0000023f7fffffff] usable
[Tue Mar 21 09:40:00 2023] BIOS-e820: [mem 0x0000023f80000000-0x000002407fffffff] reserved
[Tue Mar 21 09:40:00 2023] BIOS-e820: [mem 0x0000024080000000-0x0000027f7fffffff] usable
[Tue Mar 21 09:40:00 2023] BIOS-e820: [mem 0x0000027f80000000-0x000002807fffffff] reserved
[Tue Mar 21 09:40:00 2023] BIOS-e820: [mem 0x0000028080000000-0x000002bf7fffffff] usable
[Tue Mar 21 09:40:00 2023] BIOS-e820: [mem 0x000002bf80000000-0x000002c07fffffff] reserved
[Tue Mar 21 09:40:00 2023] BIOS-e820: [mem 0x000002c080000000-0x000002ff7fffffff] usable
[Tue Mar 21 09:40:00 2023] BIOS-e820: [mem 0x000002ff80000000-0x000003007fffffff] reserved
[Tue Mar 21 09:40:00 2023] BIOS-e820: [mem 0x0000030080000000-0x0000033f7fffffff] usable
[Tue Mar 21 09:40:00 2023] BIOS-e820: [mem 0x0000033f80000000-0x000003407fffffff] reserved
[Tue Mar 21 09:40:00 2023] BIOS-e820: [mem 0x0000034080000000-0x0000037f7fffffff] usable
[Tue Mar 21 09:40:00 2023] BIOS-e820: [mem 0x0000037f80000000-0x000003807fffffff] reserved
[Tue Mar 21 09:40:00 2023] BIOS-e820: [mem 0x0000038080000000-0x000003bf7fffffff] usable
[Tue Mar 21 09:40:00 2023] BIOS-e820: [mem 0x000003bf80000000-0x000003c07fffffff] reserved
[Tue Mar 21 09:40:00 2023] BIOS-e820: [mem 0x000003c080000000-0x000003ff7fffffff] usable
[Tue Mar 21 09:40:00 2023] BIOS-e820: [mem 0x000003ff80000000-0x000004007fffffff] reserved
[Tue Mar 21 09:40:00 2023] BIOS-e820: [mem 0x00001ffe00000000-0x00001ffeffffffff] reserved
[Tue Mar 21 09:40:00 2023] BIOS-e820: [mem 0x00001fffc0000000-0x00001fffdfffffff] reserved
[Tue Mar 21 09:40:00 2023] printk: bootconsole [earlyser0] enabled
[Tue Mar 21 09:40:00 2023] NX (Execute Disable) protection: active
[Tue Mar 21 09:40:00 2023] e820: update [mem 0x3d279018-0x3d2b3a57] usable ==> usable
[Tue Mar 21 09:40:00 2023] e820: update [mem 0x3d279018-0x3d2b3a57] usable ==> usable
[Tue Mar 21 09:40:00 2023] e820: update [mem 0x3d23e018-0x3d278a57] usable ==> usable
[Tue Mar 21 09:40:00 2023] e820: update [mem 0x3d23e018-0x3d278a57] usable ==> usable
[Tue Mar 21 09:40:00 2023] extended physical RAM map:
[Tue Mar 21 09:40:00 2023] reserve setup_data: [mem 0x0000000000000000-0x000000000009ffff] usable
[Tue Mar 21 09:40:00 2023] reserve setup_data: [mem 0x00000000000a0000-0x00000000000fffff] reserved
[Tue Mar 21 09:40:00 2023] reserve setup_data: [mem 0x0000000000100000-0x000000003d23e017] usable
[Tue Mar 21 09:40:00 2023] reserve setup_data: [mem 0x000000003d23e018-0x000000003d278a57] usable
[Tue Mar 21 09:40:00 2023] reserve setup_data: [mem 0x000000003d278a58-0x000000003d279017] usable
[Tue Mar 21 09:40:00 2023] reserve setup_data: [mem 0x000000003d279018-0x000000003d2b3a57] usable
[Tue Mar 21 09:40:00 2023] reserve setup_data: [mem 0x000000003d2b3a58-0x0000000068285fff] usable
[Tue Mar 21 09:40:00 2023] reserve setup_data: [mem 0x0000000068286000-0x0000000068286fff] ACPI NVS
[Tue Mar 21 09:40:00 2023] reserve setup_data: [mem 0x0000000068287000-0x0000000068baefff] usable
[Tue Mar 21 09:40:00 2023] reserve setup_data: [mem 0x0000000068baf000-0x0000000068bb1fff] reserved
[Tue Mar 21 09:40:00 2023] reserve setup_data: [mem 0x0000000068bb2000-0x0000000068c3ffff] usable
[Tue Mar 21 09:40:00 2023] reserve setup_data: [mem 0x0000000068c40000-0x000000006a33ffff] reserved
[Tue Mar 21 09:40:00 2023] reserve setup_data: [mem 0x000000006a340000-0x000000006d5fefff] ACPI NVS
[Tue Mar 21 09:40:00 2023] reserve setup_data: [mem 0x000000006d5ff000-0x000000006fffefff] ACPI data
[Tue Mar 21 09:40:00 2023] reserve setup_data: [mem 0x000000006ffff000-0x000000006fffffff] usable
[Tue Mar 21 09:40:00 2023] reserve setup_data: [mem 0x0000000070000000-0x000000008fffffff] reserved
[Tue Mar 21 09:40:00 2023] reserve setup_data: [mem 0x00000000f8000000-0x00000000fbffffff] reserved
[Tue Mar 21 09:40:00 2023] reserve setup_data: [mem 0x00000000fe010000-0x00000000fe010fff] reserved
[Tue Mar 21 09:40:00 2023] reserve setup_data: [mem 0x0000000100000000-0x0000003f7fffffff] usable
[Tue Mar 21 09:40:00 2023] reserve setup_data: [mem 0x0000003f80000000-0x000000407fffffff] reserved
[Tue Mar 21 09:40:00 2023] reserve setup_data: [mem 0x0000004080000000-0x0000007f7fffffff] usable
[Tue Mar 21 09:40:00 2023] reserve setup_data: [mem 0x0000007f80000000-0x000000807fffffff] reserved
[Tue Mar 21 09:40:00 2023] reserve setup_data: [mem 0x0000008080000000-0x000000bf7fffffff] usable
[Tue Mar 21 09:40:00 2023] reserve setup_data: [mem 0x000000bf80000000-0x000000c07fffffff] reserved
[Tue Mar 21 09:40:00 2023] reserve setup_data: [mem 0x000000c080000000-0x000000ff7fffffff] usable
[Tue Mar 21 09:40:00 2023] reserve setup_data: [mem 0x000000ff80000000-0x000001007fffffff] reserved
[Tue Mar 21 09:40:00 2023] reserve setup_data: [mem 0x0000010080000000-0x0000013f7fffffff] usable
[Tue Mar 21 09:40:00 2023] reserve setup_data: [mem 0x0000013f80000000-0x000001407fffffff] reserved
[Tue Mar 21 09:40:00 2023] reserve setup_data: [mem 0x0000014080000000-0x0000017f7fffffff] usable
[Tue Mar 21 09:40:00 2023] reserve setup_data: [mem 0x0000017f80000000-0x000001807fffffff] reserved
[Tue Mar 21 09:40:00 2023] reserve setup_data: [mem 0x0000018080000000-0x000001bf7fffffff] usable
[Tue Mar 21 09:40:00 2023] reserve setup_data: [mem 0x000001bf80000000-0x000001c07fffffff] reserved
[Tue Mar 21 09:40:00 2023] reserve setup_data: [mem 0x000001c080000000-0x000001ff7fffffff] usable
[Tue Mar 21 09:40:00 2023] reserve setup_data: [mem 0x000001ff80000000-0x000002007fffffff] reserved
[Tue Mar 21 09:40:00 2023] reserve setup_data: [mem 0x0000020080000000-0x0000023f7fffffff] usable
[Tue Mar 21 09:40:00 2023] reserve setup_data: [mem 0x0000023f80000000-0x000002407fffffff] reserved
[Tue Mar 21 09:40:00 2023] reserve setup_data: [mem 0x0000024080000000-0x0000027f7fffffff] usable
[Tue Mar 21 09:40:00 2023] reserve setup_data: [mem 0x0000027f80000000-0x000002807fffffff] reserved
[Tue Mar 21 09:40:00 2023] reserve setup_data: [mem 0x0000028080000000-0x000002bf7fffffff] usable
[Tue Mar 21 09:40:00 2023] reserve setup_data: [mem 0x000002bf80000000-0x000002c07fffffff] reserved
[Tue Mar 21 09:40:00 2023] reserve setup_data: [mem 0x000002c080000000-0x000002ff7fffffff] usable
[Tue Mar 21 09:40:00 2023] reserve setup_data: [mem 0x000002ff80000000-0x000003007fffffff] reserved
[Tue Mar 21 09:40:00 2023] reserve setup_data: [mem 0x0000030080000000-0x0000033f7fffffff] usable
[Tue Mar 21 09:40:00 2023] reserve setup_data: [mem 0x0000033f80000000-0x000003407fffffff] reserved
[Tue Mar 21 09:40:00 2023] reserve setup_data: [mem 0x0000034080000000-0x0000037f7fffffff] usable
[Tue Mar 21 09:40:00 2023] reserve setup_data: [mem 0x0000037f80000000-0x000003807fffffff] reserved
[Tue Mar 21 09:40:00 2023] reserve setup_data: [mem 0x0000038080000000-0x000003bf7fffffff] usable
[Tue Mar 21 09:40:00 2023] reserve setup_data: [mem 0x000003bf80000000-0x000003c07fffffff] reserved
[Tue Mar 21 09:40:00 2023] reserve setup_data: [mem 0x000003c080000000-0x000003ff7fffffff] usable
[Tue Mar 21 09:40:00 2023] reserve setup_data: [mem 0x000003ff80000000-0x000004007fffffff] reserved
[Tue Mar 21 09:40:00 2023] reserve setup_data: [mem 0x00001ffe00000000-0x00001ffeffffffff] reserved
[Tue Mar 21 09:40:00 2023] reserve setup_data: [mem 0x00001fffc0000000-0x00001fffdfffffff] reserved
[Tue Mar 21 09:40:00 2023] efi: EFI v2.7 by HPE
Thanks,
--> Steve Wahl
--
Steve Wahl, Hewlett Packard Enterprise
On Mon, Apr 01, 2024 at 01:49:13PM -0500, Steve Wahl wrote:
> On Mon, Apr 01, 2024 at 11:03:20AM -0700, Dave Hansen wrote:
> > On 4/1/24 08:15, Steve Wahl wrote:
> > > From that point of view, it does make sense to special case UV systems
> > > for this. The restricted areas we're talking about are not in the map
> > > when the bootloader is started on the UV platform.
> >
> > Just to be clear what I'm looking for here: Special casing UV systems is
> > theoretically OK. What I don't like is doing that in using GB pages or not.
> >
> > It would be much nicer to have specific, precise information about what
> > UV needs done. For instance, do we know where the special address range
> > is? Is it fixed? If so, I'd much rather have code that says: "Whoa,
> > don't map this range with *any* identity map page tables" versus
> > something targeted specifically at gbpages.
>
> The area is not fixed. Some reserved memory in each numa node,
> address and amount varies depending on the amount of memory in the
> system.
>
> We've kept the memory range marked as reserved in the e820 memory
> tables, and our experience is the kernel respects that in all other
> aspects. Even creation of the identity maps is on the surface
> respecting those areas being listed as reserved, but using GBpages
> "swings wide" and includes the reserved areas in the identity map.
>
> So, I don't fully understand your hesitation in being more selective
> in the use of gbpages on UV, or what I might suggest in its place.
Dave, given the above, can you explain more precisely why the current
aproach in the patch is not acceptable to you, and what you'd like to
see different?
Or is it maybe ok as it stands?
Thanks,
--> Steve Wahl
> Here's partial kernel output showing the memory map, from a randomly
> selected system in case it helps illustrate it for you:
>
> [Tue Mar 21 09:40:00 2023] BIOS-provided physical RAM map:
> [Tue Mar 21 09:40:00 2023] BIOS-e820: [mem 0x0000000000000000-0x000000000009ffff] usable
> [Tue Mar 21 09:40:00 2023] BIOS-e820: [mem 0x00000000000a0000-0x00000000000fffff] reserved
> [Tue Mar 21 09:40:00 2023] BIOS-e820: [mem 0x0000000000100000-0x0000000068285fff] usable
> [Tue Mar 21 09:40:00 2023] BIOS-e820: [mem 0x0000000068286000-0x0000000068286fff] ACPI NVS
> [Tue Mar 21 09:40:00 2023] BIOS-e820: [mem 0x0000000068287000-0x0000000068baefff] usable
> [Tue Mar 21 09:40:00 2023] BIOS-e820: [mem 0x0000000068baf000-0x0000000068bb1fff] reserved
> [Tue Mar 21 09:40:00 2023] BIOS-e820: [mem 0x0000000068bb2000-0x0000000068c3ffff] usable
> [Tue Mar 21 09:40:00 2023] BIOS-e820: [mem 0x0000000068c40000-0x000000006a33ffff] reserved
> [Tue Mar 21 09:40:00 2023] BIOS-e820: [mem 0x000000006a340000-0x000000006d5fefff] ACPI NVS
> [Tue Mar 21 09:40:00 2023] BIOS-e820: [mem 0x000000006d5ff000-0x000000006fffefff] ACPI data
> [Tue Mar 21 09:40:00 2023] BIOS-e820: [mem 0x000000006ffff000-0x000000006fffffff] usable
> [Tue Mar 21 09:40:00 2023] BIOS-e820: [mem 0x0000000070000000-0x000000008fffffff] reserved
> [Tue Mar 21 09:40:00 2023] BIOS-e820: [mem 0x00000000f8000000-0x00000000fbffffff] reserved
> [Tue Mar 21 09:40:00 2023] BIOS-e820: [mem 0x00000000fe010000-0x00000000fe010fff] reserved
> [Tue Mar 21 09:40:00 2023] BIOS-e820: [mem 0x0000000100000000-0x0000003f7fffffff] usable
> [Tue Mar 21 09:40:00 2023] BIOS-e820: [mem 0x0000003f80000000-0x000000407fffffff] reserved
> [Tue Mar 21 09:40:00 2023] BIOS-e820: [mem 0x0000004080000000-0x0000007f7fffffff] usable
> [Tue Mar 21 09:40:00 2023] BIOS-e820: [mem 0x0000007f80000000-0x000000807fffffff] reserved
> [Tue Mar 21 09:40:00 2023] BIOS-e820: [mem 0x0000008080000000-0x000000bf7fffffff] usable
> [Tue Mar 21 09:40:00 2023] BIOS-e820: [mem 0x000000bf80000000-0x000000c07fffffff] reserved
> [Tue Mar 21 09:40:00 2023] BIOS-e820: [mem 0x000000c080000000-0x000000ff7fffffff] usable
> [Tue Mar 21 09:40:00 2023] BIOS-e820: [mem 0x000000ff80000000-0x000001007fffffff] reserved
> [Tue Mar 21 09:40:00 2023] BIOS-e820: [mem 0x0000010080000000-0x0000013f7fffffff] usable
> [Tue Mar 21 09:40:00 2023] BIOS-e820: [mem 0x0000013f80000000-0x000001407fffffff] reserved
> [Tue Mar 21 09:40:00 2023] BIOS-e820: [mem 0x0000014080000000-0x0000017f7fffffff] usable
> [Tue Mar 21 09:40:00 2023] BIOS-e820: [mem 0x0000017f80000000-0x000001807fffffff] reserved
> [Tue Mar 21 09:40:00 2023] BIOS-e820: [mem 0x0000018080000000-0x000001bf7fffffff] usable
> [Tue Mar 21 09:40:00 2023] BIOS-e820: [mem 0x000001bf80000000-0x000001c07fffffff] reserved
> [Tue Mar 21 09:40:00 2023] BIOS-e820: [mem 0x000001c080000000-0x000001ff7fffffff] usable
> [Tue Mar 21 09:40:00 2023] BIOS-e820: [mem 0x000001ff80000000-0x000002007fffffff] reserved
> [Tue Mar 21 09:40:00 2023] BIOS-e820: [mem 0x0000020080000000-0x0000023f7fffffff] usable
> [Tue Mar 21 09:40:00 2023] BIOS-e820: [mem 0x0000023f80000000-0x000002407fffffff] reserved
> [Tue Mar 21 09:40:00 2023] BIOS-e820: [mem 0x0000024080000000-0x0000027f7fffffff] usable
> [Tue Mar 21 09:40:00 2023] BIOS-e820: [mem 0x0000027f80000000-0x000002807fffffff] reserved
> [Tue Mar 21 09:40:00 2023] BIOS-e820: [mem 0x0000028080000000-0x000002bf7fffffff] usable
> [Tue Mar 21 09:40:00 2023] BIOS-e820: [mem 0x000002bf80000000-0x000002c07fffffff] reserved
> [Tue Mar 21 09:40:00 2023] BIOS-e820: [mem 0x000002c080000000-0x000002ff7fffffff] usable
> [Tue Mar 21 09:40:00 2023] BIOS-e820: [mem 0x000002ff80000000-0x000003007fffffff] reserved
> [Tue Mar 21 09:40:00 2023] BIOS-e820: [mem 0x0000030080000000-0x0000033f7fffffff] usable
> [Tue Mar 21 09:40:00 2023] BIOS-e820: [mem 0x0000033f80000000-0x000003407fffffff] reserved
> [Tue Mar 21 09:40:00 2023] BIOS-e820: [mem 0x0000034080000000-0x0000037f7fffffff] usable
> [Tue Mar 21 09:40:00 2023] BIOS-e820: [mem 0x0000037f80000000-0x000003807fffffff] reserved
> [Tue Mar 21 09:40:00 2023] BIOS-e820: [mem 0x0000038080000000-0x000003bf7fffffff] usable
> [Tue Mar 21 09:40:00 2023] BIOS-e820: [mem 0x000003bf80000000-0x000003c07fffffff] reserved
> [Tue Mar 21 09:40:00 2023] BIOS-e820: [mem 0x000003c080000000-0x000003ff7fffffff] usable
> [Tue Mar 21 09:40:00 2023] BIOS-e820: [mem 0x000003ff80000000-0x000004007fffffff] reserved
> [Tue Mar 21 09:40:00 2023] BIOS-e820: [mem 0x00001ffe00000000-0x00001ffeffffffff] reserved
> [Tue Mar 21 09:40:00 2023] BIOS-e820: [mem 0x00001fffc0000000-0x00001fffdfffffff] reserved
> [Tue Mar 21 09:40:00 2023] printk: bootconsole [earlyser0] enabled
> [Tue Mar 21 09:40:00 2023] NX (Execute Disable) protection: active
> [Tue Mar 21 09:40:00 2023] e820: update [mem 0x3d279018-0x3d2b3a57] usable ==> usable
> [Tue Mar 21 09:40:00 2023] e820: update [mem 0x3d279018-0x3d2b3a57] usable ==> usable
> [Tue Mar 21 09:40:00 2023] e820: update [mem 0x3d23e018-0x3d278a57] usable ==> usable
> [Tue Mar 21 09:40:00 2023] e820: update [mem 0x3d23e018-0x3d278a57] usable ==> usable
> [Tue Mar 21 09:40:00 2023] extended physical RAM map:
> [Tue Mar 21 09:40:00 2023] reserve setup_data: [mem 0x0000000000000000-0x000000000009ffff] usable
> [Tue Mar 21 09:40:00 2023] reserve setup_data: [mem 0x00000000000a0000-0x00000000000fffff] reserved
> [Tue Mar 21 09:40:00 2023] reserve setup_data: [mem 0x0000000000100000-0x000000003d23e017] usable
> [Tue Mar 21 09:40:00 2023] reserve setup_data: [mem 0x000000003d23e018-0x000000003d278a57] usable
> [Tue Mar 21 09:40:00 2023] reserve setup_data: [mem 0x000000003d278a58-0x000000003d279017] usable
> [Tue Mar 21 09:40:00 2023] reserve setup_data: [mem 0x000000003d279018-0x000000003d2b3a57] usable
> [Tue Mar 21 09:40:00 2023] reserve setup_data: [mem 0x000000003d2b3a58-0x0000000068285fff] usable
> [Tue Mar 21 09:40:00 2023] reserve setup_data: [mem 0x0000000068286000-0x0000000068286fff] ACPI NVS
> [Tue Mar 21 09:40:00 2023] reserve setup_data: [mem 0x0000000068287000-0x0000000068baefff] usable
> [Tue Mar 21 09:40:00 2023] reserve setup_data: [mem 0x0000000068baf000-0x0000000068bb1fff] reserved
> [Tue Mar 21 09:40:00 2023] reserve setup_data: [mem 0x0000000068bb2000-0x0000000068c3ffff] usable
> [Tue Mar 21 09:40:00 2023] reserve setup_data: [mem 0x0000000068c40000-0x000000006a33ffff] reserved
> [Tue Mar 21 09:40:00 2023] reserve setup_data: [mem 0x000000006a340000-0x000000006d5fefff] ACPI NVS
> [Tue Mar 21 09:40:00 2023] reserve setup_data: [mem 0x000000006d5ff000-0x000000006fffefff] ACPI data
> [Tue Mar 21 09:40:00 2023] reserve setup_data: [mem 0x000000006ffff000-0x000000006fffffff] usable
> [Tue Mar 21 09:40:00 2023] reserve setup_data: [mem 0x0000000070000000-0x000000008fffffff] reserved
> [Tue Mar 21 09:40:00 2023] reserve setup_data: [mem 0x00000000f8000000-0x00000000fbffffff] reserved
> [Tue Mar 21 09:40:00 2023] reserve setup_data: [mem 0x00000000fe010000-0x00000000fe010fff] reserved
> [Tue Mar 21 09:40:00 2023] reserve setup_data: [mem 0x0000000100000000-0x0000003f7fffffff] usable
> [Tue Mar 21 09:40:00 2023] reserve setup_data: [mem 0x0000003f80000000-0x000000407fffffff] reserved
> [Tue Mar 21 09:40:00 2023] reserve setup_data: [mem 0x0000004080000000-0x0000007f7fffffff] usable
> [Tue Mar 21 09:40:00 2023] reserve setup_data: [mem 0x0000007f80000000-0x000000807fffffff] reserved
> [Tue Mar 21 09:40:00 2023] reserve setup_data: [mem 0x0000008080000000-0x000000bf7fffffff] usable
> [Tue Mar 21 09:40:00 2023] reserve setup_data: [mem 0x000000bf80000000-0x000000c07fffffff] reserved
> [Tue Mar 21 09:40:00 2023] reserve setup_data: [mem 0x000000c080000000-0x000000ff7fffffff] usable
> [Tue Mar 21 09:40:00 2023] reserve setup_data: [mem 0x000000ff80000000-0x000001007fffffff] reserved
> [Tue Mar 21 09:40:00 2023] reserve setup_data: [mem 0x0000010080000000-0x0000013f7fffffff] usable
> [Tue Mar 21 09:40:00 2023] reserve setup_data: [mem 0x0000013f80000000-0x000001407fffffff] reserved
> [Tue Mar 21 09:40:00 2023] reserve setup_data: [mem 0x0000014080000000-0x0000017f7fffffff] usable
> [Tue Mar 21 09:40:00 2023] reserve setup_data: [mem 0x0000017f80000000-0x000001807fffffff] reserved
> [Tue Mar 21 09:40:00 2023] reserve setup_data: [mem 0x0000018080000000-0x000001bf7fffffff] usable
> [Tue Mar 21 09:40:00 2023] reserve setup_data: [mem 0x000001bf80000000-0x000001c07fffffff] reserved
> [Tue Mar 21 09:40:00 2023] reserve setup_data: [mem 0x000001c080000000-0x000001ff7fffffff] usable
> [Tue Mar 21 09:40:00 2023] reserve setup_data: [mem 0x000001ff80000000-0x000002007fffffff] reserved
> [Tue Mar 21 09:40:00 2023] reserve setup_data: [mem 0x0000020080000000-0x0000023f7fffffff] usable
> [Tue Mar 21 09:40:00 2023] reserve setup_data: [mem 0x0000023f80000000-0x000002407fffffff] reserved
> [Tue Mar 21 09:40:00 2023] reserve setup_data: [mem 0x0000024080000000-0x0000027f7fffffff] usable
> [Tue Mar 21 09:40:00 2023] reserve setup_data: [mem 0x0000027f80000000-0x000002807fffffff] reserved
> [Tue Mar 21 09:40:00 2023] reserve setup_data: [mem 0x0000028080000000-0x000002bf7fffffff] usable
> [Tue Mar 21 09:40:00 2023] reserve setup_data: [mem 0x000002bf80000000-0x000002c07fffffff] reserved
> [Tue Mar 21 09:40:00 2023] reserve setup_data: [mem 0x000002c080000000-0x000002ff7fffffff] usable
> [Tue Mar 21 09:40:00 2023] reserve setup_data: [mem 0x000002ff80000000-0x000003007fffffff] reserved
> [Tue Mar 21 09:40:00 2023] reserve setup_data: [mem 0x0000030080000000-0x0000033f7fffffff] usable
> [Tue Mar 21 09:40:00 2023] reserve setup_data: [mem 0x0000033f80000000-0x000003407fffffff] reserved
> [Tue Mar 21 09:40:00 2023] reserve setup_data: [mem 0x0000034080000000-0x0000037f7fffffff] usable
> [Tue Mar 21 09:40:00 2023] reserve setup_data: [mem 0x0000037f80000000-0x000003807fffffff] reserved
> [Tue Mar 21 09:40:00 2023] reserve setup_data: [mem 0x0000038080000000-0x000003bf7fffffff] usable
> [Tue Mar 21 09:40:00 2023] reserve setup_data: [mem 0x000003bf80000000-0x000003c07fffffff] reserved
> [Tue Mar 21 09:40:00 2023] reserve setup_data: [mem 0x000003c080000000-0x000003ff7fffffff] usable
> [Tue Mar 21 09:40:00 2023] reserve setup_data: [mem 0x000003ff80000000-0x000004007fffffff] reserved
> [Tue Mar 21 09:40:00 2023] reserve setup_data: [mem 0x00001ffe00000000-0x00001ffeffffffff] reserved
> [Tue Mar 21 09:40:00 2023] reserve setup_data: [mem 0x00001fffc0000000-0x00001fffdfffffff] reserved
> [Tue Mar 21 09:40:00 2023] efi: EFI v2.7 by HPE
>
> Thanks,
>
> --> Steve Wahl
> --
> Steve Wahl, Hewlett Packard Enterprise
--
Steve Wahl, Hewlett Packard Enterprise
On Mon, Mar 25, 2024 at 6:58 AM Ingo Molnar <[email protected]> wrote:
> Anyway, I've reverted this in tip:x86/urgent:
>
> c567f2948f57 Revert "x86/mm/ident_map: Use gbpages only where full GB page should be mapped."
I see that this hasn't been reverted in the longterm branches it made
it into already (6.1.x and 6.6.x, for example) - is it expected to be
reverted there as well? I'd think it should be, until this is all
sorted out.
On Fri, Apr 05, 2024 at 09:13:36AM -0400, Eric Hagberg wrote:
> On Mon, Mar 25, 2024 at 6:58 AM Ingo Molnar <[email protected]> wrote:
> > Anyway, I've reverted this in tip:x86/urgent:
> >
> > c567f2948f57 Revert "x86/mm/ident_map: Use gbpages only where full GB page should be mapped."
>
> I see that this hasn't been reverted in the longterm branches it made
> it into already (6.1.x and 6.6.x, for example) - is it expected to be
> reverted there as well? I'd think it should be, until this is all
> sorted out.
>
The revert is queued up for the next round of stable updates.
thanks,
greg k-h