Instead of using gbpages for all memory regions, use them only when
map creation requests include the full GB page of space; descend to
using smaller 2M pages when only portions of a GB page are requested.
When gbpages are used exclusively to create identity maps, large
ranges of addresses not actually requested can be included in the
resulting table. On UV systems, this ends up including regions that
will cause hardware to halt the system if accessed (these are marked
"reserved" by BIOS). Even though code does not actually make
references to these addresses, including them in an active map allows
processor speculation into this region, which is enough to trigger the
system halt.
The kernel option "nogbpages" will disallow use of gbpages entirely
and avoid this problem, but uses a lot of extra memory for page tables
that are not really needed.
Signed-off-by: Steve Wahl <[email protected]>
---
Please ignore previous send with internal message. Thanks.
arch/x86/mm/ident_map.c | 18 +++++++++++++-----
1 file changed, 13 insertions(+), 5 deletions(-)
diff --git a/arch/x86/mm/ident_map.c b/arch/x86/mm/ident_map.c
index 968d7005f4a7..b63a1ffcfe9f 100644
--- a/arch/x86/mm/ident_map.c
+++ b/arch/x86/mm/ident_map.c
@@ -31,18 +31,26 @@ static int ident_pud_init(struct x86_mapping_info *info, pud_t *pud_page,
if (next > end)
next = end;
- if (info->direct_gbpages) {
+ /*
+ * if gbpages allowed, this entry not yet present, and
+ * the full gbpage range is requested (both ends are
+ * correctly aligned), create a gbpage.
+ */
+ if (info->direct_gbpages
+ && !pud_present(*pud)
+ && !(addr & ~PUD_MASK)
+ && !(next & ~PUD_MASK)) {
pud_t pudval;
- if (pud_present(*pud))
- continue;
-
- addr &= PUD_MASK;
pudval = __pud((addr - info->offset) | info->page_flag);
set_pud(pud, pudval);
continue;
}
+ /* if this is already a gbpage, this portion is already mapped */
+ if (pud_large(*pud))
+ continue;
+
if (pud_present(*pud)) {
pmd = pmd_offset(pud, 0);
ident_pmd_init(info, pmd, addr, next);
--
2.26.2
On 10/31/23 12:50, Steve Wahl wrote:
> Instead of using gbpages for all memory regions, use them only when
> map creation requests include the full GB page of space; descend to
> using smaller 2M pages when only portions of a GB page are requested.
...
The logic here is sound: we obviously don't want systems rebooting
because speculation went wonky.
> diff --git a/arch/x86/mm/ident_map.c b/arch/x86/mm/ident_map.c
> index 968d7005f4a7..b63a1ffcfe9f 100644
> --- a/arch/x86/mm/ident_map.c
> +++ b/arch/x86/mm/ident_map.c
> @@ -31,18 +31,26 @@ static int ident_pud_init(struct x86_mapping_info *info, pud_t *pud_page,
> if (next > end)
> next = end;
>
> - if (info->direct_gbpages) {
> + /*
> + * if gbpages allowed, this entry not yet present, and
> + * the full gbpage range is requested (both ends are
> + * correctly aligned), create a gbpage.
> + */
> + if (info->direct_gbpages
> + && !pud_present(*pud)
> + && !(addr & ~PUD_MASK)
> + && !(next & ~PUD_MASK)) {
This is a _bit_ too subtle for my taste.
Let's say someone asks for mapping of 2MB@5G, then later asks for 1G@5G.
The first call in here will do the 2MB mapping with small (pud)
entries. The second call will see the new pud_present() check and
*also* skip large pud creation.
That's a regression. It might not matter _much_, but it's a place where
the old code creates large PUDs and the new code creates small ones.
It's the kind of behavior change that at least needs to be noted in the
changelog.
Off the top of my head, I can't think of why we'd get overlapping
requests in here, though. Did you think through that case? Is it common?
> pud_t pudval;
>
> - if (pud_present(*pud))
> - continue;
> -
> - addr &= PUD_MASK;
> pudval = __pud((addr - info->offset) | info->page_flag);
> set_pud(pud, pudval);
> continue;
> }
>
> + /* if this is already a gbpage, this portion is already mapped */
> + if (pud_large(*pud))
> + continue;
I'd probably move this check up to above the large PUD creation code.
It would make it more obvious that any PUD that's encountered down below
is a small PUD.
> if (pud_present(*pud)) {
> pmd = pmd_offset(pud, 0);
> ident_pmd_init(info, pmd, addr, next);
That would end up looking something like this:
bool do_gbpages = true;
...
// Is the whole range already mapped?
if (pud_large(*pud))
continue;
/* PUD is either empty or small */
// GB pages allowed?
do_gbpages &= info->direct_gbpages;
// Addresses aligned for GB pages?
do_gbpages &= ~(addr & ~PUD_MASK);
do_gbpages &= ~(next & ~PUD_MASK);
// Check for existing mapping using a small PUD
do_gbpages &= !pud_present(*pud);
if (do_gbpages) {
set_pud(pud, __pud((addr - info->offset) |
info->page_flag));
continue
}
Dave,
Thank you for taking the time to review my patch. (More below.)
On Thu, Nov 02, 2023 at 09:02:44AM -0700, Dave Hansen wrote:
> On 10/31/23 12:50, Steve Wahl wrote:
> > Instead of using gbpages for all memory regions, use them only when
> > map creation requests include the full GB page of space; descend to
> > using smaller 2M pages when only portions of a GB page are requested.
> ...
>
> The logic here is sound: we obviously don't want systems rebooting
> because speculation went wonky.
>
> > diff --git a/arch/x86/mm/ident_map.c b/arch/x86/mm/ident_map.c
> > index 968d7005f4a7..b63a1ffcfe9f 100644
> > --- a/arch/x86/mm/ident_map.c
> > +++ b/arch/x86/mm/ident_map.c
> > @@ -31,18 +31,26 @@ static int ident_pud_init(struct x86_mapping_info *info, pud_t *pud_page,
> > if (next > end)
> > next = end;
> >
> > - if (info->direct_gbpages) {
> > + /*
> > + * if gbpages allowed, this entry not yet present, and
> > + * the full gbpage range is requested (both ends are
> > + * correctly aligned), create a gbpage.
> > + */
> > + if (info->direct_gbpages
> > + && !pud_present(*pud)
> > + && !(addr & ~PUD_MASK)
> > + && !(next & ~PUD_MASK)) {
>
> This is a _bit_ too subtle for my taste.
>
> Let's say someone asks for mapping of 2MB@5G, then later asks for 1G@5G.
> The first call in here will do the 2MB mapping with small (pud)
> entries. The second call will see the new pud_present() check and
> *also* skip large pud creation.
>
> That's a regression. It might not matter _much_, but it's a place where
> the old code creates large PUDs and the new code creates small ones.
> It's the kind of behavior change that at least needs to be noted in the
> changelog.
I will add a note that requests that create a small page mapping will
have that small mapping persist in this range, even if subsequent
requests enlarge the mapped area to cover the whole 1G segment.
> Off the top of my head, I can't think of why we'd get overlapping
> requests in here, though. Did you think through that case? Is it common?
Yes, I had thought about the overlaps. Of the choices I had here,
continuing to use the already allocated PMD page and fill the map in
with 2M pages seemed the best option.
Existing usage (the kernel decompression stub and kexec) start with
huge tracts of memory and then add smaller pieces that may or may not
already reside in the map created so far. (See, for example, the
comment around line 231 in arch/x86/kernel/machine_kexec_64.c.)
My early private versions with printks reflected this, but this was
limited to testing on UV systems.
In short, with current usage overlap is expected, but it would be rare
for small pieces that requrie PMD mapping to be followed by large
pieces that include the whole PUD level region.
> > pud_t pudval;
> >
> > - if (pud_present(*pud))
> > - continue;
> > -
> > - addr &= PUD_MASK;
> > pudval = __pud((addr - info->offset) | info->page_flag);
> > set_pud(pud, pudval);
> > continue;
> > }
> >
> > + /* if this is already a gbpage, this portion is already mapped */
> > + if (pud_large(*pud))
> > + continue;
>
> I'd probably move this check up to above the large PUD creation code.
> It would make it more obvious that any PUD that's encountered down below
> is a small PUD.
That makes sense. I will change this.
> > if (pud_present(*pud)) {
> > pmd = pmd_offset(pud, 0);
> > ident_pmd_init(info, pmd, addr, next);
>
> That would end up looking something like this:
>
> bool do_gbpages = true;
> ...
>
> // Is the whole range already mapped?
> if (pud_large(*pud))
> continue;
>
> /* PUD is either empty or small */
>
> // GB pages allowed?
> do_gbpages &= info->direct_gbpages;
> // Addresses aligned for GB pages?
> do_gbpages &= ~(addr & ~PUD_MASK);
> do_gbpages &= ~(next & ~PUD_MASK);
> // Check for existing mapping using a small PUD
> do_gbpages &= !pud_present(*pud);
>
> if (do_gbpages) {
> set_pud(pud, __pud((addr - info->offset) |
> info->page_flag));
> continue
> }
I tried coding it up with the bool instead of the single if statement,
and to me it did not look as easy to read and understand as the single
if statement version. So unless you firmly object, I'm leaving it the
original way, but improving the comment above the if statement to have
one-for-one explanations for each condition.
Thanks,
--> Steve Wahl
--
Steve Wahl, Hewlett Packard Enterprise