2019-08-28 19:43:58

by Steve Wahl

[permalink] [raw]
Subject: Purgatory compile flag changes apparently causing Kexec relocation overflows

Please CC me on responses to this.

I normally would do more diligence on this, but the timing is such
that I think it's better to get this out sooner.

With the tip of the tree from https://github.com/torvalds/linux.git (a
few days old, most recent commit fetched is
bb7ba8069de933d69cb45dd0a5806b61033796a3), I'm seeing "kexec: Overflow
in relocation type 11 value 0x11fffd000" when I try to load a crash
kernel with kdump. This seems to be caused by commit
059f801a937d164e03b33c1848bb3dca67c0b04, which changed the compiler
flags used to compile purgatory.ro, apparently creating 32 bit
relocations for things that aren't necessarily reachable with a 32 bit
reference. My guess is this only occurs when the crash kernel is
located outside 32-bit addressable physical space.

I have so far verified that the problem occurs with that commit, and
does not occur with the previous commit. For this commit, Thomas
Gleixner mentioned a few of the changed flags should have been looked
at twice. I have not gone so far as to figure out which flags cause
the problem.

The hardware in use is a HPE Superdome Flex with 48 * 32GiB dimms
(total 1536 GiB).

One example of the exact error messages seen:

019-08-28T13:42:39.308110-05:00 uv4test14 kernel: [ 45.137743] kexec: Overflow in relocation type 11 value 0x17f7affd000
2019-08-28T13:42:39.308123-05:00 uv4test14 kernel: [ 45.137749] kexec-bzImage64: Loading purgatory failed

--> Steve Wahl
--
Steve Wahl, Hewlett Packard Enterprise


2019-08-28 21:54:12

by Nick Desaulniers

[permalink] [raw]
Subject: Re: Purgatory compile flag changes apparently causing Kexec relocation overflows

On Wed, Aug 28, 2019 at 12:42 PM Steve Wahl <[email protected]> wrote:
>
> Please CC me on responses to this.
>
> I normally would do more diligence on this, but the timing is such
> that I think it's better to get this out sooner.
>
> With the tip of the tree from https://github.com/torvalds/linux.git (a
> few days old, most recent commit fetched is
> bb7ba8069de933d69cb45dd0a5806b61033796a3), I'm seeing "kexec: Overflow
> in relocation type 11 value 0x11fffd000" when I try to load a crash
> kernel with kdump. This seems to be caused by commit
> 059f801a937d164e03b33c1848bb3dca67c0b04, which changed the compiler
> flags used to compile purgatory.ro, apparently creating 32 bit
> relocations for things that aren't necessarily reachable with a 32 bit
> reference. My guess is this only occurs when the crash kernel is
> located outside 32-bit addressable physical space.
>
> I have so far verified that the problem occurs with that commit, and
> does not occur with the previous commit. For this commit, Thomas
> Gleixner mentioned a few of the changed flags should have been looked
> at twice. I have not gone so far as to figure out which flags cause
> the problem.
>
> The hardware in use is a HPE Superdome Flex with 48 * 32GiB dimms
> (total 1536 GiB).
>
> One example of the exact error messages seen:
>
> 019-08-28T13:42:39.308110-05:00 uv4test14 kernel: [ 45.137743] kexec: Overflow in relocation type 11 value 0x17f7affd000
> 2019-08-28T13:42:39.308123-05:00 uv4test14 kernel: [ 45.137749] kexec-bzImage64: Loading purgatory failed

Thanks for the report and sorry for the breakage. Can you please send
me more information for how to precisely reproduce the issue? I'm
happy to look into fixing it.

Let me go dig up the different listed flags. Steve, it may be fastest
for you to test re-adding them in your setup to see which one is
important.

Tglx, if you want to revert the above patches, I'm ok with that. It's
important that we fix the issue eventually that my patches were meant
to address, but precisely *when* it's solved isn't critical; our
kernels can carry out of tree patches for now until the issue is
completely resolved worst case.
--
Thanks,
~Nick Desaulniers

2019-08-28 22:10:16

by Nick Desaulniers

[permalink] [raw]
Subject: Re: Purgatory compile flag changes apparently causing Kexec relocation overflows

On Wed, Aug 28, 2019 at 2:51 PM Nick Desaulniers
<[email protected]> wrote:
>
> On Wed, Aug 28, 2019 at 12:42 PM Steve Wahl <[email protected]> wrote:
> >
> > Please CC me on responses to this.
> >
> > I normally would do more diligence on this, but the timing is such
> > that I think it's better to get this out sooner.
> >
> > With the tip of the tree from https://github.com/torvalds/linux.git (a
> > few days old, most recent commit fetched is
> > bb7ba8069de933d69cb45dd0a5806b61033796a3), I'm seeing "kexec: Overflow
> > in relocation type 11 value 0x11fffd000" when I try to load a crash
> > kernel with kdump. This seems to be caused by commit
> > 059f801a937d164e03b33c1848bb3dca67c0b04, which changed the compiler

is this the correct SHA from mainline? I assume you meant
commit b059f801a937 ("x86/purgatory: Use CFLAGS_REMOVE rather than
reset KBUILD_CFLAGS")

> > flags used to compile purgatory.ro, apparently creating 32 bit
> > relocations for things that aren't necessarily reachable with a 32 bit
> > reference. My guess is this only occurs when the crash kernel is
> > located outside 32-bit addressable physical space.
> >
> > I have so far verified that the problem occurs with that commit, and
> > does not occur with the previous commit. For this commit, Thomas
> > Gleixner mentioned a few of the changed flags should have been looked
> > at twice. I have not gone so far as to figure out which flags cause
> > the problem.
> >
> > The hardware in use is a HPE Superdome Flex with 48 * 32GiB dimms
> > (total 1536 GiB).
> >
> > One example of the exact error messages seen:
> >
> > 019-08-28T13:42:39.308110-05:00 uv4test14 kernel: [ 45.137743] kexec: Overflow in relocation type 11 value 0x17f7affd000
> > 2019-08-28T13:42:39.308123-05:00 uv4test14 kernel: [ 45.137749] kexec-bzImage64: Loading purgatory failed
>
> Thanks for the report and sorry for the breakage. Can you please send
> me more information for how to precisely reproduce the issue? I'm
> happy to look into fixing it.
>
> Let me go dig up the different listed flags. Steve, it may be fastest
> for you to test re-adding them in your setup to see which one is
> important.

https://lkml.org/lkml/2019/7/26/198 was the list. The "ratpoutine"
flags were added in the final version of the patch that landed. It's
not immediately clear to me which of those 4 changed flags would
result in the error that you're observing, but if you could test them
quickly to see which restores working behavior, we could triple check
it on our end and submit it.

>
> Tglx, if you want to revert the above patches, I'm ok with that. It's
> important that we fix the issue eventually that my patches were meant
> to address, but precisely *when* it's solved isn't critical; our
> kernels can carry out of tree patches for now until the issue is
> completely resolved worst case.

--
Thanks,
~Nick Desaulniers

2019-08-28 22:16:07

by Steve Wahl

[permalink] [raw]
Subject: Re: Purgatory compile flag changes apparently causing Kexec relocation overflows

On Wed, Aug 28, 2019 at 02:51:21PM -0700, Nick Desaulniers wrote:
> On Wed, Aug 28, 2019 at 12:42 PM Steve Wahl <[email protected]> wrote:
> >
> > Please CC me on responses to this.
> >
> > I normally would do more diligence on this, but the timing is such
> > that I think it's better to get this out sooner.
> >
> > With the tip of the tree from https://github.com/torvalds/linux.git (a
> > few days old, most recent commit fetched is
> > bb7ba8069de933d69cb45dd0a5806b61033796a3), I'm seeing "kexec: Overflow
> > in relocation type 11 value 0x11fffd000" when I try to load a crash
> > kernel with kdump. This seems to be caused by commit
> > 059f801a937d164e03b33c1848bb3dca67c0b04, which changed the compiler
> > flags used to compile purgatory.ro, apparently creating 32 bit
> > relocations for things that aren't necessarily reachable with a 32 bit
> > reference. My guess is this only occurs when the crash kernel is
> > located outside 32-bit addressable physical space.
> >
> > I have so far verified that the problem occurs with that commit, and
> > does not occur with the previous commit. For this commit, Thomas
> > Gleixner mentioned a few of the changed flags should have been looked
> > at twice. I have not gone so far as to figure out which flags cause
> > the problem.
> >
> > The hardware in use is a HPE Superdome Flex with 48 * 32GiB dimms
> > (total 1536 GiB).
> >
> > One example of the exact error messages seen:
> >
> > 019-08-28T13:42:39.308110-05:00 uv4test14 kernel: [ 45.137743] kexec: Overflow in relocation type 11 value 0x17f7affd000
> > 2019-08-28T13:42:39.308123-05:00 uv4test14 kernel: [ 45.137749] kexec-bzImage64: Loading purgatory failed
>
> Thanks for the report and sorry for the breakage. Can you please send
> me more information for how to precisely reproduce the issue? I'm
> happy to look into fixing it.

Here's the details I know might be important:

Since this appears to be a problem with the result of a relocation not
fitting within 32 bits, I think the location chosen to place the crash
kernel needs to be above 4GiB; so you need a machine with more memory
than that.

At the moment I'm running SLES 12 sp 4 as the rest of the
environment. rpm says kdump is kdump-0.8.16-9.2.x86_64. I've fetched
the kernel sources and compiled directly on this system. I believe I
copied the kernel config from the SLES kernel and did a make
olddefconfig for configuration. Made and installed the kernel from
the kernel tree.

crashkernel=512M,high is set on the command line.

As the system boots, and systemd initializes kdump, it tries to load
the crash kernel, I believe through
/usr/lib/systemd/system/kdump.service running /lib/kdump/load.sh
--update.

Once that completes, 'systemctl status kdump' indicates a failure, and
dmesg | grep kexec shows the error messages mentioned above.

> Let me go dig up the different listed flags. Steve, it may be fastest
> for you to test re-adding them in your setup to see which one is
> important.

I will work through that tomorrow and let you know what I find.

> Tglx, if you want to revert the above patches, I'm ok with that. It's
> important that we fix the issue eventually that my patches were meant
> to address, but precisely *when* it's solved isn't critical; our
> kernels can carry out of tree patches for now until the issue is
> completely resolved worst case.
> --
> Thanks,
> ~Nick Desaulniers

Thank you!

--> Steve Wahl

--
Steve Wahl, Hewlett Packard Enterprise

2019-08-28 22:25:04

by Nick Desaulniers

[permalink] [raw]
Subject: Re: Purgatory compile flag changes apparently causing Kexec relocation overflows

On Wed, Aug 28, 2019 at 3:14 PM Steve Wahl <[email protected]> wrote:
>
> On Wed, Aug 28, 2019 at 02:51:21PM -0700, Nick Desaulniers wrote:
> > On Wed, Aug 28, 2019 at 12:42 PM Steve Wahl <[email protected]> wrote:
> > >
> > > Please CC me on responses to this.
> > >
> > > I normally would do more diligence on this, but the timing is such
> > > that I think it's better to get this out sooner.
> > >
> > > With the tip of the tree from https://github.com/torvalds/linux.git (a
> > > few days old, most recent commit fetched is
> > > bb7ba8069de933d69cb45dd0a5806b61033796a3), I'm seeing "kexec: Overflow
> > > in relocation type 11 value 0x11fffd000" when I try to load a crash
> > > kernel with kdump. This seems to be caused by commit
> > > 059f801a937d164e03b33c1848bb3dca67c0b04, which changed the compiler
> > > flags used to compile purgatory.ro, apparently creating 32 bit
> > > relocations for things that aren't necessarily reachable with a 32 bit
> > > reference. My guess is this only occurs when the crash kernel is
> > > located outside 32-bit addressable physical space.
> > >
> > > I have so far verified that the problem occurs with that commit, and
> > > does not occur with the previous commit. For this commit, Thomas
> > > Gleixner mentioned a few of the changed flags should have been looked
> > > at twice. I have not gone so far as to figure out which flags cause
> > > the problem.
> > >
> > > The hardware in use is a HPE Superdome Flex with 48 * 32GiB dimms
> > > (total 1536 GiB).
> > >
> > > One example of the exact error messages seen:
> > >
> > > 019-08-28T13:42:39.308110-05:00 uv4test14 kernel: [ 45.137743] kexec: Overflow in relocation type 11 value 0x17f7affd000
> > > 2019-08-28T13:42:39.308123-05:00 uv4test14 kernel: [ 45.137749] kexec-bzImage64: Loading purgatory failed
> >
> > Thanks for the report and sorry for the breakage. Can you please send
> > me more information for how to precisely reproduce the issue? I'm
> > happy to look into fixing it.
>
> Here's the details I know might be important:
>
> Since this appears to be a problem with the result of a relocation not
> fitting within 32 bits, I think the location chosen to place the crash
> kernel needs to be above 4GiB; so you need a machine with more memory
> than that.
>
> At the moment I'm running SLES 12 sp 4 as the rest of the
> environment. rpm says kdump is kdump-0.8.16-9.2.x86_64. I've fetched
> the kernel sources and compiled directly on this system. I believe I
> copied the kernel config from the SLES kernel and did a make
> olddefconfig for configuration. Made and installed the kernel from
> the kernel tree.
>
> crashkernel=512M,high is set on the command line.
>
> As the system boots, and systemd initializes kdump, it tries to load
> the crash kernel, I believe through
> /usr/lib/systemd/system/kdump.service running /lib/kdump/load.sh
> --update.
>
> Once that completes, 'systemctl status kdump' indicates a failure, and
> dmesg | grep kexec shows the error messages mentioned above.
>
> > Let me go dig up the different listed flags. Steve, it may be fastest
> > for you to test re-adding them in your setup to see which one is
> > important.
>
> I will work through that tomorrow and let you know what I find.
>
> > Tglx, if you want to revert the above patches, I'm ok with that. It's
> > important that we fix the issue eventually that my patches were meant
> > to address, but precisely *when* it's solved isn't critical; our
> > kernels can carry out of tree patches for now until the issue is
> > completely resolved worst case.

One point that might be more useful first would be, is a revert of:

commit b059f801a937 ("x86/purgatory: Use CFLAGS_REMOVE rather than
reset KBUILD_CFLAGS")

good enough, or must:

commit 4ce97317f41d ("x86/purgatory: Do not use __builtin_memcpy and
__builtin_memset")

be reverted additionally? They were part of a 2 patch patchset. I
would prefer tglx to revert as few patches as necessary if possible
(to avoid "revert of revert" soup), and I doubt the latter patch needs
to be reverted. (Even more preferential would be a fix, with no
reverts, but whichever).
--
Thanks,
~Nick Desaulniers

2019-08-29 14:36:37

by Steve Wahl

[permalink] [raw]
Subject: Re: Purgatory compile flag changes apparently causing Kexec relocation overflows

On Wed, Aug 28, 2019 at 03:22:13PM -0700, Nick Desaulniers wrote:
>
> One point that might be more useful first would be, is a revert of:
>
> commit b059f801a937 ("x86/purgatory: Use CFLAGS_REMOVE rather than
> reset KBUILD_CFLAGS")
>
> good enough, or must:
>
> commit 4ce97317f41d ("x86/purgatory: Do not use __builtin_memcpy and
> __builtin_memset")
>
> be reverted additionally? They were part of a 2 patch patchset. I
> would prefer tglx to revert as few patches as necessary if possible
> (to avoid "revert of revert" soup), and I doubt the latter patch needs
> to be reverted. (Even more preferential would be a fix, with no
> reverts, but whichever).

A revert of the single commit is sufficient. Previously I have
checked out and compiled the tree at commit b059f801a937 and
b059f801a937^ (with caret, the previous commit). It worked with the
previous commit, but not with b059f801a937.

4ce97317f41d *is* the previous commit to b059f801a937, so it was in
both kernels that I tested:

$ git log -1 --oneline b059f801a937^ | cat
4ce97317f41d x86/purgatory: Do not use __builtin_memcpy and __builtin_memset
$

And, I also did an exploratory 'git revert b059f801a937' at the tip of
the tree. That corrects the problem as well.

So both say that it's only the single commit that would need to be
reverted *if* that's the route taken.

Now, on to seeing if we can narrow this down to a fix with no reverts
instead.

--> Steve
--
Steve Wahl, Hewlett Packard Enterprise