2017-03-01 18:58:28

by Sven Schmidt

[permalink] [raw]
Subject: Re: nios2 crash/hang in mainline due to 'lib: update LZ4 compressor module'

Hi Guenter, Tobias and Sandra,

thanks for your effort here.

On Tue, Feb 28, 2017 at 10:14:13AM -0800, Guenter Roeck wrote:
> On Tue, Feb 28, 2017 at 10:53:56AM -0700, Sandra Loosemore wrote:
> > On 02/28/2017 08:53 AM, Tobias Klauser wrote:
> > >(adding Sandra Loosemore to Cc due to possible relation to gcc/binutils
> > >for nios2)
> > >
> > >On 2017-02-26 at 22:03:38 +0100, Guenter Roeck <[email protected]> wrote:
> > >>Hi Sven,
> > >>
> > >>my qemu test for nios2 started failing with commit 4e1a33b105dd ("lib:
> > >>update LZ4 compressor module"). The test hangs early during boot before
> > >>any console output is seen. Reverting the offending patch as well as the
> > >>subsequent lz4 related patches fixes the problem. Disabling CONFIG_RD_LZ4
> > >>and with it other LZ4 options also fixes it (as does adding "return -EINVAL;"
> > >>at the top of the LZ4 decompression code). For reference, bisect log
> > >>is attached.
> > >>
> > >>I tried with buildroot toolchains using gcc 6.1.0 as well as 6.3.0
> > >>and binutils 2.26.1. Scripts used to run the tests are available at
> > >>https://github.com/groeck/linux-build-test/tree/master/rootfs/nios2.
> > >>Qemu is from qemu mainline or qemu v2.8 with nios2 patches applied.
> > >
> > >Looks like this is somehow related to gcc/binutils. Using GCC 4.8.3 and
> > >binutils 2.24.51 (both from from Sourcery CodeBench Lite 2014.05) I can
> > >get a kernel booting on latest master branch. AFAICT, none of the
> > >LZ4_decompress_* functions are called during boot.
> > >

It seems a bit strange that code which is not actually called causes problems like that.

Please let me know if and how I may help you figure out what's happening, especially
regarding the differences between the previous LZ4 and the current implementation.

> > >However, using a self-built GCC 7.0 (20161127) and binutils 2.27 I can
> > >reproduce the problem you see using the instructions Guenter provided in
> > >the reply to Sven.
> > >
> > >I'll try to dig a bit deeper from here on. Any suggestions on what to
> > >look out for wrt the differences between the gcc/binutils version are
> > >welcome of course.
> >
> > This message doesn't give me enough context to know what is going on,
> > especially without seeing the rest of the thread. Generally speaking,
> > Mentor recommends you use one of our stable releases instead of trying to
> > roll your own from mainline sources. As an upstream binutils and gcc
> > maintainer I do try my best to look at bug reports for those components, but
> > I need a reproducible standalone testcase and specific versions of the
> > different components involved.
> >
> The problem is also seen with Sourcery CodeBench Lite 2016.11-32 (gcc 6.2.0,
> binutils 2.26.51). I can provide additional details if needed, but we don't
> have a well enough understanding of the problem to be able to provide a
> reduced size test case. The test used to reproduce the problem is available
> at https://github.com/groeck/linux-build-test/tree/master/rootfs/nios2,
> run on the ToT linux kernel.
>
> Guenter

Regards,

Sven


2017-03-01 20:05:05

by Guenter Roeck

[permalink] [raw]
Subject: Re: nios2 crash/hang in mainline due to 'lib: update LZ4 compressor module'

On Wed, Mar 01, 2017 at 07:58:17PM +0100, Sven Schmidt wrote:
> Hi Guenter, Tobias and Sandra,
>
> thanks for your effort here.
>
> On Tue, Feb 28, 2017 at 10:14:13AM -0800, Guenter Roeck wrote:
> > On Tue, Feb 28, 2017 at 10:53:56AM -0700, Sandra Loosemore wrote:
> > > On 02/28/2017 08:53 AM, Tobias Klauser wrote:
> > > >(adding Sandra Loosemore to Cc due to possible relation to gcc/binutils
> > > >for nios2)
> > > >
> > > >On 2017-02-26 at 22:03:38 +0100, Guenter Roeck <[email protected]> wrote:
> > > >>Hi Sven,
> > > >>
> > > >>my qemu test for nios2 started failing with commit 4e1a33b105dd ("lib:
> > > >>update LZ4 compressor module"). The test hangs early during boot before
> > > >>any console output is seen. Reverting the offending patch as well as the
> > > >>subsequent lz4 related patches fixes the problem. Disabling CONFIG_RD_LZ4
> > > >>and with it other LZ4 options also fixes it (as does adding "return -EINVAL;"
> > > >>at the top of the LZ4 decompression code). For reference, bisect log
> > > >>is attached.
> > > >>
> > > >>I tried with buildroot toolchains using gcc 6.1.0 as well as 6.3.0
> > > >>and binutils 2.26.1. Scripts used to run the tests are available at
> > > >>https://github.com/groeck/linux-build-test/tree/master/rootfs/nios2.
> > > >>Qemu is from qemu mainline or qemu v2.8 with nios2 patches applied.
> > > >
> > > >Looks like this is somehow related to gcc/binutils. Using GCC 4.8.3 and
> > > >binutils 2.24.51 (both from from Sourcery CodeBench Lite 2014.05) I can
> > > >get a kernel booting on latest master branch. AFAICT, none of the
> > > >LZ4_decompress_* functions are called during boot.
> > > >
>
> It seems a bit strange that code which is not actually called causes problems like that.
>
Yes, it is, though it is always possible. The code isn't exactly easy to
understand; there may be some hidden caveats such as global variables. It may
also be that some jump target exceeds its range (though why that would only
be seen with the LZ4 code is another question), or that the compiler gets
confused by the forced inlines (disabling that didn't make a difference,
though, nor did disabling -O3).

> Please let me know if and how I may help you figure out what's happening, especially
> regarding the differences between the previous LZ4 and the current implementation.
>

For my part I am all but clueless. Unless someone has an idea, we may to
disable LZ4 support for nios2 for the time being. Does anyone have thoughts
on that ? Of course, that would not help if the problem also affects
recent gcc/binutil versions on other architectures.

Thanks,
Guenter

> > > >However, using a self-built GCC 7.0 (20161127) and binutils 2.27 I can
> > > >reproduce the problem you see using the instructions Guenter provided in
> > > >the reply to Sven.
> > > >
> > > >I'll try to dig a bit deeper from here on. Any suggestions on what to
> > > >look out for wrt the differences between the gcc/binutils version are
> > > >welcome of course.
> > >
> > > This message doesn't give me enough context to know what is going on,
> > > especially without seeing the rest of the thread. Generally speaking,
> > > Mentor recommends you use one of our stable releases instead of trying to
> > > roll your own from mainline sources. As an upstream binutils and gcc
> > > maintainer I do try my best to look at bug reports for those components, but
> > > I need a reproducible standalone testcase and specific versions of the
> > > different components involved.
> > >
> > The problem is also seen with Sourcery CodeBench Lite 2016.11-32 (gcc 6.2.0,
> > binutils 2.26.51). I can provide additional details if needed, but we don't
> > have a well enough understanding of the problem to be able to provide a
> > reduced size test case. The test used to reproduce the problem is available
> > at https://github.com/groeck/linux-build-test/tree/master/rootfs/nios2,
> > run on the ToT linux kernel.
> >
> > Guenter
>
> Regards,
>
> Sven

2017-03-02 00:52:58

by Sandra Loosemore

[permalink] [raw]
Subject: Re: nios2 crash/hang in mainline due to 'lib: update LZ4 compressor module'

On 03/01/2017 11:58 AM, Sven Schmidt wrote:
> Hi Guenter, Tobias and Sandra,
>
> thanks for your effort here.
>
> On Tue, Feb 28, 2017 at 10:14:13AM -0800, Guenter Roeck wrote:
>> On Tue, Feb 28, 2017 at 10:53:56AM -0700, Sandra Loosemore wrote:
>>> On 02/28/2017 08:53 AM, Tobias Klauser wrote:
>>>> (adding Sandra Loosemore to Cc due to possible relation to gcc/binutils
>>>> for nios2)
>>>>
>>>> On 2017-02-26 at 22:03:38 +0100, Guenter Roeck <[email protected]> wrote:
>>>>> Hi Sven,
>>>>>
>>>>> my qemu test for nios2 started failing with commit 4e1a33b105dd ("lib:
>>>>> update LZ4 compressor module"). The test hangs early during boot before
>>>>> any console output is seen. Reverting the offending patch as well as the
>>>>> subsequent lz4 related patches fixes the problem. Disabling CONFIG_RD_LZ4
>>>>> and with it other LZ4 options also fixes it (as does adding "return -EINVAL;"
>>>>> at the top of the LZ4 decompression code). For reference, bisect log
>>>>> is attached.
>>>>>
>>>>> I tried with buildroot toolchains using gcc 6.1.0 as well as 6.3.0
>>>>> and binutils 2.26.1. Scripts used to run the tests are available at
>>>>> https://github.com/groeck/linux-build-test/tree/master/rootfs/nios2.
>>>>> Qemu is from qemu mainline or qemu v2.8 with nios2 patches applied.
>>>>
>>>> Looks like this is somehow related to gcc/binutils. Using GCC 4.8.3 and
>>>> binutils 2.24.51 (both from from Sourcery CodeBench Lite 2014.05) I can
>>>> get a kernel booting on latest master branch. AFAICT, none of the
>>>> LZ4_decompress_* functions are called during boot.
>>>>
>
> It seems a bit strange that code which is not actually called causes problems like that.
>
> Please let me know if and how I may help you figure out what's happening, especially
> regarding the differences between the previous LZ4 and the current implementation.
>
>>>> However, using a self-built GCC 7.0 (20161127) and binutils 2.27 I can
>>>> reproduce the problem you see using the instructions Guenter provided in
>>>> the reply to Sven.
>>>>
>>>> I'll try to dig a bit deeper from here on. Any suggestions on what to
>>>> look out for wrt the differences between the gcc/binutils version are
>>>> welcome of course.
>>>
>>> This message doesn't give me enough context to know what is going on,
>>> especially without seeing the rest of the thread. Generally speaking,
>>> Mentor recommends you use one of our stable releases instead of trying to
>>> roll your own from mainline sources. As an upstream binutils and gcc
>>> maintainer I do try my best to look at bug reports for those components, but
>>> I need a reproducible standalone testcase and specific versions of the
>>> different components involved.
>>>
>> The problem is also seen with Sourcery CodeBench Lite 2016.11-32 (gcc 6.2.0,
>> binutils 2.26.51). I can provide additional details if needed, but we don't
>> have a well enough understanding of the problem to be able to provide a
>> reduced size test case. The test used to reproduce the problem is available
>> at https://github.com/groeck/linux-build-test/tree/master/rootfs/nios2,
>> run on the ToT linux kernel.

Just a suggestion: can you try binutils trunk, too? Alan Modra and I
just tracked down and fixed a bug with the linker creating bad
executables that the kernel's ELF loader couldn't properly map into
memory. IIUC it only affected programs that use dynamic libraries, but
maybe there was more to it than that. In any case it would be good to
know if the problem has already been fixed before investigating further.

-Sandra

2017-03-02 13:31:13

by Tobias Klauser

[permalink] [raw]
Subject: Re: nios2 crash/hang in mainline due to 'lib: update LZ4 compressor module'

On 2017-03-01 at 23:50:03 +0100, Sandra Loosemore <[email protected]> wrote:
> On 03/01/2017 11:58 AM, Sven Schmidt wrote:
> >Hi Guenter, Tobias and Sandra,
> >
> >thanks for your effort here.
> >
> >On Tue, Feb 28, 2017 at 10:14:13AM -0800, Guenter Roeck wrote:
> >>On Tue, Feb 28, 2017 at 10:53:56AM -0700, Sandra Loosemore wrote:
> >>>On 02/28/2017 08:53 AM, Tobias Klauser wrote:
> >>>>(adding Sandra Loosemore to Cc due to possible relation to gcc/binutils
> >>>>for nios2)
> >>>>
> >>>>On 2017-02-26 at 22:03:38 +0100, Guenter Roeck <[email protected]> wrote:
> >>>>>Hi Sven,
> >>>>>
> >>>>>my qemu test for nios2 started failing with commit 4e1a33b105dd ("lib:
> >>>>>update LZ4 compressor module"). The test hangs early during boot before
> >>>>>any console output is seen. Reverting the offending patch as well as the
> >>>>>subsequent lz4 related patches fixes the problem. Disabling CONFIG_RD_LZ4
> >>>>>and with it other LZ4 options also fixes it (as does adding "return -EINVAL;"
> >>>>>at the top of the LZ4 decompression code). For reference, bisect log
> >>>>>is attached.
> >>>>>
> >>>>>I tried with buildroot toolchains using gcc 6.1.0 as well as 6.3.0
> >>>>>and binutils 2.26.1. Scripts used to run the tests are available at
> >>>>>https://github.com/groeck/linux-build-test/tree/master/rootfs/nios2.
> >>>>>Qemu is from qemu mainline or qemu v2.8 with nios2 patches applied.
> >>>>
> >>>>Looks like this is somehow related to gcc/binutils. Using GCC 4.8.3 and
> >>>>binutils 2.24.51 (both from from Sourcery CodeBench Lite 2014.05) I can
> >>>>get a kernel booting on latest master branch. AFAICT, none of the
> >>>>LZ4_decompress_* functions are called during boot.
> >>>>
> >
> >It seems a bit strange that code which is not actually called causes problems like that.
> >
> >Please let me know if and how I may help you figure out what's happening, especially
> >regarding the differences between the previous LZ4 and the current implementation.
> >
> >>>>However, using a self-built GCC 7.0 (20161127) and binutils 2.27 I can
> >>>>reproduce the problem you see using the instructions Guenter provided in
> >>>>the reply to Sven.
> >>>>
> >>>>I'll try to dig a bit deeper from here on. Any suggestions on what to
> >>>>look out for wrt the differences between the gcc/binutils version are
> >>>>welcome of course.
> >>>
> >>>This message doesn't give me enough context to know what is going on,
> >>>especially without seeing the rest of the thread. Generally speaking,
> >>>Mentor recommends you use one of our stable releases instead of trying to
> >>>roll your own from mainline sources. As an upstream binutils and gcc
> >>>maintainer I do try my best to look at bug reports for those components, but
> >>>I need a reproducible standalone testcase and specific versions of the
> >>>different components involved.
> >>>
> >>The problem is also seen with Sourcery CodeBench Lite 2016.11-32 (gcc 6.2.0,
> >>binutils 2.26.51). I can provide additional details if needed, but we don't
> >>have a well enough understanding of the problem to be able to provide a
> >>reduced size test case. The test used to reproduce the problem is available
> >>at https://github.com/groeck/linux-build-test/tree/master/rootfs/nios2,
> >>run on the ToT linux kernel.
>
> Just a suggestion: can you try binutils trunk, too? Alan Modra and
> I just tracked down and fixed a bug with the linker creating bad
> executables that the kernel's ELF loader couldn't properly map into
> memory. IIUC it only affected programs that use dynamic libraries,
> but maybe there was more to it than that. In any case it would be
> good to know if the problem has already been fixed before
> investigating further.

Thanks for the suggestion.

Just tried it with a kernel compiled with binutils trunk as of today
(2.28.51.20170302) and latest gcc snapshot (7.0.1 20170226).
Unfortunately, the issue still persists.

Tobias

2017-03-02 16:47:30

by Tobias Klauser

[permalink] [raw]
Subject: Re: nios2 crash/hang in mainline due to 'lib: update LZ4 compressor module'

On 2017-03-01 at 20:45:21 +0100, Guenter Roeck <[email protected]> wrote:
> On Wed, Mar 01, 2017 at 07:58:17PM +0100, Sven Schmidt wrote:
> > Hi Guenter, Tobias and Sandra,
> >
> > thanks for your effort here.
> >
> > On Tue, Feb 28, 2017 at 10:14:13AM -0800, Guenter Roeck wrote:
> > > On Tue, Feb 28, 2017 at 10:53:56AM -0700, Sandra Loosemore wrote:
> > > > On 02/28/2017 08:53 AM, Tobias Klauser wrote:
> > > > >(adding Sandra Loosemore to Cc due to possible relation to gcc/binutils
> > > > >for nios2)
> > > > >
> > > > >On 2017-02-26 at 22:03:38 +0100, Guenter Roeck <[email protected]> wrote:
> > > > >>Hi Sven,
> > > > >>
> > > > >>my qemu test for nios2 started failing with commit 4e1a33b105dd ("lib:
> > > > >>update LZ4 compressor module"). The test hangs early during boot before
> > > > >>any console output is seen. Reverting the offending patch as well as the
> > > > >>subsequent lz4 related patches fixes the problem. Disabling CONFIG_RD_LZ4
> > > > >>and with it other LZ4 options also fixes it (as does adding "return -EINVAL;"
> > > > >>at the top of the LZ4 decompression code). For reference, bisect log
> > > > >>is attached.
> > > > >>
> > > > >>I tried with buildroot toolchains using gcc 6.1.0 as well as 6.3.0
> > > > >>and binutils 2.26.1. Scripts used to run the tests are available at
> > > > >>https://github.com/groeck/linux-build-test/tree/master/rootfs/nios2.
> > > > >>Qemu is from qemu mainline or qemu v2.8 with nios2 patches applied.
> > > > >
> > > > >Looks like this is somehow related to gcc/binutils. Using GCC 4.8.3 and
> > > > >binutils 2.24.51 (both from from Sourcery CodeBench Lite 2014.05) I can
> > > > >get a kernel booting on latest master branch. AFAICT, none of the
> > > > >LZ4_decompress_* functions are called during boot.
> > > > >
> >
> > It seems a bit strange that code which is not actually called causes problems like that.
> >
> Yes, it is, though it is always possible. The code isn't exactly easy to
> understand; there may be some hidden caveats such as global variables. It may
> also be that some jump target exceeds its range (though why that would only
> be seen with the LZ4 code is another question), or that the compiler gets
> confused by the forced inlines (disabling that didn't make a difference,
> though, nor did disabling -O3).
>
> > Please let me know if and how I may help you figure out what's happening, especially
> > regarding the differences between the previous LZ4 and the current implementation.
> >
>
> For my part I am all but clueless. Unless someone has an idea, we may to
> disable LZ4 support for nios2 for the time being. Does anyone have thoughts
> on that ? Of course, that would not help if the problem also affects
> recent gcc/binutil versions on other architectures.

After some further investigations, I'd say this isn't "caused" by LZ4
specifically but by a more general problem with one of the nios2 arch
specific tools involved.

I manually enabled random additional CONFIG_* options and in some cases
I got the kernel to boot (with CONFIG_RD_LZ4 enabled and no return
-EINVAL in place) while in others I didn't. So I'd rather suspect this
problem to be connected to the size or structure of the generated vmlinux
image.

Or could this even be a problem with qemu? Did anyone already verify
this on the 10m50 devboard? (Unfortunately I don't have any nios2
devboard available right now, otherwise I would have done this...)

Other than that I'm also becoming all but clueless... One option I
thought of was using the QEMU monitor to dump the CPU state after the
hang but so far I didn't manage to get it to work (hints appreciated ;)

Thanks
Tobias

2017-03-03 03:05:50

by Guenter Roeck

[permalink] [raw]
Subject: Re: nios2 crash/hang in mainline due to 'lib: update LZ4 compressor module'

On 03/02/2017 08:38 AM, Tobias Klauser wrote:
> On 2017-03-01 at 20:45:21 +0100, Guenter Roeck <[email protected]> wrote:
>> On Wed, Mar 01, 2017 at 07:58:17PM +0100, Sven Schmidt wrote:
>>> Hi Guenter, Tobias and Sandra,
>>>
>>> thanks for your effort here.
>>>
>>> On Tue, Feb 28, 2017 at 10:14:13AM -0800, Guenter Roeck wrote:
>>>> On Tue, Feb 28, 2017 at 10:53:56AM -0700, Sandra Loosemore wrote:
>>>>> On 02/28/2017 08:53 AM, Tobias Klauser wrote:
>>>>>> (adding Sandra Loosemore to Cc due to possible relation to gcc/binutils
>>>>>> for nios2)
>>>>>>
>>>>>> On 2017-02-26 at 22:03:38 +0100, Guenter Roeck <[email protected]> wrote:
>>>>>>> Hi Sven,
>>>>>>>
>>>>>>> my qemu test for nios2 started failing with commit 4e1a33b105dd ("lib:
>>>>>>> update LZ4 compressor module"). The test hangs early during boot before
>>>>>>> any console output is seen. Reverting the offending patch as well as the
>>>>>>> subsequent lz4 related patches fixes the problem. Disabling CONFIG_RD_LZ4
>>>>>>> and with it other LZ4 options also fixes it (as does adding "return -EINVAL;"
>>>>>>> at the top of the LZ4 decompression code). For reference, bisect log
>>>>>>> is attached.
>>>>>>>
>>>>>>> I tried with buildroot toolchains using gcc 6.1.0 as well as 6.3.0
>>>>>>> and binutils 2.26.1. Scripts used to run the tests are available at
>>>>>>> https://github.com/groeck/linux-build-test/tree/master/rootfs/nios2.
>>>>>>> Qemu is from qemu mainline or qemu v2.8 with nios2 patches applied.
>>>>>>
>>>>>> Looks like this is somehow related to gcc/binutils. Using GCC 4.8.3 and
>>>>>> binutils 2.24.51 (both from from Sourcery CodeBench Lite 2014.05) I can
>>>>>> get a kernel booting on latest master branch. AFAICT, none of the
>>>>>> LZ4_decompress_* functions are called during boot.
>>>>>>
>>>
>>> It seems a bit strange that code which is not actually called causes problems like that.
>>>
>> Yes, it is, though it is always possible. The code isn't exactly easy to
>> understand; there may be some hidden caveats such as global variables. It may
>> also be that some jump target exceeds its range (though why that would only
>> be seen with the LZ4 code is another question), or that the compiler gets
>> confused by the forced inlines (disabling that didn't make a difference,
>> though, nor did disabling -O3).
>>
>>> Please let me know if and how I may help you figure out what's happening, especially
>>> regarding the differences between the previous LZ4 and the current implementation.
>>>
>>
>> For my part I am all but clueless. Unless someone has an idea, we may to
>> disable LZ4 support for nios2 for the time being. Does anyone have thoughts
>> on that ? Of course, that would not help if the problem also affects
>> recent gcc/binutil versions on other architectures.
>
> After some further investigations, I'd say this isn't "caused" by LZ4
> specifically but by a more general problem with one of the nios2 arch
> specific tools involved.
>
> I manually enabled random additional CONFIG_* options and in some cases
> I got the kernel to boot (with CONFIG_RD_LZ4 enabled and no return
> -EINVAL in place) while in others I didn't. So I'd rather suspect this
> problem to be connected to the size or structure of the generated vmlinux
> image.
>
> Or could this even be a problem with qemu? Did anyone already verify
> this on the 10m50 devboard? (Unfortunately I don't have any nios2
> devboard available right now, otherwise I would have done this...)
>

That is of course always possible.

> Other than that I'm also becoming all but clueless... One option I
> thought of was using the QEMU monitor to dump the CPU state after the
> hang but so far I didn't manage to get it to work (hints appreciated ;)
>

Something like

qemu-system-nios2 -M 10m50-ghrd -kernel vmlinux -no-reboot \
-dtb arch/nios2/boot/dts/10m50_devboard.dtb \
--append "rdinit=/sbin/init" -initrd busybox-nios2.cpio

gives you a qemu monitor window. Use "info registers" to see registers.
Looks like it is stuck in init_bootmem_core, or at least that is what it
shows for me.

Guenter

2017-03-07 12:47:17

by Tobias Klauser

[permalink] [raw]
Subject: Re: nios2 crash/hang in mainline due to 'lib: update LZ4 compressor module'

On 2017-03-03 at 04:04:41 +0100, Guenter Roeck <[email protected]> wrote:
> On 03/02/2017 08:38 AM, Tobias Klauser wrote:
> >On 2017-03-01 at 20:45:21 +0100, Guenter Roeck <[email protected]> wrote:
> >>On Wed, Mar 01, 2017 at 07:58:17PM +0100, Sven Schmidt wrote:
> >>>Hi Guenter, Tobias and Sandra,
> >>>
> >>>thanks for your effort here.
> >>>
> >>>On Tue, Feb 28, 2017 at 10:14:13AM -0800, Guenter Roeck wrote:
> >>>>On Tue, Feb 28, 2017 at 10:53:56AM -0700, Sandra Loosemore wrote:
> >>>>>On 02/28/2017 08:53 AM, Tobias Klauser wrote:
> >>>>>>(adding Sandra Loosemore to Cc due to possible relation to gcc/binutils
> >>>>>>for nios2)
> >>>>>>
> >>>>>>On 2017-02-26 at 22:03:38 +0100, Guenter Roeck <[email protected]> wrote:
> >>>>>>>Hi Sven,
> >>>>>>>
> >>>>>>>my qemu test for nios2 started failing with commit 4e1a33b105dd ("lib:
> >>>>>>>update LZ4 compressor module"). The test hangs early during boot before
> >>>>>>>any console output is seen. Reverting the offending patch as well as the
> >>>>>>>subsequent lz4 related patches fixes the problem. Disabling CONFIG_RD_LZ4
> >>>>>>>and with it other LZ4 options also fixes it (as does adding "return -EINVAL;"
> >>>>>>>at the top of the LZ4 decompression code). For reference, bisect log
> >>>>>>>is attached.
> >>>>>>>
> >>>>>>>I tried with buildroot toolchains using gcc 6.1.0 as well as 6.3.0
> >>>>>>>and binutils 2.26.1. Scripts used to run the tests are available at
> >>>>>>>https://github.com/groeck/linux-build-test/tree/master/rootfs/nios2.
> >>>>>>>Qemu is from qemu mainline or qemu v2.8 with nios2 patches applied.
> >>>>>>
> >>>>>>Looks like this is somehow related to gcc/binutils. Using GCC 4.8.3 and
> >>>>>>binutils 2.24.51 (both from from Sourcery CodeBench Lite 2014.05) I can
> >>>>>>get a kernel booting on latest master branch. AFAICT, none of the
> >>>>>>LZ4_decompress_* functions are called during boot.
> >>>>>>
> >>>
> >>>It seems a bit strange that code which is not actually called causes problems like that.
> >>>
> >>Yes, it is, though it is always possible. The code isn't exactly easy to
> >>understand; there may be some hidden caveats such as global variables. It may
> >>also be that some jump target exceeds its range (though why that would only
> >>be seen with the LZ4 code is another question), or that the compiler gets
> >>confused by the forced inlines (disabling that didn't make a difference,
> >>though, nor did disabling -O3).
> >>
> >>>Please let me know if and how I may help you figure out what's happening, especially
> >>>regarding the differences between the previous LZ4 and the current implementation.
> >>>
> >>
> >>For my part I am all but clueless. Unless someone has an idea, we may to
> >>disable LZ4 support for nios2 for the time being. Does anyone have thoughts
> >>on that ? Of course, that would not help if the problem also affects
> >>recent gcc/binutil versions on other architectures.
> >
> >After some further investigations, I'd say this isn't "caused" by LZ4
> >specifically but by a more general problem with one of the nios2 arch
> >specific tools involved.
> >
> >I manually enabled random additional CONFIG_* options and in some cases
> >I got the kernel to boot (with CONFIG_RD_LZ4 enabled and no return
> >-EINVAL in place) while in others I didn't. So I'd rather suspect this
> >problem to be connected to the size or structure of the generated vmlinux
> >image.
> >
> >Or could this even be a problem with qemu? Did anyone already verify
> >this on the 10m50 devboard? (Unfortunately I don't have any nios2
> >devboard available right now, otherwise I would have done this...)
> >
>
> That is of course always possible.
>
> >Other than that I'm also becoming all but clueless... One option I
> >thought of was using the QEMU monitor to dump the CPU state after the
> >hang but so far I didn't manage to get it to work (hints appreciated ;)
> >
>
> Something like
>
> qemu-system-nios2 -M 10m50-ghrd -kernel vmlinux -no-reboot \
> -dtb arch/nios2/boot/dts/10m50_devboard.dtb \
> --append "rdinit=/sbin/init" -initrd busybox-nios2.cpio
>
> gives you a qemu monitor window. Use "info registers" to see registers.
> Looks like it is stuck in init_bootmem_core, or at least that is what it
> shows for me.

Thanks a lot for the hint, this worked perfectly. I'm not all that
familiar with qemu :-/

Using the qemu gdbserver I can indeed confirm that it seems to be stuck
in init_bootmem_core:

(gdb) file vmlinux
Reading symbols from vmlinux...done.
(gdb) target remote localhost:1234
Remote debugging using localhost:1234
link_bootmem (bdata=<optimized out>) at mm/bootmem.c:80
80 if (bdata->node_min_pfn < ent->node_min_pfn) {

This looks like a very weird place for it to get stuck...

So I followed a different path and implemented early printk support for
the 8250/16650 serial console on nios2, so I could get debug outputs
earlier on (patch below, I'll also officially submit this later one).

Now I get the following output on boot:

Linux version 4.11.0-rc1-dirty (tobiask@ziws08) (gcc version 7.0.1 20170226 (experimental) (GCC) ) #46 Tue Mar 7 13:40:53 CET 2017
bootconsole [early0] enabled
Early console on uart16650 initialized at 0xf8001600
OF: fdt: Error -11 processing FDT
Kernel panic - not syncing: setup_cpuinfo: No CPU found in devicetree!

---[ end Kernel panic - not syncing: setup_cpuinfo: No CPU found in devicetree!

Looks like the in-memory device tree somehow gets corrupted. Not sure
yet why and how this is linked to the Kconfig options selected but at
least we now have a possibility to use debug messages earlier on.

---%<---%<---

Patch for 8250/16650 early printk support on nios2 (make sure to select
CONFIG_EARLY_PRINTK):

diff --git a/arch/nios2/Kconfig.debug b/arch/nios2/Kconfig.debug
index 2fd08cbfdddb..35b5dd67b15a 100644
--- a/arch/nios2/Kconfig.debug
+++ b/arch/nios2/Kconfig.debug
@@ -18,7 +18,7 @@ config EARLY_PRINTK
bool "Activate early kernel debugging"
default y
select SERIAL_CORE_CONSOLE
- depends on SERIAL_ALTERA_JTAGUART_CONSOLE || SERIAL_ALTERA_UART_CONSOLE
+ depends on SERIAL_ALTERA_JTAGUART_CONSOLE || SERIAL_ALTERA_UART_CONSOLE || SERIAL_8250_CONSOLE
help
Enable early printk on console
This is useful for kernel debugging when your machine crashes very
diff --git a/arch/nios2/kernel/early_printk.c b/arch/nios2/kernel/early_printk.c
index c08e4c1486fc..24b4506f4969 100644
--- a/arch/nios2/kernel/early_printk.c
+++ b/arch/nios2/kernel/early_printk.c
@@ -22,6 +22,8 @@ static unsigned long base_addr;

#if defined(CONFIG_SERIAL_ALTERA_JTAGUART_CONSOLE)

+#define UART_NAME "altera_jtaguart"
+
#define ALTERA_JTAGUART_DATA_REG 0
#define ALTERA_JTAGUART_CONTROL_REG 4
#define ALTERA_JTAGUART_CONTROL_WSPACE_MSK 0xFFFF0000
@@ -53,6 +55,8 @@ static void early_console_write(struct console *con, const char *s, unsigned n)

#elif defined(CONFIG_SERIAL_ALTERA_UART_CONSOLE)

+#define UART_NAME "altera_uart"
+
#define ALTERA_UART_TXDATA_REG 4
#define ALTERA_UART_STATUS_REG 8
#define ALTERA_UART_STATUS_TRDY 0x0040
@@ -80,9 +84,40 @@ static void early_console_write(struct console *con, const char *s, unsigned n)
}
}

+#elif defined(CONFIG_SERIAL_8250_CONSOLE)
+
+#define UART_NAME "uart16650"
+
+#define UART_LSR_TEMT 0x40 /* Transmitter empty */
+#define UART_LSR_THRE 0x20 /* Transmit-hold-register empty */
+#define BOTH_EMPTY (UART_LSR_TEMT | UART_LSR_THRE)
+
+#define UART_GET_SR() \
+ __builtin_ldwio((void *)(base_addr + 0x14))
+#define UART_SET_TX(v) \
+ __builtin_stwio((void *)(base_addr), v)
+
+static void early_console_putc(char c)
+{
+ while (!((UART_GET_SR() & BOTH_EMPTY) == BOTH_EMPTY))
+ ;
+
+ UART_SET_TX(c & 0xff);
+}
+
+static void early_console_write(struct console *con, const char *s, unsigned n)
+{
+ while (n-- && *s) {
+ early_console_putc(*s);
+ if (*s == '\n')
+ early_console_putc('\r');
+ s++;
+ }
+}
+
#else
-# error Neither SERIAL_ALTERA_JTAGUART_CONSOLE nor SERIAL_ALTERA_UART_CONSOLE \
-selected
+# error Neither SERIAL_ALTERA_JTAGUART_CONSOLE, SERIAL_ALTERA_UART_CONSOLE, \
+ nor SERIAL_8250_CONSOLE selected
#endif

static struct console early_console_prom = {
@@ -95,7 +130,8 @@ static struct console early_console_prom = {
void __init setup_early_printk(void)
{
#if defined(CONFIG_SERIAL_ALTERA_JTAGUART_CONSOLE) || \
- defined(CONFIG_SERIAL_ALTERA_UART_CONSOLE)
+ defined(CONFIG_SERIAL_ALTERA_UART_CONSOLE) || \
+ defined(CONFIG_SERIAL_8250_CONSOLE)
base_addr = of_early_console();
#else
base_addr = 0;
@@ -114,5 +150,5 @@ void __init setup_early_printk(void)

early_console = &early_console_prom;
register_console(early_console);
- pr_info("early_console initialized at 0x%08lx\n", base_addr);
+ pr_info("Early console on %s initialized at 0x%08lx\n", UART_NAME, base_addr);
}

2017-03-08 05:07:43

by Guenter Roeck

[permalink] [raw]
Subject: Re: nios2 crash/hang in mainline due to 'lib: update LZ4 compressor module'

On 03/07/2017 04:46 AM, Tobias Klauser wrote:
> On 2017-03-03 at 04:04:41 +0100, Guenter Roeck <[email protected]> wrote:
>> On 03/02/2017 08:38 AM, Tobias Klauser wrote:
>>> On 2017-03-01 at 20:45:21 +0100, Guenter Roeck <[email protected]> wrote:
>>>> On Wed, Mar 01, 2017 at 07:58:17PM +0100, Sven Schmidt wrote:
>>>>> Hi Guenter, Tobias and Sandra,
>>>>>
>>>>> thanks for your effort here.
>>>>>
>>>>> On Tue, Feb 28, 2017 at 10:14:13AM -0800, Guenter Roeck wrote:
>>>>>> On Tue, Feb 28, 2017 at 10:53:56AM -0700, Sandra Loosemore wrote:
>>>>>>> On 02/28/2017 08:53 AM, Tobias Klauser wrote:
>>>>>>>> (adding Sandra Loosemore to Cc due to possible relation to gcc/binutils
>>>>>>>> for nios2)
>>>>>>>>
>>>>>>>> On 2017-02-26 at 22:03:38 +0100, Guenter Roeck <[email protected]> wrote:
>>>>>>>>> Hi Sven,
>>>>>>>>>
>>>>>>>>> my qemu test for nios2 started failing with commit 4e1a33b105dd ("lib:
>>>>>>>>> update LZ4 compressor module"). The test hangs early during boot before
>>>>>>>>> any console output is seen. Reverting the offending patch as well as the
>>>>>>>>> subsequent lz4 related patches fixes the problem. Disabling CONFIG_RD_LZ4
>>>>>>>>> and with it other LZ4 options also fixes it (as does adding "return -EINVAL;"
>>>>>>>>> at the top of the LZ4 decompression code). For reference, bisect log
>>>>>>>>> is attached.
>>>>>>>>>
>>>>>>>>> I tried with buildroot toolchains using gcc 6.1.0 as well as 6.3.0
>>>>>>>>> and binutils 2.26.1. Scripts used to run the tests are available at
>>>>>>>>> https://github.com/groeck/linux-build-test/tree/master/rootfs/nios2.
>>>>>>>>> Qemu is from qemu mainline or qemu v2.8 with nios2 patches applied.
>>>>>>>>
>>>>>>>> Looks like this is somehow related to gcc/binutils. Using GCC 4.8.3 and
>>>>>>>> binutils 2.24.51 (both from from Sourcery CodeBench Lite 2014.05) I can
>>>>>>>> get a kernel booting on latest master branch. AFAICT, none of the
>>>>>>>> LZ4_decompress_* functions are called during boot.
>>>>>>>>
>>>>>
>>>>> It seems a bit strange that code which is not actually called causes problems like that.
>>>>>
>>>> Yes, it is, though it is always possible. The code isn't exactly easy to
>>>> understand; there may be some hidden caveats such as global variables. It may
>>>> also be that some jump target exceeds its range (though why that would only
>>>> be seen with the LZ4 code is another question), or that the compiler gets
>>>> confused by the forced inlines (disabling that didn't make a difference,
>>>> though, nor did disabling -O3).
>>>>
>>>>> Please let me know if and how I may help you figure out what's happening, especially
>>>>> regarding the differences between the previous LZ4 and the current implementation.
>>>>>
>>>>
>>>> For my part I am all but clueless. Unless someone has an idea, we may to
>>>> disable LZ4 support for nios2 for the time being. Does anyone have thoughts
>>>> on that ? Of course, that would not help if the problem also affects
>>>> recent gcc/binutil versions on other architectures.
>>>
>>> After some further investigations, I'd say this isn't "caused" by LZ4
>>> specifically but by a more general problem with one of the nios2 arch
>>> specific tools involved.
>>>
>>> I manually enabled random additional CONFIG_* options and in some cases
>>> I got the kernel to boot (with CONFIG_RD_LZ4 enabled and no return
>>> -EINVAL in place) while in others I didn't. So I'd rather suspect this
>>> problem to be connected to the size or structure of the generated vmlinux
>>> image.
>>>
>>> Or could this even be a problem with qemu? Did anyone already verify
>>> this on the 10m50 devboard? (Unfortunately I don't have any nios2
>>> devboard available right now, otherwise I would have done this...)
>>>
>>
>> That is of course always possible.
>>
>>> Other than that I'm also becoming all but clueless... One option I
>>> thought of was using the QEMU monitor to dump the CPU state after the
>>> hang but so far I didn't manage to get it to work (hints appreciated ;)
>>>
>>
>> Something like
>>
>> qemu-system-nios2 -M 10m50-ghrd -kernel vmlinux -no-reboot \
>> -dtb arch/nios2/boot/dts/10m50_devboard.dtb \
>> --append "rdinit=/sbin/init" -initrd busybox-nios2.cpio
>>
>> gives you a qemu monitor window. Use "info registers" to see registers.
>> Looks like it is stuck in init_bootmem_core, or at least that is what it
>> shows for me.
>
> Thanks a lot for the hint, this worked perfectly. I'm not all that
> familiar with qemu :-/
>
> Using the qemu gdbserver I can indeed confirm that it seems to be stuck
> in init_bootmem_core:
>
> (gdb) file vmlinux
> Reading symbols from vmlinux...done.
> (gdb) target remote localhost:1234
> Remote debugging using localhost:1234
> link_bootmem (bdata=<optimized out>) at mm/bootmem.c:80
> 80 if (bdata->node_min_pfn < ent->node_min_pfn) {
>
> This looks like a very weird place for it to get stuck...
>
> So I followed a different path and implemented early printk support for
> the 8250/16650 serial console on nios2, so I could get debug outputs
> earlier on (patch below, I'll also officially submit this later one).
>

That is great; I'll add that to my own tests to get some output.

> Now I get the following output on boot:
>
> Linux version 4.11.0-rc1-dirty (tobiask@ziws08) (gcc version 7.0.1 20170226 (experimental) (GCC) ) #46 Tue Mar 7 13:40:53 CET 2017
> bootconsole [early0] enabled
> Early console on uart16650 initialized at 0xf8001600
> OF: fdt: Error -11 processing FDT
> Kernel panic - not syncing: setup_cpuinfo: No CPU found in devicetree!
>
> ---[ end Kernel panic - not syncing: setup_cpuinfo: No CPU found in devicetree!
>
> Looks like the in-memory device tree somehow gets corrupted. Not sure
> yet why and how this is linked to the Kconfig options selected but at
> least we now have a possibility to use debug messages earlier on.
>
Interesting. I was able to confirm that the lz4 patch is not the root
cause. I was not able to reproduce the problem in v4.10, but after
adding more and more configuration options I get it to fail starting
with commit ac1820fb286b552 ("Merge tag 'for-next-dma_ops' of
git://git.kernel.org/pub/ scm/linux/kernel/git/dledford/rdma").
No idea if that is the root cause either. Kernel configuration for that
case is attached.

Of course ac1820fb286b552 doesn't crash anymore with that after applying
your patch below, and v4.11-rc1 crashes without any output :-(.

I think I'll add some logging into qemu to see where it puts the dtb.

Guenter


> ---%<---%<---
>
> Patch for 8250/16650 early printk support on nios2 (make sure to select
> CONFIG_EARLY_PRINTK):
>
> diff --git a/arch/nios2/Kconfig.debug b/arch/nios2/Kconfig.debug
> index 2fd08cbfdddb..35b5dd67b15a 100644
> --- a/arch/nios2/Kconfig.debug
> +++ b/arch/nios2/Kconfig.debug
> @@ -18,7 +18,7 @@ config EARLY_PRINTK
> bool "Activate early kernel debugging"
> default y
> select SERIAL_CORE_CONSOLE
> - depends on SERIAL_ALTERA_JTAGUART_CONSOLE || SERIAL_ALTERA_UART_CONSOLE
> + depends on SERIAL_ALTERA_JTAGUART_CONSOLE || SERIAL_ALTERA_UART_CONSOLE || SERIAL_8250_CONSOLE
> help
> Enable early printk on console
> This is useful for kernel debugging when your machine crashes very
> diff --git a/arch/nios2/kernel/early_printk.c b/arch/nios2/kernel/early_printk.c
> index c08e4c1486fc..24b4506f4969 100644
> --- a/arch/nios2/kernel/early_printk.c
> +++ b/arch/nios2/kernel/early_printk.c
> @@ -22,6 +22,8 @@ static unsigned long base_addr;
>
> #if defined(CONFIG_SERIAL_ALTERA_JTAGUART_CONSOLE)
>
> +#define UART_NAME "altera_jtaguart"
> +
> #define ALTERA_JTAGUART_DATA_REG 0
> #define ALTERA_JTAGUART_CONTROL_REG 4
> #define ALTERA_JTAGUART_CONTROL_WSPACE_MSK 0xFFFF0000
> @@ -53,6 +55,8 @@ static void early_console_write(struct console *con, const char *s, unsigned n)
>
> #elif defined(CONFIG_SERIAL_ALTERA_UART_CONSOLE)
>
> +#define UART_NAME "altera_uart"
> +
> #define ALTERA_UART_TXDATA_REG 4
> #define ALTERA_UART_STATUS_REG 8
> #define ALTERA_UART_STATUS_TRDY 0x0040
> @@ -80,9 +84,40 @@ static void early_console_write(struct console *con, const char *s, unsigned n)
> }
> }
>
> +#elif defined(CONFIG_SERIAL_8250_CONSOLE)
> +
> +#define UART_NAME "uart16650"
> +
> +#define UART_LSR_TEMT 0x40 /* Transmitter empty */
> +#define UART_LSR_THRE 0x20 /* Transmit-hold-register empty */
> +#define BOTH_EMPTY (UART_LSR_TEMT | UART_LSR_THRE)
> +
> +#define UART_GET_SR() \
> + __builtin_ldwio((void *)(base_addr + 0x14))
> +#define UART_SET_TX(v) \
> + __builtin_stwio((void *)(base_addr), v)
> +
> +static void early_console_putc(char c)
> +{
> + while (!((UART_GET_SR() & BOTH_EMPTY) == BOTH_EMPTY))
> + ;
> +
> + UART_SET_TX(c & 0xff);
> +}
> +
> +static void early_console_write(struct console *con, const char *s, unsigned n)
> +{
> + while (n-- && *s) {
> + early_console_putc(*s);
> + if (*s == '\n')
> + early_console_putc('\r');
> + s++;
> + }
> +}
> +
> #else
> -# error Neither SERIAL_ALTERA_JTAGUART_CONSOLE nor SERIAL_ALTERA_UART_CONSOLE \
> -selected
> +# error Neither SERIAL_ALTERA_JTAGUART_CONSOLE, SERIAL_ALTERA_UART_CONSOLE, \
> + nor SERIAL_8250_CONSOLE selected
> #endif
>
> static struct console early_console_prom = {
> @@ -95,7 +130,8 @@ static struct console early_console_prom = {
> void __init setup_early_printk(void)
> {
> #if defined(CONFIG_SERIAL_ALTERA_JTAGUART_CONSOLE) || \
> - defined(CONFIG_SERIAL_ALTERA_UART_CONSOLE)
> + defined(CONFIG_SERIAL_ALTERA_UART_CONSOLE) || \
> + defined(CONFIG_SERIAL_8250_CONSOLE)
> base_addr = of_early_console();
> #else
> base_addr = 0;
> @@ -114,5 +150,5 @@ void __init setup_early_printk(void)
>
> early_console = &early_console_prom;
> register_console(early_console);
> - pr_info("early_console initialized at 0x%08lx\n", base_addr);
> + pr_info("Early console on %s initialized at 0x%08lx\n", UART_NAME, base_addr);
> }
>


Attachments:
test_defconfig (10.24 kB)

2017-03-09 13:21:07

by Guenter Roeck

[permalink] [raw]
Subject: Re: nios2 crash/hang in mainline due to 'lib: update LZ4 compressor module'

On 03/07/2017 04:46 AM, Tobias Klauser wrote:
[ ... ]

>
> Linux version 4.11.0-rc1-dirty (tobiask@ziws08) (gcc version 7.0.1 20170226 (experimental) (GCC) ) #46 Tue Mar 7 13:40:53 CET 2017
> bootconsole [early0] enabled
> Early console on uart16650 initialized at 0xf8001600
> OF: fdt: Error -11 processing FDT
> Kernel panic - not syncing: setup_cpuinfo: No CPU found in devicetree!
>
> ---[ end Kernel panic - not syncing: setup_cpuinfo: No CPU found in devicetree!
>
> Looks like the in-memory device tree somehow gets corrupted. Not sure
> yet why and how this is linked to the Kconfig options selected but at
> least we now have a possibility to use debug messages earlier on.
>

I think I found the problem. In unflatten_and_copy_device_tree(), with added
debug information:

OF: fdt: initial_boot_params=c861e400, dt=c861f000 size=28874 (0x70ca)

... and then initial_boot_params is copied to dt, which results in corrupted
fdt since the memory overlaps. Looks like the initial_boot_params memory
is not reserved and (re-)allocated by early_init_dt_alloc_memory_arch().

Guenter

2017-03-09 14:44:26

by Tobias Klauser

[permalink] [raw]
Subject: Re: nios2 crash/hang in mainline due to 'lib: update LZ4 compressor module'

On 2017-03-09 at 14:20:51 +0100, Guenter Roeck <[email protected]> wrote:
> On 03/07/2017 04:46 AM, Tobias Klauser wrote:
> [ ... ]
>
> >
> >Linux version 4.11.0-rc1-dirty (tobiask@ziws08) (gcc version 7.0.1 20170226 (experimental) (GCC) ) #46 Tue Mar 7 13:40:53 CET 2017
> >bootconsole [early0] enabled
> >Early console on uart16650 initialized at 0xf8001600
> >OF: fdt: Error -11 processing FDT
> >Kernel panic - not syncing: setup_cpuinfo: No CPU found in devicetree!
> >
> >---[ end Kernel panic - not syncing: setup_cpuinfo: No CPU found in devicetree!
> >
> >Looks like the in-memory device tree somehow gets corrupted. Not sure
> >yet why and how this is linked to the Kconfig options selected but at
> >least we now have a possibility to use debug messages earlier on.
> >
>
> I think I found the problem. In unflatten_and_copy_device_tree(), with added
> debug information:
>
> OF: fdt: initial_boot_params=c861e400, dt=c861f000 size=28874 (0x70ca)
>
> ... and then initial_boot_params is copied to dt, which results in corrupted
> fdt since the memory overlaps. Looks like the initial_boot_params memory
> is not reserved and (re-)allocated by early_init_dt_alloc_memory_arch().

Thanks for the analysis. That certainly explains the issue. The
following patch solves the issue for me. Though I'm not entirely sure if
it is correct and that is all that is needed. Do we need to retain the
memory for initial_boot_params after bootmem is freed?

diff --git a/arch/nios2/kernel/prom.c b/arch/nios2/kernel/prom.c
index 099f5ce1f3cb..6869fe03f3ff 100644
--- a/arch/nios2/kernel/prom.c
+++ b/arch/nios2/kernel/prom.c
@@ -48,6 +48,13 @@ void * __init early_init_dt_alloc_memory_arch(u64 size, u64 align)
return alloc_bootmem_align(size, align);
}

+int __init early_init_dt_reserve_memory_arch(phys_addr_t base,
+ phys_addr_t size, bool nomap)
+{
+ reserve_bootmem(base, size, BOOTMEM_DEFAULT);
+ return 0;
+}
+
void __init early_init_devtree(void *params)
{
__be32 *dtb = (u32 *)__dtb_start;
diff --git a/arch/nios2/kernel/setup.c b/arch/nios2/kernel/setup.c
index 6e57ffa5db27..6044d9be28b4 100644
--- a/arch/nios2/kernel/setup.c
+++ b/arch/nios2/kernel/setup.c
@@ -201,6 +201,9 @@ void __init setup_arch(char **cmdline_p)
}
#endif /* CONFIG_BLK_DEV_INITRD */

+ early_init_fdt_reserve_self();
+ early_init_fdt_scan_reserved_mem();
+
unflatten_and_copy_device_tree();

setup_cpuinfo();

2017-03-09 20:09:26

by Guenter Roeck

[permalink] [raw]
Subject: Re: nios2 crash/hang in mainline due to 'lib: update LZ4 compressor module'

On Thu, Mar 09, 2017 at 03:43:40PM +0100, Tobias Klauser wrote:
> On 2017-03-09 at 14:20:51 +0100, Guenter Roeck <[email protected]> wrote:
> > On 03/07/2017 04:46 AM, Tobias Klauser wrote:
> > [ ... ]
> >
> > >
> > >Linux version 4.11.0-rc1-dirty (tobiask@ziws08) (gcc version 7.0.1 20170226 (experimental) (GCC) ) #46 Tue Mar 7 13:40:53 CET 2017
> > >bootconsole [early0] enabled
> > >Early console on uart16650 initialized at 0xf8001600
> > >OF: fdt: Error -11 processing FDT
> > >Kernel panic - not syncing: setup_cpuinfo: No CPU found in devicetree!
> > >
> > >---[ end Kernel panic - not syncing: setup_cpuinfo: No CPU found in devicetree!
> > >
> > >Looks like the in-memory device tree somehow gets corrupted. Not sure
> > >yet why and how this is linked to the Kconfig options selected but at
> > >least we now have a possibility to use debug messages earlier on.
> > >
> >
> > I think I found the problem. In unflatten_and_copy_device_tree(), with added
> > debug information:
> >
> > OF: fdt: initial_boot_params=c861e400, dt=c861f000 size=28874 (0x70ca)
> >
> > ... and then initial_boot_params is copied to dt, which results in corrupted
> > fdt since the memory overlaps. Looks like the initial_boot_params memory
> > is not reserved and (re-)allocated by early_init_dt_alloc_memory_arch().
>
> Thanks for the analysis. That certainly explains the issue. The
> following patch solves the issue for me. Though I'm not entirely sure if
> it is correct and that is all that is needed. Do we need to retain the
> memory for initial_boot_params after bootmem is freed?
>

I don't know if it is correct either, but it matches what I came up with,
and it does work for me as well. Feel free to add

Tested-by: Guenter Roeck <[email protected]>

when you submit the patch for real.

Thanks,
Guenter

> diff --git a/arch/nios2/kernel/prom.c b/arch/nios2/kernel/prom.c
> index 099f5ce1f3cb..6869fe03f3ff 100644
> --- a/arch/nios2/kernel/prom.c
> +++ b/arch/nios2/kernel/prom.c
> @@ -48,6 +48,13 @@ void * __init early_init_dt_alloc_memory_arch(u64 size, u64 align)
> return alloc_bootmem_align(size, align);
> }
>
> +int __init early_init_dt_reserve_memory_arch(phys_addr_t base,
> + phys_addr_t size, bool nomap)
> +{
> + reserve_bootmem(base, size, BOOTMEM_DEFAULT);
> + return 0;
> +}
> +
> void __init early_init_devtree(void *params)
> {
> __be32 *dtb = (u32 *)__dtb_start;
> diff --git a/arch/nios2/kernel/setup.c b/arch/nios2/kernel/setup.c
> index 6e57ffa5db27..6044d9be28b4 100644
> --- a/arch/nios2/kernel/setup.c
> +++ b/arch/nios2/kernel/setup.c
> @@ -201,6 +201,9 @@ void __init setup_arch(char **cmdline_p)
> }
> #endif /* CONFIG_BLK_DEV_INITRD */
>
> + early_init_fdt_reserve_self();
> + early_init_fdt_scan_reserved_mem();
> +
> unflatten_and_copy_device_tree();
>
> setup_cpuinfo();