2009-10-09 14:18:30

by Michael Tokarev

[permalink] [raw]
Subject: wrong final bzImage build (regading #14270)

Ok, finally the mystery solved. After a week of
digging.

The original problem was titled "Cannot boot on
a PIII Celeron", and Rafael filed a bug #14270
for this.

In short, what I observed was that a new kernel
(2.6.31) fails to boot on a PIII Celeron machine.
But changing just the CPU to plain PIII and voila,
it now works. I don't know why it behaved this
way, but I found where was the problem, finally.

And the problem is in the last stage of build, when
building the bzImage.

make -f scripts/Makefile.build obj=arch/x86/boot/compressed arch/x86/boot/compressed/vmlinux
...
(cat arch/x86/boot/compressed/vmlinux.bin | lzma -9 && echo -ne \\x38\\xd6\\x37\\x00) > arch/x86/boot/compressed/vmlinux.bin.lzma
...

Note the echo command.

Now, Debian switched to dash as /bin/sh. And dash
does not understand the -e option:

$ dash -c 'echo -ne \\x38\\xd6\\x37\\x00' | od -x
0000000 6e2d 2065 785c 3833 785c 3664 785c 3733
0000020 785c 3030 000a

$ bash -c 'echo -ne \\x38\\xd6\\x37\\x00' | od -x
0000000 d638 0037

So the final size (it's the size of uncompressed file)
becomes incorrect. Here's what mkpiggy outputs for
this (in arch/x86/boot/compressed/piggy.S):

z_output_len = 170930296

while it should be

z_output_len = 3659320

And with the former (wrong, larger) size, the whole
thing just reboots on a PIII Celeron. I've no idea
why, but the original problem is here.

The same thing happens with bzip2 algorithm which is
not new, not only with lzma.

The whole thing looks quite hackish to me, -- mkpiggy
can know the size from the original image just fine,
instead of getting it from the end of already compressed
file.

For now, quick fix is to change echo to printf in there.
Correct fix is to re-write mkpiggy to look at the
original file for size (IMHO anyway).

And this is a very good candidate for -stable as well.
The bug is very difficult to find. And now when more
and more people who use Debian are switching to dash,
it will be more common.

Thanks!

/mjt


2009-10-09 14:26:47

by Michael Tokarev

[permalink] [raw]
Subject: Re: wrong final bzImage build (regading #14270)

And I forgot to mention: this IS a regression in 2.6.31.

Michael Tokarev wrote:
> Ok, finally the mystery solved. After a week of
> digging.
>
> The original problem was titled "Cannot boot on
> a PIII Celeron", and Rafael filed a bug #14270
> for this.
>
> In short, what I observed was that a new kernel
> (2.6.31) fails to boot on a PIII Celeron machine.
> But changing just the CPU to plain PIII and voila,
> it now works. I don't know why it behaved this
> way, but I found where was the problem, finally.
>
> And the problem is in the last stage of build, when
> building the bzImage.
>
> make -f scripts/Makefile.build obj=arch/x86/boot/compressed
> arch/x86/boot/compressed/vmlinux
> ...
> (cat arch/x86/boot/compressed/vmlinux.bin | lzma -9 && echo -ne
> \\x38\\xd6\\x37\\x00) > arch/x86/boot/compressed/vmlinux.bin.lzma
> ...
>
> Note the echo command.
>
> Now, Debian switched to dash as /bin/sh. And dash
> does not understand the -e option:
>
> $ dash -c 'echo -ne \\x38\\xd6\\x37\\x00' | od -x
> 0000000 6e2d 2065 785c 3833 785c 3664 785c 3733
> 0000020 785c 3030 000a
>
> $ bash -c 'echo -ne \\x38\\xd6\\x37\\x00' | od -x
> 0000000 d638 0037
>
> So the final size (it's the size of uncompressed file)
> becomes incorrect. Here's what mkpiggy outputs for
> this (in arch/x86/boot/compressed/piggy.S):
>
> z_output_len = 170930296
>
> while it should be
>
> z_output_len = 3659320
>
> And with the former (wrong, larger) size, the whole
> thing just reboots on a PIII Celeron. I've no idea
> why, but the original problem is here.
>
> The same thing happens with bzip2 algorithm which is
> not new, not only with lzma.
>
> The whole thing looks quite hackish to me, -- mkpiggy
> can know the size from the original image just fine,
> instead of getting it from the end of already compressed
> file.
>
> For now, quick fix is to change echo to printf in there.
> Correct fix is to re-write mkpiggy to look at the
> original file for size (IMHO anyway).
>
> And this is a very good candidate for -stable as well.
> The bug is very difficult to find. And now when more
> and more people who use Debian are switching to dash,
> it will be more common.
>
> Thanks!
>
> /mjt

2009-10-09 14:59:09

by Cyrill Gorcunov

[permalink] [raw]
Subject: Re: wrong final bzImage build (regading #14270)

Peter and Sam CC'ed

[Michael Tokarev - Fri, Oct 09, 2009 at 06:17:50PM +0400]
> Ok, finally the mystery solved. After a week of
> digging.
>
> The original problem was titled "Cannot boot on
> a PIII Celeron", and Rafael filed a bug #14270
> for this.
>
> In short, what I observed was that a new kernel
> (2.6.31) fails to boot on a PIII Celeron machine.
> But changing just the CPU to plain PIII and voila,
> it now works. I don't know why it behaved this
> way, but I found where was the problem, finally.
>
> And the problem is in the last stage of build, when
> building the bzImage.
>
> make -f scripts/Makefile.build obj=arch/x86/boot/compressed arch/x86/boot/compressed/vmlinux
> ...
> (cat arch/x86/boot/compressed/vmlinux.bin | lzma -9 && echo -ne \\x38\\xd6\\x37\\x00) > arch/x86/boot/compressed/vmlinux.bin.lzma
> ...
>
> Note the echo command.
>
> Now, Debian switched to dash as /bin/sh. And dash
> does not understand the -e option:
>
> $ dash -c 'echo -ne \\x38\\xd6\\x37\\x00' | od -x
> 0000000 6e2d 2065 785c 3833 785c 3664 785c 3733
> 0000020 785c 3030 000a
>
> $ bash -c 'echo -ne \\x38\\xd6\\x37\\x00' | od -x
> 0000000 d638 0037
>
> So the final size (it's the size of uncompressed file)
> becomes incorrect. Here's what mkpiggy outputs for
> this (in arch/x86/boot/compressed/piggy.S):
>
> z_output_len = 170930296
>
> while it should be
>
> z_output_len = 3659320
>
> And with the former (wrong, larger) size, the whole
> thing just reboots on a PIII Celeron. I've no idea
> why, but the original problem is here.
>
> The same thing happens with bzip2 algorithm which is
> not new, not only with lzma.
>
> The whole thing looks quite hackish to me, -- mkpiggy
> can know the size from the original image just fine,
> instead of getting it from the end of already compressed
> file.
>
> For now, quick fix is to change echo to printf in there.
> Correct fix is to re-write mkpiggy to look at the
> original file for size (IMHO anyway).
>
> And this is a very good candidate for -stable as well.
> The bug is very difficult to find. And now when more
> and more people who use Debian are switching to dash,
> it will be more common.
>
> Thanks!
>
> /mjt
>

-- Cyrill

2009-10-09 17:04:59

by H. Peter Anvin

[permalink] [raw]
Subject: Re: wrong final bzImage build (regading #14270)

On 10/09/2009 07:58 AM, Cyrill Gorcunov wrote:
> Peter and Sam CC'ed
>
> [Michael Tokarev - Fri, Oct 09, 2009 at 06:17:50PM +0400]
>> Ok, finally the mystery solved. After a week of
>> digging.
>>
>> The original problem was titled "Cannot boot on
>> a PIII Celeron", and Rafael filed a bug #14270
>> for this.
>>
>> In short, what I observed was that a new kernel
>> (2.6.31) fails to boot on a PIII Celeron machine.
>> But changing just the CPU to plain PIII and voila,
>> it now works. I don't know why it behaved this
>> way, but I found where was the problem, finally.
>>

We should switch to printf here. Hexadecimal constants in echo aren't
guaranteed by POSIX.

-hpa

--
H. Peter Anvin, Intel Open Source Technology Center
I work for Intel. I don't speak on their behalf.

2009-10-09 17:14:47

by Michael Tokarev

[permalink] [raw]
Subject: Re: wrong final bzImage build (regading #14270)

H. Peter Anvin пишет:
> On 10/09/2009 07:58 AM, Cyrill Gorcunov wrote:
>> Peter and Sam CC'ed
>>
>> [Michael Tokarev - Fri, Oct 09, 2009 at 06:17:50PM +0400]
>>> Ok, finally the mystery solved. After a week of
>>> digging.
>>>
>>> The original problem was titled "Cannot boot on
>>> a PIII Celeron", and Rafael filed a bug #14270
>>> for this.
>>>
>>> In short, what I observed was that a new kernel
>>> (2.6.31) fails to boot on a PIII Celeron machine.
>>> But changing just the CPU to plain PIII and voila,
>>> it now works. I don't know why it behaved this
>>> way, but I found where was the problem, finally.
>
> We should switch to printf here. Hexadecimal constants in echo aren't
> guaranteed by POSIX.

That's what I initially proposed. However, as Scott Olson pointed
out, there's already a fix for this:

http://lkml.org/lkml/2009/8/19/84
http://patchwork.kernel.org/patch/42564/

which uses still-non-portable /bin/echo.

(I wish I knew about it a week before now - it wasn't a pleasant week for me).

Still an interesting result. I can understand if it failed
for systems with smaller amounts of memory, -- nope, it fails
with Celeron on a 64Mb system, but works on the same system
if I replace the CPU to a real PIII... Fun.

/mjt

2009-10-09 19:40:28

by Michael Tokarev

[permalink] [raw]
Subject: Re: wrong final bzImage build (regading #14270)

Ok, some more to this.

It turns out dash's built-in echo command interprets \nnn octal
sequences by default, and there's no way to turn that off. So,
for example, sed-zoffset command from arch/x86/boot/Makefile
(which includes \1 \2 etc substitutions for sed), when echoed
in verbose mode (V=1), produces.. interesting characters (with
ascii code 1 and 2).

It's not practival to replace V=1's echo with /bin/echo I think.

So I'd say it's not a bug in the build system after all, but
a bug in dash. Well, at least this expanding-by-default didn't
trigger another very-difficult-to-find bug (hopefully), but it
has good potential.

I'll file a bug report against dash.

/mjt

> [Michael Tokarev - Fri, Oct 09, 2009 at 06:17:50PM +0400]
>> Ok, finally the mystery solved. After a week of
>> digging.
>>
>> The original problem was titled "Cannot boot on
>> a PIII Celeron", and Rafael filed a bug #14270
>> for this.
>>
>> In short, what I observed was that a new kernel
>> (2.6.31) fails to boot on a PIII Celeron machine.
>> But changing just the CPU to plain PIII and voila,
>> it now works. I don't know why it behaved this
>> way, but I found where was the problem, finally.
>>
>> And the problem is in the last stage of build, when
>> building the bzImage.
>>
>> make -f scripts/Makefile.build obj=arch/x86/boot/compressed arch/x86/boot/compressed/vmlinux
>> ...
>> (cat arch/x86/boot/compressed/vmlinux.bin | lzma -9 && echo -ne \\x38\\xd6\\x37\\x00) > arch/x86/boot/compressed/vmlinux.bin.lzma
>> ...
>>
>> Note the echo command.
>>
>> Now, Debian switched to dash as /bin/sh. And dash
>> does not understand the -e option:
>>
>> $ dash -c 'echo -ne \\x38\\xd6\\x37\\x00' | od -x
>> 0000000 6e2d 2065 785c 3833 785c 3664 785c 3733
>> 0000020 785c 3030 000a
>>
>> $ bash -c 'echo -ne \\x38\\xd6\\x37\\x00' | od -x
>> 0000000 d638 0037
>>
>> So the final size (it's the size of uncompressed file)
>> becomes incorrect. Here's what mkpiggy outputs for
>> this (in arch/x86/boot/compressed/piggy.S):
>>
>> z_output_len = 170930296
>>
>> while it should be
>>
>> z_output_len = 3659320
>>
>> And with the former (wrong, larger) size, the whole
>> thing just reboots on a PIII Celeron. I've no idea
>> why, but the original problem is here.
>>
>> The same thing happens with bzip2 algorithm which is
>> not new, not only with lzma.
>>
>> The whole thing looks quite hackish to me, -- mkpiggy
>> can know the size from the original image just fine,
>> instead of getting it from the end of already compressed
>> file.
>>
>> For now, quick fix is to change echo to printf in there.
>> Correct fix is to re-write mkpiggy to look at the
>> original file for size (IMHO anyway).
>>
>> And this is a very good candidate for -stable as well.
>> The bug is very difficult to find. And now when more
>> and more people who use Debian are switching to dash,
>> it will be more common.
>>
>> Thanks!

2009-10-09 19:59:53

by Cyrill Gorcunov

[permalink] [raw]
Subject: Re: wrong final bzImage build (regading #14270)

[Michael Tokarev - Fri, Oct 09, 2009 at 11:39:48PM +0400]
> Ok, some more to this.
>
> It turns out dash's built-in echo command interprets \nnn octal
> sequences by default, and there's no way to turn that off. So,
> for example, sed-zoffset command from arch/x86/boot/Makefile
> (which includes \1 \2 etc substitutions for sed), when echoed
> in verbose mode (V=1), produces.. interesting characters (with
> ascii code 1 and 2).
>
> It's not practival to replace V=1's echo with /bin/echo I think.
>
> So I'd say it's not a bug in the build system after all, but
> a bug in dash. Well, at least this expanding-by-default didn't
> trigger another very-difficult-to-find bug (hopefully), but it
> has good potential.
>
> I'll file a bug report against dash.
>
> /mjt
>

OK, thanks Michael!

-- Cyrill

2009-10-09 20:03:33

by Arkadiusz Miskiewicz

[permalink] [raw]
Subject: Re: wrong final bzImage build (regading #14270)

On Friday 09 of October 2009, Michael Tokarev wrote:
> Ok, some more to this.
>
> It turns out dash's built-in echo command interprets \nnn octal
> sequences by default, and there's no way to turn that off. So,
> for example, sed-zoffset command from arch/x86/boot/Makefile
> (which includes \1 \2 etc substitutions for sed), when echoed
> in verbose mode (V=1), produces.. interesting characters (with
> ascii code 1 and 2).
>
> It's not practival to replace V=1's echo with /bin/echo I think.
>
> So I'd say it's not a bug in the build system after all, but
> a bug in dash.

It's still a bug in build system if you consider that a /bin/sh is a posix
shell. posix shells don't support \hex notation (see single unix system
specification).

I had exactly this problem few weeks ago with pdksh as /bin/sh (and
bugreported to author of that change). As I workaround I used /bin/echo but
using printf is more sane/portable.

--
Arkadiusz Miśkiewicz PLD/Linux Team
arekm / maven.pl http://ftp.pld-linux.org/

2009-10-09 20:05:54

by Michael Tokarev

[permalink] [raw]
Subject: Re: wrong final bzImage build (regading #14270)

Michael Tokarev wrote:
> Ok, some more to this.
>
> It turns out dash's built-in echo command interprets \nnn octal
> sequences by default, and there's no way to turn that off. So,
> for example, sed-zoffset command from arch/x86/boot/Makefile
> (which includes \1 \2 etc substitutions for sed), when echoed
> in verbose mode (V=1), produces.. interesting characters (with
> ascii code 1 and 2).
>
> It's not practival to replace V=1's echo with /bin/echo I think.
>
> So I'd say it's not a bug in the build system after all, but
> a bug in dash. Well, at least this expanding-by-default didn't
> trigger another very-difficult-to-find bug (hopefully), but it
> has good potential.
>
> I'll file a bug report against dash.

For reference: http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=550399

> /mjt

2009-10-09 20:57:17

by H. Peter Anvin

[permalink] [raw]
Subject: Re: wrong final bzImage build (regading #14270)

On 10/09/2009 01:02 PM, Arkadiusz Miskiewicz wrote:
>
> I had exactly this problem few weeks ago with pdksh as /bin/sh (and
> bugreported to author of that change). As I workaround I used /bin/echo but
> using printf is more sane/portable.
>

Yes, using printf is the right thing to do.

A patch would be appreciated.

-hpa

2009-10-09 21:28:00

by Michael Tokarev

[permalink] [raw]
Subject: Re: wrong final bzImage build (regading #14270)

H. Peter Anvin wrote:
> On 10/09/2009 01:02 PM, Arkadiusz Miskiewicz wrote:
>> I had exactly this problem few weeks ago with pdksh as /bin/sh (and
>> bugreported to author of that change). As I workaround I used /bin/echo but
>> using printf is more sane/portable.
>>
>
> Yes, using printf is the right thing to do.
>
> A patch would be appreciated.

Come on, it's just a one-word change (s/echo/printf/ in
scripts/Makefile.lib).

But it should go to Sam's tree first I guess, which already
has s|echo|/bin/echo| so it'll conflict.
It's easier to change it in whatever tree it will be changed
without complete patches.

/mjt

2009-10-09 21:30:32

by H. Peter Anvin

[permalink] [raw]
Subject: Re: wrong final bzImage build (regading #14270)

On 10/09/2009 02:27 PM, Michael Tokarev wrote:
> H. Peter Anvin wrote:
>> On 10/09/2009 01:02 PM, Arkadiusz Miskiewicz wrote:
>>> I had exactly this problem few weeks ago with pdksh as /bin/sh (and
>>> bugreported to author of that change). As I workaround I used /bin/echo but
>>> using printf is more sane/portable.
>>>
>>
>> Yes, using printf is the right thing to do.
>>
>> A patch would be appreciated.
>
> Come on, it's just a one-word change (s/echo/printf/ in
> scripts/Makefile.lib).

> But it should go to Sam's tree first I guess, which already
> has s|echo|/bin/echo| so it'll conflict.
> It's easier to change it in whatever tree it will be changed
> without complete patches.

So send a patch against Sam's tree.

-hpa