2020-01-22 17:53:37

by Rasmus Villemoes

[permalink] [raw]
Subject: vmlinux ELF header sometimes corrupt

I'm building for a ppc32 (mpc8309) target using Yocto, and I'm hitting a
very hard to debug problem that maybe someone else has encountered. This
doesn't happen always, perhaps 1 in 8 times or something like that.

The issue is that when the build gets to do "${CROSS}objcopy -O binary
... vmlinux", vmlinux is not (no longer) a proper ELF file, so naturally
that fails with

powerpc-oe-linux-objcopy:vmlinux: file format not recognized

So I hacked link-vmlinux.sh to stash copies of vmlinux before and after
sortextable vmlinux. Both of those are proper ELF files, and comparing
the corrupted vmlinux to vmlinux.after_sort they are identical after the
first 52 bytes; in vmlinux, those first 52 bytes are all 0.

I also saved stat(1) info to see if vmlinux is being replaced or
modified in-place.

$ cat vmlinux.stat.after_sort
File: 'vmlinux'
Size: 8608456 Blocks: 16696 IO Block: 4096 regular file
Device: 811h/2065d Inode: 21919132 Links: 1
Access: (0755/-rwxr-xr-x) Uid: ( 1000/ user) Gid: ( 1001/ user)
Access: 2020-01-22 10:52:38.946703081 +0000
Modify: 2020-01-22 10:52:38.954703105 +0000
Change: 2020-01-22 10:52:38.954703105 +0000

$ stat vmlinux
File: 'vmlinux'
Size: 8608456 Blocks: 16688 IO Block: 4096 regular file
Device: 811h/2065d Inode: 21919132 Links: 1
Access: (0755/-rwxr-xr-x) Uid: ( 1000/ user) Gid: ( 1001/ user)
Access: 2020-01-22 17:20:00.650379057 +0000
Modify: 2020-01-22 10:52:38.954703105 +0000
Change: 2020-01-22 10:52:38.954703105 +0000

So the inode number and mtime/ctime are exactly the same, but for some
reason Blocks: has changed? This is on an ext4 filesystem, but I don't
suspect the filesystem to be broken, because it's always just vmlinux
that ends up corrupt, and always in exactly this way with the first 52
bytes having been wiped.

Any ideas?

Rasmus


2020-01-24 11:10:25

by Michael Ellerman

[permalink] [raw]
Subject: Re: vmlinux ELF header sometimes corrupt

Rasmus Villemoes <[email protected]> writes:
> I'm building for a ppc32 (mpc8309) target using Yocto, and I'm hitting a
> very hard to debug problem that maybe someone else has encountered. This
> doesn't happen always, perhaps 1 in 8 times or something like that.
>
> The issue is that when the build gets to do "${CROSS}objcopy -O binary
> ... vmlinux", vmlinux is not (no longer) a proper ELF file, so naturally
> that fails with
>
> powerpc-oe-linux-objcopy:vmlinux: file format not recognized
>
> So I hacked link-vmlinux.sh to stash copies of vmlinux before and after
> sortextable vmlinux. Both of those are proper ELF files, and comparing
> the corrupted vmlinux to vmlinux.after_sort they are identical after the
> first 52 bytes; in vmlinux, those first 52 bytes are all 0.
>
> I also saved stat(1) info to see if vmlinux is being replaced or
> modified in-place.
>
> $ cat vmlinux.stat.after_sort
> File: 'vmlinux'
> Size: 8608456 Blocks: 16696 IO Block: 4096 regular file
> Device: 811h/2065d Inode: 21919132 Links: 1
> Access: (0755/-rwxr-xr-x) Uid: ( 1000/ user) Gid: ( 1001/ user)
> Access: 2020-01-22 10:52:38.946703081 +0000
> Modify: 2020-01-22 10:52:38.954703105 +0000
> Change: 2020-01-22 10:52:38.954703105 +0000
>
> $ stat vmlinux
> File: 'vmlinux'
> Size: 8608456 Blocks: 16688 IO Block: 4096 regular file
> Device: 811h/2065d Inode: 21919132 Links: 1
> Access: (0755/-rwxr-xr-x) Uid: ( 1000/ user) Gid: ( 1001/ user)
> Access: 2020-01-22 17:20:00.650379057 +0000
> Modify: 2020-01-22 10:52:38.954703105 +0000
> Change: 2020-01-22 10:52:38.954703105 +0000
>
> So the inode number and mtime/ctime are exactly the same, but for some
> reason Blocks: has changed? This is on an ext4 filesystem, but I don't
> suspect the filesystem to be broken, because it's always just vmlinux
> that ends up corrupt, and always in exactly this way with the first 52
> bytes having been wiped.
>
> Any ideas?

Not really sorry. Haven't seen or heard of that before.

Are you doing a parallel make? If so does -j 1 fix it?

If it seems like sortextable is at fault then strace'ing it would be my
next step.

cheers

2020-01-24 14:38:15

by Andreas Schwab

[permalink] [raw]
Subject: Re: vmlinux ELF header sometimes corrupt

On Jan 22 2020, Rasmus Villemoes wrote:

> So the inode number and mtime/ctime are exactly the same, but for some
> reason Blocks: has changed? This is on an ext4 filesystem, but I don't
> suspect the filesystem to be broken, because it's always just vmlinux
> that ends up corrupt, and always in exactly this way with the first 52
> bytes having been wiped.

Note that the size of the ELF header (Elf32_Ehdr) is 52 bytes.

Andreas.

--
Andreas Schwab, [email protected]
GPG Key fingerprint = 7578 EB47 D4E5 4D69 2510 2552 DF73 E780 A9DA AEC1
"And now for something completely different."

2020-01-24 19:58:04

by Rasmus Villemoes

[permalink] [raw]
Subject: Re: vmlinux ELF header sometimes corrupt

On 24/01/2020 11.50, Michael Ellerman wrote:
> Rasmus Villemoes <[email protected]> writes:
>> I'm building for a ppc32 (mpc8309) target using Yocto, and I'm hitting a
>> very hard to debug problem that maybe someone else has encountered. This
>> doesn't happen always, perhaps 1 in 8 times or something like that.
>>
>> The issue is that when the build gets to do "${CROSS}objcopy -O binary
>> ... vmlinux", vmlinux is not (no longer) a proper ELF file, so naturally
>> that fails with
>>
>> powerpc-oe-linux-objcopy:vmlinux: file format not recognized
>>
>>
>> Any ideas?
>
> Not really sorry. Haven't seen or heard of that before.
>
> Are you doing a parallel make? If so does -j 1 fix it?

Hard to say, I'll have to try that a number of times to see if it can be
reproduced with that setting.

> If it seems like sortextable is at fault then strace'ing it would be my
> next step.

I don't think sortextable is at fault, that was just my first "I know
that at least pokes around in the ELF file". I do "cp vmlinux
vmlinux.before_sort" and "cp vmlinux vmlinux.after_sort", and both of
those copies are proper ELF files - and the .after_sort is identical to
the corrupt vmlinux apart from vmlinux ending up with its ELF header wiped.

So it's something that happens during some later build step (Yocto has a
lot of steps), perhaps "make modules" or "make modules_install" or
something ends up somehow deciding "hey, vmlinux isn't quite uptodate,
let's nuke it". I'm not even sure it's a Kbuild problem, but I've seen
the same thing happen using another meta-build system called oe-lite,
which is why I'm not primarily suspecting the Yocto logic.

Rasmus