2017-06-01 16:39:17

by Larry Finger

[permalink] [raw]
Subject: Regression in kernel 4.12-rc1 for Powerpc 32 - partially bisected

On my Powerbook G4 aluminum, kernel 4.12-rc1 fails to boot. I cannot save a copy
of the early printk messages and I will need to summarize.

The kernel finds the hard driver and recognizes the various partitions. All
seems normal until after the "unused kernel memory is freed" and that "This
architecture does not have kernel memory protection", which are both normal. The
next batch of logged messages are

Loading, please wait...
udevd[64]: starting version 175
udevd[64]: Unable to receive ctrl message: Bad address.
modprobe: chdir(4.12-rc1): No such file or directory
udevd[64]: Unable to receive ctrl message: Bad address.
Begin: Loading essential drivers ... modprobe: chdir: chdir(4.12.0-rc1): No such
file or directory
done.
Begin: Running /scripts/init-premount ... done.
udevd[64]: Begin: Waiting for root file system ... [ 11.651175] random: faast
init done
done.
Gave up waiting for root device. Common problems:
- Boot args (cat /proc/cmdline)
- Check rootdelay= (did the system wait long enough?)
- Check root= (Did the system wait for the right device?)
- Missing modules (cat /proc/modules; ls /dev)
ALERT! /dev/disk/by-uuid/<UUID> does not exist. Dropping to a shell!
modprobe: chdir(4.12-rc1): No such file or directory <listed 6 times>

BusyBox v1.20.2 (Debian 1:1.20.0-7) built-in shell (ash)
Enter 'help' for a list of built-in commands.

/bin/sh: can't access tty: Job control turned off
(initramfs)

At that point, the system is dead. I have tried bisecting this issue; however, I
run into a second problem that crashes the bootstrap.

The bisection was normal with a bad commit at ec02268 ("alpha: switch to
RAW_COPY_USER") and a good commit at ec02268 ("alpha: switch to RAW_COPY_USER")
ec02268. The next trial at commit c1aad8d ("asm-generic: zero in __get_user(),
not __get_user_fn()") panics in the bootstrap. That panic has apparently been
fixed, but my attempts to find that fix have not yet been successful.

Sorry that I do not have a better description of where the problem happens.

As usual, any suggestions welcome.

Larry


2017-06-21 15:11:01

by Larry Finger

[permalink] [raw]
Subject: Re: Regression in kernel 4.12-rc1 for Powerpc 32 - bisected to commit 3448890c32c3

On 06/01/2017 11:39 AM, Larry Finger wrote:
> On my Powerbook G4 aluminum, kernel 4.12-rc1 fails to boot. I cannot save a copy
> of the early printk messages and I will need to summarize.
>
> The kernel finds the hard driver and recognizes the various partitions. All
> seems normal until after the "unused kernel memory is freed" and that "This
> architecture does not have kernel memory protection", which are both normal. The
> next batch of logged messages are
>
> Loading, please wait...
> udevd[64]: starting version 175
> udevd[64]: Unable to receive ctrl message: Bad address.
> modprobe: chdir(4.12-rc1): No such file or directory
> udevd[64]: Unable to receive ctrl message: Bad address.
> Begin: Loading essential drivers ... modprobe: chdir: chdir(4.12.0-rc1): No such
> file or directory
> done.
> Begin: Running /scripts/init-premount ... done.
> udevd[64]: Begin: Waiting for root file system ... [ 11.651175] random: faast
> init done
> done.
> Gave up waiting for root device. Common problems:
> - Boot args (cat /proc/cmdline)
> - Check rootdelay= (did the system wait long enough?)
> - Check root= (Did the system wait for the right device?)
> - Missing modules (cat /proc/modules; ls /dev)
> ALERT! /dev/disk/by-uuid/<UUID> does not exist. Dropping to a shell!
> modprobe: chdir(4.12-rc1): No such file or directory <listed 6 times>
>
> BusyBox v1.20.2 (Debian 1:1.20.0-7) built-in shell (ash)
> Enter 'help' for a list of built-in commands.
>
> /bin/sh: can't access tty: Job control turned off
> (initramfs)
>
> At that point, the system is dead. I have tried bisecting this issue; however, I
> run into a second problem that crashes the bootstrap.

I finally finished the bisection by patching each commit that was affected by
the bootstrap crash. The faulty change is commit
3448890c32c32c482c3ec20baa8fdd2ab4f94cc0 ("powerpc: get rid of zeroing, switch
to RAW_COPY_USER"). I am very confident in the bisection.

As I know nothing of assembly for ppc32, I have not been able to attempt to find
the problem with these patches.

Larry


2017-06-21 21:23:06

by Al Viro

[permalink] [raw]
Subject: Re: Regression in kernel 4.12-rc1 for Powerpc 32 - bisected to commit 3448890c32c3

On Wed, Jun 21, 2017 at 10:10:57AM -0500, Larry Finger wrote:

> I finally finished the bisection by patching each commit that was affected
> by the bootstrap crash. The faulty change is commit
> 3448890c32c32c482c3ec20baa8fdd2ab4f94cc0 ("powerpc: get rid of zeroing,
> switch to RAW_COPY_USER"). I am very confident in the bisection.
>
> As I know nothing of assembly for ppc32, I have not been able to attempt to
> find the problem with these patches.

How about the .config that works on parent of that commit?

2017-06-21 21:31:47

by Larry Finger

[permalink] [raw]
Subject: Re: Regression in kernel 4.12-rc1 for Powerpc 32 - bisected to commit 3448890c32c3

On 06/21/2017 04:22 PM, Al Viro wrote:
> On Wed, Jun 21, 2017 at 10:10:57AM -0500, Larry Finger wrote:
>
>> I finally finished the bisection by patching each commit that was affected
>> by the bootstrap crash. The faulty change is commit
>> 3448890c32c32c482c3ec20baa8fdd2ab4f94cc0 ("powerpc: get rid of zeroing,
>> switch to RAW_COPY_USER"). I am very confident in the bisection.
>>
>> As I know nothing of assembly for ppc32, I have not been able to attempt to
>> find the problem with these patches.
>
> How about the .config that works on parent of that commit?

Attached.

Laeey


Attachments:
config-4.11.0-rc1+ (93.50 kB)

2017-06-21 21:34:18

by Al Viro

[permalink] [raw]
Subject: Re: Regression in kernel 4.12-rc1 for Powerpc 32 - bisected to commit 3448890c32c3

On Wed, Jun 21, 2017 at 04:31:40PM -0500, Larry Finger wrote:
> On 06/21/2017 04:22 PM, Al Viro wrote:
> > On Wed, Jun 21, 2017 at 10:10:57AM -0500, Larry Finger wrote:
> >
> > > I finally finished the bisection by patching each commit that was affected
> > > by the bootstrap crash. The faulty change is commit
> > > 3448890c32c32c482c3ec20baa8fdd2ab4f94cc0 ("powerpc: get rid of zeroing,
> > > switch to RAW_COPY_USER"). I am very confident in the bisection.
> > >
> > > As I know nothing of assembly for ppc32, I have not been able to attempt to
> > > find the problem with these patches.
> >
> > How about the .config that works on parent of that commit?
>
> Attached.

OK... am I right assuming straight jessie/powerpc userland?

2017-06-21 21:49:49

by Larry Finger

[permalink] [raw]
Subject: Re: Regression in kernel 4.12-rc1 for Powerpc 32 - bisected to commit 3448890c32c3

On 06/21/2017 04:34 PM, Al Viro wrote:
> On Wed, Jun 21, 2017 at 04:31:40PM -0500, Larry Finger wrote:
>> On 06/21/2017 04:22 PM, Al Viro wrote:
>>> How about the .config that works on parent of that commit?
>>
>> Attached.
>
> OK... am I right assuming straight jessie/powerpc userland?
>

Actually Mint 12 with noting fancy. The machine mainly exists to test the
wireless drivers work on big-endian hardware.

'cat /etc/issue' reports "Debian GNU/Linux 7".

Larry

Larry




2017-06-22 14:12:07

by Al Viro

[permalink] [raw]
Subject: Re: Regression in kernel 4.12-rc1 for Powerpc 32 - bisected to commit 3448890c32c3

On Wed, Jun 21, 2017 at 04:49:46PM -0500, Larry Finger wrote:
> On 06/21/2017 04:34 PM, Al Viro wrote:
> > On Wed, Jun 21, 2017 at 04:31:40PM -0500, Larry Finger wrote:
> > > On 06/21/2017 04:22 PM, Al Viro wrote:
> > > > How about the .config that works on parent of that commit?
> > >
> > > Attached.
> >
> > OK... am I right assuming straight jessie/powerpc userland?
> >
>
> Actually Mint 12 with noting fancy. The machine mainly exists to test the
> wireless drivers work on big-endian hardware.
>
> 'cat /etc/issue' reports "Debian GNU/Linux 7".

Ugh... MintPPC appears to be dead. On KVM with Debian userland (either
jessie or wheezy - no difference in result) booting the commit in
question with your .config oopses as soon as pata_macio is initialized,
due to the bug in "treewide: Move dma_ops from struct dev_archdata into
struct device", and after cherry-picking your own fix for that (commit
46f401c4297a "powerpc/pmac: Fix crash in dma-mapping.h with NULL dma_ops")
the result boots just fine.

Again, that happens both for Debian 8 and Debian 7 userlands, so unless
Mint had been doing something very odd there, I would question the accuracy
of your bisect...

2017-06-22 14:20:01

by Larry Finger

[permalink] [raw]
Subject: Re: Regression in kernel 4.12-rc1 for Powerpc 32 - bisected to commit 3448890c32c3

On 06/22/2017 09:12 AM, Al Viro wrote:
> On Wed, Jun 21, 2017 at 04:49:46PM -0500, Larry Finger wrote:
>> On 06/21/2017 04:34 PM, Al Viro wrote:
>>> On Wed, Jun 21, 2017 at 04:31:40PM -0500, Larry Finger wrote:
>>>> On 06/21/2017 04:22 PM, Al Viro wrote:
>>>>> How about the .config that works on parent of that commit?
>>>>
>>>> Attached.
>>>
>>> OK... am I right assuming straight jessie/powerpc userland?
>>>
>>
>> Actually Mint 12 with noting fancy. The machine mainly exists to test the
>> wireless drivers work on big-endian hardware.
>>
>> 'cat /etc/issue' reports "Debian GNU/Linux 7".
>
> Ugh... MintPPC appears to be dead. On KVM with Debian userland (either
> jessie or wheezy - no difference in result) booting the commit in
> question with your .config oopses as soon as pata_macio is initialized,
> due to the bug in "treewide: Move dma_ops from struct dev_archdata into
> struct device", and after cherry-picking your own fix for that (commit
> 46f401c4297a "powerpc/pmac: Fix crash in dma-mapping.h with NULL dma_ops")
> the result boots just fine.
>
> Again, that happens both for Debian 8 and Debian 7 userlands, so unless
> Mint had been doing something very odd there, I would question the accuracy
> of your bisect...

Any chance that real hardware differs from KVM emulation? All I know at this
point is that commit f2ed8beb with 46f401c4 backported boots OK and commit
3448890c with the same backport fails.

I will try loading jessie and see what happens.

Thanks for investigating.

Larry


2017-06-22 19:25:19

by Al Viro

[permalink] [raw]
Subject: Re: Regression in kernel 4.12-rc1 for Powerpc 32 - bisected to commit 3448890c32c3

On Thu, Jun 22, 2017 at 09:19:58AM -0500, Larry Finger wrote:

> > Ugh... MintPPC appears to be dead. On KVM with Debian userland (either
> > jessie or wheezy - no difference in result) booting the commit in
> > question with your .config oopses as soon as pata_macio is initialized,
> > due to the bug in "treewide: Move dma_ops from struct dev_archdata into
> > struct device", and after cherry-picking your own fix for that (commit
> > 46f401c4297a "powerpc/pmac: Fix crash in dma-mapping.h with NULL dma_ops")
> > the result boots just fine.
> >
> > Again, that happens both for Debian 8 and Debian 7 userlands, so unless
> > Mint had been doing something very odd there, I would question the accuracy
> > of your bisect...

> Any chance that real hardware differs from KVM emulation?

For that one? Bloody unlikely; udev could, theoretically, hit different codepaths
due to different devices being observed, etc., but changes in that commit are
not in the areas that would be easy to get wrong in emulator.

> All I know at this
> point is that commit f2ed8beb with 46f401c4 backported boots OK and commit
> 3448890c with the same backport fails.
>
> I will try loading jessie and see what happens.

I would recheck which kernels are being booted - I had screwed that up during long
bisects often enough...

BTW, could you try to check what happens if you kill the
if (__builtin_constant_p(n) && (n <= 8))
bits in raw_copy_{to,from}_user()? The usefulness of those (in __copy_from_user()
originally) had always been dubious and the things are simpler without them.
If _that_ turns out to cure breakage, I would be very surprised, though.

2017-06-22 21:42:19

by Al Viro

[permalink] [raw]
Subject: Re: Regression in kernel 4.12-rc1 for Powerpc 32 - bisected to commit 3448890c32c3

On Thu, Jun 22, 2017 at 08:25:16PM +0100, Al Viro wrote:
> > All I know at this
> > point is that commit f2ed8beb with 46f401c4 backported boots OK and commit
> > 3448890c with the same backport fails.
> >
> > I will try loading jessie and see what happens.
>
> I would recheck which kernels are being booted - I had screwed that up during long
> bisects often enough...
>
> BTW, could you try to check what happens if you kill the
> if (__builtin_constant_p(n) && (n <= 8))
> bits in raw_copy_{to,from}_user()? The usefulness of those (in __copy_from_user()
> originally) had always been dubious and the things are simpler without them.
> If _that_ turns out to cure breakage, I would be very surprised, though.

FWIW, having dug through the __copy_tofrom_user() change in 3448890c, I don't see
anything that would be likely to cause that effect, be it on hardware or emulated.
Moreover, had that been fucked up, I would've expected lots and lots of folks
screaming by now - boot being broken since -rc1 tends to have such effect, even
if nobody had noticed that in -next last cycle.

What I can prove is that
* __copy_tofrom_user() return value is unchanged in all cases
* the only difference in its behaviour is that prior to that commit
some cases when it returns non-zero used to do memset(dest + something, 0,
retval) and now they do not. _All_ such cases must have stepped into a fault
on load from src + something.

And looking through arch/powerpc callers of all that bunch, I don't see any
candidates for being buggered by disappearing memset() on partial copy with
faulting read; note that copy_from_user() *will* memset() explicitly if
raw_copy_from_user() returns non-zero. I wondered if it could be a weird
case when copy_to_user() had been running into an unmapped area of *source*
and proceeded to zero the tail of destination, but I don't see anything
likely in arch/powerpc and anything in arch-independent code would've been
oopsing on that all along for some architectures...

2017-06-23 18:49:19

by Larry Finger

[permalink] [raw]
Subject: Re: Regression in kernel 4.12-rc1 for Powerpc 32 - bisected to commit 3448890c32c3

On 06/22/2017 02:25 PM, Al Viro wrote:
> On Thu, Jun 22, 2017 at 09:19:58AM -0500, Larry Finger wrote:
>
>>> Ugh... MintPPC appears to be dead. On KVM with Debian userland (either
>>> jessie or wheezy - no difference in result) booting the commit in
>>> question with your .config oopses as soon as pata_macio is initialized,
>>> due to the bug in "treewide: Move dma_ops from struct dev_archdata into
>>> struct device", and after cherry-picking your own fix for that (commit
>>> 46f401c4297a "powerpc/pmac: Fix crash in dma-mapping.h with NULL dma_ops")
>>> the result boots just fine.
>>>
>>> Again, that happens both for Debian 8 and Debian 7 userlands, so unless
>>> Mint had been doing something very odd there, I would question the accuracy
>>> of your bisect...
>
>> Any chance that real hardware differs from KVM emulation?
>
> For that one? Bloody unlikely; udev could, theoretically, hit different codepaths
> due to different devices being observed, etc., but changes in that commit are
> not in the areas that would be easy to get wrong in emulator.
>
>> All I know at this
>> point is that commit f2ed8beb with 46f401c4 backported boots OK and commit
>> 3448890c with the same backport fails.
>>
>> I will try loading jessie and see what happens.
>
> I would recheck which kernels are being booted - I had screwed that up during long
> bisects often enough...
>
> BTW, could you try to check what happens if you kill the
> if (__builtin_constant_p(n) && (n <= 8))
> bits in raw_copy_{to,from}_user()? The usefulness of those (in __copy_from_user()
> originally) had always been dubious and the things are simpler without them.
> If _that_ turns out to cure breakage, I would be very surprised, though.
>
Sorry I was gone so long. Installing jessie on this box resulted in a crash on
boot. Lubuntu 14.04 yielded a desktop with a functioning cursor, but nothing
else. Finally, Ubuntu 12.04 resulted in a working system. I hate Unity, but I
guess I'm stuck for now.

I know how easy it is to screw up a long bisection by booting the wrong kernel.
To help that problem and to work around the yaconf/yboot nonsense on the MAC, my
/etc/yaconf has always had generic kernel stanzas with only default, old, and
original kernels mentioned. From there I use a local script to finish a kernel
installation by moving the default links to the old ones and creating the new
default links pointing to the current kernel. With those long-tested scripts,
I'm sure that I am booting the one I want.

With the new installation, kernel 4.12-rc6 failed, as did 3448890c with the
backported 46f401c4 added.

Replacing "if (__builtin_constant_p(n) && (n <= 8))" with "if (0)" had no effect.

Larry

2017-06-23 20:29:48

by Al Viro

[permalink] [raw]
Subject: Re: Regression in kernel 4.12-rc1 for Powerpc 32 - bisected to commit 3448890c32c3

On Fri, Jun 23, 2017 at 01:49:16PM -0500, Larry Finger wrote:

> > BTW, could you try to check what happens if you kill the
> > if (__builtin_constant_p(n) && (n <= 8))
> > bits in raw_copy_{to,from}_user()? The usefulness of those (in __copy_from_user()
> > originally) had always been dubious and the things are simpler without them.
> > If _that_ turns out to cure breakage, I would be very surprised, though.
> >
> Sorry I was gone so long. Installing jessie on this box resulted in a crash
> on boot. Lubuntu 14.04 yielded a desktop with a functioning cursor, but
> nothing else. Finally, Ubuntu 12.04 resulted in a working system. I hate
> Unity, but I guess I'm stuck for now.

Ho-hum... Jessie is 3.16, so whatever is crashing there, it's something
different... Ubuntu 12.04 is what, 3.2?

> I know how easy it is to screw up a long bisection by booting the wrong
> kernel. To help that problem and to work around the yaconf/yboot nonsense on
> the MAC, my /etc/yaconf has always had generic kernel stanzas with only
> default, old, and original kernels mentioned. From there I use a local
> script to finish a kernel installation by moving the default links to the
> old ones and creating the new default links pointing to the current kernel.
> With those long-tested scripts, I'm sure that I am booting the one I want.
>
> With the new installation, kernel 4.12-rc6 failed, as did 3448890c with the
> backported 46f401c4 added.
>
> Replacing "if (__builtin_constant_p(n) && (n <= 8))" with "if (0)" had no effect.

OK, that simplifies things a bit. Just to make sure we are on the same page:

* f2ed8bebee69 + cherry-pick of 46f401c4 boots (Ubuntu 12.04 userland)
* 3448890c32c3 + cherry-pick of 46f401c4 fails (Ubuntu 12.04 userland), ditto
with removal of constant-size bits in raw_copy_..._user(). Failure appears
to be on udev getting EFAULT on some syscalls.
* straight Ubuntu 12.04 works
* jessie crashes on boot.

Could you post the boot logs of the first two?

2017-06-24 17:29:27

by Larry Finger

[permalink] [raw]
Subject: Re: Regression in kernel 4.12-rc1 for Powerpc 32 - bisected to commit 3448890c32c3

On 06/23/2017 03:29 PM, Al Viro wrote:
> On Fri, Jun 23, 2017 at 01:49:16PM -0500, Larry Finger wrote:
>
>>> BTW, could you try to check what happens if you kill the
>>> if (__builtin_constant_p(n) && (n <= 8))
>>> bits in raw_copy_{to,from}_user()? The usefulness of those (in __copy_from_user()
>>> originally) had always been dubious and the things are simpler without them.
>>> If _that_ turns out to cure breakage, I would be very surprised, though.
>>>
>> Sorry I was gone so long. Installing jessie on this box resulted in a crash
>> on boot. Lubuntu 14.04 yielded a desktop with a functioning cursor, but
>> nothing else. Finally, Ubuntu 12.04 resulted in a working system. I hate
>> Unity, but I guess I'm stuck for now.
>
> Ho-hum... Jessie is 3.16, so whatever is crashing there, it's something
> different... Ubuntu 12.04 is what, 3.2?
>
>> I know how easy it is to screw up a long bisection by booting the wrong
>> kernel. To help that problem and to work around the yaconf/yboot nonsense on
>> the MAC, my /etc/yaconf has always had generic kernel stanzas with only
>> default, old, and original kernels mentioned. From there I use a local
>> script to finish a kernel installation by moving the default links to the
>> old ones and creating the new default links pointing to the current kernel.
>> With those long-tested scripts, I'm sure that I am booting the one I want.
>>
>> With the new installation, kernel 4.12-rc6 failed, as did 3448890c with the
>> backported 46f401c4 added.
>>
>> Replacing "if (__builtin_constant_p(n) && (n <= 8))" with "if (0)" had no effect.
>
> OK, that simplifies things a bit. Just to make sure we are on the same page:
>
> * f2ed8bebee69 + cherry-pick of 46f401c4 boots (Ubuntu 12.04 userland)
> * 3448890c32c3 + cherry-pick of 46f401c4 fails (Ubuntu 12.04 userland), ditto
> with removal of constant-size bits in raw_copy_..._user(). Failure appears
> to be on udev getting EFAULT on some syscalls.
> * straight Ubuntu 12.04 works
> * jessie crashes on boot.

I made a break through. If I turn off inline copy to/from users for 32-bit ppc
with the following patch, then the system boots:

diff --git a/arch/powerpc/include/asm/uaccess.h b/arch/powerpc/include/asm/uaccess.h
index 5c0d8a8cdae5..1e6a8723f497 100644
--- a/arch/powerpc/include/asm/uaccess.h
+++ b/arch/powerpc/include/asm/uaccess.h
@@ -267,12 +267,7 @@ do {
\
extern unsigned long __copy_tofrom_user(void __user *to,
const void __user *from, unsigned long size);

-#ifndef __powerpc64__
-
-#define INLINE_COPY_FROM_USER
-#define INLINE_COPY_TO_USE
-
-#else /* __powerpc64__ */
+#ifdef __powerpc64__

static inline unsigned long
raw_copy_in_user(void __user *to, const void __user *from, unsigned long n)

It seems whatever problem I am seeing is in the inline version of
_copy_to_user() and _copy_from_user() on the 32-bit ppc. The only other
difference between the two versions is the placement of the __user macro, which
looks to be wrong in the non-inlined version of _copy_to_user() in
lib/usercopy.c, but that is the one that works.

To me, this looks like a compiler error. On the PowerBook, 'gcc --version'
reports "gcc (Ubuntu/Linaro 4.6.3-1ubuntu5) 4.6.3".

I will prepare a proper patch that I will send to you privately. If you agree
with it, it can be send through normal channels in time for the release of 4.12.

Larry

2017-06-25 09:54:38

by Al Viro

[permalink] [raw]
Subject: Re: Regression in kernel 4.12-rc1 for Powerpc 32 - bisected to commit 3448890c32c3

On Sat, Jun 24, 2017 at 12:29:23PM -0500, Larry Finger wrote:

> I made a break through. If I turn off inline copy to/from users for 32-bit
> ppc with the following patch, then the system boots:

OK... So it's 4.6.3 miscompiling something - it is hardware-independent,
reproduced in qemu. I'd like to get more self-contained example of
miscompile, though; should be done by tonight...

2017-06-25 11:14:08

by Al Viro

[permalink] [raw]
Subject: Re: Regression in kernel 4.12-rc1 for Powerpc 32 - bisected to commit 3448890c32c3

On Sun, Jun 25, 2017 at 10:53:58AM +0100, Al Viro wrote:
> On Sat, Jun 24, 2017 at 12:29:23PM -0500, Larry Finger wrote:
>
> > I made a break through. If I turn off inline copy to/from users for 32-bit
> > ppc with the following patch, then the system boots:
>
> OK... So it's 4.6.3 miscompiling something - it is hardware-independent,
> reproduced in qemu. I'd like to get more self-contained example of
> miscompile, though; should be done by tonight...

OK, it's the call in rw_copy_check_uvector(); with INLINE_COPY_FROM_USER
it's miscompiled by 4.6.3. I hadn't looked through the generated code
yet; will do that after I grab some sleep.

2017-06-25 20:53:30

by Al Viro

[permalink] [raw]
Subject: gcc 4.6.3 miscompile on ppc32 (was Re: Regression in kernel 4.12-rc1 for Powerpc 32 - bisected to commit 3448890c32c3)

On Sun, Jun 25, 2017 at 12:14:04PM +0100, Al Viro wrote:
> On Sun, Jun 25, 2017 at 10:53:58AM +0100, Al Viro wrote:
> > On Sat, Jun 24, 2017 at 12:29:23PM -0500, Larry Finger wrote:
> >
> > > I made a break through. If I turn off inline copy to/from users for 32-bit
> > > ppc with the following patch, then the system boots:
> >
> > OK... So it's 4.6.3 miscompiling something - it is hardware-independent,
> > reproduced in qemu. I'd like to get more self-contained example of
> > miscompile, though; should be done by tonight...
>
> OK, it's the call in rw_copy_check_uvector(); with INLINE_COPY_FROM_USER
> it's miscompiled by 4.6.3. I hadn't looked through the generated code
> yet; will do that after I grab some sleep.

Confirmed. It manages to bugger the loop immediately after the (successful)
copying of iovec array in rw_copy_check_uvector(); both with and without
INLINE_COPY_FROM_USER it has (just before the call of copy_from_user()) r27
set to nr_segs * sizeof(struct iovec). The call is made, we check that it
has succeeded and that's when it hits the fan: without INLINE_COPY_FROM_USER
we have (interleaved with unrelated insns)
addi 27,27,-8
srwi 27,27,3
addi 27,27,1
mtctr 27
Weird, but manages to pass nr_segs to mtctr. _With_ INLINE_COPY_FROM_USER we
get this:
lis 9,0x2000
mtctr 9
In other words, the loop will try to go through 8192 iterations. No idea where
that number has come from, but it sure as hell is wrong. That's where those
-EINVAL, etc. are coming from - we run into something negative in iov[seg].len,
after having run out of on-stack iovec array.

Assembler generated out of rw_copy_check_uvector() with and without
INLINE_COPY_FROM_USER is attached; it's a definite miscompile. Neither 4.4.5
nor 6.3.0 use mtctr/bdnz for that loop.

The bottom line is, ppc cross-toolchain on kernel.org happens to be
the version that miscompiles rw_copy_check_uvector() with INLINE_COPY_FROM_USER
and hell knows what else. Said that, I would rather have ppc32 drop the
INLINE_COPY_{TO,FROM}_USER anyway; that won't fix any other places where
the same 4.6.3 bug hits, but I seriously suspect that it will end up being
faster even on non^Wless buggy gcc versions. Could powerpc folks check
what does removing those two defines from arch/powerpc/include/asm/uaccess.h
do to performance? If there's no slowdown, I would strongly recommend just
removing those as in the patch Larry has posted upthread.

Fixing whatever it is in gcc 4.6.3 that triggers that behaviour is
IMO pointless - it might make sense to switch kernel.org cross-toolchain to
something more recent, but that's it.


Attachments:
(No filename) (2.60 kB)
inlined.s (3.26 kB)
rw_copy_check_uvector() with INLINE_COPY_FROM_USER
uninlined.s (2.80 kB)
the same without INLINE_COPY_FROM_USER
Download all attachments

2017-06-25 21:44:35

by Segher Boessenkool

[permalink] [raw]
Subject: Re: gcc 4.6.3 miscompile on ppc32 (was Re: Regression in kernel 4.12-rc1 for Powerpc 32 - bisected to commit 3448890c32c3)

On Sun, Jun 25, 2017 at 09:53:24PM +0100, Al Viro wrote:
> Confirmed. It manages to bugger the loop immediately after the (successful)
> copying of iovec array in rw_copy_check_uvector(); both with and without
> INLINE_COPY_FROM_USER it has (just before the call of copy_from_user()) r27
> set to nr_segs * sizeof(struct iovec). The call is made, we check that it
> has succeeded and that's when it hits the fan: without INLINE_COPY_FROM_USER
> we have (interleaved with unrelated insns)
> addi 27,27,-8
> srwi 27,27,3
> addi 27,27,1
> mtctr 27
> Weird, but manages to pass nr_segs to mtctr.

This weirdosity is https://gcc.gnu.org/PR67288 . Those three instructions
are not the same as just srwi 27,27,3 in case r27 is 0; GCC does not
figure out this cannot happen here.

> _With_ INLINE_COPY_FROM_USER we
> get this:
> lis 9,0x2000
> mtctr 9
> In other words, the loop will try to go through 8192 iterations. No idea where
> that number has come from, but it sure as hell is wrong.

8192*65535, even. This is as if r27 was 0 always.

Do you have a short stand-alone testcase? 4.6 is ancient, of course, but
the actual problem may still exist in more recent compilers (if it _is_
a compiler problem; if it's not, you *really* want to know :-) )


Segher

2017-06-25 22:21:14

by Al Viro

[permalink] [raw]
Subject: Re: gcc 4.6.3 miscompile on ppc32 (was Re: Regression in kernel 4.12-rc1 for Powerpc 32 - bisected to commit 3448890c32c3)

On Sun, Jun 25, 2017 at 04:44:09PM -0500, Segher Boessenkool wrote:

> Do you have a short stand-alone testcase? 4.6 is ancient, of course, but
> the actual problem may still exist in more recent compilers (if it _is_
> a compiler problem; if it's not, you *really* want to know :-) )

Enjoy. At least 6.3 doesn't step into that. Look for mtctr in the resulting
asm...

cat <<'EOF' >a.c
struct iovec
{
void *iov_base;
unsigned iov_len;
};

unsigned long v;

extern void * barf(void *,int,unsigned);

extern unsigned long bar(void *to, const void *from, unsigned long size);

static inline unsigned long __bar(void *to, const void *from, unsigned long n)
{
unsigned long res = n;
if (__builtin_expect(!!(((void)0, (((( unsigned long)(from)) <= v) && ((((n)) == 0) || ((((n)) - 1) <= (v - (( unsigned long)(from)))))))), 1))
res = bar(to, from, n);
if (res)
barf(to + (n - res), 0, res);
return res;
}

int foo(int type, const struct iovec * uvector,
unsigned long nr_segs, unsigned long fast_segs,
struct iovec *iov,
struct iovec **ret_pointer)
{
unsigned long seg;
int ret;
if (nr_segs == 0) {
ret = 0;
goto out;
}
if (nr_segs > 1024) {
ret = -22;
goto out;
}
if (__bar(iov, uvector, nr_segs*sizeof(*uvector))) {
ret = -14;
goto out;
}
ret = 0;
for (seg = 0; seg < nr_segs; seg++) {
void *buf = iov[seg].iov_base;
int len = (int)iov[seg].iov_len;
if (len < 0) {
ret = -22;
goto out;
}
if (type >= 0
&& __builtin_expect(!!(!((void)0, (((( unsigned long)(buf)) <= v) && ((((len)) == 0) || ((((len)) - 1) <= (v - (( unsigned long)(buf)))))))), 0)) {
ret = -14;
goto out;
}
ret += len;
}
out:
*ret_pointer = iov;
return ret;
}
EOF
powerpc-linux-gcc -m32 -fno-strict-aliasing -fno-common -std=gnu89 -fno-PIE -msoft-float -pipe -ffixed-r2 -mmultiple -mno-altivec -mno-vsx -mno-spe -mspe=no -funit-at-a-time -fno-dwarf2-cfi-asm -mno-string -mcpu=powerpc -Wa,-maltivec -mbig-endian -fno-delete-null-pointer-checks -Os -fno-stack-protector -Wno-unused-but-set-variable -fomit-frame-pointer -fno-var-tracking-assignments -femit-struct-debug-baseonly -fno-var-tracking -fno-strict-overflow -fconserve-stack -fverbose-asm -S a.c

2017-06-26 13:37:16

by Michael Ellerman

[permalink] [raw]
Subject: Re: gcc 4.6.3 miscompile on ppc32 (was Re: Regression in kernel 4.12-rc1 for Powerpc 32 - bisected to commit 3448890c32c3)

Al Viro <[email protected]> writes:

> On Sun, Jun 25, 2017 at 04:44:09PM -0500, Segher Boessenkool wrote:
>
>> Do you have a short stand-alone testcase? 4.6 is ancient, of course, but
>> the actual problem may still exist in more recent compilers (if it _is_
>> a compiler problem; if it's not, you *really* want to know :-) )
>
> Enjoy. At least 6.3 doesn't step into that. Look for mtctr in the resulting
> asm...
>
> cat <<'EOF' >a.c
...

I pointed creduce at that and got the version below, which I'm pretty
sure still exhibits the weird mtctr behaviour.

cheers

# cat input.c
struct {
void *iov_base;
unsigned iov_len;
} * c;
long v;
void *a;
int b;
unsigned bar();
foo(unsigned p1) {
unsigned d, e = p1;
if (p1 == 0)
goto out;
if (p1 > 4)
goto out;
if (__builtin_expect(!!(0, v && a), 1))
e = bar();
if (e)
barf(e);
if (e)
goto out;
d = 0;
for (; d < p1; d++) {
int f = c[d].iov_len;
if (__builtin_expect(c[d].iov_base && f, 0))
b = 4;
}
out:;
}

$ cat output.s
.file "input.c"

# rs6000/powerpc options: -mcpu=powerpc -msdata=data -G 8
# GNU C (GCC) version 4.6.3 (powerpc64-linux)
# compiled by GNU C version 4.3.2, GMP version 4.3.2, MPFR version 2.4.2, MPC version 0.8.2
# ...

# Compiler executable checksum: 4b51a6b899110d06c9e3310ac66ad26c

.section ".text"
.align 2
.globl foo
.type foo, @function
foo:
cmpwi 0,3,0 # tmp169, p1
stwu 1,-16(1) #,,
mflr 0 #,
stw 0,20(1) #,
beq- 0,.L9 #
cmplwi 7,3,4 #, tmp170, p1
bgt- 7,.L9 #
lis 9,v@ha # tmp172,
lwz 0,v@l(9) # v, v
cmpwi 7,0,0 #, tmp174, v
beq- 7,.L3 #
lis 9,a@ha # tmp176,
lwz 0,a@l(9) # a, a
cmpwi 7,0,0 #, tmp178, a
beq- 7,.L3 #
bl bar #
cmpwi 0,3,0 # tmp179, e
beq+ 0,.L4 #
.L3:
bl barf #
b .L9 #
.L4:
lis 8,0x2000 #,
lis 9,c@ha # tmp181,
mtctr 8 # tmp192,
lwz 11,c@l(9) # c, c.3
lis 10,b@ha # tmp190,
li 9,0 # ivtmp.12,
li 0,4 # tmp191,
.L6:
lwzx 7,11,9 # MEM[base: c.3_14, index: ivtmp.12_25, offset: 0B], MEM[base: c.3_14, index: ivtmp.12_25, offset: 0B]
add 8,11,9 # tmp182, c.3, ivtmp.12
lwz 8,4(8) # MEM[base: D.1310_21, offset: 4B], D.1287
cmpwi 7,7,0 #, tmp184, MEM[base: c.3_14, index: ivtmp.12_25, offset: 0B]
beq+ 7,.L5 #
cmpwi 7,8,0 #, tmp185, D.1287
beq+ 7,.L5 #
stw 0,b@l(10) # b, tmp191
.L5:
addi 9,9,8 # ivtmp.12, ivtmp.12,
bdnz .L6 #
.L2:
.L9:
lwz 0,20(1) #,
addi 1,1,16 #,,
mtlr 0 #,
blr #
.size foo,.-foo
.globl b
.globl a
.globl v
.globl c
.section .sbss,"aw",@nobits
.align 2
.type b, @object
.size b, 4
b:
.zero 4
.type a, @object
.size a, 4
a:
.zero 4
.type v, @object
.size v, 4
v:
.zero 4
.type c, @object
.size c, 4
c:
.zero 4
.ident "GCC: (GNU) 4.6.3"
.section .note.GNU-stack,"",@progbits

2017-06-26 13:40:24

by Michael Ellerman

[permalink] [raw]
Subject: Re: Regression in kernel 4.12-rc1 for Powerpc 32 - bisected to commit 3448890c32c3

Larry Finger <[email protected]> writes:

> On 06/23/2017 03:29 PM, Al Viro wrote:
>> On Fri, Jun 23, 2017 at 01:49:16PM -0500, Larry Finger wrote:
>>
>>>> BTW, could you try to check what happens if you kill the
>>>> if (__builtin_constant_p(n) && (n <= 8))
>>>> bits in raw_copy_{to,from}_user()? The usefulness of those (in __copy_from_user()
>>>> originally) had always been dubious and the things are simpler without them.
>>>> If _that_ turns out to cure breakage, I would be very surprised, though.
>>>>
>>> Sorry I was gone so long. Installing jessie on this box resulted in a crash
>>> on boot. Lubuntu 14.04 yielded a desktop with a functioning cursor, but
>>> nothing else. Finally, Ubuntu 12.04 resulted in a working system. I hate
>>> Unity, but I guess I'm stuck for now.
>>
>> Ho-hum... Jessie is 3.16, so whatever is crashing there, it's something
>> different... Ubuntu 12.04 is what, 3.2?
>>
>>> I know how easy it is to screw up a long bisection by booting the wrong
>>> kernel. To help that problem and to work around the yaconf/yboot nonsense on
>>> the MAC, my /etc/yaconf has always had generic kernel stanzas with only
>>> default, old, and original kernels mentioned. From there I use a local
>>> script to finish a kernel installation by moving the default links to the
>>> old ones and creating the new default links pointing to the current kernel.
>>> With those long-tested scripts, I'm sure that I am booting the one I want.
>>>
>>> With the new installation, kernel 4.12-rc6 failed, as did 3448890c with the
>>> backported 46f401c4 added.
>>>
>>> Replacing "if (__builtin_constant_p(n) && (n <= 8))" with "if (0)" had no effect.
>>
>> OK, that simplifies things a bit. Just to make sure we are on the same page:
>>
>> * f2ed8bebee69 + cherry-pick of 46f401c4 boots (Ubuntu 12.04 userland)
>> * 3448890c32c3 + cherry-pick of 46f401c4 fails (Ubuntu 12.04 userland), ditto
>> with removal of constant-size bits in raw_copy_..._user(). Failure appears
>> to be on udev getting EFAULT on some syscalls.
>> * straight Ubuntu 12.04 works
>> * jessie crashes on boot.
>
> I made a break through. If I turn off inline copy to/from users for 32-bit ppc
> with the following patch, then the system boots:
>
> diff --git a/arch/powerpc/include/asm/uaccess.h b/arch/powerpc/include/asm/uaccess.h
> index 5c0d8a8cdae5..1e6a8723f497 100644
> --- a/arch/powerpc/include/asm/uaccess.h
> +++ b/arch/powerpc/include/asm/uaccess.h
> @@ -267,12 +267,7 @@ do {
> \
> extern unsigned long __copy_tofrom_user(void __user *to,
> const void __user *from, unsigned long size);
>
> -#ifndef __powerpc64__
> -
> -#define INLINE_COPY_FROM_USER
> -#define INLINE_COPY_TO_USE
> -
> -#else /* __powerpc64__ */
> +#ifdef __powerpc64__
>
> static inline unsigned long
> raw_copy_in_user(void __user *to, const void __user *from, unsigned long n)

Thanks for debugging this.

I just sent a fix based on the above. Let me know if it doesn't work for
you.

cheers