2012-10-13 00:53:37

by Al Viro

[permalink] [raw]
Subject: [git pull] signals pile 3

The last bits of infrastructure for kernel_thread() et.al., with alpha/arm/x86
use of those. Plus sanitizing the asm glue and do_notify_resume() on alpha,
fixing the "disabled irq while running task_work stuff" breakage there.

At that point the rest of kernel_thread/kernel_execve/sys_execve work can
be done independently for different architectures. The only pending bits
that do depend on having all architectures converted are restrictred to
fs/* and kernel/* - that'll obviously have to wait for the next cycle.
I thought we'd have to wait for all of them done before we start eliminating
the longjump-style insanity in kernel_execve(), but it turned out there's
a very simple way to do that without flagday-style changes.

Please, pull from
git://git.kernel.org/pub/scm/linux/kernel/git/viro/signal for-linus

Shortlog:
Al Viro (12):
alpha: simplify TIF_NEED_RESCHED handling
alpha: take SIGPENDING/NOTIFY_RESUME loop into signal.c
alpha: don't bother passing switch_stack separately from regs
alpha: get rid of switch_stack argument of do_work_pending()
don't bother with kernel_thread/kernel_execve for launching linuxrc
ppc: eeh_event should just use kthread_run()
make sure that we always have a return path from kernel_execve()
make sure that kernel_thread() callbacks call do_exit() themselves
infrastructure for saner ret_from_kernel_thread semantics
x86, um: convert to saner kernel_execve() semantics
arm: switch to saner kernel_execve() semantics
alpha: switch to saner kernel_execve() semantics

Diffstat:
arch/Kconfig | 3 +
arch/alpha/Kconfig | 1 +
arch/alpha/include/asm/unistd.h | 1 -
arch/alpha/kernel/entry.S | 87 +++++++++-------------------
arch/alpha/kernel/signal.c | 48 +++++++++------
arch/arm/Kconfig | 1 +
arch/arm/include/asm/unistd.h | 1 -
arch/arm/kernel/entry-common.S | 29 +--------
arch/arm/kernel/process.c | 5 +-
arch/powerpc/platforms/pseries/eeh_event.c | 5 +-
arch/um/include/asm/processor-generic.h | 2 -
arch/um/include/shared/os.h | 1 -
arch/um/kernel/exec.c | 5 --
arch/um/kernel/process.c | 10 +--
arch/um/os-Linux/process.c | 13 ----
arch/x86/Kconfig | 1 +
arch/x86/include/asm/unistd.h | 1 -
arch/x86/kernel/entry_32.S | 31 ++++------
arch/x86/kernel/entry_64.S | 24 +------
arch/x86/um/Kconfig | 1 +
include/linux/syscalls.h | 8 +++
init/do_mounts_initrd.c | 41 +++++--------
init/main.c | 33 ++++++-----
kernel/kmod.c | 7 ++-
kernel/kthread.c | 1 +
25 files changed, 137 insertions(+), 223 deletions(-)


2012-10-14 15:35:34

by Daniel Mack

[permalink] [raw]
Subject: Re: [git pull] signals pile 3

Hi Al,

On 13.10.2012 02:53, Al Viro wrote:
> The last bits of infrastructure for kernel_thread() et.al., with alpha/arm/x86
> use of those. Plus sanitizing the asm glue and do_notify_resume() on alpha,
> fixing the "disabled irq while running task_work stuff" breakage there.
>
> At that point the rest of kernel_thread/kernel_execve/sys_execve work can
> be done independently for different architectures. The only pending bits
> that do depend on having all architectures converted are restrictred to
> fs/* and kernel/* - that'll obviously have to wait for the next cycle.
> I thought we'd have to wait for all of them done before we start eliminating
> the longjump-style insanity in kernel_execve(), but it turned out there's
> a very simple way to do that without flagday-style changes.
>
> Please, pull from
> git://git.kernel.org/pub/scm/linux/kernel/git/viro/signal for-linus
>
> Shortlog:
> Al Viro (12):

[...]

> arm: switch to saner kernel_execve() semantics

[...]

> Diffstat:

[...]

> arch/arm/include/asm/unistd.h | 1 -
> arch/arm/kernel/entry-common.S | 29 +--------
> arch/arm/kernel/process.c | 5 +-

I rebased my ARM development branch and figured that your patch 9fff2fa
("arm: switch to saner kernel_execve() semantics") breaks the boot on my
board right after init is invoked via NFS:

[ 4.682072] VFS: Mounted root (nfs filesystem) on device 0:12.
[ 4.690744] devtmpfs: mounted
[ 4.694395] Freeing init memory: 172K
[ 5.291417] Internal error: Oops - undefined instruction: 0 [#1] SMP
THUMB2
[ 5.298734] Modules linked in:
[ 5.301952] CPU: 0 Not tainted (3.6.0-11053-g56c8535 #128)
[ 5.308071] PC is at cpsw_probe+0x422/0x9ac
[ 5.312459] LR is at trace_hardirqs_on_caller+0x8f/0xfc
[ 5.317934] pc : [<c03493de>] lr : [<c005e81f>] psr: 60000113
[ 5.317934] sp : cf055fb0 ip : 00000000 fp : 00000000
[ 5.329944] r10: 00000000 r9 : 00000000 r8 : 00000000
[ 5.335413] r7 : 00000000 r6 : 00000000 r5 : c034458d r4 : 00000000
[ 5.342244] r3 : cf057a40 r2 : 00000000 r1 : 00000001 r0 : 00000000
[ 5.349078] Flags: nZCv IRQs on FIQs on Mode SVC_32 ISA ARM
Segment user
[ 5.356546] Control: 50c5387d Table: 8f434019 DAC: 00000015
[ 5.362562] Process init (pid: 1, stack limit = 0xcf054240)
[ 5.368395] Stack: (0xcf055fb0 to 0xcf056000)
[ 5.372961] 5fa0: 00000001
00000000 00000000 00000000
[ 5.381525] 5fc0: cf055fb0 c000d1a8 00000000 00000000 00000000
00000000 00000000 00000000
[ 5.390091] 5fe0: 00000000 bee83f10 00000000 b6fdedd0 00000010
00000000 aaaabfaf a8babbaa
[ 5.398664] Code: 2206a010 718ef508 0184f8da f8b1f65d (3070f8d8)
[ 5.405049] ---[ end trace f92e44d0ab15d037 ]---
[ 5.410424] Kernel panic - not syncing: Attempted to kill init!
exitcode=0x0000000b


Looking at the patch, I see it does two things:

a) kill the special treatment formerly done by ret_from_kernel_thread
b) switch over to generic execve for ARM

I know too little about the internals here, but reverting the latter
part fixes the boot for me. Find attached the patch I used locally for
that, on top of Linus' current merge tree (4d7127d).

FWIW, the config I'm using is here: http://pastebin.com/JPwAYmrD

I can test patches if you have any for me.


Thanks,
Daniel


Attachments:
arm-execve-revert.diff (1.48 kB)

2012-10-14 16:40:24

by Al Viro

[permalink] [raw]
Subject: [revert request for commit 9fff2fa] Re: [git pull] signals pile 3

On Sun, Oct 14, 2012 at 05:35:23PM +0200, Daniel Mack wrote:

> I rebased my ARM development branch and figured that your patch 9fff2fa
> ("arm: switch to saner kernel_execve() semantics") breaks the boot on my
> board right after init is invoked via NFS:

OK, revert it is, then. Nothing in the tree has dependencies on that sucker
and while it survives testing here, it's obviously not ready for mainline.
So, with abject apologies to everyone involved, please revert.

2012-10-14 17:26:47

by Al Viro

[permalink] [raw]
Subject: Re: [revert request for commit 9fff2fa] Re: [git pull] signals pile 3

On Sun, Oct 14, 2012 at 06:44:12PM +0200, Daniel Mack wrote:
> On Oct 14, 2012 6:40 PM, "Al Viro" <[email protected]> wrote:
> >
> > On Sun, Oct 14, 2012 at 05:35:23PM +0200, Daniel Mack wrote:
> >
> > > I rebased my ARM development branch and figured that your patch 9fff2fa
> > > ("arm: switch to saner kernel_execve() semantics") breaks the boot on my
> > > board right after init is invoked via NFS:
> >
> > OK, revert it is, then. Nothing in the tree has dependencies on that
> sucker
> > and while it survives testing here, it's obviously not ready for mainline.
> > So, with abject apologies to everyone involved, please revert.
>
> Reverting it is not straight forward, and half of this patch doesn't seem
> to cause issues.
>
> I can resend my patch with an S-o-b if you want me to.

Um... That's _really_ interesting. First of all, revert is absolutely
straightforward; the only change in Kconfig is "remove the damn select"
and it's not hard to resolve. But I actually wonder what the hell is
going on with that breakage - the *only* thing your revert changes is
that instead of letting the kernel_thread callback return all the way
to returning 0 to ret_from_kernel_thread() on do_execve() success you
have it do ret_from_kernel_execve() instead. Hmm...

Could you try to print current_pt_regs()->ARM_r0 in kernel_execve() before
calling ret_from_kernel_execve() with your patch applied? If that somehow
got non-zero, we'd see trouble, all right, but I don't see any places where
it could.

Wait a minute... I think I see what might be going on, but I don't
understand it at all. Look: arm start_thread() is
#define start_thread(regs,pc,sp) \
({ \
unsigned long *stack = (unsigned long *)sp; \
memset(regs->uregs, 0, sizeof(regs->uregs)); \
if (current->personality & ADDR_LIMIT_32BIT) \
regs->ARM_cpsr = USR_MODE; \
else \
regs->ARM_cpsr = USR26_MODE; \
if (elf_hwcap & HWCAP_THUMB && pc & 1) \
regs->ARM_cpsr |= PSR_T_BIT; \
regs->ARM_cpsr |= PSR_ENDSTATE; \
regs->ARM_pc = pc & ~1; /* pc */ \
regs->ARM_sp = sp; /* sp */ \
regs->ARM_r2 = stack[2]; /* r2 (envp) */ \
regs->ARM_r1 = stack[1]; /* r1 (argv) */ \
regs->ARM_r0 = stack[0]; /* r0 (argc) */ \
nommu_start_thread(regs); \
})
and the last 3 make no sense whatsoever. Note that on normal execve() we'll
be going through the syscall return, so the userland will see 0 in there,
no matter what do we do here. Theoretically, it might've been done for
ptrace sake (it will be able to observe the values in those registers before
the tracee reaches userland), but there's another oddity involved - "stack"
is a userland pointer here. Granted, it's been recently written to, so
we are not likely to hit a pagefault here, but... What happens if we are
under enough memory pressure to swap those pages out? PF in the kernel
mode with no exception table entries for those insns?

2012-10-14 17:55:46

by Al Viro

[permalink] [raw]
Subject: Re: [revert request for commit 9fff2fa] Re: [git pull] signals pile 3

On Sun, Oct 14, 2012 at 06:26:40PM +0100, Al Viro wrote:
> On Sun, Oct 14, 2012 at 06:44:12PM +0200, Daniel Mack wrote:
> > On Oct 14, 2012 6:40 PM, "Al Viro" <[email protected]> wrote:
> > >
> > > On Sun, Oct 14, 2012 at 05:35:23PM +0200, Daniel Mack wrote:
> > >
> > > > I rebased my ARM development branch and figured that your patch 9fff2fa
> > > > ("arm: switch to saner kernel_execve() semantics") breaks the boot on my
> > > > board right after init is invoked via NFS:
> > >
> > > OK, revert it is, then. Nothing in the tree has dependencies on that
> > sucker
> > > and while it survives testing here, it's obviously not ready for mainline.
> > > So, with abject apologies to everyone involved, please revert.
> >
> > Reverting it is not straight forward, and half of this patch doesn't seem
> > to cause issues.
> >
> > I can resend my patch with an S-o-b if you want me to.
>
> Um... That's _really_ interesting. First of all, revert is absolutely
> straightforward; the only change in Kconfig is "remove the damn select"
> and it's not hard to resolve. But I actually wonder what the hell is
> going on with that breakage - the *only* thing your revert changes is
> that instead of letting the kernel_thread callback return all the way
> to returning 0 to ret_from_kernel_thread() on do_execve() success you
> have it do ret_from_kernel_execve() instead. Hmm...
>
> Could you try to print current_pt_regs()->ARM_r0 in kernel_execve() before
> calling ret_from_kernel_execve() with your patch applied? If that somehow
> got non-zero, we'd see trouble, all right, but I don't see any places where
> it could.
>
> Wait a minute... I think I see what might be going on, but I don't
> understand it at all. Look: arm start_thread() is
> #define start_thread(regs,pc,sp) \
> ({ \
> unsigned long *stack = (unsigned long *)sp; \
> memset(regs->uregs, 0, sizeof(regs->uregs)); \
> if (current->personality & ADDR_LIMIT_32BIT) \
> regs->ARM_cpsr = USR_MODE; \
> else \
> regs->ARM_cpsr = USR26_MODE; \
> if (elf_hwcap & HWCAP_THUMB && pc & 1) \
> regs->ARM_cpsr |= PSR_T_BIT; \
> regs->ARM_cpsr |= PSR_ENDSTATE; \
> regs->ARM_pc = pc & ~1; /* pc */ \
> regs->ARM_sp = sp; /* sp */ \
> regs->ARM_r2 = stack[2]; /* r2 (envp) */ \
> regs->ARM_r1 = stack[1]; /* r1 (argv) */ \
> regs->ARM_r0 = stack[0]; /* r0 (argc) */ \
> nommu_start_thread(regs); \
> })
> and the last 3 make no sense whatsoever. Note that on normal execve() we'll
> be going through the syscall return, so the userland will see 0 in there,
> no matter what do we do here. Theoretically, it might've been done for
> ptrace sake (it will be able to observe the values in those registers before
> the tracee reaches userland), but there's another oddity involved - "stack"
> is a userland pointer here. Granted, it's been recently written to, so
> we are not likely to hit a pagefault here, but... What happens if we are
> under enough memory pressure to swap those pages out? PF in the kernel
> mode with no exception table entries for those insns?

FWIW, if you don't mind an experiment, try to take mainline (with that
commit not reverted) and add
strne r0, [sp, #S_R0]
right before
get_thread_info tsk
in ret_from_fork(). And see if that changes behaviour.

2012-10-14 18:22:03

by Daniel Mack

[permalink] [raw]
Subject: Re: [revert request for commit 9fff2fa] Re: [git pull] signals pile 3

On 14.10.2012 19:55, Al Viro wrote:
> On Sun, Oct 14, 2012 at 06:26:40PM +0100, Al Viro wrote:
>> On Sun, Oct 14, 2012 at 06:44:12PM +0200, Daniel Mack wrote:
>>> On Oct 14, 2012 6:40 PM, "Al Viro" <[email protected]> wrote:
>>>>
>>>> On Sun, Oct 14, 2012 at 05:35:23PM +0200, Daniel Mack wrote:
>>>>
>>>>> I rebased my ARM development branch and figured that your patch 9fff2fa
>>>>> ("arm: switch to saner kernel_execve() semantics") breaks the boot on my
>>>>> board right after init is invoked via NFS:
>>>>
>>>> OK, revert it is, then. Nothing in the tree has dependencies on that
>>> sucker
>>>> and while it survives testing here, it's obviously not ready for mainline.
>>>> So, with abject apologies to everyone involved, please revert.
>>>
>>> Reverting it is not straight forward, and half of this patch doesn't seem
>>> to cause issues.
>>>
>>> I can resend my patch with an S-o-b if you want me to.
>>
>> Um... That's _really_ interesting. First of all, revert is absolutely
>> straightforward; the only change in Kconfig is "remove the damn select"
>> and it's not hard to resolve. But I actually wonder what the hell is
>> going on with that breakage - the *only* thing your revert changes is
>> that instead of letting the kernel_thread callback return all the way
>> to returning 0 to ret_from_kernel_thread() on do_execve() success you
>> have it do ret_from_kernel_execve() instead. Hmm...
>>
>> Could you try to print current_pt_regs()->ARM_r0 in kernel_execve() before
>> calling ret_from_kernel_execve() with your patch applied? If that somehow
>> got non-zero, we'd see trouble, all right, but I don't see any places where
>> it could.
>>
>> Wait a minute... I think I see what might be going on, but I don't
>> understand it at all. Look: arm start_thread() is
>> #define start_thread(regs,pc,sp) \
>> ({ \
>> unsigned long *stack = (unsigned long *)sp; \
>> memset(regs->uregs, 0, sizeof(regs->uregs)); \
>> if (current->personality & ADDR_LIMIT_32BIT) \
>> regs->ARM_cpsr = USR_MODE; \
>> else \
>> regs->ARM_cpsr = USR26_MODE; \
>> if (elf_hwcap & HWCAP_THUMB && pc & 1) \
>> regs->ARM_cpsr |= PSR_T_BIT; \
>> regs->ARM_cpsr |= PSR_ENDSTATE; \
>> regs->ARM_pc = pc & ~1; /* pc */ \
>> regs->ARM_sp = sp; /* sp */ \
>> regs->ARM_r2 = stack[2]; /* r2 (envp) */ \
>> regs->ARM_r1 = stack[1]; /* r1 (argv) */ \
>> regs->ARM_r0 = stack[0]; /* r0 (argc) */ \
>> nommu_start_thread(regs); \
>> })
>> and the last 3 make no sense whatsoever. Note that on normal execve() we'll
>> be going through the syscall return, so the userland will see 0 in there,
>> no matter what do we do here. Theoretically, it might've been done for
>> ptrace sake (it will be able to observe the values in those registers before
>> the tracee reaches userland), but there's another oddity involved - "stack"
>> is a userland pointer here. Granted, it's been recently written to, so
>> we are not likely to hit a pagefault here, but... What happens if we are
>> under enough memory pressure to swap those pages out? PF in the kernel
>> mode with no exception table entries for those insns?
>
> FWIW, if you don't mind an experiment, try to take mainline (with that
> commit not reverted) and add
> strne r0, [sp, #S_R0]
> right before
> get_thread_info tsk
> in ret_from_fork(). And see if that changes behaviour.
>

I don't mind experiments at all :)

However, with that extra line in place as described, I'm still getting
the Oops below. If you want me to test anything else, please let me know.


[ 4.683182] VFS: Mounted root (nfs filesystem) on device 0:12.
[ 4.742007] devtmpfs: mounted
[ 4.745746] Freeing init memory: 172K
[ 5.038724] Internal error: Oops - undefined instruction: 0 [#1] SMP
THUMB2
[ 5.046044] Modules linked in:
[ 5.049263] CPU: 0 Not tainted (3.6.0-11053-g56c8535-dirty #136)
[ 5.055925] PC is at cpsw_probe+0x422/0x9ac
[ 5.060314] LR is at trace_hardirqs_on_caller+0x8f/0xfc
[ 5.065790] pc : [<c03493de>] lr : [<c005e81f>] psr: 60000113
[ 5.065790] sp : cf055fb0 ip : 00000000 fp : 00000000
[ 5.077800] r10: 00000000 r9 : 00000000 r8 : 00000000
[ 5.083270] r7 : 00000000 r6 : 00000000 r5 : c034458d r4 : 00000000
[ 5.090101] r3 : cf057a40 r2 : 00000000 r1 : 00000001 r0 : 00000000
[ 5.096936] Flags: nZCv IRQs on FIQs on Mode SVC_32 ISA ARM
Segment user
[ 5.104406] Control: 50c5387d Table: 8f434019 DAC: 00000015
[ 5.110422] Process init (pid: 1, stack limit = 0xcf054240)
[ 5.116257] Stack: (0xcf055fb0 to 0xcf056000)
[ 5.120824] 5fa0: 00000001
00000000 00000000 00000000
[ 5.129390] 5fc0: cf055fb0 c000d1a8 00000000 00000000 00000000
00000000 00000000 00000000
[ 5.137957] 5fe0: 00000000 becedf10 00000000 b6f81dd0 00000010
00000000 aaaabfaf a8babbaa
[ 5.146529] Code: 2206a010 718ef508 0184f8da f8b1f65d (3070f8d8)
[ 5.152915] ---[ end trace 7362bbe8e73e6b07 ]---
[ 5.158324] Kernel panic - not syncing: Attempted to kill init!
exitcode=0x0000000b
[ 5.158324]


2012-10-14 19:06:50

by Al Viro

[permalink] [raw]
Subject: Re: [revert request for commit 9fff2fa] Re: [git pull] signals pile 3

On Sun, Oct 14, 2012 at 08:21:53PM +0200, Daniel Mack wrote:
> On 14.10.2012 19:55, Al Viro wrote:
> > On Sun, Oct 14, 2012 at 06:26:40PM +0100, Al Viro wrote:
> >> On Sun, Oct 14, 2012 at 06:44:12PM +0200, Daniel Mack wrote:
> >>> On Oct 14, 2012 6:40 PM, "Al Viro" <[email protected]> wrote:
> >>>>
> >>>> On Sun, Oct 14, 2012 at 05:35:23PM +0200, Daniel Mack wrote:
> >>>>
> >>>>> I rebased my ARM development branch and figured that your patch 9fff2fa
> >>>>> ("arm: switch to saner kernel_execve() semantics") breaks the boot on my
> >>>>> board right after init is invoked via NFS:
> >>>>
> >>>> OK, revert it is, then. Nothing in the tree has dependencies on that
> >>> sucker
> >>>> and while it survives testing here, it's obviously not ready for mainline.
> >>>> So, with abject apologies to everyone involved, please revert.
> >>>
> >>> Reverting it is not straight forward, and half of this patch doesn't seem
> >>> to cause issues.
> >>>
> >>> I can resend my patch with an S-o-b if you want me to.
> >>
> >> Um... That's _really_ interesting. First of all, revert is absolutely
> >> straightforward; the only change in Kconfig is "remove the damn select"
> >> and it's not hard to resolve. But I actually wonder what the hell is
> >> going on with that breakage - the *only* thing your revert changes is
> >> that instead of letting the kernel_thread callback return all the way
> >> to returning 0 to ret_from_kernel_thread() on do_execve() success you
> >> have it do ret_from_kernel_execve() instead. Hmm...
> >>
> >> Could you try to print current_pt_regs()->ARM_r0 in kernel_execve() before
> >> calling ret_from_kernel_execve() with your patch applied? If that somehow
> >> got non-zero, we'd see trouble, all right, but I don't see any places where
> >> it could.
> >>
> >> Wait a minute... I think I see what might be going on, but I don't
> >> understand it at all. Look: arm start_thread() is
> >> #define start_thread(regs,pc,sp) \
> >> ({ \
> >> unsigned long *stack = (unsigned long *)sp; \
> >> memset(regs->uregs, 0, sizeof(regs->uregs)); \
> >> if (current->personality & ADDR_LIMIT_32BIT) \
> >> regs->ARM_cpsr = USR_MODE; \
> >> else \
> >> regs->ARM_cpsr = USR26_MODE; \
> >> if (elf_hwcap & HWCAP_THUMB && pc & 1) \
> >> regs->ARM_cpsr |= PSR_T_BIT; \
> >> regs->ARM_cpsr |= PSR_ENDSTATE; \
> >> regs->ARM_pc = pc & ~1; /* pc */ \
> >> regs->ARM_sp = sp; /* sp */ \
> >> regs->ARM_r2 = stack[2]; /* r2 (envp) */ \
> >> regs->ARM_r1 = stack[1]; /* r1 (argv) */ \
> >> regs->ARM_r0 = stack[0]; /* r0 (argc) */ \
> >> nommu_start_thread(regs); \
> >> })
> >> and the last 3 make no sense whatsoever. Note that on normal execve() we'll
> >> be going through the syscall return, so the userland will see 0 in there,
> >> no matter what do we do here. Theoretically, it might've been done for
> >> ptrace sake (it will be able to observe the values in those registers before
> >> the tracee reaches userland), but there's another oddity involved - "stack"
> >> is a userland pointer here. Granted, it's been recently written to, so
> >> we are not likely to hit a pagefault here, but... What happens if we are
> >> under enough memory pressure to swap those pages out? PF in the kernel
> >> mode with no exception table entries for those insns?
> >
> > FWIW, if you don't mind an experiment, try to take mainline (with that
> > commit not reverted) and add
> > strne r0, [sp, #S_R0]
> > right before
> > get_thread_info tsk
> > in ret_from_fork(). And see if that changes behaviour.
> >
>
> I don't mind experiments at all :)
>
> However, with that extra line in place as described, I'm still getting
> the Oops below. If you want me to test anything else, please let me know.
>
>
> [ 4.683182] VFS: Mounted root (nfs filesystem) on device 0:12.
> [ 4.742007] devtmpfs: mounted
> [ 4.745746] Freeing init memory: 172K
> [ 5.038724] Internal error: Oops - undefined instruction: 0 [#1] SMP
> THUMB2
> [ 5.046044] Modules linked in:
> [ 5.049263] CPU: 0 Not tainted (3.6.0-11053-g56c8535-dirty #136)
> [ 5.055925] PC is at cpsw_probe+0x422/0x9ac
> [ 5.060314] LR is at trace_hardirqs_on_caller+0x8f/0xfc
> [ 5.065790] pc : [<c03493de>] lr : [<c005e81f>] psr: 60000113
> [ 5.065790] sp : cf055fb0 ip : 00000000 fp : 00000000
> [ 5.077800] r10: 00000000 r9 : 00000000 r8 : 00000000
> [ 5.083270] r7 : 00000000 r6 : 00000000 r5 : c034458d r4 : 00000000
> [ 5.090101] r3 : cf057a40 r2 : 00000000 r1 : 00000001 r0 : 00000000
> [ 5.096936] Flags: nZCv IRQs on FIQs on Mode SVC_32 ISA ARM
> Segment user
> [ 5.104406] Control: 50c5387d Table: 8f434019 DAC: 00000015
> [ 5.110422] Process init (pid: 1, stack limit = 0xcf054240)
> [ 5.116257] Stack: (0xcf055fb0 to 0xcf056000)
> [ 5.120824] 5fa0: 00000001
> 00000000 00000000 00000000
> [ 5.129390] 5fc0: cf055fb0 c000d1a8 00000000 00000000 00000000
> 00000000 00000000 00000000
> [ 5.137957] 5fe0: 00000000 becedf10 00000000 b6f81dd0 00000010
> 00000000 aaaabfaf a8babbaa
> [ 5.146529] Code: 2206a010 718ef508 0184f8da f8b1f65d (3070f8d8)
> [ 5.152915] ---[ end trace 7362bbe8e73e6b07 ]---
> [ 5.158324] Kernel panic - not syncing: Attempted to kill init!
> exitcode=0x0000000b
> [ 5.158324]

Very interesting... So we have kernel_thread() payload called and we have
it reach kernel_execve() (otherwise your reverting kernel_execve() change
would've had no effect). Said payload returns, and sp value seems to be
sane. Buggered return address, perhaps? But that would be killing the
damn thing everywhere...

Just in case - print __builtin_return_address(0) in the beginning of
kernel_init(); it ought to point at the end of ret_from_fork...

And kill that strne along with assignment to ->ARM_r0 in start_thread().
I've missed the obvious problem with strne - flag values won't survive the
call of payload ;-/ IOW, it's still possible that we are getting bitten
by strange value left in there. Removing the assignment in start_thread()
would check that possiblity...

2012-10-14 19:24:08

by Al Viro

[permalink] [raw]
Subject: Re: [revert request for commit 9fff2fa] Re: [git pull] signals pile 3

On Sun, Oct 14, 2012 at 06:26:40PM +0100, Al Viro wrote:
> and the last 3 make no sense whatsoever. Note that on normal execve() we'll
> be going through the syscall return, so the userland will see 0 in there,
> no matter what do we do here. Theoretically, it might've been done for
> ptrace sake (it will be able to observe the values in those registers before
> the tracee reaches userland),

Except that it won't be able to see what start_thread() puts in r0 either;
on successful exceve(2) we will store return value of sys_execve() (i.e. 0)
in regs->ARM_r0 before we get to any of the places where it could have
examine the sucker. So what was that assignment for? And as far as I can
see, ARM ELF ABI says that general register values on process startup are
undefined, so r1 and r2 assignments also seem to be pointless. OTOH, they
predate the ELF conversion by quite a but - that code had been there since
1.x times, when we used to use a.out... In any case, they were *not* going
to be usable as main() arguments - zero argc would make userland rather
unhappy. I don't have arm libc sources from those times, but I'd expect
it to have all those suckers read from userland stack...

Russell, could you recall what those had been about? I'm not sure if that
had been oopsable that far back (again, oops scenario is userland stack
page getting swapped out before we get to start_thread(), leading to
direct read from an absent page in start_thread() by plain ldr, without
anything in exception table about that insn), but it looks very odd
regardless of that problem.

2012-10-14 19:56:16

by Al Viro

[permalink] [raw]
Subject: Re: [revert request for commit 9fff2fa] Re: [git pull] signals pile 3

On Sun, Oct 14, 2012 at 08:24:03PM +0100, Al Viro wrote:

> Russell, could you recall what those had been about? I'm not sure if that
> had been oopsable that far back (again, oops scenario is userland stack
> page getting swapped out before we get to start_thread(), leading to
> direct read from an absent page in start_thread() by plain ldr, without
> anything in exception table about that insn), but it looks very odd
> regardless of that problem.

BTW, arm64 has copied that logics, so it also seems to be unsafe and very
odd - there we definitely have only ELF to cope with. arm64 folks Cc'd...

2012-10-14 20:24:45

by Russell King - ARM Linux

[permalink] [raw]
Subject: Re: [git pull] signals pile 3

On Sun, Oct 14, 2012 at 05:35:23PM +0200, Daniel Mack wrote:
> I rebased my ARM development branch and figured that your patch 9fff2fa
> ("arm: switch to saner kernel_execve() semantics") breaks the boot on my
> board right after init is invoked via NFS:

Ok, I'm not going to assign blame to Al's commits (I never reviewed his
stuff before they hit mainline - patches never posted to the ARM mailing
list, and the development actually happened within the merge window,
all things we tell people not to do...) I _still_ haven't reviewed that
stuff yet.

But... nevertheless...

> [ 4.682072] VFS: Mounted root (nfs filesystem) on device 0:12.
> [ 4.690744] devtmpfs: mounted
> [ 4.694395] Freeing init memory: 172K
> [ 5.291417] Internal error: Oops - undefined instruction: 0 [#1] SMP
> THUMB2

Ok, so this tells us the kernel was built using Thumb2 ISA.

> [ 5.298734] Modules linked in:
> [ 5.301952] CPU: 0 Not tainted (3.6.0-11053-g56c8535 #128)
> [ 5.308071] PC is at cpsw_probe+0x422/0x9ac

PC is not word aligned, so it can't be running in the ARM ISA.

> [ 5.312459] LR is at trace_hardirqs_on_caller+0x8f/0xfc
> [ 5.317934] pc : [<c03493de>] lr : [<c005e81f>] psr: 60000113

Note that this reconfirms the above (well, it should do, it's the same
value.)

> [ 5.317934] sp : cf055fb0 ip : 00000000 fp : 00000000
> [ 5.329944] r10: 00000000 r9 : 00000000 r8 : 00000000
> [ 5.335413] r7 : 00000000 r6 : 00000000 r5 : c034458d r4 : 00000000
> [ 5.342244] r3 : cf057a40 r2 : 00000000 r1 : 00000001 r0 : 00000000
> [ 5.349078] Flags: nZCv IRQs on FIQs on Mode SVC_32 ISA ARM
> Segment user

And this tells us that we're running in ARM mode, not Thumb mode.

> [ 5.356546] Control: 50c5387d Table: 8f434019 DAC: 00000015
> [ 5.362562] Process init (pid: 1, stack limit = 0xcf054240)
> [ 5.368395] Stack: (0xcf055fb0 to 0xcf056000)
> [ 5.372961] 5fa0: 00000001
> 00000000 00000000 00000000
> [ 5.381525] 5fc0: cf055fb0 c000d1a8 00000000 00000000 00000000
> 00000000 00000000 00000000
> [ 5.390091] 5fe0: 00000000 bee83f10 00000000 b6fdedd0 00000010
> 00000000 aaaabfaf a8babbaa

No stack backtrace (and it's silent about why that is).

The other strange thing here is that the stack dump above is showing that
the stack is completely empty - which shouldn't be the case if we're in a
driver probe function - driver probe functions are called via the driver
model layers...

> [ 5.398664] Code: 2206a010 718ef508 0184f8da f8b1f65d (3070f8d8)

And now we come to the Code: line, which makes no sense as an ARM ISA:

0: 2206a010 andcs sl, r6, #16
4: 718ef508 orrvc pc, lr, r8, lsl #10
8: 0184f8da ldrdeq pc, [r4, sl]
c: f8b1f65d ; <UNDEFINED> instruction: 0xf8b1f65d
10: 3070f8d8 ldrsbtcc pc, [r0], #-136 ; 0xffffff78 ; <UNPREDICTABLE>

But as Thumb, it looks more reasonable:

0: a010 add r0, pc, #64 ; (adr r0, 44 <foo+0x44>)
2: 2206 movs r2, #6
4: f508 718e add.w r1, r8, #284 ; 0x11c
8: f8da 0184 ldr.w r0, [sl, #388] ; 0x184
c: f65d f8b1 bl ffe5d172 <foo+0xffe5d172>
10: f8d8 3070 ldr.w r3, [r8, #112] ; 0x70

I don't have any further comments to make on this yet, as I've no idea
what state stuff is in, but the above oops dump to me suggests that
we've randomly jumped into some part of the kernel which just happens
to be cpsw_probe().

Please send me (in private mail) your vmlinux file and a corresponding
oops dump from that same kernel, and I'll dig and try and work out
what's going on...

This kind of investigation reminds me of those I did back in the 1990s
when stuff was rather unstable and ARM was a young architecture. Now
all we need is for an ARM platform to dump its entire memory out the
ethernet port, bringing an university department network to a halt (I
did that once - back in the 1990s - sorry Tim!)

2012-10-14 22:25:04

by Russell King - ARM Linux

[permalink] [raw]
Subject: Re: [git pull] signals pile 3

Okay, here's the post-mortem diagnosis.

What's happening is as follows (I'm very certain of this.)

We come through the usual init, and issue (see init/main.c):

kernel_thread(kernel_init, NULL, CLONE_FS | CLONE_SIGHAND);

This creates a new thread, which falls through to the ret_from_fork
assembly, with r4 set NULL and r5 set to kernel_init. You can see
this in your oops dump register set - r5 is 0xc0344555, which is the
address of kernel_init plus 1 which marks the function as Thumb code.

Now, let's look at this code a little closer - this is what the
disassembly looks like:

c000d180 <ret_from_fork>:
c000d180: f03a fe08 bl c0047d94 <schedule_tail>
c000d184: 2d00 cmp r5, #0
c000d186: bf1e ittt ne
c000d188: 4620 movne r0, r4
c000d18a: 46fe movne lr, pc <-- XXXXXXX
c000d18c: 46af movne pc, r5
c000d18e: 46e9 mov r9, sp
c000d190: ea4f 3959 mov.w r9, r9, lsr #13
c000d194: ea4f 3949 mov.w r9, r9, lsl #13
c000d198: e7c8 b.n c000d12c <ret_to_user>
c000d19a: bf00 nop
c000d19c: f3af 8000 nop.w

I have marked one instruction, and it's the significant one.

Eventually, having had a successful call to kernel_execve(), kernel_init()
returns zero.

In returning, it uses the value in 'lr' which was set by the instruction
I marked above. Unfortunately, this causes lr to contain 0xc000d18e -
an even address. This switches the ISA to ARM on return but with a non
word aligned PC value.

So, what do we end up executing? Well, not the instructions above - yes
the opcodes, but they don't mean the same thing in ARM mode. In ARM mode,
it looks like this instead:

c000d18c: 46e946af strbtmi r4, [r9], pc, lsr #13
c000d190: 3959ea4f ldmdbcc r9, {r0, r1, r2, r3, r6, r9, fp, sp, lr, pc}^
c000d194: 3949ea4f stmdbcc r9, {r0, r1, r2, r3, r6, r9, fp, sp, lr, pc}^
c000d198: bf00e7c8 svclt 0x0000e7c8
c000d19c: 8000f3af andhi pc, r0, pc, lsr #7
c000d1a0: e88db092 stm sp, {r1, r4, r7, ip, sp, pc}
c000d1a4: 46e81fff ; <UNDEFINED> instruction: 0x46e81fff
c000d1a8: 8a00f3ef bhi 0xc004a16c
c000d1ac: 0a0cf08a beq 0xc03493dc

I have included more above, because it's relevant. The PSR flags which we
can see in the oops dump are nZCv, so Z and C are set.

All the above ARM instructions are not executed, except for two. c000d1a0,
which has no writeback, and writes below the current stack pointer (and
that data is lost when we take the next exception.) The other instruction
which is executed is c000d1ac, which takes us to... 0xc03493dc. However,
remember that bit 1 of the PC got set. So that makes it 0xc03493de.

And that value is the value we find in the oops dump for PC. What is the
instruction here when interpreted in ARM mode?

0: f71e150c ; <UNDEFINED> instruction: 0xf71e150c

and there we have our undefined instruction (remember that the 'never'
condition code, 0xf, has been deprecated and is now always executed.)

So, what we have above is a consistent and sane story for how we ended up
at such a strange place in the kernel with such an odd register dump - with
no unanswered questions about what happened to get us there.

In light of this, I'm 100% certain that the patch below will fix the issue
you're seeing - please test this and get back to me ASAP, thanks.

arch/arm/kernel/entry-common.S | 4 ++--
1 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/arch/arm/kernel/entry-common.S b/arch/arm/kernel/entry-common.S
index 417bac1..3471175 100644
--- a/arch/arm/kernel/entry-common.S
+++ b/arch/arm/kernel/entry-common.S
@@ -88,9 +88,9 @@ ENTRY(ret_from_fork)
bl schedule_tail
cmp r5, #0
movne r0, r4
- movne lr, pc
+ adrne lr, BSYM(1f)
movne pc, r5
- get_thread_info tsk
+1: get_thread_info tsk
b ret_slow_syscall
ENDPROC(ret_from_fork)

2012-10-14 22:39:47

by Daniel Mack

[permalink] [raw]
Subject: Re: [git pull] signals pile 3

On 15.10.2012 00:24, Russell King - ARM Linux wrote:
> Okay, here's the post-mortem diagnosis.
>
> What's happening is as follows (I'm very certain of this.)
>
> We come through the usual init, and issue (see init/main.c):
>
> kernel_thread(kernel_init, NULL, CLONE_FS | CLONE_SIGHAND);
>
> This creates a new thread, which falls through to the ret_from_fork
> assembly, with r4 set NULL and r5 set to kernel_init. You can see
> this in your oops dump register set - r5 is 0xc0344555, which is the
> address of kernel_init plus 1 which marks the function as Thumb code.
>
> Now, let's look at this code a little closer - this is what the
> disassembly looks like:
>
> c000d180 <ret_from_fork>:
> c000d180: f03a fe08 bl c0047d94 <schedule_tail>
> c000d184: 2d00 cmp r5, #0
> c000d186: bf1e ittt ne
> c000d188: 4620 movne r0, r4
> c000d18a: 46fe movne lr, pc <-- XXXXXXX
> c000d18c: 46af movne pc, r5
> c000d18e: 46e9 mov r9, sp
> c000d190: ea4f 3959 mov.w r9, r9, lsr #13
> c000d194: ea4f 3949 mov.w r9, r9, lsl #13
> c000d198: e7c8 b.n c000d12c <ret_to_user>
> c000d19a: bf00 nop
> c000d19c: f3af 8000 nop.w
>
> I have marked one instruction, and it's the significant one.
>
> Eventually, having had a successful call to kernel_execve(), kernel_init()
> returns zero.
>
> In returning, it uses the value in 'lr' which was set by the instruction
> I marked above. Unfortunately, this causes lr to contain 0xc000d18e -
> an even address. This switches the ISA to ARM on return but with a non
> word aligned PC value.
>
> So, what do we end up executing? Well, not the instructions above - yes
> the opcodes, but they don't mean the same thing in ARM mode. In ARM mode,
> it looks like this instead:
>
> c000d18c: 46e946af strbtmi r4, [r9], pc, lsr #13
> c000d190: 3959ea4f ldmdbcc r9, {r0, r1, r2, r3, r6, r9, fp, sp, lr, pc}^
> c000d194: 3949ea4f stmdbcc r9, {r0, r1, r2, r3, r6, r9, fp, sp, lr, pc}^
> c000d198: bf00e7c8 svclt 0x0000e7c8
> c000d19c: 8000f3af andhi pc, r0, pc, lsr #7
> c000d1a0: e88db092 stm sp, {r1, r4, r7, ip, sp, pc}
> c000d1a4: 46e81fff ; <UNDEFINED> instruction: 0x46e81fff
> c000d1a8: 8a00f3ef bhi 0xc004a16c
> c000d1ac: 0a0cf08a beq 0xc03493dc
>
> I have included more above, because it's relevant. The PSR flags which we
> can see in the oops dump are nZCv, so Z and C are set.
>
> All the above ARM instructions are not executed, except for two. c000d1a0,
> which has no writeback, and writes below the current stack pointer (and
> that data is lost when we take the next exception.) The other instruction
> which is executed is c000d1ac, which takes us to... 0xc03493dc. However,
> remember that bit 1 of the PC got set. So that makes it 0xc03493de.
>
> And that value is the value we find in the oops dump for PC. What is the
> instruction here when interpreted in ARM mode?
>
> 0: f71e150c ; <UNDEFINED> instruction: 0xf71e150c
>
> and there we have our undefined instruction (remember that the 'never'
> condition code, 0xf, has been deprecated and is now always executed.)
>
> So, what we have above is a consistent and sane story for how we ended up
> at such a strange place in the kernel with such an odd register dump - with
> no unanswered questions about what happened to get us there.
>
> In light of this, I'm 100% certain that the patch below will fix the issue
> you're seeing - please test this and get back to me ASAP, thanks.

Quite impressive analysis :) And it seems you really spotted the reason
here, as your patch fixes the problem.

> arch/arm/kernel/entry-common.S | 4 ++--
> 1 files changed, 2 insertions(+), 2 deletions(-)
>
> diff --git a/arch/arm/kernel/entry-common.S b/arch/arm/kernel/entry-common.S
> index 417bac1..3471175 100644
> --- a/arch/arm/kernel/entry-common.S
> +++ b/arch/arm/kernel/entry-common.S
> @@ -88,9 +88,9 @@ ENTRY(ret_from_fork)
> bl schedule_tail
> cmp r5, #0
> movne r0, r4
> - movne lr, pc
> + adrne lr, BSYM(1f)
> movne pc, r5
> - get_thread_info tsk
> +1: get_thread_info tsk
> b ret_slow_syscall
> ENDPROC(ret_from_fork)

Tested-by: Daniel Mack <[email protected]>

Many thanks for the very prompt response!


Daniel

2012-10-14 23:17:03

by Russell King - ARM Linux

[permalink] [raw]
Subject: Re: [git pull] signals pile 3

On Mon, Oct 15, 2012 at 12:39:40AM +0200, Daniel Mack wrote:
> Tested-by: Daniel Mack <[email protected]>
>
> Many thanks for the very prompt response!

Thanks Daniel.

I've also tested this on my OMAP4430 board running in ARM mode, so that
still works - we've covered the possibilities between us here between
ARM mode and Thumb mode, so...

Linus, could you merge this patch please, thanks.

8<===
From: Russell King <[email protected]>
Subject: [PATCH] ARM: fix oops on initial entry to userspace with Thumb2 kernels

Daniel Mack reports an oops at boot with the latest kernels:

[ 4.896717] Internal error: Oops - undefined instruction: 0 [#1] SMP THUMB2
[ 4.904034] Modules linked in:
[ 4.907253] CPU: 0 Not tainted (3.6.0-11057-g584df1d #145)
[ 4.913372] PC is at cpsw_probe+0x45a/0x9ac
[ 4.917760] LR is at trace_hardirqs_on_caller+0x8f/0xfc
[ 4.923235] pc : [<c03493de>] lr : [<c005e81f>] psr: 60000113
[ 4.923235] sp : cf055fb0 ip : 00000000 fp : 00000000
[ 4.935246] r10: 00000000 r9 : 00000000 r8 : 00000000
[ 4.940715] r7 : 00000000 r6 : 00000000 r5 : c0344555 r4 : 00000000
[ 4.947548] r3 : cf057a40 r2 : 00000000 r1 : 00000001 r0 : 00000000
[ 4.954383] Flags: nZCv IRQs on FIQs on Mode SVC_32 ISA ARM Segment user
[ 4.961853] Control: 50c5387d Table: 8f3f4019 DAC: 00000015
[ 4.967868] Process init (pid: 1, stack limit = 0xcf054240)
[ 4.973702] Stack: (0xcf055fb0 to 0xcf056000)
[ 4.978269] 5fa0: 00000001 00000000 00000000 00000000
[ 4.986836] 5fc0: cf055fb0 c000d1a8 00000000 00000000 00000000 00000000 00000000 00000000
[ 4.995403] 5fe0: 00000000 be9b3f10 00000000 b6f6add0 00000010 00000000 aaaabfaf a8babbaa

The analysis of this is as follows. In init/main.c, we issue:

kernel_thread(kernel_init, NULL, CLONE_FS | CLONE_SIGHAND);

This creates a new thread, which falls through to the ret_from_fork
assembly, with r4 set NULL and r5 set to kernel_init. You can see
this in your oops dump register set - r5 is 0xc0344555, which is the
address of kernel_init plus 1 which marks the function as Thumb code.

Now, let's look at this code a little closer - this is what the
disassembly looks like:

c000d180 <ret_from_fork>:
c000d180: f03a fe08 bl c0047d94 <schedule_tail>
c000d184: 2d00 cmp r5, #0
c000d186: bf1e ittt ne
c000d188: 4620 movne r0, r4
c000d18a: 46fe movne lr, pc <-- XXXXXXX
c000d18c: 46af movne pc, r5
c000d18e: 46e9 mov r9, sp
c000d190: ea4f 3959 mov.w r9, r9, lsr #13
c000d194: ea4f 3949 mov.w r9, r9, lsl #13
c000d198: e7c8 b.n c000d12c <ret_to_user>
c000d19a: bf00 nop
c000d19c: f3af 8000 nop.w

This code was introduced in 9fff2fa0db911 (arm: switch to saner
kernel_execve() semantics). I have marked one instruction, and it's
the significant one - I'll come back to that later.

Eventually, having had a successful call to kernel_execve(), kernel_init()
returns zero.

In returning, it uses the value in 'lr' which was set by the instruction
I marked above. Unfortunately, this causes lr to contain 0xc000d18e -
an even address. This switches the ISA to ARM on return but with a non
word aligned PC value.

So, what do we end up executing? Well, not the instructions above - yes
the opcodes, but they don't mean the same thing in ARM mode. In ARM mode,
it looks like this instead:

c000d18c: 46e946af strbtmi r4, [r9], pc, lsr #13
c000d190: 3959ea4f ldmdbcc r9, {r0, r1, r2, r3, r6, r9, fp, sp, lr, pc}^
c000d194: 3949ea4f stmdbcc r9, {r0, r1, r2, r3, r6, r9, fp, sp, lr, pc}^
c000d198: bf00e7c8 svclt 0x0000e7c8
c000d19c: 8000f3af andhi pc, r0, pc, lsr #7
c000d1a0: e88db092 stm sp, {r1, r4, r7, ip, sp, pc}
c000d1a4: 46e81fff ; <UNDEFINED> instruction: 0x46e81fff
c000d1a8: 8a00f3ef bhi 0xc004a16c
c000d1ac: 0a0cf08a beq 0xc03493dc

I have included more above, because it's relevant. The PSR flags which we
can see in the oops dump are nZCv, so Z and C are set.

All the above ARM instructions are not executed, except for two. c000d1a0,
which has no writeback, and writes below the current stack pointer (and
that data is lost when we take the next exception.) The other instruction
which is executed is c000d1ac, which takes us to... 0xc03493dc. However,
remember that bit 1 of the PC got set. So that makes the PC value
0xc03493de.

And that value is the value we find in the oops dump for PC. What is the
instruction here when interpreted in ARM mode?

0: f71e150c ; <UNDEFINED> instruction: 0xf71e150c

and there we have our undefined instruction (remember that the 'never'
condition code, 0xf, has been deprecated and is now always executed as it
is now being used for additional instructions.)

This path also nicely explains the state of the stack we see in the oops
dump too.

The above is a consistent and sane story for how we got to the oops dump,
which all stems from the instruction at 0xc000d18a being wrong.

Reported-by: Daniel Mack <[email protected]>
Tested-by: Daniel Mack <[email protected]>
Signed-off-by: Russell King <[email protected]>
---
arch/arm/kernel/entry-common.S | 4 ++--
1 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/arch/arm/kernel/entry-common.S b/arch/arm/kernel/entry-common.S
index 417bac1..3471175 100644
--- a/arch/arm/kernel/entry-common.S
+++ b/arch/arm/kernel/entry-common.S
@@ -88,9 +88,9 @@ ENTRY(ret_from_fork)
bl schedule_tail
cmp r5, #0
movne r0, r4
- movne lr, pc
+ adrne lr, BSYM(1f)
movne pc, r5
- get_thread_info tsk
+1: get_thread_info tsk
b ret_slow_syscall
ENDPROC(ret_from_fork)

2012-10-15 16:07:23

by Catalin Marinas

[permalink] [raw]
Subject: Re: [revert request for commit 9fff2fa] Re: [git pull] signals pile 3

On Sun, Oct 14, 2012 at 08:56:11PM +0100, Al Viro wrote:
> On Sun, Oct 14, 2012 at 08:24:03PM +0100, Al Viro wrote:
>
> > Russell, could you recall what those had been about? I'm not sure if that
> > had been oopsable that far back (again, oops scenario is userland stack
> > page getting swapped out before we get to start_thread(), leading to
> > direct read from an absent page in start_thread() by plain ldr, without
> > anything in exception table about that insn), but it looks very odd
> > regardless of that problem.
>
> BTW, arm64 has copied that logics, so it also seems to be unsafe and very
> odd - there we definitely have only ELF to cope with. arm64 folks Cc'd...

Good point. We don't need this on arm64 and probably neither on arm (at
least since EABI).

Setting x0 may cause other issues as well. The dynamic loader simply
ignores the startup registers but for static binaries the _start code in
glibc expects r0 to contain a function pointer to be registered with
atexit() in __libc_start_main() or NULL. Since we pass argc in there,
for static binaries the rtld_fini argument to __libc_start_main() is
neither NULL nor something meaningful.

Russell, do you know whether setting these registers is needed for OABI?

--
Catalin

2012-10-15 16:27:40

by Al Viro

[permalink] [raw]
Subject: Re: [revert request for commit 9fff2fa] Re: [git pull] signals pile 3

On Mon, Oct 15, 2012 at 05:07:10PM +0100, Catalin Marinas wrote:
> On Sun, Oct 14, 2012 at 08:56:11PM +0100, Al Viro wrote:
> > On Sun, Oct 14, 2012 at 08:24:03PM +0100, Al Viro wrote:
> >
> > > Russell, could you recall what those had been about? I'm not sure if that
> > > had been oopsable that far back (again, oops scenario is userland stack
> > > page getting swapped out before we get to start_thread(), leading to
> > > direct read from an absent page in start_thread() by plain ldr, without
> > > anything in exception table about that insn), but it looks very odd
> > > regardless of that problem.
> >
> > BTW, arm64 has copied that logics, so it also seems to be unsafe and very
> > odd - there we definitely have only ELF to cope with. arm64 folks Cc'd...
>
> Good point. We don't need this on arm64 and probably neither on arm (at
> least since EABI).
>
> Setting x0 may cause other issues as well. The dynamic loader simply
> ignores the startup registers but for static binaries the _start code in
> glibc expects r0 to contain a function pointer to be registered with
> atexit() in __libc_start_main() or NULL. Since we pass argc in there,
> for static binaries the rtld_fini argument to __libc_start_main() is
> neither NULL nor something meaningful.

The value left there by start_thread() will not reach the userland anyway...

2012-10-15 17:06:10

by Catalin Marinas

[permalink] [raw]
Subject: Re: [revert request for commit 9fff2fa] Re: [git pull] signals pile 3

On Mon, Oct 15, 2012 at 05:27:32PM +0100, Al Viro wrote:
> On Mon, Oct 15, 2012 at 05:07:10PM +0100, Catalin Marinas wrote:
> > On Sun, Oct 14, 2012 at 08:56:11PM +0100, Al Viro wrote:
> > > On Sun, Oct 14, 2012 at 08:24:03PM +0100, Al Viro wrote:
> > >
> > > > Russell, could you recall what those had been about? I'm not sure if that
> > > > had been oopsable that far back (again, oops scenario is userland stack
> > > > page getting swapped out before we get to start_thread(), leading to
> > > > direct read from an absent page in start_thread() by plain ldr, without
> > > > anything in exception table about that insn), but it looks very odd
> > > > regardless of that problem.
> > >
> > > BTW, arm64 has copied that logics, so it also seems to be unsafe and very
> > > odd - there we definitely have only ELF to cope with. arm64 folks Cc'd...
> >
> > Good point. We don't need this on arm64 and probably neither on arm (at
> > least since EABI).
> >
> > Setting x0 may cause other issues as well. The dynamic loader simply
> > ignores the startup registers but for static binaries the _start code in
> > glibc expects r0 to contain a function pointer to be registered with
> > atexit() in __libc_start_main() or NULL. Since we pass argc in there,
> > for static binaries the rtld_fini argument to __libc_start_main() is
> > neither NULL nor something meaningful.
>
> The value left there by start_thread() will not reach the userland anyway...

Ah, yes. So not causing any user issues (apart from the possible fault
in the kernel while accessing the user stack).

--
Catalin

2012-10-16 14:04:28

by Uwe Kleine-König

[permalink] [raw]
Subject: Re: [git pull] signals pile 3

Hello,

On Mon, Oct 15, 2012 at 12:16:49AM +0100, Russell King - ARM Linux wrote:
> On Mon, Oct 15, 2012 at 12:39:40AM +0200, Daniel Mack wrote:
> > Tested-by: Daniel Mack <[email protected]>
> >
> > Many thanks for the very prompt response!
>
> Thanks Daniel.
>
> I've also tested this on my OMAP4430 board running in ARM mode, so that
> still works - we've covered the possibilities between us here between
> ARM mode and Thumb mode, so...
>
> Linus, could you merge this patch please, thanks.
>
> 8<===
> From: Russell King <[email protected]>
> Subject: [PATCH] ARM: fix oops on initial entry to userspace with Thumb2 kernels
>
> Daniel Mack reports an oops at boot with the latest kernels:
>
> [ 4.896717] Internal error: Oops - undefined instruction: 0 [#1] SMP THUMB2
> [ 4.904034] Modules linked in:
> [ 4.907253] CPU: 0 Not tainted (3.6.0-11057-g584df1d #145)
> [ 4.913372] PC is at cpsw_probe+0x45a/0x9ac
> [ 4.917760] LR is at trace_hardirqs_on_caller+0x8f/0xfc
> [ 4.923235] pc : [<c03493de>] lr : [<c005e81f>] psr: 60000113
> [ 4.923235] sp : cf055fb0 ip : 00000000 fp : 00000000
> [ 4.935246] r10: 00000000 r9 : 00000000 r8 : 00000000
> [ 4.940715] r7 : 00000000 r6 : 00000000 r5 : c0344555 r4 : 00000000
> [ 4.947548] r3 : cf057a40 r2 : 00000000 r1 : 00000001 r0 : 00000000
> [ 4.954383] Flags: nZCv IRQs on FIQs on Mode SVC_32 ISA ARM Segment user
> [ 4.961853] Control: 50c5387d Table: 8f3f4019 DAC: 00000015
> [ 4.967868] Process init (pid: 1, stack limit = 0xcf054240)
> [ 4.973702] Stack: (0xcf055fb0 to 0xcf056000)
> [ 4.978269] 5fa0: 00000001 00000000 00000000 00000000
> [ 4.986836] 5fc0: cf055fb0 c000d1a8 00000000 00000000 00000000 00000000 00000000 00000000
> [ 4.995403] 5fe0: 00000000 be9b3f10 00000000 b6f6add0 00000010 00000000 aaaabfaf a8babbaa
>
> The analysis of this is as follows. In init/main.c, we issue:
>
> kernel_thread(kernel_init, NULL, CLONE_FS | CLONE_SIGHAND);
>
> This creates a new thread, which falls through to the ret_from_fork
> assembly, with r4 set NULL and r5 set to kernel_init. You can see
> this in your oops dump register set - r5 is 0xc0344555, which is the
> address of kernel_init plus 1 which marks the function as Thumb code.
>
> Now, let's look at this code a little closer - this is what the
> disassembly looks like:
>
> c000d180 <ret_from_fork>:
> c000d180: f03a fe08 bl c0047d94 <schedule_tail>
> c000d184: 2d00 cmp r5, #0
> c000d186: bf1e ittt ne
> c000d188: 4620 movne r0, r4
> c000d18a: 46fe movne lr, pc <-- XXXXXXX
> c000d18c: 46af movne pc, r5
> c000d18e: 46e9 mov r9, sp
> c000d190: ea4f 3959 mov.w r9, r9, lsr #13
> c000d194: ea4f 3949 mov.w r9, r9, lsl #13
> c000d198: e7c8 b.n c000d12c <ret_to_user>
> c000d19a: bf00 nop
> c000d19c: f3af 8000 nop.w
>
> This code was introduced in 9fff2fa0db911 (arm: switch to saner
> kernel_execve() semantics). I have marked one instruction, and it's
> the significant one - I'll come back to that later.
>
> Eventually, having had a successful call to kernel_execve(), kernel_init()
> returns zero.
>
> In returning, it uses the value in 'lr' which was set by the instruction
> I marked above. Unfortunately, this causes lr to contain 0xc000d18e -
> an even address. This switches the ISA to ARM on return but with a non
> word aligned PC value.
>
> So, what do we end up executing? Well, not the instructions above - yes
> the opcodes, but they don't mean the same thing in ARM mode. In ARM mode,
> it looks like this instead:
>
> c000d18c: 46e946af strbtmi r4, [r9], pc, lsr #13
> c000d190: 3959ea4f ldmdbcc r9, {r0, r1, r2, r3, r6, r9, fp, sp, lr, pc}^
> c000d194: 3949ea4f stmdbcc r9, {r0, r1, r2, r3, r6, r9, fp, sp, lr, pc}^
> c000d198: bf00e7c8 svclt 0x0000e7c8
> c000d19c: 8000f3af andhi pc, r0, pc, lsr #7
> c000d1a0: e88db092 stm sp, {r1, r4, r7, ip, sp, pc}
> c000d1a4: 46e81fff ; <UNDEFINED> instruction: 0x46e81fff
> c000d1a8: 8a00f3ef bhi 0xc004a16c
> c000d1ac: 0a0cf08a beq 0xc03493dc
>
> I have included more above, because it's relevant. The PSR flags which we
> can see in the oops dump are nZCv, so Z and C are set.
>
> All the above ARM instructions are not executed, except for two. c000d1a0,
> which has no writeback, and writes below the current stack pointer (and
> that data is lost when we take the next exception.) The other instruction
> which is executed is c000d1ac, which takes us to... 0xc03493dc. However,
> remember that bit 1 of the PC got set. So that makes the PC value
> 0xc03493de.
>
> And that value is the value we find in the oops dump for PC. What is the
> instruction here when interpreted in ARM mode?
>
> 0: f71e150c ; <UNDEFINED> instruction: 0xf71e150c
>
> and there we have our undefined instruction (remember that the 'never'
> condition code, 0xf, has been deprecated and is now always executed as it
> is now being used for additional instructions.)
>
> This path also nicely explains the state of the stack we see in the oops
> dump too.
>
> The above is a consistent and sane story for how we got to the oops dump,
> which all stems from the instruction at 0xc000d18a being wrong.
>
> Reported-by: Daniel Mack <[email protected]>
> Tested-by: Daniel Mack <[email protected]>
> Signed-off-by: Russell King <[email protected]>
This patch makes my Cortex-M3 boot again on v3.7-rc1. I did it slightly
different:

> ---
> arch/arm/kernel/entry-common.S | 4 ++--
> 1 files changed, 2 insertions(+), 2 deletions(-)
>
> diff --git a/arch/arm/kernel/entry-common.S b/arch/arm/kernel/entry-common.S
> index 417bac1..3471175 100644
> --- a/arch/arm/kernel/entry-common.S
> +++ b/arch/arm/kernel/entry-common.S
> @@ -88,9 +88,9 @@ ENTRY(ret_from_fork)
> bl schedule_tail
> cmp r5, #0
> movne r0, r4
> - movne lr, pc
> + adrne lr, BSYM(1f)
> movne pc, r5
> - get_thread_info tsk
> +1: get_thread_info tsk
> b ret_slow_syscall

I used:
movne r0, r4
- movne lr, pc
- movne pc, r5
+ blxne r5
get_thread_info tsk

but I assume Russell's patch is better. (Probably because blx doesn't
exist everywhere?!)

So if it's not too late yet:

Tested-by: Uwe Kleine-K?nig <[email protected]>

Thanks
Uwe

--
Pengutronix e.K. | Uwe Kleine-K?nig |
Industrial Linux Solutions | http://www.pengutronix.de/ |

2012-10-16 14:05:38

by Russell King - ARM Linux

[permalink] [raw]
Subject: Re: [git pull] signals pile 3

On Tue, Oct 16, 2012 at 04:04:14PM +0200, Uwe Kleine-K?nig wrote:
> I used:
> movne r0, r4
> - movne lr, pc
> - movne pc, r5
> + blxne r5
> get_thread_info tsk
>
> but I assume Russell's patch is better. (Probably because blx doesn't
> exist everywhere?!)

Correct.

> So if it's not too late yet:
>
> Tested-by: Uwe Kleine-K?nig <[email protected]>

It is, because it's already in Linus' tree.