LinuxLists.cc - A bug about system call on ARM

2013-05-29 06:52:44

Subject: A bug about system call on ARM

Hi all,

I am a new comer to this mailing list ,
I am happy to join this community .

I have a bug reported from our android phones which is caused by the system call .
It seems like kernel bugs from my view .

Crash in file arch\arm\kernel\ entry-common.S

/***************************************************************/

ENTRY(vector_swi)
sub sp, sp, #S_FRAME_SIZE
stmia sp, {r0 - r12} @ Calling r0 - r12
ARM( add r8, sp, #S_PC )
ARM( stmdb r8, {sp, lr}^ ) @ Calling sp, lr
THUMB( mov r8, sp )
THUMB( store_user_sp_lr r8, r10, S_SP ) @ calling sp, lr
mrs r8, spsr @ called from non-FIQ mode, so ok.
str lr, [sp, #S_PC] @ Save calling PC
str r8, [sp, #S_PSR] @ Save CPSR
str r0, [sp, #S_OLD_R0] @ Save OLD_R0
zero_fp

/*
* Get the system call number.
*/

#if defined(CONFIG_OABI_COMPAT)

/*
* If we have CONFIG_OABI_COMPAT then we need to look at the swi
* value to determine if it is an EABI or an old ABI call.
*/
#ifdef CONFIG_ARM_THUMB
tst r8, #PSR_T_BIT
movne r10, #0 @ no thumb OABI emulation
ldreq r10, [lr, #-4] @ get SWI instruction // crash at this instruction, when get SWI instruction
#else
ldr r10, [lr, #-4] @ get SWI instruction
A710( and ip, r10, #0x0f000000 @ check for SWI )
A710( teq ip, #0x0f000000 )
A710( bne .Larm710bug )
#endif
#ifdef CONFIG_CPU_ENDIAN_BE8
rev r10, r10 @ little endian instruction
#endif

/***************************************************************************************************/

Then reason why it will crash when get SWI instruction is maybe
This page is clear to aged by kernel,
But this MMU fault happpened in kernel,
So the kernel do_page_fault function will not clear this page to young,
So that will crash .

It should poll this page to make it present or the fault should be handled by fixup section ,
Anyway, this place should not crash by kernel .

The kernel version I used is 3.4.0
I have add the kernel log and the call stack recovered by trace32 tools
Pls have a look at it .

Thanks .

?
Sony Mobile Communications
Tel: +86 10 5966 9819
Phone: 18610323092
Address: No.16 Guangshun South Street, Chaoyang, Beijing, P.R.C.

sonymobile.com
??

Attachments:

kernel.log (128.00 kB)
kernel.log stack.txt (747.00 B)
stack.txt Download all attachments

2013-05-29 07:34:41

by Richard Weinberger

[permalink] [raw]

Subject: Re: A bug about system call on ARM

Hi!

On Wed, May 29, 2013 at 8:52 AM, Wang, Yalin <[email protected]> wrote:
> Hi all,
>
> I am a new comer to this mailing list ,
> I am happy to join this community .
>
> I have a bug reported from our android phones which is caused by the system call .
> It seems like kernel bugs from my view .

Is this a unmodified Linux kernel from kernel.org? In other works, no
(half broken) board support package
from your hardware vendor?
Did you try a more recent kernel? (At least 3.4.47).
Maybe your problem is already known and fixed...

> Crash in file arch\arm\kernel\ entry-common.S
>
> /***************************************************************/
>
> ENTRY(vector_swi)
> sub sp, sp, #S_FRAME_SIZE
> stmia sp, {r0 - r12} @ Calling r0 - r12
> ARM( add r8, sp, #S_PC )
> ARM( stmdb r8, {sp, lr}^ ) @ Calling sp, lr
> THUMB( mov r8, sp )
> THUMB( store_user_sp_lr r8, r10, S_SP ) @ calling sp, lr
> mrs r8, spsr @ called from non-FIQ mode, so ok.
> str lr, [sp, #S_PC] @ Save calling PC
> str r8, [sp, #S_PSR] @ Save CPSR
> str r0, [sp, #S_OLD_R0] @ Save OLD_R0
> zero_fp
>
> /*
> * Get the system call number.
> */
>
> #if defined(CONFIG_OABI_COMPAT)
>
> /*
> * If we have CONFIG_OABI_COMPAT then we need to look at the swi
> * value to determine if it is an EABI or an old ABI call.
> */
> #ifdef CONFIG_ARM_THUMB
> tst r8, #PSR_T_BIT
> movne r10, #0 @ no thumb OABI emulation
> ldreq r10, [lr, #-4] @ get SWI instruction // crash at this instruction, when get SWI instruction
> #else
> ldr r10, [lr, #-4] @ get SWI instruction
> A710( and ip, r10, #0x0f000000 @ check for SWI )
> A710( teq ip, #0x0f000000 )
> A710( bne .Larm710bug )
> #endif
> #ifdef CONFIG_CPU_ENDIAN_BE8
> rev r10, r10 @ little endian instruction
> #endif
>
> /***************************************************************************************************/
>
> Then reason why it will crash when get SWI instruction is maybe
> This page is clear to aged by kernel,
> But this MMU fault happpened in kernel,
> So the kernel do_page_fault function will not clear this page to young,
> So that will crash .
>
> It should poll this page to make it present or the fault should be handled by fixup section ,
> Anyway, this place should not crash by kernel .
>
> The kernel version I used is 3.4.0
> I have add the kernel log and the call stack recovered by trace32 tools
> Pls have a look at it .
>
>
> Thanks .
>
>
>
> Sony Mobile Communications
> Tel: +86 10 5966 9819
> Phone: 18610323092
> Address: No.16 Guangshun South Street, Chaoyang, Beijing, P.R.C.
>
> sonymobile.com
>
>
>

--
Thanks,
//richard

2013-05-29 07:37:55

by Wang, Yalin

[permalink] [raw]

Subject: RE: A bug about system call on ARM

Hi Richard,

Thanks for your reply ,
I will make a check for this .

-----Original Message-----
From: richard -rw- weinberger [mailto:[email protected]]
Sent: Wednesday, May 29, 2013 3:35 PM
To: Wang, Yalin
Cc: [email protected]; [email protected]
Subject: Re: A bug about system call on ARM

Hi!

On Wed, May 29, 2013 at 8:52 AM, Wang, Yalin <[email protected]> wrote:
> Hi all,
>
> I am a new comer to this mailing list , I am happy to join this
> community .
>
> I have a bug reported from our android phones which is caused by the system call .
> It seems like kernel bugs from my view .

Is this a unmodified Linux kernel from kernel.org? In other works, no (half broken) board support package from your hardware vendor?
Did you try a more recent kernel? (At least 3.4.47).
Maybe your problem is already known and fixed...

> Crash in file arch\arm\kernel\ entry-common.S
>
> /***************************************************************/
>
> ENTRY(vector_swi)
> sub sp, sp, #S_FRAME_SIZE
> stmia sp, {r0 - r12} @ Calling r0 - r12
> ARM( add r8, sp, #S_PC )
> ARM( stmdb r8, {sp, lr}^ ) @ Calling sp, lr
> THUMB( mov r8, sp )
> THUMB( store_user_sp_lr r8, r10, S_SP ) @ calling sp, lr
> mrs r8, spsr @ called from non-FIQ mode, so ok.
> str lr, [sp, #S_PC] @ Save calling PC
> str r8, [sp, #S_PSR] @ Save CPSR
> str r0, [sp, #S_OLD_R0] @ Save OLD_R0
> zero_fp
>
> /*
> * Get the system call number.
> */
>
> #if defined(CONFIG_OABI_COMPAT)
>
> /*
> * If we have CONFIG_OABI_COMPAT then we need to look at the swi
> * value to determine if it is an EABI or an old ABI call.
> */
> #ifdef CONFIG_ARM_THUMB
> tst r8, #PSR_T_BIT
> movne r10, #0 @ no thumb OABI emulation
> ldreq r10, [lr, #-4] @ get SWI instruction // crash at this instruction, when get SWI instruction
> #else
> ldr r10, [lr, #-4] @ get SWI instruction
> A710( and ip, r10, #0x0f000000 @ check for SWI )
> A710( teq ip, #0x0f000000 )
> A710( bne .Larm710bug )
> #endif
> #ifdef CONFIG_CPU_ENDIAN_BE8
> rev r10, r10 @ little endian instruction
> #endif
>
> /*********************************************************************
> ******************************/
>
> Then reason why it will crash when get SWI instruction is maybe This
> page is clear to aged by kernel, But this MMU fault happpened in
> kernel, So the kernel do_page_fault function will not clear this page
> to young, So that will crash .
>
> It should poll this page to make it present or the fault should be
> handled by fixup section , Anyway, this place should not crash by kernel .
>
> The kernel version I used is 3.4.0
> I have add the kernel log and the call stack recovered by trace32
> tools Pls have a look at it .
>
>
> Thanks .
>
>
>
> Sony Mobile Communications
> Tel: +86 10 5966 9819
> Phone: 18610323092
> Address: No.16 Guangshun South Street, Chaoyang, Beijing, P.R.C.
>
> sonymobile.com
>
>
>

--
Thanks,
//richard

2013-05-29 08:24:29

by Wang, Yalin

[permalink] [raw]

Subject: RE: A bug about system call on ARM

Hi

I have download the latest linux kernel code 3.9.4
And Compare with 3.4.0 kernel .

It seems there is no change for this part ,
So it will still happen .
Does anyone know who is responsible for arm arch part kernel code ?

Thanks

-----Original Message-----
From: Wang, Yalin
Sent: Wednesday, May 29, 2013 3:38 PM
To: 'richard -rw- weinberger'
Cc: [email protected]; [email protected]
Subject: RE: A bug about system call on ARM

Hi Richard,

Thanks for your reply ,
I will make a check for this .

-----Original Message-----
From: richard -rw- weinberger [mailto:[email protected]]
Sent: Wednesday, May 29, 2013 3:35 PM
To: Wang, Yalin
Cc: [email protected]; [email protected]
Subject: Re: A bug about system call on ARM

Hi!

On Wed, May 29, 2013 at 8:52 AM, Wang, Yalin <[email protected]> wrote:
> Hi all,
>
> I am a new comer to this mailing list , I am happy to join this
> community .
>
> I have a bug reported from our android phones which is caused by the system call .
> It seems like kernel bugs from my view .

Is this a unmodified Linux kernel from kernel.org? In other works, no (half broken) board support package from your hardware vendor?
Did you try a more recent kernel? (At least 3.4.47).
Maybe your problem is already known and fixed...

> Crash in file arch\arm\kernel\ entry-common.S
>
> /***************************************************************/
>
> ENTRY(vector_swi)
> sub sp, sp, #S_FRAME_SIZE
> stmia sp, {r0 - r12} @ Calling r0 - r12
> ARM( add r8, sp, #S_PC )
> ARM( stmdb r8, {sp, lr}^ ) @ Calling sp, lr
> THUMB( mov r8, sp )
> THUMB( store_user_sp_lr r8, r10, S_SP ) @ calling sp, lr
> mrs r8, spsr @ called from non-FIQ mode, so ok.
> str lr, [sp, #S_PC] @ Save calling PC
> str r8, [sp, #S_PSR] @ Save CPSR
> str r0, [sp, #S_OLD_R0] @ Save OLD_R0
> zero_fp
>
> /*
> * Get the system call number.
> */
>
> #if defined(CONFIG_OABI_COMPAT)
>
> /*
> * If we have CONFIG_OABI_COMPAT then we need to look at the swi
> * value to determine if it is an EABI or an old ABI call.
> */
> #ifdef CONFIG_ARM_THUMB
> tst r8, #PSR_T_BIT
> movne r10, #0 @ no thumb OABI emulation
> ldreq r10, [lr, #-4] @ get SWI instruction // crash at this instruction, when get SWI instruction
> #else
> ldr r10, [lr, #-4] @ get SWI instruction
> A710( and ip, r10, #0x0f000000 @ check for SWI )
> A710( teq ip, #0x0f000000 )
> A710( bne .Larm710bug )
> #endif
> #ifdef CONFIG_CPU_ENDIAN_BE8
> rev r10, r10 @ little endian instruction
> #endif
>
> /*********************************************************************
> ******************************/
>
> Then reason why it will crash when get SWI instruction is maybe This
> page is clear to aged by kernel, But this MMU fault happpened in
> kernel, So the kernel do_page_fault function will not clear this page
> to young, So that will crash .
>
> It should poll this page to make it present or the fault should be
> handled by fixup section , Anyway, this place should not crash by kernel .
>
> The kernel version I used is 3.4.0
> I have add the kernel log and the call stack recovered by trace32
> tools Pls have a look at it .
>
>
> Thanks .
>
>
>
> Sony Mobile Communications
> Tel: +86 10 5966 9819
> Phone: 18610323092
> Address: No.16 Guangshun South Street, Chaoyang, Beijing, P.R.C.
>
> sonymobile.com
>
>
>

--
Thanks,
//richard

2013-05-29 08:46:49

by Richard Weinberger

[permalink] [raw]

Subject: Re: A bug about system call on ARM

On Wed, May 29, 2013 at 10:24 AM, Wang, Yalin <[email protected]> wrote:
> Hi
>
> I have download the latest linux kernel code 3.9.4
> And Compare with 3.4.0 kernel .
>
> It seems there is no change for this part ,
> So it will still happen .
> Does anyone know who is responsible for arm arch part kernel code ?

See MAINTAINERS file.
CC'ing [email protected]

>
> Thanks
>
>
> -----Original Message-----
> From: Wang, Yalin
> Sent: Wednesday, May 29, 2013 3:38 PM
> To: 'richard -rw- weinberger'
> Cc: [email protected]; [email protected]
> Subject: RE: A bug about system call on ARM
>
> Hi Richard,
>
> Thanks for your reply ,
> I will make a check for this .
>
>
> -----Original Message-----
> From: richard -rw- weinberger [mailto:[email protected]]
> Sent: Wednesday, May 29, 2013 3:35 PM
> To: Wang, Yalin
> Cc: [email protected]; [email protected]
> Subject: Re: A bug about system call on ARM
>
> Hi!
>
> On Wed, May 29, 2013 at 8:52 AM, Wang, Yalin <[email protected]> wrote:
>> Hi all,
>>
>> I am a new comer to this mailing list , I am happy to join this
>> community .
>>
>> I have a bug reported from our android phones which is caused by the system call .
>> It seems like kernel bugs from my view .
>
> Is this a unmodified Linux kernel from kernel.org? In other works, no (half broken) board support package from your hardware vendor?
> Did you try a more recent kernel? (At least 3.4.47).
> Maybe your problem is already known and fixed...
>
>> Crash in file arch\arm\kernel\ entry-common.S
>>
>> /***************************************************************/
>>
>> ENTRY(vector_swi)
>> sub sp, sp, #S_FRAME_SIZE
>> stmia sp, {r0 - r12} @ Calling r0 - r12
>> ARM( add r8, sp, #S_PC )
>> ARM( stmdb r8, {sp, lr}^ ) @ Calling sp, lr
>> THUMB( mov r8, sp )
>> THUMB( store_user_sp_lr r8, r10, S_SP ) @ calling sp, lr
>> mrs r8, spsr @ called from non-FIQ mode, so ok.
>> str lr, [sp, #S_PC] @ Save calling PC
>> str r8, [sp, #S_PSR] @ Save CPSR
>> str r0, [sp, #S_OLD_R0] @ Save OLD_R0
>> zero_fp
>>
>> /*
>> * Get the system call number.
>> */
>>
>> #if defined(CONFIG_OABI_COMPAT)
>>
>> /*
>> * If we have CONFIG_OABI_COMPAT then we need to look at the swi
>> * value to determine if it is an EABI or an old ABI call.
>> */
>> #ifdef CONFIG_ARM_THUMB
>> tst r8, #PSR_T_BIT
>> movne r10, #0 @ no thumb OABI emulation
>> ldreq r10, [lr, #-4] @ get SWI instruction // crash at this instruction, when get SWI instruction
>> #else
>> ldr r10, [lr, #-4] @ get SWI instruction
>> A710( and ip, r10, #0x0f000000 @ check for SWI )
>> A710( teq ip, #0x0f000000 )
>> A710( bne .Larm710bug )
>> #endif
>> #ifdef CONFIG_CPU_ENDIAN_BE8
>> rev r10, r10 @ little endian instruction
>> #endif
>>
>> /*********************************************************************
>> ******************************/
>>
>> Then reason why it will crash when get SWI instruction is maybe This
>> page is clear to aged by kernel, But this MMU fault happpened in
>> kernel, So the kernel do_page_fault function will not clear this page
>> to young, So that will crash .
>>
>> It should poll this page to make it present or the fault should be
>> handled by fixup section , Anyway, this place should not crash by kernel .
>>
>> The kernel version I used is 3.4.0
>> I have add the kernel log and the call stack recovered by trace32
>> tools Pls have a look at it .
>>
>>
>> Thanks .
>>
>>
>>
>> Sony Mobile Communications
>> Tel: +86 10 5966 9819
>> Phone: 18610323092
>> Address: No.16 Guangshun South Street, Chaoyang, Beijing, P.R.C.
>>
>> sonymobile.com
>>
>>
>>
>
>
>
> --
> Thanks,
> //richard
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/

--
Thanks,
//richard

2013-05-29 08:57:17

by Mikael Pettersson

[permalink] [raw]

Subject: RE: A bug about system call on ARM

Wang, Yalin writes:
> Hi
>
> I have download the latest linux kernel code 3.9.4
> And Compare with 3.4.0 kernel .
>
> It seems there is no change for this part ,
> So it will still happen .
> Does anyone know who is responsible for arm arch part kernel code ?

A quick look in MAINTAINNERS would have told you:

ARM PORT
...
L: [email protected] (moderated for non-subscribers)

Please direct ARM-specific questions to that list.

2013-05-29 09:49:03

by Will Deacon

[permalink] [raw]

Subject: Re: A bug about system call on ARM

Hello,

On Wed, May 29, 2013 at 09:46:42AM +0100, richard -rw- weinberger wrote:
> On Wed, May 29, 2013 at 10:24 AM, Wang, Yalin <[email protected]> wrote:
> > I have download the latest linux kernel code 3.9.4
> > And Compare with 3.4.0 kernel .
> >
> > It seems there is no change for this part ,
> > So it will still happen .
> > Does anyone know who is responsible for arm arch part kernel code ?
>
> See MAINTAINERS file.
> CC'ing [email protected]

Cheers for adding us to CC.

> >> #ifdef CONFIG_ARM_THUMB
> >> tst r8, #PSR_T_BIT
> >> movne r10, #0 @ no thumb OABI emulation
> >> ldreq r10, [lr, #-4] @ get SWI instruction // crash at this instruction, when get SWI instruction

Do you have the panic log please? Also, which SoC are you using and how are
you reproducing this?

> >> ldr r10, [lr, #-4] @ get SWI instruction
> >> A710( and ip, r10, #0x0f000000 @ check for SWI )
> >> A710( teq ip, #0x0f000000 )
> >> A710( bne .Larm710bug )
> >> #endif
> >> #ifdef CONFIG_CPU_ENDIAN_BE8
> >> rev r10, r10 @ little endian instruction
> >> #endif
> >>
> >> /*********************************************************************
> >> ******************************/
> >>
> >> Then reason why it will crash when get SWI instruction is maybe This
> >> page is clear to aged by kernel, But this MMU fault happpened in
> >> kernel, So the kernel do_page_fault function will not clear this page
> >> to young, So that will crash .

Sounds like we might need some USER annotations around the instruction
loads, but we should also rework the code so that we re-enable interrupts
first.

Will

2013-05-29 09:51:05

by Wang, Yalin

[permalink] [raw]

Subject: RE: A bug about system call on ARM

Hi

This is kernel.log and the stack which is recovered by Trace32 tools.
Please have a look at it .

Thanks

-----Original Message-----
From: Will Deacon [mailto:[email protected]]
Sent: Wednesday, May 29, 2013 5:48 PM
To: richard -rw- weinberger
Cc: Wang, Yalin; [email protected]; [email protected]; [email protected]
Subject: Re: A bug about system call on ARM

Hello,

On Wed, May 29, 2013 at 09:46:42AM +0100, richard -rw- weinberger wrote:
> On Wed, May 29, 2013 at 10:24 AM, Wang, Yalin <[email protected]> wrote:
> > I have download the latest linux kernel code 3.9.4 And Compare with
> > 3.4.0 kernel .
> >
> > It seems there is no change for this part , So it will still happen
> > .
> > Does anyone know who is responsible for arm arch part kernel code ?
>
> See MAINTAINERS file.
> CC'ing [email protected]

Cheers for adding us to CC.

> >> #ifdef CONFIG_ARM_THUMB
> >> tst r8, #PSR_T_BIT
> >> movne r10, #0 @ no thumb OABI emulation
> >> ldreq r10, [lr, #-4] @ get SWI instruction // crash at this instruction, when get SWI instruction

Do you have the panic log please? Also, which SoC are you using and how are you reproducing this?

> >> ldr r10, [lr, #-4] @ get SWI instruction
> >> A710( and ip, r10, #0x0f000000 @ check for SWI )
> >> A710( teq ip, #0x0f000000 )
> >> A710( bne .Larm710bug )
> >> #endif
> >> #ifdef CONFIG_CPU_ENDIAN_BE8
> >> rev r10, r10 @ little endian instruction
> >> #endif
> >>
> >> /******************************************************************
> >> ***
> >> ******************************/
> >>
> >> Then reason why it will crash when get SWI instruction is maybe
> >> This page is clear to aged by kernel, But this MMU fault happpened
> >> in kernel, So the kernel do_page_fault function will not clear this
> >> page to young, So that will crash .

Sounds like we might need some USER annotations around the instruction loads, but we should also rework the code so that we re-enable interrupts first.

Will

Attachments:

kernel.log (128.00 kB)
kernel.log stack.txt (747.00 B)
stack.txt Download all attachments

2013-05-30 01:41:56

by Wang, Yalin

[permalink] [raw]

Subject: RE: A bug about system call on ARM

Hi Will,

Have you received the log files?

And is there someone looking at this issue now ?
This issue happened on Qcom Scorpoin CPUs,
And it just happened in our stability test occasionally .

If you have some patch for this issue,
I can do the test for it .

Thanks for your help very much !

-----Original Message-----
From: Wang, Yalin
Sent: Wednesday, May 29, 2013 5:51 PM
To: 'Will Deacon'; richard -rw- weinberger
Cc: [email protected]; [email protected]; [email protected]
Subject: RE: A bug about system call on ARM

Hi

This is kernel.log and the stack which is recovered by Trace32 tools.
Please have a look at it .

Thanks

-----Original Message-----
From: Will Deacon [mailto:[email protected]]
Sent: Wednesday, May 29, 2013 5:48 PM
To: richard -rw- weinberger
Cc: Wang, Yalin; [email protected]; [email protected]; [email protected]
Subject: Re: A bug about system call on ARM

Hello,

On Wed, May 29, 2013 at 09:46:42AM +0100, richard -rw- weinberger wrote:
> On Wed, May 29, 2013 at 10:24 AM, Wang, Yalin <[email protected]> wrote:
> > I have download the latest linux kernel code 3.9.4 And Compare with
> > 3.4.0 kernel .
> >
> > It seems there is no change for this part , So it will still happen
> > .
> > Does anyone know who is responsible for arm arch part kernel code ?
>
> See MAINTAINERS file.
> CC'ing [email protected]

Cheers for adding us to CC.

> >> #ifdef CONFIG_ARM_THUMB
> >> tst r8, #PSR_T_BIT
> >> movne r10, #0 @ no thumb OABI emulation
> >> ldreq r10, [lr, #-4] @ get SWI instruction // crash at this instruction, when get SWI instruction

Do you have the panic log please? Also, which SoC are you using and how are you reproducing this?

> >> ldr r10, [lr, #-4] @ get SWI instruction
> >> A710( and ip, r10, #0x0f000000 @ check for SWI )
> >> A710( teq ip, #0x0f000000 )
> >> A710( bne .Larm710bug )
> >> #endif
> >> #ifdef CONFIG_CPU_ENDIAN_BE8
> >> rev r10, r10 @ little endian instruction
> >> #endif
> >>
> >> /******************************************************************
> >> ***
> >> ******************************/
> >>
> >> Then reason why it will crash when get SWI instruction is maybe
> >> This page is clear to aged by kernel, But this MMU fault happpened
> >> in kernel, So the kernel do_page_fault function will not clear this
> >> page to young, So that will crash .

Sounds like we might need some USER annotations around the instruction loads, but we should also rework the code so that we re-enable interrupts first.

Will

2013-05-30 09:11:15

by Will Deacon

[permalink] [raw]

Subject: Re: A bug about system call on ARM

On Thu, May 30, 2013 at 02:41:42AM +0100, Wang, Yalin wrote:
> Hi Will,

Hello,

> Have you received the log files?

Yep, and you seem to be completely correct: CPU0 ages the page from which
CPU1 just executed a system call, so we explode trying to load the swi
instruction in order to retrieve the immediate.

> And is there someone looking at this issue now ?

It's on my list, but I'm pretty busy right now and OABI-compat isn't high
priority. Are you actually running OABI binaries? If not, you can simply
turn that option off (in fact, a quick fix to this issue is to make that
depend on !SMP).

> This issue happened on Qcom Scorpoin CPUs,
> And it just happened in our stability test occasionally .
>
> If you have some patch for this issue,
> I can do the test for it .

I'll have a look at cooking something which uses an exception table entry
to rewind the PC and retry the system call. That's simpler than directly
injecting a user page fault from the system call path.

Will

2013-05-30 11:42:15

by Will Deacon

[permalink] [raw]

Subject: Re: A bug about system call on ARM

On Thu, May 30, 2013 at 10:09:49AM +0100, Will Deacon wrote:
> On Thu, May 30, 2013 at 02:41:42AM +0100, Wang, Yalin wrote:
> > If you have some patch for this issue,
> > I can do the test for it .
>
> I'll have a look at cooking something which uses an exception table entry
> to rewind the PC and retry the system call. That's simpler than directly
> injecting a user page fault from the system call path.

Ok, please can you try the following?

Will

--->8

diff --git a/arch/arm/kernel/entry-common.S b/arch/arm/kernel/entry-common.S
index bc5bc0a..855926e 100644
--- a/arch/arm/kernel/entry-common.S
+++ b/arch/arm/kernel/entry-common.S
@@ -361,6 +361,15 @@ ENTRY(vector_swi)
str r8, [sp, #S_PSR] @ Save CPSR
str r0, [sp, #S_OLD_R0] @ Save OLD_R0
zero_fp
+ enable_irq
+ ct_user_exit
+
+#ifdef CONFIG_ALIGNMENT_TRAP
+ ldr ip, __cr_alignment
+ ldr ip, [ip]
+ mcr p15, 0, ip, c1, c0 @ update control register
+#endif
+ get_thread_info tsk

/*
* Get the system call number.
@@ -375,9 +384,9 @@ ENTRY(vector_swi)
#ifdef CONFIG_ARM_THUMB
tst r8, #PSR_T_BIT
movne r10, #0 @ no thumb OABI emulation
- ldreq r10, [lr, #-4] @ get SWI instruction
+ USER( ldreq r10, [lr, #-4] ) @ get SWI instruction
#else
- ldr r10, [lr, #-4] @ get SWI instruction
+ USER( ldr r10, [lr, #-4] ) @ get SWI instruction
#endif
#ifdef CONFIG_CPU_ENDIAN_BE8
rev r10, r10 @ little endian instruction
@@ -392,22 +401,13 @@ ENTRY(vector_swi)
/* Legacy ABI only, possibly thumb mode. */
tst r8, #PSR_T_BIT @ this is SPSR from save_user_regs
addne scno, r7, #__NR_SYSCALL_BASE @ put OS number in
- ldreq scno, [lr, #-4]
+ USER( ldreq scno, [lr, #-4] )

#else
/* Legacy ABI only. */
- ldr scno, [lr, #-4] @ get SWI instruction
-#endif
-
-#ifdef CONFIG_ALIGNMENT_TRAP
- ldr ip, __cr_alignment
- ldr ip, [ip]
- mcr p15, 0, ip, c1, c0 @ update control register
+ USER( ldr scno, [lr, #-4] ) @ get SWI instruction
#endif
- enable_irq
- ct_user_exit

- get_thread_info tsk
adr tbl, sys_call_table @ load syscall table pointer

#if defined(CONFIG_OABI_COMPAT)
@@ -442,6 +442,18 @@ local_restart:
eor r0, scno, #__NR_SYSCALL_BASE @ put OS number back
bcs arm_syscall
b sys_ni_syscall @ not private func
+
+#if defined(CONFIG_OABI_COMPAT) || !defined(CONFIG_AEABI)
+ /*
+ * We may have faulted trying to load the SWI instruction due to
+ * concurrent page aging on another CPU. In this case, return
+ * back to the swi instruction and fault the page back.
+ */
+9001:
+ sub lr, lr, #4
+ str lr, [sp, #S_PC]
+ b ret_fast_syscall
+#endif
ENDPROC(vector_swi)

/*

2013-05-31 02:56:47

by Wang, Yalin

[permalink] [raw]

Subject: RE: A bug about system call on ARM

Hi Will,

Thanks for your patch ,

But I found I don't have ct_user_exit macro
In my arch/arm/kernel/entry-common.S

My kernel version is 3.4.0

I have add the file as attachment,

Could you make a patch for this file ?

Thank you !

-----Original Message-----
From: Will Deacon [mailto:[email protected]]
Sent: Thursday, May 30, 2013 7:41 PM
To: Wang, Yalin
Cc: 'richard -rw- weinberger'; '[email protected]'; '[email protected]'; '[email protected]'
Subject: Re: A bug about system call on ARM

On Thu, May 30, 2013 at 10:09:49AM +0100, Will Deacon wrote:
> On Thu, May 30, 2013 at 02:41:42AM +0100, Wang, Yalin wrote:
> > If you have some patch for this issue, I can do the test for it .
>
> I'll have a look at cooking something which uses an exception table
> entry to rewind the PC and retry the system call. That's simpler than
> directly injecting a user page fault from the system call path.

Ok, please can you try the following?

Will

--->8

diff --git a/arch/arm/kernel/entry-common.S b/arch/arm/kernel/entry-common.S index bc5bc0a..855926e 100644
--- a/arch/arm/kernel/entry-common.S
+++ b/arch/arm/kernel/entry-common.S
@@ -361,6 +361,15 @@ ENTRY(vector_swi)
str r8, [sp, #S_PSR] @ Save CPSR
str r0, [sp, #S_OLD_R0] @ Save OLD_R0
zero_fp
+ enable_irq
+ ct_user_exit
+
+#ifdef CONFIG_ALIGNMENT_TRAP
+ ldr ip, __cr_alignment
+ ldr ip, [ip]
+ mcr p15, 0, ip, c1, c0 @ update control register
+#endif
+ get_thread_info tsk

/*
* Get the system call number.
@@ -375,9 +384,9 @@ ENTRY(vector_swi)
#ifdef CONFIG_ARM_THUMB
tst r8, #PSR_T_BIT
movne r10, #0 @ no thumb OABI emulation
- ldreq r10, [lr, #-4] @ get SWI instruction
+ USER( ldreq r10, [lr, #-4] ) @ get SWI instruction
#else
- ldr r10, [lr, #-4] @ get SWI instruction
+ USER( ldr r10, [lr, #-4] ) @ get SWI instruction
#endif
#ifdef CONFIG_CPU_ENDIAN_BE8
rev r10, r10 @ little endian instruction
@@ -392,22 +401,13 @@ ENTRY(vector_swi)
/* Legacy ABI only, possibly thumb mode. */
tst r8, #PSR_T_BIT @ this is SPSR from save_user_regs
addne scno, r7, #__NR_SYSCALL_BASE @ put OS number in
- ldreq scno, [lr, #-4]
+ USER( ldreq scno, [lr, #-4] )

#else
/* Legacy ABI only. */
- ldr scno, [lr, #-4] @ get SWI instruction
-#endif
-
-#ifdef CONFIG_ALIGNMENT_TRAP
- ldr ip, __cr_alignment
- ldr ip, [ip]
- mcr p15, 0, ip, c1, c0 @ update control register
+ USER( ldr scno, [lr, #-4] ) @ get SWI instruction
#endif
- enable_irq
- ct_user_exit

- get_thread_info tsk
adr tbl, sys_call_table @ load syscall table pointer

#if defined(CONFIG_OABI_COMPAT)
@@ -442,6 +442,18 @@ local_restart:
eor r0, scno, #__NR_SYSCALL_BASE @ put OS number back
bcs arm_syscall
b sys_ni_syscall @ not private func
+
+#if defined(CONFIG_OABI_COMPAT) || !defined(CONFIG_AEABI)
+ /*
+ * We may have faulted trying to load the SWI instruction due to
+ * concurrent page aging on another CPU. In this case, return
+ * back to the swi instruction and fault the page back.
+ */
+9001:
+ sub lr, lr, #4
+ str lr, [sp, #S_PC]
+ b ret_fast_syscall
+#endif
ENDPROC(vector_swi)

/*

Attachments:

entry-common.S (14.72 kB)
entry-common.S

2013-05-31 03:55:03

by Nicolas Pitre

[permalink] [raw]

Subject: Re: A bug about system call on ARM

On Thu, 30 May 2013, Will Deacon wrote:

> On Thu, May 30, 2013 at 10:09:49AM +0100, Will Deacon wrote:
> > On Thu, May 30, 2013 at 02:41:42AM +0100, Wang, Yalin wrote:
> > > If you have some patch for this issue,
> > > I can do the test for it .
> >
> > I'll have a look at cooking something which uses an exception table entry
> > to rewind the PC and retry the system call. That's simpler than directly
> > injecting a user page fault from the system call path.
>
> Ok, please can you try the following?
>
> Will
>
> --->8
>
> diff --git a/arch/arm/kernel/entry-common.S b/arch/arm/kernel/entry-common.S
> index bc5bc0a..855926e 100644
> --- a/arch/arm/kernel/entry-common.S
> +++ b/arch/arm/kernel/entry-common.S
> @@ -361,6 +361,15 @@ ENTRY(vector_swi)
> str r8, [sp, #S_PSR] @ Save CPSR
> str r0, [sp, #S_OLD_R0] @ Save OLD_R0
> zero_fp
> + enable_irq
> + ct_user_exit
> +
> +#ifdef CONFIG_ALIGNMENT_TRAP
> + ldr ip, __cr_alignment
> + ldr ip, [ip]
> + mcr p15, 0, ip, c1, c0 @ update control register
> +#endif

This is wrong. you must set up the align bit in the control register
_before_ enabling IRQs or an IRQ handler might run without alignment
fixup.

Otherwise the patch looks good to me.

Nicolas

2013-05-31 08:45:47

by Will Deacon

[permalink] [raw]

Subject: Re: A bug about system call on ARM

On Fri, May 31, 2013 at 04:54:56AM +0100, Nicolas Pitre wrote:
> On Thu, 30 May 2013, Will Deacon wrote:
>
> > On Thu, May 30, 2013 at 10:09:49AM +0100, Will Deacon wrote:
> > > On Thu, May 30, 2013 at 02:41:42AM +0100, Wang, Yalin wrote:
> > > > If you have some patch for this issue,
> > > > I can do the test for it .
> > >
> > > I'll have a look at cooking something which uses an exception table entry
> > > to rewind the PC and retry the system call. That's simpler than directly
> > > injecting a user page fault from the system call path.
> >
> > Ok, please can you try the following?
> >
> > Will
> >
> > --->8
> >
> > diff --git a/arch/arm/kernel/entry-common.S b/arch/arm/kernel/entry-common.S
> > index bc5bc0a..855926e 100644
> > --- a/arch/arm/kernel/entry-common.S
> > +++ b/arch/arm/kernel/entry-common.S
> > @@ -361,6 +361,15 @@ ENTRY(vector_swi)
> > str r8, [sp, #S_PSR] @ Save CPSR
> > str r0, [sp, #S_OLD_R0] @ Save OLD_R0
> > zero_fp
> > + enable_irq
> > + ct_user_exit
> > +
> > +#ifdef CONFIG_ALIGNMENT_TRAP
> > + ldr ip, __cr_alignment
> > + ldr ip, [ip]
> > + mcr p15, 0, ip, c1, c0 @ update control register
> > +#endif
>
> This is wrong. you must set up the align bit in the control register
> _before_ enabling IRQs or an IRQ handler might run without alignment
> fixup.

Okey doke, I can fix that up. I thought it was only needed for the network
layer, but I suppose they have interrupts over there too :)

> Otherwise the patch looks good to me.

Thanks Nicolas.

Will

2013-05-31 08:47:28

by Will Deacon

[permalink] [raw]

Subject: Re: A bug about system call on ARM

On Fri, May 31, 2013 at 03:56:31AM +0100, Wang, Yalin wrote:
> Hi Will,
>
> Thanks for your patch ,
>
> But I found I don't have ct_user_exit macro
> In my arch/arm/kernel/entry-common.S
>
> My kernel version is 3.4.0

Well things have moved on this since then (we're approaching 3.10, so you
might consider an upgrade!).

For the purposes of this patch, you can just delete the ct_user_exit line.

Will

2013-05-31 11:02:59

by Wang, Yalin

[permalink] [raw]

Subject: RE: A bug about system call on ARM

Hi Will,

I have merge your code ,
But there is a different ,

+
+ ct_user_exit
+
+#ifdef CONFIG_ALIGNMENT_TRAP
+ ldr ip, __cr_alignment
+ ldr ip, [ip]
+ mcr p15, 0, ip, c1, c0 @ update control register
+#endif

+ enable_irq
+ get_thread_info tsk

Is this change ok ?

Thanks

-----Original Message-----
From: [email protected] [mailto:[email protected]] On Behalf Of Will Deacon
Sent: Friday, May 31, 2013 4:47 PM
To: Wang, Yalin
Cc: 'richard -rw- weinberger'; '[email protected]'; '[email protected]'; '[email protected]'
Subject: Re: A bug about system call on ARM

On Fri, May 31, 2013 at 03:56:31AM +0100, Wang, Yalin wrote:
> Hi Will,
>
> Thanks for your patch ,
>
> But I found I don't have ct_user_exit macro In my
> arch/arm/kernel/entry-common.S
>
> My kernel version is 3.4.0

Well things have moved on this since then (we're approaching 3.10, so you might consider an upgrade!).

For the purposes of this patch, you can just delete the ct_user_exit line.

Will

2013-05-31 11:13:46

by Will Deacon

[permalink] [raw]

Subject: Re: A bug about system call on ARM

On Fri, May 31, 2013 at 12:02:49PM +0100, Wang, Yalin wrote:
> Hi Will,
>
> I have merge your code ,
> But there is a different ,
>
> +
> + ct_user_exit

I thought you didn't have ct_user_exit? In which case, just delete this
line.

> +#ifdef CONFIG_ALIGNMENT_TRAP
> + ldr ip, __cr_alignment
> + ldr ip, [ip]
> + mcr p15, 0, ip, c1, c0 @ update control register
> +#endif
>
> + enable_irq
> + get_thread_info tsk

Hard to tell without context. You can take a look at my git tree if you
like (I fixed it up based on Nico's comment):

https://git.kernel.org/cgit/linux/kernel/git/will/linux.git/commit/?h=misc-patches

Will

2013-05-31 11:30:45

by Wang, Yalin

[permalink] [raw]

Subject: RE: A bug about system call on ARM

Hi Will,

I see,
I will make one more test .

Thanks for your clarification .

-----Original Message-----
From: Will Deacon [mailto:[email protected]]
Sent: Friday, May 31, 2013 7:13 PM
To: Wang, Yalin
Cc: 'richard -rw- weinberger'; '[email protected]'; '[email protected]'; '[email protected]'
Subject: Re: A bug about system call on ARM

On Fri, May 31, 2013 at 12:02:49PM +0100, Wang, Yalin wrote:
> Hi Will,
>
> I have merge your code ,
> But there is a different ,
>
> +
> + ct_user_exit

I thought you didn't have ct_user_exit? In which case, just delete this line.

> +#ifdef CONFIG_ALIGNMENT_TRAP
> + ldr ip, __cr_alignment
> + ldr ip, [ip]
> + mcr p15, 0, ip, c1, c0 @ update control register
> +#endif
>
> + enable_irq
> + get_thread_info tsk

Hard to tell without context. You can take a look at my git tree if you like (I fixed it up based on Nico's comment):

https://git.kernel.org/cgit/linux/kernel/git/will/linux.git/commit/?h=misc-patches

Will

2013-05-31 16:49:00

by Nicolas Pitre

[permalink] [raw]

Subject: Re: A bug about system call on ARM

On Fri, 31 May 2013, Will Deacon wrote:

> Hard to tell without context. You can take a look at my git tree if you
> like (I fixed it up based on Nico's comment):
>
> https://git.kernel.org/cgit/linux/kernel/git/will/linux.git/commit/?h=misc-patches

If you like, you may add "Reviewed-by: Nicolas Pitre <[email protected]>"
to it.

Nicolas

2013-05-31 16:53:40

by Will Deacon

[permalink] [raw]

Subject: Re: A bug about system call on ARM

On Fri, May 31, 2013 at 05:48:56PM +0100, Nicolas Pitre wrote:
> On Fri, 31 May 2013, Will Deacon wrote:
>
> > Hard to tell without context. You can take a look at my git tree if you
> > like (I fixed it up based on Nico's comment):
> >
> > https://git.kernel.org/cgit/linux/kernel/git/will/linux.git/commit/?h=misc-patches
>
> If you like, you may add "Reviewed-by: Nicolas Pitre <[email protected]>"
> to it.

Cheers Nicolas, will do.

Will

2013-06-03 05:25:36

by Wang, Yalin

[permalink] [raw]

Subject: RE: A bug about system call on ARM

Hi Will,

I have a question about this patch .

If the user space is thumb mode,
The PC should be rewind by 2 bytes,
So the fix_up code should be

Sub lr, lr, #2 .

Am I right ?

Thanks for your help .

-----Original Message-----
From: Wang, Yalin
Sent: Friday, May 31, 2013 7:31 PM
To: 'Will Deacon'
Cc: 'richard -rw- weinberger'; '[email protected]'; '[email protected]'; '[email protected]'
Subject: RE: A bug about system call on ARM

Hi Will,

I see,
I will make one more test .

Thanks for your clarification .

-----Original Message-----
From: Will Deacon [mailto:[email protected]]
Sent: Friday, May 31, 2013 7:13 PM
To: Wang, Yalin
Cc: 'richard -rw- weinberger'; '[email protected]'; '[email protected]'; '[email protected]'
Subject: Re: A bug about system call on ARM

On Fri, May 31, 2013 at 12:02:49PM +0100, Wang, Yalin wrote:
> Hi Will,
>
> I have merge your code ,
> But there is a different ,
>
> +
> + ct_user_exit

I thought you didn't have ct_user_exit? In which case, just delete this line.

> +#ifdef CONFIG_ALIGNMENT_TRAP
> + ldr ip, __cr_alignment
> + ldr ip, [ip]
> + mcr p15, 0, ip, c1, c0 @ update control register
> +#endif
>
> + enable_irq
> + get_thread_info tsk

Hard to tell without context. You can take a look at my git tree if you like (I fixed it up based on Nico's comment):

https://git.kernel.org/cgit/linux/kernel/git/will/linux.git/commit/?h=misc-patches

Will

2013-06-03 09:55:42

by Will Deacon

[permalink] [raw]

Subject: Re: A bug about system call on ARM

On Mon, Jun 03, 2013 at 06:25:26AM +0100, Wang, Yalin wrote:
> Hi Will,
>
> I have a question about this patch .
>
> If the user space is thumb mode,
> The PC should be rewind by 2 bytes,
> So the fix_up code should be
>
> Sub lr, lr, #2 .
>
>
> Am I right ?

No, because we don't have OABI-compat support for Thumb applications and
force everything down the EABI path instead.

Did you manage to test the patch?

Will

2013-06-03 09:58:26

by Wang, Yalin

[permalink] [raw]

Subject: RE: A bug about system call on ARM

Hi Will

Oh I see,
Thanks for your reply

Yes , we are testing for it ,
But need some time to wait for the result ,
Because The stability test need some time to reproduce
this issue , And this issue doesn't reproduce 100% .

-----Original Message-----
From: Will Deacon [mailto:[email protected]]
Sent: Monday, June 03, 2013 5:54 PM
To: Wang, Yalin
Cc: 'richard -rw- weinberger'; '[email protected]'; '[email protected]'; '[email protected]'
Subject: Re: A bug about system call on ARM

On Mon, Jun 03, 2013 at 06:25:26AM +0100, Wang, Yalin wrote:
> Hi Will,
>
> I have a question about this patch .
>
> If the user space is thumb mode,
> The PC should be rewind by 2 bytes,
> So the fix_up code should be
>
> Sub lr, lr, #2 .
>
>
> Am I right ?

No, because we don't have OABI-compat support for Thumb applications and force everything down the EABI path instead.

Did you manage to test the patch?

Will

2013-06-03 10:18:26

by Russell King - ARM Linux

[permalink] [raw]

Subject: Re: A bug about system call on ARM

On Thu, May 30, 2013 at 12:41:12PM +0100, Will Deacon wrote:
> +#if defined(CONFIG_OABI_COMPAT) || !defined(CONFIG_AEABI)
> + /*
> + * We may have faulted trying to load the SWI instruction due to
> + * concurrent page aging on another CPU. In this case, return
> + * back to the swi instruction and fault the page back.
> + */
> +9001:
> + sub lr, lr, #4
> + str lr, [sp, #S_PC]
> + b ret_fast_syscall
> +#endif

The comment is wrong. If we get here, it means that the fault from
trying to loading the instruction can't be fixed up. Arguably, that
should result in a SIGSEGV being sent immediately, but we'll get to
that when we then try to re-load the instruction.

What it means is that the page we were trying to execute has been
unmapped beneath us.

BTW, I notice that the kernel oops was never posted to the list, so it's
impossible for other people following this thread to see what the real
problem is...

2013-06-03 10:28:04

by Will Deacon

[permalink] [raw]

Subject: Re: A bug about system call on ARM

Hi Russell,

On Mon, Jun 03, 2013 at 11:18:09AM +0100, Russell King - ARM Linux wrote:
> On Thu, May 30, 2013 at 12:41:12PM +0100, Will Deacon wrote:
> > +#if defined(CONFIG_OABI_COMPAT) || !defined(CONFIG_AEABI)
> > + /*
> > + * We may have faulted trying to load the SWI instruction due to
> > + * concurrent page aging on another CPU. In this case, return
> > + * back to the swi instruction and fault the page back.
> > + */
> > +9001:
> > + sub lr, lr, #4
> > + str lr, [sp, #S_PC]
> > + b ret_fast_syscall
> > +#endif
>
> The comment is wrong. If we get here, it means that the fault from
> trying to loading the instruction can't be fixed up. Arguably, that
> should result in a SIGSEGV being sent immediately, but we'll get to
> that when we then try to re-load the instruction.

Why would we kill the application in this case? The reported problem is
where one CPU ages the page containing the swi instruction (mkold =>
clears L_PTE_YOUNG => write 0 to the pte) in between the other CPU executing
the swi and the kernel trying to read the immediate. The VMA is fine.

> What it means is that the page we were trying to execute has been
> unmapped beneath us.

Yes, as a result of the kernel aging it.

> BTW, I notice that the kernel oops was never posted to the list, so it's
> impossible for other people following this thread to see what the real
> problem is...

It was sent as an attachment I think, so I've pasted the log below (you can
see CPU0 unmapping the page from which CPU1 is assumedly executing).

Will

--->8

<1>[44330.850628] Unable to handle kernel paging request at virtual address 4020841c
<1>[44330.850750] pgd = c490c000
<1>[44330.850841] [4020841c] *pgd=84451831, *pte=bf05859d, *ppte=00000000
<0>[44330.851055] Internal error: Oops: 17 [#1] PREEMPT SMP ARM
<4>[44330.851146] Modules linked in: hid_sony(O)
<4>[44330.851330] CPU: 1 Tainted: G W O (3.4.0-perf-gf496dca-01162-gcbcc62b #1)
<4>[44330.851421] PC is at vector_swi+0x28/0x88
<4>[44330.851482] LR is at 0x40208420
<4>[44330.851604] pc : [<c000dfe8>] lr : [<40208420>] psr: 60000093
<4>[44330.851604] sp : e0601fb0 ip : 40092f78 fp : befcd6fc
<4>[44330.851757] r10: 4005a040 r9 : 4012fa70 r8 : 60000010
<4>[44330.851818] r7 : 00000107 r6 : 00000003 r5 : 4005a030 r4 : 5746d378
<4>[44330.851909] r3 : 5883fbf8 r2 : 00000000 r1 : befcd6d0 r0 : 00000001
<4>[44330.851971] Flags: nZCv IRQs off FIQs on Mode SVC_32 ISA ARM Segment user
<4>[44330.852093] Control: 10c5787d Table: 84b0c06a DAC: 00000015
<4>[44330.852154]
<4>[44330.852154] PC: 0xc000df68:
<4>[44330.852306] df68 e31100ff 1afffff0 e59d1040 e5bde03c e16ff001 f57ff01f e95d7fff e1a00000
<4>[44330.852825] df88 e28dd00c e1b0f00e eb025aba e1a096ad e1a09689 e5991000 e3a08001 e3110c03
<4>[44330.853405] dfa8 0affffec e1a0100d e3a00001 eb000911 eaffffe8 e320f000 e24dd048 e88d1fff
<4>[44330.853954] dfc8 e28d803c e9486000 e14f8000 e58de03c e58d8040 e58d0044 e3180020 13a0a000
<4>[44330.854534] dfe8 051ea004 e59fc0ac e59cc000 ee01cf10 f1080080 e1a096ad e1a09689 e28f809c
<4>[44330.855053] e008 e3daa4ff 122a7609 159f808c e599a000 e92d0030 e31a0c03 1a000008 e3570f5f
<4>[44330.855602] e028 e24fee13 3798f107 e28d1008 e3a08000 e357080f e2270000 2a000fac ea022fd4
<4>[44330.856152] e048 e1a02007 e28d1008 e3a00000 eb0008e9 e28fe014 e1a07000 e28d1008 e3570f5f
<4>[44330.856732]
<4>[44330.856732] SP: 0xe0601f30:
<4>[44330.856823] 1f30 00000000 00000000 c0d44f98 00000001 e18fce48 00000002 e565fa80 00000000
<4>[44330.857373] 1f50 c000dfe8 60000093 ffffffff e0601f9c 60000010 c0731518 00000001 befcd6d0
<4>[44330.857891] 1f70 00000000 5883fbf8 5746d378 4005a030 00000003 00000107 60000010 4012fa70
<4>[44330.858471] 1f90 4005a040 befcd6fc 40092f78 e0601fb0 40208420 c000dfe8 60000093 ffffffff
<4>[44330.859021] 1fb0 00000001 befcd6d0 00000000 5883fbf8 5746d378 4005a030 00000003 00000107
<4>[44330.859570] 1fd0 befcd6e8 4012fa70 4005a040 befcd6fc 40092f78 befcd6c8 4008aef9 40208420
<4>[44330.860089] 1ff0 60000010 00000001 ffe6e6e6 ffe6e6e6 00000000 00000000 00000000 00000000
<4>[44330.860669] 2010 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
<0>[44330.861218] Process ndroid.settings (pid: 25518, stack limit = 0xe06002f0)
<0>[44330.861279] Stack: (0xe0601fb0 to 0xe0602000)
<0>[44330.861401] 1fa0: 00000001 befcd6d0 00000000 5883fbf8
<0>[44330.861462] 1fc0: 5746d378 4005a030 00000003 00000107 befcd6e8 4012fa70 4005a040 befcd6fc
<0>[44330.861584] 1fe0: 40092f78 befcd6c8 4008aef9 40208420 60000010 00000001 ffe6e6e6 ffe6e6e6
<0>[44330.861707] Code: e58d8040 e58d0044 e3180020 13a0a000 (051ea004)
<4>[44330.861890] ---[ end trace da227214a82491c0 ]---
<0>[44330.862012] Kernel panic - not syncing: Fatal exception
<2>[44330.862073] CPU0: stopping
<4>[44330.862195] [<c0014df8>] (unwind_backtrace+0x0/0x11c) from [<c001339c>] (handle_IPI+0x110/0x224)
<4>[44330.862256] [<c001339c>] (handle_IPI+0x110/0x224) from [<c000868c>] (gic_handle_irq+0x104/0x110)
<4>[44330.862378] [<c000868c>] (gic_handle_irq+0x104/0x110) from [<c0731580>] (__irq_svc+0x40/0x70)
<4>[44330.862470] Exception stack(0xc7439b10 to 0xc7439b58)
<4>[44330.862531] 9b00: 00000001 bf05f000 00000004 1a55b000
<4>[44330.862653] 9b20: bf05f59d d2b4ac08 000bf05f ca5ca03c 00000000 c7439c24 4020f000 00000001
<4>[44330.862744] 9b40: c0f2ab50 c7439b58 c0123038 c0123040 00000113 ffffffff
<4>[44330.862836] [<c0731580>] (__irq_svc+0x40/0x70) from [<c0123040>] (memblock_is_memory+0x18/0x20)
<4>[44330.862958] [<c0123040>] (memblock_is_memory+0x18/0x20) from [<c00198b4>] (__sync_icache_dcache+0x40/0x9c)
<4>[44330.863049] [<c00198b4>] (__sync_icache_dcache+0x40/0x9c) from [<c0121a44>] (ptep_clear_flush_young+0x3c/0x60)
<4>[44330.863171] [<c0121a44>] (ptep_clear_flush_young+0x3c/0x60) from [<c011d3b0>] (page_referenced_one+0x6c/0xfc)
<4>[44330.863294] [<c011d3b0>] (page_referenced_one+0x6c/0xfc) from [<c011e9e4>] (page_referenced+0x1a8/0x200)
<4>[44330.863416] [<c011e9e4>] (page_referenced+0x1a8/0x200) from [<c010522c>] (shrink_active_list.isra.49+0x1ec/0x2e8)
<4>[44330.863477] [<c010522c>] (shrink_active_list.isra.49+0x1ec/0x2e8) from [<c0106554>] (shrink_mem_cgroup_zone+0x338/0x4c4)
<4>[44330.863599] [<c0106554>] (shrink_mem_cgroup_zone+0x338/0x4c4) from [<c010740c>] (try_to_free_pages+0x2a0/0x570)
<4>[44330.863690] [<c010740c>] (try_to_free_pages+0x2a0/0x570) from [<c00fbff0>] (__alloc_pages_nodemask+0x424/0x758)
<4>[44330.863812] [<c00fbff0>] (__alloc_pages_nodemask+0x424/0x758) from [<c0116dcc>] (handle_pte_fault+0x184/0x7c8)
<4>[44330.863934] [<c0116dcc>] (handle_pte_fault+0x184/0x7c8) from [<c011751c>] (handle_mm_fault+0x10c/0x128)
<4>[44330.864057] [<c011751c>] (handle_mm_fault+0x10c/0x128) from [<c0732cc0>] (do_page_fault+0x180/0x3c0)
<4>[44330.864148] [<c0732cc0>] (do_page_fault+0x180/0x3c0) from [<c000847c>] (do_DataAbort+0x134/0x1a8)
<4>[44330.864270] [<c000847c>] (do_DataAbort+0x134/0x1a8) from [<c07316f4>] (__dabt_usr+0x34/0x40)
<4>[44330.864331] Exception stack(0xc7439fb0 to 0xc7439ff8)
<4>[44330.864453] 9fa0: 40baa000 40fab040 0002d440 00000000
<4>[44330.864545] 9fc0: 41ed0f10 00000001 408d72c0 41ed0f2c 0000001c 00000000 40fab400 00000001
<4>[44330.864606] 9fe0: 00000380 59c70e10 00000400 40209364 20000010 ffffffff

2013-06-03 10:45:48

by Russell King - ARM Linux

[permalink] [raw]

Subject: Re: A bug about system call on ARM

On Mon, Jun 03, 2013 at 11:27:23AM +0100, Will Deacon wrote:
> Hi Russell,
>
> On Mon, Jun 03, 2013 at 11:18:09AM +0100, Russell King - ARM Linux wrote:
> > On Thu, May 30, 2013 at 12:41:12PM +0100, Will Deacon wrote:
> > > +#if defined(CONFIG_OABI_COMPAT) || !defined(CONFIG_AEABI)
> > > + /*
> > > + * We may have faulted trying to load the SWI instruction due to
> > > + * concurrent page aging on another CPU. In this case, return
> > > + * back to the swi instruction and fault the page back.
> > > + */
> > > +9001:
> > > + sub lr, lr, #4
> > > + str lr, [sp, #S_PC]
> > > + b ret_fast_syscall
> > > +#endif
> >
> > The comment is wrong. If we get here, it means that the fault from
> > trying to loading the instruction can't be fixed up. Arguably, that
> > should result in a SIGSEGV being sent immediately, but we'll get to
> > that when we then try to re-load the instruction.
>
> Why would we kill the application in this case? The reported problem is
> where one CPU ages the page containing the swi instruction (mkold =>
> clears L_PTE_YOUNG => write 0 to the pte) in between the other CPU executing
> the swi and the kernel trying to read the immediate. The VMA is fine.

If you mark the instruction was a user-accessing instruction, the kernel
will handle the resulting exception, trying to make the page accessible.
If it is successful, then execution resumes as normal at the faulting
instruction and continues as if nothing happened.

If it can't make the page accessible (eg, out of memory) the exception
handler path (your code above) will be called instead. Normal action in
that case would be for a system call to return -EFAULT, but in this case
we can't know what the syscall was, so we don't know if userspace will
even pay attention to the returned error code. In any case, if the page
is no longer accessible, it's going to end up being killed by a SEGV
when we eventually return to userspace anyway.

> > What it means is that the page we were trying to execute has been
> > unmapped beneath us.
>
> Yes, as a result of the kernel aging it.

No - see above. The exception path is for more serious conditions than
that.

2013-06-03 12:40:41

by Will Deacon

[permalink] [raw]

Subject: Re: A bug about system call on ARM

On Mon, Jun 03, 2013 at 11:45:34AM +0100, Russell King - ARM Linux wrote:
> On Mon, Jun 03, 2013 at 11:27:23AM +0100, Will Deacon wrote:
> > On Mon, Jun 03, 2013 at 11:18:09AM +0100, Russell King - ARM Linux wrote:
> > > On Thu, May 30, 2013 at 12:41:12PM +0100, Will Deacon wrote:
> > > > +#if defined(CONFIG_OABI_COMPAT) || !defined(CONFIG_AEABI)
> > > > + /*
> > > > + * We may have faulted trying to load the SWI instruction due to
> > > > + * concurrent page aging on another CPU. In this case, return
> > > > + * back to the swi instruction and fault the page back.
> > > > + */
> > > > +9001:
> > > > + sub lr, lr, #4
> > > > + str lr, [sp, #S_PC]
> > > > + b ret_fast_syscall
> > > > +#endif
> > >
> > > The comment is wrong. If we get here, it means that the fault from
> > > trying to loading the instruction can't be fixed up. Arguably, that
> > > should result in a SIGSEGV being sent immediately, but we'll get to
> > > that when we then try to re-load the instruction.
> >
> > Why would we kill the application in this case? The reported problem is
> > where one CPU ages the page containing the swi instruction (mkold =>
> > clears L_PTE_YOUNG => write 0 to the pte) in between the other CPU executing
> > the swi and the kernel trying to read the immediate. The VMA is fine.
>
> If you mark the instruction was a user-accessing instruction, the kernel
> will handle the resulting exception, trying to make the page accessible.
> If it is successful, then execution resumes as normal at the faulting
> instruction and continues as if nothing happened.
>
> If it can't make the page accessible (eg, out of memory) the exception
> handler path (your code above) will be called instead. Normal action in
> that case would be for a system call to return -EFAULT, but in this case
> we can't know what the syscall was, so we don't know if userspace will
> even pay attention to the returned error code. In any case, if the page
> is no longer accessible, it's going to end up being killed by a SEGV
> when we eventually return to userspace anyway.

Yes, of course, the fault handling will sort out non-fatal faults for us, so
I'll update the comment.

Thanks,

Will

2013-06-04 05:33:27

by Wang, Yalin

[permalink] [raw]

Subject: RE: A bug about system call on ARM

Hi Will,

Could I know what's your git branch is mainly used for ?

https://git.kernel.org/cgit/linux/kernel/git/will/linux.git

I mean if the branch is used for ARM arch maintenance ?
If yes, I think I can send future bugs about ARM to you directly,
And do not need ping-pang in the mail list .

Thanks for your help .

-----Original Message-----
From: Wang, Yalin
Sent: Monday, June 03, 2013 5:58 PM
To: 'Will Deacon'
Cc: 'richard -rw- weinberger'; '[email protected]'; '[email protected]'; '[email protected]'
Subject: RE: A bug about system call on ARM

Hi Will

Oh I see,
Thanks for your reply

Yes , we are testing for it ,
But need some time to wait for the result , Because The stability test need some time to reproduce this issue , And this issue doesn't reproduce 100% .

-----Original Message-----
From: Will Deacon [mailto:[email protected]]
Sent: Monday, June 03, 2013 5:54 PM
To: Wang, Yalin
Cc: 'richard -rw- weinberger'; '[email protected]'; '[email protected]'; '[email protected]'
Subject: Re: A bug about system call on ARM

On Mon, Jun 03, 2013 at 06:25:26AM +0100, Wang, Yalin wrote:
> Hi Will,
>
> I have a question about this patch .
>
> If the user space is thumb mode,
> The PC should be rewind by 2 bytes,
> So the fix_up code should be
>
> Sub lr, lr, #2 .
>
>
> Am I right ?

No, because we don't have OABI-compat support for Thumb applications and force everything down the EABI path instead.

Did you manage to test the patch?

Will

2013-06-04 08:49:54

by Will Deacon

[permalink] [raw]

Subject: Re: A bug about system call on ARM

On Tue, Jun 04, 2013 at 06:33:20AM +0100, Wang, Yalin wrote:
> Hi Will,

Hello,

> Could I know what's your git branch is mainly used for ?
>
> https://git.kernel.org/cgit/linux/kernel/git/will/linux.git
>
>
> I mean if the branch is used for ARM arch maintenance ?

I send most of my patches via Russell, and the tree above is where I keep my
patches whilst they're not yet in mainline.

> If yes, I think I can send future bugs about ARM to you directly,
> And do not need ping-pang in the mail list .

Quite the opposite! Having discussions on the mailing list is key to how the
kernel is developed, so please continue to send questions, bug reports and
patches there. As has been pointed out previously in this thread, you need
to choose the right list in the first place, rather than just sending to
LKML.

Of course, you can always CC me on arm/arm64 patches if you like.

Cheers,

Will

2013-06-04 09:30:31

by Wang, Yalin

[permalink] [raw]

Subject: RE: A bug about system call on ARM

Hi Will,

Thanks for your reply,
I see your meaning ,
But it seems my apply for joining into '[email protected]'
Is not approved ,
How to join in this mail-list ?

Thanks for your help .

-----Original Message-----
From: Will Deacon [mailto:[email protected]]
Sent: Tuesday, June 04, 2013 4:49 PM
To: Wang, Yalin
Cc: 'richard -rw- weinberger'; '[email protected]'; '[email protected]'; '[email protected]'
Subject: Re: A bug about system call on ARM

On Tue, Jun 04, 2013 at 06:33:20AM +0100, Wang, Yalin wrote:
> Hi Will,

Hello,

> Could I know what's your git branch is mainly used for ?
>
> https://git.kernel.org/cgit/linux/kernel/git/will/linux.git
>
>
> I mean if the branch is used for ARM arch maintenance ?

I send most of my patches via Russell, and the tree above is where I keep my patches whilst they're not yet in mainline.

> If yes, I think I can send future bugs about ARM to you directly, And
> do not need ping-pang in the mail list .

Quite the opposite! Having discussions on the mailing list is key to how the kernel is developed, so please continue to send questions, bug reports and patches there. As has been pointed out previously in this thread, you need to choose the right list in the first place, rather than just sending to LKML.

Of course, you can always CC me on arm/arm64 patches if you like.

Cheers,

Will

2013-06-04 11:29:19

by Will Deacon

[permalink] [raw]

Subject: Re: A bug about system call on ARM

On Tue, Jun 04, 2013 at 10:30:04AM +0100, Wang, Yalin wrote:
> Hi Will,
>
> Thanks for your reply,
> I see your meaning ,
> But it seems my apply for joining into '[email protected]'
> Is not approved ,
> How to join in this mail-list ?

There's a mailman frontend here:

http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

Will

2013-06-14 06:53:21

by Wang, Yalin

[permalink] [raw]

Subject: RE: A bug about system call on ARM

Hi Will,

We have tested the patch,
It seems ok in the stability test .

We have merged it into our main branch .

Thanks for your patch !

-----Original Message-----
From: Will Deacon [mailto:[email protected]]
Sent: Monday, June 03, 2013 5:54 PM
To: Wang, Yalin
Cc: 'richard -rw- weinberger'; '[email protected]'; '[email protected]'; '[email protected]'
Subject: Re: A bug about system call on ARM

On Mon, Jun 03, 2013 at 06:25:26AM +0100, Wang, Yalin wrote:
> Hi Will,
>
> I have a question about this patch .
>
> If the user space is thumb mode,
> The PC should be rewind by 2 bytes,
> So the fix_up code should be
>
> Sub lr, lr, #2 .
>
>
> Am I right ?

No, because we don't have OABI-compat support for Thumb applications and force everything down the EABI path instead.

Did you manage to test the patch?

Will