LinuxLists.cc - Possible race in copy of fpu->state in copy

2016-04-13 03:12:31

Subject: Possible race in copy of fpu->state in copy_process against the exeve'ing parent?

Hi,

I encountered a panic on a Linux-3.2 kernel on a x86_64 machine, and
suspect it is a race condition. And I checked the current mainline
and found it was fixed unintendedly.

So I hope x86/fpu maintainer help verify this. Thanks verfy much.

The panic stack trace :

#0 [ffff88529d33f990] try_crashdump at ffffffff8105b8ca
#1 [ffff88529d33f9a0] dump_on_panic at ffffffff8105b965
#2 [ffff88529d33fa60] notifier_call_chain at ffffffff8139f784
#3 [ffff88529d33fac0] atomic_notifier_call_chain at ffffffff8139f81d
#4 [ffff88529d33fad0] panic at ffffffff8139971c
#5 [ffff88529d33fb50] oops_end at ffffffff8139d34a
#6 [ffff88529d33fb80] no_context at ffffffff81021569
#7 [ffff88529d33fbd0] __bad_area_nosemaphore at ffffffff81021730
#8 [ffff88529d33fc20] bad_area at ffffffff810217ac
#9 [ffff88529d33fc50] do_page_fault at ffffffff8139f509
#10 [ffff88529d33fd70] page_fault at ffffffff8139caef
[exception RIP: prepare_to_copy+35]
<------------------ PANIC !!!
RIP: ffffffff810013f4 RSP: ffff88529d33fe20 RFLAGS: 00010286
RAX: 00000000ffffffff RBX: 0000000001200011 RCX: ffff884fe73f6320
RDX: 00000000ffffffff RSI: 00007fff07d36bd0 RDI: 0000000000000000
RBP: ffff88529d33fe20 R8: 00007f5c4a209770 R9: 0000000000000000
R10: 00007f5c4a209770 R11: 0000000000000202 R12: 0000000000000000
R13: 0000000000000000 R14: ffff884fe73f6320 R15: 0000000000000001
ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0018
#11 [ffff88529d33fe28] copy_process at ffffffff81038211
#12 [ffff88529d33fea8] do_fork at ffffffff810393ec
#13 [ffff88529d33ff38] sys_clone at ffffffff81009118
#14 [ffff88529d33ff48] stub_clone at ffffffff813a31d3

crash7> dis -r prepare_to_copy+35
0xffffffff810013d1 <prepare_to_copy>: push %rbp
0xffffffff810013d2 <prepare_to_copy+1>: mov %rsp,%rbp
0xffffffff810013d5 <prepare_to_copy+4>: nopl 0x0(%rax,%rax,1)
0xffffffff810013da <prepare_to_copy+9>: mov %rdi,%rcx
0xffffffff810013dd <prepare_to_copy+12>: cmpl $0x0,0x4d8(%rdi)
0xffffffff810013e4 <prepare_to_copy+19>: je
0xffffffff8100142e <prepare_to_copy+93>
0xffffffff810013e6 <prepare_to_copy+21>: mov 0x4e0(%rdi),%rdi
0xffffffff810013ed <prepare_to_copy+28>: xchg %ax,%ax
0xffffffff810013ef <prepare_to_copy+30>: or $0xffffffff,%eax
0xffffffff810013f2 <prepare_to_copy+33>: mov %eax,%edx
0xffffffff810013f4 <prepare_to_copy+35>: xsaveopt64 (%rdi)
<---- PANIC HERE

when panic the %rdi is 0x0000000000000000, which is fpu->state.

So I suspect there is a possible race:

Parent:

sys_execve
do_execve
do_execve_common
search_binary_handler
load_elf_binary
start_thread
start_thread_common
free_thread_xstate(current)
fpu_free
fpu->state = NULL

Child:

sys_clone
do_fork
copy_process
dup_task_struct
prepare_to_copy
unlazy_fpu
__save_init_fpu
fpu_save_init
fpu_xsave(fpu) <---- fpu->sate is NULL,
so cause a
NULL
dereference.

Scenario: Parent is still exeve'ing, and just set fpu->state to NULL,
and the a concurrent clone() forks a Child and in which fpu_xsave()
tries to fpu_xsave, when fpu->state is NULL.

The race window seems quite small, and I have checked the Parent's
'sum_exec_runtime' is 536920255(~0.53s).

I checked the mainline, and found commit 304bceda6a18(" x86, fpu: use
non-lazy fpu restore for processors supporting xsave") seems
unintendedly fix this?

Regards,
Jianyu Zhan

2016-04-13 03:20:11

by Jianyu Zhan

[permalink] [raw]

Subject: Re: Possible race in copy of fpu->state in copy_process against the exeve'ing parent?

On Wed, Apr 13, 2016 at 11:11 AM, Jianyu Zhan <[email protected]> wrote:
>
> So I suspect there is a possible race:
>
>
> Parent:
>
> sys_execve
> do_execve
> do_execve_common
> search_binary_handler
> load_elf_binary
> start_thread
> start_thread_common
> free_thread_xstate(current)
> fpu_free
> fpu->state = NULL
>
>
> Child:
>
> sys_clone
> do_fork
> copy_process
> dup_task_struct
> prepare_to_copy
> unlazy_fpu
> __save_init_fpu
> fpu_save_init
> fpu_xsave(fpu) <---- fpu->sate is NULL,
> so cause a
> NULL
> dereference.
>

Hmm, I am wrong, it is not Parent vs Child.

It is : Parent executes sys_execuve, and then right after that,
executes sys_clone.

Regards,
Jianyu Zhan

2016-04-13 06:09:57

by Ingo Molnar

[permalink] [raw]

Subject: Re: Possible race in copy of fpu->state in copy_process against the exeve'ing parent?

* Jianyu Zhan <[email protected]> wrote:

> Hi,
>
> I encountered a panic on a Linux-3.2 kernel on a x86_64 machine, and
> suspect it is a race condition. And I checked the current mainline
> and found it was fixed unintendedly.
>
> So I hope x86/fpu maintainer help verify this. Thanks verfy much.
>
>
> The panic stack trace :
>
> #0 [ffff88529d33f990] try_crashdump at ffffffff8105b8ca
> #1 [ffff88529d33f9a0] dump_on_panic at ffffffff8105b965
> #2 [ffff88529d33fa60] notifier_call_chain at ffffffff8139f784
> #3 [ffff88529d33fac0] atomic_notifier_call_chain at ffffffff8139f81d
> #4 [ffff88529d33fad0] panic at ffffffff8139971c
> #5 [ffff88529d33fb50] oops_end at ffffffff8139d34a
> #6 [ffff88529d33fb80] no_context at ffffffff81021569
> #7 [ffff88529d33fbd0] __bad_area_nosemaphore at ffffffff81021730
> #8 [ffff88529d33fc20] bad_area at ffffffff810217ac
> #9 [ffff88529d33fc50] do_page_fault at ffffffff8139f509
> #10 [ffff88529d33fd70] page_fault at ffffffff8139caef
> [exception RIP: prepare_to_copy+35]
> <------------------ PANIC !!!
> RIP: ffffffff810013f4 RSP: ffff88529d33fe20 RFLAGS: 00010286
> RAX: 00000000ffffffff RBX: 0000000001200011 RCX: ffff884fe73f6320
> RDX: 00000000ffffffff RSI: 00007fff07d36bd0 RDI: 0000000000000000
> RBP: ffff88529d33fe20 R8: 00007f5c4a209770 R9: 0000000000000000
> R10: 00007f5c4a209770 R11: 0000000000000202 R12: 0000000000000000
> R13: 0000000000000000 R14: ffff884fe73f6320 R15: 0000000000000001
> ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0018
> #11 [ffff88529d33fe28] copy_process at ffffffff81038211
> #12 [ffff88529d33fea8] do_fork at ffffffff810393ec
> #13 [ffff88529d33ff38] sys_clone at ffffffff81009118
> #14 [ffff88529d33ff48] stub_clone at ffffffff813a31d3
>
>
> crash7> dis -r prepare_to_copy+35
> 0xffffffff810013d1 <prepare_to_copy>: push %rbp
> 0xffffffff810013d2 <prepare_to_copy+1>: mov %rsp,%rbp
> 0xffffffff810013d5 <prepare_to_copy+4>: nopl 0x0(%rax,%rax,1)
> 0xffffffff810013da <prepare_to_copy+9>: mov %rdi,%rcx
> 0xffffffff810013dd <prepare_to_copy+12>: cmpl $0x0,0x4d8(%rdi)
> 0xffffffff810013e4 <prepare_to_copy+19>: je
> 0xffffffff8100142e <prepare_to_copy+93>
> 0xffffffff810013e6 <prepare_to_copy+21>: mov 0x4e0(%rdi),%rdi
> 0xffffffff810013ed <prepare_to_copy+28>: xchg %ax,%ax
> 0xffffffff810013ef <prepare_to_copy+30>: or $0xffffffff,%eax
> 0xffffffff810013f2 <prepare_to_copy+33>: mov %eax,%edx
> 0xffffffff810013f4 <prepare_to_copy+35>: xsaveopt64 (%rdi)
> <---- PANIC HERE
>
> when panic the %rdi is 0x0000000000000000, which is fpu->state.
>
>
>
> So I suspect there is a possible race:
>
>
> Parent:
>
> sys_execve
> do_execve
> do_execve_common
> search_binary_handler
> load_elf_binary
> start_thread
> start_thread_common
> free_thread_xstate(current)
> fpu_free
> fpu->state = NULL
>
>
> Child:
>
> sys_clone
> do_fork
> copy_process
> dup_task_struct
> prepare_to_copy
> unlazy_fpu
> __save_init_fpu
> fpu_save_init
> fpu_xsave(fpu) <---- fpu->sate is NULL,
> so cause a
> NULL
> dereference.
>
> Scenario: Parent is still exeve'ing, and just set fpu->state to NULL,
> and the a concurrent clone() forks a Child and in which fpu_xsave()
> tries to fpu_xsave, when fpu->state is NULL.
>
> The race window seems quite small, and I have checked the Parent's
> 'sum_exec_runtime' is 536920255(~0.53s).
>
> I checked the mainline, and found commit 304bceda6a18(" x86, fpu: use
> non-lazy fpu restore for processors supporting xsave") seems
> unintendedly fix this?

So I'm not sure I understand the suggested race. Separate tasks have separate
fpu->state states, so a parallel execve() and clone() has no effect on each other.
There's no FPU state sharing.

Thanks,

Ingo

2016-04-13 07:24:18

by Jianyu Zhan

[permalink] [raw]

Subject: Re: Possible race in copy of fpu->state in copy_process against the exeve'ing parent?

On Wed, Apr 13, 2016 at 2:09 PM, Ingo Molnar <[email protected]> wrote:
> So I'm not sure I understand the suggested race. Separate tasks have separate
> fpu->state states, so a parallel execve() and clone() has no effect on each other.
> There's no FPU state sharing.

Hi, Ingo, thans for reply.

Let me try describe the situation clearly :

>From the panic stack trace, we can infer the call path before panic:

sys_clone
do_fork
copy_process
dup_task_struct(current)
prepare_to_copy(current)
unlazy_fpu(current)
__save_init_fpu(current)
fpu_save_init(current)
fpu_xsave(&current->thread.fpu) <---- PANIC

In this case , &thread.fpu.state is NULL, so it caused a write to
NULL address fault,
which of course is invalid and resulted in a panic.

After reviewing the code, I found only one place in kernel(v3.2.33) that
could make the fpu.state NULL is
from this call path:

sys_execve
do_execve
do_execve_common
search_binary_handler
load_elf_binary
start_thread
start_thread_common
free_thread_xstate(current)
fpu_free(&current->thread.fpu)
fpu->state = NULL

And I also learned that after the first time fpu is used, init_fpu will be
called to allocate a new fpu->state.

So I suspect if this is a problem : we call sys_clone right after
sys_execve, but right before init_fpu()
is called for the first time to allocate a struct fpu for current,
so a NULL fpu->state is seen.

And commit 304bceda6a18(" x86, fpu: use non-lazy fpu restore for
processors supporting xsave") seems
unintendedly fix this ?

Regards,
Jianyu Zhan

2016-04-13 11:27:34

by Oleg Nesterov

[permalink] [raw]

Subject: Re: Possible race in copy of fpu->state in copy_process against the exeve'ing parent?

On 04/13, Jianyu Zhan wrote:
>
> From the panic stack trace, we can infer the call path before panic:
>
>
> sys_clone
> do_fork
> copy_process
> dup_task_struct(current)
> prepare_to_copy(current)
> unlazy_fpu(current)
> __save_init_fpu(current)
> fpu_save_init(current)
> fpu_xsave(&current->thread.fpu) <---- PANIC
>
>
> In this case , &thread.fpu.state is NULL, so it caused a write to
> NULL address fault,

Yes, but iirc fpu.state == NULL is not a problem. The problem is that
TS_USEDFPU is set while it should not.

> sys_execve
> do_execve
> do_execve_common
> search_binary_handler
> load_elf_binary
> start_thread
> start_thread_common
> free_thread_xstate(current)
> fpu_free(&current->thread.fpu)
> fpu->state = NULL

Yes, but note that exec path also calls flush_thread() which clears TS_USEDFPU.

Yes, this is confusing, and we had a lot bugs in this area. To be honest I didn't
even try to recall how this (very old) code works, sorry... So I can't say what
exactly could explain the wrong TS_USEDFPU.

Oleg.