2006-10-11 09:15:53

by Chandru

[permalink] [raw]
Subject: [RFC]: Possible race condition on an SMP between proc_lookupfd and tasks on other cpus

Hi All,
I am running a RHEL5 distro kernel ( which seems to be quite close to
Vannilla kernel ) and am having a problem on one of my system
(PPC64). The system crashes ( goes in to xmon ) every now and then
while running TCP stress tests on the system. The following is the
backtrace and exception information ( from distro kernel, which might be
of very little help).

f:mon> e
cpu 0xf: Vector: 300 (Data Access) at [c0000000eaa1b490]
pc: c0000000001351e0: .tid_fd_revalidate+0x64/0x220
lr: c0000000001351cc: .tid_fd_revalidate+0x50/0x220
sp: c0000000eaa1b710
msr: 8000000000009032
dar: 6b6b6b6b6b6b6b6b
dsisr: 40000000
current = 0xc0000001182864f0
paca = 0xc000000000456300
pid = 24558, comm = netstat
f:mon> t
[c0000000eaa1b7b0] c000000000138118 .proc_lookupfd+0x17c/0x21c
[c0000000eaa1b860] c0000000000f359c .do_lookup+0x108/0x268
[c0000000eaa1b920] c0000000000f65f8 .__link_path_walk+0xc58/0x1364
[c0000000eaa1ba00] c0000000000f6da0 .link_path_walk+0x9c/0x184
[c0000000eaa1bb40] c0000000000f7364 .do_path_lookup+0x304/0x398
[c0000000eaa1bbf0] c0000000000f7db8 .__user_walk_fd+0x58/0x88
[c0000000eaa1bc90] c0000000000edcdc .sys_readlinkat+0x44/0x130
[c0000000eaa1bdc0] c000000000016784 .compat_sys_readlink+0x14/0x28
[c0000000eaa1be30] c00000000000871c syscall_exit+0x0/0x40


From code analysis ( vannilla and distro kernel), it looks like there
can exist a small time window between

spin_unlock(&files->file_lock) in proc_fd_instantiate()

and fcheck_files() in tid_fd_revalidate() during which the contents
of 'struct files_struct files' of a task could be released/cleared by
that task ( during an exec probably ).

i.e between....
static struct dentry *proc_fd_instantiate(struct inode *dir,....)
{
...
...
if (file->f_mode & 2)
inode->i_mode |= S_IWUSR | S_IXUSR;
spin_unlock(&files->file_lock); <------- 1.
put_files_struct(files);

inode->i_op = &proc_pid_link_inode_operations;
inode->i_size = 64;
ei->op.proc_get_link = proc_fd_link;
dentry->d_op = &tid_fd_dentry_operations;
d_add(dentry, inode);
/* Close the race of the process dying before we return the
dentry */
if (tid_fd_revalidate(dentry, NULL)) <-------- 2.
...
...
}


static int tid_fd_revalidate(struct dentry *dentry, struct nameidata *nd)
{
...
...
if (task) {
files = get_files_struct(task);
if (files) {
rcu_read_lock();
if (fcheck_files(files, fd)) {
<---------- 2
...
...
}

Could this code analysis be right? and can this race condition be fixed?.

Thanks,
Chandru


2006-10-12 19:30:59

by Eric W. Biederman

[permalink] [raw]
Subject: Re: [RFC]: Possible race condition on an SMP between proc_lookupfd and tasks on other cpus

Chandru <[email protected]> writes:

> Hi All,
> I am running a RHEL5 distro kernel ( which seems to be quite close to Vannilla
> kernel ) and am having a problem on one of my system (PPC64). The system
> crashes ( goes in to xmon ) every now and then while running TCP stress tests on
> the system. The following is the backtrace and exception information ( from
> distro kernel, which might be of very little help).
>
> f:mon> e
> cpu 0xf: Vector: 300 (Data Access) at [c0000000eaa1b490]
> pc: c0000000001351e0: .tid_fd_revalidate+0x64/0x220
> lr: c0000000001351cc: .tid_fd_revalidate+0x50/0x220
> sp: c0000000eaa1b710
> msr: 8000000000009032
> dar: 6b6b6b6b6b6b6b6b
> dsisr: 40000000
> current = 0xc0000001182864f0
> paca = 0xc000000000456300
> pid = 24558, comm = netstat
> f:mon> t
> [c0000000eaa1b7b0] c000000000138118 .proc_lookupfd+0x17c/0x21c
> [c0000000eaa1b860] c0000000000f359c .do_lookup+0x108/0x268
> [c0000000eaa1b920] c0000000000f65f8 .__link_path_walk+0xc58/0x1364
> [c0000000eaa1ba00] c0000000000f6da0 .link_path_walk+0x9c/0x184
> [c0000000eaa1bb40] c0000000000f7364 .do_path_lookup+0x304/0x398
> [c0000000eaa1bbf0] c0000000000f7db8 .__user_walk_fd+0x58/0x88
> [c0000000eaa1bc90] c0000000000edcdc .sys_readlinkat+0x44/0x130
> [c0000000eaa1bdc0] c000000000016784 .compat_sys_readlink+0x14/0x28
> [c0000000eaa1be30] c00000000000871c syscall_exit+0x0/0x40
>
>
> From code analysis ( vannilla and distro kernel), it looks like there can exist
> a small time window between
>
> spin_unlock(&files->file_lock) in proc_fd_instantiate()
>
> and fcheck_files() in tid_fd_revalidate() during which the contents of
> 'struct files_struct files' of a task could be released/cleared by that task (
> during an exec probably ).


> Could this code analysis be right? and can this race condition be fixed?.

The window you see exists, but it is there by design.

tid_fd_revalidate is designed to be called any time after prod_fd_instantiate()
runs. So it requires all of the locks it needs. It's purpose in life
is to verify the permissions.

We have earlier increased the reference count of everything when we grabbed
the dentry.

The final tid_fd_revalidate in proc_fd_instantiate was added recently
to ensure we have a consistent set of checks before returning a dentry
to a user.

So I think you have a legitimate problem but it isn't because we drop
and reaquire the locks.

There was a little recent work making some of the fdtable access
no-rcu. See commit ca99c1da080345e227cfb083c330a184d42e27f3. But
I don't think that applies here.

You certain seem to be in one of the proc stress conditions so this
may not be a unique bug.

Digging through the disassembly and figuring out which access you died
on would be interesting, so we could know with precision which part
of tid_fd_revalidate we are dying in. My ppc64 isn't good enough
especially without the matching binaries to figure that out though.
All I know is that you are about 25 instructions into
tid_fd_revalidate.

I don't have a clue where to start to dig into this.

Eric

2006-10-13 12:14:37

by Chandru

[permalink] [raw]
Subject: Re: [RFC]: Possible race condition on an SMP between proc_lookupfd and tasks on other cpus

Eric W. Biederman wrote:
> You certain seem to be in one of the proc stress conditions so this
> may not be a unique bug.
>
> Digging through the disassembly and figuring out which access you died
> on would be interesting, so we could know with precision which part
> of tid_fd_revalidate we are dying in. My ppc64 isn't good enough
> especially without the matching binaries to figure that out though.
> All I know is that you are about 25 instructions into
> tid_fd_revalidate.
>
From the disassembly in xmon, the code that failed while running was in
fcheck_files() called from tid_fd_revalidate()

f:mon> di c0000000001351e0
c0000000001351e0 800b0000 lwz r0,0(r11) <<<<<------ PC
c0000000001351e4 7f804840 cmplw cr7,r0,r9
c0000000001351e8 409d00e0 ble cr7,c0000000001352c8 #
.tid_fd_revalidate+0x14c/0x220
c0000000001351ec e96b0008 ld r11,8(r11)
c0000000001351f0 79291f24 rldicr r9,r9,3,60
c0000000001351f4 7c09582a ldx r0,r9,r11
c0000000001351f8 2fa00000 cmpdi cr7,r0,0
c0000000001351fc 419e00cc beq cr7,c0000000001352c8 #


r11 gets loaded with 'fdt' as in struct fdtable *fdt =
files_fdtable(files);

include/linux/file.h:97
b3c: e9 6a 00 08 ld r11,8(r10)

Following are the register contents...

f:mon> r
R00 = c0000000001351cc R16 = 00000000fae9f35a
R01 = c0000000eaa1b710 R17 = 0000000010020000
R02 = c00000000056cd20 R18 = 00000000fae9f15a
R03 = c00000018b5f89a0 R19 = 00000000fae9e966
R04 = c0000001e172cb80 R20 = 00000000fae9e95a
R05 = c0000001349e88f8 R21 = 0000000010030000
R06 = c00000018b5f8ca0 R22 = 00000000fae9e968
R07 = 000000007fffffff R23 = c0000000105e3920
R08 = 0000000000135075 R24 = c0000001349e88a8
R09 = 000000000000028c R25 = c00000023dca8628
R10 = c00000018b5f89a0 R26 = c0000001e172ca30
R11 = 6b6b6b6b6b6b6b6b R27 = c00000023dca8628
R12 = d0000000001bb3f8 R28 = 000000000000028c
R13 = c000000000456300 R29 = c0000001e172ca30
R14 = 00000000ffffffff R30 = c0000000004a9250
R15 = 0000000000000000 R31 = c0000001349e88c8
pc = c0000000001351e0 .tid_fd_revalidate+0x64/0x220
lr = c0000000001351cc .tid_fd_revalidate+0x50/0x220
msr = 8000000000009032 cr = 84000482
ctr = c00000000016a770 xer = 0000000000000000 trap = 300
dar = 6b6b6b6b6b6b6b6b dsisr = 40000000


Now r11=6b6b6b6b6b6b6b6b = 'fdt'
and an attempt is made to access fdt->max_fds as in 'if (fd < fdt->max_fds)' which fails

include/linux/file.h:99
b44: 80 0b 00 00 lwz r0,0(r11) <<<<<<------PC
b48: 7f 80 48 40 cmplw cr7,r0,r9

Further the contents of 'struct files_struct *files' were all filled
with '0x6b's. I tried to look in to for which task the lookup was being
done and the task seemed to be 'execve04'. Also noticed that another
cpu ( cpu #1) was running on an execve code path ( do not have the
state of the system right now to copy it's backtrace :( ).

Thanks,
Chandru

2006-10-13 16:30:54

by Eric W. Biederman

[permalink] [raw]
Subject: Re: [RFC]: Possible race condition on an SMP between proc_lookupfd and tasks on other cpus

Chandru <[email protected]> writes:

> Eric W. Biederman wrote:
>> You certain seem to be in one of the proc stress conditions so this
>> may not be a unique bug.
>>
>> Digging through the disassembly and figuring out which access you died
>> on would be interesting, so we could know with precision which part
>> of tid_fd_revalidate we are dying in. My ppc64 isn't good enough especially
>> without the matching binaries to figure that out though.
>> All I know is that you are about 25 instructions into
>> tid_fd_revalidate.
>>
> From the disassembly in xmon, the code that failed while running was in
> fcheck_files() called from tid_fd_revalidate()

> Further the contents of 'struct files_struct *files' were all filled with
> '0x6b's. I tried to look in to for which task the lookup was being done and the
> task seemed to be 'execve04'. Also noticed that another cpu ( cpu #1) was
> running on an execve code path ( do not have the state of the system right now
> to copy it's backtrace :( ).

Which kernel version did you say you were running.

This appears to be a case of calling put_files_struct without changing
the pointer in the task_struct. So we are reading a stale pointer
in get_files_struct.

It looks like the patch below added fixed a bug that has the same symptoms
as what you are describing. So I suspect this is fixed in linus's tree
already.

>commit 3b9b8ab65d8eed784b9164d03807cb2bda7b5cd6
>Author: Kirill Korotaev <[email protected]>
>Date: Fri Sep 29 02:00:05 2006 -0700
>
> [PATCH] Fix unserialized task->files changing
>
> Fixed race on put_files_struct on exec with proc. Restoring files on
> current on error path may lead to proc having a pointer to already kfree-d
> files_struct.
>
> ->files changing at exit.c and khtread.c are safe as exit_files() makes all
> things under lock.
>
> Found during OpenVZ stress testing.
>
> [[email protected]: add export]
> Signed-off-by: Pavel Emelianov <[email protected]>
> Signed-off-by: Kirill Korotaev <[email protected]>
> Signed-off-by: Andrew Morton <[email protected]>
> Signed-off-by: Linus Torvalds <[email protected]>


Eric