2005-01-21 00:34:05

by djwong

[permalink] [raw]
Subject: [PATCH] BUG in io_destroy (fs/aio.c:1248)

Hi all,

[Please cc me on any replies because I'm not subscribed to linux-aio or
linux-kernel.]

I was running a random system call generator against mainline the other
day and got this bug report about AIO in dmesg:

------------[ cut here ]------------
kernel BUG at fs/aio.c:1249!
invalid operand: 0000 [#1]
PREEMPT SMP
Modules linked in: 8250 serial_core isofs zlib_inflate ipt_limit
iptable_mangle ipt_LOG ipt_MASQUERADE iptable_nat ipt_TOS ipt_REJECT
ip_conntrack_irc ip_conntrack_ftp ipt_state ip_conntrack iptable_filter
ip_tables snd_intel8x0 snd_ac97_codec snd_pcm snd_timer snd soundcore
snd_page_alloc intel_agp agpgart evdev ehci_hcd uhci_hcd usbcore piix
ext2 ide_generic ide_cd ide_core cdrom
CPU: 0
EIP: 0060:[<c0170245>] Not tainted VLI
EFLAGS: 00010286 (2.6.10-elm3a74)
EIP is at io_destroy+0xb1/0xce
eax: ffffffff ebx: f6b2a300 ecx: 00000000 edx: cfca1000
esi: d0da5e80 edi: f6b2a488 ebp: cfca1fa4 esp: cfca1f94
ds: 007b es: 007b ss: 0068
Process io_destroy (pid: 6610, threadinfo=cfca1000 task=f6cdca20)
Stack: 00000000 08048008 fffffff2 fffffff2 cfca1fbc c01702fc d0da5e80
00000010
b7fd8c50 bffff3d4 cfca1000 c0102eb3 00000010 08048008 080482fd
b7fd8c50
bffff3d4 bffff3e8 000000f5 0000007b 0000007b 000000f5 b7f7c60d
00000073
Call Trace:
[<c0103d25>] show_stack+0x7a/0x90
[<c0103ea6>] show_registers+0x152/0x1ca
[<c01040b7>] die+0x100/0x184
[<c01044a8>] do_invalid_op+0xa3/0xad
[<c01039cf>] error_code+0x2b/0x30
[<c01702fc>] sys_io_setup+0x9a/0xa9
[<c0102eb3>] syscall_call+0x7/0xb
Code: 1c 8b 06 85 c0 78 24 83 c4 04 5b 5e 5f 5d c3 8b 0a 85 c9 2e 74 b5
8b 46 10 89 02 eb ae 83 c4 04 89 f0 5b 5e 5f 5d e9 9b ee ff ff <0f> 0b
e1 04 f5 f6 2b c0 eb d2 89 f0 e8 8a ee ff ff eb ab 0f 0b

This is a fairly run-of-the mill P4 box with SCSI disks and a plain
vanilla 2.6.10 kernel on Debian.I 've written a test case that exposes
this bug:

http://submarine.dyndns.org/~djwong/docs/io_destroy.c

The program takes as its only argument the address of a region of read
only memory. The libc mmap is a pretty good place for this, so you can
run the program thusly:

$ ./io_destroy `cat /proc/$$/maps | grep libc- | grep 'r-' | \
awk -F "-" '{print $1}'`

...and watch the program segfault. If you can't find an address,
8048000 seems to work in most cases.

I think I've found the cause of this bug. Each ioctx structure has a
"users" field that acts as a reference counter for the ioctx, and a
"dead" flag that seems to indicate that the ioctx isn't associated with
any particular list of IO requests.

The problem, then, lies in aio.c:1247. The io_destroy function checks
the (old) value of the dead flag--if it's false (i.e. the ioctx is
alive), then the function calls put_ioctx to decrease the reference
count on the assumption that the ioctx is no longer associated with any
requests. Later, it calls put_ioctx again, on the assumption that
someone called lookup_ioctx to perform some operation at some point.

This BUG is caused by the reference counts being off. The testcase that
I provided looks for a chunk of user memory that's read-only and passes
that to the sys_io_setup syscall. sys_io_setup checks that the pointer
is readable, creates the ioctx and then tries to write the ioctx handle
back to userland. This is where the problems start to surface.

Since the pointer points to a non-writable region of memory, the write
fails. The syscall handler then destroys the ioctx. The dead flag is
zero, so io_destroy calls put_ioctx...but wait! Nobody ever put the
ioctx into a request list. The ioctx is alive but not in a list, yet
the io_destroy code assumes that being alive implies being in a request
list somewhere. Hence, calling put_ioctx is bogus; the reference count
becomes 0, and the ioctx is freed. Worse yet, put_ioctx is called again
(on a freed pointer!) to clear up the lookup_ioctx that never happened.
put_ioctx sees that the reference count has become negative and BUGs.

The patch that I've provided calls aio_cancel_all before calling
io_destroy in this failure case. aio_cancel_all sets ioctx->dead = 1
and cancels all requests (there shouldn't be any in this case) in
progress. Since the dead flag is 1, io_destroy calls put_ioctx once to
zero the reference count and free the ioctx, and thus the BUG condition
doesn't get triggered. The userland program receives an error code
instead of a segfault.

This patch is against 2.6.10; the problem doesn't seem to be fixed in
2.6.11-rc1. A simpler version of this fix would simply say "ioctx->dead
= 1;" (or even call "get_ioctx(ioctx);" to inflate the refcounts
artificially), but as I'm not an AIO developer I don't want to be the
one making that call.

--Darrick

-----------------

Signed-off-by: Darrick Wong <[email protected]>

--- linux-2.6.10-a74/fs/aio.c 2004-12-24 13:34:44.000000000 -0800
+++ linux-2.6.10/fs/aio.c 2005-01-12 16:09:37.000000000 -0800
@@ -1285,6 +1285,7 @@
if (!ret)
return 0;

+ aio_cancel_all(ioctx);
io_destroy(ioctx);
}


Attachments:
signature.asc (189.00 B)
OpenPGP digital signature

2005-01-24 08:48:48

by Suparna Bhattacharya

[permalink] [raw]
Subject: Re: [PATCH] BUG in io_destroy (fs/aio.c:1248)

Hi Darrick,

As discussed earlier - good catch ! :)

But, I'd prefer not to use aio_cancel_all() (its a bit of an overkill,
especially as we invoke aio_cancel_all() again in io_destroy()).

Having the ioctx start out as dead in ioctx_alloc() before it
is linked into the ioctx list, or a get_ioctx() before calling
io_destroy() seem more appropriate options to me.

Regards
Suparna

On Thu, Jan 20, 2005 at 04:31:47PM -0800, Darrick J. Wong wrote:
> Hi all,
>
> [Please cc me on any replies because I'm not subscribed to linux-aio or
> linux-kernel.]
>
> I was running a random system call generator against mainline the other
> day and got this bug report about AIO in dmesg:
>
> ------------[ cut here ]------------
> kernel BUG at fs/aio.c:1249!
> invalid operand: 0000 [#1]
> PREEMPT SMP
> Modules linked in: 8250 serial_core isofs zlib_inflate ipt_limit
> iptable_mangle ipt_LOG ipt_MASQUERADE iptable_nat ipt_TOS ipt_REJECT
> ip_conntrack_irc ip_conntrack_ftp ipt_state ip_conntrack iptable_filter
> ip_tables snd_intel8x0 snd_ac97_codec snd_pcm snd_timer snd soundcore
> snd_page_alloc intel_agp agpgart evdev ehci_hcd uhci_hcd usbcore piix
> ext2 ide_generic ide_cd ide_core cdrom
> CPU: 0
> EIP: 0060:[<c0170245>] Not tainted VLI
> EFLAGS: 00010286 (2.6.10-elm3a74)
> EIP is at io_destroy+0xb1/0xce
> eax: ffffffff ebx: f6b2a300 ecx: 00000000 edx: cfca1000
> esi: d0da5e80 edi: f6b2a488 ebp: cfca1fa4 esp: cfca1f94
> ds: 007b es: 007b ss: 0068
> Process io_destroy (pid: 6610, threadinfo=cfca1000 task=f6cdca20)
> Stack: 00000000 08048008 fffffff2 fffffff2 cfca1fbc c01702fc d0da5e80
> 00000010
> b7fd8c50 bffff3d4 cfca1000 c0102eb3 00000010 08048008 080482fd
> b7fd8c50
> bffff3d4 bffff3e8 000000f5 0000007b 0000007b 000000f5 b7f7c60d
> 00000073
> Call Trace:
> [<c0103d25>] show_stack+0x7a/0x90
> [<c0103ea6>] show_registers+0x152/0x1ca
> [<c01040b7>] die+0x100/0x184
> [<c01044a8>] do_invalid_op+0xa3/0xad
> [<c01039cf>] error_code+0x2b/0x30
> [<c01702fc>] sys_io_setup+0x9a/0xa9
> [<c0102eb3>] syscall_call+0x7/0xb
> Code: 1c 8b 06 85 c0 78 24 83 c4 04 5b 5e 5f 5d c3 8b 0a 85 c9 2e 74 b5
> 8b 46 10 89 02 eb ae 83 c4 04 89 f0 5b 5e 5f 5d e9 9b ee ff ff <0f> 0b
> e1 04 f5 f6 2b c0 eb d2 89 f0 e8 8a ee ff ff eb ab 0f 0b
>
> This is a fairly run-of-the mill P4 box with SCSI disks and a plain
> vanilla 2.6.10 kernel on Debian.I 've written a test case that exposes
> this bug:
>
> http://submarine.dyndns.org/~djwong/docs/io_destroy.c
>
> The program takes as its only argument the address of a region of read
> only memory. The libc mmap is a pretty good place for this, so you can
> run the program thusly:
>
> $ ./io_destroy `cat /proc/$$/maps | grep libc- | grep 'r-' | \
> awk -F "-" '{print $1}'`
>
> ...and watch the program segfault. If you can't find an address,
> 8048000 seems to work in most cases.
>
> I think I've found the cause of this bug. Each ioctx structure has a
> "users" field that acts as a reference counter for the ioctx, and a
> "dead" flag that seems to indicate that the ioctx isn't associated with
> any particular list of IO requests.
>
> The problem, then, lies in aio.c:1247. The io_destroy function checks
> the (old) value of the dead flag--if it's false (i.e. the ioctx is
> alive), then the function calls put_ioctx to decrease the reference
> count on the assumption that the ioctx is no longer associated with any
> requests. Later, it calls put_ioctx again, on the assumption that
> someone called lookup_ioctx to perform some operation at some point.
>
> This BUG is caused by the reference counts being off. The testcase that
> I provided looks for a chunk of user memory that's read-only and passes
> that to the sys_io_setup syscall. sys_io_setup checks that the pointer
> is readable, creates the ioctx and then tries to write the ioctx handle
> back to userland. This is where the problems start to surface.
>
> Since the pointer points to a non-writable region of memory, the write
> fails. The syscall handler then destroys the ioctx. The dead flag is
> zero, so io_destroy calls put_ioctx...but wait! Nobody ever put the
> ioctx into a request list. The ioctx is alive but not in a list, yet
> the io_destroy code assumes that being alive implies being in a request
> list somewhere. Hence, calling put_ioctx is bogus; the reference count
> becomes 0, and the ioctx is freed. Worse yet, put_ioctx is called again
> (on a freed pointer!) to clear up the lookup_ioctx that never happened.
> put_ioctx sees that the reference count has become negative and BUGs.
>
> The patch that I've provided calls aio_cancel_all before calling
> io_destroy in this failure case. aio_cancel_all sets ioctx->dead = 1
> and cancels all requests (there shouldn't be any in this case) in
> progress. Since the dead flag is 1, io_destroy calls put_ioctx once to
> zero the reference count and free the ioctx, and thus the BUG condition
> doesn't get triggered. The userland program receives an error code
> instead of a segfault.
>
> This patch is against 2.6.10; the problem doesn't seem to be fixed in
> 2.6.11-rc1. A simpler version of this fix would simply say "ioctx->dead
> = 1;" (or even call "get_ioctx(ioctx);" to inflate the refcounts
> artificially), but as I'm not an AIO developer I don't want to be the
> one making that call.
>
> --Darrick
>
> -----------------
>
> Signed-off-by: Darrick Wong <[email protected]>
>
> --- linux-2.6.10-a74/fs/aio.c 2004-12-24 13:34:44.000000000 -0800
> +++ linux-2.6.10/fs/aio.c 2005-01-12 16:09:37.000000000 -0800
> @@ -1285,6 +1285,7 @@
> if (!ret)
> return 0;
>
> + aio_cancel_all(ioctx);
> io_destroy(ioctx);
> }



--
Suparna Bhattacharya ([email protected])
Linux Technology Center
IBM Software Lab, India

2005-01-25 00:47:18

by djwong

[permalink] [raw]
Subject: Re: [PATCH] BUG in io_destroy (fs/aio.c:1248)

Andrew Morton wrote:

> So... Will someone be sending a new patch?

Here's a cheesy patch that simply marks the ioctx as dead before
destroying it. Though I'd like to simply mark the ioctx as dead until
it actually gets used, I don't know enough about the code to make that
sort of invasive change.

--D

-----------------
Signed-off-by: Darrick Wong <[email protected]>

--- linux-2.6.10/fs/aio.c.old 2005-01-24 16:12:46.000000000 -0800
+++ linux-2.6.10/fs/aio.c 2005-01-24 16:30:53.000000000 -0800
@@ -1285,6 +1285,10 @@
if (!ret)
return 0;

+ spin_lock_irq(&ctx->ctx_lock);
+ ctx->dead = 1;
+ spin_unlock_irq(&ctx->ctx_lock);
+
io_destroy(ioctx);
}

2005-01-25 01:00:53

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH] BUG in io_destroy (fs/aio.c:1248)

"Darrick J. Wong" <[email protected]> wrote:
>
> Andrew Morton wrote:
>
> > So... Will someone be sending a new patch?
>
> Here's a cheesy patch that simply marks the ioctx as dead before
> destroying it.

super-cheesy, given that `ctx' is an unsigned long.

> + spin_lock_irq(&ctx->ctx_lock);
> + ctx->dead = 1;
> + spin_unlock_irq(&ctx->ctx_lock);
> +

Even with this fixed up, the locking looks very odd.

Needs more work, please. Or we just run with the original patch which I
assume was tested. It's a rare error path and performance won't matter.

2005-01-25 01:57:47

by Daniel McNeil

[permalink] [raw]
Subject: Re: [PATCH] BUG in io_destroy (fs/aio.c:1248)

On Mon, 2005-01-24 at 16:58, Andrew Morton wrote:
> "Darrick J. Wong" <[email protected]> wrote:
> >
> > Andrew Morton wrote:
> >
> > > So... Will someone be sending a new patch?
> >
> > Here's a cheesy patch that simply marks the ioctx as dead before
> > destroying it.
>
> super-cheesy, given that `ctx' is an unsigned long.
>
> > + spin_lock_irq(&ctx->ctx_lock);
> > + ctx->dead = 1;
> > + spin_unlock_irq(&ctx->ctx_lock);
> > +
>
> Even with this fixed up, the locking looks very odd.
>
> Needs more work, please. Or we just run with the original patch which I
> assume was tested. It's a rare error path and performance won't matter.

The use of 'dead' looks very strange. It is set to 1 in
aio_cancel_all() while holding spin_lock_irq(&ctx->ctx_lock);
but it is set to 1 in io_destroy() holding
write_lock(&mm->ioctx_list_lock);

The io_destroy() comment says
"Protects against races with itself via ->dead."

I assume the race the comment is talking about is
multiple threads calling io_destroy() on the same
ctx. io_destroy() in only called from sys_io_destroy() or
from sys_io_setup() in the failure case that is trying
to be fixed.

aio_cancel_all() is only called from exit_aio() when the
mm is going away. So this path is using 'dead' for something
else since the mm cannot go away twice (and there cannot be
an io_destroy() in progress or the mm would not be going away).

The overloading of 'dead' is ugly and confusing.

The use of spin_lock_irq(&ctx->ctx_lock) in sys_io_setup()
does not do any good AFAICT.

Daniel





2005-01-25 03:14:16

by Suparna Bhattacharya

[permalink] [raw]
Subject: Re: [PATCH] BUG in io_destroy (fs/aio.c:1248)


Anything wrong with just a get_ioctx() instead ?


--- aio.c 2005-01-12 09:30:44.000000000 +0530
+++ aio.c.fix 2005-01-25 08:51:31.000000000 +0530
@@ -1284,7 +1284,7 @@
ret = put_user(ioctx->user_id, ctxp);
if (!ret)
return 0;
-
+ get_ioctx(ioctx); /* as io_destroy() expects us to hold a ref */
io_destroy(ioctx);
}



On Mon, Jan 24, 2005 at 04:43:21PM -0800, Darrick J. Wong wrote:
> Andrew Morton wrote:
>
> > So... Will someone be sending a new patch?
>
> Here's a cheesy patch that simply marks the ioctx as dead before
> destroying it. Though I'd like to simply mark the ioctx as dead until
> it actually gets used, I don't know enough about the code to make that
> sort of invasive change.
>
> --D
>
> -----------------
> Signed-off-by: Darrick Wong <[email protected]>
>
> --- linux-2.6.10/fs/aio.c.old 2005-01-24 16:12:46.000000000 -0800
> +++ linux-2.6.10/fs/aio.c 2005-01-24 16:30:53.000000000 -0800
> @@ -1285,6 +1285,10 @@
> if (!ret)
> return 0;
>
> + spin_lock_irq(&ctx->ctx_lock);
> + ctx->dead = 1;
> + spin_unlock_irq(&ctx->ctx_lock);
> +
> io_destroy(ioctx);
> }
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-aio' in
> the body to [email protected]. For more info on Linux AIO,
> see: http://www.kvack.org/aio/
> Don't email: <a href=mailto:"[email protected]">[email protected]</a>

--
Suparna Bhattacharya ([email protected])
Linux Technology Center
IBM Software Lab, India