2007-02-09 04:29:42

by Ananiev, Leonid I

[permalink] [raw]
Subject: [PATCH] aio: fix kernel bug when page is temporally busy

>From Leonid Ananiev

Fix "Kernel BUG at fs/aio.c:509". Return EIOCBRETRY but not EIO if page
is busy.

Signed-off-by: Leonid Ananiev <[email protected]>

"Kernel BUG at fs/aio.c:509"
http://marc.theaimsgroup.com/?l=linux-kernel&m=117031052517746&w=2
is present in 2.6.20 as well as 2.6.19.
The investigation shows that the bug's panic occurs when aio_complete()
is called with argument ret=EIO which is returned from
invalidate_inode_pages2_range() because the page is temporally busy.
Call Trace:
[<ffffffff802573c3>] invalidate_inode_pages2_range+0x236/0x26b
[<ffffffff802af77b>] ext3_direct_IO+0x16c/0x19e
[<ffffffff802adc1d>] ext3_get_block+0x0/0xe2
[<ffffffff802518c0>] generic_file_direct_IO+0xb9/0xcf
[<ffffffff8025193d>] generic_file_direct_write+0x67/0x10e
[<ffffffff80252308>] __generic_file_aio_write_nolock+0x2d6/0x3fe
[<ffffffff802082c0>] __switch_to+0x26f/0x27e
[<ffffffff8047b8e9>] thread_return+0x0/0xd8
[<ffffffff80252497>] generic_file_aio_write+0x67/0xc7
[<ffffffff802ab2cc>] ext3_file_write+0x0/0x8f
[<ffffffff802ab2e2>] ext3_file_write+0x16/0x8f
[<ffffffff802ab2cc>] ext3_file_write+0x0/0x8f
[<ffffffff80283f61>] aio_rw_vect_retry+0x72/0x14f
[<ffffffff80284ad9>] aio_run_iocb+0xe6/0x187
[<ffffffff802853e2>] io_submit_one+0x296/0x2e3
[<ffffffff80285979>] sys_io_submit+0x9b/0x108
[<ffffffff80229b49>] default_wake_function+0x0/0xe
[<ffffffff8020983e>] system_call+0x7e/0x83

As a result aio_complete() is called while it should not be.
When IO is finished aio_complete() is called once more
and iocb->ki_users becomes negative.

Taking into account aio.c lines in aio_run_iocb()
if (ret != -EIOCBRETRY && ret != -EIOCBQUEUED) {
aio_complete(iocb, ret, 0);

proposed patch makes function invalidate_inode_pages2_range()
to return EIOCBRETRY but not EIO if page is busy.
The patch is tested on 2.6.20.

--- linux-2.6.20/mm/truncate.c 2007-02-04 10:44:54.000000000 -0800
+++ linux-2.6.20p/mm/truncate.c 2007-02-08 11:38:11.000000000 -0800
@@ -366,7 +366,7 @@ static int do_launder_page(struct addres
* Any pages which are found to be mapped into pagetables are unmapped
prior to
* invalidation.
*
- * Returns -EIO if any pages could not be invalidated.
+ * Returns -EIOCBRETRY if any pages could not be invalidated.
*/
int invalidate_inode_pages2_range(struct address_space *mapping,
pgoff_t start, pgoff_t end)
@@ -423,7 +423,7 @@ int invalidate_inode_pages2_range(struct
}
ret = do_launder_page(mapping, page);
if (ret == 0 &&
!invalidate_complete_page2(mapping, page))
- ret = -EIO;
+ ret = -EIOCBRETRY;
unlock_page(page);
}
pagevec_release(&pvec);
@@ -440,7 +440,7 @@ EXPORT_SYMBOL_GPL(invalidate_inode_pages
* Any pages which are found to be mapped into pagetables are unmapped
prior to
* invalidation.
*
- * Returns -EIO if any pages could not be invalidated.
+ * Returns -EIOCBRETRY if any pages could not be invalidated.
*/
int invalidate_inode_pages2(struct address_space *mapping)
{


2007-02-09 04:35:30

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH] aio: fix kernel bug when page is temporally busy

On Fri, 9 Feb 2007 07:29:31 +0300 "Ananiev, Leonid I" <[email protected]> wrote:

> >From Leonid Ananiev
>
> Fix "Kernel BUG at fs/aio.c:509". Return EIOCBRETRY but not EIO if page
> is busy.
>
> Signed-off-by: Leonid Ananiev <[email protected]>
>
> "Kernel BUG at fs/aio.c:509"
> http://marc.theaimsgroup.com/?l=linux-kernel&m=117031052517746&w=2
> is present in 2.6.20 as well as 2.6.19.
> The investigation shows that the bug's panic occurs when aio_complete()
> is called with argument ret=EIO which is returned from
> invalidate_inode_pages2_range() because the page is temporally busy.
> Call Trace:
> [<ffffffff802573c3>] invalidate_inode_pages2_range+0x236/0x26b
> [<ffffffff802af77b>] ext3_direct_IO+0x16c/0x19e
> [<ffffffff802adc1d>] ext3_get_block+0x0/0xe2
> [<ffffffff802518c0>] generic_file_direct_IO+0xb9/0xcf
> [<ffffffff8025193d>] generic_file_direct_write+0x67/0x10e
> [<ffffffff80252308>] __generic_file_aio_write_nolock+0x2d6/0x3fe
> [<ffffffff802082c0>] __switch_to+0x26f/0x27e
> [<ffffffff8047b8e9>] thread_return+0x0/0xd8
> [<ffffffff80252497>] generic_file_aio_write+0x67/0xc7
> [<ffffffff802ab2cc>] ext3_file_write+0x0/0x8f
> [<ffffffff802ab2e2>] ext3_file_write+0x16/0x8f
> [<ffffffff802ab2cc>] ext3_file_write+0x0/0x8f
> [<ffffffff80283f61>] aio_rw_vect_retry+0x72/0x14f
> [<ffffffff80284ad9>] aio_run_iocb+0xe6/0x187
> [<ffffffff802853e2>] io_submit_one+0x296/0x2e3
> [<ffffffff80285979>] sys_io_submit+0x9b/0x108
> [<ffffffff80229b49>] default_wake_function+0x0/0xe
> [<ffffffff8020983e>] system_call+0x7e/0x83
>
> As a result aio_complete() is called while it should not be.
> When IO is finished aio_complete() is called once more
> and iocb->ki_users becomes negative.
>
> Taking into account aio.c lines in aio_run_iocb()
> if (ret != -EIOCBRETRY && ret != -EIOCBQUEUED) {
> aio_complete(iocb, ret, 0);
>
> proposed patch makes function invalidate_inode_pages2_range()
> to return EIOCBRETRY but not EIO if page is busy.
> The patch is tested on 2.6.20.
>
> --- linux-2.6.20/mm/truncate.c 2007-02-04 10:44:54.000000000 -0800
> +++ linux-2.6.20p/mm/truncate.c 2007-02-08 11:38:11.000000000 -0800
> @@ -366,7 +366,7 @@ static int do_launder_page(struct addres
> * Any pages which are found to be mapped into pagetables are unmapped
> prior to
> * invalidation.
> *
> - * Returns -EIO if any pages could not be invalidated.
> + * Returns -EIOCBRETRY if any pages could not be invalidated.
> */
> int invalidate_inode_pages2_range(struct address_space *mapping,
> pgoff_t start, pgoff_t end)
> @@ -423,7 +423,7 @@ int invalidate_inode_pages2_range(struct
> }
> ret = do_launder_page(mapping, page);
> if (ret == 0 &&
> !invalidate_complete_page2(mapping, page))
> - ret = -EIO;
> + ret = -EIOCBRETRY;
> unlock_page(page);
> }
> pagevec_release(&pvec);
> @@ -440,7 +440,7 @@ EXPORT_SYMBOL_GPL(invalidate_inode_pages
> * Any pages which are found to be mapped into pagetables are unmapped
> prior to
> * invalidation.
> *
> - * Returns -EIO if any pages could not be invalidated.
> + * Returns -EIOCBRETRY if any pages could not be invalidated.
> */
> int invalidate_inode_pages2(struct address_space *mapping)
> {

The patch is wildly wordwrapped and has had it tabs replaced with spaces.

invalidate_inode_pages2() has other callers. I suspect with this change
we'll end up leaking EIOCBRETRY back to userspace.

2007-02-09 05:41:57

by Ananiev, Leonid I

[permalink] [raw]
Subject: RE: [PATCH] aio: fix kernel bug when page is temporally busy


> invalidate_inode_pages2() has other callers. I suspect with this
change
> we'll end up leaking EIOCBRETRY back to userspace.

EIOCBRETRY is used and caught already in do_sync_read() and
do_sync_readv_writev().

Below fixed patch against kernel 2.6.20.

>From Leonid Ananiev

Fix kernel bug when IO page is temporally busy:
invalidate_inode_pages2() returns EIOCBRETRY but not EIO..

Signed-off-by: Leonid Ananiev <[email protected]>
---
--- linux-2.6.20/mm/truncate.c 2007-02-04 10:44:54.000000000 -0800
+++ linux-2.6.20p/mm/truncate.c 2007-02-08 11:38:11.000000000 -0800
@@ -366,7 +366,7 @@ static int do_launder_page(struct addres
* Any pages which are found to be mapped into pagetables are unmapped
prior to
* invalidation.
*
- * Returns -EIO if any pages could not be invalidated.
+ * Returns -EIOCBRETRY if any pages could not be invalidated.
*/
int invalidate_inode_pages2_range(struct address_space *mapping,
pgoff_t start, pgoff_t end)
@@ -423,7 +423,7 @@ int invalidate_inode_pages2_range(struct
}
ret = do_launder_page(mapping, page);
if (ret == 0 &&
!invalidate_complete_page2(mapping, page))
- ret = -EIO;
+ ret = -EIOCBRETRY;
unlock_page(page);
}
pagevec_release(&pvec);
@@ -440,7 +440,7 @@ EXPORT_SYMBOL_GPL(invalidate_inode_pages
* Any pages which are found to be mapped into pagetables are unmapped
prior to
* invalidation.
*
- * Returns -EIO if any pages could not be invalidated.
+ * Returns -EIOCBRETRY if any pages could not be invalidated.
*/
int invalidate_inode_pages2(struct address_space *mapping)
{

2007-02-09 05:52:57

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH] aio: fix kernel bug when page is temporally busy

On Fri, 9 Feb 2007 08:41:41 +0300 "Ananiev, Leonid I" <[email protected]> wrote:

>
> > invalidate_inode_pages2() has other callers. I suspect with this
> change
> > we'll end up leaking EIOCBRETRY back to userspace.
>
> EIOCBRETRY is used and caught already in do_sync_read() and
> do_sync_readv_writev().

To pick one example:

nfs_follow_link
->nfs_revalidate_mapping_nolock
->nfs_invalidate_mapping_nolock
->invalidate_inode_pages2

so that, I assume, affects open(), unlink(), etc.


> Below fixed patch against kernel 2.6.20.

The tab->spaces issue is fixed, but it's still all wordwrapped.

2007-02-09 07:10:42

by Suparna Bhattacharya

[permalink] [raw]
Subject: Re: [PATCH] aio: fix kernel bug when page is temporally busy

On Fri, Feb 09, 2007 at 08:41:41AM +0300, Ananiev, Leonid I wrote:
> > invalidate_inode_pages2() has other callers. I suspect with this
> change
> > we'll end up leaking EIOCBRETRY back to userspace.
>
> EIOCBRETRY is used and caught already in do_sync_read() and
> do_sync_readv_writev().
>
> Below fixed patch against kernel 2.6.20.

I agree with Andrew, this doesn't look like the right solution.

EIOCBRETRY isn't the same as EAGAIN or "try again later". EIOCBRETRY means
"I have queued up an action to notify (wakeup) the callback to retry the
operation once data is ready".

Regards
Suparna

>
> >From Leonid Ananiev
>
> Fix kernel bug when IO page is temporally busy:
> invalidate_inode_pages2() returns EIOCBRETRY but not EIO..
>
> Signed-off-by: Leonid Ananiev <[email protected]>
> ---
> --- linux-2.6.20/mm/truncate.c 2007-02-04 10:44:54.000000000 -0800
> +++ linux-2.6.20p/mm/truncate.c 2007-02-08 11:38:11.000000000 -0800
> @@ -366,7 +366,7 @@ static int do_launder_page(struct addres
> * Any pages which are found to be mapped into pagetables are unmapped
> prior to
> * invalidation.
> *
> - * Returns -EIO if any pages could not be invalidated.
> + * Returns -EIOCBRETRY if any pages could not be invalidated.
> */
> int invalidate_inode_pages2_range(struct address_space *mapping,
> pgoff_t start, pgoff_t end)
> @@ -423,7 +423,7 @@ int invalidate_inode_pages2_range(struct
> }
> ret = do_launder_page(mapping, page);
> if (ret == 0 &&
> !invalidate_complete_page2(mapping, page))
> - ret = -EIO;
> + ret = -EIOCBRETRY;
> unlock_page(page);
> }
> pagevec_release(&pvec);
> @@ -440,7 +440,7 @@ EXPORT_SYMBOL_GPL(invalidate_inode_pages
> * Any pages which are found to be mapped into pagetables are unmapped
> prior to
> * invalidation.
> *
> - * Returns -EIO if any pages could not be invalidated.
> + * Returns -EIOCBRETRY if any pages could not be invalidated.
> */
> int invalidate_inode_pages2(struct address_space *mapping)
> {
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-aio' in
> the body to [email protected]. For more info on Linux AIO,
> see: http://www.kvack.org/aio/
> Don't email: <a href=mailto:"[email protected]">[email protected]</a>

--
Suparna Bhattacharya ([email protected])
Linux Technology Center
IBM Software Lab, India

2007-02-09 09:52:36

by Ananiev, Leonid I

[permalink] [raw]
Subject: RE: [PATCH] aio: fix kernel bug when page is temporally busy

I have used EIOCBRETRY in the patch to minimize source code modification
only.
It is notable that EIOCBRETRY is never set in kernel, but tested only.

EAGAIN - means that you may want to recall the same function with the
same argument. But user have not to recall it just now. He may want to
free or process some other his resources before recall for success.
EIOCBRETRY - means that you may not recall the same function with the
same argument because an argument has no sense already in cuurnt
context. The result is in processing already. You may want to recall
much higher level or external function to get the common result. It is
means that inernal caller should ignore or let through errno to its
caller but do not recall the function which has returned this errno.

A lot of errno's have different meaning in different functions or
contexts. EAGAIN could be used instated of EIOCBRETRY for irredundant
set.

Leonid

-----Original Message-----
From: Suparna Bhattacharya [mailto:[email protected]]
Sent: Friday, February 09, 2007 10:16 AM
To: Ananiev, Leonid I
Cc: Andrew Morton; [email protected]; linux-aio; Zach Brown;
Chris Mason; Badari Pulavarty
Subject: Re: [PATCH] aio: fix kernel bug when page is temporally busy

On Fri, Feb 09, 2007 at 08:41:41AM +0300, Ananiev, Leonid I wrote:
> > invalidate_inode_pages2() has other callers. I suspect with this
> change
> > we'll end up leaking EIOCBRETRY back to userspace.
>
> EIOCBRETRY is used and caught already in do_sync_read() and
> do_sync_readv_writev().
>
> Below fixed patch against kernel 2.6.20.

I agree with Andrew, this doesn't look like the right solution.

EIOCBRETRY isn't the same as EAGAIN or "try again later". EIOCBRETRY
means
"I have queued up an action to notify (wakeup) the callback to retry the
operation once data is ready".

Regards
Suparna

>
> >From Leonid Ananiev
>
> Fix kernel bug when IO page is temporally busy:
> invalidate_inode_pages2() returns EIOCBRETRY but not EIO..
>
> Signed-off-by: Leonid Ananiev <[email protected]>
> ---
> --- linux-2.6.20/mm/truncate.c 2007-02-04 10:44:54.000000000
-0800
> +++ linux-2.6.20p/mm/truncate.c 2007-02-08 11:38:11.000000000
-0800
> @@ -366,7 +366,7 @@ static int do_launder_page(struct addres
> * Any pages which are found to be mapped into pagetables are
unmapped
> prior to
> * invalidation.
> *
> - * Returns -EIO if any pages could not be invalidated.
> + * Returns -EIOCBRETRY if any pages could not be invalidated.
> */
> int invalidate_inode_pages2_range(struct address_space *mapping,
> pgoff_t start, pgoff_t end)
> @@ -423,7 +423,7 @@ int invalidate_inode_pages2_range(struct
> }
> ret = do_launder_page(mapping, page);
> if (ret == 0 &&
> !invalidate_complete_page2(mapping, page))
> - ret = -EIO;
> + ret = -EIOCBRETRY;
> unlock_page(page);
> }
> pagevec_release(&pvec);
> @@ -440,7 +440,7 @@ EXPORT_SYMBOL_GPL(invalidate_inode_pages
> * Any pages which are found to be mapped into pagetables are
unmapped
> prior to
> * invalidation.
> *
> - * Returns -EIO if any pages could not be invalidated.
> + * Returns -EIOCBRETRY if any pages could not be invalidated.
> */
> int invalidate_inode_pages2(struct address_space *mapping)
> {
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-aio' in
> the body to [email protected]. For more info on Linux AIO,
> see: http://www.kvack.org/aio/
> Don't email: <a href=mailto:"[email protected]">[email protected]</a>

--
Suparna Bhattacharya ([email protected])
Linux Technology Center
IBM Software Lab, India

2007-02-09 09:56:05

by Jiri Kosina

[permalink] [raw]
Subject: Re: [PATCH] aio: fix kernel bug when page is temporally busy

On Fri, 9 Feb 2007, Ananiev, Leonid I wrote:

> Fix "Kernel BUG at fs/aio.c:509". Return EIOCBRETRY but not EIO if page
> is busy.

I am currently also hunting in this area, and I think there is something
strange in generic_file_aio_read(). This seems strange:

for (seg = 0; seg < nr_segs; seg++) {
read_descriptor_t desc;

desc.written = 0;
desc.arg.buf = iov[seg].iov_base;
desc.count = iov[seg].iov_len;
if (desc.count == 0)
continue;
desc.error = 0;

do_generic_file_read(filp,ppos,&desc,file_read_actor);
retval += desc.written;
if (desc.error) {
retval = retval ?: desc.error;
break;
}
}

So this code propagates desc.error back to caller _only if_ all of the
previous segments had desc.written == 0. Which is bogus - we need to
propagate -EIOCBRETRY as soon as any of the do_generic_file_read()
returned it, so that the caller is aware of requests being queued.

I think something like this would be appropriate

[PATCH] AIO: handle return value from do_generic_file_read() properly

generic_file_aio_read() doesn't always propagate error return value from
do_generic_file_read(). Namely handling of -EIOCBRETRY is important, because
aio_run_iocb() relies on this return value being always properly propagated
to it.

Signed-off-by: Jiri Kosina <[email protected]>

---

mm/filemap.c | 2 +-
1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/mm/filemap.c b/mm/filemap.c
index 8332c77..5ce76ce 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -1204,7 +1204,7 @@ generic_file_aio_read(struct kiocb *iocb, const struct iovec *iov,
do_generic_file_read(filp,ppos,&desc,file_read_actor);
retval += desc.written;
if (desc.error) {
- retval = retval ?: desc.error;
+ retval = desc.error;
break;
}
}

--
Jiri Kosina

2007-02-09 10:12:53

by Jiri Kosina

[permalink] [raw]
Subject: RE: [PATCH] aio: fix kernel bug when page is temporally busy

On Fri, 9 Feb 2007, Ananiev, Leonid I wrote:

> I have used EIOCBRETRY in the patch to minimize source code modification
> only. It is notable that EIOCBRETRY is never set in kernel, but tested
> only.

There is indeed something strange in aio in 2.6.20 (and maybe older) - for
example do_generic_mapping_read() gets stuck inside lock_page() in the
situation when the page is not up-to-date, instead of returning properl
error to aio_run_cb().

Or does it work in some other way and I get it wrong?

--
Jiri Kosina

2007-02-09 10:15:34

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH] aio: fix kernel bug when page is temporally busy

On Fri, 9 Feb 2007 10:54:36 +0100 (CET) Jiri Kosina <[email protected]> wrote:

> @@ -1204,7 +1204,7 @@ generic_file_aio_read(struct kiocb *iocb, const struct iovec *iov,
> do_generic_file_read(filp,ppos,&desc,file_read_actor);
> retval += desc.written;
> if (desc.error) {
> - retval = retval ?: desc.error;
> + retval = desc.error;
> break;
> }

Nope. On error the read() syscall must return the number of bytes which
were successfully read.

2007-02-09 10:41:06

by Jiri Kosina

[permalink] [raw]
Subject: Re: [PATCH] aio: fix kernel bug when page is temporally busy

On Fri, 9 Feb 2007, Andrew Morton wrote:

> > @@ -1204,7 +1204,7 @@ generic_file_aio_read(struct kiocb *iocb, const struct iovec *iov,
> > do_generic_file_read(filp,ppos,&desc,file_read_actor);
> > retval += desc.written;
> > if (desc.error) {
> > - retval = retval ?: desc.error;
> > + retval = desc.error;
> > break;
> > }
> Nope. On error the read() syscall must return the number of bytes which
> were successfully read.

You are right.

In current mainline this even is not a problem, because noone seems to be
setting the error values to EIOCBRETRY. But it still stinks a bit, as
there are tests for EIOCBRETRY.

--
Jiri Kosina

2007-02-09 11:00:37

by Suparna Bhattacharya

[permalink] [raw]
Subject: Re: [PATCH] aio: fix kernel bug when page is temporally busy

On Fri, Feb 09, 2007 at 11:40:27AM +0100, Jiri Kosina wrote:
> On Fri, 9 Feb 2007, Andrew Morton wrote:
>
> > > @@ -1204,7 +1204,7 @@ generic_file_aio_read(struct kiocb *iocb, const struct iovec *iov,
> > > do_generic_file_read(filp,ppos,&desc,file_read_actor);
> > > retval += desc.written;
> > > if (desc.error) {
> > > - retval = retval ?: desc.error;
> > > + retval = desc.error;
> > > break;
> > > }
> > Nope. On error the read() syscall must return the number of bytes which
> > were successfully read.
>
> You are right.
>
> In current mainline this even is not a problem, because noone seems to be
> setting the error values to EIOCBRETRY. But it still stinks a bit, as
> there are tests for EIOCBRETRY.

We'd want retries to act only on the remaining bytes which haven't been
transferred as yet, so even in the EIOCBRETRY case the right thing to do
is to return the number of bytes that were successfully read without
blocking. The high level AIO code (see aio_rw_vect_rety) has the ability
to handle this.

So this is still correct.

Is there a real bug that you are seeing here ?

Regards
Suparna

>
> --
> Jiri Kosina
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-aio' in
> the body to [email protected]. For more info on Linux AIO,
> see: http://www.kvack.org/aio/
> Don't email: <a href=mailto:"[email protected]">[email protected]</a>

--
Suparna Bhattacharya ([email protected])
Linux Technology Center
IBM Software Lab, India

2007-02-09 11:18:39

by Ananiev, Leonid I

[permalink] [raw]
Subject: RE: [PATCH] aio: fix kernel bug when page is temporally busy

>>noone seems to be setting the error values to EIOCBRETRY
> We'd want retries to act only on the remaining bytes which haven't
been
> transferred as yet, so even in the EIOCBRETRY...

OK. But any variable is not set to EIOCBRETRY in 2.6.14, 2.6.19,
2.6.20.
Is it correct?

Leonid

-----Original Message-----
From: Suparna Bhattacharya [mailto:[email protected]]
Sent: Friday, February 09, 2007 2:06 PM
To: Jiri Kosina
Cc: Andrew Morton; Ananiev, Leonid I; [email protected];
linux-aio; Zach Brown; Chris Mason; Badari Pulavarty; Jan Kara
Subject: Re: [PATCH] aio: fix kernel bug when page is temporally busy

On Fri, Feb 09, 2007 at 11:40:27AM +0100, Jiri Kosina wrote:
> On Fri, 9 Feb 2007, Andrew Morton wrote:
>
> > > @@ -1204,7 +1204,7 @@ generic_file_aio_read(struct kiocb *iocb,
const struct iovec *iov,
> > >
do_generic_file_read(filp,ppos,&desc,file_read_actor);
> > > retval += desc.written;
> > > if (desc.error) {
> > > - retval = retval ?: desc.error;
> > > + retval = desc.error;
> > > break;
> > > }
> > Nope. On error the read() syscall must return the number of bytes
which
> > were successfully read.
>
> You are right.
>
> In current mainline this even is not a problem, because noone seems to
be
> setting the error values to EIOCBRETRY. But it still stinks a bit, as
> there are tests for EIOCBRETRY.

We'd want retries to act only on the remaining bytes which haven't been
transferred as yet, so even in the EIOCBRETRY case the right thing to do
is to return the number of bytes that were successfully read without
blocking. The high level AIO code (see aio_rw_vect_rety) has the ability
to handle this.

So this is still correct.

Is there a real bug that you are seeing here ?

Regards
Suparna

>
> --
> Jiri Kosina
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-aio' in
> the body to [email protected]. For more info on Linux AIO,
> see: http://www.kvack.org/aio/
> Don't email: <a href=mailto:"[email protected]">[email protected]</a>

--
Suparna Bhattacharya ([email protected])
Linux Technology Center
IBM Software Lab, India

2007-02-09 17:03:32

by Zach Brown

[permalink] [raw]
Subject: Re: [PATCH] aio: fix kernel bug when page is temporally busy


On Feb 9, 2007, at 6:05 AM, Suparna Bhattacharya wrote:

> On Fri, Feb 09, 2007 at 11:40:27AM +0100, Jiri Kosina wrote:
>> On Fri, 9 Feb 2007, Andrew Morton wrote:
>>
>>>> @@ -1204,7 +1204,7 @@ generic_file_aio_read(struct kiocb *iocb,
>>>> const struct iovec *iov,
>>>> do_generic_file_read(filp,ppos,&desc,file_read_actor);
>>>> retval += desc.written;
>>>> if (desc.error) {
>>>> - retval = retval ?: desc.error;
>>>> + retval = desc.error;

I was worried about this too.

> blocking. The high level AIO code (see aio_rw_vect_rety) has the
> ability
> to handle this.

I had missed this, and yeah, that's some level of comfort.

But I'm not convinced we can guarantee that's safe. The positive
return code that aio_rw_vect_retry() sees is telling it that some IO
has completed and, arguably, that no more IO is in flight. If we
return partial progress from generic_file_aio_read() while we have an
iocb in a wait queue then we are adding yet another invariant. That
while an iocb is pending from a previous call down the call chain, we
can't return a non-aio negative error. Doing so would cause fs/aio.c
to complete while there is still an iocb on a wait queue from a
previous retry attempt. Right?

I also noticed something just now while poking around these paths to
see if I could get the start of generic_file_aio_read() to fail when
it had previously succeeded. What's to stop another task from racing
to set O_DIRECT between retries?

That sounds like a pretty hilarious way to get a read retry to fail
due to buffer misalignment while a previously buffered instance of it
is still in flight. Hi-larious.

In thinking about this a discussing it with Chris a bit, I wonder if
the right fix isn't to refuse changing O_DIRECT via setfl() once any
IO paths have started on the filp. Something like:

filp->frozen_flags |= O_DIRECT

at the start of paths and check it in setfl()?

Are we similarly worried about O_APPEND?

- z

2007-02-10 18:05:35

by Ken Chen

[permalink] [raw]
Subject: Re: [PATCH] aio: fix kernel bug when page is temporally busy

On 2/9/07, Ananiev, Leonid I <[email protected]> wrote:
> I have used EIOCBRETRY in the patch to minimize source code modification
> only.
> [...]
> A lot of errno's have different meaning in different functions or
> contexts. EAGAIN could be used instated of EIOCBRETRY for irredundant
> set.

I also think the original patch is wrong. It might shut up kernel
panic by eliminate double calls to aio_complete(), but it will
silently introduce data corruption.

If invalidate_inode_pages2_range() says it can not invalidate pages,
while dio to the same file offset range is in flight, something is
really wrong there. In generic_file_direct_IO, the function
explicitly flushes all dirty pages and wait on them before submits
DIO.

So any error value returned from invalidate_inode_pages2_range() has
to be taken seriously in the direct IO submit path instead of dropping
it to the floor.

- Ken

2007-02-10 18:17:36

by Ananiev, Leonid I

[permalink] [raw]
Subject: RE: [PATCH] aio: fix kernel bug when page is temporally busy

But if page is busy as invalidate_inode_pages2_range() says because of
bh_count>0 then aio_complet() is not called from aio_run_iocb() and next
retry() will get the iocb result.

Leonid

-----Original Message-----
From: Ken Chen [mailto:[email protected]]
Sent: Saturday, February 10, 2007 9:05 PM
To: Ananiev, Leonid I
Cc: [email protected]; Andrew Morton; [email protected];
linux-aio; Zach Brown; Chris Mason; Badari Pulavarty
Subject: Re: [PATCH] aio: fix kernel bug when page is temporally busy

On 2/9/07, Ananiev, Leonid I <[email protected]> wrote:
> I have used EIOCBRETRY in the patch to minimize source code
modification
> only.
> [...]
> A lot of errno's have different meaning in different functions or
> contexts. EAGAIN could be used instated of EIOCBRETRY for irredundant
> set.

I also think the original patch is wrong. It might shut up kernel
panic by eliminate double calls to aio_complete(), but it will
silently introduce data corruption.

If invalidate_inode_pages2_range() says it can not invalidate pages,
while dio to the same file offset range is in flight, something is
really wrong there. In generic_file_direct_IO, the function
explicitly flushes all dirty pages and wait on them before submits
DIO.

So any error value returned from invalidate_inode_pages2_range() has
to be taken seriously in the direct IO submit path instead of dropping
it to the floor.

- Ken

2007-02-10 18:27:43

by Ananiev, Leonid I

[permalink] [raw]
Subject: RE: [PATCH] aio: fix kernel bug when page is temporally busy

> If invalidate_inode_pages2_range() says it can not invalidate pages,
> while dio to the same file offset range is in flight, something is
> really wrong there.

If invalidate_inode_pages2_range() says it can not invalidate pages
It means that soft_irq does completing now on other cpu.
Next retry() will see a IO well completed.
Leonid

-----Original Message-----
From: Ken Chen [mailto:[email protected]]
Sent: Saturday, February 10, 2007 9:05 PM
To: Ananiev, Leonid I
Cc: [email protected]; Andrew Morton; [email protected];
linux-aio; Zach Brown; Chris Mason; Badari Pulavarty
Subject: Re: [PATCH] aio: fix kernel bug when page is temporally busy

On 2/9/07, Ananiev, Leonid I <[email protected]> wrote:
> I have used EIOCBRETRY in the patch to minimize source code
modification
> only.
> [...]
> A lot of errno's have different meaning in different functions or
> contexts. EAGAIN could be used instated of EIOCBRETRY for irredundant
> set.

I also think the original patch is wrong. It might shut up kernel
panic by eliminate double calls to aio_complete(), but it will
silently introduce data corruption.

If invalidate_inode_pages2_range() says it can not invalidate pages,
while dio to the same file offset range is in flight, something is
really wrong there. In generic_file_direct_IO, the function
explicitly flushes all dirty pages and wait on them before submits
DIO.

So any error value returned from invalidate_inode_pages2_range() has
to be taken seriously in the direct IO submit path instead of dropping
it to the floor.

- Ken

2007-02-10 21:57:25

by Ananiev, Leonid I

[permalink] [raw]
Subject: RE: [PATCH] aio: fix kernel bug when page is temporally busy

Ken Chen writes:
> So any error value returned from invalidate_inode_pages2_range() has
> to be taken seriously in the direct IO submit path instead of dropping
> it to the floor.

If invalidate_inode_pages2_range() will return EIOCBRETRY as the patch
"aio: fix kernel bug when page is temporally busy"
serts then do_sync_read/write() will not drop IO submit but will retry
it:
for (;;) {
ret = filp->f_op->aio_read(&kiocb, &iov, 1,
kiocb.ki_pos);
if (ret != -EIOCBRETRY)
break;
wait_on_retry_sync_kiocb(&kiocb);
}

And do_sync_read/write() will not return EIO if page is busy as it does
now, befor patching.
Curretly do_sync_read/write() tests for EIOCBRETRY. But EIOCBRETRY is
not set ever.

Leonid

2007-02-12 22:53:06

by Ananiev, Leonid I

[permalink] [raw]
Subject: RE: [PATCH] aio: fix kernel bug when page is temporally busy

Andrew,
You wrote on Friday, February 09, 2007 8:53 AM
> invalidate_inode_pages2() has other callers. I suspect with this
change
> we'll end up leaking EIOCBRETRY back to userspace.

The path is modified so that invalidate_inode_pages2() returns EIO as
earlier.
could you consider modified patch
The patch against 2.6.20.

Long story: The kernel panic is happening after hours of AIO benchmark
running in mcp.
First of all it was found that the kernel panic happens if IO error is
reported.
But later it was found that the actual reason is not in real IO error
but in a busy page.
While the current CPU tests if IO is completed it happens that another
CPU
at the same time processes IO completion in soft_irq.
The considered buffer page is busy now by second CPU and
invalidate_inode_pages2_range() returns EIO in this case.
First CPU reports EIO to caller ; completes IO and frees control block
in aio_complete().
Second CPU frees the same control block once more.
The patch makes invalidate_inode_pages2_range() to return EIOCBRETRY
which is tested just in aio_run_iocb(). It retries IO competition check
if EIOCBRETRY is got.
EIOCBRETRY is tested in do_sync_read/write() functions as well.
And direct IO competition will be retested "instead of dropping it to
the floor".

>From Leonid Ananiev

Fix kernel bug when IO page is temporally busy:
invalidate_inode_pages2_range() returns EIOCBRETRY but not EIO.
invalidate_inode_pages2() returns EIO as earlier.

Signed-off-by: Leonid Ananiev <[email protected]>
---
--- linux-2.6.20/mm/truncate.c 2007-02-04 10:44:54.000000000 -0800
+++ linux-2.6.20p/mm/truncate.c 2007-02-08 22:56:52.000000000 -0800
@@ -366,7 +366,7 @@ static int do_launder_page(struct addres
* Any pages which are found to be mapped into pagetables are unmapped
prior to
* invalidation.
*
- * Returns -EIO if any pages could not be invalidated.
+ * Returns -EIOCBRETRY if any pages could not be invalidated.
*/
int invalidate_inode_pages2_range(struct address_space *mapping,
pgoff_t start, pgoff_t end)
@@ -423,7 +423,7 @@ int invalidate_inode_pages2_range(struct
}
ret = do_launder_page(mapping, page);
if (ret == 0 &&
!invalidate_complete_page2(mapping, page))
- ret = -EIO;
+ ret = -EIOCBRETRY;
unlock_page(page);
}
pagevec_release(&pvec);
@@ -444,6 +444,7 @@ EXPORT_SYMBOL_GPL(invalidate_inode_pages
*/
int invalidate_inode_pages2(struct address_space *mapping)
{
- return invalidate_inode_pages2_range(mapping, 0, -1);
+ int ret = invalidate_inode_pages2_range(mapping, 0, -1);
+ return (ret < 0)?-EIO:ret;
}
EXPORT_SYMBOL_GPL(invalidate_inode_pages2);

2007-02-12 23:21:30

by Ananiev, Leonid I

[permalink] [raw]
Subject: RE: [PATCH] aio: fix kernel bug when page is temporally busy

Andrew,
You wrote on Friday, February 09, 2007 8:53 AM
> invalidate_inode_pages2() has other callers. I suspect with this
change
> we'll end up leaking EIOCBRETRY back to userspace.

The path is modified so that invalidate_inode_pages2() returns EIO as
earlier.
could you consider modified patch
The patch against 2.6.20.

Long story: The kernel panic is happening after hours of AIO benchmark
running in mcp.
First of all it was found that the kernel panic happens if IO error is
reported.
But later it was found that the actual reason is not in real IO error
but in a busy page.
While the current CPU tests if IO is completed it happens that another
CPU
at the same time processes IO completion in soft_irq.
The considered buffer page is busy now by second CPU and
invalidate_inode_pages2_range() returns EIO in this case.
First CPU reports EIO to caller ; completes IO and frees control block
in aio_complete().
Second CPU frees the same control block once more.
The patch makes invalidate_inode_pages2_range() to return EIOCBRETRY
which is tested just in aio_run_iocb(). It retries IO competition check
if EIOCBRETRY is got.
EIOCBRETRY is tested in do_sync_read/write() functions as well.
And direct IO competition will be retested "instead of dropping it to
the floor".

>From Leonid Ananiev

Fix kernel bug when IO page is temporally busy:
invalidate_inode_pages2_range() returns EIOCBRETRY but not EIO.
invalidate_inode_pages2() returns EIO as earlier.

Signed-off-by: Leonid Ananiev <[email protected]>
---
--- linux-2.6.20/mm/truncate.c 2007-02-04 10:44:54.000000000 -0800
+++ linux-2.6.20p/mm/truncate.c 2007-02-08 22:56:52.000000000 -0800
@@ -366,7 +366,7 @@ static int do_launder_page(struct addres
* Any pages which are found to be mapped into pagetables are unmapped
prior to
* invalidation.
*
- * Returns -EIO if any pages could not be invalidated.
+ * Returns -EIOCBRETRY if any pages could not be invalidated.
*/
int invalidate_inode_pages2_range(struct address_space *mapping,
pgoff_t start, pgoff_t end)
@@ -423,7 +423,7 @@ int invalidate_inode_pages2_range(struct
}
ret = do_launder_page(mapping, page);
if (ret == 0 &&
!invalidate_complete_page2(mapping, page))
- ret = -EIO;
+ ret = -EIOCBRETRY;
unlock_page(page);
}
pagevec_release(&pvec);
@@ -444,6 +444,7 @@ EXPORT_SYMBOL_GPL(invalidate_inode_pages
*/
int invalidate_inode_pages2(struct address_space *mapping)
{
- return invalidate_inode_pages2_range(mapping, 0, -1);
+ int ret = invalidate_inode_pages2_range(mapping, 0, -1);
+ return (ret < 0)?-EIO:ret;
}
EXPORT_SYMBOL_GPL(invalidate_inode_pages2);

2007-02-15 09:18:10

by Ananiev, Leonid I

[permalink] [raw]
Subject: RE: [PATCH] aio: fix kernel bug when page is temporally busy

Ken Chen wrote
> It might shut up kernel
> panic by eliminate double calls to aio_complete(), but it will
> silently introduce data corruption.

I had got kernel panic after an hour of aiostress running.
After patching I have not got aiostress massage
"verify error, file %s offset %Lu contents (offset:bad:good):\n"
during 5 hour aiostress running with 'verify' option.
Looking closely into aiostress.c
ftp://ftp.suse.com/pub/people/mason/utils/aio-stress.c
we can see that this program may write in random mode THE SAME
contents to the same file offset asynchronously from different
buffers and do not cure about it. Does Ken consider that kernel
panic is the best way to prevent data corruption in such kind
of programs?

> So any error value returned from invalidate_inode_pages2_range() has
> to be taken seriously in the direct IO submit path instead of dropping
> it to the floor.
If invalidate_inode_pages2_range() will return EIOCBRETRY as the patch
"aio: fix kernel bug when page is temporally busy"
sets then do_sync_read/write() will not drop IO submit but will retry
it:
for (;;) {
ret = filp->f_op->aio_read(&kiocb, &iov, 1, kiocb.ki_pos);
if (ret != -EIOCBRETRY)
break;
wait_on_retry_sync_kiocb(&kiocb);
}
And do_sync_read/write() will not return EIO if page is busy
as it does now, before patching.

Ken Chen wrote:
> I also think the original patch is wrong.
What do you mean saying 'also'?

Leonid

2007-02-15 18:25:51

by Zach Brown

[permalink] [raw]
Subject: Re: [PATCH] aio: fix kernel bug when page is temporally busy

> If invalidate_inode_pages2_range() will return EIOCBRETRY as the patch
> "aio: fix kernel bug when page is temporally busy"

Sorry Leonid, this patch is not safe.

It returns -EIOCBRETRY without guaranteeing that kick_iocb() will be
called. This can lead to operations hanging, both AIO and calls that
come through do_sync_{read,write}.

It overwrites -EIOCBQUEUED, leading to an aio_complete() while a
retry is happening. This can lead to reference count confusion.
Double-frees, referencing freed memory, that kind of thing. This
isn't a new problem. The current code that overwrites with -EIO has
this problem. But moving to -EIOCBRETRY does introduce new behaviour
of aio_complete() and the retry path racing.

I'll have a candidate patch to address the problem of EIO being
raised on the way back up from a path which has returned -EIOCBQUEUED.

- z

2007-02-15 19:11:31

by Ananiev, Leonid I

[permalink] [raw]
Subject: RE: [PATCH] aio: fix kernel bug when page is temporally busy

> It returns -EIOCBRETRY without guaranteeing that kick_iocb() will be
> called. This can lead to operations hanging

If EIOCBRETRY then generic_file_aio_write() will be recalled for the
same iocb.

> It overwrites -EIOCBQUEUED, leading to an aio_complete() while a
> retry is happening.

EIOCBQUEUED or EIOCBRETRY does not lead to aio_complete() call:
if (ret != -EIOCBRETRY && ret != -EIOCBQUEUED) {
aio_complete(iocb, ret, 0);

> This can lead to reference count confusion.
But just reference count confusion was deleted by patch. Isn't it?

Leonid

-----Original Message-----
From: Zach Brown [mailto:[email protected]]
Sent: Thursday, February 15, 2007 9:25 PM
To: Ananiev, Leonid I
Cc: Ken Chen; [email protected]; Andrew Morton;
[email protected]; linux-aio; Chris Mason
Subject: Re: [PATCH] aio: fix kernel bug when page is temporally busy

> If invalidate_inode_pages2_range() will return EIOCBRETRY as the patch
> "aio: fix kernel bug when page is temporally busy"

Sorry Leonid, this patch is not safe.

It returns -EIOCBRETRY without guaranteeing that kick_iocb() will be
called. This can lead to operations hanging, both AIO and calls that
come through do_sync_{read,write}.

It overwrites -EIOCBQUEUED, leading to an aio_complete() while a
retry is happening. This can lead to reference count confusion.
Double-frees, referencing freed memory, that kind of thing. This
isn't a new problem. The current code that overwrites with -EIO has
this problem. But moving to -EIOCBRETRY does introduce new behaviour
of aio_complete() and the retry path racing.

I'll have a candidate patch to address the problem of EIO being
raised on the way back up from a path which has returned -EIOCBQUEUED.

- z

2007-02-15 19:23:38

by Zach Brown

[permalink] [raw]
Subject: Re: [PATCH] aio: fix kernel bug when page is temporally busy


On Feb 15, 2007, at 11:11 AM, Ananiev, Leonid I wrote:

>> It returns -EIOCBRETRY without guaranteeing that kick_iocb() will be
>> called. This can lead to operations hanging
>
> If EIOCBRETRY then generic_file_aio_write() will be recalled for the
> same iocb.

Only if kick_iocb() is called. It won't be called if i_i_p2_r() was
the only thing to return -EIOCBRETRY.

>
>> It overwrites -EIOCBQUEUED, leading to an aio_complete() while a
>> retry is happening.
>
> EIOCBQUEUED or EIOCBRETRY does not lead to aio_complete() call:

Not by fs/aio.c, but *by the place that originated -EIOCBQUEUED*.
Later. After IO has completed. see fs/direct-io.c:dio_bio_end_aio().

This is what -EIOCBQUEUED means! It's a promise to call aio_complete
() in the future.

Have you read the giant comment over the definition of struct kiocb
in include/linux/aio.h?

>> This can lead to reference count confusion.
> But just reference count confusion was deleted by patch. Isn't it?

Sorry, I don't understand what you're trying to ask here.

- z

2007-02-15 21:06:37

by Ananiev, Leonid I

[permalink] [raw]
Subject: RE: [PATCH] aio: fix kernel bug when page is temporally busy



-----Original Message-----
From: Zach Brown [mailto:[email protected]]
Sent: Thursday, February 15, 2007 10:23 PM
To: Ananiev, Leonid I
Cc: Ken Chen; [email protected]; Andrew Morton;
[email protected]; linux-aio; Chris Mason
Subject: Re: [PATCH] aio: fix kernel bug when page is temporally busy


On Feb 15, 2007, at 11:11 AM, Ananiev, Leonid I wrote:

>> It returns -EIOCBRETRY without guaranteeing that kick_iocb() will be
>> called. This can lead to operations hanging
>
> If EIOCBRETRY then generic_file_aio_write() will be recalled for the
> same iocb.

Only if kick_iocb() is called. It won't be called if i_i_p2_r() was
the only thing to return -EIOCBRETRY.

>
>> It overwrites -EIOCBQUEUED, leading to an aio_complete() while a
>> retry is happening.
>
> EIOCBQUEUED or EIOCBRETRY does not lead to aio_complete() call:

Not by fs/aio.c, but *by the place that originated -EIOCBQUEUED*.
Later. After IO has completed. see fs/direct-io.c:dio_bio_end_aio().

This is what -EIOCBQUEUED means! It's a promise to call aio_complete
() in the future.

Have you read the giant comment over the definition of struct kiocb
in include/linux/aio.h?

>> This can lead to reference count confusion.
> But just reference count confusion was deleted by patch. Isn't it?

Sorry, I don't understand what you're trying to ask here.

- z

2007-02-15 23:32:13

by Ananiev, Leonid I

[permalink] [raw]
Subject: RE: [PATCH] aio: fix kernel bug when page is temporally busy

>> If EIOCBRETRY then generic_file_aio_write() will be recalled for the
>> same iocb.
> Only if kick_iocb() is called. It won't be called if i_i_p2_r() was
> the only thing to return -EIOCBRETRY.
It is not need to call kick_iocb()
for generic_file_aio_write() calling.
It is recalled without any wakeup waiting:
for (;;) {
ret = filp->f_op->aio_write(&kiocb, &iov, 1,
kiocb.ki_pos);
if (ret != -EIOCBRETRY)
break;
wait_on_retry_sync_kiocb(&kiocb);
}
Note: wait_on_retry_sync_kiocb() does not wait.
That is for dio. For aio iocb generic_file_aio_write() call
is required from ki_run_list while next io_submit or
read_events() is called.
So when an IO hang may happen?

>> It overwrites -EIOCBQUEUED
Do you mean that there is one more kernel bug which
overwrites -EIOCBQUEUED by any errno or number of bytes and this
new value is returned to caller as an IO result
while IO is not finished yet.

The proposed patch does not crate this bug if any.
It actually fixes a kernel panic bag when iocb.users count becomes
incorrect. The bag " Kernel BUG at fs/aio.c:509" is there because
aio_run_iocb() have not a chance to differ real EIO and
EIO which is actually means EAGAYN or EIOCBRETRY.
I'me sure the patch changes source code in correct direction:
to differentiate that two kinds of EIOs.

> Have you read the giant comment over the definition of struct kiocb
> in include/linux/aio.h?
I have read. But compiler has not: it did not create an object code for
* If ki_retry returns -EIOCBRETRY ...

>>> This can lead to reference count confusion.
>> But just reference count confusion was deleted by patch. Isn't it?
>Sorry, I don't understand what you're trying to ask here.
One of reference count iocb.users confusion is deleted by the patch.
I'm not sure that there is other bag.

At least I have not see IO hang while testing.
It is interesting that I've not seen any EIOCBQUEUED returned
to aio_run_iocb() during 5 hours aiostress running.
Does it mean that EIOCBQUEUED is always reset and never returned?

Leonid

-----Original Message-----
From: Zach Brown [mailto:[email protected]]
Sent: Thursday, February 15, 2007 10:23 PM
To: Ananiev, Leonid I
Cc: Ken Chen; [email protected]; Andrew Morton;
[email protected]; linux-aio; Chris Mason
Subject: Re: [PATCH] aio: fix kernel bug when page is temporally busy


On Feb 15, 2007, at 11:11 AM, Ananiev, Leonid I wrote:

>> It returns -EIOCBRETRY without guaranteeing that kick_iocb() will be
>> called. This can lead to operations hanging
>
> If EIOCBRETRY then generic_file_aio_write() will be recalled for the
> same iocb.

Only if kick_iocb() is called. It won't be called if i_i_p2_r() was
the only thing to return -EIOCBRETRY.

>
>> It overwrites -EIOCBQUEUED, leading to an aio_complete() while a
>> retry is happening.
>
> EIOCBQUEUED or EIOCBRETRY does not lead to aio_complete() call:

Not by fs/aio.c, but *by the place that originated -EIOCBQUEUED*.
Later. After IO has completed. see fs/direct-io.c:dio_bio_end_aio().

This is what -EIOCBQUEUED means! It's a promise to call aio_complete
() in the future.

Have you read the giant comment over the definition of struct kiocb
in include/linux/aio.h?

>> This can lead to reference count confusion.
> But just reference count confusion was deleted by patch. Isn't it?

Sorry, I don't understand what you're trying to ask here.

- z

2007-02-16 00:01:35

by Zach Brown

[permalink] [raw]
Subject: Re: [PATCH] aio: fix kernel bug when page is temporally busy


On Feb 15, 2007, at 3:32 PM, Ananiev, Leonid I wrote:

>>> If EIOCBRETRY then generic_file_aio_write() will be recalled for the
>>> same iocb.
>> Only if kick_iocb() is called. It won't be called if i_i_p2_r() was
>> the only thing to return -EIOCBRETRY.
> It is not need to call kick_iocb()
> for generic_file_aio_write() calling.
> It is recalled without any wakeup waiting:
> for (;;) {
> ret = filp->f_op->aio_write(&kiocb, &iov, 1,
> kiocb.ki_pos);
> if (ret != -EIOCBRETRY)
> break;
> wait_on_retry_sync_kiocb(&kiocb);
> }
> Note: wait_on_retry_sync_kiocb() does not wait.

Yes it does. It will not return until kiocbSetKicked() is called,
and that is only called from kick_iocb().

>>> It overwrites -EIOCBQUEUED
> Do you mean that there is one more kernel bug which
> overwrites -EIOCBQUEUED by any errno or number of bytes and this
> new value is returned to caller as an IO result
> while IO is not finished yet.
>
> The proposed patch does not crate this bug if any.

Right, and I said that in the mail you're quoting.

> It actually fixes a kernel panic bag when iocb.users count becomes
> incorrect. The bag " Kernel BUG at fs/aio.c:509" is there because
> aio_run_iocb() have not a chance to differ real EIO and
> EIO which is actually means EAGAYN or EIOCBRETRY.

Yes, I understand the bug you're trying to fix. You're introducing
other bugs with the patch. It will not be merged.

> It is interesting that I've not seen any EIOCBQUEUED returned
> to aio_run_iocb() during 5 hours aiostress running.

What arguments are you running aio-stress with? -EIOCBQUEUED is only
used for O_DIRECT, and then only in certain circumstances.

- z

2007-02-16 12:20:22

by Ananiev, Leonid I

[permalink] [raw]
Subject: RE: [PATCH] aio: fix kernel bug when page is temporally busy

Zach Brown wrote:
> It will not return until kiocbSetKicked() is called,
> and that is only called from kick_iocb().
There is no test or wait of Kicked in considered
for (;;) aio_write() loop.

Zach Brown wrote:
>> The proposed patch does not crate this bug if any.
> Right, and I said that in the mail you're quoting.
...
> You're introducing other bugs with the patch.
Could you list other introduced bugs?

>> It is interesting that I've not seen any EIOCBQUEUED returned
>> to aio_run_iocb() during 5 hours aiostress running.
> What arguments are you running aio-stress with? -EIOCBQUEUED is only

> used for O_DIRECT
I wrote in the vary first mail that the panic is appearing in
random write O_DIRECT aio-stress running. Other aio-stress modes
where tested after patching as well.

> and then only in certain circumstances.
Looking closely into sources we can see that
EIOCBQUEUED never may be returned to aio_run_iocb().

include/linux/aio.h says
* If ki_retry returns -EIOCBRETRY ...
Could you point source line which "returns -EIOCBRETRY"?

Leonid
-----Original Message-----
From: Zach Brown [mailto:[email protected]]
Sent: Friday, February 16, 2007 3:01 AM
To: Ananiev, Leonid I
Cc: Ken Chen; [email protected]; Andrew Morton;
[email protected]; linux-aio; Chris Mason
Subject: Re: [PATCH] aio: fix kernel bug when page is temporally busy


On Feb 15, 2007, at 3:32 PM, Ananiev, Leonid I wrote:

>>> If EIOCBRETRY then generic_file_aio_write() will be recalled for the
>>> same iocb.
>> Only if kick_iocb() is called. It won't be called if i_i_p2_r() was
>> the only thing to return -EIOCBRETRY.
> It is not need to call kick_iocb()
> for generic_file_aio_write() calling.
> It is recalled without any wakeup waiting:
> for (;;) {
> ret = filp->f_op->aio_write(&kiocb, &iov, 1,
> kiocb.ki_pos);
> if (ret != -EIOCBRETRY)
> break;
> wait_on_retry_sync_kiocb(&kiocb);
> }
> Note: wait_on_retry_sync_kiocb() does not wait.

Yes it does. It will not return until kiocbSetKicked() is called,
and that is only called from kick_iocb().

>>> It overwrites -EIOCBQUEUED
> Do you mean that there is one more kernel bug which
> overwrites -EIOCBQUEUED by any errno or number of bytes and this
> new value is returned to caller as an IO result
> while IO is not finished yet.
>
> The proposed patch does not crate this bug if any.

Right, and I said that in the mail you're quoting.

> It actually fixes a kernel panic bag when iocb.users count becomes
> incorrect. The bag " Kernel BUG at fs/aio.c:509" is there because
> aio_run_iocb() have not a chance to differ real EIO and
> EIO which is actually means EAGAYN or EIOCBRETRY.

Yes, I understand the bug you're trying to fix. You're introducing
other bugs with the patch. It will not be merged.

> It is interesting that I've not seen any EIOCBQUEUED returned
> to aio_run_iocb() during 5 hours aiostress running.

What arguments are you running aio-stress with? -EIOCBQUEUED is only
used for O_DIRECT, and then only in certain circumstances.

- z