2006-03-19 09:27:49

by Chen, Kenneth W

[permalink] [raw]
Subject: [patch] bug fix in dio handling write error - v2

Referring to original posting:
http://marc.theaimsgroup.com/?t=113752710100001&r=1&w=2

Suparna pointed out that this fix has a potential race window. I think
the race condition also exists on the READ side currently. The fundamental
problem is that dio->result is overloaded with dual use: an indicator of
fall back path for partial dio write, and an error indicator used in the
I/O completion path. In the event of device error, the setting of -EIO
to dio->result clashes with value used to track partial write that activates
the fall back path.

One way to fix the race issue is to pull all the code that uses dio->result
up before I/O is submitted and then never look at dio->result once IO is
submitted. Or alternatively another independent variable can be introduce
to track fall back state, note that to set the tracking logic, kernel still
have to look at dio->result before I/O is submitted.

The following patch implements the first option. I would appreciate some
reviews. Thanks.



[patch] bug fix in dio handling write error

There is a bug in direct-io on propagating write error up to the
higher I/O layer. When performing an async ODIRECT write to a
block device, if a device error occurred (like media error or disk
is pulled), the error code is only propagated from device driver
to the DIO layer. The error code stops at finished_one_bio(). The
aysnc write, however, is supposedly have a corresponding AIO event
with appropriate return code (in this case -EIO). Application
which waits on the async write event, will hang forever since such
AIO event is lost forever (if such app did not use the timeout
option in io_getevents call. Regardless, an AIO event is lost).

The discovery of above bug leads to another discovery of potential
race window with dio->result. The fundamental problem is that
dio->result is overloaded with dual use: an indicator of fall back
path for partial dio write, and an error indicator used in the I/O
completion path. In the event of device error, the setting of -EIO
to dio->result clashes with value used to track partial write that
activates the fall back path.

Propose to fix the race issue by pulling all the code that uses
dio->result up before I/O is submitted and then never look at
dio->result once IO is submitted.

Signed-off-by: Ken Chen <[email protected]>

--- ./fs/direct-io.c.orig 2006-01-02 19:21:10.000000000 -0800
+++ ./fs/direct-io.c 2006-03-19 01:57:49.797391064 -0800
@@ -253,8 +253,7 @@ static void finished_one_bio(struct dio
dio_complete(dio, offset, transferred);

/* Complete AIO later if falling back to buffered i/o */
- if (dio->result == dio->size ||
- ((dio->rw == READ) && dio->result)) {
+ if (!dio->waiter) {
aio_complete(dio->iocb, transferred, 0);
kfree(dio);
return;
@@ -939,7 +938,7 @@ direct_io_worker(int rw, struct kiocb *i
struct dio *dio)
{
unsigned long user_addr;
- int seg;
+ int seg, should_wait = 0;
ssize_t ret = 0;
ssize_t ret2;
size_t bytes;
@@ -1031,6 +1030,16 @@ direct_io_worker(int rw, struct kiocb *i
}
} /* end iovec loop */

+ /*
+ * We'll have to wait for a partial write to drain before falling back
+ * to buffered. We check this before submitting io so that completion
+ * doesn't have a chance to overwrite dio->result with -EIO.
+ */
+ if (dio->result < dio->size && dio->is_async && rw == WRITE) {
+ dio->waiter = current;
+ should_wait = 1;
+ }
+
if (ret == -ENOTBLK && rw == WRITE) {
/*
* The remaining part of the request will be
@@ -1051,6 +1060,7 @@ direct_io_worker(int rw, struct kiocb *i
page_cache_release(dio->cur_page);
dio->cur_page = NULL;
}
+ ret2 = dio->result;
if (dio->bio)
dio_bio_submit(dio);

@@ -1073,14 +1083,8 @@ direct_io_worker(int rw, struct kiocb *i
* reflect the number of to-be-processed BIOs.
*/
if (dio->is_async) {
- int should_wait = 0;
-
- if (dio->result < dio->size && rw == WRITE) {
- dio->waiter = current;
- should_wait = 1;
- }
if (ret == 0)
- ret = dio->result;
+ ret = ret2;
finished_one_bio(dio); /* This can free the dio */
blk_run_address_space(inode->i_mapping);
if (should_wait) {





2006-03-19 11:55:00

by Suparna Bhattacharya

[permalink] [raw]
Subject: Re: [patch] bug fix in dio handling write error - v2

On Sun, Mar 19, 2006 at 01:27:33AM -0800, Chen, Kenneth W wrote:
> Referring to original posting:
> http://marc.theaimsgroup.com/?t=113752710100001&r=1&w=2
>
> Suparna pointed out that this fix has a potential race window. I think
> the race condition also exists on the READ side currently. The fundamental
> problem is that dio->result is overloaded with dual use: an indicator of
> fall back path for partial dio write, and an error indicator used in the
> I/O completion path. In the event of device error, the setting of -EIO
> to dio->result clashes with value used to track partial write that activates
> the fall back path.
>
> One way to fix the race issue is to pull all the code that uses dio->result
> up before I/O is submitted and then never look at dio->result once IO is
> submitted. Or alternatively another independent variable can be introduce
> to track fall back state, note that to set the tracking logic, kernel still
> have to look at dio->result before I/O is submitted.
>
> The following patch implements the first option. I would appreciate some
> reviews. Thanks.
>
>
>
> [patch] bug fix in dio handling write error
>
> There is a bug in direct-io on propagating write error up to the
> higher I/O layer. When performing an async ODIRECT write to a
> block device, if a device error occurred (like media error or disk
> is pulled), the error code is only propagated from device driver
> to the DIO layer. The error code stops at finished_one_bio(). The
> aysnc write, however, is supposedly have a corresponding AIO event
> with appropriate return code (in this case -EIO). Application
> which waits on the async write event, will hang forever since such
> AIO event is lost forever (if such app did not use the timeout
> option in io_getevents call. Regardless, an AIO event is lost).
>
> The discovery of above bug leads to another discovery of potential
> race window with dio->result. The fundamental problem is that
> dio->result is overloaded with dual use: an indicator of fall back
> path for partial dio write, and an error indicator used in the I/O
> completion path. In the event of device error, the setting of -EIO
> to dio->result clashes with value used to track partial write that
> activates the fall back path.
>
> Propose to fix the race issue by pulling all the code that uses
> dio->result up before I/O is submitted and then never look at
> dio->result once IO is submitted.
>
> Signed-off-by: Ken Chen <[email protected]>
>
> --- ./fs/direct-io.c.orig 2006-01-02 19:21:10.000000000 -0800
> +++ ./fs/direct-io.c 2006-03-19 01:57:49.797391064 -0800
> @@ -253,8 +253,7 @@ static void finished_one_bio(struct dio
> dio_complete(dio, offset, transferred);
>
> /* Complete AIO later if falling back to buffered i/o */
> - if (dio->result == dio->size ||
> - ((dio->rw == READ) && dio->result)) {
> + if (!dio->waiter) {
> aio_complete(dio->iocb, transferred, 0);
> kfree(dio);
> return;
> @@ -939,7 +938,7 @@ direct_io_worker(int rw, struct kiocb *i
> struct dio *dio)
> {
> unsigned long user_addr;
> - int seg;
> + int seg, should_wait = 0;
> ssize_t ret = 0;
> ssize_t ret2;
> size_t bytes;
> @@ -1031,6 +1030,16 @@ direct_io_worker(int rw, struct kiocb *i
> }
> } /* end iovec loop */
>
> + /*
> + * We'll have to wait for a partial write to drain before falling back
> + * to buffered. We check this before submitting io so that completion
> + * doesn't have a chance to overwrite dio->result with -EIO.
> + */
> + if (dio->result < dio->size && dio->is_async && rw == WRITE) {
> + dio->waiter = current;
> + should_wait = 1;
> + }
> +

Isn't there a possibility that part of the IO for the overall request
may already have been submitted at this point ? (i.e. within
do_direct_IO->submit_page_section ->dio_send_cur_page->dio_bio_submit)
This is what I was referring to in my earlier response to Zach's patch.

> if (ret == -ENOTBLK && rw == WRITE) {
> /*
> * The remaining part of the request will be
> @@ -1051,6 +1060,7 @@ direct_io_worker(int rw, struct kiocb *i
> page_cache_release(dio->cur_page);
> dio->cur_page = NULL;
> }
> + ret2 = dio->result;
> if (dio->bio)
> dio_bio_submit(dio);
>
> @@ -1073,14 +1083,8 @@ direct_io_worker(int rw, struct kiocb *i
> * reflect the number of to-be-processed BIOs.
> */
> if (dio->is_async) {
> - int should_wait = 0;
> -
> - if (dio->result < dio->size && rw == WRITE) {
> - dio->waiter = current;
> - should_wait = 1;
> - }
> if (ret == 0)
> - ret = dio->result;
> + ret = ret2;
> finished_one_bio(dio); /* This can free the dio */
> blk_run_address_space(inode->i_mapping);
> if (should_wait) {
>
>
>

Regards
Suparna

--
Suparna Bhattacharya ([email protected])
Linux Technology Center
IBM Software Lab, India

2006-03-19 22:27:42

by Chen, Kenneth W

[permalink] [raw]
Subject: RE: [patch] bug fix in dio handling write error - v2

Suparna Bhattacharya wrote on Sunday, March 19, 2006 3:55 AM
> On Sun, Mar 19, 2006 at 01:27:33AM -0800, Chen, Kenneth W wrote:
> > Referring to original posting:
> > http://marc.theaimsgroup.com/?t=113752710100001&r=1&w=2
> >
> > Suparna pointed out that this fix has a potential race window. I think
> > the race condition also exists on the READ side currently. The fundamental
> > problem is that dio->result is overloaded with dual use: an indicator of
> > fall back path for partial dio write, and an error indicator used in the
> > I/O completion path. In the event of device error, the setting of -EIO
> > to dio->result clashes with value used to track partial write that activates
> > the fall back path.
>
> Isn't there a possibility that part of the IO for the overall request
> may already have been submitted at this point ? (i.e. within
> do_direct_IO->submit_page_section ->dio_send_cur_page->dio_bio_submit)
> This is what I was referring to in my earlier response to Zach's patch.

I suppose it is possible there. What you are saying here is effectively
that it is impossible to use dio->result to track partial write and at the
same time to track error returned from device driver. Because direct_io_work
can only determines whether it is a partial write at the end of io submission
and in mid stream of those io submission, a return code could be coming back
from the driver. Thus messing up all the subsequent logic.

Taking one of your earlier idea, how about the following patch: separating
out IO completion code from partial IO tracking?


--- ./fs/direct-io.c.orig 2006-03-19 13:36:52.000000000 -0800
+++ ./fs/direct-io.c 2006-03-19 13:47:42.000000000 -0800
@@ -129,6 +129,7 @@ struct dio {
/* AIO related stuff */
struct kiocb *iocb; /* kiocb */
int is_async; /* is IO async ? */
+ int completion_code; /* IO completion code */
ssize_t result; /* IO result */
};

@@ -250,6 +251,10 @@ static void finished_one_bio(struct dio
((offset + transferred) > dio->i_size))
transferred = dio->i_size - offset;

+ /* check for error in completion path */
+ if (dio->completion_code)
+ transferred = dio->completion_code;
+
dio_complete(dio, offset, transferred);

/* Complete AIO later if falling back to buffered i/o */
@@ -406,7 +411,7 @@ static int dio_bio_complete(struct dio *
int page_no;

if (!uptodate)
- dio->result = -EIO;
+ dio->completion_code = -EIO;

if (dio->is_async && dio->rw == READ) {
bio_check_pages_dirty(bio); /* transfers ownership */
@@ -964,6 +969,7 @@ direct_io_worker(int rw, struct kiocb *i
dio->next_block_for_io = -1;

dio->page_errors = 0;
+ dio->completion_code = 0;
dio->result = 0;
dio->iocb = iocb;
dio->i_size = i_size_read(inode);


2006-03-20 21:28:39

by Zach Brown

[permalink] [raw]
Subject: Re: [patch] bug fix in dio handling write error - v2


> Taking one of your earlier idea, how about the following patch: separating
> out IO completion code from partial IO tracking?

I like that, and I think it closes the lost write error hole. Could we
call it something like io_errno, though? "completion_code" to me sounds
like -ve errno or +ve bytes read.

I think there are still bugs dealing with errors, but I guess we can
tackle them with seperate patches.

- z

2006-03-21 07:06:48

by Suparna Bhattacharya

[permalink] [raw]
Subject: Re: [patch] bug fix in dio handling write error - v2

On Sun, Mar 19, 2006 at 02:27:33PM -0800, Chen, Kenneth W wrote:
> Suparna Bhattacharya wrote on Sunday, March 19, 2006 3:55 AM
> > On Sun, Mar 19, 2006 at 01:27:33AM -0800, Chen, Kenneth W wrote:
> > > Referring to original posting:
> > > http://marc.theaimsgroup.com/?t=113752710100001&r=1&w=2
> > >
> > > Suparna pointed out that this fix has a potential race window. I think
> > > the race condition also exists on the READ side currently. The fundamental
> > > problem is that dio->result is overloaded with dual use: an indicator of
> > > fall back path for partial dio write, and an error indicator used in the
> > > I/O completion path. In the event of device error, the setting of -EIO
> > > to dio->result clashes with value used to track partial write that activates
> > > the fall back path.
> >
> > Isn't there a possibility that part of the IO for the overall request
> > may already have been submitted at this point ? (i.e. within
> > do_direct_IO->submit_page_section ->dio_send_cur_page->dio_bio_submit)
> > This is what I was referring to in my earlier response to Zach's patch.
>
> I suppose it is possible there. What you are saying here is effectively
> that it is impossible to use dio->result to track partial write and at the
> same time to track error returned from device driver. Because direct_io_work
> can only determines whether it is a partial write at the end of io submission
> and in mid stream of those io submission, a return code could be coming back
> from the driver. Thus messing up all the subsequent logic.
>
> Taking one of your earlier idea, how about the following patch: separating
> out IO completion code from partial IO tracking?

Yup, this looks like the right fix. Hopefully you've had a chance to pound
on it a bit with all the tests since verifying this code by mere eyeballs
alone is beyond mere mortals :)

Regards
Suparna

>
>
> --- ./fs/direct-io.c.orig 2006-03-19 13:36:52.000000000 -0800
> +++ ./fs/direct-io.c 2006-03-19 13:47:42.000000000 -0800
> @@ -129,6 +129,7 @@ struct dio {
> /* AIO related stuff */
> struct kiocb *iocb; /* kiocb */
> int is_async; /* is IO async ? */
> + int completion_code; /* IO completion code */
> ssize_t result; /* IO result */
> };
>
> @@ -250,6 +251,10 @@ static void finished_one_bio(struct dio
> ((offset + transferred) > dio->i_size))
> transferred = dio->i_size - offset;
>
> + /* check for error in completion path */
> + if (dio->completion_code)
> + transferred = dio->completion_code;
> +
> dio_complete(dio, offset, transferred);
>
> /* Complete AIO later if falling back to buffered i/o */
> @@ -406,7 +411,7 @@ static int dio_bio_complete(struct dio *
> int page_no;
>
> if (!uptodate)
> - dio->result = -EIO;
> + dio->completion_code = -EIO;
>
> if (dio->is_async && dio->rw == READ) {
> bio_check_pages_dirty(bio); /* transfers ownership */
> @@ -964,6 +969,7 @@ direct_io_worker(int rw, struct kiocb *i
> dio->next_block_for_io = -1;
>
> dio->page_errors = 0;
> + dio->completion_code = 0;
> dio->result = 0;
> dio->iocb = iocb;
> dio->i_size = i_size_read(inode);
>

--
Suparna Bhattacharya ([email protected])
Linux Technology Center
IBM Software Lab, India