2006-02-23 07:29:48

by Suparna Bhattacharya

[permalink] [raw]
Subject: [RFC][WIP] DIO simplification and AIO-DIO stability


DIO code complexity and stability concerns were discussed way back during
OLS and Kernel summit last year. Still, the lack of a solid alternative and
motivation to subject oneself to the test of courage and delicate balance
that fiddling with this code entails, has meant that gingerly applying fixes
and bandaids as and when bugs are found, and moving on thereafter,
continues to be the most palatable option.

A recent AIO-DIO bug reported by Kenneth Chen, came very close
to being the proverbial last straw for me. Hence, here is a rough attempt to
put together a (currently WIP) draft towards DIO code simplication, based
on suggestions that some of you have brought up at various times. Several
details, e.g. range locking implementation still need to be fleshed out
completely, ideas/comments/suggestions would be welcome.

http://www.kernel.org/pub/linux/kernel/people/suparna/DIO-simplify.txt
(also inlined below)

It would be quite pointless (and painful!), if the rewrite ends up becoming
just as tricky and error prone as before. Such a patch will need a very
close critical review by many sharp eyes, to avoid disrupting the current
state of stability. So before going any further with this, I'm looking
for feedback along the lines of:

- Is this a worthwhile thing to attempt ? Or is status quo good enough ?
- Does the approach make sense ? Is there a simpler way ?
- Is there any hidden complexity or performance overhead that you forsee ?
- Adding an extra tag to the radix-tree for locking a range of pages would
impact the size of the radix tree; would that be a concern ?

Regards
Suparna

------------------------------------------------------

Need for simplification
-----------------------
- The code has become tricky and hence difficult to understand and maintain
- Stability has been a concern, bugs never seem to stop cropping up (especially
with regard to error handling), and there still are some potential loopholes
ini the current implementation, esp AIO-DIO e.g.
- AIO-DIO has some potential races in handling of IO errors, e.g. EIO.
- For AIO-DIO writes, invalidate_inode_pages may be called before write
completes

Considerations for simplification
---------------------------------
- Reduce the number of special cases and conditions to check for
- Bring locking model closer to that followed for regular read/writes
- Unify logic as far as possible for conceptual simplicity
- Optimize for the most common situation, i.e. non-extending DIO to
already allocated blocks, with no concurrent buffered IO on the same
file, and no concurrent DIO to the same part of the file.

I. DIO read
-----------

Protection against buffered IO hole overwrite, where uninstantiated blocks
may be exposed until writeback completes. Buffered reads go through page
cache and hence never see stale data. Direct reads however go directly to
disk.

(a) Current Solution:
- i_alloc_sem held for read => protects against truncate
- i_mutex held => no new blocks get created until DIO getblocks
completes
- filemap_write_and_wait => ensures any existing uninstantiated blocks
complete writeback to disk

- AIO error handling paths differ for errors during submission
vs during IO - in one case aio_complete issued by higher layer,
in another case within the io completion path in finished_one_bio


(b) Alternative Solution
(Follow similar locking rules as buffered read)
- i_alloc_sem held for read => protects against truncate
- Lookup and lock pages in the range by tagging the radix tree
slots as "TAG_LOCKED", and locking pages that were present.
Modify find_get_page et al to return a dummy locked cache page
if TAG_LOCKED. (Alternatively put the check in add_to_page_cache,
instead)
=> no new writes or blocks instantiated until DIO getblocks
completes
- Issue equiv of filemap_write_and_wait_range that acts on already
locked pages (also likely to be useful for mpage_writepages)
=> ensures existing uninstantiated blocks written to disk
- On IO completion release lock on pages and the range, including
unlocking and releasing dummy cache page.
=> This will wake up readers and writers blocked on locked
pages. Since on wakeup most code paths check mapping once
again and also hold an extra ref count, this should be safe.
(hopefully !)

Optional:
- With retry based AIO-DIO => aio_complete can happen in the same
way during submission and during a retry following completion, thus
error handling takes the same path in both cases.

II. DIO write

Protection against buffered IO reads or DIO reads being exposed
to uninstantiated blocks during a DIO hole overwrite/extend. Holding
i_sem in DIO writes doesn't help protect buffered IO reads, since i_sem
is not acquired in read paths.

(a) Current Solution
- i_alloc_sem held for read => protects against truncate
- Hold i_mutex at least till completion of DIO getblocks and submission
- Force file extending DIO writes to be i_mutex holding until i_size
increase (i.e till DIO completes) to avoid exposing new blocks
- Fallback to buffered IO for hole overwrites, so that the default page
based locking protects buffered IO reads from seeing uninstantiated
blocks
- filemap_write_and_wait => ensures any pending writes complete
writeback to disk
- Perform DIO write
- invalidate pages => subsequent reads should see newly written data

- Force extending AIO-DIO writes to be synchronous
- Force AIO-DIO overwrites that cross holes to be synchronous for
issued DIO before fallback to buffered IO.
- AIO Error handling takes different paths for errors during submission
and during IO completion. Fallback to buffered adds another level
of complexity, as does forced synchronous behaviour for extends.

(b) Alternative Solution
(Follow similar locking rules as buffered IO)
- i_alloc_sem held for read => protects against truncate
- check if this is an extending write, and if not release i_mutex
- Lookup and lock pages in the range by tagging the radix tree
slots as "TAG_LOCKED", and locking pages that were present.
=> reads are blocked until DIO write completes, i.e
uninstantiated blocks not exposed
- Issue equiv of filemap_write_and_wait_range that acts on already
locked pages and then invalidate the pages.
=> subsequent reads should see newly written data
- Perform DIO write (including block allocations etc)
- On IO completion release lock on pages and unlock the range,
including unlocking and releasing dummy cache page, release
i_alloc_sem and i_mutex.

[i_mutex and i_alloc_sem handling can be moved entirely into
higher levels, which avoids DIO_LOCKING checks inside the DIO
code; Since i_mutex is held only for extending writes it should
not be a concurrency issue; Also in the AIO case, it is ok to
update i_size and release i_mutex (perhaps even i_alloc_sem) w/o
having to wait for IO completion since pages are still locked ]

Optional:
- With retry based AIO-DIO => aio_complete happens in the same
way during submission and during a retry following completion.
Unlocking the range and release of pages can likewise happen
in retry context (not in interrupt context).

--
Suparna Bhattacharya ([email protected])
Linux Technology Center
IBM Software Lab, India


2006-02-23 19:12:28

by Wendy Cheng

[permalink] [raw]
Subject: Re: [RFC][WIP] DIO simplification and AIO-DIO stability

--- linux-2.6.9-22.EL/include/linux/fs.h 2005-12-07 12:43:55.000000000 -0500
+++ linux.truncate/include/linux/fs.h 2005-12-02 00:25:22.000000000 -0500
@@ -1509,7 +1509,8 @@ ssize_t __blockdev_direct_IO(int rw, str
int lock_type);

enum {
- DIO_LOCKING = 1, /* need locking between buffered and direct access */
+ DIO_CLUSTER_LOCKING = 0, /* allow (cluster) fs handle its own locking */
+ DIO_LOCKING, /* need locking between buffered and direct access */
DIO_NO_LOCKING, /* bdev; no locking at all between buffered/direct */
DIO_OWN_LOCKING, /* filesystem locks buffered and direct internally */
};
@@ -1541,6 +1542,15 @@ static inline ssize_t blockdev_direct_IO
nr_segs, get_blocks, end_io, DIO_OWN_LOCKING);
}

+static inline ssize_t blockdev_direct_IO_cluster_locking(int rw, struct kiocb *iocb,
+ struct inode *inode, struct block_device *bdev, const struct iovec *iov,
+ loff_t offset, unsigned long nr_segs, get_blocks_t get_blocks,
+ dio_iodone_t end_io)
+{
+ return __blockdev_direct_IO(rw, iocb, inode, bdev, iov, offset,
+ nr_segs, get_blocks, end_io, DIO_CLUSTER_LOCKING);
+}
+
extern struct file_operations generic_ro_fops;

#define special_file(m) (S_ISCHR(m)||S_ISBLK(m)||S_ISFIFO(m)||S_ISSOCK(m))
--- linux-2.6.9-22.EL/fs/direct-io.c 2005-11-09 17:26:02.000000000 -0500
+++ linux.truncate/fs/direct-io.c 2005-12-07 12:27:17.000000000 -0500
@@ -515,7 +515,7 @@ static int get_more_blocks(struct dio *d
fs_count++;

create = dio->rw == WRITE;
- if (dio->lock_type == DIO_LOCKING) {
+ if ((dio->lock_type == DIO_LOCKING) || (dio->lock_type == DIO_CLUSTER_LOCKING)) {
if (dio->block_in_file < (i_size_read(dio->inode) >>
dio->blkbits))
create = 0;
@@ -1183,9 +1183,16 @@ __blockdev_direct_IO(int rw, struct kioc
* For regular files using DIO_OWN_LOCKING,
* neither readers nor writers take any locks here
* (i_sem is already held and release for writers here)
+ * The DIO_CLUSTER_LOCKING allows (cluster) filesystem manages its own
+ * locking (bypassing i_sem and i_alloc_sem handling within
+ * __blockdev_direct_IO()).
*/
+
dio->lock_type = dio_lock_type;
- if (dio_lock_type != DIO_NO_LOCKING) {
+ if (dio_lock_type == DIO_CLUSTER_LOCKING)
+ goto cluster_skip_locking;
+
+ if (dio_lock_type != DIO_NO_LOCKING) {
if (rw == READ) {
struct address_space *mapping;

@@ -1205,6 +1212,9 @@ __blockdev_direct_IO(int rw, struct kioc
if (dio_lock_type == DIO_LOCKING)
down_read(&inode->i_alloc_sem);
}
+
+cluster_skip_locking:
+
/*
* For file extending writes updating i_size before data
* writeouts complete can expose uninitialized blocks. So


Attachments:
linux-2.6.9-dio-gfs-locking.patch (2.59 kB)

2006-02-24 00:38:17

by Badari Pulavarty

[permalink] [raw]
Subject: Re: [RFC][WIP] DIO simplification and AIO-DIO stability

On Thu, 2006-02-23 at 12:59 +0530, Suparna Bhattacharya wrote:
> DIO code complexity and stability concerns were discussed way back during
> OLS and Kernel summit last year. Still, the lack of a solid alternative and
> motivation to subject oneself to the test of courage and delicate balance
> that fiddling with this code entails, has meant that gingerly applying fixes
> and bandaids as and when bugs are found, and moving on thereafter,
> continues to be the most palatable option.
>
> A recent AIO-DIO bug reported by Kenneth Chen, came very close
> to being the proverbial last straw for me. Hence, here is a rough attempt to
> put together a (currently WIP) draft towards DIO code simplication, based
> on suggestions that some of you have brought up at various times. Several
> details, e.g. range locking implementation still need to be fleshed out
> completely, ideas/comments/suggestions would be welcome.
>
> http://www.kernel.org/pub/linux/kernel/people/suparna/DIO-simplify.txt
> (also inlined below)
>
> It would be quite pointless (and painful!), if the rewrite ends up becoming
> just as tricky and error prone as before. Such a patch will need a very
> close critical review by many sharp eyes, to avoid disrupting the current
> state of stability. So before going any further with this, I'm looking
> for feedback along the lines of:
>
> - Is this a worthwhile thing to attempt ? Or is status quo good enough ?
> - Does the approach make sense ? Is there a simpler way ?
> - Is there any hidden complexity or performance overhead that you forsee ?
> - Adding an extra tag to the radix-tree for locking a range of pages would
> impact the size of the radix tree; would that be a concern ?

I am still trying to understand the whole proposal to give you better
feedback. But, my gut feeling is - its not going to be any more simpler
than what we have today :(

Andrew did an excellent job when he started (with the set of
requirements we had at that time). Then we started adding more
and more features/corner-case fixes/functionality/locking fixes/error
handling cases etc - which added all the complexity.

I think it deserves a cleanup, but afraid to touch it - since its
going to take few months to stabilize it and get it right. We need
to collect all the test cases before undertaking this.

Thanks,
Badari

2006-02-24 01:01:49

by Chris Mason

[permalink] [raw]
Subject: Re: [RFC][WIP] DIO simplification and AIO-DIO stability

On Thursday 23 February 2006 02:29, Suparna Bhattacharya wrote:
> DIO code complexity and stability concerns were discussed way back during
> OLS and Kernel summit last year. Still, the lack of a solid alternative and
> motivation to subject oneself to the test of courage and delicate balance
> that fiddling with this code entails, has meant that gingerly applying
> fixes and bandaids as and when bugs are found, and moving on thereafter,
> continues to be the most palatable option.
>
> A recent AIO-DIO bug reported by Kenneth Chen, came very close
> to being the proverbial last straw for me. Hence, here is a rough attempt
> to put together a (currently WIP) draft towards DIO code simplication,
> based on suggestions that some of you have brought up at various times.
> Several details, e.g. range locking implementation still need to be fleshed
> out completely, ideas/comments/suggestions would be welcome.

I'm really in favor of this, and had actually started an implementation a
while back. At the time, I posted a different version that added yet another
semaphore but simplified the rest of the locking (and held no locks during
the dio/aio).

I'll try to dig up my original radix tagging code. I'm not sure if I kept it,
but it did pass Daniel's dio vs buffer io racing tests at the time.

-chris

2006-02-24 01:14:56

by Andrew Morton

[permalink] [raw]
Subject: Re: [RFC][WIP] DIO simplification and AIO-DIO stability

Badari Pulavarty <[email protected]> wrote:
>
> I am still trying to understand the whole proposal to give you better
> feedback. But, my gut feeling is - its not going to be any more simpler
> than what we have today :(
>

Yes, that's my general reaction as well. That code's solving a complex and
messy problem, so it got complex and messy.

Of course, a reimplementation might certainly end up faster, cleaner,
better. A throw-away-and-reimplement exercise often has that result, but
mainly because on the second time the reimplementors understand the full
scope of the problem at the outset rather than at the end. So this time
around, as you imply, we'd need to get a full problem description and set
of testcases collected.

That code does a _lot_ of stuff. Fortunately, It's basically all in
direct-io.c and that file exports a single function. So it's possible that
a reimplmentation could tick along alongside the existing implementation and
ideally, it's just a matter of changing one entry in each filesystem's a_ops.

2006-02-24 01:21:18

by Zach Brown

[permalink] [raw]
Subject: Re: [RFC][WIP] DIO simplification and AIO-DIO stability

Suparna Bhattacharya wrote:

> A recent AIO-DIO bug reported by Kenneth Chen, came very close
> to being the proverbial last straw for me.

Me too, though I found out about it from a different path. Our QA guys
were pulling drives under load and it got stuck. Trying to fix that bug
(io error setting dio->result to -EIO stops finished_one_bio() from
calling aio_complete()) without introducing other regressions involved
an incredible amount of squinting and head scratching. In wandering
around I found what seem to be other additional bugs:

- errors that hit after dio->result is sampled in the buffered fallback
case are lost. dio->result should be checked again after waiting.

- a few paths try to do arithmetic with dio->result assuming it's the
number of bytes transferred when it could be -EIO.

- the AIO path seems to forget to check dio->page_errors, but I didn't
look very hard to see what that means.

- the AIO bio completion paths don't populate dio->bio_list so reaping
doesn't happen in the AIO issuing case.. maybe that's intentional?

> It would be quite pointless (and painful!), if the rewrite ends up becoming
> just as tricky and error prone as before. Such a patch will need a very
> close critical review by many sharp eyes, to avoid disrupting the current
> state of stability.

So, I'm all for wringing the current bugs and confusion out of the
current code. But the words "a patch" and "rewrite" terrify me. It
seems much more prudent to make progress with incremental patches that
can be tested and reviewed. Especially if that is tied to writing tests
as changes are made.

Let me think harder about the specific proposals..

- z

2006-02-24 09:37:42

by Suparna Bhattacharya

[permalink] [raw]
Subject: Re: [RFC][WIP] DIO simplification and AIO-DIO stability

On Thu, Feb 23, 2006 at 08:01:32PM -0500, Chris Mason wrote:
> On Thursday 23 February 2006 02:29, Suparna Bhattacharya wrote:
> > DIO code complexity and stability concerns were discussed way back during
> > OLS and Kernel summit last year. Still, the lack of a solid alternative and
> > motivation to subject oneself to the test of courage and delicate balance
> > that fiddling with this code entails, has meant that gingerly applying
> > fixes and bandaids as and when bugs are found, and moving on thereafter,
> > continues to be the most palatable option.
> >
> > A recent AIO-DIO bug reported by Kenneth Chen, came very close
> > to being the proverbial last straw for me. Hence, here is a rough attempt
> > to put together a (currently WIP) draft towards DIO code simplication,
> > based on suggestions that some of you have brought up at various times.
> > Several details, e.g. range locking implementation still need to be fleshed
> > out completely, ideas/comments/suggestions would be welcome.
>
> I'm really in favor of this, and had actually started an implementation a
> while back. At the time, I posted a different version that added yet another
> semaphore but simplified the rest of the locking (and held no locks during
> the dio/aio).

Yes I have saved that patch as a reference as well.
With range locking I'm hoping that would be able to avoid the need for
i_hole_sem. Also I wanted to push out all locking code out of the DIO
code to avoid the various locking mode checks.

>
> I'll try to dig up my original radix tagging code. I'm not sure if I kept it,
> but it did pass Daniel's dio vs buffer io racing tests at the time.

Cool - that would be great !

Regards
Suparna

>
> -chris
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-aio' in
> the body to [email protected]. For more info on Linux AIO,
> see: http://www.kvack.org/aio/
> Don't email: <a href=mailto:"[email protected]">[email protected]</a>

--
Suparna Bhattacharya ([email protected])
Linux Technology Center
IBM Software Lab, India

2006-02-24 11:12:31

by Suparna Bhattacharya

[permalink] [raw]
Subject: Re: [RFC][WIP] DIO simplification and AIO-DIO stability

On Thu, Feb 23, 2006 at 05:21:08PM -0800, Zach Brown wrote:
> Suparna Bhattacharya wrote:
>
> > A recent AIO-DIO bug reported by Kenneth Chen, came very close
> > to being the proverbial last straw for me.
>
> Me too, though I found out about it from a different path. Our QA guys
> were pulling drives under load and it got stuck. Trying to fix that bug
> (io error setting dio->result to -EIO stops finished_one_bio() from
> calling aio_complete()) without introducing other regressions involved
> an incredible amount of squinting and head scratching. In wandering
> around I found what seem to be other additional bugs:
>
> - errors that hit after dio->result is sampled in the buffered fallback
> case are lost. dio->result should be checked again after waiting.
>
> - a few paths try to do arithmetic with dio->result assuming it's the
> number of bytes transferred when it could be -EIO.

Yes there is a race in the way dio->result is used both by completion
path and the post submission path.

>
> - the AIO path seems to forget to check dio->page_errors, but I didn't
> look very hard to see what that means.
>
> - the AIO bio completion paths don't populate dio->bio_list so reaping
> doesn't happen in the AIO issuing case.. maybe that's intentional?

It is intentional. The async case operates differently in that it
doesn't need/use the reaping logic at all. It just submits the entire
IO outright, without the pipelining sophistication of the original
synchronous DIO code. That's yet another point of divergence between
AIO and synchronous path, perhaps it would have been simpler if both
followed the same logic.

>
> > It would be quite pointless (and painful!), if the rewrite ends up becoming
> > just as tricky and error prone as before. Such a patch will need a very
> > close critical review by many sharp eyes, to avoid disrupting the current
> > state of stability.
>
> So, I'm all for wringing the current bugs and confusion out of the
> current code. But the words "a patch" and "rewrite" terrify me. It

Perhaps I shouldn't have used the term rewrite. The proposal retains
much of the current core logic, but mainly alters the way we
serialise vs concurrent buffered IO, and other pain points. But it
would certainly be more than incremental patches to fix individual
problems.

My concern is that incremental fixes seem to be adding to the complexity
over time, making subsequent modifications trickier, because they have
to conform to the current base which is a little messy. Can you
think of an incremental path towards greater simplicity ?

> seems much more prudent to make progress with incremental patches that
> can be tested and reviewed. Especially if that is tied to writing tests
> as changes are made.

Stephen Tweedie's original test and Daniel McNeil's set of tests already
cover some of the cases. I suspect that either way we'll need to keep
adding regression tests which can be run everytime someone makes a change
or bug fix to the code. It is easy to forget or miss the implication of
a subtle step during reviews because the code is so complex.

>
> Let me think harder about the specific proposals..

Thanks for taking a look at it, I'll wait for your inputs ...

Regards
Suparna

>
> - z
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-aio' in
> the body to [email protected]. For more info on Linux AIO,
> see: http://www.kvack.org/aio/
> Don't email: <a href=mailto:"[email protected]">[email protected]</a>

--
Suparna Bhattacharya ([email protected])
Linux Technology Center
IBM Software Lab, India

2006-02-24 11:24:55

by Suparna Bhattacharya

[permalink] [raw]
Subject: Re: [RFC][WIP] DIO simplification and AIO-DIO stability

On Thu, Feb 23, 2006 at 05:13:36PM -0800, Andrew Morton wrote:
> Badari Pulavarty <[email protected]> wrote:
> >
> > I am still trying to understand the whole proposal to give you better
> > feedback. But, my gut feeling is - its not going to be any more simpler
> > than what we have today :(
> >
>
> Yes, that's my general reaction as well. That code's solving a complex and
> messy problem, so it got complex and messy.

True. As you say, we just understand extent of the problem better now.
I think Stephen put in considerable thought into the implementation for
2.4, but at that time he probably didn't have to contend with the
locking modes and AIO, which have exposed a lot more scenarios, especially
with regard to error handling.

>
> Of course, a reimplementation might certainly end up faster, cleaner,
> better. A throw-away-and-reimplement exercise often has that result, but
> mainly because on the second time the reimplementors understand the full
> scope of the problem at the outset rather than at the end. So this time
> around, as you imply, we'd need to get a full problem description and set
> of testcases collected.
>
> That code does a _lot_ of stuff. Fortunately, It's basically all in
> direct-io.c and that file exports a single function. So it's possible that
> a reimplmentation could tick along alongside the existing implementation and
> ideally, it's just a matter of changing one entry in each filesystem's a_ops.

Sounds like a good idea.

Regards
Suparna

>
> --
> To unsubscribe, send a message with 'unsubscribe linux-aio' in
> the body to [email protected]. For more info on Linux AIO,
> see: http://www.kvack.org/aio/
> Don't email: <a href=mailto:"[email protected]">[email protected]</a>

--
Suparna Bhattacharya ([email protected])
Linux Technology Center
IBM Software Lab, India

2006-02-24 11:53:09

by Suparna Bhattacharya

[permalink] [raw]
Subject: Re: [RFC][WIP] DIO simplification and AIO-DIO stability

Hi Wendy,

On Thu, Feb 23, 2006 at 02:12:04PM -0500, Wendy Cheng wrote:
> Suparna Bhattacharya wrote:
>
> >http://www.kernel.org/pub/linux/kernel/people/suparna/DIO-simplify.txt
> >(also inlined below)
> >
> >
> Hi, Suparna,
>
>
> It would be nice to ensure that the lock sequence will not cause issues
> for out-of-tree external kernel modules (e.g. cluster files System) that
> require extra locking for various purposes. We've found several
> deadlocks issues in Global File System (GFS) Direct IO path due to lock
> order enforced by VFS layer:
>
>
> 1) In sys_ftruncate()->do_truncate(), VFS layer grabs
> * i_sem
> * then i_alloc_sem (i_mutex)
> * then call filesystem's setattr().
>
>
> 2) In Direct IO read, VFS layer calls
> * filesystem's direct_IO()
> * grabs i_sem (i_mutex)
> * followed by i_alloc_sem.
>
> In our case, both gfs_setattr() and gfs_direct_IO() need its own
> (global) locks to synchronize inter-nodes (and inter-processes) control
> structures access but gfs_direct_IO later ends up in
> __blockdev_direct_IO path that deadlocks with i_sem (i_mutex) and
> i_alloc_sem.
>
>
> A new DIO flag is added into our distribution (2.6.9 based) to work
> around the problem by moving the inode semaphore acquiring within
> __blockdev_direct_IO() (patch attached) into GFS code path (so lock
> order can be re-arranged). The new lock granularity is not ideal but it
> gets us out of this deadlock.

Could you help me understand in a little more detail why DIO_OWN_LOCKING
does not work for you ? Is the releasing of i_sem during READ a problem ?
Doesn't holding i_sem for the entire duration of IO for read slow down
concurrent DIO reads to different parts of the file ?

>
> We havn't had a chance to go thru your mail (and patch) in details yet
> but would like bring up this issue earlier before it gets messy.
>
One of the things I wanted to achieve in the proposal was to avoid
the need for these various locking mode flag checks in the DIO code,
leaving it to the higher level to just select the right entry points,
i.e. the _nolock or lock versions of generic_file_aio_write et al.
Would appreciate your thoughts on this, once you've had a chance to
go throught it.

Regards
Suparna

>
> -- Wendy
>
>

> --- linux-2.6.9-22.EL/include/linux/fs.h 2005-12-07 12:43:55.000000000 -0500
> +++ linux.truncate/include/linux/fs.h 2005-12-02 00:25:22.000000000 -0500
> @@ -1509,7 +1509,8 @@ ssize_t __blockdev_direct_IO(int rw, str
> int lock_type);
>
> enum {
> - DIO_LOCKING = 1, /* need locking between buffered and direct access */
> + DIO_CLUSTER_LOCKING = 0, /* allow (cluster) fs handle its own locking */
> + DIO_LOCKING, /* need locking between buffered and direct access */
> DIO_NO_LOCKING, /* bdev; no locking at all between buffered/direct */
> DIO_OWN_LOCKING, /* filesystem locks buffered and direct internally */
> };
> @@ -1541,6 +1542,15 @@ static inline ssize_t blockdev_direct_IO
> nr_segs, get_blocks, end_io, DIO_OWN_LOCKING);
> }
>
> +static inline ssize_t blockdev_direct_IO_cluster_locking(int rw, struct kiocb *iocb,
> + struct inode *inode, struct block_device *bdev, const struct iovec *iov,
> + loff_t offset, unsigned long nr_segs, get_blocks_t get_blocks,
> + dio_iodone_t end_io)
> +{
> + return __blockdev_direct_IO(rw, iocb, inode, bdev, iov, offset,
> + nr_segs, get_blocks, end_io, DIO_CLUSTER_LOCKING);
> +}
> +
> extern struct file_operations generic_ro_fops;
>
> #define special_file(m) (S_ISCHR(m)||S_ISBLK(m)||S_ISFIFO(m)||S_ISSOCK(m))
> --- linux-2.6.9-22.EL/fs/direct-io.c 2005-11-09 17:26:02.000000000 -0500
> +++ linux.truncate/fs/direct-io.c 2005-12-07 12:27:17.000000000 -0500
> @@ -515,7 +515,7 @@ static int get_more_blocks(struct dio *d
> fs_count++;
>
> create = dio->rw == WRITE;
> - if (dio->lock_type == DIO_LOCKING) {
> + if ((dio->lock_type == DIO_LOCKING) || (dio->lock_type == DIO_CLUSTER_LOCKING)) {
> if (dio->block_in_file < (i_size_read(dio->inode) >>
> dio->blkbits))
> create = 0;
> @@ -1183,9 +1183,16 @@ __blockdev_direct_IO(int rw, struct kioc
> * For regular files using DIO_OWN_LOCKING,
> * neither readers nor writers take any locks here
> * (i_sem is already held and release for writers here)
> + * The DIO_CLUSTER_LOCKING allows (cluster) filesystem manages its own
> + * locking (bypassing i_sem and i_alloc_sem handling within
> + * __blockdev_direct_IO()).
> */
> +
> dio->lock_type = dio_lock_type;
> - if (dio_lock_type != DIO_NO_LOCKING) {
> + if (dio_lock_type == DIO_CLUSTER_LOCKING)
> + goto cluster_skip_locking;
> +
> + if (dio_lock_type != DIO_NO_LOCKING) {
> if (rw == READ) {
> struct address_space *mapping;
>
> @@ -1205,6 +1212,9 @@ __blockdev_direct_IO(int rw, struct kioc
> if (dio_lock_type == DIO_LOCKING)
> down_read(&inode->i_alloc_sem);
> }
> +
> +cluster_skip_locking:
> +
> /*
> * For file extending writes updating i_size before data
> * writeouts complete can expose uninitialized blocks. So


--
Suparna Bhattacharya ([email protected])
Linux Technology Center
IBM Software Lab, India

2006-02-24 15:51:46

by Wendy Cheng

[permalink] [raw]
Subject: Re: [RFC][WIP] DIO simplification and AIO-DIO stability

Suparna Bhattacharya wrote:

>>A new DIO flag is added into our distribution (2.6.9 based) to work
>>around the problem by moving the inode semaphore acquiring within
>>__blockdev_direct_IO() (patch attached) into GFS code path (so lock
>>order can be re-arranged). The new lock granularity is not ideal but it
>>gets us out of this deadlock.
>>
>>
>
>Could you help me understand in a little more detail why DIO_OWN_LOCKING
>does not work for you ? Is the releasing of i_sem during READ a problem ?
>Doesn't holding i_sem for the entire duration of IO for read slow down
>concurrent DIO reads to different parts of the file ?
>
>

We don't mess around with i_sem during read. So that piece of locking
code is ok (but I have to say it looks very messy). The thing we can't
work around is #518 in get_more_blocks() (from 2.6.9 base code so line
offset may be different but you can get the idea):

491 static int get_more_blocks(struct dio *dio)
492 {
493 int ret;
......
516
517 create = dio->rw == WRITE;
518 if ((dio->lock_type == DIO_LOCKING) ||
(dio->lock_type = = DIO_CLUSTER_LOCKING)) {
519 if (dio->block_in_file <
(i_size_read(dio->inode ) >>
520
dio->blkbits))
521 create = 0;
522 } else if (dio->lock_type == DIO_NO_LOCKING) {
523 create = 0;
524 }

>One of the things I wanted to achieve in the proposal was to avoid
>the need for these various locking mode flag checks in the DIO code,
>leaving it to the higher level to just select the right entry points,
>i.e. the _nolock or lock versions of generic_file_aio_write et al.
>
>
That's a good news - these DIO locking flags are all last-minute
bandaids anyway.

>Would appreciate your thoughts on this, once you've had a chance to
>go throught it.
>
>
>
On the road all next week but will get to it when I'm back to office. On
the other hand, thank you for looking into this.

-- Wendy

2006-02-24 18:08:30

by Badari Pulavarty

[permalink] [raw]
Subject: Re: [RFC][WIP] DIO simplification and AIO-DIO stability

On Fri, 2006-02-24 at 16:42 +0530, Suparna Bhattacharya wrote:
> On Thu, Feb 23, 2006 at 05:21:08PM -0800, Zach Brown wrote:
> > Suparna Bhattacharya wrote:
> >
> > > A recent AIO-DIO bug reported by Kenneth Chen, came very close
> > > to being the proverbial last straw for me.
> >
> > Me too, though I found out about it from a different path. Our QA guys
> > were pulling drives under load and it got stuck. Trying to fix that bug
> > (io error setting dio->result to -EIO stops finished_one_bio() from
> > calling aio_complete()) without introducing other regressions involved
> > an incredible amount of squinting and head scratching. In wandering
> > around I found what seem to be other additional bugs:
> >
> > - errors that hit after dio->result is sampled in the buffered fallback
> > case are lost. dio->result should be checked again after waiting.
> >
> > - a few paths try to do arithmetic with dio->result assuming it's the
> > number of bytes transferred when it could be -EIO.
>
> Yes there is a race in the way dio->result is used both by completion
> path and the post submission path.
>
> >
> > - the AIO path seems to forget to check dio->page_errors, but I didn't
> > look very hard to see what that means.
> >
> > - the AIO bio completion paths don't populate dio->bio_list so reaping
> > doesn't happen in the AIO issuing case.. maybe that's intentional?
>
> It is intentional. The async case operates differently in that it
> doesn't need/use the reaping logic at all. It just submits the entire
> IO outright, without the pipelining sophistication of the original
> synchronous DIO code. That's yet another point of divergence between
> AIO and synchronous path, perhaps it would have been simpler if both
> followed the same logic.
>
> >
> > > It would be quite pointless (and painful!), if the rewrite ends up becoming
> > > just as tricky and error prone as before. Such a patch will need a very
> > > close critical review by many sharp eyes, to avoid disrupting the current
> > > state of stability.
> >
> > So, I'm all for wringing the current bugs and confusion out of the
> > current code. But the words "a patch" and "rewrite" terrify me. It
>
> Perhaps I shouldn't have used the term rewrite. The proposal retains
> much of the current core logic, but mainly alters the way we
> serialise vs concurrent buffered IO, and other pain points. But it
> would certainly be more than incremental patches to fix individual
> problems.

Yes. locking and error handling desperately needs a re-write, especially
keeping AIO in mind. I would love to see "kicking back to buffered mode"
completely go away. If Ken and Zach are willing to provide help on
looking over & testing error handling cases (with pulling drives :)),
I have no problem with re-write :)

Thanks,
Badari