2007-01-18 01:27:14

by Chris Frost

[permalink] [raw]
Subject: block_device usage and incorrect block writes

We are working on a kernel module which uses the linux block device
interface as part of a larger project, are seeing unexpected block write
behavior from our usage of the noop scheduler, and were wondering whether
anyone might have feedback on what the behavior we see?

We would like to send block writes such that they are written to the
drive controller in fifo order, so we are using the noop scheduler.
However, a small percentage (1-5 of ~50,000) of block writes
end up with incorrect data on the disk. We have determined that for each
of these incorrect blocks, the last write for the block was issued while
a previous write to the block was still queued (that is, the bio end
function had not yet been called) and that the next to last issued
write (that is, the generic_make_request function call) for the block contains
the data that ends up on the disk.

Here are more details we have noticed. The behavior appears to be
timing sensitive; multiple runs each may or may not work and
introducing slow downs (i.e. write barriers or many printks) make the
problem disappear with high certainty. About 2% of our block device
writes are writes to a block when there is still a write to that block
in the queue. About 0.3% of these 2% result in incorrect data on
the disk.

A possibly related (and unexpected) behavior we have noticed is that the
bio end function is not always called in the same order as our calls
to generic_make_request(). We are not sure whether this indicates that
the requests are being written to disk in the callback order, but would
like to fix this if so (since we want the writes made in the order of our
requests).

Below is the essence of our write code. We also make read requests, but do
not include the read code below.

struct block_device *dev;

int my_end(struct bio *bio, unsigned int done, int error)
{
if (bio->bi_size)
return 1; /* everyone else (in 2.6.12) returns in this case */
__free_page(bio_iovec_idx(bio, 0)->bv_page);
bio_iovec_idx(bio, 0)->bv_page = NULL:
bio_iovec_idx(bio, 0)->bv_len = 0:
bio_iovec_idx(bio, 0)->bv_offset = 0:
bio_put(bio);
return error;
}

void write_block(...)
{
struct bio *bio = bio_alloc(GFP_KERNEL, 1);

struct bio_vec *bv = bio_iovec_idx(bio, 0);
bv->bv_page = alloc_page(GFP_KERNEL | GFP_DMA);
memcpy(page_address(bv->bv_page), ...);
bv->bv_len = ...;
bv->bv_offset = 0;

bio->bi_idx = 0;
bio->bi_vcnt = 1;
bio->bi_sector = ...;
bio->bi_size = ...;
bio->bi_bdev = dev;
bio->bi_rw = WRITE;
bio->bi_end_io = my_end;

generic_make_request(bio);
}

void init(const char *path)
{
path_lookup(path, LOOKUP_FOLLOW, nd);
dev = open_by_devnum(nd.dentry->d_inode->i_rde, mode);
bd_claim(dev, claimer);
}

thanks in advance for any feedback!
--
Chris Frost | <http://www.frostnet.net/chris/>
-------------+----------------------------------
Public PGP Key:
Email [email protected] with the subject "retrieve pgp key"
or visit <http://www.frostnet.net/chris/about/pgp_key.phtml>


2007-01-18 02:12:56

by Jens Axboe

[permalink] [raw]
Subject: Re: block_device usage and incorrect block writes

On Wed, Jan 17 2007, Chris Frost wrote:
> We are working on a kernel module which uses the linux block device
> interface as part of a larger project, are seeing unexpected block
> write behavior from our usage of the noop scheduler, and were
> wondering whether anyone might have feedback on what the behavior we
> see?
>
> We would like to send block writes such that they are written to the
> drive controller in fifo order, so we are using the noop scheduler.
> However, a small percentage (1-5 of ~50,000) of block writes end up
> with incorrect data on the disk. We have determined that for each of
> these incorrect blocks, the last write for the block was issued while
> a previous write to the block was still queued (that is, the bio end
> function had not yet been called) and that the next to last issued
> write (that is, the generic_make_request function call) for the block
> contains the data that ends up on the disk.

noop doesn't guarentee that IO will be queued with the device in the
order in which they are submitted, and it definitely doesn't guarentee
that the device will process them in the order in which they are
dispatched. noop being FIFO basically means that it will not sort
requests. You can still have reordering if one request gets merged with
another, for instance.

The block layer in general provides no guarentees about ordering of
requests, unless you use barriers. So if you require ordering across a
given write request, it needs to be a write barrier.

> A possibly related (and unexpected) behavior we have noticed is that
> the bio end function is not always called in the same order as our
> calls to generic_make_request(). We are not sure whether this
> indicates that the requests are being written to disk in the callback
> order, but would like to fix this if so (since we want the writes made
> in the order of our requests).

The drive could complete requests in any order it sees fit, within the
depth level of the drive. If write caching is enabled, it can reorder
writes easily.

--
Jens Axboe

2007-01-18 10:32:47

by Jan Engelhardt

[permalink] [raw]
Subject: Re: block_device usage and incorrect block writes



On Jan 18 2007 13:13, Jens Axboe wrote:
>
>noop doesn't guarentee that IO will be queued with the device in the
>order in which they are submitted, and it definitely doesn't guarentee
>that the device will process them in the order in which they are
>dispatched. noop being FIFO basically means that it will not sort
>requests. You can still have reordering if one request gets merged with
>another, for instance.

Would it make sense to have a fifo-iosched module that assumes write barriers
between every submission? (No, I am not related to that project.)


-`J'
--

2007-01-24 08:58:57

by Chris Frost

[permalink] [raw]
Subject: Re: block_device usage and incorrect block writes

On Thu, Jan 18, 2007 at 01:13:06PM +1100, Jens Axboe wrote:
> noop doesn't guarentee that IO will be queued with the device in the
> order in which they are submitted, and it definitely doesn't guarentee
> that the device will process them in the order in which they are
> dispatched. noop being FIFO basically means that it will not sort
> requests. You can still have reordering if one request gets merged with
> another, for instance.
>
> The block layer in general provides no guarentees about ordering of
> requests, unless you use barriers. So if you require ordering across a
> given write request, it needs to be a write barrier.

Thank your explaining this aspect of the linux block device layer design.
Earlier, we tried bio barriers (hard barriers) and found the slow down
to be too great. After my previous email we looked further down the stack and
noticed that struct request also has a soft barrier option. For our tests,
soft barriers perform almost as well as no barriers, and our system
is ok (for now, at least) with the write reordering that devices can do.

As our code calls generic_make_request(), it does not have access to the
created struct request. We have modified block/ll_rw_blk.c:__make_request()
to 1) not merge requests and to 2) add the REQ_SOFTBARRIER flag to the new
request's cmd_flags field. Is there a more modular way for a function to
create a new request with a soft barrier?

thanks again,
--
Chris Frost | <http://www.frostnet.net/chris/>
-------------+----------------------------------
Public PGP Key:
Email [email protected] with the subject "retrieve pgp key"
or visit <http://www.frostnet.net/chris/about/pgp_key.phtml>

2007-01-24 09:21:57

by Jens Axboe

[permalink] [raw]
Subject: Re: block_device usage and incorrect block writes

On Wed, Jan 24 2007, Chris Frost wrote:
> On Thu, Jan 18, 2007 at 01:13:06PM +1100, Jens Axboe wrote:
> > noop doesn't guarentee that IO will be queued with the device in the
> > order in which they are submitted, and it definitely doesn't guarentee
> > that the device will process them in the order in which they are
> > dispatched. noop being FIFO basically means that it will not sort
> > requests. You can still have reordering if one request gets merged with
> > another, for instance.
> >
> > The block layer in general provides no guarentees about ordering of
> > requests, unless you use barriers. So if you require ordering across a
> > given write request, it needs to be a write barrier.
>
> Thank your explaining this aspect of the linux block device layer design.
> Earlier, we tried bio barriers (hard barriers) and found the slow down
> to be too great. After my previous email we looked further down the stack and
> noticed that struct request also has a soft barrier option. For our tests,
> soft barriers perform almost as well as no barriers, and our system
> is ok (for now, at least) with the write reordering that devices can do.
>
> As our code calls generic_make_request(), it does not have access to the
> created struct request. We have modified block/ll_rw_blk.c:__make_request()
> to 1) not merge requests and to 2) add the REQ_SOFTBARRIER flag to the new
> request's cmd_flags field. Is there a more modular way for a function to
> create a new request with a soft barrier?

It would not be a problem to expose the soft/hard barrier difference at
the bio level as well, so you have direct access to it. I think it would
be quite useful in general.

--
Jens Axboe