2002-01-09 17:42:17

by Badari Pulavarty

[permalink] [raw]
Subject: [PATCH] PAGE_SIZE IO for RAW (RAW VARY)

Hi,

Here is a 2.4.17 patch for doing PAGE_SIZE IO on raw devices. Instead
of doing 512 byte buffer heads, the patch does 4K buffer heads. This
patch significantly reduced CPU overhead and increased IO throughput
in our database benchmark runs. (CPU went from 45% busy to 6% busy).

Could you please consider this for 2.4.18 release ? If you need the
patch on 2.4.18-preX, I can make one quickly.


NOTES:

1) wait_kio() no longer uses "size" argument. But I have not removed
the arg, to minimize the diffs.

2) I do not like adding a new routine "submit_bh_blknr()" which is
one line change from submit_bh(). Is there a better solution ?
I have version of the patch which sets b_rsector to 0xffffffff
in brw_kiovec() and check for this in submit_bh(). But I don't
really like that hack.

Thanks,
Badari

diff -Nur -X dontdiff linux/drivers/block/ll_rw_blk.c linux-2417vary/drivers/block/ll_rw_blk.c
--- linux/drivers/block/ll_rw_blk.c Mon Oct 29 12:11:17 2001
+++ linux-2417vary/drivers/block/ll_rw_blk.c Wed Jan 9 15:47:05 2002
@@ -915,6 +915,38 @@
}
}

+/*
+ * submit_bh_blknr() - same as submit_bh() except that b_rsector is
+ * set to b_blocknr. Used for RAW VARY.
+ */
+void submit_bh_blknr(int rw, struct buffer_head * bh)
+{
+ int count = bh->b_size >> 9;
+
+ if (!test_bit(BH_Lock, &bh->b_state))
+ BUG();
+
+ set_bit(BH_Req, &bh->b_state);
+
+ /*
+ * First step, 'identity mapping' - RAID or LVM might
+ * further remap this.
+ */
+ bh->b_rdev = bh->b_dev;
+ bh->b_rsector = bh->b_blocknr;
+
+ generic_make_request(rw, bh);
+
+ switch (rw) {
+ case WRITE:
+ kstat.pgpgout += count;
+ break;
+ default:
+ kstat.pgpgin += count;
+ break;
+ }
+}
+
/**
* ll_rw_block: low-level access to block devices
* @rw: whether to %READ or %WRITE or maybe %READA (readahead)
diff -Nur -X dontdiff linux/drivers/char/raw.c linux-2417vary/drivers/char/raw.c
--- linux/drivers/char/raw.c Sat Sep 22 20:35:43 2001
+++ linux-2417vary/drivers/char/raw.c Wed Jan 9 16:05:37 2002
@@ -350,8 +350,12 @@

for (i=0; i < blocks; i++)
iobuf->blocks[i] = blocknr++;
+
+ iobuf->dovary = 1;

err = brw_kiovec(rw, 1, &iobuf, dev, iobuf->blocks, sector_size);
+
+ iobuf->dovary = 0;

if (rw == READ && err > 0)
mark_dirty_kiobuf(iobuf, err);
diff -Nur -X dontdiff linux/fs/buffer.c linux-2417vary/fs/buffer.c
--- linux/fs/buffer.c Wed Jan 9 15:52:19 2002
+++ linux-2417vary/fs/buffer.c Wed Jan 9 16:28:56 2002
@@ -2071,11 +2071,11 @@
err = 0;

for (i = nr; --i >= 0; ) {
- iosize += size;
tmp = bh[i];
if (buffer_locked(tmp)) {
wait_on_buffer(tmp);
}
+ iosize += tmp->b_size;

if (!buffer_uptodate(tmp)) {
/* We are traversing bh'es in reverse order so
@@ -2118,6 +2118,7 @@
struct kiobuf * iobuf = NULL;
struct page * map;
struct buffer_head *tmp, **bhs = NULL;
+ int iosize = size;

if (!nr)
return 0;
@@ -2154,7 +2155,7 @@
}

while (length > 0) {
- blocknr = b[bufind++];
+ blocknr = b[bufind];
if (blocknr == -1UL) {
if (rw == READ) {
/* there was an hole in the filesystem */
@@ -2167,9 +2168,15 @@
} else
BUG();
}
+ if (iobuf->dovary) {
+ iosize = PAGE_SIZE - offset;
+ if (iosize > length)
+ iosize = length;
+ }
+ bufind += (iosize/size);
tmp = bhs[bhind++];

- tmp->b_size = size;
+ tmp->b_size = iosize;
set_bh_page(tmp, map, offset);
tmp->b_this_page = tmp;

@@ -2185,7 +2192,10 @@
set_bit(BH_Uptodate, &tmp->b_state);

atomic_inc(&iobuf->io_count);
- submit_bh(rw, tmp);
+ if (iobuf->dovary)
+ submit_bh_blknr(rw, tmp);
+ else
+ submit_bh(rw, tmp);
/*
* Wait for IO if we have got too much
*/
@@ -2200,8 +2210,8 @@
}

skip_block:
- length -= size;
- offset += size;
+ length -= iosize;
+ offset += iosize;

if (offset >= PAGE_SIZE) {
offset = 0;
diff -Nur -X dontdiff linux/include/linux/iobuf.h linux-2417vary/include/linux/iobuf.h
--- linux/include/linux/iobuf.h Thu Nov 22 11:46:26 2001
+++ linux-2417vary/include/linux/iobuf.h Wed Jan 9 16:09:08 2002
@@ -44,7 +44,8 @@

struct page ** maplist;

- unsigned int locked : 1; /* If set, pages has been locked */
+ unsigned int locked : 1, /* If set, pages has been locked */
+ dovary : 1; /* If set, do PAGE_SIZE length IO */

/* Always embed enough struct pages for atomic IO */
struct page * map_array[KIO_STATIC_PAGES];


2002-01-09 17:59:06

by Benjamin LaHaise

[permalink] [raw]
Subject: Re: [PATCH] PAGE_SIZE IO for RAW (RAW VARY)

On Wed, Jan 09, 2002 at 09:41:10AM -0800, Badari Pulavarty wrote:
> Could you please consider this for 2.4.18 release ? If you need the
> patch on 2.4.18-preX, I can make one quickly.

Do not apply. This breaks the majority of databases that run under linux.

-ben

2002-01-09 18:13:06

by Badari Pulavarty

[permalink] [raw]
Subject: Re: [PATCH] PAGE_SIZE IO for RAW (RAW VARY)

>
> On Wed, Jan 09, 2002 at 09:41:10AM -0800, Badari Pulavarty wrote:
> > Could you please consider this for 2.4.18 release ? If you need the
> > patch on 2.4.18-preX, I can make one quickly.
>
> Do not apply. This breaks the majority of databases that run under linux.
>
> -ben
>

why ? could you explain ? I am not expecting that user buffer be aligned
to PAGE_SIZE.

Thanks,
Badari

2002-01-09 18:14:36

by Jens Axboe

[permalink] [raw]
Subject: Re: [PATCH] PAGE_SIZE IO for RAW (RAW VARY)

On Wed, Jan 09 2002, Benjamin LaHaise wrote:
> On Wed, Jan 09, 2002 at 09:41:10AM -0800, Badari Pulavarty wrote:
> > Could you please consider this for 2.4.18 release ? If you need the
> > patch on 2.4.18-preX, I can make one quickly.
>
> Do not apply. This breaks the majority of databases that run under linux.

Besides, it's much more of a 'lets just clamp this on' approach than a
real solution.

--
Jens Axboe

2002-01-09 18:15:46

by Andi Kleen

[permalink] [raw]
Subject: Re: [PATCH] PAGE_SIZE IO for RAW (RAW VARY)

Benjamin LaHaise <[email protected]> writes:

> On Wed, Jan 09, 2002 at 09:41:10AM -0800, Badari Pulavarty wrote:
> > Could you please consider this for 2.4.18 release ? If you need the
> > patch on 2.4.18-preX, I can make one quickly.
>
> Do not apply. This breaks the majority of databases that run under linux.

It could be done after a check for proper alignment of the user buffer;
e.g. if buffer aligned to 2^N, submit in 2^N chunks.

-Andi

2002-01-09 18:22:26

by Benjamin LaHaise

[permalink] [raw]
Subject: Re: [PATCH] PAGE_SIZE IO for RAW (RAW VARY)

On Wed, Jan 09, 2002 at 10:12:11AM -0800, Badari Pulavarty wrote:
> why ? could you explain ? I am not expecting that user buffer be aligned
> to PAGE_SIZE.

Okay, that part I misread from the message, but that leaves the question of
"does it work?" Iirc, Jeff Merkey tested variable sized ios with nwfs, but
found that triggered bugs in the low level drivers, some of which assume that
all buffer heads within a request have the same block size. Given that
concern, I really don't think this is a safe 2.4 patch.

-ben
--
Fish.

2002-01-09 19:29:13

by Badari Pulavarty

[permalink] [raw]
Subject: Re: [PATCH] PAGE_SIZE IO for RAW (RAW VARY)

Ben,

By any chance do you have a list of drivers that assume this ?
What does it take to fix them ?

I think Jens BIO changes for 2.5 will fix this problem. But
2.4 needs a solution in this area too. This patch showed
significant improvement for database workloads.

If it is not reasonable to fix all the brokern drivers,
how about making this configurable (to do variable size IO) ?
Do you favour this solution ?

Regards,
Badari


>
> On Wed, Jan 09, 2002 at 10:12:11AM -0800, Badari Pulavarty wrote:
> > why ? could you explain ? I am not expecting that user buffer be aligned
> > to PAGE_SIZE.
>
> Okay, that part I misread from the message, but that leaves the question of
> "does it work?" Iirc, Jeff Merkey tested variable sized ios with nwfs, but
> found that triggered bugs in the low level drivers, some of which assume that
> all buffer heads within a request have the same block size. Given that
> concern, I really don't think this is a safe 2.4 patch.
>
> -ben
> --
> Fish.

2002-01-09 19:49:54

by Benjamin LaHaise

[permalink] [raw]
Subject: Re: [PATCH] PAGE_SIZE IO for RAW (RAW VARY)

On Wed, Jan 09, 2002 at 11:28:39AM -0800, Badari Pulavarty wrote:
> Ben,
>
> By any chance do you have a list of drivers that assume this ?
> What does it take to fix them ?

No, doing the audit *is* the hard work in creating a patch like this.

> If it is not reasonable to fix all the brokern drivers,
> how about making this configurable (to do variable size IO) ?
> Do you favour this solution ?

How does that solve anything? There is no way for a user to be even
remotely confident that enabling the option will not corrupt data. As
a distributor, that makes it impossible to enable. As a developer,
that's just unsound practice.

-ben
--
Blue.

2002-01-09 22:45:31

by Alan

[permalink] [raw]
Subject: Re: [PATCH] PAGE_SIZE IO for RAW (RAW VARY)

> Here is a 2.4.17 patch for doing PAGE_SIZE IO on raw devices. Instead
> of doing 512 byte buffer heads, the patch does 4K buffer heads. This
> patch significantly reduced CPU overhead and increased IO throughput
> in our database benchmark runs. (CPU went from 45% busy to 6% busy).

Does that work out when the application is still doing 512 byte raw I/O.
Its fine to fall back to the current performance but at least one very
large competing database would get quite irate if the fallback made
512 byte mode slower or nonfunctional ?

Alan

2002-01-09 22:47:01

by Alan

[permalink] [raw]
Subject: Re: [PATCH] PAGE_SIZE IO for RAW (RAW VARY)

> If it is not reasonable to fix all the brokern drivers,
> how about making this configurable (to do variable size IO) ?
> Do you favour this solution ?

We have hardware that requires aligned power of two for writes (ie 4K on
4K boundaries only). The 3ware is one example Jeff Merkey found

2002-01-09 23:20:37

by Badari Pulavarty

[permalink] [raw]
Subject: Re: [PATCH] PAGE_SIZE IO for RAW (RAW VARY)

>
> > Here is a 2.4.17 patch for doing PAGE_SIZE IO on raw devices. Instead
> > of doing 512 byte buffer heads, the patch does 4K buffer heads. This
> > patch significantly reduced CPU overhead and increased IO throughput
> > in our database benchmark runs. (CPU went from 45% busy to 6% busy).
>
> Does that work out when the application is still doing 512 byte raw I/O.
> Its fine to fall back to the current performance but at least one very
> large competing database would get quite irate if the fallback made
> 512 byte mode slower or nonfunctional ?
>

It works as usual for 512 byte IO (few checks in the code). I have not
seen any slowness for < 4K IO. Infact, I can change the patch to try
raw vary only for iosize > PAGE_SIZE.

I tested with 2 large competing databases, both of them seem to benifit
significantly from this patch. I tested with 4 different Fiber & SCSI
adaptors, they all seem to work fine. (only on i386).

But unfortunately, if the hardware have special alignment restrictions
(as you mentioned), this patch does not work. I don't know if it makes
sense to make this configurable and expect customer/user to enable this
feature if they know about their hardware/driver alignment restrictions.

Thanks,
Badari

2002-01-09 23:49:05

by Badari Pulavarty

[permalink] [raw]
Subject: Re: [PATCH] PAGE_SIZE IO for RAW (RAW VARY)

Alan,

>
> > If it is not reasonable to fix all the brokern drivers,
> > how about making this configurable (to do variable size IO) ?
> > Do you favour this solution ?
>
> We have hardware that requires aligned power of two for writes (ie 4K on
> 4K boundaries only). The 3ware is one example Jeff Merkey found
>

emm.. come to think of it, I can easily (2 line) change my patch to
do 512 byte buffer heads till we get PAGE alignment and then start
issuing 4K IO buffer heads. What do you think ? will this work ?

And also, do you know any low level drivers Ben mentioning:

> low level drivers, some of which assume that
> all buffer heads within a request have the same block size.

Is it still true for 2.4 ?

Regards,
Badari

2002-01-10 00:12:09

by Alan

[permalink] [raw]
Subject: Re: [PATCH] PAGE_SIZE IO for RAW (RAW VARY)

> I tested with 2 large competing databases, both of them seem to benifit
> significantly from this patch. I tested with 4 different Fiber & SCSI
> adaptors, they all seem to work fine. (only on i386).

Great.

> But unfortunately, if the hardware have special alignment restrictions
> (as you mentioned), this patch does not work. I don't know if it makes
> sense to make this configurable and expect customer/user to enable this
> feature if they know about their hardware/driver alignment restrictions.

The only one I know about is the 3ware, but I don't know how many I don't
know about (so to speak). For the kind of win you report, it looks well
worth exploring the compatibility issues and fixing them.

2002-01-10 04:07:04

by Masanori Goto

[permalink] [raw]
Subject: Re: [PATCH] PAGE_SIZE IO for RAW (RAW VARY)

At Thu, 10 Jan 2002 00:23:18 +0000 (GMT),
Alan Cox <[email protected]> wrote:
> > I tested with 2 large competing databases, both of them seem to benifit
> > significantly from this patch. I tested with 4 different Fiber & SCSI
> > adaptors, they all seem to work fine. (only on i386).
>
> Great.

It's just a question, you already said that CPU workloads
are significantly down, but would you tell me your
workloads and environment? I think it's more worth thinking
to improve even for for the normal users' environment nicely...

-- gotom

2002-01-10 07:13:25

by Jens Axboe

[permalink] [raw]
Subject: Re: [PATCH] PAGE_SIZE IO for RAW (RAW VARY)

On Wed, Jan 09 2002, Benjamin LaHaise wrote:
> On Wed, Jan 09, 2002 at 10:12:11AM -0800, Badari Pulavarty wrote:
> > why ? could you explain ? I am not expecting that user buffer be aligned
> > to PAGE_SIZE.
>
> Okay, that part I misread from the message, but that leaves the question of
> "does it work?" Iirc, Jeff Merkey tested variable sized ios with nwfs, but
> found that triggered bugs in the low level drivers, some of which assume that
> all buffer heads within a request have the same block size. Given that
> concern, I really don't think this is a safe 2.4 patch.

I don't think that point was ever proven, and Jeff never showed any
information as to what was broken. I'm reluctant to allow differently
sized buffers heads in the _same_ request for 2.4 just to be cautios,
but that's a two-liner (or so) in the elevator to stop that from
happening.

--
Jens Axboe

2002-01-10 10:19:58

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: [PATCH] PAGE_SIZE IO for RAW (RAW VARY)

On Wed, Jan 09, 2002 at 11:28:39AM -0800, Badari Pulavarty wrote:
> Ben,
>
> By any chance do you have a list of drivers that assume this ?
> What does it take to fix them ?
>
> I think Jens BIO changes for 2.5 will fix this problem. But
> 2.4 needs a solution in this area too. This patch showed
> significant improvement for database workloads.

I didn't checked the implementation but as far as the blkdev is
concerned the b_size changes without notification as soon as you 'mkfs
-b somethingelse' and then mount the fs. So it cannot break as far I can
tell. The only important thing is that b_size stays between 512 and 4k.

> If it is not reasonable to fix all the brokern drivers,
> how about making this configurable (to do variable size IO) ?
> Do you favour this solution ?

no.

Andrea

2002-01-10 10:22:59

by Jens Axboe

[permalink] [raw]
Subject: Re: [PATCH] PAGE_SIZE IO for RAW (RAW VARY)

On Thu, Jan 10 2002, Andrea Arcangeli wrote:
> On Wed, Jan 09, 2002 at 11:28:39AM -0800, Badari Pulavarty wrote:
> > Ben,
> >
> > By any chance do you have a list of drivers that assume this ?
> > What does it take to fix them ?
> >
> > I think Jens BIO changes for 2.5 will fix this problem. But
> > 2.4 needs a solution in this area too. This patch showed
> > significant improvement for database workloads.
>
> I didn't checked the implementation but as far as the blkdev is
> concerned the b_size changes without notification as soon as you 'mkfs
> -b somethingelse' and then mount the fs. So it cannot break as far I can
> tell. The only important thing is that b_size stays between 512 and 4k.

The concern is/was differently sized buffer_heads in the same request,
ie b_size changing as you iterate through the chunks of one request.

--
Jens Axboe

2002-01-10 10:31:38

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: [PATCH] PAGE_SIZE IO for RAW (RAW VARY)

On Wed, Jan 09, 2002 at 10:58:05PM +0000, Alan Cox wrote:
> > If it is not reasonable to fix all the brokern drivers,
> > how about making this configurable (to do variable size IO) ?
> > Do you favour this solution ?
>
> We have hardware that requires aligned power of two for writes (ie 4K on
> 4K boundaries only). The 3ware is one example Jeff Merkey found

of course the right way to do it is to coalesce only if all the
alignments simulates what would happen with a 4k ext2. Now it maybe that
the patch is broken (didn't checked the implementation), but at least
theoretically the merging to save cpu should be doable without breaking
anything and without the need of bio (of course bio can do more than 4k
at once, so it's not nearly as powerful as bio but still it should make a
noticeable difference in the benchmarks).

Andrea

2002-01-10 10:35:30

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: [PATCH] PAGE_SIZE IO for RAW (RAW VARY)

On Wed, Jan 09, 2002 at 03:48:17PM -0800, Badari Pulavarty wrote:
> Alan,
>
> >
> > > If it is not reasonable to fix all the brokern drivers,
> > > how about making this configurable (to do variable size IO) ?
> > > Do you favour this solution ?
> >
> > We have hardware that requires aligned power of two for writes (ie 4K on
> > 4K boundaries only). The 3ware is one example Jeff Merkey found
> >
>
> emm.. come to think of it, I can easily (2 line) change my patch to
> do 512 byte buffer heads till we get PAGE alignment and then start
> issuing 4K IO buffer heads. What do you think ? will this work ?

you should make sure the buffer is naturally aligned with the b_size (so
you emulate the change of blocksize in a filesystem), if it wasn't the
case you should fix it.

>
> And also, do you know any low level drivers Ben mentioning:
>
> > low level drivers, some of which assume that
> > all buffer heads within a request have the same block size.
>
> Is it still true for 2.4 ?

yes, you simply need to call submit_bh or to use separate ll_rw_blocks
when you change b_size.

>
> Regards,
> Badari


Andrea

2002-01-10 10:48:00

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: [PATCH] PAGE_SIZE IO for RAW (RAW VARY)

On Thu, Jan 10, 2002 at 11:22:25AM +0100, Jens Axboe wrote:
> On Thu, Jan 10 2002, Andrea Arcangeli wrote:
> > On Wed, Jan 09, 2002 at 11:28:39AM -0800, Badari Pulavarty wrote:
> > > Ben,
> > >
> > > By any chance do you have a list of drivers that assume this ?
> > > What does it take to fix them ?
> > >
> > > I think Jens BIO changes for 2.5 will fix this problem. But
> > > 2.4 needs a solution in this area too. This patch showed
> > > significant improvement for database workloads.
> >
> > I didn't checked the implementation but as far as the blkdev is
> > concerned the b_size changes without notification as soon as you 'mkfs
> > -b somethingelse' and then mount the fs. So it cannot break as far I can
> > tell. The only important thing is that b_size stays between 512 and 4k.
>
> The concern is/was differently sized buffer_heads in the same request,
> ie b_size changing as you iterate through the chunks of one request.

ok, I don't expect problems there. It can happen for example if you
create a snapshot with 4k and then you switch back the original volume
to 1k. the physical volume will get mixed b_size colaesced into the same
request.

Andrea

2002-01-10 10:52:20

by Jens Axboe

[permalink] [raw]
Subject: Re: [PATCH] PAGE_SIZE IO for RAW (RAW VARY)

On Thu, Jan 10 2002, Andrea Arcangeli wrote:
> On Thu, Jan 10, 2002 at 11:22:25AM +0100, Jens Axboe wrote:
> > On Thu, Jan 10 2002, Andrea Arcangeli wrote:
> > > On Wed, Jan 09, 2002 at 11:28:39AM -0800, Badari Pulavarty wrote:
> > > > Ben,
> > > >
> > > > By any chance do you have a list of drivers that assume this ?
> > > > What does it take to fix them ?
> > > >
> > > > I think Jens BIO changes for 2.5 will fix this problem. But 2.4
> > > > needs a solution in this area too. This patch showed significant
> > > > improvement for database workloads.
> > >
> > > I didn't checked the implementation but as far as the blkdev is
> > > concerned the b_size changes without notification as soon as you
> > > 'mkfs -b somethingelse' and then mount the fs. So it cannot break
> > > as far I can tell. The only important thing is that b_size stays
> > > between 512 and 4k.
> >
> > The concern is/was differently sized buffer_heads in the same
> > request, ie b_size changing as you iterate through the chunks of one
> > request.
>
> ok, I don't expect problems there. It can happen for example if you
> create a snapshot with 4k and then you switch back the original volume
> to 1k. the physical volume will get mixed b_size colaesced into the
> same request.

Well I don't expect problems either, however Jeff Merkey did report them
but see my previous mail on that (validity of that report is
questionable).

I still wouldn't feel to good doing this, and just because snapshotting
opens the possibility for this to happen doesn't mean it a) ever
triggered in real life, and b) works on all devices.

--
Jens Axboe

2002-01-10 11:10:23

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: [PATCH] PAGE_SIZE IO for RAW (RAW VARY)

On Thu, Jan 10, 2002 at 11:51:51AM +0100, Jens Axboe wrote:
> On Thu, Jan 10 2002, Andrea Arcangeli wrote:
> > On Thu, Jan 10, 2002 at 11:22:25AM +0100, Jens Axboe wrote:
> > > On Thu, Jan 10 2002, Andrea Arcangeli wrote:
> > > > On Wed, Jan 09, 2002 at 11:28:39AM -0800, Badari Pulavarty wrote:
> > > > > Ben,
> > > > >
> > > > > By any chance do you have a list of drivers that assume this ?
> > > > > What does it take to fix them ?
> > > > >
> > > > > I think Jens BIO changes for 2.5 will fix this problem. But 2.4
> > > > > needs a solution in this area too. This patch showed significant
> > > > > improvement for database workloads.
> > > >
> > > > I didn't checked the implementation but as far as the blkdev is
> > > > concerned the b_size changes without notification as soon as you
> > > > 'mkfs -b somethingelse' and then mount the fs. So it cannot break
> > > > as far I can tell. The only important thing is that b_size stays
> > > > between 512 and 4k.
> > >
> > > The concern is/was differently sized buffer_heads in the same
> > > request, ie b_size changing as you iterate through the chunks of one
> > > request.
> >
> > ok, I don't expect problems there. It can happen for example if you
> > create a snapshot with 4k and then you switch back the original volume
> > to 1k. the physical volume will get mixed b_size colaesced into the
> > same request.
>
> Well I don't expect problems either, however Jeff Merkey did report them
> but see my previous mail on that (validity of that report is
> questionable).
>
> I still wouldn't feel to good doing this, and just because snapshotting
> opens the possibility for this to happen doesn't mean it a) ever
> triggered in real life, and b) works on all devices.

fair enough. one way to do it certainly safely is to add a bitflag to
the struct blkkdev.

Andrea

2002-01-10 16:59:17

by Badari Pulavarty

[permalink] [raw]
Subject: Re: [PATCH] PAGE_SIZE IO for RAW (RAW VARY)

Andrea,

> >
> > I still wouldn't feel to good doing this, and just because snapshotting
> > opens the possibility for this to happen doesn't mean it a) ever
> > triggered in real life, and b) works on all devices.
>
> fair enough. one way to do it certainly safely is to add a bitflag to
> the struct blkkdev.
>
> Andrea
>

Could you please elaborate on your suggestion. What does a biflag in
blkdev do ?

I am sure by now, you must have understood what my patch is trying to
do. I am trying to issue 4K buffer heads on RAW whenever possible
(instead of 512). But for the first and last pages buffer heads may
be different size than 4K (due to alignment and length of IO). So, We
end up with buffer heads with multiple IO sizes in a single request.

How does your suggestion fix the problem ? How can I address this
problem safely. Please let me know.

Thanks,
Badari

2002-01-10 19:25:29

by Badari Pulavarty

[permalink] [raw]
Subject: Re: [PATCH] PAGE_SIZE IO for RAW (RAW VARY)

>
> fair enough. one way to do it certainly safely is to add a bitflag to
> the struct blkkdev.
>
> Andrea
>

Thanks to Andrea !!

How about adding a flag in blk_dev structure. (I currently couldn't find one).
Set the flag for the drivers which support multiple bufferhead sizes in
a single IO request and use this flag to do RAW VARY or not.

Does this address everyones concerns ? I am willing to work with the
drivers I tested/reviewed/verified to make the change to set the flag.
As driver owners verify their drivers, could set the flag (in future).

If everyone is okay with this approach, I can make a new patch for this.

Thanks,
Badari

2002-01-10 19:49:03

by Alan

[permalink] [raw]
Subject: Re: [PATCH] PAGE_SIZE IO for RAW (RAW VARY)

> Does this address everyones concerns ? I am willing to work with the
> drivers I tested/reviewed/verified to make the change to set the flag.
> As driver owners verify their drivers, could set the flag (in future).

Im just trying to work out how this deals with the 2.4 scsi case

2002-01-10 21:04:03

by Badari Pulavarty

[permalink] [raw]
Subject: Re: [PATCH] PAGE_SIZE IO for RAW (RAW VARY)

>
> > Does this address everyones concerns ? I am willing to work with the
> > drivers I tested/reviewed/verified to make the change to set the flag.
> > As driver owners verify their drivers, could set the flag (in future).
>
> Im just trying to work out how this deals with the 2.4 scsi case
>
Alan,

I am lost ... What do you mean ? Could you please explain ?

Thanks,
Badari

2002-01-10 21:15:04

by Alan

[permalink] [raw]
Subject: Re: [PATCH] PAGE_SIZE IO for RAW (RAW VARY)

> > Im just trying to work out how this deals with the 2.4 scsi case
> >
> Alan,
>
> I am lost ... What do you mean ? Could you please explain ?

The whole of scsi in 2.4 is effectively one block driver. 2.5 splits the
queues up really nicely.

2002-01-10 21:15:47

by Badari Pulavarty

[permalink] [raw]
Subject: Re: [PATCH] PAGE_SIZE IO for RAW (RAW VARY)

>
> > Does this address everyones concerns ? I am willing to work with the
> > drivers I tested/reviewed/verified to make the change to set the flag.
> > As driver owners verify their drivers, could set the flag (in future).
>
> Im just trying to work out how this deals with the 2.4 scsi case
>

Alan,

The issue with my (mostly) PAGE_SIZE RAW IO patch is, it could generate
buffer heads with different b_size in a single IO request. Jens & Ben
have concerns about some of the low level drivers not able to handle
these. (it would be hard to analyse all the drivers to verify this).

So, Andrea suggested we add a flag in "blk_dev" structure. Only the
drivers which support variable size buffer heads in a single IO will
set it. My RAW VARY patch will use this flag to see if I can do
RAW VARY or not. Makes sense ?

Let me know, what you think about this approach.

Regards,
Badari

2002-01-10 22:00:58

by Alan

[permalink] [raw]
Subject: Re: [PATCH] PAGE_SIZE IO for RAW (RAW VARY)

> Something like sd_attach() could get this info from template and
> set a flag in blk_dev. (or in a gloal array).

I didnt think blk_dev was per minor ?

2002-01-10 22:20:39

by Badari Pulavarty

[permalink] [raw]
Subject: Re: [PATCH] PAGE_SIZE IO for RAW (RAW VARY)

>
> > Something like sd_attach() could get this info from template and
> > set a flag in blk_dev. (or in a gloal array).
>
> I didnt think blk_dev was per minor ?
>

hmm.. I see where you are going. I guess I need to have of a pointer
to arrary of MAX_MINOR. But I can use similar technique as "blksize_size"
arrary.

Otherwise, I can have gloabl arrary similar to "blksize_size".

Thanks,
Badari

2002-01-11 13:54:04

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: [PATCH] PAGE_SIZE IO for RAW (RAW VARY)

On Thu, Jan 10, 2002 at 08:58:41AM -0800, Badari Pulavarty wrote:
> Andrea,
>
> > >
> > > I still wouldn't feel to good doing this, and just because snapshotting
> > > opens the possibility for this to happen doesn't mean it a) ever
> > > triggered in real life, and b) works on all devices.
> >
> > fair enough. one way to do it certainly safely is to add a bitflag to
> > the struct blkkdev.
> >
> > Andrea
> >
>
> Could you please elaborate on your suggestion. What does a biflag in
> blkdev do ?
>
> I am sure by now, you must have understood what my patch is trying to
> do. I am trying to issue 4K buffer heads on RAW whenever possible
> (instead of 512). But for the first and last pages buffer heads may
> be different size than 4K (due to alignment and length of IO). So, We
> end up with buffer heads with multiple IO sizes in a single request.
>
> How does your suggestion fix the problem ? How can I address this
> problem safely. Please let me know.

the bitflag will allow you to enable this feature only on the lowlevel
drivers that you can afford to test. on the hardware that you can afford
to test you will know for sure that enabling the feature is safe. This
way we won't risk random breakage of not-tested drivers.

Andrea

2002-01-15 03:17:11

by Joel Becker

[permalink] [raw]
Subject: Re: [PATCH] PAGE_SIZE IO for RAW (RAW VARY)

On Thu, Jan 10, 2002 at 01:15:01PM -0800, Badari Pulavarty wrote:
> The issue with my (mostly) PAGE_SIZE RAW IO patch is, it could generate
> buffer heads with different b_size in a single IO request. Jens & Ben
> have concerns about some of the low level drivers not able to handle
> these. (it would be hard to analyse all the drivers to verify this).
>
> So, Andrea suggested we add a flag in "blk_dev" structure. Only the
> drivers which support variable size buffer heads in a single IO will
> set it. My RAW VARY patch will use this flag to see if I can do
> RAW VARY or not. Makes sense ?

The way I handled it in my O_DIRECT code for the same task (do
the largest I/O you can, instead of a hardcoded 512 or sw_blocksize) was
to merely choose the minimum alignment. See the patch in
<[email protected]>. In your
approach, you want leading blocks to be aligned at 512 or so, then all
4k aligned I/Os to be 4k size, then the trailing I/Os are aligned again
at their size. My approach was 'if aligned at 512, do all 512'. It is
slower than your approach for large I/Os aligned on 512, but it also
has the property of submitting identical blocksizes in one request.
Andrea has suggested that I work with you to be incremental on
top of your code. This would include managing a flag bit to decide if a
device can handle variable I/O sizes in one request. The issue I have
with that is that in the O_DIRECT case, the fallback is failure, not
slow I/O. IOW, in rawio, if the flag is false, you issue all I/Os with
a 512 blocksize. That's slow. However, in O_DIRECT, if the flag is
false, the sw_blocksize is 4096, and the alignment is 512, you fail with
EINVAL. That is bad. In my current patch, the O_DIRECT I/O would
succeed at a 512 byte blocksize. However, my patch doesn't touch rawio
at all.
Please have a look at my patch and maybe we can work on merging
our efforts.

Joel

--

"The first thing we do, let's kill all the lawyers."
-Henry VI, IV:ii

http://www.jlbec.org/
[email protected]