2002-01-09 19:58:05

by Joel Becker

[permalink] [raw]
Subject: [PATCH] O_DIRECT with hardware blocksize alignment

Folks,
Some major users of O_DIRECT (Oracle, for instance) align and
size I/O based on the 512byte hardware blocksize common to most hard
disk drives. The current O_DIRECT code enforces that the alignment and
size of the I/O match the software blocksize (inot->i_sb->s_blocksize).
This patch relaxes that restriction to a minimum of the hardware
blocksize. In the interest of efficiency,
min(I/O alignment, s_blocksize) is used as the effective
blocksize. eg:

I/O alignment s_blocksize final blocksize
8192 4096 4096
4096 4096 4096
512 4096 512

Joel


diff -uNr linux-2.4.17/fs/buffer.c linux-2.4.17-od/fs/buffer.c
--- linux-2.4.17/fs/buffer.c Fri Dec 21 09:41:55 2001
+++ linux-2.4.17-od/fs/buffer.c Wed Jan 9 10:55:52 2002
@@ -2003,6 +2003,17 @@
{
int i, nr_blocks, retval;
unsigned long * blocks = iobuf->blocks;
+ int i_bscale, i_blocknr, i_blockoff;
+
+ /* Calculate I/O blocksize to sw blocksize scaling factor */
+ i_bscale = 1;
+ if (blocksize != inode->i_sb->s_blocksize)
+ {
+ if ((inode->i_sb->s_blocksize < blocksize) ||
+ ((inode->i_sb->s_blocksize % blocksize) != 0))
+ BUG();
+ i_bscale = inode->i_sb->s_blocksize / blocksize;
+ }

nr_blocks = iobuf->length / blocksize;
/* build the blocklist */
@@ -2013,7 +2024,13 @@
bh.b_dev = inode->i_dev;
bh.b_size = blocksize;

- retval = get_block(inode, blocknr, &bh, rw == READ ? 0 : 1);
+ /* Convert blocknr to the software blocksize */
+ if (!i_bscale)
+ BUG();
+ i_blocknr = blocknr / i_bscale;
+ i_blockoff = blocknr % i_bscale;
+
+ retval = get_block(inode, i_blocknr, &bh, rw == READ ? 0 : 1);
if (retval)
goto out;

@@ -2031,7 +2048,12 @@
if (!buffer_mapped(&bh))
BUG();
}
- blocks[i] = bh.b_blocknr;
+
+ /*
+ * Convert the returned blocknr back to the
+ * I/O blocksize.
+ */
+ blocks[i] = (bh.b_blocknr * i_bscale) + i_blockoff;
}

retval = brw_kiovec(rw, 1, &iobuf, inode->i_dev, iobuf->blocks, blocksize);
diff -uNr linux-2.4.17/mm/filemap.c linux-2.4.17-od/mm/filemap.c
--- linux-2.4.17/mm/filemap.c Fri Dec 21 09:42:04 2001
+++ linux-2.4.17-od/mm/filemap.c Wed Jan 9 10:58:13 2002
@@ -1491,6 +1491,7 @@
{
ssize_t retval;
int new_iobuf, chunk_size, blocksize_mask, blocksize, blocksize_bits, iosize, progress;
+ int b_bsize, b_bmask;
struct kiobuf * iobuf;
struct address_space * mapping = filp->f_dentry->d_inode->i_mapping;
struct inode * inode = mapping->host;
@@ -1508,14 +1509,36 @@
new_iobuf = 1;
}

+ chunk_size = KIO_MAX_ATOMIC_IO << 10;
+
+ retval = -EINVAL;
+
+ /*
+ * Starting at the software blocksize, check size
+ * and alignment of the I/O. Shift the blocksize
+ * down until we get an alignment that works, or we hit
+ * the hardware blocksize and fail.
+ */
+ b_bsize = get_hardsect_size(inode->i_dev);
+ b_bmask = (b_bsize - 1);
+
blocksize = 1 << inode->i_blkbits;
blocksize_bits = inode->i_blkbits;
blocksize_mask = blocksize - 1;
- chunk_size = KIO_MAX_ATOMIC_IO << 10;
+ while (blocksize >= b_bsize)
+ {
+ if (! ((offset & blocksize_mask) ||
+ (count & blocksize_mask) ||
+ ((unsigned long)buf & blocksize_mask)))
+ break;

- retval = -EINVAL;
- if ((offset & blocksize_mask) || (count & blocksize_mask))
+ blocksize >>= 1;
+ blocksize_mask = (blocksize - 1);
+ blocksize_bits--;
+ }
+ if (blocksize < b_bsize)
goto out_free;
+
if (!mapping->a_ops->direct_IO)
goto out_free;

--

"Every new beginning comes from some other beginning's end."

http://www.jlbec.org/
[email protected]


2002-01-12 12:34:12

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: [PATCH] O_DIRECT with hardware blocksize alignment

On Wed, Jan 09, 2002 at 07:56:07PM +0000, Joel Becker wrote:
> Folks,
> Some major users of O_DIRECT (Oracle, for instance) align and
> size I/O based on the 512byte hardware blocksize common to most hard
> disk drives. The current O_DIRECT code enforces that the alignment and
> size of the I/O match the software blocksize (inot->i_sb->s_blocksize).
> This patch relaxes that restriction to a minimum of the hardware
> blocksize. In the interest of efficiency,
> min(I/O alignment, s_blocksize) is used as the effective
> blocksize. eg:
>
> I/O alignment s_blocksize final blocksize
> 8192 4096 4096
> 4096 4096 4096
> 512 4096 512

this falls in the same risky category of the vary-I/O patch from Badari
(check the discussion on l-k) for rawio, so to make it safe it also will
need to check for the same new per-blkdev bitflag before you can trigger
the scaling (if the bitflag says 'no', it will have to enforce
softblocksize b_size etc). (and then you will get the same optimization
in brw_kiovec as well as rawio, also while not doing softblocksize
aligned I/O, but still large I/O) So I suggest you to check Badari's
stuff and the thread on l-k and to make a new patch incremental with his
code (note: I still had some problem with his latest patch but it's not
far from a final version)

Andrea

2002-01-15 03:21:41

by Joel Becker

[permalink] [raw]
Subject: Re: [PATCH] O_DIRECT with hardware blocksize alignment

On Sat, Jan 12, 2002 at 01:31:22PM +0100, Andrea Arcangeli wrote:
> On Wed, Jan 09, 2002 at 07:56:07PM +0000, Joel Becker wrote:
> > min(I/O alignment, s_blocksize) is used as the effective
> > blocksize. eg:
> >
> > I/O alignment s_blocksize final blocksize
> > 8192 4096 4096
> > 4096 4096 4096
> > 512 4096 512
>
> this falls in the same risky category of the vary-I/O patch from Badari
> (check the discussion on l-k) for rawio, so to make it safe it also will

How so? All I/O is at the computed blocksize. In every
request, the size of each I/O in the kiovec is the same. The
computation is done upon entrance to generic_file_direct_IO, and it is
kept that way. You don't have bh[0]->b_size = 512; bh[1]->b_size =
4096;
Hmm, maybe you mean things like that rumoured 3-ware issue. I
dunno. I do know that this code seems to work just fine with ide,
aha7xxx, and the qlogic driver. Certain software really wants to use
O_DIRECT, and they align I/O on 512byte boundaries. So any scheme that
fails this when it doesn't have to is a problem.

> aligned I/O, but still large I/O) So I suggest you to check Badari's
> stuff and the thread on l-k and to make a new patch incremental with his

I've added myself to that thread as well.

Joel

--

"Vote early and vote often."
- Al Capone

http://www.jlbec.org/
[email protected]

2002-01-15 12:20:41

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: [PATCH] O_DIRECT with hardware blocksize alignment

On Tue, Jan 15, 2002 at 03:21:26AM +0000, Joel Becker wrote:
> On Sat, Jan 12, 2002 at 01:31:22PM +0100, Andrea Arcangeli wrote:
> > On Wed, Jan 09, 2002 at 07:56:07PM +0000, Joel Becker wrote:
> > > min(I/O alignment, s_blocksize) is used as the effective
> > > blocksize. eg:
> > >
> > > I/O alignment s_blocksize final blocksize
> > > 8192 4096 4096
> > > 4096 4096 4096
> > > 512 4096 512
> >
> > this falls in the same risky category of the vary-I/O patch from Badari
> > (check the discussion on l-k) for rawio, so to make it safe it also will
>
> How so? All I/O is at the computed blocksize. In every
> request, the size of each I/O in the kiovec is the same. The

in the kiovec yes, but in the same request queue there will be also the
concurrent requests from the filesystem, and they will have different
b_size, see Jens's mail about different b_size merged in the same
request.

actually we could also forbid merging at the ll_rw_block layer if b_size
is not equal, maybe that's the simpler solution to that problem after
all, merging between kiovec I/O and buffered I/O probably doesn't
matter.

> computation is done upon entrance to generic_file_direct_IO, and it is
> kept that way. You don't have bh[0]->b_size = 512; bh[1]->b_size =
> 4096;
> Hmm, maybe you mean things like that rumoured 3-ware issue. I
> dunno. I do know that this code seems to work just fine with ide,
> aha7xxx, and the qlogic driver. Certain software really wants to use
> O_DIRECT, and they align I/O on 512byte boundaries. So any scheme that
> fails this when it doesn't have to is a problem.
>
> > aligned I/O, but still large I/O) So I suggest you to check Badari's
> > stuff and the thread on l-k and to make a new patch incremental with his
>
> I've added myself to that thread as well.
>
> Joel
>
> --
>
> "Vote early and vote often."
> - Al Capone
>
> http://www.jlbec.org/
> [email protected]


Andrea

2002-01-15 13:09:47

by Jens Axboe

[permalink] [raw]
Subject: Re: [PATCH] O_DIRECT with hardware blocksize alignment

On Tue, Jan 15 2002, Andrea Arcangeli wrote:
> actually we could also forbid merging at the ll_rw_block layer if b_size
> is not equal, maybe that's the simpler solution to that problem after
> all, merging between kiovec I/O and buffered I/O probably doesn't
> matter.

Agreed, this is also what I suggested.

--- /opt/kernel/linux-2.4.18-pre3/drivers/block/ll_rw_blk.c Tue Jan 15 14:06:13 2002
+++ drivers/block/ll_rw_blk.c Tue Jan 15 14:07:23 2002
@@ -694,10 +694,11 @@
switch (el_ret) {

case ELEVATOR_BACK_MERGE:
- if (!q->back_merge_fn(q, req, bh, max_segments)) {
- insert_here = &req->queue;
+ insert_here = &req->queue;
+ if (!q->back_merge_fn(q, req, bh, max_segments))
+ break;
+ if (req->current_nr_sectors != (bh->b_size >> 9))
break;
- }
elevator->elevator_merge_cleanup_fn(q, req, count);
req->bhtail->b_reqnext = bh;
req->bhtail = bh;
@@ -708,10 +709,11 @@
goto out;

case ELEVATOR_FRONT_MERGE:
- if (!q->front_merge_fn(q, req, bh, max_segments)) {
- insert_here = req->queue.prev;
+ insert_here = req->queue.prev;
+ if (!q->front_merge_fn(q, req, bh, max_segments))
+ break;
+ if (req->current_nr_sectors != (bh->b_size >> 9))
break;
- }
elevator->elevator_merge_cleanup_fn(q, req, count);
bh->b_reqnext = req->bh;
req->bh = bh;

--
Jens Axboe

2002-01-15 13:56:49

by Jens Axboe

[permalink] [raw]
Subject: Re: [PATCH] O_DIRECT with hardware blocksize alignment

On Tue, Jan 15 2002, Jens Axboe wrote:
> On Tue, Jan 15 2002, Andrea Arcangeli wrote:
> > actually we could also forbid merging at the ll_rw_block layer if b_size
> > is not equal, maybe that's the simpler solution to that problem after
> > all, merging between kiovec I/O and buffered I/O probably doesn't
> > matter.
>
> Agreed, this is also what I suggested.

Here's the right version, sorry. This still potentially decrements
elevator sequence wrongly for a missed front merge, but that's an issue
I can definitely live with :-)

--- /opt/kernel/linux-2.4.18-pre3/drivers/block/ll_rw_blk.c Tue Jan 15 14:06:13 2002
+++ drivers/block/ll_rw_blk.c Tue Jan 15 14:54:20 2002
@@ -694,10 +694,11 @@
switch (el_ret) {

case ELEVATOR_BACK_MERGE:
- if (!q->back_merge_fn(q, req, bh, max_segments)) {
- insert_here = &req->queue;
+ insert_here = &req->queue;
+ if (req->current_nr_sectors != count)
+ break;
+ if (!q->back_merge_fn(q, req, bh, max_segments))
break;
- }
elevator->elevator_merge_cleanup_fn(q, req, count);
req->bhtail->b_reqnext = bh;
req->bhtail = bh;
@@ -708,10 +709,11 @@
goto out;

case ELEVATOR_FRONT_MERGE:
- if (!q->front_merge_fn(q, req, bh, max_segments)) {
- insert_here = req->queue.prev;
+ insert_here = req->queue.prev;
+ if (req->current_nr_sectors != count)
+ break;
+ if (!q->front_merge_fn(q, req, bh, max_segments))
break;
- }
elevator->elevator_merge_cleanup_fn(q, req, count);
bh->b_reqnext = req->bh;
req->bh = bh;

--
Jens Axboe

2002-01-15 21:23:40

by Badari Pulavarty

[permalink] [raw]
Subject: Re: [PATCH] O_DIRECT with hardware blocksize alignment


Wouldn't this cause, single IO to split into (possibly) 3 I/Os for
RAW VARY on all unaligned IOs ? In that case, wouldn't it be better
of doing entire I/O with 512byte buffer heads, so that all of them
can be merged ?

IMHO, preventing the merge due to b_size mismatch just to handle few
drivers/hardware may be overkill. This will penalize all the drivers
which can handle variable size IOs also. Isn't it ? Leaving it to
the driver and doing variable size I/O on only the drivers that can
handle them, may be a better option ?

In my initial version of RAW VARY patch, I split the I/Os into possibly
3 I/O in raw.c. Do brw_kiovec(.., 512) for the first page and last pages &
4K for the middle pages. It showed reduction in CPU utilization for
large IOs. But it since it could split single I/O into 3 I/Os, i did
not see significant I/O throughput improvement.

Regards,
Badari



>
> On Tue, Jan 15 2002, Jens Axboe wrote:
> > On Tue, Jan 15 2002, Andrea Arcangeli wrote:
> > > actually we could also forbid merging at the ll_rw_block layer if b_size
> > > is not equal, maybe that's the simpler solution to that problem after
> > > all, merging between kiovec I/O and buffered I/O probably doesn't
> > > matter.
> >
> > Agreed, this is also what I suggested.
>
> Here's the right version, sorry. This still potentially decrements
> elevator sequence wrongly for a missed front merge, but that's an issue
> I can definitely live with :-)
> ....

2002-01-15 21:34:00

by Alan

[permalink] [raw]
Subject: Re: [PATCH] O_DIRECT with hardware blocksize alignment

> which can handle variable size IOs also. Isn't it ? Leaving it to
> the driver and doing variable size I/O on only the drivers that can
> handle them, may be a better option ?

Given that almost all drivers can handle it possibly what is wanted in
the 2.5 case is something as simple as a library routine it can call
from the block I/O that converts variable sized I/O's.

That would meet your very sensible argument that it shouldn't be punishing
other drivers

Alan

2002-01-16 00:08:57

by Joel Becker

[permalink] [raw]
Subject: Re: [PATCH] O_DIRECT with hardware blocksize alignment

On Tue, Jan 15, 2002 at 01:20:26PM +0100, Andrea Arcangeli wrote:
> > How so? All I/O is at the computed blocksize. In every
> > request, the size of each I/O in the kiovec is the same. The
>
> in the kiovec yes, but in the same request queue there will be also the
> concurrent requests from the filesystem, and they will have different
> b_size, see Jens's mail about different b_size merged in the same
> request.

Ok, I've read over Badri's latest patch, and it would seem he
assumes that the kiovec coming into brw_kiovec has buffer_heads of
512bytes (as raw.c would prepare). He then submits the I/O in differing
chunks (eg 2048 + 4096 + 4096 + 2048 for a 2048-aligned buffer).
Correct me if I am wrong (I can't see anything in raw.c that would
change the sizes in the kiovec).
For O_DIRECT, the fallback is s_blocksize, not the hardware
minimum of 512. So the kiovec would be coming into brw_kiovec with a
b_size of 4096 or so. My patch lets b_size be anything between 512 and
s_blocksize.
To work with Badri's code, would not the O_DIRECT path want to
submit a kiovec that is entirely b_size = 512 and let brw_kiovec handle
expanding it to larger sizes? This makes my patch simpler, but I wonder
what issues that presents.

Joel

--

"When choosing between two evils, I always like to try the one
I've never tried before."
- Mae West

http://www.jlbec.org/
[email protected]

2002-01-24 00:44:52

by Badari Pulavarty

[permalink] [raw]
Subject: [PATCH] small bugfix for ll_rw_bio() for 2.5.3-pre3

Hi,

Here is a small bug fix for ll_rw_bio() for 2.5.3-pre3.
kio->io_count can be safely incremented only after
bio_alloc() success.

Thanks,
Badari

--- linux-253pre3/fs/bio.c Thu Dec 27 08:15:15 2001
+++ linux-253pre3.new/fs/bio.c Wed Jan 23 17:32:59 2002
@@ -368,8 +368,6 @@
if (nr_pages > total_nr_pages)
nr_pages = total_nr_pages;

- atomic_inc(&kio->io_count);
-
/*
* allocate bio and do initial setup
*/
@@ -377,6 +375,8 @@
err = -ENOMEM;
goto out;
}
+
+ atomic_inc(&kio->io_count);

bio->bi_sector = sector;
bio->bi_dev = dev;

2002-01-24 21:52:32

by Badari Pulavarty

[permalink] [raw]
Subject: O_DIRECT broken in 2.5.3-preX ?

Hi,

I am reading the O_DIRECT code patch for 2.5.3-pre4. I was wondering
how is this working in 2.5.X ? Here is my concern:

generic_direct_IO() creates a blocks[] list and passes it to
brw_kiovec() with a single kiobuf.

retval = brw_kiovec(rw, 1, &iobuf, inode->i_dev, blocks, blocksize);

But brw_kiovec() uses only b[0] to call ll_rw_bio().

for (i = 0; i < nr; i++) {
iobuf = iovec[i];
iobuf->errno = 0;

ll_rw_kio(rw, iobuf, dev, b[i] * (size >> 9));
}


Note that nr = 1 here. ll_rw_kio() uses b[0] as starting sector
and does the entire IO (for iobuf->length). This is wrong !!!
It is doing IO from wrong blocks. Some one should use other
block numbers from blocks[] list. Isn't it ?

What am I missing here ? Please let me know.

Thanks,
Badari

2002-01-28 02:06:40

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: O_DIRECT broken in 2.5.3-preX ?

On Thu, Jan 24, 2002 at 01:52:04PM -0800, Badari Pulavarty wrote:
> Hi,
>
> I am reading the O_DIRECT code patch for 2.5.3-pre4. I was wondering
> how is this working in 2.5.X ? Here is my concern:
>
> generic_direct_IO() creates a blocks[] list and passes it to
> brw_kiovec() with a single kiobuf.
>
> retval = brw_kiovec(rw, 1, &iobuf, inode->i_dev, blocks, blocksize);
>
> But brw_kiovec() uses only b[0] to call ll_rw_bio().
>
> for (i = 0; i < nr; i++) {
> iobuf = iovec[i];
> iobuf->errno = 0;
>
> ll_rw_kio(rw, iobuf, dev, b[i] * (size >> 9));
> }
>
>
> Note that nr = 1 here. ll_rw_kio() uses b[0] as starting sector
> and does the entire IO (for iobuf->length). This is wrong !!!
> It is doing IO from wrong blocks. Some one should use other
> block numbers from blocks[] list. Isn't it ?

correct. It seems like somebody broke the semantics of the block field
of brw_kiovec without updating the callers, and furthmore now it is
impossible to use brw_kiovec for O_DIRECT because it is not going to
submit physically contigous I/O.

Andrea