2006-02-27 21:19:20

by Badari Pulavarty

[permalink] [raw]
Subject: [PATCH 0/4] map multiple blocks in get_block() and mpage_readpages()

Hi,

Following patches add support to map multiple blocks in ->get_block().
This is will allow us to handle mapping of multiple disk blocks for
mpage_readpages() and mpage_writepages() etc. Instead of adding new
argument, I use "b_size" to indicate the amount of disk mapping needed
for get_block(). And also, on success get_block() actually indicates
the amount of disk mapping it did.

Now that get_block() can handle multiple blocks, there is no need
for ->get_blocks() which was added for DIO.

[PATCH 1/4] change buffer_head.b_size to size_t

[PATCH 2/4] pass b_size to ->get_block()

[PATCH 3/4] map multiple blocks for mpage_readpages()

[PATCH 4/4] remove ->get_blocks() support

I noticed decent improvements (reduced sys time) on JFS, XFS and ext3.
(on simple "dd" read tests).

(rc3.mm1) (rc3.mm1 + patches)
real 0m18.814s 0m18.482s
user 0m0.000s 0m0.004s
sys 0m3.240s 0m2.912s

Andrew, Could you include it in -mm tree ?

Comments ?

Thanks,
Badari



2006-02-27 21:21:20

by Badari Pulavarty

[permalink] [raw]
Subject: [PATCH 1/4] change buffer_head.b_size to size_t

Increase the size of the buffer_head b_size field (only) for
64 bit platforms. Update some old and moldy comments in and
around the structure as well.

The b_size increase allows us to perform larger mappings and
allocations for large I/O requests from userspace, which tie
in with other changes allowing the get_block_t() interface to
map multiple blocks at once.



Attachments:
increase-bsize.patch (4.55 kB)

2006-02-27 21:22:15

by Badari Pulavarty

[permalink] [raw]
Subject: [PATCH 2/4] pass b_size to ->get_block()

Pass amount of disk needs to be mapped to get_block().
This way one can modify the fs ->get_block() functions
to map multiple blocks at the same time.



Attachments:
getblock-take-size.patch (3.26 kB)

2006-02-27 21:22:56

by Badari Pulavarty

[permalink] [raw]
Subject: [PATCH 3/4] map multiple blocks for mpage_readpages()

This patch changes mpage_readpages() and get_block() to
get the disk mapping information for multiple blocks at the
same time.

b_size represents the amount of disk mapping that needs to
mapped. On the successful get_block() b_size indicates the
amount of disk mapping thats actually mapped. Only the
filesystems who care to use this information and provide multiple
disk blocks at a time can choose to do so.

No changes are needed for the filesystems who wants to ignore this.



Attachments:
getblocks-readpages.patch (7.39 kB)

2006-02-27 21:23:43

by Badari Pulavarty

[permalink] [raw]
Subject: [PATCH 4/4] remove ->get_blocks() support

Now that get_block() can handle mapping multiple disk blocks,
no need to have ->get_blocks(). This patch removes fs specific
->get_blocks() added for DIO and makes it users use get_block()
instead.


Attachments:
remove-getblocks.patch (16.18 kB)

2006-03-02 01:49:30

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH 1/4] change buffer_head.b_size to size_t

Badari Pulavarty <[email protected]> wrote:
>
> + size_t b_size; /* size of mapping */
> + char *b_data; /* pointer to data within the page */
>
> struct block_device *b_bdev;
> bh_end_io_t *b_end_io; /* I/O completion */
> void *b_private; /* reserved for b_end_io */
> struct list_head b_assoc_buffers; /* associated with another mapping */
> + atomic_t b_count; /* users using this buffer_head */
> };
>
> /*
> Index: linux-2.6.16-rc5/fs/buffer.c
> ===================================================================
> --- linux-2.6.16-rc5.orig/fs/buffer.c 2006-02-26 21:09:35.000000000 -0800
> +++ linux-2.6.16-rc5/fs/buffer.c 2006-02-27 08:22:37.000000000 -0800
> @@ -432,7 +432,8 @@ __find_get_block_slow(struct block_devic
> printk("__find_get_block_slow() failed. "
> "block=%llu, b_blocknr=%llu\n",
> (unsigned long long)block, (unsigned long long)bh->b_blocknr);
> - printk("b_state=0x%08lx, b_size=%u\n", bh->b_state, bh->b_size);
> + printk("b_state=0x%08lx, b_size=%lu\n", bh->b_state,
> + (unsigned long)bh->b_size);

We print size_t with `%z'. Hence the cast isn't needed.

(I'll edit the diff..)

2006-03-02 01:49:39

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH 1/4] change buffer_head.b_size to size_t

Badari Pulavarty <[email protected]> wrote:
>
> + * Historically, a buffer_head was used to map a single block
> + * within a page, and of course as the unit of I/O through the
> + * filesystem and block layers. Nowadays the basic I/O unit
> + * is the bio, and buffer_heads are used for extracting block
> + * mappings (via a get_block_t call), for tracking state within
> + * a page (via a page_mapping) and for wrapping bio submission
> + * for backward compatibility reasons (e.g. submit_bh).

Well kinda. A buffer_head remains the kernel's basic abstraction for a
"disk block". We cannot replace that with `struct page' (size isn't
flexible) nor of course with `struct bio'.

2006-03-02 01:50:10

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH 2/4] pass b_size to ->get_block()

Badari Pulavarty <[email protected]> wrote:
>
> Pass amount of disk needs to be mapped to get_block().
> This way one can modify the fs ->get_block() functions
> to map multiple blocks at the same time.

I can't say I terribly like this patch. Initialising b_size all over the
place seems fragile.

We're _already_ setting bh.b_size to the right thing in
alloc_page_buffers(), and for a bh which is attached to
pagecache_page->private, there's no reason why b_size would ever change.

So what I think I'll do is to convert those places where you're needlessly
assigning to b_size into temporary WARN_ON(b_size != blocksize).

The only place where we need to initialise b_size is where we've got a
non-pagecache bh allocated on the stack.

We need to be sure that no ->get_block() implementations write garbage into
bh->b_size if something goes wrong. b_size on a pagecache-based
buffer_head must remain inviolate.

2006-03-02 01:50:22

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH 0/4] map multiple blocks in get_block() and mpage_readpages()

Badari Pulavarty <[email protected]> wrote:
>
> I noticed decent improvements (reduced sys time) on JFS, XFS and ext3.
> (on simple "dd" read tests).
>
> (rc3.mm1) (rc3.mm1 + patches)
> real 0m18.814s 0m18.482s
> user 0m0.000s 0m0.004s
> sys 0m3.240s 0m2.912s

With which filesystem? XFS and JFS implement a larger-than-b_size
->get_block, but ext3 doesn't. I'd expect ext3 system time to increase a
bit, if anything?

2006-03-02 02:23:05

by Nathan Scott

[permalink] [raw]
Subject: Re: [PATCH 1/4] change buffer_head.b_size to size_t

On Wed, Mar 01, 2006 at 05:51:48PM -0800, Andrew Morton wrote:
> Badari Pulavarty <[email protected]> wrote:
> >
> > + * Historically, a buffer_head was used to map a single block
> > + * within a page, and of course as the unit of I/O through the
> > + * filesystem and block layers. Nowadays the basic I/O unit
> > + * is the bio, and buffer_heads are used for extracting block
> > + * mappings (via a get_block_t call), for tracking state within
> > + * a page (via a page_mapping) and for wrapping bio submission
> > + * for backward compatibility reasons (e.g. submit_bh).
>
> Well kinda. A buffer_head remains the kernel's basic abstraction for a
> "disk block".

Thats what I said (meant to say) with
"buffer_heads are used for extracting block mappings".

I think by "disk block" you mean what I'm thinking of as a
"block mapping" (series of contiguous blocks). I'd think
of a sector_t as a "disk block", but maybe I'm just wierd
that way... a better wordsmith should jump in and update
the comment I guess.

> We cannot replace that with `struct page' (size isn't
> flexible) nor of course with `struct bio'.

Indeed.

--
Nathan

2006-03-02 03:07:40

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH 1/4] change buffer_head.b_size to size_t

Nathan Scott <[email protected]> wrote:
>
> On Wed, Mar 01, 2006 at 05:51:48PM -0800, Andrew Morton wrote:
> > Badari Pulavarty <[email protected]> wrote:
> > >
> > > + * Historically, a buffer_head was used to map a single block
> > > + * within a page, and of course as the unit of I/O through the
> > > + * filesystem and block layers. Nowadays the basic I/O unit
> > > + * is the bio, and buffer_heads are used for extracting block
> > > + * mappings (via a get_block_t call), for tracking state within
> > > + * a page (via a page_mapping) and for wrapping bio submission
> > > + * for backward compatibility reasons (e.g. submit_bh).
> >
> > Well kinda. A buffer_head remains the kernel's basic abstraction for a
> > "disk block".
>
> Thats what I said (meant to say) with
> "buffer_heads are used for extracting block mappings".
>
> I think by "disk block" you mean what I'm thinking of as a
> "block mapping" (series of contiguous blocks). I'd think
> of a sector_t as a "disk block", but maybe I'm just wierd
> that way... a better wordsmith should jump in and update
> the comment I guess.

Think of `struct buffer_head' as `struct block'. If you want to read a
block off the disk, diddle with it and write it back (while, of course,
being coherent with everything else which is going on), the bh is the only
construct we have for this.

2006-03-02 04:46:22

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH 3/4] map multiple blocks for mpage_readpages()

Badari Pulavarty <[email protected]> wrote:
>
> do_mpage_readpage(struct bio *bio, struct page *page, unsigned nr_pages,
> - sector_t *last_block_in_bio, get_block_t get_block)
> + sector_t *last_block_in_bio, struct buffer_head *map_bh,
> + unsigned long *first_logical_block, int *map_valid,
> + get_block_t get_block)

I wonder if we really need that map_valid pointer there. The normal way of
communicating the validity of a bh is via buffer_uptodate(). I _think_
that's an appropriate thing to use here. With appropriate comments, of
course..

If those options are not a sufficiently good fit then we can easily create
a new buffer-head.state bit for this purpose - there are lots to spare.

2006-03-03 16:50:29

by Badari Pulavarty

[permalink] [raw]
Subject: Re: [PATCH 3/4] map multiple blocks for mpage_readpages()

On Wed, 2006-03-01 at 20:45 -0800, Andrew Morton wrote:
> Badari Pulavarty <[email protected]> wrote:
> >
> > do_mpage_readpage(struct bio *bio, struct page *page, unsigned nr_pages,
> > - sector_t *last_block_in_bio, get_block_t get_block)
> > + sector_t *last_block_in_bio, struct buffer_head *map_bh,
> > + unsigned long *first_logical_block, int *map_valid,
> > + get_block_t get_block)
>
> I wonder if we really need that map_valid pointer there. The normal way of
> communicating the validity of a bh is via buffer_uptodate(). I _think_
> that's an appropriate thing to use here. With appropriate comments, of
> course..
>
> If those options are not a sufficiently good fit then we can easily create
> a new buffer-head.state bit for this purpose - there are lots to spare.

Yep. Can be done. I will take a pass at it.

Thanks,
Badari

2006-03-03 17:04:14

by Badari Pulavarty

[permalink] [raw]
Subject: Re: [PATCH 0/4] map multiple blocks in get_block() and mpage_readpages()

On Wed, 2006-03-01 at 17:52 -0800, Andrew Morton wrote:
> Badari Pulavarty <[email protected]> wrote:
> >
> > I noticed decent improvements (reduced sys time) on JFS, XFS and ext3.
> > (on simple "dd" read tests).
> >
> > (rc3.mm1) (rc3.mm1 + patches)
> > real 0m18.814s 0m18.482s
> > user 0m0.000s 0m0.004s
> > sys 0m3.240s 0m2.912s
>
> With which filesystem? XFS and JFS implement a larger-than-b_size
> ->get_block, but ext3 doesn't. I'd expect ext3 system time to increase a
> bit, if anything?

These numbers are on JFS. With the current ext3 (mainline) - I did
find not-really-noticible increase in sys time (due to code overhead).

I tested on ext3 with Mingming's ext3 getblocks() support in -mm also,
which showed reduction in sys time.

Thanks,
Badari

2006-03-03 17:16:35

by Badari Pulavarty

[permalink] [raw]
Subject: Re: [PATCH 2/4] pass b_size to ->get_block()

On Wed, 2006-03-01 at 17:52 -0800, Andrew Morton wrote:
> Badari Pulavarty <[email protected]> wrote:
> >
> > Pass amount of disk needs to be mapped to get_block().
> > This way one can modify the fs ->get_block() functions
> > to map multiple blocks at the same time.
>
> I can't say I terribly like this patch. Initialising b_size all over the
> place seems fragile.

I went through all the in-kernel places where we call ->getblock()
and initialized b_size correctly.

>
> We're _already_ setting bh.b_size to the right thing in
> alloc_page_buffers(), and for a bh which is attached to
> pagecache_page->private, there's no reason why b_size would ever change.

Good point.

> So what I think I'll do is to convert those places where you're needlessly
> assigning to b_size into temporary WARN_ON(b_size != blocksize).

Yep.


> The only place where we need to initialise b_size is where we've got a
> non-pagecache bh allocated on the stack.
>
> We need to be sure that no ->get_block() implementations write garbage into
> bh->b_size if something goes wrong. b_size on a pagecache-based
> buffer_head must remain inviolate.

Good news is, filesystems that allocate a single block currently don't
care about b_size anyway.

I guess to be paranoid, we should add WARN_ON() or BUG_ON() in jfs, xfs,
ext3 (in -mm) ->getblock() to check for a valid b_size, before using it
(not being negitive, multiple of blocksize, not-a-huge-number type
checks ?).

Thanks,
Badari

2006-03-03 20:33:48

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH 0/4] map multiple blocks in get_block() and mpage_readpages()

Badari Pulavarty <[email protected]> wrote:
>
> On Wed, 2006-03-01 at 17:52 -0800, Andrew Morton wrote:
> > Badari Pulavarty <[email protected]> wrote:
> > >
> > > I noticed decent improvements (reduced sys time) on JFS, XFS and ext3.
> > > (on simple "dd" read tests).
> > >
> > > (rc3.mm1) (rc3.mm1 + patches)
> > > real 0m18.814s 0m18.482s
> > > user 0m0.000s 0m0.004s
> > > sys 0m3.240s 0m2.912s
> >
> > With which filesystem? XFS and JFS implement a larger-than-b_size
> > ->get_block, but ext3 doesn't. I'd expect ext3 system time to increase a
> > bit, if anything?
>
> These numbers are on JFS. With the current ext3 (mainline) - I did
> find not-really-noticible increase in sys time (due to code overhead).
>
> I tested on ext3 with Mingming's ext3 getblocks() support in -mm also,
> which showed reduction in sys time.
>

OK, no surprises there. Things will improve when someone gets around to
doing multi-block get_block for ext3.

2006-03-03 21:52:36

by Badari Pulavarty

[permalink] [raw]
Subject: Re: [PATCH 0/4] map multiple blocks in get_block() and mpage_readpages()

On Fri, 2006-03-03 at 12:32 -0800, Andrew Morton wrote:
> Badari Pulavarty <[email protected]> wrote:
> >
> > On Wed, 2006-03-01 at 17:52 -0800, Andrew Morton wrote:
> > > Badari Pulavarty <[email protected]> wrote:
> > > >
> > > > I noticed decent improvements (reduced sys time) on JFS, XFS and ext3.
> > > > (on simple "dd" read tests).
> > > >
> > > > (rc3.mm1) (rc3.mm1 + patches)
> > > > real 0m18.814s 0m18.482s
> > > > user 0m0.000s 0m0.004s
> > > > sys 0m3.240s 0m2.912s
> > >
> > > With which filesystem? XFS and JFS implement a larger-than-b_size
> > > ->get_block, but ext3 doesn't. I'd expect ext3 system time to increase a
> > > bit, if anything?
> >
> > These numbers are on JFS. With the current ext3 (mainline) - I did
> > find not-really-noticible increase in sys time (due to code overhead).
> >
> > I tested on ext3 with Mingming's ext3 getblocks() support in -mm also,
> > which showed reduction in sys time.
> >
>
> OK, no surprises there. Things will improve when someone gets around to
> doing multi-block get_block for ext3.

Well, I already did this when I tested Mingming's multiblock work for
ext3. (-mm only patch)

Thanks,
Badari



Attachments:
ext3-getblocks.patch (1.70 kB)

2006-03-06 19:54:47

by Tim Pepper

[permalink] [raw]
Subject: Re: [PATCH 1/4] change buffer_head.b_size to size_t

If you're looking for nice wording: I forget if it is Linux Device
Drivers, Understanding the Linux Kernel or Linux Kernel Development,
but more or less combined they manage to describe the bastard state of
what struct buffer_head used to be and how bio coming along leaves bh
primarily as mapping between a memory buffer and a disk block. With
block related things being abstracted cleaner as of 2.6 (and that
hopefully being a continuing goal) I'd vote for any change to this
struct's comment focusing on what the struct is meant to do, not what
abstractions have been (ab)used to accomplish different things over
time.


Tim