2006-02-20 21:20:17

by Badari Pulavarty

[permalink] [raw]
Subject: [PATCH 0/3] map multiple blocks in get_block() and mpage_readpages()

Hi,

Following patches add support to map multiple blocks in ->get_block().
This is will allow us to handle mapping of multiple disk blocks for
mpage_readpages() and mpage_writepages() etc. Instead of adding new
argument, I use "b_size" to indicate the amount of disk mapping needed
for get_block(). And also, on success get_block() actually indicates
the amount of disk mapping it did.

Now that get_block() can handle multiple blocks, there is no need
for ->get_blocks() which was added for DIO.

[PATCH 1/3] pass b_size to ->get_block()

[PATCH 2/3] map multiple blocks for mpage_readpages()

[PATCH 3/3] remove ->get_blocks() support

I noticed decent improvements (reduced sys time) on JFS, XFS and ext3.
(on simple "dd" read tests).

(rc3.mm1) (rc3.mm1 + patches)
real 0m18.814s 0m18.482s
user 0m0.000s 0m0.004s
sys 0m3.240s 0m2.912s

Andrew, Could you include it in -mm tree ?

Comments ?

Thanks,
Badari



2006-02-20 21:22:13

by Badari Pulavarty

[permalink] [raw]
Subject: [PATCH 1/3] pass b_size to ->get_block()

[PATCH 1/3] pass b_size to ->get_block()

Pass amount of disk needs to be mapped to get_block().
This way one can modify the fs ->get_block() functions
to map multiple blocks at the same time.

Thanks,
Badari



Attachments:
getblock-take-size.patch (3.26 kB)

2006-02-20 21:22:45

by Badari Pulavarty

[permalink] [raw]
Subject: [PATCH 2/3] map multiple blocks for mpage_readpages()

[PATCH 2/3] map multiple blocks for mpage_readpages()

This patch changes mpage_readpages() and get_block() to
get the disk mapping information for multiple blocks at the
same time.

b_size represents the amount of disk mapping that needs to
mapped. On the successful get_block() b_size indicates the
amount of disk mapping thats actually mapped. Only the
filesystems who care to use this information and provide multiple
disk blocks at a time can choose to do so.

No changes are needed for the filesystems who wants to ignore this.

Thanks,
Badari



Attachments:
getblocks-readpages.patch (7.39 kB)

2006-02-20 21:23:40

by Badari Pulavarty

[permalink] [raw]
Subject: [PATCH 3/3] remove ->get_blocks() support

[PATCH 3/3] remove ->get_blocks() support


Now that get_block() can handle mapping multiple disk blocks,
no need to have ->get_blocks(). This patch removes fs specific
->get_blocks() added for DIO and makes it users use get_block()
instead.


Thanks,
Badari



Attachments:
remove-getblocks.patch (16.18 kB)

2006-02-20 22:00:19

by Nathan Scott

[permalink] [raw]
Subject: Re: [PATCH 0/3] map multiple blocks in get_block() and mpage_readpages()

On Mon, Feb 20, 2006 at 01:21:27PM -0800, Badari Pulavarty wrote:
> Hi,

Hi Badari,

> Following patches add support to map multiple blocks in ->get_block().
> This is will allow us to handle mapping of multiple disk blocks for
> mpage_readpages() and mpage_writepages() etc. Instead of adding new
> argument, I use "b_size" to indicate the amount of disk mapping needed
> for get_block(). And also, on success get_block() actually indicates
> the amount of disk mapping it did.

Thanks for doing this work!

> Now that get_block() can handle multiple blocks, there is no need
> for ->get_blocks() which was added for DIO.
>
> [PATCH 1/3] pass b_size to ->get_block()
>
> [PATCH 2/3] map multiple blocks for mpage_readpages()
>
> [PATCH 3/3] remove ->get_blocks() support
>
> I noticed decent improvements (reduced sys time) on JFS, XFS and ext3.
> (on simple "dd" read tests).
>
> (rc3.mm1) (rc3.mm1 + patches)
> real 0m18.814s 0m18.482s
> user 0m0.000s 0m0.004s
> sys 0m3.240s 0m2.912s
>
> Andrew, Could you include it in -mm tree ?
>
> Comments ?

I've been running these patches in my development tree for awhile
and have not seen any problems. My one (possibly minor) concern
is that we pass get_block a size in units of bytes, e.g....

bh->b_size = 1 << inode->i_blkbits;
err = get_block(inode, block, bh, 1);

And b_size is a u32. We have had the situation in the past where
people (I'm looking at you, Jeremy ;) have been issuing multiple-
gigabyte direct reads/writes through XFS. The syscall interface
takes an (s)size_t in bytes, which on 64 bit platforms is a 64 bit
byte count.

I wonder if this change will end up ruining things for the lunatic
fringe issuing these kinds of IOs? Maybe the get_block call could
take a block count rather than a byte count? (I guess that would
equate to dropping get_block_t rather than get_blocks_t... which is
kinda the alternate direction to what you took here). On the other
hand, maybe it'd be simpler to change b_size to be a size_t instead
of u32? Although, since we are now mapping multiple blocks at once,
"get_blocks_t" does seem an appropriate name. *shrug*, whatever ...
the main thing that'd be good to see addressed is the 32 bit size.

cheers.

--
Nathan

2006-02-20 23:04:59

by Badari Pulavarty

[permalink] [raw]
Subject: Re: [PATCH 0/3] map multiple blocks in get_block() and mpage_readpages()

On Tue, 2006-02-21 at 08:59 +1100, Nathan Scott wrote:
> On Mon, Feb 20, 2006 at 01:21:27PM -0800, Badari Pulavarty wrote:
> > Hi,
>
> Hi Badari,
>
> > Following patches add support to map multiple blocks in ->get_block().
> > This is will allow us to handle mapping of multiple disk blocks for
> > mpage_readpages() and mpage_writepages() etc. Instead of adding new
> > argument, I use "b_size" to indicate the amount of disk mapping needed
> > for get_block(). And also, on success get_block() actually indicates
> > the amount of disk mapping it did.
>
> Thanks for doing this work!
>
> > Now that get_block() can handle multiple blocks, there is no need
> > for ->get_blocks() which was added for DIO.
> >
> > [PATCH 1/3] pass b_size to ->get_block()
> >
> > [PATCH 2/3] map multiple blocks for mpage_readpages()
> >
> > [PATCH 3/3] remove ->get_blocks() support
> >
> > I noticed decent improvements (reduced sys time) on JFS, XFS and ext3.
> > (on simple "dd" read tests).
> >
> > (rc3.mm1) (rc3.mm1 + patches)
> > real 0m18.814s 0m18.482s
> > user 0m0.000s 0m0.004s
> > sys 0m3.240s 0m2.912s
> >
> > Andrew, Could you include it in -mm tree ?
> >
> > Comments ?
>
> I've been running these patches in my development tree for awhile
> and have not seen any problems. My one (possibly minor) concern
> is that we pass get_block a size in units of bytes, e.g....
>
> bh->b_size = 1 << inode->i_blkbits;
> err = get_block(inode, block, bh, 1);
>
> And b_size is a u32. We have had the situation in the past where
> people (I'm looking at you, Jeremy ;) have been issuing multiple-
> gigabyte direct reads/writes through XFS. The syscall interface
> takes an (s)size_t in bytes, which on 64 bit platforms is a 64 bit
> byte count.

> I wonder if this change will end up ruining things for the lunatic
> fringe issuing these kinds of IOs? Maybe the get_block call could
> take a block count rather than a byte count?

Yes. I thought about it too.. I wanted to pass "block count" instead
of "byte count". Right now it does ..

bh->b_size = 1 << inode->i_blkbits;
call get_block();

First thing get_block() does is
blocks = bh->b_size >> inode->i_blkbits;

All, the unnecessary shifting around for nothing :(

But, I ended up doing "byte count" just to avoid confusion of
asking in "blocks" getting back in "bytes".

I have no problem making b_size as "size_t" to handle 64-bit.
But again, if we are fiddling with buffer_head - may be its time
to look at alternative to "buffer_head" with the information exactly
we need for getblock() ?

Thanks,
Badari

2006-02-20 23:17:06

by Nathan Scott

[permalink] [raw]
Subject: Re: [PATCH 0/3] map multiple blocks in get_block() and mpage_readpages()

On Mon, Feb 20, 2006 at 03:06:11PM -0800, Badari Pulavarty wrote:
> On Tue, 2006-02-21 at 08:59 +1100, Nathan Scott wrote:
> ...
> > I wonder if this change will end up ruining things for the lunatic
> > fringe issuing these kinds of IOs? Maybe the get_block call could
> > take a block count rather than a byte count?
>
> Yes. I thought about it too.. I wanted to pass "block count" instead
> of "byte count". Right now it does ..
>
> bh->b_size = 1 << inode->i_blkbits;
> call get_block();
>
> First thing get_block() does is
> blocks = bh->b_size >> inode->i_blkbits;
>
> All, the unnecessary shifting around for nothing :(

Yeah, pretty silly really, but theres not much choice if the
goal is to keep this simple. Oh well.

> But, I ended up doing "byte count" just to avoid confusion of
> asking in "blocks" getting back in "bytes".

Understood.

> I have no problem making b_size as "size_t" to handle 64-bit.
> But again, if we are fiddling with buffer_head - may be its time
> to look at alternative to "buffer_head" with the information exactly
> we need for getblock() ?

That is a much bigger change - I'm not in a position to make the
call on whether thats in everyones best interests. However, I do
want to make sure we don't regress anything, so I guess the u32
to size_t switch probably should be made to resolve this issue.

Thanks again for following up on this.

cheers.

--
Nathan

2006-02-21 02:41:25

by Jeremy Higdon

[permalink] [raw]
Subject: Re: [PATCH 0/3] map multiple blocks in get_block() and mpage_readpages()

On Tue, Feb 21, 2006 at 08:59:53AM +1100, Nathan Scott wrote:
> On Mon, Feb 20, 2006 at 01:21:27PM -0800, Badari Pulavarty wrote:
>
> I've been running these patches in my development tree for awhile
> and have not seen any problems. My one (possibly minor) concern
> is that we pass get_block a size in units of bytes, e.g....
>
> bh->b_size = 1 << inode->i_blkbits;
> err = get_block(inode, block, bh, 1);
>
> And b_size is a u32. We have had the situation in the past where
> people (I'm looking at you, Jeremy ;) have been issuing multiple-
> gigabyte direct reads/writes through XFS. The syscall interface
> takes an (s)size_t in bytes, which on 64 bit platforms is a 64 bit
> byte count.
>
> I wonder if this change will end up ruining things for the lunatic
> fringe issuing these kinds of IOs? Maybe the get_block call could

Hey! Lunatic fringe indeed. Harumph! :-)

Yes, there are a few people out there who will need to issue really
large I/O reads or writes to get maximum I/O bandwidth on large
stripes. The largest I've done so far is 4GiB, but I expect that
number will likely increase this year, and more likely next year,
if not.

jeremy

2006-02-21 16:02:42

by Badari Pulavarty

[permalink] [raw]
Subject: Re: [PATCH 0/3] map multiple blocks in get_block() and mpage_readpages()

On Mon, 2006-02-20 at 18:41 -0800, Jeremy Higdon wrote:
> On Tue, Feb 21, 2006 at 08:59:53AM +1100, Nathan Scott wrote:
> > On Mon, Feb 20, 2006 at 01:21:27PM -0800, Badari Pulavarty wrote:
> >
> > I've been running these patches in my development tree for awhile
> > and have not seen any problems. My one (possibly minor) concern
> > is that we pass get_block a size in units of bytes, e.g....
> >
> > bh->b_size = 1 << inode->i_blkbits;
> > err = get_block(inode, block, bh, 1);
> >
> > And b_size is a u32. We have had the situation in the past where
> > people (I'm looking at you, Jeremy ;) have been issuing multiple-
> > gigabyte direct reads/writes through XFS. The syscall interface
> > takes an (s)size_t in bytes, which on 64 bit platforms is a 64 bit
> > byte count.
> >
> > I wonder if this change will end up ruining things for the lunatic
> > fringe issuing these kinds of IOs? Maybe the get_block call could
>
> Hey! Lunatic fringe indeed. Harumph! :-)
>
> Yes, there are a few people out there who will need to issue really
> large I/O reads or writes to get maximum I/O bandwidth on large
> stripes. The largest I've done so far is 4GiB, but I expect that
> number will likely increase this year, and more likely next year,
> if not.

Thinking more about it, currently (without my patches) only DIO
code can request for large chunks of mapping through get_blocks().
Since b_size is only u32, you can only map 4GB max. Isn't it ?
With my patches, now mpage_readpages() also can request large
chunks. (through readahead). So, my patches are not adding any
extra limitation. Its carrying the same existing limitation.

In order to handle larger chunks of disk mapping, changing b_size
to u64 is required and we should request for it irrespective of my
patches.

Thanks,
Badari

2006-02-21 21:40:24

by Nathan Scott

[permalink] [raw]
Subject: Re: [PATCH 0/3] map multiple blocks in get_block() and mpage_readpages()

On Tue, Feb 21, 2006 at 08:03:54AM -0800, Badari Pulavarty wrote:
> ...
> Thinking more about it, currently (without my patches) only DIO
> code can request for large chunks of mapping through get_blocks().
> Since b_size is only u32, you can only map 4GB max. Isn't it ?
> With my patches, now mpage_readpages() also can request large
> chunks. (through readahead). So, my patches are not adding any
> extra limitation. Its carrying the same existing limitation.

The DIO get_blocks is super-freaky at the moment, in that it passes
in an unsigned long max_blocks (units of blocks) but it expects the
filesystem to fill in a u32 b_size (in bytes). This does at least
give the filesystem the hint that this was a monster write and it
gets the chance to allocate contiguously, but that was probably
unintentional, who knows? I guess that subtlety would be dropped
with this change if b_size isn't made a 64 bit entity.

> In order to handle larger chunks of disk mapping, changing b_size
> to u64 is required and we should request for it irrespective of my
> patches.

*shrug*... its all very closely inter-related, I'd keep it together.
I'd also thought that size_t was the right type here, not u64? 32
bit platforms aren't capable of submitting IOs this large, so they
needn't be doing any 64 bit manipulation here, no?

cheers.

--
Nathan

2006-02-22 15:12:35

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [PATCH 0/3] map multiple blocks in get_block() and mpage_readpages()

On Mon, Feb 20, 2006 at 01:21:27PM -0800, Badari Pulavarty wrote:
> Hi,
>
> Following patches add support to map multiple blocks in ->get_block().
> This is will allow us to handle mapping of multiple disk blocks for
> mpage_readpages() and mpage_writepages() etc. Instead of adding new
> argument, I use "b_size" to indicate the amount of disk mapping needed
> for get_block(). And also, on success get_block() actually indicates
> the amount of disk mapping it did.
>
> Now that get_block() can handle multiple blocks, there is no need
> for ->get_blocks() which was added for DIO.
>
> [PATCH 1/3] pass b_size to ->get_block()
>
> [PATCH 2/3] map multiple blocks for mpage_readpages()
>
> [PATCH 3/3] remove ->get_blocks() support
>
> I noticed decent improvements (reduced sys time) on JFS, XFS and ext3.
> (on simple "dd" read tests).
>
> (rc3.mm1) (rc3.mm1 + patches)
> real 0m18.814s 0m18.482s
> user 0m0.000s 0m0.004s
> sys 0m3.240s 0m2.912s
>
> Andrew, Could you include it in -mm tree ?

Thanks Badari, with that interface changes the mpage_readpage changes
look a lot nicer than my original version. I'd like to second
the request to put it into -mm.

And if the namesys folks could try out whether this works for their
reiser4 requirements it'd be nice. If you have an even faster
->readpages I'd be interested in that secrete souce receipe for
further improvement to mpage_readpages.

2006-02-22 16:57:17

by Badari Pulavarty

[permalink] [raw]
Subject: Re: [PATCH 0/3] map multiple blocks in get_block() and mpage_readpages()

On Wed, 2006-02-22 at 16:12 +0100, christoph wrote:
> On Mon, Feb 20, 2006 at 01:21:27PM -0800, Badari Pulavarty wrote:
> > Hi,
> >
> > Following patches add support to map multiple blocks in ->get_block().
> > This is will allow us to handle mapping of multiple disk blocks for
> > mpage_readpages() and mpage_writepages() etc. Instead of adding new
> > argument, I use "b_size" to indicate the amount of disk mapping needed
> > for get_block(). And also, on success get_block() actually indicates
> > the amount of disk mapping it did.
> >
> > Now that get_block() can handle multiple blocks, there is no need
> > for ->get_blocks() which was added for DIO.
> >
> > [PATCH 1/3] pass b_size to ->get_block()
> >
> > [PATCH 2/3] map multiple blocks for mpage_readpages()
> >
> > [PATCH 3/3] remove ->get_blocks() support
> >
> > I noticed decent improvements (reduced sys time) on JFS, XFS and ext3.
> > (on simple "dd" read tests).
> >
> > (rc3.mm1) (rc3.mm1 + patches)
> > real 0m18.814s 0m18.482s
> > user 0m0.000s 0m0.004s
> > sys 0m3.240s 0m2.912s
> >
> > Andrew, Could you include it in -mm tree ?
>
> Thanks Badari, with that interface changes the mpage_readpage changes
> look a lot nicer than my original version. I'd like to second
> the request to put it into -mm.

Thanks. Only current issue Nathan raised is, he wants to see
b_size change to u64 (instead of u32) to support really-huge-IO
requests. Does this sound reasonable to you ?

Thanks,
Badari

2006-02-22 16:59:53

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [PATCH 0/3] map multiple blocks in get_block() and mpage_readpages()

On Wed, Feb 22, 2006 at 08:58:30AM -0800, Badari Pulavarty wrote:
> Thanks. Only current issue Nathan raised is, he wants to see
> b_size change to u64 (instead of u32) to support really-huge-IO
> requests. Does this sound reasonable to you ?

I know that we didn't want to increase b_size at some point because
of concerns about the number of objects fitting into a page in the
slab allocator. If the same number of bigger heads fit into the
same page I see no problems with the increase.

2006-02-22 17:23:18

by Peter Staubach

[permalink] [raw]
Subject: Re: [PATCH 0/3] map multiple blocks in get_block() and mpage_readpages()

Badari Pulavarty wrote:

>
>Thanks. Only current issue Nathan raised is, he wants to see
>b_size change to u64 (instead of u32) to support really-huge-IO
>requests. Does this sound reasonable to you ?
>

I believe that the various read and write system calls impose a 32 bit limit
on the size of the i/o request, so you may need to look at that too.

Increasing the size of b_size would address a bug which exists when trying
to read more than 4GB from a block device which is greater than 4GB in size.
Currently, that bug is masked because the maximum system call size is
arbitrarily limited.

Thanx...

ps

2006-02-22 18:37:21

by Dave Kleikamp

[permalink] [raw]
Subject: Re: [PATCH 0/3] map multiple blocks in get_block() and mpage_readpages()

On Wed, 2006-02-22 at 08:58 -0800, Badari Pulavarty wrote:
> Thanks. Only current issue Nathan raised is, he wants to see
> b_size change to u64 (instead of u32) to support really-huge-IO
> requests. Does this sound reasonable to you ?

Didn't someone point out that size_t would make more sense? There's no
reason for a 32-bit architecture to have a 64-bit b_size.

> Thanks,
> Badari

Shaggy
--
David Kleikamp
IBM Linux Technology Center

2006-02-22 19:00:43

by Badari Pulavarty

[permalink] [raw]
Subject: Re: [PATCH 0/3] map multiple blocks in get_block() and mpage_readpages()

Dave Kleikamp wrote:

> On Wed, 2006-02-22 at 08:58 -0800, Badari Pulavarty wrote:
>
>>Thanks. Only current issue Nathan raised is, he wants to see
>>b_size change to u64 (instead of u32) to support really-huge-IO
>>requests. Does this sound reasonable to you ?
>
>
> Didn't someone point out that size_t would make more sense? There's no
> reason for a 32-bit architecture to have a 64-bit b_size.

Yes. I meant to say size_t.

- Badari

2006-02-23 01:43:33

by Nathan Scott

[permalink] [raw]
Subject: Re: [PATCH 0/3] map multiple blocks in get_block() and mpage_readpages()

On Wed, Feb 22, 2006 at 05:59:42PM +0100, christoph wrote:
> On Wed, Feb 22, 2006 at 08:58:30AM -0800, Badari Pulavarty wrote:
> > Thanks. Only current issue Nathan raised is, he wants to see
> > b_size change to u64 [size_t] (instead of u32) to support
> > really-huge-IO requests.

And also to not go backwards on what the current DIO mapping
interface provides us.

> Does this sound reasonable to you ?
>
> I know that we didn't want to increase b_size at some point because
> of concerns about the number of objects fitting into a page in the

There's four extra bytes on an 88 byte structure (with sector_t
CONFIG'd at 64 bits) - so, yes, there'll be a small decrease in
the number that fit in a page on 64 bit platforms. Perhaps its
not worth it, but it would be good to sort out these pesky size
mismatches cos they introduce very subtle bugs.

> slab allocator. If the same number of bigger heads fit into the
> same page I see no problems with the increase.

Heh, bigger heads. Well, the same number fit in the page for 32
bit platforms, does that count?

Taking a quick look at the struct definition, theres some oddities
crept in since the 2.4 version - looks like sct had arranged the
fields in nice 32-bit-platform-cache-aligned groups, with several
cache-alignment related comments, some of which now remain and make
very little sense on their own (& with the now-incorrect grouping).

Anyway, here's a patch Badari - I reckon its worth it, but then I
am a bit biased (as its likely only XFS is going to be seeing this
size of I/O for some time, and as someone who has hunted down some
of these size mismatch bugs...)

cheers.

--
Nathan


Increase the size of the buffer_head b_size field for 64 bit
platforms. Update some old and moldy comments in and around
the structure as well.

The b_size increase allows us to perform larger mappings and
allocations for large I/O requests from userspace, which tie
in with other changes allowing the get_block_t() interface to
map multiple blocks at once.

Signed-off-by: Nathan Scott <[email protected]>

--- a/include/linux/buffer_head.h
+++ b/include/linux/buffer_head.h
@@ -46,20 +46,25 @@ struct address_space;
typedef void (bh_end_io_t)(struct buffer_head *bh, int uptodate);

/*
- * Keep related fields in common cachelines. The most commonly accessed
- * field (b_state) goes at the start so the compiler does not generate
- * indexed addressing for it.
+ * Historically, a buffer_head was used to map a single block
+ * within a page, and of course as the unit of I/O through the
+ * filesystem and block layers. Nowadays the basic I/O unit
+ * is the bio, and buffer_heads are used for extracting block
+ * mappings (via a get_block_t call), for tracking state within
+ * a page (via a page_mapping) and for wrapping bio submission
+ * for backward compatibility reasons (e.g. submit_bh). There
+ * may be one or two other uses too (I used it for drying the
+ * dishes the other night when I couldn't find a tea towel...).
*/
struct buffer_head {
- /* First cache line: */
unsigned long b_state; /* buffer state bitmap (see above) */
struct buffer_head *b_this_page;/* circular list of page's buffers */
struct page *b_page; /* the page this bh is mapped to */
- atomic_t b_count; /* users using this block */
- u32 b_size; /* block size */
+ atomic_t b_count; /* users using this buffer_head */

- sector_t b_blocknr; /* block number */
- char *b_data; /* pointer to data block */
+ size_t b_size; /* size of mapping */
+ sector_t b_blocknr; /* start block number */
+ char *b_data; /* pointer to data within the page */

struct block_device *b_bdev;
bh_end_io_t *b_end_io; /* I/O completion */

2006-02-23 02:05:02

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH 0/3] map multiple blocks in get_block() and mpage_readpages()

Nathan Scott <[email protected]> wrote:
>
> There's four extra bytes on an 88 byte structure (with sector_t
> CONFIG'd at 64 bits)

oy, buffer_heads are 48 bytes - took a few months out of my life, that did.
That's where it most counts - memory-constrained non-LBD 32-bit platforms.

Making b_size size_t doesn't affect that, so..

2006-02-23 03:28:58

by Suparna Bhattacharya

[permalink] [raw]
Subject: Re: [PATCH 2/3] map multiple blocks for mpage_readpages()

On Mon, Feb 20, 2006 at 01:23:55PM -0800, Badari Pulavarty wrote:
> [PATCH 2/3] map multiple blocks for mpage_readpages()
>
> This patch changes mpage_readpages() and get_block() to
> get the disk mapping information for multiple blocks at the
> same time.
>
> b_size represents the amount of disk mapping that needs to
> mapped. On the successful get_block() b_size indicates the
> amount of disk mapping thats actually mapped. Only the
> filesystems who care to use this information and provide multiple
> disk blocks at a time can choose to do so.
>
> No changes are needed for the filesystems who wants to ignore this.

That's a nice way to handle this !

Just one minor suggestion on the patches - instead of adding all those
arguments to do_mpage_readpage, have you considered using
an mio structure as we did in the mpage_writepages multiple blocks case ?

Regards
Suparna

>
> Thanks,
> Badari
>
>



--
Suparna Bhattacharya ([email protected])
Linux Technology Center
IBM Software Lab, India

2006-02-23 16:27:10

by Badari Pulavarty

[permalink] [raw]
Subject: [PATCH] change b_size to size_t

On Wed, 2006-02-22 at 17:59 -0800, Andrew Morton wrote:
> Nathan Scott <[email protected]> wrote:
> >
> > There's four extra bytes on an 88 byte structure (with sector_t
> > CONFIG'd at 64 bits)
>
> oy, buffer_heads are 48 bytes - took a few months out of my life, that did.
> That's where it most counts - memory-constrained non-LBD 32-bit platforms.
>
> Making b_size size_t doesn't affect that, so..

Here is the updated version of the patch, which changes
buffer_head.b_size to size_t to support mapping large
amount of disk blocks (for large IOs).

Thanks,
Badari



Attachments:
increase-bsize.patch (4.43 kB)

2006-02-23 16:37:14

by Benjamin LaHaise

[permalink] [raw]
Subject: Re: [PATCH] change b_size to size_t

On Thu, Feb 23, 2006 at 08:28:12AM -0800, Badari Pulavarty wrote:
> Here is the updated version of the patch, which changes
> buffer_head.b_size to size_t to support mapping large
> amount of disk blocks (for large IOs).

Your patch doesn't seem to be inline, so I can't quote it. Several
problems: on 64 bit platforms you introduced 4 bytes of padding into
buffer_head. atomic_t only takes up 4 byte, while size_t is 8 byte
aligned. This is a waste of memory, imo, seeing as the vast majority
of systems out there will not be doing 4GB+ ios any time soon.

Also, the cast to unsigned long long for size_t is pretty atrocious.
Cast to unsigned long if anything, as size_t is unsigned long on all
platforms linux runs on.

-ben
--
"Ladies and gentlemen, I'm sorry to interrupt, but the police are here
and they've asked us to stop the party." Don't Email: <[email protected]>.

2006-02-23 16:40:12

by Dave Kleikamp

[permalink] [raw]
Subject: Re: [PATCH] change b_size to size_t

On Thu, 2006-02-23 at 08:28 -0800, Badari Pulavarty wrote:
> Index: linux-2.6.16-rc4/fs/buffer.c
> ===================================================================
> --- linux-2.6.16-rc4.orig/fs/buffer.c 2006-02-17 14:23:45.000000000 -0800
> +++ linux-2.6.16-rc4/fs/buffer.c 2006-02-23 08:19:15.000000000 -0800
> @@ -432,7 +432,7 @@ __find_get_block_slow(struct block_devic
> printk("__find_get_block_slow() failed. "
> "block=%llu, b_blocknr=%llu\n",
> (unsigned long long)block, (unsigned long long)bh->b_blocknr);
> - printk("b_state=0x%08lx, b_size=%u\n", bh->b_state, bh->b_size);
> + printk("b_state=0x%08lx, b_size=%llu\n", bh->b_state, (unsigned long long)bh->b_size);

Wrapping at 80 columns, it looks right, but you've got a long line in
here.

> printk("device blocksize: %d\n", 1 << bd_inode->i_blkbits);
> }
>
--
David Kleikamp
IBM Linux Technology Center

2006-02-23 17:19:06

by Badari Pulavarty

[permalink] [raw]
Subject: Re: [PATCH] change b_size to size_t

On Thu, 2006-02-23 at 11:32 -0500, Benjamin LaHaise wrote:
> On Thu, Feb 23, 2006 at 08:28:12AM -0800, Badari Pulavarty wrote:
> > Here is the updated version of the patch, which changes
> > buffer_head.b_size to size_t to support mapping large
> > amount of disk blocks (for large IOs).
>
> Your patch doesn't seem to be inline, so I can't quote it. Several
> problems: on 64 bit platforms you introduced 4 bytes of padding into
> buffer_head. atomic_t only takes up 4 byte, while size_t is 8 byte
> aligned.

I moved stuff around, but still see 96 bytes (8 byte increase) for
the structure. What am I doing wrong ? Can you check ?

> This is a waste of memory, imo, seeing as the vast majority
> of systems out there will not be doing 4GB+ ios any time soon.

Yep. I agree. I was modifying get_block() to take b_size as the amount
of disk mapping requested, so I can get rid of ->get_blocks() and also
add support for mpage_readpages() and mapge_writepages() to deal with
multiple blocks. XFS seems to be requesting >4GB IOs, but doing 4GB
IOs through DIO today and planning to see bigger IOs in future.

Since "buffer_head" is not a primary structure used for IO anymore +
this change affect only 64-bit machines, its worth doing this ?

> Also, the cast to unsigned long long for size_t is pretty atrocious.
> Cast to unsigned long if anything, as size_t is unsigned long on all
> platforms linux runs on.

Okay. Changed it to (unsigned long) instead. I was just following
sector_t :(

Thanks,
Badari

Increase the size of the buffer_head b_size field for 64 bit
platforms. Update some old and moldy comments in and around
the structure as well.

The b_size increase allows us to perform larger mappings and
allocations for large I/O requests from userspace, which tie
in with other changes allowing the get_block_t() interface to
map multiple blocks at once.

Signed-off-by: Nathan Scott <[email protected]>
Signed-off-by: Badari Pulavary <[email protected]>

Index: linux-2.6.16-rc4/include/linux/buffer_head.h
===================================================================
--- linux-2.6.16-rc4.orig/include/linux/buffer_head.h 2006-02-17
14:23:45.000000000 -0800
+++ linux-2.6.16-rc4/include/linux/buffer_head.h 2006-02-23
09:00:06.000000000 -0800
@@ -46,20 +46,23 @@ struct address_space;
typedef void (bh_end_io_t)(struct buffer_head *bh, int uptodate);

/*
- * Keep related fields in common cachelines. The most commonly
accessed
- * field (b_state) goes at the start so the compiler does not generate
- * indexed addressing for it.
+ * Historically, a buffer_head was used to map a single block
+ * within a page, and of course as the unit of I/O through the
+ * filesystem and block layers. Nowadays the basic I/O unit
+ * is the bio, and buffer_heads are used for extracting block
+ * mappings (via a get_block_t call), for tracking state within
+ * a page (via a page_mapping) and for wrapping bio submission
+ * for backward compatibility reasons (e.g. submit_bh).
*/
struct buffer_head {
- /* First cache line: */
unsigned long b_state; /* buffer state bitmap (see above) */
struct buffer_head *b_this_page;/* circular list of page's buffers */
struct page *b_page; /* the page this bh is mapped to */
- atomic_t b_count; /* users using this block */
- u32 b_size; /* block size */
+ char *b_data; /* pointer to data within the page */

- sector_t b_blocknr; /* block number */
- char *b_data; /* pointer to data block */
+ size_t b_size; /* size of mapping */
+ sector_t b_blocknr; /* start block number */
+ atomic_t b_count; /* users using this buffer_head */

struct block_device *b_bdev;
bh_end_io_t *b_end_io; /* I/O completion */
Index: linux-2.6.16-rc4/fs/buffer.c
===================================================================
--- linux-2.6.16-rc4.orig/fs/buffer.c 2006-02-17 14:23:45.000000000
-0800
+++ linux-2.6.16-rc4/fs/buffer.c 2006-02-23 08:55:18.000000000 -0800
@@ -432,7 +432,8 @@ __find_get_block_slow(struct block_devic
printk("__find_get_block_slow() failed. "
"block=%llu, b_blocknr=%llu\n",
(unsigned long long)block, (unsigned long long)bh->b_blocknr);
- printk("b_state=0x%08lx, b_size=%u\n", bh->b_state, bh->b_size);
+ printk("b_state=0x%08lx, b_size=%lu\n", bh->b_state,
+ (unsigned long)bh->b_size);
printk("device blocksize: %d\n", 1 << bd_inode->i_blkbits);
}
out_unlock:
Index: linux-2.6.16-rc4/fs/reiserfs/prints.c
===================================================================
--- linux-2.6.16-rc4.orig/fs/reiserfs/prints.c 2006-02-17
14:23:45.000000000 -0800
+++ linux-2.6.16-rc4/fs/reiserfs/prints.c 2006-02-23 08:56:17.000000000
-0800
@@ -143,8 +143,8 @@ static void sprintf_buffer_head(char *bu
char b[BDEVNAME_SIZE];

sprintf(buf,
- "dev %s, size %d, blocknr %llu, count %d, state 0x%lx, page %p, (%s,
%s, %s)",
- bdevname(bh->b_bdev, b), bh->b_size,
+ "dev %s, size %ld, blocknr %llu, count %d, state 0x%lx, page %p, (%s,
%s, %s)",
+ bdevname(bh->b_bdev, b), (unsigned long)bh->b_size,
(unsigned long long)bh->b_blocknr, atomic_read(&(bh->b_count)),
bh->b_state, bh->b_page,
buffer_uptodate(bh) ? "UPTODATE" : "!UPTODATE",
Index: linux-2.6.16-rc4/fs/ocfs2/journal.c
===================================================================
--- linux-2.6.16-rc4.orig/fs/ocfs2/journal.c 2006-02-17
14:23:45.000000000 -0800
+++ linux-2.6.16-rc4/fs/ocfs2/journal.c 2006-02-23 08:56:55.000000000
-0800
@@ -377,12 +377,12 @@ int ocfs2_journal_access(struct ocfs2_jo
BUG_ON(!bh);
BUG_ON(!(handle->flags & OCFS2_HANDLE_STARTED));

- mlog_entry("bh->b_blocknr=%llu, type=%d (\"%s\"), bh->b_size = %hu\n",
+ mlog_entry("bh->b_blocknr=%llu, type=%d (\"%s\"), bh->b_size = %lu\n",
(unsigned long long)bh->b_blocknr, type,
(type == OCFS2_JOURNAL_ACCESS_CREATE) ?
"OCFS2_JOURNAL_ACCESS_CREATE" :
"OCFS2_JOURNAL_ACCESS_WRITE",
- bh->b_size);
+ (unsigned long)bh->b_size);

/* we can safely remove this assertion after testing. */
if (!buffer_uptodate(bh)) {


Attachments:
increase-bsize.patch (4.41 kB)

2006-02-23 17:27:44

by Badari Pulavarty

[permalink] [raw]
Subject: Re: [PATCH] change b_size to size_t

On Thu, 2006-02-23 at 11:32 -0500, Benjamin LaHaise wrote:
> On Thu, Feb 23, 2006 at 08:28:12AM -0800, Badari Pulavarty wrote:
> > Here is the updated version of the patch, which changes
> > buffer_head.b_size to size_t to support mapping large
> > amount of disk blocks (for large IOs).
>
> Your patch doesn't seem to be inline, so I can't quote it. Several
> problems: on 64 bit platforms you introduced 4 bytes of padding into
> buffer_head. atomic_t only takes up 4 byte, while size_t is 8 byte
> aligned.


Ignore my previous mail.

How about doing this ? Change b_state to u32 and change b_size
to "size_t". This way, we don't increase the overall size of
the structure on 64-bit machines. Isn't it ?

Thanks,
Badari


struct buffer_head {
u32 b_state; /* buffer state bitmap (see
above) */
atomic_t b_count; /* users using this buffer_head
*/
struct buffer_head *b_this_page;/* circular list of page's
buffers */
struct page *b_page; /* the page this bh is mapped to
*/
size_t b_size; /* size of mapping */

sector_t b_blocknr; /* start block number */
char *b_data; /* pointer to data within the
page */

struct block_device *b_bdev;
bh_end_io_t *b_end_io; /* I/O completion */
void *b_private; /* reserved for b_end_io */
struct list_head b_assoc_buffers; /* associated with another
mapping */
};


Thanks,
Badari

2006-02-23 17:34:27

by Benjamin LaHaise

[permalink] [raw]
Subject: Re: [PATCH] change b_size to size_t

On Thu, Feb 23, 2006 at 09:28:58AM -0800, Badari Pulavarty wrote:
> How about doing this ? Change b_state to u32 and change b_size
> to "size_t". This way, we don't increase the overall size of
> the structure on 64-bit machines. Isn't it ?

I was thinking of that too, but that doesn't work with the bit operations
on big endian machines (which require unsigned long). Oh well, I guess
that even with adding an extra 8 bytes on x86-64 we're still at the 96
bytes, or 92 bytes if the atomic_t is put at the end of the struct.

-ben
--
"Ladies and gentlemen, I'm sorry to interrupt, but the police are here
and they've asked us to stop the party." Don't Email: <[email protected]>.

2006-02-23 17:39:42

by Badari Pulavarty

[permalink] [raw]
Subject: Re: [PATCH] change b_size to size_t

On Thu, 2006-02-23 at 09:28 -0800, Badari Pulavarty wrote:
> On Thu, 2006-02-23 at 11:32 -0500, Benjamin LaHaise wrote:
> > On Thu, Feb 23, 2006 at 08:28:12AM -0800, Badari Pulavarty wrote:
> > > Here is the updated version of the patch, which changes
> > > buffer_head.b_size to size_t to support mapping large
> > > amount of disk blocks (for large IOs).
> >
> > Your patch doesn't seem to be inline, so I can't quote it. Several
> > problems: on 64 bit platforms you introduced 4 bytes of padding into
> > buffer_head. atomic_t only takes up 4 byte, while size_t is 8 byte
> > aligned.
>
>
> Ignore my previous mail.
>
> How about doing this ? Change b_state to u32 and change b_size
> to "size_t". This way, we don't increase the overall size of
> the structure on 64-bit machines. Isn't it ?

I hate to correct myself again. But this won't work either.
If we do this, we can use bit_spin_lock() helpers any more
to do bh_state manipulation :(

Yes. Bottom line is, we would increase the size of the structure
by 8-bytes on 64-bit machines. I don't see any way out of it.
But this would provide ability to let the filesystems know
that the we are dealing with large (> 4GB) of IOs (may be they
can allocated as much as possible contiguously), even if
we don't really do that big IOs.

Thanks,
Badari

2006-02-23 18:45:17

by Badari Pulavarty

[permalink] [raw]
Subject: Re: [PATCH] change b_size to size_t

On Thu, 2006-02-23 at 12:29 -0500, Benjamin LaHaise wrote:
> On Thu, Feb 23, 2006 at 09:28:58AM -0800, Badari Pulavarty wrote:
> > How about doing this ? Change b_state to u32 and change b_size
> > to "size_t". This way, we don't increase the overall size of
> > the structure on 64-bit machines. Isn't it ?
>
> I was thinking of that too, but that doesn't work with the bit operations
> on big endian machines (which require unsigned long). Oh well, I guess
> that even with adding an extra 8 bytes on x86-64 we're still at the 96
> bytes, or 92 bytes if the atomic_t is put at the end of the struct.
>
> -ben

Okay. Here is the final version (hopefully). I moved atomic_t to
end of the structure. I still see 96-bytes for x86-64 also :(


buffer_head 40425 40760 96 40 1 : tunables 120 60
8 : slabdata 1019 1019 0

Thanks,
Badari

Increase the size of the buffer_head b_size field for 64 bit
platforms. Update some old and moldy comments in and around
the structure as well.

The b_size increase allows us to perform larger mappings and
allocations for large I/O requests from userspace, which tie
in with other changes allowing the get_block_t() interface to
map multiple blocks at once.

Signed-off-by: Nathan Scott <[email protected]>
Signed-off-by: Badari Pulavary <[email protected]>

Index: linux-2.6.16-rc4/include/linux/buffer_head.h
===================================================================
--- linux-2.6.16-rc4.orig/include/linux/buffer_head.h 2006-02-17
14:23:45.000000000 -0800
+++ linux-2.6.16-rc4/include/linux/buffer_head.h 2006-02-23
10:15:33.000000000 -0800
@@ -46,25 +46,28 @@ struct address_space;
typedef void (bh_end_io_t)(struct buffer_head *bh, int uptodate);

/*
- * Keep related fields in common cachelines. The most commonly
accessed
- * field (b_state) goes at the start so the compiler does not generate
- * indexed addressing for it.
+ * Historically, a buffer_head was used to map a single block
+ * within a page, and of course as the unit of I/O through the
+ * filesystem and block layers. Nowadays the basic I/O unit
+ * is the bio, and buffer_heads are used for extracting block
+ * mappings (via a get_block_t call), for tracking state within
+ * a page (via a page_mapping) and for wrapping bio submission
+ * for backward compatibility reasons (e.g. submit_bh).
*/
struct buffer_head {
- /* First cache line: */
unsigned long b_state; /* buffer state bitmap (see above) */
struct buffer_head *b_this_page;/* circular list of page's buffers */
struct page *b_page; /* the page this bh is mapped to */
- atomic_t b_count; /* users using this block */
- u32 b_size; /* block size */

- sector_t b_blocknr; /* block number */
- char *b_data; /* pointer to data block */
+ sector_t b_blocknr; /* start block number */
+ size_t b_size; /* size of mapping */
+ char *b_data; /* pointer to data within the page */

struct block_device *b_bdev;
bh_end_io_t *b_end_io; /* I/O completion */
void *b_private; /* reserved for b_end_io */
struct list_head b_assoc_buffers; /* associated with another mapping
*/
+ atomic_t b_count; /* users using this buffer_head */
};

/*
Index: linux-2.6.16-rc4/fs/buffer.c
===================================================================
--- linux-2.6.16-rc4.orig/fs/buffer.c 2006-02-17 14:23:45.000000000
-0800
+++ linux-2.6.16-rc4/fs/buffer.c 2006-02-23 08:55:18.000000000 -0800
@@ -432,7 +432,8 @@ __find_get_block_slow(struct block_devic
printk("__find_get_block_slow() failed. "
"block=%llu, b_blocknr=%llu\n",
(unsigned long long)block, (unsigned long long)bh->b_blocknr);
- printk("b_state=0x%08lx, b_size=%u\n", bh->b_state, bh->b_size);
+ printk("b_state=0x%08lx, b_size=%lu\n", bh->b_state,
+ (unsigned long)bh->b_size);
printk("device blocksize: %d\n", 1 << bd_inode->i_blkbits);
}
out_unlock:
Index: linux-2.6.16-rc4/fs/reiserfs/prints.c
===================================================================
--- linux-2.6.16-rc4.orig/fs/reiserfs/prints.c 2006-02-17
14:23:45.000000000 -0800
+++ linux-2.6.16-rc4/fs/reiserfs/prints.c 2006-02-23 08:56:17.000000000
-0800
@@ -143,8 +143,8 @@ static void sprintf_buffer_head(char *bu
char b[BDEVNAME_SIZE];

sprintf(buf,
- "dev %s, size %d, blocknr %llu, count %d, state 0x%lx, page %p, (%s,
%s, %s)",
- bdevname(bh->b_bdev, b), bh->b_size,
+ "dev %s, size %ld, blocknr %llu, count %d, state 0x%lx, page %p, (%s,
%s, %s)",
+ bdevname(bh->b_bdev, b), (unsigned long)bh->b_size,
(unsigned long long)bh->b_blocknr, atomic_read(&(bh->b_count)),
bh->b_state, bh->b_page,
buffer_uptodate(bh) ? "UPTODATE" : "!UPTODATE",
Index: linux-2.6.16-rc4/fs/ocfs2/journal.c
===================================================================
--- linux-2.6.16-rc4.orig/fs/ocfs2/journal.c 2006-02-17
14:23:45.000000000 -0800
+++ linux-2.6.16-rc4/fs/ocfs2/journal.c 2006-02-23 08:56:55.000000000
-0800
@@ -377,12 +377,12 @@ int ocfs2_journal_access(struct ocfs2_jo
BUG_ON(!bh);
BUG_ON(!(handle->flags & OCFS2_HANDLE_STARTED));

- mlog_entry("bh->b_blocknr=%llu, type=%d (\"%s\"), bh->b_size = %hu\n",
+ mlog_entry("bh->b_blocknr=%llu, type=%d (\"%s\"), bh->b_size = %lu\n",
(unsigned long long)bh->b_blocknr, type,
(type == OCFS2_JOURNAL_ACCESS_CREATE) ?
"OCFS2_JOURNAL_ACCESS_CREATE" :
"OCFS2_JOURNAL_ACCESS_WRITE",
- bh->b_size);
+ (unsigned long)bh->b_size);

/* we can safely remove this assertion after testing. */
if (!buffer_uptodate(bh)) {


2006-02-23 22:16:54

by Badari Pulavarty

[permalink] [raw]
Subject: Re: [PATCH 2/3] map multiple blocks for mpage_readpages()

On Thu, 2006-02-23 at 08:59 +0530, Suparna Bhattacharya wrote:
> On Mon, Feb 20, 2006 at 01:23:55PM -0800, Badari Pulavarty wrote:
> > [PATCH 2/3] map multiple blocks for mpage_readpages()
> >
> > This patch changes mpage_readpages() and get_block() to
> > get the disk mapping information for multiple blocks at the
> > same time.
> >
> > b_size represents the amount of disk mapping that needs to
> > mapped. On the successful get_block() b_size indicates the
> > amount of disk mapping thats actually mapped. Only the
> > filesystems who care to use this information and provide multiple
> > disk blocks at a time can choose to do so.
> >
> > No changes are needed for the filesystems who wants to ignore this.
>
> That's a nice way to handle this !
>
> Just one minor suggestion on the patches - instead of adding all those
> arguments to do_mpage_readpage, have you considered using
> an mio structure as we did in the mpage_writepages multiple blocks case ?

I did think about moving stuff to "mio" and pass it around. Then, I
realized I need to think hard about mpage_writepages() also to come
out with a super-set "mio". For now, I settled with arguments for now,
when we cleanup mpage_writepages() we can take a pass at cleaning up
mpage_readpages() also. (BTW, there are only 2 more arguments added).

Thanks,
Badari

2006-02-24 17:17:52

by Badari Pulavarty

[permalink] [raw]
Subject: Re: [PATCH 0/3] map multiple blocks in get_block() and mpage_readpages()

On Wed, 2006-02-22 at 16:12 +0100, christoph wrote:
..
> Thanks Badari, with that interface changes the mpage_readpage changes
> look a lot nicer than my original version. I'd like to second
> the request to put it into -mm.
>
> And if the namesys folks could try out whether this works for their
> reiser4 requirements it'd be nice. If you have an even faster
> ->readpages I'd be interested in that secrete souce receipe for
> further improvement to mpage_readpages.

I don't have any secret receipes, but I was thinking of re-organizing
the code a little. Complexity is due to "confused" case and
"blocksize < pagesize" cases + going in-and-out of the worker routine
with stored state.

I am thinking of having a "fast path" which doesn't deal with any
of those and "slow" path to deal with all that non-sense.
Something like ..

mpage_readpages()
{
if (block-size < page-size)
slow_path;

while (nr-pages) {
if (get_block(bh))
slow_path;
if (uptodate(bh))
slow_path;
while (bh.b_size) {
if (not contig)
submit bio();
add all the pages we can to bio();
bh.b_size -= size-of-pages-added;
nr_pages -= count-of-pages-added;
}

}
}

slow_path is going to be slow & ugly. How important is to handle
1k, 2k filesystems efficiently ? Should I try ?

Thanks,
Badari

2006-03-06 10:03:31

by Suparna Bhattacharya

[permalink] [raw]
Subject: Re: [PATCH 0/3] map multiple blocks in get_block() and mpage_readpages()

On Fri, Feb 24, 2006 at 09:19:08AM -0800, Badari Pulavarty wrote:
> On Wed, 2006-02-22 at 16:12 +0100, christoph wrote:
> ..
> > Thanks Badari, with that interface changes the mpage_readpage changes
> > look a lot nicer than my original version. I'd like to second
> > the request to put it into -mm.
> >
> > And if the namesys folks could try out whether this works for their
> > reiser4 requirements it'd be nice. If you have an even faster
> > ->readpages I'd be interested in that secrete souce receipe for
> > further improvement to mpage_readpages.
>
> I don't have any secret receipes, but I was thinking of re-organizing
> the code a little. Complexity is due to "confused" case and
> "blocksize < pagesize" cases + going in-and-out of the worker routine
> with stored state.
>
> I am thinking of having a "fast path" which doesn't deal with any
> of those and "slow" path to deal with all that non-sense.
> Something like ..
>
> mpage_readpages()
> {
> if (block-size < page-size)
> slow_path;
>
> while (nr-pages) {
> if (get_block(bh))
> slow_path;
> if (uptodate(bh))
> slow_path;
> while (bh.b_size) {
> if (not contig)
> submit bio();
> add all the pages we can to bio();
> bh.b_size -= size-of-pages-added;
> nr_pages -= count-of-pages-added;
> }
>
> }
> }
>
> slow_path is going to be slow & ugly. How important is to handle
> 1k, 2k filesystems efficiently ? Should I try ?

With 64K page size that could include 4K, 8K, 16K, 32K block size filesystems
as well, not sure how likely that would be ?

Regards
Suparna

>
> Thanks,
> Badari
>
> -
> To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html

--
Suparna Bhattacharya ([email protected])
Linux Technology Center
IBM Software Lab, India

2006-03-06 22:40:30

by Nathan Scott

[permalink] [raw]
Subject: Re: [PATCH 0/3] map multiple blocks in get_block() and mpage_readpages()

On Mon, Mar 06, 2006 at 03:33:21PM +0530, Suparna Bhattacharya wrote:
> On Fri, Feb 24, 2006 at 09:19:08AM -0800, Badari Pulavarty wrote:
> >
> > I am thinking of having a "fast path" which doesn't deal with any
> > of those and "slow" path to deal with all that non-sense.
> > ...
> > slow_path is going to be slow & ugly. How important is to handle
> > 1k, 2k filesystems efficiently ? Should I try ?
>
> With 64K page size that could include 4K, 8K, 16K, 32K block size filesystems
> as well, not sure how likely that would be ?

A number of architectures have a pagesize greater than 4K. Most
(OK, sample size of 2) mkfs programs default to using 4K blocksizes.
So, any/all platforms not having a 4K pagesize will be disadvantaged.
Search on the definition of PAGE_SHIFT in asm-*/page.h and for all
platforms where its not defined to 12, this will hurt.

cheers.

--
Nathan

2006-03-07 01:01:00

by Badari Pulavarty

[permalink] [raw]
Subject: Re: [PATCH 0/3] map multiple blocks in get_block() and mpage_readpages()



Nathan Scott wrote:

>On Mon, Mar 06, 2006 at 03:33:21PM +0530, Suparna Bhattacharya wrote:
>
>>On Fri, Feb 24, 2006 at 09:19:08AM -0800, Badari Pulavarty wrote:
>>
>>>I am thinking of having a "fast path" which doesn't deal with any
>>>of those and "slow" path to deal with all that non-sense.
>>>...
>>>slow_path is going to be slow & ugly. How important is to handle
>>>1k, 2k filesystems efficiently ? Should I try ?
>>>
>>With 64K page size that could include 4K, 8K, 16K, 32K block size filesystems
>>as well, not sure how likely that would be ?
>>
>
>A number of architectures have a pagesize greater than 4K. Most
>(OK, sample size of 2) mkfs programs default to using 4K blocksizes.
>So, any/all platforms not having a 4K pagesize will be disadvantaged.
>Search on the definition of PAGE_SHIFT in asm-*/page.h and for all
>platforms where its not defined to 12, this will hurt.
>

I agree, I haven't made up my mind either on if its really worth doing.

I was hoping that it will help simple cases + it will help filesystems
which use
page->private for something other than buffer heads.

Thanks,
Badari

>