LinuxLists.cc - [PATCH, RFC] ext4: Use preallocation when reading from the inode table

2008-09-23 00:35:26

Subject: [PATCH, RFC] ext4: Use preallocation when reading from the inode table

With modern hard drives, reading 64k takes roughly the same time as
reading a 4k block. So request adjacent inode table blocks to reduce
the time it takes when iterating over directories (especially when doing
this in htree sort order) in a cold cache case. With this patch, the
time it takes to run "git status" on a kernel tree after flushing the
caches via "echo 3 > /proc/sys/vm/drop_caches", is reduced by 21%.

Signed-off-by: "Theodore Ts'o" <[email protected]>
----

Note: this patch could work for ext3 as well. Anyone see any downsides?
A 20% improvement seems too easy... I guess it does increase what we
hold in the buffer cache by a small amount, but once the inodes are
cached, we'll stop needing to do any I/O, and we only try to do the
readahead when we know that we need to do some I/O anyway.

This patch also eliminates ext4_get_inode_block(), since it's a static
function which is only called once, and we needed access to the block
group descriptor, so it simplified the code to move the code into
__ext4_get_inode_loc(). The interesting bits are in the very last hunk
of the patch.

diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index eed1265..9f6c10e 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -3833,41 +3833,6 @@ out_stop:
ext4_journal_stop(handle);
}

-static ext4_fsblk_t ext4_get_inode_block(struct super_block *sb,
- unsigned long ino, struct ext4_iloc *iloc)
-{
- ext4_group_t block_group;
- unsigned long offset;
- ext4_fsblk_t block;
- struct ext4_group_desc *gdp;
-
- if (!ext4_valid_inum(sb, ino)) {
- /*
- * This error is already checked for in namei.c unless we are
- * looking at an NFS filehandle, in which case no error
- * report is needed
- */
- return 0;
- }
-
- block_group = (ino - 1) / EXT4_INODES_PER_GROUP(sb);
- gdp = ext4_get_group_desc(sb, block_group, NULL);
- if (!gdp)
- return 0;
-
- /*
- * Figure out the offset within the block group inode table
- */
- offset = ((ino - 1) % EXT4_INODES_PER_GROUP(sb)) *
- EXT4_INODE_SIZE(sb);
- block = ext4_inode_table(sb, gdp) +
- (offset >> EXT4_BLOCK_SIZE_BITS(sb));
-
- iloc->block_group = block_group;
- iloc->offset = offset & (EXT4_BLOCK_SIZE(sb) - 1);
- return block;
-}
-
/*
* ext4_get_inode_loc returns with an extra refcount against the inode's
* underlying buffer_head on success. If 'in_mem' is true, we have all
@@ -3877,13 +3842,30 @@ static ext4_fsblk_t ext4_get_inode_block(struct super_block *sb,
static int __ext4_get_inode_loc(struct inode *inode,
struct ext4_iloc *iloc, int in_mem)
{
+ struct ext4_group_desc *gdp;
ext4_fsblk_t block;
struct buffer_head *bh;

- block = ext4_get_inode_block(inode->i_sb, inode->i_ino, iloc);
- if (!block)
+ iloc->bh = 0;
+ if (!ext4_valid_inum(inode->i_sb, inode->i_ino))
return -EIO;

+ iloc->block_group = (inode->i_ino - 1) /
+ EXT4_INODES_PER_GROUP(inode->i_sb);
+ gdp = ext4_get_group_desc(inode->i_sb, iloc->block_group, NULL);
+ if (!gdp)
+ return -EIO;
+
+ /*
+ * Figure out the offset within the block group inode table
+ */
+ iloc->offset = ((inode->i_ino - 1) %
+ EXT4_INODES_PER_GROUP(inode->i_sb)) *
+ EXT4_INODE_SIZE(inode->i_sb);
+ block = ext4_inode_table(inode->i_sb, gdp) +
+ (iloc->offset >> EXT4_BLOCK_SIZE_BITS(inode->i_sb));
+ iloc->offset = iloc->offset & (EXT4_BLOCK_SIZE(inode->i_sb) - 1);
+
bh = sb_getblk(inode->i_sb, block);
if (!bh) {
ext4_error (inode->i_sb, "ext4_get_inode_loc",
@@ -3969,6 +3951,31 @@ static int __ext4_get_inode_loc(struct inode *inode,

make_io:
/*
+ * If we need to do any I/O, try to readahead up to 16
+ * blocks from the inode table.
+ */
+ {
+ ext4_fsblk_t b, end, table;
+ unsigned num;
+
+ table = ext4_inode_table(inode->i_sb, gdp);
+ b = block & ~15;
+ if (table > b)
+ b = table;
+ end = b+16;
+ num = EXT4_INODES_PER_GROUP(inode->i_sb);
+ if (EXT4_HAS_RO_COMPAT_FEATURE(inode->i_sb,
+ EXT4_FEATURE_RO_COMPAT_GDT_CSUM))
+ num -= le16_to_cpu(gdp->bg_itable_unused);
+ table += num / (EXT4_BLOCK_SIZE(inode->i_sb) /
+ EXT4_INODE_SIZE(inode->i_sb));
+ if (end > table)
+ end = table;
+ while (b <= end)
+ sb_breadahead(inode->i_sb, b++);
+ }
+
+ /*
* There are other valid inodes in the buffer, this inode
* has in-inode xattrs, or we don't have this inode in memory.
* Read the block from disk.

2008-09-23 09:16:37

by Alan

[permalink] [raw]

Subject: Re: [PATCH, RFC] ext4: Use preallocation when reading from the inode table

On Mon, 22 Sep 2008 20:35:23 -0400
"Theodore Ts'o" <[email protected]> wrote:

>
> With modern hard drives, reading 64k takes roughly the same time as
> reading a 4k block. So request adjacent inode table blocks to reduce
> the time it takes when iterating over directories (especially when doing
> this in htree sort order) in a cold cache case. With this patch, the
> time it takes to run "git status" on a kernel tree after flushing the
> caches via "echo 3 > /proc/sys/vm/drop_caches", is reduced by 21%.
>
> Signed-off-by: "Theodore Ts'o" <[email protected]>

Acked-by: Alan Cox <[email protected]>

I'm actually suprised that 16 is the magic tuning number you've used and
a bigger one isn't even more of a win

Alan

2008-09-23 11:51:18

by Andreas Dilger

[permalink] [raw]

Subject: Re: [PATCH, RFC] ext4: Use preallocation when reading from the inode table

On Sep 23, 2008 10:16 +0100, Alan Cox wrote:
> On Mon, 22 Sep 2008 20:35:23 -0400
> "Theodore Ts'o" <[email protected]> wrote:
> > With modern hard drives, reading 64k takes roughly the same time as
> > reading a 4k block. So request adjacent inode table blocks to reduce
> > the time it takes when iterating over directories (especially when doing
> > this in htree sort order) in a cold cache case. With this patch, the
> > time it takes to run "git status" on a kernel tree after flushing the
> > caches via "echo 3 > /proc/sys/vm/drop_caches", is reduced by 21%.
> >
> > Signed-off-by: "Theodore Ts'o" <[email protected]>
>
> Acked-by: Alan Cox <[email protected]>
>
> I'm actually suprised that 16 is the magic tuning number you've used and
> a bigger one isn't even more of a win

I was going to suggest making this at least a #defined constant instead
of hard coding the values there. Making it a mount option and/or /proc
value would allow further testing.

Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.

2008-09-23 12:19:08

by Ric Wheeler

[permalink] [raw]

Subject: Re: [PATCH, RFC] ext4: Use preallocation when reading from the inode table

Andreas Dilger wrote:
> On Sep 23, 2008 10:16 +0100, Alan Cox wrote:
>
>> On Mon, 22 Sep 2008 20:35:23 -0400
>> "Theodore Ts'o" <[email protected]> wrote:
>>
>>> With modern hard drives, reading 64k takes roughly the same time as
>>> reading a 4k block. So request adjacent inode table blocks to reduce
>>> the time it takes when iterating over directories (especially when doing
>>> this in htree sort order) in a cold cache case. With this patch, the
>>> time it takes to run "git status" on a kernel tree after flushing the
>>> caches via "echo 3 > /proc/sys/vm/drop_caches", is reduced by 21%.
>>>
>>> Signed-off-by: "Theodore Ts'o" <[email protected]>
>>>
>> Acked-by: Alan Cox <[email protected]>
>>
>> I'm actually suprised that 16 is the magic tuning number you've used and
>> a bigger one isn't even more of a win
>>
>
> I was going to suggest making this at least a #defined constant instead
> of hard coding the values there. Making it a mount option and/or /proc
> value would allow further testing.
>
> Cheers, Andreas
> --
> Andreas Dilger
> Sr. Staff Engineer, Lustre Group
> Sun Microsystems of Canada, Inc.
>
>

I think that Alan is probably right - the magic number for modern drives
is probably closer to 256K. Having it be a /sys tunable (with a larger
default) would be a nice way to verify this.

Ric

2008-09-24 01:30:28

by Theodore Ts'o

[permalink] [raw]

Subject: Re: [PATCH, RFC] ext4: Use preallocation when reading from the inode table

On Tue, Sep 23, 2008 at 08:18:54AM -0400, Ric Wheeler wrote:
> I think that Alan is probably right - the magic number for modern drives
> is probably closer to 256K. Having it be a /sys tunable (with a larger
> default) would be a nice way to verify this.

I've played with this a bit, and with the "git status" workload,
increasing the magic number beyond 16 (64k) doesn't actually help,
because the number of inodes we need to touch wasn't big enough.

So I switched to a different workload, which ran "find /path -size 0
-print" with a much larger directory hierarchy. With that workload I
got the following results:

ra_bits ra_blocks ra_kb seconds % improvement
0 1 4 53.3 -
1 2 8 47.3 11.3%
2 4 16 41.7 21.8%
3 8 32 37.5 29.6%
4 16 64 34.4 35.5%
5 32 128 32 40.0%
6 64 256 30.7 42.4%
7 128 512 28.8 46.0%
8 256 1024 28.3 46.9%
9 512 2048 27.5 48.4%

Given these numbers, I'm using a default of inode_readahead_bits of 5
(i.3., 32 blocks, or 128k for 4k blocksize filesystems). For a
workload that is 100% stat-based, without any I/O, it is possible to
get better results by using a higher number, yes, but I'm concerned
that a larger readahead may end up interfering with other reads. We
need to run some other workloads to be sure a larger number won't
cause problems before we go more aggressive on this parameter.

I'll send the revised patch in another message.

- Ted

2008-09-24 13:23:48

by Ric Wheeler

[permalink] [raw]

Subject: Re: [PATCH, RFC] ext4: Use preallocation when reading from the inode table

Theodore Tso wrote:
> On Tue, Sep 23, 2008 at 08:18:54AM -0400, Ric Wheeler wrote:
>
>> I think that Alan is probably right - the magic number for modern drives
>> is probably closer to 256K. Having it be a /sys tunable (with a larger
>> default) would be a nice way to verify this.
>>
>
> I've played with this a bit, and with the "git status" workload,
> increasing the magic number beyond 16 (64k) doesn't actually help,
> because the number of inodes we need to touch wasn't big enough.
>
> So I switched to a different workload, which ran "find /path -size 0
> -print" with a much larger directory hierarchy. With that workload I
> got the following results:
>
> ra_bits ra_blocks ra_kb seconds % improvement
> 0 1 4 53.3 -
> 1 2 8 47.3 11.3%
> 2 4 16 41.7 21.8%
> 3 8 32 37.5 29.6%
> 4 16 64 34.4 35.5%
> 5 32 128 32 40.0%
> 6 64 256 30.7 42.4%
> 7 128 512 28.8 46.0%
> 8 256 1024 28.3 46.9%
> 9 512 2048 27.5 48.4%
>
> Given these numbers, I'm using a default of inode_readahead_bits of 5
> (i.3., 32 blocks, or 128k for 4k blocksize filesystems). For a
> workload that is 100% stat-based, without any I/O, it is possible to
> get better results by using a higher number, yes, but I'm concerned
> that a larger readahead may end up interfering with other reads. We
> need to run some other workloads to be sure a larger number won't
> cause problems before we go more aggressive on this parameter.
>
> I'll send the revised patch in another message.
>
> - Ted
>

That sounds about right for modern S-ATA/SAS drives. I would expect that
having this be a tunable knob might help for some types of storage (SSD
might not care, but should be faster in any case?).

ric

2008-09-24 14:21:28

by Chris Mason

[permalink] [raw]

Subject: Re: [PATCH, RFC] ext4: Use preallocation when reading from the inode table

On Wed, 2008-09-24 at 09:23 -0400, Ric Wheeler wrote:
> Theodore Tso wrote:
> > On Tue, Sep 23, 2008 at 08:18:54AM -0400, Ric Wheeler wrote:

[ numbers ]

> > Given these numbers, I'm using a default of inode_readahead_bits of 5
> > (i.3., 32 blocks, or 128k for 4k blocksize filesystems). For a
> > workload that is 100% stat-based, without any I/O, it is possible to
> > get better results by using a higher number, yes, but I'm concerned
> > that a larger readahead may end up interfering with other reads. We
> > need to run some other workloads to be sure a larger number won't
> > cause problems before we go more aggressive on this parameter.
>
> That sounds about right for modern S-ATA/SAS drives. I would expect that
> having this be a tunable knob might help for some types of storage (SSD
> might not care, but should be faster in any case?).

For the test runs being done here, there's a pretty high chance that all
of the inodes you read ahead will get used before the pages are dropped,
so we want to find a balance between those and the worst case workloads
where inode reads are basically random. One good data point is the
completion time for IOs of different sizes.

I used fio to measure the latencies on O_DIRECT randomreads of given
sizes on a fast 500GB sata drive. Here is the output for a 4k run (I
used elevator=noop, but cfq was about the same):

f4k: (groupid=6, jobs=1): err= 0: pid=22877
read : io=15816KiB, bw=539KiB/s, iops=131, runt= 30004msec
clat (usec): min=555, max=20909, avg=7581.38, stdev=2475.88
issued r/w: total=3954/0, short=0/0
lat (usec): 750=0.03%
lat (msec): 2=0.03%, 4=7.08%, 10=71.60%, 20=21.24%, 50=0.03%

clat is completion latency, but note fio switches between usec and msec
just to keep us on our toes. Other important numbers are iop/s and
total issued ios. The test limits the run on each IO size to 30
seconds.

The 4k run gets 131 iop/s, so my sata drive can read 131 inodes/second
in a worst case random workload. iop rates for the others:

4k 131
8k 130
16k 128
32k 126
64k 121
128k 113
256k 100

A slightly trimmed job output is below, and the fio job file I used is
attached if anyone wants to try this on their own machines. I'd stick
with either 32k or 64k as the sweet spots, but a tunable is definitely a
good idea.

-chris

f256k: (groupid=0, jobs=1): err= 0: pid=22871
read : io=770816KiB, bw=26309KiB/s, iops=100, runt= 30001msec
clat (msec): min=1, max=45, avg= 9.96, stdev= 2.63
issued r/w: total=3011/0, short=0/0
lat (msec): 2=0.03%, 10=50.35%, 20=49.58%, 50=0.03%

f128k: (groupid=1, jobs=1): err= 0: pid=22872
read : io=434560KiB, bw=14830KiB/s, iops=113, runt= 30005msec
clat (msec): min=1, max=72, avg= 8.83, stdev= 2.82
issued r/w: total=3395/0, short=0/0
lat (msec): 2=0.06%, 4=0.62%, 10=63.62%, 20=35.64%, 100=0.06%

f64k: (groupid=2, jobs=1): err= 0: pid=22873
read : io=233280KiB, bw=7961KiB/s, iops=121, runt= 30006msec
clat (usec): min=815, max=14931, avg=8225.21, stdev=2471.22
issued r/w: total=3645/0, short=0/0
lat (usec): 1000=0.05%
lat (msec): 4=2.50%, 10=69.11%, 20=28.34%

f32k: (groupid=3, jobs=1): err= 0: pid=22874
read : io=121472KiB, bw=4144KiB/s, iops=126, runt= 30010msec
clat (usec): min=715, max=53124, avg=7898.75, stdev=2613.35
issued r/w: total=3796/0, short=0/0
lat (usec): 750=0.03%
lat (msec): 4=4.77%, 10=70.10%, 20=25.08%, 100=0.03%

f16k: (groupid=4, jobs=1): err= 0: pid=22875
read : io=61584KiB, bw=2101KiB/s, iops=128, runt= 30001msec
clat (msec): min=1, max=16, avg= 7.79, stdev= 2.46
issued r/w: total=3849/0, short=0/0

Attachments:

read-lat (319.00 B)

2008-09-24 20:36:24

by Theodore Ts'o

[permalink] [raw]

Subject: Re: [PATCH, RFC] ext4: Use preallocation when reading from the inode table

On Wed, Sep 24, 2008 at 09:23:34AM -0400, Ric Wheeler wrote:
>
> That sounds about right for modern S-ATA/SAS drives. I would expect that
> having this be a tunable knob might help for some types of storage (SSD
> might not care, but should be faster in any case?).
>

Well, for SSD's, wouldn't seek time not matter, but then the limiting
factor should be the overhead of the read transaction, and the I/O
bandwidth of the SSD. So prefetching too much might hurt even more
for SSD's, although in comparison with spinning rust platters, it
would probably still be faster. :-)

So I'm pretty sure that with an SSD we'll want to turn the tunable
down, not up.

On Wed, Sep 24, 2008 at 10:20:34AM -0400, Chris Mason wrote:
>For the test runs being done here, there's a pretty high chance that all
>of the inodes you read ahead will get used before the pages are dropped,
>so we want to find a balance between those and the worst case workloads
>where inode reads are basically random.

Yep, agreed.

On the other hand, if we take your iop/s and translate them to
milliseconds so we can measure the latency in the case where the
workload is essentialy doing random reads, and then cross correlated
it with my measurements, we get this table:

i/o size iops/s ms latency % degredation % improvement
of random inodes of related inodes I/O
4k 131 7.634
8k 130 7.692 0.77% 11.3%
16k 128 7.813 2.34% 21.8%
32k 126 7.937 3.97% 29.6%
64k 121 8.264 8.26% 35.5%
128k 113 8.850 15.93% 40.0%
256k 100 10.000 31.00% 42.4%

Depending on whether you believe that workloads involving random inode
reads are more common compared to related inodes I/O, the sweet spot
is probably somewhere between 32k and 128k. I'm open to opinions
(preferably backed up with more benchmarks of likely workloads) of
whether we should use a default value of inode_readahead_bits of 4 or
5 (i.e., 64k, my original guess, or 128k, in v2 of the patch). But
yes, making it tunable is definitely going to be necessary, since for
different workloads (i.e squid vs. git repositories) will have very
different requirements.

The other thought are the one which comes to mind is whether we should
use a similar large readahead if all we are doing writes vs. reads.
For example, if we are just updating a single inode, and we are
reading a block to do a read/modify write cycle, maybe we shouldn't be
doing as much readahead.

- Ted

P.S. One caveat is that my numbers were taken from a Laptop SATA, and
if Chris's were taken from a Desktop/Sever SATA drive the numbers
might not be properly comparable.

2008-09-25 23:40:26

by Andreas Dilger

[permalink] [raw]

Subject: Re: [PATCH, RFC] ext4: Use preallocation when reading from the inode table

On Sep 24, 2008 16:35 -0400, Theodore Ts'o wrote:
> On the other hand, if we take your iop/s and translate them to
> milliseconds so we can measure the latency in the case where the
> workload is essentialy doing random reads, and then cross correlated
> it with my measurements, we get this table:

Comparing the incremental benefit of each step:

> i/o size iops/s ms latency % degredation % improvement
> of random inodes of related inodes I/O
> 4k 131 7.634
> 8k 130 7.692 0.77% 11.3%
1.57% 10.5%
> 16k 128 7.813 2.34% 21.8%
1.63% 7.8%
> 32k 126 7.937 3.97% 29.6%
4.29% 5.9%
> 64k 121 8.264 8.26% 35.5%
7.67% 4.5%
> 128k 113 8.850 15.93% 40.0%
16.07% 2.4%
> 256k 100 10.000 31.00% 42.4%
>
> Depending on whether you believe that workloads involving random inode
> reads are more common compared to related inodes I/O, the sweet spot
> is probably somewhere between 32k and 128k. I'm open to opinions
> (preferably backed up with more benchmarks of likely workloads) of
> whether we should use a default value of inode_readahead_bits of 4 or
> 5 (i.e., 64k, my original guess, or 128k, in v2 of the patch). But
> yes, making it tunable is definitely going to be necessary, since for
> different workloads (i.e squid vs. git repositories) will have very
> different requirements.

It looks like moving from 64kB to 128kB readahead might be a loss for
"unknown" workloads, since that increases latency by 7.67% for the random
inode case, but we only get 4.5% improvement in the sequential inode case.
Also recall that at large scale "htree" breaks down to random inode
lookup so that isn't exactly a fringe case (though readahead may still
help if the cache is large enough).

Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.