2004-09-07 13:02:33

by Stephen C. Tweedie

[permalink] [raw]
Subject: [Patch 0/6]: Cleanup and rbtree for ext3 reservations in 2.6.9-rc1-mm4

The patches in the following set contain several cleanups for ext3
reservations, fix a reproducable SMP race, and turn the per-superblock
linear list of reservations into an rbtree for better scaling.

Apart from performance and the SMP race, there should be no
behavioural change in this set of patches, except for one item: the
EXT3_IOC_GETRSVSZ and EXT3_IOC_SETRSVSZ ioctl numbers are currently
using a non-standard character in their names, 'r' instead of 'f' (the
latter is used for all existing ext2/3 ioctls), so I thought it would
be best to rename that sooner rather than later if we're going to do
it at all.

These changes have been in rawhide for a couple of weeks, and have
been undergoing testing both within Red Hat and at IBM.


2004-09-11 00:35:34

by Mingming Cao

[permalink] [raw]
Subject: Re: [Patch 0/6]: Cleanup and rbtree for ext3 reservations in 2.6.9-rc1-mm4

On Tue, 2004-09-07 at 06:02, Stephen Tweedie wrote:
> The patches in the following set contain several cleanups for ext3
> reservations, fix a reproducable SMP race, and turn the per-superblock
> linear list of reservations into an rbtree for better scaling.

> These changes have been in rawhide for a couple of weeks, and have
> been undergoing testing both within Red Hat and at IBM.
>

We have run several tests on this set of the reservation changes. We
compared the results w/o reservation, rbtree based reservation vs link
list based reservation. Here is the tiobench sequential test
results.Note that 2.6.8.1-mm4 kernel include the double link based
per-fs reservation tree.

tiobench sequential write throughputs
============================================================
Threads no reservation 2.6.8.1-mm4 2.6.8.1-mm4+rbtree patch
1 29 29 29
4 3 29 29
8 4 28 28
16 3 27 27
32 4 27 27
64 3 27 27
128 2 20 25
256 1 20 24

We did see the rbtree changes scales better on more than 128 threads. We
also run tio random tests, did not see throughput regression there.

We also re-run the dbench, test results showing that these two
reservations(rbtree based vs link based) performs almost equally well.

dbench average throughputs on 4 runs
==================================================
Threads no reservation 2.6.8.1-mm4 2.6.8.1-mm4_rbtree
1 97 93 96
4 234 250 213
8 201 213 213
16 156 168 169
32 73 106 105
64 38 65 67

We had some concerns about the cpu cost for seeky random write workload
with all the reservation changes before. We are doing some tests in that
area too.

Mingming

2004-09-12 08:01:47

by Ram Pai

[permalink] [raw]
Subject: Re: [Patch 0/6]: Cleanup and rbtree for ext3 reservations in 2.6.9-rc1-mm4

On 10 Sep 2004, Mingming Cao wrote:

> On Tue, 2004-09-07 at 06:02, Stephen Tweedie wrote:
> > The patches in the following set contain several cleanups for ext3
> > reservations, fix a reproducable SMP race, and turn the per-superblock
> > linear list of reservations into an rbtree for better scaling.
>
> > These changes have been in rawhide for a couple of weeks, and have
> > been undergoing testing both within Red Hat and at IBM.
> >
>
> We have run several tests on this set of the reservation changes. We
> compared the results w/o reservation, rbtree based reservation vs link
> list based reservation.

I have some more results at
http://eaglet.rain.com/ram/seekwrite/seekwrite.html

In summary reservation code performs better than non-reservation.
There is no clear winner while comparing link-based reservation v/s rbtree
reservation.

RP

2004-09-22 01:03:52

by Mingming Cao

[permalink] [raw]
Subject: [Patch] ext3_new_inode() failure case fix

While I am studying ext3_new_inode() failure code paths, I found the
inode->i_nlink is not cleared to 0 in the first failure path. But it is
cleared to 0 in the second failure path(fail to allocate disk quota). I
think it should be cleared in both cases, because later when
generic_drop_inode() is called by iput(), i_nlink is checked to decide
whether to call generic_delete_inode(). Currently it is calling
generic_forget_inode().

Also the reference to the inode bitmap buffer head should be dropped on
the failure path too.

Mingming

---

linux-2.6.9-rc1-mm5-ming/fs/ext3/ialloc.c | 2 ++
1 files changed, 2 insertions(+)

diff -puN fs/ext3/ialloc.c~ext3_new_inode_failure_case_fix fs/ext3/ialloc.c
--- linux-2.6.9-rc1-mm5/fs/ext3/ialloc.c~ext3_new_inode_failure_case_fix 2004-09-22 00:18:18.196012520 -0700
+++ linux-2.6.9-rc1-mm5-ming/fs/ext3/ialloc.c 2004-09-22 00:19:20.063607216 -0700
@@ -622,7 +622,9 @@ got:
fail:
ext3_std_error(sb, err);
out:
+ inode->i_nlink = 0;
iput(inode);
+ brelse(bitmap_bh);
ret = ERR_PTR(err);
really_out:
brelse(bitmap_bh);

_

2004-09-22 06:03:32

by Andrew Morton

[permalink] [raw]
Subject: Re: [Patch] ext3_new_inode() failure case fix

Mingming Cao <[email protected]> wrote:
>
> While I am studying ext3_new_inode() failure code paths, I found the
> inode->i_nlink is not cleared to 0 in the first failure path. But it is
> cleared to 0 in the second failure path(fail to allocate disk quota). I
> think it should be cleared in both cases, because later when
> generic_drop_inode() is called by iput(), i_nlink is checked to decide
> whether to call generic_delete_inode(). Currently it is calling
> generic_forget_inode().
>
> Also the reference to the inode bitmap buffer head should be dropped on
> the failure path too.

That reference already is dropped:

> --- linux-2.6.9-rc1-mm5/fs/ext3/ialloc.c~ext3_new_inode_failure_case_fix 2004-09-22 00:18:18.196012520 -0700
> +++ linux-2.6.9-rc1-mm5-ming/fs/ext3/ialloc.c 2004-09-22 00:19:20.063607216 -0700
> @@ -622,7 +622,9 @@ got:
> fail:
> ext3_std_error(sb, err);
> out:
> + inode->i_nlink = 0;
> iput(inode);
> + brelse(bitmap_bh);
> ret = ERR_PTR(err);
> really_out:
> brelse(bitmap_bh);

^ here

So I'll drop that part of the patch.

2004-09-22 07:29:35

by Al Viro

[permalink] [raw]
Subject: Re: [Patch] ext3_new_inode() failure case fix

On Tue, Sep 21, 2004 at 11:01:24PM -0700, Andrew Morton wrote:
> Mingming Cao <[email protected]> wrote:
> >
> > While I am studying ext3_new_inode() failure code paths, I found the
> > inode->i_nlink is not cleared to 0 in the first failure path. But it is
> > cleared to 0 in the second failure path(fail to allocate disk quota). I
> > think it should be cleared in both cases, because later when
> > generic_drop_inode() is called by iput(), i_nlink is checked to decide
> > whether to call generic_delete_inode(). Currently it is calling
> > generic_forget_inode().
> >
> > Also the reference to the inode bitmap buffer head should be dropped on
> > the failure path too.
>
> That reference already is dropped:
>
> > --- linux-2.6.9-rc1-mm5/fs/ext3/ialloc.c~ext3_new_inode_failure_case_fix 2004-09-22 00:18:18.196012520 -0700
> > +++ linux-2.6.9-rc1-mm5-ming/fs/ext3/ialloc.c 2004-09-22 00:19:20.063607216 -0700
> > @@ -622,7 +622,9 @@ got:
> > fail:
> > ext3_std_error(sb, err);
> > out:
> > + inode->i_nlink = 0;
> > iput(inode);
> > + brelse(bitmap_bh);
> > ret = ERR_PTR(err);
> > really_out:
> > brelse(bitmap_bh);
>
> ^ here
>
> So I'll drop that part of the patch.

Drop it completely - we do *NOT* want ->delete_inode() to be called until
we had done allocation (and set sane ->i_ino, while we are at it). And
that's exactly the difference between places that bail to fail2 and places
that bail to out.