LinuxLists.cc - spinlock usage - ext2_get_block, lru_list

2001-03-21 07:08:38

Subject: spinlock usage - ext2_get_block, lru_list_lock

Hi,

I ported lockmeter to PPC and ran a few dbench runs on a quad CPU F50 here.
These runs were made to never hit the disk. The full results can be found
here:

http://samba.org/~anton/ppc/lockmeter/2.4.3-pre3_hacked/

It was not surprising the BKL was one of the main offenders. Looking at the
stats ext2_get_block was the bad guy (UTIL is % of time lock was busy for,
WAIT is time spent waiting for lock):

SPINLOCKS HOLD WAIT
UTIL CON MEAN( MAX ) MEAN( MAX )( %CPU) TOTAL NAME
38.8% 41.0% 7.6us( 31ms) 15us( 18ms)( 7.7%) 1683368 kernel_flag
0.87% 9.1% 13ms( 31ms) 129us( 231us)(0.00%) 22 do_exit+0x120
2.6% 21.6% 45us(2103us) 79us( 18ms)(0.25%) 19240 ext2_delete_inode+0x34
0.32% 24.8% 1.2us( 46us) 14us( 992us)(0.25%) 92415 ext2_discard_prealloc+0x34

29.2% 50.9% 10us( 400us) 15us( 892us)( 5.4%) 957740 ext2_get_block+0x64

0.40% 32.8% 18us( 208us) 31us( 11ms)(0.06%) 7435 lookup_hash+0xb0
0.09% 17.3% 11us( 139us) 17us( 237us)(0.01%) 2560 notify_change+0x8c
0.01% 17.3% 34us( 138us) 912us( 11ms)(0.01%) 81 real_lookup+0x94
0.02% 39.5% 34us( 344us) 47us( 331us)(0.00%) 172 schedule+0x4fc
0.00% 15.4% 11us( 37us) 14us( 22us)(0.00%) 26 sys_ioctl+0x50
1.1% 28.7% 0.7us( 131us) 12us( 910us)( 1.5%) 559700 sys_lseek+0x90
0.56% 25.8% 48us( 245us) 12us( 162us)(0.01%) 3900 sys_rename+0x1fc
0.03% 25.0% 24us( 43us) 64us(1004us)(0.00%) 400 tty_read+0xd4
0.07% 24.1% 31us( 64us) 17us( 293us)(0.00%) 776 tty_write+0x234
2.0% 32.5% 35us( 267us) 13us( 504us)(0.06%) 19220 vfs_create+0xd0
0.29% 76.5% 437us( 533us) 25us( 456us)(0.00%) 221 vfs_mkdir+0xd0
0.05% 19.2% 65us( 285us) 460us(9017us)(0.02%) 240 vfs_rmdir+0x208
1.1% 23.2% 19us( 236us) 17us( 819us)(0.06%) 19220 vfs_unlink+0x188

It can be also seen that do_exit grabbed the BKL for way too long. Another
large waster of cpu time was the lru_list_lock:

SPINLOCKS HOLD WAIT
UTIL CON MEAN( MAX ) MEAN( MAX )( %CPU) TOTAL NAME
25.8% 27.0% 1.6us( 169us) 8.9us( 446us)( 9.5%) 5281025 lru_list_lock
0.07% 33.0% 2.9us( 34us) 11us( 293us)(0.02%) 8051 __bforget+0x20
1.7% 14.6% 0.3us( 44us) 5.2us( 265us)( 1.1%) 1870792 buffer_insert_inode_queue+0x24
7.3% 13.6% 1.9us( 169us) 5.5us( 278us)(0.70%) 1239163 getblk+0x28
0.00% 58.8% 1.0us( 4.5us) 13us( 142us)(0.00%) 221 invalidate_inode_buffers+0x20
10.0% 45.5% 1.7us( 134us) 10us( 446us)( 6.6%) 1920438 refile_buffer+0x20
6.7% 45.2% 9.2us( 149us) 14us( 330us)( 1.2%) 242360 try_to_free_buffers+0x44

I began smashing up lru_list_lock but found a few problems. With a name
like lru_list_lock, you would expect it to only synchronise operations to
lru_list[]. However I find things like:

int inode_has_buffers(struct inode *inode)
{
int ret;

spin_lock(&lru_list_lock);
ret = !list_empty(&inode->i_dirty_buffers);
spin_unlock(&lru_list_lock);

return ret;
}

It also looks to be protecting some of the items in the buffer_head struct.
Is the lru_list_lock spinlock usage documented anywhere?

Cheers,
Anton

2001-03-21 08:47:36

by Andreas Dilger

[permalink] [raw]

Subject: Re: spinlock usage - ext2_get_block, lru_list_lock

Anton Blanchard writes:
> I ported lockmeter to PPC and ran a few dbench runs on a quad CPU F50 here.
> These runs were made to never hit the disk. The full results can be found
> here:
>
> http://samba.org/~anton/ppc/lockmeter/2.4.3-pre3_hacked/
>
> It was not surprising the BKL was one of the main offenders. Looking at the
> stats ext2_get_block was the bad guy (UTIL is % of time lock was busy for,
> WAIT is time spent waiting for lock):
>
> SPINLOCKS HOLD WAIT
> UTIL CON MEAN( MAX ) MEAN( MAX )( %CPU) TOTAL NAME
> 38.8% 41.0% 7.6us( 31ms) 15us( 18ms)( 7.7%) 1683368 kernel_flag
> 29.2% 50.9% 10us( 400us) 15us( 892us)( 5.4%) 957740 ext2_get_block+0x64

It has previously been discussed to remove the BLK for much (all?) of the
VFS and push the locking into the filesystems. This would require that
lock_super() actually be an SMP safe locking mechanism. At this point it
would also be possible to have separate locks for each ext2 group, which
may greatly reduce locking contention as we searched for free blocks/inodes
(really depends on allocation patterns).

With per-group (or maybe per-bitmap) locking, files could be created in
parallel with only a small amount of global locking if they are in different
groups. The CPU-intensive work (bitmap searching) could be done with
only a bitmap lock. It may even be desirable to do the bitmap searching
without any locks. We would depend on the atomic test_and_set() to tell
us if our newfound "free" block was allocated from underneath us, and use
find_next_free_bit() to continue our search in the bitmap.

The group lock would be needed for updating the group descriptor counts,
and a superblock lock for doing the superblock counts. It may also be
possible to have lazy updating of the superblock counts, and depend on
e2fsck to update the superblock counts on a crash. I'm thinking something
like updating a "delta count" for each ext2 group (blocks, inodes, dirs)
while we hold the group lock, and only moving the deltas from the groups
to the superblock on sync or similar. This would reduce lock contention
on the superblock lock a great deal.

Cheers, Andreas
--
Andreas Dilger \ "If a man ate a pound of pasta and a pound of antipasto,
\ would they cancel out, leaving him still hungry?"
http://www-mddsp.enel.ucalgary.ca/People/adilger/ -- Dogbert

2001-03-21 09:07:23

by Mitchell Blank Jr

[permalink] [raw]

Subject: Re: spinlock usage - ext2_get_block, lru_list_lock

Andreas Dilger wrote:
> With per-group (or maybe per-bitmap) locking, files could be created in
> parallel with only a small amount of global locking if they are in different
> groups.

...and then we can let the disc go nuts seeking to actually commit all
these new blocks. I suspect that this change would only be a win for
memory-based discs where seek time is zero.

I think that before anyone starts modifying the kernel for this they
should benchmark how often the free block checker blocks on lock
contention _AND_ how often the thread its contending with is looking
for a free block in a _different_ allocation group. I bet it's not
often at all.

> It may also be
> possible to have lazy updating of the superblock counts, and depend on
> e2fsck to update the superblock counts on a crash.

That sounds more promising.

> , and only moving the deltas from the groups
> to the superblock on sync or similar.

If we're going to assume that e2fsck will correct the numbers anyway then
there's really no reason to update them any time except when marking
the filesystem clean (umount, remount-ro) As a bonus, we have to update
the superblock then anyway.

-Mitch

2001-03-21 09:43:43

by Andreas Dilger

[permalink] [raw]

Subject: Re: spinlock usage - ext2_get_block, lru_list_lock

Mitchell Blank, Jr. writes:
> Andreas Dilger wrote:
> > With per-group (or maybe per-bitmap) locking, files could be created in
> > parallel with only a small amount of global locking if they are in different
> > groups.
>
> ...and then we can let the disc go nuts seeking to actually commit all
> these new blocks. I suspect that this change would only be a win for
> memory-based discs where seek time is zero.

No, because the bitmaps are not updated synchronously. It would be exactly
the same as today, but without the lock contention. Also, for very large
filesystems, you get into the case where it is spread across multiple
physical disks, so they _can_ be doing I/O in parallel.

> I think that before anyone starts modifying the kernel for this they
> should benchmark how often the free block checker blocks on lock
> contention _AND_ how often the thread its contending with is looking
> for a free block in a _different_ allocation group.

It really depends on what you are using for a benchmark. Ext2 likes to
spread directories evenly across the groups, so if you are creating files
in separate directories, then you only have 1/number_of_groups chance
that they would be contending for the same group lock. If the files are
being created in the same directory, then you have other locking issues
like updating the directory file itself.

> > It may also be possible to have lazy updating of the superblock counts,
> > and depend on e2fsck to update the superblock counts on a crash.
>
> That sounds more promising.
>
> > , and only moving the deltas from the groups
> > to the superblock on sync or similar.
>
> If we're going to assume that e2fsck will correct the numbers anyway then
> there's really no reason to update them any time except when marking
> the filesystem clean (umount, remount-ro) As a bonus, we have to update
> the superblock then anyway.

Well, the numbers in the superblock are what's used for statfs, and will
also to determine if the fs is full. It would be safe enough to have an
ext2 function call which "gathers" all of the lazy updates into the SB
for use by statfs and such. For the case of a full filesystem, the block
allocation routines would eventually find this out anyways, once they
search all of the groups and don't find a free block, so no harm done. I
believe that sync_supers() is called by kupdated every 5 seconds, so
this would be a good time to collect the deltas to the superblock.

Cheers, Andreas
--
Andreas Dilger \ "If a man ate a pound of pasta and a pound of antipasto,
\ would they cancel out, leaving him still hungry?"
http://www-mddsp.enel.ucalgary.ca/People/adilger/ -- Dogbert

2001-03-21 12:13:45

by Ingo Molnar

[permalink] [raw]

Subject: [patch] pagecache SMP-scalability patch [was: spinlock usage]

Anton,

if you are doing SMP-intensive dbench runs, then check out the SMP
pagecache-scalability patch (against 2.4.2-ac20):

http://people.redhat.com/~mingo/smp-pagecache-patches/pagecache-2.4.2-H1

this patch splits up the main scalability offender in non-RAM-limited
dbench runs, which is pagecache_lock. The patch was designed and written
by David Miller, and is being forward ported / maintained by me. (The new
pagecache lock design is similar to TCP's hashed spinlocks, which proved
to scale excellently.)

(about lstat(): IMO lstat() should not call into the lowlevel FS code.)

Ingo

2001-03-21 12:32:19

by Anton Blanchard

[permalink] [raw]

Subject: Re: [patch] pagecache SMP-scalability patch [was: spinlock usage]

Hi,

> http://people.redhat.com/~mingo/smp-pagecache-patches/pagecache-2.4.2-H1
>
> this patch splits up the main scalability offender in non-RAM-limited
> dbench runs, which is pagecache_lock. The patch was designed and written
> by David Miller, and is being forward ported / maintained by me. (The new
> pagecache lock design is similar to TCP's hashed spinlocks, which proved
> to scale excellently.)

Thanks Ingo! Davem told me about this a while ago but I had forgotten
about it. I'll do some runs tomorrow including ones which dont fit in
RAM.

> (about lstat(): IMO lstat() should not call into the lowlevel FS code.)

Ooops, sorry I meant stats as in statistics :)

Anton

2001-03-21 13:01:44

by Ingo Molnar

[permalink] [raw]

Subject: Re: [patch] pagecache SMP-scalability patch [was: spinlock usage]

On Wed, 21 Mar 2001, Alexander Viro wrote:

> > (about lstat(): IMO lstat() should not call into the lowlevel FS code.)
>
> a) revalidation on network filesystems
> b) just about anything layered would win from ability to replace the
> normal stat() behaviour (think of UI/GID replacement, etc.)

sorry, i meant ext2fs - but this was fixed by your VFS scalability changes
already :-)

Ingo

2001-03-21 12:55:14

by Alexander Viro

[permalink] [raw]

Subject: Re: [patch] pagecache SMP-scalability patch [was: spinlock usage]

On Wed, 21 Mar 2001, Ingo Molnar wrote:

>
> Anton,
>
> if you are doing SMP-intensive dbench runs, then check out the SMP
> pagecache-scalability patch (against 2.4.2-ac20):
>
> http://people.redhat.com/~mingo/smp-pagecache-patches/pagecache-2.4.2-H1
>
> this patch splits up the main scalability offender in non-RAM-limited
> dbench runs, which is pagecache_lock. The patch was designed and written
> by David Miller, and is being forward ported / maintained by me. (The new
> pagecache lock design is similar to TCP's hashed spinlocks, which proved
> to scale excellently.)
>
> (about lstat(): IMO lstat() should not call into the lowlevel FS code.)

a) revalidation on network filesystems
b) just about anything layered would win from ability to replace the
normal stat() behaviour (think of UI/GID replacement, etc.)
Cheers,
Al

2001-03-21 13:57:31

by Alexander Viro

[permalink] [raw]

Subject: Re: [patch] pagecache SMP-scalability patch [was: spinlock usage]

On Wed, 21 Mar 2001, Ingo Molnar wrote:

>
> On Wed, 21 Mar 2001, Alexander Viro wrote:
>
> > > (about lstat(): IMO lstat() should not call into the lowlevel FS code.)
> >
> > a) revalidation on network filesystems
> > b) just about anything layered would win from ability to replace the
> > normal stat() behaviour (think of UI/GID replacement, etc.)
>
> sorry, i meant ext2fs - but this was fixed by your VFS scalability changes
> already :-)

???
<checking 2.2.0>
Nope, no calls into ext2/*. do_revalidate() seeing NULL ->i_revalidate
and doing nothing, lnamei() doing usual lookup, cp_old_stat() not touching
fs code at all...

The only change was s/lnamei()/user_path_walk_link()/ and they are
equivalent, modulo different API (filling nameidata vs. returning
dentry).
Cheers,
Al

2001-03-21 14:04:01

by Ingo Molnar

[permalink] [raw]

Subject: Re: [patch] pagecache SMP-scalability patch [was: spinlock usage]

On Wed, 21 Mar 2001, Alexander Viro wrote:

> ???
> <checking 2.2.0>
> Nope, no calls into ext2/*. do_revalidate() seeing NULL ->i_revalidate
> and doing nothing, lnamei() doing usual lookup, cp_old_stat() not touching
> fs code at all...

the problem was that it was calling lock_kernel(), not the lowlevel fs i
believe - this caused contention.

Ingo

2001-03-21 16:54:15

by Linus Torvalds

[permalink] [raw]

Subject: Re: spinlock usage - ext2_get_block, lru_list_lock

In article <[email protected]>,
Anton Blanchard <[email protected]> wrote:
>
>It was not surprising the BKL was one of the main offenders. Looking at the
>stats ext2_get_block was the bad guy (UTIL is % of time lock was busy for,
>WAIT is time spent waiting for lock):

Actually, I find the BKL fairly surprising - we've whittled down all the
major non-lowlevel-FS offenders, and I didn't realize that it's still
there in do_exit().

And the do_exit() case should be _trivial_ to fix: almost none of the
code protected by the kernel lock in the exit path actually needs the
lock. I suspect you could cut down the kernel lock there to much
smaller.

The big case seems to be ext2_get_block(), we'll fix that early in
2.5.x. I think Al already has patches for it.

As to lseek, that one should probably get the inode semaphore, not the
kernel lock.

Linus

2001-03-21 17:17:55

by Alexander Viro

[permalink] [raw]

Subject: Re: spinlock usage - ext2_get_block, lru_list_lock

On 21 Mar 2001, Linus Torvalds wrote:

> The big case seems to be ext2_get_block(), we'll fix that early in
> 2.5.x. I think Al already has patches for it.

Since the last August ;-) Bitmaps handling is a separate story (it got
less testing) but with the ->gfp_mask in tree it will be much simpler.
I'll port these patches to current tree if anyone is interested - all
infrastructure is already there, so it's only code in fs/ext2/* is touched.

Obext2: <plug>
Guys, help with testing directories-in-pagecache patch. It works fine
here and I would really like it to get serious beating.
Patch is on ftp.math.psu.edu/pub/viro/ext2-dir-patch-b-S2.gz (against
2.4.2, but applies to 2.4.3-pre* too).
</plug>

(Linus, if you want me to mail it to you - just tell; comments on the
style would be _very_ welcome)

> As to lseek, that one should probably get the inode semaphore, not the
> kernel lock.

lseek() is per-file, so ->i_sem seems to be strange... I really don't see
why do we want a blocking semaphore there - spinlock (heck, even global
spinlock) should be OK, AFAICS. All we really want is atomic assignment
for 64bit and atomic += for the same beast.
Cheers,
Al

2001-03-21 18:15:11

by James Lewis Nance

[permalink] [raw]

Subject: Re: spinlock usage - ext2_get_block, lru_list_lock

On Wed, Mar 21, 2001 at 12:16:47PM -0500, Alexander Viro wrote:

> Obext2: <plug>
> Guys, help with testing directories-in-pagecache patch. It works fine
> here and I would really like it to get serious beating.
> Patch is on ftp.math.psu.edu/pub/viro/ext2-dir-patch-b-S2.gz (against
> 2.4.2, but applies to 2.4.3-pre* too).
> </plug>

I would love to test this patch, but I really dont want it touching my other
ext2 file systems (like /). I assume it would be possible to copy the ext2
code over to something like linux/fs/extnew, patch that, and then mount my
scratch partitions as extnew. I can try an cook something like this up, but
I thought you might already have it, so I am posting here to see.

Thanks,

Jim

2001-03-21 21:02:55

by Alexander Viro

[permalink] [raw]

Subject: Re: spinlock usage - ext2_get_block, lru_list_lock

On Wed, 21 Mar 2001, James Lewis Nance wrote:

> On Wed, Mar 21, 2001 at 12:16:47PM -0500, Alexander Viro wrote:
>
> > Obext2: <plug>
> > Guys, help with testing directories-in-pagecache patch. It works fine
> > here and I would really like it to get serious beating.
> > Patch is on ftp.math.psu.edu/pub/viro/ext2-dir-patch-b-S2.gz (against
> > 2.4.2, but applies to 2.4.3-pre* too).
> > </plug>
>
> I would love to test this patch, but I really dont want it touching my other
> ext2 file systems (like /). I assume it would be possible to copy the ext2
> code over to something like linux/fs/extnew, patch that, and then mount my
> scratch partitions as extnew. I can try an cook something like this up, but
> I thought you might already have it, so I am posting here to see.

cp -a fs/ext{2,69}
cp -a include/linux/ext{2,69}_fs.h
cp -a include/linux/ext{2,69}_fs_i.h
cp -a include/linux/ext{2,69}_fs_sb.h
for i in fs/ext69/* include/linux/ext69*; do
vi '-cse ext|%s/(ext|EXT)2/\169/g|x' $i;
done
vi '-c/EXT/|y|pu|s/2/69/|s/Second/FUBAR/|x' fs/Config.in
vi '-c/ext2/|y|pu|s/ext2/ext69/g|//|y|pu|&g|//|y|pu|&g|//|y|pu|&g|x' include/linux/fs.h

had done the trick last time I needed something like that, but that was long
time ago...
Cheers,
Al