2002-02-27 18:21:05

by Hanna Linder

[permalink] [raw]
Subject: lockmeter results comparing 2.4.17, 2.5.3, and 2.5.5


Congratulations to everyone working to reduce contention of the
Big Kernel Lock. According to this benchmark your work has paid off!
John Hawkes of SGI asked me to Beta test his 2.5.5 version of lockmeter
last night (that's what I get for asking when it would be out). The results
were interesting enough to post.
All three runs were on an 8-way SMP system running dbench with
up to 16 clients 10 times each. The results are at http://lse.sf.net/locking .
Throughput numbers are not included yet, I need to rerun dbench without
lockmeter to get accurate throughput results.

(Read down the Con(tention) column)
TOTAL is for the whole system
kernel_flag is for every function holding BKL combined.


SPINLOCKS HOLD WAIT
UTIL CON MEAN( MAX ) MEAN( MAX )(% CPU) TOTAL NOWAIT SPIN RJECT NAME

2.4.17:

13.7% 2.2us( 43ms) 31us( 43ms)(20.0%) 232292367 86.3% 13.7% 0.00% *TOTAL*

33.9% 40.3% 11us( 43ms) 51us( 43ms)( 8.2%) 19725127 59.7% 40.3% 0% kernel_flag


2.5.3:
11.1% 1.0us( 21ms) 8.2us( 18ms)( 3.8%) 738953957 88.9% 11.1% 0.00% *TOTAL*

10.4% 22.6% 8.3us( 21ms) 23us( 18ms)(0.81%) 27982565 77.4% 22.6% 0% kernel_flag


2.5.5:

8.6% 1.6us( 100ms) 30us( 86ms)( 9.4%) 783373441 91.4% 8.6% 0.00% *TOTAL*

1.2% 0.33% 2.5us( 50ms) 1167us( 43ms)(0.23%) 12793605 99.7% 0.33% 0% kernel_flag



Hanna Linder
IBM Linux Technology Center
[email protected]


2002-02-27 18:35:20

by Martin J. Bligh

[permalink] [raw]
Subject: Re: [Lse-tech] lockmeter results comparing 2.4.17, 2.5.3, and 2.5.5

> 2.5.5:
>
> 8.6% 1.6us( 100ms) 30us( 86ms)( 9.4%) 783373441 91.4% 8.6% 0.00% *TOTAL*

Whilst it's great to see BKL contention going down, this:

0.16% 0.25% 0.7us( 100ms) 252us( 86ms)(0.02%) 6077746 99.8% 0.25% 0% inode_lock
0.03% 0.11% 0.6us( 55us) 2.1us( 9.9us)(0.00%) 1322338 99.9% 0.11% 0% __mark_inode_dirty+0x48
0.00% 0% 0.7us( 5.9us) 0us 391 100% 0% 0% get_new_inode+0x28
0.00% 0.22% 2.5us( 50us) 495us( 28ms)(0.00%) 50397 99.8% 0.22% 0% iget4+0x3c
0.03% 0.28% 0.6us( 26us) 30us( 58ms)(0.00%) 1322080 99.7% 0.28% 0% insert_inode_hash+0x44
0.04% 0.29% 0.5us( 39us) 332us( 86ms)(0.01%) 2059365 99.7% 0.29% 0% iput+0x68
0.03% 0.30% 0.7us( 57us) 422us( 77ms)(0.01%) 1323036 99.7% 0.30% 0% new_inode+0x1c
0.03% 8.3% 63ms( 100ms) 3.8us( 3.8us)(0.00%) 12 91.7% 8.3% 0% prune_icache+0x1c
0.00% 0% 1.0us( 5.2us) 0us 34 100% 0% 0% sync_unlocked_inodes+0x10
0.00% 0% 1.0us( 2.4us) 0us 93 100% 0% 0% sync_unlocked_inodes+0x110

looks a little distressing - the hold times on inode_lock by prune_icache
look bad in terms of latency (contention is still low, but people are still
waiting on it for a very long time). Is this a transient thing, or do people
think this is going to be a problem?

Martin.

2002-02-27 19:28:27

by Linus Torvalds

[permalink] [raw]
Subject: Re: [Lse-tech] lockmeter results comparing 2.4.17, 2.5.3, and 2.5.5

In article <67850000.1014834875@flay>,
Martin J. Bligh <[email protected]> wrote:
>
>Whilst it's great to see BKL contention going down, this:
>
>0.16% 0.25% 0.7us( 100ms) 252us( 86ms)(0.02%) 6077746 99.8% 0.25% 0% inode_lock
> 0.03% 0.11% 0.6us( 55us) 2.1us( 9.9us)(0.00%) 1322338 99.9% 0.11% 0% __mark_inode_dirty+0x48
> 0.00% 0% 0.7us( 5.9us) 0us 391 100% 0% 0% get_new_inode+0x28
> 0.00% 0.22% 2.5us( 50us) 495us( 28ms)(0.00%) 50397 99.8% 0.22% 0% iget4+0x3c
> 0.03% 0.28% 0.6us( 26us) 30us( 58ms)(0.00%) 1322080 99.7% 0.28% 0% insert_inode_hash+0x44
> 0.04% 0.29% 0.5us( 39us) 332us( 86ms)(0.01%) 2059365 99.7% 0.29% 0% iput+0x68
> 0.03% 0.30% 0.7us( 57us) 422us( 77ms)(0.01%) 1323036 99.7% 0.30% 0% new_inode+0x1c
> 0.03% 8.3% 63ms( 100ms) 3.8us( 3.8us)(0.00%) 12 91.7% 8.3% 0% prune_icache+0x1c
> 0.00% 0% 1.0us( 5.2us) 0us 34 100% 0% 0% sync_unlocked_inodes+0x10
> 0.00% 0% 1.0us( 2.4us) 0us 93 100% 0% 0% sync_unlocked_inodes+0x110

Ugh. 100ms max holdtime really looks fairly disgusting.

Since that code only does list manipulation, that implies that the
inode_unused list is d*mn long.

>looks a little distressing - the hold times on inode_lock by prune_icache
>look bad in terms of latency (contention is still low, but people are still
>waiting on it for a very long time). Is this a transient thing, or do people
>think this is going to be a problem?

That already _is_ a problem: 0.1 seconds of hickups is nasty (and
obviously while it wasn't called that many times, several other users
actually hit it for almost that long).

It _sounds_ like we may put inodes on the unused list when they really
aren't unused. I don't think you can get the mean time that high
otherwise (63ms _mean_ lock hold time? Do I really read that thing
correctly?).

Actually, looking at "prune_icache()", I think we have two problems:

- inodes may be too often on the unuse-list when they cannot be unused

- thse kinds of inodes tend to cluster at the _head_ of the list, and
we never correct for that. So once we get into that situation, we
stay there.

Al, comments? I note:

- in __iget(), if the inode count goes from 0 to 1, we'll leave it on
the unused list if it is locked (if it is dirty it should already
have been on the dirty list, so that case should be ok)

But I don't really see where there would be any bigger bug..

- is this the time to get rid of the "inode_unused" thing entirely? If
we drop the thing from the dcache, we might as well drop the inode
too.

Ehh..

Linus

2002-02-27 19:47:47

by Andrew Morton

[permalink] [raw]
Subject: Re: [Lse-tech] lockmeter results comparing 2.4.17, 2.5.3, and 2.5.5

"Martin J. Bligh" wrote:
>
> ...
> looks a little distressing - the hold times on inode_lock by prune_icache
> look bad in terms of latency (contention is still low, but people are still
> waiting on it for a very long time). Is this a transient thing, or do people
> think this is going to be a problem?

inode_lock hold times are a problem for other reasons. Leaving this
unfixed makes the preepmtible kernel rather pointless.... An ideal
fix would be to release inodes based on VM pressure against their backing
page. But I don't think anyone's started looking at inode_lock yet.

The big one is lru_list_lock, of course. I'll be releasing code in
the next couple of days which should take that off the map. Testing
would be appreciated.

I have a concern about the lockmeter results. Lockmeter appears
to be measuring lock frequency and hold times and contention. But
is it measuring the cost of the cacheline transfers? For example,
I expect that with delayed allocation and radix-tree pagecache, one
of the major remaining bottlenecks will be ownership of the superblock
semaphore's cacheline. Is this measurable? (Actually, we may
then be at the point where copy_from_user costs dominate).

-

2002-02-27 19:59:10

by Hanna Linder

[permalink] [raw]
Subject: Re: [Lse-tech] lockmeter results comparing 2.4.17, 2.5.3, and 2.5.5



--On Wednesday, February 27, 2002 11:45:15 -0800 Andrew Morton <[email protected]> wrote:

> The big one is lru_list_lock, of course. I'll be releasing code in
> the next couple of days which should take that off the map. Testing
> would be appreciated.

Ill be glad to run this again with your patch. Also, John Hawkes
has an even bigger system and keeps hitting lru_list_lock too.
>
> I have a concern about the lockmeter results. Lockmeter appears
> to be measuring lock frequency and hold times and contention. But
> is it measuring the cost of the cacheline transfers?

This has come up a few times on lse-tech. Lockmeter doesnt
measure cacheline hits/misses/bouncing. However, someone said
kernprof could be used to access performance registers on the Pentium
chip to get this info. I don't know anyone who has tried that though.
I am working on a patch to decrease cacheline bouncing and it
would be great to see some specific results. Is anyone working on a tool
that could measure cache hits/misses/bouncing?

Hanna



2002-02-27 20:02:24

by Martin J. Bligh

[permalink] [raw]
Subject: Re: [Lse-tech] lockmeter results comparing 2.4.17, 2.5.3, and 2.5.5

> inode_lock hold times are a problem for other reasons. Leaving this
> unfixed makes the preepmtible kernel rather pointless.... An ideal
> fix would be to release inodes based on VM pressure against their backing
> page. But I don't think anyone's started looking at inode_lock yet.
>
> The big one is lru_list_lock, of course. I'll be releasing code in
> the next couple of days which should take that off the map. Testing
> would be appreciated.

Seeing as people seem to be interested ... there are some big holders
of BKL around too - do_exit shows up badly (50ms in the data Hanna
posted, and I've seen that a lot before). I've seen sync_old_buffers
hold the BKL for 64ms on an 8way Specweb99 run (22Gb of RAM?)
(though this was on an older 2.4 kernel, and might be fixed by now).

M.

2002-02-27 20:17:24

by Andrew Morton

[permalink] [raw]
Subject: Re: [Lse-tech] lockmeter results comparing 2.4.17, 2.5.3, and 2.5.5

"Martin J. Bligh" wrote:
>
> > inode_lock hold times are a problem for other reasons. Leaving this
> > unfixed makes the preepmtible kernel rather pointless.... An ideal
> > fix would be to release inodes based on VM pressure against their backing
> > page. But I don't think anyone's started looking at inode_lock yet.
> >
> > The big one is lru_list_lock, of course. I'll be releasing code in
> > the next couple of days which should take that off the map. Testing
> > would be appreciated.
>
> Seeing as people seem to be interested ... there are some big holders
> of BKL around too - do_exit shows up badly (50ms in the data Hanna
> posted, and I've seen that a lot before).

That'll be where exit() takes down the tasks's address spaces.
zap_page_range(). That's a nasty one.

> I've seen sync_old_buffers
> hold the BKL for 64ms on an 8way Specweb99 run (22Gb of RAM?)
> (though this was on an older 2.4 kernel, and might be fixed by now).

That will still be there - presumably it's where we walk the
per-superblock dirty inode list. hmm.

For lru_list_lock we can do an end-around by not using
buffers at all.

The other big one is truncate_inode_pages(). With ratcache
it's not a contention problem, but it is a latency problem.
I expect that we can drastically reduce the lock hold time
there by simply snipping the wholly-truncated pages out of
the tree, and thus privatising them so they can be disposed
of outside any locking.

-

2002-02-27 21:33:17

by Linus Torvalds

[permalink] [raw]
Subject: Re: [Lse-tech] lockmeter results comparing 2.4.17, 2.5.3, and 2.5.5

In article <[email protected]>,
Andrew Morton <[email protected]> wrote:
>"Martin J. Bligh" wrote:
>>
>> Seeing as people seem to be interested ... there are some big holders
>> of BKL around too - do_exit shows up badly (50ms in the data Hanna
>> posted, and I've seen that a lot before).
>
>That'll be where exit() takes down the tasks's address spaces.
>zap_page_range(). That's a nasty one.

No, lock_kernel happens after exit_mm, and in fact I suspect it's not
really needed at all any more except for the current "sem_exit()". I
think most everything else is threaded already.

(Hmm.. Maybe disassociate_ctty() too).

So minimizing the BLK footprint in do_exit() should be pretty much
trivial: all the really interesting stuff should be ok already.

Linus

2002-02-27 21:50:58

by Alexander Viro

[permalink] [raw]
Subject: Re: [Lse-tech] lockmeter results comparing 2.4.17, 2.5.3, and 2.5.5



On Wed, 27 Feb 2002, Andrew Morton wrote:

> "Martin J. Bligh" wrote:
> >
> > ...
> > looks a little distressing - the hold times on inode_lock by prune_icache
> > look bad in terms of latency (contention is still low, but people are still
> > waiting on it for a very long time). Is this a transient thing, or do people
> > think this is going to be a problem?
>
> inode_lock hold times are a problem for other reasons.

ed mm/vmscan.c <<EOF
/shrink_icache_memory/s/priority/1/
w
q
EOF

and repeat the tests. Unreferenced inodes == useless inodes. Aging is
already taken care of in dcache and anything that had fallen through
is fair game.

2002-02-27 23:13:46

by Hanna Linder

[permalink] [raw]
Subject: Re: [Lse-tech] lockmeter results comparing 2.4.17, 2.5.3, and 2.5.5


--On Wednesday, February 27, 2002 16:48:07 -0500 Alexander Viro <[email protected]> wrote:
>
>> > looks a little distressing - the hold times on inode_lock by prune_icache
>> > look bad in terms of latency (contention is still low, but people are still
>> > waiting on it for a very long time). Is this a transient thing, or do people
>> > think this is going to be a problem?
>>
>> inode_lock hold times are a problem for other reasons.
>
> ed mm/vmscan.c <<EOF
> /shrink_icache_memory/s/priority/1/
> w
> q
> EOF
>
> and repeat the tests. Unreferenced inodes == useless inodes. Aging is
> already taken care of in dcache and anything that had fallen through
> is fair game.
>

I applied your patch and reran the tests. Looks like you solved the problem:

SPINLOCKS HOLD WAIT
UTIL CON MEAN( MAX ) MEAN( MAX )(% CPU) TOTAL NOWAIT SPIN RJECT NAME

7.1% 0.7us( 19ms) 7.7us( 17ms)( 2.6%) 779799309 92.9% 7.1% 0.00% *TOTAL*

0.16% 0.29% 0.6us( 91us) 2.2us( 46us)(0.00%) 5495642 99.7% 0.29% 0% inode_lock

0.90% 0.47% 1.4us( 19ms) 280us( 17ms)(0.10%) 12681192 99.5% 0.47% 0% kernel_flag

The results are again stored at http://lse.sf.net/locking .

Hanna



2002-02-27 23:33:01

by Hanna Linder

[permalink] [raw]
Subject: Re: [Lse-tech] lockmeter results comparing 2.4.17, 2.5.3, and 2.5.5



--On Wednesday, February 27, 2002 16:48:07 -0500 Alexander Viro <[email protected]> wrote:

>
> ed mm/vmscan.c <<EOF
> /shrink_icache_memory/s/priority/1/
> w
> q
> EOF
>
> and repeat the tests. Unreferenced inodes == useless inodes. Aging is
> already taken care of in dcache and anything that had fallen through
> is fair game.
>

FYI:

The patch does this:

*** vmscan.c.orig Wed Feb 27 14:09:49 2002
--- vmscan.c Wed Feb 27 14:10:16 2002
***************
*** 578,584 ****
return 0;

shrink_dcache_memory(priority, gfp_mask);
! shrink_icache_memory(priority, gfp_mask);
#ifdef CONFIG_QUOTA
shrink_dqcache_memory(DEF_PRIORITY, gfp_mask);
#endif
--- 578,584 ----
return 0;

shrink_dcache_memory(priority, gfp_mask);
! shrink_icache_memory(1, gfp_mask);
#ifdef CONFIG_QUOTA
shrink_dqcache_memory(DEF_PRIORITY, gfp_mask);
#endif

2002-02-28 08:40:58

by Ravikiran G Thirumalai

[permalink] [raw]
Subject: Re: [Lse-tech] lockmeter results comparing 2.4.17, 2.5.3, and 2.5.5

On Wed, Feb 27, 2002 at 11:57:57AM -0800, Hanna Linder wrote:
>
> >
> > I have a concern about the lockmeter results. Lockmeter appears
> > to be measuring lock frequency and hold times and contention. But
> > is it measuring the cost of the cacheline transfers?
>
> This has come up a few times on lse-tech. Lockmeter doesnt
> measure cacheline hits/misses/bouncing. However, someone said
> kernprof could be used to access performance registers on the Pentium
> chip to get this info. I don't know anyone who has tried that though.

I have tried using kernprof; What kernprof can do is list
out the routine-wise performance event counts for a given workload.

I use it with benchmark runs using a script to start and stop profiling
and run the tests in between
something like

kernprof -r
kernprof -b -c all -d pmc -a 0x0020024
urtesttobeexecuted
kernprof -e

Event above "0x0020024" is L2_LINES_IN for P6 family

I've used this to measure cache effects with smp friendly counters,
But for lockmetering, this would not be as good as having cache event
measurements built into lockmeter .......

-Kiran