2006-12-04 01:53:39

by Bill Huey

[permalink] [raw]
Subject: [PATCH 0/4] lock stat for 2.6.19-rt1


Hello,

I'm please to announce the "lock stat" patch which instruments all locks in
the kernel tracking contention against the slow path of an a rt_mutex in -rt
kernels before blocking and possibly priority inheritence boosting. This is
for 2.6.19-rt1. In the real time patch, all locks such as a mutex, semaphore,
etc... are funneled through the rt_mutex as well as locks such as a spinlock
which preserve the semantics of BKL when hitting scheduler code.

On a contention, it increments a field atomically that's pointed in a variable
within the rt_mutex structure. It records the numbers of times the rt_mutex
is contented against during the lifetime of that lock.

By using a hash constructed out of __FILE__, __func__ and __LINE__ of the
actual lock initializer function (spin_lock_init() and friends) single
backing object within a red black tree based dictionary can serve to record
these statistics versus explicitly tracking the life and death of a lock
explicitly. This is more meaningful than having these statistics recorded
on an individual lock instance. It's a higher order of expression with how
lock statistics are represented since it's about the macro behavior of the
kernel and therefore is more meaningful.

I'll defer the other parts of the discussion to the comments in the patch
itself. I just want to get this patch right now with further delaying this.

It's quite similar to how lockdep also hooks into the kernel but I wasn't
aware of how lockdep worked when I was writing this code only to be told by
Peter Zijlstra after his review of the patch.

This is superior to lock meter for a number of reasons. First it's the -rt
kernel (and we all know how I love the -rt kernel :) ) it is Linux specific,
it hooks into lockdep closely and has a friendly proc interface to go with
it with the ability to reset stats during load testing runs by simply
writing to the file, /proc/lock_stat/contention.

Sample output is attached:

-------------------------------
[1, 1, 0] {finish_wait, -, 0}
[1, 1, 0] {lru_add_drain, -, 0}
[1, 1, 0] {rt_down_read, -, 0}
[1, 1, 0] {__sync_inodes, -, 0}
[1, 1, 0] {unix_create1, -, 0}
[1110, 1, 0] {kmem_cache_free, -, 0}
[11, 1, 0] {pdflush, -, 0}
[1, 123148, 0] {init_buffer_head, fs/buffer.c, 2968}
[1, 2, 0] {__svc_create, net/sunrpc/svc.c, 328}
[14, 11446, 0] {sock_lock_init, net/core/sock.c, 809}
[14373, 1, 0] {ide_intr, -, 0}
[14878, 13, 0] {create_workqueue_thread, kernel/workqueue.c, 347}
[15, 123148, 0] {init_buffer_head, fs/buffer.c, 2969}
[153, 1, 0] {__free_pages_ok, -, 0}
[17, 1, 0] {init_timers_cpu, kernel/timer.c, 1834}
[17533, 1, 0] {_atomic_dec_and_spin_lock, -, 0}
[203, 1, 0] {release_pages, -, 0}
[2040, 1, 0] {do_wait, -, 0}
[2, 1, 0] {cache_alloc_refill, -, 0}
[2, 1, 0] {sys_setsid, -, 0}
[23, 996648, 0] {inode_init_once, fs/inode.c, 199}
[242, 996648, 0] {inode_init_once, fs/inode.c, 197}
[251, 1, 0] {get_zone_pcp, -, 0}
[25, 1248, 0] {anon_vma_ctor, mm/rmap.c, 168}
[2725, 1, 0] {lock_kernel, -, 0}
[3, 1, 0] {lock_timer_base, -, 0}
[3, 1, 0] {serio_thread, -, 0}
[31, 81, 0] {kmem_list3_init, mm/slab.c, 411}
[32, 1, 0] {lru_cache_add_active, -, 0}
[3, 256, 0] {init, kernel/futex.c, 2776}
[33, 1, 0] {get_page_from_freelist, -, 0}
[35, 1, 0] {do_lookup, -, 0}
[37, 1, 0] {file_kill, -, 0}
[39, 1, 0] {__cache_free, -, 0}
[403, 204, 0] {sighand_ctor, kernel/fork.c, 1469}
[4, 16, 0] {xprt_create_transport, net/sunrpc/xprt.c, 930}
[4, 29369, 0] {mm_init, kernel/fork.c, 368}
[49, 2, 0] {journal_init_common, fs/jbd/journal.c, 666}
[5, 1, 0] {mm_init, -, 0}
[52, 1, 0] {lookup_mnt, -, 0}
[549, 0, 1349567] {, d_alloc_string, 1}
[5871, 2, 0] {journal_init_common, fs/jbd/journal.c, 667}
[59, 1, 0] {nv_probe, drivers/net/forcedeth.c, 4241}
[6, 11443, 0] {sock_init_data, net/core/sock.c, 1503}
[8, 2, 0] {journal_init_revoke, fs/jbd/revoke.c, 261}
[8264, 996648, 0] {inode_init_once, fs/inode.c, 196}
[8552, 996648, 0] {inode_init_once, fs/inode.c, 193}
[86, 1, 0] {exit_mmap, -, 0}
[86, 86258, 0] {init_waitqueue_head, kernel/wait.c, 15}
[8968, 1, 0] {iget_locked, -, 0}
[9, 1, 0] {tty_ldisc_try, -, 0}
[9, 1, 0] {uart_set_options, drivers/serial/serial_core.c, 1876}
@contention events = 86925
@found = 9
-------------------------------

The first column is the number of the times that object was contented against.
The second is the number of times this lock object was initialized. The third
is the annotation scheme that directly attaches the lock object (spinlock,
etc..) in line with the function initializer to avoid the binary tree lookup.

This shows contention issue with inode access around what looks like a lock
around a tree structure (inode_init_once @ lines 193/196) as well as lower
layers of the IO system around the anticipatory scheduler and interrupt handlers.
Keep in mind this is -rt so we're using interrupt threads here.

This result is a combination of series of normal "find" loads as a result of
the machine being idle overnight in a 2x AMD process set up. I'm sure that
bigger machines should generate more interesting loads from the IBM folks and
others.

Thanks to all of the folks on #offtopic2 (Zwane Mwaikambo, nikita, Rik van Riel,
the ever increasingly bitter Con Kolivas, Bill Irwin, etc...) for talking me
through early stages of this process and general support.

I'm requesting review, comments and inclusion of this into the -rt series of
patches as well as general help.

I've CCed folks that might be interested in this patch outside of the normal
core -rt that I thought might find this interesting or have been friendly with
me about -rt in the past.

bill


Attachments:
(No filename) (5.52 kB)
corefiles.diff (22.42 kB)
Download all attachments

2006-12-04 12:21:32

by bert hubert

[permalink] [raw]
Subject: Re: [PATCH 0/4] lock stat for 2.6.19-rt1

On Sun, Dec 03, 2006 at 05:53:23PM -0800, Bill Huey wrote:

> [8264, 996648, 0] {inode_init_once, fs/inode.c, 196}
> [8552, 996648, 0] {inode_init_once, fs/inode.c, 193}

Impressive, Bill!

How tightly is your work bound to -rt? Iow, any chance of separating the
two? Or should we even want to?

> The first column is the number of the times that object was contented against.
> The second is the number of times this lock object was initialized. The third
> is the annotation scheme that directly attaches the lock object (spinlock,
> etc..) in line with the function initializer to avoid the binary tree lookup.

I don't entirely get the third item, can you elaborate a bit?

Do you have a feeling of the runtime overhead?

Thanks.

--
http://www.PowerDNS.com Open source, database driven DNS Software
http://netherlabs.nl Open and Closed source services

2006-12-04 17:09:05

by Bill Huey

[permalink] [raw]
Subject: Re: [PATCH 0/4] lock stat for 2.6.19-rt1

On Mon, Dec 04, 2006 at 01:21:29PM +0100, bert hubert wrote:
> On Sun, Dec 03, 2006 at 05:53:23PM -0800, Bill Huey wrote:
>
> > [8264, 996648, 0] {inode_init_once, fs/inode.c, 196}
> > [8552, 996648, 0] {inode_init_once, fs/inode.c, 193}
>
> Impressive, Bill!
>
> How tightly is your work bound to -rt? Iow, any chance of separating the
> two? Or should we even want to?

Right now, it's solely dependent on -rt, but the basic mechanisms of how
it works is pretty much the same as lockdep. Parts of it should be
moveable across to regular kernels. The only remaining parts would be
altering the lock structure (spinlock, mutex, etc...) to have a pointer
that it can use to do the statistical tracking. If it's NULL then it's
dunamically allocated and handled differently and allocated when the
lock is contended against.

There's other uses for it as well. Think about RCU algorithms that need
to spin-try to make sure the update of an element or the validation of
it's data is safe to do. If an object was created to detect those spins
it'll track what is effectively contention as well as it is represented
in that algorithm. I've seen an RCU radix tree implementation do something
like that.

> > The first column is the number of the times that object was contented against.
> > The second is the number of times this lock object was initialized. The third
> > is the annotation scheme that directly attaches the lock object (spinlock,
> > etc..) in line with the function initializer to avoid the binary tree lookup.
>
> I don't entirely get the third item, can you elaborate a bit?

I hate the notion of a search, so I directly set the pointer to a
object that's statically defined. It means that the object is directly
connected to what is suppose to backing it without doing a runtime
search. When that's done, all of the numbers from the second column
get moved to the third.

> Do you have a feeling of the runtime overhead?

There's minimal runtime overhead I believe since it's only doing an
atomic increment of the stats during the slowpath before the thread
is actually shoved into a wait queue. That's something that happpens
seldomly.

bill

2006-12-04 23:57:38

by Bill Huey

[permalink] [raw]
Subject: Re: [PATCH 0/4] lock stat for 2.6.19-rt1

On Mon, Dec 04, 2006 at 09:08:56AM -0800, Bill Huey wrote:
> On Mon, Dec 04, 2006 at 01:21:29PM +0100, bert hubert wrote:
> > How tightly is your work bound to -rt? Iow, any chance of separating the
> > two? Or should we even want to?
>
> There's other uses for it as well. Think about RCU algorithms that need
> to spin-try to make sure the update of an element or the validation of
> it's data is safe to do. If an object was created to detect those spins
> it'll track what is effectively contention as well as it is represented
> in that algorithm. I've seen an RCU radix tree implementation do something
> like that.

That was a horrible paragraph plus I'm bored at the moment. What I meant is
that lockless algorithms occasionally have a spin-try associated with it as
well that might possibly validate the data that's updated against the entire
data structure for some kind of consistency cohernecy or possibly on an
individual element. That retry or spin can be considered a contention as well
and it can be made aware to this lock-stat patch just by connecting the
actually occurance of retry logic against a backing object.

I need to be more conscious about proofreading what I write before sending
it off. Was this clear ?

bill