Ingo,
We did some benchmarking on 2.6.19-rt14, compared it with 2.6.19
kernel and noticed several slowdowns. The test machine is a 2 socket
woodcrest machine with your default configuration.
Netperf TCP Streaming was slower by 40% ( 1 server and 1 client
each bound to separate cpu cores on different socket, network
loopback mode was used).
Volanomark was slower by 80% (Server and Clients communicate with
network loopback mode. Idle time goes from 1% to 60%)
Re-Aim7 was slower by 40% (idle time goes from 0% to 20%)
Wonder if you have any suggestions on what could cause the slowdown.
We've tried disabling CONFIG_NO_HZ and it didn't help much.
Thanks.
Tim
On Fri, 2006-12-22 at 13:39 -0800, Chen, Tim C wrote:
> Wonder if you have any suggestions on what could cause the slowdown.
> We've tried disabling CONFIG_NO_HZ and it didn't help much.
Did you compare PREEMPT_RT with vanilla PREEMPT ? Or one version of
PREEMPT_RT vs. another ?
Daniel
Chen, Tim C wrote:
> Ingo,
>
> We did some benchmarking on 2.6.19-rt14, compared it with 2.6.19
> kernel and noticed several slowdowns. The test machine is a 2 socket
> woodcrest machine with your default configuration.
>
> Netperf TCP Streaming was slower by 40% ( 1 server and 1 client
> each bound to separate cpu cores on different socket, network
> loopback mode was used).
>
> Volanomark was slower by 80% (Server and Clients communicate with
> network loopback mode. Idle time goes from 1% to 60%)
>
> Re-Aim7 was slower by 40% (idle time goes from 0% to 20%)
>
> Wonder if you have any suggestions on what could cause the slowdown.
> We've tried disabling CONFIG_NO_HZ and it didn't help much.
>
> Thanks.
>
> Tim
Take a look at this. Not sure if this is the same problem but it looks
like a candidate. I can definitely say that some latencies I have seen
in my recent testing have gone away in the last few versions
2.6.20-rc1-rt{3,4}.
Sorry I missed the URL first time around.
http://marc.theaimsgroup.com/?l=linux-kernel&m=116670388527349&w=2
--
kr
Chen, Tim C wrote:
> Ingo,
>
> We did some benchmarking on 2.6.19-rt14, compared it with 2.6.19
> kernel and noticed several slowdowns. The test machine is a 2 socket
> woodcrest machine with your default configuration.
>
> Netperf TCP Streaming was slower by 40% ( 1 server and 1 client
> each bound to separate cpu cores on different socket, network
> loopback mode was used).
>
> Volanomark was slower by 80% (Server and Clients communicate with
> network loopback mode. Idle time goes from 1% to 60%)
>
> Re-Aim7 was slower by 40% (idle time goes from 0% to 20%)
>
> Wonder if you have any suggestions on what could cause the slowdown.
> We've tried disabling CONFIG_NO_HZ and it didn't help much.
>
> Thanks.
>
> Tim
Take a look at this. Not sure if this is the same problem but it looks
like a candidate. I can definitely say that some latencies I have seen
in my recent testing have gone away in the last few versions
2.6.20-rc1-rt{3,4}.
--
kr
* Chen, Tim C <[email protected]> wrote:
> Ingo,
>
> We did some benchmarking on 2.6.19-rt14, compared it with 2.6.19
> kernel and noticed several slowdowns. The test machine is a 2 socket
> woodcrest machine with your default configuration.
cool - thanks for the feedback! Running the 64-bit kernel, right?
> Netperf TCP Streaming was slower by 40% ( 1 server and 1 client each
> bound to separate cpu cores on different socket, network loopback mode
> was used).
>
> Volanomark was slower by 80% (Server and Clients communicate with
> network loopback mode. Idle time goes from 1% to 60%)
>
> Re-Aim7 was slower by 40% (idle time goes from 0% to 20%)
>
> Wonder if you have any suggestions on what could cause the slowdown.
> We've tried disabling CONFIG_NO_HZ and it didn't help much.
while some slowdown is to be expected, did in each case idle time
increase significantly? If yes then this is the effect of lock
contention. Lock contention effects are 'magnified' by PREEMPT_RT. For
example if you run 128 threads workload that all use the same lock then
the -rt kernel can act as if it were a 128-way box (!). This way by
running -rt you'll see scalability problems alot sooner than on real
hardware. In other words: PREEMPT_RT in essence simulates the
scalability behavior of up to an infinite amount of CPUs. (with the
exception of cachemiss emulation ;) [the effect is not this precise, but
that's the rough trend]
the good news: any fix for such problems in -rt is useful for the
upstream kernel too, because it improves scalability on real hardware as
well. The bad news: this work hasnt been finished yet ;-)
We've fixed a couple of contention points discovered by -rt this way, in
particular Peter's excellent scalability patches at:
http://programming.kicks-ass.net/kernel-patches/
were done based on lock-contention profiling feedback. I've included the
lockless pagecache and the s_files scalability patches which were very
useful already. The lru_cache_add one is the next one for me to try,
plus on i686 -rt kernels highmem locking hurts, hence the
highmem_rewrite.patch (not in -rt yet either). Plus there's the really
exciting concurrent-pagecache patch-queue from Peter, which will be
interesting to try too.
If you'd like to profile this yourself then the lowest-cost way of
profiling lock contention on -rt is to use the yum kernel and run the
attached trace-it-lock-prof.c code on the box while your workload is in
'steady state' (and is showing those extended idle times):
./trace-it-lock-prof > trace.txt
this captures up to 1 second worth of system activity, on the current
CPU. Then you can construct the histogram via:
grep -A 1 ' __schedule()<-' trace.txt | cut -d: -f2- | sort |
uniq -c | sort -n > prof.txt
i just did this for "hackbench 20", and it gave me:
38 kfree()<-skb_release_data()<-kfree_skbmem()<-(-1)()
40 kmem_cache_free()<-kfree_skbmem()<-__kfree_skb()<-(-1)()
47 __schedule()<-preempt_schedule_irq()<-retint_kernel()<-__spin_unlock_irqrestore()
173 rt_read_lock()<-sock_def_readable()<-unix_stream_sendmsg()<-(-1)()
173 __schedule()<-schedule()<-rt_spin_lock_slowlock()<-__lock_text_start()
274 __schedule()<-schedule()<-posix_cpu_timers_thread()<-kthread()
277 __schedule()<-schedule()<-ksoftirqd()<-kthread()
365 kmem_cache_alloc_node()<-__alloc_skb()<-sock_alloc_send_skb()<-(-1)()
366 kmem_cache_alloc_node()<-__kmalloc_node()<-__alloc_skb()<-(-1)()
814 __schedule()<-schedule()<-rt_spin_lock_slowlock()<-rt_spin_lock()
1819 __schedule()<-schedule()<-int_careful()<-<ffffe405>()
2346 __schedule()<-schedule()<-schedule_timeout()<-unix_stream_recvmsg()
2346 sock_aio_read()<-do_sync_read()<-vfs_read()<-(-1)()
(note the output mixes the two lines of symbolic stack backtrace entries
that are generated by the tracer, so the last two lines for example are
for the same contention/preemption point.)
this means that the biggest contention points (or natural rescheduling
points) in this workload are sock_aio_read()->unix_stream_recvmsg() -
not unexpected. But there's also ~10% of contention in the kmalloc code,
meaning that for this workload the per-CPU slabs are probably set a bit
too low. There's also ~5% of contention from sock_def_readable()'s use
of an rwlock.
with more CPUs and more tasks you'll see these effects in a more drastic
way.
Ingo
Ingo Molnar wrote:
>
> cool - thanks for the feedback! Running the 64-bit kernel, right?
>
Yes, 64-bit kernel was used.
>
> while some slowdown is to be expected, did in each case idle time
> increase significantly?
Volanomark and Re-Aim7 ran close to 0% idle time for 2.6.19 kernel.
Idle time
increase significantly for Volanomark (to 60% idle) and Re-Aim7 (to 20%
idle)
with the rt kernel. For netperf, the system was 60% idle for
both 2.6.19 and rt kernel and changes in idle time was not significant.
> If yes then this is the effect of lock
> contention. Lock contention effects are 'magnified' by PREEMPT_RT. For
> example if you run 128 threads workload that all use the same lock
> then
> the -rt kernel can act as if it were a 128-way box (!). This way by
> running -rt you'll see scalability problems alot sooner than on real
> hardware. In other words: PREEMPT_RT in essence simulates the
> scalability behavior of up to an infinite amount of CPUs. (with the
> exception of cachemiss emulation ;) [the effect is not this precise,
> but
> that's the rough trend]
Turning off PREEMPT_RT for 2.6.20-rc2-rt0 kernel
restored most the performance of Volanaomark
and Re-Aim7. Idle time is close to 0%. So the benchmarks
with large number of threads are affected more by PREEMPT_RT.
For netperf TCP streaming, the performance improved from 40% down to 20%
down from 2.6.20-rc2 kernel. There is only a server and a client
process
for netperf. The underlying reason for the change in performance
is probably different.
>
> If you'd like to profile this yourself then the lowest-cost way of
> profiling lock contention on -rt is to use the yum kernel and run the
> attached trace-it-lock-prof.c code on the box while your workload is
> in 'steady state' (and is showing those extended idle times):
>
> ./trace-it-lock-prof > trace.txt
>
Thanks for the pointer. Will let you know of any relevant traces.
Thanks.
Tim
On Tue, Dec 26, 2006 at 04:51:21PM -0800, Chen, Tim C wrote:
> Ingo Molnar wrote:
> > If you'd like to profile this yourself then the lowest-cost way of
> > profiling lock contention on -rt is to use the yum kernel and run the
> > attached trace-it-lock-prof.c code on the box while your workload is
> > in 'steady state' (and is showing those extended idle times):
> >
> > ./trace-it-lock-prof > trace.txt
>
> Thanks for the pointer. Will let you know of any relevant traces.
Tim,
http://mmlinux.sourceforge.net/public/patch-2.6.20-rc2-rt2.lock_stat.patch
You can also apply this patch to get more precise statistics down to
the lock. For example:
...
[50, 30, 279 :: 1, 0] {tty_ldisc_try, -, 0}
[5, 5, 0 :: 19, 0] {alloc_super, fs/super.c, 76}
[5, 5, 3 :: 1, 0] {__free_pages_ok, -, 0}
[5728, 862, 156 :: 2, 0] {journal_init_common, fs/jbd/journal.c, 667}
[594713, 79020, 4287 :: 60818, 0] {inode_init_once, fs/inode.c, 193}
[602, 0, 0 :: 1, 0] {lru_cache_add_active, -, 0}
[63, 5, 59 :: 1, 0] {lookup_mnt, -, 0}
[6425, 378, 103 :: 24, 0] {initialize_tty_struct, drivers/char/tty_io.c, 3530}
[6708, 1, 225 :: 1, 0] {file_move, -, 0}
[67, 8, 15 :: 1, 0] {do_lookup, -, 0}
[69, 0, 0 :: 1, 0] {exit_mmap, -, 0}
[7, 0, 0 :: 1, 0] {uart_set_options, drivers/serial/serial_core.c, 1876}
[76, 0, 0 :: 1, 0] {get_zone_pcp, -, 0}
[7777, 5, 9 :: 1, 0] {as_work_handler, -, 0}
[8689, 0, 0 :: 15, 0] {create_workqueue_thread, kernel/workqueue.c, 474}
[89, 7, 6 :: 195, 0] {sighand_ctor, kernel/fork.c, 1474}
@contention events = 1791177
@found = 21
Is the output from /proc/lock_stat/contention. First column is the number
of contention that will results in a full block of the task, second is the
number of times the mutex owner is active on a per cpu run queue the
scheduler and third is the number of times Steve Rostedt's ownership handoff
code averted a full block. Peter Zijlstra used it initially during his
files_lock work.
Overhead of the patch is very low since it is only recording stuff in the
slow path of the rt-mutex implementation.
Writing to that file clears all of the stats for a fresh run with a
benchmark. This should give a precise point at which any contention would
happen in -rt. In general, -rt should do about as well as the stock kernel
minus the overhead of interrupt threads.
Since the last release, I've added checks for whether the task is running
as "current" on a run queue to see if adaptive spins would be useful in -rt.
These new stats show that only a small percentage of events would benefit
from the use of adaptive spins in front of a rt- mutex. Any implementation
of it would have little impact on the system. It's not the mechanism but
the raw MP work itself that contributes to the good MP performance of Linux.
Apply and have fun.
bill
Ingo Molnar wrote:
>
> If you'd like to profile this yourself then the lowest-cost way of
> profiling lock contention on -rt is to use the yum kernel and run the
> attached trace-it-lock-prof.c code on the box while your workload is
> in 'steady state' (and is showing those extended idle times):
>
> ./trace-it-lock-prof > trace.txt
>
> this captures up to 1 second worth of system activity, on the current
> CPU. Then you can construct the histogram via:
>
> grep -A 1 ' __schedule()<-' trace.txt | cut -d: -f2- | sort |
> uniq -c | sort -n > prof.txt
>
I did lock profiling on Volanomark as suggested and obtained the
profile that is listed below.
246
__sched_text_start()<-schedule()<-rt_spin_lock_slowlock()<-__lock_text_s
tart()
264 rt_mutex_slowunlock()<-rt_mutex_unlock()<-rt_up_read()<-(-1)()
334
__sched_text_start()<-schedule()<-posix_cpu_timers_thread()<-kthread()
437 __sched_text_start()<-schedule()<-do_futex()<-sys_futex()
467 (-1)()<-(0)()<-(0)()<-(0)()
495
__sched_text_start()<-preempt_schedule()<-__spin_unlock_irqrestore()<-rt
_mutex_adjust_prio()
497 __netif_rx_schedule()<-netif_rx()<-loopback_xmit()<-(-1)()
499
__sched_text_start()<-schedule()<-schedule_timeout()<-sk_wait_data()
500 tcp_recvmsg()<-sock_common_recvmsg()<-sock_recvmsg()<-(-1)()
503 __rt_down_read()<-rt_down_read()<-do_futex()<-(-1)()
1160 __sched_text_start()<-schedule()<-ksoftirqd()<-kthread()
1433 __rt_down_read()<-rt_down_read()<-futex_wake()<-(-1)()
1497 child_rip()<-(-1)()<-(0)()<-(0)()
1936
__sched_text_start()<-schedule()<-rt_mutex_slowlock()<-rt_mutex_lock()
Looks like the idle time I saw was due to lock contention
during call to futex_wake, which requires acquisition of
current->mm->mmap_sem.
Many of the java threads share mm and result in concurrent access to
common mm.
Looks like under rt case there is no special treatment to read locking
so
the read lock accesses are contended under __rt_down_read. For non rt
case,
__down_read makes the distinction for read lock access and the read
lockings
do not contend.
Things are made worse here as this delayed waking up processes locked by
the futex.
See also a snippet of the latency_trace below.
<idle>-0 2D..2 5821us!: thread_return <softirq--31> (150 20)
<idle>-0 2DN.1 6278us :
__sched_text_start()<-cpu_idle()<-start_secondary()<-(-1)()
<idle>-0 2DN.1 6278us : (0)()<-(0)()<-(0)()<-(0)()
java-6648 2D..2 6280us+: thread_return <<idle>-0> (20 -4)
java-6648 2D..1 6296us :
try_to_wake_up()<-wake_up_process()<-wakeup_next_waiter()<-rt_mutex_slow
unlock()
java-6648 2D..1 6296us :
rt_mutex_unlock()<-rt_up_read()<-do_futex()<-(-1)()
java-6648 2D..2 6297us : effective_prio <<...>-6673> (-4 -4)
java-6648 2D..2 6297us : __activate_task <<...>-6673> (-4 1)
java-6648 2.... 6297us < (-11)
java-6648 2.... 6298us+> sys_futex (0000000000afaf50
0000000000000001 0000000000000001)
java-6648 2...1 6315us :
__sched_text_start()<-schedule()<-rt_mutex_slowlock()<-rt_mutex_lock()
java-6648 2...1 6315us :
__rt_down_read()<-rt_down_read()<-futex_wake()<-(-1)()
java-6648 2D..2 6316us+: deactivate_task <java-6648> (-4 1)
<idle>-0 2D..2 6318us+: thread_return <java-6648> (-4 20)
<idle>-0 2DN.1 6327us :
__sched_text_start()<-cpu_idle()<-start_secondary()<-(-1)()
<idle>-0 2DN.1 6328us+: (0)()<-(0)()<-(0)()<-(0)()
java-6629 2D..2 6330us+: thread_return <<idle>-0> (20 -4)
java-6629 2D..1 6347us :
try_to_wake_up()<-wake_up_process()<-wakeup_next_waiter()<-rt_mutex_slow
unlock()
java-6629 2D..1 6347us :
rt_mutex_unlock()<-rt_up_read()<-futex_wake()<-(-1)()
java-6629 2D..2 6348us : effective_prio <java-6235> (-4 -4)
java-6629 2D..2 6349us : __activate_task <java-6235> (-4 1)
java-6629 2.... 6350us+< (0)
java-6629 2.... 6352us+> sys_futex (0000000000afc1dc
0000000000000001 0000000000000001)
java-6629 2...1 6368us :
__sched_text_start()<-schedule()<-rt_mutex_slowlock()<-rt_mutex_lock()
java-6629 2...1 6368us :
__rt_down_read()<-rt_down_read()<-futex_wake()<-(-1)()
java-6629 2D..2 6369us+: deactivate_task <java-6629> (-4 1)
<idle>-0 2D..2 6404us!: thread_return <java-6629> (-4 20)
<idle>-0 2DN.1 6584us :
__sched_text_start()<-cpu_idle()<-start_secondary()<-(-1)()
Thanks.
Tim
* Chen, Tim C <[email protected]> wrote:
> Ingo Molnar wrote:
> >
> > If you'd like to profile this yourself then the lowest-cost way of
> > profiling lock contention on -rt is to use the yum kernel and run the
> > attached trace-it-lock-prof.c code on the box while your workload is
> > in 'steady state' (and is showing those extended idle times):
> >
> > ./trace-it-lock-prof > trace.txt
> >
> > this captures up to 1 second worth of system activity, on the current
> > CPU. Then you can construct the histogram via:
> >
> > grep -A 1 ' __schedule()<-' trace.txt | cut -d: -f2- | sort |
> > uniq -c | sort -n > prof.txt
> >
>
> I did lock profiling on Volanomark as suggested and obtained the
> profile that is listed below.
thanks - this is really useful!
> 246 __sched_text_start()<-schedule()<-rt_spin_lock_slowlock()<-__lock_text_start()
> 264 rt_mutex_slowunlock()<-rt_mutex_unlock()<-rt_up_read()<-(-1)()
> 334 __sched_text_start()<-schedule()<-posix_cpu_timers_thread()<-kthread()
> 437 __sched_text_start()<-schedule()<-do_futex()<-sys_futex()
> 467 (-1)()<-(0)()<-(0)()<-(0)()
> 495 __sched_text_start()<-preempt_schedule()<-__spin_unlock_irqrestore()<-rt_mutex_adjust_prio()
> 497 __netif_rx_schedule()<-netif_rx()<-loopback_xmit()<-(-1)()
> 499 __sched_text_start()<-schedule()<-schedule_timeout()<-sk_wait_data()
> 500 tcp_recvmsg()<-sock_common_recvmsg()<-sock_recvmsg()<-(-1)()
> 503 __rt_down_read()<-rt_down_read()<-do_futex()<-(-1)()
> 1160 __sched_text_start()<-schedule()<-ksoftirqd()<-kthread()
> 1433 __rt_down_read()<-rt_down_read()<-futex_wake()<-(-1)()
> 1497 child_rip()<-(-1)()<-(0)()<-(0)()
> 1936 __sched_text_start()<-schedule()<-rt_mutex_slowlock()<-rt_mutex_lock()
>
> Looks like the idle time I saw was due to lock contention during call
> to futex_wake, which requires acquisition of current->mm->mmap_sem.
> Many of the java threads share mm and result in concurrent access to
> common mm. [...]
ah. This explains why i'm not seeing this bad contention in a comparable
workload (hackbench.c): because hackbench uses processes not threads.
> [...] Looks like under rt case there is no special treatment to read
> locking so the read lock accesses are contended under __rt_down_read.
> For non rt case, __down_read makes the distinction for read lock
> access and the read lockings do not contend.
yeah, makes sense. I'll do something about this.
> Things are made worse here as this delayed waking up processes locked
> by the futex. See also a snippet of the latency_trace below.
>
> <idle>-0 2D..2 5821us!: thread_return <softirq--31> (150 20)
> <idle>-0 2DN.1 6278us : __sched_text_start()<-cpu_idle()<-start_secondary()<-(-1)()
> <idle>-0 2DN.1 6278us : (0)()<-(0)()<-(0)()<-(0)()
> java-6648 2D..2 6280us+: thread_return <<idle>-0> (20 -4)
> java-6648 2D..1 6296us : try_to_wake_up()<-wake_up_process()<-wakeup_next_waiter()<-rt_mutex_slowunlock()
> java-6648 2D..1 6296us : rt_mutex_unlock()<-rt_up_read()<-do_futex()<-(-1)()
> java-6648 2D..2 6297us : effective_prio <<...>-6673> (-4 -4)
> java-6648 2D..2 6297us : __activate_task <<...>-6673> (-4 1)
> java-6648 2.... 6297us < (-11)
> java-6648 2.... 6298us+> sys_futex (0000000000afaf50 0000000000000001 0000000000000001)
> java-6648 2...1 6315us : __sched_text_start()<-schedule()<-rt_mutex_slowlock()<-rt_mutex_lock()
> java-6648 2...1 6315us : __rt_down_read()<-rt_down_read()<-futex_wake()<-(-1)()
> java-6648 2D..2 6316us+: deactivate_task <java-6648> (-4 1)
> <idle>-0 2D..2 6318us+: thread_return <java-6648> (-4 20)
> <idle>-0 2DN.1 6327us : __sched_text_start()<-cpu_idle()<-start_secondary()<-(-1)()
> <idle>-0 2DN.1 6328us+: (0)()<-(0)()<-(0)()<-(0)()
> java-6629 2D..2 6330us+: thread_return <<idle>-0> (20 -4)
> java-6629 2D..1 6347us : try_to_wake_up()<-wake_up_process()<-wakeup_next_waiter()<-rt_mutex_slowunlock()
> java-6629 2D..1 6347us : rt_mutex_unlock()<-rt_up_read()<-futex_wake()<-(-1)()
> java-6629 2D..2 6348us : effective_prio <java-6235> (-4 -4)
> java-6629 2D..2 6349us : __activate_task <java-6235> (-4 1)
> java-6629 2.... 6350us+< (0)
> java-6629 2.... 6352us+> sys_futex (0000000000afc1dc 0000000000000001 0000000000000001)
> java-6629 2...1 6368us : __sched_text_start()<-schedule()<-rt_mutex_slowlock()<-rt_mutex_lock()
> java-6629 2...1 6368us : __rt_down_read()<-rt_down_read()<-futex_wake()<-(-1)()
> java-6629 2D..2 6369us+: deactivate_task <java-6629> (-4 1)
> <idle>-0 2D..2 6404us!: thread_return <java-6629> (-4 20)
> <idle>-0 2DN.1 6584us : __sched_text_start()<-cpu_idle()<-start_secondary()<-(-1)()
indeed - basically the mm semaphore is a common resource here. I suspect
you'll see somewhat better numbers by using idle=poll or idle=mwait (or
are using those options already?).
(could you send me the whole trace if you still have it? It would be
interesting to see a broader snippet from the life of individual java
threads.)
Ingo
* Bill Huey <[email protected]> wrote:
> On Tue, Dec 26, 2006 at 04:51:21PM -0800, Chen, Tim C wrote:
> > Ingo Molnar wrote:
> > > If you'd like to profile this yourself then the lowest-cost way of
> > > profiling lock contention on -rt is to use the yum kernel and run the
> > > attached trace-it-lock-prof.c code on the box while your workload is
> > > in 'steady state' (and is showing those extended idle times):
> > >
> > > ./trace-it-lock-prof > trace.txt
> >
> > Thanks for the pointer. Will let you know of any relevant traces.
>
> Tim,
> http://mmlinux.sourceforge.net/public/patch-2.6.20-rc2-rt2.lock_stat.patch
>
> You can also apply this patch to get more precise statistics down to
> the lock. [...]
your patch looks pretty ok to me in principle. A couple of suggestions
to make it more mergable:
- instead of BUG_ON()s please use DEBUG_LOCKS_WARN_ON() and make sure
the code is never entered again if one assertion has been triggered.
Pass down a return result of '0' to signal failure. See
kernel/lockdep.c about how to do this. One thing we dont need are
bugs in instrumentation bringing down a machine.
- remove dead (#if 0) code
- Documentation/CodingStyle compliance - the code is not ugly per se
but still looks a bit 'alien' - please try to make it look Linuxish,
if i apply this we'll probably stick with it forever. This is the
major reason i havent applied it yet.
- the xfs/wrap_lock change looks bogus - the lock is initialized
already. What am i missing?
Ingo
On Sat, 2006-12-30 at 12:19 +0100, Ingo Molnar wrote:
>
> - Documentation/CodingStyle compliance - the code is not ugly per se
> but still looks a bit 'alien' - please try to make it look Linuxish,
> if i apply this we'll probably stick with it forever. This is the
> major reason i havent applied it yet.
I did some cleanup while reviewing the patch, nothing very exciting but
it's an attempt to bring it more into the "Linuxish" scope .. I didn't
compile it so be warned.
There lots of ifdef'd code under CONFIG_LOCK_STAT inside rtmutex.c I
suspect it would be a benefit to move that all into a header and ifdef
only in the header .
Daniel
Signed-Off-By: Daniel Walker <[email protected]>
---
include/linux/lock_stat.h | 25 +++++++++++++++----------
kernel/lock_stat.c | 34 ++++++++++++++++++++++------------
2 files changed, 37 insertions(+), 22 deletions(-)
Index: linux-2.6.19/include/linux/lock_stat.h
===================================================================
--- linux-2.6.19.orig/include/linux/lock_stat.h
+++ linux-2.6.19/include/linux/lock_stat.h
@@ -26,9 +26,9 @@
#include <asm/atomic.h>
typedef struct lock_stat {
- char function[KSYM_NAME_LEN];
- int line;
- char *file;
+ char function[KSYM_NAME_LEN];
+ int line;
+ char *file;
atomic_t ncontended;
atomic_t nhandoff;
@@ -52,7 +52,7 @@ struct task_struct;
#define LOCK_STAT_INIT(field)
#define LOCK_STAT_INITIALIZER(field) { \
__FILE__, __FUNCTION__, __LINE__, \
- ATOMIC_INIT(0), LIST_HEAD_INIT(field)}
+ ATOMIC_INIT(0), LIST_HEAD_INIT(field) }
#define LOCK_STAT_NOTE __FILE__, __FUNCTION__, __LINE__
#define LOCK_STAT_NOTE_VARS _file, _function, _line
@@ -84,9 +84,12 @@ extern void lock_stat_sys_init(void);
#define lock_stat_is_initialized(o) ((unsigned long) (*o)->file)
-extern void lock_stat_note_contention(lock_stat_ref_t *ls, struct task_struct *owner, unsigned long ip, int handoff);
+extern void lock_stat_note_contention(lock_stat_ref_t *ls,
+ struct task_struct *owner,
+ unsigned long ip, int handoff);
extern void lock_stat_print(void);
-extern void lock_stat_scoped_attach(lock_stat_ref_t *_s, LOCK_STAT_NOTE_PARAM_DECL);
+extern void lock_stat_scoped_attach(lock_stat_ref_t *_s,
+ LOCK_STAT_NOTE_PARAM_DECL);
#define ksym_strcmp(a, b) strncmp(a, b, KSYM_NAME_LEN)
#define ksym_strcpy(a, b) strncpy(a, b, KSYM_NAME_LEN)
@@ -102,10 +105,11 @@ static inline char * ksym_strdup(const c
#define LS_INIT(name, h) { \
/*.function,*/ .file = h, .line = 1, \
- .nhandoff = ATOMIC_INIT(0), .ntracked = 0, .ncontended = ATOMIC_INIT(0), \
+ .nhandoff = ATOMIC_INIT(0), .ntracked = 0, \
+ .ncontended = ATOMIC_INIT(0), \
.list_head = LIST_HEAD_INIT(name.list_head), \
- .rb_node.rb_left = NULL, .rb_node.rb_left = NULL \
- }
+ .rb_node.rb_left = NULL, \
+ .rb_node.rb_left = NULL } \
#define DECLARE_LS_ENTRY(name) \
extern struct lock_stat _lock_stat_##name##_entry
@@ -114,7 +118,8 @@ static inline char * ksym_strdup(const c
*/
#define DEFINE_LS_ENTRY(name) \
- struct lock_stat _lock_stat_##name##_entry = LS_INIT(_lock_stat_##name##_entry, #name "_string")
+ struct lock_stat _lock_stat_##name##_entry = \
+ LS_INIT(_lock_stat_##name##_entry, #name "_string")
DECLARE_LS_ENTRY(d_alloc);
DECLARE_LS_ENTRY(eventpoll_init_file);
Index: linux-2.6.19/kernel/lock_stat.c
===================================================================
--- linux-2.6.19.orig/kernel/lock_stat.c
+++ linux-2.6.19/kernel/lock_stat.c
@@ -71,7 +71,9 @@ static char null_string[] = "";
static char static_string[] = "-";
static char special_static_string[] = "-";
-struct lock_stat _lock_stat_null_entry = LS_INIT(_lock_stat_null_entry, null_string);
+struct lock_stat _lock_stat_null_entry =
+ LS_INIT(_lock_stat_null_entry, null_string);
+
EXPORT_SYMBOL(_lock_stat_null_entry);
static DEFINE_LS_ENTRY(inline); /* lock_stat_inline_entry */
@@ -111,10 +113,12 @@ static DEFINE_LS_ENTRY(tcp_init_1);
static DEFINE_LS_ENTRY(tcp_init_2);
*/
-/* I should never have to create more entries that this since I audited the kernel
- * and found out that there are only ~1500 or so places in the kernel where these
- * rw/spinlocks are initialized. Use the initialization points as a hash value to
- * look up the backing objects */
+/*
+ * I should never have to create more entries that this since I audited the
+ * kernel and found out that there are only ~1500 or so places in the kernel
+ * where these rw/spinlocks are initialized. Use the initialization points as a
+ * hash value to look up the backing objects
+ */
#define MAGIC_ENTRIES 1600
@@ -156,7 +160,7 @@ void lock_stat_sys_init(void)
int i;
for (i = 0; i < MAGIC_ENTRIES; ++i) {
- s = (struct lock_stat *) kmalloc(sizeof(struct lock_stat), GFP_KERNEL);
+ s = kmalloc(sizeof(struct lock_stat), GFP_KERNEL);
if (s) {
lock_stat_init(s);
@@ -182,7 +186,8 @@ void lock_stat_sys_init(void)
}
static
-struct lock_stat *lock_stat_allocate_object(void) {
+struct lock_stat *lock_stat_allocate_object(void)
+{
unsigned long flags;
spin_lock_irqsave(&free_store_lock, flags);
@@ -208,7 +213,8 @@ struct lock_stat *lock_stat_allocate_obj
* Comparision, greater to/less than or equals. zero is equals.
* Suitable for ordered insertion into a tree
*
- * The (entry - object) order of comparison is switched from the parameter definition
+ * The (entry - object) order of comparison is switched from the parameter
+ * definition
*/
static
int lock_stat_compare_objs(struct lock_stat *a, struct lock_stat *b)
@@ -254,7 +260,8 @@ void lock_stat_print_hash(struct lock_st
}
static
-void lock_stat_print_entry(struct seq_file *seq, struct lock_stat *o) {
+void lock_stat_print_entry(struct seq_file *seq, struct lock_stat *o)
+{
seq_printf(seq, "[%d, %d, %d :: %d, %d]\t\t{%s, %s, %d}\n",
atomic_read(&o->ncontended),
atomic_read(&o->nspinnable),
@@ -282,7 +289,8 @@ void lock_stat_rbtree_db_zero(void) {
}
static
-void _lock_stat_rbtree_db_zero(struct rb_node *node) {
+void _lock_stat_rbtree_db_zero(struct rb_node *node)
+{
struct lock_stat *o = container_of(node, struct lock_stat, rb_node);
@@ -295,12 +303,14 @@ void _lock_stat_rbtree_db_zero(struct rb
}
static
-void lock_stat_rbtree_db_print(struct seq_file *seq) {
+void lock_stat_rbtree_db_print(struct seq_file *seq)
+{
_lock_stat_rbtree_db_print(seq, lock_stat_rbtree_db.rb_node, 0);
}
static
-void _lock_stat_rbtree_db_print(struct seq_file *seq, struct rb_node *node, int level) {
+void _lock_stat_rbtree_db_print(struct seq_file *seq, struct rb_node *node, int level)
+{
struct lock_stat *o = container_of(node, struct lock_stat, rb_node);
On Sat, Dec 30, 2006 at 06:56:08AM -0800, Daniel Walker wrote:
> On Sat, 2006-12-30 at 12:19 +0100, Ingo Molnar wrote:
>
> >
> > - Documentation/CodingStyle compliance - the code is not ugly per se
> > but still looks a bit 'alien' - please try to make it look Linuxish,
> > if i apply this we'll probably stick with it forever. This is the
> > major reason i havent applied it yet.
>
> I did some cleanup while reviewing the patch, nothing very exciting but
> it's an attempt to bring it more into the "Linuxish" scope .. I didn't
> compile it so be warned.
>
> There lots of ifdef'd code under CONFIG_LOCK_STAT inside rtmutex.c I
> suspect it would be a benefit to move that all into a header and ifdef
> only in the header .
Ingo and Daniel,
I'll try and make it more Linuxish. It's one of the reasons why I posted
it since I knew it would need some kind of help in that arena and I've
been in need of feedback regarding it. Originally, I picked a style that
made what I was doing extremely obvious and clear to facilitate
development which is the rationale behind it.
I'll make those changes and we can progressively pass it back and forth
to see if this passes.
bill
Ingo Molnar wrote:
>
> (could you send me the whole trace if you still have it? It would be
> interesting to see a broader snippet from the life of individual java
> threads.)
>
> Ingo
Sure, I'll send it to you separately due to the size of the complete
trace.
Tim
Bill Huey (hui) wrote:
> On Tue, Dec 26, 2006 at 04:51:21PM -0800, Chen, Tim C wrote:
>> Ingo Molnar wrote:
>>> If you'd like to profile this yourself then the lowest-cost way of
>>> profiling lock contention on -rt is to use the yum kernel and run
>>> the attached trace-it-lock-prof.c code on the box while your
>>> workload is in 'steady state' (and is showing those extended idle
>>> times):
>>>
>>> ./trace-it-lock-prof > trace.txt
>>
>> Thanks for the pointer. Will let you know of any relevant traces.
>
> Tim,
>
http://mmlinux.sourceforge.net/public/patch-2.6.20-rc2-rt2.lock_stat.pat
ch
>
> You can also apply this patch to get more precise statistics down to
> the lock. For example:
>
Bill,
I'm having some problem getting this patch to run stablely. I'm
encoutering errors like that in the trace that follow:
Thanks.
Tim
Unable to handle kernel NULL pointer dereference at 0000000000000008
RIP:
[<ffffffff802cd6e4>] lock_stat_note_contention+0x12d/0x1c3
PGD 0
Oops: 0000 [1] PREEMPT SMP
CPU 1
Modules linked in: autofs4 sunrpc dm_mirror dm_mod video sbs i2c_ec dock
button battery ac uhci_hcd ehci_hcd i2dPid: 0, comm: swapper Not tainted
2.6.20-rc2-rt2 #4
RIP: 0010:[<ffffffff802cd6e4>] [<ffffffff802cd6e4>]
lock_stat_note_contention+0x12d/0x1c3
RSP: 0018:ffff81013fdb3d28 EFLAGS: 00010097
RAX: ffff81013fd68018 RBX: ffff81013fd68000 RCX: 0000000000000000
RDX: ffffffff8026762e RSI: 0000000000000000 RDI: ffffffff8026762e
RBP: ffff81013fdb3df8 R08: ffff8100092bab60 R09: ffff8100092aafc8
R10: 0000000000000001 R11: 0000000000000000 R12: ffff81013fd68030
R13: 0000000000000000 R14: 0000000000000046 R15: 0000002728d5ecd0
FS: 0000000000000000(0000) GS:ffff81013fd078c0(0000)
knlGS:0000000000000000
CS: 0010 DS: 0018 ES: 0018 CR0: 000000008005003b
CR2: 0000000000000008 CR3: 0000000000201000 CR4: 00000000000006e0
Process swapper (pid: 0, threadinfo ffff81013fdb2000, task
ffff81013fdb14e0)
Stack: 0000000000000000 0000000200000001 0000000000000000
0000000000000000
0000000000000000 0000000000000000 0000000000000100 000000000000000e
000000000000000e 0000000000000000 0000000000000000 ffff8100092bc440
Call Trace:
[<ffffffff802af471>] rt_mutex_slowtrylock+0x84/0x9b
[<ffffffff80266909>] rt_mutex_trylock+0x2e/0x30
[<ffffffff8026762e>] rt_spin_trylock+0x9/0xb
[<ffffffff8029beef>] get_next_timer_interrupt+0x34/0x226
[<ffffffff802a8b4d>] hrtimer_stop_sched_tick+0xb6/0x138
[<ffffffff8024b1ce>] cpu_idle+0x1b/0xdd
[<ffffffff80278edd>] start_secondary+0x2ed/0x2f9
---------------------------
| preempt count: 00000003 ]
| 3-level deep critical section nesting:
----------------------------------------
.. [<ffffffff8024b28a>] .... cpu_idle+0xd7/0xdd
.....[<ffffffff80278edd>] .. ( <= start_secondary+0x2ed/0x2f9)
.. [<ffffffff80267837>] .... __spin_lock_irqsave+0x18/0x42
.....[<ffffffff802af406>] .. ( <= rt_mutex_slowtrylock+0x19/0x9b)
.. [<ffffffff802678db>] .... __spin_trylock+0x14/0x4c
.....[<ffffffff80268540>] .. ( <= oops_begin+0x23/0x6f)
skipping trace printing on CPU#1 != -1
Code: 49 8b 45 08 8b 78 18 75 0d 49 8b 04 24 f0 ff 80 94 00 00 00
RIP [<ffffffff802cd6e4>] lock_stat_note_contention+0x12d/0x1c3
RSP <ffff81013fdb3d28>
CR2: 0000000000000008
<3>BUG: sleeping function called from invalid context swapper(0) at
kernel/rtmutex.c:1312
in_atomic():1 [00000002], irqs_disabled():1
Call Trace:
[<ffffffff8026ec53>] dump_trace+0xbe/0x3cd
[<ffffffff8026eff3>] show_trace+0x3a/0x58
[<ffffffff8026f026>] dump_stack+0x15/0x17
[<ffffffff8020b75e>] __might_sleep+0x103/0x10a
[<ffffffff80266e44>] rt_mutex_lock_with_ip+0x1e/0xac
[<ffffffff802aff07>] __rt_down_read+0x49/0x4d
[<ffffffff802aff16>] rt_down_read+0xb/0xd
[<ffffffff8029fc96>] blocking_notifier_call_chain+0x19/0x3f
[<ffffffff80296301>] profile_task_exit+0x15/0x17
[<ffffffff80215572>] do_exit+0x25/0x8de
[<ffffffff8026a2c1>] do_page_fault+0x7d4/0x856
[<ffffffff802681ad>] error_exit+0x0/0x84
[<ffffffff802cd6e4>] lock_stat_note_contention+0x12d/0x1c3
[<ffffffff802af471>] rt_mutex_slowtrylock+0x84/0x9b
[<ffffffff80266909>] rt_mutex_trylock+0x2e/0x30
[<ffffffff8026762e>] rt_spin_trylock+0x9/0xb
[<ffffffff8029beef>] get_next_timer_interrupt+0x34/0x226
[<ffffffff802a8b4d>] hrtimer_stop_sched_tick+0xb6/0x138
[<ffffffff8024b1ce>] cpu_idle+0x1b/0xdd
[<ffffffff80278edd>] start_secondary+0x2ed/0x2f9
On Tue, Jan 02, 2007 at 02:51:05PM -0800, Chen, Tim C wrote:
> Bill,
>
> I'm having some problem getting this patch to run stablely. I'm
> encoutering errors like that in the trace that follow:
>
> Thanks.
> Tim
>
> Unable to handle kernel NULL pointer dereference at 0000000000000008
Yes, those are the reason why I have some aggressive asserts in the code
to try track down the problem. Try this:
http://mmlinux.sourceforge.net/public/patch-2.6.20-rc2-rt2.1.lock_stat.patch
It's got some cosmestic clean up in it to make it more Linux-ish instead
of me trying to reinvent some kind of Smalltalk system in the kernel. I'm
trying to address all of Ingo's complaints about the code so it's still a
work in progress, namely the style issues (I'd like help/suggestions on
that) and assert conventions.
It might the case that the lock isn't know to the lock stats code yet.
It's got some technical overlap with lockdep in that a lock might not be
known yet and is causing a crashing.
Try that patch and report back to me what happens.
bill
On Tue, Jan 02, 2007 at 03:12:34PM -0800, Bill Huey wrote:
> On Tue, Jan 02, 2007 at 02:51:05PM -0800, Chen, Tim C wrote:
> > Bill,
> >
> > I'm having some problem getting this patch to run stablely. I'm
> > encoutering errors like that in the trace that follow:
>
> It might the case that the lock isn't know to the lock stats code yet.
> It's got some technical overlap with lockdep in that a lock might not be
> known yet and is causing a crashing.
The stack trace and code examination reveals, if that structure in the
timer code is used before it's initialized by the CPU bringup, it'll
cause problems like that crash. I'll look at it later on tonight.
bill
On Sat, Dec 30, 2006 at 12:19:40PM +0100, Ingo Molnar wrote:
> your patch looks pretty ok to me in principle. A couple of suggestions
> to make it more mergable:
>
> - instead of BUG_ON()s please use DEBUG_LOCKS_WARN_ON() and make sure
> the code is never entered again if one assertion has been triggered.
> Pass down a return result of '0' to signal failure. See
> kernel/lockdep.c about how to do this. One thing we dont need are
> bugs in instrumentation bringing down a machine.
I'm using a non-fatal error checking instead of BUG_ON. BUG_ON was a more
aggressive way that I use to find problem initiallly.
> - remove dead (#if 0) code
Done.
> - Documentation/CodingStyle compliance - the code is not ugly per se
> but still looks a bit 'alien' - please try to make it look Linuxish,
> if i apply this we'll probably stick with it forever. This is the
> major reason i havent applied it yet.
I reformatted most of the patch to be 80 column limited. I simplified a
number of names, but I'm open to suggestions and patches to how to go
about this. Much of this code was a style experiment, but now I have to
make this more mergable.
> - the xfs/wrap_lock change looks bogus - the lock is initialized
> already. What am i missing?
Correct. This has been removed.
I've applied Daniel Walker's changes as well.
Patch here:
http://mmlinux.sourceforge.net/public/patch-2.6.20-rc2-rt2.2.lock_stat.patch
bill
Bill Huey (hui) wrote:
>
> Patch here:
>
>
http://mmlinux.sourceforge.net/public/patch-2.6.20-rc2-rt2.2.lock_stat.p
atch
>
> bill
This version is much better and ran stablely.
If I'm reading the output correctly, the locks are listed by
their initialization point (function, file and line # that a lock is
initialized).
That's good information to identify the lock.
However, it will be more useful if there is information about where the
locking
was initiated from and who was trying to obtain the lock.
Tim
On Wed, Jan 03, 2007 at 03:59:28PM -0800, Chen, Tim C wrote:
> Bill Huey (hui) wrote:
> http://mmlinux.sourceforge.net/public/patch-2.6.20-rc2-rt2.2.lock_stat.patch
>
> This version is much better and ran stablely.
>
> If I'm reading the output correctly, the locks are listed by
> their initialization point (function, file and line # that a lock is
> initialized).
> That's good information to identify the lock.
Yes, that's correct.
Good to know that. What did the output reveal ?
It can be extended by pid/futex for userspace app that has yet to be done.
It might require changes to glibc or a some kind of dynamic tracing to
communicate to kernel space information about that lock. There are other
kernel uses as well. It's just a basic mechanisms for a variety of uses.
This patch has some LTT and Dtrace-isms to it.
What's your intended use again summarized ? futex contention ? I'll read
the first posting again.
> However, it will be more useful if there is information about where the
> locking
> was initiated from and who was trying to obtain the lock.
It would add quite a bit more overhead, but it could be done with lockdep
directly I believe in conjunction with this patch. However, it should be
specific enough though that a kernel code examination at the key points
of all users of the lock would show where the problem places are as well
as users.
bill
Bill Huey (hui) wrote:
>
> Good to know that. What did the output reveal ?
>
> What's your intended use again summarized ? futex contention ? I'll
> read the first posting again.
>
Earlier I used latency_trace and figured that there was read contention
on mm->mmap_sem during call to _rt_down_read by java threads
when I was running volanomark. That caused the slowdown of the rt
kernel
compared to non-rt kernel. The output from lock_stat confirm
that mm->map_sem was indeed the most heavily contended lock.
Tim
On Wed, Jan 03, 2007 at 04:25:46PM -0800, Chen, Tim C wrote:
> Earlier I used latency_trace and figured that there was read contention
> on mm->mmap_sem during call to _rt_down_read by java threads
> when I was running volanomark. That caused the slowdown of the rt
> kernel
> compared to non-rt kernel. The output from lock_stat confirm
> that mm->map_sem was indeed the most heavily contended lock.
Can you sort the output ("sort -n" what ever..) and post it without the
zeroed entries ?
I'm curious about how that statistical spike compares to the rest of the
system activity. I'm sure that'll get the attention of Peter as well and
maybe he'll do something about it ? :)
bill
Bill Huey (hui) wrote:
> Can you sort the output ("sort -n" what ever..) and post it without
> the zeroed entries ?
>
> I'm curious about how that statistical spike compares to the rest of
> the system activity. I'm sure that'll get the attention of Peter as
> well and maybe he'll do something about it ? :)
>
Here's the lockstat trace. You can cross reference it with my
earlier post.
http://marc.theaimsgroup.com/?l=linux-kernel&m=116743637422465&w=2
The contention happened on mm->mmap_sem shared
by the java threads during futex_wake's invocation of _rt_down_read.
Tim
@contention events = 247149
@failure_events = 146
@lookup_failed_scope = 175
@lookup_failed_static = 43
@static_found = 16
[1, 113, 77 -- 32768, 0] {tcp_init, net/ipv4/tcp.c, 2426}
[2, 759, 182 -- 1, 0] {lock_kernel, -, 0}
[13, 0, 7 -- 4, 0] {kmem_cache_free, -, 0}
[25, 3564, 9278 -- 1, 0] {lock_timer_base, -, 0}
[56, 9528, 24552 -- 3, 0] {init_timers_cpu, kernel/timer.c,
1842}
[471, 52845, 17682 -- 10448, 0] {sock_lock_init, net/core/sock.c,
817}
[32251, 9024, 242 -- 256, 0] {init, kernel/futex.c, 2781}
[173724, 11899638, 9886960 -- 11194, 0] {mm_init, kernel/fork.c,
369}
On Wed, Jan 03, 2007 at 04:46:37PM -0800, Chen, Tim C wrote:
> Bill Huey (hui) wrote:
> > Can you sort the output ("sort -n" what ever..) and post it without
> > the zeroed entries ?
> >
> > I'm curious about how that statistical spike compares to the rest of
> > the system activity. I'm sure that'll get the attention of Peter as
> > well and maybe he'll do something about it ? :)
...
> @contention events = 247149
> @failure_events = 146
> @lookup_failed_scope = 175
> @lookup_failed_static = 43
> @static_found = 16
> [1, 113, 77 -- 32768, 0] {tcp_init, net/ipv4/tcp.c, 2426}
> [2, 759, 182 -- 1, 0] {lock_kernel, -, 0}
> [13, 0, 7 -- 4, 0] {kmem_cache_free, -, 0}
> [25, 3564, 9278 -- 1, 0] {lock_timer_base, -, 0}
> [56, 9528, 24552 -- 3, 0] {init_timers_cpu, kernel/timer.c, 1842}
> [471, 52845, 17682 -- 10448, 0] {sock_lock_init, net/core/sock.c, 817}
> [32251, 9024, 242 -- 256, 0] {init, kernel/futex.c, 2781}
> [173724, 11899638, 9886960 -- 11194, 0] {mm_init, kernel/fork.c, 369}
Thanks, the numbers look a bit weird in that the first column should
have a bigger number of events than that second column since it is a
special case subset. Looking at the lock_stat_note() code should show
that to be the case. Did you make a change to the output ?
I can't tell which are "steal", actively running or overall contention
stats against the lock from your output.
Thanks
bill
On Wed, Jan 03, 2007 at 05:00:49PM -0800, Bill Huey wrote:
> On Wed, Jan 03, 2007 at 04:46:37PM -0800, Chen, Tim C wrote:
> > @contention events = 247149
> > @failure_events = 146
> > @lookup_failed_scope = 175
> > @lookup_failed_static = 43
> > @static_found = 16
> > [1, 113, 77 -- 32768, 0] {tcp_init, net/ipv4/tcp.c, 2426}
> > [2, 759, 182 -- 1, 0] {lock_kernel, -, 0}
> > [13, 0, 7 -- 4, 0] {kmem_cache_free, -, 0}
> > [25, 3564, 9278 -- 1, 0] {lock_timer_base, -, 0}
> > [56, 9528, 24552 -- 3, 0] {init_timers_cpu, kernel/timer.c, 1842}
> > [471, 52845, 17682 -- 10448, 0] {sock_lock_init, net/core/sock.c, 817}
> > [32251, 9024, 242 -- 256, 0] {init, kernel/futex.c, 2781}
> > [173724, 11899638, 9886960 -- 11194, 0] {mm_init, kernel/fork.c, 369}
>
> Thanks, the numbers look a bit weird in that the first column should
> have a bigger number of events than that second column since it is a
> special case subset. Looking at the lock_stat_note() code should show
> that to be the case. Did you make a change to the output ?
>
> I can't tell which are "steal", actively running or overall contention
> stats against the lock from your output.
Also, the output is weird in that the "contention_events" should be
a total of all of the events in the first three columns. Clearly the
number is wrong and I don't know if the text output was mangled or
if that's accurate and my code is buggy. I have yet to see it give
me data like that yet in my set up.
The fourth and fifth columns are the number of times this lock was
initialized inline by a function like spin_lock_init(). It might
have a correspondence to clone() calls.
bill
Bill Huey (hui) wrote:
>
> Thanks, the numbers look a bit weird in that the first column should
> have a bigger number of events than that second column since it is a
> special case subset. Looking at the lock_stat_note() code should show
> that to be the case. Did you make a change to the output ?
No, I did not change the output. I did reset to the contention content
by doing echo "0" > /proc/lock_stat/contention.
I noticed that the first column get reset but not the second column. So
the reset code probably need to be checked.
Tim
On Wed, Jan 03, 2007 at 05:11:04PM -0800, Chen, Tim C wrote:
> Bill Huey (hui) wrote:
> >
> > Thanks, the numbers look a bit weird in that the first column should
> > have a bigger number of events than that second column since it is a
> > special case subset. Looking at the lock_stat_note() code should show
> > that to be the case. Did you make a change to the output ?
>
> No, I did not change the output. I did reset to the contention content
>
> by doing echo "0" > /proc/lock_stat/contention.
>
> I noticed that the first column get reset but not the second column. So
> the reset code probably need to be checked.
This should have the fix.
http://mmlinux.sf.net/public/patch-2.6.20-rc2-rt2.3.lock_stat.patch
If you can rerun it and post the results, it'll hopefully show the behavior
of that lock acquisition better.
Thanks
bill
Bill Huey (hui) wrote:
> This should have the fix.
>
>
http://mmlinux.sf.net/public/patch-2.6.20-rc2-rt2.3.lock_stat.patch
>
> If you can rerun it and post the results, it'll hopefully show the
> behavior of that lock acquisition better.
>
Here's the run with fix to produce correct statistics.
Tim
@contention events = 848858
@failure_events = 10
@lookup_failed_scope = 175
@lookup_failed_static = 47
@static_found = 17
[2, 0, 0 -- 1, 0] {journal_init_common, fs/jbd/journal.c,
667}
[2, 0, 0 -- 31, 0] {blk_init_queue_node, block/ll_rw_blk.c,
1910}
[2, 0, 0 -- 31, 0] {create_workqueue_thread,
kernel/workqueue.c, 474}
[3, 3, 2 -- 16384, 0] {tcp_init, net/ipv4/tcp.c, 2426}
[4, 4, 1 -- 1, 0] {lock_kernel, -, 0}
[19, 0, 0 -- 1, 0] {kmem_cache_alloc, -, 0}
[25, 0, 0 -- 1, 0] {kfree, -, 0}
[49, 0, 0 -- 2, 0] {kmem_cache_free, -, 0}
[69, 38, 176 -- 1, 0] {lock_timer_base, -, 0}
[211, 117, 517 -- 3, 0] {init_timers_cpu, kernel/timer.c, 1842}
[1540, 778, 365 -- 7326, 0] {sock_lock_init,
net/core/sock.c, 817}
[112584, 150, 6 -- 256, 0] {init, kernel/futex.c, 2781}
[597012, 183895, 136277 -- 9546, 0] {mm_init, kernel/fork.c,
369}
* Bill Huey <[email protected]> wrote:
> > - Documentation/CodingStyle compliance - the code is not ugly per se
> > but still looks a bit 'alien' - please try to make it look Linuxish,
> > if i apply this we'll probably stick with it forever. This is the
> > major reason i havent applied it yet.
>
> I reformatted most of the patch to be 80 column limited. I simplified
> a number of names, but I'm open to suggestions and patches to how to
> go about this. Much of this code was a style experiment, but now I
> have to make this more mergable.
thanks. It's looking better, but there's still quite a bit of work left:
there's considerable amount of whitespace noise in it - lots of lines
with space/tab at the end, lines with 8 spaces instead of tabs, etc.
comment style issues:
+/* To be use for avoiding the dynamic attachment of spinlocks at runtime
+ * by attaching it inline with the lock initialization function */
the proper multi-line style is:
/*
* To be used for avoiding the dynamic attachment of spinlocks at
* runtime by attaching it inline with the lock initialization function:
*/
(note i also fixed a typo in the one above)
more unused code:
+/*
+static DEFINE_LS_ENTRY(__pte_alloc);
+static DEFINE_LS_ENTRY(get_empty_filp);
+static DEFINE_LS_ENTRY(init_waitqueue_head);
...
+*/
these:
+static int lock_stat_inited = 0;
should not be initialized to 0, that is implicit for static variables.
weird alignment here:
+void lock_stat_init(struct lock_stat *oref)
+{
+ oref->function[0] = 0;
+ oref->file = NULL;
+ oref->line = 0;
+
+ oref->ntracked = 0;
funky branching:
+ spin_lock_irqsave(&free_store_lock, flags);
+ if (!list_empty(&lock_stat_free_store)) {
+ struct list_head *e = lock_stat_free_store.next;
+ struct lock_stat *s;
+
+ s = container_of(e, struct lock_stat, list_head);
+ list_del(e);
+
+ spin_unlock_irqrestore(&free_store_lock, flags);
+
+ return s;
+ }
+ spin_unlock_irqrestore(&free_store_lock, flags);
+
+ return NULL;
that should be s = NULL in the function scope and a plain unlock and
return s.
assignments mixed with arithmetics:
+static
+int lock_stat_compare_objs(struct lock_stat *x, struct lock_stat *y)
+{
+ int a = 0, b = 0, c = 0;
+
+ (a = ksym_strcmp(x->function, y->function)) ||
+ (b = ksym_strcmp(x->file, y->file)) ||
+ (c = (x->line - y->line));
+
+ return a | b | c;
the usual (and more readable) style is to separate them out explicitly:
a = ksym_strcmp(x->function, y->function);
if (!a)
return 0;
b = ksym_strcmp(x->file, y->file);
if (!b)
return 0;
return x->line == y->line;
(detail: this btw also fixes a bug in the function above AFAICS, in the
a && !b case.)
also, i'm not fully convinced we want that x->function as a string. That
makes comparisons alot slower. Why not make it a void *, and resolve to
the name via kallsyms only when printing it in /proc, like lockdep does
it?
no need to put dates into comments:
+ * Fri Oct 27 00:26:08 PDT 2006
then:
+ while (node)
+ {
proper style is:
+ while (node) {
this function definition:
+static
+void lock_stat_insert_object(struct lock_stat *o)
can be single-line. We make it multi-line only when needed.
these are only samples of the types of style problems still present in
the code.
Ingo
On Wed, Jan 03, 2007 at 06:14:11PM -0800, Chen, Tim C wrote:
> Bill Huey (hui) wrote:
> http://mmlinux.sf.net/public/patch-2.6.20-rc2-rt2.3.lock_stat.patch
> > If you can rerun it and post the results, it'll hopefully show the
> > behavior of that lock acquisition better.
>
> Here's the run with fix to produce correct statistics.
>
> Tim
>
> @contention events = 848858
> @failure_events = 10
> @lookup_failed_scope = 175
> @lookup_failed_static = 47
> @static_found = 17
...
> [112584, 150, 6 -- 256, 0] {init, kernel/futex.c, 2781}
> [597012, 183895, 136277 -- 9546, 0] {mm_init, kernel/fork.c,
> 369}
Interesting. The second column means that those can be adaptively spun
on to prevent the blocking from happening. That's roughly 1/3rd of the
blocking events that happen (second/first). Something like that would
help out, but the problem is that contention on that lock in the first
place.
Also, Linux can do a hell of a lot of context switches per second. Is
the number of total contentions (top figure) in that run consistent with
the performance degradation ? and how much the reduction of those events
by 1/3rd would help out with the benchmark ? Those are the questions in
my mind at this moment.
bill
* Bill Huey <[email protected]> wrote:
> Patch here:
>
> http://mmlinux.sourceforge.net/public/patch-2.6.20-rc2-rt2.2.lock_stat.patch
hm, most of the review feedback i gave has not been addressed in the
patch above. So i did it myself: find below various fixups for problems
i outlined plus for a few other problems as well (ontop of
2.3.lock_stat).
While it's now markedly better it's still not quite mergeable, for
example the build fails quite spectacularly if LOCK_STAT is disabled in
the .config.
Also, it would be nice to replace those #ifdef CONFIG_LOCK_STAT changes
in rtmutex.c with some nice inline functions that do nothing on
!CONFIG_LOCK_STAT.
but in general i'm positive about the direction this is heading, it just
needs more work.
Ingo
---
include/linux/eventpoll.h | 3
include/linux/lock_stat.h | 50 +-----
include/linux/sched.h | 3
kernel/futex.c | 6
kernel/lock_stat.c | 352 +++++++++++++++-------------------------------
kernel/rtmutex.c | 4
6 files changed, 139 insertions(+), 279 deletions(-)
Index: linux/include/linux/eventpoll.h
===================================================================
--- linux.orig/include/linux/eventpoll.h
+++ linux/include/linux/eventpoll.h
@@ -15,7 +15,6 @@
#define _LINUX_EVENTPOLL_H
#include <linux/types.h>
-#include <linux/lock_stat.h>
/* Valid opcodes to issue to sys_epoll_ctl() */
@@ -46,6 +45,8 @@ struct epoll_event {
#ifdef __KERNEL__
+#include <linux/lock_stat.h>
+
/* Forward declarations to avoid compiler errors */
struct file;
Index: linux/include/linux/lock_stat.h
===================================================================
--- linux.orig/include/linux/lock_stat.h
+++ linux/include/linux/lock_stat.h
@@ -1,18 +1,8 @@
/*
* By Bill Huey (hui) at <[email protected]>
*
- * Release under the what ever the Linux kernel chooses for a
- * license, GNU Public License GPL v2
- *
- * Tue Sep 5 17:27:48 PDT 2006
- * Created lock_stat.h
- *
- * Wed Sep 6 15:36:14 PDT 2006
- * Thinking about the object lifetime of a spinlock. Please refer to
- * comments in kernel/lock_stat.c instead.
- *
+ * Released under GNU Public License GPL v2
*/
-
#ifndef LOCK_STAT_H
#define LOCK_STAT_H
@@ -86,23 +76,14 @@ extern void lock_stat_sys_init(void);
#define lock_stat_is_initialized(o) ((unsigned long) (*o)->file)
extern void lock_stat_note(lock_stat_ref_t *ls, struct task_struct *owner,
- unsigned long ip,
- int handoff);
+ unsigned long ip, int handoff);
extern void lock_stat_print(void);
extern int lock_stat_scoped_attach(lock_stat_ref_t *_s,
- LOCK_STAT_NOTE_PARAM_DECL);
+ LOCK_STAT_NOTE_PARAM_DECL);
-#define ksym_strcmp(a, b) strncmp(a, b, KSYM_NAME_LEN)
-#define ksym_strcpy(a, b) strncpy(a, b, KSYM_NAME_LEN)
-#define ksym_strlen(a) strnlen(a, KSYM_NAME_LEN)
-
-/*
-static inline char * ksym_strdup(const char *a)
-{
- char *s = (char *) kmalloc(ksym_strlen(a), GFP_KERNEL);
- return strncpy(s, a, KSYM_NAME_LEN);
-}
-*/
+#define ksym_strcmp(a, b) strncmp(a, b, KSYM_NAME_LEN)
+#define ksym_strcpy(a, b) strncpy(a, b, KSYM_NAME_LEN)
+#define ksym_strlen(a) strnlen(a, KSYM_NAME_LEN)
#define LS_INIT(name, h) { \
/*.function,*/ .file = h, .line = 1, \
@@ -115,29 +96,14 @@ static inline char * ksym_strdup(const c
#define DECLARE_LS_ENTRY(name) \
extern struct lock_stat _lock_stat_##name##_entry
-/* char _##name##_string[] = #name; \
-*/
-
#define DEFINE_LS_ENTRY(name) \
struct lock_stat _lock_stat_##name##_entry = \
LS_INIT(_lock_stat_##name##_entry, #name "_string")
DECLARE_LS_ENTRY(d_alloc);
DECLARE_LS_ENTRY(eventpoll_init_file);
-/*
-DECLARE_LS_ENTRY(get_empty_filp);
-DECLARE_LS_ENTRY(init_once_1);
-DECLARE_LS_ENTRY(init_once_2);
-DECLARE_LS_ENTRY(inode_init_once_1);
-DECLARE_LS_ENTRY(inode_init_once_2);
-DECLARE_LS_ENTRY(inode_init_once_3);
-DECLARE_LS_ENTRY(inode_init_once_4);
-DECLARE_LS_ENTRY(inode_init_once_5);
-DECLARE_LS_ENTRY(inode_init_once_6);
-DECLARE_LS_ENTRY(inode_init_once_7);
-*/
-#else /* CONFIG_LOCK_STAT */
+#else /* CONFIG_LOCK_STAT */
#define __COMMA_LOCK_STAT_FN_DECL
#define __COMMA_LOCK_STAT_FN_VAR
@@ -147,7 +113,7 @@ DECLARE_LS_ENTRY(inode_init_once_7);
#define __COMMA_LOCK_STAT_NOTE_FLLN
#define __COMMA_LOCK_STAT_NOTE_FLLN_VARS
-#define __COMMA_LOCK_STAT_INITIALIZER
+#define __COMMA_LOCK_STAT_INITIALIZER
#define __COMMA_LOCK_STAT_IP_DECL
#define __COMMA_LOCK_STAT_IP
Index: linux/include/linux/sched.h
===================================================================
--- linux.orig/include/linux/sched.h
+++ linux/include/linux/sched.h
@@ -236,6 +236,9 @@ extern void sched_init(void);
extern void sched_init_smp(void);
extern void init_idle(struct task_struct *idle, int cpu);
+extern int task_on_rq(int cpu, struct task_struct *p);
+extern struct task_struct *get_task_on_rq(int cpu);
+
extern cpumask_t nohz_cpu_mask;
/*
Index: linux/kernel/futex.c
===================================================================
--- linux.orig/kernel/futex.c
+++ linux/kernel/futex.c
@@ -760,7 +760,8 @@ retry:
(*pi_state)->key = *key;
/* init the mutex without owner */
- __rt_mutex_init(&(*pi_state)->pi_mutex, NULL);
+ __rt_mutex_init(&(*pi_state)->pi_mutex, NULL
+ __COMMA_LOCK_STAT_NOTE);
}
return 0;
@@ -2773,7 +2774,8 @@ retry:
(*pi_state)->key = *key;
/* init the mutex without owner */
- __rt_mutex_init(&(*pi_state)->pi_mutex, NULL);
+ __rt_mutex_init(&(*pi_state)->pi_mutex, NULL
+ __COMMA_LOCK_STAT_NOTE);
}
return 0;
Index: linux/kernel/lock_stat.c
===================================================================
--- linux.orig/kernel/lock_stat.c
+++ linux/kernel/lock_stat.c
@@ -1,45 +1,9 @@
/*
- * lock_stat.h
+ * kernel/lock_stat.c
*
* By Bill Huey (hui) [email protected]
- * Release under GPL license compatible with the Linux kernel
*
- * Tue Sep 5 17:27:48 PDT 2006
- * Started after thinking about the problem the of Burning Man 2006 and lock
- * lifetimes, scoping of those objects, etc...
- *
- * Thu Sep 14 04:01:26 PDT 2006
- * Some of this elaborate list handling stuff might not be necessary since I
- * can attach all of the spinlocks at spin_lock_init() time, etc... It can be
- * done out of line from the contention event. It's just a matter of how to
- * detect and record contentions for spinlocks are statically initialized
- * being the last part of what I need get this all going. I thought about that
- * last night after going to BaG and talking to Mitch a bunch about this.
- *
- * Fri Sep 15 04:27:47 PDT 2006
- * I maybe have greatly simplified this today during the drive down here to
- * SD. Keep all of the statically defined stuff late, but protect the
- * persistent list by a simple spinlock and append it to the list immediately.
- * This is possible because the static initializer stuff handles proper insert
- * of the lock_stat struct during calls to spin_lock_init() for all other
- * insertion cases.
- *
- * Special thanks go to Zwane and Peter for helping me with an initial
- * implemention using RCU and lists even though that's now been replaced by
- * something that's better and much simpler.
- *
- * Thu Sep 21 23:38:48 PDT 2006
- * I'm in San Diego this last week or so and I've been auditing the kernel
- * for spinlock and rwlocks to see if they are statically defined or scoped.
- * I finished that audit last night.
- *
- * Mon Sep 25 21:49:43 PDT 2006
- * Back in SF as of last night. Think about what I need to do to get rudimentary
- * testing in place so that I know this code isn't going to crash the kernel.
- *
- * Fri Nov 24 19:06:31 PST 2006
- * As suggested by peterz, I removed the init_wait_queue stuff since lockdep
- * already apparently has handled it.
+ * Released under GNU Public License GPL v2
*/
#include <linux/kernel.h>
@@ -78,8 +42,10 @@ static DEFINE_LS_ENTRY(inline); /* lock
static DEFINE_LS_ENTRY(untracked); /* lock_stat_untracked_entry */
static DEFINE_LS_ENTRY(preinit); /* lock_stat_preinit_entry */
-/* To be use for avoiding the dynamic attachment of spinlocks at runtime
- * by attaching it inline with the lock initialization function */
+/*
+ * To be used for avoiding the dynamic attachment of spinlocks at runtime
+ * by attaching it inline with the lock initialization function:
+ */
DEFINE_LS_ENTRY(d_alloc);
EXPORT_SYMBOL(_lock_stat_d_alloc_entry);
@@ -88,64 +54,32 @@ DEFINE_LS_ENTRY(eventpoll_init_file);
EXPORT_SYMBOL(_lock_stat_eventpoll_init_file_entry);
/*
-static DEFINE_LS_ENTRY(__pte_alloc);
-static DEFINE_LS_ENTRY(get_empty_filp);
-static DEFINE_LS_ENTRY(init_waitqueue_head);
-static DEFINE_LS_ENTRY(init_buffer_head_1);
-static DEFINE_LS_ENTRY(init_buffer_head_2);
-static DEFINE_LS_ENTRY(init_once_1);
-static DEFINE_LS_ENTRY(init_once_2);
-static DEFINE_LS_ENTRY(inode_init_once_1);
-static DEFINE_LS_ENTRY(inode_init_once_2);
-static DEFINE_LS_ENTRY(inode_init_once_3);
-static DEFINE_LS_ENTRY(inode_init_once_4);
-static DEFINE_LS_ENTRY(inode_init_once_5);
-static DEFINE_LS_ENTRY(inode_init_once_6);
-static DEFINE_LS_ENTRY(inode_init_once_7);
-static DEFINE_LS_ENTRY(inode_init_once_8);
-static DEFINE_LS_ENTRY(mm_init_1);
-static DEFINE_LS_ENTRY(mm_init_2);
-static DEFINE_LS_ENTRY(mm_init_3);
-static DEFINE_LS_ENTRY(skb_queue_init);
-static DEFINE_LS_ENTRY(tcp_init_1);
-static DEFINE_LS_ENTRY(tcp_init_2);
-*/
-
-/*
- * I should never have to create more entries that this since I audited the
- * kernel and found out that there are only ~1500 or so places in the kernel
- * where these rw/spinlocks are initialized. Use the initialization points as a
+ * Should be enough for now. Use the initialization points as a
* hash value to look up the backing objects
*/
-
-#define MAGIC_ENTRIES 1600
+#define LOCKSTAT_ENTRIES 1600
static int lock_stat_procfs_init(void);
static void lock_stat_insert_object(struct lock_stat *);
-static int lock_stat_inited = 0;
+static int lock_stat_inited;
+
+static int missed, attached, left_insert, right_insert;
-static int missed = 0, attached = 0, left_insert = 0, right_insert = 0;
-
-static int lookup_failed_scoped = 0, lookup_failed_static = 0, static_found = 0;
-static int failure_events = 0;
+static int lookup_failed_scoped, lookup_failed_static, static_found;
+static int failure_events;
-static
-DEFINE_PER_CPU(atomic_t, lock_stat_total_events);
+static DEFINE_PER_CPU(atomic_t, lock_stat_total_events);
-static
-struct rb_root lock_stat_rbtree_db = RB_ROOT;
+static struct rb_root lock_stat_rbtree_db = RB_ROOT;
static void lock_stat_rbtree_db_print(struct seq_file *seq);
static void _lock_stat_rbtree_db_print(struct seq_file *seq,
- struct rb_node *node,
- int level);
-
+ struct rb_node *node, int level);
static void lock_stat_rbtree_db_zero(void);
static void _lock_stat_rbtree_db_zero(struct rb_node *node);
-static
-void lock_stat_rbtree_db_zero(void)
+static void lock_stat_rbtree_db_zero(void)
{
_lock_stat_rbtree_db_zero(lock_stat_rbtree_db.rb_node);
}
@@ -153,10 +87,9 @@ void lock_stat_rbtree_db_zero(void)
void lock_stat_init(struct lock_stat *oref)
{
oref->function[0] = 0;
- oref->file = NULL;
- oref->line = 0;
-
- oref->ntracked = 0;
+ oref->file = NULL;
+ oref->line = 0;
+ oref->ntracked = 0;
atomic_set(&oref->ninlined, 0);
atomic_set(&oref->ncontended, 0);
atomic_set(&oref->nspinnable, 0);
@@ -172,9 +105,8 @@ void lock_stat_reset_stats(void)
lock_stat_rbtree_db_zero();
- for_each_possible_cpu(cpu) {
+ for_each_possible_cpu(cpu)
atomic_set(&per_cpu(lock_stat_total_events, cpu), 0);
- }
}
void lock_stat_sys_init(void)
@@ -182,24 +114,23 @@ void lock_stat_sys_init(void)
struct lock_stat *s;
int i;
- for (i = 0; i < MAGIC_ENTRIES; ++i) {
+ for (i = 0; i < LOCKSTAT_ENTRIES; ++i) {
s = kmalloc(sizeof(struct lock_stat), GFP_KERNEL);
if (s) {
lock_stat_init(s);
list_add_tail(&s->list_head, &lock_stat_free_store);
- }
- else {
+ } else {
printk("%s: kmalloc returned NULL\n", __func__);
return;
}
}
- lock_stat_insert_object(&_lock_stat_inline_entry);
- lock_stat_insert_object(&_lock_stat_untracked_entry);
- lock_stat_insert_object(&_lock_stat_preinit_entry);
+ lock_stat_insert_object(&_lock_stat_inline_entry);
+ lock_stat_insert_object(&_lock_stat_untracked_entry);
+ lock_stat_insert_object(&_lock_stat_preinit_entry);
- lock_stat_insert_object(&_lock_stat_d_alloc_entry);
+ lock_stat_insert_object(&_lock_stat_d_alloc_entry);
lock_stat_insert_object(&_lock_stat_eventpoll_init_file_entry);
lock_stat_procfs_init();
@@ -208,9 +139,9 @@ void lock_stat_sys_init(void)
lock_stat_inited = 1;
}
-static
-struct lock_stat *lock_stat_allocate_object(void)
+static struct lock_stat *lock_stat_allocate_object(void)
{
+ struct lock_stat *s = NULL;
unsigned long flags;
if (!lock_stat_inited)
@@ -219,65 +150,57 @@ struct lock_stat *lock_stat_allocate_obj
spin_lock_irqsave(&free_store_lock, flags);
if (!list_empty(&lock_stat_free_store)) {
struct list_head *e = lock_stat_free_store.next;
- struct lock_stat *s;
s = container_of(e, struct lock_stat, list_head);
list_del(e);
-
- spin_unlock_irqrestore(&free_store_lock, flags);
-
- return s;
}
spin_unlock_irqrestore(&free_store_lock, flags);
- return NULL;
+ return s;
}
/*
* For ordered insertion into a tree. The (entry - object) order of
* comparison is switched from the parameter definition.
*/
-static
-int lock_stat_compare_objs(struct lock_stat *x, struct lock_stat *y)
+static int lock_stat_compare_objs(struct lock_stat *x, struct lock_stat *y)
{
- int a = 0, b = 0, c = 0;
+ if (!ksym_strcmp(x->function, y->function))
+ return 0;
- (a = ksym_strcmp(x->function, y->function)) ||
- (b = ksym_strcmp(x->file, y->file)) ||
- (c = (x->line - y->line));
+ if (!ksym_strcmp(x->file, y->file))
+ return 0;
- return a | b | c;
+ return x->line == y->line;
}
-static
-int lock_stat_compare_key_to_obj(LOCK_STAT_NOTE_PARAM_DECL, struct lock_stat *x)
+static int
+lock_stat_compare_key_to_obj(LOCK_STAT_NOTE_PARAM_DECL, struct lock_stat *x)
{
- int a = 0, b = 0, c = 0;
+ if (!ksym_strcmp(_function, x->function))
+ return 0;
- (a = ksym_strcmp(_function, x->function)) ||
- (b = ksym_strcmp(_file, x->file)) ||
- (c = (_line - x->line));
+ if (!ksym_strcmp(_file, x->file))
+ return 0;
- return a | b | c;
+ return _line - x->line;
}
-static
-void lock_stat_print_entry(struct seq_file *seq, struct lock_stat *o)
+static void lock_stat_print_entry(struct seq_file *seq, struct lock_stat *o)
{
seq_printf(seq, "[%d, %d, %d -- %d, %d]\t\t{%s, %s, %d}\n",
- atomic_read(&o->ncontended),
- atomic_read(&o->nspinnable),
- atomic_read(&o->nhandoff),
- o->ntracked,
- atomic_read(&o->ninlined),
- o->function,
- o->file,
- o->line
- );
+ atomic_read(&o->ncontended),
+ atomic_read(&o->nspinnable),
+ atomic_read(&o->nhandoff),
+ o->ntracked,
+ atomic_read(&o->ninlined),
+ o->function,
+ o->file,
+ o->line
+ );
}
-static
-void _lock_stat_rbtree_db_zero(struct rb_node *node)
+static void _lock_stat_rbtree_db_zero(struct rb_node *node)
{
struct lock_stat *o = container_of(node, struct lock_stat, rb_node);
@@ -292,15 +215,14 @@ void _lock_stat_rbtree_db_zero(struct rb
_lock_stat_rbtree_db_zero(node->rb_right);
}
-static
-void lock_stat_rbtree_db_print(struct seq_file *seq)
+static void lock_stat_rbtree_db_print(struct seq_file *seq)
{
_lock_stat_rbtree_db_print(seq, lock_stat_rbtree_db.rb_node, 0);
}
-static
-void _lock_stat_rbtree_db_print(struct seq_file *seq, struct rb_node *node,
- int level)
+static void
+_lock_stat_rbtree_db_print(struct seq_file *seq, struct rb_node *node,
+ int level)
{
struct lock_stat *o = container_of(node, struct lock_stat, rb_node);
@@ -314,8 +236,7 @@ void _lock_stat_rbtree_db_print(struct s
_lock_stat_rbtree_db_print(seq, node->rb_right, level);
}
-static
-struct lock_stat *lock_stat_insert_rbtree_db(struct lock_stat *o)
+static struct lock_stat *lock_stat_insert_rbtree_db(struct lock_stat *o)
{
struct rb_node **p = &lock_stat_rbtree_db.rb_node;
struct rb_node *parent = NULL;
@@ -339,7 +260,7 @@ struct lock_stat *lock_stat_insert_rbtre
}
}
- rb_link_node(&o->rb_node, parent, p);
+ rb_link_node(&o->rb_node, parent, p);
rb_insert_color(&o->rb_node, &lock_stat_rbtree_db);
++attached;
@@ -347,14 +268,12 @@ struct lock_stat *lock_stat_insert_rbtre
return NULL;
}
-static
-struct lock_stat *lock_stat_lookup_rbtree_db(LOCK_STAT_NOTE_PARAM_DECL)
+static struct lock_stat *lock_stat_lookup_rbtree_db(LOCK_STAT_NOTE_PARAM_DECL)
{
struct rb_node *node = lock_stat_rbtree_db.rb_node;
struct lock_stat *cursor;
- while (node)
- {
+ while (node) {
cursor = container_of(node, struct lock_stat, rb_node);
/*
* item less than-> left
@@ -374,15 +293,13 @@ struct lock_stat *lock_stat_lookup_rbtre
}
/* must hold object_store_lock */
-static
-struct lock_stat * lock_stat_lookup_object(LOCK_STAT_NOTE_PARAM_DECL)
+static struct lock_stat * lock_stat_lookup_object(LOCK_STAT_NOTE_PARAM_DECL)
{
return lock_stat_lookup_rbtree_db(LOCK_STAT_NOTE_VARS);
}
/* Must hold object_store_lock */
-static
-void lock_stat_insert_object(struct lock_stat *o)
+static void lock_stat_insert_object(struct lock_stat *o)
{
lock_stat_insert_rbtree_db(o);
}
@@ -401,26 +318,25 @@ void lock_stat_insert_object(struct lock
* instances, you still track it with one backing object but it's connect to
* this object by the initialization point (file, function, line) of the
* rtmutex. That place in the source file is also used as an identifying key
- * for that instance and it is used as to associate it with a backing statistical
- * object. They are forever connected together from that point by that key. That
- * backing object holds the total number of contentions and other stats for all
- * objects associated with that key.
+ * for that instance and it is used as to associate it with a backing
+ * statistical object. They are forever connected together from that point by
+ * that key. That backing object holds the total number of contentions and
+ * other stats for all objects associated with that key.
*
* There maybe other initialization points for that same structure effecting
* how the keys are utilized, but they should * all effectively be connected
* to a single lock_stat object representing the * overall contention behavior
* and use of that object.
*
- * Connecting the rtmutex to the lock_stat object at spin_lock_init() time. It's
- * not a critical path. There are cases where a C99 style struct initializer is
- * used so I can't insert it into the lock_stat dictionary and those initializer
- * must be converted to use spin_lock_init() instead.
- *
+ * Connecting the rtmutex to the lock_stat object at spin_lock_init() time.
+ * It's not a critical path. There are cases where a C99 style struct
+ * initializer is used so we can't insert it into the lock_stat dictionary
+ * and those initializer must be converted to use spin_lock_init() instead.
*/
int lock_stat_scoped_attach(lock_stat_ref_t *oref, LOCK_STAT_NOTE_PARAM_DECL)
{
- unsigned long flags;
struct lock_stat *o = NULL;
+ unsigned long flags;
/* Failed error condition and ignore instrumentation calls before it's
* intialized */
@@ -475,11 +391,11 @@ failure:
* structures.
*/
-static
-void lock_stat_static_attach(lock_stat_ref_t *oref, LOCK_STAT_NOTE_PARAM_DECL)
+static void
+lock_stat_static_attach(lock_stat_ref_t *oref, LOCK_STAT_NOTE_PARAM_DECL)
{
- unsigned long flags;
struct lock_stat *o, *p;
+ unsigned long flags;
if (!oref || !lock_stat_inited)
return;
@@ -494,8 +410,7 @@ void lock_stat_static_attach(lock_stat_r
if ((p = lock_stat_lookup_object(LOCK_STAT_NOTE_VARS))) {
o = p;
static_found++;
- }
- else {
+ } else {
o = lock_stat_allocate_object();
lookup_failed_static++;
}
@@ -520,36 +435,32 @@ failure:
spin_unlock_irqrestore(&object_store_lock, flags);
}
-static
-char *ksym_get(unsigned long address, char *namebuffer)
-{
+static char *ksym_get(unsigned long address, char *namebuffer)
+{
unsigned long offset = 0, symsize;
- char *modname;
const char *symname;
+ char *modname;
symname = kallsyms_lookup(address, &symsize, &offset, &modname,
- namebuffer);
+ namebuffer);
- if (!symname) {
+ if (!symname)
return null_string;
- }
return (char *) symname;
}
-extern int task_on_rq(int cpu, struct task_struct *p);
-
void lock_stat_note(lock_stat_ref_t *oref, struct task_struct *owner,
- unsigned long ip,
- int handoff)
+ unsigned long ip, int handoff)
{
- int cpu;
char ksym_scoped_namebuf[KSYM_NAME_LEN+1];
- extern struct task_struct *get_task_on_rq(int cpu);
+ int cpu;
/* Invalid parameters, just do a return instead of crash */
- if (!oref || !owner || !lock_stat_inited)
- goto failure;
+ if (!oref || !owner || !lock_stat_inited) {
+ failure_events++;
+ goto out;
+ }
cpu = task_cpu(owner);
@@ -570,75 +481,30 @@ void lock_stat_note(lock_stat_ref_t *ore
>= KSYM_NAME_LEN);
if (!*oref)
lock_stat_static_attach(oref, special_static_string,
- ksym_scoped_namebuf,
- 0);
+ ksym_scoped_namebuf, 0);
if (handoff == LOCK_STAT_NOTE_TYPE_HANDOFF) {
atomic_inc(&((*oref)->nhandoff));
- }
- else {
+ } else {
if (owner && (owner == get_task_on_rq(cpu)))
atomic_inc(&((*oref)->nspinnable));
atomic_inc(&((*oref)->ncontended));
}
- atomic_inc(&per_cpu(lock_stat_total_events, get_cpu()));
- put_cpu();
-
- return;
-failure:
- failure_events++;
-
+out:
atomic_inc(&per_cpu(lock_stat_total_events, get_cpu()));
put_cpu();
}
-static int lock_stat_procfs_open(struct inode *, struct file *);
-static int lock_stat_procfs_write(struct file *file, const char *buf,
- size_t count,
- loff_t *off);
-
struct proc_dir_entry *lock_stat_procfs_dir;
struct proc_dir_entry *lock_stat_procfs_entry;
-static struct file_operations lock_stat_procfs_ops = {
- .open = lock_stat_procfs_open,
- .read = seq_read,
- .write = lock_stat_procfs_write,
- .llseek = seq_lseek,
- .release = seq_release,
-};
-
-static int lock_stat_procfs_init(void)
-{
- int error = 0;
-
- lock_stat_procfs_dir = proc_mkdir("lock_stat", NULL);
- if (lock_stat_procfs_dir == NULL) {
- error = -ENOMEM;
- return error;
- }
-
- lock_stat_procfs_entry = create_proc_entry("contention", 0666,
- lock_stat_procfs_dir);
- if (lock_stat_procfs_entry == NULL) {
- error = -ENOMEM;
- return error;
- }
- else {
- lock_stat_procfs_entry->proc_fops = &lock_stat_procfs_ops;
- }
-
- return 0;
-}
-
static int lock_stat_procfs_show(struct seq_file *seq, void *v)
{
int cpu, count = 0;
- for_each_possible_cpu(cpu) {
+ for_each_possible_cpu(cpu)
count += atomic_read(&per_cpu(lock_stat_total_events, cpu));
- }
seq_printf(seq, "@contention events = %d\n", count);
seq_printf(seq, "@failure_events = %d\n", failure_events);
@@ -650,15 +516,41 @@ static int lock_stat_procfs_show(struct
return 0;
}
-static int lock_stat_procfs_write (struct file *file, const char *buf,
- size_t count,
- loff_t *off)
+static ssize_t
+lock_stat_procfs_write(struct file *file, const char __user *buf,
+ size_t count, loff_t *ppos)
{
lock_stat_reset_stats();
return count;
}
-static int lock_stat_procfs_open(struct inode *inode, struct file *file) {
+static int lock_stat_procfs_open(struct inode *inode, struct file *file)
+{
return single_open (file, &lock_stat_procfs_show, NULL );
}
+static struct file_operations lock_stat_procfs_ops = {
+ .open = lock_stat_procfs_open,
+ .read = seq_read,
+ .write = lock_stat_procfs_write,
+ .llseek = seq_lseek,
+ .release = seq_release,
+};
+
+static int lock_stat_procfs_init(void)
+{
+ lock_stat_procfs_dir = proc_mkdir("lock_stat", NULL);
+ if (lock_stat_procfs_dir == NULL)
+ return -ENOMEM;
+
+ lock_stat_procfs_entry = create_proc_entry("contention", 0666,
+ lock_stat_procfs_dir);
+ if (lock_stat_procfs_entry == NULL)
+ return -ENOMEM;
+
+ lock_stat_procfs_entry->proc_fops = &lock_stat_procfs_ops;
+
+ return 0;
+}
+
+
Index: linux/kernel/rtmutex.c
===================================================================
--- linux.orig/kernel/rtmutex.c
+++ linux/kernel/rtmutex.c
@@ -970,9 +970,6 @@ rt_mutex_slowlock(struct rt_mutex *lock,
int ret = 0, saved_lock_depth = -1;
struct rt_mutex_waiter waiter;
unsigned long flags;
-#ifdef CONFIG_LOCK_STAT
- struct task_struct *owner;
-#endif
debug_rt_mutex_init_waiter(&waiter);
waiter.task = NULL;
@@ -1113,7 +1110,6 @@ rt_mutex_slowtrylock(struct rt_mutex *lo
unsigned long flags;
int ret = 0;
#ifdef CONFIG_LOCK_STAT
- struct task_struct *owner;
int type;
#endif
On Wed, Jan 24, 2007 at 12:31:15PM +0100, Ingo Molnar wrote:
> * Bill Huey <[email protected]> wrote:
>
> > Patch here:
> >
> > http://mmlinux.sourceforge.net/public/patch-2.6.20-rc2-rt2.2.lock_stat.patch
>
> hm, most of the review feedback i gave has not been addressed in the
> patch above. So i did it myself: find below various fixups for problems
> i outlined plus for a few other problems as well (ontop of
> 2.3.lock_stat).
Sorry, I've been siliently keeping your suggested change in my private
repo without announcing it to the world. I'll reply to the old email in
another message at length.
http://mmlinux.sourceforge.net/public/patch-2.6.20-rc2-rt2.4.lock_stat.patch
> While it's now markedly better it's still not quite mergeable, for
> example the build fails quite spectacularly if LOCK_STAT is disabled in
> the .config.
I'll look into it. I've been focused on clean up and a couple of other
things regard the stability of this patch. Making small changes in it
tends to make the kernel crash hard and I suspect that it's an interaction
problem with lockdep and that I need to turn lockdep off when hitting
"lock stats" locks. I'm going to move to "__raw_..." locks...
Meanwhile please wait until I hand interpret and merge your changes to an
older patch into my latest stuff. If it's takes too long, I suggest keeping
out of the tree for a bit until I finish this round unless something is
pressing for this to happen now like a mass change to the spinlock macros
or something. I stalled a bit trying to get Peter Zijlstra an extra feature.
> Also, it would be nice to replace those #ifdef CONFIG_LOCK_STAT changes
> in rtmutex.c with some nice inline functions that do nothing on
> !CONFIG_LOCK_STAT.
I'll look into it. Not sure what your choice in style is here and I'm
open to suggestions. I'm also interested in a reduction of #define
identifier length if you or somebody else has some kind of good convention
to suggest.
> but in general i'm positive about the direction this is heading, it just
> needs more work.
Sorry, for the lag. Trying to juggle this and the current demands of my
employeer contributed to this lag unfortunately.
bill
On Thu, Jan 04, 2007 at 05:46:59AM +0100, Ingo Molnar wrote:
> thanks. It's looking better, but there's still quite a bit of work left:
>
> there's considerable amount of whitespace noise in it - lots of lines
> with space/tab at the end, lines with 8 spaces instead of tabs, etc.
These comments from me are before the hand merge I'm going to do tonight.
> comment style issues:
>
> +/* To be use for avoiding the dynamic attachment of spinlocks at runtime
> + * by attaching it inline with the lock initialization function */
>
> the proper multi-line style is:
>
> /*
> * To be used for avoiding the dynamic attachment of spinlocks at
> * runtime by attaching it inline with the lock initialization function:
> */
I fixed all of those I can find.
> (note i also fixed a typo in the one above)
>
> more unused code:
>
> +/*
> +static DEFINE_LS_ENTRY(__pte_alloc);
> +static DEFINE_LS_ENTRY(get_empty_filp);
> +static DEFINE_LS_ENTRY(init_waitqueue_head);
> ...
> +*/
Removed. They are for annotation which isn't important right now.
> +static int lock_stat_inited = 0;
>
> should not be initialized to 0, that is implicit for static variables.
Removed.
> weird alignment here:
>
> +void lock_stat_init(struct lock_stat *oref)
> +{
> + oref->function[0] = 0;
> + oref->file = NULL;
> + oref->line = 0;
> +
> + oref->ntracked = 0;
I reduced that all to a single space without using huge tabs.
> funky branching:
>
> + spin_lock_irqsave(&free_store_lock, flags);
> + if (!list_empty(&lock_stat_free_store)) {
> + struct list_head *e = lock_stat_free_store.next;
> + struct lock_stat *s;
> +
> + s = container_of(e, struct lock_stat, list_head);
> + list_del(e);
> +
> + spin_unlock_irqrestore(&free_store_lock, flags);
> +
> + return s;
> + }
> + spin_unlock_irqrestore(&free_store_lock, flags);
> +
> + return NULL;
>
> that should be s = NULL in the function scope and a plain unlock and
> return s.
I made this change.
> assignments mixed with arithmetics:
>
> +static
> +int lock_stat_compare_objs(struct lock_stat *x, struct lock_stat *y)
> +{
> + int a = 0, b = 0, c = 0;
> +
> + (a = ksym_strcmp(x->function, y->function)) ||
> + (b = ksym_strcmp(x->file, y->file)) ||
> + (c = (x->line - y->line));
> +
> + return a | b | c;
>
> the usual (and more readable) style is to separate them out explicitly:
>
> a = ksym_strcmp(x->function, y->function);
> if (!a)
> return 0;
> b = ksym_strcmp(x->file, y->file);
> if (!b)
> return 0;
>
> return x->line == y->line;
>
> (detail: this btw also fixes a bug in the function above AFAICS, in the
> a && !b case.)
Not sure what you mean here but I made the key comparison so that it would
treat each struct field in most to least significant order evaluation. The
old code worked fine. What you're seeing is the newer stuff.
> also, i'm not fully convinced we want that x->function as a string. That
> makes comparisons alot slower. Why not make it a void *, and resolve to
> the name via kallsyms only when printing it in /proc, like lockdep does
> it?
I've made your suggested change, but I'm not done with it.
>
> no need to put dates into comments:
>
> + * Fri Oct 27 00:26:08 PDT 2006
>
> then:
>
> + while (node)
> + {
>
> proper style is:
>
> + while (node) {
Done. I misinterpreted the style guide and have made the changes to
conform to it..
> this function definition:
>
> +static
> +void lock_stat_insert_object(struct lock_stat *o)
>
> can be single-line. We make it multi-line only when needed.
Done for all the instances I remember off hand.
> these are only samples of the types of style problems still present in
> the code.
I'm a bit of a space cadet so I might have missed something.
Latest patch here.
http://finfin.is-a-geek.org/~billh/contention/patch-2.6.20-rc2-rt2.4.lock_stat.patch
I'm going to review and hand merge your changes to the older patch tonight.
Thanks for the comments.
bill
* Bill Huey <[email protected]> wrote:
> Latest patch here.
>
> http://finfin.is-a-geek.org/~billh/contention/patch-2.6.20-rc2-rt2.4.lock_stat.patch
>
> I'm going to review and hand merge your changes to the older patch
> tonight.
one more suggestion: please do diffs in -p1 (not in -p0) format.
Ingo