Hi Ingo,
At first, I have to say that the patch this mail contains
is *VERY ROUGH*. Because this patch affects not only spinlock
but also wide range of kernel, including VFS, MM, and process management, etc.
So I didn't format this patch to adopt to check by checkpatch.pl
and not signed.
But I strongly believe that this patch will help some
virtualization people (and some non-virtualization people).
So I want your advice and comment on this.
* Description
This patch makes the file spinlock_stats on top of the debugfs.
When user reads this file, some statistical data related to
spinlocks are printed.
e.g.
name: count longest shortest total average
lock: 0 0 0 0 0
i8259A_lock: 44 408852 3596 643488 14624
i8253_lock: 0 0 0 0 0
rtc_lock: 3 63504 6980 77740 25913
mce_state_lock: 0 0 0 0 0
cmci_discover_lock: 4 33512 20292 94544 23636
set_atomicity_lock: 4 3247864 0 7534840 1883710
lock: 0 0 0 0 0
ioapic_lock: 221 15512 1848 1264832 5723
vector_lock: 82 31312 528 174716 2130
gart_lock: 0 0 0 0 0
iommu_bitmap_lock: 0 0 0 0 0
pgd_lock: 19404 39880 42 23864027 1229
cpa_lock: 2383 24836 576 1830496 768
......
The reason why I want to obtain this is that
in the virtualization world, there is the problem
called "Lock Holder Preemption" (LHP).
LHP is well described in this paper,
http://www.amd64.org/fileadmin/user_upload/pub/2008-Friebel-LHP-GI_OS.pdf
LHP is not a trivial problem. e.g. VMware employs coscheduling
to avoid this. (source: http://communities.vmware.com/docs/DOC-4960 )
And the paper I linked describes the problem and the way to deal it
on Xen environment.
And, this is the important point, some paper including the above
measures term of acquiring spinlock.
But the authors of these didn't describe the way to measuring.
(This is wonder! Is this the secret voodoo?)
So I implemented the patch to measure the term of acquiring spinlock.
The result this patch products is the above.
But this patch contains a lot of trivial and non-trivial problems.
So I want your advices and comments.
Below is the list of problems:
1) The way of detecting the data structures of spinlock_t
This is a trivial problem.
Current strategy this patch employs is that
modify DEFINE_SPINLOCK() and add __section(.spinlock.data).
So spinlock values are located in the section .spinlock.data.
.spinlock.data is the section specially created for spinlock_t.
So I thought that this will be the array of spinlock_t.
But this was not. I thinks this caused by some problem related to alignment.
But the my lack of knowledge about ld, I employed
temporary way to detect spinlock_t.
The way is that after setting char * value to start of .spinlock.data,
increment this, and if this is equal to magic of spinlock,
obtain head address of spinlock_t by container_of() macro.
This way is implemented in spinlock_stats_show().
Can I make the section .spinlock.data the array of spinlock_t?
2) Affecting to wide range of kernel
Some spinlocks are declarated with __cacheline_aligned_in_smp.
These declarations conflict with new DEFINE_SPINLOCK(),
because __cacheline_aligned_in_smp contains __section(.data.page_aligned).
So I manually added the code to print these spinlocks to spinlock_stats_show().
Like this:
#define P(spin) do { \
count = atomic64_read(&spin.count); \
total = atomic64_read(&spin.total); \
shortest = atomic64_read(&spin.shortest); \
seq_printf(m, "%s: %ld %ld %ld %ld %ld\n", spin.dep_map.name, count, atomic64_read(&spin.longest), \
(shortest == LONG_MAX) ? 0 : shortest, total, count ? total / count : 0); \
}while(0)
P(bdev_lock);
P(dcache_lock);
P(files_lock);
...
This is very ugly, and makes bad effect for maintaining the code.
3) Overhead
This is the result of 'perf bench sched messaging -g 20 -l 1000'
Raw kernel:
# Running sched/messaging benchmark...
# 20 sender and receiver processes per group
# 20 groups == 800 processes run
Total time: 6.457 [sec]
Patched kernel:
# 20 sender and receiver processes per group
# 20 groups == 800 processes run
Total time: 6.607 [sec]
It seems that there is a little overhead.
I'm writing this email on the patched kernel,
and I cannot distinguish the overhead from human's perspective.
But I cannot decide the meaning of this overhead by myself.
How should I think about this? Can we discard this?
Of course, I isolated the code with ifdef.
And if this patch was merged, I'll add new Kconfig option for this patch.
... these are my proposal. Sorry for long email :-(
But, as I mentioned at first, I strongly believe that
this patch makes amount of sense!
# And I'm working of 'perf bench mem memcpy' concurrently :)
Below is the patch. Request for comments.
diff --git a/fs/block_dev.c b/fs/block_dev.c
index 8bed055..7ebee97 100644
--- a/fs/block_dev.c
+++ b/fs/block_dev.c
@@ -412,7 +412,7 @@ static int block_fsync(struct file *filp, struct dentry *dentry, int datasync)
* pseudo-fs
*/
-static __cacheline_aligned_in_smp DEFINE_SPINLOCK(bdev_lock);
+__cacheline_aligned_in_smp DEFINE_SPINLOCK_RAW(bdev_lock);
static struct kmem_cache * bdev_cachep __read_mostly;
static struct inode *bdev_alloc_inode(struct super_block *sb)
diff --git a/fs/dcache.c b/fs/dcache.c
index a100fa3..87bb305 100644
--- a/fs/dcache.c
+++ b/fs/dcache.c
@@ -38,7 +38,7 @@
int sysctl_vfs_cache_pressure __read_mostly = 100;
EXPORT_SYMBOL_GPL(sysctl_vfs_cache_pressure);
- __cacheline_aligned_in_smp DEFINE_SPINLOCK(dcache_lock);
+ __cacheline_aligned_in_smp DEFINE_SPINLOCK_RAW(dcache_lock);
__cacheline_aligned_in_smp DEFINE_SEQLOCK(rename_lock);
EXPORT_SYMBOL(dcache_lock);
diff --git a/fs/file_table.c b/fs/file_table.c
index 8eb4404..e59de52 100644
--- a/fs/file_table.c
+++ b/fs/file_table.c
@@ -31,7 +31,7 @@ struct files_stat_struct files_stat = {
};
/* public. Not pretty! */
-__cacheline_aligned_in_smp DEFINE_SPINLOCK(files_lock);
+__cacheline_aligned_in_smp DEFINE_SPINLOCK_RAW(files_lock);
/* SLAB cache for file structures */
static struct kmem_cache *filp_cachep __read_mostly;
diff --git a/fs/namespace.c b/fs/namespace.c
index bdc3cb4..791dcc4 100644
--- a/fs/namespace.c
+++ b/fs/namespace.c
@@ -38,7 +38,7 @@
#define HASH_SIZE (1UL << HASH_SHIFT)
/* spinlock for vfsmount related operations, inplace of dcache_lock */
-__cacheline_aligned_in_smp DEFINE_SPINLOCK(vfsmount_lock);
+__cacheline_aligned_in_smp DEFINE_SPINLOCK_RAW(vfsmount_lock);
static int event;
static DEFINE_IDA(mnt_id_ida);
diff --git a/include/asm-generic/vmlinux.lds.h b/include/asm-generic/vmlinux.lds.h
index b6e818f..d0da219 100644
--- a/include/asm-generic/vmlinux.lds.h
+++ b/include/asm-generic/vmlinux.lds.h
@@ -154,6 +154,10 @@
VMLINUX_SYMBOL(__start___tracepoints) = .; \
*(__tracepoints) \
VMLINUX_SYMBOL(__stop___tracepoints) = .; \
+ . = ALIGN(8); \
+ VMLINUX_SYMBOL(__lock_data_start) = .; \
+ *(.spinlock.data) \
+ VMLINUX_SYMBOL(__lock_data_end) = .; \
/* implement dynamic printk debug */ \
. = ALIGN(8); \
VMLINUX_SYMBOL(__start___verbose) = .; \
diff --git a/include/linux/spinlock_api_smp.h b/include/linux/spinlock_api_smp.h
index 7a7e18f..8f75bfe 100644
--- a/include/linux/spinlock_api_smp.h
+++ b/include/linux/spinlock_api_smp.h
@@ -196,6 +196,10 @@ static inline int __spin_trylock(spinlock_t *lock)
preempt_disable();
if (_raw_spin_trylock(lock)) {
spin_acquire(&lock->dep_map, 0, 1, _RET_IP_);
+#if defined(CONFIG_DEBUG_SPINLOCK) && defined(CONFIG_DEBUG_LOCK_ALLOC)
+ atomic64_inc(&lock->count);
+ atomic64_set(&lock->lock_start, native_read_tsc());
+#endif
return 1;
}
preempt_enable();
@@ -245,6 +249,11 @@ static inline unsigned long __spin_lock_irqsave(spinlock_t *lock)
local_irq_save(flags);
preempt_disable();
spin_acquire(&lock->dep_map, 0, 0, _RET_IP_);
+#if defined(CONFIG_DEBUG_SPINLOCK) && defined(CONFIG_DEBUG_LOCK_ALLOC)
+ atomic64_inc(&lock->count);
+ atomic64_set(&lock->lock_start, native_read_tsc());
+#endif
+
/*
* On lockdep we dont want the hand-coded irq-enable of
* _raw_spin_lock_flags() code, because lockdep assumes
@@ -263,6 +272,10 @@ static inline void __spin_lock_irq(spinlock_t *lock)
local_irq_disable();
preempt_disable();
spin_acquire(&lock->dep_map, 0, 0, _RET_IP_);
+#if defined(CONFIG_DEBUG_SPINLOCK) && defined(CONFIG_DEBUG_LOCK_ALLOC)
+ atomic64_inc(&lock->count);
+ atomic64_set(&lock->lock_start, native_read_tsc());
+#endif
LOCK_CONTENDED(lock, _raw_spin_trylock, _raw_spin_lock);
}
@@ -271,6 +284,10 @@ static inline void __spin_lock_bh(spinlock_t *lock)
local_bh_disable();
preempt_disable();
spin_acquire(&lock->dep_map, 0, 0, _RET_IP_);
+#if defined(CONFIG_DEBUG_SPINLOCK) && defined(CONFIG_DEBUG_LOCK_ALLOC)
+ atomic64_inc(&lock->count);
+ atomic64_set(&lock->lock_start, native_read_tsc());
+#endif
LOCK_CONTENDED(lock, _raw_spin_trylock, _raw_spin_lock);
}
@@ -334,6 +351,10 @@ static inline void __spin_lock(spinlock_t *lock)
{
preempt_disable();
spin_acquire(&lock->dep_map, 0, 0, _RET_IP_);
+#if defined(CONFIG_DEBUG_SPINLOCK) && defined(CONFIG_DEBUG_LOCK_ALLOC)
+ atomic64_inc(&lock->count);
+ atomic64_set(&lock->lock_start, native_read_tsc());
+#endif
LOCK_CONTENDED(lock, _raw_spin_trylock, _raw_spin_lock);
}
@@ -348,8 +369,23 @@ static inline void __write_lock(rwlock_t *lock)
static inline void __spin_unlock(spinlock_t *lock)
{
+#if defined(CONFIG_DEBUG_SPINLOCK) && defined(CONFIG_DEBUG_LOCK_ALLOC)
+ long end, locked_term;
+#endif
+
spin_release(&lock->dep_map, 1, _RET_IP_);
_raw_spin_unlock(lock);
+
+#if defined(CONFIG_DEBUG_SPINLOCK) && defined(CONFIG_DEBUG_LOCK_ALLOC)
+ end = native_read_tsc();
+ locked_term = end - atomic64_read(&lock->lock_start);
+ atomic64_add(locked_term, &lock->total);
+ if (locked_term > atomic64_read(&lock->longest))
+ atomic64_set(&lock->longest, locked_term);
+ else if (locked_term < atomic64_read(&lock->shortest))
+ atomic64_set(&lock->shortest, locked_term);
+#endif
+
preempt_enable();
}
@@ -370,24 +406,66 @@ static inline void __read_unlock(rwlock_t *lock)
static inline void __spin_unlock_irqrestore(spinlock_t *lock,
unsigned long flags)
{
+#if defined(CONFIG_DEBUG_SPINLOCK) && defined(CONFIG_DEBUG_LOCK_ALLOC)
+ long end, locked_term;
+#endif
+
spin_release(&lock->dep_map, 1, _RET_IP_);
_raw_spin_unlock(lock);
local_irq_restore(flags);
+#if defined(CONFIG_DEBUG_SPINLOCK) && defined(CONFIG_DEBUG_LOCK_ALLOC)
+ end = native_read_tsc();
+ locked_term = end - atomic64_read(&lock->lock_start);
+ atomic64_add(locked_term, &lock->total);
+ if (locked_term > atomic64_read(&lock->longest))
+ atomic64_set(&lock->longest, locked_term);
+ else if (locked_term < atomic64_read(&lock->shortest))
+ atomic64_set(&lock->shortest, locked_term);
+#endif
+
preempt_enable();
}
static inline void __spin_unlock_irq(spinlock_t *lock)
{
+#if defined(CONFIG_DEBUG_SPINLOCK) && defined(CONFIG_DEBUG_LOCK_ALLOC)
+ long end, locked_term;
+#endif
+
spin_release(&lock->dep_map, 1, _RET_IP_);
_raw_spin_unlock(lock);
local_irq_enable();
+#if defined(CONFIG_DEBUG_SPINLOCK) && defined(CONFIG_DEBUG_LOCK_ALLOC)
+ end = native_read_tsc();
+ locked_term = end - atomic64_read(&lock->lock_start);
+ atomic64_add(locked_term, &lock->total);
+ if (locked_term > atomic64_read(&lock->longest))
+ atomic64_set(&lock->longest, locked_term);
+ else if (locked_term < atomic64_read(&lock->shortest))
+ atomic64_set(&lock->shortest, locked_term);
+#endif
+
preempt_enable();
}
static inline void __spin_unlock_bh(spinlock_t *lock)
{
+#if defined(CONFIG_DEBUG_SPINLOCK) && defined(CONFIG_DEBUG_LOCK_ALLOC)
+ long end, locked_term;
+#endif
spin_release(&lock->dep_map, 1, _RET_IP_);
_raw_spin_unlock(lock);
+
+#if defined(CONFIG_DEBUG_SPINLOCK) && defined(CONFIG_DEBUG_LOCK_ALLOC)
+ end = native_read_tsc();
+ locked_term = end - atomic64_read(&lock->lock_start);
+ atomic64_add(locked_term, &lock->total);
+ if (locked_term > atomic64_read(&lock->longest))
+ atomic64_set(&lock->longest, locked_term);
+ else if (locked_term < atomic64_read(&lock->shortest))
+ atomic64_set(&lock->shortest, locked_term);
+#endif
+
preempt_enable_no_resched();
local_bh_enable_ip((unsigned long)__builtin_return_address(0));
}
@@ -447,6 +525,10 @@ static inline int __spin_trylock_bh(spinlock_t *lock)
preempt_disable();
if (_raw_spin_trylock(lock)) {
spin_acquire(&lock->dep_map, 0, 1, _RET_IP_);
+#if defined(CONFIG_DEBUG_SPINLOCK) && defined(CONFIG_DEBUG_LOCK_ALLOC)
+ atomic64_inc(&lock->count);
+ atomic64_set(&lock->lock_start, native_read_tsc());
+#endif
return 1;
}
preempt_enable_no_resched();
diff --git a/include/linux/spinlock_types.h b/include/linux/spinlock_types.h
index 68d88f7..a6e0db5 100644
--- a/include/linux/spinlock_types.h
+++ b/include/linux/spinlock_types.h
@@ -29,6 +29,10 @@ typedef struct {
#ifdef CONFIG_DEBUG_LOCK_ALLOC
struct lockdep_map dep_map;
#endif
+
+#if defined(CONFIG_DEBUG_SPINLOCK) && defined(CONFIG_DEBUG_LOCK_ALLOC)
+ atomic64_t lock_start, longest, count, total, shortest;
+#endif
} spinlock_t;
#define SPINLOCK_MAGIC 0xdead4ead
@@ -69,6 +73,7 @@ typedef struct {
.magic = SPINLOCK_MAGIC, \
.owner = SPINLOCK_OWNER_INIT, \
.owner_cpu = -1, \
+ .shortest = ATOMIC64_INIT(LONG_MAX), \
SPIN_DEP_MAP_INIT(lockname) }
#define __RW_LOCK_UNLOCKED(lockname) \
(rwlock_t) { .raw_lock = __RAW_RW_LOCK_UNLOCKED, \
@@ -94,7 +99,13 @@ typedef struct {
#define SPIN_LOCK_UNLOCKED __SPIN_LOCK_UNLOCKED(old_style_spin_init)
#define RW_LOCK_UNLOCKED __RW_LOCK_UNLOCKED(old_style_rw_init)
+#if defined(CONFIG_DEBUG_SPINLOCK) && defined(CONFIG_DEBUG_LOCK_ALLOC)
+#define DEFINE_SPINLOCK(x) spinlock_t __section(.spinlock.data) x = __SPIN_LOCK_UNLOCKED(x)
+#define DEFINE_SPINLOCK_RAW(x) spinlock_t x = __SPIN_LOCK_UNLOCKED(x)
+#else
#define DEFINE_SPINLOCK(x) spinlock_t x = __SPIN_LOCK_UNLOCKED(x)
+#endif
+
#define DEFINE_RWLOCK(x) rwlock_t x = __RW_LOCK_UNLOCKED(x)
#endif /* __LINUX_SPINLOCK_TYPES_H */
diff --git a/kernel/fork.c b/kernel/fork.c
index 166b8c4..a19b0b8 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -415,7 +415,7 @@ static inline void mm_free_pgd(struct mm_struct * mm)
#define mm_free_pgd(mm)
#endif /* CONFIG_MMU */
-__cacheline_aligned_in_smp DEFINE_SPINLOCK(mmlist_lock);
+__cacheline_aligned_in_smp DEFINE_SPINLOCK_RAW(mmlist_lock);
#define allocate_mm() (kmem_cache_alloc(mm_cachep, GFP_KERNEL))
#define free_mm(mm) (kmem_cache_free(mm_cachep, (mm)))
diff --git a/kernel/pid.c b/kernel/pid.c
index d3f722d..1e4accc 100644
--- a/kernel/pid.c
+++ b/kernel/pid.c
@@ -110,7 +110,7 @@ EXPORT_SYMBOL(is_container_init);
* For now it is easier to be safe than to prove it can't happen.
*/
-static __cacheline_aligned_in_smp DEFINE_SPINLOCK(pidmap_lock);
+__cacheline_aligned_in_smp DEFINE_SPINLOCK_RAW(pidmap_lock);
static void free_pidmap(struct upid *upid)
{
diff --git a/kernel/spinlock.c b/kernel/spinlock.c
index 5ddab73..288e1a5 100644
--- a/kernel/spinlock.c
+++ b/kernel/spinlock.c
@@ -21,6 +21,11 @@
#include <linux/debug_locks.h>
#include <linux/module.h>
+#include <linux/debugfs.h>
+#include <linux/seq_file.h>
+
+#include <asm/msr.h>
+
#ifndef _spin_trylock
int __lockfunc _spin_trylock(spinlock_t *lock)
{
@@ -385,3 +390,81 @@ notrace int in_lock_functions(unsigned long addr)
&& addr < (unsigned long)__lock_text_end;
}
EXPORT_SYMBOL(in_lock_functions);
+
+#if defined(CONFIG_DEBUG_SPINLOCK) && defined(CONFIG_DEBUG_LOCK_ALLOC)
+
+extern char __lock_data_start;
+extern char __lock_data_end;
+
+extern spinlock_t bdev_lock;
+extern spinlock_t dcache_lock;
+extern spinlock_t files_lock;
+extern spinlock_t vfsmount_lock;
+extern spinlock_t mmlist_lock;
+extern spinlock_t pidmap_lock;
+extern spinlock_t kernel_flag;
+
+static int spinlock_stats_show(struct seq_file *m, void *v)
+{
+ char *_spin;
+ spinlock_t *spin;
+ long count, total, shortest;
+
+ seq_printf(m, "name: count longest shortest total average\n");
+
+ for (_spin = &__lock_data_start; _spin < &__lock_data_end; _spin++) {
+ if (*(int *)_spin != SPINLOCK_MAGIC) /* DAMMIT! */
+ continue;
+
+ spin = container_of((unsigned int *)_spin, spinlock_t, magic);
+
+ count = atomic64_read(&spin->count);
+ total = atomic64_read(&spin->total);
+ shortest = atomic64_read(&spin->shortest);
+ seq_printf(m, "%s: %ld %ld %ld %ld %ld\n", spin->dep_map.name, count, atomic64_read(&spin->longest),
+ (shortest == LONG_MAX) ? 0 : shortest , total, count ? total / count : 0);
+ }
+
+#define P(spin) do { \
+ count = atomic64_read(&spin.count); \
+ total = atomic64_read(&spin.total); \
+ shortest = atomic64_read(&spin.shortest); \
+ seq_printf(m, "%s: %ld %ld %ld %ld %ld\n", spin.dep_map.name, count, atomic64_read(&spin.longest), \
+ (shortest == LONG_MAX) ? 0 : shortest, total, count ? total / count : 0); \
+ }while(0)
+
+ P(bdev_lock);
+ P(dcache_lock);
+ P(files_lock);
+ P(vfsmount_lock);
+ P(mmlist_lock);
+ P(pidmap_lock);
+ P(kernel_flag);
+
+#undef P
+
+ return 0;
+}
+
+static int spinlock_stats_open(struct inode *inode, struct file *filp)
+{
+ return single_open(filp, spinlock_stats_show, NULL);
+}
+
+static struct file_operations spinlock_stats_fops = {
+ .open = spinlock_stats_open,
+ .read = seq_read,
+ .llseek = seq_lseek,
+ .release = single_release,
+};
+
+static __init int init_spinlock_stats(void)
+{
+ debugfs_create_file("spinlock_stats", 0644, NULL, NULL,
+ &spinlock_stats_fops);
+
+ return 0;
+}
+late_initcall(init_spinlock_stats);
+
+#endif
diff --git a/lib/kernel_lock.c b/lib/kernel_lock.c
index 39f1029..87841f9 100644
--- a/lib/kernel_lock.c
+++ b/lib/kernel_lock.c
@@ -20,7 +20,7 @@
*
* Don't use in new code.
*/
-static __cacheline_aligned_in_smp DEFINE_SPINLOCK(kernel_flag);
+__cacheline_aligned_in_smp DEFINE_SPINLOCK_RAW(kernel_flag);
/*
* Hitoshi Mitake <[email protected]> wrote:
> * Description
>
> This patch makes the file spinlock_stats on top of the debugfs.
> When user reads this file, some statistical data related to
> spinlocks are printed.
hm, are you aware of the lockstat tracepoints? They do something like
this - and it utilizes perf events to report these events. See
include/trace/events/lockdep.h. Needs CONFIG_LOCK_STAT enabled.
Might make sense to put it into 'perf lock' kind of tool. I think
Frederic had such plans.
Ingo
From: Ingo Molnar <[email protected]>
Subject: Re: [PATCH][RFC] Measuring term of acquiring spinlock
Date: Thu, 12 Nov 2009 08:39:09 +0100
>
> * Hitoshi Mitake <[email protected]> wrote:
>
> > * Description
> >
> > This patch makes the file spinlock_stats on top of the debugfs.
> > When user reads this file, some statistical data related to
> > spinlocks are printed.
>
> hm, are you aware of the lockstat tracepoints? They do something like
> this - and it utilizes perf events to report these events. See
> include/trace/events/lockdep.h. Needs CONFIG_LOCK_STAT enabled.
>
Wow, I didn't know that. I'll try it.
> Might make sense to put it into 'perf lock' kind of tool. I think
> Frederic had such plans.
If you want a tester for 'perf lock', please call me, Frederic :)
* Hitoshi Mitake <[email protected]> wrote:
> From: Ingo Molnar <[email protected]>
> Subject: Re: [PATCH][RFC] Measuring term of acquiring spinlock
> Date: Thu, 12 Nov 2009 08:39:09 +0100
>
> >
> > * Hitoshi Mitake <[email protected]> wrote:
> >
> > > * Description
> > >
> > > This patch makes the file spinlock_stats on top of the debugfs.
> > > When user reads this file, some statistical data related to
> > > spinlocks are printed.
> >
> > hm, are you aware of the lockstat tracepoints? They do something
> > like this - and it utilizes perf events to report these events. See
> > include/trace/events/lockdep.h. Needs CONFIG_LOCK_STAT enabled.
> >
>
> Wow, I didn't know that. I'll try it.
Btw., i think we should rename that tracepoint category from 'lockdep'
to 'lock'. It's possible to enable them without CONFIG_PROVE_LOCKING,
and they arent high-overhead in that case.
Ingo
On Fri, 2009-11-13 at 09:17 +0100, Ingo Molnar wrote:
>
>
> Btw., i think we should rename that tracepoint category from 'lockdep'
> to 'lock'. It's possible to enable them without CONFIG_PROVE_LOCKING,
> and they arent high-overhead in that case.
They're still in 'lockdep', and it does still require enabling all of
the lock debugging code, so the overhead is far from zero.
On Fri, Nov 13, 2009 at 09:17:22AM +0100, Ingo Molnar wrote:
>
> * Hitoshi Mitake <[email protected]> wrote:
>
> > From: Ingo Molnar <[email protected]>
> > Subject: Re: [PATCH][RFC] Measuring term of acquiring spinlock
> > Date: Thu, 12 Nov 2009 08:39:09 +0100
> >
> > >
> > > * Hitoshi Mitake <[email protected]> wrote:
> > >
> > > > * Description
> > > >
> > > > This patch makes the file spinlock_stats on top of the debugfs.
> > > > When user reads this file, some statistical data related to
> > > > spinlocks are printed.
> > >
> > > hm, are you aware of the lockstat tracepoints? They do something
> > > like this - and it utilizes perf events to report these events. See
> > > include/trace/events/lockdep.h. Needs CONFIG_LOCK_STAT enabled.
> > >
> >
> > Wow, I didn't know that. I'll try it.
>
> Btw., i think we should rename that tracepoint category from 'lockdep'
> to 'lock'. It's possible to enable them without CONFIG_PROVE_LOCKING,
> and they arent high-overhead in that case.
>
> Ingo
I have a pending patch for that somewhere. I can send it right away.
* Frederic Weisbecker <[email protected]> wrote:
> On Fri, Nov 13, 2009 at 09:17:22AM +0100, Ingo Molnar wrote:
> >
> > * Hitoshi Mitake <[email protected]> wrote:
> >
> > > From: Ingo Molnar <[email protected]>
> > > Subject: Re: [PATCH][RFC] Measuring term of acquiring spinlock
> > > Date: Thu, 12 Nov 2009 08:39:09 +0100
> > >
> > > >
> > > > * Hitoshi Mitake <[email protected]> wrote:
> > > >
> > > > > * Description
> > > > >
> > > > > This patch makes the file spinlock_stats on top of the debugfs.
> > > > > When user reads this file, some statistical data related to
> > > > > spinlocks are printed.
> > > >
> > > > hm, are you aware of the lockstat tracepoints? They do something
> > > > like this - and it utilizes perf events to report these events. See
> > > > include/trace/events/lockdep.h. Needs CONFIG_LOCK_STAT enabled.
> > > >
> > >
> > > Wow, I didn't know that. I'll try it.
> >
> > Btw., i think we should rename that tracepoint category from 'lockdep'
> > to 'lock'. It's possible to enable them without CONFIG_PROVE_LOCKING,
> > and they arent high-overhead in that case.
> >
> > Ingo
>
>
> I have a pending patch for that somewhere. I can send it right away.
Cool - thanks!
Ingo
Lockdep events subsystem gathers various locking related events such
as a request, release, contention or acquisition of a lock.
The name of this event subsystem is a bit of a misnomer since these
events are not quite related to lockdep but more generally to locking,
ie: these events are not reporting lock dependencies or possible
deadlock scenario but pure locking events.
Hence this rename.
Signed-off-by: Frederic Weisbecker <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Arnaldo Carvalho de Melo <[email protected]>
Cc: Mike Galbraith <[email protected]>
Cc: Paul Mackerras <[email protected]>
Cc: Steven Rostedt <[email protected]>
Cc: Li Zefan <[email protected]>
Cc: Hitoshi Mitake <[email protected]>
---
include/trace/events/{lockdep.h => lock.h} | 8 ++++----
kernel/lockdep.c | 2 +-
2 files changed, 5 insertions(+), 5 deletions(-)
rename include/trace/events/{lockdep.h => lock.h} (92%)
diff --git a/include/trace/events/lockdep.h b/include/trace/events/lock.h
similarity index 92%
rename from include/trace/events/lockdep.h
rename to include/trace/events/lock.h
index bcf1d20..a870ba1 100644
--- a/include/trace/events/lockdep.h
+++ b/include/trace/events/lock.h
@@ -1,8 +1,8 @@
#undef TRACE_SYSTEM
-#define TRACE_SYSTEM lockdep
+#define TRACE_SYSTEM lock
-#if !defined(_TRACE_LOCKDEP_H) || defined(TRACE_HEADER_MULTI_READ)
-#define _TRACE_LOCKDEP_H
+#if !defined(_TRACE_LOCK_H) || defined(TRACE_HEADER_MULTI_READ)
+#define _TRACE_LOCK_H
#include <linux/lockdep.h>
#include <linux/tracepoint.h>
@@ -90,7 +90,7 @@ TRACE_EVENT(lock_acquired,
#endif
#endif
-#endif /* _TRACE_LOCKDEP_H */
+#endif /* _TRACE_LOCK_H */
/* This part must be outside protection */
#include <trace/define_trace.h>
diff --git a/kernel/lockdep.c b/kernel/lockdep.c
index 9af5672..f5dcd36 100644
--- a/kernel/lockdep.c
+++ b/kernel/lockdep.c
@@ -49,7 +49,7 @@
#include "lockdep_internals.h"
#define CREATE_TRACE_POINTS
-#include <trace/events/lockdep.h>
+#include <trace/events/lock.h>
#ifdef CONFIG_PROVE_LOCKING
int prove_locking = 1;
--
1.6.2.3
On Fri, 2009-11-13 at 10:06 +0100, Frederic Weisbecker wrote:
> Lockdep events subsystem gathers various locking related events such
> as a request, release, contention or acquisition of a lock.
>
> The name of this event subsystem is a bit of a misnomer since these
> events are not quite related to lockdep but more generally to locking,
> ie: these events are not reporting lock dependencies or possible
> deadlock scenario but pure locking events.
But in order to get them you need pretty much all of lockdep, except
PROVE_LOCKING. You get all the lock debugging, the lock tracking, the
struct dep_map bloat etc.
But sure, I don't mind renaming the category.
* Peter Zijlstra <[email protected]> wrote:
> On Fri, 2009-11-13 at 10:06 +0100, Frederic Weisbecker wrote:
> > Lockdep events subsystem gathers various locking related events such
> > as a request, release, contention or acquisition of a lock.
> >
> > The name of this event subsystem is a bit of a misnomer since these
> > events are not quite related to lockdep but more generally to
> > locking, ie: these events are not reporting lock dependencies or
> > possible deadlock scenario but pure locking events.
>
> But in order to get them you need pretty much all of lockdep, except
> PROVE_LOCKING. You get all the lock debugging, the lock tracking, the
> struct dep_map bloat etc.
>
> But sure, I don't mind renaming the category.
Yeah, i'd prefer it this way. You are right that most of lockdep.o is
still there, but most of the lockdep _overhead_ shouldnt be there - no
hashing, no tracking, etc.
it's still nonzero - see 'perf top' of a hackbench session, with
LOCK_STAT enabled and PROVE_LOCKING disabled:
------------------------------------------------------------------------------
PerfTop: 14059 irqs/sec kernel:99.8% [1000Hz cycles], (all, 16 CPUs)
------------------------------------------------------------------------------
samples pcnt function DSO
_______ _____ ________________________________ ________________
7320.00 7.9% sched_clock_local [kernel]
7217.00 7.8% lock_acquired [kernel]
5768.00 6.2% trace_hardirqs_off [kernel]
4562.00 4.9% __lock_acquire [kernel]
4304.00 4.6% lock_release [kernel]
3838.00 4.1% lock_acquire [kernel]
3833.00 4.1% look_up_lock_class [kernel]
3561.00 3.8% cpu_clock [kernel]
3283.00 3.5% start_critical_timing [kernel]
2992.00 3.2% __alloc_skb [kernel]
2680.00 2.9% acpi_pm_read [kernel]
2498.00 2.7% sched_clock [kernel]
2409.00 2.6% copy_user_generic_string [kernel]
2016.00 2.2% lock_release_holdtime [kernel]
1899.00 2.0% sched_clock_cpu [kernel]
but i think those should be gradually eliminated and improved, as lock
statistics could become a quite popular thing.
If the tracepoints are named 'lock', people will expect less overhead,
and might end up fixing/improving it. If it's named 'lockdep' on the
other hand, the expectation is higher overhead.
Ingo
Commit-ID: b8b901965dbc6572f365457bb8c6a487f7da5d02
Gitweb: http://git.kernel.org/tip/b8b901965dbc6572f365457bb8c6a487f7da5d02
Author: Frederic Weisbecker <[email protected]>
AuthorDate: Fri, 13 Nov 2009 10:06:34 +0100
Committer: Ingo Molnar <[email protected]>
CommitDate: Fri, 13 Nov 2009 10:28:13 +0100
tracing: Rename 'lockdep' event subsystem into 'lock'
Lockdep events subsystem gathers various locking related events
such as a request, release, contention or acquisition of a lock.
The name of this event subsystem is a bit of a misnomer since
these events are not quite related to lockdep but more generally
to locking, ie: these events are not reporting lock dependencies
or possible deadlock scenario but pure locking events.
Hence this rename.
Signed-off-by: Frederic Weisbecker <[email protected]>
Acked-by: Peter Zijlstra <[email protected]>
Acked-by: Hitoshi Mitake <[email protected]>
Cc: Arnaldo Carvalho de Melo <[email protected]>
Cc: Mike Galbraith <[email protected]>
Cc: Paul Mackerras <[email protected]>
Cc: Steven Rostedt <[email protected]>
Cc: Li Zefan <[email protected]>
LKML-Reference: <[email protected]>
Signed-off-by: Ingo Molnar <[email protected]>
---
include/trace/events/lockdep.h | 8 ++++----
kernel/lockdep.c | 2 +-
2 files changed, 5 insertions(+), 5 deletions(-)
diff --git a/include/trace/events/lockdep.h b/include/trace/events/lockdep.h
index bcf1d20..a870ba1 100644
--- a/include/trace/events/lockdep.h
+++ b/include/trace/events/lockdep.h
@@ -1,8 +1,8 @@
#undef TRACE_SYSTEM
-#define TRACE_SYSTEM lockdep
+#define TRACE_SYSTEM lock
-#if !defined(_TRACE_LOCKDEP_H) || defined(TRACE_HEADER_MULTI_READ)
-#define _TRACE_LOCKDEP_H
+#if !defined(_TRACE_LOCK_H) || defined(TRACE_HEADER_MULTI_READ)
+#define _TRACE_LOCK_H
#include <linux/lockdep.h>
#include <linux/tracepoint.h>
@@ -90,7 +90,7 @@ TRACE_EVENT(lock_acquired,
#endif
#endif
-#endif /* _TRACE_LOCKDEP_H */
+#endif /* _TRACE_LOCK_H */
/* This part must be outside protection */
#include <trace/define_trace.h>
diff --git a/kernel/lockdep.c b/kernel/lockdep.c
index 9af5672..f5dcd36 100644
--- a/kernel/lockdep.c
+++ b/kernel/lockdep.c
@@ -49,7 +49,7 @@
#include "lockdep_internals.h"
#define CREATE_TRACE_POINTS
-#include <trace/events/lockdep.h>
+#include <trace/events/lock.h>
#ifdef CONFIG_PROVE_LOCKING
int prove_locking = 1;
On Fri, Nov 13, 2009 at 10:10:09AM +0100, Peter Zijlstra wrote:
> On Fri, 2009-11-13 at 10:06 +0100, Frederic Weisbecker wrote:
> > Lockdep events subsystem gathers various locking related events such
> > as a request, release, contention or acquisition of a lock.
> >
> > The name of this event subsystem is a bit of a misnomer since these
> > events are not quite related to lockdep but more generally to locking,
> > ie: these events are not reporting lock dependencies or possible
> > deadlock scenario but pure locking events.
>
> But in order to get them you need pretty much all of lockdep, except
> PROVE_LOCKING. You get all the lock debugging, the lock tracking, the
> struct dep_map bloat etc.
Sure. But I think an event susbsystem name should reflect its
actual purpose despite the machinery that makes it work.
Commit-ID: 67178767b936fb47a3a5e88097cff41ccbda7acb
Gitweb: http://git.kernel.org/tip/67178767b936fb47a3a5e88097cff41ccbda7acb
Author: Frederic Weisbecker <[email protected]>
AuthorDate: Fri, 13 Nov 2009 10:06:34 +0100
Committer: Ingo Molnar <[email protected]>
CommitDate: Fri, 13 Nov 2009 10:48:27 +0100
tracing: Rename 'lockdep' event subsystem into 'lock'
Lockdep events subsystem gathers various locking related events
such as a request, release, contention or acquisition of a lock.
The name of this event subsystem is a bit of a misnomer since
these events are not quite related to lockdep but more generally
to locking, ie: these events are not reporting lock dependencies
or possible deadlock scenario but pure locking events.
Hence this rename.
Signed-off-by: Frederic Weisbecker <[email protected]>
Acked-by: Peter Zijlstra <[email protected]>
Acked-by: Hitoshi Mitake <[email protected]>
Cc: Arnaldo Carvalho de Melo <[email protected]>
Cc: Mike Galbraith <[email protected]>
Cc: Paul Mackerras <[email protected]>
Cc: Steven Rostedt <[email protected]>
Cc: Li Zefan <[email protected]>
LKML-Reference: <[email protected]>
Signed-off-by: Ingo Molnar <[email protected]>
---
include/trace/events/{lockdep.h => lock.h} | 8 ++++----
kernel/lockdep.c | 2 +-
2 files changed, 5 insertions(+), 5 deletions(-)
diff --git a/include/trace/events/lockdep.h b/include/trace/events/lock.h
similarity index 92%
rename from include/trace/events/lockdep.h
rename to include/trace/events/lock.h
index bcf1d20..a870ba1 100644
--- a/include/trace/events/lockdep.h
+++ b/include/trace/events/lock.h
@@ -1,8 +1,8 @@
#undef TRACE_SYSTEM
-#define TRACE_SYSTEM lockdep
+#define TRACE_SYSTEM lock
-#if !defined(_TRACE_LOCKDEP_H) || defined(TRACE_HEADER_MULTI_READ)
-#define _TRACE_LOCKDEP_H
+#if !defined(_TRACE_LOCK_H) || defined(TRACE_HEADER_MULTI_READ)
+#define _TRACE_LOCK_H
#include <linux/lockdep.h>
#include <linux/tracepoint.h>
@@ -90,7 +90,7 @@ TRACE_EVENT(lock_acquired,
#endif
#endif
-#endif /* _TRACE_LOCKDEP_H */
+#endif /* _TRACE_LOCK_H */
/* This part must be outside protection */
#include <trace/define_trace.h>
diff --git a/kernel/lockdep.c b/kernel/lockdep.c
index 9af5672..f5dcd36 100644
--- a/kernel/lockdep.c
+++ b/kernel/lockdep.c
@@ -49,7 +49,7 @@
#include "lockdep_internals.h"
#define CREATE_TRACE_POINTS
-#include <trace/events/lockdep.h>
+#include <trace/events/lock.h>
#ifdef CONFIG_PROVE_LOCKING
int prove_locking = 1;
On Fri, Nov 13, 2009 at 01:21:25PM +0900, Hitoshi Mitake wrote:
> > Might make sense to put it into 'perf lock' kind of tool. I think
> > Frederic had such plans.
>
> If you want a tester for 'perf lock', please call me, Frederic :)
Thanks :)
I haven't started it yet, because of some other things I need to finish.
Would you be interested in starting it?
The lock tracepoints are:
- lock_acquire: the lock is requested but not yet "acquired"
- lock_release
- lock_contended: we can't get the lock yet because someone already
holds it.
- lock_acquired: we got the lock
So, the point would be to develop a state machine that is
able to give us statistics on a per lock granularity
basis:
- the total/average/maximum time a lock is contended,
waited before being acquired (probably
lock_acquired time - lock_acquire time)
- the ratio of contention versus immediate availability
=> Sequence acquire - acquired means immediate
availability. If we have a contended event inside
it means it had to wait because of the contention.
- the total/average/maximum time a lock is held.
It would be very nice to also add an option to see the
callsites sorted from max to min, if possible with a per
boundary granularity: the couple (lock callsite, unlock callsite),
still on a per lock basis.
- any idea you could have...
Such a tool would be very useful to profile the kernel locking.
It would be nice to use design close to what perf sched does:
having an lock event structure that provides callbacks for each
lock events so that we can easily plug various plugin inside.
It's just a suggestion in case you are interested and have time
for that. Otherwise I'll do it later.
Hm?
Thanks,
Frederic.
From: Frederic Weisbecker <[email protected]>
Subject: Re: [PATCH][RFC] Measuring term of acquiring spinlock
Date: Fri, 13 Nov 2009 11:51:27 +0100
>
> On Fri, Nov 13, 2009 at 01:21:25PM +0900, Hitoshi Mitake wrote:
> > > Might make sense to put it into 'perf lock' kind of tool. I think
> > > Frederic had such plans.
> >
> > If you want a tester for 'perf lock', please call me, Frederic :)
>
>
> Thanks :)
> I haven't started it yet, because of some other things I need to finish.
>
> Would you be interested in starting it?
Yes, I'm very interested in it!
>
> The lock tracepoints are:
>
> - lock_acquire: the lock is requested but not yet "acquired"
> - lock_release
> - lock_contended: we can't get the lock yet because someone already
> holds it.
> - lock_acquired: we got the lock
>
>
> So, the point would be to develop a state machine that is
> able to give us statistics on a per lock granularity
> basis:
>
>
> - the total/average/maximum time a lock is contended,
> waited before being acquired (probably
> lock_acquired time - lock_acquire time)
>
> - the ratio of contention versus immediate availability
> => Sequence acquire - acquired means immediate
> availability. If we have a contended event inside
> it means it had to wait because of the contention.
>
> - the total/average/maximum time a lock is held.
> It would be very nice to also add an option to see the
> callsites sorted from max to min, if possible with a per
> boundary granularity: the couple (lock callsite, unlock callsite),
> still on a per lock basis.
>
> - any idea you could have...
>
>
> Such a tool would be very useful to profile the kernel locking.
>
> It would be nice to use design close to what perf sched does:
> having an lock event structure that provides callbacks for each
> lock events so that we can easily plug various plugin inside.
>
> It's just a suggestion in case you are interested and have time
> for that. Otherwise I'll do it later.
>
> Hm?
I'd like to do that.
But I'm an only newbie, so it may take a week (or more).
So If you finish this work, please post and disregard me :)
Hitoshi
On Sun, Nov 15, 2009 at 10:20:11AM +0900, Hitoshi Mitake wrote:
> > Thanks :)
> > I haven't started it yet, because of some other things I need to finish.
> >
> > Would you be interested in starting it?
>
> Yes, I'm very interested in it!
Great!
> > Such a tool would be very useful to profile the kernel locking.
> >
> > It would be nice to use design close to what perf sched does:
> > having an lock event structure that provides callbacks for each
> > lock events so that we can easily plug various plugin inside.
> >
> > It's just a suggestion in case you are interested and have time
> > for that. Otherwise I'll do it later.
> >
> > Hm?
>
> I'd like to do that.
> But I'm an only newbie, so it may take a week (or more).
Don't worry about that. Take your time.
> So If you finish this work, please post and disregard me :)
No if you take it I won't start a concurrent work.
Don't hesitate if you have questions. This will be the first tool
(at least that I'm aware of) that post-processes the lock
event so you may encounter weird things, missing events,
unappropriate sequences, or any bad things we haven't yet seen.
And don't forget to use -M on perf record to record
the lock events so that you have them multiplexed across cpus
and hence well ordered wrt time. If later we need something that
scales better, we can still drop the use of -M and do the reordering
from userspace.
Thanks.
From: Frederic Weisbecker <[email protected]>
Subject: Re: [PATCH][RFC] Measuring term of acquiring spinlock
Date: Sun, 15 Nov 2009 03:21:38 +0100
> On Sun, Nov 15, 2009 at 10:20:11AM +0900, Hitoshi Mitake wrote:
> > > Thanks :)
> > > I haven't started it yet, because of some other things I need to finish.
> > >
> > > Would you be interested in starting it?
> >
> > Yes, I'm very interested in it!
>
>
>
> Great!
>
>
> > > Such a tool would be very useful to profile the kernel locking.
> > >
> > > It would be nice to use design close to what perf sched does:
> > > having an lock event structure that provides callbacks for each
> > > lock events so that we can easily plug various plugin inside.
> > >
> > > It's just a suggestion in case you are interested and have time
> > > for that. Otherwise I'll do it later.
> > >
> > > Hm?
> >
> > I'd like to do that.
> > But I'm an only newbie, so it may take a week (or more).
>
>
> Don't worry about that. Take your time.
>
>
> > So If you finish this work, please post and disregard me :)
>
>
> No if you take it I won't start a concurrent work.
>
> Don't hesitate if you have questions. This will be the first tool
> (at least that I'm aware of) that post-processes the lock
> event so you may encounter weird things, missing events,
> unappropriate sequences, or any bad things we haven't yet seen.
Thanks. Of course, I'll not hesitate at questioning :)
>
> And don't forget to use -M on perf record to record
> the lock events so that you have them multiplexed across cpus
> and hence well ordered wrt time. If later we need something that
> scales better, we can still drop the use of -M and do the reordering
> from userspace.
Thanks for your advice! I'll do my best.
VERY very sorry for my long silence...
Please don't commit this patch, Ingo.
This patch is required to test the next patch for perf lock.
I have a question for Steven Rostedt and Tom Zanussi.
Does current raw_field_ptr() support __data_loc?
At 064739bc4b3d7f424b2f25547e6611bcf0132415 ,
support for the modifier "__data_loc" of format is added.
But, when I wanted to parse format of lock_acquired (or some event else),
raw_field_ptr() did not returned correct pointer.
So I modified raw_field_ptr() like this patch. Then raw_field_ptr() works well.
Parser was large so I'd like to ask you about this.
Cc: Frederic Weisbecker <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Paul Mackerras <[email protected]>
Cc: Tom Zanussi <[email protected]>
Cc: Steven Rostedt <[email protected]>
I have a question
---
tools/perf/util/trace-event-parse.c | 7 +++++++
1 files changed, 7 insertions(+), 0 deletions(-)
diff --git a/tools/perf/util/trace-event-parse.c b/tools/perf/util/trace-event-parse.c
index 0302405..494a470 100644
--- a/tools/perf/util/trace-event-parse.c
+++ b/tools/perf/util/trace-event-parse.c
@@ -1925,6 +1925,13 @@ void *raw_field_ptr(struct event *event, const char *name, void *data)
if (!field)
return NULL;
+ if (field->flags & FIELD_IS_STRING) {
+ int offset;
+ offset = *(int *)(data + field->offset);
+ offset &= 0xffff;
+ return data + offset;
+ }
+
return data + field->offset;
}
--
1.6.5.2
This patch adds new subcommand "lock" to perf for analyzing lock usage statistics.
Current perf lock is very primitive. This cannot provide the minimum functions.
Of course I continue to working on this.
But too big patch is not good thing for you, so I post this.
And I found some important problem, so I'd like to ask your opinion.
For another issue, this patch depends on the previous one.
The previous one is very dirty and temporary, I cannot sign on it, so I cannot sign on this too...
First, it seems that current locks (spinlock, rwlock, mutex) has no numeric ID.
So we can't treat rq->lock on CPU 0 and rq->lock on CPU 1 as different things.
Symbol name of locks cannot be complete ID.
This is the result of current ugly data structure for lock_stat
(data structure for stat per lock in builtin-lock.c).
Hash table will solve the problem of speed,
but it is not a radical solution.
I understand it is hard to implement numeric IDs for locks,
but it is required seriously, do you have some ideas?
Second, there's a lot of lack of information from trace events.
For example, current lock event subsystem cannot provide the time between
lock_acquired and lock_release.
But this time is already measured in lockdep, and we can obtain it
from /proc/lock_stat.
But /proc/lock_stat provides information from boot time only.
So I have to modify wide area of kernel including lockdep, may I do this?
Third, siginificant overhead :-(
% perf bench sched messaging # Without perf lock rec
Total time: 0.436 [sec]
% sudo ./perf lock rec perf bench sched messaging # With perf lock rec
Total time: 4.677 [sec]
[ perf record: Woken up 0 times to write data ]
[ perf record: Captured and wrote 106.065 MB perf.data (~4634063 samples) ]
Over 10 times! No one can ignore this...
This is example of using perf lock prof:
% sudo ./perf lock prof # Outputs in pager
------------------------------------------------------------------------------------------
Lock | Acquired | Max wait ns | Min wait ns | Total wait ns |
--------------------------------------------------------------------------------------------
&q->lock 30 0 0 0
&ctx->lock 3912 0 0 0
event_mutex 2 0 0 0
&newf->file_lock 1008 0 0 0
dcache_lock 444 0 0 0
&dentry->d_lock 1164 0 0 0
&ctx->mutex 2 0 0 0
&child->perf_event_mutex 2 0 0 0
&event->child_mutex 18 0 0 0
&f->f_lock 2 0 0 0
&event->mmap_mutex 2 0 0 0
&sb->s_type->i_mutex_key 259 0 0 0
&sem->wait_lock 27205 0 0 0
&(&ip->i_iolock)->mr_lock 130 0 0 0
&(&ip->i_lock)->mr_lock 6376 0 0 0
&parent->list_lock 9149 7367 146 527013
&inode->i_data.tree_lock 12175 0 0 0
&inode->i_data.private_lock 6097 0 0 0
Cc: Frederic Weisbecker <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Paul Mackerras <[email protected]>
Cc: Tom Zanussi <[email protected]>
Cc: Steven Rostedt <[email protected]>
---
tools/perf/Makefile | 1 +
tools/perf/builtin-lock.c | 449 +++++++++++++++++++++++++++++++++++++++++++++
tools/perf/builtin.h | 1 +
tools/perf/perf.c | 1 +
4 files changed, 452 insertions(+), 0 deletions(-)
create mode 100644 tools/perf/builtin-lock.c
diff --git a/tools/perf/Makefile b/tools/perf/Makefile
index 23ec660..28b0258 100644
--- a/tools/perf/Makefile
+++ b/tools/perf/Makefile
@@ -437,6 +437,7 @@ BUILTIN_OBJS += builtin-top.o
BUILTIN_OBJS += builtin-trace.o
BUILTIN_OBJS += builtin-probe.o
BUILTIN_OBJS += builtin-kmem.o
+BUILTIN_OBJS += builtin-lock.o
PERFLIBS = $(LIB_FILE)
diff --git a/tools/perf/builtin-lock.c b/tools/perf/builtin-lock.c
new file mode 100644
index 0000000..d81ba3f
--- /dev/null
+++ b/tools/perf/builtin-lock.c
@@ -0,0 +1,449 @@
+#include "builtin.h"
+#include "perf.h"
+
+#include "util/util.h"
+#include "util/cache.h"
+#include "util/symbol.h"
+#include "util/thread.h"
+#include "util/header.h"
+
+#include "util/parse-options.h"
+#include "util/trace-event.h"
+
+#include "util/debug.h"
+#include "util/data_map.h"
+
+#include <sys/types.h>
+#include <sys/prctl.h>
+
+#include <semaphore.h>
+#include <pthread.h>
+#include <math.h>
+
+#include <limits.h>
+
+/*
+ * FIXME and TODO:
+ * At least, hash table should be used here.
+ * Ideally, numeric ID for each lock instance is desired.
+ */
+struct lock_stat {
+ char *name; /* for strcpy(), we cannot use const */
+
+ unsigned int nr_acquired;
+ unsigned int nr_contended;
+ /* nr_contended - nr_acquired == immediate availability */
+
+ /* these times are in nano sec. */
+ u64 wait_time_total;
+ u64 wait_time_min;
+ u64 wait_time_max;
+};
+
+static struct lock_stat lock_stat_sentinel = {
+ .name = NULL,
+};
+
+static struct lock_stat **lock_stats;
+
+static void lock_stats_init(void)
+{
+ lock_stats = zalloc(sizeof(struct lock_stat *));
+ if (!lock_stats)
+ die("memory allocation failed\n");
+ lock_stats[0] = zalloc(sizeof(struct lock_stat));
+ if (!lock_stats[0])
+ die("memory allocation failed\n");
+ lock_stats[0] = &lock_stat_sentinel;
+}
+
+static struct lock_stat *lock_stat_findnew(const char *name)
+{
+ int i, new;
+
+ for (i = 0; lock_stats[i]->name; i++) {
+ if (!strcmp(lock_stats[i]->name, name))
+ return lock_stats[i];
+ }
+
+ new = i;
+ lock_stats = xrealloc(lock_stats,
+ sizeof(struct lock_stats *) * (i + 2));
+ lock_stats[new] = calloc(sizeof(struct lock_stat), 1);
+ if (!lock_stats[new])
+ die("memory allocation failed\n");
+ lock_stats[new]->name = calloc(sizeof(char), strlen(name) + 1);
+ if (!lock_stats[new]->name)
+ die("memory allocation failed\n");
+ strcpy(lock_stats[new]->name, name);
+ lock_stats[new]->wait_time_min = ULLONG_MAX;
+
+ lock_stats[i + 1] = &lock_stat_sentinel;
+ return lock_stats[new];
+}
+
+static char const *input_name = "perf.data";
+
+static struct perf_header *header;
+static u64 sample_type;
+
+static int profile_cpu = -1;
+
+struct raw_event_sample {
+ u32 size;
+ char data[0];
+};
+
+struct trace_acquire_event {
+ const char *name;
+};
+
+struct trace_acquired_event {
+ const char *name;
+ u64 wait_time; /* in nano sec. */
+};
+
+struct trace_contended_event {
+ const char *name;
+};
+
+struct trace_release_event {
+ const char *name;
+};
+
+struct trace_lock_handler {
+ void (*acquire_event)(struct trace_acquire_event *,
+ struct event *,
+ int cpu,
+ u64 timestamp,
+ struct thread *thread);
+
+ void (*acquired_event)(struct trace_acquired_event *,
+ struct event *,
+ int cpu,
+ u64 timestamp,
+ struct thread *thread);
+
+ void (*contended_event)(struct trace_contended_event *,
+ struct event *,
+ int cpu,
+ u64 timestamp,
+ struct thread *thread);
+
+ void (*release_event)(struct trace_release_event *,
+ struct event *,
+ int cpu,
+ u64 timestamp,
+ struct thread *thread);
+};
+
+static void prof_lock_acquire_event(struct trace_acquire_event *acquire_event __used,
+ struct event *__event __used,
+ int cpu __used,
+ u64 timestamp __used,
+ struct thread *thread __used)
+{
+}
+
+static void prof_lock_acquired_event(struct trace_acquired_event *acquired_event,
+ struct event *__event __used,
+ int cpu __used,
+ u64 timestamp __used,
+ struct thread *thread __used)
+{
+ struct lock_stat *st;
+
+ st = lock_stat_findnew(acquired_event->name);
+ st->nr_acquired++;
+
+ if (acquired_event->wait_time != 0) {
+ st->wait_time_total += acquired_event->wait_time;
+ if (st->wait_time_max < acquired_event->wait_time)
+ st->wait_time_max = acquired_event->wait_time;
+ else if (st->wait_time_min > acquired_event->wait_time)
+ st->wait_time_min = acquired_event->wait_time;
+ }
+}
+
+static void prof_lock_contended_event(struct trace_contended_event *contended_event,
+ struct event *__event __used,
+ int cpu __used,
+ u64 timestamp __used,
+ struct thread *thread __used)
+{
+ struct lock_stat *st;
+
+ st = lock_stat_findnew(contended_event->name);
+ st->nr_contended++;
+}
+
+static void prof_lock_release_event(struct trace_release_event *release_event __used,
+ struct event *__event __used,
+ int cpu __used,
+ u64 timestamp __used,
+ struct thread *thread __used)
+{
+}
+
+/* lock oriented handlers */
+/* TODO: handlers for CPU oriented, thread oriented */
+static struct trace_lock_handler prof_lock_ops = {
+ .acquire_event = prof_lock_acquire_event,
+ .acquired_event = prof_lock_acquired_event,
+ .contended_event = prof_lock_contended_event,
+ .release_event = prof_lock_release_event,
+};
+
+static struct trace_lock_handler *trace_handler;
+
+static void
+process_lock_acquire_event(struct raw_event_sample *raw __used,
+ struct event *event __used,
+ int cpu __used,
+ u64 timestamp __used,
+ struct thread *thread __used)
+{
+}
+
+static void
+process_lock_acquired_event(struct raw_event_sample *raw __used,
+ struct event *event __used,
+ int cpu __used,
+ u64 timestamp __used,
+ struct thread *thread __used)
+{
+ struct trace_acquired_event acquired_event;
+
+ acquired_event.name = (char *)raw_field_ptr(event, "name", raw->data);
+ acquired_event.wait_time =
+ (unsigned long)raw_field_value(event, "wait_usec", raw->data)
+ * 1000;
+ acquired_event.wait_time +=
+ (unsigned long)raw_field_value(event,
+ "wait_nsec_rem", raw->data);
+
+ if (trace_handler->acquire_event) {
+ trace_handler->acquired_event(&acquired_event,
+ event, cpu, timestamp, thread);
+ }
+}
+
+static void
+process_lock_contended_event(struct raw_event_sample *raw __used,
+ struct event *event __used,
+ int cpu __used,
+ u64 timestamp __used,
+ struct thread *thread __used)
+{
+}
+
+static void
+process_lock_release_event(struct raw_event_sample *raw __used,
+ struct event *event __used,
+ int cpu __used,
+ u64 timestamp __used,
+ struct thread *thread __used)
+{
+}
+
+static void
+process_raw_event(event_t *raw_event __used, void *more_data,
+ int cpu, u64 timestamp, struct thread *thread)
+{
+ struct raw_event_sample *raw = more_data;
+ struct event *event;
+ int type;
+
+ type = trace_parse_common_type(raw->data);
+ event = trace_find_event(type);
+
+ if (!strcmp(event->name, "lock_acquire"))
+ process_lock_acquire_event(raw, event, cpu, timestamp, thread);
+ if (!strcmp(event->name, "lock_acquired"))
+ process_lock_acquired_event(raw, event, cpu, timestamp, thread);
+ if (!strcmp(event->name, "lock_contended"))
+ process_lock_contended_event(raw, event, cpu, timestamp, thread);
+ if (!strcmp(event->name, "lock_release"))
+ process_lock_release_event(raw, event, cpu, timestamp, thread);
+}
+
+static int process_sample_event(event_t *event)
+{
+ struct thread *thread;
+ u64 ip = event->ip.ip;
+ u64 timestamp = -1;
+ u32 cpu = -1;
+ u64 period = 1;
+ void *more_data = event->ip.__more_data;
+
+ thread = threads__findnew(event->ip.pid);
+
+ if (!(sample_type & PERF_SAMPLE_RAW))
+ return 0;
+
+ if (sample_type & PERF_SAMPLE_TIME) {
+ timestamp = *(u64 *)more_data;
+ more_data += sizeof(u64);
+ }
+
+ if (sample_type & PERF_SAMPLE_CPU) {
+ cpu = *(u32 *)more_data;
+ more_data += sizeof(u32);
+ more_data += sizeof(u32); /* reserved */
+ }
+
+ if (sample_type & PERF_SAMPLE_PERIOD) {
+ period = *(u64 *)more_data;
+ more_data += sizeof(u64);
+ }
+
+ dump_printf("(IP, %d): %d/%d: %p period: %Ld\n",
+ event->header.misc,
+ event->ip.pid, event->ip.tid,
+ (void *)(long)ip,
+ (long long)period);
+
+ if (thread == NULL) {
+ pr_debug("problem processing %d event, skipping it.\n",
+ event->header.type);
+ return -1;
+ }
+
+ dump_printf(" ... thread: %s:%d\n", thread->comm, thread->pid);
+
+ if (profile_cpu != -1 && profile_cpu != (int) cpu)
+ return 0;
+
+ process_raw_event(event, more_data, cpu, timestamp, thread);
+
+ return 0;
+}
+
+static int sample_type_check(u64 type)
+{
+ sample_type = type;
+
+ if (!(sample_type & PERF_SAMPLE_RAW)) {
+ fprintf(stderr,
+ "No trace sample to read. Did you call perf record "
+ "without -R?");
+ return -1;
+ }
+
+ return 0;
+}
+
+/* TODO: various way to print */
+static void print_result(void)
+{
+ int i;
+
+ /* FIXME: formatting output is very rough */
+ printf("\n ------------------------------------------------------------------------------------------\n");
+ printf(" Lock | Acquired | Max wait ns | Min wait ns | Total wait ns |\n");
+ printf(" --------------------------------------------------------------------------------------------\n");
+
+ for (i = 0; lock_stats[i]->name; i++) {
+ printf("%32s %13u %13llu %13llu %15llu\n",
+ lock_stats[i]->name,
+ lock_stats[i]->nr_acquired,
+ lock_stats[i]->wait_time_max,
+ lock_stats[i]->wait_time_min == ULLONG_MAX ?
+ 0 : lock_stats[i]->wait_time_min,
+ lock_stats[i]->wait_time_total);
+ }
+}
+
+static struct perf_file_handler file_handler = {
+ .process_sample_event = process_sample_event,
+ .sample_type_check = sample_type_check,
+};
+
+static int read_events(void)
+{
+ register_idle_thread();
+ register_perf_file_handler(&file_handler);
+
+ return mmap_dispatch_perf_file(&header, input_name, 0, 0,
+ &event__cwdlen, &event__cwd);
+}
+
+static void __cmd_prof(void)
+{
+ setup_pager();
+ lock_stats_init();
+ read_events();
+ print_result();
+}
+
+static const char * const lock_usage[] = {
+ "perf lock [<options>] {record|trace|prof}",
+ NULL
+};
+
+static const struct option lock_options[] = {
+ OPT_STRING('i', "input", &input_name, "file",
+ "input file name"),
+ OPT_BOOLEAN('v', "verbose", &verbose,
+ "be more verbose (show symbol address, etc)"),
+ OPT_BOOLEAN('D', "dump-raw-trace", &dump_trace,
+ "dump raw trace in ASCII"),
+ OPT_END()
+};
+
+static const char *record_args[] = {
+ "record",
+ "-a",
+ "-R",
+ "-M",
+ "-f",
+ "-m", "1024",
+ "-c", "1",
+ "-e", "lock:lock_acquire:r",
+ "-e", "lock:lock_acquired:r",
+ "-e", "lock:lock_contended:r",
+ "-e", "lock:lock_release:r",
+};
+
+static int __cmd_record(int argc, const char **argv)
+{
+ unsigned int rec_argc, i, j;
+ const char **rec_argv;
+
+ rec_argc = ARRAY_SIZE(record_args) + argc - 1;
+ rec_argv = calloc(rec_argc + 1, sizeof(char *));
+
+ for (i = 0; i < ARRAY_SIZE(record_args); i++)
+ rec_argv[i] = strdup(record_args[i]);
+
+ for (j = 1; j < (unsigned int)argc; j++, i++)
+ rec_argv[i] = argv[j];
+
+ BUG_ON(i != rec_argc);
+
+ return cmd_record(i, rec_argv, NULL);
+}
+
+int cmd_lock(int argc, const char **argv, const char *prefix __used)
+{
+ symbol__init(0);
+
+ argc = parse_options(argc, argv, lock_options, lock_usage,
+ PARSE_OPT_STOP_AT_NON_OPTION);
+ if (!argc)
+ usage_with_options(lock_usage, lock_options);
+
+ if (!strncmp(argv[0], "rec", 3)) {
+ return __cmd_record(argc, argv);
+ } else if (!strncmp(argv[0], "prof", 4)) {
+ trace_handler = &prof_lock_ops;
+ __cmd_prof();
+ } else if (!strcmp(argv[0], "trace")) {
+ /* Aliased to 'perf trace' */
+ return cmd_trace(argc, argv, prefix);
+ } else {
+ usage_with_options(lock_usage, lock_options);
+ }
+
+ return 0;
+}
diff --git a/tools/perf/builtin.h b/tools/perf/builtin.h
index a3d8bf6..aeaa710 100644
--- a/tools/perf/builtin.h
+++ b/tools/perf/builtin.h
@@ -29,5 +29,6 @@ extern int cmd_trace(int argc, const char **argv, const char *prefix);
extern int cmd_version(int argc, const char **argv, const char *prefix);
extern int cmd_probe(int argc, const char **argv, const char *prefix);
extern int cmd_kmem(int argc, const char **argv, const char *prefix);
+extern int cmd_lock(int argc, const char **argv, const char *prefix);
#endif
diff --git a/tools/perf/perf.c b/tools/perf/perf.c
index cf64049..f100e4b 100644
--- a/tools/perf/perf.c
+++ b/tools/perf/perf.c
@@ -300,6 +300,7 @@ static void handle_internal_command(int argc, const char **argv)
{ "sched", cmd_sched, 0 },
{ "probe", cmd_probe, 0 },
{ "kmem", cmd_kmem, 0 },
+ { "lock", cmd_lock, 0 },
};
unsigned int i;
static const char ext[] = STRIP_EXTENSION;
--
1.6.5.2
On Mon, Dec 07, 2009 at 12:34:43PM +0900, Hitoshi Mitake wrote:
> VERY very sorry for my long silence...
>
> Please don't commit this patch, Ingo.
> This patch is required to test the next patch for perf lock.
>
> I have a question for Steven Rostedt and Tom Zanussi.
> Does current raw_field_ptr() support __data_loc?
>
> At 064739bc4b3d7f424b2f25547e6611bcf0132415 ,
> support for the modifier "__data_loc" of format is added.
I guess this patch handled the nature of the strings, but the current
end use is in perl. There is no support of it in raw_field_ptr().
Your patch looks good to me..
On Mon, Dec 07, 2009 at 12:34:44PM +0900, Hitoshi Mitake wrote:
> This patch adds new subcommand "lock" to perf for analyzing lock usage statistics.
> Current perf lock is very primitive. This cannot provide the minimum functions.
> Of course I continue to working on this.
> But too big patch is not good thing for you, so I post this.
Oh great!
Yeah, the work can be done incrementally.
>
> And I found some important problem, so I'd like to ask your opinion.
> For another issue, this patch depends on the previous one.
> The previous one is very dirty and temporary, I cannot sign on it, so I cannot sign on this too...
The previous one looks rather good actually.
> First, it seems that current locks (spinlock, rwlock, mutex) has no numeric ID.
> So we can't treat rq->lock on CPU 0 and rq->lock on CPU 1 as different things.
> Symbol name of locks cannot be complete ID.
> This is the result of current ugly data structure for lock_stat
> (data structure for stat per lock in builtin-lock.c).
> Hash table will solve the problem of speed,
> but it is not a radical solution.
> I understand it is hard to implement numeric IDs for locks,
> but it is required seriously, do you have some ideas?
Indeed. I think every lock instance has its own lockdep_map.
And this lockdep_map is passed in every lock event but is
only used to retrieve the name of the lock.
Why not adding the address of the lockdep_map in the event?
> Second, there's a lot of lack of information from trace events.
> For example, current lock event subsystem cannot provide the time between
> lock_acquired and lock_release.
> But this time is already measured in lockdep, and we can obtain it
> from /proc/lock_stat.
> But /proc/lock_stat provides information from boot time only.
> So I have to modify wide area of kernel including lockdep, may I do this?
I think this is more something to compute in a state machine:
lock_release - lock_acquired event.
This is what we do with sched events in perf sched latency
Also I think we should remove the field that gives the time waited
between lock_acquire and lock_acquired. This is more something that
should be done in userspace instead of calculating in from the kernel.
This brings overhead in the wrong place.
>
> Third, siginificant overhead :-(
>
> % perf bench sched messaging # Without perf lock rec
>
> Total time: 0.436 [sec]
>
> % sudo ./perf lock rec perf bench sched messaging # With perf lock rec
>
> Total time: 4.677 [sec]
> [ perf record: Woken up 0 times to write data ]
> [ perf record: Captured and wrote 106.065 MB perf.data (~4634063 samples) ]
>
> Over 10 times! No one can ignore this...
I think that the lock events are much more sensible than the sched events,
and that by nature: these are very high frequency events class, probably the
highest among every event classes we have (the worst beeing function tracing :)
But still, you're right, there are certainly various things we need to
optimize in this area.
More than 8 times slower is high.
>
> This is example of using perf lock prof:
> % sudo ./perf lock prof # Outputs in pager
> ------------------------------------------------------------------------------------------
> Lock | Acquired | Max wait ns | Min wait ns | Total wait ns |
> --------------------------------------------------------------------------------------------
> &q->lock 30 0 0 0
> &ctx->lock 3912 0 0 0
> event_mutex 2 0 0 0
> &newf->file_lock 1008 0 0 0
> dcache_lock 444 0 0 0
> &dentry->d_lock 1164 0 0 0
> &ctx->mutex 2 0 0 0
> &child->perf_event_mutex 2 0 0 0
> &event->child_mutex 18 0 0 0
> &f->f_lock 2 0 0 0
> &event->mmap_mutex 2 0 0 0
> &sb->s_type->i_mutex_key 259 0 0 0
> &sem->wait_lock 27205 0 0 0
> &(&ip->i_iolock)->mr_lock 130 0 0 0
> &(&ip->i_lock)->mr_lock 6376 0 0 0
> &parent->list_lock 9149 7367 146 527013
> &inode->i_data.tree_lock 12175 0 0 0
> &inode->i_data.private_lock 6097 0 0 0
Very nice and promising!
I can't wait to try it.
* Frederic Weisbecker <[email protected]> wrote:
> On Mon, Dec 07, 2009 at 12:34:44PM +0900, Hitoshi Mitake wrote:
> > This patch adds new subcommand "lock" to perf for analyzing lock usage statistics.
> > Current perf lock is very primitive. This cannot provide the minimum functions.
> > Of course I continue to working on this.
> > But too big patch is not good thing for you, so I post this.
>
> Oh great!
> Yeah, the work can be done incrementally.
>
[...]
>
>
> Very nice and promising!
>
> I can't wait to try it.
ok, to ease testing i've created a new (and not yet permanent) topic
tree for it to track this new perf feature: tip:perf/lock and pushed it
out.
Note: because it's not yet in a final form i have not merged it into
tip:master yet - when you are working on these bits you need to do this
manually via:
git merge tip/perf/lock
Also, we might need to rebase this branch as it's WIP, so the commit IDs
are not permanent yet. But i thought it would be easier to do deltas on
this basis.
Hitoshi-san, the patches did not have a Signed-off-by line from you, can
i add them for you?
Also, i agree that the performance aspect is probably the most pressing
issue. Note that 'perf bench sched messaging' is very locking intense so
a 10x slowdown is not entirely unexpected - we still ought to optimize
it all some more. 'perf lock' is an excellent testcase for this in any
case.
Thanks,
Ingo
Ingo Molnar wrote:
> Also, i agree that the performance aspect is probably the most pressing
> issue. Note that 'perf bench sched messaging' is very locking intense so
> a 10x slowdown is not entirely unexpected - we still ought to optimize
> it all some more. 'perf lock' is an excellent testcase for this in any
> case.
>
Here are some test results to show the overhead of lockdep trace events:
select pagefault mmap Memory par Cont_SW
latency latency latency R/W BD latency
disable ftrace 0 0 0 0 0
enable all ftrace -16.65% -109.80% -93.62% 0.14% -6.94%
enable all ftrace -2.67% 1.08% -3.65% -0.52% -0.68%
except lockdep
We also found big overhead when using kernbench and fio, but we haven't
verified whether it's caused by lockdep events.
Thanks,
Xiao
From: Frederic Weisbecker <[email protected]>
Subject: Re: [PATCH 2/2] perf lock: New subcommand "lock" to perf for analyzing lock statistics
Date: Mon, 7 Dec 2009 05:41:26 +0100
Frederic, thanks for your comment!
> >
> > And I found some important problem, so I'd like to ask your opinion.
> > For another issue, this patch depends on the previous one.
> > The previous one is very dirty and temporary, I cannot sign on it, so I cannot sign on this too...
>
>
>
> The previous one looks rather good actually.
Thanks for your review in previous mail.
I'm new to perf, so I didn't have confidence.
Your advice is encouraging!
>
>
>
> > First, it seems that current locks (spinlock, rwlock, mutex) has no numeric ID.
> > So we can't treat rq->lock on CPU 0 and rq->lock on CPU 1 as different things.
> > Symbol name of locks cannot be complete ID.
> > This is the result of current ugly data structure for lock_stat
> > (data structure for stat per lock in builtin-lock.c).
> > Hash table will solve the problem of speed,
> > but it is not a radical solution.
> > I understand it is hard to implement numeric IDs for locks,
> > but it is required seriously, do you have some ideas?
>
>
> Indeed. I think every lock instance has its own lockdep_map.
> And this lockdep_map is passed in every lock event but is
> only used to retrieve the name of the lock.
>
> Why not adding the address of the lockdep_map in the event?
It's good idea. Address cannot be used as index of array directly,
but dealing with it is far easier than string and low cost.
>
>
> > Second, there's a lot of lack of information from trace events.
> > For example, current lock event subsystem cannot provide the time between
> > lock_acquired and lock_release.
> > But this time is already measured in lockdep, and we can obtain it
> > from /proc/lock_stat.
> > But /proc/lock_stat provides information from boot time only.
> > So I have to modify wide area of kernel including lockdep, may I do this?
>
>
>
> I think this is more something to compute in a state machine:
> lock_release - lock_acquired event.
>
> This is what we do with sched events in perf sched latency
Yeah, tracing state of the lock is smart way. I'll try it.
>
> Also I think we should remove the field that gives the time waited
> between lock_acquire and lock_acquired. This is more something that
> should be done in userspace instead of calculating in from the kernel.
> This brings overhead in the wrong place.
I agree. I think we can exploit more information from timestamps.
>
>
> >
> > Third, siginificant overhead :-(
> >
> > % perf bench sched messaging # Without perf lock rec
> >
> > Total time: 0.436 [sec]
> >
> > % sudo ./perf lock rec perf bench sched messaging # With perf lock rec
> >
> > Total time: 4.677 [sec]
> > [ perf record: Woken up 0 times to write data ]
> > [ perf record: Captured and wrote 106.065 MB perf.data (~4634063 samples) ]
> >
> > Over 10 times! No one can ignore this...
>
>
> I think that the lock events are much more sensible than the sched events,
> and that by nature: these are very high frequency events class, probably the
> highest among every event classes we have (the worst beeing function tracing :)
>
> But still, you're right, there are certainly various things we need to
> optimize in this area.
>
> More than 8 times slower is high.
It seems that lockdep contains some O(n) codes.
Of course lockdep is important, but analyzing statistics of lock usage
is another problem.
I think separating lockdep and lock event for stats can be solution.
>
>
> >
> > This is example of using perf lock prof:
> > % sudo ./perf lock prof # Outputs in pager
> > ------------------------------------------------------------------------------------------
> > Lock | Acquired | Max wait ns | Min wait ns | Total wait ns |
> > --------------------------------------------------------------------------------------------
> > &q->lock 30 0 0 0
> > &ctx->lock 3912 0 0 0
> > event_mutex 2 0 0 0
> > &newf->file_lock 1008 0 0 0
> > dcache_lock 444 0 0 0
> > &dentry->d_lock 1164 0 0 0
> > &ctx->mutex 2 0 0 0
> > &child->perf_event_mutex 2 0 0 0
> > &event->child_mutex 18 0 0 0
> > &f->f_lock 2 0 0 0
> > &event->mmap_mutex 2 0 0 0
> > &sb->s_type->i_mutex_key 259 0 0 0
> > &sem->wait_lock 27205 0 0 0
> > &(&ip->i_iolock)->mr_lock 130 0 0 0
> > &(&ip->i_lock)->mr_lock 6376 0 0 0
> > &parent->list_lock 9149 7367 146 527013
> > &inode->i_data.tree_lock 12175 0 0 0
> > &inode->i_data.private_lock 6097 0 0 0
>
>
>
> Very nice and promising!
>
> I can't wait to try it.
>
>
Thanks! I'll do my best :)
From: Ingo Molnar <[email protected]>
Subject: Re: [PATCH 2/2] perf lock: New subcommand "lock" to perf for analyzing lock statistics
Date: Mon, 7 Dec 2009 08:27:52 +0100
>
> * Frederic Weisbecker <[email protected]> wrote:
>
> > On Mon, Dec 07, 2009 at 12:34:44PM +0900, Hitoshi Mitake wrote:
> > > This patch adds new subcommand "lock" to perf for analyzing lock usage statistics.
> > > Current perf lock is very primitive. This cannot provide the minimum functions.
> > > Of course I continue to working on this.
> > > But too big patch is not good thing for you, so I post this.
> >
> > Oh great!
> > Yeah, the work can be done incrementally.
> >
> [...]
> >
> >
> > Very nice and promising!
> >
> > I can't wait to try it.
>
> ok, to ease testing i've created a new (and not yet permanent) topic
> tree for it to track this new perf feature: tip:perf/lock and pushed it
> out.
>
> Note: because it's not yet in a final form i have not merged it into
> tip:master yet - when you are working on these bits you need to do this
> manually via:
>
> git merge tip/perf/lock
>
> Also, we might need to rebase this branch as it's WIP, so the commit IDs
> are not permanent yet. But i thought it would be easier to do deltas on
> this basis.
Thanks!
>
> Hitoshi-san, the patches did not have a Signed-off-by line from you, can
> i add them for you?
Yes of course. I didn't signed
because these were too experimental things for master branch.
But I can sign on these as experimental things on experimental branch.
>
> Also, i agree that the performance aspect is probably the most pressing
> issue. Note that 'perf bench sched messaging' is very locking intense so
> a 10x slowdown is not entirely unexpected - we still ought to optimize
> it all some more. 'perf lock' is an excellent testcase for this in any
> case.
Yeah, as I described in my reply to Frederic,
separating lockdep and lock events for perf lock
might be solution for performance problem. I'll try it.
Thanks
Hitoshi
From: Xiao Guangrong <[email protected]>
Subject: Re: [PATCH 2/2] perf lock: New subcommand "lock" to perf for analyzing lock statistics
Date: Mon, 07 Dec 2009 16:38:03 +0800
Hi Xiao,
>
>
> Ingo Molnar wrote:
>
> > Also, i agree that the performance aspect is probably the most pressing
> > issue. Note that 'perf bench sched messaging' is very locking intense so
> > a 10x slowdown is not entirely unexpected - we still ought to optimize
> > it all some more. 'perf lock' is an excellent testcase for this in any
> > case.
> >
>
> Here are some test results to show the overhead of lockdep trace events:
>
> select pagefault mmap Memory par Cont_SW
> latency latency latency R/W BD latency
>
> disable ftrace 0 0 0 0 0
>
> enable all ftrace -16.65% -109.80% -93.62% 0.14% -6.94%
>
> enable all ftrace -2.67% 1.08% -3.65% -0.52% -0.68%
> except lockdep
>
>
> We also found big overhead when using kernbench and fio, but we haven't
> verified whether it's caused by lockdep events.
Thanks for your terrible but important data.
Difference between "enable all ftrace" and "enable all ftrace except lockdep"
is significant... This must be reduced.
Thanks
Hitoshi
On Mon, 2009-12-07 at 12:34 +0900, Hitoshi Mitake wrote:
> VERY very sorry for my long silence...
>
> Please don't commit this patch, Ingo.
> This patch is required to test the next patch for perf lock.
>
> I have a question for Steven Rostedt and Tom Zanussi.
> Does current raw_field_ptr() support __data_loc?
>
> At 064739bc4b3d7f424b2f25547e6611bcf0132415 ,
> support for the modifier "__data_loc" of format is added.
>
> But, when I wanted to parse format of lock_acquired (or some event else),
> raw_field_ptr() did not returned correct pointer.
>
> So I modified raw_field_ptr() like this patch. Then raw_field_ptr() works well.
> Parser was large so I'd like to ask you about this.
>
> Cc: Frederic Weisbecker <[email protected]>
> Cc: Peter Zijlstra <[email protected]>
> Cc: Paul Mackerras <[email protected]>
> Cc: Tom Zanussi <[email protected]>
> Cc: Steven Rostedt <[email protected]>
>
> I have a question
> ---
> tools/perf/util/trace-event-parse.c | 7 +++++++
> 1 files changed, 7 insertions(+), 0 deletions(-)
>
> diff --git a/tools/perf/util/trace-event-parse.c b/tools/perf/util/trace-event-parse.c
> index 0302405..494a470 100644
> --- a/tools/perf/util/trace-event-parse.c
> +++ b/tools/perf/util/trace-event-parse.c
> @@ -1925,6 +1925,13 @@ void *raw_field_ptr(struct event *event, const char *name, void *data)
> if (!field)
> return NULL;
>
> + if (field->flags & FIELD_IS_STRING) {
> + int offset;
> + offset = *(int *)(data + field->offset);
> + offset &= 0xffff;
> + return data + offset;
> + }
> +
Hmm, I guess I need to update the code in perf again. The way I did this
was to have FIELD_IS_STRING evaluated in the eval_num_arg.
-- Steve
> return data + field->offset;
> }
>
On Mon, 2009-12-07 at 16:38 +0800, Xiao Guangrong wrote:
>
> Ingo Molnar wrote:
>
> > Also, i agree that the performance aspect is probably the most pressing
> > issue. Note that 'perf bench sched messaging' is very locking intense so
> > a 10x slowdown is not entirely unexpected - we still ought to optimize
> > it all some more. 'perf lock' is an excellent testcase for this in any
> > case.
> >
>
> Here are some test results to show the overhead of lockdep trace events:
>
> select pagefault mmap Memory par Cont_SW
> latency latency latency R/W BD latency
>
> disable ftrace 0 0 0 0 0
>
> enable all ftrace -16.65% -109.80% -93.62% 0.14% -6.94%
>
> enable all ftrace -2.67% 1.08% -3.65% -0.52% -0.68%
> except lockdep
>
>
> We also found big overhead when using kernbench and fio, but we haven't
> verified whether it's caused by lockdep events.
Well, it is expected that recording all locking is going to have a
substantial overhead. In my measurements, a typical event takes around
250ns per event (note, I've gotten this down to 140ns in recent updates,
and even 90ns by disabling integrity checks, but I don't want to disable
those checks in production).
Anyway, if you add just 100ns to every lock taken in the kernel, that
will definitely increase the overhead. Just enable spin_lock() in the
function tracer and watch the performance go down. This is why, when
using the function tracer I usually add all locking to the notrace
filter. This alone helps tremendously in tracing functions.
-- Steve
On Mon, Dec 07, 2009 at 04:38:03PM +0800, Xiao Guangrong wrote:
>
>
> Ingo Molnar wrote:
>
> > Also, i agree that the performance aspect is probably the most pressing
> > issue. Note that 'perf bench sched messaging' is very locking intense so
> > a 10x slowdown is not entirely unexpected - we still ought to optimize
> > it all some more. 'perf lock' is an excellent testcase for this in any
> > case.
> >
>
> Here are some test results to show the overhead of lockdep trace events:
>
> select pagefault mmap Memory par Cont_SW
> latency latency latency R/W BD latency
>
> disable ftrace 0 0 0 0 0
>
> enable all ftrace -16.65% -109.80% -93.62% 0.14% -6.94%
>
> enable all ftrace -2.67% 1.08% -3.65% -0.52% -0.68%
> except lockdep
>
>
> We also found big overhead when using kernbench and fio, but we haven't
> verified whether it's caused by lockdep events.
>
> Thanks,
> Xiao
This profile has been done using ftrace with perf right?
It might be because the lock events are high rate events and
fill a lot of perf buffer space. More than other events.
In one of your previous mails, you showed us the difference
of the size of perf.data by capturing either scheduler events
or lock events.
And IIRC, the case of lock events resulted in a 100 MB perf.data
whereas it was a small file for sched events.
The overhead in the pagefault and mmap latency could then
result in the fact we have much more events to save, walking
through much more pages in perf buffer, then faulting more often,
etc.
Plus the fact various locks are taken in mmap and fault path,
generating more lock events.
Just a guess...
On Mon, Dec 07, 2009 at 08:48:05PM +0100, Frederic Weisbecker wrote:
> On Mon, Dec 07, 2009 at 04:38:03PM +0800, Xiao Guangrong wrote:
> >
> >
> > Ingo Molnar wrote:
> >
> > > Also, i agree that the performance aspect is probably the most pressing
> > > issue. Note that 'perf bench sched messaging' is very locking intense so
> > > a 10x slowdown is not entirely unexpected - we still ought to optimize
> > > it all some more. 'perf lock' is an excellent testcase for this in any
> > > case.
> > >
> >
> > Here are some test results to show the overhead of lockdep trace events:
> >
> > select pagefault mmap Memory par Cont_SW
> > latency latency latency R/W BD latency
> >
> > disable ftrace 0 0 0 0 0
> >
> > enable all ftrace -16.65% -109.80% -93.62% 0.14% -6.94%
> >
> > enable all ftrace -2.67% 1.08% -3.65% -0.52% -0.68%
> > except lockdep
> >
> >
> > We also found big overhead when using kernbench and fio, but we haven't
> > verified whether it's caused by lockdep events.
> >
> > Thanks,
> > Xiao
>
>
> This profile has been done using ftrace with perf right?
> It might be because the lock events are high rate events and
> fill a lot of perf buffer space. More than other events.
> In one of your previous mails, you showed us the difference
> of the size of perf.data by capturing either scheduler events
> or lock events.
I'm not sure who sent this email actually. May be you or Hitoshi.
But, anyway you got the point :)
On Mon, Dec 07, 2009 at 11:51:23PM +0900, Hitoshi Mitake wrote:
> From: Frederic Weisbecker <[email protected]>
> > Indeed. I think every lock instance has its own lockdep_map.
> > And this lockdep_map is passed in every lock event but is
> > only used to retrieve the name of the lock.
> >
> > Why not adding the address of the lockdep_map in the event?
>
> It's good idea. Address cannot be used as index of array directly,
> but dealing with it is far easier than string and low cost.
Yeah, dealing with lock strings as identifier is like dealing
with the lock class instead of the lock instance. It would
require a very tricky state machine to sort it out.
The address of the lockdep map would be useful as an identifier.
Probably a hashlist based on it would be a good way: we can use
the hashlist from the kernel headers (list.h) in perf tools
already.
> > I think that the lock events are much more sensible than the sched events,
> > and that by nature: these are very high frequency events class, probably the
> > highest among every event classes we have (the worst beeing function tracing :)
> >
> > But still, you're right, there are certainly various things we need to
> > optimize in this area.
> >
> > More than 8 times slower is high.
>
> It seems that lockdep contains some O(n) codes.
> Of course lockdep is important, but analyzing statistics of lock usage
> is another problem.
> I think separating lockdep and lock event for stats can be solution.
Indeed.
Thanks!
Hi Frederic,
Frederic Weisbecker wrote:
> This profile has been done using ftrace with perf right?
We not use perf, just operate it by debugfs, the test procedure
is like below:
case 1: disable all trace events:
echo nop > debugfs/tracing/current_tracer
echo > debugfs/tracing/set_event
run benchmarks
...
case 2: enable all trace events except lockdep:
echo nop > debugfs/tracing/current_tracer
echo *:* > debugfs/tracing/set_event
echo 0 > debugfs/tracing/events/lockdep/enable
run benchmarks
...
> In one of your previous mails, you showed us the difference
> of the size of perf.data by capturing either scheduler events
> or lock events.
>
It's not my work :-)
Thanks,
Xiao
> And IIRC, the case of lock events resulted in a 100 MB perf.data
> whereas it was a small file for sched events.
>
> The overhead in the pagefault and mmap latency could then
> result in the fact we have much more events to save, walking
> through much more pages in perf buffer, then faulting more often,
> etc.
>
> Plus the fact various locks are taken in mmap and fault path,
> generating more lock events.
>
> Just a guess...
>
>
>