2007-05-29 13:19:21

by Peter Zijlstra

[permalink] [raw]
Subject: [PATCH 3/5] lockstat: core infrastructure

Introduce the core lock statistics code.

Lock statistics provides lock wait-time and hold-time (as well as the count
of corresponding contention and acquisitions events). Also, the first few
call-sites that encounter contention are tracked.

Lock wait-time is the time spent waiting on the lock. This provides insight
into the locking scheme, that is, a heavily contended lock is indicative of
a too coarse locking scheme.

Lock hold-time is the duration the lock was held, this provides a reference for
the wait-time numbers, so they can be put into perspective.

1)
lock
2)
... do stuff ..
unlock
3)

The time between 1 and 2 is the wait-time. The time between 2 and 3 is the
hold-time.

The lockdep held-lock tracking code is reused, because it already collects locks
into meaningful groups (classes), and because it is an existing infrastructure
for lock instrumentation.

Currently lockdep tracks lock acquisition with two hooks:

lock()
lock_acquire()
_lock()

... code protected by lock ...

unlock()
lock_release()
_unlock()

We need to extend this with two more hooks, in order to measure contention.

lock_contended() - used to measure contention events
lock_acquired() - completion of the contention

These are then placed the following way:

lock()
lock_acquire()
if (!_try_lock())
lock_contended()
_lock()
lock_acquired()

... do locked stuff ...

unlock()
lock_release()
_unlock()

(Note: the try_lock() 'trick' is used to avoid instrumenting all platform
dependent lock primitive implementations.)

It is also possible to toggle the two lockdep features at runtime using:

/proc/sys/kernel/prove_locking
/proc/sys/kernel/lock_stat

(esp. turning off the O(n^2) prove_locking functionaliy can help)

Signed-off-by: Peter Zijlstra <[email protected]>
Acked-by: Ingo Molnar <[email protected]>
Acked-by: Jason Baron <[email protected]>
---
include/linux/lockdep.h | 50 +++++++
include/linux/sysctl.h | 2
kernel/lockdep.c | 314 +++++++++++++++++++++++++++++++++++++++++++++++-
kernel/sysctl.c | 28 ++++
lib/Kconfig.debug | 11 +
5 files changed, 404 insertions(+), 1 deletion(-)

Index: linux-2.6-git/kernel/lockdep.c
===================================================================
--- linux-2.6-git.orig/kernel/lockdep.c
+++ linux-2.6-git/kernel/lockdep.c
@@ -42,6 +42,16 @@

#include "lockdep_internals.h"

+#ifdef CONFIG_PROVE_LOCKING
+int prove_locking = 1;
+module_param(prove_locking, int, 0644);
+#endif
+
+#ifdef CONFIG_LOCK_STAT
+int lock_stat = 1;
+module_param(lock_stat, int, 0644);
+#endif
+
/*
* lockdep_lock: protects the lockdep graph, the hashes and the
* class/list/hash allocators.
@@ -123,6 +133,92 @@ static struct lock_list *alloc_list_entr
unsigned long nr_lock_classes;
static struct lock_class lock_classes[MAX_LOCKDEP_KEYS];

+#ifdef CONFIG_LOCK_STAT
+static DEFINE_PER_CPU(struct lock_class_stats[MAX_LOCKDEP_KEYS], lock_stats);
+
+static int lock_contention_point(struct lock_class *class, unsigned long ip)
+{
+ int i;
+
+ for (i = 0; i < ARRAY_SIZE(class->contention_point); i++) {
+ if (class->contention_point[i] == 0) {
+ class->contention_point[i] = ip;
+ break;
+ }
+ if (class->contention_point[i] == ip)
+ break;
+ }
+
+ return i;
+}
+
+static inline void lock_time_add(struct lock_time *src, struct lock_time *dst)
+{
+ dst->min += src->min;
+ dst->max += src->max;
+ dst->total += src->total;
+ dst->nr += src->nr;
+}
+
+static void lock_time_inc(struct lock_time *lt, unsigned long long time)
+{
+ if (time > lt->max)
+ lt->max = time;
+
+ if (time < lt->min || !lt->min)
+ lt->min = time;
+
+ lt->total += time;
+ lt->nr++;
+}
+
+struct lock_class_stats lock_stats(struct lock_class *class)
+{
+ struct lock_class_stats stats;
+ int cpu, i;
+
+ memset(&stats, 0, sizeof(struct lock_class_stats));
+ for_each_possible_cpu(cpu) {
+ struct lock_class_stats *pcs =
+ &per_cpu(lock_stats, cpu)[class - lock_classes];
+
+ for (i = 0; i < ARRAY_SIZE(stats.contention_point); i++)
+ stats.contention_point[i] += pcs->contention_point[i];
+
+ lock_time_add(&pcs->read_waittime, &stats.read_waittime);
+ lock_time_add(&pcs->write_waittime, &stats.write_waittime);
+
+ lock_time_add(&pcs->read_holdtime, &stats.read_holdtime);
+ lock_time_add(&pcs->write_holdtime, &stats.write_holdtime);
+ }
+
+ return stats;
+}
+
+void clear_lock_stats(struct lock_class *class)
+{
+ int cpu;
+
+ for_each_possible_cpu(cpu) {
+ struct lock_class_stats *cpu_stats =
+ &per_cpu(lock_stats, cpu)[class - lock_classes];
+
+ memset(cpu_stats, 0, sizeof(struct lock_class_stats));
+ }
+ memset(class->contention_point, 0, sizeof(class->contention_point));
+}
+
+static struct lock_class_stats *get_lock_stats(struct lock_class *class)
+{
+ return &get_cpu_var(lock_stats)[class - lock_classes];
+}
+
+static void put_lock_stats(struct lock_class_stats *stats)
+{
+ put_cpu_var(lock_stats);
+}
+#endif
+
/*
* We keep a global list of all lock classes. The list only grows,
* never shrinks. The list is only accessed with the lockdep
@@ -2035,6 +2131,11 @@ static int __lock_acquire(struct lockdep
int chain_head = 0;
u64 chain_key;

+#ifdef CONFIG_PROVE_LOCKING
+ if (!prove_locking)
+ check = 1;
+#endif
+
if (unlikely(!debug_locks))
return 0;

@@ -2085,6 +2186,10 @@ static int __lock_acquire(struct lockdep
hlock->read = read;
hlock->check = check;
hlock->hardirqs_off = hardirqs_off;
+#ifdef CONFIG_LOCK_STAT
+ hlock->waittime_stamp = 0;
+ hlock->holdtime_stamp = sched_clock();
+#endif

if (check != 2)
goto out_calc_hash;
@@ -2132,10 +2237,11 @@ static int __lock_acquire(struct lockdep
}
}
#endif
+out_calc_hash:
/* mark it as used: */
if (!mark_lock(curr, hlock, LOCK_USED))
return 0;
-out_calc_hash:
+
/*
* Calculate the chain hash: it's the combined has of all the
* lock keys along the dependency chain. We save the hash value
@@ -2183,6 +2289,7 @@ out_calc_hash:
chain_key = iterate_chain_key(chain_key, id);
curr->curr_chain_key = chain_key;

+#ifdef CONFIG_PROVE_LOCKING
/*
* Trylock needs to maintain the stack of held locks, but it
* does not add new dependencies, because trylock can be done
@@ -2229,6 +2336,7 @@ out_calc_hash:
/* after lookup_chain_cache(): */
if (unlikely(!debug_locks))
return 0;
+#endif

curr->lockdep_depth++;
check_chain_key(curr);
@@ -2276,6 +2384,35 @@ print_unlock_inbalance_bug(struct task_s
return 0;
}

+#ifdef CONFIG_LOCK_STAT
+static int
+print_lock_contention_bug(struct task_struct *curr, struct lockdep_map *lock,
+ unsigned long ip)
+{
+ if (!debug_locks_off())
+ return 0;
+ if (debug_locks_silent)
+ return 0;
+
+ printk("\n=================================\n");
+ printk( "[ BUG: bad contention detected! ]\n");
+ printk( "---------------------------------\n");
+ printk("%s/%d is trying to contend lock (",
+ curr->comm, curr->pid);
+ print_lockdep_cache(lock);
+ printk(") at:\n");
+ print_ip_sym(ip);
+ printk("but there are no locks held!\n");
+ printk("\nother info that might help us debug this:\n");
+ lockdep_print_held_locks(curr);
+
+ printk("\nstack backtrace:\n");
+ dump_stack();
+
+ return 0;
+}
+#endif
+
/*
* Common debugging checks for both nested and non-nested unlock:
*/
@@ -2293,6 +2430,32 @@ static int check_unlock(struct task_stru
return 1;
}

+#ifdef CONFIG_LOCK_STAT
+static void lock_release_holdtime(struct held_lock *hlock)
+{
+ struct lock_class_stats *stats;
+ unsigned long long holdtime;
+
+ if (!lock_stat)
+ return;
+
+ holdtime = sched_clock() - hlock->holdtime_stamp;
+
+ stats = get_lock_stats(hlock->class);
+
+ if (hlock->read)
+ lock_time_inc(&stats->read_holdtime, holdtime);
+ else
+ lock_time_inc(&stats->write_holdtime, holdtime);
+
+ put_lock_stats(stats);
+}
+#else
+static void lock_release_holdtime(struct held_lock *hlock)
+{
+}
+#endif
+
/*
* Remove the lock to the list of currently held locks in a
* potentially non-nested (out of order) manner. This is a
@@ -2330,6 +2493,8 @@ lock_release_non_nested(struct task_stru
return print_unlock_inbalance_bug(curr, lock, ip);

found_it:
+ lock_release_holdtime(hlock);
+
/*
* We have the right lock to unlock, 'hlock' points to it.
* Now we remove it from the stack, and add back the other
@@ -2382,6 +2547,8 @@ static int lock_release_nested(struct ta

curr->curr_chain_key = hlock->prev_chain_key;

+ lock_release_holdtime(hlock);
+
#ifdef CONFIG_DEBUG_LOCKDEP
hlock->prev_chain_key = 0;
hlock->class = NULL;
@@ -2416,6 +2583,95 @@ __lock_release(struct lockdep_map *lock,
check_chain_key(curr);
}

+#ifdef CONFIG_LOCK_STAT
+static void
+__lock_contended(struct lockdep_map *lock, unsigned long ip)
+{
+ struct task_struct *curr = current;
+ struct held_lock *hlock, *prev_hlock;
+ struct lock_class_stats *stats;
+ unsigned int depth;
+ int i, point;
+
+ depth = curr->lockdep_depth;
+ if (DEBUG_LOCKS_WARN_ON(!depth))
+ return;
+
+ prev_hlock = NULL;
+ for (i = depth-1; i >= 0; i--) {
+ hlock = curr->held_locks + i;
+ /*
+ * We must not cross into another context:
+ */
+ if (prev_hlock && prev_hlock->irq_context != hlock->irq_context)
+ break;
+ if (hlock->instance == lock)
+ goto found_it;
+ prev_hlock = hlock;
+ }
+ print_lock_contention_bug(curr, lock, ip);
+ return;
+
+found_it:
+ hlock->waittime_stamp = sched_clock();
+
+ point = lock_contention_point(hlock->class, ip);
+
+ stats = get_lock_stats(hlock->class);
+ if (point < ARRAY_SIZE(stats->contention_point))
+ stats->contention_point[i]++;
+ put_lock_stats(stats);
+}
+
+static void
+__lock_acquired(struct lockdep_map *lock)
+{
+ struct task_struct *curr = current;
+ struct held_lock *hlock, *prev_hlock;
+ struct lock_class_stats *stats;
+ unsigned int depth;
+ unsigned long long now, waittime;
+ int i;
+
+ depth = curr->lockdep_depth;
+ if (DEBUG_LOCKS_WARN_ON(!depth))
+ return;
+
+ prev_hlock = NULL;
+ for (i = depth-1; i >= 0; i--) {
+ hlock = curr->held_locks + i;
+ /*
+ * We must not cross into another context:
+ */
+ if (prev_hlock && prev_hlock->irq_context != hlock->irq_context)
+ break;
+ if (hlock->instance == lock)
+ goto found_it;
+ prev_hlock = hlock;
+ }
+ print_lock_contention_bug(curr, lock, _RET_IP_);
+ return;
+
+found_it:
+ if (!hlock->waittime_stamp)
+ return;
+
+ now = sched_clock();
+ waittime = now - hlock->waittime_stamp;
+
+ hlock->holdtime_stamp = now;
+
+ stats = get_lock_stats(hlock->class);
+
+ if (hlock->read)
+ lock_time_inc(&stats->read_waittime, waittime);
+ else
+ lock_time_inc(&stats->write_waittime, waittime);
+
+ put_lock_stats(stats);
+}
+#endif
+
/*
* Check whether we follow the irq-flags state precisely:
*/
@@ -2456,6 +2712,14 @@ void lock_acquire(struct lockdep_map *lo
{
unsigned long flags;

+#ifdef CONFIG_LOCK_STAT
+ if (unlikely(!lock_stat))
+#endif
+#ifdef CONFIG_PROVE_LOCKING
+ if (unlikely(!prove_locking))
+#endif
+ return;
+
if (unlikely(current->lockdep_recursion))
return;

@@ -2475,6 +2739,14 @@ void lock_release(struct lockdep_map *lo
{
unsigned long flags;

+#ifdef CONFIG_LOCK_STAT
+ if (unlikely(!lock_stat))
+#endif
+#ifdef CONFIG_PROVE_LOCKING
+ if (unlikely(!prove_locking))
+#endif
+ return;
+
if (unlikely(current->lockdep_recursion))
return;

@@ -2488,6 +2760,46 @@ void lock_release(struct lockdep_map *lo

EXPORT_SYMBOL_GPL(lock_release);

+#ifdef CONFIG_LOCK_STAT
+void lock_contended(struct lockdep_map *lock, unsigned long ip)
+{
+ unsigned long flags;
+
+ if (unlikely(!lock_stat))
+ return;
+
+ if (unlikely(current->lockdep_recursion))
+ return;
+
+ raw_local_irq_save(flags);
+ check_flags(flags);
+ current->lockdep_recursion = 1;
+ __lock_contended(lock, ip);
+ current->lockdep_recursion = 0;
+ raw_local_irq_restore(flags);
+}
+EXPORT_SYMBOL_GPL(lock_contended);
+
+void lock_acquired(struct lockdep_map *lock)
+{
+ unsigned long flags;
+
+ if (unlikely(!lock_stat))
+ return;
+
+ if (unlikely(current->lockdep_recursion))
+ return;
+
+ raw_local_irq_save(flags);
+ check_flags(flags);
+ current->lockdep_recursion = 1;
+ __lock_acquired(lock);
+ current->lockdep_recursion = 0;
+ raw_local_irq_restore(flags);
+}
+EXPORT_SYMBOL_GPL(lock_acquired);
+#endif
+
/*
* Used by the testsuite, sanitize the validator state
* after a simulated failure:
Index: linux-2.6-git/include/linux/lockdep.h
===================================================================
--- linux-2.6-git.orig/include/linux/lockdep.h
+++ linux-2.6-git/include/linux/lockdep.h
@@ -114,8 +114,30 @@ struct lock_class {

const char *name;
int name_version;
+
+ unsigned long contention_point[4];
+};
+
+#ifdef CONFIG_LOCK_STAT
+struct lock_time {
+ unsigned long long min;
+ unsigned long long max;
+ unsigned long long total;
+ unsigned long nr;
+};
+
+struct lock_class_stats {
+ unsigned long contention_point[4];
+ struct lock_time read_waittime;
+ struct lock_time write_waittime;
+ struct lock_time read_holdtime;
+ struct lock_time write_holdtime;
};

+struct lock_class_stats lock_stats(struct lock_class *class);
+void clear_lock_stats(struct lock_class *class);
+#endif
+
/*
* Map the lock object (the lock instance) to the lock-class object.
* This is embedded into specific lock instances:
@@ -165,6 +187,10 @@ struct held_lock {
unsigned long acquire_ip;
struct lockdep_map *instance;

+#ifdef CONFIG_LOCK_STAT
+ unsigned long long waittime_stamp;
+ unsigned long long holdtime_stamp;
+#endif
/*
* The lock-stack is unified in that the lock chains of interrupt
* contexts nest ontop of process context chains, but we 'separate'
@@ -281,6 +307,30 @@ struct lock_class_key { };

#endif /* !LOCKDEP */

+#ifdef CONFIG_LOCK_STAT
+
+extern void lock_contended(struct lockdep_map *lock, unsigned long ip);
+extern void lock_acquired(struct lockdep_map *lock);
+
+#define LOCK_CONTENDED(_lock, try, lock) \
+do { \
+ if (!try(_lock)) { \
+ lock_contended(&(_lock)->dep_map, _RET_IP_); \
+ lock(_lock); \
+ lock_acquired(&(_lock)->dep_map); \
+ } \
+} while (0)
+
+#else /* CONFIG_LOCK_STAT */
+
+#define lock_contended(l, i) do { } while (0)
+#define lock_acquired(l) do { } while (0)
+
+#define LOCK_CONTENDED(_lock, try, lock) \
+ lock(_lock)
+
+#endif /* CONFIG_LOCK_STAT */
+
#if defined(CONFIG_TRACE_IRQFLAGS) && defined(CONFIG_GENERIC_HARDIRQS)
extern void early_init_irq_lock_class(void);
#else
Index: linux-2.6-git/lib/Kconfig.debug
===================================================================
--- linux-2.6-git.orig/lib/Kconfig.debug
+++ linux-2.6-git/lib/Kconfig.debug
@@ -273,6 +273,17 @@ config LOCKDEP
select KALLSYMS
select KALLSYMS_ALL

+config LOCK_STAT
+ bool "Lock usage statisitics"
+ depends on DEBUG_KERNEL && TRACE_IRQFLAGS_SUPPORT && STACKTRACE_SUPPORT && LOCKDEP_SUPPORT
+ select LOCKDEP
+ select DEBUG_SPINLOCK
+ select DEBUG_MUTEXES
+ select DEBUG_LOCK_ALLOC
+ default n
+ help
+ This feature enables tracking lock contention points
+
config DEBUG_LOCKDEP
bool "Lock dependency engine debugging"
depends on DEBUG_KERNEL && LOCKDEP
Index: linux-2.6-git/kernel/sysctl.c
===================================================================
--- linux-2.6-git.orig/kernel/sysctl.c
+++ linux-2.6-git/kernel/sysctl.c
@@ -164,6 +164,14 @@ int sysctl_legacy_va_layout;
#endif


+#ifdef CONFIG_PROVE_LOCKING
+extern int prove_locking;
+#endif
+
+#ifdef CONFIG_LOCK_STAT
+extern int lock_stat;
+#endif
+
/* The default sysctl tables: */

static ctl_table root_table[] = {
@@ -683,6 +691,26 @@ static ctl_table kern_table[] = {
.proc_handler = &proc_dostring,
.strategy = &sysctl_string,
},
+#ifdef CONFIG_PROVE_LOCKING
+ {
+ .ctl_name = KERN_PROVE_LOCKING,
+ .procname = "prove_locking",
+ .data = &prove_locking,
+ .maxlen = sizeof(int),
+ .mode = 0644,
+ .proc_handler = &proc_dointvec,
+ },
+#endif
+#ifdef CONFIG_LOCK_STAT
+ {
+ .ctl_name = KERN_LOCK_STAT,
+ .procname = "lock_stat",
+ .data = &lock_stat,
+ .maxlen = sizeof(int),
+ .mode = 0644,
+ .proc_handler = &proc_dointvec,
+ },
+#endif

{ .ctl_name = 0 }
};
Index: linux-2.6-git/include/linux/sysctl.h
===================================================================
--- linux-2.6-git.orig/include/linux/sysctl.h
+++ linux-2.6-git/include/linux/sysctl.h
@@ -166,6 +166,8 @@ enum
KERN_NMI_WATCHDOG=75, /* int: enable/disable nmi watchdog */
KERN_PANIC_ON_NMI=76, /* int: whether we will panic on an unrecovered */
KERN_POWEROFF_CMD=77, /* string: poweroff command line */
+ KERN_PROVE_LOCKING=78, /* int: enable lock dependancy checking */
+ KERN_LOCK_STAT=79, /* int: enable lock statistics */
};



--


2007-05-29 20:31:34

by Daniel Walker

[permalink] [raw]
Subject: Re: [PATCH 3/5] lockstat: core infrastructure

On Tue, 2007-05-29 at 14:52 +0200, Peter Zijlstra wrote:
> + now = sched_clock();
> + waittime = now - hlock->waittime_stamp;
> +

It looks like your using sched_clock() through out .. It's a little
troubling considering the constraints on the function .. Most
architecture implement a jiffies sched_clock() w/ 1 millisecond or worse
resolution.. I'd imagine a millisecond hold time is pretty rare, even a
millisecond wait time might be fairly rare too .. There's also no
guarantee that sched_clock timestamps off two different cpu's can be
compared (or at least that's my understanding) ..

Daniel

2007-05-30 03:45:04

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH 3/5] lockstat: core infrastructure

On Tue, 29 May 2007 14:52:51 +0200 Peter Zijlstra <[email protected]> wrote:

> Introduce the core lock statistics code.
>

I must say that an aggregate addition of 27 ifdefs is a bit sad. And there
is some easy stuff here.

> +#ifdef CONFIG_PROVE_LOCKING
> +int prove_locking = 1;
> +module_param(prove_locking, int, 0644);
> +#endif

#else
#define prove_locking 0
#endif

> +
> +#ifdef CONFIG_LOCK_STAT
> +int lock_stat = 1;
> +module_param(lock_stat, int, 0644);
> +#endif

ditto.

>
> ...
>
> +struct lock_class_stats lock_stats(struct lock_class *class)
> +{
> + struct lock_class_stats stats;
> + int cpu, i;
> +
> + memset(&stats, 0, sizeof(struct lock_class_stats));
> + for_each_possible_cpu(cpu) {
> + struct lock_class_stats *pcs =
> + &per_cpu(lock_stats, cpu)[class - lock_classes];
> +
> + for (i = 0; i < ARRAY_SIZE(stats.contention_point); i++)
> + stats.contention_point[i] += pcs->contention_point[i];
> +
> + lock_time_add(&pcs->read_waittime, &stats.read_waittime);
> + lock_time_add(&pcs->write_waittime, &stats.write_waittime);
> +
> + lock_time_add(&pcs->read_holdtime, &stats.read_holdtime);
> + lock_time_add(&pcs->write_holdtime, &stats.write_holdtime);
> + }
> +
> + return stats;
> +}

hm, that isn't trying to be very efficient.

> @@ -2035,6 +2131,11 @@ static int __lock_acquire(struct lockdep
> int chain_head = 0;
> u64 chain_key;
>
> +#ifdef CONFIG_PROVE_LOCKING
> + if (!prove_locking)
> + check = 1;
> +#endif

Removable

> +#ifdef CONFIG_LOCK_STAT
> +static void lock_release_holdtime(struct held_lock *hlock)
> +{
> + struct lock_class_stats *stats;
> + unsigned long long holdtime;
> +
> + if (!lock_stat)
> + return;
> +
> + holdtime = sched_clock() - hlock->holdtime_stamp;
> +
> + stats = get_lock_stats(hlock->class);
> +
> + if (hlock->read)
> + lock_time_inc(&stats->read_holdtime, holdtime);
> + else
> + lock_time_inc(&stats->write_holdtime, holdtime);
> +
> + put_lock_stats(stats);
> +}
> +#else
> +static void lock_release_holdtime(struct held_lock *hlock)

inline

> +{
> +}
> +#endif
> +
> ...
>
> @@ -2456,6 +2712,14 @@ void lock_acquire(struct lockdep_map *lo
> {
> unsigned long flags;
>
> +#ifdef CONFIG_LOCK_STAT
> + if (unlikely(!lock_stat))
> +#endif

removable

> +#ifdef CONFIG_PROVE_LOCKING
> + if (unlikely(!prove_locking))
> +#endif

removable

> @@ -2475,6 +2739,14 @@ void lock_release(struct lockdep_map *lo
> {
> unsigned long flags;
>
> +#ifdef CONFIG_LOCK_STAT
> + if (unlikely(!lock_stat))
> +#endif

removable

> +#ifdef CONFIG_PROVE_LOCKING
> + if (unlikely(!prove_locking))
> +#endif
> + return;
> +
> if (unlikely(current->lockdep_recursion))
> return;
>
>
> ...
>
> +#ifdef CONFIG_LOCK_STAT
> +
> +extern void lock_contended(struct lockdep_map *lock, unsigned long ip);
> +extern void lock_acquired(struct lockdep_map *lock);
> +
> +#define LOCK_CONTENDED(_lock, try, lock) \
> +do { \
> + if (!try(_lock)) { \
> + lock_contended(&(_lock)->dep_map, _RET_IP_); \
> + lock(_lock); \
> + lock_acquired(&(_lock)->dep_map); \
> + } \
> +} while (0)
> +
> +#else /* CONFIG_LOCK_STAT */
> +
> +#define lock_contended(l, i) do { } while (0)
> +#define lock_acquired(l) do { } while (0)

inlines are better.

> +#define LOCK_CONTENDED(_lock, try, lock) \
> + lock(_lock)
> +
> +#endif /* CONFIG_LOCK_STAT */
> +
> },
>
> ...
>
> +#ifdef CONFIG_PROVE_LOCKING
> + {
> + .ctl_name = KERN_PROVE_LOCKING,
> + .procname = "prove_locking",
> + .data = &prove_locking,
> + .maxlen = sizeof(int),
> + .mode = 0644,
> + .proc_handler = &proc_dointvec,
> + },
> +#endif
> +#ifdef CONFIG_LOCK_STAT
> + {
> + .ctl_name = KERN_LOCK_STAT,
> + .procname = "lock_stat",
> + .data = &lock_stat,
> + .maxlen = sizeof(int),
> + .mode = 0644,
> + .proc_handler = &proc_dointvec,
> + },
> +#endif

Please use CTL_UNNUNBERED for new sysctls.

> { .ctl_name = 0 }
> };
> Index: linux-2.6-git/include/linux/sysctl.h
> ===================================================================
> --- linux-2.6-git.orig/include/linux/sysctl.h
> +++ linux-2.6-git/include/linux/sysctl.h
> @@ -166,6 +166,8 @@ enum
> KERN_NMI_WATCHDOG=75, /* int: enable/disable nmi watchdog */
> KERN_PANIC_ON_NMI=76, /* int: whether we will panic on an unrecovered */
> KERN_POWEROFF_CMD=77, /* string: poweroff command line */
> + KERN_PROVE_LOCKING=78, /* int: enable lock dependancy checking */
> + KERN_LOCK_STAT=79, /* int: enable lock statistics */
> };

And lose these.

So I'm inclined to ask if you can redo these pathces with a view to reducing
the ifdef density with a bit of restructuring.

We could do that as a followon patch I guess. Nicer not to though.

2007-05-30 13:19:37

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH 3/5] lockstat: core infrastructure

On Tue, 2007-05-29 at 13:28 -0700, Daniel Walker wrote:
> On Tue, 2007-05-29 at 14:52 +0200, Peter Zijlstra wrote:
> > + now = sched_clock();
> > + waittime = now - hlock->waittime_stamp;
> > +
>
> It looks like your using sched_clock() through out .. It's a little
> troubling considering the constraints on the function .. Most
> architecture implement a jiffies sched_clock() w/ 1 millisecond or worse
> resolution.. I'd imagine a millisecond hold time is pretty rare, even a
> millisecond wait time might be fairly rare too .. There's also no
> guarantee that sched_clock timestamps off two different cpu's can be
> compared (or at least that's my understanding) ..

All valid points, however.. calling anything more expensive 2-3 times
per lock acquisition is going to be _very_ painful.

Also, IMHO the contention count vs the acquisition count is the most
interesting number, the times are just a nice bonus (if and when they
work).


2007-05-30 13:25:28

by Ingo Molnar

[permalink] [raw]
Subject: Re: [PATCH 3/5] lockstat: core infrastructure


* Daniel Walker <[email protected]> wrote:

> [...] Most architecture implement a jiffies sched_clock() w/ 1
> millisecond or worse resolution.. [...]

weird that you count importance by 'number of architectures', while 98%
of the installed server base is x86 or x86_64, where sched_clock() is
pretty precise ;-)

Ingo

2007-05-30 13:41:48

by Steven Rostedt

[permalink] [raw]
Subject: Re: [PATCH 3/5] lockstat: core infrastructure

>
> > [...] Most architecture implement a jiffies sched_clock() w/ 1
> > millisecond or worse resolution.. [...]
>
> weird that you count importance by 'number of architectures', while 98%
> of the installed server base is x86 or x86_64, where sched_clock() is
> pretty precise ;-)
>


I can understand Daniel's POV (from working at TimeSys once upon a time).
He works for Monta Vista which does a lot with embedded systems running on
something other than x86. So from Daniel's POV, those "number of
architectures" is of importance. While, from those that do server work
where x86 is now becoming the dominant platform, our focus is on those.

As long as the work doesn't "break" an arch. We can argue that sched_clock
is "good enough". If someone wants better accounting of locks on some
other arch, they can simply change sched_clock to be more precise.

-- Steve

2007-05-30 13:49:41

by Ingo Molnar

[permalink] [raw]
Subject: Re: [PATCH 3/5] lockstat: core infrastructure


* Steven Rostedt <[email protected]> wrote:

> [...] We can argue that sched_clock is "good enough". If someone
> wants better accounting of locks on some other arch, they can simply
> change sched_clock to be more precise.

exactly. Imprecise sched_clock() if there's a better fast clock source
available is a bug in the architecture code. If the only available
clocksource is 1 msec resolution then there's no solution and nothing to
talk about - lock statistics will be 1msec granular just as much as
scheduling.

Ingo

2007-05-30 15:23:30

by Daniel Walker

[permalink] [raw]
Subject: Re: [PATCH 3/5] lockstat: core infrastructure

On Wed, 2007-05-30 at 15:24 +0200, Ingo Molnar wrote:
> * Daniel Walker <[email protected]> wrote:
>
> > [...] Most architecture implement a jiffies sched_clock() w/ 1
> > millisecond or worse resolution.. [...]
>
> weird that you count importance by 'number of architectures', while 98%
> of the installed server base is x86 or x86_64, where sched_clock() is
> pretty precise ;-)

I work with many other architectures, so it's important to me. That's
aside anyway, just consider there are several Linux architectures. When
we write code we should consider those other architectures .. Non-x86
Linux is used pretty frequently. Dare I say that x86 and x86_64
installations combined don't out weigh all the other architecture
installations, that's only a guess but I wouldn't be surprised if that's
true.

Daniel

2007-05-30 17:09:27

by Daniel Walker

[permalink] [raw]
Subject: Re: [PATCH 3/5] lockstat: core infrastructure

On Wed, 2007-05-30 at 15:49 +0200, Ingo Molnar wrote:
> * Steven Rostedt <[email protected]> wrote:
>
> > [...] We can argue that sched_clock is "good enough". If someone
> > wants better accounting of locks on some other arch, they can simply
> > change sched_clock to be more precise.
>
> exactly. Imprecise sched_clock() if there's a better fast clock source
> available is a bug in the architecture code. If the only available
> clocksource is 1 msec resolution then there's no solution and nothing to
> talk about - lock statistics will be 1msec granular just as much as
> scheduling.


I don't agree .. sched_clock() is obsoleted by timekeepings clocksource
structure.. sched_clock() was a quick way to get lowlevel time stamps
just for the scheduler. The timekeeping clocksource structure is a more
complete solution.

>From the architecture perspective there are two low level clock hooks to
implement one is sched_clock() , and at least one clocksource structure.
Both do essentially the same thing. With timekeepings clocksource
structure actually being easier to implement cause the math is built in.

It's clear to me that architectures will implement clocksource
structures .. However, they will not implement sched_clock() because
there is no benefit associated with it. As you have said the scheduler
works fine with a jiffies resolution clock, even a large number of x86
machines use a jiffies sched_clock() ..

So I don't think it's a bug if sched_clock() is lowres, and it shouldn't
be a bug in the future..

Daniel

2007-05-30 17:16:52

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH 3/5] lockstat: core infrastructure

On Wed, 2007-05-30 at 10:06 -0700, Daniel Walker wrote:
> On Wed, 2007-05-30 at 15:49 +0200, Ingo Molnar wrote:
> > * Steven Rostedt <[email protected]> wrote:
> >
> > > [...] We can argue that sched_clock is "good enough". If someone
> > > wants better accounting of locks on some other arch, they can simply
> > > change sched_clock to be more precise.
> >
> > exactly. Imprecise sched_clock() if there's a better fast clock source
> > available is a bug in the architecture code. If the only available
> > clocksource is 1 msec resolution then there's no solution and nothing to
> > talk about - lock statistics will be 1msec granular just as much as
> > scheduling.
>
>
> I don't agree .. sched_clock() is obsoleted by timekeepings clocksource
> structure.. sched_clock() was a quick way to get lowlevel time stamps
> just for the scheduler. The timekeeping clocksource structure is a more
> complete solution.
>
> >From the architecture perspective there are two low level clock hooks to
> implement one is sched_clock() , and at least one clocksource structure.
> Both do essentially the same thing. With timekeepings clocksource
> structure actually being easier to implement cause the math is built in.

I think you are mistaken here; the two are similar but not identical.

I see sched_clock() as fast first, accurate second. Whereas the
clocksource thing is accurate first, fast second.

There is room for both of them.

2007-05-30 17:28:11

by Daniel Walker

[permalink] [raw]
Subject: Re: [PATCH 3/5] lockstat: core infrastructure

On Wed, 2007-05-30 at 19:16 +0200, Peter Zijlstra wrote:

> > >From the architecture perspective there are two low level clock hooks to
> > implement one is sched_clock() , and at least one clocksource structure.
> > Both do essentially the same thing. With timekeepings clocksource
> > structure actually being easier to implement cause the math is built in.
>
> I think you are mistaken here; the two are similar but not identical.
>
> I see sched_clock() as fast first, accurate second. Whereas the
> clocksource thing is accurate first, fast second.

This is true .. However, if there is a speed different it's small.
In the past I've replace sched_clock() with a clocksource, and there was
no noticeable speed different .. Just recently I replaced x86's
sched_clock() math with the clocksource math with no noticable
difference .. At least not from my benchmarks ..

> There is room for both of them.

There is room, but we don't need sched_clock() .. Certainly we shouldn't
force architectures to implement sched_clock() by calling it a "bug" if
it's lowres.

Daniel

2007-06-01 13:13:23

by Ingo Molnar

[permalink] [raw]
Subject: Re: [PATCH 3/5] lockstat: core infrastructure


* Daniel Walker <[email protected]> wrote:

> On Wed, 2007-05-30 at 19:16 +0200, Peter Zijlstra wrote:

> > I think you are mistaken here; the two are similar but not
> > identical.
> >
> > I see sched_clock() as fast first, accurate second. Whereas the
> > clocksource thing is accurate first, fast second.
>
> This is true .. However, if there is a speed different it's small.

Ugh. Have you ever compared pmtimer (or even hpet) against TSC based
sched_clock()? What you write is so wrong that it's not even funny. You
keep repeating this nonsense despite having been told multiple times
that you are dead wrong.

Ingo

2007-06-01 13:29:47

by Andi Kleen

[permalink] [raw]
Subject: Re: [PATCH 3/5] lockstat: core infrastructure

Daniel Walker <[email protected]> writes:

> On Wed, 2007-05-30 at 19:16 +0200, Peter Zijlstra wrote:
>
> > > >From the architecture perspective there are two low level clock hooks to
> > > implement one is sched_clock() , and at least one clocksource structure.
> > > Both do essentially the same thing. With timekeepings clocksource
> > > structure actually being easier to implement cause the math is built in.
> >
> > I think you are mistaken here; the two are similar but not identical.
> >
> > I see sched_clock() as fast first, accurate second. Whereas the
> > clocksource thing is accurate first, fast second.
>
> This is true .. However, if there is a speed different it's small.

pmtimer (factor 1000+) or HPET (factor 10-100+ depending on CPU)
accesses are much slower than TSC or jiffies read. Talking to the
southbridge is slow.

-Andi

2007-06-01 15:29:47

by Daniel Walker

[permalink] [raw]
Subject: Re: [PATCH 3/5] lockstat: core infrastructure

On Fri, 2007-06-01 at 15:12 +0200, Ingo Molnar wrote:
> * Daniel Walker <[email protected]> wrote:
>
> > On Wed, 2007-05-30 at 19:16 +0200, Peter Zijlstra wrote:
>
> > > I think you are mistaken here; the two are similar but not
> > > identical.
> > >
> > > I see sched_clock() as fast first, accurate second. Whereas the
> > > clocksource thing is accurate first, fast second.
> >
> > This is true .. However, if there is a speed different it's small.
>
> Ugh. Have you ever compared pmtimer (or even hpet) against TSC based
> sched_clock()? What you write is so wrong that it's not even funny. You
> keep repeating this nonsense despite having been told multiple times
> that you are dead wrong.

Yes I have, and your right there is a difference, and a big
difference .. Above I was referring only to the TSC clocksource, since
that's an apples to apples comparison .. I would never compare the TSC
to the acpi_pm, that's no contest ..

The acpi_pm as sched_clock() with hackbench was about %25 slower, the
pit was 10x slower approximately. (I did this months ago.)

Daniel

2007-06-01 15:53:42

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH 3/5] lockstat: core infrastructure

On Fri, 2007-06-01 at 08:26 -0700, Daniel Walker wrote:
> On Fri, 2007-06-01 at 15:12 +0200, Ingo Molnar wrote:
> > * Daniel Walker <[email protected]> wrote:
> >
> > > On Wed, 2007-05-30 at 19:16 +0200, Peter Zijlstra wrote:
> >
> > > > I think you are mistaken here; the two are similar but not
> > > > identical.
> > > >
> > > > I see sched_clock() as fast first, accurate second. Whereas the
> > > > clocksource thing is accurate first, fast second.
> > >
> > > This is true .. However, if there is a speed different it's small.
> >
> > Ugh. Have you ever compared pmtimer (or even hpet) against TSC based
> > sched_clock()? What you write is so wrong that it's not even funny. You
> > keep repeating this nonsense despite having been told multiple times
> > that you are dead wrong.
>
> Yes I have, and your right there is a difference, and a big
> difference .. Above I was referring only to the TSC clocksource, since
> that's an apples to apples comparison .. I would never compare the TSC
> to the acpi_pm, that's no contest ..
>
> The acpi_pm as sched_clock() with hackbench was about %25 slower, the
> pit was 10x slower approximately. (I did this months ago.)

The whole issue is that you don't have any control over what clocksource
you'll end up with. If it so happens that pmtimer gets selected your
whole box will crawl if its used liberaly, like the patch under
discussion does.

So, having two interfaces, one fast and one accurate is the right answer
IMHO.

And I agree, that if the arch has a fast clock but doesn't use it for
sched_clock() that would be a shortcoming of that arch.


2007-06-01 16:14:04

by Daniel Walker

[permalink] [raw]
Subject: Re: [PATCH 3/5] lockstat: core infrastructure

On Fri, 2007-06-01 at 17:52 +0200, Peter Zijlstra wrote:
> On Fri, 2007-06-01 at 08:26 -0700, Daniel Walker wrote:
> > On Fri, 2007-06-01 at 15:12 +0200, Ingo Molnar wrote:
> > > * Daniel Walker <[email protected]> wrote:
> > >
> > > > On Wed, 2007-05-30 at 19:16 +0200, Peter Zijlstra wrote:
> > >
> > > > > I think you are mistaken here; the two are similar but not
> > > > > identical.
> > > > >
> > > > > I see sched_clock() as fast first, accurate second. Whereas the
> > > > > clocksource thing is accurate first, fast second.
> > > >
> > > > This is true .. However, if there is a speed different it's small.
> > >
> > > Ugh. Have you ever compared pmtimer (or even hpet) against TSC based
> > > sched_clock()? What you write is so wrong that it's not even funny. You
> > > keep repeating this nonsense despite having been told multiple times
> > > that you are dead wrong.
> >
> > Yes I have, and your right there is a difference, and a big
> > difference .. Above I was referring only to the TSC clocksource, since
> > that's an apples to apples comparison .. I would never compare the TSC
> > to the acpi_pm, that's no contest ..
> >
> > The acpi_pm as sched_clock() with hackbench was about %25 slower, the
> > pit was 10x slower approximately. (I did this months ago.)
>
> The whole issue is that you don't have any control over what clocksource
> you'll end up with. If it so happens that pmtimer gets selected your
> whole box will crawl if its used liberaly, like the patch under
> discussion does.

You can have control over it, which I think the whole point of this
discussion ..

> So, having two interfaces, one fast and one accurate is the right answer
> IMHO.

In the case of lockstat you have two cases fast and functional, and
non-functional .. Right now your patch has no slow and functional state.

The non-functional state is even the majority from my perspective.

> And I agree, that if the arch has a fast clock but doesn't use it for
> sched_clock() that would be a shortcoming of that arch.

As I said before there is no reason why and architectures should be
forced to implement sched_clock() .. Is there some specific reason why
you think it should be mandatory?

Daniel

2007-06-01 18:20:26

by Ingo Molnar

[permalink] [raw]
Subject: Re: [PATCH 3/5] lockstat: core infrastructure


* Daniel Walker <[email protected]> wrote:

> > > > I see sched_clock() as fast first, accurate second. Whereas the
> > > > clocksource thing is accurate first, fast second.
> > >
> > > This is true .. However, if there is a speed different it's small.
> >
> > Ugh. Have you ever compared pmtimer (or even hpet) against TSC based
> > sched_clock()? What you write is so wrong that it's not even funny.
> > You keep repeating this nonsense despite having been told multiple
> > times that you are dead wrong.
>
> Yes I have, and your right there is a difference, and a big difference
> .. Above I was referring only to the TSC clocksource, since that's an
> apples to apples comparison .. I would never compare the TSC to the
> acpi_pm, that's no contest ..

You still dont get it i think: in real life we end up using the TSC in
sched_clock() _much more often_ than we end up using the TSC for
clocksource! So your flawed suggestion does not fix anything, it in fact
introduces a really bad regression: instead of using the TSC (or
jiffies) we'd end up using the pmtimer or hpet for every lock operation
when lockstat is enabled, bringing the box to a screeching halt in
essence.

so what you suggest has a far worse effect on the _majority_ of systems
that are even interested in running lockstat, than the case you
mentioned that some seldom-used arch which is lazy about sched_clock()
falls back to jiffies granularity. It's not a big deal: the stats will
have the same granularity. (the op counts in lockstat will still be
quite useful)

sched_clock() is a 'fast but occasionally inaccurate clock', while the
GTOD clocksource is an accurate clock (but very often slow).

Ingo

2007-06-01 18:32:21

by Ingo Molnar

[permalink] [raw]
Subject: Re: [PATCH 3/5] lockstat: core infrastructure


* Daniel Walker <[email protected]> wrote:

> > So, having two interfaces, one fast and one accurate is the right
> > answer IMHO.
>
> In the case of lockstat you have two cases fast and functional, and
> non-functional .. Right now your patch has no slow and functional
> state.

let me explain it to you:

1) there is absolutely no problem here to begin with. If a rare
architecture is lazy enough to not bother implementing a finegrained
sched_clock() then it certainly does not care about the granularity of
lockstat fields either. If it does, it can improve scheduling and get
more finegrained lockstat by implementing a proper sched_clock()
function - all for the same price! ;-)

2) the 'solution' you suggested for this non-problem is _far worse_ than
the granularity non-problem, on the _majority_ of server systems today!
Think about it! Your suggestion would make lockstat _totally unusable_.
Not "slow and functional" like you claim but "dead-slow and unusable".

in light of all this it is puzzling to me how you can still call Peter's
code "non-functional" with a straight face. I have just tried lockstat
with jiffies granular sched_clock() and it was still fully functional.
So if you want to report some bug then please do it in a proper form.

> As I said before there is no reason why and architectures should be
> forced to implement sched_clock() .. Is there some specific reason why
> you think it should be mandatory?

Easy: it's not mandatory, but it's certainly "nice" even today, even
without lockstat. It will get you:

- better scheduling
- better printk timestamps
- higher-quality blktrace timestamps

With lockstat, append "more finegrained lockstat output" to that list of
benefits too. That's why every sane server architecture has a
sched_clock() implementation - go check the kernel source. Now i wouldnt
mind to clean the API up and call it get_stat_clock() or whatever - but
that was not your suggestion at all - your suggestion was flawed: to
implement sched_clock() via the GTOD clocksource.

Ingo

2007-06-01 18:44:44

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH 3/5] lockstat: core infrastructure

On Fri, 2007-06-01 at 09:11 -0700, Daniel Walker wrote:
> On Fri, 2007-06-01 at 17:52 +0200, Peter Zijlstra wrote:

> > The whole issue is that you don't have any control over what clocksource
> > you'll end up with. If it so happens that pmtimer gets selected your
> > whole box will crawl if its used liberaly, like the patch under
> > discussion does.
>
> You can have control over it, which I think the whole point of this
> discussion ..

No you don't, clocksource will gladly discard the TSC when its not found
stable enough (the majority of the systems today). While it would be
good enough for sched_clock().



2007-06-01 18:51:53

by Ingo Molnar

[permalink] [raw]
Subject: Re: [PATCH 3/5] lockstat: core infrastructure


* Peter Zijlstra <[email protected]> wrote:

> On Fri, 2007-06-01 at 09:11 -0700, Daniel Walker wrote:
> > On Fri, 2007-06-01 at 17:52 +0200, Peter Zijlstra wrote:
>
> > > The whole issue is that you don't have any control over what clocksource
> > > you'll end up with. If it so happens that pmtimer gets selected your
> > > whole box will crawl if its used liberaly, like the patch under
> > > discussion does.
> >
> > You can have control over it, which I think the whole point of this
> > discussion ..
>
> No you don't, clocksource will gladly discard the TSC when its not
> found stable enough (the majority of the systems today). While it
> would be good enough for sched_clock().

yeah, precisely. [ There is another thing as well: most embedded
architectures do not even implement LOCKDEP_SUPPORT today, so it wouldnt
be possible to enable lockstat on them anyway. So this whole topic is
ridiculous to begin with. How about fixing some real, non-imaginery bugs
instead? ;-) ]

Ingo

2007-06-01 19:26:30

by Matt Mackall

[permalink] [raw]
Subject: Re: [PATCH 3/5] lockstat: core infrastructure

On Fri, Jun 01, 2007 at 08:30:53PM +0200, Ingo Molnar wrote:
> - better scheduling
> - better printk timestamps
> - higher-quality blktrace timestamps
- more entropy in /dev/random

--
Mathematics is the supreme nostalgia of our time.

2007-06-01 19:33:35

by Daniel Walker

[permalink] [raw]
Subject: Re: [PATCH 3/5] lockstat: core infrastructure

On Fri, 2007-06-01 at 20:19 +0200, Ingo Molnar wrote:
> * Daniel Walker <[email protected]> wrote:
>
> > > > > I see sched_clock() as fast first, accurate second. Whereas the
> > > > > clocksource thing is accurate first, fast second.
> > > >
> > > > This is true .. However, if there is a speed different it's small.
> > >
> > > Ugh. Have you ever compared pmtimer (or even hpet) against TSC based
> > > sched_clock()? What you write is so wrong that it's not even funny.
> > > You keep repeating this nonsense despite having been told multiple
> > > times that you are dead wrong.
> >
> > Yes I have, and your right there is a difference, and a big difference
> > .. Above I was referring only to the TSC clocksource, since that's an
> > apples to apples comparison .. I would never compare the TSC to the
> > acpi_pm, that's no contest ..
>
> You still dont get it i think: in real life we end up using the TSC in
> sched_clock() _much more often_ than we end up using the TSC for
> clocksource! So your flawed suggestion does not fix anything, it in fact
> introduces a really bad regression: instead of using the TSC (or
> jiffies) we'd end up using the pmtimer or hpet for every lock operation
> when lockstat is enabled, bringing the box to a screeching halt in
> essence.

My position isn't that we should use the high level clocksource
interface as it is now without changes .. That's never been my position
since I've been working with it.. The high level interface will need to
evolve.

I'm saying we should use the clocksource structure as the main hook into
the low level architecture code.

> so what you suggest has a far worse effect on the _majority_ of systems
> that are even interested in running lockstat, than the case you
> mentioned that some seldom-used arch which is lazy about sched_clock()
> falls back to jiffies granularity. It's not a big deal: the stats will
> have the same granularity. (the op counts in lockstat will still be
> quite useful)

My suggestion is only as good as the implementation .. Your making some
fairly sweeping assumption about how lockstat _would_ use the
clocksources in it's final form ..

So clearly lockstat has a contraint which is that it can't use slow
clocks..

> sched_clock() is a 'fast but occasionally inaccurate clock', while the
> GTOD clocksource is an accurate clock (but very often slow).

I think we're just taking different perspectives .. The tsc clocksource
is just as fast as the tsc sched_clock() , you can interchange the two
without ill effects .. That's one perspective ..

You can use a yet to be written API that uses the GTOD and a
clocksource, and allows slow clocksources to be used in place of fast
ones with really bad effects, that's another perspective ..

Daniel

2007-06-01 19:33:47

by Daniel Walker

[permalink] [raw]
Subject: Re: [PATCH 3/5] lockstat: core infrastructure

On Fri, 2007-06-01 at 20:30 +0200, Ingo Molnar wrote:
> * Daniel Walker <[email protected]> wrote:
>
> > > So, having two interfaces, one fast and one accurate is the right
> > > answer IMHO.
> >
> > In the case of lockstat you have two cases fast and functional, and
> > non-functional .. Right now your patch has no slow and functional
> > state.
>
> let me explain it to you:
>
> 1) there is absolutely no problem here to begin with. If a rare
> architecture is lazy enough to not bother implementing a finegrained
> sched_clock() then it certainly does not care about the granularity of
> lockstat fields either. If it does, it can improve scheduling and get
> more finegrained lockstat by implementing a proper sched_clock()
> function - all for the same price! ;-)

There is a problem, which we are discussing ... sched_clock() can be
lowres in lots of different situations, and lockstat fails to account
for that .. That in turn makes it's timing non-functional.

> 2) the 'solution' you suggested for this non-problem is _far worse_ than
> the granularity non-problem, on the _majority_ of server systems today!
> Think about it! Your suggestion would make lockstat _totally unusable_.
> Not "slow and functional" like you claim but "dead-slow and unusable".

I'm not sure how to respond to this.. You taking a big ball of
assumptions, and molding it into what ever you want ..

> in light of all this it is puzzling to me how you can still call Peter's
> code "non-functional" with a straight face. I have just tried lockstat
> with jiffies granular sched_clock() and it was still fully functional.
> So if you want to report some bug then please do it in a proper form.

Clearly you can't have sane microsecond level timestamps with a clock
that doesn't support microsecond resolution.. This is even something
Peter acknowledged in his first email to me.

> > As I said before there is no reason why and architectures should be
> > forced to implement sched_clock() .. Is there some specific reason why
> > you think it should be mandatory?
>
> Easy: it's not mandatory, but it's certainly "nice" even today, even
> without lockstat. It will get you:
>
> - better scheduling
> - better printk timestamps
> - higher-quality blktrace timestamps
>
> With lockstat, append "more finegrained lockstat output" to that list of
> benefits too. That's why every sane server architecture has a
> sched_clock() implementation - go check the kernel source. Now i wouldnt
> mind to clean the API up and call it get_stat_clock() or whatever - but
> that was not your suggestion at all - your suggestion was flawed: to
> implement sched_clock() via the GTOD clocksource.

At this point it's not clear to me you know what my suggestion was ..
Your saying you want a better API for sched_clock(), and yes I agree
with that 100% sched_clock() needs a better API .. The paragraph above
it looks like your on the verge of agreeing with me ..

You think my words are puzzling, try it from this end..

Daniel

2007-06-01 19:33:58

by Daniel Walker

[permalink] [raw]
Subject: Re: [PATCH 3/5] lockstat: core infrastructure

On Fri, 2007-06-01 at 20:43 +0200, Peter Zijlstra wrote:
> On Fri, 2007-06-01 at 09:11 -0700, Daniel Walker wrote:
> > On Fri, 2007-06-01 at 17:52 +0200, Peter Zijlstra wrote:
>
> > > The whole issue is that you don't have any control over what clocksource
> > > you'll end up with. If it so happens that pmtimer gets selected your
> > > whole box will crawl if its used liberaly, like the patch under
> > > discussion does.
> >
> > You can have control over it, which I think the whole point of this
> > discussion ..
>
> No you don't, clocksource will gladly discard the TSC when its not found
> stable enough (the majority of the systems today). While it would be
> good enough for sched_clock().

Your misreading the sentence above "You can have control over it" , this
means that you _can_ make lockstat use the TSC or disable itself when
the TSC is unstable.. Clock management is secondary to me, and we can
change it.. What matters more is if the "struct clocksource" provides a
better method for accessing lowlevel clocks than sched_clock() .. My
contention is that it does provide a better method.

Daniel