2006-05-29 21:20:54

by Ingo Molnar

[permalink] [raw]
Subject: [patch 00/61] ANNOUNCE: lock validator -V1

We are pleased to announce the first release of the "lock dependency
correctness validator" kernel debugging feature, which can be downloaded
from:

http://redhat.com/~mingo/lockdep-patches/

The easiest way to try lockdep on a testbox is to apply the combo patch
to 2.6.17-rc4-mm3. The patch order is:

http://kernel.org/pub/linux/kernel/v2.6/testing/linux-2.6.17-rc4.tar.bz2
http://kernel.org/pub/linux/kernel/people/akpm/patches/2.6/2.6.17-rc4/2.6.17-rc4-mm3/2.6.17-rc4-mm3.bz2
http://redhat.com/~mingo/lockdep-patches/lockdep-combo.patch

do 'make oldconfig' and accept all the defaults for new config options -
reboot into the kernel and if everything goes well it should boot up
fine and you should have /proc/lockdep and /proc/lockdep_stats files.

Typically if the lock validator finds some problem it will print out
voluminous debug output that begins with "BUG: ..." and which syslog
output can be used by kernel developers to figure out the precise
locking scenario.

What does the lock validator do? It "observes" and maps all locking
rules as they occur dynamically (as triggered by the kernel's natural
use of spinlocks, rwlocks, mutexes and rwsems). Whenever the lock
validator subsystem detects a new locking scenario, it validates this
new rule against the existing set of rules. If this new rule is
consistent with the existing set of rules then the new rule is added
transparently and the kernel continues as normal. If the new rule could
create a deadlock scenario then this condition is printed out.

When determining validity of locking, all possible "deadlock scenarios"
are considered: assuming arbitrary number of CPUs, arbitrary irq context
and task context constellations, running arbitrary combinations of all
the existing locking scenarios. In a typical system this means millions
of separate scenarios. This is why we call it a "locking correctness"
validator - for all rules that are observed the lock validator proves it
with mathematical certainty that a deadlock could not occur (assuming
that the lock validator implementation itself is correct and its
internal data structures are not corrupted by some other kernel
subsystem). [see more details and conditionals of this statement in
include/linux/lockdep.h and Documentation/lockdep-design.txt]

Furthermore, this "all possible scenarios" property of the validator
also enables the finding of complex, highly unlikely multi-CPU
multi-context races via single single-context rules, increasing the
likelyhood of finding bugs drastically. In practical terms: the lock
validator already found a bug in the upstream kernel that could only
occur on systems with 3 or more CPUs, and which needed 3 very unlikely
code sequences to occur at once on the 3 CPUs. That bug was found and
reported on a single-CPU system (!). So in essence a race will be found
"piecemail-wise", triggering all the necessary components for the race,
without having to reproduce the race scenario itself! In its short
existence the lock validator found and reported many bugs before they
actually caused a real deadlock.

To further increase the efficiency of the validator, the mapping is not
per "lock instance", but per "lock-type". For example, all struct inode
objects in the kernel have inode->inotify_mutex. If there are 10,000
inodes cached, then there are 10,000 lock objects. But ->inotify_mutex
is a single "lock type", and all locking activities that occur against
->inotify_mutex are "unified" into this single lock-type. The advantage
of the lock-type approach is that all historical ->inotify_mutex uses
are mapped into a single (and as narrow as possible) set of locking
rules - regardless of how many different tasks or inode structures it
took to build this set of rules. The set of rules persist during the
lifetime of the kernel.

To see the rough magnitude of checking that the lock validator does,
here's a portion of /proc/lockdep_stats, fresh after bootup:

lock-types: 694 [max: 2048]
direct dependencies: 1598 [max: 8192]
indirect dependencies: 17896
all direct dependencies: 16206
dependency chains: 1910 [max: 8192]
in-hardirq chains: 17
in-softirq chains: 105
in-process chains: 1065
stack-trace entries: 38761 [max: 131072]
combined max dependencies: 2033928
hardirq-safe locks: 24
hardirq-unsafe locks: 176
softirq-safe locks: 53
softirq-unsafe locks: 137
irq-safe locks: 59
irq-unsafe locks: 176

The lock validator has observed 1598 actual single-thread locking
patterns, and has validated all possible 2033928 distinct locking
scenarios.

More details about the design of the lock validator can be found in
Documentation/lockdep-design.txt, which can also found at:

http://redhat.com/~mingo/lockdep-patches/lockdep-design.txt

The patchqueue consists of 61 patches, and the changes are quite
extensive:

215 files changed, 7693 insertions(+), 1247 deletions(-)

So be careful when testing.

We only plan to post the queue to lkml this time, we'll try to not flood
lkml with future releases. The finegrained patch-queue can be also seen
at:

http://redhat.com/~mingo/lockdep-patches/patches/

(the series file, with explanations about splitup categories of the
patches can be found attached below.)

The lock validator has been build-tested with allyesconfig, and booted
on x86 and x86_64. (Other architectures probably dont build/work yet.)

Comments, test-results, bug fixes, and improvements are welcome!

Ingo


# locking fixes (for bugs found by lockdep), not yet in mainline or -mm:

floppy-release-fix.patch
forcedeth-deadlock-fix.patch

# fixes for upstream that only triggers on lockdep:

sound_oss_emu10k1_midi-fix.patch
mutex-section-bug.patch

# locking subsystem debugging improvements:

warn-once.patch
add-module-address.patch

generic-lock-debugging.patch
locking-selftests.patch

spinlock-init-cleanups.patch
lock-init-improvement.patch
xfs-improve-mrinit-macro.patch

# stacktrace:

x86_64-beautify-stack-backtrace.patch
x86_64-document-stack-backtrace.patch
stacktrace.patch

x86_64-use-stacktrace-for-backtrace.patch

# irq-flags state tracing:

lockdep-fown-fixes.patch
lockdep-sk-callback-lock-fixes.patch
trace-irqflags.patch
trace-irqflags-cleanups-x86.patch
trace-irqflags-cleanups-x86_64.patch
local-irq-enable-in-hardirq.patch

# percpu subsystem feature needed for lockdep:

add-per-cpu-offset.patch

# lockdep subsystem core bits:

lockdep-core.patch
lockdep-proc.patch
lockdep-docs.patch

# make use of lockdep in locking subsystems:

lockdep-prove-rwsems.patch
lockdep-prove-spin_rwlocks.patch
lockdep-prove-mutexes.patch

# lockdep utility patches:

lockdep-print-types-in-sysrq.patch
lockdep-x86_64-early-init.patch
lockdep-i386-alternatives-off.patch
lockdep-printk-recursion.patch
lockdep-disable-nmi-watchdog.patch

# map all the locking details and quirks to lockdep:

lockdep-blockdev.patch
lockdep-direct-io.patch
lockdep-serial.patch
lockdep-dcache.patch
lockdep-namei.patch
lockdep-super.patch
lockdep-futex.patch
lockdep-genirq.patch
lockdep-kgdb.patch
lockdep-completions.patch
lockdep-waitqueue.patch
lockdep-mm.patch
lockdep-slab.patch

lockdep-skb_queue_head_init.patch
lockdep-timer.patch
lockdep-sched.patch
lockdep-hrtimer.patch
lockdep-sock.patch
lockdep-af_unix.patch
lockdep-lock_sock.patch
lockdep-mmap_sem.patch

lockdep-prune_dcache-workaround.patch
lockdep-jbd.patch
lockdep-posix-timers.patch
lockdep-sch_generic.patch
lockdep-xfrm.patch
lockdep-sound-seq-ports.patch

lockdep-enable-Kconfig.patch


2006-05-29 21:22:36

by Ingo Molnar

[permalink] [raw]
Subject: [patch 01/61] lock validator: floppy.c irq-release fix

From: Ingo Molnar <[email protected]>

floppy.c does alot of irq-unsafe work within floppy_release_irq_and_dma():
free_irq(), release_region() ... so when executing in irq context, push
the whole function into keventd.

Signed-off-by: Ingo Molnar <[email protected]>
Signed-off-by: Arjan van de Ven <[email protected]>
---
drivers/block/floppy.c | 27 +++++++++++++++++++++++++--
1 file changed, 25 insertions(+), 2 deletions(-)

Index: linux/drivers/block/floppy.c
===================================================================
--- linux.orig/drivers/block/floppy.c
+++ linux/drivers/block/floppy.c
@@ -573,6 +573,21 @@ static int floppy_grab_irq_and_dma(void)
static void floppy_release_irq_and_dma(void);

/*
+ * Interrupt, DMA and region freeing must not be done from IRQ
+ * context - e.g. irq-unregistration means /proc VFS work, region
+ * release takes an irq-unsafe lock, etc. So we push this work
+ * into keventd:
+ */
+static void fd_release_fn(void *data)
+{
+ mutex_lock(&open_lock);
+ floppy_release_irq_and_dma();
+ mutex_unlock(&open_lock);
+}
+
+static DECLARE_WORK(floppy_release_irq_and_dma_work, fd_release_fn, NULL);
+
+/*
* The "reset" variable should be tested whenever an interrupt is scheduled,
* after the commands have been sent. This is to ensure that the driver doesn't
* get wedged when the interrupt doesn't come because of a failed command.
@@ -836,7 +851,7 @@ static int set_dor(int fdc, char mask, c
if (newdor & FLOPPY_MOTOR_MASK)
floppy_grab_irq_and_dma();
if (olddor & FLOPPY_MOTOR_MASK)
- floppy_release_irq_and_dma();
+ schedule_work(&floppy_release_irq_and_dma_work);
return olddor;
}

@@ -917,6 +932,8 @@ static int _lock_fdc(int drive, int inte

set_current_state(TASK_RUNNING);
remove_wait_queue(&fdc_wait, &wait);
+
+ flush_scheduled_work();
}
command_status = FD_COMMAND_NONE;

@@ -950,7 +967,7 @@ static inline void unlock_fdc(void)
if (elv_next_request(floppy_queue))
do_fd_request(floppy_queue);
spin_unlock_irqrestore(&floppy_lock, flags);
- floppy_release_irq_and_dma();
+ schedule_work(&floppy_release_irq_and_dma_work);
wake_up(&fdc_wait);
}

@@ -4647,6 +4664,12 @@ void cleanup_module(void)
del_timer_sync(&fd_timer);
blk_cleanup_queue(floppy_queue);

+ /*
+ * Wait for any asynchronous floppy_release_irq_and_dma()
+ * calls to finish first:
+ */
+ flush_scheduled_work();
+
if (usage_count)
floppy_release_irq_and_dma();

2006-05-29 21:22:56

by Ingo Molnar

[permalink] [raw]
Subject: [patch 02/61] lock validator: forcedeth.c fix

From: Ingo Molnar <[email protected]>

nv_do_nic_poll() is called from timer softirqs, which has interrupts
enabled, but np->lock might also be taken by some other interrupt
context.

Signed-off-by: Ingo Molnar <[email protected]>
Signed-off-by: Arjan van de Ven <[email protected]>
---
drivers/net/forcedeth.c | 5 +++--
1 file changed, 3 insertions(+), 2 deletions(-)

Index: linux/drivers/net/forcedeth.c
===================================================================
--- linux.orig/drivers/net/forcedeth.c
+++ linux/drivers/net/forcedeth.c
@@ -2869,6 +2869,7 @@ static void nv_do_nic_poll(unsigned long
struct net_device *dev = (struct net_device *) data;
struct fe_priv *np = netdev_priv(dev);
u8 __iomem *base = get_hwbase(dev);
+ unsigned long flags;
u32 mask = 0;

/*
@@ -2897,10 +2898,9 @@ static void nv_do_nic_poll(unsigned long
mask |= NVREG_IRQ_OTHER;
}
}
+ local_irq_save(flags);
np->nic_poll_irq = 0;

- /* FIXME: Do we need synchronize_irq(dev->irq) here? */
-
writel(mask, base + NvRegIrqMask);
pci_push(base);

@@ -2924,6 +2924,7 @@ static void nv_do_nic_poll(unsigned long
enable_irq(np->msi_x_entry[NV_MSI_X_VECTOR_OTHER].vector);
}
}
+ local_irq_restore(flags);
}

#ifdef CONFIG_NET_POLL_CONTROLLER

2006-05-29 21:23:01

by Ingo Molnar

[permalink] [raw]
Subject: [patch 03/61] lock validator: sound/oss/emu10k1/midi.c cleanup

From: Ingo Molnar <[email protected]>

move the __attribute outside of the DEFINE_SPINLOCK() section.

Signed-off-by: Ingo Molnar <[email protected]>
Signed-off-by: Arjan van de Ven <[email protected]>
---
sound/oss/emu10k1/midi.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)

Index: linux/sound/oss/emu10k1/midi.c
===================================================================
--- linux.orig/sound/oss/emu10k1/midi.c
+++ linux/sound/oss/emu10k1/midi.c
@@ -45,7 +45,7 @@
#include "../sound_config.h"
#endif

-static DEFINE_SPINLOCK(midi_spinlock __attribute((unused)));
+static __attribute((unused)) DEFINE_SPINLOCK(midi_spinlock);

static void init_midi_hdr(struct midi_hdr *midihdr)
{

2006-05-29 21:23:33

by Ingo Molnar

[permalink] [raw]
Subject: [patch 07/61] lock validator: better lock debugging

From: Ingo Molnar <[email protected]>

generic lock debugging:

- generalized lock debugging framework. For example, a bug in one lock
subsystem turns off debugging in all lock subsystems.

- got rid of the caller address passing from the mutex/rtmutex debugging
code: it caused way too much prototype hackery, and lockdep will give
the same information anyway.

- ability to do silent tests

- check lock freeing in vfree too.

- more finegrained debugging options, to allow distributions to
turn off more expensive debugging features.

Signed-off-by: Ingo Molnar <[email protected]>
Signed-off-by: Arjan van de Ven <[email protected]>
---
drivers/char/sysrq.c | 2
include/asm-generic/mutex-null.h | 11 -
include/linux/debug_locks.h | 62 ++++++++
include/linux/init_task.h | 1
include/linux/mm.h | 8 -
include/linux/mutex-debug.h | 12 -
include/linux/mutex.h | 6
include/linux/rtmutex.h | 10 -
include/linux/sched.h | 4
init/main.c | 9 +
kernel/exit.c | 5
kernel/fork.c | 4
kernel/mutex-debug.c | 289 +++----------------------------------
kernel/mutex-debug.h | 87 +----------
kernel/mutex.c | 83 +++++++---
kernel/mutex.h | 18 --
kernel/rtmutex-debug.c | 302 +--------------------------------------
kernel/rtmutex-debug.h | 8 -
kernel/rtmutex.c | 45 ++---
kernel/rtmutex.h | 3
kernel/sched.c | 16 +-
lib/Kconfig.debug | 26 ++-
lib/Makefile | 2
lib/debug_locks.c | 45 +++++
lib/spinlock_debug.c | 60 +++----
mm/vmalloc.c | 2
26 files changed, 329 insertions(+), 791 deletions(-)

Index: linux/drivers/char/sysrq.c
===================================================================
--- linux.orig/drivers/char/sysrq.c
+++ linux/drivers/char/sysrq.c
@@ -152,7 +152,7 @@ static struct sysrq_key_op sysrq_mountro
static void sysrq_handle_showlocks(int key, struct pt_regs *pt_regs,
struct tty_struct *tty)
{
- mutex_debug_show_all_locks();
+ debug_show_all_locks();
}
static struct sysrq_key_op sysrq_showlocks_op = {
.handler = sysrq_handle_showlocks,
Index: linux/include/asm-generic/mutex-null.h
===================================================================
--- linux.orig/include/asm-generic/mutex-null.h
+++ linux/include/asm-generic/mutex-null.h
@@ -10,14 +10,9 @@
#ifndef _ASM_GENERIC_MUTEX_NULL_H
#define _ASM_GENERIC_MUTEX_NULL_H

-/* extra parameter only needed for mutex debugging: */
-#ifndef __IP__
-# define __IP__
-#endif
-
-#define __mutex_fastpath_lock(count, fail_fn) fail_fn(count __RET_IP__)
-#define __mutex_fastpath_lock_retval(count, fail_fn) fail_fn(count __RET_IP__)
-#define __mutex_fastpath_unlock(count, fail_fn) fail_fn(count __RET_IP__)
+#define __mutex_fastpath_lock(count, fail_fn) fail_fn(count)
+#define __mutex_fastpath_lock_retval(count, fail_fn) fail_fn(count)
+#define __mutex_fastpath_unlock(count, fail_fn) fail_fn(count)
#define __mutex_fastpath_trylock(count, fail_fn) fail_fn(count)
#define __mutex_slowpath_needs_to_unlock() 1

Index: linux/include/linux/debug_locks.h
===================================================================
--- /dev/null
+++ linux/include/linux/debug_locks.h
@@ -0,0 +1,62 @@
+#ifndef __LINUX_DEBUG_LOCKING_H
+#define __LINUX_DEBUG_LOCKING_H
+
+extern int debug_locks;
+extern int debug_locks_silent;
+
+/*
+ * Generic 'turn off all lock debugging' function:
+ */
+extern int debug_locks_off(void);
+
+/*
+ * In the debug case we carry the caller's instruction pointer into
+ * other functions, but we dont want the function argument overhead
+ * in the nondebug case - hence these macros:
+ */
+#define _RET_IP_ (unsigned long)__builtin_return_address(0)
+#define _THIS_IP_ ({ __label__ __here; __here: (unsigned long)&&__here; })
+
+#define DEBUG_WARN_ON(c) \
+({ \
+ int __ret = 0; \
+ \
+ if (unlikely(c)) { \
+ if (debug_locks_off()) \
+ WARN_ON(1); \
+ __ret = 1; \
+ } \
+ __ret; \
+})
+
+#ifdef CONFIG_SMP
+# define SMP_DEBUG_WARN_ON(c) DEBUG_WARN_ON(c)
+#else
+# define SMP_DEBUG_WARN_ON(c) do { } while (0)
+#endif
+
+#ifdef CONFIG_DEBUG_LOCKING_API_SELFTESTS
+ extern void locking_selftest(void);
+#else
+# define locking_selftest() do { } while (0)
+#endif
+
+static inline void
+debug_check_no_locks_freed(const void *from, unsigned long len)
+{
+}
+
+static inline void
+debug_check_no_locks_held(struct task_struct *task)
+{
+}
+
+static inline void debug_show_all_locks(void)
+{
+}
+
+static inline void debug_show_held_locks(struct task_struct *task)
+{
+}
+
+#endif
Index: linux/include/linux/init_task.h
===================================================================
--- linux.orig/include/linux/init_task.h
+++ linux/include/linux/init_task.h
@@ -133,7 +133,6 @@ extern struct group_info init_groups;
.journal_info = NULL, \
.cpu_timers = INIT_CPU_TIMERS(tsk.cpu_timers), \
.fs_excl = ATOMIC_INIT(0), \
- INIT_RT_MUTEXES(tsk) \
}


Index: linux/include/linux/mm.h
===================================================================
--- linux.orig/include/linux/mm.h
+++ linux/include/linux/mm.h
@@ -14,6 +14,7 @@
#include <linux/prio_tree.h>
#include <linux/fs.h>
#include <linux/mutex.h>
+#include <linux/debug_locks.h>

struct mempolicy;
struct anon_vma;
@@ -1080,13 +1081,6 @@ static inline void vm_stat_account(struc
}
#endif /* CONFIG_PROC_FS */

-static inline void
-debug_check_no_locks_freed(const void *from, unsigned long len)
-{
- mutex_debug_check_no_locks_freed(from, len);
- rt_mutex_debug_check_no_locks_freed(from, len);
-}
-
#ifndef CONFIG_DEBUG_PAGEALLOC
static inline void
kernel_map_pages(struct page *page, int numpages, int enable)
Index: linux/include/linux/mutex-debug.h
===================================================================
--- linux.orig/include/linux/mutex-debug.h
+++ linux/include/linux/mutex-debug.h
@@ -7,17 +7,11 @@
* Mutexes - debugging helpers:
*/

-#define __DEBUG_MUTEX_INITIALIZER(lockname) \
- , .held_list = LIST_HEAD_INIT(lockname.held_list), \
- .name = #lockname , .magic = &lockname
+#define __DEBUG_MUTEX_INITIALIZER(lockname) \
+ , .magic = &lockname

-#define mutex_init(sem) __mutex_init(sem, __FUNCTION__)
+#define mutex_init(sem) __mutex_init(sem, __FILE__":"#sem)

extern void FASTCALL(mutex_destroy(struct mutex *lock));

-extern void mutex_debug_show_all_locks(void);
-extern void mutex_debug_show_held_locks(struct task_struct *filter);
-extern void mutex_debug_check_no_locks_held(struct task_struct *task);
-extern void mutex_debug_check_no_locks_freed(const void *from, unsigned long len);
-
#endif
Index: linux/include/linux/mutex.h
===================================================================
--- linux.orig/include/linux/mutex.h
+++ linux/include/linux/mutex.h
@@ -50,8 +50,6 @@ struct mutex {
struct list_head wait_list;
#ifdef CONFIG_DEBUG_MUTEXES
struct thread_info *owner;
- struct list_head held_list;
- unsigned long acquire_ip;
const char *name;
void *magic;
#endif
@@ -76,10 +74,6 @@ struct mutex_waiter {
# define __DEBUG_MUTEX_INITIALIZER(lockname)
# define mutex_init(mutex) __mutex_init(mutex, NULL)
# define mutex_destroy(mutex) do { } while (0)
-# define mutex_debug_show_all_locks() do { } while (0)
-# define mutex_debug_show_held_locks(p) do { } while (0)
-# define mutex_debug_check_no_locks_held(task) do { } while (0)
-# define mutex_debug_check_no_locks_freed(from, len) do { } while (0)
#endif

#define __MUTEX_INITIALIZER(lockname) \
Index: linux/include/linux/rtmutex.h
===================================================================
--- linux.orig/include/linux/rtmutex.h
+++ linux/include/linux/rtmutex.h
@@ -29,8 +29,6 @@ struct rt_mutex {
struct task_struct *owner;
#ifdef CONFIG_DEBUG_RT_MUTEXES
int save_state;
- struct list_head held_list_entry;
- unsigned long acquire_ip;
const char *name, *file;
int line;
void *magic;
@@ -98,14 +96,6 @@ extern int rt_mutex_trylock(struct rt_mu

extern void rt_mutex_unlock(struct rt_mutex *lock);

-#ifdef CONFIG_DEBUG_RT_MUTEXES
-# define INIT_RT_MUTEX_DEBUG(tsk) \
- .held_list_head = LIST_HEAD_INIT(tsk.held_list_head), \
- .held_list_lock = SPIN_LOCK_UNLOCKED
-#else
-# define INIT_RT_MUTEX_DEBUG(tsk)
-#endif
-
#ifdef CONFIG_RT_MUTEXES
# define INIT_RT_MUTEXES(tsk) \
.pi_waiters = PLIST_HEAD_INIT(tsk.pi_waiters, tsk.pi_lock), \
Index: linux/include/linux/sched.h
===================================================================
--- linux.orig/include/linux/sched.h
+++ linux/include/linux/sched.h
@@ -910,10 +910,6 @@ struct task_struct {
struct plist_head pi_waiters;
/* Deadlock detection and priority inheritance handling */
struct rt_mutex_waiter *pi_blocked_on;
-# ifdef CONFIG_DEBUG_RT_MUTEXES
- spinlock_t held_list_lock;
- struct list_head held_list_head;
-# endif
#endif

#ifdef CONFIG_DEBUG_MUTEXES
Index: linux/init/main.c
===================================================================
--- linux.orig/init/main.c
+++ linux/init/main.c
@@ -53,6 +53,7 @@
#include <linux/key.h>
#include <linux/root_dev.h>
#include <linux/buffer_head.h>
+#include <linux/debug_locks.h>

#include <asm/io.h>
#include <asm/bugs.h>
@@ -512,6 +513,14 @@ asmlinkage void __init start_kernel(void
panic(panic_later, panic_param);
profile_init();
local_irq_enable();
+
+ /*
+ * Need to run this when irqs are enabled, because it wants
+ * to self-test [hard/soft]-irqs on/off lock inversion bugs
+ * too:
+ */
+ locking_selftest();
+
#ifdef CONFIG_BLK_DEV_INITRD
if (initrd_start && !initrd_below_start_ok &&
initrd_start < min_low_pfn << PAGE_SHIFT) {
Index: linux/kernel/exit.c
===================================================================
--- linux.orig/kernel/exit.c
+++ linux/kernel/exit.c
@@ -952,10 +952,9 @@ fastcall NORET_TYPE void do_exit(long co
if (unlikely(current->pi_state_cache))
kfree(current->pi_state_cache);
/*
- * If DEBUG_MUTEXES is on, make sure we are holding no locks:
+ * Make sure we are holding no locks:
*/
- mutex_debug_check_no_locks_held(tsk);
- rt_mutex_debug_check_no_locks_held(tsk);
+ debug_check_no_locks_held(tsk);

if (tsk->io_context)
exit_io_context();
Index: linux/kernel/fork.c
===================================================================
--- linux.orig/kernel/fork.c
+++ linux/kernel/fork.c
@@ -921,10 +921,6 @@ static inline void rt_mutex_init_task(st
spin_lock_init(&p->pi_lock);
plist_head_init(&p->pi_waiters, &p->pi_lock);
p->pi_blocked_on = NULL;
-# ifdef CONFIG_DEBUG_RT_MUTEXES
- spin_lock_init(&p->held_list_lock);
- INIT_LIST_HEAD(&p->held_list_head);
-# endif
#endif
}

Index: linux/kernel/mutex-debug.c
===================================================================
--- linux.orig/kernel/mutex-debug.c
+++ linux/kernel/mutex-debug.c
@@ -19,37 +19,10 @@
#include <linux/spinlock.h>
#include <linux/kallsyms.h>
#include <linux/interrupt.h>
+#include <linux/debug_locks.h>

#include "mutex-debug.h"

-/*
- * We need a global lock when we walk through the multi-process
- * lock tree. Only used in the deadlock-debugging case.
- */
-DEFINE_SPINLOCK(debug_mutex_lock);
-
-/*
- * All locks held by all tasks, in a single global list:
- */
-LIST_HEAD(debug_mutex_held_locks);
-
-/*
- * In the debug case we carry the caller's instruction pointer into
- * other functions, but we dont want the function argument overhead
- * in the nondebug case - hence these macros:
- */
-#define __IP_DECL__ , unsigned long ip
-#define __IP__ , ip
-#define __RET_IP__ , (unsigned long)__builtin_return_address(0)
-
-/*
- * "mutex debugging enabled" flag. We turn it off when we detect
- * the first problem because we dont want to recurse back
- * into the tracing code when doing error printk or
- * executing a BUG():
- */
-int debug_mutex_on = 1;
-
static void printk_task(struct task_struct *p)
{
if (p)
@@ -66,157 +39,28 @@ static void printk_ti(struct thread_info
printk("<none>");
}

-static void printk_task_short(struct task_struct *p)
-{
- if (p)
- printk("%s/%d [%p, %3d]", p->comm, p->pid, p, p->prio);
- else
- printk("<none>");
-}
-
static void printk_lock(struct mutex *lock, int print_owner)
{
- printk(" [%p] {%s}\n", lock, lock->name);
+#ifdef CONFIG_PROVE_MUTEX_LOCKING
+ printk(" [%p] {%s}\n", lock, lock->dep_map.name);
+#else
+ printk(" [%p]\n", lock);
+#endif

if (print_owner && lock->owner) {
printk(".. held by: ");
printk_ti(lock->owner);
printk("\n");
}
- if (lock->owner) {
- printk("... acquired at: ");
- print_symbol("%s\n", lock->acquire_ip);
- }
-}
-
-/*
- * printk locks held by a task:
- */
-static void show_task_locks(struct task_struct *p)
-{
- switch (p->state) {
- case TASK_RUNNING: printk("R"); break;
- case TASK_INTERRUPTIBLE: printk("S"); break;
- case TASK_UNINTERRUPTIBLE: printk("D"); break;
- case TASK_STOPPED: printk("T"); break;
- case EXIT_ZOMBIE: printk("Z"); break;
- case EXIT_DEAD: printk("X"); break;
- default: printk("?"); break;
- }
- printk_task(p);
- if (p->blocked_on) {
- struct mutex *lock = p->blocked_on->lock;
-
- printk(" blocked on mutex:");
- printk_lock(lock, 1);
- } else
- printk(" (not blocked on mutex)\n");
-}
-
-/*
- * printk all locks held in the system (if filter == NULL),
- * or all locks belonging to a single task (if filter != NULL):
- */
-void show_held_locks(struct task_struct *filter)
-{
- struct list_head *curr, *cursor = NULL;
- struct mutex *lock;
- struct thread_info *t;
- unsigned long flags;
- int count = 0;
-
- if (filter) {
- printk("------------------------------\n");
- printk("| showing all locks held by: | (");
- printk_task_short(filter);
- printk("):\n");
- printk("------------------------------\n");
- } else {
- printk("---------------------------\n");
- printk("| showing all locks held: |\n");
- printk("---------------------------\n");
- }
-
- /*
- * Play safe and acquire the global trace lock. We
- * cannot printk with that lock held so we iterate
- * very carefully:
- */
-next:
- debug_spin_lock_save(&debug_mutex_lock, flags);
- list_for_each(curr, &debug_mutex_held_locks) {
- if (cursor && curr != cursor)
- continue;
- lock = list_entry(curr, struct mutex, held_list);
- t = lock->owner;
- if (filter && (t != filter->thread_info))
- continue;
- count++;
- cursor = curr->next;
- debug_spin_unlock_restore(&debug_mutex_lock, flags);
-
- printk("\n#%03d: ", count);
- printk_lock(lock, filter ? 0 : 1);
- goto next;
- }
- debug_spin_unlock_restore(&debug_mutex_lock, flags);
- printk("\n");
-}
-
-void mutex_debug_show_all_locks(void)
-{
- struct task_struct *g, *p;
- int count = 10;
- int unlock = 1;
-
- printk("\nShowing all blocking locks in the system:\n");
-
- /*
- * Here we try to get the tasklist_lock as hard as possible,
- * if not successful after 2 seconds we ignore it (but keep
- * trying). This is to enable a debug printout even if a
- * tasklist_lock-holding task deadlocks or crashes.
- */
-retry:
- if (!read_trylock(&tasklist_lock)) {
- if (count == 10)
- printk("hm, tasklist_lock locked, retrying... ");
- if (count) {
- count--;
- printk(" #%d", 10-count);
- mdelay(200);
- goto retry;
- }
- printk(" ignoring it.\n");
- unlock = 0;
- }
- if (count != 10)
- printk(" locked it.\n");
-
- do_each_thread(g, p) {
- show_task_locks(p);
- if (!unlock)
- if (read_trylock(&tasklist_lock))
- unlock = 1;
- } while_each_thread(g, p);
-
- printk("\n");
- show_held_locks(NULL);
- printk("=============================================\n\n");
-
- if (unlock)
- read_unlock(&tasklist_lock);
}

static void report_deadlock(struct task_struct *task, struct mutex *lock,
- struct mutex *lockblk, unsigned long ip)
+ struct mutex *lockblk)
{
printk("\n%s/%d is trying to acquire this lock:\n",
current->comm, current->pid);
printk_lock(lock, 1);
- printk("... trying at: ");
- print_symbol("%s\n", ip);
- show_held_locks(current);
+ debug_show_held_locks(current);

if (lockblk) {
printk("but %s/%d is deadlocking current task %s/%d!\n\n",
@@ -225,7 +69,7 @@ static void report_deadlock(struct task_
task->comm, task->pid);
printk_lock(lockblk, 1);

- show_held_locks(task);
+ debug_show_held_locks(task);

printk("\n%s/%d's [blocked] stackdump:\n\n",
task->comm, task->pid);
@@ -235,7 +79,7 @@ static void report_deadlock(struct task_
printk("\n%s/%d's [current] stackdump:\n\n",
current->comm, current->pid);
dump_stack();
- mutex_debug_show_all_locks();
+ debug_show_all_locks();
printk("[ turning off deadlock detection. Please report this. ]\n\n");
local_irq_disable();
}
@@ -243,13 +87,12 @@ static void report_deadlock(struct task_
/*
* Recursively check for mutex deadlocks:
*/
-static int check_deadlock(struct mutex *lock, int depth,
- struct thread_info *ti, unsigned long ip)
+static int check_deadlock(struct mutex *lock, int depth, struct thread_info *ti)
{
struct mutex *lockblk;
struct task_struct *task;

- if (!debug_mutex_on)
+ if (!debug_locks)
return 0;

ti = lock->owner;
@@ -263,123 +106,46 @@ static int check_deadlock(struct mutex *

/* Self-deadlock: */
if (current == task) {
- DEBUG_OFF();
+ debug_locks_off();
if (depth)
return 1;
printk("\n==========================================\n");
printk( "[ BUG: lock recursion deadlock detected! |\n");
printk( "------------------------------------------\n");
- report_deadlock(task, lock, NULL, ip);
+ report_deadlock(task, lock, NULL);
return 0;
}

/* Ugh, something corrupted the lock data structure? */
if (depth > 20) {
- DEBUG_OFF();
+ debug_locks_off();
printk("\n===========================================\n");
printk( "[ BUG: infinite lock dependency detected!? |\n");
printk( "-------------------------------------------\n");
- report_deadlock(task, lock, lockblk, ip);
+ report_deadlock(task, lock, lockblk);
return 0;
}

/* Recursively check for dependencies: */
- if (lockblk && check_deadlock(lockblk, depth+1, ti, ip)) {
+ if (lockblk && check_deadlock(lockblk, depth+1, ti)) {
printk("\n============================================\n");
printk( "[ BUG: circular locking deadlock detected! ]\n");
printk( "--------------------------------------------\n");
- report_deadlock(task, lock, lockblk, ip);
+ report_deadlock(task, lock, lockblk);
return 0;
}
return 0;
}

/*
- * Called when a task exits, this function checks whether the
- * task is holding any locks, and reports the first one if so:
- */
-void mutex_debug_check_no_locks_held(struct task_struct *task)
-{
- struct list_head *curr, *next;
- struct thread_info *t;
- unsigned long flags;
- struct mutex *lock;
-
- if (!debug_mutex_on)
- return;
-
- debug_spin_lock_save(&debug_mutex_lock, flags);
- list_for_each_safe(curr, next, &debug_mutex_held_locks) {
- lock = list_entry(curr, struct mutex, held_list);
- t = lock->owner;
- if (t != task->thread_info)
- continue;
- list_del_init(curr);
- DEBUG_OFF();
- debug_spin_unlock_restore(&debug_mutex_lock, flags);
-
- printk("BUG: %s/%d, lock held at task exit time!\n",
- task->comm, task->pid);
- printk_lock(lock, 1);
- if (lock->owner != task->thread_info)
- printk("exiting task is not even the owner??\n");
- return;
- }
- debug_spin_unlock_restore(&debug_mutex_lock, flags);
-}
-
-/*
- * Called when kernel memory is freed (or unmapped), or if a mutex
- * is destroyed or reinitialized - this code checks whether there is
- * any held lock in the memory range of <from> to <to>:
- */
-void mutex_debug_check_no_locks_freed(const void *from, unsigned long len)
-{
- struct list_head *curr, *next;
- const void *to = from + len;
- unsigned long flags;
- struct mutex *lock;
- void *lock_addr;
-
- if (!debug_mutex_on)
- return;
-
- debug_spin_lock_save(&debug_mutex_lock, flags);
- list_for_each_safe(curr, next, &debug_mutex_held_locks) {
- lock = list_entry(curr, struct mutex, held_list);
- lock_addr = lock;
- if (lock_addr < from || lock_addr >= to)
- continue;
- list_del_init(curr);
- DEBUG_OFF();
- debug_spin_unlock_restore(&debug_mutex_lock, flags);
-
- printk("BUG: %s/%d, active lock [%p(%p-%p)] freed!\n",
- current->comm, current->pid, lock, from, to);
- dump_stack();
- printk_lock(lock, 1);
- if (lock->owner != current_thread_info())
- printk("freeing task is not even the owner??\n");
- return;
- }
- debug_spin_unlock_restore(&debug_mutex_lock, flags);
-}
-
-/*
* Must be called with lock->wait_lock held.
*/
-void debug_mutex_set_owner(struct mutex *lock,
- struct thread_info *new_owner __IP_DECL__)
+void debug_mutex_set_owner(struct mutex *lock, struct thread_info *new_owner)
{
lock->owner = new_owner;
- DEBUG_WARN_ON(!list_empty(&lock->held_list));
- if (debug_mutex_on) {
- list_add_tail(&lock->held_list, &debug_mutex_held_locks);
- lock->acquire_ip = ip;
- }
}

-void debug_mutex_init_waiter(struct mutex_waiter *waiter)
+void debug_mutex_lock_common(struct mutex *lock, struct mutex_waiter *waiter)
{
memset(waiter, 0x11, sizeof(*waiter));
waiter->magic = waiter;
@@ -401,10 +167,12 @@ void debug_mutex_free_waiter(struct mute
}

void debug_mutex_add_waiter(struct mutex *lock, struct mutex_waiter *waiter,
- struct thread_info *ti __IP_DECL__)
+ struct thread_info *ti)
{
SMP_DEBUG_WARN_ON(!spin_is_locked(&lock->wait_lock));
- check_deadlock(lock, 0, ti, ip);
+#ifdef CONFIG_DEBUG_MUTEX_DEADLOCKS
+ check_deadlock(lock, 0, ti);
+#endif
/* Mark the current thread as blocked on the lock: */
ti->task->blocked_on = waiter;
waiter->lock = lock;
@@ -424,13 +192,10 @@ void mutex_remove_waiter(struct mutex *l

void debug_mutex_unlock(struct mutex *lock)
{
+ DEBUG_WARN_ON(lock->owner != current_thread_info());
DEBUG_WARN_ON(lock->magic != lock);
DEBUG_WARN_ON(!lock->wait_list.prev && !lock->wait_list.next);
DEBUG_WARN_ON(lock->owner != current_thread_info());
- if (debug_mutex_on) {
- DEBUG_WARN_ON(list_empty(&lock->held_list));
- list_del_init(&lock->held_list);
- }
}

void debug_mutex_init(struct mutex *lock, const char *name)
@@ -438,10 +203,8 @@ void debug_mutex_init(struct mutex *lock
/*
* Make sure we are not reinitializing a held lock:
*/
- mutex_debug_check_no_locks_freed((void *)lock, sizeof(*lock));
+ debug_check_no_locks_freed((void *)lock, sizeof(*lock));
lock->owner = NULL;
- INIT_LIST_HEAD(&lock->held_list);
- lock->name = name;
lock->magic = lock;
}

Index: linux/kernel/mutex-debug.h
===================================================================
--- linux.orig/kernel/mutex-debug.h
+++ linux/kernel/mutex-debug.h
@@ -10,110 +10,43 @@
* More details are in kernel/mutex-debug.c.
*/

-extern spinlock_t debug_mutex_lock;
-extern struct list_head debug_mutex_held_locks;
-extern int debug_mutex_on;
-
-/*
- * In the debug case we carry the caller's instruction pointer into
- * other functions, but we dont want the function argument overhead
- * in the nondebug case - hence these macros:
- */
-#define __IP_DECL__ , unsigned long ip
-#define __IP__ , ip
-#define __RET_IP__ , (unsigned long)__builtin_return_address(0)
-
/*
* This must be called with lock->wait_lock held.
*/
-extern void debug_mutex_set_owner(struct mutex *lock,
- struct thread_info *new_owner __IP_DECL__);
+extern void
+debug_mutex_set_owner(struct mutex *lock, struct thread_info *new_owner);

static inline void debug_mutex_clear_owner(struct mutex *lock)
{
lock->owner = NULL;
}

-extern void debug_mutex_init_waiter(struct mutex_waiter *waiter);
+extern void debug_mutex_lock_common(struct mutex *lock,
+ struct mutex_waiter *waiter);
extern void debug_mutex_wake_waiter(struct mutex *lock,
struct mutex_waiter *waiter);
extern void debug_mutex_free_waiter(struct mutex_waiter *waiter);
extern void debug_mutex_add_waiter(struct mutex *lock,
struct mutex_waiter *waiter,
- struct thread_info *ti __IP_DECL__);
+ struct thread_info *ti);
extern void mutex_remove_waiter(struct mutex *lock, struct mutex_waiter *waiter,
struct thread_info *ti);
extern void debug_mutex_unlock(struct mutex *lock);
extern void debug_mutex_init(struct mutex *lock, const char *name);

-#define debug_spin_lock_save(lock, flags) \
- do { \
- local_irq_save(flags); \
- if (debug_mutex_on) \
- spin_lock(lock); \
- } while (0)
-
-#define debug_spin_unlock_restore(lock, flags) \
- do { \
- if (debug_mutex_on) \
- spin_unlock(lock); \
- local_irq_restore(flags); \
- preempt_check_resched(); \
- } while (0)
-
#define spin_lock_mutex(lock, flags) \
do { \
struct mutex *l = container_of(lock, struct mutex, wait_lock); \
\
DEBUG_WARN_ON(in_interrupt()); \
- debug_spin_lock_save(&debug_mutex_lock, flags); \
- spin_lock(lock); \
+ local_irq_save(flags); \
+ __raw_spin_lock(&(lock)->raw_lock); \
DEBUG_WARN_ON(l->magic != l); \
} while (0)

#define spin_unlock_mutex(lock, flags) \
do { \
- spin_unlock(lock); \
- debug_spin_unlock_restore(&debug_mutex_lock, flags); \
+ __raw_spin_unlock(&(lock)->raw_lock); \
+ local_irq_restore(flags); \
+ preempt_check_resched(); \
} while (0)
-
-#define DEBUG_OFF() \
-do { \
- if (debug_mutex_on) { \
- debug_mutex_on = 0; \
- console_verbose(); \
- if (spin_is_locked(&debug_mutex_lock)) \
- spin_unlock(&debug_mutex_lock); \
- } \
-} while (0)
-
-#define DEBUG_BUG() \
-do { \
- if (debug_mutex_on) { \
- DEBUG_OFF(); \
- BUG(); \
- } \
-} while (0)
-
-#define DEBUG_WARN_ON(c) \
-do { \
- if (unlikely(c && debug_mutex_on)) { \
- DEBUG_OFF(); \
- WARN_ON(1); \
- } \
-} while (0)
-
-# define DEBUG_BUG_ON(c) \
-do { \
- if (unlikely(c)) \
- DEBUG_BUG(); \
-} while (0)
-
-#ifdef CONFIG_SMP
-# define SMP_DEBUG_WARN_ON(c) DEBUG_WARN_ON(c)
-# define SMP_DEBUG_BUG_ON(c) DEBUG_BUG_ON(c)
-#else
-# define SMP_DEBUG_WARN_ON(c) do { } while (0)
-# define SMP_DEBUG_BUG_ON(c) do { } while (0)
-#endif
-
Index: linux/kernel/mutex.c
===================================================================
--- linux.orig/kernel/mutex.c
+++ linux/kernel/mutex.c
@@ -17,6 +17,7 @@
#include <linux/module.h>
#include <linux/spinlock.h>
#include <linux/interrupt.h>
+#include <linux/debug_locks.h>

/*
* In the DEBUG case we are using the "NULL fastpath" for mutexes,
@@ -38,7 +39,7 @@
*
* It is not allowed to initialize an already locked mutex.
*/
-void fastcall __mutex_init(struct mutex *lock, const char *name)
+__always_inline void fastcall __mutex_init(struct mutex *lock, const char *name)
{
atomic_set(&lock->count, 1);
spin_lock_init(&lock->wait_lock);
@@ -56,7 +57,7 @@ EXPORT_SYMBOL(__mutex_init);
* branch is predicted by the CPU as default-untaken.
*/
static void fastcall noinline __sched
-__mutex_lock_slowpath(atomic_t *lock_count __IP_DECL__);
+__mutex_lock_slowpath(atomic_t *lock_count);

/***
* mutex_lock - acquire the mutex
@@ -79,7 +80,7 @@ __mutex_lock_slowpath(atomic_t *lock_cou
*
* This function is similar to (but not equivalent to) down().
*/
-void fastcall __sched mutex_lock(struct mutex *lock)
+void inline fastcall __sched mutex_lock(struct mutex *lock)
{
might_sleep();
/*
@@ -92,7 +93,7 @@ void fastcall __sched mutex_lock(struct
EXPORT_SYMBOL(mutex_lock);

static void fastcall noinline __sched
-__mutex_unlock_slowpath(atomic_t *lock_count __IP_DECL__);
+__mutex_unlock_slowpath(atomic_t *lock_count);

/***
* mutex_unlock - release the mutex
@@ -116,22 +117,36 @@ void fastcall __sched mutex_unlock(struc

EXPORT_SYMBOL(mutex_unlock);

+static void fastcall noinline __sched
+__mutex_unlock_non_nested_slowpath(atomic_t *lock_count);
+
+void fastcall __sched mutex_unlock_non_nested(struct mutex *lock)
+{
+ /*
+ * The unlocking fastpath is the 0->1 transition from 'locked'
+ * into 'unlocked' state:
+ */
+ __mutex_fastpath_unlock(&lock->count, __mutex_unlock_non_nested_slowpath);
+}
+
+EXPORT_SYMBOL(mutex_unlock_non_nested);
+
+
/*
* Lock a mutex (possibly interruptible), slowpath:
*/
static inline int __sched
-__mutex_lock_common(struct mutex *lock, long state __IP_DECL__)
+__mutex_lock_common(struct mutex *lock, long state, unsigned int subtype)
{
struct task_struct *task = current;
struct mutex_waiter waiter;
unsigned int old_val;
unsigned long flags;

- debug_mutex_init_waiter(&waiter);
-
spin_lock_mutex(&lock->wait_lock, flags);

- debug_mutex_add_waiter(lock, &waiter, task->thread_info, ip);
+ debug_mutex_lock_common(lock, &waiter);
+ debug_mutex_add_waiter(lock, &waiter, task->thread_info);

/* add waiting tasks to the end of the waitqueue (FIFO): */
list_add_tail(&waiter.list, &lock->wait_list);
@@ -173,7 +188,7 @@ __mutex_lock_common(struct mutex *lock,

/* got the lock - rejoice! */
mutex_remove_waiter(lock, &waiter, task->thread_info);
- debug_mutex_set_owner(lock, task->thread_info __IP__);
+ debug_mutex_set_owner(lock, task->thread_info);

/* set it to 0 if there are no waiters left: */
if (likely(list_empty(&lock->wait_list)))
@@ -183,32 +198,41 @@ __mutex_lock_common(struct mutex *lock,

debug_mutex_free_waiter(&waiter);

- DEBUG_WARN_ON(list_empty(&lock->held_list));
DEBUG_WARN_ON(lock->owner != task->thread_info);

return 0;
}

static void fastcall noinline __sched
-__mutex_lock_slowpath(atomic_t *lock_count __IP_DECL__)
+__mutex_lock_slowpath(atomic_t *lock_count)
{
struct mutex *lock = container_of(lock_count, struct mutex, count);

- __mutex_lock_common(lock, TASK_UNINTERRUPTIBLE __IP__);
+ __mutex_lock_common(lock, TASK_UNINTERRUPTIBLE, 0);
}

+#ifdef CONFIG_DEBUG_MUTEXES
+void __sched
+mutex_lock_nested(struct mutex *lock, unsigned int subtype)
+{
+ might_sleep();
+ __mutex_lock_common(lock, TASK_UNINTERRUPTIBLE, subtype);
+}
+
+EXPORT_SYMBOL_GPL(mutex_lock_nested);
+#endif
+
/*
* Release the lock, slowpath:
*/
-static fastcall noinline void
-__mutex_unlock_slowpath(atomic_t *lock_count __IP_DECL__)
+static fastcall inline void
+__mutex_unlock_common_slowpath(atomic_t *lock_count, int nested)
{
struct mutex *lock = container_of(lock_count, struct mutex, count);
unsigned long flags;

- DEBUG_WARN_ON(lock->owner != current_thread_info());
-
spin_lock_mutex(&lock->wait_lock, flags);
+ debug_mutex_unlock(lock);

/*
* some architectures leave the lock unlocked in the fastpath failure
@@ -218,8 +242,6 @@ __mutex_unlock_slowpath(atomic_t *lock_c
if (__mutex_slowpath_needs_to_unlock())
atomic_set(&lock->count, 1);

- debug_mutex_unlock(lock);
-
if (!list_empty(&lock->wait_list)) {
/* get the first entry from the wait-list: */
struct mutex_waiter *waiter =
@@ -237,11 +259,27 @@ __mutex_unlock_slowpath(atomic_t *lock_c
}

/*
+ * Release the lock, slowpath:
+ */
+static fastcall noinline void
+__mutex_unlock_slowpath(atomic_t *lock_count)
+{
+ __mutex_unlock_common_slowpath(lock_count, 1);
+}
+
+static fastcall noinline void
+__mutex_unlock_non_nested_slowpath(atomic_t *lock_count)
+{
+ __mutex_unlock_common_slowpath(lock_count, 0);
+}
+
+
+/*
* Here come the less common (and hence less performance-critical) APIs:
* mutex_lock_interruptible() and mutex_trylock().
*/
static int fastcall noinline __sched
-__mutex_lock_interruptible_slowpath(atomic_t *lock_count __IP_DECL__);
+__mutex_lock_interruptible_slowpath(atomic_t *lock_count);

/***
* mutex_lock_interruptible - acquire the mutex, interruptable
@@ -264,11 +302,11 @@ int fastcall __sched mutex_lock_interrup
EXPORT_SYMBOL(mutex_lock_interruptible);

static int fastcall noinline __sched
-__mutex_lock_interruptible_slowpath(atomic_t *lock_count __IP_DECL__)
+__mutex_lock_interruptible_slowpath(atomic_t *lock_count)
{
struct mutex *lock = container_of(lock_count, struct mutex, count);

- return __mutex_lock_common(lock, TASK_INTERRUPTIBLE __IP__);
+ return __mutex_lock_common(lock, TASK_INTERRUPTIBLE, 0);
}

/*
@@ -285,7 +323,8 @@ static inline int __mutex_trylock_slowpa

prev = atomic_xchg(&lock->count, -1);
if (likely(prev == 1))
- debug_mutex_set_owner(lock, current_thread_info() __RET_IP__);
+ debug_mutex_set_owner(lock, current_thread_info());
+
/* Set it back to 0 if there are no waiters: */
if (likely(list_empty(&lock->wait_list)))
atomic_set(&lock->count, 0);
Index: linux/kernel/mutex.h
===================================================================
--- linux.orig/kernel/mutex.h
+++ linux/kernel/mutex.h
@@ -19,19 +19,15 @@
#define DEBUG_WARN_ON(c) do { } while (0)
#define debug_mutex_set_owner(lock, new_owner) do { } while (0)
#define debug_mutex_clear_owner(lock) do { } while (0)
-#define debug_mutex_init_waiter(waiter) do { } while (0)
#define debug_mutex_wake_waiter(lock, waiter) do { } while (0)
#define debug_mutex_free_waiter(waiter) do { } while (0)
-#define debug_mutex_add_waiter(lock, waiter, ti, ip) do { } while (0)
+#define debug_mutex_add_waiter(lock, waiter, ti) do { } while (0)
+#define mutex_acquire(lock, subtype, trylock) do { } while (0)
+#define mutex_release(lock, nested) do { } while (0)
#define debug_mutex_unlock(lock) do { } while (0)
#define debug_mutex_init(lock, name) do { } while (0)

-/*
- * Return-address parameters/declarations. They are very useful for
- * debugging, but add overhead in the !DEBUG case - so we go the
- * trouble of using this not too elegant but zero-cost solution:
- */
-#define __IP_DECL__
-#define __IP__
-#define __RET_IP__
-
+static inline void
+debug_mutex_lock_common(struct mutex *lock, struct mutex_waiter *waiter)
+{
+}
Index: linux/kernel/rtmutex-debug.c
===================================================================
--- linux.orig/kernel/rtmutex-debug.c
+++ linux/kernel/rtmutex-debug.c
@@ -26,6 +26,7 @@
#include <linux/interrupt.h>
#include <linux/plist.h>
#include <linux/fs.h>
+#include <linux/debug_locks.h>

#include "rtmutex_common.h"

@@ -45,8 +46,6 @@ do { \
console_verbose(); \
if (spin_is_locked(&current->pi_lock)) \
spin_unlock(&current->pi_lock); \
- if (spin_is_locked(&current->held_list_lock)) \
- spin_unlock(&current->held_list_lock); \
} \
} while (0)

@@ -105,14 +104,6 @@ static void printk_task(task_t *p)
printk("<none>");
}

-static void printk_task_short(task_t *p)
-{
- if (p)
- printk("%s/%d [%p, %3d]", p->comm, p->pid, p, p->prio);
- else
- printk("<none>");
-}
-
static void printk_lock(struct rt_mutex *lock, int print_owner)
{
if (lock->name)
@@ -128,222 +119,6 @@ static void printk_lock(struct rt_mutex
printk_task(rt_mutex_owner(lock));
printk("\n");
}
- if (rt_mutex_owner(lock)) {
- printk("... acquired at: ");
- print_symbol("%s\n", lock->acquire_ip);
- }
-}
-
-static void printk_waiter(struct rt_mutex_waiter *w)
-{
- printk("-------------------------\n");
- printk("| waiter struct %p:\n", w);
- printk("| w->list_entry: [DP:%p/%p|SP:%p/%p|PRI:%d]\n",
- w->list_entry.plist.prio_list.prev, w->list_entry.plist.prio_list.next,
- w->list_entry.plist.node_list.prev, w->list_entry.plist.node_list.next,
- w->list_entry.prio);
- printk("| w->pi_list_entry: [DP:%p/%p|SP:%p/%p|PRI:%d]\n",
- w->pi_list_entry.plist.prio_list.prev, w->pi_list_entry.plist.prio_list.next,
- w->pi_list_entry.plist.node_list.prev, w->pi_list_entry.plist.node_list.next,
- w->pi_list_entry.prio);
- printk("\n| lock:\n");
- printk_lock(w->lock, 1);
- printk("| w->ti->task:\n");
- printk_task(w->task);
- printk("| blocked at: ");
- print_symbol("%s\n", w->ip);
- printk("-------------------------\n");
-}
-
-static void show_task_locks(task_t *p)
-{
- switch (p->state) {
- case TASK_RUNNING: printk("R"); break;
- case TASK_INTERRUPTIBLE: printk("S"); break;
- case TASK_UNINTERRUPTIBLE: printk("D"); break;
- case TASK_STOPPED: printk("T"); break;
- case EXIT_ZOMBIE: printk("Z"); break;
- case EXIT_DEAD: printk("X"); break;
- default: printk("?"); break;
- }
- printk_task(p);
- if (p->pi_blocked_on) {
- struct rt_mutex *lock = p->pi_blocked_on->lock;
-
- printk(" blocked on:");
- printk_lock(lock, 1);
- } else
- printk(" (not blocked)\n");
-}
-
-void rt_mutex_show_held_locks(task_t *task, int verbose)
-{
- struct list_head *curr, *cursor = NULL;
- struct rt_mutex *lock;
- task_t *t;
- unsigned long flags;
- int count = 0;
-
- if (!rt_trace_on)
- return;
-
- if (verbose) {
- printk("------------------------------\n");
- printk("| showing all locks held by: | (");
- printk_task_short(task);
- printk("):\n");
- printk("------------------------------\n");
- }
-
-next:
- spin_lock_irqsave(&task->held_list_lock, flags);
- list_for_each(curr, &task->held_list_head) {
- if (cursor && curr != cursor)
- continue;
- lock = list_entry(curr, struct rt_mutex, held_list_entry);
- t = rt_mutex_owner(lock);
- WARN_ON(t != task);
- count++;
- cursor = curr->next;
- spin_unlock_irqrestore(&task->held_list_lock, flags);
-
- printk("\n#%03d: ", count);
- printk_lock(lock, 0);
- goto next;
- }
- spin_unlock_irqrestore(&task->held_list_lock, flags);
-
- printk("\n");
-}
-
-void rt_mutex_show_all_locks(void)
-{
- task_t *g, *p;
- int count = 10;
- int unlock = 1;
-
- printk("\n");
- printk("----------------------\n");
- printk("| showing all tasks: |\n");
- printk("----------------------\n");
-
- /*
- * Here we try to get the tasklist_lock as hard as possible,
- * if not successful after 2 seconds we ignore it (but keep
- * trying). This is to enable a debug printout even if a
- * tasklist_lock-holding task deadlocks or crashes.
- */
-retry:
- if (!read_trylock(&tasklist_lock)) {
- if (count == 10)
- printk("hm, tasklist_lock locked, retrying... ");
- if (count) {
- count--;
- printk(" #%d", 10-count);
- mdelay(200);
- goto retry;
- }
- printk(" ignoring it.\n");
- unlock = 0;
- }
- if (count != 10)
- printk(" locked it.\n");
-
- do_each_thread(g, p) {
- show_task_locks(p);
- if (!unlock)
- if (read_trylock(&tasklist_lock))
- unlock = 1;
- } while_each_thread(g, p);
-
- printk("\n");
-
- printk("-----------------------------------------\n");
- printk("| showing all locks held in the system: |\n");
- printk("-----------------------------------------\n");
-
- do_each_thread(g, p) {
- rt_mutex_show_held_locks(p, 0);
- if (!unlock)
- if (read_trylock(&tasklist_lock))
- unlock = 1;
- } while_each_thread(g, p);
-
-
- printk("=============================================\n\n");
-
- if (unlock)
- read_unlock(&tasklist_lock);
-}
-
-void rt_mutex_debug_check_no_locks_held(task_t *task)
-{
- struct rt_mutex_waiter *w;
- struct list_head *curr;
- struct rt_mutex *lock;
-
- if (!rt_trace_on)
- return;
- if (!rt_prio(task->normal_prio) && rt_prio(task->prio)) {
- printk("BUG: PI priority boost leaked!\n");
- printk_task(task);
- printk("\n");
- }
- if (list_empty(&task->held_list_head))
- return;
-
- spin_lock(&task->pi_lock);
- plist_for_each_entry(w, &task->pi_waiters, pi_list_entry) {
- TRACE_OFF();
-
- printk("hm, PI interest held at exit time? Task:\n");
- printk_task(task);
- printk_waiter(w);
- return;
- }
- spin_unlock(&task->pi_lock);
-
- list_for_each(curr, &task->held_list_head) {
- lock = list_entry(curr, struct rt_mutex, held_list_entry);
-
- printk("BUG: %s/%d, lock held at task exit time!\n",
- task->comm, task->pid);
- printk_lock(lock, 1);
- if (rt_mutex_owner(lock) != task)
- printk("exiting task is not even the owner??\n");
- }
-}
-
-int rt_mutex_debug_check_no_locks_freed(const void *from, unsigned long len)
-{
- const void *to = from + len;
- struct list_head *curr;
- struct rt_mutex *lock;
- unsigned long flags;
- void *lock_addr;
-
- if (!rt_trace_on)
- return 0;
-
- spin_lock_irqsave(&current->held_list_lock, flags);
- list_for_each(curr, &current->held_list_head) {
- lock = list_entry(curr, struct rt_mutex, held_list_entry);
- lock_addr = lock;
- if (lock_addr < from || lock_addr >= to)
- continue;
- TRACE_OFF();
-
- printk("BUG: %s/%d, active lock [%p(%p-%p)] freed!\n",
- current->comm, current->pid, lock, from, to);
- dump_stack();
- printk_lock(lock, 1);
- if (rt_mutex_owner(lock) != current)
- printk("freeing task is not even the owner??\n");
- return 1;
- }
- spin_unlock_irqrestore(&current->held_list_lock, flags);
-
- return 0;
}

void rt_mutex_debug_task_free(struct task_struct *task)
@@ -395,85 +170,41 @@ void debug_rt_mutex_print_deadlock(struc
current->comm, current->pid);
printk_lock(waiter->lock, 1);

- printk("... trying at: ");
- print_symbol("%s\n", waiter->ip);
-
printk("\n2) %s/%d is blocked on this lock:\n", task->comm, task->pid);
printk_lock(waiter->deadlock_lock, 1);

- rt_mutex_show_held_locks(current, 1);
- rt_mutex_show_held_locks(task, 1);
+ debug_show_held_locks(current);
+ debug_show_held_locks(task);

printk("\n%s/%d's [blocked] stackdump:\n\n", task->comm, task->pid);
show_stack(task, NULL);
printk("\n%s/%d's [current] stackdump:\n\n",
current->comm, current->pid);
dump_stack();
- rt_mutex_show_all_locks();
+ debug_show_all_locks();
+
printk("[ turning off deadlock detection."
"Please report this trace. ]\n\n");
local_irq_disable();
}

-void debug_rt_mutex_lock(struct rt_mutex *lock __IP_DECL__)
+void debug_rt_mutex_lock(struct rt_mutex *lock)
{
- unsigned long flags;
-
- if (rt_trace_on) {
- TRACE_WARN_ON_LOCKED(!list_empty(&lock->held_list_entry));
-
- spin_lock_irqsave(&current->held_list_lock, flags);
- list_add_tail(&lock->held_list_entry, &current->held_list_head);
- spin_unlock_irqrestore(&current->held_list_lock, flags);
-
- lock->acquire_ip = ip;
- }
}

void debug_rt_mutex_unlock(struct rt_mutex *lock)
{
- unsigned long flags;
-
- if (rt_trace_on) {
- TRACE_WARN_ON_LOCKED(rt_mutex_owner(lock) != current);
- TRACE_WARN_ON_LOCKED(list_empty(&lock->held_list_entry));
-
- spin_lock_irqsave(&current->held_list_lock, flags);
- list_del_init(&lock->held_list_entry);
- spin_unlock_irqrestore(&current->held_list_lock, flags);
- }
+ TRACE_WARN_ON_LOCKED(rt_mutex_owner(lock) != current);
}

-void debug_rt_mutex_proxy_lock(struct rt_mutex *lock,
- struct task_struct *powner __IP_DECL__)
+void
+debug_rt_mutex_proxy_lock(struct rt_mutex *lock, struct task_struct *powner)
{
- unsigned long flags;
-
- if (rt_trace_on) {
- TRACE_WARN_ON_LOCKED(!list_empty(&lock->held_list_entry));
-
- spin_lock_irqsave(&powner->held_list_lock, flags);
- list_add_tail(&lock->held_list_entry, &powner->held_list_head);
- spin_unlock_irqrestore(&powner->held_list_lock, flags);
-
- lock->acquire_ip = ip;
- }
}

void debug_rt_mutex_proxy_unlock(struct rt_mutex *lock)
{
- unsigned long flags;
-
- if (rt_trace_on) {
- struct task_struct *owner = rt_mutex_owner(lock);
-
- TRACE_WARN_ON_LOCKED(!owner);
- TRACE_WARN_ON_LOCKED(list_empty(&lock->held_list_entry));
-
- spin_lock_irqsave(&owner->held_list_lock, flags);
- list_del_init(&lock->held_list_entry);
- spin_unlock_irqrestore(&owner->held_list_lock, flags);
- }
+ TRACE_WARN_ON_LOCKED(!rt_mutex_owner(lock));
}

void debug_rt_mutex_init_waiter(struct rt_mutex_waiter *waiter)
@@ -493,14 +224,11 @@ void debug_rt_mutex_free_waiter(struct r

void debug_rt_mutex_init(struct rt_mutex *lock, const char *name)
{
- void *addr = lock;
-
- if (rt_trace_on) {
- rt_mutex_debug_check_no_locks_freed(addr,
- sizeof(struct rt_mutex));
- INIT_LIST_HEAD(&lock->held_list_entry);
- lock->name = name;
- }
+ /*
+ * Make sure we are not reinitializing a held lock:
+ */
+ debug_check_no_locks_freed((void *)lock, sizeof(*lock));
+ lock->name = name;
}

void rt_mutex_deadlock_account_lock(struct rt_mutex *lock, task_t *task)
Index: linux/kernel/rtmutex-debug.h
===================================================================
--- linux.orig/kernel/rtmutex-debug.h
+++ linux/kernel/rtmutex-debug.h
@@ -9,20 +9,16 @@
* This file contains macros used solely by rtmutex.c. Debug version.
*/

-#define __IP_DECL__ , unsigned long ip
-#define __IP__ , ip
-#define __RET_IP__ , (unsigned long)__builtin_return_address(0)
-
extern void
rt_mutex_deadlock_account_lock(struct rt_mutex *lock, struct task_struct *task);
extern void rt_mutex_deadlock_account_unlock(struct task_struct *task);
extern void debug_rt_mutex_init_waiter(struct rt_mutex_waiter *waiter);
extern void debug_rt_mutex_free_waiter(struct rt_mutex_waiter *waiter);
extern void debug_rt_mutex_init(struct rt_mutex *lock, const char *name);
-extern void debug_rt_mutex_lock(struct rt_mutex *lock __IP_DECL__);
+extern void debug_rt_mutex_lock(struct rt_mutex *lock);
extern void debug_rt_mutex_unlock(struct rt_mutex *lock);
extern void debug_rt_mutex_proxy_lock(struct rt_mutex *lock,
- struct task_struct *powner __IP_DECL__);
+ struct task_struct *powner);
extern void debug_rt_mutex_proxy_unlock(struct rt_mutex *lock);
extern void debug_rt_mutex_deadlock(int detect, struct rt_mutex_waiter *waiter,
struct rt_mutex *lock);
Index: linux/kernel/rtmutex.c
===================================================================
--- linux.orig/kernel/rtmutex.c
+++ linux/kernel/rtmutex.c
@@ -160,8 +160,7 @@ int max_lock_depth = 1024;
static int rt_mutex_adjust_prio_chain(task_t *task,
int deadlock_detect,
struct rt_mutex *orig_lock,
- struct rt_mutex_waiter *orig_waiter
- __IP_DECL__)
+ struct rt_mutex_waiter *orig_waiter)
{
struct rt_mutex *lock;
struct rt_mutex_waiter *waiter, *top_waiter = orig_waiter;
@@ -356,7 +355,7 @@ static inline int try_to_steal_lock(stru
*
* Must be called with lock->wait_lock held.
*/
-static int try_to_take_rt_mutex(struct rt_mutex *lock __IP_DECL__)
+static int try_to_take_rt_mutex(struct rt_mutex *lock)
{
/*
* We have to be careful here if the atomic speedups are
@@ -383,7 +382,7 @@ static int try_to_take_rt_mutex(struct r
return 0;

/* We got the lock. */
- debug_rt_mutex_lock(lock __IP__);
+ debug_rt_mutex_lock(lock);

rt_mutex_set_owner(lock, current, 0);

@@ -401,8 +400,7 @@ static int try_to_take_rt_mutex(struct r
*/
static int task_blocks_on_rt_mutex(struct rt_mutex *lock,
struct rt_mutex_waiter *waiter,
- int detect_deadlock
- __IP_DECL__)
+ int detect_deadlock)
{
struct rt_mutex_waiter *top_waiter = waiter;
task_t *owner = rt_mutex_owner(lock);
@@ -450,8 +448,7 @@ static int task_blocks_on_rt_mutex(struc

spin_unlock(&lock->wait_lock);

- res = rt_mutex_adjust_prio_chain(owner, detect_deadlock, lock,
- waiter __IP__);
+ res = rt_mutex_adjust_prio_chain(owner, detect_deadlock, lock, waiter);

spin_lock(&lock->wait_lock);

@@ -523,7 +520,7 @@ static void wakeup_next_waiter(struct rt
* Must be called with lock->wait_lock held
*/
static void remove_waiter(struct rt_mutex *lock,
- struct rt_mutex_waiter *waiter __IP_DECL__)
+ struct rt_mutex_waiter *waiter)
{
int first = (waiter == rt_mutex_top_waiter(lock));
int boost = 0;
@@ -564,7 +561,7 @@ static void remove_waiter(struct rt_mute

spin_unlock(&lock->wait_lock);

- rt_mutex_adjust_prio_chain(owner, 0, lock, NULL __IP__);
+ rt_mutex_adjust_prio_chain(owner, 0, lock, NULL);

spin_lock(&lock->wait_lock);
}
@@ -575,7 +572,7 @@ static void remove_waiter(struct rt_mute
static int __sched
rt_mutex_slowlock(struct rt_mutex *lock, int state,
struct hrtimer_sleeper *timeout,
- int detect_deadlock __IP_DECL__)
+ int detect_deadlock)
{
struct rt_mutex_waiter waiter;
int ret = 0;
@@ -586,7 +583,7 @@ rt_mutex_slowlock(struct rt_mutex *lock,
spin_lock(&lock->wait_lock);

/* Try to acquire the lock again: */
- if (try_to_take_rt_mutex(lock __IP__)) {
+ if (try_to_take_rt_mutex(lock)) {
spin_unlock(&lock->wait_lock);
return 0;
}
@@ -600,7 +597,7 @@ rt_mutex_slowlock(struct rt_mutex *lock,

for (;;) {
/* Try to acquire the lock: */
- if (try_to_take_rt_mutex(lock __IP__))
+ if (try_to_take_rt_mutex(lock))
break;

/*
@@ -624,7 +621,7 @@ rt_mutex_slowlock(struct rt_mutex *lock,
*/
if (!waiter.task) {
ret = task_blocks_on_rt_mutex(lock, &waiter,
- detect_deadlock __IP__);
+ detect_deadlock);
/*
* If we got woken up by the owner then start loop
* all over without going into schedule to try
@@ -650,7 +647,7 @@ rt_mutex_slowlock(struct rt_mutex *lock,
set_current_state(TASK_RUNNING);

if (unlikely(waiter.task))
- remove_waiter(lock, &waiter __IP__);
+ remove_waiter(lock, &waiter);

/*
* try_to_take_rt_mutex() sets the waiter bit
@@ -681,7 +678,7 @@ rt_mutex_slowlock(struct rt_mutex *lock,
* Slow path try-lock function:
*/
static inline int
-rt_mutex_slowtrylock(struct rt_mutex *lock __IP_DECL__)
+rt_mutex_slowtrylock(struct rt_mutex *lock)
{
int ret = 0;

@@ -689,7 +686,7 @@ rt_mutex_slowtrylock(struct rt_mutex *lo

if (likely(rt_mutex_owner(lock) != current)) {

- ret = try_to_take_rt_mutex(lock __IP__);
+ ret = try_to_take_rt_mutex(lock);
/*
* try_to_take_rt_mutex() sets the lock waiters
* bit unconditionally. Clean this up.
@@ -739,13 +736,13 @@ rt_mutex_fastlock(struct rt_mutex *lock,
int detect_deadlock,
int (*slowfn)(struct rt_mutex *lock, int state,
struct hrtimer_sleeper *timeout,
- int detect_deadlock __IP_DECL__))
+ int detect_deadlock))
{
if (!detect_deadlock && likely(rt_mutex_cmpxchg(lock, NULL, current))) {
rt_mutex_deadlock_account_lock(lock, current);
return 0;
} else
- return slowfn(lock, state, NULL, detect_deadlock __RET_IP__);
+ return slowfn(lock, state, NULL, detect_deadlock);
}

static inline int
@@ -753,24 +750,24 @@ rt_mutex_timed_fastlock(struct rt_mutex
struct hrtimer_sleeper *timeout, int detect_deadlock,
int (*slowfn)(struct rt_mutex *lock, int state,
struct hrtimer_sleeper *timeout,
- int detect_deadlock __IP_DECL__))
+ int detect_deadlock))
{
if (!detect_deadlock && likely(rt_mutex_cmpxchg(lock, NULL, current))) {
rt_mutex_deadlock_account_lock(lock, current);
return 0;
} else
- return slowfn(lock, state, timeout, detect_deadlock __RET_IP__);
+ return slowfn(lock, state, timeout, detect_deadlock);
}

static inline int
rt_mutex_fasttrylock(struct rt_mutex *lock,
- int (*slowfn)(struct rt_mutex *lock __IP_DECL__))
+ int (*slowfn)(struct rt_mutex *lock))
{
if (likely(rt_mutex_cmpxchg(lock, NULL, current))) {
rt_mutex_deadlock_account_lock(lock, current);
return 1;
}
- return slowfn(lock __RET_IP__);
+ return slowfn(lock);
}

static inline void
@@ -918,7 +915,7 @@ void rt_mutex_init_proxy_locked(struct r
struct task_struct *proxy_owner)
{
__rt_mutex_init(lock, NULL);
- debug_rt_mutex_proxy_lock(lock, proxy_owner __RET_IP__);
+ debug_rt_mutex_proxy_lock(lock, proxy_owner);
rt_mutex_set_owner(lock, proxy_owner, 0);
rt_mutex_deadlock_account_lock(lock, proxy_owner);
}
Index: linux/kernel/rtmutex.h
===================================================================
--- linux.orig/kernel/rtmutex.h
+++ linux/kernel/rtmutex.h
@@ -10,9 +10,6 @@
* Non-debug version.
*/

-#define __IP_DECL__
-#define __IP__
-#define __RET_IP__
#define rt_mutex_deadlock_check(l) (0)
#define rt_mutex_deadlock_account_lock(m, t) do { } while (0)
#define rt_mutex_deadlock_account_unlock(l) do { } while (0)
Index: linux/kernel/sched.c
===================================================================
--- linux.orig/kernel/sched.c
+++ linux/kernel/sched.c
@@ -30,6 +30,7 @@
#include <linux/capability.h>
#include <linux/completion.h>
#include <linux/kernel_stat.h>
+#include <linux/debug_locks.h>
#include <linux/security.h>
#include <linux/notifier.h>
#include <linux/profile.h>
@@ -3158,12 +3159,13 @@ void fastcall add_preempt_count(int val)
/*
* Underflow?
*/
- BUG_ON((preempt_count() < 0));
+ if (DEBUG_WARN_ON((preempt_count() < 0)))
+ return;
preempt_count() += val;
/*
* Spinlock count overflowing soon?
*/
- BUG_ON((preempt_count() & PREEMPT_MASK) >= PREEMPT_MASK-10);
+ DEBUG_WARN_ON((preempt_count() & PREEMPT_MASK) >= PREEMPT_MASK-10);
}
EXPORT_SYMBOL(add_preempt_count);

@@ -3172,11 +3174,15 @@ void fastcall sub_preempt_count(int val)
/*
* Underflow?
*/
- BUG_ON(val > preempt_count());
+ if (DEBUG_WARN_ON(val > preempt_count()))
+ return;
/*
* Is the spinlock portion underflowing?
*/
- BUG_ON((val < PREEMPT_MASK) && !(preempt_count() & PREEMPT_MASK));
+ if (DEBUG_WARN_ON((val < PREEMPT_MASK) &&
+ !(preempt_count() & PREEMPT_MASK)))
+ return;
+
preempt_count() -= val;
}
EXPORT_SYMBOL(sub_preempt_count);
@@ -4715,7 +4721,7 @@ void show_state(void)
} while_each_thread(g, p);

read_unlock(&tasklist_lock);
- mutex_debug_show_all_locks();
+ debug_show_all_locks();
}

/**
Index: linux/lib/Kconfig.debug
===================================================================
--- linux.orig/lib/Kconfig.debug
+++ linux/lib/Kconfig.debug
@@ -130,12 +130,30 @@ config DEBUG_PREEMPT
will detect preemption count underflows.

config DEBUG_MUTEXES
- bool "Mutex debugging, deadlock detection"
- default n
+ bool "Mutex debugging, basic checks"
+ default y
depends on DEBUG_KERNEL
help
- This allows mutex semantics violations and mutex related deadlocks
- (lockups) to be detected and reported automatically.
+ This feature allows mutex semantics violations to be detected and
+ reported.
+
+config DEBUG_MUTEX_ALLOC
+ bool "Detect incorrect freeing of live mutexes"
+ default y
+ depends on DEBUG_MUTEXES
+ help
+ This feature will check whether any held mutex is incorrectly
+ freed by the kernel, via any of the memory-freeing routines
+ (kfree(), kmem_cache_free(), free_pages(), vfree(), etc.),
+ or whether there is any lock held during task exit.
+
+config DEBUG_MUTEX_DEADLOCKS
+ bool "Detect mutex related deadlocks"
+ default y
+ depends on DEBUG_MUTEXES
+ help
+ This feature will automatically detect and report mutex related
+ deadlocks, as they happen.

config DEBUG_RT_MUTEXES
bool "RT Mutex debugging, deadlock detection"
Index: linux/lib/Makefile
===================================================================
--- linux.orig/lib/Makefile
+++ linux/lib/Makefile
@@ -11,7 +11,7 @@ lib-$(CONFIG_SMP) += cpumask.o

lib-y += kobject.o kref.o kobject_uevent.o klist.o

-obj-y += sort.o parser.o halfmd4.o iomap_copy.o
+obj-y += sort.o parser.o halfmd4.o iomap_copy.o debug_locks.o

ifeq ($(CONFIG_DEBUG_KOBJECT),y)
CFLAGS_kobject.o += -DDEBUG
Index: linux/lib/debug_locks.c
===================================================================
--- /dev/null
+++ linux/lib/debug_locks.c
@@ -0,0 +1,45 @@
+/*
+ * lib/debug_locks.c
+ *
+ * Generic place for common debugging facilities for various locks:
+ * spinlocks, rwlocks, mutexes and rwsems.
+ *
+ * Started by Ingo Molnar:
+ *
+ * Copyright (C) 2006 Red Hat, Inc., Ingo Molnar <[email protected]>
+ */
+#include <linux/rwsem.h>
+#include <linux/mutex.h>
+#include <linux/module.h>
+#include <linux/spinlock.h>
+#include <linux/debug_locks.h>
+
+/*
+ * We want to turn all lock-debugging facilities on/off at once,
+ * via a global flag. The reason is that once a single bug has been
+ * detected and reported, there might be cascade of followup bugs
+ * that would just muddy the log. So we report the first one and
+ * shut up after that.
+ */
+int debug_locks = 1;
+
+/*
+ * The locking-testsuite uses <debug_locks_silent> to get a
+ * 'silent failure': nothing is printed to the console when
+ * a locking bug is detected.
+ */
+int debug_locks_silent;
+
+/*
+ * Generic 'turn off all lock debugging' function:
+ */
+int debug_locks_off(void)
+{
+ if (xchg(&debug_locks, 0)) {
+ if (!debug_locks_silent) {
+ console_verbose();
+ return 1;
+ }
+ }
+ return 0;
+}
Index: linux/lib/spinlock_debug.c
===================================================================
--- linux.orig/lib/spinlock_debug.c
+++ linux/lib/spinlock_debug.c
@@ -9,38 +9,35 @@
#include <linux/config.h>
#include <linux/spinlock.h>
#include <linux/interrupt.h>
+#include <linux/debug_locks.h>
#include <linux/delay.h>
+#include <linux/module.h>

static void spin_bug(spinlock_t *lock, const char *msg)
{
- static long print_once = 1;
struct task_struct *owner = NULL;

- if (xchg(&print_once, 0)) {
- if (lock->owner && lock->owner != SPINLOCK_OWNER_INIT)
- owner = lock->owner;
- printk(KERN_EMERG "BUG: spinlock %s on CPU#%d, %s/%d\n",
- msg, raw_smp_processor_id(),
- current->comm, current->pid);
- printk(KERN_EMERG " lock: %p, .magic: %08x, .owner: %s/%d, "
- ".owner_cpu: %d\n",
- lock, lock->magic,
- owner ? owner->comm : "<none>",
- owner ? owner->pid : -1,
- lock->owner_cpu);
- dump_stack();
-#ifdef CONFIG_SMP
- /*
- * We cannot continue on SMP:
- */
-// panic("bad locking");
-#endif
- }
+ if (!debug_locks_off())
+ return;
+
+ if (lock->owner && lock->owner != SPINLOCK_OWNER_INIT)
+ owner = lock->owner;
+ printk(KERN_EMERG "BUG: spinlock %s on CPU#%d, %s/%d\n",
+ msg, raw_smp_processor_id(),
+ current->comm, current->pid);
+ printk(KERN_EMERG " lock: %p, .magic: %08x, .owner: %s/%d, "
+ ".owner_cpu: %d\n",
+ lock, lock->magic,
+ owner ? owner->comm : "<none>",
+ owner ? owner->pid : -1,
+ lock->owner_cpu);
+ dump_stack();
}

#define SPIN_BUG_ON(cond, lock, msg) if (unlikely(cond)) spin_bug(lock, msg)

-static inline void debug_spin_lock_before(spinlock_t *lock)
+static inline void
+debug_spin_lock_before(spinlock_t *lock)
{
SPIN_BUG_ON(lock->magic != SPINLOCK_MAGIC, lock, "bad magic");
SPIN_BUG_ON(lock->owner == current, lock, "recursion");
@@ -119,20 +116,13 @@ void _raw_spin_unlock(spinlock_t *lock)

static void rwlock_bug(rwlock_t *lock, const char *msg)
{
- static long print_once = 1;
+ if (!debug_locks_off())
+ return;

- if (xchg(&print_once, 0)) {
- printk(KERN_EMERG "BUG: rwlock %s on CPU#%d, %s/%d, %p\n",
- msg, raw_smp_processor_id(), current->comm,
- current->pid, lock);
- dump_stack();
-#ifdef CONFIG_SMP
- /*
- * We cannot continue on SMP:
- */
- panic("bad locking");
-#endif
- }
+ printk(KERN_EMERG "BUG: rwlock %s on CPU#%d, %s/%d, %p\n",
+ msg, raw_smp_processor_id(), current->comm,
+ current->pid, lock);
+ dump_stack();
}

#define RWLOCK_BUG_ON(cond, lock, msg) if (unlikely(cond)) rwlock_bug(lock, msg)
Index: linux/mm/vmalloc.c
===================================================================
--- linux.orig/mm/vmalloc.c
+++ linux/mm/vmalloc.c
@@ -330,6 +330,8 @@ void __vunmap(void *addr, int deallocate
return;
}

+ debug_check_no_locks_freed(addr, area->size);
+
if (deallocate_pages) {
int i;

2006-05-29 21:23:41

by Ingo Molnar

[permalink] [raw]
Subject: [patch 04/61] lock validator: mutex section binutils workaround

From: Ingo Molnar <[email protected]>

work around weird section nesting build bug causing smp-alternatives
failures under certain circumstances.

Signed-off-by: Ingo Molnar <[email protected]>
Signed-off-by: Arjan van de Ven <[email protected]>
---
kernel/mutex.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)

Index: linux/kernel/mutex.c
===================================================================
--- linux.orig/kernel/mutex.c
+++ linux/kernel/mutex.c
@@ -309,7 +309,7 @@ static inline int __mutex_trylock_slowpa
* This function must not be used in interrupt context. The
* mutex must be released by the same task that acquired it.
*/
-int fastcall mutex_trylock(struct mutex *lock)
+int fastcall __sched mutex_trylock(struct mutex *lock)
{
return __mutex_fastpath_trylock(&lock->count,
__mutex_trylock_slowpath);

2006-05-29 21:24:10

by Ingo Molnar

[permalink] [raw]
Subject: [patch 16/61] lock validator: fown locking workaround

From: Ingo Molnar <[email protected]>

temporary workaround for the lock validator: make all uses of
f_owner.lock irq-safe. (The real solution will be to express to
the lock validator that f_owner.lock rules are to be generated
per-filesystem.)

Signed-off-by: Ingo Molnar <[email protected]>
Signed-off-by: Arjan van de Ven <[email protected]>
---
fs/cifs/file.c | 18 +++++++++---------
fs/fcntl.c | 11 +++++++----
2 files changed, 16 insertions(+), 13 deletions(-)

Index: linux/fs/cifs/file.c
===================================================================
--- linux.orig/fs/cifs/file.c
+++ linux/fs/cifs/file.c
@@ -108,7 +108,7 @@ static inline int cifs_open_inode_helper
&pCifsInode->openFileList);
}
write_unlock(&GlobalSMBSeslock);
- write_unlock(&file->f_owner.lock);
+ write_unlock_irq(&file->f_owner.lock);
if (pCifsInode->clientCanCacheRead) {
/* we have the inode open somewhere else
no need to discard cache data */
@@ -280,7 +280,7 @@ int cifs_open(struct inode *inode, struc
goto out;
}
pCifsFile = cifs_init_private(file->private_data, inode, file, netfid);
- write_lock(&file->f_owner.lock);
+ write_lock_irq(&file->f_owner.lock);
write_lock(&GlobalSMBSeslock);
list_add(&pCifsFile->tlist, &pTcon->openFileList);

@@ -291,7 +291,7 @@ int cifs_open(struct inode *inode, struc
&oplock, buf, full_path, xid);
} else {
write_unlock(&GlobalSMBSeslock);
- write_unlock(&file->f_owner.lock);
+ write_unlock_irq(&file->f_owner.lock);
}

if (oplock & CIFS_CREATE_ACTION) {
@@ -470,7 +470,7 @@ int cifs_close(struct inode *inode, stru
pTcon = cifs_sb->tcon;
if (pSMBFile) {
pSMBFile->closePend = TRUE;
- write_lock(&file->f_owner.lock);
+ write_lock_irq(&file->f_owner.lock);
if (pTcon) {
/* no sense reconnecting to close a file that is
already closed */
@@ -485,23 +485,23 @@ int cifs_close(struct inode *inode, stru
the struct would be in each open file,
but this should give enough time to
clear the socket */
- write_unlock(&file->f_owner.lock);
+ write_unlock_irq(&file->f_owner.lock);
cERROR(1,("close with pending writes"));
msleep(timeout);
- write_lock(&file->f_owner.lock);
+ write_lock_irq(&file->f_owner.lock);
timeout *= 4;
}
- write_unlock(&file->f_owner.lock);
+ write_unlock_irq(&file->f_owner.lock);
rc = CIFSSMBClose(xid, pTcon,
pSMBFile->netfid);
- write_lock(&file->f_owner.lock);
+ write_lock_irq(&file->f_owner.lock);
}
}
write_lock(&GlobalSMBSeslock);
list_del(&pSMBFile->flist);
list_del(&pSMBFile->tlist);
write_unlock(&GlobalSMBSeslock);
- write_unlock(&file->f_owner.lock);
+ write_unlock_irq(&file->f_owner.lock);
kfree(pSMBFile->search_resume_name);
kfree(file->private_data);
file->private_data = NULL;
Index: linux/fs/fcntl.c
===================================================================
--- linux.orig/fs/fcntl.c
+++ linux/fs/fcntl.c
@@ -470,9 +470,10 @@ static void send_sigio_to_task(struct ta
void send_sigio(struct fown_struct *fown, int fd, int band)
{
struct task_struct *p;
+ unsigned long flags;
int pid;

- read_lock(&fown->lock);
+ read_lock_irqsave(&fown->lock, flags);
pid = fown->pid;
if (!pid)
goto out_unlock_fown;
@@ -490,7 +491,7 @@ void send_sigio(struct fown_struct *fown
}
read_unlock(&tasklist_lock);
out_unlock_fown:
- read_unlock(&fown->lock);
+ read_unlock_irqrestore(&fown->lock, flags);
}

static void send_sigurg_to_task(struct task_struct *p,
@@ -503,9 +504,10 @@ static void send_sigurg_to_task(struct t
int send_sigurg(struct fown_struct *fown)
{
struct task_struct *p;
+ unsigned long flags;
int pid, ret = 0;

- read_lock(&fown->lock);
+ read_lock_irqsave(&fown->lock, flags);
pid = fown->pid;
if (!pid)
goto out_unlock_fown;
@@ -525,7 +527,8 @@ int send_sigurg(struct fown_struct *fown
}
read_unlock(&tasklist_lock);
out_unlock_fown:
- read_unlock(&fown->lock);
+ read_unlock_irqrestore(&fown->lock, flags);
+
return ret;
}

2006-05-29 21:24:57

by Ingo Molnar

[permalink] [raw]
Subject: [patch 19/61] lock validator: irqtrace: cleanup: include/asm-i386/irqflags.h

From: Ingo Molnar <[email protected]>

clean up the x86 irqflags.h file:

- macro => inline function transformation
- simplifications
- style fixes

Signed-off-by: Ingo Molnar <[email protected]>
Signed-off-by: Arjan van de Ven <[email protected]>
---
include/asm-i386/irqflags.h | 95 ++++++++++++++++++++++++++++++++++++++------
1 file changed, 83 insertions(+), 12 deletions(-)

Index: linux/include/asm-i386/irqflags.h
===================================================================
--- linux.orig/include/asm-i386/irqflags.h
+++ linux/include/asm-i386/irqflags.h
@@ -5,24 +5,95 @@
*
* This file gets included from lowlevel asm headers too, to provide
* wrapped versions of the local_irq_*() APIs, based on the
- * raw_local_irq_*() macros from the lowlevel headers.
+ * raw_local_irq_*() functions from the lowlevel headers.
*/
#ifndef _ASM_IRQFLAGS_H
#define _ASM_IRQFLAGS_H

-#define raw_local_save_flags(x) do { typecheck(unsigned long,x); __asm__ __volatile__("pushfl ; popl %0":"=g" (x): /* no input */); } while (0)
-#define raw_local_irq_restore(x) do { typecheck(unsigned long,x); __asm__ __volatile__("pushl %0 ; popfl": /* no output */ :"g" (x):"memory", "cc"); } while (0)
-#define raw_local_irq_disable() __asm__ __volatile__("cli": : :"memory")
-#define raw_local_irq_enable() __asm__ __volatile__("sti": : :"memory")
-/* used in the idle loop; sti takes one instruction cycle to complete */
-#define raw_safe_halt() __asm__ __volatile__("sti; hlt": : :"memory")
-/* used when interrupts are already enabled or to shutdown the processor */
-#define halt() __asm__ __volatile__("hlt": : :"memory")
+#ifndef __ASSEMBLY__

-#define raw_irqs_disabled_flags(flags) (!((flags) & (1<<9)))
+static inline unsigned long __raw_local_save_flags(void)
+{
+ unsigned long flags;
+
+ __asm__ __volatile__(
+ "pushfl ; popl %0"
+ : "=g" (flags)
+ : /* no input */
+ );
+
+ return flags;
+}
+
+#define raw_local_save_flags(flags) \
+ do { (flags) = __raw_local_save_flags(); } while (0)
+
+static inline void raw_local_irq_restore(unsigned long flags)
+{
+ __asm__ __volatile__(
+ "pushl %0 ; popfl"
+ : /* no output */
+ :"g" (flags)
+ :"memory", "cc"
+ );
+}
+
+static inline void raw_local_irq_disable(void)
+{
+ __asm__ __volatile__("cli" : : : "memory");
+}
+
+static inline void raw_local_irq_enable(void)
+{
+ __asm__ __volatile__("sti" : : : "memory");
+}

-/* For spinlocks etc */
-#define raw_local_irq_save(x) __asm__ __volatile__("pushfl ; popl %0 ; cli":"=g" (x): /* no input */ :"memory")
+/*
+ * Used in the idle loop; sti takes one instruction cycle
+ * to complete:
+ */
+static inline void raw_safe_halt(void)
+{
+ __asm__ __volatile__("sti; hlt" : : : "memory");
+}
+
+/*
+ * Used when interrupts are already enabled or to
+ * shutdown the processor:
+ */
+static inline void halt(void)
+{
+ __asm__ __volatile__("hlt": : :"memory");
+}
+
+static inline int raw_irqs_disabled_flags(unsigned long flags)
+{
+ return !(flags & (1 << 9));
+}
+
+static inline int raw_irqs_disabled(void)
+{
+ unsigned long flags = __raw_local_save_flags();
+
+ return raw_irqs_disabled_flags(flags);
+}
+
+/*
+ * For spinlocks, etc:
+ */
+static inline unsigned long __raw_local_irq_save(void)
+{
+ unsigned long flags = __raw_local_save_flags();
+
+ raw_local_irq_disable();
+
+ return flags;
+}
+
+#define raw_local_irq_save(flags) \
+ do { (flags) = __raw_local_irq_save(); } while (0)
+
+#endif /* __ASSEMBLY__ */

/*
* Do the CPU's IRQ-state tracing from assembly code. We call a

2006-05-29 21:25:05

by Ingo Molnar

[permalink] [raw]
Subject: [patch 25/61] lock validator: design docs

From: Ingo Molnar <[email protected]>

lock validator design documentation.

Signed-off-by: Ingo Molnar <[email protected]>
Signed-off-by: Arjan van de Ven <[email protected]>
---
Documentation/lockdep-design.txt | 224 +++++++++++++++++++++++++++++++++++++++
1 file changed, 224 insertions(+)

Index: linux/Documentation/lockdep-design.txt
===================================================================
--- /dev/null
+++ linux/Documentation/lockdep-design.txt
@@ -0,0 +1,224 @@
+Runtime locking correctness validator
+=====================================
+
+started by Ingo Molnar <[email protected]>
+additions by Arjan van de Ven <[email protected]>
+
+Lock-type
+---------
+
+The basic object the validator operates upon is the 'type' or 'class' of
+locks.
+
+A class of locks is a group of locks that are logically the same with
+respect to locking rules, even if the locks may have multiple (possibly
+tens of thousands of) instantiations. For example a lock in the inode
+struct is one class, while each inode has its own instantiation of that
+lock class.
+
+The validator tracks the 'state' of lock-types, and it tracks
+dependencies between different lock-types. The validator maintains a
+rolling proof that the state and the dependencies are correct.
+
+Unlike an lock instantiation, the lock-type itself never goes away: when
+a lock-type is used for the first time after bootup it gets registered,
+and all subsequent uses of that lock-type will be attached to this
+lock-type.
+
+State
+-----
+
+The validator tracks lock-type usage history into 5 separate state bits:
+
+- 'ever held in hardirq context' [ == hardirq-safe ]
+- 'ever held in softirq context' [ == softirq-safe ]
+- 'ever held with hardirqs enabled' [ == hardirq-unsafe ]
+- 'ever held with softirqs and hardirqs enabled' [ == softirq-unsafe ]
+
+- 'ever used' [ == !unused ]
+
+Single-lock state rules:
+------------------------
+
+A softirq-unsafe lock-type is automatically hardirq-unsafe as well. The
+following states are exclusive, and only one of them is allowed to be
+set for any lock-type:
+
+ <hardirq-safe> and <hardirq-unsafe>
+ <softirq-safe> and <softirq-unsafe>
+
+The validator detects and reports lock usage that violate these
+single-lock state rules.
+
+Multi-lock dependency rules:
+----------------------------
+
+The same lock-type must not be acquired twice, because this could lead
+to lock recursion deadlocks.
+
+Furthermore, two locks may not be taken in different order:
+
+ <L1> -> <L2>
+ <L2> -> <L1>
+
+because this could lead to lock inversion deadlocks. (The validator
+finds such dependencies in arbitrary complexity, i.e. there can be any
+other locking sequence between the acquire-lock operations, the
+validator will still track all dependencies between locks.)
+
+Furthermore, the following usage based lock dependencies are not allowed
+between any two lock-types:
+
+ <hardirq-safe> -> <hardirq-unsafe>
+ <softirq-safe> -> <softirq-unsafe>
+
+The first rule comes from the fact the a hardirq-safe lock could be
+taken by a hardirq context, interrupting a hardirq-unsafe lock - and
+thus could result in a lock inversion deadlock. Likewise, a softirq-safe
+lock could be taken by an softirq context, interrupting a softirq-unsafe
+lock.
+
+The above rules are enforced for any locking sequence that occurs in the
+kernel: when acquiring a new lock, the validator checks whether there is
+any rule violation between the new lock and any of the held locks.
+
+When a lock-type changes its state, the following aspects of the above
+dependency rules are enforced:
+
+- if a new hardirq-safe lock is discovered, we check whether it
+ took any hardirq-unsafe lock in the past.
+
+- if a new softirq-safe lock is discovered, we check whether it took
+ any softirq-unsafe lock in the past.
+
+- if a new hardirq-unsafe lock is discovered, we check whether any
+ hardirq-safe lock took it in the past.
+
+- if a new softirq-unsafe lock is discovered, we check whether any
+ softirq-safe lock took it in the past.
+
+(Again, we do these checks too on the basis that an interrupt context
+could interrupt _any_ of the irq-unsafe or hardirq-unsafe locks, which
+could lead to a lock inversion deadlock - even if that lock scenario did
+not trigger in practice yet.)
+
+Exception 1: Nested data types leading to nested locking
+--------------------------------------------------------
+
+There are a few cases where the Linux kernel acquires more than one
+instance of the same lock-type. Such cases typically happen when there
+is some sort of hierarchy within objects of the same type. In these
+cases there is an inherent "natural" ordering between the two objects
+(defined by the properties of the hierarchy), and the kernel grabs the
+locks in this fixed order on each of the objects.
+
+An example of such an object hieararchy that results in "nested locking"
+is that of a "whole disk" block-dev object and a "partition" block-dev
+object; the partition is "part of" the whole device and as long as one
+always takes the whole disk lock as a higher lock than the partition
+lock, the lock ordering is fully correct. The validator does not
+automatically detect this natural ordering, as the locking rule behind
+the ordering is not static.
+
+In order to teach the validator about this correct usage model, new
+versions of the various locking primitives were added that allow you to
+specify a "nesting level". An example call, for the block device mutex,
+looks like this:
+
+enum bdev_bd_mutex_lock_type
+{
+ BD_MUTEX_NORMAL,
+ BD_MUTEX_WHOLE,
+ BD_MUTEX_PARTITION
+};
+
+ mutex_lock_nested(&bdev->bd_contains->bd_mutex, BD_MUTEX_PARTITION);
+
+In this case the locking is done on a bdev object that is known to be a
+partition.
+
+The validator treats a lock that is taken in such a nested fasion as a
+separate (sub)class for the purposes of validation.
+
+Note: When changing code to use the _nested() primitives, be careful and
+check really thoroughly that the hiearchy is correctly mapped; otherwise
+you can get false positives or false negatives.
+
+Exception 2: Out of order unlocking
+-----------------------------------
+
+In the Linux kernel, locks are released in the opposite order in which
+they were taken, with a few exceptions. The validator is optimized for
+the common case, and in fact treats an "out of order" unlock as a
+locking bug. (the rationale is that the code is doing something rare,
+which can be a sign of a bug)
+
+There are some cases where releasing the locks out of order is
+unavoidable and dictated by the algorithm that is being implemented.
+Therefore, the validator can be told about this, using a special
+unlocking variant of the primitives. An example call looks like this:
+
+ spin_unlock_non_nested(&target->d_lock);
+
+Here the d_lock is released by the VFS in a different order than it was
+taken, as required by the d_move() algorithm.
+
+Note: the _non_nested() primitives are more expensive than the "normal"
+primitives, and in almost all cases it's trivial to use the natural
+unlock order. There are gains in doing this that are outside the realm
+of the validator regardless so it's strongly suggested to make sure that
+unlocking always happens in the natural order whenever reasonable,
+rather than blindly changing code to use the _non_nested() variants.
+
+Proof of 100% correctness:
+--------------------------
+
+The validator achieves perfect, mathematical 'closure' (proof of locking
+correctness) in the sense that for every simple, standalone single-task
+locking sequence that occured at least once during the lifetime of the
+kernel, the validator proves it with a 100% certainty that no
+combination and timing of these locking sequences can cause any type of
+lock related deadlock. [*]
+
+I.e. complex multi-CPU and multi-task locking scenarios do not have to
+occur in practice to prove a deadlock: only the simple 'component'
+locking chains have to occur at least once (anytime, in any
+task/context) for the validator to be able to prove correctness. (For
+example, complex deadlocks that would normally need more than 3 CPUs and
+a very unlikely constellation of tasks, irq-contexts and timings to
+occur, can be detected on a plain, lightly loaded single-CPU system as
+well!)
+
+This radically decreases the complexity of locking related QA of the
+kernel: what has to be done during QA is to trigger as many "simple"
+single-task locking dependencies in the kernel as possible, at least
+once, to prove locking correctness - instead of having to trigger every
+possible combination of locking interaction between CPUs, combined with
+every possible hardirq and softirq nesting scenario (which is impossible
+to do in practice).
+
+[*] assuming that the validator itself is 100% correct, and no other
+ part of the system corrupts the state of the validator in any way.
+ We also assume that all NMI/SMM paths [which could interrupt
+ even hardirq-disabled codepaths] are correct and do not interfere
+ with the validator. We also assume that the 64-bit 'chain hash'
+ value is unique for every lock-chain in the system. Also, lock
+ recursion must not be higher than 20.
+
+Performance:
+------------
+
+The above rules require _massive_ amounts of runtime checking. If we did
+that for every lock taken and for every irqs-enable event, it would
+render the system practically unusably slow. The complexity of checking
+is O(N^2), so even with just a few hundred lock-types we'd have to do
+tens of thousands of checks for every event.
+
+This problem is solved by checking any given 'locking scenario' (unique
+sequence of locks taken after each other) only once. A simple stack of
+held locks is maintained, and a lightweight 64-bit hash value is
+calculated, which hash is unique for every lock chain. The hash value,
+when the chain is validated for the first time, is then put into a hash
+table, which hash-table can be checked in a lockfree manner. If the
+locking chain occurs again later on, the hash table tells us that we
+dont have to validate the chain again.

2006-05-29 21:25:53

by Ingo Molnar

[permalink] [raw]
Subject: [patch 37/61] lock validator: special locking: dcache

From: Ingo Molnar <[email protected]>

teach special (recursive) locking code to the lock validator. Has no
effect on non-lockdep kernels.

Signed-off-by: Ingo Molnar <[email protected]>
Signed-off-by: Arjan van de Ven <[email protected]>
---
fs/dcache.c | 6 +++---
include/linux/dcache.h | 12 ++++++++++++
2 files changed, 15 insertions(+), 3 deletions(-)

Index: linux/fs/dcache.c
===================================================================
--- linux.orig/fs/dcache.c
+++ linux/fs/dcache.c
@@ -1380,10 +1380,10 @@ void d_move(struct dentry * dentry, stru
*/
if (target < dentry) {
spin_lock(&target->d_lock);
- spin_lock(&dentry->d_lock);
+ spin_lock_nested(&dentry->d_lock, DENTRY_D_LOCK_NESTED);
} else {
spin_lock(&dentry->d_lock);
- spin_lock(&target->d_lock);
+ spin_lock_nested(&target->d_lock, DENTRY_D_LOCK_NESTED);
}

/* Move the dentry to the target hash queue, if on different bucket */
@@ -1420,7 +1420,7 @@ already_unhashed:
}

list_add(&dentry->d_u.d_child, &dentry->d_parent->d_subdirs);
- spin_unlock(&target->d_lock);
+ spin_unlock_non_nested(&target->d_lock);
fsnotify_d_move(dentry);
spin_unlock(&dentry->d_lock);
write_sequnlock(&rename_lock);
Index: linux/include/linux/dcache.h
===================================================================
--- linux.orig/include/linux/dcache.h
+++ linux/include/linux/dcache.h
@@ -114,6 +114,18 @@ struct dentry {
unsigned char d_iname[DNAME_INLINE_LEN_MIN]; /* small names */
};

+/*
+ * dentry->d_lock spinlock nesting types:
+ *
+ * 0: normal
+ * 1: nested
+ */
+enum dentry_d_lock_type
+{
+ DENTRY_D_LOCK_NORMAL,
+ DENTRY_D_LOCK_NESTED
+};
+
struct dentry_operations {
int (*d_revalidate)(struct dentry *, struct nameidata *);
int (*d_hash) (struct dentry *, struct qstr *);

2006-05-29 21:26:00

by Ingo Molnar

[permalink] [raw]
Subject: [patch 34/61] lock validator: special locking: bdev

From: Ingo Molnar <[email protected]>

teach special (recursive) locking code to the lock validator. Has no
effect on non-lockdep kernels.

Signed-off-by: Ingo Molnar <[email protected]>
Signed-off-by: Arjan van de Ven <[email protected]>
---
drivers/md/md.c | 6 +--
fs/block_dev.c | 105 ++++++++++++++++++++++++++++++++++++++++++++++-------
include/linux/fs.h | 17 ++++++++
3 files changed, 112 insertions(+), 16 deletions(-)

Index: linux/drivers/md/md.c
===================================================================
--- linux.orig/drivers/md/md.c
+++ linux/drivers/md/md.c
@@ -1394,7 +1394,7 @@ static int lock_rdev(mdk_rdev_t *rdev, d
struct block_device *bdev;
char b[BDEVNAME_SIZE];

- bdev = open_by_devnum(dev, FMODE_READ|FMODE_WRITE);
+ bdev = open_partition_by_devnum(dev, FMODE_READ|FMODE_WRITE);
if (IS_ERR(bdev)) {
printk(KERN_ERR "md: could not open %s.\n",
__bdevname(dev, b));
@@ -1404,7 +1404,7 @@ static int lock_rdev(mdk_rdev_t *rdev, d
if (err) {
printk(KERN_ERR "md: could not bd_claim %s.\n",
bdevname(bdev, b));
- blkdev_put(bdev);
+ blkdev_put_partition(bdev);
return err;
}
rdev->bdev = bdev;
@@ -1418,7 +1418,7 @@ static void unlock_rdev(mdk_rdev_t *rdev
if (!bdev)
MD_BUG();
bd_release(bdev);
- blkdev_put(bdev);
+ blkdev_put_partition(bdev);
}

void md_autodetect_dev(dev_t dev);
Index: linux/fs/block_dev.c
===================================================================
--- linux.orig/fs/block_dev.c
+++ linux/fs/block_dev.c
@@ -746,7 +746,7 @@ static int bd_claim_by_kobject(struct bl
if (!bo)
return -ENOMEM;

- mutex_lock(&bdev->bd_mutex);
+ mutex_lock_nested(&bdev->bd_mutex, BD_MUTEX_PARTITION);
res = bd_claim(bdev, holder);
if (res || !add_bd_holder(bdev, bo))
free_bd_holder(bo);
@@ -771,7 +771,7 @@ static void bd_release_from_kobject(stru
if (!kobj)
return;

- mutex_lock(&bdev->bd_mutex);
+ mutex_lock_nested(&bdev->bd_mutex, BD_MUTEX_PARTITION);
bd_release(bdev);
if ((bo = del_bd_holder(bdev, kobj)))
free_bd_holder(bo);
@@ -829,6 +829,22 @@ struct block_device *open_by_devnum(dev_

EXPORT_SYMBOL(open_by_devnum);

+static int
+blkdev_get_partition(struct block_device *bdev, mode_t mode, unsigned flags);
+
+struct block_device *open_partition_by_devnum(dev_t dev, unsigned mode)
+{
+ struct block_device *bdev = bdget(dev);
+ int err = -ENOMEM;
+ int flags = mode & FMODE_WRITE ? O_RDWR : O_RDONLY;
+ if (bdev)
+ err = blkdev_get_partition(bdev, mode, flags);
+ return err ? ERR_PTR(err) : bdev;
+}
+
+EXPORT_SYMBOL(open_partition_by_devnum);
+
+
/*
* This routine checks whether a removable media has been changed,
* and invalidates all buffer-cache-entries in that case. This
@@ -875,7 +891,11 @@ void bd_set_size(struct block_device *bd
}
EXPORT_SYMBOL(bd_set_size);

-static int do_open(struct block_device *bdev, struct file *file)
+static int
+blkdev_get_whole(struct block_device *bdev, mode_t mode, unsigned flags);
+
+static int
+do_open(struct block_device *bdev, struct file *file, unsigned int subtype)
{
struct module *owner = NULL;
struct gendisk *disk;
@@ -892,7 +912,8 @@ static int do_open(struct block_device *
}
owner = disk->fops->owner;

- mutex_lock(&bdev->bd_mutex);
+ mutex_lock_nested(&bdev->bd_mutex, subtype);
+
if (!bdev->bd_openers) {
bdev->bd_disk = disk;
bdev->bd_contains = bdev;
@@ -917,13 +938,17 @@ static int do_open(struct block_device *
struct block_device *whole;
whole = bdget_disk(disk, 0);
ret = -ENOMEM;
+ /*
+ * We must not recurse deeper than 1:
+ */
+ WARN_ON(subtype != 0);
if (!whole)
goto out_first;
- ret = blkdev_get(whole, file->f_mode, file->f_flags);
+ ret = blkdev_get_whole(whole, file->f_mode, file->f_flags);
if (ret)
goto out_first;
bdev->bd_contains = whole;
- mutex_lock(&whole->bd_mutex);
+ mutex_lock_nested(&whole->bd_mutex, BD_MUTEX_WHOLE);
whole->bd_part_count++;
p = disk->part[part - 1];
bdev->bd_inode->i_data.backing_dev_info =
@@ -951,7 +976,8 @@ static int do_open(struct block_device *
if (bdev->bd_invalidated)
rescan_partitions(bdev->bd_disk, bdev);
} else {
- mutex_lock(&bdev->bd_contains->bd_mutex);
+ mutex_lock_nested(&bdev->bd_contains->bd_mutex,
+ BD_MUTEX_PARTITION);
bdev->bd_contains->bd_part_count++;
mutex_unlock(&bdev->bd_contains->bd_mutex);
}
@@ -992,11 +1018,49 @@ int blkdev_get(struct block_device *bdev
fake_file.f_dentry = &fake_dentry;
fake_dentry.d_inode = bdev->bd_inode;

- return do_open(bdev, &fake_file);
+ return do_open(bdev, &fake_file, BD_MUTEX_NORMAL);
}

EXPORT_SYMBOL(blkdev_get);

+static int
+blkdev_get_whole(struct block_device *bdev, mode_t mode, unsigned flags)
+{
+ /*
+ * This crockload is due to bad choice of ->open() type.
+ * It will go away.
+ * For now, block device ->open() routine must _not_
+ * examine anything in 'inode' argument except ->i_rdev.
+ */
+ struct file fake_file = {};
+ struct dentry fake_dentry = {};
+ fake_file.f_mode = mode;
+ fake_file.f_flags = flags;
+ fake_file.f_dentry = &fake_dentry;
+ fake_dentry.d_inode = bdev->bd_inode;
+
+ return do_open(bdev, &fake_file, BD_MUTEX_WHOLE);
+}
+
+static int
+blkdev_get_partition(struct block_device *bdev, mode_t mode, unsigned flags)
+{
+ /*
+ * This crockload is due to bad choice of ->open() type.
+ * It will go away.
+ * For now, block device ->open() routine must _not_
+ * examine anything in 'inode' argument except ->i_rdev.
+ */
+ struct file fake_file = {};
+ struct dentry fake_dentry = {};
+ fake_file.f_mode = mode;
+ fake_file.f_flags = flags;
+ fake_file.f_dentry = &fake_dentry;
+ fake_dentry.d_inode = bdev->bd_inode;
+
+ return do_open(bdev, &fake_file, BD_MUTEX_PARTITION);
+}
+
static int blkdev_open(struct inode * inode, struct file * filp)
{
struct block_device *bdev;
@@ -1012,7 +1076,7 @@ static int blkdev_open(struct inode * in

bdev = bd_acquire(inode);

- res = do_open(bdev, filp);
+ res = do_open(bdev, filp, BD_MUTEX_NORMAL);
if (res)
return res;

@@ -1026,13 +1090,13 @@ static int blkdev_open(struct inode * in
return res;
}

-int blkdev_put(struct block_device *bdev)
+static int __blkdev_put(struct block_device *bdev, unsigned int subtype)
{
int ret = 0;
struct inode *bd_inode = bdev->bd_inode;
struct gendisk *disk = bdev->bd_disk;

- mutex_lock(&bdev->bd_mutex);
+ mutex_lock_nested(&bdev->bd_mutex, subtype);
lock_kernel();
if (!--bdev->bd_openers) {
sync_blockdev(bdev);
@@ -1042,7 +1106,9 @@ int blkdev_put(struct block_device *bdev
if (disk->fops->release)
ret = disk->fops->release(bd_inode, NULL);
} else {
- mutex_lock(&bdev->bd_contains->bd_mutex);
+ WARN_ON(subtype != 0);
+ mutex_lock_nested(&bdev->bd_contains->bd_mutex,
+ BD_MUTEX_PARTITION);
bdev->bd_contains->bd_part_count--;
mutex_unlock(&bdev->bd_contains->bd_mutex);
}
@@ -1059,7 +1125,8 @@ int blkdev_put(struct block_device *bdev
bdev->bd_disk = NULL;
bdev->bd_inode->i_data.backing_dev_info = &default_backing_dev_info;
if (bdev != bdev->bd_contains) {
- blkdev_put(bdev->bd_contains);
+ WARN_ON(subtype != 0);
+ __blkdev_put(bdev->bd_contains, 1);
}
bdev->bd_contains = NULL;
}
@@ -1069,8 +1136,20 @@ int blkdev_put(struct block_device *bdev
return ret;
}

+int blkdev_put(struct block_device *bdev)
+{
+ return __blkdev_put(bdev, BD_MUTEX_NORMAL);
+}
+
EXPORT_SYMBOL(blkdev_put);

+int blkdev_put_partition(struct block_device *bdev)
+{
+ return __blkdev_put(bdev, BD_MUTEX_PARTITION);
+}
+
+EXPORT_SYMBOL(blkdev_put_partition);
+
static int blkdev_close(struct inode * inode, struct file * filp)
{
struct block_device *bdev = I_BDEV(filp->f_mapping->host);
Index: linux/include/linux/fs.h
===================================================================
--- linux.orig/include/linux/fs.h
+++ linux/include/linux/fs.h
@@ -436,6 +436,21 @@ struct block_device {
};

/*
+ * bdev->bd_mutex nesting types for the LOCKDEP validator:
+ *
+ * 0: normal
+ * 1: 'whole'
+ * 2: 'partition'
+ */
+enum bdev_bd_mutex_lock_type
+{
+ BD_MUTEX_NORMAL,
+ BD_MUTEX_WHOLE,
+ BD_MUTEX_PARTITION
+};
+
+
+/*
* Radix-tree tags, for tagging dirty and writeback pages within the pagecache
* radix trees
*/
@@ -1404,6 +1419,7 @@ extern void bd_set_size(struct block_dev
extern void bd_forget(struct inode *inode);
extern void bdput(struct block_device *);
extern struct block_device *open_by_devnum(dev_t, unsigned);
+extern struct block_device *open_partition_by_devnum(dev_t, unsigned);
extern const struct file_operations def_blk_fops;
extern const struct address_space_operations def_blk_aops;
extern const struct file_operations def_chr_fops;
@@ -1414,6 +1430,7 @@ extern int blkdev_ioctl(struct inode *,
extern long compat_blkdev_ioctl(struct file *, unsigned, unsigned long);
extern int blkdev_get(struct block_device *, mode_t, unsigned);
extern int blkdev_put(struct block_device *);
+extern int blkdev_put_partition(struct block_device *);
extern int bd_claim(struct block_device *, void *);
extern void bd_release(struct block_device *);
#ifdef CONFIG_SYSFS

2006-05-29 21:26:39

by Ingo Molnar

[permalink] [raw]
Subject: [patch 47/61] lock validator: special locking: skb_queue_head_init()

From: Ingo Molnar <[email protected]>

teach special (multi-initialized) locking code to the lock validator.
Has no effect on non-lockdep kernels.

Signed-off-by: Ingo Molnar <[email protected]>
Signed-off-by: Arjan van de Ven <[email protected]>
---
---
include/linux/skbuff.h | 7 +------
net/core/skbuff.c | 9 +++++++++
2 files changed, 10 insertions(+), 6 deletions(-)

Index: linux/include/linux/skbuff.h
===================================================================
--- linux.orig/include/linux/skbuff.h
+++ linux/include/linux/skbuff.h
@@ -584,12 +584,7 @@ static inline __u32 skb_queue_len(const
return list_->qlen;
}

-static inline void skb_queue_head_init(struct sk_buff_head *list)
-{
- spin_lock_init(&list->lock);
- list->prev = list->next = (struct sk_buff *)list;
- list->qlen = 0;
-}
+extern void skb_queue_head_init(struct sk_buff_head *list);

/*
* Insert an sk_buff at the start of a list.
Index: linux/net/core/skbuff.c
===================================================================
--- linux.orig/net/core/skbuff.c
+++ linux/net/core/skbuff.c
@@ -71,6 +71,15 @@
static kmem_cache_t *skbuff_head_cache __read_mostly;
static kmem_cache_t *skbuff_fclone_cache __read_mostly;

+void skb_queue_head_init(struct sk_buff_head *list)
+{
+ spin_lock_init(&list->lock);
+ list->prev = list->next = (struct sk_buff *)list;
+ list->qlen = 0;
+}
+
+EXPORT_SYMBOL(skb_queue_head_init);
+
/*
* Keep out-of-line to prevent kernel bloat.
* __builtin_return_address is not used because it is not always

2006-05-29 21:27:14

by Ingo Molnar

[permalink] [raw]
Subject: [patch 55/61] lock validator: special locking: sb->s_umount

From: Ingo Molnar <[email protected]>

workaround for special sb->s_umount locking rule.

s_umount gets held across a series of lock dropping and releasing
in prune_one_dentry(), so i changed the order, at the risk of
introducing a umount race. FIXME.

i think a better fix would be to do the unlocks as _non_nested in
prune_one_dentry(), and to do the up_read() here as
an up_read_non_nested() as well?

Signed-off-by: Ingo Molnar <[email protected]>
Signed-off-by: Arjan van de Ven <[email protected]>
---
fs/dcache.c | 3 ++-
1 file changed, 2 insertions(+), 1 deletion(-)

Index: linux/fs/dcache.c
===================================================================
--- linux.orig/fs/dcache.c
+++ linux/fs/dcache.c
@@ -470,8 +470,9 @@ static void prune_dcache(int count, stru
s_umount = &dentry->d_sb->s_umount;
if (down_read_trylock(s_umount)) {
if (dentry->d_sb->s_root != NULL) {
- prune_one_dentry(dentry);
+// lockdep hack: do this better!
up_read(s_umount);
+ prune_one_dentry(dentry);
continue;
}
up_read(s_umount);

2006-05-29 21:26:46

by Ingo Molnar

[permalink] [raw]
Subject: [patch 45/61] lock validator: special locking: mm

From: Ingo Molnar <[email protected]>

teach special (recursive) locking code to the lock validator. Has no
effect on non-lockdep kernels.

Signed-off-by: Ingo Molnar <[email protected]>
Signed-off-by: Arjan van de Ven <[email protected]>
---
mm/memory.c | 2 +-
mm/mremap.c | 2 +-
2 files changed, 2 insertions(+), 2 deletions(-)

Index: linux/mm/memory.c
===================================================================
--- linux.orig/mm/memory.c
+++ linux/mm/memory.c
@@ -509,7 +509,7 @@ again:
return -ENOMEM;
src_pte = pte_offset_map_nested(src_pmd, addr);
src_ptl = pte_lockptr(src_mm, src_pmd);
- spin_lock(src_ptl);
+ spin_lock_nested(src_ptl, SINGLE_DEPTH_NESTING);

do {
/*
Index: linux/mm/mremap.c
===================================================================
--- linux.orig/mm/mremap.c
+++ linux/mm/mremap.c
@@ -97,7 +97,7 @@ static void move_ptes(struct vm_area_str
new_pte = pte_offset_map_nested(new_pmd, new_addr);
new_ptl = pte_lockptr(mm, new_pmd);
if (new_ptl != old_ptl)
- spin_lock(new_ptl);
+ spin_lock_nested(new_ptl, SINGLE_DEPTH_NESTING);

for (; old_addr < old_end; old_pte++, old_addr += PAGE_SIZE,
new_pte++, new_addr += PAGE_SIZE) {

2006-05-29 21:27:34

by Ingo Molnar

[permalink] [raw]
Subject: [patch 50/61] lock validator: special locking: hrtimer.c

From: Ingo Molnar <[email protected]>

teach special (recursive) locking code to the lock validator. Has no
effect on non-lockdep kernels.

Signed-off-by: Ingo Molnar <[email protected]>
Signed-off-by: Arjan van de Ven <[email protected]>
---
kernel/hrtimer.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)

Index: linux/kernel/hrtimer.c
===================================================================
--- linux.orig/kernel/hrtimer.c
+++ linux/kernel/hrtimer.c
@@ -786,7 +786,7 @@ static void __devinit init_hrtimers_cpu(
int i;

for (i = 0; i < MAX_HRTIMER_BASES; i++, base++)
- spin_lock_init(&base->lock);
+ spin_lock_init_static(&base->lock);
}

#ifdef CONFIG_HOTPLUG_CPU

2006-05-29 21:28:04

by Ingo Molnar

[permalink] [raw]
Subject: [patch 57/61] lock validator: special locking: posix-timers

From: Ingo Molnar <[email protected]>

teach special (non-nested) unlocking code to the lock validator. Has no
effect on non-lockdep kernels.

Signed-off-by: Ingo Molnar <[email protected]>
Signed-off-by: Arjan van de Ven <[email protected]>
---
kernel/posix-timers.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)

Index: linux/kernel/posix-timers.c
===================================================================
--- linux.orig/kernel/posix-timers.c
+++ linux/kernel/posix-timers.c
@@ -576,7 +576,7 @@ static struct k_itimer * lock_timer(time
timr = (struct k_itimer *) idr_find(&posix_timers_id, (int) timer_id);
if (timr) {
spin_lock(&timr->it_lock);
- spin_unlock(&idr_lock);
+ spin_unlock_non_nested(&idr_lock);

if ((timr->it_id != timer_id) || !(timr->it_process) ||
timr->it_process->tgid != current->tgid) {

2006-05-29 21:28:47

by Ingo Molnar

[permalink] [raw]
Subject: [patch 61/61] lock validator: enable lock validator in Kconfig

From: Ingo Molnar <[email protected]>

offer the following lock validation options:

CONFIG_PROVE_SPIN_LOCKING
CONFIG_PROVE_RW_LOCKING
CONFIG_PROVE_MUTEX_LOCKING
CONFIG_PROVE_RWSEM_LOCKING

Signed-off-by: Ingo Molnar <[email protected]>
Signed-off-by: Arjan van de Ven <[email protected]>
---
lib/Kconfig.debug | 167 ++++++++++++++++++++++++++++++++++++++++++++++++++++++
1 file changed, 167 insertions(+)

Index: linux/lib/Kconfig.debug
===================================================================
--- linux.orig/lib/Kconfig.debug
+++ linux/lib/Kconfig.debug
@@ -184,6 +184,173 @@ config DEBUG_SPINLOCK
best used in conjunction with the NMI watchdog so that spinlock
deadlocks are also debuggable.

+config PROVE_SPIN_LOCKING
+ bool "Prove spin-locking correctness"
+ default y
+ help
+ This feature enables the kernel to prove that all spinlock
+ locking that occurs in the kernel runtime is mathematically
+ correct: that under no circumstance could an arbitrary (and
+ not yet triggered) combination of observed spinlock locking
+ sequences (on an arbitrary number of CPUs, running an
+ arbitrary number of tasks and interrupt contexts) cause a
+ deadlock.
+
+ In short, this feature enables the kernel to report spinlock
+ deadlocks before they actually occur.
+
+ The proof does not depend on how hard and complex a
+ deadlock scenario would be to trigger: how many
+ participant CPUs, tasks and irq-contexts would be needed
+ for it to trigger. The proof also does not depend on
+ timing: if a race and a resulting deadlock is possible
+ theoretically (no matter how unlikely the race scenario
+ is), it will be proven so and will immediately be
+ reported by the kernel (once the event is observed that
+ makes the deadlock theoretically possible).
+
+ If a deadlock is impossible (i.e. the locking rules, as
+ observed by the kernel, are mathematically correct), the
+ kernel reports nothing.
+
+ NOTE: this feature can also be enabled for rwlocks, mutexes
+ and rwsems - in which case all dependencies between these
+ different locking variants are observed and mapped too, and
+ the proof of observed correctness is also maintained for an
+ arbitrary combination of these separate locking variants.
+
+ For more details, see Documentation/locking-correctness.txt.
+
+config PROVE_RW_LOCKING
+ bool "Prove rw-locking correctness"
+ default y
+ help
+ This feature enables the kernel to prove that all rwlock
+ locking that occurs in the kernel runtime is mathematically
+ correct: that under no circumstance could an arbitrary (and
+ not yet triggered) combination of observed rwlock locking
+ sequences (on an arbitrary number of CPUs, running an
+ arbitrary number of tasks and interrupt contexts) cause a
+ deadlock.
+
+ In short, this feature enables the kernel to report rwlock
+ deadlocks before they actually occur.
+
+ The proof does not depend on how hard and complex a
+ deadlock scenario would be to trigger: how many
+ participant CPUs, tasks and irq-contexts would be needed
+ for it to trigger. The proof also does not depend on
+ timing: if a race and a resulting deadlock is possible
+ theoretically (no matter how unlikely the race scenario
+ is), it will be proven so and will immediately be
+ reported by the kernel (once the event is observed that
+ makes the deadlock theoretically possible).
+
+ If a deadlock is impossible (i.e. the locking rules, as
+ observed by the kernel, are mathematically correct), the
+ kernel reports nothing.
+
+ NOTE: this feature can also be enabled for spinlocks, mutexes
+ and rwsems - in which case all dependencies between these
+ different locking variants are observed and mapped too, and
+ the proof of observed correctness is also maintained for an
+ arbitrary combination of these separate locking variants.
+
+ For more details, see Documentation/locking-correctness.txt.
+
+config PROVE_MUTEX_LOCKING
+ bool "Prove mutex-locking correctness"
+ default y
+ help
+ This feature enables the kernel to prove that all mutexlock
+ locking that occurs in the kernel runtime is mathematically
+ correct: that under no circumstance could an arbitrary (and
+ not yet triggered) combination of observed mutexlock locking
+ sequences (on an arbitrary number of CPUs, running an
+ arbitrary number of tasks and interrupt contexts) cause a
+ deadlock.
+
+ In short, this feature enables the kernel to report mutexlock
+ deadlocks before they actually occur.
+
+ The proof does not depend on how hard and complex a
+ deadlock scenario would be to trigger: how many
+ participant CPUs, tasks and irq-contexts would be needed
+ for it to trigger. The proof also does not depend on
+ timing: if a race and a resulting deadlock is possible
+ theoretically (no matter how unlikely the race scenario
+ is), it will be proven so and will immediately be
+ reported by the kernel (once the event is observed that
+ makes the deadlock theoretically possible).
+
+ If a deadlock is impossible (i.e. the locking rules, as
+ observed by the kernel, are mathematically correct), the
+ kernel reports nothing.
+
+ NOTE: this feature can also be enabled for spinlock, rwlocks
+ and rwsems - in which case all dependencies between these
+ different locking variants are observed and mapped too, and
+ the proof of observed correctness is also maintained for an
+ arbitrary combination of these separate locking variants.
+
+ For more details, see Documentation/locking-correctness.txt.
+
+config PROVE_RWSEM_LOCKING
+ bool "Prove rwsem-locking correctness"
+ default y
+ help
+ This feature enables the kernel to prove that all rwsemlock
+ locking that occurs in the kernel runtime is mathematically
+ correct: that under no circumstance could an arbitrary (and
+ not yet triggered) combination of observed rwsemlock locking
+ sequences (on an arbitrary number of CPUs, running an
+ arbitrary number of tasks and interrupt contexts) cause a
+ deadlock.
+
+ In short, this feature enables the kernel to report rwsemlock
+ deadlocks before they actually occur.
+
+ The proof does not depend on how hard and complex a
+ deadlock scenario would be to trigger: how many
+ participant CPUs, tasks and irq-contexts would be needed
+ for it to trigger. The proof also does not depend on
+ timing: if a race and a resulting deadlock is possible
+ theoretically (no matter how unlikely the race scenario
+ is), it will be proven so and will immediately be
+ reported by the kernel (once the event is observed that
+ makes the deadlock theoretically possible).
+
+ If a deadlock is impossible (i.e. the locking rules, as
+ observed by the kernel, are mathematically correct), the
+ kernel reports nothing.
+
+ NOTE: this feature can also be enabled for spinlocks, rwlocks
+ and mutexes - in which case all dependencies between these
+ different locking variants are observed and mapped too, and
+ the proof of observed correctness is also maintained for an
+ arbitrary combination of these separate locking variants.
+
+ For more details, see Documentation/locking-correctness.txt.
+
+config LOCKDEP
+ bool
+ default y
+ depends on PROVE_SPIN_LOCKING || PROVE_RW_LOCKING || PROVE_MUTEX_LOCKING || PROVE_RWSEM_LOCKING
+
+config DEBUG_LOCKDEP
+ bool "Lock dependency engine debugging"
+ depends on LOCKDEP
+ default y
+ help
+ If you say Y here, the lock dependency engine will do
+ additional runtime checks to debug itself, at the price
+ of more runtime overhead.
+
+config TRACE_IRQFLAGS
+ bool
+ default y
+ depends on PROVE_SPIN_LOCKING || PROVE_RW_LOCKING
+
config DEBUG_SPINLOCK_SLEEP
bool "Sleep-inside-spinlock checking"
depends on DEBUG_KERNEL

2006-05-29 21:27:34

by Ingo Molnar

[permalink] [raw]
Subject: [patch 51/61] lock validator: special locking: sock_lock_init()

From: Ingo Molnar <[email protected]>

teach special (multi-initialized, per-address-family) locking code to the
lock validator. Has no effect on non-lockdep kernels.

Signed-off-by: Ingo Molnar <[email protected]>
Signed-off-by: Arjan van de Ven <[email protected]>
---
include/net/sock.h | 6 ------
net/core/sock.c | 27 +++++++++++++++++++++++----
2 files changed, 23 insertions(+), 10 deletions(-)

Index: linux/include/net/sock.h
===================================================================
--- linux.orig/include/net/sock.h
+++ linux/include/net/sock.h
@@ -81,12 +81,6 @@ typedef struct {
wait_queue_head_t wq;
} socket_lock_t;

-#define sock_lock_init(__sk) \
-do { spin_lock_init(&((__sk)->sk_lock.slock)); \
- (__sk)->sk_lock.owner = NULL; \
- init_waitqueue_head(&((__sk)->sk_lock.wq)); \
-} while(0)
-
struct sock;
struct proto;

Index: linux/net/core/sock.c
===================================================================
--- linux.orig/net/core/sock.c
+++ linux/net/core/sock.c
@@ -739,6 +739,27 @@ lenout:
return 0;
}

+/*
+ * Each address family might have different locking rules, so we have
+ * one slock key per address family:
+ */
+static struct lockdep_type_key af_family_keys[AF_MAX];
+
+static void noinline sock_lock_init(struct sock *sk)
+{
+ spin_lock_init_key(&sk->sk_lock.slock, af_family_keys + sk->sk_family);
+ sk->sk_lock.owner = NULL;
+ init_waitqueue_head(&sk->sk_lock.wq);
+}
+
+static struct lockdep_type_key af_callback_keys[AF_MAX];
+
+static void noinline sock_rwlock_init(struct sock *sk)
+{
+ rwlock_init(&sk->sk_dst_lock);
+ rwlock_init_key(&sk->sk_callback_lock, af_callback_keys + sk->sk_family);
+}
+
/**
* sk_alloc - All socket objects are allocated here
* @family: protocol family
@@ -833,8 +854,7 @@ struct sock *sk_clone(const struct sock
skb_queue_head_init(&newsk->sk_receive_queue);
skb_queue_head_init(&newsk->sk_write_queue);

- rwlock_init(&newsk->sk_dst_lock);
- rwlock_init(&newsk->sk_callback_lock);
+ sock_rwlock_init(newsk);

newsk->sk_dst_cache = NULL;
newsk->sk_wmem_queued = 0;
@@ -1404,8 +1424,7 @@ void sock_init_data(struct socket *sock,
} else
sk->sk_sleep = NULL;

- rwlock_init(&sk->sk_dst_lock);
- rwlock_init(&sk->sk_callback_lock);
+ sock_rwlock_init(sk);

sk->sk_state_change = sock_def_wakeup;
sk->sk_data_ready = sock_def_readable;

2006-05-29 21:28:05

by Ingo Molnar

[permalink] [raw]
Subject: [patch 59/61] lock validator: special locking: xfrm

From: Ingo Molnar <[email protected]>

teach special (non-nested) unlocking code to the lock validator. Has no
effect on non-lockdep kernels.

Signed-off-by: Ingo Molnar <[email protected]>
Signed-off-by: Arjan van de Ven <[email protected]>
---
net/xfrm/xfrm_policy.c | 2 +-
net/xfrm/xfrm_state.c | 2 +-
2 files changed, 2 insertions(+), 2 deletions(-)

Index: linux/net/xfrm/xfrm_policy.c
===================================================================
--- linux.orig/net/xfrm/xfrm_policy.c
+++ linux/net/xfrm/xfrm_policy.c
@@ -1308,7 +1308,7 @@ static struct xfrm_policy_afinfo *xfrm_p
afinfo = xfrm_policy_afinfo[family];
if (likely(afinfo != NULL))
read_lock(&afinfo->lock);
- read_unlock(&xfrm_policy_afinfo_lock);
+ read_unlock_non_nested(&xfrm_policy_afinfo_lock);
return afinfo;
}

Index: linux/net/xfrm/xfrm_state.c
===================================================================
--- linux.orig/net/xfrm/xfrm_state.c
+++ linux/net/xfrm/xfrm_state.c
@@ -1105,7 +1105,7 @@ static struct xfrm_state_afinfo *xfrm_st
afinfo = xfrm_state_afinfo[family];
if (likely(afinfo != NULL))
read_lock(&afinfo->lock);
- read_unlock(&xfrm_state_afinfo_lock);
+ read_unlock_non_nested(&xfrm_state_afinfo_lock);
return afinfo;
}

2006-05-29 21:28:05

by Ingo Molnar

[permalink] [raw]
Subject: [patch 58/61] lock validator: special locking: sch_generic.c

From: Ingo Molnar <[email protected]>

teach special (non-nested) unlocking code to the lock validator. Has no
effect on non-lockdep kernels.

Signed-off-by: Ingo Molnar <[email protected]>
Signed-off-by: Arjan van de Ven <[email protected]>
---
net/sched/sch_generic.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)

Index: linux/net/sched/sch_generic.c
===================================================================
--- linux.orig/net/sched/sch_generic.c
+++ linux/net/sched/sch_generic.c
@@ -132,7 +132,7 @@ int qdisc_restart(struct net_device *dev

{
/* And release queue */
- spin_unlock(&dev->queue_lock);
+ spin_unlock_non_nested(&dev->queue_lock);

if (!netif_queue_stopped(dev)) {
int ret;

2006-05-29 21:29:53

by Ingo Molnar

[permalink] [raw]
Subject: [patch 53/61] lock validator: special locking: bh_lock_sock()

From: Ingo Molnar <[email protected]>

teach special (recursive) locking code to the lock validator. Has no
effect on non-lockdep kernels.

Signed-off-by: Ingo Molnar <[email protected]>
Signed-off-by: Arjan van de Ven <[email protected]>
---
include/net/sock.h | 3 +++
net/ipv4/tcp_ipv4.c | 2 +-
2 files changed, 4 insertions(+), 1 deletion(-)

Index: linux/include/net/sock.h
===================================================================
--- linux.orig/include/net/sock.h
+++ linux/include/net/sock.h
@@ -743,6 +743,9 @@ extern void FASTCALL(release_sock(struct

/* BH context may only use the following locking interface. */
#define bh_lock_sock(__sk) spin_lock(&((__sk)->sk_lock.slock))
+#define bh_lock_sock_nested(__sk) \
+ spin_lock_nested(&((__sk)->sk_lock.slock), \
+ SINGLE_DEPTH_NESTING)
#define bh_unlock_sock(__sk) spin_unlock(&((__sk)->sk_lock.slock))

extern struct sock *sk_alloc(int family,
Index: linux/net/ipv4/tcp_ipv4.c
===================================================================
--- linux.orig/net/ipv4/tcp_ipv4.c
+++ linux/net/ipv4/tcp_ipv4.c
@@ -1088,7 +1088,7 @@ process:

skb->dev = NULL;

- bh_lock_sock(sk);
+ bh_lock_sock_nested(sk);
ret = 0;
if (!sock_owned_by_user(sk)) {
if (!tcp_prequeue(sk, skb))

2006-05-29 21:29:52

by Ingo Molnar

[permalink] [raw]
Subject: [patch 52/61] lock validator: special locking: af_unix

From: Ingo Molnar <[email protected]>

teach special (recursive) locking code to the lock validator. Has no
effect on non-lockdep kernels.

(includes workaround for sk_receive_queue.lock, which is currently
treated globally by the lock validator, but which be switched to
per-address-family locking rules.)

Signed-off-by: Ingo Molnar <[email protected]>
Signed-off-by: Arjan van de Ven <[email protected]>
---
include/net/af_unix.h | 3 +++
net/unix/af_unix.c | 10 +++++-----
net/unix/garbage.c | 8 ++++----
3 files changed, 12 insertions(+), 9 deletions(-)

Index: linux/include/net/af_unix.h
===================================================================
--- linux.orig/include/net/af_unix.h
+++ linux/include/net/af_unix.h
@@ -61,6 +61,9 @@ struct unix_skb_parms {
#define unix_state_rlock(s) spin_lock(&unix_sk(s)->lock)
#define unix_state_runlock(s) spin_unlock(&unix_sk(s)->lock)
#define unix_state_wlock(s) spin_lock(&unix_sk(s)->lock)
+#define unix_state_wlock_nested(s) \
+ spin_lock_nested(&unix_sk(s)->lock, \
+ SINGLE_DEPTH_NESTING)
#define unix_state_wunlock(s) spin_unlock(&unix_sk(s)->lock)

#ifdef __KERNEL__
Index: linux/net/unix/af_unix.c
===================================================================
--- linux.orig/net/unix/af_unix.c
+++ linux/net/unix/af_unix.c
@@ -1022,7 +1022,7 @@ restart:
goto out_unlock;
}

- unix_state_wlock(sk);
+ unix_state_wlock_nested(sk);

if (sk->sk_state != st) {
unix_state_wunlock(sk);
@@ -1073,12 +1073,12 @@ restart:
unix_state_wunlock(sk);

/* take ten and and send info to listening sock */
- spin_lock(&other->sk_receive_queue.lock);
+ spin_lock_bh(&other->sk_receive_queue.lock);
__skb_queue_tail(&other->sk_receive_queue, skb);
/* Undo artificially decreased inflight after embrion
* is installed to listening socket. */
atomic_inc(&newu->inflight);
- spin_unlock(&other->sk_receive_queue.lock);
+ spin_unlock_bh(&other->sk_receive_queue.lock);
unix_state_runlock(other);
other->sk_data_ready(other, 0);
sock_put(other);
@@ -1843,7 +1843,7 @@ static int unix_ioctl(struct socket *soc
break;
}

- spin_lock(&sk->sk_receive_queue.lock);
+ spin_lock_bh(&sk->sk_receive_queue.lock);
if (sk->sk_type == SOCK_STREAM ||
sk->sk_type == SOCK_SEQPACKET) {
skb_queue_walk(&sk->sk_receive_queue, skb)
@@ -1853,7 +1853,7 @@ static int unix_ioctl(struct socket *soc
if (skb)
amount=skb->len;
}
- spin_unlock(&sk->sk_receive_queue.lock);
+ spin_unlock_bh(&sk->sk_receive_queue.lock);
err = put_user(amount, (int __user *)arg);
break;
}
Index: linux/net/unix/garbage.c
===================================================================
--- linux.orig/net/unix/garbage.c
+++ linux/net/unix/garbage.c
@@ -235,7 +235,7 @@ void unix_gc(void)
struct sock *x = pop_stack();
struct sock *sk;

- spin_lock(&x->sk_receive_queue.lock);
+ spin_lock_bh(&x->sk_receive_queue.lock);
skb = skb_peek(&x->sk_receive_queue);

/*
@@ -270,7 +270,7 @@ void unix_gc(void)
maybe_unmark_and_push(skb->sk);
skb=skb->next;
}
- spin_unlock(&x->sk_receive_queue.lock);
+ spin_unlock_bh(&x->sk_receive_queue.lock);
sock_put(x);
}

@@ -283,7 +283,7 @@ void unix_gc(void)
if (u->gc_tree == GC_ORPHAN) {
struct sk_buff *nextsk;

- spin_lock(&s->sk_receive_queue.lock);
+ spin_lock_bh(&s->sk_receive_queue.lock);
skb = skb_peek(&s->sk_receive_queue);
while (skb &&
skb != (struct sk_buff *)&s->sk_receive_queue) {
@@ -298,7 +298,7 @@ void unix_gc(void)
}
skb = nextsk;
}
- spin_unlock(&s->sk_receive_queue.lock);
+ spin_unlock_bh(&s->sk_receive_queue.lock);
}
u->gc_tree = GC_ORPHAN;
}

2006-05-29 21:29:55

by Ingo Molnar

[permalink] [raw]
Subject: [patch 60/61] lock validator: special locking: sound/core/seq/seq_ports.c

From: Ingo Molnar <[email protected]>

teach special (recursive) locking code to the lock validator. Has no
effect on non-lockdep kernels.

Signed-off-by: Ingo Molnar <[email protected]>
Signed-off-by: Arjan van de Ven <[email protected]>
---
sound/core/seq/seq_ports.c | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)

Index: linux/sound/core/seq/seq_ports.c
===================================================================
--- linux.orig/sound/core/seq/seq_ports.c
+++ linux/sound/core/seq/seq_ports.c
@@ -518,7 +518,7 @@ int snd_seq_port_connect(struct snd_seq_
atomic_set(&subs->ref_count, 2);

down_write(&src->list_mutex);
- down_write(&dest->list_mutex);
+ down_write_nested(&dest->list_mutex, SINGLE_DEPTH_NESTING);

exclusive = info->flags & SNDRV_SEQ_PORT_SUBS_EXCLUSIVE ? 1 : 0;
err = -EBUSY;
@@ -591,7 +591,7 @@ int snd_seq_port_disconnect(struct snd_s
unsigned long flags;

down_write(&src->list_mutex);
- down_write(&dest->list_mutex);
+ down_write_nested(&dest->list_mutex, SINGLE_DEPTH_NESTING);

/* look for the connection */
list_for_each(p, &src->list_head) {

2006-05-29 21:31:18

by Ingo Molnar

[permalink] [raw]
Subject: [patch 54/61] lock validator: special locking: mmap_sem

From: Ingo Molnar <[email protected]>

teach special (recursive) locking code to the lock validator. Has no
effect on non-lockdep kernels.

Signed-off-by: Ingo Molnar <[email protected]>
Signed-off-by: Arjan van de Ven <[email protected]>
---
kernel/exit.c | 2 +-
kernel/fork.c | 5 ++++-
2 files changed, 5 insertions(+), 2 deletions(-)

Index: linux/kernel/exit.c
===================================================================
--- linux.orig/kernel/exit.c
+++ linux/kernel/exit.c
@@ -582,7 +582,7 @@ static void exit_mm(struct task_struct *
/* more a memory barrier than a real lock */
task_lock(tsk);
tsk->mm = NULL;
- up_read(&mm->mmap_sem);
+ up_read_non_nested(&mm->mmap_sem);
enter_lazy_tlb(mm, current);
task_unlock(tsk);
mmput(mm);
Index: linux/kernel/fork.c
===================================================================
--- linux.orig/kernel/fork.c
+++ linux/kernel/fork.c
@@ -196,7 +196,10 @@ static inline int dup_mmap(struct mm_str

down_write(&oldmm->mmap_sem);
flush_cache_mm(oldmm);
- down_write(&mm->mmap_sem);
+ /*
+ * Not linked in yet - no deadlock potential:
+ */
+ down_write_nested(&mm->mmap_sem, 1);

mm->locked_vm = 0;
mm->mmap = NULL;

2006-05-29 21:31:18

by Ingo Molnar

[permalink] [raw]
Subject: [patch 49/61] lock validator: special locking: sched.c

From: Ingo Molnar <[email protected]>

teach special (recursive) locking code to the lock validator. Has no
effect on non-lockdep kernels.

Signed-off-by: Ingo Molnar <[email protected]>
Signed-off-by: Arjan van de Ven <[email protected]>
---
kernel/sched.c | 16 ++++++++--------
1 file changed, 8 insertions(+), 8 deletions(-)

Index: linux/kernel/sched.c
===================================================================
--- linux.orig/kernel/sched.c
+++ linux/kernel/sched.c
@@ -1963,7 +1963,7 @@ static void double_rq_unlock(runqueue_t
__releases(rq1->lock)
__releases(rq2->lock)
{
- spin_unlock(&rq1->lock);
+ spin_unlock_non_nested(&rq1->lock);
if (rq1 != rq2)
spin_unlock(&rq2->lock);
else
@@ -1980,7 +1980,7 @@ static void double_lock_balance(runqueue
{
if (unlikely(!spin_trylock(&busiest->lock))) {
if (busiest->cpu < this_rq->cpu) {
- spin_unlock(&this_rq->lock);
+ spin_unlock_non_nested(&this_rq->lock);
spin_lock(&busiest->lock);
spin_lock(&this_rq->lock);
} else
@@ -2602,7 +2602,7 @@ static int load_balance_newidle(int this
nr_moved = move_tasks(this_rq, this_cpu, busiest,
minus_1_or_zero(busiest->nr_running),
imbalance, sd, NEWLY_IDLE, NULL);
- spin_unlock(&busiest->lock);
+ spin_unlock_non_nested(&busiest->lock);
}

if (!nr_moved) {
@@ -2687,7 +2687,7 @@ static void active_load_balance(runqueue
else
schedstat_inc(sd, alb_failed);
out:
- spin_unlock(&target_rq->lock);
+ spin_unlock_non_nested(&target_rq->lock);
}

/*
@@ -3032,7 +3032,7 @@ static void wake_sleeping_dependent(int
}

for_each_cpu_mask(i, sibling_map)
- spin_unlock(&cpu_rq(i)->lock);
+ spin_unlock_non_nested(&cpu_rq(i)->lock);
/*
* We exit with this_cpu's rq still held and IRQs
* still disabled:
@@ -3068,7 +3068,7 @@ static int dependent_sleeper(int this_cp
* The same locking rules and details apply as for
* wake_sleeping_dependent():
*/
- spin_unlock(&this_rq->lock);
+ spin_unlock_non_nested(&this_rq->lock);
sibling_map = sd->span;
for_each_cpu_mask(i, sibling_map)
spin_lock(&cpu_rq(i)->lock);
@@ -3146,7 +3146,7 @@ check_smt_task:
}
out_unlock:
for_each_cpu_mask(i, sibling_map)
- spin_unlock(&cpu_rq(i)->lock);
+ spin_unlock_non_nested(&cpu_rq(i)->lock);
return ret;
}
#else
@@ -6680,7 +6680,7 @@ void __init sched_init(void)
prio_array_t *array;

rq = cpu_rq(i);
- spin_lock_init(&rq->lock);
+ spin_lock_init_static(&rq->lock);
rq->nr_running = 0;
rq->active = rq->arrays;
rq->expired = rq->arrays + 1;

2006-05-29 21:31:27

by Ingo Molnar

[permalink] [raw]
Subject: [patch 56/61] lock validator: special locking: jbd

From: Ingo Molnar <[email protected]>

teach special (non-nested) unlocking code to the lock validator. Has no
effect on non-lockdep kernels.

Signed-off-by: Ingo Molnar <[email protected]>
Signed-off-by: Arjan van de Ven <[email protected]>
---
fs/jbd/checkpoint.c | 2 +-
fs/jbd/commit.c | 2 +-
2 files changed, 2 insertions(+), 2 deletions(-)

Index: linux/fs/jbd/checkpoint.c
===================================================================
--- linux.orig/fs/jbd/checkpoint.c
+++ linux/fs/jbd/checkpoint.c
@@ -135,7 +135,7 @@ void __log_wait_for_space(journal_t *jou
log_do_checkpoint(journal);
spin_lock(&journal->j_state_lock);
}
- mutex_unlock(&journal->j_checkpoint_mutex);
+ mutex_unlock_non_nested(&journal->j_checkpoint_mutex);
}
}

Index: linux/fs/jbd/commit.c
===================================================================
--- linux.orig/fs/jbd/commit.c
+++ linux/fs/jbd/commit.c
@@ -838,7 +838,7 @@ restart_loop:
J_ASSERT(commit_transaction == journal->j_committing_transaction);
journal->j_commit_sequence = commit_transaction->t_tid;
journal->j_committing_transaction = NULL;
- spin_unlock(&journal->j_state_lock);
+ spin_unlock_non_nested(&journal->j_state_lock);

if (commit_transaction->t_checkpoint_list == NULL) {
__journal_drop_transaction(journal, commit_transaction);

2006-05-29 21:32:52

by Ingo Molnar

[permalink] [raw]
Subject: [patch 44/61] lock validator: special locking: waitqueues

From: Ingo Molnar <[email protected]>

map special (multi-initialized) locking code to the lock validator.
Has no effect on non-lockdep kernels.

Signed-off-by: Ingo Molnar <[email protected]>
Signed-off-by: Arjan van de Ven <[email protected]>
---
---
include/linux/wait.h | 11 +++++++++--
kernel/wait.c | 9 +++++++++
2 files changed, 18 insertions(+), 2 deletions(-)

Index: linux/include/linux/wait.h
===================================================================
--- linux.orig/include/linux/wait.h
+++ linux/include/linux/wait.h
@@ -77,12 +77,19 @@ struct task_struct;
#define __WAIT_BIT_KEY_INITIALIZER(word, bit) \
{ .flags = word, .bit_nr = bit, }

-static inline void init_waitqueue_head(wait_queue_head_t *q)
+/*
+ * lockdep: we want one lock-type for all waitqueue locks.
+ */
+extern struct lockdep_type_key waitqueue_lock_key;
+
+static inline void __init_waitqueue_head(wait_queue_head_t *q)
{
- spin_lock_init(&q->lock);
+ spin_lock_init_key(&q->lock, &waitqueue_lock_key);
INIT_LIST_HEAD(&q->task_list);
}

+extern void init_waitqueue_head(wait_queue_head_t *q);
+
static inline void init_waitqueue_entry(wait_queue_t *q, struct task_struct *p)
{
q->flags = 0;
Index: linux/kernel/wait.c
===================================================================
--- linux.orig/kernel/wait.c
+++ linux/kernel/wait.c
@@ -11,6 +11,15 @@
#include <linux/wait.h>
#include <linux/hash.h>

+struct lockdep_type_key waitqueue_lock_key;
+
+void init_waitqueue_head(wait_queue_head_t *q)
+{
+ __init_waitqueue_head(q);
+}
+
+EXPORT_SYMBOL(init_waitqueue_head);
+
void fastcall add_wait_queue(wait_queue_head_t *q, wait_queue_t *wait)
{
unsigned long flags;

2006-05-29 21:32:53

by Ingo Molnar

[permalink] [raw]
Subject: [patch 46/61] lock validator: special locking: slab

From: Ingo Molnar <[email protected]>

teach special (recursive) locking code to the lock validator. Has no
effect on non-lockdep kernels.

fix initialize-locks-via-memcpy assumptions.

Signed-off-by: Ingo Molnar <[email protected]>
Signed-off-by: Arjan van de Ven <[email protected]>
---
mm/slab.c | 59 ++++++++++++++++++++++++++++++++++++++++++++++++-----------
1 file changed, 48 insertions(+), 11 deletions(-)

Index: linux/mm/slab.c
===================================================================
--- linux.orig/mm/slab.c
+++ linux/mm/slab.c
@@ -1026,7 +1026,8 @@ static void drain_alien_cache(struct kme
}
}

-static inline int cache_free_alien(struct kmem_cache *cachep, void *objp)
+static inline int cache_free_alien(struct kmem_cache *cachep, void *objp,
+ int nesting)
{
struct slab *slabp = virt_to_slab(objp);
int nodeid = slabp->nodeid;
@@ -1044,7 +1045,7 @@ static inline int cache_free_alien(struc
STATS_INC_NODEFREES(cachep);
if (l3->alien && l3->alien[nodeid]) {
alien = l3->alien[nodeid];
- spin_lock(&alien->lock);
+ spin_lock_nested(&alien->lock, nesting);
if (unlikely(alien->avail == alien->limit)) {
STATS_INC_ACOVERFLOW(cachep);
__drain_alien_cache(cachep, alien, nodeid);
@@ -1073,7 +1074,8 @@ static inline void free_alien_cache(stru
{
}

-static inline int cache_free_alien(struct kmem_cache *cachep, void *objp)
+static inline int cache_free_alien(struct kmem_cache *cachep, void *objp,
+ int nesting)
{
return 0;
}
@@ -1278,6 +1280,11 @@ static void init_list(struct kmem_cache

local_irq_disable();
memcpy(ptr, list, sizeof(struct kmem_list3));
+ /*
+ * Do not assume that spinlocks can be initialized via memcpy:
+ */
+ spin_lock_init(&ptr->list_lock);
+
MAKE_ALL_LISTS(cachep, ptr, nodeid);
cachep->nodelists[nodeid] = ptr;
local_irq_enable();
@@ -1408,7 +1415,7 @@ void __init kmem_cache_init(void)
}
/* 4) Replace the bootstrap head arrays */
{
- void *ptr;
+ struct array_cache *ptr;

ptr = kmalloc(sizeof(struct arraycache_init), GFP_KERNEL);

@@ -1416,6 +1423,11 @@ void __init kmem_cache_init(void)
BUG_ON(cpu_cache_get(&cache_cache) != &initarray_cache.cache);
memcpy(ptr, cpu_cache_get(&cache_cache),
sizeof(struct arraycache_init));
+ /*
+ * Do not assume that spinlocks can be initialized via memcpy:
+ */
+ spin_lock_init(&ptr->lock);
+
cache_cache.array[smp_processor_id()] = ptr;
local_irq_enable();

@@ -1426,6 +1438,11 @@ void __init kmem_cache_init(void)
!= &initarray_generic.cache);
memcpy(ptr, cpu_cache_get(malloc_sizes[INDEX_AC].cs_cachep),
sizeof(struct arraycache_init));
+ /*
+ * Do not assume that spinlocks can be initialized via memcpy:
+ */
+ spin_lock_init(&ptr->lock);
+
malloc_sizes[INDEX_AC].cs_cachep->array[smp_processor_id()] =
ptr;
local_irq_enable();
@@ -1753,6 +1770,8 @@ static void slab_destroy_objs(struct kme
}
#endif

+static void __cache_free(struct kmem_cache *cachep, void *objp, int nesting);
+
/**
* slab_destroy - destroy and release all objects in a slab
* @cachep: cache pointer being destroyed
@@ -1776,8 +1795,17 @@ static void slab_destroy(struct kmem_cac
call_rcu(&slab_rcu->head, kmem_rcu_free);
} else {
kmem_freepages(cachep, addr);
- if (OFF_SLAB(cachep))
- kmem_cache_free(cachep->slabp_cache, slabp);
+ if (OFF_SLAB(cachep)) {
+ unsigned long flags;
+
+ /*
+ * lockdep: we may nest inside an already held
+ * ac->lock, so pass in a nesting flag:
+ */
+ local_irq_save(flags);
+ __cache_free(cachep->slabp_cache, slabp, 1);
+ local_irq_restore(flags);
+ }
}
}

@@ -3062,7 +3090,16 @@ static void free_block(struct kmem_cache
if (slabp->inuse == 0) {
if (l3->free_objects > l3->free_limit) {
l3->free_objects -= cachep->num;
+ /*
+ * It is safe to drop the lock. The slab is
+ * no longer linked to the cache. cachep
+ * cannot disappear - we are using it and
+ * all destruction of caches must be
+ * serialized properly by the user.
+ */
+ spin_unlock(&l3->list_lock);
slab_destroy(cachep, slabp);
+ spin_lock(&l3->list_lock);
} else {
list_add(&slabp->list, &l3->slabs_free);
}
@@ -3088,7 +3125,7 @@ static void cache_flusharray(struct kmem
#endif
check_irq_off();
l3 = cachep->nodelists[node];
- spin_lock(&l3->list_lock);
+ spin_lock_nested(&l3->list_lock, SINGLE_DEPTH_NESTING);
if (l3->shared) {
struct array_cache *shared_array = l3->shared;
int max = shared_array->limit - shared_array->avail;
@@ -3131,14 +3168,14 @@ free_done:
* Release an obj back to its cache. If the obj has a constructed state, it must
* be in this state _before_ it is released. Called with disabled ints.
*/
-static inline void __cache_free(struct kmem_cache *cachep, void *objp)
+static void __cache_free(struct kmem_cache *cachep, void *objp, int nesting)
{
struct array_cache *ac = cpu_cache_get(cachep);

check_irq_off();
objp = cache_free_debugcheck(cachep, objp, __builtin_return_address(0));

- if (cache_free_alien(cachep, objp))
+ if (cache_free_alien(cachep, objp, nesting))
return;

if (likely(ac->avail < ac->limit)) {
@@ -3393,7 +3430,7 @@ void kmem_cache_free(struct kmem_cache *
BUG_ON(virt_to_cache(objp) != cachep);

local_irq_save(flags);
- __cache_free(cachep, objp);
+ __cache_free(cachep, objp, 0);
local_irq_restore(flags);
}
EXPORT_SYMBOL(kmem_cache_free);
@@ -3418,7 +3455,7 @@ void kfree(const void *objp)
kfree_debugcheck(objp);
c = virt_to_cache(objp);
debug_check_no_locks_freed(objp, obj_size(c));
- __cache_free(c, (void *)objp);
+ __cache_free(c, (void *)objp, 0);
local_irq_restore(flags);
}
EXPORT_SYMBOL(kfree);

2006-05-29 21:32:53

by Ingo Molnar

[permalink] [raw]
Subject: [patch 48/61] lock validator: special locking: timer.c

From: Ingo Molnar <[email protected]>

teach special (recursive) locking code to the lock validator. Has no
effect on non-lockdep kernels.

Signed-off-by: Ingo Molnar <[email protected]>
Signed-off-by: Arjan van de Ven <[email protected]>
---
kernel/timer.c | 9 ++++++++-
1 file changed, 8 insertions(+), 1 deletion(-)

Index: linux/kernel/timer.c
===================================================================
--- linux.orig/kernel/timer.c
+++ linux/kernel/timer.c
@@ -1496,6 +1496,13 @@ asmlinkage long sys_sysinfo(struct sysin
return 0;
}

+/*
+ * lockdep: we want to track each per-CPU base as a separate lock-type,
+ * but timer-bases are kmalloc()-ed, so we need to attach separate
+ * keys to them:
+ */
+static struct lockdep_type_key base_lock_keys[NR_CPUS];
+
static int __devinit init_timers_cpu(int cpu)
{
int j;
@@ -1530,7 +1537,7 @@ static int __devinit init_timers_cpu(int
base = per_cpu(tvec_bases, cpu);
}

- spin_lock_init(&base->lock);
+ spin_lock_init_key(&base->lock, base_lock_keys + cpu);
for (j = 0; j < TVN_SIZE; j++) {
INIT_LIST_HEAD(base->tv5.vec + j);
INIT_LIST_HEAD(base->tv4.vec + j);

2006-05-29 21:33:37

by Ingo Molnar

[permalink] [raw]
Subject: [patch 43/61] lock validator: special locking: completions

From: Ingo Molnar <[email protected]>

teach special (multi-initialized) locking code to the lock validator.
Has no effect on non-lockdep kernels.

Signed-off-by: Ingo Molnar <[email protected]>
Signed-off-by: Arjan van de Ven <[email protected]>
---
---
include/linux/completion.h | 6 +-----
kernel/sched.c | 8 ++++++++
2 files changed, 9 insertions(+), 5 deletions(-)

Index: linux/include/linux/completion.h
===================================================================
--- linux.orig/include/linux/completion.h
+++ linux/include/linux/completion.h
@@ -21,11 +21,7 @@ struct completion {
#define DECLARE_COMPLETION(work) \
struct completion work = COMPLETION_INITIALIZER(work)

-static inline void init_completion(struct completion *x)
-{
- x->done = 0;
- init_waitqueue_head(&x->wait);
-}
+extern void init_completion(struct completion *x);

extern void FASTCALL(wait_for_completion(struct completion *));
extern int FASTCALL(wait_for_completion_interruptible(struct completion *x));
Index: linux/kernel/sched.c
===================================================================
--- linux.orig/kernel/sched.c
+++ linux/kernel/sched.c
@@ -3569,6 +3569,14 @@ __wake_up_sync(wait_queue_head_t *q, uns
}
EXPORT_SYMBOL_GPL(__wake_up_sync); /* For internal use only */

+void init_completion(struct completion *x)
+{
+ x->done = 0;
+ __init_waitqueue_head(&x->wait);
+}
+
+EXPORT_SYMBOL(init_completion);
+
void fastcall complete(struct completion *x)
{
unsigned long flags;

2006-05-29 21:34:19

by Ingo Molnar

[permalink] [raw]
Subject: [patch 39/61] lock validator: special locking: s_lock

From: Ingo Molnar <[email protected]>

teach special (per-filesystem) locking code to the lock validator. Has no
effect on non-lockdep kernels.

Signed-off-by: Ingo Molnar <[email protected]>
Signed-off-by: Arjan van de Ven <[email protected]>
---
---
fs/super.c | 13 +++++++++----
include/linux/fs.h | 1 +
2 files changed, 10 insertions(+), 4 deletions(-)

Index: linux/fs/super.c
===================================================================
--- linux.orig/fs/super.c
+++ linux/fs/super.c
@@ -54,7 +54,7 @@ DEFINE_SPINLOCK(sb_lock);
* Allocates and initializes a new &struct super_block. alloc_super()
* returns a pointer new superblock or %NULL if allocation had failed.
*/
-static struct super_block *alloc_super(void)
+static struct super_block *alloc_super(struct file_system_type *type)
{
struct super_block *s = kzalloc(sizeof(struct super_block), GFP_USER);
static struct super_operations default_op;
@@ -72,7 +72,12 @@ static struct super_block *alloc_super(v
INIT_HLIST_HEAD(&s->s_anon);
INIT_LIST_HEAD(&s->s_inodes);
init_rwsem(&s->s_umount);
- mutex_init(&s->s_lock);
+ /*
+ * The locking rules for s_lock are up to the
+ * filesystem. For example ext3fs has different
+ * lock ordering than usbfs:
+ */
+ mutex_init_key(&s->s_lock, type->name, &type->s_lock_key);
down_write(&s->s_umount);
s->s_count = S_BIAS;
atomic_set(&s->s_active, 1);
@@ -297,7 +302,7 @@ retry:
}
if (!s) {
spin_unlock(&sb_lock);
- s = alloc_super();
+ s = alloc_super(type);
if (!s)
return ERR_PTR(-ENOMEM);
goto retry;
@@ -696,7 +701,7 @@ struct super_block *get_sb_bdev(struct f
*/
mutex_lock(&bdev->bd_mount_mutex);
s = sget(fs_type, test_bdev_super, set_bdev_super, bdev);
- mutex_unlock(&bdev->bd_mount_mutex);
+ mutex_unlock_non_nested(&bdev->bd_mount_mutex);
if (IS_ERR(s))
goto out;

Index: linux/include/linux/fs.h
===================================================================
--- linux.orig/include/linux/fs.h
+++ linux/include/linux/fs.h
@@ -1307,6 +1307,7 @@ struct file_system_type {
struct module *owner;
struct file_system_type * next;
struct list_head fs_supers;
+ struct lockdep_type_key s_lock_key;
};

struct super_block *get_sb_bdev(struct file_system_type *fs_type,

2006-05-29 21:34:46

by Ingo Molnar

[permalink] [raw]
Subject: [patch 42/61] lock validator: special locking: kgdb

From: Ingo Molnar <[email protected]>

teach special (recursive, non-ordered) locking code to the lock validator.
Has no effect on non-lockdep kernels.

Signed-off-by: Ingo Molnar <[email protected]>
Signed-off-by: Arjan van de Ven <[email protected]>
---
---
kernel/kgdb.c | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)

Index: linux/kernel/kgdb.c
===================================================================
--- linux.orig/kernel/kgdb.c
+++ linux/kernel/kgdb.c
@@ -1539,7 +1539,7 @@ int kgdb_handle_exception(int ex_vector,

if (!debugger_step || !kgdb_contthread) {
for (i = 0; i < NR_CPUS; i++)
- spin_unlock(&slavecpulocks[i]);
+ spin_unlock_non_nested(&slavecpulocks[i]);
/* Wait till all the processors have quit
* from the debugger. */
for (i = 0; i < NR_CPUS; i++) {
@@ -1622,7 +1622,7 @@ static void __init kgdb_internal_init(vo

/* Initialize our spinlocks. */
for (i = 0; i < NR_CPUS; i++)
- spin_lock_init(&slavecpulocks[i]);
+ spin_lock_init_static(&slavecpulocks[i]);

for (i = 0; i < MAX_BREAKPOINTS; i++)
kgdb_break[i].state = bp_none;

2006-05-29 21:35:00

by Ingo Molnar

[permalink] [raw]
Subject: [patch 38/61] lock validator: special locking: i_mutex

From: Ingo Molnar <[email protected]>

teach special (recursive) locking code to the lock validator. Has no
effect on non-lockdep kernels.

Signed-off-by: Ingo Molnar <[email protected]>
Signed-off-by: Arjan van de Ven <[email protected]>
---
drivers/usb/core/inode.c | 2 +-
fs/namei.c | 24 ++++++++++++------------
include/linux/fs.h | 14 ++++++++++++++
3 files changed, 27 insertions(+), 13 deletions(-)

Index: linux/drivers/usb/core/inode.c
===================================================================
--- linux.orig/drivers/usb/core/inode.c
+++ linux/drivers/usb/core/inode.c
@@ -201,7 +201,7 @@ static void update_sb(struct super_block
if (!root)
return;

- mutex_lock(&root->d_inode->i_mutex);
+ mutex_lock_nested(&root->d_inode->i_mutex, I_MUTEX_PARENT);

list_for_each_entry(bus, &root->d_subdirs, d_u.d_child) {
if (bus->d_inode) {
Index: linux/fs/namei.c
===================================================================
--- linux.orig/fs/namei.c
+++ linux/fs/namei.c
@@ -1422,7 +1422,7 @@ struct dentry *lock_rename(struct dentry
struct dentry *p;

if (p1 == p2) {
- mutex_lock(&p1->d_inode->i_mutex);
+ mutex_lock_nested(&p1->d_inode->i_mutex, I_MUTEX_PARENT);
return NULL;
}

@@ -1430,30 +1430,30 @@ struct dentry *lock_rename(struct dentry

for (p = p1; p->d_parent != p; p = p->d_parent) {
if (p->d_parent == p2) {
- mutex_lock(&p2->d_inode->i_mutex);
- mutex_lock(&p1->d_inode->i_mutex);
+ mutex_lock_nested(&p2->d_inode->i_mutex, I_MUTEX_PARENT);
+ mutex_lock_nested(&p1->d_inode->i_mutex, I_MUTEX_CHILD);
return p;
}
}

for (p = p2; p->d_parent != p; p = p->d_parent) {
if (p->d_parent == p1) {
- mutex_lock(&p1->d_inode->i_mutex);
- mutex_lock(&p2->d_inode->i_mutex);
+ mutex_lock_nested(&p1->d_inode->i_mutex, I_MUTEX_PARENT);
+ mutex_lock_nested(&p2->d_inode->i_mutex, I_MUTEX_CHILD);
return p;
}
}

- mutex_lock(&p1->d_inode->i_mutex);
- mutex_lock(&p2->d_inode->i_mutex);
+ mutex_lock_nested(&p1->d_inode->i_mutex, I_MUTEX_PARENT);
+ mutex_lock_nested(&p2->d_inode->i_mutex, I_MUTEX_CHILD);
return NULL;
}

void unlock_rename(struct dentry *p1, struct dentry *p2)
{
- mutex_unlock(&p1->d_inode->i_mutex);
+ mutex_unlock_non_nested(&p1->d_inode->i_mutex);
if (p1 != p2) {
- mutex_unlock(&p2->d_inode->i_mutex);
+ mutex_unlock_non_nested(&p2->d_inode->i_mutex);
mutex_unlock(&p1->d_inode->i_sb->s_vfs_rename_mutex);
}
}
@@ -1750,7 +1750,7 @@ struct dentry *lookup_create(struct name
{
struct dentry *dentry = ERR_PTR(-EEXIST);

- mutex_lock(&nd->dentry->d_inode->i_mutex);
+ mutex_lock_nested(&nd->dentry->d_inode->i_mutex, I_MUTEX_PARENT);
/*
* Yucky last component or no last component at all?
* (foo/., foo/.., /////)
@@ -2007,7 +2007,7 @@ static long do_rmdir(int dfd, const char
error = -EBUSY;
goto exit1;
}
- mutex_lock(&nd.dentry->d_inode->i_mutex);
+ mutex_lock_nested(&nd.dentry->d_inode->i_mutex, I_MUTEX_PARENT);
dentry = lookup_hash(&nd);
error = PTR_ERR(dentry);
if (!IS_ERR(dentry)) {
@@ -2081,7 +2081,7 @@ static long do_unlinkat(int dfd, const c
error = -EISDIR;
if (nd.last_type != LAST_NORM)
goto exit1;
- mutex_lock(&nd.dentry->d_inode->i_mutex);
+ mutex_lock_nested(&nd.dentry->d_inode->i_mutex, I_MUTEX_PARENT);
dentry = lookup_hash(&nd);
error = PTR_ERR(dentry);
if (!IS_ERR(dentry)) {
Index: linux/include/linux/fs.h
===================================================================
--- linux.orig/include/linux/fs.h
+++ linux/include/linux/fs.h
@@ -558,6 +558,20 @@ struct inode {
};

/*
+ * inode->i_mutex nesting types for the LOCKDEP validator:
+ *
+ * 0: the object of the current VFS operation
+ * 1: parent
+ * 2: child/target
+ */
+enum inode_i_mutex_lock_type
+{
+ I_MUTEX_NORMAL,
+ I_MUTEX_PARENT,
+ I_MUTEX_CHILD
+};
+
+/*
* NOTE: in a 32bit arch with a preemptable kernel and
* an UP compile the i_size_read/write must be atomic
* with respect to the local cpu (unlike with preempt disabled),

2006-05-29 21:35:00

by Ingo Molnar

[permalink] [raw]
Subject: [patch 33/61] lock validator: disable NMI watchdog if CONFIG_LOCKDEP

From: Ingo Molnar <[email protected]>

The NMI watchdog uses spinlocks (notifier chains, etc.),
so it's not lockdep-safe at the moment.

Signed-off-by: Ingo Molnar <[email protected]>
Signed-off-by: Arjan van de Ven <[email protected]>
---
arch/x86_64/kernel/nmi.c | 12 ++++++++++++
1 file changed, 12 insertions(+)

Index: linux/arch/x86_64/kernel/nmi.c
===================================================================
--- linux.orig/arch/x86_64/kernel/nmi.c
+++ linux/arch/x86_64/kernel/nmi.c
@@ -205,6 +205,18 @@ int __init check_nmi_watchdog (void)
int *counts;
int cpu;

+#ifdef CONFIG_LOCKDEP
+ /*
+ * The NMI watchdog uses spinlocks (notifier chains, etc.),
+ * so it's not lockdep-safe:
+ */
+ nmi_watchdog = 0;
+ for_each_online_cpu(cpu)
+ per_cpu(nmi_watchdog_ctlblk.enabled, cpu) = 0;
+
+ printk("lockdep: disabled NMI watchdog.\n");
+ return 0;
+#endif
if ((nmi_watchdog == NMI_NONE) || (nmi_watchdog == NMI_DEFAULT))
return 0;

2006-05-29 21:35:35

by Ingo Molnar

[permalink] [raw]
Subject: [patch 40/61] lock validator: special locking: futex

From: Ingo Molnar <[email protected]>

teach special (recursive) locking code to the lock validator. Has no
effect on non-lockdep kernels.

Signed-off-by: Ingo Molnar <[email protected]>
Signed-off-by: Arjan van de Ven <[email protected]>
---
kernel/futex.c | 44 ++++++++++++++++++++++++++------------------
1 file changed, 26 insertions(+), 18 deletions(-)

Index: linux/kernel/futex.c
===================================================================
--- linux.orig/kernel/futex.c
+++ linux/kernel/futex.c
@@ -604,6 +604,22 @@ static int unlock_futex_pi(u32 __user *u
}

/*
+ * Express the locking dependencies for lockdep:
+ */
+static inline void
+double_lock_hb(struct futex_hash_bucket *hb1, struct futex_hash_bucket *hb2)
+{
+ if (hb1 <= hb2) {
+ spin_lock(&hb1->lock);
+ if (hb1 < hb2)
+ spin_lock_nested(&hb2->lock, SINGLE_DEPTH_NESTING);
+ } else { /* hb1 > hb2 */
+ spin_lock(&hb2->lock);
+ spin_lock_nested(&hb1->lock, SINGLE_DEPTH_NESTING);
+ }
+}
+
+/*
* Wake up all waiters hashed on the physical page that is mapped
* to this virtual address:
*/
@@ -669,19 +685,15 @@ retryfull:
hb2 = hash_futex(&key2);

retry:
- if (hb1 < hb2)
- spin_lock(&hb1->lock);
- spin_lock(&hb2->lock);
- if (hb1 > hb2)
- spin_lock(&hb1->lock);
+ double_lock_hb(hb1, hb2);

op_ret = futex_atomic_op_inuser(op, uaddr2);
if (unlikely(op_ret < 0)) {
u32 dummy;

- spin_unlock(&hb1->lock);
+ spin_unlock_non_nested(&hb1->lock);
if (hb1 != hb2)
- spin_unlock(&hb2->lock);
+ spin_unlock_non_nested(&hb2->lock);

#ifndef CONFIG_MMU
/*
@@ -748,9 +760,9 @@ retry:
ret += op_ret;
}

- spin_unlock(&hb1->lock);
+ spin_unlock_non_nested(&hb1->lock);
if (hb1 != hb2)
- spin_unlock(&hb2->lock);
+ spin_unlock_non_nested(&hb2->lock);
out:
up_read(&current->mm->mmap_sem);
return ret;
@@ -782,11 +794,7 @@ static int futex_requeue(u32 __user *uad
hb1 = hash_futex(&key1);
hb2 = hash_futex(&key2);

- if (hb1 < hb2)
- spin_lock(&hb1->lock);
- spin_lock(&hb2->lock);
- if (hb1 > hb2)
- spin_lock(&hb1->lock);
+ double_lock_hb(hb1, hb2);

if (likely(cmpval != NULL)) {
u32 curval;
@@ -794,9 +802,9 @@ static int futex_requeue(u32 __user *uad
ret = get_futex_value_locked(&curval, uaddr1);

if (unlikely(ret)) {
- spin_unlock(&hb1->lock);
+ spin_unlock_non_nested(&hb1->lock);
if (hb1 != hb2)
- spin_unlock(&hb2->lock);
+ spin_unlock_non_nested(&hb2->lock);

/*
* If we would have faulted, release mmap_sem, fault
@@ -842,9 +850,9 @@ static int futex_requeue(u32 __user *uad
}

out_unlock:
- spin_unlock(&hb1->lock);
+ spin_unlock_non_nested(&hb1->lock);
if (hb1 != hb2)
- spin_unlock(&hb2->lock);
+ spin_unlock_non_nested(&hb2->lock);

/* drop_key_refs() must be called outside the spinlocks. */
while (--drop_count >= 0)

2006-05-29 21:36:21

by Ingo Molnar

[permalink] [raw]
Subject: [patch 41/61] lock validator: special locking: genirq

From: Ingo Molnar <[email protected]>

teach special (recursive) locking code to the lock validator. Has no
effect on non-lockdep kernels.

Signed-off-by: Ingo Molnar <[email protected]>
Signed-off-by: Arjan van de Ven <[email protected]>
---
---
kernel/irq/handle.c | 13 +++++++++++++
1 file changed, 13 insertions(+)

Index: linux/kernel/irq/handle.c
===================================================================
--- linux.orig/kernel/irq/handle.c
+++ linux/kernel/irq/handle.c
@@ -11,6 +11,7 @@
#include <linux/random.h>
#include <linux/interrupt.h>
#include <linux/kernel_stat.h>
+#include <linux/kallsyms.h>

#include "internals.h"

@@ -193,3 +194,15 @@ out:
return 1;
}

+/*
+ * lockdep: we want to handle all irq_desc locks as a single lock-type:
+ */
+static struct lockdep_type_key irq_desc_lock_type;
+
+void early_init_irq_lock_type(void)
+{
+ int i;
+
+ for (i = 0; i < NR_IRQS; i++)
+ spin_lock_init_key(&irq_desc[i].lock, &irq_desc_lock_type);
+}

2006-05-29 21:38:06

by Ingo Molnar

[permalink] [raw]
Subject: [patch 26/61] lock validator: prove rwsem locking correctness

From: Ingo Molnar <[email protected]>

add CONFIG_PROVE_RWSEM_LOCKING, which uses the lock validator framework
to prove rwsem locking correctness.

Signed-off-by: Ingo Molnar <[email protected]>
Signed-off-by: Arjan van de Ven <[email protected]>
---
include/asm-i386/rwsem.h | 38 +++++++++++++++++++--------
include/linux/rwsem-spinlock.h | 23 +++++++++++++++-
include/linux/rwsem.h | 56 +++++++++++++++++++++++++++++++++++++++++
lib/rwsem-spinlock.c | 15 ++++++++--
lib/rwsem.c | 19 +++++++++++++
5 files changed, 135 insertions(+), 16 deletions(-)

Index: linux/include/asm-i386/rwsem.h
===================================================================
--- linux.orig/include/asm-i386/rwsem.h
+++ linux/include/asm-i386/rwsem.h
@@ -40,6 +40,7 @@

#include <linux/list.h>
#include <linux/spinlock.h>
+#include <linux/lockdep.h>

struct rwsem_waiter;

@@ -64,6 +65,9 @@ struct rw_semaphore {
#if RWSEM_DEBUG
int debug;
#endif
+#ifdef CONFIG_PROVE_RWSEM_LOCKING
+ struct lockdep_map dep_map;
+#endif
};

/*
@@ -75,22 +79,29 @@ struct rw_semaphore {
#define __RWSEM_DEBUG_INIT /* */
#endif

+#ifdef CONFIG_PROVE_RWSEM_LOCKING
+# define __RWSEM_DEP_MAP_INIT(lockname) , .dep_map = { .name = #lockname }
+#else
+# define __RWSEM_DEP_MAP_INIT(lockname)
+#endif
+
+
#define __RWSEM_INITIALIZER(name) \
{ RWSEM_UNLOCKED_VALUE, SPIN_LOCK_UNLOCKED, LIST_HEAD_INIT((name).wait_list) \
- __RWSEM_DEBUG_INIT }
+ __RWSEM_DEBUG_INIT __RWSEM_DEP_MAP_INIT(name) }

#define DECLARE_RWSEM(name) \
struct rw_semaphore name = __RWSEM_INITIALIZER(name)

-static inline void init_rwsem(struct rw_semaphore *sem)
-{
- sem->count = RWSEM_UNLOCKED_VALUE;
- spin_lock_init(&sem->wait_lock);
- INIT_LIST_HEAD(&sem->wait_list);
-#if RWSEM_DEBUG
- sem->debug = 0;
-#endif
-}
+extern void __init_rwsem(struct rw_semaphore *sem, const char *name,
+ struct lockdep_type_key *key);
+
+#define init_rwsem(sem) \
+do { \
+ static struct lockdep_type_key __key; \
+ \
+ __init_rwsem((sem), #sem, &__key); \
+} while (0)

/*
* lock for reading
@@ -143,7 +154,7 @@ LOCK_PREFIX " cmpxchgl %2,%0\n\t"
/*
* lock for writing
*/
-static inline void __down_write(struct rw_semaphore *sem)
+static inline void __down_write_nested(struct rw_semaphore *sem, int subtype)
{
int tmp;

@@ -167,6 +178,11 @@ LOCK_PREFIX " xadd %%edx,(%%eax)\n
: "memory", "cc");
}

+static inline void __down_write(struct rw_semaphore *sem)
+{
+ __down_write_nested(sem, 0);
+}
+
/*
* trylock for writing -- returns 1 if successful, 0 if contention
*/
Index: linux/include/linux/rwsem-spinlock.h
===================================================================
--- linux.orig/include/linux/rwsem-spinlock.h
+++ linux/include/linux/rwsem-spinlock.h
@@ -35,6 +35,9 @@ struct rw_semaphore {
#if RWSEM_DEBUG
int debug;
#endif
+#ifdef CONFIG_PROVE_RWSEM_LOCKING
+ struct lockdep_map dep_map;
+#endif
};

/*
@@ -46,16 +49,32 @@ struct rw_semaphore {
#define __RWSEM_DEBUG_INIT /* */
#endif

+#ifdef CONFIG_PROVE_RWSEM_LOCKING
+# define __RWSEM_DEP_MAP_INIT(lockname) , .dep_map = { .name = #lockname }
+#else
+# define __RWSEM_DEP_MAP_INIT(lockname)
+#endif
+
#define __RWSEM_INITIALIZER(name) \
-{ 0, SPIN_LOCK_UNLOCKED, LIST_HEAD_INIT((name).wait_list) __RWSEM_DEBUG_INIT }
+{ 0, SPIN_LOCK_UNLOCKED, LIST_HEAD_INIT((name).wait_list) __RWSEM_DEBUG_INIT __RWSEM_DEP_MAP_INIT(name) }

#define DECLARE_RWSEM(name) \
struct rw_semaphore name = __RWSEM_INITIALIZER(name)

-extern void FASTCALL(init_rwsem(struct rw_semaphore *sem));
+extern void __init_rwsem(struct rw_semaphore *sem, const char *name,
+ struct lockdep_type_key *key);
+
+#define init_rwsem(sem) \
+do { \
+ static struct lockdep_type_key __key; \
+ \
+ __init_rwsem((sem), #sem, &__key); \
+} while (0)
+
extern void FASTCALL(__down_read(struct rw_semaphore *sem));
extern int FASTCALL(__down_read_trylock(struct rw_semaphore *sem));
extern void FASTCALL(__down_write(struct rw_semaphore *sem));
+extern void FASTCALL(__down_write_nested(struct rw_semaphore *sem, int subtype));
extern int FASTCALL(__down_write_trylock(struct rw_semaphore *sem));
extern void FASTCALL(__up_read(struct rw_semaphore *sem));
extern void FASTCALL(__up_write(struct rw_semaphore *sem));
Index: linux/include/linux/rwsem.h
===================================================================
--- linux.orig/include/linux/rwsem.h
+++ linux/include/linux/rwsem.h
@@ -40,6 +40,20 @@ extern void FASTCALL(rwsemtrace(struct r
static inline void down_read(struct rw_semaphore *sem)
{
might_sleep();
+ rwsem_acquire_read(&sem->dep_map, 0, 0, _THIS_IP_);
+
+ rwsemtrace(sem,"Entering down_read");
+ __down_read(sem);
+ rwsemtrace(sem,"Leaving down_read");
+}
+
+/*
+ * Take a lock when not the owner will release it:
+ */
+static inline void down_read_non_owner(struct rw_semaphore *sem)
+{
+ might_sleep();
+
rwsemtrace(sem,"Entering down_read");
__down_read(sem);
rwsemtrace(sem,"Leaving down_read");
@@ -53,6 +67,8 @@ static inline int down_read_trylock(stru
int ret;
rwsemtrace(sem,"Entering down_read_trylock");
ret = __down_read_trylock(sem);
+ if (ret == 1)
+ rwsem_acquire_read(&sem->dep_map, 0, 1, _THIS_IP_);
rwsemtrace(sem,"Leaving down_read_trylock");
return ret;
}
@@ -63,12 +79,28 @@ static inline int down_read_trylock(stru
static inline void down_write(struct rw_semaphore *sem)
{
might_sleep();
+ rwsem_acquire(&sem->dep_map, 0, 0, _THIS_IP_);
+
rwsemtrace(sem,"Entering down_write");
__down_write(sem);
rwsemtrace(sem,"Leaving down_write");
}

/*
+ * lock for writing
+ */
+static inline void down_write_nested(struct rw_semaphore *sem, int subtype)
+{
+ might_sleep();
+ rwsem_acquire(&sem->dep_map, subtype, 0, _THIS_IP_);
+
+ rwsemtrace(sem,"Entering down_write_nested");
+ __down_write_nested(sem, subtype);
+ rwsemtrace(sem,"Leaving down_write_nested");
+}
+
+
+/*
* trylock for writing -- returns 1 if successful, 0 if contention
*/
static inline int down_write_trylock(struct rw_semaphore *sem)
@@ -76,6 +108,8 @@ static inline int down_write_trylock(str
int ret;
rwsemtrace(sem,"Entering down_write_trylock");
ret = __down_write_trylock(sem);
+ if (ret == 1)
+ rwsem_acquire(&sem->dep_map, 0, 0, _THIS_IP_);
rwsemtrace(sem,"Leaving down_write_trylock");
return ret;
}
@@ -85,16 +119,34 @@ static inline int down_write_trylock(str
*/
static inline void up_read(struct rw_semaphore *sem)
{
+ rwsem_release(&sem->dep_map, 1, _THIS_IP_);
+
rwsemtrace(sem,"Entering up_read");
__up_read(sem);
rwsemtrace(sem,"Leaving up_read");
}

+static inline void up_read_non_nested(struct rw_semaphore *sem)
+{
+ rwsem_release(&sem->dep_map, 0, _THIS_IP_);
+ __up_read(sem);
+}
+
+/*
+ * Not the owner will release it:
+ */
+static inline void up_read_non_owner(struct rw_semaphore *sem)
+{
+ __up_read(sem);
+}
+
/*
* release a write lock
*/
static inline void up_write(struct rw_semaphore *sem)
{
+ rwsem_release(&sem->dep_map, 1, _THIS_IP_);
+
rwsemtrace(sem,"Entering up_write");
__up_write(sem);
rwsemtrace(sem,"Leaving up_write");
@@ -105,6 +157,10 @@ static inline void up_write(struct rw_se
*/
static inline void downgrade_write(struct rw_semaphore *sem)
{
+ /*
+ * lockdep: a downgraded write will live on as a write
+ * dependency.
+ */
rwsemtrace(sem,"Entering downgrade_write");
__downgrade_write(sem);
rwsemtrace(sem,"Leaving downgrade_write");
Index: linux/lib/rwsem-spinlock.c
===================================================================
--- linux.orig/lib/rwsem-spinlock.c
+++ linux/lib/rwsem-spinlock.c
@@ -30,7 +30,8 @@ void rwsemtrace(struct rw_semaphore *sem
/*
* initialise the semaphore
*/
-void fastcall init_rwsem(struct rw_semaphore *sem)
+void __init_rwsem(struct rw_semaphore *sem, const char *name,
+ struct lockdep_type_key *key)
{
sem->activity = 0;
spin_lock_init(&sem->wait_lock);
@@ -38,6 +39,9 @@ void fastcall init_rwsem(struct rw_semap
#if RWSEM_DEBUG
sem->debug = 0;
#endif
+#ifdef CONFIG_PROVE_RWSEM_LOCKING
+ lockdep_init_map(&sem->dep_map, name, key);
+#endif
}

/*
@@ -204,7 +208,7 @@ int fastcall __down_read_trylock(struct
* get a write lock on the semaphore
* - we increment the waiting count anyway to indicate an exclusive lock
*/
-void fastcall __sched __down_write(struct rw_semaphore *sem)
+void fastcall __sched __down_write_nested(struct rw_semaphore *sem, int subtype)
{
struct rwsem_waiter waiter;
struct task_struct *tsk;
@@ -247,6 +251,11 @@ void fastcall __sched __down_write(struc
rwsemtrace(sem, "Leaving __down_write");
}

+void fastcall __sched __down_write(struct rw_semaphore *sem)
+{
+ __down_write_nested(sem, 0);
+}
+
/*
* trylock for writing -- returns 1 if successful, 0 if contention
*/
@@ -331,7 +340,7 @@ void fastcall __downgrade_write(struct r
rwsemtrace(sem, "Leaving __downgrade_write");
}

-EXPORT_SYMBOL(init_rwsem);
+EXPORT_SYMBOL(__init_rwsem);
EXPORT_SYMBOL(__down_read);
EXPORT_SYMBOL(__down_read_trylock);
EXPORT_SYMBOL(__down_write);
Index: linux/lib/rwsem.c
===================================================================
--- linux.orig/lib/rwsem.c
+++ linux/lib/rwsem.c
@@ -8,6 +8,25 @@
#include <linux/init.h>
#include <linux/module.h>

+/*
+ * Initialize an rwsem:
+ */
+void __init_rwsem(struct rw_semaphore *sem, const char *name,
+ struct lockdep_type_key *key)
+{
+ sem->count = RWSEM_UNLOCKED_VALUE;
+ spin_lock_init(&sem->wait_lock);
+ INIT_LIST_HEAD(&sem->wait_list);
+#if RWSEM_DEBUG
+ sem->debug = 0;
+#endif
+#ifdef CONFIG_PROVE_RWSEM_LOCKING
+ lockdep_init_map(&sem->dep_map, name, key);
+#endif
+}
+
+EXPORT_SYMBOL(__init_rwsem);
+
struct rwsem_waiter {
struct list_head list;
struct task_struct *task;

2006-05-29 21:38:21

by Ingo Molnar

[permalink] [raw]
Subject: [patch 31/61] lock validator: SMP alternatives workaround

From: Ingo Molnar <[email protected]>

disable SMP alternatives fixups (the patching in of NOPs on 1-CPU
systems) if the lock validator is enabled: there is a binutils
section handling bug that causes corrupted instructions when
UP instructions are patched in.

Signed-off-by: Ingo Molnar <[email protected]>
Signed-off-by: Arjan van de Ven <[email protected]>
---
arch/i386/kernel/alternative.c | 10 ++++++++++
1 file changed, 10 insertions(+)

Index: linux/arch/i386/kernel/alternative.c
===================================================================
--- linux.orig/arch/i386/kernel/alternative.c
+++ linux/arch/i386/kernel/alternative.c
@@ -301,6 +301,16 @@ void alternatives_smp_switch(int smp)
struct smp_alt_module *mod;
unsigned long flags;

+#ifdef CONFIG_LOCKDEP
+ /*
+ * A not yet fixed binutils section handling bug prevents
+ * alternatives-replacement from working reliably, so turn
+ * it off:
+ */
+ printk("lockdep: not fixing up alternatives.\n");
+ return;
+#endif
+
if (no_replacement || smp_alt_once)
return;
BUG_ON(!smp && (num_online_cpus() > 1));

2006-05-29 21:37:10

by Ingo Molnar

[permalink] [raw]
Subject: [patch 35/61] lock validator: special locking: direct-IO

From: Ingo Molnar <[email protected]>

teach special (rwsem-in-irq) locking code to the lock validator. Has no
effect on non-lockdep kernels.

Signed-off-by: Ingo Molnar <[email protected]>
Signed-off-by: Arjan van de Ven <[email protected]>
---
---
fs/direct-io.c | 6 ++++--
1 file changed, 4 insertions(+), 2 deletions(-)

Index: linux/fs/direct-io.c
===================================================================
--- linux.orig/fs/direct-io.c
+++ linux/fs/direct-io.c
@@ -220,7 +220,8 @@ static void dio_complete(struct dio *dio
if (dio->end_io && dio->result)
dio->end_io(dio->iocb, offset, bytes, dio->map_bh.b_private);
if (dio->lock_type == DIO_LOCKING)
- up_read(&dio->inode->i_alloc_sem);
+ /* lockdep: non-owner release */
+ up_read_non_owner(&dio->inode->i_alloc_sem);
}

/*
@@ -1261,7 +1262,8 @@ __blockdev_direct_IO(int rw, struct kioc
}

if (dio_lock_type == DIO_LOCKING)
- down_read(&inode->i_alloc_sem);
+ /* lockdep: not the owner will release it */
+ down_read_non_owner(&inode->i_alloc_sem);
}

/*

2006-05-29 21:37:10

by Ingo Molnar

[permalink] [raw]
Subject: [patch 36/61] lock validator: special locking: serial

From: Ingo Molnar <[email protected]>

teach special (dual-initialized) locking code to the lock validator.
Has no effect on non-lockdep kernels.

Signed-off-by: Ingo Molnar <[email protected]>
Signed-off-by: Arjan van de Ven <[email protected]>
---
---
drivers/serial/serial_core.c | 10 ++++++++--
1 file changed, 8 insertions(+), 2 deletions(-)

Index: linux/drivers/serial/serial_core.c
===================================================================
--- linux.orig/drivers/serial/serial_core.c
+++ linux/drivers/serial/serial_core.c
@@ -1849,6 +1849,12 @@ static const struct baud_rates baud_rate
{ 0, B38400 }
};

+/*
+ * lockdep: port->lock is initialized in two places, but we
+ * want only one lock-type:
+ */
+static struct lockdep_type_key port_lock_key;
+
/**
* uart_set_options - setup the serial console parameters
* @port: pointer to the serial ports uart_port structure
@@ -1869,7 +1875,7 @@ uart_set_options(struct uart_port *port,
* Ensure that the serial console lock is initialised
* early.
*/
- spin_lock_init(&port->lock);
+ spin_lock_init_key(&port->lock, &port_lock_key);

memset(&termios, 0, sizeof(struct termios));

@@ -2255,7 +2261,7 @@ int uart_add_one_port(struct uart_driver
* initialised.
*/
if (!(uart_console(port) && (port->cons->flags & CON_ENABLED)))
- spin_lock_init(&port->lock);
+ spin_lock_init_key(&port->lock, &port_lock_key);

uart_configure_port(drv, state, port);

2006-05-29 21:38:06

by Ingo Molnar

[permalink] [raw]
Subject: [patch 21/61] lock validator: lockdep: add local_irq_enable_in_hardirq() API.

From: Ingo Molnar <[email protected]>

introduce local_irq_enable_in_hardirq() API. It is currently
aliased to local_irq_enable(), hence has no functional effects.

This API will be used by lockdep, but even without lockdep
this will better document places in the kernel where a hardirq
context enables hardirqs.

Signed-off-by: Ingo Molnar <[email protected]>
Signed-off-by: Arjan van de Ven <[email protected]>
---
arch/i386/kernel/nmi.c | 3 ++-
arch/x86_64/kernel/nmi.c | 3 ++-
drivers/ide/ide-io.c | 6 +++---
drivers/ide/ide-taskfile.c | 2 +-
include/linux/ide.h | 2 +-
include/linux/trace_irqflags.h | 2 ++
kernel/irq/handle.c | 2 +-
7 files changed, 12 insertions(+), 8 deletions(-)

Index: linux/arch/i386/kernel/nmi.c
===================================================================
--- linux.orig/arch/i386/kernel/nmi.c
+++ linux/arch/i386/kernel/nmi.c
@@ -188,7 +188,8 @@ static __cpuinit inline int nmi_known_cp
static __init void nmi_cpu_busy(void *data)
{
volatile int *endflag = data;
- local_irq_enable();
+
+ local_irq_enable_in_hardirq();
/* Intentionally don't use cpu_relax here. This is
to make sure that the performance counter really ticks,
even if there is a simulator or similar that catches the
Index: linux/arch/x86_64/kernel/nmi.c
===================================================================
--- linux.orig/arch/x86_64/kernel/nmi.c
+++ linux/arch/x86_64/kernel/nmi.c
@@ -186,7 +186,8 @@ void nmi_watchdog_default(void)
static __init void nmi_cpu_busy(void *data)
{
volatile int *endflag = data;
- local_irq_enable();
+
+ local_irq_enable_in_hardirq();
/* Intentionally don't use cpu_relax here. This is
to make sure that the performance counter really ticks,
even if there is a simulator or similar that catches the
Index: linux/drivers/ide/ide-io.c
===================================================================
--- linux.orig/drivers/ide/ide-io.c
+++ linux/drivers/ide/ide-io.c
@@ -689,7 +689,7 @@ static ide_startstop_t drive_cmd_intr (i
u8 stat = hwif->INB(IDE_STATUS_REG);
int retries = 10;

- local_irq_enable();
+ local_irq_enable_in_hardirq();
if ((stat & DRQ_STAT) && args && args[3]) {
u8 io_32bit = drive->io_32bit;
drive->io_32bit = 0;
@@ -1273,7 +1273,7 @@ static void ide_do_request (ide_hwgroup_
if (masked_irq != IDE_NO_IRQ && hwif->irq != masked_irq)
disable_irq_nosync(hwif->irq);
spin_unlock(&ide_lock);
- local_irq_enable();
+ local_irq_enable_in_hardirq();
/* allow other IRQs while we start this request */
startstop = start_request(drive, rq);
spin_lock_irq(&ide_lock);
@@ -1622,7 +1622,7 @@ irqreturn_t ide_intr (int irq, void *dev
spin_unlock(&ide_lock);

if (drive->unmask)
- local_irq_enable();
+ local_irq_enable_in_hardirq();
/* service this interrupt, may set handler for next interrupt */
startstop = handler(drive);
spin_lock_irq(&ide_lock);
Index: linux/drivers/ide/ide-taskfile.c
===================================================================
--- linux.orig/drivers/ide/ide-taskfile.c
+++ linux/drivers/ide/ide-taskfile.c
@@ -223,7 +223,7 @@ ide_startstop_t task_no_data_intr (ide_d
ide_hwif_t *hwif = HWIF(drive);
u8 stat;

- local_irq_enable();
+ local_irq_enable_in_hardirq();
if (!OK_STAT(stat = hwif->INB(IDE_STATUS_REG),READY_STAT,BAD_STAT)) {
return ide_error(drive, "task_no_data_intr", stat);
/* calls ide_end_drive_cmd */
Index: linux/include/linux/ide.h
===================================================================
--- linux.orig/include/linux/ide.h
+++ linux/include/linux/ide.h
@@ -1361,7 +1361,7 @@ extern struct semaphore ide_cfg_sem;
* ide_drive_t->hwif: constant, no locking
*/

-#define local_irq_set(flags) do { local_save_flags((flags)); local_irq_enable(); } while (0)
+#define local_irq_set(flags) do { local_save_flags((flags)); local_irq_enable_in_hardirq(); } while (0)

extern struct bus_type ide_bus_type;

Index: linux/include/linux/trace_irqflags.h
===================================================================
--- linux.orig/include/linux/trace_irqflags.h
+++ linux/include/linux/trace_irqflags.h
@@ -66,6 +66,8 @@
} \
} while (0)

+#define local_irq_enable_in_hardirq() local_irq_enable()
+
#define safe_halt() \
do { \
trace_hardirqs_on(); \
Index: linux/kernel/irq/handle.c
===================================================================
--- linux.orig/kernel/irq/handle.c
+++ linux/kernel/irq/handle.c
@@ -83,7 +83,7 @@ fastcall irqreturn_t handle_IRQ_event(un
unsigned int status = 0;

if (!(action->flags & SA_INTERRUPT))
- local_irq_enable();
+ local_irq_enable_in_hardirq();

do {
ret = action->handler(irq, action->dev_id, regs);

2006-05-29 21:38:54

by Ingo Molnar

[permalink] [raw]
Subject: [patch 32/61] lock validator: do not recurse in printk()

From: Ingo Molnar <[email protected]>

make printk()-ing from within the lock validation code safer by
using the lockdep-recursion counter.

Signed-off-by: Ingo Molnar <[email protected]>
Signed-off-by: Arjan van de Ven <[email protected]>
---
kernel/printk.c | 20 ++++++++++++++++----
1 file changed, 16 insertions(+), 4 deletions(-)

Index: linux/kernel/printk.c
===================================================================
--- linux.orig/kernel/printk.c
+++ linux/kernel/printk.c
@@ -516,7 +516,9 @@ asmlinkage int vprintk(const char *fmt,
zap_locks();

/* This stops the holder of console_sem just where we want him */
- spin_lock_irqsave(&logbuf_lock, flags);
+ local_irq_save(flags);
+ current->lockdep_recursion++;
+ spin_lock(&logbuf_lock);
printk_cpu = smp_processor_id();

/* Emit the output into the temporary buffer */
@@ -586,7 +588,7 @@ asmlinkage int vprintk(const char *fmt,
*/
console_locked = 1;
printk_cpu = UINT_MAX;
- spin_unlock_irqrestore(&logbuf_lock, flags);
+ spin_unlock(&logbuf_lock);

/*
* Console drivers may assume that per-cpu resources have
@@ -602,6 +604,8 @@ asmlinkage int vprintk(const char *fmt,
console_locked = 0;
up(&console_sem);
}
+ current->lockdep_recursion--;
+ local_irq_restore(flags);
} else {
/*
* Someone else owns the drivers. We drop the spinlock, which
@@ -609,7 +613,9 @@ asmlinkage int vprintk(const char *fmt,
* console drivers with the output which we just produced.
*/
printk_cpu = UINT_MAX;
- spin_unlock_irqrestore(&logbuf_lock, flags);
+ spin_unlock(&logbuf_lock);
+ current->lockdep_recursion--;
+ local_irq_restore(flags);
}

preempt_enable();
@@ -783,7 +789,13 @@ void release_console_sem(void)
up(&console_sem);
spin_unlock_irqrestore(&logbuf_lock, flags);
if (wake_klogd && !oops_in_progress && waitqueue_active(&log_wait))
- wake_up_interruptible(&log_wait);
+ /*
+ * If we printk from within the lock dependency code,
+ * from within the scheduler code, then do not lock
+ * up due to self-recursion:
+ */
+ if (current->lockdep_recursion <= 1)
+ wake_up_interruptible(&log_wait);
}
EXPORT_SYMBOL(release_console_sem);

2006-05-29 21:39:42

by Ingo Molnar

[permalink] [raw]
Subject: [patch 23/61] lock validator: core

From: Ingo Molnar <[email protected]>

lock validator core changes. Not enabled yet.

Signed-off-by: Ingo Molnar <[email protected]>
Signed-off-by: Arjan van de Ven <[email protected]>
---
include/linux/init_task.h | 1
include/linux/lockdep.h | 280 ++++
include/linux/sched.h | 12
include/linux/trace_irqflags.h | 13
init/main.c | 16
kernel/Makefile | 1
kernel/fork.c | 5
kernel/irq/manage.c | 6
kernel/lockdep.c | 2633 +++++++++++++++++++++++++++++++++++++++++
kernel/lockdep_internals.h | 93 +
kernel/module.c | 3
lib/Kconfig.debug | 2
lib/locking-selftest.c | 4
13 files changed, 3064 insertions(+), 5 deletions(-)

Index: linux/include/linux/init_task.h
===================================================================
--- linux.orig/include/linux/init_task.h
+++ linux/include/linux/init_task.h
@@ -134,6 +134,7 @@ extern struct group_info init_groups;
.cpu_timers = INIT_CPU_TIMERS(tsk.cpu_timers), \
.fs_excl = ATOMIC_INIT(0), \
INIT_TRACE_IRQFLAGS \
+ INIT_LOCKDEP \
}


Index: linux/include/linux/lockdep.h
===================================================================
--- /dev/null
+++ linux/include/linux/lockdep.h
@@ -0,0 +1,280 @@
+/*
+ * Runtime locking correctness validator
+ *
+ * Copyright (C) 2006 Red Hat, Inc., Ingo Molnar <[email protected]>
+ *
+ * see Documentation/lockdep-design.txt for more details.
+ */
+#ifndef __LINUX_LOCKDEP_H
+#define __LINUX_LOCKDEP_H
+
+#include <linux/linkage.h>
+#include <linux/list.h>
+#include <linux/debug_locks.h>
+#include <linux/stacktrace.h>
+
+#ifdef CONFIG_LOCKDEP
+
+/*
+ * Lock-type usage-state bits:
+ */
+enum lock_usage_bit
+{
+ LOCK_USED = 0,
+ LOCK_USED_IN_HARDIRQ,
+ LOCK_USED_IN_SOFTIRQ,
+ LOCK_ENABLED_SOFTIRQS,
+ LOCK_ENABLED_HARDIRQS,
+ LOCK_USED_IN_HARDIRQ_READ,
+ LOCK_USED_IN_SOFTIRQ_READ,
+ LOCK_ENABLED_SOFTIRQS_READ,
+ LOCK_ENABLED_HARDIRQS_READ,
+ LOCK_USAGE_STATES
+};
+
+/*
+ * Usage-state bitmasks:
+ */
+#define LOCKF_USED (1 << LOCK_USED)
+#define LOCKF_USED_IN_HARDIRQ (1 << LOCK_USED_IN_HARDIRQ)
+#define LOCKF_USED_IN_SOFTIRQ (1 << LOCK_USED_IN_SOFTIRQ)
+#define LOCKF_ENABLED_HARDIRQS (1 << LOCK_ENABLED_HARDIRQS)
+#define LOCKF_ENABLED_SOFTIRQS (1 << LOCK_ENABLED_SOFTIRQS)
+
+#define LOCKF_ENABLED_IRQS (LOCKF_ENABLED_HARDIRQS | LOCKF_ENABLED_SOFTIRQS)
+#define LOCKF_USED_IN_IRQ (LOCKF_USED_IN_HARDIRQ | LOCKF_USED_IN_SOFTIRQ)
+
+#define LOCKF_USED_IN_HARDIRQ_READ (1 << LOCK_USED_IN_HARDIRQ_READ)
+#define LOCKF_USED_IN_SOFTIRQ_READ (1 << LOCK_USED_IN_SOFTIRQ_READ)
+#define LOCKF_ENABLED_HARDIRQS_READ (1 << LOCK_ENABLED_HARDIRQS_READ)
+#define LOCKF_ENABLED_SOFTIRQS_READ (1 << LOCK_ENABLED_SOFTIRQS_READ)
+
+#define LOCKF_ENABLED_IRQS_READ \
+ (LOCKF_ENABLED_HARDIRQS_READ | LOCKF_ENABLED_SOFTIRQS_READ)
+#define LOCKF_USED_IN_IRQ_READ \
+ (LOCKF_USED_IN_HARDIRQ_READ | LOCKF_USED_IN_SOFTIRQ_READ)
+
+#define MAX_LOCKDEP_SUBTYPES 8UL
+
+/*
+ * Lock-types are keyed via unique addresses, by embedding the
+ * locktype-key into the kernel (or module) .data section. (For
+ * static locks we use the lock address itself as the key.)
+ */
+struct lockdep_subtype_key {
+ char __one_byte;
+} __attribute__ ((__packed__));
+
+struct lockdep_type_key {
+ struct lockdep_subtype_key subkeys[MAX_LOCKDEP_SUBTYPES];
+};
+
+/*
+ * The lock-type itself:
+ */
+struct lock_type {
+ /*
+ * type-hash:
+ */
+ struct list_head hash_entry;
+
+ /*
+ * global list of all lock-types:
+ */
+ struct list_head lock_entry;
+
+ struct lockdep_subtype_key *key;
+ unsigned int subtype;
+
+ /*
+ * IRQ/softirq usage tracking bits:
+ */
+ unsigned long usage_mask;
+ struct stack_trace usage_traces[LOCK_USAGE_STATES];
+
+ /*
+ * These fields represent a directed graph of lock dependencies,
+ * to every node we attach a list of "forward" and a list of
+ * "backward" graph nodes.
+ */
+ struct list_head locks_after, locks_before;
+
+ /*
+ * Generation counter, when doing certain types of graph walking,
+ * to ensure that we check one node only once:
+ */
+ unsigned int version;
+
+ /*
+ * Statistics counter:
+ */
+ unsigned long ops;
+
+ const char *name;
+ int name_version;
+};
+
+/*
+ * Map the lock object (the lock instance) to the lock-type object.
+ * This is embedded into specific lock instances:
+ */
+struct lockdep_map {
+ struct lockdep_type_key *key;
+ struct lock_type *type[MAX_LOCKDEP_SUBTYPES];
+ const char *name;
+};
+
+/*
+ * Every lock has a list of other locks that were taken after it.
+ * We only grow the list, never remove from it:
+ */
+struct lock_list {
+ struct list_head entry;
+ struct lock_type *type;
+ struct stack_trace trace;
+};
+
+/*
+ * We record lock dependency chains, so that we can cache them:
+ */
+struct lock_chain {
+ struct list_head entry;
+ u64 chain_key;
+};
+
+struct held_lock {
+ /*
+ * One-way hash of the dependency chain up to this point. We
+ * hash the hashes step by step as the dependency chain grows.
+ *
+ * We use it for dependency-caching and we skip detection
+ * passes and dependency-updates if there is a cache-hit, so
+ * it is absolutely critical for 100% coverage of the validator
+ * to have a unique key value for every unique dependency path
+ * that can occur in the system, to make a unique hash value
+ * as likely as possible - hence the 64-bit width.
+ *
+ * The task struct holds the current hash value (initialized
+ * with zero), here we store the previous hash value:
+ */
+ u64 prev_chain_key;
+ struct lock_type *type;
+ unsigned long acquire_ip;
+ struct lockdep_map *instance;
+
+ /*
+ * The lock-stack is unified in that the lock chains of interrupt
+ * contexts nest ontop of process context chains, but we 'separate'
+ * the hashes by starting with 0 if we cross into an interrupt
+ * context, and we also keep do not add cross-context lock
+ * dependencies - the lock usage graph walking covers that area
+ * anyway, and we'd just unnecessarily increase the number of
+ * dependencies otherwise. [Note: hardirq and softirq contexts
+ * are separated from each other too.]
+ *
+ * The following field is used to detect when we cross into an
+ * interrupt context:
+ */
+ int irq_context;
+ int trylock;
+ int read;
+ int hardirqs_off;
+};
+
+/*
+ * Initialization, self-test and debugging-output methods:
+ */
+extern void lockdep_init(void);
+extern void lockdep_info(void);
+extern void lockdep_reset(void);
+extern void lockdep_reset_lock(struct lockdep_map *lock);
+extern void lockdep_free_key_range(void *start, unsigned long size);
+
+extern void print_lock_types(void);
+extern void lockdep_print_held_locks(struct task_struct *task);
+
+/*
+ * These methods are used by specific locking variants (spinlocks,
+ * rwlocks, mutexes and rwsems) to pass init/acquire/release events
+ * to lockdep:
+ */
+
+extern void lockdep_init_map(struct lockdep_map *lock, const char *name,
+ struct lockdep_type_key *key);
+
+extern void lockdep_acquire(struct lockdep_map *lock, unsigned int subtype,
+ int trylock, int read, unsigned long ip);
+
+extern void lockdep_release(struct lockdep_map *lock, int nested,
+ unsigned long ip);
+
+# define INIT_LOCKDEP .lockdep_recursion = 0,
+
+extern void early_boot_irqs_off(void);
+extern void early_boot_irqs_on(void);
+
+#else /* LOCKDEP */
+# define lockdep_init() do { } while (0)
+# define lockdep_info() do { } while (0)
+# define print_lock_types() do { } while (0)
+# define lockdep_print_held_locks(task) do { (void)(task); } while (0)
+# define lockdep_init_map(lock, name, key) do { } while (0)
+# define INIT_LOCKDEP
+# define lockdep_reset() do { debug_locks = 1; } while (0)
+# define lockdep_free_key_range(start, size) do { } while (0)
+# define early_boot_irqs_off() do { } while (0)
+# define early_boot_irqs_on() do { } while (0)
+/*
+ * The type key takes no space if lockdep is disabled:
+ */
+struct lockdep_type_key { };
+#endif /* !LOCKDEP */
+
+/*
+ * For trivial one-depth nesting of a lock-type, the following
+ * global define can be used. (Subsystems with multiple levels
+ * of nesting should define their own lock-nesting subtypes.)
+ */
+#define SINGLE_DEPTH_NESTING 1
+
+/*
+ * Map the dependency ops to NOP or to real lockdep ops, depending
+ * on the per lock-type debug mode:
+ */
+#ifdef CONFIG_PROVE_SPIN_LOCKING
+# define spin_acquire(l, s, t, i) lockdep_acquire(l, s, t, 0, i)
+# define spin_release(l, n, i) lockdep_release(l, n, i)
+#else
+# define spin_acquire(l, s, t, i) do { } while (0)
+# define spin_release(l, n, i) do { } while (0)
+#endif
+
+#ifdef CONFIG_PROVE_RW_LOCKING
+# define rwlock_acquire(l, s, t, i) lockdep_acquire(l, s, t, 0, i)
+# define rwlock_acquire_read(l, s, t, i) lockdep_acquire(l, s, t, 1, i)
+# define rwlock_release(l, n, i) lockdep_release(l, n, i)
+#else
+# define rwlock_acquire(l, s, t, i) do { } while (0)
+# define rwlock_acquire_read(l, s, t, i) do { } while (0)
+# define rwlock_release(l, n, i) do { } while (0)
+#endif
+
+#ifdef CONFIG_PROVE_MUTEX_LOCKING
+# define mutex_acquire(l, s, t, i) lockdep_acquire(l, s, t, 0, i)
+# define mutex_release(l, n, i) lockdep_release(l, n, i)
+#else
+# define mutex_acquire(l, s, t, i) do { } while (0)
+# define mutex_release(l, n, i) do { } while (0)
+#endif
+
+#ifdef CONFIG_PROVE_RWSEM_LOCKING
+# define rwsem_acquire(l, s, t, i) lockdep_acquire(l, s, t, 0, i)
+# define rwsem_acquire_read(l, s, t, i) lockdep_acquire(l, s, t, -1, i)
+# define rwsem_release(l, n, i) lockdep_release(l, n, i)
+#else
+# define rwsem_acquire(l, s, t, i) do { } while (0)
+# define rwsem_acquire_read(l, s, t, i) do { } while (0)
+# define rwsem_release(l, n, i) do { } while (0)
+#endif
+
+#endif /* __LINUX_LOCKDEP_H */
Index: linux/include/linux/sched.h
===================================================================
--- linux.orig/include/linux/sched.h
+++ linux/include/linux/sched.h
@@ -931,6 +931,13 @@ struct task_struct {
int hardirq_context;
int softirq_context;
#endif
+#ifdef CONFIG_LOCKDEP
+# define MAX_LOCK_DEPTH 30UL
+ u64 curr_chain_key;
+ int lockdep_depth;
+ struct held_lock held_locks[MAX_LOCK_DEPTH];
+#endif
+ unsigned int lockdep_recursion;

/* journalling filesystem info */
void *journal_info;
@@ -1350,6 +1357,11 @@ static inline void task_lock(struct task
spin_lock(&p->alloc_lock);
}

+static inline void task_lock_free(struct task_struct *p)
+{
+ spin_lock_nested(&p->alloc_lock, SINGLE_DEPTH_NESTING);
+}
+
static inline void task_unlock(struct task_struct *p)
{
spin_unlock(&p->alloc_lock);
Index: linux/include/linux/trace_irqflags.h
===================================================================
--- linux.orig/include/linux/trace_irqflags.h
+++ linux/include/linux/trace_irqflags.h
@@ -66,7 +66,18 @@
} \
} while (0)

-#define local_irq_enable_in_hardirq() local_irq_enable()
+/*
+ * On lockdep we dont want to enable hardirqs in hardirq
+ * context. NOTE: in theory this might break fragile code
+ * that relies on hardirq delivery - in practice we dont
+ * seem to have such places left. So the only effect should
+ * be slightly increased irqs-off latencies.
+ */
+#ifdef CONFIG_LOCKDEP
+# define local_irq_enable_in_hardirq() do { } while (0)
+#else
+# define local_irq_enable_in_hardirq() local_irq_enable()
+#endif

#define safe_halt() \
do { \
Index: linux/init/main.c
===================================================================
--- linux.orig/init/main.c
+++ linux/init/main.c
@@ -54,6 +54,7 @@
#include <linux/root_dev.h>
#include <linux/buffer_head.h>
#include <linux/debug_locks.h>
+#include <linux/lockdep.h>

#include <asm/io.h>
#include <asm/bugs.h>
@@ -80,6 +81,7 @@

static int init(void *);

+extern void early_init_irq_lock_type(void);
extern void init_IRQ(void);
extern void fork_init(unsigned long);
extern void mca_init(void);
@@ -461,6 +463,17 @@ asmlinkage void __init start_kernel(void
{
char * command_line;
extern struct kernel_param __start___param[], __stop___param[];
+
+ /*
+ * Need to run as early as possible, to initialize the
+ * lockdep hash:
+ */
+ lockdep_init();
+
+ local_irq_disable();
+ early_boot_irqs_off();
+ early_init_irq_lock_type();
+
/*
* Interrupts are still disabled. Do necessary setups, then
* enable them
@@ -512,8 +525,11 @@ asmlinkage void __init start_kernel(void
if (panic_later)
panic(panic_later, panic_param);
profile_init();
+ early_boot_irqs_on();
local_irq_enable();

+ lockdep_info();
+
/*
* Need to run this when irqs are enabled, because it wants
* to self-test [hard/soft]-irqs on/off lock inversion bugs
Index: linux/kernel/Makefile
===================================================================
--- linux.orig/kernel/Makefile
+++ linux/kernel/Makefile
@@ -12,6 +12,7 @@ obj-y = sched.o fork.o exec_domain.o

obj-y += time/
obj-$(CONFIG_DEBUG_MUTEXES) += mutex-debug.o
+obj-$(CONFIG_LOCKDEP) += lockdep.o
obj-$(CONFIG_FUTEX) += futex.o
ifeq ($(CONFIG_COMPAT),y)
obj-$(CONFIG_FUTEX) += futex_compat.o
Index: linux/kernel/fork.c
===================================================================
--- linux.orig/kernel/fork.c
+++ linux/kernel/fork.c
@@ -1049,6 +1049,11 @@ static task_t *copy_process(unsigned lon
}
mpol_fix_fork_child_flag(p);
#endif
+#ifdef CONFIG_LOCKDEP
+ p->lockdep_depth = 0; /* no locks held yet */
+ p->curr_chain_key = 0;
+ p->lockdep_recursion = 0;
+#endif

rt_mutex_init_task(p);

Index: linux/kernel/irq/manage.c
===================================================================
--- linux.orig/kernel/irq/manage.c
+++ linux/kernel/irq/manage.c
@@ -406,6 +406,12 @@ int request_irq(unsigned int irq,
immediately, so let's make sure....
We do this before actually registering it, to make sure that a 'real'
IRQ doesn't run in parallel with our fake. */
+#ifdef CONFIG_LOCKDEP
+ /*
+ * Lockdep wants atomic interrupt handlers:
+ */
+ irqflags |= SA_INTERRUPT;
+#endif
if (irqflags & SA_INTERRUPT) {
unsigned long flags;

Index: linux/kernel/lockdep.c
===================================================================
--- /dev/null
+++ linux/kernel/lockdep.c
@@ -0,0 +1,2633 @@
+/*
+ * kernel/lockdep.c
+ *
+ * Runtime locking correctness validator
+ *
+ * Started by Ingo Molnar:
+ *
+ * Copyright (C) 2006 Red Hat, Inc., Ingo Molnar <[email protected]>
+ *
+ * this code maps all the lock dependencies as they occur in a live kernel
+ * and will warn about the following types of locking bugs:
+ *
+ * - lock inversion scenarios
+ * - circular lock dependencies
+ * - hardirq/softirq safe/unsafe locking bugs
+ *
+ * Bugs are reported even if the current locking scenario does not cause
+ * any deadlock at this point.
+ *
+ * I.e. if anytime in the past two locks were taken in a different order,
+ * even if it happened for another task, even if those were different
+ * locks (but of the same type as this lock), this code will detect it.
+ *
+ * Thanks to Arjan van de Ven for coming up with the initial idea of
+ * mapping lock dependencies runtime.
+ */
+#include <linux/mutex.h>
+#include <linux/sched.h>
+#include <linux/delay.h>
+#include <linux/module.h>
+#include <linux/proc_fs.h>
+#include <linux/seq_file.h>
+#include <linux/spinlock.h>
+#include <linux/kallsyms.h>
+#include <linux/interrupt.h>
+#include <linux/stacktrace.h>
+#include <linux/debug_locks.h>
+#include <linux/trace_irqflags.h>
+
+#include <asm/sections.h>
+
+#include "lockdep_internals.h"
+
+/*
+ * hash_lock: protects the lockdep hashes and type/list/hash allocators.
+ *
+ * This is one of the rare exceptions where it's justified
+ * to use a raw spinlock - we really dont want the spinlock
+ * code to recurse back into the lockdep code.
+ */
+static raw_spinlock_t hash_lock = (raw_spinlock_t)__RAW_SPIN_LOCK_UNLOCKED;
+
+static int lockdep_initialized;
+
+unsigned long nr_list_entries;
+static struct lock_list list_entries[MAX_LOCKDEP_ENTRIES];
+
+/*
+ * Allocate a lockdep entry. (assumes hash_lock held, returns
+ * with NULL on failure)
+ */
+static struct lock_list *alloc_list_entry(void)
+{
+ if (nr_list_entries >= MAX_LOCKDEP_ENTRIES) {
+ __raw_spin_unlock(&hash_lock);
+ debug_locks_off();
+ printk("BUG: MAX_LOCKDEP_ENTRIES too low!\n");
+ printk("turning off the locking correctness validator.\n");
+ return NULL;
+ }
+ return list_entries + nr_list_entries++;
+}
+
+/*
+ * All data structures here are protected by the global debug_lock.
+ *
+ * Mutex key structs only get allocated, once during bootup, and never
+ * get freed - this significantly simplifies the debugging code.
+ */
+unsigned long nr_lock_types;
+static struct lock_type lock_types[MAX_LOCKDEP_KEYS];
+
+/*
+ * We keep a global list of all lock types. The list only grows,
+ * never shrinks. The list is only accessed with the lockdep
+ * spinlock lock held.
+ */
+LIST_HEAD(all_lock_types);
+
+/*
+ * The lockdep types are in a hash-table as well, for fast lookup:
+ */
+#define TYPEHASH_BITS (MAX_LOCKDEP_KEYS_BITS - 1)
+#define TYPEHASH_SIZE (1UL << TYPEHASH_BITS)
+#define TYPEHASH_MASK (TYPEHASH_SIZE - 1)
+#define __typehashfn(key) ((((unsigned long)key >> TYPEHASH_BITS) + (unsigned long)key) & TYPEHASH_MASK)
+#define typehashentry(key) (typehash_table + __typehashfn((key)))
+
+static struct list_head typehash_table[TYPEHASH_SIZE];
+
+unsigned long nr_lock_chains;
+static struct lock_chain lock_chains[MAX_LOCKDEP_CHAINS];
+
+/*
+ * We put the lock dependency chains into a hash-table as well, to cache
+ * their existence:
+ */
+#define CHAINHASH_BITS (MAX_LOCKDEP_CHAINS_BITS-1)
+#define CHAINHASH_SIZE (1UL << CHAINHASH_BITS)
+#define CHAINHASH_MASK (CHAINHASH_SIZE - 1)
+#define __chainhashfn(chain) \
+ (((chain >> CHAINHASH_BITS) + chain) & CHAINHASH_MASK)
+#define chainhashentry(chain) (chainhash_table + __chainhashfn((chain)))
+
+static struct list_head chainhash_table[CHAINHASH_SIZE];
+
+/*
+ * The hash key of the lock dependency chains is a hash itself too:
+ * it's a hash of all locks taken up to that lock, including that lock.
+ * It's a 64-bit hash, because it's important for the keys to be
+ * unique.
+ */
+#define iterate_chain_key(key1, key2) \
+ (((key1) << MAX_LOCKDEP_KEYS_BITS/2) ^ \
+ ((key1) >> (64-MAX_LOCKDEP_KEYS_BITS/2)) ^ \
+ (key2))
+
+/*
+ * Debugging switches:
+ */
+#define LOCKDEP_OFF 0
+
+#define VERBOSE 0
+
+#if VERBOSE
+# define HARDIRQ_VERBOSE 1
+# define SOFTIRQ_VERBOSE 1
+#else
+# define HARDIRQ_VERBOSE 0
+# define SOFTIRQ_VERBOSE 0
+#endif
+
+#if VERBOSE || HARDIRQ_VERBOSE || SOFTIRQ_VERBOSE
+/*
+ * Quick filtering for interesting events:
+ */
+static int type_filter(struct lock_type *type)
+{
+ if (type->name_version == 2 &&
+ !strcmp(type->name, "xfrm_state_afinfo_lock"))
+ return 1;
+ if ((type->name_version == 2 || type->name_version == 4) &&
+ !strcmp(type->name, "&mc->mca_lock"))
+ return 1;
+ return 0;
+}
+#endif
+
+static int verbose(struct lock_type *type)
+{
+#if VERBOSE
+ return type_filter(type);
+#endif
+ return 0;
+}
+
+static int hardirq_verbose(struct lock_type *type)
+{
+#if HARDIRQ_VERBOSE
+ return type_filter(type);
+#endif
+ return 0;
+}
+
+static int softirq_verbose(struct lock_type *type)
+{
+#if SOFTIRQ_VERBOSE
+ return type_filter(type);
+#endif
+ return 0;
+}
+
+/*
+ * Stack-trace: tightly packed array of stack backtrace
+ * addresses. Protected by the hash_lock.
+ */
+unsigned long nr_stack_trace_entries;
+static unsigned long stack_trace[MAX_STACK_TRACE_ENTRIES];
+
+static int save_trace(struct stack_trace *trace)
+{
+ trace->nr_entries = 0;
+ trace->max_entries = MAX_STACK_TRACE_ENTRIES - nr_stack_trace_entries;
+ trace->entries = stack_trace + nr_stack_trace_entries;
+
+ save_stack_trace(trace, NULL, 0, 3);
+
+ trace->max_entries = trace->nr_entries;
+
+ nr_stack_trace_entries += trace->nr_entries;
+ if (DEBUG_WARN_ON(nr_stack_trace_entries > MAX_STACK_TRACE_ENTRIES))
+ return 0;
+
+ if (nr_stack_trace_entries == MAX_STACK_TRACE_ENTRIES) {
+ __raw_spin_unlock(&hash_lock);
+ if (debug_locks_off()) {
+ printk("BUG: MAX_STACK_TRACE_ENTRIES too low!\n");
+ printk("turning off the locking correctness validator.\n");
+ dump_stack();
+ }
+ return 0;
+ }
+
+ return 1;
+}
+
+unsigned int nr_hardirq_chains;
+unsigned int nr_softirq_chains;
+unsigned int nr_process_chains;
+unsigned int max_lockdep_depth;
+unsigned int max_recursion_depth;
+
+#ifdef CONFIG_DEBUG_LOCKDEP
+/*
+ * We cannot printk in early bootup code. Not even early_printk()
+ * might work. So we mark any initialization errors and printk
+ * about it later on, in lockdep_info().
+ */
+int lockdep_init_error;
+
+/*
+ * Various lockdep statistics:
+ */
+atomic_t chain_lookup_hits;
+atomic_t chain_lookup_misses;
+atomic_t hardirqs_on_events;
+atomic_t hardirqs_off_events;
+atomic_t redundant_hardirqs_on;
+atomic_t redundant_hardirqs_off;
+atomic_t softirqs_on_events;
+atomic_t softirqs_off_events;
+atomic_t redundant_softirqs_on;
+atomic_t redundant_softirqs_off;
+atomic_t nr_unused_locks;
+atomic_t nr_hardirq_safe_locks;
+atomic_t nr_softirq_safe_locks;
+atomic_t nr_hardirq_unsafe_locks;
+atomic_t nr_softirq_unsafe_locks;
+atomic_t nr_hardirq_read_safe_locks;
+atomic_t nr_softirq_read_safe_locks;
+atomic_t nr_hardirq_read_unsafe_locks;
+atomic_t nr_softirq_read_unsafe_locks;
+atomic_t nr_cyclic_checks;
+atomic_t nr_cyclic_check_recursions;
+atomic_t nr_find_usage_forwards_checks;
+atomic_t nr_find_usage_forwards_recursions;
+atomic_t nr_find_usage_backwards_checks;
+atomic_t nr_find_usage_backwards_recursions;
+# define debug_atomic_inc(ptr) atomic_inc(ptr)
+# define debug_atomic_dec(ptr) atomic_dec(ptr)
+# define debug_atomic_read(ptr) atomic_read(ptr)
+#else
+# define debug_atomic_inc(ptr) do { } while (0)
+# define debug_atomic_dec(ptr) do { } while (0)
+# define debug_atomic_read(ptr) 0
+#endif
+
+/*
+ * Locking printouts:
+ */
+
+static const char *usage_str[] =
+{
+ [LOCK_USED] = "initial-use ",
+ [LOCK_USED_IN_HARDIRQ] = "in-hardirq-W",
+ [LOCK_USED_IN_SOFTIRQ] = "in-softirq-W",
+ [LOCK_ENABLED_SOFTIRQS] = "softirq-on-W",
+ [LOCK_ENABLED_HARDIRQS] = "hardirq-on-W",
+ [LOCK_USED_IN_HARDIRQ_READ] = "in-hardirq-R",
+ [LOCK_USED_IN_SOFTIRQ_READ] = "in-softirq-R",
+ [LOCK_ENABLED_SOFTIRQS_READ] = "softirq-on-R",
+ [LOCK_ENABLED_HARDIRQS_READ] = "hardirq-on-R",
+};
+
+static void printk_sym(unsigned long ip)
+{
+ printk(" [<%08lx>]", ip);
+ print_symbol(" %s\n", ip);
+}
+
+const char * __get_key_name(struct lockdep_subtype_key *key, char *str)
+{
+ unsigned long offs, size;
+ char *modname;
+
+ return kallsyms_lookup((unsigned long)key, &size, &offs, &modname, str);
+}
+
+void
+get_usage_chars(struct lock_type *type, char *c1, char *c2, char *c3, char *c4)
+{
+ *c1 = '.', *c2 = '.', *c3 = '.', *c4 = '.';
+
+ if (type->usage_mask & LOCKF_USED_IN_HARDIRQ)
+ *c1 = '+';
+ else
+ if (type->usage_mask & LOCKF_ENABLED_HARDIRQS)
+ *c1 = '-';
+
+ if (type->usage_mask & LOCKF_USED_IN_SOFTIRQ)
+ *c2 = '+';
+ else
+ if (type->usage_mask & LOCKF_ENABLED_SOFTIRQS)
+ *c2 = '-';
+
+ if (type->usage_mask & LOCKF_ENABLED_HARDIRQS_READ)
+ *c3 = '-';
+ if (type->usage_mask & LOCKF_USED_IN_HARDIRQ_READ) {
+ *c3 = '+';
+ if (type->usage_mask & LOCKF_ENABLED_HARDIRQS_READ)
+ *c3 = (char)'??';
+ }
+
+ if (type->usage_mask & LOCKF_ENABLED_SOFTIRQS_READ)
+ *c4 = '-';
+ if (type->usage_mask & LOCKF_USED_IN_SOFTIRQ_READ) {
+ *c4 = '+';
+ if (type->usage_mask & LOCKF_ENABLED_SOFTIRQS_READ)
+ *c4 = (char)'??';
+ }
+}
+
+static void print_lock_name(struct lock_type *type)
+{
+ char str[128], c1, c2, c3, c4;
+ const char *name;
+
+ get_usage_chars(type, &c1, &c2, &c3, &c4);
+
+ name = type->name;
+ if (!name) {
+ name = __get_key_name(type->key, str);
+ printk(" (%s", name);
+ } else {
+ printk(" (%s", name);
+ if (type->name_version > 1)
+ printk("#%d", type->name_version);
+ if (type->subtype)
+ printk("/%d", type->subtype);
+ }
+ printk("){%c%c%c%c}", c1, c2, c3, c4);
+}
+
+static void print_lock_name_field(struct lock_type *type)
+{
+ const char *name;
+ char str[128];
+
+ name = type->name;
+ if (!name) {
+ name = __get_key_name(type->key, str);
+ printk("%30s", name);
+ } else {
+ printk("%30s", name);
+ if (type->name_version > 1)
+ printk("#%d", type->name_version);
+ if (type->subtype)
+ printk("/%d", type->subtype);
+ }
+}
+
+static void print_lockdep_cache(struct lockdep_map *lock)
+{
+ const char *name;
+ char str[128];
+
+ name = lock->name;
+ if (!name)
+ name = __get_key_name(lock->key->subkeys, str);
+
+ printk("%s", name);
+}
+
+static void print_lock(struct held_lock *hlock)
+{
+ print_lock_name(hlock->type);
+ printk(", at:");
+ printk_sym(hlock->acquire_ip);
+}
+
+void lockdep_print_held_locks(struct task_struct *curr)
+{
+ int i;
+
+ if (!curr->lockdep_depth) {
+ printk("no locks held by %s/%d.\n", curr->comm, curr->pid);
+ return;
+ }
+ printk("%d locks held by %s/%d:\n",
+ curr->lockdep_depth, curr->comm, curr->pid);
+
+ for (i = 0; i < curr->lockdep_depth; i++) {
+ printk(" #%d: ", i);
+ print_lock(curr->held_locks + i);
+ }
+}
+/*
+ * Helper to print a nice hierarchy of lock dependencies:
+ */
+static void print_spaces(int nr)
+{
+ int i;
+
+ for (i = 0; i < nr; i++)
+ printk(" ");
+}
+
+void print_lock_type_header(struct lock_type *type, int depth)
+{
+ int bit;
+
+ print_spaces(depth);
+ printk("->");
+ print_lock_name(type);
+ printk(" ops: %lu", type->ops);
+ printk(" {\n");
+
+ for (bit = 0; bit < LOCK_USAGE_STATES; bit++) {
+ if (type->usage_mask & (1 << bit)) {
+ int len = depth;
+
+ print_spaces(depth);
+ len += printk(" %s", usage_str[bit]);
+ len += printk(" at:\n");
+ print_stack_trace(type->usage_traces + bit, len);
+ }
+ }
+ print_spaces(depth);
+ printk(" }\n");
+
+ print_spaces(depth);
+ printk(" ... key at:");
+ printk_sym((unsigned long)type->key);
+}
+
+/*
+ * printk all lock dependencies starting at <entry>:
+ */
+static void print_lock_dependencies(struct lock_type *type, int depth)
+{
+ struct lock_list *entry;
+
+ if (DEBUG_WARN_ON(depth >= 20))
+ return;
+
+ print_lock_type_header(type, depth);
+
+ list_for_each_entry(entry, &type->locks_after, entry) {
+ DEBUG_WARN_ON(!entry->type);
+ print_lock_dependencies(entry->type, depth + 1);
+
+ print_spaces(depth);
+ printk(" ... acquired at:\n");
+ print_stack_trace(&entry->trace, 2);
+ printk("\n");
+ }
+}
+
+/*
+ * printk all locks that are taken after this lock:
+ */
+static void print_flat_dependencies(struct lock_type *type)
+{
+ struct lock_list *entry;
+ int nr = 0;
+
+ printk(" {\n");
+ list_for_each_entry(entry, &type->locks_after, entry) {
+ nr++;
+ DEBUG_WARN_ON(!entry->type);
+ printk(" -> ");
+ print_lock_name_field(entry->type);
+ if (entry->type->subtype)
+ printk("/%d", entry->type->subtype);
+ print_stack_trace(&entry->trace, 2);
+ }
+ printk(" } [%d]", nr);
+}
+
+void print_lock_type(struct lock_type *type)
+{
+ print_lock_type_header(type, 0);
+ if (!list_empty(&type->locks_after))
+ print_flat_dependencies(type);
+ printk("\n");
+}
+
+void print_lock_types(void)
+{
+ struct list_head *head;
+ struct lock_type *type;
+ int i, nr;
+
+ printk("lock types:\n");
+
+ for (i = 0; i < TYPEHASH_SIZE; i++) {
+ head = typehash_table + i;
+ if (list_empty(head))
+ continue;
+ printk("\nhash-list at %d:\n", i);
+ nr = 0;
+ list_for_each_entry(type, head, hash_entry) {
+ printk("\n");
+ print_lock_type(type);
+ nr++;
+ }
+ }
+}
+
+/*
+ * Add a new dependency to the head of the list:
+ */
+static int add_lock_to_list(struct lock_type *type, struct lock_type *this,
+ struct list_head *head, unsigned long ip)
+{
+ struct lock_list *entry;
+ /*
+ * Lock not present yet - get a new dependency struct and
+ * add it to the list:
+ */
+ entry = alloc_list_entry();
+ if (!entry)
+ return 0;
+
+ entry->type = this;
+ save_trace(&entry->trace);
+
+ /*
+ * Since we never remove from the dependency list, the list can
+ * be walked lockless by other CPUs, it's only allocation
+ * that must be protected by the spinlock. But this also means
+ * we must make new entries visible only once writes to the
+ * entry become visible - hence the RCU op:
+ */
+ list_add_tail_rcu(&entry->entry, head);
+
+ return 1;
+}
+
+/*
+ * Recursive, forwards-direction lock-dependency checking, used for
+ * both noncyclic checking and for hardirq-unsafe/softirq-unsafe
+ * checking.
+ *
+ * (to keep the stackframe of the recursive functions small we
+ * use these global variables, and we also mark various helper
+ * functions as noinline.)
+ */
+static struct held_lock *check_source, *check_target;
+
+/*
+ * Print a dependency chain entry (this is only done when a deadlock
+ * has been detected):
+ */
+static noinline int
+print_circular_bug_entry(struct lock_list *target, unsigned int depth)
+{
+ if (debug_locks_silent)
+ return 0;
+ printk("\n-> #%u", depth);
+ print_lock_name(target->type);
+ printk(":\n");
+ print_stack_trace(&target->trace, 6);
+
+ return 0;
+}
+
+/*
+ * When a circular dependency is detected, print the
+ * header first:
+ */
+static noinline int
+print_circular_bug_header(struct lock_list *entry, unsigned int depth)
+{
+ struct task_struct *curr = current;
+
+ __raw_spin_unlock(&hash_lock);
+ debug_locks_off();
+ if (debug_locks_silent)
+ return 0;
+
+ printk("\n=====================================================\n");
+ printk( "[ BUG: possible circular locking deadlock detected! ]\n");
+ printk( "-----------------------------------------------------\n");
+ printk("%s/%d is trying to acquire lock:\n",
+ curr->comm, curr->pid);
+ print_lock(check_source);
+ printk("\nbut task is already holding lock:\n");
+ print_lock(check_target);
+ printk("\nwhich lock already depends on the new lock,\n");
+ printk("which could lead to circular deadlocks!\n");
+ printk("\nthe existing dependency chain (in reverse order) is:\n");
+
+ print_circular_bug_entry(entry, depth);
+
+ return 0;
+}
+
+static noinline int print_circular_bug_tail(void)
+{
+ struct task_struct *curr = current;
+ struct lock_list this;
+
+ if (debug_locks_silent)
+ return 0;
+
+ this.type = check_source->type;
+ save_trace(&this.trace);
+ print_circular_bug_entry(&this, 0);
+
+ printk("\nother info that might help us debug this:\n\n");
+ lockdep_print_held_locks(curr);
+
+ printk("\nstack backtrace:\n");
+ dump_stack();
+
+ return 0;
+}
+
+static int noinline print_infinite_recursion_bug(void)
+{
+ __raw_spin_unlock(&hash_lock);
+ DEBUG_WARN_ON(1);
+
+ return 0;
+}
+
+/*
+ * Prove that the dependency graph starting at <entry> can not
+ * lead to <target>. Print an error and return 0 if it does.
+ */
+static noinline int
+check_noncircular(struct lock_type *source, unsigned int depth)
+{
+ struct lock_list *entry;
+
+ debug_atomic_inc(&nr_cyclic_check_recursions);
+ if (depth > max_recursion_depth)
+ max_recursion_depth = depth;
+ if (depth >= 20)
+ return print_infinite_recursion_bug();
+ /*
+ * Check this lock's dependency list:
+ */
+ list_for_each_entry(entry, &source->locks_after, entry) {
+ if (entry->type == check_target->type)
+ return print_circular_bug_header(entry, depth+1);
+ debug_atomic_inc(&nr_cyclic_checks);
+ if (!check_noncircular(entry->type, depth+1))
+ return print_circular_bug_entry(entry, depth+1);
+ }
+ return 1;
+}
+
+#ifdef CONFIG_TRACE_IRQFLAGS
+
+/*
+ * Forwards and backwards subgraph searching, for the purposes of
+ * proving that two subgraphs can be connected by a new dependency
+ * without creating any illegal irq-safe -> irq-unsafe lock dependency.
+ */
+static enum lock_usage_bit find_usage_bit;
+static struct lock_type *forwards_match, *backwards_match;
+
+/*
+ * Find a node in the forwards-direction dependency sub-graph starting
+ * at <source> that matches <find_usage_bit>.
+ *
+ * Return 2 if such a node exists in the subgraph, and put that node
+ * into <forwards_match>.
+ *
+ * Return 1 otherwise and keep <forwards_match> unchanged.
+ * Return 0 on error.
+ */
+static noinline int
+find_usage_forwards(struct lock_type *source, unsigned int depth)
+{
+ struct lock_list *entry;
+ int ret;
+
+ if (depth > max_recursion_depth)
+ max_recursion_depth = depth;
+ if (depth >= 20)
+ return print_infinite_recursion_bug();
+
+ debug_atomic_inc(&nr_find_usage_forwards_checks);
+ if (source->usage_mask & (1 << find_usage_bit)) {
+ forwards_match = source;
+ return 2;
+ }
+
+ /*
+ * Check this lock's dependency list:
+ */
+ list_for_each_entry(entry, &source->locks_after, entry) {
+ debug_atomic_inc(&nr_find_usage_forwards_recursions);
+ ret = find_usage_forwards(entry->type, depth+1);
+ if (ret == 2 || ret == 0)
+ return ret;
+ }
+ return 1;
+}
+
+/*
+ * Find a node in the backwards-direction dependency sub-graph starting
+ * at <source> that matches <find_usage_bit>.
+ *
+ * Return 2 if such a node exists in the subgraph, and put that node
+ * into <backwards_match>.
+ *
+ * Return 1 otherwise and keep <backwards_match> unchanged.
+ * Return 0 on error.
+ */
+static noinline int
+find_usage_backwards(struct lock_type *source, unsigned int depth)
+{
+ struct lock_list *entry;
+ int ret;
+
+ if (depth > max_recursion_depth)
+ max_recursion_depth = depth;
+ if (depth >= 20)
+ return print_infinite_recursion_bug();
+
+ debug_atomic_inc(&nr_find_usage_backwards_checks);
+ if (source->usage_mask & (1 << find_usage_bit)) {
+ backwards_match = source;
+ return 2;
+ }
+
+ /*
+ * Check this lock's dependency list:
+ */
+ list_for_each_entry(entry, &source->locks_before, entry) {
+ debug_atomic_inc(&nr_find_usage_backwards_recursions);
+ ret = find_usage_backwards(entry->type, depth+1);
+ if (ret == 2 || ret == 0)
+ return ret;
+ }
+ return 1;
+}
+
+static int
+print_bad_irq_dependency(struct task_struct *curr,
+ struct held_lock *prev,
+ struct held_lock *next,
+ enum lock_usage_bit bit1,
+ enum lock_usage_bit bit2,
+ const char *irqtype)
+{
+ __raw_spin_unlock(&hash_lock);
+ debug_locks_off();
+ if (debug_locks_silent)
+ return 0;
+
+ printk("\n======================================================\n");
+ printk( "[ BUG: %s-safe -> %s-unsafe lock order detected! ]\n",
+ irqtype, irqtype);
+ printk( "------------------------------------------------------\n");
+ printk("%s/%d [HC%u[%lu]:SC%u[%lu]:HE%u:SE%u] is trying to acquire:\n",
+ curr->comm, curr->pid,
+ curr->hardirq_context, hardirq_count() >> HARDIRQ_SHIFT,
+ curr->softirq_context, softirq_count() >> SOFTIRQ_SHIFT,
+ curr->hardirqs_enabled,
+ curr->softirqs_enabled);
+ print_lock(next);
+
+ printk("\nand this task is already holding:\n");
+ print_lock(prev);
+ printk("which would create a new lock dependency:\n");
+ print_lock_name(prev->type);
+ printk(" ->");
+ print_lock_name(next->type);
+ printk("\n");
+
+ printk("\nbut this new dependency connects a %s-irq-safe lock:\n",
+ irqtype);
+ print_lock_name(backwards_match);
+ printk("\n... which became %s-irq-safe at:\n", irqtype);
+
+ print_stack_trace(backwards_match->usage_traces + bit1, 1);
+
+ printk("\nto a %s-irq-unsafe lock:\n", irqtype);
+ print_lock_name(forwards_match);
+ printk("\n... which became %s-irq-unsafe at:\n", irqtype);
+ printk("...");
+
+ print_stack_trace(forwards_match->usage_traces + bit2, 1);
+
+ printk("\nwhich could potentially lead to deadlocks!\n");
+
+ printk("\nother info that might help us debug this:\n\n");
+ lockdep_print_held_locks(curr);
+
+ printk("\nthe %s-irq-safe lock's dependencies:\n", irqtype);
+ print_lock_dependencies(backwards_match, 0);
+
+ printk("\nthe %s-irq-unsafe lock's dependencies:\n", irqtype);
+ print_lock_dependencies(forwards_match, 0);
+
+ printk("\nstack backtrace:\n");
+ dump_stack();
+
+ return 0;
+}
+
+static int
+check_usage(struct task_struct *curr, struct held_lock *prev,
+ struct held_lock *next, enum lock_usage_bit bit_backwards,
+ enum lock_usage_bit bit_forwards, const char *irqtype)
+{
+ int ret;
+
+ find_usage_bit = bit_backwards;
+ /* fills in <backwards_match> */
+ ret = find_usage_backwards(prev->type, 0);
+ if (!ret || ret == 1)
+ return ret;
+
+ find_usage_bit = bit_forwards;
+ ret = find_usage_forwards(next->type, 0);
+ if (!ret || ret == 1)
+ return ret;
+ /* ret == 2 */
+ return print_bad_irq_dependency(curr, prev, next,
+ bit_backwards, bit_forwards, irqtype);
+}
+
+#endif
+
+static int
+print_deadlock_bug(struct task_struct *curr, struct held_lock *prev,
+ struct held_lock *next)
+{
+ debug_locks_off();
+ __raw_spin_unlock(&hash_lock);
+ if (debug_locks_silent)
+ return 0;
+
+ printk("\n====================================\n");
+ printk( "[ BUG: possible deadlock detected! ]\n");
+ printk( "------------------------------------\n");
+ printk("%s/%d is trying to acquire lock:\n",
+ curr->comm, curr->pid);
+ print_lock(next);
+ printk("\nbut task is already holding lock:\n");
+ print_lock(prev);
+ printk("\nwhich could potentially lead to deadlocks!\n");
+
+ printk("\nother info that might help us debug this:\n");
+ lockdep_print_held_locks(curr);
+
+ printk("\nstack backtrace:\n");
+ dump_stack();
+
+ return 0;
+}
+
+/*
+ * Check whether we are holding such a type already.
+ *
+ * (Note that this has to be done separately, because the graph cannot
+ * detect such types of deadlocks.)
+ *
+ * Returns: 0 on deadlock detected, 1 on OK, 2 on recursive read
+ */
+static int
+check_deadlock(struct task_struct *curr, struct held_lock *next,
+ struct lockdep_map *next_instance, int read)
+{
+ struct held_lock *prev;
+ int i;
+
+ for (i = 0; i < curr->lockdep_depth; i++) {
+ prev = curr->held_locks + i;
+ if (prev->type != next->type)
+ continue;
+ /*
+ * Allow read-after-read recursion of the same
+ * lock instance (i.e. read_lock(lock)+read_lock(lock)):
+ */
+ if ((read > 0) && prev->read &&
+ (prev->instance == next_instance))
+ return 2;
+ return print_deadlock_bug(curr, prev, next);
+ }
+ return 1;
+}
+
+/*
+ * There was a chain-cache miss, and we are about to add a new dependency
+ * to a previous lock. We recursively validate the following rules:
+ *
+ * - would the adding of the <prev> -> <next> dependency create a
+ * circular dependency in the graph? [== circular deadlock]
+ *
+ * - does the new prev->next dependency connect any hardirq-safe lock
+ * (in the full backwards-subgraph starting at <prev>) with any
+ * hardirq-unsafe lock (in the full forwards-subgraph starting at
+ * <next>)? [== illegal lock inversion with hardirq contexts]
+ *
+ * - does the new prev->next dependency connect any softirq-safe lock
+ * (in the full backwards-subgraph starting at <prev>) with any
+ * softirq-unsafe lock (in the full forwards-subgraph starting at
+ * <next>)? [== illegal lock inversion with softirq contexts]
+ *
+ * any of these scenarios could lead to a deadlock.
+ *
+ * Then if all the validations pass, we add the forwards and backwards
+ * dependency.
+ */
+static int
+check_prev_add(struct task_struct *curr, struct held_lock *prev,
+ struct held_lock *next)
+{
+ struct lock_list *entry;
+ int ret;
+
+ /*
+ * Prove that the new <prev> -> <next> dependency would not
+ * create a circular dependency in the graph. (We do this by
+ * forward-recursing into the graph starting at <next>, and
+ * checking whether we can reach <prev>.)
+ *
+ * We are using global variables to control the recursion, to
+ * keep the stackframe size of the recursive functions low:
+ */
+ check_source = next;
+ check_target = prev;
+ if (!(check_noncircular(next->type, 0)))
+ return print_circular_bug_tail();
+
+#ifdef CONFIG_TRACE_IRQFLAGS
+ /*
+ * Prove that the new dependency does not connect a hardirq-safe
+ * lock with a hardirq-unsafe lock - to achieve this we search
+ * the backwards-subgraph starting at <prev>, and the
+ * forwards-subgraph starting at <next>:
+ */
+ if (!check_usage(curr, prev, next, LOCK_USED_IN_HARDIRQ,
+ LOCK_ENABLED_HARDIRQS, "hard"))
+ return 0;
+
+ /*
+ * Prove that the new dependency does not connect a hardirq-safe-read
+ * lock with a hardirq-unsafe lock - to achieve this we search
+ * the backwards-subgraph starting at <prev>, and the
+ * forwards-subgraph starting at <next>:
+ */
+ if (!check_usage(curr, prev, next, LOCK_USED_IN_HARDIRQ_READ,
+ LOCK_ENABLED_HARDIRQS, "hard-read"))
+ return 0;
+
+ /*
+ * Prove that the new dependency does not connect a softirq-safe
+ * lock with a softirq-unsafe lock - to achieve this we search
+ * the backwards-subgraph starting at <prev>, and the
+ * forwards-subgraph starting at <next>:
+ */
+ if (!check_usage(curr, prev, next, LOCK_USED_IN_SOFTIRQ,
+ LOCK_ENABLED_SOFTIRQS, "soft"))
+ return 0;
+ /*
+ * Prove that the new dependency does not connect a softirq-safe-read
+ * lock with a softirq-unsafe lock - to achieve this we search
+ * the backwards-subgraph starting at <prev>, and the
+ * forwards-subgraph starting at <next>:
+ */
+ if (!check_usage(curr, prev, next, LOCK_USED_IN_SOFTIRQ_READ,
+ LOCK_ENABLED_SOFTIRQS, "soft"))
+ return 0;
+#endif
+ /*
+ * For recursive read-locks we do all the dependency checks,
+ * but we dont store read-triggered dependencies (only
+ * write-triggered dependencies). This ensures that only the
+ * write-side dependencies matter, and that if for example a
+ * write-lock never takes any other locks, then the reads are
+ * equivalent to a NOP.
+ */
+ if (next->read == 1 || prev->read == 1)
+ return 1;
+ /*
+ * Is the <prev> -> <next> dependency already present?
+ *
+ * (this may occur even though this is a new chain: consider
+ * e.g. the L1 -> L2 -> L3 -> L4 and the L5 -> L1 -> L2 -> L3
+ * chains - the second one will be new, but L1 already has
+ * L2 added to its dependency list, due to the first chain.)
+ */
+ list_for_each_entry(entry, &prev->type->locks_after, entry) {
+ if (entry->type == next->type)
+ return 2;
+ }
+
+ /*
+ * Ok, all validations passed, add the new lock
+ * to the previous lock's dependency list:
+ */
+ ret = add_lock_to_list(prev->type, next->type,
+ &prev->type->locks_after, next->acquire_ip);
+ if (!ret)
+ return 0;
+ /*
+ * Return value of 2 signals 'dependency already added',
+ * in that case we dont have to add the backlink either.
+ */
+ if (ret == 2)
+ return 2;
+ ret = add_lock_to_list(next->type, prev->type,
+ &next->type->locks_before, next->acquire_ip);
+
+ /*
+ * Debugging printouts:
+ */
+ if (verbose(prev->type) || verbose(next->type)) {
+ __raw_spin_unlock(&hash_lock);
+ print_lock_name_field(prev->type);
+ printk(" => ");
+ print_lock_name_field(next->type);
+ printk("\n");
+ dump_stack();
+ __raw_spin_lock(&hash_lock);
+ }
+ return 1;
+}
+
+/*
+ * Add the dependency to all directly-previous locks that are 'relevant'.
+ * The ones that are relevant are (in increasing distance from curr):
+ * all consecutive trylock entries and the final non-trylock entry - or
+ * the end of this context's lock-chain - whichever comes first.
+ */
+static int
+check_prevs_add(struct task_struct *curr, struct held_lock *next)
+{
+ int depth = curr->lockdep_depth;
+ struct held_lock *hlock;
+
+ /*
+ * Debugging checks.
+ *
+ * Depth must not be zero for a non-head lock:
+ */
+ if (!depth)
+ goto out_bug;
+ /*
+ * At least two relevant locks must exist for this
+ * to be a head:
+ */
+ if (curr->held_locks[depth].irq_context !=
+ curr->held_locks[depth-1].irq_context)
+ goto out_bug;
+
+ for (;;) {
+ hlock = curr->held_locks + depth-1;
+ /*
+ * Only non-recursive-read entries get new dependencies
+ * added:
+ */
+ if (hlock->read != 2) {
+ check_prev_add(curr, hlock, next);
+ /*
+ * Stop after the first non-trylock entry,
+ * as non-trylock entries have added their
+ * own direct dependencies already, so this
+ * lock is connected to them indirectly:
+ */
+ if (!hlock->trylock)
+ break;
+ }
+ depth--;
+ /*
+ * End of lock-stack?
+ */
+ if (!depth)
+ break;
+ /*
+ * Stop the search if we cross into another context:
+ */
+ if (curr->held_locks[depth].irq_context !=
+ curr->held_locks[depth-1].irq_context)
+ break;
+ }
+ return 1;
+out_bug:
+ __raw_spin_unlock(&hash_lock);
+ DEBUG_WARN_ON(1);
+
+ return 0;
+}
+
+
+/*
+ * Is this the address of a static object:
+ */
+static int static_obj(void *obj)
+{
+ unsigned long start = (unsigned long) &_stext,
+ end = (unsigned long) &_end,
+ addr = (unsigned long) obj;
+ int i;
+
+ /*
+ * static variable?
+ */
+ if ((addr >= start) && (addr < end))
+ return 1;
+
+#ifdef CONFIG_SMP
+ /*
+ * percpu var?
+ */
+ for_each_possible_cpu(i) {
+ start = (unsigned long) &__per_cpu_start + per_cpu_offset(i);
+ end = (unsigned long) &__per_cpu_end + per_cpu_offset(i);
+
+ if ((addr >= start) && (addr < end))
+ return 1;
+ }
+#endif
+
+ /*
+ * module var?
+ */
+ return __module_address(addr);
+}
+
+/*
+ * To make lock name printouts unique, we calculate a unique
+ * type->name_version generation counter:
+ */
+int count_matching_names(struct lock_type *new_type)
+{
+ struct lock_type *type;
+ int count = 0;
+
+ if (!new_type->name)
+ return 0;
+
+ list_for_each_entry(type, &all_lock_types, lock_entry) {
+ if (new_type->key - new_type->subtype == type->key)
+ return type->name_version;
+ if (!strcmp(type->name, new_type->name))
+ count = max(count, type->name_version);
+ }
+
+ return count + 1;
+}
+
+extern void __error_too_big_MAX_LOCKDEP_SUBTYPES(void);
+
+/*
+ * Register a lock's type in the hash-table, if the type is not present
+ * yet. Otherwise we look it up. We cache the result in the lock object
+ * itself, so actual lookup of the hash should be once per lock object.
+ */
+static inline struct lock_type *
+register_lock_type(struct lockdep_map *lock, unsigned int subtype)
+{
+ struct lockdep_subtype_key *key;
+ struct list_head *hash_head;
+ struct lock_type *type;
+
+#ifdef CONFIG_DEBUG_LOCKDEP
+ /*
+ * If the architecture calls into lockdep before initializing
+ * the hashes then we'll warn about it later. (we cannot printk
+ * right now)
+ */
+ if (unlikely(!lockdep_initialized)) {
+ lockdep_init();
+ lockdep_init_error = 1;
+ }
+#endif
+
+ /*
+ * Static locks do not have their type-keys yet - for them the key
+ * is the lock object itself:
+ */
+ if (unlikely(!lock->key))
+ lock->key = (void *)lock;
+
+ /*
+ * Debug-check: all keys must be persistent!
+ */
+ if (DEBUG_WARN_ON(!static_obj(lock->key))) {
+ debug_locks_off();
+ printk("BUG: trying to register non-static key!\n");
+ printk("turning off the locking correctness validator.\n");
+ dump_stack();
+ return NULL;
+ }
+
+ /*
+ * NOTE: the type-key must be unique. For dynamic locks, a static
+ * lockdep_type_key variable is passed in through the mutex_init()
+ * (or spin_lock_init()) call - which acts as the key. For static
+ * locks we use the lock object itself as the key.
+ */
+ if (sizeof(struct lockdep_type_key) > sizeof(struct lock_type))
+ __error_too_big_MAX_LOCKDEP_SUBTYPES();
+
+ key = lock->key->subkeys + subtype;
+
+ hash_head = typehashentry(key);
+
+ /*
+ * We can walk the hash lockfree, because the hash only
+ * grows, and we are careful when adding entries to the end:
+ */
+ list_for_each_entry(type, hash_head, hash_entry)
+ if (type->key == key)
+ goto out_set;
+
+ __raw_spin_lock(&hash_lock);
+ /*
+ * We have to do the hash-walk again, to avoid races
+ * with another CPU:
+ */
+ list_for_each_entry(type, hash_head, hash_entry)
+ if (type->key == key)
+ goto out_unlock_set;
+ /*
+ * Allocate a new key from the static array, and add it to
+ * the hash:
+ */
+ if (nr_lock_types >= MAX_LOCKDEP_KEYS) {
+ __raw_spin_unlock(&hash_lock);
+ debug_locks_off();
+ printk("BUG: MAX_LOCKDEP_KEYS too low!\n");
+ printk("turning off the locking correctness validator.\n");
+ return NULL;
+ }
+ type = lock_types + nr_lock_types++;
+ debug_atomic_inc(&nr_unused_locks);
+ type->key = key;
+ type->name = lock->name;
+ type->subtype = subtype;
+ INIT_LIST_HEAD(&type->lock_entry);
+ INIT_LIST_HEAD(&type->locks_before);
+ INIT_LIST_HEAD(&type->locks_after);
+ type->name_version = count_matching_names(type);
+ /*
+ * We use RCU's safe list-add method to make
+ * parallel walking of the hash-list safe:
+ */
+ list_add_tail_rcu(&type->hash_entry, hash_head);
+
+ if (verbose(type)) {
+ __raw_spin_unlock(&hash_lock);
+ printk("new type %p: %s", type->key, type->name);
+ if (type->name_version > 1)
+ printk("#%d", type->name_version);
+ printk("\n");
+ dump_stack();
+ __raw_spin_lock(&hash_lock);
+ }
+out_unlock_set:
+ __raw_spin_unlock(&hash_lock);
+
+out_set:
+ lock->type[subtype] = type;
+
+ DEBUG_WARN_ON(type->subtype != subtype);
+
+ return type;
+}
+
+/*
+ * Look up a dependency chain. If the key is not present yet then
+ * add it and return 0 - in this case the new dependency chain is
+ * validated. If the key is already hashed, return 1.
+ */
+static inline int lookup_chain_cache(u64 chain_key)
+{
+ struct list_head *hash_head = chainhashentry(chain_key);
+ struct lock_chain *chain;
+
+ DEBUG_WARN_ON(!irqs_disabled());
+ /*
+ * We can walk it lock-free, because entries only get added
+ * to the hash:
+ */
+ list_for_each_entry(chain, hash_head, entry) {
+ if (chain->chain_key == chain_key) {
+cache_hit:
+ debug_atomic_inc(&chain_lookup_hits);
+ /*
+ * In the debugging case, force redundant checking
+ * by returning 1:
+ */
+#ifdef CONFIG_DEBUG_LOCKDEP
+ __raw_spin_lock(&hash_lock);
+ return 1;
+#endif
+ return 0;
+ }
+ }
+ /*
+ * Allocate a new chain entry from the static array, and add
+ * it to the hash:
+ */
+ __raw_spin_lock(&hash_lock);
+ /*
+ * We have to walk the chain again locked - to avoid duplicates:
+ */
+ list_for_each_entry(chain, hash_head, entry) {
+ if (chain->chain_key == chain_key) {
+ __raw_spin_unlock(&hash_lock);
+ goto cache_hit;
+ }
+ }
+ if (unlikely(nr_lock_chains >= MAX_LOCKDEP_CHAINS)) {
+ __raw_spin_unlock(&hash_lock);
+ debug_locks_off();
+ printk("BUG: MAX_LOCKDEP_CHAINS too low!\n");
+ printk("turning off the locking correctness validator.\n");
+ return 0;
+ }
+ chain = lock_chains + nr_lock_chains++;
+ chain->chain_key = chain_key;
+ list_add_tail_rcu(&chain->entry, hash_head);
+ debug_atomic_inc(&chain_lookup_misses);
+#ifdef CONFIG_TRACE_IRQFLAGS
+ if (current->hardirq_context)
+ nr_hardirq_chains++;
+ else {
+ if (current->softirq_context)
+ nr_softirq_chains++;
+ else
+ nr_process_chains++;
+ }
+#else
+ nr_process_chains++;
+#endif
+
+ return 1;
+}
+
+/*
+ * We are building curr_chain_key incrementally, so double-check
+ * it from scratch, to make sure that it's done correctly:
+ */
+static void check_chain_key(struct task_struct *curr)
+{
+#ifdef CONFIG_DEBUG_LOCKDEP
+ struct held_lock *hlock, *prev_hlock = NULL;
+ unsigned int i, id;
+ u64 chain_key = 0;
+
+ for (i = 0; i < curr->lockdep_depth; i++) {
+ hlock = curr->held_locks + i;
+ if (chain_key != hlock->prev_chain_key) {
+ debug_locks_off();
+ printk("hm#1, depth: %u [%u], %016Lx != %016Lx\n",
+ curr->lockdep_depth, i, chain_key,
+ hlock->prev_chain_key);
+ WARN_ON(1);
+ return;
+ }
+ id = hlock->type - lock_types;
+ DEBUG_WARN_ON(id >= MAX_LOCKDEP_KEYS);
+ if (prev_hlock && (prev_hlock->irq_context !=
+ hlock->irq_context))
+ chain_key = 0;
+ chain_key = iterate_chain_key(chain_key, id);
+ prev_hlock = hlock;
+ }
+ if (chain_key != curr->curr_chain_key) {
+ debug_locks_off();
+ printk("hm#2, depth: %u [%u], %016Lx != %016Lx\n",
+ curr->lockdep_depth, i, chain_key,
+ curr->curr_chain_key);
+ WARN_ON(1);
+ }
+#endif
+}
+
+#ifdef CONFIG_TRACE_IRQFLAGS
+
+/*
+ * print irq inversion bug:
+ */
+static int
+print_irq_inversion_bug(struct task_struct *curr, struct lock_type *other,
+ struct held_lock *this, int forwards,
+ const char *irqtype)
+{
+ __raw_spin_unlock(&hash_lock);
+ debug_locks_off();
+ if (debug_locks_silent)
+ return 0;
+
+ printk("\n==================================================\n");
+ printk( "[ BUG: possible irq lock inversion bug detected! ]\n");
+ printk( "--------------------------------------------------\n");
+ printk("%s/%d just changed the state of lock:\n",
+ curr->comm, curr->pid);
+ print_lock(this);
+ if (forwards)
+ printk("but this lock took another, %s-irq-unsafe lock in the past:\n", irqtype);
+ else
+ printk("but this lock was taken by another, %s-irq-safe lock in the past:\n", irqtype);
+ print_lock_name(other);
+ printk("\n\nand interrupts could create inverse lock ordering between them,\n");
+
+ printk("which could potentially lead to deadlocks!\n");
+
+ printk("\nother info that might help us debug this:\n");
+ lockdep_print_held_locks(curr);
+
+ printk("\nthe first lock's dependencies:\n");
+ print_lock_dependencies(this->type, 0);
+
+ printk("\nthe second lock's dependencies:\n");
+ print_lock_dependencies(other, 0);
+
+ printk("\nstack backtrace:\n");
+ dump_stack();
+
+ return 0;
+}
+
+/*
+ * Prove that in the forwards-direction subgraph starting at <this>
+ * there is no lock matching <mask>:
+ */
+static int
+check_usage_forwards(struct task_struct *curr, struct held_lock *this,
+ enum lock_usage_bit bit, const char *irqtype)
+{
+ int ret;
+
+ find_usage_bit = bit;
+ /* fills in <forwards_match> */
+ ret = find_usage_forwards(this->type, 0);
+ if (!ret || ret == 1)
+ return ret;
+
+ return print_irq_inversion_bug(curr, forwards_match, this, 1, irqtype);
+}
+
+/*
+ * Prove that in the backwards-direction subgraph starting at <this>
+ * there is no lock matching <mask>:
+ */
+static int
+check_usage_backwards(struct task_struct *curr, struct held_lock *this,
+ enum lock_usage_bit bit, const char *irqtype)
+{
+ int ret;
+
+ find_usage_bit = bit;
+ /* fills in <backwards_match> */
+ ret = find_usage_backwards(this->type, 0);
+ if (!ret || ret == 1)
+ return ret;
+
+ return print_irq_inversion_bug(curr, backwards_match, this, 0, irqtype);
+}
+
+static inline void print_irqtrace_events(struct task_struct *curr)
+{
+ printk("irq event stamp: %u\n", curr->irq_events);
+ printk("hardirqs last enabled at (%u): [<%08lx>]",
+ curr->hardirq_enable_event, curr->hardirq_enable_ip);
+ print_symbol(" %s\n", curr->hardirq_enable_ip);
+ printk("hardirqs last disabled at (%u): [<%08lx>]",
+ curr->hardirq_disable_event, curr->hardirq_disable_ip);
+ print_symbol(" %s\n", curr->hardirq_disable_ip);
+ printk("softirqs last enabled at (%u): [<%08lx>]",
+ curr->softirq_enable_event, curr->softirq_enable_ip);
+ print_symbol(" %s\n", curr->softirq_enable_ip);
+ printk("softirqs last disabled at (%u): [<%08lx>]",
+ curr->softirq_disable_event, curr->softirq_disable_ip);
+ print_symbol(" %s\n", curr->softirq_disable_ip);
+}
+
+#else
+static inline void print_irqtrace_events(struct task_struct *curr)
+{
+}
+#endif
+
+static int
+print_usage_bug(struct task_struct *curr, struct held_lock *this,
+ enum lock_usage_bit prev_bit, enum lock_usage_bit new_bit)
+{
+ __raw_spin_unlock(&hash_lock);
+ debug_locks_off();
+ if (debug_locks_silent)
+ return 0;
+
+ printk("\n============================\n");
+ printk( "[ BUG: illegal lock usage! ]\n");
+ printk( "----------------------------\n");
+
+ printk("illegal {%s} -> {%s} usage.\n",
+ usage_str[prev_bit], usage_str[new_bit]);
+
+ printk("%s/%d [HC%u[%lu]:SC%u[%lu]:HE%u:SE%u] takes:\n",
+ curr->comm, curr->pid,
+ trace_hardirq_context(curr), hardirq_count() >> HARDIRQ_SHIFT,
+ trace_softirq_context(curr), softirq_count() >> SOFTIRQ_SHIFT,
+ trace_hardirqs_enabled(curr),
+ trace_softirqs_enabled(curr));
+ print_lock(this);
+
+ printk("{%s} state was registered at:\n", usage_str[prev_bit]);
+ print_stack_trace(this->type->usage_traces + prev_bit, 1);
+
+ print_irqtrace_events(curr);
+ printk("\nother info that might help us debug this:\n");
+ lockdep_print_held_locks(curr);
+
+ printk("\nstack backtrace:\n");
+ dump_stack();
+
+ return 0;
+}
+
+/*
+ * Print out an error if an invalid bit is set:
+ */
+static inline int
+valid_state(struct task_struct *curr, struct held_lock *this,
+ enum lock_usage_bit new_bit, enum lock_usage_bit bad_bit)
+{
+ if (unlikely(this->type->usage_mask & (1 << bad_bit)))
+ return print_usage_bug(curr, this, bad_bit, new_bit);
+ return 1;
+}
+
+#define STRICT_READ_CHECKS 1
+
+/*
+ * Mark a lock with a usage bit, and validate the state transition:
+ */
+static int mark_lock(struct task_struct *curr, struct held_lock *this,
+ enum lock_usage_bit new_bit, unsigned long ip)
+{
+ unsigned int new_mask = 1 << new_bit, ret = 1;
+
+ /*
+ * If already set then do not dirty the cacheline,
+ * nor do any checks:
+ */
+ if (likely(this->type->usage_mask & new_mask))
+ return 1;
+
+ __raw_spin_lock(&hash_lock);
+ /*
+ * Make sure we didnt race:
+ */
+ if (unlikely(this->type->usage_mask & new_mask)) {
+ __raw_spin_unlock(&hash_lock);
+ return 1;
+ }
+
+ this->type->usage_mask |= new_mask;
+
+#ifdef CONFIG_TRACE_IRQFLAGS
+ if (new_bit == LOCK_ENABLED_HARDIRQS ||
+ new_bit == LOCK_ENABLED_HARDIRQS_READ)
+ ip = curr->hardirq_enable_ip;
+ else if (new_bit == LOCK_ENABLED_SOFTIRQS ||
+ new_bit == LOCK_ENABLED_SOFTIRQS_READ)
+ ip = curr->softirq_enable_ip;
+#endif
+ if (!save_trace(this->type->usage_traces + new_bit))
+ return 0;
+
+ switch (new_bit) {
+#ifdef CONFIG_TRACE_IRQFLAGS
+ case LOCK_USED_IN_HARDIRQ:
+ if (!valid_state(curr, this, new_bit, LOCK_ENABLED_HARDIRQS))
+ return 0;
+ if (!valid_state(curr, this, new_bit,
+ LOCK_ENABLED_HARDIRQS_READ))
+ return 0;
+ /*
+ * just marked it hardirq-safe, check that this lock
+ * took no hardirq-unsafe lock in the past:
+ */
+ if (!check_usage_forwards(curr, this,
+ LOCK_ENABLED_HARDIRQS, "hard"))
+ return 0;
+#if STRICT_READ_CHECKS
+ /*
+ * just marked it hardirq-safe, check that this lock
+ * took no hardirq-unsafe-read lock in the past:
+ */
+ if (!check_usage_forwards(curr, this,
+ LOCK_ENABLED_HARDIRQS_READ, "hard-read"))
+ return 0;
+#endif
+ debug_atomic_inc(&nr_hardirq_safe_locks);
+ if (hardirq_verbose(this->type))
+ ret = 2;
+ break;
+ case LOCK_USED_IN_SOFTIRQ:
+ if (!valid_state(curr, this, new_bit, LOCK_ENABLED_SOFTIRQS))
+ return 0;
+ if (!valid_state(curr, this, new_bit,
+ LOCK_ENABLED_SOFTIRQS_READ))
+ return 0;
+ /*
+ * just marked it softirq-safe, check that this lock
+ * took no softirq-unsafe lock in the past:
+ */
+ if (!check_usage_forwards(curr, this,
+ LOCK_ENABLED_SOFTIRQS, "soft"))
+ return 0;
+#if STRICT_READ_CHECKS
+ /*
+ * just marked it softirq-safe, check that this lock
+ * took no softirq-unsafe-read lock in the past:
+ */
+ if (!check_usage_forwards(curr, this,
+ LOCK_ENABLED_SOFTIRQS_READ, "soft-read"))
+ return 0;
+#endif
+ debug_atomic_inc(&nr_softirq_safe_locks);
+ if (softirq_verbose(this->type))
+ ret = 2;
+ break;
+ case LOCK_USED_IN_HARDIRQ_READ:
+ if (!valid_state(curr, this, new_bit, LOCK_ENABLED_HARDIRQS))
+ return 0;
+ /*
+ * just marked it hardirq-read-safe, check that this lock
+ * took no hardirq-unsafe lock in the past:
+ */
+ if (!check_usage_forwards(curr, this,
+ LOCK_ENABLED_HARDIRQS, "hard"))
+ return 0;
+ debug_atomic_inc(&nr_hardirq_read_safe_locks);
+ if (hardirq_verbose(this->type))
+ ret = 2;
+ break;
+ case LOCK_USED_IN_SOFTIRQ_READ:
+ if (!valid_state(curr, this, new_bit, LOCK_ENABLED_SOFTIRQS))
+ return 0;
+ /*
+ * just marked it softirq-read-safe, check that this lock
+ * took no softirq-unsafe lock in the past:
+ */
+ if (!check_usage_forwards(curr, this,
+ LOCK_ENABLED_SOFTIRQS, "soft"))
+ return 0;
+ debug_atomic_inc(&nr_softirq_read_safe_locks);
+ if (softirq_verbose(this->type))
+ ret = 2;
+ break;
+ case LOCK_ENABLED_HARDIRQS:
+ if (!valid_state(curr, this, new_bit, LOCK_USED_IN_HARDIRQ))
+ return 0;
+ if (!valid_state(curr, this, new_bit,
+ LOCK_USED_IN_HARDIRQ_READ))
+ return 0;
+ /*
+ * just marked it hardirq-unsafe, check that no hardirq-safe
+ * lock in the system ever took it in the past:
+ */
+ if (!check_usage_backwards(curr, this,
+ LOCK_USED_IN_HARDIRQ, "hard"))
+ return 0;
+#if STRICT_READ_CHECKS
+ /*
+ * just marked it hardirq-unsafe, check that no
+ * hardirq-safe-read lock in the system ever took
+ * it in the past:
+ */
+ if (!check_usage_backwards(curr, this,
+ LOCK_USED_IN_HARDIRQ_READ, "hard-read"))
+ return 0;
+#endif
+ debug_atomic_inc(&nr_hardirq_unsafe_locks);
+ if (hardirq_verbose(this->type))
+ ret = 2;
+ break;
+ case LOCK_ENABLED_SOFTIRQS:
+ if (!valid_state(curr, this, new_bit, LOCK_USED_IN_SOFTIRQ))
+ return 0;
+ if (!valid_state(curr, this, new_bit,
+ LOCK_USED_IN_SOFTIRQ_READ))
+ return 0;
+ /*
+ * just marked it softirq-unsafe, check that no softirq-safe
+ * lock in the system ever took it in the past:
+ */
+ if (!check_usage_backwards(curr, this,
+ LOCK_USED_IN_SOFTIRQ, "soft"))
+ return 0;
+#if STRICT_READ_CHECKS
+ /*
+ * just marked it softirq-unsafe, check that no
+ * softirq-safe-read lock in the system ever took
+ * it in the past:
+ */
+ if (!check_usage_backwards(curr, this,
+ LOCK_USED_IN_SOFTIRQ_READ, "soft-read"))
+ return 0;
+#endif
+ debug_atomic_inc(&nr_softirq_unsafe_locks);
+ if (softirq_verbose(this->type))
+ ret = 2;
+ break;
+ case LOCK_ENABLED_HARDIRQS_READ:
+ if (!valid_state(curr, this, new_bit, LOCK_USED_IN_HARDIRQ))
+ return 0;
+#if STRICT_READ_CHECKS
+ /*
+ * just marked it hardirq-read-unsafe, check that no
+ * hardirq-safe lock in the system ever took it in the past:
+ */
+ if (!check_usage_backwards(curr, this,
+ LOCK_USED_IN_HARDIRQ, "hard"))
+ return 0;
+#endif
+ debug_atomic_inc(&nr_hardirq_read_unsafe_locks);
+ if (hardirq_verbose(this->type))
+ ret = 2;
+ break;
+ case LOCK_ENABLED_SOFTIRQS_READ:
+ if (!valid_state(curr, this, new_bit, LOCK_USED_IN_SOFTIRQ))
+ return 0;
+#if STRICT_READ_CHECKS
+ /*
+ * just marked it softirq-read-unsafe, check that no
+ * softirq-safe lock in the system ever took it in the past:
+ */
+ if (!check_usage_backwards(curr, this,
+ LOCK_USED_IN_SOFTIRQ, "soft"))
+ return 0;
+#endif
+ debug_atomic_inc(&nr_softirq_read_unsafe_locks);
+ if (softirq_verbose(this->type))
+ ret = 2;
+ break;
+#endif
+ case LOCK_USED:
+ /*
+ * Add it to the global list of types:
+ */
+ list_add_tail_rcu(&this->type->lock_entry, &all_lock_types);
+ debug_atomic_dec(&nr_unused_locks);
+ break;
+ default:
+ debug_locks_off();
+ WARN_ON(1);
+ return 0;
+ }
+
+ __raw_spin_unlock(&hash_lock);
+
+ /*
+ * We must printk outside of the hash_lock:
+ */
+ if (ret == 2) {
+ printk("\nmarked lock as {%s}:\n", usage_str[new_bit]);
+ print_lock(this);
+ print_irqtrace_events(curr);
+ dump_stack();
+ }
+
+ return ret;
+}
+
+#ifdef CONFIG_TRACE_IRQFLAGS
+/*
+ * Mark all held locks with a usage bit:
+ */
+static int
+mark_held_locks(struct task_struct *curr, int hardirq, unsigned long ip)
+{
+ enum lock_usage_bit usage_bit;
+ struct held_lock *hlock;
+ int i;
+
+ for (i = 0; i < curr->lockdep_depth; i++) {
+ hlock = curr->held_locks + i;
+
+ if (hardirq) {
+ if (hlock->read)
+ usage_bit = LOCK_ENABLED_HARDIRQS_READ;
+ else
+ usage_bit = LOCK_ENABLED_HARDIRQS;
+ } else {
+ if (hlock->read)
+ usage_bit = LOCK_ENABLED_SOFTIRQS_READ;
+ else
+ usage_bit = LOCK_ENABLED_SOFTIRQS;
+ }
+ if (!mark_lock(curr, hlock, usage_bit, ip))
+ return 0;
+ }
+
+ return 1;
+}
+
+/*
+ * Debugging helper: via this flag we know that we are in
+ * 'early bootup code', and will warn about any invalid irqs-on event:
+ */
+static int early_boot_irqs_enabled;
+
+void early_boot_irqs_off(void)
+{
+ early_boot_irqs_enabled = 0;
+}
+
+void early_boot_irqs_on(void)
+{
+ early_boot_irqs_enabled = 1;
+}
+
+/*
+ * Hardirqs will be enabled:
+ */
+void trace_hardirqs_on(void)
+{
+ struct task_struct *curr = current;
+ unsigned long ip;
+
+ if (unlikely(!debug_locks))
+ return;
+
+ if (DEBUG_WARN_ON(unlikely(!early_boot_irqs_enabled)))
+ return;
+
+ if (unlikely(curr->hardirqs_enabled)) {
+ debug_atomic_inc(&redundant_hardirqs_on);
+ return;
+ }
+ /* we'll do an OFF -> ON transition: */
+ curr->hardirqs_enabled = 1;
+ ip = (unsigned long) __builtin_return_address(0);
+
+ if (DEBUG_WARN_ON(!irqs_disabled()))
+ return;
+ if (DEBUG_WARN_ON(current->hardirq_context))
+ return;
+ /*
+ * We are going to turn hardirqs on, so set the
+ * usage bit for all held locks:
+ */
+ if (!mark_held_locks(curr, 1, ip))
+ return;
+ /*
+ * If we have softirqs enabled, then set the usage
+ * bit for all held locks. (disabled hardirqs prevented
+ * this bit from being set before)
+ */
+ if (curr->softirqs_enabled)
+ if (!mark_held_locks(curr, 0, ip))
+ return;
+
+ curr->hardirq_enable_ip = ip;
+ curr->hardirq_enable_event = ++curr->irq_events;
+ debug_atomic_inc(&hardirqs_on_events);
+}
+
+EXPORT_SYMBOL(trace_hardirqs_on);
+
+/*
+ * Hardirqs were disabled:
+ */
+void trace_hardirqs_off(void)
+{
+ struct task_struct *curr = current;
+
+ if (unlikely(!debug_locks))
+ return;
+
+ if (DEBUG_WARN_ON(!irqs_disabled()))
+ return;
+
+ if (curr->hardirqs_enabled) {
+ /*
+ * We have done an ON -> OFF transition:
+ */
+ curr->hardirqs_enabled = 0;
+ curr->hardirq_disable_ip = _RET_IP_;
+ curr->hardirq_disable_event = ++curr->irq_events;
+ debug_atomic_inc(&hardirqs_off_events);
+ } else
+ debug_atomic_inc(&redundant_hardirqs_off);
+}
+
+EXPORT_SYMBOL(trace_hardirqs_off);
+
+/*
+ * Softirqs will be enabled:
+ */
+void trace_softirqs_on(unsigned long ip)
+{
+ struct task_struct *curr = current;
+
+ if (unlikely(!debug_locks))
+ return;
+
+ if (DEBUG_WARN_ON(!irqs_disabled()))
+ return;
+
+ if (curr->softirqs_enabled) {
+ debug_atomic_inc(&redundant_softirqs_on);
+ return;
+ }
+
+ /*
+ * We'll do an OFF -> ON transition:
+ */
+ curr->softirqs_enabled = 1;
+ curr->softirq_enable_ip = ip;
+ curr->softirq_enable_event = ++curr->irq_events;
+ debug_atomic_inc(&softirqs_on_events);
+ /*
+ * We are going to turn softirqs on, so set the
+ * usage bit for all held locks, if hardirqs are
+ * enabled too:
+ */
+ if (curr->hardirqs_enabled)
+ mark_held_locks(curr, 0, ip);
+}
+
+/*
+ * Softirqs were disabled:
+ */
+void trace_softirqs_off(unsigned long ip)
+{
+ struct task_struct *curr = current;
+
+ if (unlikely(!debug_locks))
+ return;
+
+ if (DEBUG_WARN_ON(!irqs_disabled()))
+ return;
+
+ if (curr->softirqs_enabled) {
+ /*
+ * We have done an ON -> OFF transition:
+ */
+ curr->softirqs_enabled = 0;
+ curr->softirq_disable_ip = ip;
+ curr->softirq_disable_event = ++curr->irq_events;
+ debug_atomic_inc(&softirqs_off_events);
+ DEBUG_WARN_ON(!softirq_count());
+ } else
+ debug_atomic_inc(&redundant_softirqs_off);
+}
+
+#endif
+
+/*
+ * Initialize a lock instance's lock-type mapping info:
+ */
+void lockdep_init_map(struct lockdep_map *lock, const char *name,
+ struct lockdep_type_key *key)
+{
+ if (unlikely(!debug_locks))
+ return;
+
+ if (DEBUG_WARN_ON(!key))
+ return;
+
+ /*
+ * Sanity check, the lock-type key must be persistent:
+ */
+ if (!static_obj(key)) {
+ printk("BUG: key %p not in .data!\n", key);
+ DEBUG_WARN_ON(1);
+ return;
+ }
+ lock->name = name;
+ lock->key = key;
+ memset(lock->type, 0, sizeof(lock->type[0])*MAX_LOCKDEP_SUBTYPES);
+}
+
+EXPORT_SYMBOL_GPL(lockdep_init_map);
+
+/*
+ * This gets called for every mutex_lock*()/spin_lock*() operation.
+ * We maintain the dependency maps and validate the locking attempt:
+ */
+static int __lockdep_acquire(struct lockdep_map *lock, unsigned int subtype,
+ int trylock, int read, int hardirqs_off,
+ unsigned long ip)
+{
+ struct task_struct *curr = current;
+ struct held_lock *hlock;
+ struct lock_type *type;
+ unsigned int depth, id;
+ int chain_head = 0;
+ u64 chain_key;
+
+ if (unlikely(!debug_locks))
+ return 0;
+
+ if (DEBUG_WARN_ON(!irqs_disabled()))
+ return 0;
+
+ if (unlikely(subtype >= MAX_LOCKDEP_SUBTYPES)) {
+ debug_locks_off();
+ printk("BUG: MAX_LOCKDEP_SUBTYPES too low!\n");
+ printk("turning off the locking correctness validator.\n");
+ return 0;
+ }
+
+ type = lock->type[subtype];
+ /* not cached yet? */
+ if (unlikely(!type)) {
+ type = register_lock_type(lock, subtype);
+ if (!type)
+ return 0;
+ }
+ debug_atomic_inc((atomic_t *)&type->ops);
+
+ /*
+ * Add the lock to the list of currently held locks.
+ * (we dont increase the depth just yet, up until the
+ * dependency checks are done)
+ */
+ depth = curr->lockdep_depth;
+ if (DEBUG_WARN_ON(depth >= MAX_LOCK_DEPTH))
+ return 0;
+
+ hlock = curr->held_locks + depth;
+
+ hlock->type = type;
+ hlock->acquire_ip = ip;
+ hlock->instance = lock;
+ hlock->trylock = trylock;
+ hlock->read = read;
+ hlock->hardirqs_off = hardirqs_off;
+
+#ifdef CONFIG_TRACE_IRQFLAGS
+ /*
+ * If non-trylock use in a hardirq or softirq context, then
+ * mark the lock as used in these contexts:
+ */
+ if (!trylock) {
+ if (read) {
+ if (curr->hardirq_context)
+ if (!mark_lock(curr, hlock,
+ LOCK_USED_IN_HARDIRQ_READ, ip))
+ return 0;
+ if (curr->softirq_context)
+ if (!mark_lock(curr, hlock,
+ LOCK_USED_IN_SOFTIRQ_READ, ip))
+ return 0;
+ } else {
+ if (curr->hardirq_context)
+ if (!mark_lock(curr, hlock, LOCK_USED_IN_HARDIRQ, ip))
+ return 0;
+ if (curr->softirq_context)
+ if (!mark_lock(curr, hlock, LOCK_USED_IN_SOFTIRQ, ip))
+ return 0;
+ }
+ }
+ if (!hardirqs_off) {
+ if (read) {
+ if (!mark_lock(curr, hlock,
+ LOCK_ENABLED_HARDIRQS_READ, ip))
+ return 0;
+ if (curr->softirqs_enabled)
+ if (!mark_lock(curr, hlock,
+ LOCK_ENABLED_SOFTIRQS_READ, ip))
+ return 0;
+ } else {
+ if (!mark_lock(curr, hlock,
+ LOCK_ENABLED_HARDIRQS, ip))
+ return 0;
+ if (curr->softirqs_enabled)
+ if (!mark_lock(curr, hlock,
+ LOCK_ENABLED_SOFTIRQS, ip))
+ return 0;
+ }
+ }
+#endif
+ /* mark it as used: */
+ if (!mark_lock(curr, hlock, LOCK_USED, ip))
+ return 0;
+ /*
+ * Calculate the chain hash: it's the combined has of all the
+ * lock keys along the dependency chain. We save the hash value
+ * at every step so that we can get the current hash easily
+ * after unlock. The chain hash is then used to cache dependency
+ * results.
+ *
+ * The 'key ID' is what is the most compact key value to drive
+ * the hash, not type->key.
+ */
+ id = type - lock_types;
+ if (DEBUG_WARN_ON(id >= MAX_LOCKDEP_KEYS))
+ return 0;
+
+ chain_key = curr->curr_chain_key;
+ if (!depth) {
+ if (DEBUG_WARN_ON(chain_key != 0))
+ return 0;
+ chain_head = 1;
+ }
+
+ hlock->prev_chain_key = chain_key;
+
+#ifdef CONFIG_TRACE_IRQFLAGS
+ /*
+ * Keep track of points where we cross into an interrupt context:
+ */
+ hlock->irq_context = 2*(curr->hardirq_context ? 1 : 0) +
+ curr->softirq_context;
+ if (depth) {
+ struct held_lock *prev_hlock;
+
+ prev_hlock = curr->held_locks + depth-1;
+ /*
+ * If we cross into another context, reset the
+ * hash key (this also prevents the checking and the
+ * adding of the dependency to 'prev'):
+ */
+ if (prev_hlock->irq_context != hlock->irq_context) {
+ chain_key = 0;
+ chain_head = 1;
+ }
+ }
+#endif
+ chain_key = iterate_chain_key(chain_key, id);
+ curr->curr_chain_key = chain_key;
+
+ /*
+ * Trylock needs to maintain the stack of held locks, but it
+ * does not add new dependencies, because trylock can be done
+ * in any order.
+ *
+ * We look up the chain_key and do the O(N^2) check and update of
+ * the dependencies only if this is a new dependency chain.
+ * (If lookup_chain_cache() returns with 1 it acquires
+ * hash_lock for us)
+ */
+ if (!trylock && lookup_chain_cache(chain_key)) {
+ /*
+ * Check whether last held lock:
+ *
+ * - is irq-safe, if this lock is irq-unsafe
+ * - is softirq-safe, if this lock is hardirq-unsafe
+ *
+ * And check whether the new lock's dependency graph
+ * could lead back to the previous lock.
+ *
+ * any of these scenarios could lead to a deadlock. If
+ * All validations
+ */
+ int ret = check_deadlock(curr, hlock, lock, read);
+
+ if (!ret)
+ return 0;
+ /*
+ * Mark recursive read, as we jump over it when
+ * building dependencies (just like we jump over
+ * trylock entries):
+ */
+ if (ret == 2)
+ hlock->read = 2;
+ /*
+ * Add dependency only if this lock is not the head
+ * of the chain, and if it's not a secondary read-lock:
+ */
+ if (!chain_head && ret != 2)
+ if (!check_prevs_add(curr, hlock))
+ return 0;
+ __raw_spin_unlock(&hash_lock);
+ }
+ curr->lockdep_depth++;
+ check_chain_key(curr);
+ if (unlikely(curr->lockdep_depth >= MAX_LOCK_DEPTH)) {
+ debug_locks_off();
+ printk("BUG: MAX_LOCK_DEPTH too low!\n");
+ printk("turning off the locking correctness validator.\n");
+ return 0;
+ }
+ if (unlikely(curr->lockdep_depth > max_lockdep_depth))
+ max_lockdep_depth = curr->lockdep_depth;
+
+ return 1;
+}
+
+static int
+print_unlock_order_bug(struct task_struct *curr, struct lockdep_map *lock,
+ struct held_lock *hlock, unsigned long ip)
+{
+ debug_locks_off();
+ if (debug_locks_silent)
+ return 0;
+
+ printk("\n======================================\n");
+ printk( "[ BUG: bad unlock ordering detected! ]\n");
+ printk( "--------------------------------------\n");
+ printk("%s/%d is trying to release lock (",
+ curr->comm, curr->pid);
+ print_lockdep_cache(lock);
+ printk(") at:\n");
+ printk_sym(ip);
+ printk("but the next lock to release is:\n");
+ print_lock(hlock);
+ printk("\nother info that might help us debug this:\n");
+ lockdep_print_held_locks(curr);
+
+ printk("\nstack backtrace:\n");
+ dump_stack();
+
+ return 0;
+}
+
+static int
+print_unlock_inbalance_bug(struct task_struct *curr, struct lockdep_map *lock,
+ unsigned long ip)
+{
+ debug_locks_off();
+ if (debug_locks_silent)
+ return 0;
+
+ printk("\n=====================================\n");
+ printk( "[ BUG: bad unlock balance detected! ]\n");
+ printk( "-------------------------------------\n");
+ printk("%s/%d is trying to release lock (",
+ curr->comm, curr->pid);
+ print_lockdep_cache(lock);
+ printk(") at:\n");
+ printk_sym(ip);
+ printk("but there are no more locks to release!\n");
+ printk("\nother info that might help us debug this:\n");
+ lockdep_print_held_locks(curr);
+
+ printk("\nstack backtrace:\n");
+ dump_stack();
+
+ return 0;
+}
+
+/*
+ * Common debugging checks for both nested and non-nested unlock:
+ */
+static int check_unlock(struct task_struct *curr, struct lockdep_map *lock,
+ unsigned long ip)
+{
+ if (unlikely(!debug_locks))
+ return 0;
+ if (DEBUG_WARN_ON(!irqs_disabled()))
+ return 0;
+
+ if (curr->lockdep_depth <= 0)
+ return print_unlock_inbalance_bug(curr, lock, ip);
+
+ return 1;
+}
+
+/*
+ * Remove the lock to the list of currently held locks - this gets
+ * called on mutex_unlock()/spin_unlock*() (or on a failed
+ * mutex_lock_interruptible()). This is done for unlocks that nest
+ * perfectly. (i.e. the current top of the lock-stack is unlocked)
+ */
+static int lockdep_release_nested(struct task_struct *curr,
+ struct lockdep_map *lock, unsigned long ip)
+{
+ struct held_lock *hlock;
+ unsigned int depth;
+
+ /*
+ * Pop off the top of the lock stack:
+ */
+ depth = --curr->lockdep_depth;
+ hlock = curr->held_locks + depth;
+
+ if (hlock->instance != lock)
+ return print_unlock_order_bug(curr, lock, hlock, ip);
+
+ if (DEBUG_WARN_ON(!depth && (hlock->prev_chain_key != 0)))
+ return 0;
+
+ curr->curr_chain_key = hlock->prev_chain_key;
+
+#ifdef CONFIG_DEBUG_LOCKDEP
+ hlock->prev_chain_key = 0;
+ hlock->type = NULL;
+ hlock->acquire_ip = 0;
+ hlock->irq_context = 0;
+#endif
+ return 1;
+}
+
+/*
+ * Remove the lock to the list of currently held locks in a
+ * potentially non-nested (out of order) manner. This is a
+ * relatively rare operation, as all the unlock APIs default
+ * to nested mode (which uses lockdep_release()):
+ */
+static int
+lockdep_release_non_nested(struct task_struct *curr,
+ struct lockdep_map *lock, unsigned long ip)
+{
+ struct held_lock *hlock, *prev_hlock;
+ unsigned int depth;
+ int i;
+
+ /*
+ * Check whether the lock exists in the current stack
+ * of held locks:
+ */
+ depth = curr->lockdep_depth;
+ if (DEBUG_WARN_ON(!depth))
+ return 0;
+
+ prev_hlock = NULL;
+ for (i = depth-1; i >= 0; i--) {
+ hlock = curr->held_locks + i;
+ /*
+ * We must not cross into another context:
+ */
+ if (prev_hlock && prev_hlock->irq_context != hlock->irq_context)
+ break;
+ if (hlock->instance == lock)
+ goto found_it;
+ prev_hlock = hlock;
+ }
+ return print_unlock_inbalance_bug(curr, lock, ip);
+
+found_it:
+ /*
+ * We have the right lock to unlock, 'hlock' points to it.
+ * Now we remove it from the stack, and add back the other
+ * entries (if any), recalculating the hash along the way:
+ */
+ curr->lockdep_depth = i;
+ curr->curr_chain_key = hlock->prev_chain_key;
+
+ for (i++; i < depth; i++) {
+ hlock = curr->held_locks + i;
+ if (!__lockdep_acquire(hlock->instance,
+ hlock->type->subtype, hlock->trylock,
+ hlock->read, hlock->hardirqs_off,
+ hlock->acquire_ip))
+ return 0;
+ }
+
+ if (DEBUG_WARN_ON(curr->lockdep_depth != depth - 1))
+ return 0;
+ return 1;
+}
+
+/*
+ * Remove the lock to the list of currently held locks - this gets
+ * called on mutex_unlock()/spin_unlock*() (or on a failed
+ * mutex_lock_interruptible()). This is done for unlocks that nest
+ * perfectly. (i.e. the current top of the lock-stack is unlocked)
+ */
+static void __lockdep_release(struct lockdep_map *lock, int nested,
+ unsigned long ip)
+{
+ struct task_struct *curr = current;
+
+ if (!check_unlock(curr, lock, ip))
+ return;
+
+ if (nested) {
+ if (!lockdep_release_nested(curr, lock, ip))
+ return;
+ } else {
+ if (!lockdep_release_non_nested(curr, lock, ip))
+ return;
+ }
+
+ check_chain_key(curr);
+}
+
+/*
+ * Check whether we follow the irq-flags state precisely:
+ */
+static void check_flags(unsigned long flags)
+{
+#if defined(CONFIG_DEBUG_LOCKDEP) && defined(CONFIG_TRACE_IRQFLAGS)
+ if (!debug_locks)
+ return;
+
+ if (irqs_disabled_flags(flags))
+ DEBUG_WARN_ON(current->hardirqs_enabled);
+ else
+ DEBUG_WARN_ON(!current->hardirqs_enabled);
+
+ /*
+ * We dont accurately track softirq state in e.g.
+ * hardirq contexts (such as on 4KSTACKS), so only
+ * check if not in hardirq contexts:
+ */
+ if (!hardirq_count()) {
+ if (softirq_count())
+ DEBUG_WARN_ON(current->softirqs_enabled);
+ else
+ DEBUG_WARN_ON(!current->softirqs_enabled);
+ }
+
+ if (!debug_locks)
+ print_irqtrace_events(current);
+#endif
+}
+
+/*
+ * We are not always called with irqs disabled - do that here,
+ * and also avoid lockdep recursion:
+ */
+void lockdep_acquire(struct lockdep_map *lock, unsigned int subtype,
+ int trylock, int read, unsigned long ip)
+{
+ unsigned long flags;
+
+ if (LOCKDEP_OFF)
+ return;
+
+ raw_local_irq_save(flags);
+ check_flags(flags);
+
+ if (unlikely(current->lockdep_recursion))
+ goto out;
+ current->lockdep_recursion = 1;
+ __lockdep_acquire(lock, subtype, trylock, read, irqs_disabled_flags(flags), ip);
+ current->lockdep_recursion = 0;
+out:
+ raw_local_irq_restore(flags);
+}
+
+EXPORT_SYMBOL_GPL(lockdep_acquire);
+
+void lockdep_release(struct lockdep_map *lock, int nested, unsigned long ip)
+{
+ unsigned long flags;
+
+ if (LOCKDEP_OFF)
+ return;
+
+ raw_local_irq_save(flags);
+ check_flags(flags);
+ if (unlikely(current->lockdep_recursion))
+ goto out;
+ current->lockdep_recursion = 1;
+ __lockdep_release(lock, nested, ip);
+ current->lockdep_recursion = 0;
+out:
+ raw_local_irq_restore(flags);
+}
+
+EXPORT_SYMBOL_GPL(lockdep_release);
+
+/*
+ * Used by the testsuite, sanitize the validator state
+ * after a simulated failure:
+ */
+
+void lockdep_reset(void)
+{
+ unsigned long flags;
+
+ raw_local_irq_save(flags);
+ current->curr_chain_key = 0;
+ current->lockdep_depth = 0;
+ current->lockdep_recursion = 0;
+ memset(current->held_locks, 0, MAX_LOCK_DEPTH*sizeof(struct held_lock));
+ nr_hardirq_chains = 0;
+ nr_softirq_chains = 0;
+ nr_process_chains = 0;
+ debug_locks = 1;
+ raw_local_irq_restore(flags);
+}
+
+static void zap_type(struct lock_type *type)
+{
+ int i;
+
+ /*
+ * Remove all dependencies this lock is
+ * involved in:
+ */
+ for (i = 0; i < nr_list_entries; i++) {
+ if (list_entries[i].type == type)
+ list_del_rcu(&list_entries[i].entry);
+ }
+ /*
+ * Unhash the type and remove it from the all_lock_types list:
+ */
+ list_del_rcu(&type->hash_entry);
+ list_del_rcu(&type->lock_entry);
+
+}
+
+static inline int within(void *addr, void *start, unsigned long size)
+{
+ return addr >= start && addr < start + size;
+}
+
+void lockdep_free_key_range(void *start, unsigned long size)
+{
+ struct lock_type *type, *next;
+ struct list_head *head;
+ unsigned long flags;
+ int i;
+
+ raw_local_irq_save(flags);
+ __raw_spin_lock(&hash_lock);
+
+ /*
+ * Unhash all types that were created by this module:
+ */
+ for (i = 0; i < TYPEHASH_SIZE; i++) {
+ head = typehash_table + i;
+ if (list_empty(head))
+ continue;
+ list_for_each_entry_safe(type, next, head, hash_entry)
+ if (within(type->key, start, size))
+ zap_type(type);
+ }
+
+ __raw_spin_unlock(&hash_lock);
+ raw_local_irq_restore(flags);
+}
+
+void lockdep_reset_lock(struct lockdep_map *lock)
+{
+ struct lock_type *type, *next, *entry;
+ struct list_head *head;
+ unsigned long flags;
+ int i, j;
+
+ raw_local_irq_save(flags);
+ __raw_spin_lock(&hash_lock);
+
+ /*
+ * Remove all types this lock has:
+ */
+ for (i = 0; i < TYPEHASH_SIZE; i++) {
+ head = typehash_table + i;
+ if (list_empty(head))
+ continue;
+ list_for_each_entry_safe(type, next, head, hash_entry) {
+ for (j = 0; j < MAX_LOCKDEP_SUBTYPES; j++) {
+ entry = lock->type[j];
+ if (type == entry) {
+ zap_type(type);
+ lock->type[j] = NULL;
+ break;
+ }
+ }
+ }
+ }
+
+ /*
+ * Debug check: in the end all mapped types should
+ * be gone.
+ */
+ for (j = 0; j < MAX_LOCKDEP_SUBTYPES; j++) {
+ entry = lock->type[j];
+ if (!entry)
+ continue;
+ __raw_spin_unlock(&hash_lock);
+ DEBUG_WARN_ON(1);
+ raw_local_irq_restore(flags);
+ return;
+ }
+
+ __raw_spin_unlock(&hash_lock);
+ raw_local_irq_restore(flags);
+}
+
+void __init lockdep_init(void)
+{
+ int i;
+
+ /*
+ * Some architectures have their own start_kernel()
+ * code which calls lockdep_init(), while we also
+ * call lockdep_init() from the start_kernel() itself,
+ * and we want to initialize the hashes only once:
+ */
+ if (lockdep_initialized)
+ return;
+
+ for (i = 0; i < TYPEHASH_SIZE; i++)
+ INIT_LIST_HEAD(typehash_table + i);
+
+ for (i = 0; i < CHAINHASH_SIZE; i++)
+ INIT_LIST_HEAD(chainhash_table + i);
+
+ lockdep_initialized = 1;
+}
+
+void __init lockdep_info(void)
+{
+ printk("Lock dependency validator: Copyright (c) 2006 Red Hat, Inc., Ingo Molnar\n");
+
+ printk("... MAX_LOCKDEP_SUBTYPES: %lu\n", MAX_LOCKDEP_SUBTYPES);
+ printk("... MAX_LOCK_DEPTH: %lu\n", MAX_LOCK_DEPTH);
+ printk("... MAX_LOCKDEP_KEYS: %lu\n", MAX_LOCKDEP_KEYS);
+ printk("... TYPEHASH_SIZE: %lu\n", TYPEHASH_SIZE);
+ printk("... MAX_LOCKDEP_ENTRIES: %lu\n", MAX_LOCKDEP_ENTRIES);
+ printk("... MAX_LOCKDEP_CHAINS: %lu\n", MAX_LOCKDEP_CHAINS);
+ printk("... CHAINHASH_SIZE: %lu\n", CHAINHASH_SIZE);
+
+ printk(" memory used by lock dependency info: %lu kB\n",
+ (sizeof(struct lock_type) * MAX_LOCKDEP_KEYS +
+ sizeof(struct list_head) * TYPEHASH_SIZE +
+ sizeof(struct lock_list) * MAX_LOCKDEP_ENTRIES +
+ sizeof(struct lock_chain) * MAX_LOCKDEP_CHAINS +
+ sizeof(struct list_head) * CHAINHASH_SIZE) / 1024);
+
+ printk(" per task-struct memory footprint: %lu bytes\n",
+ sizeof(struct held_lock) * MAX_LOCK_DEPTH);
+
+#ifdef CONFIG_DEBUG_LOCKDEP
+ if (lockdep_init_error)
+ printk("WARNING: lockdep init error! Arch code didnt call lockdep_init() early enough?\n");
+#endif
+}
+
Index: linux/kernel/lockdep_internals.h
===================================================================
--- /dev/null
+++ linux/kernel/lockdep_internals.h
@@ -0,0 +1,93 @@
+/*
+ * kernel/lockdep_internals.h
+ *
+ * Runtime locking correctness validator
+ *
+ * lockdep subsystem internal functions and variables.
+ */
+
+/*
+ * MAX_LOCKDEP_ENTRIES is the maximum number of lock dependencies
+ * we track.
+ *
+ * We use the per-lock dependency maps in two ways: we grow it by adding
+ * every to-be-taken lock to all currently held lock's own dependency
+ * table (if it's not there yet), and we check it for lock order
+ * conflicts and deadlocks.
+ */
+#define MAX_LOCKDEP_ENTRIES 8192UL
+
+#define MAX_LOCKDEP_KEYS_BITS 11
+#define MAX_LOCKDEP_KEYS (1UL << MAX_LOCKDEP_KEYS_BITS)
+
+#define MAX_LOCKDEP_CHAINS_BITS 13
+#define MAX_LOCKDEP_CHAINS (1UL << MAX_LOCKDEP_CHAINS_BITS)
+
+/*
+ * Stack-trace: tightly packed array of stack backtrace
+ * addresses. Protected by the hash_lock.
+ */
+#define MAX_STACK_TRACE_ENTRIES 131072UL
+
+extern struct list_head all_lock_types;
+
+extern void
+get_usage_chars(struct lock_type *type, char *c1, char *c2, char *c3, char *c4);
+
+extern const char * __get_key_name(struct lockdep_subtype_key *key, char *str);
+
+extern unsigned long nr_lock_types;
+extern unsigned long nr_list_entries;
+extern unsigned long nr_lock_chains;
+extern unsigned long nr_stack_trace_entries;
+
+extern unsigned int nr_hardirq_chains;
+extern unsigned int nr_softirq_chains;
+extern unsigned int nr_process_chains;
+extern unsigned int max_lockdep_depth;
+extern unsigned int max_recursion_depth;
+
+#ifdef CONFIG_DEBUG_LOCKDEP
+/*
+ * We cannot printk in early bootup code. Not even early_printk()
+ * might work. So we mark any initialization errors and printk
+ * about it later on, in lockdep_info().
+ */
+extern int lockdep_init_error;
+
+/*
+ * Various lockdep statistics:
+ */
+extern atomic_t chain_lookup_hits;
+extern atomic_t chain_lookup_misses;
+extern atomic_t hardirqs_on_events;
+extern atomic_t hardirqs_off_events;
+extern atomic_t redundant_hardirqs_on;
+extern atomic_t redundant_hardirqs_off;
+extern atomic_t softirqs_on_events;
+extern atomic_t softirqs_off_events;
+extern atomic_t redundant_softirqs_on;
+extern atomic_t redundant_softirqs_off;
+extern atomic_t nr_unused_locks;
+extern atomic_t nr_hardirq_safe_locks;
+extern atomic_t nr_softirq_safe_locks;
+extern atomic_t nr_hardirq_unsafe_locks;
+extern atomic_t nr_softirq_unsafe_locks;
+extern atomic_t nr_hardirq_read_safe_locks;
+extern atomic_t nr_softirq_read_safe_locks;
+extern atomic_t nr_hardirq_read_unsafe_locks;
+extern atomic_t nr_softirq_read_unsafe_locks;
+extern atomic_t nr_cyclic_checks;
+extern atomic_t nr_cyclic_check_recursions;
+extern atomic_t nr_find_usage_forwards_checks;
+extern atomic_t nr_find_usage_forwards_recursions;
+extern atomic_t nr_find_usage_backwards_checks;
+extern atomic_t nr_find_usage_backwards_recursions;
+# define debug_atomic_inc(ptr) atomic_inc(ptr)
+# define debug_atomic_dec(ptr) atomic_dec(ptr)
+# define debug_atomic_read(ptr) atomic_read(ptr)
+#else
+# define debug_atomic_inc(ptr) do { } while (0)
+# define debug_atomic_dec(ptr) do { } while (0)
+# define debug_atomic_read(ptr) 0
+#endif
Index: linux/kernel/module.c
===================================================================
--- linux.orig/kernel/module.c
+++ linux/kernel/module.c
@@ -1151,6 +1151,9 @@ static void free_module(struct module *m
if (mod->percpu)
percpu_modfree(mod->percpu);

+ /* Free lock-types: */
+ lockdep_free_key_range(mod->module_core, mod->core_size);
+
/* Finally, free the core (containing the module structure) */
module_free(mod, mod->module_core);
}
Index: linux/lib/Kconfig.debug
===================================================================
--- linux.orig/lib/Kconfig.debug
+++ linux/lib/Kconfig.debug
@@ -57,7 +57,7 @@ config DEBUG_KERNEL
config LOG_BUF_SHIFT
int "Kernel log buffer size (16 => 64KB, 17 => 128KB)" if DEBUG_KERNEL
range 12 21
- default 17 if S390
+ default 17 if S390 || LOCKDEP
default 16 if X86_NUMAQ || IA64
default 15 if SMP
default 14
Index: linux/lib/locking-selftest.c
===================================================================
--- linux.orig/lib/locking-selftest.c
+++ linux/lib/locking-selftest.c
@@ -15,6 +15,7 @@
#include <linux/sched.h>
#include <linux/delay.h>
#include <linux/module.h>
+#include <linux/lockdep.h>
#include <linux/spinlock.h>
#include <linux/kallsyms.h>
#include <linux/interrupt.h>
@@ -872,9 +873,6 @@ GENERATE_PERMUTATIONS_3_EVENTS(irq_read_
#include "locking-selftest-softirq.h"
// GENERATE_PERMUTATIONS_3_EVENTS(irq_read_recursion2_soft)

-#define lockdep_reset()
-#define lockdep_reset_lock(x)
-
#ifdef CONFIG_PROVE_SPIN_LOCKING
# define I_SPINLOCK(x) lockdep_reset_lock(&lock_##x.dep_map)
#else

2006-05-29 21:39:35

by Ingo Molnar

[permalink] [raw]
Subject: [patch 22/61] lock validator: add per_cpu_offset()

From: Ingo Molnar <[email protected]>

add the per_cpu_offset() generic method. (used by the lock validator)

Signed-off-by: Ingo Molnar <[email protected]>
Signed-off-by: Arjan van de Ven <[email protected]>
---
include/asm-generic/percpu.h | 2 ++
include/asm-x86_64/percpu.h | 2 ++
2 files changed, 4 insertions(+)

Index: linux/include/asm-generic/percpu.h
===================================================================
--- linux.orig/include/asm-generic/percpu.h
+++ linux/include/asm-generic/percpu.h
@@ -7,6 +7,8 @@

extern unsigned long __per_cpu_offset[NR_CPUS];

+#define per_cpu_offset(x) (__per_cpu_offset[x])
+
/* Separate out the type, so (int[3], foo) works. */
#define DEFINE_PER_CPU(type, name) \
__attribute__((__section__(".data.percpu"))) __typeof__(type) per_cpu__##name
Index: linux/include/asm-x86_64/percpu.h
===================================================================
--- linux.orig/include/asm-x86_64/percpu.h
+++ linux/include/asm-x86_64/percpu.h
@@ -14,6 +14,8 @@
#define __per_cpu_offset(cpu) (cpu_pda(cpu)->data_offset)
#define __my_cpu_offset() read_pda(data_offset)

+#define per_cpu_offset(x) (__per_cpu_offset(x))
+
/* Separate out the type, so (int[3], foo) works. */
#define DEFINE_PER_CPU(type, name) \
__attribute__((__section__(".data.percpu"))) __typeof__(type) per_cpu__##name

2006-05-29 21:40:43

by Ingo Molnar

[permalink] [raw]
Subject: [patch 28/61] lock validator: prove mutex locking correctness

From: Ingo Molnar <[email protected]>

add CONFIG_PROVE_MUTEX_LOCKING, which uses the lock validator framework
to prove mutex locking correctness.

Signed-off-by: Ingo Molnar <[email protected]>
Signed-off-by: Arjan van de Ven <[email protected]>
---
include/linux/mutex-debug.h | 8 +++++++-
include/linux/mutex.h | 34 +++++++++++++++++++++++++++++++---
kernel/mutex-debug.c | 8 ++++++++
kernel/mutex-lockdep.h | 40 ++++++++++++++++++++++++++++++++++++++++
kernel/mutex.c | 28 ++++++++++++++++++++++------
kernel/mutex.h | 3 +--
6 files changed, 109 insertions(+), 12 deletions(-)

Index: linux/include/linux/mutex-debug.h
===================================================================
--- linux.orig/include/linux/mutex-debug.h
+++ linux/include/linux/mutex-debug.h
@@ -2,6 +2,7 @@
#define __LINUX_MUTEX_DEBUG_H

#include <linux/linkage.h>
+#include <linux/lockdep.h>

/*
* Mutexes - debugging helpers:
@@ -10,7 +11,12 @@
#define __DEBUG_MUTEX_INITIALIZER(lockname) \
, .magic = &lockname

-#define mutex_init(sem) __mutex_init(sem, __FILE__":"#sem)
+#define mutex_init(mutex) \
+do { \
+ static struct lockdep_type_key __key; \
+ \
+ __mutex_init((mutex), #mutex, &__key); \
+} while (0)

extern void FASTCALL(mutex_destroy(struct mutex *lock));

Index: linux/include/linux/mutex.h
===================================================================
--- linux.orig/include/linux/mutex.h
+++ linux/include/linux/mutex.h
@@ -13,6 +13,7 @@
#include <linux/list.h>
#include <linux/spinlock_types.h>
#include <linux/linkage.h>
+#include <linux/lockdep.h>

#include <asm/atomic.h>

@@ -53,6 +54,9 @@ struct mutex {
const char *name;
void *magic;
#endif
+#ifdef CONFIG_PROVE_MUTEX_LOCKING
+ struct lockdep_map dep_map;
+#endif
};

/*
@@ -72,20 +76,36 @@ struct mutex_waiter {
# include <linux/mutex-debug.h>
#else
# define __DEBUG_MUTEX_INITIALIZER(lockname)
-# define mutex_init(mutex) __mutex_init(mutex, NULL)
+# define mutex_init(mutex) \
+do { \
+ static struct lockdep_type_key __key; \
+ \
+ __mutex_init((mutex), NULL, &__key); \
+} while (0)
# define mutex_destroy(mutex) do { } while (0)
#endif

+#ifdef CONFIG_PROVE_MUTEX_LOCKING
+# define __DEP_MAP_MUTEX_INITIALIZER(lockname) \
+ , .dep_map = { .name = #lockname }
+#else
+# define __DEP_MAP_MUTEX_INITIALIZER(lockname)
+#endif
+
#define __MUTEX_INITIALIZER(lockname) \
{ .count = ATOMIC_INIT(1) \
, .wait_lock = SPIN_LOCK_UNLOCKED \
, .wait_list = LIST_HEAD_INIT(lockname.wait_list) \
- __DEBUG_MUTEX_INITIALIZER(lockname) }
+ __DEBUG_MUTEX_INITIALIZER(lockname) \
+ __DEP_MAP_MUTEX_INITIALIZER(lockname) }

#define DEFINE_MUTEX(mutexname) \
struct mutex mutexname = __MUTEX_INITIALIZER(mutexname)

-extern void fastcall __mutex_init(struct mutex *lock, const char *name);
+extern void __mutex_init(struct mutex *lock, const char *name,
+ struct lockdep_type_key *key);
+
+#define mutex_init_key(mutex, name, key) __mutex_init((mutex), name, key)

/***
* mutex_is_locked - is the mutex locked
@@ -104,11 +124,19 @@ static inline int fastcall mutex_is_lock
*/
extern void fastcall mutex_lock(struct mutex *lock);
extern int fastcall mutex_lock_interruptible(struct mutex *lock);
+
+#ifdef CONFIG_PROVE_MUTEX_LOCKING
+extern void mutex_lock_nested(struct mutex *lock, unsigned int subtype);
+#else
+# define mutex_lock_nested(lock, subtype) mutex_lock(lock)
+#endif
+
/*
* NOTE: mutex_trylock() follows the spin_trylock() convention,
* not the down_trylock() convention!
*/
extern int fastcall mutex_trylock(struct mutex *lock);
extern void fastcall mutex_unlock(struct mutex *lock);
+extern void fastcall mutex_unlock_non_nested(struct mutex *lock);

#endif
Index: linux/kernel/mutex-debug.c
===================================================================
--- linux.orig/kernel/mutex-debug.c
+++ linux/kernel/mutex-debug.c
@@ -100,6 +100,14 @@ static int check_deadlock(struct mutex *
return 0;

task = ti->task;
+ /*
+ * In the PROVE_MUTEX_LOCKING we are tracking all held
+ * locks already, which allows us to optimize this:
+ */
+#ifdef CONFIG_PROVE_MUTEX_LOCKING
+ if (!task->lockdep_depth)
+ return 0;
+#endif
lockblk = NULL;
if (task->blocked_on)
lockblk = task->blocked_on->lock;
Index: linux/kernel/mutex-lockdep.h
===================================================================
--- /dev/null
+++ linux/kernel/mutex-lockdep.h
@@ -0,0 +1,40 @@
+/*
+ * Mutexes: blocking mutual exclusion locks
+ *
+ * started by Ingo Molnar:
+ *
+ * Copyright (C) 2004-2006 Red Hat, Inc., Ingo Molnar <[email protected]>
+ *
+ * This file contains mutex debugging related internal prototypes, for the
+ * !CONFIG_DEBUG_MUTEXES && CONFIG_PROVE_MUTEX_LOCKING case. Most of
+ * them are NOPs:
+ */
+
+#define spin_lock_mutex(lock, flags) \
+ do { \
+ local_irq_save(flags); \
+ __raw_spin_lock(&(lock)->raw_lock); \
+ } while (0)
+
+#define spin_unlock_mutex(lock, flags) \
+ do { \
+ __raw_spin_unlock(&(lock)->raw_lock); \
+ local_irq_restore(flags); \
+ } while (0)
+
+#define mutex_remove_waiter(lock, waiter, ti) \
+ __list_del((waiter)->list.prev, (waiter)->list.next)
+
+#define debug_mutex_set_owner(lock, new_owner) do { } while (0)
+#define debug_mutex_clear_owner(lock) do { } while (0)
+#define debug_mutex_wake_waiter(lock, waiter) do { } while (0)
+#define debug_mutex_free_waiter(waiter) do { } while (0)
+#define debug_mutex_add_waiter(lock, waiter, ti) do { } while (0)
+#define debug_mutex_unlock(lock) do { } while (0)
+#define debug_mutex_init(lock, name) do { } while (0)
+
+static inline void
+debug_mutex_lock_common(struct mutex *lock,
+ struct mutex_waiter *waiter)
+{
+}
Index: linux/kernel/mutex.c
===================================================================
--- linux.orig/kernel/mutex.c
+++ linux/kernel/mutex.c
@@ -27,8 +27,13 @@
# include "mutex-debug.h"
# include <asm-generic/mutex-null.h>
#else
-# include "mutex.h"
-# include <asm/mutex.h>
+# ifdef CONFIG_PROVE_MUTEX_LOCKING
+# include "mutex-lockdep.h"
+# include <asm-generic/mutex-null.h>
+# else
+# include "mutex.h"
+# include <asm/mutex.h>
+# endif
#endif

/***
@@ -39,13 +44,18 @@
*
* It is not allowed to initialize an already locked mutex.
*/
-__always_inline void fastcall __mutex_init(struct mutex *lock, const char *name)
+void
+__mutex_init(struct mutex *lock, const char *name, struct lockdep_type_key *key)
{
atomic_set(&lock->count, 1);
spin_lock_init(&lock->wait_lock);
INIT_LIST_HEAD(&lock->wait_list);

debug_mutex_init(lock, name);
+
+#ifdef CONFIG_PROVE_MUTEX_LOCKING
+ lockdep_init_map(&lock->dep_map, name, key);
+#endif
}

EXPORT_SYMBOL(__mutex_init);
@@ -146,6 +156,7 @@ __mutex_lock_common(struct mutex *lock,
spin_lock_mutex(&lock->wait_lock, flags);

debug_mutex_lock_common(lock, &waiter);
+ mutex_acquire(&lock->dep_map, subtype, 0, _RET_IP_);
debug_mutex_add_waiter(lock, &waiter, task->thread_info);

/* add waiting tasks to the end of the waitqueue (FIFO): */
@@ -173,6 +184,7 @@ __mutex_lock_common(struct mutex *lock,
if (unlikely(state == TASK_INTERRUPTIBLE &&
signal_pending(task))) {
mutex_remove_waiter(lock, &waiter, task->thread_info);
+ mutex_release(&lock->dep_map, 1, _RET_IP_);
spin_unlock_mutex(&lock->wait_lock, flags);

debug_mutex_free_waiter(&waiter);
@@ -198,7 +210,9 @@ __mutex_lock_common(struct mutex *lock,

debug_mutex_free_waiter(&waiter);

+#ifdef CONFIG_DEBUG_MUTEXES
DEBUG_WARN_ON(lock->owner != task->thread_info);
+#endif

return 0;
}
@@ -211,7 +225,7 @@ __mutex_lock_slowpath(atomic_t *lock_cou
__mutex_lock_common(lock, TASK_UNINTERRUPTIBLE, 0);
}

-#ifdef CONFIG_DEBUG_MUTEXES
+#ifdef CONFIG_PROVE_MUTEX_LOCKING
void __sched
mutex_lock_nested(struct mutex *lock, unsigned int subtype)
{
@@ -232,6 +246,7 @@ __mutex_unlock_common_slowpath(atomic_t
unsigned long flags;

spin_lock_mutex(&lock->wait_lock, flags);
+ mutex_release(&lock->dep_map, nested, _RET_IP_);
debug_mutex_unlock(lock);

/*
@@ -322,9 +337,10 @@ static inline int __mutex_trylock_slowpa
spin_lock_mutex(&lock->wait_lock, flags);

prev = atomic_xchg(&lock->count, -1);
- if (likely(prev == 1))
+ if (likely(prev == 1)) {
debug_mutex_set_owner(lock, current_thread_info());
-
+ mutex_acquire(&lock->dep_map, 0, 1, _RET_IP_);
+ }
/* Set it back to 0 if there are no waiters: */
if (likely(list_empty(&lock->wait_list)))
atomic_set(&lock->count, 0);
Index: linux/kernel/mutex.h
===================================================================
--- linux.orig/kernel/mutex.h
+++ linux/kernel/mutex.h
@@ -16,14 +16,13 @@
#define mutex_remove_waiter(lock, waiter, ti) \
__list_del((waiter)->list.prev, (waiter)->list.next)

+#undef DEBUG_WARN_ON
#define DEBUG_WARN_ON(c) do { } while (0)
#define debug_mutex_set_owner(lock, new_owner) do { } while (0)
#define debug_mutex_clear_owner(lock) do { } while (0)
#define debug_mutex_wake_waiter(lock, waiter) do { } while (0)
#define debug_mutex_free_waiter(waiter) do { } while (0)
#define debug_mutex_add_waiter(lock, waiter, ti) do { } while (0)
-#define mutex_acquire(lock, subtype, trylock) do { } while (0)
-#define mutex_release(lock, nested) do { } while (0)
#define debug_mutex_unlock(lock) do { } while (0)
#define debug_mutex_init(lock, name) do { } while (0)

2006-05-29 21:40:43

by Ingo Molnar

[permalink] [raw]
Subject: [patch 30/61] lock validator: x86_64 early init

From: Ingo Molnar <[email protected]>

x86_64 uses spinlocks very early - earlier than start_kernel().
So call lockdep_init() from the arch setup code.

Signed-off-by: Ingo Molnar <[email protected]>
Signed-off-by: Arjan van de Ven <[email protected]>
---
arch/x86_64/kernel/head64.c | 5 +++++
1 file changed, 5 insertions(+)

Index: linux/arch/x86_64/kernel/head64.c
===================================================================
--- linux.orig/arch/x86_64/kernel/head64.c
+++ linux/arch/x86_64/kernel/head64.c
@@ -85,6 +85,11 @@ void __init x86_64_start_kernel(char * r
clear_bss();

/*
+ * This must be called really, really early:
+ */
+ lockdep_init();
+
+ /*
* switch to init_level4_pgt from boot_level4_pgt
*/
memcpy(init_level4_pgt, boot_level4_pgt, PTRS_PER_PGD*sizeof(pgd_t));

2006-05-29 21:40:42

by Ingo Molnar

[permalink] [raw]
Subject: [patch 29/61] lock validator: print all lock-types on SysRq-D

From: Ingo Molnar <[email protected]>

print all lock-types on SysRq-D.

Signed-off-by: Ingo Molnar <[email protected]>
Signed-off-by: Arjan van de Ven <[email protected]>
---
drivers/char/sysrq.c | 4 +++-
1 file changed, 3 insertions(+), 1 deletion(-)

Index: linux/drivers/char/sysrq.c
===================================================================
--- linux.orig/drivers/char/sysrq.c
+++ linux/drivers/char/sysrq.c
@@ -148,12 +148,14 @@ static struct sysrq_key_op sysrq_mountro
.enable_mask = SYSRQ_ENABLE_REMOUNT,
};

-#ifdef CONFIG_DEBUG_MUTEXES
+#ifdef CONFIG_LOCKDEP
static void sysrq_handle_showlocks(int key, struct pt_regs *pt_regs,
struct tty_struct *tty)
{
debug_show_all_locks();
+ print_lock_types();
}
+
static struct sysrq_key_op sysrq_showlocks_op = {
.handler = sysrq_handle_showlocks,
.help_msg = "show-all-locks(D)",

2006-05-29 21:42:46

by Ingo Molnar

[permalink] [raw]
Subject: [patch 13/61] lock validator: x86_64: document stack frame internals

From: Ingo Molnar <[email protected]>

document stack frame nesting internals some more.

Signed-off-by: Ingo Molnar <[email protected]>
Signed-off-by: Arjan van de Ven <[email protected]>
---
arch/x86_64/kernel/traps.c | 64 +++++++++++++++++++++++++++++++++++++++++++--
1 file changed, 62 insertions(+), 2 deletions(-)

Index: linux/arch/x86_64/kernel/traps.c
===================================================================
--- linux.orig/arch/x86_64/kernel/traps.c
+++ linux/arch/x86_64/kernel/traps.c
@@ -134,8 +134,9 @@ void printk_address(unsigned long addres
}
#endif

-static unsigned long *in_exception_stack(unsigned cpu, unsigned long stack,
- unsigned *usedp, const char **idp)
+unsigned long *
+in_exception_stack(unsigned cpu, unsigned long stack, unsigned *usedp,
+ const char **idp)
{
static char ids[][8] = {
[DEBUG_STACK - 1] = "#DB",
@@ -149,10 +150,22 @@ static unsigned long *in_exception_stack
};
unsigned k;

+ /*
+ * Iterate over all exception stacks, and figure out whether
+ * 'stack' is in one of them:
+ */
for (k = 0; k < N_EXCEPTION_STACKS; k++) {
unsigned long end;

+ /*
+ * set 'end' to the end of the exception stack.
+ */
switch (k + 1) {
+ /*
+ * TODO: this block is not needed i think, because
+ * setup64.c:cpu_init() sets up t->ist[DEBUG_STACK]
+ * properly too.
+ */
#if DEBUG_STKSZ > EXCEPTION_STKSZ
case DEBUG_STACK:
end = cpu_pda(cpu)->debugstack + DEBUG_STKSZ;
@@ -162,19 +175,43 @@ static unsigned long *in_exception_stack
end = per_cpu(init_tss, cpu).ist[k];
break;
}
+ /*
+ * Is 'stack' above this exception frame's end?
+ * If yes then skip to the next frame.
+ */
if (stack >= end)
continue;
+ /*
+ * Is 'stack' above this exception frame's start address?
+ * If yes then we found the right frame.
+ */
if (stack >= end - EXCEPTION_STKSZ) {
+ /*
+ * Make sure we only iterate through an exception
+ * stack once. If it comes up for the second time
+ * then there's something wrong going on - just
+ * break out and return NULL:
+ */
if (*usedp & (1U << k))
break;
*usedp |= 1U << k;
*idp = ids[k];
return (unsigned long *)end;
}
+ /*
+ * If this is a debug stack, and if it has a larger size than
+ * the usual exception stacks, then 'stack' might still
+ * be within the lower portion of the debug stack:
+ */
#if DEBUG_STKSZ > EXCEPTION_STKSZ
if (k == DEBUG_STACK - 1 && stack >= end - DEBUG_STKSZ) {
unsigned j = N_EXCEPTION_STACKS - 1;

+ /*
+ * Black magic. A large debug stack is composed of
+ * multiple exception stack entries, which we
+ * iterate through now. Dont look:
+ */
do {
++j;
end -= EXCEPTION_STKSZ;
@@ -206,6 +243,11 @@ void show_trace(unsigned long *stack)

printk("\nCall Trace:\n");

+ /*
+ * Print function call entries within a stack. 'cond' is the
+ * "end of stackframe" condition, that the 'stack++'
+ * iteration will eventually trigger.
+ */
#define HANDLE_STACK(cond) \
do while (cond) { \
unsigned long addr = *stack++; \
@@ -223,6 +265,11 @@ void show_trace(unsigned long *stack)
} \
} while (0)

+ /*
+ * Print function call entries in all stacks, starting at the
+ * current stack address. If the stacks consist of nested
+ * exceptions
+ */
for ( ; ; ) {
const char *id;
unsigned long *estack_end;
@@ -233,6 +280,11 @@ void show_trace(unsigned long *stack)
printk(" <%s>", id);
HANDLE_STACK (stack < estack_end);
printk(" <EOE>");
+ /*
+ * We link to the next stack via the
+ * second-to-last pointer (index -2 to end) in the
+ * exception stack:
+ */
stack = (unsigned long *) estack_end[-2];
continue;
}
@@ -244,6 +296,11 @@ void show_trace(unsigned long *stack)
if (stack >= irqstack && stack < irqstack_end) {
printk(" <IRQ>");
HANDLE_STACK (stack < irqstack_end);
+ /*
+ * We link to the next stack (which would be
+ * the process stack normally) the last
+ * pointer (index -1 to end) in the IRQ stack:
+ */
stack = (unsigned long *) (irqstack_end[-1]);
irqstack_end = NULL;
printk(" <EOI>");
@@ -253,6 +310,9 @@ void show_trace(unsigned long *stack)
break;
}

+ /*
+ * This prints the process stack:
+ */
HANDLE_STACK (((long) stack & (THREAD_SIZE-1)) != 0);
#undef HANDLE_STACK

2006-05-29 21:25:05

by Ingo Molnar

[permalink] [raw]
Subject: [patch 24/61] lock validator: procfs

From: Ingo Molnar <[email protected]>

lock validator /proc/lockdep and /proc/lockdep_stats support.
(FIXME: should go into debugfs)

Signed-off-by: Ingo Molnar <[email protected]>
Signed-off-by: Arjan van de Ven <[email protected]>
---
kernel/Makefile | 3
kernel/lockdep_proc.c | 345 ++++++++++++++++++++++++++++++++++++++++++++++++++
2 files changed, 348 insertions(+)

Index: linux/kernel/Makefile
===================================================================
--- linux.orig/kernel/Makefile
+++ linux/kernel/Makefile
@@ -13,6 +13,9 @@ obj-y = sched.o fork.o exec_domain.o
obj-y += time/
obj-$(CONFIG_DEBUG_MUTEXES) += mutex-debug.o
obj-$(CONFIG_LOCKDEP) += lockdep.o
+ifeq ($(CONFIG_PROC_FS),y)
+obj-$(CONFIG_LOCKDEP) += lockdep_proc.o
+endif
obj-$(CONFIG_FUTEX) += futex.o
ifeq ($(CONFIG_COMPAT),y)
obj-$(CONFIG_FUTEX) += futex_compat.o
Index: linux/kernel/lockdep_proc.c
===================================================================
--- /dev/null
+++ linux/kernel/lockdep_proc.c
@@ -0,0 +1,345 @@
+/*
+ * kernel/lockdep_proc.c
+ *
+ * Runtime locking correctness validator
+ *
+ * Started by Ingo Molnar:
+ *
+ * Copyright (C) 2006 Red Hat, Inc., Ingo Molnar <[email protected]>
+ *
+ * Code for /proc/lockdep and /proc/lockdep_stats:
+ *
+ */
+#include <linux/sched.h>
+#include <linux/module.h>
+#include <linux/proc_fs.h>
+#include <linux/seq_file.h>
+#include <linux/kallsyms.h>
+#include <linux/debug_locks.h>
+
+#include "lockdep_internals.h"
+
+static void *l_next(struct seq_file *m, void *v, loff_t *pos)
+{
+ struct lock_type *type = v;
+
+ (*pos)++;
+
+ if (type->lock_entry.next != &all_lock_types)
+ type = list_entry(type->lock_entry.next, struct lock_type,
+ lock_entry);
+ else
+ type = NULL;
+ m->private = type;
+
+ return type;
+}
+
+static void *l_start(struct seq_file *m, loff_t *pos)
+{
+ struct lock_type *type = m->private;
+
+ if (&type->lock_entry == all_lock_types.next)
+ seq_printf(m, "all lock types:\n");
+
+ return type;
+}
+
+static void l_stop(struct seq_file *m, void *v)
+{
+}
+
+static unsigned long count_forward_deps(struct lock_type *type)
+{
+ struct lock_list *entry;
+ unsigned long ret = 1;
+
+ /*
+ * Recurse this type's dependency list:
+ */
+ list_for_each_entry(entry, &type->locks_after, entry)
+ ret += count_forward_deps(entry->type);
+
+ return ret;
+}
+
+static unsigned long count_backward_deps(struct lock_type *type)
+{
+ struct lock_list *entry;
+ unsigned long ret = 1;
+
+ /*
+ * Recurse this type's dependency list:
+ */
+ list_for_each_entry(entry, &type->locks_before, entry)
+ ret += count_backward_deps(entry->type);
+
+ return ret;
+}
+
+static int l_show(struct seq_file *m, void *v)
+{
+ unsigned long nr_forward_deps, nr_backward_deps;
+ struct lock_type *type = m->private;
+ char str[128], c1, c2, c3, c4;
+ const char *name;
+
+ seq_printf(m, "%p", type->key);
+#ifdef CONFIG_DEBUG_LOCKDEP
+ seq_printf(m, " OPS:%8ld", type->ops);
+#endif
+ nr_forward_deps = count_forward_deps(type);
+ seq_printf(m, " FD:%5ld", nr_forward_deps);
+
+ nr_backward_deps = count_backward_deps(type);
+ seq_printf(m, " BD:%5ld", nr_backward_deps);
+
+ get_usage_chars(type, &c1, &c2, &c3, &c4);
+ seq_printf(m, " %c%c%c%c", c1, c2, c3, c4);
+
+ name = type->name;
+ if (!name) {
+ name = __get_key_name(type->key, str);
+ seq_printf(m, ": %s", name);
+ } else{
+ seq_printf(m, ": %s", name);
+ if (type->name_version > 1)
+ seq_printf(m, "#%d", type->name_version);
+ if (type->subtype)
+ seq_printf(m, "/%d", type->subtype);
+ }
+ seq_puts(m, "\n");
+
+ return 0;
+}
+
+static struct seq_operations lockdep_ops = {
+ .start = l_start,
+ .next = l_next,
+ .stop = l_stop,
+ .show = l_show,
+};
+
+static int lockdep_open(struct inode *inode, struct file *file)
+{
+ int res = seq_open(file, &lockdep_ops);
+ if (!res) {
+ struct seq_file *m = file->private_data;
+
+ if (!list_empty(&all_lock_types))
+ m->private = list_entry(all_lock_types.next,
+ struct lock_type, lock_entry);
+ else
+ m->private = NULL;
+ }
+ return res;
+}
+
+static struct file_operations proc_lockdep_operations = {
+ .open = lockdep_open,
+ .read = seq_read,
+ .llseek = seq_lseek,
+ .release = seq_release,
+};
+
+static void lockdep_stats_debug_show(struct seq_file *m)
+{
+#ifdef CONFIG_DEBUG_LOCKDEP
+ unsigned int hi1 = debug_atomic_read(&hardirqs_on_events),
+ hi2 = debug_atomic_read(&hardirqs_off_events),
+ hr1 = debug_atomic_read(&redundant_hardirqs_on),
+ hr2 = debug_atomic_read(&redundant_hardirqs_off),
+ si1 = debug_atomic_read(&softirqs_on_events),
+ si2 = debug_atomic_read(&softirqs_off_events),
+ sr1 = debug_atomic_read(&redundant_softirqs_on),
+ sr2 = debug_atomic_read(&redundant_softirqs_off);
+
+ seq_printf(m, " chain lookup misses: %11u\n",
+ debug_atomic_read(&chain_lookup_misses));
+ seq_printf(m, " chain lookup hits: %11u\n",
+ debug_atomic_read(&chain_lookup_hits));
+ seq_printf(m, " cyclic checks: %11u\n",
+ debug_atomic_read(&nr_cyclic_checks));
+ seq_printf(m, " cyclic-check recursions: %11u\n",
+ debug_atomic_read(&nr_cyclic_check_recursions));
+ seq_printf(m, " find-mask forwards checks: %11u\n",
+ debug_atomic_read(&nr_find_usage_forwards_checks));
+ seq_printf(m, " find-mask forwards recursions: %11u\n",
+ debug_atomic_read(&nr_find_usage_forwards_recursions));
+ seq_printf(m, " find-mask backwards checks: %11u\n",
+ debug_atomic_read(&nr_find_usage_backwards_checks));
+ seq_printf(m, " find-mask backwards recursions:%11u\n",
+ debug_atomic_read(&nr_find_usage_backwards_recursions));
+
+ seq_printf(m, " hardirq on events: %11u\n", hi1);
+ seq_printf(m, " hardirq off events: %11u\n", hi2);
+ seq_printf(m, " redundant hardirq ons: %11u\n", hr1);
+ seq_printf(m, " redundant hardirq offs: %11u\n", hr2);
+ seq_printf(m, " softirq on events: %11u\n", si1);
+ seq_printf(m, " softirq off events: %11u\n", si2);
+ seq_printf(m, " redundant softirq ons: %11u\n", sr1);
+ seq_printf(m, " redundant softirq offs: %11u\n", sr2);
+#endif
+}
+
+static int lockdep_stats_show(struct seq_file *m, void *v)
+{
+ struct lock_type *type;
+ unsigned long nr_unused = 0, nr_uncategorized = 0,
+ nr_irq_safe = 0, nr_irq_unsafe = 0,
+ nr_softirq_safe = 0, nr_softirq_unsafe = 0,
+ nr_hardirq_safe = 0, nr_hardirq_unsafe = 0,
+ nr_irq_read_safe = 0, nr_irq_read_unsafe = 0,
+ nr_softirq_read_safe = 0, nr_softirq_read_unsafe = 0,
+ nr_hardirq_read_safe = 0, nr_hardirq_read_unsafe = 0,
+ sum_forward_deps = 0, factor = 0;
+
+ list_for_each_entry(type, &all_lock_types, lock_entry) {
+
+ if (type->usage_mask == 0)
+ nr_unused++;
+ if (type->usage_mask == LOCKF_USED)
+ nr_uncategorized++;
+ if (type->usage_mask & LOCKF_USED_IN_IRQ)
+ nr_irq_safe++;
+ if (type->usage_mask & LOCKF_ENABLED_IRQS)
+ nr_irq_unsafe++;
+ if (type->usage_mask & LOCKF_USED_IN_SOFTIRQ)
+ nr_softirq_safe++;
+ if (type->usage_mask & LOCKF_ENABLED_SOFTIRQS)
+ nr_softirq_unsafe++;
+ if (type->usage_mask & LOCKF_USED_IN_HARDIRQ)
+ nr_hardirq_safe++;
+ if (type->usage_mask & LOCKF_ENABLED_HARDIRQS)
+ nr_hardirq_unsafe++;
+ if (type->usage_mask & LOCKF_USED_IN_IRQ_READ)
+ nr_irq_read_safe++;
+ if (type->usage_mask & LOCKF_ENABLED_IRQS_READ)
+ nr_irq_read_unsafe++;
+ if (type->usage_mask & LOCKF_USED_IN_SOFTIRQ_READ)
+ nr_softirq_read_safe++;
+ if (type->usage_mask & LOCKF_ENABLED_SOFTIRQS_READ)
+ nr_softirq_read_unsafe++;
+ if (type->usage_mask & LOCKF_USED_IN_HARDIRQ_READ)
+ nr_hardirq_read_safe++;
+ if (type->usage_mask & LOCKF_ENABLED_HARDIRQS_READ)
+ nr_hardirq_read_unsafe++;
+
+ sum_forward_deps += count_forward_deps(type);
+ }
+#ifdef CONFIG_LOCKDEP_DEBUG
+ DEBUG_WARN_ON(debug_atomic_read(&nr_unused_locks) != nr_unused);
+#endif
+ seq_printf(m, " lock-types: %11lu [max: %lu]\n",
+ nr_lock_types, MAX_LOCKDEP_KEYS);
+ seq_printf(m, " direct dependencies: %11lu [max: %lu]\n",
+ nr_list_entries, MAX_LOCKDEP_ENTRIES);
+ seq_printf(m, " indirect dependencies: %11lu\n",
+ sum_forward_deps);
+
+ /*
+ * Total number of dependencies:
+ *
+ * All irq-safe locks may nest inside irq-unsafe locks,
+ * plus all the other known dependencies:
+ */
+ seq_printf(m, " all direct dependencies: %11lu\n",
+ nr_irq_unsafe * nr_irq_safe +
+ nr_hardirq_unsafe * nr_hardirq_safe +
+ nr_list_entries);
+
+ /*
+ * Estimated factor between direct and indirect
+ * dependencies:
+ */
+ if (nr_list_entries)
+ factor = sum_forward_deps / nr_list_entries;
+
+ seq_printf(m, " dependency chains: %11lu [max: %lu]\n",
+ nr_lock_chains, MAX_LOCKDEP_CHAINS);
+
+#ifdef CONFIG_TRACE_IRQFLAGS
+ seq_printf(m, " in-hardirq chains: %11u\n",
+ nr_hardirq_chains);
+ seq_printf(m, " in-softirq chains: %11u\n",
+ nr_softirq_chains);
+#endif
+ seq_printf(m, " in-process chains: %11u\n",
+ nr_process_chains);
+ seq_printf(m, " stack-trace entries: %11lu [max: %lu]\n",
+ nr_stack_trace_entries, MAX_STACK_TRACE_ENTRIES);
+ seq_printf(m, " combined max dependencies: %11u\n",
+ (nr_hardirq_chains + 1) *
+ (nr_softirq_chains + 1) *
+ (nr_process_chains + 1)
+ );
+ seq_printf(m, " hardirq-safe locks: %11lu\n",
+ nr_hardirq_safe);
+ seq_printf(m, " hardirq-unsafe locks: %11lu\n",
+ nr_hardirq_unsafe);
+ seq_printf(m, " softirq-safe locks: %11lu\n",
+ nr_softirq_safe);
+ seq_printf(m, " softirq-unsafe locks: %11lu\n",
+ nr_softirq_unsafe);
+ seq_printf(m, " irq-safe locks: %11lu\n",
+ nr_irq_safe);
+ seq_printf(m, " irq-unsafe locks: %11lu\n",
+ nr_irq_unsafe);
+
+ seq_printf(m, " hardirq-read-safe locks: %11lu\n",
+ nr_hardirq_read_safe);
+ seq_printf(m, " hardirq-read-unsafe locks: %11lu\n",
+ nr_hardirq_read_unsafe);
+ seq_printf(m, " softirq-read-safe locks: %11lu\n",
+ nr_softirq_read_safe);
+ seq_printf(m, " softirq-read-unsafe locks: %11lu\n",
+ nr_softirq_read_unsafe);
+ seq_printf(m, " irq-read-safe locks: %11lu\n",
+ nr_irq_read_safe);
+ seq_printf(m, " irq-read-unsafe locks: %11lu\n",
+ nr_irq_read_unsafe);
+
+ seq_printf(m, " uncategorized locks: %11lu\n",
+ nr_uncategorized);
+ seq_printf(m, " unused locks: %11lu\n",
+ nr_unused);
+ seq_printf(m, " max locking depth: %11u\n",
+ max_lockdep_depth);
+ seq_printf(m, " max recursion depth: %11u\n",
+ max_recursion_depth);
+ lockdep_stats_debug_show(m);
+ seq_printf(m, " debug_locks: %11u\n",
+ debug_locks);
+
+ return 0;
+}
+
+static int lockdep_stats_open(struct inode *inode, struct file *file)
+{
+ return single_open(file, lockdep_stats_show, NULL);
+}
+
+static struct file_operations proc_lockdep_stats_operations = {
+ .open = lockdep_stats_open,
+ .read = seq_read,
+ .llseek = seq_lseek,
+ .release = seq_release,
+};
+
+static int __init lockdep_proc_init(void)
+{
+ struct proc_dir_entry *entry;
+
+ entry = create_proc_entry("lockdep", S_IRUSR, NULL);
+ if (entry)
+ entry->proc_fops = &proc_lockdep_operations;
+
+ entry = create_proc_entry("lockdep_stats", S_IRUSR, NULL);
+ if (entry)
+ entry->proc_fops = &proc_lockdep_stats_operations;
+
+ return 0;
+}
+
+__initcall(lockdep_proc_init);
+

2006-05-29 21:42:12

by Ingo Molnar

[permalink] [raw]
Subject: [patch 20/61] lock validator: irqtrace: cleanup: include/asm-x86_64/irqflags.h

From: Ingo Molnar <[email protected]>

clean up the x86-64 irqflags.h file:

- macro => inline function transformation
- simplifications
- style fixes

Signed-off-by: Ingo Molnar <[email protected]>
Signed-off-by: Arjan van de Ven <[email protected]>
---
arch/x86_64/lib/thunk.S | 5 +
include/asm-x86_64/irqflags.h | 159 ++++++++++++++++++++++++++++++++----------
2 files changed, 128 insertions(+), 36 deletions(-)

Index: linux/arch/x86_64/lib/thunk.S
===================================================================
--- linux.orig/arch/x86_64/lib/thunk.S
+++ linux/arch/x86_64/lib/thunk.S
@@ -47,6 +47,11 @@
thunk_retrax __down_failed_interruptible,__down_interruptible
thunk_retrax __down_failed_trylock,__down_trylock
thunk __up_wakeup,__up
+
+#ifdef CONFIG_TRACE_IRQFLAGS
+ thunk trace_hardirqs_on_thunk,trace_hardirqs_on
+ thunk trace_hardirqs_off_thunk,trace_hardirqs_off
+#endif

/* SAVE_ARGS below is used only for the .cfi directives it contains. */
CFI_STARTPROC
Index: linux/include/asm-x86_64/irqflags.h
===================================================================
--- linux.orig/include/asm-x86_64/irqflags.h
+++ linux/include/asm-x86_64/irqflags.h
@@ -5,50 +5,137 @@
*
* This file gets included from lowlevel asm headers too, to provide
* wrapped versions of the local_irq_*() APIs, based on the
- * raw_local_irq_*() macros from the lowlevel headers.
+ * raw_local_irq_*() functions from the lowlevel headers.
*/
#ifndef _ASM_IRQFLAGS_H
#define _ASM_IRQFLAGS_H

-/* interrupt control.. */
-#define raw_local_save_flags(x) do { warn_if_not_ulong(x); __asm__ __volatile__("# save_flags \n\t pushfq ; popq %q0":"=g" (x): /* no input */ :"memory"); } while (0)
-#define raw_local_irq_restore(x) __asm__ __volatile__("# restore_flags \n\t pushq %0 ; popfq": /* no output */ :"g" (x):"memory", "cc")
+#ifndef __ASSEMBLY__
+/*
+ * Interrupt control:
+ */
+
+static inline unsigned long __raw_local_save_flags(void)
+{
+ unsigned long flags;
+
+ __asm__ __volatile__(
+ "# __raw_save_flags\n\t"
+ "pushfq ; popq %q0"
+ : "=g" (flags)
+ : /* no input */
+ : "memory"
+ );
+
+ return flags;
+}
+
+#define raw_local_save_flags(flags) \
+ do { (flags) = __raw_local_save_flags(); } while (0)
+
+static inline void raw_local_irq_restore(unsigned long flags)
+{
+ __asm__ __volatile__(
+ "pushq %0 ; popfq"
+ : /* no output */
+ :"g" (flags)
+ :"memory", "cc"
+ );
+}

#ifdef CONFIG_X86_VSMP
-/* Interrupt control for VSMP architecture */
-#define raw_local_irq_disable() do { unsigned long flags; raw_local_save_flags(flags); raw_local_irq_restore((flags & ~(1 << 9)) | (1 << 18)); } while (0)
-#define raw_local_irq_enable() do { unsigned long flags; raw_local_save_flags(flags); raw_local_irq_restore((flags | (1 << 9)) & ~(1 << 18)); } while (0)
-
-#define raw_irqs_disabled_flags(flags) \
-({ \
- (flags & (1<<18)) || !(flags & (1<<9)); \
-})
-
-/* For spinlocks etc */
-#define raw_local_irq_save(x) do { raw_local_save_flags(x); raw_local_irq_restore((x & ~(1 << 9)) | (1 << 18)); } while (0)
-#else /* CONFIG_X86_VSMP */
-#define raw_local_irq_disable() __asm__ __volatile__("cli": : :"memory")
-#define raw_local_irq_enable() __asm__ __volatile__("sti": : :"memory")
-
-#define raw_irqs_disabled_flags(flags) \
-({ \
- !(flags & (1<<9)); \
-})

-/* For spinlocks etc */
-#define raw_local_irq_save(x) do { warn_if_not_ulong(x); __asm__ __volatile__("# raw_local_irq_save \n\t pushfq ; popq %0 ; cli":"=g" (x): /* no input */ :"memory"); } while (0)
+/*
+ * Interrupt control for the VSMP architecture:
+ */
+
+static inline void raw_local_irq_disable(void)
+{
+ unsigned long flags = __raw_local_save_flags();
+
+ raw_local_irq_restore((flags & ~(1 << 9)) | (1 << 18));
+}
+
+static inline void raw_local_irq_enable(void)
+{
+ unsigned long flags = __raw_local_save_flags();
+
+ raw_local_irq_restore((flags | (1 << 9)) & ~(1 << 18));
+}
+
+static inline int raw_irqs_disabled_flags(unsigned long flags)
+{
+ return !(flags & (1<<9)) || (flags & (1 << 18));
+}
+
+#else /* CONFIG_X86_VSMP */
+
+static inline void raw_local_irq_disable(void)
+{
+ __asm__ __volatile__("cli" : : : "memory");
+}
+
+static inline void raw_local_irq_enable(void)
+{
+ __asm__ __volatile__("sti" : : : "memory");
+}
+
+static inline int raw_irqs_disabled_flags(unsigned long flags)
+{
+ return !(flags & (1 << 9));
+}
+
#endif

-#define raw_irqs_disabled() \
-({ \
- unsigned long flags; \
- raw_local_save_flags(flags); \
- raw_irqs_disabled_flags(flags); \
-})
-
-/* used in the idle loop; sti takes one instruction cycle to complete */
-#define raw_safe_halt() __asm__ __volatile__("sti; hlt": : :"memory")
-/* used when interrupts are already enabled or to shutdown the processor */
-#define halt() __asm__ __volatile__("hlt": : :"memory")
+/*
+ * For spinlocks, etc.:
+ */
+
+static inline unsigned long __raw_local_irq_save(void)
+{
+ unsigned long flags = __raw_local_save_flags();
+
+ raw_local_irq_disable();
+
+ return flags;
+}
+
+#define raw_local_irq_save(flags) \
+ do { (flags) = __raw_local_irq_save(); } while (0)
+
+static inline int raw_irqs_disabled(void)
+{
+ unsigned long flags = __raw_local_save_flags();
+
+ return raw_irqs_disabled_flags(flags);
+}
+
+/*
+ * Used in the idle loop; sti takes one instruction cycle
+ * to complete:
+ */
+static inline void raw_safe_halt(void)
+{
+ __asm__ __volatile__("sti; hlt" : : : "memory");
+}
+
+/*
+ * Used when interrupts are already enabled or to
+ * shutdown the processor:
+ */
+static inline void halt(void)
+{
+ __asm__ __volatile__("hlt": : :"memory");
+}
+
+#else /* __ASSEMBLY__: */
+# ifdef CONFIG_TRACE_IRQFLAGS
+# define TRACE_IRQS_ON call trace_hardirqs_on_thunk
+# define TRACE_IRQS_OFF call trace_hardirqs_off_thunk
+# else
+# define TRACE_IRQS_ON
+# define TRACE_IRQS_OFF
+# endif
+#endif

#endif

2006-05-29 21:42:12

by Ingo Molnar

[permalink] [raw]
Subject: [patch 27/61] lock validator: prove spinlock/rwlock locking correctness

From: Ingo Molnar <[email protected]>

add CONFIG_PROVE_SPIN_LOCKING and CONFIG_PROVE_RW_LOCKING, which uses
the lock validator framework to prove spinlock and rwlock locking
correctness.

Signed-off-by: Ingo Molnar <[email protected]>
Signed-off-by: Arjan van de Ven <[email protected]>
---
include/asm-i386/spinlock.h | 2
include/linux/spinlock.h | 96 ++++++++++++++++++++++-----
include/linux/spinlock_api_smp.h | 4 +
include/linux/spinlock_api_up.h | 3
include/linux/spinlock_types.h | 32 ++++++++-
include/linux/spinlock_types_up.h | 10 ++
include/linux/spinlock_up.h | 4 -
kernel/Makefile | 2
kernel/sched.c | 10 ++
kernel/spinlock.c | 131 +++++++++++++++++++++++++++++++++++---
lib/kernel_lock.c | 7 +-
net/ipv4/route.c | 4 -
12 files changed, 269 insertions(+), 36 deletions(-)

Index: linux/include/asm-i386/spinlock.h
===================================================================
--- linux.orig/include/asm-i386/spinlock.h
+++ linux/include/asm-i386/spinlock.h
@@ -68,6 +68,7 @@ static inline void __raw_spin_lock(raw_s
"=m" (lock->slock) : : "memory");
}

+#ifndef CONFIG_PROVE_SPIN_LOCKING
static inline void __raw_spin_lock_flags(raw_spinlock_t *lock, unsigned long flags)
{
alternative_smp(
@@ -75,6 +76,7 @@ static inline void __raw_spin_lock_flags
__raw_spin_lock_string_up,
"=m" (lock->slock) : "r" (flags) : "memory");
}
+#endif

static inline int __raw_spin_trylock(raw_spinlock_t *lock)
{
Index: linux/include/linux/spinlock.h
===================================================================
--- linux.orig/include/linux/spinlock.h
+++ linux/include/linux/spinlock.h
@@ -82,14 +82,64 @@ extern int __lockfunc generic__raw_read_
/*
* Pull the __raw*() functions/declarations (UP-nondebug doesnt need them):
*/
-#if defined(CONFIG_SMP)
+#ifdef CONFIG_SMP
# include <asm/spinlock.h>
#else
# include <linux/spinlock_up.h>
#endif

-#define spin_lock_init(lock) do { *(lock) = SPIN_LOCK_UNLOCKED; } while (0)
-#define rwlock_init(lock) do { *(lock) = RW_LOCK_UNLOCKED; } while (0)
+#if defined(CONFIG_DEBUG_SPINLOCK) || defined(CONFIG_PROVE_SPIN_LOCKING)
+ extern void __spin_lock_init(spinlock_t *lock, const char *name,
+ struct lockdep_type_key *key);
+# define spin_lock_init(lock) \
+do { \
+ static struct lockdep_type_key __key; \
+ \
+ __spin_lock_init((lock), #lock, &__key); \
+} while (0)
+
+/*
+ * If for example an array of static locks are initialized
+ * via spin_lock_init(), this API variant can be used to
+ * split the lock-types of them:
+ */
+# define spin_lock_init_static(lock) \
+ __spin_lock_init((lock), #lock, \
+ (struct lockdep_type_key *)(lock)) \
+
+/*
+ * Type splitting can also be done for dynamic locks, if for
+ * example there are per-CPU dynamically allocated locks:
+ */
+# define spin_lock_init_key(lock, key) \
+ __spin_lock_init((lock), #lock, key)
+
+#else
+# define spin_lock_init(lock) \
+ do { *(lock) = SPIN_LOCK_UNLOCKED; } while (0)
+# define spin_lock_init_static(lock) \
+ spin_lock_init(lock)
+# define spin_lock_init_key(lock, key) \
+ do { spin_lock_init(lock); (void)(key); } while (0)
+#endif
+
+#if defined(CONFIG_DEBUG_SPINLOCK) || defined(CONFIG_PROVE_RW_LOCKING)
+ extern void __rwlock_init(rwlock_t *lock, const char *name,
+ struct lockdep_type_key *key);
+# define rwlock_init(lock) \
+do { \
+ static struct lockdep_type_key __key; \
+ \
+ __rwlock_init((lock), #lock, &__key); \
+} while (0)
+# define rwlock_init_key(lock, key) \
+ __rwlock_init((lock), #lock, key)
+#else
+# define rwlock_init(lock) \
+ do { *(lock) = RW_LOCK_UNLOCKED; } while (0)
+# define rwlock_init_key(lock, key) \
+ do { rwlock_init(lock); (void)(key); } while (0)
+#endif

#define spin_is_locked(lock) __raw_spin_is_locked(&(lock)->raw_lock)

@@ -102,7 +152,9 @@ extern int __lockfunc generic__raw_read_
/*
* Pull the _spin_*()/_read_*()/_write_*() functions/declarations:
*/
-#if defined(CONFIG_SMP) || defined(CONFIG_DEBUG_SPINLOCK)
+#if defined(CONFIG_SMP) || defined(CONFIG_DEBUG_SPINLOCK) || \
+ defined(CONFIG_PROVE_SPIN_LOCKING) || \
+ defined(CONFIG_PROVE_RW_LOCKING)
# include <linux/spinlock_api_smp.h>
#else
# include <linux/spinlock_api_up.h>
@@ -113,7 +165,6 @@ extern int __lockfunc generic__raw_read_
#define _raw_spin_lock_flags(lock, flags) _raw_spin_lock(lock)
extern int _raw_spin_trylock(spinlock_t *lock);
extern void _raw_spin_unlock(spinlock_t *lock);
-
extern void _raw_read_lock(rwlock_t *lock);
extern int _raw_read_trylock(rwlock_t *lock);
extern void _raw_read_unlock(rwlock_t *lock);
@@ -121,17 +172,17 @@ extern int __lockfunc generic__raw_read_
extern int _raw_write_trylock(rwlock_t *lock);
extern void _raw_write_unlock(rwlock_t *lock);
#else
-# define _raw_spin_unlock(lock) __raw_spin_unlock(&(lock)->raw_lock)
-# define _raw_spin_trylock(lock) __raw_spin_trylock(&(lock)->raw_lock)
# define _raw_spin_lock(lock) __raw_spin_lock(&(lock)->raw_lock)
# define _raw_spin_lock_flags(lock, flags) \
__raw_spin_lock_flags(&(lock)->raw_lock, *(flags))
+# define _raw_spin_trylock(lock) __raw_spin_trylock(&(lock)->raw_lock)
+# define _raw_spin_unlock(lock) __raw_spin_unlock(&(lock)->raw_lock)
# define _raw_read_lock(rwlock) __raw_read_lock(&(rwlock)->raw_lock)
-# define _raw_write_lock(rwlock) __raw_write_lock(&(rwlock)->raw_lock)
-# define _raw_read_unlock(rwlock) __raw_read_unlock(&(rwlock)->raw_lock)
-# define _raw_write_unlock(rwlock) __raw_write_unlock(&(rwlock)->raw_lock)
# define _raw_read_trylock(rwlock) __raw_read_trylock(&(rwlock)->raw_lock)
+# define _raw_read_unlock(rwlock) __raw_read_unlock(&(rwlock)->raw_lock)
+# define _raw_write_lock(rwlock) __raw_write_lock(&(rwlock)->raw_lock)
# define _raw_write_trylock(rwlock) __raw_write_trylock(&(rwlock)->raw_lock)
+# define _raw_write_unlock(rwlock) __raw_write_unlock(&(rwlock)->raw_lock)
#endif

#define read_can_lock(rwlock) __raw_read_can_lock(&(rwlock)->raw_lock)
@@ -147,10 +198,14 @@ extern int __lockfunc generic__raw_read_
#define write_trylock(lock) __cond_lock(_write_trylock(lock))

#define spin_lock(lock) _spin_lock(lock)
+#define spin_lock_nested(lock, subtype) \
+ _spin_lock_nested(lock, subtype)
#define write_lock(lock) _write_lock(lock)
#define read_lock(lock) _read_lock(lock)

-#if defined(CONFIG_SMP) || defined(CONFIG_DEBUG_SPINLOCK)
+#if defined(CONFIG_SMP) || defined(CONFIG_DEBUG_SPINLOCK) || \
+ defined(CONFIG_PROVE_SPIN_LOCKING) || \
+ defined(CONFIG_PROVE_RW_LOCKING)
#define spin_lock_irqsave(lock, flags) flags = _spin_lock_irqsave(lock)
#define read_lock_irqsave(lock, flags) flags = _read_lock_irqsave(lock)
#define write_lock_irqsave(lock, flags) flags = _write_lock_irqsave(lock)
@@ -172,21 +227,24 @@ extern int __lockfunc generic__raw_read_
/*
* We inline the unlock functions in the nondebug case:
*/
-#if defined(CONFIG_DEBUG_SPINLOCK) || defined(CONFIG_PREEMPT) || !defined(CONFIG_SMP)
+#if defined(CONFIG_DEBUG_SPINLOCK) || defined(CONFIG_PREEMPT) || \
+ !defined(CONFIG_SMP) || \
+ defined(CONFIG_PROVE_SPIN_LOCKING) || \
+ defined(CONFIG_PROVE_RW_LOCKING)
# define spin_unlock(lock) _spin_unlock(lock)
+# define spin_unlock_non_nested(lock) _spin_unlock_non_nested(lock)
# define read_unlock(lock) _read_unlock(lock)
+# define read_unlock_non_nested(lock) _read_unlock_non_nested(lock)
# define write_unlock(lock) _write_unlock(lock)
-#else
-# define spin_unlock(lock) __raw_spin_unlock(&(lock)->raw_lock)
-# define read_unlock(lock) __raw_read_unlock(&(lock)->raw_lock)
-# define write_unlock(lock) __raw_write_unlock(&(lock)->raw_lock)
-#endif
-
-#if defined(CONFIG_DEBUG_SPINLOCK) || defined(CONFIG_PREEMPT) || !defined(CONFIG_SMP)
# define spin_unlock_irq(lock) _spin_unlock_irq(lock)
# define read_unlock_irq(lock) _read_unlock_irq(lock)
# define write_unlock_irq(lock) _write_unlock_irq(lock)
#else
+# define spin_unlock(lock) __raw_spin_unlock(&(lock)->raw_lock)
+# define spin_unlock_non_nested(lock) __raw_spin_unlock(&(lock)->raw_lock)
+# define read_unlock(lock) __raw_read_unlock(&(lock)->raw_lock)
+# define read_unlock_non_nested(lock) __raw_read_unlock(&(lock)->raw_lock)
+# define write_unlock(lock) __raw_write_unlock(&(lock)->raw_lock)
# define spin_unlock_irq(lock) \
do { __raw_spin_unlock(&(lock)->raw_lock); local_irq_enable(); } while (0)
# define read_unlock_irq(lock) \
Index: linux/include/linux/spinlock_api_smp.h
===================================================================
--- linux.orig/include/linux/spinlock_api_smp.h
+++ linux/include/linux/spinlock_api_smp.h
@@ -20,6 +20,8 @@ int in_lock_functions(unsigned long addr
#define assert_spin_locked(x) BUG_ON(!spin_is_locked(x))

void __lockfunc _spin_lock(spinlock_t *lock) __acquires(spinlock_t);
+void __lockfunc _spin_lock_nested(spinlock_t *lock, int subtype)
+ __acquires(spinlock_t);
void __lockfunc _read_lock(rwlock_t *lock) __acquires(rwlock_t);
void __lockfunc _write_lock(rwlock_t *lock) __acquires(rwlock_t);
void __lockfunc _spin_lock_bh(spinlock_t *lock) __acquires(spinlock_t);
@@ -39,7 +41,9 @@ int __lockfunc _read_trylock(rwlock_t *l
int __lockfunc _write_trylock(rwlock_t *lock);
int __lockfunc _spin_trylock_bh(spinlock_t *lock);
void __lockfunc _spin_unlock(spinlock_t *lock) __releases(spinlock_t);
+void __lockfunc _spin_unlock_non_nested(spinlock_t *lock) __releases(spinlock_t);
void __lockfunc _read_unlock(rwlock_t *lock) __releases(rwlock_t);
+void __lockfunc _read_unlock_non_nested(rwlock_t *lock) __releases(rwlock_t);
void __lockfunc _write_unlock(rwlock_t *lock) __releases(rwlock_t);
void __lockfunc _spin_unlock_bh(spinlock_t *lock) __releases(spinlock_t);
void __lockfunc _read_unlock_bh(rwlock_t *lock) __releases(rwlock_t);
Index: linux/include/linux/spinlock_api_up.h
===================================================================
--- linux.orig/include/linux/spinlock_api_up.h
+++ linux/include/linux/spinlock_api_up.h
@@ -49,6 +49,7 @@
do { local_irq_restore(flags); __UNLOCK(lock); } while (0)

#define _spin_lock(lock) __LOCK(lock)
+#define _spin_lock_nested(lock, subtype) __LOCK(lock)
#define _read_lock(lock) __LOCK(lock)
#define _write_lock(lock) __LOCK(lock)
#define _spin_lock_bh(lock) __LOCK_BH(lock)
@@ -65,7 +66,9 @@
#define _write_trylock(lock) ({ __LOCK(lock); 1; })
#define _spin_trylock_bh(lock) ({ __LOCK_BH(lock); 1; })
#define _spin_unlock(lock) __UNLOCK(lock)
+#define _spin_unlock_non_nested(lock) __UNLOCK(lock)
#define _read_unlock(lock) __UNLOCK(lock)
+#define _read_unlock_non_nested(lock) __UNLOCK(lock)
#define _write_unlock(lock) __UNLOCK(lock)
#define _spin_unlock_bh(lock) __UNLOCK_BH(lock)
#define _write_unlock_bh(lock) __UNLOCK_BH(lock)
Index: linux/include/linux/spinlock_types.h
===================================================================
--- linux.orig/include/linux/spinlock_types.h
+++ linux/include/linux/spinlock_types.h
@@ -9,6 +9,8 @@
* Released under the General Public License (GPL).
*/

+#include <linux/lockdep.h>
+
#if defined(CONFIG_SMP)
# include <asm/spinlock_types.h>
#else
@@ -24,6 +26,9 @@ typedef struct {
unsigned int magic, owner_cpu;
void *owner;
#endif
+#ifdef CONFIG_PROVE_SPIN_LOCKING
+ struct lockdep_map dep_map;
+#endif
} spinlock_t;

#define SPINLOCK_MAGIC 0xdead4ead
@@ -37,28 +42,47 @@ typedef struct {
unsigned int magic, owner_cpu;
void *owner;
#endif
+#ifdef CONFIG_PROVE_RW_LOCKING
+ struct lockdep_map dep_map;
+#endif
} rwlock_t;

#define RWLOCK_MAGIC 0xdeaf1eed

#define SPINLOCK_OWNER_INIT ((void *)-1L)

+#ifdef CONFIG_PROVE_SPIN_LOCKING
+# define SPIN_DEP_MAP_INIT(lockname) .dep_map = { .name = #lockname }
+#else
+# define SPIN_DEP_MAP_INIT(lockname)
+#endif
+
+#ifdef CONFIG_PROVE_RW_LOCKING
+# define RW_DEP_MAP_INIT(lockname) .dep_map = { .name = #lockname }
+#else
+# define RW_DEP_MAP_INIT(lockname)
+#endif
+
#ifdef CONFIG_DEBUG_SPINLOCK
# define __SPIN_LOCK_UNLOCKED(lockname) \
(spinlock_t) { .raw_lock = __RAW_SPIN_LOCK_UNLOCKED, \
.magic = SPINLOCK_MAGIC, \
.owner = SPINLOCK_OWNER_INIT, \
- .owner_cpu = -1 }
+ .owner_cpu = -1, \
+ SPIN_DEP_MAP_INIT(lockname) }
#define __RW_LOCK_UNLOCKED(lockname) \
(rwlock_t) { .raw_lock = __RAW_RW_LOCK_UNLOCKED, \
.magic = RWLOCK_MAGIC, \
.owner = SPINLOCK_OWNER_INIT, \
- .owner_cpu = -1 }
+ .owner_cpu = -1, \
+ RW_DEP_MAP_INIT(lockname) }
#else
# define __SPIN_LOCK_UNLOCKED(lockname) \
- (spinlock_t) { .raw_lock = __RAW_SPIN_LOCK_UNLOCKED }
+ (spinlock_t) { .raw_lock = __RAW_SPIN_LOCK_UNLOCKED, \
+ SPIN_DEP_MAP_INIT(lockname) }
#define __RW_LOCK_UNLOCKED(lockname) \
- (rwlock_t) { .raw_lock = __RAW_RW_LOCK_UNLOCKED }
+ (rwlock_t) { .raw_lock = __RAW_RW_LOCK_UNLOCKED, \
+ RW_DEP_MAP_INIT(lockname) }
#endif

#define SPIN_LOCK_UNLOCKED __SPIN_LOCK_UNLOCKED(old_style_spin_init)
Index: linux/include/linux/spinlock_types_up.h
===================================================================
--- linux.orig/include/linux/spinlock_types_up.h
+++ linux/include/linux/spinlock_types_up.h
@@ -12,10 +12,15 @@
* Released under the General Public License (GPL).
*/

-#ifdef CONFIG_DEBUG_SPINLOCK
+#if defined(CONFIG_DEBUG_SPINLOCK) || \
+ defined(CONFIG_PROVE_SPIN_LOCKING) || \
+ defined(CONFIG_PROVE_RW_LOCKING)

typedef struct {
volatile unsigned int slock;
+#ifdef CONFIG_PROVE_SPIN_LOCKING
+ struct lockdep_map dep_map;
+#endif
} raw_spinlock_t;

#define __RAW_SPIN_LOCK_UNLOCKED { 1 }
@@ -30,6 +35,9 @@ typedef struct { } raw_spinlock_t;

typedef struct {
/* no debug version on UP */
+#ifdef CONFIG_PROVE_RW_LOCKING
+ struct lockdep_map dep_map;
+#endif
} raw_rwlock_t;

#define __RAW_RW_LOCK_UNLOCKED { }
Index: linux/include/linux/spinlock_up.h
===================================================================
--- linux.orig/include/linux/spinlock_up.h
+++ linux/include/linux/spinlock_up.h
@@ -17,7 +17,9 @@
* No atomicity anywhere, we are on UP.
*/

-#ifdef CONFIG_DEBUG_SPINLOCK
+#if defined(CONFIG_DEBUG_SPINLOCK) || \
+ defined(CONFIG_PROVE_SPIN_LOCKING) || \
+ defined(CONFIG_PROVE_RW_LOCKING)

#define __raw_spin_is_locked(x) ((x)->slock == 0)

Index: linux/kernel/Makefile
===================================================================
--- linux.orig/kernel/Makefile
+++ linux/kernel/Makefile
@@ -26,6 +26,8 @@ obj-$(CONFIG_RT_MUTEX_TESTER) += rtmutex
obj-$(CONFIG_GENERIC_ISA_DMA) += dma.o
obj-$(CONFIG_SMP) += cpu.o spinlock.o
obj-$(CONFIG_DEBUG_SPINLOCK) += spinlock.o
+obj-$(CONFIG_PROVE_SPIN_LOCKING) += spinlock.o
+obj-$(CONFIG_PROVE_RW_LOCKING) += spinlock.o
obj-$(CONFIG_UID16) += uid16.o
obj-$(CONFIG_MODULES) += module.o
obj-$(CONFIG_KALLSYMS) += kallsyms.o
Index: linux/kernel/sched.c
===================================================================
--- linux.orig/kernel/sched.c
+++ linux/kernel/sched.c
@@ -312,6 +312,13 @@ static inline void finish_lock_switch(ru
/* this is a valid case when another task releases the spinlock */
rq->lock.owner = current;
#endif
+ /*
+ * If we are tracking spinlock dependencies then we have to
+ * fix up the runqueue lock - which gets 'carried over' from
+ * prev into current:
+ */
+ spin_acquire(&rq->lock.dep_map, 0, 0, _THIS_IP_);
+
spin_unlock_irq(&rq->lock);
}

@@ -1839,6 +1846,7 @@ task_t * context_switch(runqueue_t *rq,
WARN_ON(rq->prev_mm);
rq->prev_mm = oldmm;
}
+ spin_release(&rq->lock.dep_map, 1, _THIS_IP_);

/* Here we just switch the register state and the stack. */
switch_to(prev, next, prev);
@@ -4406,6 +4414,7 @@ asmlinkage long sys_sched_yield(void)
* no need to preempt or enable interrupts:
*/
__release(rq->lock);
+ spin_release(&rq->lock.dep_map, 1, _THIS_IP_);
_raw_spin_unlock(&rq->lock);
preempt_enable_no_resched();

@@ -4465,6 +4474,7 @@ int cond_resched_lock(spinlock_t *lock)
spin_lock(lock);
}
if (need_resched()) {
+ spin_release(&lock->dep_map, 1, _THIS_IP_);
_raw_spin_unlock(lock);
preempt_enable_no_resched();
__cond_resched();
Index: linux/kernel/spinlock.c
===================================================================
--- linux.orig/kernel/spinlock.c
+++ linux/kernel/spinlock.c
@@ -14,8 +14,47 @@
#include <linux/preempt.h>
#include <linux/spinlock.h>
#include <linux/interrupt.h>
+#include <linux/debug_locks.h>
#include <linux/module.h>

+#if defined(CONFIG_DEBUG_SPINLOCK) || defined(CONFIG_PROVE_SPIN_LOCKING)
+void __spin_lock_init(spinlock_t *lock, const char *name,
+ struct lockdep_type_key *key)
+{
+ lock->raw_lock = (raw_spinlock_t)__RAW_SPIN_LOCK_UNLOCKED;
+#ifdef CONFIG_DEBUG_SPINLOCK
+ lock->magic = SPINLOCK_MAGIC;
+ lock->owner = SPINLOCK_OWNER_INIT;
+ lock->owner_cpu = -1;
+#endif
+#ifdef CONFIG_PROVE_SPIN_LOCKING
+ lockdep_init_map(&lock->dep_map, name, key);
+#endif
+}
+
+EXPORT_SYMBOL(__spin_lock_init);
+
+#endif
+
+#if defined(CONFIG_DEBUG_SPINLOCK) || defined(CONFIG_PROVE_RW_LOCKING)
+
+void __rwlock_init(rwlock_t *lock, const char *name,
+ struct lockdep_type_key *key)
+{
+ lock->raw_lock = (raw_rwlock_t) __RAW_RW_LOCK_UNLOCKED;
+#ifdef CONFIG_DEBUG_SPINLOCK
+ lock->magic = RWLOCK_MAGIC;
+ lock->owner = SPINLOCK_OWNER_INIT;
+ lock->owner_cpu = -1;
+#endif
+#ifdef CONFIG_PROVE_RW_LOCKING
+ lockdep_init_map(&lock->dep_map, name, key);
+#endif
+}
+
+EXPORT_SYMBOL(__rwlock_init);
+
+#endif
/*
* Generic declaration of the raw read_trylock() function,
* architectures are supposed to optimize this:
@@ -30,8 +69,10 @@ EXPORT_SYMBOL(generic__raw_read_trylock)
int __lockfunc _spin_trylock(spinlock_t *lock)
{
preempt_disable();
- if (_raw_spin_trylock(lock))
+ if (_raw_spin_trylock(lock)) {
+ spin_acquire(&lock->dep_map, 0, 1, _RET_IP_);
return 1;
+ }

preempt_enable();
return 0;
@@ -41,8 +82,10 @@ EXPORT_SYMBOL(_spin_trylock);
int __lockfunc _read_trylock(rwlock_t *lock)
{
preempt_disable();
- if (_raw_read_trylock(lock))
+ if (_raw_read_trylock(lock)) {
+ rwlock_acquire_read(&lock->dep_map, 0, 1, _RET_IP_);
return 1;
+ }

preempt_enable();
return 0;
@@ -52,19 +95,29 @@ EXPORT_SYMBOL(_read_trylock);
int __lockfunc _write_trylock(rwlock_t *lock)
{
preempt_disable();
- if (_raw_write_trylock(lock))
+ if (_raw_write_trylock(lock)) {
+ rwlock_acquire(&lock->dep_map, 0, 1, _RET_IP_);
return 1;
+ }

preempt_enable();
return 0;
}
EXPORT_SYMBOL(_write_trylock);

-#if !defined(CONFIG_PREEMPT) || !defined(CONFIG_SMP)
+/*
+ * If lockdep is enabled then we use the non-preemption spin-ops
+ * even on CONFIG_PREEMPT, because lockdep assumes that interrupts are
+ * not re-enabled during lock-acquire (which the preempt-spin-ops do):
+ */
+#if !defined(CONFIG_PREEMPT) || !defined(CONFIG_SMP) || \
+ defined(CONFIG_PROVE_SPIN_LOCKING) || \
+ defined(CONFIG_PROVE_RW_LOCKING)

void __lockfunc _read_lock(rwlock_t *lock)
{
preempt_disable();
+ rwlock_acquire_read(&lock->dep_map, 0, 0, _RET_IP_);
_raw_read_lock(lock);
}
EXPORT_SYMBOL(_read_lock);
@@ -75,7 +128,17 @@ unsigned long __lockfunc _spin_lock_irqs

local_irq_save(flags);
preempt_disable();
+ spin_acquire(&lock->dep_map, 0, 0, _RET_IP_);
+ /*
+ * On lockdep we dont want the hand-coded irq-enable of
+ * _raw_spin_lock_flags() code, because lockdep assumes
+ * that interrupts are not re-enabled during lock-acquire:
+ */
+#ifdef CONFIG_PROVE_SPIN_LOCKING
+ _raw_spin_lock(lock);
+#else
_raw_spin_lock_flags(lock, &flags);
+#endif
return flags;
}
EXPORT_SYMBOL(_spin_lock_irqsave);
@@ -84,6 +147,7 @@ void __lockfunc _spin_lock_irq(spinlock_
{
local_irq_disable();
preempt_disable();
+ spin_acquire(&lock->dep_map, 0, 0, _RET_IP_);
_raw_spin_lock(lock);
}
EXPORT_SYMBOL(_spin_lock_irq);
@@ -92,6 +156,7 @@ void __lockfunc _spin_lock_bh(spinlock_t
{
local_bh_disable();
preempt_disable();
+ spin_acquire(&lock->dep_map, 0, 0, _RET_IP_);
_raw_spin_lock(lock);
}
EXPORT_SYMBOL(_spin_lock_bh);
@@ -102,6 +167,7 @@ unsigned long __lockfunc _read_lock_irqs

local_irq_save(flags);
preempt_disable();
+ rwlock_acquire_read(&lock->dep_map, 0, 0, _RET_IP_);
_raw_read_lock(lock);
return flags;
}
@@ -111,6 +177,7 @@ void __lockfunc _read_lock_irq(rwlock_t
{
local_irq_disable();
preempt_disable();
+ rwlock_acquire_read(&lock->dep_map, 0, 0, _RET_IP_);
_raw_read_lock(lock);
}
EXPORT_SYMBOL(_read_lock_irq);
@@ -119,6 +186,7 @@ void __lockfunc _read_lock_bh(rwlock_t *
{
local_bh_disable();
preempt_disable();
+ rwlock_acquire_read(&lock->dep_map, 0, 0, _RET_IP_);
_raw_read_lock(lock);
}
EXPORT_SYMBOL(_read_lock_bh);
@@ -129,6 +197,7 @@ unsigned long __lockfunc _write_lock_irq

local_irq_save(flags);
preempt_disable();
+ rwlock_acquire(&lock->dep_map, 0, 0, _RET_IP_);
_raw_write_lock(lock);
return flags;
}
@@ -138,6 +207,7 @@ void __lockfunc _write_lock_irq(rwlock_t
{
local_irq_disable();
preempt_disable();
+ rwlock_acquire(&lock->dep_map, 0, 0, _RET_IP_);
_raw_write_lock(lock);
}
EXPORT_SYMBOL(_write_lock_irq);
@@ -146,6 +216,7 @@ void __lockfunc _write_lock_bh(rwlock_t
{
local_bh_disable();
preempt_disable();
+ rwlock_acquire(&lock->dep_map, 0, 0, _RET_IP_);
_raw_write_lock(lock);
}
EXPORT_SYMBOL(_write_lock_bh);
@@ -153,6 +224,7 @@ EXPORT_SYMBOL(_write_lock_bh);
void __lockfunc _spin_lock(spinlock_t *lock)
{
preempt_disable();
+ spin_acquire(&lock->dep_map, 0, 0, _RET_IP_);
_raw_spin_lock(lock);
}

@@ -161,6 +233,7 @@ EXPORT_SYMBOL(_spin_lock);
void __lockfunc _write_lock(rwlock_t *lock)
{
preempt_disable();
+ rwlock_acquire(&lock->dep_map, 0, 0, _RET_IP_);
_raw_write_lock(lock);
}

@@ -256,15 +329,35 @@ BUILD_LOCK_OPS(write, rwlock);

#endif /* CONFIG_PREEMPT */

+void __lockfunc _spin_lock_nested(spinlock_t *lock, int subtype)
+{
+ preempt_disable();
+ spin_acquire(&lock->dep_map, subtype, 0, _RET_IP_);
+ _raw_spin_lock(lock);
+}
+
+EXPORT_SYMBOL(_spin_lock_nested);
+
void __lockfunc _spin_unlock(spinlock_t *lock)
{
+ spin_release(&lock->dep_map, 1, _RET_IP_);
_raw_spin_unlock(lock);
preempt_enable();
}
EXPORT_SYMBOL(_spin_unlock);

+void __lockfunc _spin_unlock_non_nested(spinlock_t *lock)
+{
+ spin_release(&lock->dep_map, 0, _RET_IP_);
+ _raw_spin_unlock(lock);
+ preempt_enable();
+}
+EXPORT_SYMBOL(_spin_unlock_non_nested);
+
+
void __lockfunc _write_unlock(rwlock_t *lock)
{
+ rwlock_release(&lock->dep_map, 1, _RET_IP_);
_raw_write_unlock(lock);
preempt_enable();
}
@@ -272,13 +365,23 @@ EXPORT_SYMBOL(_write_unlock);

void __lockfunc _read_unlock(rwlock_t *lock)
{
+ rwlock_release(&lock->dep_map, 1, _RET_IP_);
_raw_read_unlock(lock);
preempt_enable();
}
EXPORT_SYMBOL(_read_unlock);

+void __lockfunc _read_unlock_non_nested(rwlock_t *lock)
+{
+ rwlock_release(&lock->dep_map, 0, _RET_IP_);
+ _raw_read_unlock(lock);
+ preempt_enable();
+}
+EXPORT_SYMBOL(_read_unlock_non_nested);
+
void __lockfunc _spin_unlock_irqrestore(spinlock_t *lock, unsigned long flags)
{
+ spin_release(&lock->dep_map, 1, _RET_IP_);
_raw_spin_unlock(lock);
local_irq_restore(flags);
preempt_enable();
@@ -287,6 +390,7 @@ EXPORT_SYMBOL(_spin_unlock_irqrestore);

void __lockfunc _spin_unlock_irq(spinlock_t *lock)
{
+ spin_release(&lock->dep_map, 1, _RET_IP_);
_raw_spin_unlock(lock);
local_irq_enable();
preempt_enable();
@@ -295,14 +399,16 @@ EXPORT_SYMBOL(_spin_unlock_irq);

void __lockfunc _spin_unlock_bh(spinlock_t *lock)
{
+ spin_release(&lock->dep_map, 1, _RET_IP_);
_raw_spin_unlock(lock);
preempt_enable_no_resched();
- local_bh_enable();
+ local_bh_enable_ip((unsigned long)__builtin_return_address(0));
}
EXPORT_SYMBOL(_spin_unlock_bh);

void __lockfunc _read_unlock_irqrestore(rwlock_t *lock, unsigned long flags)
{
+ rwlock_release(&lock->dep_map, 1, _RET_IP_);
_raw_read_unlock(lock);
local_irq_restore(flags);
preempt_enable();
@@ -311,6 +417,7 @@ EXPORT_SYMBOL(_read_unlock_irqrestore);

void __lockfunc _read_unlock_irq(rwlock_t *lock)
{
+ rwlock_release(&lock->dep_map, 1, _RET_IP_);
_raw_read_unlock(lock);
local_irq_enable();
preempt_enable();
@@ -319,14 +426,16 @@ EXPORT_SYMBOL(_read_unlock_irq);

void __lockfunc _read_unlock_bh(rwlock_t *lock)
{
+ rwlock_release(&lock->dep_map, 1, _RET_IP_);
_raw_read_unlock(lock);
preempt_enable_no_resched();
- local_bh_enable();
+ local_bh_enable_ip((unsigned long)__builtin_return_address(0));
}
EXPORT_SYMBOL(_read_unlock_bh);

void __lockfunc _write_unlock_irqrestore(rwlock_t *lock, unsigned long flags)
{
+ rwlock_release(&lock->dep_map, 1, _RET_IP_);
_raw_write_unlock(lock);
local_irq_restore(flags);
preempt_enable();
@@ -335,6 +444,7 @@ EXPORT_SYMBOL(_write_unlock_irqrestore);

void __lockfunc _write_unlock_irq(rwlock_t *lock)
{
+ rwlock_release(&lock->dep_map, 1, _RET_IP_);
_raw_write_unlock(lock);
local_irq_enable();
preempt_enable();
@@ -343,9 +453,10 @@ EXPORT_SYMBOL(_write_unlock_irq);

void __lockfunc _write_unlock_bh(rwlock_t *lock)
{
+ rwlock_release(&lock->dep_map, 1, _RET_IP_);
_raw_write_unlock(lock);
preempt_enable_no_resched();
- local_bh_enable();
+ local_bh_enable_ip((unsigned long)__builtin_return_address(0));
}
EXPORT_SYMBOL(_write_unlock_bh);

@@ -353,11 +464,13 @@ int __lockfunc _spin_trylock_bh(spinlock
{
local_bh_disable();
preempt_disable();
- if (_raw_spin_trylock(lock))
+ if (_raw_spin_trylock(lock)) {
+ spin_acquire(&lock->dep_map, 0, 1, _RET_IP_);
return 1;
+ }

preempt_enable_no_resched();
- local_bh_enable();
+ local_bh_enable_ip((unsigned long)__builtin_return_address(0));
return 0;
}
EXPORT_SYMBOL(_spin_trylock_bh);
Index: linux/lib/kernel_lock.c
===================================================================
--- linux.orig/lib/kernel_lock.c
+++ linux/lib/kernel_lock.c
@@ -177,7 +177,12 @@ static inline void __lock_kernel(void)

static inline void __unlock_kernel(void)
{
- spin_unlock(&kernel_flag);
+ /*
+ * the BKL is not covered by lockdep, so we open-code the
+ * unlocking sequence (and thus avoid the dep-chain ops):
+ */
+ _raw_spin_unlock(&kernel_flag);
+ preempt_enable();
}

/*
Index: linux/net/ipv4/route.c
===================================================================
--- linux.orig/net/ipv4/route.c
+++ linux/net/ipv4/route.c
@@ -206,7 +206,9 @@ __u8 ip_tos2prio[16] = {
struct rt_hash_bucket {
struct rtable *chain;
};
-#if defined(CONFIG_SMP) || defined(CONFIG_DEBUG_SPINLOCK)
+#if defined(CONFIG_SMP) || defined(CONFIG_DEBUG_SPINLOCK) || \
+ defined(CONFIG_PROVE_SPIN_LOCKING) || \
+ defined(CONFIG_PROVE_RW_LOCKING)
/*
* Instead of using one spinlock for each rt_hash_bucket, we use a table of spinlocks
* The size of this table is a power of two and depends on the number of CPUS.

2006-05-29 21:42:47

by Ingo Molnar

[permalink] [raw]
Subject: [patch 14/61] lock validator: stacktrace

From: Ingo Molnar <[email protected]>

framework to generate and save stacktraces quickly, without printing
anything to the console.

Signed-off-by: Ingo Molnar <[email protected]>
Signed-off-by: Arjan van de Ven <[email protected]>
---
arch/i386/kernel/Makefile | 2
arch/i386/kernel/stacktrace.c | 98 +++++++++++++++++
arch/x86_64/kernel/Makefile | 2
arch/x86_64/kernel/stacktrace.c | 219 ++++++++++++++++++++++++++++++++++++++++
include/linux/stacktrace.h | 15 ++
kernel/Makefile | 2
kernel/stacktrace.c | 26 ++++
7 files changed, 361 insertions(+), 3 deletions(-)

Index: linux/arch/i386/kernel/Makefile
===================================================================
--- linux.orig/arch/i386/kernel/Makefile
+++ linux/arch/i386/kernel/Makefile
@@ -4,7 +4,7 @@

extra-y := head.o init_task.o vmlinux.lds

-obj-y := process.o semaphore.o signal.o entry.o traps.o irq.o \
+obj-y := process.o semaphore.o signal.o entry.o traps.o irq.o stacktrace.o \
ptrace.o time.o ioport.o ldt.o setup.o i8259.o sys_i386.o \
pci-dma.o i386_ksyms.o i387.o bootflag.o \
quirks.o i8237.o topology.o alternative.o i8253.o tsc.o
Index: linux/arch/i386/kernel/stacktrace.c
===================================================================
--- /dev/null
+++ linux/arch/i386/kernel/stacktrace.c
@@ -0,0 +1,98 @@
+/*
+ * arch/i386/kernel/stacktrace.c
+ *
+ * Stack trace management functions
+ *
+ * Copyright (C) 2006 Red Hat, Inc., Ingo Molnar <[email protected]>
+ */
+#include <linux/sched.h>
+#include <linux/stacktrace.h>
+
+static inline int valid_stack_ptr(struct thread_info *tinfo, void *p)
+{
+ return p > (void *)tinfo &&
+ p < (void *)tinfo + THREAD_SIZE - 3;
+}
+
+/*
+ * Save stack-backtrace addresses into a stack_trace buffer:
+ */
+static inline unsigned long
+save_context_stack(struct stack_trace *trace, unsigned int skip,
+ struct thread_info *tinfo, unsigned long *stack,
+ unsigned long ebp)
+{
+ unsigned long addr;
+
+#ifdef CONFIG_FRAME_POINTER
+ while (valid_stack_ptr(tinfo, (void *)ebp)) {
+ addr = *(unsigned long *)(ebp + 4);
+ if (!skip)
+ trace->entries[trace->nr_entries++] = addr;
+ else
+ skip--;
+ if (trace->nr_entries >= trace->max_entries)
+ break;
+ /*
+ * break out of recursive entries (such as
+ * end_of_stack_stop_unwind_function):
+ */
+ if (ebp == *(unsigned long *)ebp)
+ break;
+
+ ebp = *(unsigned long *)ebp;
+ }
+#else
+ while (valid_stack_ptr(tinfo, stack)) {
+ addr = *stack++;
+ if (__kernel_text_address(addr)) {
+ if (!skip)
+ trace->entries[trace->nr_entries++] = addr;
+ else
+ skip--;
+ if (trace->nr_entries >= trace->max_entries)
+ break;
+ }
+ }
+#endif
+
+ return ebp;
+}
+
+/*
+ * Save stack-backtrace addresses into a stack_trace buffer.
+ * If all_contexts is set, all contexts (hardirq, softirq and process)
+ * are saved. If not set then only the current context is saved.
+ */
+void save_stack_trace(struct stack_trace *trace,
+ struct task_struct *task, int all_contexts,
+ unsigned int skip)
+{
+ unsigned long ebp;
+ unsigned long *stack = &ebp;
+
+ WARN_ON(trace->nr_entries || !trace->max_entries);
+
+ if (!task || task == current) {
+ /* Grab ebp right from our regs: */
+ asm ("movl %%ebp, %0" : "=r" (ebp));
+ } else {
+ /* ebp is the last reg pushed by switch_to(): */
+ ebp = *(unsigned long *) task->thread.esp;
+ }
+
+ while (1) {
+ struct thread_info *context = (struct thread_info *)
+ ((unsigned long)stack & (~(THREAD_SIZE - 1)));
+
+ ebp = save_context_stack(trace, skip, context, stack, ebp);
+ stack = (unsigned long *)context->previous_esp;
+ if (!all_contexts || !stack ||
+ trace->nr_entries >= trace->max_entries)
+ break;
+ trace->entries[trace->nr_entries++] = ULONG_MAX;
+ if (trace->nr_entries >= trace->max_entries)
+ break;
+ }
+}
+
Index: linux/arch/x86_64/kernel/Makefile
===================================================================
--- linux.orig/arch/x86_64/kernel/Makefile
+++ linux/arch/x86_64/kernel/Makefile
@@ -4,7 +4,7 @@

extra-y := head.o head64.o init_task.o vmlinux.lds
EXTRA_AFLAGS := -traditional
-obj-y := process.o signal.o entry.o traps.o irq.o \
+obj-y := process.o signal.o entry.o traps.o irq.o stacktrace.o \
ptrace.o time.o ioport.o ldt.o setup.o i8259.o sys_x86_64.o \
x8664_ksyms.o i387.o syscall.o vsyscall.o \
setup64.o bootflag.o e820.o reboot.o quirks.o i8237.o \
Index: linux/arch/x86_64/kernel/stacktrace.c
===================================================================
--- /dev/null
+++ linux/arch/x86_64/kernel/stacktrace.c
@@ -0,0 +1,219 @@
+/*
+ * arch/x86_64/kernel/stacktrace.c
+ *
+ * Stack trace management functions
+ *
+ * Copyright (C) 2006 Red Hat, Inc., Ingo Molnar <[email protected]>
+ */
+#include <linux/sched.h>
+#include <linux/stacktrace.h>
+
+#include <asm/smp.h>
+
+static inline int
+in_range(unsigned long start, unsigned long addr, unsigned long end)
+{
+ return addr >= start && addr <= end;
+}
+
+static unsigned long
+get_stack_end(struct task_struct *task, unsigned long stack)
+{
+ unsigned long stack_start, stack_end, flags;
+ int i, cpu;
+
+ /*
+ * The most common case is that we are in the task stack:
+ */
+ stack_start = (unsigned long)task->thread_info;
+ stack_end = stack_start + THREAD_SIZE;
+
+ if (in_range(stack_start, stack, stack_end))
+ return stack_end;
+
+ /*
+ * We are in an interrupt if irqstackptr is set:
+ */
+ raw_local_irq_save(flags);
+ cpu = safe_smp_processor_id();
+ stack_end = (unsigned long)cpu_pda(cpu)->irqstackptr;
+
+ if (stack_end) {
+ stack_start = stack_end & ~(IRQSTACKSIZE-1);
+ if (in_range(stack_start, stack, stack_end))
+ goto out_restore;
+ /*
+ * We get here if we are in an IRQ context but we
+ * are also in an exception stack.
+ */
+ }
+
+ /*
+ * Iterate over all exception stacks, and figure out whether
+ * 'stack' is in one of them:
+ */
+ for (i = 0; i < N_EXCEPTION_STACKS; i++) {
+ /*
+ * set 'end' to the end of the exception stack.
+ */
+ stack_end = per_cpu(init_tss, cpu).ist[i];
+ stack_start = stack_end - EXCEPTION_STKSZ;
+
+ /*
+ * Is 'stack' above this exception frame's end?
+ * If yes then skip to the next frame.
+ */
+ if (stack >= stack_end)
+ continue;
+ /*
+ * Is 'stack' above this exception frame's start address?
+ * If yes then we found the right frame.
+ */
+ if (stack >= stack_start)
+ goto out_restore;
+
+ /*
+ * If this is a debug stack, and if it has a larger size than
+ * the usual exception stacks, then 'stack' might still
+ * be within the lower portion of the debug stack:
+ */
+#if DEBUG_STKSZ > EXCEPTION_STKSZ
+ if (i == DEBUG_STACK - 1 && stack >= stack_end - DEBUG_STKSZ) {
+ /*
+ * Black magic. A large debug stack is composed of
+ * multiple exception stack entries, which we
+ * iterate through now. Dont look:
+ */
+ do {
+ stack_end -= EXCEPTION_STKSZ;
+ stack_start -= EXCEPTION_STKSZ;
+ } while (stack < stack_start);
+
+ goto out_restore;
+ }
+#endif
+ }
+ /*
+ * Ok, 'stack' is not pointing to any of the system stacks.
+ */
+ stack_end = 0;
+
+out_restore:
+ raw_local_irq_restore(flags);
+
+ return stack_end;
+}
+
+
+/*
+ * Save stack-backtrace addresses into a stack_trace buffer:
+ */
+static inline unsigned long
+save_context_stack(struct stack_trace *trace, unsigned int skip,
+ unsigned long stack, unsigned long stack_end)
+{
+ unsigned long addr, prev_stack = 0;
+
+#ifdef CONFIG_FRAME_POINTER
+ while (in_range(prev_stack, (unsigned long)stack, stack_end)) {
+ pr_debug("stack: %p\n", (void *)stack);
+ addr = (unsigned long)(((unsigned long *)stack)[1]);
+ pr_debug("addr: %p\n", (void *)addr);
+ if (!skip)
+ trace->entries[trace->nr_entries++] = addr-1;
+ else
+ skip--;
+ if (trace->nr_entries >= trace->max_entries)
+ break;
+ if (!addr)
+ return 0;
+ /*
+ * Stack frames must go forwards (otherwise a loop could
+ * happen if the stackframe is corrupted), so we move
+ * prev_stack forwards:
+ */
+ prev_stack = stack;
+ stack = (unsigned long)(((unsigned long *)stack)[0]);
+ }
+ pr_debug("invalid: %p\n", (void *)stack);
+#else
+ while (stack < stack_end) {
+ addr = (unsigned long *)stack[0];
+ stack += sizeof(long);
+ if (__kernel_text_address(addr)) {
+ if (!skip)
+ trace->entries[trace->nr_entries++] = addr-1;
+ else
+ skip--;
+ if (trace->nr_entries >= trace->max_entries)
+ break;
+ }
+ }
+#endif
+ return stack;
+}
+
+#define MAX_STACKS 10
+
+/*
+ * Save stack-backtrace addresses into a stack_trace buffer.
+ * If all_contexts is set, all contexts (hardirq, softirq and process)
+ * are saved. If not set then only the current context is saved.
+ */
+void save_stack_trace(struct stack_trace *trace,
+ struct task_struct *task, int all_contexts,
+ unsigned int skip)
+{
+ unsigned long stack = (unsigned long)&stack;
+ int i, nr_stacks = 0, stacks_done[MAX_STACKS];
+
+ WARN_ON(trace->nr_entries || !trace->max_entries);
+
+ if (!task)
+ task = current;
+
+ pr_debug("task: %p, ti: %p\n", task, task->thread_info);
+
+ if (!task || task == current) {
+ /* Grab rbp right from our regs: */
+ asm ("mov %%rbp, %0" : "=r" (stack));
+ pr_debug("rbp: %p\n", (void *)stack);
+ } else {
+ /* rbp is the last reg pushed by switch_to(): */
+ stack = task->thread.rsp;
+ pr_debug("other task rsp: %p\n", (void *)stack);
+ stack = (unsigned long)(((unsigned long *)stack)[0]);
+ pr_debug("other task rbp: %p\n", (void *)stack);
+ }
+
+ while (1) {
+ unsigned long stack_end = get_stack_end(task, stack);
+
+ pr_debug("stack: %p\n", (void *)stack);
+ pr_debug("stack end: %p\n", (void *)stack_end);
+
+ /*
+ * Invalid stack addres?
+ */
+ if (!stack_end)
+ return;
+ /*
+ * Were we in this stack already? (recursion)
+ */
+ for (i = 0; i < nr_stacks; i++)
+ if (stacks_done[i] == stack_end)
+ return;
+ stacks_done[nr_stacks] = stack_end;
+
+ stack = save_context_stack(trace, skip, stack, stack_end);
+ if (!all_contexts || !stack ||
+ trace->nr_entries >= trace->max_entries)
+ return;
+ trace->entries[trace->nr_entries++] = ULONG_MAX;
+ if (trace->nr_entries >= trace->max_entries)
+ return;
+ if (++nr_stacks >= MAX_STACKS)
+ return;
+ }
+}
+
Index: linux/include/linux/stacktrace.h
===================================================================
--- /dev/null
+++ linux/include/linux/stacktrace.h
@@ -0,0 +1,15 @@
+#ifndef __LINUX_STACKTRACE_H
+#define __LINUX_STACKTRACE_H
+
+struct stack_trace {
+ unsigned int nr_entries, max_entries;
+ unsigned long *entries;
+};
+
+extern void save_stack_trace(struct stack_trace *trace,
+ struct task_struct *task, int all_contexts,
+ unsigned int skip);
+
+extern void print_stack_trace(struct stack_trace *trace, int spaces);
+
+#endif
Index: linux/kernel/Makefile
===================================================================
--- linux.orig/kernel/Makefile
+++ linux/kernel/Makefile
@@ -8,7 +8,7 @@ obj-y = sched.o fork.o exec_domain.o
signal.o sys.o kmod.o workqueue.o pid.o \
rcupdate.o extable.o params.o posix-timers.o \
kthread.o wait.o kfifo.o sys_ni.o posix-cpu-timers.o mutex.o \
- hrtimer.o nsproxy.o
+ hrtimer.o nsproxy.o stacktrace.o

obj-y += time/
obj-$(CONFIG_DEBUG_MUTEXES) += mutex-debug.o
Index: linux/kernel/stacktrace.c
===================================================================
--- /dev/null
+++ linux/kernel/stacktrace.c
@@ -0,0 +1,26 @@
+/*
+ * kernel/stacktrace.c
+ *
+ * Stack trace management functions
+ *
+ * Copyright (C) 2006 Red Hat, Inc., Ingo Molnar <[email protected]>
+ */
+#include <linux/sched.h>
+#include <linux/kallsyms.h>
+#include <linux/stacktrace.h>
+
+void print_stack_trace(struct stack_trace *trace, int spaces)
+{
+ int i, j;
+
+ for (i = 0; i < trace->nr_entries; i++) {
+ unsigned long ip = trace->entries[i];
+
+ for (j = 0; j < spaces + 1; j++)
+ printk(" ");
+
+ printk("[<%08lx>]", ip);
+ print_symbol(" %s\n", ip);
+ }
+}
+

2006-05-29 21:43:23

by Ingo Molnar

[permalink] [raw]
Subject: [patch 18/61] lock validator: irqtrace: core

From: Ingo Molnar <[email protected]>

accurate hard-IRQ-flags state tracing. This allows us to attach
extra functionality to IRQ flags on/off events (such as trace-on/off).

Signed-off-by: Ingo Molnar <[email protected]>
Signed-off-by: Arjan van de Ven <[email protected]>
---
arch/i386/kernel/entry.S | 25 ++++++-
arch/i386/kernel/irq.c | 6 +
arch/x86_64/ia32/ia32entry.S | 19 +++++
arch/x86_64/kernel/entry.S | 54 +++++++++++++++-
arch/x86_64/kernel/irq.c | 4 -
include/asm-i386/irqflags.h | 56 ++++++++++++++++
include/asm-i386/spinlock.h | 5 +
include/asm-i386/system.h | 20 -----
include/asm-powerpc/irqflags.h | 31 +++++++++
include/asm-x86_64/irqflags.h | 54 ++++++++++++++++
include/asm-x86_64/system.h | 38 -----------
include/linux/hardirq.h | 13 +++
include/linux/init_task.h | 1
include/linux/interrupt.h | 11 +--
include/linux/sched.h | 15 ++++
include/linux/trace_irqflags.h | 87 ++++++++++++++++++++++++++
kernel/fork.c | 20 +++++
kernel/sched.c | 4 -
kernel/softirq.c | 137 +++++++++++++++++++++++++++++++++++------
lib/locking-selftest.c | 3
20 files changed, 513 insertions(+), 90 deletions(-)

Index: linux/arch/i386/kernel/entry.S
===================================================================
--- linux.orig/arch/i386/kernel/entry.S
+++ linux/arch/i386/kernel/entry.S
@@ -43,6 +43,7 @@
#include <linux/config.h>
#include <linux/linkage.h>
#include <asm/thread_info.h>
+#include <asm/irqflags.h>
#include <asm/errno.h>
#include <asm/segment.h>
#include <asm/smp.h>
@@ -76,7 +77,7 @@ NT_MASK = 0x00004000
VM_MASK = 0x00020000

#ifdef CONFIG_PREEMPT
-#define preempt_stop cli
+#define preempt_stop cli; TRACE_IRQS_OFF
#else
#define preempt_stop
#define resume_kernel restore_nocheck
@@ -186,6 +187,10 @@ need_resched:
ENTRY(sysenter_entry)
movl TSS_sysenter_esp0(%esp),%esp
sysenter_past_esp:
+ /*
+ * No need to follow this irqs on/off section: the syscall
+ * disabled irqs and here we enable it straight after entry:
+ */
sti
pushl $(__USER_DS)
pushl %ebp
@@ -217,6 +222,7 @@ sysenter_past_esp:
call *sys_call_table(,%eax,4)
movl %eax,EAX(%esp)
cli
+ TRACE_IRQS_OFF
movl TI_flags(%ebp), %ecx
testw $_TIF_ALLWORK_MASK, %cx
jne syscall_exit_work
@@ -224,6 +230,7 @@ sysenter_past_esp:
movl EIP(%esp), %edx
movl OLDESP(%esp), %ecx
xorl %ebp,%ebp
+ TRACE_IRQS_ON
sti
sysexit

@@ -250,6 +257,7 @@ syscall_exit:
cli # make sure we don't miss an interrupt
# setting need_resched or sigpending
# between sampling and the iret
+ TRACE_IRQS_OFF
movl TI_flags(%ebp), %ecx
testw $_TIF_ALLWORK_MASK, %cx # current->work
jne syscall_exit_work
@@ -265,11 +273,14 @@ restore_all:
cmpl $((4 << 8) | 3), %eax
je ldt_ss # returning to user-space with LDT SS
restore_nocheck:
+ TRACE_IRQS_ON
+restore_nocheck_notrace:
RESTORE_REGS
addl $4, %esp
1: iret
.section .fixup,"ax"
iret_exc:
+ TRACE_IRQS_ON
sti
pushl $0 # no error code
pushl $do_iret_error
@@ -293,10 +304,12 @@ ldt_ss:
* dosemu and wine happy. */
subl $8, %esp # reserve space for switch16 pointer
cli
+ TRACE_IRQS_OFF
movl %esp, %eax
/* Set up the 16bit stack frame with switch32 pointer on top,
* and a switch16 pointer on top of the current frame. */
call setup_x86_bogus_stack
+ TRACE_IRQS_ON
RESTORE_REGS
lss 20+4(%esp), %esp # switch to 16bit stack
1: iret
@@ -315,6 +328,7 @@ work_resched:
cli # make sure we don't miss an interrupt
# setting need_resched or sigpending
# between sampling and the iret
+ TRACE_IRQS_OFF
movl TI_flags(%ebp), %ecx
andl $_TIF_WORK_MASK, %ecx # is there any work to be done other
# than syscall tracing?
@@ -364,6 +378,7 @@ syscall_trace_entry:
syscall_exit_work:
testb $(_TIF_SYSCALL_TRACE|_TIF_SYSCALL_AUDIT|_TIF_SINGLESTEP), %cl
jz work_pending
+ TRACE_IRQS_ON
sti # could let do_syscall_trace() call
# schedule() instead
movl %esp, %eax
@@ -425,9 +440,14 @@ ENTRY(irq_entries_start)
vector=vector+1
.endr

+/*
+ * the CPU automatically disables interrupts when executing an IRQ vector,
+ * so IRQ-flags tracing has to follow that:
+ */
ALIGN
common_interrupt:
SAVE_ALL
+ TRACE_IRQS_OFF
movl %esp,%eax
call do_IRQ
jmp ret_from_intr
@@ -436,6 +456,7 @@ common_interrupt:
ENTRY(name) \
pushl $~(nr); \
SAVE_ALL \
+ TRACE_IRQS_OFF \
movl %esp,%eax; \
call smp_/**/name; \
jmp ret_from_intr;
@@ -565,7 +586,7 @@ nmi_stack_correct:
xorl %edx,%edx # zero error code
movl %esp,%eax # pt_regs pointer
call do_nmi
- jmp restore_all
+ jmp restore_nocheck_notrace

nmi_stack_fixup:
FIX_STACK(12,nmi_stack_correct, 1)
Index: linux/arch/i386/kernel/irq.c
===================================================================
--- linux.orig/arch/i386/kernel/irq.c
+++ linux/arch/i386/kernel/irq.c
@@ -147,7 +147,7 @@ void irq_ctx_init(int cpu)
irqctx->tinfo.task = NULL;
irqctx->tinfo.exec_domain = NULL;
irqctx->tinfo.cpu = cpu;
- irqctx->tinfo.preempt_count = SOFTIRQ_OFFSET;
+ irqctx->tinfo.preempt_count = 0;
irqctx->tinfo.addr_limit = MAKE_MM_SEG(0);

softirq_ctx[cpu] = irqctx;
@@ -192,6 +192,10 @@ asmlinkage void do_softirq(void)
: "0"(isp)
: "memory", "cc", "edx", "ecx", "eax"
);
+ /*
+ * Shouldnt happen, we returned above if in_interrupt():
+ */
+ WARN_ON_ONCE(softirq_count());
}

local_irq_restore(flags);
Index: linux/arch/x86_64/ia32/ia32entry.S
===================================================================
--- linux.orig/arch/x86_64/ia32/ia32entry.S
+++ linux/arch/x86_64/ia32/ia32entry.S
@@ -13,6 +13,7 @@
#include <asm/thread_info.h>
#include <asm/segment.h>
#include <asm/vsyscall32.h>
+#include <asm/irqflags.h>
#include <linux/linkage.h>

#define IA32_NR_syscalls ((ia32_syscall_end - ia32_sys_call_table)/8)
@@ -75,6 +76,10 @@ ENTRY(ia32_sysenter_target)
swapgs
movq %gs:pda_kernelstack, %rsp
addq $(PDA_STACKOFFSET),%rsp
+ /*
+ * No need to follow this irqs on/off section: the syscall
+ * disabled irqs, here we enable it straight after entry:
+ */
sti
movl %ebp,%ebp /* zero extension */
pushq $__USER32_DS
@@ -118,6 +123,7 @@ sysenter_do_call:
movq %rax,RAX-ARGOFFSET(%rsp)
GET_THREAD_INFO(%r10)
cli
+ TRACE_IRQS_OFF
testl $_TIF_ALLWORK_MASK,threadinfo_flags(%r10)
jnz int_ret_from_sys_call
andl $~TS_COMPAT,threadinfo_status(%r10)
@@ -132,6 +138,7 @@ sysenter_do_call:
CFI_REGISTER rsp,rcx
movl $VSYSCALL32_SYSEXIT,%edx /* User %eip */
CFI_REGISTER rip,rdx
+ TRACE_IRQS_ON
swapgs
sti /* sti only takes effect after the next instruction */
/* sysexit */
@@ -186,6 +193,10 @@ ENTRY(ia32_cstar_target)
movl %esp,%r8d
CFI_REGISTER rsp,r8
movq %gs:pda_kernelstack,%rsp
+ /*
+ * No need to follow this irqs on/off section: the syscall
+ * disabled irqs and here we enable it straight after entry:
+ */
sti
SAVE_ARGS 8,1,1
movl %eax,%eax /* zero extension */
@@ -220,6 +231,7 @@ cstar_do_call:
movq %rax,RAX-ARGOFFSET(%rsp)
GET_THREAD_INFO(%r10)
cli
+ TRACE_IRQS_OFF
testl $_TIF_ALLWORK_MASK,threadinfo_flags(%r10)
jnz int_ret_from_sys_call
andl $~TS_COMPAT,threadinfo_status(%r10)
@@ -228,6 +240,7 @@ cstar_do_call:
CFI_REGISTER rip,rcx
movl EFLAGS-ARGOFFSET(%rsp),%r11d
/*CFI_REGISTER rflags,r11*/
+ TRACE_IRQS_ON
movl RSP-ARGOFFSET(%rsp),%esp
CFI_RESTORE rsp
swapgs
@@ -286,7 +299,11 @@ ENTRY(ia32_syscall)
/*CFI_REL_OFFSET rflags,EFLAGS-RIP*/
/*CFI_REL_OFFSET cs,CS-RIP*/
CFI_REL_OFFSET rip,RIP-RIP
- swapgs
+ swapgs
+ /*
+ * No need to follow this irqs on/off section: the syscall
+ * disabled irqs and here we enable it straight after entry:
+ */
sti
movl %eax,%eax
pushq %rax
Index: linux/arch/x86_64/kernel/entry.S
===================================================================
--- linux.orig/arch/x86_64/kernel/entry.S
+++ linux/arch/x86_64/kernel/entry.S
@@ -42,13 +42,14 @@
#include <asm/thread_info.h>
#include <asm/hw_irq.h>
#include <asm/page.h>
+#include <asm/irqflags.h>

.code64

#ifndef CONFIG_PREEMPT
#define retint_kernel retint_restore_args
#endif
-
+
/*
* C code is not supposed to know about undefined top of stack. Every time
* a C function with an pt_regs argument is called from the SYSCALL based
@@ -195,6 +196,10 @@ ENTRY(system_call)
swapgs
movq %rsp,%gs:pda_oldrsp
movq %gs:pda_kernelstack,%rsp
+ /*
+ * No need to follow this irqs off/on section - it's straight
+ * and short:
+ */
sti
SAVE_ARGS 8,1
movq %rax,ORIG_RAX-ARGOFFSET(%rsp)
@@ -220,10 +225,15 @@ ret_from_sys_call:
sysret_check:
GET_THREAD_INFO(%rcx)
cli
+ TRACE_IRQS_OFF
movl threadinfo_flags(%rcx),%edx
andl %edi,%edx
CFI_REMEMBER_STATE
jnz sysret_careful
+ /*
+ * sysretq will re-enable interrupts:
+ */
+ TRACE_IRQS_ON
movq RIP-ARGOFFSET(%rsp),%rcx
CFI_REGISTER rip,rcx
RESTORE_ARGS 0,-ARG_SKIP,1
@@ -238,6 +248,7 @@ sysret_careful:
CFI_RESTORE_STATE
bt $TIF_NEED_RESCHED,%edx
jnc sysret_signal
+ TRACE_IRQS_ON
sti
pushq %rdi
CFI_ADJUST_CFA_OFFSET 8
@@ -248,6 +259,7 @@ sysret_careful:

/* Handle a signal */
sysret_signal:
+ TRACE_IRQS_ON
sti
testl $(_TIF_SIGPENDING|_TIF_NOTIFY_RESUME|_TIF_SINGLESTEP),%edx
jz 1f
@@ -262,6 +274,7 @@ sysret_signal:
/* Use IRET because user could have changed frame. This
works because ptregscall_common has called FIXUP_TOP_OF_STACK. */
cli
+ TRACE_IRQS_OFF
jmp int_with_check

badsys:
@@ -315,6 +328,7 @@ ENTRY(int_ret_from_sys_call)
CFI_REL_OFFSET r10,R10-ARGOFFSET
CFI_REL_OFFSET r11,R11-ARGOFFSET
cli
+ TRACE_IRQS_OFF
testl $3,CS-ARGOFFSET(%rsp)
je retint_restore_args
movl $_TIF_ALLWORK_MASK,%edi
@@ -333,6 +347,7 @@ int_with_check:
int_careful:
bt $TIF_NEED_RESCHED,%edx
jnc int_very_careful
+ TRACE_IRQS_ON
sti
pushq %rdi
CFI_ADJUST_CFA_OFFSET 8
@@ -340,10 +355,12 @@ int_careful:
popq %rdi
CFI_ADJUST_CFA_OFFSET -8
cli
+ TRACE_IRQS_OFF
jmp int_with_check

/* handle signals and tracing -- both require a full stack frame */
int_very_careful:
+ TRACE_IRQS_ON
sti
SAVE_REST
/* Check for syscall exit trace */
@@ -357,6 +374,7 @@ int_very_careful:
CFI_ADJUST_CFA_OFFSET -8
andl $~(_TIF_SYSCALL_TRACE|_TIF_SYSCALL_AUDIT|_TIF_SINGLESTEP),%edi
cli
+ TRACE_IRQS_OFF
jmp int_restore_rest

int_signal:
@@ -369,6 +387,7 @@ int_signal:
int_restore_rest:
RESTORE_REST
cli
+ TRACE_IRQS_OFF
jmp int_with_check
CFI_ENDPROC
END(int_ret_from_sys_call)
@@ -501,6 +520,11 @@ END(stub_rt_sigreturn)
#ifndef CONFIG_DEBUG_INFO
CFI_ADJUST_CFA_OFFSET 8
#endif
+ /*
+ * We entered an interrupt context - irqs are off:
+ */
+ TRACE_IRQS_OFF
+
call \func
.endm

@@ -514,6 +538,7 @@ ret_from_intr:
CFI_ADJUST_CFA_OFFSET -8
#endif
cli
+ TRACE_IRQS_OFF
decl %gs:pda_irqcount
#ifdef CONFIG_DEBUG_INFO
movq RBP(%rdi),%rbp
@@ -538,9 +563,21 @@ retint_check:
CFI_REMEMBER_STATE
jnz retint_careful
retint_swapgs:
+ /*
+ * The iretq will re-enable interrupts:
+ */
+ cli
+ TRACE_IRQS_ON
swapgs
+ jmp restore_args
+
retint_restore_args:
cli
+ /*
+ * The iretq will re-enable interrupts:
+ */
+ TRACE_IRQS_ON
+restore_args:
RESTORE_ARGS 0,8,0
iret_label:
iretq
@@ -553,6 +590,7 @@ iret_label:
/* running with kernel gs */
bad_iret:
movq $11,%rdi /* SIGSEGV */
+ TRACE_IRQS_ON
sti
jmp do_exit
.previous
@@ -562,6 +600,7 @@ retint_careful:
CFI_RESTORE_STATE
bt $TIF_NEED_RESCHED,%edx
jnc retint_signal
+ TRACE_IRQS_ON
sti
pushq %rdi
CFI_ADJUST_CFA_OFFSET 8
@@ -570,11 +609,13 @@ retint_careful:
CFI_ADJUST_CFA_OFFSET -8
GET_THREAD_INFO(%rcx)
cli
+ TRACE_IRQS_OFF
jmp retint_check

retint_signal:
testl $(_TIF_SIGPENDING|_TIF_NOTIFY_RESUME|_TIF_SINGLESTEP),%edx
jz retint_swapgs
+ TRACE_IRQS_ON
sti
SAVE_REST
movq $-1,ORIG_RAX(%rsp)
@@ -583,6 +624,7 @@ retint_signal:
call do_notify_resume
RESTORE_REST
cli
+ TRACE_IRQS_OFF
movl $_TIF_NEED_RESCHED,%edi
GET_THREAD_INFO(%rcx)
jmp retint_check
@@ -714,6 +756,7 @@ END(spurious_interrupt)
addq $EXCEPTION_STKSZ, per_cpu__init_tss + TSS_ist + (\ist - 1) * 8(%rbp)
.endif
cli
+ TRACE_IRQS_OFF
.endm

/*
@@ -771,6 +814,7 @@ error_exit:
movl %ebx,%eax
RESTORE_REST
cli
+ TRACE_IRQS_OFF
GET_THREAD_INFO(%rcx)
testl %eax,%eax
jne retint_kernel
@@ -778,6 +822,10 @@ error_exit:
movl $_TIF_WORK_MASK,%edi
andl %edi,%edx
jnz retint_careful
+ /*
+ * The iret will restore flags:
+ */
+ TRACE_IRQS_ON
swapgs
RESTORE_ARGS 0,8,0
jmp iret_label
@@ -980,16 +1028,20 @@ paranoid_userspace:
testl $_TIF_NEED_RESCHED,%ebx
jnz paranoid_schedule
movl %ebx,%edx /* arg3: thread flags */
+ TRACE_IRQS_ON
sti
xorl %esi,%esi /* arg2: oldset */
movq %rsp,%rdi /* arg1: &pt_regs */
call do_notify_resume
cli
+ TRACE_IRQS_OFF
jmp paranoid_userspace
paranoid_schedule:
+ TRACE_IRQS_ON
sti
call schedule
cli
+ TRACE_IRQS_OFF
jmp paranoid_userspace
CFI_ENDPROC
END(nmi)
Index: linux/arch/x86_64/kernel/irq.c
===================================================================
--- linux.orig/arch/x86_64/kernel/irq.c
+++ linux/arch/x86_64/kernel/irq.c
@@ -145,8 +145,10 @@ asmlinkage void do_softirq(void)
local_irq_save(flags);
pending = local_softirq_pending();
/* Switch to interrupt stack */
- if (pending)
+ if (pending) {
call_softirq();
+ WARN_ON_ONCE(softirq_count());
+ }
local_irq_restore(flags);
}
EXPORT_SYMBOL(do_softirq);
Index: linux/include/asm-i386/irqflags.h
===================================================================
--- /dev/null
+++ linux/include/asm-i386/irqflags.h
@@ -0,0 +1,56 @@
+/*
+ * include/asm-i386/irqflags.h
+ *
+ * IRQ flags handling
+ *
+ * This file gets included from lowlevel asm headers too, to provide
+ * wrapped versions of the local_irq_*() APIs, based on the
+ * raw_local_irq_*() macros from the lowlevel headers.
+ */
+#ifndef _ASM_IRQFLAGS_H
+#define _ASM_IRQFLAGS_H
+
+#define raw_local_save_flags(x) do { typecheck(unsigned long,x); __asm__ __volatile__("pushfl ; popl %0":"=g" (x): /* no input */); } while (0)
+#define raw_local_irq_restore(x) do { typecheck(unsigned long,x); __asm__ __volatile__("pushl %0 ; popfl": /* no output */ :"g" (x):"memory", "cc"); } while (0)
+#define raw_local_irq_disable() __asm__ __volatile__("cli": : :"memory")
+#define raw_local_irq_enable() __asm__ __volatile__("sti": : :"memory")
+/* used in the idle loop; sti takes one instruction cycle to complete */
+#define raw_safe_halt() __asm__ __volatile__("sti; hlt": : :"memory")
+/* used when interrupts are already enabled or to shutdown the processor */
+#define halt() __asm__ __volatile__("hlt": : :"memory")
+
+#define raw_irqs_disabled_flags(flags) (!((flags) & (1<<9)))
+
+/* For spinlocks etc */
+#define raw_local_irq_save(x) __asm__ __volatile__("pushfl ; popl %0 ; cli":"=g" (x): /* no input */ :"memory")
+
+/*
+ * Do the CPU's IRQ-state tracing from assembly code. We call a
+ * C function, so save all the C-clobbered registers:
+ */
+#ifdef CONFIG_TRACE_IRQFLAGS
+
+# define TRACE_IRQS_ON \
+ pushl %eax; \
+ pushl %ecx; \
+ pushl %edx; \
+ call trace_hardirqs_on; \
+ popl %edx; \
+ popl %ecx; \
+ popl %eax;
+
+# define TRACE_IRQS_OFF \
+ pushl %eax; \
+ pushl %ecx; \
+ pushl %edx; \
+ call trace_hardirqs_off; \
+ popl %edx; \
+ popl %ecx; \
+ popl %eax;
+
+#else
+# define TRACE_IRQS_ON
+# define TRACE_IRQS_OFF
+#endif
+
+#endif
Index: linux/include/asm-i386/spinlock.h
===================================================================
--- linux.orig/include/asm-i386/spinlock.h
+++ linux/include/asm-i386/spinlock.h
@@ -31,6 +31,11 @@
"jmp 1b\n" \
"3:\n\t"

+/*
+ * NOTE: there's an irqs-on section here, which normally would have to be
+ * irq-traced, but on CONFIG_TRACE_IRQFLAGS we never use
+ * __raw_spin_lock_string_flags().
+ */
#define __raw_spin_lock_string_flags \
"\n1:\t" \
"lock ; decb %0\n\t" \
Index: linux/include/asm-i386/system.h
===================================================================
--- linux.orig/include/asm-i386/system.h
+++ linux/include/asm-i386/system.h
@@ -456,25 +456,7 @@ static inline unsigned long long __cmpxc

#define set_wmb(var, value) do { var = value; wmb(); } while (0)

-/* interrupt control.. */
-#define local_save_flags(x) do { typecheck(unsigned long,x); __asm__ __volatile__("pushfl ; popl %0":"=g" (x): /* no input */); } while (0)
-#define local_irq_restore(x) do { typecheck(unsigned long,x); __asm__ __volatile__("pushl %0 ; popfl": /* no output */ :"g" (x):"memory", "cc"); } while (0)
-#define local_irq_disable() __asm__ __volatile__("cli": : :"memory")
-#define local_irq_enable() __asm__ __volatile__("sti": : :"memory")
-/* used in the idle loop; sti takes one instruction cycle to complete */
-#define safe_halt() __asm__ __volatile__("sti; hlt": : :"memory")
-/* used when interrupts are already enabled or to shutdown the processor */
-#define halt() __asm__ __volatile__("hlt": : :"memory")
-
-#define irqs_disabled() \
-({ \
- unsigned long flags; \
- local_save_flags(flags); \
- !(flags & (1<<9)); \
-})
-
-/* For spinlocks etc */
-#define local_irq_save(x) __asm__ __volatile__("pushfl ; popl %0 ; cli":"=g" (x): /* no input */ :"memory")
+#include <linux/trace_irqflags.h>

/*
* disable hlt during certain critical i/o operations
Index: linux/include/asm-powerpc/irqflags.h
===================================================================
--- /dev/null
+++ linux/include/asm-powerpc/irqflags.h
@@ -0,0 +1,31 @@
+/*
+ * include/asm-powerpc/irqflags.h
+ *
+ * IRQ flags handling
+ *
+ * This file gets included from lowlevel asm headers too, to provide
+ * wrapped versions of the local_irq_*() APIs, based on the
+ * raw_local_irq_*() macros from the lowlevel headers.
+ */
+#ifndef _ASM_IRQFLAGS_H
+#define _ASM_IRQFLAGS_H
+
+/*
+ * Get definitions for raw_local_save_flags(x), etc.
+ */
+#include <asm-powerpc/hw_irq.h>
+
+/*
+ * Do the CPU's IRQ-state tracing from assembly code. We call a
+ * C function, so save all the C-clobbered registers:
+ */
+#ifdef CONFIG_TRACE_IRQFLAGS
+
+#error No support on PowerPC yet for CONFIG_TRACE_IRQFLAGS
+
+#else
+# define TRACE_IRQS_ON
+# define TRACE_IRQS_OFF
+#endif
+
+#endif
Index: linux/include/asm-x86_64/irqflags.h
===================================================================
--- /dev/null
+++ linux/include/asm-x86_64/irqflags.h
@@ -0,0 +1,54 @@
+/*
+ * include/asm-x86_64/irqflags.h
+ *
+ * IRQ flags handling
+ *
+ * This file gets included from lowlevel asm headers too, to provide
+ * wrapped versions of the local_irq_*() APIs, based on the
+ * raw_local_irq_*() macros from the lowlevel headers.
+ */
+#ifndef _ASM_IRQFLAGS_H
+#define _ASM_IRQFLAGS_H
+
+/* interrupt control.. */
+#define raw_local_save_flags(x) do { warn_if_not_ulong(x); __asm__ __volatile__("# save_flags \n\t pushfq ; popq %q0":"=g" (x): /* no input */ :"memory"); } while (0)
+#define raw_local_irq_restore(x) __asm__ __volatile__("# restore_flags \n\t pushq %0 ; popfq": /* no output */ :"g" (x):"memory", "cc")
+
+#ifdef CONFIG_X86_VSMP
+/* Interrupt control for VSMP architecture */
+#define raw_local_irq_disable() do { unsigned long flags; raw_local_save_flags(flags); raw_local_irq_restore((flags & ~(1 << 9)) | (1 << 18)); } while (0)
+#define raw_local_irq_enable() do { unsigned long flags; raw_local_save_flags(flags); raw_local_irq_restore((flags | (1 << 9)) & ~(1 << 18)); } while (0)
+
+#define raw_irqs_disabled_flags(flags) \
+({ \
+ (flags & (1<<18)) || !(flags & (1<<9)); \
+})
+
+/* For spinlocks etc */
+#define raw_local_irq_save(x) do { raw_local_save_flags(x); raw_local_irq_restore((x & ~(1 << 9)) | (1 << 18)); } while (0)
+#else /* CONFIG_X86_VSMP */
+#define raw_local_irq_disable() __asm__ __volatile__("cli": : :"memory")
+#define raw_local_irq_enable() __asm__ __volatile__("sti": : :"memory")
+
+#define raw_irqs_disabled_flags(flags) \
+({ \
+ !(flags & (1<<9)); \
+})
+
+/* For spinlocks etc */
+#define raw_local_irq_save(x) do { warn_if_not_ulong(x); __asm__ __volatile__("# raw_local_irq_save \n\t pushfq ; popq %0 ; cli":"=g" (x): /* no input */ :"memory"); } while (0)
+#endif
+
+#define raw_irqs_disabled() \
+({ \
+ unsigned long flags; \
+ raw_local_save_flags(flags); \
+ raw_irqs_disabled_flags(flags); \
+})
+
+/* used in the idle loop; sti takes one instruction cycle to complete */
+#define raw_safe_halt() __asm__ __volatile__("sti; hlt": : :"memory")
+/* used when interrupts are already enabled or to shutdown the processor */
+#define halt() __asm__ __volatile__("hlt": : :"memory")
+
+#endif
Index: linux/include/asm-x86_64/system.h
===================================================================
--- linux.orig/include/asm-x86_64/system.h
+++ linux/include/asm-x86_64/system.h
@@ -244,43 +244,7 @@ static inline unsigned long __cmpxchg(vo

#define warn_if_not_ulong(x) do { unsigned long foo; (void) (&(x) == &foo); } while (0)

-/* interrupt control.. */
-#define local_save_flags(x) do { warn_if_not_ulong(x); __asm__ __volatile__("# save_flags \n\t pushfq ; popq %q0":"=g" (x): /* no input */ :"memory"); } while (0)
-#define local_irq_restore(x) __asm__ __volatile__("# restore_flags \n\t pushq %0 ; popfq": /* no output */ :"g" (x):"memory", "cc")
-
-#ifdef CONFIG_X86_VSMP
-/* Interrupt control for VSMP architecture */
-#define local_irq_disable() do { unsigned long flags; local_save_flags(flags); local_irq_restore((flags & ~(1 << 9)) | (1 << 18)); } while (0)
-#define local_irq_enable() do { unsigned long flags; local_save_flags(flags); local_irq_restore((flags | (1 << 9)) & ~(1 << 18)); } while (0)
-
-#define irqs_disabled() \
-({ \
- unsigned long flags; \
- local_save_flags(flags); \
- (flags & (1<<18)) || !(flags & (1<<9)); \
-})
-
-/* For spinlocks etc */
-#define local_irq_save(x) do { local_save_flags(x); local_irq_restore((x & ~(1 << 9)) | (1 << 18)); } while (0)
-#else /* CONFIG_X86_VSMP */
-#define local_irq_disable() __asm__ __volatile__("cli": : :"memory")
-#define local_irq_enable() __asm__ __volatile__("sti": : :"memory")
-
-#define irqs_disabled() \
-({ \
- unsigned long flags; \
- local_save_flags(flags); \
- !(flags & (1<<9)); \
-})
-
-/* For spinlocks etc */
-#define local_irq_save(x) do { warn_if_not_ulong(x); __asm__ __volatile__("# local_irq_save \n\t pushfq ; popq %0 ; cli":"=g" (x): /* no input */ :"memory"); } while (0)
-#endif
-
-/* used in the idle loop; sti takes one instruction cycle to complete */
-#define safe_halt() __asm__ __volatile__("sti; hlt": : :"memory")
-/* used when interrupts are already enabled or to shutdown the processor */
-#define halt() __asm__ __volatile__("hlt": : :"memory")
+#include <linux/trace_irqflags.h>

void cpu_idle_wait(void);

Index: linux/include/linux/hardirq.h
===================================================================
--- linux.orig/include/linux/hardirq.h
+++ linux/include/linux/hardirq.h
@@ -87,7 +87,11 @@ extern void synchronize_irq(unsigned int
#endif

#define nmi_enter() irq_enter()
-#define nmi_exit() sub_preempt_count(HARDIRQ_OFFSET)
+#define nmi_exit() \
+ do { \
+ sub_preempt_count(HARDIRQ_OFFSET); \
+ trace_hardirq_exit(); \
+ } while (0)

struct task_struct;

@@ -97,10 +101,17 @@ static inline void account_system_vtime(
}
#endif

+/*
+ * It is safe to do non-atomic ops on ->hardirq_context,
+ * because NMI handlers may not preempt and the ops are
+ * always balanced, so the interrupted value of ->hardirq_context
+ * will always be restored.
+ */
#define irq_enter() \
do { \
account_system_vtime(current); \
add_preempt_count(HARDIRQ_OFFSET); \
+ trace_hardirq_enter(); \
} while (0)

extern void irq_exit(void);
Index: linux/include/linux/init_task.h
===================================================================
--- linux.orig/include/linux/init_task.h
+++ linux/include/linux/init_task.h
@@ -133,6 +133,7 @@ extern struct group_info init_groups;
.journal_info = NULL, \
.cpu_timers = INIT_CPU_TIMERS(tsk.cpu_timers), \
.fs_excl = ATOMIC_INIT(0), \
+ INIT_TRACE_IRQFLAGS \
}


Index: linux/include/linux/interrupt.h
===================================================================
--- linux.orig/include/linux/interrupt.h
+++ linux/include/linux/interrupt.h
@@ -10,6 +10,7 @@
#include <linux/irqreturn.h>
#include <linux/hardirq.h>
#include <linux/sched.h>
+#include <linux/trace_irqflags.h>
#include <asm/atomic.h>
#include <asm/ptrace.h>
#include <asm/system.h>
@@ -72,13 +73,11 @@ static inline void __deprecated save_and
#define save_and_cli(x) save_and_cli(&x)
#endif /* CONFIG_SMP */

-/* SoftIRQ primitives. */
-#define local_bh_disable() \
- do { add_preempt_count(SOFTIRQ_OFFSET); barrier(); } while (0)
-#define __local_bh_enable() \
- do { barrier(); sub_preempt_count(SOFTIRQ_OFFSET); } while (0)
-
+extern void local_bh_disable(void);
+extern void __local_bh_enable(void);
+extern void _local_bh_enable(void);
extern void local_bh_enable(void);
+extern void local_bh_enable_ip(unsigned long ip);

/* PLEASE, avoid to allocate new softirqs, if you need not _really_ high
frequency threaded job scheduling. For almost all the purposes
Index: linux/include/linux/sched.h
===================================================================
--- linux.orig/include/linux/sched.h
+++ linux/include/linux/sched.h
@@ -916,6 +916,21 @@ struct task_struct {
/* mutex deadlock detection */
struct mutex_waiter *blocked_on;
#endif
+#ifdef CONFIG_TRACE_IRQFLAGS
+ unsigned int irq_events;
+ int hardirqs_enabled;
+ unsigned long hardirq_enable_ip;
+ unsigned int hardirq_enable_event;
+ unsigned long hardirq_disable_ip;
+ unsigned int hardirq_disable_event;
+ int softirqs_enabled;
+ unsigned long softirq_disable_ip;
+ unsigned int softirq_disable_event;
+ unsigned long softirq_enable_ip;
+ unsigned int softirq_enable_event;
+ int hardirq_context;
+ int softirq_context;
+#endif

/* journalling filesystem info */
void *journal_info;
Index: linux/include/linux/trace_irqflags.h
===================================================================
--- /dev/null
+++ linux/include/linux/trace_irqflags.h
@@ -0,0 +1,87 @@
+/*
+ * include/linux/trace_irqflags.h
+ *
+ * IRQ flags tracing: follow the state of the hardirq and softirq flags and
+ * provide callbacks for transitions between ON and OFF states.
+ *
+ * This file gets included from lowlevel asm headers too, to provide
+ * wrapped versions of the local_irq_*() APIs, based on the
+ * raw_local_irq_*() macros from the lowlevel headers.
+ */
+#ifndef _LINUX_TRACE_IRQFLAGS_H
+#define _LINUX_TRACE_IRQFLAGS_H
+
+#include <asm/irqflags.h>
+
+/*
+ * The local_irq_*() APIs are equal to the raw_local_irq*()
+ * if !TRACE_IRQFLAGS.
+ */
+#ifdef CONFIG_TRACE_IRQFLAGS
+ extern void trace_hardirqs_on(void);
+ extern void trace_hardirqs_off(void);
+ extern void trace_softirqs_on(unsigned long ip);
+ extern void trace_softirqs_off(unsigned long ip);
+# define trace_hardirq_context(p) ((p)->hardirq_context)
+# define trace_softirq_context(p) ((p)->softirq_context)
+# define trace_hardirqs_enabled(p) ((p)->hardirqs_enabled)
+# define trace_softirqs_enabled(p) ((p)->softirqs_enabled)
+# define trace_hardirq_enter() do { current->hardirq_context++; } while (0)
+# define trace_hardirq_exit() do { current->hardirq_context--; } while (0)
+# define trace_softirq_enter() do { current->softirq_context++; } while (0)
+# define trace_softirq_exit() do { current->softirq_context--; } while (0)
+# define INIT_TRACE_IRQFLAGS .softirqs_enabled = 1,
+
+#else
+# define trace_hardirqs_on() do { } while (0)
+# define trace_hardirqs_off() do { } while (0)
+# define trace_softirqs_on(ip) do { } while (0)
+# define trace_softirqs_off(ip) do { } while (0)
+# define trace_hardirq_context(p) 0
+# define trace_softirq_context(p) 0
+# define trace_hardirqs_enabled(p) 0
+# define trace_softirqs_enabled(p) 0
+# define trace_hardirq_enter() do { } while (0)
+# define trace_hardirq_exit() do { } while (0)
+# define trace_softirq_enter() do { } while (0)
+# define trace_softirq_exit() do { } while (0)
+# define INIT_TRACE_IRQFLAGS
+#endif
+
+#define local_irq_enable() \
+ do { trace_hardirqs_on(); raw_local_irq_enable(); } while (0)
+#define local_irq_disable() \
+ do { raw_local_irq_disable(); trace_hardirqs_off(); } while (0)
+#define local_irq_save(flags) \
+ do { raw_local_irq_save(flags); trace_hardirqs_off(); } while (0)
+
+#define local_irq_restore(flags) \
+ do { \
+ if (raw_irqs_disabled_flags(flags)) { \
+ raw_local_irq_restore(flags); \
+ trace_hardirqs_off(); \
+ } else { \
+ trace_hardirqs_on(); \
+ raw_local_irq_restore(flags); \
+ } \
+ } while (0)
+
+#define safe_halt() \
+ do { \
+ trace_hardirqs_on(); \
+ raw_safe_halt(); \
+ } while (0)
+
+#define local_save_flags(flags) raw_local_save_flags(flags)
+
+#define irqs_disabled() \
+({ \
+ unsigned long flags; \
+ \
+ raw_local_save_flags(flags); \
+ raw_irqs_disabled_flags(flags); \
+})
+
+#define irqs_disabled_flags(flags) raw_irqs_disabled_flags(flags)
+
+#endif
Index: linux/kernel/fork.c
===================================================================
--- linux.orig/kernel/fork.c
+++ linux/kernel/fork.c
@@ -970,6 +970,10 @@ static task_t *copy_process(unsigned lon
if (!p)
goto fork_out;

+#ifdef CONFIG_TRACE_IRQFLAGS
+ DEBUG_WARN_ON(!p->hardirqs_enabled);
+ DEBUG_WARN_ON(!p->softirqs_enabled);
+#endif
retval = -EAGAIN;
if (atomic_read(&p->user->processes) >=
p->signal->rlim[RLIMIT_NPROC].rlim_cur) {
@@ -1051,7 +1055,21 @@ static task_t *copy_process(unsigned lon
#ifdef CONFIG_DEBUG_MUTEXES
p->blocked_on = NULL; /* not blocked yet */
#endif
-
+#ifdef CONFIG_TRACE_IRQFLAGS
+ p->irq_events = 0;
+ p->hardirqs_enabled = 0;
+ p->hardirq_enable_ip = 0;
+ p->hardirq_enable_event = 0;
+ p->hardirq_disable_ip = _THIS_IP_;
+ p->hardirq_disable_event = 0;
+ p->softirqs_enabled = 1;
+ p->softirq_enable_ip = _THIS_IP_;
+ p->softirq_enable_event = 0;
+ p->softirq_disable_ip = 0;
+ p->softirq_disable_event = 0;
+ p->hardirq_context = 0;
+ p->softirq_context = 0;
+#endif
p->tgid = p->pid;
if (clone_flags & CLONE_THREAD)
p->tgid = current->tgid;
Index: linux/kernel/sched.c
===================================================================
--- linux.orig/kernel/sched.c
+++ linux/kernel/sched.c
@@ -4481,7 +4481,9 @@ int __sched cond_resched_softirq(void)
BUG_ON(!in_softirq());

if (need_resched()) {
- __local_bh_enable();
+ raw_local_irq_disable();
+ _local_bh_enable();
+ raw_local_irq_enable();
__cond_resched();
local_bh_disable();
return 1;
Index: linux/kernel/softirq.c
===================================================================
--- linux.orig/kernel/softirq.c
+++ linux/kernel/softirq.c
@@ -62,6 +62,119 @@ static inline void wakeup_softirqd(void)
}

/*
+ * This one is for softirq.c-internal use,
+ * where hardirqs are disabled legitimately:
+ */
+static void __local_bh_disable(unsigned long ip)
+{
+ unsigned long flags;
+
+ WARN_ON_ONCE(in_irq());
+
+ raw_local_irq_save(flags);
+ add_preempt_count(SOFTIRQ_OFFSET);
+ /*
+ * Were softirqs turned off above:
+ */
+ if (softirq_count() == SOFTIRQ_OFFSET)
+ trace_softirqs_off(ip);
+ raw_local_irq_restore(flags);
+}
+
+void local_bh_disable(void)
+{
+ WARN_ON_ONCE(irqs_disabled());
+ __local_bh_disable((unsigned long)__builtin_return_address(0));
+}
+
+EXPORT_SYMBOL(local_bh_disable);
+
+void __local_bh_enable(void)
+{
+ WARN_ON_ONCE(in_irq());
+
+ /*
+ * softirqs should never be enabled by __local_bh_enable(),
+ * it always nests inside local_bh_enable() sections:
+ */
+ WARN_ON_ONCE(softirq_count() == SOFTIRQ_OFFSET);
+
+ sub_preempt_count(SOFTIRQ_OFFSET);
+}
+
+EXPORT_SYMBOL(__local_bh_enable);
+
+/*
+ * Special-case - softirqs can safely be enabled in
+ * cond_resched_softirq(), or by __do_softirq(),
+ * without processing still-pending softirqs:
+ */
+void _local_bh_enable(void)
+{
+ WARN_ON_ONCE(in_irq());
+ WARN_ON_ONCE(!irqs_disabled());
+
+ if (softirq_count() == SOFTIRQ_OFFSET)
+ trace_softirqs_on((unsigned long)__builtin_return_address(0));
+ sub_preempt_count(SOFTIRQ_OFFSET);
+}
+
+void local_bh_enable(void)
+{
+ unsigned long flags;
+
+ WARN_ON_ONCE(in_irq());
+ WARN_ON_ONCE(irqs_disabled());
+
+ local_irq_save(flags);
+ /*
+ * Are softirqs going to be turned on now:
+ */
+ if (softirq_count() == SOFTIRQ_OFFSET)
+ trace_softirqs_on((unsigned long)__builtin_return_address(0));
+ /*
+ * Keep preemption disabled until we are done with
+ * softirq processing:
+ */
+ sub_preempt_count(SOFTIRQ_OFFSET - 1);
+
+ if (unlikely(!in_interrupt() && local_softirq_pending()))
+ do_softirq();
+
+ dec_preempt_count();
+ local_irq_restore(flags);
+ preempt_check_resched();
+}
+EXPORT_SYMBOL(local_bh_enable);
+
+void local_bh_enable_ip(unsigned long ip)
+{
+ unsigned long flags;
+
+ WARN_ON_ONCE(in_irq());
+
+ local_irq_save(flags);
+ /*
+ * Are softirqs going to be turned on now:
+ */
+ if (softirq_count() == SOFTIRQ_OFFSET)
+ trace_softirqs_on(ip);
+ /*
+ * Keep preemption disabled until we are done with
+ * softirq processing:
+ */
+ sub_preempt_count(SOFTIRQ_OFFSET - 1);
+
+ if (unlikely(!in_interrupt() && local_softirq_pending()))
+ do_softirq();
+
+ dec_preempt_count();
+ local_irq_restore(flags);
+ preempt_check_resched();
+}
+EXPORT_SYMBOL(local_bh_enable_ip);
+
+/*
* We restart softirq processing MAX_SOFTIRQ_RESTART times,
* and we fall back to softirqd after that.
*
@@ -80,8 +193,9 @@ asmlinkage void __do_softirq(void)
int cpu;

pending = local_softirq_pending();
+ __local_bh_disable((unsigned long)__builtin_return_address(0));
+ trace_softirq_enter();

- local_bh_disable();
cpu = smp_processor_id();
restart:
/* Reset the pending bitmask before enabling irqs */
@@ -109,7 +223,8 @@ restart:
if (pending)
wakeup_softirqd();

- __local_bh_enable();
+ trace_softirq_exit();
+ _local_bh_enable();
}

#ifndef __ARCH_HAS_DO_SOFTIRQ
@@ -136,23 +251,6 @@ EXPORT_SYMBOL(do_softirq);

#endif

-void local_bh_enable(void)
-{
- WARN_ON(irqs_disabled());
- /*
- * Keep preemption disabled until we are done with
- * softirq processing:
- */
- sub_preempt_count(SOFTIRQ_OFFSET - 1);
-
- if (unlikely(!in_interrupt() && local_softirq_pending()))
- do_softirq();
-
- dec_preempt_count();
- preempt_check_resched();
-}
-EXPORT_SYMBOL(local_bh_enable);
-
#ifdef __ARCH_IRQ_EXIT_IRQS_DISABLED
# define invoke_softirq() __do_softirq()
#else
@@ -165,6 +263,7 @@ EXPORT_SYMBOL(local_bh_enable);
void irq_exit(void)
{
account_system_vtime(current);
+ trace_hardirq_exit();
sub_preempt_count(IRQ_EXIT_OFFSET);
if (!in_interrupt() && local_softirq_pending())
invoke_softirq();
Index: linux/lib/locking-selftest.c
===================================================================
--- linux.orig/lib/locking-selftest.c
+++ linux/lib/locking-selftest.c
@@ -19,6 +19,7 @@
#include <linux/kallsyms.h>
#include <linux/interrupt.h>
#include <linux/debug_locks.h>
+#include <linux/trace_irqflags.h>

/*
* Change this to 1 if you want to see the failure printouts:
@@ -157,9 +158,11 @@ static void init_shared_types(void)
#define SOFTIRQ_ENTER() \
local_bh_disable(); \
local_irq_disable(); \
+ trace_softirq_enter(); \
WARN_ON(!in_softirq());

#define SOFTIRQ_EXIT() \
+ trace_softirq_exit(); \
local_irq_enable(); \
local_bh_enable();

2006-05-29 21:44:00

by Ingo Molnar

[permalink] [raw]
Subject: [patch 15/61] lock validator: x86_64: use stacktrace to generate backtraces

From: Ingo Molnar <[email protected]>

this switches x86_64 to use the stacktrace infrastructure when generating
backtrace printouts, if CONFIG_FRAME_POINTER=y. (This patch will go away
once the dwarf2 stackframe parser in -mm goes upstream.)

Signed-off-by: Ingo Molnar <[email protected]>
Signed-off-by: Arjan van de Ven <[email protected]>
---
arch/x86_64/kernel/traps.c | 35 +++++++++++++++++++++++++++++++++--
1 file changed, 33 insertions(+), 2 deletions(-)

Index: linux/arch/x86_64/kernel/traps.c
===================================================================
--- linux.orig/arch/x86_64/kernel/traps.c
+++ linux/arch/x86_64/kernel/traps.c
@@ -235,7 +235,31 @@ in_exception_stack(unsigned cpu, unsigne
* severe exception (double fault, nmi, stack fault, debug, mce) hardware stack
*/

-void show_trace(unsigned long *stack)
+#ifdef CONFIG_FRAME_POINTER
+
+#include <linux/stacktrace.h>
+
+#define MAX_TRACE_ENTRIES 64
+
+static void __show_trace(struct task_struct *task, unsigned long *stack)
+{
+ unsigned long entries[MAX_TRACE_ENTRIES];
+ struct stack_trace trace;
+
+ trace.nr_entries = 0;
+ trace.max_entries = MAX_TRACE_ENTRIES;
+ trace.entries = entries;
+
+ save_stack_trace(&trace, task, 1, 0);
+
+ pr_debug("got %d/%d entries.\n", trace.nr_entries, trace.max_entries);
+
+ print_stack_trace(&trace, 4);
+}
+
+#else
+
+void __show_trace(struct task_struct *task, unsigned long *stack)
{
const unsigned cpu = safe_smp_processor_id();
unsigned long *irqstack_end = (unsigned long *)cpu_pda(cpu)->irqstackptr;
@@ -319,6 +343,13 @@ void show_trace(unsigned long *stack)
printk("\n");
}

+#endif
+
+void show_trace(unsigned long *stack)
+{
+ __show_trace(current, stack);
+}
+
void show_stack(struct task_struct *tsk, unsigned long * rsp)
{
unsigned long *stack;
@@ -353,7 +384,7 @@ void show_stack(struct task_struct *tsk,
printk("%016lx ", *stack++);
touch_nmi_watchdog();
}
- show_trace((unsigned long *)rsp);
+ __show_trace(tsk, (unsigned long *)rsp);
}

/*

2006-05-29 21:24:10

by Ingo Molnar

[permalink] [raw]
Subject: [patch 10/61] lock validator: locking init debugging improvement

From: Ingo Molnar <[email protected]>

locking init improvement:

- introduce and use __SPIN_LOCK_UNLOCKED for array initializations,
to pass in the name string of locks, used by debugging

Signed-off-by: Ingo Molnar <[email protected]>
Signed-off-by: Arjan van de Ven <[email protected]>
---
arch/x86_64/kernel/smpboot.c | 3 +++
arch/x86_64/kernel/vsyscall.c | 2 +-
block/ll_rw_blk.c | 1 +
drivers/char/random.c | 6 +++---
drivers/ide/ide-io.c | 2 ++
drivers/scsi/libata-core.c | 2 ++
drivers/spi/spi.c | 1 +
fs/dcache.c | 2 +-
include/linux/idr.h | 2 +-
include/linux/init_task.h | 10 +++++-----
include/linux/notifier.h | 2 +-
include/linux/seqlock.h | 12 ++++++++++--
include/linux/spinlock_types.h | 15 +++++++++------
include/linux/wait.h | 2 +-
kernel/kmod.c | 2 ++
kernel/rcupdate.c | 4 ++--
kernel/timer.c | 2 +-
mm/swap_state.c | 2 +-
net/ipv4/tcp_ipv4.c | 2 +-
net/ipv4/tcp_minisocks.c | 2 +-
net/ipv4/xfrm4_policy.c | 4 ++--
21 files changed, 51 insertions(+), 29 deletions(-)

Index: linux/arch/x86_64/kernel/smpboot.c
===================================================================
--- linux.orig/arch/x86_64/kernel/smpboot.c
+++ linux/arch/x86_64/kernel/smpboot.c
@@ -771,8 +771,11 @@ static int __cpuinit do_boot_cpu(int cpu
.cpu = cpu,
.done = COMPLETION_INITIALIZER(c_idle.done),
};
+
DECLARE_WORK(work, do_fork_idle, &c_idle);

+ init_completion(&c_idle.done);
+
/* allocate memory for gdts of secondary cpus. Hotplug is considered */
if (!cpu_gdt_descr[cpu].address &&
!(cpu_gdt_descr[cpu].address = get_zeroed_page(GFP_KERNEL))) {
Index: linux/arch/x86_64/kernel/vsyscall.c
===================================================================
--- linux.orig/arch/x86_64/kernel/vsyscall.c
+++ linux/arch/x86_64/kernel/vsyscall.c
@@ -37,7 +37,7 @@
#define __vsyscall(nr) __attribute__ ((unused,__section__(".vsyscall_" #nr)))

int __sysctl_vsyscall __section_sysctl_vsyscall = 1;
-seqlock_t __xtime_lock __section_xtime_lock = SEQLOCK_UNLOCKED;
+__section_xtime_lock DEFINE_SEQLOCK(__xtime_lock);

#include <asm/unistd.h>

Index: linux/block/ll_rw_blk.c
===================================================================
--- linux.orig/block/ll_rw_blk.c
+++ linux/block/ll_rw_blk.c
@@ -2529,6 +2529,7 @@ int blk_execute_rq(request_queue_t *q, s
char sense[SCSI_SENSE_BUFFERSIZE];
int err = 0;

+ init_completion(&wait);
/*
* we need an extra reference to the request, so we can look at
* it after io completion
Index: linux/drivers/char/random.c
===================================================================
--- linux.orig/drivers/char/random.c
+++ linux/drivers/char/random.c
@@ -417,7 +417,7 @@ static struct entropy_store input_pool =
.poolinfo = &poolinfo_table[0],
.name = "input",
.limit = 1,
- .lock = SPIN_LOCK_UNLOCKED,
+ .lock = __SPIN_LOCK_UNLOCKED(&input_pool.lock),
.pool = input_pool_data
};

@@ -426,7 +426,7 @@ static struct entropy_store blocking_poo
.name = "blocking",
.limit = 1,
.pull = &input_pool,
- .lock = SPIN_LOCK_UNLOCKED,
+ .lock = __SPIN_LOCK_UNLOCKED(&blocking_pool.lock),
.pool = blocking_pool_data
};

@@ -434,7 +434,7 @@ static struct entropy_store nonblocking_
.poolinfo = &poolinfo_table[1],
.name = "nonblocking",
.pull = &input_pool,
- .lock = SPIN_LOCK_UNLOCKED,
+ .lock = __SPIN_LOCK_UNLOCKED(&nonblocking_pool.lock),
.pool = nonblocking_pool_data
};

Index: linux/drivers/ide/ide-io.c
===================================================================
--- linux.orig/drivers/ide/ide-io.c
+++ linux/drivers/ide/ide-io.c
@@ -1700,6 +1700,8 @@ int ide_do_drive_cmd (ide_drive_t *drive
int where = ELEVATOR_INSERT_BACK, err;
int must_wait = (action == ide_wait || action == ide_head_wait);

+ init_completion(&wait);
+
rq->errors = 0;
rq->rq_status = RQ_ACTIVE;

Index: linux/drivers/scsi/libata-core.c
===================================================================
--- linux.orig/drivers/scsi/libata-core.c
+++ linux/drivers/scsi/libata-core.c
@@ -994,6 +994,8 @@ unsigned ata_exec_internal(struct ata_de
unsigned int err_mask;
int rc;

+ init_completion(&wait);
+
spin_lock_irqsave(&ap->host_set->lock, flags);

/* no internal command while frozen */
Index: linux/drivers/spi/spi.c
===================================================================
--- linux.orig/drivers/spi/spi.c
+++ linux/drivers/spi/spi.c
@@ -512,6 +512,7 @@ int spi_sync(struct spi_device *spi, str
DECLARE_COMPLETION(done);
int status;

+ init_completion(&done);
message->complete = spi_complete;
message->context = &done;
status = spi_async(spi, message);
Index: linux/fs/dcache.c
===================================================================
--- linux.orig/fs/dcache.c
+++ linux/fs/dcache.c
@@ -39,7 +39,7 @@ int sysctl_vfs_cache_pressure __read_mos
EXPORT_SYMBOL_GPL(sysctl_vfs_cache_pressure);

__cacheline_aligned_in_smp DEFINE_SPINLOCK(dcache_lock);
-static seqlock_t rename_lock __cacheline_aligned_in_smp = SEQLOCK_UNLOCKED;
+static __cacheline_aligned_in_smp DEFINE_SEQLOCK(rename_lock);

EXPORT_SYMBOL(dcache_lock);

Index: linux/include/linux/idr.h
===================================================================
--- linux.orig/include/linux/idr.h
+++ linux/include/linux/idr.h
@@ -66,7 +66,7 @@ struct idr {
.id_free = NULL, \
.layers = 0, \
.id_free_cnt = 0, \
- .lock = SPIN_LOCK_UNLOCKED, \
+ .lock = __SPIN_LOCK_UNLOCKED(name.lock), \
}
#define DEFINE_IDR(name) struct idr name = IDR_INIT(name)

Index: linux/include/linux/init_task.h
===================================================================
--- linux.orig/include/linux/init_task.h
+++ linux/include/linux/init_task.h
@@ -22,7 +22,7 @@
.count = ATOMIC_INIT(1), \
.fdt = &init_files.fdtab, \
.fdtab = INIT_FDTABLE, \
- .file_lock = SPIN_LOCK_UNLOCKED, \
+ .file_lock = __SPIN_LOCK_UNLOCKED(init_task.file_lock), \
.next_fd = 0, \
.close_on_exec_init = { { 0, } }, \
.open_fds_init = { { 0, } }, \
@@ -37,7 +37,7 @@
.user_id = 0, \
.next = NULL, \
.wait = __WAIT_QUEUE_HEAD_INITIALIZER(name.wait), \
- .ctx_lock = SPIN_LOCK_UNLOCKED, \
+ .ctx_lock = __SPIN_LOCK_UNLOCKED(name.ctx_lock), \
.reqs_active = 0U, \
.max_reqs = ~0U, \
}
@@ -49,7 +49,7 @@
.mm_users = ATOMIC_INIT(2), \
.mm_count = ATOMIC_INIT(1), \
.mmap_sem = __RWSEM_INITIALIZER(name.mmap_sem), \
- .page_table_lock = SPIN_LOCK_UNLOCKED, \
+ .page_table_lock = __SPIN_LOCK_UNLOCKED(name.page_table_lock), \
.mmlist = LIST_HEAD_INIT(name.mmlist), \
.cpu_vm_mask = CPU_MASK_ALL, \
}
@@ -78,7 +78,7 @@ extern struct nsproxy init_nsproxy;
#define INIT_SIGHAND(sighand) { \
.count = ATOMIC_INIT(1), \
.action = { { { .sa_handler = NULL, } }, }, \
- .siglock = SPIN_LOCK_UNLOCKED, \
+ .siglock = __SPIN_LOCK_UNLOCKED(sighand.siglock), \
}

extern struct group_info init_groups;
@@ -129,7 +129,7 @@ extern struct group_info init_groups;
.list = LIST_HEAD_INIT(tsk.pending.list), \
.signal = {{0}}}, \
.blocked = {{0}}, \
- .alloc_lock = SPIN_LOCK_UNLOCKED, \
+ .alloc_lock = __SPIN_LOCK_UNLOCKED(tsk.alloc_lock), \
.journal_info = NULL, \
.cpu_timers = INIT_CPU_TIMERS(tsk.cpu_timers), \
.fs_excl = ATOMIC_INIT(0), \
Index: linux/include/linux/notifier.h
===================================================================
--- linux.orig/include/linux/notifier.h
+++ linux/include/linux/notifier.h
@@ -65,7 +65,7 @@ struct raw_notifier_head {
} while (0)

#define ATOMIC_NOTIFIER_INIT(name) { \
- .lock = SPIN_LOCK_UNLOCKED, \
+ .lock = __SPIN_LOCK_UNLOCKED(name.lock), \
.head = NULL }
#define BLOCKING_NOTIFIER_INIT(name) { \
.rwsem = __RWSEM_INITIALIZER((name).rwsem), \
Index: linux/include/linux/seqlock.h
===================================================================
--- linux.orig/include/linux/seqlock.h
+++ linux/include/linux/seqlock.h
@@ -38,9 +38,17 @@ typedef struct {
* These macros triggered gcc-3.x compile-time problems. We think these are
* OK now. Be cautious.
*/
-#define SEQLOCK_UNLOCKED { 0, SPIN_LOCK_UNLOCKED }
-#define seqlock_init(x) do { *(x) = (seqlock_t) SEQLOCK_UNLOCKED; } while (0)
+#define __SEQLOCK_UNLOCKED(lockname) \
+ { 0, __SPIN_LOCK_UNLOCKED(lockname) }

+#define SEQLOCK_UNLOCKED \
+ __SEQLOCK_UNLOCKED(old_style_seqlock_init)
+
+#define seqlock_init(x) \
+ do { *(x) = (seqlock_t) __SEQLOCK_UNLOCKED(x); } while (0)
+
+#define DEFINE_SEQLOCK(x) \
+ seqlock_t x = __SEQLOCK_UNLOCKED(x)

/* Lock out other writers and update the count.
* Acts like a normal spin_lock/unlock.
Index: linux/include/linux/spinlock_types.h
===================================================================
--- linux.orig/include/linux/spinlock_types.h
+++ linux/include/linux/spinlock_types.h
@@ -44,24 +44,27 @@ typedef struct {
#define SPINLOCK_OWNER_INIT ((void *)-1L)

#ifdef CONFIG_DEBUG_SPINLOCK
-# define SPIN_LOCK_UNLOCKED \
+# define __SPIN_LOCK_UNLOCKED(lockname) \
(spinlock_t) { .raw_lock = __RAW_SPIN_LOCK_UNLOCKED, \
.magic = SPINLOCK_MAGIC, \
.owner = SPINLOCK_OWNER_INIT, \
.owner_cpu = -1 }
-#define RW_LOCK_UNLOCKED \
+#define __RW_LOCK_UNLOCKED(lockname) \
(rwlock_t) { .raw_lock = __RAW_RW_LOCK_UNLOCKED, \
.magic = RWLOCK_MAGIC, \
.owner = SPINLOCK_OWNER_INIT, \
.owner_cpu = -1 }
#else
-# define SPIN_LOCK_UNLOCKED \
+# define __SPIN_LOCK_UNLOCKED(lockname) \
(spinlock_t) { .raw_lock = __RAW_SPIN_LOCK_UNLOCKED }
-#define RW_LOCK_UNLOCKED \
+#define __RW_LOCK_UNLOCKED(lockname) \
(rwlock_t) { .raw_lock = __RAW_RW_LOCK_UNLOCKED }
#endif

-#define DEFINE_SPINLOCK(x) spinlock_t x = SPIN_LOCK_UNLOCKED
-#define DEFINE_RWLOCK(x) rwlock_t x = RW_LOCK_UNLOCKED
+#define SPIN_LOCK_UNLOCKED __SPIN_LOCK_UNLOCKED(old_style_spin_init)
+#define RW_LOCK_UNLOCKED __RW_LOCK_UNLOCKED(old_style_rw_init)
+
+#define DEFINE_SPINLOCK(x) spinlock_t x = __SPIN_LOCK_UNLOCKED(x)
+#define DEFINE_RWLOCK(x) rwlock_t x = __RW_LOCK_UNLOCKED(x)

#endif /* __LINUX_SPINLOCK_TYPES_H */
Index: linux/include/linux/wait.h
===================================================================
--- linux.orig/include/linux/wait.h
+++ linux/include/linux/wait.h
@@ -68,7 +68,7 @@ struct task_struct;
wait_queue_t name = __WAITQUEUE_INITIALIZER(name, tsk)

#define __WAIT_QUEUE_HEAD_INITIALIZER(name) { \
- .lock = SPIN_LOCK_UNLOCKED, \
+ .lock = __SPIN_LOCK_UNLOCKED(name.lock), \
.task_list = { &(name).task_list, &(name).task_list } }

#define DECLARE_WAIT_QUEUE_HEAD(name) \
Index: linux/kernel/kmod.c
===================================================================
--- linux.orig/kernel/kmod.c
+++ linux/kernel/kmod.c
@@ -246,6 +246,8 @@ int call_usermodehelper_keys(char *path,
};
DECLARE_WORK(work, __call_usermodehelper, &sub_info);

+ init_completion(&done);
+
if (!khelper_wq)
return -EBUSY;

Index: linux/kernel/rcupdate.c
===================================================================
--- linux.orig/kernel/rcupdate.c
+++ linux/kernel/rcupdate.c
@@ -53,13 +53,13 @@
static struct rcu_ctrlblk rcu_ctrlblk = {
.cur = -300,
.completed = -300,
- .lock = SPIN_LOCK_UNLOCKED,
+ .lock = __SPIN_LOCK_UNLOCKED(&rcu_ctrlblk.lock),
.cpumask = CPU_MASK_NONE,
};
static struct rcu_ctrlblk rcu_bh_ctrlblk = {
.cur = -300,
.completed = -300,
- .lock = SPIN_LOCK_UNLOCKED,
+ .lock = __SPIN_LOCK_UNLOCKED(&rcu_bh_ctrlblk.lock),
.cpumask = CPU_MASK_NONE,
};

Index: linux/kernel/timer.c
===================================================================
--- linux.orig/kernel/timer.c
+++ linux/kernel/timer.c
@@ -1142,7 +1142,7 @@ unsigned long wall_jiffies = INITIAL_JIF
* playing with xtime and avenrun.
*/
#ifndef ARCH_HAVE_XTIME_LOCK
-seqlock_t xtime_lock __cacheline_aligned_in_smp = SEQLOCK_UNLOCKED;
+__cacheline_aligned_in_smp DEFINE_SEQLOCK(xtime_lock);

EXPORT_SYMBOL(xtime_lock);
#endif
Index: linux/mm/swap_state.c
===================================================================
--- linux.orig/mm/swap_state.c
+++ linux/mm/swap_state.c
@@ -39,7 +39,7 @@ static struct backing_dev_info swap_back

struct address_space swapper_space = {
.page_tree = RADIX_TREE_INIT(GFP_ATOMIC|__GFP_NOWARN),
- .tree_lock = RW_LOCK_UNLOCKED,
+ .tree_lock = __RW_LOCK_UNLOCKED(swapper_space.tree_lock),
.a_ops = &swap_aops,
.i_mmap_nonlinear = LIST_HEAD_INIT(swapper_space.i_mmap_nonlinear),
.backing_dev_info = &swap_backing_dev_info,
Index: linux/net/ipv4/tcp_ipv4.c
===================================================================
--- linux.orig/net/ipv4/tcp_ipv4.c
+++ linux/net/ipv4/tcp_ipv4.c
@@ -90,7 +90,7 @@ static struct socket *tcp_socket;
void tcp_v4_send_check(struct sock *sk, int len, struct sk_buff *skb);

struct inet_hashinfo __cacheline_aligned tcp_hashinfo = {
- .lhash_lock = RW_LOCK_UNLOCKED,
+ .lhash_lock = __RW_LOCK_UNLOCKED(tcp_hashinfo.lhash_lock),
.lhash_users = ATOMIC_INIT(0),
.lhash_wait = __WAIT_QUEUE_HEAD_INITIALIZER(tcp_hashinfo.lhash_wait),
};
Index: linux/net/ipv4/tcp_minisocks.c
===================================================================
--- linux.orig/net/ipv4/tcp_minisocks.c
+++ linux/net/ipv4/tcp_minisocks.c
@@ -41,7 +41,7 @@ int sysctl_tcp_abort_on_overflow;
struct inet_timewait_death_row tcp_death_row = {
.sysctl_max_tw_buckets = NR_FILE * 2,
.period = TCP_TIMEWAIT_LEN / INET_TWDR_TWKILL_SLOTS,
- .death_lock = SPIN_LOCK_UNLOCKED,
+ .death_lock = __SPIN_LOCK_UNLOCKED(tcp_death_row.death_lock),
.hashinfo = &tcp_hashinfo,
.tw_timer = TIMER_INITIALIZER(inet_twdr_hangman, 0,
(unsigned long)&tcp_death_row),
Index: linux/net/ipv4/xfrm4_policy.c
===================================================================
--- linux.orig/net/ipv4/xfrm4_policy.c
+++ linux/net/ipv4/xfrm4_policy.c
@@ -17,7 +17,7 @@
static struct dst_ops xfrm4_dst_ops;
static struct xfrm_policy_afinfo xfrm4_policy_afinfo;

-static struct xfrm_type_map xfrm4_type_map = { .lock = RW_LOCK_UNLOCKED };
+static struct xfrm_type_map xfrm4_type_map = { .lock = __RW_LOCK_UNLOCKED(xfrm4_type_map.lock) };

static int xfrm4_dst_lookup(struct xfrm_dst **dst, struct flowi *fl)
{
@@ -299,7 +299,7 @@ static struct dst_ops xfrm4_dst_ops = {

static struct xfrm_policy_afinfo xfrm4_policy_afinfo = {
.family = AF_INET,
- .lock = RW_LOCK_UNLOCKED,
+ .lock = __RW_LOCK_UNLOCKED(xfrm4_policy_afinfo.lock),
.type_map = &xfrm4_type_map,
.dst_ops = &xfrm4_dst_ops,
.dst_lookup = xfrm4_dst_lookup,

2006-05-29 21:44:15

by Ingo Molnar

[permalink] [raw]
Subject: [patch 17/61] lock validator: sk_callback_lock workaround

From: Ingo Molnar <[email protected]>

temporary workaround for the lock validator: make all uses of
sk_callback_lock softirq-safe. (The real solution will be to
express to the lock validator that sk_callback_lock rules are
to be generated per-address-family.)

Signed-off-by: Ingo Molnar <[email protected]>
Signed-off-by: Arjan van de Ven <[email protected]>
---
net/core/sock.c | 24 ++++++++++++------------
1 file changed, 12 insertions(+), 12 deletions(-)

Index: linux/net/core/sock.c
===================================================================
--- linux.orig/net/core/sock.c
+++ linux/net/core/sock.c
@@ -934,9 +934,9 @@ int sock_i_uid(struct sock *sk)
{
int uid;

- read_lock(&sk->sk_callback_lock);
+ read_lock_bh(&sk->sk_callback_lock);
uid = sk->sk_socket ? SOCK_INODE(sk->sk_socket)->i_uid : 0;
- read_unlock(&sk->sk_callback_lock);
+ read_unlock_bh(&sk->sk_callback_lock);
return uid;
}

@@ -944,9 +944,9 @@ unsigned long sock_i_ino(struct sock *sk
{
unsigned long ino;

- read_lock(&sk->sk_callback_lock);
+ read_lock_bh(&sk->sk_callback_lock);
ino = sk->sk_socket ? SOCK_INODE(sk->sk_socket)->i_ino : 0;
- read_unlock(&sk->sk_callback_lock);
+ read_unlock_bh(&sk->sk_callback_lock);
return ino;
}

@@ -1306,33 +1306,33 @@ ssize_t sock_no_sendpage(struct socket *

static void sock_def_wakeup(struct sock *sk)
{
- read_lock(&sk->sk_callback_lock);
+ read_lock_bh(&sk->sk_callback_lock);
if (sk->sk_sleep && waitqueue_active(sk->sk_sleep))
wake_up_interruptible_all(sk->sk_sleep);
- read_unlock(&sk->sk_callback_lock);
+ read_unlock_bh(&sk->sk_callback_lock);
}

static void sock_def_error_report(struct sock *sk)
{
- read_lock(&sk->sk_callback_lock);
+ read_lock_bh(&sk->sk_callback_lock);
if (sk->sk_sleep && waitqueue_active(sk->sk_sleep))
wake_up_interruptible(sk->sk_sleep);
sk_wake_async(sk,0,POLL_ERR);
- read_unlock(&sk->sk_callback_lock);
+ read_unlock_bh(&sk->sk_callback_lock);
}

static void sock_def_readable(struct sock *sk, int len)
{
- read_lock(&sk->sk_callback_lock);
+ read_lock_bh(&sk->sk_callback_lock);
if (sk->sk_sleep && waitqueue_active(sk->sk_sleep))
wake_up_interruptible(sk->sk_sleep);
sk_wake_async(sk,1,POLL_IN);
- read_unlock(&sk->sk_callback_lock);
+ read_unlock_bh(&sk->sk_callback_lock);
}

static void sock_def_write_space(struct sock *sk)
{
- read_lock(&sk->sk_callback_lock);
+ read_lock_bh(&sk->sk_callback_lock);

/* Do not wake up a writer until he can make "significant"
* progress. --DaveM
@@ -1346,7 +1346,7 @@ static void sock_def_write_space(struct
sk_wake_async(sk, 2, POLL_OUT);
}

- read_unlock(&sk->sk_callback_lock);
+ read_unlock_bh(&sk->sk_callback_lock);
}

static void sock_def_destruct(struct sock *sk)

2006-05-29 21:44:53

by Ingo Molnar

[permalink] [raw]
Subject: [patch 11/61] lock validator: lockdep: small xfs init_rwsem() cleanup

From: Ingo Molnar <[email protected]>

init_rwsem() has no return value. This is not a problem if init_rwsem()
is a function, but it's a problem if it's a do { ... } while (0) macro.
(which lockdep introduces)

Signed-off-by: Ingo Molnar <[email protected]>
Signed-off-by: Arjan van de Ven <[email protected]>
---
fs/xfs/linux-2.6/mrlock.h | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)

Index: linux/fs/xfs/linux-2.6/mrlock.h
===================================================================
--- linux.orig/fs/xfs/linux-2.6/mrlock.h
+++ linux/fs/xfs/linux-2.6/mrlock.h
@@ -28,7 +28,7 @@ typedef struct {
} mrlock_t;

#define mrinit(mrp, name) \
- ( (mrp)->mr_writer = 0, init_rwsem(&(mrp)->mr_lock) )
+ do { (mrp)->mr_writer = 0; init_rwsem(&(mrp)->mr_lock); } while (0)
#define mrlock_init(mrp, t,n,s) mrinit(mrp, n)
#define mrfree(mrp) do { } while (0)
#define mraccess(mrp) mraccessf(mrp, 0)

2006-05-29 21:45:33

by Ingo Molnar

[permalink] [raw]
Subject: [patch 12/61] lock validator: beautify x86_64 stacktraces

From: Ingo Molnar <[email protected]>

beautify x86_64 stacktraces to be more readable.

Signed-off-by: Ingo Molnar <[email protected]>
Signed-off-by: Arjan van de Ven <[email protected]>
---
arch/x86_64/kernel/traps.c | 55 ++++++++++++++++++++------------------------
include/asm-x86_64/kdebug.h | 2 -
2 files changed, 27 insertions(+), 30 deletions(-)

Index: linux/arch/x86_64/kernel/traps.c
===================================================================
--- linux.orig/arch/x86_64/kernel/traps.c
+++ linux/arch/x86_64/kernel/traps.c
@@ -108,28 +108,30 @@ static inline void preempt_conditional_c
static int kstack_depth_to_print = 10;

#ifdef CONFIG_KALLSYMS
-#include <linux/kallsyms.h>
-int printk_address(unsigned long address)
-{
+# include <linux/kallsyms.h>
+void printk_address(unsigned long address)
+{
unsigned long offset = 0, symsize;
const char *symname;
char *modname;
- char *delim = ":";
+ char *delim = ":";
char namebuf[128];

- symname = kallsyms_lookup(address, &symsize, &offset, &modname, namebuf);
- if (!symname)
- return printk("[<%016lx>]", address);
- if (!modname)
+ symname = kallsyms_lookup(address, &symsize, &offset, &modname, namebuf);
+ if (!symname) {
+ printk(" [<%016lx>]", address);
+ return;
+ }
+ if (!modname)
modname = delim = "";
- return printk("<%016lx>{%s%s%s%s%+ld}",
- address, delim, modname, delim, symname, offset);
-}
+ printk(" [<%016lx>] %s%s%s%s+0x%lx/0x%lx",
+ address, delim, modname, delim, symname, offset, symsize);
+}
#else
-int printk_address(unsigned long address)
-{
- return printk("[<%016lx>]", address);
-}
+void printk_address(unsigned long address)
+{
+ printk(" [<%016lx>]", address);
+}
#endif

static unsigned long *in_exception_stack(unsigned cpu, unsigned long stack,
@@ -200,21 +202,14 @@ void show_trace(unsigned long *stack)
{
const unsigned cpu = safe_smp_processor_id();
unsigned long *irqstack_end = (unsigned long *)cpu_pda(cpu)->irqstackptr;
- int i;
unsigned used = 0;

- printk("\nCall Trace:");
+ printk("\nCall Trace:\n");

#define HANDLE_STACK(cond) \
do while (cond) { \
unsigned long addr = *stack++; \
if (kernel_text_address(addr)) { \
- if (i > 50) { \
- printk("\n "); \
- i = 0; \
- } \
- else \
- i += printk(" "); \
/* \
* If the address is either in the text segment of the \
* kernel, or in the region which contains vmalloc'ed \
@@ -223,20 +218,21 @@ void show_trace(unsigned long *stack)
* down the cause of the crash will be able to figure \
* out the call path that was taken. \
*/ \
- i += printk_address(addr); \
+ printk_address(addr); \
+ printk("\n"); \
} \
} while (0)

- for(i = 11; ; ) {
+ for ( ; ; ) {
const char *id;
unsigned long *estack_end;
estack_end = in_exception_stack(cpu, (unsigned long)stack,
&used, &id);

if (estack_end) {
- i += printk(" <%s>", id);
+ printk(" <%s>", id);
HANDLE_STACK (stack < estack_end);
- i += printk(" <EOE>");
+ printk(" <EOE>");
stack = (unsigned long *) estack_end[-2];
continue;
}
@@ -246,11 +242,11 @@ void show_trace(unsigned long *stack)
(IRQSTACKSIZE - 64) / sizeof(*irqstack);

if (stack >= irqstack && stack < irqstack_end) {
- i += printk(" <IRQ>");
+ printk(" <IRQ>");
HANDLE_STACK (stack < irqstack_end);
stack = (unsigned long *) (irqstack_end[-1]);
irqstack_end = NULL;
- i += printk(" <EOI>");
+ printk(" <EOI>");
continue;
}
}
@@ -259,6 +255,7 @@ void show_trace(unsigned long *stack)

HANDLE_STACK (((long) stack & (THREAD_SIZE-1)) != 0);
#undef HANDLE_STACK
+
printk("\n");
}

Index: linux/include/asm-x86_64/kdebug.h
===================================================================
--- linux.orig/include/asm-x86_64/kdebug.h
+++ linux/include/asm-x86_64/kdebug.h
@@ -49,7 +49,7 @@ static inline int notify_die(enum die_va
return atomic_notifier_call_chain(&die_chain, val, &args);
}

-extern int printk_address(unsigned long address);
+extern void printk_address(unsigned long address);
extern void die(const char *,struct pt_regs *,long);
extern void __die(const char *,struct pt_regs *,long);
extern void show_registers(struct pt_regs *regs);

2006-05-29 21:44:53

by Ingo Molnar

[permalink] [raw]
Subject: [patch 08/61] lock validator: locking API self-tests

From: Ingo Molnar <[email protected]>

introduce DEBUG_LOCKING_API_SELFTESTS, which uses the generic lock
debugging code's silent-failure feature to run a matrix of testcases.
There are 210 testcases currently:

------------------------
| Locking API testsuite:
----------------------------------------------------------------------------
| spin |wlock |rlock |mutex | wsem | rsem |
--------------------------------------------------------------------------
A-A deadlock: ok | ok | ok | ok | ok | ok |
A-B-B-A deadlock: ok | ok | ok | ok | ok | ok |
A-B-B-C-C-A deadlock: ok | ok | ok | ok | ok | ok |
A-B-C-A-B-C deadlock: ok | ok | ok | ok | ok | ok |
A-B-B-C-C-D-D-A deadlock: ok | ok | ok | ok | ok | ok |
A-B-C-D-B-D-D-A deadlock: ok | ok | ok | ok | ok | ok |
A-B-C-D-B-C-D-A deadlock: ok | ok | ok | ok | ok | ok |
double unlock: ok | ok | ok | ok | ok | ok |
bad unlock order: ok | ok | ok | ok | ok | ok |
--------------------------------------------------------------------------
recursive read-lock: | ok | | ok |
--------------------------------------------------------------------------
non-nested unlock: ok | ok | ok | ok |
------------------------------------------------------------
hard-irqs-on + irq-safe-A/12: ok | ok | ok |
soft-irqs-on + irq-safe-A/12: ok | ok | ok |
hard-irqs-on + irq-safe-A/21: ok | ok | ok |
soft-irqs-on + irq-safe-A/21: ok | ok | ok |
sirq-safe-A => hirqs-on/12: ok | ok | ok |
sirq-safe-A => hirqs-on/21: ok | ok | ok |
hard-safe-A + irqs-on/12: ok | ok | ok |
soft-safe-A + irqs-on/12: ok | ok | ok |
hard-safe-A + irqs-on/21: ok | ok | ok |
soft-safe-A + irqs-on/21: ok | ok | ok |
hard-safe-A + unsafe-B #1/123: ok | ok | ok |
soft-safe-A + unsafe-B #1/123: ok | ok | ok |
hard-safe-A + unsafe-B #1/132: ok | ok | ok |
soft-safe-A + unsafe-B #1/132: ok | ok | ok |
hard-safe-A + unsafe-B #1/213: ok | ok | ok |
soft-safe-A + unsafe-B #1/213: ok | ok | ok |
hard-safe-A + unsafe-B #1/231: ok | ok | ok |
soft-safe-A + unsafe-B #1/231: ok | ok | ok |
hard-safe-A + unsafe-B #1/312: ok | ok | ok |
soft-safe-A + unsafe-B #1/312: ok | ok | ok |
hard-safe-A + unsafe-B #1/321: ok | ok | ok |
soft-safe-A + unsafe-B #1/321: ok | ok | ok |
hard-safe-A + unsafe-B #2/123: ok | ok | ok |
soft-safe-A + unsafe-B #2/123: ok | ok | ok |
hard-safe-A + unsafe-B #2/132: ok | ok | ok |
soft-safe-A + unsafe-B #2/132: ok | ok | ok |
hard-safe-A + unsafe-B #2/213: ok | ok | ok |
soft-safe-A + unsafe-B #2/213: ok | ok | ok |
hard-safe-A + unsafe-B #2/231: ok | ok | ok |
soft-safe-A + unsafe-B #2/231: ok | ok | ok |
hard-safe-A + unsafe-B #2/312: ok | ok | ok |
soft-safe-A + unsafe-B #2/312: ok | ok | ok |
hard-safe-A + unsafe-B #2/321: ok | ok | ok |
soft-safe-A + unsafe-B #2/321: ok | ok | ok |
hard-irq lock-inversion/123: ok | ok | ok |
soft-irq lock-inversion/123: ok | ok | ok |
hard-irq lock-inversion/132: ok | ok | ok |
soft-irq lock-inversion/132: ok | ok | ok |
hard-irq lock-inversion/213: ok | ok | ok |
soft-irq lock-inversion/213: ok | ok | ok |
hard-irq lock-inversion/231: ok | ok | ok |
soft-irq lock-inversion/231: ok | ok | ok |
hard-irq lock-inversion/312: ok | ok | ok |
soft-irq lock-inversion/312: ok | ok | ok |
hard-irq lock-inversion/321: ok | ok | ok |
soft-irq lock-inversion/321: ok | ok | ok |
hard-irq read-recursion/123: ok |
soft-irq read-recursion/123: ok |
hard-irq read-recursion/132: ok |
soft-irq read-recursion/132: ok |
hard-irq read-recursion/213: ok |
soft-irq read-recursion/213: ok |
hard-irq read-recursion/231: ok |
soft-irq read-recursion/231: ok |
hard-irq read-recursion/312: ok |
soft-irq read-recursion/312: ok |
hard-irq read-recursion/321: ok |
soft-irq read-recursion/321: ok |
-------------------------------------------------------
Good, all 210 testcases passed! |
---------------------------------

Signed-off-by: Ingo Molnar <[email protected]>
Signed-off-by: Arjan van de Ven <[email protected]>
---
Documentation/kernel-parameters.txt | 9
lib/Kconfig.debug | 12
lib/Makefile | 1
lib/locking-selftest-hardirq.h | 9
lib/locking-selftest-mutex.h | 5
lib/locking-selftest-rlock-hardirq.h | 2
lib/locking-selftest-rlock-softirq.h | 2
lib/locking-selftest-rlock.h | 5
lib/locking-selftest-rsem.h | 5
lib/locking-selftest-softirq.h | 9
lib/locking-selftest-spin-hardirq.h | 2
lib/locking-selftest-spin-softirq.h | 2
lib/locking-selftest-spin.h | 5
lib/locking-selftest-wlock-hardirq.h | 2
lib/locking-selftest-wlock-softirq.h | 2
lib/locking-selftest-wlock.h | 5
lib/locking-selftest-wsem.h | 5
lib/locking-selftest.c | 1168 +++++++++++++++++++++++++++++++++++
18 files changed, 1250 insertions(+)

Index: linux/Documentation/kernel-parameters.txt
===================================================================
--- linux.orig/Documentation/kernel-parameters.txt
+++ linux/Documentation/kernel-parameters.txt
@@ -436,6 +436,15 @@ running once the system is up.

debug [KNL] Enable kernel debugging (events log level).

+ debug_locks_verbose=
+ [KNL] verbose self-tests
+ Format=<0|1>
+ Print debugging info while doing the locking API
+ self-tests.
+ We default to 0 (no extra messages), setting it to
+ 1 will print _a lot_ more information - normally
+ only useful to kernel developers.
+
decnet= [HW,NET]
Format: <area>[,<node>]
See also Documentation/networking/decnet.txt.
Index: linux/lib/Kconfig.debug
===================================================================
--- linux.orig/lib/Kconfig.debug
+++ linux/lib/Kconfig.debug
@@ -191,6 +191,18 @@ config DEBUG_SPINLOCK_SLEEP
If you say Y here, various routines which may sleep will become very
noisy if they are called with a spinlock held.

+config DEBUG_LOCKING_API_SELFTESTS
+ bool "Locking API boot-time self-tests"
+ depends on DEBUG_KERNEL
+ default y
+ help
+ Say Y here if you want the kernel to run a short self-test during
+ bootup. The self-test checks whether common types of locking bugs
+ are detected by debugging mechanisms or not. (if you disable
+ lock debugging then those bugs wont be detected of course.)
+ The following locking APIs are covered: spinlocks, rwlocks,
+ mutexes and rwsems.
+
config DEBUG_KOBJECT
bool "kobject debugging"
depends on DEBUG_KERNEL
Index: linux/lib/Makefile
===================================================================
--- linux.orig/lib/Makefile
+++ linux/lib/Makefile
@@ -18,6 +18,7 @@ CFLAGS_kobject.o += -DDEBUG
CFLAGS_kobject_uevent.o += -DDEBUG
endif

+obj-$(CONFIG_DEBUG_LOCKING_API_SELFTESTS) += locking-selftest.o
obj-$(CONFIG_DEBUG_SPINLOCK) += spinlock_debug.o
lib-$(CONFIG_RWSEM_GENERIC_SPINLOCK) += rwsem-spinlock.o
lib-$(CONFIG_RWSEM_XCHGADD_ALGORITHM) += rwsem.o
Index: linux/lib/locking-selftest-hardirq.h
===================================================================
--- /dev/null
+++ linux/lib/locking-selftest-hardirq.h
@@ -0,0 +1,9 @@
+#undef IRQ_DISABLE
+#undef IRQ_ENABLE
+#undef IRQ_ENTER
+#undef IRQ_EXIT
+
+#define IRQ_ENABLE HARDIRQ_ENABLE
+#define IRQ_DISABLE HARDIRQ_DISABLE
+#define IRQ_ENTER HARDIRQ_ENTER
+#define IRQ_EXIT HARDIRQ_EXIT
Index: linux/lib/locking-selftest-mutex.h
===================================================================
--- /dev/null
+++ linux/lib/locking-selftest-mutex.h
@@ -0,0 +1,5 @@
+#undef LOCK
+#define LOCK ML
+
+#undef UNLOCK
+#define UNLOCK MU
Index: linux/lib/locking-selftest-rlock-hardirq.h
===================================================================
--- /dev/null
+++ linux/lib/locking-selftest-rlock-hardirq.h
@@ -0,0 +1,2 @@
+#include "locking-selftest-rlock.h"
+#include "locking-selftest-hardirq.h"
Index: linux/lib/locking-selftest-rlock-softirq.h
===================================================================
--- /dev/null
+++ linux/lib/locking-selftest-rlock-softirq.h
@@ -0,0 +1,2 @@
+#include "locking-selftest-rlock.h"
+#include "locking-selftest-softirq.h"
Index: linux/lib/locking-selftest-rlock.h
===================================================================
--- /dev/null
+++ linux/lib/locking-selftest-rlock.h
@@ -0,0 +1,5 @@
+#undef LOCK
+#define LOCK RL
+
+#undef UNLOCK
+#define UNLOCK RU
Index: linux/lib/locking-selftest-rsem.h
===================================================================
--- /dev/null
+++ linux/lib/locking-selftest-rsem.h
@@ -0,0 +1,5 @@
+#undef LOCK
+#define LOCK RSL
+
+#undef UNLOCK
+#define UNLOCK RSU
Index: linux/lib/locking-selftest-softirq.h
===================================================================
--- /dev/null
+++ linux/lib/locking-selftest-softirq.h
@@ -0,0 +1,9 @@
+#undef IRQ_DISABLE
+#undef IRQ_ENABLE
+#undef IRQ_ENTER
+#undef IRQ_EXIT
+
+#define IRQ_DISABLE SOFTIRQ_DISABLE
+#define IRQ_ENABLE SOFTIRQ_ENABLE
+#define IRQ_ENTER SOFTIRQ_ENTER
+#define IRQ_EXIT SOFTIRQ_EXIT
Index: linux/lib/locking-selftest-spin-hardirq.h
===================================================================
--- /dev/null
+++ linux/lib/locking-selftest-spin-hardirq.h
@@ -0,0 +1,2 @@
+#include "locking-selftest-spin.h"
+#include "locking-selftest-hardirq.h"
Index: linux/lib/locking-selftest-spin-softirq.h
===================================================================
--- /dev/null
+++ linux/lib/locking-selftest-spin-softirq.h
@@ -0,0 +1,2 @@
+#include "locking-selftest-spin.h"
+#include "locking-selftest-softirq.h"
Index: linux/lib/locking-selftest-spin.h
===================================================================
--- /dev/null
+++ linux/lib/locking-selftest-spin.h
@@ -0,0 +1,5 @@
+#undef LOCK
+#define LOCK L
+
+#undef UNLOCK
+#define UNLOCK U
Index: linux/lib/locking-selftest-wlock-hardirq.h
===================================================================
--- /dev/null
+++ linux/lib/locking-selftest-wlock-hardirq.h
@@ -0,0 +1,2 @@
+#include "locking-selftest-wlock.h"
+#include "locking-selftest-hardirq.h"
Index: linux/lib/locking-selftest-wlock-softirq.h
===================================================================
--- /dev/null
+++ linux/lib/locking-selftest-wlock-softirq.h
@@ -0,0 +1,2 @@
+#include "locking-selftest-wlock.h"
+#include "locking-selftest-softirq.h"
Index: linux/lib/locking-selftest-wlock.h
===================================================================
--- /dev/null
+++ linux/lib/locking-selftest-wlock.h
@@ -0,0 +1,5 @@
+#undef LOCK
+#define LOCK WL
+
+#undef UNLOCK
+#define UNLOCK WU
Index: linux/lib/locking-selftest-wsem.h
===================================================================
--- /dev/null
+++ linux/lib/locking-selftest-wsem.h
@@ -0,0 +1,5 @@
+#undef LOCK
+#define LOCK WSL
+
+#undef UNLOCK
+#define UNLOCK WSU
Index: linux/lib/locking-selftest.c
===================================================================
--- /dev/null
+++ linux/lib/locking-selftest.c
@@ -0,0 +1,1168 @@
+/*
+ * lib/locking-selftest.c
+ *
+ * Testsuite for various locking APIs: spinlocks, rwlocks,
+ * mutexes and rw-semaphores.
+ *
+ * It is checking both false positives and false negatives.
+ *
+ * Started by Ingo Molnar:
+ *
+ * Copyright (C) 2006 Red Hat, Inc., Ingo Molnar <[email protected]>
+ */
+#include <linux/rwsem.h>
+#include <linux/mutex.h>
+#include <linux/sched.h>
+#include <linux/delay.h>
+#include <linux/module.h>
+#include <linux/spinlock.h>
+#include <linux/kallsyms.h>
+#include <linux/interrupt.h>
+#include <linux/debug_locks.h>
+
+/*
+ * Change this to 1 if you want to see the failure printouts:
+ */
+static unsigned int debug_locks_verbose;
+
+static int __init setup_debug_locks_verbose(char *str)
+{
+ get_option(&str, &debug_locks_verbose);
+
+ return 1;
+}
+
+__setup("debug_locks_verbose=", setup_debug_locks_verbose);
+
+#define FAILURE 0
+#define SUCCESS 1
+
+enum {
+ LOCKTYPE_SPIN,
+ LOCKTYPE_RWLOCK,
+ LOCKTYPE_MUTEX,
+ LOCKTYPE_RWSEM,
+};
+
+/*
+ * Normal standalone locks, for the circular and irq-context
+ * dependency tests:
+ */
+static DEFINE_SPINLOCK(lock_A);
+static DEFINE_SPINLOCK(lock_B);
+static DEFINE_SPINLOCK(lock_C);
+static DEFINE_SPINLOCK(lock_D);
+
+static DEFINE_RWLOCK(rwlock_A);
+static DEFINE_RWLOCK(rwlock_B);
+static DEFINE_RWLOCK(rwlock_C);
+static DEFINE_RWLOCK(rwlock_D);
+
+static DEFINE_MUTEX(mutex_A);
+static DEFINE_MUTEX(mutex_B);
+static DEFINE_MUTEX(mutex_C);
+static DEFINE_MUTEX(mutex_D);
+
+static DECLARE_RWSEM(rwsem_A);
+static DECLARE_RWSEM(rwsem_B);
+static DECLARE_RWSEM(rwsem_C);
+static DECLARE_RWSEM(rwsem_D);
+
+/*
+ * Locks that we initialize dynamically as well so that
+ * e.g. X1 and X2 becomes two instances of the same type,
+ * but X* and Y* are different types. We do this so that
+ * we do not trigger a real lockup:
+ */
+static DEFINE_SPINLOCK(lock_X1);
+static DEFINE_SPINLOCK(lock_X2);
+static DEFINE_SPINLOCK(lock_Y1);
+static DEFINE_SPINLOCK(lock_Y2);
+static DEFINE_SPINLOCK(lock_Z1);
+static DEFINE_SPINLOCK(lock_Z2);
+
+static DEFINE_RWLOCK(rwlock_X1);
+static DEFINE_RWLOCK(rwlock_X2);
+static DEFINE_RWLOCK(rwlock_Y1);
+static DEFINE_RWLOCK(rwlock_Y2);
+static DEFINE_RWLOCK(rwlock_Z1);
+static DEFINE_RWLOCK(rwlock_Z2);
+
+static DEFINE_MUTEX(mutex_X1);
+static DEFINE_MUTEX(mutex_X2);
+static DEFINE_MUTEX(mutex_Y1);
+static DEFINE_MUTEX(mutex_Y2);
+static DEFINE_MUTEX(mutex_Z1);
+static DEFINE_MUTEX(mutex_Z2);
+
+static DECLARE_RWSEM(rwsem_X1);
+static DECLARE_RWSEM(rwsem_X2);
+static DECLARE_RWSEM(rwsem_Y1);
+static DECLARE_RWSEM(rwsem_Y2);
+static DECLARE_RWSEM(rwsem_Z1);
+static DECLARE_RWSEM(rwsem_Z2);
+
+/*
+ * non-inlined runtime initializers, to let separate locks share
+ * the same lock-type:
+ */
+#define INIT_TYPE_FUNC(type) \
+static noinline void \
+init_type_##type(spinlock_t *lock, rwlock_t *rwlock, struct mutex *mutex, \
+ struct rw_semaphore *rwsem) \
+{ \
+ spin_lock_init(lock); \
+ rwlock_init(rwlock); \
+ mutex_init(mutex); \
+ init_rwsem(rwsem); \
+}
+
+INIT_TYPE_FUNC(X)
+INIT_TYPE_FUNC(Y)
+INIT_TYPE_FUNC(Z)
+
+static void init_shared_types(void)
+{
+ init_type_X(&lock_X1, &rwlock_X1, &mutex_X1, &rwsem_X1);
+ init_type_X(&lock_X2, &rwlock_X2, &mutex_X2, &rwsem_X2);
+
+ init_type_Y(&lock_Y1, &rwlock_Y1, &mutex_Y1, &rwsem_Y1);
+ init_type_Y(&lock_Y2, &rwlock_Y2, &mutex_Y2, &rwsem_Y2);
+
+ init_type_Z(&lock_Z1, &rwlock_Z1, &mutex_Z1, &rwsem_Z1);
+ init_type_Z(&lock_Z2, &rwlock_Z2, &mutex_Z2, &rwsem_Z2);
+}
+
+/*
+ * For spinlocks and rwlocks we also do hardirq-safe / softirq-safe tests.
+ * The following functions use a lock from a simulated hardirq/softirq
+ * context, causing the locks to be marked as hardirq-safe/softirq-safe:
+ */
+
+#define HARDIRQ_DISABLE local_irq_disable
+#define HARDIRQ_ENABLE local_irq_enable
+
+#define HARDIRQ_ENTER() \
+ local_irq_disable(); \
+ nmi_enter(); \
+ WARN_ON(!in_irq());
+
+#define HARDIRQ_EXIT() \
+ nmi_exit(); \
+ local_irq_enable();
+
+#define SOFTIRQ_DISABLE local_bh_disable
+#define SOFTIRQ_ENABLE local_bh_enable
+
+#define SOFTIRQ_ENTER() \
+ local_bh_disable(); \
+ local_irq_disable(); \
+ WARN_ON(!in_softirq());
+
+#define SOFTIRQ_EXIT() \
+ local_irq_enable(); \
+ local_bh_enable();
+
+/*
+ * Shortcuts for lock/unlock API variants, to keep
+ * the testcases compact:
+ */
+#define L(x) spin_lock(&lock_##x)
+#define U(x) spin_unlock(&lock_##x)
+#define UNN(x) spin_unlock_non_nested(&lock_##x)
+#define LU(x) L(x); U(x)
+
+#define WL(x) write_lock(&rwlock_##x)
+#define WU(x) write_unlock(&rwlock_##x)
+#define WLU(x) WL(x); WU(x)
+
+#define RL(x) read_lock(&rwlock_##x)
+#define RU(x) read_unlock(&rwlock_##x)
+#define RUNN(x) read_unlock_non_nested(&rwlock_##x)
+#define RLU(x) RL(x); RU(x)
+
+#define ML(x) mutex_lock(&mutex_##x)
+#define MU(x) mutex_unlock(&mutex_##x)
+#define MUNN(x) mutex_unlock_non_nested(&mutex_##x)
+
+#define WSL(x) down_write(&rwsem_##x)
+#define WSU(x) up_write(&rwsem_##x)
+
+#define RSL(x) down_read(&rwsem_##x)
+#define RSU(x) up_read(&rwsem_##x)
+#define RSUNN(x) up_read_non_nested(&rwsem_##x)
+
+#define LOCK_UNLOCK_2(x,y) LOCK(x); LOCK(y); UNLOCK(y); UNLOCK(x)
+
+/*
+ * Generate different permutations of the same testcase, using
+ * the same basic lock-dependency/state events:
+ */
+
+#define GENERATE_TESTCASE(name) \
+ \
+static void name(void) { E(); }
+
+#define GENERATE_PERMUTATIONS_2_EVENTS(name) \
+ \
+static void name##_12(void) { E1(); E2(); } \
+static void name##_21(void) { E2(); E1(); }
+
+#define GENERATE_PERMUTATIONS_3_EVENTS(name) \
+ \
+static void name##_123(void) { E1(); E2(); E3(); } \
+static void name##_132(void) { E1(); E3(); E2(); } \
+static void name##_213(void) { E2(); E1(); E3(); } \
+static void name##_231(void) { E2(); E3(); E1(); } \
+static void name##_312(void) { E3(); E1(); E2(); } \
+static void name##_321(void) { E3(); E2(); E1(); }
+
+/*
+ * AA deadlock:
+ */
+
+#define E() \
+ \
+ LOCK(X1); \
+ LOCK(X2); /* this one should fail */ \
+ UNLOCK(X2); \
+ UNLOCK(X1);
+
+/*
+ * 6 testcases:
+ */
+#include "locking-selftest-spin.h"
+GENERATE_TESTCASE(AA_spin)
+#include "locking-selftest-wlock.h"
+GENERATE_TESTCASE(AA_wlock)
+#include "locking-selftest-rlock.h"
+GENERATE_TESTCASE(AA_rlock)
+#include "locking-selftest-mutex.h"
+GENERATE_TESTCASE(AA_mutex)
+#include "locking-selftest-wsem.h"
+GENERATE_TESTCASE(AA_wsem)
+#include "locking-selftest-rsem.h"
+GENERATE_TESTCASE(AA_rsem)
+
+#undef E
+
+/*
+ * Special-case for read-locking, they are
+ * allowed to recurse on the same lock instance:
+ */
+static void rlock_AA1(void)
+{
+ RL(X1);
+ RL(X1); // this one should NOT fail
+ RU(X1);
+ RU(X1);
+}
+
+static void rsem_AA1(void)
+{
+ RSL(X1);
+ RSL(X1); // this one should fail
+ RSU(X1);
+ RSU(X1);
+}
+
+/*
+ * ABBA deadlock:
+ */
+
+#define E() \
+ \
+ LOCK_UNLOCK_2(A, B); \
+ LOCK_UNLOCK_2(B, A); /* fail */
+
+/*
+ * 6 testcases:
+ */
+#include "locking-selftest-spin.h"
+GENERATE_TESTCASE(ABBA_spin)
+#include "locking-selftest-wlock.h"
+GENERATE_TESTCASE(ABBA_wlock)
+#include "locking-selftest-rlock.h"
+GENERATE_TESTCASE(ABBA_rlock)
+#include "locking-selftest-mutex.h"
+GENERATE_TESTCASE(ABBA_mutex)
+#include "locking-selftest-wsem.h"
+GENERATE_TESTCASE(ABBA_wsem)
+#include "locking-selftest-rsem.h"
+GENERATE_TESTCASE(ABBA_rsem)
+
+#undef E
+
+/*
+ * AB BC CA deadlock:
+ */
+
+#define E() \
+ \
+ LOCK_UNLOCK_2(A, B); \
+ LOCK_UNLOCK_2(B, C); \
+ LOCK_UNLOCK_2(C, A); /* fail */
+
+/*
+ * 6 testcases:
+ */
+#include "locking-selftest-spin.h"
+GENERATE_TESTCASE(ABBCCA_spin)
+#include "locking-selftest-wlock.h"
+GENERATE_TESTCASE(ABBCCA_wlock)
+#include "locking-selftest-rlock.h"
+GENERATE_TESTCASE(ABBCCA_rlock)
+#include "locking-selftest-mutex.h"
+GENERATE_TESTCASE(ABBCCA_mutex)
+#include "locking-selftest-wsem.h"
+GENERATE_TESTCASE(ABBCCA_wsem)
+#include "locking-selftest-rsem.h"
+GENERATE_TESTCASE(ABBCCA_rsem)
+
+#undef E
+
+/*
+ * AB CA BC deadlock:
+ */
+
+#define E() \
+ \
+ LOCK_UNLOCK_2(A, B); \
+ LOCK_UNLOCK_2(C, A); \
+ LOCK_UNLOCK_2(B, C); /* fail */
+
+/*
+ * 6 testcases:
+ */
+#include "locking-selftest-spin.h"
+GENERATE_TESTCASE(ABCABC_spin)
+#include "locking-selftest-wlock.h"
+GENERATE_TESTCASE(ABCABC_wlock)
+#include "locking-selftest-rlock.h"
+GENERATE_TESTCASE(ABCABC_rlock)
+#include "locking-selftest-mutex.h"
+GENERATE_TESTCASE(ABCABC_mutex)
+#include "locking-selftest-wsem.h"
+GENERATE_TESTCASE(ABCABC_wsem)
+#include "locking-selftest-rsem.h"
+GENERATE_TESTCASE(ABCABC_rsem)
+
+#undef E
+
+/*
+ * AB BC CD DA deadlock:
+ */
+
+#define E() \
+ \
+ LOCK_UNLOCK_2(A, B); \
+ LOCK_UNLOCK_2(B, C); \
+ LOCK_UNLOCK_2(C, D); \
+ LOCK_UNLOCK_2(D, A); /* fail */
+
+/*
+ * 6 testcases:
+ */
+#include "locking-selftest-spin.h"
+GENERATE_TESTCASE(ABBCCDDA_spin)
+#include "locking-selftest-wlock.h"
+GENERATE_TESTCASE(ABBCCDDA_wlock)
+#include "locking-selftest-rlock.h"
+GENERATE_TESTCASE(ABBCCDDA_rlock)
+#include "locking-selftest-mutex.h"
+GENERATE_TESTCASE(ABBCCDDA_mutex)
+#include "locking-selftest-wsem.h"
+GENERATE_TESTCASE(ABBCCDDA_wsem)
+#include "locking-selftest-rsem.h"
+GENERATE_TESTCASE(ABBCCDDA_rsem)
+
+#undef E
+
+/*
+ * AB CD BD DA deadlock:
+ */
+#define E() \
+ \
+ LOCK_UNLOCK_2(A, B); \
+ LOCK_UNLOCK_2(C, D); \
+ LOCK_UNLOCK_2(B, D); \
+ LOCK_UNLOCK_2(D, A); /* fail */
+
+/*
+ * 6 testcases:
+ */
+#include "locking-selftest-spin.h"
+GENERATE_TESTCASE(ABCDBDDA_spin)
+#include "locking-selftest-wlock.h"
+GENERATE_TESTCASE(ABCDBDDA_wlock)
+#include "locking-selftest-rlock.h"
+GENERATE_TESTCASE(ABCDBDDA_rlock)
+#include "locking-selftest-mutex.h"
+GENERATE_TESTCASE(ABCDBDDA_mutex)
+#include "locking-selftest-wsem.h"
+GENERATE_TESTCASE(ABCDBDDA_wsem)
+#include "locking-selftest-rsem.h"
+GENERATE_TESTCASE(ABCDBDDA_rsem)
+
+#undef E
+
+/*
+ * AB CD BC DA deadlock:
+ */
+#define E() \
+ \
+ LOCK_UNLOCK_2(A, B); \
+ LOCK_UNLOCK_2(C, D); \
+ LOCK_UNLOCK_2(B, C); \
+ LOCK_UNLOCK_2(D, A); /* fail */
+
+/*
+ * 6 testcases:
+ */
+#include "locking-selftest-spin.h"
+GENERATE_TESTCASE(ABCDBCDA_spin)
+#include "locking-selftest-wlock.h"
+GENERATE_TESTCASE(ABCDBCDA_wlock)
+#include "locking-selftest-rlock.h"
+GENERATE_TESTCASE(ABCDBCDA_rlock)
+#include "locking-selftest-mutex.h"
+GENERATE_TESTCASE(ABCDBCDA_mutex)
+#include "locking-selftest-wsem.h"
+GENERATE_TESTCASE(ABCDBCDA_wsem)
+#include "locking-selftest-rsem.h"
+GENERATE_TESTCASE(ABCDBCDA_rsem)
+
+#undef E
+
+/*
+ * Double unlock:
+ */
+#define E() \
+ \
+ LOCK(A); \
+ UNLOCK(A); \
+ UNLOCK(A); /* fail */
+
+/*
+ * 6 testcases:
+ */
+#include "locking-selftest-spin.h"
+GENERATE_TESTCASE(double_unlock_spin)
+#include "locking-selftest-wlock.h"
+GENERATE_TESTCASE(double_unlock_wlock)
+#include "locking-selftest-rlock.h"
+GENERATE_TESTCASE(double_unlock_rlock)
+#include "locking-selftest-mutex.h"
+GENERATE_TESTCASE(double_unlock_mutex)
+#include "locking-selftest-wsem.h"
+GENERATE_TESTCASE(double_unlock_wsem)
+#include "locking-selftest-rsem.h"
+GENERATE_TESTCASE(double_unlock_rsem)
+
+#undef E
+
+/*
+ * Bad unlock ordering:
+ */
+#define E() \
+ \
+ LOCK(A); \
+ LOCK(B); \
+ UNLOCK(A); /* fail */ \
+ UNLOCK(B);
+
+/*
+ * 6 testcases:
+ */
+#include "locking-selftest-spin.h"
+GENERATE_TESTCASE(bad_unlock_order_spin)
+#include "locking-selftest-wlock.h"
+GENERATE_TESTCASE(bad_unlock_order_wlock)
+#include "locking-selftest-rlock.h"
+GENERATE_TESTCASE(bad_unlock_order_rlock)
+#include "locking-selftest-mutex.h"
+GENERATE_TESTCASE(bad_unlock_order_mutex)
+#include "locking-selftest-wsem.h"
+GENERATE_TESTCASE(bad_unlock_order_wsem)
+#include "locking-selftest-rsem.h"
+GENERATE_TESTCASE(bad_unlock_order_rsem)
+
+#undef E
+
+#ifdef CONFIG_LOCKDEP
+/*
+ * bad unlock ordering - but using the _non_nested API,
+ * which must supress the warning:
+ */
+static void spin_order_nn(void)
+{
+ L(A);
+ L(B);
+ UNN(A); // this one should succeed
+ UNN(B);
+}
+
+static void rlock_order_nn(void)
+{
+ RL(A);
+ RL(B);
+ RUNN(A); // this one should succeed
+ RUNN(B);
+}
+
+static void mutex_order_nn(void)
+{
+ ML(A);
+ ML(B);
+ MUNN(A); // this one should succeed
+ MUNN(B);
+}
+
+static void rsem_order_nn(void)
+{
+ RSL(A);
+ RSL(B);
+ RSUNN(A); // this one should succeed
+ RSUNN(B);
+}
+
+#endif
+
+/*
+ * locking an irq-safe lock with irqs enabled:
+ */
+#define E1() \
+ \
+ IRQ_ENTER(); \
+ LOCK(A); \
+ UNLOCK(A); \
+ IRQ_EXIT();
+
+#define E2() \
+ \
+ LOCK(A); \
+ UNLOCK(A);
+
+/*
+ * Generate 24 testcases:
+ */
+#include "locking-selftest-spin-hardirq.h"
+GENERATE_PERMUTATIONS_2_EVENTS(irqsafe1_hard_spin)
+
+#include "locking-selftest-rlock-hardirq.h"
+GENERATE_PERMUTATIONS_2_EVENTS(irqsafe1_hard_rlock)
+
+#include "locking-selftest-wlock-hardirq.h"
+GENERATE_PERMUTATIONS_2_EVENTS(irqsafe1_hard_wlock)
+
+#include "locking-selftest-spin-softirq.h"
+GENERATE_PERMUTATIONS_2_EVENTS(irqsafe1_soft_spin)
+
+#include "locking-selftest-rlock-softirq.h"
+GENERATE_PERMUTATIONS_2_EVENTS(irqsafe1_soft_rlock)
+
+#include "locking-selftest-wlock-softirq.h"
+GENERATE_PERMUTATIONS_2_EVENTS(irqsafe1_soft_wlock)
+
+#undef E1
+#undef E2
+
+/*
+ * Enabling hardirqs with a softirq-safe lock held:
+ */
+#define E1() \
+ \
+ SOFTIRQ_ENTER(); \
+ LOCK(A); \
+ UNLOCK(A); \
+ SOFTIRQ_EXIT();
+
+#define E2() \
+ \
+ HARDIRQ_DISABLE(); \
+ LOCK(A); \
+ HARDIRQ_ENABLE(); \
+ UNLOCK(A);
+
+/*
+ * Generate 12 testcases:
+ */
+#include "locking-selftest-spin.h"
+GENERATE_PERMUTATIONS_2_EVENTS(irqsafe2A_spin)
+
+#include "locking-selftest-wlock.h"
+GENERATE_PERMUTATIONS_2_EVENTS(irqsafe2A_wlock)
+
+#include "locking-selftest-rlock.h"
+GENERATE_PERMUTATIONS_2_EVENTS(irqsafe2A_rlock)
+
+#undef E1
+#undef E2
+
+/*
+ * Enabling irqs with an irq-safe lock held:
+ */
+#define E1() \
+ \
+ IRQ_ENTER(); \
+ LOCK(A); \
+ UNLOCK(A); \
+ IRQ_EXIT();
+
+#define E2() \
+ \
+ IRQ_DISABLE(); \
+ LOCK(A); \
+ IRQ_ENABLE(); \
+ UNLOCK(A);
+
+/*
+ * Generate 24 testcases:
+ */
+#include "locking-selftest-spin-hardirq.h"
+GENERATE_PERMUTATIONS_2_EVENTS(irqsafe2B_hard_spin)
+
+#include "locking-selftest-rlock-hardirq.h"
+GENERATE_PERMUTATIONS_2_EVENTS(irqsafe2B_hard_rlock)
+
+#include "locking-selftest-wlock-hardirq.h"
+GENERATE_PERMUTATIONS_2_EVENTS(irqsafe2B_hard_wlock)
+
+#include "locking-selftest-spin-softirq.h"
+GENERATE_PERMUTATIONS_2_EVENTS(irqsafe2B_soft_spin)
+
+#include "locking-selftest-rlock-softirq.h"
+GENERATE_PERMUTATIONS_2_EVENTS(irqsafe2B_soft_rlock)
+
+#include "locking-selftest-wlock-softirq.h"
+GENERATE_PERMUTATIONS_2_EVENTS(irqsafe2B_soft_wlock)
+
+#undef E1
+#undef E2
+
+/*
+ * Acquiring a irq-unsafe lock while holding an irq-safe-lock:
+ */
+#define E1() \
+ \
+ LOCK(A); \
+ LOCK(B); \
+ UNLOCK(B); \
+ UNLOCK(A); \
+
+#define E2() \
+ \
+ LOCK(B); \
+ UNLOCK(B);
+
+#define E3() \
+ \
+ IRQ_ENTER(); \
+ LOCK(A); \
+ UNLOCK(A); \
+ IRQ_EXIT();
+
+/*
+ * Generate 36 testcases:
+ */
+#include "locking-selftest-spin-hardirq.h"
+GENERATE_PERMUTATIONS_3_EVENTS(irqsafe3_hard_spin)
+
+#include "locking-selftest-rlock-hardirq.h"
+GENERATE_PERMUTATIONS_3_EVENTS(irqsafe3_hard_rlock)
+
+#include "locking-selftest-wlock-hardirq.h"
+GENERATE_PERMUTATIONS_3_EVENTS(irqsafe3_hard_wlock)
+
+#include "locking-selftest-spin-softirq.h"
+GENERATE_PERMUTATIONS_3_EVENTS(irqsafe3_soft_spin)
+
+#include "locking-selftest-rlock-softirq.h"
+GENERATE_PERMUTATIONS_3_EVENTS(irqsafe3_soft_rlock)
+
+#include "locking-selftest-wlock-softirq.h"
+GENERATE_PERMUTATIONS_3_EVENTS(irqsafe3_soft_wlock)
+
+#undef E1
+#undef E2
+#undef E3
+
+/*
+ * If a lock turns into softirq-safe, but earlier it took
+ * a softirq-unsafe lock:
+ */
+
+#define E1() \
+ IRQ_DISABLE(); \
+ LOCK(A); \
+ LOCK(B); \
+ UNLOCK(B); \
+ UNLOCK(A); \
+ IRQ_ENABLE();
+
+#define E2() \
+ LOCK(B); \
+ UNLOCK(B);
+
+#define E3() \
+ IRQ_ENTER(); \
+ LOCK(A); \
+ UNLOCK(A); \
+ IRQ_EXIT();
+
+/*
+ * Generate 36 testcases:
+ */
+#include "locking-selftest-spin-hardirq.h"
+GENERATE_PERMUTATIONS_3_EVENTS(irqsafe4_hard_spin)
+
+#include "locking-selftest-rlock-hardirq.h"
+GENERATE_PERMUTATIONS_3_EVENTS(irqsafe4_hard_rlock)
+
+#include "locking-selftest-wlock-hardirq.h"
+GENERATE_PERMUTATIONS_3_EVENTS(irqsafe4_hard_wlock)
+
+#include "locking-selftest-spin-softirq.h"
+GENERATE_PERMUTATIONS_3_EVENTS(irqsafe4_soft_spin)
+
+#include "locking-selftest-rlock-softirq.h"
+GENERATE_PERMUTATIONS_3_EVENTS(irqsafe4_soft_rlock)
+
+#include "locking-selftest-wlock-softirq.h"
+GENERATE_PERMUTATIONS_3_EVENTS(irqsafe4_soft_wlock)
+
+#undef E1
+#undef E2
+#undef E3
+
+/*
+ * read-lock / write-lock irq inversion.
+ *
+ * Deadlock scenario:
+ *
+ * CPU#1 is at #1, i.e. it has write-locked A, but has not
+ * taken B yet.
+ *
+ * CPU#2 is at #2, i.e. it has locked B.
+ *
+ * Hardirq hits CPU#2 at point #2 and is trying to read-lock A.
+ *
+ * The deadlock occurs because CPU#1 will spin on B, and CPU#2
+ * will spin on A.
+ */
+
+#define E1() \
+ \
+ IRQ_DISABLE(); \
+ WL(A); \
+ LOCK(B); \
+ UNLOCK(B); \
+ WU(A); \
+ IRQ_ENABLE();
+
+#define E2() \
+ \
+ LOCK(B); \
+ UNLOCK(B);
+
+#define E3() \
+ \
+ IRQ_ENTER(); \
+ RL(A); \
+ RU(A); \
+ IRQ_EXIT();
+
+/*
+ * Generate 36 testcases:
+ */
+#include "locking-selftest-spin-hardirq.h"
+GENERATE_PERMUTATIONS_3_EVENTS(irq_inversion_hard_spin)
+
+#include "locking-selftest-rlock-hardirq.h"
+GENERATE_PERMUTATIONS_3_EVENTS(irq_inversion_hard_rlock)
+
+#include "locking-selftest-wlock-hardirq.h"
+GENERATE_PERMUTATIONS_3_EVENTS(irq_inversion_hard_wlock)
+
+#include "locking-selftest-spin-softirq.h"
+GENERATE_PERMUTATIONS_3_EVENTS(irq_inversion_soft_spin)
+
+#include "locking-selftest-rlock-softirq.h"
+GENERATE_PERMUTATIONS_3_EVENTS(irq_inversion_soft_rlock)
+
+#include "locking-selftest-wlock-softirq.h"
+GENERATE_PERMUTATIONS_3_EVENTS(irq_inversion_soft_wlock)
+
+#undef E1
+#undef E2
+#undef E3
+
+/*
+ * read-lock / write-lock recursion that is actually safe.
+ */
+
+#define E1() \
+ \
+ IRQ_DISABLE(); \
+ WL(A); \
+ WU(A); \
+ IRQ_ENABLE();
+
+#define E2() \
+ \
+ RL(A); \
+ RU(A); \
+
+#define E3() \
+ \
+ IRQ_ENTER(); \
+ RL(A); \
+ L(B); \
+ U(B); \
+ RU(A); \
+ IRQ_EXIT();
+
+/*
+ * Generate 12 testcases:
+ */
+#include "locking-selftest-hardirq.h"
+GENERATE_PERMUTATIONS_3_EVENTS(irq_read_recursion_hard)
+
+#include "locking-selftest-softirq.h"
+GENERATE_PERMUTATIONS_3_EVENTS(irq_read_recursion_soft)
+
+#undef E1
+#undef E2
+#undef E3
+
+/*
+ * read-lock / write-lock recursion that is unsafe.
+ */
+
+#define E1() \
+ \
+ IRQ_DISABLE(); \
+ L(B); \
+ WL(A); \
+ WU(A); \
+ U(B); \
+ IRQ_ENABLE();
+
+#define E2() \
+ \
+ RL(A); \
+ RU(A); \
+
+#define E3() \
+ \
+ IRQ_ENTER(); \
+ L(B); \
+ U(B); \
+ IRQ_EXIT();
+
+/*
+ * Generate 12 testcases:
+ */
+#include "locking-selftest-hardirq.h"
+// GENERATE_PERMUTATIONS_3_EVENTS(irq_read_recursion2_hard)
+
+#include "locking-selftest-softirq.h"
+// GENERATE_PERMUTATIONS_3_EVENTS(irq_read_recursion2_soft)
+
+#define lockdep_reset()
+#define lockdep_reset_lock(x)
+
+#ifdef CONFIG_PROVE_SPIN_LOCKING
+# define I_SPINLOCK(x) lockdep_reset_lock(&lock_##x.dep_map)
+#else
+# define I_SPINLOCK(x)
+#endif
+
+#ifdef CONFIG_PROVE_RW_LOCKING
+# define I_RWLOCK(x) lockdep_reset_lock(&rwlock_##x.dep_map)
+#else
+# define I_RWLOCK(x)
+#endif
+
+#ifdef CONFIG_PROVE_MUTEX_LOCKING
+# define I_MUTEX(x) lockdep_reset_lock(&mutex_##x.dep_map)
+#else
+# define I_MUTEX(x)
+#endif
+
+#ifdef CONFIG_PROVE_RWSEM_LOCKING
+# define I_RWSEM(x) lockdep_reset_lock(&rwsem_##x.dep_map)
+#else
+# define I_RWSEM(x)
+#endif
+
+#define I1(x) \
+ do { \
+ I_SPINLOCK(x); \
+ I_RWLOCK(x); \
+ I_MUTEX(x); \
+ I_RWSEM(x); \
+ } while (0)
+
+#define I2(x) \
+ do { \
+ spin_lock_init(&lock_##x); \
+ rwlock_init(&rwlock_##x); \
+ mutex_init(&mutex_##x); \
+ init_rwsem(&rwsem_##x); \
+ } while (0)
+
+static void reset_locks(void)
+{
+ local_irq_disable();
+ I1(A); I1(B); I1(C); I1(D);
+ I1(X1); I1(X2); I1(Y1); I1(Y2); I1(Z1); I1(Z2);
+ lockdep_reset();
+ I2(A); I2(B); I2(C); I2(D);
+ init_shared_types();
+ local_irq_enable();
+}
+
+#undef I
+
+static int testcase_total;
+static int testcase_successes;
+static int expected_testcase_failures;
+static int unexpected_testcase_failures;
+
+static void dotest(void (*testcase_fn)(void), int expected, int locktype)
+{
+ unsigned long saved_preempt_count = preempt_count();
+ int unexpected_failure = 0;
+
+ WARN_ON(irqs_disabled());
+
+ testcase_fn();
+#ifdef CONFIG_PROVE_SPIN_LOCKING
+ if (locktype == LOCKTYPE_SPIN && debug_locks != expected)
+ unexpected_failure = 1;
+#endif
+#ifdef CONFIG_PROVE_RW_LOCKING
+ if (locktype == LOCKTYPE_RWLOCK && debug_locks != expected)
+ unexpected_failure = 1;
+#endif
+#ifdef CONFIG_PROVE_MUTEX_LOCKING
+ if (locktype == LOCKTYPE_MUTEX && debug_locks != expected)
+ unexpected_failure = 1;
+#endif
+#ifdef CONFIG_PROVE_RWSEM_LOCKING
+ if (locktype == LOCKTYPE_RWSEM && debug_locks != expected)
+ unexpected_failure = 1;
+#endif
+ if (debug_locks != expected) {
+ if (unexpected_failure) {
+ unexpected_testcase_failures++;
+ printk("FAILED|");
+ } else {
+ expected_testcase_failures++;
+ printk("failed|");
+ }
+ } else {
+ testcase_successes++;
+ printk(" ok |");
+ }
+ testcase_total++;
+
+ /*
+ * Some tests (e.g. double-unlock) might corrupt the preemption
+ * count, so restore it:
+ */
+ preempt_count() = saved_preempt_count;
+#ifdef CONFIG_TRACE_IRQFLAGS
+ if (softirq_count())
+ current->softirqs_enabled = 0;
+ else
+ current->softirqs_enabled = 1;
+#endif
+
+ reset_locks();
+}
+
+static inline void print_testname(const char *testname)
+{
+ printk("%33s:", testname);
+}
+
+#define DO_TESTCASE_1(desc, name, nr) \
+ print_testname(desc"/"#nr); \
+ dotest(name##_##nr, SUCCESS, LOCKTYPE_RWLOCK); \
+ printk("\n");
+
+#define DO_TESTCASE_1B(desc, name, nr) \
+ print_testname(desc"/"#nr); \
+ dotest(name##_##nr, FAILURE, LOCKTYPE_RWLOCK); \
+ printk("\n");
+
+#define DO_TESTCASE_3(desc, name, nr) \
+ print_testname(desc"/"#nr); \
+ dotest(name##_spin_##nr, FAILURE, LOCKTYPE_SPIN); \
+ dotest(name##_wlock_##nr, FAILURE, LOCKTYPE_RWLOCK); \
+ dotest(name##_rlock_##nr, SUCCESS, LOCKTYPE_RWLOCK); \
+ printk("\n");
+
+#define DO_TESTCASE_6(desc, name) \
+ print_testname(desc); \
+ dotest(name##_spin, FAILURE, LOCKTYPE_SPIN); \
+ dotest(name##_wlock, FAILURE, LOCKTYPE_RWLOCK); \
+ dotest(name##_rlock, FAILURE, LOCKTYPE_RWLOCK); \
+ dotest(name##_mutex, FAILURE, LOCKTYPE_SPIN); \
+ dotest(name##_wsem, FAILURE, LOCKTYPE_RWSEM); \
+ dotest(name##_rsem, FAILURE, LOCKTYPE_RWSEM); \
+ printk("\n");
+
+/*
+ * 'read' variant: rlocks must not trigger.
+ */
+#define DO_TESTCASE_6R(desc, name) \
+ print_testname(desc); \
+ dotest(name##_spin, FAILURE, LOCKTYPE_SPIN); \
+ dotest(name##_wlock, FAILURE, LOCKTYPE_RWLOCK); \
+ dotest(name##_rlock, SUCCESS, LOCKTYPE_RWLOCK); \
+ dotest(name##_mutex, FAILURE, LOCKTYPE_SPIN); \
+ dotest(name##_wsem, FAILURE, LOCKTYPE_RWSEM); \
+ dotest(name##_rsem, FAILURE, LOCKTYPE_RWSEM); \
+ printk("\n");
+
+#define DO_TESTCASE_2I(desc, name, nr) \
+ DO_TESTCASE_1("hard-"desc, name##_hard, nr); \
+ DO_TESTCASE_1("soft-"desc, name##_soft, nr);
+
+#define DO_TESTCASE_2IB(desc, name, nr) \
+ DO_TESTCASE_1B("hard-"desc, name##_hard, nr); \
+ DO_TESTCASE_1B("soft-"desc, name##_soft, nr);
+
+#define DO_TESTCASE_6I(desc, name, nr) \
+ DO_TESTCASE_3("hard-"desc, name##_hard, nr); \
+ DO_TESTCASE_3("soft-"desc, name##_soft, nr);
+
+#define DO_TESTCASE_2x3(desc, name) \
+ DO_TESTCASE_3(desc, name, 12); \
+ DO_TESTCASE_3(desc, name, 21);
+
+#define DO_TESTCASE_2x6(desc, name) \
+ DO_TESTCASE_6I(desc, name, 12); \
+ DO_TESTCASE_6I(desc, name, 21);
+
+#define DO_TESTCASE_6x2(desc, name) \
+ DO_TESTCASE_2I(desc, name, 123); \
+ DO_TESTCASE_2I(desc, name, 132); \
+ DO_TESTCASE_2I(desc, name, 213); \
+ DO_TESTCASE_2I(desc, name, 231); \
+ DO_TESTCASE_2I(desc, name, 312); \
+ DO_TESTCASE_2I(desc, name, 321);
+
+#define DO_TESTCASE_6x2B(desc, name) \
+ DO_TESTCASE_2IB(desc, name, 123); \
+ DO_TESTCASE_2IB(desc, name, 132); \
+ DO_TESTCASE_2IB(desc, name, 213); \
+ DO_TESTCASE_2IB(desc, name, 231); \
+ DO_TESTCASE_2IB(desc, name, 312); \
+ DO_TESTCASE_2IB(desc, name, 321);
+
+
+#define DO_TESTCASE_6x6(desc, name) \
+ DO_TESTCASE_6I(desc, name, 123); \
+ DO_TESTCASE_6I(desc, name, 132); \
+ DO_TESTCASE_6I(desc, name, 213); \
+ DO_TESTCASE_6I(desc, name, 231); \
+ DO_TESTCASE_6I(desc, name, 312); \
+ DO_TESTCASE_6I(desc, name, 321);
+
+void locking_selftest(void)
+{
+ /*
+ * Got a locking failure before the selftest ran?
+ */
+ if (!debug_locks) {
+ printk("----------------------------------\n");
+ printk("| Locking API testsuite disabled |\n");
+ printk("----------------------------------\n");
+ return;
+ }
+
+ /*
+ * Run the testsuite:
+ */
+ printk("------------------------\n");
+ printk("| Locking API testsuite:\n");
+ printk("----------------------------------------------------------------------------\n");
+ printk(" | spin |wlock |rlock |mutex | wsem | rsem |\n");
+ printk(" --------------------------------------------------------------------------\n");
+
+ init_shared_types();
+ debug_locks_silent = !debug_locks_verbose;
+
+ DO_TESTCASE_6("A-A deadlock", AA);
+ DO_TESTCASE_6R("A-B-B-A deadlock", ABBA);
+ DO_TESTCASE_6R("A-B-B-C-C-A deadlock", ABBCCA);
+ DO_TESTCASE_6R("A-B-C-A-B-C deadlock", ABCABC);
+ DO_TESTCASE_6R("A-B-B-C-C-D-D-A deadlock", ABBCCDDA);
+ DO_TESTCASE_6R("A-B-C-D-B-D-D-A deadlock", ABCDBDDA);
+ DO_TESTCASE_6R("A-B-C-D-B-C-D-A deadlock", ABCDBCDA);
+ DO_TESTCASE_6("double unlock", double_unlock);
+ DO_TESTCASE_6("bad unlock order", bad_unlock_order);
+
+ printk(" --------------------------------------------------------------------------\n");
+ print_testname("recursive read-lock");
+ printk(" |");
+ dotest(rlock_AA1, SUCCESS, LOCKTYPE_RWLOCK);
+ printk(" |");
+ dotest(rsem_AA1, FAILURE, LOCKTYPE_RWLOCK);
+ printk("\n");
+
+ printk(" --------------------------------------------------------------------------\n");
+
+#ifdef CONFIG_LOCKDEP
+ print_testname("non-nested unlock");
+ dotest(spin_order_nn, SUCCESS, LOCKTYPE_SPIN);
+ dotest(rlock_order_nn, SUCCESS, LOCKTYPE_RWLOCK);
+ dotest(mutex_order_nn, SUCCESS, LOCKTYPE_MUTEX);
+ dotest(rsem_order_nn, SUCCESS, LOCKTYPE_RWSEM);
+ printk("\n");
+ printk(" ------------------------------------------------------------\n");
+#endif
+ /*
+ * irq-context testcases:
+ */
+ DO_TESTCASE_2x6("irqs-on + irq-safe-A", irqsafe1);
+ DO_TESTCASE_2x3("sirq-safe-A => hirqs-on", irqsafe2A);
+ DO_TESTCASE_2x6("safe-A + irqs-on", irqsafe2B);
+ DO_TESTCASE_6x6("safe-A + unsafe-B #1", irqsafe3);
+ DO_TESTCASE_6x6("safe-A + unsafe-B #2", irqsafe4);
+ DO_TESTCASE_6x6("irq lock-inversion", irq_inversion);
+
+ DO_TESTCASE_6x2("irq read-recursion", irq_read_recursion);
+// DO_TESTCASE_6x2B("irq read-recursion #2", irq_read_recursion2);
+
+ if (unexpected_testcase_failures) {
+ printk("-----------------------------------------------------------------\n");
+ debug_locks = 0;
+ printk("BUG: %3d unexpected failures (out of %3d) - debugging disabled! |\n",
+ unexpected_testcase_failures, testcase_total);
+ printk("-----------------------------------------------------------------\n");
+ } else if (expected_testcase_failures && testcase_successes) {
+ printk("--------------------------------------------------------\n");
+ printk("%3d out of %3d testcases failed, as expected. |\n",
+ expected_testcase_failures, testcase_total);
+ printk("----------------------------------------------------\n");
+ debug_locks = 1;
+ } else if (expected_testcase_failures && !testcase_successes) {
+ printk("--------------------------------------------------------\n");
+ printk("All %3d testcases failed, as expected. |\n",
+ expected_testcase_failures);
+ printk("----------------------------------------\n");
+ debug_locks = 1;
+ } else {
+ printk("-------------------------------------------------------\n");
+ printk("Good, all %3d testcases passed! |\n",
+ testcase_successes);
+ printk("---------------------------------\n");
+ debug_locks = 1;
+ }
+ debug_locks_silent = 0;
+}

2006-05-29 21:46:31

by Ingo Molnar

[permalink] [raw]
Subject: [patch 05/61] lock validator: introduce WARN_ON_ONCE(cond)

From: Ingo Molnar <[email protected]>

add WARN_ON_ONCE(cond) to print once-per-bootup messages.

Signed-off-by: Ingo Molnar <[email protected]>
Signed-off-by: Arjan van de Ven <[email protected]>
---
include/asm-generic/bug.h | 13 +++++++++++++
1 file changed, 13 insertions(+)

Index: linux/include/asm-generic/bug.h
===================================================================
--- linux.orig/include/asm-generic/bug.h
+++ linux/include/asm-generic/bug.h
@@ -44,4 +44,17 @@
# define WARN_ON_SMP(x) do { } while (0)
#endif

+#define WARN_ON_ONCE(condition) \
+({ \
+ static int __warn_once = 1; \
+ int __ret = 0; \
+ \
+ if (unlikely(__warn_once && (condition))) { \
+ __warn_once = 0; \
+ WARN_ON(1); \
+ __ret = 1; \
+ } \
+ __ret; \
+})
+
#endif

2006-05-29 21:47:46

by Ingo Molnar

[permalink] [raw]
Subject: [patch 09/61] lock validator: spin/rwlock init cleanups

From: Ingo Molnar <[email protected]>

locking init cleanups:

- convert " = SPIN_LOCK_UNLOCKED" to spin_lock_init() or DEFINE_SPINLOCK()
- convert rwlocks in a similar manner

this patch was generated automatically.

Signed-off-by: Ingo Molnar <[email protected]>
Signed-off-by: Arjan van de Ven <[email protected]>
---
arch/ia64/sn/kernel/irq.c | 2 +-
arch/mips/kernel/smtc.c | 4 ++--
arch/powerpc/platforms/cell/spufs/switch.c | 2 +-
arch/powerpc/platforms/powermac/pfunc_core.c | 2 +-
arch/powerpc/platforms/pseries/eeh_event.c | 2 +-
arch/powerpc/sysdev/mmio_nvram.c | 2 +-
arch/xtensa/kernel/time.c | 2 +-
arch/xtensa/kernel/traps.c | 2 +-
drivers/char/drm/drm_memory_debug.h | 2 +-
drivers/char/drm/via_dmablit.c | 2 +-
drivers/char/epca.c | 2 +-
drivers/char/moxa.c | 2 +-
drivers/char/specialix.c | 2 +-
drivers/char/sx.c | 2 +-
drivers/isdn/gigaset/common.c | 2 +-
drivers/leds/led-core.c | 2 +-
drivers/leds/led-triggers.c | 2 +-
drivers/message/i2o/exec-osm.c | 2 +-
drivers/misc/ibmasm/module.c | 2 +-
drivers/pcmcia/m8xx_pcmcia.c | 4 ++--
drivers/rapidio/rio-access.c | 4 ++--
drivers/rtc/rtc-sa1100.c | 2 +-
drivers/rtc/rtc-vr41xx.c | 2 +-
drivers/s390/block/dasd_eer.c | 2 +-
drivers/scsi/libata-core.c | 2 +-
drivers/sn/ioc3.c | 2 +-
drivers/usb/ip/stub_dev.c | 4 ++--
drivers/usb/ip/vhci_hcd.c | 4 ++--
drivers/video/backlight/hp680_bl.c | 2 +-
fs/gfs2/ops_fstype.c | 2 +-
fs/nfsd/nfs4state.c | 2 +-
fs/ocfs2/cluster/heartbeat.c | 2 +-
fs/ocfs2/cluster/tcp.c | 2 +-
fs/ocfs2/dlm/dlmdomain.c | 2 +-
fs/ocfs2/dlm/dlmlock.c | 2 +-
fs/ocfs2/dlm/dlmrecovery.c | 4 ++--
fs/ocfs2/dlmglue.c | 2 +-
fs/ocfs2/journal.c | 2 +-
fs/reiser4/block_alloc.c | 2 +-
fs/reiser4/debug.c | 2 +-
fs/reiser4/fsdata.c | 2 +-
fs/reiser4/txnmgr.c | 2 +-
include/asm-alpha/core_t2.h | 2 +-
kernel/audit.c | 2 +-
mm/sparse.c | 2 +-
net/ipv6/route.c | 2 +-
net/sunrpc/auth_gss/gss_krb5_seal.c | 2 +-
net/tipc/bcast.c | 4 ++--
net/tipc/bearer.c | 2 +-
net/tipc/config.c | 2 +-
net/tipc/dbg.c | 2 +-
net/tipc/handler.c | 2 +-
net/tipc/name_table.c | 4 ++--
net/tipc/net.c | 2 +-
net/tipc/node.c | 2 +-
net/tipc/port.c | 4 ++--
net/tipc/ref.c | 4 ++--
net/tipc/subscr.c | 2 +-
net/tipc/user_reg.c | 2 +-
59 files changed, 69 insertions(+), 69 deletions(-)

Index: linux/arch/ia64/sn/kernel/irq.c
===================================================================
--- linux.orig/arch/ia64/sn/kernel/irq.c
+++ linux/arch/ia64/sn/kernel/irq.c
@@ -27,7 +27,7 @@ static void unregister_intr_pda(struct s
int sn_force_interrupt_flag = 1;
extern int sn_ioif_inited;
struct list_head **sn_irq_lh;
-static spinlock_t sn_irq_info_lock = SPIN_LOCK_UNLOCKED; /* non-IRQ lock */
+static DEFINE_SPINLOCK(sn_irq_info_lock); /* non-IRQ lock */

u64 sn_intr_alloc(nasid_t local_nasid, int local_widget,
struct sn_irq_info *sn_irq_info,
Index: linux/arch/mips/kernel/smtc.c
===================================================================
--- linux.orig/arch/mips/kernel/smtc.c
+++ linux/arch/mips/kernel/smtc.c
@@ -367,7 +367,7 @@ void mipsmt_prepare_cpus(void)
dvpe();
dmt();

- freeIPIq.lock = SPIN_LOCK_UNLOCKED;
+ spin_lock_init(&freeIPIq.lock);

/*
* We probably don't have as many VPEs as we do SMP "CPUs",
@@ -375,7 +375,7 @@ void mipsmt_prepare_cpus(void)
*/
for (i=0; i<NR_CPUS; i++) {
IPIQ[i].head = IPIQ[i].tail = NULL;
- IPIQ[i].lock = SPIN_LOCK_UNLOCKED;
+ spin_lock_init(&IPIQ[i].lock);
IPIQ[i].depth = 0;
ipi_timer_latch[i] = 0;
}
Index: linux/arch/powerpc/platforms/cell/spufs/switch.c
===================================================================
--- linux.orig/arch/powerpc/platforms/cell/spufs/switch.c
+++ linux/arch/powerpc/platforms/cell/spufs/switch.c
@@ -2183,7 +2183,7 @@ void spu_init_csa(struct spu_state *csa)

memset(lscsa, 0, sizeof(struct spu_lscsa));
csa->lscsa = lscsa;
- csa->register_lock = SPIN_LOCK_UNLOCKED;
+ spin_lock_init(&csa->register_lock);

/* Set LS pages reserved to allow for user-space mapping. */
for (p = lscsa->ls; p < lscsa->ls + LS_SIZE; p += PAGE_SIZE)
Index: linux/arch/powerpc/platforms/powermac/pfunc_core.c
===================================================================
--- linux.orig/arch/powerpc/platforms/powermac/pfunc_core.c
+++ linux/arch/powerpc/platforms/powermac/pfunc_core.c
@@ -545,7 +545,7 @@ struct pmf_device {
};

static LIST_HEAD(pmf_devices);
-static spinlock_t pmf_lock = SPIN_LOCK_UNLOCKED;
+static DEFINE_SPINLOCK(pmf_lock);

static void pmf_release_device(struct kref *kref)
{
Index: linux/arch/powerpc/platforms/pseries/eeh_event.c
===================================================================
--- linux.orig/arch/powerpc/platforms/pseries/eeh_event.c
+++ linux/arch/powerpc/platforms/pseries/eeh_event.c
@@ -35,7 +35,7 @@
*/

/* EEH event workqueue setup. */
-static spinlock_t eeh_eventlist_lock = SPIN_LOCK_UNLOCKED;
+static DEFINE_SPINLOCK(eeh_eventlist_lock);
LIST_HEAD(eeh_eventlist);
static void eeh_thread_launcher(void *);
DECLARE_WORK(eeh_event_wq, eeh_thread_launcher, NULL);
Index: linux/arch/powerpc/sysdev/mmio_nvram.c
===================================================================
--- linux.orig/arch/powerpc/sysdev/mmio_nvram.c
+++ linux/arch/powerpc/sysdev/mmio_nvram.c
@@ -32,7 +32,7 @@

static void __iomem *mmio_nvram_start;
static long mmio_nvram_len;
-static spinlock_t mmio_nvram_lock = SPIN_LOCK_UNLOCKED;
+static DEFINE_SPINLOCK(mmio_nvram_lock);

static ssize_t mmio_nvram_read(char *buf, size_t count, loff_t *index)
{
Index: linux/arch/xtensa/kernel/time.c
===================================================================
--- linux.orig/arch/xtensa/kernel/time.c
+++ linux/arch/xtensa/kernel/time.c
@@ -29,7 +29,7 @@

extern volatile unsigned long wall_jiffies;

-spinlock_t rtc_lock = SPIN_LOCK_UNLOCKED;
+DEFINE_SPINLOCK(rtc_lock);
EXPORT_SYMBOL(rtc_lock);


Index: linux/arch/xtensa/kernel/traps.c
===================================================================
--- linux.orig/arch/xtensa/kernel/traps.c
+++ linux/arch/xtensa/kernel/traps.c
@@ -461,7 +461,7 @@ void show_code(unsigned int *pc)
}
}

-spinlock_t die_lock = SPIN_LOCK_UNLOCKED;
+DEFINE_SPINLOCK(die_lock);

void die(const char * str, struct pt_regs * regs, long err)
{
Index: linux/drivers/char/drm/drm_memory_debug.h
===================================================================
--- linux.orig/drivers/char/drm/drm_memory_debug.h
+++ linux/drivers/char/drm/drm_memory_debug.h
@@ -43,7 +43,7 @@ typedef struct drm_mem_stats {
unsigned long bytes_freed;
} drm_mem_stats_t;

-static spinlock_t drm_mem_lock = SPIN_LOCK_UNLOCKED;
+static DEFINE_SPINLOCK(drm_mem_lock);
static unsigned long drm_ram_available = 0; /* In pages */
static unsigned long drm_ram_used = 0;
static drm_mem_stats_t drm_mem_stats[] =
Index: linux/drivers/char/drm/via_dmablit.c
===================================================================
--- linux.orig/drivers/char/drm/via_dmablit.c
+++ linux/drivers/char/drm/via_dmablit.c
@@ -557,7 +557,7 @@ via_init_dmablit(drm_device_t *dev)
blitq->num_outstanding = 0;
blitq->is_active = 0;
blitq->aborting = 0;
- blitq->blit_lock = SPIN_LOCK_UNLOCKED;
+ spin_lock_init(&blitq->blit_lock);
for (j=0; j<VIA_NUM_BLIT_SLOTS; ++j) {
DRM_INIT_WAITQUEUE(blitq->blit_queue + j);
}
Index: linux/drivers/char/epca.c
===================================================================
--- linux.orig/drivers/char/epca.c
+++ linux/drivers/char/epca.c
@@ -80,7 +80,7 @@ static int invalid_lilo_config;
/* The ISA boards do window flipping into the same spaces so its only sane
with a single lock. It's still pretty efficient */

-static spinlock_t epca_lock = SPIN_LOCK_UNLOCKED;
+static DEFINE_SPINLOCK(epca_lock);

/* -----------------------------------------------------------------------
MAXBOARDS is typically 12, but ISA and EISA cards are restricted to
Index: linux/drivers/char/moxa.c
===================================================================
--- linux.orig/drivers/char/moxa.c
+++ linux/drivers/char/moxa.c
@@ -301,7 +301,7 @@ static struct tty_operations moxa_ops =
.tiocmset = moxa_tiocmset,
};

-static spinlock_t moxa_lock = SPIN_LOCK_UNLOCKED;
+static DEFINE_SPINLOCK(moxa_lock);

#ifdef CONFIG_PCI
static int moxa_get_PCI_conf(struct pci_dev *p, int board_type, moxa_board_conf * board)
Index: linux/drivers/char/specialix.c
===================================================================
--- linux.orig/drivers/char/specialix.c
+++ linux/drivers/char/specialix.c
@@ -2477,7 +2477,7 @@ static int __init specialix_init(void)
#endif

for (i = 0; i < SX_NBOARD; i++)
- sx_board[i].lock = SPIN_LOCK_UNLOCKED;
+ spin_lock_init(&sx_board[i].lock);

if (sx_init_drivers()) {
func_exit();
Index: linux/drivers/char/sx.c
===================================================================
--- linux.orig/drivers/char/sx.c
+++ linux/drivers/char/sx.c
@@ -2320,7 +2320,7 @@ static int sx_init_portstructs (int nboa
#ifdef NEW_WRITE_LOCKING
port->gs.port_write_mutex = MUTEX;
#endif
- port->gs.driver_lock = SPIN_LOCK_UNLOCKED;
+ spin_lock_init(&port->gs.driver_lock);
/*
* Initializing wait queue
*/
Index: linux/drivers/isdn/gigaset/common.c
===================================================================
--- linux.orig/drivers/isdn/gigaset/common.c
+++ linux/drivers/isdn/gigaset/common.c
@@ -981,7 +981,7 @@ exit:
EXPORT_SYMBOL_GPL(gigaset_stop);

static LIST_HEAD(drivers);
-static spinlock_t driver_lock = SPIN_LOCK_UNLOCKED;
+static DEFINE_SPINLOCK(driver_lock);

struct cardstate *gigaset_get_cs_by_id(int id)
{
Index: linux/drivers/leds/led-core.c
===================================================================
--- linux.orig/drivers/leds/led-core.c
+++ linux/drivers/leds/led-core.c
@@ -18,7 +18,7 @@
#include <linux/leds.h>
#include "leds.h"

-rwlock_t leds_list_lock = RW_LOCK_UNLOCKED;
+DEFINE_RWLOCK(leds_list_lock);
LIST_HEAD(leds_list);

EXPORT_SYMBOL_GPL(leds_list);
Index: linux/drivers/leds/led-triggers.c
===================================================================
--- linux.orig/drivers/leds/led-triggers.c
+++ linux/drivers/leds/led-triggers.c
@@ -26,7 +26,7 @@
/*
* Nests outside led_cdev->trigger_lock
*/
-static rwlock_t triggers_list_lock = RW_LOCK_UNLOCKED;
+static DEFINE_RWLOCK(triggers_list_lock);
static LIST_HEAD(trigger_list);

ssize_t led_trigger_store(struct class_device *dev, const char *buf,
Index: linux/drivers/message/i2o/exec-osm.c
===================================================================
--- linux.orig/drivers/message/i2o/exec-osm.c
+++ linux/drivers/message/i2o/exec-osm.c
@@ -213,7 +213,7 @@ static int i2o_msg_post_wait_complete(st
{
struct i2o_exec_wait *wait, *tmp;
unsigned long flags;
- static spinlock_t lock = SPIN_LOCK_UNLOCKED;
+ static DEFINE_SPINLOCK(lock);
int rc = 1;

/*
Index: linux/drivers/misc/ibmasm/module.c
===================================================================
--- linux.orig/drivers/misc/ibmasm/module.c
+++ linux/drivers/misc/ibmasm/module.c
@@ -85,7 +85,7 @@ static int __devinit ibmasm_init_one(str
}
memset(sp, 0, sizeof(struct service_processor));

- sp->lock = SPIN_LOCK_UNLOCKED;
+ spin_lock_init(&sp->lock);
INIT_LIST_HEAD(&sp->command_queue);

pci_set_drvdata(pdev, (void *)sp);
Index: linux/drivers/pcmcia/m8xx_pcmcia.c
===================================================================
--- linux.orig/drivers/pcmcia/m8xx_pcmcia.c
+++ linux/drivers/pcmcia/m8xx_pcmcia.c
@@ -157,7 +157,7 @@ MODULE_LICENSE("Dual MPL/GPL");

static int pcmcia_schlvl = PCMCIA_SCHLVL;

-static spinlock_t events_lock = SPIN_LOCK_UNLOCKED;
+static DEFINE_SPINLOCK(events_lock);


#define PCMCIA_SOCKET_KEY_5V 1
@@ -644,7 +644,7 @@ static struct platform_device m8xx_devic
};

static u32 pending_events[PCMCIA_SOCKETS_NO];
-static spinlock_t pending_event_lock = SPIN_LOCK_UNLOCKED;
+static DEFINE_SPINLOCK(pending_event_lock);

static irqreturn_t m8xx_interrupt(int irq, void *dev, struct pt_regs *regs)
{
Index: linux/drivers/rapidio/rio-access.c
===================================================================
--- linux.orig/drivers/rapidio/rio-access.c
+++ linux/drivers/rapidio/rio-access.c
@@ -17,8 +17,8 @@
* These interrupt-safe spinlocks protect all accesses to RIO
* configuration space and doorbell access.
*/
-static spinlock_t rio_config_lock = SPIN_LOCK_UNLOCKED;
-static spinlock_t rio_doorbell_lock = SPIN_LOCK_UNLOCKED;
+static DEFINE_SPINLOCK(rio_config_lock);
+static DEFINE_SPINLOCK(rio_doorbell_lock);

/*
* Wrappers for all RIO configuration access functions. They just check
Index: linux/drivers/rtc/rtc-sa1100.c
===================================================================
--- linux.orig/drivers/rtc/rtc-sa1100.c
+++ linux/drivers/rtc/rtc-sa1100.c
@@ -45,7 +45,7 @@

static unsigned long rtc_freq = 1024;
static struct rtc_time rtc_alarm;
-static spinlock_t sa1100_rtc_lock = SPIN_LOCK_UNLOCKED;
+static DEFINE_SPINLOCK(sa1100_rtc_lock);

static int rtc_update_alarm(struct rtc_time *alrm)
{
Index: linux/drivers/rtc/rtc-vr41xx.c
===================================================================
--- linux.orig/drivers/rtc/rtc-vr41xx.c
+++ linux/drivers/rtc/rtc-vr41xx.c
@@ -93,7 +93,7 @@ static void __iomem *rtc2_base;

static unsigned long epoch = 1970; /* Jan 1 1970 00:00:00 */

-static spinlock_t rtc_lock = SPIN_LOCK_UNLOCKED;
+static DEFINE_SPINLOCK(rtc_lock);
static char rtc_name[] = "RTC";
static unsigned long periodic_frequency;
static unsigned long periodic_count;
Index: linux/drivers/s390/block/dasd_eer.c
===================================================================
--- linux.orig/drivers/s390/block/dasd_eer.c
+++ linux/drivers/s390/block/dasd_eer.c
@@ -89,7 +89,7 @@ struct eerbuffer {
};

static LIST_HEAD(bufferlist);
-static spinlock_t bufferlock = SPIN_LOCK_UNLOCKED;
+static DEFINE_SPINLOCK(bufferlock);
static DECLARE_WAIT_QUEUE_HEAD(dasd_eer_read_wait_queue);

/*
Index: linux/drivers/scsi/libata-core.c
===================================================================
--- linux.orig/drivers/scsi/libata-core.c
+++ linux/drivers/scsi/libata-core.c
@@ -5605,7 +5605,7 @@ module_init(ata_init);
module_exit(ata_exit);

static unsigned long ratelimit_time;
-static spinlock_t ata_ratelimit_lock = SPIN_LOCK_UNLOCKED;
+static DEFINE_SPINLOCK(ata_ratelimit_lock);

int ata_ratelimit(void)
{
Index: linux/drivers/sn/ioc3.c
===================================================================
--- linux.orig/drivers/sn/ioc3.c
+++ linux/drivers/sn/ioc3.c
@@ -26,7 +26,7 @@ static DECLARE_RWSEM(ioc3_devices_rwsem)

static struct ioc3_submodule *ioc3_submodules[IOC3_MAX_SUBMODULES];
static struct ioc3_submodule *ioc3_ethernet;
-static rwlock_t ioc3_submodules_lock = RW_LOCK_UNLOCKED;
+static DEFINE_RWLOCK(ioc3_submodules_lock);

/* NIC probing code */

Index: linux/drivers/usb/ip/stub_dev.c
===================================================================
--- linux.orig/drivers/usb/ip/stub_dev.c
+++ linux/drivers/usb/ip/stub_dev.c
@@ -285,13 +285,13 @@ static struct stub_device * stub_device_

sdev->ud.side = USBIP_STUB;
sdev->ud.status = SDEV_ST_AVAILABLE;
- sdev->ud.lock = SPIN_LOCK_UNLOCKED;
+ spin_lock_init(&sdev->ud.lock);
sdev->ud.tcp_socket = NULL;

INIT_LIST_HEAD(&sdev->priv_init);
INIT_LIST_HEAD(&sdev->priv_tx);
INIT_LIST_HEAD(&sdev->priv_free);
- sdev->priv_lock = SPIN_LOCK_UNLOCKED;
+ spin_lock_init(&sdev->priv_lock);

sdev->ud.eh_ops.shutdown = stub_shutdown_connection;
sdev->ud.eh_ops.reset = stub_device_reset;
Index: linux/drivers/usb/ip/vhci_hcd.c
===================================================================
--- linux.orig/drivers/usb/ip/vhci_hcd.c
+++ linux/drivers/usb/ip/vhci_hcd.c
@@ -768,11 +768,11 @@ static void vhci_device_init(struct vhci

vdev->ud.side = USBIP_VHCI;
vdev->ud.status = VDEV_ST_NULL;
- vdev->ud.lock = SPIN_LOCK_UNLOCKED;
+ spin_lock_init(&vdev->ud.lock );

INIT_LIST_HEAD(&vdev->priv_rx);
INIT_LIST_HEAD(&vdev->priv_tx);
- vdev->priv_lock = SPIN_LOCK_UNLOCKED;
+ spin_lock_init(&vdev->priv_lock);

init_waitqueue_head(&vdev->waitq);

Index: linux/drivers/video/backlight/hp680_bl.c
===================================================================
--- linux.orig/drivers/video/backlight/hp680_bl.c
+++ linux/drivers/video/backlight/hp680_bl.c
@@ -27,7 +27,7 @@

static int hp680bl_suspended;
static int current_intensity = 0;
-static spinlock_t bl_lock = SPIN_LOCK_UNLOCKED;
+static DEFINE_SPINLOCK(bl_lock);
static struct backlight_device *hp680_backlight_device;

static void hp680bl_send_intensity(struct backlight_device *bd)
Index: linux/fs/gfs2/ops_fstype.c
===================================================================
--- linux.orig/fs/gfs2/ops_fstype.c
+++ linux/fs/gfs2/ops_fstype.c
@@ -58,7 +58,7 @@ static struct gfs2_sbd *init_sbd(struct
gfs2_tune_init(&sdp->sd_tune);

for (x = 0; x < GFS2_GL_HASH_SIZE; x++) {
- sdp->sd_gl_hash[x].hb_lock = RW_LOCK_UNLOCKED;
+ rwlock_init(&sdp->sd_gl_hash[x].hb_lock);
INIT_LIST_HEAD(&sdp->sd_gl_hash[x].hb_list);
}
INIT_LIST_HEAD(&sdp->sd_reclaim_list);
Index: linux/fs/nfsd/nfs4state.c
===================================================================
--- linux.orig/fs/nfsd/nfs4state.c
+++ linux/fs/nfsd/nfs4state.c
@@ -123,7 +123,7 @@ static void release_stateid(struct nfs4_
*/

/* recall_lock protects the del_recall_lru */
-static spinlock_t recall_lock = SPIN_LOCK_UNLOCKED;
+static DEFINE_SPINLOCK(recall_lock);
static struct list_head del_recall_lru;

static void
Index: linux/fs/ocfs2/cluster/heartbeat.c
===================================================================
--- linux.orig/fs/ocfs2/cluster/heartbeat.c
+++ linux/fs/ocfs2/cluster/heartbeat.c
@@ -54,7 +54,7 @@ static DECLARE_RWSEM(o2hb_callback_sem);
* multiple hb threads are watching multiple regions. A node is live
* whenever any of the threads sees activity from the node in its region.
*/
-static spinlock_t o2hb_live_lock = SPIN_LOCK_UNLOCKED;
+static DEFINE_SPINLOCK(o2hb_live_lock);
static struct list_head o2hb_live_slots[O2NM_MAX_NODES];
static unsigned long o2hb_live_node_bitmap[BITS_TO_LONGS(O2NM_MAX_NODES)];
static LIST_HEAD(o2hb_node_events);
Index: linux/fs/ocfs2/cluster/tcp.c
===================================================================
--- linux.orig/fs/ocfs2/cluster/tcp.c
+++ linux/fs/ocfs2/cluster/tcp.c
@@ -107,7 +107,7 @@
##args); \
} while (0)

-static rwlock_t o2net_handler_lock = RW_LOCK_UNLOCKED;
+static DEFINE_RWLOCK(o2net_handler_lock);
static struct rb_root o2net_handler_tree = RB_ROOT;

static struct o2net_node o2net_nodes[O2NM_MAX_NODES];
Index: linux/fs/ocfs2/dlm/dlmdomain.c
===================================================================
--- linux.orig/fs/ocfs2/dlm/dlmdomain.c
+++ linux/fs/ocfs2/dlm/dlmdomain.c
@@ -88,7 +88,7 @@ out_free:
*
*/

-spinlock_t dlm_domain_lock = SPIN_LOCK_UNLOCKED;
+DEFINE_SPINLOCK(dlm_domain_lock);
LIST_HEAD(dlm_domains);
static DECLARE_WAIT_QUEUE_HEAD(dlm_domain_events);

Index: linux/fs/ocfs2/dlm/dlmlock.c
===================================================================
--- linux.orig/fs/ocfs2/dlm/dlmlock.c
+++ linux/fs/ocfs2/dlm/dlmlock.c
@@ -53,7 +53,7 @@
#define MLOG_MASK_PREFIX ML_DLM
#include "cluster/masklog.h"

-static spinlock_t dlm_cookie_lock = SPIN_LOCK_UNLOCKED;
+static DEFINE_SPINLOCK(dlm_cookie_lock);
static u64 dlm_next_cookie = 1;

static enum dlm_status dlm_send_remote_lock_request(struct dlm_ctxt *dlm,
Index: linux/fs/ocfs2/dlm/dlmrecovery.c
===================================================================
--- linux.orig/fs/ocfs2/dlm/dlmrecovery.c
+++ linux/fs/ocfs2/dlm/dlmrecovery.c
@@ -101,8 +101,8 @@ static int dlm_lockres_master_requery(st

static u64 dlm_get_next_mig_cookie(void);

-static spinlock_t dlm_reco_state_lock = SPIN_LOCK_UNLOCKED;
-static spinlock_t dlm_mig_cookie_lock = SPIN_LOCK_UNLOCKED;
+static DEFINE_SPINLOCK(dlm_reco_state_lock);
+static DEFINE_SPINLOCK(dlm_mig_cookie_lock);
static u64 dlm_mig_cookie = 1;

static u64 dlm_get_next_mig_cookie(void)
Index: linux/fs/ocfs2/dlmglue.c
===================================================================
--- linux.orig/fs/ocfs2/dlmglue.c
+++ linux/fs/ocfs2/dlmglue.c
@@ -242,7 +242,7 @@ static void ocfs2_build_lock_name(enum o
mlog_exit_void();
}

-static spinlock_t ocfs2_dlm_tracking_lock = SPIN_LOCK_UNLOCKED;
+static DEFINE_SPINLOCK(ocfs2_dlm_tracking_lock);

static void ocfs2_add_lockres_tracking(struct ocfs2_lock_res *res,
struct ocfs2_dlm_debug *dlm_debug)
Index: linux/fs/ocfs2/journal.c
===================================================================
--- linux.orig/fs/ocfs2/journal.c
+++ linux/fs/ocfs2/journal.c
@@ -49,7 +49,7 @@

#include "buffer_head_io.h"

-spinlock_t trans_inc_lock = SPIN_LOCK_UNLOCKED;
+DEFINE_SPINLOCK(trans_inc_lock);

static int ocfs2_force_read_journal(struct inode *inode);
static int ocfs2_recover_node(struct ocfs2_super *osb,
Index: linux/fs/reiser4/block_alloc.c
===================================================================
--- linux.orig/fs/reiser4/block_alloc.c
+++ linux/fs/reiser4/block_alloc.c
@@ -499,7 +499,7 @@ void cluster_reserved2free(int count)
spin_unlock_reiser4_super(sbinfo);
}

-static spinlock_t fake_lock = SPIN_LOCK_UNLOCKED;
+static DEFINE_SPINLOCK(fake_lock);
static reiser4_block_nr fake_gen = 0;

/* obtain a block number for new formatted node which will be used to refer
Index: linux/fs/reiser4/debug.c
===================================================================
--- linux.orig/fs/reiser4/debug.c
+++ linux/fs/reiser4/debug.c
@@ -52,7 +52,7 @@ static char panic_buf[REISER4_PANIC_MSG_
/*
* lock protecting consistency of panic_buf under concurrent panics
*/
-static spinlock_t panic_guard = SPIN_LOCK_UNLOCKED;
+static DEFINE_SPINLOCK(panic_guard);

/* Your best friend. Call it on each occasion. This is called by
fs/reiser4/debug.h:reiser4_panic(). */
Index: linux/fs/reiser4/fsdata.c
===================================================================
--- linux.orig/fs/reiser4/fsdata.c
+++ linux/fs/reiser4/fsdata.c
@@ -17,7 +17,7 @@ static LIST_HEAD(cursor_cache);
static unsigned long d_cursor_unused = 0;

/* spinlock protecting manipulations with dir_cursor's hash table and lists */
-spinlock_t d_lock = SPIN_LOCK_UNLOCKED;
+DEFINE_SPINLOCK(d_lock);

static reiser4_file_fsdata *create_fsdata(struct file *file);
static int file_is_stateless(struct file *file);
Index: linux/fs/reiser4/txnmgr.c
===================================================================
--- linux.orig/fs/reiser4/txnmgr.c
+++ linux/fs/reiser4/txnmgr.c
@@ -905,7 +905,7 @@ jnode *find_first_dirty_jnode(txn_atom *

/* this spin lock is used to prevent races during steal on capture.
FIXME: should be per filesystem or even per atom */
-spinlock_t scan_lock = SPIN_LOCK_UNLOCKED;
+DEFINE_SPINLOCK(scan_lock);

/* Scan atom->writeback_nodes list and dispatch jnodes according to their state:
* move dirty and !writeback jnodes to @fq, clean jnodes to atom's clean
Index: linux/include/asm-alpha/core_t2.h
===================================================================
--- linux.orig/include/asm-alpha/core_t2.h
+++ linux/include/asm-alpha/core_t2.h
@@ -435,7 +435,7 @@ static inline void t2_outl(u32 b, unsign
set_hae(msb); \
}

-static spinlock_t t2_hae_lock = SPIN_LOCK_UNLOCKED;
+static DEFINE_SPINLOCK(t2_hae_lock);

__EXTERN_INLINE u8 t2_readb(const volatile void __iomem *xaddr)
{
Index: linux/kernel/audit.c
===================================================================
--- linux.orig/kernel/audit.c
+++ linux/kernel/audit.c
@@ -787,7 +787,7 @@ err:
*/
unsigned int audit_serial(void)
{
- static spinlock_t serial_lock = SPIN_LOCK_UNLOCKED;
+ static DEFINE_SPINLOCK(serial_lock);
static unsigned int serial = 0;

unsigned long flags;
Index: linux/mm/sparse.c
===================================================================
--- linux.orig/mm/sparse.c
+++ linux/mm/sparse.c
@@ -45,7 +45,7 @@ static struct mem_section *sparse_index_

static int sparse_index_init(unsigned long section_nr, int nid)
{
- static spinlock_t index_init_lock = SPIN_LOCK_UNLOCKED;
+ static DEFINE_SPINLOCK(index_init_lock);
unsigned long root = SECTION_NR_TO_ROOT(section_nr);
struct mem_section *section;
int ret = 0;
Index: linux/net/ipv6/route.c
===================================================================
--- linux.orig/net/ipv6/route.c
+++ linux/net/ipv6/route.c
@@ -343,7 +343,7 @@ static struct rt6_info *rt6_select(struc
(strict & RT6_SELECT_F_REACHABLE) &&
last && last != rt0) {
/* no entries matched; do round-robin */
- static spinlock_t lock = SPIN_LOCK_UNLOCKED;
+ static DEFINE_SPINLOCK(lock);
spin_lock(&lock);
*head = rt0->u.next;
rt0->u.next = last->u.next;
Index: linux/net/sunrpc/auth_gss/gss_krb5_seal.c
===================================================================
--- linux.orig/net/sunrpc/auth_gss/gss_krb5_seal.c
+++ linux/net/sunrpc/auth_gss/gss_krb5_seal.c
@@ -70,7 +70,7 @@
# define RPCDBG_FACILITY RPCDBG_AUTH
#endif

-spinlock_t krb5_seq_lock = SPIN_LOCK_UNLOCKED;
+DEFINE_SPINLOCK(krb5_seq_lock);

u32
gss_get_mic_kerberos(struct gss_ctx *gss_ctx, struct xdr_buf *text,
Index: linux/net/tipc/bcast.c
===================================================================
--- linux.orig/net/tipc/bcast.c
+++ linux/net/tipc/bcast.c
@@ -102,7 +102,7 @@ struct bclink {
static struct bcbearer *bcbearer = NULL;
static struct bclink *bclink = NULL;
static struct link *bcl = NULL;
-static spinlock_t bc_lock = SPIN_LOCK_UNLOCKED;
+static DEFINE_SPINLOCK(bc_lock);

char tipc_bclink_name[] = "multicast-link";

@@ -783,7 +783,7 @@ int tipc_bclink_init(void)
memset(bclink, 0, sizeof(struct bclink));
INIT_LIST_HEAD(&bcl->waiting_ports);
bcl->next_out_no = 1;
- bclink->node.lock = SPIN_LOCK_UNLOCKED;
+ spin_lock_init(&bclink->node.lock);
bcl->owner = &bclink->node;
bcl->max_pkt = MAX_PKT_DEFAULT_MCAST;
tipc_link_set_queue_limits(bcl, BCLINK_WIN_DEFAULT);
Index: linux/net/tipc/bearer.c
===================================================================
--- linux.orig/net/tipc/bearer.c
+++ linux/net/tipc/bearer.c
@@ -552,7 +552,7 @@ restart:
b_ptr->link_req = tipc_disc_init_link_req(b_ptr, &m_ptr->bcast_addr,
bcast_scope, 2);
}
- b_ptr->publ.lock = SPIN_LOCK_UNLOCKED;
+ spin_lock_init(&b_ptr->publ.lock);
write_unlock_bh(&tipc_net_lock);
info("Enabled bearer <%s>, discovery domain %s, priority %u\n",
name, addr_string_fill(addr_string, bcast_scope), priority);
Index: linux/net/tipc/config.c
===================================================================
--- linux.orig/net/tipc/config.c
+++ linux/net/tipc/config.c
@@ -63,7 +63,7 @@ struct manager {

static struct manager mng = { 0};

-static spinlock_t config_lock = SPIN_LOCK_UNLOCKED;
+static DEFINE_SPINLOCK(config_lock);

static const void *req_tlv_area; /* request message TLV area */
static int req_tlv_space; /* request message TLV area size */
Index: linux/net/tipc/dbg.c
===================================================================
--- linux.orig/net/tipc/dbg.c
+++ linux/net/tipc/dbg.c
@@ -41,7 +41,7 @@
#define MAX_STRING 512

static char print_string[MAX_STRING];
-static spinlock_t print_lock = SPIN_LOCK_UNLOCKED;
+static DEFINE_SPINLOCK(print_lock);

static struct print_buf cons_buf = { NULL, 0, NULL, NULL };
struct print_buf *TIPC_CONS = &cons_buf;
Index: linux/net/tipc/handler.c
===================================================================
--- linux.orig/net/tipc/handler.c
+++ linux/net/tipc/handler.c
@@ -44,7 +44,7 @@ struct queue_item {

static kmem_cache_t *tipc_queue_item_cache;
static struct list_head signal_queue_head;
-static spinlock_t qitem_lock = SPIN_LOCK_UNLOCKED;
+static DEFINE_SPINLOCK(qitem_lock);
static int handler_enabled = 0;

static void process_signal_queue(unsigned long dummy);
Index: linux/net/tipc/name_table.c
===================================================================
--- linux.orig/net/tipc/name_table.c
+++ linux/net/tipc/name_table.c
@@ -101,7 +101,7 @@ struct name_table {

static struct name_table table = { NULL } ;
static atomic_t rsv_publ_ok = ATOMIC_INIT(0);
-rwlock_t tipc_nametbl_lock = RW_LOCK_UNLOCKED;
+DEFINE_RWLOCK(tipc_nametbl_lock);


static int hash(int x)
@@ -172,7 +172,7 @@ static struct name_seq *tipc_nameseq_cre
}

memset(nseq, 0, sizeof(*nseq));
- nseq->lock = SPIN_LOCK_UNLOCKED;
+ spin_lock_init(&nseq->lock);
nseq->type = type;
nseq->sseqs = sseq;
dbg("tipc_nameseq_create() nseq = %x type %u, ssseqs %x, ff: %u\n",
Index: linux/net/tipc/net.c
===================================================================
--- linux.orig/net/tipc/net.c
+++ linux/net/tipc/net.c
@@ -115,7 +115,7 @@
* - A local spin_lock protecting the queue of subscriber events.
*/

-rwlock_t tipc_net_lock = RW_LOCK_UNLOCKED;
+DEFINE_RWLOCK(tipc_net_lock);
struct network tipc_net = { NULL };

struct node *tipc_net_select_remote_node(u32 addr, u32 ref)
Index: linux/net/tipc/node.c
===================================================================
--- linux.orig/net/tipc/node.c
+++ linux/net/tipc/node.c
@@ -64,7 +64,7 @@ struct node *tipc_node_create(u32 addr)
if (n_ptr != NULL) {
memset(n_ptr, 0, sizeof(*n_ptr));
n_ptr->addr = addr;
- n_ptr->lock = SPIN_LOCK_UNLOCKED;
+ spin_lock_init(&n_ptr->lock);
INIT_LIST_HEAD(&n_ptr->nsub);

c_ptr = tipc_cltr_find(addr);
Index: linux/net/tipc/port.c
===================================================================
--- linux.orig/net/tipc/port.c
+++ linux/net/tipc/port.c
@@ -57,8 +57,8 @@
static struct sk_buff *msg_queue_head = NULL;
static struct sk_buff *msg_queue_tail = NULL;

-spinlock_t tipc_port_list_lock = SPIN_LOCK_UNLOCKED;
-static spinlock_t queue_lock = SPIN_LOCK_UNLOCKED;
+DEFINE_SPINLOCK(tipc_port_list_lock);
+static DEFINE_SPINLOCK(queue_lock);

static LIST_HEAD(ports);
static void port_handle_node_down(unsigned long ref);
Index: linux/net/tipc/ref.c
===================================================================
--- linux.orig/net/tipc/ref.c
+++ linux/net/tipc/ref.c
@@ -63,7 +63,7 @@

struct ref_table tipc_ref_table = { NULL };

-static rwlock_t ref_table_lock = RW_LOCK_UNLOCKED;
+static DEFINE_RWLOCK(ref_table_lock);

/**
* tipc_ref_table_init - create reference table for objects
@@ -87,7 +87,7 @@ int tipc_ref_table_init(u32 requested_si
index_mask = sz - 1;
for (i = sz - 1; i >= 0; i--) {
table[i].object = NULL;
- table[i].lock = SPIN_LOCK_UNLOCKED;
+ spin_lock_init(&table[i].lock);
table[i].data.next_plus_upper = (start & ~index_mask) + i - 1;
}
tipc_ref_table.entries = table;
Index: linux/net/tipc/subscr.c
===================================================================
--- linux.orig/net/tipc/subscr.c
+++ linux/net/tipc/subscr.c
@@ -457,7 +457,7 @@ int tipc_subscr_start(void)
int res = -1;

memset(&topsrv, 0, sizeof (topsrv));
- topsrv.lock = SPIN_LOCK_UNLOCKED;
+ spin_lock_init(&topsrv.lock);
INIT_LIST_HEAD(&topsrv.subscriber_list);

spin_lock_bh(&topsrv.lock);
Index: linux/net/tipc/user_reg.c
===================================================================
--- linux.orig/net/tipc/user_reg.c
+++ linux/net/tipc/user_reg.c
@@ -67,7 +67,7 @@ struct tipc_user {

static struct tipc_user *users = NULL;
static u32 next_free_user = MAX_USERID + 1;
-static spinlock_t reg_lock = SPIN_LOCK_UNLOCKED;
+static DEFINE_SPINLOCK(reg_lock);

/**
* reg_init - create TIPC user registry (but don't activate it)

2006-05-29 21:46:32

by Ingo Molnar

[permalink] [raw]
Subject: [patch 06/61] lock validator: add __module_address() method

From: Ingo Molnar <[email protected]>

add __module_address() method - to be used by lockdep.

Signed-off-by: Ingo Molnar <[email protected]>
Signed-off-by: Arjan van de Ven <[email protected]>
---
include/linux/module.h | 6 ++++++
kernel/module.c | 14 ++++++++++++++
2 files changed, 20 insertions(+)

Index: linux/include/linux/module.h
===================================================================
--- linux.orig/include/linux/module.h
+++ linux/include/linux/module.h
@@ -371,6 +371,7 @@ static inline int module_is_live(struct
/* Is this address in a module? (second is with no locks, for oops) */
struct module *module_text_address(unsigned long addr);
struct module *__module_text_address(unsigned long addr);
+int __module_address(unsigned long addr);

/* Returns module and fills in value, defined and namebuf, or NULL if
symnum out of range. */
@@ -509,6 +510,11 @@ static inline struct module *__module_te
return NULL;
}

+static inline int __module_address(unsigned long addr)
+{
+ return 0;
+}
+
/* Get/put a kernel symbol (calls should be symmetric) */
#define symbol_get(x) ({ extern typeof(x) x __attribute__((weak)); &(x); })
#define symbol_put(x) do { } while(0)
Index: linux/kernel/module.c
===================================================================
--- linux.orig/kernel/module.c
+++ linux/kernel/module.c
@@ -2222,6 +2222,20 @@ const struct exception_table_entry *sear
return e;
}

+/*
+ * Is this a valid module address? We don't grab the lock.
+ */
+int __module_address(unsigned long addr)
+{
+ struct module *mod;
+
+ list_for_each_entry(mod, &modules, list)
+ if (within(addr, mod->module_core, mod->core_size))
+ return 1;
+ return 0;
+}
+
+
/* Is this a valid kernel address? We don't grab the lock: we are oopsing. */
struct module *__module_text_address(unsigned long addr)
{

2006-05-29 22:28:38

by Michal Piotrowski

[permalink] [raw]
Subject: Re: [patch 00/61] ANNOUNCE: lock validator -V1

On 29/05/06, Ingo Molnar <[email protected]> wrote:
> We are pleased to announce the first release of the "lock dependency
> correctness validator" kernel debugging feature, which can be downloaded
> from:
>
> http://redhat.com/~mingo/lockdep-patches/
>
[snip]

I get this while loading cpufreq modules

=====================================================
[ BUG: possible circular locking deadlock detected! ]
-----------------------------------------------------
modprobe/1942 is trying to acquire lock:
(&anon_vma->lock){--..}, at: [<c10609cf>] anon_vma_link+0x1d/0xc9

but task is already holding lock:
(&mm->mmap_sem/1){--..}, at: [<c101e5a0>] copy_process+0xbc6/0x1519

which lock already depends on the new lock,
which could lead to circular deadlocks!

the existing dependency chain (in reverse order) is:

-> #1 (cpucontrol){--..}:
[<c10394be>] lockdep_acquire+0x69/0x82
[<c11ed759>] __mutex_lock_slowpath+0xd0/0x347
[<c11ed9ec>] mutex_lock+0x1c/0x1f
[<c103dda5>] __lock_cpu_hotplug+0x36/0x56
[<c103ddde>] lock_cpu_hotplug+0xa/0xc
[<c1199e06>] __cpufreq_driver_target+0x15/0x50
[<c119a1c2>] cpufreq_governor_performance+0x1a/0x20
[<c1198b0a>] __cpufreq_governor+0xa0/0x1a9
[<c1198ce2>] __cpufreq_set_policy+0xcf/0x100
[<c11991c6>] cpufreq_set_policy+0x2d/0x6f
[<c1199cae>] cpufreq_add_dev+0x34f/0x492
[<c114b8c8>] sysdev_driver_register+0x58/0x9b
[<c119a036>] cpufreq_register_driver+0x80/0xf4
[<fd97b02a>] ct_get_next+0x17/0x3f [ip_conntrack]
[<c10410e1>] sys_init_module+0xa6/0x230
[<c11ef9ab>] sysenter_past_esp+0x54/0x8d

-> #0 (&anon_vma->lock){--..}:
[<c10394be>] lockdep_acquire+0x69/0x82
[<c11ed759>] __mutex_lock_slowpath+0xd0/0x347
[<c11ed9ec>] mutex_lock+0x1c/0x1f
[<c11990eb>] cpufreq_update_policy+0x34/0xd8
[<fd9ad50b>] cpufreq_stat_cpu_callback+0x1b/0x7c [cpufreq_stats]
[<fd9b007d>] cpufreq_stats_init+0x7d/0x9b [cpufreq_stats]
[<c10410e1>] sys_init_module+0xa6/0x230
[<c11ef9ab>] sysenter_past_esp+0x54/0x8d

other info that might help us debug this:

1 locks held by modprobe/1942:
#0: (cpucontrol){--..}, at: [<c11ed9ec>] mutex_lock+0x1c/0x1f

stack backtrace:
<c1003f36> show_trace+0xd/0xf <c1004449> dump_stack+0x17/0x19
<c103863e> print_circular_bug_tail+0x59/0x64 <c1038e91>
__lockdep_acquire+0x848/0xa39
<c10394be> lockdep_acquire+0x69/0x82 <c11ed759>
__mutex_lock_slowpath+0xd0/0x347
<c11ed9ec> mutex_lock+0x1c/0x1f <c11990eb> cpufreq_update_policy+0x34/0xd8
<fd9ad50b> cpufreq_stat_cpu_callback+0x1b/0x7c [cpufreq_stats]
<fd9b007d> cpufreq_stats_init+0x7d/0x9b [cpufreq_stats]
<c10410e1> sys_init_module+0xa6/0x230 <c11ef9ab> sysenter_past_esp+0x54/0x8d

Here is dmesg http://www.stardust.webpages.pl/files/lockdep/2.6.17-rc4-mm3-lockdep1/lockdep-dmesg3

Here is config
http://www.stardust.webpages.pl/files/lockdep/2.6.17-rc4-mm3-lockdep1/lockdep-config2

BTW I still must revert lockdep-serial.patch - it doesn't compile on
my gcc 4.1.1

Regards,
Michal

--
Michal K. K. Piotrowski
LTG - Linux Testers Group
(http://www.stardust.webpages.pl/ltg/wiki/)

2006-05-29 22:40:53

by Ingo Molnar

[permalink] [raw]
Subject: Re: [patch 00/61] ANNOUNCE: lock validator -V1


* Michal Piotrowski <[email protected]> wrote:

> On 29/05/06, Ingo Molnar <[email protected]> wrote:
> >We are pleased to announce the first release of the "lock dependency
> >correctness validator" kernel debugging feature, which can be downloaded
> >from:
> >
> > http://redhat.com/~mingo/lockdep-patches/
> >
> [snip]
>
> I get this while loading cpufreq modules
>
> =====================================================
> [ BUG: possible circular locking deadlock detected! ]
> -----------------------------------------------------
> modprobe/1942 is trying to acquire lock:
> (&anon_vma->lock){--..}, at: [<c10609cf>] anon_vma_link+0x1d/0xc9
>
> but task is already holding lock:
> (&mm->mmap_sem/1){--..}, at: [<c101e5a0>] copy_process+0xbc6/0x1519
>
> which lock already depends on the new lock,
> which could lead to circular deadlocks!

hm, this one could perhaps be a real bug. Dave: lockdep complains about
having observed:

anon_vma->lock => mm->mmap_sem
mm->mmap_sem => anon_vma->lock

locking sequences, in the cpufreq code. Is there some special runtime
behavior that still makes this safe, or is it a real bug?

> stack backtrace:
> <c1003f36> show_trace+0xd/0xf <c1004449> dump_stack+0x17/0x19
> <c103863e> print_circular_bug_tail+0x59/0x64 <c1038e91>
> __lockdep_acquire+0x848/0xa39
> <c10394be> lockdep_acquire+0x69/0x82 <c11ed759>
> __mutex_lock_slowpath+0xd0/0x347

there's one small detail to improve future lockdep printouts: please set
CONFIG_STACK_BACKTRACE_COLS=1, so that the backtrace is more readable.
(i'll change the code to force that when CONFIG_LOCKDEP is enabled)

> BTW I still must revert lockdep-serial.patch - it doesn't compile on
> my gcc 4.1.1

ok, will check this.

Ingo

2006-05-29 22:50:10

by Keith Owens

[permalink] [raw]
Subject: Re: [patch 33/61] lock validator: disable NMI watchdog if CONFIG_LOCKDEP

Ingo Molnar (on Mon, 29 May 2006 23:25:50 +0200) wrote:
>From: Ingo Molnar <[email protected]>
>
>The NMI watchdog uses spinlocks (notifier chains, etc.),
>so it's not lockdep-safe at the moment.

Fixed in 2.6.17-rc1. notify_die() uses atomic_notifier_call_chain()
which uses RCU, not spinlocks.

2006-05-29 23:09:28

by Dave Jones

[permalink] [raw]
Subject: Re: [patch 00/61] ANNOUNCE: lock validator -V1

On Tue, May 30, 2006 at 12:41:08AM +0200, Ingo Molnar wrote:

> > =====================================================
> > [ BUG: possible circular locking deadlock detected! ]
> > -----------------------------------------------------
> > modprobe/1942 is trying to acquire lock:
> > (&anon_vma->lock){--..}, at: [<c10609cf>] anon_vma_link+0x1d/0xc9
> >
> > but task is already holding lock:
> > (&mm->mmap_sem/1){--..}, at: [<c101e5a0>] copy_process+0xbc6/0x1519
> >
> > which lock already depends on the new lock,
> > which could lead to circular deadlocks!
>
> hm, this one could perhaps be a real bug. Dave: lockdep complains about
> having observed:
>
> anon_vma->lock => mm->mmap_sem
> mm->mmap_sem => anon_vma->lock
>
> locking sequences, in the cpufreq code. Is there some special runtime
> behavior that still makes this safe, or is it a real bug?

I'm feeling a bit overwhelmed by the voluminous output of this checker.
Especially as (directly at least) cpufreq doesn't touch vma's, or mmap's.

The first stack trace it shows has us down in the bowels of cpu hotplug,
where we're taking the cpucontrol sem. The second stack trace shows
us in cpufreq_update_policy taking a per-cpu data->lock semaphore.

Now, I notice this is modprobe triggering this, and this *looks* like
we're loading two modules simultaneously (the first trace is from a
scaling driver like powernow-k8 or the like, whilst the second trace
is from cpufreq-stats).

How on earth did we get into this situation? module loading is supposed
to be serialised on the module_mutex no ?

It's been a while since a debug patch has sent me in search of paracetamol ;)

Dave

--
http://www.codemonkey.org.uk

2006-05-30 01:29:13

by Andrew Morton

[permalink] [raw]
Subject: Re: [patch 06/61] lock validator: add __module_address() method

On Mon, 29 May 2006 23:23:33 +0200
Ingo Molnar <[email protected]> wrote:

> +/*
> + * Is this a valid module address? We don't grab the lock.
> + */
> +int __module_address(unsigned long addr)
> +{
> + struct module *mod;
> +
> + list_for_each_entry(mod, &modules, list)
> + if (within(addr, mod->module_core, mod->core_size))
> + return 1;
> + return 0;
> +}

Returns a boolean.

> /* Is this a valid kernel address? We don't grab the lock: we are oopsing. */
> struct module *__module_text_address(unsigned long addr)

But this returns a module*.

I'd suggest that __module_address() should do the same thing, from an API neatness
POV. Although perhaps that's mot very useful if we didn't take a ref on the returned
object (but module_text_address() doesn't either).

Also, the name's a bit misleading - it sounds like it returns the address
of a module or something. __module_any_address() would be better, perhaps?

Also, how come this doesn't need modlist_lock()?

2006-05-30 01:29:10

by Andrew Morton

[permalink] [raw]
Subject: Re: [patch 05/61] lock validator: introduce WARN_ON_ONCE(cond)

On Mon, 29 May 2006 23:23:28 +0200
Ingo Molnar <[email protected]> wrote:

> add WARN_ON_ONCE(cond) to print once-per-bootup messages.
>
> Signed-off-by: Ingo Molnar <[email protected]>
> Signed-off-by: Arjan van de Ven <[email protected]>
> ---
> include/asm-generic/bug.h | 13 +++++++++++++
> 1 file changed, 13 insertions(+)
>
> Index: linux/include/asm-generic/bug.h
> ===================================================================
> --- linux.orig/include/asm-generic/bug.h
> +++ linux/include/asm-generic/bug.h
> @@ -44,4 +44,17 @@
> # define WARN_ON_SMP(x) do { } while (0)
> #endif
>
> +#define WARN_ON_ONCE(condition) \
> +({ \
> + static int __warn_once = 1; \
> + int __ret = 0; \
> + \
> + if (unlikely(__warn_once && (condition))) { \
> + __warn_once = 0; \
> + WARN_ON(1); \
> + __ret = 1; \
> + } \
> + __ret; \
> +})
> +
> #endif

I'll queue this for mainline inclusion.

2006-05-30 01:29:36

by Andrew Morton

[permalink] [raw]
Subject: Re: [patch 07/61] lock validator: better lock debugging

On Mon, 29 May 2006 23:23:37 +0200
Ingo Molnar <[email protected]> wrote:

> --- /dev/null
> +++ linux/include/linux/debug_locks.h
> @@ -0,0 +1,62 @@
> +#ifndef __LINUX_DEBUG_LOCKING_H
> +#define __LINUX_DEBUG_LOCKING_H
> +
> +extern int debug_locks;
> +extern int debug_locks_silent;
> +
> +/*
> + * Generic 'turn off all lock debugging' function:
> + */
> +extern int debug_locks_off(void);
> +
> +/*
> + * In the debug case we carry the caller's instruction pointer into
> + * other functions, but we dont want the function argument overhead
> + * in the nondebug case - hence these macros:
> + */
> +#define _RET_IP_ (unsigned long)__builtin_return_address(0)
> +#define _THIS_IP_ ({ __label__ __here; __here: (unsigned long)&&__here; })
> +
> +#define DEBUG_WARN_ON(c) \
> +({ \
> + int __ret = 0; \
> + \
> + if (unlikely(c)) { \
> + if (debug_locks_off()) \
> + WARN_ON(1); \
> + __ret = 1; \
> + } \
> + __ret; \
> +})

Either the name of this thing is too generic, or we _make_ it generic, in
which case it's in the wrong header file.

> +#ifdef CONFIG_SMP
> +# define SMP_DEBUG_WARN_ON(c) DEBUG_WARN_ON(c)
> +#else
> +# define SMP_DEBUG_WARN_ON(c) do { } while (0)
> +#endif

Probably ditto.


2006-05-30 01:29:37

by Andrew Morton

[permalink] [raw]
Subject: Re: [patch 03/61] lock validator: sound/oss/emu10k1/midi.c cleanup

On Mon, 29 May 2006 23:23:19 +0200
Ingo Molnar <[email protected]> wrote:

> move the __attribute outside of the DEFINE_SPINLOCK() section.
>
> Signed-off-by: Ingo Molnar <[email protected]>
> Signed-off-by: Arjan van de Ven <[email protected]>
> ---
> sound/oss/emu10k1/midi.c | 2 +-
> 1 file changed, 1 insertion(+), 1 deletion(-)
>
> Index: linux/sound/oss/emu10k1/midi.c
> ===================================================================
> --- linux.orig/sound/oss/emu10k1/midi.c
> +++ linux/sound/oss/emu10k1/midi.c
> @@ -45,7 +45,7 @@
> #include "../sound_config.h"
> #endif
>
> -static DEFINE_SPINLOCK(midi_spinlock __attribute((unused)));
> +static __attribute((unused)) DEFINE_SPINLOCK(midi_spinlock);
>
> static void init_midi_hdr(struct midi_hdr *midihdr)
> {

I'll tag this as for-mainline-via-alsa.

2006-05-30 01:28:22

by Andrew Morton

[permalink] [raw]
Subject: Re: [patch 01/61] lock validator: floppy.c irq-release fix

On Mon, 29 May 2006 23:22:56 +0200
Ingo Molnar <[email protected]> wrote:

> floppy.c does alot of irq-unsafe work within floppy_release_irq_and_dma():
> free_irq(), release_region() ... so when executing in irq context, push
> the whole function into keventd.

I seem to remember having issues with this - of the "not yet adequate"
type. But I forget what they were. Perhaps we have enough
flush_scheduled_work()s in there now.

We're glad to see you reassuming floppy.c maintenance.

2006-05-30 01:30:13

by Andrew Morton

[permalink] [raw]
Subject: Re: [patch 18/61] lock validator: irqtrace: core

On Mon, 29 May 2006 23:24:32 +0200
Ingo Molnar <[email protected]> wrote:

> accurate hard-IRQ-flags state tracing. This allows us to attach
> extra functionality to IRQ flags on/off events (such as trace-on/off).

That's a fairly skimpy description of some fairly substantial new
infrastructure.

2006-05-30 01:30:04

by Andrew Morton

[permalink] [raw]
Subject: Re: [patch 11/61] lock validator: lockdep: small xfs init_rwsem() cleanup

On Mon, 29 May 2006 23:23:59 +0200
Ingo Molnar <[email protected]> wrote:

> nit_rwsem() has no return value. This is not a problem if init_rwsem()
> is a function, but it's a problem if it's a do { ... } while (0) macro.
> (which lockdep introduces)
>
> Signed-off-by: Ingo Molnar <[email protected]>
> Signed-off-by: Arjan van de Ven <[email protected]>
> ---
> fs/xfs/linux-2.6/mrlock.h | 2 +-
> 1 file changed, 1 insertion(+), 1 deletion(-)
>
> Index: linux/fs/xfs/linux-2.6/mrlock.h
> ===================================================================
> --- linux.orig/fs/xfs/linux-2.6/mrlock.h
> +++ linux/fs/xfs/linux-2.6/mrlock.h
> @@ -28,7 +28,7 @@ typedef struct {
> } mrlock_t;
>
> #define mrinit(mrp, name) \
> - ( (mrp)->mr_writer = 0, init_rwsem(&(mrp)->mr_lock) )
> + do { (mrp)->mr_writer = 0; init_rwsem(&(mrp)->mr_lock); } while (0)
> #define mrlock_init(mrp, t,n,s) mrinit(mrp, n)
> #define mrfree(mrp) do { } while (0)
> #define mraccess(mrp) mraccessf(mrp, 0)

I'll queue this for mainline, via the XFS tree.

2006-05-30 01:30:13

by Andrew Morton

[permalink] [raw]
Subject: Re: [patch 15/61] lock validator: x86_64: use stacktrace to generate backtraces

On Mon, 29 May 2006 23:24:19 +0200
Ingo Molnar <[email protected]> wrote:

> this switches x86_64 to use the stacktrace infrastructure when generating
> backtrace printouts, if CONFIG_FRAME_POINTER=y. (This patch will go away
> once the dwarf2 stackframe parser in -mm goes upstream.)

yup, I dropped it.

2006-05-30 01:31:50

by Andrew Morton

[permalink] [raw]
Subject: Re: [patch 50/61] lock validator: special locking: hrtimer.c

On Mon, 29 May 2006 23:27:09 +0200
Ingo Molnar <[email protected]> wrote:

> From: Ingo Molnar <[email protected]>
>
> teach special (recursive) locking code to the lock validator. Has no
> effect on non-lockdep kernels.
>
> Signed-off-by: Ingo Molnar <[email protected]>
> Signed-off-by: Arjan van de Ven <[email protected]>
> ---
> kernel/hrtimer.c | 2 +-
> 1 file changed, 1 insertion(+), 1 deletion(-)
>
> Index: linux/kernel/hrtimer.c
> ===================================================================
> --- linux.orig/kernel/hrtimer.c
> +++ linux/kernel/hrtimer.c
> @@ -786,7 +786,7 @@ static void __devinit init_hrtimers_cpu(
> int i;
>
> for (i = 0; i < MAX_HRTIMER_BASES; i++, base++)
> - spin_lock_init(&base->lock);
> + spin_lock_init_static(&base->lock);
> }
>

Perhaps the validator core's implementation of spin_lock_init() could look
at the address and work out if it's within the static storage sections.

2006-05-30 01:32:30

by Andrew Morton

[permalink] [raw]
Subject: Re: [patch 61/61] lock validator: enable lock validator in Kconfig

On Mon, 29 May 2006 23:28:12 +0200
Ingo Molnar <[email protected]> wrote:

> offer the following lock validation options:
>
> CONFIG_PROVE_SPIN_LOCKING
> CONFIG_PROVE_RW_LOCKING
> CONFIG_PROVE_MUTEX_LOCKING
> CONFIG_PROVE_RWSEM_LOCKING
>
> Signed-off-by: Ingo Molnar <[email protected]>
> Signed-off-by: Arjan van de Ven <[email protected]>
> ---
> lib/Kconfig.debug | 167 ++++++++++++++++++++++++++++++++++++++++++++++++++++++
> 1 file changed, 167 insertions(+)
>
> Index: linux/lib/Kconfig.debug
> ===================================================================
> --- linux.orig/lib/Kconfig.debug
> +++ linux/lib/Kconfig.debug
> @@ -184,6 +184,173 @@ config DEBUG_SPINLOCK
> best used in conjunction with the NMI watchdog so that spinlock
> deadlocks are also debuggable.
>
> +config PROVE_SPIN_LOCKING
> + bool "Prove spin-locking correctness"
> + default y

err, I think I'll be sticking a `depends on X86' in there, thanks very
much. I'd prefer that you be the first to test it ;)

2006-05-30 01:32:38

by Andrew Morton

[permalink] [raw]
Subject: Re: [patch 59/61] lock validator: special locking: xfrm

On Mon, 29 May 2006 23:27:51 +0200
Ingo Molnar <[email protected]> wrote:

> From: Ingo Molnar <[email protected]>
>
> teach special (non-nested) unlocking code to the lock validator. Has no
> effect on non-lockdep kernels.
>
> Signed-off-by: Ingo Molnar <[email protected]>
> Signed-off-by: Arjan van de Ven <[email protected]>
> ---
> net/xfrm/xfrm_policy.c | 2 +-
> net/xfrm/xfrm_state.c | 2 +-
> 2 files changed, 2 insertions(+), 2 deletions(-)
>
> Index: linux/net/xfrm/xfrm_policy.c
> ===================================================================
> --- linux.orig/net/xfrm/xfrm_policy.c
> +++ linux/net/xfrm/xfrm_policy.c
> @@ -1308,7 +1308,7 @@ static struct xfrm_policy_afinfo *xfrm_p
> afinfo = xfrm_policy_afinfo[family];
> if (likely(afinfo != NULL))
> read_lock(&afinfo->lock);
> - read_unlock(&xfrm_policy_afinfo_lock);
> + read_unlock_non_nested(&xfrm_policy_afinfo_lock);
> return afinfo;
> }
>
> Index: linux/net/xfrm/xfrm_state.c
> ===================================================================
> --- linux.orig/net/xfrm/xfrm_state.c
> +++ linux/net/xfrm/xfrm_state.c
> @@ -1105,7 +1105,7 @@ static struct xfrm_state_afinfo *xfrm_st
> afinfo = xfrm_state_afinfo[family];
> if (likely(afinfo != NULL))
> read_lock(&afinfo->lock);
> - read_unlock(&xfrm_state_afinfo_lock);
> + read_unlock_non_nested(&xfrm_state_afinfo_lock);
> return afinfo;
> }
>

I got a bunch of rejects here due to changes in git-net.patch. Please
verify the result. It could well be wrong (the changes in there are odd).

2006-05-30 01:31:51

by Andrew Morton

[permalink] [raw]
Subject: Re: [patch 46/61] lock validator: special locking: slab

On Mon, 29 May 2006 23:26:49 +0200
Ingo Molnar <[email protected]> wrote:

> + /*
> + * Do not assume that spinlocks can be initialized via memcpy:
> + */

I'd view that as something which should be fixed in mainline.

2006-05-30 01:33:14

by Nathan Scott

[permalink] [raw]
Subject: Re: [patch 11/61] lock validator: lockdep: small xfs init_rwsem() cleanup

On Mon, May 29, 2006 at 06:33:41PM -0700, Andrew Morton wrote:
> I'll queue this for mainline, via the XFS tree.

Thanks Andrew, its merged in our tree now.

--
Nathan

2006-05-30 01:30:14

by Andrew Morton

[permalink] [raw]
Subject: Re: [patch 02/61] lock validator: forcedeth.c fix

On Mon, 29 May 2006 23:23:13 +0200
Ingo Molnar <[email protected]> wrote:

> nv_do_nic_poll() is called from timer softirqs, which has interrupts
> enabled, but np->lock might also be taken by some other interrupt
> context.

But the driver does disable_irq(), so I'd say this was a false-positive.

And afaict this is not a timer handler - it's a poll_controller handler
(although maybe that get called from timer handler somewhere?)

That being said, doing disable_irq() from a poll_controller handler is
downright scary.

Anwyay, I'll tentatively mark this as a lockdep workaround, not a bugfix.

2006-05-30 01:34:25

by Andrew Morton

[permalink] [raw]
Subject: Re: [patch 36/61] lock validator: special locking: serial

On Mon, 29 May 2006 23:26:04 +0200
Ingo Molnar <[email protected]> wrote:

> From: Ingo Molnar <[email protected]>
>
> teach special (dual-initialized) locking code to the lock validator.
> Has no effect on non-lockdep kernels.
>

This isn't an adequate description of the problem which this patch is
solving, IMO.

I _assume_ the validator is using the instruction pointer of the
spin_lock_init() site (or the file-n-line) as the lock's identifier. Or
something?

>
> Index: linux/drivers/serial/serial_core.c
> ===================================================================
> --- linux.orig/drivers/serial/serial_core.c
> +++ linux/drivers/serial/serial_core.c
> @@ -1849,6 +1849,12 @@ static const struct baud_rates baud_rate
> { 0, B38400 }
> };
>
> +/*
> + * lockdep: port->lock is initialized in two places, but we
> + * want only one lock-type:
> + */
> +static struct lockdep_type_key port_lock_key;
> +
> /**
> * uart_set_options - setup the serial console parameters
> * @port: pointer to the serial ports uart_port structure
> @@ -1869,7 +1875,7 @@ uart_set_options(struct uart_port *port,
> * Ensure that the serial console lock is initialised
> * early.
> */
> - spin_lock_init(&port->lock);
> + spin_lock_init_key(&port->lock, &port_lock_key);
>
> memset(&termios, 0, sizeof(struct termios));
>
> @@ -2255,7 +2261,7 @@ int uart_add_one_port(struct uart_driver
> * initialised.
> */
> if (!(uart_console(port) && (port->cons->flags & CON_ENABLED)))
> - spin_lock_init(&port->lock);
> + spin_lock_init_key(&port->lock, &port_lock_key);
>
> uart_configure_port(drv, state, port);
>

Is there a cleaner way of doing this?

Perhaps write a new helper function which initialises the spinlock, call
that? Rather than open-coding lockdep stuff?

2006-05-30 01:33:49

by Andrew Morton

[permalink] [raw]
Subject: Re: [patch 52/61] lock validator: special locking: af_unix

On Mon, 29 May 2006 23:27:19 +0200
Ingo Molnar <[email protected]> wrote:

> From: Ingo Molnar <[email protected]>
>
> teach special (recursive) locking code to the lock validator. Has no
> effect on non-lockdep kernels.
>
> (includes workaround for sk_receive_queue.lock, which is currently
> treated globally by the lock validator, but which be switched to
> per-address-family locking rules.)
>
> ...
>
>
> - spin_lock(&sk->sk_receive_queue.lock);
> + spin_lock_bh(&sk->sk_receive_queue.lock);

Again, a bit of a show-stopper. Will the real fix be far off?

2006-05-30 01:34:55

by Andrew Morton

[permalink] [raw]
Subject: Re: [patch 51/61] lock validator: special locking: sock_lock_init()

On Mon, 29 May 2006 23:27:14 +0200
Ingo Molnar <[email protected]> wrote:

> From: Ingo Molnar <[email protected]>
>
> teach special (multi-initialized, per-address-family) locking code to the
> lock validator. Has no effect on non-lockdep kernels.
>
> Index: linux/include/net/sock.h
> ===================================================================
> --- linux.orig/include/net/sock.h
> +++ linux/include/net/sock.h
> @@ -81,12 +81,6 @@ typedef struct {
> wait_queue_head_t wq;
> } socket_lock_t;
>
> -#define sock_lock_init(__sk) \
> -do { spin_lock_init(&((__sk)->sk_lock.slock)); \
> - (__sk)->sk_lock.owner = NULL; \
> - init_waitqueue_head(&((__sk)->sk_lock.wq)); \
> -} while(0)
> -
> struct sock;
> struct proto;
>
> Index: linux/net/core/sock.c
> ===================================================================
> --- linux.orig/net/core/sock.c
> +++ linux/net/core/sock.c
> @@ -739,6 +739,27 @@ lenout:
> return 0;
> }
>
> +/*
> + * Each address family might have different locking rules, so we have
> + * one slock key per address family:
> + */
> +static struct lockdep_type_key af_family_keys[AF_MAX];
> +
> +static void noinline sock_lock_init(struct sock *sk)
> +{
> + spin_lock_init_key(&sk->sk_lock.slock, af_family_keys + sk->sk_family);
> + sk->sk_lock.owner = NULL;
> + init_waitqueue_head(&sk->sk_lock.wq);
> +}

OK, no code outside net/core/sock.c uses sock_lock_init().

Hopefully the same is true of out-of-tree code...

2006-05-30 01:34:36

by Andrew Morton

[permalink] [raw]
Subject: Re: [patch 37/61] lock validator: special locking: dcache

On Mon, 29 May 2006 23:26:08 +0200
Ingo Molnar <[email protected]> wrote:

> From: Ingo Molnar <[email protected]>
>
> teach special (recursive) locking code to the lock validator. Has no
> effect on non-lockdep kernels.
>
> Signed-off-by: Ingo Molnar <[email protected]>
> Signed-off-by: Arjan van de Ven <[email protected]>
> ---
> fs/dcache.c | 6 +++---
> include/linux/dcache.h | 12 ++++++++++++
> 2 files changed, 15 insertions(+), 3 deletions(-)
>
> Index: linux/fs/dcache.c
> ===================================================================
> --- linux.orig/fs/dcache.c
> +++ linux/fs/dcache.c
> @@ -1380,10 +1380,10 @@ void d_move(struct dentry * dentry, stru
> */
> if (target < dentry) {
> spin_lock(&target->d_lock);
> - spin_lock(&dentry->d_lock);
> + spin_lock_nested(&dentry->d_lock, DENTRY_D_LOCK_NESTED);
> } else {
> spin_lock(&dentry->d_lock);
> - spin_lock(&target->d_lock);
> + spin_lock_nested(&target->d_lock, DENTRY_D_LOCK_NESTED);
> }
>
> /* Move the dentry to the target hash queue, if on different bucket */
> @@ -1420,7 +1420,7 @@ already_unhashed:
> }
>
> list_add(&dentry->d_u.d_child, &dentry->d_parent->d_subdirs);
> - spin_unlock(&target->d_lock);
> + spin_unlock_non_nested(&target->d_lock);
> fsnotify_d_move(dentry);
> spin_unlock(&dentry->d_lock);
> write_sequnlock(&rename_lock);
> Index: linux/include/linux/dcache.h
> ===================================================================
> --- linux.orig/include/linux/dcache.h
> +++ linux/include/linux/dcache.h
> @@ -114,6 +114,18 @@ struct dentry {
> unsigned char d_iname[DNAME_INLINE_LEN_MIN]; /* small names */
> };
>
> +/*
> + * dentry->d_lock spinlock nesting types:
> + *
> + * 0: normal
> + * 1: nested
> + */
> +enum dentry_d_lock_type
> +{
> + DENTRY_D_LOCK_NORMAL,
> + DENTRY_D_LOCK_NESTED
> +};
> +
> struct dentry_operations {
> int (*d_revalidate)(struct dentry *, struct nameidata *);
> int (*d_hash) (struct dentry *, struct qstr *);

DENTRY_D_LOCK_NORMAL isn't used anywhere.

2006-05-30 01:33:50

by Andrew Morton

[permalink] [raw]
Subject: Re: [patch 55/61] lock validator: special locking: sb->s_umount

On Mon, 29 May 2006 23:27:32 +0200
Ingo Molnar <[email protected]> wrote:

> From: Ingo Molnar <[email protected]>
>
> workaround for special sb->s_umount locking rule.
>
> s_umount gets held across a series of lock dropping and releasing
> in prune_one_dentry(), so i changed the order, at the risk of
> introducing a umount race. FIXME.
>
> i think a better fix would be to do the unlocks as _non_nested in
> prune_one_dentry(), and to do the up_read() here as
> an up_read_non_nested() as well?
>
> Signed-off-by: Ingo Molnar <[email protected]>
> Signed-off-by: Arjan van de Ven <[email protected]>
> ---
> fs/dcache.c | 3 ++-
> 1 file changed, 2 insertions(+), 1 deletion(-)
>
> Index: linux/fs/dcache.c
> ===================================================================
> --- linux.orig/fs/dcache.c
> +++ linux/fs/dcache.c
> @@ -470,8 +470,9 @@ static void prune_dcache(int count, stru
> s_umount = &dentry->d_sb->s_umount;
> if (down_read_trylock(s_umount)) {
> if (dentry->d_sb->s_root != NULL) {
> - prune_one_dentry(dentry);
> +// lockdep hack: do this better!
> up_read(s_umount);
> + prune_one_dentry(dentry);
> continue;

argh, you broke my kernel!

I'll whack some ifdefs in here so it's only known-broken if CONFIG_LOCKDEP.

Again, we'd need the real fix here.

2006-05-30 01:34:54

by Andrew Morton

[permalink] [raw]
Subject: Re: [patch 16/61] lock validator: fown locking workaround

On Mon, 29 May 2006 23:24:23 +0200
Ingo Molnar <[email protected]> wrote:

> temporary workaround for the lock validator: make all uses of
> f_owner.lock irq-safe. (The real solution will be to express to
> the lock validator that f_owner.lock rules are to be generated
> per-filesystem.)

This description forgot to tell us what problem is being worked around.

This patch is a bit of a show-stopper. How hard-n-bad is the real fix?

2006-05-30 01:31:06

by Andrew Morton

[permalink] [raw]
Subject: Re: [patch 00/61] ANNOUNCE: lock validator -V1

On Mon, 29 May 2006 23:21:09 +0200
Ingo Molnar <[email protected]> wrote:

> We are pleased to announce the first release of the "lock dependency
> correctness validator" kernel debugging feature

What are the runtime speed and space costs of enabling this?

2006-05-30 01:36:07

by Andrew Morton

[permalink] [raw]
Subject: Re: [patch 21/61] lock validator: lockdep: add local_irq_enable_in_hardirq() API.

On Mon, 29 May 2006 23:24:52 +0200
Ingo Molnar <[email protected]> wrote:

> introduce local_irq_enable_in_hardirq() API. It is currently
> aliased to local_irq_enable(), hence has no functional effects.
>
> This API will be used by lockdep, but even without lockdep
> this will better document places in the kernel where a hardirq
> context enables hardirqs.

If we expect people to use this then we'd best whack a comment over it.

Also, trace_irqflags.h doesn't seem an appropriate place for it to live.

I trust all the affected files are including trace_irqflags.h by some
means. Hopefully a _reliable_ means. No doubt I'm about to find out ;)

2006-05-30 01:36:21

by Andrew Morton

[permalink] [raw]
Subject: Re: [patch 27/61] lock validator: prove spinlock/rwlock locking correctness

On Mon, 29 May 2006 23:25:23 +0200
Ingo Molnar <[email protected]> wrote:

> +# define spin_lock_init_key(lock, key) \
> + __spin_lock_init((lock), #lock, key)

erk. This adds a whole new layer of obfuscation on top of the existing
spinlock header files. You already need to run the preprocessor and
disassembler to even work out which flavour you're presently using.

Ho hum.

2006-05-30 01:37:05

by Andrew Morton

[permalink] [raw]
Subject: Re: [patch 34/61] lock validator: special locking: bdev

On Mon, 29 May 2006 23:25:54 +0200
Ingo Molnar <[email protected]> wrote:

> From: Ingo Molnar <[email protected]>
>
> teach special (recursive) locking code to the lock validator. Has no
> effect on non-lockdep kernels.
>

There's no description here of the problem which is being worked around.
This leaves everyone in the dark.

> +static int
> +blkdev_get_whole(struct block_device *bdev, mode_t mode, unsigned flags)
> +{
> + /*
> + * This crockload is due to bad choice of ->open() type.
> + * It will go away.
> + * For now, block device ->open() routine must _not_
> + * examine anything in 'inode' argument except ->i_rdev.
> + */
> + struct file fake_file = {};
> + struct dentry fake_dentry = {};
> + fake_file.f_mode = mode;
> + fake_file.f_flags = flags;
> + fake_file.f_dentry = &fake_dentry;
> + fake_dentry.d_inode = bdev->bd_inode;
> +
> + return do_open(bdev, &fake_file, BD_MUTEX_WHOLE);
> +}

"crock" is a decent description ;)

How long will this live, and what will the fix look like?

(This is all a bit of a pain - carrying these patches in -mm will require
some effort, and they're not ready to go yet, which will lengthen the pain
arbitrarily).

2006-05-30 01:36:51

by Andrew Morton

[permalink] [raw]
Subject: Re: [patch 22/61] lock validator: add per_cpu_offset()

On Mon, 29 May 2006 23:24:57 +0200
Ingo Molnar <[email protected]> wrote:

> From: Ingo Molnar <[email protected]>
>
> add the per_cpu_offset() generic method. (used by the lock validator)
>
> Signed-off-by: Ingo Molnar <[email protected]>
> Signed-off-by: Arjan van de Ven <[email protected]>
> ---
> include/asm-generic/percpu.h | 2 ++
> include/asm-x86_64/percpu.h | 2 ++
> 2 files changed, 4 insertions(+)
>
> Index: linux/include/asm-generic/percpu.h
> ===================================================================
> --- linux.orig/include/asm-generic/percpu.h
> +++ linux/include/asm-generic/percpu.h
> @@ -7,6 +7,8 @@
>
> extern unsigned long __per_cpu_offset[NR_CPUS];
>
> +#define per_cpu_offset(x) (__per_cpu_offset[x])
> +
> /* Separate out the type, so (int[3], foo) works. */
> #define DEFINE_PER_CPU(type, name) \
> __attribute__((__section__(".data.percpu"))) __typeof__(type) per_cpu__##name
> Index: linux/include/asm-x86_64/percpu.h
> ===================================================================
> --- linux.orig/include/asm-x86_64/percpu.h
> +++ linux/include/asm-x86_64/percpu.h
> @@ -14,6 +14,8 @@
> #define __per_cpu_offset(cpu) (cpu_pda(cpu)->data_offset)
> #define __my_cpu_offset() read_pda(data_offset)
>
> +#define per_cpu_offset(x) (__per_cpu_offset(x))
> +
> /* Separate out the type, so (int[3], foo) works. */
> #define DEFINE_PER_CPU(type, name) \
> __attribute__((__section__(".data.percpu"))) __typeof__(type) per_cpu__##name

I can tell just looking at it that it'll break various builds.I assume that
things still happen to compile because you're presently using it in code
which those architectures don't presently compile.

But introducing a "generic" function invites others to start using it. And
they will, and they'll ship code which "works" but is broken, because they
only tested it on x86 and x86_64.

I'll queue the needed fixups - please check it.

2006-05-30 01:30:01

by Andrew Morton

[permalink] [raw]
Subject: Re: [patch 12/61] lock validator: beautify x86_64 stacktraces

On Mon, 29 May 2006 23:24:05 +0200
Ingo Molnar <[email protected]> wrote:

> beautify x86_64 stacktraces to be more readable.

One reject fixed due to the backtrace changes in Andi's tree.

I'll get all this compiling, but we'll need to review and test the end
result please, make sure that it all landed OK.

2006-05-30 01:38:04

by Andrew Morton

[permalink] [raw]
Subject: Re: [patch 17/61] lock validator: sk_callback_lock workaround

On Mon, 29 May 2006 23:24:27 +0200
Ingo Molnar <[email protected]> wrote:

> temporary workaround for the lock validator: make all uses of
> sk_callback_lock softirq-safe. (The real solution will be to
> express to the lock validator that sk_callback_lock rules are
> to be generated per-address-family.)

Ditto. What's the actual problem being worked around here, and how's the
real fix shaping up?


2006-05-30 04:50:16

by Mike Galbraith

[permalink] [raw]
Subject: Re: [patch 00/61] ANNOUNCE: lock validator -V1

On Mon, 2006-05-29 at 23:21 +0200, Ingo Molnar wrote:
> The easiest way to try lockdep on a testbox is to apply the combo patch
> to 2.6.17-rc4-mm3. The patch order is:
>
> http://kernel.org/pub/linux/kernel/v2.6/testing/linux-2.6.17-rc4.tar.bz2
> http://kernel.org/pub/linux/kernel/people/akpm/patches/2.6/2.6.17-rc4/2.6.17-rc4-mm3/2.6.17-rc4-mm3.bz2
> http://redhat.com/~mingo/lockdep-patches/lockdep-combo.patch
>
> do 'make oldconfig' and accept all the defaults for new config options -
> reboot into the kernel and if everything goes well it should boot up
> fine and you should have /proc/lockdep and /proc/lockdep_stats files.

Darn. It said all tests passed, then oopsed.

(have .config all gzipped up if you want it)

-Mike

BUG: unable to handle kernel NULL pointer dereference at virtual address 00000000
printing eip:
b103a872
*pde = 00000000
Oops: 0000 [#1]
PREEMPT SMP
last sysfs file:
Modules linked in:
CPU: 0
EIP: 0060:[<b103a872>] Not tainted VLI
EFLAGS: 00010083 (2.6.17-rc4-mm3-smp #157)
EIP is at count_matching_names+0x5b/0xa2
eax: b15074a8 ebx: 00000000 ecx: b165c430 edx: b165b320
esi: 00000000 edi: b1410423 ebp: dfe20e74 esp: dfe20e68
ds: 007b es: 007b ss: 0068
Process idle (pid: 1, threadinfo=dfe20000 task=effc1470)
Stack: 000139b0 b165c430 00000000 dfe20ec8 b103d442 b1797a6c b1797a64 effc1470
b1797a64 00000004 b1797a50 00000000 b15074a8 effc1470 dfe20ef8 b106da88
b169d0a8 b1797a64 dfe20f52 0000000a b106dec7 00000282 dfe20000 00000000
Call Trace:
<b1003d73> show_stack_log_lvl+0x9e/0xc3 <b1003f80> show_registers+0x1ac/0x237
<b100413d> die+0x132/0x2fb <b101a083> do_page_fault+0x5cf/0x656
<b10038a7> error_code+0x4f/0x54 <b103d442> __lockdep_acquire+0xa6f/0xc32
<b103d9f8> lockdep_acquire+0x61/0x77 <b13d27f3> _spin_lock+0x2e/0x42
<b102b03a> register_sysctl_table+0x4e/0xaa <b15a463a> sched_init_smp+0x411/0x41e
<b100035d> init+0xbd/0x2c6 <b1001005> kernel_thread_helper+0x5/0xb
Code: 92 50 b1 74 5d 8b 41 10 2b 41 14 31 db 39 42 10 75 0d eb 53 8b 41 10 2b 41 14 3b 42 10 74 48 8b b2 a0 00 00 00 8b b9 a0 00 00 00 <ac> ae 75 08 84 c0 75 f8 31 c0 eb 04 19 c0 0c 01 85 c0 75 0b 8b

1151 list_for_each_entry(type, &all_lock_types, lock_entry) {
1152 if (new_type->key - new_type->subtype == type->key)
1153 return type->name_version;
1154 if (!strcmp(type->name, new_type->name)) <--kaboom
1155 count = max(count, type->name_version);
1156 }

EIP: [<b103a872>] count_matching_names+0x5b/0xa2 SS:ESP 0068:dfe20e68
Kernel panic - not syncing: Attempted to kill init!
BUG: warning at arch/i386/kernel/smp.c:537/smp_call_function()
<b1003dd2> show_trace+0xd/0xf <b10044c0> dump_stack+0x17/0x19
<b10129ff> smp_call_function+0x11d/0x122 <b1012a22> smp_send_stop+0x1e/0x31
<b1022f4b> panic+0x60/0x1d5 <b10267fa> do_exit+0x613/0x94f
<b1004306> do_trap+0x0/0x9e <b101a083> do_page_fault+0x5cf/0x656
<b10038a7> error_code+0x4f/0x54 <b103d442> __lockdep_acquire+0xa6f/0xc32
<b103d9f8> lockdep_acquire+0x61/0x77 <b13d27f3> _spin_lock+0x2e/0x42
<b102b03a> register_sysctl_table+0x4e/0xaa <b15a463a> sched_init_smp+0x411/0x41e
<b100035d> init+0xbd/0x2c6 <b1001005> kernel_thread_helper+0x5/0xb
BUG: NMI Watchdog detected LOCKUP on CPU1, eip b103cc64, registers:
Modules linked in:
CPU: 1
EIP: 0060:[<b103cc64>] Not tainted VLI
EFLAGS: 00000086 (2.6.17-rc4-mm3-smp #157)
EIP is at __lockdep_acquire+0x291/0xc32
eax: 00000000 ebx: 000001d7 ecx: b16bf938 edx: 00000000
esi: 00000000 edi: b16bf938 ebp: effc4ea4 esp: effc4e58
ds: 007b es: 007b ss: 0068
Process idle (pid: 0, threadinfo=effc4000 task=effc0a50)
Stack: b101d4ce 00000000 effc0fb8 000001d7 effc0a50 b16bf938 00000000 b29b38c8
effc0a50 effc0fb8 00000001 00000000 00000005 00000000 00000000 00000000
00000096 effc4000 00000000 effc4ecc b103d9f8 00000000 00000001 b101d4ce
Call Trace:
<b1003d73> show_stack_log_lvl+0x9e/0xc3 <b1003f80> show_registers+0x1ac/0x237
<b10050d9> die_nmi+0x93/0xeb <b1015af1> nmi_watchdog_tick+0xff/0x20e
<b1004542> do_nmi+0x80/0x249 <b1003912> nmi_stack_correct+0x1d/0x22
<b103d9f8> lockdep_acquire+0x61/0x77 <b13d27f3> _spin_lock+0x2e/0x42
<b101d4ce> scheduler_tick+0xd0/0x381 <b102d47e> update_process_times+0x42/0x61
<b1014f9f> smp_apic_timer_interrupt+0x67/0x78 <b10037ba> apic_timer_interrupt+0x2a/0x30
<b1001e5b> cpu_idle+0x71/0xb8 <b1013c6e> start_secondary+0x3e5/0x46b
<00000000> _stext+0x4efffd68/0x8 <effc4fb4> 0xeffc4fb4
Code: 18 01 90 39 c7 0f 84 2e 02 00 00 8b 50 0c 31 f2 8b 40 08 31 d8 09 c2 75 e2 f0 ff 05 08 8a 61 b1 f0 fe 0d e4 92 50 b1 79 0d f3 90 <80> 3d e4 92 50 b1 00 7e f5 eb ea 8b 55 d4 8b b2 64 05 00 00 85
console shuts up ...


2006-05-30 05:13:33

by Arjan van de Ven

[permalink] [raw]
Subject: Re: [patch 34/61] lock validator: special locking: bdev

On Mon, 2006-05-29 at 18:35 -0700, Andrew Morton wrote:
> On Mon, 29 May 2006 23:25:54 +0200
> Ingo Molnar <[email protected]> wrote:
>
> > From: Ingo Molnar <[email protected]>
> >
> > teach special (recursive) locking code to the lock validator. Has no
> > effect on non-lockdep kernels.
> >
>
> There's no description here of the problem which is being worked around.
> This leaves everyone in the dark.

it's not really a workaround, it's a "separate the uses" thing. The real
problem is an inherent hierarchy between "disk" and "partition". Where
lots of code assumes you can first take the disk mutex, and then the
partition mutex, and never deadlock. This patch basically separates the
"get me the disk" versus "get me the partition" uses.

2006-05-30 05:20:18

by Arjan van de Ven

[permalink] [raw]
Subject: Re: [patch 00/61] ANNOUNCE: lock validator -V1

On Tue, 2006-05-30 at 00:28 +0200, Michal Piotrowski wrote:
> On 29/05/06, Ingo Molnar <[email protected]> wrote:
> > We are pleased to announce the first release of the "lock dependency
> > correctness validator" kernel debugging feature, which can be downloaded
> > from:
> >
> > http://redhat.com/~mingo/lockdep-patches/
> >
> [snip]
>
> I get this while loading cpufreq modules

can you enable CONFIG_KALLSYMS_ALL ? that will give a more accurate
debug output...

2006-05-30 05:45:52

by Arjan van de Ven

[permalink] [raw]
Subject: Re: [patch 00/61] ANNOUNCE: lock validator -V1


> I'm feeling a bit overwhelmed by the voluminous output of this checker.
> Especially as (directly at least) cpufreq doesn't touch vma's, or mmap's.

the reporter doesn't have CONFIG_KALLSYMS_ALL enabled which gives
sometimes misleading backtraces (should lockdep just enable KALLSYMS_ALL
to get more useful bugreports?)

the problem is this, there are 2 scenarios in this bug:

One
---
store_scaling_governor takes policy->lock and then calls __cpufreq_set_policy
__cpufreq_set_policy calls __cpufreq_governor
__cpufreq_governor calls __cpufreq_driver_target via cpufreq_governor_performance
__cpufreq_driver_target calls lock_cpu_hotplug() (which takes the hotplug lock)


Two
---
cpufreq_stats_init lock_cpu_hotplug() and then calls cpufreq_stat_cpu_callback
cpufreq_stat_cpu_callback calls cpufreq_update_policy
cpufreq_update_policy takes the policy->lock


so this looks like a real honest AB-BA deadlock to me...


2006-05-30 05:52:16

by Michal Piotrowski

[permalink] [raw]
Subject: Re: [patch 00/61] ANNOUNCE: lock validator -V1

Hi,

On 30/05/06, Dave Jones <[email protected]> wrote:
> On Tue, May 30, 2006 at 12:41:08AM +0200, Ingo Molnar wrote:
>
> > > =====================================================
> > > [ BUG: possible circular locking deadlock detected! ]
> > > -----------------------------------------------------
> > > modprobe/1942 is trying to acquire lock:
> > > (&anon_vma->lock){--..}, at: [<c10609cf>] anon_vma_link+0x1d/0xc9
> > >
> > > but task is already holding lock:
> > > (&mm->mmap_sem/1){--..}, at: [<c101e5a0>] copy_process+0xbc6/0x1519
> > >
> > > which lock already depends on the new lock,
> > > which could lead to circular deadlocks!
> >
> > hm, this one could perhaps be a real bug. Dave: lockdep complains about
> > having observed:
> >
> > anon_vma->lock => mm->mmap_sem
> > mm->mmap_sem => anon_vma->lock
> >
> > locking sequences, in the cpufreq code. Is there some special runtime
> > behavior that still makes this safe, or is it a real bug?
>
> I'm feeling a bit overwhelmed by the voluminous output of this checker.
> Especially as (directly at least) cpufreq doesn't touch vma's, or mmap's.
>
> The first stack trace it shows has us down in the bowels of cpu hotplug,
> where we're taking the cpucontrol sem. The second stack trace shows
> us in cpufreq_update_policy taking a per-cpu data->lock semaphore.
>
> Now, I notice this is modprobe triggering this, and this *looks* like
> we're loading two modules simultaneously (the first trace is from a
> scaling driver like powernow-k8 or the like, whilst the second trace
> is from cpufreq-stats).

/etc/init.d/cpuspeed starts very early
$ ls /etc/rc5.d/ | grep cpu
S06cpuspeed

I have this in my /etc/rc.local
modprobe -i cpufreq_conservative
modprobe -i cpufreq_ondemand
modprobe -i cpufreq_powersave
modprobe -i cpufreq_stats
modprobe -i cpufreq_userspace
modprobe -i freq_table

>
> How on earth did we get into this situation?

Just before gdm starts, while /etc/rc.local is processed.

> module loading is supposed
> to be serialised on the module_mutex no ?
>
> It's been a while since a debug patch has sent me in search of paracetamol ;)
>
> Dave

Regards,
Michal

--
Michal K. K. Piotrowski
LTG - Linux Testers Group
(http://www.stardust.webpages.pl/ltg/wiki/)

2006-05-30 06:07:05

by Michal Piotrowski

[permalink] [raw]
Subject: Re: [patch 00/61] ANNOUNCE: lock validator -V1

Hi,

On 30/05/06, Arjan van de Ven <[email protected]> wrote:
>
> > I'm feeling a bit overwhelmed by the voluminous output of this checker.
> > Especially as (directly at least) cpufreq doesn't touch vma's, or mmap's.
>
> the reporter doesn't have CONFIG_KALLSYMS_ALL enabled which gives
> sometimes misleading backtraces (should lockdep just enable KALLSYMS_ALL
> to get more useful bugreports?)

Here is bug with CONFIG_KALLSYMS_ALL enabled.

=====================================================
[ BUG: possible circular locking deadlock detected! ]
-----------------------------------------------------
modprobe/1950 is trying to acquire lock:
(&sighand->siglock){.+..}, at: [<c102b632>] do_notify_parent+0x12b/0x1b9

but task is already holding lock:
(tasklist_lock){..-<B1>}, at: [<c1023473>] do_exit+0x608/0xa43

which lock already depends on the new lock,
which could lead to circular deadlocks!

the existing dependency chain (in reverse order) is:

-> #1 (cpucontrol){--..}:
[<c10394be>] lockdep_acquire+0x69/0x82
[<c11ed729>] __mutex_lock_slowpath+0xd0/0x347
[<c11ed9bc>] mutex_lock+0x1c/0x1f
[<c103dda5>] __lock_cpu_hotplug+0x36/0x56
[<c103ddde>] lock_cpu_hotplug+0xa/0xc
[<c1199dd6>] __cpufreq_driver_target+0x15/0x50
[<c119a192>] cpufreq_governor_performance+0x1a/0x20
[<c1198ada>] __cpufreq_governor+0xa0/0x1a9
[<c1198cb2>] __cpufreq_set_policy+0xcf/0x100
[<c1199196>] cpufreq_set_policy+0x2d/0x6f
[<c1199c7e>] cpufreq_add_dev+0x34f/0x492
[<c114b898>] sysdev_driver_register+0x58/0x9b
[<c119a006>] cpufreq_register_driver+0x80/0xf4
[<fd91402a>] ipt_local_out_hook+0x2a/0x65 [iptable_filter]
[<c10410e1>] sys_init_module+0xa6/0x230
[<c11ef97b>] sysenter_past_esp+0x54/0x8d

-> #0 (&sighand->siglock){.+..}:
[<c10394be>] lockdep_acquire+0x69/0x82
[<c11ed729>] __mutex_lock_slowpath+0xd0/0x347
[<c11ed9bc>] mutex_lock+0x1c/0x1f
[<c11990bb>] cpufreq_update_policy+0x34/0xd8
[<fd9a350b>] cpufreq_stat_cpu_callback+0x1b/0x7c [cpufreq_stats]
[<fd9a607d>] cpufreq_stats_init+0x7d/0x9b [cpufreq_stats]
[<c10410e1>] sys_init_module+0xa6/0x230
[<c11ef97b>] sysenter_past_esp+0x54/0x8d

other info that might help us debug this:

1 locks held by modprobe/1950:
#0: (cpucontrol){--..}, at: [<c11ed9bc>] mutex_lock+0x1c/0x1f

stack backtrace:
[<c1003ed6>] show_trace+0xd/0xf
[<c10043e9>] dump_stack+0x17/0x19
[<c103863e>] print_circular_bug_tail+0x59/0x64
[<c1038e91>] __lockdep_acquire+0x848/0xa39
[<c10394be>] lockdep_acquire+0x69/0x82
[<c11ed729>] __mutex_lock_slowpath+0xd0/0x347
[<c11ed9bc>] mutex_lock+0x1c/0x1f
[<c11990bb>] cpufreq_update_policy+0x34/0xd8
[<fd9a350b>] cpufreq_stat_cpu_callback+0x1b/0x7c [cpufreq_stats]
[<fd9a607d>] cpufreq_stats_init+0x7d/0x9b [cpufreq_stats]
[<c10410e1>] sys_init_module+0xa6/0x230
[<c11ef97b>] sysenter_past_esp+0x54/0x8d


>
> the problem is this, there are 2 scenarios in this bug:
>
> One
> ---
> store_scaling_governor takes policy->lock and then calls __cpufreq_set_policy
> __cpufreq_set_policy calls __cpufreq_governor
> __cpufreq_governor calls __cpufreq_driver_target via cpufreq_governor_performance
> __cpufreq_driver_target calls lock_cpu_hotplug() (which takes the hotplug lock)
>
>
> Two
> ---
> cpufreq_stats_init lock_cpu_hotplug() and then calls cpufreq_stat_cpu_callback
> cpufreq_stat_cpu_callback calls cpufreq_update_policy
> cpufreq_update_policy takes the policy->lock
>
>
> so this looks like a real honest AB-BA deadlock to me...

Regards,
Michal

--
Michal K. K. Piotrowski
LTG - Linux Testers Group
(http://www.stardust.webpages.pl/ltg/wiki/)

2006-05-30 06:20:14

by Arjan van de Ven

[permalink] [raw]
Subject: Re: [patch 00/61] ANNOUNCE: lock validator -V1

On Tue, 2006-05-30 at 06:52 +0200, Mike Galbraith wrote:
> On Mon, 2006-05-29 at 23:21 +0200, Ingo Molnar wrote:
> > The easiest way to try lockdep on a testbox is to apply the combo patch
> > to 2.6.17-rc4-mm3. The patch order is:
> >
> > http://kernel.org/pub/linux/kernel/v2.6/testing/linux-2.6.17-rc4.tar.bz2
> > http://kernel.org/pub/linux/kernel/people/akpm/patches/2.6/2.6.17-rc4/2.6.17-rc4-mm3/2.6.17-rc4-mm3.bz2
> > http://redhat.com/~mingo/lockdep-patches/lockdep-combo.patch
> >
> > do 'make oldconfig' and accept all the defaults for new config options -
> > reboot into the kernel and if everything goes well it should boot up
> > fine and you should have /proc/lockdep and /proc/lockdep_stats files.
>
> Darn. It said all tests passed, then oopsed.
>
> (have .config all gzipped up if you want it)


yes please get me/Ingo the .config; something odd is going on

2006-05-30 06:35:44

by Arjan van de Ven

[permalink] [raw]
Subject: Re: [patch 00/61] ANNOUNCE: lock validator -V1

On Tue, 2006-05-30 at 06:52 +0200, Mike Galbraith wrote:
> On Mon, 2006-05-29 at 23:21 +0200, Ingo Molnar wrote:
> > The easiest way to try lockdep on a testbox is to apply the combo patch
> > to 2.6.17-rc4-mm3. The patch order is:
> >
> > http://kernel.org/pub/linux/kernel/v2.6/testing/linux-2.6.17-rc4.tar.bz2
> > http://kernel.org/pub/linux/kernel/people/akpm/patches/2.6/2.6.17-rc4/2.6.17-rc4-mm3/2.6.17-rc4-mm3.bz2
> > http://redhat.com/~mingo/lockdep-patches/lockdep-combo.patch
> >
> > do 'make oldconfig' and accept all the defaults for new config options -
> > reboot into the kernel and if everything goes well it should boot up
> > fine and you should have /proc/lockdep and /proc/lockdep_stats files.
>
> Darn. It said all tests passed, then oopsed.


does this fix it?


type->name can be NULL legitimately; all places but one check for this
already. Fix this off-by-one.

Signed-off-by: Arjan van de Ven <[email protected]>

--- linux-2.6.17-rc4-mm3-lockdep/kernel/lockdep.c.org 2006-05-30 08:32:52.000000000 +0200
+++ linux-2.6.17-rc4-mm3-lockdep/kernel/lockdep.c 2006-05-30 08:33:09.000000000 +0200
@@ -1151,7 +1151,7 @@ int count_matching_names(struct lock_typ
list_for_each_entry(type, &all_lock_types, lock_entry) {
if (new_type->key - new_type->subtype == type->key)
return type->name_version;
- if (!strcmp(type->name, new_type->name))
+ if (type->name && !strcmp(type->name, new_type->name))
count = max(count, type->name_version);
}



2006-05-30 06:37:07

by Ingo Molnar

[permalink] [raw]
Subject: Re: [patch 00/61] ANNOUNCE: lock validator -V1


* Mike Galbraith <[email protected]> wrote:

> Darn. It said all tests passed, then oopsed.
>
> (have .config all gzipped up if you want it)

yeah, please.

> EIP: 0060:[<b103a872>] Not tainted VLI
> EFLAGS: 00010083 (2.6.17-rc4-mm3-smp #157)
> EIP is at count_matching_names+0x5b/0xa2

> 1151 list_for_each_entry(type, &all_lock_types, lock_entry) {
> 1152 if (new_type->key - new_type->subtype == type->key)
> 1153 return type->name_version;
> 1154 if (!strcmp(type->name, new_type->name)) <--kaboom
> 1155 count = max(count, type->name_version);

hm, while most code (except the one above) is prepared for type->name
being NULL, it should not be NULL. Maybe an uninitialized lock slipped
through? Please try the patch below - it both protects against
type->name being NULL in this place, and will warn if it finds a NULL
lockname.

Ingo

Index: linux/kernel/lockdep.c
===================================================================
--- linux.orig/kernel/lockdep.c
+++ linux/kernel/lockdep.c
@@ -1151,7 +1151,7 @@ int count_matching_names(struct lock_typ
list_for_each_entry(type, &all_lock_types, lock_entry) {
if (new_type->key - new_type->subtype == type->key)
return type->name_version;
- if (!strcmp(type->name, new_type->name))
+ if (type->name && !strcmp(type->name, new_type->name))
count = max(count, type->name_version);
}

@@ -1974,7 +1974,8 @@ void lockdep_init_map(struct lockdep_map

if (DEBUG_WARN_ON(!key))
return;
-
+ if (DEBUG_WARN_ON(!name))
+ return;
/*
* Sanity check, the lock-type key must be persistent:
*/

2006-05-30 07:47:10

by Ingo Molnar

[permalink] [raw]
Subject: Re: [patch 00/61] ANNOUNCE: lock validator -V1


* Arjan van de Ven <[email protected]> wrote:

> > Darn. It said all tests passed, then oopsed.
>
> does this fix it?
>
> type->name can be NULL legitimately; all places but one check for this
> already. Fix this off-by-one.

that used to be the case, but shouldnt happen anymore - with current
lockdep code we always pass some string to the lock init code. (that's
what lock-init-improvement.patch achieves in essence.) Worst-case the
string should be "old_style_spin_init" or "old_style_rw_init".

So Mike please try the other patch i sent - it also adds a debugging
check so that we can see where that NULL name comes from. It could be
something benign like me forgetting to pass in a string somewhere in the
initialization macros, but it could also be something more nasty like an
initialize-by-memset assumption.

Ingo

2006-05-30 09:12:12

by Nikita Danilov

[permalink] [raw]
Subject: Re: [patch 25/61] lock validator: design docs

Ingo Molnar writes:
> From: Ingo Molnar <[email protected]>

[...]

> +
> +enum bdev_bd_mutex_lock_type
> +{
> + BD_MUTEX_NORMAL,
> + BD_MUTEX_WHOLE,
> + BD_MUTEX_PARTITION
> +};

In some situations well-defined and finite set of "nesting levels" does
not exist. For example, if one has a tree with per-node locking, and
algorithms acquire multiple node locks left-to-right in the tree
order. Reiser4 does this.

Can nested locking restrictions be weakened for certain lock types?

Nikita.

2006-05-30 09:15:28

by Benoit Boissinot

[permalink] [raw]
Subject: Re: [patch 00/61] ANNOUNCE: lock validator -V1

On 5/29/06, Ingo Molnar <[email protected]> wrote:
> We are pleased to announce the first release of the "lock dependency
> correctness validator" kernel debugging feature, which can be downloaded
> from:
>
> http://redhat.com/~mingo/lockdep-patches/
> [snip]

I get this right after ipw2200 is loaded (it is quite verbose, I
probably shoudln't post everything...)

ipw2200: Detected Intel PRO/Wireless 2200BG Network Connection
ipw2200: Detected geography ZZD (13 802.11bg channels, 0 802.11a channels)

======================================================
[ BUG: hard-safe -> hard-unsafe lock order detected! ]
------------------------------------------------------
default.hotplug/3212 [HC0[0]:SC1[1]:HE0:SE0] is trying to acquire:
(nl_table_lock){-.-?}, at: [<c0301efa>] netlink_broadcast+0x7a/0x360

and this task is already holding:
(&priv->lock){++..}, at: [<e1cfe588>] ipw_irq_tasklet+0x18/0x500 [ipw2200]
which would create a new lock dependency:
(&priv->lock){++..} -> (nl_table_lock){-.-?}

but this new dependency connects a hard-irq-safe lock:
(&priv->lock){++..}
... which became hard-irq-safe at:
[<c01395da>] lockdep_acquire+0x7a/0xa0
[<c0352583>] _spin_lock+0x23/0x30
[<e1cfdbc1>] ipw_isr+0x21/0xd0 [ipw2200]
[<c01466e3>] handle_IRQ_event+0x33/0x80
[<c01467e4>] __do_IRQ+0xb4/0x120
[<c01057c0>] do_IRQ+0x70/0xc0

to a hard-irq-unsafe lock:
(nl_table_lock){-.-?}
... which became hard-irq-unsafe at:
... [<c01395da>] lockdep_acquire+0x7a/0xa0
[<c03520da>] _write_lock_bh+0x2a/0x30
[<c03017d2>] netlink_table_grab+0x12/0xe0
[<c0301bcb>] netlink_insert+0x2b/0x180
[<c030307c>] netlink_kernel_create+0xac/0x140
[<c048f29a>] rtnetlink_init+0x6a/0xc0
[<c048f6b9>] netlink_proto_init+0x169/0x180
[<c010029f>] _stext+0x7f/0x250
[<c0101005>] kernel_thread_helper+0x5/0xb

which could potentially lead to deadlocks!

other info that might help us debug this:

1 locks held by default.hotplug/3212:
#0: (&priv->lock){++..}, at: [<e1cfe588>] ipw_irq_tasklet+0x18/0x500 [ipw2200]

the hard-irq-safe lock's dependencies:
-> (&priv->lock){++..} ops: 102 {
initial-use at:
[<c01395da>] lockdep_acquire+0x7a/0xa0
[<c03524c0>] _spin_lock_irqsave+0x30/0x50
[<e1cf6a0c>] ipw_load+0x1fc/0xc90 [ipw2200]
[<e1cf74e8>] ipw_up+0x48/0x520 [ipw2200]
[<e1cfda87>] ipw_net_init+0x27/0x50 [ipw2200]
[<c02eeef1>] register_netdevice+0xd1/0x410
[<c02f0609>] register_netdev+0x59/0x70
[<e1cfe4d6>] ipw_pci_probe+0x806/0x8a0 [ipw2200]
[<c023481e>] pci_device_probe+0x5e/0x80
[<c02a86e4>] driver_probe_device+0x44/0xc0
[<c02a888b>] __driver_attach+0x9b/0xa0
[<c02a8039>] bus_for_each_dev+0x49/0x70
[<c02a8629>] driver_attach+0x19/0x20
[<c02a7c64>] bus_add_driver+0x74/0x140
[<c02a8b06>] driver_register+0x56/0x90
[<c0234a10>] __pci_register_driver+0x50/0x70
[<e18b302e>] 0xe18b302e
[<c014034d>] sys_init_module+0xcd/0x1630
[<c035273b>] sysenter_past_esp+0x54/0x8d
in-hardirq-W at:
[<c01395da>] lockdep_acquire+0x7a/0xa0
[<c0352583>] _spin_lock+0x23/0x30
[<e1cfdbc1>] ipw_isr+0x21/0xd0 [ipw2200]
[<c01466e3>] handle_IRQ_event+0x33/0x80
[<c01467e4>] __do_IRQ+0xb4/0x120
[<c01057c0>] do_IRQ+0x70/0xc0
in-softirq-W at:
[<c01395da>] lockdep_acquire+0x7a/0xa0
[<c03524c0>] _spin_lock_irqsave+0x30/0x50
[<e1cfe588>] ipw_irq_tasklet+0x18/0x500 [ipw2200]
[<c0121ea0>] tasklet_action+0x40/0x90
[<c01223b4>] __do_softirq+0x54/0xc0
[<c01056bb>] do_softirq+0x5b/0xf0
}
... key at: [<e1d0b438>] __key.27363+0x0/0xffff38f6 [ipw2200]
-> (&q->lock){++..} ops: 33353 {
initial-use at:
[<c01395da>] lockdep_acquire+0x7a/0xa0
[<c0352509>] _spin_lock_irq+0x29/0x40
[<c034f084>] wait_for_completion+0x24/0x150
[<c013160e>] keventd_create_kthread+0x2e/0x70
[<c01315d6>] kthread_create+0xe6/0xf0
[<c0121b75>] cpu_callback+0x95/0x110
[<c0481194>] spawn_ksoftirqd+0x14/0x30
[<c010023c>] _stext+0x1c/0x250
[<c0101005>] kernel_thread_helper+0x5/0xb
in-hardirq-W at:
[<c01395da>] lockdep_acquire+0x7a/0xa0
[<c03524c0>] _spin_lock_irqsave+0x30/0x50
[<c011794b>] __wake_up+0x1b/0x50
[<c012dcdd>] __queue_work+0x4d/0x70
[<c012ddaf>] queue_work+0x6f/0x80
[<c0269588>] acpi_os_execute+0xcd/0xe9
[<c026eea1>] acpi_ev_gpe_dispatch+0xbc/0x122
[<c026f106>] acpi_ev_gpe_detect+0x99/0xe0
[<c026d90b>] acpi_ev_sci_xrupt_handler+0x15/0x1d
[<c0268c55>] acpi_irq+0xe/0x18
[<c01466e3>] handle_IRQ_event+0x33/0x80
[<c01467e4>] __do_IRQ+0xb4/0x120
[<c01057c0>] do_IRQ+0x70/0xc0
in-softirq-W at:
[<c01395da>] lockdep_acquire+0x7a/0xa0
[<c03524c0>] _spin_lock_irqsave+0x30/0x50
[<c011786b>] complete+0x1b/0x60
[<c012ef0b>] wakeme_after_rcu+0xb/0x10
[<c012f0c9>] __rcu_process_callbacks+0x69/0x1c0
[<c012f232>] rcu_process_callbacks+0x12/0x30
[<c0121ea0>] tasklet_action+0x40/0x90
[<c01223b4>] __do_softirq+0x54/0xc0
[<c01056bb>] do_softirq+0x5b/0xf0
}
... key at: [<c04d47c8>] 0xc04d47c8
-> (&rq->lock){++..} ops: 68824 {
initial-use at:
[<c01395da>] lockdep_acquire+0x7a/0xa0
[<c03524c0>] _spin_lock_irqsave+0x30/0x50
[<c0117bcc>] init_idle+0x4c/0x80
[<c0480ad8>] sched_init+0xa8/0xb0
[<c0473558>] start_kernel+0x58/0x330
[<c0100199>] 0xc0100199
in-hardirq-W at:
[<c01395da>] lockdep_acquire+0x7a/0xa0
[<c0352583>] _spin_lock+0x23/0x30
[<c0117cc7>] scheduler_tick+0xc7/0x310
[<c01270ee>] update_process_times+0x3e/0x70
[<c0106c21>] timer_interrupt+0x41/0xa0
[<c01466e3>] handle_IRQ_event+0x33/0x80
[<c01467e4>] __do_IRQ+0xb4/0x120
[<c01057c0>] do_IRQ+0x70/0xc0
in-softirq-W at:
[<c01395da>] lockdep_acquire+0x7a/0xa0
[<c0352583>] _spin_lock+0x23/0x30
[<c01183e0>] try_to_wake_up+0x30/0x170
[<c011854f>] wake_up_process+0xf/0x20
[<c0122413>] __do_softirq+0xb3/0xc0
[<c01056bb>] do_softirq+0x5b/0xf0
}
... key at: [<c04c1400>] 0xc04c1400
... acquired at:
[<c01395da>] lockdep_acquire+0x7a/0xa0
[<c0352583>] _spin_lock+0x23/0x30
[<c01183e0>] try_to_wake_up+0x30/0x170
[<c011852b>] default_wake_function+0xb/0x10
[<c01172d9>] __wake_up_common+0x39/0x70
[<c011788d>] complete+0x3d/0x60
[<c01316d4>] kthread+0x84/0xbc
[<c0101005>] kernel_thread_helper+0x5/0xb

... acquired at:
[<c01395da>] lockdep_acquire+0x7a/0xa0
[<c03524c0>] _spin_lock_irqsave+0x30/0x50
[<c011794b>] __wake_up+0x1b/0x50
[<e1cf6a2e>] ipw_load+0x21e/0xc90 [ipw2200]
[<e1cf74e8>] ipw_up+0x48/0x520 [ipw2200]
[<e1cfda87>] ipw_net_init+0x27/0x50 [ipw2200]
[<c02eeef1>] register_netdevice+0xd1/0x410
[<c02f0609>] register_netdev+0x59/0x70
[<e1cfe4d6>] ipw_pci_probe+0x806/0x8a0 [ipw2200]
[<c023481e>] pci_device_probe+0x5e/0x80
[<c02a86e4>] driver_probe_device+0x44/0xc0
[<c02a888b>] __driver_attach+0x9b/0xa0
[<c02a8039>] bus_for_each_dev+0x49/0x70
[<c02a8629>] driver_attach+0x19/0x20
[<c02a7c64>] bus_add_driver+0x74/0x140
[<c02a8b06>] driver_register+0x56/0x90
[<c0234a10>] __pci_register_driver+0x50/0x70
[<e18b302e>] 0xe18b302e
[<c014034d>] sys_init_module+0xcd/0x1630
[<c035273b>] sysenter_past_esp+0x54/0x8d

-> (&rxq->lock){.+..} ops: 40 {
initial-use at:
[<c01395da>] lockdep_acquire+0x7a/0xa0
[<c03524c0>] _spin_lock_irqsave+0x30/0x50
[<e1cf66d0>] ipw_rx_queue_replenish+0x20/0x120 [ipw2200]
[<e1cf72e0>] ipw_load+0xad0/0xc90 [ipw2200]
[<e1cf74e8>] ipw_up+0x48/0x520 [ipw2200]
[<e1cfda87>] ipw_net_init+0x27/0x50 [ipw2200]
[<c02eeef1>] register_netdevice+0xd1/0x410
[<c02f0609>] register_netdev+0x59/0x70
[<e1cfe4d6>] ipw_pci_probe+0x806/0x8a0 [ipw2200]
[<c023481e>] pci_device_probe+0x5e/0x80
[<c02a86e4>] driver_probe_device+0x44/0xc0
[<c02a888b>] __driver_attach+0x9b/0xa0
[<c02a8039>] bus_for_each_dev+0x49/0x70
[<c02a8629>] driver_attach+0x19/0x20
[<c02a7c64>] bus_add_driver+0x74/0x140
[<c02a8b06>] driver_register+0x56/0x90
[<c0234a10>] __pci_register_driver+0x50/0x70
[<e18b302e>] 0xe18b302e
[<c014034d>] sys_init_module+0xcd/0x1630
[<c035273b>] sysenter_past_esp+0x54/0x8d
in-softirq-W at:
[<c01395da>] lockdep_acquire+0x7a/0xa0
[<c03524c0>] _spin_lock_irqsave+0x30/0x50
[<e1cf25bf>] ipw_rx_queue_restock+0x1f/0x120 [ipw2200]
[<e1cf80d1>] ipw_rx+0x631/0x1bb0 [ipw2200]
[<e1cfe6ac>] ipw_irq_tasklet+0x13c/0x500 [ipw2200]
[<c0121ea0>] tasklet_action+0x40/0x90
[<c01223b4>] __do_softirq+0x54/0xc0
[<c01056bb>] do_softirq+0x5b/0xf0
}
... key at: [<e1d0b440>] __key.23915+0x0/0xffff38ee [ipw2200]
-> (&parent->list_lock){.+..} ops: 17457 {
initial-use at:
[<c01395da>] lockdep_acquire+0x7a/0xa0
[<c0352583>] _spin_lock+0x23/0x30
[<c0166437>] cache_alloc_refill+0x87/0x650
[<c0166bae>] kmem_cache_zalloc+0xbe/0xd0
[<c01672d4>] kmem_cache_create+0x154/0x540
[<c0483ad9>] kmem_cache_init+0x179/0x3d0
[<c0473638>] start_kernel+0x138/0x330
[<c0100199>] 0xc0100199
in-softirq-W at:
[<c01395da>] lockdep_acquire+0x7a/0xa0
[<c0352583>] _spin_lock+0x23/0x30
[<c0166073>] free_block+0x183/0x190
[<c0165bdf>] __cache_free+0x9f/0x120
[<c0165da8>] kmem_cache_free+0x88/0xb0
[<c0119e21>] free_task+0x21/0x30
[<c011b955>] __put_task_struct+0x95/0x156
[<c011db12>] delayed_put_task_struct+0x32/0x60
[<c012f0c9>] __rcu_process_callbacks+0x69/0x1c0
[<c012f232>] rcu_process_callbacks+0x12/0x30
[<c0121ea0>] tasklet_action+0x40/0x90
[<c01223b4>] __do_softirq+0x54/0xc0
[<c01056bb>] do_softirq+0x5b/0xf0
}
... key at: [<c060d00c>] 0xc060d00c
... acquired at:
[<c01395da>] lockdep_acquire+0x7a/0xa0
[<c0352583>] _spin_lock+0x23/0x30
[<c0166437>] cache_alloc_refill+0x87/0x650
[<c0166ab8>] __kmalloc+0xb8/0xf0
[<c02eb3cb>] __alloc_skb+0x4b/0x100
[<e1cf6769>] ipw_rx_queue_replenish+0xb9/0x120 [ipw2200]
[<e1cf72e0>] ipw_load+0xad0/0xc90 [ipw2200]
[<e1cf74e8>] ipw_up+0x48/0x520 [ipw2200]
[<e1cfda87>] ipw_net_init+0x27/0x50 [ipw2200]
[<c02eeef1>] register_netdevice+0xd1/0x410
[<c02f0609>] register_netdev+0x59/0x70
[<e1cfe4d6>] ipw_pci_probe+0x806/0x8a0 [ipw2200]
[<c023481e>] pci_device_probe+0x5e/0x80
[<c02a86e4>] driver_probe_device+0x44/0xc0
[<c02a888b>] __driver_attach+0x9b/0xa0
[<c02a8039>] bus_for_each_dev+0x49/0x70
[<c02a8629>] driver_attach+0x19/0x20
[<c02a7c64>] bus_add_driver+0x74/0x140
[<c02a8b06>] driver_register+0x56/0x90
[<c0234a10>] __pci_register_driver+0x50/0x70
[<e18b302e>] 0xe18b302e
[<c014034d>] sys_init_module+0xcd/0x1630
[<c035273b>] sysenter_past_esp+0x54/0x8d

... acquired at:
[<c01395da>] lockdep_acquire+0x7a/0xa0
[<c03524c0>] _spin_lock_irqsave+0x30/0x50
[<e1cf25bf>] ipw_rx_queue_restock+0x1f/0x120 [ipw2200]
[<e1cf80d1>] ipw_rx+0x631/0x1bb0 [ipw2200]
[<e1cfe6ac>] ipw_irq_tasklet+0x13c/0x500 [ipw2200]
[<c0121ea0>] tasklet_action+0x40/0x90
[<c01223b4>] __do_softirq+0x54/0xc0
[<c01056bb>] do_softirq+0x5b/0xf0

-> (&ieee->lock){.+..} ops: 15 {
initial-use at:
[<c01395da>] lockdep_acquire+0x7a/0xa0
[<c03524c0>] _spin_lock_irqsave+0x30/0x50
[<e1c9d0cf>] ieee80211_process_probe_response+0x1ff/0x790 [ieee80211]
[<e1c9d70f>] ieee80211_rx_mgt+0xaf/0x340 [ieee80211]
[<e1cf8219>] ipw_rx+0x779/0x1bb0 [ipw2200]
[<e1cfe6ac>] ipw_irq_tasklet+0x13c/0x500 [ipw2200]
[<c0121ea0>] tasklet_action+0x40/0x90
[<c01223b4>] __do_softirq+0x54/0xc0
[<c01056bb>] do_softirq+0x5b/0xf0
in-softirq-W at:
[<c01395da>] lockdep_acquire+0x7a/0xa0
[<c03524c0>] _spin_lock_irqsave+0x30/0x50
[<e1c9d0cf>] ieee80211_process_probe_response+0x1ff/0x790 [ieee80211]
[<e1c9d70f>] ieee80211_rx_mgt+0xaf/0x340 [ieee80211]
[<e1cf8219>] ipw_rx+0x779/0x1bb0 [ipw2200]
[<e1cfe6ac>] ipw_irq_tasklet+0x13c/0x500 [ipw2200]
[<c0121ea0>] tasklet_action+0x40/0x90
[<c01223b4>] __do_softirq+0x54/0xc0
[<c01056bb>] do_softirq+0x5b/0xf0
}
... key at: [<e1ca2781>] __key.22782+0x0/0xffffdc00 [ieee80211]
... acquired at:
[<c01395da>] lockdep_acquire+0x7a/0xa0
[<c03524c0>] _spin_lock_irqsave+0x30/0x50
[<e1c9d0cf>] ieee80211_process_probe_response+0x1ff/0x790 [ieee80211]
[<e1c9d70f>] ieee80211_rx_mgt+0xaf/0x340 [ieee80211]
[<e1cf8219>] ipw_rx+0x779/0x1bb0 [ipw2200]
[<e1cfe6ac>] ipw_irq_tasklet+0x13c/0x500 [ipw2200]
[<c0121ea0>] tasklet_action+0x40/0x90
[<c01223b4>] __do_softirq+0x54/0xc0
[<c01056bb>] do_softirq+0x5b/0xf0

-> (&cwq->lock){++..} ops: 3739 {
initial-use at:
[<c01395da>] lockdep_acquire+0x7a/0xa0
[<c03524c0>] _spin_lock_irqsave+0x30/0x50
[<c012dca8>] __queue_work+0x18/0x70
[<c012ddaf>] queue_work+0x6f/0x80
[<c012d949>] call_usermodehelper_keys+0x139/0x160
[<c0219a2a>] kobject_uevent+0x7a/0x4a0
[<c0219753>] kobject_register+0x43/0x50
[<c02a7687>] sysdev_register+0x67/0x100
[<c02aa950>] register_cpu+0x30/0x70
[<c0108f7a>] arch_register_cpu+0x2a/0x30
[<c047850a>] topology_init+0xa/0x10
[<c010029f>] _stext+0x7f/0x250
[<c0101005>] kernel_thread_helper+0x5/0xb
in-hardirq-W at:
[<c01395da>] lockdep_acquire+0x7a/0xa0
[<c03524c0>] _spin_lock_irqsave+0x30/0x50
[<c012dca8>] __queue_work+0x18/0x70
[<c012ddaf>] queue_work+0x6f/0x80
[<c0269588>] acpi_os_execute+0xcd/0xe9
[<c026eea1>] acpi_ev_gpe_dispatch+0xbc/0x122
[<c026f106>] acpi_ev_gpe_detect+0x99/0xe0
[<c026d90b>] acpi_ev_sci_xrupt_handler+0x15/0x1d
[<c0268c55>] acpi_irq+0xe/0x18
[<c01466e3>] handle_IRQ_event+0x33/0x80
[<c01467e4>] __do_IRQ+0xb4/0x120
[<c01057c0>] do_IRQ+0x70/0xc0
in-softirq-W at:
[<c01395da>] lockdep_acquire+0x7a/0xa0
[<c03524c0>] _spin_lock_irqsave+0x30/0x50
[<c012dca8>] __queue_work+0x18/0x70
[<c012dd30>] delayed_work_timer_fn+0x30/0x40
[<c012633e>] run_timer_softirq+0x12e/0x180
[<c01223b4>] __do_softirq+0x54/0xc0
[<c01056bb>] do_softirq+0x5b/0xf0
}
... key at: [<c04d4334>] 0xc04d4334
-> (&q->lock){++..} ops: 33353 {
initial-use at:
[<c01395da>] lockdep_acquire+0x7a/0xa0
[<c0352509>] _spin_lock_irq+0x29/0x40
[<c034f084>] wait_for_completion+0x24/0x150
[<c013160e>] keventd_create_kthread+0x2e/0x70
[<c01315d6>] kthread_create+0xe6/0xf0
[<c0121b75>] cpu_callback+0x95/0x110
[<c0481194>] spawn_ksoftirqd+0x14/0x30
[<c010023c>] _stext+0x1c/0x250
[<c0101005>] kernel_thread_helper+0x5/0xb
in-hardirq-W at:
[<c01395da>] lockdep_acquire+0x7a/0xa0
[<c03524c0>] _spin_lock_irqsave+0x30/0x50
[<c011794b>] __wake_up+0x1b/0x50
[<c012dcdd>] __queue_work+0x4d/0x70
[<c012ddaf>] queue_work+0x6f/0x80
[<c0269588>] acpi_os_execute+0xcd/0xe9
[<c026eea1>] acpi_ev_gpe_dispatch+0xbc/0x122
[<c026f106>] acpi_ev_gpe_detect+0x99/0xe0
[<c026d90b>] acpi_ev_sci_xrupt_handler+0x15/0x1d
[<c0268c55>] acpi_irq+0xe/0x18
[<c01466e3>] handle_IRQ_event+0x33/0x80
[<c01467e4>] __do_IRQ+0xb4/0x120
[<c01057c0>] do_IRQ+0x70/0xc0
in-softirq-W at:
[<c01395da>] lockdep_acquire+0x7a/0xa0
[<c03524c0>] _spin_lock_irqsave+0x30/0x50
[<c011786b>] complete+0x1b/0x60
[<c012ef0b>] wakeme_after_rcu+0xb/0x10
[<c012f0c9>] __rcu_process_callbacks+0x69/0x1c0
[<c012f232>] rcu_process_callbacks+0x12/0x30
[<c0121ea0>] tasklet_action+0x40/0x90
[<c01223b4>] __do_softirq+0x54/0xc0
[<c01056bb>] do_softirq+0x5b/0xf0
}
... key at: [<c04d47c8>] 0xc04d47c8
-> (&rq->lock){++..} ops: 68824 {
initial-use at:
[<c01395da>] lockdep_acquire+0x7a/0xa0
[<c03524c0>] _spin_lock_irqsave+0x30/0x50
[<c0117bcc>] init_idle+0x4c/0x80
[<c0480ad8>] sched_init+0xa8/0xb0
[<c0473558>] start_kernel+0x58/0x330
[<c0100199>] 0xc0100199
in-hardirq-W at:
[<c01395da>] lockdep_acquire+0x7a/0xa0
[<c0352583>] _spin_lock+0x23/0x30
[<c0117cc7>] scheduler_tick+0xc7/0x310
[<c01270ee>] update_process_times+0x3e/0x70
[<c0106c21>] timer_interrupt+0x41/0xa0
[<c01466e3>] handle_IRQ_event+0x33/0x80
[<c01467e4>] __do_IRQ+0xb4/0x120
[<c01057c0>] do_IRQ+0x70/0xc0
in-softirq-W at:
[<c01395da>] lockdep_acquire+0x7a/0xa0
[<c0352583>] _spin_lock+0x23/0x30
[<c01183e0>] try_to_wake_up+0x30/0x170
[<c011854f>] wake_up_process+0xf/0x20
[<c0122413>] __do_softirq+0xb3/0xc0
[<c01056bb>] do_softirq+0x5b/0xf0
}
... key at: [<c04c1400>] 0xc04c1400
... acquired at:
[<c01395da>] lockdep_acquire+0x7a/0xa0
[<c0352583>] _spin_lock+0x23/0x30
[<c01183e0>] try_to_wake_up+0x30/0x170
[<c011852b>] default_wake_function+0xb/0x10
[<c01172d9>] __wake_up_common+0x39/0x70
[<c011788d>] complete+0x3d/0x60
[<c01316d4>] kthread+0x84/0xbc
[<c0101005>] kernel_thread_helper+0x5/0xb

... acquired at:
[<c01395da>] lockdep_acquire+0x7a/0xa0
[<c03524c0>] _spin_lock_irqsave+0x30/0x50
[<c011794b>] __wake_up+0x1b/0x50
[<c012dcdd>] __queue_work+0x4d/0x70
[<c012ddaf>] queue_work+0x6f/0x80
[<c012d949>] call_usermodehelper_keys+0x139/0x160
[<c0219a2a>] kobject_uevent+0x7a/0x4a0
[<c0219753>] kobject_register+0x43/0x50
[<c02a7687>] sysdev_register+0x67/0x100
[<c02aa950>] register_cpu+0x30/0x70
[<c0108f7a>] arch_register_cpu+0x2a/0x30
[<c047850a>] topology_init+0xa/0x10
[<c010029f>] _stext+0x7f/0x250
[<c0101005>] kernel_thread_helper+0x5/0xb

... acquired at:
[<c01395da>] lockdep_acquire+0x7a/0xa0
[<c03524c0>] _spin_lock_irqsave+0x30/0x50
[<c012dca8>] __queue_work+0x18/0x70
[<c012ddaf>] queue_work+0x6f/0x80
[<e1cf267e>] ipw_rx_queue_restock+0xde/0x120 [ipw2200]
[<e1cf80d1>] ipw_rx+0x631/0x1bb0 [ipw2200]
[<e1cfe6ac>] ipw_irq_tasklet+0x13c/0x500 [ipw2200]
[<c0121ea0>] tasklet_action+0x40/0x90
[<c01223b4>] __do_softirq+0x54/0xc0
[<c01056bb>] do_softirq+0x5b/0xf0

-> (&base->lock){++..} ops: 8140 {
initial-use at:
[<c01395da>] lockdep_acquire+0x7a/0xa0
[<c03524c0>] _spin_lock_irqsave+0x30/0x50
[<c0126e4a>] lock_timer_base+0x3a/0x60
[<c0126f17>] __mod_timer+0x37/0xc0
[<c0127036>] mod_timer+0x36/0x50
[<c048a2e5>] con_init+0x1b5/0x200
[<c0489802>] console_init+0x32/0x40
[<c04735ea>] start_kernel+0xea/0x330
[<c0100199>] 0xc0100199
in-hardirq-W at:
[<c01395da>] lockdep_acquire+0x7a/0xa0
[<c03524c0>] _spin_lock_irqsave+0x30/0x50
[<c0126e4a>] lock_timer_base+0x3a/0x60
[<c0126e9c>] del_timer+0x2c/0x70
[<c02bc619>] ide_intr+0x69/0x1f0
[<c01466e3>] handle_IRQ_event+0x33/0x80
[<c01467e4>] __do_IRQ+0xb4/0x120
[<c01057c0>] do_IRQ+0x70/0xc0
in-softirq-W at:
[<c01395da>] lockdep_acquire+0x7a/0xa0
[<c0352509>] _spin_lock_irq+0x29/0x40
[<c0126239>] run_timer_softirq+0x29/0x180
[<c01223b4>] __do_softirq+0x54/0xc0
[<c01056bb>] do_softirq+0x5b/0xf0
}
... key at: [<c04d3af8>] 0xc04d3af8
... acquired at:
[<c01395da>] lockdep_acquire+0x7a/0xa0
[<c03524c0>] _spin_lock_irqsave+0x30/0x50
[<c0126e4a>] lock_timer_base+0x3a/0x60
[<c0126e9c>] del_timer+0x2c/0x70
[<e1cf83d9>] ipw_rx+0x939/0x1bb0 [ipw2200]
[<e1cfe6ac>] ipw_irq_tasklet+0x13c/0x500 [ipw2200]
[<c0121ea0>] tasklet_action+0x40/0x90
[<c01223b4>] __do_softirq+0x54/0xc0
[<c01056bb>] do_softirq+0x5b/0xf0


the hard-irq-unsafe lock's dependencies:
-> (nl_table_lock){-.-?} ops: 1585 {
initial-use at:
[<c01395da>] lockdep_acquire+0x7a/0xa0
[<c03520da>] _write_lock_bh+0x2a/0x30
[<c03017d2>] netlink_table_grab+0x12/0xe0
[<c0301bcb>] netlink_insert+0x2b/0x180
[<c030307c>] netlink_kernel_create+0xac/0x140
[<c048f29a>] rtnetlink_init+0x6a/0xc0
[<c048f6b9>] netlink_proto_init+0x169/0x180
[<c010029f>] _stext+0x7f/0x250
[<c0101005>] kernel_thread_helper+0x5/0xb
hardirq-on-W at:
[<c01395da>] lockdep_acquire+0x7a/0xa0
[<c03520da>] _write_lock_bh+0x2a/0x30
[<c03017d2>] netlink_table_grab+0x12/0xe0
[<c0301bcb>] netlink_insert+0x2b/0x180
[<c030307c>] netlink_kernel_create+0xac/0x140
[<c048f29a>] rtnetlink_init+0x6a/0xc0
[<c048f6b9>] netlink_proto_init+0x169/0x180
[<c010029f>] _stext+0x7f/0x250
[<c0101005>] kernel_thread_helper+0x5/0xb
in-softirq-R at:
[<c01395da>] lockdep_acquire+0x7a/0xa0
[<c0352130>] _read_lock+0x20/0x30
[<c0301efa>] netlink_broadcast+0x7a/0x360
[<c02fb6a4>] wireless_send_event+0x304/0x340
[<e1cf8e11>] ipw_rx+0x1371/0x1bb0 [ipw2200]
[<e1cfe6ac>] ipw_irq_tasklet+0x13c/0x500 [ipw2200]
[<c0121ea0>] tasklet_action+0x40/0x90
[<c01223b4>] __do_softirq+0x54/0xc0
[<c01056bb>] do_softirq+0x5b/0xf0
softirq-on-R at:
[<c01395da>] lockdep_acquire+0x7a/0xa0
[<c0352130>] _read_lock+0x20/0x30
[<c0301efa>] netlink_broadcast+0x7a/0x360
[<c02199f0>] kobject_uevent+0x40/0x4a0
[<c0219753>] kobject_register+0x43/0x50
[<c02a7687>] sysdev_register+0x67/0x100
[<c02aa950>] register_cpu+0x30/0x70
[<c0108f7a>] arch_register_cpu+0x2a/0x30
[<c047850a>] topology_init+0xa/0x10
[<c010029f>] _stext+0x7f/0x250
[<c0101005>] kernel_thread_helper+0x5/0xb
hardirq-on-R at:
[<c01395da>] lockdep_acquire+0x7a/0xa0
[<c0352130>] _read_lock+0x20/0x30
[<c0301efa>] netlink_broadcast+0x7a/0x360
[<c02199f0>] kobject_uevent+0x40/0x4a0
[<c0219753>] kobject_register+0x43/0x50
[<c02a7687>] sysdev_register+0x67/0x100
[<c02aa950>] register_cpu+0x30/0x70
[<c0108f7a>] arch_register_cpu+0x2a/0x30
[<c047850a>] topology_init+0xa/0x10
[<c010029f>] _stext+0x7f/0x250
[<c0101005>] kernel_thread_helper+0x5/0xb
}
... key at: [<c0438908>] 0xc0438908

stack backtrace:
<c010402d> show_trace+0xd/0x10 <c0104687> dump_stack+0x17/0x20
<c0137fe3> check_usage+0x263/0x270 <c0138f06> __lockdep_acquire+0xb96/0xd40
<c01395da> lockdep_acquire+0x7a/0xa0 <c0352130> _read_lock+0x20/0x30
<c0301efa> netlink_broadcast+0x7a/0x360 <c02fb6a4> wireless_send_event+0x304/0x340
<e1cf8e11> ipw_rx+0x1371/0x1bb0 [ipw2200] <e1cfe6ac> ipw_irq_tasklet+0x13c/0x500 [ipw2200]
<c0121ea0> tasklet_action+0x40/0x90 <c01223b4> __do_softirq+0x54/0xc0
<c01056bb> do_softirq+0x5b/0xf0
=======================
<c0122455> irq_exit+0x35/0x40 <c01057c7> do_IRQ+0x77/0xc0
<c0103949> common_interrupt+0x25/0x2c

2006-05-30 09:23:22

by Mike Galbraith

[permalink] [raw]
Subject: Re: [patch 00/61] ANNOUNCE: lock validator -V1

On Tue, 2006-05-30 at 08:37 +0200, Ingo Molnar wrote:
> * Mike Galbraith <[email protected]> wrote:
>
> > Darn. It said all tests passed, then oopsed.
> >
> > (have .config all gzipped up if you want it)
>
> yeah, please.

(sent off list)

> > EIP: 0060:[<b103a872>] Not tainted VLI
> > EFLAGS: 00010083 (2.6.17-rc4-mm3-smp #157)
> > EIP is at count_matching_names+0x5b/0xa2
>
> > 1151 list_for_each_entry(type, &all_lock_types, lock_entry) {
> > 1152 if (new_type->key - new_type->subtype == type->key)
> > 1153 return type->name_version;
> > 1154 if (!strcmp(type->name, new_type->name)) <--kaboom
> > 1155 count = max(count, type->name_version);
>
> hm, while most code (except the one above) is prepared for type->name
> being NULL, it should not be NULL. Maybe an uninitialized lock slipped
> through? Please try the patch below - it both protects against
> type->name being NULL in this place, and will warn if it finds a NULL
> lockname.

Got the warning. It failed testing, but booted.

Lock dependency validator: Copyright (c) 2006 Red Hat, Inc., Ingo Molnar
... MAX_LOCKDEP_SUBTYPES: 8
... MAX_LOCK_DEPTH: 30
... MAX_LOCKDEP_KEYS: 2048
... TYPEHASH_SIZE: 1024
... MAX_LOCKDEP_ENTRIES: 8192
... MAX_LOCKDEP_CHAINS: 8192
... CHAINHASH_SIZE: 4096
memory used by lock dependency info: 696 kB
per task-struct memory footprint: 1080 bytes
------------------------
| Locking API testsuite:
----------------------------------------------------------------------------
| spin |wlock |rlock |mutex | wsem | rsem |
--------------------------------------------------------------------------
BUG: warning at kernel/lockdep.c:1977/lockdep_init_map()
<b1003dd2> show_trace+0xd/0xf <b10044c0> dump_stack+0x17/0x19
<b103badf> lockdep_init_map+0x10a/0x10f <b10398d7> __mutex_init+0x3b/0x44
<b11d4601> init_type_X+0x37/0x4d <b11d4638> init_shared_types+0x21/0xaa
<b11dcca3> locking_selftest+0x76/0x1889 <b1597657> start_kernel+0x1e7/0x400
<b1000210> 0xb1000210
A-A deadlock: ok | ok | ok | ok | ok | ok |
A-B-B-A deadlock: ok | ok |FAILED| ok | ok | ok |
A-B-B-C-C-A deadlock: ok | ok |FAILED| ok | ok | ok |
A-B-C-A-B-C deadlock: ok | ok |FAILED| ok | ok | ok |
A-B-B-C-C-D-D-A deadlock: ok | ok |FAILED| ok | ok | ok |
A-B-C-D-B-D-D-A deadlock: ok | ok |FAILED| ok | ok | ok |
A-B-C-D-B-C-D-A deadlock: ok | ok |FAILED| ok | ok | ok |
double unlock: ok | ok | ok | ok | ok | ok |
bad unlock order: ok | ok | ok | ok | ok | ok |
--------------------------------------------------------------------------
recursive read-lock: |FAILED| | ok |
--------------------------------------------------------------------------
non-nested unlock:FAILED|FAILED|FAILED|FAILED|
------------------------------------------------------------
hard-irqs-on + irq-safe-A/12: ok | ok |FAILED|
soft-irqs-on + irq-safe-A/12: ok | ok |FAILED|
hard-irqs-on + irq-safe-A/21: ok | ok |FAILED|
soft-irqs-on + irq-safe-A/21: ok | ok |FAILED|
sirq-safe-A => hirqs-on/12: ok | ok |FAILED|
sirq-safe-A => hirqs-on/21: ok | ok |FAILED|
hard-safe-A + irqs-on/12: ok | ok |FAILED|
soft-safe-A + irqs-on/12: ok | ok |FAILED|
hard-safe-A + irqs-on/21: ok | ok |FAILED|
soft-safe-A + irqs-on/21: ok | ok |FAILED|
hard-safe-A + unsafe-B #1/123: ok | ok |FAILED|
soft-safe-A + unsafe-B #1/123: ok | ok |FAILED|
hard-safe-A + unsafe-B #1/132: ok | ok |FAILED|
soft-safe-A + unsafe-B #1/132: ok | ok |FAILED|
hard-safe-A + unsafe-B #1/213: ok | ok |FAILED|
soft-safe-A + unsafe-B #1/213: ok | ok |FAILED|
hard-safe-A + unsafe-B #1/231: ok | ok |FAILED|
soft-safe-A + unsafe-B #1/231: ok | ok |FAILED|
hard-safe-A + unsafe-B #1/312: ok | ok |FAILED|
soft-safe-A + unsafe-B #1/312: ok | ok |FAILED|
hard-safe-A + unsafe-B #1/321: ok | ok |FAILED|
soft-safe-A + unsafe-B #1/321: ok | ok |FAILED|
hard-safe-A + unsafe-B #2/123: ok | ok |FAILED|
soft-safe-A + unsafe-B #2/123: ok | ok |FAILED|
hard-safe-A + unsafe-B #2/132: ok | ok |FAILED|
soft-safe-A + unsafe-B #2/132: ok | ok |FAILED|
hard-safe-A + unsafe-B #2/213: ok | ok |FAILED|
soft-safe-A + unsafe-B #2/213: ok | ok |FAILED|
hard-safe-A + unsafe-B #2/231: ok | ok |FAILED|
soft-safe-A + unsafe-B #2/231: ok | ok |FAILED|
hard-safe-A + unsafe-B #2/312: ok | ok |FAILED|
soft-safe-A + unsafe-B #2/312: ok | ok |FAILED|
hard-safe-A + unsafe-B #2/321: ok | ok |FAILED|
soft-safe-A + unsafe-B #2/321: ok | ok |FAILED|
hard-irq lock-inversion/123: ok | ok |FAILED|
soft-irq lock-inversion/123: ok | ok |FAILED|
hard-irq lock-inversion/132: ok | ok |FAILED|
soft-irq lock-inversion/132: ok | ok |FAILED|
hard-irq lock-inversion/213: ok | ok |FAILED|
soft-irq lock-inversion/213: ok | ok |FAILED|
hard-irq lock-inversion/231: ok | ok |FAILED|
soft-irq lock-inversion/231: ok | ok |FAILED|
hard-irq lock-inversion/312: ok | ok |FAILED|
soft-irq lock-inversion/312: ok | ok |FAILED|
hard-irq lock-inversion/321: ok | ok |FAILED|
soft-irq lock-inversion/321: ok | ok |FAILED|
hard-irq read-recursion/123:FAILED|
soft-irq read-recursion/123:FAILED|
hard-irq read-recursion/132:FAILED|
soft-irq read-recursion/132:FAILED|
hard-irq read-recursion/213:FAILED|
soft-irq read-recursion/213:FAILED|
hard-irq read-recursion/231:FAILED|
soft-irq read-recursion/231:FAILED|
hard-irq read-recursion/312:FAILED|
soft-irq read-recursion/312:FAILED|
hard-irq read-recursion/321:FAILED|
soft-irq read-recursion/321:FAILED|
-----------------------------------------------------------------
BUG: 69 unexpected failures (out of 210) - debugging disabled! |
-----------------------------------------------------------------


2006-05-30 09:59:09

by Al Viro

[permalink] [raw]
Subject: Re: [patch 34/61] lock validator: special locking: bdev

On Mon, May 29, 2006 at 06:35:23PM -0700, Andrew Morton wrote:
> > + * For now, block device ->open() routine must _not_
> > + * examine anything in 'inode' argument except ->i_rdev.
> > + */
> > + struct file fake_file = {};
> > + struct dentry fake_dentry = {};
> > + fake_file.f_mode = mode;
> > + fake_file.f_flags = flags;
> > + fake_file.f_dentry = &fake_dentry;
> > + fake_dentry.d_inode = bdev->bd_inode;
> > +
> > + return do_open(bdev, &fake_file, BD_MUTEX_WHOLE);
> > +}
>
> "crock" is a decent description ;)
>
> How long will this live, and what will the fix look like?

The comment there is a bit deceptive.

The real problem is with the stuff ->open() uses. Short version of the
story:
* everything uses inode->i_bdev. Since we always pass an inode
allocated in block_dev.c along with bdev and its ->i_bdev points to that
bdev (i.e. at the constant offset from inode), it doesn't matter whether
we pass struct inode or struct block_device.
* many things use file->f_mode. Nobody modifies it.
* some things use file->f_flags. Used flags: O_EXCL and O_NDELAY.
Nobody modifies it.
* one (and only one) weird driver uses something else. That FPOS
is floppy.c and it needs more detailed description.

floppy.c is _weird_. In addition to normally used stuff, it checks for
opener having write permissions on file->f_dentry->d_inode. Then it
modifies file->private_data to store that information and uses it as
permission check in ->ioctl().

The rationale for that crock is a big load of bullshit. It goes like that:
We have priveleged ioctls and can't allow them unless you have
write permissions.
We can't ask to just open() the damn thing for write and let these
be done as usual (and check file->f_mode & FMODE_WRITE) because we might want
them on drive that has no disk in it or a write-protected one. Opening it
for write would try to check for disk being writable and screw itself.
Passing O_NDELAY would avoid that problem by skipping the checks
for disk being writable, present, etc., but we can't use that. Reasons
why we can't? We don't need no stinkin' reasons!

IOW, *all* of that could be avoided if floppy.c
* checked FMODE_WRITE for ability to do priveleged ioctls
* had those who want to issue such ioctls on drive that might have
no disk in it pass O_NDELAY|O_WRONLY (or O_NDELAY|O_RDWR) when they open
the fscker. Note that userland code always could have done that -
passing O_NDELAY|O_RDWR will do the right thing with any kernel.

That FPOS is the main reason why we pass struct file * there at all *and*
care to have ->f_dentry->d_inode in it (normally that wouldn't be even
looked at). Again, my prefered solution would be to pass 4-bit flags and
either inode or block_device. Flags being FMODE_READ, FMODE_WRITE,
O_EXCL and O_NDELAY.

The problem is moronic semantics for ioctl access control in floppy.c,
even though the sane API is _already_ present and always had been. In
the very same floppy_open()...

2006-05-30 10:26:58

by Arjan van de Ven

[permalink] [raw]
Subject: Re: [patch 00/61] ANNOUNCE: lock validator -V1

On Tue, 2006-05-30 at 11:14 +0200, Benoit Boissinot wrote:
> On 5/29/06, Ingo Molnar <[email protected]> wrote:
> > We are pleased to announce the first release of the "lock dependency
> > correctness validator" kernel debugging feature, which can be downloaded
> > from:
> >
> > http://redhat.com/~mingo/lockdep-patches/
> > [snip]
>
> I get this right after ipw2200 is loaded (it is quite verbose, I
> probably shoudln't post everything...)
>
> ipw2200: Detected Intel PRO/Wireless 2200BG Network Connection
> ipw2200: Detected geography ZZD (13 802.11bg channels, 0 802.11a channels)


> <c0301efa> netlink_broadcast+0x7a/0x360

this isn't allow to be called from IRQ context, because it takes
nl_table_lock for read, but that is taken as
write_lock_bh(&nl_table_lock);
in
static void netlink_table_grab(void)
so without disabling interrupts; which would thus deadlock if this
read_lock-from-irq would hit.

> <c02fb6a4> wireless_send_event+0x304/0x340
> <e1cf8e11> ipw_rx+0x1371/0x1bb0 [ipw2200]
> <e1cfe6ac> ipw_irq_tasklet+0x13c/0x500 [ipw2200]
> <c0121ea0> tasklet_action+0x40/0x90

but it's more complex than that, since we ARE in BH context.
The complexity comes from us holding &priv->lock, which is
used in hard irq context.

so the deadlock is like this:


cpu 0: user context cpu1: softirq context
netlink_table_grab takes nl_table_lock as take priv->lock in ipw_irq_tasklet
write_lock_bh, but leaves irqs enabled


hardirq comes in and the isr tries to take in ipw_rx, call wireless_send_event which
priv->lock but has to wait on cpu 1 tries to take nl_table_lock for read
but has to wait for cpu0

and... kaboom kabang deadlock :)


2006-05-30 10:45:05

by Arjan van de Ven

[permalink] [raw]
Subject: Re: [patch 34/61] lock validator: special locking: bdev

On Mon, 2006-05-29 at 18:35 -0700, Andrew Morton wrote:
> On Mon, 29 May 2006 23:25:54 +0200
> Ingo Molnar <[email protected]> wrote:
>
> > From: Ingo Molnar <[email protected]>
> >
> > teach special (recursive) locking code to the lock validator. Has no
> > effect on non-lockdep kernels.
> >
>
> There's no description here of the problem which is being worked around.
> This leaves everyone in the dark.
>
> > +static int
> > +blkdev_get_whole(struct block_device *bdev, mode_t mode, unsigned flags)
> > +{
> > + /*
> > + * This crockload is due to bad choice of ->open() type.
> > + * It will go away.
> > + * For now, block device ->open() routine must _not_
> > + * examine anything in 'inode' argument except ->i_rdev.
> > + */
> > + struct file fake_file = {};
> > + struct dentry fake_dentry = {};
> > + fake_file.f_mode = mode;
> > + fake_file.f_flags = flags;
> > + fake_file.f_dentry = &fake_dentry;
> > + fake_dentry.d_inode = bdev->bd_inode;
> > +
> > + return do_open(bdev, &fake_file, BD_MUTEX_WHOLE);
> > +}
>
> "crock" is a decent description ;)
>
> How long will this live, and what will the fix look like?

this btw is not new crock; the only new thing is the BD_MUTEX_WHOLE :)

2006-05-30 10:52:04

by Takashi Iwai

[permalink] [raw]
Subject: Re: [patch 03/61] lock validator: sound/oss/emu10k1/midi.c cleanup

At Mon, 29 May 2006 18:33:17 -0700,
Andrew Morton wrote:
>
> On Mon, 29 May 2006 23:23:19 +0200
> Ingo Molnar <[email protected]> wrote:
>
> > move the __attribute outside of the DEFINE_SPINLOCK() section.
> >
> > Signed-off-by: Ingo Molnar <[email protected]>
> > Signed-off-by: Arjan van de Ven <[email protected]>
> > ---
> > sound/oss/emu10k1/midi.c | 2 +-
> > 1 file changed, 1 insertion(+), 1 deletion(-)
> >
> > Index: linux/sound/oss/emu10k1/midi.c
> > ===================================================================
> > --- linux.orig/sound/oss/emu10k1/midi.c
> > +++ linux/sound/oss/emu10k1/midi.c
> > @@ -45,7 +45,7 @@
> > #include "../sound_config.h"
> > #endif
> >
> > -static DEFINE_SPINLOCK(midi_spinlock __attribute((unused)));
> > +static __attribute((unused)) DEFINE_SPINLOCK(midi_spinlock);
> >
> > static void init_midi_hdr(struct midi_hdr *midihdr)
> > {
>
> I'll tag this as for-mainline-via-alsa.

Acked-by: Takashi Iwai <[email protected]>


It's OSS stuff, so feel free to push it from your side ;)


thanks,

Takashi

2006-05-30 10:57:35

by Ingo Molnar

[permalink] [raw]
Subject: Re: [patch 00/61] ANNOUNCE: lock validator -V1


* Mike Galbraith <[email protected]> wrote:

> On Tue, 2006-05-30 at 08:37 +0200, Ingo Molnar wrote:
> > * Mike Galbraith <[email protected]> wrote:
> >
> > > Darn. It said all tests passed, then oopsed.
> > >
> > > (have .config all gzipped up if you want it)
> >
> > yeah, please.
>
> (sent off list)

thanks, i managed to reproduce the warning with your .config - i'm
debugging the problem now.

Ingo

2006-05-30 11:03:00

by Alexey Dobriyan

[permalink] [raw]
Subject: Re: [patch 03/61] lock validator: sound/oss/emu10k1/midi.c cleanup

On Tue, May 30, 2006 at 12:51:53PM +0200, Takashi Iwai wrote:
> At Mon, 29 May 2006 18:33:17 -0700,
> Andrew Morton wrote:
> >
> > On Mon, 29 May 2006 23:23:19 +0200
> > Ingo Molnar <[email protected]> wrote:
> >
> > > move the __attribute outside of the DEFINE_SPINLOCK() section.
> > >
> > > Signed-off-by: Ingo Molnar <[email protected]>
> > > Signed-off-by: Arjan van de Ven <[email protected]>
> > > ---
> > > sound/oss/emu10k1/midi.c | 2 +-
> > > 1 file changed, 1 insertion(+), 1 deletion(-)
> > >
> > > Index: linux/sound/oss/emu10k1/midi.c
> > > ===================================================================
> > > --- linux.orig/sound/oss/emu10k1/midi.c
> > > +++ linux/sound/oss/emu10k1/midi.c
> > > @@ -45,7 +45,7 @@
> > > #include "../sound_config.h"
> > > #endif
> > >
> > > -static DEFINE_SPINLOCK(midi_spinlock __attribute((unused)));
> > > +static __attribute((unused)) DEFINE_SPINLOCK(midi_spinlock);
> > >
> > > static void init_midi_hdr(struct midi_hdr *midihdr)
> > > {
> >
> > I'll tag this as for-mainline-via-alsa.
>
> Acked-by: Takashi Iwai <[email protected]>
>
>
> It's OSS stuff, so feel free to push it from your side ;)

Why it is marked unused when in fact it's used?

[PATCH] Mark midi_spinlock as used

Signed-off-by: Alexey Dobriyan <[email protected]>
---

--- a/sound/oss/emu10k1/midi.c
+++ b/sound/oss/emu10k1/midi.c
@@ -45,7 +45,7 @@
#include "../sound_config.h"
#endif

-static DEFINE_SPINLOCK(midi_spinlock __attribute((unused)));
+static DEFINE_SPINLOCK(midi_spinlock);

static void init_midi_hdr(struct midi_hdr *midihdr)
{

2006-05-30 11:42:08

by Benoit Boissinot

[permalink] [raw]
Subject: Re: [patch 00/61] ANNOUNCE: lock validator -V1

On Tue, May 30, 2006 at 12:26:27PM +0200, Arjan van de Ven wrote:
> On Tue, 2006-05-30 at 11:14 +0200, Benoit Boissinot wrote:
> > On 5/29/06, Ingo Molnar <[email protected]> wrote:
> > > We are pleased to announce the first release of the "lock dependency
> > > correctness validator" kernel debugging feature, which can be downloaded
> > > from:
> > >
> > > http://redhat.com/~mingo/lockdep-patches/
> > > [snip]
> >
> > I get this right after ipw2200 is loaded (it is quite verbose, I
> > probably shoudln't post everything...)
> >
> > ipw2200: Detected Intel PRO/Wireless 2200BG Network Connection
> > ipw2200: Detected geography ZZD (13 802.11bg channels, 0 802.11a channels)
>
>
> > <c0301efa> netlink_broadcast+0x7a/0x360
>
> this isn't allow to be called from IRQ context, because it takes
> nl_table_lock for read, but that is taken as
> write_lock_bh(&nl_table_lock);
> in
> static void netlink_table_grab(void)
> so without disabling interrupts; which would thus deadlock if this
> read_lock-from-irq would hit.
>
> > <c02fb6a4> wireless_send_event+0x304/0x340
> > <e1cf8e11> ipw_rx+0x1371/0x1bb0 [ipw2200]
> > <e1cfe6ac> ipw_irq_tasklet+0x13c/0x500 [ipw2200]
> > <c0121ea0> tasklet_action+0x40/0x90
>
> but it's more complex than that, since we ARE in BH context.
> The complexity comes from us holding &priv->lock, which is
> used in hard irq context.

It is probably related, but I got this in my log too:

BUG: warning at kernel/softirq.c:86/local_bh_disable()
<c010402d> show_trace+0xd/0x10 <c0104687> dump_stack+0x17/0x20
<c0121fdc> local_bh_disable+0x5c/0x70 <c03520f1> _read_lock_bh+0x11/0x30
<c02e8dce> sock_def_readable+0x1e/0x80 <c0302130> netlink_broadcast+0x2b0/0x360
<c02fb6a4> wireless_send_event+0x304/0x340 <e1cf8e11> ipw_rx+0x1371/0x1bb0 [ipw2200]
<e1cfe6ac> ipw_irq_tasklet+0x13c/0x500 [ipw2200] <c0121ea0> tasklet_action+0x40/0x90
<c01223b4> __do_softirq+0x54/0xc0 <c01056bb> do_softirq+0x5b/0xf0
=======================
<c0122455> irq_exit+0x35/0x40 <c01057c7> do_IRQ+0x77/0xc0
<c0103949> common_interrupt+0x25/0x2c

>
> so the deadlock is like this:
>
>
> cpu 0: user context cpu1: softirq context
> netlink_table_grab takes nl_table_lock as take priv->lock in ipw_irq_tasklet
> write_lock_bh, but leaves irqs enabled
>
>
> hardirq comes in and the isr tries to take in ipw_rx, call wireless_send_event which
> priv->lock but has to wait on cpu 1 tries to take nl_table_lock for read
> but has to wait for cpu0
>
> and... kaboom kabang deadlock :)
>
>

--
powered by bash/screen/(urxvt/fvwm|linux-console)/gentoo/gnu/linux OS

2006-05-30 12:12:58

by Ingo Molnar

[permalink] [raw]
Subject: Re: [patch 00/61] ANNOUNCE: lock validator -V1


* Benoit Boissinot <[email protected]> wrote:

> It is probably related, but I got this in my log too:
>
> BUG: warning at kernel/softirq.c:86/local_bh_disable()

this one is harmless, you can ignore it. (already sent a patch to remove
the WARN_ON)

Ingo

2006-05-30 13:33:53

by Roman Zippel

[permalink] [raw]
Subject: Re: [patch 61/61] lock validator: enable lock validator in Kconfig

Hi,

On Mon, 29 May 2006, Ingo Molnar wrote:

> Index: linux/lib/Kconfig.debug
> ===================================================================
> --- linux.orig/lib/Kconfig.debug
> +++ linux/lib/Kconfig.debug
> @@ -184,6 +184,173 @@ config DEBUG_SPINLOCK
> best used in conjunction with the NMI watchdog so that spinlock
> deadlocks are also debuggable.
>
> +config PROVE_SPIN_LOCKING
> + bool "Prove spin-locking correctness"
> + default y

Could you please keep all the defaults in a separate -mm-only patch, so
it doesn't get merged?
There are also a number of dependencies on DEBUG_KERNEL missing, it
completely breaks the debugging menu.

> +config LOCKDEP
> + bool
> + default y
> + depends on PROVE_SPIN_LOCKING || PROVE_RW_LOCKING || PROVE_MUTEX_LOCKING || PROVE_RWSEM_LOCKING

This can be written shorter as:

config LOCKDEP
def_bool PROVE_SPIN_LOCKING || PROVE_RW_LOCKING || PROVE_MUTEX_LOCKING || PROVE_RWSEM_LOCKING

bye, Roman

2006-05-30 14:10:35

by Dave Jones

[permalink] [raw]
Subject: Re: [patch 00/61] ANNOUNCE: lock validator -V1

On Tue, May 30, 2006 at 07:45:47AM +0200, Arjan van de Ven wrote:

> One
> ---
> store_scaling_governor takes policy->lock and then calls __cpufreq_set_policy
> __cpufreq_set_policy calls __cpufreq_governor
> __cpufreq_governor calls __cpufreq_driver_target via cpufreq_governor_performance
> __cpufreq_driver_target calls lock_cpu_hotplug() (which takes the hotplug lock)
>
>
> Two
> ---
> cpufreq_stats_init lock_cpu_hotplug() and then calls cpufreq_stat_cpu_callback
> cpufreq_stat_cpu_callback calls cpufreq_update_policy
> cpufreq_update_policy takes the policy->lock
>
>
> so this looks like a real honest AB-BA deadlock to me...

This looks a little clearer this morning. I missed the fact that sys_init_module
isn't completely serialised, only the loading part. ->init routines can and will be
called in parallel.

I don't see where cpufreq_update_policy takes policy->lock though.
In my tree it just takes the per-cpu data->lock.

Time for more wake-up juice? or am I missing something obvious again?

Dave

--
http://www.codemonkey.org.uk

2006-05-30 14:19:30

by Arjan van de Ven

[permalink] [raw]
Subject: Re: [patch 00/61] ANNOUNCE: lock validator -V1

On Tue, 2006-05-30 at 10:10 -0400, Dave Jones wrote:
> On Tue, May 30, 2006 at 07:45:47AM +0200, Arjan van de Ven wrote:
>
> > One
> > ---
> > store_scaling_governor takes policy->lock and then calls __cpufreq_set_policy
> > __cpufreq_set_policy calls __cpufreq_governor
> > __cpufreq_governor calls __cpufreq_driver_target via cpufreq_governor_performance
> > __cpufreq_driver_target calls lock_cpu_hotplug() (which takes the hotplug lock)
> >
> >
> > Two
> > ---
> > cpufreq_stats_init lock_cpu_hotplug() and then calls cpufreq_stat_cpu_callback
> > cpufreq_stat_cpu_callback calls cpufreq_update_policy
> > cpufreq_update_policy takes the policy->lock
> >
> >
> > so this looks like a real honest AB-BA deadlock to me...
>
> This looks a little clearer this morning. I missed the fact that sys_init_module
> isn't completely serialised, only the loading part. ->init routines can and will be
> called in parallel.
>
> I don't see where cpufreq_update_policy takes policy->lock though.
> In my tree it just takes the per-cpu data->lock.

isn't that basically the same lock?


2006-05-30 14:59:12

by Dave Jones

[permalink] [raw]
Subject: Re: [patch 00/61] ANNOUNCE: lock validator -V1

On Tue, May 30, 2006 at 04:19:22PM +0200, Arjan van de Ven wrote:

> > > One
> > > ---
> > > store_scaling_governor takes policy->lock and then calls __cpufreq_set_policy
> > > __cpufreq_set_policy calls __cpufreq_governor
> > > __cpufreq_governor calls __cpufreq_driver_target via cpufreq_governor_performance
> > > __cpufreq_driver_target calls lock_cpu_hotplug() (which takes the hotplug lock)
> > >
> > >
> > > Two
> > > ---
> > > cpufreq_stats_init lock_cpu_hotplug() and then calls cpufreq_stat_cpu_callback
> > > cpufreq_stat_cpu_callback calls cpufreq_update_policy
> > > cpufreq_update_policy takes the policy->lock
> > >
> > >
> > > so this looks like a real honest AB-BA deadlock to me...
> >
> > This looks a little clearer this morning. I missed the fact that sys_init_module
> > isn't completely serialised, only the loading part. ->init routines can and will be
> > called in parallel.
> >
> > I don't see where cpufreq_update_policy takes policy->lock though.
> > In my tree it just takes the per-cpu data->lock.
>
> isn't that basically the same lock?

Ugh, I've completely forgotten how this stuff fits together.

Dominik, any clues ?

Dave

--
http://www.codemonkey.org.uk

2006-05-30 17:13:22

by Dominik Brodowski

[permalink] [raw]
Subject: Re: [patch 00/61] ANNOUNCE: lock validator -V1

Hi,

On Tue, May 30, 2006 at 10:58:52AM -0400, Dave Jones wrote:
> On Tue, May 30, 2006 at 04:19:22PM +0200, Arjan van de Ven wrote:
>
> > > > One
> > > > ---
> > > > store_scaling_governor takes policy->lock and then calls __cpufreq_set_policy
> > > > __cpufreq_set_policy calls __cpufreq_governor
> > > > __cpufreq_governor calls __cpufreq_driver_target via cpufreq_governor_performance
> > > > __cpufreq_driver_target calls lock_cpu_hotplug() (which takes the hotplug lock)
> > > >
> > > >
> > > > Two
> > > > ---
> > > > cpufreq_stats_init lock_cpu_hotplug() and then calls cpufreq_stat_cpu_callback
> > > > cpufreq_stat_cpu_callback calls cpufreq_update_policy
> > > > cpufreq_update_policy takes the policy->lock
> > > >
> > > >
> > > > so this looks like a real honest AB-BA deadlock to me...
> > >
> > > This looks a little clearer this morning. I missed the fact that sys_init_module
> > > isn't completely serialised, only the loading part. ->init routines can and will be
> > > called in parallel.
> > >
> > > I don't see where cpufreq_update_policy takes policy->lock though.
> > > In my tree it just takes the per-cpu data->lock.
> >
> > isn't that basically the same lock?
>
> Ugh, I've completely forgotten how this stuff fits together.
>
> Dominik, any clues ?

That's indeed a possible deadlock situation -- what's the
cpufreq_update_policy() call needed for in cpufreq_stat_cpu_callback anyway?

Dominik

2006-05-30 17:38:39

by Steven Rostedt

[permalink] [raw]
Subject: Re: [patch 05/61] lock validator: introduce WARN_ON_ONCE(cond)

On Mon, 2006-05-29 at 18:33 -0700, Andrew Morton wrote:
> On Mon, 29 May 2006 23:23:28 +0200
> Ingo Molnar <[email protected]> wrote:
>
> > add WARN_ON_ONCE(cond) to print once-per-bootup messages.
> >
> > Signed-off-by: Ingo Molnar <[email protected]>
> > Signed-off-by: Arjan van de Ven <[email protected]>
> > ---
> > include/asm-generic/bug.h | 13 +++++++++++++
> > 1 file changed, 13 insertions(+)
> >
> > Index: linux/include/asm-generic/bug.h
> > ===================================================================
> > --- linux.orig/include/asm-generic/bug.h
> > +++ linux/include/asm-generic/bug.h
> > @@ -44,4 +44,17 @@
> > # define WARN_ON_SMP(x) do { } while (0)
> > #endif
> >
> > +#define WARN_ON_ONCE(condition) \
> > +({ \
> > + static int __warn_once = 1; \
> > + int __ret = 0; \
> > + \
> > + if (unlikely(__warn_once && (condition))) { \

Since __warn_once is likely to be true, and the condition is likely to
be false, wouldn't it be better to switch this around to:

if (unlikely((condition) && __warn_once)) {

So the && will fall out before having to check a global variable.

Only after the unlikely condition would the __warn_once be false.

-- Steve

> > + __warn_once = 0; \
> > + WARN_ON(1); \
> > + __ret = 1; \
> > + } \
> > + __ret; \
> > +})
> > +
> > #endif
>
> I'll queue this for mainline inclusion.


2006-05-30 17:45:26

by Steven Rostedt

[permalink] [raw]
Subject: Re: [patch 06/61] lock validator: add __module_address() method

On Mon, 2006-05-29 at 18:33 -0700, Andrew Morton wrote:

>
> I'd suggest that __module_address() should do the same thing, from an API neatness
> POV. Although perhaps that's mot very useful if we didn't take a ref on the returned
> object (but module_text_address() doesn't either).
>
> Also, the name's a bit misleading - it sounds like it returns the address
> of a module or something. __module_any_address() would be better, perhaps?

How about __valid_module_address() so that it describes exactly what it
is doing. Or __module_address_valid().

-- Steve

>
> Also, how come this doesn't need modlist_lock()?


2006-05-30 19:03:15

by Dave Jones

[permalink] [raw]
Subject: Re: [patch 00/61] ANNOUNCE: lock validator -V1

On Tue, May 30, 2006 at 07:11:18PM +0200, Dominik Brodowski wrote:

> That's indeed a possible deadlock situation -- what's the
> cpufreq_update_policy() call needed for in cpufreq_stat_cpu_callback anyway?

I was hoping you could enlighten me :)
I started picking through history with gitk, but my tk install uses
fonts that make my eyes bleed. My kingdom for a 'git annotate'..

Dave
--
http://www.codemonkey.org.uk

2006-05-30 19:25:34

by Roland Dreier

[permalink] [raw]
Subject: Re: [patch 00/61] ANNOUNCE: lock validator -V1

Dave> I was hoping you could enlighten me :) I started picking
Dave> through history with gitk, but my tk install uses fonts that
Dave> make my eyes bleed. My kingdom for a 'git annotate'..

Heh -- try "git annotate" or "git blame". I think you need git 1.3.x
for that... details of where to send your kingdom forthcoming...

- R.

2006-05-30 19:34:23

by Dave Jones

[permalink] [raw]
Subject: Re: [patch 00/61] ANNOUNCE: lock validator -V1

On Tue, May 30, 2006 at 12:25:29PM -0700, Roland Dreier wrote:
> Dave> I was hoping you could enlighten me :) I started picking
> Dave> through history with gitk, but my tk install uses fonts that
> Dave> make my eyes bleed. My kingdom for a 'git annotate'..
>
> Heh -- try "git annotate" or "git blame". I think you need git 1.3.x
> for that... details of where to send your kingdom forthcoming...

How on earth did I miss that? Thanks for the pointer.

Dave

--
http://www.codemonkey.org.uk

2006-05-30 19:40:14

by Dave Jones

[permalink] [raw]
Subject: Re: [patch 00/61] ANNOUNCE: lock validator -V1

On Tue, May 30, 2006 at 07:11:18PM +0200, Dominik Brodowski wrote:

> On Tue, May 30, 2006 at 10:58:52AM -0400, Dave Jones wrote:
> > On Tue, May 30, 2006 at 04:19:22PM +0200, Arjan van de Ven wrote:
> >
> > > > > One
> > > > > ---
> > > > > store_scaling_governor takes policy->lock and then calls __cpufreq_set_policy
> > > > > __cpufreq_set_policy calls __cpufreq_governor
> > > > > __cpufreq_governor calls __cpufreq_driver_target via cpufreq_governor_performance
> > > > > __cpufreq_driver_target calls lock_cpu_hotplug() (which takes the hotplug lock)
> > > > >
> > > > >
> > > > > Two
> > > > > ---
> > > > > cpufreq_stats_init lock_cpu_hotplug() and then calls cpufreq_stat_cpu_callback
> > > > > cpufreq_stat_cpu_callback calls cpufreq_update_policy
> > > > > cpufreq_update_policy takes the policy->lock
> > > > >
> > > > >
> > > > > so this looks like a real honest AB-BA deadlock to me...
> > > >
> > > > This looks a little clearer this morning. I missed the fact that sys_init_module
> > > > isn't completely serialised, only the loading part. ->init routines can and will be
> > > > called in parallel.
> > > >
> > > > I don't see where cpufreq_update_policy takes policy->lock though.
> > > > In my tree it just takes the per-cpu data->lock.
> > >
> > > isn't that basically the same lock?
> >
> > Ugh, I've completely forgotten how this stuff fits together.
> >
> > Dominik, any clues ?
>
> That's indeed a possible deadlock situation -- what's the
> cpufreq_update_policy() call needed for in cpufreq_stat_cpu_callback anyway?

Oh wow. Reading the commit message of this change rings alarm bells.

change c32b6b8e524d2c337767d312814484d9289550cf has this to say..

[PATCH] create and destroy cpufreq sysfs entries based on cpu notifiers

cpufreq entries in sysfs should only be populated when CPU is online state.
When we either boot with maxcpus=x and then boot the other cpus by echoing
to sysfs online file, these entries should be created and destroyed when
CPU_DEAD is notified. Same treatement as cache entries under sysfs.

We place the processor in the lowest frequency, so hw managed P-State
transitions can still work on the other threads to save power.

Primary goal was to just make these directories appear/disapper dynamically.

There is one in this patch i had to do, which i really dont like myself but
probably best if someone handling the cpufreq infrastructure could give
this code right treatment if this is not acceptable. I guess its probably
good for the first cut.

- Converting lock_cpu_hotplug()/unlock_cpu_hotplug() to disable/enable preempt.
The locking was smack in the middle of the notification path, when the
hotplug is already holding the lock. I tried another solution to avoid this
so avoid taking locks if we know we are from notification path. The solution
was getting very ugly and i decided this was probably good for this iteration
until someone who understands cpufreq could do a better job than me.

So, that last part pretty highlights that we knew about this problem, and meant to
come back and fix it later. Surprise surprise, no one came back and fixed it.

Dave

--
http://www.codemonkey.org.uk

2006-05-30 19:56:58

by Raj, Ashok

[permalink] [raw]
Subject: Re: [patch 00/61] ANNOUNCE: lock validator -V1

On Tue, May 30, 2006 at 03:39:47PM -0400, Dave Jones wrote:

> So, that last part pretty highlights that we knew about this problem, and meant to
> come back and fix it later. Surprise surprise, no one came back and fixed it.
>

There was another iteration after his, and currently we keep track of
the owner in lock_cpu_hotplug()->__lock_cpu_hotplug(). So if we are in
same thread context we dont acquire locks.

if (lock_cpu_hotplug_owner != current) {
if (interruptible)
ret = down_interruptible(&cpucontrol);
else
down(&cpucontrol);
}


the lock and unlock kept track of the depth as well, so we know when to release

We didnt hear any better suggestions (from cpufreq folks), so we left it in
that state (atlease the same thread doenst try to take the lock twice)
that resulted in deadlocks earlier.

--
Cheers,
Ashok Raj
- Open Source Technology Center

2006-05-30 20:41:15

by Ingo Molnar

[permalink] [raw]
Subject: Re: [patch 00/61] ANNOUNCE: lock validator -V1


* Roland Dreier <[email protected]> wrote:

> Dave> I was hoping you could enlighten me :) I started picking
> Dave> through history with gitk, but my tk install uses fonts that
> Dave> make my eyes bleed. My kingdom for a 'git annotate'..
>
> Heh -- try "git annotate" or "git blame". I think you need git 1.3.x
> for that... details of where to send your kingdom forthcoming...

i use qgit, which is GTK based and thus uses the native desktop fonts.

Ingo

2006-05-30 20:44:04

by Ingo Molnar

[permalink] [raw]
Subject: Re: [patch 00/61] ANNOUNCE: lock validator -V1


* Ingo Molnar <[email protected]> wrote:

>
> * Roland Dreier <[email protected]> wrote:
>
> > Dave> I was hoping you could enlighten me :) I started picking
> > Dave> through history with gitk, but my tk install uses fonts that
> > Dave> make my eyes bleed. My kingdom for a 'git annotate'..
> >
> > Heh -- try "git annotate" or "git blame". I think you need git 1.3.x
> > for that... details of where to send your kingdom forthcoming...
>
> i use qgit, which is GTK based and thus uses the native desktop fonts.

and qgit annotates source files in the background while you are viewing
them, and then you can click on lines to jump to the last commit that
touched it. It doesnt need latest GIT, qgit always did this (by itself).

Ingo

2006-05-30 20:51:40

by Steven Rostedt

[permalink] [raw]
Subject: Re: [patch 37/61] lock validator: special locking: dcache

On Mon, 2006-05-29 at 18:35 -0700, Andrew Morton wrote:

> > Index: linux/fs/dcache.c
> > ===================================================================
> > --- linux.orig/fs/dcache.c
> > +++ linux/fs/dcache.c
> > @@ -1380,10 +1380,10 @@ void d_move(struct dentry * dentry, stru
> > */
> > if (target < dentry) {
> > spin_lock(&target->d_lock);
> > - spin_lock(&dentry->d_lock);
> > + spin_lock_nested(&dentry->d_lock, DENTRY_D_LOCK_NESTED);
> > } else {
> > spin_lock(&dentry->d_lock);
> > - spin_lock(&target->d_lock);
> > + spin_lock_nested(&target->d_lock, DENTRY_D_LOCK_NESTED);
> > }
> >
>

[...]

> > +/*
> > + * dentry->d_lock spinlock nesting types:
> > + *
> > + * 0: normal
> > + * 1: nested
> > + */
> > +enum dentry_d_lock_type
> > +{
> > + DENTRY_D_LOCK_NORMAL,
> > + DENTRY_D_LOCK_NESTED
> > +};
> > +
> > struct dentry_operations {
> > int (*d_revalidate)(struct dentry *, struct nameidata *);
> > int (*d_hash) (struct dentry *, struct qstr *);
>
> DENTRY_D_LOCK_NORMAL isn't used anywhere.
>

I guess it is implied with the normal spin_lock. Since
spin_lock(&target->d_lock) and
spin_lock_nested(&target->d_lock, DENTRY_D_LOCK_NORMAL)
are equivalent. (DENTRY_D_LOCK_NORMAL == 0)

Probably this deserves a comment.

-- Steve


2006-05-30 20:53:30

by Steven Rostedt

[permalink] [raw]
Subject: Re: [patch 38/61] lock validator: special locking: i_mutex

On Mon, 2006-05-29 at 23:26 +0200, Ingo Molnar wrote:
> + * inode->i_mutex nesting types for the LOCKDEP validator:
> + *
> + * 0: the object of the current VFS operation
> + * 1: parent
> + * 2: child/target
> + */
> +enum inode_i_mutex_lock_type
> +{
> + I_MUTEX_NORMAL,
> + I_MUTEX_PARENT,
> + I_MUTEX_CHILD
> +};
> +
> +/*

I guess we can say the same about I_MUTEX_NORMAL.

-- Steve


2006-05-30 20:54:31

by Ingo Molnar

[permalink] [raw]
Subject: [patch, -rc5-mm1] lock validator: select KALLSYMS_ALL


* Arjan van de Ven <[email protected]> wrote:

> the reporter doesn't have CONFIG_KALLSYMS_ALL enabled which gives
> sometimes misleading backtraces (should lockdep just enable
> KALLSYMS_ALL to get more useful bugreports?)

agreed - the patch below does that.

-----------------------
Subject: lock validator: select KALLSYMS_ALL
From: Ingo Molnar <[email protected]>

all the kernel symbol printouts make alot more sense if KALLSYMS_ALL
is enabled too - force it on if lockdep is enabled.

Signed-off-by: Ingo Molnar <[email protected]>
---
lib/Kconfig.debug | 1 +
1 file changed, 1 insertion(+)

Index: linux/lib/Kconfig.debug
===================================================================
--- linux.orig/lib/Kconfig.debug
+++ linux/lib/Kconfig.debug
@@ -342,6 +342,7 @@ config LOCKDEP
default y
select FRAME_POINTER
select KALLSYMS
+ select KALLSYMS_ALL
depends on PROVE_SPIN_LOCKING || PROVE_RW_LOCKING || PROVE_MUTEX_LOCKING || PROVE_RWSEM_LOCKING

config DEBUG_LOCKDEP

2006-05-30 21:00:52

by Ingo Molnar

[permalink] [raw]
Subject: Re: [patch 37/61] lock validator: special locking: dcache


* Steven Rostedt <[email protected]> wrote:

> > > +enum dentry_d_lock_type
> > > +{
> > > + DENTRY_D_LOCK_NORMAL,
> > > + DENTRY_D_LOCK_NESTED
> > > +};
> > > +
> > > struct dentry_operations {
> > > int (*d_revalidate)(struct dentry *, struct nameidata *);
> > > int (*d_hash) (struct dentry *, struct qstr *);
> >
> > DENTRY_D_LOCK_NORMAL isn't used anywhere.
>
> I guess it is implied with the normal spin_lock. Since
> spin_lock(&target->d_lock) and
> spin_lock_nested(&target->d_lock, DENTRY_D_LOCK_NORMAL)
> are equivalent. (DENTRY_D_LOCK_NORMAL == 0)

correct. This is the case for all the subtype enum definitions: 0 means
normal spinlock [rwlock, rwsem, mutex] API use.

Ingo

2006-05-30 21:06:17

by Ingo Molnar

[permalink] [raw]
Subject: Re: [patch 38/61] lock validator: special locking: i_mutex


* Steven Rostedt <[email protected]> wrote:

> On Mon, 2006-05-29 at 23:26 +0200, Ingo Molnar wrote:
> > + * inode->i_mutex nesting types for the LOCKDEP validator:
> > + *
> > + * 0: the object of the current VFS operation
> > + * 1: parent
> > + * 2: child/target
> > + */
> > +enum inode_i_mutex_lock_type
> > +{
> > + I_MUTEX_NORMAL,
> > + I_MUTEX_PARENT,
> > + I_MUTEX_CHILD
> > +};
> > +
> > +/*
>
> I guess we can say the same about I_MUTEX_NORMAL.

yeah. Subtypes start from 1, as 0 is the basic type.

Lock types are keyed via static kernel addresses. This means that we can
use the lock address (for DEFINE_SPINLOCK) or the static key embedded in
spin_lock_init() as a key in 99% of the cases. The key [struct
lockdep_type_key, see include/linux/lockdep.h] occupies enough bytes (of
kernel static virtual memory) so that the keys remain automatically
unique. Right now MAX_LOKCDEP_SUBTYPES is 8, so the keys take at most 8
bytes. (To save some memory there's another detail: for static locks
(DEFINE_SPINLOCK ones) we use the lock address itself as the key.)

Ingo

2006-05-30 21:58:34

by Paolo Ciarrocchi

[permalink] [raw]
Subject: Re: [patch 00/61] ANNOUNCE: lock validator -V1

On 5/30/06, Ingo Molnar <[email protected]> wrote:
>
> * Roland Dreier <[email protected]> wrote:
>
> > Dave> I was hoping you could enlighten me :) I started picking
> > Dave> through history with gitk, but my tk install uses fonts that
> > Dave> make my eyes bleed. My kingdom for a 'git annotate'..
> >
> > Heh -- try "git annotate" or "git blame". I think you need git 1.3.x
> > for that... details of where to send your kingdom forthcoming...
>
> i use qgit, which is GTK based and thus uses the native desktop fonts.

GTK? A typo, I suppose.
QGit is a git GUI viewer built on Qt/C++ (that I hope will be added to
the git.git tree soon).

Ciao,

--
Paolo
http://paolociarrocchi.googlepages.com

2006-05-31 05:40:59

by Manfred Spraul

[permalink] [raw]
Subject: Re: [patch 02/61] lock validator: forcedeth.c fix

Andrew Morton wrote:

>On Mon, 29 May 2006 23:23:13 +0200
>Ingo Molnar <[email protected]> wrote:
>
>
>
>>nv_do_nic_poll() is called from timer softirqs, which has interrupts
>>enabled, but np->lock might also be taken by some other interrupt
>>context.
>>
>>
>
>But the driver does disable_irq(), so I'd say this was a false-positive.
>
>And afaict this is not a timer handler - it's a poll_controller handler
>(although maybe that get called from timer handler somewhere?)
>
>
>
It's both a timer handler and a poll_controller handler:
- if the interrupt handler causes a system overload (gig e without irq
mitigation...), then the nic disables the irq on the device and waits
one tick and handles the interrupts from a timer. This is nv_do_nic_poll().

- nv_do_nic_poll is also called from the poll_controller handler.

I'll try to remove the disable_irq() calls from the poll_controller
handler, but probably not before the week-end.

--
Manfred

2006-05-31 08:40:51

by Ingo Molnar

[permalink] [raw]
Subject: Re: [patch 00/61] ANNOUNCE: lock validator -V1


* Paolo Ciarrocchi <[email protected]> wrote:

> GTK? A typo, I suppose.

brainfart, sorry :)

> QGit is a git GUI viewer built on Qt/C++ (that I hope will be added to
> the git.git tree soon).

yeah.

Ingo

2007-02-13 14:23:08

by Ingo Molnar

[permalink] [raw]
Subject: [patch 01/11] syslets: add async.h include file, kernel-side API definitions

From: Ingo Molnar <[email protected]>

add include/linux/async.h which contains the kernel-side API
declarations.

it also provides NOP stubs for the !CONFIG_ASYNC_SUPPORT case.

Signed-off-by: Ingo Molnar <[email protected]>
Signed-off-by: Arjan van de Ven <[email protected]>
---
include/linux/async.h | 25 +++++++++++++++++++++++++
1 file changed, 25 insertions(+)

Index: linux/include/linux/async.h
===================================================================
--- /dev/null
+++ linux/include/linux/async.h
@@ -0,0 +1,25 @@
+#ifndef _LINUX_ASYNC_H
+#define _LINUX_ASYNC_H
+/*
+ * The syslet subsystem - asynchronous syscall execution support.
+ *
+ * Generic kernel API definitions:
+ */
+
+#ifdef CONFIG_ASYNC_SUPPORT
+extern void async_init(struct task_struct *t);
+extern void async_exit(struct task_struct *t);
+extern void __async_schedule(struct task_struct *t);
+#else /* !CONFIG_ASYNC_SUPPORT */
+static inline void async_init(struct task_struct *t)
+{
+}
+static inline void async_exit(struct task_struct *t)
+{
+}
+static inline void __async_schedule(struct task_struct *t)
+{
+}
+#endif /* !CONFIG_ASYNC_SUPPORT */
+
+#endif

2007-02-13 14:23:14

by Ingo Molnar

[permalink] [raw]
Subject: [patch 03/11] syslets: generic kernel bits

From: Ingo Molnar <[email protected]>

add the kernel generic bits - these are present even if !CONFIG_ASYNC_SUPPORT.

Signed-off-by: Ingo Molnar <[email protected]>
Signed-off-by: Arjan van de Ven <[email protected]>
---
include/linux/sched.h | 7 ++++++-
kernel/exit.c | 3 +++
kernel/fork.c | 2 ++
kernel/sched.c | 9 +++++++++
4 files changed, 20 insertions(+), 1 deletion(-)

Index: linux/include/linux/sched.h
===================================================================
--- linux.orig/include/linux/sched.h
+++ linux/include/linux/sched.h
@@ -88,7 +88,8 @@ struct sched_param {

struct exec_domain;
struct futex_pi_state;
-
+struct async_thread;
+struct async_head;
/*
* List of flags we want to share for kernel threads,
* if only because they are not used by them anyway.
@@ -997,6 +998,10 @@ struct task_struct {
/* journalling filesystem info */
void *journal_info;

+/* async syscall support: */
+ struct async_thread *at, *async_ready;
+ struct async_head *ah;
+
/* VM state */
struct reclaim_state *reclaim_state;

Index: linux/kernel/exit.c
===================================================================
--- linux.orig/kernel/exit.c
+++ linux/kernel/exit.c
@@ -26,6 +26,7 @@
#include <linux/ptrace.h>
#include <linux/profile.h>
#include <linux/mount.h>
+#include <linux/async.h>
#include <linux/proc_fs.h>
#include <linux/mempolicy.h>
#include <linux/taskstats_kern.h>
@@ -889,6 +890,8 @@ fastcall NORET_TYPE void do_exit(long co
schedule();
}

+ async_exit(tsk);
+
tsk->flags |= PF_EXITING;

if (unlikely(in_atomic()))
Index: linux/kernel/fork.c
===================================================================
--- linux.orig/kernel/fork.c
+++ linux/kernel/fork.c
@@ -22,6 +22,7 @@
#include <linux/personality.h>
#include <linux/mempolicy.h>
#include <linux/sem.h>
+#include <linux/async.h>
#include <linux/file.h>
#include <linux/key.h>
#include <linux/binfmts.h>
@@ -1054,6 +1055,7 @@ static struct task_struct *copy_process(

p->lock_depth = -1; /* -1 = no lock */
do_posix_clock_monotonic_gettime(&p->start_time);
+ async_init(p);
p->security = NULL;
p->io_context = NULL;
p->io_wait = NULL;
Index: linux/kernel/sched.c
===================================================================
--- linux.orig/kernel/sched.c
+++ linux/kernel/sched.c
@@ -38,6 +38,7 @@
#include <linux/vmalloc.h>
#include <linux/blkdev.h>
#include <linux/delay.h>
+#include <linux/async.h>
#include <linux/smp.h>
#include <linux/threads.h>
#include <linux/timer.h>
@@ -3436,6 +3437,14 @@ asmlinkage void __sched schedule(void)
}
profile_hit(SCHED_PROFILING, __builtin_return_address(0));

+ prev = current;
+ if (unlikely(prev->async_ready)) {
+ if (prev->state && !(preempt_count() & PREEMPT_ACTIVE) &&
+ (!(prev->state & TASK_INTERRUPTIBLE) ||
+ !signal_pending(prev)))
+ __async_schedule(prev);
+ }
+
need_resched:
preempt_disable();
prev = current;

2007-02-13 14:23:33

by Ingo Molnar

[permalink] [raw]
Subject: [patch 06/11] syslets: core, documentation

From: Ingo Molnar <[email protected]>

Add Documentation/syslet-design.txt with a high-level description
of the syslet concepts.

Signed-off-by: Ingo Molnar <[email protected]>
Signed-off-by: Arjan van de Ven <[email protected]>
---
Documentation/syslet-design.txt | 137 ++++++++++++++++++++++++++++++++++++++++
1 file changed, 137 insertions(+)

Index: linux/Documentation/syslet-design.txt
===================================================================
--- /dev/null
+++ linux/Documentation/syslet-design.txt
@@ -0,0 +1,137 @@
+Syslets / asynchronous system calls
+===================================
+
+started by Ingo Molnar <[email protected]>
+
+Goal:
+-----
+
+The goal of the syslet subsystem is to allow user-space to execute
+arbitrary system calls asynchronously. It does so by allowing user-space
+to execute "syslets" which are small scriptlets that the kernel can execute
+both securely and asynchronously without having to exit to user-space.
+
+the core syslet concepts are:
+
+The Syslet Atom:
+----------------
+
+The syslet atom is a small, fixed-size (44 bytes on 32-bit) piece of
+user-space memory, which is the basic unit of execution within the syslet
+framework. A syslet represents a single system-call and its arguments.
+In addition it also has condition flags attached to it that allows the
+construction of larger programs (syslets) from these atoms.
+
+Arguments to the system call are implemented via pointers to arguments.
+This not only increases the flexibility of syslet atoms (multiple syslets
+can share the same variable for example), but is also an optimization:
+copy_uatom() will only fetch syscall parameters up until the point it
+meets the first NULL pointer. 50% of all syscalls have 2 or less
+parameters (and 90% of all syscalls have 4 or less parameters).
+
+ [ Note: since the argument array is at the end of the atom, and the
+ kernel will not touch any argument beyond the final NULL one, atoms
+ might be packed more tightly. (the only special case exception to
+ this rule would be SKIP_TO_NEXT_ON_STOP atoms, where the kernel will
+ jump a full syslet_uatom number of bytes.) ]
+
+The Syslet:
+-----------
+
+A syslet is a program, represented by a graph of syslet atoms. The
+syslet atoms are chained to each other either via the atom->next pointer,
+or via the SYSLET_SKIP_TO_NEXT_ON_STOP flag.
+
+Running Syslets:
+----------------
+
+Syslets can be run via the sys_async_exec() system call, which takes
+the first atom of the syslet as an argument. The kernel does not need
+to be told about the other atoms - it will fetch them on the fly as
+execution goes forward.
+
+A syslet might either be executed 'cached', or it might generate a
+'cachemiss'.
+
+'Cached' syslet execution means that the whole syslet was executed
+without blocking. The system-call returns the submitted atom's address
+in this case.
+
+If a syslet blocks while the kernel executes a system-call embedded in
+one of its atoms, the kernel will keep working on that syscall in
+parallel, but it immediately returns to user-space with a NULL pointer,
+so the submitting task can submit other syslets.
+
+Completion of asynchronous syslets:
+-----------------------------------
+
+Completion of asynchronous syslets is done via the 'completion ring',
+which is a ringbuffer of syslet atom pointers user user-space memory,
+provided by user-space in the sys_async_register() syscall. The
+kernel fills in the ringbuffer starting at index 0, and user-space
+must clear out these pointers. Once the kernel reaches the end of
+the ring it wraps back to index 0. The kernel will not overwrite
+non-NULL pointers (but will return an error), user-space has to
+make sure it completes all events it asked for.
+
+Waiting for completions:
+------------------------
+
+Syslet completions can be waited for via the sys_async_wait()
+system call - which takes the number of events it should wait for as
+a parameter. This system call will also return if the number of
+pending events goes down to zero.
+
+Sample Hello World syslet code:
+
+--------------------------->
+/*
+ * Set up a syslet atom:
+ */
+static void
+init_atom(struct syslet_uatom *atom, int nr,
+ void *arg_ptr0, void *arg_ptr1, void *arg_ptr2,
+ void *arg_ptr3, void *arg_ptr4, void *arg_ptr5,
+ void *ret_ptr, unsigned long flags, struct syslet_uatom *next)
+{
+ atom->nr = nr;
+ atom->arg_ptr[0] = arg_ptr0;
+ atom->arg_ptr[1] = arg_ptr1;
+ atom->arg_ptr[2] = arg_ptr2;
+ atom->arg_ptr[3] = arg_ptr3;
+ atom->arg_ptr[4] = arg_ptr4;
+ atom->arg_ptr[5] = arg_ptr5;
+ atom->ret_ptr = ret_ptr;
+ atom->flags = flags;
+ atom->next = next;
+}
+
+int main(int argc, char *argv[])
+{
+ unsigned long int fd_out = 1; /* standard output */
+ char *buf = "Hello Syslet World!\n";
+ unsigned long size = strlen(buf);
+ struct syslet_uatom atom, *done;
+
+ async_head_init();
+
+ /*
+ * Simple syslet consisting of a single atom:
+ */
+ init_atom(&atom, __NR_sys_write, &fd_out, &buf, &size,
+ NULL, NULL, NULL, NULL, SYSLET_ASYNC, NULL);
+ done = sys_async_exec(&atom);
+ if (!done) {
+ sys_async_wait(1);
+ if (completion_ring[curr_ring_idx] == &atom) {
+ completion_ring[curr_ring_idx] = NULL;
+ printf("completed an async syslet atom!\n");
+ }
+ } else {
+ printf("completed an cached syslet atom!\n");
+ }
+
+ async_head_exit();
+
+ return 0;
+}

2007-02-13 14:23:42

by Ingo Molnar

[permalink] [raw]
Subject: [patch 04/11] syslets: core, data structures

From: Ingo Molnar <[email protected]>

this adds the data structures used by the syslet / async system calls
infrastructure.

This is used only if CONFIG_ASYNC_SUPPORT is enabled.

Signed-off-by: Ingo Molnar <[email protected]>
Signed-off-by: Arjan van de Ven <[email protected]>
---
kernel/async.h | 58 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++
1 file changed, 58 insertions(+)

Index: linux/kernel/async.h
===================================================================
--- /dev/null
+++ linux/kernel/async.h
@@ -0,0 +1,58 @@
+/*
+ * The syslet subsystem - asynchronous syscall execution support.
+ *
+ * Syslet-subsystem internal definitions:
+ */
+
+/*
+ * The kernel-side copy of a syslet atom - with arguments expanded:
+ */
+struct syslet_atom {
+ unsigned long flags;
+ unsigned long nr;
+ long __user *ret_ptr;
+ struct syslet_uatom __user *next;
+ unsigned long args[6];
+};
+
+/*
+ * The 'async head' is the thread which has user-space context (ptregs)
+ * 'below it' - this is the one that can return to user-space:
+ */
+struct async_head {
+ spinlock_t lock;
+ struct task_struct *user_task;
+
+ struct list_head ready_async_threads;
+ struct list_head busy_async_threads;
+
+ unsigned long events_left;
+ wait_queue_head_t wait;
+
+ struct async_head_user __user *uah;
+ struct syslet_uatom __user **completion_ring;
+ unsigned long curr_ring_idx;
+ unsigned long max_ring_idx;
+ unsigned long ring_size_bytes;
+
+ unsigned int nr_threads;
+ unsigned int max_nr_threads;
+
+ struct completion start_done;
+ struct completion exit_done;
+};
+
+/*
+ * The 'async thread' is either a newly created async thread or it is
+ * an 'ex-head' - it cannot return to user-space and only has kernel
+ * context.
+ */
+struct async_thread {
+ struct task_struct *task;
+ struct syslet_uatom __user *work;
+ struct async_head *ah;
+
+ struct list_head entry;
+
+ unsigned int exit;
+};

2007-02-13 14:24:50

by Ingo Molnar

[permalink] [raw]
Subject: [patch 05/11] syslets: core code

From: Ingo Molnar <[email protected]>

the core syslet / async system calls infrastructure code.

Is built only if CONFIG_ASYNC_SUPPORT is enabled.

Signed-off-by: Ingo Molnar <[email protected]>
Signed-off-by: Arjan van de Ven <[email protected]>
---
kernel/Makefile | 1
kernel/async.c | 811 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++
2 files changed, 812 insertions(+)

Index: linux/kernel/Makefile
===================================================================
--- linux.orig/kernel/Makefile
+++ linux/kernel/Makefile
@@ -10,6 +10,7 @@ obj-y = sched.o fork.o exec_domain.o
kthread.o wait.o kfifo.o sys_ni.o posix-cpu-timers.o mutex.o \
hrtimer.o rwsem.o latency.o nsproxy.o srcu.o

+obj-$(CONFIG_ASYNC_SUPPORT) += async.o
obj-$(CONFIG_STACKTRACE) += stacktrace.o
obj-y += time/
obj-$(CONFIG_DEBUG_MUTEXES) += mutex-debug.o
Index: linux/kernel/async.c
===================================================================
--- /dev/null
+++ linux/kernel/async.c
@@ -0,0 +1,811 @@
+/*
+ * kernel/async.c
+ *
+ * The syslet subsystem - asynchronous syscall execution support.
+ *
+ * Started by Ingo Molnar:
+ *
+ * Copyright (C) 2007 Red Hat, Inc., Ingo Molnar <[email protected]>
+ *
+ * This file is released under the GPLv2.
+ *
+ * This code implements asynchronous syscalls via 'syslets'.
+ *
+ * Syslets consist of a set of 'syslet atoms' which are residing
+ * purely in user-space memory and have no kernel-space resource
+ * attached to them. These atoms can be linked to each other via
+ * pointers. Besides the fundamental ability to execute system
+ * calls, syslet atoms can also implement branches, loops and
+ * arithmetics.
+ *
+ * Thus syslets can be used to build small autonomous programs that
+ * the kernel can execute purely from kernel-space, without having
+ * to return to any user-space context. Syslets can be run by any
+ * unprivileged user-space application - they are executed safely
+ * by the kernel.
+ */
+#include <linux/syscalls.h>
+#include <linux/syslet.h>
+#include <linux/delay.h>
+#include <linux/async.h>
+#include <linux/sched.h>
+#include <linux/init.h>
+#include <linux/err.h>
+
+#include <asm/uaccess.h>
+#include <asm/unistd.h>
+
+#include "async.h"
+
+typedef asmlinkage long (*syscall_fn_t)(long, long, long, long, long, long);
+
+extern syscall_fn_t sys_call_table[NR_syscalls];
+
+static void
+__mark_async_thread_ready(struct async_thread *at, struct async_head *ah)
+{
+ list_del(&at->entry);
+ list_add_tail(&at->entry, &ah->ready_async_threads);
+ if (list_empty(&ah->busy_async_threads))
+ wake_up(&ah->wait);
+}
+
+static void
+mark_async_thread_ready(struct async_thread *at, struct async_head *ah)
+{
+ spin_lock(&ah->lock);
+ __mark_async_thread_ready(at, ah);
+ spin_unlock(&ah->lock);
+}
+
+static void
+__mark_async_thread_busy(struct async_thread *at, struct async_head *ah)
+{
+ list_del(&at->entry);
+ list_add_tail(&at->entry, &ah->busy_async_threads);
+}
+
+static void
+mark_async_thread_busy(struct async_thread *at, struct async_head *ah)
+{
+ spin_lock(&ah->lock);
+ __mark_async_thread_busy(at, ah);
+ spin_unlock(&ah->lock);
+}
+
+static void
+__async_thread_init(struct task_struct *t, struct async_thread *at,
+ struct async_head *ah)
+{
+ INIT_LIST_HEAD(&at->entry);
+ at->exit = 0;
+ at->task = t;
+ at->ah = ah;
+ at->work = NULL;
+
+ t->at = at;
+ ah->nr_threads++;
+}
+
+static void
+async_thread_init(struct task_struct *t, struct async_thread *at,
+ struct async_head *ah)
+{
+ spin_lock(&ah->lock);
+ __async_thread_init(t, at, ah);
+ __mark_async_thread_ready(at, ah);
+ spin_unlock(&ah->lock);
+}
+
+
+static void
+async_thread_exit(struct async_thread *at, struct task_struct *t)
+{
+ struct async_head *ah;
+
+ ah = at->ah;
+
+ spin_lock(&ah->lock);
+ list_del_init(&at->entry);
+ if (at->exit)
+ complete(&ah->exit_done);
+ t->at = NULL;
+ at->task = NULL;
+ WARN_ON(!ah->nr_threads);
+ ah->nr_threads--;
+ spin_unlock(&ah->lock);
+}
+
+static struct async_thread *
+pick_ready_cachemiss_thread(struct async_head *ah)
+{
+ struct list_head *head = &ah->ready_async_threads;
+ struct async_thread *at;
+
+ if (list_empty(head))
+ return NULL;
+
+ at = list_entry(head->next, struct async_thread, entry);
+
+ return at;
+}
+
+static void pick_new_async_head(struct async_head *ah,
+ struct task_struct *t, struct pt_regs *old_regs)
+{
+ struct async_thread *new_async_thread;
+ struct async_thread *async_ready;
+ struct task_struct *new_task;
+ struct pt_regs *new_regs;
+
+ spin_lock(&ah->lock);
+
+ new_async_thread = pick_ready_cachemiss_thread(ah);
+ if (!new_async_thread)
+ goto out_unlock;
+
+ async_ready = t->async_ready;
+ WARN_ON(!async_ready);
+ t->async_ready = NULL;
+
+ new_task = new_async_thread->task;
+ new_regs = task_pt_regs(new_task);
+ *new_regs = *old_regs;
+
+ new_task->at = NULL;
+ t->ah = NULL;
+ new_task->ah = ah;
+
+ wake_up_process(new_task);
+
+ __async_thread_init(t, async_ready, ah);
+ __mark_async_thread_busy(t->at, ah);
+
+ out_unlock:
+ spin_unlock(&ah->lock);
+}
+
+void __async_schedule(struct task_struct *t)
+{
+ struct async_head *ah = t->ah;
+ struct pt_regs *old_regs = task_pt_regs(t);
+
+ pick_new_async_head(ah, t, old_regs);
+}
+
+static void async_schedule(struct task_struct *t)
+{
+ if (t->async_ready)
+ __async_schedule(t);
+}
+
+static long __exec_atom(struct task_struct *t, struct syslet_atom *atom)
+{
+ struct async_thread *async_ready_save;
+ long ret;
+
+ /*
+ * If user-space expects the syscall to schedule then
+ * (try to) switch user-space to another thread straight
+ * away and execute the syscall asynchronously:
+ */
+ if (unlikely(atom->flags & SYSLET_ASYNC))
+ async_schedule(t);
+ /*
+ * Does user-space want synchronous execution for this atom?:
+ */
+ async_ready_save = t->async_ready;
+ if (unlikely(atom->flags & SYSLET_SYNC))
+ t->async_ready = NULL;
+
+ if (unlikely(atom->nr >= NR_syscalls))
+ return -ENOSYS;
+
+ ret = sys_call_table[atom->nr](atom->args[0], atom->args[1],
+ atom->args[2], atom->args[3],
+ atom->args[4], atom->args[5]);
+ if (atom->ret_ptr && put_user(ret, atom->ret_ptr))
+ return -EFAULT;
+
+ if (t->ah)
+ t->async_ready = async_ready_save;
+
+ return ret;
+}
+
+/*
+ * Arithmetics syscall, add a value to a user-space memory location.
+ *
+ * Generic C version - in case the architecture has not implemented it
+ * in assembly.
+ */
+asmlinkage __attribute__((weak)) long
+sys_umem_add(unsigned long __user *uptr, unsigned long inc)
+{
+ unsigned long val, new_val;
+
+ if (get_user(val, uptr))
+ return -EFAULT;
+ /*
+ * inc == 0 means 'read memory value':
+ */
+ if (!inc)
+ return val;
+
+ new_val = val + inc;
+ __put_user(new_val, uptr);
+
+ return new_val;
+}
+
+/*
+ * Open-coded because this is a very hot codepath during syslet
+ * execution and every cycle counts ...
+ *
+ * [ NOTE: it's an explicit fastcall because optimized assembly code
+ * might depend on this. There are some kernels that disable regparm,
+ * so lets not break those if possible. ]
+ */
+fastcall __attribute__((weak)) long
+copy_uatom(struct syslet_atom *atom, struct syslet_uatom __user *uatom)
+{
+ unsigned long __user *arg_ptr;
+ long ret = 0;
+
+ if (!access_ok(VERIFY_WRITE, uatom, sizeof(*uatom)))
+ return -EFAULT;
+
+ ret = __get_user(atom->nr, &uatom->nr);
+ ret |= __get_user(atom->ret_ptr, &uatom->ret_ptr);
+ ret |= __get_user(atom->flags, &uatom->flags);
+ ret |= __get_user(atom->next, &uatom->next);
+
+ memset(atom->args, 0, sizeof(atom->args));
+
+ ret |= __get_user(arg_ptr, &uatom->arg_ptr[0]);
+ if (!arg_ptr)
+ return ret;
+ if (!access_ok(VERIFY_WRITE, arg_ptr, sizeof(*arg_ptr)))
+ return -EFAULT;
+ ret |= __get_user(atom->args[0], arg_ptr);
+
+ ret |= __get_user(arg_ptr, &uatom->arg_ptr[1]);
+ if (!arg_ptr)
+ return ret;
+ if (!access_ok(VERIFY_WRITE, arg_ptr, sizeof(*arg_ptr)))
+ return -EFAULT;
+ ret |= __get_user(atom->args[1], arg_ptr);
+
+ ret |= __get_user(arg_ptr, &uatom->arg_ptr[2]);
+ if (!arg_ptr)
+ return ret;
+ if (!access_ok(VERIFY_WRITE, arg_ptr, sizeof(*arg_ptr)))
+ return -EFAULT;
+ ret |= __get_user(atom->args[2], arg_ptr);
+
+ ret |= __get_user(arg_ptr, &uatom->arg_ptr[3]);
+ if (!arg_ptr)
+ return ret;
+ if (!access_ok(VERIFY_WRITE, arg_ptr, sizeof(*arg_ptr)))
+ return -EFAULT;
+ ret |= __get_user(atom->args[3], arg_ptr);
+
+ ret |= __get_user(arg_ptr, &uatom->arg_ptr[4]);
+ if (!arg_ptr)
+ return ret;
+ if (!access_ok(VERIFY_WRITE, arg_ptr, sizeof(*arg_ptr)))
+ return -EFAULT;
+ ret |= __get_user(atom->args[4], arg_ptr);
+
+ ret |= __get_user(arg_ptr, &uatom->arg_ptr[5]);
+ if (!arg_ptr)
+ return ret;
+ if (!access_ok(VERIFY_WRITE, arg_ptr, sizeof(*arg_ptr)))
+ return -EFAULT;
+ ret |= __get_user(atom->args[5], arg_ptr);
+
+ return ret;
+}
+
+/*
+ * Should the next atom run, depending on the return value of
+ * the current atom - or should we stop execution?
+ */
+static int run_next_atom(struct syslet_atom *atom, long ret)
+{
+ switch (atom->flags & SYSLET_STOP_MASK) {
+ case SYSLET_STOP_ON_NONZERO:
+ if (!ret)
+ return 1;
+ return 0;
+ case SYSLET_STOP_ON_ZERO:
+ if (ret)
+ return 1;
+ return 0;
+ case SYSLET_STOP_ON_NEGATIVE:
+ if (ret >= 0)
+ return 1;
+ return 0;
+ case SYSLET_STOP_ON_NON_POSITIVE:
+ if (ret > 0)
+ return 1;
+ return 0;
+ }
+ return 1;
+}
+
+static struct syslet_uatom __user *
+next_uatom(struct syslet_atom *atom, struct syslet_uatom *uatom, long ret)
+{
+ /*
+ * If the stop condition is false then continue
+ * to atom->next:
+ */
+ if (run_next_atom(atom, ret))
+ return atom->next;
+ /*
+ * Special-case: if the stop condition is true and the atom
+ * has SKIP_TO_NEXT_ON_STOP set, then instead of
+ * stopping we skip to the atom directly after this atom
+ * (in linear address-space).
+ *
+ * This, combined with the atom->next pointer and the
+ * stop condition flags is what allows true branches and
+ * loops in syslets:
+ */
+ if (atom->flags & SYSLET_SKIP_TO_NEXT_ON_STOP)
+ return uatom + 1;
+
+ return NULL;
+}
+
+/*
+ * If user-space requested a completion event then put the last
+ * executed uatom into the completion ring:
+ */
+static long
+complete_uatom(struct async_head *ah, struct task_struct *t,
+ struct syslet_atom *atom, struct syslet_uatom __user *uatom)
+{
+ struct syslet_uatom __user **ring_slot, *slot_val = NULL;
+ long ret;
+
+ WARN_ON(!t->at);
+ WARN_ON(t->ah);
+
+ if (unlikely(atom->flags & SYSLET_NO_COMPLETE))
+ return 0;
+
+ /*
+ * Asynchron threads can complete in parallel so use the
+ * head-lock to serialize:
+ */
+ spin_lock(&ah->lock);
+ ring_slot = ah->completion_ring + ah->curr_ring_idx;
+ ret = __copy_from_user_inatomic(&slot_val, ring_slot, sizeof(slot_val));
+ /*
+ * User-space submitted more work than what fits into the
+ * completion ring - do not stomp over it silently and signal
+ * the error condition:
+ */
+ if (unlikely(slot_val)) {
+ spin_unlock(&ah->lock);
+ return -EFAULT;
+ }
+ slot_val = uatom;
+ ret |= __copy_to_user_inatomic(ring_slot, &slot_val, sizeof(slot_val));
+
+ ah->curr_ring_idx++;
+ if (unlikely(ah->curr_ring_idx == ah->max_ring_idx))
+ ah->curr_ring_idx = 0;
+
+ /*
+ * See whether the async-head is waiting and needs a wakeup:
+ */
+ if (ah->events_left) {
+ ah->events_left--;
+ if (!ah->events_left)
+ wake_up(&ah->wait);
+ }
+
+ spin_unlock(&ah->lock);
+
+ return ret;
+}
+
+/*
+ * This is the main syslet atom execution loop. This fetches atoms
+ * and executes them until it runs out of atoms or until the
+ * exit condition becomes false:
+ */
+static struct syslet_uatom __user *
+exec_atom(struct async_head *ah, struct task_struct *t,
+ struct syslet_uatom __user *uatom)
+{
+ struct syslet_uatom __user *last_uatom;
+ struct syslet_atom atom;
+ long ret;
+
+ run_next:
+ if (unlikely(copy_uatom(&atom, uatom)))
+ return ERR_PTR(-EFAULT);
+
+ last_uatom = uatom;
+ ret = __exec_atom(t, &atom);
+ if (unlikely(signal_pending(t) || need_resched()))
+ goto stop;
+
+ uatom = next_uatom(&atom, uatom, ret);
+ if (uatom)
+ goto run_next;
+ stop:
+ /*
+ * We do completion only in async context:
+ */
+ if (t->at && complete_uatom(ah, t, &atom, last_uatom))
+ return ERR_PTR(-EFAULT);
+
+ return last_uatom;
+}
+
+static void cachemiss_execute(struct async_thread *at, struct async_head *ah,
+ struct task_struct *t)
+{
+ struct syslet_uatom __user *uatom;
+
+ uatom = at->work;
+ WARN_ON(!uatom);
+ at->work = NULL;
+
+ exec_atom(ah, t, uatom);
+}
+
+static void
+cachemiss_loop(struct async_thread *at, struct async_head *ah,
+ struct task_struct *t)
+{
+ for (;;) {
+ schedule();
+ mark_async_thread_busy(at, ah);
+ set_task_state(t, TASK_INTERRUPTIBLE);
+ if (at->work)
+ cachemiss_execute(at, ah, t);
+ if (unlikely(t->ah || at->exit || signal_pending(t)))
+ break;
+ mark_async_thread_ready(at, ah);
+ }
+ t->state = TASK_RUNNING;
+
+ async_thread_exit(at, t);
+}
+
+static int cachemiss_thread(void *data)
+{
+ struct task_struct *t = current;
+ struct async_head *ah = data;
+ struct async_thread at;
+
+ async_thread_init(t, &at, ah);
+ complete(&ah->start_done);
+
+ cachemiss_loop(&at, ah, t);
+ if (at.exit)
+ do_exit(0);
+
+ if (!t->ah && signal_pending(t)) {
+ WARN_ON(1);
+ do_exit(0);
+ }
+
+ /*
+ * Return to user-space with NULL:
+ */
+ return 0;
+}
+
+static void __notify_async_thread_exit(struct async_thread *at,
+ struct async_head *ah)
+{
+ list_del_init(&at->entry);
+ at->exit = 1;
+ init_completion(&ah->exit_done);
+ wake_up_process(at->task);
+}
+
+static void stop_cachemiss_threads(struct async_head *ah)
+{
+ struct async_thread *at;
+
+repeat:
+ spin_lock(&ah->lock);
+ list_for_each_entry(at, &ah->ready_async_threads, entry) {
+
+ __notify_async_thread_exit(at, ah);
+ spin_unlock(&ah->lock);
+
+ wait_for_completion(&ah->exit_done);
+
+ goto repeat;
+ }
+
+ list_for_each_entry(at, &ah->busy_async_threads, entry) {
+
+ __notify_async_thread_exit(at, ah);
+ spin_unlock(&ah->lock);
+
+ wait_for_completion(&ah->exit_done);
+
+ goto repeat;
+ }
+ spin_unlock(&ah->lock);
+}
+
+static void async_head_exit(struct async_head *ah, struct task_struct *t)
+{
+ stop_cachemiss_threads(ah);
+ WARN_ON(!list_empty(&ah->ready_async_threads));
+ WARN_ON(!list_empty(&ah->busy_async_threads));
+ WARN_ON(ah->nr_threads);
+ WARN_ON(spin_is_locked(&ah->lock));
+ kfree(ah);
+ t->ah = NULL;
+}
+
+/*
+ * Pretty arbitrary for now. The kernel resource-controls the number
+ * of threads anyway.
+ */
+#define DEFAULT_THREAD_LIMIT 1024
+
+/*
+ * Initialize the in-kernel async head, based on the user-space async
+ * head:
+ */
+static long
+async_head_init(struct task_struct *t, struct async_head_user __user *uah)
+{
+ unsigned long max_nr_threads, ring_size_bytes, max_ring_idx;
+ struct syslet_uatom __user **completion_ring;
+ struct async_head *ah;
+ long ret;
+
+ if (get_user(max_nr_threads, &uah->max_nr_threads))
+ return -EFAULT;
+ if (get_user(completion_ring, &uah->completion_ring))
+ return -EFAULT;
+ if (get_user(ring_size_bytes, &uah->ring_size_bytes))
+ return -EFAULT;
+ if (!ring_size_bytes)
+ return -EINVAL;
+ /*
+ * We pre-check the ring pointer, so that in the fastpath
+ * we can use __put_user():
+ */
+ if (!access_ok(VERIFY_WRITE, completion_ring, ring_size_bytes))
+ return -EFAULT;
+
+ max_ring_idx = ring_size_bytes / sizeof(void *);
+ if (ring_size_bytes != max_ring_idx * sizeof(void *))
+ return -EINVAL;
+
+ /*
+ * Lock down the ring. Note: user-space should not munlock() this,
+ * because if the ring pages get swapped out then the async
+ * completion code might return a -EFAULT instead of the expected
+ * completion. (the kernel safely handles that case too, so this
+ * isnt a security problem.)
+ *
+ * mlock() is better here because it gets resource-accounted
+ * properly, and even unprivileged userspace has a few pages
+ * of mlock-able memory available. (which is more than enough
+ * for the completion-pointers ringbuffer)
+ */
+ ret = sys_mlock((unsigned long)completion_ring, ring_size_bytes);
+ if (ret)
+ return ret;
+
+ /*
+ * -1 means: the kernel manages the optimal size of the async pool.
+ * Simple static limit for now.
+ */
+ if (max_nr_threads == -1UL)
+ max_nr_threads = DEFAULT_THREAD_LIMIT;
+ /*
+ * If the ring is smaller than the number of threads requested
+ * then lower the thread count - otherwise we might lose
+ * syslet completion events:
+ */
+ max_nr_threads = min(max_ring_idx, max_nr_threads);
+
+ ah = kmalloc(sizeof(*ah), GFP_KERNEL);
+ if (!ah)
+ return -ENOMEM;
+
+ spin_lock_init(&ah->lock);
+ ah->nr_threads = 0;
+ ah->max_nr_threads = max_nr_threads;
+ INIT_LIST_HEAD(&ah->ready_async_threads);
+ INIT_LIST_HEAD(&ah->busy_async_threads);
+ init_waitqueue_head(&ah->wait);
+ ah->events_left = 0;
+ ah->uah = uah;
+ ah->curr_ring_idx = 0;
+ ah->max_ring_idx = max_ring_idx;
+ ah->completion_ring = completion_ring;
+ ah->ring_size_bytes = ring_size_bytes;
+
+ ah->user_task = t;
+ t->ah = ah;
+
+ return 0;
+}
+
+/**
+ * sys_async_register - enable async syscall support
+ */
+asmlinkage long
+sys_async_register(struct async_head_user __user *uah, unsigned int len)
+{
+ struct task_struct *t = current;
+
+ /*
+ * This 'len' check enables future extension of
+ * the async_head ABI:
+ */
+ if (len != sizeof(struct async_head_user))
+ return -EINVAL;
+ /*
+ * Already registered?
+ */
+ if (t->ah)
+ return -EEXIST;
+
+ return async_head_init(t, uah);
+}
+
+/**
+ * sys_async_unregister - disable async syscall support
+ */
+asmlinkage long
+sys_async_unregister(struct async_head_user __user *uah, unsigned int len)
+{
+ struct syslet_uatom __user **completion_ring;
+ struct task_struct *t = current;
+ struct async_head *ah = t->ah;
+ unsigned long ring_size_bytes;
+
+ if (len != sizeof(struct async_head_user))
+ return -EINVAL;
+ /*
+ * Already unregistered?
+ */
+ if (!ah)
+ return -EINVAL;
+
+ completion_ring = ah->completion_ring;
+ ring_size_bytes = ah->ring_size_bytes;
+
+ async_head_exit(ah, t);
+
+ /*
+ * Unpin the ring:
+ */
+ return sys_munlock((unsigned long)completion_ring, ring_size_bytes);
+}
+
+/*
+ * Simple limit and pool management mechanism for now:
+ */
+static void refill_cachemiss_pool(struct async_head *ah)
+{
+ int pid;
+
+ if (ah->nr_threads >= ah->max_nr_threads)
+ return;
+
+ init_completion(&ah->start_done);
+
+ pid = create_async_thread(cachemiss_thread, (void *)ah,
+ CLONE_VM | CLONE_FS | CLONE_FILES | CLONE_SIGHAND |
+ CLONE_PTRACE | CLONE_THREAD | CLONE_SYSVSEM);
+ if (pid < 0)
+ return;
+
+ wait_for_completion(&ah->start_done);
+}
+
+/**
+ * sys_async_wait - wait for async completion events
+ *
+ * This syscall waits for @min_wait_events syslet completion events
+ * to finish or for all async processing to finish (whichever
+ * comes first).
+ */
+asmlinkage long sys_async_wait(unsigned long min_wait_events)
+{
+ struct async_head *ah = current->ah;
+
+ if (!ah)
+ return -EINVAL;
+
+ if (min_wait_events) {
+ spin_lock(&ah->lock);
+ ah->events_left = min_wait_events;
+ spin_unlock(&ah->lock);
+ }
+
+ return wait_event_interruptible(ah->wait,
+ list_empty(&ah->busy_async_threads) || !ah->events_left);
+}
+
+/**
+ * sys_async_exec - execute a syslet.
+ *
+ * returns the uatom that was last executed, if the kernel was able to
+ * execute the syslet synchronously, or NULL if the syslet became
+ * asynchronous. (in the latter case syslet completion will be notified
+ * via the completion ring)
+ *
+ * (Various errors might also be returned via the usual negative numbers.)
+ */
+asmlinkage struct syslet_uatom __user *
+sys_async_exec(struct syslet_uatom __user *uatom)
+{
+ struct syslet_uatom __user *ret;
+ struct task_struct *t = current;
+ struct async_head *ah = t->ah;
+ struct async_thread at;
+
+ if (unlikely(!ah))
+ return ERR_PTR(-EINVAL);
+
+ if (list_empty(&ah->ready_async_threads))
+ refill_cachemiss_pool(ah);
+
+ t->async_ready = &at;
+ ret = exec_atom(ah, t, uatom);
+
+ if (t->ah) {
+ WARN_ON(!t->async_ready);
+ t->async_ready = NULL;
+ return ret;
+ }
+ ret = ERR_PTR(-EINTR);
+ if (!at.exit && !signal_pending(t)) {
+ set_task_state(t, TASK_INTERRUPTIBLE);
+ mark_async_thread_ready(&at, ah);
+ cachemiss_loop(&at, ah, t);
+ }
+ if (t->ah)
+ return NULL;
+ else
+ do_exit(0);
+}
+
+/*
+ * fork()-time initialization:
+ */
+void async_init(struct task_struct *t)
+{
+ t->at = NULL;
+ t->async_ready = NULL;
+ t->ah = NULL;
+}
+
+/*
+ * do_exit()-time cleanup:
+ */
+void async_exit(struct task_struct *t)
+{
+ struct async_thread *at = t->at;
+ struct async_head *ah = t->ah;
+
+ WARN_ON(at && ah);
+ WARN_ON(t->async_ready);
+
+ if (unlikely(at))
+ async_thread_exit(at, t);
+
+ if (unlikely(ah))
+ async_head_exit(ah, t);
+}

2007-02-13 14:24:50

by Ingo Molnar

[permalink] [raw]
Subject: [patch 00/11] ANNOUNCE: "Syslets", generic asynchronous system call support

I'm pleased to announce the first release of the "Syslet" kernel feature
and kernel subsystem, which provides generic asynchrous system call
support:

http://redhat.com/~mingo/syslet-patches/

Syslets are small, simple, lightweight programs (consisting of
system-calls, 'atoms') that the kernel can execute autonomously (and,
not the least, asynchronously), without having to exit back into
user-space. Syslets can be freely constructed and submitted by any
unprivileged user-space context - and they have access to all the
resources (and only those resources) that the original context has
access to.

because the proof of the pudding is eating it, here are the performance
results from async-test.c which does open()+read()+close() of 1000 small
random files (smaller is better):

synchronous IO | Syslets:
--------------------------------------------
uncached: 45.8 seconds | 34.2 seconds ( +33.9% )
cached: 31.6 msecs | 26.5 msecs ( +19.2% )

("uncached" results were done via "echo 3 > /proc/sys/vm/drop_caches".
The default IO scheduler was the deadline scheduler, the test was run on
ext3, using a single PATA IDE disk.)

So syslets, in this particular workload, are a nice speedup /both/ in
the uncached and in the cached case. (note that i used only a single
disk, so the level of parallelism in the hardware is quite limited.)

the testcode can be found at:

http://redhat.com/~mingo/syslet-patches/async-test-0.1.tar.gz

The boring details:

Syslets consist of 'syslet atoms', where each atom represents a single
system-call. These atoms can be chained to each other: serially, in
branches or in loops. The return value of an executed atom is checked
against the condition flags. So an atom can specify 'exit on nonzero' or
'loop until non-negative' kind of constructs.

Syslet atoms fundamentally execute only system calls, thus to be able to
manipulate user-space variables from syslets i've added a simple special
system call: sys_umem_add(ptr, val). This can be used to increase or
decrease the user-space variable (and to get the result), or to simply
read out the variable (if 'val' is 0).

So a single syslet (submitted and executed via a single system call) can
be arbitrarily complex. For example it can be like this:

--------------------
| accept() |-----> [ stop if returns negative ]
--------------------
|
V
-------------------------------
| setsockopt(TCP_NODELAY) |-----> [ stop if returns negative ]
-------------------------------
|
v
--------------------
| read() |<---------
-------------------- | [ loop while positive ]
| | |
| ---------------------
|
-----------------------------------------
| decrease and read user space variable |
----------------------------------------- A
| |
-------[ loop back to accept() if positive ]------

(you can find a VFS example and a hello.c example in the user-space
testcode.)

A syslet is executed opportunistically: i.e. the syslet subsystem
assumes that the syslet will not block, and it will switch to a
cachemiss kernel thread from the scheduler. This means that even a
single-atom syslet (i.e. a pure system call) is very close in
performance to a pure system call. The syslet NULL-overhead in the
cached case is roughly 10% of the SYSENTER NULL-syscall overhead. This
means that two atoms are a win already, even in the cached case.

When a 'cachemiss' occurs, i.e. if we hit schedule() and are about to
consider other threads, the syslet subsystem picks up a 'cachemiss
thread' and switches the current task's user-space context over to the
cachemiss thread, and makes the cachemiss thread available. The original
thread (which now becomes a 'busy' cachemiss thread) continues to block.
This means that user-space will still be executed without stopping -
even if user-space is single-threaded.

if the submitting user-space context /knows/ that a system call will
block, it can request immediate 'cachemiss' via the SYSLET_ASYNC flag.
This would be used if for example an O_DIRECT file is read() or
write()n.

likewise, if user-space knows (or expects) that a system call takes alot
of CPU time even in the cached case, and it wants to offload it to
another asynchronous context, it can request that via the SYSLET_ASYNC
flag too.

completions of asynchronous syslets are done via a user-space ringbuffer
that the kernel fills and user-space clears. Waiting is done via the
sys_async_wait() system call. Completion can be supressed on a per-atom
basis via the SYSLET_NO_COMPLETE flag, for atoms that include some
implicit notification mechanism. (such as sys_kill(), etc.)

As it might be obvious to some of you, the syslet subsystem takes many
ideas and experience from my Tux in-kernel webserver :) The syslet code
originates from a heavy rewrite of the Tux-atom and the Tux-cachemiss
infrastructure.

Open issues:

- the 'TID' of the 'head' thread currently varies depending on which
thread is running the user-space context.

- signal support is not fully thought through - probably the head
should be getting all of them - the cachemiss threads are not really
interested in executing signal handlers.

- sys_fork() and sys_async_exec() should be filtered out from the
syscalls that are allowed - first one only makes sense with ptregs,
second one is a nice kernel recursion thing :) I didnt want to
duplicate the sys_call_table though - maybe others have a better
idea.

See more details in Documentation/syslet-design.txt. The patchset is
against v2.6.20, but should apply to the -git head as well.

Thanks to Zach Brown for the idea to drive cachemisses via the
scheduler. Thanks to Arjan van de Ven for early review feedback.

Comments, suggestions, reports are welcome!

Ingo

2007-02-13 14:26:03

by Ingo Molnar

[permalink] [raw]
Subject: [patch 02/11] syslets: add syslet.h include file, user API/ABI definitions

From: Ingo Molnar <[email protected]>

add include/linux/syslet.h which contains the user-space API/ABI
declarations. Add the new header to include/linux/Kbuild as well.

Signed-off-by: Ingo Molnar <[email protected]>
Signed-off-by: Arjan van de Ven <[email protected]>
---
include/linux/Kbuild | 1
include/linux/syslet.h | 136 +++++++++++++++++++++++++++++++++++++++++++++++++
2 files changed, 137 insertions(+)

Index: linux/include/linux/Kbuild
===================================================================
--- linux.orig/include/linux/Kbuild
+++ linux/include/linux/Kbuild
@@ -140,6 +140,7 @@ header-y += sockios.h
header-y += som.h
header-y += sound.h
header-y += synclink.h
+header-y += syslet.h
header-y += telephony.h
header-y += termios.h
header-y += ticable.h
Index: linux/include/linux/syslet.h
===================================================================
--- /dev/null
+++ linux/include/linux/syslet.h
@@ -0,0 +1,136 @@
+#ifndef _LINUX_SYSLET_H
+#define _LINUX_SYSLET_H
+/*
+ * The syslet subsystem - asynchronous syscall execution support.
+ *
+ * Started by Ingo Molnar:
+ *
+ * Copyright (C) 2007 Red Hat, Inc., Ingo Molnar <[email protected]>
+ *
+ * User-space API/ABI definitions:
+ */
+
+/*
+ * This is the 'Syslet Atom' - the basic unit of execution
+ * within the syslet framework. A syslet always represents
+ * a single system-call plus its arguments, plus has conditions
+ * attached to it that allows the construction of larger
+ * programs from these atoms. User-space variables can be used
+ * (for example a loop index) via the special sys_umem*() syscalls.
+ *
+ * Arguments are implemented via pointers to arguments. This not
+ * only increases the flexibility of syslet atoms (multiple syslets
+ * can share the same variable for example), but is also an
+ * optimization: copy_uatom() will only fetch syscall parameters
+ * up until the point it meets the first NULL pointer. 50% of all
+ * syscalls have 2 or less parameters (and 90% of all syscalls have
+ * 4 or less parameters).
+ *
+ * [ Note: since the argument array is at the end of the atom, and the
+ * kernel will not touch any argument beyond the final NULL one, atoms
+ * might be packed more tightly. (the only special case exception to
+ * this rule would be SKIP_TO_NEXT_ON_STOP atoms, where the kernel will
+ * jump a full syslet_uatom number of bytes.) ]
+ */
+struct syslet_uatom {
+ unsigned long flags;
+ unsigned long nr;
+ long __user *ret_ptr;
+ struct syslet_uatom __user *next;
+ unsigned long __user *arg_ptr[6];
+ /*
+ * User-space can put anything in here, kernel will not
+ * touch it:
+ */
+ void __user *private;
+};
+
+/*
+ * Flags to modify/control syslet atom behavior:
+ */
+
+/*
+ * Immediately queue this syslet asynchronously - do not even
+ * attempt to execute it synchronously in the user context:
+ */
+#define SYSLET_ASYNC 0x00000001
+
+/*
+ * Never queue this syslet asynchronously - even if synchronous
+ * execution causes a context-switching:
+ */
+#define SYSLET_SYNC 0x00000002
+
+/*
+ * Do not queue the syslet in the completion ring when done.
+ *
+ * ( the default is that the final atom of a syslet is queued
+ * in the completion ring. )
+ *
+ * Some syscalls generate implicit completion events of their
+ * own.
+ */
+#define SYSLET_NO_COMPLETE 0x00000004
+
+/*
+ * Execution control: conditions upon the return code
+ * of the previous syslet atom. 'Stop' means syslet
+ * execution is stopped and the atom is put into the
+ * completion ring:
+ */
+#define SYSLET_STOP_ON_NONZERO 0x00000008
+#define SYSLET_STOP_ON_ZERO 0x00000010
+#define SYSLET_STOP_ON_NEGATIVE 0x00000020
+#define SYSLET_STOP_ON_NON_POSITIVE 0x00000040
+
+#define SYSLET_STOP_MASK \
+ ( SYSLET_STOP_ON_NONZERO | \
+ SYSLET_STOP_ON_ZERO | \
+ SYSLET_STOP_ON_NEGATIVE | \
+ SYSLET_STOP_ON_NON_POSITIVE )
+
+/*
+ * Special modifier to 'stop' handling: instead of stopping the
+ * execution of the syslet, the linearly next syslet is executed.
+ * (Normal execution flows along atom->next, and execution stops
+ * if atom->next is NULL or a stop condition becomes true.)
+ *
+ * This is what allows true branches of execution within syslets.
+ */
+#define SYSLET_SKIP_TO_NEXT_ON_STOP 0x00000080
+
+/*
+ * This is the (per-user-context) descriptor of the async completion
+ * ring. This gets registered via sys_async_register().
+ */
+struct async_head_user {
+ /*
+ * Pointers to completed async syslets (i.e. syslets that
+ * generated a cachemiss and went async, returning -EASYNCSYSLET
+ * to the user context by sys_async_exec()) are queued here.
+ * Syslets that were executed synchronously are not queued here.
+ *
+ * Note: the final atom that generated the exit condition is
+ * queued here. Normally this would be the last atom of a syslet.
+ */
+ struct syslet_uatom __user **completion_ring;
+ /*
+ * Ring size in bytes:
+ */
+ unsigned long ring_size_bytes;
+
+ /*
+ * Maximum number of asynchronous contexts the kernel creates.
+ *
+ * -1UL has a special meaning: the kernel manages the optimal
+ * size of the async pool.
+ *
+ * Note: this field should be valid for the lifetime of async
+ * processing, because future kernels detect changes to this
+ * field. (enabling user-space to control the size of the async
+ * pool in a low-overhead fashion)
+ */
+ unsigned long max_nr_threads;
+};
+
+#endif

2007-02-13 14:26:04

by Ingo Molnar

[permalink] [raw]
Subject: [patch 07/11] syslets: x86, add create_async_thread() method

From: Ingo Molnar <[email protected]>

add the create_async_thread() way of creating kernel threads:
these threads first execute a kernel function and when they
return from it they execute user-space.

An architecture must implement this interface before it can turn
CONFIG_ASYNC_SUPPORT on.

Signed-off-by: Ingo Molnar <[email protected]>
Signed-off-by: Arjan van de Ven <[email protected]>
---
arch/i386/kernel/entry.S | 25 +++++++++++++++++++++++++
arch/i386/kernel/process.c | 31 +++++++++++++++++++++++++++++++
include/asm-i386/processor.h | 5 +++++
3 files changed, 61 insertions(+)

Index: linux/arch/i386/kernel/entry.S
===================================================================
--- linux.orig/arch/i386/kernel/entry.S
+++ linux/arch/i386/kernel/entry.S
@@ -996,6 +996,31 @@ ENTRY(kernel_thread_helper)
CFI_ENDPROC
ENDPROC(kernel_thread_helper)

+ENTRY(async_thread_helper)
+ CFI_STARTPROC
+ /*
+ * Allocate space on the stack for pt-regs.
+ * sizeof(struct pt_regs) == 64, and we've got 8 bytes on the
+ * kernel stack already:
+ */
+ subl $64-8, %esp
+ CFI_ADJUST_CFA_OFFSET 64
+ movl %edx,%eax
+ push %edx
+ CFI_ADJUST_CFA_OFFSET 4
+ call *%ebx
+ addl $4, %esp
+ CFI_ADJUST_CFA_OFFSET -4
+
+ movl %eax, PT_EAX(%esp)
+
+ GET_THREAD_INFO(%ebp)
+
+ jmp syscall_exit
+ CFI_ENDPROC
+ENDPROC(async_thread_helper)
+
+
.section .rodata,"a"
#include "syscall_table.S"

Index: linux/arch/i386/kernel/process.c
===================================================================
--- linux.orig/arch/i386/kernel/process.c
+++ linux/arch/i386/kernel/process.c
@@ -352,6 +352,37 @@ int kernel_thread(int (*fn)(void *), voi
EXPORT_SYMBOL(kernel_thread);

/*
+ * This gets run with %ebx containing the
+ * function to call, and %edx containing
+ * the "args".
+ */
+extern void async_thread_helper(void);
+
+/*
+ * Create an async thread
+ */
+int create_async_thread(int (*fn)(void *), void * arg, unsigned long flags)
+{
+ struct pt_regs regs;
+
+ memset(&regs, 0, sizeof(regs));
+
+ regs.ebx = (unsigned long) fn;
+ regs.edx = (unsigned long) arg;
+
+ regs.xds = __USER_DS;
+ regs.xes = __USER_DS;
+ regs.xgs = __KERNEL_PDA;
+ regs.orig_eax = -1;
+ regs.eip = (unsigned long) async_thread_helper;
+ regs.xcs = __KERNEL_CS | get_kernel_rpl();
+ regs.eflags = X86_EFLAGS_IF | X86_EFLAGS_SF | X86_EFLAGS_PF | 0x2;
+
+ /* Ok, create the new task.. */
+ return do_fork(flags | CLONE_VM, 0, &regs, 0, NULL, NULL);
+}
+
+/*
* Free current thread data structures etc..
*/
void exit_thread(void)
Index: linux/include/asm-i386/processor.h
===================================================================
--- linux.orig/include/asm-i386/processor.h
+++ linux/include/asm-i386/processor.h
@@ -468,6 +468,11 @@ extern void prepare_to_copy(struct task_
*/
extern int kernel_thread(int (*fn)(void *), void * arg, unsigned long flags);

+/*
+ * create an async thread:
+ */
+extern int create_async_thread(int (*fn)(void *), void * arg, unsigned long flags);
+
extern unsigned long thread_saved_pc(struct task_struct *tsk);
void show_trace(struct task_struct *task, struct pt_regs *regs, unsigned long *stack);

2007-02-13 14:52:13

by Alan

[permalink] [raw]
Subject: Re: [patch 00/11] ANNOUNCE: "Syslets", generic asynchronous system call support

> A syslet is executed opportunistically: i.e. the syslet subsystem
> assumes that the syslet will not block, and it will switch to a
> cachemiss kernel thread from the scheduler. This means that even a

How is scheduler fairness maintained ? and what is done for resource
accounting here ?

> that the kernel fills and user-space clears. Waiting is done via the
> sys_async_wait() system call. Completion can be supressed on a per-atom

They should be selectable as well iff possible.

> Open issues:

Let me add some more

sys_setuid/gid/etc need to be synchronous only and not occur
while other async syscalls are running in parallel to meet current kernel
assumptions.

sys_exec and other security boundaries must be synchronous only
and not allow async "spill over" (consider setuid async binary patching)

> - sys_fork() and sys_async_exec() should be filtered out from the
> syscalls that are allowed - first one only makes sense with ptregs,

clone and vfork. async_vfork is a real mindbender actually.

> second one is a nice kernel recursion thing :) I didnt want to
> duplicate the sys_call_table though - maybe others have a better
> idea.

What are the semantics of async sys_async_wait and async sys_async ?

2007-02-13 14:59:13

by Benjamin LaHaise

[permalink] [raw]
Subject: Re: [patch 00/11] ANNOUNCE: "Syslets", generic asynchronous system call support

On Tue, Feb 13, 2007 at 03:00:19PM +0000, Alan wrote:
> > Open issues:
>
> Let me add some more

Also: FPU state (especially important with the FPU and SSE memory copy
variants), segment register bases on x86-64, interaction with set_fs()...
There is no easy way of getting around the full thread context switch and
its associated overhead (mucking around in CR0 is one of the more expensive
bits of the context switch code path, and at the very least, setting the FPU
not present is mandatory). I have looked into exactly this approach, and
it's only cheaper if the code is incomplete. Linux's native threads are
pretty damned good.

-ben
--
"Time is of no importance, Mr. President, only life is important."
Don't Email: <[email protected]>.

2007-02-13 15:09:42

by Arjan van de Ven

[permalink] [raw]
Subject: Re: [patch 00/11] ANNOUNCE: "Syslets", generic asynchronous system call support

On Tue, 2007-02-13 at 09:58 -0500, Benjamin LaHaise wrote:
> On Tue, Feb 13, 2007 at 03:00:19PM +0000, Alan wrote:
> > > Open issues:
> >
> > Let me add some more
>
> Also: FPU state (especially important with the FPU and SSE memory copy
> variants)

are these preserved over explicit system calls?
--
if you want to mail me at work (you don't), use arjan (at) linux.intel.com
Test the interaction between Linux and your BIOS via http://www.linuxfirmwarekit.org

2007-02-13 15:39:45

by Andi Kleen

[permalink] [raw]
Subject: Re: [patch 00/11] ANNOUNCE: "Syslets", generic asynchronous system call support

Alan <[email protected]> writes:

Funny, it sounds like batch() on stereoids @) Ok with an async context it becomes
somewhat more interesting.

> sys_setuid/gid/etc need to be synchronous only and not occur
> while other async syscalls are running in parallel to meet current kernel
> assumptions.
>
> sys_exec and other security boundaries must be synchronous only
> and not allow async "spill over" (consider setuid async binary patching)

He probably would need some generalization of Andrea's seccomp work.
Perhaps using bitmaps? For paranoia I would suggest to white list, not black list
calls.

-Andi

2007-02-13 15:46:08

by Dmitry Torokhov

[permalink] [raw]
Subject: Re: [patch 00/11] ANNOUNCE: "Syslets", generic asynchronous system call support

On 2/13/07, Alan <[email protected]> wrote:
> > A syslet is executed opportunistically: i.e. the syslet subsystem
> > assumes that the syslet will not block, and it will switch to a
> > cachemiss kernel thread from the scheduler. This means that even a
>
> How is scheduler fairness maintained ? and what is done for resource
> accounting here ?
>
> > that the kernel fills and user-space clears. Waiting is done via the
> > sys_async_wait() system call. Completion can be supressed on a per-atom
>
> They should be selectable as well iff possible.
>
> > Open issues:
>
> Let me add some more
>
> sys_setuid/gid/etc need to be synchronous only and not occur
> while other async syscalls are running in parallel to meet current kernel
> assumptions.
>
> sys_exec and other security boundaries must be synchronous only
> and not allow async "spill over" (consider setuid async binary patching)
>
> > - sys_fork() and sys_async_exec() should be filtered out from the
> > syscalls that are allowed - first one only makes sense with ptregs,
>
> clone and vfork. async_vfork is a real mindbender actually.
>
> > second one is a nice kernel recursion thing :) I didnt want to
> > duplicate the sys_call_table though - maybe others have a better
> > idea.
>
> What are the semantics of async sys_async_wait and async sys_async ?
>

Ooooohh. OpenVMS lives forever ;) Me likeee ;)

--
Dmitry

2007-02-13 16:24:22

by bert hubert

[permalink] [raw]
Subject: Re: [patch 00/11] ANNOUNCE: "Syslets", generic asynchronous system call support

On Tue, Feb 13, 2007 at 09:58:48AM -0500, Benjamin LaHaise wrote:

> not present is mandatory). I have looked into exactly this approach, and
> it's only cheaper if the code is incomplete. Linux's native threads are
> pretty damned good.

Cheaper in time or in memory? Iow, would you be able to queue up as many
threads as syslets?

Bert

--
http://www.PowerDNS.com Open source, database driven DNS Software
http://netherlabs.nl Open and Closed source services

2007-02-13 16:28:30

by Linus Torvalds

[permalink] [raw]
Subject: Re: [patch 00/11] ANNOUNCE: "Syslets", generic asynchronous system call support



On Tue, 13 Feb 2007, Andi Kleen wrote:

> > sys_exec and other security boundaries must be synchronous only
> > and not allow async "spill over" (consider setuid async binary patching)
>
> He probably would need some generalization of Andrea's seccomp work.
> Perhaps using bitmaps? For paranoia I would suggest to white list, not black list
> calls.

It's actually more likely a lot more efficient to let the system call
itself do the sanity checking. That allows the common system calls (that
*don't* need to even check) to just not do anything at all, instead of
having some complex logic in the common system call execution trying to
figure out for each system call whether it is ok or not.

Ie, we could just add to "do_fork()" (which is where all of the
vfork/clone/fork cases end up) a simple case like

err = wait_async_context();
if (err)
return err;

or

if (in_async_context())
return -EINVAL;

or similar. We need that "async_context()" function anyway for the other
cases where we can't do other things concurrently, like changing the UID.

I would suggest that "wait_async_context()" would do:

- if weare *in* an async context, return an error. We cannot wait for
ourselves!
- if we are the "real thread", wait for all async contexts to go away
(and since we are the real thread, no new ones will be created, so this
is not going to be an infinite wait)

The new thing would be that wait_async_context() would possibly return
-ERESTARTSYS (signal while an async context was executing), so any system
call that does this would possibly return EINTR. Which "fork()" hasn't
historically done. But if you have async events active, some operations
likely cannot be done (setuid() and execve() comes to mind), so you really
do need something like this.

And obviously it would only affect any program that actually would _use_
any of the suggested new interfaces, so it's not like a new error return
would break anything old.

Linus

2007-02-13 16:46:13

by Ingo Molnar

[permalink] [raw]
Subject: Re: [patch 00/11] ANNOUNCE: "Syslets", generic asynchronous system call support


* Alan <[email protected]> wrote:

> > A syslet is executed opportunistically: i.e. the syslet subsystem
> > assumes that the syslet will not block, and it will switch to a
> > cachemiss kernel thread from the scheduler. This means that even a
>
> How is scheduler fairness maintained ? and what is done for resource
> accounting here ?

the async threads are as if the user created user-space threads - and
it's accounted (and scheduled) accordingly.

> > that the kernel fills and user-space clears. Waiting is done via the
> > sys_async_wait() system call. Completion can be supressed on a
> > per-atom
>
> They should be selectable as well iff possible.

basically arbitrary notification interfaces are supported. For example
if you add a sys_kill() call as the last syslet atom then this will
notify any waiter in sigwait().

or if you want to select(), just do it in the fds that you are
interested in, and the write that the syslet does triggers select()
completion.

but the fastest one will be by using syslets: to just check the
notification ring pointer in user-space, and then call into
sys_async_wait() if the ring is empty.

I just noticed a small bug here: sys_async_wait() should also take the
ring index userspace checked as a second parameter, and fix up the
number of events it waits for with the delta between the ring index the
kernel maintains and the ring index user-space has. The patch below
fixes this bug.

> > Open issues:
>
> Let me add some more
>
> sys_setuid/gid/etc need to be synchronous only and not occur
> while other async syscalls are running in parallel to meet current
> kernel assumptions.

these should probably be taken out of the 'async syscall table', along
with fork and the async syscalls themselves.

> sys_exec and other security boundaries must be synchronous
> only and not allow async "spill over" (consider setuid async binary
> patching)

i've tested sys_exec() and it seems to work, but i might have missed
some corner-cases. (And what you raise is not academic, it might even
make sense to do it, in the vfork() way.)

> > - sys_fork() and sys_async_exec() should be filtered out from the
> > syscalls that are allowed - first one only makes sense with ptregs,
>
> clone and vfork. async_vfork is a real mindbender actually.

yeah. Also, create_module() perhaps. I'm starting to lean towards an
async_syscall_table[]. At which point we could reduce the max syslet
parameter count to 4, and do those few 5 and 6 parameter syscalls (of
which only splice() and futex() truly matter i suspect) via wrappers.
This would fit a syslet atom into 32 bytes on x86. Hm?

> > second one is a nice kernel recursion thing :) I didnt want to
> > duplicate the sys_call_table though - maybe others have a better
> > idea.
>
> What are the semantics of async sys_async_wait and async sys_async ?

agreed, that should be forbidden too.

Ingo

---------------------->
---
kernel/async.c | 12 +++++++++---
kernel/async.h | 2 +-
2 files changed, 10 insertions(+), 4 deletions(-)

Index: linux/kernel/async.c
===================================================================
--- linux.orig/kernel/async.c
+++ linux/kernel/async.c
@@ -721,7 +721,8 @@ static void refill_cachemiss_pool(struct
* to finish or for all async processing to finish (whichever
* comes first).
*/
-asmlinkage long sys_async_wait(unsigned long min_wait_events)
+asmlinkage long
+sys_async_wait(unsigned long min_wait_events, unsigned long user_curr_ring_idx)
{
struct async_head *ah = current->ah;

@@ -730,12 +731,17 @@ asmlinkage long sys_async_wait(unsigned

if (min_wait_events) {
spin_lock(&ah->lock);
- ah->events_left = min_wait_events;
+ /*
+ * Account any completions that happened since user-space
+ * checked the ring:
+ */
+ ah->events_left = min_wait_events -
+ (ah->curr_ring_idx - user_curr_ring_idx);
spin_unlock(&ah->lock);
}

return wait_event_interruptible(ah->wait,
- list_empty(&ah->busy_async_threads) || !ah->events_left);
+ list_empty(&ah->busy_async_threads) || ah->events_left > 0);
}

/**
Index: linux/kernel/async.h
===================================================================
--- linux.orig/kernel/async.h
+++ linux/kernel/async.h
@@ -26,7 +26,7 @@ struct async_head {
struct list_head ready_async_threads;
struct list_head busy_async_threads;

- unsigned long events_left;
+ long events_left;
wait_queue_head_t wait;

struct async_head_user __user *uah;

2007-02-13 16:53:10

by Ingo Molnar

[permalink] [raw]
Subject: Re: [patch 00/11] ANNOUNCE: "Syslets", generic asynchronous system call support


* Andi Kleen <[email protected]> wrote:

> > sys_exec and other security boundaries must be synchronous
> > only and not allow async "spill over" (consider setuid async binary
> > patching)
>
> He probably would need some generalization of Andrea's seccomp work.
> Perhaps using bitmaps? For paranoia I would suggest to white list, not
> black list calls.

what i've implemented in my tree is sys_async_call_table[] which is a
copy of sys_call_table[] with certain entries modified (by architecture
level code, not by kernel/async.c) to sys_ni_syscall(). It's up to the
architecture to decide which syscalls are allowed.

but i could use a bitmap too - whatever linear construct. [ I'm not sure
there's much connection to seccomp - seccomp uses a NULL terminated
whitelist - while syslets would use most of the entries (and would not
want to have the overhead of checking a blacklist). ]

Ingo

2007-02-13 17:00:06

by Ingo Molnar

[permalink] [raw]
Subject: Re: [patch 00/11] ANNOUNCE: "Syslets", generic asynchronous system call support


* Benjamin LaHaise <[email protected]> wrote:

> > > Open issues:
> >
> > Let me add some more
>
> Also: FPU state (especially important with the FPU and SSE memory copy
> variants), segment register bases on x86-64, interaction with
> set_fs()...

agreed - i'll fix this. But i can see no big conceptual issue here -
these resources are all attached to the user context, and that doesnt
change upon an 'async context-switch'. So it's "only" a matter of
properly separating the user execution context from the kernel execution
context. The hardest bit was getting the ptregs details right - the
FPU/SSE state is pretty much async already (in the hardware too) and
isnt even touched by any of these codepaths.

Ingo

2007-02-13 17:06:43

by Ingo Molnar

[permalink] [raw]
Subject: Re: [patch 00/11] ANNOUNCE: "Syslets", generic asynchronous system call support


* Linus Torvalds <[email protected]> wrote:

> Ie, we could just add to "do_fork()" (which is where all of the
> vfork/clone/fork cases end up) a simple case like
>
> err = wait_async_context();
> if (err)
> return err;
>
> or
>
> if (in_async_context())
> return -EINVAL;

ok, this is a much nicer solution. I've scrapped the
sys_async_sys_call_table[] thing.

Ingo

2007-02-13 19:08:27

by Evgeniy Polyakov

[permalink] [raw]
Subject: Re: [patch 00/11] ANNOUNCE: "Syslets", generic asynchronous system call support

On Tue, Feb 13, 2007 at 05:56:42PM +0100, Ingo Molnar ([email protected]) wrote:
>
> * Benjamin LaHaise <[email protected]> wrote:
>
> > > > Open issues:
> > >
> > > Let me add some more
> >
> > Also: FPU state (especially important with the FPU and SSE memory copy
> > variants), segment register bases on x86-64, interaction with
> > set_fs()...
>
> agreed - i'll fix this. But i can see no big conceptual issue here -
> these resources are all attached to the user context, and that doesnt
> change upon an 'async context-switch'. So it's "only" a matter of
> properly separating the user execution context from the kernel execution
> context. The hardest bit was getting the ptregs details right - the
> FPU/SSE state is pretty much async already (in the hardware too) and
> isnt even touched by any of these codepaths.

Good work, Ingo.

I have not received first mail with announcement yet, so I will place
my thoughts here if you do not mind.

First one is per-thread data like TID. What about TLS related kernel
data (is non-exec stack property stored in TLS block or in kernel)?
Should it be copied with regs too (or better introduce new clone flag,
which would force that info copy)?

Btw, does SSE?/MMX?/call-it-yourself really saved on context switch?
As far as I can see no syscalls (and kernel at all) use that registers.

Another one is more global AIO question - while this approach IMHO
outperforms micro-thread design (Zach and Linus created really good
starting points, but they too have fundamental limiting factor), it
still has a problem - syscall blocks and the same thread thus is not
allowed to continue execution and fill the pipe - so what if system
issues thousands of requests and there are only tens of working thread
at most. What Tux did, as far as I recall, (and some other similar
state machines do :) was to break blocking syscall issues and return
to the next execution entity (next syslet or atom). Is it possible to
extend exactly this state machine and interface to allow that (so that
some other state machine implementations would not continue its life :)?

> Ingo

--
Evgeniy Polyakov

2007-02-13 19:15:08

by Evgeniy Polyakov

[permalink] [raw]
Subject: Re: [patch 00/11] ANNOUNCE: "Syslets", generic asynchronous system call support

> I have not received first mail with announcement yet, so I will place
> my thoughts here if you do not mind.

An issue with sys_async_wait():
is is possible that events_left will be setup too late so that all
events are already ready and thus sys_async_wait() can wait forever
(or until next $sys_async_wait are ready)?

--
Evgeniy Polyakov

2007-02-13 20:18:25

by Davide Libenzi

[permalink] [raw]
Subject: Re: [patch 06/11] syslets: core, documentation


Wow! You really helped Zach out ;)



On Tue, 13 Feb 2007, Ingo Molnar wrote:

> +The Syslet Atom:
> +----------------
> +
> +The syslet atom is a small, fixed-size (44 bytes on 32-bit) piece of
> +user-space memory, which is the basic unit of execution within the syslet
> +framework. A syslet represents a single system-call and its arguments.
> +In addition it also has condition flags attached to it that allows the
> +construction of larger programs (syslets) from these atoms.
> +
> +Arguments to the system call are implemented via pointers to arguments.
> +This not only increases the flexibility of syslet atoms (multiple syslets
> +can share the same variable for example), but is also an optimization:
> +copy_uatom() will only fetch syscall parameters up until the point it
> +meets the first NULL pointer. 50% of all syscalls have 2 or less
> +parameters (and 90% of all syscalls have 4 or less parameters).

Why do you need to have an extra memory indirection per parameter in
copy_uatom()? It also forces you to have parameters pointed-to, to be
"long" (or pointers), instead of their natural POSIX type (like fd being
"int" for example). Also, you need to have array pointers (think about a
"char buf[];" passed to an async read(2)) to be saved into a pointer
variable, and pass the pointer of the latter to the async system. Same for
all structures (ie. stat(2) "struct stat"). Let them be real argouments
and add a nparams argoument to the structure:

struct syslet_atom {
unsigned long flags;
unsigned int nr;
unsigned int nparams;
long __user *ret_ptr;
struct syslet_uatom __user *next;
unsigned long args[6];
};

I can understand that chaining syscalls requires variable sharing, but the
majority of the parameters passed to syscalls are just direct ones.
Maybe a smart method that allows you to know if a parameter is a direct
one or a pointer to one? An "unsigned int pmap" where bit N is 1 if param
N is an indirection? Hmm?





> +Running Syslets:
> +----------------
> +
> +Syslets can be run via the sys_async_exec() system call, which takes
> +the first atom of the syslet as an argument. The kernel does not need
> +to be told about the other atoms - it will fetch them on the fly as
> +execution goes forward.
> +
> +A syslet might either be executed 'cached', or it might generate a
> +'cachemiss'.
> +
> +'Cached' syslet execution means that the whole syslet was executed
> +without blocking. The system-call returns the submitted atom's address
> +in this case.
> +
> +If a syslet blocks while the kernel executes a system-call embedded in
> +one of its atoms, the kernel will keep working on that syscall in
> +parallel, but it immediately returns to user-space with a NULL pointer,
> +so the submitting task can submit other syslets.
> +
> +Completion of asynchronous syslets:
> +-----------------------------------
> +
> +Completion of asynchronous syslets is done via the 'completion ring',
> +which is a ringbuffer of syslet atom pointers user user-space memory,
> +provided by user-space in the sys_async_register() syscall. The
> +kernel fills in the ringbuffer starting at index 0, and user-space
> +must clear out these pointers. Once the kernel reaches the end of
> +the ring it wraps back to index 0. The kernel will not overwrite
> +non-NULL pointers (but will return an error), user-space has to
> +make sure it completes all events it asked for.

Sigh, I really dislike shared userspace/kernel stuff, when we're
transfering pointers to userspace. Did you actually bench it against a:

int async_wait(struct syslet_uatom **r, int n);

I can fully understand sharing userspace buffers with the kernel, if we're
talking about KB transferd during a block or net I/O DMA operation, but
for transfering a pointer? Behind each pointer transfer(4/8 bytes) there
is a whole syscall execution, that makes the 4/8 bytes tranfers have a
relative cost of 0.01% *maybe*. Different case is a O_DIRECT read of 16KB
of data, where in that case the memory transfer has a relative cost
compared to the syscall, that can be pretty high. The syscall saving
argument is moot too, because syscall are cheap, and if there's a lot of
async traffic, you'll be fetching lots of completions to keep you dispatch
loop pretty busy for a while.
And the API is *certainly* cleaner.



- Davide


2007-02-13 20:22:10

by Davide Libenzi

[permalink] [raw]
Subject: Re: [patch 00/11] ANNOUNCE: "Syslets", generic asynchronous system call support

On Tue, 13 Feb 2007, Ingo Molnar wrote:

> As it might be obvious to some of you, the syslet subsystem takes many
> ideas and experience from my Tux in-kernel webserver :) The syslet code
> originates from a heavy rewrite of the Tux-atom and the Tux-cachemiss
> infrastructure.
>
> Open issues:
>
> - the 'TID' of the 'head' thread currently varies depending on which
> thread is running the user-space context.
>
> - signal support is not fully thought through - probably the head
> should be getting all of them - the cachemiss threads are not really
> interested in executing signal handlers.
>
> - sys_fork() and sys_async_exec() should be filtered out from the
> syscalls that are allowed - first one only makes sense with ptregs,
> second one is a nice kernel recursion thing :) I didnt want to
> duplicate the sys_call_table though - maybe others have a better
> idea.

If this is going to be a generic AIO subsystem:

- Cancellation of peding request



- Davide


2007-02-13 20:26:58

by Davide Libenzi

[permalink] [raw]
Subject: Re: [patch 00/11] ANNOUNCE: "Syslets", generic asynchronous system call support

On Tue, 13 Feb 2007, Linus Torvalds wrote:

> if (in_async_context())
> return -EINVAL;
>
> or similar. We need that "async_context()" function anyway for the other
> cases where we can't do other things concurrently, like changing the UID.

Yes, that's definitely better. Let's have the policy about weather a
syscall is or is not async-enabled, inside the syscall itself. Simplify
things a lot.



- Davide


2007-02-13 20:37:19

by Ingo Molnar

[permalink] [raw]
Subject: Re: [patch 00/11] ANNOUNCE: "Syslets", generic asynchronous system call support


* Benjamin LaHaise <[email protected]> wrote:

> [...] interaction with set_fs()...

hm, this one should already work in the current version, because
addr_limit is in thread_info and hence stays with the async context. Or
can you see any hole in it?

Ingo

2007-02-13 20:42:42

by Ingo Molnar

[permalink] [raw]
Subject: Re: [patch 00/11] ANNOUNCE: "Syslets", generic asynchronous system call support


* Dmitry Torokhov <[email protected]> wrote:

> > What are the semantics of async sys_async_wait and async sys_async ?
>
> Ooooohh. OpenVMS lives forever ;) Me likeee ;)

hm, i dont know OpenVMS - but googled around a bit for 'VMS
asynchronous' and it gave me this:

http://en.wikipedia.org/wiki/Asynchronous_system_trap

is AST what you mean? From a quick read AST seems to be a signal
mechanism a bit like Unix signals, extended to kernel-space as well -
while syslets are a different 'safe execution engine' kind of thing
centered around the execution of system calls.

Ingo

2007-02-13 21:00:57

by Indan Zupancic

[permalink] [raw]
Subject: Re: [patch 02/11] syslets: add syslet.h include file, user API/ABI definitions

On Tue, February 13, 2007 15:20, Ingo Molnar wrote:
> +/*
> + * Execution control: conditions upon the return code
> + * of the previous syslet atom. 'Stop' means syslet
> + * execution is stopped and the atom is put into the
> + * completion ring:
> + */
> +#define SYSLET_STOP_ON_NONZERO 0x00000008
> +#define SYSLET_STOP_ON_ZERO 0x00000010
> +#define SYSLET_STOP_ON_NEGATIVE 0x00000020
> +#define SYSLET_STOP_ON_NON_POSITIVE 0x00000040

This is confusing. Why the return code of the previous syslet atom?
Wouldn't it be more clear if the flag was for the current tasklet?
Worse, what is the previous atom? Imagine some case with a loop:

A
|
B<--.
| |
C---'

What will be the previous atom of B here? It can be either A or C,
but their return values can be different and incompatible, so what
flag should B set?

> +/*
> + * Special modifier to 'stop' handling: instead of stopping the
> + * execution of the syslet, the linearly next syslet is executed.
> + * (Normal execution flows along atom->next, and execution stops
> + * if atom->next is NULL or a stop condition becomes true.)
> + *
> + * This is what allows true branches of execution within syslets.
> + */
> +#define SYSLET_SKIP_TO_NEXT_ON_STOP 0x00000080
> +

Might rename this to SYSLET_SKIP_NEXT_ON_STOP too then.

Greetings,

Indan



2007-02-13 21:24:55

by Davide Libenzi

[permalink] [raw]
Subject: Re: [patch 00/11] ANNOUNCE: "Syslets", generic asynchronous system call support

On Tue, 13 Feb 2007, Davide Libenzi wrote:

> If this is going to be a generic AIO subsystem:
>
> - Cancellation of peding request

What about the busy_async_threads list becoming a hash/rb_tree indexed by
syslet_atom ptr. A cancel would lookup the thread and send a signal (of
course, signal handling of the async threads should be set properly)?



- Davide


2007-02-13 21:36:45

by Ingo Molnar

[permalink] [raw]
Subject: Re: [patch 06/11] syslets: core, documentation


* Davide Libenzi <[email protected]> wrote:

> > +The Syslet Atom:
> > +----------------
> > +
> > +The syslet atom is a small, fixed-size (44 bytes on 32-bit) piece of
> > +user-space memory, which is the basic unit of execution within the syslet
> > +framework. A syslet represents a single system-call and its arguments.
> > +In addition it also has condition flags attached to it that allows the
> > +construction of larger programs (syslets) from these atoms.
> > +
> > +Arguments to the system call are implemented via pointers to arguments.
> > +This not only increases the flexibility of syslet atoms (multiple syslets
> > +can share the same variable for example), but is also an optimization:
> > +copy_uatom() will only fetch syscall parameters up until the point it
> > +meets the first NULL pointer. 50% of all syscalls have 2 or less
> > +parameters (and 90% of all syscalls have 4 or less parameters).
>
> Why do you need to have an extra memory indirection per parameter in
> copy_uatom()? [...]

yes. Try to use them in real programs, and you'll see that most of the
time the variable an atom wants to access should also be accessed by
other atoms. For example a socket file descriptor - one atom opens it,
another one reads from it, a third one closes it. By having the
parameters in the atoms we'd have to copy the fd to two other places.

but i see your point: i actually had it like that in my earlier
versions, only changed it to an indirect method later on, when writing
more complex syslets. And, surprisingly, performance of atom handling
/improved/ on both Intel and AMD CPUs when i added indirection, because
the indirection enables the 'tail NULL' optimization. (which wasnt the
goal of indirection, it was just a side-effect)

> [...] It also forces you to have parameters pointed-to, to be "long"
> (or pointers), instead of their natural POSIX type (like fd being
> "int" for example). [...]

this wasnt a big problem while coding syslets. I'd also not expect
application writers having to do these things on the syscall level -
this is a system interface after all. But you do have a point.

> I can understand that chaining syscalls requires variable sharing, but
> the majority of the parameters passed to syscalls are just direct
> ones. Maybe a smart method that allows you to know if a parameter is a
> direct one or a pointer to one? An "unsigned int pmap" where bit N is
> 1 if param N is an indirection? Hmm?

adding such things tends to slow down atom parsing.

there's another reason as well: i wanted syslets to be like
'instructions' - i.e. not self-modifying. If the fd parameter is
embedded in the syslet then every syslet has to be replicated

note that chaining does not necessarily require variable sharing: a
sys_umem_add() atom could be used to modify the next syslet's ->fd
parameter. So for example

sys_open() -> returns 'fd'
sys_umem_add(&atom1->fd) <= atom1->fd is 0 initially
sys_umem_add(&atom2->fd) <= the first umem_add returns the value
atom1 [uses fd]
atom2 [uses fd]

but i didnt like this approach: this means 1 more atom per indirect
parameter, and quite some trickery to put the right information into the
right place. Furthermore, this makes syslets very much tied to the
'register contents' - instead of them being 'pure instructions/code'.

> > +Completion of asynchronous syslets:
> > +-----------------------------------
> > +
> > +Completion of asynchronous syslets is done via the 'completion ring',
> > +which is a ringbuffer of syslet atom pointers user user-space memory,
> > +provided by user-space in the sys_async_register() syscall. The
> > +kernel fills in the ringbuffer starting at index 0, and user-space
> > +must clear out these pointers. Once the kernel reaches the end of
> > +the ring it wraps back to index 0. The kernel will not overwrite
> > +non-NULL pointers (but will return an error), user-space has to
> > +make sure it completes all events it asked for.
>
> Sigh, I really dislike shared userspace/kernel stuff, when we're
> transfering pointers to userspace. Did you actually bench it against
> a:
>
> int async_wait(struct syslet_uatom **r, int n);
>
> I can fully understand sharing userspace buffers with the kernel, if
> we're talking about KB transferd during a block or net I/O DMA
> operation, but for transfering a pointer? Behind each pointer
> transfer(4/8 bytes) there is a whole syscall execution, [...]

there are three main reasons for this choice:

- firstly, by putting completion events into the user-space ringbuffer
the asynchronous contexts are not held up at all, and the threads are
available for further syslet use.

- secondly, it was the most obvious and simplest solution to me - it
just fits well into the syslet model - which is an execution concept
centered around pure user-space memory and system calls, not some
kernel resource. Kernel fills in the ringbuffer, user-space clears it.
If we had to worry about a handshake between user-space and
kernel-space for the completion information to be passed along, that
would either mean extra buffering or extra overhead. Extra buffering
(in the kernel) would be for no good reason: why not buffer it in the
place where the information is destined for in the first place. The
ringbuffer of /pointers/ is what makes this really powerful. I never
really liked the AIO/etc. method /event buffer/ rings. With syslets
the 'cookie' is the pointer to the syslet atom itself. It doesnt get
any more straightforward than that i believe.

- making 'is there more stuff for me to work on' a simple instruction in
user-space makes it a no-brainer for user-space to promptly and
without thinking complete events. It's also the right thing to do on
SMP: if one core is solely dedicated to the asynchronous workload,
only running on kernel mode, and the other code is only running
user-space, why ever switch between protection domains? [except if any
of them is idle] The fastest completion signalling method is the
/memory bus/, not an interrupt. User-space could in theory even use
MWAIT (in user-space!) to wait for the other core to complete stuff.
That makes for a hell of a fast wakeup.

Ingo

2007-02-13 21:45:34

by Ingo Molnar

[permalink] [raw]
Subject: Re: [patch 02/11] syslets: add syslet.h include file, user API/ABI definitions


* Indan Zupancic <[email protected]> wrote:

> > + * Execution control: conditions upon the return code
> > + * of the previous syslet atom. 'Stop' means syslet
> > + * execution is stopped and the atom is put into the
> > + * completion ring:
> > + */
> > +#define SYSLET_STOP_ON_NONZERO 0x00000008
> > +#define SYSLET_STOP_ON_ZERO 0x00000010
> > +#define SYSLET_STOP_ON_NEGATIVE 0x00000020
> > +#define SYSLET_STOP_ON_NON_POSITIVE 0x00000040
>
> This is confusing. Why the return code of the previous syslet atom?
> Wouldn't it be more clear if the flag was for the current tasklet?
> Worse, what is the previous atom? [...]

the previously executed atom. (I have fixed up the comment in my tree to
say that.)

> [...] Imagine some case with a loop:
>
> A
> |
> B<--.
> | |
> C---'
>
> What will be the previous atom of B here? It can be either A or C, but
> their return values can be different and incompatible, so what flag
> should B set?

previous here is the previously executed atom, which is always a
specific atom. Think of atoms as 'instructions', and these condition
flags as the 'CPU flags' like 'zero' 'carry' 'sign', etc. Syslets can be
thought of as streams of simplified instructions.

> > +/*
> > + * Special modifier to 'stop' handling: instead of stopping the
> > + * execution of the syslet, the linearly next syslet is executed.
> > + * (Normal execution flows along atom->next, and execution stops
> > + * if atom->next is NULL or a stop condition becomes true.)
> > + *
> > + * This is what allows true branches of execution within syslets.
> > + */
> > +#define SYSLET_SKIP_TO_NEXT_ON_STOP 0x00000080
> > +
>
> Might rename this to SYSLET_SKIP_NEXT_ON_STOP too then.

but that's not what it does. It really 'skips to the next one on a stop
event'. I.e. if you have three consecutive atoms (consecutive in linear
memory):

atom1 returns 0
atom2 has SYSLET_STOP_ON_ZERO|SYSLET_SKIP_NEXT_ON_STOP set
atom3

then after atom1 returns 0, the SYSLET_STOP_ON_ZERO condition is
recognized as a 'stop' event - but due to the SYSLET_SKIP_NEXT_ON_STOP
flag execution does not stop (i.e. we do not return to user-space or
complete the syslet), but we continue execution at atom3.

this flag basically avoids having to add an atom->else pointer and keeps
the data structure more compressed. Two-way branches are sufficiently
rare, so i wanted to avoid the atom->else pointer.

Ingo

2007-02-13 21:59:58

by Ingo Molnar

[permalink] [raw]
Subject: Re: [patch 00/11] ANNOUNCE: "Syslets", generic asynchronous system call support


* Davide Libenzi <[email protected]> wrote:

> > Open issues:

> If this is going to be a generic AIO subsystem:
>
> - Cancellation of pending request

How about implementing aio_cancel() as a NOP. Can anyone prove that the
kernel didnt actually attempt to cancel that IO? [but unfortunately
failed at doing so, because the platters were being written already.]

really, what's the point behind aio_cancel()?

Ingo

2007-02-13 22:13:29

by Ingo Molnar

[permalink] [raw]
Subject: Re: [patch 00/11] ANNOUNCE: "Syslets", generic asynchronous system call support


* Davide Libenzi <[email protected]> wrote:

> > If this is going to be a generic AIO subsystem:
> >
> > - Cancellation of peding request
>
> What about the busy_async_threads list becoming a hash/rb_tree indexed
> by syslet_atom ptr. A cancel would lookup the thread and send a signal
> (of course, signal handling of the async threads should be set
> properly)?

well, each async syslet has a separate TID at the moment, so if we want
a submitted syslet to be cancellable then we could return the TID of the
syslet handler (instead of the NULL) in sys_async_exec(). Then
user-space could send a signal the old-fashioned way, via sys_tkill(),
if it so wishes.

the TID could also be used in a sys_async_wait_on() API. I.e. it would
be a natural, readily accessible 'cookie' for the pending work. TIDs can
be looked up lockless via RCU, so it's reasonably fast as well.

( Note that there's already a way to 'signal' pending syslets: do_exit()
in the user context will signal all async contexts (which results in
-EINTR of currently executing syscalls, wherever possible) and will
tear them down. But that's too crude for aio_cancel() i guess. )

Ingo

2007-02-13 22:16:34

by Andi Kleen

[permalink] [raw]
Subject: Re: [patch 05/11] syslets: core code

Ingo Molnar <[email protected]> writes:

> +
> +static struct async_thread *
> +pick_ready_cachemiss_thread(struct async_head *ah)

The cachemiss names are confusing. I assume that's just a left over
from Tux?
> +
> + memset(atom->args, 0, sizeof(atom->args));
> +
> + ret |= __get_user(arg_ptr, &uatom->arg_ptr[0]);
> + if (!arg_ptr)
> + return ret;
> + if (!access_ok(VERIFY_WRITE, arg_ptr, sizeof(*arg_ptr)))
> + return -EFAULT;

It's a little unclear why you do that many individual access_ok()s.
And why is the target constant sized anyways?


+ /*
+ * Lock down the ring. Note: user-space should not munlock() this,
+ * because if the ring pages get swapped out then the async
+ * completion code might return a -EFAULT instead of the expected
+ * completion. (the kernel safely handles that case too, so this
+ * isnt a security problem.)
+ *
+ * mlock() is better here because it gets resource-accounted
+ * properly, and even unprivileged userspace has a few pages
+ * of mlock-able memory available. (which is more than enough
+ * for the completion-pointers ringbuffer)
+ */

If it's only a few pages you don't need any resource accounting.
If it's more then it's nasty to steal the users quota.
I think plain gup() would be better.


-Andi

2007-02-13 22:20:18

by Ingo Molnar

[permalink] [raw]
Subject: Re: [patch 00/11] ANNOUNCE: "Syslets", generic asynchronous system call support


* Evgeniy Polyakov <[email protected]> wrote:

> [...] it still has a problem - syscall blocks and the same thread thus
> is not allowed to continue execution and fill the pipe - so what if
> system issues thousands of requests and there are only tens of working
> thread at most. [...]

the same thread is allowed to continue execution even if the system call
blocks: take a look at async_schedule(). The blocked system-call is 'put
aside' (in a sleeping thread), the kernel switches the user-space
context (registers) to a free kernel thread and switches to it - and
returns to user-space as if nothing happened - allowing the user-space
context to 'fill the pipe' as much as it can. Or did i misunderstand
your point?

basically there's SYSLET_ASYNC for 'always async' and SYSLET_SYNC for
'always sync' - but the default syslet behavior is: 'try sync and switch
transparently to async on demand'. The testcode i sent very much uses
this. (and this mechanism is in essence Zach's fibril-switching thing,
but done via kernel threads.)

Ingo

2007-02-13 22:22:50

by Ingo Molnar

[permalink] [raw]
Subject: Re: [patch 00/11] ANNOUNCE: "Syslets", generic asynchronous system call support


* Evgeniy Polyakov <[email protected]> wrote:

> > I have not received first mail with announcement yet, so I will place
> > my thoughts here if you do not mind.
>
> An issue with sys_async_wait(): is is possible that events_left will
> be setup too late so that all events are already ready and thus
> sys_async_wait() can wait forever (or until next $sys_async_wait are
> ready)?

yeah. I have fixed this up and have uploaded a newer queue to:

http://redhat.com/~mingo/syslet-patches/

Ingo

2007-02-13 22:25:09

by Andi Kleen

[permalink] [raw]
Subject: Re: [patch 00/11] ANNOUNCE: "Syslets", generic asynchronous system call support

Ingo Molnar <[email protected]> writes:
>
> really, what's the point behind aio_cancel()?

The main use case is when you open a file requester on a network
file system where the server is down and you get tired of waiting
and press "Cancel" it should abort the hanging IO immediately.

At least I would appreciate such a feature sometimes.

e.g. the readdir loop could be a syslet (are they powerful
enough to allocate memory for a arbitary sized directory? Probably not)
and then the cancel button could async_cancel() it.

-Andi

2007-02-13 22:25:08

by Indan Zupancic

[permalink] [raw]
Subject: Re: [patch 02/11] syslets: add syslet.h include file, user API/ABI definitions

On Tue, February 13, 2007 22:43, Ingo Molnar wrote:
> * Indan Zupancic <[email protected]> wrote:
>> A
>> |
>> B<--.
>> | |
>> C---'
>>
>> What will be the previous atom of B here? It can be either A or C, but
>> their return values can be different and incompatible, so what flag
>> should B set?
>
> previous here is the previously executed atom, which is always a
> specific atom. Think of atoms as 'instructions', and these condition
> flags as the 'CPU flags' like 'zero' 'carry' 'sign', etc. Syslets can be
> thought of as streams of simplified instructions.

In the diagram above the previously executed atom, when handling atom B,
can be either atom A or atom C. So B doesn't know what kind of return value
to expect, because it depends on the previous atom's kind of syscall, and
not on B's return type. So I think you would want to move those return value
flags one atom earlier, in this case to A and C. So each atom will have a
flag telling what to to depending on its own return value.

>> > +/*
>> > + * Special modifier to 'stop' handling: instead of stopping the
>> > + * execution of the syslet, the linearly next syslet is executed.
>> > + * (Normal execution flows along atom->next, and execution stops
>> > + * if atom->next is NULL or a stop condition becomes true.)
>> > + *
>> > + * This is what allows true branches of execution within syslets.
>> > + */
>> > +#define SYSLET_SKIP_TO_NEXT_ON_STOP 0x00000080
>> > +
>>
>> Might rename this to SYSLET_SKIP_NEXT_ON_STOP too then.
>
> but that's not what it does. It really 'skips to the next one on a stop
> event'. I.e. if you have three consecutive atoms (consecutive in linear
> memory):
>
> atom1 returns 0
> atom2 has SYSLET_STOP_ON_ZERO|SYSLET_SKIP_NEXT_ON_STOP set
> atom3
>
> then after atom1 returns 0, the SYSLET_STOP_ON_ZERO condition is
> recognized as a 'stop' event - but due to the SYSLET_SKIP_NEXT_ON_STOP
> flag execution does not stop (i.e. we do not return to user-space or
> complete the syslet), but we continue execution at atom3.
>
> this flag basically avoids having to add an atom->else pointer and keeps
> the data structure more compressed. Two-way branches are sufficiently
> rare, so i wanted to avoid the atom->else pointer.

The flags are smart, they're just at the wrong place I think.

In your example, if atom3 has a 'next' pointing to atom2, atom2 wouldn't
know which return value it's checking: The one of atom1, or the one of
atom3? You're spreading syscall specific knowledge over multiple atoms
while that isn't necessary.

What I propose:

atom1 returns 0, has SYSLET_STOP_ON_ZERO|SYSLET_SKIP_NEXT_ON_STOP set
atom2
atom3

(You've already used my SYSLET_SKIP_NEXT_ON_STOP instead of
SYSLET_SKIP_TO_NEXT_ON_STOP. ;-)

Perhaps it's even more clear when splitting that SYSLET_STOP_* into a
SYSLET_STOP flag, and specific SYSLET_IF_* flags. Either that, or go
all the way and introduce seperate SYSLET_SKIP_NEXT_ON_*.

atom1 returns 0, has SYSLET_SKIP_NEXT|SYSLET_IF_ZERO set
atom2
atom3

Greetings,

Indan


2007-02-13 22:27:38

by Ingo Molnar

[permalink] [raw]
Subject: Re: [patch 05/11] syslets: core code


* Andi Kleen <[email protected]> wrote:

> Ingo Molnar <[email protected]> writes:
>
> > +
> > +static struct async_thread *
> > +pick_ready_cachemiss_thread(struct async_head *ah)
>
> The cachemiss names are confusing. I assume that's just a left over
> from Tux?

yeah. Although 'stuff goes async' is quite similar to a cachemiss. We
didnt have some resource available right now so the syscall has to block
== i.e. some cache was not available.

> > +
> > + memset(atom->args, 0, sizeof(atom->args));
> > +
> > + ret |= __get_user(arg_ptr, &uatom->arg_ptr[0]);
> > + if (!arg_ptr)
> > + return ret;
> > + if (!access_ok(VERIFY_WRITE, arg_ptr, sizeof(*arg_ptr)))
> > + return -EFAULT;
>
> It's a little unclear why you do that many individual access_ok()s.
> And why is the target constant sized anyways?

each indirect pointer has to be checked separately, before dereferencing
it. (Andrew pointed out that they should be VERIFY_READ, i fixed that in
my tree)

it looks a bit scary in C but the assembly code is very fast and quite
straightforward.

> + /*
> + * Lock down the ring. Note: user-space should not munlock() this,
> + * because if the ring pages get swapped out then the async
> + * completion code might return a -EFAULT instead of the expected
> + * completion. (the kernel safely handles that case too, so this
> + * isnt a security problem.)
> + *
> + * mlock() is better here because it gets resource-accounted
> + * properly, and even unprivileged userspace has a few pages
> + * of mlock-able memory available. (which is more than enough
> + * for the completion-pointers ringbuffer)
> + */
>
> If it's only a few pages you don't need any resource accounting. If
> it's more then it's nasty to steal the users quota. I think plain
> gup() would be better.

get_user_pages() would have to be limited in some way - and i didnt want
to add yet another wacky limit thing - so i just used the already
existing mlock() infrastructure for this. If Oracle wants to set up a 10
MB ringbuffer, they can set the PAM resource limits to 11 MB and still
have enough stuff left. And i dont really expect GPG to start using
syslets - just yet ;-)

a single page is enough for 1024 completion pointers - that's more than
enough for most purposes - and the default mlock limit is 40K.

Ingo

2007-02-13 22:28:33

by Ingo Molnar

[permalink] [raw]
Subject: Re: [patch 00/11] ANNOUNCE: "Syslets", generic asynchronous system call support


* Andi Kleen <[email protected]> wrote:

> > really, what's the point behind aio_cancel()?
>
> The main use case is when you open a file requester on a network file
> system where the server is down and you get tired of waiting and press
> "Cancel" it should abort the hanging IO immediately.

ok, that should work fine already - exit in the user context gets
propagated to all async syslet contexts immediately. So if the syscalls
that the syslet uses are reasonably interruptible, it will work out
fine.

Ingo

2007-02-13 22:30:20

by Andi Kleen

[permalink] [raw]
Subject: Re: [patch 05/11] syslets: core code

On Tue, Feb 13, 2007 at 11:24:43PM +0100, Ingo Molnar wrote:
> > > + memset(atom->args, 0, sizeof(atom->args));
> > > +
> > > + ret |= __get_user(arg_ptr, &uatom->arg_ptr[0]);
> > > + if (!arg_ptr)
> > > + return ret;
> > > + if (!access_ok(VERIFY_WRITE, arg_ptr, sizeof(*arg_ptr)))
> > > + return -EFAULT;
> >
> > It's a little unclear why you do that many individual access_ok()s.
> > And why is the target constant sized anyways?
>
> each indirect pointer has to be checked separately, before dereferencing
> it. (Andrew pointed out that they should be VERIFY_READ, i fixed that in
> my tree)

But why only constant sized? It could be a variable length object, couldn't it?

If it's an array it could be all checked together

(i must be missing something here)

> > If it's only a few pages you don't need any resource accounting. If
> > it's more then it's nasty to steal the users quota. I think plain
> > gup() would be better.
>
> get_user_pages() would have to be limited in some way - and i didnt want

If you only use it for a small ring buffer it is naturally limited.

Also beancounter will fix that eventually.

> a single page is enough for 1024 completion pointers - that's more than
> enough for most purposes - and the default mlock limit is 40K.

Then limit it to a single page and use gup

-Andi

2007-02-13 22:32:22

by Andi Kleen

[permalink] [raw]
Subject: Re: [patch 00/11] ANNOUNCE: "Syslets", generic asynchronous system call support

On Tue, Feb 13, 2007 at 11:26:26PM +0100, Ingo Molnar wrote:
>
> * Andi Kleen <[email protected]> wrote:
>
> > > really, what's the point behind aio_cancel()?
> >
> > The main use case is when you open a file requester on a network file
> > system where the server is down and you get tired of waiting and press
> > "Cancel" it should abort the hanging IO immediately.
>
> ok, that should work fine already - exit in the user context gets

That would be a little heavy handed. I wouldn't expect my GUI
program to quit itself on cancel. And requiring it to create a new
thread just to exit on cancel would be also nasty.

And of course you cannot interrupt blocked IOs this way right now
(currently it only works with signals in some cases on NFS)

-Andi

2007-02-13 22:35:16

by Ingo Molnar

[permalink] [raw]
Subject: Re: [patch 02/11] syslets: add syslet.h include file, user API/ABI definitions


* Indan Zupancic <[email protected]> wrote:

> What I propose:
>
> atom1 returns 0, has SYSLET_STOP_ON_ZERO|SYSLET_SKIP_NEXT_ON_STOP set
> atom2
> atom3
>
> (You've already used my SYSLET_SKIP_NEXT_ON_STOP instead of
> SYSLET_SKIP_TO_NEXT_ON_STOP. ;-)

doh. Yes. I noticed and implemented this yesterday and it's in the
submitted syslet code - but i guess i was too tired to remember my own
code - so i added the wrong comments :-/ If you look at the sample
user-space code:

init_atom(req, &req->open_file, __NR_sys_open,
&req->filename_p, &O_RDONLY_var, NULL, NULL, NULL, NULL,
&req->fd, SYSLET_STOP_ON_NEGATIVE, &req->read_file);

the 'STOP_ON_NEGATIVE' acts on that particular atom.

this indeed cleaned up things quite a bit and made the user-space syslet
code alot more straightforward. A return value can still be recovered
and examined (with a different condition and a different jump target)
arbitrary number of times via ret_ptr and via sys_umem_add().

Ingo

2007-02-13 22:36:39

by Dmitry Torokhov

[permalink] [raw]
Subject: Re: [patch 00/11] ANNOUNCE: "Syslets", generic asynchronous system call support

Hi Ingo,

On Tuesday 13 February 2007 15:39, Ingo Molnar wrote:
>
> * Dmitry Torokhov <[email protected]> wrote:
>
> > > What are the semantics of async sys_async_wait and async sys_async ?
> >
> > Ooooohh. OpenVMS lives forever ;) Me likeee ;)
>
> hm, i dont know OpenVMS - but googled around a bit for 'VMS
> asynchronous' and it gave me this:
>
> http://en.wikipedia.org/wiki/Asynchronous_system_trap
>
> is AST what you mean? From a quick read AST seems to be a signal
> mechanism a bit like Unix signals, extended to kernel-space as well -
> while syslets are a different 'safe execution engine' kind of thing
> centered around the execution of system calls.
>

That is only one of ways of notifying userspace of system call completion
on OpenVMS. Pretty much every syscall there exists in 2 flavors - async
and sync, for example $QIO and $QIOW or $ENQ/$ENQW (actually -W flavor
is async call + $SYNCH to wait for completion). Once system service call
is completed the OS would raise a so-called event flag and may also
deliver an AST to the process. Application may either wait for an
event flag/set of event flags (EFN) or rely on AST to get notification.

--
Dmitry

2007-02-13 22:44:13

by Ingo Molnar

[permalink] [raw]
Subject: Re: [patch 05/11] syslets: core code


* Andi Kleen <[email protected]> wrote:

> > > > + if (!access_ok(VERIFY_WRITE, arg_ptr, sizeof(*arg_ptr)))
> > > > + return -EFAULT;
> > >
> > > It's a little unclear why you do that many individual access_ok()s.
> > > And why is the target constant sized anyways?
> >
> > each indirect pointer has to be checked separately, before dereferencing
> > it. (Andrew pointed out that they should be VERIFY_READ, i fixed that in
> > my tree)
>
> But why only constant sized? It could be a variable length object,
> couldn't it?

i think what you might be missing is that it's only the 6 syscall
arguments that are fetched via indirect pointers - security checks are
then done by the system calls themselves. It's a bit awkward to think
about, but it is surprisingly clean in the assembly, and it simplified
syslet programming too.

> > get_user_pages() would have to be limited in some way - and i didnt
> > want
>
> If you only use it for a small ring buffer it is naturally limited.

yeah, but 'small' is a dangerous word when it comes to adding IO
interfaces ;-)

> > a single page is enough for 1024 completion pointers - that's more
> > than enough for most purposes - and the default mlock limit is 40K.
>
> Then limit it to a single page and use gup

1024 (512 on 64-bit) is alot but not ALOT. It is also certainly not
ALOOOOT :-) Really, people will want to have more than 512
disks/spindles in the same box. I have used such a beast myself. For Tux
workloads and benchmarks we had parallelism levels of millions of
pending requests (!) on a single system - networking, socket limits,
disk IO combined with thousands of clients do create such scenarios. I
really think that such 'pinned pages' are a pretty natural fit for
sys_mlock() and RLIMIT_MEMLOCK, and since the kernel side is careful to
use the _inatomic() uaccess methods, it's safe (and fast) as well.

Ingo

2007-02-13 22:45:36

by Ingo Molnar

[permalink] [raw]
Subject: Re: [patch 00/11] ANNOUNCE: "Syslets", generic asynchronous system call support


* Andi Kleen <[email protected]> wrote:

> > ok, that should work fine already - exit in the user context gets
>
> That would be a little heavy handed. I wouldn't expect my GUI program
> to quit itself on cancel. And requiring it to create a new thread just
> to exit on cancel would be also nasty.
>
> And of course you cannot interrupt blocked IOs this way right now
> (currently it only works with signals in some cases on NFS)

ok. The TID+signal approach i mentioned in the other reply should work.
If it's frequent enough we could make this an explicit
sys_async_cancel(TID) API.

Ingo

2007-02-13 22:47:39

by Andi Kleen

[permalink] [raw]
Subject: Re: [patch 00/11] ANNOUNCE: "Syslets", generic asynchronous system call support

> ok. The TID+signal approach i mentioned in the other reply should work.

Not sure if a signal is good for this. It might conflict with existing
strange historical semantics.

> If it's frequent enough we could make this an explicit
> sys_async_cancel(TID) API.

Ideally there should be a new function like signal_pending() that checks for
this. Then the network fs could check those in their blocking loops
and error out.

Then it would even work on non intr NFS mounts.

-Andi

2007-02-13 22:51:01

by Olivier Galibert

[permalink] [raw]
Subject: Re: [patch 00/11] ANNOUNCE: "Syslets", generic asynchronous system call support

On Tue, Feb 13, 2007 at 10:57:24PM +0100, Ingo Molnar wrote:
>
> * Davide Libenzi <[email protected]> wrote:
>
> > > Open issues:
>
> > If this is going to be a generic AIO subsystem:
> >
> > - Cancellation of pending request
>
> How about implementing aio_cancel() as a NOP. Can anyone prove that the
> kernel didnt actually attempt to cancel that IO? [but unfortunately
> failed at doing so, because the platters were being written already.]
>
> really, what's the point behind aio_cancel()?

Lemme give you a real-world scenario: Question Answering in a Dialog
System. Your locked-in-memory index ranks documents in a several
million files corpus depending of the chances they have to have what
you're looking for. You have a tenth of a second to read as many of
them as you can, and each seek is 5ms. So you aio-read them,
requesting them in order of ranking up to 200 or so, and see what you
have at the 0.1s deadline. If you're lucky, a combination of cache
(especially if you stat() the whole dir tree on a regular basis to
keep the metadata fresh in cache) and of good io reorganisation by the
scheduler will allow you to get a good number of them and do the
information extraction, scoring and clustering of answers, which is
pure CPU at that point. You *have* to cancel the remaining i/o
because you do not want the disk saturated when the next request
comes, especially if it's 10ms later because the dialog manager found
out it needed a complementary request.

Incidentally, that's something I'm currently implementing for work,
making these aio discussions more interesting that usual :-)

OG.

2007-02-13 22:58:22

by Andrew Morton

[permalink] [raw]
Subject: Re: [patch 05/11] syslets: core code

> On Tue, 13 Feb 2007 23:24:43 +0100 Ingo Molnar <[email protected]> wrote:
> > If it's only a few pages you don't need any resource accounting. If
> > it's more then it's nasty to steal the users quota. I think plain
> > gup() would be better.
>
> get_user_pages() would have to be limited in some way - and i didnt want
> to add yet another wacky limit thing - so i just used the already
> existing mlock() infrastructure for this. If Oracle wants to set up a 10
> MB ringbuffer, they can set the PAM resource limits to 11 MB and still
> have enough stuff left. And i dont really expect GPG to start using
> syslets - just yet ;-)
>
> a single page is enough for 1024 completion pointers - that's more than
> enough for most purposes - and the default mlock limit is 40K.

So if I have an application which instantiates a single mlocked page
for this purpose, I can only run ten of them at once, and any other
mlock-using process which I'm using starts to mysteriously fail.

It seems like a problem to me..

2007-02-13 22:59:20

by Ulrich Drepper

[permalink] [raw]
Subject: Re: [patch 00/11] ANNOUNCE: "Syslets", generic asynchronous system call support

Ingo Molnar wrote:
> really, what's the point behind aio_cancel()?

- sequence

aio_write()
aio_cancel()
aio_write()

with both writes going to the same place must be predictably

- think beyond files. Writes to sockets, ttys, they can block and
cancel must abort them. Even for files the same applies in some
situations, e.g., for network filesystems.

--
➧ Ulrich Drepper ➧ Red Hat, Inc. ➧ 444 Castro St ➧ Mountain View, CA ❖


Attachments:
signature.asc (251.00 B)
OpenPGP digital signature

2007-02-13 23:21:18

by Davide Libenzi

[permalink] [raw]
Subject: Re: [patch 06/11] syslets: core, documentation

On Tue, 13 Feb 2007, Ingo Molnar wrote:

>
> * Davide Libenzi <[email protected]> wrote:
>
> > > +The Syslet Atom:
> > > +----------------
> > > +
> > > +The syslet atom is a small, fixed-size (44 bytes on 32-bit) piece of
> > > +user-space memory, which is the basic unit of execution within the syslet
> > > +framework. A syslet represents a single system-call and its arguments.
> > > +In addition it also has condition flags attached to it that allows the
> > > +construction of larger programs (syslets) from these atoms.
> > > +
> > > +Arguments to the system call are implemented via pointers to arguments.
> > > +This not only increases the flexibility of syslet atoms (multiple syslets
> > > +can share the same variable for example), but is also an optimization:
> > > +copy_uatom() will only fetch syscall parameters up until the point it
> > > +meets the first NULL pointer. 50% of all syscalls have 2 or less
> > > +parameters (and 90% of all syscalls have 4 or less parameters).
> >
> > Why do you need to have an extra memory indirection per parameter in
> > copy_uatom()? [...]
>
> yes. Try to use them in real programs, and you'll see that most of the
> time the variable an atom wants to access should also be accessed by
> other atoms. For example a socket file descriptor - one atom opens it,
> another one reads from it, a third one closes it. By having the
> parameters in the atoms we'd have to copy the fd to two other places.

Yes, of course we have to support the indirection, otherwise chaining
won't work. But ...



> > I can understand that chaining syscalls requires variable sharing, but
> > the majority of the parameters passed to syscalls are just direct
> > ones. Maybe a smart method that allows you to know if a parameter is a
> > direct one or a pointer to one? An "unsigned int pmap" where bit N is
> > 1 if param N is an indirection? Hmm?
>
> adding such things tends to slow down atom parsing.

I really think it simplifies it. You simply *copy* the parameter (I'd say
that 70+% of cases falls inside here), and if the current "pmap" bit is
set, then you do all the indirection copy-from-userspace stuff.
It also simplify userspace a lot, since you can now pass arrays and
structure pointers directly, w/out saving them in a temporary variable.




> > Sigh, I really dislike shared userspace/kernel stuff, when we're
> > transfering pointers to userspace. Did you actually bench it against
> > a:
> >
> > int async_wait(struct syslet_uatom **r, int n);
> >
> > I can fully understand sharing userspace buffers with the kernel, if
> > we're talking about KB transferd during a block or net I/O DMA
> > operation, but for transfering a pointer? Behind each pointer
> > transfer(4/8 bytes) there is a whole syscall execution, [...]
>
> there are three main reasons for this choice:
>
> - firstly, by putting completion events into the user-space ringbuffer
> the asynchronous contexts are not held up at all, and the threads are
> available for further syslet use.
>
> - secondly, it was the most obvious and simplest solution to me - it
> just fits well into the syslet model - which is an execution concept
> centered around pure user-space memory and system calls, not some
> kernel resource. Kernel fills in the ringbuffer, user-space clears it.
> If we had to worry about a handshake between user-space and
> kernel-space for the completion information to be passed along, that
> would either mean extra buffering or extra overhead. Extra buffering
> (in the kernel) would be for no good reason: why not buffer it in the
> place where the information is destined for in the first place. The
> ringbuffer of /pointers/ is what makes this really powerful. I never
> really liked the AIO/etc. method /event buffer/ rings. With syslets
> the 'cookie' is the pointer to the syslet atom itself. It doesnt get
> any more straightforward than that i believe.
>
> - making 'is there more stuff for me to work on' a simple instruction in
> user-space makes it a no-brainer for user-space to promptly and
> without thinking complete events. It's also the right thing to do on
> SMP: if one core is solely dedicated to the asynchronous workload,
> only running on kernel mode, and the other code is only running
> user-space, why ever switch between protection domains? [except if any
> of them is idle] The fastest completion signalling method is the
> /memory bus/, not an interrupt. User-space could in theory even use
> MWAIT (in user-space!) to wait for the other core to complete stuff.
> That makes for a hell of a fast wakeup.

That makes also for a hell ugly retrieval API IMO ;)
If it'd be backed up but considerable performance gains, then it might be OK.
But I believe it won't be the case, and that leave us with an ugly API.
OTOH, if noone else object this, it means that I'm the only wierdo :) and
the API is just fine.




- Davide


2007-02-13 23:24:46

by Davide Libenzi

[permalink] [raw]
Subject: Re: [patch 00/11] ANNOUNCE: "Syslets", generic asynchronous system call support

On Tue, 13 Feb 2007, Ingo Molnar wrote:

>
> * Davide Libenzi <[email protected]> wrote:
>
> > > Open issues:
>
> > If this is going to be a generic AIO subsystem:
> >
> > - Cancellation of pending request
>
> How about implementing aio_cancel() as a NOP. Can anyone prove that the
> kernel didnt actually attempt to cancel that IO? [but unfortunately
> failed at doing so, because the platters were being written already.]
>
> really, what's the point behind aio_cancel()?

You need cancel. If you scheduled an async syscall, and the "session"
linked with that chain is going away, you better have that canceled before
cleaning up buffers to where the chain is going to read/write.
If you keep and hash or a tree indexed by atom-ptr, than become a matter
of a lookup and sending a signal.



- Davide


2007-02-13 23:28:07

by Davide Libenzi

[permalink] [raw]
Subject: Re: [patch 00/11] ANNOUNCE: "Syslets", generic asynchronous system call support

On Tue, 13 Feb 2007, Ingo Molnar wrote:

> * Davide Libenzi <[email protected]> wrote:
>
> > > If this is going to be a generic AIO subsystem:
> > >
> > > - Cancellation of peding request
> >
> > What about the busy_async_threads list becoming a hash/rb_tree indexed
> > by syslet_atom ptr. A cancel would lookup the thread and send a signal
> > (of course, signal handling of the async threads should be set
> > properly)?
>
> well, each async syslet has a separate TID at the moment, so if we want
> a submitted syslet to be cancellable then we could return the TID of the
> syslet handler (instead of the NULL) in sys_async_exec(). Then
> user-space could send a signal the old-fashioned way, via sys_tkill(),
> if it so wishes.

That works too. I was thinking about identifying syslets with the
userspace ptr, but the TID is fine too.



> the TID could also be used in a sys_async_wait_on() API. I.e. it would
> be a natural, readily accessible 'cookie' for the pending work. TIDs can
> be looked up lockless via RCU, so it's reasonably fast as well.
>
> ( Note that there's already a way to 'signal' pending syslets: do_exit()
> in the user context will signal all async contexts (which results in
> -EINTR of currently executing syscalls, wherever possible) and will
> tear them down. But that's too crude for aio_cancel() i guess. )

Yup.



- Davide


2007-02-14 00:18:33

by Davide Libenzi

[permalink] [raw]
Subject: Re: [patch 06/11] syslets: core, documentation

On Tue, 13 Feb 2007, Davide Libenzi wrote:

> > > I can understand that chaining syscalls requires variable sharing, but
> > > the majority of the parameters passed to syscalls are just direct
> > > ones. Maybe a smart method that allows you to know if a parameter is a
> > > direct one or a pointer to one? An "unsigned int pmap" where bit N is
> > > 1 if param N is an indirection? Hmm?
> >
> > adding such things tends to slow down atom parsing.
>
> I really think it simplifies it. You simply *copy* the parameter (I'd say
> that 70+% of cases falls inside here), and if the current "pmap" bit is
> set, then you do all the indirection copy-from-userspace stuff.
> It also simplify userspace a lot, since you can now pass arrays and
> structure pointers directly, w/out saving them in a temporary variable.

Very rough sketch below ...


---
struct syslet_uatom {
unsigned long flags;
unsigned int nr;
unsigned short nparams;
unsigned short pmap;
long __user *ret_ptr;
struct syslet_uatom __user *next;
unsigned long __user args[6];
void __user *private;
};

long copy_uatom(struct syslet_atom *atom, struct syslet_uatom __user *uatom)
{
unsigned short i, pmap;
unsigned long __user *arg_ptr;
long ret = 0;

if (!access_ok(VERIFY_WRITE, uatom, sizeof(*uatom)))
return -EFAULT;

ret = __get_user(atom->nr, &uatom->nr);
ret |= __get_user(atom->nparams, &uatom->nparams);
ret |= __get_user(pmap, &uatom->pmap);
ret |= __get_user(atom->ret_ptr, &uatom->ret_ptr);
ret |= __get_user(atom->flags, &uatom->flags);
ret |= __get_user(atom->next, &uatom->next);
if (unlikely(atom->nparams >= 6))
return -EINVAL;
for (i = 0; i < atom->nparams; i++, pmap >>= 1) {
atom->args[i] = uatom->args[i];
if (unlikely(pmap & 1)) {
arg_ptr = (unsigned long __user *) atom->args[i];
if (!access_ok(VERIFY_WRITE, arg_ptr, sizeof(*arg_ptr)))
return -EFAULT;
ret |= __get_user(atom->args[i], arg_ptr);
}
}

return ret;
}

void init_utaom(struct syslet_uatom *ua, unsigned long flags, unsigned int nr,
long *ret_ptr, struct syslet_uatom *next, void *private,
int nparams, ...)
{
int i, mode;
va_list args;

ua->flags = flags;
ua->nr = nr;
ua->ret_ptr = ret_ptr;
ua->next = next;
ua->private = private;
ua->nparams = nparams;
ua->pmap = 0;
va_start(args, nparams);
for (i = 0; i < nparams; i++) {
mode = va_arg(args, int);
ua->args[i] = va_arg(args, unsigned long);
if (mode == UASYNC_INDIR)
ua->pmap |= 1 << i;
}
va_end(args);
}


#define UASYNC_IMM 0
#define UASYNC_INDIR 1
#define UAPD(a) UASYNC_IMM, (unsigned long) (a)
#define UAPI(a) UASYNC_INDIR, (unsigned long) (a)


void foo(void)
{
int fd;
long res;
struct stat stb;
struct syslet_uatom ua;

init_utaom(&ua, 0, __NR_fstat, &res, NULL, NULL, 2,
UAPI(&fd), UAPD(&stb));
...
}
---



- Davide


2007-02-14 03:29:00

by Davide Libenzi

[permalink] [raw]
Subject: Re: [patch 00/11] ANNOUNCE: "Syslets", generic asynchronous system call support

On Tue, 13 Feb 2007, Ingo Molnar wrote:

> I'm pleased to announce the first release of the "Syslet" kernel feature
> and kernel subsystem, which provides generic asynchrous system call
> support:
> [...]

Ok, I had little to time review the code, but it has been a long
working day, so bear with me if I missed something.
I don't see how sys_async_exec would not block, based on your patches.
Let's try to follow:

- We enter sys_async_exec

- We may fill the pool, but that's nothing interesting ATM. A bunch of
threads will be created, and they'll end up sleeping inside the
cachemiss_loop

- We set the async_ready pointer and we fall inside exec_atom

- There we copy the atom (nothing interesting from a scheduling POV) and
we fall inside __exec_atom

- In __exec_atom we do the actual syscall call. Note that we're still the
task/thread that called sys_async_exec

- So we enter the syscall, and now we end up in schedule because we're
just unlucky

- We notice that the async_ready pointer is not NULL, and we call
__async_schedule

- Finally we're in pick_new_async_thread and we pick one of the ready
threads sleeping in cachemiss_loop

- We copy the pt_regs to the newly picked-up thread, we set its async head
pointer, we set the current task async_ready pointer to NULL, we
re-initialize the async_thread structure (the old async_ready), and we
put ourselves in the busy_list

- Then we roll back to the schedule that started everything, and being
still "prev" for the scheduler, we go to sleep

So the sys_async_exec task is going to block. Now, am I being really
tired, or the cachemiss fast return is simply not there?
There's another problem AFAICS:

- We woke up one of the cachemiss_loop threads in pick_new_async_thread

- The threads wakes up, mark itself as busy, and look at the ->work
pointer hoping to find something to work on

But we never set that pointer to a userspace atom AFAICS. Me blind? :)




- Davide


2007-02-14 04:42:41

by Willy Tarreau

[permalink] [raw]
Subject: Re: [patch 00/11] ANNOUNCE: "Syslets", generic asynchronous system call support

Hi Ingo !

On Tue, Feb 13, 2007 at 03:20:10PM +0100, Ingo Molnar wrote:
> I'm pleased to announce the first release of the "Syslet" kernel feature
> and kernel subsystem, which provides generic asynchrous system call
> support:
>
> http://redhat.com/~mingo/syslet-patches/
>
> Syslets are small, simple, lightweight programs (consisting of
> system-calls, 'atoms') that the kernel can execute autonomously (and,
> not the least, asynchronously), without having to exit back into
> user-space. Syslets can be freely constructed and submitted by any
> unprivileged user-space context - and they have access to all the
> resources (and only those resources) that the original context has
> access to.

I like this a lot. I've always felt frustrated by the wasted time in
setsockopt() calls after accept() or before connect(), or in multiple
calls to epoll_ctl(). It might also be useful as an efficient readv()
emulation using recv(), etc...

Nice work !
Willy

2007-02-14 04:49:27

by Davide Libenzi

[permalink] [raw]
Subject: Re: [patch 00/11] ANNOUNCE: "Syslets", generic asynchronous system call support

On Tue, 13 Feb 2007, Davide Libenzi wrote:

[...]

> So the sys_async_exec task is going to block. Now, am I being really
> tired, or the cachemiss fast return is simply not there?

The former 8)

pick_new_async_head()
new_task->ah = ah;

cachemiss_loop()
for (;;) {
if (unlikely(t->ah || ...))
break;


> There's another problem AFAICS:
>
> - We woke up one of the cachemiss_loop threads in pick_new_async_thread
>
> - The threads wakes up, mark itself as busy, and look at the ->work
> pointer hoping to find something to work on
>
> But we never set that pointer to a userspace atom AFAICS. Me blind? :)

I still don't see at->work ever set to a valid userspace atom though...



- Davide


2007-02-14 08:30:24

by Ingo Molnar

[permalink] [raw]
Subject: Re: [patch 00/11] ANNOUNCE: "Syslets", generic asynchronous system call support


* Davide Libenzi <[email protected]> wrote:

> > There's another problem AFAICS:
> >
> > - We woke up one of the cachemiss_loop threads in pick_new_async_thread
> >
> > - The threads wakes up, mark itself as busy, and look at the ->work
> > pointer hoping to find something to work on
> >
> > But we never set that pointer to a userspace atom AFAICS. Me blind? :)
>
> I still don't see at->work ever set to a valid userspace atom
> though...

yeah - i havent added support for 'submit syslet from within a syslet'
support yet :-)

note that current normal syslet operation (both async and sync alike)
does not need at->work at all. When we cachemiss then the new head task
just wants to return a NULL pointer to user-space, to signal that work
is continuing in the background. A ready 'cachemiss' thread is really
not there to do cachemisses, it is a 'new head task in the waiting'. The
name comes from Tux and i guess it's time for a rename :)

but i do plan a SYSLET_ASYNC_CONTINUE flag, roughly along the patch i've
attached below: this would skip to the linearly next syslet and would
let the original syslet execute in the. I have not fully thought this
through though (let alone tested it ;) - can you see any hole in this
approach? This would in essence allow the following construct:

syslet1 &
syslet2 &
syslet3 &
syslet4 &

submitted in parallel, straight to cachemiss threads, from a syslet
itself.

there's yet another work submission variant that makes sense to do, a
true syslet vector submission: to do a loop over syslet atoms in
sys_async_exec(). That would have the added advantage of enabling
caching. If one vector component generates a cachemiss then the head
would continue with the next vector. (this too needs at->work alike
communication between ex-head and new-head)

maybe the latter would be the cleaner approach - SYSLET_ASYNC_CONTINUE
has no effect in cachemiss context, so it only makes sense if the
submitted syslet is a pure vector of parallel atoms. Alternatively,
SYSLET_ASYNC_CONTINUE would have to be made work from cachemiss contexts
too. (because that makes sense too, to start new async execution from
another async context.)

another not yet clear area is when there's no cachemiss thread
available. Right now SYSLET_ASYNC_CONTINUE will just fail - which makes
it nondeterministic.

Ingo

---
include/linux/async.h | 13 +++++++++++--
include/linux/sched.h | 3 +--
include/linux/syslet.h | 20 +++++++++++++-------
kernel/async.c | 43 +++++++++++++++++++++++++++++--------------
kernel/sched.c | 2 +-
5 files changed, 56 insertions(+), 27 deletions(-)

# *DOCUMENTATION*
Index: linux/include/linux/async.h
===================================================================
--- linux.orig/include/linux/async.h
+++ linux/include/linux/async.h
@@ -1,15 +1,23 @@
#ifndef _LINUX_ASYNC_H
#define _LINUX_ASYNC_H
+
+#include <linux/compiler.h>
+
/*
* The syslet subsystem - asynchronous syscall execution support.
*
* Generic kernel API definitions:
*/

+struct syslet_uatom;
+struct async_thread;
+struct async_head;
+
#ifdef CONFIG_ASYNC_SUPPORT
extern void async_init(struct task_struct *t);
extern void async_exit(struct task_struct *t);
-extern void __async_schedule(struct task_struct *t);
+extern void
+__async_schedule(struct task_struct *t, struct syslet_uatom __user *next_uatom);
#else /* !CONFIG_ASYNC_SUPPORT */
static inline void async_init(struct task_struct *t)
{
@@ -17,7 +25,8 @@ static inline void async_init(struct tas
static inline void async_exit(struct task_struct *t)
{
}
-static inline void __async_schedule(struct task_struct *t)
+static inline void
+__async_schedule(struct task_struct *t, struct syslet_uatom __user *next_uatom)
{
}
#endif /* !CONFIG_ASYNC_SUPPORT */
Index: linux/include/linux/sched.h
===================================================================
--- linux.orig/include/linux/sched.h
+++ linux/include/linux/sched.h
@@ -83,13 +83,12 @@ struct sched_param {
#include <linux/timer.h>
#include <linux/hrtimer.h>
#include <linux/task_io_accounting.h>
+#include <linux/async.h>

#include <asm/processor.h>

struct exec_domain;
struct futex_pi_state;
-struct async_thread;
-struct async_head;
/*
* List of flags we want to share for kernel threads,
* if only because they are not used by them anyway.
Index: linux/include/linux/syslet.h
===================================================================
--- linux.orig/include/linux/syslet.h
+++ linux/include/linux/syslet.h
@@ -56,10 +56,16 @@ struct syslet_uatom {
#define SYSLET_ASYNC 0x00000001

/*
+ * Queue this syslet asynchronously and continue executing the
+ * next linear atom:
+ */
+#define SYSLET_ASYNC_CONTINUE 0x00000002
+
+/*
* Never queue this syslet asynchronously - even if synchronous
* execution causes a context-switching:
*/
-#define SYSLET_SYNC 0x00000002
+#define SYSLET_SYNC 0x00000004

/*
* Do not queue the syslet in the completion ring when done.
@@ -70,7 +76,7 @@ struct syslet_uatom {
* Some syscalls generate implicit completion events of their
* own.
*/
-#define SYSLET_NO_COMPLETE 0x00000004
+#define SYSLET_NO_COMPLETE 0x00000008

/*
* Execution control: conditions upon the return code
@@ -78,10 +84,10 @@ struct syslet_uatom {
* execution is stopped and the atom is put into the
* completion ring:
*/
-#define SYSLET_STOP_ON_NONZERO 0x00000008
-#define SYSLET_STOP_ON_ZERO 0x00000010
-#define SYSLET_STOP_ON_NEGATIVE 0x00000020
-#define SYSLET_STOP_ON_NON_POSITIVE 0x00000040
+#define SYSLET_STOP_ON_NONZERO 0x00000010
+#define SYSLET_STOP_ON_ZERO 0x00000020
+#define SYSLET_STOP_ON_NEGATIVE 0x00000040
+#define SYSLET_STOP_ON_NON_POSITIVE 0x00000080

#define SYSLET_STOP_MASK \
( SYSLET_STOP_ON_NONZERO | \
@@ -97,7 +103,7 @@ struct syslet_uatom {
*
* This is what allows true branches of execution within syslets.
*/
-#define SYSLET_SKIP_TO_NEXT_ON_STOP 0x00000080
+#define SYSLET_SKIP_TO_NEXT_ON_STOP 0x00000100

/*
* This is the (per-user-context) descriptor of the async completion
Index: linux/kernel/async.c
===================================================================
--- linux.orig/kernel/async.c
+++ linux/kernel/async.c
@@ -75,13 +75,14 @@ mark_async_thread_busy(struct async_thre

static void
__async_thread_init(struct task_struct *t, struct async_thread *at,
- struct async_head *ah)
+ struct async_head *ah,
+ struct syslet_uatom __user *work)
{
INIT_LIST_HEAD(&at->entry);
at->exit = 0;
at->task = t;
at->ah = ah;
- at->work = NULL;
+ at->work = work;

t->at = at;
ah->nr_threads++;
@@ -92,7 +93,7 @@ async_thread_init(struct task_struct *t,
struct async_head *ah)
{
spin_lock(&ah->lock);
- __async_thread_init(t, at, ah);
+ __async_thread_init(t, at, ah, NULL);
__mark_async_thread_ready(at, ah);
spin_unlock(&ah->lock);
}
@@ -130,8 +131,10 @@ pick_ready_cachemiss_thread(struct async
return at;
}

-static void pick_new_async_head(struct async_head *ah,
- struct task_struct *t, struct pt_regs *old_regs)
+static void
+pick_new_async_head(struct async_head *ah, struct task_struct *t,
+ struct pt_regs *old_regs,
+ struct syslet_uatom __user *next_uatom)
{
struct async_thread *new_async_thread;
struct async_thread *async_ready;
@@ -158,28 +161,31 @@ static void pick_new_async_head(struct a

wake_up_process(new_task);

- __async_thread_init(t, async_ready, ah);
+ __async_thread_init(t, async_ready, ah, next_uatom);
__mark_async_thread_busy(t->at, ah);

out_unlock:
spin_unlock(&ah->lock);
}

-void __async_schedule(struct task_struct *t)
+void
+__async_schedule(struct task_struct *t, struct syslet_uatom __user *next_uatom)
{
struct async_head *ah = t->ah;
struct pt_regs *old_regs = task_pt_regs(t);

- pick_new_async_head(ah, t, old_regs);
+ pick_new_async_head(ah, t, old_regs, next_uatom);
}

-static void async_schedule(struct task_struct *t)
+static void
+async_schedule(struct task_struct *t, struct syslet_uatom __user *next_uatom)
{
if (t->async_ready)
- __async_schedule(t);
+ __async_schedule(t, next_uatom);
}

-static long __exec_atom(struct task_struct *t, struct syslet_atom *atom)
+static long __exec_atom(struct task_struct *t, struct syslet_atom *atom,
+ struct syslet_uatom __user *uatom)
{
struct async_thread *async_ready_save;
long ret;
@@ -189,8 +195,17 @@ static long __exec_atom(struct task_stru
* (try to) switch user-space to another thread straight
* away and execute the syscall asynchronously:
*/
- if (unlikely(atom->flags & SYSLET_ASYNC))
- async_schedule(t);
+ if (unlikely(atom->flags & (SYSLET_ASYNC | SYSLET_ASYNC_CONTINUE))) {
+ /*
+ * If this is a parallel (vectored) submission straight to
+ * a cachemiss context then the linearly next (uatom + 1)
+ * uatom will be executed by this context.
+ */
+ if (atom->flags & SYSLET_ASYNC_CONTINUE)
+ async_schedule(t, uatom + 1);
+ else
+ async_schedule(t, NULL);
+ }
/*
* Does user-space want synchronous execution for this atom?:
*/
@@ -432,7 +447,7 @@ exec_atom(struct async_head *ah, struct
return ERR_PTR(-EFAULT);

last_uatom = uatom;
- ret = __exec_atom(t, &atom);
+ ret = __exec_atom(t, &atom, uatom);
if (unlikely(signal_pending(t) || need_resched()))
goto stop;

Index: linux/kernel/sched.c
===================================================================
--- linux.orig/kernel/sched.c
+++ linux/kernel/sched.c
@@ -3442,7 +3442,7 @@ asmlinkage void __sched schedule(void)
if (prev->state && !(preempt_count() & PREEMPT_ACTIVE) &&
(!(prev->state & TASK_INTERRUPTIBLE) ||
!signal_pending(prev)))
- __async_schedule(prev);
+ __async_schedule(prev, NULL);
}

need_resched:

2007-02-14 09:04:41

by Evgeniy Polyakov

[permalink] [raw]
Subject: Re: [patch 00/11] ANNOUNCE: "Syslets", generic asynchronous system call support

On Tue, Feb 13, 2007 at 11:18:10PM +0100, Ingo Molnar ([email protected]) wrote:
>
> * Evgeniy Polyakov <[email protected]> wrote:
>
> > [...] it still has a problem - syscall blocks and the same thread thus
> > is not allowed to continue execution and fill the pipe - so what if
> > system issues thousands of requests and there are only tens of working
> > thread at most. [...]
>
> the same thread is allowed to continue execution even if the system call
> blocks: take a look at async_schedule(). The blocked system-call is 'put
> aside' (in a sleeping thread), the kernel switches the user-space
> context (registers) to a free kernel thread and switches to it - and
> returns to user-space as if nothing happened - allowing the user-space
> context to 'fill the pipe' as much as it can. Or did i misunderstand
> your point?

Let me clarify what I meant.
There is only limited number of threads, which are supposed to execute
blocking context, so when all they are used, main one will block too - I
asked about possibility to reuse the same thread to execute queue of
requests attached to it, each request can block, but if blocking issue
is removed, it would be possible to return.

What I'm asking for is how actually kevent IO state machine functions work
- each IO request is made not through usual mpage and bio allocations,
but with special kevent ones, which do not wait until completion, but
instead in destructor it is either rescheduled (if big file is
transferred, then it is split into parts for transmission) or committed
as ready (thus it becomes possible to read completion through kevent
queue or ring), so there are only several threads, each one does small
job on each request, but the same request can be rescheduled to it again
and again (from bio destructor or ->end_io callback for example).

So I asked if it is possible to extend this state machine to work not
only with blocked syscalls but with non-blocked functions with
possibility to reschedule the same item again.

--
Evgeniy Polyakov

2007-02-14 09:15:58

by Evgeniy Polyakov

[permalink] [raw]
Subject: Re: [patch 05/11] syslets: core code

On Tue, Feb 13, 2007 at 11:41:31PM +0100, Ingo Molnar ([email protected]) wrote:
> > Then limit it to a single page and use gup
>
> 1024 (512 on 64-bit) is alot but not ALOT. It is also certainly not
> ALOOOOT :-) Really, people will want to have more than 512
> disks/spindles in the same box. I have used such a beast myself. For Tux
> workloads and benchmarks we had parallelism levels of millions of
> pending requests (!) on a single system - networking, socket limits,
> disk IO combined with thousands of clients do create such scenarios. I
> really think that such 'pinned pages' are a pretty natural fit for
> sys_mlock() and RLIMIT_MEMLOCK, and since the kernel side is careful to
> use the _inatomic() uaccess methods, it's safe (and fast) as well.

This will end up badly - I used the same approach in the early kevent
days and was proven to have swapable memory for the ring. I think it
would be much better to have userspace allocated ring and use
copy_to_user() there.

Btw, as a bit of advertisement, the whole completion part can be done
through kevent which already has ring buffer, queue operations and
non-racy updates... :)

> Ingo

--
Evgeniy Polyakov

2007-02-14 09:49:49

by Ingo Molnar

[permalink] [raw]
Subject: Re: [patch 05/11] syslets: core code


* Evgeniy Polyakov <[email protected]> wrote:

> This will end up badly - I used the same approach in the early kevent
> days and was proven to have swapable memory for the ring. I think it
> would be much better to have userspace allocated ring and use
> copy_to_user() there.

it is a userspace allocated ring - but pinned down by the kernel.

Ingo

2007-02-14 10:12:16

by Evgeniy Polyakov

[permalink] [raw]
Subject: Re: [patch 05/11] syslets: core code

On Wed, Feb 14, 2007 at 10:46:29AM +0100, Ingo Molnar ([email protected]) wrote:
>
> * Evgeniy Polyakov <[email protected]> wrote:
>
> > This will end up badly - I used the same approach in the early kevent
> > days and was proven to have swapable memory for the ring. I think it
> > would be much better to have userspace allocated ring and use
> > copy_to_user() there.
>
> it is a userspace allocated ring - but pinned down by the kernel.

That's a problem - 1000/512 pages per 'usual' thread ends up with the
whole memory locked by malicious/stupid application (at least on Debian
and Mandrake there is no locked memory limit by default). And if such
a limit exists, this will hurt big-iron applications, which want to used
high-order rings legitimely.

> Ingo

--
Evgeniy Polyakov

2007-02-14 10:31:29

by Arjan van de Ven

[permalink] [raw]
Subject: Re: [patch 05/11] syslets: core code

> (at least on Debian
> and Mandrake there is no locked memory limit by default).

that sounds like 2 very large bugtraq-worthy bugs in these distros.. so
bad a bug that I almost find it hard to believe...

--
if you want to mail me at work (you don't), use arjan (at) linux.intel.com
Test the interaction between Linux and your BIOS via http://www.linuxfirmwarekit.org

2007-02-14 10:37:30

by Russell King

[permalink] [raw]
Subject: Re: [patch 06/11] syslets: core, documentation

On Tue, Feb 13, 2007 at 03:20:42PM +0100, Ingo Molnar wrote:
> +Arguments to the system call are implemented via pointers to arguments.
> +This not only increases the flexibility of syslet atoms (multiple syslets
> +can share the same variable for example), but is also an optimization:
> +copy_uatom() will only fetch syscall parameters up until the point it
> +meets the first NULL pointer. 50% of all syscalls have 2 or less
> +parameters (and 90% of all syscalls have 4 or less parameters).
> +
> + [ Note: since the argument array is at the end of the atom, and the
> + kernel will not touch any argument beyond the final NULL one, atoms
> + might be packed more tightly. (the only special case exception to
> + this rule would be SKIP_TO_NEXT_ON_STOP atoms, where the kernel will
> + jump a full syslet_uatom number of bytes.) ]

What if you need to increase the number of arguments passed to a system
call later? That would be an API change since the size of syslet_uatom
would change?

Also, what if you have an ABI such that:

sys_foo(int fd, long long a)

where:
arg[0] <= fd
arg[1] <= unused
arg[2] <= low 32-bits a
arg[3] <= high 32-bits a

it seems you need to point arg[1] to some valid but dummy variable.

How do you propose syslet users know about these kinds of ABI issues
(including the endian-ness of 64-bit arguments) ?

--
Russell King
Linux kernel 2.6 ARM Linux - http://www.arm.linux.org.uk/
maintainer of:

2007-02-14 10:40:50

by Ingo Molnar

[permalink] [raw]
Subject: Re: [patch 00/11] ANNOUNCE: "Syslets", generic asynchronous system call support


* Evgeniy Polyakov <[email protected]> wrote:

> Let me clarify what I meant. There is only limited number of threads,
> which are supposed to execute blocking context, so when all they are
> used, main one will block too - I asked about possibility to reuse the
> same thread to execute queue of requests attached to it, each request
> can block, but if blocking issue is removed, it would be possible to
> return.

ah, ok, i understand your point. This is not quite possible: the
cachemisses are driven from schedule(), which can be arbitraily deep
inside arbitrary system calls. It can be in a mutex_lock() deep inside a
driver. It can be due to a alloc_pages() call done by a kmalloc() call
done from within ext3, which was called from the loopback block driver,
which was called from XFS, which was called from a VFS syscall.

Even if it were possible to backtrack i'm quite sure we dont want to do
this, for three main reasons:

Firstly, backtracking and retrying always has a cost. We construct state
on the way in - and we destruct on the way out. The kernel stack we have
built up has a (nontrivial) construction cost and thus a construction
value - we should preserve that if possible.

Secondly, and this is equally important: i wanted the number of async
kernel threads to be the natural throttling mechanism. If user-space
wants to use less threads and overcommit the request queue then it can
be done in user-space: by over-queueing requests into a separate list,
and taking from that list upon completion and submitting it. User-space
has precise knowledge of overqueueing scenarios: if the event ring is
full then all async kernel threads are busy.

but note that there's a deeper reason as well for not wanting
over-queueing: the main cost of a 'pending request' is the kernel stack
of the blocked thread itself! So do we want to allow 'requests' to stay
'pending' even if there are "no more threads available"? Nope: because
letting them 'pend' would essentially (and implicitly) mean an increase
of the thread pool.

In other words: with the syslet subsystem, a kernel thread /is/ the
asynchronous request itself. So 'have more requests pending' means 'have
more kernel threads'. And 'no kernel thread available' must thus mean
'no queueing of this request'.

Thirdly, there is a performance advantage of this queueing property as
well: by letting a cachemiss thread only do a single syslet all work is
concentrated back to the 'head' task, and all queueing decisions are
immediately known by user-space and can be acted upon.

So the work-queueing setup is not symmetric at all, there's a
fundamental bias and tendency back towards the head task - this helps
caching too. That's what Tux did too - it always tried to queue back to
the 'head task' as soon as it could. Spreading out work dynamically and
transparently is necessary and nice, but it's useless if the system has
no automatic tendency to move back into single-threaded (fully cached)
state if the workload becomes less parallel. Without this fundamental
(and transparent) 'shrink parallelism' property syslets would only
degrade into yet another threading construct.

Ingo

2007-02-14 10:45:39

by Evgeniy Polyakov

[permalink] [raw]
Subject: Re: [patch 05/11] syslets: core code

On Wed, Feb 14, 2007 at 11:30:55AM +0100, Arjan van de Ven ([email protected]) wrote:
> > (at least on Debian
> > and Mandrake there is no locked memory limit by default).
>
> that sounds like 2 very large bugtraq-worthy bugs in these distros.. so
> bad a bug that I almost find it hard to believe...

Well:

$ ulimit -a
core file size (blocks, -c) 0
data seg size (kbytes, -d) unlimited
max nice (-e) 0
file size (blocks, -f) unlimited
pending signals (-i) unlimited
max locked memory (kbytes, -l) unlimited
max memory size (kbytes, -m) unlimited
open files (-n) 1024
pipe size (512 bytes, -p) 8
POSIX message queues (bytes, -q) unlimited
max rt priority (-r) 0
stack size (kbytes, -s) 8192
cpu time (seconds, -t) unlimited
max user processes (-u) unlimited
virtual memory (kbytes, -v) unlimited
file locks (-x) unlimited
$ cat /etc/debian_version
4.0

$ ulimit -a
core file size (blocks, -c) 0
data seg size (kbytes, -d) unlimited
file size (blocks, -f) unlimited
max locked memory (kbytes, -l) unlimited
max memory size (kbytes, -m) unlimited
open files (-n) 1024
pipe size (512 bytes, -p) 8
stack size (kbytes, -s) 8192
cpu time (seconds, -t) unlimited
max user processes (-u) 7168
virtual memory (kbytes, -v) unlimited
$ cat /etc/mandrake-release
Mandrake Linux release 10.0 (Community) for i586

Anyway, even if there is a limit like in fc5 - 32kb,
so I doubt any unpriveledged userspace application
will ever run there.

--
Evgeniy Polyakov

2007-02-14 10:53:42

by Ingo Molnar

[permalink] [raw]
Subject: Re: [patch 06/11] syslets: core, documentation


* Russell King <[email protected]> wrote:

> On Tue, Feb 13, 2007 at 03:20:42PM +0100, Ingo Molnar wrote:
> > +Arguments to the system call are implemented via pointers to arguments.
> > +This not only increases the flexibility of syslet atoms (multiple syslets
> > +can share the same variable for example), but is also an optimization:
> > +copy_uatom() will only fetch syscall parameters up until the point it
> > +meets the first NULL pointer. 50% of all syscalls have 2 or less
> > +parameters (and 90% of all syscalls have 4 or less parameters).
> > +
> > + [ Note: since the argument array is at the end of the atom, and the
> > + kernel will not touch any argument beyond the final NULL one, atoms
> > + might be packed more tightly. (the only special case exception to
> > + this rule would be SKIP_TO_NEXT_ON_STOP atoms, where the kernel will
> > + jump a full syslet_uatom number of bytes.) ]
>
> What if you need to increase the number of arguments passed to a
> system call later? That would be an API change since the size of
> syslet_uatom would change?

the syslet_uatom has a constant size right now, and space for a maximum
of 6 arguments. /If/ the user knows that a specific atom (which for
example does a sys_close()) takes only 1 argument, it could shrink the
size of the atom down by 4 arguments.

[ i'd not actually recommend doing this, because it's generally a
volatile thing to play such tricks - i guess i shouldnt have written
that side-note in the header file :-) ]

there should be no new ABI issues: the existing syscall ABI never
changes, it's only extended. New syslets can rely on new properties of
new system calls. This is quite parallel to how glibc handles system
calls.

> How do you propose syslet users know about these kinds of ABI issues
> (including the endian-ness of 64-bit arguments) ?

syslet users would preferably be libraries like glibc - not applications
- i'm not sure the raw syslet interface should be exposed to
applications. Thus my current thinking is that syslets ought to be
per-arch structures - no need to pad them out to 64 bits on 32-bit
architectures - it's per-arch userspace that makes use of them anyway.
system call encodings are fundamentally per-arch anyway - every arch
does various fixups and has its own order of system calls.

but ... i'd not be against having a 'generic syscall layer' though, and
syslets might be a good starting point for that. But that would
necessiate a per-arch table of translating syscall numbers into this
'generic' numbering, at minimum - or a separate sys_async_call_table[].

Ingo

2007-02-14 10:55:37

by Alan

[permalink] [raw]
Subject: Re: [patch 00/11] ANNOUNCE: "Syslets", generic asynchronous system call support

> > Ooooohh. OpenVMS lives forever ;) Me likeee ;)
>
> hm, i dont know OpenVMS - but googled around a bit for 'VMS
> asynchronous' and it gave me this:

VMS had SYS$QIO which is asynchronous I/O queueing with completions of
sorts. You had to specifically remember if you wanted to a synchronous
I/O.

Nothing afaik quite like series of commands batched async, although VMS
has a call for everything else so its possible there is one buried in the
back of volume 347 of the grey wall ;)

Looking at the completion side I'm not 100% sure we need async_wait given
the async batches can include futex operations...

Alan

2007-02-14 11:04:49

by Russell King

[permalink] [raw]
Subject: Re: [patch 06/11] syslets: core, documentation

On Wed, Feb 14, 2007 at 11:50:39AM +0100, Ingo Molnar wrote:
> * Russell King <[email protected]> wrote:
> > On Tue, Feb 13, 2007 at 03:20:42PM +0100, Ingo Molnar wrote:
> > > +Arguments to the system call are implemented via pointers to arguments.
> > > +This not only increases the flexibility of syslet atoms (multiple syslets
> > > +can share the same variable for example), but is also an optimization:
> > > +copy_uatom() will only fetch syscall parameters up until the point it
> > > +meets the first NULL pointer. 50% of all syscalls have 2 or less
> > > +parameters (and 90% of all syscalls have 4 or less parameters).
> > > +
> > > + [ Note: since the argument array is at the end of the atom, and the
> > > + kernel will not touch any argument beyond the final NULL one, atoms
> > > + might be packed more tightly. (the only special case exception to
> > > + this rule would be SKIP_TO_NEXT_ON_STOP atoms, where the kernel will
> > > + jump a full syslet_uatom number of bytes.) ]
> >
> > What if you need to increase the number of arguments passed to a
> > system call later? That would be an API change since the size of
> > syslet_uatom would change?
>
> the syslet_uatom has a constant size right now, and space for a maximum
> of 6 arguments. /If/ the user knows that a specific atom (which for
> example does a sys_close()) takes only 1 argument, it could shrink the
> size of the atom down by 4 arguments.
>
> [ i'd not actually recommend doing this, because it's generally a
> volatile thing to play such tricks - i guess i shouldnt have written
> that side-note in the header file :-) ]
>
> there should be no new ABI issues: the existing syscall ABI never
> changes, it's only extended. New syslets can rely on new properties of
> new system calls. This is quite parallel to how glibc handles system
> calls.

Let me spell it out, since you appear to have completely missed my point.

At the moment, SKIP_TO_NEXT_ON_STOP is specified to jump a "jump a full
syslet_uatom number of bytes".

If we end up with a system call being added which requires more than
the currently allowed number of arguments (and it _has_ happened before)
then either those syscalls are not accessible to syslets, or you need
to increase the arg_ptr array.

That makes syslet_uatom larger.

If syslet_uatom is larger, SKIP_TO_NEXT_ON_STOP increments the syslet_uatom
pointer by a greater number of bytes.

If we're running a set of userspace syslets built for an older kernel on
such a newer kernel, that is an incompatible change which will break.

> > How do you propose syslet users know about these kinds of ABI issues
> > (including the endian-ness of 64-bit arguments) ?
>
> syslet users would preferably be libraries like glibc - not applications
> - i'm not sure the raw syslet interface should be exposed to
> applications. Thus my current thinking is that syslets ought to be
> per-arch structures - no need to pad them out to 64 bits on 32-bit
> architectures - it's per-arch userspace that makes use of them anyway.
> system call encodings are fundamentally per-arch anyway - every arch
> does various fixups and has its own order of system calls.
>
> but ... i'd not be against having a 'generic syscall layer' though, and
> syslets might be a good starting point for that. But that would
> necessiate a per-arch table of translating syscall numbers into this
> 'generic' numbering, at minimum - or a separate sys_async_call_table[].

Okay - I guess the userspace library approach is fine, but it needs
to be documented that applications which build syslets directly are
going to be non-portable.

--
Russell King
Linux kernel 2.6 ARM Linux - http://www.arm.linux.org.uk/
maintainer of:

2007-02-14 11:12:37

by Evgeniy Polyakov

[permalink] [raw]
Subject: Re: [patch 00/11] ANNOUNCE: "Syslets", generic asynchronous system call support

On Wed, Feb 14, 2007 at 11:37:31AM +0100, Ingo Molnar ([email protected]) wrote:
> > Let me clarify what I meant. There is only limited number of threads,
> > which are supposed to execute blocking context, so when all they are
> > used, main one will block too - I asked about possibility to reuse the
> > same thread to execute queue of requests attached to it, each request
> > can block, but if blocking issue is removed, it would be possible to
> > return.
>
> ah, ok, i understand your point. This is not quite possible: the
> cachemisses are driven from schedule(), which can be arbitraily deep
> inside arbitrary system calls. It can be in a mutex_lock() deep inside a
> driver. It can be due to a alloc_pages() call done by a kmalloc() call
> done from within ext3, which was called from the loopback block driver,
> which was called from XFS, which was called from a VFS syscall.

That's only because of schedule() is a main point where
'rescheduling'/requeuing (task switch in other words) happens - but if
it will be possible to bypass schedule()'s decision and not reschedule
there, but 'on demand', will it be possible to reuse the same syslet?

Let me show an example:
consider aio_sendfile() on a big file, so it is not possible to fully
get it into VFS, but having spinning on per-page basis (like right now)
is no optial solution too. For kevent AIO I created new address space
operation aio_getpages() which is essentially mpage_readpages() - it
populates several pages into VFS in one BIO (if possible, otherwise in
the smallest possible number of chunks) and then in bio destruction
callback (actually in bio_endio callback, but for that case it can be
considered as the same) I reschedule the same request to some other (not
exactly the same as started) thread. When processed data is being sent
and next chunk of the file is populated to the VFS using aio_getpages(),
which in BIO callback will reschedule the same request again.

So it is possible with essentially one thread (or limited number of
them) to fill the whole IO pipe.

With syslet approach it seems to be impossible due to the fact, that
request is a whole sendfile. Even if one uses proper readahed (fadvise)
advise, there is no possibility to split sendfile and form it as a set
of essentially the same requests with different start/offset/whatever
parameters (well, exactly for senfile() it is possible - just setup
several calls in one syslet from different offsets and with different
lengths and form a proper state machine of them, but for example TCP
recv() will not match that scenario).

So my main question was about possibility to reuse syslet state machine
in kevent AIO instead of own (althtough own one lacks only one good
feature of syslets threads currently - its set of threads is global,
but not per-task, which does not allow to scale good with number of
different processes doing IO) so to not duplicate the code if kevent is
ever be possible to get into.

--
Evgeniy Polyakov

2007-02-14 12:40:05

by Pavel Machek

[permalink] [raw]
Subject: Re: [patch 00/11] ANNOUNCE: "Syslets", generic asynchronous system call support

Hi!

> The boring details:
>
> Syslets consist of 'syslet atoms', where each atom represents a single
> system-call. These atoms can be chained to each other: serially, in
> branches or in loops. The return value of an executed atom is checked
> against the condition flags. So an atom can specify 'exit on nonzero' or
> 'loop until non-negative' kind of constructs.

Ouch, yet another interpretter in kernel :-(. Can we reuse acpi or
something?

Pavel
--
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

2007-02-14 12:43:21

by Guillaume Chazarain

[permalink] [raw]
Subject: Re: [patch 05/11] syslets: core code

Ingo Molnar a écrit :
> + if (unlikely(signal_pending(t) || need_resched()))
> + goto stop;
>

So, this is how you'll prevent me from running an infinite loop ;-)
The attached patch adds a cond_resched() instead, to allow infinite
loops without DoS. I dropped the unlikely() as it's already in the
definition of signal_pending().

> +asmlinkage long sys_async_wait(unsigned long min_wait_events)
>

Here I would expect:

sys_async_wait_for_all(struct syslet_atom *atoms, long nr_atoms)

and

sys_async_wait_for_any(struct syslet_atom *atoms, long nr_atoms).

This way syslets can be used by different parts of a program without
having them waiting for each other.

Thanks.

--
Guillaume


Attachments:
cond_resched.diff (321.00 B)

2007-02-14 13:17:20

by Stephen Rothwell

[permalink] [raw]
Subject: Re: [patch 05/11] syslets: core code

Hi Ingo,

On Tue, 13 Feb 2007 15:20:35 +0100 Ingo Molnar <[email protected]> wrote:
>
> From: Ingo Molnar <[email protected]>
>
> the core syslet / async system calls infrastructure code.

It occurred to me that the 32 compat code for 64 bit architectures for
all this could be very hairy ...

--
Cheers,
Stephen Rothwell [email protected]
http://www.canb.auug.org.au/~sfr/


Attachments:
(No filename) (389.00 B)
(No filename) (189.00 B)
Download all attachments

2007-02-14 17:16:10

by Linus Torvalds

[permalink] [raw]
Subject: Re: [patch 00/11] ANNOUNCE: "Syslets", generic asynchronous system call support



On Wed, 14 Feb 2007, Pavel Machek wrote:
>
> Ouch, yet another interpretter in kernel :-(. Can we reuse acpi or
> something?

Hah. You make the joke! I get it!

Mwahahahaa!

Linus

2007-02-14 17:17:59

by Davide Libenzi

[permalink] [raw]
Subject: Re: [patch 00/11] ANNOUNCE: "Syslets", generic asynchronous system call support

On Wed, 14 Feb 2007, Ingo Molnar wrote:

>
> * Evgeniy Polyakov <[email protected]> wrote:
>
> > Let me clarify what I meant. There is only limited number of threads,
> > which are supposed to execute blocking context, so when all they are
> > used, main one will block too - I asked about possibility to reuse the
> > same thread to execute queue of requests attached to it, each request
> > can block, but if blocking issue is removed, it would be possible to
> > return.
>
> ah, ok, i understand your point. This is not quite possible: the
> cachemisses are driven from schedule(), which can be arbitraily deep
> inside arbitrary system calls. It can be in a mutex_lock() deep inside a
> driver. It can be due to a alloc_pages() call done by a kmalloc() call
> done from within ext3, which was called from the loopback block driver,
> which was called from XFS, which was called from a VFS syscall.
>
> Even if it were possible to backtrack i'm quite sure we dont want to do
> this, for three main reasons:

IMO it'd be quite simple. We detect the service-thread full condition,
*before* entering exec_atom and we queue the atom in an async_head request
list. Yes, there is the chance that from the test time in sys_async_exec,
to the time we'll end up entering exec_atom and down to schedule, one
of the threads would become free, but IMO better that blocking
sys_async_exec.



- Davide


2007-02-14 17:52:26

by Davide Libenzi

[permalink] [raw]
Subject: Re: [patch 06/11] syslets: core, documentation

On Wed, 14 Feb 2007, Russell King wrote:

> Let me spell it out, since you appear to have completely missed my point.
>
> At the moment, SKIP_TO_NEXT_ON_STOP is specified to jump a "jump a full
> syslet_uatom number of bytes".
>
> If we end up with a system call being added which requires more than
> the currently allowed number of arguments (and it _has_ happened before)
> then either those syscalls are not accessible to syslets, or you need
> to increase the arg_ptr array.

I was thinking about this yesterday, since I honestly thought that this
whole chaining, and conditions, and parameter lists, and argoument passed
by pointers, etc... was at the end a little clumsy IMO.
Wouldn't a syslet look better like:

long syslet(void *ctx) {
struct sctx *c = ctx;

if (open(c->file, ...) == -1)
return -1;
read();
send();
blah();
...
return 0;
}

That'd be, instead of passing a chain of atoms, with the kernel
interpreting conditions, and parameter lists, etc..., we let gcc
do this stuff for us, and we pass the "clet" :) pointer to sys_async_exec,
that exec the above under the same schedule-trapped environment, but in
userspace. We setup a special userspace ad-hoc frame (ala signal), and we
trap underneath task schedule attempt in the same way we do now.
We setup the frame and when we return from sys_async_exec, we basically
enter the "clet", that will return to a ret_from_async, that will return
to userspace. Or, maybe we can support both. A simple single-syscall exec
in the way we do now, and a clet way for the ones that requires chains and
conditions. Hmmm?



- Davide


2007-02-14 18:03:52

by Benjamin LaHaise

[permalink] [raw]
Subject: Re: [patch 06/11] syslets: core, documentation

On Wed, Feb 14, 2007 at 09:52:20AM -0800, Davide Libenzi wrote:
> That'd be, instead of passing a chain of atoms, with the kernel
> interpreting conditions, and parameter lists, etc..., we let gcc
> do this stuff for us, and we pass the "clet" :) pointer to sys_async_exec,
> that exec the above under the same schedule-trapped environment, but in
> userspace. We setup a special userspace ad-hoc frame (ala signal), and we
> trap underneath task schedule attempt in the same way we do now.
> We setup the frame and when we return from sys_async_exec, we basically
> enter the "clet", that will return to a ret_from_async, that will return
> to userspace. Or, maybe we can support both. A simple single-syscall exec
> in the way we do now, and a clet way for the ones that requires chains and
> conditions. Hmmm?

Which is just the same as using threads. My argument is that once you
look at all the details involved, what you end up arriving at is the
creation of threads. Threads are relatively cheap, it's just that the
hardware currently has several performance bugs with them on x86 (and more
on x86-64 with the MSR fiddling that hits the hot path). Architectures
like powerpc are not going to benefit anywhere near as much from this
exercise, as the state involved is processed much more sanely. IA64 as
usual is simply doomed by way of having too many registers to switch.

If people really want to go down this path, please make an effort to compare
threads on a properly tuned platform. This means that things like the kernel
and userland stacks must take into account the cache alignment (we do some
of this already, but there are some very definate L1 cache colour collisions
between commonly hit data structures amongst threads). The existing AIO
ringbuffer suffers from this, as important data is always on the beginning
of the first page. Yes, these might be microoptimizations, but accumulated
changes of this nature have been known to buy 100%+ improvements in
performance.

-ben
--
"Time is of no importance, Mr. President, only life is important."
Don't Email: <[email protected]>.

2007-02-14 19:45:27

by Davide Libenzi

[permalink] [raw]
Subject: Re: [patch 06/11] syslets: core, documentation

On Wed, 14 Feb 2007, Benjamin LaHaise wrote:

> On Wed, Feb 14, 2007 at 09:52:20AM -0800, Davide Libenzi wrote:
> > That'd be, instead of passing a chain of atoms, with the kernel
> > interpreting conditions, and parameter lists, etc..., we let gcc
> > do this stuff for us, and we pass the "clet" :) pointer to sys_async_exec,
> > that exec the above under the same schedule-trapped environment, but in
> > userspace. We setup a special userspace ad-hoc frame (ala signal), and we
> > trap underneath task schedule attempt in the same way we do now.
> > We setup the frame and when we return from sys_async_exec, we basically
> > enter the "clet", that will return to a ret_from_async, that will return
> > to userspace. Or, maybe we can support both. A simple single-syscall exec
> > in the way we do now, and a clet way for the ones that requires chains and
> > conditions. Hmmm?
>
> Which is just the same as using threads. My argument is that once you
> look at all the details involved, what you end up arriving at is the
> creation of threads. Threads are relatively cheap, it's just that the
> hardware currently has several performance bugs with them on x86 (and more
> on x86-64 with the MSR fiddling that hits the hot path). Architectures
> like powerpc are not going to benefit anywhere near as much from this
> exercise, as the state involved is processed much more sanely. IA64 as
> usual is simply doomed by way of having too many registers to switch.

Sort of, except that the whole thing can complete syncronously w/out
context switches. The real point of the whole fibrils/syslets solution is
that kind of optimization. The solution is as good as it is now, for
single syscalls (modulo sys_async_cancel implementation), but for multiple
chained submission it kinda stinks IMHO. Once you have to build chains,
and conditions, and new syscalls to implement userspace variable
increments, and so on..., at that point it's better to have the chain to
be coded in C ala thread proc. Yes, it requires a frame setup and another
entry to the kernel, but IMO that will be amortized in the cost of the
multiple syscalls inside the "clet".



- Davide


2007-02-14 20:03:58

by Benjamin LaHaise

[permalink] [raw]
Subject: Re: [patch 06/11] syslets: core, documentation

On Wed, Feb 14, 2007 at 11:45:23AM -0800, Davide Libenzi wrote:
> Sort of, except that the whole thing can complete syncronously w/out
> context switches. The real point of the whole fibrils/syslets solution is
> that kind of optimization. The solution is as good as it is now, for

Except that You Can't Do That (tm). Try to predict beforehand if the code
path being followed will touch the FPU or SSE state, and you can't. There is
no way to avoid the context switch overhead, as you have to preserve things
so that whatever state is being returned to the user is as it was. Unless
you plan on resetting the state beforehand, but then you have to call into
arch specific code that ends up with a comparable overhead to the context
switch.

-ben
--
"Time is of no importance, Mr. President, only life is important."
Don't Email: <[email protected]>.

2007-02-14 20:14:33

by Davide Libenzi

[permalink] [raw]
Subject: Re: [patch 06/11] syslets: core, documentation

On Wed, 14 Feb 2007, Benjamin LaHaise wrote:

> On Wed, Feb 14, 2007 at 11:45:23AM -0800, Davide Libenzi wrote:
> > Sort of, except that the whole thing can complete syncronously w/out
> > context switches. The real point of the whole fibrils/syslets solution is
> > that kind of optimization. The solution is as good as it is now, for
>
> Except that You Can't Do That (tm). Try to predict beforehand if the code
> path being followed will touch the FPU or SSE state, and you can't. There is
> no way to avoid the context switch overhead, as you have to preserve things
> so that whatever state is being returned to the user is as it was. Unless
> you plan on resetting the state beforehand, but then you have to call into
> arch specific code that ends up with a comparable overhead to the context
> switch.

I think you may have mis-interpreted my words. *When* a schedule would
block a synco execution try, then you do have a context switch. Noone
argue that, and the code is clear. The sys_async_exec thread will block,
and a newly woke up thread will re-emerge to sys_async_exec with a NULL
returned to userspace. But in a "cachehit" case (no schedule happens
during the syscall/*let execution), there is no context switch at all.
That is the whole point of the optimization.



- Davide


2007-02-14 20:35:56

by Benjamin LaHaise

[permalink] [raw]
Subject: Re: [patch 06/11] syslets: core, documentation

On Wed, Feb 14, 2007 at 12:14:29PM -0800, Davide Libenzi wrote:
> I think you may have mis-interpreted my words. *When* a schedule would
> block a synco execution try, then you do have a context switch. Noone
> argue that, and the code is clear. The sys_async_exec thread will block,
> and a newly woke up thread will re-emerge to sys_async_exec with a NULL
> returned to userspace. But in a "cachehit" case (no schedule happens
> during the syscall/*let execution), there is no context switch at all.
> That is the whole point of the optimization.

And I will repeat myself: that cannot be done. Tell me how the following
what if scenario works: you're in an MMX optimized memory copy and you take
a page fault. How does returning to the submittor of the async operation
get the correct MMX state restored? It doesn't.

-ben
--
"Time is of no importance, Mr. President, only life is important."
Don't Email: <[email protected]>.

2007-02-14 20:39:40

by Linus Torvalds

[permalink] [raw]
Subject: Re: [patch 05/11] syslets: core code



On Tue, 13 Feb 2007, Ingo Molnar wrote:
>
> the core syslet / async system calls infrastructure code.

Ok, having now looked at it more, I can say:

- I hate it.

I dislike it intensely, because it's so _close_ to being usable. But the
programming interface looks absolutely horrid for any "casual" use, and
while the loops etc look like fun, I think they are likely to be less than
useful in practice. Yeah, you can do the "setup and teardown" just once,
but it ends up being "once per user", and it ends up being a lot of stuff
to do for somebody who wants to just do some simple async stuff.

And the whole "lock things down in memory" approach is bad. It's doing
expensive things like mlock(), making the overhead for _single_ system
calls much more expensive. Since I don't actually believe that the
non-single case is even all that interesting, I really don't like it.

I think it's clever and potentially useful to allow user mode to see the
data structures (and even allow user mode to *modify* them) while the
async thing is running, but it really seems to be a case of excessive
cleverness.

For example, how would you use this to emulate the *current* aio_read()
etc interfaces that don't have any user-level component except for the
actual call? And if you can't do that, the whole exercise is pointless.

Or how would you do the trivial example loop that I explained was a good
idea:

struct one_entry *prev = NULL;
struct dirent *de;

while ((de = readdir(dir)) != NULL) {
struct one_entry *entry = malloc(..);

/* Add it to the list, fill in the name */
entry->next = prev;
prev = entry;
strcpy(entry->name, de->d_name);

/* Do the stat lookup async */
async_stat(de->d_name, &entry->stat_buf);
}
wait_for_async();
.. Ta-daa! All done ..


Notice? This also "chains system calls together", but it does it using a
*much* more powerful entity called "user space". That's what user space
is. And yeah, it's a pretty complex sequencer, but happily we have
hardware support for accelerating it to the point that the kernel never
even needs to care.

The above is a *realistic* schenario, where you actually have things like
memory allocation etc going on. In contrast, just chaining system calls
together isn't a realistic schenario at all.

So I think we have one _known_ usage schenario:

- replacing the _existing_ aio_read() etc system calls (with not just
existing semantics, but actually binary-compatible)

- simple code use where people are willing to perhaps do something
Linux-specific, but because it's so _simple_, they'll do it.

In neither case does the "chaining atoms together" seem to really solve
the problem. It's clever, but it's not what people would actually do.

And yes, you can hide things like that behind an abstraction library, but
once you start doing that, I've got three questions for you:

- what's the point?
- we're adding overhead, so how are we getting it back
- how do we handle independent libraries each doing their own thing and
version skew between them?

In other words, the "let user space sort out the complexity" is not a good
answer. It just means that the interface is badly designed.

Linus

2007-02-14 20:52:31

by Jeremy Fitzhardinge

[permalink] [raw]
Subject: Re: [patch 00/11] ANNOUNCE: "Syslets", generic asynchronous system call support

Ingo Molnar wrote:
> Syslets consist of 'syslet atoms', where each atom represents a single
> system-call. These atoms can be chained to each other: serially, in
> branches or in loops. The return value of an executed atom is checked
> against the condition flags. So an atom can specify 'exit on nonzero' or
> 'loop until non-negative' kind of constructs.
>
> Syslet atoms fundamentally execute only system calls, thus to be able to
> manipulate user-space variables from syslets i've added a simple special
> system call: sys_umem_add(ptr, val). This can be used to increase or
> decrease the user-space variable (and to get the result), or to simply
> read out the variable (if 'val' is 0).
>

This looks very interesting. A couple of questions:

Are there any special semantics that result from running the syslet
atoms in kernel mode? If I wanted to, could I write a syslet emulation
in userspace that's functionally identical to a kernel-based
implementation? (Obviously the performance characteristics will be
different.)

I'm asking from the perspective of trying to work out the Valgrind
binding for this if it goes into the kernel. Valgrind needs to see all
the input and output values of each system call the client makes,
including those done within the syslet mechanism. It seems to me that
the easiest way to do this would be to intercept the syslet system
calls, and just implement them in usermode, performing the same series
of syscalls directly, and applying the Valgrind machinery to each one in
turn.

Would this work?

Also, an unrelated question: is there enough control structure in place
to allow multiple syslet streams to synchronize with each other with
futexes?

Thanks,
J

2007-02-14 21:06:04

by Ingo Molnar

[permalink] [raw]
Subject: Re: [patch 05/11] syslets: core code


* Linus Torvalds <[email protected]> wrote:

> And the whole "lock things down in memory" approach is bad. It's doing
> expensive things like mlock(), making the overhead for _single_ system
> calls much more expensive. [...]

hm, there must be some misunderstanding here. That mlock is /only/ once
per the lifetime of the whole 'head' - i.e. per sys_async_register().
(And you can even forget i ever did it - it's 5 lines of code to turn
the completion ring into a swappable entity.)

never does any MMU trick ever enter the picture during the whole
operation of this thing, and that's very much intentional.

Ingo

2007-02-14 21:07:06

by Davide Libenzi

[permalink] [raw]
Subject: Re: [patch 06/11] syslets: core, documentation

On Wed, 14 Feb 2007, Benjamin LaHaise wrote:

> On Wed, Feb 14, 2007 at 12:14:29PM -0800, Davide Libenzi wrote:
> > I think you may have mis-interpreted my words. *When* a schedule would
> > block a synco execution try, then you do have a context switch. Noone
> > argue that, and the code is clear. The sys_async_exec thread will block,
> > and a newly woke up thread will re-emerge to sys_async_exec with a NULL
> > returned to userspace. But in a "cachehit" case (no schedule happens
> > during the syscall/*let execution), there is no context switch at all.
> > That is the whole point of the optimization.
>
> And I will repeat myself: that cannot be done. Tell me how the following
> what if scenario works: you're in an MMX optimized memory copy and you take
> a page fault. How does returning to the submittor of the async operation
> get the correct MMX state restored? It doesn't.

Bear with me Ben, and let's follow this up :) If you are in the middle of
an MMX copy operation, inside the syscall, you are:

- Userspace, on task A, calls sys_async_exec

- Userspace in _not_ doing any MMX stuff before the call

- We execute the syscall

- Task A, executing the syscall and inside an MMX copy operation, gets a
page fault

- We get a schedule

- Task A MMX state will *follow* task A, that will be put to sleep

- We wake task B that will return to userspace

So if the MMX work happens inside the syscall execution, we're fine
because its context will follow the same task being put into sleep.
Problem would be to preserve the *caller* (userspace) context. But than
can be done in a lazy way (detecting if task A user the FPU) like we're
currently doing it, once we detect a schedule-out condition. That wouldn't
be the most common case for many userspace programs in any case.




- Davide


2007-02-14 21:10:18

by Davide Libenzi

[permalink] [raw]
Subject: Re: [patch 05/11] syslets: core code

On Wed, 14 Feb 2007, Linus Torvalds wrote:

>
>
> On Tue, 13 Feb 2007, Ingo Molnar wrote:
> >
> > the core syslet / async system calls infrastructure code.
>
> Ok, having now looked at it more, I can say:
>
> - I hate it.
>
> I dislike it intensely, because it's so _close_ to being usable. But the
> programming interface looks absolutely horrid for any "casual" use, and
> while the loops etc look like fun, I think they are likely to be less than
> useful in practice. Yeah, you can do the "setup and teardown" just once,
> but it ends up being "once per user", and it ends up being a lot of stuff
> to do for somebody who wants to just do some simple async stuff.
>
> And the whole "lock things down in memory" approach is bad. It's doing
> expensive things like mlock(), making the overhead for _single_ system
> calls much more expensive. Since I don't actually believe that the
> non-single case is even all that interesting, I really don't like it.
>
> I think it's clever and potentially useful to allow user mode to see the
> data structures (and even allow user mode to *modify* them) while the
> async thing is running, but it really seems to be a case of excessive
> cleverness.

Ok, that makes the wierdo-count up to two :) I agree with you that the
chained API can be improved at least.



- Davide


2007-02-14 21:14:58

by Ingo Molnar

[permalink] [raw]
Subject: Re: [patch 05/11] syslets: core code


* Ingo Molnar <[email protected]> wrote:

> * Linus Torvalds <[email protected]> wrote:
>
> > And the whole "lock things down in memory" approach is bad. It's
> > doing expensive things like mlock(), making the overhead for
> > _single_ system calls much more expensive. [...]
>
> hm, there must be some misunderstanding here. That mlock is /only/
> once per the lifetime of the whole 'head' - i.e. per
> sys_async_register(). (And you can even forget i ever did it - it's 5
> lines of code to turn the completion ring into a swappable entity.)
>
> never does any MMU trick ever enter the picture during the whole
> operation of this thing, and that's very much intentional.

to stress it: never does any mlocking or other lockdown happen of any
syslet atom - it is /only/ the completion ring of syslet pointers that i
made mlocked - but even that can be made generic memory no problem.

It's all about asynchronous system calls, and if you want you can have a
terabyte of syslets in user memory, half of it swapped out. They have
absolutely zero kernel context attached to them in the 'cached case' (be
that locked memory or some other kernel resource).

Ingo

2007-02-14 21:27:33

by Linus Torvalds

[permalink] [raw]
Subject: Re: [patch 05/11] syslets: core code



On Wed, 14 Feb 2007, Ingo Molnar wrote:
>
> hm, there must be some misunderstanding here. That mlock is /only/ once
> per the lifetime of the whole 'head' - i.e. per sys_async_register().
> (And you can even forget i ever did it - it's 5 lines of code to turn
> the completion ring into a swappable entity.)

But the whole point is that the notion of a "register" is wrong in the
first place. It's wrong because:

- it assumes we are going to make these complex state machines (which I
don't believe for a second that a real program will do)

- it assumes that we're going to make many async system calls that go
together (which breaks the whole notion of having different libraries
using this for their own internal reasons - they may not even *know*
about other libraries that _also_ do async IO for *their* reasons)

- it fundamentally is based on a broken notion that everything would use
this "AIO atom" in the first place, WHICH WE KNOW IS INCORRECT, since
current users use "aio_read()" that simply doesn't have that and
doesn't build up any such data structures.

So please answer my questions. The problem wasn't the mlock(), even though
that was just STUPID. The problem was much deeper. This is not a "prepare
to do a lot of very boutique linked list operations" problem. This is a
"people already use 'aio_read()' and want to extend on it" problem.

You didn't at all react to that fundamental issue: you have an overly
complex and clever thing that doesn't actually *match* what people do.

Linus

2007-02-14 21:36:17

by Davide Libenzi

[permalink] [raw]
Subject: Re: [patch 00/11] ANNOUNCE: "Syslets", generic asynchronous system call support

On Wed, 14 Feb 2007, Jeremy Fitzhardinge wrote:

> Are there any special semantics that result from running the syslet
> atoms in kernel mode? If I wanted to, could I write a syslet emulation
> in userspace that's functionally identical to a kernel-based
> implementation? (Obviously the performance characteristics will be
> different.)
>
> I'm asking from the perspective of trying to work out the Valgrind
> binding for this if it goes into the kernel. Valgrind needs to see all
> the input and output values of each system call the client makes,
> including those done within the syslet mechanism. It seems to me that
> the easiest way to do this would be to intercept the syslet system
> calls, and just implement them in usermode, performing the same series
> of syscalls directly, and applying the Valgrind machinery to each one in
> turn.
>
> Would this work?

Hopefully the API will simplify enough so that emulation will becomes
easier.



> Also, an unrelated question: is there enough control structure in place
> to allow multiple syslet streams to synchronize with each other with
> futexes?

I think the whole point of an async execution of a syscall or a syslet, is
that the syscall/syslet itself includes a non interlocked operations with
other syscalls/syslets. So that the main scheduler thread can run in a
lockless/singletask fashion. There are no technical obstacles that
prevents you to do it, bu if you start adding locks (and hence having
long-living syslet-threads) at that point you'll end up with a fully
multithreaded solution.



- Davide


2007-02-14 21:38:03

by Ingo Molnar

[permalink] [raw]
Subject: Re: [patch 05/11] syslets: core code


* Linus Torvalds <[email protected]> wrote:

> But the whole point is that the notion of a "register" is wrong in the
> first place. [...]

forget about it then. The thing we "register" is dead-simple:

struct async_head_user {
struct syslet_uatom __user **completion_ring;
unsigned long ring_size_bytes;
unsigned long max_nr_threads;
};

this can be passed in to sys_async_exec() as a second pointer, and the
kernel can put the expected-completion pointer (and the user ring idx
pointer) into its struct atom. It's just a few instructions, and only in
the cachemiss case.

that would make completions arbitrarily split-up-able. No registration
whatsoever. A waiter could specify which ring's events it is interested
in. A 'ring' could be a single-entry thing as well, for a single
instance of pending IO.

Ingo

2007-02-14 21:43:44

by Alan

[permalink] [raw]
Subject: Re: [patch 05/11] syslets: core code

> - it assumes we are going to make these complex state machines (which I
> don't believe for a second that a real program will do)

They've not had the chance before and there are certain chains of them
which make huge amounts of sense because you don't want to keep taking
completion hits. Not so much looping ones but stuff like

cork write sendfile uncork close

are very natural sequences.

There seem to be a lot of typical sequences it doesn't represent however
(consider the trivial copy case where you use the result one syscall into
the next)

> - it assumes that we're going to make many async system calls that go
> together (which breaks the whole notion of having different libraries
> using this for their own internal reasons - they may not even *know*
> about other libraries that _also_ do async IO for *their* reasons)

They can each register their own async objects. They need to do this
anyway so that the libraries can use asynchronous I/O and hide it from
applications.

> this "AIO atom" in the first place, WHICH WE KNOW IS INCORRECT, since
> current users use "aio_read()" that simply doesn't have that and
> doesn't build up any such data structures.

Do current users do this because that is all they have, because it is
hard, or because the current option is all that makes sense ?

The ability to avoid asynchronous completion waits and
complete/wake/despatch cycles is a good thing of itself. I don't know if
it justifies the rest but it has potential for excellent performance.

Alan

2007-02-14 21:44:19

by Benjamin LaHaise

[permalink] [raw]
Subject: Re: [patch 06/11] syslets: core, documentation

On Wed, Feb 14, 2007 at 01:06:59PM -0800, Davide Libenzi wrote:
> Bear with me Ben, and let's follow this up :) If you are in the middle of
> an MMX copy operation, inside the syscall, you are:
>
> - Userspace, on task A, calls sys_async_exec
>
> - Userspace in _not_ doing any MMX stuff before the call

That's an incorrect assumption. Every task/thread in the system has FPU
state associated with it, in part due to the fact that glibc has to change
some of the rounding mode bits, making them different than the default from
a freshly initialized state.

> - We wake task B that will return to userspace

At which point task B has to touch the FPU in userspace as part of the
cleanup, which adds back in an expensive operation to the whole process.

The whole context switch mechanism is a zero sum game -- everything that
occurs does so because it *must* be done. If you remove something at one
point, then it has to occur somewhere else.

My opinion of this whole thread is that it implies that our thread creation
and/or context switch is too slow. If that is the case, improve those
elements first. At least some of those optimizations have to be done in
hardware on x86, while on other platforms are probably unnecessary.

Fwiw, there are patches floating around that did AIO via kernel threads
for file descriptors that didn't implement AIO (and remember: kernel thread
context switches are cheaper than userland thread context switches). At
least take a stab at measuring what the performance differences are and
what optimizations are possible before prematurely introducing a new "fast"
way of doing things that adds a bunch more to maintain.

-ben
--
"Time is of no importance, Mr. President, only life is important."
Don't Email: <[email protected]>.

2007-02-14 21:47:14

by Ingo Molnar

[permalink] [raw]
Subject: Re: [patch 05/11] syslets: core code


* Linus Torvalds <[email protected]> wrote:

> - it fundamentally is based on a broken notion that everything would
> use this "AIO atom" in the first place, WHICH WE KNOW IS INCORRECT,
> since current users use "aio_read()" that simply doesn't have that
> and doesn't build up any such data structures.

i'm not sure what you mean here either - aio_read()/write()/etc. could
very much be implemented using syslets - and in fact one goal of syslets
is to enable such use. struct aiocb is mostly shaped by glibc internals,
and it currently has 32 bytes of free space. Enough to put a single atom
there. (or a pointer to an atom)

Ingo

2007-02-14 21:52:16

by Ingo Molnar

[permalink] [raw]
Subject: [patch] x86: split FPU state from task state


* Benjamin LaHaise <[email protected]> wrote:

> On Wed, Feb 14, 2007 at 12:14:29PM -0800, Davide Libenzi wrote:
> > I think you may have mis-interpreted my words. *When* a schedule
> > would block a synco execution try, then you do have a context
> > switch. Noone argue that, and the code is clear. The sys_async_exec
> > thread will block, and a newly woke up thread will re-emerge to
> > sys_async_exec with a NULL returned to userspace. But in a
> > "cachehit" case (no schedule happens during the syscall/*let
> > execution), there is no context switch at all. That is the whole
> > point of the optimization.
>
> And I will repeat myself: that cannot be done. Tell me how the
> following what if scenario works: you're in an MMX optimized memory
> copy and you take a page fault. How does returning to the submittor
> of the async operation get the correct MMX state restored? It
> doesn't.

this can very much be done, with a straightforward extension of how we
handle FPU state. That makes sense independently of syslets/async as
well, so find below the standalone patch from Arjan. It's in my current
syslet queue and works great.

Ingo

------------------------>
Subject: [patch] x86: split FPU state from task state
From: Arjan van de Ven <[email protected]>

Split the FPU save area from the task struct. This allows easy migration
of FPU context, and it's generally cleaner. It also allows the following
two (future) optimizations:

1) allocate the right size for the actual cpu rather than 512 bytes always
2) only allocate when the application actually uses FPU, so in the first
lazy FPU trap. This could save memory for non-fpu using apps.

Signed-off-by: Arjan van de Ven <[email protected]>
Signed-off-by: Ingo Molnar <[email protected]>
---
arch/i386/kernel/i387.c | 96 ++++++++++++++++++++---------------------
arch/i386/kernel/process.c | 56 +++++++++++++++++++++++
arch/i386/kernel/traps.c | 10 ----
include/asm-i386/i387.h | 6 +-
include/asm-i386/processor.h | 6 ++
include/asm-i386/thread_info.h | 6 ++
kernel/fork.c | 7 ++
7 files changed, 123 insertions(+), 64 deletions(-)

Index: linux/arch/i386/kernel/i387.c
===================================================================
--- linux.orig/arch/i386/kernel/i387.c
+++ linux/arch/i386/kernel/i387.c
@@ -31,9 +31,9 @@ void mxcsr_feature_mask_init(void)
unsigned long mask = 0;
clts();
if (cpu_has_fxsr) {
- memset(&current->thread.i387.fxsave, 0, sizeof(struct i387_fxsave_struct));
- asm volatile("fxsave %0" : : "m" (current->thread.i387.fxsave));
- mask = current->thread.i387.fxsave.mxcsr_mask;
+ memset(&current->thread.i387->fxsave, 0, sizeof(struct i387_fxsave_struct));
+ asm volatile("fxsave %0" : : "m" (current->thread.i387->fxsave));
+ mask = current->thread.i387->fxsave.mxcsr_mask;
if (mask == 0) mask = 0x0000ffbf;
}
mxcsr_feature_mask &= mask;
@@ -49,16 +49,16 @@ void mxcsr_feature_mask_init(void)
void init_fpu(struct task_struct *tsk)
{
if (cpu_has_fxsr) {
- memset(&tsk->thread.i387.fxsave, 0, sizeof(struct i387_fxsave_struct));
- tsk->thread.i387.fxsave.cwd = 0x37f;
+ memset(&tsk->thread.i387->fxsave, 0, sizeof(struct i387_fxsave_struct));
+ tsk->thread.i387->fxsave.cwd = 0x37f;
if (cpu_has_xmm)
- tsk->thread.i387.fxsave.mxcsr = 0x1f80;
+ tsk->thread.i387->fxsave.mxcsr = 0x1f80;
} else {
- memset(&tsk->thread.i387.fsave, 0, sizeof(struct i387_fsave_struct));
- tsk->thread.i387.fsave.cwd = 0xffff037fu;
- tsk->thread.i387.fsave.swd = 0xffff0000u;
- tsk->thread.i387.fsave.twd = 0xffffffffu;
- tsk->thread.i387.fsave.fos = 0xffff0000u;
+ memset(&tsk->thread.i387->fsave, 0, sizeof(struct i387_fsave_struct));
+ tsk->thread.i387->fsave.cwd = 0xffff037fu;
+ tsk->thread.i387->fsave.swd = 0xffff0000u;
+ tsk->thread.i387->fsave.twd = 0xffffffffu;
+ tsk->thread.i387->fsave.fos = 0xffff0000u;
}
/* only the device not available exception or ptrace can call init_fpu */
set_stopped_child_used_math(tsk);
@@ -152,18 +152,18 @@ static inline unsigned long twd_fxsr_to_
unsigned short get_fpu_cwd( struct task_struct *tsk )
{
if ( cpu_has_fxsr ) {
- return tsk->thread.i387.fxsave.cwd;
+ return tsk->thread.i387->fxsave.cwd;
} else {
- return (unsigned short)tsk->thread.i387.fsave.cwd;
+ return (unsigned short)tsk->thread.i387->fsave.cwd;
}
}

unsigned short get_fpu_swd( struct task_struct *tsk )
{
if ( cpu_has_fxsr ) {
- return tsk->thread.i387.fxsave.swd;
+ return tsk->thread.i387->fxsave.swd;
} else {
- return (unsigned short)tsk->thread.i387.fsave.swd;
+ return (unsigned short)tsk->thread.i387->fsave.swd;
}
}

@@ -171,9 +171,9 @@ unsigned short get_fpu_swd( struct task_
unsigned short get_fpu_twd( struct task_struct *tsk )
{
if ( cpu_has_fxsr ) {
- return tsk->thread.i387.fxsave.twd;
+ return tsk->thread.i387->fxsave.twd;
} else {
- return (unsigned short)tsk->thread.i387.fsave.twd;
+ return (unsigned short)tsk->thread.i387->fsave.twd;
}
}
#endif /* 0 */
@@ -181,7 +181,7 @@ unsigned short get_fpu_twd( struct task_
unsigned short get_fpu_mxcsr( struct task_struct *tsk )
{
if ( cpu_has_xmm ) {
- return tsk->thread.i387.fxsave.mxcsr;
+ return tsk->thread.i387->fxsave.mxcsr;
} else {
return 0x1f80;
}
@@ -192,27 +192,27 @@ unsigned short get_fpu_mxcsr( struct tas
void set_fpu_cwd( struct task_struct *tsk, unsigned short cwd )
{
if ( cpu_has_fxsr ) {
- tsk->thread.i387.fxsave.cwd = cwd;
+ tsk->thread.i387->fxsave.cwd = cwd;
} else {
- tsk->thread.i387.fsave.cwd = ((long)cwd | 0xffff0000u);
+ tsk->thread.i387->fsave.cwd = ((long)cwd | 0xffff0000u);
}
}

void set_fpu_swd( struct task_struct *tsk, unsigned short swd )
{
if ( cpu_has_fxsr ) {
- tsk->thread.i387.fxsave.swd = swd;
+ tsk->thread.i387->fxsave.swd = swd;
} else {
- tsk->thread.i387.fsave.swd = ((long)swd | 0xffff0000u);
+ tsk->thread.i387->fsave.swd = ((long)swd | 0xffff0000u);
}
}

void set_fpu_twd( struct task_struct *tsk, unsigned short twd )
{
if ( cpu_has_fxsr ) {
- tsk->thread.i387.fxsave.twd = twd_i387_to_fxsr(twd);
+ tsk->thread.i387->fxsave.twd = twd_i387_to_fxsr(twd);
} else {
- tsk->thread.i387.fsave.twd = ((long)twd | 0xffff0000u);
+ tsk->thread.i387->fsave.twd = ((long)twd | 0xffff0000u);
}
}

@@ -298,8 +298,8 @@ static inline int save_i387_fsave( struc
struct task_struct *tsk = current;

unlazy_fpu( tsk );
- tsk->thread.i387.fsave.status = tsk->thread.i387.fsave.swd;
- if ( __copy_to_user( buf, &tsk->thread.i387.fsave,
+ tsk->thread.i387->fsave.status = tsk->thread.i387->fsave.swd;
+ if ( __copy_to_user( buf, &tsk->thread.i387->fsave,
sizeof(struct i387_fsave_struct) ) )
return -1;
return 1;
@@ -312,15 +312,15 @@ static int save_i387_fxsave( struct _fps

unlazy_fpu( tsk );

- if ( convert_fxsr_to_user( buf, &tsk->thread.i387.fxsave ) )
+ if ( convert_fxsr_to_user( buf, &tsk->thread.i387->fxsave ) )
return -1;

- err |= __put_user( tsk->thread.i387.fxsave.swd, &buf->status );
+ err |= __put_user( tsk->thread.i387->fxsave.swd, &buf->status );
err |= __put_user( X86_FXSR_MAGIC, &buf->magic );
if ( err )
return -1;

- if ( __copy_to_user( &buf->_fxsr_env[0], &tsk->thread.i387.fxsave,
+ if ( __copy_to_user( &buf->_fxsr_env[0], &tsk->thread.i387->fxsave,
sizeof(struct i387_fxsave_struct) ) )
return -1;
return 1;
@@ -343,7 +343,7 @@ int save_i387( struct _fpstate __user *b
return save_i387_fsave( buf );
}
} else {
- return save_i387_soft( &current->thread.i387.soft, buf );
+ return save_i387_soft( &current->thread.i387->soft, buf );
}
}

@@ -351,7 +351,7 @@ static inline int restore_i387_fsave( st
{
struct task_struct *tsk = current;
clear_fpu( tsk );
- return __copy_from_user( &tsk->thread.i387.fsave, buf,
+ return __copy_from_user( &tsk->thread.i387->fsave, buf,
sizeof(struct i387_fsave_struct) );
}

@@ -360,11 +360,11 @@ static int restore_i387_fxsave( struct _
int err;
struct task_struct *tsk = current;
clear_fpu( tsk );
- err = __copy_from_user( &tsk->thread.i387.fxsave, &buf->_fxsr_env[0],
+ err = __copy_from_user( &tsk->thread.i387->fxsave, &buf->_fxsr_env[0],
sizeof(struct i387_fxsave_struct) );
/* mxcsr reserved bits must be masked to zero for security reasons */
- tsk->thread.i387.fxsave.mxcsr &= mxcsr_feature_mask;
- return err ? 1 : convert_fxsr_from_user( &tsk->thread.i387.fxsave, buf );
+ tsk->thread.i387->fxsave.mxcsr &= mxcsr_feature_mask;
+ return err ? 1 : convert_fxsr_from_user( &tsk->thread.i387->fxsave, buf );
}

int restore_i387( struct _fpstate __user *buf )
@@ -378,7 +378,7 @@ int restore_i387( struct _fpstate __user
err = restore_i387_fsave( buf );
}
} else {
- err = restore_i387_soft( &current->thread.i387.soft, buf );
+ err = restore_i387_soft( &current->thread.i387->soft, buf );
}
set_used_math();
return err;
@@ -391,7 +391,7 @@ int restore_i387( struct _fpstate __user
static inline int get_fpregs_fsave( struct user_i387_struct __user *buf,
struct task_struct *tsk )
{
- return __copy_to_user( buf, &tsk->thread.i387.fsave,
+ return __copy_to_user( buf, &tsk->thread.i387->fsave,
sizeof(struct user_i387_struct) );
}

@@ -399,7 +399,7 @@ static inline int get_fpregs_fxsave( str
struct task_struct *tsk )
{
return convert_fxsr_to_user( (struct _fpstate __user *)buf,
- &tsk->thread.i387.fxsave );
+ &tsk->thread.i387->fxsave );
}

int get_fpregs( struct user_i387_struct __user *buf, struct task_struct *tsk )
@@ -411,7 +411,7 @@ int get_fpregs( struct user_i387_struct
return get_fpregs_fsave( buf, tsk );
}
} else {
- return save_i387_soft( &tsk->thread.i387.soft,
+ return save_i387_soft( &tsk->thread.i387->soft,
(struct _fpstate __user *)buf );
}
}
@@ -419,14 +419,14 @@ int get_fpregs( struct user_i387_struct
static inline int set_fpregs_fsave( struct task_struct *tsk,
struct user_i387_struct __user *buf )
{
- return __copy_from_user( &tsk->thread.i387.fsave, buf,
+ return __copy_from_user( &tsk->thread.i387->fsave, buf,
sizeof(struct user_i387_struct) );
}

static inline int set_fpregs_fxsave( struct task_struct *tsk,
struct user_i387_struct __user *buf )
{
- return convert_fxsr_from_user( &tsk->thread.i387.fxsave,
+ return convert_fxsr_from_user( &tsk->thread.i387->fxsave,
(struct _fpstate __user *)buf );
}

@@ -439,7 +439,7 @@ int set_fpregs( struct task_struct *tsk,
return set_fpregs_fsave( tsk, buf );
}
} else {
- return restore_i387_soft( &tsk->thread.i387.soft,
+ return restore_i387_soft( &tsk->thread.i387->soft,
(struct _fpstate __user *)buf );
}
}
@@ -447,7 +447,7 @@ int set_fpregs( struct task_struct *tsk,
int get_fpxregs( struct user_fxsr_struct __user *buf, struct task_struct *tsk )
{
if ( cpu_has_fxsr ) {
- if (__copy_to_user( buf, &tsk->thread.i387.fxsave,
+ if (__copy_to_user( buf, &tsk->thread.i387->fxsave,
sizeof(struct user_fxsr_struct) ))
return -EFAULT;
return 0;
@@ -461,11 +461,11 @@ int set_fpxregs( struct task_struct *tsk
int ret = 0;

if ( cpu_has_fxsr ) {
- if (__copy_from_user( &tsk->thread.i387.fxsave, buf,
+ if (__copy_from_user( &tsk->thread.i387->fxsave, buf,
sizeof(struct user_fxsr_struct) ))
ret = -EFAULT;
/* mxcsr reserved bits must be masked to zero for security reasons */
- tsk->thread.i387.fxsave.mxcsr &= mxcsr_feature_mask;
+ tsk->thread.i387->fxsave.mxcsr &= mxcsr_feature_mask;
} else {
ret = -EIO;
}
@@ -479,7 +479,7 @@ int set_fpxregs( struct task_struct *tsk
static inline void copy_fpu_fsave( struct task_struct *tsk,
struct user_i387_struct *fpu )
{
- memcpy( fpu, &tsk->thread.i387.fsave,
+ memcpy( fpu, &tsk->thread.i387->fsave,
sizeof(struct user_i387_struct) );
}

@@ -490,10 +490,10 @@ static inline void copy_fpu_fxsave( stru
unsigned short *from;
int i;

- memcpy( fpu, &tsk->thread.i387.fxsave, 7 * sizeof(long) );
+ memcpy( fpu, &tsk->thread.i387->fxsave, 7 * sizeof(long) );

to = (unsigned short *)&fpu->st_space[0];
- from = (unsigned short *)&tsk->thread.i387.fxsave.st_space[0];
+ from = (unsigned short *)&tsk->thread.i387->fxsave.st_space[0];
for ( i = 0 ; i < 8 ; i++, to += 5, from += 8 ) {
memcpy( to, from, 5 * sizeof(unsigned short) );
}
@@ -540,7 +540,7 @@ int dump_task_extended_fpu(struct task_s
if (fpvalid) {
if (tsk == current)
unlazy_fpu(tsk);
- memcpy(fpu, &tsk->thread.i387.fxsave, sizeof(*fpu));
+ memcpy(fpu, &tsk->thread.i387->fxsave, sizeof(*fpu));
}
return fpvalid;
}
Index: linux/arch/i386/kernel/process.c
===================================================================
--- linux.orig/arch/i386/kernel/process.c
+++ linux/arch/i386/kernel/process.c
@@ -645,7 +645,7 @@ struct task_struct fastcall * __switch_t

/* we're going to use this soon, after a few expensive things */
if (next_p->fpu_counter > 5)
- prefetch(&next->i387.fxsave);
+ prefetch(&next->i387->fxsave);

/*
* Reload esp0.
@@ -908,3 +908,57 @@ unsigned long arch_align_stack(unsigned
sp -= get_random_int() % 8192;
return sp & ~0xf;
}
+
+
+
+struct kmem_cache *task_struct_cachep;
+struct kmem_cache *task_i387_cachep;
+
+struct task_struct * alloc_task_struct(void)
+{
+ struct task_struct *tsk;
+ tsk = kmem_cache_alloc(task_struct_cachep, GFP_KERNEL);
+ if (!tsk)
+ return NULL;
+ tsk->thread.i387 = kmem_cache_alloc(task_i387_cachep, GFP_KERNEL);
+ if (!tsk->thread.i387)
+ goto error;
+ WARN_ON((unsigned long)tsk->thread.i387 & 15);
+ return tsk;
+
+error:
+ kfree(tsk);
+ return NULL;
+}
+
+void memcpy_task_struct(struct task_struct *dst, struct task_struct *src)
+{
+ union i387_union *ptr;
+ ptr = dst->thread.i387;
+ *dst = *src;
+ dst->thread.i387 = ptr;
+ memcpy(dst->thread.i387, src->thread.i387, sizeof(union i387_union));
+}
+
+void free_task_struct(struct task_struct *tsk)
+{
+ kmem_cache_free(task_i387_cachep, tsk->thread.i387);
+ tsk->thread.i387=NULL;
+ kmem_cache_free(task_struct_cachep, tsk);
+}
+
+
+void task_struct_slab_init(void)
+{
+ /* create a slab on which task_structs can be allocated */
+ task_struct_cachep =
+ kmem_cache_create("task_struct", sizeof(struct task_struct),
+ ARCH_MIN_TASKALIGN, SLAB_PANIC, NULL, NULL);
+ task_i387_cachep =
+ kmem_cache_create("task_i387", sizeof(union i387_union), 32,
+ SLAB_PANIC | SLAB_MUST_HWCACHE_ALIGN, NULL, NULL);
+}
+
+
+/* the very init task needs a static allocated i387 area */
+union i387_union init_i387_context;
Index: linux/arch/i386/kernel/traps.c
===================================================================
--- linux.orig/arch/i386/kernel/traps.c
+++ linux/arch/i386/kernel/traps.c
@@ -1154,16 +1154,6 @@ void __init trap_init(void)
set_trap_gate(19,&simd_coprocessor_error);

if (cpu_has_fxsr) {
- /*
- * Verify that the FXSAVE/FXRSTOR data will be 16-byte aligned.
- * Generates a compile-time "error: zero width for bit-field" if
- * the alignment is wrong.
- */
- struct fxsrAlignAssert {
- int _:!(offsetof(struct task_struct,
- thread.i387.fxsave) & 15);
- };
-
printk(KERN_INFO "Enabling fast FPU save and restore... ");
set_in_cr4(X86_CR4_OSFXSR);
printk("done.\n");
Index: linux/include/asm-i386/i387.h
===================================================================
--- linux.orig/include/asm-i386/i387.h
+++ linux/include/asm-i386/i387.h
@@ -34,7 +34,7 @@ extern void init_fpu(struct task_struct
"nop ; frstor %1", \
"fxrstor %1", \
X86_FEATURE_FXSR, \
- "m" ((tsk)->thread.i387.fxsave))
+ "m" ((tsk)->thread.i387->fxsave))

extern void kernel_fpu_begin(void);
#define kernel_fpu_end() do { stts(); preempt_enable(); } while(0)
@@ -60,8 +60,8 @@ static inline void __save_init_fpu( stru
"fxsave %[fx]\n"
"bt $7,%[fsw] ; jnc 1f ; fnclex\n1:",
X86_FEATURE_FXSR,
- [fx] "m" (tsk->thread.i387.fxsave),
- [fsw] "m" (tsk->thread.i387.fxsave.swd) : "memory");
+ [fx] "m" (tsk->thread.i387->fxsave),
+ [fsw] "m" (tsk->thread.i387->fxsave.swd) : "memory");
/* AMD K7/K8 CPUs don't save/restore FDP/FIP/FOP unless an exception
is pending. Clear the x87 state here by setting it to fixed
values. safe_address is a random variable that should be in L1 */
Index: linux/include/asm-i386/processor.h
===================================================================
--- linux.orig/include/asm-i386/processor.h
+++ linux/include/asm-i386/processor.h
@@ -407,7 +407,7 @@ struct thread_struct {
/* fault info */
unsigned long cr2, trap_no, error_code;
/* floating point info */
- union i387_union i387;
+ union i387_union *i387;
/* virtual 86 mode info */
struct vm86_struct __user * vm86_info;
unsigned long screen_bitmap;
@@ -420,11 +420,15 @@ struct thread_struct {
unsigned long io_bitmap_max;
};

+
+extern union i387_union init_i387_context;
+
#define INIT_THREAD { \
.vm86_info = NULL, \
.sysenter_cs = __KERNEL_CS, \
.io_bitmap_ptr = NULL, \
.gs = __KERNEL_PDA, \
+ .i387 = &init_i387_context, \
}

/*
Index: linux/include/asm-i386/thread_info.h
===================================================================
--- linux.orig/include/asm-i386/thread_info.h
+++ linux/include/asm-i386/thread_info.h
@@ -102,6 +102,12 @@ static inline struct thread_info *curren

#define free_thread_info(info) kfree(info)

+#define __HAVE_ARCH_TASK_STRUCT_ALLOCATOR
+extern struct task_struct * alloc_task_struct(void);
+extern void free_task_struct(struct task_struct *tsk);
+extern void memcpy_task_struct(struct task_struct *dst, struct task_struct *src);
+extern void task_struct_slab_init(void);
+
#else /* !__ASSEMBLY__ */

/* how to get the thread information struct from ASM */
Index: linux/kernel/fork.c
===================================================================
--- linux.orig/kernel/fork.c
+++ linux/kernel/fork.c
@@ -83,6 +83,8 @@ int nr_processes(void)
#ifndef __HAVE_ARCH_TASK_STRUCT_ALLOCATOR
# define alloc_task_struct() kmem_cache_alloc(task_struct_cachep, GFP_KERNEL)
# define free_task_struct(tsk) kmem_cache_free(task_struct_cachep, (tsk))
+# define memcpy_task_struct(dst, src) *dst = *src;
+
static struct kmem_cache *task_struct_cachep;
#endif

@@ -137,6 +139,8 @@ void __init fork_init(unsigned long memp
task_struct_cachep =
kmem_cache_create("task_struct", sizeof(struct task_struct),
ARCH_MIN_TASKALIGN, SLAB_PANIC, NULL, NULL);
+#else
+ task_struct_slab_init();
#endif

/*
@@ -175,7 +179,8 @@ static struct task_struct *dup_task_stru
return NULL;
}

- *tsk = *orig;
+ memcpy_task_struct(tsk, orig);
+
tsk->thread_info = ti;
setup_thread_stack(tsk, orig);

2007-02-14 22:04:57

by Benjamin LaHaise

[permalink] [raw]
Subject: Re: [patch] x86: split FPU state from task state

On Wed, Feb 14, 2007 at 10:49:44PM +0100, Ingo Molnar wrote:
> this can very much be done, with a straightforward extension of how we
> handle FPU state. That makes sense independently of syslets/async as
> well, so find below the standalone patch from Arjan. It's in my current
> syslet queue and works great.

That patch adds a bunch of memory dereferences and another allocation
to the thread creation code path -- a tax that all users must pay. Granted,
it's small, but at the very least should be configurable out for the 99.9%
of users that will never use this functionality.

I'm willing to be convinced, it's just that I would like to see some
numbers that scream out that this is a good thing.

-ben

2007-02-14 22:10:38

by Arjan van de Ven

[permalink] [raw]
Subject: Re: [patch] x86: split FPU state from task state

On Wed, 2007-02-14 at 17:04 -0500, Benjamin LaHaise wrote:
> On Wed, Feb 14, 2007 at 10:49:44PM +0100, Ingo Molnar wrote:
> > this can very much be done, with a straightforward extension of how we
> > handle FPU state. That makes sense independently of syslets/async as
> > well, so find below the standalone patch from Arjan. It's in my current
> > syslet queue and works great.
>
> That patch adds a bunch of memory dereferences

not really; you missed that most of the ->'s are actually just going to
members of the union and aren't actually extra dereference.

> and another allocation
> to the thread creation code path -- a tax that all users must pay.

so the next step, as mentioned in the changelog, to allocate only on the
first FPU fault, so that it becomes a GAIN for everyone, since only
threads that use FPU will use the memory.

The second gain (although only on old cpus) is that you only need to
allocate enough memory for your cpu, rather than 512 always.



--
if you want to mail me at work (you don't), use arjan (at) linux.intel.com
Test the interaction between Linux and your BIOS via http://www.linuxfirmwarekit.org

2007-02-14 22:14:15

by Ingo Molnar

[permalink] [raw]
Subject: Re: [patch 05/11] syslets: core code


* Linus Torvalds <[email protected]> wrote:

> Or how would you do the trivial example loop that I explained was a
> good idea:
>
> struct one_entry *prev = NULL;
> struct dirent *de;
>
> while ((de = readdir(dir)) != NULL) {
> struct one_entry *entry = malloc(..);
>
> /* Add it to the list, fill in the name */
> entry->next = prev;
> prev = entry;
> strcpy(entry->name, de->d_name);
>
> /* Do the stat lookup async */
> async_stat(de->d_name, &entry->stat_buf);
> }
> wait_for_async();
> .. Ta-daa! All done ..

i think you are banging on open doors. That async_stat() call is very
much what i'd like to see glibc to provide, not really the raw syslet
interface. Nor do i want to see raw syscalls exposed to applications.
Plus the single-atom thing is what i think will be used mostly
initially, so all my optimizations went into that case.

while i agree with you that state machines are hard, it's all a function
of where the concentration of processing is. If most of the application
complexity happens in user-space, then the logic should live there. But
for infrastructure things (like the async_stat() calls, or aio_read(),
or other, future interfaces) i wouldnt mind at all if they were
implemented using syslets. Likewise, if someone wants to implement the
hottest accept loop in Apache or Samba via syslets, keeping them from
wasting time on writing in-kernel webservers (oops, did i really say
that?), it can be done. If a JVM wants to use syslets, sure - it's an
abstraction machine anyway so application programmers are not exposed to
it.

syslets are just a realization that /if/ the thing we want to do is
mostly on the kernel side, then we might as well put the logic to the
kernel side. It's more of a 'compound interface builder' than the place
for real program logic. It makes our interfaces usable more flexibly,
and it allows the kernel to provide 'atomic' APIs, instead of having to
provide the most common compounded uses as well.

and note that if you actually try to do an async_stat() sanely, you do
get quite close to the point of having syslets. You get basically up to
a one-shot atom concept and 90% of what i have in kernel/async.c. The
remaining 10% of further execution control is easy and still it opens up
these new things that were not possible before: compounding, vectoring,
simple program logic, etc.

The 'cost' of syslets is mostly the atom->next pointer in essence. The
whole async infrastructure only takes up 20 nsecs more in the cached
case. (but with some crazier hacks i got the one-shot atom overhead
[compared to a simple synchronous null syscall] to below 10 nsecs, so
there's room in there for further optimizations. Our current null
syscall latency is around ~150 nsecs.)

Ingo

2007-02-14 22:35:35

by Ingo Molnar

[permalink] [raw]
Subject: Re: [patch 05/11] syslets: core code


* Alan <[email protected]> wrote:

> > this "AIO atom" in the first place, WHICH WE KNOW IS INCORRECT,
> > since current users use "aio_read()" that simply doesn't have
> > that and doesn't build up any such data structures.
>
> Do current users do this because that is all they have, because it is
> hard, or because the current option is all that makes sense ?
>
> The ability to avoid asynchronous completion waits and
> complete/wake/despatch cycles is a good thing of itself. [...]

yeah, that's another key thing. I do plan to provide a sys_upcall()
syscall as well which calls a 5-parameter user-space function with a
special stack. (it's like a lightweight signal/event handler, without
any of the signal handler legacies and overhead - it's like a reverse
system call - a "user call". Obviously pure userspace would never use
sys_upcall(), unless as an act of sheer masochism.)

[ that way say a full HTTP request could be done by an asynchronous
context, because the HTTP parser could be done as a sys_upcall(). ]

so if it's simpler/easier for a syslet to do a step in user-space - as
long as it's an 'atom' of processing - it can be done.

or if processing is so heavily in user-space that most of the logic
lives there then just use plain pthreads. There's just no point in
moving complex user-space code to the syslet side if it's easier/faster
to do it in user-space. Syslets are there for asynchronous /kernel/
execution, and is centered around how the kernel does stuff: system
calls.

besides sys_upcall() i also plan two other extensions:

- a CLONE_ASYNC_WORKER for user-space to be able use its pthread as an
optional worker thread in the async engine. A thread executing
user-space code qualifies as a 'busy' thread - it has to call into
sys_async_cachemiss_thread() to 'offer' itself as a ready thread that
the 'head' could switch to anytime.

- support for multiple heads sharing the async context pool. All the
locking logic is there already (because cachemiss threads can already
access the queue), it only needs a refcount in struct async_head
(only accessed during fork/exit), and an update to the teardown logic
(that too is a slowpath).

Ingo

2007-02-14 23:14:15

by Linus Torvalds

[permalink] [raw]
Subject: Re: [patch 05/11] syslets: core code



On Wed, 14 Feb 2007, Ingo Molnar wrote:
>
> i think you are banging on open doors. That async_stat() call is very
> much what i'd like to see glibc to provide, not really the raw syslet
> interface.

Right. Which is why I wrote (and you removed) the rest of my email.

If the "raw" interfaces aren't actually what you use, and you just expect
glibc to translate things into them, WHY DO WE HAVE THEM AT ALL?

> The 'cost' of syslets is mostly the atom->next pointer in essence.

No. The cost is:

- indirect interfaces are harder to follow and debug. It's a LOT easier
to debug things that go wrong when it just does what you ask it for,
instead of writing to memory and doing something totally obscure.

I don't know about you, but I use "strace" a lot. That's the kind of
cost we have.

- the cost is the extra and totally unnecessary setup for the
indirection, that nobody reallyis likely to use.

> The whole async infrastructure only takes up 20 nsecs more in the cached
> case. (but with some crazier hacks i got the one-shot atom overhead
> [compared to a simple synchronous null syscall] to below 10 nsecs, so
> there's room in there for further optimizations. Our current null
> syscall latency is around ~150 nsecs.)

You are not counting the whole setup cost there, then, because your setup
cost is going to be at a minimum more expensive than the null system call.

And yes, for benchmarks, it's going to be done just once, and then the
benchmark will loop a million times. But for other things like libraries,
that don't know whether they get called once, or a million times, this is
a big deal.

This is why I'd like a "async_stat()" to basically be the *same* cost as a
"stat()". To within nanoseconds. WITH ALL THE SETUP! Because otherwise, a
library may not be able to use it without thinking about it a lot, because
it simply doesn't know whether the caller is going to call it once or many
times.

THIS was why I wanted the "synchronous mode". Exactly because it removes
all the questions about "is it worth it". If the cost overhead is
basically zero, you know it's always worth it.

Now, if you make the "async_submit()" _incldue_ the setup itself (as you
alluded to in one of your emails), and the cost of that is basically
negligible, and it still allows people to do things simply and just submit
a single system call without any real overhead, then hey, it's may be a
complex interface, but at least you can _use_ it as a simple one.

At that point most of my arguments against it go away. It might still be
over-engineered, but if the costs aren't visible, and it's obvious enough
that the over-engineering doesn't result in subtle bugs, THEN (and only
then) is a more complex and generic interface worth it even if nobody
actually ends up using it.

Linus

2007-02-14 23:18:05

by Davide Libenzi

[permalink] [raw]
Subject: Re: [patch 06/11] syslets: core, documentation

On Wed, 14 Feb 2007, Benjamin LaHaise wrote:

> On Wed, Feb 14, 2007 at 01:06:59PM -0800, Davide Libenzi wrote:
> > Bear with me Ben, and let's follow this up :) If you are in the middle of
> > an MMX copy operation, inside the syscall, you are:
> >
> > - Userspace, on task A, calls sys_async_exec
> >
> > - Userspace in _not_ doing any MMX stuff before the call
>
> That's an incorrect assumption. Every task/thread in the system has FPU
> state associated with it, in part due to the fact that glibc has to change
> some of the rounding mode bits, making them different than the default from
> a freshly initialized state.

IMO I still belive this is not a huge problem. FPU state propagation/copy
can be done in a clever way, once we detect the in-async condition.



- Davide


2007-02-14 23:40:53

by Benjamin LaHaise

[permalink] [raw]
Subject: Re: [patch 06/11] syslets: core, documentation

On Wed, Feb 14, 2007 at 03:17:59PM -0800, Davide Libenzi wrote:
> > That's an incorrect assumption. Every task/thread in the system has FPU
> > state associated with it, in part due to the fact that glibc has to change
> > some of the rounding mode bits, making them different than the default from
> > a freshly initialized state.
>
> IMO I still belive this is not a huge problem. FPU state propagation/copy
> can be done in a clever way, once we detect the in-async condition.

Show me. clts() and stts() are expensive hardware operations which there
is no means of avoiding as control register writes impact the CPU in a not
trivial manner. I've spent far too much time staring at profiles of what
goes on in the context switch code in the process of looking for optimizations
on this very issue to be ignored on this point.

-ben
--
"Time is of no importance, Mr. President, only life is important."
Don't Email: <[email protected]>.

2007-02-14 23:46:48

by Ingo Molnar

[permalink] [raw]
Subject: Re: [patch 05/11] syslets: core code


* Linus Torvalds <[email protected]> wrote:

> > case. (but with some crazier hacks i got the one-shot atom overhead
> > [compared to a simple synchronous null syscall] to below 10 nsecs,
> > so there's room in there for further optimizations. Our current null
> > syscall latency is around ~150 nsecs.)
>
> You are not counting the whole setup cost there, then, because your
> setup cost is going to be at a minimum more expensive than the null
> system call.

hm, this one-time cost was never on my radar. [ It's really dwarved by
other startup costs (a single fork() takes 100 usecs, an exec() takes
800 usecs.) ]

In any case, we can delay this cost into the first cachemiss, or can
eliminate it by making it a globally pooled thing. It does not seem like
a big issue.

Ingo

2007-02-15 00:07:15

by Ingo Molnar

[permalink] [raw]
Subject: Re: [patch 05/11] syslets: core code


* Ingo Molnar <[email protected]> wrote:

> > You are not counting the whole setup cost there, then, because your
> > setup cost is going to be at a minimum more expensive than the null
> > system call.
>
> hm, this one-time cost was never on my radar. [ It's really dwarved by
> other startup costs (a single fork() takes 100 usecs, an exec() takes
> 800 usecs.) ]

i really count this into the category of 'application startup', and thus
it's is another type of 'cachemiss': the cost of having to bootstrap a
new context. (Even though obviously we want this to go as fast as
possible too.) Library startups, linking (even with prelink), etc., is
quite expensive already - goes into the tens of milliseconds.

or if it's a new thread startup - where this cost would indeed be
visible, if the thread exits straight after being startup up, and where
this thread would want to do a single AIO, then shareable async heads
(see my mail to Alan) ought to solve this. (But short-lifetime threads
are not really a good idea in themselves.)

but the moment it's some fork()ed context, or even an exec()ed context,
this cost is very small in comparisno. And who in their right mind
starts up a whole new process just to do a single IO and then exit
without doing any other processing? (so that the async setup cost would
show up)

but, short-lived contexts, where this cost would be visible, are
generally a really bad idea.

Ingo

2007-02-15 00:09:05

by Jeremy Fitzhardinge

[permalink] [raw]
Subject: Re: [patch 00/11] ANNOUNCE: "Syslets", generic asynchronous system call support

Davide Libenzi wrote:
>> Would this work?
>>
>
> Hopefully the API will simplify enough so that emulation will becomes
> easier.
>

The big question in my mind is how all this stuff interacts with
signals. Can a blocked syscall atom be interrupted by a signal? If so,
what thread does it get delivered to? How does sigprocmask affect this
(is it atomic with respect to blocking atoms)?

>> Also, an unrelated question: is there enough control structure in place
>> to allow multiple syslet streams to synchronize with each other with
>> futexes?
>>
>
> I think the whole point of an async execution of a syscall or a syslet, is
> that the syscall/syslet itself includes a non interlocked operations with
> other syscalls/syslets. So that the main scheduler thread can run in a
> lockless/singletask fashion. There are no technical obstacles that
> prevents you to do it, bu if you start adding locks (and hence having
> long-living syslet-threads) at that point you'll end up with a fully
> multithreaded solution.
>

I was thinking you'd use the futexes more like barriers than locks.
That way you could have several streams going asynchronously, but use
futexes to gang them together at appropriate times in the stream. A
handwavy example would be to have separate async streams for audio and
video, but use futexes to stop them from drifting too far from each other.

J

2007-02-15 00:35:47

by Davide Libenzi

[permalink] [raw]
Subject: Re: [patch 06/11] syslets: core, documentation

On Wed, 14 Feb 2007, Benjamin LaHaise wrote:

> On Wed, Feb 14, 2007 at 03:17:59PM -0800, Davide Libenzi wrote:
> > > That's an incorrect assumption. Every task/thread in the system has FPU
> > > state associated with it, in part due to the fact that glibc has to change
> > > some of the rounding mode bits, making them different than the default from
> > > a freshly initialized state.
> >
> > IMO I still belive this is not a huge problem. FPU state propagation/copy
> > can be done in a clever way, once we detect the in-async condition.
>
> Show me. clts() and stts() are expensive hardware operations which there
> is no means of avoiding as control register writes impact the CPU in a not
> trivial manner. I've spent far too much time staring at profiles of what
> goes on in the context switch code in the process of looking for optimizations
> on this very issue to be ignored on this point.

The trivial case is the cachehit case. Everything flows like usual since
we don't swap threads.
If we're going to sleep, __async_schedule has to save/copy (depending if
TS_USEDFPU is set) the current FPU state to the newly selected service
thread (return-to-userspace thread).
When a fault eventually happen in the new userspace thread, context is
restored.



- Davide


2007-02-15 01:01:09

by Davide Libenzi

[permalink] [raw]
Subject: Re: [patch 05/11] syslets: core code

On Wed, 14 Feb 2007, Ingo Molnar wrote:

> yeah, that's another key thing. I do plan to provide a sys_upcall()
> syscall as well which calls a 5-parameter user-space function with a
> special stack. (it's like a lightweight signal/event handler, without
> any of the signal handler legacies and overhead - it's like a reverse
> system call - a "user call". Obviously pure userspace would never use
> sys_upcall(), unless as an act of sheer masochism.)

That is exactly what I described as clets. Instead of having complex jump
and condition interpreters on the kernel (on top of new syscalls to
modify/increment userspace variables), you just code it in C and you pass
the clet pointer to the kernel.
The upcall will setup a frame, execute the clet (where jump/conditions and
userspace variable changes happen in machine code - gcc is pretty good in
taking care of that for us) on its return, come back through a
sys_async_return, and go back to userspace.




- Davide


2007-02-15 01:29:00

by Davide Libenzi

[permalink] [raw]
Subject: Re: [patch 05/11] syslets: core code

On Wed, 14 Feb 2007, Davide Libenzi wrote:

> On Wed, 14 Feb 2007, Ingo Molnar wrote:
>
> > yeah, that's another key thing. I do plan to provide a sys_upcall()
> > syscall as well which calls a 5-parameter user-space function with a
> > special stack. (it's like a lightweight signal/event handler, without
> > any of the signal handler legacies and overhead - it's like a reverse
> > system call - a "user call". Obviously pure userspace would never use
> > sys_upcall(), unless as an act of sheer masochism.)
>
> That is exactly what I described as clets. Instead of having complex jump
> and condition interpreters on the kernel (on top of new syscalls to
> modify/increment userspace variables), you just code it in C and you pass
> the clet pointer to the kernel.
> The upcall will setup a frame, execute the clet (where jump/conditions and
> userspace variable changes happen in machine code - gcc is pretty good in
> taking care of that for us) on its return, come back through a
> sys_async_return, and go back to userspace.

So, for example, this is the setup code for the current API (and that's a
really simple one - immagine going wacko with loops and userspace varaible
changes):


static struct req *alloc_req(void)
{
/*
* Constants can be picked up by syslets via static variables:
*/
static long O_RDONLY_var = O_RDONLY;
static long FILE_BUF_SIZE_var = FILE_BUF_SIZE;

struct req *req;

if (freelist) {
req = freelist;
freelist = freelist->next_free;
req->next_free = NULL;
return req;
}

req = calloc(1, sizeof(struct req));

/*
* This is the first atom in the syslet, it opens the file:
*
* req->fd = open(req->filename, O_RDONLY);
*
* It is linked to the next read() atom.
*/
req->filename_p = req->filename;
init_atom(req, &req->open_file, __NR_sys_open,
&req->filename_p, &O_RDONLY_var, NULL, NULL, NULL, NULL,
&req->fd, SYSLET_STOP_ON_NEGATIVE, &req->read_file);

/*
* This second read() atom is linked back to itself, it skips to
* the next one on stop:
*/
req->file_buf_ptr = req->file_buf;
init_atom(req, &req->read_file, __NR_sys_read,
&req->fd, &req->file_buf_ptr, &FILE_BUF_SIZE_var,
NULL, NULL, NULL, NULL,
SYSLET_STOP_ON_NON_POSITIVE | SYSLET_SKIP_TO_NEXT_ON_STOP,
&req->read_file);

/*
* This close() atom has NULL as next, this finishes the syslet:
*/
init_atom(req, &req->close_file, __NR_sys_close,
&req->fd, NULL, NULL, NULL, NULL, NULL, NULL, 0, NULL);

return req;
}


Here's how your clet would look like:

static long main_sync_loop(ctx *c)
{
int fd;
char file_buf[FILE_BUF_SIZE+1];

if ((fd = open(c->filename, O_RDONLY)) == -1)
return -1;
while (read(fd, file_buf, FILE_BUF_SIZE) > 0)
;
close(fd);
return 0;
}


Kinda easier to code isn't it? And the cost of the upcall to schedule the
clet is widely amortized by the multple syscalls you're going to do inside
your clet.




- Davide


2007-02-15 01:32:38

by Michael K. Edwards

[permalink] [raw]
Subject: Re: [patch 06/11] syslets: core, documentation

On 2/14/07, Benjamin LaHaise <[email protected]> wrote:
> My opinion of this whole thread is that it implies that our thread creation
> and/or context switch is too slow. If that is the case, improve those
> elements first. At least some of those optimizations have to be done in
> hardware on x86, while on other platforms are probably unnecessary.

Not necessarily too slow, but too opaque in terms of system-wide
impact and global flow control. Here are the four practical use cases
that I have seen come up in this discussion:

1) Databases that want to parallelize I/O storms, with an emphasis on
getting results that are already cache-hot immediately (not least so
they don't get evicted by other I/O results); there is also room to
push some of the I/O clustering and sequencing logic down into the
kernel.

2) Static-content-intensive network servers, with an emphasis on
servicing those requests that can be serviced promptly (to avoid a
ballooning connection backlog) and avoiding duplication of I/O effort
when many clients suddenly want the same cold content; the real win
may be in "smart prefetch" initiated from outside the network server
proper.

3) Network information gathering GUIs, which want to harvest as much
information as possible for immediate display and then switch to an
event-based delivery mechanism for tardy responses; these need
throttling of concurrent requests (ideally, in-kernel traffic shaping
by request group and destination class) and efficient cancellation of
no-longer-interesting requests.

4) Document search facilities, which need all of the above (big
surprise there) as well as a rich diagnostic toolset, including a
practical snooping and profiling facility to guide tuning for
application responsiveness.

Even if threads were so cheap that you could just fire off one per I/O
request, they're a poor solution to the host of flow control issues
raised in these use cases. A sequential thread of execution per I/O
request may be the friendliest mental model for the individual delayed
I/Os, but the global traffic shaping and scheduling is a data
structure problem.

The right question to be asking is, what are the operations that need
to be supported on the system-wide pool of pending AIOs, and on what
data structure can they be implemented efficiently? For instance, can
we provide an RCU priority queue implementation (perhaps based on
splay trees) so that userspace can scan a coherent read-only snapshot
of the structure and select candidates for cancellation, etc., without
interfering with kernel completions? Or is it more important to have
a three-sided query operation (characteristic of priority search
trees), or perhaps a lower amortized cost bulk delete?

Once you've thought through the data structure manipulation, you'll
know what AIO submission / cancellation / reprioritization interfaces
are practical. Then you can work on a programming model for
application-level "I/O completions" that is library-friendly and
allows a "fast path" optimization for the fraction of requests that
can be served synchronously. Then and only then does it make sense to
code-bum the asynchronous path. Not that it isn't interesting to
think in advance about what stack space completions will run in and
which bits of the task struct needn't be in a coherent condition; but
that's probably not going to guide you to the design that meets the
application needs.

I know I'm teaching my grandmother to suck eggs here. But there are
application programmers high up the code stack whose code makes
implicit use of asynchronous I/O continuations. In addition to the
GUI example I blathered about a few days ago, I have in mind Narrative
Javascript's "blocking operator" and Twisted Python's Deferred. Those
folks would be well served by an async I/O interface to the kernel
which mates well to their language's closure/continuation facilities.
If it's usable from C, that's nice too. :-)

Cheers,
- Michael

2007-02-15 02:07:34

by Davide Libenzi

[permalink] [raw]
Subject: Re: [patch 00/11] ANNOUNCE: "Syslets", generic asynchronous system call support

On Wed, 14 Feb 2007, Jeremy Fitzhardinge wrote:

> Davide Libenzi wrote:
> >> Would this work?
> >>
> >
> > Hopefully the API will simplify enough so that emulation will becomes
> > easier.
> >
>
> The big question in my mind is how all this stuff interacts with
> signals. Can a blocked syscall atom be interrupted by a signal? If so,
> what thread does it get delivered to? How does sigprocmask affect this
> (is it atomic with respect to blocking atoms)?

Signal context is another thing that we need to transfer to the
return-to-userspace task, in case we switch. Async threads inherit that
from the "main" task once they're created, but from there to the
sys_async_exec syscall, userspace might have changed the signal context,
and re-emerging with a different one is not an option ;)
We should setup service-threds signal context, so that we can cancel them,
but the implementation should be hidden to userspace (that will use
sys_async_cancel - or whatever name - to do that).



- Davide


2007-02-15 02:45:24

by Zach Brown

[permalink] [raw]
Subject: Re: [patch 00/11] ANNOUNCE: "Syslets", generic asynchronous system call support


I'm finally back from my travel and conference hiatus.. you guys have
been busy! :)

On Feb 13, 2007, at 6:20 AM, Ingo Molnar wrote:

> I'm pleased to announce the first release of the "Syslet" kernel
> feature
> and kernel subsystem, which provides generic asynchrous system call
> support:
>
> http://redhat.com/~mingo/syslet-patches/

In general, I really like the look of this.

I think I'm convinced that your strong preference to do this with
full kernel threads (1:1 task_struct -> thread_info/stack
relationship) is the right thing to do. The fibrils fell on the side
of risking bugs by sharing task_structs amongst stacks executing
kernel paths. This, correct me if I'm wrong, falls on the side of
risking behavioural quirks stemming from task_struct references that
we happen to have not enabled sharing of yet.

I have strong hopes that we won't actually *care* about the
behavioural differences we get from having individual task structs
(which share the important things!) between syscall handlers. The
*only* seemingly significant case I've managed to find, the IO
scheduler priority and context fields, is easy enough to fix up.
Jens and I have been talking about that. It's been bugging him for
other reasons.

So, thanks, nice work. I'm going to focus on finding out if its
feasible for The Database to use this instead of the current iocb
mechanics. I'm optimistic.

> Syslets are small, simple, lightweight programs (consisting of
> system-calls, 'atoms')

I will admit, though, that I'm not at all convinced that we need
this. Adding a system call for addition (just addition? how far do
we go?!) sure feels like a warning sign to me that we're heading down
a slippery slope. I would rather we started with an obviously
minimal syscall which just takes an array of calls and args and
executes them unconditionally.

But its existance doesn't stop the use case I care about. So it's
hard to get *too* worked up about it.

> Comments, suggestions, reports are welcome!

For what it's worth, it looks like 'x86-optimized-copy_uatom.patch'
got some hunks that should have been in 'x86-optimized-
sys_umem_add.patch'.

- z

2007-02-15 02:53:39

by Zach Brown

[permalink] [raw]
Subject: Re: [patch 05/11] syslets: core code

>> But the whole point is that the notion of a "register" is wrong in
>> the
>> first place. [...]
>
> forget about it then. The thing we "register" is dead-simple:
>
> struct async_head_user {
> struct syslet_uatom __user **completion_ring;
> unsigned long ring_size_bytes;
> unsigned long max_nr_threads;
> };
>
> this can be passed in to sys_async_exec() as a second pointer, and the
> kernel can put the expected-completion pointer (and the user ring idx
> pointer) into its struct atom. It's just a few instructions, and
> only in
> the cachemiss case.
>
> that would make completions arbitrarily split-up-able. No registration
> whatsoever. A waiter could specify which ring's events it is
> interested
> in. A 'ring' could be a single-entry thing as well, for a single
> instance of pending IO.

I like this, too. (Not surprisingly, having outlined something like
it in a mail in one of the previous threads :)).

I'll bring up the POSIX AIO "list" IO case. It wants to issue a
group of IOs and sleep until they all return. Being able to cheaply
instantiate a ring implicitly with the submission of the IO calls in
the list will make implementing this almost too easy. It'd obviously
just wait for that list's ring to drain.

I hope. There might be complications around the edges (waiting for
multiple list IOs to drain?), but it seems like this would be on the
right track.

I might be alone in caring about having a less ridiculous POSIX AIO
interface in glibc, though, I'll admit. It seems like it'd be a
pretty sad missed opportunity if we rolled a fantastic general AIO
interface and left glibc to still screw around with it's own manual
threading :/.

- z

2007-02-15 13:43:07

by Evgeniy Polyakov

[permalink] [raw]
Subject: Re: [patch 05/11] syslets: core code

On Wed, Feb 14, 2007 at 12:38:16PM -0800, Linus Torvalds ([email protected]) wrote:
> Or how would you do the trivial example loop that I explained was a good
> idea:
>
> struct one_entry *prev = NULL;
> struct dirent *de;
>
> while ((de = readdir(dir)) != NULL) {
> struct one_entry *entry = malloc(..);
>
> /* Add it to the list, fill in the name */
> entry->next = prev;
> prev = entry;
> strcpy(entry->name, de->d_name);
>
> /* Do the stat lookup async */
> async_stat(de->d_name, &entry->stat_buf);
> }
> wait_for_async();
> .. Ta-daa! All done ..
>
>
> Notice? This also "chains system calls together", but it does it using a
> *much* more powerful entity called "user space". That's what user space
> is. And yeah, it's a pretty complex sequencer, but happily we have
> hardware support for accelerating it to the point that the kernel never
> even needs to care.
>
> The above is a *realistic* schenario, where you actually have things like
> memory allocation etc going on. In contrast, just chaining system calls
> together isn't a realistic schenario at all.

One can still perfectly fine and easily use sys_async_exec(...stat()...)
in above scenario. Although I do think that having a web server in
kernel is overkill, having a proper state machine for good async
processing is a must.
Not that I agree, that it should be done on top of syscalls as basic
elements, but it is an initial state.

> So I think we have one _known_ usage schenario:
>
> - replacing the _existing_ aio_read() etc system calls (with not just
> existing semantics, but actually binary-compatible)
>
> - simple code use where people are willing to perhaps do something
> Linux-specific, but because it's so _simple_, they'll do it.
>
> In neither case does the "chaining atoms together" seem to really solve
> the problem. It's clever, but it's not what people would actually do.

It is an example of what can be done. If one do not like it - do not use
it. State machine is implemented in sendfile() syscall - and although it
is not a good idea to have async sendfile as is in micro-thread design
(due to network blocking and small per-page reading), it is still a state
machine, which can be used with syslet state machine (if it could be
extended).

> And yes, you can hide things like that behind an abstraction library, but
> once you start doing that, I've got three questions for you:
>
> - what's the point?
> - we're adding overhead, so how are we getting it back
> - how do we handle independent libraries each doing their own thing and
> version skew between them?
>
> In other words, the "let user space sort out the complexity" is not a good
> answer. It just means that the interface is badly designed.

Well, if we can setup iocb structure, why we can not setup syslet one?

Yes, with syscalls as a state machine elements 99% of users will not use
it (I can only think about proper fadvice()+read()/sendfile() states),
but there is no problem to setup a structure in userspace at all. And if
there is possibility to use it for other things, it is definitely a win.

Actually complex structure setup argument is stupid - everyone forces to
have timeval structure instead of number of microseconds.

So there is no point in 'complex setup and overhead', but there is
a. limit of the AIO (although my point is not to have huge amount of
working threads - they were created by people who can not
program state machines (c) Alan Cox)
b. possibility to implement a state machine (in current form likely will
not be used except maybe some optional hints for IO tasks like
fadvice)
c. in all other ways it has all pros and cons of micro-thread design (it
looks neat and simple, although is utterly broken in some usage
cases).

> Linus

--
Evgeniy Polyakov

2007-02-15 16:16:16

by Linus Torvalds

[permalink] [raw]
Subject: Re: [patch 05/11] syslets: core code



On Thu, 15 Feb 2007, Evgeniy Polyakov wrote:
> >
> > In other words, the "let user space sort out the complexity" is not a good
> > answer. It just means that the interface is badly designed.
>
> Well, if we can setup iocb structure, why we can not setup syslet one?

(I'm cutting wildly, to try to get to the part I wanted to answer)

I actually think aio_read/write() and friends are *horrible* interfaces.

Here's a quick question: how many people have actually ever seen them used
in "normal code"?

Yeah. Nobody uses them. They're not all that portable (even within unixes
they aren't always there, much less in other places), they are fairly
obscure, and they are just not really easy to use.

Guess what? The same is going to be true *in*spades* for any Linux-
specific async system call thing.

This is why I think simplicity of use along with transparency, is so
important. I think "aio_read()" is already a nasty enough interface, and
it's sure more portable than any Linux-specific extension will be (only
until the revolution comes, of course - at that point, everybody who
doesn't use Linux will be up against the wall, so we can solve the problem
that way).

So a Linux-specific extension needs to be *easier* to use or at least
understand, and have *more* obvious upsides than "aio_read()" has.
Otherwise, it's pointless - nobody is really going to use it.

This is why I think the main goals should be:

- the *internal* kernel goal of trying to replace some very specific
aio_read() etc code with somethign more maintainable.

This is really a maintainability argument, nothing more. Even if we
never do anything *but* aio_read() and friends, if we can avoid having
the VFS code have multiple code-paths and explicit checks for AIO, and
instead handle it more "automatically", I think it is already worth it.

- add extensions that people *actually*can*use* in practice.

And this is where "simple interfaces" comes in.

> So there is no point in 'complex setup and overhead', but there is
> a. limit of the AIO (although my point is not to have huge amount of
> working threads - they were created by people who can not
> program state machines (c) Alan Cox)
> b. possibility to implement a state machine (in current form likely will
> not be used except maybe some optional hints for IO tasks like
> fadvice)
> c. in all other ways it has all pros and cons of micro-thread design (it
> looks neat and simple, although is utterly broken in some usage
> cases).

I don't think the "atom" approach is bad per se. I think it could be fine
to have some state information in user space. It's just that I think
complex interfaces that people largely won't even use is a big mistake. We
should concentrate on usability first, and some excessive cleverness
really isn't a big advantage.

Being able to do a "open + stat" looks like a fine thing. But I doubt
you'll see a lot of other combinations.

Linus

2007-02-15 16:38:58

by Evgeniy Polyakov

[permalink] [raw]
Subject: Re: [patch 05/11] syslets: core code

On Thu, Feb 15, 2007 at 08:09:54AM -0800, Linus Torvalds ([email protected]) wrote:
> > > In other words, the "let user space sort out the complexity" is not a good
> > > answer. It just means that the interface is badly designed.
> >
> > Well, if we can setup iocb structure, why we can not setup syslet one?
>
> (I'm cutting wildly, to try to get to the part I wanted to answer)
>
> I actually think aio_read/write() and friends are *horrible* interfaces.
>
> Here's a quick question: how many people have actually ever seen them used
> in "normal code"?

Agree, existing AIO interface is far from ideal IMO, but it is used. No
matter if it is normal or not, AIO itself is not normal interface -
there are no books about POSIX AIO, so no one knows about AIO at all.

> Yeah. Nobody uses them. They're not all that portable (even within unixes
> they aren't always there, much less in other places), they are fairly
> obscure, and they are just not really easy to use.
>
> Guess what? The same is going to be true *in*spades* for any Linux-
> specific async system call thing.
>
> This is why I think simplicity of use along with transparency, is so
> important. I think "aio_read()" is already a nasty enough interface, and
> it's sure more portable than any Linux-specific extension will be (only
> until the revolution comes, of course - at that point, everybody who
> doesn't use Linux will be up against the wall, so we can solve the problem
> that way).
>
> So a Linux-specific extension needs to be *easier* to use or at least
> understand, and have *more* obvious upsides than "aio_read()" has.
> Otherwise, it's pointless - nobody is really going to use it.

Userspace_API_is_the_ever_possible_last_thing_to_ever_think_about. Period
. // <- wrapped one

If system is designed that with API changes it breaks - that system sucks
wildly and should be thrown away. Syslets do not suffer from that.

We can have tons of interfaces any alien would be happy with (imho it is
not even kernel's task at all) - new table of syscalls the way usual ones
are used for example.
And we will have async_stat() exactly the same way people would use that.

It is not even a thing to discus. There are other technical issues with
syslets yet to resolve. If people do happy with design of the system, it
is time to think about how it will look from user's point of view.

syslet(__NR_stat) -> async_stat() - say it and Ingo and other developers
will think about how to implement that or start to discuss that it is
bad interface, and instead something else should be invented.

If interface sucks and _interface_ must be changed/extended/replaced.
If overall design sucks and it must be changed.

Solve problems one-by-one, not throwing something just because it uses
wild interface which can be changed in a minute.

> This is why I think the main goals should be:
>
> - the *internal* kernel goal of trying to replace some very specific
> aio_read() etc code with somethign more maintainable.
>
> This is really a maintainability argument, nothing more. Even if we
> never do anything *but* aio_read() and friends, if we can avoid having
> the VFS code have multiple code-paths and explicit checks for AIO, and
> instead handle it more "automatically", I think it is already worth it.
>
> - add extensions that people *actually*can*use* in practice.
>
> And this is where "simple interfaces" comes in.

There is absolutely _NO_ problem in having any interface people will use.
What one will you?
async_stat() instead of syslet(complex_struct_blah_sync)?
No problem - it is _really_ trivial to implement.
Ingo mentioned that it should be done, and it is really simple task for
glibc just like it is done for usual syscalls, which has completely nothing
with overall system design at all.

> > So there is no point in 'complex setup and overhead', but there is
> > a. limit of the AIO (although my point is not to have huge amount of
> > working threads - they were created by people who can not
> > program state machines (c) Alan Cox)
> > b. possibility to implement a state machine (in current form likely will
> > not be used except maybe some optional hints for IO tasks like
> > fadvice)
> > c. in all other ways it has all pros and cons of micro-thread design (it
> > looks neat and simple, although is utterly broken in some usage
> > cases).
>
> I don't think the "atom" approach is bad per se. I think it could be fine
> to have some state information in user space. It's just that I think
> complex interfaces that people largely won't even use is a big mistake. We
> should concentrate on usability first, and some excessive cleverness
> really isn't a big advantage.
>
> Being able to do a "open + stat" looks like a fine thing. But I doubt
> you'll see a lot of other combinations.

Then no problem.

Interface does suck, especially since it does not allow to forbid some
syscalls from async execution, but it is just a brick which allows to
build really good system of it.

I personally vote for table of async syscalls transferred into
human-readable aliases like async_stat() and the like.

> Linus

--
Evgeniy Polyakov

2007-02-15 17:05:19

by Davide Libenzi

[permalink] [raw]
Subject: Re: [patch 05/11] syslets: core code

On Thu, 15 Feb 2007, Linus Torvalds wrote:

> I don't think the "atom" approach is bad per se. I think it could be fine
> to have some state information in user space. It's just that I think
> complex interfaces that people largely won't even use is a big mistake. We
> should concentrate on usability first, and some excessive cleverness
> really isn't a big advantage.
>
> Being able to do a "open + stat" looks like a fine thing. But I doubt
> you'll see a lot of other combinations.

I actually think that building chains of syscalls bring you back to a
multithreaded solution. Why? Because suddendly the service thread become
from servicing a syscall (with possible cachehit optimization), to
servicing a whole session. So the number of service threads needed (locked
down by a chain) becomes big because requests goes from being short-lived
syscalls to long-lived chains of them. Think about the trivial web server,
and think about a chain that does open->fstat->sendhdrs->sendfile after an
accept. What's the difference with a multithreaded solution that does
accept->clone and execute the above code in the new thread? Nada, NIL.
Actually, there is a difference. The standard multithreaded function is
easier to code in C than with the complex atoms chains. The number of
service thread becomes suddendly proportional to the number of active
sessions.
The more I look at this, the more I think that async_submit should submit
simple syscalls, or an array of them (unrelated/parallel).



- Davide


2007-02-15 17:17:25

by Ulrich Drepper

[permalink] [raw]
Subject: Re: [patch 05/11] syslets: core code

Linus Torvalds wrote:
> Here's a quick question: how many people have actually ever seen them used
> in "normal code"?
>
> Yeah. Nobody uses them. They're not all that portable (even within unixes
> they aren't always there, much less in other places), they are fairly
> obscure, and they are just not really easy to use.

That's nonsense. They are widely used (just hear people scream if
something changes or breaks) and they are available on all Unix
implementations which are not geared towards embedded use. POSIX makes
AIO in the next revision mandatory.

Just because you don't like it, don't discount it. Yes, the interface
is not the best. But this is what you get if you cannot dictate
interfaces to everybody. You have to make concessions.

--
➧ Ulrich Drepper ➧ Red Hat, Inc. ➧ 444 Castro St ➧ Mountain View, CA ❖


Attachments:
signature.asc (251.00 B)
OpenPGP digital signature

2007-02-15 17:23:30

by Evgeniy Polyakov

[permalink] [raw]
Subject: Re: [patch 05/11] syslets: core code

On Thu, Feb 15, 2007 at 09:05:13AM -0800, Davide Libenzi ([email protected]) wrote:
> On Thu, 15 Feb 2007, Linus Torvalds wrote:
>
> > I don't think the "atom" approach is bad per se. I think it could be fine
> > to have some state information in user space. It's just that I think
> > complex interfaces that people largely won't even use is a big mistake. We
> > should concentrate on usability first, and some excessive cleverness
> > really isn't a big advantage.
> >
> > Being able to do a "open + stat" looks like a fine thing. But I doubt
> > you'll see a lot of other combinations.
>
> I actually think that building chains of syscalls bring you back to a
> multithreaded solution. Why? Because suddendly the service thread become
> from servicing a syscall (with possible cachehit optimization), to
> servicing a whole session. So the number of service threads needed (locked
> down by a chain) becomes big because requests goes from being short-lived
> syscalls to long-lived chains of them. Think about the trivial web server,
> and think about a chain that does open->fstat->sendhdrs->sendfile after an
> accept. What's the difference with a multithreaded solution that does
> accept->clone and execute the above code in the new thread? Nada, NIL.

That is more ideological question about micro-thread design at all.
If syslet will be able to perform only one syscall, one will have 4
threads for above case, not one, so it is even more broken.

So, if Linux moves that way of doing AIO (IMO incorrect, I think that
the correct state machine made not of syscalls, but specially crafted
entries - like populate pages into VFS, send chunk, recv chunk without
blocking and continue on completion and the like), syslets with attached
state machines are the (smallest evil) best choice.

> Actually, there is a difference. The standard multithreaded function is
> easier to code in C than with the complex atoms chains. The number of
> service thread becomes suddendly proportional to the number of active
> sessions.
> The more I look at this, the more I think that async_submit should submit
> simple syscalls, or an array of them (unrelated/parallel).

That is the case - atom items (I do hope that this subsystem would be
able to perform not only syscalls, but any kernel interfaces suitable
prototypes, v2 seems to move that direction) are called asynchronously
from main userspace thread to achieve the maximum performance.

> - Davide
>

--
Evgeniy Polyakov

2007-02-15 17:39:40

by Davide Libenzi

[permalink] [raw]
Subject: Re: [patch 05/11] syslets: core code

On Thu, 15 Feb 2007, Evgeniy Polyakov wrote:

> On Thu, Feb 15, 2007 at 09:05:13AM -0800, Davide Libenzi ([email protected]) wrote:
> >
> > I actually think that building chains of syscalls bring you back to a
> > multithreaded solution. Why? Because suddendly the service thread become
> > from servicing a syscall (with possible cachehit optimization), to
> > servicing a whole session. So the number of service threads needed (locked
> > down by a chain) becomes big because requests goes from being short-lived
> > syscalls to long-lived chains of them. Think about the trivial web server,
> > and think about a chain that does open->fstat->sendhdrs->sendfile after an
> > accept. What's the difference with a multithreaded solution that does
> > accept->clone and execute the above code in the new thread? Nada, NIL.
>
> That is more ideological question about micro-thread design at all.
> If syslet will be able to perform only one syscall, one will have 4
> threads for above case, not one, so it is even more broken.

Nope, just one thread. Well, two, if you consider the "main" dispatch
thread, and the syscall service thread.



> So, if Linux moves that way of doing AIO (IMO incorrect, I think that
> the correct state machine made not of syscalls, but specially crafted
> entries - like populate pages into VFS, send chunk, recv chunk without
> blocking and continue on completion and the like), syslets with attached
> state machines are the (smallest evil) best choice.

But at that point you don't need to have complex atom interfaces, with
chains, whips and leather pants :) Just code it in C and submit that to
the async engine. The longer is the chain though, the closer you get to a
fully multithreaded solution, in terms of service thread consuption. And
what do you save WRT a multithreaded solution? Not thread
creation/destroy, because that cost is fully amortized inside the chain
execution cost (plus a pool would even save that).
IMO the plus of a generic async engine is mostly from a kernel code
maintainance POV. You don't need anymore to have AIO-aware code paths,
that automatically transalte to smaller and more maintainable code.



- Davide


2007-02-15 17:49:11

by Linus Torvalds

[permalink] [raw]
Subject: Re: [patch 05/11] syslets: core code



On Thu, 15 Feb 2007, Evgeniy Polyakov wrote:
>
> Userspace_API_is_the_ever_possible_last_thing_to_ever_think_about. Period
> . // <- wrapped one

No, I really think you're wrong.

In many ways, the interfaces and especially data structures are *more*
important than the code.

The code we can fix. The interfaces, on the other hand, we'll have to live
with forever.

So complex interfaces that expose lots of implementation detail are not a
good thing, and it's _not_ the last thing you want to think about. Complex
interfaces with a lot of semantic knowledge seriously limit how you can
fix things up later.

In contrast, simple interfaces that have clear and unambiguous semantics
and that can be explained at a conceptual level are things that you can
often implement in many different ways. So the interface isn't the bottle
neck: you may have to have a "backwards compatibility layer" for it

> If system is designed that with API changes it breaks - that system sucks
> wildly and should be thrown away. Syslets do not suffer from that.

The syslet code itself looks fine. It's the user-visible part I'm not
convinced about.

I'm just saying: how would use use this for existing programs?

For something this machine-specific, you're not going to have any big
project written around the "async atom" code. So realistically, the kinds
of usage we'd see is likely some compile-time configuration option, where
people replace some specific sequence of code with another one. THAT is
what we should aim to make easy and flexible, I think. And that is where
interfaces really are as important as code.

We know one interface: the current aio_read() one. Nobody really _likes_
it (even database people would apparently like to extend it), but it has
the huge advantage of "being there", and having real programs that really
care that use it today.

Others? We don't know yet. And exposing complex interfaces that may not be
the right ones is much *worse* than exposing simple interfaces (that
_also_ may not be the right ones, of course - but simple and
straightforward interfaces with obvious and not-very-complex semantics are
a lot easier to write compatibility layers for if the internal code
changes radically)

Linus

2007-02-15 18:03:18

by Evgeniy Polyakov

[permalink] [raw]
Subject: Re: [patch 05/11] syslets: core code

On Thu, Feb 15, 2007 at 09:39:33AM -0800, Davide Libenzi ([email protected]) wrote:
> On Thu, 15 Feb 2007, Evgeniy Polyakov wrote:
>
> > On Thu, Feb 15, 2007 at 09:05:13AM -0800, Davide Libenzi ([email protected]) wrote:
> > >
> > > I actually think that building chains of syscalls bring you back to a
> > > multithreaded solution. Why? Because suddendly the service thread become
> > > from servicing a syscall (with possible cachehit optimization), to
> > > servicing a whole session. So the number of service threads needed (locked
> > > down by a chain) becomes big because requests goes from being short-lived
> > > syscalls to long-lived chains of them. Think about the trivial web server,
> > > and think about a chain that does open->fstat->sendhdrs->sendfile after an
> > > accept. What's the difference with a multithreaded solution that does
> > > accept->clone and execute the above code in the new thread? Nada, NIL.
> >
> > That is more ideological question about micro-thread design at all.
> > If syslet will be able to perform only one syscall, one will have 4
> > threads for above case, not one, so it is even more broken.
>
> Nope, just one thread. Well, two, if you consider the "main" dispatch
> thread, and the syscall service thread.

Argh, if they are supposed to run synchronously, for example stat can be
done in parallel with sendfile in above example, but generally yes, one
execution thread.

> > So, if Linux moves that way of doing AIO (IMO incorrect, I think that
> > the correct state machine made not of syscalls, but specially crafted
> > entries - like populate pages into VFS, send chunk, recv chunk without
> > blocking and continue on completion and the like), syslets with attached
> > state machines are the (smallest evil) best choice.
>
> But at that point you don't need to have complex atom interfaces, with
> chains, whips and leather pants :) Just code it in C and submit that to
> the async engine. The longer is the chain though, the closer you get to a
> fully multithreaded solution, in terms of service thread consuption. And
> what do you save WRT a multithreaded solution? Not thread
> creation/destroy, because that cost is fully amortized inside the chain
> execution cost (plus a pool would even save that).
> IMO the plus of a generic async engine is mostly from a kernel code
> maintainance POV. You don't need anymore to have AIO-aware code paths,
> that automatically transalte to smaller and more maintainable code.

It is completely possible to not wire up several syscalls and just use
only one per async call, but _if_ such a requirement rises, the whole
infrastructure is there.

> - Davide
>

--
Evgeniy Polyakov

2007-02-15 18:12:34

by Evgeniy Polyakov

[permalink] [raw]
Subject: Re: [patch 05/11] syslets: core code

On Thu, Feb 15, 2007 at 09:42:32AM -0800, Linus Torvalds ([email protected]) wrote:
>
>
> On Thu, 15 Feb 2007, Evgeniy Polyakov wrote:
> >
> > Userspace_API_is_the_ever_possible_last_thing_to_ever_think_about. Period
> > . // <- wrapped one
>
> No, I really think you're wrong.
>
> In many ways, the interfaces and especially data structures are *more*
> important than the code.
>
> The code we can fix. The interfaces, on the other hand, we'll have to live
> with forever.
>
> So complex interfaces that expose lots of implementation detail are not a
> good thing, and it's _not_ the last thing you want to think about. Complex
> interfaces with a lot of semantic knowledge seriously limit how you can
> fix things up later.
>
> In contrast, simple interfaces that have clear and unambiguous semantics
> and that can be explained at a conceptual level are things that you can
> often implement in many different ways. So the interface isn't the bottle
> neck: you may have to have a "backwards compatibility layer" for it

That's exaclt the way we should discuss it - you do ont like that
interface, but Ingo proposed a way to change that via table of async
syscalls - people asks, people answers - so eventually interface and (if
any) other problems got resolved.

> > If system is designed that with API changes it breaks - that system sucks
> > wildly and should be thrown away. Syslets do not suffer from that.
>
> The syslet code itself looks fine. It's the user-visible part I'm not
> convinced about.
>
> I'm just saying: how would use use this for existing programs?
>
> For something this machine-specific, you're not going to have any big
> project written around the "async atom" code. So realistically, the kinds
> of usage we'd see is likely some compile-time configuration option, where
> people replace some specific sequence of code with another one. THAT is
> what we should aim to make easy and flexible, I think. And that is where
> interfaces really are as important as code.
>
> We know one interface: the current aio_read() one. Nobody really _likes_
> it (even database people would apparently like to extend it), but it has
> the huge advantage of "being there", and having real programs that really
> care that use it today.
>
> Others? We don't know yet. And exposing complex interfaces that may not be
> the right ones is much *worse* than exposing simple interfaces (that
> _also_ may not be the right ones, of course - but simple and
> straightforward interfaces with obvious and not-very-complex semantics are
> a lot easier to write compatibility layers for if the internal code
> changes radically)

So we just need to describe the way we want to see new interface -
that's it.

Here is a stub for async_stat() - probably broken example, but that does
not matter - this interface is really easy to change.

static void syslet_setup(struct syslet *s, int nr, void *arg1...)
{
s->flags = ...
s->arg[1] = arg1;
....
}

long glibc_async_stat(const char *path, struct stat *buf)
{
/* What about making syslet and/or set of atoms per thread and preallocate
* them when working threads are allocated? */
struct syslet s;
syslet_setup(&s, __NR_stat, path, buf, NULL, NULL, NULL, NULL);
return async_submit(&s);
}

> Linus

--
Evgeniy Polyakov

2007-02-15 18:32:15

by Linus Torvalds

[permalink] [raw]
Subject: Re: [patch 05/11] syslets: core code



On Thu, 15 Feb 2007, Evgeniy Polyakov wrote:
>
> So we just need to describe the way we want to see new interface -
> that's it.

Agreed. Absolutely.

But please keep the kernel interface as part of that. Not just a strange
and complex kernel interface and then _usable_ library interfaces that use
the strange and complex one internally. Because if the complex one has no
validity on its own, it's just (a) a bitch to debug and (b) if we ever
change any details inside the kernel we'll end up with a lot of subtle
code where user land creates complex data, and the kernel just reads it,
and both just (unnecessarily) work around the fact that the other doesn't
do the straightforward thing.

> Here is a stub for async_stat() - probably broken example, but that does
> not matter - this interface is really easy to change.
>
> static void syslet_setup(struct syslet *s, int nr, void *arg1...)
> {
> s->flags = ...
> s->arg[1] = arg1;
> ....
> }
>
> long glibc_async_stat(const char *path, struct stat *buf)
> {
> /* What about making syslet and/or set of atoms per thread and preallocate
> * them when working threads are allocated? */
> struct syslet s;
> syslet_setup(&s, __NR_stat, path, buf, NULL, NULL, NULL, NULL);
> return async_submit(&s);
> }

And this is a classic example of potentially totally buggy code.

Why? You're releasing the automatic variable on the stack before it's
necessarily all used!

So now you need to do a _longterm_ allocation, and that in turn means that
you need to do a long-term de-allocation!

Ok, so do we make the rule be that all atoms *have* to be read fully
before we start the async submission (so that the caller doesn't need to
do a long-term allocation)?

Or do we make the rule be that just the *first* atom is copied by the
kernel before the async_sumbit() returns, and thus it's ok to do the above
*IFF* you only have a single system call?

See? The example you tried to use to show how "simple" the interface iswas
actually EXACTLY THE REVERSE. It shows how subtle bugs can creep in!

Linus

2007-02-15 18:47:00

by bert hubert

[permalink] [raw]
Subject: Re: [patch 05/11] syslets: core code

On Thu, Feb 15, 2007 at 09:42:32AM -0800, Linus Torvalds wrote:

> We know one interface: the current aio_read() one. Nobody really _likes_
[...]

> Others? We don't know yet. And exposing complex interfaces that may not be
> the right ones is much *worse* than exposing simple interfaces (that
> _also_ may not be the right ones, of course - but simple and

>From humble userland, here's two things I'd hope to be able to do, although
I admit my needs are rather specialist.

1) batch, and wait for, with proper error reporting:
socket();
[ setsockopt(); ]
bind();
connect();
gettimeofday(); // doesn't *always* happen
send();
recv();
gettimeofday(); // doesn't *always* happen

I go through this sequence for each outgoing powerdns UDP query
because I need a new random source port for each query, and I
connect because I care about errrors. Linux does not give me random
source ports for UDP sockets.

When async, I can probably just drop the setsockopt (for
nonblocking). I already batch the gettimeofday to 'once per epoll
return', but quite often this is once per packet.

2) On the client facing side (port 53), I'd very much hope for a way to
do 'recvv' on datagram sockets, so I can retrieve a whole bunch of
UDP datagrams with only one kernel transition.

This would mean that I batch up either 10 calls to recv(), or one
'atom' of 10 recv's.

Both 1 and 2 are currently limiting factors when I enter the 100kqps domain
of name serving. This doesn't mean the rest of my code is as tight as it
could be, but I spend a significant portion of time in the kernel even at
moderate (10kqps effective) loads, even though I already use epoll. A busy
PowerDNS recursor typically spends 25% to 50% of its time on 'sy' load.

This might be due to my use of get/set/swap/makecontext though.

Bert

--
http://www.PowerDNS.com Open source, database driven DNS Software
http://netherlabs.nl Open and Closed source services

2007-02-15 19:07:23

by Evgeniy Polyakov

[permalink] [raw]
Subject: Re: [patch 05/11] syslets: core code

On Thu, Feb 15, 2007 at 10:25:37AM -0800, Linus Torvalds ([email protected]) wrote:
> > static void syslet_setup(struct syslet *s, int nr, void *arg1...)
> > {
> > s->flags = ...
> > s->arg[1] = arg1;
> > ....
> > }
> >
> > long glibc_async_stat(const char *path, struct stat *buf)
> > {
> > /* What about making syslet and/or set of atoms per thread and preallocate
> > * them when working threads are allocated? */
> > struct syslet s;
> > syslet_setup(&s, __NR_stat, path, buf, NULL, NULL, NULL, NULL);
> > return async_submit(&s);
> > }
>
> And this is a classic example of potentially totally buggy code.
>
> Why? You're releasing the automatic variable on the stack before it's
> necessarily all used!
>
> So now you need to do a _longterm_ allocation, and that in turn means that
> you need to do a long-term de-allocation!
>
> Ok, so do we make the rule be that all atoms *have* to be read fully
> before we start the async submission (so that the caller doesn't need to
> do a long-term allocation)?
>
> Or do we make the rule be that just the *first* atom is copied by the
> kernel before the async_sumbit() returns, and thus it's ok to do the above
> *IFF* you only have a single system call?
>
> See? The example you tried to use to show how "simple" the interface iswas
> actually EXACTLY THE REVERSE. It shows how subtle bugs can creep in!

So describe what are the requirements (constraints)?

Above example has exactly one syscall in the chain, so it is ok, but
generally it is not correct.

So instead there will be
s = atom_create_and_add(__NR_stat, path, stat, NULL, NULL, NULL, NULL);
atom then can be freed in the glibc_async_wait() wrapper just before
returning data to userspace.

There are millions of possible ways to do that, but what exactly one
should be used from your point of view? Describe _your_ vision of that path.

Currently generic example is following:
allocate mem
setup complex structure
submit syscall
wait syscall
free mem

the first two can be hidden in glibc setup/startup code, the last one -
in waiting or cleanup entry.

Or it can be this one (just an idea):

glibc_async_stat(path, &stat);

int glibc_async_stat(char *path, struct stat *stat)
{
struct pthread *p;

asm ("movl %%gs:0, %0", "=r"(unsigned long)(p));

atom = allocate_new_atom_and_setup_initial_values();
setup_atom(atom, __NR_stat, path, stat, ...);
add_atom_into_private_tree(p, atom);
return async_submit(atom);
}

glibc_async_wait()
{
struct pthread *p;

asm ("movl %%gs:0, %0", "=r"(unsigned long)(p));

cookie = sys_async_wait();
atom = search_for_cookie_and_remove(p);
free_atom(atom);
}

Although that cruft might need to be extended...

So, describe how exactly _you_ think it should be implemented with its
pros and cons, so that systemn could be adopted without trying to
mind-read of what is simple and good or complex and really bad.

> Linus

--
Evgeniy Polyakov

2007-02-15 19:12:41

by Evgeniy Polyakov

[permalink] [raw]
Subject: Re: [patch 05/11] syslets: core code

On Thu, Feb 15, 2007 at 07:46:56PM +0100, bert hubert ([email protected]) wrote:
> 1) batch, and wait for, with proper error reporting:
> socket();
> [ setsockopt(); ]
> bind();
> connect();
> gettimeofday(); // doesn't *always* happen
> send();
> recv();
> gettimeofday(); // doesn't *always* happen
>
> I go through this sequence for each outgoing powerdns UDP query
> because I need a new random source port for each query, and I
> connect because I care about errrors. Linux does not give me random
> source ports for UDP sockets.

What about a setsockopt or just random port selection patch? :)

> When async, I can probably just drop the setsockopt (for
> nonblocking). I already batch the gettimeofday to 'once per epoll
> return', but quite often this is once per packet.
>
> 2) On the client facing side (port 53), I'd very much hope for a way to
> do 'recvv' on datagram sockets, so I can retrieve a whole bunch of
> UDP datagrams with only one kernel transition.
>
> This would mean that I batch up either 10 calls to recv(), or one
> 'atom' of 10 recv's.
>
> Both 1 and 2 are currently limiting factors when I enter the 100kqps domain
> of name serving. This doesn't mean the rest of my code is as tight as it
> could be, but I spend a significant portion of time in the kernel even at
> moderate (10kqps effective) loads, even though I already use epoll. A busy
> PowerDNS recursor typically spends 25% to 50% of its time on 'sy' load.
>
> This might be due to my use of get/set/swap/makecontext though.

It is only about one syscall in get and set/swap context, btw, so it
should not be a main factor, doesn't it?

As an advertisement note, but if you have a lot of network events per epoll
read try to use kevent - its socket notifications do not require
additional traverse of the list of ready events as in poll usage.

> Bert
>
> --
> http://www.PowerDNS.com Open source, database driven DNS Software
> http://netherlabs.nl Open and Closed source services

--
Evgeniy Polyakov

2007-02-15 19:22:50

by Zach Brown

[permalink] [raw]
Subject: Re: [patch 05/11] syslets: core code

> 2) On the client facing side (port 53), I'd very much hope for a
> way to
> do 'recvv' on datagram sockets, so I can retrieve a whole bunch of
> UDP datagrams with only one kernel transition.

I want to highlight this point that Bert is making.

Whenever we talk about AIO and kernel threads some folks are rightly
concerned that we're talking about a thread *per IO* and fear that
memory consumption will be fatal.

Take the case of userspace which implements what we'd think of as
page cache writeback. (*coughs, points at email address*). It wants
to issue thousands of IOs to disjoint regions of a file. "Thousands
of kernel threads, oh crap!"

But it only issues each IO with a separate syscall (or io_submit()
op) because it doesn't have an interface that lets it specify IOs
that vector user memory addresses *and file position*.

If we had a seemingly obvious interface that let it kick off batched
IOs to different parts of the file, the looming disaster of a thread
per IO vanishes in that case.

struct off_vec {
off_t pos;
size_t len;
};

long sys_sgwrite(int fd, struct iovec *memvec, size_t mv_count,
struct off_vec *ovec, size_t ov_count);

It doesn't take long to imagine other uses for this that are less
exotic.

Take e2fsck and its iterating through indirect blocks or directory
data blocks. It has a list of disjoint file regions (blocks) it
wants to read, but it does them serially to keep the code from
getting even more confusing. blktrace a clean e2fsck -f some time..
it's leaving *HALF* of the disk read bandwith on the table by
performing serial block-sized reads. If it could specify batches of
them the code would still be simple but it could tell the kernel and
IO scheduler *exactly* what it wants, without having to mess around
with sys_readahead() or AIO or any of that junk :).

Anyway, that's just something that's been on my mind. If there are
obvious clean opportunities to get more done with single syscalls, it
might not be such a bad thing.

- z

2007-02-15 19:26:18

by Eric Dumazet

[permalink] [raw]
Subject: Re: [patch 05/11] syslets: core code

On Thursday 15 February 2007 19:46, bert hubert wrote:

> Both 1 and 2 are currently limiting factors when I enter the 100kqps domain
> of name serving. This doesn't mean the rest of my code is as tight as it
> could be, but I spend a significant portion of time in the kernel even at
> moderate (10kqps effective) loads, even though I already use epoll. A busy
> PowerDNS recursor typically spends 25% to 50% of its time on 'sy' load.

Well, I guess in your workload most of system overhead is because of sockets
creation/destruction, UDP/IP stack work, nic driver, interrupts... I really
doubt async_io could help you... Do you have some oprofile results to share
with us ?

2007-02-15 19:34:48

by Linus Torvalds

[permalink] [raw]
Subject: Re: [patch 05/11] syslets: core code



On Thu, 15 Feb 2007, Evgeniy Polyakov wrote:
> >
> > See? The example you tried to use to show how "simple" the interface iswas
> > actually EXACTLY THE REVERSE. It shows how subtle bugs can creep in!
>
> So describe what are the requirements (constraints)?
>
> Above example has exactly one syscall in the chain, so it is ok, but
> generally it is not correct.

Well, it *could* be correct. It depends on the semantics of the atom
fetching. If we make the semantics be that the first atom is fetched
entirely synchronously, then we could make the rule be that single-syscall
async things can do their job with a temporary allocation.

So that wasn't my point. My point was that a complicated interface that
uses indirection actually has subtle issues. You *thought* you were doing
something simple, and you didn't even realize the subtle assumptions you
made.

THAT was the point. Interfaces are really really subtle and important.
It's absolutely not a case of "we can just write wrappers to fix up any
library issues".

> So instead there will be
> s = atom_create_and_add(__NR_stat, path, stat, NULL, NULL, NULL, NULL);
> atom then can be freed in the glibc_async_wait() wrapper just before
> returning data to userspace.

So now you add some kind of allocation/dealloction thing. In user space or
in the kernel?

> There are millions of possible ways to do that, but what exactly one
> should be used from your point of view? Describe _your_ vision of that path.

My vision is that we should be able to do the simple things *easily* and
without any extra overhead.

And doing wrappers in user space is almost entirely unacceptable, becasue
a lot of the wrapping needs to be done at release time (for example:
de-allocating memory), and that means that you no longer can do simple
system calls that don't even need release notification AT ALL.

> Currently generic example is following:
> allocate mem
> setup complex structure
> submit syscall
> wait syscall
> free mem

And that "allocate mem" and "free mem" is a problem. It's not just a
performance problem, it is a _complexity_ problem. It means that people
have to track things that they are NOT AT ALL INTERESTED IN!

> So, describe how exactly _you_ think it should be implemented with its
> pros and cons, so that systemn could be adopted without trying to
> mind-read of what is simple and good or complex and really bad.

So I think that a good implementation just does everything up-front, and
doesn't _need_ a user buffer that is live over longer periods, except for
the actual results. Exactly because the whole alloc/teardown is nasty.

And I think a good implementation doesn't need wrapping in user space to
be useful - at *least* not wrapping at completion time, which is the
really difficult one (since, by definition, in an async world completion
is separated from the initial submit() event, and with kernel-only threads
you actually want to *avoid* having to do user code after the operation
completed).

I suspect Ingo's thing can do that. But I also suspect (nay, _know_, from
this discussion), that you didn't even think of the problems.

Linus

2007-02-15 20:13:29

by Linus Torvalds

[permalink] [raw]
Subject: Re: [patch 05/11] syslets: core code



On Thu, 15 Feb 2007, Linus Torvalds wrote:
>
> So I think that a good implementation just does everything up-front, and
> doesn't _need_ a user buffer that is live over longer periods, except for
> the actual results. Exactly because the whole alloc/teardown is nasty.

Btw, this doesn't necessarily mean "not supporting multiple atoms at all".

I think the batching of async things is potentially a great idea. I think
it's quite workable for "open+fstat" kind of things, and I agree that it
can solve other things too (the "socket+bind+connect+sendmsg+rcv" kind of
complex setup things).

But I suspect that if we just said:
- we limit these atom sequences to just linear sequences of max "n" ops
- we read them all in in a single go at startup

we actually avoid several nasty issues. Not just the memory allocation
issue in user space (now it's perfectly ok to build up a sequence of ops
in temporary memory and throw it away once it's been submitted), but also
issues like the 32-bit vs 64-bit compatibility stuff (the compat handlers
would just convert it when they do the initial copying, and then the
actual run-time wouldn't care about user-level pointers having different
sizes etc).

Would it make the interface less cool? Yeah. Would it limit it to just a
few linked system calls (to avoid memory allocation issues in the kernel)?
Yes again. But it would simplify a lot of the interface issues.

It would _also_ allow the "sys_aio_read()" function to build up its
*own* set of atoms in kernel space to actually do the read, and there
would be no impact of the actual run-time wanting to read stuff from user
space. Again - it's actually the same issue as with the compat system
call: by making the interfaces do things up-front rather than dynamically,
it becomes more static, but also easier to do interface translations. You
can translate into any arbitrary internal format _once_, and be done with
it.

I dunno.

Linus

2007-02-15 21:17:18

by Davide Libenzi

[permalink] [raw]
Subject: Re: [patch 05/11] syslets: core code

On Thu, 15 Feb 2007, Linus Torvalds wrote:

>
>
> On Thu, 15 Feb 2007, Linus Torvalds wrote:
> >
> > So I think that a good implementation just does everything up-front, and
> > doesn't _need_ a user buffer that is live over longer periods, except for
> > the actual results. Exactly because the whole alloc/teardown is nasty.
>
> Btw, this doesn't necessarily mean "not supporting multiple atoms at all".
>
> I think the batching of async things is potentially a great idea. I think
> it's quite workable for "open+fstat" kind of things, and I agree that it
> can solve other things too (the "socket+bind+connect+sendmsg+rcv" kind of
> complex setup things).

If you *really* want to allow chains (note that the above could be
prolly be hosted on a real thread, once chains becomes that long), then
try to build that chain with the current API, and compare it with:

long my_clet(ctx *c) {
int fd, error = -1;

if ((fd = socket(...)) == -1 ||
bind(fd, &c->laddr, sizeof(c->laddr)) ||
connect(fd, &c->saddr, sizeof(c->saddr)) ||
sendmsg(fd, ...) == -1 ||
recv(fd, ...) <= 0)
goto
error = 0;
erxit:
close(fd);
return error;
}

Points:

- Keep the submission API to submit one or an array of parallel async
syscalls/clets

- Keep arguments of the syscall being longs (no need for extra pointer
indirection compat code, and special copy_atoms functions)

- No need for the "next" atom pointer chaining (nice for compat too)

- No need to create special conditions/jump interpreters into the kernel
(nice for compat and emulators). C->machine-code that that for us

- Easier to code. Try to build a chain like that with the current API and
you will see what I saying

- Did I say faster? Machine code is faster than sudo-VM interpretation of
jumps/conditions done inside the kernel




- Davide


2007-02-15 22:34:15

by Michael K. Edwards

[permalink] [raw]
Subject: Re: [patch 05/11] syslets: core code

On 2/15/07, Linus Torvalds <[email protected]> wrote:
> Would it make the interface less cool? Yeah. Would it limit it to just a
> few linked system calls (to avoid memory allocation issues in the kernel)?
> Yes again. But it would simplify a lot of the interface issues.

Only in toy applications. Real userspace code that lives between
networks+disks and impatient humans is 80% exception handling,
logging, and diagnostics. If you can't do any of that between stages
of an async syscall chain, you're fscked when it comes to performance
analysis (the "which 10% of the traffic do we not abort under
pressure" kind, not the "cut overhead by 50%" kind). Not that real
userspace code could get any simpler by using this facility anyway,
since you can't jump the queue, cancel in bulk, or add cleanup hooks.

Efficiently interleaved execution of high-latency I/O chains would be
nice. Low overhead for cache hits would be nicer. But least for the
workloads that interest me, neither is anywhere near as important as
the ability to say, "This 10% (or 90%) of my requests are going to
take forever? Nevermind -- but don't cancel the 1% I can't do
without."

This is not a scheduling problem, it is a caching problem. Caches are
data structures, not thread pools. Assume that you need to design for
dynamic reprioritization, speculative fetch, and opportunistic flush,
even if you don't implement them at first. Above all, stay out of the
way when a synchronous request misses cache -- and when application
code decides that a bunch of its outstanding requests are no longer
interesting, take the hint!

Oh, and while you're at it: I'd like to program AIO facilities using a
C compiler with an explicitly parallel construct -- something along
the lines of:

try (my_aio_batch, initial_priority, ...) {
} catch {
} finally {
}

Naturally the compiler will know how to convert synchronous syscalls
to their asynchronous equivalent, will use an analogue of IEEE NaNs to
minimize the hits to the exception path, and won't let you call
functions that aren't annotated as safe in IO completion context. I
would also like five acres in town and a pony.

Cheers,
- Michael

2007-02-16 08:58:32

by Evgeniy Polyakov

[permalink] [raw]
Subject: Re: [patch 05/11] syslets: core code

On Thu, Feb 15, 2007 at 11:28:57AM -0800, Linus Torvalds ([email protected]) wrote:
> THAT was the point. Interfaces are really really subtle and important.
> It's absolutely not a case of "we can just write wrappers to fix up any
> library issues".

Interfaces can be created and destroyed - they do not affect overall
system design in anyway (well, if they do, something is broken).
So let's solve problems in order of their appearence - if interfaces
are more important for you than overall design - that is a problem I think.

> > So instead there will be
> > s = atom_create_and_add(__NR_stat, path, stat, NULL, NULL, NULL, NULL);
> > atom then can be freed in the glibc_async_wait() wrapper just before
> > returning data to userspace.
>
> So now you add some kind of allocation/dealloction thing. In user space or
> in the kernel?

In userspace.
It was not added by me - it is just a wrapper.

> > There are millions of possible ways to do that, but what exactly one
> > should be used from your point of view? Describe _your_ vision of that path.
>
> My vision is that we should be able to do the simple things *easily* and
> without any extra overhead.
>
> And doing wrappers in user space is almost entirely unacceptable, becasue
> a lot of the wrapping needs to be done at release time (for example:
> de-allocating memory), and that means that you no longer can do simple
> system calls that don't even need release notification AT ALL.

syslets do work that way - they require some user memory - likely
long-standing (100% sure for multi atom setup, maybe it can be optimized
though) - if you do not want to allocate it explicitly - it is possible
to have a wrapper.

> > Currently generic example is following:
> > allocate mem
> > setup complex structure
> > submit syscall
> > wait syscall
> > free mem
>
> And that "allocate mem" and "free mem" is a problem. It's not just a
> performance problem, it is a _complexity_ problem. It means that people
> have to track things that they are NOT AT ALL INTERESTED IN!

I proposed a way to hide allocation - it is simple, but you've cut it.
I can create another one without special per-thread thing -
handle = async_init();
async_stat(handle, path, stat);
async_cleanup(); // not needed, since will be freed on exit automatically

Another one is to preallocate set of atoms in __attribute((contructor))
function.

There are really a lot of possible ways - _I_ can use the first one with
explicit operations, others likely can not - so I _ask_ about how should
it look like.

> > So, describe how exactly _you_ think it should be implemented with its
> > pros and cons, so that systemn could be adopted without trying to
> > mind-read of what is simple and good or complex and really bad.
>
> So I think that a good implementation just does everything up-front, and
> doesn't _need_ a user buffer that is live over longer periods, except for
> the actual results. Exactly because the whole alloc/teardown is nasty.
>
> And I think a good implementation doesn't need wrapping in user space to
> be useful - at *least* not wrapping at completion time, which is the
> really difficult one (since, by definition, in an async world completion
> is separated from the initial submit() event, and with kernel-only threads
> you actually want to *avoid* having to do user code after the operation
> completed).

So where is a problem?
I proposed already three ways to do the thing - user will not even know
about something happend. You did not comment on anyone, instead you
handwaving with talks about how in theory something should be similar
to. What exactly do _you_ expect from interface?

> I suspect Ingo's thing can do that. But I also suspect (nay, _know_, from
> this discussion), that you didn't even think of the problems.

That is another problem - you think you know something, but you fail to
prove that.

I can work with explicit structure allocation/deallocation/setup -
you do not want that - so I ask you about your opinion, and instead of
getting an answer I receive theoretical word-fall about how perfect
interface should look like.

You only need to have one function call without ever thinking bout
freeing? I proposed _two_ ways to that.
You can live with explicit init/cleanup (opt) code? There is another
one.

So please decribe your vision of interface with details, so that it
could be think about or/and implemented.

> Linus

--
Evgeniy Polyakov

2007-02-16 12:32:32

by Ingo Molnar

[permalink] [raw]
Subject: Re: [patch 05/11] syslets: core code


* Linus Torvalds <[email protected]> wrote:

> On Thu, 15 Feb 2007, Linus Torvalds wrote:
> >
> > So I think that a good implementation just does everything up-front,
> > and doesn't _need_ a user buffer that is live over longer periods,
> > except for the actual results. Exactly because the whole
> > alloc/teardown is nasty.
>
> Btw, this doesn't necessarily mean "not supporting multiple atoms at
> all".
>
> I think the batching of async things is potentially a great idea. I
> think it's quite workable for "open+fstat" kind of things, and I agree
> that it can solve other things too (the
> "socket+bind+connect+sendmsg+rcv" kind of complex setup things).
>
> But I suspect that if we just said:
> - we limit these atom sequences to just linear sequences of max "n" ops
> - we read them all in in a single go at startup
>
> we actually avoid several nasty issues. Not just the memory allocation
> issue in user space (now it's perfectly ok to build up a sequence of
> ops in temporary memory and throw it away once it's been submitted),
> but also issues like the 32-bit vs 64-bit compatibility stuff (the
> compat handlers would just convert it when they do the initial
> copying, and then the actual run-time wouldn't care about user-level
> pointers having different sizes etc).
>
> Would it make the interface less cool? Yeah. Would it limit it to just
> a few linked system calls (to avoid memory allocation issues in the
> kernel)? Yes again. But it would simplify a lot of the interface
> issues.
>
> It would _also_ allow the "sys_aio_read()" function to build up its
> *own* set of atoms in kernel space to actually do the read, and there
> would be no impact of the actual run-time wanting to read stuff from
> user space. Again - it's actually the same issue as with the compat
> system call: by making the interfaces do things up-front rather than
> dynamically, it becomes more static, but also easier to do interface
> translations. You can translate into any arbitrary internal format
> _once_, and be done with it.
>
> I dunno.

[ hm. I again wrote a pretty long email for you to read. Darn! ]

regarding the API - i share most of your concerns, and it's all a
function of how widely we want to push this into user-space.

My initial thought was for syslets to be used by glibc as small, secure
kernel-side 'syscall plugins' mainly - so that it can do things like
'POSIX AIO signal notifications' (which are madness in terms of
performance, but which applications rely on) /without/ having to burden
the kernel-side AIO with such requirements: glibc just adds an enclosing
sys_kill() to the syslet and it will do the proper signal notification,
asynchronously. (and of course syslets can be used for the Tux type of
performance sillinesses as well ;-)

So a sane user API (all used at the glibc level, not at application
level) would use simple syslets, while more broken ones would have to
use longer ones - but nobody would have the burden of having to
synchronize back to the issuer context. Natural selection will gravitate
application use towards the APIs with the shorter syslets. (at least so
i hope)

In this model syslets arent really user-programmable entities but rather
small plugins available to glibc to build up more complex, more
innovative (or just more broken) APIs than what the kernel wants to
provide - without putting any true new ABI dependency on the kernel,
other than the already existing syscall ABIs.

But if we'd like glibc to provide this to applications in some sort of
standardized /programmable/ manner, with a wide range of atom selections
(not directly coded syscall numbers, but rather as function pointers to
actual glibc functions, which glibc could translate to syscall numbers,
argument encodings, etc.), then i agree that doing the compat things and
making it 32/64-bit agnostic (and much more) is pretty much a must. If
90% of this current job is finished then sorting those out will at least
be another 90% of the work ;-)

and actually this latter model scares me, and i think that model scared
the hell out of you as well.

But i really have no strong opinion about which one we want yet, without
having walked the path. Somewhere inside me i'd of course like syslets
to become a widely available interface - but my fear is that it might
just not be 'human' enough to make sense - and we'd just not want to tie
us down with an ABI that's not used. I dont want this to become another
sys_sendfile - much talked about and _almost_ useful but in practice
seldom used due to its programmability and utility limitations.

OTOH, the syslet concept right now already looks very ubiquitous, and
the main problem with AIO use in applications wasnt just even its broken
API or its broken performance, but the fundamental lack of all Linux IO
disciplines supporting AIO, and the lack of significantly parallel
hardware. We have kaio that is centered around block drivers - then we
have epoll that works best with networking, and inotify that deals with
some (but not all) VFS events - but neither supports every IO and event
disciple well, at once. My feeling is that /this/ is the main
fundamental problem with AIO in general, not just its programmability
limitations.

Right now i'm concentrating on trying to build up something on the
scheduling side that shows the issues in practice, shows the limitations
and shows the possibilities. For example the easy ability to turn a
cachemiss thread back into a user thread (and then back into a cachemiss
thread) was a true surprise to me which increased utility quite a bit. I
couldnt have designed it into the concept because it just didnt occur to
me in the early stages. The notification ring related limitations you
noticed is another important thing to fix - and these issues go to the
core scheduling model of the concept and affect everything.

Thirdly, while Tux does not matter much to us, at least to me it is
pretty clear what it takes to get performance up to the levels of Tux -
and i dont see any big fundamental compromise possible on that front.
Syslets are partly Tux repackaged into something generic - they are
probably a bit slower than straight kernel code Tux, but not by much and
it's also not behaving fundamentally differently. And if we dont offer
at least something close to those possibilities then people will
re-start trying to add those special-purpose state machine APIs again,
and the whole "we need /true/ async IO" game starts again.

So if we accept "make parallelism easier to program" and "get somewhat
close to Tux's performance and scalability" as a premise (which you
might not agree with in that form), then i dont think there's much
choice we have: either we use kernel threads, synchronous system calls
and the scheduler intelligently (and the scheduling/threading bits of
syslets are pretty much the most intelligent kernel thread based
approach i can imagine at the moment =B-) or we use a special-purpose
KAIO state machine subsystem, avoiding most of the existing synchronous
infrastructure, painfully coding it into every IO discipline - and this
will certainly haunt us until the end of times.

So that's why i'm not /that/ much worried about the final form of the
API at the moment - even though i agree that it is /the/ most important
decision factor in the end: i see various unavoidable externalities
forcing us very much, and in the end we either like the result and make
it available to programmers, or we dont, and limit it to system-glue
glibc use - or we throw it away altogether. I'm curious about the end
result even if it gets limited or gets thrown away (joining 4:4 on the
way to the bit bucket ;) and while i'm cautiously optimistic that
something useful can come out of this, i cannot know it for sure at the
moment.

Ingo

2007-02-16 13:33:20

by Evgeniy Polyakov

[permalink] [raw]
Subject: Re: [patch 05/11] syslets: core code

On Fri, Feb 16, 2007 at 01:28:06PM +0100, Ingo Molnar ([email protected]) wrote:
> OTOH, the syslet concept right now already looks very ubiquitous, and
> the main problem with AIO use in applications wasnt just even its broken
> API or its broken performance, but the fundamental lack of all Linux IO
> disciplines supporting AIO, and the lack of significantly parallel
> hardware. We have kaio that is centered around block drivers - then we
> have epoll that works best with networking, and inotify that deals with
> some (but not all) VFS events - but neither supports every IO and event
> disciple well, at once. My feeling is that /this/ is the main
> fundamental problem with AIO in general, not just its programmability
> limitations.

That is quite dissapointing to hear when weekely released kevent can
solve that problem already more than year ago - it was designed specially to
support every possible notification types and does support file
descriptor ones, VFS (dropped in current releases to reduce size) and
tons of other including POSIX times, signals, own high-performance AIO
(which was created as a a bit complex state machine over internals of
page population code) and essentially everything one can ever imagine
with quite a bit of code needed for new type.

I was requested to add waiting for futex through kevent queue - that is
quite simple task, but having complete lack of feedback and ignorance of
the project even from people who asked about its features, it looks like
there is no need for that at all.

--
Evgeniy Polyakov

2007-02-16 15:55:37

by Linus Torvalds

[permalink] [raw]
Subject: Re: [patch 05/11] syslets: core code



On Fri, 16 Feb 2007, Evgeniy Polyakov wrote:
>
> Interfaces can be created and destroyed - they do not affect overall
> system design in anyway (well, if they do, something is broken).

I'm sorry, but you've obviously never maintained any piece of software
that actually has users.

As long as you think that interfaces can change, this discussion is
pointless.

So go away, ponder things.

Linus

2007-02-16 16:07:58

by Evgeniy Polyakov

[permalink] [raw]
Subject: Re: [patch 05/11] syslets: core code

On Fri, Feb 16, 2007 at 07:54:22AM -0800, Linus Torvalds ([email protected]) wrote:
> > Interfaces can be created and destroyed - they do not affect overall
> > system design in anyway (well, if they do, something is broken).
>
> I'm sorry, but you've obviously never maintained any piece of software
> that actually has users.

Strong. But saying for others usualy tends to show own problems.

> As long as you think that interfaces can change, this discussion is
> pointless.

That is too cool phrase to be heared - if you will make me a favour and
reread what was written you will (hopefully) detect that there were no
words about interfaces being changed after put into the wild - talk was
only about time when system is designed and implemented, and there is
time for discussion about its rough edges - if its design is good, then
interface can be changed in a moment without any problem - that is what
we see with syslets right now - they are designed and implemented (the
formed was done several years ago), and it is time to shape its edges -
like change userspace API - it is easy, but you do not (want/like to)
see that.

> So go away, ponder things.

But my above words are too lame for self-hearing Olympus liver.
Definitely.

> Linus

--
Evgeniy Polyakov

2007-02-16 16:53:33

by Ray Lee

[permalink] [raw]
Subject: Re: [patch 05/11] syslets: core code

On 2/16/07, Evgeniy Polyakov <[email protected]> wrote:
> if its design is good, then
> interface can be changed in a moment without any problem

This isn't always the case. Sometimes the interface puts requirements
(contract-like) upon the implementation. Case in point in the kernel,
dnotify versus inotify. dnotify is a steaming pile of worthlessness,
because it's userspace interface is so bad (meaning inefficient) as to
be nearly unusable.

inotify has a different interface, one that supplies details about
events rather that mere notice that an event occurred, and therefore
has different requirements in implementation. dnotify probably was a
good design, but for a worthless interface.

The interface isn't always important, but it's certainly something
that has to be understood before putting the finishing touches on the
behind-the-scenes implementation.

Ray

2007-02-16 17:01:23

by Evgeniy Polyakov

[permalink] [raw]
Subject: Re: [patch 05/11] syslets: core code

On Fri, Feb 16, 2007 at 08:53:30AM -0800, Ray Lee ([email protected]) wrote:
> On 2/16/07, Evgeniy Polyakov <[email protected]> wrote:
> >if its design is good, then
> >interface can be changed in a moment without any problem
>
> This isn't always the case. Sometimes the interface puts requirements
> (contract-like) upon the implementation. Case in point in the kernel,
> dnotify versus inotify. dnotify is a steaming pile of worthlessness,
> because it's userspace interface is so bad (meaning inefficient) as to
> be nearly unusable.
>
> inotify has a different interface, one that supplies details about
> events rather that mere notice that an event occurred, and therefore
> has different requirements in implementation. dnotify probably was a
> good design, but for a worthless interface.
>
> The interface isn't always important, but it's certainly something
> that has to be understood before putting the finishing touches on the
> behind-the-scenes implementation.

Absolutely.
And if overall system design is good, there is no problem to change
(well, for those who fail to read to the end and understand my english
replace 'to change' with 'to create and commit') interface to the state
where it will satisfy all (majority of) users.

Situations when system is designed from interface down to system ends up
with one thread per IO and huge limitations on how system is going to be
used at all.

> Ray

--
Evgeniy Polyakov

2007-02-16 20:23:22

by Cyrill Gorcunov

[permalink] [raw]
Subject: Re: [patch 05/11] syslets: core code

On Fri, Feb 16, 2007 at 07:58:54PM +0300, Evgeniy Polyakov wrote:
| Absolutely.
| And if overall system design is good, there is no problem to change
| (well, for those who fail to read to the end and understand my english
| replace 'to change' with 'to create and commit') interface to the state
| where it will satisfy all (majority of) users.
|
| Situations when system is designed from interface down to system ends up
| with one thread per IO and huge limitations on how system is going to be
| used at all.
|
| --
| Evgeniy Polyakov

I'm sorry for meddling in conversation but I think Linus misunderstood
you. If I'm right you propose to "create and commit" _new_ interfaces
only? I mean _changing_ of interfaces exported to user space is
very painfull... for further support. Don't swear at me if I wrote
something stupid ;)

--

Cyrill

2007-02-17 05:22:04

by Ray Lee

[permalink] [raw]
Subject: Re: [patch 05/11] syslets: core code

Evgeniy Polyakov wrote:
> On Fri, Feb 16, 2007 at 08:53:30AM -0800, Ray Lee ([email protected]) wrote:
>> On 2/16/07, Evgeniy Polyakov <[email protected]> wrote:
>>> if its design is good, then
>>> interface can be changed in a moment without any problem
>> This isn't always the case. Sometimes the interface puts requirements
>> (contract-like) upon the implementation. Case in point in the kernel,
>> dnotify versus inotify. dnotify is a steaming pile of worthlessness,
>> because it's userspace interface is so bad (meaning inefficient) as to
>> be nearly unusable.
>>
>> inotify has a different interface, one that supplies details about
>> events rather that mere notice that an event occurred, and therefore
>> has different requirements in implementation. dnotify probably was a
>> good design, but for a worthless interface.
>>
>> The interface isn't always important, but it's certainly something
>> that has to be understood before putting the finishing touches on the
>> behind-the-scenes implementation.
>
> Absolutely.
> And if overall system design is good,

dnotify was a good system design for a stupid (or misunderstood) problem.

> there is no problem to change
> (well, for those who fail to read to the end and understand my english
> replace 'to change' with 'to create and commit') interface to the state
> where it will satisfy all (majority of) users.

You might be right, but the point I (and others) are trying to make is
that there are some cases where you *really* need to understand the
users of the interface first. You might have everything else right
(userspace wants to know when filesystem changes occur, great), but if
you don't know what form those notifications have to look like, you'll
end up doing a lot of wasted work on a worthless piece of code that no
one will ever use.

Sometimes the interface really is the most important thing. Just like a
contract between people.

(This is probably why, by the way, most people are staying silent on
your excellent kevent work. The kernel side is, in some ways, the easy
part. It's getting an interface that will handle all users [ users ==
producers and consumers of kevents ], that is the hard bit.)

Or, let me put it yet another way: How do you prove to the rest of us
that you, or Ingo, or whomever, are not building another dnotify? (Maybe
you're smart enough in this problem space that you know you're not --
that's actually the most likely possibility. But you still have to prove
it to the rest of us. Sucks, I know.)

> Situations when system is designed from interface down to system ends up
> with one thread per IO and huge limitations on how system is going to be
> used at all.

The other side is you start from the goal in mind and get Ingo's state
machines with loops and conditionals and marmalade in syslets which
appear a bit baroque and overkill for the majority of us userspace folk.

(No offense intended to Ingo, he's obviously quite a bit more conversant
on the needs of high speed interfaces than I am. However, I suspect I
have a bit more clarity on what us normal folk would actually use, and
kernel driven FSMs ain't it. Userspace often makes a lot of contextual
decisions that I would absolutely *hate* to write and debug as a state
machine that gets handed off to the kernel. I'll happily take a 10% hit
in efficiency that Moore's law will get me back in a few months, instead
of spending a bunch of time debugging difficult heisenbugs due to the
syslet FSM reading a userspace variable at a slightly different time
once in a blue moon. OTOH, I'm also not Oracle, so what do I know?)

The truth of this lies somewhere in the middle. It isn't kernel driven,
or userspace interface driven, but a tradeoff between the two.

So:

> Userspace_API_is_the_ever_possible_last_thing_to_ever_think_about.
> Period

Please listen to those of us who are saying that this might not be the
case. Maybe we're idiots, but then again maybe we're not, okay?
Sometimes the API really *DOES* change the underlying implementation.

Ray

2007-02-17 10:02:14

by Evgeniy Polyakov

[permalink] [raw]
Subject: Re: [patch 05/11] syslets: core code

On Fri, Feb 16, 2007 at 11:20:36PM +0300, Cyrill V. Gorcunov ([email protected]) wrote:
> On Fri, Feb 16, 2007 at 07:58:54PM +0300, Evgeniy Polyakov wrote:
> | Absolutely.
> | And if overall system design is good, there is no problem to change
> | (well, for those who fail to read to the end and understand my english
> | replace 'to change' with 'to create and commit') interface to the state
> | where it will satisfy all (majority of) users.
> |
> | Situations when system is designed from interface down to system ends up
> | with one thread per IO and huge limitations on how system is going to be
> | used at all.
> |
> | --
> | Evgeniy Polyakov
>
> I'm sorry for meddling in conversation but I think Linus misunderstood
> you. If I'm right you propose to "create and commit" _new_ interfaces
> only? I mean _changing_ of interfaces exported to user space is
> very painfull... for further support. Don't swear at me if I wrote
> something stupid ;)

Yes, I only proposed to change what Ingo has right now - although it is
usable, but it does suck, but since overall syslet design is indeed good
it does not suffer from possible interface changes - so I said that it
can be trivially changed in that regard that until it is committed
anything can be done to extend it.

> --
>
> Cyrill

--
Evgeniy Polyakov

2007-02-17 10:22:23

by Evgeniy Polyakov

[permalink] [raw]
Subject: Re: [patch 05/11] syslets: core code

On Fri, Feb 16, 2007 at 08:54:11PM -0800, Ray Lee ([email protected]) wrote:
> (This is probably why, by the way, most people are staying silent on
> your excellent kevent work. The kernel side is, in some ways, the easy
> part. It's getting an interface that will handle all users [ users ==
> producers and consumers of kevents ], that is the hard bit.)

Kevent interface was completely changed 4 (!) times for the last year
after kernel developers request without any damage to its kernel part.

> Or, let me put it yet another way: How do you prove to the rest of us
> that you, or Ingo, or whomever, are not building another dnotify? (Maybe
> you're smart enough in this problem space that you know you're not --
> that's actually the most likely possibility. But you still have to prove
> it to the rest of us. Sucks, I know.)

I only want to say that when system is designed correctly there is no
problem to change interface (yes, I again said 'to change' just because
I hope everyone understand that I'm talking about time when system is
not yet committed to the tree).

Btw, dnotify had problems in its design highlighted at inotify statrt -
mainly that watchers were not attached to inode.

It is right now the time to ask users what interface they expect from
AIO - so I asked Linus and proposed three different ones, two of them
were designed in a way that user would not even know that some
allocation/freeing was done - and as a result I got 'you suck' response
exactly the same as was returned on the first syslet release - just
_anly_ fscking _just_ because it had ugly interface.

> > Situations when system is designed from interface down to system ends up
> > with one thread per IO and huge limitations on how system is going to be
> > used at all.
>
> The other side is you start from the goal in mind and get Ingo's state
> machines with loops and conditionals and marmalade in syslets which
> appear a bit baroque and overkill for the majority of us userspace folk.

Well, I designed kevent AIO in the similar way, but it has even more
complex one which is built on top of internal page population functions.

It is complex a bit, but it works fast. And it works with any type (if I
would not be lazy and implement bindings) of AIO.

Interface of syslets is not perfect, but it can be changed (I said it
again? I think we all understand what I mean by that already) trivially
right now (before it is included) - it is not the way to throw thing
just because it has bad interface which can be extended in a moment.

> (No offense intended to Ingo, he's obviously quite a bit more conversant
> on the needs of high speed interfaces than I am. However, I suspect I
> have a bit more clarity on what us normal folk would actually use, and
> kernel driven FSMs ain't it. Userspace often makes a lot of contextual
> decisions that I would absolutely *hate* to write and debug as a state
> machine that gets handed off to the kernel. I'll happily take a 10% hit
> in efficiency that Moore's law will get me back in a few months, instead
> of spending a bunch of time debugging difficult heisenbugs due to the
> syslet FSM reading a userspace variable at a slightly different time
> once in a blue moon. OTOH, I'm also not Oracle, so what do I know?)
>
> The truth of this lies somewhere in the middle. It isn't kernel driven,
> or userspace interface driven, but a tradeoff between the two.
>
> So:
>
> > Userspace_API_is_the_ever_possible_last_thing_to_ever_think_about.
> > Period
>
> Please listen to those of us who are saying that this might not be the
> case. Maybe we're idiots, but then again maybe we're not, okay?
> Sometimes the API really *DOES* change the underlying implementation.

It is exactly the time to say what interface sould be good.
System is almost ready - it is time to make it looks cool for users.

> Ray

--
Evgeniy Polyakov

2007-02-17 15:00:37

by Al Boldi

[permalink] [raw]
Subject: Re: [patch 05/11] syslets: core code

Evgeniy Polyakov wrote:
> Ray Lee ([email protected]) wrote:
> > The truth of this lies somewhere in the middle. It isn't kernel driven,
> > or userspace interface driven, but a tradeoff between the two.
> >
> > So:
> > > Userspace_API_is_the_ever_possible_last_thing_to_ever_think_about.
> > > Period
> >
> > Please listen to those of us who are saying that this might not be the
> > case. Maybe we're idiots, but then again maybe we're not, okay?
> > Sometimes the API really *DOES* change the underlying implementation.
>
> It is exactly the time to say what interface sould be good.
> System is almost ready - it is time to make it looks cool for users.

IMHO, what is needed is an event registration switch-board that handles
notifications from the kernel and the user side respectively.


Thanks!

--
Al

2007-02-17 18:02:09

by Cyrill Gorcunov

[permalink] [raw]
Subject: Re: [patch 05/11] syslets: core code

On Sat, Feb 17, 2007 at 01:02:00PM +0300, Evgeniy Polyakov wrote:
[... snipped ...]

| Yes, I only proposed to change what Ingo has right now - although it is
| usable, but it does suck, but since overall syslet design is indeed good
| it does not suffer from possible interface changes - so I said that it
| can be trivially changed in that regard that until it is committed
| anything can be done to extend it.
|
| --
| Evgeniy Polyakov
|

I think Evgeniy - you are right! For times of research _changing_ a lot
of things is almost a low. syslets are in test area and why should we
bound ourself in survey of best. If something in syslets is sucks so
lets change it as early as possible. Of course I mean no more interface
changing after some _commit_ point (and that should be Linus decision).

--

Cyrill

2007-02-18 20:21:19

by Pavel Machek

[permalink] [raw]
Subject: Re: [patch 05/11] syslets: core code

Hi!

> > The upcall will setup a frame, execute the clet (where jump/conditions and
> > userspace variable changes happen in machine code - gcc is pretty good in
> > taking care of that for us) on its return, come back through a
> > sys_async_return, and go back to userspace.
>
> So, for example, this is the setup code for the current API (and that's a
> really simple one - immagine going wacko with loops and userspace varaible
> changes):
>
>
> static struct req *alloc_req(void)
> {
> /*
> * Constants can be picked up by syslets via static variables:
> */
> static long O_RDONLY_var = O_RDONLY;
> static long FILE_BUF_SIZE_var = FILE_BUF_SIZE;
>
> struct req *req;
>
> if (freelist) {
> req = freelist;
> freelist = freelist->next_free;
> req->next_free = NULL;
> return req;
> }
>
> req = calloc(1, sizeof(struct req));
>
> /*
> * This is the first atom in the syslet, it opens the file:
> *
> * req->fd = open(req->filename, O_RDONLY);
> *
> * It is linked to the next read() atom.
> */
> req->filename_p = req->filename;
> init_atom(req, &req->open_file, __NR_sys_open,
> &req->filename_p, &O_RDONLY_var, NULL, NULL, NULL, NULL,
> &req->fd, SYSLET_STOP_ON_NEGATIVE, &req->read_file);
>
> /*
> * This second read() atom is linked back to itself, it skips to
> * the next one on stop:
> */
> req->file_buf_ptr = req->file_buf;
> init_atom(req, &req->read_file, __NR_sys_read,
> &req->fd, &req->file_buf_ptr, &FILE_BUF_SIZE_var,
> NULL, NULL, NULL, NULL,
> SYSLET_STOP_ON_NON_POSITIVE | SYSLET_SKIP_TO_NEXT_ON_STOP,
> &req->read_file);
>
> /*
> * This close() atom has NULL as next, this finishes the syslet:
> */
> init_atom(req, &req->close_file, __NR_sys_close,
> &req->fd, NULL, NULL, NULL, NULL, NULL, NULL, 0, NULL);
>
> return req;
> }
>
>
> Here's how your clet would look like:
>
> static long main_sync_loop(ctx *c)
> {
> int fd;
> char file_buf[FILE_BUF_SIZE+1];
>
> if ((fd = open(c->filename, O_RDONLY)) == -1)
> return -1;
> while (read(fd, file_buf, FILE_BUF_SIZE) > 0)
> ;
> close(fd);
> return 0;
> }
>
>
> Kinda easier to code isn't it? And the cost of the upcall to schedule the
> clet is widely amortized by the multple syscalls you're going to do inside
> your clet.

I do not get it. What if clet includes

int *a = 0; *a = 1; /* enjoy your oops, stupid kernel? */

I.e. how do you make sure kernel is protected from malicious clets?

--
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

2007-02-18 20:37:28

by Davide Libenzi

[permalink] [raw]
Subject: Re: [patch 05/11] syslets: core code

On Sun, 18 Feb 2007, Pavel Machek wrote:

> > > The upcall will setup a frame, execute the clet (where jump/conditions and
> > > userspace variable changes happen in machine code - gcc is pretty good in
> > > taking care of that for us) on its return, come back through a
> > > sys_async_return, and go back to userspace.
> >
> > So, for example, this is the setup code for the current API (and that's a
> > really simple one - immagine going wacko with loops and userspace varaible
> > changes):
> >
> >
> > static struct req *alloc_req(void)
> > {
> > /*
> > * Constants can be picked up by syslets via static variables:
> > */
> > static long O_RDONLY_var = O_RDONLY;
> > static long FILE_BUF_SIZE_var = FILE_BUF_SIZE;
> >
> > struct req *req;
> >
> > if (freelist) {
> > req = freelist;
> > freelist = freelist->next_free;
> > req->next_free = NULL;
> > return req;
> > }
> >
> > req = calloc(1, sizeof(struct req));
> >
> > /*
> > * This is the first atom in the syslet, it opens the file:
> > *
> > * req->fd = open(req->filename, O_RDONLY);
> > *
> > * It is linked to the next read() atom.
> > */
> > req->filename_p = req->filename;
> > init_atom(req, &req->open_file, __NR_sys_open,
> > &req->filename_p, &O_RDONLY_var, NULL, NULL, NULL, NULL,
> > &req->fd, SYSLET_STOP_ON_NEGATIVE, &req->read_file);
> >
> > /*
> > * This second read() atom is linked back to itself, it skips to
> > * the next one on stop:
> > */
> > req->file_buf_ptr = req->file_buf;
> > init_atom(req, &req->read_file, __NR_sys_read,
> > &req->fd, &req->file_buf_ptr, &FILE_BUF_SIZE_var,
> > NULL, NULL, NULL, NULL,
> > SYSLET_STOP_ON_NON_POSITIVE | SYSLET_SKIP_TO_NEXT_ON_STOP,
> > &req->read_file);
> >
> > /*
> > * This close() atom has NULL as next, this finishes the syslet:
> > */
> > init_atom(req, &req->close_file, __NR_sys_close,
> > &req->fd, NULL, NULL, NULL, NULL, NULL, NULL, 0, NULL);
> >
> > return req;
> > }
> >
> >
> > Here's how your clet would look like:
> >
> > static long main_sync_loop(ctx *c)
> > {
> > int fd;
> > char file_buf[FILE_BUF_SIZE+1];
> >
> > if ((fd = open(c->filename, O_RDONLY)) == -1)
> > return -1;
> > while (read(fd, file_buf, FILE_BUF_SIZE) > 0)
> > ;
> > close(fd);
> > return 0;
> > }
> >
> >
> > Kinda easier to code isn't it? And the cost of the upcall to schedule the
> > clet is widely amortized by the multple syscalls you're going to do inside
> > your clet.
>
> I do not get it. What if clet includes
>
> int *a = 0; *a = 1; /* enjoy your oops, stupid kernel? */
>
> I.e. how do you make sure kernel is protected from malicious clets?

Clets would execute in userspace, like signal handlers, but under the
special schedule() handler. In that way chains happens by the mean of
natural C code, and access to userspace variables happen by the mean of
natural C code too (not with special syscalls to manipulate userspace
memory). I'm not a big fan of chains of syscalls for the reasons I
already explained, but at least clets (or whatever name) has a way lower
cost for the programmer (easier to code than atom chains), and for the
kernel (no need of all that atom handling stuff, no need of limited
cond/jump interpreters in the kernel, and no need of nightmare compat
code).



- Davide


2007-02-18 21:04:59

by Michael K. Edwards

[permalink] [raw]
Subject: Re: [patch 05/11] syslets: core code

On 2/18/07, Davide Libenzi <[email protected]> wrote:
> Clets would execute in userspace, like signal handlers,

or like "event handlers" in cooperative multitasking environments
without the Unix baggage

> but under the special schedule() handler.

or, better yet, as the next tasklet in the chain after the softirq
dispatcher, since I/Os almost always unblock as a result of something
that happens in an ISR or softirq

> In that way chains happens by the mean of
> natural C code, and access to userspace variables happen by the mean of
> natural C code too (not with special syscalls to manipulate userspace
> memory).

yep. That way you can exploit this nice hardware block called an MMU.

> I'm not a big fan of chains of syscalls for the reasons I
> already explained,

to a kernel programmer, all userspace programs are chains of syscalls. :-)

> but at least clets (or whatever name) has a way lower
> cost for the programmer (easier to code than atom chains),

except you still have the 80% of the code that is half-assed exception
handling using overloaded semantics on function return values and a
thread-local errno, which is totally unsafe with fibrils, syslets,
clets, and giblets, since none of them promise to run continuations in
the same thread context as the submission. Of course you aren't going
to use errno as such, but that means that async-ifying code isn't
s/syscall/aio_syscall/, it's a complete rewrite. If you're going to
design a new AIO interface, please model it after the only standard
that has ever made deeply pipelined, massively parallel execution
programmer-friendly -- IEEE 754.

> and for the kernel (no need of all that atom handling stuff,

you still need this, but it has to be centered on a data structure
that makes request throttling, dynamic reprioritization, and bulk
cancellation practical

> no need of limited cond/jump interpreters in the kernel,

you still need this, for efficient handling of speculative execution,
pipeline stalls, and exception propagation, but it's invisible to the
interface and you don't have to invent it up front

> and no need of nightmare compat code).

compat code, yes. nighmare, no. Just like kernel FP emulation on any
processor other than an x86. Unimplemented instruction traps. x86 is
so utterly the wrong architecture on which to prototype this it isn't
even funny.

Cheers,
- Michael

2007-02-19 00:22:09

by Paul Mackerras

[permalink] [raw]
Subject: Re: [patch 02/11] syslets: add syslet.h include file, user API/ABI definitions

Ingo Molnar writes:

> add include/linux/syslet.h which contains the user-space API/ABI
> declarations. Add the new header to include/linux/Kbuild as well.

> +struct syslet_uatom {
> + unsigned long flags;
> + unsigned long nr;
> + long __user *ret_ptr;
> + struct syslet_uatom __user *next;
> + unsigned long __user *arg_ptr[6];
> + /*
> + * User-space can put anything in here, kernel will not
> + * touch it:
> + */
> + void __user *private;
> +};

This structure, with its unsigned longs and pointers, is going to
create enormous headaches for 32-bit processes on 64-bit machines as
far as I can see---and on ppc64 machines, almost all processes are
32-bit, since there is no inherent speed penalty for running in 32-bit
mode, and some space savings.

Have you thought about how you will handle compatibility for 32-bit
processes? The issue will arise for x86_64 and ia64 (among others)
too, I would think.

Paul.