2001-03-15 01:26:43

by Nigel Gamble

[permalink] [raw]
Subject: [PATCH for 2.5] preemptible kernel

Here is the latest preemptible kernel patch. It's much cleaner and
smaller than previous versions, so I've appended it to this mail. This
patch is against 2.4.2, although it's not intended for 2.4. I'd like
comments from anyone interested in a low-latency Linux kernel solution
for the 2.5 development tree.

Kernel preemption is not allowed while spinlocks are held, which means
that this patch alone cannot guarantee low preemption latencies. But
as long held locks (in particular the BKL) are replaced by finer-grained
locks, this patch will enable lower latencies as the kernel also becomes
more scalable on large SMP systems.

Notwithstanding the comments in the Configure.help section for
CONFIG_PREEMPT, I think this patch has a negligible effect on
throughput. In fact, I got better average results from running 'dbench
16' on a 750MHz PIII with 128MB with kernel preemption turned on
(~30MB/s) than on the plain 2.4.2 kernel (~26MB/s).

(I had to rearrange three headers files that are needed in sched.h before
task_struct is defined, but which include inline functions that cannot
now be compiled until after task_struct is defined. I chose not to
move them into sched.h, like d_path(), as I don't want to make it more
difficult to apply kernel patches to my kernel source tree.)

Nigel Gamble [email protected]
Mountain View, CA, USA. http://www.nrg.org/


diff -Nur 2.4.2/CREDITS linux/CREDITS
--- 2.4.2/CREDITS Wed Mar 14 12:15:49 2001
+++ linux/CREDITS Wed Mar 14 12:21:42 2001
@@ -907,8 +907,8 @@

N: Nigel Gamble
E: [email protected]
-E: [email protected]
D: Interrupt-driven printer driver
+D: Preemptible kernel
S: 120 Alley Way
S: Mountain View, California 94040
S: USA
diff -Nur 2.4.2/Documentation/Configure.help linux/Documentation/Configure.help
--- 2.4.2/Documentation/Configure.help Wed Mar 14 12:16:10 2001
+++ linux/Documentation/Configure.help Wed Mar 14 12:22:04 2001
@@ -130,6 +130,23 @@
If you have system with several CPU's, you do not need to say Y
here: APIC will be used automatically.

+Preemptible Kernel
+CONFIG_PREEMPT
+ This option reduces the latency of the kernel when reacting to
+ real-time or interactive events by allowing a low priority process to
+ be preempted even if it is in kernel mode executing a system call.
+ This allows applications that need real-time response, such as audio
+ and other multimedia applications, to run more reliably even when the
+ system is under load due to other, lower priority, processes.
+
+ This option is currently experimental if used in conjuction with SMP
+ support.
+
+ Say Y here if you are building a kernel for a desktop system, embedded
+ system or real-time system. Say N if you are building a kernel for a
+ system where throughput is more important than interactive response,
+ such as a server system. Say N if you are unsure.
+
Kernel math emulation
CONFIG_MATH_EMULATION
Linux can emulate a math coprocessor (used for floating point
diff -Nur 2.4.2/arch/i386/config.in linux/arch/i386/config.in
--- 2.4.2/arch/i386/config.in Wed Mar 14 12:14:18 2001
+++ linux/arch/i386/config.in Wed Mar 14 12:20:02 2001
@@ -161,6 +161,11 @@
define_bool CONFIG_X86_IO_APIC y
define_bool CONFIG_X86_LOCAL_APIC y
fi
+ bool 'Preemptible Kernel' CONFIG_PREEMPT
+else
+ if [ "$CONFIG_EXPERIMENTAL" = "y" ]; then
+ bool 'Preemptible SMP Kernel (EXPERIMENTAL)' CONFIG_PREEMPT
+ fi
fi

if [ "$CONFIG_SMP" = "y" -a "$CONFIG_X86_CMPXCHG" = "y" ]; then
diff -Nur 2.4.2/arch/i386/kernel/entry.S linux/arch/i386/kernel/entry.S
--- 2.4.2/arch/i386/kernel/entry.S Wed Mar 14 12:17:37 2001
+++ linux/arch/i386/kernel/entry.S Wed Mar 14 12:23:42 2001
@@ -72,7 +72,7 @@
* these are offsets into the task-struct.
*/
state = 0
-flags = 4
+preempt_count = 4
sigpending = 8
addr_limit = 12
exec_domain = 16
@@ -80,8 +80,30 @@
tsk_ptrace = 24
processor = 52

+ /* These are offsets into the irq_stat structure
+ * There is one per cpu and it is aligned to 32
+ * byte boundry (we put that here as a shift count)
+ */
+irq_array_shift = CONFIG_X86_L1_CACHE_SHIFT
+
+irq_stat_softirq_active = 0
+irq_stat_softirq_mask = 4
+irq_stat_local_irq_count = 8
+irq_stat_local_bh_count = 12
+
ENOSYS = 38

+#ifdef CONFIG_SMP
+#define GET_CPU_INDX movl processor(%ebx),%eax; \
+ shll $irq_array_shift,%eax
+#define GET_CURRENT_CPU_INDX GET_CURRENT(%ebx); \
+ GET_CPU_INDX
+#define CPU_INDX (,%eax)
+#else
+#define GET_CPU_INDX
+#define GET_CURRENT_CPU_INDX GET_CURRENT(%ebx)
+#define CPU_INDX
+#endif

#define SAVE_ALL \
cld; \
@@ -270,16 +292,44 @@
#endif
jne handle_softirq

+#ifdef CONFIG_PREEMPT
+ cli
+ incl preempt_count(%ebx)
+#endif
ENTRY(ret_from_intr)
GET_CURRENT(%ebx)
+#ifdef CONFIG_PREEMPT
+ cli
+ decl preempt_count(%ebx)
+#endif
movl EFLAGS(%esp),%eax # mix EFLAGS and CS
movb CS(%esp),%al
testl $(VM_MASK | 3),%eax # return to VM86 mode or non-supervisor?
jne ret_with_reschedule
+#ifdef CONFIG_PREEMPT
+ cmpl $0,preempt_count(%ebx)
+ jnz restore_all
+ cmpl $0,need_resched(%ebx)
+ jz restore_all
+ movl SYMBOL_NAME(irq_stat)+irq_stat_local_bh_count CPU_INDX,%ecx
+ addl SYMBOL_NAME(irq_stat)+irq_stat_local_irq_count CPU_INDX,%ecx
+ jnz restore_all
+ incl preempt_count(%ebx)
+ sti
+ call SYMBOL_NAME(preempt_schedule)
+ jmp ret_from_intr
+#else
jmp restore_all
+#endif

ALIGN
handle_softirq:
+#ifdef CONFIG_PREEMPT
+ cli
+ GET_CURRENT(%ebx)
+ incl preempt_count(%ebx)
+ sti
+#endif
call SYMBOL_NAME(do_softirq)
jmp ret_from_intr

diff -Nur 2.4.2/arch/i386/kernel/traps.c linux/arch/i386/kernel/traps.c
--- 2.4.2/arch/i386/kernel/traps.c Wed Mar 14 12:16:46 2001
+++ linux/arch/i386/kernel/traps.c Wed Mar 14 12:22:45 2001
@@ -973,7 +973,7 @@
set_trap_gate(11,&segment_not_present);
set_trap_gate(12,&stack_segment);
set_trap_gate(13,&general_protection);
- set_trap_gate(14,&page_fault);
+ set_intr_gate(14,&page_fault);
set_trap_gate(15,&spurious_interrupt_bug);
set_trap_gate(16,&coprocessor_error);
set_trap_gate(17,&alignment_check);
diff -Nur 2.4.2/arch/i386/lib/dec_and_lock.c linux/arch/i386/lib/dec_and_lock.c
--- 2.4.2/arch/i386/lib/dec_and_lock.c Wed Mar 14 12:16:12 2001
+++ linux/arch/i386/lib/dec_and_lock.c Wed Mar 14 12:22:07 2001
@@ -8,6 +8,7 @@
*/

#include <linux/spinlock.h>
+#include <linux/sched.h>
#include <asm/atomic.h>

int atomic_dec_and_lock(atomic_t *atomic, spinlock_t *lock)
diff -Nur 2.4.2/arch/i386/mm/fault.c linux/arch/i386/mm/fault.c
--- 2.4.2/arch/i386/mm/fault.c Wed Mar 14 12:16:41 2001
+++ linux/arch/i386/mm/fault.c Wed Mar 14 12:22:40 2001
@@ -117,6 +117,9 @@
/* get the address */
__asm__("movl %%cr2,%0":"=r" (address));

+ /* It's safe to allow preemption after cr2 has been saved */
+ local_irq_restore(regs->eflags);
+
tsk = current;

/*
diff -Nur 2.4.2/fs/exec.c linux/fs/exec.c
--- 2.4.2/fs/exec.c Wed Mar 14 12:14:14 2001
+++ linux/fs/exec.c Wed Mar 14 12:19:57 2001
@@ -412,8 +412,8 @@
active_mm = current->active_mm;
current->mm = mm;
current->active_mm = mm;
- task_unlock(current);
activate_mm(active_mm, mm);
+ task_unlock(current);
mm_release();
if (old_mm) {
if (active_mm != old_mm) BUG();
diff -Nur 2.4.2/include/asm-i386/hardirq.h linux/include/asm-i386/hardirq.h
--- 2.4.2/include/asm-i386/hardirq.h Wed Mar 14 12:17:12 2001
+++ linux/include/asm-i386/hardirq.h Wed Mar 14 12:23:18 2001
@@ -36,6 +36,8 @@

#define synchronize_irq() barrier()

+#define release_irqlock(cpu) do { } while (0)
+
#else

#include <asm/atomic.h>
diff -Nur 2.4.2/include/asm-i386/hw_irq.h linux/include/asm-i386/hw_irq.h
--- 2.4.2/include/asm-i386/hw_irq.h Wed Mar 14 12:16:34 2001
+++ linux/include/asm-i386/hw_irq.h Wed Mar 14 12:22:32 2001
@@ -92,6 +92,18 @@
#define __STR(x) #x
#define STR(x) __STR(x)

+#define GET_CURRENT \
+ "movl %esp, %ebx\n\t" \
+ "andl $-8192, %ebx\n\t"
+
+#ifdef CONFIG_PREEMPT
+#define BUMP_CONTEX_SWITCH_LOCK \
+ GET_CURRENT \
+ "incl 4(%ebx)\n\t"
+#else
+#define BUMP_CONTEX_SWITCH_LOCK
+#endif
+
#define SAVE_ALL \
"cld\n\t" \
"pushl %es\n\t" \
@@ -105,14 +117,11 @@
"pushl %ebx\n\t" \
"movl $" STR(__KERNEL_DS) ",%edx\n\t" \
"movl %edx,%ds\n\t" \
- "movl %edx,%es\n\t"
+ "movl %edx,%es\n\t" \
+ BUMP_CONTEX_SWITCH_LOCK

#define IRQ_NAME2(nr) nr##_interrupt(void)
#define IRQ_NAME(nr) IRQ_NAME2(IRQ##nr)
-
-#define GET_CURRENT \
- "movl %esp, %ebx\n\t" \
- "andl $-8192, %ebx\n\t"

/*
* SMP has a few special interrupts for IPI messages
diff -Nur 2.4.2/include/asm-i386/mmu_context.h linux/include/asm-i386/mmu_context.h
--- 2.4.2/include/asm-i386/mmu_context.h Wed Mar 14 12:13:52 2001
+++ linux/include/asm-i386/mmu_context.h Wed Mar 14 12:19:34 2001
@@ -27,6 +27,10 @@

static inline void switch_mm(struct mm_struct *prev, struct mm_struct *next, struct task_struct *tsk, unsigned cpu)
{
+#ifdef CONFIG_PREEMPT
+ if (in_ctx_sw_off() == 0)
+ BUG();
+#endif
if (prev != next) {
/* stop flush ipis for the previous mm */
clear_bit(cpu, &prev->cpu_vm_mask);
diff -Nur 2.4.2/include/asm-i386/smplock.h linux/include/asm-i386/smplock.h
--- 2.4.2/include/asm-i386/smplock.h Wed Mar 14 12:14:40 2001
+++ linux/include/asm-i386/smplock.h Wed Mar 14 12:20:27 2001
@@ -10,7 +10,15 @@

extern spinlock_t kernel_flag;

+#ifdef CONFIG_SMP
#define kernel_locked() spin_is_locked(&kernel_flag)
+#else
+#ifdef CONFIG_PREEMPT
+#define kernel_locked() in_ctx_sw_off()
+#else
+#define kernel_locked() 1
+#endif
+#endif

/*
* Release global kernel lock and global interrupt lock
@@ -42,6 +50,11 @@
*/
extern __inline__ void lock_kernel(void)
{
+#ifdef CONFIG_PREEMPT
+ if (current->lock_depth == -1)
+ spin_lock(&kernel_flag);
+ ++current->lock_depth;
+#else
#if 1
if (!++current->lock_depth)
spin_lock(&kernel_flag);
@@ -53,6 +66,7 @@
"\n9:"
:"=m" (__dummy_lock(&kernel_flag)),
"=m" (current->lock_depth));
+#endif
#endif
}

diff -Nur 2.4.2/include/asm-i386/softirq.h linux/include/asm-i386/softirq.h
--- 2.4.2/include/asm-i386/softirq.h Wed Mar 14 12:16:35 2001
+++ linux/include/asm-i386/softirq.h Wed Mar 14 12:22:35 2001
@@ -4,8 +4,8 @@
#include <asm/atomic.h>
#include <asm/hardirq.h>

-#define cpu_bh_disable(cpu) do { local_bh_count(cpu)++; barrier(); } while (0)
-#define cpu_bh_enable(cpu) do { barrier(); local_bh_count(cpu)--; } while (0)
+#define cpu_bh_disable(cpu) do { ctx_sw_off(); local_bh_count(cpu)++; barrier(); } while (0)
+#define cpu_bh_enable(cpu) do { barrier(); local_bh_count(cpu)--;ctx_sw_on(); } while (0)

#define local_bh_disable() cpu_bh_disable(smp_processor_id())
#define local_bh_enable() cpu_bh_enable(smp_processor_id())
diff -Nur 2.4.2/include/asm-i386/spinlock.h linux/include/asm-i386/spinlock.h
--- 2.4.2/include/asm-i386/spinlock.h Wed Mar 14 12:16:48 2001
+++ linux/include/asm-i386/spinlock.h Wed Mar 14 12:22:49 2001
@@ -65,7 +65,7 @@
#define spin_unlock_string \
"movb $1,%0"

-static inline int spin_trylock(spinlock_t *lock)
+static inline int _raw_spin_trylock(spinlock_t *lock)
{
char oldval;
__asm__ __volatile__(
@@ -75,7 +75,7 @@
return oldval > 0;
}

-static inline void spin_lock(spinlock_t *lock)
+static inline void _raw_spin_lock(spinlock_t *lock)
{
#if SPINLOCK_DEBUG
__label__ here;
@@ -90,7 +90,7 @@
:"=m" (lock->lock) : : "memory");
}

-static inline void spin_unlock(spinlock_t *lock)
+static inline void _raw_spin_unlock(spinlock_t *lock)
{
#if SPINLOCK_DEBUG
if (lock->magic != SPINLOCK_MAGIC)
@@ -143,7 +143,7 @@
*/
/* the spinlock helpers are in arch/i386/kernel/semaphore.c */

-static inline void read_lock(rwlock_t *rw)
+static inline void _raw_read_lock(rwlock_t *rw)
{
#if SPINLOCK_DEBUG
if (rw->magic != RWLOCK_MAGIC)
@@ -152,7 +152,7 @@
__build_read_lock(rw, "__read_lock_failed");
}

-static inline void write_lock(rwlock_t *rw)
+static inline void _raw_write_lock(rwlock_t *rw)
{
#if SPINLOCK_DEBUG
if (rw->magic != RWLOCK_MAGIC)
@@ -161,10 +161,10 @@
__build_write_lock(rw, "__write_lock_failed");
}

-#define read_unlock(rw) asm volatile("lock ; incl %0" :"=m" ((rw)->lock) : : "memory")
-#define write_unlock(rw) asm volatile("lock ; addl $" RW_LOCK_BIAS_STR ",%0":"=m" ((rw)->lock) : : "memory")
+#define _raw_read_unlock(rw) asm volatile("lock ; incl %0" :"=m" ((rw)->lock) : : "memory")
+#define _raw_write_unlock(rw) asm volatile("lock ; addl $" RW_LOCK_BIAS_STR ",%0":"=m" ((rw)->lock) : : "memory")

-static inline int write_trylock(rwlock_t *lock)
+static inline int _raw_write_trylock(rwlock_t *lock)
{
atomic_t *count = (atomic_t *)lock;
if (atomic_sub_and_test(RW_LOCK_BIAS, count))
diff -Nur 2.4.2/include/linux/brlock.h linux/include/linux/brlock.h
--- 2.4.2/include/linux/brlock.h Wed Mar 14 12:14:04 2001
+++ linux/include/linux/brlock.h Wed Mar 14 12:19:47 2001
@@ -170,12 +170,19 @@
__br_write_unlock(idx);
}

+#else /* CONFIG_SMP */
+#ifdef CONFIG_PREEMPT
+# define br_read_lock(idx) ({ (void)(idx); ctx_sw_off(); })
+# define br_read_unlock(idx) ({ (void)(idx); ctx_sw_on(); })
+# define br_write_lock(idx) ({ (void)(idx); ctx_sw_off(); })
+# define br_write_unlock(idx) ({ (void)(idx); ctx_sw_on(); })
#else
# define br_read_lock(idx) ((void)(idx))
# define br_read_unlock(idx) ((void)(idx))
# define br_write_lock(idx) ((void)(idx))
# define br_write_unlock(idx) ((void)(idx))
#endif
+#endif /* CONFIG_SMP */

/*
* Now enumerate all of the possible sw/hw IRQ protected
diff -Nur 2.4.2/include/linux/dcache.h linux/include/linux/dcache.h
--- 2.4.2/include/linux/dcache.h Wed Mar 14 12:15:09 2001
+++ linux/include/linux/dcache.h Wed Mar 14 12:20:59 2001
@@ -126,31 +126,6 @@

extern spinlock_t dcache_lock;

-/**
- * d_drop - drop a dentry
- * @dentry: dentry to drop
- *
- * d_drop() unhashes the entry from the parent
- * dentry hashes, so that it won't be found through
- * a VFS lookup any more. Note that this is different
- * from deleting the dentry - d_delete will try to
- * mark the dentry negative if possible, giving a
- * successful _negative_ lookup, while d_drop will
- * just make the cache lookup fail.
- *
- * d_drop() is used mainly for stuff that wants
- * to invalidate a dentry for some reason (NFS
- * timeouts or autofs deletes).
- */
-
-static __inline__ void d_drop(struct dentry * dentry)
-{
- spin_lock(&dcache_lock);
- list_del(&dentry->d_hash);
- INIT_LIST_HEAD(&dentry->d_hash);
- spin_unlock(&dcache_lock);
-}
-
static __inline__ int dname_external(struct dentry *d)
{
return d->d_name.name != d->d_iname;
@@ -271,3 +246,34 @@
#endif /* __KERNEL__ */

#endif /* __LINUX_DCACHE_H */
+
+#if !defined(__LINUX_DCACHE_H_INLINES) && defined(_TASK_STRUCT_DEFINED)
+#define __LINUX_DCACHE_H_INLINES
+
+#ifdef __KERNEL__
+/**
+ * d_drop - drop a dentry
+ * @dentry: dentry to drop
+ *
+ * d_drop() unhashes the entry from the parent
+ * dentry hashes, so that it won't be found through
+ * a VFS lookup any more. Note that this is different
+ * from deleting the dentry - d_delete will try to
+ * mark the dentry negative if possible, giving a
+ * successful _negative_ lookup, while d_drop will
+ * just make the cache lookup fail.
+ *
+ * d_drop() is used mainly for stuff that wants
+ * to invalidate a dentry for some reason (NFS
+ * timeouts or autofs deletes).
+ */
+
+static __inline__ void d_drop(struct dentry * dentry)
+{
+ spin_lock(&dcache_lock);
+ list_del(&dentry->d_hash);
+ INIT_LIST_HEAD(&dentry->d_hash);
+ spin_unlock(&dcache_lock);
+}
+#endif
+#endif
diff -Nur 2.4.2/include/linux/fs_struct.h linux/include/linux/fs_struct.h
--- 2.4.2/include/linux/fs_struct.h Wed Mar 14 12:17:20 2001
+++ linux/include/linux/fs_struct.h Wed Mar 14 12:23:26 2001
@@ -20,6 +20,15 @@
extern void exit_fs(struct task_struct *);
extern void set_fs_altroot(void);

+struct fs_struct *copy_fs_struct(struct fs_struct *old);
+void put_fs_struct(struct fs_struct *fs);
+
+#endif
+#endif
+
+#if !defined(_LINUX_FS_STRUCT_H_INLINES) && defined(_TASK_STRUCT_DEFINED)
+#define _LINUX_FS_STRUCT_H_INLINES
+#ifdef __KERNEL__
/*
* Replace the fs->{rootmnt,root} with {mnt,dentry}. Put the old values.
* It can block. Requires the big lock held.
@@ -65,9 +74,5 @@
mntput(old_pwdmnt);
}
}
-
-struct fs_struct *copy_fs_struct(struct fs_struct *old);
-void put_fs_struct(struct fs_struct *fs);
-
#endif
#endif
diff -Nur 2.4.2/include/linux/sched.h linux/include/linux/sched.h
--- 2.4.2/include/linux/sched.h Wed Mar 14 12:15:15 2001
+++ linux/include/linux/sched.h Wed Mar 14 12:21:06 2001
@@ -86,6 +86,7 @@
#define TASK_UNINTERRUPTIBLE 2
#define TASK_ZOMBIE 4
#define TASK_STOPPED 8
+#define TASK_PREEMPTED 64

#define __set_task_state(tsk, state_value) \
do { (tsk)->state = (state_value); } while (0)
@@ -150,6 +151,9 @@
#define MAX_SCHEDULE_TIMEOUT LONG_MAX
extern signed long FASTCALL(schedule_timeout(signed long timeout));
asmlinkage void schedule(void);
+#ifdef CONFIG_PREEMPT
+asmlinkage void preempt_schedule(void);
+#endif

extern int schedule_task(struct tq_struct *task);
extern void flush_scheduled_tasks(void);
@@ -280,7 +284,17 @@
* offsets of these are hardcoded elsewhere - touch with care
*/
volatile long state; /* -1 unrunnable, 0 runnable, >0 stopped */
+#ifdef CONFIG_PREEMPT
+ /*
+ * We want the preempt_count in this cache line, but we
+ * a) don't want to mess up the offsets in asm code and
+ * b) the alignment of the next line below so
+ * we move "flags" down
+ */
+ atomic_t preempt_count; /* 0=> preemptable, < 0 => BUG */
+#else
unsigned long flags; /* per process flags, defined below */
+#endif
int sigpending;
mm_segment_t addr_limit; /* thread address space:
0-0xBFFFFFFF for user-thead
@@ -312,6 +326,9 @@

struct task_struct *next_task, *prev_task;
struct mm_struct *active_mm;
+#ifdef CONFIG_PREEMPT
+ unsigned long flags; /* per process flags, defined below */
+#endif

/* task state */
struct linux_binfmt *binfmt;
@@ -885,6 +902,11 @@
mntput(rootmnt);
return res;
}
+
+#define _TASK_STRUCT_DEFINED
+#include <linux/dcache.h>
+#include <linux/tqueue.h>
+#include <linux/fs_struct.h>

#endif /* __KERNEL__ */

diff -Nur 2.4.2/include/linux/smp.h linux/include/linux/smp.h
--- 2.4.2/include/linux/smp.h Wed Mar 14 12:15:31 2001
+++ linux/include/linux/smp.h Wed Mar 14 12:21:23 2001
@@ -81,7 +81,9 @@
#define smp_processor_id() 0
#define hard_smp_processor_id() 0
#define smp_threads_ready 1
+#ifndef CONFIG_PREEMPT
#define kernel_lock()
+#endif
#define cpu_logical_map(cpu) 0
#define cpu_number_map(cpu) 0
#define smp_call_function(func,info,retry,wait) ({ 0; })
diff -Nur 2.4.2/include/linux/smp_lock.h linux/include/linux/smp_lock.h
--- 2.4.2/include/linux/smp_lock.h Wed Mar 14 12:15:40 2001
+++ linux/include/linux/smp_lock.h Wed Mar 14 12:21:32 2001
@@ -3,7 +3,7 @@

#include <linux/config.h>

-#ifndef CONFIG_SMP
+#if !defined(CONFIG_SMP) && !defined(CONFIG_PREEMPT)

#define lock_kernel() do { } while(0)
#define unlock_kernel() do { } while(0)
diff -Nur 2.4.2/include/linux/spinlock.h linux/include/linux/spinlock.h
--- 2.4.2/include/linux/spinlock.h Wed Mar 14 12:13:53 2001
+++ linux/include/linux/spinlock.h Wed Mar 14 12:19:35 2001
@@ -40,7 +40,9 @@

#if (DEBUG_SPINLOCKS < 1)

+#ifndef CONFIG_PREEMPT
#define atomic_dec_and_lock(atomic,lock) atomic_dec_and_test(atomic)
+#endif

/*
* Your basic spinlocks, allowing only a single CPU anywhere
@@ -56,11 +58,11 @@
#endif

#define spin_lock_init(lock) do { } while(0)
-#define spin_lock(lock) (void)(lock) /* Not "unused variable". */
+#define _raw_spin_lock(lock) (void)(lock) /* Not "unused variable". */
#define spin_is_locked(lock) (0)
-#define spin_trylock(lock) ({1; })
+#define _raw_spin_trylock(lock) ({1; })
#define spin_unlock_wait(lock) do { } while(0)
-#define spin_unlock(lock) do { } while(0)
+#define _raw_spin_unlock(lock) do { } while(0)

#elif (DEBUG_SPINLOCKS < 2)

@@ -119,12 +121,74 @@
#endif

#define rwlock_init(lock) do { } while(0)
-#define read_lock(lock) (void)(lock) /* Not "unused variable". */
-#define read_unlock(lock) do { } while(0)
-#define write_lock(lock) (void)(lock) /* Not "unused variable". */
-#define write_unlock(lock) do { } while(0)
+#define _raw_read_lock(lock) (void)(lock) /* Not "unused variable". */
+#define _raw_read_unlock(lock) do { } while(0)
+#define _raw_write_lock(lock) (void)(lock) /* Not "unused variable". */
+#define _raw_write_unlock(lock) do { } while(0)

#endif /* !SMP */
+
+#ifdef CONFIG_PREEMPT
+
+#define switch_lock_count() current->preempt_count
+
+#define in_ctx_sw_off() (switch_lock_count().counter)
+#define atomic_ptr_in_ctx_sw_off() (&switch_lock_count())
+
+#define ctx_sw_off() \
+do { \
+ atomic_inc(atomic_ptr_in_ctx_sw_off()); \
+} while (0)
+
+#define ctx_sw_on_no_preempt() \
+do { \
+ atomic_dec(atomic_ptr_in_ctx_sw_off()); \
+} while (0)
+
+#define ctx_sw_on() \
+do { \
+ if (atomic_dec_and_test(atomic_ptr_in_ctx_sw_off()) && \
+ current->need_resched) \
+ preempt_schedule(); \
+} while (0)
+
+#define spin_lock(lock) \
+do { \
+ ctx_sw_off(); \
+ _raw_spin_lock(lock); \
+} while(0)
+#define spin_trylock(lock) ({ctx_sw_off(); _raw_spin_trylock(lock) ? \
+ 1 : ({ctx_sw_on(); 0;});})
+#define spin_unlock(lock) \
+do { \
+ _raw_spin_unlock(lock); \
+ ctx_sw_on(); \
+} while (0)
+
+#define read_lock(lock) ({ctx_sw_off(); _raw_read_lock(lock);})
+#define read_unlock(lock) ({_raw_read_unlock(lock); ctx_sw_on();})
+#define write_lock(lock) ({ctx_sw_off(); _raw_write_lock(lock);})
+#define write_unlock(lock) ({_raw_write_unlock(lock); ctx_sw_on();})
+#define write_trylock(lock) ({ctx_sw_off(); _raw_write_trylock(lock) ? \
+ 1 : ({ctx_sw_on(); 0;});})
+
+#else
+
+#define in_ctx_sw_off() do { } while (0)
+#define ctx_sw_off() do { } while (0)
+#define ctx_sw_on_no_preempt()
+#define ctx_sw_on() do { } while (0)
+
+#define spin_lock(lock) _raw_spin_lock(lock)
+#define spin_trylock(lock) _raw_spin_trylock(lock)
+#define spin_unlock(lock) _raw_spin_unlock(lock)
+
+#define read_lock(lock) _raw_read_lock(lock)
+#define read_unlock(lock) _raw_read_unlock(lock)
+#define write_lock(lock) _raw_write_lock(lock)
+#define write_unlock(lock) _raw_write_unlock(lock)
+#define write_trylock(lock) _raw_write_trylock(lock)
+#endif

/* "lock on reference count zero" */
#ifndef atomic_dec_and_lock
diff -Nur 2.4.2/include/linux/tqueue.h linux/include/linux/tqueue.h
--- 2.4.2/include/linux/tqueue.h Wed Mar 14 12:15:53 2001
+++ linux/include/linux/tqueue.h Wed Mar 14 12:21:46 2001
@@ -75,6 +75,22 @@
extern spinlock_t tqueue_lock;

/*
+ * Call all "bottom halfs" on a given list.
+ */
+
+extern void __run_task_queue(task_queue *list);
+
+static inline void run_task_queue(task_queue *list)
+{
+ if (TQ_ACTIVE(*list))
+ __run_task_queue(list);
+}
+
+#endif /* _LINUX_TQUEUE_H */
+
+#if !defined(_LINUX_TQUEUE_H_INLINES) && defined(_TASK_STRUCT_DEFINED)
+#define _LINUX_TQUEUE_H_INLINES
+/*
* Queue a task on a tq. Return non-zero if it was successfully
* added.
*/
@@ -90,17 +106,4 @@
}
return ret;
}
-
-/*
- * Call all "bottom halfs" on a given list.
- */
-
-extern void __run_task_queue(task_queue *list);
-
-static inline void run_task_queue(task_queue *list)
-{
- if (TQ_ACTIVE(*list))
- __run_task_queue(list);
-}
-
-#endif /* _LINUX_TQUEUE_H */
+#endif
diff -Nur 2.4.2/kernel/exit.c linux/kernel/exit.c
--- 2.4.2/kernel/exit.c Wed Mar 14 12:16:14 2001
+++ linux/kernel/exit.c Wed Mar 14 12:22:10 2001
@@ -276,6 +276,10 @@
struct mm_struct * start_lazy_tlb(void)
{
struct mm_struct *mm = current->mm;
+#ifdef CONFIG_PREEMPT
+ if (in_ctx_sw_off() == 0)
+ BUG();
+#endif
current->mm = NULL;
/* active_mm is still 'mm' */
atomic_inc(&mm->mm_count);
@@ -287,6 +291,10 @@
{
struct mm_struct *active_mm = current->active_mm;

+#ifdef CONFIG_PREEMPT
+ if (in_ctx_sw_off() == 0)
+ BUG();
+#endif
current->mm = mm;
if (mm != active_mm) {
current->active_mm = mm;
@@ -310,8 +318,8 @@
/* more a memory barrier than a real lock */
task_lock(tsk);
tsk->mm = NULL;
- task_unlock(tsk);
enter_lazy_tlb(mm, current, smp_processor_id());
+ task_unlock(tsk);
mmput(mm);
}
}
diff -Nur 2.4.2/kernel/fork.c linux/kernel/fork.c
--- 2.4.2/kernel/fork.c Wed Mar 14 12:14:12 2001
+++ linux/kernel/fork.c Wed Mar 14 12:19:57 2001
@@ -594,6 +594,12 @@
if (p->binfmt && p->binfmt->module)
__MOD_INC_USE_COUNT(p->binfmt->module);

+#ifdef CONFIG_PREEMPT
+ /* Since we are keeping the context switch off state as part
+ * of the context, make sure we start with it off.
+ */
+ p->preempt_count.counter = 1;
+#endif
p->did_exec = 0;
p->swappable = 0;
p->state = TASK_UNINTERRUPTIBLE;
diff -Nur 2.4.2/kernel/ksyms.c linux/kernel/ksyms.c
--- 2.4.2/kernel/ksyms.c Wed Mar 14 12:16:28 2001
+++ linux/kernel/ksyms.c Wed Mar 14 12:22:27 2001
@@ -427,6 +427,9 @@
EXPORT_SYMBOL(interruptible_sleep_on);
EXPORT_SYMBOL(interruptible_sleep_on_timeout);
EXPORT_SYMBOL(schedule);
+#ifdef CONFIG_PREEMPT
+EXPORT_SYMBOL(preempt_schedule);
+#endif
EXPORT_SYMBOL(schedule_timeout);
EXPORT_SYMBOL(jiffies);
EXPORT_SYMBOL(xtime);
diff -Nur 2.4.2/kernel/sched.c linux/kernel/sched.c
--- 2.4.2/kernel/sched.c Wed Mar 14 12:13:59 2001
+++ linux/kernel/sched.c Wed Mar 14 12:19:41 2001
@@ -443,7 +443,7 @@
task_lock(prev);
prev->has_cpu = 0;
mb();
- if (prev->state == TASK_RUNNING)
+ if (task_on_runqueue(prev))
goto needs_resched;

out_unlock:
@@ -473,7 +473,7 @@
goto out_unlock;

spin_lock_irqsave(&runqueue_lock, flags);
- if (prev->state == TASK_RUNNING)
+ if (task_on_runqueue(prev))
reschedule_idle(prev);
spin_unlock_irqrestore(&runqueue_lock, flags);
goto out_unlock;
@@ -486,6 +486,9 @@
void schedule_tail(struct task_struct *prev)
{
__schedule_tail(prev);
+#ifdef CONFIG_PREEMPT
+ ctx_sw_on();
+#endif
}

/*
@@ -505,6 +508,10 @@
struct list_head *tmp;
int this_cpu, c;

+#ifdef CONFIG_PREEMPT
+ ctx_sw_off();
+#endif
+
if (!current->active_mm) BUG();
need_resched_back:
prev = current;
@@ -540,7 +547,14 @@
break;
}
default:
+#ifdef CONFIG_PREEMPT
+ if (prev->state & TASK_PREEMPTED)
+ break;
+#endif
del_from_runqueue(prev);
+#ifdef CONFIG_PREEMPT
+ case TASK_PREEMPTED:
+#endif
case TASK_RUNNING:
}
prev->need_resched = 0;
@@ -555,7 +569,7 @@
*/
next = idle_task(this_cpu);
c = -1000;
- if (prev->state == TASK_RUNNING)
+ if (task_on_runqueue(prev))
goto still_running;

still_running_back:
@@ -646,6 +660,9 @@
if (current->need_resched)
goto need_resched_back;

+#ifdef CONFIG_PREEMPT
+ ctx_sw_on_no_preempt();
+#endif
return;

recalculate:
@@ -1231,3 +1248,15 @@
atomic_inc(&init_mm.mm_count);
enter_lazy_tlb(&init_mm, current, cpu);
}
+#ifdef CONFIG_PREEMPT
+asmlinkage void preempt_schedule(void)
+{
+ while (current->need_resched) {
+ ctx_sw_off();
+ current->state |= TASK_PREEMPTED;
+ schedule();
+ current->state &= ~TASK_PREEMPTED;
+ ctx_sw_on_no_preempt();
+ }
+}
+#endif
diff -Nur 2.4.2/lib/dec_and_lock.c linux/lib/dec_and_lock.c
--- 2.4.2/lib/dec_and_lock.c Wed Mar 14 12:14:15 2001
+++ linux/lib/dec_and_lock.c Wed Mar 14 12:19:57 2001
@@ -1,4 +1,5 @@
#include <linux/spinlock.h>
+#include <linux/sched.h>
#include <asm/atomic.h>

/*
diff -Nur 2.4.2/net/socket.c linux/net/socket.c
--- 2.4.2/net/socket.c Wed Mar 14 12:16:29 2001
+++ linux/net/socket.c Wed Mar 14 12:22:26 2001
@@ -131,7 +131,7 @@

static struct net_proto_family *net_families[NPROTO];

-#ifdef CONFIG_SMP
+#if defined(CONFIG_SMP) || defined(CONFIG_PREEMPT)
static atomic_t net_family_lockct = ATOMIC_INIT(0);
static spinlock_t net_family_lock = SPIN_LOCK_UNLOCKED;



2001-03-17 21:32:14

by Pavel Machek

[permalink] [raw]
Subject: Re: [PATCH for 2.5] preemptible kernel

Hi!

> Here is the latest preemptible kernel patch. It's much cleaner and
> smaller than previous versions, so I've appended it to this mail. This
> patch is against 2.4.2, although it's not intended for 2.4. I'd like
> comments from anyone interested in a low-latency Linux kernel solution
> for the 2.5 development tree.
>
> Kernel preemption is not allowed while spinlocks are held, which means
> that this patch alone cannot guarantee low preemption latencies. But
> as long held locks (in particular the BKL) are replaced by finer-grained
> locks, this patch will enable lower latencies as the kernel also becomes
> more scalable on large SMP systems.
>
> Notwithstanding the comments in the Configure.help section for
> CONFIG_PREEMPT, I think this patch has a negligible effect on
> throughput. In fact, I got better average results from running 'dbench
> 16' on a 750MHz PIII with 128MB with kernel preemption turned on
> (~30MB/s) than on the plain 2.4.2 kernel (~26MB/s).

That is not bad result!

> (I had to rearrange three headers files that are needed in sched.h before
> task_struct is defined, but which include inline functions that cannot
> now be compiled until after task_struct is defined. I chose not to
> move them into sched.h, like d_path(), as I don't want to make it more
> difficult to apply kernel patches to my kernel source tree.)


> diff -Nur 2.4.2/arch/i386/kernel/traps.c linux/arch/i386/kernel/traps.c
> --- 2.4.2/arch/i386/kernel/traps.c Wed Mar 14 12:16:46 2001
> +++ linux/arch/i386/kernel/traps.c Wed Mar 14 12:22:45 2001
> @@ -973,7 +973,7 @@
> set_trap_gate(11,&segment_not_present);
> set_trap_gate(12,&stack_segment);
> set_trap_gate(13,&general_protection);
> - set_trap_gate(14,&page_fault);
> + set_intr_gate(14,&page_fault);
> set_trap_gate(15,&spurious_interrupt_bug);
> set_trap_gate(16,&coprocessor_error);
> set_trap_gate(17,&alignment_check);

Are you sure about this piece? Add least add a comment, because it
*looks* strange.
Pavel
--
I'm [email protected]. "In my country we have almost anarchy and I don't care."
Panos Katsaloulis describing me w.r.t. patents at [email protected]

2001-03-19 21:02:44

by Nigel Gamble

[permalink] [raw]
Subject: Re: [PATCH for 2.5] preemptible kernel

Hi Pavel,

Thanks for you comments.

On Sat, 17 Mar 2001, Pavel Machek wrote:
> > diff -Nur 2.4.2/arch/i386/kernel/traps.c linux/arch/i386/kernel/traps.c
> > --- 2.4.2/arch/i386/kernel/traps.c Wed Mar 14 12:16:46 2001
> > +++ linux/arch/i386/kernel/traps.c Wed Mar 14 12:22:45 2001
> > @@ -973,7 +973,7 @@
> > set_trap_gate(11,&segment_not_present);
> > set_trap_gate(12,&stack_segment);
> > set_trap_gate(13,&general_protection);
> > - set_trap_gate(14,&page_fault);
> > + set_intr_gate(14,&page_fault);
> > set_trap_gate(15,&spurious_interrupt_bug);
> > set_trap_gate(16,&coprocessor_error);
> > set_trap_gate(17,&alignment_check);
>
> Are you sure about this piece? Add least add a comment, because it
> *looks* strange.

With a preemptible kernel, we need to enter the page fault handler with
interrupts disabled to protect the cr2 register. The interrupt state is
restored immediately after cr2 has been saved. Otherwise, an interrupt
could cause the faulting thread to be preempted, and the new thread
could also fault, clobbering the cr2 register for the preempted thread.
See the diff for linux/arch/i386/mm/fault.c.

Nigel Gamble [email protected]
Mountain View, CA, USA. http://www.nrg.org/

MontaVista Software [email protected]

2001-03-20 08:41:46

by Rusty Russell

[permalink] [raw]
Subject: Re: [PATCH for 2.5] preemptible kernel

In message <[email protected]> you write:
> Kernel preemption is not allowed while spinlocks are held, which means
> that this patch alone cannot guarantee low preemption latencies. But
> as long held locks (in particular the BKL) are replaced by finer-grained
> locks, this patch will enable lower latencies as the kernel also becomes
> more scalable on large SMP systems.

Hi Nigel,

I can see three problems with this approach, only one of which
is serious.

The first is code which is already SMP unsafe is now a problem for
everyone, not just the 0.1% of SMP machines. I consider this a good
thing for 2.5 though.

The second is that there are "manual" locking schemes which are used
in several places in the kernel which rely on non-preemptability;
de-facto spinlocks if you will. I consider all these uses flawed: (1)
they are often subtly broken anyway, (2) they make reading those parts
of the code much harder, and (3) they break when things like this are
done.

The third is that preemtivity conflicts with the naive
quiescent-period approach proposed for module unloading in 2.5, and
useful for several other things (eg. hotplugging CPUs). This method
relies on knowing that when a schedule() has occurred on every CPU, we
know noone is holding certain references. The simplest example is a
single linked list: you can traverse without a lock as long as you
don't sleep, and then someone can unlink a node, and wait for a
schedule on every other CPU before freeing it. The non-SMP case is a
noop. See synchonize_kernel() below.

This, too, is soluble, but it means that synchronize_kernel() must
guarantee that each task which was running or preempted in kernel
space when it was called, has been non-preemtively scheduled before
synchronize_kernel() can exit. Icky.

Thoughts?
Rusty.
--
Premature optmztion is rt of all evl. --DK

/* We could keep a schedule count for each CPU and make idle tasks
schedule (some don't unless need_resched), but this scales quite
well (eg. 64 processors, average time to wait for first schedule =
jiffie/64. Total time for all processors = jiffie/63 + jiffie/62...

At 1024 cpus, this is about 7.5 jiffies. And that assumes noone
schedules early. --RR */
void synchronize_kernel(void)
{
unsigned long cpus_allowed, policy, rt_priority;

/* Save current state */
cpus_allowed = current->cpus_allowed;
policy = current->policy;
rt_priority = current->rt_priority;

/* Create an unreal time task. */
current->policy = SCHED_FIFO;
current->rt_priority = 1001 + sys_sched_get_priority_max(SCHED_FIFO);

/* Make us schedulable on all CPUs. */
current->cpus_allowed = (1UL<<smp_num_cpus)-1;

/* Eliminate current cpu, reschedule */
while ((current->cpus_allowed &= ~(1 << smp_processor_id())) != 0)
schedule();

/* Back to normal. */
current->cpus_allowed = cpus_allowed;
current->policy = policy;
current->rt_priority = rt_priority;
}

2001-03-20 09:33:20

by Keith Owens

[permalink] [raw]
Subject: Re: [PATCH for 2.5] preemptible kernel

On Tue, 20 Mar 2001 19:43:50 +1100,
Rusty Russell <[email protected]> wrote:
>The third is that preemtivity conflicts with the naive
>quiescent-period approach proposed for module unloading in 2.5, and
>useful for several other things (eg. hotplugging CPUs). This method
>relies on knowing that when a schedule() has occurred on every CPU, we
>know noone is holding certain references.
>
>This, too, is soluble, but it means that synchronize_kernel() must
>guarantee that each task which was running or preempted in kernel
>space when it was called, has been non-preemtively scheduled before
>synchronize_kernel() can exit. Icky.

The preemption patch only allows preemption from interrupt and only for
a single level of preemption. That coexists quite happily with
synchronize_kernel() which runs in user context. Just count user
context schedules (preempt_count == 0), not preemptive schedules.

2001-03-20 18:28:53

by Roger Larsson

[permalink] [raw]
Subject: Re: [PATCH for 2.5] preemptible kernel

Hi,

One little readability thing I found.
The prev->state TASK_ value is mostly used as a plain value
but the new TASK_PREEMPTED is or:ed together with whatever was there.
Later when we switch to check the state it is checked against TASK_PREEMPTED
only. Since TASK_RUNNING is 0 it works OK but...

--- sched.c.nigel Tue Mar 20 18:52:43 2001
+++ sched.c.roger Tue Mar 20 19:03:28 2001
@@ -553,7 +553,7 @@
#endif
del_from_runqueue(prev);
#ifdef CONFIG_PREEMPT
- case TASK_PREEMPTED:
+ case TASK_RUNNING | TASK_PREEMPTED:
#endif
case TASK_RUNNING:
}


We could add all/(other common) combinations as cases

switch (prev->state) {
case TASK_INTERRUPTIBLE:
if (signal_pending(prev)) {
prev->state = TASK_RUNNING;
break;
}
default:
#ifdef CONFIG_PREEMPT
if (prev->state & TASK_PREEMPTED)
break;
#endif
del_from_runqueue(prev);
#ifdef CONFIG_PREEMPT
case TASK_RUNNING | TASK_PREEMPTED:
case TASK_INTERRUPTIBLE | TASK_PREEMPTED:
case TASK_UNINTERRUPTIBLE | TASK_PREEMPTED:
#endif
case TASK_RUNNING:
}


Then the break in default case could almost be replaced with a BUG()...
(I have not checked the generated code)

/RogerL

2001-03-20 22:07:56

by Nigel Gamble

[permalink] [raw]
Subject: Re: [PATCH for 2.5] preemptible kernel

On Tue, 20 Mar 2001, Roger Larsson wrote:
> One little readability thing I found.
> The prev->state TASK_ value is mostly used as a plain value
> but the new TASK_PREEMPTED is or:ed together with whatever was there.
> Later when we switch to check the state it is checked against TASK_PREEMPTED
> only. Since TASK_RUNNING is 0 it works OK but...

Yes, you're right. I had forgotten that TASK_RUNNING is 0 and I think I
was assuming that there could be (rare) cases where a task was preempted
while prev->state was in transition such that no other flags were set.
This is, of course, impossible given that TASK_RUNNING is 0. So your
change makes the common case more obvious (to me, at least!)

> --- sched.c.nigel Tue Mar 20 18:52:43 2001
> +++ sched.c.roger Tue Mar 20 19:03:28 2001
> @@ -553,7 +553,7 @@
> #endif
> del_from_runqueue(prev);
> #ifdef CONFIG_PREEMPT
> - case TASK_PREEMPTED:
> + case TASK_RUNNING | TASK_PREEMPTED:
> #endif
> case TASK_RUNNING:
> }
>
>
> We could add all/(other common) combinations as cases
>
> switch (prev->state) {
> case TASK_INTERRUPTIBLE:
> if (signal_pending(prev)) {
> prev->state = TASK_RUNNING;
> break;
> }
> default:
> #ifdef CONFIG_PREEMPT
> if (prev->state & TASK_PREEMPTED)
> break;
> #endif
> del_from_runqueue(prev);
> #ifdef CONFIG_PREEMPT
> case TASK_RUNNING | TASK_PREEMPTED:
> case TASK_INTERRUPTIBLE | TASK_PREEMPTED:
> case TASK_UNINTERRUPTIBLE | TASK_PREEMPTED:
> #endif
> case TASK_RUNNING:
> }
>
>
> Then the break in default case could almost be replaced with a BUG()...
> (I have not checked the generated code)

The other cases are not very common, as they only happen if a task is
preempted during the short time that it is running while in the process
of changing state while going to sleep or waking up, so the default case
is probably OK for them; and I'd be happier to leave the default case
for reliability reasons anyway.

Nigel Gamble [email protected]
Mountain View, CA, USA. http://www.nrg.org/

MontaVista Software [email protected]

2001-03-20 22:32:47

by George Anzinger

[permalink] [raw]
Subject: Re: [PATCH for 2.5] preemptible kernel

Nigel Gamble wrote:
>
> On Tue, 20 Mar 2001, Roger Larsson wrote:
> > One little readability thing I found.
> > The prev->state TASK_ value is mostly used as a plain value
> > but the new TASK_PREEMPTED is or:ed together with whatever was there.
> > Later when we switch to check the state it is checked against TASK_PREEMPTED
> > only. Since TASK_RUNNING is 0 it works OK but...
>
> Yes, you're right. I had forgotten that TASK_RUNNING is 0 and I think I
> was assuming that there could be (rare) cases where a task was preempted
> while prev->state was in transition such that no other flags were set.
> This is, of course, impossible given that TASK_RUNNING is 0. So your
> change makes the common case more obvious (to me, at least!)
>
> > --- sched.c.nigel Tue Mar 20 18:52:43 2001
> > +++ sched.c.roger Tue Mar 20 19:03:28 2001
> > @@ -553,7 +553,7 @@
> > #endif
> > del_from_runqueue(prev);
> > #ifdef CONFIG_PREEMPT
> > - case TASK_PREEMPTED:
> > + case TASK_RUNNING | TASK_PREEMPTED:
> > #endif
> > case TASK_RUNNING:
> > }
> >
> >
> > We could add all/(other common) combinations as cases
> >
> > switch (prev->state) {
> > case TASK_INTERRUPTIBLE:
> > if (signal_pending(prev)) {
> > prev->state = TASK_RUNNING;
> > break;
> > }
> > default:
> > #ifdef CONFIG_PREEMPT
> > if (prev->state & TASK_PREEMPTED)
> > break;
> > #endif
> > del_from_runqueue(prev);
> > #ifdef CONFIG_PREEMPT
> > case TASK_RUNNING | TASK_PREEMPTED:
> > case TASK_INTERRUPTIBLE | TASK_PREEMPTED:
> > case TASK_UNINTERRUPTIBLE | TASK_PREEMPTED:
> > #endif
> > case TASK_RUNNING:
> > }
> >
> >
> > Then the break in default case could almost be replaced with a BUG()...
> > (I have not checked the generated code)
>
> The other cases are not very common, as they only happen if a task is
> preempted during the short time that it is running while in the process
> of changing state while going to sleep or waking up, so the default case
> is probably OK for them; and I'd be happier to leave the default case
> for reliability reasons anyway.

Especially since he forgot:

TASK_ZOMBIE
TASK_STOPPED
TASK_SWAPPING

I don't know about the last two but TASK_ZOMBIE must be handled
correctly or the task will never clear.

In general, a task must run till it gets to schedule() before the actual
state is "real" so the need for the TASK_PREEMPT.

The actual code generated with what you propose should be the same (even
if TASK_RUNNING != 0, except for the constant).

George

2001-03-21 00:25:17

by Nigel Gamble

[permalink] [raw]
Subject: Re: [PATCH for 2.5] preemptible kernel

On Tue, 20 Mar 2001, Rusty Russell wrote:
> I can see three problems with this approach, only one of which
> is serious.
>
> The first is code which is already SMP unsafe is now a problem for
> everyone, not just the 0.1% of SMP machines. I consider this a good
> thing for 2.5 though.

So do I.

> The second is that there are "manual" locking schemes which are used
> in several places in the kernel which rely on non-preemptability;
> de-facto spinlocks if you will. I consider all these uses flawed: (1)
> they are often subtly broken anyway, (2) they make reading those parts
> of the code much harder, and (3) they break when things like this are
> done.

Likewise.

> The third is that preemtivity conflicts with the naive
> quiescent-period approach proposed for module unloading in 2.5, and
> useful for several other things (eg. hotplugging CPUs). This method
> relies on knowing that when a schedule() has occurred on every CPU, we
> know noone is holding certain references. The simplest example is a
> single linked list: you can traverse without a lock as long as you
> don't sleep, and then someone can unlink a node, and wait for a
> schedule on every other CPU before freeing it. The non-SMP case is a
> noop. See synchonize_kernel() below.

So, to make sure I understand this, the code to free a node would look
like:

prev->next = node->next; /* assumed to be atomic */
synchronize_kernel();
free(node);

So that any other CPU concurrently traversing the list would see a
consistent state, either including or not including "node" before the
call to synchronize_kernel(); but after synchronize_kernel() all other
CPUs are guaranteed to see a list that no longer includes "node", so it
is now safe to free it.

It looks like there are also implicit assumptions to this approach, like
no other CPU is trying to use the same approach simultaneously to free
"prev". So my initial reaction is that this approach is, like the
manual locking schemes you commented on above, open to being subtly
broken when people don't understand all the implicit assumptions and
subsequently invalidate them.

> This, too, is soluble, but it means that synchronize_kernel() must
> guarantee that each task which was running or preempted in kernel
> space when it was called, has been non-preemtively scheduled before
> synchronize_kernel() can exit. Icky.

Yes, you're right.

> Thoughts?

Perhaps synchronize_kernel() could take the run_queue lock, mark all the
tasks on it and count them. Any task marked when it calls schedule()
voluntarily (but not if it is preempted) is unmarked and the count
decremented. synchronize_kernel() continues until the count is zero.
As you said, "Icky."

> /* We could keep a schedule count for each CPU and make idle tasks
> schedule (some don't unless need_resched), but this scales quite
> well (eg. 64 processors, average time to wait for first schedule =
> jiffie/64. Total time for all processors = jiffie/63 + jiffie/62...
>
> At 1024 cpus, this is about 7.5 jiffies. And that assumes noone
> schedules early. --RR */
> void synchronize_kernel(void)
> {
> unsigned long cpus_allowed, policy, rt_priority;
>
> /* Save current state */
> cpus_allowed = current->cpus_allowed;
> policy = current->policy;
> rt_priority = current->rt_priority;
>
> /* Create an unreal time task. */
> current->policy = SCHED_FIFO;
> current->rt_priority = 1001 + sys_sched_get_priority_max(SCHED_FIFO);
>
> /* Make us schedulable on all CPUs. */
> current->cpus_allowed = (1UL<<smp_num_cpus)-1;
>
> /* Eliminate current cpu, reschedule */
> while ((current->cpus_allowed &= ~(1 << smp_processor_id())) != 0)
> schedule();
>
> /* Back to normal. */
> current->cpus_allowed = cpus_allowed;
> current->policy = policy;
> current->rt_priority = rt_priority;
> }
>

Nigel Gamble [email protected]
Mountain View, CA, USA. http://www.nrg.org/

MontaVista Software [email protected]

2001-03-21 00:49:19

by Nigel Gamble

[permalink] [raw]
Subject: Re: [PATCH for 2.5] preemptible kernel

On Tue, 20 Mar 2001, Keith Owens wrote:
> The preemption patch only allows preemption from interrupt and only for
> a single level of preemption. That coexists quite happily with
> synchronize_kernel() which runs in user context. Just count user
> context schedules (preempt_count == 0), not preemptive schedules.

I'm not sure what you mean by "only for a single level of preemption."
It's possible for a preempting process to be preempted itself by a
higher priority process, and for that process to be preempted by an even
higher priority one, limited only by the number of processes waiting for
interrupt handlers to make them runnable. This isn't very likely in
practice (kernel preemptions tend to be rare compared to normal calls to
schedule()), but it could happen in theory.

If you're looking at preempt_schedule(), note the call to ctx_sw_off()
only increments current->preempt_count for the preempted task - the
higher priority preempting task that is about to be scheduled will have
a preempt_count of 0.

Nigel Gamble [email protected]
Mountain View, CA, USA. http://www.nrg.org/

MontaVista Software [email protected]

2001-03-21 01:24:35

by Keith Owens

[permalink] [raw]
Subject: Re: [PATCH for 2.5] preemptible kernel

On Tue, 20 Mar 2001 16:48:01 -0800 (PST),
Nigel Gamble <[email protected]> wrote:
>On Tue, 20 Mar 2001, Keith Owens wrote:
>> The preemption patch only allows preemption from interrupt and only for
>> a single level of preemption. That coexists quite happily with
>> synchronize_kernel() which runs in user context. Just count user
>> context schedules (preempt_count == 0), not preemptive schedules.
>
>If you're looking at preempt_schedule(), note the call to ctx_sw_off()
>only increments current->preempt_count for the preempted task - the
>higher priority preempting task that is about to be scheduled will have
>a preempt_count of 0.

I misread the code, but the idea is still correct. Add a preemption
depth counter to each cpu, when you schedule and the depth is zero then
you know that the cpu is no longer holding any references to quiesced
structures.

>So, to make sure I understand this, the code to free a node would look
>like:
>
> prev->next = node->next; /* assumed to be atomic */
> synchronize_kernel();
> free(node);
>
>So that any other CPU concurrently traversing the list would see a
>consistent state, either including or not including "node" before the
>call to synchronize_kernel(); but after synchronize_kernel() all other
>CPUs are guaranteed to see a list that no longer includes "node", so it
>is now safe to free it.
>
>It looks like there are also implicit assumptions to this approach, like
>no other CPU is trying to use the same approach simultaneously to free
>"prev".

Not quite. The idea is that readers can traverse lists without locks,
code that changes the list needs to take a semaphore first.

Read
node = node->next;

Update
down(&list_sem);
prev->next = node->next;
synchronize_kernel();
free(node);
up(&list_sem);

Because the readers have no locks, other cpus can have references to
the node being freed. The updating code needs to wait until all other
cpus have gone through at least one schedule to ensure that all
dangling references have been flushed. Adding preemption complicates
this slightly, we have to distinguish between the bottom level schedule
and higher level schedules for preemptive code. Only when all
preemptive code on a cpu has ended is it safe to say that there are no
dangling references left on that cpu.

This method is a win for high read, low update lists. Instead of
penalizing the read code every time on the off chance that somebody
will update the data, speed up the common code and penalize the update
code. The classic example is module code, it is rarely unloaded but
right now everything that *might* be entering a module has to grab the
module spin lock and update the module use count. So main line code is
being slowed down all the time.

2001-03-21 03:36:34

by Nigel Gamble

[permalink] [raw]
Subject: Re: [PATCH for 2.5] preemptible kernel

On Wed, 21 Mar 2001, Keith Owens wrote:
> I misread the code, but the idea is still correct. Add a preemption
> depth counter to each cpu, when you schedule and the depth is zero then
> you know that the cpu is no longer holding any references to quiesced
> structures.

A task that has been preempted is on the run queue and can be
rescheduled on a different CPU, so I can't see how a per-CPU counter
would work. It seems to me that you would need a per run queue
counter, like the example I gave in a previous posting.

Nigel Gamble [email protected]
Mountain View, CA, USA. http://www.nrg.org/

MontaVista Software [email protected]

2001-03-21 08:10:57

by George Anzinger

[permalink] [raw]
Subject: Re: [PATCH for 2.5] preemptible kernel

Nigel Gamble wrote:
>
> On Wed, 21 Mar 2001, Keith Owens wrote:
> > I misread the code, but the idea is still correct. Add a preemption
> > depth counter to each cpu, when you schedule and the depth is zero then
> > you know that the cpu is no longer holding any references to quiesced
> > structures.
>
> A task that has been preempted is on the run queue and can be
> rescheduled on a different CPU, so I can't see how a per-CPU counter
> would work. It seems to me that you would need a per run queue
> counter, like the example I gave in a previous posting.

Exactly so. The method does not depend on the sum of preemption being
zip, but on each potential reader (writers take locks) passing thru a
"sync point". Your notion of waiting for each task to arrive
"naturally" at schedule() would work. It is, in fact, over kill as you
could also add arrival at sys call exit as a (the) "sync point". In
fact, for module unload, isn't this the real "sync point"? After all, a
module can call schedule, or did I miss a usage counter somewhere?

By the way, there is a paper on this somewhere on the web. Anyone
remember where?

George

2001-03-21 09:06:02

by Keith Owens

[permalink] [raw]
Subject: Re: [PATCH for 2.5] preemptible kernel

On Wed, 21 Mar 2001 00:04:56 -0800,
george anzinger <[email protected]> wrote:
>Exactly so. The method does not depend on the sum of preemption being
>zip, but on each potential reader (writers take locks) passing thru a
>"sync point". Your notion of waiting for each task to arrive
>"naturally" at schedule() would work. It is, in fact, over kill as you
>could also add arrival at sys call exit as a (the) "sync point". In
>fact, for module unload, isn't this the real "sync point"? After all, a
>module can call schedule, or did I miss a usage counter somewhere?

A module can call schedule but it must do MOD_INC_USE_COUNT first.
Sleeping in module code without incrementing the module use count first
is a shooting offence. It is so full of races that you may as well
call it Daytona.

>By the way, there is a paper on this somewhere on the web. Anyone
>remember where?

http://www.rdrop.com/users/paulmck/paper/rclockpdcsproof.pdf

2001-03-21 09:21:02

by Keith Owens

[permalink] [raw]
Subject: Re: [PATCH for 2.5] preemptible kernel

Nigel Gamble wrote:
> A task that has been preempted is on the run queue and can be
> rescheduled on a different CPU, so I can't see how a per-CPU counter
> would work. It seems to me that you would need a per run queue
> counter, like the example I gave in a previous posting.

Ouch. What about all the per cpu structures in the kernel, how do you
handle them if a preempted task can be rescheduled on another cpu?

int count[NR_CPUS], *p;
p = count+smp_processor_id(); /* start on cpu 0, &count[0] */
if (*p >= 1024) {
/* preempt here, reschedule on cpu 1 */
*p = 1; /* update cpu 0 count from cpu 1, oops */
}

Unless you find every use of a per cpu structure and wrap a spin lock
around it, migrating a task from one cpu to another is going to be a
source of wierd and wonderful errors. Since the main use of per cpu
structures is to avoid locks, adding a spin lock to every structure
will be a killer. Or have I missed something?

2001-03-21 09:42:53

by David Miller

[permalink] [raw]
Subject: Re: [PATCH for 2.5] preemptible kernel


Keith Owens writes:
> Or have I missed something?

Nope, it is a fundamental problem with such kernel pre-emption
schemes. As a result, it would also break our big-reader locks
(see include/linux/brlock.h).

Basically, anything which uses smp_processor_id() would need to
be holding some lock so as to not get pre-empted.

Later,
David S. Miller
[email protected]

2001-03-21 10:05:13

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH for 2.5] preemptible kernel

"David S. Miller" wrote:
>
> Keith Owens writes:
> > Or have I missed something?
>
> Nope, it is a fundamental problem with such kernel pre-emption
> schemes. As a result, it would also break our big-reader locks
> (see include/linux/brlock.h).
>
> Basically, anything which uses smp_processor_id() would need to
> be holding some lock so as to not get pre-empted.
>

It's a problem for uniprocessors as well.

Example:

#define current_cpu_data boot_cpu_data
#define pgd_quicklist (current_cpu_data.pgd_quick)

extern __inline__ void free_pgd_fast(pgd_t *pgd)
{
*(unsigned long *)pgd = (unsigned long) pgd_quicklist;
pgd_quicklist = (unsigned long *) pgd;
pgtable_cache_size++;
}

Preemption could corrupt this list.

-

2001-03-21 11:03:34

by George Anzinger

[permalink] [raw]
Subject: Re: [PATCH for 2.5] preemptible kernel

"David S. Miller" wrote:
>
> Keith Owens writes:
> > Or have I missed something?
>
> Nope, it is a fundamental problem with such kernel pre-emption
> schemes. As a result, it would also break our big-reader locks
> (see include/linux/brlock.h).

He has this one covered. The patch puts preemption locks around
read_locks.

By the by, if a preemption lock is all that is needed the patch defines
it and it is rather fast (an inc going in and a dec & test comming
out). A lot faster than a spin lock with its "LOCK" access. A preempt
lock does not need to be "LOCK"ed because the only contender is the same
cpu.

George

>
> Basically, anything which uses smp_processor_id() would need to
> be holding some lock so as to not get pre-empted.
>
> Later,
> David S. Miller
> [email protected]

2001-03-21 11:32:07

by David Miller

[permalink] [raw]
Subject: Re: [PATCH for 2.5] preemptible kernel


george anzinger writes:
> By the by, if a preemption lock is all that is needed the patch defines
> it and it is rather fast (an inc going in and a dec & test comming
> out). A lot faster than a spin lock with its "LOCK" access. A preempt
> lock does not need to be "LOCK"ed because the only contender is the same
> cpu.

So we would have to invoke this thing around every set of
smp_processor_id() references?

Later,
David S. Miller
[email protected]

2001-03-21 15:46:22

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: [PATCH for 2.5] preemptible kernel

On Wed, Mar 21, 2001 at 08:19:54PM +1100, Keith Owens wrote:
> Ouch. What about all the per cpu structures in the kernel, how do you
> handle them if a preempted task can be rescheduled on another cpu?
>
> int count[NR_CPUS], *p;
> p = count+smp_processor_id(); /* start on cpu 0, &count[0] */
> if (*p >= 1024) {
> /* preempt here, reschedule on cpu 1 */
> *p = 1; /* update cpu 0 count from cpu 1, oops */
> }
>
> Unless you find every use of a per cpu structure and wrap a spin lock
> around it, migrating a task from one cpu to another is going to be a
> source of wierd and wonderful errors. Since the main use of per cpu
> structures is to avoid locks, adding a spin lock to every structure
> will be a killer. Or have I missed something?

That's why Linus suggested it for UP only.

Andrea

2001-03-21 17:13:25

by George Anzinger

[permalink] [raw]
Subject: Re: [PATCH for 2.5] preemptible kernel

"David S. Miller" wrote:
>
> george anzinger writes:
> > By the by, if a preemption lock is all that is needed the patch defines
> > it and it is rather fast (an inc going in and a dec & test comming
> > out). A lot faster than a spin lock with its "LOCK" access. A preempt
> > lock does not need to be "LOCK"ed because the only contender is the same
> > cpu.
>
> So we would have to invoke this thing around every set of
> smp_processor_id() references?

Well, most of them are already protected. In most preemptable kernels
there are several protection mechanisms which are used singly or in some
combination. These are:

a.) interrupt disable
b.) preemption disable
c.) spin_lock

Nigel will tell you that systems he worked on used a & c only. This
requires that all places where you want to disable preemption be short
enough that you don't take a hit on the interrupt disable (i.e. don't
turn them off for too long). There are pros and cons to not using
preemption disable, however, even if the off time is very short:

Pro: The preemption disable is much lower overhead. We are talking
about something like ++,--, if() schedule verses cli, sti. More
instructions in the preemption disable case but if done carefully you
have a ++,-- , test, and a correctly predicted non branch (probably
requires asm code to get the non branch, but it can be done. This is
MUCH faster that the cli, sti because these are VERY heavy
instructions. (Note that the ++ & -- need to be atomic but NOT locked
as the contention is with the current cpu, not others.)

Con: The idea of preemption is to get to the contending task as fast as
possible not to to necessarily do it efficiently. By doing the cli, sti
thing, once the interruption IS taken the system is free to do the
context switch after it has basically trashed the cache with the
interrupt itself. In the preemption disable method, while it is faster
for the normal, non preempting cases, if a preemption becomes pending,
it can not be taken, so the system must return from the interrupt and
continue thus adding an interrupt return and a bunch of cache faulting
to the preemption delay. (It is true that the interrupt return is in
the "way" but one could counter argue that the cache disruption is small
because we are, after all, talking about a SHORT protected region which
can not use much cache in the first place.)

Other arguments can be made for the preemption disable. For example, on
most systems, there are cases where an explicit schedule call wants to
be made with out the possibility of preemption. Linux can only do this
by turning off interrupts, but this only protects the call and not the
return while preemption can be disabled for the round trip. (Of course,
the system could save the interrupt system state as part of the task
state if the return leg really needed to be protected. The up side here
is that the synchronization methods used in linux do not seem to ever
require this sort of protection. You may note that the preemption path
in the patch does disable preemption while it calls schedule, but this
is to protect against preempting the preemption which quickly leads to
stack overflow...

As to the original question, references to smp_processor_id are for at
least two reasons. If the reason for the reference is to dodge a
spin_lock, then they need to be protected with either a.) or b.). I
don't think you will find this very often, but it is probably done in a
few places. Note, however, that testing found the unprotected cr2 in
the do_page_fault code and has not found any of these (i.e. nothing in
the patch suggests that these even exist). I think we can safely say
that these things will not be found in the kernel directory. The driver
directory and one or more of the file system directories are another
matter. The list handling code that Andrew points out is also a place
where an implied assumption of not being preemptable was made. Again a
dodge to avoid a spin lock and the same arguments apply.

On the other hand, other references to smp_processor_id are usually made
for dispatching reasons. I think you will find that these are already
protected by a.) or c.) just because they must be even if there is no
preemption.

George

2001-03-21 18:19:52

by Nigel Gamble

[permalink] [raw]
Subject: Re: [PATCH for 2.5] preemptible kernel

On Wed, 21 Mar 2001, David S. Miller wrote:
> Basically, anything which uses smp_processor_id() would need to
> be holding some lock so as to not get pre-empted.

Not necessarily. Another solution for the smp_processor_id() case is
to ensure that the task can only be scheduled on the current CPU for the
duration that the value of smp_processor_id() is used. Or, if the
critical region is very short, to disable interrupts on the local CPU.

Nigel Gamble [email protected]
Mountain View, CA, USA. http://www.nrg.org/

MontaVista Software [email protected]

2001-03-22 00:22:16

by Nigel Gamble

[permalink] [raw]
Subject: Re: [PATCH for 2.5] preemptible kernel

On Wed, 21 Mar 2001, Andrew Morton wrote:
> It's a problem for uniprocessors as well.
>
> Example:
>
> #define current_cpu_data boot_cpu_data
> #define pgd_quicklist (current_cpu_data.pgd_quick)
>
> extern __inline__ void free_pgd_fast(pgd_t *pgd)
> {
> *(unsigned long *)pgd = (unsigned long) pgd_quicklist;
> pgd_quicklist = (unsigned long *) pgd;
> pgtable_cache_size++;
> }
>
> Preemption could corrupt this list.

Thanks, Andrew, for pointing this out. I've added fixes to the patch
for this problem and the others in pgalloc.h. If you know of any other
similar problems on uniprocessors, please let me know.

Nigel Gamble [email protected]
Mountain View, CA, USA. http://www.nrg.org/

MontaVista Software [email protected]

2001-03-23 12:37:04

by Rusty Russell

[permalink] [raw]
Subject: Re: [PATCH for 2.5] preemptible kernel

In message <[email protected]> you write:
> Nigel Gamble wrote:
> >
> > On Wed, 21 Mar 2001, Keith Owens wrote:
> > > I misread the code, but the idea is still correct. Add a preemption
> > > depth counter to each cpu, when you schedule and the depth is zero then
> > > you know that the cpu is no longer holding any references to quiesced
> > > structures.
> >
> > A task that has been preempted is on the run queue and can be
> > rescheduled on a different CPU, so I can't see how a per-CPU counter
> > would work. It seems to me that you would need a per run queue
> > counter, like the example I gave in a previous posting.
>
> Exactly so. The method does not depend on the sum of preemption being
> zip, but on each potential reader (writers take locks) passing thru a
> "sync point". Your notion of waiting for each task to arrive
> "naturally" at schedule() would work. It is, in fact, over kill as you
> could also add arrival at sys call exit as a (the) "sync point". In

Power off is also a sync point 8). But you want it to happen in
bounded time: consider a daemon which opens a device every minute and
never exits.

Nigel's "traverse the run queue and mark the preempted" solution is
actually pretty nice, and cheap. Since the runqueue lock is grabbed,
it doesn't require icky atomic ops, either.

Despite Nigel's initial belief that this technique is fragile, I
believe it will become an increasingly fundamental method in the
kernel, so (with documentation) it will become widely understood, as
it offers scalability and efficiency.

Rusty.
--
Premature optmztion is rt of all evl. --DK

2001-03-23 12:44:15

by Rusty Russell

[permalink] [raw]
Subject: Re: [PATCH for 2.5] preemptible kernel

In message <[email protected]> you write:
>
> Keith Owens writes:
> > Or have I missed something?
>
> Nope, it is a fundamental problem with such kernel pre-emption
> schemes. As a result, it would also break our big-reader locks
> (see include/linux/brlock.h).

Good point: holding a brlock has to block preemption, as per spinlocks.

> Basically, anything which uses smp_processor_id() would need to
> be holding some lock so as to not get pre-empted.

When I audited the uses of smp_processor_id() for the hotplug cpu
stuff, there were surprisingly few calls to smp_processor_id(), and
most of these will be holding a brlock, so OK already.

Rusty.
--
Premature optmztion is rt of all evl. --DK

2001-03-23 20:43:21

by Nigel Gamble

[permalink] [raw]
Subject: Re: [PATCH for 2.5] preemptible kernel

On Thu, 22 Mar 2001, Rusty Russell wrote:
> Nigel's "traverse the run queue and mark the preempted" solution is
> actually pretty nice, and cheap. Since the runqueue lock is grabbed,
> it doesn't require icky atomic ops, either.

You'd have to mark both the preempted tasks, and the tasks currently
running on each CPU (which could become preempted before reaching a
voluntary schedule point).

> Despite Nigel's initial belief that this technique is fragile, I
> believe it will become an increasingly fundamental method in the
> kernel, so (with documentation) it will become widely understood, as
> it offers scalability and efficiency.

Actually, I agree with you now that I've had a chance to think about
this some more.

Nigel Gamble [email protected]
Mountain View, CA, USA. http://www.nrg.org/

MontaVista Software [email protected]

2001-03-29 09:42:48

by Dipankar Sarma

[permalink] [raw]
Subject: Re: [PATCH for 2.5] preemptible kernel

On Wed, Mar 28, 2001 at 12:51:02PM -0800, george anzinger wrote:
> Dipankar Sarma wrote:
> >
> > Also, a task could be preempted and then rescheduled on the same cpu
> > making
> > the depth counter 0 (right ?), but it could still be holding references
> > to data
> > structures to be updated using synchronize_kernel(). There seems to be
> > two
> > approaches to tackle preemption -
> >
> > 1. Disable pre-emption during the time when references to data
> > structures
> > updated using such Two-phase updates are held.
>
> Doesn't this fly in the face of the whole Two-phase system? It seems to
> me that the point was to not require any locks. Preemption disable IS a
> lock. Not as strong as some, but a lock none the less.

The point is to avoid acquring costly locks, so it is a question of
relative cost. Disabling preemption (by an atomic increment) for
short critical sections may not be as bad as spin-waiting for
highly contended locks or thrashing lock cachelines.


> >
> > Pros: easy to implement using a flag (ctx_sw_off() ?)
> > Cons: not so easy to use since critical sections need to be clearly
> > identified and interfaces defined. also affects preemptive behavior.
> >
> > 2. In synchronize_kernel(), distinguish between "natural" and preemptive
> > schedules() and ignore preemptive ones.
> >
> > Pros: easy to use
> > Cons: Not so easy to implement. Also a low priority task that keeps
> > getting
> > preempted often can affect update side performance significantly.
>
> Actually is is fairly easy to distinguish the two (see TASK_PREEMPTED in
> state). Don't you also have to have some sort of task flag that
> indicates that the task is one that needs to sync? Something that gets
> set when it enters the area of interest and cleared when it hits the
> sync point?

None of the two two-phase update implementations (synchronize_kernel())
by Rusty and read-copy update by us, monitor the tasks that require
sync for update. synchronize_kernel() forces a schedule on every
cpu and read-copy update waits until every cpu goes through
a quiscent state, before updating. Both approaches will require
significant special handling because they implicitly assume
that tasks inside the kernel are bound to the current cpu until it
reaches a quiescent state (like a "normal"
context switch). Since preempted tasks can run later on any cpu, we
will have to keep track of sync points on a per-task basis and
that will probably require using a snapshot of the running
tasks from the global runqueue. That may not be a good thing
from performance standpoint, not to mention the complexity.

Also, in situations where read-to-write ratio is not heavily
skewed towards read or lots of updates happening, a very low
priority task can have a significant impact on performance
by getting preempted all the time.

Thanks
Dipankar
--
Dipankar Sarma ([email protected])
IBM Linux Technology Center
IBM Software Lab, Bangalore, India.
Project Page: http://lse.sourceforge.net

2001-03-30 00:27:52

by Nigel Gamble

[permalink] [raw]
Subject: Re: [PATCH for 2.5] preemptible kernel

On Tue, 20 Mar 2001, Nigel Gamble wrote:
> On Tue, 20 Mar 2001, Rusty Russell wrote:
> > Thoughts?
>
> Perhaps synchronize_kernel() could take the run_queue lock, mark all the
> tasks on it and count them. Any task marked when it calls schedule()
> voluntarily (but not if it is preempted) is unmarked and the count
> decremented. synchronize_kernel() continues until the count is zero.

Hi Rusty,

Here is an attempt at a possible version of synchronize_kernel() that
should work on a preemptible kernel. I haven't tested it yet.


static int sync_count = 0;
static struct task_struct *syncing_task = NULL;
static DECLARE_MUTEX(synchronize_kernel_mtx);

void
synchronize_kernel()
{
struct list_head *tmp;
struct task_struct *p;

/* Guard against multiple calls to this function */
down(&synchronize_kernel_mtx);

/* Mark all tasks on the runqueue */
spin_lock_irq(&runqueue_lock);
list_for_each(tmp, &runqueue_head) {
p = list_entry(tmp, struct task_struct, run_list);
if (p == current)
continue;
if (p->state == TASK_RUNNING ||
(p->state == (TASK_RUNNING|TASK_PREEMPTED))) {
p->flags |= PF_SYNCING;
sync_count++;
}
}
if (sync_count == 0)
goto out;

syncing_task = current;
spin_unlock_irq(&runqueue_lock);

/*
* Cause a schedule on every CPU, as for a non-preemptible
* kernel
*/

/* Save current state */
cpus_allowed = current->cpus_allowed;
policy = current->policy;
rt_priority = current->rt_priority;

/* Create an unreal time task. */
current->policy = SCHED_FIFO;
current->rt_priority = 1001 + sys_sched_get_priority_max(SCHED_FIFO);

/* Make us schedulable on all CPUs. */
current->cpus_allowed = (1UL<<smp_num_cpus)-1;

/* Eliminate current cpu, reschedule */
while ((current->cpus_allowed &= ~(1 << smp_processor_id())) != 0)
schedule();

/* Back to normal. */
current->cpus_allowed = cpus_allowed;
current->policy = policy;
current->rt_priority = rt_priority;

/*
* Wait, if necessary, until all preempted tasks
* have reached a sync point.
*/

spin_lock_irq(&runqueue_lock);
for (;;) {
set_current_state(TASK_UNINTERRUPTIBLE);
if (sync_count == 0)
break;
spin_unlock_irq(&runqueue_lock);
schedule();
spin_lock_irq(&runqueue_lock);
}
current->state = TASK_RUNNING;
syncing_task = NULL;
out:
spin_unlock_irq(&runqueue_lock);

up(&synchronize_kernel_mtx);
}

And add this code to the beginning of schedule(), just after the
runqueue_lock is taken (the flags field is probably not be the right
place to put the synchronize mark; and the test should be optimized for
the fast path in the same way as the other tests in schedule(), but you
get the idea):

if ((prev->flags & PF_SYNCING) && !(prev->state & TASK_PREEMPTED)) {
prev->flags &= ~PF_SYNCING;
if (--sync_count == 0) {
syncing_task->state = TASK_RUNNING;
if (!task_on_runqueue(syncing_task))
add_to_runqueue(syncing_task);
syncing_task = NULL;
}

}

Nigel Gamble [email protected]
Mountain View, CA, USA. http://www.nrg.org/

MontaVista Software [email protected]

2001-03-30 06:33:37

by Keith Owens

[permalink] [raw]
Subject: Re: [PATCH for 2.5] preemptible kernel

On Wed, 28 Mar 2001 12:51:02 -0800,
george anzinger <[email protected]> wrote:
>Dipankar Sarma wrote:
>> 1. Disable pre-emption during the time when references to data
>> structures
>> updated using such Two-phase updates are held.
>
>Doesn't this fly in the face of the whole Two-phase system? It seems to
>me that the point was to not require any locks. Preemption disable IS a
>lock. Not as strong as some, but a lock none the less.

The aim is to remove all module locks from the main kernel path and
move the penalty for module unloading into the task doing the unload.
Only the unload code needs to worry about preemption, not the main
kernel code paths, so main line code runs faster and scales better. I
don't care how long (within reason) it takes to unload a module, it is
such a rare event. I want module unload to be race free with zero
impact on the fast code paths, the current design penalizes the fast
code paths.

2001-03-30 20:26:27

by Rusty Russell

[permalink] [raw]
Subject: Re: [PATCH for 2.5] preemptible kernel

In message <[email protected]> you write:
> Here is an attempt at a possible version of synchronize_kernel() that
> should work on a preemptible kernel. I haven't tested it yet.

It's close, but...

Those who suggest that we don't do preemtion on SMP make this much
easier (synchronize_kernel() is a NOP on UP), and I'm starting to
agree with them. Anyway:

> if (p->state == TASK_RUNNING ||
> (p->state == (TASK_RUNNING|TASK_PREEMPTED))) {
> p->flags |= PF_SYNCING;

Setting a running task's flags brings races, AFAICT, and checking
p->state is NOT sufficient, consider wait_event(): you need p->has_cpu
here I think. You could do it for TASK_PREEMPTED only, but you'd have
to do the "unreal priority" part of synchronize_kernel() with some
method to say "don't preempt anyone", but it will hurt latency.
Hmmm...

The only way I can see is to have a new element in "struct
task_struct" saying "syncing now", which is protected by the runqueue
lock. This looks like (and I prefer wait queues, they have such nice
helpers):

static DECLARE_WAIT_QUEUE_HEAD(syncing_task);
static DECLARE_MUTEX(synchronize_kernel_mtx);
static int sync_count = 0;

schedule():
if (!(prev->state & TASK_PREEMPTED) && prev->syncing)
if (--sync_count == 0) wake_up(&syncing_task);

synchronize_kernel():
{
struct list_head *tmp;
struct task_struct *p;

/* Guard against multiple calls to this function */
down(&synchronize_kernel_mtx);

/* Everyone running now or currently preempted must
voluntarily schedule before we know we are safe. */
spin_lock_irq(&runqueue_lock);
list_for_each(tmp, &runqueue_head) {
p = list_entry(tmp, struct task_struct, run_list);
if (p->has_cpu || p->state == (TASK_RUNNING|TASK_PREEMPTED)) {
p->syncing = 1;
sync_count++;
}
}
spin_unlock_irq(&runqueue_lock);

/* Wait for them all */
wait_event(syncing_task, sync_count == 0);
up(&synchronize_kernel_mtx);
}

Also untested 8),
Rusty.
--
Premature optmztion is rt of all evl. --DK

2001-03-28 10:19:03

by Dipankar Sarma

[permalink] [raw]
Subject: Re: [PATCH for 2.5] preemptible kernel

Nigel Gamble wrote:
>
> On Wed, 21 Mar 2001, Keith Owens wrote:
> > I misread the code, but the idea is still correct. Add a preemption
> > depth counter to each cpu, when you schedule and the depth is zero then
> > you know that the cpu is no longer holding any references to quiesced
> > structures.
>
> A task that has been preempted is on the run queue and can be
> rescheduled on a different CPU, so I can't see how a per-CPU counter
> would work. It seems to me that you would need a per run queue
> counter, like the example I gave in a previous posting.

Also, a task could be preempted and then rescheduled on the same cpu
making
the depth counter 0 (right ?), but it could still be holding references
to data
structures to be updated using synchronize_kernel(). There seems to be
two
approaches to tackle preemption -

1. Disable pre-emption during the time when references to data
structures
updated using such Two-phase updates are held.

Pros: easy to implement using a flag (ctx_sw_off() ?)
Cons: not so easy to use since critical sections need to be clearly
identified and interfaces defined. also affects preemptive behavior.

2. In synchronize_kernel(), distinguish between "natural" and preemptive
schedules() and ignore preemptive ones.

Pros: easy to use
Cons: Not so easy to implement. Also a low priority task that keeps
getting
preempted often can affect update side performance significantly.

I intend to experiment with both to understand the impact.

Thanks
Dipankar
--
Dipankar Sarma ([email protected])
IBM Linux Technology Center
IBM Software Lab, Bangalore, India.
Project Page: http://lse.sourceforge.net

2001-03-28 11:46:18

by Dipankar Sarma

[permalink] [raw]
Subject: Re: [PATCH for 2.5] preemptible kernel

Hi George,

george anzinger wrote:
>
> Exactly so. The method does not depend on the sum of preemption being
> zip, but on each potential reader (writers take locks) passing thru a
> "sync point". Your notion of waiting for each task to arrive
> "naturally" at schedule() would work. It is, in fact, over kill as you
> could also add arrival at sys call exit as a (the) "sync point". In
> fact, for module unload, isn't this the real "sync point"? After all, a
> module can call schedule, or did I miss a usage counter somewhere?

It is certainly possible to implement synchronize_kernel() like
primitive for two phase update using "sync point". Waiting for
sys call exit will perhaps work in the module unloading case,
but there could be performance issues if a cpu spends most of
its time in idle task/interrupts. synchronize_kernel() provides
a simple generic way of implementing a two phase update without
serialization for reading.

I am working a "sync point" based version of such an approach
available at http://lse.sourceforge.net/locking/rclock.html. It
is based on the original DYNIX/ptx stuff that Paul Mckenney
developed in early 90s. This and synchronize_kernel() are
very similar in approach and each can be implemented using
the other.

As for handling preemption, we can perhaps try 2 things -

1. The read side of the critical section is enclosed in
RC_RDPROTECT()/RC_RDUNPROTECT() which are currently nops.
We can disable/enable preemption using these.

2. Avoid counting preemptive context switches. I am not sure
about this one though.

>
> By the way, there is a paper on this somewhere on the web. Anyone
> remember where?

If you are talking about Paul's paper, the link is
http://www.rdrop.com/users/paulmck/paper/rclockpdcsproof.pdf.

Thanks
Dipankar
--
Dipankar Sarma ([email protected])
IBM Linux Technology Center
IBM Software Lab, Bangalore, India.
Project Page: http://lse.sourceforge.net

2001-03-28 20:57:05

by George Anzinger

[permalink] [raw]
Subject: Re: [PATCH for 2.5] preemptible kernel

Dipankar Sarma wrote:
>
> Nigel Gamble wrote:
> >
> > On Wed, 21 Mar 2001, Keith Owens wrote:
> > > I misread the code, but the idea is still correct. Add a preemption
> > > depth counter to each cpu, when you schedule and the depth is zero then
> > > you know that the cpu is no longer holding any references to quiesced
> > > structures.
> >
> > A task that has been preempted is on the run queue and can be
> > rescheduled on a different CPU, so I can't see how a per-CPU counter
> > would work. It seems to me that you would need a per run queue
> > counter, like the example I gave in a previous posting.
>
> Also, a task could be preempted and then rescheduled on the same cpu
> making
> the depth counter 0 (right ?), but it could still be holding references
> to data
> structures to be updated using synchronize_kernel(). There seems to be
> two
> approaches to tackle preemption -
>
> 1. Disable pre-emption during the time when references to data
> structures
> updated using such Two-phase updates are held.

Doesn't this fly in the face of the whole Two-phase system? It seems to
me that the point was to not require any locks. Preemption disable IS a
lock. Not as strong as some, but a lock none the less.
>
> Pros: easy to implement using a flag (ctx_sw_off() ?)
> Cons: not so easy to use since critical sections need to be clearly
> identified and interfaces defined. also affects preemptive behavior.
>
> 2. In synchronize_kernel(), distinguish between "natural" and preemptive
> schedules() and ignore preemptive ones.
>
> Pros: easy to use
> Cons: Not so easy to implement. Also a low priority task that keeps
> getting
> preempted often can affect update side performance significantly.

Actually is is fairly easy to distinguish the two (see TASK_PREEMPTED in
state). Don't you also have to have some sort of task flag that
indicates that the task is one that needs to sync? Something that gets
set when it enters the area of interest and cleared when it hits the
sync point?

George

2001-04-01 07:53:51

by George Anzinger

[permalink] [raw]
Subject: Re: [PATCH for 2.5] preemptible kernel

Rusty Russell wrote:
>
> In message <[email protected]> you write:
> > Here is an attempt at a possible version of synchronize_kernel() that
> > should work on a preemptible kernel. I haven't tested it yet.
>
> It's close, but...
>
> Those who suggest that we don't do preemtion on SMP make this much
> easier (synchronize_kernel() is a NOP on UP), and I'm starting to
> agree with them. Anyway:
>
> > if (p->state == TASK_RUNNING ||
> > (p->state == (TASK_RUNNING|TASK_PREEMPTED))) {
> > p->flags |= PF_SYNCING;
>
> Setting a running task's flags brings races, AFAICT, and checking
> p->state is NOT sufficient, consider wait_event(): you need p->has_cpu
> here I think. You could do it for TASK_PREEMPTED only, but you'd have
> to do the "unreal priority" part of synchronize_kernel() with some
> method to say "don't preempt anyone", but it will hurt latency.
> Hmmm...
>
> The only way I can see is to have a new element in "struct
> task_struct" saying "syncing now", which is protected by the runqueue
> lock. This looks like (and I prefer wait queues, they have such nice
> helpers):
>
> static DECLARE_WAIT_QUEUE_HEAD(syncing_task);
> static DECLARE_MUTEX(synchronize_kernel_mtx);
> static int sync_count = 0;
>
> schedule():
> if (!(prev->state & TASK_PREEMPTED) && prev->syncing)
> if (--sync_count == 0) wake_up(&syncing_task);
>
> synchronize_kernel():
> {
> struct list_head *tmp;
> struct task_struct *p;
>
> /* Guard against multiple calls to this function */
> down(&synchronize_kernel_mtx);
>
> /* Everyone running now or currently preempted must
> voluntarily schedule before we know we are safe. */
> spin_lock_irq(&runqueue_lock);
> list_for_each(tmp, &runqueue_head) {
> p = list_entry(tmp, struct task_struct, run_list);
> if (p->has_cpu || p->state == (TASK_RUNNING|TASK_PREEMPTED)) {
I think this should be:
if (p->has_cpu || p->state & TASK_PREEMPTED)) {
to catch tasks that were preempted with other states. The lse Multi
Queue scheduler folks are going to love this.

George

> p->syncing = 1;
> sync_count++;
> }
> }
> spin_unlock_irq(&runqueue_lock);
>
> /* Wait for them all */
> wait_event(syncing_task, sync_count == 0);
> up(&synchronize_kernel_mtx);
> }
>
> Also untested 8),
> Rusty.
> --
> Premature optmztion is rt of all evl. --DK
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/

2001-04-01 21:08:46

by Nigel Gamble

[permalink] [raw]
Subject: Re: [PATCH for 2.5] preemptible kernel

On Sat, 31 Mar 2001, Rusty Russell wrote:
> > if (p->state == TASK_RUNNING ||
> > (p->state == (TASK_RUNNING|TASK_PREEMPTED))) {
> > p->flags |= PF_SYNCING;
>
> Setting a running task's flags brings races, AFAICT, and checking
> p->state is NOT sufficient, consider wait_event(): you need p->has_cpu
> here I think.

My thought here was that if p->state is anything other than TASK_RUNNING
or TASK_RUNNING|TASK_PREEMPTED, then that task is already at a
synchonize point, so we don't need to wait for it to arrive at another
one - it will get a consistent view of the data we are protecting.
wait_event() qualifies as a synchronize point, doesn't it? Or am I
missing something?

> The only way I can see is to have a new element in "struct
> task_struct" saying "syncing now", which is protected by the runqueue
> lock. This looks like (and I prefer wait queues, they have such nice
> helpers):
>
> static DECLARE_WAIT_QUEUE_HEAD(syncing_task);
> static DECLARE_MUTEX(synchronize_kernel_mtx);
> static int sync_count = 0;
>
> schedule():
> if (!(prev->state & TASK_PREEMPTED) && prev->syncing)
> if (--sync_count == 0) wake_up(&syncing_task);

Don't forget to reset prev->syncing. I agree with you about wait
queues, but didn't use them here because of the problem of avoiding
deadlock on the runqueue lock, which the wait queues also use. The
above code in schedule needs the runqueue lock to protect sync_count.


Nigel Gamble [email protected]
Mountain View, CA, USA. http://www.nrg.org/

MontaVista Software [email protected]

2001-04-01 21:14:46

by Nigel Gamble

[permalink] [raw]
Subject: Re: [PATCH for 2.5] preemptible kernel

On Sat, 31 Mar 2001, george anzinger wrote:
> I think this should be:
> if (p->has_cpu || p->state & TASK_PREEMPTED)) {
> to catch tasks that were preempted with other states.

But the other states are all part of the state change that happens at a
non-preemtive schedule() point, aren't they, so those tasks are already
safe to access the data we are protecting.

Nigel Gamble [email protected]
Mountain View, CA, USA. http://www.nrg.org/

MontaVista Software [email protected]

2001-04-02 20:01:46

by George Anzinger

[permalink] [raw]
Subject: Re: [PATCH for 2.5] preemptible kernel

Nigel Gamble wrote:
>
> On Sat, 31 Mar 2001, george anzinger wrote:
> > I think this should be:
> > if (p->has_cpu || p->state & TASK_PREEMPTED)) {
> > to catch tasks that were preempted with other states.
>
> But the other states are all part of the state change that happens at a
> non-preemtive schedule() point, aren't they, so those tasks are already
> safe to access the data we are protecting.
>
If your saying that the task's "thinking" about a state change is
sufficient, ok. The point is that a task changes it state prior to
calling schedule() and then, sometimes, doesn't call schedule and just
changes its state back to running. Preemption can happen at any of
these times, after all that is what the TASK_PREEMPTED flag is used for.

On a higher level, I think the scanning of the run list to set flags and
counters is a bit off. If these things need to be counted and kept
track of, the tasks should be doing it "in the moment" not some other
task at some distant time. For example if what is being protected is a
data structure, a counter could be put in the structure that keeps track
of the number of tasks that are interested in seeing it stay around. As
I understand the objective of the method being explored, a writer wants
to change the structure, but old readers can continue to use the old
while new readers get the new structure. The issue then is when to
"garbage collect" the no longer used structures. It seems to me that
the pointer to the structure can be changed to point to the new one when
the writer has it set up. Old users continue to use the old. When they
are done, they decrement the use count. When the use count goes to
zero, it is time to "garbage collect". At this time, the "garbage man"
is called (one simple one would be to check if the structure is still
the one a "new" task would get). Various methods exist for determing
how and if the "garbage man" should be called, but this sort of thing,
IMNSHO, does NOT belong in schedule().

George

2001-04-05 07:59:18

by Rusty Russell

[permalink] [raw]
Subject: Re: [PATCH for 2.5] preemptible kernel

In message <[email protected]> you write
:
> > Setting a running task's flags brings races, AFAICT, and checking
> > p->state is NOT sufficient, consider wait_event(): you need p->has_cpu
> > here I think.
>
> My thought here was that if p->state is anything other than TASK_RUNNING
> or TASK_RUNNING|TASK_PREEMPTED, then that task is already at a
> synchonize point,

Right. Theoretically possible to set p->state and not sleep, but it's
not a bad assumption.

> > schedule():
> > if (!(prev->state & TASK_PREEMPTED) && prev->syncing)
> > if (--sync_count == 0) wake_up(&syncing_task);
>
> Don't forget to reset prev->syncing.

Right.

> I agree with you about wait queues, but didn't use them here because
> of the problem of avoiding deadlock on the runqueue lock, which the
> wait queues also use.

And right again. Yeah, it has to be done manually. Ack, can I
retract that Email?

Rusty.
--
Premature optmztion is rt of all evl. --DK

2001-04-05 07:59:28

by Rusty Russell

[permalink] [raw]
Subject: Re: [PATCH for 2.5] preemptible kernel

In message <[email protected]> you write:
> On a higher level, I think the scanning of the run list to set flags and
> counters is a bit off.

[snip standard refcnt scheme]

For most things, refcnts are great. I use them in connection
tracking. But when writes can be insanely slow (eg. once per hour),
those two atomic ops just hurt scalability, and are invasive.

In particular, they suck HARD for modules, which is where my initial
quiescent wait implementation came from, only I didn't call it that
because Andi hadn't posted the Sequent paper URL yet and I hadn't
figured out that this was a generically useful scheme that can be
implemented in 20 lines...

Rusty.
--
Premature optmztion is rt of all evl. --DK

2001-04-06 23:58:02

by Paul McKenney

[permalink] [raw]
Subject: Re: [PATCH for 2.5] preemptible kernel

Please accept my apologies if I am missing something basic, but...

1. On a busy system, isn't it possible for a preempted task
to stay preempted for a -long- time, especially if there are
lots of real-time tasks in the mix?

2. Isn't it possible to get in trouble even on a UP if a task
is preempted in a critical region? For example, suppose the
preempting task does a synchronize_kernel()?

If I understand the preemptible kernel patch, suppressing preemption
is quite cheap -- just increment preempt_count before and decrement
it after, with no locks, atomic operations, or disabling of interrupts
required. I understand that disabling preemption for any long time
is a Bad Thing, however, Dipankar's patch is able to detect situations
where preemption has been suppressed for too long. In addition, Rusty's
original synchronize_kernel() patch is much simpler than the two that
wait for preempted tasks to reach a voluntary context switch.

What am I missing here?

Thanx, Paul

> > Here is an attempt at a possible version of synchronize_kernel() that
> > should work on a preemptible kernel. I haven't tested it yet.
>
> It's close, but...
>
> Those who suggest that we don't do preemtion on SMP make this much
> easier (synchronize_kernel() is a NOP on UP), and I'm starting to
> agree with them. Anyway:
>
> > if (p->state == TASK_RUNNING ||
> > (p->state == (TASK_RUNNING|TASK_PREEMPTED))) {
> > p->flags |= PF_SYNCING;
>
> Setting a running task's flags brings races, AFAICT, and checking
> p->state is NOT sufficient, consider wait_event(): you need p->has_cpu
> here I think. You could do it for TASK_PREEMPTED only, but you'd have
> to do the "unreal priority" part of synchronize_kernel() with some
> method to say "don't preempt anyone", but it will hurt latency.
> Hmmm...
>
> The only way I can see is to have a new element in "struct
> task_struct" saying "syncing now", which is protected by the runqueue
> lock. This looks like (and I prefer wait queues, they have such nice
> helpers):
>
> static DECLARE_WAIT_QUEUE_HEAD(syncing_task);
> static DECLARE_MUTEX(synchronize_kernel_mtx);
> static int sync_count = 0;
>
> schedule():
> if (!(prev->state & TASK_PREEMPTED) && prev->syncing)
> if (--sync_count == 0) wake_up(&syncing_task);
>
> synchronize_kernel():
> {
> struct list_head *tmp;
> struct task_struct *p;
>
> /* Guard against multiple calls to this function */
> down(&synchronize_kernel_mtx);
>
> /* Everyone running now or currently preempted must
> voluntarily schedule before we know we are safe. */
> spin_lock_irq(&runqueue_lock);
> list_for_each(tmp, &runqueue_head) {
> p = list_entry(tmp, struct task_struct, run_list);
> if (p->has_cpu || p->state
== (TASK_RUNNING|TASK_PREEMPTED)) {
> p->syncing = 1;
> sync_count++;
> }
> }
> spin_unlock_irq(&runqueue_lock);
>
> /* Wait for them all */
> wait_event(syncing_task, sync_count == 0);
> up(&synchronize_kernel_mtx);
> }
>
> Also untested 8),
> Rusty.

2001-04-07 00:46:26

by Andi Kleen

[permalink] [raw]
Subject: Re: [PATCH for 2.5] preemptible kernel

Hallo,

On Fri, Apr 06, 2001 at 04:52:25PM -0700, Paul McKenney wrote:
> 1. On a busy system, isn't it possible for a preempted task
> to stay preempted for a -long- time, especially if there are
> lots of real-time tasks in the mix?

The problem you're describing is probably considered too hard to solve properly (bad answer,
but that is how it is currently)

Yes there is. You can also force a normal (non preemptive) kernel into complete livelock by
just giving it enough network traffic to do, so that it always works in the high priority
network softirq or doing the same with some other interrupt.

Just when this happens a lot of basic things will stop working (like page cleaning, IO
flushing etc.), so your callbacks or module unloads not running are probably the least
of your worries.

The same problem applies to a smaller scale to real time processes; kernel services
normally do not run real-time, so they can be starved.

Priority inversion is not handled in Linux kernel ATM BTW, there are already situations
where a realtime task can cause a deadlock with some lower priority system thread
(I believe there is at least one case of this known with realtime ntpd on 2.4)

> 2. Isn't it possible to get in trouble even on a UP if a task
> is preempted in a critical region? For example, suppose the
> preempting task does a synchronize_kernel()?

Ugly. I guess one way to solve it would be to readd the 2.2 scheduler taskqueue,
and just queue a scheduler callback in this case.

-Andi

2001-04-07 19:36:57

by Paul McKenney

[permalink] [raw]
Subject: Re: [PATCH for 2.5] preemptible kernel


Andi, thank you for the background! More comments interspersed...

> On Fri, Apr 06, 2001 at 04:52:25PM -0700, Paul McKenney wrote:
> > 1. On a busy system, isn't it possible for a preempted task
> > to stay preempted for a -long- time, especially if there are
> > lots of real-time tasks in the mix?
>
> The problem you're describing is probably considered too hard to
> solve properly (bad answer, but that is how it is currently)
>
> Yes there is. You can also force a normal (non preemptive) kernel
> into complete livelock by just giving it enough network traffic
> to do, so that it always works in the high priority network
> softirq or doing the same with some other interrupt.
>
> Just when this happens a lot of basic things will stop working (like
> page cleaning, IO flushing etc.), so your callbacks or module unloads
> not running are probably the least of your worries.
>
> The same problem applies to a smaller scale to real time processes;
> kernel services normally do not run real-time, so they can be starved.
>
> Priority inversion is not handled in Linux kernel ATM BTW, there
> are already situations where a realtime task can cause a deadlock
> with some lower priority system thread (I believe there is at least
> one case of this known with realtime ntpd on 2.4)

I see your point here, but need to think about it. One question:
isn't it the case that the alternative to using synchronize_kernel()
is to protect the read side with explicit locks, which will themselves
suppress preemption? If so, why not just suppress preemption on the read
side in preemptible kernels, and thus gain the simpler implementation
of synchronize_kernel()? You are not losing any preemption latency
compared to a kernel that uses traditional locks, in fact, you should
improve latency a bit since the lock operations are more expensive than
are simple increments and decrements. As usual, what am I missing
here? ;-)

> > 2. Isn't it possible to get in trouble even on a UP if a task
> > is preempted in a critical region? For example, suppose the
> > preempting task does a synchronize_kernel()?
>
> Ugly. I guess one way to solve it would be to readd the 2.2 scheduler
> taskqueue, and just queue a scheduler callback in this case.

Another approach would be to define a "really low" priority that noone
other than synchronize_kernel() was allowed to use. Then the UP
implementation of synchronize_kernel() could drop its priority to
this level, yield the CPU, and know that all preempted tasks must
have obtained and voluntarily yielded the CPU before synchronize_kernel()
gets it back again.

I still prefer suppressing preemption on the read side, though I
suppose one could claim that this is only because I am -really-
used to it. ;-)

Thanx, Paul

2001-04-07 20:10:38

by Rusty Russell

[permalink] [raw]
Subject: Re: [PATCH for 2.5] preemptible kernel

In message <OF37B0793C.6B15F182-ON88256A27.0007C3EF@LocalDomain> you write:
> > Priority inversion is not handled in Linux kernel ATM BTW, there
> > are already situations where a realtime task can cause a deadlock
> > with some lower priority system thread (I believe there is at least
> > one case of this known with realtime ntpd on 2.4)
>
> I see your point here, but need to think about it. One question:
> isn't it the case that the alternative to using synchronize_kernel()
> is to protect the read side with explicit locks, which will themselves
> suppress preemption? If so, why not just suppress preemption on the read
> side in preemptible kernels, and thus gain the simpler implementation
> of synchronize_kernel()? You are not losing any preemption latency
> compared to a kernel that uses traditional locks, in fact, you should
> improve latency a bit since the lock operations are more expensive than
> are simple increments and decrements. As usual, what am I missing
> here? ;-)

Already preempted tasks.

> Another approach would be to define a "really low" priority that noone
> other than synchronize_kernel() was allowed to use. Then the UP
> implementation of synchronize_kernel() could drop its priority to
> this level, yield the CPU, and know that all preempted tasks must
> have obtained and voluntarily yielded the CPU before synchronize_kernel()
> gets it back again.

Or "never", because I'm running RC5 etc. 8(.

Rusty.
--
Premature optmztion is rt of all evl. --DK

2001-04-07 21:26:24

by Andi Kleen

[permalink] [raw]
Subject: Re: [Lse-tech] Re: [PATCH for 2.5] preemptible kernel

On Fri, Apr 06, 2001 at 06:25:36PM -0700, Paul McKenney wrote:
> I see your point here, but need to think about it. One question:
> isn't it the case that the alternative to using synchronize_kernel()
> is to protect the read side with explicit locks, which will themselves
> suppress preemption? If so, why not just suppress preemption on the read
> side in preemptible kernels, and thus gain the simpler implementation
> of synchronize_kernel()? You are not losing any preemption latency
> compared to a kernel that uses traditional locks, in fact, you should
> improve latency a bit since the lock operations are more expensive than
> are simple increments and decrements. As usual, what am I missing
> here? ;-)

You miss nothing I think. In fact it's already used (see below)

>
> > > 2. Isn't it possible to get in trouble even on a UP if a task
> > > is preempted in a critical region? For example, suppose the
> > > preempting task does a synchronize_kernel()?
> >
> > Ugly. I guess one way to solve it would be to readd the 2.2 scheduler
> > taskqueue, and just queue a scheduler callback in this case.
>
> Another approach would be to define a "really low" priority that noone
> other than synchronize_kernel() was allowed to use. Then the UP
> implementation of synchronize_kernel() could drop its priority to
> this level, yield the CPU, and know that all preempted tasks must
> have obtained and voluntarily yielded the CPU before synchronize_kernel()
> gets it back again.

That just would allow nasty starvation, e.g. when someone runs a cpu intensive
screensaver or a seti-at-home.

>
> I still prefer suppressing preemption on the read side, though I
> suppose one could claim that this is only because I am -really-
> used to it. ;-)

For a lot of reader cases non-preemption by threads is guaranteed anyways --
e.g. anything that runs in interrupts, timers, tasklets and network softirq.
I think that already covers a lot of interesting cases.


-Andi