2002-09-29 17:39:54

by Ingo Molnar

[permalink] [raw]
Subject: [patch] smptimers, old BH removal, tq-cleanup, 2.5.39


the attached patch is the smptimers patch plus the removal of old BHs and
a rewrite of task-queue handling.

Basically with the removal of TIMER_BH i think the time is right to get
rid of old BHs forever, and to do a massive cleanup of all related fields.
The following five basic 'execution context' abstractions are supported by
the kernel:

- hardirq
- softirq
- tasklet
- keventd-driven task-queues
- process contexts

i've done the following cleanups/simplifications to task-queues:

- removed the ability to define your own task-queue, what can be done is
to schedule_task() a given task to keventd, and to flush all pending
tasks.

this is actually a quite easy transition, since 90% of all task-queue
users in the kernel used BH_IMMEDIATE - which is very similar in
functionality to keventd.

i believe task-queues should not be removed from the kernel altogether.
It's true that they were written as a candidate replacement for BHs
originally, but they do make sense in a different way: it's perhaps the
easiest interface to do deferred processing from IRQ context, in
performance-uncritical code areas. They are easier to use than tasklets.

code that cares about performance should convert to tasklets - as the
timer code and the serial subsystem has done already. For extreme
performance softirqs should be used - the net subsystem does this.

and we can do this for 2.6 - there are only a couple of areas left after
fixing all the BH_IMMEDIATE places.

i have moved all the taskqueue handling code into kernel/context.c, and
only kept the basic 'queue a task' definitions in include/linux/tqueue.h.
I've converted three of the most commonly used BH_IMMEDIATE users:
tty_io.c, floppy.c and random.c. [random.c might need more thought
though.]

i've also cleaned up kernel/timer.c over that of the stock smptimers
patch: privatized the timer-vec definitions (nothing needs it,
init_timer() used it mistakenly) and cleaned up the code. Plus i've moved
some code around that does not belong into timer.c, and within timer.c
i've organized data and functions along functionality and further
separated the base timer code from the NTP bits.

net_bh_lock: i have removed it, since it would synchronize to nothing. The
old protocol handlers should still run on UP, and on SMP the kernel prints
a warning upon use. Alexey, is this approach fine with you?

scalable timers: i've further improved the patch ported to 2.5 by wli and
Dipankar. There is only one pending issue i can see, the question of
whether to migrate timers in mod_timer() or not. I'm quite convinced that
they should be migrated, but i might be wrong. It's a 10 lines change to
switch between migrating and non-migrating timers, we can do performance
tests later on. The current, more complex migration code is pretty fast
and has been stable under extremely high networking loads in the past 2
years, so we can immediately switch to the simpler variant if someone
proves it improves performance. (I'd say if non-migrating timers improve
Apache performance on one of the bigger NUMA boxes then the point is
proven, no further though will be needed.)

would this patch be an acceptable approach? We could avoid all the fuss
about synchronizing the timer execution to the old-BH (and tq) paradigms
by removing those right on the spot.

The attached patch (against BK-curr) compiles, boots & works just fine on
x86 SMP and UP. I've done some wider functionality testing as well (X,
mouse, other input and console features, networking), and it all appears
to work just fine.

Ingo

--- linux/drivers/net/eepro100.c.orig Fri Sep 20 17:20:31 2002
+++ linux/drivers/net/eepro100.c Sun Sep 29 17:53:24 2002
@@ -1210,9 +1210,6 @@
/* We must continue to monitor the media. */
sp->timer.expires = RUN_AT(2*HZ); /* 2.0 sec. */
add_timer(&sp->timer);
-#if defined(timer_exit)
- timer_exit(&sp->timer);
-#endif
}

static void speedo_show_state(struct net_device *dev)
--- linux/drivers/char/random.c.orig Sun Sep 29 18:26:17 2002
+++ linux/drivers/char/random.c Sun Sep 29 18:27:01 2002
@@ -649,7 +649,7 @@
* Changes to the entropy data is put into a queue rather than being added to
* the entropy counts directly. This is presumably to avoid doing heavy
* hashing calculations during an interrupt in add_timer_randomness().
- * Instead, the entropy is only added to the pool once per timer tick.
+ * Instead, the entropy is only added to the pool by keventd.
*/
void batch_entropy_store(u32 a, u32 b, int num)
{
@@ -664,7 +664,8 @@

new = (batch_head+1) & (batch_max-1);
if (new != batch_tail) {
- queue_task(&batch_tqueue, &tq_timer);
+ // FIXME: is this correct?
+ schedule_task(&batch_tqueue);
batch_head = new;
} else {
DEBUG_ENT("batch entropy buffer full\n");
--- linux/drivers/char/tty_io.c.orig Sun Sep 29 18:11:49 2002
+++ linux/drivers/char/tty_io.c Sun Sep 29 18:14:34 2002
@@ -1265,7 +1265,6 @@
/*
* Make sure that the tty's task queue isn't activated.
*/
- run_task_queue(&tq_timer);
flush_scheduled_tasks();

/*
@@ -1876,7 +1875,6 @@

/*
* The tq handling here is a little racy - tty->SAK_tq may already be queued.
- * But there's no mechanism to fix that without futzing with tqueue_lock.
* Fortunately we don't need to worry, because if ->SAK_tq is already queued,
* the values which we write to it will be identical to the values which it
* already has. --akpm
@@ -1902,7 +1900,7 @@
unsigned long flags;

if (test_bit(TTY_DONT_FLIP, &tty->flags)) {
- queue_task(&tty->flip.tqueue, &tq_timer);
+ schedule_task(&tty->flip.tqueue);
return;
}
if (tty->flip.buf_num) {
@@ -1979,7 +1977,7 @@
if (tty->low_latency)
flush_to_ldisc((void *) tty);
else
- queue_task(&tty->flip.tqueue, &tq_timer);
+ schedule_task(&tty->flip.tqueue);
}

/*
--- linux/drivers/block/floppy.c.orig Sun Sep 29 18:29:09 2002
+++ linux/drivers/block/floppy.c Sun Sep 29 18:30:22 2002
@@ -1009,8 +1009,7 @@
static void schedule_bh( void (*handler)(void*) )
{
floppy_tq.routine = (void *)(void *) handler;
- queue_task(&floppy_tq, &tq_immediate);
- mark_bh(IMMEDIATE_BH);
+ schedule_task(&floppy_tq);
}

static struct timer_list fd_timer;
@@ -4361,7 +4360,7 @@
if (have_no_fdc)
{
DPRINT("no floppy controllers found\n");
- run_task_queue(&tq_immediate);
+ flush_scheduled_tasks();
if (usage_count)
floppy_release_irq_and_dma();
blk_cleanup_queue(BLK_DEFAULT_QUEUE(MAJOR_NR));
--- linux/arch/i386/mm/fault.c.orig Fri Sep 20 17:20:13 2002
+++ linux/arch/i386/mm/fault.c Sun Sep 29 17:53:24 2002
@@ -99,18 +99,14 @@
goto bad_area;
}

-extern spinlock_t timerlist_lock;
-
/*
* Unlock any spinlocks which will prevent us from getting the
- * message out (timerlist_lock is acquired through the
- * console unblank code)
+ * message out
*/
void bust_spinlocks(int yes)
{
int loglevel_save = console_loglevel;

- spin_lock_init(&timerlist_lock);
if (yes) {
oops_in_progress = 1;
return;
--- linux/fs/file_table.c.orig Sun Sep 29 18:52:58 2002
+++ linux/fs/file_table.c Sun Sep 29 18:53:20 2002
@@ -25,6 +25,9 @@
/* public *and* exported. Not pretty! */
spinlock_t files_lock = SPIN_LOCK_UNLOCKED;

+/* file version */
+unsigned long event;
+
/* Find an unused file structure and return a pointer to it.
* Returns NULL, if there are no more free file structures or
* we run out of memory.
--- linux/include/linux/interrupt.h.orig Fri Sep 20 17:20:29 2002
+++ linux/include/linux/interrupt.h Sun Sep 29 17:53:24 2002
@@ -22,25 +22,6 @@
struct irqaction *next;
};

-
-/* Who gets which entry in bh_base. Things which will occur most often
- should come first */
-
-enum {
- TIMER_BH = 0,
- TQUEUE_BH = 1,
- DIGI_BH = 2,
- SERIAL_BH = 3,
- RISCOM8_BH = 4,
- SPECIALIX_BH = 5,
- AURORA_BH = 6,
- ESP_BH = 7,
- IMMEDIATE_BH = 9,
- CYCLADES_BH = 10,
- MACSERIAL_BH = 13,
- ISICOM_BH = 14
-};
-
#include <asm/hardirq.h>
#include <asm/softirq.h>

@@ -217,23 +198,6 @@
#define SMP_TIMER_DEFINE(name, task)

#endif /* CONFIG_SMP */
-
-
-/* Old BH definitions */
-
-extern struct tasklet_struct bh_task_vec[];
-
-/* It is exported _ONLY_ for wait_on_irq(). */
-extern spinlock_t global_bh_lock;
-
-static inline void mark_bh(int nr)
-{
- tasklet_hi_schedule(bh_task_vec+nr);
-}
-
-extern void init_bh(int nr, void (*routine)(void));
-extern void remove_bh(int nr);
-

/*
* Autoprobing for irqs:
--- linux/include/linux/timer.h.orig Fri Sep 20 17:20:19 2002
+++ linux/include/linux/timer.h Sun Sep 29 19:07:24 2002
@@ -2,11 +2,15 @@
#define _LINUX_TIMER_H

#include <linux/config.h>
+#include <linux/smp.h>
#include <linux/stddef.h>
#include <linux/list.h>
+#include <linux/spinlock.h>
+#include <linux/cache.h>
+
+struct tvec_t_base_s;

/*
- * In Linux 2.4, static timers have been removed from the kernel.
* Timers may be dynamically created and destroyed, and should be initialized
* by a call to init_timer() upon creation.
*
@@ -14,22 +18,31 @@
* timeouts. You can use this field to distinguish between the different
* invocations.
*/
-struct timer_list {
+typedef struct timer_list {
struct list_head list;
unsigned long expires;
unsigned long data;
void (*function)(unsigned long);
-};
-
-extern void add_timer(struct timer_list * timer);
-extern int del_timer(struct timer_list * timer);
+ struct tvec_t_base_s *base;
+} timer_t;

+extern void add_timer(timer_t * timer);
+extern int del_timer(timer_t * timer);
+
#ifdef CONFIG_SMP
-extern int del_timer_sync(struct timer_list * timer);
+extern int del_timer_sync(timer_t * timer);
+extern void sync_timers(void);
+#define timer_enter(base, t) do { base->running_timer = t; mb(); } while (0)
+#define timer_exit(base) do { base->running_timer = NULL; } while (0)
+#define timer_is_running(base,t) (base->running_timer == t)
+#define timer_synchronize(base,t) while (timer_is_running(base,t)) barrier()
#else
#define del_timer_sync(t) del_timer(t)
+#define sync_timers() do { } while (0)
+#define timer_enter(base,t) do { } while (0)
+#define timer_exit(base) do { } while (0)
#endif
-
+
/*
* mod_timer is a more efficient way to update the expire field of an
* active timer (if the timer is inactive it will be activated)
@@ -37,16 +50,20 @@
* If the timer is known to be not pending (ie, in the handler), mod_timer
* is less efficient than a->expires = b; add_timer(a).
*/
-int mod_timer(struct timer_list *timer, unsigned long expires);
+int mod_timer(timer_t *timer, unsigned long expires);

extern void it_real_fn(unsigned long);

-static inline void init_timer(struct timer_list * timer)
+extern void init_timers(void);
+extern void run_local_timers(void);
+
+static inline void init_timer(timer_t * timer)
{
timer->list.next = timer->list.prev = NULL;
+ timer->base = NULL;
}

-static inline int timer_pending (const struct timer_list * timer)
+static inline int timer_pending(const timer_t * timer)
{
return timer->list.next != NULL;
}
--- linux/include/linux/sched.h.orig Sun Sep 29 18:33:10 2002
+++ linux/include/linux/sched.h Sun Sep 29 18:33:14 2002
@@ -172,7 +172,6 @@
extern signed long FASTCALL(schedule_timeout(signed long timeout));
asmlinkage void schedule(void);

-extern void flush_scheduled_tasks(void);
extern int start_context_thread(void);
extern int current_is_keventd(void);

--- linux/include/linux/tqueue.h.orig Sun Sep 29 18:15:16 2002
+++ linux/include/linux/tqueue.h Sun Sep 29 18:33:39 2002
@@ -1,13 +1,12 @@
/*
* tqueue.h --- task queue handling for Linux.
*
- * Mostly based on a proposed bottom-half replacement code written by
- * Kai Petzke, [email protected].
+ * Modified version of previous incarnations of task-queues,
+ * written by:
*
+ * (C) 1994 Kai Petzke, [email protected]
* Modified for use in the Linux kernel by Theodore Ts'o,
- * [email protected]. Any bugs are my fault, not Kai's.
- *
- * The original comment follows below.
+ * [email protected].
*/

#ifndef _LINUX_TQUEUE_H
@@ -18,25 +17,8 @@
#include <linux/bitops.h>
#include <asm/system.h>

-/*
- * New proposed "bottom half" handlers:
- * (C) 1994 Kai Petzke, [email protected]
- *
- * Advantages:
- * - Bottom halfs are implemented as a linked list. You can have as many
- * of them, as you want.
- * - No more scanning of a bit field is required upon call of a bottom half.
- * - Support for chained bottom half lists. The run_task_queue() function can be
- * used as a bottom half handler. This is for example useful for bottom
- * halfs, which want to be delayed until the next clock tick.
- *
- * Notes:
- * - Bottom halfs are called in the reverse order that they were linked into
- * the list.
- */
-
struct tq_struct {
- struct list_head list; /* linked list of active bh's */
+ struct list_head list; /* linked list of active tq's */
unsigned long sync; /* must be initialized to zero */
void (*routine)(void *); /* function to call */
void *data; /* argument to function */
@@ -61,68 +43,13 @@
PREPARE_TQUEUE((_tq), (_routine), (_data)); \
} while (0)

-typedef struct list_head task_queue;
-
#define DECLARE_TASK_QUEUE(q) LIST_HEAD(q)
-#define TQ_ACTIVE(q) (!list_empty(&q))
-
-extern task_queue tq_timer, tq_immediate;
-
-/*
- * To implement your own list of active bottom halfs, use the following
- * two definitions:
- *
- * DECLARE_TASK_QUEUE(my_tqueue);
- * struct tq_struct my_task = {
- * routine: (void (*)(void *)) my_routine,
- * data: &my_data
- * };
- *
- * To activate a bottom half on a list, use:
- *
- * queue_task(&my_task, &my_tqueue);
- *
- * To later run the queued tasks use
- *
- * run_task_queue(&my_tqueue);
- *
- * This allows you to do deferred processing. For example, you could
- * have a task queue called tq_timer, which is executed within the timer
- * interrupt.
- */
-
-extern spinlock_t tqueue_lock;
-
-/*
- * Queue a task on a tq. Return non-zero if it was successfully
- * added.
- */
-static inline int queue_task(struct tq_struct *bh_pointer, task_queue *bh_list)
-{
- int ret = 0;
- if (!test_and_set_bit(0,&bh_pointer->sync)) {
- unsigned long flags;
- spin_lock_irqsave(&tqueue_lock, flags);
- list_add_tail(&bh_pointer->list, bh_list);
- spin_unlock_irqrestore(&tqueue_lock, flags);
- ret = 1;
- }
- return ret;
-}

/* Schedule a tq to run in process context */
extern int schedule_task(struct tq_struct *task);

-/*
- * Call all "bottom halfs" on a given list.
- */
-
-extern void __run_task_queue(task_queue *list);
+/* finish all currently pending tasks - do not call from irq context */
+extern void flush_scheduled_tasks(void);

-static inline void run_task_queue(task_queue *list)
-{
- if (TQ_ACTIVE(*list))
- __run_task_queue(list);
-}
+#endif

-#endif /* _LINUX_TQUEUE_H */
--- linux/include/linux/tty_flip.h.orig Sun Sep 29 18:23:30 2002
+++ linux/include/linux/tty_flip.h Sun Sep 29 18:23:41 2002
@@ -19,7 +19,7 @@

_INLINE_ void tty_schedule_flip(struct tty_struct *tty)
{
- queue_task(&tty->flip.tqueue, &tq_timer);
+ schedule_task(&tty->flip.tqueue);
}

#undef _INLINE_
--- linux/net/core/dev.c.orig Fri Sep 20 17:20:29 2002
+++ linux/net/core/dev.c Sun Sep 29 17:53:24 2002
@@ -1296,7 +1296,6 @@
static int deliver_to_old_ones(struct packet_type *pt,
struct sk_buff *skb, int last)
{
- static spinlock_t net_bh_lock = SPIN_LOCK_UNLOCKED;
int ret = NET_RX_DROP;

if (!last) {
@@ -1307,20 +1306,13 @@
if (skb_is_nonlinear(skb) && skb_linearize(skb, GFP_ATOMIC))
goto out_kfree;

- /* The assumption (correct one) is that old protocols
- did not depened on BHs different of NET_BH and TIMER_BH.
+#if CONFIG_SMP
+ /* Old protocols did not depened on BHs different of NET_BH and
+ TIMER_BH - they need to be fixed for the new assumptions.
*/
-
- /* Emulate NET_BH with special spinlock */
- spin_lock(&net_bh_lock);
-
- /* Disable timers and wait for all timers completion */
- tasklet_disable(bh_task_vec+TIMER_BH);
-
+ print_symbol("fix old protocol handler %s!\n", (unsigned long)pt->func);
+#endif
ret = pt->func(skb, skb->dev, pt);
-
- tasklet_hi_enable(bh_task_vec+TIMER_BH);
- spin_unlock(&net_bh_lock);
out:
return ret;
out_kfree:
--- linux/lib/bust_spinlocks.c.orig Fri Sep 20 17:20:20 2002
+++ linux/lib/bust_spinlocks.c Sun Sep 29 17:53:24 2002
@@ -14,11 +14,9 @@
#include <linux/wait.h>
#include <linux/vt_kern.h>

-extern spinlock_t timerlist_lock;

void bust_spinlocks(int yes)
{
- spin_lock_init(&timerlist_lock);
if (yes) {
oops_in_progress = 1;
} else {
--- linux/kernel/ksyms.c.orig Sun Sep 29 17:52:59 2002
+++ linux/kernel/ksyms.c Sun Sep 29 18:12:59 2002
@@ -420,12 +420,9 @@
EXPORT_SYMBOL(del_timer_sync);
#endif
EXPORT_SYMBOL(mod_timer);
-EXPORT_SYMBOL(tq_timer);
-EXPORT_SYMBOL(tq_immediate);
+EXPORT_SYMBOL(tvec_bases);

#ifdef CONFIG_SMP
-/* Various random spinlocks we want to export */
-EXPORT_SYMBOL(tqueue_lock);

/* Big-Reader lock implementation */
EXPORT_SYMBOL(__brlock_array);
--- linux/kernel/sched.c.orig Sun Sep 29 17:52:59 2002
+++ linux/kernel/sched.c Sun Sep 29 17:53:24 2002
@@ -29,6 +29,7 @@
#include <linux/security.h>
#include <linux/notifier.h>
#include <linux/delay.h>
+#include <linux/timer.h>

/*
* Convert user-nice values [ -20 ... 0 ... 19 ]
@@ -860,6 +861,7 @@
runqueue_t *rq = this_rq();
task_t *p = current;

+ run_local_timers();
if (p == rq->idle) {
/* note: this timer irq context must be accounted for as well */
if (irq_count() - HARDIRQ_OFFSET >= SOFTIRQ_OFFSET)
@@ -2101,10 +2103,7 @@
spinlock_t kernel_flag __cacheline_aligned_in_smp = SPIN_LOCK_UNLOCKED;
#endif

-extern void init_timervecs(void);
-extern void timer_bh(void);
-extern void tqueue_bh(void);
-extern void immediate_bh(void);
+extern void init_timers(void);

void __init sched_init(void)
{
@@ -2140,10 +2139,7 @@
set_task_cpu(current, smp_processor_id());
wake_up_process(current);

- init_timervecs();
- init_bh(TIMER_BH, timer_bh);
- init_bh(TQUEUE_BH, tqueue_bh);
- init_bh(IMMEDIATE_BH, immediate_bh);
+ init_timers();

/*
* The boot idle thread does lazy MMU switching as well:
--- linux/kernel/timer.c.orig Sun Sep 29 17:52:59 2002
+++ linux/kernel/timer.c Sun Sep 29 19:22:25 2002
@@ -14,74 +14,21 @@
* Copyright (C) 1998 Andrea Arcangeli
* 1999-03-10 Improved NTP compatibility by Ulrich Windl
* 2002-05-31 Move sys_sysinfo here and make its locking sane, Robert Love
+ * 2000-10-05 Implemented scalable SMP per-CPU timer handling.
+ * Copyright (C) 2000, 2001, 2002 Ingo Molnar
+ * Designed by David S. Miller, Alexey Kuznetsov and Ingo Molnar
*/

-#include <linux/config.h>
-#include <linux/mm.h>
-#include <linux/timex.h>
-#include <linux/delay.h>
-#include <linux/smp_lock.h>
-#include <linux/interrupt.h>
-#include <linux/tqueue.h>
#include <linux/kernel_stat.h>
+#include <linux/interrupt.h>
+#include <linux/percpu.h>
+#include <linux/init.h>
+#include <linux/mm.h>

#include <asm/uaccess.h>

-struct kernel_stat kstat;
-
-/*
- * Timekeeping variables
- */
-
-unsigned long tick_usec = TICK_USEC; /* ACTHZ period (usec) */
-unsigned long tick_nsec = TICK_NSEC(TICK_USEC); /* USER_HZ period (nsec) */
-
-/* The current time */
-struct timespec xtime __attribute__ ((aligned (16)));
-
-/* Don't completely fail for HZ > 500. */
-int tickadj = 500/HZ ? : 1; /* microsecs */
-
-DECLARE_TASK_QUEUE(tq_timer);
-DECLARE_TASK_QUEUE(tq_immediate);
-
/*
- * phase-lock loop variables
- */
-/* TIME_ERROR prevents overwriting the CMOS clock */
-int time_state = TIME_OK; /* clock synchronization status */
-int time_status = STA_UNSYNC; /* clock status bits */
-long time_offset; /* time adjustment (us) */
-long time_constant = 2; /* pll time constant */
-long time_tolerance = MAXFREQ; /* frequency tolerance (ppm) */
-long time_precision = 1; /* clock precision (us) */
-long time_maxerror = NTP_PHASE_LIMIT; /* maximum error (us) */
-long time_esterror = NTP_PHASE_LIMIT; /* estimated error (us) */
-long time_phase; /* phase offset (scaled us) */
-long time_freq = ((1000000 + HZ/2) % HZ - HZ/2) << SHIFT_USEC;
- /* frequency offset (scaled ppm)*/
-long time_adj; /* tick adjust (scaled 1 / HZ) */
-long time_reftime; /* time at last adjustment (s) */
-
-long time_adjust;
-
-unsigned long event;
-
-extern int do_setitimer(int, struct itimerval *, struct itimerval *);
-
-/*
- * The 64-bit jiffies value is not atomic - you MUST NOT read it
- * without holding read_lock_irq(&xtime_lock).
- * jiffies is defined in the linker script...
- */
-
-
-unsigned int * prof_buffer;
-unsigned long prof_len;
-unsigned long prof_shift;
-
-/*
- * Event timer code
+ * per-CPU timer vector definitions:
*/
#define TVN_BITS 6
#define TVR_BITS 8
@@ -90,115 +37,88 @@
#define TVN_MASK (TVN_SIZE - 1)
#define TVR_MASK (TVR_SIZE - 1)

-struct timer_vec {
+typedef struct tvec_s {
int index;
struct list_head vec[TVN_SIZE];
-};
+} tvec_t;

-struct timer_vec_root {
+typedef struct tvec_root_s {
int index;
struct list_head vec[TVR_SIZE];
-};
+} tvec_root_t;

-static struct timer_vec tv5;
-static struct timer_vec tv4;
-static struct timer_vec tv3;
-static struct timer_vec tv2;
-static struct timer_vec_root tv1;
+struct tvec_t_base_s {
+ spinlock_t lock;
+ unsigned long timer_jiffies;
+ volatile timer_t * volatile running_timer;
+ tvec_root_t tv1;
+ tvec_t tv2;
+ tvec_t tv3;
+ tvec_t tv4;
+ tvec_t tv5;
+} ____cacheline_aligned_in_smp;

-static struct timer_vec * const tvecs[] = {
- (struct timer_vec *)&tv1, &tv2, &tv3, &tv4, &tv5
-};
+typedef struct tvec_t_base_s tvec_base_t;

-#define NOOF_TVECS (sizeof(tvecs) / sizeof(tvecs[0]))
+static tvec_base_t tvec_bases[NR_CPUS] __cacheline_aligned;

-void init_timervecs (void)
-{
- int i;
+/* Fake initialization needed to avoid compiler breakage */
+static DEFINE_PER_CPU(struct tasklet_struct, timer_tasklet) = { NULL };

- for (i = 0; i < TVN_SIZE; i++) {
- INIT_LIST_HEAD(tv5.vec + i);
- INIT_LIST_HEAD(tv4.vec + i);
- INIT_LIST_HEAD(tv3.vec + i);
- INIT_LIST_HEAD(tv2.vec + i);
- }
- for (i = 0; i < TVR_SIZE; i++)
- INIT_LIST_HEAD(tv1.vec + i);
-}
-
-static unsigned long timer_jiffies;
-
-static inline void internal_add_timer(struct timer_list *timer)
+static inline void internal_add_timer(tvec_base_t *base, timer_t *timer)
{
- /*
- * must be cli-ed when calling this
- */
unsigned long expires = timer->expires;
- unsigned long idx = expires - timer_jiffies;
+ unsigned long idx = expires - base->timer_jiffies;
struct list_head * vec;

if (idx < TVR_SIZE) {
int i = expires & TVR_MASK;
- vec = tv1.vec + i;
+ vec = base->tv1.vec + i;
} else if (idx < 1 << (TVR_BITS + TVN_BITS)) {
int i = (expires >> TVR_BITS) & TVN_MASK;
- vec = tv2.vec + i;
+ vec = base->tv2.vec + i;
} else if (idx < 1 << (TVR_BITS + 2 * TVN_BITS)) {
int i = (expires >> (TVR_BITS + TVN_BITS)) & TVN_MASK;
- vec = tv3.vec + i;
+ vec = base->tv3.vec + i;
} else if (idx < 1 << (TVR_BITS + 3 * TVN_BITS)) {
int i = (expires >> (TVR_BITS + 2 * TVN_BITS)) & TVN_MASK;
- vec = tv4.vec + i;
+ vec = base->tv4.vec + i;
} else if ((signed long) idx < 0) {
- /* can happen if you add a timer with expires == jiffies,
+ /*
+ * Can happen if you add a timer with expires == jiffies,
* or you set a timer to go off in the past
*/
- vec = tv1.vec + tv1.index;
+ vec = base->tv1.vec + base->tv1.index;
} else if (idx <= 0xffffffffUL) {
int i = (expires >> (TVR_BITS + 3 * TVN_BITS)) & TVN_MASK;
- vec = tv5.vec + i;
+ vec = base->tv5.vec + i;
} else {
/* Can only get here on architectures with 64-bit jiffies */
INIT_LIST_HEAD(&timer->list);
return;
}
/*
- * Timers are FIFO!
+ * Timers are FIFO:
*/
- list_add(&timer->list, vec->prev);
+ list_add_tail(&timer->list, vec);
}

-/* Initialize both explicitly - let's try to have them in the same cache line */
-spinlock_t timerlist_lock ____cacheline_aligned_in_smp = SPIN_LOCK_UNLOCKED;
-
-#ifdef CONFIG_SMP
-volatile struct timer_list * volatile running_timer;
-#define timer_enter(t) do { running_timer = t; mb(); } while (0)
-#define timer_exit() do { running_timer = NULL; } while (0)
-#define timer_is_running(t) (running_timer == t)
-#define timer_synchronize(t) while (timer_is_running(t)) barrier()
-#else
-#define timer_enter(t) do { } while (0)
-#define timer_exit() do { } while (0)
-#endif
-
-void add_timer(struct timer_list *timer)
+void add_timer(timer_t *timer)
{
- unsigned long flags;
+ int cpu = get_cpu();
+ tvec_base_t *base = tvec_bases + cpu;
+ unsigned long flags;
+
+ BUG_ON(timer_pending(timer));

- spin_lock_irqsave(&timerlist_lock, flags);
- if (unlikely(timer_pending(timer)))
- goto bug;
- internal_add_timer(timer);
- spin_unlock_irqrestore(&timerlist_lock, flags);
- return;
-bug:
- spin_unlock_irqrestore(&timerlist_lock, flags);
- printk(KERN_ERR "BUG: kernel timer added twice at %p.\n",
- __builtin_return_address(0));
+ spin_lock_irqsave(&base->lock, flags);
+ internal_add_timer(base, timer);
+ timer->base = base;
+ spin_unlock_irqrestore(&base->lock, flags);
+ put_cpu();
}

-static inline int detach_timer (struct timer_list *timer)
+static inline int detach_timer (timer_t *timer)
{
if (!timer_pending(timer))
return 0;
@@ -206,28 +126,78 @@
return 1;
}

-int mod_timer(struct timer_list *timer, unsigned long expires)
+/*
+ * mod_timer() has subtle locking semantics because parallel
+ * calls to it must happen serialized.
+ */
+int mod_timer(timer_t *timer, unsigned long expires)
{
- int ret;
+ tvec_base_t *old_base, *new_base;
unsigned long flags;
+ int ret;
+
+ if (timer_pending(timer) && timer->expires == expires)
+ return 1;
+
+ local_irq_save(flags);
+ new_base = tvec_bases + smp_processor_id();
+repeat:
+ old_base = timer->base;
+
+ /*
+ * Prevent deadlocks via ordering by old_base < new_base.
+ */
+ if (old_base && (new_base != old_base)) {
+ if (old_base < new_base) {
+ spin_lock(&new_base->lock);
+ spin_lock(&old_base->lock);
+ } else {
+ spin_lock(&old_base->lock);
+ spin_lock(&new_base->lock);
+ }
+ /*
+ * Subtle, we rely on timer->base being always
+ * valid and being updated atomically.
+ */
+ if (timer->base != old_base) {
+ spin_unlock(&new_base->lock);
+ spin_unlock(&old_base->lock);
+ goto repeat;
+ }
+ } else
+ spin_lock(&new_base->lock);

- spin_lock_irqsave(&timerlist_lock, flags);
timer->expires = expires;
ret = detach_timer(timer);
- internal_add_timer(timer);
- spin_unlock_irqrestore(&timerlist_lock, flags);
+ internal_add_timer(new_base, timer);
+ timer->base = new_base;
+
+ if (old_base && (new_base != old_base))
+ spin_unlock(&old_base->lock);
+ spin_unlock_irqrestore(&new_base->lock, flags);
+
return ret;
}

-int del_timer(struct timer_list * timer)
+int del_timer(timer_t * timer)
{
- int ret;
unsigned long flags;
+ tvec_base_t * base;
+ int ret;

- spin_lock_irqsave(&timerlist_lock, flags);
+ if (!timer->base)
+ return 0;
+repeat:
+ base = timer->base;
+ spin_lock_irqsave(&base->lock, flags);
+ if (base != timer->base) {
+ spin_unlock_irqrestore(&base->lock, flags);
+ goto repeat;
+ }
ret = detach_timer(timer);
timer->list.next = timer->list.prev = NULL;
- spin_unlock_irqrestore(&timerlist_lock, flags);
+ spin_unlock_irqrestore(&base->lock, flags);
+
return ret;
}

@@ -240,24 +210,33 @@
* (for reference counting).
*/

-int del_timer_sync(struct timer_list * timer)
+int del_timer_sync(timer_t * timer)
{
+ tvec_base_t * base;
int ret = 0;

+ if (!timer->base)
+ return 0;
for (;;) {
unsigned long flags;
int running;

- spin_lock_irqsave(&timerlist_lock, flags);
+repeat:
+ base = timer->base;
+ spin_lock_irqsave(&base->lock, flags);
+ if (base != timer->base) {
+ spin_unlock_irqrestore(&base->lock, flags);
+ goto repeat;
+ }
ret += detach_timer(timer);
timer->list.next = timer->list.prev = 0;
- running = timer_is_running(timer);
- spin_unlock_irqrestore(&timerlist_lock, flags);
+ running = timer_is_running(base, timer);
+ spin_unlock_irqrestore(&base->lock, flags);

if (!running)
break;

- timer_synchronize(timer);
+ timer_synchronize(base, timer);
}

return ret;
@@ -265,7 +244,7 @@
#endif


-static inline void cascade_timers(struct timer_vec *tv)
+static void cascade(tvec_base_t *base, tvec_t *tv)
{
/* cascade all the timers from tv up one level */
struct list_head *head, *curr, *next;
@@ -277,67 +256,107 @@
* detach them individually, just clear the list afterwards.
*/
while (curr != head) {
- struct timer_list *tmp;
+ timer_t *tmp;

- tmp = list_entry(curr, struct timer_list, list);
+ tmp = list_entry(curr, timer_t, list);
+ if (tmp->base != base)
+ BUG();
next = curr->next;
list_del(curr); // not needed
- internal_add_timer(tmp);
+ internal_add_timer(base, tmp);
curr = next;
}
INIT_LIST_HEAD(head);
tv->index = (tv->index + 1) & TVN_MASK;
}

-static inline void run_timer_list(void)
+static void __run_timers(tvec_base_t *base)
{
- spin_lock_irq(&timerlist_lock);
- while ((long)(jiffies - timer_jiffies) >= 0) {
+ unsigned long flags;
+
+ spin_lock_irqsave(&base->lock, flags);
+ while ((long)(jiffies - base->timer_jiffies) >= 0) {
struct list_head *head, *curr;
- if (!tv1.index) {
- int n = 1;
- do {
- cascade_timers(tvecs[n]);
- } while (tvecs[n]->index == 1 && ++n < NOOF_TVECS);
+
+ /*
+ * Cascade timers:
+ */
+ if (!base->tv1.index) {
+ cascade(base, &base->tv2);
+ if (base->tv2.index == 1) {
+ cascade(base, &base->tv3);
+ if (base->tv3.index == 1) {
+ cascade(base, &base->tv4);
+ if (base->tv4.index == 1)
+ cascade(base, &base->tv5);
+ }
+ }
}
repeat:
- head = tv1.vec + tv1.index;
+ head = base->tv1.vec + base->tv1.index;
curr = head->next;
if (curr != head) {
- struct timer_list *timer;
void (*fn)(unsigned long);
unsigned long data;
+ timer_t *timer;

- timer = list_entry(curr, struct timer_list, list);
+ timer = list_entry(curr, timer_t, list);
fn = timer->function;
- data= timer->data;
+ data = timer->data;

detach_timer(timer);
timer->list.next = timer->list.prev = NULL;
- timer_enter(timer);
- spin_unlock_irq(&timerlist_lock);
+ timer_enter(base, timer);
+ spin_unlock_irq(&base->lock);
fn(data);
- spin_lock_irq(&timerlist_lock);
- timer_exit();
+ spin_lock_irq(&base->lock);
+ timer_exit(base);
goto repeat;
}
- ++timer_jiffies;
- tv1.index = (tv1.index + 1) & TVR_MASK;
+ ++base->timer_jiffies;
+ base->tv1.index = (base->tv1.index + 1) & TVR_MASK;
}
- spin_unlock_irq(&timerlist_lock);
+ spin_unlock_irqrestore(&base->lock, flags);
}

-spinlock_t tqueue_lock __cacheline_aligned_in_smp = SPIN_LOCK_UNLOCKED;
+/******************************************************************/

-void tqueue_bh(void)
-{
- run_task_queue(&tq_timer);
-}
+/*
+ * Timekeeping variables
+ */
+unsigned long tick_usec = TICK_USEC; /* ACTHZ period (usec) */
+unsigned long tick_nsec = TICK_NSEC(TICK_USEC); /* USER_HZ period (nsec) */

-void immediate_bh(void)
-{
- run_task_queue(&tq_immediate);
-}
+/* The current time */
+struct timespec xtime __attribute__ ((aligned (16)));
+
+/* Don't completely fail for HZ > 500. */
+int tickadj = 500/HZ ? : 1; /* microsecs */
+
+struct kernel_stat kstat;
+
+/*
+ * phase-lock loop variables
+ */
+/* TIME_ERROR prevents overwriting the CMOS clock */
+int time_state = TIME_OK; /* clock synchronization status */
+int time_status = STA_UNSYNC; /* clock status bits */
+long time_offset; /* time adjustment (us) */
+long time_constant = 2; /* pll time constant */
+long time_tolerance = MAXFREQ; /* frequency tolerance (ppm) */
+long time_precision = 1; /* clock precision (us) */
+long time_maxerror = NTP_PHASE_LIMIT; /* maximum error (us) */
+long time_esterror = NTP_PHASE_LIMIT; /* estimated error (us) */
+long time_phase; /* phase offset (scaled us) */
+long time_freq = ((1000000 + HZ/2) % HZ - HZ/2) << SHIFT_USEC;
+ /* frequency offset (scaled ppm)*/
+long time_adj; /* tick adjust (scaled 1 / HZ) */
+long time_reftime; /* time at last adjustment (s) */
+long time_adjust;
+
+unsigned int * prof_buffer;
+unsigned long prof_len;
+unsigned long prof_shift;

/*
* this routine handles the overflow of the microsecond field
@@ -638,17 +657,33 @@
rwlock_t xtime_lock __cacheline_aligned_in_smp = RW_LOCK_UNLOCKED;
unsigned long last_time_offset;

+/*
+ * This function runs timers and the timer-tq in softirq context.
+ */
+static void run_timer_tasklet(unsigned long data)
+{
+ tvec_base_t *base = tvec_bases + smp_processor_id();
+
+ if ((long)(jiffies - base->timer_jiffies) >= 0)
+ __run_timers(base);
+}
+
+/*
+ * Called by the local, per-CPU timer interrupt on SMP.
+ */
+void run_local_timers(void)
+{
+ tasklet_hi_schedule(&per_cpu(timer_tasklet, smp_processor_id()));
+}
+
+/*
+ * Called by the timer interrupt. xtime_lock must already be taken
+ * by the timer IRQ!
+ */
static inline void update_times(void)
{
unsigned long ticks;

- /*
- * update_times() is run from the raw timer_bh handler so we
- * just know that the irqs are locally enabled and so we don't
- * need to save/restore the flags of the local CPU here. -arca
- */
- write_lock_irq(&xtime_lock);
-
ticks = jiffies - wall_jiffies;
if (ticks) {
wall_jiffies += ticks;
@@ -656,14 +691,13 @@
}
last_time_offset = 0;
calc_load(ticks);
- write_unlock_irq(&xtime_lock);
-}
-
-void timer_bh(void)
-{
- update_times();
- run_timer_list();
}
+
+/*
+ * The 64-bit jiffies value is not atomic - you MUST NOT read it
+ * without holding read_lock_irq(&xtime_lock).
+ * jiffies is defined in the linker script...
+ */

void do_timer(struct pt_regs *regs)
{
@@ -673,13 +707,13 @@

update_process_times(user_mode(regs));
#endif
- mark_bh(TIMER_BH);
- if (TQ_ACTIVE(tq_timer))
- mark_bh(TQUEUE_BH);
+ update_times();
}

#if !defined(__alpha__) && !defined(__ia64__)

+extern int do_setitimer(int, struct itimerval *, struct itimerval *);
+
/*
* For backwards compatibility? This can be done in libc so Alpha
* and all newer ports shouldn't need it.
@@ -821,7 +855,7 @@
*/
signed long schedule_timeout(signed long timeout)
{
- struct timer_list timer;
+ timer_t timer;
unsigned long expire;

switch (timeout)
@@ -973,4 +1007,25 @@
return -EFAULT;

return 0;
+}
+
+void __init init_timers(void)
+{
+ int i, j;
+
+ for (i = 0; i < NR_CPUS; i++) {
+ tvec_base_t *base;
+
+ base = tvec_bases + i;
+ spin_lock_init(&base->lock);
+ for (j = 0; j < TVN_SIZE; j++) {
+ INIT_LIST_HEAD(base->tv5.vec + j);
+ INIT_LIST_HEAD(base->tv4.vec + j);
+ INIT_LIST_HEAD(base->tv3.vec + j);
+ INIT_LIST_HEAD(base->tv2.vec + j);
+ }
+ for (j = 0; j < TVR_SIZE; j++)
+ INIT_LIST_HEAD(base->tv1.vec + j);
+ tasklet_init(&per_cpu(timer_tasklet, i), run_timer_tasklet, 0);
+ }
}
--- linux/kernel/softirq.c.orig Fri Sep 20 17:20:20 2002
+++ linux/kernel/softirq.c Sun Sep 29 18:37:46 2002
@@ -3,21 +3,15 @@
*
* Copyright (C) 1992 Linus Torvalds
*
- * Fixed a disable_bh()/enable_bh() race (was causing a console lockup)
- * due bh_mask_count not atomic handling. Copyright (C) 1998 Andrea Arcangeli
- *
* Rewritten. Old one was good in 2.2, but in 2.3 it was immoral. --ANK (990903)
*/

-#include <linux/config.h>
-#include <linux/mm.h>
#include <linux/kernel_stat.h>
#include <linux/interrupt.h>
-#include <linux/smp_lock.h>
-#include <linux/init.h>
-#include <linux/tqueue.h>
-#include <linux/percpu.h>
#include <linux/notifier.h>
+#include <linux/percpu.h>
+#include <linux/init.h>
+#include <linux/mm.h>

/*
- No shared variables, all the data are CPU local.
@@ -35,7 +29,6 @@
it is logically serialized per device, but this serialization
is invisible to common code.
- Tasklets: serialized wrt itself.
- - Bottom halves: globally serialized, grr...
*/

irq_cpustat_t irq_stat[NR_CPUS];
@@ -115,10 +108,10 @@
__cpu_raise_softirq(cpu, nr);

/*
- * If we're in an interrupt or bh, we're done
- * (this also catches bh-disabled code). We will
+ * If we're in an interrupt or softirq, we're done
+ * (this also catches softirq-disabled code). We will
* actually run the softirq once we return from
- * the irq or bh.
+ * the irq or softirq.
*
* Otherwise we wake up ksoftirqd to make sure we
* schedule the softirq soon.
@@ -267,89 +260,10 @@
clear_bit(TASKLET_STATE_SCHED, &t->state);
}

-
-
-/* Old style BHs */
-
-static void (*bh_base[32])(void);
-struct tasklet_struct bh_task_vec[32];
-
-/* BHs are serialized by spinlock global_bh_lock.
-
- It is still possible to make synchronize_bh() as
- spin_unlock_wait(&global_bh_lock). This operation is not used
- by kernel now, so that this lock is not made private only
- due to wait_on_irq().
-
- It can be removed only after auditing all the BHs.
- */
-spinlock_t global_bh_lock = SPIN_LOCK_UNLOCKED;
-
-static void bh_action(unsigned long nr)
-{
- if (!spin_trylock(&global_bh_lock))
- goto resched;
-
- if (bh_base[nr])
- bh_base[nr]();
-
- hardirq_endlock();
- spin_unlock(&global_bh_lock);
- return;
-
- spin_unlock(&global_bh_lock);
-resched:
- mark_bh(nr);
-}
-
-void init_bh(int nr, void (*routine)(void))
-{
- bh_base[nr] = routine;
- mb();
-}
-
-void remove_bh(int nr)
-{
- tasklet_kill(bh_task_vec+nr);
- bh_base[nr] = NULL;
-}
-
void __init softirq_init()
{
- int i;
-
- for (i=0; i<32; i++)
- tasklet_init(bh_task_vec+i, bh_action, i);
-
open_softirq(TASKLET_SOFTIRQ, tasklet_action, NULL);
open_softirq(HI_SOFTIRQ, tasklet_hi_action, NULL);
-}
-
-void __run_task_queue(task_queue *list)
-{
- struct list_head head, *next;
- unsigned long flags;
-
- spin_lock_irqsave(&tqueue_lock, flags);
- list_add(&head, list);
- list_del_init(list);
- spin_unlock_irqrestore(&tqueue_lock, flags);
-
- next = head.next;
- while (next != &head) {
- void (*f) (void *);
- struct tq_struct *p;
- void *data;
-
- p = list_entry(next, struct tq_struct, list);
- next = next->next;
- f = p->routine;
- data = p->data;
- wmb();
- p->sync = 0;
- if (f)
- f(data);
- }
}

static int ksoftirqd(void * __bind_cpu)
--- linux/kernel/context.c.orig Sun Sep 29 18:13:44 2002
+++ linux/kernel/context.c Sun Sep 29 18:36:48 2002
@@ -28,6 +28,60 @@
static int keventd_running;
static struct task_struct *keventd_task;

+static spinlock_t tqueue_lock __cacheline_aligned_in_smp = SPIN_LOCK_UNLOCKED;
+
+typedef struct list_head task_queue;
+
+/*
+ * Queue a task on a tq. Return non-zero if it was successfully
+ * added.
+ */
+static inline int queue_task(struct tq_struct *tq, task_queue *list)
+{
+ int ret = 0;
+ unsigned long flags;
+
+ if (!test_and_set_bit(0, &tq->sync)) {
+ spin_lock_irqsave(&tqueue_lock, flags);
+ list_add_tail(&tq->list, list);
+ spin_unlock_irqrestore(&tqueue_lock, flags);
+ ret = 1;
+ }
+ return ret;
+}
+
+#define TQ_ACTIVE(q) (!list_empty(&q))
+
+static inline void run_task_queue(task_queue *list)
+{
+ struct list_head head, *next;
+ unsigned long flags;
+
+ if (!TQ_ACTIVE(*list))
+ return;
+
+ spin_lock_irqsave(&tqueue_lock, flags);
+ list_add(&head, list);
+ list_del_init(list);
+ spin_unlock_irqrestore(&tqueue_lock, flags);
+
+ next = head.next;
+ while (next != &head) {
+ void (*f) (void *);
+ struct tq_struct *p;
+ void *data;
+
+ p = list_entry(next, struct tq_struct, list);
+ next = next->next;
+ f = p->routine;
+ data = p->data;
+ wmb();
+ p->sync = 0;
+ if (f)
+ f(data);
+ }
+}
+
static int need_keventd(const char *who)
{
if (keventd_running == 0)



2002-09-29 18:45:24

by Jeff Garzik

[permalink] [raw]
Subject: Re: [patch] smptimers, old BH removal, tq-cleanup, 2.5.39

Jeff Garzik wrote:
> 3) Maybe this is an improvement worth considering, maybe not: I think
> it would be useful to have "process context timers." What I mean by
> this is, make it easier to implement code that can be boiled down to:
> run in process context
> sleep for a little while
> Sure I can do this by implementing a timer that does nothing more than
> call schedule_task(), but that seems a little redundant if done in
> multiple places :)


To sum, I guess I want something to "run schedule_task() [at least] NN
usecs from now." schedule_task_future() ?

2002-09-29 18:34:05

by Jeff Garzik

[permalink] [raw]
Subject: Re: [patch] smptimers, old BH removal, tq-cleanup, 2.5.39

Ingo Molnar wrote:
> Basically with the removal of TIMER_BH i think the time is right to get
> rid of old BHs forever, and to do a massive cleanup of all related fields.
> The following five basic 'execution context' abstractions are supported by
> the kernel:
>
> - hardirq
> - softirq
> - tasklet
> - keventd-driven task-queues
> - process contexts

To be pedantic (and perhaps more clear), I think keventd can really be
considered with process context, because that's schedule_task()'s main
purpose - to provide a process context for code that otherwise may have
not one reliably.

And to tangent a bit, there are three improvements in this area that
(IMO) are worth considering:

1) keventd's worth is diminished, I think, due to what I see as an open
question about its usage: it provides a process context, and that's why
most code uses it. And the main reason why people want process context
is to potentially sleep. So by extension, one really doesn't know how
long a schedule_task() will wait before it is issued.

At least in driver code I see common patterns -- especially error
handling paths -- which could benefit from some code in interrupt
context saying "run my error handling path in process context." This is
a one-shot task that's perfect for keventd... if it weren't for the fact
that worst case, you don't really know if your error handling path will
even get called in the next 5 minutes.

I have a suggested solution, but it seems heavyweight to me and easily
shoot-down-able: have a keventd manager that manages the keventd queue.
a second thread actually does the work of providing a process context.
third, fourth, etc. threads are created if work starts piling up
because earlier work threads are taking a long time to complete. Cap
the number of threads at a maximum, and kill off idle threads [i.e.
maintain a thread pool].

But in any case, I think keventd has a lot of utility, and is very
under-utilized at present.


2) Given the above, I think a better driver interface is to simply pass
in a function pointer, and a user data pointer. No need for any special
task queue structure. Drivers call
schedule_task(my_func, my_context_specific_data);
and the driver's callback looks like
void my_func(void *data)

3) Maybe this is an improvement worth considering, maybe not: I think
it would be useful to have "process context timers." What I mean by
this is, make it easier to implement code that can be boiled down to:
run in process context
sleep for a little while
Sure I can do this by implementing a timer that does nothing more than
call schedule_task(), but that seems a little redundant if done in
multiple places :)

Ethernet link state machines, pcmcia-16's multi-stage state machine,
etc. are examples of this. Ethernet link state machines are often
handled via the standard timers, and I would prefer to be able to sleep
and do process-context-y stuff instead. Talking to ethernet phys often
takes quite a while...

I think these improvements would also serve to encourage coders to use
process context whenever possible (not only is it the most flexible,
it's easy too!), which is always a good thing.

Comments anyone?


> i believe task-queues should not be removed from the kernel altogether.
> It's true that they were written as a candidate replacement for BHs
> originally, but they do make sense in a different way: it's perhaps the
> easiest interface to do deferred processing from IRQ context, in
> performance-uncritical code areas. They are easier to use than tasklets.

I have a theory that most if not all "task queues" can be boiled down
into keventd-tasks, which allows for further [if minor] simplifications
as the function calling simplification in #2 above.


> i have moved all the taskqueue handling code into kernel/context.c, and
> only kept the basic 'queue a task' definitions in include/linux/tqueue.h.
> I've converted three of the most commonly used BH_IMMEDIATE users:
> tty_io.c, floppy.c and random.c. [random.c might need more thought
> though.]

makes sense


> scalable timers: i've further improved the patch ported to 2.5 by wli and
> Dipankar. There is only one pending issue i can see, the question of
> whether to migrate timers in mod_timer() or not. I'm quite convinced that
> they should be migrated, but i might be wrong. It's a 10 lines change to
> switch between migrating and non-migrating timers, we can do performance
> tests later on.

As an aside, most drivers seem the use mod_timer()

Regards,

Jeff




2002-09-29 19:01:20

by Ingo Molnar

[permalink] [raw]
Subject: Re: [patch] smptimers, old BH removal, tq-cleanup, 2.5.39


the attached patch is against BK-curr, it removes some more old symbols
from ksyms.c. This makes the kernel compile with modules enabled.

Ingo

--- linux/kernel/ksyms.c.orig Sun Sep 29 21:12:50 2002
+++ linux/kernel/ksyms.c Sun Sep 29 21:13:24 2002
@@ -420,7 +420,6 @@
EXPORT_SYMBOL(del_timer_sync);
#endif
EXPORT_SYMBOL(mod_timer);
-EXPORT_SYMBOL(tvec_bases);

#ifdef CONFIG_SMP

@@ -589,12 +588,8 @@
EXPORT_SYMBOL(strsep);

/* software interrupts */
-EXPORT_SYMBOL(bh_task_vec);
-EXPORT_SYMBOL(init_bh);
-EXPORT_SYMBOL(remove_bh);
EXPORT_SYMBOL(tasklet_init);
EXPORT_SYMBOL(tasklet_kill);
-EXPORT_SYMBOL(__run_task_queue);
EXPORT_SYMBOL(do_softirq);
EXPORT_SYMBOL(raise_softirq);
EXPORT_SYMBOL(open_softirq);

2002-09-29 19:05:16

by Dipankar Sarma

[permalink] [raw]
Subject: Re: [patch] smptimers, old BH removal, tq-cleanup, 2.5.39

Hi Ingo,

First of all, YES! I am going to start testing first thing tomorrow.

On Sun, Sep 29, 2002 at 07:52:17PM +0200, Ingo Molnar wrote:
>
> i've done the following cleanups/simplifications to task-queues:
>
> - removed the ability to define your own task-queue, what can be done is
> to schedule_task() a given task to keventd, and to flush all pending
> tasks.
>
> this is actually a quite easy transition, since 90% of all task-queue
> users in the kernel used BH_IMMEDIATE - which is very similar in
> functionality to keventd.

This is a problem I ran into in my "kill-BHs" project. I was wondering
if callbacks executed through keventd might have significantly
higher latency (and potential starvation) compared to IMMEDIATE_BH
driven task-queues and that might break existing drivers. Is this not
going to be an issue ?

> net_bh_lock: i have removed it, since it would synchronize to nothing. The
> old protocol handlers should still run on UP, and on SMP the kernel prints
> a warning upon use. Alexey, is this approach fine with you?

The cache line bouncing of global_bh_lock and net_bh_lock in
run_timer_tasklet() show up in our profiles, so getting rid of
them is a good thing (TM).

> scalable timers: i've further improved the patch ported to 2.5 by wli and
> Dipankar. There is only one pending issue i can see, the question of
> whether to migrate timers in mod_timer() or not. I'm quite convinced that
> they should be migrated, but i might be wrong. It's a 10 lines change to
> switch between migrating and non-migrating timers, we can do performance
> tests later on. The current, more complex migration code is pretty fast
> and has been stable under extremely high networking loads in the past 2
> years, so we can immediately switch to the simpler variant if someone
> proves it improves performance. (I'd say if non-migrating timers improve
> Apache performance on one of the bigger NUMA boxes then the point is
> proven, no further though will be needed.)

I will start testing this patch and will try to get you some numbers.
Thanks for taking this up.

Thanks
--
Dipankar Sarma <[email protected]> http://lse.sourceforge.net
Linux Technology Center, IBM Software Lab, Bangalore, India.

2002-09-29 18:48:57

by Dave Jones

[permalink] [raw]
Subject: Re: [patch] smptimers, old BH removal, tq-cleanup, 2.5.39

On Sun, Sep 29, 2002 at 07:52:17PM +0200, Ingo Molnar wrote:
>
> the attached patch is the smptimers patch plus the removal of old BHs and
> a rewrite of task-queue handling.

As an aside, some of the stuff in Documentation/ like Rusty's various
guides are now woefully out of date with whats happening in 2.5
Yet another small project for someone with too much time on their
hands would be to go through this, deleting the obsolete stuff, and
updating the locking documentation to reflect new issues like
preemption.

Dave

--
| Dave Jones. http://www.codemonkey.org.uk
| SuSE Labs

2002-09-29 19:12:27

by Ingo Molnar

[permalink] [raw]
Subject: Re: [patch] smptimers, old BH removal, tq-cleanup, 2.5.39


yes, wrt. keventd i was thinking along the same line - but in a different,
perhaps cleaner and simpler direction.

i'd like to introduce the following interfaces:

- create_work_queue(wq, handler_fn)

- destroy_work_queue(wq)

- queue_work(wq, work_fn, work_data)

- flush_work_queue(wq)

this is an extension of the keventd concept. A work queue is a simplified
interface to create a kernel thread that gets work queued from IRQ and
process contexts. No more, no less.

there would be a number of 'default' work-queues that would be created
upon bootup:

- &irq_workqueue
- &io_workqueue

each work queue would get its own separate kernel thread. schedule_task()
would simply queue to the irq_workqueue. We could make the irq_workqueue's
kernel thread a highprio RT thread, to make it really softirq-alike. (Or
for the very specific case of BH_IMMEDIATE type of stricly IRQ-safe work,
we could add a tasklet that works down this queue.)

There's tons of code within the kernel that can be streamlined this way,
most of the helper threads do this kind of functionality. (I'll post a
patch soon to show how it would look like.)

Ingo

2002-09-29 19:43:31

by Jeff Garzik

[permalink] [raw]
Subject: Re: [patch] smptimers, old BH removal, tq-cleanup, 2.5.39

Ingo Molnar wrote:
> yes, wrt. keventd i was thinking along the same line - but in a different,
> perhaps cleaner and simpler direction.
>
> i'd like to introduce the following interfaces:
>
> - create_work_queue(wq, handler_fn)

what is handler_fn for, if you pass work_fn later?


> - destroy_work_queue(wq)
>
> - queue_work(wq, work_fn, work_data)

queue_work_delayed(wq, work_fn, work_data, delay) would be nice too


> - flush_work_queue(wq)
>
> this is an extension of the keventd concept. A work queue is a simplified
> interface to create a kernel thread that gets work queued from IRQ and
> process contexts. No more, no less.

Your proposal sounds good to me...

Jeff



2002-09-30 00:21:48

by David Miller

[permalink] [raw]
Subject: Re: [patch] smptimers, old BH removal, tq-cleanup, 2.5.39

From: Dipankar Sarma <[email protected]>
Date: Mon, 30 Sep 2002 00:45:59 +0530

> net_bh_lock: i have removed it, since it would synchronize to nothing. The
> old protocol handlers should still run on UP, and on SMP the kernel prints
> a warning upon use. Alexey, is this approach fine with you?

The cache line bouncing of global_bh_lock and net_bh_lock in
run_timer_tasklet() show up in our profiles, so getting rid of
them is a good thing (TM).

What ancient protocols are you running that make use of this?

IPv4 and IPv6 both do not use it at all. Even IPX, Appletalk, and
DecNET layers do not use it

2002-09-30 00:41:10

by David Miller

[permalink] [raw]
Subject: Re: [patch] smptimers, old BH removal, tq-cleanup, 2.5.39

From: Ingo Molnar <[email protected]>
Date: Sun, 29 Sep 2002 19:52:17 +0200 (CEST)

net_bh_lock: i have removed it, since it would synchronize to nothing. The
old protocol handlers should still run on UP, and on SMP the kernel prints
a warning upon use. Alexey, is this approach fine with you?

Just kill this crap completely. Old protocol handlers are %100
unsupported in 2.6

I know people are working on fixing up basically every old protocol
layer currently in the tree, so this will not be an issue.

When a "struct packet_type" is registered in dev_add_pack(), fail if
!pt->data which is the indication of "old protocol". Once all the
protocols are finished being fixed up, pt->data and this test can
die.

2002-09-30 04:23:47

by Dipankar Sarma

[permalink] [raw]
Subject: Re: [patch] smptimers, old BH removal, tq-cleanup, 2.5.39

On Sun, Sep 29, 2002 at 05:20:22PM -0700, David S. Miller wrote:
> From: Dipankar Sarma <[email protected]>
> Date: Mon, 30 Sep 2002 00:45:59 +0530
>
> > net_bh_lock: i have removed it, since it would synchronize to nothing. The
> > old protocol handlers should still run on UP, and on SMP the kernel prints
> > a warning upon use. Alexey, is this approach fine with you?
>
> The cache line bouncing of global_bh_lock and net_bh_lock in
> run_timer_tasklet() show up in our profiles, so getting rid of
> them is a good thing (TM).
>
> What ancient protocols are you running that make use of this?

I wasn't running any old protocols. It was a problem I faced
with my port of smptimers - I serialized
wrt BHs and old protocols using global_bh_lock and net_bh_lock (exported
it globally) respectively in the per-cpu tasklet that runs timers.
So, the spin_trylock() in run_timer_tasklet() would modify the
lock cache line and hence the bouncing. Getting rid of BHs and
old protocol serialization avoids this as in Ingo's latest patch.

>
> IPv4 and IPv6 both do not use it at all. Even IPX, Appletalk, and
> DecNET layers do not use it

This is the list, I think, by looking at packet_types -

802/psnap.c
appletalk/ddp.c
ax25/af_ax25.c
core/ext8022.c
econet/af_econet.c
irda/irsyms.c
x25/af_x25.c

Thanks
--
Dipankar Sarma <[email protected]> http://lse.sourceforge.net
Linux Technology Center, IBM Software Lab, Bangalore, India.

2002-09-30 04:30:00

by David Miller

[permalink] [raw]
Subject: Re: [patch] smptimers, old BH removal, tq-cleanup, 2.5.39

From: Dipankar Sarma <[email protected]>
Date: Mon, 30 Sep 2002 10:03:17 +0530

This is the list, I think, by looking at packet_types -

You look at old sources then, in current 2.5.x psnap.c and ext8022.c
are taken care of by Arnaldo's LLC stack. He is hacking x25 as well.
Ralf Baechle is doing ax25/rose/etc. radio layers. Appletalk should
be sane in 2.5.x as well.

2002-09-30 04:33:21

by Arnaldo Carvalho de Melo

[permalink] [raw]
Subject: Re: [patch] smptimers, old BH removal, tq-cleanup, 2.5.39

Em Mon, Sep 30, 2002 at 10:03:17AM +0530, Dipankar Sarma escreveu:
> This is the list, I think, by looking at packet_types -

Don't bother with snap, that one is just a dummy packet type, its not
even registered via dev_add_pack
> 802/psnap.c

I'm working on Appletalk, will be fixed after X.25, humm, in fact Appletalk
only uses SNAP on Ethernet, so it is only broken for ppptalk and ltalk, does
anybody still uses these later two?

> appletalk/ddp.c

Ralf B?chle is working on ax25 (and its clients: ROSE and NETROM)
> ax25/af_ax25.c

This doesn't exists anymore, what kernel are you looking at?
> core/ext8022.c

Nobody working on this, as far as I know
> econet/af_econet.c

Last I heard Alexey was working on fixing irda
> irda/irsyms.c

I'm working on this one now
> x25/af_x25.c

- Arnaldo

2002-09-30 05:51:29

by Ingo Molnar

[permalink] [raw]
Subject: Re: [patch] smptimers, old BH removal, tq-cleanup, 2.5.39


On Sun, 29 Sep 2002, David S. Miller wrote:

> From: Ingo Molnar <[email protected]>
> Date: Sun, 29 Sep 2002 19:52:17 +0200 (CEST)
>
> net_bh_lock: i have removed it, since it would synchronize to nothing. The
> old protocol handlers should still run on UP, and on SMP the kernel prints
> a warning upon use. Alexey, is this approach fine with you?
>
> Just kill this crap completely. Old protocol handlers are %100
> unsupported in 2.6
>
> I know people are working on fixing up basically every old protocol
> layer currently in the tree, so this will not be an issue.

wonderful. I thought it might have helped if we kep those old-protocol
callbacks around in the UP kernel still (to help converting stuff) - but
if this is being taken care of then great.

Ingo

2002-09-30 12:45:33

by Alan Cox

[permalink] [raw]
Subject: Re: [patch] smptimers, old BH removal, tq-cleanup, 2.5.39

On Mon, 2002-09-30 at 05:38, Arnaldo Carvalho de Melo wrote:
> I'm working on Appletalk, will be fixed after X.25, humm, in fact Appletalk
> only uses SNAP on Ethernet, so it is only broken for ppptalk and ltalk, does
> anybody still uses these later two?

ppptalk is relevant to the modern world, localtalk is basically for
talking to old macintoshes many of which don't have any capability for
ethernet. I don't think either of them are even going to be performance
matters.

> Nobody working on this, as far as I know
> > econet/af_econet.c

Ancient BBC micro protocol, could probably be done just as well in user
space.

Alan

2002-09-30 14:45:50

by Arnaldo Carvalho de Melo

[permalink] [raw]
Subject: Re: [patch] smptimers, old BH removal, tq-cleanup, 2.5.39

Em Mon, Sep 30, 2002 at 01:55:56PM +0100, Alan Cox escreveu:
> On Mon, 2002-09-30 at 05:38, Arnaldo Carvalho de Melo wrote:
> > I'm working on Appletalk, will be fixed after X.25, humm, in fact Appletalk
> > only uses SNAP on Ethernet, so it is only broken for ppptalk and ltalk, does
> > anybody still uses these later two?
>
> ppptalk is relevant to the modern world, localtalk is basically for
> talking to old macintoshes many of which don't have any capability for
> ethernet. I don't think either of them are even going to be performance
> matters.

OK, but even those will be taken care of, as the changes had to be done anyway
for SNAP, so I'll just stick the (void*)1 to its packet_types.

> > Nobody working on this, as far as I know
> > > econet/af_econet.c
>
> Ancient BBC micro protocol, could probably be done just as well in user
> space.

As some of the other protocols, but at this point it may well be easier to
fix it in the kernel where it sits 8)

Oh, dang, I forgot that these other protocols can work on fast lines these
days 8)

- Arnaldo

2002-09-30 16:22:58

by Ingo Molnar

[permalink] [raw]
Subject: Re: [patch] smptimers, old BH removal, tq-cleanup, 2.5.39


On Mon, 30 Sep 2002, Christoph Hellwig wrote:

> But it breaks XFS. Chris Wedgwood fixed it to use schedule_task()
> instead (and I cleaned it up a littler more, see patch below), but this
> does effectively simgle-thead XFS I/O completion.

see the workqueues patch i posted a couple of minutes ago. Does this solve
XFS's problems?

> Altennatively we could allow kernel code to create it's own kevends with
> associated task-queues, but that sounds rather ugly..

why is it ugly? I can add a simple interface to the workqueues subsystem
that will bind the XFS worker threads to given sets of CPUs. That should
give you per-CPU workqueues, with separate per-CPU locking.

Ingo

2002-09-30 16:25:44

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [patch] smptimers, old BH removal, tq-cleanup, 2.5.39

On Mon, Sep 30, 2002 at 06:38:03PM +0200, Ingo Molnar wrote:
> see the workqueues patch i posted a couple of minutes ago. Does this solve
> XFS's problems?

Not exactly. All your work on one queue is internally serialize. An
totally unserialized workqueue would be best for XFS.

> why is it ugly? I can add a simple interface to the workqueues subsystem
> that will bind the XFS worker threads to given sets of CPUs. That should
> give you per-CPU workqueues, with separate per-CPU locking.

That would also work, but would require more code in XFS than my
above suggestion.

2002-09-30 16:18:52

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [patch] smptimers, old BH removal, tq-cleanup, 2.5.39

On Sun, Sep 29, 2002 at 07:52:17PM +0200, Ingo Molnar wrote:
> i've done the following cleanups/simplifications to task-queues:
>
> - removed the ability to define your own task-queue, what can be done is
> to schedule_task() a given task to keventd, and to flush all pending
> tasks.
>
> this is actually a quite easy transition, since 90% of all task-queue
> users in the kernel used BH_IMMEDIATE - which is very similar in
> functionality to keventd.

But it breaks XFS. Chris Wedgwood fixed it to use schedule_task()
instead (and I cleaned it up a littler more, see patch below), but
this does effectively simgle-thead XFS I/O completion.

Why? XFS used to have a per-cpu task-queue, emptied by a per-cpu
kernel thread. To get anywhere near the per-formance again we'd
need to merge the per-cpu keventd patch Arjan did for RH 2.1AS into
2.5. Altennatively we could allow kernel code to create it's own
kevends with associated task-queues, but that sounds rather ugly..


diff -uNr -p linux-2.5.39-mm1/fs/xfs/pagebuf/page_buf.c linux-2.5.39/fs/xfs/pagebuf/page_buf.c
--- linux-2.5.39-mm1/fs/xfs/pagebuf/page_buf.c Mon Sep 30 14:27:25 2002
+++ linux-2.5.39/fs/xfs/pagebuf/page_buf.c Mon Sep 30 16:49:06 2002
@@ -160,8 +160,6 @@ pb_tracking_free(

STATIC kmem_cache_t *pagebuf_cache;
STATIC pagebuf_daemon_t *pb_daemon;
-STATIC struct list_head pagebuf_iodone_tq[NR_CPUS];
-STATIC wait_queue_head_t pagebuf_iodone_wait[NR_CPUS];
STATIC void pagebuf_daemon_wakeup(int);

/*
@@ -1157,15 +1155,6 @@ _pagebuf_wait_unpin(
current->state = TASK_RUNNING;
}

-void
-pagebuf_queue_task(
- struct tq_struct *task)
-{
- queue_task(task, &pagebuf_iodone_tq[smp_processor_id()]);
- wake_up(&pagebuf_iodone_wait[smp_processor_id()]);
-}
-
-
/*
* Buffer Utility Routines
*/
@@ -1210,9 +1199,8 @@ pagebuf_iodone(
INIT_TQUEUE(&pb->pb_iodone_sched,
pagebuf_iodone_sched, (void *)pb);

- queue_task(&pb->pb_iodone_sched,
- &pagebuf_iodone_tq[smp_processor_id()]);
- wake_up(&pagebuf_iodone_wait[smp_processor_id()]);
+ schedule_task(&pb->pb_iodone_sched);
+
} else {
up(&pb->pb_iodonesema);
}
@@ -1666,62 +1654,6 @@ pagebuf_delwri_dequeue(
spin_unlock(&pb_daemon->pb_delwrite_lock);
}

-
-/*
- * The pagebuf iodone daemon
- */
-
-STATIC int pb_daemons[NR_CPUS];
-
-STATIC int
-pagebuf_iodone_daemon(
- void *__bind_cpu)
-{
- int cpu = (long) __bind_cpu;
- DECLARE_WAITQUEUE (wait, current);
-
- /* Set up the thread */
- daemonize();
-
- /* Avoid signals */
- spin_lock_irq(&current->sig->siglock);
- sigfillset(&current->blocked);
- recalc_sigpending();
- spin_unlock_irq(&current->sig->siglock);
-
- /* Migrate to the right CPU */
- set_cpus_allowed(current, 1UL << cpu);
- if (smp_processor_id() != cpu)
- BUG();
-
- sprintf(current->comm, "pagebuf_io_CPU%d", cpu);
- INIT_LIST_HEAD(&pagebuf_iodone_tq[cpu]);
- init_waitqueue_head(&pagebuf_iodone_wait[cpu]);
- __set_current_state(TASK_INTERRUPTIBLE);
- mb();
-
- pb_daemons[cpu] = 1;
-
- for (;;) {
- add_wait_queue(&pagebuf_iodone_wait[cpu],
- &wait);
-
- if (TQ_ACTIVE(pagebuf_iodone_tq[cpu]))
- __set_task_state(current, TASK_RUNNING);
- schedule();
- remove_wait_queue(&pagebuf_iodone_wait[cpu],
- &wait);
- run_task_queue(&pagebuf_iodone_tq[cpu]);
- if (pb_daemons[cpu] == 0)
- break;
- __set_current_state(TASK_INTERRUPTIBLE);
- }
-
- pb_daemons[cpu] = -1;
- wake_up_interruptible(&pagebuf_iodone_wait[cpu]);
- return 0;
-}
-
/* Defines for pagebuf daemon */
DECLARE_WAIT_QUEUE_HEAD(pbd_waitq);
STATIC int force_flush;
@@ -1907,8 +1839,6 @@ STATIC int
pagebuf_daemon_start(void)
{
if (!pb_daemon) {
- int cpu;
-
pb_daemon = (pagebuf_daemon_t *)
kmalloc(sizeof(pagebuf_daemon_t), GFP_KERNEL);
if (!pb_daemon) {
@@ -1924,19 +1854,6 @@ pagebuf_daemon_start(void)

kernel_thread(pagebuf_daemon, (void *)pb_daemon,
CLONE_FS|CLONE_FILES|CLONE_VM);
- for (cpu = 0; cpu < NR_CPUS; cpu++) {
- if (!cpu_online(cpu))
- continue;
- if (kernel_thread(pagebuf_iodone_daemon,
- (void *)(long) cpu,
- CLONE_FS|CLONE_FILES|CLONE_VM) < 0) {
- printk("pagebuf_daemon_start failed\n");
- } else {
- while (!pb_daemons[cpu]) {
- yield();
- }
- }
- }
}
return 0;
}
@@ -1950,24 +1867,12 @@ STATIC void
pagebuf_daemon_stop(void)
{
if (pb_daemon) {
- int cpu;
-
pb_daemon->active = 0;
pb_daemon->io_active = 0;

wake_up_interruptible(&pbd_waitq);
while (pb_daemon->active == 0) {
interruptible_sleep_on(&pbd_waitq);
- }
- for (cpu = 0; cpu < NR_CPUS; cpu++) {
- if (!cpu_online(cpu))
- continue;
- pb_daemons[cpu] = 0;
- wake_up(&pagebuf_iodone_wait[cpu]);
- while (pb_daemons[cpu] != -1) {
- interruptible_sleep_on(
- &pagebuf_iodone_wait[cpu]);
- }
}

kfree(pb_daemon);
diff -uNr -p linux-2.5.39-mm1/fs/xfs/pagebuf/page_buf.h linux-2.5.39/fs/xfs/pagebuf/page_buf.h
--- linux-2.5.39-mm1/fs/xfs/pagebuf/page_buf.h Mon Sep 30 14:13:46 2002
+++ linux-2.5.39/fs/xfs/pagebuf/page_buf.h Mon Sep 30 16:15:25 2002
@@ -324,9 +324,6 @@ extern void pagebuf_unlock( /* unlock b

#define pagebuf_geterror(pb) ((pb)->pb_error)

-extern void pagebuf_queue_task(
- struct tq_struct *);
-
extern void pagebuf_iodone( /* mark buffer I/O complete */
page_buf_t *); /* buffer to mark */

diff -uNr -p linux-2.5.39-mm1/fs/xfs/xfs_log.c linux-2.5.39/fs/xfs/xfs_log.c
--- linux-2.5.39-mm1/fs/xfs/xfs_log.c Mon Sep 30 14:15:55 2002
+++ linux-2.5.39/fs/xfs/xfs_log.c Mon Sep 30 16:15:25 2002
@@ -2779,7 +2779,7 @@ xlog_state_release_iclog(xlog_t *log,
case 0:
return xlog_sync(log, iclog, 0);
case 1:
- pagebuf_queue_task(&iclog->ic_write_sched);
+ schedule_task(&iclog->ic_write_sched);
}
}
return (0);

2002-09-30 16:57:54

by Ingo Molnar

[permalink] [raw]
Subject: Re: [patch] smptimers, old BH removal, tq-cleanup, 2.5.39


On Mon, 30 Sep 2002, Christoph Hellwig wrote:

> On Mon, Sep 30, 2002 at 06:38:03PM +0200, Ingo Molnar wrote:
> > see the workqueues patch i posted a couple of minutes ago. Does this solve
> > XFS's problems?
>
> Not exactly. All your work on one queue is internally serialize. An
> totally unserialized workqueue would be best for XFS.

you can create as many queues as you wish - one per CPU for example. Or
one per mounted fs per CPU.

Ingo

2002-09-30 17:23:40

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [patch] smptimers, old BH removal, tq-cleanup, 2.5.39

On Mon, Sep 30, 2002 at 07:12:54PM +0200, Ingo Molnar wrote:
> > Not exactly. All your work on one queue is internally serialize. An
> > totally unserialized workqueue would be best for XFS.
>
> you can create as many queues as you wish - one per CPU for example. Or
> one per mounted fs per CPU.

Yeah. But adding a create_workqueue_per_cpu that has one queue and thead
per cpu to which queue_work dispatches would centralize the code needed
to manage that in one place instead of duplicating it over and over.

Sure both works, but IMHO hiding it behind a nice abstraction is much
better.

2002-09-30 21:41:27

by George Anzinger

[permalink] [raw]
Subject: Re: [patch] smptimers, old BH removal, tq-cleanup, 2.5.39

Ingo Molnar wrote:
>
> the attached patch is the smptimers patch plus the removal of old BHs and
> a rewrite of task-queue handling.
>
~snip
>
> scalable timers: i've further improved the patch ported to 2.5 by wli and
> Dipankar. There is only one pending issue i can see, the question of
> whether to migrate timers in mod_timer() or not. I'm quite convinced that
> they should be migrated, but i might be wrong. It's a 10 lines change to
> switch between migrating and non-migrating timers, we can do performance
> tests later on. The current, more complex migration code is pretty fast
> and has been stable under extremely high networking loads in the past 2
> years, so we can immediately switch to the simpler variant if someone
> proves it improves performance. (I'd say if non-migrating timers improve
> Apache performance on one of the bigger NUMA boxes then the point is
> proven, no further though will be needed.)

As the APIC timers are currently set up they are
undisciplined WRT the PIT which is still used to drive the
clock. This means that, since this patch drives the
"run_timer_list" code from the APIC timers, the actual delay
in timer servicing from the requested time will vary with
a.) the cpu (since each cpu is set up to have its timer
expire at a different time within the 1/HZ tick) and b.)
over time as the PIT and the APIC clocks drift. This may be
acceptable with 1/HZ timer resolution (however I don't
really think it is), but it is in no way acceptable WRT high
resolution timers.

The solution I would suggest is to disciplined the APIC
clocks. They _should_ be set up to interrupt as soon after
a PIT interrupt as possible and they should all do so at the
same time if we are to avoid timer (not time, actual time
keeping is not in question here) glitches when moving from
one cpu to another. Further, checks for drift need to be in
place to "pull" the APIC timer into sync when it drifts.

I had similar problems in the high-res-timers keeping the
PIT synched with the TSC or the pm timer. It is do able.
>
~snip
>
> Ingo

--
George Anzinger [email protected]
High-res-timers:
http://sourceforge.net/projects/high-res-timers/
Preemption patch:
http://www.kernel.org/pub/linux/kernel/people/rml

2002-10-01 03:36:25

by Ingo Molnar

[permalink] [raw]
Subject: Re: [patch] smptimers, old BH removal, tq-cleanup, 2.5.39


On Mon, 30 Sep 2002, george anzinger wrote:

> As the APIC timers are currently set up they are undisciplined WRT the
> PIT which is still used to drive the clock. This means that, since this
> patch drives the "run_timer_list" code from the APIC timers, the actual
> delay in timer servicing from the requested time will vary with a.) the
> cpu (since each cpu is set up to have its timer expire at a different
> time within the 1/HZ tick) and b.) over time as the PIT and the APIC
> clocks drift. This may be acceptable with 1/HZ timer resolution
> (however I don't really think it is), but it is in no way acceptable WRT
> high resolution timers.

the smp_processor_id()/(HZ*num_cpus) 'interleaving' of every APIC clock
was an SMP scalability issue, and it was done as part of the smptimers
patch. It just got into the kernel much earlier.

but these days, with the removal of BHs, it might be less of a factor,
mainly because timers have no global synchronization anymore, so we can
again try to not interleave the APIC clocks. Only testing will tell,
because there might be some interaction between timer-generated code
still.

Dipankar, wli, would it be possible to try the attached simple patch with
some of the more complex networking loads? The patch gets rid of the APIC
timer interleaving.

Ingo

--- arch/i386/kernel/apic.c.orig Tue Oct 1 05:47:34 2002
+++ arch/i386/kernel/apic.c Tue Oct 1 05:49:23 2002
@@ -813,24 +813,9 @@

static void setup_APIC_timer(unsigned int clocks)
{
- unsigned int slice, t0, t1;
unsigned long flags;
- int delta;

- local_save_flags(flags);
- local_irq_enable();
- /*
- * ok, Intel has some smart code in their APIC that knows
- * if a CPU was in 'hlt' lowpower mode, and this increases
- * its APIC arbitration priority. To avoid the external timer
- * IRQ APIC event being in synchron with the APIC clock we
- * introduce an interrupt skew to spread out timer events.
- *
- * The number of slices within a 'big' timeslice is NR_CPUS+1
- */
-
- slice = clocks / (NR_CPUS+1);
- printk("cpu: %d, clocks: %d, slice: %d\n", smp_processor_id(), clocks, slice);
+ local_irq_save(flags);

/*
* Wait for IRQ0's slice:
@@ -839,22 +824,6 @@

__setup_APIC_LVTT(clocks);

- t0 = apic_read(APIC_TMICT)*APIC_DIVISOR;
- /* Wait till TMCCT gets reloaded from TMICT... */
- do {
- t1 = apic_read(APIC_TMCCT)*APIC_DIVISOR;
- delta = (int)(t0 - t1 - slice*(smp_processor_id()+1));
- } while (delta >= 0);
- /* Now wait for our slice for real. */
- do {
- t1 = apic_read(APIC_TMCCT)*APIC_DIVISOR;
- delta = (int)(t0 - t1 - slice*(smp_processor_id()+1));
- } while (delta < 0);
-
- __setup_APIC_LVTT(clocks);
-
- printk("CPU%d<T0:%d,T1:%d,D:%d,S:%d,C:%d>\n", smp_processor_id(), t0, t1, delta, slice, clocks);
-
local_irq_restore(flags);
}


2002-10-01 04:20:20

by David Miller

[permalink] [raw]
Subject: Re: [patch] smptimers, old BH removal, tq-cleanup, 2.5.39

From: Ingo Molnar <[email protected]>
Date: Tue, 1 Oct 2002 05:51:45 +0200 (CEST)

- /*
- * ok, Intel has some smart code in their APIC that knows
- * if a CPU was in 'hlt' lowpower mode, and this increases
- * its APIC arbitration priority. To avoid the external timer
- * IRQ APIC event being in synchron with the APIC clock we
- * introduce an interrupt skew to spread out timer events.
- *
- * The number of slices within a 'big' timeslice is NR_CPUS+1
- */
-

I did some thinking, and I don't understand how this old code
can be legal. Doesn't this make do_gettimeofday() inaccurate?

I must be missing something.

2002-10-01 05:10:02

by Ingo Molnar

[permalink] [raw]
Subject: Re: [patch] smptimers, old BH removal, tq-cleanup, 2.5.39


On Mon, 30 Sep 2002, David S. Miller wrote:

> I did some thinking, and I don't understand how this old code can be
> legal. Doesn't this make do_gettimeofday() inaccurate?

it's a mostly bogus comment, dont think about it too much.

Ingo

2002-10-01 05:22:11

by Dipankar Sarma

[permalink] [raw]
Subject: Re: [patch] smptimers, old BH removal, tq-cleanup, 2.5.39

On Tue, Oct 01, 2002 at 05:51:45AM +0200, Ingo Molnar wrote:
> the smp_processor_id()/(HZ*num_cpus) 'interleaving' of every APIC clock
> was an SMP scalability issue, and it was done as part of the smptimers
> patch. It just got into the kernel much earlier.
>
> but these days, with the removal of BHs, it might be less of a factor,
> mainly because timers have no global synchronization anymore, so we can
> again try to not interleave the APIC clocks. Only testing will tell,
> because there might be some interaction between timer-generated code
> still.

Yes, with earlier versions of smptimers where global_bh_lock was
still being acquired to serialize with BHs, local timer clocks needed
to be spaced over a HZ to reduce contention. Archs that didn't space
the clocks perfromed poorly with smptimers as Anton found out with
ppc64 and had to change.

>
> Dipankar, wli, would it be possible to try the attached simple patch with
> some of the more complex networking loads? The patch gets rid of the APIC
> timer interleaving.

I will give it a spin.

Thanks
--
Dipankar Sarma <[email protected]> http://lse.sourceforge.net
Linux Technology Center, IBM Software Lab, Bangalore, India.

2002-10-01 07:55:38

by George Anzinger

[permalink] [raw]
Subject: Re: [patch] smptimers, old BH removal, tq-cleanup, 2.5.39

Ingo Molnar wrote:
>
> On Mon, 30 Sep 2002, David S. Miller wrote:
>
> > I did some thinking, and I don't understand how this old code can be
> > legal. Doesn't this make do_gettimeofday() inaccurate?
>
> it's a mostly bogus comment, dont think about it too much.
>
> Ingo
Actually gettimeofday is fine. It does not depend on the
timer interrupt, but only on one happening every once in a
while. It make up any latency by using the TSC and time of
last interrupt and is in no way dependent on any latency in
the run_timer_list.

By the way, I have been lately impressed with the relative
amount of time a cli/sti takes and have wondered if we might
not get a "nice" improvement in system performance by moving
the xtime read/write lock from an irq to a bh lock. This
would avoid the cli/sti in the read lock which is needed to
read the time. All this should take (aside from finding all
the locks) is to move the write access of xtime to the bh
code. Since it does nothing if timer_jiffie == jiffie, it
does not hurt to call it from each cpu. The timer interrupt
would then bump a shadow jiffie which would not appear in
jiffies until the bh code runs.

On finding the locks, I suggest abstracting them to a macro,
thus allowing the change to be made only one place. Of
course, we need to change the name of the lock to enlist the
compilers help in finding them. But then if we are sure
this is the way to go, there is no need for the macro.
--
George Anzinger [email protected]
High-res-timers:
http://sourceforge.net/projects/high-res-timers/
Preemption patch:
http://www.kernel.org/pub/linux/kernel/people/rml

2002-10-03 06:59:10

by Dipankar Sarma

[permalink] [raw]
Subject: Re: [patch] smptimers, old BH removal, tq-cleanup, 2.5.39

On Tue, Oct 01, 2002 at 05:51:45AM +0200, Ingo Molnar wrote:
>
> the smp_processor_id()/(HZ*num_cpus) 'interleaving' of every APIC clock
> was an SMP scalability issue, and it was done as part of the smptimers
> patch. It just got into the kernel much earlier.
>
> but these days, with the removal of BHs, it might be less of a factor,
> mainly because timers have no global synchronization anymore, so we can
> again try to not interleave the APIC clocks. Only testing will tell,
> because there might be some interaction between timer-generated code
> still.
>
> Dipankar, wli, would it be possible to try the attached simple patch with
> some of the more complex networking loads? The patch gets rid of the APIC
> timer interleaving.
>

Ingo,

Removal of interleaving of apic timers doesn't seem to have any adverse affect.
Here are some numbers from a 16-CPU NUMA-Q with tbench (32 clients) averaged
over 5 runs -

2.5.40-vanilla - 44.16 MB/Sec
2.5.40-no-clock-interleave - 44.17 MB/Sec

Thanks
--
Dipankar Sarma <[email protected]> http://lse.sourceforge.net
Linux Technology Center, IBM Software Lab, Bangalore, India.