2004-10-09 05:50:21

by Sven-Thorsten Dietrich

[permalink] [raw]
Subject: [ANNOUNCE] Linux 2.6 Real Time Kernel


Announcing the availability of prototype real-time (RT)
enhancements to the Linux 2.6 kernel.

We will submit 3 additional emails following this one, containing
the remaining 3 patches (of 4) inline, with their descriptions.

Download:

Patches against the Linux-2.6.9-rc3 kernel are available at:

ftp://source.mvista.com/pub/realtime/Linux-2.6.9-rc3-RT_irqthreads.patch
ftp://source.mvista.com/pub/realtime/Linux-2.6.9-rc3-RT_mutex.patch
ftp://source.mvista.com/pub/realtime/Linux-2.6.9-rc3-RT_spinlock1.patch
ftp://source.mvista.com/pub/realtime/Linux-2.6.9-rc3-RT_spinlock2.patch

The patches are to be applied to the linux-2.6.9-rc3 kernel in the
order listed above.

Subsequent announcements will include the links to the ftp site only,
to reduce email bulk on the Linux kernel mailing list.


Introduction:

The purpose of this effort is to to further reduce interrupt latency
and to dramatically reduce task preemption latency in the 2.6 kernel
series. Our broad objective is to achieve preemption latency bounded
by the worst case IRQ disable.

We are in progress of porting to the 2.6.9-rc3-mm kernel series,
and would like to present our work at this stage, to request
general feedback, and interact with others working on similar kernel
enhancements.

These RT enhancements are an integration of features developed by
others and some new MontaVista components:

- Voluntary Preemption by Ingo Molnar
- IRQ thread patches by Scott Wood and Ingo Molnar
- BKL mutex patch by Ingo Molnar (with MV extensions)
- PMutex from Germany's Universitaet der Bundeswehr, Munich
- MontaVista mutex abstraction layer replacing spinlocks with mutexes

WHY IMPLEMENT PRELIMINARY RT SUPPORT IN LINUX:

Our objective is to enable the Linux 2.6 kernel to be usable
for high-performance multi-media applications and for applications
requiring very fast, task level reliable control functions.

The AV industry is building HDTV related technology on Linux,
and desktop systems are increasingly used for similar applications.

Cell phones, PDAs and MP3 players are converging into highly
integrated devices requiring a large number of threads. These
threads support a vast array of communications protocols
(IP, Bluetooth, 802.11, GSM, CDMA, etc.). Especially the
cellular-based protocols require highly deadline-sensitive
operations to work reliably.

GPS processing, for example, requires hard real-time tasks and
guaranteed KHz frequency interrupt processing. Linux-based remote
controlled GPS stations at inaccessible or dangerous sites,
like the inside of Mt. St. Helens, stream live data via IP.

Additionally, Linux is being increasingly utilized in traditional
real-time control environments including radar processing, factory
automation systems, "in the loop" process control systems, medical and
instrumentation systems, and automotive control systems. Many times
these systems have task level response requirements in the 10's to
hundreds of microsecond ranges, which is a level of guaranteed task
response not achievable with current 2.6 Linux technology.


Other precedent work:

There are several micro-kernel solutions available, which achieve
the required performance, but there are two general concerns with
such solutions:

1. Two separate kernel environments, creating more overall
system complexity and application design complexity.
2. Legal controversy.

In line with the above mentioned previous Kernel enhancements,
our work is designed to be transparent to existing applications
and drivers.



Implementation Details:

We have substituted the definition of kernel spinlocks with
a mutex abstraction that uses the P-mutex from the Bundeswehr
University in Munich, Germany:

http://inf3-www.informatik.unibw-muenchen.de/research/linux/mutex/

The spinlock definitions have been abstracted to invoke
a crude but effective #define-based substitution of spin_lock
to mutex_lock functions (in linux/kmutex.h).

We have abstracted the mutex layer to allow configuration
and selection of the mutex implementation. We have used a
simple mutex implementation, but intend to support use of
other mutexes, for example the existing system semaphore,
or third party plugins such as the the FUSYN project.


Partitioning the Critical Sections:

A partitioning between critical sections protected by spinlocks
and critical sections protected by mutexes has been established.

There are currently some overlaps (or holes) in the partitioning.
It is possible for a task holding a spinlock to block
on a mutex, causing a deadlock. These deadlocks are resolved for
interactive tasks on UP by grace of the interactive scheduler.

We are eliminating this nesting of mutex-protected sections
inside of spinlock-protected critical sections.
Only a minimal set (teens) of the spinlocks will remain.
This set will be composed of spinlocks necessary to protect
immediate hardware, as well as minimal critical sections that
would not benefit from mutex-based preemptability.

Our broad objective is to achieve preemption latency bounded by the
worst case IRQ disable. Total response latency (i.e, time to
initiate/complete an arbitrary system call) would still be bounded
by the worst case spinlock protected critical region.


Testing

This experimental code requires further enhancement
and is very much a work in progress.

The kernel is fairly stable, failing under high loads
and in low memory conditions.

The kernel has not been extensively tested on SMP systems.

We are reluctant to publish any performance numbers until
we have completed the mutex-spinlock partitioning and
provisioned support for RW locks.

At that point, we expect the worst case preemption latencies
to be in the hundreds of microseconds on a typical workstation.

We are acknowledging performance degradation due to the mutex
debug code and the abstraction layer.
We expect to be able to improve throughput as the code matures,
and the RT kernel becomes more refined.


Documentation:

Please find additional documentation in the
Documentation/rttReleaseNotes file.

Please see this document for a complete list of
known problems and latest status.



Credits and Thanks:

We wish to acknowledge the precedent work that has
allowed us to build this framework, as cited above.

We would also like to thank Dirk Grambow, Arnd Heursch,
and Witold Jaworski of the Universitaet der Bundeswehr,
Muenchen, Germany.

We are providing this kernel patch as waypoint on the course
towards configurable responsiveness in the 2.6 Linux kernel.

Thank you

Sven-Thorsten Dietrich



Attached below, please find the first of 4 patches.


RT Prototype 2004 (C) MontaVista Software, Inc.
This file is licensed under the terms of the GNU
General Public License version 2. This program
is licensed "as is" without any warranty of any kind,
whether express or implied.


Linux-2.6.9-rc3-RT_irqthreads.patch
===================================
This patch is a hybrid of several IRQ threads implementations,
as cited above.
We have made some modifications to adapt wake-up to handle
the scenario where an IRQ thread could be blocked on a mutex
at transition of an interrupt.

We expect to revise this IRQ thread code after moving to
the mm kernel series, and while incorporating the voluntary
preemption code.

This patch adds options to the 'General setup' section of
the kernel configuration. Running irqs in threads is
prerequisite for the subsequent patches. We have provided
defaults for running softirqs in threads, and have selected
Ingo Molnar's IRQ thread implementation as default.

CONFIG_SOFTIRQ_THREADS

- required for RT kernel. Runs all softirqs in softirqd

CONFIG_INGO_THREADS

- enable Ingo Molnar's version of IRQ threads. This is not
in sync with the latest releases in the voluntary preempt
series.

CONFIG_IRQ_THREADS
- version of IRQ threads posted to LKML by Scott Wood.
This appears to have been superceded by Ingo Molnar's changes.


In addition, this patch includes a port of Ingo Molnar's
proposed substitution of the BKL into the kernel semaphore.

Sign-off: Sven-Thorsten Dietrich ([email protected])


diff -pruN a/arch/i386/Kconfig b/arch/i386/Kconfig
--- a/arch/i386/Kconfig 2004-10-09 03:50:45.000000000 +0400
+++ b/arch/i386/Kconfig 2004-10-09 04:01:36.000000000 +0400
@@ -497,6 +497,7 @@ config SCHED_SMT

config PREEMPT
bool "Preemptible Kernel"
+ default y
help
This option reduces the latency of the kernel when reacting to
real-time or interactive events by allowing a low priority process to
diff -pruN a/arch/i386/kernel/i386_ksyms.c b/arch/i386/kernel/i386_ksyms.c
--- a/arch/i386/kernel/i386_ksyms.c 2004-10-09 03:50:45.000000000 +0400
+++ b/arch/i386/kernel/i386_ksyms.c 2004-10-09 04:01:36.000000000 +0400
@@ -76,9 +76,11 @@ EXPORT_SYMBOL_GPL(kernel_fpu_begin);
EXPORT_SYMBOL(__ioremap);
EXPORT_SYMBOL(ioremap_nocache);
EXPORT_SYMBOL(iounmap);
+#ifndef CONFIG_INGO_IRQ_THREADS
EXPORT_SYMBOL(enable_irq);
EXPORT_SYMBOL(disable_irq);
EXPORT_SYMBOL(disable_irq_nosync);
+#endif
EXPORT_SYMBOL(probe_irq_mask);
EXPORT_SYMBOL(kernel_thread);
EXPORT_SYMBOL(pm_idle);
@@ -138,6 +140,10 @@ EXPORT_SYMBOL(smp_num_siblings);
EXPORT_SYMBOL(cpu_sibling_map);
#endif

+#if defined(CONFIG_IRQ_THREADS) && !defined(CONFIG_SMP) && !defined(CONFIG_INGO_IRQ_THREADS)
+EXPORT_SYMBOL(synchronize_irq);
+#endif
+
#ifdef CONFIG_SMP
EXPORT_SYMBOL(cpu_data);
EXPORT_SYMBOL(cpu_online_map);
@@ -145,9 +151,9 @@ EXPORT_SYMBOL(cpu_callout_map);
EXPORT_SYMBOL(__write_lock_failed);
EXPORT_SYMBOL(__read_lock_failed);

-/* Global SMP stuff */
-EXPORT_SYMBOL(synchronize_irq);
+#ifndef CONFIG_INGO_IRQ_THREADS
EXPORT_SYMBOL(smp_call_function);
+#endif

/* TLB flushing */
EXPORT_SYMBOL(flush_tlb_page);
diff -pruN a/arch/i386/kernel/i8259.c b/arch/i386/kernel/i8259.c
--- a/arch/i386/kernel/i8259.c 2004-10-09 03:50:45.000000000 +0400
+++ b/arch/i386/kernel/i8259.c 2004-10-09 04:01:36.000000000 +0400
@@ -358,7 +358,14 @@ static irqreturn_t math_error_irq(int cp
* New motherboards sometimes make IRQ 13 be a PCI interrupt,
* so allow interrupt sharing.
*/
-static struct irqaction fpu_irq = { math_error_irq, 0, CPU_MASK_NONE, "fpu", NULL, NULL };
+#ifndef CONFIG_INGO_IRQ_THREADS
+static struct irqaction fpu_irq =
+ { math_error_irq, SA_NOTHREAD, CPU_MASK_NONE, "fpu", NULL, NULL };
+#else
+static struct irqaction fpu_irq =
+ { math_error_irq, SA_NODELAY, CPU_MASK_NONE, "fpu", NULL, NULL };
+#endif
+

void __init init_ISA_irqs (void)
{
diff -pruN a/arch/i386/kernel/irq.c b/arch/i386/kernel/irq.c
--- a/arch/i386/kernel/irq.c 2004-10-09 03:50:45.000000000 +0400
+++ b/arch/i386/kernel/irq.c 2004-10-09 04:01:36.000000000 +0400
@@ -45,6 +45,8 @@
#include <asm/desc.h>
#include <asm/irq.h>

+static DECLARE_MUTEX(probe_sem);
+
/*
* Linux has a controller-independent x86 interrupt architecture.
* every controller has a 'controller-template', that is used
@@ -71,7 +73,9 @@ irq_desc_t irq_desc[NR_IRQS] __cacheline
}
};

+#ifndef CONFIG_INGO_IRQ_THREADS
static void register_irq_proc (unsigned int irq);
+#endif

/*
* per-CPU IRQ handling stacks
@@ -198,9 +202,9 @@ skip:
return 0;
}

+#ifndef CONFIG_INGO_IRQ_THREADS

-
-
+#ifndef CONFIG_IRQ_THREADS
#ifdef CONFIG_SMP
inline void synchronize_irq(unsigned int irq)
{
@@ -208,6 +212,7 @@ inline void synchronize_irq(unsigned int
cpu_relax();
}
#endif
+#endif /* CONFIG_IRQ_THREADS */

/*
* This should really return information about whether
@@ -226,10 +231,16 @@ asmlinkage int handle_IRQ_event(unsigned
local_irq_enable();

do {
- ret = action->handler(irq, action->dev_id, regs);
- if (ret == IRQ_HANDLED)
- status |= action->flags;
- retval |= ret;
+#ifdef CONFIG_IRQ_THREADS
+ if (action->flags & SA_NOTHREAD)
+#endif
+ {
+ ret = action->handler(irq, action->dev_id, regs);
+ if (ret == IRQ_HANDLED)
+ status |= action->flags;
+ retval |= ret;
+
+ }
action = action->next;
} while (action);
if (status & SA_SAMPLE_RANDOM)
@@ -291,12 +302,10 @@ __setup("noirqdebug", noirqdebug_setup);
*
* Called under desc->lock
*/
-static void note_interrupt(int irq, irq_desc_t *desc, irqreturn_t action_ret)
+static void note_interrupt(int irq, irq_desc_t *desc)
{
- if (action_ret != IRQ_HANDLED) {
+ if (desc->status & IRQ_HANDLED) {
desc->irqs_unhandled++;
- if (action_ret != IRQ_NONE)
- report_bad_irq(irq, desc, action_ret);
}

desc->irq_count++;
@@ -308,7 +317,7 @@ static void note_interrupt(int irq, irq_
/*
* The interrupt is stuck
*/
- __report_bad_irq(irq, desc, action_ret);
+ __report_bad_irq(irq, desc, IRQ_NONE);
/*
* Now kill the IRQ
*/
@@ -340,13 +349,13 @@ static void note_interrupt(int irq, irq_

inline void disable_irq_nosync(unsigned int irq)
{
- irq_desc_t *desc = irq_desc + irq;
+ irq_desc_t *desc = irq_descp(irq);
unsigned long flags;

spin_lock_irqsave(&desc->lock, flags);
if (!desc->depth++) {
desc->status |= IRQ_DISABLED;
- desc->handler->disable(irq);
+ SHUTDOWN_IRQ(irq);
}
spin_unlock_irqrestore(&desc->lock, flags);
}
@@ -366,7 +375,7 @@ inline void disable_irq_nosync(unsigned

void disable_irq(unsigned int irq)
{
- irq_desc_t *desc = irq_desc + irq;
+ irq_desc_t *desc = irq_descp(irq);
disable_irq_nosync(irq);
if (desc->action)
synchronize_irq(irq);
@@ -385,7 +394,7 @@ void disable_irq(unsigned int irq)

void enable_irq(unsigned int irq)
{
- irq_desc_t *desc = irq_desc + irq;
+ irq_desc_t *desc = irq_descp(irq);
unsigned long flags;

spin_lock_irqsave(&desc->lock, flags);
@@ -397,7 +406,15 @@ void enable_irq(unsigned int irq)
desc->status = status | IRQ_REPLAY;
hw_resend_irq(desc->handler,irq);
}
- desc->handler->enable(irq);
+
+ /* Don't unmask the IRQ if it's in progress, or else you
+ could re-enter the IRQ handler. As it is now enabled,
+ the IRQ will be enabled when the handler is finished. */
+
+ if (!(desc->status & (IRQ_INPROGRESS | IRQ_THREADRUNNING |
+ IRQ_THREADPENDING)))
+ STARTUP_IRQ(irq);
+
/* fall-through */
}
default:
@@ -410,6 +427,8 @@ void enable_irq(unsigned int irq)
spin_unlock_irqrestore(&desc->lock, flags);
}

+#endif
+
/*
* do_IRQ handles all normal device IRQ's (the special
* SMP cross-CPU interrupts have their own specific
@@ -428,7 +447,7 @@ asmlinkage unsigned int do_IRQ(struct pt
* handled by some other CPU. (or is disabled)
*/
int irq = regs.orig_eax & 0xff; /* high bits used in ret_from_ code */
- irq_desc_t *desc = irq_desc + irq;
+ irq_desc_t *desc = irq_descp(irq);
struct irqaction * action;
unsigned int status;

@@ -456,14 +475,17 @@ asmlinkage unsigned int do_IRQ(struct pt
WAITING is used by probe to mark irqs that are being tested
*/
status = desc->status & ~(IRQ_REPLAY | IRQ_WAITING);
- status |= IRQ_PENDING; /* we _want_ to handle it */
+ status |= IRQ_PENDING | /* we _want_ to handle it */
+ IRQ_UNHANDLED; /* This will be cleared after a
+ handler that cares. */

/*
* If the IRQ is disabled for whatever reason, we cannot
* use the action we have.
*/
action = NULL;
- if (likely(!(status & (IRQ_DISABLED | IRQ_INPROGRESS)))) {
+ if (likely(!(status & (IRQ_DISABLED | IRQ_INPROGRESS |
+ IRQ_THREADPENDING | IRQ_THREADRUNNING)))) {
action = desc->action;
status &= ~IRQ_PENDING; /* we commit to handling */
status |= IRQ_INPROGRESS; /* we are handling it */
@@ -479,6 +501,14 @@ asmlinkage unsigned int do_IRQ(struct pt
if (unlikely(!action))
goto out;

+#ifdef CONFIG_INGO_IRQ_THREADS
+ /*
+ * hardirq redirection to the irqd process context:
+ */
+ if (generic_redirect_hardirq(desc))
+ goto out_no_end;
+#endif
+
/*
* Edge triggered interrupts need to remember
* pending events.
@@ -500,8 +530,16 @@ asmlinkage unsigned int do_IRQ(struct pt
curctx = (union irq_ctx *) current_thread_info();
irqctx = hardirq_ctx[smp_processor_id()];

- spin_unlock(&desc->lock);
-
+#ifdef CONFIG_IRQ_THREADS
+ if (desc->thread) {
+ desc->status |= IRQ_THREADPENDING;
+ wake_up_process(desc->thread);
+ }
+
+ if (!desc->thread || (desc->status & IRQ_NOTHREAD))
+#endif
+ {
+ spin_unlock(&desc->lock);
/*
* this is where we switch to the IRQ stack. However, if we are already using
* the IRQ stack (because we interrupted a hardirq handler) we can't do that
@@ -509,51 +547,80 @@ asmlinkage unsigned int do_IRQ(struct pt
* after all)
*/

- if (curctx == irqctx)
- action_ret = handle_IRQ_event(irq, &regs, action);
- else {
- /* build the stack frame on the IRQ stack */
- isp = (u32*) ((char*)irqctx + sizeof(*irqctx));
- irqctx->tinfo.task = curctx->tinfo.task;
- irqctx->tinfo.previous_esp = current_stack_pointer();
-
- *--isp = (u32) action;
- *--isp = (u32) &regs;
- *--isp = (u32) irq;
-
- asm volatile(
- " xchgl %%ebx,%%esp \n"
- " call handle_IRQ_event \n"
- " xchgl %%ebx,%%esp \n"
- : "=a"(action_ret)
- : "b"(isp)
- : "memory", "cc", "edx", "ecx"
- );
-
+ if (curctx == irqctx)
+ action_ret = handle_IRQ_event(irq, &regs, action);
+ else {
+ /* build the stack frame on the IRQ stack */
+ isp = (u32*) ((char*)irqctx + sizeof(*irqctx));
+ irqctx->tinfo.task = curctx->tinfo.task;
+ irqctx->tinfo.previous_esp = current_stack_pointer();
+
+ *--isp = (u32) action;
+ *--isp = (u32) &regs;
+ *--isp = (u32) irq;
+
+ asm volatile(
+ " xchgl %%ebx,%%esp \n"
+#ifdef CONFIG_INGO_IRQ_THREADS
+ " call generic_handle_IRQ_event \n"
+#else
+ " call handle_IRQ_event \n"
+#endif
+ " xchgl %%ebx,%%esp \n"
+ : "=a"(action_ret)
+ : "b"(isp)
+ : "memory", "cc", "edx", "ecx"
+ );
+ }
+ spin_lock(&desc->lock);
+ if (!noirqdebug)
+#ifdef CONFIG_INGO_IRQ_THREADS
+ generic_note_interrupt(irq, desc, action_ret);
+#else
+ note_interrupt(irq, desc, action_ret);
+#endif

+ if (curctx != irqctx)
+ irqctx->tinfo.task = NULL;
+ if (likely(!(desc->status & IRQ_PENDING)))
+ break;
+ desc->status &= ~IRQ_PENDING;
}
- spin_lock(&desc->lock);
- if (!noirqdebug)
- note_interrupt(irq, desc, action_ret);
- if (curctx != irqctx)
- irqctx->tinfo.task = NULL;
- if (likely(!(desc->status & IRQ_PENDING)))
- break;
- desc->status &= ~IRQ_PENDING;
- }

#else

for (;;) {
irqreturn_t action_ret;

- spin_unlock(&desc->lock);
-
- action_ret = handle_IRQ_event(irq, &regs, action);
+# ifdef CONFIG_IRQ_THREADS
+ if (desc->thread) {
+ desc->status |= IRQ_THREADPENDING;
+ wake_up_process(desc->thread);
+ }
+
+ if (!desc->thread || (desc->status & IRQ_NOTHREAD))
+# endif
+ {
+ spin_unlock(&desc->lock);
+#ifdef CONFIG_INGO_IRQ_THREADS
+ action_ret = generic_handle_IRQ_event(irq, &regs, action);
+#else
+ action_ret = handle_IRQ_event(irq, &regs, action);
+#endif
+ spin_lock(&desc->lock);
+ if (!noirqdebug)
+#ifdef CONFIG_INGO_IRQ_THREADS
+ generic_note_interrupt(irq, desc, action_ret);
+#else
+ {
+ if (action_ret == IRQ_HANDLED)
+ desc->status &= ~IRQ_UNHANDLED;
+ else if (action_ret != IRQ_NONE)
+ report_bad_irq(irq, desc, action_ret);
+ }
+#endif
+ }

- spin_lock(&desc->lock);
- if (!noirqdebug)
- note_interrupt(irq, desc, action_ret);
if (likely(!(desc->status & IRQ_PENDING)))
break;
desc->status &= ~IRQ_PENDING;
@@ -566,11 +633,20 @@ out:
* The ->end() handler has to deal with interrupts which got
* disabled while the handler was running.
*/
- desc->handler->end(irq);
+ if (!(desc->status & (IRQ_DISABLED | IRQ_INPROGRESS |
+ IRQ_THREADPENDING | IRQ_THREADRUNNING))) {
+#ifndef CONFIG_INGO_IRQ_THREADS
+ if (!noirqdebug)
+ note_interrupt(irq, desc);
+#endif
+
+
+ desc->handler->end(irq);
+ }
+out_no_end:
spin_unlock(&desc->lock);

irq_exit();
-
return 1;
}

@@ -659,7 +735,12 @@ int request_irq(unsigned int irq,
action->next = NULL;
action->dev_id = dev_id;

- retval = setup_irq(irq, action);
+#ifdef CONFIG_INGO_IRQ_THREADS
+ retval = generic_setup_irq(irq, action);
+#else
+ retval = setup_irq(irq, action);
+#endif
+
if (retval)
kfree(action);
return retval;
@@ -667,6 +748,8 @@ int request_irq(unsigned int irq,

EXPORT_SYMBOL(request_irq);

+
+#ifndef CONFIG_INGO_IRQ_THREADS
/**
* free_irq - free an interrupt
* @irq: Interrupt line to free
@@ -691,7 +774,7 @@ void free_irq(unsigned int irq, void *de
if (irq >= NR_IRQS)
return;

- desc = irq_desc + irq;
+ desc = irq_descp(irq);
spin_lock_irqsave(&desc->lock,flags);
p = &desc->action;
for (;;) {
@@ -706,7 +789,7 @@ void free_irq(unsigned int irq, void *de
*pp = action->next;
if (!desc->action) {
desc->status |= IRQ_DISABLED;
- desc->handler->shutdown(irq);
+ SHUTDOWN_IRQ(irq);
}
spin_unlock_irqrestore(&desc->lock,flags);

@@ -722,6 +805,7 @@ void free_irq(unsigned int irq, void *de
}

EXPORT_SYMBOL(free_irq);
+#endif

/*
* IRQ autodetection code..
@@ -732,7 +816,6 @@ EXPORT_SYMBOL(free_irq);
* disabled.
*/

-static DECLARE_MUTEX(probe_sem);

/**
* probe_irq_on - begin an interrupt autodetect
@@ -755,7 +838,7 @@ unsigned long probe_irq_on(void)
* flush such a longstanding irq before considering it as spurious.
*/
for (i = NR_IRQS-1; i > 0; i--) {
- desc = irq_desc + i;
+ desc = irq_descp(i);

spin_lock_irq(&desc->lock);
if (!irq_desc[i].action)
@@ -778,7 +861,7 @@ unsigned long probe_irq_on(void)
spin_lock_irq(&desc->lock);
if (!desc->action) {
desc->status |= IRQ_AUTODETECT | IRQ_WAITING;
- if (desc->handler->startup(i))
+ if (STARTUP_IRQ(i))
desc->status |= IRQ_PENDING;
}
spin_unlock_irq(&desc->lock);
@@ -795,7 +878,7 @@ unsigned long probe_irq_on(void)
*/
val = 0;
for (i = 0; i < NR_IRQS; i++) {
- irq_desc_t *desc = irq_desc + i;
+ irq_desc_t *desc = irq_descp(i);
unsigned int status;

spin_lock_irq(&desc->lock);
@@ -805,7 +888,7 @@ unsigned long probe_irq_on(void)
/* It triggered already - consider it spurious. */
if (!(status & IRQ_WAITING)) {
desc->status = status & ~IRQ_AUTODETECT;
- desc->handler->shutdown(i);
+ SHUTDOWN_IRQ(i);
} else
if (i < 32)
val |= 1 << i;
@@ -842,7 +925,7 @@ unsigned int probe_irq_mask(unsigned lon

mask = 0;
for (i = 0; i < NR_IRQS; i++) {
- irq_desc_t *desc = irq_desc + i;
+ irq_desc_t *desc = irq_descp(i);
unsigned int status;

spin_lock_irq(&desc->lock);
@@ -853,7 +936,7 @@ unsigned int probe_irq_mask(unsigned lon
mask |= 1 << i;

desc->status = status & ~IRQ_AUTODETECT;
- desc->handler->shutdown(i);
+ SHUTDOWN_IRQ(i);
}
spin_unlock_irq(&desc->lock);
}
@@ -892,7 +975,7 @@ int probe_irq_off(unsigned long val)
nr_irqs = 0;
irq_found = 0;
for (i = 0; i < NR_IRQS; i++) {
- irq_desc_t *desc = irq_desc + i;
+ irq_desc_t *desc = irq_descp(i);
unsigned int status;

spin_lock_irq(&desc->lock);
@@ -905,7 +988,7 @@ int probe_irq_off(unsigned long val)
nr_irqs++;
}
desc->status = status & ~IRQ_AUTODETECT;
- desc->handler->shutdown(i);
+ SHUTDOWN_IRQ(i);
}
spin_unlock_irq(&desc->lock);
}
@@ -918,13 +1001,15 @@ int probe_irq_off(unsigned long val)

EXPORT_SYMBOL(probe_irq_off);

+#ifndef CONFIG_INGO_IRQ_THREADS
+
/* this was setup_x86_irq but it seems pretty generic */
int setup_irq(unsigned int irq, struct irqaction * new)
{
int shared = 0;
unsigned long flags;
struct irqaction *old, **p;
- irq_desc_t *desc = irq_desc + irq;
+ irq_desc_t *desc = irq_descp(irq);

if (desc->handler == &no_irq_type)
return -ENOSYS;
@@ -945,6 +1030,8 @@ int setup_irq(unsigned int irq, struct i
rand_initialize_irq(irq);
}

+ setup_irq_spawn_thread(irq, new);
+
/*
* The following block of code has to be executed atomically
*/
@@ -970,7 +1057,7 @@ int setup_irq(unsigned int irq, struct i
if (!shared) {
desc->depth = 0;
desc->status &= ~(IRQ_DISABLED | IRQ_AUTODETECT | IRQ_WAITING | IRQ_INPROGRESS);
- desc->handler->startup(irq);
+ STARTUP_IRQ(irq);
}
spin_unlock_irqrestore(&desc->lock,flags);

@@ -1075,7 +1162,7 @@ void init_irq_proc (void)
for (i = 0; i < NR_IRQS; i++)
register_irq_proc(i);
}
-
+#endif /* CONFIG_INGO_IRQ_THREADS */

#ifdef CONFIG_4KSTACKS
/*
diff -pruN a/arch/i386/mach-default/setup.c b/arch/i386/mach-default/setup.c
--- a/arch/i386/mach-default/setup.c 2004-10-09 03:50:45.000000000 +0400
+++ b/arch/i386/mach-default/setup.c 2004-10-09 04:01:36.000000000 +0400
@@ -27,7 +27,12 @@ void __init pre_intr_init_hook(void)
/*
* IRQ2 is cascade interrupt to second interrupt controller
*/
-static struct irqaction irq2 = { no_action, 0, CPU_MASK_NONE, "cascade", NULL, NULL};
+#ifdef CONFIG_INGO_IRQ_THREADS
+static struct irqaction irq2 = { no_action, SA_NODELAY, CPU_MASK_NONE, "cascade", NULL, NULL};
+#else
+static struct irqaction irq2 =
+ { no_action, SA_NOTHREAD, CPU_MASK_NONE, "cascade", NULL, NULL };
+#endif

/**
* intr_init_hook - post gate setup interrupt initialisation
@@ -71,7 +76,13 @@ void __init trap_init_hook(void)
{
}

-static struct irqaction irq0 = { timer_interrupt, SA_INTERRUPT, CPU_MASK_NONE, "timer", NULL, NULL};
+#ifdef CONFIG_INGO_IRQ_THREADS
+static struct irqaction irq0 = { timer_interrupt, SA_INTERRUPT | SA_NODELAY,
+ CPU_MASK_NONE, "timer", NULL, NULL};
+#else
+static struct irqaction irq0 =
+ { timer_interrupt, SA_INTERRUPT | SA_NOTHREAD, CPU_MASK_NONE, "timer", NULL, NULL };
+#endif

/**
* time_init_hook - do any specific initialisations for the system timer.
diff -pruN a/arch/i386/mach-visws/setup.c b/arch/i386/mach-visws/setup.c
--- a/arch/i386/mach-visws/setup.c 2004-10-09 03:50:45.000000000 +0400
+++ b/arch/i386/mach-visws/setup.c 2004-10-09 04:01:36.000000000 +0400
@@ -112,7 +112,11 @@ void __init pre_setup_arch_hook()

static struct irqaction irq0 = {
.handler = timer_interrupt,
+#ifdef CONFIG_INGO_IRQ_THREADS
+ .flags = SA_INTERRUPT | SA_NODELAY,
+#else
.flags = SA_INTERRUPT,
+#endif
.name = "timer",
};

diff -pruN a/arch/i386/mach-voyager/setup.c b/arch/i386/mach-voyager/setup.c
--- a/arch/i386/mach-voyager/setup.c 2004-10-09 03:50:45.000000000 +0400
+++ b/arch/i386/mach-voyager/setup.c 2004-10-09 04:01:36.000000000 +0400
@@ -17,7 +17,11 @@ void __init pre_intr_init_hook(void)
/*
* IRQ2 is cascade interrupt to second interrupt controller
*/
+#ifdef CONFIG_INGO_IRQ_THREADS
+static struct irqaction irq2 = { no_action, SA_NODELAY, 0, "cascade", NULL, NULL};
+#else
static struct irqaction irq2 = { no_action, 0, CPU_MASK_NONE, "cascade", NULL, NULL};
+#endif

void __init intr_init_hook(void)
{
@@ -39,8 +43,11 @@ void __init pre_setup_arch_hook(void)
void __init trap_init_hook(void)
{
}
-
+#ifdef CONFIG_INGO_IRQ_THREADS
+static struct irqaction irq0 = { timer_interrupt, SA_INTERRUPT | SA_NODELAY, 0, "timer", NULL, NULL};
+#else
static struct irqaction irq0 = { timer_interrupt, SA_INTERRUPT, CPU_MASK_NONE, "timer", NULL, NULL};
+#endif

void __init time_init_hook(void)
{
diff -pruN a/drivers/block/ll_rw_blk.c b/drivers/block/ll_rw_blk.c
--- a/drivers/block/ll_rw_blk.c 2004-10-09 03:50:45.000000000 +0400
+++ b/drivers/block/ll_rw_blk.c 2004-10-09 04:01:36.000000000 +0400
@@ -7,6 +7,9 @@
* Queue request tables / lock, selectable elevator, Jens Axboe <[email protected]>
* kernel-doc documentation started by NeilBrown <[email protected]> - July2000
* bio rewrite, highmem i/o, etc, Jens Axboe <[email protected]> - may 2001
+ *
+ * 2004-07-16 Modified by Eugeny S. Mints for RT Prototype.
+ * RT Prototype 2004 (C) MontaVista Software, Inc.
*/

/*
@@ -1211,7 +1214,16 @@ static int ll_merge_requests_fn(request_
*/
void blk_plug_device(request_queue_t *q)
{
+ /* XXX: emints: since irqs in threads patch is employed only routines
+ * executed from do_IRQ() are executed from a real interrupt context.
+ * For others holding a lock should be enough. Thus while irqs in
+ * threads, !irqs_disabled() doesn't a sign that we are not protected
+ * properly. May be substituted by checking corresponding lock later
+ * if paranoja.
+ */
+#if !defined(CONFIG_IRQ_THREADS) && !defined(CONFIG_INGO_IRQ_THREADS)
WARN_ON(!irqs_disabled());
+#endif /* CONFIG_IRQ_THREADS */

/*
* don't plug a stopped queue, it must be paired with blk_start_queue()
@@ -1232,7 +1244,16 @@ EXPORT_SYMBOL(blk_plug_device);
*/
int blk_remove_plug(request_queue_t *q)
{
- WARN_ON(!irqs_disabled());
+ /* XXX: emints: since irqs in threads patch is employed only routines
+ * executed from do_IRQ() are executed from a real interrupt context.
+ * For others holding a lock should be enough. Thus while irqs in
+ * threads, !irqs_disabled() doesn't a sign that we are not protected
+ * properly. May be substituted by checking corresponding lock later
+ * if paranoja.
+ */
+#if !defined(CONFIG_IRQ_THREADS) && !defined(CONFIG_INGO_IRQ_THREADS)
+ WARN_ON(!irqs_disabled());
+#endif /* CONFIG_IRQ_THREADS */

if (!test_and_clear_bit(QUEUE_FLAG_PLUGGED, &q->queue_flags))
return 0;
diff -pruN a/drivers/ide/ide-probe.c b/drivers/ide/ide-probe.c
--- a/drivers/ide/ide-probe.c 2004-10-09 03:50:45.000000000 +0400
+++ b/drivers/ide/ide-probe.c 2004-10-09 04:01:36.000000000 +0400
@@ -378,7 +378,10 @@ static int try_to_identify (ide_drive_t
hwif->OUTB(drive->ctl|2, IDE_CONTROL_REG);
/* clear drive IRQ */
(void) hwif->INB(IDE_STATUS_REG);
- udelay(5);
+
+ set_current_state(TASK_INTERRUPTIBLE);
+ schedule_timeout(1);
+
irq = probe_irq_off(cookie);
if (!hwif->irq) {
if (irq > 0) {
diff -pruN a/drivers/input/serio/ambakmi.c b/drivers/input/serio/ambakmi.c
--- a/drivers/input/serio/ambakmi.c 2004-10-09 03:50:45.000000000 +0400
+++ b/drivers/input/serio/ambakmi.c 2004-10-09 04:01:36.000000000 +0400
@@ -84,7 +84,7 @@ static int amba_kmi_open(struct serio *i
writeb(divisor, KMICLKDIV);
writeb(KMICR_EN, KMICR);

- ret = request_irq(kmi->irq, amba_kmi_int, 0, "kmi-pl050", kmi);
+ ret = request_irq(kmi->irq, amba_kmi_int, SA_NOTHREAD, "kmi-pl050", kmi);
if (ret) {
printk(KERN_ERR "kmi: failed to claim IRQ%d\n", kmi->irq);
writeb(0, KMICR);
diff -pruN a/drivers/input/serio/ct82c710.c b/drivers/input/serio/ct82c710.c
--- a/drivers/input/serio/ct82c710.c 2004-10-09 03:50:45.000000000 +0400
+++ b/drivers/input/serio/ct82c710.c 2004-10-09 04:01:36.000000000 +0400
@@ -113,7 +113,7 @@ static int ct82c710_open(struct serio *s
{
unsigned char status;

- if (request_irq(CT82C710_IRQ, ct82c710_interrupt, 0, "ct82c710", NULL))
+ if (request_irq(CT82C710_IRQ, ct82c710_interrupt, SA_NOTHREAD, "ct82c710", NULL))
return -1;

status = inb_p(CT82C710_STATUS);
diff -pruN a/drivers/input/serio/i8042.c b/drivers/input/serio/i8042.c
--- a/drivers/input/serio/i8042.c 2004-10-09 03:50:45.000000000 +0400
+++ b/drivers/input/serio/i8042.c 2004-10-09 04:01:36.000000000 +0400
@@ -10,6 +10,7 @@
* the Free Software Foundation.
*/

+#include <linux/config.h>
#include <linux/delay.h>
#include <linux/module.h>
#include <linux/moduleparam.h>
@@ -303,7 +304,7 @@ static int i8042_open(struct serio *port
return 0;

if (request_irq(values->irq, i8042_interrupt,
- SA_SHIRQ, "i8042", i8042_request_irq_cookie)) {
+ SA_SHIRQ | SA_NOTHREAD, "i8042", i8042_request_irq_cookie)) {
printk(KERN_ERR "i8042.c: Can't get irq %d for %s, unregistering the port.\n", values->irq, values->name);
goto irq_fail;
}
@@ -566,7 +567,7 @@ static int __init i8042_check_aux(struct
* in trying to detect AUX presence.
*/

- if (request_irq(values->irq, i8042_interrupt, SA_SHIRQ,
+ if (request_irq(values->irq, i8042_interrupt, SA_SHIRQ | SA_NOTHREAD,
"i8042", &i8042_check_aux_cookie))
return -1;
free_irq(values->irq, &i8042_check_aux_cookie);
diff -pruN a/drivers/input/serio/pcips2.c b/drivers/input/serio/pcips2.c
--- a/drivers/input/serio/pcips2.c 2004-10-09 03:50:45.000000000 +0400
+++ b/drivers/input/serio/pcips2.c 2004-10-09 04:01:36.000000000 +0400
@@ -107,7 +107,7 @@ static int pcips2_open(struct serio *io)
outb(PS2_CTRL_ENABLE, ps2if->base);
pcips2_flush_input(ps2if);

- ret = request_irq(ps2if->dev->irq, pcips2_interrupt, SA_SHIRQ,
+ ret = request_irq(ps2if->dev->irq, pcips2_interrupt, SA_SHIRQ | SA_NOTHREAD,
"pcips2", ps2if);
if (ret == 0)
val = PS2_CTRL_ENABLE | PS2_CTRL_RXIRQ;
diff -pruN a/drivers/input/serio/rpckbd.c b/drivers/input/serio/rpckbd.c
--- a/drivers/input/serio/rpckbd.c 2004-10-09 03:50:45.000000000 +0400
+++ b/drivers/input/serio/rpckbd.c 2004-10-09 04:01:36.000000000 +0400
@@ -85,12 +85,12 @@ static int rpckbd_open(struct serio *por
iomd_writeb(8, IOMD_KCTRL);
iomd_readb(IOMD_KARTRX);

- if (request_irq(IRQ_KEYBOARDRX, rpckbd_rx, 0, "rpckbd", port) != 0) {
+ if (request_irq(IRQ_KEYBOARDRX, rpckbd_rx, SA_NOTHREAD, "rpckbd", port) != 0) {
printk(KERN_ERR "rpckbd.c: Could not allocate keyboard receive IRQ\n");
return -EBUSY;
}

- if (request_irq(IRQ_KEYBOARDTX, rpckbd_tx, 0, "rpckbd", port) != 0) {
+ if (request_irq(IRQ_KEYBOARDTX, rpckbd_tx, SA_NOTHREAD, "rpckbd", port) != 0) {
printk(KERN_ERR "rpckbd.c: Could not allocate keyboard transmit IRQ\n");
free_irq(IRQ_KEYBOARDRX, NULL);
return -EBUSY;
diff -pruN a/drivers/input/serio/sa1111ps2.c b/drivers/input/serio/sa1111ps2.c
--- a/drivers/input/serio/sa1111ps2.c 2004-10-09 03:50:45.000000000 +0400
+++ b/drivers/input/serio/sa1111ps2.c 2004-10-09 04:01:36.000000000 +0400
@@ -127,7 +127,7 @@ static int ps2_open(struct serio *io)

sa1111_enable_device(ps2if->dev);

- ret = request_irq(ps2if->dev->irq[0], ps2_rxint, 0,
+ ret = request_irq(ps2if->dev->irq[0], ps2_rxint, SA_NOTHREAD,
SA1111_DRIVER_NAME(ps2if->dev), ps2if);
if (ret) {
printk(KERN_ERR "sa1111ps2: could not allocate IRQ%d: %d\n",
@@ -135,7 +135,7 @@ static int ps2_open(struct serio *io)
return ret;
}

- ret = request_irq(ps2if->dev->irq[1], ps2_txint, 0,
+ ret = request_irq(ps2if->dev->irq[1], ps2_txint, SA_NOTHREAD,
SA1111_DRIVER_NAME(ps2if->dev), ps2if);
if (ret) {
printk(KERN_ERR "sa1111ps2: could not allocate IRQ%d: %d\n",
diff -pruN a/include/asm-i386/hardirq.h b/include/asm-i386/hardirq.h
--- a/include/asm-i386/hardirq.h 2004-10-09 03:50:45.000000000 +0400
+++ b/include/asm-i386/hardirq.h 2004-10-09 04:01:36.000000000 +0400
@@ -46,10 +46,28 @@ typedef struct {
# error HARDIRQ_BITS is too low!
#endif

+/*
+ * Are we doing bottom half or hardware interrupt processing?
+ * Are we in a softirq context? Interrupt context?
+ */
+#ifdef CONFIG_INGO_IRQ_THREADS
+#define in_irq() (hardirq_count() || (current->flags & PF_HARDIRQ))
+#define in_softirq() (softirq_count() || (current->flags & PF_SOFTIRQ))
+#else
+#define in_irq() (hardirq_count())
+#define in_softirq() (softirq_count())
+#endif
+#define in_interrupt() (irq_count())
+
+
+#define hardirq_trylock() (!in_interrupt())
+#define hardirq_endlock() do { } while (0)
+
+#define irq_enter() (preempt_count() += HARDIRQ_OFFSET)
#define nmi_enter() (irq_enter())
#define nmi_exit() (preempt_count() -= HARDIRQ_OFFSET)

-#define irq_enter() (preempt_count() += HARDIRQ_OFFSET)
+#ifndef CONFIG_SOFTIRQ_THREADS
#define irq_exit() \
do { \
preempt_count() -= IRQ_EXIT_OFFSET; \
@@ -57,5 +75,55 @@ do { \
do_softirq(); \
preempt_enable_no_resched(); \
} while (0)
+#else
+#define irq_exit() (preempt_count() -= HARDIRQ_OFFSET)
+#endif
+
+#ifndef CONFIG_INGO_IRQ_THREADS
+
+#if !defined(CONFIG_SMP) && !defined(CONFIG_IRQ_THREADS)
+# define synchronize_irq(irq) barrier()
+#else
+ extern void synchronize_irq(unsigned int irq);
+#endif /* CONFIG_SMP */
+
+#else
+static inline void synchronize_irq(unsigned int irq)
+{
+ generic_synchronize_irq(irq);
+}
+
+static inline void free_irq(unsigned int irq, void *dev_id)
+{
+ generic_free_irq(irq, dev_id);
+}
+
+static inline void disable_irq_nosync(unsigned int irq)
+{
+ generic_disable_irq_nosync(irq);
+}
+
+static inline void disable_irq(unsigned int irq)
+{
+ generic_disable_irq(irq);
+}
+
+static inline void enable_irq(unsigned int irq)
+{
+ generic_enable_irq(irq);
+}
+
+static inline int setup_irq(unsigned int irq, struct irqaction *new)
+{
+ return generic_setup_irq(irq, new);
+}
+#endif /* CONFIG_INGO_IRQ_THREADS */
+

#endif /* __ASM_HARDIRQ_H */
+
+
+
+
+
+
diff -pruN a/include/asm-i386/hw_irq.h b/include/asm-i386/hw_irq.h
--- a/include/asm-i386/hw_irq.h 2004-10-09 03:50:45.000000000 +0400
+++ b/include/asm-i386/hw_irq.h 2004-10-09 04:01:36.000000000 +0400
@@ -54,6 +54,9 @@ void make_8259A_irq(unsigned int irq);
void init_8259A(int aeoi);
void FASTCALL(send_IPI_self(int vector));
void init_VISWS_APIC_irqs(void);
+#ifdef CONFIG_INGO_IRQ_THREADS
+extern void init_hardirqs(void);
+#endif
void setup_IO_APIC(void);
void disable_IO_APIC(void);
void print_IO_APIC(void);
@@ -78,4 +81,7 @@ static inline void hw_resend_irq(struct
static inline void hw_resend_irq(struct hw_interrupt_type *h, unsigned int i) {}
#endif

+/* Return a pointer to the irq descriptor for IRQ. */
+#define irq_descp(irq) (irq_desc + (irq))
+
#endif /* _ASM_HW_IRQ_H */
diff -pruN a/include/asm-i386/irq.h b/include/asm-i386/irq.h
--- a/include/asm-i386/irq.h 2004-10-09 03:50:45.000000000 +0400
+++ b/include/asm-i386/irq.h 2004-10-09 04:01:36.000000000 +0400
@@ -20,10 +20,12 @@ static __inline__ int irq_canonicalize(i
{
return ((irq == 2) ? 9 : irq);
}
-
+#ifndef CONFIG_INGO_IRQ_THREADS
extern void disable_irq(unsigned int);
extern void disable_irq_nosync(unsigned int);
extern void enable_irq(unsigned int);
+#endif
+
extern void release_x86_irqs(struct task_struct *);
extern int can_request_irq(unsigned int, unsigned long flags);

diff -pruN a/include/asm-i386/signal.h b/include/asm-i386/signal.h
--- a/include/asm-i386/signal.h 2004-10-09 03:50:45.000000000 +0400
+++ b/include/asm-i386/signal.h 2004-10-09 04:01:36.000000000 +0400
@@ -121,6 +121,12 @@ typedef unsigned long sigset_t;
*/
#define SA_PROBE SA_ONESHOT
#define SA_SAMPLE_RANDOM SA_RESTART
+#define SA_NOTHREAD 0x01000000
+#ifdef CONFIG_INGO_IRQ_THREADS
+#define SA_NODELAY 0x02000000
+#undef SA_NOTHREAD
+#define SA_NOTHREAD SA_NODELAY
+#endif
#define SA_SHIRQ 0x04000000
#endif

diff -pruN a/include/linux/hardirq.h b/include/linux/hardirq.h
--- a/include/linux/hardirq.h 2004-10-09 03:50:45.000000000 +0400
+++ b/include/linux/hardirq.h 2004-10-09 04:01:36.000000000 +0400
@@ -23,24 +23,18 @@
* Are we doing bottom half or hardware interrupt processing?
* Are we in a softirq context? Interrupt context?
*/
-#define in_irq() (hardirq_count())
-#define in_softirq() (softirq_count())
-#define in_interrupt() (irq_count())
-
#ifdef CONFIG_PREEMPT
-# define in_atomic() ((preempt_count() & ~PREEMPT_ACTIVE) != kernel_locked())
+# if defined CONFIG_INGO_BKL
+ /* lock_depth is not incremented if BKL is a mutex */
+# define in_atomic() ((preempt_count() & ~PREEMPT_ACTIVE) != 0)
+# else
+# define in_atomic() ((preempt_count() & ~PREEMPT_ACTIVE) != kernel_locked())
+# endif
# define preemptible() (preempt_count() == 0 && !irqs_disabled())
# define IRQ_EXIT_OFFSET (HARDIRQ_OFFSET-1)
#else
-# define in_atomic() (preempt_count() != 0)
+# define in_atomic() (preempt_count() != 0)
# define preemptible() 0
# define IRQ_EXIT_OFFSET HARDIRQ_OFFSET
#endif
-
-#ifdef CONFIG_SMP
-extern void synchronize_irq(unsigned int irq);
-#else
-# define synchronize_irq(irq) barrier()
-#endif
-
#endif /* LINUX_HARDIRQ_H */
diff -pruN a/include/linux/interrupt.h b/include/linux/interrupt.h
--- a/include/linux/interrupt.h 2004-10-09 03:50:45.000000000 +0400
+++ b/include/linux/interrupt.h 2004-10-09 04:01:36.000000000 +0400
@@ -39,6 +39,10 @@ struct irqaction {
cpumask_t mask;
const char *name;
void *dev_id;
+#ifdef CONFIG_INGO_IRQ_THREADS
+ int irq;
+ struct proc_dir_entry *dir, *threaded;
+#endif
struct irqaction *next;
};

@@ -51,7 +55,7 @@ extern void free_irq(unsigned int, void
/*
* Temporary defines for UP kernels, until all code gets fixed.
*/
-#ifndef CONFIG_SMP
+#if !defined(CONFIG_SMP) && !defined(CONFIG_IRQ_THREADS)
# define cli() local_irq_disable()
# define sti() local_irq_enable()
# define save_flags(x) local_save_flags(x)
@@ -60,6 +64,8 @@ extern void free_irq(unsigned int, void
#endif

/* SoftIRQ primitives. */
+#ifndef CONFIG_SOFTIRQ_THREADS
+
#define local_bh_disable() \
do { preempt_count() += SOFTIRQ_OFFSET; barrier(); } while (0)
#define __local_bh_enable() \
@@ -67,6 +73,27 @@ extern void free_irq(unsigned int, void

extern void local_bh_enable(void);

+#else
+
+/* As far as I can tell, local_bh_disable() didn't stop ksoftirqd
+ from running before. Since all softirqs now run from one of
+ the ksoftirqds, this shouldn't be necessary. */
+
+static inline void local_bh_disable(void)
+{
+}
+
+static inline void __local_bh_enable(void)
+{
+}
+
+static inline void local_bh_enable(void)
+{
+}
+
+#endif
+
+
/* PLEASE, avoid to allocate new softirqs, if you need not _really_ high
frequency threaded job scheduling. For almost all the purposes
tasklets are more than enough. F.e. all serial device BHs et
@@ -92,6 +119,10 @@ struct softirq_action
void (*action)(struct softirq_action *);
void *data;
};
+#ifdef CONFIG_INGO_IRQ_THREADS
+extern void do_hardirq(irq_desc_t *desc);
+extern void wakeup_irqd(void);
+#endif

asmlinkage void do_softirq(void);
extern void open_softirq(int nr, void (*action)(struct softirq_action*), void *data);
@@ -147,6 +178,7 @@ enum
TASKLET_STATE_RUN /* Tasklet is running (SMP only) */
};

+
#ifdef CONFIG_SMP
static inline int tasklet_trylock(struct tasklet_struct *t)
{
diff -pruN a/include/linux/irq.h b/include/linux/irq.h
--- a/include/linux/irq.h 2004-10-09 03:50:45.000000000 +0400
+++ b/include/linux/irq.h 2004-10-09 04:01:36.000000000 +0400
@@ -7,6 +7,9 @@
* within this file.
*
* Thanks. --rmk
+ *
+ * 2004-07-16 Modified by Eugeny S. Mints for RT Prototype.
+ * RT Prototype 2004 (C) MontaVista Software, Inc.
*/

#include <linux/config.h>
@@ -32,6 +35,26 @@
#define IRQ_LEVEL 64 /* IRQ level triggered */
#define IRQ_MASKED 128 /* IRQ masked - shouldn't be seen again */
#define IRQ_PER_CPU 256 /* IRQ is per CPU */
+#ifndef CONFIG_INGO_IRQ_THREADS
+#define IRQ_THREAD 512 /* IRQ has at least one threaded handler */
+#else
+#define IRQ_NODELAY 512 /* IRQ must run immediately */
+#endif
+
+#define IRQ_NOTHREAD 1024 /* IRQ has at least one nonthreaded handler */
+#define IRQ_THREADPENDING 2048 /* IRQ thread has been woken */
+#define IRQ_THREADRUNNING 4096 /* IRQ thread is currently running */
+
+/* Nobody has yet handled this IRQ. This is set when ack() is called,
+ and checked when end() is called. It is done this way to accomodate
+ threaded and non-threaded IRQs sharing the same IRQ. */
+
+#define IRQ_UNHANDLED 8192
+
+/* The interrupt is supposed to be enabled, but the IRQ thread hasn't
+ been spawned yet. Call startup_irq() once the thread is spawned. */
+
+#define IRQ_DELAYEDSTARTUP 16384

/*
* Interrupt controller descriptor. This is all we need
@@ -64,17 +87,58 @@ typedef struct irq_desc {
unsigned int depth; /* nested irq disables */
unsigned int irq_count; /* For detecting broken interrupts */
unsigned int irqs_unhandled;
+ /*
+ * this lock is used from a real interrupt context (do_IRQ) even if
+ * irqs in threads patch is employed.
+ */
spinlock_t lock;
+
+#if defined CONFIG_INGO_IRQ_THREADS || defined CONFIG_IRQ_THREADS
+ struct task_struct *thread;
+# ifdef CONFIG_IRQ_THREADS
+ wait_queue_head_t sync;
+# endif
+#endif
} ____cacheline_aligned irq_desc_t;

extern irq_desc_t irq_desc [NR_IRQS];

#include <asm/hw_irq.h> /* the arch dependent stuff */
-
+#ifndef CONFIG_INGO_IRQ_THREADS
extern int setup_irq(unsigned int , struct irqaction * );
+#else
+extern int generic_redirect_hardirq(struct irq_desc *desc);
+extern asmlinkage int generic_handle_IRQ_event(unsigned int irq, struct pt_regs *regs, struct irqaction *action);
+extern void generic_synchronize_irq(unsigned int irq);
+extern int generic_setup_irq(unsigned int irq, struct irqaction * new);
+extern void generic_free_irq(unsigned int irq, void *dev_id);
+extern void generic_disable_irq_nosync(unsigned int irq);
+extern void generic_disable_irq(unsigned int irq);
+extern void generic_enable_irq(unsigned int irq);
+extern void generic_note_interrupt(int irq, irq_desc_t *desc, int action_ret);
+
+extern int noirqdebug;
+#endif

extern hw_irq_controller no_irq_type; /* needed in every arch ? */

-#endif
+#ifdef CONFIG_IRQ_THREADS
+void spawn_irq_threads(void);
+void setup_irq_spawn_thread(unsigned int irq, struct irqaction *new);
+unsigned int it_startup_irq(unsigned int irq);
+void it_shutdown_irq(unsigned int irq);
+#define STARTUP_IRQ(irq) it_startup_irq(irq)
+#define SHUTDOWN_IRQ(irq) it_shutdown_irq(irq)
+#else
+#define setup_irq_spawn_thread(irq, new)
+#define STARTUP_IRQ(irq) desc->handler->startup(irq)
+#define SHUTDOWN_IRQ(irq) desc->handler->shutdown(irq)
+#endif /* CONFIG_IRQ_THREADS */
+
+
+
+
+
+#endif /* CONFIG_ARCH_S390 */

#endif /* __irq_h */
diff -pruN a/include/linux/sched.h b/include/linux/sched.h
--- a/include/linux/sched.h 2004-10-09 03:50:45.000000000 +0400
+++ b/include/linux/sched.h 2004-10-09 04:01:36.000000000 +0400
@@ -178,6 +178,9 @@ extern int in_sched_functions(unsigned l

#define MAX_SCHEDULE_TIMEOUT LONG_MAX
extern signed long FASTCALL(schedule_timeout(signed long timeout));
+struct timeout;
+#define MAX_SCHEDULE_TIMEOUT_EXT ((struct timeout *) ~0)
+extern void FASTCALL(schedule_timeout_ext (const struct timeout *timeout));
asmlinkage void schedule(void);

struct namespace;
@@ -216,7 +219,6 @@ struct mm_struct {
int map_count; /* number of VMAs */
struct rw_semaphore mmap_sem;
spinlock_t page_table_lock; /* Protects task page tables and mm->rss */
-
struct list_head mmlist; /* List of all active mm's. These are globally strung
* together off init_mm.mmlist, and are protected
* by mmlist_lock
@@ -260,7 +262,7 @@ struct sighand_struct {
};

/*
- * NOTE! "signal_struct" does not have it's own
+ * NOTE! "signal_struct" des not have it's own
* locking, because a shared signal_struct always
* implies a shared sighand_struct, so locking
* sighand_struct is always a proper superset of
@@ -328,9 +330,10 @@ struct signal_struct {
*/

#define MAX_USER_RT_PRIO 100
-#define MAX_RT_PRIO MAX_USER_RT_PRIO
+#define MAX_RT_PRIO 100 /* MAX_USER_RT_PRIO */

#define MAX_PRIO (MAX_RT_PRIO + 40)
+#define BOTTOM_PRIO INT_MAX

#define rt_task(p) (unlikely((p)->prio < MAX_RT_PRIO))

@@ -443,7 +446,7 @@ struct task_struct {

int lock_depth; /* Lock depth */

- int prio, static_prio;
+ int prio, static_prio, boost_prio;
struct list_head run_list;
prio_array_t *array;

@@ -454,6 +457,9 @@ struct task_struct {

unsigned long policy;
cpumask_t cpus_allowed;
+#ifdef CONFIG_INGO_BKL
+ cpumask_t saved_cpus_allowed;
+#endif
unsigned int time_slice, first_time_slice;

#ifdef CONFIG_SCHEDSTATS
@@ -559,7 +565,13 @@ struct task_struct {
spinlock_t proc_lock;
/* context-switch lock */
spinlock_t switch_lock;
-
+/*
+ * current io wait handle: wait queue entry to use for io waits
+ * If this thread is processing aio, this points at the waitqueue
+ * inside the currently handled kiocb. It may be NULL (i.e. default
+ * to a stack based synchronous wait) if its doing sync IO.
+ */
+ wait_queue_t *io_wait;
/* journalling filesystem info */
void *journal_info;

@@ -573,13 +585,7 @@ struct task_struct {

unsigned long ptrace_message;
siginfo_t *last_siginfo; /* For ptrace use. */
-/*
- * current io wait handle: wait queue entry to use for io waits
- * If this thread is processing aio, this points at the waitqueue
- * inside the currently handled kiocb. It may be NULL (i.e. default
- * to a stack based synchronous wait) if its doing sync IO.
- */
- wait_queue_t *io_wait;
+
#ifdef CONFIG_NUMA
struct mempolicy *mempolicy;
short il_next; /* could be shared with used_math */
@@ -613,6 +619,12 @@ do { if (atomic_dec_and_test(&(tsk)->usa
#define PF_MEMDIE 0x00001000 /* Killed for out-of-memory */
#define PF_FLUSHER 0x00002000 /* responsible for disk writeback */

+
+/* Thread is an IRQ handler. This is used to determine which softirq
+ thread to wake. */
+
+#define PF_IRQHANDLER 0x10000000
+
#define PF_FREEZE 0x00004000 /* this task should be frozen for suspend */
#define PF_NOFREEZE 0x00008000 /* this thread should not be frozen */
#define PF_FROZEN 0x00010000 /* frozen for system suspend */
@@ -621,6 +633,13 @@ do { if (atomic_dec_and_test(&(tsk)->usa
#define PF_SWAPOFF 0x00080000 /* I am in swapoff */
#define PF_LESS_THROTTLE 0x00100000 /* Throttle me less: I clean memory */
#define PF_SYNCWRITE 0x00200000 /* I am doing a sync write */
+#ifdef CONFIG_INGO_IRQ_THREADS
+#define PF_SOFTIRQ 0x00400000 /* softirq context */
+#define PF_HARDIRQ 0x00800000 /* hardirq context */
+#endif
+
+#define PF_ADD_TO_HEAD 0x40000000
+#define PF_MUTEX_INTERRUPTIBLE 0x20000000

#ifdef CONFIG_SMP
extern int set_cpus_allowed(task_t *p, cpumask_t new_mask);
@@ -695,6 +714,7 @@ extern unsigned long itimer_ticks;
extern unsigned long itimer_next;
extern void do_timer(struct pt_regs *);

+extern int try_to_wake_up(struct task_struct *p, unsigned int state, int sync);
extern int FASTCALL(wake_up_state(struct task_struct * tsk, unsigned int state));
extern int FASTCALL(wake_up_process(struct task_struct * tsk));
extern void FASTCALL(wake_up_new_task(struct task_struct * tsk,
@@ -880,6 +900,9 @@ static inline int thread_group_empty(tas
return list_empty(&p->pids[PIDTYPE_TGID].pid_list);
}

+asmlinkage long sys_sched_setscheduler(pid_t pid, int policy,
+ struct sched_param __user *param);
+
#define delay_group_leader(p) \
(thread_group_leader(p) && !thread_group_empty(p))

diff -pruN a/include/linux/smp_lock.h b/include/linux/smp_lock.h
--- a/include/linux/smp_lock.h 2004-10-09 03:50:45.000000000 +0400
+++ b/include/linux/smp_lock.h 2004-10-09 04:01:36.000000000 +0400
@@ -7,12 +7,17 @@

#if defined(CONFIG_SMP) || defined(CONFIG_PREEMPT)

-extern spinlock_t kernel_flag;
-
-#define kernel_locked() (current->lock_depth >= 0)
-
-#define get_kernel_lock() spin_lock(&kernel_flag)
-#define put_kernel_lock() spin_unlock(&kernel_flag)
+# ifdef CONFIG_INGO_BKL
+ extern int kernel_locked(void);
+ extern void lock_kernel(void);
+ extern void unlock_kernel(void);
+# else
+
+# define kernel_locked() (current->lock_depth >= 0)
+
+# define get_kernel_lock() _spin_lock(&kernel_flag)
+# define put_kernel_lock() _spin_unlock(&kernel_flag)
+ extern spinlock_t kernel_flag;

/*
* Release global kernel lock.
@@ -53,14 +58,18 @@ static inline void unlock_kernel(void)
if (likely(--current->lock_depth < 0))
put_kernel_lock();
}
+# endif /* !INGO's BKL */

#else

-#define lock_kernel() do { } while(0)
-#define unlock_kernel() do { } while(0)
-#define release_kernel_lock(task) do { } while(0)
-#define reacquire_kernel_lock(task) do { } while(0)
-#define kernel_locked() 1
+# define lock_kernel() do { } while(0)
+# define unlock_kernel() do { } while(0)
+# define kernel_locked() 1
+
+# ifndef CONFIG_INGO_BKL
+# define release_kernel_lock(task) do { } while(0)
+# define reacquire_kernel_lock(task) do { } while(0)
+# endif /* INGO's BKL */

#endif /* CONFIG_SMP || CONFIG_PREEMPT */
#endif /* __LINUX_SMPLOCK_H */
diff -pruN a/init/Kconfig b/init/Kconfig
--- a/init/Kconfig 2004-10-09 03:50:45.000000000 +0400
+++ b/init/Kconfig 2004-10-09 04:01:36.000000000 +0400
@@ -224,6 +224,30 @@ config IKCONFIG_PROC
This option enables access to the kernel configuration file
through /proc/config.gz.

+config INGO_BKL
+ bool "Replace the BKL with a sleeping lock"
+ default y
+ ---help---
+ Uses Ingo Molnars code to replace the BKL with
+ a semaphore.
+
+choice
+ prompt "Select lock"
+ depends on INGO_BKL
+ default BKL_SEM
+
+config BKL_SEM
+ bool "BKL becomes the system semaphore."
+ ---help---
+ Use the system semaphore to replace the BKL instead of
+ the kmutex.
+
+config BKL_MTX
+ bool "BKL becomes a mutex"
+ ---help---
+ Use the kmutex to replace the BKL instead of
+ the system semaphore.
+endchoice

menuconfig EMBEDDED
bool "Configure standard kernel features (for small systems)"
@@ -280,6 +304,40 @@ config EPOLL

source "drivers/block/Kconfig.iosched"

+config SOFTIRQ_THREADS
+ bool "Run all softirqs in threads"
+ default y
+ depends on PREEMPT
+ help
+ This option creates a second softirq thread per CPU, which
+ runs at high real-time priority, to replace the softirqs
+ which were previously run immediately. This allows these
+ softirqs to be prioritized, so as to avoid preempting
+ very high priority real-time tasks. This also allows
+ certain spinlocks to be converted into sleeping mutexes,
+ for futher reduction of scheduling latency.
+
+config INGO_IRQ_THREADS
+ bool "Support for Ingo Molnar's version of IRQ Threads."
+ default y
+ depends on !IRQ_THREADS && SOFTIRQ_THREADS
+ help
+ Interrupts are redirected to high priority threads.
+
+
+config IRQ_THREADS
+ bool "Run all IRQs in threads by default"
+ depends on PREEMPT && SOFTIRQ_THREADS
+ help
+ This option creates a thread for each IRQ, which runs at
+ high real-time priority, unless the SA_NOTHREAD option is
+ passed to request_irq(). This allows these IRQs to be
+ prioritized, so as to avoid preempting very high priority
+ real-time tasks. This also allows certain spinlocks to be
+ converted into sleeping mutexes, for futher reduction of
+ scheduling latency (however, this is not done automatically).
+
+
config CC_OPTIMIZE_FOR_SIZE
bool "Optimize for size" if EMBEDDED
default y if ARM || H8300
@@ -389,3 +447,5 @@ config STOP_MACHINE
help
Need stop_machine() primitive.
endmenu
+
+
diff -pruN a/init/main.c b/init/main.c
--- a/init/main.c 2004-10-09 03:50:45.000000000 +0400
+++ b/init/main.c 2004-10-09 04:01:36.000000000 +0400
@@ -42,6 +42,7 @@
#include <linux/writeback.h>
#include <linux/cpu.h>
#include <linux/efi.h>
+#include <linux/irq.h>
#include <linux/unistd.h>
#include <linux/rmap.h>
#include <linux/mempolicy.h>
@@ -435,6 +436,9 @@ static void noinline rest_init(void)
kernel_thread(init, NULL, CLONE_FS | CLONE_SIGHAND);
numa_default_policy();
unlock_kernel();
+#ifdef CONFIG_INGO_BKL
+ preempt_enable_no_resched();
+#endif
cpu_idle();
}

@@ -493,13 +497,21 @@ asmlinkage void __init start_kernel(void
* printk() and can access its per-cpu storage.
*/
smp_prepare_boot_cpu();
-
/*
* Set up the scheduler prior starting any interrupts (such as the
* timer interrupt). Full topology setup happens at smp_init()
* time - but meanwhile we still have a functioning scheduler.
*/
sched_init();
+#ifdef CONFIG_INGO_BKL
+ /*
+ * The early boot stage up until we run the first idle thread
+ * is a very volatile affair for the scheduler. Disable preemption
+ * up until the init thread has been started:
+ */
+ preempt_disable();
+#endif
+
build_all_zonelists();
page_alloc_init();
printk("Kernel command line: %s\n", saved_command_line);
@@ -680,6 +692,10 @@ static inline void fixup_cpu_present_map

static int init(void * unused)
{
+#ifdef CONFIG_IRQ_THREADS
+ spawn_irq_threads();
+#endif
+
lock_kernel();
/*
* Tell the world that we're going to be the grim
diff -pruN a/kernel/hardirq.c b/kernel/hardirq.c
--- a/kernel/hardirq.c 1970-01-01 03:00:00.000000000 +0300
+++ b/kernel/hardirq.c 2004-10-09 04:01:36.000000000 +0400
@@ -0,0 +1,697 @@
+/*
+ * linux/kernel/hardirq.c
+ */
+
+#include <linux/module.h>
+#include <linux/interrupt.h>
+#include <linux/init.h>
+#include <linux/kthread.h>
+#include <linux/random.h>
+#include <linux/sched.h>
+#include <linux/mm.h>
+#include <linux/kallsyms.h>
+#include <linux/proc_fs.h>
+#include <asm/uaccess.h>
+
+#ifdef CONFIG_INGO_IRQ_THREADS
+extern struct irq_desc irq_desc[NR_IRQS];
+
+static struct proc_dir_entry * root_irq_dir;
+static struct proc_dir_entry * irq_dir [NR_IRQS];
+
+int noirqdebug;
+static void register_irq_proc (unsigned int irq);
+static void register_handler_proc (unsigned int irq, struct irqaction *action);
+static int start_irq_thread(int irq, struct irq_desc *desc);
+
+int generic_redirect_hardirq(struct irq_desc *desc)
+{
+ /*
+ * Direct execution:
+ */
+ if ((desc->status & IRQ_NODELAY))
+ return 0;
+
+ BUG_ON(!desc->thread);
+ BUG_ON(!irqs_disabled());
+ if (desc->thread->state != TASK_RUNNING)
+ wake_up_process(desc->thread);
+
+ return 1;
+}
+
+/*
+ * This should really return information about whether
+ * we should do bottom half handling etc. Right now we
+ * end up _always_ checking the bottom half, which is a
+ * waste of time and is not what some drivers would
+ * prefer.
+ */
+asmlinkage int generic_handle_IRQ_event(unsigned int irq,
+ struct pt_regs *regs, struct irqaction *action)
+{
+ int status = 1; /* Force the "do bottom halves" bit */
+ int retval = 0;
+
+ if (!(action->flags & SA_INTERRUPT))
+ local_irq_enable();
+
+ do {
+ status |= action->flags;
+ retval |= action->handler(irq, action->dev_id, regs);
+ action = action->next;
+ } while (action);
+ if (status & SA_SAMPLE_RANDOM)
+ add_interrupt_randomness(irq);
+ local_irq_disable();
+ return retval;
+}
+
+void do_hardirq(struct irq_desc *desc)
+{
+ struct irqaction * action;
+ unsigned int irq = desc - irq_desc, count;
+
+ local_irq_disable();
+
+repeat:
+ count = 0;
+ while (desc->status & IRQ_INPROGRESS) {
+ action = desc->action;
+ count++;
+ spin_lock(&desc->lock);
+ for (;;) {
+ irqreturn_t action_ret = 0;
+
+ if (action) {
+ spin_unlock(&desc->lock);
+ action_ret = generic_handle_IRQ_event(irq, NULL,action);
+ spin_lock_irq(&desc->lock);
+ }
+ if (!noirqdebug)
+ generic_note_interrupt(irq, desc, action_ret);
+ if (likely(!(desc->status & IRQ_PENDING)))
+ break;
+ desc->status &= ~IRQ_PENDING;
+ }
+ desc->status &= ~IRQ_INPROGRESS;
+ /*
+ * The ->end() handler has to deal with interrupts which got
+ * disabled while the handler was running.
+ */
+ desc->handler->end(irq);
+ spin_unlock(&desc->lock);
+ }
+
+ if (count)
+ goto repeat;
+
+ local_irq_enable();
+}
+
+
+static void __report_bad_irq(int irq, irq_desc_t *desc, irqreturn_t action_ret)
+{
+ struct irqaction *action;
+
+ if (action_ret != IRQ_HANDLED && action_ret != IRQ_NONE) {
+ printk(KERN_ERR "irq event %d: bogus return value %x\n",
+ irq, action_ret);
+ } else {
+ printk(KERN_ERR "irq %d: nobody cared!\n", irq);
+ }
+ dump_stack();
+ printk(KERN_ERR "handlers:\n");
+ action = desc->action;
+ while (action) {
+ printk(KERN_ERR "[<%p>]", action->handler);
+ print_symbol(" (%s)",
+ (unsigned long)action->handler);
+ printk("\n");
+ action = action->next;
+ }
+}
+
+static void report_bad_irq(int irq, irq_desc_t *desc, irqreturn_t action_ret)
+{
+ static int count = 100;
+
+ if (count) {
+ count--;
+ __report_bad_irq(irq, desc, action_ret);
+ }
+}
+
+
+static int __init noirqdebug_setup(char *str)
+{
+ noirqdebug = 1;
+ printk("IRQ lockup detection disabled\n");
+ return 1;
+}
+
+__setup("noirqdebug", noirqdebug_setup);
+
+/*
+ * If 99,900 of the previous 100,000 interrupts have not been handled then
+ * assume that the IRQ is stuck in some manner. Drop a diagnostic and try to
+ * turn the IRQ off.
+ *
+ * (The other 100-of-100,000 interrupts may have been a correctly-functioning
+ * device sharing an IRQ with the failing one)
+ *
+ * Called under desc->lock
+ */
+void generic_note_interrupt(int irq, irq_desc_t *desc, irqreturn_t action_ret)
+{
+ if (action_ret != IRQ_HANDLED) {
+ desc->irqs_unhandled++;
+ if (action_ret != IRQ_NONE)
+ report_bad_irq(irq, desc, action_ret);
+ }
+
+ desc->irq_count++;
+ if (desc->irq_count < 100000)
+ return;
+
+ desc->irq_count = 0;
+ if (desc->irqs_unhandled > 99900) {
+ /*
+ * The interrupt is stuck
+ */
+ __report_bad_irq(irq, desc, action_ret);
+ /*
+ * Now kill the IRQ
+ */
+ printk(KERN_EMERG "Disabling IRQ #%d\n", irq);
+ desc->status |= IRQ_DISABLED;
+ desc->handler->disable(irq);
+ }
+ desc->irqs_unhandled = 0;
+}
+
+void generic_synchronize_irq(unsigned int irq)
+{
+ while (irq_desc[irq].status & IRQ_INPROGRESS) {
+ cpu_relax();
+ do_hardirq(irq_desc + irq);
+ }
+}
+
+EXPORT_SYMBOL(generic_synchronize_irq);
+
+/*
+ * Generic enable/disable code: this just calls
+ * down into the PIC-specific version for the actual
+ * hardware disable after having gotten the irq
+ * controller lock.
+ */
+
+/**
+ * disable_irq_nosync - disable an irq without waiting
+ * @irq: Interrupt to disable
+ *
+ * Disable the selected interrupt line. Disables and Enables are
+ * nested.
+ * Unlike disable_irq(), this function does not ensure existing
+ * instances of the IRQ handler have completed before returning.
+ *
+ * This function may be called from IRQ context.
+ */
+
+void generic_disable_irq_nosync(unsigned int irq)
+{
+ irq_desc_t *desc = irq_desc + irq;
+ unsigned long flags;
+
+ spin_lock_irqsave(&desc->lock, flags);
+ if (!desc->depth++) {
+ desc->status |= IRQ_DISABLED;
+ desc->handler->disable(irq);
+ }
+ spin_unlock_irqrestore(&desc->lock, flags);
+}
+
+EXPORT_SYMBOL(generic_disable_irq_nosync);
+
+/**
+ * disable_irq - disable an irq and wait for completion
+ * @irq: Interrupt to disable
+ *
+ * Disable the selected interrupt line. Enables and Disables are
+ * nested.
+ * This function waits for any pending IRQ handlers for this interrupt
+ * to complete before returning. If you use this function while
+ * holding a resource the IRQ handler may need you will deadlock.
+ *
+ * This function may be called - with care - from IRQ context.
+ */
+
+void generic_disable_irq(unsigned int irq)
+{
+ irq_desc_t *desc = irq_desc + irq;
+ generic_disable_irq_nosync(irq);
+ if (desc->action)
+ synchronize_irq(irq);
+}
+
+EXPORT_SYMBOL(generic_disable_irq);
+
+/**
+ * enable_irq - enable handling of an irq
+ * @irq: Interrupt to enable
+ *
+ * Undoes the effect of one call to disable_irq(). If this
+ * matches the last disable, processing of interrupts on this
+ * IRQ line is re-enabled.
+ *
+ * This function may be called from IRQ context.
+ */
+
+void generic_enable_irq(unsigned int irq)
+{
+ irq_desc_t *desc = irq_desc + irq;
+ unsigned long flags;
+
+ spin_lock_irqsave(&desc->lock, flags);
+ switch (desc->depth) {
+ case 1: {
+ unsigned int status = desc->status & ~IRQ_DISABLED;
+ desc->status = status;
+ if ((status & (IRQ_PENDING | IRQ_REPLAY)) == IRQ_PENDING) {
+ desc->status = status | IRQ_REPLAY;
+ hw_resend_irq(desc->handler,irq);
+ }
+ desc->handler->enable(irq);
+ /* fall-through */
+ }
+ default:
+ desc->depth--;
+ break;
+ case 0:
+ printk("enable_irq(%u) unbalanced from %p\n", irq,
+ __builtin_return_address(0));
+ }
+ spin_unlock_irqrestore(&desc->lock, flags);
+}
+
+EXPORT_SYMBOL(generic_enable_irq);
+
+/*
+ * If any action has SA_NODELAY then turn IRQ_NODELAY on:
+ */
+static void recalculate_desc_flags(struct irq_desc *desc)
+{
+ struct irqaction *action;
+
+ desc->status &= ~IRQ_NODELAY;
+ for (action = desc->action ; action; action = action->next)
+ if (action->flags & SA_NODELAY)
+ desc->status |= IRQ_NODELAY;
+}
+
+int generic_setup_irq(unsigned int irq, struct irqaction * new)
+{
+ int shared = 0;
+ unsigned long flags;
+ struct irqaction *old, **p;
+ struct irq_desc *desc = irq_desc + irq;
+
+ if (desc->handler == &no_irq_type)
+ return -ENOSYS;
+ /*
+ * Some drivers like serial.c use request_irq() heavily,
+ * so we have to be careful not to interfere with a
+ * running system.
+ */
+ if (new->flags & SA_SAMPLE_RANDOM) {
+ /*
+ * This function might sleep, we want to call it first,
+ * outside of the atomic block.
+ * Yes, this might clear the entropy pool if the wrong
+ * driver is attempted to be loaded, without actually
+ * installing a new handler, but is this really a problem,
+ * only the sysadmin is able to do this.
+ */
+ rand_initialize_irq(irq);
+ }
+
+ if (!(new->flags & SA_NODELAY))
+ if (start_irq_thread(irq, desc))
+ return -ENOMEM;
+ /*
+ * The following block of code has to be executed atomically
+ */
+ spin_lock_irqsave(&desc->lock,flags);
+ p = &desc->action;
+ if ((old = *p) != NULL) {
+ /* Can't share interrupts unless both agree to */
+ if (!(old->flags & new->flags & SA_SHIRQ)) {
+ spin_unlock_irqrestore(&desc->lock,flags);
+ return -EBUSY;
+ }
+
+ /* add new interrupt at end of irq queue */
+ do {
+ p = &old->next;
+ old = *p;
+ } while (old);
+ shared = 1;
+ }
+
+ *p = new;
+
+ /*
+ * Propagate any possible SA_NODELAY flag into IRQ_NODELAY:
+ */
+ recalculate_desc_flags(desc);
+
+ if (!shared) {
+ desc->depth = 0;
+ desc->status &= ~(IRQ_DISABLED | IRQ_AUTODETECT | IRQ_WAITING | IRQ_INPROGRESS);
+ desc->handler->startup(irq);
+ }
+ spin_unlock_irqrestore(&desc->lock,flags);
+
+ new->irq = irq;
+ register_irq_proc(irq);
+ new->dir = new->threaded = NULL;
+ register_handler_proc(irq, new);
+
+ return 0;
+}
+
+/**
+ * generic_free_irq - free an interrupt
+ * @irq: Interrupt line to free
+ * @dev_id: Device identity to free
+ *
+ * Remove an interrupt handler. The handler is removed and if the
+ * interrupt line is no longer in use by any driver it is disabled.
+ * On a shared IRQ the caller must ensure the interrupt is disabled
+ * on the card it drives before calling this function. The function
+ * does not return until any executing interrupts for this IRQ
+ * have completed.
+ *
+ * This function must not be called from interrupt context.
+ */
+
+void generic_free_irq(unsigned int irq, void *dev_id)
+{
+ struct irq_desc *desc;
+ struct irqaction **p;
+ unsigned long flags;
+
+ if (irq >= NR_IRQS)
+ return;
+
+ desc = irq_desc + irq;
+ spin_lock_irqsave(&desc->lock,flags);
+ p = &desc->action;
+ for (;;) {
+ struct irqaction * action = *p;
+ if (action) {
+ struct irqaction **pp = p;
+ p = &action->next;
+ if (action->dev_id != dev_id)
+ continue;
+
+ /* Found it - now remove it from the list of entries */
+ *pp = action->next;
+ if (!desc->action) {
+ desc->status |= IRQ_DISABLED;
+ desc->handler->shutdown(irq);
+ }
+ recalculate_desc_flags(desc);
+ spin_unlock_irqrestore(&desc->lock,flags);
+ if (action->threaded)
+ remove_proc_entry(action->threaded->name, action->dir);
+ if (action->dir)
+ remove_proc_entry(action->dir->name, irq_dir[irq]);
+
+ /* Wait to make sure it's not being used on another CPU */
+ synchronize_irq(irq);
+ kfree(action);
+ return;
+ }
+ printk("Trying to free free IRQ%d\n",irq);
+ spin_unlock_irqrestore(&desc->lock,flags);
+ return;
+ }
+}
+
+EXPORT_SYMBOL(generic_free_irq);
+
+
+#ifdef CONFIG_SMP
+extern cpumask_t irq_affinity[NR_IRQS];
+#endif
+
+static int do_irqd(void * __desc)
+{
+ struct irq_desc *desc = __desc;
+ int irq = desc - irq_desc;
+#ifdef CONFIG_SMP
+ cpumask_t mask = irq_affinity[irq];
+
+ set_cpus_allowed(current, mask);
+#endif
+ current->flags |= PF_NOFREEZE | PF_HARDIRQ;
+
+ set_user_nice(current, -10);
+
+ printk("IRQ#%d thread started up.\n", irq);
+
+ while (!kthread_should_stop()) {
+ set_current_state(TASK_INTERRUPTIBLE);
+ do_hardirq(desc);
+#ifdef CONFIG_SMP
+ /*
+ * Did IRQ affinities change?
+ */
+ if (!cpus_equal(mask, irq_affinity[irq])) {
+ mask = irq_affinity[irq];
+ set_cpus_allowed(current, mask);
+ }
+#endif
+ schedule();
+ }
+ __set_current_state(TASK_RUNNING);
+ return 0;
+}
+
+static int start_irq_thread(int irq, struct irq_desc *desc)
+{
+ if (desc->thread)
+ return 0;
+
+ printk("requesting new irq thread for IRQ%d...\n", irq);
+ desc->thread = kthread_create(do_irqd, desc, "IRQ %d", irq);
+ if (!desc->thread) {
+ printk(KERN_ERR "irqd: could not create IRQ thread %d!\n", irq);
+ return -ENOMEM;
+ }
+
+ return 0;
+}
+
+#ifdef CONFIG_SMP
+
+static struct proc_dir_entry *smp_affinity_entry[NR_IRQS];
+
+cpumask_t irq_affinity[NR_IRQS] = { [0 ... NR_IRQS-1] = CPU_MASK_ALL };
+
+static int irq_affinity_read_proc(char *page, char **start, off_t off,
+ int count, int *eof, void *data)
+{
+ int len = cpumask_scnprintf(page, count, irq_affinity[(long)data]);
+ if (count - len < 2)
+ return -EINVAL;
+ len += sprintf(page + len, "\n");
+ return len;
+}
+
+static int irq_affinity_write_proc(struct file *file, const char __user *buffer,
+ unsigned long count, void *data)
+{
+ int irq = (long)data, full_count = count, err;
+ cpumask_t new_value, tmp;
+
+ if (!irq_desc[irq].handler->set_affinity)
+ return -EIO;
+
+ err = cpumask_parse(buffer, count, new_value);
+ if (err)
+ return err;
+
+ /*
+ * Do not allow disabling IRQs completely - it's a too easy
+ * way to make the system unusable accidentally :-) At least
+ * one online CPU still has to be targeted.
+ */
+ cpus_and(tmp, new_value, cpu_online_map);
+ if (cpus_empty(tmp))
+ return -EINVAL;
+
+ irq_affinity[irq] = new_value;
+ irq_desc[irq].handler->set_affinity(irq,
+ cpumask_of_cpu(first_cpu(new_value)));
+
+ return full_count;
+}
+
+#endif
+
+static int prof_cpu_mask_read_proc (char *page, char **start, off_t off,
+ int count, int *eof, void *data)
+{
+ int len = cpumask_scnprintf(page, count, *(cpumask_t *)data);
+ if (count - len < 2)
+ return -EINVAL;
+ len += sprintf(page + len, "\n");
+ return len;
+}
+
+static int prof_cpu_mask_write_proc (struct file *file, const char __user *buffer,
+ unsigned long count, void *data)
+{
+ cpumask_t *mask = (cpumask_t *)data;
+ unsigned long full_count = count, err;
+ cpumask_t new_value;
+
+ err = cpumask_parse(buffer, count, new_value);
+ if (err)
+ return err;
+
+ *mask = new_value;
+ return full_count;
+}
+
+#define MAX_NAMELEN 10
+
+static void register_irq_proc (unsigned int irq)
+{
+ char name [MAX_NAMELEN];
+
+ if (!root_irq_dir || (irq_desc[irq].handler == &no_irq_type) ||
+ irq_dir[irq])
+ return;
+
+ memset(name, 0, MAX_NAMELEN);
+ sprintf(name, "%d", irq);
+
+ /* create /proc/irq/1234 */
+ irq_dir[irq] = proc_mkdir(name, root_irq_dir);
+
+#ifdef CONFIG_SMP
+ {
+ struct proc_dir_entry *entry;
+
+ /* create /proc/irq/1234/smp_affinity */
+ entry = create_proc_entry("smp_affinity", 0600, irq_dir[irq]);
+
+ if (entry) {
+ entry->nlink = 1;
+ entry->data = (void *)(long)irq;
+ entry->read_proc = irq_affinity_read_proc;
+ entry->write_proc = irq_affinity_write_proc;
+ }
+
+ smp_affinity_entry[irq] = entry;
+ }
+#endif
+}
+
+#undef MAX_NAMELEN
+
+static int threaded_read_proc (char *page, char **start, off_t off,
+ int count, int *eof, void *data)
+{
+ return sprintf(page, "%c\n",
+ ((struct irqaction *)data)->flags & SA_NODELAY ? '0' : '1');
+}
+
+static int threaded_write_proc (struct file *file, const char __user *buffer,
+ unsigned long count, void *data)
+{
+ struct irqaction *action = data;
+ irq_desc_t *desc = irq_desc + action->irq;
+ int c;
+
+ if (get_user(c, buffer))
+ return -EFAULT;
+ if (c != '0' && c != '1')
+ return -EINVAL;
+
+ spin_lock_irq(&desc->lock);
+
+ if (c == '0')
+ action->flags |= SA_NODELAY;
+ if (c == '1')
+ action->flags &= ~SA_NODELAY;
+ recalculate_desc_flags(desc);
+
+ spin_unlock_irq(&desc->lock);
+
+ return 1;
+}
+
+
+#define MAX_NAMELEN 128
+
+static void register_handler_proc (unsigned int irq, struct irqaction *action)
+{
+ char name [MAX_NAMELEN];
+ struct proc_dir_entry *entry;
+
+ if (!irq_dir[irq] || action->dir || !action->name)
+ return;
+
+ memset(name, 0, MAX_NAMELEN);
+ snprintf(name, MAX_NAMELEN, "%s", action->name);
+
+ /* create /proc/irq/1234/handler/ */
+ action->dir = proc_mkdir(name, irq_dir[irq]);
+ if (!action->dir)
+ return;
+ /* create /proc/irq/1234/handler/threaded */
+ entry = create_proc_entry("threaded", 0600, action->dir);
+ if (!entry)
+ return;
+ entry->nlink = 1;
+ entry->data = (void *)action;
+ entry->read_proc = threaded_read_proc;
+ entry->write_proc = threaded_write_proc;
+ action->threaded = entry;
+}
+
+
+unsigned long prof_cpu_mask = -1;
+
+void init_irq_proc (void)
+{
+ struct proc_dir_entry *entry;
+ int i;
+
+ /* create /proc/irq */
+ root_irq_dir = proc_mkdir("irq", NULL);
+
+ /* create /proc/irq/prof_cpu_mask */
+ entry = create_proc_entry("prof_cpu_mask", 0600, root_irq_dir);
+
+ if (!entry)
+ return;
+
+ entry->nlink = 1;
+ entry->data = (void *)&prof_cpu_mask;
+ entry->read_proc = prof_cpu_mask_read_proc;
+ entry->write_proc = prof_cpu_mask_write_proc;
+
+ /*
+ * Create entries for all existing IRQs.
+ */
+ for (i = 0; i < NR_IRQS; i++)
+ register_irq_proc(i);
+}
+
+#endif /* CONFIG_INGO_IRQ_THREADS */
diff -pruN a/kernel/irq.c b/kernel/irq.c
--- a/kernel/irq.c 1970-01-01 03:00:00.000000000 +0300
+++ b/kernel/irq.c 2004-10-09 04:01:36.000000000 +0400
@@ -0,0 +1,260 @@
+/*
+ * linux/kernel/irq.c
+ *
+ * Copyright (C) 1992, 1998 Linus Torvalds, Ingo Molnar
+ * Includes portions of Andrey Panin's IRQ consolidation patches.
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License version 2 as
+ * published by the Free Software Foundation.
+ */
+
+#include <linux/config.h>
+#include <linux/errno.h>
+#include <linux/module.h>
+#include <linux/signal.h>
+#include <linux/sched.h>
+#include <linux/ioport.h>
+#include <linux/interrupt.h>
+#include <linux/timex.h>
+#include <linux/slab.h>
+#include <linux/random.h>
+#include <linux/smp_lock.h>
+#include <linux/init.h>
+#include <linux/kernel_stat.h>
+#include <linux/irq.h>
+#include <linux/proc_fs.h>
+#include <linux/seq_file.h>
+#include <linux/kallsyms.h>
+
+#include <asm/atomic.h>
+#include <asm/io.h>
+#include <asm/smp.h>
+#include <asm/system.h>
+#include <asm/bitops.h>
+#include <asm/uaccess.h>
+#include <asm/pgalloc.h>
+#include <asm/delay.h>
+#include <asm/irq.h>
+
+
+#ifdef CONFIG_IRQ_THREADS
+static const int irq_prio = MAX_USER_RT_PRIO - 9;
+
+static inline void synchronize_hard_irq(unsigned int irq)
+{
+#ifdef CONFIG_SMP
+ while (irq_descp(irq)->status & IRQ_INPROGRESS)
+ cpu_relax();
+#endif
+}
+
+void synchronize_irq(unsigned int irq)
+{
+ irq_desc_t *desc = irq_descp(irq);
+
+ synchronize_hard_irq(irq);
+
+ if (desc->thread)
+ wait_event(desc->sync, !(desc->status & IRQ_THREADRUNNING));
+}
+
+typedef struct {
+ struct semaphore sem;
+ int irq;
+} irq_thread_info;
+
+static int run_irq_thread(void *__info)
+{
+ irq_thread_info *info = __info;
+ int irq = info->irq;
+ struct sched_param param = { .sched_priority = irq_prio };
+ irq_desc_t *desc = irq_descp(irq);
+
+ daemonize("IRQ %d", irq);
+
+ set_fs(KERNEL_DS);
+ sys_sched_setscheduler(0, SCHED_FIFO, &param);
+
+ current->flags |= PF_IRQHANDLER | PF_NOFREEZE;
+
+ init_waitqueue_head(&desc->sync);
+ smp_wmb();
+ desc->thread = current;
+
+ spin_lock_irq(&desc->lock);
+
+ if (desc->status & IRQ_DELAYEDSTARTUP) {
+ desc->status &= ~IRQ_DELAYEDSTARTUP;
+ STARTUP_IRQ(irq);
+ }
+
+ spin_unlock_irq(&desc->lock);
+
+ /* Don't reference info after the up(). */
+ up(&info->sem);
+
+ for (;;) {
+ struct irqaction *action;
+ int status, retval;
+
+ set_current_state(TASK_INTERRUPTIBLE);
+
+ while (!(desc->status & IRQ_THREADPENDING))
+ schedule();
+
+ set_current_state(TASK_RUNNING);
+
+ spin_lock_irq(&desc->lock);
+
+ desc->status |= IRQ_THREADRUNNING;
+ desc->status &= ~IRQ_THREADPENDING;
+ status = desc->status;
+
+ spin_unlock_irq(&desc->lock);
+
+ retval = 0;
+
+ if (!(status & IRQ_DISABLED)) {
+ action = desc->action;
+
+ while (action) {
+ if (!(action->flags & SA_NOTHREAD)) {
+ status |= action->flags;
+ retval |= action->handler(irq, action->dev_id, NULL);
+ }
+
+ action = action->next;
+ }
+ }
+
+ if (status & SA_SAMPLE_RANDOM)
+ add_interrupt_randomness(irq);
+
+ spin_lock_irq(&desc->lock);
+
+
+ desc->status &= ~IRQ_THREADRUNNING;
+ if (!(desc->status & (IRQ_DISABLED | IRQ_INPROGRESS |
+ IRQ_THREADPENDING | IRQ_THREADRUNNING))) {
+ desc->handler->end(irq);
+ }
+
+ spin_unlock_irq(&desc->lock);
+
+ if (waitqueue_active(&desc->sync))
+ wake_up(&desc->sync);
+ }
+}
+
+static int ok_to_spawn_threads;
+
+void do_spawn_irq_thread(int irq)
+{
+ irq_thread_info info;
+
+ info.irq = irq;
+ sema_init(&info.sem, 0);
+
+ if (kernel_thread(run_irq_thread, &info, CLONE_KERNEL) < 0) {
+ printk(KERN_EMERG "Could not spawn thread for IRQ %d\n", irq);
+ } else {
+ /* This assumes that up() doesn't touch the semaphore
+ at all after down() returns... */
+
+ down(&info.sem);
+ }
+}
+
+void setup_irq_spawn_thread(unsigned int irq, struct irqaction *new)
+{
+ irq_desc_t *desc = irq_descp(irq);
+ int spawn_thread = 0;
+ unsigned long flags;
+
+ spin_lock_irqsave(&desc->lock, flags);
+
+ if (new->flags & SA_NOTHREAD) {
+ desc->status |= IRQ_NOTHREAD;
+ } else {
+ /* Only the first threaded handler should spawn
+ a thread. */
+
+ if (!(desc->status & IRQ_THREAD)) {
+ spawn_thread = 1;
+ desc->status |= IRQ_THREAD;
+ }
+ }
+
+ spin_unlock_irqrestore(&desc->lock, flags);
+
+ if (ok_to_spawn_threads && spawn_thread)
+ do_spawn_irq_thread(irq);
+}
+
+
+/* This takes care of interrupts that were requested before the
+ scheduler was ready for threads to be created. */
+
+void spawn_irq_threads(void)
+{
+ int i;
+
+ for (i = 0; i < NR_IRQS; i++) {
+ irq_desc_t *desc = irq_descp(i);
+
+ if (desc->action && !desc->thread && (desc->status & IRQ_THREAD))
+ do_spawn_irq_thread(i);
+ }
+
+ ok_to_spawn_threads = 1;
+}
+
+/*
+ * Workarounds for interrupt types without startup()/shutdown() (ppc, ppc64).
+ * Will be removed some day.
+ */
+
+unsigned int it_startup_irq(unsigned int irq)
+{
+ irq_desc_t *desc = irq_descp(irq);
+
+#ifdef CONFIG_IRQ_THREADS
+ if ((desc->status & IRQ_THREAD) && !desc->thread) {
+ /* The IRQ threads haven't been spawned yet. Don't
+ turn on the IRQ until that happens. */
+
+ desc->status |= IRQ_DELAYEDSTARTUP;
+ return 0;
+ }
+#endif
+
+ if (desc->handler->startup)
+ return desc->handler->startup(irq);
+ else if (desc->handler->enable)
+ desc->handler->enable(irq);
+ else
+ BUG();
+ return 0;
+}
+
+void it_shutdown_irq(unsigned int irq)
+{
+ irq_desc_t *desc = irq_descp(irq);
+
+#ifdef CONFIG_IRQ_THREADS
+ if (desc->status & IRQ_DELAYEDSTARTUP) {
+ desc->status &= ~IRQ_DELAYEDSTARTUP;
+ return;
+ }
+#endif
+
+ if (desc->handler->shutdown)
+ desc->handler->shutdown(irq);
+ else if (desc->handler->disable)
+ desc->handler->disable(irq);
+ else
+ BUG();
+}
+
+#endif
diff -pruN a/kernel/kthread.c b/kernel/kthread.c
--- a/kernel/kthread.c 2004-10-09 03:50:45.000000000 +0400
+++ b/kernel/kthread.c 2004-10-09 04:01:36.000000000 +0400
@@ -14,6 +14,14 @@
#include <linux/module.h>
#include <asm/semaphore.h>

+#ifdef CONFIG_INGO_IRQ_THREADS
+/*
+ * We dont want to execute off keventsd since it might
+ * hold a semaphore our callers hold too:
+ */
+static struct workqueue_struct *helper_wq;
+#endif
+
struct kthread_create_info
{
/* Information passed to kthread() from keventd. */
@@ -126,12 +134,23 @@ struct task_struct *kthread_create(int (
init_completion(&create.started);
init_completion(&create.done);

+#ifdef CONFIG_INGO_IRQ_THREADS
+ /*
+ * The workqueue needs to start up first:
+ */
+ if (!helper_wq)
+#else
/* If we're being called to start the first workqueue, we
* can't use keventd. */
if (!keventd_up())
+#endif
work.func(work.data);
else {
- schedule_work(&work);
+#ifdef CONFIG_INGO_IRQ_THREADS
+ queue_work(helper_wq, &work);
+#else
+ schedule_work(&work);
+#endif
wait_for_completion(&create.done);
}
if (!IS_ERR(create.result)) {
@@ -183,3 +202,20 @@ int kthread_stop(struct task_struct *k)
return ret;
}
EXPORT_SYMBOL(kthread_stop);
+
+#ifdef CONFIG_INGO_IRQ_THREADS
+static __init int helper_init(void)
+{
+ helper_wq = create_singlethread_workqueue("kthread");
+ BUG_ON(!helper_wq);
+
+ return 0;
+}
+core_initcall(helper_init);
+#endif
+
+
+
+
+
+
diff -pruN a/kernel/Makefile b/kernel/Makefile
--- a/kernel/Makefile 2004-10-09 03:50:45.000000000 +0400
+++ b/kernel/Makefile 2004-10-09 04:01:36.000000000 +0400
@@ -3,11 +3,11 @@
#

obj-y = sched.o fork.o exec_domain.o panic.o printk.o profile.o \
- exit.o itimer.o time.o softirq.o resource.o \
+ exit.o itimer.o time.o softirq.o hardirq.o resource.o \
sysctl.o capability.o ptrace.o timer.o user.o \
signal.o sys.o kmod.o workqueue.o pid.o \
rcupdate.o intermodule.o extable.o params.o posix-timers.o \
- kthread.o
+ kthread.o irq.o

obj-$(CONFIG_FUTEX) += futex.o
obj-$(CONFIG_GENERIC_ISA_DMA) += dma.o
@@ -25,6 +25,7 @@ obj-$(CONFIG_AUDIT) += audit.o
obj-$(CONFIG_AUDITSYSCALL) += auditsc.o
obj-$(CONFIG_KPROBES) += kprobes.o

+
ifneq ($(CONFIG_IA64),y)
# According to Alan Modra <[email protected]>, the -fno-omit-frame-pointer is
# needed for x86 only. Why this used to be enabled for all architectures is beyond
diff -pruN a/kernel/sched.c b/kernel/sched.c
--- a/kernel/sched.c 2004-10-09 03:50:45.000000000 +0400
+++ b/kernel/sched.c 2004-10-09 04:01:36.000000000 +0400
@@ -450,6 +450,19 @@ static runqueue_t *task_rq_lock(task_t *
struct runqueue *rq;

repeat_lock_task:
+ /* Note this potential BUG::
+ * Mutex substitution maps spin_unlock_irqrestore
+ * to a simple spin_unlock. If we substituted
+ * a mutex here, we would save flags and disable
+ * ints, but the spin_unlock_irqrestore call wouldn't
+ * unlock irqs because of the remapping .
+ * Since we are not substitutinng mutexes for
+ * rq lock we are ok, BUT its edemic of problems
+ * we could encounter elsewhere in the kernel.
+ * This type of construct should be rewritten
+ * using a local_irq_restore following the spin_unlock()
+ * to be mutex-substitution-safe */
+
local_irq_save(*flags);
rq = task_rq(p);
spin_lock(&rq->lock);
@@ -1118,7 +1131,7 @@ static inline int wake_idle(int cpu, tas
*
* returns failure only if the task is already active.
*/
-static int try_to_wake_up(task_t * p, unsigned int state, int sync)
+int try_to_wake_up(task_t * p, unsigned int state, int sync)
{
int cpu, this_cpu, success = 0;
unsigned long flags;
@@ -2620,6 +2633,136 @@ static inline int dependent_sleeper(int
}
#endif

+#if defined(CONFIG_INGO_BKL)
+#if defined(CONFIG_SMP) || defined(CONFIG_PREEMPT)
+/*
+ * The 'big kernel semaphore'
+ *
+ * This mutex is taken and released recursively by lock_kernel()
+ * and unlock_kernel().? It is transparently dropped and reaquired
+ * over schedule().? It is used to protect legacy code that hasn't
+ * been migrated to a proper locking design yet.
+ *
+ * Note: code locked by this semaphore will only be serialized against
+ * other code using the same locking facility. The code guarantees that
+ * the task remains on the same CPU.
+ *
+ * Don't use in new code.
+ */
+#ifdef CONFIG_BKL_SEM
+static __cacheline_aligned_in_smp DECLARE_MUTEX(kernel_sem);
+#else
+kmutex_t kernel_flag __cacheline_aligned_in_smp = KMUTEX_INIT;
+#endif
+
+int kernel_locked(void)
+{
+ return current->lock_depth >= 0;
+}
+
+EXPORT_SYMBOL(kernel_locked);
+
+static inline void put_kernel_sem(void)
+{
+ current->cpus_allowed = current->saved_cpus_allowed;
+#ifdef CONFIG_BKL_SEM
+ up(&kernel_sem);
+#else
+ kmutex_unlock(&kernel_flag);
+#endif
+}
+
+/*
+ * Release global kernel semaphore:
+ */
+static inline void release_kernel_sem(struct task_struct *task)
+{
+ if (unlikely(task->lock_depth >= 0))
+ put_kernel_sem();
+}
+
+/*
+ * Re-acquire the kernel semaphore.
+ *
+ * This function is called with preemption off.
+ *
+ * We are executing in schedule() so the code must be extremely careful
+ * about recursion, both due to the down() and due to the enabling of
+ * preemption. schedule() will re-check the preemption flag after
+ * reacquiring the semaphore.
+ */
+static inline void reacquire_kernel_sem(struct task_struct *task)
+{
+ int this_cpu, saved_lock_depth = task->lock_depth;
+
+ if (likely(saved_lock_depth < 0))
+ return;
+
+ task->lock_depth = -1;
+ preempt_enable_no_resched();
+
+#ifdef CONFIG_BKL_SEM
+ down(&kernel_sem);
+#else
+ kmutex_lock(&kernel_flag);
+#endif
+ this_cpu = get_cpu();
+ /*
+ * Magic. We can pin the task to this CPU safely and
+ * cheaply here because we have preemption disabled
+ * and we are obviously running on the current CPU:
+ */
+ current->saved_cpus_allowed = current->cpus_allowed;
+ current->cpus_allowed = cpumask_of_cpu(this_cpu);
+ task->lock_depth = saved_lock_depth;
+}
+
+/*
+ * Getting the big kernel semaphore.
+ */
+void lock_kernel(void)
+{
+ int this_cpu, depth = current->lock_depth + 1;
+
+ if (likely(!depth)) {
+ /*
+ * No recursion worries - we set up lock_depth _after_
+ */
+#ifdef CONFIG_BKL_SEM
+ down(&kernel_sem);
+#else
+ kmutex_lock(&kernel_flag);
+#endif
+ this_cpu = get_cpu();
+ current->saved_cpus_allowed = current->cpus_allowed;
+ current->cpus_allowed = cpumask_of_cpu(this_cpu);
+ current->lock_depth = depth;
+ put_cpu();
+ } else
+ current->lock_depth = depth;
+}
+
+EXPORT_SYMBOL(lock_kernel);
+
+void unlock_kernel(void)
+{
+ BUG_ON(current->lock_depth < 0);
+
+ if (likely(--current->lock_depth < 0))
+ put_kernel_sem();
+}
+
+EXPORT_SYMBOL(unlock_kernel);
+
+#else
+
+static inline void release_kernel_sem(struct task_struct *task) { }
+static inline void reacquire_kernel_sem(struct task_struct *task) { }
+
+#endif
+#endif /* INGO's BKL */
+
+
/*
* schedule() is the main scheduler function.
*/
@@ -2645,12 +2788,15 @@ asmlinkage void __sched schedule(void)
dump_stack();
}
}
-
need_resched:
preempt_disable();
prev = current;
rq = this_rq();
-
+#ifdef CONFIG_INGO_BKL
+ release_kernel_sem(prev);
+#else
+ release_kernel_lock(prev);
+#endif
/*
* The idle thread is not allowed to schedule!
* Remove this check after it has been exercised a bit.
@@ -2660,8 +2806,8 @@ need_resched:
dump_stack();
}

- release_kernel_lock(prev);
schedstat_inc(rq, sched_cnt);
+
now = sched_clock();
if (likely(now - prev->timestamp < NS_MAX_SLEEP_AVG))
run_time = now - prev->timestamp;
@@ -2781,7 +2927,11 @@ switch_tasks:
} else
spin_unlock_irq(&rq->lock);

+#ifdef CONFIG_INGO_BKL
+ reacquire_kernel_sem(current);
+#else
reacquire_kernel_lock(current);
+#endif
preempt_enable_no_resched();
if (unlikely(test_thread_flag(TIF_NEED_RESCHED)))
goto need_resched;
@@ -2798,6 +2948,9 @@ EXPORT_SYMBOL(schedule);
asmlinkage void __sched preempt_schedule(void)
{
struct thread_info *ti = current_thread_info();
+#ifdef CONFIG_INGO_BKL
+ int saved_lock_depth;
+#endif

/*
* If there is a non-zero preempt_count or interrupts are disabled,
@@ -2808,7 +2961,19 @@ asmlinkage void __sched preempt_schedule

need_resched:
ti->preempt_count = PREEMPT_ACTIVE;
+#ifdef CONFIG_INGO_BKL
+ /*
+ * We keep the big kernel semaphore locked, but we
+ * clear ->lock_depth so that schedule() doesnt
+ * auto-release the semaphore:
+ */
+ saved_lock_depth = current->lock_depth;
+ current->lock_depth = 0;
schedule();
+ current->lock_depth = saved_lock_depth;
+#else
+ schedule();
+#endif
ti->preempt_count = 0;

/* we could miss a preemption opportunity between schedule and now */
@@ -3790,7 +3955,7 @@ void __devinit init_idle(task_t *idle, i
spin_unlock_irqrestore(&rq->lock, flags);

/* Set the preempt count _outside_ the spinlocks! */
-#ifdef CONFIG_PREEMPT
+#if defined CONFIG_PREEMPT && !defined CONFIG_INGO_BKL
idle->thread_info->preempt_count = (idle->lock_depth >= 0);
#else
idle->thread_info->preempt_count = 0;
@@ -3839,13 +4004,23 @@ int set_cpus_allowed(task_t *p, cpumask_
migration_req_t req;
runqueue_t *rq;

+#ifdef CONFIG_INGO_BKL
+ lock_kernel();
+#endif
rq = task_rq_lock(p, &flags);
+
if (!cpus_intersects(new_mask, cpu_online_map)) {
+#ifdef CONFIG_INGO_BKL
+ unlock_kernel();
+#endif
ret = -EINVAL;
goto out;
}

p->cpus_allowed = new_mask;
+#ifdef CONFIG_INGO_BKL
+ unlock_kernel();
+#endif
/* Can the task run on the task's current CPU? If so, we're done */
if (cpu_isset(task_cpu(p), new_mask))
goto out;
@@ -4205,8 +4380,11 @@ int __init migration_init(void)
*
* Note: spinlock debugging needs this even on !CONFIG_SMP.
*/
+#if !defined(CONFIG_INGO_BKL)
spinlock_t kernel_flag __cacheline_aligned_in_smp = SPIN_LOCK_UNLOCKED;
EXPORT_SYMBOL(kernel_flag);
+#endif
+

#ifdef CONFIG_SMP
/* Attach the domain 'sd' to 'cpu' as its base domain */
@@ -4766,3 +4944,23 @@ void __might_sleep(char *file, int line)
}
EXPORT_SYMBOL(__might_sleep);
#endif
+
+
+#if defined(CONFIG_SMP) && defined(CONFIG_PREEMPT)
+/*
+ * This could be a long-held lock. If another CPU holds it for a long time,
+ * and that CPU is not asked to reschedule then *this* CPU will spin on the
+ * lock for a long time, even if *this* CPU is asked to reschedule.
+ *
+ * So what we do here, in the slow (contended) path is to spin on the lock by
+ * hand while permitting preemption.
+ *
+ * Called inside preempt_disable().
+ */
+
+/* these functions are only called from inside spin_lock
+ * and old_write_lock therefore under spinlock substitution
+ * they will only be passed old spinlocks or old rwlocks as parameter
+ * there are no issues with modified mutex behavior here. */
+
+#endif /* defined(CONFIG_SMP) && defined(CONFIG_PREEMPT) */
diff -pruN a/kernel/softirq.c b/kernel/softirq.c
--- a/kernel/softirq.c 2004-10-09 03:50:45.000000000 +0400
+++ b/kernel/softirq.c 2004-10-09 04:01:36.000000000 +0400
@@ -16,6 +16,12 @@
#include <linux/cpu.h>
#include <linux/kthread.h>
#include <linux/rcupdate.h>
+#include <asm/uaccess.h>
+
+#ifdef CONFIG_SOFTIRQ_THREADS
+static const int softirq_prio = MAX_USER_RT_PRIO - 8;
+#endif
+

#include <asm/irq.h>
/*
@@ -45,6 +51,10 @@ static struct softirq_action softirq_vec

static DEFINE_PER_CPU(struct task_struct *, ksoftirqd);

+#ifdef CONFIG_SOFTIRQ_THREADS
+static DEFINE_PER_CPU(struct task_struct *, ksoftirqd_high_prio);
+#endif
+
/*
* we cannot loop indefinitely here to avoid userspace starvation,
* but we also don't want to introduce a worst case 1/HZ latency
@@ -56,10 +66,25 @@ static inline void wakeup_softirqd(void)
/* Interrupts are disabled: no need to stop preemption */
struct task_struct *tsk = __get_cpu_var(ksoftirqd);

- if (tsk && tsk->state != TASK_RUNNING)
+ if (tsk && (tsk->state != TASK_RUNNING &&
+ tsk->state != TASK_UNINTERRUPTIBLE))
wake_up_process(tsk);
}

+#ifdef CONFIG_SOFTIRQ_THREADS
+
+static inline void wakeup_softirqd_high_prio(void)
+{
+ /* Interrupts are disabled: no need to stop preemption */
+ struct task_struct *tsk = __get_cpu_var(ksoftirqd_high_prio);
+
+ if (tsk && (tsk->state != TASK_RUNNING &&
+ tsk->state != TASK_UNINTERRUPTIBLE))
+ wake_up_process(tsk);
+}
+
+#endif
+
/*
* We restart softirq processing MAX_SOFTIRQ_RESTART times,
* and we fall back to softirqd after that.
@@ -118,8 +143,13 @@ asmlinkage void do_softirq(void)
__u32 pending;
unsigned long flags;

+#ifdef CONFIG_SOFTIRQ_THREADS
+ if (in_interrupt())
+ BUG();
+#else
if (in_interrupt())
return;
+#endif

local_irq_save(flags);

@@ -135,17 +165,20 @@ EXPORT_SYMBOL(do_softirq);

#endif

+#ifndef CONFIG_SOFTIRQ_THREADS
+
void local_bh_enable(void)
{
__local_bh_enable();
WARN_ON(irqs_disabled());
- if (unlikely(!in_interrupt() &&
- local_softirq_pending()))
+ if (unlikely(!in_interrupt() && local_softirq_pending()))
invoke_softirq();
preempt_check_resched();
}
EXPORT_SYMBOL(local_bh_enable);

+#endif
+
/*
* This function must run with irqs disabled!
*/
@@ -162,8 +195,19 @@ inline fastcall void raise_softirq_irqof
* Otherwise we wake up ksoftirqd to make sure we
* schedule the softirq soon.
*/
+#ifdef CONFIG_SOFTIRQ_THREADS
+
+ if (in_interrupt() || (current->flags & PF_IRQHANDLER))
+ wakeup_softirqd_high_prio();
+ else
+ wakeup_softirqd();
+
+#else
+
if (!in_interrupt())
wakeup_softirqd();
+
+#endif
}

EXPORT_SYMBOL(raise_softirq_irqoff);
@@ -319,6 +363,47 @@ void tasklet_kill(struct tasklet_struct

EXPORT_SYMBOL(tasklet_kill);

+#ifdef CONFIG_SOFTIRQ_THREADS
+
+static int ksoftirqd_high_prio(void *__bind_cpu)
+{
+ int cpu = (int)(long)__bind_cpu;
+ struct sched_param param = { .sched_priority = softirq_prio };
+
+ /* Yuck. Thanks for separating the implementation from the
+ user API. */
+
+ set_fs(KERNEL_DS);
+ sys_sched_setscheduler(0, SCHED_FIFO, &param);
+
+ current->flags |= PF_NOFREEZE; /* PF_IOTHREAD in < 2.6.5 */
+
+ /* Migrate to the right CPU */
+ set_cpus_allowed(current, cpumask_of_cpu(cpu));
+ BUG_ON(smp_processor_id() != cpu);
+
+ __set_current_state(TASK_INTERRUPTIBLE);
+ mb();
+
+ __get_cpu_var(ksoftirqd_high_prio) = current;
+
+ for (;;) {
+ if (!local_softirq_pending())
+ schedule();
+
+ __set_current_state(TASK_RUNNING);
+
+ while (local_softirq_pending()) {
+ do_softirq();
+ cond_resched();
+ }
+
+ __set_current_state(TASK_INTERRUPTIBLE);
+ }
+}
+
+#endif
+
void __init softirq_init(void)
{
open_softirq(TASKLET_SOFTIRQ, tasklet_action, NULL);
@@ -430,15 +515,28 @@ static int __devinit cpu_callback(struct
case CPU_UP_PREPARE:
BUG_ON(per_cpu(tasklet_vec, hotcpu).list);
BUG_ON(per_cpu(tasklet_hi_vec, hotcpu).list);
- p = kthread_create(ksoftirqd, hcpu, "ksoftirqd/%d", hotcpu);
+ p = kthread_create(ksoftirqd, hcpu, "ksoftirqd/l%d", hotcpu);
if (IS_ERR(p)) {
- printk("ksoftirqd for %i failed\n", hotcpu);
+ printk("ksoftirqd/l%i failed\n", hotcpu);
return NOTIFY_BAD;
}
kthread_bind(p, hotcpu);
per_cpu(ksoftirqd, hotcpu) = p;
+#ifdef CONFIG_SOFTIRQ_THREADS
+ p = kthread_create(ksoftirqd_high_prio, hcpu, "ksoftirqd/h%d", hotcpu);
+ if (IS_ERR(p)) {
+ printk("ksoftirqd/h%i failed\n", hotcpu);
+ return NOTIFY_BAD;
+ }
+ per_cpu(ksoftirqd_high_prio, hotcpu) = p;
+ kthread_bind(p, hotcpu);
+ per_cpu(ksoftirqd_high_prio, hotcpu) = p;
+#endif
break;
case CPU_ONLINE:
+#ifdef CONFIG_SOFTIRQ_THREADS
+ wake_up_process(per_cpu(ksoftirqd_high_prio, hotcpu));
+#endif
wake_up_process(per_cpu(ksoftirqd, hotcpu));
break;
#ifdef CONFIG_HOTPLUG_CPU
diff -pruN a/Makefile b/Makefile
--- a/Makefile 2004-10-09 03:51:27.000000000 +0400
+++ b/Makefile 2004-10-09 04:01:36.000000000 +0400
@@ -1,7 +1,7 @@
VERSION = 2
PATCHLEVEL = 6
SUBLEVEL = 9
-EXTRAVERSION = -rc3
+EXTRAVERSION = -rc3-RT
NAME=Zonked Quokka

# *DOCUMENTATION*
diff -pruN a/mm/slab.c b/mm/slab.c
--- a/mm/slab.c 2004-10-09 03:50:45.000000000 +0400
+++ b/mm/slab.c 2004-10-09 04:01:36.000000000 +0400
@@ -2730,6 +2730,10 @@ static void drain_array_locked(kmem_cach
static void cache_reap(void *unused)
{
struct list_head *walk;
+#if DEBUG && !defined(CONFIG_SOFTIRQ_THREADS)
+ BUG_ON(!in_interrupt());
+ BUG_ON(in_irq());
+#endif

if (down_trylock(&cache_chain_sem)) {
/* Give up. Setup the next iteration. */
diff -pruN a/net/ipv4/ipconfig.c b/net/ipv4/ipconfig.c
--- a/net/ipv4/ipconfig.c 2004-10-09 03:50:45.000000000 +0400
+++ b/net/ipv4/ipconfig.c 2004-10-09 04:01:36.000000000 +0400
@@ -1100,8 +1100,10 @@ static int __init ic_dynamic(void)

jiff = jiffies + (d->next ? CONF_INTER_TIMEOUT : timeout);
while (time_before(jiffies, jiff) && !ic_got_reply) {
- barrier();
- cpu_relax();
+ /* need to drop the BKL here to allow preemption. */
+
+ set_current_state(TASK_INTERRUPTIBLE);
+ schedule_timeout(1);
}
#ifdef IPCONFIG_DHCP
/* DHCP isn't done until we get a DHCPACK. */




2004-10-09 06:40:52

by Lee Revell

[permalink] [raw]
Subject: Re: [ANNOUNCE] Linux 2.6 Real Time Kernel

On Sat, 2004-10-09 at 01:59, Sven-Thorsten Dietrich wrote:
> Announcing the availability of prototype real-time (RT)
> enhancements to the Linux 2.6 kernel.
>

Does not compile:

CC arch/i386/kernel/semaphore.o
CC arch/i386/kernel/signal.o
AS arch/i386/kernel/entry.o
CC arch/i386/kernel/traps.o
CC arch/i386/kernel/irq.o
arch/i386/kernel/irq.c: In function `do_IRQ':
arch/i386/kernel/irq.c:582: error: too many arguments to function `note_interrupt'
arch/i386/kernel/irq.c:667: warning: ISO C90 forbids mixed declarations and code
arch/i386/kernel/irq.c:751: error: initializer element is not constant
arch/i386/kernel/irq.c:751: error: (near initialization for `__ksymtab_request_irq.value')
arch/i386/kernel/irq.c:809: error: initializer element is not constant
arch/i386/kernel/irq.c:809: error: (near initialization for `__ksymtab_free_irq.value')
arch/i386/kernel/irq.c:904: error: initializer element is not constant
arch/i386/kernel/irq.c:904: error: (near initialization for `__ksymtab_probe_irq_on.value')
arch/i386/kernel/irq.c:1004: error: initializer element is not constant
arch/i386/kernel/irq.c:1004: error: (near initialization for `__ksymtab_probe_irq_off.value')
arch/i386/kernel/irq.c:1246: error: initializer element is not constant
arch/i386/kernel/irq.c:1246: error: (near initialization for `__ksymtab_do_softirq.value')
arch/i386/kernel/irq.c:1246: error: parse error at end of input
arch/i386/kernel/irq.c:648: warning: label `out_no_end' defined but not used
arch/i386/kernel/irq.c:79: warning: 'register_irq_proc' declared `static' but never defined
arch/i386/kernel/irq.c:277: warning: 'report_bad_irq' defined but not used
make[1]: *** [arch/i386/kernel/irq.o] Error 1
make: *** [arch/i386/kernel] Error 2

I am using gcc 3.4. I accepted all the default settings except I
enabled "Run all IRQS in threads".

Lee

2004-10-09 07:34:13

by Daniel Walker

[permalink] [raw]
Subject: Re: [ANNOUNCE] Linux 2.6 Real Time Kernel


Do you have 4k stacks turned off? The docs make note of this.

Daniel Walker


On Fri, 2004-10-08 at 23:40, Lee Revell wrote:
> On Sat, 2004-10-09 at 01:59, Sven-Thorsten Dietrich wrote:
> > Announcing the availability of prototype real-time (RT)
> > enhancements to the Linux 2.6 kernel.
> >
>
> Does not compile:
>
> CC arch/i386/kernel/semaphore.o
> CC arch/i386/kernel/signal.o
> AS arch/i386/kernel/entry.o
> CC arch/i386/kernel/traps.o
> CC arch/i386/kernel/irq.o
> arch/i386/kernel/irq.c: In function `do_IRQ':
> arch/i386/kernel/irq.c:582: error: too many arguments to function `note_interrupt'
> arch/i386/kernel/irq.c:667: warning: ISO C90 forbids mixed declarations and code
> arch/i386/kernel/irq.c:751: error: initializer element is not constant
> arch/i386/kernel/irq.c:751: error: (near initialization for `__ksymtab_request_irq.value')
> arch/i386/kernel/irq.c:809: error: initializer element is not constant
> arch/i386/kernel/irq.c:809: error: (near initialization for `__ksymtab_free_irq.value')
> arch/i386/kernel/irq.c:904: error: initializer element is not constant
> arch/i386/kernel/irq.c:904: error: (near initialization for `__ksymtab_probe_irq_on.value')
> arch/i386/kernel/irq.c:1004: error: initializer element is not constant
> arch/i386/kernel/irq.c:1004: error: (near initialization for `__ksymtab_probe_irq_off.value')
> arch/i386/kernel/irq.c:1246: error: initializer element is not constant
> arch/i386/kernel/irq.c:1246: error: (near initialization for `__ksymtab_do_softirq.value')
> arch/i386/kernel/irq.c:1246: error: parse error at end of input
> arch/i386/kernel/irq.c:648: warning: label `out_no_end' defined but not used
> arch/i386/kernel/irq.c:79: warning: 'register_irq_proc' declared `static' but never defined
> arch/i386/kernel/irq.c:277: warning: 'report_bad_irq' defined but not used
> make[1]: *** [arch/i386/kernel/irq.o] Error 1
> make: *** [arch/i386/kernel] Error 2
>
> I am using gcc 3.4. I accepted all the default settings except I
> enabled "Run all IRQS in threads".
>
> Lee
>

2004-10-09 07:42:48

by Lee Revell

[permalink] [raw]
Subject: Re: [ANNOUNCE] Linux 2.6 Real Time Kernel

On Sat, 2004-10-09 at 03:33, Daniel Walker wrote:
> Do you have 4k stacks turned off? The docs make note of this.
>

My mistake, it works now.

Lee

2004-10-09 08:52:47

by Lee Revell

[permalink] [raw]
Subject: Re: [ANNOUNCE] Linux 2.6 Real Time Kernel

On Sat, 2004-10-09 at 03:33, Daniel Walker wrote:
> Do you have 4k stacks turned off? The docs make note of this.
>

OK after fixing this it builds OK, but several modules complain about
unresolved symbols:

Oct 9 04:43:23 krustophenia kernel: usbcore: Unknown symbol kmutex_unlock
Oct 9 04:43:23 krustophenia kernel: usbcore: Unknown symbol kmutex_lock
Oct 9 04:43:23 krustophenia kernel: usbcore: Unknown symbol kmutex_init
Oct 9 04:43:23 krustophenia kernel: uhci_hcd: Unknown symbol usb_hcd_pci_probe
Oct 9 04:43:23 krustophenia kernel: uhci_hcd: Unknown symbol usb_check_bandwidth
Oct 9 04:43:23 krustophenia kernel: uhci_hcd: Unknown symbol usb_disabled
Oct 9 04:43:23 krustophenia kernel: uhci_hcd: Unknown symbol usb_release_bandwidth
Oct 9 04:43:23 krustophenia kernel: uhci_hcd: Unknown symbol usb_register_root_hub
Oct 9 04:43:23 krustophenia kernel: uhci_hcd: Unknown symbol usb_put_dev
Oct 9 04:43:23 krustophenia kernel: uhci_hcd: Unknown symbol usb_get_dev
Oct 9 04:43:23 krustophenia kernel: uhci_hcd: Unknown symbol usb_claim_bandwidth
Oct 9 04:43:23 krustophenia kernel: uhci_hcd: Unknown symbol usb_hcd_giveback_urb
Oct 9 04:43:23 krustophenia kernel: uhci_hcd: Unknown symbol kmutex_unlock
Oct 9 04:43:23 krustophenia kernel: uhci_hcd: Unknown symbol kmutex_lock
Oct 9 04:43:23 krustophenia kernel: uhci_hcd: Unknown symbol usb_hcd_pci_remove
Oct 9 04:43:23 krustophenia kernel: uhci_hcd: Unknown symbol kmutex_init
Oct 9 04:43:23 krustophenia kernel: uhci_hcd: Unknown symbol usb_alloc_dev
Oct 9 04:43:23 krustophenia kernel: usbcore: Unknown symbol kmutex_unlock
Oct 9 04:43:23 krustophenia kernel: usbcore: Unknown symbol kmutex_lock
Oct 9 04:43:23 krustophenia kernel: usbcore: Unknown symbol kmutex_init
Oct 9 04:43:23 krustophenia kernel: usbhid: Unknown symbol usb_alloc_urb
Oct 9 04:43:23 krustophenia kernel: usbhid: Unknown symbol usb_free_urb
Oct 9 04:43:23 krustophenia kernel: usbhid: Unknown symbol usb_register
Oct 9 04:43:23 krustophenia kernel: usbhid: Unknown symbol usb_submit_urb
Oct 9 04:43:23 krustophenia kernel: usbhid: Unknown symbol usb_control_msg
Oct 9 04:43:23 krustophenia kernel: usbhid: Unknown symbol usb_deregister
Oct 9 04:43:23 krustophenia kernel: usbhid: Unknown symbol usb_string
Oct 9 04:43:23 krustophenia kernel: usbhid: Unknown symbol usb_unlink_urb
Oct 9 04:43:23 krustophenia kernel: usbhid: Unknown symbol kmutex_unlock
Oct 9 04:43:23 krustophenia kernel: usbhid: Unknown symbol kmutex_lock
Oct 9 04:43:23 krustophenia kernel: usbhid: Unknown symbol usb_kill_urb
Oct 9 04:43:23 krustophenia kernel: usbhid: Unknown symbol usb_buffer_free
Oct 9 04:43:23 krustophenia kernel: usbhid: Unknown symbol kmutex_init
Oct 9 04:43:23 krustophenia kernel: usbhid: Unknown symbol __usb_get_extra_descriptor
Oct 9 04:43:23 krustophenia kernel: usbhid: Unknown symbol usb_buffer_alloc
Oct 9 04:43:23 krustophenia kernel: via_rhine: Unknown symbol kmutex_unlock
Oct 9 04:43:23 krustophenia kernel: via_rhine: Unknown symbol kmutex_lock
Oct 9 04:43:23 krustophenia kernel: via_rhine: Unknown symbol kmutex_init

Lee

2004-10-09 10:51:23

by Måns Rullgård

[permalink] [raw]
Subject: Re: [ANNOUNCE] Linux 2.6 Real Time Kernel

Sven-Thorsten Dietrich <[email protected]> writes:

> +#if defined(CONFIG_SMP) && defined(CONFIG_PREEMPT)
> +/*
> + * This could be a long-held lock. If another CPU holds it for a long time,
> + * and that CPU is not asked to reschedule then *this* CPU will spin on the
> + * lock for a long time, even if *this* CPU is asked to reschedule.
> + *
> + * So what we do here, in the slow (contended) path is to spin on the lock by
> + * hand while permitting preemption.
> + *
> + * Called inside preempt_disable().
> + */
> +
> +/* these functions are only called from inside spin_lock
> + * and old_write_lock therefore under spinlock substitution
> + * they will only be passed old spinlocks or old rwlocks as parameter
> + * there are no issues with modified mutex behavior here. */
> +
> +#endif /* defined(CONFIG_SMP) && defined(CONFIG_PREEMPT) */

May I inquire as to the purpose of placing a couple of comments under
an #ifdef?

--
M?ns Rullg?rd
[email protected]

2004-10-09 12:54:20

by John Hedditch

[permalink] [raw]
Subject: Re: [ANNOUNCE] Linux 2.6 Real Time Kernel


By disabling compilation of usb, s2io and scsi I can get this to build and link, but it hangs immediately on getting
to init.

Cheers,
John

2004-10-09 13:15:33

by Måns Rullgård

[permalink] [raw]
Subject: Re: [ANNOUNCE] Linux 2.6 Real Time Kernel


I got this thing to build by adding a few EXPORT_SYMBOL, patch below.
Now it seems to be running quite well. I am, however, getting
occasional "bad: scheduling while atomic!" messages, all alike:

bad: scheduling while atomic!
[<c02ef301>] schedule+0x4e5/0x4ea
[<c0114cbe>] try_to_wake_up+0x99/0xa8
[<c01332e2>] __p_mutex_down+0xfe/0x190
[<c029e238>] alloc_skb+0x32/0xc3
[<c01335e0>] kmutex_is_locked+0x1f/0x33
[<c029fa63>] skb_queue_tail+0x1c/0x45
[<c02eb43e>] unix_stream_sendmsg+0x22c/0x38c
[<c029ab03>] sock_sendmsg+0xc9/0xe3
[<c029f95a>] skb_dequeue+0x4a/0x5b
[<c02eb9e6>] unix_stream_recvmsg+0x119/0x430
[<c0137f06>] __alloc_pages+0x1cc/0x33f
[<c01168ca>] autoremove_wake_function+0x0/0x43
[<c029af4b>] sock_readv_writev+0x6e/0x97
[<c029afec>] sock_writev+0x37/0x3e
[<c029afb5>] sock_writev+0x0/0x3e
[<c014ebb8>] do_readv_writev+0x1db/0x21f
[<c01168ca>] autoremove_wake_function+0x0/0x43
[<c014e5ca>] vfs_read+0xd0/0xf5
[<c014ec94>] vfs_writev+0x49/0x52
[<c014ed5a>] sys_writev+0x47/0x76
[<c0103f09>] sysenter_past_esp+0x52/0x71

USB, sound and wireless are all working nicely.

Now the patch:

--- kernel/kmutex.c~ 2004-10-09 12:51:37 +02:00
+++ kernel/kmutex.c 2004-10-09 13:50:43 +02:00
@@ -20,6 +20,7 @@
#include <linux/config.h>
#include <linux/kmutex.h>
#include <linux/sched.h>
+#include <linux/module.h>

# if defined CONFIG_PMUTEX
# include <linux/pmutex.h>
@@ -40,11 +41,14 @@
return p_mutex_trylock(&(lock->kmtx));
}

+EXPORT_SYMBOL(kmutex_trylock);

inline int kmutex_is_locked(struct kmutex *lock)
{
return p_mutex_is_locked(&(lock->kmtx));
}
+
+EXPORT_SYMBOL(kmutex_is_locked);
# endif


@@ -60,6 +64,7 @@
#endif
}

+EXPORT_SYMBOL(kmutex_init);

/*
* print warning is case kmutex_lock is called while preempt count is
@@ -88,6 +93,8 @@
#endif
}

+EXPORT_SYMBOL(kmutex_lock);
+
void kmutex_unlock(struct kmutex *lock)
{
#if defined CONFIG_KMUTEX_DEBUG
@@ -102,6 +109,7 @@
#endif
}

+EXPORT_SYMBOL(kmutex_unlock);

void kmutex_unlock_wait(struct kmutex * lock)
{
@@ -111,4 +119,4 @@
}
}

-
+EXPORT_SYMBOL(kmutex_unlock_wait);


--
M?ns Rullg?rd
[email protected]

2004-10-09 17:33:27

by Karim Yaghmour

[permalink] [raw]
Subject: Re: [ANNOUNCE] Linux 2.6 Real Time Kernel


Sven-Thorsten Dietrich wrote:
> - Voluntary Preemption by Ingo Molnar
> - IRQ thread patches by Scott Wood and Ingo Molnar
> - BKL mutex patch by Ingo Molnar (with MV extensions)
> - PMutex from Germany's Universitaet der Bundeswehr, Munich
> - MontaVista mutex abstraction layer replacing spinlocks with mutexes

To the best of my understanding, this still doesn't provide deterministic
hard-real-time performance in Linux.

> There are several micro-kernel solutions available, which achieve
> the required performance, but there are two general concerns with
> such solutions:
>
> 1. Two separate kernel environments, creating more overall
> system complexity and application design complexity.
> 2. Legal controversy.

It's been quite a while since any of this has been true.

> In line with the above mentioned previous Kernel enhancements,
> our work is designed to be transparent to existing applications
> and drivers.

I guess you haven't taken a look at the work on RTAI/fusion lately.
Applications use the same Linux API, and get deterministic
hard-real-time response times. It's really much less complicated
to use than the above-suggested aggregate.

Karim
--
Author, Speaker, Developer, Consultant
Pushing Embedded and Real-Time Linux Systems Beyond the Limits
http://www.opersys.com || [email protected] || 1-866-677-4546

2004-10-09 18:30:40

by Lee Revell

[permalink] [raw]
Subject: Re: [ANNOUNCE] Linux 2.6 Real Time Kernel

On Sat, 2004-10-09 at 13:41, Karim Yaghmour wrote:
> Sven-Thorsten Dietrich wrote:
> > - Voluntary Preemption by Ingo Molnar
> > - IRQ thread patches by Scott Wood and Ingo Molnar
> > - BKL mutex patch by Ingo Molnar (with MV extensions)
> > - PMutex from Germany's Universitaet der Bundeswehr, Munich
> > - MontaVista mutex abstraction layer replacing spinlocks with mutexes
>
> To the best of my understanding, this still doesn't provide deterministic
> hard-real-time performance in Linux.

Using only the VP+IRQ thread patch, I ran my RT app for 11 million
cycles yesterday, with a maximum delay of 190 usecs. How would this not
satisfy a 200 usec hard RT constraint?

PHB: "I've looked at your proposal and decided it can't be done"
Dilbert: "I just did it. It's working perfectly"

Lee

2004-10-09 19:24:59

by stefan.eletzhofer

[permalink] [raw]
Subject: Re: [ANNOUNCE] Linux 2.6 Real Time Kernel

On Sat, Oct 09, 2004 at 02:30:28PM -0400, Lee Revell wrote:
> On Sat, 2004-10-09 at 13:41, Karim Yaghmour wrote:
> > Sven-Thorsten Dietrich wrote:
> > > - Voluntary Preemption by Ingo Molnar
> > > - IRQ thread patches by Scott Wood and Ingo Molnar
> > > - BKL mutex patch by Ingo Molnar (with MV extensions)
> > > - PMutex from Germany's Universitaet der Bundeswehr, Munich
> > > - MontaVista mutex abstraction layer replacing spinlocks with mutexes
> >
> > To the best of my understanding, this still doesn't provide deterministic
> > hard-real-time performance in Linux.
>
> Using only the VP+IRQ thread patch, I ran my RT app for 11 million
> cycles yesterday, with a maximum delay of 190 usecs. How would this not
> satisfy a 200 usec hard RT constraint?

I think the keyword here is "deterministic", isn't it?

>
> PHB: "I've looked at your proposal and decided it can't be done"
> Dilbert: "I just did it. It's working perfectly"
>
> Lee
>
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/

--
Stefan Eletzhofer
InQuant Data GBR
http://www.inquant.de
+49 (0) 751 35 44 112
+49 (0) 171 23 24 529 (Mobil)
+49 (0) 751 35 44 115 (FAX)

2004-10-09 19:31:06

by Lee Revell

[permalink] [raw]
Subject: Re: [ANNOUNCE] Linux 2.6 Real Time Kernel

On Sat, 2004-10-09 at 17:26, [email protected] wrote:
> On Sat, Oct 09, 2004 at 02:30:28PM -0400, Lee Revell wrote:
> > On Sat, 2004-10-09 at 13:41, Karim Yaghmour wrote:
> > > Sven-Thorsten Dietrich wrote:
> > > > - Voluntary Preemption by Ingo Molnar
> > > > - IRQ thread patches by Scott Wood and Ingo Molnar
> > > > - BKL mutex patch by Ingo Molnar (with MV extensions)
> > > > - PMutex from Germany's Universitaet der Bundeswehr, Munich
> > > > - MontaVista mutex abstraction layer replacing spinlocks with mutexes
> > >
> > > To the best of my understanding, this still doesn't provide deterministic
> > > hard-real-time performance in Linux.
> >
> > Using only the VP+IRQ thread patch, I ran my RT app for 11 million
> > cycles yesterday, with a maximum delay of 190 usecs. How would this not
> > satisfy a 200 usec hard RT constraint?
>
> I think the keyword here is "deterministic", isn't it?

Well, depends what you mean by deterministic. Some RT apps only require
an upper bound on response time. This is a form of determinism.

Lee

2004-10-09 19:37:05

by stefan.eletzhofer

[permalink] [raw]
Subject: Re: [ANNOUNCE] Linux 2.6 Real Time Kernel

On Sat, Oct 09, 2004 at 03:30:27PM -0400, Lee Revell wrote:
> On Sat, 2004-10-09 at 17:26, [email protected] wrote:
> > On Sat, Oct 09, 2004 at 02:30:28PM -0400, Lee Revell wrote:
> > > On Sat, 2004-10-09 at 13:41, Karim Yaghmour wrote:
> > > > Sven-Thorsten Dietrich wrote:
> > > > > - Voluntary Preemption by Ingo Molnar
> > > > > - IRQ thread patches by Scott Wood and Ingo Molnar
> > > > > - BKL mutex patch by Ingo Molnar (with MV extensions)
> > > > > - PMutex from Germany's Universitaet der Bundeswehr, Munich
> > > > > - MontaVista mutex abstraction layer replacing spinlocks with mutexes
> > > >
> > > > To the best of my understanding, this still doesn't provide deterministic
> > > > hard-real-time performance in Linux.
> > >
> > > Using only the VP+IRQ thread patch, I ran my RT app for 11 million
> > > cycles yesterday, with a maximum delay of 190 usecs. How would this not
> > > satisfy a 200 usec hard RT constraint?
> >
> > I think the keyword here is "deterministic", isn't it?
>
> Well, depends what you mean by deterministic. Some RT apps only require
> an upper bound on response time. This is a form of determinism.

Yes. But can you give that upper bound "a priori", that is w/o doing
measurements with your application?

Without that I think its impossible to get _guaranteed_ upper
bounds, regardles of the application running. I think thats what
"hard real-time" is all about.

Stefan

>
> Lee
>
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/

--
Stefan Eletzhofer
InQuant Data GBR
http://www.inquant.de
+49 (0) 751 35 44 112
+49 (0) 171 23 24 529 (Mobil)
+49 (0) 751 35 44 115 (FAX)

2004-10-09 19:39:15

by Måns Rullgård

[permalink] [raw]
Subject: Re: [ANNOUNCE] Linux 2.6 Real Time Kernel

Lee Revell <[email protected]> writes:

>> > > To the best of my understanding, this still doesn't provide
>> > > deterministic hard-real-time performance in Linux.
>> >
>> > Using only the VP+IRQ thread patch, I ran my RT app for 11 million
>> > cycles yesterday, with a maximum delay of 190 usecs. How would this not
>> > satisfy a 200 usec hard RT constraint?
>>
>> I think the keyword here is "deterministic", isn't it?
>
> Well, depends what you mean by deterministic. Some RT apps only require
> an upper bound on response time. This is a form of determinism.

Sure, but running for a zillion cycles without breaking some limit
doesn't guarantee that it never will happen. Being able to give such
a guarantee is what determinism is about.

--
M?ns Rullg?rd
[email protected]

2004-10-09 19:47:24

by Lee Revell

[permalink] [raw]
Subject: Re: [ANNOUNCE] Linux 2.6 Real Time Kernel

On Sat, 2004-10-09 at 17:38, [email protected] wrote:
> On Sat, Oct 09, 2004 at 03:30:27PM -0400, Lee Revell wrote:
> > On Sat, 2004-10-09 at 17:26, [email protected] wrote:
> > > On Sat, Oct 09, 2004 at 02:30:28PM -0400, Lee Revell wrote:
> > > > On Sat, 2004-10-09 at 13:41, Karim Yaghmour wrote:
> > > > > Sven-Thorsten Dietrich wrote:
> > > > > > - Voluntary Preemption by Ingo Molnar
> > > > > > - IRQ thread patches by Scott Wood and Ingo Molnar
> > > > > > - BKL mutex patch by Ingo Molnar (with MV extensions)
> > > > > > - PMutex from Germany's Universitaet der Bundeswehr, Munich
> > > > > > - MontaVista mutex abstraction layer replacing spinlocks with mutexes
> > > > >
> > > > > To the best of my understanding, this still doesn't provide deterministic
> > > > > hard-real-time performance in Linux.
> > > >
> > > > Using only the VP+IRQ thread patch, I ran my RT app for 11 million
> > > > cycles yesterday, with a maximum delay of 190 usecs. How would this not
> > > > satisfy a 200 usec hard RT constraint?
> > >
> > > I think the keyword here is "deterministic", isn't it?
> >
> > Well, depends what you mean by deterministic. Some RT apps only require
> > an upper bound on response time. This is a form of determinism.
>
> Yes. But can you give that upper bound "a priori", that is w/o doing
> measurements with your application?
>

Yes. The upper bound on the response time of an RT task is a function
of the longest non-preemptible code path in the kernel. Currently this
is the processing of a single packet by netif_receive_skb.

AIUI hard realtime is about bounded response times. How does this not
qualify?

Lee

2004-10-09 20:04:33

by Karim Yaghmour

[permalink] [raw]
Subject: Re: [ANNOUNCE] Linux 2.6 Real Time Kernel


Lee Revell wrote:
> Yes. The upper bound on the response time of an RT task is a function
> of the longest non-preemptible code path in the kernel. Currently this
> is the processing of a single packet by netif_receive_skb.

And this has been demonstrated mathematically/algorithmically to be
true 100% of the time, regardless of the load and the driver set? IOW,
if I was building an automated industrial saw (based on a VP+IRQ-thread
kernel or a combination of the above-mentioned agregate) with a
safety mechanism that depended on the kernel's responsivness to
outside events to avoid bodily harm, would you be willing to put your
hand beneath it?

How about things like a hard-rt deterministic nanosleep() 100% of the
time with RTAI/fusion?

Karim
--
Author, Speaker, Developer, Consultant
Pushing Embedded and Real-Time Linux Systems Beyond the Limits
http://www.opersys.com || [email protected] || 1-866-677-4546

2004-10-09 20:19:20

by Robert Love

[permalink] [raw]
Subject: Re: [ANNOUNCE] Linux 2.6 Real Time Kernel

On Sat, 2004-10-09 at 15:47 -0400, Lee Revell wrote:

> Yes. The upper bound on the response time of an RT task is a function
> of the longest non-preemptible code path in the kernel. Currently this
> is the processing of a single packet by netif_receive_skb.
>
> AIUI hard realtime is about bounded response times. How does this not
> qualify?

I am actually in agreement with you, favoring this soft real-time
approach, but this is not bounded response time or determinism. There
are no guarantees, no measurements conducted with all possible inputs,
sizes, errors, and so on. This soft real-time approach gives great
average case--but the worst case is only a measurement on a specific
machine in a specific workload.

Robert Love


2004-10-09 20:18:49

by Lee Revell

[permalink] [raw]
Subject: Re: [ANNOUNCE] Linux 2.6 Real Time Kernel

On Sat, 2004-10-09 at 16:11, Karim Yaghmour wrote:
> Lee Revell wrote:
> > Yes. The upper bound on the response time of an RT task is a function
> > of the longest non-preemptible code path in the kernel. Currently this
> > is the processing of a single packet by netif_receive_skb.
>
> And this has been demonstrated mathematically/algorithmically to be
> true 100% of the time, regardless of the load and the driver set? IOW,
> if I was building an automated industrial saw (based on a VP+IRQ-thread
> kernel or a combination of the above-mentioned agregate) with a
> safety mechanism that depended on the kernel's responsivness to
> outside events to avoid bodily harm, would you be willing to put your
> hand beneath it?
>

In theory, I think yes, if all IRQs on the system run in threads except
the saw interrupt, and the RT task that controls the saw runs at a
higher priority than all the IRQ threads. You can guarantee that other
interrupts won't delay the saw, because the saw irq is the only thing on
the system that runs in interrupt context. With the current VP
implementation you are still bounded by the longest non-preemptible code
path in the kernel AKA the longest time that a spinlock is held.
Replacing most spinlocks with mutexes reduces this to less than 20 code
paths according to Mvista, which then can be individually audited for
RT-safeness.

That being said, no way would I put my hand under the saw with the
current implementation. But, unless I am missing something, it seems
like this kind of determinism is possible with the Mvista design.

> How about things like a hard-rt deterministic nanosleep() 100% of the
> time with RTAI/fusion?

I will check that out, I have not looked at RTAI in over a year.

Lee

2004-10-09 20:27:52

by Lee Revell

[permalink] [raw]
Subject: Re: [ANNOUNCE] Linux 2.6 Real Time Kernel

On Sat, 2004-10-09 at 16:20, Robert Love wrote:
> On Sat, 2004-10-09 at 15:47 -0400, Lee Revell wrote:
>
> > Yes. The upper bound on the response time of an RT task is a function
> > of the longest non-preemptible code path in the kernel. Currently this
> > is the processing of a single packet by netif_receive_skb.
> >
> > AIUI hard realtime is about bounded response times. How does this not
> > qualify?
>
> I am actually in agreement with you, favoring this soft real-time
> approach, but this is not bounded response time or determinism. There
> are no guarantees, no measurements conducted with all possible inputs,
> sizes, errors, and so on. This soft real-time approach gives great
> average case--but the worst case is only a measurement on a specific
> machine in a specific workload.

I did not mean to say that VP approach alone can do hard realtime, that
was just an example. But, when combined the MontaVista approach of
turning all but ~20 spinlocks into mutexes, it seems like the amount of
non-preemptible code is small enough that you could analyze it all and
start to make hard RT guarantees.

Lee

2004-10-09 20:51:35

by Karim Yaghmour

[permalink] [raw]
Subject: Re: [ANNOUNCE] Linux 2.6 Real Time Kernel


Lee Revell wrote:
> In theory, I think yes, if all IRQs on the system run in threads except
> the saw interrupt, and the RT task that controls the saw runs at a
> higher priority than all the IRQ threads. You can guarantee that other
> interrupts won't delay the saw, because the saw irq is the only thing on
> the system that runs in interrupt context. With the current VP
> implementation you are still bounded by the longest non-preemptible code
> path in the kernel AKA the longest time that a spinlock is held.
> Replacing most spinlocks with mutexes reduces this to less than 20 code
> paths according to Mvista, which then can be individually audited for
> RT-safeness.
>
> That being said, no way would I put my hand under the saw with the
> current implementation. But, unless I am missing something, it seems
> like this kind of determinism is possible with the Mvista design.

It may be a question of taste, but even if that did work, which I am
not convinced of, it seems to me that it's awfully convoluted.
With the current interrupt pipeline mechanism part of Adeos, on
which RTAI and RTAI fusion are built, I can give you absolute hard-rt
deterministic guarantees while keeping the spinlocks intact, and not
having to check for the rt-safeness of any part of the kernel. You
just write the time-sensitive saw driver int handler in front of
Linux in the ipipe and you're done: 100% deterministic hard-rt,
regardless of the application load and the driver set.

> I will check that out, I have not looked at RTAI in over a year.

Here are some interesting links:

RTAI/fusion presentation by Philipppe Gerum last July (see slide 25
for some interesting numbers):
http://www.enseirb.fr/~kadionik/rmll2004/presentation/philippe_gerum.pdf
Here's a thread that explains the details about RTAI/fusion:
https://mail.rtai.org/pipermail/rtai/2004-June/thread.html#7909
Here's the ipipe core API:
http://home.gna.org/adeos/doc/api/interface_8h.html

Karim
--
Author, Speaker, Developer, Consultant
Pushing Embedded and Real-Time Linux Systems Beyond the Limits
http://www.opersys.com || [email protected] || 1-866-677-4546


2004-10-09 20:59:26

by Lee Revell

[permalink] [raw]
Subject: Re: [ANNOUNCE] Linux 2.6 Real Time Kernel

On Sat, 2004-10-09 at 16:53, Karim Yaghmour wrote:
> Lee Revell wrote:
> > In theory, I think yes, if all IRQs on the system run in threads except
> > the saw interrupt, and the RT task that controls the saw runs at a
> > higher priority than all the IRQ threads. You can guarantee that other
> > interrupts won't delay the saw, because the saw irq is the only thing on
> > the system that runs in interrupt context. With the current VP
> > implementation you are still bounded by the longest non-preemptible code
> > path in the kernel AKA the longest time that a spinlock is held.
> > Replacing most spinlocks with mutexes reduces this to less than 20 code
> > paths according to Mvista, which then can be individually audited for
> > RT-safeness.
> >
> > That being said, no way would I put my hand under the saw with the
> > current implementation. But, unless I am missing something, it seems
> > like this kind of determinism is possible with the Mvista design.
>
> It may be a question of taste, but even if that did work, which I am
> not convinced of, it seems to me that it's awfully convoluted.
> With the current interrupt pipeline mechanism part of Adeos, on
> which RTAI and RTAI fusion are built, I can give you absolute hard-rt
> deterministic guarantees while keeping the spinlocks intact, and not
> having to check for the rt-safeness of any part of the kernel. You
> just write the time-sensitive saw driver int handler in front of
> Linux in the ipipe and you're done: 100% deterministic hard-rt,
> regardless of the application load and the driver set.

True, there are probably too many "ifs" in my above statement for a saw
or an airplane or a power plant. There does seem to be a gray area
between soft and hard realtime, where either approach could be
reasonable. For example the Mt. St. Helens example, where you could
miss a sample and it would be really bad, but not kill anyone.

Lee

2004-10-09 21:20:39

by Lee Revell

[permalink] [raw]
Subject: Re: [ANNOUNCE] Linux 2.6 Real Time Kernel

On Sat, 2004-10-09 at 09:15, M?ns Rullg?rd wrote:
> I got this thing to build by adding a few EXPORT_SYMBOL, patch below.
> Now it seems to be running quite well. I am, however, getting
> occasional "bad: scheduling while atomic!" messages, all alike:
>

I am getting the same message. Also, leaving all the default debug
options on, I got this debug output, but it did not coincide with the
"bad" messages.

Mtx: dd84e644 [773] pri (0) inherit from [3] pri(92)
Mtx dd84e644 task [773] pri (92) restored pri(0). Next owner [3] pri (92)
Mtx: dd84e644 [773] pri (0) inherit from [3] pri(92)
Mtx dd84e644 task [773] pri (92) restored pri(0). Next owner [3] pri (92)
Mtx: dd84e644 [773] pri (0) inherit from [3] pri(92)
Mtx dd84e644 task [773] pri (92) restored pri(0). Next owner [3] pri (92)

Lee

2004-10-09 21:35:34

by Måns Rullgård

[permalink] [raw]
Subject: Re: [ANNOUNCE] Linux 2.6 Real Time Kernel

Lee Revell <[email protected]> writes:

> On Sat, 2004-10-09 at 09:15, M?ns Rullg?rd wrote:
>> I got this thing to build by adding a few EXPORT_SYMBOL, patch below.
>> Now it seems to be running quite well. I am, however, getting
>> occasional "bad: scheduling while atomic!" messages, all alike:
>>
>
> I am getting the same message. Also, leaving all the default debug
> options on, I got this debug output, but it did not coincide with the
> "bad" messages.
>
> Mtx: dd84e644 [773] pri (0) inherit from [3] pri(92)
> Mtx dd84e644 task [773] pri (92) restored pri(0). Next owner [3] pri (92)
> Mtx: dd84e644 [773] pri (0) inherit from [3] pri(92)
> Mtx dd84e644 task [773] pri (92) restored pri(0). Next owner [3] pri (92)
> Mtx: dd84e644 [773] pri (0) inherit from [3] pri(92)
> Mtx dd84e644 task [773] pri (92) restored pri(0). Next owner [3] pri (92)

Well, those don't give me any clues.

I had the system running that kernel for a bit over an hour and got
five of the "bad" messages, approximately evenly spaced in a
two-minute interval about 20 minutes after boot.

I did notice one improvement compared to vanilla 2.6.8.1. The sound
didn't skip when I switched from X to a text console. However, my
keyboard no longer worked in X, but that seems to be due to some
recent changes to the input subsystem.

Did you build it with our without my patch, BTW?

--
M?ns Rullg?rd
[email protected]

2004-10-09 21:41:19

by Lee Revell

[permalink] [raw]
Subject: Re: [ANNOUNCE] Linux 2.6 Real Time Kernel

On Sat, 2004-10-09 at 17:35, M?ns Rullg?rd wrote:

> I did notice one improvement compared to vanilla 2.6.8.1. The sound
> didn't skip when I switched from X to a text console. However, my
> keyboard no longer worked in X, but that seems to be due to some
> recent changes to the input subsystem.
>
> Did you build it with our without my patch, BTW?

With. Most of the modules did not work without your patch.

Lee

2004-10-09 21:45:27

by Måns Rullgård

[permalink] [raw]
Subject: Re: [ANNOUNCE] Linux 2.6 Real Time Kernel

Lee Revell <[email protected]> writes:

> On Sat, 2004-10-09 at 17:35, M?ns Rullg?rd wrote:
>
>> I did notice one improvement compared to vanilla 2.6.8.1. The sound
>> didn't skip when I switched from X to a text console. However, my
>> keyboard no longer worked in X, but that seems to be due to some
>> recent changes to the input subsystem.
>>
>> Did you build it with our without my patch, BTW?
>
> With. Most of the modules did not work without your patch.

Do the Montavista folks build their kernels without modules?

--
M?ns Rullg?rd
[email protected]

2004-10-09 22:04:18

by Lee Revell

[permalink] [raw]
Subject: Re: [ANNOUNCE] Linux 2.6 Real Time Kernel

On Sat, 2004-10-09 at 17:35, M?ns Rullg?rd wrote:
> Lee Revell <[email protected]> writes:
>
> > On Sat, 2004-10-09 at 09:15, M?ns Rullg?rd wrote:
> >> I got this thing to build by adding a few EXPORT_SYMBOL, patch below.
> >> Now it seems to be running quite well. I am, however, getting
> >> occasional "bad: scheduling while atomic!" messages, all alike:
> >>
> >
> > I am getting the same message. Also, leaving all the default debug
> > options on, I got this debug output, but it did not coincide with the
> > "bad" messages.
> >
> > Mtx: dd84e644 [773] pri (0) inherit from [3] pri(92)
> > Mtx dd84e644 task [773] pri (92) restored pri(0). Next owner [3] pri (92)
> > Mtx: dd84e644 [773] pri (0) inherit from [3] pri(92)
> > Mtx dd84e644 task [773] pri (92) restored pri(0). Next owner [3] pri (92)
> > Mtx: dd84e644 [773] pri (0) inherit from [3] pri(92)
> > Mtx dd84e644 task [773] pri (92) restored pri(0). Next owner [3] pri (92)
>
> Well, those don't give me any clues.
>
> I had the system running that kernel for a bit over an hour and got
> five of the "bad" messages, approximately evenly spaced in a
> two-minute interval about 20 minutes after boot.
>

I am getting these too:

bad: scheduling while atomic!
[<c0279c5a>] schedule+0x62a/0x630
[<c013b137>] kmutex_unlock+0x37/0x50
[<c013ab0d>] __p_mutex_down+0x1ed/0x360
[<c013b1e0>] kmutex_is_locked+0x20/0x40
[<c01cba47>] journal_dirty_data+0x77/0x230
[<c01bf2e2>] ext3_journal_dirty_data+0x12/0x40
[<c01bf150>] walk_page_buffers+0x60/0x70
[<c01bf7c7>] ext3_ordered_writepage+0xf7/0x160
[<c01bf6b0>] journal_dirty_data_fn+0x0/0x20
[<c018067d>] mpage_writepages+0x29d/0x3e0
[<c01bf6d0>] ext3_ordered_writepage+0x0/0x160
[<c0141c09>] do_writepages+0x39/0x50
[<c017ec5f>] __sync_single_inode+0x5f/0x220
[<c017f0b7>] sync_sb_inodes+0x1c7/0x2e0
[<c017f2c7>] writeback_inodes+0xf7/0x110
[<c0141a03>] wb_kupdate+0x93/0x100
[<c0142ccf>] __pdflush+0x2af/0x5a0
[<c0142fc0>] pdflush+0x0/0x30
[<c0142fde>] pdflush+0x1e/0x30
[<c0141970>] wb_kupdate+0x0/0x100
[<c0134af3>] kthread+0xa3/0xb0
[<c0134a50>] kthread+0x0/0xb0
[<c0103fe5>] kernel_thread_helper+0x5/0x10

Lee

2004-10-09 22:22:10

by Måns Rullgård

[permalink] [raw]
Subject: Re: [ANNOUNCE] Linux 2.6 Real Time Kernel

Lee Revell <[email protected]> writes:

> On Sat, 2004-10-09 at 17:35, M?ns Rullg?rd wrote:
>> Lee Revell <[email protected]> writes:
>>
>> > On Sat, 2004-10-09 at 09:15, M?ns Rullg?rd wrote:
>> >> I got this thing to build by adding a few EXPORT_SYMBOL, patch below.
>> >> Now it seems to be running quite well. I am, however, getting
>> >> occasional "bad: scheduling while atomic!" messages, all alike:
>> >>
>> >
>> > I am getting the same message. Also, leaving all the default debug
>> > options on, I got this debug output, but it did not coincide with the
>> > "bad" messages.
>> >
>> > Mtx: dd84e644 [773] pri (0) inherit from [3] pri(92)
>> > Mtx dd84e644 task [773] pri (92) restored pri(0). Next owner [3] pri (92)
>> > Mtx: dd84e644 [773] pri (0) inherit from [3] pri(92)
>> > Mtx dd84e644 task [773] pri (92) restored pri(0). Next owner [3] pri (92)
>> > Mtx: dd84e644 [773] pri (0) inherit from [3] pri(92)
>> > Mtx dd84e644 task [773] pri (92) restored pri(0). Next owner [3] pri (92)
>>
>> Well, those don't give me any clues.
>>
>> I had the system running that kernel for a bit over an hour and got
>> five of the "bad" messages, approximately evenly spaced in a
>> two-minute interval about 20 minutes after boot.
>>
>
> I am getting these too:
>
> bad: scheduling while atomic!
> [<c0279c5a>] schedule+0x62a/0x630
> [<c013b137>] kmutex_unlock+0x37/0x50
> [<c013ab0d>] __p_mutex_down+0x1ed/0x360
> [<c013b1e0>] kmutex_is_locked+0x20/0x40
> [<c01cba47>] journal_dirty_data+0x77/0x230
> [<c01bf2e2>] ext3_journal_dirty_data+0x12/0x40

My machine is mostly XFS, which might explain why I haven't seen any
of those. I've found XFS to perform better with the multi-gigabyte
files I often deal with.

--
M?ns Rullg?rd
[email protected]

2004-10-09 22:55:38

by Sven-Thorsten Dietrich

[permalink] [raw]
Subject: RE: [ANNOUNCE] Linux 2.6 Real Time Kernel



Thanks for giving it a try!

The "bad: scheduling while atomic!" are indicative
of blocking on a mutex while holding a spinlock.

You can see the __p_mutex_down in the trace.

See the notes in the original announce
regarding the partitioning : work in progress.

I can't offer a fix for that now, but I will
post an updated mutex patch for the
EXPORT_SYMBOLS / module build issue.

Sven

> -----Original Message-----
> From: [email protected]
> [mailto:[email protected]]On Behalf Of M?ns Rullg?rd
> Sent: Saturday, October 09, 2004 6:15 AM
> To: [email protected]
> Subject: Re: [ANNOUNCE] Linux 2.6 Real Time Kernel
>
>
>
> I got this thing to build by adding a few EXPORT_SYMBOL, patch below.
> Now it seems to be running quite well. I am, however, getting
> occasional "bad: scheduling while atomic!" messages, all alike:
>
> bad: scheduling while atomic!
> [<c02ef301>] schedule+0x4e5/0x4ea
> [<c0114cbe>] try_to_wake_up+0x99/0xa8
> [<c01332e2>] __p_mutex_down+0xfe/0x190
> [<c029e238>] alloc_skb+0x32/0xc3
> [<c01335e0>] kmutex_is_locked+0x1f/0x33
> [<c029fa63>] skb_queue_tail+0x1c/0x45
> [<c02eb43e>] unix_stream_sendmsg+0x22c/0x38c
> [<c029ab03>] sock_sendmsg+0xc9/0xe3
> [<c029f95a>] skb_dequeue+0x4a/0x5b
> [<c02eb9e6>] unix_stream_recvmsg+0x119/0x430
> [<c0137f06>] __alloc_pages+0x1cc/0x33f
> [<c01168ca>] autoremove_wake_function+0x0/0x43
> [<c029af4b>] sock_readv_writev+0x6e/0x97
> [<c029afec>] sock_writev+0x37/0x3e
> [<c029afb5>] sock_writev+0x0/0x3e
> [<c014ebb8>] do_readv_writev+0x1db/0x21f
> [<c01168ca>] autoremove_wake_function+0x0/0x43
> [<c014e5ca>] vfs_read+0xd0/0xf5
> [<c014ec94>] vfs_writev+0x49/0x52
> [<c014ed5a>] sys_writev+0x47/0x76
> [<c0103f09>] sysenter_past_esp+0x52/0x71
>
> USB, sound and wireless are all working nicely.
>
> Now the patch:
>
> --- kernel/kmutex.c~ 2004-10-09 12:51:37 +02:00
> +++ kernel/kmutex.c 2004-10-09 13:50:43 +02:00
> @@ -20,6 +20,7 @@
> #include <linux/config.h>
> #include <linux/kmutex.h>
> #include <linux/sched.h>
> +#include <linux/module.h>
>
> # if defined CONFIG_PMUTEX
> # include <linux/pmutex.h>
> @@ -40,11 +41,14 @@
> return p_mutex_trylock(&(lock->kmtx));
> }
>
> +EXPORT_SYMBOL(kmutex_trylock);
>
> inline int kmutex_is_locked(struct kmutex *lock)
> {
> return p_mutex_is_locked(&(lock->kmtx));
> }
> +
> +EXPORT_SYMBOL(kmutex_is_locked);
> # endif
>
>
> @@ -60,6 +64,7 @@
> #endif
> }
>
> +EXPORT_SYMBOL(kmutex_init);
>
> /*
> * print warning is case kmutex_lock is called while preempt count is
> @@ -88,6 +93,8 @@
> #endif
> }
>
> +EXPORT_SYMBOL(kmutex_lock);
> +
> void kmutex_unlock(struct kmutex *lock)
> {
> #if defined CONFIG_KMUTEX_DEBUG
> @@ -102,6 +109,7 @@
> #endif
> }
>
> +EXPORT_SYMBOL(kmutex_unlock);
>
> void kmutex_unlock_wait(struct kmutex * lock)
> {
> @@ -111,4 +119,4 @@
> }
> }
>
> -
> +EXPORT_SYMBOL(kmutex_unlock_wait);
>
>
> --
> M?ns Rullg?rd
> [email protected]
>
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
>

2004-10-09 23:21:07

by Dave Hansen

[permalink] [raw]
Subject: Re: [ANNOUNCE] Linux 2.6 Real Time Kernel

On Sat, 2004-10-09 at 00:33, Daniel Walker wrote:
> Do you have 4k stacks turned off? The docs make note of this.

Isn't this a better thing to spell out in a Kconfig file than some
documentation?

-- Dave

2004-10-09 23:24:56

by Lee Revell

[permalink] [raw]
Subject: Re: [ANNOUNCE] Linux 2.6 Real Time Kernel

On Sat, 2004-10-09 at 19:20, Dave Hansen wrote:
> On Sat, 2004-10-09 at 00:33, Daniel Walker wrote:
> > Do you have 4k stacks turned off? The docs make note of this.
>
> Isn't this a better thing to spell out in a Kconfig file than some
> documentation?

FWIW I did see this in the docs, it's clearly stated, I just forgot that
I had enabled 4k stacks.

Lee

2004-10-09 23:42:38

by Matthias Urlichs

[permalink] [raw]
Subject: Re: [ANNOUNCE] Linux 2.6 Real Time Kernel

Hi, Lee Revell wrote:

> On Sat, 2004-10-09 at 03:33, Daniel Walker wrote:
>> Do you have 4k stacks turned off? The docs make note of this.
>>
>
> My mistake, it works now.
>
Actually, if 4k stacks don't work with RT turned on, this exclusion should
be encoded in the appropriate Kconfig file(s).

--
Matthias Urlichs | {M:U} IT Design @ m-u-it.de | [email protected]

2004-10-09 23:52:48

by Lee Revell

[permalink] [raw]
Subject: Re: [ANNOUNCE] Linux 2.6 Real Time Kernel

On Sat, 2004-10-09 at 17:35, M?ns Rullg?rd wrote:
> Lee Revell <[email protected]> writes:
>
> > On Sat, 2004-10-09 at 09:15, M?ns Rullg?rd wrote:
> >> I got this thing to build by adding a few EXPORT_SYMBOL, patch below.
> >> Now it seems to be running quite well. I am, however, getting
> >> occasional "bad: scheduling while atomic!" messages, all alike:
> >>
> >
> > I am getting the same message. Also, leaving all the default debug
> > options on, I got this debug output, but it did not coincide with the
> > "bad" messages.
> >
> > Mtx: dd84e644 [773] pri (0) inherit from [3] pri(92)
> > Mtx dd84e644 task [773] pri (92) restored pri(0). Next owner [3] pri (92)
> > Mtx: dd84e644 [773] pri (0) inherit from [3] pri(92)
> > Mtx dd84e644 task [773] pri (92) restored pri(0). Next owner [3] pri (92)
> > Mtx: dd84e644 [773] pri (0) inherit from [3] pri(92)
> > Mtx dd84e644 task [773] pri (92) restored pri(0). Next owner [3] pri (92)
>
> Well, those don't give me any clues.

Pid 773 is the IRQ thread for eth0. I am using the via-rhine driver.

Lee

2004-10-10 00:05:40

by Måns Rullgård

[permalink] [raw]
Subject: Re: [ANNOUNCE] Linux 2.6 Real Time Kernel

Lee Revell <[email protected]> writes:

> On Sat, 2004-10-09 at 17:35, M?ns Rullg?rd wrote:
>> Lee Revell <[email protected]> writes:
>>
>> > On Sat, 2004-10-09 at 09:15, M?ns Rullg?rd wrote:
>> >> I got this thing to build by adding a few EXPORT_SYMBOL, patch below.
>> >> Now it seems to be running quite well. I am, however, getting
>> >> occasional "bad: scheduling while atomic!" messages, all alike:
>> >>
>> >
>> > I am getting the same message. Also, leaving all the default debug
>> > options on, I got this debug output, but it did not coincide with the
>> > "bad" messages.
>> >
>> > Mtx: dd84e644 [773] pri (0) inherit from [3] pri(92)
>> > Mtx dd84e644 task [773] pri (92) restored pri(0). Next owner [3] pri (92)
>> > Mtx: dd84e644 [773] pri (0) inherit from [3] pri(92)
>> > Mtx dd84e644 task [773] pri (92) restored pri(0). Next owner [3] pri (92)
>> > Mtx: dd84e644 [773] pri (0) inherit from [3] pri(92)
>> > Mtx dd84e644 task [773] pri (92) restored pri(0). Next owner [3] pri (92)
>>
>> Well, those don't give me any clues.
>
> Pid 773 is the IRQ thread for eth0. I am using the via-rhine driver.

I was using a prism54 wireless card.

--
M?ns Rullg?rd
[email protected]

2004-10-10 00:42:20

by Micha

[permalink] [raw]
Subject: Re: [ANNOUNCE] Linux 2.6 Real Time Kernel

On Sat, Oct 09, 2004 at 11:35:16PM +0200, M?ns Rullg?rd wrote:
> Lee Revell <[email protected]> writes:
>
> > On Sat, 2004-10-09 at 09:15, M?ns Rullg?rd wrote:
> >> I got this thing to build by adding a few EXPORT_SYMBOL, patch below.
> >> Now it seems to be running quite well. I am, however, getting
> >> occasional "bad: scheduling while atomic!" messages, all alike:
> >>
> >
> > I am getting the same message. Also, leaving all the default debug
> > options on, I got this debug output, but it did not coincide with the
> > "bad" messages.
> >
> > Mtx: dd84e644 [773] pri (0) inherit from [3] pri(92)
> > Mtx dd84e644 task [773] pri (92) restored pri(0). Next owner [3] pri (92)
> > Mtx: dd84e644 [773] pri (0) inherit from [3] pri(92)
> > Mtx dd84e644 task [773] pri (92) restored pri(0). Next owner [3] pri (92)
> > Mtx: dd84e644 [773] pri (0) inherit from [3] pri(92)
> > Mtx dd84e644 task [773] pri (92) restored pri(0). Next owner [3] pri (92)
>
> Well, those don't give me any clues.
>
> I had the system running that kernel for a bit over an hour and got
> five of the "bad" messages, approximately evenly spaced in a
> two-minute interval about 20 minutes after boot.
>
> I did notice one improvement compared to vanilla 2.6.8.1. The sound
> didn't skip when I switched from X to a text console. However, my
> keyboard no longer worked in X, but that seems to be due to some
> recent changes to the input subsystem.

There was some change in 2.6.9-pre-something that cause the mouse and
keyboard to exchange event interfaces between them, if it interests you.

>
> Did you build it with our without my patch, BTW?
>
> --
> M?ns Rullg?rd
> [email protected]
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
>
> +++++++++++++++++++++++++++++++++++++++++++
> This Mail Was Scanned By Mail-seCure System
> at the Tel-Aviv University CC.
>

2004-10-10 00:45:49

by Lee Revell

[permalink] [raw]
Subject: Re: [ANNOUNCE] Linux 2.6 Real Time Kernel

On Sat, 2004-10-09 at 20:05, M?ns Rullg?rd wrote:
> Lee Revell <[email protected]> writes:
>
> > On Sat, 2004-10-09 at 17:35, M?ns Rullg?rd wrote:
> >> Lee Revell <[email protected]> writes:
> >>
> >> > On Sat, 2004-10-09 at 09:15, M?ns Rullg?rd wrote:
> >> >> I got this thing to build by adding a few EXPORT_SYMBOL, patch below.
> >> >> Now it seems to be running quite well. I am, however, getting
> >> >> occasional "bad: scheduling while atomic!" messages, all alike:
> >> >>
> >> >
> >> > I am getting the same message. Also, leaving all the default debug
> >> > options on, I got this debug output, but it did not coincide with the
> >> > "bad" messages.
> >> >
> >> > Mtx: dd84e644 [773] pri (0) inherit from [3] pri(92)
> >> > Mtx dd84e644 task [773] pri (92) restored pri(0). Next owner [3] pri (92)
> >> > Mtx: dd84e644 [773] pri (0) inherit from [3] pri(92)
> >> > Mtx dd84e644 task [773] pri (92) restored pri(0). Next owner [3] pri (92)
> >> > Mtx: dd84e644 [773] pri (0) inherit from [3] pri(92)
> >> > Mtx dd84e644 task [773] pri (92) restored pri(0). Next owner [3] pri (92)
> >>
> >> Well, those don't give me any clues.
> >
> > Pid 773 is the IRQ thread for eth0. I am using the via-rhine driver.
>
> I was using a prism54 wireless card.

OK, first bug: I lost my PS/2 keyboard, and had to reboot to get it
back. Unplugging and replugging it made Num Lock work again, but the
system did not respond to the keyboard at all. USB mouse continued to
work fine.

Lee

2004-10-10 01:06:15

by Måns Rullgård

[permalink] [raw]
Subject: Re: [ANNOUNCE] Linux 2.6 Real Time Kernel

Lee Revell <[email protected]> writes:

> On Sat, 2004-10-09 at 20:05, M?ns Rullg?rd wrote:
>> Lee Revell <[email protected]> writes:
>>
>> > On Sat, 2004-10-09 at 17:35, M?ns Rullg?rd wrote:
>> >> Lee Revell <[email protected]> writes:
>> >>
>> >> > On Sat, 2004-10-09 at 09:15, M?ns Rullg?rd wrote:
>> >> >> I got this thing to build by adding a few EXPORT_SYMBOL, patch below.
>> >> >> Now it seems to be running quite well. I am, however, getting
>> >> >> occasional "bad: scheduling while atomic!" messages, all alike:
>> >> >>
>> >> >
>> >> > I am getting the same message. Also, leaving all the default debug
>> >> > options on, I got this debug output, but it did not coincide with the
>> >> > "bad" messages.
>> >> >
>> >> > Mtx: dd84e644 [773] pri (0) inherit from [3] pri(92)
>> >> > Mtx dd84e644 task [773] pri (92) restored pri(0). Next owner [3] pri (92)
>> >> > Mtx: dd84e644 [773] pri (0) inherit from [3] pri(92)
>> >> > Mtx dd84e644 task [773] pri (92) restored pri(0). Next owner [3] pri (92)
>> >> > Mtx: dd84e644 [773] pri (0) inherit from [3] pri(92)
>> >> > Mtx dd84e644 task [773] pri (92) restored pri(0). Next owner [3] pri (92)
>> >>
>> >> Well, those don't give me any clues.
>> >
>> > Pid 773 is the IRQ thread for eth0. I am using the via-rhine driver.
>>
>> I was using a prism54 wireless card.
>
> OK, first bug: I lost my PS/2 keyboard, and had to reboot to get it
> back. Unplugging and replugging it made Num Lock work again, but the
> system did not respond to the keyboard at all. USB mouse continued to
> work fine.

I lost my keyboard as well, though only in X, but I figured that could
be caused by some changes to the input layer that went in between
2.6.9-rc2 and -rc3. My synaptics touchpad also stopped working
properly. USB keyboard and mouse worked properly.

--
M?ns Rullg?rd
[email protected]

2004-10-10 01:08:50

by Måns Rullgård

[permalink] [raw]
Subject: Re: [ANNOUNCE] Linux 2.6 Real Time Kernel

Micha Feigin <[email protected]> writes:

> On Sat, Oct 09, 2004 at 11:35:16PM +0200, M?ns Rullg?rd wrote:
>> Lee Revell <[email protected]> writes:
>>
>> > On Sat, 2004-10-09 at 09:15, M?ns Rullg?rd wrote:
>> >> I got this thing to build by adding a few EXPORT_SYMBOL, patch below.
>> >> Now it seems to be running quite well. I am, however, getting
>> >> occasional "bad: scheduling while atomic!" messages, all alike:
>> >>
>> >
>> > I am getting the same message. Also, leaving all the default debug
>> > options on, I got this debug output, but it did not coincide with the
>> > "bad" messages.
>> >
>> > Mtx: dd84e644 [773] pri (0) inherit from [3] pri(92)
>> > Mtx dd84e644 task [773] pri (92) restored pri(0). Next owner [3] pri (92)
>> > Mtx: dd84e644 [773] pri (0) inherit from [3] pri(92)
>> > Mtx dd84e644 task [773] pri (92) restored pri(0). Next owner [3] pri (92)
>> > Mtx: dd84e644 [773] pri (0) inherit from [3] pri(92)
>> > Mtx dd84e644 task [773] pri (92) restored pri(0). Next owner [3] pri (92)
>>
>> Well, those don't give me any clues.
>>
>> I had the system running that kernel for a bit over an hour and got
>> five of the "bad" messages, approximately evenly spaced in a
>> two-minute interval about 20 minutes after boot.
>>
>> I did notice one improvement compared to vanilla 2.6.8.1. The sound
>> didn't skip when I switched from X to a text console. However, my
>> keyboard no longer worked in X, but that seems to be due to some
>> recent changes to the input subsystem.
>
> There was some change in 2.6.9-pre-something that cause the mouse and
> keyboard to exchange event interfaces between them, if it interests you.

The keyboard doesn't have a device entry in my X config file, and I
suppose the mouse would still be at /dev/input/mice, no?

--
M?ns Rullg?rd
[email protected]

2004-10-10 01:09:19

by Lee Revell

[permalink] [raw]
Subject: Re: [ANNOUNCE] Linux 2.6 Real Time Kernel

On Sat, 2004-10-09 at 21:05, M?ns Rullg?rd wrote:
> > OK, first bug: I lost my PS/2 keyboard, and had to reboot to get it
> > back. Unplugging and replugging it made Num Lock work again, but the
> > system did not respond to the keyboard at all. USB mouse continued to
> > work fine.
>
> I lost my keyboard as well, though only in X, but I figured that could
> be caused by some changes to the input layer that went in between
> 2.6.9-rc2 and -rc3. My synaptics touchpad also stopped working
> properly. USB keyboard and mouse worked properly.

Looks like the same areas that were problematic with the VP kernel will
be an issue here. I suspect many of the fixes already exist in Ingo's
patch or in -mm.

I think my keyboard issue is different because it worked fine, then I
lost it suddenly.

Lee

2004-10-10 01:15:49

by Lee Revell

[permalink] [raw]
Subject: Re: [ANNOUNCE] Linux 2.6 Real Time Kernel

On Sat, 2004-10-09 at 01:59, Sven-Thorsten Dietrich wrote:
> Announcing the availability of prototype real-time (RT)
> enhancements to the Linux 2.6 kernel.
>

More "scheduling while atomic":

Oct 9 21:06:55 krustophenia kernel: bad: scheduling while atomic!
Oct 9 21:06:55 krustophenia kernel: [schedule+1578/1584] schedule+0x62a/0x630
Oct 9 21:06:55 krustophenia kernel: [__p_mutex_down+493/864] __p_mutex_down+0x1ed/0x360
Oct 9 21:06:55 krustophenia kernel: [kmutex_is_locked+32/64] kmutex_is_locked+0x20/0x40
Oct 9 21:06:55 krustophenia kernel: [pg0+509195408/1070195712] snd_emu10k1_ptr_read+0xc0/0xe0 [snd_emu10k1]
Oct 9 21:06:55 krustophenia kernel: [pg0+509190243/1070195712] snd_emu10k1_capture_pointer+0x33/0x70 [snd_emu10k1]
Oct 9 21:06:55 krustophenia kernel: [pg0+508937310/1070195712] snd_pcm_period_elapsed+0xde/0x3d0 [snd_pcm]
Oct 9 21:06:55 krustophenia kernel: [pg0+509178406/1070195712] snd_emu10k1_interrupt+0xd6/0x400 [snd_emu10k1]
Oct 9 21:06:55 krustophenia kernel: [generic_handle_IRQ_event+49/96] generic_handle_IRQ_event+0x31/0x60
Oct 9 21:06:55 krustophenia kernel: [do_IRQ+317/848] do_IRQ+0x13d/0x350
Oct 9 21:06:55 krustophenia kernel: [common_interrupt+24/32] common_interrupt+0x18/0x20
Oct 9 21:06:55 krustophenia kernel: [pg0+509195625/1070195712] snd_emu10k1_ptr_write+0xb9/0xc0 [snd_emu10k1]
Oct 9 21:06:55 krustophenia kernel: [pg0+509174891/1070195712] snd_emu10k1_voice_init+0x11b/0x1e0 [snd_emu10k1]
Oct 9 21:06:55 krustophenia kernel: [pg0+509182616/1070195712] snd_emu10k1_voice_free+0x38/0x70 [snd_emu10k1]
Oct 9 21:06:55 krustophenia kernel: [pg0+509187993/1070195712] snd_emu10k1_playback_hw_free+0x99/0xd0 [snd_emu10k1]
Oct 9 21:06:55 krustophenia kernel: [pg0+510049279/1070195712] snd_pcm_oss_release_file+0xbf/0x110 [snd_pcm_oss]
Oct 9 21:06:55 krustophenia kernel: [pg0+510051019/1070195712] snd_pcm_oss_release+0x4b/0x100 [snd_pcm_oss]
Oct 9 21:06:55 krustophenia kernel: [__fput+292/320] __fput+0x124/0x140
Oct 9 21:06:55 krustophenia kernel: [filp_close+67/112] filp_close+0x43/0x70
Oct 9 21:06:55 krustophenia kernel: [sys_close+88/112] sys_close+0x58/0x70
Oct 9 21:06:55 krustophenia kernel: [syscall_call+7/11] syscall_call+0x7/0xb
Oct 9 21:06:56 krustophenia kernel: Mtx: ddfc1ed0 [1445] pri (0) inherit from [1495] pri(10)
Oct 9 21:06:56 krustophenia kernel: bad: scheduling while atomic!
Oct 9 21:06:56 krustophenia kernel: [schedule+1578/1584] schedule+0x62a/0x630
Oct 9 21:06:56 krustophenia kernel: [__p_mutex_down+493/864] __p_mutex_down+0x1ed/0x360
Oct 9 21:06:56 krustophenia kernel: [kmutex_is_locked+32/64] kmutex_is_locked+0x20/0x40
Oct 9 21:06:56 krustophenia kernel: [pg0+508918664/1070195712] snd_pcm_capture_poll+0x48/0x120 [snd_pcm]
Oct 9 21:06:56 krustophenia kernel: [do_pollfd+125/144] do_pollfd+0x7d/0x90
Oct 9 21:06:56 krustophenia kernel: [do_poll+95/192] do_poll+0x5f/0xc0
Oct 9 21:06:56 krustophenia kernel: [sys_poll+330/560] sys_poll+0x14a/0x230
Oct 9 21:06:56 krustophenia kernel: [__pollwait+0/160] __pollwait+0x0/0xa0
Oct 9 21:06:56 krustophenia kernel: [syscall_call+7/11] syscall_call+0x7/0xb
Oct 9 21:09:27 krustophenia kernel: Mtx: cde58020 [1383] pri (0) inherit from [3] pri(92)
Oct 9 21:09:27 krustophenia kernel: bad: scheduling while atomic!
Oct 9 21:09:27 krustophenia kernel: [schedule+1578/1584] schedule+0x62a/0x630
Oct 9 21:09:27 krustophenia kernel: [__p_mutex_down+493/864] __p_mutex_down+0x1ed/0x360
Oct 9 21:09:27 krustophenia kernel: [kmutex_is_locked+32/64] kmutex_is_locked+0x20/0x40
Oct 9 21:09:27 krustophenia kernel: [tcp_v4_rcv+1207/2048] tcp_v4_rcv+0x4b7/0x800
Oct 9 21:09:27 krustophenia kernel: [ip_local_deliver+154/304] ip_local_deliver+0x9a/0x130
Oct 9 21:09:27 krustophenia kernel: [ip_rcv+729/992] ip_rcv+0x2d9/0x3e0
Oct 9 21:09:27 krustophenia kernel: [netif_receive_skb+264/448] netif_receive_skb+0x108/0x1c0
Oct 9 21:09:27 krustophenia kernel: [process_backlog+125/272] process_backlog+0x7d/0x110
Oct 9 21:09:27 krustophenia kernel: [ksoftirqd_high_prio+0/192] ksoftirqd_high_prio+0x0/0xc0
Oct 9 21:09:27 krustophenia kernel: [net_rx_action+108/256] net_rx_action+0x6c/0x100
Oct 9 21:09:27 krustophenia kernel: [__do_softirq+99/112] __do_softirq+0x63/0x70
Oct 9 21:09:27 krustophenia kernel: [do_softirq+53/64] do_softirq+0x35/0x40
Oct 9 21:09:27 krustophenia kernel: [ksoftirqd_high_prio+133/192] ksoftirqd_high_prio+0x85/0xc0
Oct 9 21:09:27 krustophenia kernel: [kthread+163/176] kthread+0xa3/0xb0
Oct 9 21:09:27 krustophenia kernel: [kthread+0/176] kthread+0x0/0xb0
Oct 9 21:09:27 krustophenia kernel: [kernel_thread_helper+5/16] kernel_thread_helper+0x5/0x10
Oct 9 21:09:42 krustophenia kernel: Mtx: cbafd9a0 [1388] pri (0) inherit from [3] pri(92)
Oct 9 21:09:42 krustophenia kernel: bad: scheduling while atomic!
Oct 9 21:09:42 krustophenia kernel: [schedule+1578/1584] schedule+0x62a/0x630
Oct 9 21:09:42 krustophenia kernel: [__p_mutex_down+493/864] __p_mutex_down+0x1ed/0x360
Oct 9 21:09:42 krustophenia kernel: [kmutex_is_locked+32/64] kmutex_is_locked+0x20/0x40
Oct 9 21:09:42 krustophenia kernel: [tcp_v4_rcv+1207/2048] tcp_v4_rcv+0x4b7/0x800
Oct 9 21:09:42 krustophenia kernel: [ip_local_deliver+154/304] ip_local_deliver+0x9a/0x130
Oct 9 21:09:42 krustophenia kernel: [ip_rcv+729/992] ip_rcv+0x2d9/0x3e0
Oct 9 21:09:42 krustophenia kernel: [netif_receive_skb+264/448] netif_receive_skb+0x108/0x1c0
Oct 9 21:09:42 krustophenia kernel: [process_backlog+125/272] process_backlog+0x7d/0x110
Oct 9 21:09:42 krustophenia kernel: [ksoftirqd_high_prio+0/192] ksoftirqd_high_prio+0x0/0xc0
Oct 9 21:09:42 krustophenia kernel: [net_rx_action+108/256] net_rx_action+0x6c/0x100
Oct 9 21:09:42 krustophenia kernel: [__do_softirq+99/112] __do_softirq+0x63/0x70
Oct 9 21:09:42 krustophenia kernel: [do_softirq+53/64] do_softirq+0x35/0x40
Oct 9 21:09:42 krustophenia kernel: [ksoftirqd_high_prio+133/192] ksoftirqd_high_prio+0x85/0xc0
Oct 9 21:09:42 krustophenia kernel: [kthread+163/176] kthread+0xa3/0xb0
Oct 9 21:09:42 krustophenia kernel: [kthread+0/176] kthread+0x0/0xb0
Oct 9 21:09:42 krustophenia kernel: [kernel_thread_helper+5/16] kernel_thread_helper+0x5/0x10
Oct 9 21:09:42 krustophenia kernel: Mtx cbafd9a0 task [1388] pri (92) restored pri(0). Next owner [3] pri (92)

Looks like the Mtx debug messages are related.

Lee


2004-10-10 08:45:13

by Ingo Molnar

[permalink] [raw]
Subject: Re: [ANNOUNCE] Linux 2.6 Real Time Kernel


* Sven-Thorsten Dietrich <[email protected]> wrote:

> Announcing the availability of prototype real-time (RT)
> enhancements to the Linux 2.6 kernel.
>
> We will submit 3 additional emails following this one, containing
> the remaining 3 patches (of 4) inline, with their descriptions.

cool! Basically the biggest problem is not the technology itself, but
its proper integration into Linux. As it can be seen from the 2.4 RT
patches (TimeSys and yours), just walking the path towards a fully
preemptible kernel is not fruitful because it generates lots of huge,
intrusive patches that end up being unmaintainable forks of the Linux
tree.

the other approach is what i'm currently doing with the
voluntary-preempt patchset: to improve the generic kernel for latency
purposes without actually adding too many extra features. Here is what
is happening in the -mm tree right now:

- the generic irq subsystem: irq threading is a simple ~200-lines,
architecture-independent add-on to this. It makes no sense to offer 3
different implementations - pick one and help make it work well.

- preemptible BKL. Related to this is new debugging infrastructure in
-mm that allows the safe and slow conversion of spinlocks to mutexes.
In the case of the BKL this conversion is expected to be permanent,
for most of the other spinlocks it will be optional - but the
debugging code can still be used.

- various fixes and latency improvements. A mutex based kernel is of
little use if the only code you can execute reliably is user-space
code and the moment you hit kernel-space your RT app is exposed to
high latencies.

A couple of suggestions wrt. how to speed up the integration effort: you
might want to rebase this stuff to the -mm tree. Also, what i dont see
in your (and others') patches (yet?) is some of the harder stuff:

- the handling of per-CPU data structures (get_cpu_var())

- RCU and softirq data structures

- the handling of the IRQ flag

These are basic correctness issues that affect UP just as much as SMP.
Without these the kernel is still not a "fully preemptible" kernel.
These need infrastructure changes too, so they must preceed any addition
of a spinlock -> mutex conversion feature.

So the mutex patch will probably the one that can go upstream _last_,
which will do the "final step" of making the kernel fully preemptible.

Ingo

2004-10-10 12:21:57

by John Richard Moser

[permalink] [raw]
Subject: Re: [ANNOUNCE] Linux 2.6 Real Time Kernel

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1



Sven-Thorsten Dietrich wrote:
|
| Announcing the availability of prototype real-time (RT)
| enhancements to the Linux 2.6 kernel.
|
| We will submit 3 additional emails following this one, containing
| the remaining 3 patches (of 4) inline, with their descriptions.
|
| Download:
|
| Patches against the Linux-2.6.9-rc3 kernel are available at:
|
| ftp://source.mvista.com/pub/realtime/Linux-2.6.9-rc3-RT_irqthreads.patch
| ftp://source.mvista.com/pub/realtime/Linux-2.6.9-rc3-RT_mutex.patch
| ftp://source.mvista.com/pub/realtime/Linux-2.6.9-rc3-RT_spinlock1.patch
| ftp://source.mvista.com/pub/realtime/Linux-2.6.9-rc3-RT_spinlock2.patch
|
| The patches are to be applied to the linux-2.6.9-rc3 kernel in the
| order listed above.

Does any of this 'work' on x86_64 yet? I heard that Ingo's voluntary
pre-empt was x86 only and didn't work on amd64; this stuff's kinda new,
does it work outside x86 yet?

I'd like to see what these kinds of things do. :)

[...]

- --
All content of all messages exchanged herein are left in the
Public Domain, unless otherwise explicitly stated.

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.6 (GNU/Linux)
Comment: Using GnuPG with Thunderbird - http://enigmail.mozdev.org

iD8DBQFBaSk6hDd4aOud5P8RAotcAJ9GgA3P1mAG/CpdlJDknGK6zwA92QCePZi4
AyNDvW6urtDNdvJAPDMZZfk=
=gVeZ
-----END PGP SIGNATURE-----

2004-10-10 17:29:47

by Daniel Walker

[permalink] [raw]
Subject: Re: [ANNOUNCE] Linux 2.6 Real Time Kernel

On Sun, 2004-10-10 at 05:21, John Richard Moser wrote:
> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA1
>
>
>
> Sven-Thorsten Dietrich wrote:
> |
> | Announcing the availability of prototype real-time (RT)
> | enhancements to the Linux 2.6 kernel.
> |
> | We will submit 3 additional emails following this one, containing
> | the remaining 3 patches (of 4) inline, with their descriptions.
> |
> | Download:
> |
> | Patches against the Linux-2.6.9-rc3 kernel are available at:
> |
> | ftp://source.mvista.com/pub/realtime/Linux-2.6.9-rc3-RT_irqthreads.patch
> | ftp://source.mvista.com/pub/realtime/Linux-2.6.9-rc3-RT_mutex.patch
> | ftp://source.mvista.com/pub/realtime/Linux-2.6.9-rc3-RT_spinlock1.patch
> | ftp://source.mvista.com/pub/realtime/Linux-2.6.9-rc3-RT_spinlock2.patch
> |
> | The patches are to be applied to the linux-2.6.9-rc3 kernel in the
> | order listed above.
>
> Does any of this 'work' on x86_64 yet? I heard that Ingo's voluntary
> pre-empt was x86 only and didn't work on amd64; this stuff's kinda new,
> does it work outside x86 yet?
>
> I'd like to see what these kinds of things do. :)


No it's x86 only right now. The mutex is partly in assembly, and the
IRQ threads that we are using are (both of them) x86 only.

Daniel Walker

2004-10-10 17:32:26

by Lee Revell

[permalink] [raw]
Subject: Re: [ANNOUNCE] Linux 2.6 Real Time Kernel

On Sun, 2004-10-10 at 08:21, John Richard Moser wrote:

> Does any of this 'work' on x86_64 yet? I heard that Ingo's voluntary
> pre-empt was x86 only and didn't work on amd64; this stuff's kinda new,
> does it work outside x86 yet?
>
> I'd like to see what these kinds of things do. :)

The VP patches currently work on x86, x64, amd64, and ppc AFAIK. As
stated in the docs, the MontaVista stuff is x86 only right now.

My tests show the worst case latency with the MontaVista patches is
about twice that of the VP patches. Probably due to debug overhead and
a bug or two. But, as expected, the average case latency is _much_
better.

Here's the top of the VP histogram, delay is in usecs:

Delay #
0 5764433
1 3154867
2 461521
3 332445
4 403847
5 320120
6 237955
7 152418
8 94274
9 66496
10 52976
11 44605
12 38437
13 31620
14 27816
15 26845
16 23743
17 20648
18 21611
19 24853
20 30352
21 50046
22 101989
23 24843
24 28829
25 56247
26 42408
27 28228
28 20773
29 19521

Here's the top of the Mvista histogram:

Delay #
0 6771692
1 26
2 29
3 12
4 15
5 15
6 15
7 18
8 19
9 10
10 15
11 10
12 19
13 12
14 15
15 11
16 13
17 13
18 11
19 13
20 12
21 9
22 11
23 13
24 17
25 10
26 9
27 11
28 8
29 12

Lee

2004-10-10 18:45:14

by John Richard Moser

[permalink] [raw]
Subject: Re: [ANNOUNCE] Linux 2.6 Real Time Kernel

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1



Lee Revell wrote:
| On Sun, 2004-10-10 at 08:21, John Richard Moser wrote:
|
|
|>Does any of this 'work' on x86_64 yet? I heard that Ingo's voluntary
|>pre-empt was x86 only and didn't work on amd64; this stuff's kinda new,
|>does it work outside x86 yet?
|>
|>I'd like to see what these kinds of things do. :)
|
|
| The VP patches currently work on x86, x64, amd64, and ppc AFAIK. As
| stated in the docs, the MontaVista stuff is x86 only right now.

Is there a stable amd64 voluntary pre-empt patch for 2.6.7? I'm using
PaX so I can't go up until the author catches up to the new VM changes
introduced in 2.6.8+.

- --
All content of all messages exchanged herein are left in the
Public Domain, unless otherwise explicitly stated.

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.6 (GNU/Linux)
Comment: Using GnuPG with Thunderbird - http://enigmail.mozdev.org

iD8DBQFBaYM7hDd4aOud5P8RAl9tAJ9mJmKtt4p+I4iLh9u1hiFQXK1DlwCfbBhL
TTXwLyxVxwBNuZvnpfj5BN8=
=tbRd
-----END PGP SIGNATURE-----

2004-10-10 19:42:07

by Daniel Walker

[permalink] [raw]
Subject: Re: [ANNOUNCE] Linux 2.6 Real Time Kernel

On Sun, 2004-10-10 at 01:46, Ingo Molnar wrote:
> - the generic irq subsystem: irq threading is a simple ~200-lines,
> architecture-independent add-on to this. It makes no sense to offer 3
> different implementations - pick one and help make it work well.
>
> - preemptible BKL. Related to this is new debugging infrastructure in
> -mm that allows the safe and slow conversion of spinlocks to mutexes.
> In the case of the BKL this conversion is expected to be permanent,
> for most of the other spinlocks it will be optional - but the
> debugging code can still be used.

Are you referring to the lock metering? I've ported our changes to
-mm3-VP-T3 on top of lock metering. It needs some clean up but It will
be released soon. It's very similar to our rc3 release only without the
IRQ threads patch.

Daniel Walker

2004-10-10 20:19:22

by Ingo Molnar

[permalink] [raw]
Subject: Re: [ANNOUNCE] Linux 2.6 Real Time Kernel


* John Richard Moser <[email protected]> wrote:

> | The VP patches currently work on x86, x64, amd64, and ppc AFAIK. As
> | stated in the docs, the MontaVista stuff is x86 only right now.
>
> Is there a stable amd64 voluntary pre-empt patch for 2.6.7? I'm using
> PaX so I can't go up until the author catches up to the new VM changes
> introduced in 2.6.8+.

nope, latest -VP is against 2.6.9-rc3-mm3-ish kernels. Since half of -VP
is in -mm already in various forms of patches it would be quite hard to
extract all of that even against a vanilla 2.6.9-rc3 kernel - let alone
against 2.6.7.

Ingo

2004-10-10 20:44:11

by John Richard Moser

[permalink] [raw]
Subject: Re: [ANNOUNCE] Linux 2.6 Real Time Kernel

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1



Ingo Molnar wrote:
| * John Richard Moser <[email protected]> wrote:
|
|

[...]

|
| nope, latest -VP is against 2.6.9-rc3-mm3-ish kernels. Since half of -VP
| is in -mm already in various forms of patches it would be quite hard to
| extract all of that even against a vanilla 2.6.9-rc3 kernel - let alone
| against 2.6.7.

Alright, I'll just wait for a new PaX patch then.

[...]

- --
All content of all messages exchanged herein are left in the
Public Domain, unless otherwise explicitly stated.

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.6 (GNU/Linux)
Comment: Using GnuPG with Thunderbird - http://enigmail.mozdev.org

iD8DBQFBaZ8fhDd4aOud5P8RAjYlAJ98UZqYWigQacDJLg1BPHLgS9dxQgCggv0S
KDoa7bJJYso9DlRTwldbFlo=
=u2eR
-----END PGP SIGNATURE-----

2004-10-10 19:44:35

by Ingo Molnar

[permalink] [raw]
Subject: Re: [ANNOUNCE] Linux 2.6 Real Time Kernel


* Daniel Walker <[email protected]> wrote:

> On Sun, 2004-10-10 at 01:46, Ingo Molnar wrote:
> > - the generic irq subsystem: irq threading is a simple ~200-lines,
> > architecture-independent add-on to this. It makes no sense to offer 3
> > different implementations - pick one and help make it work well.
> >
> > - preemptible BKL. Related to this is new debugging infrastructure in
> > -mm that allows the safe and slow conversion of spinlocks to mutexes.
> > In the case of the BKL this conversion is expected to be permanent,
> > for most of the other spinlocks it will be optional - but the
> > debugging code can still be used.
>
> Are you referring to the lock metering? I've ported our changes
> to -mm3-VP-T3 on top of lock metering. It needs some clean up but It
> will be released soon. It's very similar to our rc3 release only
> without the IRQ threads patch.

no, i mean the smp_processor_id() debugger, and the other bits triggered
by CONFIG_DEBUG_PREEMPT.

Ingo

2004-10-10 21:22:22

by Andrew Morton

[permalink] [raw]
Subject: Re: [ANNOUNCE] Linux 2.6 Real Time Kernel

Daniel Walker <[email protected]> wrote:
>
> On Sun, 2004-10-10 at 01:46, Ingo Molnar wrote:
> > - the generic irq subsystem: irq threading is a simple ~200-lines,
> > architecture-independent add-on to this. It makes no sense to offer 3
> > different implementations - pick one and help make it work well.
> >
> > - preemptible BKL. Related to this is new debugging infrastructure in
> > -mm that allows the safe and slow conversion of spinlocks to mutexes.
> > In the case of the BKL this conversion is expected to be permanent,
> > for most of the other spinlocks it will be optional - but the
> > debugging code can still be used.
>
> Are you referring to the lock metering? I've ported our changes to
> -mm3-VP-T3 on top of lock metering.

Lockmeter gets in the way of all this activity in a big way. I'll drop it.

2004-10-10 21:57:39

by Ingo Molnar

[permalink] [raw]
Subject: Re: [ANNOUNCE] Linux 2.6 Real Time Kernel


* Andrew Morton <[email protected]> wrote:

> Lockmeter gets in the way of all this activity in a big way. I'll
> drop it.

great. Daniel, would you mind to merge your patchkit against the
following base:

-mm3, minus lockmeter, plus the -T3 patch

? To make this easier i've uploaded a combined undo-lockmeter patch to:

http://redhat.com/~mingo/voluntary-preempt/undo-lockmeter-2.6.9-rc3-mm3-A1

which you should apply to vanilla -mm3, then apply the -T3 patch:

http://redhat.com/~mingo/voluntary-preempt/voluntary-preempt-2.6.9-rc3-mm3-T3

this will apply cleanly with some minor fuzz. The resulting kernel
builds & boots fine with my .config.

Ingo

2004-10-11 15:28:00

by Vadim Lebedev

[permalink] [raw]
Subject: Re: [ANNOUNCE] Linux 2.6 Real Time Kernel

Sven-Thorsten Dietrich <[email protected]> wrote in message
news:<[email protected]>...
> Announcing the availability of prototype real-time (RT)
> enhancements to the Linux 2.6 kernel.

Reading the sources i believe that __p_mutex_up is not constant time
operation because of __p_mutex_down....

It is clear that
__p_mutex_down is not constant time operation because of insertion
into the priority-sorted sleepers list. However both __p_mutex_down
and __p_mutex_up are synchronize on the same global spinlock
(m_spin_lock) .... so if the __p_mutex_down is holding this spinlock
while inserting NO other process(or) is able to perform any __p_mutex
operation...

Maybe the better idea would be to have a per-mutex spinlock? or even
better, given that the task->rt_priority have a finite range maybe each
mutex can have a table of sleeper lists indexed by rt_priority?


Vadim

2004-10-11 16:02:07

by Eugeny S. Mints

[permalink] [raw]
Subject: Re: [ANNOUNCE] Linux 2.6 Real Time Kernel

Vadim Lebedev wrote:
> Sven-Thorsten Dietrich <[email protected]> wrote in message
> news:<[email protected]>...
>
>>Announcing the availability of prototype real-time (RT)
>>enhancements to the Linux 2.6 kernel.
>
>
> Reading the sources i believe that __p_mutex_up is not constant time
> operation because of __p_mutex_down....
>
> It is clear that
> __p_mutex_down is not constant time operation because of insertion
> into the priority-sorted sleepers list. However both __p_mutex_down
> and __p_mutex_up are synchronize on the same global spinlock
> (m_spin_lock) .... so if the __p_mutex_down is holding this spinlock
> while inserting NO other process(or) is able to perform any __p_mutex
> operation...

Current pmutex implementation was chosen only as prototype
implementation. kmutex abstraction layer allows to switch easily between
any (alternative) mutex implementations and to choose optimal one on a
benchmarking basis.

>
> Maybe the better idea would be to have a per-mutex spinlock? or even
> better, given that the task->rt_priority have a finite range maybe each
> mutex can have a table of sleeper lists indexed by rt_priority?
>
>
> Vadim
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
>
>
>


2004-10-11 17:53:21

by Daniel Walker

[permalink] [raw]
Subject: Re: [ANNOUNCE] Linux 2.6 Real Time Kernel

On Sun, 2004-10-10 at 14:59, Ingo Molnar wrote:
> * Andrew Morton <[email protected]> wrote:
>
> > Lockmeter gets in the way of all this activity in a big way. I'll
> > drop it.
>
> great. Daniel, would you mind to merge your patchkit against the
> following base:
>
> -mm3, minus lockmeter, plus the -T3 patch


No problem. Next release will be without lockmeter. Thanks for the
patches.



Daniel Walker



2004-10-11 20:48:33

by Ingo Molnar

[permalink] [raw]
Subject: Re: [ANNOUNCE] Linux 2.6 Real Time Kernel


* Daniel Walker <[email protected]> wrote:

> On Sun, 2004-10-10 at 14:59, Ingo Molnar wrote:
> > * Andrew Morton <[email protected]> wrote:
> >
> > > Lockmeter gets in the way of all this activity in a big way. I'll
> > > drop it.
> >
> > great. Daniel, would you mind to merge your patchkit against the
> > following base:
> >
> > -mm3, minus lockmeter, plus the -T3 patch
>
>
> No problem. Next release will be without lockmeter. Thanks for the
> patches.

what do you think about the PREEMPT_REALTIME stuff in -T4? Ideally, if
you agree with the generic approach, the next step would be to add your
priority inheritance handling code to Linux semaphores and
rw-semaphores. The sched.c bits for that looked pretty straightforward.
The list walking is a bit ugly but probably unavoidable - the only other
option would be 100 priority queues per semaphore -> yuck.

Ingo

2004-10-11 21:44:37

by Sven-Thorsten Dietrich

[permalink] [raw]
Subject: RE: [ANNOUNCE] Linux 2.6 Real Time Kernel


I think Daniel has some separate thoughts,
here are mine:

Regarding the list walking stuff:

There are a lot of hashing options, indexing,
etc. that could be done. We thought of it
as a future optimization. An easy fix would
be to insert RT processes at the front, non-RT
from the tail of the queue.


Regarding patch size: clearly this is
an issue. We are working on creating a
good map of spinlock nestings, to help
with this.

Will publish that ASAP.


IMO the number of raw_spinlocks should be
lower, I said teens before.

Theoretically, it should only need to be
around hardware registers and some memory maps
and cache code, plus interrupt controller
and other SMP-contended hardware.

Practically, its an efficiency judgement call.
Its not worth blocking for 5 instructions in
a critical section under any circumstance,
so the deepest nested locks should probably remain
spinlocks.

There are some concurrency issues in kernel threads,
and I think there is a lot of work here.
The abstraction for LOCK_OPS is a good alternative,
but like the spin_undefs, its difficult to tell
in the code whether you are dealing with a mutex
or a spinlock.

Regarding the use of the system semaphore:
We have WIP on PMUTEX modified to use atomic_t,
thereby eliminating the assembly for instant
portability.

Its slow, but optimizations are allowed for.

Of course for actual portability the
IRQ threads must also be running on those
other platforms.

Your IRQ abstraction is ideal for that.

Eventually, I think that we will see
optimization - the last touches would have
the final mutex code converted back to
assembly, for performance reasons.

There are a whole lot of caveats and race
conditions that have not yet been unearthed
by the brief LKML testing. A lot of them
have to do with wakeups of tasks blocked
on a mutex, and differentiating between
blocked "ready" and blocked "mutex" states.
Here the system semaphore may have an advantage.

With that, maybe we can work back towards
the abstraction, so that we can evaluate both
solutions for their specific advantages.

I'll have to take a look at the new T4 patch
in detail, but at first glance it seems
that both mutexes could coexist in the
abstraction.

We'll give it a test run, and look forward to
your thoughts.

Thanks,

Sven


> -----Original Message-----
> From: Ingo Molnar [mailto:[email protected]]
> Sent: Monday, October 11, 2004 1:50 PM
> To: Daniel Walker
> Cc: Andrew Morton; [email protected]; [email protected];
> [email protected]; [email protected]; [email protected];
> [email protected]; [email protected]; [email protected]
> Subject: Re: [ANNOUNCE] Linux 2.6 Real Time Kernel
>
>
>
> * Daniel Walker <[email protected]> wrote:
>
> > On Sun, 2004-10-10 at 14:59, Ingo Molnar wrote:
> > > * Andrew Morton <[email protected]> wrote:
> > >
> > > > Lockmeter gets in the way of all this activity in a big way. I'll
> > > > drop it.
> > >
> > > great. Daniel, would you mind to merge your patchkit against the
> > > following base:
> > >
> > > -mm3, minus lockmeter, plus the -T3 patch
> >
> >
> > No problem. Next release will be without lockmeter. Thanks for the
> > patches.
>
> what do you think about the PREEMPT_REALTIME stuff in -T4? Ideally, if
> you agree with the generic approach, the next step would be to add your
> priority inheritance handling code to Linux semaphores and
> rw-semaphores. The sched.c bits for that looked pretty straightforward.
> The list walking is a bit ugly but probably unavoidable - the only other
> option would be 100 priority queues per semaphore -> yuck.
>
> Ingo
>

2004-10-11 21:55:55

by Ingo Molnar

[permalink] [raw]
Subject: Re: [ANNOUNCE] Linux 2.6 Real Time Kernel


* Sven Dietrich <[email protected]> wrote:

> IMO the number of raw_spinlocks should be lower, I said teens before.
>
> Theoretically, it should only need to be around hardware registers and
> some memory maps and cache code, plus interrupt controller and other
> SMP-contended hardware.

yeah, fully agreed. Right now the 90 locks i have means roughly 20% of
all locking still happens as raw spinlocks.

But, there is a 'correctness' _minimum_ set of spinlocks that _must_ be
raw spinlocks - this i tried to map in the -T4 patch. The patch does run
on SMP systems for example. (it was developed as an SMP kernel - in fact
i never compiled it as UP :-|.) If code has per-CPU or preemption
assumptions then there is no choice but to make it a raw spinlock, until
those assumptions are fixed.

> There are some concurrency issues in kernel threads, and I think there
> is a lot of work here. The abstraction for LOCK_OPS is a good
> alternative, but like the spin_undefs, its difficult to tell in the
> code whether you are dealing with a mutex or a spinlock.

what do you mean by 'it's difficult to tell'? In -T4 you do the choice
of type in the data structure and the API adapts automatically. If the
type is raw_spinlock_t then a spin_lock() is turned into a
_raw_spin_lock(). If the type is spinlock_t then the spin_lock() is
redirected to mutex_lock(). It's all transparently done and always
correct.

> There are a whole lot of caveats and race conditions that have not yet
> been unearthed by the brief LKML testing. [...]

actually, have you tried your patchset on an SMP box? As far as i can
see the locking in it ignores SMP issues _completely_, which makes the
choice of locks much less useful.

Ingo

2004-10-11 23:07:54

by Sven-Thorsten Dietrich

[permalink] [raw]
Subject: RE: [ANNOUNCE] Linux 2.6 Real Time Kernel


>
> * Sven Dietrich <[email protected]> wrote:
>
> > IMO the number of raw_spinlocks should be lower, I said teens before.
> >
> > Theoretically, it should only need to be around hardware registers and
> > some memory maps and cache code, plus interrupt controller and other
> > SMP-contended hardware.
>
> yeah, fully agreed. Right now the 90 locks i have means roughly 20% of
> all locking still happens as raw spinlocks.
>
> But, there is a 'correctness' _minimum_ set of spinlocks that _must_ be
> raw spinlocks - this i tried to map in the -T4 patch. The patch does run
> on SMP systems for example. (it was developed as an SMP kernel - in fact
> i never compiled it as UP :-|.) If code has per-CPU or preemption
> assumptions then there is no choice but to make it a raw spinlock, until
> those assumptions are fixed.
>

The grunt work is in identifying those problem areas and coming up with
elegant, low-impact solutions. RCU locks is one example as mentioned
before. We had a fix to serialize RCU access, but weren't happy with that.
We were hoping to get some input on this, but these problems seem to show
up more readily on slow systems (we are also testing with a bunch of
old P1, P2 and K6 boxes all far sub 1 GHz)

> > There are some concurrency issues in kernel threads, and I think there
> > is a lot of work here. The abstraction for LOCK_OPS is a good
> > alternative, but like the spin_undefs, its difficult to tell in the
> > code whether you are dealing with a mutex or a spinlock.
>
> what do you mean by 'it's difficult to tell'? In -T4 you do the choice
> of type in the data structure and the API adapts automatically. If the
> type is raw_spinlock_t then a spin_lock() is turned into a
> _raw_spin_lock(). If the type is spinlock_t then the spin_lock() is
> redirected to mutex_lock(). It's all transparently done and always
> correct.
>

I was making this observation:
One can't look at an arbitrary piece of code and tell if it will
be a spinlock or a mutex. One has to go look elsewhere.
In the spin_undefs case one can look the top of the file and check for it,
in the LOCK_OPS case, you have to call up the data structure declaration.

> > There are a whole lot of caveats and race conditions that have not yet
> > been unearthed by the brief LKML testing. [...]
>
> actually, have you tried your patchset on an SMP box? As far as i can
> see the locking in it ignores SMP issues _completely_, which makes the
> choice of locks much less useful.
>

We stated that its been tested minimally on SMP. That means we have
had it up and running and found it to be unstable. I fully agree that
SMP is the superset to get it working on, and that PMutex is not
perfect at this point.

We will take a look at the T5 patch and see what we can do about
PI for the system semaphore, but I am not sure how portable it would
be without also touching the assembly. FWIW PMutex is already based
in part on the system semaphore, so we might get similar problems when
porting elsewhere.

I think we should try and eliminate the mutex as an issue ASAP so we can
move on to the real meat. We have spec'd some requirements in the
rttReleaseNotes, clearly not all are being met, but we hoped to capture
most of them.
I have copied Arndt Heursch and Witold Jaworski in Germany, maybe they
will also have some insights.


Sven


2004-10-12 05:49:33

by Ingo Molnar

[permalink] [raw]
Subject: Re: [ANNOUNCE] Linux 2.6 Real Time Kernel


* Sven Dietrich <[email protected]> wrote:

> > But, there is a 'correctness' _minimum_ set of spinlocks that _must_ be
> > raw spinlocks - this i tried to map in the -T4 patch. The patch does run
> > on SMP systems for example. (it was developed as an SMP kernel - in fact
> > i never compiled it as UP :-|.) If code has per-CPU or preemption
> > assumptions then there is no choice but to make it a raw spinlock, until
> > those assumptions are fixed.

> The grunt work is in identifying those problem areas and coming up
> with elegant, low-impact solutions. RCU locks is one example as
> mentioned before. We had a fix to serialize RCU access, but weren't
> happy with that. We were hoping to get some input on this, but these
> problems seem to show up more readily on slow systems (we are also
> testing with a bunch of old P1, P2 and K6 boxes all far sub 1 GHz)

identifying problem areas is near 100% automatic if you look at -T5: all
illegal sleeps and illegal smp_processor_id() assumptions are reported
when they happen. That's how i identified & fixed the core 90 locks in
the first wave, in just a couple of hours. The only minor annoyance when
doing a conversion is the inflexibility of SPIN_LOCK_UNLOCKED and
RW_LOCK_UNLOCKED initializer. If it werent for the initializers then a
'conversion' would be a matter of a 2-line change, the change of the
prototype and the change of the definition. Now it's a 3-line change
most of the time - still very fast and painless.

regarding RCU serialization - i think that is the way to go - i dont
think there is any sensible way to extend RCU to a fully preempted
model, RCU is all about per-CPU-ness and per-CPU-ness is quite limited
in a fully preemptible model.

could you send those RCU patches (no matter how incomplete/broken)? It's
the main issue that causes the dcache_lock to be raw still. (and a
number of dependent locks: fs-writeback.c, inode.c, etc.) We can make
those RCU changes not impact the normal !PREEMPT_REALTIME locking so it
might have a chance for upstream merging as well.

> I was making this observation: One can't look at an arbitrary piece of
> code and tell if it will be a spinlock or a mutex. One has to go look
> elsewhere. In the spin_undefs case one can look the top of the file
> and check for it, in the LOCK_OPS case, you have to call up the data
> structure declaration.

ok, i now understand what you mean. The way i drove it wasnt really via
code review but via: 'compile kernel, look at the bootlogs, fix the
first lock reported, repeat' iterations. This was much easier and much
more reliable than trying to figure out lock dependencies from the
source. The turnaround for a single lock was 2-3 minutes in the typical
case, allowing the conversion of 90 locks in a couple of hours.

> > > There are a whole lot of caveats and race conditions that have not yet
> > > been unearthed by the brief LKML testing. [...]
> >
> > actually, have you tried your patchset on an SMP box? As far as i can
> > see the locking in it ignores SMP issues _completely_, which makes the
> > choice of locks much less useful.
>
> We stated that its been tested minimally on SMP. That means we have
> had it up and running and found it to be unstable. I fully agree that
> SMP is the superset to get it working on, and that PMutex is not
> perfect at this point.

it's not just the problem of PMutex - i believe it's mainly the plain
inadequacy of the 30 raw locks you have identified - and identifying the
locks is the bigger work, not the semaphore implementation. I'm now at
90 locks (20% of all locking in this .config) and that's just to quiet
the DEBUG_PREEMPT violations on my testboxes.

and no matter how well UP works, to fix SMP one has to 'cover' all the
necessary locks first before fixing it, which (drastic) increase in raw
locks invalidates most of the UP efforts of getting rid of raw locks.
That's why i decided to go for SMP primarily - didnt see much point in
going for UP.

> We will take a look at the T5 patch and see what we can do about PI
> for the system semaphore, but I am not sure how portable it would be
> without also touching the assembly. FWIW PMutex is already based in
> part on the system semaphore, so we might get similar problems when
> porting elsewhere.

there are in-C variants of Linux mutexes and rw-semaphores in the kernel
source, so worst-case we could just make use of them in the
PREEMPT_REALTIME case. I'm not a big fan of assembly optimizations (or
having to touch assembly optimizations) at an early stage like this.

Ingo

2004-10-12 18:51:18

by Daniel Walker

[permalink] [raw]
Subject: Re: [ANNOUNCE] Linux 2.6 Real Time Kernel

On Mon, 2004-10-11 at 13:49, Ingo Molnar wrote:
> * Daniel Walker <[email protected]> wrote:
>
> what do you think about the PREEMPT_REALTIME stuff in -T4? Ideally, if
> you agree with the generic approach, the next step would be to add your
> priority inheritance handling code to Linux semaphores and
> rw-semaphores. The sched.c bits for that looked pretty straightforward.
> The list walking is a bit ugly but probably unavoidable - the only other
> option would be 100 priority queues per semaphore -> yuck.


I think patch size is an issue, but I also think that , eventually, we
should change all spin_lock calls that actually lock a mutex to be more
distinct so it's obvious what is going on. Sven and I both agree that
this should be addressed. Is this a non-issue for you? What does the
community want? I don't find your code or ours acceptable in it's
current form , due to this issue.

With the addition of PREEMPT_REALTIME it looks like you more than
doubled the size of voluntary preempt. I really feel that it should
remain as two distinct patches. They are dependent , but the scope of
the changes are too vast to lump it all together.

Daniel Walker

2004-10-12 19:55:23

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [Ext-rt-dev] Re: [ANNOUNCE] Linux 2.6 Real Time Kernel

On Tue, 2004-10-12 at 20:50, Daniel Walker wrote:
> > what do you think about the PREEMPT_REALTIME stuff in -T4? Ideally, if
> > you agree with the generic approach, the next step would be to add your
> > priority inheritance handling code to Linux semaphores and
> > rw-semaphores. The sched.c bits for that looked pretty straightforward.
> > The list walking is a bit ugly but probably unavoidable - the only other
> > option would be 100 priority queues per semaphore -> yuck.
>
> I think patch size is an issue, but I also think that , eventually, we
> should change all spin_lock calls that actually lock a mutex to be more
> distinct so it's obvious what is going on. Sven and I both agree that
> this should be addressed. Is this a non-issue for you? What does the
> community want? I don't find your code or ours acceptable in it's
> current form , due to this issue.
>
> With the addition of PREEMPT_REALTIME it looks like you more than
> doubled the size of voluntary preempt. I really feel that it should
> remain as two distinct patches. They are dependent , but the scope of
> the changes are too vast to lump it all together.
>

Both patches (MV & Ingos) have their good bits, but both share the same
ugliness and are hard to compare and harder to combine. The conversion
of spin_lock to _spin_lock and substitution of spin_lock by mutexes,
semaphores or what ever makes it more than hard to keep the code in a
readable form.

If there is the tendency to touch the concurrency controls in general
all over the kernel, then I would suggest a script driven overhaul of
all concurrency controls like spin_locks, mutexes and semaphores to
general macros like

enter_critical_section(TYPE, &var, &flags, whatever);
leave_critical_section(TYPE, &var, flags, whatever);

where TYPE might be SPIN_LOCK, SPIN_LOCK_IRQ, MUTEX, PMUTEX or whatever
we have and come up with in the future.

This could be done in a first step and then it is clearly identifiable
and it gives us more flexibility to wrap different implementations and
lets us change particular points in a more clear way.

I would be willing to provide some scripted conversion aid, if there is
enough interest to that. I started with some test files and the results
are quite encouraging.

Any thoughts ?

tglx








2004-10-12 20:32:15

by Sven-Thorsten Dietrich

[permalink] [raw]
Subject: RE: [Ext-rt-dev] Re: [ANNOUNCE] Linux 2.6 Real Time Kernel

> >
> > I think patch size is an issue, but I also think that , eventually, we
> > should change all spin_lock calls that actually lock a mutex to be more
> > distinct so it's obvious what is going on. Sven and I both agree that
> > this should be addressed. Is this a non-issue for you? What does the
> > community want? I don't find your code or ours acceptable in it's
> > current form , due to this issue.
> >
> > With the addition of PREEMPT_REALTIME it looks like you more than
> > doubled the size of voluntary preempt. I really feel that it should
> > remain as two distinct patches. They are dependent , but the scope of
> > the changes are too vast to lump it all together.
> >
>
>
> If there is the tendency to touch the concurrency controls in general
> all over the kernel, then I would suggest a script driven overhaul of
> all concurrency controls like spin_locks, mutexes and semaphores to
> general macros like
>
> enter_critical_section(TYPE, &var, &flags, whatever);
> leave_critical_section(TYPE, &var, flags, whatever);
>
> where TYPE might be SPIN_LOCK, SPIN_LOCK_IRQ, MUTEX, PMUTEX or whatever
> we have and come up with in the future.
>
> This could be done in a first step and then it is clearly identifiable
> and it gives us more flexibility to wrap different implementations and
> lets us change particular points in a more clear way.
>
> I would be willing to provide some scripted conversion aid, if there is
> enough interest to that. I started with some test files and the results
> are quite encouraging.
>



Ideally we would eventually provide some level of tunability, i.e.
if you want the spinlocks all the way around it should be possible
to have that, or one could enable degrees of enhancements,
expanding on the existing sequence starting with PREEMPT, IRQ_THREADS,
BKL, MUTEX, etc. In addition to that, once the minim set of spinlocks
necessary for RT is established, additional layers, corresponding to
the lock nesting order, could be established, making the "mutex-depth"
somewhat configurable based on the performance requirements.

The entire effort would have the side effect of making the locking and
critical sections more distinct, and reveal soft spots in concurrency
code, as well as to raise awareness of the code density inside
critical sections.

The concept of tunable foreground / background responsiveness,
based on preemptability of low priority processes comes to mind.
A lot of folks would probably not mind making UI responsiveness
a little crisper, others will want the throughput.

I realize this is an early stage to be looking at it so high end,
but I think in general this type of script would not be a bad addition
to the patch kit(s).


Sven


2004-10-12 20:48:01

by Thomas Gleixner

[permalink] [raw]
Subject: RE: [Ext-rt-dev] Re: [ANNOUNCE] Linux 2.6 Real Time Kernel

On Tue, 2004-10-12 at 22:31, Sven Dietrich wrote:
> > >
> > > I think patch size is an issue, but I also think that , eventually, we
> > > should change all spin_lock calls that actually lock a mutex to be more
> > > distinct so it's obvious what is going on. Sven and I both agree that
> > > this should be addressed. Is this a non-issue for you? What does the
> > > community want? I don't find your code or ours acceptable in it's
> > > current form , due to this issue.
> > >
> > > With the addition of PREEMPT_REALTIME it looks like you more than
> > > doubled the size of voluntary preempt. I really feel that it should
> > > remain as two distinct patches. They are dependent , but the scope of
> > > the changes are too vast to lump it all together.
> > >
> >
> >
> > If there is the tendency to touch the concurrency controls in general
> > all over the kernel, then I would suggest a script driven overhaul of
> > all concurrency controls like spin_locks, mutexes and semaphores to
> > general macros like
> >
> > enter_critical_section(TYPE, &var, &flags, whatever);
> > leave_critical_section(TYPE, &var, flags, whatever);
> >
> > where TYPE might be SPIN_LOCK, SPIN_LOCK_IRQ, MUTEX, PMUTEX or whatever
> > we have and come up with in the future.
> >
> > This could be done in a first step and then it is clearly identifiable
> > and it gives us more flexibility to wrap different implementations and
> > lets us change particular points in a more clear way.
> >
> > I would be willing to provide some scripted conversion aid, if there is
> > enough interest to that. I started with some test files and the results
> > are quite encouraging.
> >

> Ideally we would eventually provide some level of tunability, i.e.
> if you want the spinlocks all the way around it should be possible
> to have that, or one could enable degrees of enhancements,
> expanding on the existing sequence starting with PREEMPT, IRQ_THREADS,
> BKL, MUTEX, etc. In addition to that, once the minim set of spinlocks
> necessary for RT is established, additional layers, corresponding to
> the lock nesting order, could be established, making the "mutex-depth"
> somewhat configurable based on the performance requirements.
>
> The entire effort would have the side effect of making the locking and
> critical sections more distinct, and reveal soft spots in concurrency
> code, as well as to raise awareness of the code density inside
> critical sections.
>
> The concept of tunable foreground / background responsiveness,
> based on preemptability of low priority processes comes to mind.
> A lot of folks would probably not mind making UI responsiveness
> a little crisper, others will want the throughput.

Yup, and having a unique identifiable thing for all that stuff in the
code would make life easier for coders and for people who want to
experiment and change things.

> I realize this is an early stage to be looking at it so high end,
> but I think in general this type of script would not be a bad addition
> to the patch kit(s).

Ok, will try to make it work on more than two files and two patterns.

Any preferences on scripting language ?

tglx


2004-10-12 21:12:43

by Bill Huey

[permalink] [raw]
Subject: Re: [Ext-rt-dev] Re: [ANNOUNCE] Linux 2.6 Real Time Kernel

On Tue, Oct 12, 2004 at 09:46:34PM +0200, Thomas Gleixner wrote:
> Both patches (MV & Ingos) have their good bits, but both share the same
> ugliness and are hard to compare and harder to combine. The conversion
> of spin_lock to _spin_lock and substitution of spin_lock by mutexes,
> semaphores or what ever makes it more than hard to keep the code in a
> readable form.
>
> If there is the tendency to touch the concurrency controls in general
> all over the kernel, then I would suggest a script driven overhaul of
> all concurrency controls like spin_locks, mutexes and semaphores to
> general macros like
>
> enter_critical_section(TYPE, &var, &flags, whatever);
> leave_critical_section(TYPE, &var, flags, whatever);

FreeBSD uses these things, but it they create severe pipeline stalls
since they toggle interrupt flags on entry and exit. The current scheme
in Linux with preempt_count use to be a curse when I was working on an
equivalent implementation of there stuff at:

http://mmlinux.sf.net

It's a project I've been working for a long time and I'm farther than
them in the area of stability and most likely the problem space in general.
They are 7 and I am a single engineer though.

I don't have the latest sources up and I'm going to up load them in a
couple of hours. I've been playing with it for about 2 months, late July,
since it was able to boot reliably and I've felt/measure how a fully
preemptable kernel like this can perform. I'm getting about 4-6 usecs
average latency in the system from interrupt exception frame to the start
of the irq-thread in question. Tons of events were at 2 usecs which I
thought was insane at the time, but a ndelay insert into the path verified
this to be correct. The majority of the spread was at 5 and 10 usecs,
pushing to about 12 usecs. That's fantastic latency performance and I
was floored when the measurements were validating my preemption ideas
at the time.

> where TYPE might be SPIN_LOCK, SPIN_LOCK_IRQ, MUTEX, PMUTEX or whatever
> we have and come up with in the future.

There's two problems that need to be solved at this moment regarding
this issue. One is long term which should have a clear differentiation
of what is a persistent spinlock across a compile .config context
(choice of preemptable or standard kernels) is useful since it clearly
identifies which devices and low level systems. The other is Ingo's need
to be able to rapidly change mutexes at the drop of a hat. Eventually,
the long term goal will impose on stylistic issues in the Linux kernel
community and papers/documentation will have to be written to describe
these changes across all kernel subsystems and drivers. It's complete
epic flame bait.

In my system, I do exactly what you just outlined. With a three character
"vim" command, I capitalize the entire word, spin_lock -> SPIN_LOCK
repeated with a ".". I choose this convention because capitals standout
broadly in the source code. It's good because having this kind of
visibility can show static/compile time sleep violations that are the
main source of instability, and almost certainly all of the deadlocks
in Monta Vista's current preemption release.

My tree is stable. I was able to hammer this machine for 2-3 days straight
(no networking, that's another major can of worms) with deadlocking
using multipule mass "find / -exec egrep" of some sort that stress both
process creation and all parts of the IO system.

The lock graph changes I made ironically outlined some serious Linux
structural problems as it concerns latency. Through my effort of fixing
all of the sleep violation, I came all of the way back to the start of
the project which is that all major systems have become non-preemptable
again.

That graph that I saw from Lee is consistent with my results in that a
deadlock prone system will have phenomenal latency performance at the
expense of being absolutely incorrect. It's just a flat out broken
system at this point that they've released.

> This could be done in a first step and then it is clearly identifiable
> and it gives us more flexibility to wrap different implementations and
> lets us change particular points in a more clear way.

Yes, I agree, but the convention needs to be standardized.

> I would be willing to provide some scripted conversion aid, if there is
> enough interest to that. I started with some test files and the results
> are quite encouraging.

No, all of this can only be manual at this time, either through static
analysis by a compiler, like what Ingo did over the weekend or by hand
with runtime sleep violation checks.

Give me a bit of time to upload those files. I was just given permission
to talk about this openly now. But I can definitely tell you that I had
this running months before Monta Vista's announcement over the weekend.

Full preemption has just heated up in serious way. :) It's going to be
interesting.

> Any thoughts ?

bill

2004-10-12 21:24:50

by Bill Huey

[permalink] [raw]
Subject: Re: [Ext-rt-dev] Re: [ANNOUNCE] Linux 2.6 Real Time Kernel

On Tue, Oct 12, 2004 at 02:12:01PM -0700, Bill Huey wrote:
> On Tue, Oct 12, 2004 at 09:46:34PM +0200, Thomas Gleixner wrote:
> > enter_critical_section(TYPE, &var, &flags, whatever);
> > leave_critical_section(TYPE, &var, flags, whatever);
>
> FreeBSD uses these things, but it they create severe pipeline stalls
> since they toggle interrupt flags on entry and exit. The current scheme
> in Linux with preempt_count use to be a curse when I was working on an
> equivalent implementation of there stuff at:
>
> http://mmlinux.sf.net

Duh, I didn't finish the sentence. I meant this method above is nasty
filled with pipeline stalls. Don't know if that's what were saying, but
non-preemptable critical sections denoted by preempt_count must have some
kind of conceptual overlap with local_irq* functions. I use to curse the
seperation of the two since it made my own conception irregular, but I
have come to the conclusion that using relatively something light weight
like preempt_count() for that functionality instead. That's what I
meant. :)

bill

2004-10-12 21:40:17

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [Ext-rt-dev] Re: [ANNOUNCE] Linux 2.6 Real Time Kernel

On Tue, 2004-10-12 at 23:24, Bill Huey wrote:
> On Tue, Oct 12, 2004 at 02:12:01PM -0700, Bill Huey wrote:
> > On Tue, Oct 12, 2004 at 09:46:34PM +0200, Thomas Gleixner wrote:
> > > enter_critical_section(TYPE, &var, &flags, whatever);
> > > leave_critical_section(TYPE, &var, flags, whatever);
> >
> > FreeBSD uses these things, but it they create severe pipeline stalls
> > since they toggle interrupt flags on entry and exit. The current scheme
> > in Linux with preempt_count use to be a curse when I was working on an
> > equivalent implementation of there stuff at:

You missed the point. TYPE decides whether to toggle interrupts or not.
It's a generic function equivivalent, which identifies sections of code,
which must be protected. The grade of protection is defined in TYPE.

> > http://mmlinux.sf.net
>
> Duh, I didn't finish the sentence. I meant this method above is nasty
> filled with pipeline stalls. Don't know if that's what were saying, but
> non-preemptable critical sections denoted by preempt_count must have some
> kind of conceptual overlap with local_irq* functions. I use to curse the
> seperation of the two since it made my own conception irregular, but I
> have come to the conclusion that using relatively something light weight
> like preempt_count() for that functionality instead. That's what I
> meant. :)

I dont see a drawback in the proposal of enter_critical_section and
leave_critical_section conversion.

They indicate a none preemptible region, which must be protected in one
or another way. Which way is choosen, must be evaluated by the
programmer.

There are several grades from preempt_disable over mutexes, spinlocks
and irq blocking. All those grades allow different implementations for
different goals.

Systems which are optimized for througput will use other mechanisms than
systems which are optimized for guaranteed repsonse times.

There is no generic sulotion available for those problems.

But having a generic identifiable expression is more suitable for
improvements, than struggling with substitutions of x,y and z.

tglx







2004-10-12 21:42:43

by Sven-Thorsten Dietrich

[permalink] [raw]
Subject: RE: [Ext-rt-dev] Re: [ANNOUNCE] Linux 2.6 Real Time Kernel


I emailed the mmlinux project about 2 months ago,
telling you that we were doing this.

There was no response.

I am sorry that the early stage of our development upsets you.

It was intended to promote discussion, and that seems to be working.

We are aware of the issues you describe, and are making
every effort to raise awareness of these problems.

It is difficult to solve them for a team of 1 or N,
in a maintainable fashion, as it requires some level
of awareness by the maintainers that we are looking
at it from that angle.

Thanks for the insights, we look forward to seeing your
implementation added to the smorgasbord ;)

Sven



> -----Original Message-----
> From: [email protected]
> [mailto:[email protected]]On Behalf Of Bill Huey (hui)
> Sent: Tuesday, October 12, 2004 2:12 PM
> To: Thomas Gleixner
> Cc: [email protected]; Ingo Molnar; Andrew Morton;
> [email protected]; [email protected]; LKML
> Subject: Re: [Ext-rt-dev] Re: [ANNOUNCE] Linux 2.6 Real Time Kernel
>
>
> On Tue, Oct 12, 2004 at 09:46:34PM +0200, Thomas Gleixner wrote:
> > Both patches (MV & Ingos) have their good bits, but both share the same
> > ugliness and are hard to compare and harder to combine. The conversion
> > of spin_lock to _spin_lock and substitution of spin_lock by mutexes,
> > semaphores or what ever makes it more than hard to keep the code in a
> > readable form.
> >
> > If there is the tendency to touch the concurrency controls in general
> > all over the kernel, then I would suggest a script driven overhaul of
> > all concurrency controls like spin_locks, mutexes and semaphores to
> > general macros like
> >
> > enter_critical_section(TYPE, &var, &flags, whatever);
> > leave_critical_section(TYPE, &var, flags, whatever);
>
> FreeBSD uses these things, but it they create severe pipeline stalls
> since they toggle interrupt flags on entry and exit. The current scheme
> in Linux with preempt_count use to be a curse when I was working on an
> equivalent implementation of there stuff at:
>
> http://mmlinux.sf.net
>
> It's a project I've been working for a long time and I'm farther than
> them in the area of stability and most likely the problem space in general.
> They are 7 and I am a single engineer though.
>
> I don't have the latest sources up and I'm going to up load them in a
> couple of hours. I've been playing with it for about 2 months, late July,
> since it was able to boot reliably and I've felt/measure how a fully
> preemptable kernel like this can perform. I'm getting about 4-6 usecs
> average latency in the system from interrupt exception frame to the start
> of the irq-thread in question. Tons of events were at 2 usecs which I
> thought was insane at the time, but a ndelay insert into the path verified
> this to be correct. The majority of the spread was at 5 and 10 usecs,
> pushing to about 12 usecs. That's fantastic latency performance and I
> was floored when the measurements were validating my preemption ideas
> at the time.
>
> > where TYPE might be SPIN_LOCK, SPIN_LOCK_IRQ, MUTEX, PMUTEX or whatever
> > we have and come up with in the future.
>
> There's two problems that need to be solved at this moment regarding
> this issue. One is long term which should have a clear differentiation
> of what is a persistent spinlock across a compile .config context
> (choice of preemptable or standard kernels) is useful since it clearly
> identifies which devices and low level systems. The other is Ingo's need
> to be able to rapidly change mutexes at the drop of a hat. Eventually,
> the long term goal will impose on stylistic issues in the Linux kernel
> community and papers/documentation will have to be written to describe
> these changes across all kernel subsystems and drivers. It's complete
> epic flame bait.
>
> In my system, I do exactly what you just outlined. With a three character
> "vim" command, I capitalize the entire word, spin_lock -> SPIN_LOCK
> repeated with a ".". I choose this convention because capitals standout
> broadly in the source code. It's good because having this kind of
> visibility can show static/compile time sleep violations that are the
> main source of instability, and almost certainly all of the deadlocks
> in Monta Vista's current preemption release.
>
> My tree is stable. I was able to hammer this machine for 2-3 days straight
> (no networking, that's another major can of worms) with deadlocking
> using multipule mass "find / -exec egrep" of some sort that stress both
> process creation and all parts of the IO system.
>
> The lock graph changes I made ironically outlined some serious Linux
> structural problems as it concerns latency. Through my effort of fixing
> all of the sleep violation, I came all of the way back to the start of
> the project which is that all major systems have become non-preemptable
> again.
>
> That graph that I saw from Lee is consistent with my results in that a
> deadlock prone system will have phenomenal latency performance at the
> expense of being absolutely incorrect. It's just a flat out broken
> system at this point that they've released.
>
> > This could be done in a first step and then it is clearly identifiable
> > and it gives us more flexibility to wrap different implementations and
> > lets us change particular points in a more clear way.
>
> Yes, I agree, but the convention needs to be standardized.
>
> > I would be willing to provide some scripted conversion aid, if there is
> > enough interest to that. I started with some test files and the results
> > are quite encouraging.
>
> No, all of this can only be manual at this time, either through static
> analysis by a compiler, like what Ingo did over the weekend or by hand
> with runtime sleep violation checks.
>
> Give me a bit of time to upload those files. I was just given permission
> to talk about this openly now. But I can definitely tell you that I had
> this running months before Monta Vista's announcement over the weekend.
>
> Full preemption has just heated up in serious way. :) It's going to be
> interesting.
>
> > Any thoughts ?
>
> bill
>
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
>

2004-10-12 22:08:30

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [Ext-rt-dev] Re: [ANNOUNCE] Linux 2.6 Real Time Kernel

On Tue, 2004-10-12 at 23:12, Bill Huey wrote:
> My tree is stable. I was able to hammer this machine for 2-3 days straight
> (no networking, that's another major can of worms) with deadlocking
> using multipule mass "find / -exec egrep" of some sort that stress both
> process creation and all parts of the IO system.

He, a system without networking is a real measurement ? Ever heard of
hackbench in combination with ping -f ?

> That graph that I saw from Lee is consistent with my results in that a
> deadlock prone system will have phenomenal latency performance at the
> expense of being absolutely incorrect. It's just a flat out broken
> system at this point that they've released.

Thats a major problem caused by "dumb" priority inheritence. The goal is
not priority inheritence at the very end. It's proxy execution, where
priority inheritence is a subset.

> > This could be done in a first step and then it is clearly identifiable
> > and it gives us more flexibility to wrap different implementations and
> > lets us change particular points in a more clear way.
>
> Yes, I agree, but the convention needs to be standardized.

That's all I was talking about.

> > I would be willing to provide some scripted conversion aid, if there is
> > enough interest to that. I started with some test files and the results
> > are quite encouraging.
>
> No, all of this can only be manual at this time, either through static
> analysis by a compiler, like what Ingo did over the weekend or by hand
> with runtime sleep violation checks.

I'm not talking about automatic conversion of rules. I'm talking about
automatic conversion of different concurrency controls into a
equivillance function, which lets you better identify the neccecary
manual changes and leaves room for simple and non intrusive replacement
implementations.

> Give me a bit of time to upload those files. I was just given permission
> to talk about this openly now. But I can definitely tell you that I had
> this running months before Monta Vista's announcement over the weekend.

There are a bunch of other efforts underway around the world, which
might be concentrated now into one.

tglx


2004-10-12 22:39:28

by Bill Huey

[permalink] [raw]
Subject: Re: [Ext-rt-dev] Re: [ANNOUNCE] Linux 2.6 Real Time Kernel

On Wed, Oct 13, 2004 at 12:00:16AM +0200, Thomas Gleixner wrote:
> On Tue, 2004-10-12 at 23:12, Bill Huey wrote:
> > My tree is stable. I was able to hammer this machine for 2-3 days straight
> > (no networking, that's another major can of worms) with deadlocking
> > using multipule mass "find / -exec egrep" of some sort that stress both
> > process creation and all parts of the IO system.
>
> He, a system without networking is a real measurement ? Ever heard of
> hackbench in combination with ping -f ?

The problem with doing this project is to create an identically
functioning system that's correct. The current track taking by Monta Vista
is highly unstable given the lack of locking throughout their kernel. It
has all of the complexities of mutex style conventions without any debugging
methodology attached to it. It's no longer the spinlock universe that
Linux is using since a deadlock situation just leaves use running in
cpu_idle wondering what is going on.

It's something that needs to be address in the large scheme of the project.

> > That graph that I saw from Lee is consistent with my results in that a
> > deadlock prone system will have phenomenal latency performance at the
> > expense of being absolutely incorrect. It's just a flat out broken
> > system at this point that they've released.
>
> Thats a major problem caused by "dumb" priority inheritence. The goal is
> not priority inheritence at the very end. It's proxy execution, where
> priority inheritence is a subset.

This has been articulate a couple of times by both me and Ingo (recent email).
The MV's system is highly unstable, not because of priority inheritance,
but because of basic lock violation in the lock graph itself. It's another kind
of SMP granularity problem. The hard problem was just what Ingo was saying and
it's higher, but higher in the graph.

> > Yes, I agree, but the convention needs to be standardized.
>
> That's all I was talking about.

Yeah, it needs to be done. I like the "_" methodology that both Monta Vista
and Ingo are using. I'll convert my stuff over to using it when I'm finished
with a couple of large items here.

> I'm not talking about automatic conversion of rules. I'm talking about
> automatic conversion of different concurrency controls into a
> equivillance function, which lets you better identify the neccecary
> manual changes and leaves room for simple and non intrusive replacement
> implementations.

This is kind of a sketchy problem. So far all of what I've seen really needs
to be done manually and can be done using the all of the normal Linux locking
and scheduler/interrupt masking primitives. I'd hate to see another system
added to this that solves a problem that may not exist. Please, correct
me if I'm not understanding you.

> > Give me a bit of time to upload those files. I was just given permission
> > to talk about this openly now. But I can definitely tell you that I had
> > this running months before Monta Vista's announcement over the weekend.

> There are a bunch of other efforts underway around the world, which
> might be concentrated now into one.

bill

2004-10-12 22:57:41

by Bill Huey

[permalink] [raw]
Subject: Re: [Ext-rt-dev] Re: [ANNOUNCE] Linux 2.6 Real Time Kernel

On Tue, Oct 12, 2004 at 02:41:02PM -0700, Sven Dietrich wrote:
>
> I emailed the mmlinux project about 2 months ago,
> telling you that we were doing this.

I don't remember getting an email from you. I get tons of
email at times and I don't know if I had lost it or not.
I'm sorry if I didn't respond to you, but being in the
context of commerical development has a certain kind of
conflict with open source culture and balancing them with
competitors is tenuous and tense. I'm about as die hard
open source as it gets and it's a difficult balance if
one thinks of this problem in with these constraints.

> There was no response.
>
> I am sorry that the early stage of our development upsets you.

Well, it kind of forced a number of things to happen that
is premature from multipule folks, namely me (Ingo can
speak for himself). I didn't want to release these patches
until I had solved a number of really critical problems,
since it would have made the release rather useless.

But since this is in a commerical context we have to save
face by at least putting our cards on the table and establishing
a sort of role in this community. That commericial development
attitude the reason why I haven't been permitted to talk about
this stuff openly, only sort of on the side in various
preemption discussions.

> It was intended to promote discussion, and that seems to be working.

Yeah, for me a bit of freak out Saturday that is still
kind of happening since this has been a personal project
of mine for a long time. :) I interpreted it as a visibility
move on your company's part, which I hate to say is a bit
unnerving to know that another group was doing the same
work. TimeSys's Scott Wood and friends are doing something
like this as well. I'm only being fair by mentioning them. :)

BTW, I'm using their irq-thread patches with modifications.
I intuited that they were doing an incremental model, which,
since this problem space is a bit more known, is no longer
a clearly viable track for them, assuming they are going this
route, because of all of the recent work.

There's going to be tons of overlap here and I suspect
that Ingo is going to kick all of our respective commerical
butts. :)

> We are aware of the issues you describe, and are making
> every effort to raise awareness of these problems.

> It is difficult to solve them for a team of 1 or N,
> in a maintainable fashion, as it requires some level
> of awareness by the maintainers that we are looking
> at it from that angle.

> Thanks for the insights, we look forward to seeing your
> implementation added to the smorgasbord ;)

Well, uh, at least you're single kernel image folks like
us and not flaming us/me yet for corrupting the sancity
of Linux. Oh man, I feel a flame war coming. This is such
touchy material.

What's Monta Vista's attitude toward preemption development ?
open or closed ? I know this is a charged question, but
this has to be asked. :)

This commerical thing is going to be weird. I wish I was
an angry hippie instead of having a job at certain moments. :)

But the bay area is pretty damn cool, so... that makes up
for it. :)

bill

2004-10-12 23:14:19

by Bill Huey

[permalink] [raw]
Subject: Re: [Ext-rt-dev] Re: [ANNOUNCE] Linux 2.6 Real Time Kernel

On Tue, Oct 12, 2004 at 11:32:18PM +0200, Thomas Gleixner wrote:
> You missed the point. TYPE decides whether to toggle interrupts or not.
> It's a generic function equivivalent, which identifies sections of code,
> which must be protected. The grade of protection is defined in TYPE.

Sorry, I misunderstood out of my impulsiveness. If I understand you,
you just want a gradual method of determining which critical sections
need to be preemptive or not depending if you need a server or RT
performance ?

I thought you were talking about something else if this is the case.

bill

2004-10-12 23:19:23

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [Ext-rt-dev] Re: [ANNOUNCE] Linux 2.6 Real Time Kernel

On Wed, 2004-10-13 at 00:36, Bill Huey wrote:
> On Wed, Oct 13, 2004 at 12:00:16AM +0200, Thomas Gleixner wrote:
> > On Tue, 2004-10-12 at 23:12, Bill Huey wrote:
> > > My tree is stable. I was able to hammer this machine for 2-3 days straight
> > > (no networking, that's another major can of worms) with deadlocking
> > > using multipule mass "find / -exec egrep" of some sort that stress both
> > > process creation and all parts of the IO system.
> >
> > He, a system without networking is a real measurement ? Ever heard of
> > hackbench in combination with ping -f ?
>
> The problem with doing this project is to create an identically
> functioning system that's correct. The current track taking by Monta Vista
> is highly unstable given the lack of locking throughout their kernel. It
> has all of the complexities of mutex style conventions without any debugging
> methodology attached to it. It's no longer the spinlock universe that
> Linux is using since a deadlock situation just leaves use running in
> cpu_idle wondering what is going on.
>
> It's something that needs to be address in the large scheme of the project.

Ack.

> > > That graph that I saw from Lee is consistent with my results in that a
> > > deadlock prone system will have phenomenal latency performance at the
> > > expense of being absolutely incorrect. It's just a flat out broken
> > > system at this point that they've released.
> >
> > Thats a major problem caused by "dumb" priority inheritence. The goal is
> > not priority inheritence at the very end. It's proxy execution, where
> > priority inheritence is a subset.
>
> This has been articulate a couple of times by both me and Ingo (recent email).
> The MV's system is highly unstable, not because of priority inheritance,
> but because of basic lock violation in the lock graph itself. It's another kind
> of SMP granularity problem. The hard problem was just what Ingo was saying and
> it's higher, but higher in the graph.

Can you point me a bit more clear on what you are talking about ?

> > > Yes, I agree, but the convention needs to be standardized.
> >
> > That's all I was talking about.
>
> Yeah, it needs to be done. I like the "_" methodology that both Monta Vista
> and Ingo are using. I'll convert my stuff over to using it when I'm finished
> with a couple of large items here.

That's totaly fucked up. Compile XFS with that and you are toast. That's
ugly and not understandable/fixable for anybody in the universe without
more ugly and less understandable hacks. Yes I managed to get XFS up,
but I refuse to show the patch, because it's making me barf when I look
into it.

_spinlock = spinlock
spinlock = mutex
_mutex = semaphore
semphore = whatever
....

That's violating every single aspect of software design. That's messing
up the whole kernel.

What have we at the very end ? A endless mess of non understandable
macros, which are resolved by compiler magic ? Where nobody can see on
the first look, which kind of concurrency control you are using ? That's
a nice thing to do some proof of concept implementation, but it can not
be a solution for something what is targeted to go into mainline. The
frequency of T4-T7 patches including the small fixes posted on LKML is
just a proof of this.

> > I'm not talking about automatic conversion of rules. I'm talking about
> > automatic conversion of different concurrency controls into a
> > equivillance function, which lets you better identify the neccecary
> > manual changes and leaves room for simple and non intrusive replacement
> > implementations.
>
> This is kind of a sketchy problem. So far all of what I've seen really needs
> to be done manually and can be done using the all of the normal Linux locking
> and scheduler/interrupt masking primitives. I'd hate to see another system
> added to this that solves a problem that may not exist. Please, correct
> me if I'm not understanding you.

We have spinlocks, mutexes, semaphores and preemption as types of
concurrency control implementations in the kernel. They represent
different grades of access exclusion control.

But all of them have one in common. Exclusive access to resources.

So the natural consequence is to convert _all_ concurrency control
mechanisms into a single identifiable one. That's a purely semantical
conversion, in terms of macro replacement, where no functional change
takes place.

After you have done this, it is much more easier to

a) identify the nested places, as you have to look for exactly one
pattern instead of N
b) to easy experiment with replacment functions
c) to make clear which changes to the code you are making

substituting

enter_critical_section(SPIN_LOCK,....) by
enter_critical_section(XYZ_MUTEX,....) is
understandable for most of the people.

Changing it by hidden gcc magic is not.

The bad thing of hidden gcc magic is that you will not be able to
analyse nested concurrency controls in one go. You have to figure out
what the heck spin_lock vs. _spin_lock vs. semaphore vs. _semaphore vs.
mutex vs. _mutex means.

So cleaning up the purely semantical (clear wording sense) way is the
first step to go instead of changing a bunch of macros over the place
and break the half of the kernel compile.

tglx













2004-10-12 23:22:24

by Adam Heath

[permalink] [raw]
Subject: Re: [Ext-rt-dev] Re: [ANNOUNCE] Linux 2.6 Real Time Kernel

On Tue, 12 Oct 2004, Bill Huey wrote:

> Yeah, for me a bit of freak out Saturday that is still
> kind of happening since this has been a personal project
> of mine for a long time. :) I interpreted it as a visibility
> move on your company's part, which I hate to say is a bit
> unnerving to know that another group was doing the same
> work. TimeSys's Scott Wood and friends are doing something
> like this as well. I'm only being fair by mentioning them. :)

This is because companies and inviduals still think that developing things
privately is the correct way to go. Doing things this way will leave
open the possibility that someone else will do the same bit of work, and
the final output will clash.

Remember, release early, release often.

2004-10-12 23:33:10

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [Ext-rt-dev] Re: [ANNOUNCE] Linux 2.6 Real Time Kernel

On Wed, 2004-10-13 at 00:57, Bill Huey wrote:
> But since this is in a commerical context we have to save
> face by at least putting our cards on the table and establishing
> a sort of role in this community.

Yeah, a pretty good way to establish a role by keeping your mouth shut
and let others do redundant work.

> That commericial development
> attitude the reason why I haven't been permitted to talk about
> this stuff openly, only sort of on the side in various
> preemption discussions.

Discuss this with your company.

> Yeah, for me a bit of freak out Saturday that is still
> kind of happening since this has been a personal project
> of mine for a long time. :) I interpreted it as a visibility
> move on your company's part, which I hate to say is a bit
> unnerving to know that another group was doing the same
> work. TimeSys's Scott Wood and friends are doing something
> like this as well. I'm only being fair by mentioning them. :)

There are other people around who worked on similar things openly.

> Well, uh, at least you're single kernel image folks like
> us and not flaming us/me yet for corrupting the sancity
> of Linux. Oh man, I feel a flame war coming. This is such
> touchy material.

The flame war might come, ...

> What's Monta Vista's attitude toward preemption development ?
> open or closed ? I know this is a charged question, but
> this has to be asked. :)
>
> This commerical thing is going to be weird. I wish I was
> an angry hippie instead of having a job at certain moments. :)
>
> But the bay area is pretty damn cool, so... that makes up
> for it. :)

... but not about realtime improvements.

It might be about: " hey we putting this up to play a role attitude of
comapnies".

tglx



2004-10-12 23:33:42

by Bill Huey

[permalink] [raw]
Subject: Re: [Ext-rt-dev] Re: [ANNOUNCE] Linux 2.6 Real Time Kernel

On Wed, Oct 13, 2004 at 01:10:34AM +0200, Thomas Gleixner wrote:
> > This has been articulate a couple of times by both me and Ingo (recent email).
> > The MV's system is highly unstable, not because of priority inheritance,
> > but because of basic lock violation in the lock graph itself. It's another kind
> > of SMP granularity problem. The hard problem was just what Ingo was saying and
> > it's higher, but higher in the graph.
>
> Can you point me a bit more clear on what you are talking about ?

It's just a lock graph dependency problem. Things up top in the graph
force things below it to be non-preemptable. The things up top need
to be changed, so that things below it can also be preemptable. Sleeping
within an atomic critical section, local_irq* or preempt_count() > 0,
is a deadlock waiting to happen.

> So the natural consequence is to convert _all_ concurrency control
> mechanisms into a single identifiable one. That's a purely semantical
> conversion, in terms of macro replacement, where no functional change
> takes place.
...
> The bad thing of hidden gcc magic is that you will not be able to
> analyse nested concurrency controls in one go. You have to figure out
> what the heck spin_lock vs. _spin_lock vs. semaphore vs. _semaphore vs.
> mutex vs. _mutex means.

Yeah, I thought of it initially as a great idea, but ultimately this
is going to impose on the overall Linux development methodology if
these patches go into the mainstream.

I know what you're saying, but I ask you to be patient. All of this
stuff is going to get clean up when I get some critical parts in place.
And, yes, I do agree that this is unspeakably horrid. The static
type determination thing probably will have to be removed at some point,
but it's useful for rapid changing in the kernel at this time so that
Ingo can make changes to keep up with MontaVista.

All I can ask is for folks to be patient as all groups get synced up
to each other and then we'll be able to talk about it more meaningfully.
A bunch of things will fall into place once we all parties are mentally
synced up.

bill

2004-10-12 23:43:04

by Lee Revell

[permalink] [raw]
Subject: Re: [Ext-rt-dev] Re: [ANNOUNCE] Linux 2.6 Real Time Kernel

On Tue, 2004-10-12 at 19:17, Adam Heath wrote:
> This is because companies and inviduals still think that developing things
> privately is the correct way to go. Doing things this way will leave
> open the possibility that someone else will do the same bit of work, and
> the final output will clash.
>
> Remember, release early, release often.

Except that none of the parties involved claim to have solved all the
priority inheritance issues etc. "Releasing early" if it doesn't work
yet just makes you look bad. There are perfectly valid reasons to do
kernel development privately. MontaVista was doing just that and they
saw that some of their work may be duplicated so they released it. I
don't see how this conflicts with the open source development process at
all.

Lee

2004-10-12 23:45:11

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [Ext-rt-dev] Re: [ANNOUNCE] Linux 2.6 Real Time Kernel

On Wed, 2004-10-13 at 01:33, Bill Huey wrote:
> Yeah, I thought of it initially as a great idea, but ultimately this
> is going to impose on the overall Linux development methodology if
> these patches go into the mainstream.
>
> I know what you're saying, but I ask you to be patient. All of this
> stuff is going to get clean up when I get some critical parts in place.
> And, yes, I do agree that this is unspeakably horrid. The static
> type determination thing probably will have to be removed at some point,
> but it's useful for rapid changing in the kernel at this time so that
> Ingo can make changes to keep up with MontaVista.
>
> All I can ask is for folks to be patient as all groups get synced up
> to each other and then we'll be able to talk about it more meaningfully.
> A bunch of things will fall into place once we all parties are mentally
> synced up.

Hey, what are you talking about ?

Everybody should shut up, until some people have decided that others can
participate in the development ?

I proposed this to stop this stupid race for the better solution, which
is ugly and horrid, as you accept yourself.

There is no rush to push those enhancements within no time and there is
no Nobel prize to win.

Both groups have published their incomplete solutions and now we should
stop and contemplate how to merge this effort in a less nerve racking
way so we can improve and investigate this further on a common base.

tglx




2004-10-12 23:52:41

by Bill Huey

[permalink] [raw]
Subject: Re: [Ext-rt-dev] Re: [ANNOUNCE] Linux 2.6 Real Time Kernel

On Wed, Oct 13, 2004 at 01:37:06AM +0200, Thomas Gleixner wrote:
> Hey, what are you talking about ?
>
> Everybody should shut up, until some people have decided that others can
> participate in the development ?

No, just wait and your (everybody's) concern should be address. It takes
time to work through all of the slop. I'm all for syncing to a single
solution, but there's a ton of problems that still need to be addressed.

> I proposed this to stop this stupid race for the better solution, which
> is ugly and horrid, as you accept yourself.

Yes, the efforts are distant from each other and it's going to take time
to resolve it. I'm probably going to use Ingo's stuff in 2.6.9+, but my
stuff in 2.6.7 is useful as a specialized kind of test harness. I'll
have to think about what's the best way of resolving this. I agree on
these points.

bill

2004-10-13 00:31:31

by George Anzinger

[permalink] [raw]
Subject: Re: [Ext-rt-dev] Re: [ANNOUNCE] Linux 2.6 Real Time Kernel

Sven Dietrich wrote:
>>>I think patch size is an issue, but I also think that , eventually, we
>>>should change all spin_lock calls that actually lock a mutex to be more
>>>distinct so it's obvious what is going on. Sven and I both agree that
>>>this should be addressed. Is this a non-issue for you? What does the
>>>community want? I don't find your code or ours acceptable in it's
>>>current form , due to this issue.
>>>
>>>With the addition of PREEMPT_REALTIME it looks like you more than
>>>doubled the size of voluntary preempt. I really feel that it should
>>>remain as two distinct patches. They are dependent , but the scope of
>>>the changes are too vast to lump it all together.
>>>
>>
>>
>>If there is the tendency to touch the concurrency controls in general
>>all over the kernel, then I would suggest a script driven overhaul of
>>all concurrency controls like spin_locks, mutexes and semaphores to
>>general macros like
>>
>>enter_critical_section(TYPE, &var, &flags, whatever);
>>leave_critical_section(TYPE, &var, flags, whatever);

There is nothing here that can not be done with a macro. Don't really need a
script. The optimizer would drop out unused code...

-g
>>
>>where TYPE might be SPIN_LOCK, SPIN_LOCK_IRQ, MUTEX, PMUTEX or whatever
>>we have and come up with in the future.
>>
>>This could be done in a first step and then it is clearly identifiable
>>and it gives us more flexibility to wrap different implementations and
>>lets us change particular points in a more clear way.
>>
>>I would be willing to provide some scripted conversion aid, if there is
>>enough interest to that. I started with some test files and the results
>>are quite encouraging.
>>
>
>
>
>
> Ideally we would eventually provide some level of tunability, i.e.
> if you want the spinlocks all the way around it should be possible
> to have that, or one could enable degrees of enhancements,
> expanding on the existing sequence starting with PREEMPT, IRQ_THREADS,
> BKL, MUTEX, etc. In addition to that, once the minim set of spinlocks
> necessary for RT is established, additional layers, corresponding to
> the lock nesting order, could be established, making the "mutex-depth"
> somewhat configurable based on the performance requirements.
>
> The entire effort would have the side effect of making the locking and
> critical sections more distinct, and reveal soft spots in concurrency
> code, as well as to raise awareness of the code density inside
> critical sections.
>
> The concept of tunable foreground / background responsiveness,
> based on preemptability of low priority processes comes to mind.
> A lot of folks would probably not mind making UI responsiveness
> a little crisper, others will want the throughput.
>
> I realize this is an early stage to be looking at it so high end,
> but I think in general this type of script would not be a bad addition
> to the patch kit(s).
>
>
> Sven
>
>
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
>

--
George Anzinger [email protected]
High-res-timers: http://sourceforge.net/projects/high-res-timers/
Preemption patch: http://www.kernel.org/pub/linux/kernel/people/rml

2004-10-13 01:01:52

by Valdis Klētnieks

[permalink] [raw]
Subject: Re: [Ext-rt-dev] Re: [ANNOUNCE] Linux 2.6 Real Time Kernel

On Wed, 13 Oct 2004 01:10:34 +0200, Thomas Gleixner said:

> What have we at the very end ? A endless mess of non understandable
> macros, which are resolved by compiler magic ? Where nobody can see on
> the first look, which kind of concurrency control you are using ? That's
> a nice thing to do some proof of concept implementation, but it can not
> be a solution for something what is targeted to go into mainline. The
> frequency of T4-T7 patches including the small fixes posted on LKML is
> just a proof of this.

I seem to remember Ingo saying that this *is* still somewhat "proof of concept",
and that the gcc preprocessor ad-crockery was just a *really* nice way of doing
it semi-automagically while minimizing the patch footprint and intrusiveness.

I'm sure that once we've got a non-moving target, at least 2 or 3 levels
of preprocessor redirection will get cleaned up and removed, to save
future programmer's sanity..

(Viewed alternatively - how many more flubs would the T4-T7 series have
if Ingo wasn't using the preprocessor to do the heavy lifting? For something
at the current level of cookedness, it's doing fairly well)...


Attachments:
(No filename) (226.00 B)

2004-10-13 02:02:29

by K.R. Foley

[permalink] [raw]
Subject: Re: [Ext-rt-dev] Re: [ANNOUNCE] Linux 2.6 Real Time Kernel

Bill Huey (hui) wrote:
<snip>
>
> Well, uh, at least you're single kernel image folks like
> us and not flaming us/me yet for corrupting the sancity
> of Linux. Oh man, I feel a flame war coming. This is such
> touchy material.
>
> What's Monta Vista's attitude toward preemption development ?
> open or closed ? I know this is a charged question, but
> this has to be asked. :)
>
> This commerical thing is going to be weird. I wish I was
> an angry hippie instead of having a job at certain moments. :)

Aside from being able to claim first to market, what is to be gained by
having this effort closed? If it is truly integrated into the Linux
kernel and not another segregated/multi-kernel solution, is there any
way to keep it closed?

>
> But the bay area is pretty damn cool, so... that makes up
> for it. :)
>
> bill
>
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
>

2004-10-13 03:56:17

by Bill Huey

[permalink] [raw]
Subject: Re: [Ext-rt-dev] Re: [ANNOUNCE] Linux 2.6 Real Time Kernel

On Tue, Oct 12, 2004 at 02:41:02PM -0700, Sven Dietrich wrote:
> I emailed the mmlinux project about 2 months ago,
> telling you that we were doing this.

http://mmlinux.sourceforge.net/temp/

I'll do an official announcement tomorrow. It's party time for me. :)

bill

2004-10-14 05:06:22

by Dipankar Sarma

[permalink] [raw]
Subject: Re: [ANNOUNCE] Linux 2.6 Real Time Kernel

On Tue, Oct 12, 2004 at 07:50:29AM +0200, Ingo Molnar wrote:
>
> regarding RCU serialization - i think that is the way to go - i dont
> think there is any sensible way to extend RCU to a fully preempted
> model, RCU is all about per-CPU-ness and per-CPU-ness is quite limited
> in a fully preemptible model.

It seems that way to me too. Long ago I implemented preemptible
RCU, but did not follow it through because I believed it
was not a good idea. The original patch is here :

http://www.uwsg.iu.edu/hypermail/linux/kernel/0205.1/0026.html

This allows read-side critical sections of RCU to be preempted.
It will take a bit of work to re-use it in RCU as of now, but
I don't think it makes sense to do so. My primary concern is
DoS/OOM situation due to preempted tasks holding up RCU.

>
> could you send those RCU patches (no matter how incomplete/broken)? It's
> the main issue that causes the dcache_lock to be raw still. (and a
> number of dependent locks: fs-writeback.c, inode.c, etc.) We can make
> those RCU changes not impact the normal !PREEMPT_REALTIME locking so it
> might have a chance for upstream merging as well.

I would be interested in this too.

Thanks
Dipankar

2004-10-14 07:16:58

by Ingo Molnar

[permalink] [raw]
Subject: Re: [ANNOUNCE] Linux 2.6 Real Time Kernel


* Dipankar Sarma <[email protected]> wrote:

> On Tue, Oct 12, 2004 at 07:50:29AM +0200, Ingo Molnar wrote:
> >
> > regarding RCU serialization - i think that is the way to go - i dont
> > think there is any sensible way to extend RCU to a fully preempted
> > model, RCU is all about per-CPU-ness and per-CPU-ness is quite limited
> > in a fully preemptible model.
>
> It seems that way to me too. Long ago I implemented preemptible RCU,
> but did not follow it through because I believed it was not a good
> idea. The original patch is here :
>
> http://www.uwsg.iu.edu/hypermail/linux/kernel/0205.1/0026.html

interesting!

> This allows read-side critical sections of RCU to be preempted. It
> will take a bit of work to re-use it in RCU as of now, but I don't
> think it makes sense to do so. [...]

note that meanwhile i have implemented another variant:

http://marc.theaimsgroup.com/?l=linux-kernel&m=109771365907797&w=2

i dont think this will be the final interface (the _rt postfix is
stupid, it should probably be _spin?), but i think this is roughly the
structure of how to attack it - a minimal extension to the RCU APIs to
allow for serialization. What do you think about this particular
approach?

> [...] My primary concern is DoS/OOM situation due to preempted tasks
> holding up RCU.

in the serialization solution in -U0 it would be possible to immediately
free the RCU entries and hence have no DoS/OOM situation - although the
-U0 patch does not do this yet.

Ingo

2004-10-15 15:04:17

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [ANNOUNCE] Linux 2.6 Real Time Kernel

On Thu, Oct 14, 2004 at 09:18:10AM +0200, Ingo Molnar wrote:
>
> * Dipankar Sarma <[email protected]> wrote:
>
> > On Tue, Oct 12, 2004 at 07:50:29AM +0200, Ingo Molnar wrote:
> > >
> > > regarding RCU serialization - i think that is the way to go - i dont
> > > think there is any sensible way to extend RCU to a fully preempted
> > > model, RCU is all about per-CPU-ness and per-CPU-ness is quite limited
> > > in a fully preemptible model.
> >
> > It seems that way to me too. Long ago I implemented preemptible RCU,
> > but did not follow it through because I believed it was not a good
> > idea. The original patch is here :
> >
> > http://www.uwsg.iu.edu/hypermail/linux/kernel/0205.1/0026.html
>
> interesting!
>
> > This allows read-side critical sections of RCU to be preempted. It
> > will take a bit of work to re-use it in RCU as of now, but I don't
> > think it makes sense to do so. [...]
>
> note that meanwhile i have implemented another variant:
>
> http://marc.theaimsgroup.com/?l=linux-kernel&m=109771365907797&w=2
>
> i dont think this will be the final interface (the _rt postfix is
> stupid, it should probably be _spin?), but i think this is roughly the
> structure of how to attack it - a minimal extension to the RCU APIs to
> allow for serialization. What do you think about this particular
> approach?

One caution (which you are no doubt already aware of) -- if an RCU
algorithm that reads (rcu_read_lock()/rcu_read_unlock()) in process
context and updates in softirq/bh/irq context, you can see deadlocks.

Thanx, Paul

> > [...] My primary concern is DoS/OOM situation due to preempted tasks
> > holding up RCU.
>
> in the serialization solution in -U0 it would be possible to immediately
> free the RCU entries and hence have no DoS/OOM situation - although the
> -U0 patch does not do this yet.
>
> Ingo
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
>
>

2004-10-15 15:44:39

by Ingo Molnar

[permalink] [raw]
Subject: Re: [ANNOUNCE] Linux 2.6 Real Time Kernel


* Paul E. McKenney <[email protected]> wrote:

> One caution (which you are no doubt already aware of) -- if an RCU
> algorithm that reads (rcu_read_lock()/rcu_read_unlock()) in process
> context and updates in softirq/bh/irq context, you can see deadlocks.

yeah - but in the PREEMPT_REALTIME kernel there are simply no irq or
softirq contexts in process contexts - everything is a task. So
everything can (and does) block.

Ingo

2004-10-15 16:46:02

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [ANNOUNCE] Linux 2.6 Real Time Kernel

On Fri, Oct 15, 2004 at 05:45:42PM +0200, Ingo Molnar wrote:
>
> * Paul E. McKenney <[email protected]> wrote:
>
> > One caution (which you are no doubt already aware of) -- if an RCU
> > algorithm that reads (rcu_read_lock()/rcu_read_unlock()) in process
> > context and updates in softirq/bh/irq context, you can see deadlocks.
>
> yeah - but in the PREEMPT_REALTIME kernel there are simply no irq or
> softirq contexts in process contexts - everything is a task. So
> everything can (and does) block.

OK, am probably confused, but I thought that the whole point of your
PREEMPT_REALTIME implementation of rcu_read_lock_rt() was to enable
preemption in the RCU read-side critical section. If this is indeed
the case, then it looks to me like code that would run in softirq/bh/irq
context in a kernel compiled non-PREEMPT_REALTIME could now run during
the time that a code path running under rcu_read_lock_rt() was preempted.

If so, then the kernel can end up freeing a data item that the preempted
RCU read-side critical section is still referencing.

OK, so what am I missing here?

Thanx, Paul

2004-10-15 16:50:56

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [ANNOUNCE] Linux 2.6 Real Time Kernel

On Fri, Oct 15, 2004 at 09:40:39AM -0700, Paul E. McKenney wrote:
> On Fri, Oct 15, 2004 at 05:45:42PM +0200, Ingo Molnar wrote:
> >
> > * Paul E. McKenney <[email protected]> wrote:
> >
> > > One caution (which you are no doubt already aware of) -- if an RCU
> > > algorithm that reads (rcu_read_lock()/rcu_read_unlock()) in process
> > > context and updates in softirq/bh/irq context, you can see deadlocks.
> >
> > yeah - but in the PREEMPT_REALTIME kernel there are simply no irq or
> > softirq contexts in process contexts - everything is a task. So
> > everything can (and does) block.
>
> OK, am probably confused, but I thought that the whole point of your
> PREEMPT_REALTIME implementation of rcu_read_lock_rt() was to enable
> preemption in the RCU read-side critical section. If this is indeed
> the case, then it looks to me like code that would run in softirq/bh/irq
> context in a kernel compiled non-PREEMPT_REALTIME could now run during
> the time that a code path running under rcu_read_lock_rt() was preempted.
>
> If so, then the kernel can end up freeing a data item that the preempted
> RCU read-side critical section is still referencing.
>
> OK, so what am I missing here?

Never mind!!! You insert the mutex. Sorry for the noise!

Thanx, Paul

2004-10-17 17:10:47

by Ingo Molnar

[permalink] [raw]
Subject: Re: [ANNOUNCE] Linux 2.6 Real Time Kernel


* Dipankar Sarma <[email protected]> wrote:

> It seems that way to me too. Long ago I implemented preemptible RCU,
> but did not follow it through because I believed it was not a good
> idea. The original patch is here :
>
> http://www.uwsg.iu.edu/hypermail/linux/kernel/0205.1/0026.html
>
> This allows read-side critical sections of RCU to be preempted. It
> will take a bit of work to re-use it in RCU as of now, but I don't
> think it makes sense to do so. My primary concern is DoS/OOM situation
> due to preempted tasks holding up RCU.

the DoS/OOM problems are serious i believe. Preemptible RCU in that
sense is 'RCU with no guarantee of progress', which sounds bad from a
design POV.

Ingo