2005-02-23 18:08:59

by Lee Revell

[permalink] [raw]
Subject: More latency regressions with 2.6.11-rc4-RT-V0.7.39-02

Ingo,

Did something change recently in the VM that made copy_pte_range and
clear_page_range a lot more expensive? I noticed a reference in the
"Page Table Iterators" thread to excessive overhead introduced by
aggressive page freeing. That sure looks like what is going on in
trace2. trace1 and trace3 look like big fork latencies associated with
copy_pte_range.

This is all with PREEMPT_DESKTOP.

Lee


Attachments:
trace1.txt (3.75 kB)
trace2.txt (48.58 kB)
trace3.txt (3.75 kB)
Download all attachments

2005-02-23 19:18:10

by Hugh Dickins

[permalink] [raw]
Subject: Re: More latency regressions with 2.6.11-rc4-RT-V0.7.39-02

On Wed, 23 Feb 2005, Lee Revell wrote:
>
> Did something change recently in the VM that made copy_pte_range and
> clear_page_range a lot more expensive? I noticed a reference in the
> "Page Table Iterators" thread to excessive overhead introduced by
> aggressive page freeing. That sure looks like what is going on in
> trace2. trace1 and trace3 look like big fork latencies associated with
> copy_pte_range.

I'm just about to test this patch below: please give it a try: thanks...

Ingo's patch to reduce scheduling latencies, by checking for lockbreak
in copy_page_range, was in the -VP and -mm patchsets some months ago;
but got preempted by the 4level rework, and not reinstated since.
Restore it now in copy_pte_range - which mercifully makes it easier.

Signed-off-by: Hugh Dickins <[email protected]>

--- 2.6.11-rc4-bk9/mm/memory.c 2005-02-21 11:32:19.000000000 +0000
+++ linux/mm/memory.c 2005-02-23 18:35:28.000000000 +0000
@@ -328,6 +328,7 @@ static int copy_pte_range(struct mm_stru
pte_t *s, *d;
unsigned long vm_flags = vma->vm_flags;

+again:
d = dst_pte = pte_alloc_map(dst_mm, dst_pmd, addr);
if (!dst_pte)
return -ENOMEM;
@@ -338,11 +339,22 @@ static int copy_pte_range(struct mm_stru
if (pte_none(*s))
continue;
copy_one_pte(dst_mm, src_mm, d, s, vm_flags, addr);
+ /*
+ * We are holding two locks at this point - either of them
+ * could generate latencies in another task on another CPU.
+ */
+ if (need_resched() ||
+ need_lockbreak(&src_mm->page_table_lock) ||
+ need_lockbreak(&dst_mm->page_table_lock))
+ break;
}
pte_unmap_nested(src_pte);
pte_unmap(dst_pte);
spin_unlock(&src_mm->page_table_lock);
+
cond_resched_lock(&dst_mm->page_table_lock);
+ if (addr < end)
+ goto again;
return 0;
}

2005-02-23 19:37:52

by Lee Revell

[permalink] [raw]
Subject: Re: More latency regressions with 2.6.11-rc4-RT-V0.7.39-02

On Wed, 2005-02-23 at 19:16 +0000, Hugh Dickins wrote:
> On Wed, 23 Feb 2005, Lee Revell wrote:
> >
> > Did something change recently in the VM that made copy_pte_range and
> > clear_page_range a lot more expensive? I noticed a reference in the
> > "Page Table Iterators" thread to excessive overhead introduced by
> > aggressive page freeing. That sure looks like what is going on in
> > trace2. trace1 and trace3 look like big fork latencies associated with
> > copy_pte_range.
>
> I'm just about to test this patch below: please give it a try: thanks...
>
> Ingo's patch to reduce scheduling latencies, by checking for lockbreak
> in copy_page_range, was in the -VP and -mm patchsets some months ago;
> but got preempted by the 4level rework, and not reinstated since.
> Restore it now in copy_pte_range - which mercifully makes it easier.

Aha, that explains why all the latency regressions involve the VM
subsystem.

Thanks, your patch fixes the copy_pte_range latency. Now zap_pte_range,
which Ingo also fixed a few months ago, is the worst offender. Can this
fix be easily ported too?

Lee


Attachments:
trace5.txt (12.49 kB)

2005-02-23 19:53:26

by Lee Revell

[permalink] [raw]
Subject: Re: More latency regressions with 2.6.11-rc4-RT-V0.7.39-02

On Wed, 2005-02-23 at 13:07 -0500, Lee Revell wrote:
> This is all with PREEMPT_DESKTOP.
>

Here is another new one, this time in the ext3 reservation code.

Lee


Attachments:
trace6.txt (62.16 kB)

2005-02-23 20:08:18

by Hugh Dickins

[permalink] [raw]
Subject: Re: More latency regressions with 2.6.11-rc4-RT-V0.7.39-02

On Wed, 23 Feb 2005, Lee Revell wrote:
> On Wed, 2005-02-23 at 19:16 +0000, Hugh Dickins wrote:
> >
> > I'm just about to test this patch below: please give it a try: thanks...

I'm very sorry, there's two things wrong with that version: _must_
increment addr before breaking out, and better to check after pte_none
too (we can question whether it might be checking too often, but this
replicates what Ingo was doing). Please replace by new patch below,
which I'm now running through lmbench.

> Aha, that explains why all the latency regressions involve the VM
> subsystem.
>
> Thanks, your patch fixes the copy_pte_range latency.

Great, if the previous patch fixed that latency then this new one
will too, no need to report on that; but please get rid of the old
patch before it leaks too many of your pages.

> Now zap_pte_range,
> which Ingo also fixed a few months ago, is the worst offender. Can this
> fix be easily ported too?

That surprises me: all the zap_pte_range latency fixes I know of
are in 2.6.11-rc, perhaps Ingo knows of something missing there?

Hugh

Ingo's patch to reduce scheduling latencies, by checking for lockbreak
in copy_page_range, was in the -VP and -mm patchsets some months ago;
but got preempted by the 4level rework, and not reinstated since.
Restore it now in copy_pte_range - which mercifully makes it easier.

Signed-off-by: Hugh Dickins <[email protected]>

--- 2.6.11-rc4-bk/mm/memory.c 2005-02-21 11:32:19.000000000 +0000
+++ linux/mm/memory.c 2005-02-23 19:46:40.000000000 +0000
@@ -328,21 +328,33 @@ static int copy_pte_range(struct mm_stru
pte_t *s, *d;
unsigned long vm_flags = vma->vm_flags;

+again:
d = dst_pte = pte_alloc_map(dst_mm, dst_pmd, addr);
if (!dst_pte)
return -ENOMEM;

spin_lock(&src_mm->page_table_lock);
s = src_pte = pte_offset_map_nested(src_pmd, addr);
- for (; addr < end; addr += PAGE_SIZE, s++, d++) {
- if (pte_none(*s))
- continue;
- copy_one_pte(dst_mm, src_mm, d, s, vm_flags, addr);
+ for (; addr < end; s++, d++) {
+ if (!pte_none(*s))
+ copy_one_pte(dst_mm, src_mm, d, s, vm_flags, addr);
+ addr += PAGE_SIZE;
+ /*
+ * We are holding two locks at this point - either of them
+ * could generate latencies in another task on another CPU.
+ */
+ if (need_resched() ||
+ need_lockbreak(&src_mm->page_table_lock) ||
+ need_lockbreak(&dst_mm->page_table_lock))
+ break;
}
pte_unmap_nested(src_pte);
pte_unmap(dst_pte);
spin_unlock(&src_mm->page_table_lock);
+
cond_resched_lock(&dst_mm->page_table_lock);
+ if (addr < end)
+ goto again;
return 0;
}

2005-02-23 20:11:04

by Lee Revell

[permalink] [raw]
Subject: Re: More latency regressions with 2.6.11-rc4-RT-V0.7.39-02

On Wed, 2005-02-23 at 20:06 +0000, Hugh Dickins wrote:
> On Wed, 23 Feb 2005, Lee Revell wrote:
> > On Wed, 2005-02-23 at 19:16 +0000, Hugh Dickins wrote:
> > >
> > > I'm just about to test this patch below: please give it a try: thanks...
>
> I'm very sorry, there's two things wrong with that version: _must_
> increment addr before breaking out, and better to check after pte_none
> too (we can question whether it might be checking too often, but this
> replicates what Ingo was doing). Please replace by new patch below,
> which I'm now running through lmbench.

OK, I will report any interesting results with the new patch.

Lee

2005-02-23 20:31:21

by Lee Revell

[permalink] [raw]
Subject: Re: More latency regressions with 2.6.11-rc4-RT-V0.7.39-02

On Wed, 2005-02-23 at 20:06 +0000, Hugh Dickins wrote:
> >
> > Thanks, your patch fixes the copy_pte_range latency.
>
> Great, if the previous patch fixed that latency then this new one
> will too, no need to report on that; but please get rid of the old
> patch before it leaks too many of your pages.

clear_page_range is also problematic.

Lee


Attachments:
trace7.txt (48.77 kB)

2005-02-23 20:53:55

by Hugh Dickins

[permalink] [raw]
Subject: Re: More latency regressions with 2.6.11-rc4-RT-V0.7.39-02

On Wed, 23 Feb 2005, Hugh Dickins wrote:
> Please replace by new patch below, which I'm now running through lmbench.

That second patch seems fine, and I see no lmbench regression from it.

Hugh

2005-02-23 21:04:50

by Hugh Dickins

[permalink] [raw]
Subject: Re: More latency regressions with 2.6.11-rc4-RT-V0.7.39-02

On Wed, 23 Feb 2005, Lee Revell wrote:
> > >
> > > Thanks, your patch fixes the copy_pte_range latency.
>
> clear_page_range is also problematic.

Yes, I saw that from your other traces too. I know there are plans
to improve clear_page_range during 2.6.12, but I didn't realize that
it had become very much worse than its antecedent clear_page_tables,
and I don't see missing latency fixes for that. Nick's the expert.

Hugh

2005-02-23 22:18:51

by Lee Revell

[permalink] [raw]
Subject: Re: More latency regressions with 2.6.11-rc4-RT-V0.7.39-02

On Wed, 2005-02-23 at 20:53 +0000, Hugh Dickins wrote:
> On Wed, 23 Feb 2005, Hugh Dickins wrote:
> > Please replace by new patch below, which I'm now running through lmbench.
>
> That second patch seems fine, and I see no lmbench regression from it.

Should go into 2.6.11, right?

Lee

2005-02-23 22:18:45

by Lee Revell

[permalink] [raw]
Subject: Re: More latency regressions with 2.6.11-rc4-RT-V0.7.39-02

On Wed, 2005-02-23 at 21:03 +0000, Hugh Dickins wrote:
> On Wed, 23 Feb 2005, Lee Revell wrote:
> > > >
> > > > Thanks, your patch fixes the copy_pte_range latency.
> >
> > clear_page_range is also problematic.
>
> Yes, I saw that from your other traces too.

Heh, sorry, that one was a dupe... I should know to give the files
better names.

Lee

2005-02-23 23:32:48

by Nick Piggin

[permalink] [raw]
Subject: Re: More latency regressions with 2.6.11-rc4-RT-V0.7.39-02

Hugh Dickins wrote:
> On Wed, 23 Feb 2005, Lee Revell wrote:
>
>>>>Thanks, your patch fixes the copy_pte_range latency.
>>
>>clear_page_range is also problematic.
>
>
> Yes, I saw that from your other traces too. I know there are plans
> to improve clear_page_range during 2.6.12, but I didn't realize that
> it had become very much worse than its antecedent clear_page_tables,
> and I don't see missing latency fixes for that. Nick's the expert.
>

I wouldn't have thought it should have become worse, latency
wise. What is actually happening is that the lower level freeing
functions are being called more often. But this should result in
the work being spread out more, if anything. Rather than in the
old system things would tend to be batched up into bigger chunks
(typically at exit() time).

If you are using i386 with 2-level page tables (no highmem), then
the behaviour should be more or less identical. Odd.

Nick

2005-02-24 00:44:18

by john cooper

[permalink] [raw]
Subject: PPC RT Patch..

--- ./arch/ppc/syslib/i8259.c.ORG 2005-02-01 16:26:40.000000000 -0500
+++ ./arch/ppc/syslib/i8259.c 2005-02-10 12:57:45.000000000 -0500
@@ -10,7 +10,7 @@ unsigned char cached_8259[2] = { 0xff, 0
#define cached_A1 (cached_8259[0])
#define cached_21 (cached_8259[1])

-static DEFINE_SPINLOCK(i8259_lock);
+static DEFINE_RAW_SPINLOCK(i8259_lock);

int i8259_pic_irq_offset;

=================================================================
--- ./arch/ppc/syslib/ocp.c.ORG 2004-12-24 16:35:23.000000000 -0500
+++ ./arch/ppc/syslib/ocp.c 2005-02-23 16:51:04.009548560 -0500
@@ -49,7 +49,6 @@
#include <asm/io.h>
#include <asm/ocp.h>
#include <asm/errno.h>
-#include <asm/rwsem.h>
#include <asm/semaphore.h>

//#define DBG(x) printk x
=================================================================
--- ./arch/ppc/kernel/Makefile.ORG 2005-02-01 16:26:40.000000000 -0500
+++ ./arch/ppc/kernel/Makefile 2005-02-18 20:34:07.000000000 -0500
@@ -13,7 +13,7 @@ extra-y += vmlinux.lds

obj-y := entry.o traps.o irq.o idle.o time.o misc.o \
process.o signal.o ptrace.o align.o \
- semaphore.o syscalls.o setup.o \
+ syscalls.o setup.o \
cputable.o ppc_htab.o perfmon.o
obj-$(CONFIG_6xx) += l2cr.o cpu_setup_6xx.o
obj-$(CONFIG_POWER4) += cpu_setup_power4.o
@@ -26,6 +26,9 @@ obj-$(CONFIG_TAU) += temp.o
obj-$(CONFIG_ALTIVEC) += vecemu.o vector.o
obj-$(CONFIG_FSL_BOOKE) += perfmon_fsl_booke.o

+obj-$(CONFIG_ASM_SEMAPHORES) += semaphore.o
+obj-$(CONFIG_MCOUNT) += mcount.o
+
ifndef CONFIG_MATH_EMULATION
obj-$(CONFIG_8xx) += softemu8xx.o
endif
=================================================================
--- ./arch/ppc/kernel/signal.c.ORG 2005-02-01 16:26:40.000000000 -0500
+++ ./arch/ppc/kernel/signal.c 2005-02-09 19:16:25.000000000 -0500
@@ -704,6 +704,13 @@ int do_signal(sigset_t *oldset, struct p
unsigned long frame, newsp;
int signr, ret;

+#ifdef CONFIG_PREEMPT_RT
+ /*
+ * Fully-preemptible kernel does not need interrupts disabled:
+ */
+ local_irq_enable();
+ preempt_check_resched();
+#endif
if (!oldset)
oldset = &current->blocked;

=================================================================
--- ./arch/ppc/kernel/time.c.ORG 2005-02-01 16:26:40.000000000 -0500
+++ ./arch/ppc/kernel/time.c 2005-02-19 10:46:17.000000000 -0500
@@ -91,7 +91,7 @@ extern unsigned long wall_jiffies;

static long time_offset;

-DEFINE_SPINLOCK(rtc_lock);
+DEFINE_RAW_SPINLOCK(rtc_lock);

EXPORT_SYMBOL(rtc_lock);

@@ -109,7 +109,7 @@ static inline int tb_delta(unsigned *jif
}

#ifdef CONFIG_SMP
-unsigned long profile_pc(struct pt_regs *regs)
+unsigned long notrace profile_pc(struct pt_regs *regs)
{
unsigned long pc = instruction_pointer(regs);

@@ -126,7 +126,7 @@ EXPORT_SYMBOL(profile_pc);
* with interrupts disabled.
* We set it up to overflow again in 1/HZ seconds.
*/
-void timer_interrupt(struct pt_regs * regs)
+void notrace timer_interrupt(struct pt_regs * regs)
{
int next_dec;
unsigned long cpu = smp_processor_id();
=================================================================
--- ./arch/ppc/kernel/traps.c.ORG 2005-02-01 16:26:40.000000000 -0500
+++ ./arch/ppc/kernel/traps.c 2005-02-09 18:59:40.000000000 -0500
@@ -72,7 +72,7 @@ void (*debugger_fault_handler)(struct pt
* Trap & Exception support
*/

-DEFINE_SPINLOCK(die_lock);
+DEFINE_RAW_SPINLOCK(die_lock);

void die(const char * str, struct pt_regs * fp, long err)
{
@@ -111,6 +111,10 @@ void _exception(int signr, struct pt_reg
debugger(regs);
die("Exception in kernel mode", regs, signr);
}
+#ifdef CONFIG_PREEMPT_RT
+ local_irq_enable();
+ preempt_check_resched();
+#endif
info.si_signo = signr;
info.si_errno = 0;
info.si_code = code;
=================================================================
--- ./arch/ppc/kernel/ppc_ksyms.c.ORG 2004-12-24 16:35:28.000000000 -0500
+++ ./arch/ppc/kernel/ppc_ksyms.c 2005-02-08 12:00:56.000000000 -0500
@@ -294,9 +294,11 @@ EXPORT_SYMBOL(console_drivers);
EXPORT_SYMBOL(xmon);
EXPORT_SYMBOL(xmon_printf);
#endif
+#ifdef CONFIG_ASM_SEMAPHORES
EXPORT_SYMBOL(__up);
EXPORT_SYMBOL(__down);
EXPORT_SYMBOL(__down_interruptible);
+#endif

#if defined(CONFIG_KGDB) || defined(CONFIG_XMON)
extern void (*debugger)(struct pt_regs *regs);
=================================================================
--- ./arch/ppc/kernel/entry.S.ORG 2005-02-01 16:26:40.000000000 -0500
+++ ./arch/ppc/kernel/entry.S 2005-02-23 16:30:55.924205448 -0500
@@ -8,6 +8,7 @@
* rewritten by Paul Mackerras.
* Copyright (C) 1996 Paul Mackerras.
* MPC8xx modifications Copyright (C) 1997 Dan Malek ([email protected]).
+ * RT_PREEMPT support (C) Timesys Corp. <[email protected]>
*
* This file contains the system call entry code, context switch
* code, and exception/interrupt return code for PowerPC.
@@ -44,6 +45,48 @@
#define LOAD_MSR_KERNEL(r, x) li r,(x)
#endif

+#if defined(CONFIG_LATENCY_TRACE) || defined(CONFIG_CRITICAL_IRQSOFF_TIMING) \
+ || defined(CONFIG_CRITICAL_TIMING)
+#define TFSIZE 64 /* frame SHDB multiple of 16 bytes */
+#define TFR12 48 /* TODO: prune this down -- safe but overkill */
+#define TFR11 44
+#define TFR10 40
+#define TFR9 36
+#define TFR8 32
+#define TFR7 28
+#define TFR6 24
+#define TFR5 20
+#define TFR4 16
+#define TFR3 12
+#define TFR0 8
+#define PUSHFRAME() \
+ stwu r1, -TFSIZE(r1); \
+ stw r12, TFR12(r1); \
+ stw r11, TFR11(r1); \
+ stw r10, TFR10(r1); \
+ stw r9, TFR9(r1); \
+ stw r8, TFR8(r1); \
+ stw r7, TFR7(r1); \
+ stw r6, TFR6(r1); \
+ stw r5, TFR5(r1); \
+ stw r4, TFR4(r1); \
+ stw r3, TFR3(r1); \
+ stw r0, TFR0(r1)
+#define POPFRAME() \
+ lwz r12, TFR12(r1); \
+ lwz r11, TFR11(r1); \
+ lwz r10, TFR10(r1); \
+ lwz r9, TFR9(r1); \
+ lwz r8, TFR8(r1); \
+ lwz r7, TFR7(r1); \
+ lwz r6, TFR6(r1); \
+ lwz r5, TFR5(r1); \
+ lwz r4, TFR4(r1); \
+ lwz r3, TFR3(r1); \
+ lwz r0, TFR0(r1); \
+ addi r1, r1, TFSIZE
+#endif
+
#ifdef CONFIG_BOOKE
#define COR r8 /* Critical Offset Register (COR) */
#define BOOKE_LOAD_COR lis COR,crit_save@ha
@@ -197,6 +240,20 @@ _GLOBAL(DoSyscall)
lwz r11,_CCR(r1) /* Clear SO bit in CR */
rlwinm r11,r11,0,4,2
stw r11,_CCR(r1)
+#ifdef CONFIG_LATENCY_TRACE
+ lwz r3,GPR0(r1)
+ lwz r4,GPR3(r1)
+ lwz r5,GPR4(r1)
+ lwz r6,GPR5(r1)
+ bl sys_call
+ lwz r0,GPR0(r1)
+ lwz r3,GPR3(r1)
+ lwz r4,GPR4(r1)
+ lwz r5,GPR5(r1)
+ lwz r6,GPR6(r1)
+ lwz r7,GPR7(r1)
+ lwz r8,GPR8(r1)
+#endif
#ifdef SHOW_SYSCALLS
bl do_show_syscall
#endif /* SHOW_SYSCALLS */
@@ -250,6 +307,21 @@ syscall_exit_cont:
andis. r10,r0,DBCR0_IC@h
bnel- load_dbcr0
#endif
+#if defined(CONFIG_LATENCY_TRACE) || defined(CONFIG_CRITICAL_IRQSOFF_TIMING) \
+ || defined(CONFIG_CRITICAL_TIMING)
+ PUSHFRAME();
+#ifdef CONFIG_CRITICAL_TIMING
+ bl touch_critical_timing
+#endif
+#ifdef CONFIG_CRITICAL_IRQSOFF_TIMING
+ bl trace_irqs_on
+#endif
+#ifdef CONFIG_LATENCY_TRACE
+ lwz r3, RESULT+TFSIZE(r1)
+ bl sys_ret
+#endif
+ POPFRAME();
+#endif
stwcx. r0,0,r1 /* to clear the reservation */
lwz r4,_LINK(r1)
lwz r5,_CCR(r1)
@@ -614,32 +686,38 @@ restore_user:

/* N.B. the only way to get here is from the beq following ret_from_except. */
resume_kernel:
+ lis r9,kernel_preemption@ha
+ lwz r9,kernel_preemption@l(r9)
+ cmpwi 0,r9,0
+ beq restore
/* check current_thread_info->preempt_count */
rlwinm r9,r1,0,0,18
lwz r0,TI_PREEMPT(r9)
cmpwi 0,r0,0 /* if non-zero, just restore regs and return */
bne restore
+check_resched:
lwz r0,TI_FLAGS(r9)
andi. r0,r0,_TIF_NEED_RESCHED
beq+ restore
andi. r0,r3,MSR_EE /* interrupts off? */
beq restore /* don't schedule if so */
-1: lis r0,PREEMPT_ACTIVE@h
- stw r0,TI_PREEMPT(r9)
+1:
ori r10,r10,MSR_EE
- SYNC
- MTMSRD(r10) /* hard-enable interrupts */
- bl schedule
LOAD_MSR_KERNEL(r10,MSR_KERNEL)
SYNC
MTMSRD(r10) /* disable interrupts */
+ bl preempt_schedule_irq
rlwinm r9,r1,0,0,18
li r0,0
stw r0,TI_PREEMPT(r9)
+#if 0
lwz r3,TI_FLAGS(r9)
andi. r0,r3,_TIF_NEED_RESCHED
bne- 1b
#else
+ b check_resched
+#endif
+#else
resume_kernel:
#endif /* CONFIG_PREEMPT */

=================================================================
--- ./arch/ppc/kernel/process.c.ORG 2004-12-24 16:34:45.000000000 -0500
+++ ./arch/ppc/kernel/process.c 2005-02-09 18:43:14.000000000 -0500
@@ -360,6 +360,7 @@ void show_regs(struct pt_regs * regs)
print_symbol("%s\n", regs->nip);
printk("LR [%08lx] ", regs->link);
print_symbol("%s\n", regs->link);
+ printk("preempt: %08x\n", preempt_count());
#endif
show_stack(current, (unsigned long *) regs->gpr[1]);
}
=================================================================
--- ./arch/ppc/kernel/mcount.S.ORG 2005-02-18 19:51:33.000000000 -0500
+++ ./arch/ppc/kernel/mcount.S 2005-02-23 14:25:18.780025528 -0500
@@ -0,0 +1,86 @@
+/*
+ * linux/arch/ppc/kernel/mcount.S
+ *
+ * Copyright (C) 2005 TimeSys Corporation, [email protected]
+ */
+
+#include <asm/ppc_asm.h>
+
+/*
+ * stack frame in effect when calling __mcount():
+ *
+ * 52: RA to caller
+ * 48: caller chain
+ * 44: [alignment pad]
+ * 40: saved LR to prolog/target function
+ * 36: r10
+ * 32: r9
+ * 28: r8
+ * 24: r7
+ * 20: r6
+ * 16: r5
+ * 12: r4
+ * 8: r3
+ * 4: LR save for __mcount() use
+ * 0: frame chain pointer / current r1
+ */
+
+/* preamble present in each traced function:
+ *
+ * .data
+ * .align 2
+ * 0:
+ * .long 0
+ * .previous
+ * mflr r0
+ * lis r11, 0b@ha
+ * stw r0, 4(r1)
+ * addi r0, r11, 0b@l
+ * bl _mcount
+ */
+
+ .text
+.globl _mcount
+_mcount:
+ lis r11,mcount_enabled@ha
+ lwz r11,mcount_enabled@l(r11)
+ cmpwi 0,r11,0
+ beq disabled
+ stwu r1,-48(r1) /* local frame */
+ stw r3, 8(r1)
+ stw r4, 12(r1)
+ stw r5, 16(r1)
+ stw r6, 20(r1)
+ mflr r4 /* RA to function prolog */
+ stw r7, 24(r1)
+ stw r8, 28(r1)
+ stw r9, 32(r1)
+ stw r10,36(r1)
+ stw r4, 40(r1)
+ bl __mcount /* __void mcount(void) */
+ nop
+ lwz r0, 40(r1)
+ lwz r3, 8(r1)
+ mtctr r0
+ lwz r4, 12(r1)
+ lwz r5, 16(r1)
+ lwz r6, 20(r1)
+ lwz r0, 52(r1) /* RA to function caller */
+ lwz r7, 24(r1)
+ lwz r8, 28(r1)
+ mtlr r0
+ lwz r9, 32(r1)
+ lwz r10,36(r1)
+ addi r1,r1,48 /* toss frame */
+ bctr /* return to target function */
+
+/* the function preamble modified LR in getting here so we need
+ * to restore its LR and return to the preamble otherwise
+ */
+disabled:
+ mflr r12 /* return address to preamble */
+ lwz r11, 4(r1)
+ mtlr r11 /* restore LR modified by preamble */
+ mtctr r12
+ bctr
+
=================================================================
--- ./arch/ppc/kernel/smp.c.ORG 2005-02-01 16:26:40.000000000 -0500
+++ ./arch/ppc/kernel/smp.c 2005-02-22 10:21:39.000000000 -0500
@@ -163,7 +163,7 @@ void smp_send_stop(void)
* static memory requirements. It also looks cleaner.
* Stolen from the i386 version.
*/
-static DEFINE_SPINLOCK(call_lock);
+static DEFINE_RAW_SPINLOCK(call_lock);

static struct call_data_struct {
void (*func) (void *info);
@@ -397,5 +397,15 @@ int __cpu_up(unsigned int cpu)

void smp_cpus_done(unsigned int max_cpus)
{
- smp_ops->setup_cpu(0);
+ if (smp_ops)
+ smp_ops->setup_cpu(0);
+}
+
+/* this function sends a 'reschedule' IPI to all other CPUs.
+ * This is used when RT tasks are starving and other CPUs
+ * might be able to run them
+ */
+void smp_send_reschedule_allbutself(void)
+{
+ smp_message_pass(MSG_ALL_BUT_SELF, PPC_MSG_RESCHEDULE, 0, 0);
}
=================================================================
--- ./arch/ppc/kernel/irq.c.ORG 2004-12-24 16:35:24.000000000 -0500
+++ ./arch/ppc/kernel/irq.c 2005-02-19 21:43:53.000000000 -0500
@@ -135,10 +135,11 @@ skip:
return 0;
}

-void do_IRQ(struct pt_regs *regs)
+void notrace do_IRQ(struct pt_regs *regs)
{
int irq, first = 1;
irq_enter();
+ trace_special(regs->nip, irq, 0);

/*
* Every platform is required to implement ppc_md.get_irq.
=================================================================
--- ./arch/ppc/kernel/idle.c.ORG 2005-02-01 16:26:40.000000000 -0500
+++ ./arch/ppc/kernel/idle.c 2005-02-23 13:02:34.000000000 -0500
@@ -39,6 +39,7 @@ void default_idle(void)
powersave = ppc_md.power_save;

if (!need_resched()) {
+ stop_critical_timing();
if (powersave != NULL)
powersave();
else {
=================================================================
--- ./arch/ppc/mm/fault.c.ORG 2004-12-24 16:34:29.000000000 -0500
+++ ./arch/ppc/mm/fault.c 2005-02-19 21:32:02.000000000 -0500
@@ -92,7 +92,7 @@ static int store_updates_sp(struct pt_re
* the error_code parameter is ESR for a data fault, 0 for an instruction
* fault.
*/
-int do_page_fault(struct pt_regs *regs, unsigned long address,
+int notrace do_page_fault(struct pt_regs *regs, unsigned long address,
unsigned long error_code)
{
struct vm_area_struct * vma;
@@ -104,6 +104,7 @@ int do_page_fault(struct pt_regs *regs,
#else
int is_write = 0;

+ trace_special(regs->nip, error_code, address);
/*
* Fortunately the bit assignments in SRR1 for an instruction
* fault and DSISR for a data fault are mostly the same for the
=================================================================
--- ./arch/ppc/mm/init.c.ORG 2005-02-01 16:26:40.000000000 -0500
+++ ./arch/ppc/mm/init.c 2005-02-21 13:26:10.000000000 -0500
@@ -56,7 +56,7 @@
#endif
#define MAX_LOW_MEM CONFIG_LOWMEM_SIZE

-DEFINE_PER_CPU(struct mmu_gather, mmu_gathers);
+DEFINE_PER_CPU_LOCKED(struct mmu_gather, mmu_gathers);

unsigned long total_memory;
unsigned long total_lowmem;
=================================================================
--- ./arch/ppc/boot/common/util.S.ORG 2004-12-24 16:35:49.000000000 -0500
+++ ./arch/ppc/boot/common/util.S 2005-02-23 14:27:14.577421648 -0500
@@ -289,5 +289,15 @@ _GLOBAL(flush_data_cache)
bdnz 00b
10: blr

- .previous
+#ifdef CONFIG_MCOUNT
+/* innocuous _mcount for boot header
+ */
+_GLOBAL(_mcount)
+ mflr r12 /* return address to preamble */
+ lwz r11, 4(r1)
+ mtlr r11 /* restore LR modified by preamble */
+ mtctr r12
+ bctr
+#endif

+ .previous
=================================================================
--- ./arch/ppc/platforms/encpp1_time.c.ORG 2005-02-02 20:42:55.000000000 -0500
+++ ./arch/ppc/platforms/encpp1_time.c 2005-02-09 16:35:10.000000000 -0500
@@ -155,9 +155,9 @@ int pp1_set_rtc_time (unsigned long nowt
{
unsigned char save_control, save_freq_select;
struct rtc_time tm;
- extern spinlock_t rtc_lock;
+ extern raw_spinlock_t rtc_lock;

- spin_lock (&rtc_lock);
+ __raw_spin_lock (&rtc_lock);
to_tm (nowtime, &tm);

called ();
@@ -202,7 +202,7 @@ int pp1_set_rtc_time (unsigned long nowt

if ((time_state == TIME_ERROR) || (time_state == TIME_BAD))
time_state = TIME_OK;
- spin_unlock (&rtc_lock);
+ __raw_spin_unlock (&rtc_lock);
return 0;
}

=================================================================
--- ./arch/ppc/lib/locks.c.ORG 2004-12-24 16:34:26.000000000 -0500
+++ ./arch/ppc/lib/locks.c 2005-02-07 19:21:14.000000000 -0500
@@ -8,6 +8,7 @@
#include <linux/sched.h>
#include <linux/spinlock.h>
#include <linux/module.h>
+#include <linux/rt_lock.h>
#include <asm/ppc_asm.h>
#include <asm/smp.h>

@@ -43,7 +44,7 @@ static inline unsigned long __spin_trylo
return ret;
}

-void _raw_spin_lock(spinlock_t *lock)
+void __raw_spin_lock(raw_spinlock_t *lock)
{
int cpu = smp_processor_id();
unsigned int stuck = INIT_STUCK;
@@ -63,9 +64,9 @@ void _raw_spin_lock(spinlock_t *lock)
lock->owner_pc = (unsigned long)__builtin_return_address(0);
lock->owner_cpu = cpu;
}
-EXPORT_SYMBOL(_raw_spin_lock);
+EXPORT_SYMBOL(__raw_spin_lock);

-int _raw_spin_trylock(spinlock_t *lock)
+int __raw_spin_trylock(raw_spinlock_t *lock)
{
if (__spin_trylock(&lock->lock))
return 0;
@@ -73,9 +74,9 @@ int _raw_spin_trylock(spinlock_t *lock)
lock->owner_pc = (unsigned long)__builtin_return_address(0);
return 1;
}
-EXPORT_SYMBOL(_raw_spin_trylock);
+EXPORT_SYMBOL(__raw_spin_trylock);

-void _raw_spin_unlock(spinlock_t *lp)
+void __raw_spin_unlock(raw_spinlock_t *lp)
{
if ( !lp->lock )
printk("_spin_unlock(%p): no lock cpu %d curr PC %p %s/%d\n",
@@ -89,7 +90,7 @@ void _raw_spin_unlock(spinlock_t *lp)
wmb();
lp->lock = 0;
}
-EXPORT_SYMBOL(_raw_spin_unlock);
+EXPORT_SYMBOL(__raw_spin_unlock);


/*
@@ -97,7 +98,7 @@ EXPORT_SYMBOL(_raw_spin_unlock);
* with the high bit (sign) being the "write" bit.
* -- Cort
*/
-void _raw_read_lock(rwlock_t *rw)
+void __raw_read_lock(raw_rwlock_t *rw)
{
unsigned long stuck = INIT_STUCK;
int cpu = smp_processor_id();
@@ -123,9 +124,9 @@ again:
}
wmb();
}
-EXPORT_SYMBOL(_raw_read_lock);
+EXPORT_SYMBOL(__raw_read_lock);

-void _raw_read_unlock(rwlock_t *rw)
+void __raw_read_unlock(raw_rwlock_t *rw)
{
if ( rw->lock == 0 )
printk("_read_unlock(): %s/%d (nip %08lX) lock %lx\n",
@@ -134,9 +135,9 @@ void _raw_read_unlock(rwlock_t *rw)
wmb();
atomic_dec((atomic_t *) &(rw)->lock);
}
-EXPORT_SYMBOL(_raw_read_unlock);
+EXPORT_SYMBOL(__raw_read_unlock);

-void _raw_write_lock(rwlock_t *rw)
+void __raw_write_lock(raw_rwlock_t *rw)
{
unsigned long stuck = INIT_STUCK;
int cpu = smp_processor_id();
@@ -175,9 +176,9 @@ again:
}
wmb();
}
-EXPORT_SYMBOL(_raw_write_lock);
+EXPORT_SYMBOL(__raw_write_lock);

-int _raw_write_trylock(rwlock_t *rw)
+int __raw_write_trylock(raw_rwlock_t *rw)
{
if (test_and_set_bit(31, &(rw)->lock)) /* someone has a write lock */
return 0;
@@ -190,9 +191,9 @@ int _raw_write_trylock(rwlock_t *rw)
wmb();
return 1;
}
-EXPORT_SYMBOL(_raw_write_trylock);
+EXPORT_SYMBOL(__raw_write_trylock);

-void _raw_write_unlock(rwlock_t *rw)
+void __raw_write_unlock(raw_rwlock_t *rw)
{
if ( !(rw->lock & (1<<31)) )
printk("_write_lock(): %s/%d (nip %08lX) lock %lx\n",
@@ -201,6 +202,6 @@ void _raw_write_unlock(rwlock_t *rw)
wmb();
clear_bit(31,&(rw)->lock);
}
-EXPORT_SYMBOL(_raw_write_unlock);
+EXPORT_SYMBOL(__raw_write_unlock);

#endif
=================================================================
--- ./arch/ppc/lib/dec_and_lock.c.ORG 2004-12-24 16:35:27.000000000 -0500
+++ ./arch/ppc/lib/dec_and_lock.c 2005-02-07 21:25:52.000000000 -0500
@@ -19,7 +19,7 @@
*/

#ifndef ATOMIC_DEC_AND_LOCK
-int _atomic_dec_and_lock(atomic_t *atomic, spinlock_t *lock)
+int _atomic_dec_and_raw_spin_lock(atomic_t *atomic, raw_spinlock_t *lock)
{
int counter;
int newcount;
@@ -35,12 +35,12 @@ int _atomic_dec_and_lock(atomic_t *atomi
return 0;
}

- spin_lock(lock);
+ _raw_spin_lock(lock);
if (atomic_dec_and_test(atomic))
return 1;
- spin_unlock(lock);
+ _raw_spin_unlock(lock);
return 0;
}

-EXPORT_SYMBOL(_atomic_dec_and_lock);
+EXPORT_SYMBOL(_atomic_dec_and_raw_spin_lock);
#endif /* ATOMIC_DEC_AND_LOCK */
=================================================================
--- ./arch/ppc/Kconfig.ORG 2005-02-02 20:42:55.000000000 -0500
+++ ./arch/ppc/Kconfig 2005-02-08 19:54:51.000000000 -0500
@@ -15,13 +15,6 @@ config GENERIC_HARDIRQS
bool
default y

-config RWSEM_GENERIC_SPINLOCK
- bool
-
-config RWSEM_XCHGADD_ALGORITHM
- bool
- default y
-
config GENERIC_CALIBRATE_DELAY
bool
default y
@@ -866,15 +859,21 @@ config NR_CPUS
depends on SMP
default "4"

-config PREEMPT
- bool "Preemptible Kernel"
- help
- This option reduces the latency of the kernel when reacting to
- real-time or interactive events by allowing a low priority process to
- be preempted even if it is in kernel mode executing a system call.
+source "lib/Kconfig.RT"

- Say Y here if you are building a kernel for a desktop, embedded
- or real-time system. Say N if you are unsure.
+config RWSEM_GENERIC_SPINLOCK
+ bool
+ depends on !PREEMPT_RT
+ default y
+
+config ASM_SEMAPHORES
+ bool
+ depends on !PREEMPT_RT
+ default y
+
+config RWSEM_XCHGADD_ALGORITHM
+ bool
+ depends on !RWSEM_GENERIC_SPINLOCK && !PREEMPT_RT

config HIGHMEM
bool "High memory support"
=================================================================
--- ./include/asm-generic/tlb.h.ORG 2005-02-01 16:26:51.000000000 -0500
+++ ./include/asm-generic/tlb.h 2005-02-21 13:43:21.000000000 -0500
@@ -50,7 +50,8 @@ struct mmu_gather {
#define tlb_mm(tlb) ((tlb)->mm)

/* Users of the generic TLB shootdown code must declare this storage space. */
-DECLARE_PER_CPU(struct mmu_gather, mmu_gathers);
+
+DECLARE_PER_CPU_LOCKED(struct mmu_gather, mmu_gathers);

/* tlb_gather_mmu
* Return a pointer to an initialized struct mmu_gather.
@@ -58,7 +59,8 @@ DECLARE_PER_CPU(struct mmu_gather, mmu_g
static inline struct mmu_gather *
tlb_gather_mmu(struct mm_struct *mm, unsigned int full_mm_flush)
{
- struct mmu_gather *tlb = &get_cpu_var(mmu_gathers);
+ struct mmu_gather *tlb = &get_cpu_var_locked(mmu_gathers,
+ _smp_processor_id());

tlb->mm = mm;

@@ -99,7 +101,7 @@ tlb_finish_mmu(struct mmu_gather *tlb, u
freed = rss;
mm->rss = rss - freed;
tlb_flush_mmu(tlb, start, end);
- put_cpu_var(mmu_gathers);
+ put_cpu_var_locked(mmu_gathers, _smp_processor_id());

/* keep the page table cache within bounds */
check_pgt_cache();
=================================================================
--- ./include/asm-generic/percpu.h.ORG 2005-02-01 16:26:51.000000000 -0500
+++ ./include/asm-generic/percpu.h 2005-02-21 13:11:53.000000000 -0500
@@ -53,6 +53,9 @@ do { \
#endif /* SMP */

#define DECLARE_PER_CPU(type, name) extern __typeof__(type) per_cpu__##name
+#define DECLARE_PER_CPU_LOCKED(type, name) \
+ extern __typeof__(type) per_cpu__##name##_locked; \
+ extern spinlock_t per_cpu_lock__##name##_locked

#define EXPORT_PER_CPU_SYMBOL(var) EXPORT_SYMBOL(per_cpu__##var)
#define EXPORT_PER_CPU_SYMBOL_GPL(var) EXPORT_SYMBOL_GPL(per_cpu__##var)
=================================================================
--- ./include/asm-ppc/rtc.h.ORG 2004-12-24 16:33:49.000000000 -0500
+++ ./include/asm-ppc/rtc.h 2005-02-23 14:30:06.705254240 -0500
@@ -24,6 +24,11 @@
#ifndef __ASM_RTC_H__
#define __ASM_RTC_H__

+#ifdef CONFIG_ENCPP1 /* Ampro work-around. Ugh. */
+#define cpu_mhz 300
+#define cpu_khz (cpu_mhz * 1000)
+#endif
+
#ifdef __KERNEL__

#include <linux/rtc.h>
=================================================================
--- ./include/asm-ppc/semaphore.h.ORG 2004-12-24 16:34:57.000000000 -0500
+++ ./include/asm-ppc/semaphore.h 2005-02-08 11:39:18.000000000 -0500
@@ -18,6 +18,13 @@

#include <asm/atomic.h>
#include <asm/system.h>
+
+#ifdef CONFIG_PREEMPT_RT
+
+#include <linux/rt_lock.h>
+
+#else
+
#include <linux/wait.h>
#include <linux/rwsem.h>

@@ -108,4 +115,8 @@ extern inline void up(struct semaphore *

#endif /* __KERNEL__ */

+extern int FASTCALL(sem_is_locked(struct semaphore *sem));
+
+#endif /* CONFIG_PREEMPT_RT */
+
#endif /* !(_PPC_SEMAPHORE_H) */
=================================================================
--- ./include/asm-ppc/spinlock.h.ORG 2005-02-01 16:26:45.000000000 -0500
+++ ./include/asm-ppc/spinlock.h 2005-02-22 09:50:38.000000000 -0500
@@ -7,17 +7,6 @@
* Simple spin lock operations.
*/

-typedef struct {
- volatile unsigned long lock;
-#ifdef CONFIG_DEBUG_SPINLOCK
- volatile unsigned long owner_pc;
- volatile unsigned long owner_cpu;
-#endif
-#ifdef CONFIG_PREEMPT
- unsigned int break_lock;
-#endif
-} spinlock_t;
-
#ifdef __KERNEL__
#ifdef CONFIG_DEBUG_SPINLOCK
#define SPINLOCK_DEBUG_INIT , 0, 0
@@ -25,16 +14,19 @@ typedef struct {
#define SPINLOCK_DEBUG_INIT /* */
#endif

-#define SPIN_LOCK_UNLOCKED (spinlock_t) { 0 SPINLOCK_DEBUG_INIT }
+#define __RAW_SPIN_LOCK_UNLOCKED { 0 SPINLOCK_DEBUG_INIT }
+#define RAW_SPIN_LOCK_UNLOCKED (raw_spinlock_t) __RAW_SPIN_LOCK_UNLOCKED

-#define spin_lock_init(x) do { *(x) = SPIN_LOCK_UNLOCKED; } while(0)
-#define spin_is_locked(x) ((x)->lock != 0)
-#define spin_unlock_wait(x) do { barrier(); } while(spin_is_locked(x))
-#define _raw_spin_lock_flags(lock, flags) _raw_spin_lock(lock)
+#define __raw_spin_lock_init(x) \
+ do { *(x) = RAW_SPIN_LOCK_UNLOCKED; } while(0)
+#define __raw_spin_is_locked(x) ((x)->lock != 0)
+#define __raw_spin_unlock_wait(x) \
+ do { barrier(); } while(__raw_spin_is_locked(x))
+#define __raw_spin_lock_flags(lock, flags) __raw_spin_lock(lock)

#ifndef CONFIG_DEBUG_SPINLOCK

-static inline void _raw_spin_lock(spinlock_t *lock)
+static inline void __raw_spin_lock(raw_spinlock_t *lock)
{
unsigned long tmp;

@@ -55,54 +47,37 @@ static inline void _raw_spin_lock(spinlo
: "cr0", "memory");
}

-static inline void _raw_spin_unlock(spinlock_t *lock)
+static inline void __raw_spin_unlock(raw_spinlock_t *lock)
{
__asm__ __volatile__("eieio # spin_unlock": : :"memory");
lock->lock = 0;
}

-#define _raw_spin_trylock(l) (!test_and_set_bit(0,&(l)->lock))
+#define __raw_spin_trylock(l) (!test_and_set_bit(0,&(l)->lock))

#else

-extern void _raw_spin_lock(spinlock_t *lock);
-extern void _raw_spin_unlock(spinlock_t *lock);
-extern int _raw_spin_trylock(spinlock_t *lock);
-
-#endif
+extern void __raw_spin_lock(raw_spinlock_t *lock);
+extern void __raw_spin_unlock(raw_spinlock_t *lock);
+extern int __raw_spin_trylock(raw_spinlock_t *lock);

-/*
- * Read-write spinlocks, allowing multiple readers
- * but only one writer.
- *
- * NOTE! it is quite common to have readers in interrupts
- * but no interrupt writers. For those circumstances we
- * can "mix" irq-safe locks - any writer needs to get a
- * irq-safe write-lock, but readers can get non-irqsafe
- * read-locks.
- */
-typedef struct {
- volatile unsigned long lock;
-#ifdef CONFIG_DEBUG_SPINLOCK
- volatile unsigned long owner_pc;
-#endif
-#ifdef CONFIG_PREEMPT
- unsigned int break_lock;
-#endif
-} rwlock_t;
+#endif /* CONFIG_DEBUG_SPINLOCK */

#ifdef CONFIG_DEBUG_SPINLOCK
-#define RWLOCK_DEBUG_INIT , 0
+#define RAW_RWLOCK_DEBUG_INIT , 0
#else
-#define RWLOCK_DEBUG_INIT /* */
+#define RAW_RWLOCK_DEBUG_INIT /* */
#endif

-#define RW_LOCK_UNLOCKED (rwlock_t) { 0 RWLOCK_DEBUG_INIT }
-#define rwlock_init(lp) do { *(lp) = RW_LOCK_UNLOCKED; } while(0)
+#define __RAW_RW_LOCK_UNLOCKED { 0 RAW_RWLOCK_DEBUG_INIT }
+#define RAW_RW_LOCK_UNLOCKED (raw_rwlock_t) __RAW_RW_LOCK_UNLOCKED
+#define __raw_rwlock_init(lp) do { *(lp) = RAW_RW_LOCK_UNLOCKED; } while(0)
+#define __raw_read_can_lock(lp) (0 <= (lp)->lock)
+#define __raw_write_can_lock(lp) (!(lp)->lock)

#ifndef CONFIG_DEBUG_SPINLOCK

-static __inline__ void _raw_read_lock(rwlock_t *rw)
+static __inline__ void __raw_read_lock(raw_rwlock_t *rw)
{
unsigned int tmp;

@@ -123,7 +98,7 @@ static __inline__ void _raw_read_lock(rw
: "cr0", "memory");
}

-static __inline__ void _raw_read_unlock(rwlock_t *rw)
+static __inline__ void __raw_read_unlock(raw_rwlock_t *rw)
{
unsigned int tmp;

@@ -139,7 +114,7 @@ static __inline__ void _raw_read_unlock(
: "cr0", "memory");
}

-static __inline__ int _raw_write_trylock(rwlock_t *rw)
+static __inline__ int __raw_write_trylock(raw_rwlock_t *rw)
{
unsigned int tmp;

@@ -159,7 +134,7 @@ static __inline__ int _raw_write_trylock
return tmp == 0;
}

-static __inline__ void _raw_write_lock(rwlock_t *rw)
+static __inline__ void __raw_write_lock(raw_rwlock_t *rw)
{
unsigned int tmp;

@@ -180,7 +155,7 @@ static __inline__ void _raw_write_lock(r
: "cr0", "memory");
}

-static __inline__ void _raw_write_unlock(rwlock_t *rw)
+static __inline__ void __raw_write_unlock(raw_rwlock_t *rw)
{
__asm__ __volatile__("eieio # write_unlock": : :"memory");
rw->lock = 0;
@@ -188,15 +163,15 @@ static __inline__ void _raw_write_unlock

#else

-extern void _raw_read_lock(rwlock_t *rw);
-extern void _raw_read_unlock(rwlock_t *rw);
-extern void _raw_write_lock(rwlock_t *rw);
-extern void _raw_write_unlock(rwlock_t *rw);
-extern int _raw_write_trylock(rwlock_t *rw);
+extern void __raw_read_lock(raw_rwlock_t *rw);
+extern void __raw_read_unlock(raw_rwlock_t *rw);
+extern void __raw_write_lock(raw_rwlock_t *rw);
+extern void __raw_write_unlock(raw_rwlock_t *rw);
+extern int __raw_write_trylock(raw_rwlock_t *rw);

#endif

-#define _raw_read_trylock(lock) generic_raw_read_trylock(lock)
+#define __raw_read_trylock(lock) generic_raw_read_trylock(lock)

#endif /* __ASM_SPINLOCK_H */
#endif /* __KERNEL__ */
=================================================================
--- ./include/asm-ppc/hw_irq.h.ORG 2004-12-24 16:35:15.000000000 -0500
+++ ./include/asm-ppc/hw_irq.h 2005-02-23 15:55:45.653015432 -0500
@@ -13,8 +13,17 @@ extern void timer_interrupt(struct pt_re
#define INLINE_IRQS

#define irqs_disabled() ((mfmsr() & MSR_EE) == 0)
+#define irqs_disabled_flags(f) (!((f) & MSR_EE))

-#ifdef INLINE_IRQS
+#ifdef CONFIG_CRITICAL_IRQSOFF_TIMING
+ extern void notrace trace_irqs_off(void);
+ extern void notrace trace_irqs_on(void);
+#else
+# define trace_irqs_off() do { } while (0)
+# define trace_irqs_on() do { } while (0)
+#endif
+
+#if defined(INLINE_IRQS) || defined(CONFIG_CRITICAL_IRQSOFF_TIMING)

static inline void local_irq_disable(void)
{
@@ -22,11 +31,14 @@ static inline void local_irq_disable(voi
msr = mfmsr();
mtmsr(msr & ~MSR_EE);
__asm__ __volatile__("": : :"memory");
+ trace_irqs_off();
}

static inline void local_irq_enable(void)
{
unsigned long msr;
+
+ trace_irqs_on();
__asm__ __volatile__("": : :"memory");
msr = mfmsr();
mtmsr(msr | MSR_EE);
@@ -39,11 +51,19 @@ static inline void local_irq_save_ptr(un
*flags = msr;
mtmsr(msr & ~MSR_EE);
__asm__ __volatile__("": : :"memory");
+ trace_irqs_off();
}

#define local_save_flags(flags) ((flags) = mfmsr())
#define local_irq_save(flags) local_irq_save_ptr(&flags)
-#define local_irq_restore(flags) mtmsr(flags)
+#define local_irq_restore(flags) \
+ do { \
+ if (irqs_disabled_flags(flags)) \
+ trace_irqs_off(); \
+ else \
+ trace_irqs_on(); \
+ mtmsr(flags); \
+ } while (0)

#else

=================================================================
--- ./include/asm-ppc/tlb.h.ORG 2004-12-24 16:34:58.000000000 -0500
+++ ./include/asm-ppc/tlb.h 2005-02-09 19:12:21.000000000 -0500
@@ -50,7 +50,11 @@ static inline void __tlb_remove_tlb_entr
#define tlb_flush(tlb) flush_tlb_mm((tlb)->mm)

/* Get the generic bits... */
+#ifdef CONFIG_PREEMPT_RT
+#include <asm-generic/tlb-simple.h>
+#else
#include <asm-generic/tlb.h>
+#endif

#endif /* CONFIG_PPC_STD_MMU */

=================================================================
--- ./include/asm-ppc/ocp.h.ORG 2004-12-24 16:34:26.000000000 -0500
+++ ./include/asm-ppc/ocp.h 2005-02-23 16:50:53.514144104 -0500
@@ -32,7 +32,6 @@

#include <asm/mmu.h>
#include <asm/ocp_ids.h>
-#include <asm/rwsem.h>
#include <asm/semaphore.h>

#ifdef CONFIG_PPC_OCP
=================================================================
--- ./include/linux/sched.h.ORG 2005-02-01 16:26:51.000000000 -0500
+++ ./include/linux/sched.h 2005-02-20 18:24:02.000000000 -0500
@@ -74,9 +74,18 @@ extern int debug_direct_keyboard;
#endif

#ifdef CONFIG_FRAME_POINTER
+#ifdef CONFIG_PPC
+# define __valid_ra(l) ((__builtin_frame_address(l) && \
+ *(unsigned long *)__builtin_frame_address(l)) ? \
+ (unsigned long)__builtin_return_address(l) : 0UL)
+# define CALLER_ADDR0 ((unsigned long)__builtin_return_address(0))
+# define CALLER_ADDR1 __valid_ra(1)
+# define CALLER_ADDR2 (__valid_ra(1) ? __valid_ra(2) : 0UL)
+#else
# define CALLER_ADDR0 ((unsigned long)__builtin_return_address(0))
# define CALLER_ADDR1 ((unsigned long)__builtin_return_address(1))
# define CALLER_ADDR2 ((unsigned long)__builtin_return_address(2))
+#endif
#else
# define CALLER_ADDR0 ((unsigned long)__builtin_return_address(0))
# define CALLER_ADDR1 0UL
@@ -84,9 +93,14 @@ extern int debug_direct_keyboard;
#endif

#ifdef CONFIG_MCOUNT
- extern void notrace mcount(void);
+#ifdef CONFIG_PPC
+# define ARCH_MCOUNT _mcount
+#else
+# define ARCH_MCOUNT mcount
+#endif
+ extern void notrace ARCH_MCOUNT(void);
#else
-# define mcount() do { } while (0)
+# define ARCH_MCOUNT() do { } while (0)
#endif

#ifdef CONFIG_LATENCY_TRACE
=================================================================
--- ./drivers/char/blocker.c.ORG 2005-02-01 16:26:48.000000000 -0500
+++ ./drivers/char/blocker.c 2005-02-02 17:16:24.000000000 -0500
@@ -4,6 +4,7 @@

#include <linux/fs.h>
#include <linux/miscdevice.h>
+#include <asm/time.h>

#define BLOCKER_MINOR 221

@@ -17,8 +18,18 @@ u64 notrace get_cpu_tick(void)
u64 tsc;
#ifdef ARCHARM
tsc = *oscr;
-#else
+#elif defined(CONFIG_X86)
__asm__ __volatile__("rdtsc" : "=A" (tsc));
+#elif defined(CONFIG_PPC)
+ unsigned long hi, lo;
+
+ do {
+ hi = get_tbu();
+ lo = get_tbl();
+ } while (get_tbu() != hi);
+ tsc = (u64)hi << 32 | lo;
+#else
+ #error Implement get_cpu_tick()
#endif
return tsc;
}
=================================================================
--- ./kernel/sched.c.ORG 2005-02-01 16:26:46.000000000 -0500
+++ ./kernel/sched.c 2005-02-23 14:37:51.325621208 -0500
@@ -1237,7 +1237,7 @@ int fastcall wake_up_process(task_t * p)
int ret = try_to_wake_up(p, TASK_STOPPED | TASK_TRACED |
TASK_RUNNING_MUTEX | TASK_INTERRUPTIBLE |
TASK_UNINTERRUPTIBLE, 0, 0);
- mcount();
+ ARCH_MCOUNT();
return ret;
}

@@ -1248,7 +1248,7 @@ int fastcall wake_up_process_mutex(task_
int ret = try_to_wake_up(p, TASK_STOPPED | TASK_TRACED |
TASK_RUNNING_MUTEX | TASK_INTERRUPTIBLE |
TASK_UNINTERRUPTIBLE, 0, 1);
- mcount();
+ ARCH_MCOUNT();
return ret;
}

@@ -1257,7 +1257,7 @@ EXPORT_SYMBOL(wake_up_process_mutex);
int fastcall wake_up_state(task_t *p, unsigned int state)
{
int ret = try_to_wake_up(p, state | TASK_RUNNING_MUTEX, 0, 0);
- mcount();
+ ARCH_MCOUNT();
return ret;
}

@@ -1502,7 +1502,11 @@ static void finish_task_switch(task_t *p
* schedule_tail - first thing a freshly forked thread must call.
* @prev: the thread we just switched away from.
*/
+#ifdef CONFIG_PPC
+asmlinkage notrace void schedule_tail(task_t *prev)
+#else
asmlinkage void schedule_tail(task_t *prev)
+#endif
__releases(rq->lock)
{
preempt_disable(); // TODO: move this to fork setup
=================================================================
--- ./kernel/latency.c.ORG 2005-02-01 16:26:46.000000000 -0500
+++ ./kernel/latency.c 2005-02-23 08:32:38.000000000 -0500
@@ -50,6 +50,12 @@ static __cacheline_aligned_in_smp struct
int wakeup_timing = 1;
#endif

+#ifdef NONASCII
+#define MU '?'
+#else
+#define MU 'u'
+#endif
+
#ifdef CONFIG_LATENCY_TIMING

/*
@@ -357,9 +363,9 @@ void notrace __trace(unsigned long eip,
___trace(TRACE_FN, eip, parent_eip, 0, 0, 0);
}

-extern void mcount(void);
+extern void ARCH_MCOUNT(void);

-EXPORT_SYMBOL(mcount);
+EXPORT_SYMBOL(ARCH_MCOUNT);

void notrace __mcount(void)
{
@@ -631,8 +637,8 @@ static void * notrace l_start(struct seq
if (!n) {
seq_printf(m, "preemption latency trace v1.1.4 on %s\n", UTS_RELEASE);
seq_puts(m, "--------------------------------------------------------------------\n");
- seq_printf(m, " latency: %lu ?s, #%lu/%lu, CPU#%d | (M:%s VP:%d, KP:%d, SP:%d HP:%d #P:%d)\n",
- cycles_to_usecs(tr->saved_latency),
+ seq_printf(m, " latency: %lu %cs, #%lu/%lu, CPU#%d | (M:%s VP:%d, KP:%d, SP:%d HP:%d #P:%d)\n",
+ cycles_to_usecs(tr->saved_latency), MU,
entries, tr->trace_idx, out_tr.cpu,
#if defined(CONFIG_PREEMPT_NONE)
"server",
@@ -698,7 +704,7 @@ static void notrace l_stop(struct seq_fi
static void print_timestamp(struct seq_file *m, unsigned long abs_usecs,
unsigned long rel_usecs)
{
- seq_printf(m, " %4ld?s", abs_usecs);
+ seq_printf(m, " %4ld%cs", abs_usecs, MU);
if (rel_usecs > 100)
seq_puts(m, "!: ");
else if (rel_usecs > 1)
@@ -711,7 +717,7 @@ static void
print_timestamp_short(struct seq_file *m, unsigned long abs_usecs,
unsigned long rel_usecs)
{
- seq_printf(m, " %4ld?s", abs_usecs);
+ seq_printf(m, " %4ld%cs", abs_usecs, MU);
if (rel_usecs > 100)
seq_putc(m, '!');
else if (rel_usecs > 1)
@@ -1043,7 +1049,7 @@ static int setup_preempt_thresh(char *s)
get_option(&s, &thresh);
if (thresh > 0) {
preempt_thresh = usecs_to_cycles(thresh);
- printk("Preemption threshold = %u ?s\n", thresh);
+ printk("Preemption threshold = %u %cs\n", thresh, MU);
}
return 1;
}
@@ -1091,18 +1097,18 @@ check_critical_timing(struct cpu_trace *
update_max_tr(tr);

if (preempt_thresh)
- printk("(%16s-%-5d|#%d): %lu ?s critical section "
- "violates %lu ?s threshold.\n"
+ printk("(%16s-%-5d|#%d): %lu %cs critical section "
+ "violates %lu %cs threshold.\n"
" => started at timestamp %lu: ",
current->comm, current->pid,
- _smp_processor_id(),
- latency, cycles_to_usecs(preempt_thresh), t0);
+ _smp_processor_id(), latency,
+ MU, cycles_to_usecs(preempt_thresh), MU, t0);
else
- printk("(%16s-%-5d|#%d): new %lu ?s maximum-latency "
+ printk("(%16s-%-5d|#%d): new %lu %cs maximum-latency "
"critical section.\n => started at timestamp %lu: ",
current->comm, current->pid,
_smp_processor_id(),
- latency, t0);
+ latency, MU, t0);

print_symbol("<%s>\n", tr->critical_start);
printk(" => ended at timestamp %lu: ", t1);
@@ -1345,15 +1351,15 @@ check_wakeup_timing(struct cpu_trace *tr
update_max_tr(tr);

if (preempt_thresh)
- printk("(%16s-%-5d|#%d): %lu ?s wakeup latency "
- "violates %lu ?s threshold.\n",
+ printk("(%16s-%-5d|#%d): %lu %cs wakeup latency "
+ "violates %lu %cs threshold.\n",
current->comm, current->pid,
_smp_processor_id(), latency,
- cycles_to_usecs(preempt_thresh));
+ MU, cycles_to_usecs(preempt_thresh), MU);
else
- printk("(%16s-%-5d|#%d): new %lu ?s maximum-latency "
+ printk("(%16s-%-5d|#%d): new %lu %cs maximum-latency "
"wakeup.\n", current->comm, current->pid,
- _smp_processor_id(), latency);
+ _smp_processor_id(), latency, MU);

max_sequence++;

@@ -1399,7 +1405,7 @@ void __trace_start_sched_wakeup(struct t
tr->preempt_timestamp = cycles();
tr->critical_start = CALLER_ADDR0;
trace_cmdline();
- mcount();
+ ARCH_MCOUNT();
out_unlock:
spin_unlock(&sch.trace_lock);
}
@@ -1489,7 +1495,7 @@ long user_trace_start(void)
tr->critical_sequence = max_sequence;
tr->preempt_timestamp = cycles();
trace_cmdline();
- mcount();
+ ARCH_MCOUNT();
preempt_enable();

up(&max_mutex);
@@ -1507,7 +1513,7 @@ long user_trace_stop(void)
return -EINVAL;

preempt_disable();
- mcount();
+ ARCH_MCOUNT();

if (wakeup_timing) {
spin_lock_irqsave(&sch.trace_lock, flags);
@@ -1538,15 +1544,15 @@ long user_trace_stop(void)
latency = cycles_to_usecs(delta);

if (preempt_thresh)
- printk("(%16s-%-5d|#%d): %lu ?s user-latency "
- "violates %lu ?s threshold.\n",
+ printk("(%16s-%-5d|#%d): %lu %cs user-latency "
+ "violates %lu %cs threshold.\n",
current->comm, current->pid,
- _smp_processor_id(), latency,
- cycles_to_usecs(preempt_thresh));
+ _smp_processor_id(), latency, MU,
+ cycles_to_usecs(preempt_thresh), MU);
else
- printk("(%16s-%-5d|#%d): new %lu ?s user-latency.\n",
+ printk("(%16s-%-5d|#%d): new %lu %cs user-latency.\n",
current->comm, current->pid,
- _smp_processor_id(), latency);
+ _smp_processor_id(), latency, MU);

max_sequence++;
up(&max_mutex);


Attachments:
realtime-preempt-2.6.11-rc2-V0.7.37-02-ppc (38.23 kB)

2005-02-24 01:04:13

by Lee Revell

[permalink] [raw]
Subject: Re: More latency regressions with 2.6.11-rc4-RT-V0.7.39-02

On Thu, 2005-02-24 at 10:27 +1100, Nick Piggin wrote:
> Hugh Dickins wrote:
> > On Wed, 23 Feb 2005, Lee Revell wrote:
> >
> >>>>Thanks, your patch fixes the copy_pte_range latency.
> >>
> >>clear_page_range is also problematic.
> >
> >
> > Yes, I saw that from your other traces too. I know there are plans
> > to improve clear_page_range during 2.6.12, but I didn't realize that
> > it had become very much worse than its antecedent clear_page_tables,
> > and I don't see missing latency fixes for that. Nick's the expert.
> >
>
> I wouldn't have thought it should have become worse, latency
> wise. What is actually happening is that the lower level freeing
> functions are being called more often. But this should result in
> the work being spread out more, if anything. Rather than in the
> old system things would tend to be batched up into bigger chunks
> (typically at exit() time).
>
> If you are using i386 with 2-level page tables (no highmem), then
> the behaviour should be more or less identical. Odd.

IIRC last time I really tested this a few months ago, the worst case
latency on that machine was about 150us. Currently its 422us from the
same clear_page_range code path.

On my Athlon XP the clear_page_range latency is not showing up at all,
and the worst delay so far is only 35us, most of which is the timer
interrupt IOW that machine is showing the best achievable latency (with
PREEMPT_DESKTOP). The machine seeing 422 us latencies in
clear_page_range is a 600Mhz C3, which is known to be a FSB limited
architecture.

Lee

2005-02-24 01:29:36

by Nick Piggin

[permalink] [raw]
Subject: Re: More latency regressions with 2.6.11-rc4-RT-V0.7.39-02

Lee Revell wrote:
> On Thu, 2005-02-24 at 10:27 +1100, Nick Piggin wrote:

>>If you are using i386 with 2-level page tables (no highmem), then
>>the behaviour should be more or less identical. Odd.
>
>
> IIRC last time I really tested this a few months ago, the worst case
> latency on that machine was about 150us. Currently its 422us from the
> same clear_page_range code path.
>
> On my Athlon XP the clear_page_range latency is not showing up at all,
> and the worst delay so far is only 35us, most of which is the timer
> interrupt IOW that machine is showing the best achievable latency (with
> PREEMPT_DESKTOP). The machine seeing 422 us latencies in
> clear_page_range is a 600Mhz C3, which is known to be a FSB limited
> architecture.
>

Well it should be pretty trivial to add a break in there.
I don't think it can get into 2.6.11 at this point though,
so we'll revisit this for 2.6.12 if the clear_page_range
optimisations don't get anywhere.

Nick

2005-02-24 02:25:44

by Lee Revell

[permalink] [raw]
Subject: Re: More latency regressions with 2.6.11-rc4-RT-V0.7.39-02

On Thu, 2005-02-24 at 12:29 +1100, Nick Piggin wrote:
> Lee Revell wrote:
> >
> > IIRC last time I really tested this a few months ago, the worst case
> > latency on that machine was about 150us. Currently its 422us from the
> > same clear_page_range code path.
> >
> Well it should be pretty trivial to add a break in there.
> I don't think it can get into 2.6.11 at this point though,
> so we'll revisit this for 2.6.12 if the clear_page_range
> optimisations don't get anywhere.
>

Agreed, it would be much better to optimize this away than just add a
scheduling point. It seems like we could do this lazily.

IMHO it's not critical that these latency fixes be merged until the VP
feature gets merged, until then people will be using Ingo's patches
anyway.

Lee

2005-02-24 02:43:18

by Nick Piggin

[permalink] [raw]
Subject: Re: More latency regressions with 2.6.11-rc4-RT-V0.7.39-02

Lee Revell wrote:
> On Thu, 2005-02-24 at 12:29 +1100, Nick Piggin wrote:
>
>>Lee Revell wrote:
>>
>>>IIRC last time I really tested this a few months ago, the worst case
>>>latency on that machine was about 150us. Currently its 422us from the
>>>same clear_page_range code path.
>>>
>>
>>Well it should be pretty trivial to add a break in there.
>>I don't think it can get into 2.6.11 at this point though,
>>so we'll revisit this for 2.6.12 if the clear_page_range
>>optimisations don't get anywhere.
>>
>
>
> Agreed, it would be much better to optimize this away than just add a
> scheduling point. It seems like we could do this lazily.
>

Oh? What do you mean by lazy? IMO it is sort of implemented lazily now.
That is, we are too lazy to refcount page table pages in fastpaths, so
that pushes a lot of work to unmap time. Not necessarily a bad trade-off,
mind you. Just something I'm looking into.

2005-02-24 03:04:08

by Lee Revell

[permalink] [raw]
Subject: Re: More latency regressions with 2.6.11-rc4-RT-V0.7.39-02

On Thu, 2005-02-24 at 13:41 +1100, Nick Piggin wrote:
> Lee Revell wrote:
> >
> > Agreed, it would be much better to optimize this away than just add a
> > scheduling point. It seems like we could do this lazily.
> >
>
> Oh? What do you mean by lazy? IMO it is sort of implemented lazily now.
> That is, we are too lazy to refcount page table pages in fastpaths, so
> that pushes a lot of work to unmap time. Not necessarily a bad trade-off,
> mind you. Just something I'm looking into.
>

I guess I was thinking we could be even more lazy, and somehow defer it
until after unmap time (in lieu of memory pressure that is). Actually
that's kind of what a lock break would do.

Lee

2005-02-24 04:34:19

by Frank Rowand

[permalink] [raw]
Subject: Re: PPC RT Patch..

john cooper wrote:
> Ingo,
> We've had a PPC port of your RT work underway with
> a focus on trace instrumentation. This is based upon
> realtime-preempt-2.6.11-rc2-V0.7.37-02. A diff is
> attached.
>
> To the extent possible the tracing facilities are the
> same as your x86 work. In the process a few PPC/gcc
> issues needed to be resolved. There is also a bug fix
> contained for tlb_gather_mmu() which was causing debug
> assertions to be generated in a path which attempted to
> sleep with a non-zero preempt count.

Manish Lachwani mentioned to me that he faced the same issue
with the MIPS RT support and that when he discussed
it with Ingo that the solution was for include/asm-ppc/tlb.h
to include/asm-generic/tlb-simple.h when PREEMPT_RT is turned on.
The patch does this for the #ifdef CONFIG_PPC_STD_MMU case,
but not for the #else case. I don't know which case is used
for the Ampro board.


>
> This does build and function when SMP is configured,
> though we have not yet verified it on other than a
> uniprocessor. As a simplifying assumption, testing has
> thus far concentrated on the following modes:
>
> PREEMPT_NONE
> - verify baseline regression
>
> PREEMPT_RT && !PREEMPT_SMP
> - typical for an embedded RT PPC application
>
> PREEMPT_RT && PREEMPT_SMP
> - kicks in live locking code which otherwise receives no
> coverage. This is functionally equivalent to the above
> config on a single CPU target thus no MP dynamic testing
> is achieved. Still quite useful IMHO.
>
> The target used for development/testing is an Ampro EnCore PP1
> which sports a 300Mhz MPC8245. For testing this boots with NFS
> as root. An mp3 decode at nice --20 is launched which requires
> just under 20% of the CPU to maintain an uninterrupted audio
> decode and output. To this a series of "du -s /" are launched
> to soak up excess CPU bandwidth. Perhaps not rigorous but a
> fair sanity check and load for the purpose at hand.
>
> Under these conditions maximum scheduling latencies are seen in
> the 120-150us range. Note no attempt has yet been made to
> optimize arch specific paths and full trace instrumentation has
> been enabled.
>
> I've written some logging code to help find problems such as
> the tlb issue above. As it has not been made general I've
> removed it from this patch. At some point I'll likely revisit
> this.
>
> Comments/suggestions welcome.

I am glad to see the instrumentation and measurement related code
in your patch. (My patch of last week ("Frank's patch") is lacking
that code.)

Other differences between the two patches are:

arch/ppc/syslib/i8259.c
Frank neglected to convert i8259_lock to a raw spinlock.

arch/ppc/kernel/signal.c
John added an enable of irqs in do_signal() #ifdef CONFIG_PREEMPT_RT

arch/ppc/kernel/traps.c
John added an enable of irqs and preempt_check_resched() in _exception().

various files
Frank added the intrusive variable tb_to_us for use by cycles_to_usec()
and added an ugly #ifdef in cycles_to_usec().
John hard-coded cpu_khz for one specific board so that no change would
be needed in cycles_to_usec().

various files
John has the mmu_gather fix that is described above.

John's patch and Frank's patch are otherwise mostly the same, except for
the differences that result from being based on different kernel
versions. I am glad to see that because it means that two sets of
eyes have agreed.

Frank's patch may have missed some EXPORT_SYMBOL()s in arch/ppc/lib/locks.c.
I'll check those over again tomorrow.


> -john


-Frank
--
Frank Rowand <[email protected]>
MontaVista Software, Inc

2005-02-24 05:00:02

by Hugh Dickins

[permalink] [raw]
Subject: Re: More latency regressions with 2.6.11-rc4-RT-V0.7.39-02

On Wed, 23 Feb 2005, Lee Revell wrote:
> On Wed, 2005-02-23 at 20:53 +0000, Hugh Dickins wrote:
> > On Wed, 23 Feb 2005, Hugh Dickins wrote:
> > > Please replace by new patch below, which I'm now running through lmbench.
> >
> > That second patch seems fine, and I see no lmbench regression from it.
>
> Should go into 2.6.11, right?

That's up to Andrew (and Linus).

I was thinking that way when I rushed you the patch. But given that
you have remaining unresolved latency issues nearby (zap_pte_range,
clear_page_range), and given the warning shot that I screwed up my
first attempt, I'd be inclined to say hold off.

It's a pity: for a while we were thinking 2.6.11 would be a big step
forward for mainline latency; but it now looks to me like these tests
have come too late in the cycle to be dealt with safely.

In other mail, you do expect people still to be using Ingo's patches,
so probably this patch should stick there (and in -mm) for now.

Hugh

2005-02-24 06:38:41

by Lee Revell

[permalink] [raw]
Subject: Re: More latency regressions with 2.6.11-rc4-RT-V0.7.39-02

On Thu, 2005-02-24 at 04:56 +0000, Hugh Dickins wrote:
> On Wed, 23 Feb 2005, Lee Revell wrote:
> > On Wed, 2005-02-23 at 20:53 +0000, Hugh Dickins wrote:
> > > On Wed, 23 Feb 2005, Hugh Dickins wrote:
> > > > Please replace by new patch below, which I'm now running through lmbench.
> > >
> > > That second patch seems fine, and I see no lmbench regression from it.
> >
> > Should go into 2.6.11, right?
>
> That's up to Andrew (and Linus).
>
> I was thinking that way when I rushed you the patch. But given that
> you have remaining unresolved latency issues nearby (zap_pte_range,
> clear_page_range), and given the warning shot that I screwed up my
> first attempt, I'd be inclined to say hold off.
>
> It's a pity: for a while we were thinking 2.6.11 would be a big step
> forward for mainline latency; but it now looks to me like these tests
> have come too late in the cycle to be dealt with safely.
>
> In other mail, you do expect people still to be using Ingo's patches,
> so probably this patch should stick there (and in -mm) for now.

Well all of these were fixed in the past so it may not be unreasonable
to fix them for 2.6.11.

Lee

2005-02-24 08:27:04

by Hugh Dickins

[permalink] [raw]
Subject: Re: More latency regressions with 2.6.11-rc4-RT-V0.7.39-02

On Thu, 24 Feb 2005, Lee Revell wrote:
> On Thu, 2005-02-24 at 04:56 +0000, Hugh Dickins wrote:
> >
> > In other mail, you do expect people still to be using Ingo's patches,
> > so probably this patch should stick there (and in -mm) for now.
>
> Well all of these were fixed in the past so it may not be unreasonable
> to fix them for 2.6.11.

If we'd got to it earlier, yes. But 2.6.11 looks to be just a day or
two away, and we've no idea why zap_pte_range or clear_page_range
would have reverted. Nor have we heard from Ingo yet.

Hugh

2005-02-24 14:01:08

by john cooper

[permalink] [raw]
Subject: Re: PPC RT Patch..

Frank Rowand wrote:
> john cooper wrote:
>> ... There is also a bug fix
>> contained for tlb_gather_mmu() which was causing debug
>> assertions to be generated in a path which attempted to
>> sleep with a non-zero preempt count.
>
>
> Manish Lachwani mentioned to me that he faced the same issue
> with the MIPS RT support and that when he discussed
> it with Ingo that the solution was for include/asm-ppc/tlb.h
> to include/asm-generic/tlb-simple.h when PREEMPT_RT is turned on.
> The patch does this for the #ifdef CONFIG_PPC_STD_MMU case,
> but not for the #else case. I don't know which case is used
> for the Ampro board.

It appeared to me a generic issue though I believe a number
of solutions are possible. asm-generic/tlb.h:tlb_gather_mmu()
expands to linux/percpu.h:get_cpu_var() which does a
preempt_disable() and __get_cpu_var(). This caused the debug
assertion to kick when __page_cache_release() and to a lesser
extent activate_page() attempted to block on a mutex (though
other paths may well exist). My approach was to replace the
outer layer preempt_disable/enable calls with a mutex-spinlock.

The fix was fairly easy once it was known from where the
gratuitous call to preempt_disable() existed. I cobbled
together a logging mechanism which detected the problem. As it
wasn't very general I removed it from the patch. I didn't see
an alternate means of diagnosing such a scenario so I'll likely
address generalizing the code.

-john


--
[email protected]

2005-02-25 03:31:02

by Lee Revell

[permalink] [raw]
Subject: Re: More latency regressions with 2.6.11-rc4-RT-V0.7.39-02

On Thu, 2005-02-24 at 08:26 +0000, Hugh Dickins wrote:
> On Thu, 24 Feb 2005, Lee Revell wrote:
> > On Thu, 2005-02-24 at 04:56 +0000, Hugh Dickins wrote:
> > >
> > > In other mail, you do expect people still to be using Ingo's patches,
> > > so probably this patch should stick there (and in -mm) for now.
> >
> > Well all of these were fixed in the past so it may not be unreasonable
> > to fix them for 2.6.11.
>
> If we'd got to it earlier, yes. But 2.6.11 looks to be just a day or
> two away, and we've no idea why zap_pte_range or clear_page_range
> would have reverted. Nor have we heard from Ingo yet.
>

It's also not clear that the patch completely fixes the copy_pte_range
latency. This trace is from the Athlon XP.

Lee

preemption latency trace v1.1.4 on 2.6.11-rc4-RT-V0.7.39-02
--------------------------------------------------------------------
latency: 284 �s, #25/25, CPU#0 | (M:preempt VP:0, KP:1, SP:1 HP:1 #P:1)
-----------------
| task: ksoftirqd/0-2 (uid:0 nice:-10 policy:0 rt_prio:0)
-----------------

_------=> CPU#
/ _-----=> irqs-off
| / _----=> need-resched
|| / _---=> hardirq/softirq
||| / _--=> preempt-depth
|||| /
||||| delay
cmd pid ||||| time | caller
\ / ||||| \ | /
(T1/#0) dpkg 9299 0 3 00000005 00000000 [0001017457529380] 0.000ms (+3633922.612ms): <676b7064> (<00746500>)
(T1/#2) dpkg 9299 0 3 00000005 00000002 [0001017457529620] 0.000ms (+0.000ms): __trace_start_sched_wakeup+0x9a/0xd0 <c012eaca> (try_to_wake_up+0x90/0x160 <c0110350>)
(T1/#3) dpkg 9299 0 3 00000004 00000003 [0001017457529825] 0.000ms (+0.000ms): preempt_schedule+0x11/0x80 <c02879d1> (try_to_wake_up+0x90/0x160 <c0110350>)
(T3/#4) dpkg-9299 0dn.4 0�s : try_to_wake_up+0x118/0x160 <c01103d8> <<...>-2> (69 74):
(T1/#5) dpkg 9299 0 3 00000003 00000005 [0001017457530633] 0.000ms (+0.000ms): preempt_schedule+0x11/0x80 <c02879d1> (try_to_wake_up+0xf2/0x160 <c01103b2>)
(T1/#6) dpkg 9299 0 3 00000003 00000006 [0001017457530809] 0.001ms (+0.000ms): wake_up_process+0x35/0x40 <c0110455> (do_softirq+0x3f/0x50 <c011aedf>)
(T6/#7) dpkg-9299 0dn.2 1�s!< (1)
(T1/#8) dpkg 9299 0 2 00000001 00000008 [0001017457898984] 0.276ms (+0.000ms): preempt_schedule+0x11/0x80 <c02879d1> (copy_pte_range+0xbc/0x1b0 <c014573c>)
(T1/#9) dpkg 9299 0 2 00000001 00000009 [0001017457899172] 0.276ms (+0.000ms): __cond_resched_raw_spinlock+0xb/0x50 <c0111f9b> (copy_pte_range+0xad/0x1b0 <c014572d>)
(T1/#10) dpkg 9299 0 2 00000000 0000000a [0001017457899575] 0.277ms (+0.000ms): __cond_resched+0xe/0x70 <c0111f2e> (__cond_resched_raw_spinlock+0x35/0x50 <c0111fc5>)
(T1/#11) dpkg 9299 0 3 00000000 0000000b [0001017457900063] 0.277ms (+0.000ms): __schedule+0xe/0x680 <c028720e> (__cond_resched+0x4a/0x70 <c0111f6a>)
(T1/#12) dpkg 9299 0 3 00000000 0000000c [0001017457900379] 0.277ms (+0.000ms): profile_hit+0x9/0x50 <c0116449> (__schedule+0x43/0x680 <c0287243>)
(T1/#13) dpkg 9299 0 3 00000001 0000000d [0001017457900602] 0.277ms (+0.001ms): sched_clock+0x14/0x80 <c010cbb4> (__schedule+0x73/0x680 <c0287273>)
(T1/#14) dpkg 9299 0 3 00000002 0000000e [0001017457902490] 0.279ms (+0.000ms): dequeue_task+0x12/0x60 <c010ff32> (__schedule+0x1e0/0x680 <c02873e0>)
(T1/#15) dpkg 9299 0 3 00000002 0000000f [0001017457902687] 0.279ms (+0.000ms): recalc_task_prio+0xe/0x140 <c01100be> (__schedule+0x202/0x680 <c0287402>)
(T1/#16) dpkg 9299 0 3 00000002 00000010 [0001017457902848] 0.279ms (+0.000ms): effective_prio+0x8/0x60 <c0110058> (recalc_task_prio+0x88/0x140 <c0110138>)
(T1/#17) dpkg 9299 0 3 00000002 00000011 [0001017457902995] 0.279ms (+0.000ms): enqueue_task+0x11/0x80 <c010ff91> (__schedule+0x20e/0x680 <c028740e>)
(T4/#18) [ => dpkg ] 0.280ms (+0.000ms)
(T1/#19) <...> 2 0 1 00000002 00000013 [0001017457905091] 0.281ms (+0.000ms): __switch_to+0xe/0x190 <c010110e> (__schedule+0x306/0x680 <c0287506>)
(T3/#20) <...>-2 0d..2 281�s : __schedule+0x337/0x680 <c0287537> <dpkg-9299> (74 69):
(T1/#21) <...> 2 0 1 00000002 00000015 [0001017457906484] 0.282ms (+0.000ms): finish_task_switch+0x14/0xa0 <c0110844> (__schedule+0x33f/0x680 <c028753f>)
(T1/#22) <...> 2 0 1 00000001 00000016 [0001017457906713] 0.282ms (+0.000ms): trace_stop_sched_switched+0x11/0x180 <c012eb11> (finish_task_switch+0x51/0xa0 <c0110881>)
(T3/#23) <...>-2 0d..1 282�s : trace_stop_sched_switched+0x4c/0x180 <c012eb4c> <<...>-2> (69 0):
(T1/#24) <...> 2 0 1 00000001 00000018 [0001017457908107] 0.283ms (+0.000ms): trace_stop_sched_switched+0x11c/0x180 <c012ec1c> (finish_task_switch+0x51/0xa0 <c0110881>)


vim:ft=help



2005-02-25 05:59:14

by Hugh Dickins

[permalink] [raw]
Subject: Re: More latency regressions with 2.6.11-rc4-RT-V0.7.39-02

On Thu, 24 Feb 2005, Lee Revell wrote:
> On Thu, 2005-02-24 at 08:26 +0000, Hugh Dickins wrote:
> >
> > If we'd got to it earlier, yes. But 2.6.11 looks to be just a day or
> > two away, and we've no idea why zap_pte_range or clear_page_range
> > would have reverted. Nor have we heard from Ingo yet.
>
> It's also not clear that the patch completely fixes the copy_pte_range
> latency. This trace is from the Athlon XP.

Then we need Ingo to investigate and explain all these reversions.
I'm not _blaming_ Ingo for them, but I'm not familiar with his patches
nor with deciphering latency traces - he's the magician around here.

Hugh

2005-02-25 15:04:19

by Lee Revell

[permalink] [raw]
Subject: Re: More latency regressions with 2.6.11-rc4-RT-V0.7.39-02

On Fri, 2005-02-25 at 05:58 +0000, Hugh Dickins wrote:
> On Thu, 24 Feb 2005, Lee Revell wrote:
> > On Thu, 2005-02-24 at 08:26 +0000, Hugh Dickins wrote:
> > >
> > > If we'd got to it earlier, yes. But 2.6.11 looks to be just a day or
> > > two away, and we've no idea why zap_pte_range or clear_page_range
> > > would have reverted. Nor have we heard from Ingo yet.
> >
> > It's also not clear that the patch completely fixes the copy_pte_range
> > latency. This trace is from the Athlon XP.
>
> Then we need Ingo to investigate and explain all these reversions.
> I'm not _blaming_ Ingo for them, but I'm not familiar with his patches
> nor with deciphering latency traces - he's the magician around here.
>

Yup. Oh well.

I'll try to compile a comprehensive list of these so we can fix them for
2.6.12.

Lee