2005-12-27 23:27:45

by Lee Revell

[permalink] [raw]
Subject: 2.6.15-rc5: latency regression vs 2.6.14 in exit_mmap->free_pgtables

I am seeing excessive latencies (2ms+) in free_pgtables, called from
exit_mmap with 2.6.15-rc5. This is a significant regression from 2.6.14
where the maximum observed latency was less than 1ms with some tuning.
Here is the trace.

(Although the trace was captured with the -rt kernel I am only using it
for the instrumentation features - all preemption settings are the same
as mainline with PREEMPT)

preemption latency trace v1.1.5 on 2.6.15-rc5-rt4
--------------------------------------------------------------------
latency: 2267 us, #2632/2632, CPU#0 | (M:preempt VP:0, KP:0, SP:0 HP:0)
-----------------
| task: softirq-timer/0-3 (uid:0 nice:0 policy:1 rt_prio:1)
-----------------

_------=> CPU#
/ _-----=> irqs-off
| / _----=> need-resched
|| / _---=> hardirq/softirq
||| / _--=> preempt-depth
|||| /
||||| delay
cmd pid ||||| time | caller
\ / ||||| \ | /
evolutio-4728 0D.h2 0us : __trace_start_sched_wakeup (try_to_wake_up)
evolutio-4728 0D.h2 1us : __trace_start_sched_wakeup <<...>-3> (62 0)
evolutio-4728 0D.h. 2us : wake_up_process (wakeup_softirqd)
evolutio-4728 0D.h. 3us : scheduler_tick (update_process_times)
evolutio-4728 0D.h. 3us : sched_clock (scheduler_tick)
evolutio-4728 0D.h1 5us : task_timeslice (scheduler_tick)
evolutio-4728 0D.h. 6us : run_posix_cpu_timers (update_process_times)
evolutio-4728 0D.h1 7us : note_interrupt (__do_IRQ)
evolutio-4728 0D.h1 7us : end_8259A_irq (__do_IRQ)
evolutio-4728 0D.h1 8us : enable_8259A_irq (end_8259A_irq)
evolutio-4728 0Dnh1 10us : irq_exit (do_IRQ)
evolutio-4728 0Dn.1 11us < (608)
evolutio-4728 0.n.1 11us : unlink_file_vma (free_pgtables)
evolutio-4728 0.n.2 12us : __remove_shared_vm_struct (unlink_file_vma)
evolutio-4728 0.n.2 13us : vma_prio_tree_remove (__remove_shared_vm_struct)
evolutio-4728 0.n.2 13us : prio_tree_remove (vma_prio_tree_remove)
evolutio-4728 0.n.1 14us : preempt_schedule (unlink_file_vma)
evolutio-4728 0.n.1 15us : anon_vma_unlink (free_pgtables)
evolutio-4728 0.n.1 16us : unlink_file_vma (free_pgtables)
evolutio-4728 0.n.2 17us : __remove_shared_vm_struct (unlink_file_vma)
evolutio-4728 0.n.2 17us : vma_prio_tree_remove (__remove_shared_vm_struct)
evolutio-4728 0.n.2 18us : prio_tree_remove (vma_prio_tree_remove)
evolutio-4728 0.n.2 19us : prio_tree_replace (prio_tree_remove)
evolutio-4728 0.n.1 20us : preempt_schedule (unlink_file_vma)
evolutio-4728 0.n.1 21us : anon_vma_unlink (free_pgtables)
evolutio-4728 0.n.1 22us : preempt_schedule (anon_vma_unlink)
evolutio-4728 0.n.1 23us : kmem_cache_free (anon_vma_unlink)
evolutio-4728 0.n.1 24us : unlink_file_vma (free_pgtables)

[...]

evolutio-4728 0.n.1 2239us : unlink_file_vma (free_pgtables)
evolutio-4728 0.n.1 2240us : anon_vma_unlink (free_pgtables)
evolutio-4728 0.n.1 2241us : preempt_schedule (anon_vma_unlink)
evolutio-4728 0.n.1 2241us : kmem_cache_free (anon_vma_unlink)
evolutio-4728 0.n.1 2242us : unlink_file_vma (free_pgtables)
evolutio-4728 0.n.1 2243us : free_pgd_range (free_pgtables)
evolutio-4728 0.n.1 2243us : free_pte_range (free_pgd_range)
evolutio-4728 0.n.1 2244us : free_page_and_swap_cache (free_pte_range)
evolutio-4728 0.n.1 2245us : put_page (free_page_and_swap_cache)
evolutio-4728 0.n.1 2245us : __page_cache_release (put_page)
evolutio-4728 0.n.1 2246us : preempt_schedule (__page_cache_release)
evolutio-4728 0.n.1 2247us : free_hot_page (__page_cache_release)
evolutio-4728 0.n.1 2248us : free_hot_cold_page (free_hot_page)
evolutio-4728 0.n.1 2248us : __mod_page_state (free_hot_cold_page)
evolutio-4728 0.n.1 2250us : preempt_schedule (free_hot_cold_page)
evolutio-4728 0.n.1 2250us : __mod_page_state (free_pte_range)
evolutio-4728 0.n.. 2252us : preempt_schedule (exit_mmap)
evolutio-4728 0Dn.. 2252us : __schedule (preempt_schedule)
evolutio-4728 0Dn.. 2254us : profile_hit (__schedule)
evolutio-4728 0Dn.1 2255us+: sched_clock (__schedule)
<...>-3 0D..2 2259us+: __switch_to (__schedule)
<...>-3 0D..2 2262us : __schedule <evolutio-4728> (74 62)
<...>-3 0...1 2263us : trace_stop_sched_switched (__schedule)
<...>-3 0D..2 2264us+: trace_stop_sched_switched <<...>-3> (62 0)
<...>-3 0D..2 2266us : trace_stop_sched_switched (__schedule)

The problem is that we now do a lot more work in free_pgtables under the
mm->page_table_lock spinlock so preemption can be delayed for a long
time. Here is the change responsible:

--- ../linux-2.6.14-rt22/mm/memory.c 2005-12-05 14:47:02.000000000 -0500
+++ ../linux-2.6.15-rc5-rt2/mm/memory.c 2005-12-22 16:35:26.000000000 -0500
@@ -260,6 +261,12 @@
struct vm_area_struct *next = vma->vm_next;
unsigned long addr = vma->vm_start;

+ /*
+ * Hide vma from rmap and vmtruncate before freeing pgtables
+ */
+ anon_vma_unlink(vma);
+ unlink_file_vma(vma);
+
if (is_hugepage_only_range(vma->vm_mm, addr, HPAGE_SIZE)) {
hugetlb_free_pgd_range(tlb, addr, vma->vm_end,
floor, next? next->vm_start: ceiling);
@@ -272,6 +279,8 @@
HPAGE_SIZE)) {
vma = next;
next = vma->vm_next;
+ anon_vma_unlink(vma);
+ unlink_file_vma(vma);
}
free_pgd_range(tlb, addr, vma->vm_end,
floor, next? next->vm_start: ceiling);

What was the purpose of this change?

Lee




2005-12-28 02:41:33

by Lee Revell

[permalink] [raw]
Subject: Re: 2.6.15-rc5: latency regression vs 2.6.14 in exit_mmap->free_pgtables

On Tue, 2005-12-27 at 18:31 -0500, Lee Revell wrote:
> The problem is that we now do a lot more work in free_pgtables under the
> mm->page_table_lock spinlock so preemption can be delayed for a long
> time. Here is the change responsible:

Hugh,

I have found the commit that introduced the regression:

[PATCH 16/21] mm: unlink vma before pagetables

http://lkml.org/lkml/2005/10/12/227

Lee

2005-12-28 22:58:47

by Hugh Dickins

[permalink] [raw]
Subject: Re: 2.6.15-rc5: latency regression vs 2.6.14 in exit_mmap->free_pgtables

On Tue, 27 Dec 2005, Lee Revell wrote:

> I am seeing excessive latencies (2ms+) in free_pgtables, called from
> exit_mmap with 2.6.15-rc5. This is a significant regression from 2.6.14
> where the maximum observed latency was less than 1ms with some tuning.
> Here is the trace.
>
> The problem is that we now do a lot more work in free_pgtables under the
> mm->page_table_lock spinlock so preemption can be delayed for a long
> time. Here is the change responsible:

Yes, I'm to blame for that - sorry. It didn't occur to me that I was
moving any signficant amount of work (on mms with many vmas) into the
section with preemption disabled. Actually, the mm->page_table_lock
is _not_ held there any more; but preemption is still disabled while
using the per-cpu mmu_gathers.

I wish you'd found it at -rc1 time. It's not something that can
be properly corrected in a hurry. The proper fix is to rework the
tlb_gather_mmu stuff, so it can be done without preemption disabled.
It's already a serious limitation in unmap_vmas, with CONFIG_PREEMPT's
ZAP_BLOCK_SIZE spoiling throughput with far too many TLB flushes.

On my list to work on; but the TLB always needs great care, and this
goes down into architectural divergences, with truncation of a mapped
file adding further awkward constraints. I imagine 2.6.16-rc1 is only
a couple of weeks away, so it's unlikely to be fixed in 2.6.16 either.

> What was the purpose of this change?

To narrow the scope of the page_table_lock, so that it could be taken
later and released earlier, for slightly better preemptibility, and
to allow more scalable locking by splitting it up per page-table page.
And a step towards the mmu_gather rework I refer to above, to recover
from 2.6.11's unmap_vmas slowdown.

Here's an untested patch which should mostly correct your latency
problem with 2.6.15-rc. But it's certainly not the right solution,
and it's probably both too ugly and too late for 2.6.15. If you
really want Linus to put it in, please test it out, especially on
ia64, and try to persuade him. Personally I'd prefer to wait for
the right solution: but I don't have your low-latency needs, and
I'm certainly guilty of a regression here.


2.6.15-rc1 moved the unlinking of a vma from its prio_tree and anon_vma
into free_pgtables: so the vma is hidden from rmap and vmtruncate before
freeing its page tables, allowing safe descent without page table lock.
But free_pgtables is still called with preemption disabled, and Lee
Revell has now detected high latency there.

The right fix will be to rework the mmu_gathering, not to need preemption
disabled; but for now an ugly CONFIG_PREEMPT block in free_pgtables, to
make an initial unlinking pass with preemption enabled - made uglier by
CONFIG_IA64 definitions (only ia64 actually uses the start and end given
to tlb_finish_mmu, and our floor and ceiling don't quite work for those).
These CONFIG choices being to minimize the additional TLB flushing.

Signed-off-by: Hugh Dickins <[email protected]>
---

mm/memory.c | 32 ++++++++++++++++++++++++++++++++
1 files changed, 32 insertions(+)

--- 2.6.15-rc7/mm/memory.c 2005-12-25 11:19:27.000000000 +0000
+++ linux/mm/memory.c 2005-12-28 21:49:35.000000000 +0000
@@ -254,18 +254,48 @@ void free_pgd_range(struct mmu_gather **
flush_tlb_pgtables((*tlb)->mm, start, end);
}

+#ifdef CONFIG_IA64
+#define tlb_start_addr(tlb) (tlb)->start_addr
+#define tlb_end_addr(tlb) (tlb)->end_addr
+#else
+#define tlb_start_addr(tlb) 0UL /* only ia64 really uses it */
+#define tlb_end_addr(tlb) 0UL /* only ia64 really uses it */
+#endif
+
void free_pgtables(struct mmu_gather **tlb, struct vm_area_struct *vma,
unsigned long floor, unsigned long ceiling)
{
+#ifdef CONFIG_PREEMPT
+ struct vm_area_struct *unlink = vma;
+ int fullmm = (*tlb)->fullmm;
+
+ if (!vma) /* Sometimes when exiting after an oops */
+ return;
+ if (vma->vm_next)
+ tlb_finish_mmu(*tlb, tlb_start_addr(*tlb), tlb_end_addr(*tlb));
+ /*
+ * Hide vma from rmap and vmtruncate before freeeing pgtables,
+ * with preemption enabled, except when unmapping just one area.
+ */
+ while (unlink) {
+ anon_vma_unlink(unlink);
+ unlink_file_vma(unlink);
+ unlink = unlink->vm_next;
+ }
+ if (vma->vm_next)
+ *tlb = tlb_gather_mmu(vma->vm_mm, fullmm);
+#endif
while (vma) {
struct vm_area_struct *next = vma->vm_next;
unsigned long addr = vma->vm_start;

+#ifndef CONFIG_PREEMPT
/*
* Hide vma from rmap and vmtruncate before freeing pgtables
*/
anon_vma_unlink(vma);
unlink_file_vma(vma);
+#endif

if (is_hugepage_only_range(vma->vm_mm, addr, HPAGE_SIZE)) {
hugetlb_free_pgd_range(tlb, addr, vma->vm_end,
@@ -279,8 +309,10 @@ void free_pgtables(struct mmu_gather **t
HPAGE_SIZE)) {
vma = next;
next = vma->vm_next;
+#ifndef CONFIG_PREEMPT
anon_vma_unlink(vma);
unlink_file_vma(vma);
+#endif
}
free_pgd_range(tlb, addr, vma->vm_end,
floor, next? next->vm_start: ceiling);

2005-12-28 23:54:42

by Lee Revell

[permalink] [raw]
Subject: Re: 2.6.15-rc5: latency regression vs 2.6.14 in exit_mmap->free_pgtables

On Wed, 2005-12-28 at 22:59 +0000, Hugh Dickins wrote:
> I wish you'd found it at -rc1 time.

Unfortunately this is the kind of thing that only could have been found
with Ingo's latency tracer, and the -rt patch set was not rebased to
2.6.15 until -rc5.

I'm not sure how much work it would be to break out CONFIG_LATENCY_TRACE
as a separate patch.

Thanks for the patch, I'll try it.

Lee



2005-12-29 00:48:37

by Lee Revell

[permalink] [raw]
Subject: Re: 2.6.15-rc5: latency regression vs 2.6.14 in exit_mmap->free_pgtables

On Wed, 2005-12-28 at 22:59 +0000, Hugh Dickins wrote:
> Here's an untested patch which should mostly correct your latency
> problem with 2.6.15-rc. But it's certainly not the right solution,
> and it's probably both too ugly and too late for 2.6.15. If you
> really want Linus to put it in, please test it out, especially on
> ia64, and try to persuade him. Personally I'd prefer to wait for
> the right solution: but I don't have your low-latency needs, and
> I'm certainly guilty of a regression here.

OK, FWIW the patch does work.

It occurred to me that this might only be a noticeable latency
regression on slower machines. Although too late for 2.6.15 it would be
nice to have fixed for 2.6.16...

Lee

2005-12-29 08:22:36

by Ingo Molnar

[permalink] [raw]
Subject: [patch] latency tracer, 2.6.15-rc7


* Lee Revell <[email protected]> wrote:

> I'm not sure how much work it would be to break out
> CONFIG_LATENCY_TRACE as a separate patch.

could you check out the 2.6.15-rc7 version of the latency tracer i
uploaded to:

http://redhat.com/~mingo/latency-tracing-patches/

could test it by e.g. trying to reproduce the same VM latency as in the
-rt tree. [the two zlib patches are needed if you are using 4K stacks,
mcount increases stack footprint.]

Ingo

2005-12-29 10:03:04

by Dave Jones

[permalink] [raw]
Subject: Re: [patch] latency tracer, 2.6.15-rc7

On Thu, Dec 29, 2005 at 09:22:17AM +0100, Ingo Molnar wrote:
>
> * Lee Revell <[email protected]> wrote:
>
> > I'm not sure how much work it would be to break out
> > CONFIG_LATENCY_TRACE as a separate patch.
>
> could you check out the 2.6.15-rc7 version of the latency tracer i
> uploaded to:
>
> http://redhat.com/~mingo/latency-tracing-patches/
>
> could test it by e.g. trying to reproduce the same VM latency as in the
> -rt tree. [the two zlib patches are needed if you are using 4K stacks,
> mcount increases stack footprint.]

kernel/latency.c: In function 'add_preempt_count_ti':
kernel/latency.c:1703: warning: implicit declaration of function 'preempt_count_ti'
kernel/latency.c:1703: error: invalid lvalue in assignment
kernel/latency.c: In function 'sub_preempt_count_ti':
kernel/latency.c:1764: error: invalid lvalue in assignment

interesting config options ...

# CONFIG_PREEMPT_NONE is not set
CONFIG_PREEMPT_VOLUNTARY=y
# CONFIG_PREEMPT is not set
CONFIG_PREEMPT_BKL=y

CONFIG_WAKEUP_TIMING=y
CONFIG_WAKEUP_LATENCY_HIST=y
CONFIG_CRITICAL_IRQSOFF_TIMING=y
CONFIG_INTERRUPT_OFF_HIST=y
CONFIG_LATENCY_TRACE=y
CONFIG_USE_FRAME_POINTER=y
CONFIG_FRAME_POINTER=y

Dave

2005-12-29 10:17:59

by Ingo Molnar

[permalink] [raw]
Subject: Re: [patch] latency tracer, 2.6.15-rc7


* Dave Jones <[email protected]> wrote:

> > could test it by e.g. trying to reproduce the same VM latency as in the
> > -rt tree. [the two zlib patches are needed if you are using 4K stacks,
> > mcount increases stack footprint.]
>
> kernel/latency.c: In function 'add_preempt_count_ti':
> kernel/latency.c:1703: warning: implicit declaration of function 'preempt_count_ti'
> kernel/latency.c:1703: error: invalid lvalue in assignment
> kernel/latency.c: In function 'sub_preempt_count_ti':
> kernel/latency.c:1764: error: invalid lvalue in assignment

indeed - i have fixed this and have uploaded a new version to:

http://redhat.com/~mingo/latency-tracing-patches/

> interesting config options ...
>
> # CONFIG_PREEMPT_NONE is not set
> CONFIG_PREEMPT_VOLUNTARY=y
> # CONFIG_PREEMPT is not set
> CONFIG_PREEMPT_BKL=y
>
> CONFIG_WAKEUP_TIMING=y
> CONFIG_WAKEUP_LATENCY_HIST=y
> CONFIG_CRITICAL_IRQSOFF_TIMING=y
> CONFIG_INTERRUPT_OFF_HIST=y
> CONFIG_LATENCY_TRACE=y
> CONFIG_USE_FRAME_POINTER=y
> CONFIG_FRAME_POINTER=y

these are various things one might be interested in gathering on a live
system. Enabling more of them means higher runtime overhead.
WAKEUP_TIMING only measures the worst-case wakeup cost, it's the
lowest-overhead option. It's activated via resetting the worst-case
cost:

echo 0 > /proc/sys/kernel/preempt_max_latency

The histogram ones are gathering a histogram (but no traces) into
/proc/latency_hist/<cpu_nr>. These have some overhead as they hook into
every preempt-disable/enable call. Output is:

$ head -15 /proc/latency_hist/wakeup_latency/CPU0
#Minimum latency: 2 microseconds.
#Average latency: 7 microseconds.
#Maximum latency: 511 microseconds.
#Total samples: 5241
#There are 0 samples greater or equal than 10240 microseconds
#usecs samples
0 0
1 0
2 7
3 2041
4 921
5 194
6 62
7 502
8 119
...

The LATENCY_TRACING option does full tracing of critical sections
[driven by e.g. WAKEUP_TIMING - but it can also be activated by
interrupts, or be completely user-driven via userspace calls], which
trace is then put into /proc/latency_trace - mcount done for every
function call in the kernel. This can add up to 30% of runtime overhead
[or worse], but is obviously very useful for debugging latencies.

Ingo

2005-12-29 20:05:23

by Lee Revell

[permalink] [raw]
Subject: Re: [patch] latency tracer, 2.6.15-rc7

On Thu, 2005-12-29 at 11:17 +0100, Ingo Molnar wrote:
> * Dave Jones <[email protected]> wrote:
>
> > > could test it by e.g. trying to reproduce the same VM latency as in the
> > > -rt tree. [the two zlib patches are needed if you are using 4K stacks,
> > > mcount increases stack footprint.]
> >
> > kernel/latency.c: In function 'add_preempt_count_ti':
> > kernel/latency.c:1703: warning: implicit declaration of function 'preempt_count_ti'
> > kernel/latency.c:1703: error: invalid lvalue in assignment
> > kernel/latency.c: In function 'sub_preempt_count_ti':
> > kernel/latency.c:1764: error: invalid lvalue in assignment
>
> indeed - i have fixed this and have uploaded a new version to:
>
> http://redhat.com/~mingo/latency-tracing-patches/

Still does not quite work for me on i386. I applied all the patches as
I'm using 4K stacks.

LD .tmp_vmlinux1
init/built-in.o: In function `start_kernel':
: undefined reference to `preempt_max_latency'
make: *** [.tmp_vmlinux1] Error 1

.config excerpts:

# CONFIG_SMP is not set
# CONFIG_PREEMPT_NONE is not set
# CONFIG_PREEMPT_VOLUNTARY is not set
CONFIG_PREEMPT=y
CONFIG_PREEMPT_BKL=y

#
# Kernel hacking
#
# CONFIG_PRINTK_TIME is not set
CONFIG_DEBUG_KERNEL=y
# CONFIG_MAGIC_SYSRQ is not set
CONFIG_LOG_BUF_SHIFT=14
# CONFIG_DETECT_SOFTLOCKUP is not set
# CONFIG_SCHEDSTATS is not set
# CONFIG_DEBUG_SLAB is not set
CONFIG_DEBUG_PREEMPT=y
# CONFIG_DEBUG_SPINLOCK is not set
# CONFIG_DEBUG_SPINLOCK_SLEEP is not set
CONFIG_WAKEUP_TIMING=y
# CONFIG_WAKEUP_LATENCY_HIST is not set
CONFIG_PREEMPT_TRACE=y
CONFIG_CRITICAL_PREEMPT_TIMING=y
# CONFIG_PREEMPT_OFF_HIST is not set
# CONFIG_CRITICAL_IRQSOFF_TIMING is not set
CONFIG_CRITICAL_TIMING=y
CONFIG_LATENCY_TIMING=y
CONFIG_LATENCY_TRACE=y
CONFIG_MCOUNT=y
# CONFIG_DEBUG_KOBJECT is not set
CONFIG_DEBUG_BUGVERBOSE=y
# CONFIG_DEBUG_INFO is not set
# CONFIG_DEBUG_FS is not set
# CONFIG_DEBUG_VM is not set
CONFIG_FRAME_POINTER=y
# CONFIG_RCU_TORTURE_TEST is not set
CONFIG_EARLY_PRINTK=y
# CONFIG_DEBUG_STACKOVERFLOW is not set
# CONFIG_DEBUG_STACK_USAGE is not set
# CONFIG_DEBUG_PAGEALLOC is not set
CONFIG_4KSTACKS=y
CONFIG_X86_FIND_SMP_CONFIG=y
CONFIG_X86_MPPARSE=y

Lee

2005-12-29 20:21:04

by Lee Revell

[permalink] [raw]
Subject: Re: [patch] latency tracer, 2.6.15-rc7

On Thu, 2005-12-29 at 15:11 -0500, Lee Revell wrote:
> On Thu, 2005-12-29 at 11:17 +0100, Ingo Molnar wrote:
> > * Dave Jones <[email protected]> wrote:
> >
> > > > could test it by e.g. trying to reproduce the same VM latency as in the
> > > > -rt tree. [the two zlib patches are needed if you are using 4K stacks,
> > > > mcount increases stack footprint.]
> > >
> > > kernel/latency.c: In function 'add_preempt_count_ti':
> > > kernel/latency.c:1703: warning: implicit declaration of function 'preempt_count_ti'
> > > kernel/latency.c:1703: error: invalid lvalue in assignment
> > > kernel/latency.c: In function 'sub_preempt_count_ti':
> > > kernel/latency.c:1764: error: invalid lvalue in assignment
> >
> > indeed - i have fixed this and have uploaded a new version to:
> >
> > http://redhat.com/~mingo/latency-tracing-patches/
>
> Still does not quite work for me on i386. I applied all the patches as
> I'm using 4K stacks.
>
> LD .tmp_vmlinux1
> init/built-in.o: In function `start_kernel':
> : undefined reference to `preempt_max_latency'
> make: *** [.tmp_vmlinux1] Error 1
>

This patch fixes the problem.

Lee

--- linux-2.6.15-rc5-rt2/kernel/latency.c~ 2005-12-29 14:04:26.000000000 -0500
+++ linux-2.6.15-rc5-rt2/kernel/latency.c 2005-12-29 15:23:32.000000000 -0500
@@ -71,9 +71,9 @@
* we clear it after bootup.
*/
#ifdef CONFIG_LATENCY_HIST
-static cycles_t preempt_max_latency = (cycles_t)0UL;
+cycles_t preempt_max_latency = (cycles_t)0UL;
#else
-static cycles_t preempt_max_latency = (cycles_t)ULONG_MAX;
+cycles_t preempt_max_latency = (cycles_t)ULONG_MAX;
#endif

static cycles_t preempt_thresh;


2005-12-29 20:29:12

by Ingo Molnar

[permalink] [raw]
Subject: Re: [patch] latency tracer, 2.6.15-rc7


* Lee Revell <[email protected]> wrote:

> > Still does not quite work for me on i386. I applied all the patches as
> > I'm using 4K stacks.

oops!

> > LD .tmp_vmlinux1
> > init/built-in.o: In function `start_kernel':
> > : undefined reference to `preempt_max_latency'
> > make: *** [.tmp_vmlinux1] Error 1
> >
>
> This patch fixes the problem.

thanks, applied - new version uploaded.

Ingo

2005-12-29 22:12:15

by Lee Revell

[permalink] [raw]
Subject: Re: [patch] latency tracer, 2.6.15-rc7

On Thu, 2005-12-29 at 21:28 +0100, Ingo Molnar wrote:
> * Lee Revell <[email protected]> wrote:
>
> > > Still does not quite work for me on i386. I applied all the patches as
> > > I'm using 4K stacks.
>
> oops!
>
> > > LD .tmp_vmlinux1
> > > init/built-in.o: In function `start_kernel':
> > > : undefined reference to `preempt_max_latency'
> > > make: *** [.tmp_vmlinux1] Error 1
> > >
> >
> > This patch fixes the problem.
>
> thanks, applied - new version uploaded.

OK, this works perfectly. It would be great to see this in "Kernel
Hacking" for 2.6.16...

Lee

2005-12-30 00:08:09

by Grant Coady

[permalink] [raw]
Subject: Re: [patch] latency tracer, 2.6.15-rc7

On Thu, 29 Dec 2005 21:28:48 +0100, Ingo Molnar <[email protected]> wrote:

>
>thanks, applied - new version uploaded.

I just booted with latency tracer, it died with (copy by hand):
{ 40} [<c012e74a>] debug_stackoverflow+0x6a/0xc0

Much unusual stuff (several screenfuls) scrolled up prior to lockup.

Grant.

2005-12-30 00:18:41

by Lee Revell

[permalink] [raw]
Subject: Re: [patch] latency tracer, 2.6.15-rc7

On Fri, 2005-12-30 at 11:08 +1100, Grant Coady wrote:
> On Thu, 29 Dec 2005 21:28:48 +0100, Ingo Molnar <[email protected]> wrote:
>
> >
> >thanks, applied - new version uploaded.
>
> I just booted with latency tracer, it died with (copy by hand):
> { 40} [<c012e74a>] debug_stackoverflow+0x6a/0xc0
>
> Much unusual stuff (several screenfuls) scrolled up prior to lockup.

Can you verify that it works with CONFIG_DEBUG_STACKOVERFLOW disabled?

Lee

2005-12-30 00:42:12

by Grant Coady

[permalink] [raw]
Subject: Re: [patch] latency tracer, 2.6.15-rc7

On Thu, 29 Dec 2005 19:18:38 -0500, Lee Revell <[email protected]> wrote:

>On Fri, 2005-12-30 at 11:08 +1100, Grant Coady wrote:
>> On Thu, 29 Dec 2005 21:28:48 +0100, Ingo Molnar <[email protected]> wrote:
>>
>> >
>> >thanks, applied - new version uploaded.
>>
>> I just booted with latency tracer, it died with (copy by hand):
>> { 40} [<c012e74a>] debug_stackoverflow+0x6a/0xc0
>>
>> Much unusual stuff (several screenfuls) scrolled up prior to lockup.
>
>Can you verify that it works with CONFIG_DEBUG_STACKOVERFLOW disabled?

Okay! That sorted it :o)

Grant.

2005-12-30 02:16:22

by Lee Revell

[permalink] [raw]
Subject: Re: [patch] latency tracer, 2.6.15-rc7

On Thu, 2005-12-29 at 21:28 +0100, Ingo Molnar wrote:
> * Lee Revell <[email protected]> wrote:
>
> > > Still does not quite work for me on i386. I applied all the patches as
> > > I'm using 4K stacks.
>
> oops!
>
> > > LD .tmp_vmlinux1
> > > init/built-in.o: In function `start_kernel':
> > > : undefined reference to `preempt_max_latency'
> > > make: *** [.tmp_vmlinux1] Error 1
> > >
> >
> > This patch fixes the problem.
>
> thanks, applied - new version uploaded.

It seems that debug_smp_processor_id is being called a lot, even though
I have a UP config, which I didn't see with the -rt kernel:

$ grep debug_smp_processor_id /proc/latency_trace | head -20
evolutio-4568 0d.H1 3us : debug_smp_processor_id (do_IRQ)
evolutio-4568 0d.h. 25us : debug_smp_processor_id (netif_rx)
evolutio-4568 0d.h. 28us : debug_smp_processor_id (kmem_cache_alloc)
evolutio-4568 0d.h. 31us+: debug_smp_processor_id (__kmalloc)
evolutio-4568 0d.s. 46us : debug_smp_processor_id (__do_softirq)
evolutio-4568 0d.s. 47us : debug_smp_processor_id (__do_softirq)
evolutio-4568 0d.s1 54us+: debug_smp_processor_id (kmem_cache_alloc)
evolutio-4568 0d.s1 65us : debug_smp_processor_id (kmem_cache_free)
evolutio-4568 0d.s3 109us : debug_smp_processor_id (kmem_cache_alloc)
evolutio-4568 0d.s3 111us+: debug_smp_processor_id (__kmalloc)
evolutio-4568 0d.s3 140us : debug_smp_processor_id (kfree)
evolutio-4568 0d.s3 141us : debug_smp_processor_id (kmem_cache_free)
evolutio-4568 0d.s3 158us : debug_smp_processor_id (kfree)
evolutio-4568 0d.s3 160us : debug_smp_processor_id (kmem_cache_free)
evolutio-4568 0d.s3 192us : debug_smp_processor_id (kfree)
evolutio-4568 0d.s3 193us : debug_smp_processor_id (kmem_cache_free)
evolutio-4568 0d.s3 210us : debug_smp_processor_id (kfree)
evolutio-4568 0d.s3 211us : debug_smp_processor_id (kmem_cache_free)
evolutio-4568 0d.s3 216us : debug_smp_processor_id (kmem_cache_alloc)
evolutio-4568 0d.s3 217us+: debug_smp_processor_id (__kmalloc)

etc.

Was this optimized out on UP before?

Lee

2005-12-30 08:00:52

by Ingo Molnar

[permalink] [raw]
Subject: Re: [patch] latency tracer, 2.6.15-rc7


* Lee Revell <[email protected]> wrote:

> It seems that debug_smp_processor_id is being called a lot, even
> though I have a UP config, which I didn't see with the -rt kernel:

do you have CONFIG_DEBUG_PREEMPT enabled? (if yes then disable it)

> Was this optimized out on UP before?

no, because smp_processor_id() debugging is useful on UP too: it checks
whether smp_processor_id() is every called with preemption enabled, and
reports such bugs.

Ingo

2005-12-30 08:09:35

by Ingo Molnar

[permalink] [raw]
Subject: Re: [patch] latency tracer, 2.6.15-rc7


* Grant Coady <[email protected]> wrote:

> On Thu, 29 Dec 2005 21:28:48 +0100, Ingo Molnar <[email protected]> wrote:
>
> >
> >thanks, applied - new version uploaded.
>
> I just booted with latency tracer, it died with (copy by hand):
> { 40} [<c012e74a>] debug_stackoverflow+0x6a/0xc0
>
> Much unusual stuff (several screenfuls) scrolled up prior to lockup.

have you applied the zlib patches too? In particular this one should
make a difference:

http://redhat.com/~mingo/latency-tracing-patches/patches/reduce-zlib-stack-hack.patch

If you didnt have this applied, could you apply it and retry with
stack-footprint-debugging again?

Ingo

2005-12-30 16:52:31

by Linus Torvalds

[permalink] [raw]
Subject: Re: [patch] latency tracer, 2.6.15-rc7



On Fri, 30 Dec 2005, Ingo Molnar wrote:
>
> have you applied the zlib patches too? In particular this one should
> make a difference:
>
> http://redhat.com/~mingo/latency-tracing-patches/patches/reduce-zlib-stack-hack.patch
>
> If you didnt have this applied, could you apply it and retry with
> stack-footprint-debugging again?

Ingo, instead of having a static work area and using locking, why not just
move those fields into the "z_stream" structure, and thus make them be
per-stream? The z_stream structure should already be allocated with
kmalloc() or similar by the caller, so that also gets it off the stack
without the locking thing.

Hmm?

Linus

2005-12-30 18:51:04

by Lee Revell

[permalink] [raw]
Subject: Re: [patch] latency tracer, 2.6.15-rc7

On Fri, 2005-12-30 at 09:00 +0100, Ingo Molnar wrote:
> * Lee Revell <[email protected]> wrote:
>
> > It seems that debug_smp_processor_id is being called a lot, even
> > though I have a UP config, which I didn't see with the -rt kernel:
>
> do you have CONFIG_DEBUG_PREEMPT enabled? (if yes then disable it)

Ah, thanks, I had assumed that the latency tracing depended on this.

Lee

2005-12-31 00:20:28

by Grant Coady

[permalink] [raw]
Subject: Re: [patch] latency tracer, 2.6.15-rc7

On Fri, 30 Dec 2005 09:09:14 +0100, Ingo Molnar <[email protected]> wrote:

>
>* Grant Coady <[email protected]> wrote:
>
>> On Thu, 29 Dec 2005 21:28:48 +0100, Ingo Molnar <[email protected]> wrote:
>>
>> >
>> >thanks, applied - new version uploaded.
>>
>> I just booted with latency tracer, it died with (copy by hand):
>> { 40} [<c012e74a>] debug_stackoverflow+0x6a/0xc0
>>
>> Much unusual stuff (several screenfuls) scrolled up prior to lockup.
>
>have you applied the zlib patches too? In particular this one should
>make a difference:
>
> http://redhat.com/~mingo/latency-tracing-patches/patches/reduce-zlib-stack-hack.patch

Yes, that was applied (well, trying to apply it asks me to reverse it :)

>If you didnt have this applied, could you apply it and retry with
>stack-footprint-debugging again?

Silly question: where is stack-footprint-debugging?

Cheers,
Grant.

2005-12-31 00:51:14

by Lee Revell

[permalink] [raw]
Subject: Re: [patch] latency tracer, 2.6.15-rc7

On Fri, 2005-12-30 at 09:00 +0100, Ingo Molnar wrote:
> * Lee Revell <[email protected]> wrote:
>
> > It seems that debug_smp_processor_id is being called a lot, even
> > though I have a UP config, which I didn't see with the -rt kernel:
>
> do you have CONFIG_DEBUG_PREEMPT enabled? (if yes then disable it)
>
> > Was this optimized out on UP before?
>
> no, because smp_processor_id() debugging is useful on UP too: it checks
> whether smp_processor_id() is every called with preemption enabled, and
> reports such bugs.

It seems that the networking code's use of RCU can cause 10ms+
latencies:

preemption latency trace v1.1.5 on 2.6.15-rc7
--------------------------------------------------------------------
latency: 10437 us, #12080/12080, CPU#0 | (M:rt VP:0, KP:0, SP:0 HP:0)
-----------------
| task: gam_server-21330 (uid:1000 nice:0 policy:0 rt_prio:0)
-----------------

_------=> CPU#
/ _-----=> irqs-off
| / _----=> need-resched
|| / _---=> hardirq/softirq
||| / _--=> preempt-depth
|||| /
||||| delay
cmd pid ||||| time | caller
\ / ||||| \ | /
<idle>-0 0d.s2 1us : __trace_start_sched_wakeup (try_to_wake_up)
<idle>-0 0d.s2 1us : __trace_start_sched_wakeup <<...>-21330> (73 0)
<idle>-0 0d.s. 2us : wake_up_process (process_timeout)
<idle>-0 0d.s. 3us : tasklet_action (__do_softirq)
<idle>-0 0d.s. 4us : rcu_process_callbacks (tasklet_action)
<idle>-0 0d.s. 5us : __rcu_process_callbacks (rcu_process_callbacks)
<idle>-0 0d.s. 5us : rcu_check_quiescent_state (__rcu_process_callbacks)
<idle>-0 0d.s. 6us : __rcu_process_callbacks (rcu_process_callbacks)
<idle>-0 0d.s. 7us : rcu_check_quiescent_state (__rcu_process_callbacks)
<idle>-0 0d.s. 7us : rcu_do_batch (__rcu_process_callbacks)
<idle>-0 0d.s. 8us : dst_rcu_free (rcu_do_batch)
<idle>-0 0d.s. 9us : dst_destroy (dst_rcu_free)
<idle>-0 0d.s. 11us : ipv4_dst_destroy (dst_destroy)
<idle>-0 0d.s. 12us : kmem_cache_free (dst_destroy)
<idle>-0 0d.s. 13us+: debug_smp_processor_id (kmem_cache_free)
<idle>-0 0d.s. 15us : dst_rcu_free (rcu_do_batch)
<idle>-0 0d.s. 16us : dst_destroy (dst_rcu_free)
<idle>-0 0d.s. 16us : ipv4_dst_destroy (dst_destroy)
<idle>-0 0d.s. 18us : kmem_cache_free (dst_destroy)
<idle>-0 0d.s. 18us : debug_smp_processor_id (kmem_cache_free)
<idle>-0 0d.s. 19us : dst_rcu_free (rcu_do_batch)
<idle>-0 0d.s. 20us : dst_destroy (dst_rcu_free)

[last 5 lines repeat ~2000 times]

<idle>-0 0d.s. 10368us : ipv4_dst_destroy (dst_destroy)
<idle>-0 0d.s. 10369us : kmem_cache_free (dst_destroy)
<idle>-0 0d.s. 10370us : debug_smp_processor_id (kmem_cache_free)
<idle>-0 0d.s. 10371us : dst_rcu_free (rcu_do_batch)
<idle>-0 0d.s. 10371us : dst_destroy (dst_rcu_free)
<idle>-0 0d.s. 10372us : ipv4_dst_destroy (dst_destroy)
<idle>-0 0d.s. 10373us : kmem_cache_free (dst_destroy)
<idle>-0 0d.s. 10373us : debug_smp_processor_id (kmem_cache_free)
<idle>-0 0d.s. 10375us : debug_smp_processor_id (__do_softirq)
<idle>-0 0d.s. 10375us : debug_smp_processor_id (__do_softirq)
<idle>-0 0d.s. 10376us+: run_timer_softirq (__do_softirq)
<idle>-0 0d.s. 10378us : rh_timer_func (run_timer_softirq)
<idle>-0 0d.s. 10379us : usb_hcd_poll_rh_status (rh_timer_func)
<idle>-0 0d.s. 10380us : uhci_hub_status_data (usb_hcd_poll_rh_status)
<idle>-0 0d.s1 10381us : uhci_scan_schedule (uhci_hub_status_data)
<idle>-0 0d.s1 10382us : uhci_get_current_frame_number (uhci_scan_schedule)
<idle>-0 0d.s1 10384us : uhci_free_pending_qhs (uhci_scan_schedule)
<idle>-0 0d.s1 10384us : uhci_free_pending_tds (uhci_scan_schedule)
<idle>-0 0d.s1 10385us : uhci_remove_pending_urbps (uhci_scan_schedule)
<idle>-0 0d.s1 10387us : uhci_transfer_result (uhci_scan_schedule)
<idle>-0 0d.s2 10388us : uhci_result_common (uhci_transfer_result)
<idle>-0 0d.s1 10390us : uhci_finish_completion (uhci_scan_schedule)
<idle>-0 0d.s1 10391us : __wake_up (uhci_scan_schedule)
<idle>-0 0d.s2 10391us : __wake_up_common (__wake_up)
<idle>-0 0d.s1 10392us : check_fsbr (uhci_hub_status_data)
<idle>-0 0d.s1 10394us+: uhci_check_ports (uhci_hub_status_data)
<idle>-0 0d.s1 10398us : any_ports_active (uhci_hub_status_data)
<idle>-0 0d.s. 10399us : mod_timer (usb_hcd_poll_rh_status)
<idle>-0 0d.s. 10400us : __mod_timer (mod_timer)
<idle>-0 0d.s. 10401us : lock_timer_base (__mod_timer)
<idle>-0 0d.s1 10402us : internal_add_timer (__mod_timer)
<idle>-0 0d.s. 10403us : i8042_timer_func (run_timer_softirq)
<idle>-0 0d.s. 10404us : i8042_interrupt (i8042_timer_func)
<idle>-0 0d.s. 10405us : mod_timer (i8042_interrupt)
<idle>-0 0d.s. 10405us : __mod_timer (mod_timer)
<idle>-0 0d.s. 10406us : lock_timer_base (__mod_timer)
<idle>-0 0d.s1 10407us+: internal_add_timer (__mod_timer)
<idle>-0 0d.s. 10411us : tasklet_action (__do_softirq)
<idle>-0 0d.s. 10412us : rcu_process_callbacks (tasklet_action)
<idle>-0 0d.s. 10412us : __rcu_process_callbacks (rcu_process_callbacks)
<idle>-0 0d.s. 10413us : rcu_check_quiescent_state (__rcu_process_callbacks)
<idle>-0 0d.s. 10414us : __rcu_process_callbacks (rcu_process_callbacks)
<idle>-0 0d.s. 10414us : rcu_check_quiescent_state (__rcu_process_callbacks)
<idle>-0 0d.s. 10415us+: debug_smp_processor_id (__do_softirq)
<idle>-0 0dn.1 10417us < (2097760)
<idle>-0 0dn.. 10418us : schedule (cpu_idle)
<idle>-0 0dn.. 10419us : stop_trace (schedule)
<idle>-0 0dn.. 10420us : profile_hit (schedule)
<idle>-0 0dn.1 10421us : sched_clock (schedule)
<idle>-0 0dn.2 10422us : debug_smp_processor_id (schedule)
<idle>-0 0dn.2 10423us : recalc_task_prio (schedule)
<idle>-0 0dn.2 10424us : effective_prio (recalc_task_prio)
<idle>-0 0dn.2 10425us : requeue_task (schedule)
<idle>-0 0d..2 10426us : debug_smp_processor_id (schedule)
<...>-21330 0d..2 10430us : __switch_to (schedule)
<...>-21330 0d..2 10431us+: debug_smp_processor_id (__switch_to)
<...>-21330 0d..2 10433us : schedule <<idle>-0> (8c 73)
<...>-21330 0d..1 10434us : trace_stop_sched_switched (schedule)
<...>-21330 0d..2 10435us+: trace_stop_sched_switched <<...>-21330> (73 0)
<...>-21330 0d..2 10437us : trace_stop_sched_switched (schedule)

2005-12-31 00:59:08

by Mark Knecht

[permalink] [raw]
Subject: Re: [patch] latency tracer, 2.6.15-rc7

On 12/30/05, Lee Revell <[email protected]> wrote:
> On Fri, 2005-12-30 at 09:00 +0100, Ingo Molnar wrote:
> > * Lee Revell <[email protected]> wrote:
> >
> > > It seems that debug_smp_processor_id is being called a lot, even
> > > though I have a UP config, which I didn't see with the -rt kernel:
> >
> > do you have CONFIG_DEBUG_PREEMPT enabled? (if yes then disable it)
> >
> > > Was this optimized out on UP before?
> >
> > no, because smp_processor_id() debugging is useful on UP too: it checks
> > whether smp_processor_id() is every called with preemption enabled, and
> > reports such bugs.
>
> It seems that the networking code's use of RCU can cause 10ms+
> latencies:
>


Hi,
I've noted for awhile that on my AMD64 machine that has xrun issues
that at least annecdotally it has always seemed that the network
interface was somehow involved. I wonder if this may turn out to be
true?

Q: Did the latency trace stuff ever get fixed for AMD64? I'm running
2.6.15-rc7-rt1 as of today. I've had xruns all afternoon while working
inside of MythTV. (Start, stop, rewind, pause all over the place...)

15:09:47.132 Statistics reset.
15:09:47.303 Client activated.
15:09:47.304 Audio connection change.
15:09:47.307 Audio connection graph change.
15:09:48.147 Audio connection graph change.
15:09:48.327 Audio connection change.
15:09:50.487 Audio connection graph change.
15:34:05.107 XRUN callback (1).
**** alsa_pcm: xrun of at least 0.148 msecs
subgraph starting at qjackctl-7906 timed out (subgraph_wait_fd=17,
status = 0, state = Finished)
15:36:15.580 XRUN callback (2).
**** alsa_pcm: xrun of at least 0.707 msecs
subgraph starting at qjackctl-7906 timed out (subgraph_wait_fd=17,
status = 0, state = Finished)
15:38:07.232 XRUN callback (3).
**** alsa_pcm: xrun of at least 0.967 msecs
**** alsa_pcm: xrun of at least 2.921 msecs
15:38:08.143 XRUN callback (1 skipped).
15:39:36.616 XRUN callback (5).
**** alsa_pcm: xrun of at least 0.079 msecs
subgraph starting at qjackctl-7906 timed out (subgraph_wait_fd=17,
status = 0, state = Finished)
15:43:04.318 XRUN callback (6).
**** alsa_pcm: xrun of at least 1.065 msecs
**** alsa_pcm: xrun of at least 0.955 msecs
15:43:05.642 XRUN callback (1 skipped).
subgraph starting at qjackctl-7906 timed out (subgraph_wait_fd=17,
status = 0, state = Finished)
15:44:16.964 XRUN callback (8).
**** alsa_pcm: xrun of at least 1.311 msecs
**** alsa_pcm: xrun of at least 1.956 msecs
15:44:18.028 XRUN callback (1 skipped).
subgraph starting at qjackctl-7906 timed out (subgraph_wait_fd=17,
status = 0, state = Finished)
15:45:11.371 XRUN callback (10).
**** alsa_pcm: xrun of at least 1.409 msecs
subgraph starting at qjackctl-7906 timed out (subgraph_wait_fd=17,
status = 0, state = Finished)
15:49:22.014 XRUN callback (11).
**** alsa_pcm: xrun of at least 1.332 msecs
16:10:30.306 XRUN callback (12).
**** alsa_pcm: xrun of at least 0.368 msecs
subgraph starting at qjackctl-7906 timed out (subgraph_wait_fd=17,
status = 0, state = Finished)
16:30:05.355 XRUN callback (13).
**** alsa_pcm: xrun of at least 0.873 msecs
subgraph starting at qjackctl-7906 timed out (subgraph_wait_fd=17,
status = 0, state = Finished)
16:33:21.938 XRUN callback (14).
**** alsa_pcm: xrun of at least 1.647 msecs

Bad stuff...

- Mark

2005-12-31 01:03:11

by Linus Torvalds

[permalink] [raw]
Subject: Re: [patch] latency tracer, 2.6.15-rc7



On Fri, 30 Dec 2005, Lee Revell wrote:
>
> It seems that the networking code's use of RCU can cause 10ms+
> latencies:

Hmm. Is there a big jump at the 10ms mark? Do you have a 100Hz timer
source?

A latency jump at 10ms would tend to indicate a missed wakeup that
was "picked up" by the next timer tick.

Linus

2005-12-31 01:15:44

by Lee Revell

[permalink] [raw]
Subject: Re: [patch] latency tracer, 2.6.15-rc7

On Fri, 2005-12-30 at 17:02 -0800, Linus Torvalds wrote:
>
> On Fri, 30 Dec 2005, Lee Revell wrote:
> >
> > It seems that the networking code's use of RCU can cause 10ms+
> > latencies:
>
> Hmm. Is there a big jump at the 10ms mark? Do you have a 100Hz timer
> source?
>
> A latency jump at 10ms would tend to indicate a missed wakeup that
> was "picked up" by the next timer tick.

No there are no large jumps, it really seems that this was the network
code causing an RCU callback to drop ~2K routes at once. Specifically
RCU invokes dst_rcu_free 2085 times in a single batch
(call_rcu_bh(&rt->u.dst.rcu_head, dst_rcu_free) is only called from
rt_free() and rt_drop()).

I have found that many of the paths in the network stack that can cause
high latencies can be tuned via sysctls (for example
net.ipv4.route.gc_thresh); this one may be the same.

Lee

2005-12-31 01:16:19

by Lee Revell

[permalink] [raw]
Subject: Re: [patch] latency tracer, 2.6.15-rc7

On Fri, 2005-12-30 at 16:59 -0800, Mark Knecht wrote:
> On 12/30/05, Lee Revell <[email protected]> wrote:
> > On Fri, 2005-12-30 at 09:00 +0100, Ingo Molnar wrote:
> > > * Lee Revell <[email protected]> wrote:
> > >
> > > > It seems that debug_smp_processor_id is being called a lot, even
> > > > though I have a UP config, which I didn't see with the -rt kernel:
> > >
> > > do you have CONFIG_DEBUG_PREEMPT enabled? (if yes then disable it)
> > >
> > > > Was this optimized out on UP before?
> > >
> > > no, because smp_processor_id() debugging is useful on UP too: it checks
> > > whether smp_processor_id() is every called with preemption enabled, and
> > > reports such bugs.
> >
> > It seems that the networking code's use of RCU can cause 10ms+
> > latencies:
> >
>
>
> Hi,
> I've noted for awhile that on my AMD64 machine that has xrun issues
> that at least annecdotally it has always seemed that the network
> interface was somehow involved. I wonder if this may turn out to be
> true?
>
> Q: Did the latency trace stuff ever get fixed for AMD64? I'm running
> 2.6.15-rc7-rt1 as of today. I've had xruns all afternoon while working
> inside of MythTV. (Start, stop, rewind, pause all over the place...)

Yes, it should have. What happens when you try it?

Lee

2005-12-31 01:30:42

by Lee Revell

[permalink] [raw]
Subject: Re: [patch] latency tracer, 2.6.15-rc7

On Fri, 2005-12-30 at 16:59 -0800, Mark Knecht wrote:
> I've noted for awhile that on my AMD64 machine that has xrun issues
> that at least annecdotally it has always seemed that the network
> interface was somehow involved. I wonder if this may turn out to be
> true?

Yes, it probably is. Since at least 2.6.14 almost all of the 1ms+
latencies left in the kernel are due to long running network softirqs -
thanks to lots of work by Ingo and others there are almost no long-held
spinlocks left. So I would certainly expect audio underruns to
correspond to heavy network activity.

I believe that softirqs usually run on the processor that they were
raised on so if you have an SMP system you could test this by locking
all interrupt handling to one processor and running JACK on the other
and see if your xruns decrease.

Lee

2005-12-31 01:40:25

by Linus Torvalds

[permalink] [raw]
Subject: Re: [patch] latency tracer, 2.6.15-rc7



On Fri, 30 Dec 2005, Lee Revell wrote:
>
> No there are no large jumps, it really seems that this was the network
> code causing an RCU callback to drop ~2K routes at once. Specifically
> RCU invokes dst_rcu_free 2085 times in a single batch
> (call_rcu_bh(&rt->u.dst.rcu_head, dst_rcu_free) is only called from
> rt_free() and rt_drop()).

Ok. This is likely something that was hidden by the RCU batch size thing,
but that in turn was effectively turned off because it caused
out-of-memory situations where a small batch size would cause the RCU
queues to grow without bounds (noticed when we started freeing the
dentries from RCU)..

We fixed that for "regular" RCU callbacks by noticing when the RCU queue
got long, and encouraging a RCU event when that happened. However, that
doesn't happen for the "call_rcu_bh()" case, so I'm not surprised that the
network queues can grow fairly long.

I've added Eric Dumazet, Dipankar and Paul to the Cc: list, and appended a
totally untested (and probably horribly buggy) possible patch as a
starting point for discussion. It just sets "need_resched()" in the hope
that we'll go through an RCU idle point and go the RCU callbacks. Whether
it helps your case or not, I have no clue.

Linus
---
diff --git a/kernel/rcupdate.c b/kernel/rcupdate.c
index 48d3bce..b107562 100644
--- a/kernel/rcupdate.c
+++ b/kernel/rcupdate.c
@@ -149,11 +149,10 @@ void fastcall call_rcu_bh(struct rcu_hea
*rdp->nxttail = head;
rdp->nxttail = &head->next;
rdp->count++;
-/*
- * Should we directly call rcu_do_batch() here ?
- * if (unlikely(rdp->count > 10000))
- * rcu_do_batch(rdp);
- */
+
+ if (unlikely(++rdp->count > 100))
+ set_need_resched();
+
local_irq_restore(flags);
}

2005-12-31 03:54:30

by Lee Revell

[permalink] [raw]
Subject: Re: [patch] latency tracer, 2.6.15-rc7

On Fri, 2005-12-30 at 17:39 -0800, Linus Torvalds wrote:
> Whether
> it helps your case or not, I have no clue.

OK, so far so good, I won't know whether it helps for a while.

Lee

2005-12-31 04:00:19

by Lee Revell

[permalink] [raw]
Subject: Re: [patch] latency tracer, 2.6.15-rc7

On Fri, 2005-12-30 at 17:39 -0800, Linus Torvalds wrote:
> diff --git a/kernel/rcupdate.c b/kernel/rcupdate.c
> index 48d3bce..b107562 100644
> --- a/kernel/rcupdate.c
> +++ b/kernel/rcupdate.c
> @@ -149,11 +149,10 @@ void fastcall call_rcu_bh(struct rcu_hea
> *rdp->nxttail = head;
> rdp->nxttail = &head->next;
> rdp->count++;
> -/*
> - * Should we directly call rcu_do_batch() here ?
> - * if (unlikely(rdp->count > 10000))
> - * rcu_do_batch(rdp);
> - */
> +
> + if (unlikely(++rdp->count > 100))
> + set_need_resched();
> +
> local_irq_restore(flags);
> }

This increments rdp->count twice - is that intentional?

Also what was the story deal with the commented out code?

Lee

2005-12-31 04:29:34

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [patch] latency tracer, 2.6.15-rc7

On Fri, Dec 30, 2005 at 11:00:15PM -0500, Lee Revell wrote:
> On Fri, 2005-12-30 at 17:39 -0800, Linus Torvalds wrote:
> > diff --git a/kernel/rcupdate.c b/kernel/rcupdate.c
> > index 48d3bce..b107562 100644
> > --- a/kernel/rcupdate.c
> > +++ b/kernel/rcupdate.c
> > @@ -149,11 +149,10 @@ void fastcall call_rcu_bh(struct rcu_hea
> > *rdp->nxttail = head;
> > rdp->nxttail = &head->next;
> > rdp->count++;
> > -/*
> > - * Should we directly call rcu_do_batch() here ?
> > - * if (unlikely(rdp->count > 10000))
> > - * rcu_do_batch(rdp);
> > - */
> > +
> > + if (unlikely(++rdp->count > 100))
> > + set_need_resched();

This should help in UP configurations, or in SMP configurations where
all CPUs are doing call_rcu_bh() very frequently. I would not expect
it to help in cases where one of several CPUs is frequently executing
call_rcu_bh(), but where the other CPUs are either CPU-bound in user
space or are in a tickful idle state.

My guess is that Lee would be running a UP kernel.

> > local_irq_restore(flags);
> > }
>
> This increments rdp->count twice - is that intentional?

If it was intentional, I would argue instead for

if (unlikely(rdp->count > 50))

> Also what was the story deal with the commented out code?

Any number of people, myself included, have been seduced by the thought
of invoking callbacks from call_rcu() or call_rcu_bh() context. When I
first un-seduced myself from this approach, I documented the reasons
why it is a bad idea in Documentation/RCU/UP.txt -- the gist of it is
that there are a number of self-deadlock traps that are easy to fall into.

Thanx, Paul

2005-12-31 04:48:45

by Linus Torvalds

[permalink] [raw]
Subject: Re: [patch] latency tracer, 2.6.15-rc7



On Fri, 30 Dec 2005, Lee Revell wrote:
>
> This increments rdp->count twice - is that intentional?

No. Just stupid. Remove the second one.

> Also what was the story deal with the commented out code?

You can't do it that way, since rcu_call_bh() will be called when calling
rcu_do_batch() is not valid.

Linus

2005-12-31 04:54:17

by Lee Revell

[permalink] [raw]
Subject: Re: [patch] latency tracer, 2.6.15-rc7

On Fri, 2005-12-30 at 20:29 -0800, Paul E. McKenney wrote:
> This should help in UP configurations, or in SMP configurations where
> all CPUs are doing call_rcu_bh() very frequently. I would not expect
> it to help in cases where one of several CPUs is frequently executing
> call_rcu_bh(), but where the other CPUs are either CPU-bound in user
> space or are in a tickful idle state.

This and net/decnet/dn_route.c are the only two uses of call_rcu_bh in
the kernel. And this one does not seem to be invoked frequently, it
took ~48 hours to show up in the latency tracer. Of course a server
workload might call it all the time.

Lee

2005-12-31 20:15:12

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [patch] latency tracer, 2.6.15-rc7

On Fri, Dec 30, 2005 at 11:54:14PM -0500, Lee Revell wrote:
> On Fri, 2005-12-30 at 20:29 -0800, Paul E. McKenney wrote:
> > This should help in UP configurations, or in SMP configurations where
> > all CPUs are doing call_rcu_bh() very frequently. I would not expect
> > it to help in cases where one of several CPUs is frequently executing
> > call_rcu_bh(), but where the other CPUs are either CPU-bound in user
> > space or are in a tickful idle state.
>
> This and net/decnet/dn_route.c are the only two uses of call_rcu_bh in
> the kernel. And this one does not seem to be invoked frequently, it
> took ~48 hours to show up in the latency tracer. Of course a server
> workload might call it all the time.

Well, most old-style server workloads wouldn't notice a 10ms hit to
latency. I expect server workloads to become much more sensitive to
latency over time, however. There are several reasons for this:
(1) as more and more workloads are spread over many machines, the
latency of each individual machine must improve just to maintain
the same overall latency (2) as computers continue to get less expensive
relative to people (and this trend is most pronounced in -developing-
countries, -not- developed countries!), it will be less and less
acceptable to have people waiting on computers (3) the trend of which
VOIP is part will continue, so that traditional server workloads will
have increasingly large latency-sensitive components.

But back to the problem at hand.

So it seems to me that Linus's patch is part of the solution, but
needs to also have a global component, perhaps as follows:

if (unlikely(rdp->count > 100)) {
set_need_resched();
if (unlikely(rdp->count - rdp->last_rs_count > 1000)) {
int cpu;

rdp->last_rs_count = rdp->count;
spin_lock_bh(&rcu_bh_state.lock);
for_each_cpu_mask(cpu, rdp->rcu_bh_state.cpumask)
smp_send_reschedule(cpu);
spin_unlock_bh(&rcu_bh_state.lock);
}
}

I am sure that I am missing some interaction or another with tickless
idle and CPU hotplug covered.

There also needs to be some adjustment in rcu_do_batch(), which will
have to somehow get back to a quiescent state periodically. Dipankar,
Vatsa, thoughts?

Thanx, Paul

2006-01-01 05:46:23

by Lee Revell

[permalink] [raw]
Subject: Re: [patch] latency tracer, 2.6.15-rc7

On Sat, 2005-12-31 at 12:14 -0800, Paul E. McKenney wrote:
> So it seems to me that Linus's patch is part of the solution, but
> needs to also have a global component, perhaps as follows:
>
> if (unlikely(rdp->count > 100)) {
> set_need_resched();

In fact neither of these patches helps, because these RCU callbacks run
from a tasklet in softirq context, and softirqs are not preemptible
(unless they are running in threads, see below):

_------=> CPU#
/ _-----=> irqs-off
| / _----=> need-resched
|| / _---=> hardirq/softirq <---------***
||| / _--=> preempt-depth
|||| /
||||| delay
cmd pid ||||| time | caller
\ / ||||| \ | /
<idle>-0 0d.s2 1us : __trace_start_sched_wakeup (try_to_wake_up)
<idle>-0 0d.s2 1us : __trace_start_sched_wakeup <<...>-21330> (73 0)
<idle>-0 0d.s. 2us : wake_up_process (process_timeout)
<idle>-0 0d.s. 3us : tasklet_action (__do_softirq)
<idle>-0 0d.s. 4us : rcu_process_callbacks (tasklet_action)
<idle>-0 0d.s. 5us : __rcu_process_callbacks (rcu_process_callbacks)
<idle>-0 0d.s. 5us : rcu_check_quiescent_state (__rcu_process_callbacks)
<idle>-0 0d.s. 6us : __rcu_process_callbacks (rcu_process_callbacks)
<idle>-0 0d.s. 7us : rcu_check_quiescent_state (__rcu_process_callbacks)
<idle>-0 0d.s. 7us : rcu_do_batch (__rcu_process_callbacks)
<idle>-0 0d.s. 8us : dst_rcu_free (rcu_do_batch)

Fortunately softirq preemption is one of the simplest parts of the -rt
patch - since the kernel already runs all softirqs in threads under
heavy load, it's simply a matter of adding a .config option to always do
that.

Linus, would you accept CONFIG_PREEMPT_SOFTIRQS to always run softirqs
in threads (default N of course, it certainly has a slight throughput
cost) for mainline if Ingo were to submit it?

Lee

2006-01-01 08:32:32

by Lee Revell

[permalink] [raw]
Subject: Re: [patch] latency tracer, 2.6.15-rc7

On Fri, 2005-12-30 at 20:15 -0500, Lee Revell wrote:
> On Fri, 2005-12-30 at 17:02 -0800, Linus Torvalds wrote:
> >
> > On Fri, 30 Dec 2005, Lee Revell wrote:
> > >
> > > It seems that the networking code's use of RCU can cause 10ms+
> > > latencies:
> >
> > Hmm. Is there a big jump at the 10ms mark? Do you have a 100Hz timer
> > source?
> >
> > A latency jump at 10ms would tend to indicate a missed wakeup that
> > was "picked up" by the next timer tick.
>
> No there are no large jumps, it really seems that this was the network
> code causing an RCU callback to drop ~2K routes at once. Specifically
> RCU invokes dst_rcu_free 2085 times in a single batch
> (call_rcu_bh(&rt->u.dst.rcu_head, dst_rcu_free) is only called from
> rt_free() and rt_drop()).
>
> I have found that many of the paths in the network stack that can cause
> high latencies can be tuned via sysctls (for example
> net.ipv4.route.gc_thresh); this one may be the same.

On a related topic:

I thought I had solved the route cache flushing problem by tuning these
sysctls, but it does not seems to help.

The short version is that rt_run_flush and rt_garbage_collect can cause
15ms+ latencies when a lot of routes (up to 4096 it seems) are flushed
in one go:

$ grep "local_bh_enable (rt_run_flush)" /proc/latency_trace | wc -l
4096

$ grep "local_bh_enable (rt_garbage_collect)" /proc/latency_trace | wc
-l
4096

(rt_run_flush and rt_garbage_collect call spin_lock_bh/spin_unlock_bh
once for each flushed route so the above grep effectively counts the
number of routes flushed at once)

I reported this a year or so ago and it led to Ingo adding an option to
-rt (then called the voluntary preempt patch) to always run softirqs in
threads which makes rt_run_flush preemptible.

Anyway I thought I could work around this with mainline by tuning the
network stack to minimize the effect of route cache flushing on
scheduler latency using these sysctls to cause the route cache to be
flushed more often and/or limit the maximum size of the route cache:

$ sudo /sbin/sysctl -a | grep route
net.ipv4.route.gc_elasticity = 8
net.ipv4.route.gc_interval = 60
net.ipv4.route.gc_timeout = 300
net.ipv4.route.gc_min_interval_ms = 20
net.ipv4.route.gc_min_interval = 0
net.ipv4.route.max_size = 65536
net.ipv4.route.gc_thresh = 256
net.ipv4.route.max_delay = 10
net.ipv4.route.min_delay = 2

I tried lowering gc_min_interval_ms, gc_timeout, max_size, and gc_thresh
but rt_run_flush will still process up to 4096 (the size of the route
hash table?) routes at once.

(stop here if you don't care to interpret long latency_trace reports)

17ms+ latency caused by rt_run_flush then rt_garbage_collect processing
4096 routes:

preemption latency trace v1.1.5 on 2.6.15-rc7
--------------------------------------------------------------------
latency: 17286 us, #19154/19154, CPU#0 | (M:rt VP:0, KP:0, SP:0 HP:0)
-----------------
| task: gtk-gnutella-8581 (uid:1000 nice:0 policy:0 rt_prio:0)
-----------------

_------=> CPU#
/ _-----=> irqs-off
| / _----=> need-resched
|| / _---=> hardirq/softirq
||| / _--=> preempt-depth
|||| /
||||| delay
cmd pid ||||| time | caller
\ / ||||| \ | /
epiphany-8742 0dns8 0us : __trace_start_sched_wakeup (try_to_wake_up)
epiphany-8742 0dns8 1us : __trace_start_sched_wakeup <<...>-8581> (73 0)
epiphany-8742 0dns7 2us : preempt_schedule (__trace_start_sched_wakeup)
epiphany-8742 0dns6 2us : preempt_schedule (try_to_wake_up)
epiphany-8742 0dns5 3us : preempt_schedule (__wake_up)
epiphany-8742 0dns5 4us : preempt_schedule (ep_poll_safewake)
epiphany-8742 0dnH5 6us : do_IRQ (c011223e b 0)
epiphany-8742 0d.h. 6us : __do_IRQ (do_IRQ)
epiphany-8742 0d.h1 7us+: mask_and_ack_8259A (__do_IRQ)
epiphany-8742 0d.h. 12us : handle_IRQ_event (__do_IRQ)
epiphany-8742 0d.h. 13us : usb_hcd_irq (handle_IRQ_event)
epiphany-8742 0d.h. 13us : uhci_irq (usb_hcd_irq)
epiphany-8742 0d.h. 14us : via_driver_irq_handler (handle_IRQ_event)
epiphany-8742 0d.h. 16us : rhine_interrupt (handle_IRQ_event)
epiphany-8742 0d.h. 16us : ioread16 (rhine_interrupt)
epiphany-8742 0d.h. 17us : ioread8 (rhine_interrupt)
epiphany-8742 0d.h. 18us : iowrite16 (rhine_interrupt)
epiphany-8742 0d.h. 19us : ioread8 (rhine_interrupt)
epiphany-8742 0d.h. 20us : rhine_tx (rhine_interrupt)
epiphany-8742 0d.h1 21us : raise_softirq_irqoff (rhine_tx)
epiphany-8742 0d.h. 22us : ioread16 (rhine_interrupt)
epiphany-8742 0d.h. 23us : ioread8 (rhine_interrupt)
epiphany-8742 0d.h1 25us : note_interrupt (__do_IRQ)
epiphany-8742 0d.h1 25us : end_8259A_irq (__do_IRQ)
epiphany-8742 0d.h1 26us : enable_8259A_irq (end_8259A_irq)
epiphany-8742 0dnH5 28us : irq_exit (do_IRQ)
epiphany-8742 0dns5 28us < (2097760)
epiphany-8742 0dns4 29us : preempt_schedule (__wake_up)
epiphany-8742 0dns3 30us : preempt_schedule (sock_def_readable)
epiphany-8742 0dns2 31us : preempt_schedule (tcp_v4_rcv)
epiphany-8742 0dns1 32us : preempt_schedule (ip_local_deliver)
epiphany-8742 0dns. 33us : preempt_schedule (netif_receive_skb)
epiphany-8742 0dns. 33us : net_tx_action (__do_softirq)
epiphany-8742 0dns. 34us : __kfree_skb (net_tx_action)
epiphany-8742 0dns. 35us : sock_wfree (__kfree_skb)
epiphany-8742 0dns. 35us : kfree_skbmem (__kfree_skb)
epiphany-8742 0dns. 36us : skb_release_data (kfree_skbmem)
epiphany-8742 0dns. 37us : kfree (skb_release_data)
epiphany-8742 0dns. 38us : kmem_cache_free (kfree_skbmem)
epiphany-8742 0dn.. 39us : schedule (work_resched)
epiphany-8742 0dn.. 40us : stop_trace (schedule)
epiphany-8742 0dn.. 40us : profile_hit (schedule)
epiphany-8742 0dn.1 41us : sched_clock (schedule)
epiphany-8742 0dn.2 43us : recalc_task_prio (schedule)
epiphany-8742 0dn.2 44us : effective_prio (recalc_task_prio)
epiphany-8742 0dn.2 45us : requeue_task (schedule)
<...>-8581 0d..2 47us : __switch_to (schedule)
<...>-8581 0d..2 49us : schedule <epiphany-8742> (7d 73)
<...>-8581 0d.h2 50us : do_IRQ (c022ecf4 0 0)
<...>-8581 0d.h. 51us : __do_IRQ (do_IRQ)
<...>-8581 0d.h1 52us+: mask_and_ack_8259A (__do_IRQ)
<...>-8581 0d.h. 56us : handle_IRQ_event (__do_IRQ)
<...>-8581 0d.h. 56us : timer_interrupt (handle_IRQ_event)
<...>-8581 0d.h1 57us+: mark_offset_tsc (timer_interrupt)
<...>-8581 0d.h1 64us : do_timer (timer_interrupt)
<...>-8581 0d.h1 64us : update_wall_time (do_timer)
<...>-8581 0d.h1 65us : update_wall_time_one_tick (update_wall_time)
<...>-8581 0d.h1 66us : update_process_times (timer_interrupt)
<...>-8581 0d.h1 66us : account_system_time (update_process_times)
<...>-8581 0d.h1 67us : acct_update_integrals (account_system_time)
<...>-8581 0d.h1 69us : run_local_timers (update_process_times)
<...>-8581 0d.h1 69us : raise_softirq (run_local_timers)
<...>-8581 0d.h1 70us : scheduler_tick (update_process_times)
<...>-8581 0d.h1 71us : sched_clock (scheduler_tick)
<...>-8581 0d.h2 72us : task_timeslice (scheduler_tick)
<...>-8581 0d.h1 73us : run_posix_cpu_timers (update_process_times)
<...>-8581 0d.h1 74us : smp_local_timer_interrupt (timer_interrupt)
<...>-8581 0d.h1 75us : profile_tick (smp_local_timer_interrupt)
<...>-8581 0d.h1 76us : profile_hit (profile_tick)
<...>-8581 0d.h1 77us : note_interrupt (__do_IRQ)
<...>-8581 0d.h1 77us : end_8259A_irq (__do_IRQ)
<...>-8581 0d.h1 78us+: enable_8259A_irq (end_8259A_irq)
<...>-8581 0d.h2 80us : irq_exit (do_IRQ)
<...>-8581 0d..3 81us : do_softirq (irq_exit)
<...>-8581 0dns. 81us : __do_softirq (do_softirq)
<...>-8581 0dns. 82us : run_timer_softirq (__do_softirq)
<...>-8581 0dns. 83us : preempt_schedule (run_timer_softirq)
<...>-8581 0dns. 84us : rt_secret_rebuild (run_timer_softirq)
<...>-8581 0dns. 85us : rt_cache_flush (rt_secret_rebuild)
<...>-8581 0dns1 86us : del_timer (rt_cache_flush)
<...>-8581 0dns. 87us : local_bh_enable (rt_cache_flush)
<...>-8581 0dns. 88us : preempt_schedule (local_bh_enable)
<...>-8581 0dns. 88us : rt_run_flush (rt_cache_flush)
<...>-8581 0dns. 89us : get_random_bytes (rt_run_flush)
<...>-8581 0dns. 90us : extract_entropy (get_random_bytes)
<...>-8581 0dns. 91us : xfer_secondary_pool (extract_entropy)
<...>-8581 0dns. 92us : extract_entropy (xfer_secondary_pool)
<...>-8581 0dns. 93us : xfer_secondary_pool (extract_entropy)
<...>-8581 0dns. 94us : account (extract_entropy)
<...>-8581 0dns. 95us : preempt_schedule (account)
<...>-8581 0dns. 96us : extract_buf (extract_entropy)
<...>-8581 0dns. 97us : sha_init (extract_buf)
<...>-8581 0dns. 98us+: sha_transform (extract_buf)
<...>-8581 0dns. 106us+: __add_entropy_words (extract_buf)
<...>-8581 0dns. 108us : preempt_schedule (__add_entropy_words)
<...>-8581 0dns. 109us+: sha_transform (extract_buf)
<...>-8581 0dns. 116us : __add_entropy_words (extract_buf)
<...>-8581 0dns. 117us : preempt_schedule (__add_entropy_words)
<...>-8581 0dns. 118us+: sha_transform (extract_buf)
<...>-8581 0dns. 125us : __add_entropy_words (extract_buf)
<...>-8581 0dns. 126us : preempt_schedule (__add_entropy_words)
<...>-8581 0dns. 127us+: sha_transform (extract_buf)
<...>-8581 0dns. 134us : __add_entropy_words (extract_buf)
<...>-8581 0dns. 135us : preempt_schedule (__add_entropy_words)
<...>-8581 0dns. 136us+: sha_transform (extract_buf)
<...>-8581 0dns. 143us : __add_entropy_words (extract_buf)
<...>-8581 0dns. 144us : preempt_schedule (__add_entropy_words)
<...>-8581 0dns. 144us+: sha_transform (extract_buf)
<...>-8581 0dns. 151us : __add_entropy_words (extract_buf)
<...>-8581 0dns. 152us : preempt_schedule (__add_entropy_words)
<...>-8581 0dns. 153us+: sha_transform (extract_buf)
<...>-8581 0dns. 160us : __add_entropy_words (extract_buf)
<...>-8581 0dns. 161us : preempt_schedule (__add_entropy_words)
<...>-8581 0dns. 162us+: sha_transform (extract_buf)
<...>-8581 0dns. 169us : __add_entropy_words (extract_buf)
<...>-8581 0dns. 170us : preempt_schedule (__add_entropy_words)
<...>-8581 0dns. 171us : __add_entropy_words (extract_buf)
<...>-8581 0dns. 172us : preempt_schedule (__add_entropy_words)
<...>-8581 0dns. 173us+: sha_transform (extract_buf)
<...>-8581 0dns. 180us : __add_entropy_words (xfer_secondary_pool)
<...>-8581 0dns. 182us : preempt_schedule (__add_entropy_words)
<...>-8581 0dns. 183us : credit_entropy_store (xfer_secondary_pool)
<...>-8581 0dns. 184us : preempt_schedule (credit_entropy_store)
<...>-8581 0dns. 185us : account (extract_entropy)
<...>-8581 0dns1 186us : __wake_up (account)
<...>-8581 0dns2 186us : __wake_up_common (__wake_up)
<...>-8581 0dns1 187us : preempt_schedule (__wake_up)
<...>-8581 0dns. 188us : preempt_schedule (account)
<...>-8581 0dns. 188us : extract_buf (extract_entropy)
<...>-8581 0dns. 189us : sha_init (extract_buf)
<...>-8581 0dns. 190us+: sha_transform (extract_buf)
<...>-8581 0dns. 197us : __add_entropy_words (extract_buf)
<...>-8581 0dns. 198us : preempt_schedule (__add_entropy_words)
<...>-8581 0dns. 198us+: sha_transform (extract_buf)
<...>-8581 0dns. 205us : __add_entropy_words (extract_buf)
<...>-8581 0dns. 206us : preempt_schedule (__add_entropy_words)
<...>-8581 0dns. 207us : __add_entropy_words (extract_buf)
<...>-8581 0dns. 209us : preempt_schedule (__add_entropy_words)
<...>-8581 0dns. 209us+: sha_transform (extract_buf)
<...>-8581 0dns. 217us : local_bh_enable (rt_run_flush)
<...>-8581 0dns. 218us : preempt_schedule (local_bh_enable)
<...>-8581 0dns. 219us : call_rcu_bh (rt_run_flush)
<...>-8581 0dns. 220us : local_bh_enable (rt_run_flush)
<...>-8581 0dns. 221us : preempt_schedule (local_bh_enable)
<...>-8581 0dns. 222us : call_rcu_bh (rt_run_flush)
<...>-8581 0dns. 223us : call_rcu_bh (rt_run_flush)

<...>-8581 0dns. 224us : local_bh_enable (rt_run_flush)
<...>-8581 0dns. 225us : preempt_schedule (local_bh_enable)

<...>-8581 0dns. 226us : local_bh_enable (rt_run_flush)
<...>-8581 0dns. 227us : preempt_schedule (local_bh_enable)

[ rt_run_flush x 4096 routes ]

<...>-8581 0dns. 8763us : local_bh_enable (rt_run_flush)
<...>-8581 0dns. 8764us : preempt_schedule (local_bh_enable)
<...>-8581 0dns. 8765us : mod_timer (rt_secret_rebuild)
<...>-8581 0dns. 8766us : __mod_timer (mod_timer)
<...>-8581 0dns. 8766us : lock_timer_base (__mod_timer)
<...>-8581 0dns1 8767us : internal_add_timer (__mod_timer)
<...>-8581 0dns. 8769us+: preempt_schedule (__mod_timer)
<...>-8581 0dns. 8771us : preempt_schedule (run_timer_softirq)
<...>-8581 0dns. 8772us : tcp_delack_timer (run_timer_softirq)

[ tcp_delack_timer ... ]

<...>-8581 0dns. 8841us : preempt_schedule (tcp_delack_timer)
<...>-8581 0dns. 8842us+: preempt_schedule (run_timer_softirq)

<...>-8581 0dnH. 8845us : do_IRQ (c010dc95 b 0)

[ network irq ]

<...>-8581 0dnH. 8871us : irq_exit (do_IRQ)

<...>-8581 0dns. 8871us < (2097760)
<...>-8581 0dns. 8872us : run_timer_softirq (__do_softirq)
<...>-8581 0dns. 8873us : net_rx_action (__do_softirq)
<...>-8581 0dns. 8874us : process_backlog (net_rx_action)
<...>-8581 0dns. 8875us : netif_receive_skb (process_backlog)
<...>-8581 0dns1 8877us : packet_rcv_spkt (netif_receive_skb)
<...>-8581 0dns1 8878us : skb_clone (packet_rcv_spkt)
<...>-8581 0dns1 8878us : kmem_cache_alloc (skb_clone)
<...>-8581 0dns1 8880us : strlcpy (packet_rcv_spkt)
<...>-8581 0dns2 8881us : sk_run_filter (packet_rcv_spkt)
<...>-8581 0dns1 8882us : preempt_schedule (packet_rcv_spkt)
<...>-8581 0dns1 8883us : __kfree_skb (packet_rcv_spkt)
<...>-8581 0dns1 8884us : skb_release_data (kfree_skbmem)
<...>-8581 0dns1 8884us : kmem_cache_free (kfree_skbmem)
<...>-8581 0dns1 8885us : ip_rcv (netif_receive_skb)
<...>-8581 0dns1 8886us : ip_route_input (ip_rcv)
<...>-8581 0dns1 8887us : rt_hash_code (ip_route_input)
<...>-8581 0dns1 8888us : preempt_schedule (ip_route_input)
<...>-8581 0dns1 8889us : ip_route_input_slow (ip_route_input)
<...>-8581 0dns1 8890us : preempt_schedule (ip_route_input_slow)
<...>-8581 0dns1 8891us+: memset (ip_route_input_slow)
<...>-8581 0dns1 8893us+: fn_hash_lookup (ip_route_input_slow)
<...>-8581 0dns2 8895us : fib_semantic_match (fn_hash_lookup)
<...>-8581 0dns1 8897us : preempt_schedule (fn_hash_lookup)
<...>-8581 0dns1 8899us : fib_validate_source (ip_route_input_slow)
<...>-8581 0dns1 8900us : memset (fib_validate_source)
<...>-8581 0dns1 8901us : preempt_schedule (fib_validate_source)
<...>-8581 0dns1 8902us : fn_hash_lookup (fib_validate_source)
<...>-8581 0dns1 8903us : preempt_schedule (fn_hash_lookup)
<...>-8581 0dns1 8904us : fn_hash_lookup (fib_validate_source)
<...>-8581 0dns2 8906us : fib_semantic_match (fn_hash_lookup)
<...>-8581 0dns1 8907us : preempt_schedule (fn_hash_lookup)
<...>-8581 0dns1 8908us : __fib_res_prefsrc (fib_validate_source)
<...>-8581 0dns1 8909us : inet_select_addr (__fib_res_prefsrc)
<...>-8581 0dns1 8910us+: preempt_schedule (inet_select_addr)
<...>-8581 0dns1 8912us : dst_alloc (ip_route_input_slow)
<...>-8581 0dns1 8913us+: rt_garbage_collect (dst_alloc)

<...>-8581 0dns1 8916us : local_bh_enable (rt_garbage_collect)
<...>-8581 0dns1 8917us : preempt_schedule (local_bh_enable)

<...>-8581 0dns1 8918us : local_bh_enable (rt_garbage_collect)
<...>-8581 0dns1 8919us : preempt_schedule (local_bh_enable)

[ rt_garbage_collect x 4096 routes ]

<...>-8581 0dns1 16460us : local_bh_enable (rt_garbage_collect)
<...>-8581 0dns1 16461us : preempt_schedule (local_bh_enable)
<...>-8581 0dns1 16462us+: kmem_cache_alloc (dst_alloc)
<...>-8581 0dns1 16468us : preempt_schedule (ip_route_input_slow)
<...>-8581 0dns1 16469us : rt_hash_code (ip_route_input_slow)
<...>-8581 0dns1 16470us : rt_intern_hash (ip_route_input_slow)
<...>-8581 0dns1 16472us : local_bh_enable (rt_intern_hash)
<...>-8581 0dns1 16473us+: preempt_schedule (local_bh_enable)

<...>-8581 0dns1 16476us+: ip_local_deliver (ip_rcv)

[ ~1ms more network softirqs ]

<...>-8581 0dns. 17278us+: preempt_schedule (rcu_check_quiescent_state)
<...>-8581 0dn.2 17280us < (2097760)
<...>-8581 0dn.1 17281us : preempt_schedule (schedule)
<...>-8581 0dn.1 17282us : trace_stop_sched_switched (schedule)
<...>-8581 0dn.2 17283us+: trace_stop_sched_switched <<...>-8581> (73 0)
<...>-8581 0dn.2 17285us : trace_stop_sched_switched (schedule)

Lee

2006-01-01 18:57:00

by Linus Torvalds

[permalink] [raw]
Subject: Re: [patch] latency tracer, 2.6.15-rc7



On Sun, 1 Jan 2006, Lee Revell wrote:
>
> In fact neither of these patches helps, because these RCU callbacks run
> from a tasklet in softirq context, and softirqs are not preemptible
> (unless they are running in threads, see below):

That wasn't why I was hoping they would help..

The actual RCU callback will run uninterruptible, but I was hoping that
the patch would make it be called more often, and not have thousands of
events to work on...

> Linus, would you accept CONFIG_PREEMPT_SOFTIRQS to always run softirqs
> in threads (default N of course, it certainly has a slight throughput
> cost) for mainline if Ingo were to submit it?

Actually, I think there's a better solution.

Try just setting maxbatch back to 10 in kernel/rcupdate.c.

The thing is, "maxbatch" doesn't actually _work_ because what happens is
that the tasklet will continually re-schedule itself, and the caller will
keep calling it. So maxbatch is actually broken.

However, what happens is that after kernel/softirq.c has called the
tasklet ten times, and it is still pending, it will do the softirq in a
thread (see the "max_restart" logic).

Which happens to do the right thing, although I'm pretty convinced that it
was by mistake (or if on purpose, it depends way too closely on silly
magic implementation issues).

I raised maxbatch because the dentry RCU problem (since fixed) was
reported to become less (probably because the scheduling of the softirq's
delayed the freeing of old dentries further, making the problem worse),
and because I don't like the subtlety involved in having two layers of
batching. But hey, it might work, at which point we should just document
how maxbatch and max_restart interact with each other and with softirqd.

Linus

2006-01-01 19:02:12

by Lee Revell

[permalink] [raw]
Subject: Re: [patch] latency tracer, 2.6.15-rc7

On Sun, 2006-01-01 at 10:56 -0800, Linus Torvalds wrote:
> The thing is, "maxbatch" doesn't actually _work_ because what happens
> is that the tasklet will continually re-schedule itself, and the
> caller will keep calling it. So maxbatch is actually broken.
>
> However, what happens is that after kernel/softirq.c has called the
> tasklet ten times, and it is still pending, it will do the softirq in
> a thread (see the "max_restart" logic).

Ah OK thanks for the explanation. I'll try Paul's patch (yours did not
seem to help).

Lee

2006-01-01 19:06:51

by Lee Revell

[permalink] [raw]
Subject: Re: [patch] latency tracer, 2.6.15-rc7

On Sun, 2006-01-01 at 14:02 -0500, Lee Revell wrote:
> On Sun, 2006-01-01 at 10:56 -0800, Linus Torvalds wrote:
> > The thing is, "maxbatch" doesn't actually _work_ because what happens
> > is that the tasklet will continually re-schedule itself, and the
> > caller will keep calling it. So maxbatch is actually broken.
> >
> > However, what happens is that after kernel/softirq.c has called the
> > tasklet ten times, and it is still pending, it will do the softirq in
> > a thread (see the "max_restart" logic).
>
> Ah OK thanks for the explanation. I'll try Paul's patch (yours did not
> seem to help).
>

s/try Paul's patch/set maxbatch to 10/



2006-01-02 20:14:43

by Ingo Molnar

[permalink] [raw]
Subject: Re: [patch] latency tracer, 2.6.15-rc7


* Linus Torvalds <[email protected]> wrote:

> > have you applied the zlib patches too? In particular this one should
> > make a difference:
> >
> > http://redhat.com/~mingo/latency-tracing-patches/patches/reduce-zlib-stack-hack.patch
> >
> > If you didnt have this applied, could you apply it and retry with
> > stack-footprint-debugging again?
>
> Ingo, instead of having a static work area and using locking, why not
> just move those fields into the "z_stream" structure, and thus make
> them be per-stream? The z_stream structure should already be allocated
> with kmalloc() or similar by the caller, so that also gets it off the
> stack without the locking thing.
>
> Hmm?

yeah, i'll implement that. The above one was just a quick hack that
somehow survived. (and i am definitely a wimp when it comes to changing
zlib internals :-)

Ingo

2006-01-03 11:11:31

by Dipankar Sarma

[permalink] [raw]
Subject: Re: [patch] latency tracer, 2.6.15-rc7

On Sun, Jan 01, 2006 at 10:56:25AM -0800, Linus Torvalds wrote:
>
> > Linus, would you accept CONFIG_PREEMPT_SOFTIRQS to always run softirqs
> > in threads (default N of course, it certainly has a slight throughput
> > cost) for mainline if Ingo were to submit it?
>
> Actually, I think there's a better solution.
>
> Try just setting maxbatch back to 10 in kernel/rcupdate.c.
>
> The thing is, "maxbatch" doesn't actually _work_ because what happens is
> that the tasklet will continually re-schedule itself, and the caller will
> keep calling it. So maxbatch is actually broken.

Not really. maxbatch limits the number of RCU callbacks in a
batch inside RCU subsystem, it doesn't assure that total number
of RCU callbacks invoked in that instance of softirq would
be maxbatch. The idea was to give the control back to softirq
subsystem after maxbatch RCUs are processed and let the softirq
latency logic take over.

> However, what happens is that after kernel/softirq.c has called the
> tasklet ten times, and it is still pending, it will do the softirq in a
> thread (see the "max_restart" logic).
>
> Which happens to do the right thing, although I'm pretty convinced that it
> was by mistake (or if on purpose, it depends way too closely on silly
> magic implementation issues).

It was intentional. I wanted to keep the RCU throttling separate
and let softirq handling do its own thing. Softirqs, once delegated
to ksoftirqd were managable from the latency perspective, but
not a very long RCU batch.

I do agree that the two layers of batching really makes things
subtle. I think the best we can do is to figure out some way of
automatic throttling in RCU and forced quiescent state under extreme
conditions. That way we will have less dependency on softirq
throttling.

Thanks
Dipankar

2006-01-03 13:30:52

by David Lang

[permalink] [raw]
Subject: Re: [patch] latency tracer, 2.6.15-rc7

On Tue, 3 Jan 2006, Dipankar Sarma wrote:

> I do agree that the two layers of batching really makes things
> subtle. I think the best we can do is to figure out some way of
> automatic throttling in RCU and forced quiescent state under extreme
> conditions. That way we will have less dependency on softirq
> throttling.

would it make sense to have the RCU subsystems with a threshold so that
when more then X items are outstanding they trigger a premption of all
other CPU's ASAP (forceing the scheduling break needed to make progress on
clearing RCU)? This wouldn't work in all cases, but it could significantly
reduce the problem situations.

David Lang

--
There are two ways of constructing a software design. One way is to make it so simple that there are obviously no deficiencies. And the other way is to make it so complicated that there are no obvious deficiencies.
-- C.A.R. Hoare

2006-01-03 14:09:08

by Dipankar Sarma

[permalink] [raw]
Subject: Re: [patch] latency tracer, 2.6.15-rc7

On Sat, Dec 31, 2005 at 12:14:26PM -0800, Paul E. McKenney wrote:
> So it seems to me that Linus's patch is part of the solution, but
> needs to also have a global component, perhaps as follows:
>
> if (unlikely(rdp->count > 100)) {
> set_need_resched();
> if (unlikely(rdp->count - rdp->last_rs_count > 1000)) {
> int cpu;
>
> rdp->last_rs_count = rdp->count;
> spin_lock_bh(&rcu_bh_state.lock);
> for_each_cpu_mask(cpu, rdp->rcu_bh_state.cpumask)
> smp_send_reschedule(cpu);
> spin_unlock_bh(&rcu_bh_state.lock);
> }
> }

Yes, something like this that covers corner cases and forces
queiscent state in all cpus, would be ideal.

> I am sure that I am missing some interaction or another with tickless
> idle and CPU hotplug covered.

It would be safe to miss a cpu or two while sending the resched
interrupt. So, I don't think we need to worry about tickless
idle and cpu hotplug.

> There also needs to be some adjustment in rcu_do_batch(), which will
> have to somehow get back to a quiescent state periodically. Dipankar,
> Vatsa, thoughts?

My original thought was to make maxbatch dynamic and automatically
adjust it depending on the situation. I can try that approach.

Thanks
Dipankar

2006-01-03 14:10:18

by Dipankar Sarma

[permalink] [raw]
Subject: Re: [patch] latency tracer, 2.6.15-rc7

On Tue, Jan 03, 2006 at 05:28:15AM -0800, David Lang wrote:
> On Tue, 3 Jan 2006, Dipankar Sarma wrote:
>
> >I do agree that the two layers of batching really makes things
> >subtle. I think the best we can do is to figure out some way of
> >automatic throttling in RCU and forced quiescent state under extreme
> >conditions. That way we will have less dependency on softirq
> >throttling.
>
> would it make sense to have the RCU subsystems with a threshold so that
> when more then X items are outstanding they trigger a premption of all
> other CPU's ASAP (forceing the scheduling break needed to make progress on
> clearing RCU)? This wouldn't work in all cases, but it could significantly
> reduce the problem situations.

Yes, I think it would make sense to try something like that. See Paul's earlier
mail in this thread for an example code snippet.

Thanks
Dipankar

2006-01-03 15:56:33

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [patch] latency tracer, 2.6.15-rc7

On Tue, Jan 03, 2006 at 07:39:42PM +0530, Dipankar Sarma wrote:
> On Sat, Dec 31, 2005 at 12:14:26PM -0800, Paul E. McKenney wrote:
> > So it seems to me that Linus's patch is part of the solution, but
> > needs to also have a global component, perhaps as follows:
> >
> > if (unlikely(rdp->count > 100)) {
> > set_need_resched();
> > if (unlikely(rdp->count - rdp->last_rs_count > 1000)) {
> > int cpu;
> >
> > rdp->last_rs_count = rdp->count;
> > spin_lock_bh(&rcu_bh_state.lock);
> > for_each_cpu_mask(cpu, rdp->rcu_bh_state.cpumask)
> > smp_send_reschedule(cpu);
> > spin_unlock_bh(&rcu_bh_state.lock);
> > }
> > }
>
> Yes, something like this that covers corner cases and forces
> queiscent state in all cpus, would be ideal.
>
> > I am sure that I am missing some interaction or another with tickless
> > idle and CPU hotplug covered.
>
> It would be safe to miss a cpu or two while sending the resched
> interrupt. So, I don't think we need to worry about tickless
> idle and cpu hotplug.

OK, does that also mean that the spin_lock_bh/spin_unlock_bh are also
unnecessary? ;-)

> > There also needs to be some adjustment in rcu_do_batch(), which will
> > have to somehow get back to a quiescent state periodically. Dipankar,
> > Vatsa, thoughts?
>
> My original thought was to make maxbatch dynamic and automatically
> adjust it depending on the situation. I can try that approach.

Makes sense to me!

Thanx, Paul

2006-01-06 20:29:04

by Lee Revell

[permalink] [raw]
Subject: Re: [patch] latency tracer, 2.6.15-rc7

On Thu, 2005-12-29 at 09:22 +0100, Ingo Molnar wrote:
> * Lee Revell <[email protected]> wrote:
>
> > I'm not sure how much work it would be to break out
> > CONFIG_LATENCY_TRACE as a separate patch.
>
> could you check out the 2.6.15-rc7 version of the latency tracer i
> uploaded to:
>
> http://redhat.com/~mingo/latency-tracing-patches/
>
> could test it by e.g. trying to reproduce the same VM latency as in the
> -rt tree. [the two zlib patches are needed if you are using 4K stacks,
> mcount increases stack footprint.]

Ingo,

Would you mind submitting this for 2.6.16? We already have one case
where it would have caught a regression in time for 2.6.15, so IMHO this
would be a great addition to "Kernel hacking".

I have had absolutely no problems running this with 2.6.15-rc7, it works
perfectly, as it has in the -rt tree for ages.

Lee

2006-01-19 01:49:38

by Lee Revell

[permalink] [raw]
Subject: Re: 2.6.15-rc5: latency regression vs 2.6.14 in exit_mmap->free_pgtables

On Wed, 2005-12-28 at 22:59 +0000, Hugh Dickins wrote:
> I'm to blame for that - sorry. It didn't occur to me that I was
> moving any signficant amount of work (on mms with many vmas) into the
> section with preemption disabled. Actually, the mm->page_table_lock
> is _not_ held there any more; but preemption is still disabled while
> using the per-cpu mmu_gathers.
>
> I wish you'd found it at -rc1 time. It's not something that can
> be properly corrected in a hurry. The proper fix is to rework the
> tlb_gather_mmu stuff, so it can be done without preemption disabled.
> It's already a serious limitation in unmap_vmas, with CONFIG_PREEMPT's
> ZAP_BLOCK_SIZE spoiling throughput with far too many TLB flushes.
>
> On my list to work on; but the TLB always needs great care, and this
> goes down into architectural divergences, with truncation of a mapped
> file adding further awkward constraints. I imagine 2.6.16-rc1 is only
> a couple of weeks away, so it's unlikely to be fixed in 2.6.16 either.
>

Hugh,

Is this believed to be fixed in 2.6.16-rc1?

Lee

2006-01-19 01:49:39

by Lee Revell

[permalink] [raw]
Subject: Re: [patch] latency tracer, 2.6.15-rc7

On Mon, 2006-01-02 at 21:14 +0100, Ingo Molnar wrote:
> * Linus Torvalds <[email protected]> wrote:
>
> > > have you applied the zlib patches too? In particular this one should
> > > make a difference:
> > >
> > > http://redhat.com/~mingo/latency-tracing-patches/patches/reduce-zlib-stack-hack.patch
> > >
> > > If you didnt have this applied, could you apply it and retry with
> > > stack-footprint-debugging again?
> >
> > Ingo, instead of having a static work area and using locking, why not
> > just move those fields into the "z_stream" structure, and thus make
> > them be per-stream? The z_stream structure should already be allocated
> > with kmalloc() or similar by the caller, so that also gets it off the
> > stack without the locking thing.
> >
> > Hmm?
>
> yeah, i'll implement that. The above one was just a quick hack that
> somehow survived. (and i am definitely a wimp when it comes to changing
> zlib internals :-)

Ingo,

If you get a chance would you mind porting this to 2.6.16-rc1? I only
get 6 failures when applying the 2.6.15-rc7 patch set.

I'd like to be able to catch any 2.6.16 latency regressions while there
is time to do something about them, to avoid a repeat of the unmap_vmas
problem in 2.6.15.

Lee

2006-01-19 06:18:07

by Lee Revell

[permalink] [raw]
Subject: Re: [patch] latency tracer, 2.6.15-rc7

On Wed, 2006-01-18 at 20:49 -0500, Lee Revell wrote:
> I'd like to be able to catch any 2.6.16 latency regressions while there
> is time to do something about them, to avoid a repeat of the unmap_vmas
> problem in 2.6.15.

Good news - I was able to trivially port this to 2.6.16-rc1.

Lee

2006-01-19 07:29:08

by Hugh Dickins

[permalink] [raw]
Subject: Re: 2.6.15-rc5: latency regression vs 2.6.14 in exit_mmap->free_pgtables

On Wed, 18 Jan 2006, Lee Revell wrote:
> On Wed, 2005-12-28 at 22:59 +0000, Hugh Dickins wrote:
> >
> > On my list to work on; but the TLB always needs great care, and this
> > goes down into architectural divergences, with truncation of a mapped
> > file adding further awkward constraints. I imagine 2.6.16-rc1 is only
> > a couple of weeks away, so it's unlikely to be fixed in 2.6.16 either.
>
> Is this believed to be fixed in 2.6.16-rc1?

Not at all, I'm afraid. Do you think I ought to try to persuade Linus
and Andrew to take that ugly free_pgtables #ifdef CONFIG_PREEMPT patch
in the interim before we've the proper latency fix there? (I doubt the
mmu_gather rewrite in 2.6.17 too, but perhaps a reasonable compromise.)

Hugh

2006-01-19 07:32:22

by Lee Revell

[permalink] [raw]
Subject: Re: 2.6.15-rc5: latency regression vs 2.6.14 in exit_mmap->free_pgtables

On Thu, 2006-01-19 at 07:29 +0000, Hugh Dickins wrote:
> On Wed, 18 Jan 2006, Lee Revell wrote:
> > On Wed, 2005-12-28 at 22:59 +0000, Hugh Dickins wrote:
> > >
> > > On my list to work on; but the TLB always needs great care, and this
> > > goes down into architectural divergences, with truncation of a mapped
> > > file adding further awkward constraints. I imagine 2.6.16-rc1 is only
> > > a couple of weeks away, so it's unlikely to be fixed in 2.6.16 either.
> >
> > Is this believed to be fixed in 2.6.16-rc1?
>
> Not at all, I'm afraid. Do you think I ought to try to persuade Linus
> and Andrew to take that ugly free_pgtables #ifdef CONFIG_PREEMPT patch
> in the interim before we've the proper latency fix there? (I doubt the
> mmu_gather rewrite in 2.6.17 too, but perhaps a reasonable compromise.)
>

I don't feel like I understand this code well enough to make a
reasonable case for it. Maybe Ingo cares to comment?

Lee

2006-01-19 07:35:23

by Lee Revell

[permalink] [raw]
Subject: Re: 2.6.15-rc5: latency regression vs 2.6.14 in exit_mmap->free_pgtables

On Thu, 2006-01-19 at 07:29 +0000, Hugh Dickins wrote:
> On Wed, 18 Jan 2006, Lee Revell wrote:
> > On Wed, 2005-12-28 at 22:59 +0000, Hugh Dickins wrote:
> > >
> > > On my list to work on; but the TLB always needs great care, and this
> > > goes down into architectural divergences, with truncation of a mapped
> > > file adding further awkward constraints. I imagine 2.6.16-rc1 is only
> > > a couple of weeks away, so it's unlikely to be fixed in 2.6.16 either.
> >
> > Is this believed to be fixed in 2.6.16-rc1?
>
> Not at all, I'm afraid. Do you think I ought to try to persuade Linus
> and Andrew to take that ugly free_pgtables #ifdef CONFIG_PREEMPT patch
> in the interim before we've the proper latency fix there? (I doubt the
> mmu_gather rewrite in 2.6.17 too, but perhaps a reasonable compromise.)

For the time being, this is overshadowed by long latencies (~8ms or so)
in rt_garbage_collect() and rt_run_flush(). Previously I was under the
impression those could be worked around by sysctl/proc tuning, but this
is not the case.

Until those are fixed I would say it's low priority.

Lee