2004-09-02 06:32:43

by Ingo Molnar

[permalink] [raw]
Subject: Re: [patch] voluntary-preempt-2.6.9-rc1-bk4-Q5


* [email protected] <[email protected]> wrote:

> In 2.4, it appears that the duration of the write system call is
> basically fixed and dependent on the duration of the audio fragment.
> In 2.6, this behavior is now different. If I look at the chart in
> detail, it appears the system is queueing up several write operations
> during the first few seconds of testing. You can see this by
> consistently low elapsed times for the write system call. Then the
> elapsed time for the write bounces up / down in a sawtooth pattern
> over a 1 msec range. Could someone explain the cause of this new
> behavior and if there is a setting to restore the old behavior? I am
> concerned that this queueing adds latency to audio operations (when
> trying to synchronize audio with other real time behavior).

i think i found the reason for the sawtooth: it's a bug in hardirq
redirection. In certain situations we can end up not waking up softirqd,
resulting in a random 0-1msec latency between hardirq arrival and
softirq execution. We dont see higher latencies because timer IRQs
always wake up softirqd which hides the bug to a certain degree.

I'll fix this in -Q8.

Ingo


2004-09-02 06:55:13

by Ingo Molnar

[permalink] [raw]
Subject: [patch] voluntary-preempt-2.6.9-rc1-bk4-Q8


i've released the -Q8 patch:

http://redhat.com/~mingo/voluntary-preempt/voluntary-preempt-2.6.9-rc1-bk4-Q8

ontop of:

http://redhat.com/~mingo/voluntary-preempt/diff-bk-040828-2.6.8.1.bz2

this release fixes an artificial 0-1msec delay between hardirq arrival
and softirq invocation. This should solve some of the ALSA artifacts
reported by Mark H Johnson. It should also solve the rtl8139 problems -
i've put such a card into a testbox and with -Q7 i had similar packet
latency problems while with -Q8 it works just fine.

So netdev_backlog_granularity is still a value of 1 in -Q8, please check
whether the networking problems (bootup and service startup and latency)
problems are resolved. (and increase this value in case there are still
problems.)

Ingo

2004-09-02 07:04:17

by Lee Revell

[permalink] [raw]
Subject: Re: [patch] voluntary-preempt-2.6.9-rc1-bk4-Q8

On Thu, 2004-09-02 at 02:55, Ingo Molnar wrote:
> i've released the -Q8 patch:
>
> http://redhat.com/~mingo/voluntary-preempt/voluntary-preempt-2.6.9-rc1-bk4-Q8
>
> ontop of:
>
> http://redhat.com/~mingo/voluntary-preempt/diff-bk-040828-2.6.8.1.bz2
>

Here are traces of a 145, 190, and 217 usec latencies in
netif_receive_skb:

http://krustophenia.net/testresults.php?dataset=2.6.9-rc1-Q6#/var/www/2.6.9-rc1-Q6/trace2.txt
http://krustophenia.net/testresults.php?dataset=2.6.9-rc1-Q6#/var/www/2.6.9-rc1-Q6/trace3.txt
http://krustophenia.net/testresults.php?dataset=2.6.9-rc1-Q6#/var/www/2.6.9-rc1-Q6/trace4.txt

Some of these are with ip_conntrack enabled, at the request of another
poster, this does not make much of a difference, it increases the worst
case latency by 20 usec or so.

Also there is the rt_garbage_collect issue, previously reported. I have
not seen this lately but I do not remember seeing that it was fixed.

Lee

2004-09-02 07:14:04

by Ingo Molnar

[permalink] [raw]
Subject: Re: [patch] voluntary-preempt-2.6.9-rc1-bk4-Q8


* Lee Revell <[email protected]> wrote:

> Here are traces of a 145, 190, and 217 usec latencies in
> netif_receive_skb:
>
> http://krustophenia.net/testresults.php?dataset=2.6.9-rc1-Q6#/var/www/2.6.9-rc1-Q6/trace2.txt
> http://krustophenia.net/testresults.php?dataset=2.6.9-rc1-Q6#/var/www/2.6.9-rc1-Q6/trace3.txt
> http://krustophenia.net/testresults.php?dataset=2.6.9-rc1-Q6#/var/www/2.6.9-rc1-Q6/trace4.txt

these all seem to be single-packet processing latencies - it would be
quite hard to make those codepaths preemptible.

i'd suggest to turn off things like netfilter and ip_conntrack (and
other optional networking features that show up in the trace), they can
only increase latency:

00000001 0.016ms (+0.000ms): ip_rcv (netif_receive_skb)
00000001 0.019ms (+0.002ms): nf_hook_slow (ip_rcv)
00000002 0.019ms (+0.000ms): nf_iterate (nf_hook_slow)
00000002 0.021ms (+0.001ms): ip_conntrack_defrag (nf_iterate)
00000002 0.022ms (+0.000ms): ip_conntrack_in (nf_iterate)
00000002 0.022ms (+0.000ms): ip_ct_find_proto (ip_conntrack_in)
00000103 0.023ms (+0.000ms): __ip_ct_find_proto (ip_ct_find_proto)
00000102 0.024ms (+0.000ms): local_bh_enable (ip_ct_find_proto)
00000002 0.025ms (+0.001ms): tcp_error (ip_conntrack_in)

Ingo

2004-09-02 07:16:22

by Ingo Molnar

[permalink] [raw]
Subject: Re: [patch] voluntary-preempt-2.6.9-rc1-bk4-Q8


* Lee Revell <[email protected]> wrote:

> Also there is the rt_garbage_collect issue, previously reported. I
> have not seen this lately but I do not remember seeing that it was
> fixed.

i dont think it's fixed, please re-report it if it occurs again, there
have been many changes.

Ingo

2004-09-02 07:31:29

by Lee Revell

[permalink] [raw]
Subject: Re: [patch] voluntary-preempt-2.6.9-rc1-bk4-Q8

On Thu, 2004-09-02 at 03:15, Ingo Molnar wrote:
> * Lee Revell <[email protected]> wrote:
>
> > Here are traces of a 145, 190, and 217 usec latencies in
> > netif_receive_skb:
> >
> > http://krustophenia.net/testresults.php?dataset=2.6.9-rc1-Q6#/var/www/2.6.9-rc1-Q6/trace2.txt
> > http://krustophenia.net/testresults.php?dataset=2.6.9-rc1-Q6#/var/www/2.6.9-rc1-Q6/trace3.txt
> > http://krustophenia.net/testresults.php?dataset=2.6.9-rc1-Q6#/var/www/2.6.9-rc1-Q6/trace4.txt
>
> these all seem to be single-packet processing latencies - it would be
> quite hard to make those codepaths preemptible.
>

I suspected as much, these are not a problem. The large latencies from
reading the /proc filesystem are a bit worrisome (trace1.txt), I will
report these again if they still happen with Q8.

Lee

2004-09-02 07:44:48

by Ingo Molnar

[permalink] [raw]
Subject: Re: [patch] voluntary-preempt-2.6.9-rc1-bk4-Q8


* Lee Revell <[email protected]> wrote:

> > these all seem to be single-packet processing latencies - it would be
> > quite hard to make those codepaths preemptible.
>
> I suspected as much, these are not a problem. The large latencies
> from reading the /proc filesystem are a bit worrisome (trace1.txt), I
> will report these again if they still happen with Q8.

conntrack's ct_seq ops indeed seems to have latency problems - the quick
workaround is to disable conntrack.

The reason for the latency is that ct_seq_start() does a read_lock() on
ip_conntrack_lock and only ct_seq_stop() releases it - possibly
milliseconds later. But the whole conntrack /proc code is quite flawed:

READ_LOCK(&ip_conntrack_lock);

if (*pos >= ip_conntrack_htable_size)
return NULL;

bucket = kmalloc(sizeof(unsigned int), GFP_KERNEL);
if (!bucket) {
return ERR_PTR(-ENOMEM);
}
*bucket = *pos;
return bucket;

#1: we kmalloc(GFP_KERNEL) with a spinlock held and softirqs off - ouch!

#2: why does it do the kmalloc() anyway? It could store the position in
the seq pointer just fine. No need to alloc an integer pointer to
store the value in ...

#3: to fix the latency, ct_seq_show() could take the ip_conntrack_lock
and could check the current index against ip_conntrack_htable_size.
There's not much point in making this non-preemptible, there's
a 4K granularity anyway.

Rusty, what's going on in this code?

Ingo

2004-09-02 08:01:40

by Mika Penttilä

[permalink] [raw]
Subject: Re: [patch] voluntary-preempt-2.6.9-rc1-bk4-Q8

Ingo,

I think there might be a problem with voluntary-preempt's hadling of softirqs. Namely, in cond_resched_softirq(), you do __local_bh_enable() and local_bh_disable(). But it may be the case that the softirq is handled from ksoftirqd, and then the preempt_count isn't elevated with SOFTIRQ_OFFSET (only PF_SOFTIRQ is set). So the __local_bh_enable() actually makes preempt_count negative, which might have bad effects. Or am I missing something?

Mika


2004-09-02 08:22:12

by Ingo Molnar

[permalink] [raw]
Subject: Re: [patch] voluntary-preempt-2.6.9-rc1-bk4-Q8


* Ingo Molnar <[email protected]> wrote:

> this release fixes an artificial 0-1msec delay between hardirq arrival
> and softirq invocation. This should solve some of the ALSA artifacts
> reported by Mark H Johnson. It should also solve the rtl8139 problems
> - i've put such a card into a testbox and with -Q7 i had similar
> packet latency problems while with -Q8 it works just fine.

the rtl8139 problems are not fixed yet - i can still reproduce the
delayed packet issues.

Ingo

2004-09-02 08:36:40

by Ingo Molnar

[permalink] [raw]
Subject: Re: [patch] voluntary-preempt-2.6.9-rc1-bk4-Q8


* [email protected] <[email protected]> wrote:

> Ingo,
>
> I think there might be a problem with voluntary-preempt's hadling of
> softirqs. Namely, in cond_resched_softirq(), you do
> __local_bh_enable() and local_bh_disable(). But it may be the case
> that the softirq is handled from ksoftirqd, and then the preempt_count
> isn't elevated with SOFTIRQ_OFFSET (only PF_SOFTIRQ is set). So the
> __local_bh_enable() actually makes preempt_count negative, which might
> have bad effects. Or am I missing something?

you are right. Fortunately the main use of cond_resched_softirq() is via
cond_resched_all() - which is safe because it uses softirq_count(). But
the kernel/timer.c explicit call to cond_resched_softirq() is unsafe.
I've fixed this in my tree and i've added an assert to catch the
underflow when it happens.

Ingo

2004-09-02 09:10:39

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [patch] voluntary-preempt-2.6.9-rc1-bk4-Q8

On Thu, 2004-09-02 at 10:32, Ingo Molnar wrote:
> * [email protected] <[email protected]> wrote:
>
> > Ingo,
> >
> > I think there might be a problem with voluntary-preempt's hadling of
> > softirqs. Namely, in cond_resched_softirq(), you do
> > __local_bh_enable() and local_bh_disable(). But it may be the case
> > that the softirq is handled from ksoftirqd, and then the preempt_count
> > isn't elevated with SOFTIRQ_OFFSET (only PF_SOFTIRQ is set). So the
> > __local_bh_enable() actually makes preempt_count negative, which might
> > have bad effects. Or am I missing something?
>
> you are right. Fortunately the main use of cond_resched_softirq() is via
> cond_resched_all() - which is safe because it uses softirq_count(). But
> the kernel/timer.c explicit call to cond_resched_softirq() is unsafe.
> I've fixed this in my tree and i've added an assert to catch the
> underflow when it happens.
>
> Ingo

I've had linux-2.6.9-rc1-bk8-Q7 lock up on me this morning not long
after starting a glibc compile resulting from: emerge -uo gnome
although it did survive a make World on xorg-cvs.

Could this have been caused by the bug under discussion?

Unfortunatly I don't have much testing time before I go on hollidays,
so for now I went back to linux-2.6.9-rc1-bk6-Q5 which on my machine is
rock solid.

Peter

2004-09-02 11:09:42

by Ingo Molnar

[permalink] [raw]
Subject: [patch] voluntary-preempt-2.6.9-rc1-bk4-Q9


i've released the -Q9 patch:

http://redhat.com/~mingo/voluntary-preempt/voluntary-preempt-2.6.9-rc1-bk4-Q9

ontop of:

http://redhat.com/~mingo/voluntary-preempt/diff-bk-040828-2.6.8.1.bz2

Changes:

- fixed the cond_resched_softirq() bug noticed by Mika Penttila.

- updated the preemption-friendly network-RX code but 8193too.c still
produces delayed packets so netdev_backlog_granularity now defaults
to 2, which seems to be working fine on my testbox.

- the latency_trace output now includes the kernel and patch version,
for easier sorting of reports.

Ingo

2004-09-02 12:14:49

by Thomas Charbonnel

[permalink] [raw]
Subject: Re: [patch] voluntary-preempt-2.6.9-rc1-bk4-Q9

Ingo Molnar wrote :
> i've released the -Q9 patch:
>
> http://redhat.com/~mingo/voluntary-preempt/voluntary-preempt-2.6.9-rc1-bk4-Q9
>
> ontop of:
>
> http://redhat.com/~mingo/voluntary-preempt/diff-bk-040828-2.6.8.1.bz2
>
> Changes:
>
> - fixed the cond_resched_softirq() bug noticed by Mika Penttila.
>
> - updated the preemption-friendly network-RX code but 8193too.c still
> produces delayed packets so netdev_backlog_granularity now defaults
> to 2, which seems to be working fine on my testbox.

Thanks, network seems to be working fine with the new default value.
I still get > 140 us non-preemptible sections in rtl8139_pol, though :
http://www.undata.org/~thomas/q9_rtl8139.trace

Thomas


2004-09-02 13:16:49

by Thomas Charbonnel

[permalink] [raw]
Subject: Re: [patch] voluntary-preempt-2.6.9-rc1-bk4-Q9

Ingo Molnar wrote :
> i've released the -Q9 patch:
>
> http://redhat.com/~mingo/voluntary-preempt/voluntary-preempt-2.6.9-rc1-bk4-Q9

With ACPI compiled in and booting with acpi=off (which again doesn't
seem to be honoured), here's another weird one :

preemption latency trace v1.0.5 on 2.6.9-rc1-VP-Q9
--------------------------------------------------
latency: 2888 us, entries: 37 (37)
-----------------
| task: swapper/0, uid:0 nice:0 policy:0 rt_prio:0
-----------------
=> started at: do_IRQ+0x19/0x1a0
=> ended at: do_IRQ+0x14d/0x1a0
=======>
00010000 0.000ms (+0.000ms): do_IRQ (common_interrupt)
00010000 0.000ms (+0.000ms): do_IRQ (default_idle)
00010000 0.000ms (+0.000ms): do_IRQ (<00000000>)
00010001 0.000ms (+0.002ms): mask_and_ack_8259A (do_IRQ)
00010001 0.002ms (+0.000ms): generic_redirect_hardirq (do_IRQ)
00010000 0.002ms (+0.000ms): generic_handle_IRQ_event (do_IRQ)
00010000 0.003ms (+0.000ms): timer_interrupt (generic_handle_IRQ_event)
00010001 0.003ms (+2.878ms): mark_offset_tsc (timer_interrupt)
00010001 2.882ms (+0.000ms): do_timer (timer_interrupt)
00010001 2.882ms (+0.000ms): update_process_times (do_timer)
00010001 2.882ms (+0.000ms): update_one_process (update_process_times)
00010001 2.882ms (+0.000ms): run_local_timers (update_process_times)
00010001 2.882ms (+0.000ms): raise_softirq (update_process_times)
00010001 2.882ms (+0.000ms): scheduler_tick (update_process_times)
00010001 2.883ms (+0.000ms): sched_clock (scheduler_tick)
00010001 2.883ms (+0.000ms): update_wall_time (do_timer)
00010001 2.883ms (+0.000ms): update_wall_time_one_tick
(update_wall_time)
00010001 2.883ms (+0.000ms): profile_tick (timer_interrupt)
00010001 2.884ms (+0.000ms): profile_hook (profile_tick)
00010002 2.884ms (+0.000ms): notifier_call_chain (profile_hook)
00010001 2.884ms (+0.000ms): profile_hit (timer_interrupt)
00010001 2.884ms (+0.000ms): generic_note_interrupt (do_IRQ)
00010001 2.884ms (+0.000ms): end_8259A_irq (do_IRQ)
00010001 2.885ms (+0.000ms): enable_8259A_irq (do_IRQ)
00000001 2.886ms (+0.000ms): do_softirq (do_IRQ)
00000001 2.886ms (+0.000ms): __do_softirq (do_softirq)
00000001 2.886ms (+0.000ms): wake_up_process (do_softirq)
00000001 2.886ms (+0.000ms): try_to_wake_up (wake_up_process)
00000001 2.886ms (+0.000ms): task_rq_lock (try_to_wake_up)
00000002 2.886ms (+0.000ms): activate_task (try_to_wake_up)
00000002 2.886ms (+0.000ms): sched_clock (activate_task)
00000002 2.887ms (+0.000ms): recalc_task_prio (activate_task)
00000002 2.887ms (+0.000ms): effective_prio (recalc_task_prio)
00000002 2.887ms (+0.000ms): enqueue_task (activate_task)
00000001 2.887ms (+0.000ms): preempt_schedule (try_to_wake_up)
00000001 2.888ms (+0.000ms): sub_preempt_count (do_IRQ)
00000001 2.888ms (+0.000ms): update_max_trace (check_preempt_timing)

Thomas


2004-09-02 13:22:12

by Ingo Molnar

[permalink] [raw]
Subject: Re: [patch] voluntary-preempt-2.6.9-rc1-bk4-Q9


* Thomas Charbonnel <[email protected]> wrote:

> With ACPI compiled in and booting with acpi=off (which again doesn't
> seem to be honoured), here's another weird one :

> 00010000 0.003ms (+0.000ms): timer_interrupt (generic_handle_IRQ_event)
> 00010001 0.003ms (+2.878ms): mark_offset_tsc (timer_interrupt)
> 00010001 2.882ms (+0.000ms): do_timer (timer_interrupt)

do you have the NMI watchdog enabled? That could help us debugging this.
Enabling the APIC/IO_APIC and using nmi_watchdog=1 would be the ideal
solution - if that doesnt work then nmi_watchdog=2 would be fine too,
but remove this code from arch/i386/kernel/nmi.c:

if (nmi_watchdog == NMI_LOCAL_APIC)
nmi_hz = 1;

(otherwise nmi_watchdog=2 would result in one NMI per second, not enough
to shed light on the above 2.8 msec latency.)

Ingo

2004-09-02 14:39:00

by Thomas Charbonnel

[permalink] [raw]
Subject: Re: [patch] voluntary-preempt-2.6.9-rc1-bk4-Q9

Ingo Molnar wrote :

> do you have the NMI watchdog enabled? That could help us debugging this.
> Enabling the APIC/IO_APIC and using nmi_watchdog=1 would be the ideal
> solution - if that doesnt work then nmi_watchdog=2 would be fine too,
> but remove this code from arch/i386/kernel/nmi.c:
>
> if (nmi_watchdog == NMI_LOCAL_APIC)
> nmi_hz = 1;
>
> (otherwise nmi_watchdog=2 would result in one NMI per second, not enough
> to shed light on the above 2.8 msec latency.)
>
> Ingo

nmi_watchdog=1 worked fine.
Here's a trace (there seem to be a counter overflow) :

preemption latency trace v1.0.5 on 2.6.9-rc1-VP-Q9
--------------------------------------------------
latency: 539 us, entries: 38 (38)
-----------------
| task: swapper/0, uid:0 nice:0 policy:0 rt_prio:0
-----------------
=> started at: do_IRQ+0x19/0x190
=> ended at: do_IRQ+0x14a/0x190
=======>
00010000 0.000ms (+0.000ms): do_IRQ (common_interrupt)
00010000 0.000ms (+0.000ms): do_IRQ (default_idle)
00010000 0.000ms (+0.000ms): do_IRQ (<00000000>)
00010001 0.000ms (+0.002ms): mask_and_ack_8259A (do_IRQ)
00010001 0.002ms (+0.000ms): generic_redirect_hardirq (do_IRQ)
00010000 0.002ms (+0.000ms): generic_handle_IRQ_event (do_IRQ)
00010000 0.002ms (+0.000ms): timer_interrupt (generic_handle_IRQ_event)
00010001 0.003ms (+0.005ms): mark_offset_tsc (timer_interrupt)
00010001 0.008ms (+0.000ms): do_timer (timer_interrupt)
00010001 0.008ms (+0.000ms): update_process_times (do_timer)
00010001 0.008ms (+0.000ms): update_one_process (update_process_times)
00010001 0.009ms (+0.000ms): run_local_timers (update_process_times)
00010001 0.009ms (+0.000ms): raise_softirq (update_process_times)
00010001 0.009ms (+0.000ms): scheduler_tick (update_process_times)
00010001 0.009ms (+0.000ms): sched_clock (scheduler_tick)
00010001 0.009ms (+0.000ms): update_wall_time (do_timer)
00010001 0.009ms (+0.000ms): update_wall_time_one_tick
(update_wall_time)
00010001 0.010ms (+0.000ms): generic_note_interrupt (do_IRQ)
00010001 0.010ms (+0.000ms): end_8259A_irq (do_IRQ)
00010001 0.010ms (+0.000ms): enable_8259A_irq (do_IRQ)
00000001 0.011ms (+0.524ms): do_softirq (do_IRQ)
00000001 0.535ms (+0.000ms): do_nmi (mcount)
00010001 0.536ms (+0.000ms): profile_tick (nmi_watchdog_tick)
00010001 0.536ms (+0.000ms): profile_hook (profile_tick)
00010002 0.536ms (+0.000ms): notifier_call_chain (profile_hook)
00010001 0.537ms (+689953.062ms): profile_hit (nmi_watchdog_tick)
00000002 689953.599ms (+1.523ms): sched_clock (activate_task)
00000001 0.537ms (+0.000ms): wake_up_process (do_softirq)
00000001 0.537ms (+0.000ms): try_to_wake_up (wake_up_process)
00000001 0.537ms (+0.000ms): task_rq_lock (try_to_wake_up)
00000002 0.537ms (+0.000ms): activate_task (try_to_wake_up)
00000002 0.538ms (+0.000ms): sched_clock (activate_task)
00000002 0.538ms (+0.000ms): recalc_task_prio (activate_task)
00000002 0.538ms (+0.000ms): effective_prio (recalc_task_prio)
00000002 0.538ms (+0.000ms): enqueue_task (activate_task)
00000001 0.538ms (+0.000ms): preempt_schedule (try_to_wake_up)
00000001 0.539ms (+0.000ms): sub_preempt_count (do_IRQ)
00000001 0.539ms (+0.000ms): update_max_trace (check_preempt_timing)

Thomas


2004-09-02 20:28:57

by Mark_H_Johnson

[permalink] [raw]
Subject: Re: [patch] voluntary-preempt-2.6.9-rc1-bk4-Q9

>I've released the -Q9 patch:
>...

Built a kernel with -Q9 plus the following additional patches:
... email saved into mark-offset-tsc-mcount.patch ...
... email saved into ens001.patch ...
... own patch to add mcount() calls to sched.c ...

Ran my 25 minute test and collected 71 traces > 500 usec. The following
has a summary of the results.

[1] No cascade traces - several runs with this result - we should be OK
here.

[2] rtl8139 still showing some traces due to my use of
netdev_backlog_granularity = 8. Had severe problems when it is 1.

[3] I read what you said about hardware overhead, but the data seems
to show latencies getting to "__switch_to" may be buried in some
inline functions. A typical trace looks something like this...

00000002 0.003ms (+0.000ms): dummy_switch_tasks (schedule)
00000002 0.003ms (+0.000ms): schedule (worker_thread)
00000002 0.003ms (+0.000ms): schedule (worker_thread)
00000002 0.004ms (+0.000ms): schedule (worker_thread)
00000002 0.004ms (+0.000ms): schedule (worker_thread)
00000002 0.004ms (+0.000ms): schedule (worker_thread)
00000002 0.004ms (+0.274ms): schedule (worker_thread)
04000002 0.279ms (+0.000ms): __switch_to (schedule)

dummy_switch_tasks refers to a function I added / called right after
the label switch_tasks (in sched.c). The mcount() traces that follow
are basically at each step leading up to the call to context_switch.
Since context_switch is static inline, I assume it is not traced -
please confirm. I am considering adding mcount() calls inside
context_switch to see if there is a step that has some long
duration.

[4] The go_idle path is also one that appears to have some long
latencies. The following is an example - the trace of dummy_go_idle
refers to the label named go_idle.

00000002 0.002ms (+0.000ms): dummy_go_idle (schedule)
00000002 0.002ms (+0.060ms): schedule (io_schedule)
00000002 0.063ms (+0.069ms): load_balance_newidle (schedule)
00000002 0.133ms (+0.074ms): find_busiest_group (load_balance_newidle)
00000002 0.207ms (+0.034ms): find_next_bit (find_busiest_group)
00000002 0.242ms (+0.039ms): find_next_bit (find_busiest_group)
00000002 0.281ms (+0.070ms): find_busiest_queue (load_balance_newidle)
00000002 0.351ms (+0.071ms): find_next_bit (find_busiest_queue)
00000002 0.422ms (+0.069ms): double_lock_balance (load_balance_newidle)
00000003 0.492ms (+0.070ms): move_tasks (load_balance_newidle)
00010003 0.563ms (+0.000ms): do_nmi (move_tasks)
00010003 0.563ms (+0.002ms): do_nmi (del_timer_sync)
00010003 0.566ms (+0.000ms): profile_tick (nmi_watchdog_tick)
00010003 0.566ms (+0.000ms): profile_hook (profile_tick)
00010003 0.567ms (+0.000ms): read_lock (profile_hook)
00010004 0.567ms (+0.000ms): notifier_call_chain (profile_hook)
00010003 0.567ms (+0.001ms): profile_hit (nmi_watchdog_tick)
00000003 0.569ms (+0.000ms): find_next_bit (move_tasks)
00000003 0.569ms (+0.000ms): dequeue_task (move_tasks)
00000003 0.570ms (+0.000ms): enqueue_task (move_tasks)
00000003 0.570ms (+0.000ms): resched_task (move_tasks)
00000003 0.571ms (+0.000ms): find_next_bit (move_tasks)
00000003 0.571ms (+0.000ms): dequeue_task (move_tasks)
00000003 0.571ms (+0.000ms): enqueue_task (move_tasks)
00000003 0.572ms (+0.000ms): resched_task (move_tasks)
00000002 0.572ms (+0.000ms): preempt_schedule (load_balance_newidle)

We had 500 usec accumulation of latency through several function calls.

[5] mark_tsc_offset - some steps in that sequence are generating
some long latencies. For example:

04010003 0.011ms (+0.000ms): mark_offset_tsc (timer_interrupt)
04010003 0.011ms (+0.000ms): mark_offset_tsc (timer_interrupt)
04010003 0.011ms (+0.000ms): spin_lock (mark_offset_tsc)
04010004 0.011ms (+0.137ms): mark_offset_tsc (timer_interrupt)
04010004 0.149ms (+0.000ms): mark_offset_tsc (timer_interrupt)
04010004 0.149ms (+0.000ms): spin_lock (mark_offset_tsc)
04010005 0.149ms (+0.144ms): mark_offset_tsc (timer_interrupt)
04010005 0.294ms (+0.004ms): mark_offset_tsc (timer_interrupt)
04010005 0.298ms (+0.003ms): mark_offset_tsc (timer_interrupt)
04010005 0.301ms (+0.000ms): mark_offset_tsc (timer_interrupt)
04010005 0.301ms (+0.000ms): mark_offset_tsc (timer_interrupt)
04010004 0.301ms (+0.000ms): mark_offset_tsc (timer_interrupt)
04010004 0.301ms (+0.073ms): mark_offset_tsc (timer_interrupt)
04010004 0.375ms (+0.000ms): mark_offset_tsc (timer_interrupt)
04010003 0.375ms (+0.000ms): timer_interrupt (generic_handle_IRQ_event)

I didn't see any feedback from the table I provided previously.
Is this data helpful or should I take out the patch?

[6] spin_lock - may just be an SMP lock problem but here's a trace
I don't recall seeing previously.

00000002 0.008ms (+0.000ms): snd_ensoniq_trigger (snd_pcm_do_stop)
00000002 0.008ms (+0.344ms): spin_lock (snd_ensoniq_trigger)
00010003 0.353ms (+0.015ms): do_nmi (snd_ensoniq_trigger)
00010003 0.368ms (+0.006ms): do_nmi (update_one_process)
00010003 0.375ms (+0.000ms): profile_tick (nmi_watchdog_tick)
00010003 0.375ms (+0.000ms): profile_hook (profile_tick)
00010003 0.375ms (+0.121ms): read_lock (profile_hook)
00010004 0.496ms (+0.000ms): notifier_call_chain (profile_hook)
00010003 0.497ms (+0.068ms): profile_hit (nmi_watchdog_tick)
00000002 0.566ms (+0.000ms): snd_pcm_post_stop (snd_pcm_action_single)

[7] Not sure what to call it, I don't recall seeing this type of trace
before either.

=> started at: __spin_lock_irqsave+0x39/0x90
=> ended at: as_work_handler+0x5c/0xa0
=======>
00000001 0.000ms (+0.000ms): __spin_lock_irqsave (spin_lock_irq)
00000001 0.000ms (+0.000ms): generic_enable_irq (ide_do_request)
00000001 0.000ms (+0.000ms): __spin_lock_irqsave (generic_enable_irq)
00000002 0.000ms (+0.000ms): unmask_IO_APIC_irq (generic_enable_irq)
00000002 0.000ms (+0.000ms): __spin_lock_irqsave (unmask_IO_APIC_irq)
00000003 0.001ms (+0.000ms): __unmask_IO_APIC_irq (unmask_IO_APIC_irq)
00000003 0.001ms (+0.066ms): __modify_IO_APIC_irq (__unmask_IO_APIC_irq)
00000001 0.067ms (+0.001ms): smp_apic_timer_interrupt (as_work_handler)
00010001 0.069ms (+0.087ms): profile_tick (smp_apic_timer_interrupt)
00010001 0.157ms (+0.000ms): profile_hook (profile_tick)
00010001 0.157ms (+0.069ms): read_lock (profile_hook)
00010002 0.227ms (+0.000ms): notifier_call_chain (profile_hook)
00010001 0.227ms (+0.069ms): profile_hit (smp_apic_timer_interrupt)
00010001 0.297ms (+0.000ms): update_process_times
(smp_apic_timer_interrupt)
00010001 0.297ms (+0.069ms): update_one_process (update_process_times)
00010001 0.367ms (+0.000ms): run_local_timers (update_process_times)
00010001 0.367ms (+0.069ms): raise_softirq (update_process_times)
00010001 0.437ms (+0.000ms): scheduler_tick (update_process_times)
00010001 0.437ms (+0.070ms): sched_clock (scheduler_tick)
00020001 0.507ms (+0.000ms): do_nmi (scheduler_tick)
00020001 0.508ms (+0.002ms): do_nmi (del_timer_sync)
00020001 0.511ms (+0.000ms): profile_tick (nmi_watchdog_tick)
00020001 0.511ms (+0.000ms): profile_hook (profile_tick)
00020001 0.511ms (+0.065ms): read_lock (profile_hook)
00020002 0.577ms (+0.000ms): notifier_call_chain (profile_hook)
00020001 0.577ms (+0.000ms): profile_hit (nmi_watchdog_tick)
00010001 0.578ms (+0.000ms): spin_lock (scheduler_tick)

[8] exit_mmap - there are a few traces referring to code in or
called by exit_mmap. Here's an example.

=> started at: cond_resched_lock+0x6b/0x110
=> ended at: exit_mmap+0x155/0x1f0
=======>
00000001 0.000ms (+0.000ms): touch_preempt_timing (cond_resched_lock)
00000001 0.000ms (+0.000ms): __bitmap_weight (unmap_vmas)
00000001 0.000ms (+0.000ms): vm_acct_memory (exit_mmap)
00000001 0.001ms (+0.629ms): clear_page_tables (exit_mmap)
00000001 0.631ms (+0.000ms): flush_tlb_mm (exit_mmap)
00000001 0.631ms (+0.000ms): free_pages_and_swap_cache (exit_mmap)

I will send the traces separately (not to linux-kernel) for analysis.

--Mark

2004-09-02 21:57:59

by Ingo Molnar

[permalink] [raw]
Subject: [patch] voluntary-preempt-2.6.9-rc1-bk4-R0


i've released the -R0 patch:

http://redhat.com/~mingo/voluntary-preempt/voluntary-preempt-2.6.9-rc1-bk4-R0

ontop of:

http://redhat.com/~mingo/voluntary-preempt/diff-bk-040828-2.6.8.1.bz2

i've given up on the netdev_backlog_granularity approach, and as a
replacement i've modified specific network drivers to return at a safe
point if softirq preemption is requested. This gives the same end result
but is more robust. For the time being i've fixed 8193too.c and e100.c.
(will fix up other drivers too as latencies get reported)

this should fix the crash reported by P.O. Gaillard, and it should solve
the packet delay/loss issues reported by Mark H Johnson. I cannot see
any problems on my rtl8193 testbox anymore.

Ingo

2004-09-02 22:19:49

by Lee Revell

[permalink] [raw]
Subject: Re: [patch] voluntary-preempt-2.6.9-rc1-bk4-R0

On Thu, 2004-09-02 at 18:14, Ingo Molnar wrote:
> * Lee Revell <[email protected]> wrote:
>
> > > i've given up on the netdev_backlog_granularity approach, and as a
> > > replacement i've modified specific network drivers to return at a safe
> > > point if softirq preemption is requested.
> >
> > Makes sense, netdev_max_backlog never made a difference on my system
> > (via-rhine driver).
>
> via-rhine does RX processing from the hardirq handler, this codepath is
> harder to break up. The NAPI ->poll functions used by e100 and 8193too
> are much easier to break up because RX throttling and re-trying is a
> basic property of NAPI.
>

What I meant was, I did not consider the latencies from via-rhine to be
a problem. The other posters reported 300-500 usec latencies in the
network driver, I did not see this.

Lee

2004-09-02 22:24:31

by Ingo Molnar

[permalink] [raw]
Subject: Re: [patch] voluntary-preempt-2.6.9-rc1-bk4-R0


* Lee Revell <[email protected]> wrote:

> > i've given up on the netdev_backlog_granularity approach, and as a
> > replacement i've modified specific network drivers to return at a safe
> > point if softirq preemption is requested.
>
> Makes sense, netdev_max_backlog never made a difference on my system
> (via-rhine driver).

via-rhine does RX processing from the hardirq handler, this codepath is
harder to break up. The NAPI ->poll functions used by e100 and 8193too
are much easier to break up because RX throttling and re-trying is a
basic property of NAPI.

Ingo

2004-09-02 22:47:44

by Ingo Molnar

[permalink] [raw]
Subject: Re: [patch] voluntary-preempt-2.6.9-rc1-bk4-Q9


* [email protected] <[email protected]> wrote:

> [1] No cascade traces - several runs with this result - we should be
> OK here.

yeah, good. I think my analysis was correct: softirqd got delayed by
higher prio tasks and this caused a backlog in timer processing => hence
the 'storm' of cascade() calls. Now we may reschedule between cascade()
calls.

> [2] rtl8139 still showing some traces due to my use of
> netdev_backlog_granularity = 8. Had severe problems when it is 1.

i've just released -R0 which should fix these issues in theory.

> [3] I read what you said about hardware overhead, but the data seems
> to show latencies getting to "__switch_to" may be buried in some
> inline functions. A typical trace looks something like this...

>
> 00000002 0.003ms (+0.000ms): dummy_switch_tasks (schedule)
> 00000002 0.003ms (+0.000ms): schedule (worker_thread)
> 00000002 0.003ms (+0.000ms): schedule (worker_thread)
> 00000002 0.004ms (+0.000ms): schedule (worker_thread)
> 00000002 0.004ms (+0.000ms): schedule (worker_thread)
> 00000002 0.004ms (+0.000ms): schedule (worker_thread)
> 00000002 0.004ms (+0.274ms): schedule (worker_thread)
> 04000002 0.279ms (+0.000ms): __switch_to (schedule)
>
> dummy_switch_tasks refers to a function I added / called right after
> the label switch_tasks (in sched.c). The mcount() traces that follow
> are basically at each step leading up to the call to context_switch.
> Since context_switch is static inline, I assume it is not traced -
> please confirm. [...]

correct, inline functions are not traced. (Also, 'tail-merged' function
calls are not traced either - this can happen if a function calls
another one right at the end.)

> [...] I am considering adding mcount() calls inside context_switch to
> see if there is a step that has some long duration.

yeah. If you can isolate the overhead down to a specific line of code
that would implicate that particular piece of code quite heavily. But if
the latencies happen all around the place then it means that it's not
the code that is at fault, it's just an innocent bystander.

> [4] The go_idle path is also one that appears to have some long
> latencies. The following is an example - the trace of dummy_go_idle
> refers to the label named go_idle.
>
> 00000002 0.002ms (+0.000ms): dummy_go_idle (schedule)
> 00000002 0.002ms (+0.060ms): schedule (io_schedule)
> 00000002 0.063ms (+0.069ms): load_balance_newidle (schedule)
> 00000002 0.133ms (+0.074ms): find_busiest_group (load_balance_newidle)
> 00000002 0.207ms (+0.034ms): find_next_bit (find_busiest_group)
> 00000002 0.242ms (+0.039ms): find_next_bit (find_busiest_group)
> 00000002 0.281ms (+0.070ms): find_busiest_queue (load_balance_newidle)
> 00000002 0.351ms (+0.071ms): find_next_bit (find_busiest_queue)
> 00000002 0.422ms (+0.069ms): double_lock_balance (load_balance_newidle)
> 00000003 0.492ms (+0.070ms): move_tasks (load_balance_newidle)

this is as if the CPU executed everything in 'slow motion'. E.g. the
cache being disabled could be one such reason - or some severe DMA or
other bus traffic.

the code being executed is completely benign - no locking, just
straightforward processing. No way can they take 60-70 usecs!

as if certain data structures were uncached or something like that.
Although this theory can be excluded i think because the same code
doesnt execute slowly in other traces (and most likely not in the
normal, non-traced execution either).

in theory code tracing itself could perhaps generate some sort of bad
cache pattern that hits a sweet spot of the CPU. (very weak theory but
that's all that's left ...) Could you disable tracing (but keep the
preemption-timing on) and see whether you still get these 500+ usec
latencies reported? Switching /proc/sys/kernel/tracing_enabled off
should be enough.

> [5] mark_tsc_offset - some steps in that sequence are generating
> some long latencies. For example:
>
> 04010003 0.011ms (+0.000ms): mark_offset_tsc (timer_interrupt)
> 04010003 0.011ms (+0.000ms): mark_offset_tsc (timer_interrupt)
> 04010003 0.011ms (+0.000ms): spin_lock (mark_offset_tsc)
> 04010004 0.011ms (+0.137ms): mark_offset_tsc (timer_interrupt)
> 04010004 0.149ms (+0.000ms): mark_offset_tsc (timer_interrupt)
> 04010004 0.149ms (+0.000ms): spin_lock (mark_offset_tsc)
> 04010005 0.149ms (+0.144ms): mark_offset_tsc (timer_interrupt)
> 04010005 0.294ms (+0.004ms): mark_offset_tsc (timer_interrupt)
> 04010005 0.298ms (+0.003ms): mark_offset_tsc (timer_interrupt)
> 04010005 0.301ms (+0.000ms): mark_offset_tsc (timer_interrupt)
> 04010005 0.301ms (+0.000ms): mark_offset_tsc (timer_interrupt)
> 04010004 0.301ms (+0.000ms): mark_offset_tsc (timer_interrupt)
> 04010004 0.301ms (+0.073ms): mark_offset_tsc (timer_interrupt)
> 04010004 0.375ms (+0.000ms): mark_offset_tsc (timer_interrupt)
> 04010003 0.375ms (+0.000ms): timer_interrupt (generic_handle_IRQ_event)
>
> I didn't see any feedback from the table I provided previously.
> Is this data helpful or should I take out the patch?

sent a mail separately, replying to your previous table. I think it's
useful, lets try to pin down what code there is between #03-#04.

> [6] spin_lock - may just be an SMP lock problem but here's a trace
> I don't recall seeing previously.
>
> 00000002 0.008ms (+0.000ms): snd_ensoniq_trigger (snd_pcm_do_stop)
> 00000002 0.008ms (+0.344ms): spin_lock (snd_ensoniq_trigger)
> 00010003 0.353ms (+0.015ms): do_nmi (snd_ensoniq_trigger)
> 00010003 0.368ms (+0.006ms): do_nmi (update_one_process)
> 00010003 0.375ms (+0.000ms): profile_tick (nmi_watchdog_tick)
> 00010003 0.375ms (+0.000ms): profile_hook (profile_tick)
> 00010003 0.375ms (+0.121ms): read_lock (profile_hook)
> 00010004 0.496ms (+0.000ms): notifier_call_chain (profile_hook)
> 00010003 0.497ms (+0.068ms): profile_hit (nmi_watchdog_tick)
> 00000002 0.566ms (+0.000ms): snd_pcm_post_stop (snd_pcm_action_single)

it's hard to tell whether the overhead is the 'magic' overhead or
spinning on the lock. I've attached trace-spin.patch that shows us how
many times a spinlock had to spin before it got the lock. It will also
tell the precise time it took to execute the spinning function.

if we looped zero times and still it's high overhead then this again
implicates the 'magic' hw issue that interferes.

> [7] Not sure what to call it, I don't recall seeing this type of trace
> before either.
>
> => started at: __spin_lock_irqsave+0x39/0x90
> => ended at: as_work_handler+0x5c/0xa0
> =======>
> 00000001 0.000ms (+0.000ms): __spin_lock_irqsave (spin_lock_irq)
> 00000001 0.000ms (+0.000ms): generic_enable_irq (ide_do_request)
> 00000001 0.000ms (+0.000ms): __spin_lock_irqsave (generic_enable_irq)
> 00000002 0.000ms (+0.000ms): unmask_IO_APIC_irq (generic_enable_irq)
> 00000002 0.000ms (+0.000ms): __spin_lock_irqsave (unmask_IO_APIC_irq)
> 00000003 0.001ms (+0.000ms): __unmask_IO_APIC_irq (unmask_IO_APIC_irq)
> 00000003 0.001ms (+0.066ms): __modify_IO_APIC_irq (__unmask_IO_APIC_irq)
> 00000001 0.067ms (+0.001ms): smp_apic_timer_interrupt (as_work_handler)
> 00010001 0.069ms (+0.087ms): profile_tick (smp_apic_timer_interrupt)
> 00010001 0.157ms (+0.000ms): profile_hook (profile_tick)
> 00010001 0.157ms (+0.069ms): read_lock (profile_hook)
> 00010002 0.227ms (+0.000ms): notifier_call_chain (profile_hook)
> 00010001 0.227ms (+0.069ms): profile_hit (smp_apic_timer_interrupt)
> 00010001 0.297ms (+0.000ms): update_process_times
> (smp_apic_timer_interrupt)
> 00010001 0.297ms (+0.069ms): update_one_process (update_process_times)
> 00010001 0.367ms (+0.000ms): run_local_timers (update_process_times)
> 00010001 0.367ms (+0.069ms): raise_softirq (update_process_times)
> 00010001 0.437ms (+0.000ms): scheduler_tick (update_process_times)
> 00010001 0.437ms (+0.070ms): sched_clock (scheduler_tick)
> 00020001 0.507ms (+0.000ms): do_nmi (scheduler_tick)
> 00020001 0.508ms (+0.002ms): do_nmi (del_timer_sync)
> 00020001 0.511ms (+0.000ms): profile_tick (nmi_watchdog_tick)
> 00020001 0.511ms (+0.000ms): profile_hook (profile_tick)
> 00020001 0.511ms (+0.065ms): read_lock (profile_hook)
> 00020002 0.577ms (+0.000ms): notifier_call_chain (profile_hook)
> 00020001 0.577ms (+0.000ms): profile_hit (nmi_watchdog_tick)
> 00010001 0.578ms (+0.000ms): spin_lock (scheduler_tick)

this too seems to be 'magic' overhead just randomly in the middle of a
commonly executing function.

> [8] exit_mmap - there are a few traces referring to code in or
> called by exit_mmap. Here's an example.
>
> => started at: cond_resched_lock+0x6b/0x110
> => ended at: exit_mmap+0x155/0x1f0
> =======>
> 00000001 0.000ms (+0.000ms): touch_preempt_timing (cond_resched_lock)
> 00000001 0.000ms (+0.000ms): __bitmap_weight (unmap_vmas)
> 00000001 0.000ms (+0.000ms): vm_acct_memory (exit_mmap)
> 00000001 0.001ms (+0.629ms): clear_page_tables (exit_mmap)
> 00000001 0.631ms (+0.000ms): flush_tlb_mm (exit_mmap)
> 00000001 0.631ms (+0.000ms): free_pages_and_swap_cache (exit_mmap)

this should be a known latency: clear_page_tables() runs with preemption
disabled (a spinlock held).

Ingo


Attachments:
(No filename) (8.87 kB)
trace-spin.patch (1.23 kB)
Download all attachments

2004-09-02 23:28:34

by Lee Revell

[permalink] [raw]
Subject: Re: [patch] voluntary-preempt-2.6.9-rc1-bk4-Q8

On Thu, 2004-09-02 at 03:15, Ingo Molnar wrote:
> * Lee Revell <[email protected]> wrote:
>
> > Here are traces of a 145, 190, and 217 usec latencies in
> > netif_receive_skb:
> >
> > http://krustophenia.net/testresults.php?dataset=2.6.9-rc1-Q6#/var/www/2.6.9-rc1-Q6/trace2.txt
> > http://krustophenia.net/testresults.php?dataset=2.6.9-rc1-Q6#/var/www/2.6.9-rc1-Q6/trace3.txt
> > http://krustophenia.net/testresults.php?dataset=2.6.9-rc1-Q6#/var/www/2.6.9-rc1-Q6/trace4.txt
>
> these all seem to be single-packet processing latencies - it would be
> quite hard to make those codepaths preemptible.
>
> i'd suggest to turn off things like netfilter and ip_conntrack (and
> other optional networking features that show up in the trace), they can
> only increase latency:
>

Do you see any optional networking features in the trace (other than
ip_conntrack)? I was under the impression that I had everything
optional disabled.

Lee

2004-09-02 23:35:18

by Lee Revell

[permalink] [raw]
Subject: Re: [patch] voluntary-preempt-2.6.9-rc1-bk4-Q8

On Thu, 2004-09-02 at 19:28, Ingo Molnar wrote:
> * Lee Revell <[email protected]> wrote:
>
> > Do you see any optional networking features in the trace (other than
> > ip_conntrack)? I was under the impression that I had everything
> > optional disabled.
>
> yeah, it seems to be only ip_conntrack and netfilter (which conntrack
> relies on).
>

FWIW these seem to only slow down the single packet path by about 10%.
This is pretty good.

Lee

2004-09-02 23:27:47

by Ingo Molnar

[permalink] [raw]
Subject: Re: [patch] voluntary-preempt-2.6.9-rc1-bk4-Q8


* Lee Revell <[email protected]> wrote:

> Do you see any optional networking features in the trace (other than
> ip_conntrack)? I was under the impression that I had everything
> optional disabled.

yeah, it seems to be only ip_conntrack and netfilter (which conntrack
relies on).

Ingo

2004-09-03 00:29:11

by Lee Revell

[permalink] [raw]
Subject: Re: [patch] voluntary-preempt-2.6.9-rc1-bk4-R0

On Thu, 2004-09-02 at 18:14, Ingo Molnar wrote:
> * Lee Revell <[email protected]> wrote:
>
> > > i've given up on the netdev_backlog_granularity approach, and as a
> > > replacement i've modified specific network drivers to return at a safe
> > > point if softirq preemption is requested.
> >
> > Makes sense, netdev_max_backlog never made a difference on my system
> > (via-rhine driver).
>
> via-rhine does RX processing from the hardirq handler, this codepath is
> harder to break up. The NAPI ->poll functions used by e100 and 8193too
> are much easier to break up because RX throttling and re-trying is a
> basic property of NAPI.
>

Judging from these graphs, all of the latency issues are solved, at
least on my UP hardware, and the latencies seem to be getting very close
to the limits of what the hardware can do:

http://krustophenia.net/testresults.php?dataset=2.6.9-rc1-Q6#/var/www/2.6.9-rc1-Q6/jack-test-1

The worst case latency is only 160 usecs, and the vast majority fall
into the pattern from 0 to 120 usecs. All of the spikes above 120 are
almost certainly caused by netdev_max_backlog. However these are not
long enough to cause any problems with my workload; the lowest practical
latency for audio work is around 0.66 ms (32 frames at 48khz).

Lee


2004-09-03 01:23:53

by Rusty Russell

[permalink] [raw]
Subject: Re: [patch] voluntary-preempt-2.6.9-rc1-bk4-Q8

On Thu, 2004-09-02 at 17:46, Ingo Molnar wrote:
> Rusty, what's going on in this code?

Good question! Not my code, fortunately...

> #1: we kmalloc(GFP_KERNEL) with a spinlock held and softirqs off - ouch!
>
> #2: why does it do the kmalloc() anyway? It could store the position in
> the seq pointer just fine. No need to alloc an integer pointer to
> store the value in ...
>
> #3: to fix the latency, ct_seq_show() could take the ip_conntrack_lock
> and could check the current index against ip_conntrack_htable_size.
> There's not much point in making this non-preemptible, there's
> a 4K granularity anyway.

The code tries to put an entire hash bucket into a single seq_read().
That's not going to work if the hash is really deep. On the other hand,
not much will, and it's simple.

The lock is only needed on traversing: htable_size can't change after
init anyway, so it should be done in ct_seq_show.

Fix should be fairly simple...
Rusty.
--
Anyone who quotes me in their signature is an idiot -- Rusty Russell

2004-09-03 03:04:31

by Lee Revell

[permalink] [raw]
Subject: Re: [patch] voluntary-preempt-2.6.9-rc1-bk4-R0

On Thu, 2004-09-02 at 17:57, Ingo Molnar wrote:
> i've released the -R0 patch:
>
> http://redhat.com/~mingo/voluntary-preempt/voluntary-preempt-2.6.9-rc1-bk4-R0
>
> ontop of:
>
> http://redhat.com/~mingo/voluntary-preempt/diff-bk-040828-2.6.8.1.bz2
>
> i've given up on the netdev_backlog_granularity approach, and as a
> replacement i've modified specific network drivers to return at a safe
> point if softirq preemption is requested.

Makes sense, netdev_max_backlog never made a difference on my system
(via-rhine driver).

If you read that Microsoft paper I posted a link to earlier, they
describe all of the main categories of latencies we have dealt with.
They give special mention to network driver DPCs/softirqs.

The worst offender was a driver that used a DPC/softirq to reset the
card if there was no traffic in 10 seconds, to work around a hardware
lockup bug in some versions. Since the reset path was never designed to
be fast this caused problems. They also mention a similar problem to
the one that still exists with the via-rhine driver (fixed in -mm) where
it uses a DPC to poll for link status.

Lee



2004-09-03 03:18:21

by Eric St-Laurent

[permalink] [raw]
Subject: Re: [patch] voluntary-preempt-2.6.9-rc1-bk4-R0

> Judging from these graphs, all of the latency issues are solved, at
> least on my UP hardware, and the latencies seem to be getting very close
> to the limits of what the hardware can do:
>
> http://krustophenia.net/testresults.php?dataset=2.6.9-rc1-Q6#/var/www/2.6.9-rc1-Q6/jack-test-1
>
> The worst case latency is only 160 usecs, and the vast majority fall
> into the pattern from 0 to 120 usecs. All of the spikes above 120 are
> almost certainly caused by netdev_max_backlog. However these are not
> long enough to cause any problems with my workload; the lowest practical
> latency for audio work is around 0.66 ms (32 frames at 48khz).

Lee,

A few weeks ago you wrote that "the worst latency I was able to trigger
was 46 usecs", now it's 160 usecs.

Ingo has done much work on his patches since then.

Why the worst latency is higher now? I presume that the latency
measurements technique are more accurate and the 46 usecs was
inaccurate?

Ref: http://uwsg.indiana.edu/hypermail/linux/kernel/0407.3/0994.html

Best regards,

Eric St-Laurent


2004-09-03 03:24:35

by Ingo Molnar

[permalink] [raw]
Subject: Re: [patch] voluntary-preempt-2.6.9-rc1-bk4-Q9


* [email protected] <[email protected]> wrote:

> 00000002 0.003ms (+0.000ms): dummy_switch_tasks (schedule)
> 00000002 0.003ms (+0.000ms): schedule (worker_thread)
> 00000002 0.003ms (+0.000ms): schedule (worker_thread)
> 00000002 0.004ms (+0.000ms): schedule (worker_thread)
> 00000002 0.004ms (+0.000ms): schedule (worker_thread)
> 00000002 0.004ms (+0.000ms): schedule (worker_thread)
> 00000002 0.004ms (+0.274ms): schedule (worker_thread)
> 04000002 0.279ms (+0.000ms): __switch_to (schedule)

a quick suggestion: could you add this near the top of sched.c (below
the #include lines):

#define static

this will turn off all inlining and makes the scheduler-internal
functions visible. If there's any scheduler-internal overhead we should
see it. Maybe this is the CR3 flush (switch_mm()) - but 274 usecs is
still excessive ...

Ingo

2004-09-03 06:26:34

by Lee Revell

[permalink] [raw]
Subject: Re: [patch] voluntary-preempt-2.6.9-rc1-bk4-R0

On Thu, 2004-09-02 at 23:17, Eric St-Laurent wrote:
> > Judging from these graphs, all of the latency issues are solved, at
> > least on my UP hardware, and the latencies seem to be getting very close
> > to the limits of what the hardware can do:
> >
> > http://krustophenia.net/testresults.php?dataset=2.6.9-rc1-Q6#/var/www/2.6.9-rc1-Q6/jack-test-1
> >
> > The worst case latency is only 160 usecs, and the vast majority fall
> > into the pattern from 0 to 120 usecs. All of the spikes above 120 are
> > almost certainly caused by netdev_max_backlog. However these are not
> > long enough to cause any problems with my workload; the lowest practical
> > latency for audio work is around 0.66 ms (32 frames at 48khz).
>
> Lee,
>
> A few weeks ago you wrote that "the worst latency I was able to trigger
> was 46 usecs", now it's 160 usecs.
>
> Ingo has done much work on his patches since then.
>
> Why the worst latency is higher now? I presume that the latency
> measurements technique are more accurate and the 46 usecs was
> inaccurate?
>
> Ref: http://uwsg.indiana.edu/hypermail/linux/kernel/0407.3/0994.html
>

Yup, due to my incomplete understanding of the jackd code, my initial
measurements were measuring the time it took to run one process cycle
(basically a NOOP if there are no clients), rather than the actual time
jackd spent in poll() between cycles.

This did have the effect of measuring the scueduler latency, but I
believe it was being measured indirectly via cache effects - the longer
it had been since jackd last ran, the colder the cachelines touched by
the last cycle.

Since I am using a patch to jackd to measure these latencies, which will
be merged in the near future, it's more important for me that the patch
accurately reflect the latencies jackd users will see than it is for the
new, accurate results to be compatible with the old.

All datasets for -O and earlier use the old code:

http://krustophenia.net/testresults.php?dataset=2.6.8-rc3-O5

This one uses the initial version of the new code, measuring the time
jackd spends in poll(). The bimodal distribution is due to my sound
card having two different interrupt sources for capture and playback.
So one of the spikes represent the elapsed time between the capture
interrupt and the playback interrupt, and the other the time between the
playback interrupt and the next capture interrupt:

http://krustophenia.net/testresults.php?dataset=2.6.8.1-P0


-Q and later use the current method, which is like the above except the
second hump is discarded, as it is a function of the scheduling latency
and the period size rather than just the scheduling latency:

http://krustophenia.net/testresults.php?dataset=2.6.9-rc1-Q6

So, don't be fooled by the numbers, the newest version of the patch is
in fact the best. I have been meaning to go back and measure the
current patches with the old code but it's pretty low priority...

Lee




2004-09-03 06:36:46

by Ingo Molnar

[permalink] [raw]
Subject: Re: [patch] voluntary-preempt-2.6.9-rc1-bk4-R0


* Lee Revell <[email protected]> wrote:

> -Q and later use the current method, which is like the above except
> the second hump is discarded, as it is a function of the scheduling
> latency and the period size rather than just the scheduling latency:
>
> http://krustophenia.net/testresults.php?dataset=2.6.9-rc1-Q6
>
> So, don't be fooled by the numbers, the newest version of the patch is
> in fact the best. I have been meaning to go back and measure the
> current patches with the old code but it's pretty low priority...

vanilla kernel 2.6.8.1 would be quite interesting to get a few charts of
- especially if your measurement methodology has changed. There's not
much sense in re-testing older VP patches.

also, has the userspace workload you are using stayed constant during
all these tests?

Ingo

2004-09-03 06:49:19

by Lee Revell

[permalink] [raw]
Subject: Re: [patch] voluntary-preempt-2.6.9-rc1-bk4-R0

On Fri, 2004-09-03 at 02:36, Ingo Molnar wrote:
> * Lee Revell <[email protected]> wrote:
>
> > -Q and later use the current method, which is like the above except
> > the second hump is discarded, as it is a function of the scheduling
> > latency and the period size rather than just the scheduling latency:
> >
> > http://krustophenia.net/testresults.php?dataset=2.6.9-rc1-Q6
> >
> > So, don't be fooled by the numbers, the newest version of the patch is
> > in fact the best. I have been meaning to go back and measure the
> > current patches with the old code but it's pretty low priority...
>
> vanilla kernel 2.6.8.1 would be quite interesting to get a few charts of
> - especially if your measurement methodology has changed.

OK, I will give this a shot. Now that the VP patches are stabilizing I
will be doing more profiling. I also want to try the -mm kernel, this
has some interesting differences from the stock kernel. For example I
measured about a 10% improvement with the old method, which implies a
big performance gain.

> There's not
> much sense in re-testing older VP patches.
>

Yup, my thoughts exactly, this would just tell us what we already know,
that the latency gets better with each version.

> also, has the userspace workload you are using stayed constant during
> all these tests?
>

I am mostly just using normal desktop hacker workloads, web browsing,
email, builds. Lately I am using the box as a Samba server. At first,
I was stressing the system using every disk benchmark I could think of,
but it never seemed to affect the worst case and did not even change the
shape of the distribution much, so I don't bother. For all practical
purposes, it's impossible to change the shape of these graphs much by
stressing the system.

I am able to induce large latencies by using up all available swap with
make -j12 on a KDE program, and by pingflooding the broadcast address,
but these are pathological enough that I have not worried about them.

Lee

2004-09-03 07:00:23

by Ingo Molnar

[permalink] [raw]
Subject: Re: [patch] voluntary-preempt-2.6.9-rc1-bk4-R0


* Lee Revell <[email protected]> wrote:

> > vanilla kernel 2.6.8.1 would be quite interesting to get a few charts of
> > - especially if your measurement methodology has changed.
>
> OK, I will give this a shot. Now that the VP patches are stabilizing
> I will be doing more profiling. I also want to try the -mm kernel,
> this has some interesting differences from the stock kernel. For
> example I measured about a 10% improvement with the old method, which
> implies a big performance gain.

the -mm kernel used to have additional *-latency-fix patches that were
done based on the initial preemption-timing patch in -mm and partly
based on early VP discussions and findings. I recently reviewed and
merged the 2-3 missing ones into VP. Andrew has dropped these patches
meanwhile and i expect to submit the cleaner and more complete solution
that is in VP.

So i'd expect -mm to still perform better than vanilla (it usually
does), but if that big 10% difference in latencies doesnt show up
anymore i'd attribute it to the shuffling around of latency related
patches, not some genuine deficiency in -mm.

Ingo

2004-09-03 07:04:04

by Ingo Molnar

[permalink] [raw]
Subject: Re: [patch] voluntary-preempt-2.6.9-rc1-bk4-R0


another question - any objections against me rebasing the VP patch to
the current -mm tree and keeping it there exclusively until all possible
merges are done? It would probably quite some work to create a complete
patch for the upstream or BK tree during that process, as small patches
start to flow in the VP => -mm => BK direction. Would any of the regular
VP users/testers be wary to use the -mm tree?

Ingo

2004-09-03 07:40:57

by Lee Revell

[permalink] [raw]
Subject: Re: [patch] voluntary-preempt-2.6.9-rc1-bk4-R0

On Fri, 2004-09-03 at 03:05, Ingo Molnar wrote:
> another question - any objections against me rebasing the VP patch to
> the current -mm tree and keeping it there exclusively until all possible
> merges are done? It would probably quite some work to create a complete
> patch for the upstream or BK tree during that process, as small patches
> start to flow in the VP => -mm => BK direction. Would any of the regular
> VP users/testers be wary to use the -mm tree?
>

None here. Assuming the SMP issues are resolved, then the only
remaining issues on UP will probably be with various drivers, those
fixes should apply just as easily to -mm as vanilla.

As far as I am concerned the VP patches are stable enough for the
audio-centric distros to start distributing VP kernel packages, these
will certainly be using the vanilla kernel. I think the PlanetCCRMA and
AGNULA people are planning to start distributing test VP-kernel packages
as soon as the patches stabilize. IIRC Nando is on vacation this week.

I will make an announcement on LAD that as of R0 the VP patches should
be stable and are ready for wider testing. You may want to wait until
after the initial slew of bug reports before rebasing VP against MM. I
suspect most of the problems with be driver specific, and most of the
fixes will apply equally to -mm and vanilla.

I have added Luke (AudioSlack), Free (AGNULA), and Nando (CCRMA) to the
cc: list. They would be in the best position to answer your question.

Lee


2004-09-03 07:49:00

by Free Ekanayaka

[permalink] [raw]
Subject: Re: [patch] voluntary-preempt-2.6.9-rc1-bk4-R0

|--==> "LR" == Lee Revell <[email protected]> writes:

LR> On Fri, 2004-09-03 at 03:05, Ingo Molnar wrote:
>>another question - any objections against me rebasing the VP patch to
>>the current -mm tree and keeping it there exclusively until all possible
>>merges are done? It would probably quite some work to create a complete
>>patch for the upstream or BK tree during that process, as small patches
>>start to flow in the VP => -mm => BK direction. Would any of the regular
>>VP users/testers be wary to use the -mm tree?
>>

LR> None here. Assuming the SMP issues are resolved, then the only
LR> remaining issues on UP will probably be with various drivers, those
LR> fixes should apply just as easily to -mm as vanilla.

LR> As far as I am concerned the VP patches are stable enough for the
LR> audio-centric distros to start distributing VP kernel packages, these
LR> will certainly be using the vanilla kernel. I think the PlanetCCRMA and
LR> AGNULA people are planning to start distributing test VP-kernel packages
LR> as soon as the patches stabilize. IIRC Nando is on vacation this week.

LR> I will make an announcement on LAD that as of R0 the VP patches should
LR> be stable and are ready for wider testing. You may want to wait until
LR> after the initial slew of bug reports before rebasing VP against MM. I
LR> suspect most of the problems with be driver specific, and most of the
LR> fixes will apply equally to -mm and vanilla.

LR> I have added Luke (AudioSlack), Free (AGNULA), and Nando (CCRMA) to the
LR> cc: list. They would be in the best position to answer your question.

Yes, you're right. We plan to provide test 2.6.x packages as soon as
patches stabilise. Please let me know if you have some recommendation
(configuration flags, additional patches, etc.).

Cheers,

Free


2004-09-03 08:07:30

by Lee Revell

[permalink] [raw]
Subject: Re: [patch] voluntary-preempt-2.6.9-rc1-bk4-R0

On Fri, 2004-09-03 at 03:50, Free Ekanayaka wrote:
> LR> As far as I am concerned the VP patches are stable enough for the
> LR> audio-centric distros to start distributing VP kernel packages, these
> LR> will certainly be using the vanilla kernel. I think the PlanetCCRMA and
> LR> AGNULA people are planning to start distributing test VP-kernel packages
> LR> as soon as the patches stabilize. IIRC Nando is on vacation this week.
>
> LR> I will make an announcement on LAD that as of R0 the VP patches should
> LR> be stable and are ready for wider testing. You may want to wait until
> LR> after the initial slew of bug reports before rebasing VP against MM. I
> LR> suspect most of the problems with be driver specific, and most of the
> LR> fixes will apply equally to -mm and vanilla.
>
> LR> I have added Luke (AudioSlack), Free (AGNULA), and Nando (CCRMA) to the
> LR> cc: list. They would be in the best position to answer your question.
>
> Yes, you're right. We plan to provide test 2.6.x packages as soon as
> patches stabilise. Please let me know if you have some recommendation
> (configuration flags, additional patches, etc.).
>

As of -R0 it's definitely stable on UP and SMP users are reporting the
same. All known problems should be fixed, and there are no known
regressions. You should probably post a UP version and have your users
test that before posting SMP packages, the latter are not quite as well
tested.

No other patches (ie various scheduler tweaks, CK) should be necessary,
and in fact are not recommended because they might mask latency issues
that we would rather fix.

I use Debian unstable which should be pretty close to AGNULA, and for
several weeks now I have been unable to produce an xrun in jack at 32
frames no matter what I throw at the machine. I actually have not had
xrun debugging enabled in weeks because I don't get any xruns.

Any problems at this point are most likely going to involve less common
hardware, stuff the LKML testers don't have.

Lee

2004-09-03 08:17:45

by Luke Yelavich

[permalink] [raw]
Subject: Re: [patch] voluntary-preempt-2.6.9-rc1-bk4-R0

On Fri, Sep 03, 2004 at 05:40:34PM EST, Lee Revell wrote:
> On Fri, 2004-09-03 at 03:05, Ingo Molnar wrote:
> > another question - any objections against me rebasing the VP patch to
> > the current -mm tree and keeping it there exclusively until all possible
> > merges are done? It would probably quite some work to create a complete
> > patch for the upstream or BK tree during that process, as small patches
> > start to flow in the VP => -mm => BK direction. Would any of the regular
> > VP users/testers be wary to use the -mm tree?
> >
>
> None here. Assuming the SMP issues are resolved, then the only
> remaining issues on UP will probably be with various drivers, those
> fixes should apply just as easily to -mm as vanilla.
>
> As far as I am concerned the VP patches are stable enough for the
> audio-centric distros to start distributing VP kernel packages, these
> will certainly be using the vanilla kernel. I think the PlanetCCRMA and
> AGNULA people are planning to start distributing test VP-kernel packages
> as soon as the patches stabilize. IIRC Nando is on vacation this week.

I certainly have intensions of making test kernels available in my repository
in a separate testing branch. I don't really mind for myself what tree the
patch is based on, but I feel that the patches may get more testing if they
are against vanilla, rather than against another tree, as users
may consider the patch to be more stable. On the other hand, it is not hard for
packagers like myself to do the leg work, and we can also provide a pre-patched
kernel source for those who still want to roll their own.

> I will make an announcement on LAD that as of R0 the VP patches should
> be stable and are ready for wider testing. You may want to wait until
> after the initial slew of bug reports before rebasing VP against MM. I
> suspect most of the problems with be driver specific, and most of the
> fixes will apply equally to -mm and vanilla.

Well with Lee's help, I think I have identified an ICE1712 sound driver issue,
but this is yet to be determined.

> I have added Luke (AudioSlack), Free (AGNULA), and Nando (CCRMA) to the
> cc: list. They would be in the best position to answer your question.

I will go with whatever is decided, as I am prepared to do the legwork for my
repository users.
--
Luke Yelavich
http://www.audioslack.com
[email protected]

2004-09-03 08:22:25

by Lee Revell

[permalink] [raw]
Subject: Re: [patch] voluntary-preempt-2.6.9-rc1-bk4-R0

On Fri, 2004-09-03 at 04:09, Luke Yelavich wrote:
> On Fri, Sep 03, 2004 at 05:40:34PM EST, Lee Revell wrote:
> > On Fri, 2004-09-03 at 03:05, Ingo Molnar wrote:
> Well with Lee's help, I think I have identified an ICE1712 sound driver issue,
> but this is yet to be determined.

Hmm, this one is still not fixed, using the latest VP patches?

What are the symptoms again?

Lee

2004-09-03 08:47:20

by Luke Yelavich

[permalink] [raw]
Subject: Re: [patch] voluntary-preempt-2.6.9-rc1-bk4-R0

On Fri, Sep 03, 2004 at 06:13:16PM EST, Lee Revell wrote:
> > On Fri, Sep 03, 2004 at 05:40:34PM EST, Lee Revell wrote:
> > > On Fri, 2004-09-03 at 03:05, Ingo Molnar wrote:
> > Well with Lee's help, I think I have identified an ICE1712 sound driver issue,
> > but this is yet to be determined.
>
> Hmm, this one is still not fixed, using the latest VP patches?

I have been keeping up with the patches, but haven't had a chance to try it out
with them lately. Currently building the latest one, and will be able to test
again when done.
--
Luke Yelavich
http://www.audioslack.com
[email protected]

2004-09-03 09:09:57

by Free Ekanayaka

[permalink] [raw]
Subject: Re: [patch] voluntary-preempt-2.6.9-rc1-bk4-R0

|--==> "LR" == Lee Revell <[email protected]> writes:

LR> On Fri, 2004-09-03 at 03:50, Free Ekanayaka wrote:
LR> As far as I am concerned the VP patches are stable enough for the
LR> audio-centric distros to start distributing VP kernel packages, these
LR> will certainly be using the vanilla kernel. I think the PlanetCCRMA and
LR> AGNULA people are planning to start distributing test VP-kernel packages
LR> as soon as the patches stabilize. IIRC Nando is on vacation this week.
>>
LR> I will make an announcement on LAD that as of R0 the VP patches should
LR> be stable and are ready for wider testing. You may want to wait until
LR> after the initial slew of bug reports before rebasing VP against MM. I
LR> suspect most of the problems with be driver specific, and most of the
LR> fixes will apply equally to -mm and vanilla.
>>
LR> I have added Luke (AudioSlack), Free (AGNULA), and Nando (CCRMA) to the
LR> cc: list. They would be in the best position to answer your question.
>>
>>Yes, you're right. We plan to provide test 2.6.x packages as soon as
>>patches stabilise. Please let me know if you have some recommendation
>>(configuration flags, additional patches, etc.).
>>

LR> As of -R0 it's definitely stable on UP and SMP users are reporting the
LR> same. All known problems should be fixed, and there are no known
LR> regressions. You should probably post a UP version and have your users
LR> test that before posting SMP packages, the latter are not quite as well
LR> tested.

LR> No other patches (ie various scheduler tweaks, CK) should be necessary,
LR> and in fact are not recommended because they might mask latency issues
LR> that we would rather fix.

LR> I use Debian unstable which should be pretty close to AGNULA, and for
LR> several weeks now I have been unable to produce an xrun in jack at 32
LR> frames no matter what I throw at the machine. I actually have not had
LR> xrun debugging enabled in weeks because I don't get any xruns.

LR> Any problems at this point are most likely going to involve less common
LR> hardware, stuff the LKML testers don't have.

A/DeMuDi is based on a sarge snapshot (currently 17 July.. time to
shift forward!), plus some bits of sid (especially up to date audio
packages).

Which kernel tree are you using?

Have you built your kernel starting from the stock kernel-source-2.6.8
package along with the relevant debian patch?

Cheers,

Free

2004-09-03 09:28:50

by Ingo Molnar

[permalink] [raw]
Subject: [patch] voluntary-preempt-2.6.9-rc1-bk4-R1


* Lee Revell <[email protected]> wrote:

> As of -R0 it's definitely stable on UP and SMP users are reporting the
> same. All known problems should be fixed, and there are no known
> regressions. You should probably post a UP version and have your
> users test that before posting SMP packages, the latter are not quite
> as well tested.

Florian Schmidt reported a minor bug that prevents a successful build if
!CONFIG_LATENCY_TRACE - i've uploaded -R1 that fixes this:

http://redhat.com/~mingo/voluntary-preempt/voluntary-preempt-2.6.9-rc1-bk4-R1

there are no other changes in -R1, and there are no known pending bugs
currently.

for a packaged kernel i'd suggest to enable all the CONFIG_PREEMPT_*
values in the .config, including CONFIG_PREEMPT_TIMING.

CONFIG_LATENCY_TRACE can add overhead if active, so while it would be
useful for initial packages to enable this .config option, i'd suggest
to turn it off by default by changing 'tracing_enabled = 1' to
'tracing_enabled = 0' in the patch. Then people can enable it and do
precise tracing whenever they encounter a particular high latency on
their system.

configuring the threaded/nonthreaded properties of IRQ handlers can be
tricky. Perhaps a script could scan /proc (and/or /sys) for audio
interrupts and make them directly executed? Unfortunately audio
interrupt handler requests dont seem to be going through some central
ALSA function so i see no easy way to somehow tag audio interrupts
automatically and provide some /proc flag to make audio interrupts
non-threaded by default.

Ingo

2004-09-03 09:32:18

by Ingo Molnar

[permalink] [raw]
Subject: Re: [patch] voluntary-preempt-2.6.9-rc1-bk4-Q9


* Ingo Molnar <[email protected]> wrote:

> > 00000002 0.002ms (+0.000ms): dummy_go_idle (schedule)
> > 00000002 0.002ms (+0.060ms): schedule (io_schedule)
> > 00000002 0.063ms (+0.069ms): load_balance_newidle (schedule)
> > 00000002 0.133ms (+0.074ms): find_busiest_group (load_balance_newidle)
> > 00000002 0.207ms (+0.034ms): find_next_bit (find_busiest_group)
> > 00000002 0.242ms (+0.039ms): find_next_bit (find_busiest_group)
> > 00000002 0.281ms (+0.070ms): find_busiest_queue (load_balance_newidle)
> > 00000002 0.351ms (+0.071ms): find_next_bit (find_busiest_queue)
> > 00000002 0.422ms (+0.069ms): double_lock_balance (load_balance_newidle)
> > 00000003 0.492ms (+0.070ms): move_tasks (load_balance_newidle)
>
> this is as if the CPU executed everything in 'slow motion'. E.g. the
> cache being disabled could be one such reason - or some severe DMA or
> other bus traffic.

another thing: could you try maxcpus=1 again and see whether 1 CPU
produces similar 'slow motion' traces? If it's DMA or PCI bus traffic
somehow interfering then i'd expect the same phenomenon to pop up with a
single CPU too. IIRC you tested an UP kernel once before, but i believe
that was prior fixing all the latency measurement errors. Would be nice
to re-test again, with maxcpus=1, and see whether any of these
slow-motion traces trigger. On a 1-CPU test i'd suggest to lower the
tracing threshold to half of the 2-CPU value.

Ingo

2004-09-03 09:57:06

by Luke Yelavich

[permalink] [raw]
Subject: Re: [patch] voluntary-preempt-2.6.9-rc1-bk4-R1

On Fri, Sep 03, 2004 at 07:25:47PM EST, Ingo Molnar wrote:
>
> * Lee Revell <[email protected]> wrote:
>
> > As of -R0 it's definitely stable on UP and SMP users are reporting the
> > same. All known problems should be fixed, and there are no known
> > regressions. You should probably post a UP version and have your
> > users test that before posting SMP packages, the latter are not quite
> > as well tested.
>
> Florian Schmidt reported a minor bug that prevents a successful build if
> !CONFIG_LATENCY_TRACE - i've uploaded -R1 that fixes this:
>
> http://redhat.com/~mingo/voluntary-preempt/voluntary-preempt-2.6.9-rc1-bk4-R1
>
> there are no other changes in -R1, and there are no known pending bugs
> currently.

I think I might be having a problem here, that I didn't have with previous
patches. Bare in mind that the previous patch I tested was against 2.6.8.1
vanilla.

I am using a D-Link KVM here between my notebook and my desktop machine.
It is the desktop I am currently testing these patches on, and the KVM
requires a double-tap of the scroll lock key to switch between machines. With
the latest R0 patch, something is not working when I attempt to change from
my desktop to my notebook. The KVM usually lets out a beep when I can
use the arrow keys to switch, but it isn't here. Adding to that, my console
locks up totally for about 10 seconds, before allowing me to go on and
type commands. I also seem to get some debugging output or a trace of
some sort when rebooting, and the kernel panics with the message:
(0)Kernel Panic - not syncing: Failed exception in interrupt

Included is some dmesg output from beginning of bootup.

Let me know if any more info is needed.
--
Luke Yelavich
http://www.audioslack.com
[email protected]

CPU: After generic identify, caps: bfebfbff 00000000 00000000 00000000
CPU: After vendor identify, caps: bfebfbff 00000000 00000000 00000000
CPU: Trace cache: 12K uops, L1 D cache: 8K
CPU: L2 cache: 512K
CPU: After all inits, caps: bfebfbff 00000000 00000000 00000080
CPU: Intel(R) Pentium(R) 4 CPU 2.40GHz stepping 07
Enabling fast FPU save and restore... done.
Enabling unmasked SIMD FPU exception support... done.
Checking 'hlt' instruction... OK.
Badness in schedule at kernel/sched.c:2637
[<c0302fb0>] schedule+0x558/0x55d
[<c0302ff5>] preempt_schedule+0x40/0x5f
[<c010041f>] rest_init+0x67/0x73
[<c03da780>] start_kernel+0x195/0x1d4
[<c03da38c>] unknown_bootoption+0x0/0x147
(swapper/1): new 666 us maximum-latency critical section.
=> started at: <preempt_schedule+0x3b/0x5f>
=> ended at: <finish_task_switch+0x37/0x93>
[<c012f01f>] check_preempt_timing+0x11b/0x1da
[<c01103df>] finish_task_switch+0x37/0x93
[<c01103df>] finish_task_switch+0x37/0x93
[<c012f212>] sub_preempt_count+0x4d/0x65
[<c012f212>] sub_preempt_count+0x4d/0x65
[<c01103df>] finish_task_switch+0x37/0x93
[<c011045e>] schedule_tail+0x23/0x5c
[<c0105f5e>] ret_from_fork+0x6/0x14
[<c010046b>] init+0x0/0x179
[<c0104210>] kernel_thread_helper+0x0/0xb
Badness in schedule at kernel/sched.c:2637
[<c0302fb0>] schedule+0x558/0x55d
[<c0302ff5>] preempt_schedule+0x40/0x5f
[<c0110439>] finish_task_switch+0x91/0x93
[<c011045e>] schedule_tail+0x23/0x5c
[<c0105f5e>] ret_from_fork+0x6/0x14
[<c010046b>] init+0x0/0x179
[<c0104210>] kernel_thread_helper+0x0/0xb
NET: Registered protocol family 16
PCI: PCI BIOS revision 2.10 entry at 0xf1e60, last bus=2
(swapper/1): new 904 us maximum-latency critical section.
=> started at: <cond_resched+0xd/0x41>
=> ended at: <cond_resched+0xd/0x41>
[<c012f01f>] check_preempt_timing+0x11b/0x1da
[<c0303417>] cond_resched+0xd/0x41
[<c0303417>] cond_resched+0xd/0x41
[<c012f114>] touch_preempt_timing+0x36/0x3a
[<c012f114>] touch_preempt_timing+0x36/0x3a
[<c0303417>] cond_resched+0xd/0x41
[<c0138b6b>] kmem_cache_alloc+0x5f/0x61
[<c0119fe8>] __request_region+0x26/0xbd
[<c03f1336>] pci_direct_init+0x3e/0x10e
[<c03da807>] do_initcalls+0x2f/0xbc
[<c01004c4>] init+0x59/0x179
[<c010046b>] init+0x0/0x179
[<c0104215>] kernel_thread_helper+0x5/0xb
PCI: Using configuration type 1
Linux Plug and Play Support v0.97 (c) Adam Belay
SCSI subsystem initialized
usbcore: registered new driver usbfs
usbcore: registered new driver hub

<snip>

get_random_bytes called before random driver initialization
(swapper/1): new 6605 us maximum-latency critical section.
=> started at: <cond_resched+0xd/0x41>
=> ended at: <cond_resched+0xd/0x41>
[<c012f01f>] check_preempt_timing+0x11b/0x1da
[<c0303417>] cond_resched+0xd/0x41
[<c0303417>] cond_resched+0xd/0x41
[<c012f114>] touch_preempt_timing+0x36/0x3a
[<c012f114>] touch_preempt_timing+0x36/0x3a
[<c0303417>] cond_resched+0xd/0x41
[<c0138da7>] __kmalloc+0x89/0x90
[<c0183f6c>] proc_create+0x86/0xd9
[<c018416b>] create_proc_entry+0x66/0xc9
[<c03f1fcc>] dev_proc_init+0x2b/0xa3
[<c03f206c>] net_dev_init+0x28/0x17c
[<c03f1ce4>] pcibios_init+0x65/0x7e
[<c03da807>] do_initcalls+0x2f/0xbc
[<c01004c4>] init+0x59/0x179
[<c010046b>] init+0x0/0x179
[<c0104215>] kernel_thread_helper+0x5/0xb
(swapper/1): new 43385 us maximum-latency critical section.
=> started at: <cond_resched+0xd/0x41>
=> ended at: <cond_resched+0xd/0x41>
[<c012f01f>] check_preempt_timing+0x11b/0x1da
[<c0303417>] cond_resched+0xd/0x41
[<c0107b1d>] do_IRQ+0x14a/0x18e
[<c0303417>] cond_resched+0xd/0x41
[<c012f114>] touch_preempt_timing+0x36/0x3a
[<c012f114>] touch_preempt_timing+0x36/0x3a
[<c0303417>] cond_resched+0xd/0x41
[<c0138b6b>] kmem_cache_alloc+0x5f/0x61
[<c0119fe8>] __request_region+0x26/0xbd
[<c01efb09>] isapnp_next_rdp+0x66/0xa1
[<c03e6d2d>] isapnp_isolate_rdp_select+0x5b/0xc5
[<c03e6eda>] isapnp_isolate+0x143/0x1f8
[<c03e7c15>] isapnp_init+0xbb/0x2da
[<c03da807>] do_initcalls+0x2f/0xbc
[<c01004c4>] init+0x59/0x179
[<c010046b>] init+0x0/0x179
[<c0104215>] kernel_thread_helper+0x5/0xb
(swapper/1): new 43492 us maximum-latency critical section.
=> started at: <cond_resched+0xd/0x41>
=> ended at: <cond_resched+0xd/0x41>
[<c012f01f>] check_preempt_timing+0x11b/0x1da
[<c0303417>] cond_resched+0xd/0x41
[<c0303417>] cond_resched+0xd/0x41
[<c012f114>] touch_preempt_timing+0x36/0x3a
[<c012f114>] touch_preempt_timing+0x36/0x3a
[<c0303417>] cond_resched+0xd/0x41
[<c0138b6b>] kmem_cache_alloc+0x5f/0x61
[<c0119fe8>] __request_region+0x26/0xbd
[<c01efb09>] isapnp_next_rdp+0x66/0xa1
[<c03e6d2d>] isapnp_isolate_rdp_select+0x5b/0xc5
[<c03e6eda>] isapnp_isolate+0x143/0x1f8
[<c03e7c15>] isapnp_init+0xbb/0x2da
[<c03da807>] do_initcalls+0x2f/0xbc
[<c01004c4>] init+0x59/0x179
[<c010046b>] init+0x0/0x179
[<c0104215>] kernel_thread_helper+0x5/0xb
(swapper/1): new 43561 us maximum-latency critical section.
=> started at: <cond_resched+0xd/0x41>
=> ended at: <cond_resched+0xd/0x41>
[<c012f01f>] check_preempt_timing+0x11b/0x1da
[<c0303417>] cond_resched+0xd/0x41
[<c0107b1d>] do_IRQ+0x14a/0x18e
[<c0303417>] cond_resched+0xd/0x41
[<c012f114>] touch_preempt_timing+0x36/0x3a
[<c012f114>] touch_preempt_timing+0x36/0x3a
[<c0303417>] cond_resched+0xd/0x41
[<c0138b6b>] kmem_cache_alloc+0x5f/0x61
[<c0119fe8>] __request_region+0x26/0xbd
[<c01efb09>] isapnp_next_rdp+0x66/0xa1
[<c03e6d2d>] isapnp_isolate_rdp_select+0x5b/0xc5
[<c03e6eda>] isapnp_isolate+0x143/0x1f8
[<c03e7c15>] isapnp_init+0xbb/0x2da
[<c03da807>] do_initcalls+0x2f/0xbc
[<c01004c4>] init+0x59/0x179
[<c010046b>] init+0x0/0x179
[<c0104215>] kernel_thread_helper+0x5/0xb
isapnp: No Plug & Play device found
requesting new irq thread for IRQ8...
Real Time Clock Driver v1.12
Using anticipatory io scheduler
<snip>
requesting new irq thread for IRQ6...
<snip>
IRQ#6 thread started up.
eth0: Tigon3 [partno(BCM95702A20) rev 1002 PHY(5703)] (PCI:33MHz:32-bit) 10/100/1000BaseT Ethernet 00:e0:18:df:d5:b6
eth0: HostTXDS[1] RXcsums[1] LinkChgREG[0] MIirq[0] ASF[0] Split[0] WireSpeed[1] TSOcap[1]
requesting new irq thread for IRQ3...
(dhcpcd/2226): new 115753 us maximum-latency critical section.
=> started at: <tg3_open+0xc9/0x263 [tg3]>
=> ended at: <tg3_open+0x122/0x263 [tg3]>
[<c012f01f>] check_preempt_timing+0x11b/0x1da
[<f88c805a>] tg3_open+0x122/0x263 [tg3]
[<f88c805a>] tg3_open+0x122/0x263 [tg3]
[<c012f212>] sub_preempt_count+0x4d/0x65
[<c012f212>] sub_preempt_count+0x4d/0x65
[<f88c805a>] tg3_open+0x122/0x263 [tg3]
[<c029f2ad>] dev_open+0xcf/0xfe
[<c02a30c1>] dev_mc_upload+0x3d/0x5d
[<c02a0987>] dev_change_flags+0x5b/0x12c
[<c029e930>] __dev_get_by_name+0xe/0xb9
[<c02e2d4b>] devinet_ioctl+0x247/0x5c1
[<c02e5066>] inet_ioctl+0x5f/0x9f
[<c0301f9d>] packet_ioctl+0x14d/0x177
[<c0296a96>] sock_ioctl+0x112/0x2df
[<c0161e06>] sys_ioctl+0x13a/0x2a0
[<c0106087>] syscall_call+0x7/0xb
IRQ#3 thread started up.
divert: not allocating divert_blk for non-ethernet device sit0
(IRQ 1/716): new 204218 us maximum-latency critical section.
=> started at: <__do_softirq+0x39/0x59>
=> ended at: <__do_softirq+0x4a/0x59>
[<c012f01f>] check_preempt_timing+0x11b/0x1da
[<c01186d0>] __do_softirq+0x4a/0x59
[<c01186d0>] __do_softirq+0x4a/0x59
[<c012f212>] sub_preempt_count+0x4d/0x65
[<c012f212>] sub_preempt_count+0x4d/0x65
[<c01186d0>] __do_softirq+0x4a/0x59
[<c01194d6>] do_irqd+0x71/0x91
[<c012819f>] kthread+0xaa/0xaf
[<c0119465>] do_irqd+0x0/0x91
[<c01280f5>] kthread+0x0/0xaf
[<c0104215>] kernel_thread_helper+0x5/0xb

2004-09-03 10:31:13

by Ingo Molnar

[permalink] [raw]
Subject: Re: [patch] voluntary-preempt-2.6.9-rc1-bk4-R1


* Luke Yelavich <[email protected]> wrote:

> I think I might be having a problem here, that I didn't have with
> previous patches. Bare in mind that the previous patch I tested was
> against 2.6.8.1 vanilla.
>
> I am using a D-Link KVM here between my notebook and my desktop
> machine. It is the desktop I am currently testing these patches on,
> and the KVM requires a double-tap of the scroll lock key to switch
> between machines. With the latest R0 patch, something is not working
> when I attempt to change from my desktop to my notebook. The KVM
> usually lets out a beep when I can use the arrow keys to switch, but
> it isn't here. Adding to that, my console locks up totally for about
> 10 seconds, before allowing me to go on and type commands. [...]

i have a KVM too to two testsystems and unfortunately i cannot reproduce
your problems neither with KVM (key-based-)switching nor with scroll
lock. But this KVM uses a triple-key combination to switch, not
scroll-lock.

it's a PS2 keyboard, right? If yes then does:

echo 0 > /proc/irq/1/i8042/threaded
( maybe also: echo 0 > /proc/irq/12/i8042/threaded )

fix the problem? The PS2 driver has been a bit unrobust when hardirq
redirection is enabled.

> [...] I also seem to get some debugging output or a trace of some sort
> when rebooting, and the kernel panics with the message: (0)Kernel
> Panic - not syncing: Failed exception in interrupt

hm. Would be nice to get a serial console capture of this, if possible.

Ingo

2004-09-03 10:44:51

by Luke Yelavich

[permalink] [raw]
Subject: Re: [patch] voluntary-preempt-2.6.9-rc1-bk4-R1

On Fri, Sep 03, 2004 at 08:29:48PM EST, Ingo Molnar wrote:
> * Luke Yelavich <[email protected]> wrote:
> > I am using a D-Link KVM here between my notebook and my desktop
> > machine. It is the desktop I am currently testing these patches on,
> > and the KVM requires a double-tap of the scroll lock key to switch
> > between machines. With the latest R0 patch, something is not working
> > when I attempt to change from my desktop to my notebook. The KVM
> > usually lets out a beep when I can use the arrow keys to switch, but
> > it isn't here. Adding to that, my console locks up totally for about
> > 10 seconds, before allowing me to go on and type commands. [...]
>
> i have a KVM too to two testsystems and unfortunately i cannot reproduce
> your problems neither with KVM (key-based-)switching nor with scroll
> lock. But this KVM uses a triple-key combination to switch, not
> scroll-lock.
>
> it's a PS2 keyboard, right?

Yes, that is correct.

> If yes then does:
>
> echo 0 > /proc/irq/1/i8042/threaded
> ( maybe also: echo 0 > /proc/irq/12/i8042/threaded )
>
> fix the problem? The PS2 driver has been a bit unrobust when hardirq
> redirection is enabled.

I only had to turn off IRQ threading for 1, and all is well again.

>
> > [...] I also seem to get some debugging output or a trace of some sort
> > when rebooting, and the kernel panics with the message: (0)Kernel
> > Panic - not syncing: Failed exception in interrupt
>
> hm. Would be nice to get a serial console capture of this, if possible.

I am pretty sure I have a cable, however my desktop is the only computer in this
room that has serial ports, and I also don't know how to set up a serial
console. If you feel that this is important, I will see what I can do.
--
Luke Yelavich
http://www.audioslack.com
[email protected]

2004-09-03 11:04:13

by K.R. Foley

[permalink] [raw]
Subject: Re: [patch] voluntary-preempt-2.6.9-rc1-bk4-R0

Ingo Molnar wrote:
> i've released the -R0 patch:
>
> http://redhat.com/~mingo/voluntary-preempt/voluntary-preempt-2.6.9-rc1-bk4-R0
>
> ontop of:
>
> http://redhat.com/~mingo/voluntary-preempt/diff-bk-040828-2.6.8.1.bz2
>
> i've given up on the netdev_backlog_granularity approach, and as a
> replacement i've modified specific network drivers to return at a safe
> point if softirq preemption is requested. This gives the same end result
> but is more robust. For the time being i've fixed 8193too.c and e100.c.
> (will fix up other drivers too as latencies get reported)
>
> this should fix the crash reported by P.O. Gaillard, and it should solve
> the packet delay/loss issues reported by Mark H Johnson. I cannot see
> any problems on my rtl8193 testbox anymore.
>
> Ingo
>

Last night when rebooting on R0 for the first time, my system locked up.
It appears to have happened the first time trying to send/recv any real
data over the e100 interface when ntp was starting up. Will try to
investigate further a little later.

kr

2004-09-03 11:34:11

by Thomas Charbonnel

[permalink] [raw]
Subject: Re: [patch] voluntary-preempt-2.6.9-rc1-bk4-R1

Ingo Molnar wrote :
> Florian Schmidt reported a minor bug that prevents a successful build if
> !CONFIG_LATENCY_TRACE - i've uploaded -R1 that fixes this:
>
> http://redhat.com/~mingo/voluntary-preempt/voluntary-preempt-2.6.9-rc1-bk4-R1
>

I still get > 170 us latency from rtl8139 :
http://www.undata.org/~thomas/R1_rtl8139.trace

And again this one :
preemption latency trace v1.0.5 on 2.6.9-rc1-VP-R1
--------------------------------------------------
latency: 597 us, entries: 12 (12)
-----------------
| task: swapper/0, uid:0 nice:0 policy:0 rt_prio:0
-----------------
=> started at: smp_apic_timer_interrupt+0x32/0xd0
=> ended at: smp_apic_timer_interrupt+0x86/0xd0
=======>
00010000 0.000ms (+0.000ms): smp_apic_timer_interrupt
(apic_timer_interrupt)
00010000 0.000ms (+0.000ms): profile_tick (smp_apic_timer_interrupt)
00010000 0.000ms (+0.000ms): profile_hook (profile_tick)
00010001 0.000ms (+0.595ms): notifier_call_chain (profile_hook)
00010000 0.595ms (+0.000ms): do_nmi (mcount)
00020000 0.596ms (+0.000ms): profile_tick (nmi_watchdog_tick)
00020000 0.596ms (+0.000ms): profile_hook (profile_tick)
00020001 0.597ms (+0.000ms): notifier_call_chain (profile_hook)
00020000 0.597ms (+689953.444ms): profile_hit (nmi_watchdog_tick)
00010001 689954.042ms (+1.141ms): update_process_times (do_timer)
00000001 0.597ms (+0.000ms): sub_preempt_count
(smp_apic_timer_interrupt)
00000001 0.598ms (+0.000ms): update_max_trace (check_preempt_timing)

Thomas


2004-09-03 11:34:44

by Ingo Molnar

[permalink] [raw]
Subject: [patch] voluntary-preempt-2.6.9-rc1-bk4-R2


i've uploaded -R2:

http://redhat.com/~mingo/voluntary-preempt/voluntary-preempt-2.6.9-rc1-bk4-R2

no function changes, only fixing some rough edges: it fixes an UP
boot-time assert that occurs right after 'checking hlt'. I also fixed
PREEMPT_TIMING + !LATENCY_TRACE build bugs, and another boot bug that
occurs when PREEMPT_TIMING + !FRAME_POINTERS is used.

(the reboot assert i'm not sure about - tried to reproduce but here
reboot works fine. Will need some sort of serial log to debug this.)

Ingo

2004-09-03 11:49:11

by Ingo Molnar

[permalink] [raw]
Subject: Re: [patch] voluntary-preempt-2.6.9-rc1-bk4-R1


* Thomas Charbonnel <[email protected]> wrote:

> I still get > 170 us latency from rtl8139 :
> http://www.undata.org/~thomas/R1_rtl8139.trace

this is a single-packet latency, we wont get much lower than this with
the current techniques. Disabling ip_conntrack and tracing ought to
lower the real latency somewhat.

> And again this one :
> preemption latency trace v1.0.5 on 2.6.9-rc1-VP-R1
> --------------------------------------------------
> latency: 597 us, entries: 12 (12)
> -----------------
> | task: swapper/0, uid:0 nice:0 policy:0 rt_prio:0
> -----------------
> => started at: smp_apic_timer_interrupt+0x32/0xd0
> => ended at: smp_apic_timer_interrupt+0x86/0xd0
> =======>
> 00010000 0.000ms (+0.000ms): smp_apic_timer_interrupt (apic_timer_interrupt)
> 00010000 0.000ms (+0.000ms): profile_tick (smp_apic_timer_interrupt)
> 00010000 0.000ms (+0.000ms): profile_hook (profile_tick)
> 00010001 0.000ms (+0.595ms): notifier_call_chain (profile_hook)
> 00010000 0.595ms (+0.000ms): do_nmi (mcount)
> 00020000 0.596ms (+0.000ms): profile_tick (nmi_watchdog_tick)
> 00020000 0.596ms (+0.000ms): profile_hook (profile_tick)
> 00020001 0.597ms (+0.000ms): notifier_call_chain (profile_hook)
> 00020000 0.597ms (+689953.444ms): profile_hit (nmi_watchdog_tick)
> 00010001 689954.042ms (+1.141ms): update_process_times (do_timer)
> 00000001 0.597ms (+0.000ms): sub_preempt_count (smp_apic_timer_interrupt)
> 00000001 0.598ms (+0.000ms): update_max_trace (check_preempt_timing)

this is a pretty weird one. First it shows an apparently non-monotonic
RDTSC: the jump forward and backward in time around profile_hit. I
suspect the real RDTSC value was lower than the previous one and caused
an underflow. What is your cpu_khz in /proc/cpuinfo?

the other weird one is the +0.595 usec entry at notifier_call_chain().
That code is just a couple of instructions, so no real for any overhead
there.

could you try the attached robust-get-cycles.patch ontop of your current
tree and see whether it impacts these weirdnesses? The patch makes sure
that the cycle counter is sane: two subsequent readings of it were
monotonic and less than 1000 cycles apart.

this patch probably wont remove the +0.595 msec latency though - the
RDTSC value jumped forward there permanently. Maybe the RDTSC value is
somehow corrupted by NMIs - could you turn off the NMI watchdog to
check?

Ingo


Attachments:
(No filename) (2.34 kB)
robust-get-cycles.patch (1.56 kB)
Download all attachments

2004-09-03 12:12:23

by Thomas Charbonnel

[permalink] [raw]
Subject: Re: [patch] voluntary-preempt-2.6.9-rc1-bk4-R1

Ingo Molnar wrote :
> * Thomas Charbonnel <[email protected]> wrote:
>
> > I still get > 170 us latency from rtl8139 :
> > http://www.undata.org/~thomas/R1_rtl8139.trace
>
> this is a single-packet latency, we wont get much lower than this with
> the current techniques. Disabling ip_conntrack and tracing ought to
> lower the real latency somewhat.
>

Ok, I'll do that.

> > And again this one :
> > preemption latency trace v1.0.5 on 2.6.9-rc1-VP-R1
> > --------------------------------------------------
> > latency: 597 us, entries: 12 (12)
> > -----------------
> > | task: swapper/0, uid:0 nice:0 policy:0 rt_prio:0
> > -----------------
> > => started at: smp_apic_timer_interrupt+0x32/0xd0
> > => ended at: smp_apic_timer_interrupt+0x86/0xd0
> > =======>
> > 00010000 0.000ms (+0.000ms): smp_apic_timer_interrupt (apic_timer_interrupt)
> > 00010000 0.000ms (+0.000ms): profile_tick (smp_apic_timer_interrupt)
> > 00010000 0.000ms (+0.000ms): profile_hook (profile_tick)
> > 00010001 0.000ms (+0.595ms): notifier_call_chain (profile_hook)
> > 00010000 0.595ms (+0.000ms): do_nmi (mcount)
> > 00020000 0.596ms (+0.000ms): profile_tick (nmi_watchdog_tick)
> > 00020000 0.596ms (+0.000ms): profile_hook (profile_tick)
> > 00020001 0.597ms (+0.000ms): notifier_call_chain (profile_hook)
> > 00020000 0.597ms (+689953.444ms): profile_hit (nmi_watchdog_tick)
> > 00010001 689954.042ms (+1.141ms): update_process_times (do_timer)
> > 00000001 0.597ms (+0.000ms): sub_preempt_count (smp_apic_timer_interrupt)
> > 00000001 0.598ms (+0.000ms): update_max_trace (check_preempt_timing)
>
> this is a pretty weird one. First it shows an apparently non-monotonic
> RDTSC: the jump forward and backward in time around profile_hit. I
> suspect the real RDTSC value was lower than the previous one and caused
> an underflow. What is your cpu_khz in /proc/cpuinfo?
>
root@satellite thomas # cat /proc/cpuinfo
processor : 0
vendor_id : GenuineIntel
cpu family : 6
model : 11
model name : Intel(R) Pentium(R) III Mobile CPU 1000MHz
stepping : 1
cpu MHz : 996.879
cache size : 512 KB
fdiv_bug : no
hlt_bug : no
f00f_bug : no
coma_bug : no
fpu : yes
fpu_exception : yes
cpuid level : 2
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge
mca cmov pat pse36 mmx fxsr sse
bogomips : 1966.08

> the other weird one is the +0.595 usec entry at notifier_call_chain().
> That code is just a couple of instructions, so no real for any overhead
> there.
>
> could you try the attached robust-get-cycles.patch ontop of your current
> tree and see whether it impacts these weirdnesses? The patch makes sure
> that the cycle counter is sane: two subsequent readings of it were
> monotonic and less than 1000 cycles apart.
>
> this patch probably wont remove the +0.595 msec latency though - the
> RDTSC value jumped forward there permanently. Maybe the RDTSC value is
> somehow corrupted by NMIs - could you turn off the NMI watchdog to
> check?
>
> Ingo

I precisely enabled the NMI watchdog to track those weird latencies
down. My guess is still that when ACPI is enabled my bios does something
funky with SMM/SMI that increments the TSC. I'll try the patch and let
you know.

Thomas


2004-09-03 12:56:31

by Luke Yelavich

[permalink] [raw]
Subject: Re: [patch] voluntary-preempt-2.6.9-rc1-bk4-R0

On Fri, Sep 03, 2004 at 06:13:16PM EST, Lee Revell wrote:
> On Fri, 2004-09-03 at 04:09, Luke Yelavich wrote:
> > On Fri, Sep 03, 2004 at 05:40:34PM EST, Lee Revell wrote:
> > > On Fri, 2004-09-03 at 03:05, Ingo Molnar wrote:
> > Well with Lee's help, I think I have identified an ICE1712 sound driver issue,
> > but this is yet to be determined.
>
> Hmm, this one is still not fixed, using the latest VP patches?

With R2, it turns out that this problem still exists. If I turn off threading
for either ICE1712 soundcard when it is being used by JACK, I get xruns of
around 3 msecs.

I can't remember the info you needed. What is needed to debug this problem?
--
Luke Yelavich
http://www.audioslack.com
[email protected]

2004-09-03 13:43:35

by Mark_H_Johnson

[permalink] [raw]
Subject: Re: [patch] voluntary-preempt-2.6.9-rc1-bk4-Q9

>> this is as if the CPU executed everything in 'slow motion'. E.g. the
>> cache being disabled could be one such reason - or some severe DMA or
>> other bus traffic.
>
>another thing: could you try maxcpus=1 again and see whether 1 CPU
>produces similar 'slow motion' traces?

OK. Second try to send this message. I've had two complete system lockups
now with -Q9. The first doing a kernel build, the second while trying
to send a message explaining that plus the results I saw when trying
to test with maxcpus=1.

Booted w/ maxcpus=1 nmi_watchdog=1. Saw information messages confirming
both were set and acted on.

First major problem was the audio came out "too fast". The test repeats
a pattern and you could hear it repeating faster than it should have.
Needless to say - this meant that there was no time available to run
any non real time tasks so the application timing results were worthless
and the latency traces were separated by several minutes in some cases.
[before I could force a SIGINT to the real time task]

I don't know if the traces I did get are helpful or not, but here is a
summary.... The numbers after each title refer to the trace file number
so if you count them, you get an idea how frequent each symptom was.
I will send the full traces only if you ask for them.

Slow getting to "switch_tasks" (00, 01, 10, 12, 13, 16)

00000002 0.002ms (+0.000ms): dequeue_task (deactivate_task)
00000002 0.002ms (+0.000ms): schedule (io_schedule)
00000002 0.002ms (+0.000ms): schedule (io_schedule)
00000002 0.002ms (+0.000ms): schedule (io_schedule)
00000002 0.003ms (+0.000ms): schedule (io_schedule)
00000002 0.003ms (+0.000ms): schedule (io_schedule)
00000002 0.003ms (+0.303ms): schedule (io_schedule)
00010002 0.306ms (+0.016ms): do_nmi (schedule)
00010002 0.323ms (+0.000ms): profile_tick (nmi_watchdog_tick)
00010002 0.323ms (+0.000ms): profile_hook (profile_tick)
00010002 0.323ms (+0.100ms): read_lock (profile_hook)
00010003 0.424ms (+0.000ms): notifier_call_chain (profile_hook)
00010002 0.424ms (+0.070ms): profile_hit (nmi_watchdog_tick)
00000002 0.494ms (+0.000ms): dummy_switch_tasks (schedule)

Slow getting to "__switch_to" (02, 03, 05, 09)

00000002 0.003ms (+0.000ms): dummy_switch_tasks (schedule)
00000002 0.003ms (+0.000ms): schedule (io_schedule)
00000002 0.003ms (+0.327ms): schedule (io_schedule)
00000002 0.331ms (+0.000ms): schedule (io_schedule)
00000002 0.331ms (+0.000ms): schedule (io_schedule)
00000002 0.331ms (+0.000ms): schedule (io_schedule)
00000002 0.332ms (+0.236ms): schedule (io_schedule)
00010002 0.568ms (+0.001ms): do_nmi (mcount)
00010002 0.570ms (+0.000ms): profile_tick (nmi_watchdog_tick)
00010002 0.570ms (+0.000ms): profile_hook (profile_tick)
00010002 0.570ms (+0.000ms): read_lock (profile_hook)
00010003 0.571ms (+0.000ms): notifier_call_chain (profile_hook)
00010002 0.571ms (+0.000ms): profile_hit (nmi_watchdog_tick)
00000002 0.572ms (+0.000ms): __switch_to (schedule)

Slow to both switch_tasks and __switch_to (04, 06)

00000002 0.002ms (+0.000ms): dequeue_task (deactivate_task)
00000002 0.002ms (+0.000ms): schedule (io_schedule)
00000002 0.002ms (+0.000ms): schedule (io_schedule)
00000002 0.002ms (+0.000ms): schedule (io_schedule)
00000002 0.003ms (+0.000ms): schedule (io_schedule)
00000002 0.003ms (+0.000ms): schedule (io_schedule)
00000002 0.003ms (+0.131ms): schedule (io_schedule)
00000002 0.134ms (+0.000ms): dummy_switch_tasks (schedule)
00000002 0.134ms (+0.000ms): schedule (io_schedule)
00000002 0.134ms (+0.000ms): schedule (io_schedule)
00000002 0.134ms (+0.000ms): schedule (io_schedule)
00000002 0.135ms (+0.000ms): schedule (io_schedule)
00000002 0.135ms (+0.000ms): schedule (io_schedule)
00000002 0.135ms (+0.139ms): schedule (io_schedule)
00010002 0.275ms (+0.279ms): do_nmi (schedule)
00010002 0.554ms (+0.000ms): profile_tick (nmi_watchdog_tick)
00010002 0.554ms (+0.000ms): profile_hook (profile_tick)
00010002 0.554ms (+0.000ms): read_lock (profile_hook)
00010003 0.555ms (+0.000ms): notifier_call_chain (profile_hook)
00010002 0.555ms (+0.001ms): profile_hit (nmi_watchdog_tick)
00000002 0.557ms (+0.000ms): __switch_to (schedule)

sched_clock (08, 11, 14)

00000001 0.000ms (+0.000ms): schedule (worker_thread)
00000001 0.000ms (+0.000ms): schedule (worker_thread)
00000001 0.000ms (+0.000ms): schedule (worker_thread)
00000001 0.000ms (+0.000ms): schedule (worker_thread)
00000001 0.000ms (+0.000ms): schedule (worker_thread)
00000001 0.000ms (+0.000ms): schedule (worker_thread)
00000001 0.000ms (+0.553ms): sched_clock (schedule)
00010001 0.554ms (+0.001ms): do_nmi (sched_clock)
00010001 0.556ms (+0.000ms): profile_tick (nmi_watchdog_tick)
00010001 0.556ms (+0.000ms): profile_hook (profile_tick)
00010001 0.556ms (+0.000ms): read_lock (profile_hook)
00010002 0.556ms (+0.000ms): notifier_call_chain (profile_hook)
00010001 0.557ms (+0.000ms): profile_hit (nmi_watchdog_tick)
00010001 0.557ms (+0.000ms): do_IRQ (sched_clock)
00010001 0.558ms (+0.000ms): do_IRQ (<00000000>)
00010001 0.558ms (+0.000ms): spin_lock (do_IRQ)

__modify_IO_APIC_irq (15)

00000001 0.000ms (+0.000ms): __spin_lock_irqsave (spin_lock_irq)
00000001 0.000ms (+0.000ms): generic_enable_irq (ide_do_request)
00000001 0.000ms (+0.000ms): __spin_lock_irqsave (generic_enable_irq)
00000002 0.001ms (+0.000ms): unmask_IO_APIC_irq (generic_enable_irq)
00000002 0.001ms (+0.000ms): __spin_lock_irqsave (unmask_IO_APIC_irq)
00000003 0.002ms (+0.000ms): __unmask_IO_APIC_irq (unmask_IO_APIC_irq)
00000003 0.002ms (+0.565ms): __modify_IO_APIC_irq (__unmask_IO_APIC_irq)
00010001 0.567ms (+0.001ms): do_nmi (apic_timer_interrupt)
00010001 0.569ms (+0.000ms): profile_tick (nmi_watchdog_tick)
00010001 0.569ms (+0.000ms): profile_hook (profile_tick)
00010001 0.569ms (+0.000ms): read_lock (profile_hook)
00010002 0.570ms (+0.000ms): notifier_call_chain (profile_hook)
00010001 0.570ms (+0.000ms): profile_hit (nmi_watchdog_tick)
00000001 0.571ms (+0.000ms): smp_apic_timer_interrupt (as_work_handler)

In the meantime, I will build a kernel with -R1 and leave the latency trace
on (now running a -Q7 kernel with both CPUs active) to see if it shows
anything interesting before attempting any more tests.

--Mark

2004-09-03 15:37:58

by Mark_H_Johnson

[permalink] [raw]
Subject: Re: [patch] voluntary-preempt-2.6.9-rc1-bk4-Q9

>In the meantime, I will build a kernel with -R1 and leave the latency
trace
>on (now running a -Q7 kernel with both CPUs active) to see if it shows
>anything interesting before attempting any more tests.

Well - THAT was special. Another crash but I may have a clue on both that
and the "general slow down" in sched.c.

The crash is likely due to a problem with X. I caused it this time when
I was trying to open / hide Mozilla windows (again preparing to send email
with a web based client). The last operation that worked was the window
hide. The one that locked everything up was a click to restore a Mozilla
window onto the screen. I don't know if this is relevant, but the last
trace in /var/log/messages was a latency trace caused by X.

Sep 3 09:57:11 dws77 kernel: (X/2382): new 329 us maximum-latency critical
section.
Sep 3 09:57:11 dws77 kernel: => started at: <kmap_atomic+0x23/0xe0>
Sep 3 09:57:11 dws77 kernel: => ended at: <kunmap_atomic+0x7b/0xa0>

I am not sure this is relevant since all the data for it was written
to disk (my script picked up the latency trace as well). Let me know
if you want the trace data.

The slow down in sched.c may be due to disk DMA activities.
When I started the kernel build, I forgot to run my script that sets
a number of /proc values. In specific, the values were:
cat /sys/block/hda/queue/max_sectors_kb
128
cat /sys/block/hda/queue/read_ahead_kb
128
cat /proc/sys/net/core/netdev_max_backlog
300
cat '/proc/irq/10/Ensoniq AudioPCI/threaded'
1

and in my other tests, they are:
echo 32 > /sys/block/hda/queue/max_sectors_kb
echo 32 > /sys/block/hda/queue/read_ahead_kb
echo 8 > /proc/sys/net/core/netdev_max_backlog
echo 0 > '/proc/irq/10/Ensoniq AudioPCI/threaded'

Note - no audio and only light network activity during the kernel
build.

With the default settings, I had latencies in sched.c over 2 msec
(longest was 2305 us). For example:

preemption latency trace v1.0.3
-------------------------------
latency: 2305 us, entries: 137 (137)
-----------------
| task: ksoftirqd/0/3, uid:0 nice:-10 policy:0 rt_prio:0
-----------------
=> started at: schedule+0x51/0x740
=> ended at: schedule+0x337/0x740
=======>
00000001 0.000ms (+0.000ms): schedule (worker_thread)
00000001 0.000ms (+0.000ms): sched_clock (schedule)
00000001 0.000ms (+0.000ms): spin_lock_irq (schedule)
00000001 0.000ms (+0.000ms): __spin_lock_irqsave (spin_lock_irq)
00000002 0.001ms (+0.000ms): deactivate_task (schedule)
00000002 0.001ms (+0.000ms): dequeue_task (deactivate_task)
04000002 0.765ms (+0.764ms): __switch_to (schedule)
04000002 0.800ms (+0.034ms): finish_task_switch (schedule)
04010002 1.116ms (+0.316ms): do_IRQ (finish_task_switch)
04010002 1.116ms (+0.000ms): spin_lock (do_IRQ)
04010003 1.117ms (+0.000ms): ack_edge_ioapic_irq (do_IRQ)
04010003 1.118ms (+0.000ms): generic_redirect_hardirq (do_IRQ)
04010002 1.328ms (+0.210ms): generic_handle_IRQ_event (do_IRQ)
04010002 1.797ms (+0.468ms): timer_interrupt (generic_handle_IRQ_event)
04010002 1.840ms (+0.043ms): spin_lock (timer_interrupt)
04010003 1.842ms (+0.001ms): mark_offset_tsc (timer_interrupt)
04010003 1.842ms (+0.000ms): mark_offset_tsc (timer_interrupt)
04010003 1.842ms (+0.000ms): spin_lock (mark_offset_tsc)
04010004 1.890ms (+0.048ms): mark_offset_tsc (timer_interrupt)
04010004 2.236ms (+0.345ms): mark_offset_tsc (timer_interrupt)
04010004 2.236ms (+0.000ms): spin_lock (mark_offset_tsc)
04010005 2.236ms (+0.000ms): mark_offset_tsc (timer_interrupt)
04010005 2.240ms (+0.003ms): mark_offset_tsc (timer_interrupt)
04010005 2.244ms (+0.004ms): mark_offset_tsc (timer_interrupt)
04010005 2.246ms (+0.002ms): mark_offset_tsc (timer_interrupt)
04010005 2.247ms (+0.000ms): mark_offset_tsc (timer_interrupt)
04010004 2.247ms (+0.000ms): mark_offset_tsc (timer_interrupt)
04010004 2.247ms (+0.000ms): mark_offset_tsc (timer_interrupt)
04010004 2.248ms (+0.000ms): mark_offset_tsc (timer_interrupt)
04010003 2.249ms (+0.001ms): timer_interrupt (generic_handle_IRQ_event)
...

So this trace shows a number of the symptoms I have previously
reported.

When I used the lower disk settings, the latencies went way
down [at least up to the crash...] with a maximum of 618 us.
I looked through several traces and could not find a similar
sequence for comparison.

So we may be both right.
- there is hardware overhead that needs to be accounted for; this
pair of runs seems to point to disk I/O which we previously checked
is DMA to/from an IDE disk.
- there may be steps in the code that cause longer latency and
we want to avoid if possible.

It may be a combination of effects. A question for others doing
testing (like Lee) - have you been doing any other activity in
the background when doing your tests? For example, I have found
that something as simple as
head -c $1 /dev/zero >tmpfile [$1 is a > physical memory size]
or
cat tmpfile > /dev/null
can cause significantly increased latencies in the 2.6 kernels.

--Mark

2004-09-03 16:14:50

by Thomas Charbonnel

[permalink] [raw]
Subject: Re: [patch] voluntary-preempt-2.6.9-rc1-bk4-R1

Ingo Molnar wrote :
> this is a pretty weird one. First it shows an apparently non-monotonic
> RDTSC: the jump forward and backward in time around profile_hit. I
> suspect the real RDTSC value was lower than the previous one and caused
> an underflow. What is your cpu_khz in /proc/cpuinfo?
>
> the other weird one is the +0.595 usec entry at notifier_call_chain().
> That code is just a couple of instructions, so no real for any overhead
> there.
>
> could you try the attached robust-get-cycles.patch ontop of your current
> tree and see whether it impacts these weirdnesses? The patch makes sure
> that the cycle counter is sane: two subsequent readings of it were
> monotonic and less than 1000 cycles apart.
>
> this patch probably wont remove the +0.595 msec latency though - the
> RDTSC value jumped forward there permanently. Maybe the RDTSC value is
> somehow corrupted by NMIs - could you turn off the NMI watchdog to
> check?

Here are more traces with robust-get-cycles applied. So far no
non-monotonic issue.

preemption latency trace v1.0.5 on 2.6.9-rc1-VP-R1
--------------------------------------------------
latency: 626 us, entries: 25 (25)
-----------------
| task: ksoftirqd/0/2, uid:0 nice:-10 policy:0 rt_prio:0
-----------------
=> started at: preempt_schedule+0x3b/0x60
=> ended at: schedule+0x2e4/0x570
=======>
04000000 0.000ms (+0.000ms): preempt_schedule (need_resched)
04000000 0.000ms (+0.000ms): schedule (preempt_schedule)
04000001 0.000ms (+0.620ms): sched_clock (schedule)
04010001 0.620ms (+0.000ms): do_nmi (sched_clock)
04010001 0.621ms (+0.000ms): profile_tick (nmi_watchdog_tick)
04010001 0.622ms (+0.000ms): profile_hook (profile_tick)
04010002 0.622ms (+0.000ms): notifier_call_chain (profile_hook)
04010001 0.622ms (+0.000ms): preempt_schedule (profile_tick)
04010001 0.622ms (+0.000ms): profile_hit (nmi_watchdog_tick)
04000001 0.623ms (+0.000ms): smp_apic_timer_interrupt (sched_clock)
04010001 0.623ms (+0.000ms): profile_tick (smp_apic_timer_interrupt)
04010001 0.623ms (+0.000ms): profile_hook (profile_tick)
04010002 0.623ms (+0.000ms): notifier_call_chain (profile_hook)
04010001 0.624ms (+0.000ms): preempt_schedule (profile_tick)
04010001 0.624ms (+0.000ms): profile_hit (smp_apic_timer_interrupt)
04000002 0.624ms (+0.000ms): do_softirq (smp_apic_timer_interrupt)
04000002 0.624ms (+0.000ms): __do_softirq (do_softirq)
04000002 0.625ms (+0.000ms): dequeue_task (schedule)
04000002 0.625ms (+0.000ms): recalc_task_prio (schedule)
04000002 0.625ms (+0.000ms): effective_prio (recalc_task_prio)
04000002 0.625ms (+0.000ms): enqueue_task (schedule)
00000002 0.626ms (+0.000ms): __switch_to (schedule)
00000002 0.626ms (+0.000ms): finish_task_switch (schedule)
00000001 0.626ms (+0.000ms): sub_preempt_count (schedule)
00000001 0.627ms (+0.000ms): update_max_trace (check_preempt_timing)

preemption latency trace v1.0.5 on 2.6.9-rc1-VP-R1
--------------------------------------------------
latency: 460 us, entries: 38 (38)
-----------------
| task: swapper/0, uid:0 nice:0 policy:0 rt_prio:0
-----------------
=> started at: do_IRQ+0x19/0x190
=> ended at: do_IRQ+0x14a/0x190
=======>
00010000 0.000ms (+0.000ms): do_IRQ (common_interrupt)
00010000 0.000ms (+0.000ms): do_IRQ (default_idle)
00010000 0.000ms (+0.000ms): do_IRQ (<00000000>)
00010001 0.000ms (+0.444ms): mask_and_ack_8259A (do_IRQ)
00020002 0.444ms (+0.000ms): do_nmi (mask_and_ack_8259A)
00020002 0.445ms (+0.000ms): profile_tick (nmi_watchdog_tick)
00020002 0.446ms (+0.000ms): profile_hook (profile_tick)
00020003 0.446ms (+0.000ms): notifier_call_chain (profile_hook)
00020002 0.446ms (+0.001ms): profile_hit (nmi_watchdog_tick)
00010001 0.448ms (+0.000ms): generic_redirect_hardirq (do_IRQ)
00010000 0.448ms (+0.000ms): generic_handle_IRQ_event (do_IRQ)
00010000 0.448ms (+0.000ms): timer_interrupt (generic_handle_IRQ_event)
00010001 0.448ms (+0.005ms): mark_offset_tsc (timer_interrupt)
00010001 0.454ms (+0.000ms): do_timer (timer_interrupt)
00010001 0.454ms (+0.000ms): update_process_times (do_timer)
00010001 0.454ms (+0.000ms): update_one_process (update_process_times)
00010001 0.455ms (+0.000ms): run_local_timers (update_process_times)
00010001 0.455ms (+0.000ms): raise_softirq (update_process_times)
00010001 0.455ms (+0.000ms): scheduler_tick (update_process_times)
00010001 0.455ms (+0.000ms): sched_clock (scheduler_tick)
00010001 0.456ms (+0.000ms): update_wall_time (do_timer)
00010001 0.456ms (+0.000ms): update_wall_time_one_tick
(update_wall_time)
00010001 0.456ms (+0.000ms): generic_note_interrupt (do_IRQ)
00010001 0.456ms (+0.000ms): end_8259A_irq (do_IRQ)
00010001 0.457ms (+0.001ms): enable_8259A_irq (do_IRQ)
00000001 0.458ms (+0.000ms): do_softirq (do_IRQ)
00000001 0.458ms (+0.000ms): __do_softirq (do_softirq)
00000001 0.458ms (+0.000ms): wake_up_process (do_softirq)
00000001 0.458ms (+0.000ms): try_to_wake_up (wake_up_process)
00000001 0.458ms (+0.000ms): task_rq_lock (try_to_wake_up)
00000002 0.459ms (+0.000ms): activate_task (try_to_wake_up)
00000002 0.459ms (+0.000ms): sched_clock (activate_task)
00000002 0.459ms (+0.000ms): recalc_task_prio (activate_task)
00000002 0.459ms (+0.000ms): effective_prio (recalc_task_prio)
00000002 0.459ms (+0.000ms): enqueue_task (activate_task)
00000001 0.460ms (+0.000ms): preempt_schedule (try_to_wake_up)
00000001 0.460ms (+0.000ms): sub_preempt_count (do_IRQ)
00000001 0.460ms (+0.000ms): update_max_trace (check_preempt_timing)

preemption latency trace v1.0.5 on 2.6.9-rc1-VP-R1
--------------------------------------------------
latency: 557 us, entries: 39 (39)
-----------------
| task: evolution-1.4/5492, uid:1000 nice:0 policy:0 rt_prio:0
-----------------
=> started at: fget+0x32/0x80
=> ended at: fget+0x58/0x80
=======>
00000001 0.000ms (+0.537ms): fget (do_pollfd)
00010001 0.537ms (+0.000ms): do_nmi (sub_preempt_count)
00010001 0.538ms (+0.000ms): profile_tick (nmi_watchdog_tick)
00010001 0.538ms (+0.000ms): profile_hook (profile_tick)
00010002 0.539ms (+0.000ms): notifier_call_chain (profile_hook)
00010001 0.539ms (+0.001ms): profile_hit (nmi_watchdog_tick)
00010001 0.540ms (+0.000ms): do_IRQ (sub_preempt_count)
00010001 0.540ms (+0.000ms): do_IRQ (<00000000>)
00010002 0.541ms (+0.002ms): mask_and_ack_8259A (do_IRQ)
00010002 0.543ms (+0.000ms): generic_redirect_hardirq (do_IRQ)
00010001 0.543ms (+0.000ms): generic_handle_IRQ_event (do_IRQ)
00010001 0.543ms (+0.000ms): timer_interrupt (generic_handle_IRQ_event)
00010002 0.544ms (+0.005ms): mark_offset_tsc (timer_interrupt)
00010002 0.549ms (+0.000ms): do_timer (timer_interrupt)
00010002 0.550ms (+0.000ms): update_process_times (do_timer)
00010002 0.550ms (+0.000ms): update_one_process (update_process_times)
00010002 0.550ms (+0.000ms): run_local_timers (update_process_times)
00010002 0.550ms (+0.000ms): raise_softirq (update_process_times)
00010002 0.551ms (+0.000ms): scheduler_tick (update_process_times)
00010002 0.551ms (+0.000ms): sched_clock (scheduler_tick)
00010003 0.551ms (+0.000ms): task_timeslice (scheduler_tick)
00010002 0.552ms (+0.000ms): update_wall_time (do_timer)
00010002 0.552ms (+0.000ms): update_wall_time_one_tick
(update_wall_time)
00010002 0.552ms (+0.000ms): generic_note_interrupt (do_IRQ)
00010002 0.553ms (+0.000ms): end_8259A_irq (do_IRQ)
00010002 0.553ms (+0.001ms): enable_8259A_irq (do_IRQ)
00000002 0.554ms (+0.000ms): do_softirq (do_IRQ)
00000002 0.554ms (+0.000ms): __do_softirq (do_softirq)
00000002 0.554ms (+0.000ms): wake_up_process (do_softirq)
00000002 0.555ms (+0.000ms): try_to_wake_up (wake_up_process)
00000002 0.555ms (+0.000ms): task_rq_lock (try_to_wake_up)
00000003 0.555ms (+0.000ms): activate_task (try_to_wake_up)
00000003 0.555ms (+0.000ms): sched_clock (activate_task)
00000003 0.555ms (+0.000ms): recalc_task_prio (activate_task)
00000003 0.556ms (+0.000ms): effective_prio (recalc_task_prio)
00000003 0.556ms (+0.000ms): enqueue_task (activate_task)
00000002 0.556ms (+0.000ms): preempt_schedule (try_to_wake_up)
00000001 0.557ms (+0.000ms): sub_preempt_count (fget)
00000001 0.557ms (+0.000ms): update_max_trace (check_preempt_timing)

I had similar problems affecting apparently random code paths before
enabling nmi_watchdog, but if you feel it can be useful, I can do some
more tests with robust-get-cycles and without nmi.

Thomas


2004-09-03 17:03:28

by K.R. Foley

[permalink] [raw]
Subject: Re: [patch] voluntary-preempt-2.6.9-rc1-bk4-R0

Ingo Molnar wrote:
> i've released the -R0 patch:
>
> http://redhat.com/~mingo/voluntary-preempt/voluntary-preempt-2.6.9-rc1-bk4-R0
>
> ontop of:
>
> http://redhat.com/~mingo/voluntary-preempt/diff-bk-040828-2.6.8.1.bz2
>
> i've given up on the netdev_backlog_granularity approach, and as a
> replacement i've modified specific network drivers to return at a safe
> point if softirq preemption is requested. This gives the same end result
> but is more robust. For the time being i've fixed 8193too.c and e100.c.
> (will fix up other drivers too as latencies get reported)
>
> this should fix the crash reported by P.O. Gaillard, and it should solve
> the packet delay/loss issues reported by Mark H Johnson. I cannot see
> any problems on my rtl8193 testbox anymore.
>
> Ingo
>

OK. I previously reported about the system hanging trying to boot this
patch. It may have been a fluke. In any event, it doesn't hang
consistently. I have rebooted several times now without problems. I do
however still get some of these:

(ksoftirqd/0/2): new 131 us maximum-latency critical section.
=> started at: <netif_receive_skb+0x82/0x280>
=> ended at: <netif_receive_skb+0x1d7/0x280>
[<c0136789>] check_preempt_timing+0x119/0x1f0
[<c0255017>] netif_receive_skb+0x1d7/0x280
[<c0255017>] netif_receive_skb+0x1d7/0x280
[<c013699d>] sub_preempt_count+0x4d/0x70
[<c013699d>] sub_preempt_count+0x4d/0x70
[<c0255017>] netif_receive_skb+0x1d7/0x280
[<d08d40b7>] e100_poll+0x5b7/0x630 [e100]
[<c013699d>] sub_preempt_count+0x4d/0x70
[<c025527f>] net_rx_action+0x7f/0x110
[<c011eb55>] ___do_softirq+0x75/0x90
[<c011ec1e>] _do_softirq+0xe/0x20
[<c011f034>] ksoftirqd+0x94/0xe0
[<c012f8aa>] kthread+0xaa/0xb0
[<c011efa0>] ksoftirqd+0x0/0xe0
[<c012f800>] kthread+0x0/0xb0
[<c0104575>] kernel_thread_helper+0x5/0x10


2004-09-03 17:39:38

by Thomas Charbonnel

[permalink] [raw]
Subject: Re: [patch] voluntary-preempt-2.6.9-rc1-bk4-R1

I wrote :
> Ingo Molnar wrote :
> > this is a pretty weird one. First it shows an apparently non-monotonic
> > RDTSC: the jump forward and backward in time around profile_hit. I
> > suspect the real RDTSC value was lower than the previous one and caused
> > an underflow. What is your cpu_khz in /proc/cpuinfo?
> >
> > the other weird one is the +0.595 usec entry at notifier_call_chain().
> > That code is just a couple of instructions, so no real for any overhead
> > there.
> >
> > could you try the attached robust-get-cycles.patch ontop of your current
> > tree and see whether it impacts these weirdnesses? The patch makes sure
> > that the cycle counter is sane: two subsequent readings of it were
> > monotonic and less than 1000 cycles apart.
> >
> > this patch probably wont remove the +0.595 msec latency though - the
> > RDTSC value jumped forward there permanently. Maybe the RDTSC value is
> > somehow corrupted by NMIs - could you turn off the NMI watchdog to
> > check?
>
> Here are more traces with robust-get-cycles applied. So far no
> non-monotonic issue.

In the end here's one :

preemption latency trace v1.0.5 on 2.6.9-rc1-VP-R1
--------------------------------------------------
latency: 891 us, entries: 38 (38)
-----------------
| task: swapper/0, uid:0 nice:0 policy:0 rt_prio:0
-----------------
=> started at: do_IRQ+0x19/0x190
=> ended at: do_IRQ+0x14a/0x190
=======>
00010000 0.000ms (+0.876ms): do_nmi (robust_get_cycles)
00020000 0.876ms (+0.000ms): profile_tick (nmi_watchdog_tick)
00020000 0.876ms (+0.000ms): profile_hook (profile_tick)
00020001 0.876ms (+0.000ms): notifier_call_chain (profile_hook)
00020000 0.876ms (+689952.730ms): profile_hit (nmi_watchdog_tick)
04000002 689953.607ms (+1.855ms): finish_task_switch (schedule)
00010000 0.877ms (+0.000ms): do_IRQ (default_idle)
00010000 0.877ms (+0.000ms): do_IRQ (<00000000>)
00010001 0.877ms (+0.002ms): mask_and_ack_8259A (do_IRQ)
00010001 0.879ms (+0.000ms): generic_redirect_hardirq (do_IRQ)
00010000 0.880ms (+0.000ms): generic_handle_IRQ_event (do_IRQ)
00010000 0.880ms (+0.000ms): timer_interrupt (generic_handle_IRQ_event)
00010001 0.880ms (+0.005ms): mark_offset_tsc (timer_interrupt)
00010001 0.886ms (+0.000ms): do_timer (timer_interrupt)
00010001 0.886ms (+0.000ms): update_process_times (do_timer)
00010001 0.886ms (+0.000ms): update_one_process (update_process_times)
00010001 0.886ms (+0.000ms): run_local_timers (update_process_times)
00010001 0.886ms (+0.000ms): raise_softirq (update_process_times)
00010001 0.887ms (+0.000ms): scheduler_tick (update_process_times)
00010001 0.887ms (+0.000ms): sched_clock (scheduler_tick)
00010001 0.887ms (+0.000ms): update_wall_time (do_timer)
00010001 0.887ms (+0.000ms): update_wall_time_one_tick
(update_wall_time)
00010001 0.888ms (+0.000ms): generic_note_interrupt (do_IRQ)
00010001 0.888ms (+0.000ms): end_8259A_irq (do_IRQ)
00010001 0.888ms (+0.001ms): enable_8259A_irq (do_IRQ)
00000001 0.889ms (+0.000ms): do_softirq (do_IRQ)
00000001 0.889ms (+0.000ms): __do_softirq (do_softirq)
00000001 0.889ms (+0.000ms): wake_up_process (do_softirq)
00000001 0.890ms (+0.000ms): try_to_wake_up (wake_up_process)
00000001 0.890ms (+0.000ms): task_rq_lock (try_to_wake_up)
00000002 0.890ms (+0.000ms): activate_task (try_to_wake_up)
00000002 0.890ms (+0.000ms): sched_clock (activate_task)
00000002 0.890ms (+0.000ms): recalc_task_prio (activate_task)
00000002 0.891ms (+0.000ms): effective_prio (recalc_task_prio)
00000002 0.891ms (+0.000ms): enqueue_task (activate_task)
00000001 0.891ms (+0.000ms): preempt_schedule (try_to_wake_up)
00000001 0.892ms (+0.000ms): sub_preempt_count (do_IRQ)
00000001 0.892ms (+0.000ms): update_max_trace (check_preempt_timing)


2004-09-03 18:17:27

by K.R. Foley

[permalink] [raw]
Subject: Re: [patch] voluntary-preempt-2.6.9-rc1-bk4-R0

Ingo Molnar wrote:
> another question - any objections against me rebasing the VP patch to
> the current -mm tree and keeping it there exclusively until all possible
> merges are done? It would probably quite some work to create a complete
> patch for the upstream or BK tree during that process, as small patches
> start to flow in the VP => -mm => BK direction. Would any of the regular
> VP users/testers be wary to use the -mm tree?
>
> Ingo
>

I don't have any objections. Should I be wary of running the -mm tree?

kr

2004-09-03 18:27:05

by Ingo Molnar

[permalink] [raw]
Subject: Re: [patch] voluntary-preempt-2.6.9-rc1-bk4-R0


* K.R. Foley <[email protected]> wrote:

> Managed to hang the system again under heavy load. This time with the
> above patch:
>
> http://www.cybsft.com/testresults/crashes/2.6.9-rc1-bk4-R0.txt
>
> Last time was with Q7:
>
> http://www.cybsft.com/testresults/crashes/2.6.9-rc1-bk4-Q7.txt

seems to be the same thing - an unbalanced preemption count, possibly
due to some locking error. Unfortunately the assert catches the
imbalance only at exit time. (it's unlikely that the do_exit() code is
buggy.)

i'll add a new feature to debug this: when crashing on an assert and
tracing is enabled the trace leading up to the crash will be printed to
the console. How did you capture the crash - was it in the syslog or do
you have serial logging? Maybe this is not the real crash but only a
followup crash?

Ingo

2004-09-03 18:30:00

by Mark_H_Johnson

[permalink] [raw]
Subject: Re: [patch] voluntary-preempt-2.6.9-rc1-bk4-Q9

>a quick suggestion: could you add this near the top of sched.c (below
>the #include lines):
>
> #define static
>
>this will turn off all inlining and makes the scheduler-internal
>functions visible. ....

I did this and built a broken kernel. I panics very early in start up.
I also got warnings like
kernel/sched.c: In function __might_sleep
kernel/sched.c:4974: warning: prev_jiffy might be used uninitialized in
this function
which if I read the code right should be a local static variable.

I suppose you meant
#define inline
instead (which throws a warning about a duplicate definition; can I
ignore it?)

--Mark H Johnson
<mailto:[email protected]>

2004-09-03 18:34:34

by Ingo Molnar

[permalink] [raw]
Subject: Re: [patch] voluntary-preempt-2.6.9-rc1-bk4-Q9


* [email protected] <[email protected]> wrote:

> I suppose you meant
> #define inline

yeah ...

> instead (which throws a warning about a duplicate definition; can I
> ignore it?)

yeah, ought to be fine. (you can do an #undef inline for a cleaner
compile).

Ingo

2004-09-03 18:38:44

by K.R. Foley

[permalink] [raw]
Subject: Re: [patch] voluntary-preempt-2.6.9-rc1-bk4-R0

Ingo Molnar wrote:
> * K.R. Foley <[email protected]> wrote:
>
>
>>Managed to hang the system again under heavy load. This time with the
>>above patch:
>>
>>http://www.cybsft.com/testresults/crashes/2.6.9-rc1-bk4-R0.txt
>>
>>Last time was with Q7:
>>
>>http://www.cybsft.com/testresults/crashes/2.6.9-rc1-bk4-Q7.txt
>
>
> seems to be the same thing - an unbalanced preemption count, possibly
> due to some locking error. Unfortunately the assert catches the
> imbalance only at exit time. (it's unlikely that the do_exit() code is
> buggy.)
>
> i'll add a new feature to debug this: when crashing on an assert and
> tracing is enabled the trace leading up to the crash will be printed to
> the console. How did you capture the crash - was it in the syslog or do
> you have serial logging? Maybe this is not the real crash but only a
> followup crash?
>
> Ingo
>

This actually came from syslog. When it happened it was completely
locked, so I have no idea if there was more that didn't make it to disk.
Of course the monitor was blanked because most of the access is via the
network. I could probably enable serial logging if necessary because
it's sitting next to a couple of other machines.

kr

2004-09-03 18:42:15

by Ingo Molnar

[permalink] [raw]
Subject: Re: [patch] voluntary-preempt-2.6.9-rc1-bk4-R0


* K.R. Foley <[email protected]> wrote:

> Managed to hang the system again under heavy load. This time with the
> above patch:
>
> http://www.cybsft.com/testresults/crashes/2.6.9-rc1-bk4-R0.txt

do you have CONFIG_DEBUG_SPINLOCK and CONFIG_DEBUG_SPINLOCK_SLEEP
enabled? These are pretty good at e.g. catching illegal scheduling in
critical sections much earlier. You can enable them on !SMP kernels as
well.

Ingo

2004-09-03 18:50:03

by K.R. Foley

[permalink] [raw]
Subject: Re: [patch] voluntary-preempt-2.6.9-rc1-bk4-R0

Ingo Molnar wrote:
> * K.R. Foley <[email protected]> wrote:
>
>
>>Managed to hang the system again under heavy load. This time with the
>>above patch:
>>
>>http://www.cybsft.com/testresults/crashes/2.6.9-rc1-bk4-R0.txt
>
>
> do you have CONFIG_DEBUG_SPINLOCK and CONFIG_DEBUG_SPINLOCK_SLEEP
> enabled? These are pretty good at e.g. catching illegal scheduling in
> critical sections much earlier. You can enable them on !SMP kernels as
> well.
>
> Ingo

Not currently but I am enabling now.

kr

2004-09-03 19:33:55

by Ingo Molnar

[permalink] [raw]
Subject: [patch] voluntary-preempt-2.6.9-rc1-bk4-R3


* Ingo Molnar <[email protected]> wrote:

> i'll add a new feature to debug this: when crashing on an assert and
> tracing is enabled the trace leading up to the crash will be printed
> to the console. [...]

the -R3 patch has this feature:

http://redhat.com/~mingo/voluntary-preempt/voluntary-preempt-2.6.9-rc1-bk4-R3

you can enable it via enabling CONFIG_LATENCY_TRACE and doing:

echo 3 > /proc/sys/kernel/trace_enabled

it's all automatic from this point on: tracing will happen nonstop and
any assert or crash that prints the process stack will also print the
last 100 trace entries. Sample output:

Call Trace:
[<c0160401>] sys_munmap+0x61/0x80
[<c010520d>] sysenter_past_esp+0x52/0x71
Last 100 trace entries:
00000001: zap_pmd_range+0xe/0x90 <= (unmap_page_range+0x55/0x80)
00000001: stop_trace+0x8/0x20 <= (bust_spinlocks+0x20/0x60)
00000001: bust_spinlocks+0xe/0x60 <= (die+0xbc/0x2a0)
[... 97 more trace entries ...]

Please capture and mail the first 'extended' oops that triggers
(secondary followup traces are probably just side-effects of the crash),
it could give us clues about where the bug is.

Ingo

2004-09-03 19:54:20

by K.R. Foley

[permalink] [raw]
Subject: Re: [patch] voluntary-preempt-2.6.9-rc1-bk4-R3

Ingo Molnar wrote:
> * Ingo Molnar <[email protected]> wrote:
>
>
>>i'll add a new feature to debug this: when crashing on an assert and
>>tracing is enabled the trace leading up to the crash will be printed
>>to the console. [...]
>
>
> the -R3 patch has this feature:
>
> http://redhat.com/~mingo/voluntary-preempt/voluntary-preempt-2.6.9-rc1-bk4-R3
>
> you can enable it via enabling CONFIG_LATENCY_TRACE and doing:
>
> echo 3 > /proc/sys/kernel/trace_enabled
>
> it's all automatic from this point on: tracing will happen nonstop and
> any assert or crash that prints the process stack will also print the
> last 100 trace entries. Sample output:
>
> Call Trace:
> [<c0160401>] sys_munmap+0x61/0x80
> [<c010520d>] sysenter_past_esp+0x52/0x71
> Last 100 trace entries:
> 00000001: zap_pmd_range+0xe/0x90 <= (unmap_page_range+0x55/0x80)
> 00000001: stop_trace+0x8/0x20 <= (bust_spinlocks+0x20/0x60)
> 00000001: bust_spinlocks+0xe/0x60 <= (die+0xbc/0x2a0)
> [... 97 more trace entries ...]
>
> Please capture and mail the first 'extended' oops that triggers
> (secondary followup traces are probably just side-effects of the crash),
> it could give us clues about where the bug is.
>
> Ingo
>

Building now.

kr

2004-09-03 20:45:22

by Lee Revell

[permalink] [raw]
Subject: Re: [patch] voluntary-preempt-2.6.9-rc1-bk4-R0

On Fri, 2004-09-03 at 13:02, K.R. Foley wrote:
> OK. I previously reported about the system hanging trying to boot this
> patch. It may have been a fluke. In any event, it doesn't hang
> consistently. I have rebooted several times now without problems. I do
> however still get some of these:
>
> (ksoftirqd/0/2): new 131 us maximum-latency critical section.
> => started at: <netif_receive_skb+0x82/0x280>
> => ended at: <netif_receive_skb+0x1d7/0x280>

This is actually a good sign if this is the most common one you see.
The netif_receive softirq is basically the longest code path that can't
easily be made preemptible; on my hardware about 150 usecs seems to be
the best we can do. This is really good, I can't imagine an application
that you would use PC hardware for where 150 usecs is not enough, though
I am sure the hard realtime crowd has a few.

Even fixing this one would not help much because it seems there are many
other non-preemtible code paths that are only a little shorter than this
one.

Lee

2004-09-04 00:04:18

by Lee Revell

[permalink] [raw]
Subject: Re: [patch] voluntary-preempt-2.6.9-rc1-bk4-Q9

On Fri, 2004-09-03 at 11:33, [email protected] wrote:
> Well - THAT was special. Another crash but I may have a clue on both that
> and the "general slow down" in sched.c.
>
> The crash is likely due to a problem with X. I caused it this time when
> I was trying to open / hide Mozilla windows (again preparing to send email
> with a web based client). The last operation that worked was the window
> hide. The one that locked everything up was a click to restore a Mozilla
> window onto the screen. I don't know if this is relevant, but the last
> trace in /var/log/messages was a latency trace caused by X.
>
> Sep 3 09:57:11 dws77 kernel: (X/2382): new 329 us maximum-latency critical
> section.
> Sep 3 09:57:11 dws77 kernel: => started at: <kmap_atomic+0x23/0xe0>
> Sep 3 09:57:11 dws77 kernel: => ended at: <kunmap_atomic+0x7b/0xa0>
>
> I am not sure this is relevant since all the data for it was written
> to disk (my script picked up the latency trace as well). Let me know
> if you want the trace data.
>
> The slow down in sched.c may be due to disk DMA activities.

[...]

> It may be a combination of effects. A question for others doing
> testing (like Lee) - have you been doing any other activity in
> the background when doing your tests? For example, I have found
> that something as simple as
> head -c $1 /dev/zero >tmpfile [$1 is a > physical memory size]
> or
> cat tmpfile > /dev/null
> can cause significantly increased latencies in the 2.6 kernels.
>

This is looking more and more like a video driver problem:

"Misbehaving video card drivers are another source of significant delays
in scheduling user code. A number of video cards manufacturers recently
began employing a hack to save a PCI bus transaction for each display
operation in order to gain a few percentage points on their WinBench
[Ziff-Davis 98] Graphics WinMark performance.

The video cards have a command FIFO that is written to via the PCI bus.
They also have a status register, read via the PCI bus, which says
whether the command FIFO is full or not. The hack is to not check
whether the command FIFO is full before attempting to write to it, thus
saving a PCI bus read.

The problem with this is that the result of attempting to write to the
FIFO when it is full is to stall the CPU waiting on the PCI bus write
until a command has been completed and space becomes available to accept
the new command. In fact, this not only causes the CPU to stall waiting
on the PCI bus, but since the PCI controller chip also controls the ISA
bus and mediates interrupts, ISA traffic and interrupt requests are
stalled as well. Even the clock interrupts stop.

These video cards will stall the machine, for instance, when the user
drags a window. For windows occupying most of a 1024x768 screen on a
333MHz Pentium II with an AccelStar II AGP video board (which is based
on the 3D Labs Permedia 2 chip set) this will stall the machine for
25-30ms at a time!"

(from http://research.microsoft.com/~mbj/papers/tr-98-29.html)

Ingo, would the above situation indeed produce these symptoms?

I had this exact problem with my Via Unichrome chipset (an open source
driver!), reported it to the maintainers, and it turned out this was
exactly what the driver was doing (it's fixed now). The above text is
from 1998, the problem in the via driver was discovered last week. So I
would actually expect this behavior to be the norm, unless someone has
fixed that driver.

The easiest way to eliminate this possibility is to disable DRI and set
'Option "NoAccel"' in your X config. Do you get the same mysterious
latencies with this setting?

What video hardware are you using?

Lee






2004-09-04 03:39:38

by K.R. Foley

[permalink] [raw]
Subject: Re: [patch] voluntary-preempt-2.6.9-rc1-bk4-R3

Ingo Molnar wrote:
> * Ingo Molnar <[email protected]> wrote:
>
>
>>i'll add a new feature to debug this: when crashing on an assert and
>>tracing is enabled the trace leading up to the crash will be printed
>>to the console. [...]
>
>
> the -R3 patch has this feature:
>

After hammering the system for a little more than an hour it gave up. I
don't have the serial logging setup yet because I haven't had time this
evening. I will be glad to do whatever I can to try to help debug this,
but it will have to wait until tomorrow. The log is here:

http://www.cybsft.com/testresults/crashes/2.6.9-rc1-vo-R3.txt

kr

2004-09-04 03:44:18

by K.R. Foley

[permalink] [raw]
Subject: Re: [patch] voluntary-preempt-2.6.9-rc1-bk4-R3

Ingo Molnar wrote:
> * Ingo Molnar <[email protected]> wrote:
>
>
>>i'll add a new feature to debug this: when crashing on an assert and
>>tracing is enabled the trace leading up to the crash will be printed
>>to the console. [...]
>
>
> the -R3 patch has this feature:
>

After hammering the system for a little more than an hour it gave up. I
don't have the serial logging setup yet because I haven't had time this
evening. I will be glad to do whatever I can to try to help debug this,
but it will have to wait until tomorrow. The log is here:

http://www.cybsft.com/testresults/crashes/2.6.9-rc1-vo-R3.txt

kr

Sorry I forgot to mention that this was triggered running the
stress-kernel package, minus the NFS-Compile, but it does include the
CRASHME test. In addition, amlat was running as well. The system was
pretty much 100% loaded.

2004-09-04 06:49:33

by Ingo Molnar

[permalink] [raw]
Subject: Re: [patch] voluntary-preempt-2.6.9-rc1-bk4-R3


* K.R. Foley <[email protected]> wrote:

> http://www.cybsft.com/testresults/crashes/2.6.9-rc1-vo-R3.txt

the first line seems partial - isnt the full oops in the log?

> Sorry I forgot to mention that this was triggered running the
> stress-kernel package, minus the NFS-Compile, but it does include the
> CRASHME test. In addition, amlat was running as well. The system was
> pretty much 100% loaded.

Have you run crashme as root? That would be unsafe.

Ingo

2004-09-04 08:55:49

by Ingo Molnar

[permalink] [raw]
Subject: Re: [patch] voluntary-preempt-2.6.9-rc1-bk4-R3


* K.R. Foley <[email protected]> wrote:

> After hammering the system for a little more than an hour it gave up.
> I don't have the serial logging setup yet because I haven't had time
> this evening. I will be glad to do whatever I can to try to help debug
> this, but it will have to wait until tomorrow. The log is here:
>
> http://www.cybsft.com/testresults/crashes/2.6.9-rc1-vo-R3.txt

fyi, i have now triggered a similar crash on a testbox too. It takes
quite some time to trigger but it does.

since it happens with VP=0,KP=0,SP=0,HP=0 as well it should be one of
the cond_resched_lock() (or cond_resched()) additions.

Ingo

2004-09-04 10:16:39

by Lee Revell

[permalink] [raw]
Subject: Re: [patch] voluntary-preempt-2.6.9-rc1-bk4-R3

On Sat, 2004-09-04 at 04:57, Ingo Molnar wrote:
> * K.R. Foley <[email protected]> wrote:
>
> > After hammering the system for a little more than an hour it gave up.
> > I don't have the serial logging setup yet because I haven't had time
> > this evening. I will be glad to do whatever I can to try to help debug
> > this, but it will have to wait until tomorrow. The log is here:
> >
> > http://www.cybsft.com/testresults/crashes/2.6.9-rc1-vo-R3.txt
>
> fyi, i have now triggered a similar crash on a testbox too. It takes
> quite some time to trigger but it does.
>
> since it happens with VP=0,KP=0,SP=0,HP=0 as well it should be one of
> the cond_resched_lock() (or cond_resched()) additions.
>

Here are some results for R0 on another machine, a 1.2Ghz Athlon XP:

http://krustophenia.net/testresults.php?dataset=2.6.9-rc1-R0

I have also changed the test again, to be more accurate. Now the jackd
alsa driver separately keeps track of the time it spends in poll(), and
the time it takes running the jackd process cycle. The length of one
period less the sum of the time we spent in poll() and the time it took
to run the process cycle equals the amount of time we spent ready to
run, and waiting to be scheduled AKA latency.

The results are pretty amazing - out of a period time of 666 usecs most
of the time we spend between 0 and 1 usec in this state, The worst is
27 usecs or so.

These results are of course not directly comparable with previous tests,
but I believe this is the most accurate way to measure latency in jackd.

Lee

2004-09-04 12:28:19

by K.R. Foley

[permalink] [raw]
Subject: Re: [patch] voluntary-preempt-2.6.9-rc1-bk4-R3

Ingo Molnar wrote:
> * K.R. Foley <[email protected]> wrote:
>
>
>>http://www.cybsft.com/testresults/crashes/2.6.9-rc1-vo-R3.txt
>
>
> the first line seems partial - isnt the full oops in the log?

Agreed. Unfortunately that is all there is.

>
>
>>Sorry I forgot to mention that this was triggered running the
>>stress-kernel package, minus the NFS-Compile, but it does include the
>>CRASHME test. In addition, amlat was running as well. The system was
>>pretty much 100% loaded.
>
>
> Have you run crashme as root? That would be unsafe.

Actually what happens is that it creates a "crashme" user (and group)
for running the test and then deletes the user after the test. In fact
the user is still in passwd, because of the crash it didn't get cleaned
up. So I don't think this SHOULD be a problem.


2004-09-04 14:35:59

by K.R. Foley

[permalink] [raw]
Subject: Re: [patch] voluntary-preempt-2.6.9-rc1-bk4-R3

Ingo Molnar wrote:
> * K.R. Foley <[email protected]> wrote:
>
>
>>After hammering the system for a little more than an hour it gave up.
>>I don't have the serial logging setup yet because I haven't had time
>>this evening. I will be glad to do whatever I can to try to help debug
>>this, but it will have to wait until tomorrow. The log is here:
>>
>>http://www.cybsft.com/testresults/crashes/2.6.9-rc1-vo-R3.txt
>
>
> fyi, i have now triggered a similar crash on a testbox too. It takes
> quite some time to trigger but it does.
>
> since it happens with VP=0,KP=0,SP=0,HP=0 as well it should be one of
> the cond_resched_lock() (or cond_resched()) additions.
>
> Ingo
>

I am glad that it's reproducible for you as well. How did you trigger
it? Because it seems to only crash under heavy load for me. The system
has been up since I rebooted last night after the crash and I haven't
seen any problems. Same thing goes for up until last night when I booted
the new patch. Even building the new patch didn't seem to be enough to
trigger it.

kr

2004-09-04 17:56:42

by Alan

[permalink] [raw]
Subject: Re: [patch] voluntary-preempt-2.6.9-rc1-bk4-Q9

On Sad, 2004-09-04 at 01:04, Lee Revell wrote:
> This is looking more and more like a video driver problem:

Not really. The delay is too small and X is smarter than this. (except a
VIA case that only recently got squished).

> The video cards have a command FIFO that is written to via the PCI bus.
> They also have a status register, read via the PCI bus, which says
> whether the command FIFO is full or not. The hack is to not check
> whether the command FIFO is full before attempting to write to it, thus
> saving a PCI bus read.

On problem cards X defaults to polling the status FIFO. You can tell it
to be rude but you have to actively do so. Newer PCI 2.x specs also have
a thing or two to say on the subject

2004-09-04 18:07:53

by Lee Revell

[permalink] [raw]
Subject: Re: [patch] voluntary-preempt-2.6.9-rc1-bk4-Q9

On Sat, 2004-09-04 at 12:52, Alan Cox wrote:
> On Sad, 2004-09-04 at 01:04, Lee Revell wrote:
> > This is looking more and more like a video driver problem:
>
> Not really. The delay is too small and X is smarter than this. (except a
> VIA case that only recently got squished).
>

True, the VIA problem did not cause long latency traces to appear on my
machine. This is a weird one.

Lee




2004-09-04 20:04:10

by Ingo Molnar

[permalink] [raw]
Subject: Re: [patch] voluntary-preempt-2.6.9-rc1-bk4-R3


* K.R. Foley <[email protected]> wrote:

> I am glad that it's reproducible for you as well. How did you trigger
> it? Because it seems to only crash under heavy load for me. The system
> has been up since I rebooted last night after the crash and I haven't
> seen any problems. Same thing goes for up until last night when I
> booted the new patch. Even building the new patch didn't seem to be
> enough to trigger it.

i triggered it via two IO-intense scripts running on a 256 MB RAM
testbox:

cd /tmp; while true; do dd if=/dev/zero of=bigfile bs=1000000
count=500 >/dev/null 2>/dev/null; sync; date; sleep 60; done &

while true; do du /usr >/dev/null 2>/dev/null; date; done &

usually it triggers within half an hour or so of runtime. With -R4 it
didnt crash yet after ~1 hour of running so i'm quite optimistic that
this particular bug has been fixed.

i think the key to reproduce was to use KP=0,SP=0,HP=0 - this
concentrates all preemption activity to the places the VP patch adds -
amongst them the one in the do_exit() path that is necessary to trigger
the race.

Ingo

2004-09-07 17:01:37

by Mark_H_Johnson

[permalink] [raw]
Subject: Re: [patch] voluntary-preempt-2.6.9-rc1-bk4-R1

Results from this morning's test with -R1 and some follow up on
related messages.

Two runs with -R1 plus the following changes
- a change to the ensonic driver to reduce latency
- added mcount() calls in sched.c and timer_tsc.c
- disabled inline functions in sched.c
The last two so we can see the steps a little more clearly.

The differences between the runs was:
- first run had DRI and hardware acceleration enabled
- second run disabled DRI / hardware acceleration
as suggested by Lee Revell.

>From Lee Revell <[email protected]>
>This is looking more and more like a video driver problem:
>...
>The easiest way to eliminate this possibility is to disable DRI and set
>'Option "NoAccel"' in your X config. Do you get the same mysterious
>latencies with this setting?

I am not sure I see any significant differences in latencies. Most of
the large delays occur during disk reads and the number of traces >
500 usec was over 100 in both tests (about 30 minutes each). There were
also more latency traces per minute during disk reads when DRI was
disabled. There were however none of the "find_next_bit" traces in the
NoAccel run with significant time delays so that may indicate some
display problems but not sure. See below for details.

Side Comment
============

If you look at the date / time of the traces, you will notice that
most occur in the latter part of the test. This is during the
"disk copy" and "disk read" parts of the testing. To illustrate, the
first 10 traces take about 20 minutes to collect; the last 90 are
collected in 15 and 9 minutes (first and second test).

It is also encouraging that the longest trace is < 800 usec in these
two runs. This is far better than what I saw a couple weeks ago.

Shortest trace
==============

preemption latency trace v1.0.5 on 2.6.9-rc1-VP-R1
--------------------------------------------------
latency: 550 us, entries: 6 (6)
-----------------
| task: cat/6771, uid:0 nice:0 policy:0 rt_prio:0
-----------------
=> started at: kmap_atomic+0x23/0xe0
=> ended at: kunmap_atomic+0x7b/0xa0
=======>
00000001 0.000ms (+0.000ms): kmap_atomic (file_read_actor)
00000001 0.000ms (+0.000ms): page_address (file_read_actor)
00000001 0.000ms (+0.549ms): __copy_to_user_ll (file_read_actor)
00000001 0.550ms (+0.000ms): kunmap_atomic (file_read_actor)
00000001 0.550ms (+0.000ms): sub_preempt_count (kunmap_atomic)
00000001 0.550ms (+0.000ms): update_max_trace (check_preempt_timing)

The 1/2 msec latency in __copy_to_user_ll occurred a few more times
in both tests.

Find Next Bit
=============

It appears that -R1 has significantly fewer cases where the code in
find_next_bit is delayed. To summarize:

# grep find_next_bit lt040903/lt*/lt.* | grep '(+0.1' | wc -l
63
# grep find_next_bit lt040907/lt*/lt.* | grep '(+0.1' | wc -l
0

[counting number of trace entries w/ 100-199 usec latencies]

The runs on September 3 were with -Q9, today's (September 7) were
with -R1. Not sure why this changed, but it is encouraging. The
maximum with DRI / display acceleration was +0.069ms, the maximum
without DRI was +0.001ms.

Mark Offset TSC
===============

In a similar way, long latencies in mark_offset_tsc were significantly
reduced in -R1.
# grep _tsc lt040903/lt*/lt.* | grep '(+0.1' | wc -l
24
# grep _tsc lt040907/lt*/lt.* | grep '(+0.1' | wc -l
3

Two of the long delays were in the same trace in the following
sequence:

00010001 0.140ms (+0.000ms): timer_interrupt (generic_handle_IRQ_event)
00010001 0.140ms (+0.000ms): spin_lock (timer_interrupt)
00010002 0.140ms (+0.000ms): spin_lock (<00000000>)
00010002 0.141ms (+0.000ms): mark_offset_tsc (timer_interrupt) [0]
00010002 0.141ms (+0.000ms): mark_offset_tsc (timer_interrupt) [1]
00010002 0.141ms (+0.000ms): spin_lock (mark_offset_tsc)
00010003 0.141ms (+0.000ms): spin_lock (<00000000>)
00010003 0.141ms (+0.131ms): mark_offset_tsc (timer_interrupt) [2]
00010003 0.273ms (+0.000ms): mark_offset_tsc (timer_interrupt) [3]
00010003 0.273ms (+0.000ms): spin_lock (mark_offset_tsc)
00010004 0.273ms (+0.000ms): spin_lock (<00000000>)
00010004 0.273ms (+0.145ms): mark_offset_tsc (timer_interrupt) [4]
00010004 0.419ms (+0.004ms): mark_offset_tsc (timer_interrupt) [5]
00010004 0.423ms (+0.002ms): mark_offset_tsc (timer_interrupt) [6]
00010004 0.425ms (+0.000ms): mark_offset_tsc (timer_interrupt)
00010004 0.425ms (+0.000ms): mark_offset_tsc (timer_interrupt)
00010003 0.426ms (+0.000ms): mark_offset_tsc (timer_interrupt)
00010003 0.426ms (+0.046ms): mark_offset_tsc (timer_interrupt)
00010003 0.472ms (+0.000ms): mark_offset_tsc (timer_interrupt)
00010002 0.472ms (+0.000ms): timer_interrupt (generic_handle_IRQ_event)

For reference, the steps in the code read (w/o comments):

mcount(); [1]
write_seqlock(&monotonic_lock);
mcount(); [2]
last_offset = ((unsigned long long)last_tsc_high<<32)|last_tsc_low;
rdtsc(last_tsc_low, last_tsc_high);

mcount(); [3]
spin_lock(&i8253_lock);
mcount(); [4]
outb_p(0x00, PIT_MODE); /* latch the count ASAP */
mcount(); [5]

count = inb_p(PIT_CH0); /* read the latched count */
mcount(); [6]
count |= inb(PIT_CH0) << 8;
mcount(); [7]
I numbered the traces / corresponding mcount() calls. If I labeled this
right, it appears the delays were in the rdtsc call and outb_p to
latch the count. Perhaps a hardware level delay taking place.

Clear Page Tables
=================

This is the longest single latency with the following traces:

# grep '(+0.6' lt040907/lt*/lt.*
lt040907/lt001.v3k1/lt.28:00000001 0.001ms (+0.635ms): clear_page_tables
(exit_mmap)
lt040907/lt002.v3k1/lt.75:00000001 0.001ms (+0.628ms): clear_page_tables
(exit_mmap)

__modify_IO_APIC_irq
====================

Not quite sure what is happening here, but this trace

preemption latency trace v1.0.5 on 2.6.9-rc1-VP-R1
--------------------------------------------------
latency: 612 us, entries: 111 (111)
-----------------
| task: kblockd/0/10, uid:0 nice:-10 policy:0 rt_prio:0
-----------------
=> started at: __spin_lock_irqsave+0x4b/0xa0
=> ended at: as_work_handler+0x5c/0xa0
=======>
00000001 0.000ms (+0.000ms): __spin_lock_irqsave (spin_lock_irq)
00000001 0.000ms (+0.000ms): generic_enable_irq (ide_do_request)
00000001 0.000ms (+0.000ms): __spin_lock_irqsave (generic_enable_irq)
00000001 0.000ms (+0.000ms): __spin_lock_irqsave (<00000000>)
00000002 0.001ms (+0.000ms): unmask_IO_APIC_irq (generic_enable_irq)
00000002 0.001ms (+0.000ms): __spin_lock_irqsave (unmask_IO_APIC_irq)
00000002 0.001ms (+0.000ms): __spin_lock_irqsave (<00000000>)
00000003 0.001ms (+0.000ms): __unmask_IO_APIC_irq (unmask_IO_APIC_irq)
00000003 0.002ms (+0.567ms): __modify_IO_APIC_irq (__unmask_IO_APIC_irq)
00010001 0.569ms (+0.000ms): do_nmi (smp_apic_timer_interrupt)
00010001 0.569ms (+0.002ms): do_nmi (sched_clock)
00010001 0.572ms (+0.000ms): profile_tick (nmi_watchdog_tick)
00010001 0.572ms (+0.000ms): profile_hook (profile_tick)
00010001 0.573ms (+0.000ms): read_lock (profile_hook)
00010002 0.573ms (+0.000ms): read_lock (<00000000>)
00010002 0.573ms (+0.000ms): notifier_call_chain (profile_hook)
00010001 0.574ms (+0.000ms): profile_hit (nmi_watchdog_tick)
00000001 0.575ms (+0.000ms): smp_apic_timer_interrupt (as_work_handler)

is an example where __modify_IO_APIC_irq shows up with a long latency.
There were about a dozen long latencies (> 100 usec) in the two tests
run this morning, slightly fewer (6 vs 8) in the first test with DRI
and display accelaration turned on. This appears to be an increase
with -R1, tests over three days (-Q5 through -Q9) only had 4 long
latencies at this spot. [Hmm - that might be due to the change in
the way we compute / display the latencies]

Spin Lock
=========

We seem to have gotten stuck here in a spin lock...

=> started at: spin_lock+0x34/0x90
=> ended at: journal_commit_transaction+0xa25/0x1c30
=======>
00000001 0.000ms (+0.559ms): spin_lock (journal_commit_transaction)

and here...

=> started at: __read_lock_irqsave+0x93/0xa0
=> ended at: snd_pcm_lib_write1+0x3ba/0x600
=======>
00000001 0.000ms (+0.000ms): __read_lock_irqsave (read_lock_irq)
00000001 0.000ms (+0.000ms): spin_lock (snd_pcm_lib_write1)
00000002 0.000ms (+0.000ms): spin_lock (<00000000>)
00000002 0.000ms (+0.000ms): snd_pcm_update_hw_ptr (snd_pcm_lib_write1)
...
00000002 0.008ms (+0.000ms): snd_ensoniq_trigger (snd_pcm_do_stop)
00000002 0.009ms (+0.000ms): spin_lock (snd_ensoniq_trigger)
00000003 0.009ms (+0.549ms): spin_lock (<00000000>)

Overall there traces referring to spin_lock were...

# grep 'spin_lock (' lt040907/lt*/* | grep -v '+0.0' | wc -l
35
# grep 'spin_lock (' lt040903/lt*/* | grep -v '+0.0' | wc -l
11

More of these traces in -R1 than -Q9, but I am not sure if this is
significant or not.

Context Switch
==============

With the inlined functions disabled, and some extra debug code,
I saw context_switch appear > 800 times in the long latency traces.
Usually very fast (> 10 usec) but sometimes listed with some long
latencies (> 100 usec, 45 trace lines).
For example:

=> started at: schedule+0x5b/0x610
=> ended at: schedule+0x385/0x610
...
0000002 0.004ms (+0.000ms): schedule (do_irqd)
00000002 0.005ms (+0.000ms): context_switch (schedule)
00000002 0.005ms (+0.000ms): dummy_cs_entry (context_switch)
00000002 0.005ms (+0.000ms): context_switch (schedule)
00000002 0.005ms (+0.000ms): dummy_cs_switch_mm (context_switch)
00000002 0.005ms (+0.111ms): context_switch (schedule)
00000002 0.116ms (+0.000ms): dummy_cs_unlikely_if (context_switch)
00000002 0.117ms (+0.000ms): context_switch (schedule)
00000002 0.117ms (+0.000ms): dummy_cs_switch_to (context_switch)
00000002 0.117ms (+0.029ms): context_switch (schedule)
00000002 0.147ms (+0.128ms): __switch_to (context_switch)
00000002 0.275ms (+0.001ms): dummy_cs_exit (context_switch)
00000002 0.277ms (+0.155ms): context_switch (schedule)
00000002 0.432ms (+0.000ms): finish_task_switch (schedule)
00000002 0.433ms (+0.000ms): smp_apic_timer_interrupt (finish_task_switch)
00010002 0.433ms (+0.137ms): profile_tick (smp_apic_timer_interrupt)
00010002 0.571ms (+0.000ms): profile_hook (profile_tick)
00010002 0.571ms (+0.000ms): read_lock (profile_hook)
00010003 0.571ms (+0.000ms): read_lock (<00000000>)
00010003 0.571ms (+0.000ms): notifier_call_chain (profile_hook)
00010002 0.572ms (+0.000ms): profile_hit (smp_apic_timer_interrupt)

The dummy_cs lines refer to calls I added to context_switch at each
step to read...

dummy_cs_entry();
if (unlikely(!mm)) {
next->active_mm = oldmm;
atomic_inc(&oldmm->mm_count);
dummy_cs_lazy_tlb();
enter_lazy_tlb(oldmm, next);
} else {
dummy_cs_switch_mm();
switch_mm(oldmm, mm, next);
}

dummy_cs_unlikely_if();
if (unlikely(!prev->mm)) {
prev->active_mm = NULL;
WARN_ON(rq->prev_mm);
rq->prev_mm = oldmm;
}

/* Here we just switch the register state and the stack. */
dummy_cs_switch_to();
switch_to(prev, next, prev);

dummy_cs_exit();
return prev;

The switch_mm path of the if appears to be more costly time wise
than the lazy_tlb path. To see this from the detailed traces, try
something like...

grep -C2 'dummy_cs_switch_mm' lt040907/lt*/* | less -im
or
grep -C2 'dummy_cs_lazy_tlb' lt040907/lt*/* | less -im

Closing
=======

I will send copies of the traces (not to the mailing lists) in a few
minutes for review / analysis.

I also downloaded the -R8 patch and other associated patches and will
be building that soon.
--Mark

2004-09-08 18:42:38

by Ingo Molnar

[permalink] [raw]
Subject: Re: [patch] voluntary-preempt-2.6.9-rc1-bk4-R1


* [email protected] <[email protected]> wrote:

> If you look at the date / time of the traces, you will notice that
> most occur in the latter part of the test. This is during the "disk
> copy" and "disk read" parts of the testing. [...]

would it be possible to test with DMA disabled? (hdparm -d0 /dev/hda) It
might take some extra work to shun the extra latency reports from the
PIO IDE path (which is quite slow) but once that is done you should be
able to see whether these long 0.5 msec delays remain even if all (most)
DMA activity has been eliminated.

> preemption latency trace v1.0.5 on 2.6.9-rc1-VP-R1
> --------------------------------------------------
> latency: 550 us, entries: 6 (6)
> -----------------
> | task: cat/6771, uid:0 nice:0 policy:0 rt_prio:0
> -----------------
> => started at: kmap_atomic+0x23/0xe0
> => ended at: kunmap_atomic+0x7b/0xa0
> =======>
> 00000001 0.000ms (+0.000ms): kmap_atomic (file_read_actor)
> 00000001 0.000ms (+0.000ms): page_address (file_read_actor)
> 00000001 0.000ms (+0.549ms): __copy_to_user_ll (file_read_actor)
> 00000001 0.550ms (+0.000ms): kunmap_atomic (file_read_actor)
> 00000001 0.550ms (+0.000ms): sub_preempt_count (kunmap_atomic)
> 00000001 0.550ms (+0.000ms): update_max_trace (check_preempt_timing)

this is a full page copy, from userspace into a kernelspace pagecache
page. This shouldnt take 500 usecs on any hardware. Since this is a
single instruction (memcpy's rep; movsl instruction) there's nothing
that Linux can do to avoid (or even to cause) such a situation.

> 00010002 0.141ms (+0.000ms): mark_offset_tsc (timer_interrupt) [0]
> 00010002 0.141ms (+0.000ms): mark_offset_tsc (timer_interrupt) [1]
> 00010002 0.141ms (+0.000ms): spin_lock (mark_offset_tsc)
> 00010003 0.141ms (+0.000ms): spin_lock (<00000000>)
> 00010003 0.141ms (+0.131ms): mark_offset_tsc (timer_interrupt) [2]
> 00010003 0.273ms (+0.000ms): mark_offset_tsc (timer_interrupt) [3]

note that there's no spinning on the spinlock, the (<00000000>) shows
that there was no contention at all.

> For reference, the steps in the code read (w/o comments):
>
> mcount(); [1]
> write_seqlock(&monotonic_lock);
> mcount(); [2]
> last_offset = ((unsigned long long)last_tsc_high<<32)|last_tsc_low;
> rdtsc(last_tsc_low, last_tsc_high);
> mcount(); [3]

the 131 usec delay occured between [2] and [3] - which, if you check the
assembly, there are only 14 instructions between those two mcount()
calls:

1a0: 31 db xor %ebx,%ebx
1a2: 8b 0d 10 00 00 00 mov 0x10,%ecx
1a4: R_386_32 .bss
1a8: a1 14 00 00 00 mov 0x14,%eax
1a9: R_386_32 .bss
1ad: 89 c2 mov %eax,%edx
1af: 31 c0 xor %eax,%eax
1b1: 89 c7 mov %eax,%edi
1b3: 09 cf or %ecx,%edi
1b5: 89 7d e0 mov %edi,0xffffffe0(%ebp)
1b8: 89 d7 mov %edx,%edi
1ba: 09 df or %ebx,%edi
1bc: 89 7d e4 mov %edi,0xffffffe4(%ebp)
1bf: 0f 31 rdtsc
1c1: 89 15 14 00 00 00 mov %edx,0x14
1c3: R_386_32 .bss
1c7: a3 10 00 00 00 mov %eax,0x10
1c8: R_386_32 .bss

no loop, no nothing. No way can this take 131 usecs without hardware
effects.

> Clear Page Tables
> =================
>
> This is the longest single latency with the following traces:
>
> # grep '(+0.6' lt040907/lt*/lt.*
> lt040907/lt001.v3k1/lt.28:00000001 0.001ms (+0.635ms): clear_page_tables
> (exit_mmap)
> lt040907/lt002.v3k1/lt.75:00000001 0.001ms (+0.628ms): clear_page_tables
> (exit_mmap)

this one might be a real latency - but it's hard to tell if there are
random 500 usec latencies all around the place.

> __modify_IO_APIC_irq
> ====================

> 00000003 0.001ms (+0.000ms): __unmask_IO_APIC_irq (unmask_IO_APIC_irq)
> 00000003 0.002ms (+0.567ms): __modify_IO_APIC_irq (__unmask_IO_APIC_irq)
> 00010001 0.569ms (+0.000ms): do_nmi (smp_apic_timer_interrupt)

this too seems to be one of these random 500 usec latencies that have no
connection whatsoever to what is being done. It's just some unfortunate
piece of code that is more likely to access the memory bus or happens to
be on a page boundary or something like that.

> Spin Lock
> =========
>
> We seem to have gotten stuck here in a spin lock...

none of the spinlocks had a counter different from zero so there was no
contention. The extra trace entry after a spinlock:

> 00000002 0.000ms (+0.000ms): spin_lock (<00000000>)

shows the number of times the spinlock had to spin internally before it
got the lock. For real contention this should be some large nonzero
number.

> 00000002 0.008ms (+0.000ms): snd_ensoniq_trigger (snd_pcm_do_stop)
> 00000002 0.009ms (+0.000ms): spin_lock (snd_ensoniq_trigger)
> 00000003 0.009ms (+0.549ms): spin_lock (<00000000>)

this too seems to be caused by that 'magic' latency - a noncontended
spinlock cannot take 500 usecs to execute ...

> Context Switch
> ==============

same for the context-switch codepath. I'm very convinced that the 'magic
latencies' are distributed more or less randomly across kernel code.
Code that accesses the main memory bus more likely will be affected more
by DMA starvation, that's what makes some of these functions more
prominent than others.

> 00000002 0.005ms (+0.000ms): dummy_cs_switch_mm (context_switch)
> 00000002 0.005ms (+0.111ms): context_switch (schedule)

since this includes a cr3 flush it is very likely accessing main memory.
(it's possibly re-fetching lots of TLB entries from a new pagetable
which is likely cache-cold.)

so my main candidate is still IDE DMA. Please disable IDE DMA and see
what happens (after hiding the PIO IDE codepath via
touch_preempt_timing()).

Ingo

2004-09-08 20:34:22

by Mark_H_Johnson

[permalink] [raw]
Subject: Re: [patch] voluntary-preempt-2.6.9-rc1-bk4-R1

>.... Please disable IDE DMA and see
>what happens (after hiding the PIO IDE codepath via
>touch_preempt_timing()).

Not quite sure where to add touch_preempt_timing() calls - somewhere in the
loop in ide_outsl and ide_insl? [so we keep resetting the start /end
times?]

I did collect some initial data - most are of the form:

preemption latency trace v1.0.7 on 2.6.9-rc1-bk12-VP-R8
-------------------------------------------------------
latency: 545 us, entries: 8 (8) | [VP:1 KP:1 SP:1 HP:1 #CPUS:2]
-----------------
| task: IRQ 14/140, uid:0 nice:-10 policy:0 rt_prio:0
-----------------
=> started at: kmap_atomic+0x23/0xe0
=> ended at: kunmap_atomic+0x7b/0xa0
=======>
00000001 0.000ms (+0.000ms): kmap_atomic (ide_multwrite)
00000001 0.000ms (+0.001ms): page_address (ide_multwrite)
00000001 0.001ms (+0.000ms): taskfile_output_data (ide_multwrite)
00000001 0.002ms (+0.000ms): ata_output_data (taskfile_output_data)
00000001 0.003ms (+0.542ms): ide_outsl (ata_output_data)
00000001 0.545ms (+0.000ms): kunmap_atomic (ide_multwrite)
00000001 0.545ms (+0.000ms): sub_preempt_count (kunmap_atomic)
00000001 0.546ms (+0.000ms): update_max_trace (check_preempt_timing)

Which I am sure is what you expected.

I did get a couple that looked like this:

preemption latency trace v1.0.7 on 2.6.9-rc1-bk12-VP-R8
-------------------------------------------------------
latency: 1261 us, entries: 73 (73) | [VP:1 KP:1 SP:1 HP:1 #CPUS:2]
-----------------
| task: sleep/4859, uid:0 nice:-9 policy:0 rt_prio:0
-----------------
=> started at: spin_lock+0x24/0x90
=> ended at: _spin_unlock+0x2d/0x60
=======>
00000001 0.000ms (+0.000ms): spin_lock (do_wp_page)
00000001 0.000ms (+0.000ms): spin_lock (<00000000>)
00000001 0.000ms (+0.000ms): kmap_atomic (do_wp_page)
00000002 0.000ms (+0.000ms): page_address (do_wp_page)
00000002 0.001ms (+0.000ms): page_remove_rmap (do_wp_page)
00000002 0.002ms (+0.001ms): flush_tlb_page (do_wp_page)
00000003 0.003ms (+0.000ms): flush_tlb_others (flush_tlb_page)
00000003 0.003ms (+0.000ms): spin_lock (flush_tlb_others)
00000004 0.004ms (+0.001ms): spin_lock (<00000000>)
00000004 0.005ms (+0.000ms): send_IPI_mask (flush_tlb_others)
00000004 0.005ms (+0.321ms): send_IPI_mask_bitmask (flush_tlb_others)
00010004 0.327ms (+0.000ms): do_nmi (flush_tlb_others)
00010004 0.327ms (+0.003ms): do_nmi (kallsyms_lookup)
00010004 0.331ms (+0.000ms): notifier_call_chain (default_do_nmi)
00010004 0.331ms (+0.000ms): profile_tick (nmi_watchdog_tick)
00010004 0.331ms (+0.000ms): profile_hook (profile_tick)
00010004 0.332ms (+0.000ms): read_lock (profile_hook)
00010005 0.332ms (+0.000ms): read_lock (<00000000>)
00010005 0.332ms (+0.000ms): notifier_call_chain (profile_hook)
00010005 0.332ms (+0.000ms): _read_unlock (profile_tick)
00010004 0.333ms (+0.091ms): profile_hit (nmi_watchdog_tick)
00000004 0.424ms (+0.000ms): smp_apic_timer_interrupt (flush_tlb_others)
...
00000005 0.431ms (+0.000ms): do_softirq (smp_apic_timer_interrupt)
00000005 0.431ms (+0.000ms): __do_softirq (do_softirq)
00000005 0.432ms (+0.000ms): wake_up_process (do_softirq)
00000005 0.432ms (+0.000ms): try_to_wake_up (wake_up_process)
00000005 0.433ms (+0.000ms): task_rq_lock (try_to_wake_up)
00000005 0.433ms (+0.000ms): spin_lock (task_rq_lock)
00000006 0.433ms (+0.000ms): spin_lock (<00000000>)
00000006 0.433ms (+0.000ms): wake_idle (try_to_wake_up)
00000006 0.434ms (+0.000ms): activate_task (try_to_wake_up)
00000006 0.434ms (+0.000ms): sched_clock (activate_task)
00000006 0.434ms (+0.000ms): recalc_task_prio (activate_task)
00000006 0.435ms (+0.000ms): effective_prio (recalc_task_prio)
00000006 0.435ms (+0.000ms): __activate_task (try_to_wake_up)
00000006 0.436ms (+0.000ms): enqueue_task (__activate_task)
00000006 0.436ms (+0.000ms): sched_info_queued (enqueue_task)
00000006 0.437ms (+0.000ms): resched_task (try_to_wake_up)
00000006 0.437ms (+0.000ms): task_rq_unlock (try_to_wake_up)
00000006 0.437ms (+0.000ms): _spin_unlock_irqrestore (try_to_wake_up)
00000005 0.437ms (+0.820ms): preempt_schedule (try_to_wake_up)
00000004 1.258ms (+0.000ms): _spin_unlock (flush_tlb_others)
00000003 1.258ms (+0.000ms): preempt_schedule (flush_tlb_others)
00000002 1.258ms (+0.000ms): preempt_schedule (flush_tlb_page)
...

or this

preemption latency trace v1.0.7 on 2.6.9-rc1-bk12-VP-R8
-------------------------------------------------------
latency: 1210 us, entries: 113 (113) | [VP:1 KP:1 SP:1 HP:1 #CPUS:2]
-----------------
| task: get_ltrace.sh/3071, uid:0 nice:-9 policy:0 rt_prio:0
-----------------
=> started at: flush_tlb_mm+0x1d/0xd0
=> ended at: flush_tlb_mm+0x52/0xd0
=======>
00000001 0.000ms (+0.001ms): flush_tlb_mm (copy_mm)
00000001 0.001ms (+0.000ms): flush_tlb_others (flush_tlb_mm)
00000001 0.001ms (+0.000ms): spin_lock (flush_tlb_others)
00000002 0.002ms (+0.000ms): spin_lock (<00000000>)
00000002 0.003ms (+0.000ms): send_IPI_mask (flush_tlb_others)
00000002 0.003ms (+0.378ms): send_IPI_mask_bitmask (flush_tlb_others)
00000002 0.381ms (+0.000ms): smp_apic_timer_interrupt (flush_tlb_others)
...
00010003 0.459ms (+0.000ms): preempt_schedule (timer_interrupt)
00010003 0.460ms (+0.000ms): do_timer (timer_interrupt)
00010003 0.460ms (+0.000ms): update_wall_time (do_timer)
00010003 0.460ms (+0.000ms): update_wall_time_one_tick (update_wall_time)
00010003 0.460ms (+0.000ms): _spin_unlock (timer_interrupt)
00010002 0.460ms (+0.000ms): preempt_schedule (timer_interrupt)
00010002 0.461ms (+0.000ms): spin_lock (do_IRQ)
00010003 0.461ms (+0.000ms): spin_lock (<00000000>)
00010003 0.461ms (+0.000ms): generic_note_interrupt (do_IRQ)
00010003 0.461ms (+0.000ms): end_edge_ioapic_irq (do_IRQ)
00010003 0.461ms (+0.000ms): _spin_unlock (do_IRQ)
00010002 0.462ms (+0.000ms): preempt_schedule (do_IRQ)
00000003 0.462ms (+0.000ms): do_softirq (do_IRQ)
00000003 0.462ms (+0.747ms): __do_softirq (do_softirq)
00000002 1.209ms (+0.000ms): _spin_unlock (flush_tlb_others)
00000001 1.210ms (+0.000ms): preempt_schedule (flush_tlb_others)
00000001 1.210ms (+0.000ms): sub_preempt_count (flush_tlb_mm)
00000001 1.211ms (+0.000ms): update_max_trace (check_preempt_timing)

where I had > 1 msec latencies w/o any apparent references to the IDE.

--Mark

2004-09-08 23:36:59

by Alan

[permalink] [raw]
Subject: Re: [patch] voluntary-preempt-2.6.9-rc1-bk4-R1

On Mer, 2004-09-08 at 21:33, [email protected] wrote:
> >.... Please disable IDE DMA and see
> >what happens (after hiding the PIO IDE codepath via
> >touch_preempt_timing()).
>
> Not quite sure where to add touch_preempt_timing() calls - somewhere in the
> loop in ide_outsl and ide_insl? [so we keep resetting the start /end
> times?]

If you haven't done hdparm -u1 that may be a reason you want to touch
these. To defend against some very bad old h/w where a stall in the I/O
stream to the disk causes corruption we disable IRQ's across the
transfer in PIO mode by default.

2004-09-09 06:26:39

by Ingo Molnar

[permalink] [raw]
Subject: Re: [patch] voluntary-preempt-2.6.9-rc1-bk4-R1


* [email protected] <[email protected]> wrote:

> Not quite sure where to add touch_preempt_timing() calls - somewhere in the
> loop in ide_outsl and ide_insl? [so we keep resetting the start /end
> times?]

> 00000001 0.002ms (+0.000ms): ata_output_data (taskfile_output_data)
> 00000001 0.003ms (+0.542ms): ide_outsl (ata_output_data)
> 00000001 0.545ms (+0.000ms): kunmap_atomic (ide_multwrite)

yeah, i'd add it to ide_outsl. Something like the patch below. (it
compiles but is untested otherwise.)

another possibility besides DMA starvation is starvation between CPUs.
The only way to exclude DMA as a source of CPU starvation would be to
try a maxcpus=1 test using the SMP kernel. You dont even have to run the
RT-latency tester for this - just the disk read/write workload should
trigger the latency traces!

Ingo

--- linux/drivers/ide/ide-iops.c.orig
+++ linux/drivers/ide/ide-iops.c
@@ -56,7 +56,15 @@ static u32 ide_inl (unsigned long port)

static void ide_insl (unsigned long port, void *addr, u32 count)
{
- insl(port, addr, count);
+ unsigned int chunk, offset = 0;
+
+ while (count) {
+ chunk = min(128U, count);
+ insl(port, addr + offset, chunk);
+ count -= chunk;
+ offset += chunk;
+ touch_preempt_timing();
+ }
}

static void ide_outb (u8 val, unsigned long port)
@@ -86,7 +94,15 @@ static void ide_outl (u32 val, unsigned

static void ide_outsl (unsigned long port, void *addr, u32 count)
{
- outsl(port, addr, count);
+ unsigned int chunk, offset = 0;
+
+ while (count) {
+ chunk = min(128U, count);
+ outsl(port, addr + offset, chunk);
+ count -= chunk;
+ offset += chunk;
+ touch_preempt_timing();
+ }
}

void default_hwif_iops (ide_hwif_t *hwif)

2004-09-09 15:12:12

by Ingo Molnar

[permalink] [raw]
Subject: Re: [patch] voluntary-preempt-2.6.9-rc1-bk4-R1


* [email protected] <[email protected]> wrote:

> 00000004 0.005ms (+0.321ms): send_IPI_mask_bitmask (flush_tlb_others)
> 00010004 0.327ms (+0.000ms): do_nmi (flush_tlb_others)
> 00010004 0.327ms (+0.003ms): do_nmi (kallsyms_lookup)

these traces happen if you forget to do 'dmesg -n 1' to turn off console
output. One CPU does a latency-printout with interrupts disabled, and
this CPU keeps waiting in flush_tlb_others(). The second do_nmi() entry
(which is a special one, showing the other CPU's activity) indeed shows
that the CPU is executing in kallsyms_lookup() - possibly it's in the
middle of a latency printout.

i havent seen such type of traces with 'dmesg -n 1'.

Ingo

2004-09-09 15:32:28

by Ingo Molnar

[permalink] [raw]
Subject: Re: [patch] voluntary-preempt-2.6.9-rc1-bk4-R1


> >.... Please disable IDE DMA and see
> >what happens (after hiding the PIO IDE codepath via
> >touch_preempt_timing()).

here is a list of a couple of more or less independent ideas regarding
your latency problems. (Some of these i might have suggested before.)

1)

we talked about external SMM events (handled by BIOS code and not
influenced/disabled by Linux) being a possible source of these 'mystic'
latencies. In SMP systems, there's a good likelyhood that SMM's are only
executed on CPU#0.

one good way to exclude SMM as a possibility would be to switch the CPUs
in the test. When you start an RT CPU-intense task to do the latency
test, is that task bound to any particular CPU? If yes and it's CPU#1,
then could you bind it to CPU#0? That would shift almost all kernel
activities to CPU#1. Likewise, could you set the
/proc/irq/*/smp_affinity masks to be '2' for all interrupts - to handle
all IRQs on CPU#1.

but i dont expect SMMs as the source of these latencies. BIOS writers
are supposed to fix up the TSC in SMM handlers to make them as seemless
as possible. So seeing frequent 500+ usec SMMs with no TSC fixup seems
weird.

2)

the maxcpus=1 workload-only (non-latencytest) test should also give an
important datapoint: is this phenomenon caused by SMP?

3)

it would still be nice if you could try the tests on a different type of
SMP hardware, as i cannot reproduce your results on 3 different types of
SMP hardware. E.g. one of my testsystems is a dual 466 MHz Celeron
(Mendocino) box, and the biggest latencies very closely track the UP
results reported by others - the highest latency in an IO-intense
workload is 192 usecs. The box is using IDE DMA. This system is roughly
half as fast as your 800+ MHz P3 box. I also have tried tests on a dual
P4 and a dual P3 - both have maximum latencies below 100 usecs and never
produce traces in weird places like the scheduler or copy_page_ll.

4)

your previous experiments have shown that disabling DRI in X does not
eliminate these weird latencies. Another DMA agent is IDE, which you are
testing now via the PIO-mode experiment. There's a third DMA agent:
audio cards do DMA too, have you tried to eliminate the audio component
of the test? Isnt there a latencytest mode that doesnt use an audio card
but /dev/rtc? [or just running the workloads without running latencytest
should be ok too to trigger the latency traces.]

Ingo

2004-09-09 16:01:43

by Ingo Molnar

[permalink] [raw]
Subject: Re: [patch] voluntary-preempt-2.6.9-rc1-bk4-R1


* [email protected] <[email protected]> wrote:

> Results from this morning's test with -R1 and some follow up on
> related messages.

lt002.v3k1/lt.10 is particularly interesting:

00000001 0.000ms (+0.000ms): spin_lock (rtl8139_poll)
00000001 0.000ms (+0.070ms): spin_lock (<00000000>)
00000001 0.070ms (+0.070ms): rtl8139_rx (rtl8139_poll)
00000001 0.140ms (+0.070ms): alloc_skb (rtl8139_rx)
00000001 0.210ms (+0.070ms): kmem_cache_alloc (alloc_skb)
00000001 0.280ms (+0.073ms): __kmalloc (alloc_skb)
00000001 0.354ms (+0.139ms): eth_type_trans (rtl8139_rx)
00000001 0.493ms (+0.076ms): netif_receive_skb (rtl8139_rx)
00000002 0.570ms (+0.001ms): packet_rcv_spkt (netif_receive_skb)

this too shows the CPU in 'slow motion' in a codepath that normally
executes 10 times faster than this on a 100 MHz Pentium Classic ...

another interesting thing is that the unit of delay seems to be around
70 usecs. As if under certain circumstances every main memory access
created a 70 usecs hit. (a cachemiss perhaps?) The eth_type_trans entry
perhaps generated 2 main memory accesses (2 cachemisses?) or so.

Ingo

2004-09-09 16:48:42

by Thomas Charbonnel

[permalink] [raw]
Subject: Re: [patch] voluntary-preempt-2.6.9-rc1-bk4-R1

Alan Cox wrote :
> On Mer, 2004-09-08 at 21:33, [email protected] wrote:
> > >.... Please disable IDE DMA and see
> > >what happens (after hiding the PIO IDE codepath via
> > >touch_preempt_timing()).
> >
> > Not quite sure where to add touch_preempt_timing() calls - somewhere in the
> > loop in ide_outsl and ide_insl? [so we keep resetting the start /end
> > times?]
>
> If you haven't done hdparm -u1 that may be a reason you want to touch
> these. To defend against some very bad old h/w where a stall in the I/O
> stream to the disk causes corruption we disable IRQ's across the
> transfer in PIO mode by default.
>

I had the exact same problem showing in the output of latencytest, and
enabling unmaskirq on the drive being stressed solved it, thanks !

See this for the problem :
http://www.undata.org/~thomas/unmaskirq_0/index.html
and this for the (impressive) results :
http://www.undata.org/~thomas/unmaskirq_1/index.html

Thomas


2004-09-09 18:41:26

by Mark_H_Johnson

[permalink] [raw]
Subject: Re: [patch] voluntary-preempt-2.6.9-rc1-bk4-R1

>If you haven't done hdparm -u1 that may be a reason you want to touch
>these.

Alas, but
# hdparm /dev/hda
/dev/hda:
multcount = 16 (on)
IO_support = 1 (32-bit)
unmaskirq = 1 (on)
using_dma = 0 (off)
keepsettings = 0 (off)
readonly = 0 (off)
readahead = 64 (on)
geometry = 58168/16/63, sectors = 58633344, start = 0

so I already have IRQ's unmasked. [this was during no DMA tests, I usually
run with DMA enabled]

I'll be commenting on the results of the no DMA tests shortly.

--Mark

2004-09-09 19:52:29

by Ingo Molnar

[permalink] [raw]
Subject: Re: [patch] voluntary-preempt-2.6.9-rc1-bk4-R1


* [email protected] <[email protected]> wrote:

> PIO trace
> =========

> 00000001 0.000ms (+0.370ms): touch_preempt_timing (ide_outsl)
> 00000001 0.370ms (+0.000ms): touch_preempt_timing (ide_outsl)

Please decrease the 128U chunking to 8U or so.

> I have several traces where send_IPI_mask_bitmask (flush_tlb_others)
> shows up. For example...

flush_tlb_others() is a good indicator of irqs-off sections on the other
CPU. The code does the following when it flushes TLBs: it sends an IPI
(inter-process-interrupt) to all other CPUs (one CPU in your case) and
waits for arrival of that IRQ and completion while spinning on a flag.
The IPI normally takes 10 usecs or so to process so this is not an
issue. BUT if the CPU has IRQs disabled then the IPI is delayed and the
IRQs-off latency shows up as flush_tlb_others() latency.

> 00000003 0.014ms (+0.132ms): send_IPI_mask_bitmask (flush_tlb_others)
> 00010003 0.147ms (+0.000ms): do_nmi (flush_tlb_others)
> 00010003 0.147ms (+0.001ms): do_nmi (ide_outsl)

Since the other CPU's do_nmi() implicates ide_outsl it could be that we
are doing ide_outsl with IRQs disabled? Could you add something like
this to the ide_outsl code:

if (irqs_disabled() && printk_ratelimit())
dump_stack();

(the most common irqs-off section is the latency printout itself - this
triggers if the latency message goes to the console - i.e. 'dmesg -n 1'
wasnt done.)

> Buried inside a pretty long trace in kswapd0, I saw the following...

> 00000007 0.111ms (+0.000ms): _spin_unlock_irqrestore (try_to_wake_up)
> 00000006 0.111ms (+0.298ms): preempt_schedule (try_to_wake_up)
> 00000005 0.409ms (+0.000ms): _spin_unlock (flush_tlb_others)

this too is flush_tlb_others() related.

> So the long wait on paths through sched and timer_tsc appear to be
> eliminated with PIO to the disk.

yeah, nice. I'd still like to make sure that we've not hidden latencies
by working down the ide_outsl() latency and its apparent IRQs-off
property.

> Is there some "for sure" way to limit the size and/or duration of DMA
> transfers so I get reasonable performance from the disk (and other DMA
> devices) and reasonable latency?

the 'unit' of the 'weird' delays seem to be around 70 usecs, the maximum
seems to be around 560 usecs. Note the 1:8 relationship between the two.
You have 32 KB as max_sectors, so the 70 usecs unit is for a single 4K
transfer which is a single scatter-gather entry: it all makes perfect
sense. 4K per 70 usecs means a DMA rate of ~57 MB/sec which sounds
reasonable.

so if these assumptions are true i'd suggest to decrease max_sectors
from 32K to 16K - that alone should halve these random latencies from
560 usecs to 280 usecs. Unless you see stability you might want to try
an 8K setting as well - this will likely decrease your IO rates
noticeably though. This would reduce the random delays to 140 usecs.

but the real fix would be to tweak the IDE controller to not do so
agressive DMA! Are there any BIOS settings that somehow deal with it?
Try increasing the PCI latency value? Is the disk using UDMA - if yes,
could you downgrade it to normal IDE DMA? Perhaps that tweaks the
controller to be 'nice' to the CPU. Is your IDE chipset integrated on
the motherboard? Could you send me your full bootlog (off-list)?

there are also tools that tweak chipsets directly - powertweak and the
PCI latency settings. Maybe something tweaks the IDE controller just the
right way. Also, try disabling specific controller support in the
.config (or turn it on) - by chance the generic IDE code could program
the IDE controller in a way that generates nicer DMA.

Ingo

2004-09-09 19:39:31

by Mark_H_Johnson

[permalink] [raw]
Subject: Re: [patch] voluntary-preempt-2.6.9-rc1-bk4-R1

>would it be possible to test with DMA disabled? (hdparm -d0 /dev/hda) It
>might take some extra work to shun the extra latency reports from the
>PIO IDE path (which is quite slow) but once that is done you should be
>able to see whether these long 0.5 msec delays remain even if all (most)
>DMA activity has been eliminated.

OK. With new patches in hand, I have a set of latency results w/ IDE
DMA turned off. For reference, tests were run with:

# hdparm /dev/hda

/dev/hda:
multcount = 16 (on)
IO_support = 1 (32-bit)
unmaskirq = 1 (on)
using_dma = 0 (off)
keepsettings = 0 (off)
readonly = 0 (off)
readahead = 64 (on)
geometry = 58168/16/63, sectors = 58633344, start = 0

# cat /sys/block/hda/queue/max_sectors_kb
32
# cat /sys/block/hda/queue/read_ahead_kb
32
# cat /proc/sys/net/core/netdev_max_backlog
8
# dmesg -n 1

and all tests run w/ a -R8 kernel plus patches for sched, timer_tsc,
and ide-iops (to add latency trace outputs or suppress known long paths).

No latency traces > 600 usec. These tests were run simply with
head -c 750000000 /dev/zero >tmpfile (disk writes)
cp tmpfile tmpfile2 (disk copy)
cat tmpfile tmpfile2 >/dev/null
while capturing the latency traces in another process. 72 total traces
captured in 15 minutes of tests.

For reference, the 750 meg file size is about 1.5 x physical memory.
No I/O to the audio card. X was running (to monitor the test) and a
network was connected (but generally idle).

PIO trace
=========

preemption latency trace v1.0.7 on 2.6.9-rc1-bk12-VP-R8-nd
-------------------------------------------------------
latency: 369 us, entries: 3 (3) | [VP:1 KP:1 SP:1 HP:1 #CPUS:2]
-----------------
| task: IRQ 14/140, uid:0 nice:-10 policy:0 rt_prio:0
-----------------
=> started at: ide_outsl+0x44/0x50
=> ended at: ide_outsl+0x44/0x50
=======>
00000001 0.000ms (+0.370ms): touch_preempt_timing (ide_outsl)
00000001 0.370ms (+0.000ms): touch_preempt_timing (ide_outsl)
00000001 0.370ms (+0.000ms): update_max_trace (check_preempt_timing)

which appears to be the PIO path ide_outsl. I had a few similar traces
with ide_insl during the copy / read tests as well.

send_IPI_mask_bitmask
=====================

I have several traces where send_IPI_mask_bitmask (flush_tlb_others)
shows up. For example...

00000002 0.010ms (+0.001ms): kunmap_atomic (zap_pte_range)
00000001 0.011ms (+0.000ms): flush_tlb_mm (unmap_vmas)
00000002 0.012ms (+0.000ms): flush_tlb_others (flush_tlb_mm)
00000002 0.012ms (+0.000ms): spin_lock (flush_tlb_others)
00000003 0.013ms (+0.001ms): spin_lock (<00000000>)
00000003 0.014ms (+0.000ms): send_IPI_mask (flush_tlb_others)
00000003 0.014ms (+0.132ms): send_IPI_mask_bitmask (flush_tlb_others)
00010003 0.147ms (+0.000ms): do_nmi (flush_tlb_others)
00010003 0.147ms (+0.001ms): do_nmi (ide_outsl)
00010003 0.149ms (+0.000ms): notifier_call_chain (default_do_nmi)
00010003 0.149ms (+0.000ms): profile_tick (nmi_watchdog_tick)
00010003 0.150ms (+0.000ms): profile_hook (profile_tick)
00010003 0.150ms (+0.000ms): read_lock (profile_hook)
00010004 0.150ms (+0.000ms): read_lock (<00000000>)
00010004 0.150ms (+0.000ms): notifier_call_chain (profile_hook)
00010004 0.151ms (+0.000ms): _read_unlock (profile_tick)
00010003 0.151ms (+0.250ms): profile_hit (nmi_watchdog_tick)
00000003 0.401ms (+0.000ms): _spin_unlock (flush_tlb_others)
00000001 0.402ms (+0.000ms): free_pages_and_swap_cache (unmap_vmas)
00000001 0.402ms (+0.000ms): lru_add_drain (free_pages_and_swap_cache)

or this one...

preemption latency trace v1.0.7 on 2.6.9-rc1-bk12-VP-R8-nd
-------------------------------------------------------
latency: 445 us, entries: 22 (22) | [VP:1 KP:1 SP:1 HP:1 #CPUS:2]
-----------------
| task: kswapd0/40, uid:0 nice:0 policy:0 rt_prio:0
-----------------
=> started at: spin_lock+0x24/0x90
=> ended at: _spin_unlock+0x2d/0x60
=======>
00000001 0.000ms (+0.000ms): spin_lock (page_referenced_file)
00000001 0.000ms (+0.000ms): spin_lock (<00000000>)
00000001 0.000ms (+0.000ms): vma_prio_tree_next (page_referenced_file)
00000001 0.001ms (+0.001ms): prio_tree_first (vma_prio_tree_next)
00000001 0.003ms (+0.001ms): prio_tree_left (prio_tree_first)
00000001 0.005ms (+0.000ms): page_referenced_one (page_referenced_file)
00000001 0.005ms (+0.000ms): spin_lock (page_referenced_one)
00000002 0.006ms (+0.001ms): spin_lock (<00000000>)
00000002 0.007ms (+0.000ms): kmap_atomic (page_referenced_one)
00000003 0.007ms (+0.001ms): page_address (page_referenced_one)
00000003 0.009ms (+0.000ms): flush_tlb_page (page_referenced_one)
00000004 0.010ms (+0.000ms): flush_tlb_others (flush_tlb_page)
00000004 0.010ms (+0.000ms): spin_lock (flush_tlb_others)
00000005 0.011ms (+0.001ms): spin_lock (<00000000>)
00000005 0.012ms (+0.000ms): send_IPI_mask (flush_tlb_others)
00000005 0.012ms (+0.431ms): send_IPI_mask_bitmask (flush_tlb_others)
00000005 0.444ms (+0.000ms): _spin_unlock (flush_tlb_others)
00000003 0.445ms (+0.000ms): kunmap_atomic (page_referenced_one)
00000002 0.445ms (+0.000ms): _spin_unlock (page_referenced_one)
00000001 0.445ms (+0.000ms): _spin_unlock (page_referenced_file)
00000001 0.446ms (+0.000ms): sub_preempt_count (_spin_unlock)
00000001 0.446ms (+0.000ms): update_max_trace (check_preempt_timing)

or this one...

preemption latency trace v1.0.7 on 2.6.9-rc1-bk12-VP-R8-nd
-------------------------------------------------------
latency: 428 us, entries: 49 (49) | [VP:1 KP:1 SP:1 HP:1 #CPUS:2]
-----------------
| task: get_ltrace.sh/5514, uid:0 nice:-9 policy:0 rt_prio:0
-----------------
=> started at: flush_tlb_mm+0x1d/0xd0
=> ended at: flush_tlb_mm+0x52/0xd0
=======>
00000001 0.000ms (+0.001ms): flush_tlb_mm (copy_mm)
00000001 0.001ms (+0.000ms): flush_tlb_others (flush_tlb_mm)
00000001 0.002ms (+0.000ms): spin_lock (flush_tlb_others)
00000002 0.002ms (+0.000ms): spin_lock (<00000000>)
00000002 0.003ms (+0.000ms): send_IPI_mask (flush_tlb_others)
00000002 0.003ms (+0.415ms): send_IPI_mask_bitmask (flush_tlb_others)
00000002 0.419ms (+0.000ms): smp_apic_timer_interrupt (flush_tlb_others)
00010002 0.419ms (+0.000ms): profile_tick (smp_apic_timer_interrupt)
00010002 0.419ms (+0.000ms): profile_hook (profile_tick)
00010002 0.419ms (+0.000ms): read_lock (profile_hook)
00010003 0.420ms (+0.000ms): read_lock (<00000000>)
00010003 0.420ms (+0.000ms): notifier_call_chain (profile_hook)
00010003 0.420ms (+0.000ms): _read_unlock (profile_tick)
00010002 0.420ms (+0.000ms): profile_hit (smp_apic_timer_interrupt)
00010002 0.420ms (+0.000ms): update_process_times
(smp_apic_timer_interrupt)
...

or this one...

preemption latency trace v1.0.7 on 2.6.9-rc1-bk12-VP-R8-nd
-------------------------------------------------------
latency: 330 us, entries: 9 (9) | [VP:1 KP:1 SP:1 HP:1 #CPUS:2]
-----------------
| task: get_ltrace.sh/5514, uid:0 nice:-9 policy:0 rt_prio:0
-----------------
=> started at: flush_tlb_mm+0x1d/0xd0
=> ended at: flush_tlb_mm+0x52/0xd0
=======>
00000001 0.000ms (+0.001ms): flush_tlb_mm (copy_mm)
00000001 0.001ms (+0.000ms): flush_tlb_others (flush_tlb_mm)
00000001 0.002ms (+0.000ms): spin_lock (flush_tlb_others)
00000002 0.002ms (+0.000ms): spin_lock (<00000000>)
00000002 0.003ms (+0.000ms): send_IPI_mask (flush_tlb_others)
00000002 0.004ms (+0.326ms): send_IPI_mask_bitmask (flush_tlb_others)
00000002 0.330ms (+0.000ms): _spin_unlock (flush_tlb_others)
00000001 0.331ms (+0.000ms): sub_preempt_count (flush_tlb_mm)
00000001 0.331ms (+0.000ms): update_max_trace (check_preempt_timing)

try_to_wake_up
==============

Buried inside a pretty long trace in kswapd0, I saw the following...

preemption latency trace v1.0.7 on 2.6.9-rc1-bk12-VP-R8-nd
-------------------------------------------------------
latency: 426 us, entries: 286 (286) | [VP:1 KP:1 SP:1 HP:1 #CPUS:2]
-----------------
| task: kswapd0/40, uid:0 nice:0 policy:0 rt_prio:0
-----------------
=> started at: spin_lock+0x24/0x90
=> ended at: _spin_unlock+0x2d/0x60
=======>
00000001 0.000ms (+0.000ms): spin_lock (page_referenced_file)
00000001 0.000ms (+0.000ms): spin_lock (<00000000>)
00000001 0.000ms (+0.000ms): vma_prio_tree_next (page_referenced_file)
00000001 0.000ms (+0.000ms): prio_tree_first (vma_prio_tree_next)
00000001 0.001ms (+0.000ms): prio_tree_left (prio_tree_first)
00000001 0.001ms (+0.000ms): page_referenced_one (page_referenced_file)
00000001 0.001ms (+0.000ms): spin_lock (page_referenced_one)
00000002 0.002ms (+0.000ms): spin_lock (<00000000>)
...
00000006 0.108ms (+0.000ms): task_rq_lock (try_to_wake_up)
00000006 0.108ms (+0.000ms): spin_lock (task_rq_lock)
00000007 0.108ms (+0.000ms): spin_lock (<00000000>)
00000007 0.109ms (+0.000ms): wake_idle (try_to_wake_up)
00000007 0.109ms (+0.000ms): activate_task (try_to_wake_up)
00000007 0.109ms (+0.000ms): sched_clock (activate_task)
00000007 0.109ms (+0.000ms): recalc_task_prio (activate_task)
00000007 0.109ms (+0.000ms): effective_prio (recalc_task_prio)
00000007 0.110ms (+0.000ms): __activate_task (try_to_wake_up)
00000007 0.110ms (+0.000ms): enqueue_task (__activate_task)
00000007 0.110ms (+0.000ms): sched_info_queued (enqueue_task)
00000007 0.110ms (+0.000ms): resched_task (try_to_wake_up)
00000007 0.111ms (+0.000ms): task_rq_unlock (try_to_wake_up)
00000007 0.111ms (+0.000ms): _spin_unlock_irqrestore (try_to_wake_up)
00000006 0.111ms (+0.298ms): preempt_schedule (try_to_wake_up)
00000005 0.409ms (+0.000ms): _spin_unlock (flush_tlb_others)
00000004 0.409ms (+0.000ms): preempt_schedule (flush_tlb_others)
00000003 0.410ms (+0.000ms): preempt_schedule (flush_tlb_page)
00000003 0.410ms (+0.000ms): kunmap_atomic (page_referenced_one)
00000002 0.410ms (+0.000ms): preempt_schedule (page_referenced_one)
00000002 0.410ms (+0.000ms): _spin_unlock (page_referenced_one)
00000001 0.410ms (+0.000ms): preempt_schedule (page_referenced_one)

There is an almost 300 usec hit there as we unwind the nested layers.

So the long wait on paths through sched and timer_tsc appear to be
eliminated with PIO to the disk.

Is there some "for sure" way to limit the size and/or duration of
DMA transfers so I get reasonable performance from the disk (and
other DMA devices) and reasonable latency?
--Mark

2004-09-09 20:15:08

by Alan

[permalink] [raw]
Subject: Re: [patch] voluntary-preempt-2.6.9-rc1-bk4-R1

On Iau, 2004-09-09 at 20:47, Ingo Molnar wrote:
> there are also tools that tweak chipsets directly - powertweak and the
> PCI latency settings. Maybe something tweaks the IDE controller just the
> right way. Also, try disabling specific controller support in the
> .config (or turn it on) - by chance the generic IDE code could program
> the IDE controller in a way that generates nicer DMA.

Do not tweak your IDE controller pci latency without making a backup if
its anything but the base motherboard chipset. Not unless you make
backups first and test it very hard.

Its a particularly popular area for errata.

You could also try lower DMA modes, I don't know if that will help or
not but UDMA133 is essentially "one entire PCI 32/33Mhz bus worth of
traffic so the effects are not too suprising. IDE control ports can give
you long stalls too especially if IORDY is in use.

Alan

2004-09-09 22:43:23

by Mark_H_Johnson

[permalink] [raw]
Subject: Re: [patch] voluntary-preempt-2.6.9-rc1-bk4-R1

>so if these assumptions are true i'd suggest to decrease max_sectors
>from 32K to 16K - that alone should halve these random latencies from
>560 usecs to 280 usecs. Unless you see stability you might want to try
>an 8K setting as well - this will likely decrease your IO rates
>noticeably though. This would reduce the random delays to 140 usecs.

I tried both and still had some long delays. For example:

00000003 0.015ms (+0.000ms): send_IPI_mask (flush_tlb_others)
00000003 0.015ms (+0.137ms): send_IPI_mask_bitmask (flush_tlb_others)
00010003 0.153ms (+0.000ms): do_nmi (flush_tlb_others)
00010003 0.153ms (+0.001ms): do_nmi (ide_outsl)
00010003 0.155ms (+0.000ms): notifier_call_chain (default_do_nmi)
00010003 0.155ms (+0.000ms): profile_tick (nmi_watchdog_tick)
00010003 0.155ms (+0.000ms): profile_hook (profile_tick)
00010003 0.156ms (+0.000ms): read_lock (profile_hook)
00010004 0.156ms (+0.000ms): read_lock (<00000000>)
00010004 0.156ms (+0.000ms): notifier_call_chain (profile_hook)
00010004 0.156ms (+0.000ms): _read_unlock (profile_tick)
00010003 0.157ms (+0.290ms): profile_hit (nmi_watchdog_tick)
00000003 0.447ms (+0.000ms): _spin_unlock (flush_tlb_others)
00000001 0.447ms (+0.000ms): free_pages_and_swap_cache (unmap_vmas)

or this...

preemption latency trace v1.0.7 on 2.6.9-rc1-bk12-VP-R8-nd
-------------------------------------------------------
latency: 440 us, entries: 34 (34) | [VP:1 KP:1 SP:1 HP:1 #CPUS:2]
-----------------
| task: kswapd0/40, uid:0 nice:0 policy:0 rt_prio:0
-----------------
=> started at: spin_lock+0x24/0x90
=> ended at: _spin_unlock+0x2d/0x60
=======>
00000001 0.000ms (+0.000ms): spin_lock (get_swap_page)
00000001 0.000ms (+0.000ms): spin_lock (<00000000>)
00000001 0.000ms (+0.000ms): spin_lock (get_swap_page)
00000002 0.000ms (+0.410ms): spin_lock (<00000000>)
00010002 0.411ms (+0.000ms): do_IRQ (get_swap_page)

or this...

preemption latency trace v1.0.7 on 2.6.9-rc1-bk12-VP-R8-nd
-------------------------------------------------------
latency: 548 us, entries: 8 (8) | [VP:1 KP:1 SP:1 HP:1 #CPUS:2]
-----------------
| task: kswapd0/40, uid:0 nice:0 policy:0 rt_prio:0
-----------------
=> started at: spin_lock+0x24/0x90
=> ended at: _spin_unlock+0x2d/0x60
=======>
00000001 0.000ms (+0.000ms): spin_lock (get_swap_page)
00000001 0.000ms (+0.000ms): spin_lock (<00000000>)
00000001 0.000ms (+0.000ms): spin_lock (get_swap_page)
00000002 0.000ms (+0.547ms): spin_lock (<00000000>)
00000002 0.548ms (+0.000ms): _spin_unlock (get_swap_page)
00000001 0.548ms (+0.000ms): _spin_unlock (get_swap_page)
00000001 0.548ms (+0.000ms): sub_preempt_count (_spin_unlock)
00000001 0.549ms (+0.000ms): update_max_trace (check_preempt_timing)

or this...

preemption latency trace v1.0.7 on 2.6.9-rc1-bk12-VP-R8-nd
-------------------------------------------------------
latency: 422 us, entries: 345 (345) | [VP:1 KP:1 SP:1 HP:1 #CPUS:2]
-----------------
| task: kblockd/0/12, uid:0 nice:-10 policy:0 rt_prio:0
-----------------
=> started at: do_IRQ+0x19/0x1a0
=> ended at: do_IRQ+0x14a/0x1a0
=======>
00010000 0.000ms (+0.000ms): do_IRQ (common_interrupt)
00010000 0.000ms (+0.000ms): do_IRQ (kthread_should_stop)
00010000 0.000ms (+0.000ms): do_IRQ (<0000000a>)
00010000 0.000ms (+0.000ms): spin_lock (do_IRQ)
00010001 0.000ms (+0.052ms): spin_lock (<00000000>)
00010001 0.053ms (+0.000ms): mask_and_ack_level_ioapic_irq (do_IRQ)
00010001 0.053ms (+0.000ms): mask_IO_APIC_irq
(mask_and_ack_level_ioapic_irq)
00010001 0.053ms (+0.000ms): __spin_lock_irqsave (mask_IO_APIC_irq)
00010002 0.053ms (+0.000ms): __spin_lock_irqsave (<00000000>)
00010002 0.053ms (+0.000ms): __mask_IO_APIC_irq (mask_IO_APIC_irq)
00010002 0.054ms (+0.220ms): __modify_IO_APIC_irq (__mask_IO_APIC_irq)
00010002 0.274ms (+0.000ms): _spin_unlock_irqrestore (mask_IO_APIC_irq)
00010001 0.274ms (+0.000ms): generic_redirect_hardirq (do_IRQ)

Kswapd0
=======

Not sure I saw this before - may be a side effect of the smaller
transfer sizes...

preemption latency trace v1.0.7 on 2.6.9-rc1-bk12-VP-R8-nd
-------------------------------------------------------
latency: 500 us, entries: 608 (608) | [VP:1 KP:1 SP:1 HP:1 #CPUS:2]
-----------------
| task: kswapd0/40, uid:0 nice:0 policy:0 rt_prio:0
-----------------
=> started at: spin_lock+0x24/0x90
=> ended at: _spin_unlock+0x2d/0x60
=======>
00000001 0.000ms (+0.000ms): spin_lock (page_referenced_file)
00000001 0.000ms (+0.000ms): spin_lock (<00000000>)
00000001 0.000ms (+0.000ms): vma_prio_tree_next (page_referenced_file)
00000001 0.000ms (+0.000ms): prio_tree_first (vma_prio_tree_next)
00000001 0.000ms (+0.000ms): prio_tree_left (prio_tree_first)
00000001 0.000ms (+0.000ms): page_referenced_one (page_referenced_file)
00000001 0.001ms (+0.000ms): spin_lock (page_referenced_one)
00000002 0.001ms (+0.000ms): spin_lock (<00000000>)
00000002 0.001ms (+0.000ms): kmap_atomic (page_referenced_one)
00000003 0.001ms (+0.000ms): page_address (page_referenced_one)
00000003 0.002ms (+0.000ms): kunmap_atomic (page_referenced_one)
00000002 0.002ms (+0.000ms): _spin_unlock (page_referenced_one)
00000001 0.002ms (+0.000ms): vma_prio_tree_next (page_referenced_file)
00000001 0.002ms (+0.000ms): page_referenced_one (page_referenced_file)
00000001 0.003ms (+0.000ms): spin_lock (page_referenced_one)
00000002 0.003ms (+0.000ms): spin_lock (<00000000>)
...
00000001 0.480ms (+0.000ms): vma_prio_tree_next (page_referenced_file)
00000001 0.481ms (+0.001ms): page_referenced_one (page_referenced_file)
00000001 0.483ms (+0.000ms): spin_lock (page_referenced_one)
00000002 0.484ms (+0.001ms): spin_lock (<00000000>)
00000002 0.485ms (+0.000ms): kmap_atomic (page_referenced_one)
00000003 0.486ms (+0.001ms): page_address (page_referenced_one)
00000003 0.487ms (+0.000ms): flush_tlb_page (page_referenced_one)
00000003 0.488ms (+0.000ms): kunmap_atomic (page_referenced_one)
00000002 0.489ms (+0.000ms): _spin_unlock (page_referenced_one)
00000001 0.489ms (+0.001ms): vma_prio_tree_next (page_referenced_file)
00000001 0.491ms (+0.002ms): page_referenced_one (page_referenced_file)
00000001 0.494ms (+0.000ms): spin_lock (page_referenced_one)
00000002 0.494ms (+0.002ms): spin_lock (<00000000>)
00000002 0.497ms (+0.000ms): kmap_atomic (page_referenced_one)
00000003 0.497ms (+0.000ms): page_address (page_referenced_one)
00000003 0.498ms (+0.000ms): kunmap_atomic (page_referenced_one)
00000002 0.499ms (+0.000ms): _spin_unlock (page_referenced_one)
00000001 0.499ms (+0.002ms): _spin_unlock (page_referenced_file)
00000001 0.501ms (+0.002ms): sub_preempt_count (_spin_unlock)
00000001 0.503ms (+0.000ms): update_max_trace (check_preempt_timing)

>but the real fix would be to tweak the IDE controller to not do so
>agressive DMA! Are there any BIOS settings that somehow deal with it?
>Try increasing the PCI latency value? Is the disk using UDMA - if yes,
>could you downgrade it to normal IDE DMA? Perhaps that tweaks the
>controller to be 'nice' to the CPU.

I tried
hdparm -p -X udma2 /dev/hda
(since it was udma4 previously)
and reran the tests. As a side note - that command caused a long latency
Something like...

preemption latency trace v1.0.7 on 2.6.9-rc1-bk12-VP-R8-nd
-------------------------------------------------------
latency: 2019 us, entries: 2829 (2829) | [VP:1 KP:1 SP:1 HP:1 #CPUS:2]
-----------------
| task: hdparm/13753, uid:0 nice:0 policy:0 rt_prio:0
-----------------
=> started at: cond_resched+0xd/0x30
=> ended at: cond_resched+0xd/0x30
=======>
00000001 0.000ms (+0.000ms): touch_preempt_timing (cond_resched)
00000001 0.000ms (+0.007ms): ide_find_setting_by_ioctl (generic_ide_ioctl)
00000001 0.008ms (+0.000ms): ide_write_setting (generic_ide_ioctl)
00000001 0.009ms (+0.001ms): capable (ide_write_setting)
...
00000001 0.064ms (+0.001ms): ide_inb (ide_config_drive_speed)
00000001 0.066ms (+0.000ms): ide_inb (ide_config_drive_speed)
00000001 0.067ms (+0.000ms): ide_inb (ide_config_drive_speed)
...
00000001 1.899ms (+0.000ms): ide_inb (ide_config_drive_speed)
00000001 1.899ms (+0.000ms): ide_inb (ide_config_drive_speed)
00000001 1.900ms (+0.000ms): ide_inb (ide_config_drive_speed)
00000001 1.901ms (+0.000ms): __const_udelay (ide_config_drive_speed)
00000001 1.901ms (+0.000ms): __delay (ide_config_drive_speed)
00000001 1.901ms (+0.001ms): delay_tsc (__delay)
00000001 1.902ms (+0.000ms): ide_inb (ide_config_drive_speed)
...

The results after that change were like this...

preemption latency trace v1.0.7 on 2.6.9-rc1-bk12-VP-R8-nd
-------------------------------------------------------
latency: 536 us, entries: 31 (31) | [VP:1 KP:1 SP:1 HP:1 #CPUS:2]
-----------------
| task: IRQ 14/140, uid:0 nice:-10 policy:0 rt_prio:0
-----------------
=> started at: kmap_atomic+0x23/0xe0
=> ended at: kunmap_atomic+0x7b/0xa0
=======>
00000001 0.000ms (+0.000ms): kmap_atomic (__ide_do_rw_disk)
00000001 0.000ms (+0.000ms): page_address (__ide_do_rw_disk)
00000001 0.000ms (+0.000ms): ide_set_handler (__ide_do_rw_disk)
00000001 0.000ms (+0.000ms): __spin_lock_irqsave (ide_set_handler)
00000002 0.000ms (+0.000ms): __spin_lock_irqsave (<00000000>)
00000002 0.001ms (+0.000ms): __ide_set_handler (ide_set_handler)
00000002 0.001ms (+0.000ms): __mod_timer (ide_set_handler)
00000002 0.001ms (+0.000ms): __spin_lock_irqsave (__mod_timer)
00000003 0.001ms (+0.000ms): __spin_lock_irqsave (<00000000>)
00000003 0.002ms (+0.000ms): spin_lock (__mod_timer)
00000004 0.002ms (+0.000ms): spin_lock (<00000000>)
00000004 0.002ms (+0.000ms): internal_add_timer (__mod_timer)
00000004 0.002ms (+0.000ms): _spin_unlock (__mod_timer)
00000003 0.002ms (+0.000ms): _spin_unlock_irqrestore (__mod_timer)
00000002 0.003ms (+0.000ms): _spin_unlock_irqrestore (ide_set_handler)
00000001 0.003ms (+0.000ms): taskfile_output_data (__ide_do_rw_disk)
00000001 0.003ms (+0.000ms): ata_output_data (taskfile_output_data)
00000001 0.003ms (+0.496ms): ide_outsw (ata_output_data)
00010001 0.500ms (+0.000ms): do_nmi (ide_outsw)
00010001 0.501ms (+0.005ms): do_nmi (<0864d2b2>)
00010001 0.506ms (+0.000ms): notifier_call_chain (default_do_nmi)
00010001 0.506ms (+0.000ms): profile_tick (nmi_watchdog_tick)
00010001 0.506ms (+0.000ms): profile_hook (profile_tick)
00010001 0.506ms (+0.000ms): read_lock (profile_hook)
00010002 0.507ms (+0.000ms): read_lock (<00000000>)
00010002 0.507ms (+0.000ms): notifier_call_chain (profile_hook)
00010002 0.507ms (+0.000ms): _read_unlock (profile_tick)
00010001 0.507ms (+0.028ms): profile_hit (nmi_watchdog_tick)
00000001 0.536ms (+0.000ms): kunmap_atomic (__ide_do_rw_disk)
00000001 0.536ms (+0.000ms): sub_preempt_count (kunmap_atomic)
00000001 0.536ms (+0.000ms): sub_preempt_count (kunmap_atomic)
00000001 0.537ms (+0.000ms): update_max_trace (check_preempt_timing)

or this...

preemption latency trace v1.0.7 on 2.6.9-rc1-bk12-VP-R8-nd
-------------------------------------------------------
latency: 497 us, entries: 18 (18) | [VP:1 KP:1 SP:1 HP:1 #CPUS:2]
-----------------
| task: IRQ 14/140, uid:0 nice:-10 policy:0 rt_prio:0
-----------------
=> started at: kmap_atomic+0x23/0xe0
=> ended at: kunmap_atomic+0x7b/0xa0
=======>
00000001 0.000ms (+0.000ms): kmap_atomic (read_intr)
00000001 0.000ms (+0.000ms): page_address (read_intr)
00000001 0.000ms (+0.000ms): taskfile_input_data (read_intr)
00000001 0.000ms (+0.000ms): ata_input_data (taskfile_input_data)
00000001 0.000ms (+0.066ms): ide_insw (ata_input_data)
00010001 0.067ms (+0.000ms): do_nmi (ide_insw)
00010001 0.067ms (+0.003ms): do_nmi (<0815f06c>)
00010001 0.070ms (+0.000ms): notifier_call_chain (default_do_nmi)
00010001 0.070ms (+0.000ms): profile_tick (nmi_watchdog_tick)
00010001 0.071ms (+0.000ms): profile_hook (profile_tick)
00010001 0.071ms (+0.000ms): read_lock (profile_hook)
00010002 0.071ms (+0.000ms): read_lock (<00000000>)
00010002 0.071ms (+0.000ms): notifier_call_chain (profile_hook)
00010002 0.071ms (+0.000ms): _read_unlock (profile_tick)
00010001 0.072ms (+0.424ms): profile_hit (nmi_watchdog_tick)
00000001 0.497ms (+0.000ms): kunmap_atomic (read_intr)
00000001 0.497ms (+0.000ms): sub_preempt_count (kunmap_atomic)
00000001 0.498ms (+0.000ms): update_max_trace (check_preempt_timing)

or this...

preemption latency trace v1.0.7 on 2.6.9-rc1-bk12-VP-R8-nd
-------------------------------------------------------
latency: 454 us, entries: 60 (60) | [VP:1 KP:1 SP:1 HP:1 #CPUS:2]
-----------------
| task: kswapd0/40, uid:0 nice:0 policy:0 rt_prio:0
-----------------
=> started at: spin_lock+0x24/0x90
=> ended at: _spin_unlock+0x2d/0x60
=======>
00000001 0.000ms (+0.000ms): spin_lock (get_swap_page)
00000001 0.000ms (+0.000ms): spin_lock (<00000000>)
00000001 0.000ms (+0.000ms): spin_lock (get_swap_page)
00000002 0.001ms (+0.267ms): spin_lock (<00000000>)
00010002 0.268ms (+0.000ms): do_nmi (get_swap_page)
00010002 0.268ms (+0.003ms): do_nmi (ide_outsw)
00010002 0.272ms (+0.000ms): notifier_call_chain (default_do_nmi)
00010002 0.272ms (+0.000ms): profile_tick (nmi_watchdog_tick)
00010002 0.272ms (+0.000ms): profile_hook (profile_tick)
00010002 0.272ms (+0.000ms): read_lock (profile_hook)
00010003 0.273ms (+0.000ms): read_lock (<00000000>)
00010003 0.273ms (+0.000ms): notifier_call_chain (profile_hook)
00010003 0.273ms (+0.000ms): _read_unlock (profile_tick)
00010002 0.274ms (+0.000ms): profile_hit (nmi_watchdog_tick)
00010002 0.274ms (+0.000ms): do_IRQ (get_swap_page)
00010002 0.275ms (+0.000ms): do_IRQ (<00000000>)
00010002 0.275ms (+0.000ms): spin_lock (do_IRQ)
00010003 0.275ms (+0.000ms): spin_lock (<00000000>)
00010003 0.275ms (+0.000ms): ack_edge_ioapic_irq (do_IRQ)
00010003 0.275ms (+0.000ms): generic_redirect_hardirq (do_IRQ)
00010003 0.276ms (+0.000ms): _spin_unlock (do_IRQ)

or this...

reemption latency trace v1.0.7 on 2.6.9-rc1-bk12-VP-R8-nd
-------------------------------------------------------
latency: 457 us, entries: 8 (8) | [VP:1 KP:1 SP:1 HP:1 #CPUS:2]
-----------------
| task: kswapd0/40, uid:0 nice:0 policy:0 rt_prio:0
-----------------
=> started at: spin_lock+0x24/0x90
=> ended at: _spin_unlock+0x2d/0x60
=======>
00000001 0.000ms (+0.000ms): spin_lock (get_swap_page)
00000001 0.000ms (+0.000ms): spin_lock (<00000000>)
00000001 0.000ms (+0.000ms): spin_lock (get_swap_page)
00000002 0.000ms (+0.456ms): spin_lock (<00000000>)
00000002 0.457ms (+0.000ms): _spin_unlock (get_swap_page)
00000001 0.457ms (+0.000ms): _spin_unlock (get_swap_page)
00000001 0.458ms (+0.000ms): sub_preempt_count (_spin_unlock)
00000001 0.458ms (+0.000ms): update_max_trace (check_preempt_timing)

So I don't think that particular setting helps.

--Mark


2004-09-09 22:57:58

by Mark_H_Johnson

[permalink] [raw]
Subject: Re: [patch] voluntary-preempt-2.6.9-rc1-bk4-R1

>I tried
> hdparm -p -X udma2 /dev/hda
>(since it was udma4 previously)
>and reran the tests.
Not quite sure this was what I wanted - appears to turn on PIO modes
exclusively.

What is the right incantation for this?
Thanks.
--Mark H Johnson
<mailto:[email protected]>

2004-09-10 02:32:03

by K.R. Foley

[permalink] [raw]
Subject: Re: [patch] voluntary-preempt-2.6.9-rc1-bk4-R1

Ingo Molnar wrote:
> * [email protected] <[email protected]> wrote:
>
>
>>If you look at the date / time of the traces, you will notice that
>>most occur in the latter part of the test. This is during the "disk
>>copy" and "disk read" parts of the testing. [...]
>
>
> would it be possible to test with DMA disabled? (hdparm -d0 /dev/hda) It
> might take some extra work to shun the extra latency reports from the
> PIO IDE path (which is quite slow) but once that is done you should be
> able to see whether these long 0.5 msec delays remain even if all (most)
> DMA activity has been eliminated.
>
>
>>preemption latency trace v1.0.5 on 2.6.9-rc1-VP-R1
>>--------------------------------------------------
>> latency: 550 us, entries: 6 (6)
>> -----------------
>> | task: cat/6771, uid:0 nice:0 policy:0 rt_prio:0
>> -----------------
>> => started at: kmap_atomic+0x23/0xe0
>> => ended at: kunmap_atomic+0x7b/0xa0
>>=======>
>>00000001 0.000ms (+0.000ms): kmap_atomic (file_read_actor)
>>00000001 0.000ms (+0.000ms): page_address (file_read_actor)
>>00000001 0.000ms (+0.549ms): __copy_to_user_ll (file_read_actor)
>>00000001 0.550ms (+0.000ms): kunmap_atomic (file_read_actor)
>>00000001 0.550ms (+0.000ms): sub_preempt_count (kunmap_atomic)
>>00000001 0.550ms (+0.000ms): update_max_trace (check_preempt_timing)
>
>
> this is a full page copy, from userspace into a kernelspace pagecache
> page. This shouldnt take 500 usecs on any hardware. Since this is a
> single instruction (memcpy's rep; movsl instruction) there's nothing
> that Linux can do to avoid (or even to cause) such a situation.

I saw this one (or one very similar) on a system that I just started
testing on some today. Not quite as high (~219 usec if I remember
correctly). I don't have access to the system from here. I will forward
the trace tomorrow when I'm there tomorrow. However, I haven't seen this
on my slower system running the same stress tests. There are several
possible points of interest:

System I saw this on:
P4 2.4GHz or 3.0GHz
2GB memory
2.6.9-rc1-bk12-S0 built for SMP (even though hyperthreading is off
currently)

System I haven't seen this on:
PII 450
256MB memory
2.6.9-rc1-bk12-R6 built for UP

Sorry I don't have more complete data in front of me. I will send the
concrete info tomorrow with the trace.

kr

2004-09-10 12:40:48

by Ingo Molnar

[permalink] [raw]
Subject: Re: [patch] voluntary-preempt-2.6.9-rc1-bk4-R1


* [email protected] <[email protected]> wrote:

> >I tried
> > hdparm -p -X udma2 /dev/hda
> >(since it was udma4 previously)
> >and reran the tests.
> Not quite sure this was what I wanted - appears to turn on PIO modes
> exclusively.
>
> What is the right incantation for this?

does 'hdparm -X udma2 /dev/hda' work? -p seems to be some PIO auto-tune
option that you probably dont need.

Ingo

2004-09-10 12:58:44

by Mark_H_Johnson

[permalink] [raw]
Subject: Re: [patch] voluntary-preempt-2.6.9-rc1-bk4-R1

>You could also try lower DMA modes, I don't know if that will help or
>not but UDMA133 is essentially "one entire PCI 32/33Mhz bus worth of
>traffic so the effects are not too suprising. IDE control ports can give
>you long stalls too especially if IORDY is in use.

It appears I finally figured it out and have the drive running with udma2
instead of udma4. For reference:

# hdparm /dev/hda

/dev/hda:
multcount = 16 (on)
IO_support = 1 (32-bit)
unmaskirq = 1 (on)
using_dma = 1 (on)
keepsettings = 0 (off)
readonly = 0 (off)
readahead = 64 (on)
geometry = 58168/16/63, sectors = 58633344, start = 0
# hdparm -i /dev/hda

/dev/hda:

Model=WDC WD300BB-00AUA1, FwRev=18.20D18, SerialNo=WD-WMA6W1764218
Config={ HardSect NotMFM HdSw>15uSec SpinMotCtl Fixed DTR>5Mbs FmtGapReq }
RawCHS=16383/16/63, TrkSize=57600, SectSize=600, ECCbytes=40
BuffType=DualPortCache, BuffSize=2048kB, MaxMultSect=16, MultSect=16
CurCHS=16383/16/63, CurSects=16514064, LBA=yes, LBAsects=58633344
IORDY=on/off, tPIO={min:120,w/IORDY:120}, tDMA={min:120,rec:120}
PIO modes: pio0 pio1 pio2 pio3 pio4
DMA modes: mdma0 mdma1 mdma2
UDMA modes: udma0 udma1 *udma2 udma3 udma4 udma5
AdvancedPM=no WriteCache=enabled
Drive conforms to: device does not report version:

* signifies the current active mode

The same test I ran yesterday
head -c 750000000 /dev/zero >tmpfile (disk writes)
cp tmpfile tmpfile2 (disk copy)
cat tmpfile tmpfile2 >/dev/null
ran with the following results late last night. All are traces > 200 usec
[much smaller limit than before] No traces > 600 usec.

Exit Mmap - Traces 00 01 02 03 04 05 10 13 24 26
Kswapd0 - Traces 06 07 08 09 11 12 14 16..23 25 27
Spin lock - Trace 15

Exit Mmap
=========

preemption latency trace v1.0.7 on 2.6.9-rc1-bk12-VP-R8-nd
-------------------------------------------------------
latency: 209 us, entries: 34 (34) | [VP:1 KP:1 SP:1 HP:1 #CPUS:2]
-----------------
| task: sleep/4436, uid:0 nice:-9 policy:0 rt_prio:0
-----------------
=> started at: cond_resched_lock+0x1d/0xa0
=> ended at: _spin_unlock+0x2d/0x60
=======>
00000001 0.000ms (+0.000ms): touch_preempt_timing (cond_resched_lock)
00000001 0.000ms (+0.001ms): __bitmap_weight (unmap_vmas)
00000001 0.001ms (+0.002ms): vm_acct_memory (exit_mmap)
00000001 0.003ms (+0.113ms): clear_page_tables (exit_mmap)
00010001 0.117ms (+0.000ms): do_nmi (clear_page_tables)
00010001 0.117ms (+0.003ms): do_nmi (avc_has_perm)
00010001 0.121ms (+0.000ms): notifier_call_chain (default_do_nmi)
00010001 0.121ms (+0.000ms): profile_tick (nmi_watchdog_tick)
00010001 0.121ms (+0.000ms): profile_hook (profile_tick)
00010001 0.121ms (+0.000ms): read_lock (profile_hook)
00010002 0.121ms (+0.000ms): read_lock (<00000000>)
00010002 0.122ms (+0.000ms): notifier_call_chain (profile_hook)
00010002 0.122ms (+0.000ms): _read_unlock (profile_tick)
00010001 0.122ms (+0.078ms): profile_hit (nmi_watchdog_tick)
00000001 0.200ms (+0.002ms): flush_tlb_mm (exit_mmap)
00000001 0.202ms (+0.000ms): free_pages_and_swap_cache (exit_mmap)

Kswapd0
=======

preemption latency trace v1.0.7 on 2.6.9-rc1-bk12-VP-R8-nd
-------------------------------------------------------
latency: 539 us, entries: 719 (719) | [VP:1 KP:1 SP:1 HP:1 #CPUS:2]
-----------------
| task: kswapd0/40, uid:0 nice:0 policy:0 rt_prio:0
-----------------
=> started at: spin_lock+0x24/0x90
=> ended at: _spin_unlock+0x2d/0x60
=======>
00000001 0.000ms (+0.000ms): spin_lock (page_referenced_file)
00000001 0.000ms (+0.000ms): spin_lock (<00000000>)
00000001 0.000ms (+0.000ms): vma_prio_tree_next (page_referenced_file)
00000001 0.000ms (+0.000ms): prio_tree_first (vma_prio_tree_next)
00000001 0.001ms (+0.001ms): prio_tree_left (prio_tree_first)
00000001 0.003ms (+0.000ms): page_referenced_one (page_referenced_file)
00000001 0.004ms (+0.000ms): spin_lock (page_referenced_one)
00000002 0.005ms (+0.001ms): spin_lock (<00000000>)
00000002 0.006ms (+0.000ms): kmap_atomic (page_referenced_one)
00000003 0.006ms (+0.000ms): page_address (page_referenced_one)
00000003 0.007ms (+0.000ms): kunmap_atomic (page_referenced_one)
00000002 0.007ms (+0.000ms): _spin_unlock (page_referenced_one)
00000001 0.007ms (+0.000ms): vma_prio_tree_next (page_referenced_file)
00000001 0.008ms (+0.001ms): page_referenced_one (page_referenced_file)
00000001 0.009ms (+0.000ms): spin_lock (page_referenced_one)
00000002 0.009ms (+0.002ms): spin_lock (<00000000>)
... cycle repeats a number of times ...
00000001 0.535ms (+0.000ms): preempt_schedule (page_referenced_one)
00000001 0.536ms (+0.000ms): vma_prio_tree_next (page_referenced_file)
00000001 0.536ms (+0.000ms): page_referenced_one (page_referenced_file)
00000001 0.537ms (+0.000ms): spin_lock (page_referenced_one)
00000002 0.537ms (+0.000ms): spin_lock (<00000000>)
00000002 0.538ms (+0.000ms): kmap_atomic (page_referenced_one)
00000003 0.538ms (+0.000ms): page_address (page_referenced_one)
00000003 0.538ms (+0.000ms): kunmap_atomic (page_referenced_one)
00000002 0.538ms (+0.000ms): preempt_schedule (page_referenced_one)
00000002 0.539ms (+0.000ms): _spin_unlock (page_referenced_one)
00000001 0.539ms (+0.000ms): preempt_schedule (page_referenced_one)
00000001 0.539ms (+0.000ms): _spin_unlock (page_referenced_file)
00000001 0.540ms (+0.000ms): sub_preempt_count (_spin_unlock)
00000001 0.541ms (+0.000ms): update_max_trace (check_preempt_timing)

Spin Lock
=========

preemption latency trace v1.0.7 on 2.6.9-rc1-bk12-VP-R8-nd
-------------------------------------------------------
latency: 235 us, entries: 18 (18) | [VP:1 KP:1 SP:1 HP:1 #CPUS:2]
-----------------
| task: kswapd0/40, uid:0 nice:0 policy:0 rt_prio:0
-----------------
=> started at: spin_lock+0x24/0x90
=> ended at: _spin_unlock+0x2d/0x60
=======>
00000001 0.000ms (+0.000ms): spin_lock (get_swap_page)
00000001 0.000ms (+0.000ms): spin_lock (<00000000>)
00000001 0.000ms (+0.000ms): spin_lock (get_swap_page)
00000002 0.000ms (+0.113ms): spin_lock (<00000000>)
00010002 0.113ms (+0.000ms): do_nmi (get_swap_page)
00010002 0.114ms (+0.002ms): do_nmi (default_idle)
00010002 0.116ms (+0.000ms): notifier_call_chain (default_do_nmi)
00010002 0.116ms (+0.000ms): profile_tick (nmi_watchdog_tick)
00010002 0.117ms (+0.000ms): profile_hook (profile_tick)
00010002 0.117ms (+0.000ms): read_lock (profile_hook)
00010003 0.117ms (+0.000ms): read_lock (<00000000>)
00010003 0.117ms (+0.000ms): notifier_call_chain (profile_hook)
00010003 0.117ms (+0.000ms): _read_unlock (profile_tick)
00010002 0.118ms (+0.116ms): profile_hit (nmi_watchdog_tick)
00000002 0.235ms (+0.000ms): _spin_unlock (get_swap_page)
00000001 0.235ms (+0.000ms): _spin_unlock (get_swap_page)
00000001 0.235ms (+0.001ms): sub_preempt_count (_spin_unlock)
00000001 0.237ms (+0.000ms): update_max_trace (check_preempt_timing)

Are you SURE the spin lock counter works properly on SMP systems?
I did a quick check of yesterday's results:
# grep -ir '<.*>' latencytest0.42-png/lt040909 | wc -l
6978
# grep -ir '<.*>' latencytest0.42-png/lt040909 | grep -v '<00000000>' |
less -im
...
latencytest0.42-png/lt040909/lt004.v3k1/lt.10:00010002 0.382ms
(+0.000ms): do_IRQ (<0000000e>)
latencytest0.42-png/lt040909/lt004.v3k1/lt.11:00010001 0.003ms (+0.000ms):
do_IRQ (<0000000a>)
latencytest0.42-png/lt040909/lt004.v3k1/lt.11:00020001 0.308ms (+0.000ms):
do_IRQ (<0000000e>)
latencytest0.42-png/lt040909/lt005.v3k1/lt.00:00010002 0.263ms (+0.005ms):
do_nmi (<0804bd56>)
latencytest0.42-png/lt040909/lt006.v3k1/lt.00:00010001 0.501ms (+0.005ms):
do_nmi (<0864d2b2>)
latencytest0.42-png/lt040909/lt006.v3k1/lt.01:00010001 0.067ms (+0.003ms):
do_nmi (<0815f06c>)
...
No entries that are non zero and lock related.

--Mark

2004-09-10 13:09:19

by Ingo Molnar

[permalink] [raw]
Subject: Re: [patch] voluntary-preempt-2.6.9-rc1-bk4-R1


* [email protected] <[email protected]> wrote:

> 00000001 0.000ms (+0.000ms): spin_lock (get_swap_page)
> 00000001 0.000ms (+0.000ms): spin_lock (<00000000>)
> 00000001 0.000ms (+0.000ms): spin_lock (get_swap_page)
> 00000002 0.000ms (+0.113ms): spin_lock (<00000000>)
> 00010002 0.113ms (+0.000ms): do_nmi (get_swap_page)

> Are you SURE the spin lock counter works properly on SMP systems?
> I did a quick check of yesterday's results:
> # grep -ir '<.*>' latencytest0.42-png/lt040909 | wc -l
> 6978
> # grep -ir '<.*>' latencytest0.42-png/lt040909 | grep -v '<00000000>' |
> less -im
> ...

> No entries that are non zero and lock related.

it works fine here. To double-check i've created a contention testcase:

00000001 0.000ms (+0.000ms): spin_lock (sys_gettimeofday)
00000001 0.000ms (+0.000ms): spin_lock (<000012ce>)

this spin_lock() spun 4814 times before it got the lock.

Linux locking is pretty uncontended on 2-way boxes.

Ingo

2004-09-10 14:58:56

by Mark_H_Johnson

[permalink] [raw]
Subject: Re: [patch] voluntary-preempt-2.6.9-rc1-bk4-R1

>does 'hdparm -X udma2 /dev/hda' work?

Yes it does and quite well actually. For comparison
-R1 on September 7 - over 100 traces > 500 usec
-R8 on September 10 - 38 traces > 200 usec, only 3 > 500 usec

This was with the full test suite (latencytest active, all different
types of operations).

In addition to what I already reported today, there were a few additional
types of traces:

network poll - Traces 00 01 02 03 04 06 07 08 16 17 29 32
twkill_work - Trace 05
VL kswapd0 - Trace 12
spin lock/nmi- Trace 15
do_IRQ - Trace 25 26 27
the previously reported symptoms...
exit mmap - Traces 09 18 19 20 21 22 23 24 28 30 31 33 34 35 36 37
kswapd0 - Traces 10 11 13 14

Network Poll
============
preemption latency trace v1.0.7 on 2.6.9-rc1-bk12-VP-R8-nd
-------------------------------------------------------
latency: 332 us, entries: 198 (198) | [VP:1 KP:1 SP:1 HP:1 #CPUS:2]
-----------------
| task: ksoftirqd/1/5, uid:0 nice:-10 policy:0 rt_prio:0
-----------------
=> started at: spin_lock+0x24/0x90
=> ended at: rtl8139_rx+0x219/0x340
=======>
00000001 0.000ms (+0.000ms): spin_lock (rtl8139_poll)
00000001 0.000ms (+0.001ms): spin_lock (<00000000>)
00000001 0.001ms (+0.003ms): rtl8139_rx (rtl8139_poll)
... we've been down this path before ...
No individual traces > 10 usec; just the number of traces involved.

twkill_work
===========

I have not seen this one before....

preemption latency trace v1.0.7 on 2.6.9-rc1-bk12-VP-R8-nd
-------------------------------------------------------
latency: 499 us, entries: 1858 (1858) | [VP:1 KP:1 SP:1 HP:1 #CPUS:2]
-----------------
| task: events/0/6, uid:0 nice:-10 policy:0 rt_prio:0
-----------------
=> started at: twkill_work+0xb5/0xe0
=> ended at: twkill_work+0xb5/0xe0
=======>
00000101 0.000ms (+0.000ms): touch_preempt_timing (twkill_work)
00000101 0.000ms (+0.000ms): smp_apic_timer_interrupt
(touch_preempt_timing)
00010101 0.000ms (+0.000ms): profile_tick (smp_apic_timer_interrupt)
00010101 0.000ms (+0.000ms): profile_hook (profile_tick)
00010101 0.000ms (+0.000ms): read_lock (profile_hook)
00010102 0.001ms (+0.000ms): read_lock (<00000000>)
00010102 0.001ms (+0.000ms): notifier_call_chain (profile_hook)
00010102 0.001ms (+0.000ms): _read_unlock (profile_tick)
00010101 0.002ms (+0.000ms): profile_hit (smp_apic_timer_interrupt)
00010101 0.002ms (+0.000ms): update_process_times
(smp_apic_timer_interrupt)
00010101 0.002ms (+0.000ms): update_one_process (update_process_times)
00010101 0.003ms (+0.000ms): run_local_timers (update_process_times)
00010101 0.003ms (+0.000ms): raise_softirq (update_process_times)
00010101 0.003ms (+0.000ms): scheduler_tick (update_process_times)
00010101 0.004ms (+0.000ms): sched_clock (scheduler_tick)
00010101 0.005ms (+0.000ms): rcu_check_callbacks (scheduler_tick)
00010101 0.005ms (+0.001ms): idle_cpu (rcu_check_callbacks)
00010101 0.006ms (+0.000ms): spin_lock (scheduler_tick)
00010102 0.006ms (+0.000ms): spin_lock (<00000000>)
00010102 0.007ms (+0.000ms): task_timeslice (scheduler_tick)
00010102 0.007ms (+0.001ms): __bitmap_weight (scheduler_tick)
00010102 0.009ms (+0.000ms): _spin_unlock (scheduler_tick)
00010101 0.010ms (+0.001ms): rebalance_tick (scheduler_tick)
00010101 0.011ms (+0.000ms): do_IRQ (touch_preempt_timing)
00010101 0.011ms (+0.000ms): do_IRQ (<00000000>)
00010101 0.012ms (+0.000ms): spin_lock (do_IRQ)
00010102 0.012ms (+0.000ms): spin_lock (<00000000>)
00010102 0.012ms (+0.000ms): ack_edge_ioapic_irq (do_IRQ)
00010102 0.013ms (+0.000ms): generic_redirect_hardirq (do_IRQ)
00010102 0.013ms (+0.000ms): _spin_unlock (do_IRQ)
... Look at this next sequence ...
00010102 0.033ms (+0.000ms): end_edge_ioapic_irq (do_IRQ)
00010102 0.033ms (+0.000ms): _spin_unlock (do_IRQ)
00000101 0.034ms (+0.000ms): tcp_do_twkill_work (twkill_work)
00000101 0.035ms (+0.000ms): _spin_unlock (tcp_do_twkill_work)
00000100 0.035ms (+0.000ms): tcp_timewait_kill (tcp_do_twkill_work)
00000100 0.035ms (+0.000ms): write_lock (tcp_timewait_kill)
00000101 0.036ms (+0.000ms): write_lock (<00000000>)
00000101 0.037ms (+0.000ms): _write_unlock (tcp_timewait_kill)
00000100 0.037ms (+0.000ms): spin_lock (tcp_timewait_kill)
00000101 0.037ms (+0.000ms): spin_lock (<00000000>)
00000101 0.038ms (+0.000ms): tcp_bucket_destroy (tcp_timewait_kill)
00000101 0.038ms (+0.000ms): kmem_cache_free (tcp_bucket_destroy)
00000101 0.038ms (+0.000ms): cache_flusharray (kmem_cache_free)
... or this one ...
00000100 0.050ms (+0.000ms): spin_lock (tcp_do_twkill_work)
00000101 0.051ms (+0.000ms): spin_lock (<00000000>)
00000101 0.051ms (+0.000ms): _spin_unlock (tcp_do_twkill_work)
00000100 0.051ms (+0.000ms): tcp_timewait_kill (tcp_do_twkill_work)
00000100 0.051ms (+0.000ms): write_lock (tcp_timewait_kill)
00000101 0.052ms (+0.000ms): write_lock (<00000000>)
00000101 0.052ms (+0.000ms): _write_unlock (tcp_timewait_kill)
00000100 0.053ms (+0.000ms): spin_lock (tcp_timewait_kill)
00000101 0.053ms (+0.000ms): spin_lock (<00000000>)
00000101 0.053ms (+0.000ms): tcp_bucket_destroy (tcp_timewait_kill)
00000101 0.053ms (+0.000ms): kmem_cache_free (tcp_bucket_destroy)
00000101 0.054ms (+0.000ms): _spin_unlock (tcp_timewait_kill)
00000100 0.054ms (+0.000ms): kmem_cache_free (tcp_do_twkill_work)
00000100 0.054ms (+0.000ms): spin_lock (tcp_do_twkill_work)
00000101 0.054ms (+0.000ms): spin_lock (<00000000>)
00000101 0.054ms (+0.000ms): _spin_unlock (tcp_do_twkill_work)
00000100 0.055ms (+0.000ms): tcp_timewait_kill (tcp_do_twkill_work)
... and ends like this ...
00000101 0.496ms (+0.000ms): _write_unlock (tcp_timewait_kill)
00000100 0.496ms (+0.000ms): preempt_schedule (tcp_timewait_kill)
00000100 0.496ms (+0.000ms): spin_lock (tcp_timewait_kill)
00000101 0.496ms (+0.000ms): spin_lock (<00000000>)
00000101 0.497ms (+0.000ms): tcp_bucket_destroy (tcp_timewait_kill)
00000101 0.497ms (+0.000ms): kmem_cache_free (tcp_bucket_destroy)
00000101 0.497ms (+0.000ms): _spin_unlock (tcp_timewait_kill)
00000100 0.497ms (+0.000ms): preempt_schedule (tcp_timewait_kill)
00000100 0.498ms (+0.000ms): kmem_cache_free (tcp_do_twkill_work)
00000100 0.498ms (+0.000ms): spin_lock (tcp_do_twkill_work)
00000101 0.498ms (+0.000ms): spin_lock (<00000000>)
00000101 0.499ms (+0.000ms): touch_preempt_timing (twkill_work)
00000101 0.499ms (+0.000ms): update_max_trace (check_preempt_timing)

I looked briefly at the -R8 patch and see the change in condition for
breaking the lock, but from this trace, it may not be enough.

VL kswapd0
==========

An extremely long latency (> 6 msec) appears to be a combination of
effects, perhaps tracing related. It starts like this...

preemption latency trace v1.0.7 on 2.6.9-rc1-bk12-VP-R8-nd
-------------------------------------------------------
latency: 6776 us, entries: 548 (548) | [VP:1 KP:1 SP:1 HP:1 #CPUS:2]
-----------------
| task: kswapd0/40, uid:0 nice:0 policy:0 rt_prio:0
-----------------
=> started at: spin_lock+0x24/0x90
=> ended at: _spin_unlock+0x2d/0x60
=======>
00000001 0.000ms (+0.000ms): spin_lock (page_referenced_file)
00000001 0.000ms (+0.000ms): spin_lock (<00000000>)
00000001 0.000ms (+0.000ms): vma_prio_tree_next (page_referenced_file)
00000001 0.001ms (+0.001ms): prio_tree_first (vma_prio_tree_next)
00000001 0.002ms (+0.000ms): page_referenced_one (page_referenced_file)
00000001 0.002ms (+0.000ms): spin_lock (page_referenced_one)
00000002 0.003ms (+0.001ms): spin_lock (<00000000>)
00000002 0.004ms (+0.000ms): kmap_atomic (page_referenced_one)
00000003 0.004ms (+0.000ms): page_address (page_referenced_one)
00000003 0.005ms (+0.000ms): kunmap_atomic (page_referenced_one)
00000002 0.005ms (+0.000ms): _spin_unlock (page_referenced_one)
00000001 0.005ms (+0.000ms): vma_prio_tree_next (page_referenced_file)
00000001 0.006ms (+0.000ms): page_referenced_one (page_referenced_file)
00000001 0.006ms (+0.000ms): spin_lock (page_referenced_one)
00000002 0.007ms (+0.000ms): spin_lock (<00000000>)
... looks like the typical kswapd0 cycle until this point ...
00000001 0.126ms (+0.000ms): page_referenced_one (page_referenced_file)
00000001 0.127ms (+0.000ms): spin_lock (page_referenced_one)
00000002 0.127ms (+0.000ms): spin_lock (<00000000>)
00000002 0.128ms (+0.000ms): kmap_atomic (page_referenced_one)
00000003 0.128ms (+0.000ms): page_address (page_referenced_one)
00000003 0.129ms (+0.000ms): flush_tlb_page (page_referenced_one)
00000004 0.129ms (+0.000ms): flush_tlb_others (flush_tlb_page)
00000004 0.130ms (+0.000ms): spin_lock (flush_tlb_others)
00000005 0.130ms (+0.000ms): spin_lock (<00000000>)
00000005 0.131ms (+0.000ms): send_IPI_mask (flush_tlb_others)
00000005 0.132ms (+0.790ms): send_IPI_mask_bitmask (flush_tlb_others)
00010005 0.922ms (+0.000ms): do_nmi (flush_tlb_others)
00010005 0.923ms (+0.003ms): do_nmi (touch_preempt_timing)
00010005 0.926ms (+0.000ms): notifier_call_chain (default_do_nmi)
00010005 0.926ms (+0.000ms): profile_tick (nmi_watchdog_tick)
00010005 0.926ms (+0.000ms): profile_hook (profile_tick)
00010005 0.927ms (+0.000ms): read_lock (profile_hook)
00010006 0.927ms (+0.000ms): read_lock (<00000000>)
00010006 0.927ms (+0.000ms): notifier_call_chain (profile_hook)
00010006 0.928ms (+0.000ms): _read_unlock (profile_tick)
00010005 0.928ms (+0.027ms): profile_hit (nmi_watchdog_tick)
00000005 0.955ms (+0.000ms): smp_apic_timer_interrupt (flush_tlb_others)
00010005 0.955ms (+0.000ms): profile_tick (smp_apic_timer_interrupt)
00010005 0.956ms (+0.000ms): profile_hook (profile_tick)
00010005 0.956ms (+0.000ms): read_lock (profile_hook)
00010006 0.956ms (+0.000ms): read_lock (<00000000>)
00010006 0.956ms (+0.000ms): notifier_call_chain (profile_hook)
00010006 0.956ms (+0.000ms): _read_unlock (profile_tick)
00010005 0.956ms (+0.000ms): profile_hit (smp_apic_timer_interrupt)
00010005 0.957ms (+0.000ms): update_process_times
(smp_apic_timer_interrupt)
00010005 0.957ms (+0.000ms): update_one_process (update_process_times)
00010005 0.957ms (+0.000ms): run_local_timers (update_process_times)
00010005 0.957ms (+0.000ms): raise_softirq (update_process_times)
00010005 0.958ms (+0.000ms): scheduler_tick (update_process_times)
00010005 0.958ms (+0.000ms): sched_clock (scheduler_tick)
00010005 0.958ms (+0.000ms): spin_lock (scheduler_tick)
00010006 0.958ms (+0.000ms): spin_lock (<00000000>)
00010006 0.959ms (+0.000ms): task_timeslice (scheduler_tick)
00010006 0.959ms (+0.000ms): __bitmap_weight (scheduler_tick)
00010006 0.959ms (+0.000ms): _spin_unlock (scheduler_tick)
00010005 0.959ms (+0.000ms): rebalance_tick (scheduler_tick)
00000006 0.960ms (+0.000ms): do_softirq (smp_apic_timer_interrupt)
00000006 0.960ms (+0.000ms): __do_softirq (do_softirq)
00000006 0.960ms (+0.000ms): wake_up_process (do_softirq)
00000006 0.960ms (+0.000ms): try_to_wake_up (wake_up_process)
00000006 0.960ms (+0.000ms): task_rq_lock (try_to_wake_up)
00000006 0.961ms (+0.000ms): spin_lock (task_rq_lock)
00000007 0.961ms (+0.000ms): spin_lock (<00000000>)
00000007 0.961ms (+0.000ms): wake_idle (try_to_wake_up)
00000007 0.961ms (+0.000ms): activate_task (try_to_wake_up)
00000007 0.961ms (+0.000ms): sched_clock (activate_task)
00000007 0.962ms (+0.000ms): recalc_task_prio (activate_task)
00000007 0.962ms (+0.000ms): effective_prio (recalc_task_prio)
00000007 0.962ms (+0.000ms): __activate_task (try_to_wake_up)
00000007 0.962ms (+0.000ms): enqueue_task (__activate_task)
00000007 0.962ms (+0.000ms): sched_info_queued (enqueue_task)
00000007 0.963ms (+0.000ms): resched_task (try_to_wake_up)
00000007 0.963ms (+0.000ms): task_rq_unlock (try_to_wake_up)
00000007 0.963ms (+0.000ms): _spin_unlock_irqrestore (try_to_wake_up)
00000006 0.964ms (+0.958ms): preempt_schedule (try_to_wake_up)
00010005 1.922ms (+0.000ms): do_nmi (flush_tlb_others)
00010005 1.922ms (+0.003ms): do_nmi (check_preempt_timing)
00010005 1.926ms (+0.000ms): notifier_call_chain (default_do_nmi)
00010005 1.926ms (+0.000ms): profile_tick (nmi_watchdog_tick)
00010005 1.926ms (+0.000ms): profile_hook (profile_tick)
00010005 1.926ms (+0.000ms): read_lock (profile_hook)
00010006 1.927ms (+0.000ms): read_lock (<00000000>)
00010006 1.927ms (+0.000ms): notifier_call_chain (profile_hook)
00010006 1.927ms (+0.000ms): _read_unlock (profile_tick)
00010005 1.927ms (+0.000ms): preempt_schedule (profile_tick)
00010005 1.928ms (+0.027ms): profile_hit (nmi_watchdog_tick)
00000005 1.955ms (+0.000ms): smp_apic_timer_interrupt (flush_tlb_others)
00010005 1.955ms (+0.000ms): profile_tick (smp_apic_timer_interrupt)
00010005 1.955ms (+0.000ms): profile_hook (profile_tick)
00010005 1.955ms (+0.000ms): read_lock (profile_hook)
00010006 1.955ms (+0.000ms): read_lock (<00000000>)
00010006 1.956ms (+0.000ms): notifier_call_chain (profile_hook)
00010006 1.956ms (+0.000ms): _read_unlock (profile_tick)
00010005 1.956ms (+0.000ms): preempt_schedule (profile_tick)
00010005 1.956ms (+0.000ms): profile_hit (smp_apic_timer_interrupt)
00010005 1.956ms (+0.000ms): update_process_times
(smp_apic_timer_interrupt)
00010005 1.956ms (+0.000ms): update_one_process (update_process_times)
00010005 1.957ms (+0.000ms): run_local_timers (update_process_times)
00010005 1.957ms (+0.000ms): raise_softirq (update_process_times)
00010005 1.957ms (+0.000ms): scheduler_tick (update_process_times)
00010005 1.957ms (+0.000ms): sched_clock (scheduler_tick)
00010005 1.958ms (+0.000ms): spin_lock (scheduler_tick)
00010006 1.958ms (+0.000ms): spin_lock (<00000000>)
00010006 1.958ms (+0.000ms): task_timeslice (scheduler_tick)
00010006 1.958ms (+0.000ms): __bitmap_weight (scheduler_tick)
00010006 1.958ms (+0.000ms): _spin_unlock (scheduler_tick)
00010005 1.959ms (+0.000ms): preempt_schedule (scheduler_tick)
00010005 1.959ms (+0.000ms): rebalance_tick (scheduler_tick)
00000006 1.959ms (+0.000ms): do_softirq (smp_apic_timer_interrupt)
00000006 1.959ms (+0.961ms): __do_softirq (do_softirq)
00010005 2.921ms (+0.000ms): do_nmi (flush_tlb_others)
00010005 2.922ms (+0.001ms): do_nmi (check_preempt_timing)
... looks like we got stuck for milliseconds but finally finished ...
00000006 4.959ms (+0.000ms): do_softirq (smp_apic_timer_interrupt)
00000006 4.959ms (+0.960ms): __do_softirq (do_softirq)
00010005 5.919ms (+0.000ms): do_nmi (flush_tlb_others)
00010005 5.920ms (+0.003ms): do_nmi (__trace)
00010005 5.924ms (+0.000ms): notifier_call_chain (default_do_nmi)
00010005 5.924ms (+0.000ms): profile_tick (nmi_watchdog_tick)
00010005 5.924ms (+0.000ms): profile_hook (profile_tick)
00010005 5.924ms (+0.000ms): read_lock (profile_hook)
00010006 5.924ms (+0.000ms): read_lock (<00000000>)
00010006 5.925ms (+0.000ms): notifier_call_chain (profile_hook)
00010006 5.925ms (+0.000ms): _read_unlock (profile_tick)
00010005 5.925ms (+0.000ms): preempt_schedule (profile_tick)
00010005 5.925ms (+0.027ms): profile_hit (nmi_watchdog_tick)
00000005 5.952ms (+0.000ms): smp_apic_timer_interrupt (flush_tlb_others)
00010005 5.953ms (+0.000ms): profile_tick (smp_apic_timer_interrupt)
00010005 5.953ms (+0.000ms): profile_hook (profile_tick)
00010005 5.953ms (+0.000ms): read_lock (profile_hook)
00010006 5.953ms (+0.000ms): read_lock (<00000000>)
00010006 5.953ms (+0.000ms): notifier_call_chain (profile_hook)
00010006 5.953ms (+0.000ms): _read_unlock (profile_tick)
00010005 5.954ms (+0.000ms): preempt_schedule (profile_tick)
00010005 5.954ms (+0.000ms): profile_hit (smp_apic_timer_interrupt)
00010005 5.954ms (+0.000ms): update_process_times
(smp_apic_timer_interrupt)
00010005 5.954ms (+0.000ms): update_one_process (update_process_times)
00010005 5.954ms (+0.000ms): run_local_timers (update_process_times)
00010005 5.955ms (+0.000ms): raise_softirq (update_process_times)
00010005 5.955ms (+0.000ms): scheduler_tick (update_process_times)
00010005 5.955ms (+0.000ms): sched_clock (scheduler_tick)
00010005 5.955ms (+0.000ms): spin_lock (scheduler_tick)
00010006 5.956ms (+0.000ms): spin_lock (<00000000>)
00010006 5.956ms (+0.000ms): task_timeslice (scheduler_tick)
00010006 5.956ms (+0.000ms): __bitmap_weight (scheduler_tick)
00010006 5.956ms (+0.000ms): _spin_unlock (scheduler_tick)
00010005 5.956ms (+0.000ms): preempt_schedule (scheduler_tick)
00010005 5.957ms (+0.000ms): rebalance_tick (scheduler_tick)
00000006 5.957ms (+0.000ms): do_softirq (smp_apic_timer_interrupt)
00000006 5.957ms (+0.817ms): __do_softirq (do_softirq)
00000005 6.774ms (+0.000ms): _spin_unlock (flush_tlb_others)
00000004 6.775ms (+0.000ms): preempt_schedule (flush_tlb_others)
00000003 6.775ms (+0.000ms): preempt_schedule (flush_tlb_page)
00000003 6.775ms (+0.000ms): kunmap_atomic (page_referenced_one)
00000002 6.775ms (+0.000ms): preempt_schedule (page_referenced_one)
00000002 6.776ms (+0.000ms): _spin_unlock (page_referenced_one)
00000001 6.776ms (+0.000ms): preempt_schedule (page_referenced_one)
00000001 6.776ms (+0.000ms): _spin_unlock (page_referenced_file)
00000001 6.777ms (+0.000ms): sub_preempt_count (_spin_unlock)
00000001 6.777ms (+0.000ms): update_max_trace (check_preempt_timing)

Spin Lock/nmi
=============

May be an artifact of the test configuration

preemption latency trace v1.0.7 on 2.6.9-rc1-bk12-VP-R8-nd
-------------------------------------------------------
latency: 244 us, entries: 17 (17) | [VP:1 KP:1 SP:1 HP:1 #CPUS:2]
-----------------
| task: fam/3212, uid:0 nice:0 policy:0 rt_prio:0
-----------------
=> started at: __spin_lock_irqsave+0x2b/0xb0
=> ended at: _spin_unlock_irqrestore+0x32/0x70
=======>
00000001 0.000ms (+0.000ms): __spin_lock_irqsave (avc_has_perm_noaudit)
00000001 0.000ms (+0.000ms): __spin_lock_irqsave (<00000000>)
00000001 0.000ms (+0.069ms): avc_insert (avc_has_perm_noaudit)
00010001 0.069ms (+0.000ms): do_nmi (avc_insert)
00010001 0.070ms (+0.002ms): do_nmi (__mcount)
00010001 0.072ms (+0.000ms): notifier_call_chain (default_do_nmi)
00010001 0.073ms (+0.000ms): profile_tick (nmi_watchdog_tick)
00010001 0.073ms (+0.000ms): profile_hook (profile_tick)
00010001 0.073ms (+0.000ms): read_lock (profile_hook)
00010002 0.073ms (+0.000ms): read_lock (<00000000>)
00010002 0.073ms (+0.000ms): notifier_call_chain (profile_hook)
00010002 0.074ms (+0.000ms): _read_unlock (profile_tick)
00010001 0.074ms (+0.168ms): profile_hit (nmi_watchdog_tick)
00000001 0.242ms (+0.000ms): memcpy (avc_has_perm_noaudit)
00000001 0.243ms (+0.002ms): _spin_unlock_irqrestore (avc_has_perm_noaudit)
00000001 0.245ms (+0.000ms): sub_preempt_count (_spin_unlock_irqrestore)
00000001 0.245ms (+0.000ms): update_max_trace (check_preempt_timing)

Do_IRQ
======

Perhaps an unfortunate series of nested IRQ's?
Starts like this...

preemption latency trace v1.0.7 on 2.6.9-rc1-bk12-VP-R8-nd
-------------------------------------------------------
latency: 240 us, entries: 380 (380) | [VP:1 KP:1 SP:1 HP:1 #CPUS:2]
-----------------
| task: cpu_burn/14061, uid:0 nice:10 policy:0 rt_prio:0
-----------------
=> started at: do_IRQ+0x19/0x1a0
=> ended at: do_IRQ+0x14a/0x1a0
=======>
00010000 0.000ms (+0.000ms): do_IRQ (common_interrupt)
00010000 0.000ms (+0.000ms): do_IRQ (<08048340>)
00010000 0.000ms (+0.000ms): do_IRQ (<0000000a>)
00010000 0.000ms (+0.000ms): spin_lock (do_IRQ)
00010001 0.000ms (+0.000ms): spin_lock (<00000000>)
00010001 0.001ms (+0.000ms): mask_and_ack_level_ioapic_irq (do_IRQ)
00010001 0.001ms (+0.000ms): mask_IO_APIC_irq
(mask_and_ack_level_ioapic_irq)
00010001 0.001ms (+0.000ms): __spin_lock_irqsave (mask_IO_APIC_irq)
00010002 0.002ms (+0.000ms): __spin_lock_irqsave (<00000000>)
00010002 0.002ms (+0.000ms): __mask_IO_APIC_irq (mask_IO_APIC_irq)
00010002 0.002ms (+0.013ms): __modify_IO_APIC_irq (__mask_IO_APIC_irq)
00010002 0.016ms (+0.000ms): _spin_unlock_irqrestore (mask_IO_APIC_irq)
00010001 0.016ms (+0.000ms): generic_redirect_hardirq (do_IRQ)
00010001 0.016ms (+0.000ms): _spin_unlock (do_IRQ)
00010000 0.017ms (+0.000ms): generic_handle_IRQ_event (do_IRQ)
00010000 0.017ms (+0.001ms): usb_hcd_irq (generic_handle_IRQ_event)
00010000 0.019ms (+0.001ms): uhci_irq (usb_hcd_irq)
00010000 0.021ms (+0.000ms): spin_lock (uhci_irq)
00010001 0.021ms (+0.000ms): spin_lock (<00000000>)
00010001 0.021ms (+0.001ms): uhci_get_current_frame_number (uhci_irq)
00010001 0.023ms (+0.000ms): uhci_free_pending_qhs (uhci_irq)
00010001 0.023ms (+0.001ms): uhci_free_pending_tds (uhci_irq)
00010001 0.024ms (+0.001ms): uhci_remove_pending_urbps (uhci_irq)
00010001 0.025ms (+0.000ms): uhci_transfer_result (uhci_irq)
00010001 0.026ms (+0.000ms): spin_lock (uhci_transfer_result)
...
00020005 0.134ms (+0.000ms): do_IRQ (_spin_unlock_irqrestore)
00020005 0.134ms (+0.000ms): do_IRQ (<00000000>)
00020005 0.134ms (+0.000ms): spin_lock (do_IRQ)
00020006 0.134ms (+0.000ms): spin_lock (<00000000>)
00020006 0.134ms (+0.000ms): ack_edge_ioapic_irq (do_IRQ)
00020006 0.135ms (+0.000ms): generic_redirect_hardirq (do_IRQ)
00020006 0.135ms (+0.000ms): _spin_unlock (do_IRQ)
00020005 0.135ms (+0.000ms): preempt_schedule (do_IRQ)
00020005 0.135ms (+0.000ms): generic_handle_IRQ_event (do_IRQ)
00020005 0.136ms (+0.000ms): timer_interrupt (generic_handle_IRQ_event)
... exits like this ...
00010001 0.220ms (+0.000ms): generic_note_interrupt (do_IRQ)
00010001 0.220ms (+0.000ms): end_level_ioapic_irq (do_IRQ)
00010001 0.220ms (+0.000ms): unmask_IO_APIC_irq (do_IRQ)
00010001 0.221ms (+0.000ms): __spin_lock_irqsave (unmask_IO_APIC_irq)
00010002 0.221ms (+0.000ms): __spin_lock_irqsave (<00000000>)
00010002 0.221ms (+0.000ms): __unmask_IO_APIC_irq (unmask_IO_APIC_irq)
00010002 0.221ms (+0.013ms): __modify_IO_APIC_irq (__unmask_IO_APIC_irq)
00010002 0.235ms (+0.000ms): _spin_unlock_irqrestore (unmask_IO_APIC_irq)
00010001 0.235ms (+0.000ms): preempt_schedule (unmask_IO_APIC_irq)
00010001 0.235ms (+0.000ms): _spin_unlock (do_IRQ)
00010000 0.236ms (+0.000ms): preempt_schedule (do_IRQ)
00000001 0.236ms (+0.000ms): do_softirq (do_IRQ)
00000001 0.236ms (+0.000ms): __do_softirq (do_softirq)
00000001 0.236ms (+0.000ms): wake_up_process (do_softirq)
00000001 0.237ms (+0.000ms): try_to_wake_up (wake_up_process)
00000001 0.237ms (+0.000ms): task_rq_lock (try_to_wake_up)
00000001 0.237ms (+0.000ms): spin_lock (task_rq_lock)
00000002 0.237ms (+0.000ms): spin_lock (<00000000>)
00000002 0.237ms (+0.000ms): wake_idle (try_to_wake_up)
00000002 0.238ms (+0.000ms): activate_task (try_to_wake_up)
00000002 0.238ms (+0.000ms): sched_clock (activate_task)
00000002 0.238ms (+0.000ms): recalc_task_prio (activate_task)
00000002 0.238ms (+0.000ms): effective_prio (recalc_task_prio)
00000002 0.239ms (+0.000ms): __activate_task (try_to_wake_up)
00000002 0.239ms (+0.000ms): enqueue_task (__activate_task)
00000002 0.239ms (+0.000ms): sched_info_queued (enqueue_task)
00000002 0.239ms (+0.000ms): resched_task (try_to_wake_up)
00000002 0.240ms (+0.000ms): task_rq_unlock (try_to_wake_up)
00000002 0.240ms (+0.000ms): _spin_unlock_irqrestore (try_to_wake_up)
00000001 0.240ms (+0.000ms): preempt_schedule (try_to_wake_up)
00000001 0.241ms (+0.000ms): sub_preempt_count (do_IRQ)
00000001 0.242ms (+0.000ms): update_max_trace (check_preempt_timing)

I will send the full traces separately (off list).

--Mark

2004-09-10 18:01:43

by K.R. Foley

[permalink] [raw]
Subject: Re: [patch] voluntary-preempt-2.6.9-rc1-bk4-R1

K.R. Foley wrote:
> Ingo Molnar wrote:
>
>> * [email protected] <[email protected]> wrote:
>>
>>
>>> If you look at the date / time of the traces, you will notice that
>>> most occur in the latter part of the test. This is during the "disk
>>> copy" and "disk read" parts of the testing. [...]
>>
>>
>>
>> would it be possible to test with DMA disabled? (hdparm -d0 /dev/hda) It
>> might take some extra work to shun the extra latency reports from the
>> PIO IDE path (which is quite slow) but once that is done you should be
>> able to see whether these long 0.5 msec delays remain even if all (most)
>> DMA activity has been eliminated.
>>
>>
>>> preemption latency trace v1.0.5 on 2.6.9-rc1-VP-R1
>>> --------------------------------------------------
>>> latency: 550 us, entries: 6 (6)
>>> -----------------
>>> | task: cat/6771, uid:0 nice:0 policy:0 rt_prio:0
>>> -----------------
>>> => started at: kmap_atomic+0x23/0xe0
>>> => ended at: kunmap_atomic+0x7b/0xa0
>>> =======>
>>> 00000001 0.000ms (+0.000ms): kmap_atomic (file_read_actor)
>>> 00000001 0.000ms (+0.000ms): page_address (file_read_actor)
>>> 00000001 0.000ms (+0.549ms): __copy_to_user_ll (file_read_actor)
>>> 00000001 0.550ms (+0.000ms): kunmap_atomic (file_read_actor)
>>> 00000001 0.550ms (+0.000ms): sub_preempt_count (kunmap_atomic)
>>> 00000001 0.550ms (+0.000ms): update_max_trace (check_preempt_timing)
>>
>>
>>
>> this is a full page copy, from userspace into a kernelspace pagecache
>> page. This shouldnt take 500 usecs on any hardware. Since this is a
>> single instruction (memcpy's rep; movsl instruction) there's nothing
>> that Linux can do to avoid (or even to cause) such a situation.
>
>
> I saw this one (or one very similar) on a system that I just started
> testing on some today. Not quite as high (~219 usec if I remember
> correctly). I don't have access to the system from here. I will forward
> the trace tomorrow when I'm there tomorrow. However, I haven't seen this
> on my slower system running the same stress tests. There are several
> possible points of interest:
>
> System I saw this on:
> P4 2.4GHz or 3.0GHz
> 2GB memory
> 2.6.9-rc1-bk12-S0 built for SMP (even though hyperthreading is off
> currently)

Actual system info for above is 2.4GHz with 512 memory.

>
> System I haven't seen this on:
> PII 450
> 256MB memory
> 2.6.9-rc1-bk12-R6 built for UP
>
> Sorry I don't have more complete data in front of me. I will send the
> concrete info tomorrow with the trace.
>

As promised here is the trace:

http://www.cybsft.com/testresults/2.6.9-rc1-bk12-S0/trace2.txt

I also have some other traces from this system that I have not seen
before on my slower system. For instance this one where we spend ~204
usec in __spin_lock_irqsave:

http://www.cybsft.com/testresults/2.6.9-rc1-bk12-S0/trace1.txt

Or this one where we spend ~203 usec in sched_clock. That just doesn't
seem possible.

http://www.cybsft.com/testresults/2.6.9-rc1-bk12-S0/trace4.txt

kr

2004-09-10 19:28:13

by Ingo Molnar

[permalink] [raw]
Subject: Re: [patch] voluntary-preempt-2.6.9-rc1-bk4-R1


* K.R. Foley <[email protected]> wrote:

> I also have some other traces from this system that I have not seen
> before on my slower system. For instance this one where we spend ~204
> usec in __spin_lock_irqsave:
>
> http://www.cybsft.com/testresults/2.6.9-rc1-bk12-S0/trace1.txt
>
> Or this one where we spend ~203 usec in sched_clock. That just doesn't
> seem possible.
>
> http://www.cybsft.com/testresults/2.6.9-rc1-bk12-S0/trace4.txt

seems quite similar to Mark's IDE-DMA related hardware latencies. Does
reducing the DMA mode via hdparm reduce these latencies?

Ingo

2004-09-10 19:33:24

by Ingo Molnar

[permalink] [raw]
Subject: Re: [patch] voluntary-preempt-2.6.9-rc1-bk4-R1


* [email protected] <[email protected]> wrote:

> >does 'hdparm -X udma2 /dev/hda' work?
>
> Yes it does and quite well actually. For comparison
> -R1 on September 7 - over 100 traces > 500 usec
> -R8 on September 10 - 38 traces > 200 usec, only 3 > 500 usec
>
> This was with the full test suite (latencytest active, all different
> types of operations).

cool. Perhaps further reducing the DMA mode (to udma0, or even mdma
modes?) will further reduce these latencies?

most of your remaining latencies seem to be get_swap_page() related -
the attached (highly experimental) patch might fix that particular
latency. (ontop of -S0).

Ingo


Attachments:
(No filename) (668.00 B)
swapspace-layout-improvements-2.6.9-rc1-bk12-A1 (7.87 kB)
Download all attachments

2004-09-10 19:33:31

by K.R. Foley

[permalink] [raw]
Subject: Re: [patch] voluntary-preempt-2.6.9-rc1-bk4-R1

Ingo Molnar wrote:
> * K.R. Foley <[email protected]> wrote:
>
>
>>I also have some other traces from this system that I have not seen
>>before on my slower system. For instance this one where we spend ~204
>>usec in __spin_lock_irqsave:
>>
>>http://www.cybsft.com/testresults/2.6.9-rc1-bk12-S0/trace1.txt
>>
>>Or this one where we spend ~203 usec in sched_clock. That just doesn't
>>seem possible.
>>
>>http://www.cybsft.com/testresults/2.6.9-rc1-bk12-S0/trace4.txt
>
>
> seems quite similar to Mark's IDE-DMA related hardware latencies. Does
> reducing the DMA mode via hdparm reduce these latencies?
>
> Ingo
>
I will give that a try shortly.

kr

2004-09-13 14:51:54

by Mark_H_Johnson

[permalink] [raw]
Subject: Re: [patch] voluntary-preempt-2.6.9-rc1-bk4-R1

>most of your remaining latencies seem to be get_swap_page() related -
>the attached (highly experimental) patch might fix that particular
>latency. (ontop of -S0).

Qualitatively, -S0 with this patch is a little worse than what I saw with
-R8.
>From the last message I sent...

> -R8 on September 10 - 38 traces > 200 usec, only 3 > 500 usec
-S0 on September 13 - 78 traces > 200 usec, only 3 > 500 usec

So I have more relatively small delays in the same test period - almost
twice as much with few that I can see are swap related. Most are network
related. One very long trace > 70 msec - this is the third day I've had
a single multiple millisecond trace in a run.

Key contributors seem to be:
- IRQ sequence - 00 02 50 64 73 75 76 77
- modprobe - 01
- rtl8139_poll - 03 04 07 08 09 10 12 13 15 16 17 18 19 21 24 47
- 49 72 74
- tcp_prequeue_process - 05
- release_sock - 06 14
- tcp_copy_to_iovec - 11
- do_munmap - 20 [very long trace]
- twkill_work - 22
- exit_mmap - 23 25 26 27 28 29 30 31 32 36 37 38 39 40 41 44
- 46 48 51 52 53 54 55 57 58 59 60 61 62 66 67 68
- 69 70 71
- kswapd - 33 34 35 42 63
- avc_insert - 43 45 65
- schedule - 56

IRQ Sequence
============

Hard to describe - no specific steps that take a long time, just
a long series of activities, appear to be IRQ related that add up to
a > 300 usec time delay. Many I noted have mouse / signal related
activities, see also the "schedule" trace near the end for a similar
condition.

preemption latency trace v1.0.7 on 2.6.9-rc1-bk12-VP-S0
-------------------------------------------------------
latency: 342 us, entries: 280 (280) | [VP:1 KP:1 SP:1 HP:1 #CPUS:2]
-----------------
| task: kjournald/191, uid:0 nice:0 policy:0 rt_prio:0
-----------------
=> started at: __spin_lock_irqsave+0x2b/0xb0
=> ended at: _spin_unlock_irq+0x2e/0x60
=======>
[should look at the full traces for details]

modprobe
========

The shortest trace. I had ran
system-config-soundcard
to play the test sound before starting the real time test program to
make sure the sound system ran OK. Not something I would expect to do
on a real time system, but something to fix (or document as a limitation).

preemption latency trace v1.0.7 on 2.6.9-rc1-bk12-VP-S0
-------------------------------------------------------
latency: 460 us, entries: 9 (9) | [VP:1 KP:1 SP:1 HP:1 #CPUS:2]
-----------------
| task: modprobe/2995, uid:0 nice:-10 policy:0 rt_prio:0
-----------------
=> started at: __spin_lock_irqsave+0x2b/0xb0
=> ended at: _spin_unlock_irq+0x2e/0x60
=======>
00000001 0.000ms (+0.000ms): __spin_lock_irqsave (spin_lock_irq)
00000001 0.000ms (+0.000ms): __spin_lock_irqsave (<00000000>)
00000001 0.000ms (+0.454ms): __find_symbol (resolve_symbol)
00000001 0.455ms (+0.000ms): use_module (resolve_symbol)
00000001 0.455ms (+0.003ms): already_uses (use_module)
00000001 0.459ms (+0.001ms): kmem_cache_alloc (use_module)
00000001 0.460ms (+0.000ms): _spin_unlock_irq (resolve_symbol)
00000001 0.460ms (+0.000ms): sub_preempt_count (_spin_unlock_irq)
00000001 0.461ms (+0.000ms): update_max_trace (check_preempt_timing)

rtl8139_poll
============

We've talked this one before but still shows up on a regular basis.

preemption latency trace v1.0.7 on 2.6.9-rc1-bk12-VP-S0
-------------------------------------------------------
latency: 347 us, entries: 245 (245) | [VP:1 KP:1 SP:1 HP:1 #CPUS:2]
-----------------
| task: ksoftirqd/0/3, uid:0 nice:-10 policy:0 rt_prio:0
-----------------
=> started at: spin_lock+0x24/0x90
=> ended at: rtl8139_rx+0x219/0x340
=======>
[again - no big steps, just a large number w/o reschedule opportunity]

tcp_prequeue_process
====================

preemption latency trace v1.0.7 on 2.6.9-rc1-bk12-VP-S0
-------------------------------------------------------
latency: 271 us, entries: 294 (294) | [VP:1 KP:1 SP:1 HP:1 #CPUS:2]
-----------------
| task: rcp/3591, uid:2711 nice:0 policy:0 rt_prio:0
-----------------
=> started at: tcp_prequeue_process+0x49/0xb0
=> ended at: tcp_write_xmit+0x249/0x330
=======>
[appears similar to the others - many steps, each one very short]

release_sock
============

preemption latency trace v1.0.7 on 2.6.9-rc1-bk12-VP-S0
-------------------------------------------------------
latency: 209 us, entries: 275 (275) | [VP:1 KP:1 SP:1 HP:1 #CPUS:2]
-----------------
| task: rcp/3705, uid:2711 nice:0 policy:0 rt_prio:0
-----------------
=> started at: __spin_lock_irqsave+0x2b/0xb0
=> ended at: cond_resched_softirq+0x65/0x90
=======>
00000001 0.000ms (+0.000ms): __spin_lock_irqsave (spin_lock_bh)
00000001 0.000ms (+0.000ms): __spin_lock_irqsave (<00000000>)
00000101 0.000ms (+0.000ms): __release_sock (release_sock)
00000101 0.000ms (+0.000ms): _spin_unlock (__release_sock)
... long sequence, not long in each step ...

tcp_copy_to_iovec
=================

preemption latency trace v1.0.7 on 2.6.9-rc1-bk12-VP-S0
-------------------------------------------------------
latency: 210 us, entries: 243 (243) | [VP:1 KP:1 SP:1 HP:1 #CPUS:2]
-----------------
| task: rcp/3894, uid:2711 nice:0 policy:0 rt_prio:0
-----------------
=> started at: tcp_copy_to_iovec+0x5b/0xb0
=> ended at: local_bh_enable+0x17/0xa0
=======>
00000100 0.000ms (+0.002ms): tcp_copy_to_iovec (tcp_rcv_established)
00000100 0.002ms (+0.005ms): tcp_event_data_recv (tcp_rcv_established)
00000100 0.007ms (+0.000ms): tcp_send_ack (tcp_rcv_established)
... large number of steps, each one very short in duration ...

do_munmap
=========

The very long trace (> 4000 traces, > 70 msec). Following lists all steps
until we get "stuck" with roughly 1 msec pauses. Note - I don't have the
"how we got out" part of this trace but see previous messages for that
kind of detail.

preemption latency trace v1.0.7 on 2.6.9-rc1-bk12-VP-S0
-------------------------------------------------------
latency: 71065 us, entries: 4000 (5343) | [VP:1 KP:1 SP:1 HP:1
#CPUS:2]
-----------------
| task: latencytest/4408, uid:0 nice:0 policy:1 rt_prio:99
-----------------
=> started at: spin_lock+0x24/0x90
=> ended at: cond_resched_lock+0x1d/0xa0
=======>
00000001 0.000ms (+0.000ms): spin_lock (do_munmap)
00000001 0.000ms (+0.000ms): spin_lock (<00000000>)
00000001 0.000ms (+0.000ms): unmap_region (do_munmap)
00000001 0.001ms (+0.000ms): lru_add_drain (unmap_region)
00000002 0.001ms (+0.000ms): __pagevec_lru_add_active (lru_add_drain)
00000002 0.002ms (+0.000ms): spin_lock_irq (__pagevec_lru_add_active)
00000002 0.002ms (+0.000ms): __spin_lock_irqsave (spin_lock_irq)
00000003 0.002ms (+0.001ms): __spin_lock_irqsave (<00000000>)
00000003 0.003ms (+0.000ms): _spin_unlock_irq (__pagevec_lru_add_active)
00000002 0.003ms (+0.001ms): release_pages (__pagevec_lru_add_active)
00000001 0.004ms (+0.001ms): __bitmap_weight (unmap_region)
00000001 0.006ms (+0.001ms): unmap_vmas (unmap_region)
00000001 0.007ms (+0.000ms): unmap_page_range (unmap_vmas)
00000001 0.007ms (+0.000ms): zap_pmd_range (unmap_page_range)
00000001 0.008ms (+0.000ms): zap_pte_range (zap_pmd_range)
00000001 0.009ms (+0.000ms): kmap_atomic (zap_pte_range)
00000002 0.009ms (+0.001ms): page_address (zap_pte_range)
00000002 0.011ms (+0.000ms): set_page_dirty (zap_pte_range)
00000002 0.012ms (+0.000ms): page_remove_rmap (zap_pte_range)
00000002 0.013ms (+0.000ms): set_page_dirty (zap_pte_range)
00000002 0.013ms (+0.000ms): page_remove_rmap (zap_pte_range)
00000002 0.013ms (+0.000ms): set_page_dirty (zap_pte_range)
00000002 0.013ms (+0.000ms): page_remove_rmap (zap_pte_range)
00000002 0.014ms (+0.000ms): set_page_dirty (zap_pte_range)
00000002 0.014ms (+0.000ms): page_remove_rmap (zap_pte_range)
00000002 0.015ms (+0.000ms): set_page_dirty (zap_pte_range)
00000002 0.015ms (+0.000ms): page_remove_rmap (zap_pte_range)
00000002 0.015ms (+0.000ms): set_page_dirty (zap_pte_range)
00000002 0.016ms (+0.000ms): page_remove_rmap (zap_pte_range)
00000002 0.016ms (+0.000ms): set_page_dirty (zap_pte_range)
00000002 0.016ms (+0.001ms): page_remove_rmap (zap_pte_range)
00000002 0.017ms (+0.000ms): smp_apic_timer_interrupt (zap_pte_range)
00010002 0.018ms (+0.000ms): profile_tick (smp_apic_timer_interrupt)
00010002 0.018ms (+0.000ms): profile_hook (profile_tick)
00010002 0.018ms (+0.000ms): read_lock (profile_hook)
00010003 0.019ms (+0.000ms): read_lock (<00000000>)
00010003 0.019ms (+0.000ms): notifier_call_chain (profile_hook)
00010003 0.019ms (+0.000ms): _read_unlock (profile_tick)
00010002 0.020ms (+0.000ms): profile_hit (smp_apic_timer_interrupt)
00010002 0.021ms (+0.000ms): update_process_times
(smp_apic_timer_interrupt)
00010002 0.021ms (+0.000ms): update_one_process (update_process_times)
00010002 0.022ms (+0.000ms): run_local_timers (update_process_times)
00010002 0.022ms (+0.000ms): raise_softirq (update_process_times)
00010002 0.023ms (+0.000ms): scheduler_tick (update_process_times)
00010002 0.023ms (+0.002ms): sched_clock (scheduler_tick)
00010002 0.026ms (+0.000ms): spin_lock (scheduler_tick)
00010003 0.027ms (+0.000ms): spin_lock (<00000000>)
00010003 0.027ms (+0.000ms): _spin_unlock (scheduler_tick)
00010002 0.027ms (+0.001ms): rebalance_tick (scheduler_tick)
00000003 0.028ms (+0.000ms): do_softirq (smp_apic_timer_interrupt)
00000003 0.029ms (+0.001ms): __do_softirq (do_softirq)
00000002 0.030ms (+0.000ms): set_page_dirty (zap_pte_range)
00000002 0.031ms (+0.000ms): page_remove_rmap (zap_pte_range)
00000002 0.031ms (+0.001ms): kunmap_atomic (zap_pte_range)
00000001 0.032ms (+0.002ms): flush_tlb_mm (unmap_vmas)
00000002 0.034ms (+0.000ms): flush_tlb_others (flush_tlb_mm)
00000002 0.035ms (+0.000ms): spin_lock (flush_tlb_others)
00000003 0.035ms (+0.000ms): spin_lock (<00000000>)
00000003 0.036ms (+0.000ms): send_IPI_mask (flush_tlb_others)
00000003 0.037ms (+0.158ms): send_IPI_mask_bitmask (flush_tlb_others)
00010003 0.195ms (+0.000ms): do_nmi (flush_tlb_others)
00010003 0.195ms (+0.002ms): do_nmi (__trace)
00010003 0.198ms (+0.000ms): notifier_call_chain (default_do_nmi)
00010003 0.199ms (+0.000ms): profile_tick (nmi_watchdog_tick)
00010003 0.199ms (+0.000ms): profile_hook (profile_tick)
00010003 0.199ms (+0.000ms): read_lock (profile_hook)
00010004 0.199ms (+0.000ms): read_lock (<00000000>)
00010004 0.199ms (+0.000ms): notifier_call_chain (profile_hook)
00010004 0.200ms (+0.000ms): _read_unlock (profile_tick)
00010003 0.200ms (+0.001ms): profile_hit (nmi_watchdog_tick)
00010003 0.202ms (+0.000ms): do_IRQ (flush_tlb_others)
00010003 0.202ms (+0.000ms): do_IRQ (<00000000>)
00010003 0.202ms (+0.000ms): spin_lock (do_IRQ)
00010004 0.203ms (+0.000ms): spin_lock (<00000000>)
00010004 0.203ms (+0.000ms): ack_edge_ioapic_irq (do_IRQ)
00010004 0.203ms (+0.000ms): redirect_hardirq (do_IRQ)
00010004 0.204ms (+0.000ms): _spin_unlock (do_IRQ)
00010003 0.204ms (+0.000ms): handle_IRQ_event (do_IRQ)
00010003 0.204ms (+0.000ms): timer_interrupt (handle_IRQ_event)
00010003 0.204ms (+0.000ms): spin_lock (timer_interrupt)
00010004 0.205ms (+0.000ms): spin_lock (<00000000>)
00010004 0.205ms (+0.000ms): mark_offset_tsc (timer_interrupt)
00010004 0.205ms (+0.000ms): mark_offset_tsc (timer_interrupt)
00010004 0.206ms (+0.000ms): spin_lock (mark_offset_tsc)
00010005 0.206ms (+0.000ms): spin_lock (<00000000>)
00010005 0.206ms (+0.000ms): mark_offset_tsc (timer_interrupt)
00010005 0.206ms (+0.000ms): mark_offset_tsc (timer_interrupt)
00010005 0.206ms (+0.000ms): spin_lock (mark_offset_tsc)
00010006 0.207ms (+0.000ms): spin_lock (<00000000>)
00010006 0.207ms (+0.003ms): mark_offset_tsc (timer_interrupt)
00010006 0.211ms (+0.004ms): mark_offset_tsc (timer_interrupt)
00010006 0.215ms (+0.002ms): mark_offset_tsc (timer_interrupt)
00010006 0.217ms (+0.000ms): mark_offset_tsc (timer_interrupt)
00010006 0.217ms (+0.000ms): mark_offset_tsc (timer_interrupt)
00010006 0.217ms (+0.000ms): _spin_unlock (mark_offset_tsc)
00010005 0.218ms (+0.000ms): mark_offset_tsc (timer_interrupt)
00010005 0.218ms (+0.000ms): mark_offset_tsc (timer_interrupt)
00010005 0.218ms (+0.000ms): mark_offset_tsc (timer_interrupt)
00010005 0.219ms (+0.000ms): _spin_unlock (mark_offset_tsc)
00010004 0.219ms (+0.000ms): timer_interrupt (handle_IRQ_event)
00010004 0.219ms (+0.000ms): spin_lock (timer_interrupt)
00010005 0.220ms (+0.003ms): spin_lock (<00000000>)
00010005 0.223ms (+0.000ms): _spin_unlock (timer_interrupt)
00010004 0.224ms (+0.000ms): do_timer (timer_interrupt)
00010004 0.224ms (+0.000ms): update_wall_time (do_timer)
00010004 0.225ms (+0.000ms): update_wall_time_one_tick (update_wall_time)
00010004 0.225ms (+0.000ms): _spin_unlock (timer_interrupt)
00010003 0.226ms (+0.000ms): spin_lock (do_IRQ)
00010004 0.226ms (+0.000ms): spin_lock (<00000000>)
00010004 0.226ms (+0.000ms): note_interrupt (do_IRQ)
00010004 0.227ms (+0.000ms): end_edge_ioapic_irq (do_IRQ)
00010004 0.227ms (+0.000ms): _spin_unlock (do_IRQ)
00000004 0.227ms (+0.000ms): do_softirq (do_IRQ)
00000004 0.227ms (+0.788ms): __do_softirq (do_softirq)
00000003 1.016ms (+0.000ms): smp_apic_timer_interrupt (flush_tlb_others)
00010003 1.016ms (+0.000ms): profile_tick (smp_apic_timer_interrupt)
00010003 1.017ms (+0.000ms): profile_hook (profile_tick)
00010003 1.017ms (+0.000ms): read_lock (profile_hook)
00010004 1.017ms (+0.000ms): read_lock (<00000000>)
00010004 1.017ms (+0.000ms): notifier_call_chain (profile_hook)
00010004 1.017ms (+0.000ms): _read_unlock (profile_tick)
00010003 1.018ms (+0.000ms): profile_hit (smp_apic_timer_interrupt)
00010003 1.018ms (+0.000ms): update_process_times
(smp_apic_timer_interrupt)
00010003 1.018ms (+0.000ms): update_one_process (update_process_times)
00010003 1.018ms (+0.000ms): run_local_timers (update_process_times)
00010003 1.018ms (+0.000ms): raise_softirq (update_process_times)
00010003 1.019ms (+0.000ms): scheduler_tick (update_process_times)
00010003 1.019ms (+0.000ms): sched_clock (scheduler_tick)
00010003 1.019ms (+0.000ms): spin_lock (scheduler_tick)
00010004 1.019ms (+0.000ms): spin_lock (<00000000>)
00010004 1.019ms (+0.000ms): _spin_unlock (scheduler_tick)
00010003 1.020ms (+0.000ms): rebalance_tick (scheduler_tick)
00000004 1.020ms (+0.000ms): do_softirq (smp_apic_timer_interrupt)
00000004 1.020ms (+0.173ms): __do_softirq (do_softirq)
00010003 1.193ms (+0.000ms): do_nmi (flush_tlb_others)
00010003 1.194ms (+0.001ms): do_nmi (__trace)
00010003 1.195ms (+0.000ms): notifier_call_chain (default_do_nmi)
00010003 1.196ms (+0.000ms): profile_tick (nmi_watchdog_tick)
00010003 1.196ms (+0.000ms): profile_hook (profile_tick)
00010003 1.196ms (+0.000ms): read_lock (profile_hook)
00010004 1.196ms (+0.000ms): read_lock (<00000000>)
00010004 1.196ms (+0.000ms): notifier_call_chain (profile_hook)
00010004 1.197ms (+0.000ms): _read_unlock (profile_tick)
00010003 1.197ms (+0.000ms): profile_hit (nmi_watchdog_tick)
00010003 1.198ms (+0.000ms): do_IRQ (flush_tlb_others)
00010003 1.198ms (+0.000ms): do_IRQ (<00000000>)
00010003 1.198ms (+0.000ms): spin_lock (do_IRQ)
00010004 1.198ms (+0.000ms): spin_lock (<00000000>)
00010004 1.198ms (+0.000ms): ack_edge_ioapic_irq (do_IRQ)
00010004 1.199ms (+0.000ms): redirect_hardirq (do_IRQ)
00010004 1.199ms (+0.000ms): _spin_unlock (do_IRQ)
00010003 1.199ms (+0.000ms): handle_IRQ_event (do_IRQ)
00010003 1.199ms (+0.000ms): timer_interrupt (handle_IRQ_event)
00010003 1.199ms (+0.000ms): spin_lock (timer_interrupt)
00010004 1.200ms (+0.000ms): spin_lock (<00000000>)
00010004 1.200ms (+0.000ms): mark_offset_tsc (timer_interrupt)
00010004 1.200ms (+0.000ms): mark_offset_tsc (timer_interrupt)
00010004 1.200ms (+0.000ms): spin_lock (mark_offset_tsc)
00010005 1.200ms (+0.000ms): spin_lock (<00000000>)
00010005 1.201ms (+0.000ms): mark_offset_tsc (timer_interrupt)
00010005 1.201ms (+0.000ms): mark_offset_tsc (timer_interrupt)
00010005 1.201ms (+0.000ms): spin_lock (mark_offset_tsc)
00010006 1.201ms (+0.000ms): spin_lock (<00000000>)
00010006 1.201ms (+0.003ms): mark_offset_tsc (timer_interrupt)
00010006 1.205ms (+0.004ms): mark_offset_tsc (timer_interrupt)
00010006 1.209ms (+0.002ms): mark_offset_tsc (timer_interrupt)
00010006 1.212ms (+0.000ms): mark_offset_tsc (timer_interrupt)
00010006 1.212ms (+0.000ms): mark_offset_tsc (timer_interrupt)
00010006 1.212ms (+0.000ms): _spin_unlock (mark_offset_tsc)
00010005 1.212ms (+0.000ms): mark_offset_tsc (timer_interrupt)
00010005 1.213ms (+0.000ms): mark_offset_tsc (timer_interrupt)
00010005 1.213ms (+0.000ms): mark_offset_tsc (timer_interrupt)
00010005 1.213ms (+0.000ms): _spin_unlock (mark_offset_tsc)
00010004 1.213ms (+0.000ms): timer_interrupt (handle_IRQ_event)
00010004 1.213ms (+0.000ms): spin_lock (timer_interrupt)
00010005 1.214ms (+0.003ms): spin_lock (<00000000>)
00010005 1.217ms (+0.000ms): _spin_unlock (timer_interrupt)
00010004 1.218ms (+0.000ms): do_timer (timer_interrupt)
00010004 1.218ms (+0.000ms): update_wall_time (do_timer)
00010004 1.218ms (+0.000ms): update_wall_time_one_tick (update_wall_time)
00010004 1.218ms (+0.000ms): _spin_unlock (timer_interrupt)
00010003 1.219ms (+0.000ms): spin_lock (do_IRQ)
00010004 1.219ms (+0.000ms): spin_lock (<00000000>)
00010004 1.219ms (+0.000ms): note_interrupt (do_IRQ)
00010004 1.219ms (+0.000ms): end_edge_ioapic_irq (do_IRQ)
00010004 1.219ms (+0.000ms): _spin_unlock (do_IRQ)
00000004 1.220ms (+0.000ms): do_softirq (do_IRQ)
00000004 1.220ms (+0.795ms): __do_softirq (do_softirq)
00000003 2.016ms (+0.000ms): smp_apic_timer_interrupt (flush_tlb_others)
... the steps over the last millisecond get repeated for several cycles ...

twkill_work
===========

A very long sequence (> 1500 traces) with non zero preempt values
that end in "00". Does that mean we did not have any locks but could
not schedule for other reasons?

preemption latency trace v1.0.7 on 2.6.9-rc1-bk12-VP-S0
-------------------------------------------------------
latency: 429 us, entries: 1531 (1531) | [VP:1 KP:1 SP:1 HP:1 #CPUS:2]
-----------------
| task: events/0/6, uid:0 nice:-10 policy:0 rt_prio:0
-----------------
=> started at: __spin_lock_irqsave+0x2b/0xb0
=> ended at: twkill_work+0xb5/0xe0
=======>
00000001 0.000ms (+0.000ms): __spin_lock_irqsave (spin_lock_bh)
00000001 0.000ms (+0.000ms): __spin_lock_irqsave (<00000000>)
00000101 0.000ms (+0.000ms): tcp_do_twkill_work (twkill_work)
00000101 0.000ms (+0.000ms): _spin_unlock (tcp_do_twkill_work)
00000100 0.001ms (+0.000ms): tcp_timewait_kill (tcp_do_twkill_work)
00000100 0.001ms (+0.000ms): write_lock (tcp_timewait_kill)
00000101 0.001ms (+0.000ms): write_lock (<00000000>)
00000101 0.002ms (+0.000ms): _write_unlock (tcp_timewait_kill)
00000100 0.002ms (+0.000ms): spin_lock (tcp_timewait_kill)
00000101 0.002ms (+0.000ms): spin_lock (<00000000>)
00000101 0.002ms (+0.000ms): tcp_bucket_destroy (tcp_timewait_kill)
00000101 0.003ms (+0.000ms): kmem_cache_free (tcp_bucket_destroy)
00000101 0.003ms (+0.000ms): _spin_unlock (tcp_timewait_kill)
00000100 0.003ms (+0.000ms): kmem_cache_free (tcp_do_twkill_work)
00000100 0.003ms (+0.000ms): spin_lock (tcp_do_twkill_work)
00000101 0.004ms (+0.000ms): spin_lock (<00000000>)
00000101 0.004ms (+0.000ms): _spin_unlock (tcp_do_twkill_work)
00000100 0.004ms (+0.000ms): tcp_timewait_kill (tcp_do_twkill_work)
00000100 0.004ms (+0.000ms): write_lock (tcp_timewait_kill)
...

exit_mmap
=========

This one has a relatively low time delay for clear_page_tables, but
several other traces have 100 to 200 usec delays at this step. [though
if you add the delay handling the nmi_watchdog, it falls in the 100-200
range as well]

preemption latency trace v1.0.7 on 2.6.9-rc1-bk12-VP-S0
-------------------------------------------------------
latency: 266 us, entries: 146 (146) | [VP:1 KP:1 SP:1 HP:1 #CPUS:2]
-----------------
| task: get_ltrace.sh/5099, uid:0 nice:-9 policy:0 rt_prio:0
-----------------
=> started at: spin_lock+0x24/0x90
=> ended at: _spin_unlock+0x2d/0x60
=======>
00000001 0.000ms (+0.000ms): spin_lock (cond_resched_lock)
00000001 0.000ms (+0.000ms): spin_lock (<00000000>)
00000001 0.000ms (+0.000ms): __bitmap_weight (unmap_vmas)
00000001 0.000ms (+0.001ms): vm_acct_memory (exit_mmap)
00000001 0.001ms (+0.056ms): clear_page_tables (exit_mmap)
00010001 0.058ms (+0.000ms): do_nmi (clear_page_tables)
00010001 0.058ms (+0.002ms): do_nmi (__mcount)
00010001 0.060ms (+0.000ms): notifier_call_chain (default_do_nmi)
00010001 0.061ms (+0.000ms): profile_tick (nmi_watchdog_tick)
00010001 0.061ms (+0.000ms): profile_hook (profile_tick)
00010001 0.061ms (+0.000ms): read_lock (profile_hook)
00010002 0.061ms (+0.000ms): read_lock (<00000000>)
00010002 0.061ms (+0.000ms): notifier_call_chain (profile_hook)
00010002 0.062ms (+0.000ms): _read_unlock (profile_tick)
00010001 0.062ms (+0.091ms): profile_hit (nmi_watchdog_tick)
00010001 0.153ms (+0.000ms): do_IRQ (clear_page_tables)
00010001 0.153ms (+0.000ms): do_IRQ (<0000000a>)
00010001 0.153ms (+0.000ms): spin_lock (do_IRQ)
00010002 0.154ms (+0.000ms): spin_lock (<00000000>)
00010002 0.154ms (+0.000ms): mask_and_ack_level_ioapic_irq (do_IRQ)
00010002 0.154ms (+0.000ms): mask_IO_APIC_irq
(mask_and_ack_level_ioapic_irq)
00010002 0.154ms (+0.000ms): __spin_lock_irqsave (mask_IO_APIC_irq)
00010003 0.154ms (+0.000ms): __spin_lock_irqsave (<00000000>)
00010003 0.154ms (+0.000ms): __mask_IO_APIC_irq (mask_IO_APIC_irq)
00010003 0.155ms (+0.018ms): __modify_IO_APIC_irq (__mask_IO_APIC_irq)
00010003 0.173ms (+0.000ms): _spin_unlock_irqrestore (mask_IO_APIC_irq)
00010002 0.174ms (+0.000ms): redirect_hardirq (do_IRQ)
00010002 0.174ms (+0.000ms): _spin_unlock (do_IRQ)
00010001 0.174ms (+0.000ms): handle_IRQ_event (do_IRQ)
00010001 0.174ms (+0.000ms): usb_hcd_irq (handle_IRQ_event)
...

kswapd
======

preemption latency trace v1.0.7 on 2.6.9-rc1-bk12-VP-S0
-------------------------------------------------------
latency: 557 us, entries: 840 (840) | [VP:1 KP:1 SP:1 HP:1 #CPUS:2]
-----------------
| task: kswapd0/40, uid:0 nice:0 policy:0 rt_prio:0
-----------------
=> started at: spin_lock+0x24/0x90
=> ended at: _spin_unlock+0x2d/0x60
=======>
00000001 0.000ms (+0.000ms): spin_lock (page_referenced_file)
00000001 0.000ms (+0.000ms): spin_lock (<00000000>)
00000001 0.000ms (+0.000ms): vma_prio_tree_next (page_referenced_file)
00000001 0.001ms (+0.002ms): prio_tree_first (vma_prio_tree_next)
00000001 0.004ms (+0.002ms): prio_tree_left (prio_tree_first)
00000001 0.006ms (+0.000ms): page_referenced_one (page_referenced_file)
00000001 0.007ms (+0.000ms): spin_lock (page_referenced_one)
00000002 0.007ms (+0.001ms): spin_lock (<00000000>)
00000002 0.009ms (+0.001ms): kmap_atomic (page_referenced_one)
00000003 0.011ms (+0.000ms): page_address (page_referenced_one)
00000003 0.012ms (+0.000ms): kunmap_atomic (page_referenced_one)
00000002 0.012ms (+0.000ms): _spin_unlock (page_referenced_one)
00000001 0.012ms (+0.000ms): vma_prio_tree_next (page_referenced_file)
00000001 0.013ms (+0.000ms): page_referenced_one (page_referenced_file)
00000001 0.013ms (+0.003ms): spin_lock (page_referenced_one)
00000002 0.017ms (+0.003ms): spin_lock (<00000000>)
00000002 0.020ms (+0.000ms): kmap_atomic (page_referenced_one)

avc_insert
==========

preemption latency trace v1.0.7 on 2.6.9-rc1-bk12-VP-S0
-------------------------------------------------------
latency: 266 us, entries: 80 (80) | [VP:1 KP:1 SP:1 HP:1 #CPUS:2]
-----------------
| task: fam/2565, uid:0 nice:0 policy:0 rt_prio:0
-----------------
=> started at: __spin_lock_irqsave+0x2b/0xb0
=> ended at: _spin_unlock_irqrestore+0x32/0x70
=======>
00000001 0.000ms (+0.000ms): __spin_lock_irqsave (avc_has_perm_noaudit)
00000001 0.000ms (+0.000ms): __spin_lock_irqsave (<00000000>)
00000001 0.000ms (+0.204ms): avc_insert (avc_has_perm_noaudit)
00000001 0.204ms (+0.000ms): memcpy (avc_has_perm_noaudit)
00000001 0.204ms (+0.001ms): _spin_unlock_irqrestore (avc_has_perm_noaudit)
00010001 0.205ms (+0.000ms): do_IRQ (_spin_unlock_irqrestore)
00010001 0.206ms (+0.000ms): do_IRQ (<0000000a>)
00010001 0.206ms (+0.000ms): spin_lock (do_IRQ)
00010002 0.206ms (+0.000ms): spin_lock (<00000000>)
...

schedule
========

Have not talked about schedule in a while - this looks like something
different than before. Appears to be some deep nesting of preempt
disabling activities.

preemption latency trace v1.0.7 on 2.6.9-rc1-bk12-VP-S0
-------------------------------------------------------
latency: 237 us, entries: 426 (426) | [VP:1 KP:1 SP:1 HP:1 #CPUS:2]
-----------------
| task: latencytest/5314, uid:0 nice:0 policy:1 rt_prio:99
-----------------
=> started at: schedule+0x59/0x5b0
=> ended at: schedule+0x39d/0x5b0
=======>
00000001 0.000ms (+0.000ms): schedule (ksoftirqd)
00000001 0.000ms (+0.000ms): schedule (ksoftirqd)
00000001 0.000ms (+0.000ms): schedule (ksoftirqd)
00000001 0.000ms (+0.000ms): schedule (ksoftirqd)
00000001 0.000ms (+0.000ms): schedule (ksoftirqd)
00000001 0.000ms (+0.000ms): schedule (ksoftirqd)
00000001 0.000ms (+0.000ms): sched_clock (schedule)
00000001 0.001ms (+0.000ms): schedule (ksoftirqd)
00000001 0.001ms (+0.000ms): schedule (ksoftirqd)
00000001 0.001ms (+0.000ms): schedule (ksoftirqd)
00000001 0.001ms (+0.000ms): spin_lock_irq (schedule)
00000001 0.001ms (+0.000ms): __spin_lock_irqsave (spin_lock_irq)
00000002 0.002ms (+0.000ms): __spin_lock_irqsave (<00000000>)
00000002 0.002ms (+0.000ms): schedule (ksoftirqd)
00000002 0.002ms (+0.000ms): deactivate_task (schedule)
00000002 0.002ms (+0.000ms): dequeue_task (deactivate_task)
00000002 0.002ms (+0.000ms): schedule (ksoftirqd)
00000002 0.002ms (+0.000ms): schedule (ksoftirqd)
00000002 0.003ms (+0.000ms): dependent_sleeper (schedule)
00000002 0.003ms (+0.000ms): schedule (ksoftirqd)
00000002 0.003ms (+0.000ms): schedule (ksoftirqd)
00000002 0.003ms (+0.000ms): schedule (ksoftirqd)
00000002 0.003ms (+0.000ms): schedule (ksoftirqd)
00000002 0.003ms (+0.000ms): dummy_switch_tasks (schedule)
00000002 0.003ms (+0.000ms): schedule (ksoftirqd)
00000002 0.004ms (+0.000ms): schedule (ksoftirqd)
00000002 0.004ms (+0.000ms): schedule (ksoftirqd)
00000002 0.004ms (+0.000ms): schedule (ksoftirqd)
00000002 0.004ms (+0.000ms): schedule (ksoftirqd)
00000002 0.004ms (+0.000ms): sched_info_switch (schedule)
00000002 0.004ms (+0.000ms): sched_info_depart (sched_info_switch)
00000002 0.005ms (+0.000ms): sched_info_arrive (schedule)
00000002 0.005ms (+0.000ms): sched_info_dequeued (sched_info_arrive)
00000002 0.005ms (+0.000ms): schedule (ksoftirqd)
00000002 0.005ms (+0.000ms): context_switch (schedule)
00000002 0.005ms (+0.000ms): dummy_cs_entry (context_switch)
00000002 0.005ms (+0.000ms): context_switch (schedule)
00000002 0.006ms (+0.000ms): dummy_cs_switch_mm (context_switch)
00000002 0.006ms (+0.000ms): context_switch (schedule)
00000002 0.006ms (+0.000ms): dummy_cs_unlikely_if (context_switch)
00000002 0.006ms (+0.000ms): context_switch (schedule)
00000002 0.006ms (+0.000ms): dummy_cs_switch_to (context_switch)
00000002 0.006ms (+0.000ms): context_switch (schedule)
04000002 0.007ms (+0.000ms): __switch_to (context_switch)
04000002 0.007ms (+0.000ms): dummy_cs_exit (context_switch)
04000002 0.007ms (+0.000ms): context_switch (schedule)
04000002 0.007ms (+0.000ms): finish_task_switch (schedule)
04000002 0.008ms (+0.000ms): _spin_unlock_irq (finish_task_switch)
04010002 0.008ms (+0.000ms): do_IRQ (_spin_unlock_irq)
04010002 0.008ms (+0.000ms): do_IRQ (<0000000a>)
04010002 0.008ms (+0.000ms): spin_lock (do_IRQ)
04010003 0.009ms (+0.000ms): spin_lock (<00000000>)
04010003 0.009ms (+0.000ms): mask_and_ack_level_ioapic_irq (do_IRQ)
04010003 0.009ms (+0.000ms): mask_IO_APIC_irq
(mask_and_ack_level_ioapic_irq)
04010003 0.009ms (+0.000ms): __spin_lock_irqsave (mask_IO_APIC_irq)
04010004 0.009ms (+0.000ms): __spin_lock_irqsave (<00000000>)
04010004 0.009ms (+0.000ms): __mask_IO_APIC_irq (mask_IO_APIC_irq)
04010004 0.010ms (+0.020ms): __modify_IO_APIC_irq (__mask_IO_APIC_irq)
04010004 0.030ms (+0.000ms): _spin_unlock_irqrestore (mask_IO_APIC_irq)
04010003 0.030ms (+0.000ms): redirect_hardirq (do_IRQ)
04010003 0.030ms (+0.000ms): _spin_unlock (do_IRQ)
04010002 0.031ms (+0.000ms): handle_IRQ_event (do_IRQ)
04010002 0.031ms (+0.000ms): usb_hcd_irq (handle_IRQ_event)
04010002 0.031ms (+0.001ms): uhci_irq (usb_hcd_irq)
04010002 0.032ms (+0.000ms): spin_lock (uhci_irq)
04010003 0.032ms (+0.000ms): spin_lock (<00000000>)
04010003 0.032ms (+0.000ms): uhci_get_current_frame_number (uhci_irq)
04010003 0.033ms (+0.000ms): uhci_free_pending_qhs (uhci_irq)
04010003 0.033ms (+0.000ms): uhci_free_pending_tds (uhci_irq)
...
gets even deeper nesting with the first value going up to 04010008
while the system appears to send signals based on mouse input. None
of the steps take very long, but with over 400 steps in the sequence
it shows up in the trace record.

--Mark

2004-09-14 18:43:10

by Ingo Molnar

[permalink] [raw]
Subject: Re: [patch] voluntary-preempt-2.6.9-rc1-bk4-R1


* [email protected] <[email protected]> wrote:

> So I have more relatively small delays in the same test period -
> almost twice as much with few that I can see are swap related. Most
> are network related. One very long trace > 70 msec - this is the third
> day I've had a single multiple millisecond trace in a run.

the big one is the tlb_flush_others() latency:

> 00010003 0.195ms (+0.000ms): do_nmi (flush_tlb_others)
> 00010003 0.195ms (+0.002ms): do_nmi (__trace)

> 00010003 1.193ms (+0.000ms): do_nmi (flush_tlb_others)
> 00010003 1.194ms (+0.001ms): do_nmi (__trace)

> ... the steps over the last millisecond get repeated for several cycles ...

this is similar to what i described in earlier emails - IPI passing
getting held up by some IRQs-off activity on the other CPU. You can see
what happened on the other CPU by doing this:

grep ': do_nmi' lt.20 | grep -v flush_tlb_other

there's only one suspicious thing, roughly half into the trace:

00010003 30.178ms (+0.004ms): do_nmi (run_timer_softirq)
00010003 31.178ms (+0.001ms): do_nmi (run_timer_softirq)

so run_timer_softirq() was done - but we dont normally run that with
interrupts disabled ...

so maybe something else is holding up the IPI? Eventually the IPI
arrived so it could very well be just a 70 msec delay due to printing a
latency trace to the console. But ... maybe something odd is going on
that delays IPIs. To debug this further, could you do something like
this to kernel/latency.c's nmi_trace() function:

if (cpu_online(cpu) && cpu != this_cpu) {
__trace(eip, nmi_eips[cpu]);
__trace(eip, irq_stat[cpu].apic_timer_irqs);
}

this will add a third do_nmi entry to the trace, showing the APIC timer
interrupt counter on that other CPU. If this counter doesnt increase
during the ~70 ticks then that CPU had interrupts disabled permanently
and the IPI latency is normal. If the counter increases then the IPI
delay is unjustified and there's something bad going on ...

> A very long sequence (> 1500 traces) with non zero preempt values that
> end in "00". Does that mean we did not have any locks but could not
> schedule for other reasons?

no - it's 0x00000100 that is SOFTIRQ_OFFSET, so it's normal behavior.

The design is like this: the softirq and hardirq counts are 'merged'
into preempt_count(). spin_lock/unlock increases/decreases the
preempt_count() by 1. Softirq disabling increases the preempt_count() by
0x100 - hardirq entry increases the preempt_count() by 0x10000. This way
we have a nesting up to 255 and we are still able to tell from the
preempt_count() alone whether we are in a hardirq, or whether softirqs
are disabled.

For scheduling purposes all that matters is whether the preempt_count is
nonzero (in which case the kernel must not be preempted). But there are
other contexts too that need exclusion: softirqs will run asynchronously
even if the preempt_count is nonzero - as long as the bits covered by
SOFTIRQ_MASK (0xff00) are not set.

the selinux overhead looks interesting:

> avc_insert
> ==========
>
> preemption latency trace v1.0.7 on 2.6.9-rc1-bk12-VP-S0
> -------------------------------------------------------
> latency: 266 us, entries: 80 (80) | [VP:1 KP:1 SP:1 HP:1 #CPUS:2]
> -----------------
> | task: fam/2565, uid:0 nice:0 policy:0 rt_prio:0
> -----------------
> => started at: __spin_lock_irqsave+0x2b/0xb0
> => ended at: _spin_unlock_irqrestore+0x32/0x70
> =======>
> 00000001 0.000ms (+0.000ms): __spin_lock_irqsave (avc_has_perm_noaudit)
> 00000001 0.000ms (+0.000ms): __spin_lock_irqsave (<00000000>)
> 00000001 0.000ms (+0.204ms): avc_insert (avc_has_perm_noaudit)
> 00000001 0.204ms (+0.000ms): memcpy (avc_has_perm_noaudit)
> 00000001 0.204ms (+0.001ms): _spin_unlock_irqrestore (avc_has_perm_noaudit)

it seems a bit odd. avc_has_perm_noaudit() calls avc_insert(), and from
the entry to avc_insert(), to the memcpy() call done by
avc_has_perm_noaudit() there were 204 usecs spent. That's alot of time -
175 thousand cycles! Now if you check out the code between line 1009 and
1019 in security/selinux/avc.c, there's little that could cause this
amount of overhead. avc_insert() itself is rather simple - it does an
avc_claim_node() call which is an inlined function so it doesnt show up
in the trace. That function _might_ have called acv_reclaim_node() which
seems to do a linear scan over a list - i dont know how long that list
is typically but it could be long. Could you add "#define inline" near
the top of avc.c (but below the #include lines) so that we can see how
the overhead is distributed in the future?


> schedule
> ========
>
> Have not talked about schedule in a while - this looks like something
> different than before. Appears to be some deep nesting of preempt
> disabling activities.

> ... gets even deeper nesting with the first value going up to 04010008
> while the system appears to send signals based on mouse input. None of
> the steps take very long, but with over 400 steps in the sequence it
> shows up in the trace record.

it could be tracing overhead - you've got inlining disabled in sched.c
(upon my request) and you've got the extra lines you added to find out
where the overhead is. Now that we know that DMA was (is?) a big
contributor to the mystic latencies you might want to remove those.

Ingo

2004-09-15 15:31:27

by Stephen Smalley

[permalink] [raw]
Subject: Re: [patch] voluntary-preempt-2.6.9-rc1-bk4-R1

On Tue, 2004-09-14 at 14:32, Ingo Molnar wrote:
> it seems a bit odd. avc_has_perm_noaudit() calls avc_insert(), and from
> the entry to avc_insert(), to the memcpy() call done by
> avc_has_perm_noaudit() there were 204 usecs spent. That's alot of time -
> 175 thousand cycles! Now if you check out the code between line 1009 and
> 1019 in security/selinux/avc.c, there's little that could cause this
> amount of overhead. avc_insert() itself is rather simple - it does an
> avc_claim_node() call which is an inlined function so it doesnt show up
> in the trace. That function _might_ have called acv_reclaim_node() which
> seems to do a linear scan over a list - i dont know how long that list
> is typically but it could be long. Could you add "#define inline" near
> the top of avc.c (but below the #include lines) so that we can see how
> the overhead is distributed in the future?

May be rendered moot by the RCU patches for SELinux, see
http://marc.theaimsgroup.com/?l=linux-kernel&m=109386501021596&w=2

--
Stephen Smalley <[email protected]>
National Security Agency

2004-10-05 20:39:31

by Timothy Miller

[permalink] [raw]
Subject: GPU driver misbehavior [Re: [patch] voluntary-preempt-2.6.9-rc1-bk4-Q9]



Lee Revell wrote:

>
> "Misbehaving video card drivers are another source of significant delays
> in scheduling user code. A number of video cards manufacturers recently
> began employing a hack to save a PCI bus transaction for each display
> operation in order to gain a few percentage points on their WinBench
> [Ziff-Davis 98] Graphics WinMark performance.
>
> The video cards have a command FIFO that is written to via the PCI bus.
> They also have a status register, read via the PCI bus, which says
> whether the command FIFO is full or not. The hack is to not check
> whether the command FIFO is full before attempting to write to it, thus
> saving a PCI bus read.
>
> The problem with this is that the result of attempting to write to the
> FIFO when it is full is to stall the CPU waiting on the PCI bus write
> until a command has been completed and space becomes available to accept
> the new command. In fact, this not only causes the CPU to stall waiting
> on the PCI bus, but since the PCI controller chip also controls the ISA
> bus and mediates interrupts, ISA traffic and interrupt requests are
> stalled as well. Even the clock interrupts stop.
>
> These video cards will stall the machine, for instance, when the user
> drags a window. For windows occupying most of a 1024x768 screen on a
> 333MHz Pentium II with an AccelStar II AGP video board (which is based
> on the 3D Labs Permedia 2 chip set) this will stall the machine for
> 25-30ms at a time!"

I would expect that I'm not the first to think of this, but I haven't
seen it mentioned, so it makes me wonder. Therefore, I offer my solution.

Whenever you read the status register, keep a copy of the "number of
free fifo entries" field. Whenever you're going to do a group of writes
to the fifo, you first must check for enough free entries. The macro
that does this checks the copy of the status register to see if there
were enough free the last time you checked. If so, deduct the number of
free slots you're about to use, and move on. If not, re-read the status
register and loop or sleep if you don't have enough free.

The copy of the status register will always be "correct" in that it will
always report a number of free entries less than or equal to the actual
number, and it will never report a number greater than what is available
(barring a hardware glitch of a bug which is bad for other reasons).
This is because you're assuming the fifo doesn't drain, when in fact, it
does.

This results in nearly optimal performance, because usually you end up
reading the status register mostly when the fifo is full (a time when
extra bus reads don't hurt anything). If you have a 256-entry fifo,
then you end up reading the status register once for ever 256 writes,
for a performance loss of only 0.39%, and you ONLY get this performance
loss when the fifo drains faster than you can fill it.

One challenge to this is when you have more than one entity trying to
access the same resource. But in that case, you'll already have to be
using some sort of mutex mechanism anyhow.

2004-10-05 21:12:17

by Lee Revell

[permalink] [raw]
Subject: Re: GPU driver misbehavior [Re: [patch] voluntary-preempt-2.6.9-rc1-bk4-Q9]

On Tue, 2004-10-05 at 16:43, Timothy Miller wrote:
> Lee Revell wrote:
>
> >
> > "Misbehaving video card drivers are another source of significant delays
> > in scheduling user code. A number of video cards manufacturers recently
> > began employing a hack to save a PCI bus transaction for each display
> > operation in order to gain a few percentage points on their WinBench
> > [Ziff-Davis 98] Graphics WinMark performance.
> >
> > The video cards have a command FIFO that is written to via the PCI bus.
> > They also have a status register, read via the PCI bus, which says
> > whether the command FIFO is full or not. The hack is to not check
> > whether the command FIFO is full before attempting to write to it, thus
> > saving a PCI bus read.
> >
> > The problem with this is that the result of attempting to write to the
> > FIFO when it is full is to stall the CPU waiting on the PCI bus write
> > until a command has been completed and space becomes available to accept
> > the new command. In fact, this not only causes the CPU to stall waiting
> > on the PCI bus, but since the PCI controller chip also controls the ISA
> > bus and mediates interrupts, ISA traffic and interrupt requests are
> > stalled as well. Even the clock interrupts stop.
> >
> > These video cards will stall the machine, for instance, when the user
> > drags a window. For windows occupying most of a 1024x768 screen on a
> > 333MHz Pentium II with an AccelStar II AGP video board (which is based
> > on the 3D Labs Permedia 2 chip set) this will stall the machine for
> > 25-30ms at a time!"
>
> I would expect that I'm not the first to think of this, but I haven't
> seen it mentioned, so it makes me wonder. Therefore, I offer my solution.
>
> Whenever you read the status register, keep a copy of the "number of
> free fifo entries" field. Whenever you're going to do a group of writes
> to the fifo, you first must check for enough free entries. The macro
> that does this checks the copy of the status register to see if there
> were enough free the last time you checked. If so, deduct the number of
> free slots you're about to use, and move on. If not, re-read the status
> register and loop or sleep if you don't have enough free.
>
> The copy of the status register will always be "correct" in that it will
> always report a number of free entries less than or equal to the actual
> number, and it will never report a number greater than what is available
> (barring a hardware glitch of a bug which is bad for other reasons).
> This is because you're assuming the fifo doesn't drain, when in fact, it
> does.
>
> This results in nearly optimal performance, because usually you end up
> reading the status register mostly when the fifo is full (a time when
> extra bus reads don't hurt anything). If you have a 256-entry fifo,
> then you end up reading the status register once for ever 256 writes,
> for a performance loss of only 0.39%, and you ONLY get this performance
> loss when the fifo drains faster than you can fill it.
>
> One challenge to this is when you have more than one entity trying to
> access the same resource. But in that case, you'll already have to be
> using some sort of mutex mechanism anyhow.
>
>

AFAIK only one driver (VIA unichrome) has had this problem recently.
Thomas Hellstrom fixed it, so I added him to the cc: list. Thomas, you
mentioned there was a performance hit associated with the fix; would
this be an improvement over what you did?

Also I should add that I was quoting a research.microsoft.com whitepaper
above. But s/AccelStar II AGP/VIA CLE266/ and it applies exactly to my
results. Just want to give credit where it's due...

Lee