2004-09-01 08:29:42

by Ingo Molnar

[permalink] [raw]
Subject: [patch] voluntary-preempt-2.6.9-rc1-bk4-Q6


i've released the -Q6 patch:

http://redhat.com/~mingo/voluntary-preempt/voluntary-preempt-2.6.9-rc1-bk4-Q6

ontop of:

http://redhat.com/~mingo/voluntary-preempt/diff-bk-040828-2.6.8.1.bz2

this patch includes two changes that should shorten the networking
latencies reported. There's a new 'RX granularity' sysctl now:

/proc/sys/net/core/netdev_backlog_granularity

It defaults to the most finegrained value, 1.

netdev_max_backlog has been moved back to the upstream value of 300.

Also, the backlog processing is now sensitive to preemption requests and
will break out early in that case.

(This should not result in TCP connection quality issues (all processing
is restarted after such a breakout), but nevertheless i'd suggest
everyone to keep an eye on lost packets and seemingly hung TCP
connections.)

other changes since -Q5:

- mtrr simplifications and IRQ-disabling. (reported & tested by Lee
Revell) Still under discussion though.

- fix /dev/random driver latency (reported & tested by Lee Revell)

- move vgacon_do_font_op out of the BKL (reported by P.O. Gaillard)

- increase percpu space for tracing (by Mark H Johnson)

- added user-triggerable generic kernel tracing enabled via
tracing_enabled=2 and turned on via gettimeofday(0,1) and turned off
via gettimeofday(0,0).

Ingo


2004-09-01 13:51:06

by Ingo Molnar

[permalink] [raw]
Subject: [patch] voluntary-preempt-2.6.9-rc1-bk4-Q7


i've released the -Q7 patch:

http://redhat.com/~mingo/voluntary-preempt/voluntary-preempt-2.6.9-rc1-bk4-Q7

ontop of:

http://redhat.com/~mingo/voluntary-preempt/diff-bk-040828-2.6.8.1.bz2

the main change in this patch are more SMP latency fixes. The stock
kernel, even with CONFIG_PREEMPT enabled, didnt have any spin-nicely
preemption logic for the following, commonly used SMP locking
primitives: read_lock(), spin_lock_irqsave(), spin_lock_irq(),
spin_lock_bh(), read_lock_irqsave(), read_lock_irq(), read_lock_bh(),
write_lock_irqsave(), write_lock_irq(), write_lock_bh(). Only
spin_lock() and write_lock() [the two simplest cases] where covered.

In addition to the preemption latency problems, the _irq() variants in
the above list didnt do any IRQ-enabling while spinning - possibly
resulting in excessive irqs-off sections of code!

-Q7 fixes all of these latency problems: we now re-enable interrupts
while spinning in all possible cases, and a spinning op stays
preemptible if this is a beginning of a new critical section.

there's also an SMP related tracing improvement in -Q7: the NMI tracing
code now traces the other CPUs too - this way if an NMI hits a
particulary long section, we'll have a chance to see what the other CPU
was doing. These show up as double do_nmi() trace entries on a 2-CPU x86
box. The first one is the current CPU, subsequent entries are the other
CPUs in the system.

(-Q7 is not that interesting to uniprocessor kernel users, but it would
still be useful to test it, just to see nothing broke (on the
compilation side), lots of spinlock code had to be changed.)

Ingo

2004-09-01 17:12:26

by Thomas Charbonnel

[permalink] [raw]
Subject: Re: [patch] voluntary-preempt-2.6.9-rc1-bk4-Q7

Ingo Molnar wrote :
> i've released the -Q7 patch:
>
> http://redhat.com/~mingo/voluntary-preempt/voluntary-preempt-2.6.9-rc1-bk4-Q7

With Q7 I still get rx latency issues (> 130 us non-preemptible section
from rtl8139_poll). Moreover network connections were extremely slow
(almost hung) until I set /proc/sys/net/core/netdev_backlog_granularity
to 2.

Thomas


2004-09-01 19:03:57

by K.R. Foley

[permalink] [raw]
Subject: Re: [patch] voluntary-preempt-2.6.9-rc1-bk4-Q7

Thomas Charbonnel wrote:
> Ingo Molnar wrote :
>
>>i've released the -Q7 patch:
>>
>> http://redhat.com/~mingo/voluntary-preempt/voluntary-preempt-2.6.9-rc1-bk4-Q7
>
>
> With Q7 I still get rx latency issues (> 130 us non-preemptible section
> from rtl8139_poll). Moreover network connections were extremely slow
> (almost hung) until I set /proc/sys/net/core/netdev_backlog_granularity
> to 2.
>
> Thomas
>
>
>
I too am still getting these latencies, although not as often (maybe?).
I on the other hand am having no problems with slow connections.
However, this is with very little load on the system. Here is one such
trace:

http://www.cybsft.com/testresults/2.6.9-rc1-bk4-Q7/latencytrace4.txt

I do have a couple of new traces that seem to be related to transmitting
data, I think. They are here:

http://www.cybsft.com/testresults/2.6.9-rc1-bk4-Q7/latencytrace2.txt

http://www.cybsft.com/testresults/2.6.9-rc1-bk4-Q7/latencytrace3.txt

kr

2004-09-01 20:20:21

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [patch] voluntary-preempt-2.6.9-rc1-bk4-Q7

On Wed, 2004-09-01 at 19:09 +0200, Thomas Charbonnel wrote:
> Ingo Molnar wrote :
> > i've released the -Q7 patch:
> >
> > http://redhat.com/~mingo/voluntary-preempt/voluntary-preempt-2.6.9-rc1-bk4-Q7
>
> With Q7 I still get rx latency issues (> 130 us non-preemptible section
> from rtl8139_poll). Moreover network connections were extremely slow
> (almost hung) until I set /proc/sys/net/core/netdev_backlog_granularity
> to 2.
>
> Thomas
>

Me too!
I too have a rtl8139 network card.

kr, what kind of nic do you have since this does not occur on your
machine?

--
Peter Zijlstra <[email protected]>

2004-09-01 23:03:58

by Mark_H_Johnson

[permalink] [raw]
Subject: Re: [patch] voluntary-preempt-2.6.9-rc1-bk4-Q7

>With Q7 I still get rx latency issues (> 130 us non-preemptible section
>from rtl8139_poll). Moreover network connections were extremely slow
>(almost hung) until I set /proc/sys/net/core/netdev_backlog_granularity
>to 2.

The default of 1 caused a couple services to fail start up - most
annoying failure was NIS. I changed netdev_backlog_granularity to
eight (8) in /etc/sysctl.conf and came up fine. The system is under
test right now, though will probably tomorrow before I get full
results.

It appears to have fewer > 500 usec traces than previous
tests so the -Q7 stuff appears to work (though has not made it to
the disk tests where I generally have more problems yet).

One place where we may need to consider more mcount() calls is in
the scheduler. I got another 500+ msec trace going from dequeue_task
to __switch_to.

I also looked briefly at find_first_bit since it appears in a number
of traces. Just curious, but the coding for the i386 version is MUCH
different in style than several other architectures (e.g, PPC64, SPARC).
Is there some reason why it is recursive on the x86 and a loop in the
others?

--Mark H Johnson
<mailto:[email protected]>

2004-09-01 23:17:23

by K.R. Foley

[permalink] [raw]
Subject: Re: [patch] voluntary-preempt-2.6.9-rc1-bk4-Q7

Peter Zijlstra wrote:
> On Wed, 2004-09-01 at 19:09 +0200, Thomas Charbonnel wrote:
>
>>Ingo Molnar wrote :
>>
>>>i've released the -Q7 patch:
>>>
>>> http://redhat.com/~mingo/voluntary-preempt/voluntary-preempt-2.6.9-rc1-bk4-Q7
>>
>>With Q7 I still get rx latency issues (> 130 us non-preemptible section
>>from rtl8139_poll). Moreover network connections were extremely slow
>>(almost hung) until I set /proc/sys/net/core/netdev_backlog_granularity
>>to 2.
>>
>>Thomas
>>
>
>
> Me too!
> I too have a rtl8139 network card.
>
> kr, what kind of nic do you have since this does not occur on your
> machine?
>

Ethernet Pro 100.

2004-09-01 20:27:37

by Lee Revell

[permalink] [raw]
Subject: Re: [patch] voluntary-preempt-2.6.9-rc1-bk4-Q7

On Wed, 2004-09-01 at 16:11, Peter Zijlstra wrote:
> On Wed, 2004-09-01 at 19:09 +0200, Thomas Charbonnel wrote:
> > Ingo Molnar wrote :
> > > i've released the -Q7 patch:
> > >
> > > http://redhat.com/~mingo/voluntary-preempt/voluntary-preempt-2.6.9-rc1-bk4-Q7
> >
> > With Q7 I still get rx latency issues (> 130 us non-preemptible section
> > from rtl8139_poll). Moreover network connections were extremely slow
> > (almost hung) until I set /proc/sys/net/core/netdev_backlog_granularity
> > to 2.
> >
> > Thomas
> >
>
> Me too!
> I too have a rtl8139 network card.
>
> kr, what kind of nic do you have since this does not occur on your
> machine?

Hmm, I am not a network driver expert, and this is just a guess, but if
they work anything like sound cards, I would say that that that hardware
will only generate an interrupt when there are 2 packets in its queue.

Lee

2004-09-02 05:33:12

by Ingo Molnar

[permalink] [raw]
Subject: Re: [patch] voluntary-preempt-2.6.9-rc1-bk4-Q7


* [email protected] <[email protected]> wrote:

> One place where we may need to consider more mcount() calls is in the
> scheduler. I got another 500+ msec trace going from dequeue_task to
> __switch_to.

(you mean 500+ usec, correct?)

there's no way the scheduler can have 500 usecs of overhead going from
dequeue_task() to __switch_to(): we have all interrupts disabled and
take zero locks! This is almost certainly some hardware effect (i
described some possibilities and tests a couple of mails earlier).

In any case, please enable nmi_watchdog=1 so that we can see (in -Q7)
what happens on the other CPUs during such long delays.

> I also looked briefly at find_first_bit since it appears in a number
> of traces. Just curious, but the coding for the i386 version is MUCH
> different in style than several other architectures (e.g, PPC64,
> SPARC). Is there some reason why it is recursive on the x86 and a loop
> in the others?

what do you mean by recursive? It uses the SCAS (scan string) x86
instruction.

Ingo

2004-09-02 05:35:43

by Ingo Molnar

[permalink] [raw]
Subject: Re: [patch] voluntary-preempt-2.6.9-rc1-bk4-Q7


* K.R. Foley <[email protected]> wrote:

> This is an interesting one. ~3.9ms generated here by amlat in do_IRQ:

the overhead is not in do_IRQ():

> 00000001 0.000ms (+0.000ms): n_tty_receive_buf (pty_write)
> 00010001 3.992ms (+3.992ms): do_IRQ (n_tty_receive_buf)

the overhead is always relative to the previous entry - so the overhead
was in n_tty_receive_buf() [that is the function that was interrupted by
do_IRQ()]. But it's a bit weird - you should have gotten timer IRQs
every 1 msec. Does n_tty_receive_buf() run with irqs disabled perhaps?

Ingo

2004-09-02 05:58:15

by Ingo Molnar

[permalink] [raw]
Subject: Re: [patch] voluntary-preempt-2.6.9-rc1-bk4-Q7


* Ingo Molnar <[email protected]> wrote:

> > 00000001 0.000ms (+0.000ms): n_tty_receive_buf (pty_write)
> > 00010001 3.992ms (+3.992ms): do_IRQ (n_tty_receive_buf)
>
> the overhead is always relative to the previous entry [...]

i've changed the /proc/latency_trace output in my tree to print the
latency of this entry relative to the next entry, not the previous
entry. This should be more intuitive than using the previous entry.

Ingo

2004-09-02 13:18:45

by Mark_H_Johnson

[permalink] [raw]
Subject: Re: [patch] voluntary-preempt-2.6.9-rc1-bk4-Q7

>(you mean 500+ usec, correct?)
>
>there's no way the scheduler can have 500 usecs of overhead going from
>dequeue_task() to __switch_to(): we have all interrupts disabled and
>take zero locks! This is almost certainly some hardware effect (i
>described some possibilities and tests a couple of mails earlier).
>
>In any case, please enable nmi_watchdog=1 so that we can see (in -Q7)
>what happens on the other CPUs during such long delays.

Booted with nmi_watchdog=1, saw the kernel message indicating that
NMI was checked OK.

The first trace looks something like this...

latency 518 us, entries: 79
...
started at schedule+0x51/0x740
ended at schedule+0x337/0x740

00000001 0.000ms (+0.000ms): schedule (io_schedule)
00000001 0.000ms (+0.000ms): sched_clock (schedule)
00010001 0.478ms (+0.478ms): do_nmi (sched_clock)
00010001 0.478ms (+0.000ms): do_nmi (<08049b21>)
00010001 0.482ms (+0.003ms): profile_tick (nmi_watchdog_tick)
...
and a few entries later ends up at do_IRQ (sched_clock).

The second trace goes from dequeue_task to __switch_to with a
similar pattern - the line with do_nmi has +0.282ms duration and
the line notifier_call_chain (profile_hook) as +0.135ms duration.

I don't see how this provides any additional information but will
provide several additional traces when the test gets done in a
few minutes.


--Mark H Johnson
<mailto:[email protected]>

2004-09-02 13:34:24

by Mark_H_Johnson

[permalink] [raw]
Subject: Re: [patch] voluntary-preempt-2.6.9-rc1-bk4-Q7

>> I also looked briefly at find_first_bit since it appears in a number
>> of traces. Just curious, but the coding for the i386 version is MUCH
>> different in style than several other architectures (e.g, PPC64,
>> SPARC). Is there some reason why it is recursive on the x86 and a loop
>> in the others?
>
>what do you mean by recursive? It uses the SCAS (scan string) x86
>instruction.

Never mind. In bitops.c I misread "find_first_bit" (the call near the end)
as "find_next_bit" and thought there was recursion here.

--Mark H Johnson
<mailto:[email protected]>

2004-09-02 13:37:20

by Ingo Molnar

[permalink] [raw]
Subject: Re: [patch] voluntary-preempt-2.6.9-rc1-bk4-Q7


* [email protected] <[email protected]> wrote:

> >In any case, please enable nmi_watchdog=1 so that we can see (in -Q7)
> >what happens on the other CPUs during such long delays.
>
> Booted with nmi_watchdog=1, saw the kernel message indicating that
> NMI was checked OK.
>
> The first trace looks something like this...
>
> latency 518 us, entries: 79
> ...
> started at schedule+0x51/0x740
> ended at schedule+0x337/0x740
>
> 00000001 0.000ms (+0.000ms): schedule (io_schedule)
> 00000001 0.000ms (+0.000ms): sched_clock (schedule)
> 00010001 0.478ms (+0.478ms): do_nmi (sched_clock)
> 00010001 0.478ms (+0.000ms): do_nmi (<08049b21>)
> 00010001 0.482ms (+0.003ms): profile_tick (nmi_watchdog_tick)
> ...
> and a few entries later ends up at do_IRQ (sched_clock).
>
> The second trace goes from dequeue_task to __switch_to with a
> similar pattern - the line with do_nmi has +0.282ms duration and
> the line notifier_call_chain (profile_hook) as +0.135ms duration.
>
> I don't see how this provides any additional information but will
> provide several additional traces when the test gets done in a few
> minutes.

thanks. The NMI gives us two kinds of information, both useful:

- if the ratio of do_nmi()'s within such a section roughly matches the
number of NMIs we'd expect during the sum of these sections then it
means that the delay is most likely wall-clock time and not some
measurement artifact (RDTSC artifact or tracing bug). The NMI's are
triggered (indirectly) by the PIT and the PIT is an independent clock
that has a frequency that is independent of the rest of the system
(independent of the CPU clock, DMA activities, IRQ load, etc.)

since most of the codepaths in question (the scheduler's
dequeue_task(), etc.) run with interrupts disabled the normal timer
interrupts (smp_apic_timer_interrupt() and do_IRQ(00000000)) cannot
'sample' this codepath. Only the NMI can interrupt these codepaths.

- the NMIs also sample what happens on the other CPU. In your above
trace this gives:

> 00010001 0.478ms (+0.478ms): do_nmi (sched_clock)
> 00010001 0.478ms (+0.000ms): do_nmi (<08049b21>)

the other CPU was executing userspace code during the last NMI tick -
i.e. nothing that could be suspect. 'suspect' code would be some sort
kernel code that could in theory interact with this CPU's scheduler
code.

this too is statistical sampling so we'll need as much of these
traces as possible.

some wacky guess based on the above single sampling point: it seems the
delays are real wall-clock delays, and the only thing matching the
theory so far is that DMA traffic on the memory bus somehow stalls this
CPU's memory traffic for up to 500 usecs. How could userspace running on
CPU#0 impact the kernel's scheduler code on CPU#1?

Ingo

2004-09-02 14:49:48

by Ingo Molnar

[permalink] [raw]
Subject: Re: [patch] voluntary-preempt-2.6.9-rc1-bk4-Q7


* [email protected] <[email protected]> wrote:

> The test just completed. Over 100 traces (>500 usec) in 25 minutes
> of test runs.
>
> To recap - this kernel has:
>
> Downloaded linux-2.6.8.1 from
> http://www.kernel.org/pub/linux/kernel/v2.6/linux-2.6.8.1.tar.bz2
> Downloaded patches from
> http://redhat.com/~mingo/voluntary-preempt/diff-bk-040828-2.6.8.1.bz2
> http://people.redhat.com/mingo/voluntary-preempt/voluntary-preempt-2.6.9-rc1-bk4-Q7
> ... email saved into mark-offset-tsc-mcount.patch ...
> ... email saved into ens001.patch ...

thanks for the data. There are dozens of traces that show a big latency
for no algorithmic reason, in completely unlocked codepaths. In these
places the CPU seems to have an inexplicable inability to run simple
sequential code that has no looping potential at all.

the NMI samples show that just about any kernel code can be delayed by
this phenomenon - the kernel functions that have critical sections show
up by their likely frequency of use. There doesnt seem to be anything
common to the functions that show these delays, other than that they
have a critical section and that they are running in your workload.

so the remaining theories are:

- DMA starvation. I've never seen anything on this scale but it's
pretty much the only thing interacting with a CPU's ability to
execute code - besides the other CPU running in the system.

i'd not be surprised if some audio cards tried tricks to do as
agressive DMA as physically possible, even violating hw
specifications - for the purpose of producing skip-free audio output.
Do you have another soundcard for testing by any chance? Another
option would be to try latencytest driven not by the soundcard IRQ
but by /dev/rtc.

- some sort of SMM handler that is triggered on I/O ops or something.
But a number of functions in the traces dont do any I/O ops (port
instructions like IN or OUT) so it's hard to imagine this to be the
case. An externally triggered SMM is possible too, perhaps some
independent timer triggers a watchdog SMM?

it is nearly impossible for these traces to be caused by the kernel. It
really has to be some hardware effect, based on the data we have so far.

Ingo

2004-09-02 18:03:25

by Lee Revell

[permalink] [raw]
Subject: Re: [patch] voluntary-preempt-2.6.9-rc1-bk4-Q7

On Thu, 2004-09-02 at 09:37, Ingo Molnar wrote:
> - the NMIs also sample what happens on the other CPU. In your above
> trace this gives:
>
> > 00010001 0.478ms (+0.478ms): do_nmi (sched_clock)
> > 00010001 0.478ms (+0.000ms): do_nmi (<08049b21>)
>
> the other CPU was executing userspace code during the last NMI tick -
> i.e. nothing that could be suspect. 'suspect' code would be some sort
> kernel code that could in theory interact with this CPU's scheduler
> code.
>
> this too is statistical sampling so we'll need as much of these
> traces as possible.
>
> some wacky guess based on the above single sampling point: it seems the
> delays are real wall-clock delays, and the only thing matching the
> theory so far is that DMA traffic on the memory bus somehow stalls this
> CPU's memory traffic for up to 500 usecs. How could userspace running on
> CPU#0 impact the kernel's scheduler code on CPU#1?
>

This is not wacky at all. For example the 2D acceleration driver for my
Via Unichrome chipset will completely stall the PCI bus and processor if
the 2D engine is overloaded and the command FIFO is written to without
checking whether it is full. This can be triggered easily, all you have
to do is enable 'display window contents while dragging' and drag a busy
window around slowly.

Many vendor-supplied drivers (including the open source via_accel.c)
don't bother to check whether this FIFO is full before writing to it,
because it increases benchmark scores slighly, at the expense of ruining
audio performance. You can thank Matrox for this "innovation", though
to be fair, every other vendor seems to have followed suit until they
were busted. Only setting 'Options "NoAccel"' in my X config fixes it.
This is not even a kernel driver, it's part of XFree86!

This paper describes the problem in more detail:

http://research.microsoft.com/~mbj/papers/tr-98-29.html

Since they don't provide a deep link and the interesting content is
buried, here is the excerpt:

Misbehaving video card drivers are another source of significant delays
in scheduling user code. A number of video cards manufacturers recently
began employing a hack to save a PCI bus transaction for each display
operation in order to gain a few percentage points on their WinBench
[Ziff-Davis 98] Graphics WinMark performance.

The video cards have a command FIFO that is written to via the PCI bus.
They also have a status register, read via the PCI bus, which says
whether the command FIFO is full or not. The hack is to not check
whether the command FIFO is full before attempting to write to it, thus
saving a PCI bus read.

The problem with this is that the result of attempting to write to the
FIFO when it is full is to stall the CPU waiting on the PCI bus write
until a command has been completed and space becomes available to accept
the new command. In fact, this not only causes the CPU to stall waiting
on the PCI bus, but since the PCI controller chip also controls the ISA
bus and mediates interrupts, ISA traffic and interrupt requests are
stalled as well. Even the clock interrupts stop.

These video cards will stall the machine, for instance, when the user
drags a window. For windows occupying most of a 1024x768 screen on a
333MHz Pentium II with an AccelStar II AGP video board (which is based
on the 3D Labs Permedia 2 chip set) this will stall the machine for
25-30ms at a time!

This may marginally improve the graphics performance under some
circumstances, but it wrecks havoc on any other devices expecting timely
response from the machine. For instance, this causes severe problems
with USB and IEEE 1394 video and audio streams, as well as standard
sound cards.

Some manufacturers, such as 3D Labs, do provide a registry key that can
be set to disable this anti-social behavior. For instance, [Hanssen 98]
describes this behavior and lists the registry keys to fix several
common graphics cards, including some by Matrox, Tseng Labs, Hercules,
and S3. However as of this writing, there were still drivers, including
some from Number 9 and ATI, for which this behavior could not be
disabled.

This hack, and the problems it causes, has recently started to receive
attention in the trade press [PC Magazine 98]. We hope that pressures
can soon be brought to bear on the vendors to cease this antisocial
behavior. At the very least, should they persist in writing drivers that
can stall the machine, this behavior should no longer be the default.

Lee