LinuxLists.cc - preempt-rt, NUMA and strange latency traces

2006-02-07 11:25:13

Subject: preempt-rt, NUMA and strange latency traces

Hi,

I've been experimenting with 2.6.15-rt16 on a dual 2.8GHz Xeon box
with quite good results and decided to make a run on a NUMA dual node
IBM x440 (8 1.4GHz Xeon, 28GB ram).

However, the kernel crashes early when creating the slabs. Does the
current preempt-rt patchset supports NUMA machines or has support
been disabled until things settle down?

Going on, I compiled a non NUMA RT kernel which booted just fine,
but when examining the latency traces, I came upon strange jumps
in the latencies such as:

<...>-6459 2D.h1 42us : rcu_pending (update_process_times)
<...>-6459 2D.h1 42us : scheduler_tick (update_process_times)
<...>-6459 2D.h1 43us : sched_clock (scheduler_tick)
<...>-6459 2D.h1 44us!: _raw_spin_lock (scheduler_tick)
<...>-6459 2D.h2 28806us : _raw_spin_unlock (scheduler_tick)
<...>-6459 2D.h1 28806us : rebalance_tick (scheduler_tick)
<...>-6459 2D.h1 28807us : irq_exit (smp_apic_timer_interrupt)
<...>-6459 2D..1 28808us < (608)
<...>-6459 2D..1 28809us : smp_apic_timer_interrupt (c03e2a02 0 0)
<...>-6459 2D.h1 28810us : handle_nextevent_update (smp_apic_timer_interrupt)
<...>-6459 2D.h1 28810us : hrtimer_interrupt (handle_nextevent_update)

or

<idle>-0 0Dn.. 11us : __schedule (cpu_idle)
<idle>-0 0Dn.. 12us : profile_hit (__schedule)
<idle>-0 0Dn.1 12us : sched_clock (__schedule)
<idle>-0 0Dn.1 13us : _raw_spin_lock_irq (__schedule)
<...>-6459 0D..2 14us : __switch_to (__schedule)
<...>-6459 0D..2 15us : __schedule <<idle>-0> (8c 0)
<...>-6459 0D..2 16us : _raw_spin_unlock_irq (__schedule)
<...>-6459 0...1 16us!: trace_stop_sched_switched (__schedule)
<...>-6459 0D..1 28585us : smp_apic_timer_interrupt (c013babb 0 0)
<...>-6459 0D.h1 28585us : handle_nextevent_update (smp_apic_timer_interrupt)
<...>-6459 0D.h1 28586us : hrtimer_interrupt (handle_nextevent_update)

or even

<idle>-0 3D.h4 0us!: __trace_start_sched_wakeup (try_to_wake_up)
<idle>-0 3D.h4 28899us : __trace_start_sched_wakeup <<...>-6459> (0 3)
<idle>-0 3D.h4 28900us : _raw_spin_unlock (__trace_start_sched_wakeup)
<idle>-0 3D.h3 28900us : resched_task (try_to_wake_up)

There does not seem to be a precise code path leading to those jumps, it seems
they can appear anywhere. Furthermore the jump seems to always be of ~ 27 ms

I tried running on only 1 CPU, tried using the TSC instead of the cyclone
timer but to no avail, the phenomenon is still there.

My test program only consists in a thread running at max RT priority doing
a nanosleep().

What could be going on here?

Thanks,

S?bastien.

2006-02-08 09:41:51

by Steven Rostedt

[permalink] [raw]

Subject: Re: preempt-rt, NUMA and strange latency traces

On Tue, 7 Feb 2006, S?bastien Dugu? wrote:

> Hi,
>
> I've been experimenting with 2.6.15-rt16 on a dual 2.8GHz Xeon box
> with quite good results and decided to make a run on a NUMA dual node
> IBM x440 (8 1.4GHz Xeon, 28GB ram).
>
> However, the kernel crashes early when creating the slabs. Does the
> current preempt-rt patchset supports NUMA machines or has support
> been disabled until things settle down?

Yeah, currently the -rt patch doesn't work well with NUMA.

>
> Going on, I compiled a non NUMA RT kernel which booted just fine,
> but when examining the latency traces, I came upon strange jumps
> in the latencies such as:
>
>
> <...>-6459 2D.h1 42us : rcu_pending (update_process_times)
> <...>-6459 2D.h1 42us : scheduler_tick (update_process_times)
> <...>-6459 2D.h1 43us : sched_clock (scheduler_tick)
> <...>-6459 2D.h1 44us!: _raw_spin_lock (scheduler_tick)
> <...>-6459 2D.h2 28806us : _raw_spin_unlock (scheduler_tick)
> <...>-6459 2D.h1 28806us : rebalance_tick (scheduler_tick)
> <...>-6459 2D.h1 28807us : irq_exit (smp_apic_timer_interrupt)
> <...>-6459 2D..1 28808us < (608)
> <...>-6459 2D..1 28809us : smp_apic_timer_interrupt (c03e2a02 0 0)
> <...>-6459 2D.h1 28810us : handle_nextevent_update (smp_apic_timer_interrupt)
> <...>-6459 2D.h1 28810us : hrtimer_interrupt (handle_nextevent_update)

Hmm, are the TSC of the CPUS in sync? If not, you will get bad
messurements of the latency tracer. That is currently why we can't use
tracing with x86_64 x2 CPUS.

>
> There does not seem to be a precise code path leading to those jumps,
> it seems
> they can appear anywhere. Furthermore the jump seems to always be of ~ 27 ms
>
> I tried running on only 1 CPU, tried using the TSC instead of the cyclone
> timer but to no avail, the phenomenon is still there.

OK, this scares me. You get these with only one CPU? Is it still
compiled for SMP? If not, see if the latencies go away if you turn off
SMP.

>
> My test program only consists in a thread running at max RT priority doing
> a nanosleep().
>
> What could be going on here?

Good question. Could you send your .config

-- Steve

2006-02-08 10:26:37

by Ingo Molnar

[permalink] [raw]

Subject: Re: preempt-rt, NUMA and strange latency traces

* Steven Rostedt <[email protected]> wrote:

> > I've been experimenting with 2.6.15-rt16 on a dual 2.8GHz Xeon box
> > with quite good results and decided to make a run on a NUMA dual node
> > IBM x440 (8 1.4GHz Xeon, 28GB ram).
> >
> > However, the kernel crashes early when creating the slabs. Does the
> > current preempt-rt patchset supports NUMA machines or has support
> > been disabled until things settle down?
>
> Yeah, currently the -rt patch doesn't work well with NUMA.

FYI, i've got a new port of upstream slab.c to -rt, which should work on
NUMA too. It'll be in -rt17 (later today or tomorrow).

Ingo

2006-02-08 10:42:41

by Sébastien Dugué

[permalink] [raw]

Subject: Re: preempt-rt, NUMA and strange latency traces

On Wed, 2006-02-08 at 04:41 -0500, Steven Rostedt wrote:
> On Tue, 7 Feb 2006, S?bastien Dugu? wrote:
>
> > Hi,
> >
> > I've been experimenting with 2.6.15-rt16 on a dual 2.8GHz Xeon box
> > with quite good results and decided to make a run on a NUMA dual node
> > IBM x440 (8 1.4GHz Xeon, 28GB ram).
> >
> > However, the kernel crashes early when creating the slabs. Does the
> > current preempt-rt patchset supports NUMA machines or has support
> > been disabled until things settle down?
>
> Yeah, currently the -rt patch doesn't work well with NUMA.

OK, I'll have to wait for that I guess.

>
> >
> > Going on, I compiled a non NUMA RT kernel which booted just fine,
> > but when examining the latency traces, I came upon strange jumps
> > in the latencies such as:
> >
> >
> > <...>-6459 2D.h1 42us : rcu_pending (update_process_times)
> > <...>-6459 2D.h1 42us : scheduler_tick (update_process_times)
> > <...>-6459 2D.h1 43us : sched_clock (scheduler_tick)
> > <...>-6459 2D.h1 44us!: _raw_spin_lock (scheduler_tick)
> > <...>-6459 2D.h2 28806us : _raw_spin_unlock (scheduler_tick)
> > <...>-6459 2D.h1 28806us : rebalance_tick (scheduler_tick)
> > <...>-6459 2D.h1 28807us : irq_exit (smp_apic_timer_interrupt)
> > <...>-6459 2D..1 28808us < (608)
> > <...>-6459 2D..1 28809us : smp_apic_timer_interrupt (c03e2a02 0 0)
> > <...>-6459 2D.h1 28810us : handle_nextevent_update (smp_apic_timer_interrupt)
> > <...>-6459 2D.h1 28810us : hrtimer_interrupt (handle_nextevent_update)
>
> Hmm, are the TSC of the CPUS in sync? If not, you will get bad
> messurements of the latency tracer. That is currently why we can't use
> tracing with x86_64 x2 CPUS.
>
> >
> > There does not seem to be a precise code path leading to those jumps,
> > it seems
> > they can appear anywhere. Furthermore the jump seems to always be of ~ 27 ms
> >
> > I tried running on only 1 CPU, tried using the TSC instead of the cyclone
> > timer but to no avail, the phenomenon is still there.
>
> OK, this scares me. You get these with only one CPU? Is it still
> compiled for SMP? If not, see if the latencies go away if you turn off
> SMP.

Yes the kernel is still compiled for SMP. An UP compiled kernel did
not boot. I will try to fix it and go for it again.

>
> >
> > My test program only consists in a thread running at max RT priority doing
> > a nanosleep().
> >
> > What could be going on here?
>
> Good question. Could you send your .config

The more I think about it, the more I tend to believe it's hardware
related. It seems as if the CPU just hangs for ~27 ms before
resuming processing.

Anyway, here is my .config. Maybe I've got something misconfigured
somewhere after all.

Thanks for replying.

Attachments:

config-2.6.15-rt16-no-numa.gz (7.43 kB)

2006-02-08 10:44:06

by Sébastien Dugué

[permalink] [raw]

Subject: Re: preempt-rt, NUMA and strange latency traces

On Wed, 2006-02-08 at 11:25 +0100, Ingo Molnar wrote:
> * Steven Rostedt <[email protected]> wrote:
>
> > > I've been experimenting with 2.6.15-rt16 on a dual 2.8GHz Xeon box
> > > with quite good results and decided to make a run on a NUMA dual node
> > > IBM x440 (8 1.4GHz Xeon, 28GB ram).
> > >
> > > However, the kernel crashes early when creating the slabs. Does the
> > > current preempt-rt patchset supports NUMA machines or has support
> > > been disabled until things settle down?
> >
> > Yeah, currently the -rt patch doesn't work well with NUMA.
>
> FYI, i've got a new port of upstream slab.c to -rt, which should work on
> NUMA too. It'll be in -rt17 (later today or tomorrow).
>
> Ingo

Great, I sure will try it.

Thanks.

S?bastien.

2006-02-08 16:49:24

by john stultz

[permalink] [raw]

Subject: Re: preempt-rt, NUMA and strange latency traces

On Wed, 2006-02-08 at 11:45 +0100, Sébastien Dugué wrote:
> The more I think about it, the more I tend to believe it's hardware
> related. It seems as if the CPU just hangs for ~27 ms before
> resuming processing.

That sounds like to the ~30ms RSA caused SMIs. Does this system have an
RSA1 or RSA2 card?

thanks
-john

2006-02-09 06:04:10

by Lee Revell

[permalink] [raw]

Subject: Re: preempt-rt, NUMA and strange latency traces

On Wed, 2006-02-08 at 11:45 +0100, S?bastien Dugu? wrote:
> The more I think about it, the more I tend to believe it's hardware
> related. It seems as if the CPU just hangs for ~27 ms before
> resuming processing.

That would be an exceptionally long latency - you would probably notice
it if the mouse froze, VOIP dropped out, ping stops, etc for 30ms.

Lee

2006-02-09 11:08:46

by Sébastien Dugué

[permalink] [raw]

Subject: Re: preempt-rt, NUMA and strange latency traces

On Wed, 2006-02-08 at 08:49 -0800, john stultz wrote:
> On Wed, 2006-02-08 at 11:45 +0100, S?bastien Dugu? wrote:
> > The more I think about it, the more I tend to believe it's hardware
> > related. It seems as if the CPU just hangs for ~27 ms before
> > resuming processing.
>
> That sounds like to the ~30ms RSA caused SMIs. Does this system have an
> RSA1 or RSA2 card?
>

Hi John,

I think it's an RSA1 card, but no certainty here. I don't think
the x440 came with the RSA2. Is there a way to check for sure without
unplugging everything (It's such a mess of cables behind the box)?

Whats the SMI issue with the RSA? Could the RSA generate bursts of SMI
that could be enough to freeze the CPUs?

When I ran those tests I was logged on the RSA (serial line). I will
try to run the tests again without being connected when I can manage to
get some CPU time (eh! shared machine)...

Thanks.

S?bastien.

2006-02-09 11:21:28

by Sébastien Dugué

[permalink] [raw]

Subject: Re: preempt-rt, NUMA and strange latency traces

On Thu, 2006-02-09 at 01:04 -0500, Lee Revell wrote:
> On Wed, 2006-02-08 at 11:45 +0100, S?bastien Dugu? wrote:
> > The more I think about it, the more I tend to believe it's hardware
> > related. It seems as if the CPU just hangs for ~27 ms before
> > resuming processing.
>
> That would be an exceptionally long latency - you would probably notice
> it if the mouse froze, VOIP dropped out, ping stops, etc for 30ms.
>

It's a test machine and I use it remotely with console redirected so
no mouse, no RT applications aside from my silly nanosleep() loop. But
I do notice that that test sometimes takes more time (ie when I get
those weird latencies).

S?bastien.

2006-02-09 11:23:46

by Sébastien Dugué

[permalink] [raw]

Subject: Re: preempt-rt, NUMA and strange latency traces

2006-02-09 18:54:25

by john stultz

[permalink] [raw]

Subject: Re: preempt-rt, NUMA and strange latency traces

On Thu, 2006-02-09 at 12:11 +0100, S?bastien Dugu? wrote:
> On Wed, 2006-02-08 at 08:49 -0800, john stultz wrote:
> > On Wed, 2006-02-08 at 11:45 +0100, S?bastien Dugu? wrote:
> > > The more I think about it, the more I tend to believe it's hardware
> > > related. It seems as if the CPU just hangs for ~27 ms before
> > > resuming processing.
> >
> > That sounds like to the ~30ms RSA caused SMIs. Does this system have an
> > RSA1 or RSA2 card?
> >
>
> I think it's an RSA1 card, but no certainty here. I don't think
> the x440 came with the RSA2. Is there a way to check for sure without
> unplugging everything (It's such a mess of cables behind the box)?

No, you're right, the x445 was the box that could have either RSA1 or
RSA2. The x440 can only have an RSA1.

> Whats the SMI issue with the RSA? Could the RSA generate bursts of SMI
> that could be enough to freeze the CPUs?

SMIs effectively freeze all cpus while the BIOS code executes. With the
RSA1, there are two main causes of SMIs:
1) Periodic SMI: Which occurs every 15 minutes as the RSA checks various
hardware status and lasts for ~30ms.

2) Console Redirection SMIs: This occurs only when you use the console
redirection feature from the RSA. You'll see ~30ms stalls ~once a second
while the SMI screen-scrapes the console text buffer and sends it out
over the serial line.

The console redirection one is easy to avoid: just don't use that
feature if you care about low latencies. The periodic SMI is more
difficult to work around. With RSA2 based systems, there is a BIOS
update that allows you to disable this functionality (and on newer
systems it ships disabled), however I don't believe there is any such
feature for the older RSA1 based systems.

> When I ran those tests I was logged on the RSA (serial line). I will
> try to run the tests again without being connected when I can manage to
> get some CPU time (eh! shared machine)...

Yes, not using the console redirection feature will definitly help the
situation, but there still will be the possibility of ~30ms stalls every
15 minutes on the x440.

thanks
-john

2006-02-09 20:02:52

by Lee Revell

[permalink] [raw]

Subject: Re: preempt-rt, NUMA and strange latency traces

On Thu, 2006-02-09 at 12:24 +0100, S?bastien Dugu? wrote:
> On Thu, 2006-02-09 at 01:04 -0500, Lee Revell wrote:
> > On Wed, 2006-02-08 at 11:45 +0100, S?bastien Dugu? wrote:
> > > The more I think about it, the more I tend to believe it's hardware
> > > related. It seems as if the CPU just hangs for ~27 ms before
> > > resuming processing.
> >
> > That would be an exceptionally long latency - you would probably notice
> > it if the mouse froze, VOIP dropped out, ping stops, etc for 30ms.
> >
>
> It's a test machine and I use it remotely with console redirected so
> no mouse, no RT applications aside from my silly nanosleep() loop. But
> I do notice that that test sometimes takes more time (ie when I get
> those weird latencies).

Argh. You would think the vendors would consider a 30ms delay
unacceptable. This is big enough to show up on an MRTG graph of ping
times ferchrissake.

I guess the assumption is that most hardware will never be used for even
soft RT work...

Lee

2006-02-10 13:04:43

by Sébastien Dugué

[permalink] [raw]

Subject: Re: preempt-rt, NUMA and strange latency traces

On Thu, 2006-02-09 at 10:54 -0800, john stultz wrote:
> >
> > I think it's an RSA1 card, but no certainty here. I don't think
> > the x440 came with the RSA2. Is there a way to check for sure without
> > unplugging everything (It's such a mess of cables behind the box)?
>
> No, you're right, the x445 was the box that could have either RSA1 or
> RSA2. The x440 can only have an RSA1.
>
>
> > Whats the SMI issue with the RSA? Could the RSA generate bursts of SMI
> > that could be enough to freeze the CPUs?
>
> SMIs effectively freeze all cpus while the BIOS code executes. With the
> RSA1, there are two main causes of SMIs:
> 1) Periodic SMI: Which occurs every 15 minutes as the RSA checks various
> hardware status and lasts for ~30ms.
>
> 2) Console Redirection SMIs: This occurs only when you use the console
> redirection feature from the RSA. You'll see ~30ms stalls ~once a second
> while the SMI screen-scrapes the console text buffer and sends it out
> over the serial line.

That must be what I've been hitting in my tests and will check soon.

>
> The console redirection one is easy to avoid: just don't use that
> feature if you care about low latencies. The periodic SMI is more
> difficult to work around. With RSA2 based systems, there is a BIOS
> update that allows you to disable this functionality (and on newer
> systems it ships disabled), however I don't believe there is any such
> feature for the older RSA1 based systems.

I'll have to live with it then.

>
>
> > When I ran those tests I was logged on the RSA (serial line). I will
> > try to run the tests again without being connected when I can manage to
> > get some CPU time (eh! shared machine)...
>
> Yes, not using the console redirection feature will definitly help the
> situation, but there still will be the possibility of ~30ms stalls every
> 15 minutes on the x440.

That's the problem and I'll have to synchronize for a proper window
for running my tests.

Is this SMI thing IBM's eyes only stuff or is it documented somewhere?

Thanks for that valuable info.

S?bastien.

2006-02-10 13:16:56

by Sébastien Dugué

[permalink] [raw]

Subject: Re: preempt-rt, NUMA and strange latency traces

On Thu, 2006-02-09 at 15:02 -0500, Lee Revell wrote:
> On Thu, 2006-02-09 at 12:24 +0100, S?bastien Dugu? wrote:
> > On Thu, 2006-02-09 at 01:04 -0500, Lee Revell wrote:
> > > On Wed, 2006-02-08 at 11:45 +0100, S?bastien Dugu? wrote:
> > > > The more I think about it, the more I tend to believe it's hardware
> > > > related. It seems as if the CPU just hangs for ~27 ms before
> > > > resuming processing.
> > >
> > > That would be an exceptionally long latency - you would probably notice
> > > it if the mouse froze, VOIP dropped out, ping stops, etc for 30ms.
> > >
> >
> > It's a test machine and I use it remotely with console redirected so
> > no mouse, no RT applications aside from my silly nanosleep() loop. But
> > I do notice that that test sometimes takes more time (ie when I get
> > those weird latencies).
>
> Argh. You would think the vendors would consider a 30ms delay
> unacceptable. This is big enough to show up on an MRTG graph of ping
> times ferchrissake.
>
> I guess the assumption is that most hardware will never be used for even
> soft RT work...
>
> Lee
>

That may be but in that case I may be pushing it a bit far testing
that kind of box with realtime stuff.

As a former hw designer I find it useful to have some hardware
monitoring capabilities on a system but it should either not be so
intrusive or at least we should be able to disable it.

S?bastien.

2006-02-10 19:08:00

by Lee Revell

[permalink] [raw]

Subject: Re: preempt-rt, NUMA and strange latency traces

On Fri, 2006-02-10 at 14:07 +0100, S?bastien Dugu? wrote:
> That's the problem and I'll have to synchronize for a proper window
> for running my tests.
>
> Is this SMI thing IBM's eyes only stuff or is it documented
> somewhere?
>

No, I wish it was just IBM. Lots of modern systems unfortunately use
SMM (system management mode, which is entered when we get a system
management interrupts) to implement all kinds of PM and BIOS junk and
various value added features.

Most of these systems are laptops and are utterly unusable for low
latency work. Worse, most of this stuff is considered highly
proprietary by the vendors (probably because it leands to such suckage).

Lee