LinuxLists.cc - ~5x greater CPU load for a networked application when using 2.6.15-rt15-smp vs. 2.6.12-1.1390

2006-02-23 19:56:07

Subject: ~5x greater CPU load for a networked application when using 2.6.15-rt15-smp vs. 2.6.12-1.1390_FC4

The real-time patches at the URL below do a great job of endowing Linux with
real-time capabilities.

http://people.redhat.com/mingo/realtime-preempt/

It has been documented before (and accepted) that this patch turns Linux into
a RT kernel but considerably slows down the code paths, esp. thru the I/O
subsystem. I want to provide some additional measurements and seek opinions
of if it might ever be possible to improve on this situation.

In my tests I used 20 3GHZ Intel Xeon PCs on an isoloated gigabit network.
One of the nodes has a "monitor" process that listens to incoming UDP packets
from the other 19 nodes. Each node is sending approximately 2000 UDP
packets/sec to the monitor process for a total of about 38,000 incoming UDP
packest/sec. These UDP packets are small with application payload being ~10
bytes, for total BW usage of less than 4 Mbits/sec at application level and
less than 15 Mbits/sec counting all headers. (Total BW usage is not high but
there is a large number of packets that are coming in.) Monitor process does
some fairly simple processing per packet.

I measured the CPU usage of the "monitor" process when the testbed was used
with two different operating system. The monitor process is the "nalive.p"
process in the "top" output below. The CPU laod is fairly stable and "top"
gives the following information:

::::::::::::::
top: 2.6.12-1.1390_FC4 # STANDARD KERNEL
::::::::::::::
top - 14:34:39 up 2:32, 2 users, load average: 0.10, 0.05, 0.01
Tasks: 56 total, 2 running, 54 sleeping, 0 stopped, 0 zombie
top - 14:35:32 up 2:33, 2 users, load average: 0.11, 0.06, 0.01
Tasks: 56 total, 2 running, 54 sleeping, 0 stopped, 0 zombie
Cpu(s): 1.4% us, 7.0% sy, 0.0% ni, 80.8% id, 0.2% wa, 7.0% hi, 3.6% si
Mem: 2076008k total, 100292k used, 1975716k free, 16192k buffers
Swap: 128512k total, 0k used, 128512k free, 50376k cached

PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
4823 root -66 0 22712 2236 1484 S 8.4 0.1 0:37.74 nalive.p
4860 gthaker 16 0 7396 2380 1904 R 0.2 0.1 0:00.04 sshd
1 root 16 0 1748 572 492 S 0.0 0.0 0:01.06 init
2 root 34 19 0 0 0 S 0.0 0.0 0:00.00 ksoftirqd/0
3 root 10 -5 0 0 0 S 0.0 0.0 0:00.00 events/0

::::::::::::::
top: 2.6.15-rt15-smp.out # REAL_TIME KERNEL
::::::::::::::
node0> top
top - 09:52:48 up 1:47, 3 users, load average: 0.91, 1.05, 1.02
Tasks: 98 total, 1 running, 97 sleeping, 0 stopped, 0 zombie
Cpu(s): 2.5% us, 41.8% sy, 0.0% ni, 55.6% id, 0.1% wa, 0.0% hi, 0.0% si
Mem: 2058608k total, 88104k used, 1970504k free, 9072k buffers
Swap: 128512k total, 0k used, 128512k free, 39208k cached

PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
2906 root -66 0 18624 2244 1480 S 41.4 0.1 27:11.21 nalive.p
6 root -91 0 0 0 0 S 32.3 0.0 21:04.53 softirq-net-rx/
1379 root -40 -5 0 0 0 S 14.5 0.0 9:54.76 IRQ 23
400 root 15 0 0 0 0 S 0.2 0.0 0:00.13 kjournald
1 root 16 0 1740 564 488 S 0.0 0.0 0:04.03 init

The %CPU is at 8% for the non-real-time, uniprocessor kernel, while it is at
least 41% (and may be 41.4%+32.3%+14.5% = 88%) for the real-time SMP kernel.)

My question is this: How much improvement in raw efficiency is possible for
real-time patches? We take very long view, so if there is a belief that in 5
years the penalty will be reduced from 5-10x in this application to less than
2x that would be great. If we think this is about as well as can be done it
helps knowing that too.

There is nothing else going on on the machines, all code paths should be
going down "happy path" with no contention or blocking - my naive view is
that a 2x overhead is possible, but 5-10x seems harder to understand. And
this is not the case of finding some large non-preemptible region - since
real-time performance is excellent, but about why the code paths seem so "heavy".

Gautam Thaker

2006-02-23 20:20:06

by Benjamin LaHaise

[permalink] [raw]

Subject: Re: ~5x greater CPU load for a networked application when using 2.6.15-rt15-smp vs. 2.6.12-1.1390_FC4

On Thu, Feb 23, 2006 at 02:55:56PM -0500, Gautam H Thaker wrote:
> It has been documented before (and accepted) that this patch turns Linux into
> a RT kernel but considerably slows down the code paths, esp. thru the I/O
> subsystem. I want to provide some additional measurements and seek opinions
> of if it might ever be possible to improve on this situation.

32 bit kernel or 64 bit kernel? What about profiling the system with
oprofile?

-ben
--
"Ladies and gentlemen, I'm sorry to interrupt, but the police are here
and they've asked us to stop the party." Don't Email: <[email protected]>.

2006-02-23 21:00:19

by Ingo Molnar

[permalink] [raw]

Subject: Re: ~5x greater CPU load for a networked application when using 2.6.15-rt15-smp vs. 2.6.12-1.1390_FC4

* Gautam H Thaker <[email protected]> wrote:

> ::::::::::::::
> top: 2.6.15-rt15-smp.out # REAL_TIME KERNEL
> ::::::::::::::

> 2906 root -66 0 18624 2244 1480 S 41.4 0.1 27:11.21 nalive.p
> 6 root -91 0 0 0 0 S 32.3 0.0 21:04.53 softirq-net-rx/
> 1379 root -40 -5 0 0 0 S 14.5 0.0 9:54.76 IRQ 23

One effect of the -rt kernel is that it shows IRQ load explicitly -
while the stock kernel can 'hide' it because there interrupts run
'atomically', making it hard to measure the true system overhead. The
-rt kernel will likely show more overhead, but i'd not expect this
amount of overhead.

To figure out the true overhead of both kernels, could you try the
attached loop_print_thread.c code, and run it on: an idle non-rt kernel,
and idle -rt kernel, a busy non-rt kernel and a busy -rt kernel, and
send me the typical/average loops/sec value you are getting?

Furthermore, there have been some tasklet related fixes in 2.6.15-rt17,
which maybe could improve this workload. Maybe ...

Also, would there be some easy way for me to reproduce that workload?
Possibly some .c code you could send that is easy to run on the server
and the client to reproduce the guts of this workload?

Ingo

Attachments:

(No filename) (1.23 kB)
loop_print_thread.c (2.05 kB)
Download all attachments

2006-02-23 21:06:56

by Nish Aravamudan

[permalink] [raw]

Subject: Re: ~5x greater CPU load for a networked application when using 2.6.15-rt15-smp vs. 2.6.12-1.1390_FC4

On 2/23/06, Ingo Molnar <[email protected]> wrote:
>
> * Gautam H Thaker <[email protected]> wrote:
>
> > ::::::::::::::
> > top: 2.6.15-rt15-smp.out # REAL_TIME KERNEL
> > ::::::::::::::
>
> > 2906 root -66 0 18624 2244 1480 S 41.4 0.1 27:11.21 nalive.p
> > 6 root -91 0 0 0 0 S 32.3 0.0 21:04.53 softirq-net-rx/
> > 1379 root -40 -5 0 0 0 S 14.5 0.0 9:54.76 IRQ 23
>
> One effect of the -rt kernel is that it shows IRQ load explicitly -
> while the stock kernel can 'hide' it because there interrupts run
> 'atomically', making it hard to measure the true system overhead. The
> -rt kernel will likely show more overhead, but i'd not expect this
> amount of overhead.
>
> To figure out the true overhead of both kernels, could you try the
> attached loop_print_thread.c code, and run it on: an idle non-rt kernel,
> and idle -rt kernel, a busy non-rt kernel and a busy -rt kernel, and
> send me the typical/average loops/sec value you are getting?
>
> Furthermore, there have been some tasklet related fixes in 2.6.15-rt17,
> which maybe could improve this workload. Maybe ...

Would it make more sense to compare 2.6.15 and 2.6.15-rt17, as opposed
to 2.6.12-1.1390_FC4 and 2.6.15-rt17? Seems like the closer the two
kernels are, the easier it will be to isolate the differences.

Thanks,
Nish

2006-02-23 21:10:12

by Ingo Molnar

[permalink] [raw]

Subject: Re: ~5x greater CPU load for a networked application when using 2.6.15-rt15-smp vs. 2.6.12-1.1390_FC4

* Nish Aravamudan <[email protected]> wrote:

> Would it make more sense to compare 2.6.15 and 2.6.15-rt17, as opposed
> to 2.6.12-1.1390_FC4 and 2.6.15-rt17? Seems like the closer the two
> kernels are, the easier it will be to isolate the differences.

good point. I'd expect there to be similar 'top' output, but still worth
doing for comparable results.

Ingo

2006-02-23 21:14:36

by Nish Aravamudan

[permalink] [raw]

Subject: Re: ~5x greater CPU load for a networked application when using 2.6.15-rt15-smp vs. 2.6.12-1.1390_FC4

On 2/23/06, Ingo Molnar <[email protected]> wrote:
>
> * Nish Aravamudan <[email protected]> wrote:
>
> > Would it make more sense to compare 2.6.15 and 2.6.15-rt17, as opposed
> > to 2.6.12-1.1390_FC4 and 2.6.15-rt17? Seems like the closer the two
> > kernels are, the easier it will be to isolate the differences.
>
> good point. I'd expect there to be similar 'top' output, but still worth
> doing for comparable results.

I'd also expect little difference (hopefully) -- although there's
always an off-chance something big changed somewhere and the problem
was fixed in mainline. Just makes the comparison clearer.

Thanks,
Nish

2006-02-23 22:07:52

by Esben Nielsen

[permalink] [raw]

Subject: Re: ~5x greater CPU load for a networked application when using 2.6.15-rt15-smp vs. 2.6.12-1.1390_FC4

When PREEMPT_RT is settled down I propose that many of the irq-handlers
are moved back to "raw" irq context. I am sure many of them are so short
that it wont increase latency. It is always a balance between needed
latency and performance. Basicly, the rule is that all irq handlers
running for less that the lower required latency and rare enough not
to take a significant part of the CPU load, should run in irq context.

Now the kernel hacker doesn't know for how long various irq handlers run
on a specific piece of hardware and which latencies the application need.
Therefore it has to be a config option per driver. The driver locks
ofcourse also need to be change from rt_lock to raw_spin_lock depending on
that option. Thus a macro framework for making the right choices of lock
type fitting the choosen irq-handler context it, is needed.

For the issue here, I am pretty sure changing the ethernet driver from
running in task context to raw irq context will improve the performance.
What you need to meassure as well, is how it influences latencies. There
is a good change you will see you can't meassure any difference because so
little work is actually done in irq context, must of the work is done
by the DMA controller.

Esben

On Thu, 23 Feb 2006, Nish Aravamudan wrote:

> On 2/23/06, Ingo Molnar <[email protected]> wrote:
> >
> > * Nish Aravamudan <[email protected]> wrote:
> >
> > > Would it make more sense to compare 2.6.15 and 2.6.15-rt17, as opposed
> > > to 2.6.12-1.1390_FC4 and 2.6.15-rt17? Seems like the closer the two
> > > kernels are, the easier it will be to isolate the differences.
> >
> > good point. I'd expect there to be similar 'top' output, but still worth
> > doing for comparable results.
>
> I'd also expect little difference (hopefully) -- although there's
> always an off-chance something big changed somewhere and the problem
> was fixed in mainline. Just makes the comparison clearer.
>
> Thanks,
> Nish
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
>

2006-02-24 08:03:37

by Jan Engelhardt

[permalink] [raw]

Subject: Re: ~5x greater CPU load for a networked application when using 2.6.15-rt15-smp vs. 2.6.12-1.1390_FC4

>> Would it make more sense to compare 2.6.15 and 2.6.15-rt17, as opposed
>> to 2.6.12-1.1390_FC4 and 2.6.15-rt17? Seems like the closer the two
>> kernels are, the easier it will be to isolate the differences.
>
>good point. I'd expect there to be similar 'top' output, but still worth
>doing for comparable results.
>
I have seen this before too (with earlier -rt's), when MPlayer jumped from
1.8% to about 10%. Maybe because it's using the rtc at 1024 Hz?

Jan Engelhardt
--

2006-02-24 12:12:31

by Andrew Morton

[permalink] [raw]

Subject: Re: ~5x greater CPU load for a networked application when using 2.6.15-rt15-smp vs. 2.6.12-1.1390_FC4

Ingo Molnar <[email protected]> wrote:
>
> To figure out the true overhead of both kernels, could you try the
> attached loop_print_thread.c code
>

http://www.zip.com.au/~akpm/linux/#zc <- better ;)

2006-02-24 16:52:13

by Theodore Ts'o

[permalink] [raw]

Subject: Re: ~5x greater CPU load for a networked application when using 2.6.15-rt15-smp vs. 2.6.12-1.1390_FC4

On Thu, Feb 23, 2006 at 02:55:56PM -0500, Gautam H Thaker wrote:
> The real-time patches at the URL below do a great job of endowing Linux with
> real-time capabilities.
>
> http://people.redhat.com/mingo/realtime-preempt/

Gautam,

#1) Can you publish the code you used in your tests?

#2) Can you post your .config file? In particular, did you have any
of the latency measurement options or other debugging options?

Regards,

- Ted

2006-02-24 19:25:55

by Gautam Thaker

[permalink] [raw]

Subject: Re: ~5x greater CPU load for a networked application when using 2.6.15-rt15-smp vs. 2.6.12-1.1390_FC4

Theodore Ts'o wrote:
> On Thu, Feb 23, 2006 at 02:55:56PM -0500, Gautam H Thaker wrote:
>
>>The real-time patches at the URL below do a great job of endowing Linux with
>>real-time capabilities.
>>
>>http://people.redhat.com/mingo/realtime-preempt/
>
>
> Gautam,
>
> #1) Can you publish the code you used in your tests?

This may not be easy for me but I will try to get corp. approval(s).
Basically, the process that is, at least according to "top", showing ~5x
increased CPU usage is receiving very short UDP packets over a gigabit
interface at the rate of about 38,000 per second. UDP packets are small and
according to "/sbin/ifconfig" there are no errors, drops, overruns, frame or
carrier errors or collisions. (it is an isolated network of 20 PC3000s (3 GH
Xeon processors) at http://www.emulab.net.

>
> #2) Can you post your .config file? In particular, did you have any
> of the latency measurement options or other debugging options?

The config file I had used to build the "RT" kernel can be found at:

http://www.atl.external.lmco.com/projects/QoS/config.2.6.15-rt15-smp

I had tried to have all debug options off

>
> Regards,
>
> - Ted

Gautam

--

Gautam H. Thaker
Distributed Processing Lab; Lockheed Martin Adv. Tech. Labs
3 Executive Campus; Cherry Hill, NJ 08002
856-792-9754, fax 856-792-9925 email: [email protected]

2006-02-24 20:06:52

by Gautam Thaker

[permalink] [raw]

Subject: Re: ~5x greater CPU load for a networked application when using 2.6.15-rt15-smp vs. 2.6.12-1.1390_FC4

Andrew Morton wrote:
> Ingo Molnar <[email protected]> wrote:
>
>>To figure out the true overhead of both kernels, could you try the
>> attached loop_print_thread.c code

> http://www.zip.com.au/~akpm/linux/#zc <- better ;)

Andrew,

I read the README for the "zc" tests. I wish Ingo can opine on which may be a
better test. Also, i assume that I can run "zcs" and "zcc" on the same
machine. I would do the tests with "send" instead of "sendfile".

I also have some other test data. The graphical summary result can be viewed
at this link:

http://www.atl.external.lmco.com/projects/QoS/LM_ATL_MW_Comparator_7920.png

In these tests I used a single Intel Xeon 3GH dual processor machine with 4
different kernels, all based on 2.6.14

2.6.14 Uniprocessor kernel
2.6.14-rt22 Uniprocessor kernel w/ RT patches
2.6.14-smp SMP kernel
2.6.14-rt22-smp SMP kernel w/ RT patches.

The test is similar to "zcs", "zcc" tests. In my tests a client process opens
a TCP connection to the server process (all on same machine) and sends to it
10,000,000 messages of sizes 4 bytes, 8 bytes, 16 bytes, .... , 32Kbytes,
64Kbytes. The server sends back a 1 byte reply. The client measures roundtrip
latencies. The graphic shows mean roundtrip latencies. Since measuremnts are
taken over so many samples I believe that the large differences in mean
latencies capture the relative CPU consumption of various kernel. (This being
loopback there are no NIC card issues or otherwise.) One notices a 3:1 ration
here from uniprocessor, non-RT kernel to SMP-RT kernel. The RT kernel has
nice real-time properties, and there is a lot of pressure in our systems to
use the SMP hardware of the multicore machines, and in some cases we can even
with with a 3x slowdown (since real applications do more than just I/O), but
when I started to note 5x (or more) in my newer tests I thought I would at
least post something.

I suspect that "zcs"/"zcc" tests would pretty much show the same conclusions
as this graphic.

Gautam H. Thaker
Distributed Processing Lab; Lockheed Martin Adv. Tech. Labs
3 Executive Campus; Cherry Hill, NJ 08002
856-792-9754, fax 856-792-9925 email: [email protected]

2006-02-24 20:32:36

by Andrew Morton

[permalink] [raw]

Subject: Re: ~5x greater CPU load for a networked application when using 2.6.15-rt15-smp vs. 2.6.12-1.1390_FC4

Gautam H Thaker <[email protected]> wrote:
>
> > http://www.zip.com.au/~akpm/linux/#zc <- better ;)
>
> Andrew,
>
> I read the README for the "zc" tests. I wish Ingo can opine on which may be a
> better test. Also, i assume that I can run "zcs" and "zcc" on the same
> machine. I would do the tests with "send" instead of "sendfile".

Oh. I don't actually remember what zc does. I was actually referring to
`cyclesoak', which has proven to be a pretty accurate (or at least,
sensitive and repeatable) way of determining overall per-CPU system load.

2006-02-24 20:44:56

by Gautam Thaker

[permalink] [raw]

Subject: Re: ~5x greater CPU load for a networked application when using 2.6.15-rt15-smp vs. 2.6.12-1.1390_FC4

Andrew Morton wrote:
> Gautam H Thaker <[email protected]> wrote:
>
>>>http://www.zip.com.au/~akpm/linux/#zc <- better ;)
>>
>> Andrew,
>>
>> I read the README for the "zc" tests. I wish Ingo can opine on which may be a
>> better test. Also, i assume that I can run "zcs" and "zcc" on the same
>> machine. I would do the tests with "send" instead of "sendfile".
>
>
> Oh. I don't actually remember what zc does. I was actually referring to
> `cyclesoak', which has proven to be a pretty accurate (or at least,
> sensitive and repeatable) way of determining overall per-CPU system load.

Yes, I should have been more clear. I meant to say that perhaps I should use
the 4 combinations of OS configs (non-RT/RT x UniProc/SMP) and use zc and
cyclesoak rather than do a 20 node test, but I believe I will need many nodes
sending to my one "monitor" node to get this high packet receive rate of
about 38,000/second. Lower rates involving only a single machine should also
be capable of revealing conclusively that "RT-SMP" kernels are some factor
heavier than non-RT-UniProc kernel. Anyway, I will do the tests.

--

Gautam H. Thaker
Distributed Processing Lab; Lockheed Martin Adv. Tech. Labs
3 Executive Campus; Cherry Hill, NJ 08002
856-792-9754, fax 856-792-9925 email: [email protected]

2006-02-28 19:27:39

by Matt Mackall

[permalink] [raw]

Subject: Re: ~5x greater CPU load for a networked application when using 2.6.15-rt15-smp vs. 2.6.12-1.1390_FC4

On Thu, Feb 23, 2006 at 02:55:56PM -0500, Gautam H Thaker wrote:
> The real-time patches at the URL below do a great job of endowing Linux with
> real-time capabilities.
>
> http://people.redhat.com/mingo/realtime-preempt/
>
> It has been documented before (and accepted) that this patch turns Linux into
> a RT kernel but considerably slows down the code paths, esp. thru the I/O
> subsystem. I want to provide some additional measurements and seek opinions
> of if it might ever be possible to improve on this situation.

Are you using the SLAB or SLOB allocator in the -rt kernel?

--
Mathematics is the supreme nostalgia of our time.

2006-02-28 22:19:12

by Gautam Thaker

[permalink] [raw]

Subject: Re: ~5x greater CPU load for a networked application when using 2.6.15-rt15-smp vs. 2.6.12-1.1390_FC4

Matt Mackall wrote:
> On Thu, Feb 23, 2006 at 02:55:56PM -0500, Gautam H Thaker wrote:
>
>>The real-time patches at the URL below do a great job of endowing Linux with
>>real-time capabilities.
>>
>>http://people.redhat.com/mingo/realtime-preempt/
>>
>>It has been documented before (and accepted) that this patch turns Linux into
>>a RT kernel but considerably slows down the code paths, esp. thru the I/O
>>subsystem. I want to provide some additional measurements and seek opinions
>>of if it might ever be possible to improve on this situation.
>
>
> Are you using the SLAB or SLOB allocator in the -rt kernel?

lake> grep SL config.2.6.15-rt15-smp
CONFIG_SEMAPHORE_SLEEPERS=y
CONFIG_SLAB=y
# CONFIG_SLOB is not set

--

Gautam H. Thaker
Distributed Processing Lab; Lockheed Martin Adv. Tech. Labs
3 Executive Campus; Cherry Hill, NJ 08002
856-792-9754, fax 856-792-9925 email: [email protected]