2006-09-07 18:29:58

by Metathronius Galabant

[permalink] [raw]
Subject: top displaying 50% si time and 50% idle on idle machine

Hi,

Kernel 2.6.17.11 with tg3 network driver, NAPI enabled (Distro CentOS 4.4).
top shows strangely 50% idle even if the machine is _completely_ idle.

top - 01:04:30 up 4 days, 12:05, 2 users, load average: 0.00, 0.00, 0.00
Tasks: 34 total, 2 running, 32 sleeping, 0 stopped, 0 zombie
Cpu(s): 0.0% us, 0.0% sy, 0.0% ni, 50.0% id, 0.0% wa, 0.0% hi, 50.0% si
Mem: 3634452k total, 313284k used, 3321168k free, 71308k buffers
Swap: 505896k total, 0k used, 505896k free, 220272k cached

I find this pretty alarming - can somebody please enlighten me?
Please include me on CC.
Thanks,
M.


2006-09-08 00:53:28

by Andrew Morton

[permalink] [raw]
Subject: Re: top displaying 50% si time and 50% idle on idle machine

On Thu, 7 Sep 2006 20:08:53 +0200
"Metathronius Galabant" <[email protected]> wrote:

> Hi,
>
> Kernel 2.6.17.11 with tg3 network driver, NAPI enabled (Distro CentOS 4.4).
> top shows strangely 50% idle even if the machine is _completely_ idle.
>
> top - 01:04:30 up 4 days, 12:05, 2 users, load average: 0.00, 0.00, 0.00
> Tasks: 34 total, 2 running, 32 sleeping, 0 stopped, 0 zombie
> Cpu(s): 0.0% us, 0.0% sy, 0.0% ni, 50.0% id, 0.0% wa, 0.0% hi, 50.0% si
> Mem: 3634452k total, 313284k used, 3321168k free, 71308k buffers
> Swap: 505896k total, 0k used, 505896k free, 220272k cached
>
> I find this pretty alarming - can somebody please enlighten me?
> Please include me on CC.

Do `ps aux', look for a process stuck in D state.

Do

echo t > /proc/sysrq-trigger
dmesg -s 1000000 > foo

then edit foo, search for the process in D state (look for " D ") and send
that process's backtrace record.

Thanks.

2006-09-08 21:03:34

by Metathronius Galabant

[permalink] [raw]
Subject: Re: top displaying 50% si time and 50% idle on idle machine

>> Kernel 2.6.17.11 with tg3 network driver, NAPI enabled (Distro CentOS 4.4).
>> top shows strangely 50% idle even if the machine is _completely_ idle.
>>
>> top - 01:04:30 up 4 days, 12:05, 2 users, load average: 0.00, 0.00, 0.00
>> Tasks: 34 total, 2 running, 32 sleeping, 0 stopped, 0 zombie
>> Cpu(s): 0.0% us, 0.0% sy, 0.0% ni, 50.0% id, 0.0% wa, 0.0% hi, 50.0%si

> Do `ps aux', look for a process stuck in D state.

The issue that startled me: there is _NO_ process in a D state!
BTW what means si? (interrupt service time? google didn't find anything)

ps auxwm output following:

USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND
root 1 0.0 0.0 1692 572 ? - Sep04 0:00 init [3]

root - 0.0 - - - - Ss Sep04 0:00 -
root 2 0.0 0.0 0 0 ? - Sep04 0:00 [ksoftirqd/0]
root - 0.0 - - - - SN Sep04 0:00 -
root 3 0.0 0.0 0 0 ? - Sep04 0:00 [events/0]
root - 0.0 - - - - S< Sep04 0:00 -
root 4 0.0 0.0 0 0 ? - Sep04 0:00 [khelper]
root - 0.0 - - - - S< Sep04 0:00 -
root 5 0.0 0.0 0 0 ? - Sep04 0:00 [kthread]
root - 0.0 - - - - S< Sep04 0:00 -
root 7 0.0 0.0 0 0 ? - Sep04 0:00 [kblockd/0]
root - 0.0 - - - - S< Sep04 0:00 -
root 8 0.0 0.0 0 0 ? - Sep04 0:00 [kseriod]
root - 0.0 - - - - S< Sep04 0:00 -
root 70 0.0 0.0 0 0 ? - Sep04 0:00 [pdflush]
root - 0.0 - - - - S Sep04 0:00 -
root 71 0.0 0.0 0 0 ? - Sep04 0:00 [pdflush]
root - 0.0 - - - - S Sep04 0:00 -
root 72 0.0 0.0 0 0 ? - Sep04 0:00 [kswapd0]
root - 0.0 - - - - S Sep04 0:00 -
root 73 0.0 0.0 0 0 ? - Sep04 0:00 [aio/0]
root - 0.0 - - - - S< Sep04 0:00 -
root 672 0.0 0.0 0 0 ? - Sep04 0:00 [kpsmoused]
root - 0.0 - - - - S< Sep04 0:00 -
root 685 0.0 0.0 0 0 ? - Sep04 0:00 [kjournald]
root - 0.0 - - - - S< Sep04 0:00 -
root 1666 0.0 0.0 1588 464 ? - Sep04 0:00 udevd
root - 0.0 - - - - S<s Sep04 0:00 -
root 2065 0.0 0.0 0 0 ? - Sep04 0:00 [kjournald]
root - 0.0 - - - - S< Sep04 0:00 -
root 2073 0.0 0.0 0 0 ? - Sep04 0:00 [kjournald]
root - 0.0 - - - - S< Sep04 0:00 -
root 2763 0.0 0.0 2484 928 ? - Sep04 0:00 crond
root - 0.0 - - - - Ss Sep04 0:00 -
root 2779 0.0 0.0 1532 408 tty2 - Sep04 0:00
/sbin/mingetty tty2
root - 0.0 - - - - Ss+ Sep04 0:00 -
root 2780 0.0 0.0 1536 416 tty3 - Sep04 0:00
/sbin/mingetty tty3
root - 0.0 - - - - Ss+ Sep04 0:00 -
root 4495 0.0 0.0 2904 1236 ? - Sep04 0:00 login -- root
root - 0.0 - - - - Ss Sep04 0:00 -
root 6582 0.0 0.0 0 0 ? - Sep04 0:00 [reiserfs/0]
root - 0.0 - - - - S< Sep04 0:00 -
root 7406 0.0 0.0 2344 1380 tty1 - Sep06 0:00 -bash
root - 0.0 - - - - Ss+ Sep06 0:00 -
root 8158 0.0 0.0 4076 1132 ? - Sep06 0:01 /usr/sbin/sshd
root - 0.0 - - - - Ss Sep06 0:01 -
root 10342 0.0 0.0 2244 608 ? - Sep08 0:03 syslogd -m 0
root - 0.0 - - - - Ss Sep08 0:03 -
root 10347 0.0 0.0 1672 384 ? - Sep08 0:00 klogd -x
root - 0.0 - - - - Ss Sep08 0:00 -
root 16078 0.0 0.0 8040 2240 ? - Sep10 0:00 sshd: root@pts/0
root - 0.0 - - - - Ss Sep10 0:00 -
root 16089 0.0 0.0 2516 1256 pts/0 - Sep10 0:00 -bash
root - 0.0 - - - - Ss Sep10 0:00 -
nscd 24962 0.0 0.0 106476 840 ? - 03:43 0:00 /usr/sbin/nscd
nscd - 0.0 - - - - Ssl 03:43 0:00 -
nscd - 0.0 - - - - Ssl 03:43 0:00 -
nscd - 0.0 - - - - Ssl 03:43 0:00 -
nscd - 0.0 - - - - Ssl 03:43 0:00 -
nscd - 0.0 - - - - Ssl 03:43 0:00 -
nscd - 0.0 - - - - Ssl 03:43 0:00 -
nscd - 0.0 - - - - Ssl 03:43 0:00 -
root 24972 0.0 0.0 2548 724 pts/0 - 03:44 0:00 ps auxwm
root - 0.0 - - - - R+ 03:44 0:00 -

> echo t > /proc/sysrq-trigger
> dmesg -s 1000000 > foo
>
> then edit foo, search for the process in D state (look for " D ") and send
> that process's backtrace record.

This does only make sense if I have found the offender? I tried it,
nothing unusual and almost no output.

Is there any way to track it down? Just hints in the right direction
are highly appreciated!
Thanks,
M.

2006-09-08 22:48:07

by Pavel Machek

[permalink] [raw]
Subject: Re: top displaying 50% si time and 50% idle on idle machine

> >>Cpu(s): 0.0% us, 0.0% sy, 0.0% ni, 50.0% id, 0.0%
> >>wa, 0.0% hi, 50.0%si
>
> >Do `ps aux', look for a process stuck in D state.
>
> The issue that startled me: there is _NO_ process in a D
> state!
> BTW what means si? (interrupt service time? google
> didn't find anything)

'soft interrupt' probably. try disconnecting network.

Pavel

--
Thanks for all the (sleeping) penguins.

2006-09-11 11:05:46

by Metathronius Galabant

[permalink] [raw]
Subject: Re: top displaying 50% si time and 50% idle on idle machine

>>>>Cpu(s): 0.0% us, 0.0% sy, 0.0% ni, 50.0% id, 0.0%
>>>>wa, 0.0% hi, 50.0%si

>> BTW what means si? (interrupt service time? google
>> didn't find anything)

> 'soft interrupt' probably. try disconnecting network.

The cause has been found. The timer of that machine is seriously
broken, 1 second is approximately 500ms long.
It is a HP DL360 G4 and I configured the kernel without ACPI or
similar. Maybe there are some strange BIOS power management schemes
active. I will look deeper into the problem and report back.
A broken timer is _very_ strange to me (I didn't encounter that in the
last 12 years w/ custom kernels).

Cheers,
M.

2006-09-11 11:49:59

by Metathronius Galabant

[permalink] [raw]
Subject: Re: top displaying 50% si time and 50% idle on idle machine

> >>>>Cpu(s): 0.0% us, 0.0% sy, 0.0% ni, 50.0% id, 0.0%
> >>>>wa, 0.0% hi, 50.0%si
>
> >> BTW what means si? (interrupt service time? google
> >> didn't find anything)
>
> > 'soft interrupt' probably. try disconnecting network.
>
> The cause has been found. The timer of that machine is seriously
> broken, 1 second is approximately 500ms long.
> It is a HP DL360 G4 and I configured the kernel without ACPI or
> similar. Maybe there are some strange BIOS power management schemes
> active. I will look deeper into the problem and report back.
> A broken timer is _very_ strange to me (I didn't encounter that in the
> last 12 years w/ custom kernels).

Power management was completely switched off, so that was not the cause.
Instead, please have a look the this dmesg outputs:

not working (no ACPI):

ENABLING IO-APIC IRQs
unknown bus type 32. (repeated MULTIPLE TIMES!)
..TIMER: vector=0x31 apic1=-1 pin1=-1 apic2=-1 pin2=-1
...trying to set up timer (IRQ0) through the 8259A ... failed.
...trying to set up timer as Virtual Wire IRQ... works.


working (ACPI + processor support):

ENABLING IO-APIC IRQs
..TIMER: vector=0x31 apic1=0 pin1=2 apic2=-1 pin2=-1

Can this be considered as a kernel-bug?

Thanks for the help,
M.

2006-09-11 12:32:14

by Vitezslav Samel

[permalink] [raw]
Subject: Re: top displaying 50% si time and 50% idle on idle machine

On Mon, Sep 11, 2006 at 01:49:57PM +0200, Metathronius Galabant wrote:
> >>>>>Cpu(s): 0.0% us, 0.0% sy, 0.0% ni, 50.0% id, 0.0%
> >>>>>wa, 0.0% hi, 50.0%si
> >
> >>> BTW what means si? (interrupt service time? google
> >>> didn't find anything)
> >
> >> 'soft interrupt' probably. try disconnecting network.
> >
> >The cause has been found. The timer of that machine is seriously
> >broken, 1 second is approximately 500ms long.
> >It is a HP DL360 G4 and I configured the kernel without ACPI or
> >similar. Maybe there are some strange BIOS power management schemes
> >active. I will look deeper into the problem and report back.
> >A broken timer is _very_ strange to me (I didn't encounter that in the
> >last 12 years w/ custom kernels).

Try 2.6.18-rc6, there is a fix included (see Changelog:
"[PATCH] x86: increase MAX_MP_BUSSES on default arch")

Cheers,
Vita