From: Randy Dunlap <[email protected]>
Use a round_jiffies() variant to reduce overhead of timer
wakeups. This causes the ipmi timers to occur at the same
time as other timers (per CPU).
Typical powertop for /ipmi/ (2.6.31, before patch):
11.4% (247.4) kipmi0 : __mod_timer (process_timeout)
0.6% ( 13.1) <interrupt> : ipmi_si
0.5% ( 10.0) <kernel core> : __mod_timer (ipmi_timeout)
powertop for /ipmi/, 2.6.31, after patch:
10.8% (247.6) kipmi0 : __mod_timer (process_timeout)
0.3% ( 6.9) <interrupt> : ipmi_si
0.0% ( 1.0) <kernel core> : __mod_timer (ipmi_timeout)
Signed-off-by: Randy Dunlap <[email protected]>
Cc: Corey Minyard <[email protected]>
Cc: [email protected]
---
drivers/char/ipmi/ipmi_msghandler.c | 7 +++++--
drivers/char/ipmi/ipmi_si_intf.c | 3 ++-
2 files changed, 7 insertions(+), 3 deletions(-)
--- lnx-2632-rc5.orig/drivers/char/ipmi/ipmi_msghandler.c
+++ lnx-2632-rc5/drivers/char/ipmi/ipmi_msghandler.c
@@ -39,6 +39,7 @@
#include <linux/spinlock.h>
#include <linux/mutex.h>
#include <linux/slab.h>
+#include <linux/timer.h>
#include <linux/ipmi.h>
#include <linux/ipmi_smi.h>
#include <linux/notifier.h>
@@ -4057,7 +4058,8 @@ static void ipmi_timeout(unsigned long d
ipmi_timeout_handler(IPMI_TIMEOUT_TIME);
- mod_timer(&ipmi_timer, jiffies + IPMI_TIMEOUT_JIFFIES);
+ mod_timer(&ipmi_timer,
+ round_jiffies_up(jiffies + IPMI_TIMEOUT_JIFFIES));
}
@@ -4424,7 +4426,8 @@ static int ipmi_init_msghandler(void)
#endif /* CONFIG_PROC_FS */
setup_timer(&ipmi_timer, ipmi_timeout, 0);
- mod_timer(&ipmi_timer, jiffies + IPMI_TIMEOUT_JIFFIES);
+ mod_timer(&ipmi_timer,
+ round_jiffies_up(jiffies + IPMI_TIMEOUT_JIFFIES));
atomic_notifier_chain_register(&panic_notifier_list, &panic_block);
--- lnx-2632-rc5.orig/drivers/char/ipmi/ipmi_si_intf.c
+++ lnx-2632-rc5/drivers/char/ipmi/ipmi_si_intf.c
@@ -1068,7 +1068,8 @@ static int smi_start_processing(void
/* Set up the timer that drives the interface. */
setup_timer(&new_smi->si_timer, smi_timeout, (long)new_smi);
new_smi->last_timeout_jiffies = jiffies;
- mod_timer(&new_smi->si_timer, jiffies + SI_TIMEOUT_JIFFIES);
+ mod_timer(&new_smi->si_timer,
+ round_jiffies_up(jiffies + SI_TIMEOUT_JIFFIES));
/*
* Check if the user forcefully enabled the daemon.
On Wed, 21 Oct 2009 10:28:22 -0700
Randy Dunlap <[email protected]> wrote:
> From: Randy Dunlap <[email protected]>
>
> Use a round_jiffies() variant to reduce overhead of timer
> wakeups. This causes the ipmi timers to occur at the same
> time as other timers (per CPU).
>
> Typical powertop for /ipmi/ (2.6.31, before patch):
> 11.4% (247.4) kipmi0 : __mod_timer (process_timeout)
> 0.6% ( 13.1) <interrupt> : ipmi_si
> 0.5% ( 10.0) <kernel core> : __mod_timer (ipmi_timeout)
>
> powertop for /ipmi/, 2.6.31, after patch:
> 10.8% (247.6) kipmi0 : __mod_timer (process_timeout)
> 0.3% ( 6.9) <interrupt> : ipmi_si
> 0.0% ( 1.0) <kernel core> : __mod_timer (ipmi_timeout)
while it is nice that ipmi_si ande the timer wake up less now.... it's
still rather sad that the 247.6 from kipmi0 are still there..... those
are a much much bigger issue
--
Arjan van de Ven Intel Open Source Technology Centre
For development, discussion and tips for power savings,
visit http://www.lesswatts.org
Arjan van de Ven wrote:
> On Wed, 21 Oct 2009 10:28:22 -0700
> Randy Dunlap <[email protected]> wrote:
>
>> From: Randy Dunlap <[email protected]>
>>
>> Use a round_jiffies() variant to reduce overhead of timer
>> wakeups. This causes the ipmi timers to occur at the same
>> time as other timers (per CPU).
>>
>> Typical powertop for /ipmi/ (2.6.31, before patch):
>> 11.4% (247.4) kipmi0 : __mod_timer (process_timeout)
>> 0.6% ( 13.1) <interrupt> : ipmi_si
>> 0.5% ( 10.0) <kernel core> : __mod_timer (ipmi_timeout)
>>
>> powertop for /ipmi/, 2.6.31, after patch:
>> 10.8% (247.6) kipmi0 : __mod_timer (process_timeout)
>> 0.3% ( 6.9) <interrupt> : ipmi_si
>> 0.0% ( 1.0) <kernel core> : __mod_timer (ipmi_timeout)
>
> while it is nice that ipmi_si ande the timer wake up less now.... it's
> still rather sad that the 247.6 from kipmi0 are still there..... those
> are a much much bigger issue
obviously :)
Randy, any idea where those are coming from ?
Auke
On Wed, 21 Oct 2009 11:49:59 -0700 Kok, Auke wrote:
> Arjan van de Ven wrote:
> > On Wed, 21 Oct 2009 10:28:22 -0700
> > Randy Dunlap <[email protected]> wrote:
> >
> >> From: Randy Dunlap <[email protected]>
> >>
> >> Use a round_jiffies() variant to reduce overhead of timer
> >> wakeups. This causes the ipmi timers to occur at the same
> >> time as other timers (per CPU).
> >>
> >> Typical powertop for /ipmi/ (2.6.31, before patch):
> >> 11.4% (247.4) kipmi0 : __mod_timer (process_timeout)
> >> 0.6% ( 13.1) <interrupt> : ipmi_si
> >> 0.5% ( 10.0) <kernel core> : __mod_timer (ipmi_timeout)
> >>
> >> powertop for /ipmi/, 2.6.31, after patch:
> >> 10.8% (247.6) kipmi0 : __mod_timer (process_timeout)
> >> 0.3% ( 6.9) <interrupt> : ipmi_si
> >> 0.0% ( 1.0) <kernel core> : __mod_timer (ipmi_timeout)
> >
> > while it is nice that ipmi_si ande the timer wake up less now.... it's
> > still rather sad that the 247.6 from kipmi0 are still there..... those
> > are a much much bigger issue
>
> obviously :)
>
> Randy, any idea where those are coming from ?
obviously from kipmi thread :(
drivers/char/ipmi/ipmi_si_intf.c::ipmi_thread():
static int ipmi_thread(void *data)
{
struct smi_info *smi_info = data;
unsigned long flags;
enum si_sm_result smi_result;
set_user_nice(current, 19);
while (!kthread_should_stop()) {
spin_lock_irqsave(&(smi_info->si_lock), flags);
smi_result = smi_event_handler(smi_info, 0);
spin_unlock_irqrestore(&(smi_info->si_lock), flags);
if (smi_result == SI_SM_CALL_WITHOUT_DELAY)
; /* do nothing */
else if (smi_result == SI_SM_CALL_WITH_DELAY)
schedule();
else
schedule_timeout_interruptible(1); <-----
}
return 0;
}
calls setup_timer_on_stack(), which calls process_timeout().
>From what I recall (probably 2 years ago), [older] ipmi hardware does not
generate event interrupts, so it has to be polled.
Corey, can you elaborate on this?
---
~Randy
Randy Dunlap wrote:
> On Wed, 21 Oct 2009 11:49:59 -0700 Kok, Auke wrote:
>
>
>> Arjan van de Ven wrote:
>>
>>> On Wed, 21 Oct 2009 10:28:22 -0700
>>> Randy Dunlap <[email protected]> wrote:
>>>
>>>
>>>> From: Randy Dunlap <[email protected]>
>>>>
>>>> Use a round_jiffies() variant to reduce overhead of timer
>>>> wakeups. This causes the ipmi timers to occur at the same
>>>> time as other timers (per CPU).
>>>>
>>>> Typical powertop for /ipmi/ (2.6.31, before patch):
>>>> 11.4% (247.4) kipmi0 : __mod_timer (process_timeout)
>>>> 0.6% ( 13.1) <interrupt> : ipmi_si
>>>> 0.5% ( 10.0) <kernel core> : __mod_timer (ipmi_timeout)
>>>>
>>>> powertop for /ipmi/, 2.6.31, after patch:
>>>> 10.8% (247.6) kipmi0 : __mod_timer (process_timeout)
>>>> 0.3% ( 6.9) <interrupt> : ipmi_si
>>>> 0.0% ( 1.0) <kernel core> : __mod_timer (ipmi_timeout)
>>>>
>>> while it is nice that ipmi_si ande the timer wake up less now.... it's
>>> still rather sad that the 247.6 from kipmi0 are still there..... those
>>> are a much much bigger issue
>>>
>> obviously :)
>>
>> Randy, any idea where those are coming from ?
>>
>
>
> obviously from kipmi thread :(
>
> drivers/char/ipmi/ipmi_si_intf.c::ipmi_thread():
>
> static int ipmi_thread(void *data)
> {
> struct smi_info *smi_info = data;
> unsigned long flags;
> enum si_sm_result smi_result;
>
> set_user_nice(current, 19);
> while (!kthread_should_stop()) {
> spin_lock_irqsave(&(smi_info->si_lock), flags);
> smi_result = smi_event_handler(smi_info, 0);
> spin_unlock_irqrestore(&(smi_info->si_lock), flags);
> if (smi_result == SI_SM_CALL_WITHOUT_DELAY)
> ; /* do nothing */
> else if (smi_result == SI_SM_CALL_WITH_DELAY)
> schedule();
> else
> schedule_timeout_interruptible(1); <-----
> }
> return 0;
> }
>
>
> calls setup_timer_on_stack(), which calls process_timeout().
>
> From what I recall (probably 2 years ago), [older] ipmi hardware does not
> generate event interrupts, so it has to be polled.
>
> Corey, can you elaborate on this?
>
Certainly. Yes, some (probably most) IPMI hardware does not use
interrupts, and unfortunately, it's not just older machines. The driver
used to poll more slowly, but in many cases the performance was
unacceptable.
kipmid is only started if the hardware doesn't support interrupts, so
only users with sub-standard hardware have to suffer with this problem.
Thanks for the patch, Randy.
-corey
Corey Minyard wrote:
> Certainly. Yes, some (probably most) IPMI hardware does not use
> interrupts, and unfortunately, it's not just older machines. The driver
> used to poll more slowly, but in many cases the performance was
> unacceptable.
... but now it burns quite a bit of power (I'd not be surprised if it is 10 Watts extra on a 70W server)
is there any way to poll slowly until there is active ipmi traffic, during which we can then poll a bit faster.
... and then go back to slow polling when there is an ipmi idle period ?
Corey Minyard & Randy Dunlap wrote:
Randy>> From what I recall (probably 2 years ago), [older] ipmi hardware
Randy>> does not generate event interrupts, so it has to be polled.
Randy>>
Randy>> Corey, can you elaborate on this?
Corey> Certainly. Yes, some (probably most) IPMI hardware does not use
Corey> interrupts, and unfortunately, it's not just older machines.
Corey> The driver used to poll more slowly, but in many cases the
Corey> performance was unacceptable.
Corey>
Corey> kipmid is only started if the hardware doesn't support
Corey> interrupts, so only users with sub-standard hardware have to
Corey> suffer with this problem.
Regrettably, of the "big three" in the "PC Server" world, only HP's iLO2
BMC supports interrupts. Dell's DRAC4 & 5 don't, IBM's ASM, RSA, etc.
don't. Also (at least out of a sample of one) SuperMicro also doesn't
have an interrupt.
They also have all settled on the KCS interface, which dribbles one
character through per non-interrupt. So sad. Dell's DRAC3 had a BT BMC
which transferred whole IPMI packets via DMA _and_ had an interrupt.
HP's ancient SMIC equipment also had an interrupt (but that's also
char-by-char, and their current KCS has an interrupt, so at least they
haven't regressed). I've guessed that some chip vendor must have
come out with a Really Cheep KCS implementation and drove every other
implementation out of the market. :-(
I've heard rumors that some current Sun hardware has BT.
>Bela<-
On Thu, Oct 22, 2009 at 05:57:06AM +0900, Arjan van de Ven wrote:
> Corey Minyard wrote:
> >Certainly. Yes, some (probably most) IPMI hardware does not use
> >interrupts, and unfortunately, it's not just older machines. The driver
> >used to poll more slowly, but in many cases the performance was
> >unacceptable.
>
> ... but now it burns quite a bit of power (I'd not be surprised if it is 10
> Watts extra on a 70W server)
>
> is there any way to poll slowly until there is active ipmi traffic, during
> which we can then poll a bit faster.
> ... and then go back to slow polling when there is an ipmi idle period ?
I believe that's what the thread does already. Depending on what
userspace apps are generating IPMI requests though, there may not be a
whole lot of time between requests. Dell OpenManage software does a
poll of the IPMI sensors, SEL logs, etc. at regular intervals, on the
order of minutes between runs, but during each run there's almost
always an outstanding IPMI command.
The difference was several minutes during startup, and 15 minutes ->
1.5 minutes during a firmware flash, with the kipmi0 thread present.
I've heard requests to have the some userspace control over the
start/stop of that thread, so it could get started only when there are
more time-critical userspace requests to be made (such as during firmware
update), but I've not worked on that part. We'd need an interface
(sysfs file?) to start/stop the thread, and could adjust userspace to
use that (Dell OpenManage, ipmitool, sblim-cmpi-ipmi, ...).
Though I'm really curious that HP has a KCS+interrupt controller
available. That gives me hope that the industry-wide problems which
prevented Dell from doing likewise a couple years ago are now
resolved. I'll have my team look into it again.
--
Matt Domsch
Technology Strategist, Dell Office of the CTO
linux.dell.com & http://www.dell.com/linux
Matt Domsch wrote:
> On Thu, Oct 22, 2009 at 05:57:06AM +0900, Arjan van de Ven wrote:
>
>> Corey Minyard wrote:
>>
>>> Certainly. Yes, some (probably most) IPMI hardware does not use
>>> interrupts, and unfortunately, it's not just older machines. The driver
>>> used to poll more slowly, but in many cases the performance was
>>> unacceptable.
>>>
>> ... but now it burns quite a bit of power (I'd not be surprised if it is 10
>> Watts extra on a 70W server)
>>
>> is there any way to poll slowly until there is active ipmi traffic, during
>> which we can then poll a bit faster.
>> ... and then go back to slow polling when there is an ipmi idle period ?
>>
>
> I believe that's what the thread does already. Depending on what
> userspace apps are generating IPMI requests though, there may not be a
> whole lot of time between requests. Dell OpenManage software does a
> poll of the IPMI sensors, SEL logs, etc. at regular intervals, on the
> order of minutes between runs, but during each run there's almost
> always an outstanding IPMI command.
>
In addition to userland work, the upper layer of the driver polls for
events once a second. This is another unfortunate IPMI design flaw.
There is a flag (that will thus cause an interrupt) that tells you if
the event queue is full, but it doesn't tell you if there is an event in
the queue, only if its full. So you have to poll for events.
-corey
Matt Domsch wrote:
> Though I'm really curious that HP has a KCS+interrupt controller
> available. That gives me hope that the industry-wide problems which
> prevented Dell from doing likewise a couple years ago are now
> resolved. I'll have my team look into it again.
Can you expand on "industry-wide problems"? (Forced to share
interrupts with a high rate device? Design your gizmo to
support MSI/MSI-x. Add MSI support to ipmi_si if necessary...)
As far as I can tell, HP has never shipped an interrupt-less
BMC. Their current iLO2 BMC is KCS + interrupt. Their older
design was SMIC + interrupt.
Why does everyone use KCS when BT is obviously better? Can
you have your team look into that as well? (Among the various
goals here, I assume that BT -- with a single interrupt and a
DMA transfer instead of shuffling bytes over I/O ports -- would
cost less power. Not that the members of that list will
receive this message: it bounces nonmembers.)
It appears that BT designs usually include both BT & KCS
programming interfaces to the same BMC. So perhaps there is
some increased silicon complexity -- but c'mon, we're talking
about a couple of silicon library macros here.
Also, I see evidence of some Sun BMCs that have BT without KCS,
so apparently it isn't required to pair them. Pairing is
probably done for the benefit of certain dumb client software
that assumes all BMCs are KCS. I say to heck with that SW.
Any app running under an OS that provides an adequate driver
[which includes at least Linux, Solaris, and -- I assume --
Win] shouldn't be thinking about the BMC programming
interface at all.
>Bela<-
Bela Lubkin wrote:
> Matt Domsch wrote:
>
>
>> Though I'm really curious that HP has a KCS+interrupt controller
>> available. That gives me hope that the industry-wide problems which
>> prevented Dell from doing likewise a couple years ago are now
>> resolved. I'll have my team look into it again.
>>
>
> Can you expand on "industry-wide problems"? (Forced to share
> interrupts with a high rate device? Design your gizmo to
> support MSI/MSI-x. Add MSI support to ipmi_si if necessary...)
>
> As far as I can tell, HP has never shipped an interrupt-less
> BMC. Their current iLO2 BMC is KCS + interrupt. Their older
> design was SMIC + interrupt.
>
> Why does everyone use KCS when BT is obviously better? Can
> you have your team look into that as well? (Among the various
> goals here, I assume that BT -- with a single interrupt and a
> DMA transfer instead of shuffling bytes over I/O ports -- would
> cost less power. Not that the members of that list will
> receive this message: it bounces nonmembers.)
>
This is an industry where pennies matter, apparently.
My personal preference would be to use the I2C based standard
interface. That actually doesn't perform too badly, it's probably
cheaper than KCS since you already have I2C, anyway, and the I2C
interfaces are generally tied to an interrupt. The trouble is that the
only hardware implementation of this I know of seems to be poorly done,
but that mostly affects trying to use the ethernet interface and the
local interface at the same time.
Of course, the driver for I2C is not yet in the standard kernel as it
requires a fairly massive I2C driver rewrite to allow the IPMI driver to
do it's panic-time operations.
BT would be better for performance, I guess, but it's yet another
interface to tie in, and hanging this off an existing bus seems like a
sensible thing to do. And performance is really not an issue for IPMI.
-corey
Corey Minyard wrote:
> Bela Lubkin wrote:
>
> > Why does everyone use KCS when BT is obviously better? Can
> > you have your team look into that as well? (Among the various
> > goals here, I assume that BT -- with a single interrupt and a
> > DMA transfer instead of shuffling bytes over I/O ports -- would
> > cost less power. Not that the members of that list will
> > receive this message: it bounces nonmembers.)
> This is an industry where pennies matter, apparently.
Well yeah. Also an industry where a small leg up on some competitor can
turn into a big deal. So that goes both ways.
> My personal preference would be to use the I2C based standard
> interface. That actually doesn't perform too badly, it's probably
> cheaper than KCS since you already have I2C, anyway, and the I2C
> interfaces are generally tied to an interrupt. The trouble
> is that the only hardware implementation of this I know of seems
> to be poorly done, but that mostly affects trying to use the
> ethernet interface and the local interface at the same time.
>
> Of course, the driver for I2C is not yet in the standard kernel as it
> requires a fairly massive I2C driver rewrite to allow the
> IPMI driver to do it's panic-time operations.
That seems like a fairly compelling argument against. It's not in the
"standard kernel" (i.e. of Linux); it's probably not in other OSes. So
now you're asking the software side of the industry to spend another
several years playing catchup.
> BT would be better for performance, I guess, but it's yet another
> interface to tie in, and hanging this off an existing bus
> seems like a sensible thing to do. And performance is really not
> an issue for IPMI.
BT is already supported by existing code (e.g. OpenIPMI driver). It can
almost certainly be presented on a PCI card (at PCI-mediated addresses
not some legacy ISA port range). So what do you mean "yet another
interface to tie in"? It's just a blob of behavior to be situated on a
PCI card (or mboard implementation in PCI space).
I2C? Volunteering to put it on I2C is like volunteering to put the new
branch office on a 1200 baud modem because you're suspicious of all that
newfangled fiber stuff. Huh?
Regarding performance, we've had real struggles with IPMI performance
aspects. There were situations where kipmid failed to start or didn't
do its job correctly, resulting in very slow IPMI operations. Seemingly
harmless -- except that some OEM management agent daemon was making
decisions based on feedback from IPMI. The delays caused it to get into
some watchdog code that decided the system was hung and issued a reboot.
On other systems, the OEM management agents poll IPMI so persistently
that kipmid ends up taking a significant fraction of a CPU (nearly
100% of a CPU in bursts, long term average of 40%+). Since the Linux
"Console OS" in COS-bearing versions of ESX is by design limited to a
single CPU, and since it does have other tasks to do, this causes some
bad bottlenecks.
These things I mention are our bugs, should be fixed, are not
necessarily the "fault" of IPMI. Yet they would have no noticable
impact if the BMC hardware was speaking BT. The slow operations which
(when magnified by 1000s of calls) took up most of a CPU would instead
take up a barely measurable portion of a CPU.
I don't suppose my viewpoint on this is going to be popular on lkml,
but it is rather parochial to base hardware design decisions on "that's
OK because it will be dealt with in some nebulous near-future version
of the Linux kernel". That's great for people who will be running a
bleeding edge Linux kernel. Not so good for people running stodgier
Linux distros, other *ix OSes, Windows, QNX, who knows what else.
It's great that Linux is [or intends to be] on top of this. It's not
the whole world. The Dells of this world necessarily have to design
systems that will work adequately with e.g. RHEL5.4 (2.6.18 + a zillion
patches), not to mention Win2K3 SP${current}, etc.
If you're designing a phone handset, router, set-top box, etc. --
something where you own the entire project from hardware up to the top
of the software stack -- _then_ you can make decisions on such criteria.
General purpose hardware designs have to support general purpose OSes,
including lots of weird little backwaters.
>Bela<-