timekeeping_init() basically does the following:
xtime = RTC
if (arch implements read_boot_clock())
wall_to_monotonic = -read_boot_clock()
else
wall_to_monotonic = -xtime
So wall_to_monotonic records some approximation of the system boot
time, which is then used to derive the "btime" reported in /proc/stat.
The problem I'm seeing is that xtime is updated on timer ticks, so
uninterruptible code, like kernel serial printk, makes us miss ticks,
so xtime falls behind the RTC. Then, when userland fixes xtime, in my
case with "hwclock --hctosys", the delta is applied to both xtime and
wall_to_monotonic. The result is that "btime" is no longer accurate.
Here's an example where I artificially exaggerated the problem by
adding 30 seconds of wait time with interrupts disabled. Assume the
RTC is perfectly correct at boot, and note that xtime has fallen
behind the RTC by 31 seconds by the time userland resets the clock:
rtc 1306957603 xtime 1306957603 wall_to_monotonic -1306957603 in
timekeeping_init()
rtc 1306957638 xtime 1306957607 wall_to_monotonic -1306957603
before do_settimeofday()
rtc 1306957638 xtime 1306957638 wall_to_monotonic -1306957634 after
do_settimeofday()
Now /proc/stat btime reports 1306957634 instead of the correct 1306957603.
Bjorn
On Wed, 2011-06-01 at 14:50 -0600, Bjorn Helgaas wrote:
> timekeeping_init() basically does the following:
>
> xtime = RTC
> if (arch implements read_boot_clock())
> wall_to_monotonic = -read_boot_clock()
> else
> wall_to_monotonic = -xtime
>
> So wall_to_monotonic records some approximation of the system boot
> time, which is then used to derive the "btime" reported in /proc/stat.
>
> The problem I'm seeing is that xtime is updated on timer ticks, so
> uninterruptible code, like kernel serial printk, makes us miss ticks,
> so xtime falls behind the RTC.
Huh. So this sort of issue was common back when we had tick-based
timekeeping (in combination with troubled hardware), but with the
current clocksource based timekeeping, occasional lost ticks shouldn't
really effect time.
Can you explain a bit more about what kind of hardware this is happening
on, and what clocksource is being used?
> Then, when userland fixes xtime, in my
> case with "hwclock --hctosys", the delta is applied to both xtime and
> wall_to_monotonic. The result is that "btime" is no longer accurate.
Yes. If time was actually lost (which I suspect is the actual problem),
then adjustments to fix it do not propagate, and thus btime (which is
approximately calculated as CLOCK_REALTIME - CLOCK_BOOTTIME) will be
off. This is due to the fact that the adjustment changes CLOCK_REALTIME,
but that CLOCK_BOOTTIME (or CLOCK_MONOTONIC) isn't being increased for
the time lost.
> Here's an example where I artificially exaggerated the problem by
> adding 30 seconds of wait time with interrupts disabled. Assume the
> RTC is perfectly correct at boot, and note that xtime has fallen
> behind the RTC by 31 seconds by the time userland resets the clock:
Yea, unless I'm somehow misunderstanding you, I don't think this is a
btime accuracy issue, but instead a hardware problem. If interrupts are
being disabled for longer then the clocksource hardware can handle,
there will be problems.
But let me know more about the clocksource being used and we'll see if
we can solve what you're seeing.
thanks
-john
On Wed, Jun 1, 2011 at 4:35 PM, john stultz <[email protected]> wrote:
> On Wed, 2011-06-01 at 14:50 -0600, Bjorn Helgaas wrote:
>> timekeeping_init() basically does the following:
>>
>> ? ? xtime = RTC
>> ? ? if (arch implements read_boot_clock())
>> ? ? ? ? wall_to_monotonic = -read_boot_clock()
>> ? ? else
>> ? ? ? wall_to_monotonic = -xtime
>>
>> So wall_to_monotonic records some approximation of the system boot
>> time, which is then used to derive the "btime" reported in /proc/stat.
>>
>> The problem I'm seeing is that xtime is updated on timer ticks, so
>> uninterruptible code, like kernel serial printk, makes us miss ticks,
>> so xtime falls behind the RTC.
>
> Huh. So this sort of issue was common back when we had tick-based
> timekeeping (in combination with troubled hardware), but with the
> current clocksource based timekeeping, occasional lost ticks shouldn't
> really effect time.
Makes sense. Your presentation here was a great help:
http://sr71.net/~jstultz/tod/ols-presentation-final.pdf
> Can you explain a bit more about what kind of hardware this is happening
> on, and what clocksource is being used?
Sure. This is an x86 box. Normally we're using the TSC clocksource,
and I don't think the issue happens after that. I guess my
experimentation so far has been with uninterruptible time before we
register *any* clocksource (or at least before I see any "Switching to
clocksource" messages).
Bjorn
On Wed, 2011-06-01 at 17:35 -0600, Bjorn Helgaas wrote:
> On Wed, Jun 1, 2011 at 4:35 PM, john stultz <[email protected]> wrote:
> > On Wed, 2011-06-01 at 14:50 -0600, Bjorn Helgaas wrote:
> >> timekeeping_init() basically does the following:
> >>
> >> xtime = RTC
> >> if (arch implements read_boot_clock())
> >> wall_to_monotonic = -read_boot_clock()
> >> else
> >> wall_to_monotonic = -xtime
> >>
> >> So wall_to_monotonic records some approximation of the system boot
> >> time, which is then used to derive the "btime" reported in /proc/stat.
> >>
> >> The problem I'm seeing is that xtime is updated on timer ticks, so
> >> uninterruptible code, like kernel serial printk, makes us miss ticks,
> >> so xtime falls behind the RTC.
> >
> > Huh. So this sort of issue was common back when we had tick-based
> > timekeeping (in combination with troubled hardware), but with the
> > current clocksource based timekeeping, occasional lost ticks shouldn't
> > really effect time.
>
> Makes sense. Your presentation here was a great help:
> http://sr71.net/~jstultz/tod/ols-presentation-final.pdf
>
> > Can you explain a bit more about what kind of hardware this is happening
> > on, and what clocksource is being used?
>
> Sure. This is an x86 box. Normally we're using the TSC clocksource,
> and I don't think the issue happens after that. I guess my
> experimentation so far has been with uninterruptible time before we
> register *any* clocksource (or at least before I see any "Switching to
> clocksource" messages).
Huh.
So yea, if we are very early at boot, we're likely using the jiffies
clocksource, which is basically a software-based tick counter, which
would be prone to lost-ticks issues if irqs were disabled for too long.
Do you know if this is this a relatively new issue?
My first instinct is "don't do that!" to whatever driver is disabling
irqs for so long. Do you know what's actually causing these long irq off
periods?
I assume you're noticing this offset by seeing that CLOCK_REALTIME is
off from the RTC right after boot? How severe is this? The RTC read is
only second granular, so there's a fair amount of error (~1 second)
possible right at boot, so this then must be many seconds worth of lost
ticks to be noticeable, right?
thanks
-john
On Wed, Jun 1, 2011 at 5:58 PM, john stultz <[email protected]> wrote:
> On Wed, 2011-06-01 at 17:35 -0600, Bjorn Helgaas wrote:
>> On Wed, Jun 1, 2011 at 4:35 PM, john stultz <[email protected]> wrote:
>> > On Wed, 2011-06-01 at 14:50 -0600, Bjorn Helgaas wrote:
>> >> timekeeping_init() basically does the following:
>> >>
>> >> ? ? xtime = RTC
>> >> ? ? if (arch implements read_boot_clock())
>> >> ? ? ? ? wall_to_monotonic = -read_boot_clock()
>> >> ? ? else
>> >> ? ? ? wall_to_monotonic = -xtime
>> >>
>> >> So wall_to_monotonic records some approximation of the system boot
>> >> time, which is then used to derive the "btime" reported in /proc/stat.
>> >>
>> >> The problem I'm seeing is that xtime is updated on timer ticks, so
>> >> uninterruptible code, like kernel serial printk, makes us miss ticks,
>> >> so xtime falls behind the RTC.
>> >
>> > Huh. So this sort of issue was common back when we had tick-based
>> > timekeeping (in combination with troubled hardware), but with the
>> > current clocksource based timekeeping, occasional lost ticks shouldn't
>> > really effect time.
>>
>> Makes sense. ?Your presentation here was a great help:
>> ? http://sr71.net/~jstultz/tod/ols-presentation-final.pdf
>>
>> > Can you explain a bit more about what kind of hardware this is happening
>> > on, and what clocksource is being used?
>>
>> Sure. ?This is an x86 box. ?Normally we're using the TSC clocksource,
>> and I don't think the issue happens after that. ?I guess my
>> experimentation so far has been with uninterruptible time before we
>> register *any* clocksource (or at least before I see any "Switching to
>> clocksource" messages).
>
> Huh.
>
> So yea, if we are very early at boot, we're likely using the jiffies
> clocksource, which is basically a software-based tick counter, which
> would be prone to lost-ticks issues if irqs were disabled for too long.
>
> Do you know if this is this a relatively new issue?
>
> My first instinct is "don't do that!" to whatever driver is disabling
> irqs for so long. Do you know what's actually causing these long irq off
> periods?
>
> I assume you're noticing this offset by seeing that CLOCK_REALTIME is
> off from the RTC right after boot? How severe is this? The RTC read is
> only second granular, so there's a fair amount of error (~1 second)
> possible right at boot, so this then must be many seconds worth of lost
> ticks to be noticeable, right?
I'm using 2.6.34, so not really new. I think the major offender is
kernel serial printk, which is done in polled mode. A *lot* of it,
e.g., 30+ seconds' worth. I wonder if there's some reasonably clean
way to resync with the RTC, say at the time we register a clocksource
better than jiffies, or in clocksource_done_booting(), or something.
Bjorn
On Wed, 2011-06-01 at 18:31 -0600, Bjorn Helgaas wrote:
> On Wed, Jun 1, 2011 at 5:58 PM, john stultz <[email protected]> wrote:
> > My first instinct is "don't do that!" to whatever driver is disabling
> > irqs for so long. Do you know what's actually causing these long irq off
> > periods?
> >
> > I assume you're noticing this offset by seeing that CLOCK_REALTIME is
> > off from the RTC right after boot? How severe is this? The RTC read is
> > only second granular, so there's a fair amount of error (~1 second)
> > possible right at boot, so this then must be many seconds worth of lost
> > ticks to be noticeable, right?
>
> I'm using 2.6.34, so not really new. I think the major offender is
> kernel serial printk, which is done in polled mode. A *lot* of it,
> e.g., 30+ seconds' worth.
CC'ing the linux-serial list and Alan for their thoughts: Does the
serial port is disabling irqs for 30+ seconds during boot seem at all
sane?
I would think it would periodically enable irqs in the polling loop to
let things be processed, but maybe not?
> I wonder if there's some reasonably clean
> way to resync with the RTC, say at the time we register a clocksource
> better than jiffies, or in clocksource_done_booting(), or something.
I definitely think we need to address the long irq disable periods
instead of trying to hack around the issue in the timekeeping code.
thanks
-john
On Wed, Jun 1, 2011 at 6:49 PM, john stultz <[email protected]> wrote:
> On Wed, 2011-06-01 at 18:31 -0600, Bjorn Helgaas wrote:
>> On Wed, Jun 1, 2011 at 5:58 PM, john stultz <[email protected]> wrote:
>> > My first instinct is "don't do that!" to whatever driver is disabling
>> > irqs for so long. Do you know what's actually causing these long irq off
>> > periods?
>> >
>> > I assume you're noticing this offset by seeing that CLOCK_REALTIME is
>> > off from the RTC right after boot? How severe is this? The RTC read is
>> > only second granular, so there's a fair amount of error (~1 second)
>> > possible right at boot, so this then must be many seconds worth of lost
>> > ticks to be noticeable, right?
>>
>> I'm using 2.6.34, so not really new. ?I think the major offender is
>> kernel serial printk, which is done in polled mode. ?A *lot* of it,
>> e.g., 30+ seconds' worth.
>
> CC'ing the linux-serial list and Alan for their thoughts: Does the
> serial port is disabling irqs for 30+ seconds during boot seem at all
> sane?
>
> I would think it would periodically enable irqs in the polling loop to
> let things be processed, but maybe not?
>
>> ? I wonder if there's some reasonably clean
>> way to resync with the RTC, say at the time we register a clocksource
>> better than jiffies, or in clocksource_done_booting(), or something.
>
> I definitely think we need to address the long irq disable periods
> instead of trying to hack around the issue in the timekeeping code.
OK, I have a theory. Here are snippets from my dmesg log (this is
from a current upstream kernel). The "BH" lines are instrumentation I
added to read the RTC at a few places:
Linux version 3.0.0-smp-DEV ...
BH now rtc 1306992452 (start_kernel, before setup_arch)
Printk 230K of numa=fake debug stuff (more than seems strictly
necessary to me, but there it is). All this data goes into the log
buffer, not to the UART, because the console hasn't been
initialized yet.
Command line: auto BOOT_IMAGE=300smpD ro root=/dev/hda1,/dev/sda1
ignore_loglevel console=ttyS0,115200n8 swiotlb=16000 numa=fake=16M
BH now rtc 1306992457 (start_kernel, after mm init)
BH btime rtc 1306992457 xt 1306992457 wtm -1306992457 (timekeeping_init)
Snapshot btime and init xtime and wall_to_monotonic. So
theoretically, we should see 1306992457 in /proc/stat btime.
Console: colour dummy device 80x25
console [ttyS0] enabled
Now ttyS0 is enabled, so we dump the log buffer to the UART. I think
this happens in console_unlock(), with interrupts disabled for the whole
buffer.
BH now rtc 1306992481 xt 1306992459 wtm -1306992457
clocksource_register jiffies
This RTC read is in clocksource_register(); note that xtime is now
22 seconds behind the RTC. The UART is running at 115200 baud,
and 230K/(115200/10) = about 20 seconds, so this sort of matches
the time I expect it to take to dump the buffer.
HPET: 3 timers in total, 0 timers will be used for per-cpu timer
hpet0: at MMIO 0xfed00000, IRQs 2, 8, 0
hpet0: 3 comparators, 64-bit 14.318180 MHz counter
Switching to clocksource hpet
Refined TSC clocksource calibration: 2400.084 MHz.
Switching to clocksource tsc
BH <set rtc 1306992488 xt 1306992465 wtm -1306992457 to 1306992488
BH settimeofday delta 23
BH set> rtc 1306992488 xt 1306992488 wtm -1306992480
This settimeofday() call happens when userland runs "hwclock --hctosys".
We adjust xtime and wall_to_monotonic by 23 seconds.
Finally, we read /proc/stat. Since we only re-synced the system clock
to the RTC (we didn't sync to an external source), btime ought to be
the same as it was when we collected it, i.e., 1306992457. But
instead, we get this, which I assume is the same wall_to_monotonic we
computed in settimeofday(), but rounded:
btime 1306992479
Things that look interesting to me:
- numa=fake prints an excessive amount of debug info!
- We register the ttyS0 console before registering the jiffies clocksource.
It seems like the jiffies clocksource ought to be registered about the
same time we initialize xtime in timekeeping_init().
- Interrupts are disabled the entire time we dump the log buffer, which
could be a very long time. Maybe we could restore/disable interrupts
periodically there.
Bjorn
> CC'ing the linux-serial list and Alan for their thoughts: Does the
> serial port is disabling irqs for 30+ seconds during boot seem at all
> sane?
printk is supposed to work from anywhere so it takes the hit. It should
only be printk that is involved however.
Alan
I'm still spinning my wheels on this, so I guess the only thing left
is to ask even more stupid questions :)
I'm only concerned about the early boot sequence, and I think only
about the period when we're using the jiffies clocksource. My
understanding is:
- I'm using the jiffies clocksource during early boot.
- Jiffies depends on a periodic (1000 HZ in my case) interrupt that
updates xtime via the tick_periodic -> do_timer -> update_wall_time
path.
- If those periodic interrupts are lost, those xtime updates are forever lost.
- An interrupt would be lost if interrupts are disabled for an
interval that covers two or more ticks (my guess ... I'm thinking that
if interrupts were re-enabled before the second tick, the first one
would be delayed but not lost).
- The RTC runs independently of CPU interrupts being disabled, so
its time is not lost.
- User-space will typically reset xtime to match the RTC
And my sequence of events is:
- xtime = RTC reading #1
- wall_to_monotonic = -xtime
- periodic tick increments xtime
- some ticks are lost while interrupts are disabled
- by the time we switch from jiffies to hpet and eventually tsc
clock source, the RTC is ahead of xtime by several seconds (1-2 in a
normal boot, 30+ in more extreme cases)
- user-space resets xtime to RTC ("hwclock -hctosys" in my case),
which adds the delta to xtime and subtracts it from wall_to_monotonic
- getboottime() returns -wall_to_monotonic (should be RTC reading
#1, but now "reading #1 + delta")
It seems like we're throwing away information here at the time we
switch from jiffies to a more capable clocksource -- at that point, we
know the RTC - xtime delta, and we know that delta represents time
when interrupts were disabled. (Obviously this only applies during
early boot, before we do any RTC updates.)
My naive thought was "well, what if we just use the RTC directly as a
clocksource." It's crappy resolution, but at least it doesn't lose
time, so I tried the following, which didn't work at all (hangs during
boot). But I don't know enough to know *why* this isn't feasible.
Seems like jiffies can be different sizes, so why not 1 Hz?
+static cycle_t rtc_read(struct clocksource *cs)
+{
+ static struct timespec base;
+ struct timespec now;
+ static int first = 1;
+
+ read_persistent_clock(&now);
+ if (first) {
+ first = 0;
+ base = now;
+ }
+
+ return (cycle_t) (now.tv_sec - base.tv_sec);
+}
+
+struct clocksource clocksource_rtc = {
+ .name = "rtc",
+ .rating = 1,
+ .read = rtc_read,
+ .flags = CLOCK_SOURCE_IS_CONTINUOUS,
+ .mask = CLOCKSOURCE_MASK(64),
+ .mult = NSEC_PER_SEC,
+};
+
+struct clocksource * __init clocksource_default_clock(void)
+{
+ return &clocksource_rtc;
+}
Bjorn
On Mon, 2011-06-06 at 23:20 -0600, Bjorn Helgaas wrote:
> I'm still spinning my wheels on this, so I guess the only thing left
> is to ask even more stupid questions :)
>
> I'm only concerned about the early boot sequence, and I think only
> about the period when we're using the jiffies clocksource. My
> understanding is:
>
> - I'm using the jiffies clocksource during early boot.
> - Jiffies depends on a periodic (1000 HZ in my case) interrupt that
> updates xtime via the tick_periodic -> do_timer -> update_wall_time
> path.
Yep.
> - If those periodic interrupts are lost, those xtime updates are forever lost.
Yep.
> - An interrupt would be lost if interrupts are disabled for an
> interval that covers two or more ticks (my guess ... I'm thinking that
> if interrupts were re-enabled before the second tick, the first one
> would be delayed but not lost).
Yep. There's also some possibly connected issues here to irq starvation
related to the irq priorities (so even if irqs were disabled, if the irq
is getting hammered, and that irq is higher priority then the tick, you
can lose ticks that way as well).
> - The RTC runs independently of CPU interrupts being disabled, so
> its time is not lost.
Yep.
> - User-space will typically reset xtime to match the RTC
Not really sure about this one. I think most systems will set the system
time via NTP and then after we're considered in-sync with ntpd we'll set
the RTC to system time every 11 minutes.
But regardless, the issue that if we lose ticks, the btime won't seem to
be correct remains.
> And my sequence of events is:
>
> - xtime = RTC reading #1
> - wall_to_monotonic = -xtime
> - periodic tick increments xtime
> - some ticks are lost while interrupts are disabled
> - by the time we switch from jiffies to hpet and eventually tsc
> clock source, the RTC is ahead of xtime by several seconds (1-2 in a
> normal boot, 30+ in more extreme cases)
> - user-space resets xtime to RTC ("hwclock -hctosys" in my case),
> which adds the delta to xtime and subtracts it from wall_to_monotonic
> - getboottime() returns -wall_to_monotonic (should be RTC reading
> #1, but now "reading #1 + delta")
>
> It seems like we're throwing away information here at the time we
> switch from jiffies to a more capable clocksource -- at that point, we
> know the RTC - xtime delta, and we know that delta represents time
> when interrupts were disabled. (Obviously this only applies during
> early boot, before we do any RTC updates.)
But I think you're focusing on trying to solve the symptom instead of
the problem. The really big issue here is that irqs are apparently being
disabled for 30 seconds at a time.
Sure, once a real clocksource is registered, maybe you don't see
timekeeping problems, but if the serial console gets more output, but
then you might see strange scheduling issues, or very late timers.
Further, you could hit other strange problems like OOM issues if you're
doing lots of RCU and the grace periods don't get to run.
Further, even if we did use the RTC to correct for lost ticks that
happened while using the jiffies clocksource, you have the fact that the
RTC resolution is so coarse, you couldn't account for lost ticks of less
then a second anyway (which I suspect is much more common then the 30
second intervals you're seeing).
> My naive thought was "well, what if we just use the RTC directly as a
> clocksource." It's crappy resolution, but at least it doesn't lose
> time, so I tried the following, which didn't work at all (hangs during
> boot). But I don't know enough to know *why* this isn't feasible.
It wouldn't be impossible to use the RTC as a clocksource (I think old
601 ppc macs use this). However, its not really a generic solution, as
systems have a number of different types of RTCs, some which go over i2c
buses or require interrupts in order to be read. read_persistent_clock
is safe, but it doesn't solve the issue for systems that don't provide a
read_persistent_clock hook.
> Seems like jiffies can be different sizes, so why not 1 Hz?
Hmm. That is interesting. I'm guessing it probably hits an edge case
where the timekeeping code expects there to be a non-zero shift value.
But again, I don't think this approach is going to solve all the issues
that might be caused by 30-seconds of irqs being off.
Maybe to get this back on coarse, could you provide some additional
details about the machine where you're seeing this? Is there one
specific driver that is putting out tons of output over the serial
console? Or is there anything unique about the serial port or its
settings (is it configured at 300 baud :)? What is the /proc/interrupts
count after boot on one of these systems?
thanks
-john
On Tue, 2011-06-07 at 10:50 -0700, john stultz wrote:
> Maybe to get this back on coarse, could you provide some additional
> details about the machine where you're seeing this? Is there one
> specific driver that is putting out tons of output over the serial
> console? Or is there anything unique about the serial port or its
> settings (is it configured at 300 baud :)? What is the /proc/interrupts
> count after boot on one of these systems?
Sorry, I just remembered you already provided some of these details
below:
On Thu, 2011-06-02 at 00:34 -0600, Bjorn Helgaas wrote:
> Linux version 3.0.0-smp-DEV ...
> BH now rtc 1306992452 (start_kernel, before setup_arch)
> Printk 230K of numa=fake debug stuff (more than seems strictly
> necessary to me, but there it is). All this data goes into the log
> buffer, not to the UART, because the console hasn't been
> initialized yet.
[snip]
> console [ttyS0] enabled
> Now ttyS0 is enabled, so we dump the log buffer to the UART. I think
> this happens in console_unlock(), with interrupts disabled for the whole
> buffer.
>
> BH now rtc 1306992481 xt 1306992459 wtm -1306992457
> clocksource_register jiffies
> This RTC read is in clocksource_register(); note that xtime is now
> 22 seconds behind the RTC. The UART is running at 115200 baud,
> and 230K/(115200/10) = about 20 seconds, so this sort of matches
> the time I expect it to take to dump the buffer.
Ok. So having the 230k data backlog at console_unlock() seems to be part of the issue.
Maybe something like the following could help? By only holding irqs off for 1k chunks?
thanks
-john
diff --git a/kernel/printk.c b/kernel/printk.c
index 3518539..9703b22 100644
--- a/kernel/printk.c
+++ b/kernel/printk.c
@@ -1243,6 +1243,7 @@ void console_unlock(void)
unsigned long flags;
unsigned _con_start, _log_end;
unsigned wake_klogd = 0;
+ unsigned chunk_size, length;
if (console_suspended) {
up(&console_sem);
@@ -1251,14 +1252,18 @@ void console_unlock(void)
console_may_schedule = 0;
+ chunk_size = min(LOG_BUF_MASK, 1024); /* 1k chunks */
+
for ( ; ; ) {
spin_lock_irqsave(&logbuf_lock, flags);
wake_klogd |= log_start - log_end;
if (con_start == log_end)
break; /* Nothing to print */
+ length = (log_end - con_start) & LOG_BUF_MASK;
+ length = min(length , chunk_size);
_con_start = con_start;
- _log_end = log_end;
- con_start = log_end; /* Flush */
+ _log_end = (con_start + length) & LOG_BUF_MASK;
+ con_start = _log_end; /* Flush */
spin_unlock(&logbuf_lock);
stop_critical_timings(); /* don't trace print latency */
call_console_drivers(_con_start, _log_end);
On Tue, Jun 7, 2011 at 7:03 PM, john stultz <[email protected]> wrote:
> On Tue, 2011-06-07 at 10:50 -0700, john stultz wrote:
>> Maybe to get this back on coarse, could you provide some additional
>> details about the machine where you're seeing this? Is there one
>> specific driver that is putting out tons of output over the serial
>> console? Or is there anything unique about the serial port or its
>> settings (is it configured at 300 baud :)? What is the /proc/interrupts
>> count after boot on one of these systems?
>
> Sorry, I just remembered you already provided some of these details
> below:
>
> On Thu, 2011-06-02 at 00:34 -0600, Bjorn Helgaas wrote:
>> Linux version 3.0.0-smp-DEV ...
>> ? BH now ? rtc 1306992452 (start_kernel, before setup_arch)
>> ? ? ? Printk 230K of numa=fake debug stuff (more than seems strictly
>> ? ? ? necessary to me, but there it is). ?All this data goes into the log
>> ? ? ? buffer, not to the UART, because the console hasn't been
>> ? ? ? initialized yet.
> [snip]
>> console [ttyS0] enabled
>> ? ? ? Now ttyS0 is enabled, so we dump the log buffer to the UART. ?I think
>> ? ? ? this happens in console_unlock(), with interrupts disabled for the whole
>> ? ? ? buffer.
>>
>> ? BH now ? rtc 1306992481 xt 1306992459 wtm -1306992457
>> ? clocksource_register jiffies
>> ? ? ? This RTC read is in clocksource_register(); note that xtime is now
>> ? ? ? 22 seconds behind the RTC. ?The UART is running at 115200 baud,
>> ? ? ? and 230K/(115200/10) = about 20 seconds, so this sort of matches
>> ? ? ? the time I expect it to take to dump the buffer.
>
>
> Ok. So having the 230k data backlog at console_unlock() seems to be part of the issue.
>
> Maybe something like the following could help? By only holding irqs off for 1k chunks?
>
> thanks
> -john
>
>
> diff --git a/kernel/printk.c b/kernel/printk.c
> index 3518539..9703b22 100644
> --- a/kernel/printk.c
> +++ b/kernel/printk.c
> @@ -1243,6 +1243,7 @@ void console_unlock(void)
> ? ? ? ?unsigned long flags;
> ? ? ? ?unsigned _con_start, _log_end;
> ? ? ? ?unsigned wake_klogd = 0;
> + ? ? ? unsigned chunk_size, length;
>
> ? ? ? ?if (console_suspended) {
> ? ? ? ? ? ? ? ?up(&console_sem);
> @@ -1251,14 +1252,18 @@ void console_unlock(void)
>
> ? ? ? ?console_may_schedule = 0;
>
> + ? ? ? chunk_size = min(LOG_BUF_MASK, 1024); /* 1k chunks */
> +
> ? ? ? ?for ( ; ; ) {
> ? ? ? ? ? ? ? ?spin_lock_irqsave(&logbuf_lock, flags);
> ? ? ? ? ? ? ? ?wake_klogd |= log_start - log_end;
> ? ? ? ? ? ? ? ?if (con_start == log_end)
> ? ? ? ? ? ? ? ? ? ? ? ?break; ? ? ? ? ? ? ? ? ?/* Nothing to print */
> + ? ? ? ? ? ? ? length = (log_end - con_start) & LOG_BUF_MASK;
> + ? ? ? ? ? ? ? length = min(length , chunk_size);
> ? ? ? ? ? ? ? ?_con_start = con_start;
> - ? ? ? ? ? ? ? _log_end = log_end;
> - ? ? ? ? ? ? ? con_start = log_end; ? ? ? ? ? ?/* Flush */
> + ? ? ? ? ? ? ? _log_end = (con_start + length) & LOG_BUF_MASK;
> + ? ? ? ? ? ? ? con_start = _log_end; ? ? ? ? ? /* Flush */
> ? ? ? ? ? ? ? ?spin_unlock(&logbuf_lock);
> ? ? ? ? ? ? ? ?stop_critical_timings(); ? ? ? ?/* don't trace print latency */
> ? ? ? ? ? ? ? ?call_console_drivers(_con_start, _log_end);
That's a much better attempt than the disastrous one I made. But it
didn't help at all, and I think I know one reason: if we reenable
interrupts every 1K, and I have 230K to dump, that's 230 additional
ticks we might get. But with HZ=1000, that's only a quarter second of
wallclock time.
Even at 115200 baud, I think we can only print about 10 characters/ms.
It doesn't seem reasonable to me to try to flush in chunks anywhere
near that small.
But something else must still be going on, because even when I *did*
turn the chunk size down to 100 or 10 or even 4 characters, we still
didn't recover any of that missing wallclock time. I tried to verify
that interrupts were actually enabled by the local_irq_restore(), and
they were, but maybe there's still a priority issue or something.
I wonder if we could do something with the TSC, e.g., capture it
early, then use it in getboottime() to compute the delta back from the
current time. Machines without a stable TSC would use the present
system.
Bjorn