all,
After a fresh boot, the Chrome web browser behaves normally. Pages
load quickly and scroll fast. Even image heavy sites such as
images.google.com work fine. However, after a suspend and resume
cycle, Chrome becomes very slow. Pages take ten seconds or more to
load. The scroll bars and buttons are almost completely
unresponsive. Interestingly, I can run Firefox on the same sites
and it has no issue whatsoever.
I have bisected the kernel and found that the following commit
introduced the bug. It is present in the latest linux-next (20150602).
From 868a3e915f7f5eba8f8cb4f7da2276760807c51c Mon Sep 17 00:00:00 2001
From: Thomas Gleixner <[email protected]>
Date: Tue, 14 Apr 2015 21:08:37 +0000
Subject: [PATCH] hrtimer: Make offset update smarter
On every tick/hrtimer interrupt we update the offset variables of the
clock bases. That's silly because these offsets change very seldom.
Add a sequence counter to the time keeping code which keeps track of
the offset updates (clock_was_set()). Have a sequence cache in the
hrtimer cpu bases to evaluate whether the offsets must be updated or
not. This allows us later to avoid pointless cacheline pollution.
Signed-off-by: Thomas Gleixner <[email protected]>
Reviewed-by: Preeti U Murthy <[email protected]>
Acked-by: Peter Zijlstra <[email protected]>
Cc: Viresh Kumar <[email protected]>
Cc: Marcelo Tosatti <[email protected]>
Cc: Frederic Weisbecker <[email protected]>
Cc: John Stultz <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Thomas Gleixner <[email protected]>
Cc: John Stultz <[email protected]>
---
include/linux/hrtimer.h | 4 ++--
include/linux/timekeeper_internal.h | 2 ++
kernel/time/hrtimer.c | 3 ++-
kernel/time/timekeeping.c | 23 ++++++++++++++++-------
kernel/time/timekeeping.h | 7 ++++---
5 files changed, 26 insertions(+), 13 deletions(-)
--
- Jeremiah Mahler
On Wed, 3 Jun 2015, Jeremiah Mahler wrote:
> After a fresh boot, the Chrome web browser behaves normally. Pages
> load quickly and scroll fast. Even image heavy sites such as
> images.google.com work fine. However, after a suspend and resume
> cycle, Chrome becomes very slow. Pages take ten seconds or more to
> load. The scroll bars and buttons are almost completely
> unresponsive. Interestingly, I can run Firefox on the same sites
> and it has no issue whatsoever.
Weird.
> I have bisected the kernel and found that the following commit
> introduced the bug. It is present in the latest linux-next (20150602).
>
> From 868a3e915f7f5eba8f8cb4f7da2276760807c51c Mon Sep 17 00:00:00 2001
> From: Thomas Gleixner <[email protected]>
> Date: Tue, 14 Apr 2015 21:08:37 +0000
> Subject: [PATCH] hrtimer: Make offset update smarter
>
> On every tick/hrtimer interrupt we update the offset variables of the
> clock bases. That's silly because these offsets change very seldom.
>
> Add a sequence counter to the time keeping code which keeps track of
> the offset updates (clock_was_set()). Have a sequence cache in the
> hrtimer cpu bases to evaluate whether the offsets must be updated or
> not. This allows us later to avoid pointless cacheline pollution.
I had to wrap my head around that for quite a while, but I think I
have decoded the issue. Can you please test the patch below whether it
solves your problem?
Thanks,
tglx
------------------------>
diff --git a/kernel/time/hrtimer.c b/kernel/time/hrtimer.c
index 278d4b36fd94..e9dfcd0b8c41 100644
--- a/kernel/time/hrtimer.c
+++ b/kernel/time/hrtimer.c
@@ -1492,6 +1492,12 @@ static void init_hrtimers_cpu(int cpu)
cpu_base->cpu = cpu;
hrtimer_init_hres(cpu_base);
+ /*
+ * Force an update by setting the clock was set sequence to an
+ * odd value.
+ */
+ cpu_base->clock_was_set_seq = 1;
+ hrtimer_update_base(cpu_base);
}
#ifdef CONFIG_HOTPLUG_CPU
diff --git a/kernel/time/timekeeping.c b/kernel/time/timekeeping.c
index 90ed5db67c1d..c97710137a9e 100644
--- a/kernel/time/timekeeping.c
+++ b/kernel/time/timekeeping.c
@@ -593,7 +593,7 @@ static void timekeeping_update(struct timekeeper *tk, unsigned int action)
update_fast_timekeeper(&tk->tkr_raw, &tk_fast_raw);
if (action & TK_CLOCK_WAS_SET)
- tk->clock_was_set_seq++;
+ tk->clock_was_set_seq += 2;
}
/**
Thomas,
On Thu, Jun 04, 2015 at 01:22:45PM +0200, Thomas Gleixner wrote:
> On Wed, 3 Jun 2015, Jeremiah Mahler wrote:
[...]
>
> I had to wrap my head around that for quite a while, but I think I
> have decoded the issue. Can you please test the patch below whether it
> solves your problem?
>
> Thanks,
>
> tglx
>
> ------------------------>
>
> diff --git a/kernel/time/hrtimer.c b/kernel/time/hrtimer.c
> index 278d4b36fd94..e9dfcd0b8c41 100644
> --- a/kernel/time/hrtimer.c
> +++ b/kernel/time/hrtimer.c
> @@ -1492,6 +1492,12 @@ static void init_hrtimers_cpu(int cpu)
>
> cpu_base->cpu = cpu;
> hrtimer_init_hres(cpu_base);
> + /*
> + * Force an update by setting the clock was set sequence to an
> + * odd value.
> + */
> + cpu_base->clock_was_set_seq = 1;
> + hrtimer_update_base(cpu_base);
> }
>
> #ifdef CONFIG_HOTPLUG_CPU
> diff --git a/kernel/time/timekeeping.c b/kernel/time/timekeeping.c
> index 90ed5db67c1d..c97710137a9e 100644
> --- a/kernel/time/timekeeping.c
> +++ b/kernel/time/timekeeping.c
> @@ -593,7 +593,7 @@ static void timekeeping_update(struct timekeeper *tk, unsigned int action)
> update_fast_timekeeper(&tk->tkr_raw, &tk_fast_raw);
>
> if (action & TK_CLOCK_WAS_SET)
> - tk->clock_was_set_seq++;
> + tk->clock_was_set_seq += 2;
> }
>
> /**
With this change it won't boot. It just hangs at "Booting the kernel".
--
- Jeremiah Mahler
On Wed, Jun 3, 2015 at 5:56 PM, Jeremiah Mahler <[email protected]> wrote:
> all,
>
> After a fresh boot, the Chrome web browser behaves normally. Pages
> load quickly and scroll fast. Even image heavy sites such as
> images.google.com work fine. However, after a suspend and resume
> cycle, Chrome becomes very slow. Pages take ten seconds or more to
> load. The scroll bars and buttons are almost completely
> unresponsive. Interestingly, I can run Firefox on the same sites
> and it has no issue whatsoever.
>
> I have bisected the kernel and found that the following commit
> introduced the bug. It is present in the latest linux-next (20150602).
>
> From 868a3e915f7f5eba8f8cb4f7da2276760807c51c Mon Sep 17 00:00:00 2001
> From: Thomas Gleixner <[email protected]>
> Date: Tue, 14 Apr 2015 21:08:37 +0000
> Subject: [PATCH] hrtimer: Make offset update smarter
>
> On every tick/hrtimer interrupt we update the offset variables of the
> clock bases. That's silly because these offsets change very seldom.
>
> Add a sequence counter to the time keeping code which keeps track of
> the offset updates (clock_was_set()). Have a sequence cache in the
> hrtimer cpu bases to evaluate whether the offsets must be updated or
> not. This allows us later to avoid pointless cacheline pollution.
>
> Signed-off-by: Thomas Gleixner <[email protected]>
> Reviewed-by: Preeti U Murthy <[email protected]>
> Acked-by: Peter Zijlstra <[email protected]>
> Cc: Viresh Kumar <[email protected]>
> Cc: Marcelo Tosatti <[email protected]>
> Cc: Frederic Weisbecker <[email protected]>
> Cc: John Stultz <[email protected]>
> Link: http://lkml.kernel.org/r/[email protected]
> Signed-off-by: Thomas Gleixner <[email protected]>
> Cc: John Stultz <[email protected]>
> ---
> include/linux/hrtimer.h | 4 ++--
> include/linux/timekeeper_internal.h | 2 ++
> kernel/time/hrtimer.c | 3 ++-
> kernel/time/timekeeping.c | 23 ++++++++++++++++-------
> kernel/time/timekeeping.h | 7 ++++---
> 5 files changed, 26 insertions(+), 13 deletions(-)
So I suspect the problem is the change to clock_was_set_seq in
timekeeping_update is done prior to mirroring the time state to the
shadow-timekeeper. Thus the next time we do update_wall_time() the
updated sequence is overwritten by whats in the shadow copy. The
attached patch moving the modification up seems to avoid the issue for
me.
Thomas: Looking at the problematic change, I'm not a big fan of it.
Caching timekeeping state here in the hrtimer code has been a source
of bugs in the past, and I'm not sure I see how avoiding copying
24bytes is that big of a win. Especially since it adds more state to
the timekeeper and hrtimer base that we have to read and mange.
Personally I'd prefer a revert to my fix.
thanks
-john
John,
On Thu, Jun 04, 2015 at 03:54:35PM -0700, John Stultz wrote:
> On Wed, Jun 3, 2015 at 5:56 PM, Jeremiah Mahler <[email protected]> wrote:
[...]
>
>
> So I suspect the problem is the change to clock_was_set_seq in
> timekeeping_update is done prior to mirroring the time state to the
> shadow-timekeeper. Thus the next time we do update_wall_time() the
> updated sequence is overwritten by whats in the shadow copy. The
> attached patch moving the modification up seems to avoid the issue for
> me.
>
> Thomas: Looking at the problematic change, I'm not a big fan of it.
> Caching timekeeping state here in the hrtimer code has been a source
> of bugs in the past, and I'm not sure I see how avoiding copying
> 24bytes is that big of a win. Especially since it adds more state to
> the timekeeper and hrtimer base that we have to read and mange.
> Personally I'd prefer a revert to my fix.
>
> thanks
> -john
> diff --git a/kernel/time/timekeeping.c b/kernel/time/timekeeping.c
> index 90ed5db..53be796 100644
> --- a/kernel/time/timekeeping.c
> +++ b/kernel/time/timekeeping.c
> @@ -580,6 +580,9 @@ static void timekeeping_update(struct timekeeper *tk, unsigned int action)
> ntp_clear();
> }
>
> + if (action & TK_CLOCK_WAS_SET)
> + tk->clock_was_set_seq++;
> +
> tk_update_ktime_data(tk);
>
> update_vsyscall(tk);
> @@ -591,9 +594,6 @@ static void timekeeping_update(struct timekeeper *tk, unsigned int action)
>
> update_fast_timekeeper(&tk->tkr_mono, &tk_fast_mono);
> update_fast_timekeeper(&tk->tkr_raw, &tk_fast_raw);
> -
> - if (action & TK_CLOCK_WAS_SET)
> - tk->clock_was_set_seq++;
> }
>
> /**
That patch fixes the problem for me.
Thanks John.
--
- Jeremiah Mahler
* John Stultz <[email protected]> wrote:
> So I suspect the problem is the change to clock_was_set_seq in
> timekeeping_update is done prior to mirroring the time state to the
> shadow-timekeeper. Thus the next time we do update_wall_time() the updated
> sequence is overwritten by whats in the shadow copy. The attached patch moving
> the modification up seems to avoid the issue for me.
>
> Thomas: Looking at the problematic change, I'm not a big fan of it. Caching
> timekeeping state here in the hrtimer code has been a source of bugs in the
> past, and I'm not sure I see how avoiding copying 24bytes is that big of a win.
> Especially since it adds more state to the timekeeper and hrtimer base that we
> have to read and mange. Personally I'd prefer a revert to my fix.
So it's not really the copying of the 24 bytes that is the problem with that
pattern.
It's the constant dirtying of a cacheline that would otherwise be read-mostly, and
then the reading of it from the percpu hrtimer interrupts.
That can have negative scalability effects similar to a global lock.
( Now I have not checked whether this cacheline truly becomes read-mostly after
the original commit - I suspect Thomas did. )
> + if (action & TK_CLOCK_WAS_SET)
> + tk->clock_was_set_seq++;
> +
> tk_update_ktime_data(tk);
So I'd also add a comment that this update should be done before:
if (action & TK_MIRROR)
memcpy(&shadow_timekeeper, &tk_core.timekeeper,
sizeof(tk_core.timekeeper));
Also, there appears to be a layering violation here (unless I mis-read the
tk_real/tk logic): this should copy 'tk', not access tk_core directly, correct?
Right now this does not matter, because 'tk == &tk_core' is always supposed to be
true, or at least it should point to an identical shadow copy, right?
But nevertheless tk_core should only ever be directly referenced when 'tk' is
initialized from it.
Also, there's a few more of these apparent layering violations:
git grep 'tk_core' kernel/time/timekeeping.c | grep -v 'struct timekeeper'
I realize that we are transitioning from former global variables based timekeeping
to a more parametric model, so this isn't a complaint, just an observation.
Right now it's a code cleanliness detail, but if/when we introduce clocks that are
updated independently from each other then it will also matter functionally.
Thanks,
Ingo
On Thu, 4 Jun 2015, John Stultz wrote:
> On Wed, Jun 3, 2015 at 5:56 PM, Jeremiah Mahler <[email protected]> wrote:
> So I suspect the problem is the change to clock_was_set_seq in
> timekeeping_update is done prior to mirroring the time state to the
> shadow-timekeeper. Thus the next time we do update_wall_time() the
> updated sequence is overwritten by whats in the shadow copy. The
> attached patch moving the modification up seems to avoid the issue for
> me.
Duh, yes.
> Thomas: Looking at the problematic change, I'm not a big fan of it.
> Caching timekeeping state here in the hrtimer code has been a source
> of bugs in the past, and I'm not sure I see how avoiding copying
> 24bytes is that big of a win. Especially since it adds more state to
> the timekeeper and hrtimer base that we have to read and mange.
It's not about copying 24 bytes. It's about touching 3 cache lines for
nothing. In situations where we run high frequency periodic timers on
clock monotonic and nothing is going on in the other clock domains,
which is a pretty common situation, this is measurable in terms of
cache utilization. I went great length to optimize the cache footprint
and access patterns and that unconditional update really makes a
measurable difference.
Thanks,
tglx
* Thomas Gleixner <[email protected]> wrote:
> On Thu, 4 Jun 2015, John Stultz wrote:
> > On Wed, Jun 3, 2015 at 5:56 PM, Jeremiah Mahler <[email protected]> wrote:
> > So I suspect the problem is the change to clock_was_set_seq in
> > timekeeping_update is done prior to mirroring the time state to the
> > shadow-timekeeper. Thus the next time we do update_wall_time() the
> > updated sequence is overwritten by whats in the shadow copy. The
> > attached patch moving the modification up seems to avoid the issue for
> > me.
>
> Duh, yes.
>
> > Thomas: Looking at the problematic change, I'm not a big fan of it. Caching
> > timekeeping state here in the hrtimer code has been a source of bugs in the
> > past, and I'm not sure I see how avoiding copying 24bytes is that big of a
> > win. Especially since it adds more state to the timekeeper and hrtimer base
> > that we have to read and mange.
>
> It's not about copying 24 bytes. It's about touching 3 cache lines for nothing.
> In situations where we run high frequency periodic timers on clock monotonic and
> nothing is going on in the other clock domains, which is a pretty common
> situation, this is measurable in terms of cache utilization. [...]
It's not just about 'touching': it's about _dirtying_ cachelines from a globally
executed function (timekeeping), which is then accessed by per-CPU functionality
(hrtimers).
That makes it far more expensive, it has similar scalability limiting effects as a
global lock - while if we do it smart it can perform as essentially lockless code
in most cases.
Thanks,
Ingo
On Fri, 5 Jun 2015, Thomas Gleixner wrote:
> On Thu, 4 Jun 2015, John Stultz wrote:
> > On Wed, Jun 3, 2015 at 5:56 PM, Jeremiah Mahler <[email protected]> wrote:
> > So I suspect the problem is the change to clock_was_set_seq in
> > timekeeping_update is done prior to mirroring the time state to the
> > shadow-timekeeper. Thus the next time we do update_wall_time() the
> > updated sequence is overwritten by whats in the shadow copy. The
> > attached patch moving the modification up seems to avoid the issue for
> > me.
>
> Duh, yes.
>
> > Thomas: Looking at the problematic change, I'm not a big fan of it.
> > Caching timekeeping state here in the hrtimer code has been a source
> > of bugs in the past, and I'm not sure I see how avoiding copying
> > 24bytes is that big of a win. Especially since it adds more state to
> > the timekeeper and hrtimer base that we have to read and mange.
>
> It's not about copying 24 bytes. It's about touching 3 cache lines for
> nothing. In situations where we run high frequency periodic timers on
> clock monotonic and nothing is going on in the other clock domains,
> which is a pretty common situation, this is measurable in terms of
> cache utilization. I went great length to optimize the cache footprint
> and access patterns and that unconditional update really makes a
> measurable difference.
That said. Can you please send a patch with a proper changelog?
Thanks,
tglx
On Fri, Jun 5, 2015 at 3:07 AM, Ingo Molnar <[email protected]> wrote:
>
> * Thomas Gleixner <[email protected]> wrote:
>
>> On Thu, 4 Jun 2015, John Stultz wrote:
>> > On Wed, Jun 3, 2015 at 5:56 PM, Jeremiah Mahler <[email protected]> wrote:
>> > So I suspect the problem is the change to clock_was_set_seq in
>> > timekeeping_update is done prior to mirroring the time state to the
>> > shadow-timekeeper. Thus the next time we do update_wall_time() the
>> > updated sequence is overwritten by whats in the shadow copy. The
>> > attached patch moving the modification up seems to avoid the issue for
>> > me.
>>
>> Duh, yes.
>>
>> > Thomas: Looking at the problematic change, I'm not a big fan of it. Caching
>> > timekeeping state here in the hrtimer code has been a source of bugs in the
>> > past, and I'm not sure I see how avoiding copying 24bytes is that big of a
>> > win. Especially since it adds more state to the timekeeper and hrtimer base
>> > that we have to read and mange.
>>
>> It's not about copying 24 bytes. It's about touching 3 cache lines for nothing.
>> In situations where we run high frequency periodic timers on clock monotonic and
>> nothing is going on in the other clock domains, which is a pretty common
>> situation, this is measurable in terms of cache utilization. [...]
>
> It's not just about 'touching': it's about _dirtying_ cachelines from a globally
> executed function (timekeeping), which is then accessed by per-CPU functionality
> (hrtimers).
Right, but part of that issue is that we're caching in the hrtimer cpu
bases data that *should not be cached*. That was the core issue that
caused the 2012 leapsecond issue, and I'd prefer to not reintroduce
it.
The offset data is only valid for the monotonic time its read for. So
dirtying three cache lines really is just due to the fact that the
data is stored in the cpu_base structure where I'd argue it doesn't
provide real value (other then convenience of indexing it cleanly).
Reading the offset data into three values from the stack would be fine
too, and (I think) would avoid dirtying much extra (we have to store
the now value anyway).
BTW: Thomas, what are you using to do measurements here? I hesitate to
argue in these sorts of performance discussions, since I really only
have a embarrassing theoretical understanding of the issues and
suspect myself a bit naive here. Additionally these sorts of
constraints aren't always clearly documented, so being able to measure
and compare would be helpful to ensure future changes don't impact
things here.
> That makes it far more expensive, it has similar scalability limiting effects as a
> global lock - while if we do it smart it can perform as essentially lockless code
> in most cases.
Another reason why I don't like this approach of caching the data is
that it also prevents fixing the leap-second adjustment to happen on
the second edge, because we have to have an async update to the
seqcounter in order to refresh the cached real_offset. Or we have to
also export more ntp state data so we can duplicate the adjustment to
the cached data in the hrtimer code, which is more of the complexity
you've objected to.
thanks
-john
On Fri, 5 Jun 2015, John Stultz wrote:
> On Fri, Jun 5, 2015 at 3:07 AM, Ingo Molnar <[email protected]> wrote:
> > * Thomas Gleixner <[email protected]> wrote:
> >> It's not about copying 24 bytes. It's about touching 3 cache lines for nothing.
> >> In situations where we run high frequency periodic timers on clock monotonic and
> >> nothing is going on in the other clock domains, which is a pretty common
> >> situation, this is measurable in terms of cache utilization. [...]
> >
> > It's not just about 'touching': it's about _dirtying_ cachelines from a globally
> > executed function (timekeeping), which is then accessed by per-CPU functionality
> > (hrtimers).
>
> Right, but part of that issue is that we're caching in the hrtimer cpu
> bases data that *should not be cached*. That was the core issue that
> caused the 2012 leapsecond issue, and I'd prefer to not reintroduce
> it.
>
> The offset data is only valid for the monotonic time its read for. So
> dirtying three cache lines really is just due to the fact that the
> data is stored in the cpu_base structure where I'd argue it doesn't
> provide real value (other then convenience of indexing it cleanly).
>
> Reading the offset data into three values from the stack would be fine
> too, and (I think) would avoid dirtying much extra (we have to store
> the now value anyway).
Well, the problem is that we need to fetch that data on several
occasions:
- hrtimer_start (if it is the first expiring timer of a clock)
- hrtimer_reprogram (after canceling the first timer)
- hrtimer_interrupt
So I really prefer to have cached values instead of a function
call. And even if we do not cache stuff, there is no guarantee that we
wont expire a timer too early:
CPU0 CPU1
hrtimer_interrupt()
------------------------------------- leap second edge
get_time_and_offsets_uncached()
do_leap_second_adjustment()
expire_timers()
So, if the do_leap_second_adjustment() happens a bit too late, then
clock monotonic + offset_realtime will have advanced over the leap
second and expire a timer and the sleeper will then observe that it is
expired too early because the leap second adjustment finished before
it returned to user space.
You do not even need two cpus for this. You can observe the same issue
on a UP machine. Assume we have two hrtimers programmed to go off at
the leap seconds edge:
1) a user space timer
2) the tick/leap second one
If the user space timer has been enqueued before the leap one, then it
will be expired first and if the timer interrupt got delayed a bit it
again will see that its over the programmed time and happily expire to
early.
So what ever we do vs. the hrtimer offsets, cached or not will not
prevent that we expire timers early.
> BTW: Thomas, what are you using to do measurements here? I hesitate
> to argue in these sorts of performance discussions, since I really
> only have a embarrassing theoretical understanding of the issues and
> suspect myself a bit naive here. Additionally these sorts of
> constraints aren't always clearly documented, so being able to
> measure and compare would be helpful to ensure future changes don't
> impact things here.
performance counters and tests which stress the particular subsystems.
> > That makes it far more expensive, it has similar scalability limiting effects as a
> > global lock - while if we do it smart it can perform as essentially lockless code
> > in most cases.
>
> Another reason why I don't like this approach of caching the data is
> that it also prevents fixing the leap-second adjustment to happen on
> the second edge, because we have to have an async update to the
> seqcounter in order to refresh the cached real_offset. Or we have to
> also export more ntp state data so we can duplicate the adjustment to
> the cached data in the hrtimer code, which is more of the complexity
> you've objected to.
There is no guarantee that it happens at the seconds edge. Timer might
be delayed, vcpu scheduled out ....
All you will be able to do is to narrow the window, but as I explained
above it wont prevent early expiry and it wont prevent VDSO seing the
time go over the leap second and then jump back.
We just have to accept that timekeeping, time readout and hrtimers
have asynchronous behaviour. And there is no way around that unless
you want to kill performance completely for the sake of this leap
second nonsense.
Thanks,
tglx
On Mon, Jun 8, 2015 at 12:44 AM, Thomas Gleixner <[email protected]> wrote:
> On Fri, 5 Jun 2015, John Stultz wrote:
>> On Fri, Jun 5, 2015 at 3:07 AM, Ingo Molnar <[email protected]> wrote:
>> > * Thomas Gleixner <[email protected]> wrote:
>> >> It's not about copying 24 bytes. It's about touching 3 cache lines for nothing.
>> >> In situations where we run high frequency periodic timers on clock monotonic and
>> >> nothing is going on in the other clock domains, which is a pretty common
>> >> situation, this is measurable in terms of cache utilization. [...]
>> >
>> > It's not just about 'touching': it's about _dirtying_ cachelines from a globally
>> > executed function (timekeeping), which is then accessed by per-CPU functionality
>> > (hrtimers).
>>
>> Right, but part of that issue is that we're caching in the hrtimer cpu
>> bases data that *should not be cached*. That was the core issue that
>> caused the 2012 leapsecond issue, and I'd prefer to not reintroduce
>> it.
>>
>> The offset data is only valid for the monotonic time its read for. So
>> dirtying three cache lines really is just due to the fact that the
>> data is stored in the cpu_base structure where I'd argue it doesn't
>> provide real value (other then convenience of indexing it cleanly).
>>
>> Reading the offset data into three values from the stack would be fine
>> too, and (I think) would avoid dirtying much extra (we have to store
>> the now value anyway).
>
> Well, the problem is that we need to fetch that data on several
> occasions:
>
> - hrtimer_start (if it is the first expiring timer of a clock)
> - hrtimer_reprogram (after canceling the first timer)
> - hrtimer_interrupt
hrtimer_interrupt is really the one I'm most concerned with, because
that's what does timer expiration. For that usage, having duplicated
time state has been repeatedly problematic.
For start and reprogram, I guess I'm not as concerned because bad
cache values there can only result in errors in programming the
clockevent hardware, which can only cause timers to firing late.
> So I really prefer to have cached values instead of a function
> call. And even if we do not cache stuff, there is no guarantee that we
> wont expire a timer too early:
>
> CPU0 CPU1
> hrtimer_interrupt()
> ------------------------------------- leap second edge
> get_time_and_offsets_uncached()
> do_leap_second_adjustment()
> expire_timers()
>
> So, if the do_leap_second_adjustment() happens a bit too late, then
> clock monotonic + offset_realtime will have advanced over the leap
> second and expire a timer and the sleeper will then observe that it is
> expired too early because the leap second adjustment finished before
> it returned to user space.
Well, the patch I proposed (which does leapsecond adjustments in the
read paths) avoids this because update_base_and_offsets_now() checks
to see if the monotonic base + offset would cross the leapsecond edge,
and corrects the offset appropriately.
>
> You do not even need two cpus for this. You can observe the same issue
> on a UP machine. Assume we have two hrtimers programmed to go off at
> the leap seconds edge:
>
> 1) a user space timer
> 2) the tick/leap second one
>
> If the user space timer has been enqueued before the leap one, then it
> will be expired first and if the timer interrupt got delayed a bit it
> again will see that its over the programmed time and happily expire to
> early.
>
> So what ever we do vs. the hrtimer offsets, cached or not will not
> prevent that we expire timers early.
Right, so without my proposed patch, this is an issue, but my proposed
patch requires that the hrtimer_interrupt not use cached offsets in
order to ensure the read-state is adjusted properly for the
leapsecond. (Or it requries the hrtimer_interrupt path to also cache
the leapsecond state so it can do the same adjustment to the cached
data, but this seems terribly duplicative).
>> BTW: Thomas, what are you using to do measurements here? I hesitate
>> to argue in these sorts of performance discussions, since I really
>> only have a embarrassing theoretical understanding of the issues and
>> suspect myself a bit naive here. Additionally these sorts of
>> constraints aren't always clearly documented, so being able to
>> measure and compare would be helpful to ensure future changes don't
>> impact things here.
>
> performance counters and tests which stress the particular subsystems.
Sweet. Vague hand-waving results are simple to reproduce.
[Gestures, points assertively a few times] Results validated! Science! :)
But seriously, I'm earnestly looking for specifics (like which stress
tests are you caring about) here, so I can try to also watch that
patches I write or take don't undo your performance optimizations, so
in the future you'll have to yell at fewer people.
>> > That makes it far more expensive, it has similar scalability limiting effects as a
>> > global lock - while if we do it smart it can perform as essentially lockless code
>> > in most cases.
>>
>> Another reason why I don't like this approach of caching the data is
>> that it also prevents fixing the leap-second adjustment to happen on
>> the second edge, because we have to have an async update to the
>> seqcounter in order to refresh the cached real_offset. Or we have to
>> also export more ntp state data so we can duplicate the adjustment to
>> the cached data in the hrtimer code, which is more of the complexity
>> you've objected to.
>
> There is no guarantee that it happens at the seconds edge. Timer might
> be delayed, vcpu scheduled out ....
Again, I worry you've not looked closely at my patch. The only way to
assure the change happens at second edge is to do the adjustment in
the read path.
> All you will be able to do is to narrow the window, but as I explained
> above it wont prevent early expiry and it wont prevent VDSO seing the
> time go over the leap second and then jump back.
The vdso is fixable as well. I just avoided it because fixing it has
less benefit then fixing the timer expiration issue, and expected the
performance overhead (which is an extra comparison and subtraction,
not huge, but I respect that the vdsos are very optimized) would raise
more objections.
(If we want to get fancy, there's even the possibility of switching
the vdso execution page so we use the fixed adjustment function only
near the leap-second and go back to the normal function otherwise -
but regardless, if its to be done, have to fix the core first)
> We just have to accept that timekeeping, time readout and hrtimers
> have asynchronous behaviour. And there is no way around that unless
> you want to kill performance completely for the sake of this leap
> second nonsense.
Even outside of my leapsecond correctness concern, I think caching
time state like you're doing here is a maintenance issue. It has bit
us a number of times already, and things are complicated enough that
even fixing the issues that stem from it are non-trivial.
thanks
-john
On Mon, 8 Jun 2015, John Stultz wrote:
> On Mon, Jun 8, 2015 at 12:44 AM, Thomas Gleixner <[email protected]> wrote:
> > Well, the problem is that we need to fetch that data on several
> > occasions:
> >
> > - hrtimer_start (if it is the first expiring timer of a clock)
> > - hrtimer_reprogram (after canceling the first timer)
> > - hrtimer_interrupt
>
> hrtimer_interrupt is really the one I'm most concerned with, because
> that's what does timer expiration. For that usage, having duplicated
> time state has been repeatedly problematic.
And hurting performance just to solve problematic issues is definitely
not a solution either.
> Right, so without my proposed patch, this is an issue, but my proposed
> patch requires that the hrtimer_interrupt not use cached offsets in
> order to ensure the read-state is adjusted properly for the
> leapsecond. (Or it requries the hrtimer_interrupt path to also cache
> the leapsecond state so it can do the same adjustment to the cached
> data, but this seems terribly duplicative).
No, we can be smarter than that, really.
> But seriously, I'm earnestly looking for specifics (like which stress
> tests are you caring about) here, so I can try to also watch that
> patches I write or take don't undo your performance optimizations, so
> in the future you'll have to yell at fewer people.
I'm often using cyclictest or a variant of it which can arm more
timers. I play with various configurations for this, so I can't give a
receipe right away. Will try to remember when I play with it next
time.
> Even outside of my leapsecond correctness concern, I think caching
> time state like you're doing here is a maintenance issue. It has bit
> us a number of times already, and things are complicated enough that
> even fixing the issues that stem from it are non-trivial.
The caching is solely a problem for this silly leap second corner
case. So sacrifying performance for the general case just to solve
this particular issue is really the wrong answer. We know that it is
problematic, so we can have tests which verify that it did not break.
Sure it requires a bit more brain cycles to get it right, but that's
well worth spent cycles.
Thanks,
tglx