Date: Mon, 8 Jun 2015 09:44:50 +0200 (CEST)
From: Thomas Gleixner <tglx@linutronix.de>
To: John Stultz <john.stultz@linaro.org>
cc: Ingo Molnar <mingo@kernel.org>, Jeremiah Mahler <jmmahler@gmail.com>,
        Preeti U Murthy <preeti@linux.vnet.ibm.com>,
        Peter Zijlstra <peterz@infradead.org>,
        Viresh Kumar <viresh.kumar@linaro.org>,
        Marcelo Tosatti <mtosatti@redhat.com>,
        Frederic Weisbecker <fweisbec@gmail.com>,
        lkml <linux-kernel@vger.kernel.org>
Subject: Re: [BUG, bisect] hrtimer: severe lag after suspend & resume
In-Reply-To: <CALAqxLVRzrMp=o5HY+nyc5T9LytoVQPt6t5XF=3rgzpr3kF3iQ@mail.gmail.com>
Message-ID: <alpine.DEB.2.11.1506071156070.5050@nanos>
References: <20150604005624.GA1789@hudson.localdomain> <CALAqxLUAXDHimaqA0mqUNcY0inOZGRf92SxSbX8dDMzJUBRvmQ@mail.gmail.com> <alpine.DEB.2.11.1506051110040.4155@nanos> <20150605100707.GB8995@gmail.com>
 <CALAqxLVRzrMp=o5HY+nyc5T9LytoVQPt6t5XF=3rgzpr3kF3iQ@mail.gmail.com>
User-Agent: Alpine 2.11 (DEB 23 2013-08-11)
MIME-Version: 1.0
Content-Type: TEXT/PLAIN; charset=US-ASCII
Sender: linux-kernel-owner@vger.kernel.org
Content-Length: 4695
Lines: 107

On Fri, 5 Jun 2015, John Stultz wrote:
> On Fri, Jun 5, 2015 at 3:07 AM, Ingo Molnar <mingo@kernel.org> wrote:
> > * Thomas Gleixner <tglx@linutronix.de> wrote:
> >> It's not about copying 24 bytes. It's about touching 3 cache lines for nothing.
> >> In situations where we run high frequency periodic timers on clock monotonic and
> >> nothing is going on in the other clock domains, which is a pretty common
> >> situation, this is measurable in terms of cache utilization. [...]
> >
> > It's not just about 'touching': it's about _dirtying_ cachelines from a globally
> > executed function (timekeeping), which is then accessed by per-CPU functionality
> > (hrtimers).
> 
> Right, but part of that issue is that we're caching in the hrtimer cpu
> bases data that *should not be cached*. That was the core issue that
> caused the 2012 leapsecond issue, and I'd prefer to not reintroduce
> it.
> 
> The offset data is only valid for the monotonic time its read for. So
> dirtying three cache lines really is just due to the fact that the
> data is stored in the cpu_base structure where I'd argue it doesn't
> provide real value (other then convenience of indexing it cleanly).
> 
> Reading the offset data into three values from the stack would be fine
> too, and (I think) would avoid dirtying much extra (we have to store
> the now value anyway).

Well, the problem is that we need to fetch that data on several
occasions:

	- hrtimer_start (if it is the first expiring timer of a clock)
	- hrtimer_reprogram (after canceling the first timer)
	- hrtimer_interrupt

So I really prefer to have cached values instead of a function
call. And even if we do not cache stuff, there is no guarantee that we
wont expire a timer too early:

CPU0					CPU1
hrtimer_interrupt()
           ------------------------------------- leap second edge
   get_time_and_offsets_uncached()
					do_leap_second_adjustment()
   expire_timers()

So, if the do_leap_second_adjustment() happens a bit too late, then
clock monotonic + offset_realtime will have advanced over the leap
second and expire a timer and the sleeper will then observe that it is
expired too early because the leap second adjustment finished before
it returned to user space.

You do not even need two cpus for this. You can observe the same issue
on a UP machine. Assume we have two hrtimers programmed to go off at
the leap seconds edge:

    1) a user space timer
    2) the tick/leap second one

If the user space timer has been enqueued before the leap one, then it
will be expired first and if the timer interrupt got delayed a bit it
again will see that its over the programmed time and happily expire to
early.

So what ever we do vs. the hrtimer offsets, cached or not will not
prevent that we expire timers early.

> BTW: Thomas, what are you using to do measurements here? I hesitate
> to argue in these sorts of performance discussions, since I really
> only have a embarrassing theoretical understanding of the issues and
> suspect myself a bit naive here. Additionally these sorts of
> constraints aren't always clearly documented, so being able to
> measure and compare would be helpful to ensure future changes don't
> impact things here.

performance counters and tests which stress the particular subsystems.

> > That makes it far more expensive, it has similar scalability limiting effects as a
> > global lock - while if we do it smart it can perform as essentially lockless code
> > in most cases.
> 
> Another reason why I don't like this approach of caching the data is
> that it also prevents fixing the leap-second adjustment to happen on
> the second edge, because we have to have an async update to the
> seqcounter in order to refresh the cached real_offset. Or we have to
> also export more ntp state data so we can duplicate the adjustment to
> the cached data in the hrtimer code, which is more of the complexity
> you've objected to.

There is no guarantee that it happens at the seconds edge. Timer might
be delayed, vcpu scheduled out ....

All you will be able to do is to narrow the window, but as I explained
above it wont prevent early expiry and it wont prevent VDSO seing the
time go over the leap second and then jump back.

We just have to accept that timekeeping, time readout and hrtimers
have asynchronous behaviour. And there is no way around that unless
you want to kill performance completely for the sake of this leap
second nonsense.

Thanks,

	tglx
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/