by Liang, Kan

[permalink] [raw]

Subject: Re: [PATCH 1/3] timekeeping: NMI safe converter from a given time to monotonic

On 2023-01-24 5:40 p.m., John Stultz wrote:
> On Tue, Jan 24, 2023 at 2:08 PM Liang, Kan <[email protected]> wrote:
>> On 2023-01-24 3:33 p.m., John Stultz wrote:
>>> On Tue, Jan 24, 2023 at 12:13 PM Liang, Kan <[email protected]> wrote:
>>>> On 2023-01-24 1:43 p.m., John Stultz wrote:
>>>> I read the code correctly, it should happen only when adjusting NTP or
>>>> suspending/resuming. If so, I think the drawback should not impact the
>>>> normal analysis work. I will call out the drwabacks in the comments
>>>> where the function is used.
>>> So the adjustments are done at tick time depending on the current NTP
>>> "error" (basically what the kernel tracks as the delta from its sense
>>> of what NTP has told us).
>>>
>>> Not just at the time when ntp makes an adjustment.
>>>
>>> So the window for it to happen is every timekeeping update (which is ~HZ).
>>
>> You mean the tkf->base is updated in each tick?
>
> Right, every call to timekeeping_update(), which is called for every
> major state change in time along with regular updates via
> timekeeping_advance() which is called from update_wall_time() which
> gets called from the various tick handlers.
>
>> If so, that should be a problem.
>>
>> Does the NTP "error" consistently happen on all the platforms? Or just
>> on some platforms which we can check and limit the usage?
>
> Basically any platform where adjtimex was ever called. Both offset
> and freq adjustments will trigger this. The trouble is that the
> adjustments requested are finer grained than what the actual
> clocksources mult values can be adjusted for. So the kernel tracks the
> error internally and will oscillate the clocksource multiplier value
> to keep the long-term accuracy close to what was requested.
>
>> There are two configurations for PEBS, single PEBS, and large PEBS.
>> For the single PEBS mode, HW triggers a NMI for each record. The TSC is
>> the time which the record is generated, and we convert it in the same
>> NMI. I don't think the NTP "error" can impact it.
>
> If the clocksource is being adjusted in between reads (imagine the
> freq is dropped), when using the fast accessors, time may seem to go
> backwards between those reads.
>
> In fact, if the same TSC value is used for two calls in a row, you
> could see time go backwards or forwards between them if the adjustment
> happened inbetween.
>
> The regular (non-fast) clocksource accessors use the seqcount to
> prevent reads from occuring while the adjustment is made from being
> used (you'll note we always re-read the clocksource within the
> seqcount loop) but this makes it not NMI safe.
>
>
>> But for the large PEBS mode, HW triggers a NMI only when a fixed number
>> of records are generated. It's very likely that the base has been
>> updated between the records. That could bring troubles, since we can
>> only fall back to the NMI handler time of the last record. I'm afraid we
>> have to disable the large PEBS mode.
>>
>
> Right, if I recall, this is why the interpolation is done in the
> existing get_device_system_crosststamp(). It's obviously never going
> to be perfect as we dont' keep the timekeeper state for every tick,
> but it might be better than the bailout you have if you didn't luck
> into the current interval.
>
> My apologies, as my intention isn't to dissuade you here, but just to
> make sure the complexities (and unfortunately there are a lot) are
> well worked out and well tested.
>

Thank you very much for your patience and all the details, I really
appreciate it. I think I understand the problem and the concerns now. I
will reevaluate the current design and try to find a proper solution.

Thanks,
Kan

2023-01-27 13:30:27

by Thomas Gleixner

[permalink] [raw]

Subject: Re: [PATCH 1/3] timekeeping: NMI safe converter from a given time to monotonic

On Tue, Jan 24 2023 at 01:10, Stephane Eranian wrote:
> On Tue, Jan 24, 2023 at 12:52 AM Thomas Gleixner <[email protected]> wrote:
>> > It's useful to provide a NMI safe function to convert a given time to
>> > monotonic. For example, the perf_event subsystem wants to convert a TSC
>> > of a PEBS record to a monotonic clock in a NMI handler.
>>
>> Why? That's a postprocessing problem, really.
>>
> Because you want to correlate samples captured by PEBS with samples
> from applications timestamped with a user available clock such as
> CLOCK_MONOTONIC, for instance.

Sure. Postprocessing can do that if the kernel provides the relevant
conversion information. Absolutely no need for doing this in the kernel.

> When I create a perf_event event and I stipulate that
> event_attr.clockid=MONOTONIC, I expect all the samples from that event
> to be timestamped using the same clock source, regardless of PEBS or
> IBS.

Just because the kernel can do the conversion, there is no requirement
to actually do so. Instrumentation has to be as lightweight as possible
and a lot of instrumentation goes the extra mile of doing postprocessing
exactly for this reason.

So if we implement this conversion in the kernel then there needs to be
a compelling technical reason to do so.

So far the provided arguments are in the 'I want a pony' realm.

Thanks,

tglx