Message-ID: <01434319-3828-0d50-4274-8f95b18dafc1@linux.intel.com>
Date:   Wed, 25 Jan 2023 11:44:05 -0500
MIME-Version: 1.0
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:102.0) Gecko/20100101
 Thunderbird/102.6.1
Subject: Re: [PATCH 1/3] timekeeping: NMI safe converter from a given time to
 monotonic
To:     John Stultz <jstultz@google.com>
Cc:     Stephane Eranian <eranian@google.com>, peterz@infradead.org,
        mingo@redhat.com, tglx@linutronix.de, sboyd@kernel.org,
        linux-kernel@vger.kernel.org, namhyung@kernel.org,
        ak@linux.intel.com
References: <20230123182728.825519-1-kan.liang@linux.intel.com>
 <20230123182728.825519-2-kan.liang@linux.intel.com>
 <CANDhNCpWwxXM8DD9h4zOW+bygshkOg9TWO9Z7wJO_B7bDtgEHw@mail.gmail.com>
 <9c17d6be-e532-84e1-4d35-77b9bd3051dc@linux.intel.com>
 <CANDhNCp_0Os+e0A0LZ7yKw16mWai9MAPMPYL0p1NkcVxifh88w@mail.gmail.com>
 <1fb59dfa-1ab9-51ad-98c6-89431aa56918@linux.intel.com>
 <CANDhNCodq8iyRY-md-nRkAPYS5p3iTCgDqZXvWAA108TctQASg@mail.gmail.com>
 <c8ad4654-6bfd-2983-036e-e969787992e9@linux.intel.com>
 <CANDhNCqxOurpjKQaEEgxdFYV8kQQHVUNZUy1B5AewSi8bys8iQ@mail.gmail.com>
Content-Language: en-US
From:   "Liang, Kan" <kan.liang@linux.intel.com>
In-Reply-To: <CANDhNCqxOurpjKQaEEgxdFYV8kQQHVUNZUy1B5AewSi8bys8iQ@mail.gmail.com>
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 7bit
Precedence: bulk


On 2023-01-24 5:40 p.m., John Stultz wrote:
> On Tue, Jan 24, 2023 at 2:08 PM Liang, Kan <kan.liang@linux.intel.com> wrote:
>> On 2023-01-24 3:33 p.m., John Stultz wrote:
>>> On Tue, Jan 24, 2023 at 12:13 PM Liang, Kan <kan.liang@linux.intel.com> wrote:
>>>> On 2023-01-24 1:43 p.m., John Stultz wrote:
>>>> I read the code correctly, it should happen only when adjusting NTP or
>>>> suspending/resuming. If so, I think the drawback should not impact the
>>>> normal analysis work. I will call out the drwabacks in the comments
>>>> where the function is used.
>>> So the adjustments are done at tick time depending on the current NTP
>>> "error" (basically what the kernel tracks as the delta from its sense
>>> of what NTP has told us).
>>>
>>> Not just at the time when ntp makes an adjustment.
>>>
>>> So the window for it to happen is every timekeeping update (which is ~HZ).
>>
>> You mean the tkf->base is updated in each tick?
> 
> Right, every call to timekeeping_update(), which is called for every
> major state change in time along with regular updates via
> timekeeping_advance() which is called from update_wall_time() which
> gets called from the various tick handlers.
> 
>> If so, that should be a problem.
>>
>> Does the NTP "error" consistently happen on all the platforms? Or just
>> on some platforms which we can check and limit the usage?
> 
> Basically any platform where adjtimex was ever called.  Both offset
> and freq adjustments will trigger this. The trouble is that the
> adjustments requested are finer grained than what the actual
> clocksources mult values can be adjusted for. So the kernel tracks the
> error internally and will oscillate the clocksource multiplier value
> to keep the long-term accuracy close to what was requested.
> 
>> There are two configurations for PEBS, single PEBS, and large PEBS.
>> For the single PEBS mode, HW triggers a NMI for each record. The TSC is
>> the time which the record is generated, and we convert it in the same
>> NMI. I don't think the NTP "error" can impact it.
> 
> If the clocksource is being adjusted in between reads (imagine the
> freq is dropped), when using the fast accessors, time may seem to go
> backwards between those reads.
> 
> In fact, if the same TSC value is used for two calls in a row, you
> could see time go backwards or forwards between them if the adjustment
> happened inbetween.
> 
> The regular (non-fast) clocksource accessors use the seqcount to
> prevent reads from occuring while the adjustment is made from being
> used (you'll note we always re-read the clocksource within the
> seqcount loop) but this makes it not NMI safe.
> 
> 
>> But for the large PEBS mode, HW triggers a NMI only when a fixed number
>> of records are generated. It's very likely that the base has been
>> updated between the records. That could bring troubles, since we can
>> only fall back to the NMI handler time of the last record. I'm afraid we
>> have to disable the large PEBS mode.
>>
> 
> Right, if I recall, this is why the interpolation is done in the
> existing get_device_system_crosststamp(). It's obviously never going
> to be perfect as we dont' keep the timekeeper state for every tick,
> but it might be better than the bailout you have if you didn't luck
> into the current interval.
> 
> My apologies, as my intention isn't to dissuade you here, but just to
> make sure the complexities (and unfortunately there are a lot) are
> well worked out and well tested.
> 

Thank you very much for your patience and all the details, I really
appreciate it. I think I understand the problem and the concerns now. I
will reevaluate the current design and try to find a proper solution.

Thanks,
Kan