2022-03-17 05:58:09

by Waiman Long

[permalink] [raw]
Subject: [PATCH 0/2] x86/tsc: Avoid TSC sync failure

When booting a kernel on some Intel CooperLake systems, it was found
that the initial TSC sync at boot up may sometime fail leading to a
fallback to HPET as the system clock instead with some performance
degradation. For example

[ 0.034090] TSC deadline timer available
[ 0.008807] TSC ADJUST compensate: CPU36 observed 95626 warp. Adjust: 95626
[ 0.008807] TSC ADJUST compensate: CPU36 observed 74 warp. Adjust: 95700
[ 0.974281] TSC synchronization [CPU#0 -> CPU#36]:
[ 0.974281] Measured 4 cycles TSC warp between CPUs, turning off TSC clock.
[ 0.974281] tsc: Marking TSC unstable due to check_tsc_sync_source failed

This patch set tries to minimize these type of failures by
1) Put all the TSC synchronization variables in their own cacheline to
minimize external interference.
2) Try to estimate the synchronization overhead and add it to the
adjustment value.

With these changes in place, only one TSC adjustment was observed for
each of the affected cpus with no failure.

Waiman Long (2):
x86/tsc: Reduce external interference on max_warp detection
x86/tsc_sync: Add synchronization overhead to tsc adjustment

arch/x86/kernel/tsc_sync.c | 84 ++++++++++++++++++++++----------------
1 file changed, 48 insertions(+), 36 deletions(-)

--
2.27.0


2022-03-17 06:06:53

by Waiman Long

[permalink] [raw]
Subject: [PATCH 2/2] x86/tsc_sync: Add synchronization overhead to tsc adjustment

As stated in the comment of check_tsc_sync_target():

The adjustment value is slightly off by the overhead of the
sync mechanism (observed values are ~200 TSC cycles), but this
really depends on CPU, node distance and frequency. So
compensating for this is hard to get right.

That overhead, however, can cause the tsc adjustment to fail after 3
test runs as can be seen when booting up a hot 4-socket Intel CooperLake
system:

[ 0.034090] TSC deadline timer available
[ 0.008807] TSC ADJUST compensate: CPU36 observed 95626 warp. Adjust: 95626
[ 0.008807] TSC ADJUST compensate: CPU36 observed 74 warp. Adjust: 95700
[ 0.974281] TSC synchronization [CPU#0 -> CPU#36]:
[ 0.974281] Measured 4 cycles TSC warp between CPUs, turning off TSC clock.
[ 0.974281] tsc: Marking TSC unstable due to check_tsc_sync_source failed

To prevent this tsc adjustment failure, we need to estimate the sync
overhead which will be at least an unlock operation in one cpu followed
by a lock operation in another cpu.

The measurement is done in check_tsc_sync_target() after stop_count
reached 2 which is set by the source cpu after it re-initializes the sync
variables causing the lock cacheline to be remote from the target cpu.
The subsequent time measurement will then be similar to latency between
successive 2-cpu sync loop in check_tsc_warp().

Interrupt should not yet been enabled when check_tsc_sync_target() is
called. However some interference may have caused the overhead estimation
to vary a bit. With this patch applied, the measured overhead on the
same CooperLake system on different reboot runs varies from 104 to 326.

Signed-off-by: Waiman Long <[email protected]>
---
arch/x86/kernel/tsc_sync.c | 27 ++++++++++++++++++---------
1 file changed, 18 insertions(+), 9 deletions(-)

diff --git a/arch/x86/kernel/tsc_sync.c b/arch/x86/kernel/tsc_sync.c
index 70aeb254b62b..e2c43ba4e7b8 100644
--- a/arch/x86/kernel/tsc_sync.c
+++ b/arch/x86/kernel/tsc_sync.c
@@ -445,6 +445,7 @@ void check_tsc_sync_target(void)
struct tsc_adjust *cur = this_cpu_ptr(&tsc_adjust);
unsigned int cpu = smp_processor_id();
cycles_t cur_max_warp, gbl_max_warp;
+ cycles_t start, sync_overhead;
int cpus = 2;

/* Also aborts if there is no TSC. */
@@ -505,29 +506,37 @@ void check_tsc_sync_target(void)
if (!atomic_read(&test_runs))
return;

+ /*
+ * Estimate the synchronization overhead by measuring the time for
+ * a lock/unlock operation.
+ */
+ start = rdtsc_ordered();
+ arch_spin_lock(&sync.lock);
+ arch_spin_unlock(&sync.lock);
+ sync_overhead = rdtsc_ordered() - start;
+
/*
* If the warp value of this CPU is 0, then the other CPU
* observed time going backwards so this TSC was ahead and
* needs to move backwards.
*/
- if (!cur_max_warp)
+ if (!cur_max_warp) {
cur_max_warp = -gbl_max_warp;
+ sync_overhead = -sync_overhead;
+ }

/*
* Add the result to the previous adjustment value.
*
* The adjustment value is slightly off by the overhead of the
* sync mechanism (observed values are ~200 TSC cycles), but this
- * really depends on CPU, node distance and frequency. So
- * compensating for this is hard to get right. Experiments show
- * that the warp is not longer detectable when the observed warp
- * value is used. In the worst case the adjustment needs to go
- * through a 3rd run for fine tuning.
+ * really depends on CPU, node distance and frequency. Add the
+ * estimated sync overhead to the adjustment value.
*/
- cur->adjusted += cur_max_warp;
+ cur->adjusted += cur_max_warp + sync_overhead;

- pr_warn("TSC ADJUST compensate: CPU%u observed %lld warp. Adjust: %lld\n",
- cpu, cur_max_warp, cur->adjusted);
+ pr_warn("TSC ADJUST compensate: CPU%u observed %lld warp (overhead %lld). Adjust: %lld\n",
+ cpu, cur_max_warp, sync_overhead, cur->adjusted);

wrmsrl(MSR_IA32_TSC_ADJUST, cur->adjusted);
goto retry;
--
2.27.0

2022-04-05 02:04:00

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [PATCH 2/2] x86/tsc_sync: Add synchronization overhead to tsc adjustment

On Mon, Mar 14 2022 at 15:46, Waiman Long wrote:
> As stated in the comment of check_tsc_sync_target():
>
> The adjustment value is slightly off by the overhead of the
> sync mechanism (observed values are ~200 TSC cycles), but this
> really depends on CPU, node distance and frequency. So
> compensating for this is hard to get right.
>
> That overhead, however, can cause the tsc adjustment to fail after 3
> test runs as can be seen when booting up a hot 4-socket Intel CooperLake
> system:
>
> [ 0.034090] TSC deadline timer available
> [ 0.008807] TSC ADJUST compensate: CPU36 observed 95626 warp. Adjust: 95626
> [ 0.008807] TSC ADJUST compensate: CPU36 observed 74 warp. Adjust: 95700
> [ 0.974281] TSC synchronization [CPU#0 -> CPU#36]:
> [ 0.974281] Measured 4 cycles TSC warp between CPUs, turning off TSC clock.
> [ 0.974281] tsc: Marking TSC unstable due to check_tsc_sync_source failed
>
> To prevent this tsc adjustment failure, we need to estimate the sync
> overhead which will be at least an unlock operation in one cpu followed
> by a lock operation in another cpu.
>
> The measurement is done in check_tsc_sync_target() after stop_count
> reached 2 which is set by the source cpu after it re-initializes the sync
> variables causing the lock cacheline to be remote from the target cpu.
> The subsequent time measurement will then be similar to latency between
> successive 2-cpu sync loop in check_tsc_warp().
>
> Interrupt should not yet been enabled when check_tsc_sync_target() is

Interrupts _ARE_ not enabled.

> called. However some interference may have caused the overhead estimation
> to vary a bit. With this patch applied, the measured overhead on the
> same CooperLake system on different reboot runs varies from 104 to 326.

Hmm.

> [ 0.008807] TSC ADJUST compensate: CPU36 observed 95626 warp. Adjust: 95626
> [ 0.008807] TSC ADJUST compensate: CPU36 observed 74 warp. Adjust: 95700
> [ 0.974281] Measured 4 cycles TSC warp between CPUs, turning off TSC clock.
> [ 0.974281] tsc: Marking TSC unstable due to check_tsc_sync_source failed

Can you please provide the output after your changes? It's hard to tell
what effect adding the lock compensation has. Ideally from several runs
for both patched and unpatched.

Also if the compensation value is at the upper end and the real overhead
is way lower then the validation run might end up with the opposite
result. I'm a bit worried about this variation.

Thanks,

tglx


2022-04-22 20:54:29

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [PATCH 2/2] x86/tsc_sync: Add synchronization overhead to tsc adjustment

On Mon, Apr 18 2022 at 11:41, Waiman Long wrote:
> On 4/3/22 06:03, Thomas Gleixner wrote:
> [    0.008815] TSC ADJUST compensate: CPU36 observed 86056 warp
> (overhead 150). Adjust: 86206
> [    0.008815] TSC ADJUST compensate: CPU54 observed 86148 warp
> (overhead 158). Adjust: 86306
>
>> Also if the compensation value is at the upper end and the real overhead
>> is way lower then the validation run might end up with the opposite
>> result. I'm a bit worried about this variation.
>
> I also have a little concern about that. That is why I add patch 1 to
> minimize as much external interference as possible. For the TSC
> adjustment samples that I got so far, I have never seen one that need a
> 2nd adjustment to go backward.

I did some experiments and noticed that the boot time overhead is
different from the overhead when doing the sync check after boot
(offline a socket and on/offline the first CPU of it several times).

During boot the overhead is lower on this machine (SKL-X), during
runtime it's way higher and more noisy.

The noise can be pretty much eliminated by running the sync_overhead
measurement multiple times and building the average.

The reason why it is higher is that after offlining the socket the CPU
comes back up with a frequency of 700Mhz while during boot it runs with
2100Mhz.

Sync overhead: 118
Sync overhead: 51 A: 22466 M: 22448 F: 2101683

Sync overhead: 178
Sync overhead: 152 A: 22477 M: 67380 F: 700529

Sync overhead: 212
Sync overhead: 152 A: 22475 M: 67380 F: 700467

Sync overhead: 153
Sync overhead: 152 A: 22497 M: 67452 F: 700404

Can you try the patch below and check whether the overhead stabilizes
accross several attempts on that copperlake machine and whether the
frequency is always the same or varies?

Independent of the outcome on that, I think have to take the actual CPU
frequency into account for calculating the overhead.

Thanks,

tglx
---
--- a/arch/x86/kernel/tsc_sync.c
+++ b/arch/x86/kernel/tsc_sync.c
@@ -446,7 +446,8 @@ void check_tsc_sync_target(void)
unsigned int cpu = smp_processor_id();
cycles_t cur_max_warp, gbl_max_warp;
cycles_t start, sync_overhead;
- int cpus = 2;
+ u64 ap1, ap2, mp1, mp2;
+ int i, cpus = 2;

/* Also aborts if there is no TSC. */
if (unsynchronized_tsc())
@@ -514,6 +515,23 @@ void check_tsc_sync_target(void)
arch_spin_lock(&sync.lock);
arch_spin_unlock(&sync.lock);
sync_overhead = rdtsc_ordered() - start;
+ pr_info("Sync overhead: %lld\n", sync_overhead);
+
+ sync_overhead = 0;
+ rdmsrl(MSR_IA32_APERF, ap1);
+ rdmsrl(MSR_IA32_MPERF, mp1);
+ for (i = 0; i < 256; i++) {
+ start = rdtsc_ordered();
+ arch_spin_lock(&sync.lock);
+ arch_spin_unlock(&sync.lock);
+ sync_overhead += rdtsc_ordered() - start;
+ };
+ rdmsrl(MSR_IA32_APERF, ap2);
+ rdmsrl(MSR_IA32_MPERF, mp2);
+ ap2 -= ap1;
+ mp2 -= mp1;
+ pr_info("Sync overhead: %lld A: %llu M: %llu F: %llu\n", sync_overhead >> 8,
+ ap2, mp2, mp2 ? div64_u64((cpu_khz * ap2), mp2) : 0);

/*
* If the warp value of this CPU is 0, then the other CPU

2022-04-25 22:22:23

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [PATCH 2/2] x86/tsc_sync: Add synchronization overhead to tsc adjustment

On Mon, Apr 25 2022 at 09:20, Waiman Long wrote:
> On 4/22/22 06:41, Thomas Gleixner wrote:
>> I did some experiments and noticed that the boot time overhead is
>> different from the overhead when doing the sync check after boot
>> (offline a socket and on/offline the first CPU of it several times).
>>
>> During boot the overhead is lower on this machine (SKL-X), during
>> runtime it's way higher and more noisy.
>>
>> The noise can be pretty much eliminated by running the sync_overhead
>> measurement multiple times and building the average.
>>
>> The reason why it is higher is that after offlining the socket the CPU
>> comes back up with a frequency of 700Mhz while during boot it runs with
>> 2100Mhz.
>>
>> Sync overhead: 118
>> Sync overhead: 51 A: 22466 M: 22448 F: 2101683
> One explanation of the sync overhead difference (118 vs 51) here is
> whether the lock cacheline is local or remote. My analysis the
> interaction between check_tsc_sync_source() and check_tsc_sync_target()
> is that real overhead is about locking with remote cacheline (local to
> source, remote to target). When you do a 256 loop of locking, it is all
> local cacheline. That is why the overhead is lower. It also depends on
> if the remote cacheline is in the same socket or a different socket.

Yes. It's clear that the initial sync overhead is due to the cache line
being remote, but I rather underestimate the compensation. Aside of that
it's not guaranteed that the cache line is actually remote on the first
access. It's by chance, but not by design.

>> Sync overhead: 178
>> Sync overhead: 152 A: 22477 M: 67380 F: 700529
>>
>> Sync overhead: 212
>> Sync overhead: 152 A: 22475 M: 67380 F: 700467
>>
>> Sync overhead: 153
>> Sync overhead: 152 A: 22497 M: 67452 F: 700404
>>
>> Can you try the patch below and check whether the overhead stabilizes
>> accross several attempts on that copperlake machine and whether the
>> frequency is always the same or varies?
> Yes, I will try that experiment and report back the results.
>>
>> Independent of the outcome on that, I think have to take the actual CPU
>> frequency into account for calculating the overhead.
>
> Assuming that the clock frequency remains the same during the
> check_tsc_warp() loop and the sync overhead computation time, I don't
> think the actual clock frequency matters much. However, it will be a
> different matter if the frequency does change. In this case, it is more
> likely the frequency will go up than down. Right? IOW, we may
> underestimate the sync overhead in this case. I think it is better than
> overestimating it.

The question is not whether the clock frequency changes during the loop.
The point is:

start = rdtsc();
do_stuff();
end = rdtsc();
compensation = end - start;

do_stuff() executes a constant number of instructions which are executed
in a constant number of CPU clock cycles, let's say 100 for simplicity.
TSC runs with 2000MHz.

With a CPU frequency of 1000 MHz the real computation time is:

100/1000MHz = 100 nsec = 200 TSC cycles

while with a CPU frequency of 2000MHz it is obviously:

100/2000MHz = 50 nsec = 100 TSC cyles

IOW, TSC runs with a constant frequency independent of the actual CPU
frequency, ergo the CPU frequency dependent execution time has an
influence on the resulting compensation value, no?

On the machine I tested on, it's a factor of 3 between the minimal and
the maximal CPU frequency, which makes quite a difference, right?

Thanks,

tglx


2022-04-26 17:44:40

by Waiman Long

[permalink] [raw]
Subject: Re: [PATCH 2/2] x86/tsc_sync: Add synchronization overhead to tsc adjustment

On 4/22/22 06:41, Thomas Gleixner wrote:
> On Mon, Apr 18 2022 at 11:41, Waiman Long wrote:
>> On 4/3/22 06:03, Thomas Gleixner wrote:
>> [    0.008815] TSC ADJUST compensate: CPU36 observed 86056 warp
>> (overhead 150). Adjust: 86206
>> [    0.008815] TSC ADJUST compensate: CPU54 observed 86148 warp
>> (overhead 158). Adjust: 86306
>>
>>> Also if the compensation value is at the upper end and the real overhead
>>> is way lower then the validation run might end up with the opposite
>>> result. I'm a bit worried about this variation.
>> I also have a little concern about that. That is why I add patch 1 to
>> minimize as much external interference as possible. For the TSC
>> adjustment samples that I got so far, I have never seen one that need a
>> 2nd adjustment to go backward.
> I did some experiments and noticed that the boot time overhead is
> different from the overhead when doing the sync check after boot
> (offline a socket and on/offline the first CPU of it several times).
>
> During boot the overhead is lower on this machine (SKL-X), during
> runtime it's way higher and more noisy.
>
> The noise can be pretty much eliminated by running the sync_overhead
> measurement multiple times and building the average.
>
> The reason why it is higher is that after offlining the socket the CPU
> comes back up with a frequency of 700Mhz while during boot it runs with
> 2100Mhz.
>
> Sync overhead: 118
> Sync overhead: 51 A: 22466 M: 22448 F: 2101683
One explanation of the sync overhead difference (118 vs 51) here is
whether the lock cacheline is local or remote. My analysis the
interaction between check_tsc_sync_source() and check_tsc_sync_target()
is that real overhead is about locking with remote cacheline (local to
source, remote to target). When you do a 256 loop of locking, it is all
local cacheline. That is why the overhead is lower. It also depends on
if the remote cacheline is in the same socket or a different socket.
>
> Sync overhead: 178
> Sync overhead: 152 A: 22477 M: 67380 F: 700529
>
> Sync overhead: 212
> Sync overhead: 152 A: 22475 M: 67380 F: 700467
>
> Sync overhead: 153
> Sync overhead: 152 A: 22497 M: 67452 F: 700404
>
> Can you try the patch below and check whether the overhead stabilizes
> accross several attempts on that copperlake machine and whether the
> frequency is always the same or varies?
Yes, I will try that experiment and report back the results.
>
> Independent of the outcome on that, I think have to take the actual CPU
> frequency into account for calculating the overhead.

Assuming that the clock frequency remains the same during the
check_tsc_warp() loop and the sync overhead computation time, I don't
think the actual clock frequency matters much. However, it will be a
different matter if the frequency does change. In this case, it is more
likely the frequency will go up than down. Right? IOW, we may
underestimate the sync overhead in this case. I think it is better than
overestimating it.

Cheers,
Longman

>
> Thanks,
>
> tglx
> ---
> --- a/arch/x86/kernel/tsc_sync.c
> +++ b/arch/x86/kernel/tsc_sync.c
> @@ -446,7 +446,8 @@ void check_tsc_sync_target(void)
> unsigned int cpu = smp_processor_id();
> cycles_t cur_max_warp, gbl_max_warp;
> cycles_t start, sync_overhead;
> - int cpus = 2;
> + u64 ap1, ap2, mp1, mp2;
> + int i, cpus = 2;
>
> /* Also aborts if there is no TSC. */
> if (unsynchronized_tsc())
> @@ -514,6 +515,23 @@ void check_tsc_sync_target(void)
> arch_spin_lock(&sync.lock);
> arch_spin_unlock(&sync.lock);
> sync_overhead = rdtsc_ordered() - start;
> + pr_info("Sync overhead: %lld\n", sync_overhead);
> +
> + sync_overhead = 0;
> + rdmsrl(MSR_IA32_APERF, ap1);
> + rdmsrl(MSR_IA32_MPERF, mp1);
> + for (i = 0; i < 256; i++) {
> + start = rdtsc_ordered();
> + arch_spin_lock(&sync.lock);
> + arch_spin_unlock(&sync.lock);
> + sync_overhead += rdtsc_ordered() - start;
> + };
> + rdmsrl(MSR_IA32_APERF, ap2);
> + rdmsrl(MSR_IA32_MPERF, mp2);
> + ap2 -= ap1;
> + mp2 -= mp1;
> + pr_info("Sync overhead: %lld A: %llu M: %llu F: %llu\n", sync_overhead >> 8,
> + ap2, mp2, mp2 ? div64_u64((cpu_khz * ap2), mp2) : 0);
>
> /*
> * If the warp value of this CPU is 0, then the other CPU
>

2022-04-27 10:04:26

by Waiman Long

[permalink] [raw]
Subject: Re: [PATCH 2/2] x86/tsc_sync: Add synchronization overhead to tsc adjustment

On 4/25/22 15:24, Thomas Gleixner wrote:
> On Mon, Apr 25 2022 at 09:20, Waiman Long wrote:
>> On 4/22/22 06:41, Thomas Gleixner wrote:
>>> I did some experiments and noticed that the boot time overhead is
>>> different from the overhead when doing the sync check after boot
>>> (offline a socket and on/offline the first CPU of it several times).
>>>
>>> During boot the overhead is lower on this machine (SKL-X), during
>>> runtime it's way higher and more noisy.
>>>
>>> The noise can be pretty much eliminated by running the sync_overhead
>>> measurement multiple times and building the average.
>>>
>>> The reason why it is higher is that after offlining the socket the CPU
>>> comes back up with a frequency of 700Mhz while during boot it runs with
>>> 2100Mhz.
>>>
>>> Sync overhead: 118
>>> Sync overhead: 51 A: 22466 M: 22448 F: 2101683
>> One explanation of the sync overhead difference (118 vs 51) here is
>> whether the lock cacheline is local or remote. My analysis the
>> interaction between check_tsc_sync_source() and check_tsc_sync_target()
>> is that real overhead is about locking with remote cacheline (local to
>> source, remote to target). When you do a 256 loop of locking, it is all
>> local cacheline. That is why the overhead is lower. It also depends on
>> if the remote cacheline is in the same socket or a different socket.
> Yes. It's clear that the initial sync overhead is due to the cache line
> being remote, but I rather underestimate the compensation. Aside of that
> it's not guaranteed that the cache line is actually remote on the first
> access. It's by chance, but not by design.

In check_tsc_warp(), the (unlikely(prev > now) check may only be
triggered to record the possible wrap if last_tsc was previously written
to by another cpu. That requires the transfer of lock cacheline from the
remote cpu to local cpu as well. So sync overhead with remote cacheline
is what really matters here. I had actually thought about just measuring
local cacheline sync overhead so as to underestimate it and I am fine
about doing it.

>>> Sync overhead: 178
>>> Sync overhead: 152 A: 22477 M: 67380 F: 700529
>>>
>>> Sync overhead: 212
>>> Sync overhead: 152 A: 22475 M: 67380 F: 700467
>>>
>>> Sync overhead: 153
>>> Sync overhead: 152 A: 22497 M: 67452 F: 700404
>>>
>>> Can you try the patch below and check whether the overhead stabilizes
>>> accross several attempts on that copperlake machine and whether the
>>> frequency is always the same or varies?
>> Yes, I will try that experiment and report back the results.
>>> Independent of the outcome on that, I think have to take the actual CPU
>>> frequency into account for calculating the overhead.
>> Assuming that the clock frequency remains the same during the
>> check_tsc_warp() loop and the sync overhead computation time, I don't
>> think the actual clock frequency matters much. However, it will be a
>> different matter if the frequency does change. In this case, it is more
>> likely the frequency will go up than down. Right? IOW, we may
>> underestimate the sync overhead in this case. I think it is better than
>> overestimating it.
> The question is not whether the clock frequency changes during the loop.
> The point is:
>
> start = rdtsc();
> do_stuff();
> end = rdtsc();
> compensation = end - start;
>
> do_stuff() executes a constant number of instructions which are executed
> in a constant number of CPU clock cycles, let's say 100 for simplicity.
> TSC runs with 2000MHz.
>
> With a CPU frequency of 1000 MHz the real computation time is:
>
> 100/1000MHz = 100 nsec = 200 TSC cycles
>
> while with a CPU frequency of 2000MHz it is obviously:
>
> 100/2000MHz = 50 nsec = 100 TSC cyles
>
> IOW, TSC runs with a constant frequency independent of the actual CPU
> frequency, ergo the CPU frequency dependent execution time has an
> influence on the resulting compensation value, no?
>
> On the machine I tested on, it's a factor of 3 between the minimal and
> the maximal CPU frequency, which makes quite a difference, right?

Yes, I understand that. The measurement of sync_overhead is for
estimating the delay (in TSC cycles) that the locking overhead
introduces. With 1000MHz frequency, the delay in TSC cycle will be
double that of a cpu running at 2000MHz. So you need more compensation
in this case. That is why I said that as long as clock frequency doesn't
change in the check_tsc_wrap() loop and the sync_overhead measurement
part of the code, the actual cpu frequency does not matter here.

However about we half the measure sync_overhead as compensation to avoid
over-estimation, but probably increase the chance that we need a second
adjustment of TSC wrap.

Cheers,
Longman

2022-04-28 08:34:06

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [PATCH 2/2] x86/tsc_sync: Add synchronization overhead to tsc adjustment

On Tue, Apr 26 2022 at 11:36, Waiman Long wrote:
> On 4/25/22 15:24, Thomas Gleixner wrote:
>> Yes. It's clear that the initial sync overhead is due to the cache line
>> being remote, but I rather underestimate the compensation. Aside of that
>> it's not guaranteed that the cache line is actually remote on the first
>> access. It's by chance, but not by design.
>
> In check_tsc_warp(), the (unlikely(prev > now) check may only be
> triggered to record the possible wrap if last_tsc was previously written
> to by another cpu. That requires the transfer of lock cacheline from the
> remote cpu to local cpu as well. So sync overhead with remote cacheline
> is what really matters here. I had actually thought about just measuring
> local cacheline sync overhead so as to underestimate it and I am fine
> about doing it.

Fair enough, but what I meant is that when estimating the actual sync
overhead then there is no guarantee that the cache line is remote.

The CPU which does that estimation might have been the last to lock,
there is no guarantee that the reference CPU locked last or wrote to the
cache line last.

>> IOW, TSC runs with a constant frequency independent of the actual CPU
>> frequency, ergo the CPU frequency dependent execution time has an
>> influence on the resulting compensation value, no?
>>
>> On the machine I tested on, it's a factor of 3 between the minimal and
>> the maximal CPU frequency, which makes quite a difference, right?
>
> Yes, I understand that. The measurement of sync_overhead is for
> estimating the delay (in TSC cycles) that the locking overhead
> introduces. With 1000MHz frequency, the delay in TSC cycle will be
> double that of a cpu running at 2000MHz. So you need more compensation
> in this case. That is why I said that as long as clock frequency doesn't
> change in the check_tsc_wrap() loop and the sync_overhead measurement
> part of the code, the actual cpu frequency does not matter here.

I grant you that it does not matter for the loop under the assumption
that the loop runs at constant frequency, but is that a guarantee that
it does not matter later on?

If you overcompensate by a factor of 3 because the upcoming CPU ran at
the lowest frequency, then it might become visible later when everything
runs at full speed.

> However about we half the measure sync_overhead as compensation to avoid
> over-estimation, but probably increase the chance that we need a second
> adjustment of TSC wrap.

Half of what?

> With this patch applied, the measured overhead on the same CooperLake
> system on different reboot runs varies from 104 to 326.

Half of something which jumps around? Not convinced. :)

Btw:
>> Yes, I will try that experiment and report back the results.

Could you please do that? I really like to see the data points.

It's so sad that we have still to deal with this nonsense just because
HW people don't give a damn.

Thanks,

tglx

2022-05-01 20:28:08

by Waiman Long

[permalink] [raw]
Subject: Re: [PATCH 2/2] x86/tsc_sync: Add synchronization overhead to tsc adjustment

On 4/27/22 18:38, Thomas Gleixner wrote:
> On Tue, Apr 26 2022 at 11:36, Waiman Long wrote:
>> On 4/25/22 15:24, Thomas Gleixner wrote:
>>> Yes. It's clear that the initial sync overhead is due to the cache line
>>> being remote, but I rather underestimate the compensation. Aside of that
>>> it's not guaranteed that the cache line is actually remote on the first
>>> access. It's by chance, but not by design.
>> In check_tsc_warp(), the (unlikely(prev > now) check may only be
>> triggered to record the possible wrap if last_tsc was previously written
>> to by another cpu. That requires the transfer of lock cacheline from the
>> remote cpu to local cpu as well. So sync overhead with remote cacheline
>> is what really matters here. I had actually thought about just measuring
>> local cacheline sync overhead so as to underestimate it and I am fine
>> about doing it.
> Fair enough, but what I meant is that when estimating the actual sync
> overhead then there is no guarantee that the cache line is remote.
>
> The CPU which does that estimation might have been the last to lock,
> there is no guarantee that the reference CPU locked last or wrote to the
> cache line last.
>
>>> IOW, TSC runs with a constant frequency independent of the actual CPU
>>> frequency, ergo the CPU frequency dependent execution time has an
>>> influence on the resulting compensation value, no?
>>>
>>> On the machine I tested on, it's a factor of 3 between the minimal and
>>> the maximal CPU frequency, which makes quite a difference, right?
>> Yes, I understand that. The measurement of sync_overhead is for
>> estimating the delay (in TSC cycles) that the locking overhead
>> introduces. With 1000MHz frequency, the delay in TSC cycle will be
>> double that of a cpu running at 2000MHz. So you need more compensation
>> in this case. That is why I said that as long as clock frequency doesn't
>> change in the check_tsc_wrap() loop and the sync_overhead measurement
>> part of the code, the actual cpu frequency does not matter here.
> I grant you that it does not matter for the loop under the assumption
> that the loop runs at constant frequency, but is that a guarantee that
> it does not matter later on?
Yes, that is my point that frequency doesn't matter if frequency remain
the same. Of course, all bets are off if frequency really change.
>
> If you overcompensate by a factor of 3 because the upcoming CPU ran at
> the lowest frequency, then it might become visible later when everything
> runs at full speed.
I don't think the overhead will be directly proportional to the cpu
frequency. A 3X increase in frequency will certainly cause the overhead
to be lowered, but it won't be 1/3. Maybe 1/2 at most.
>
>> However about we half the measure sync_overhead as compensation to avoid
>> over-estimation, but probably increase the chance that we need a second
>> adjustment of TSC wrap.
> Half of what?

What I mean is

@@ -533,7 +551,7 @@ void check_tsc_sync_target(void)
         * really depends on CPU, node distance and frequency. Add the
         * estimated sync overhead to the adjustment value.
         */
-       cur->adjusted += cur_max_warp + sync_overhead;
+       cur->adjusted += cur_max_warp + sync_overhead/2;

        pr_warn("TSC ADJUST compensate: CPU%u observed %lld warp
(overhead %lld>
                cpu, cur_max_warp, sync_overhead, cur->adjusted);

>> With this patch applied, the measured overhead on the same CooperLake
>> system on different reboot runs varies from 104 to 326.
> Half of something which jumps around? Not convinced. :)
>
> Btw:
>>> Yes, I will try that experiment and report back the results.
> Could you please do that? I really like to see the data points.

I have applied your patch with some modification and below was the
relevant part of the boot up log on a Cooperlake system with this TSC
sync problem.

[    0.008858] smpboot: CPU 36 Converting physical 0 to logical die 2
[    0.008858] Sync overhead: 230
[    0.008858] Sync overhead: 547 A: 149597 M: 149596 F: 2500016
[    0.008858] TSC ADJUST compensate: CPU36 observed 76342 warp
(overhead 230). Adjust: 76457
[    0.008858] smpboot: CPU 54 Converting physical 0 to logical die 3
[    0.008858] Sync overhead: 178
[    0.008858] Sync overhead: 658 A: 177970 M: 177968 F: 2500028
[    0.008858] TSC ADJUST compensate: CPU54 observed 76568 warp
(overhead 178). Adjust: 76657

BTW, CPUs 36 and 54 are the first CPU of socket 2 and 3 respectively. It
is always these CPUs that need TSC adjustment.

I have no idea why the sync overhead actually increase in the subsequent
measurement as the frequency was set at 2.5GHz, I think.

I tried to offline all the CPUs in a socket and then online the first
CPU as suggested. However, I was not able to cause the tsc_sync loop to run.

BTW, without my patch the same system will boot up with the following log:

[    0.008823] TSC ADJUST compensate: CPU36 observed 86036 warp. Adjust:
86036
[    0.008823] TSC ADJUST compensate: CPU36 observed 122 warp. Adjust: 86158
[    0.923620] Measured 2 cycles TSC warp between CPUs, turning off TSC
clock.
[    0.923620] tsc: Marking TSC unstable due to check_tsc_sync_source failed

How about we add the full sync_overhead at bootup stage, but then half
sync_overhead after boot in case the tsc_sync loop is run again for new
online CPUs? Or any other scheme that you think is appropriate.

Cheers,
Longman

2022-05-03 00:55:39

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [PATCH 2/2] x86/tsc_sync: Add synchronization overhead to tsc adjustment

On Fri, Apr 29 2022 at 13:41, Waiman Long wrote:
> On 4/27/22 18:38, Thomas Gleixner wrote:
>> I grant you that it does not matter for the loop under the assumption
>> that the loop runs at constant frequency, but is that a guarantee that
>> it does not matter later on?
> Yes, that is my point that frequency doesn't matter if frequency remain
> the same. Of course, all bets are off if frequency really change.

Frequency changes after boot.

>> If you overcompensate by a factor of 3 because the upcoming CPU ran at
>> the lowest frequency, then it might become visible later when everything
>> runs at full speed.
> I don't think the overhead will be directly proportional to the cpu
> frequency. A 3X increase in frequency will certainly cause the overhead
> to be lowered, but it won't be 1/3. Maybe 1/2 at most.

It's very close to proportional according to my experiments.

>>> However about we half the measure sync_overhead as compensation to avoid
>>> over-estimation, but probably increase the chance that we need a second
>>> adjustment of TSC wrap.
>> Half of what?
>
> What I mean is
>
> @@ -533,7 +551,7 @@ void check_tsc_sync_target(void)
>          * really depends on CPU, node distance and frequency. Add the
>          * estimated sync overhead to the adjustment value.
>          */
> -       cur->adjusted += cur_max_warp + sync_overhead;
> +       cur->adjusted += cur_max_warp + sync_overhead/2;

The point is? Make it magically half while it still jumps around:

>>> With this patch applied, the measured overhead on the same CooperLake
>>> system on different reboot runs varies from 104 to 326.
>> Half of something which jumps around? Not convinced. :)

...

> I have applied your patch with some modification and below was the
> relevant part of the boot up log on a Cooperlake system with this TSC
> sync problem.
>
> [    0.008858] smpboot: CPU 36 Converting physical 0 to logical die 2
> [    0.008858] Sync overhead: 230
> [    0.008858] Sync overhead: 547 A: 149597 M: 149596 F: 2500016
> [    0.008858] TSC ADJUST compensate: CPU36 observed 76342 warp
> (overhead 230). Adjust: 76457
> [    0.008858] smpboot: CPU 54 Converting physical 0 to logical die 3
> [    0.008858] Sync overhead: 178
> [    0.008858] Sync overhead: 658 A: 177970 M: 177968 F: 2500028
> [    0.008858] TSC ADJUST compensate: CPU54 observed 76568 warp
> (overhead 178). Adjust: 76657
>
> BTW, CPUs 36 and 54 are the first CPU of socket 2 and 3 respectively. It
> is always these CPUs that need TSC adjustment.
>
> I have no idea why the sync overhead actually increase in the subsequent
> measurement as the frequency was set at 2.5GHz, I think.

See, that's the thing I'm worried about. An allegedly remote lock takes
178 cycles, but a loop of cache local lock operations takes on average
658 cycles.

That does not make sense and I haven't seen that behaviour on any of my
machines I ran that on. The average was always less than the initial
measurement.

> I tried to offline all the CPUs in a socket and then online the first
> CPU as suggested. However, I was not able to cause the tsc_sync loop
> to run.

It runs for sure when the first CPU on a socket goes online except when
TSC is already disabled. It does not matter whether it was online before
or not.

> BTW, without my patch the same system will boot up with the following
> log:

We know that by now.

> How about we add the full sync_overhead at bootup stage, but then half
> sync_overhead after boot in case the tsc_sync loop is run again for new
> online CPUs? Or any other scheme that you think is appropriate.

Can you please stop grasping for straws?

I'm as much interested as you to get this solved, but I'm not interested
in 'works for me' and 'we don't know why it works' solutions simply
because they end up on my desk sooner than later again.

Thanks

tglx