LinuxLists.cc - [PATCH] perf/core: Fix small negative period being ignored

2024-01-16 08:42:00

Subject: [PATCH] perf/core: Fix small negative period being ignored

In perf_adjust_period, we will first calculate period, and then use
this period to calculate delta. However, when delta is less than 0,
there will be a deviation compared to when delta is greater than or
equal to 0. For example, when delta is in the range of [-14,-1], the
range of delta = delta + 7 is between [-7,6], so the final value of
delta/8 is 0. Therefore, the impact of -1 and -2 will be ignored.
This is unacceptable when the target period is very short, because
we will lose a lot of samples.

Here are some tests and analyzes:
before:
# perf record -e cs -F 1000 ./a.out
[ perf record: Woken up 1 times to write data ]
[ perf record: Captured and wrote 0.022 MB perf.data (518 samples) ]

# perf script
...
a.out 396 257.956048: 23 cs: ffffffff81f4eeec schedul>
a.out 396 257.957891: 23 cs: ffffffff81f4eeec schedul>
a.out 396 257.959730: 23 cs: ffffffff81f4eeec schedul>
a.out 396 257.961545: 23 cs: ffffffff81f4eeec schedul>
a.out 396 257.963355: 23 cs: ffffffff81f4eeec schedul>
a.out 396 257.965163: 23 cs: ffffffff81f4eeec schedul>
a.out 396 257.966973: 23 cs: ffffffff81f4eeec schedul>
a.out 396 257.968785: 23 cs: ffffffff81f4eeec schedul>
a.out 396 257.970593: 23 cs: ffffffff81f4eeec schedul>
...

after:
# perf record -e cs -F 1000 ./a.out
[ perf record: Woken up 1 times to write data ]
[ perf record: Captured and wrote 0.058 MB perf.data (1466 samples) ]

# perf script
...
a.out 395 59.338813: 11 cs: ffffffff81f4eeec schedul>
a.out 395 59.339707: 12 cs: ffffffff81f4eeec schedul>
a.out 395 59.340682: 13 cs: ffffffff81f4eeec schedul>
a.out 395 59.341751: 13 cs: ffffffff81f4eeec schedul>
a.out 395 59.342799: 12 cs: ffffffff81f4eeec schedul>
a.out 395 59.343765: 11 cs: ffffffff81f4eeec schedul>
a.out 395 59.344651: 11 cs: ffffffff81f4eeec schedul>
a.out 395 59.345539: 12 cs: ffffffff81f4eeec schedul>
a.out 395 59.346502: 13 cs: ffffffff81f4eeec schedul>
...

test.c

int main() {
for (int i = 0; i < 20000; i++)
usleep(10);

return 0;
}

# time ./a.out
real 0m1.583s
user 0m0.040s
sys 0m0.298s

The above results were tested on x86-64 qemu with KVM enabled using
test.c as test program. Ideally, we should have around 1500 samples,
but the previous algorithm had only about 500, whereas the modified
algorithm now has about 1400. Further more, the new version shows 1
sample per 0.001s, while the previous one is 1 sample per 0.002s.This
indicates that the new algorithm is more sensitive to small negative
values compared to old algorithm.

Fixes: bd2b5b12849a ("perf_counter: More aggressive frequency adjustment")
Signed-off-by: Luo Gengkun <[email protected]>
---
kernel/events/core.c | 6 +++++-
1 file changed, 5 insertions(+), 1 deletion(-)

diff --git a/kernel/events/core.c b/kernel/events/core.c
index 683dc086ef10..cad50d3439f1 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -4078,7 +4078,11 @@ static void perf_adjust_period(struct perf_event *event, u64 nsec, u64 count, bo
period = perf_calculate_period(event, nsec, count);

delta = (s64)(period - hwc->sample_period);
- delta = (delta + 7) / 8; /* low pass filter */
+ if (delta >= 0)
+ delta += 7;
+ else
+ delta -= 7;
+ delta /= 8; /* low pass filter */

sample_period = hwc->sample_period + delta;

--
2.34.1

2024-01-19 07:36:26

by Adrian Hunter

[permalink] [raw]

Subject: Re: [PATCH] perf/core: Fix small negative period being ignored

On 16/01/24 10:39, Luo Gengkun wrote:
> In perf_adjust_period, we will first calculate period, and then use
> this period to calculate delta. However, when delta is less than 0,
> there will be a deviation compared to when delta is greater than or
> equal to 0. For example, when delta is in the range of [-14,-1], the
> range of delta = delta + 7 is between [-7,6], so the final value of
> delta/8 is 0. Therefore, the impact of -1 and -2 will be ignored.
> This is unacceptable when the target period is very short, because
> we will lose a lot of samples.
>
> Here are some tests and analyzes:
> before:
> # perf record -e cs -F 1000 ./a.out
> [ perf record: Woken up 1 times to write data ]
> [ perf record: Captured and wrote 0.022 MB perf.data (518 samples) ]
>
> # perf script
> ...
> a.out 396 257.956048: 23 cs: ffffffff81f4eeec schedul>
> a.out 396 257.957891: 23 cs: ffffffff81f4eeec schedul>
> a.out 396 257.959730: 23 cs: ffffffff81f4eeec schedul>
> a.out 396 257.961545: 23 cs: ffffffff81f4eeec schedul>
> a.out 396 257.963355: 23 cs: ffffffff81f4eeec schedul>
> a.out 396 257.965163: 23 cs: ffffffff81f4eeec schedul>
> a.out 396 257.966973: 23 cs: ffffffff81f4eeec schedul>
> a.out 396 257.968785: 23 cs: ffffffff81f4eeec schedul>
> a.out 396 257.970593: 23 cs: ffffffff81f4eeec schedul>
> ...
>
> after:
> # perf record -e cs -F 1000 ./a.out
> [ perf record: Woken up 1 times to write data ]
> [ perf record: Captured and wrote 0.058 MB perf.data (1466 samples) ]
>
> # perf script
> ...
> a.out 395 59.338813: 11 cs: ffffffff81f4eeec schedul>
> a.out 395 59.339707: 12 cs: ffffffff81f4eeec schedul>
> a.out 395 59.340682: 13 cs: ffffffff81f4eeec schedul>
> a.out 395 59.341751: 13 cs: ffffffff81f4eeec schedul>
> a.out 395 59.342799: 12 cs: ffffffff81f4eeec schedul>
> a.out 395 59.343765: 11 cs: ffffffff81f4eeec schedul>
> a.out 395 59.344651: 11 cs: ffffffff81f4eeec schedul>
> a.out 395 59.345539: 12 cs: ffffffff81f4eeec schedul>
> a.out 395 59.346502: 13 cs: ffffffff81f4eeec schedul>
> ...
>
> test.c
>
> int main() {
> for (int i = 0; i < 20000; i++)
> usleep(10);
>
> return 0;
> }
>
> # time ./a.out
> real 0m1.583s
> user 0m0.040s
> sys 0m0.298s
>
> The above results were tested on x86-64 qemu with KVM enabled using
> test.c as test program. Ideally, we should have around 1500 samples,
> but the previous algorithm had only about 500, whereas the modified
> algorithm now has about 1400. Further more, the new version shows 1
> sample per 0.001s, while the previous one is 1 sample per 0.002s.This
> indicates that the new algorithm is more sensitive to small negative
> values compared to old algorithm.
>
> Fixes: bd2b5b12849a ("perf_counter: More aggressive frequency adjustment")
> Signed-off-by: Luo Gengkun <[email protected]>

It seems better, and the maths makes sense, so:

Reviewed-by: Adrian Hunter <[email protected]>

But the test case still seems to give unexpected results. Usually:

# time taskset --cpu 1 ./test
real 0m 1.25s
user 0m 0.03s
sys 0m 0.00
# taskset --cpu 0 perf record -F 1000 -e cs -- taskset --cpu 1 ./test
[ perf record: Woken up 1 times to write data ]
[ perf record: Captured and wrote 0.051 MB perf.data (1290 samples) ]

But occasionally:

# taskset --cpu 0 perf record -F 1000 -e cs -- taskset --cpu 1 ./test
[ perf record: Woken up 1 times to write data ]
[ perf record: Captured and wrote 0.010 MB perf.data (204 samples) ]
# perf script
...
test 865 265.377846: 16 cs: ffffffff832e927b schedule+0x2b
test 865 265.378900: 15 cs: ffffffff832e927b schedule+0x2b
test 865 265.379845: 14 cs: ffffffff832e927b schedule+0x2b
test 865 265.380770: 14 cs: ffffffff832e927b schedule+0x2b
test 865 265.381647: 15 cs: ffffffff832e927b schedule+0x2b
test 865 265.382638: 16 cs: ffffffff832e927b schedule+0x2b
test 865 265.383647: 16 cs: ffffffff832e927b schedule+0x2b
test 865 265.384704: 15 cs: ffffffff832e927b schedule+0x2b
test 865 265.385649: 14 cs: ffffffff832e927b schedule+0x2b
test 865 265.386578: 152 cs: ffffffff832e927b schedule+0x2b
test 865 265.396383: 154 cs: ffffffff832e927b schedule+0x2b
test 865 265.406183: 154 cs: ffffffff832e927b schedule+0x2b
test 865 265.415839: 154 cs: ffffffff832e927b schedule+0x2b
test 865 265.425445: 154 cs: ffffffff832e927b schedule+0x2b
test 865 265.435052: 154 cs: ffffffff832e927b schedule+0x2b
test 865 265.444708: 154 cs: ffffffff832e927b schedule+0x2b
test 865 265.454314: 154 cs: ffffffff832e927b schedule+0x2b
test 865 265.463970: 154 cs: ffffffff832e927b schedule+0x2b
test 865 265.473577: 154 cs: ffffffff832e927b schedule+0x2b
...

> ---
> kernel/events/core.c | 6 +++++-
> 1 file changed, 5 insertions(+), 1 deletion(-)
>
> diff --git a/kernel/events/core.c b/kernel/events/core.c
> index 683dc086ef10..cad50d3439f1 100644
> --- a/kernel/events/core.c
> +++ b/kernel/events/core.c
> @@ -4078,7 +4078,11 @@ static void perf_adjust_period(struct perf_event *event, u64 nsec, u64 count, bo
> period = perf_calculate_period(event, nsec, count);
>
> delta = (s64)(period - hwc->sample_period);
> - delta = (delta + 7) / 8; /* low pass filter */
> + if (delta >= 0)
> + delta += 7;
> + else
> + delta -= 7;
> + delta /= 8; /* low pass filter */
>
> sample_period = hwc->sample_period + delta;
>

2024-01-20 07:22:11

by Luo Gengkun

[permalink] [raw]

Subject: Re: [PATCH] perf/core: Fix small negative period being ignored

在 2024/1/19 15:36, Adrian Hunter 写道:
> On 16/01/24 10:39, Luo Gengkun wrote:
>> In perf_adjust_period, we will first calculate period, and then use
>> this period to calculate delta. However, when delta is less than 0,
>> there will be a deviation compared to when delta is greater than or
>> equal to 0. For example, when delta is in the range of [-14,-1], the
>> range of delta = delta + 7 is between [-7,6], so the final value of
>> delta/8 is 0. Therefore, the impact of -1 and -2 will be ignored.
>> This is unacceptable when the target period is very short, because
>> we will lose a lot of samples.
>>
>> Here are some tests and analyzes:
>> before:
>> # perf record -e cs -F 1000 ./a.out
>> [ perf record: Woken up 1 times to write data ]
>> [ perf record: Captured and wrote 0.022 MB perf.data (518 samples) ]
>>
>> # perf script
>> ...
>> a.out 396 257.956048: 23 cs: ffffffff81f4eeec schedul>
>> a.out 396 257.957891: 23 cs: ffffffff81f4eeec schedul>
>> a.out 396 257.959730: 23 cs: ffffffff81f4eeec schedul>
>> a.out 396 257.961545: 23 cs: ffffffff81f4eeec schedul>
>> a.out 396 257.963355: 23 cs: ffffffff81f4eeec schedul>
>> a.out 396 257.965163: 23 cs: ffffffff81f4eeec schedul>
>> a.out 396 257.966973: 23 cs: ffffffff81f4eeec schedul>
>> a.out 396 257.968785: 23 cs: ffffffff81f4eeec schedul>
>> a.out 396 257.970593: 23 cs: ffffffff81f4eeec schedul>
>> ...
>>
>> after:
>> # perf record -e cs -F 1000 ./a.out
>> [ perf record: Woken up 1 times to write data ]
>> [ perf record: Captured and wrote 0.058 MB perf.data (1466 samples) ]
>>
>> # perf script
>> ...
>> a.out 395 59.338813: 11 cs: ffffffff81f4eeec schedul>
>> a.out 395 59.339707: 12 cs: ffffffff81f4eeec schedul>
>> a.out 395 59.340682: 13 cs: ffffffff81f4eeec schedul>
>> a.out 395 59.341751: 13 cs: ffffffff81f4eeec schedul>
>> a.out 395 59.342799: 12 cs: ffffffff81f4eeec schedul>
>> a.out 395 59.343765: 11 cs: ffffffff81f4eeec schedul>
>> a.out 395 59.344651: 11 cs: ffffffff81f4eeec schedul>
>> a.out 395 59.345539: 12 cs: ffffffff81f4eeec schedul>
>> a.out 395 59.346502: 13 cs: ffffffff81f4eeec schedul>
>> ...
>>
>> test.c
>>
>> int main() {
>> for (int i = 0; i < 20000; i++)
>> usleep(10);
>>
>> return 0;
>> }
>>
>> # time ./a.out
>> real 0m1.583s
>> user 0m0.040s
>> sys 0m0.298s
>>
>> The above results were tested on x86-64 qemu with KVM enabled using
>> test.c as test program. Ideally, we should have around 1500 samples,
>> but the previous algorithm had only about 500, whereas the modified
>> algorithm now has about 1400. Further more, the new version shows 1
>> sample per 0.001s, while the previous one is 1 sample per 0.002s.This
>> indicates that the new algorithm is more sensitive to small negative
>> values compared to old algorithm.
>>
>> Fixes: bd2b5b12849a ("perf_counter: More aggressive frequency adjustment")
>> Signed-off-by: Luo Gengkun <[email protected]>
>
> It seems better, and the maths makes sense, so:
>
> Reviewed-by: Adrian Hunter <[email protected]>
>
>
> But the test case still seems to give unexpected results. Usually:
>
> # time taskset --cpu 1 ./test
> real 0m 1.25s
> user 0m 0.03s
> sys 0m 0.00
> # taskset --cpu 0 perf record -F 1000 -e cs -- taskset --cpu 1 ./test
> [ perf record: Woken up 1 times to write data ]
> [ perf record: Captured and wrote 0.051 MB perf.data (1290 samples) ]
>
> But occasionally:
>
> # taskset --cpu 0 perf record -F 1000 -e cs -- taskset --cpu 1 ./test
> [ perf record: Woken up 1 times to write data ]
> [ perf record: Captured and wrote 0.010 MB perf.data (204 samples) ]
> # perf script
> ...
> test 865 265.377846: 16 cs: ffffffff832e927b schedule+0x2b
> test 865 265.378900: 15 cs: ffffffff832e927b schedule+0x2b
> test 865 265.379845: 14 cs: ffffffff832e927b schedule+0x2b
> test 865 265.380770: 14 cs: ffffffff832e927b schedule+0x2b
> test 865 265.381647: 15 cs: ffffffff832e927b schedule+0x2b
> test 865 265.382638: 16 cs: ffffffff832e927b schedule+0x2b
> test 865 265.383647: 16 cs: ffffffff832e927b schedule+0x2b
> test 865 265.384704: 15 cs: ffffffff832e927b schedule+0x2b
> test 865 265.385649: 14 cs: ffffffff832e927b schedule+0x2b
> test 865 265.386578: 152 cs: ffffffff832e927b schedule+0x2b
> test 865 265.396383: 154 cs: ffffffff832e927b schedule+0x2b
> test 865 265.406183: 154 cs: ffffffff832e927b schedule+0x2b
> test 865 265.415839: 154 cs: ffffffff832e927b schedule+0x2b
> test 865 265.425445: 154 cs: ffffffff832e927b schedule+0x2b
> test 865 265.435052: 154 cs: ffffffff832e927b schedule+0x2b
> test 865 265.444708: 154 cs: ffffffff832e927b schedule+0x2b
> test 865 265.454314: 154 cs: ffffffff832e927b schedule+0x2b
> test 865 265.463970: 154 cs: ffffffff832e927b schedule+0x2b
> test 865 265.473577: 154 cs: ffffffff832e927b schedule+0x2b
> ...
>
>
>
It seems that the unexpected results is caused by Timer Interrupts not
coming every TICK_NSEC.

I guess this is due to system idleness. I tried the patch and it should
have fixed the issue.

You can give it a try as well.

diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
index afb028c54f33..2708f1d0692c 100644
--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -265,6 +265,7 @@ struct hw_perf_event {
* State for freq target events, see __perf_event_overflow() and
* perf_adjust_freq_unthr_context().
*/
+ u64 freq_tick_stamp;
u64 freq_time_stamp;
u64 freq_count_stamp;
#endif
diff --git a/kernel/events/core.c b/kernel/events/core.c index
cad50d3439f1..fe0d9b470365 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -4112,7 +4112,7 @@ perf_adjust_freq_unthr_context(struct
perf_event_context *ctx, bool unthrottle)
{
struct perf_event *event;
struct hw_perf_event *hwc;
- u64 now, period = TICK_NSEC;
+ u64 now, period, tick_stamp;
s64 delta;

/*
@@ -4151,6 +4151,10 @@ perf_adjust_freq_unthr_context(struct
perf_event_context *ctx, bool unthrottle)
*/
event->pmu->stop(event, PERF_EF_UPDATE);

+ tick_stamp = perf_clock();
+ period = tick_stamp - hwc->freq_tick_stamp;
+ hwc->freq_tick_stamp = tick_stamp;
+
now = local64_read(&event->count);
delta = now - hwc->freq_count_stamp;
hwc->freq_count_stamp = now;
@@ -4162,8 +4166,14 @@ perf_adjust_freq_unthr_context(struct
perf_event_context *ctx, bool unthrottle)
* to perf_adjust_period() to avoid stopping it
* twice.
*/
- if (delta > 0)
- perf_adjust_period(event, period, delta, false);
+ if (delta > 0) {
+ /*
+ * we skip first tick adjust period
+ */
+ if (likely(period != tick_stamp)) {
+ perf_adjust_period(event, period, delta, false);
+ }
+ }

event->pmu->start(event, delta > 0 ? PERF_EF_RELOAD : 0);
next:

>> ---
>> kernel/events/core.c | 6 +++++-
>> 1 file changed, 5 insertions(+), 1 deletion(-)
>>
>> diff --git a/kernel/events/core.c b/kernel/events/core.c
>> index 683dc086ef10..cad50d3439f1 100644
>> --- a/kernel/events/core.c
>> +++ b/kernel/events/core.c
>> @@ -4078,7 +4078,11 @@ static void perf_adjust_period(struct perf_event *event, u64 nsec, u64 count, bo
>> period = perf_calculate_period(event, nsec, count);
>>
>> delta = (s64)(period - hwc->sample_period);
>> - delta = (delta + 7) / 8; /* low pass filter */
>> + if (delta >= 0)
>> + delta += 7;
>> + else
>> + delta -= 7;
>> + delta /= 8; /* low pass filter */
>>
>> sample_period = hwc->sample_period + delta;
>>
>

2024-02-05 14:22:08

by Adrian Hunter

[permalink] [raw]

Subject: Re: [PATCH] perf/core: Fix small negative period being ignored

On 20/01/24 09:21, Luo Gengkun wrote:
>
>
> 在 2024/1/19 15:36, Adrian Hunter 写道:
>> On 16/01/24 10:39, Luo Gengkun wrote:
>>> In perf_adjust_period, we will first calculate period, and then use
>>> this period to calculate delta. However, when delta is less than 0,
>>> there will be a deviation compared to when delta is greater than or
>>> equal to 0. For example, when delta is in the range of [-14,-1], the
>>> range of delta = delta + 7 is between [-7,6], so the final value of
>>> delta/8 is 0. Therefore, the impact of -1 and -2 will be ignored.
>>> This is unacceptable when the target period is very short, because
>>> we will lose a lot of samples.
>>>
>>> Here are some tests and analyzes:
>>> before:
>>>    # perf record -e cs -F 1000 ./a.out
>>>    [ perf record: Woken up 1 times to write data ]
>>>    [ perf record: Captured and wrote 0.022 MB perf.data (518 samples) ]
>>>
>>>    # perf script
>>>    ...
>>>    a.out     396   257.956048:         23 cs: ffffffff81f4eeec schedul>
>>>    a.out     396   257.957891:         23 cs: ffffffff81f4eeec schedul>
>>>    a.out     396   257.959730:         23 cs: ffffffff81f4eeec schedul>
>>>    a.out     396   257.961545:         23 cs: ffffffff81f4eeec schedul>
>>>    a.out     396   257.963355:         23 cs: ffffffff81f4eeec schedul>
>>>    a.out     396   257.965163:         23 cs: ffffffff81f4eeec schedul>
>>>    a.out     396   257.966973:         23 cs: ffffffff81f4eeec schedul>
>>>    a.out     396   257.968785:         23 cs: ffffffff81f4eeec schedul>
>>>    a.out     396   257.970593:         23 cs: ffffffff81f4eeec schedul>
>>>    ...
>>>
>>> after:
>>>    # perf record -e cs -F 1000 ./a.out
>>>    [ perf record: Woken up 1 times to write data ]
>>>    [ perf record: Captured and wrote 0.058 MB perf.data (1466 samples) ]
>>>
>>>    # perf script
>>>    ...
>>>    a.out     395    59.338813:         11 cs: ffffffff81f4eeec schedul>
>>>    a.out     395    59.339707:         12 cs: ffffffff81f4eeec schedul>
>>>    a.out     395    59.340682:         13 cs: ffffffff81f4eeec schedul>
>>>    a.out     395    59.341751:         13 cs: ffffffff81f4eeec schedul>
>>>    a.out     395    59.342799:         12 cs: ffffffff81f4eeec schedul>
>>>    a.out     395    59.343765:         11 cs: ffffffff81f4eeec schedul>
>>>    a.out     395    59.344651:         11 cs: ffffffff81f4eeec schedul>
>>>    a.out     395    59.345539:         12 cs: ffffffff81f4eeec schedul>
>>>    a.out     395    59.346502:         13 cs: ffffffff81f4eeec schedul>
>>>    ...
>>>
>>> test.c
>>>
>>> int main() {
>>>          for (int i = 0; i < 20000; i++)
>>>                  usleep(10);
>>>
>>>          return 0;
>>> }
>>>
>>>    # time ./a.out
>>>    real    0m1.583s
>>>    user    0m0.040s
>>>    sys     0m0.298s
>>>
>>> The above results were tested on x86-64 qemu with KVM enabled using
>>> test.c as test program. Ideally, we should have around 1500 samples,
>>> but the previous algorithm had only about 500, whereas the modified
>>> algorithm now has about 1400. Further more, the new version shows 1
>>> sample per 0.001s, while the previous one is 1 sample per 0.002s.This
>>> indicates that the new algorithm is more sensitive to small negative
>>> values compared to old algorithm.
>>>
>>> Fixes: bd2b5b12849a ("perf_counter: More aggressive frequency adjustment")
>>> Signed-off-by: Luo Gengkun <[email protected]>
>>
>> It seems better, and the maths makes sense, so:
>>
>> Reviewed-by: Adrian Hunter <[email protected]>
>>
>>
>> But the test case still seems to give unexpected results. Usually:
>>
>>    # time taskset --cpu 1 ./test
>>    real    0m 1.25s
>>    user    0m 0.03s
>>    sys     0m 0.00
>>    # taskset --cpu 0 perf record -F 1000 -e cs -- taskset --cpu 1 ./test
>>    [ perf record: Woken up 1 times to write data ]
>>    [ perf record: Captured and wrote 0.051 MB perf.data (1290 samples) ]
>>
>> But occasionally:
>>
>>    # taskset --cpu 0 perf record -F 1000 -e cs -- taskset --cpu 1 ./test
>>    [ perf record: Woken up 1 times to write data ]
>>    [ perf record: Captured and wrote 0.010 MB perf.data (204 samples) ]
>>    # perf script
>>    ...
>>    test   865   265.377846:         16 cs: ffffffff832e927b schedule+0x2b
>>    test   865   265.378900:         15 cs: ffffffff832e927b schedule+0x2b
>>    test   865   265.379845:         14 cs: ffffffff832e927b schedule+0x2b
>>    test   865   265.380770:         14 cs: ffffffff832e927b schedule+0x2b
>>    test   865   265.381647:         15 cs: ffffffff832e927b schedule+0x2b
>>    test   865   265.382638:         16 cs: ffffffff832e927b schedule+0x2b
>>    test   865   265.383647:         16 cs: ffffffff832e927b schedule+0x2b
>>    test   865   265.384704:         15 cs: ffffffff832e927b schedule+0x2b
>>    test   865   265.385649:         14 cs: ffffffff832e927b schedule+0x2b
>>    test   865   265.386578:        152 cs: ffffffff832e927b schedule+0x2b
>>    test   865   265.396383:        154 cs: ffffffff832e927b schedule+0x2b
>>    test   865   265.406183:        154 cs: ffffffff832e927b schedule+0x2b
>>    test   865   265.415839:        154 cs: ffffffff832e927b schedule+0x2b
>>    test   865   265.425445:        154 cs: ffffffff832e927b schedule+0x2b
>>    test   865   265.435052:        154 cs: ffffffff832e927b schedule+0x2b
>>    test   865   265.444708:        154 cs: ffffffff832e927b schedule+0x2b
>>    test   865   265.454314:        154 cs: ffffffff832e927b schedule+0x2b
>>    test   865   265.463970:        154 cs: ffffffff832e927b schedule+0x2b
>>    test   865   265.473577:        154 cs: ffffffff832e927b schedule+0x2b
>>    ...
>>
>>
>>
> It seems that the unexpected results is caused by Timer Interrupts not coming every TICK_NSEC.
>
> I guess this is due to system idleness.

It looks like the period is adjusted at the tick only
for active tasks, so a task that is asleep a lot, like
the test case, could go a number of ticks without being
adjusted by perf_adjust_freq_unthr_context().

> I tried the patch and it should have fixed the issue.

Maybe also:

@@ -9523,7 +9532,7 @@ __perf_event_account_interrupt(struct perf_event *event, int throttle)

hwc->freq_time_stamp = now;

- if (delta > 0 && delta < 2*TICK_NSEC)
+ if (delta > 0 && delta != now)
perf_adjust_period(event, delta, hwc->last_period, true);
}

>
> You can give it a try as well.
>
> diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h index afb028c54f33..2708f1d0692c 100644
> --- a/include/linux/perf_event.h
> +++ b/include/linux/perf_event.h
> @@ -265,6 +265,7 @@ struct hw_perf_event {
>        * State for freq target events, see __perf_event_overflow() and
>        * perf_adjust_freq_unthr_context().
>        */
> +    u64                freq_tick_stamp;
>       u64                freq_time_stamp;
>       u64                freq_count_stamp;
> #endif
> diff --git a/kernel/events/core.c b/kernel/events/core.c index cad50d3439f1..fe0d9b470365 100644
> --- a/kernel/events/core.c
> +++ b/kernel/events/core.c
> @@ -4112,7 +4112,7 @@ perf_adjust_freq_unthr_context(struct
> perf_event_context *ctx, bool unthrottle)
> {
>       struct perf_event *event;
>       struct hw_perf_event *hwc;
> -    u64 now, period = TICK_NSEC;
> +    u64 now, period, tick_stamp;
>       s64 delta;
>
>       /*
> @@ -4151,6 +4151,10 @@ perf_adjust_freq_unthr_context(struct
> perf_event_context *ctx, bool unthrottle)
>            */
>           event->pmu->stop(event, PERF_EF_UPDATE);
>
> +        tick_stamp = perf_clock();
> +        period = tick_stamp - hwc->freq_tick_stamp;
> +        hwc->freq_tick_stamp = tick_stamp;
> +
>           now = local64_read(&event->count);
>           delta = now - hwc->freq_count_stamp;
>           hwc->freq_count_stamp = now;
> @@ -4162,8 +4166,14 @@ perf_adjust_freq_unthr_context(struct
> perf_event_context *ctx, bool unthrottle)
>            * to perf_adjust_period() to avoid stopping it
>            * twice.
>            */
> -        if (delta > 0)
> -            perf_adjust_period(event, period, delta, false);
> +        if (delta > 0) {
> +            /*
> +             * we skip first tick adjust period
> +             */
> +            if (likely(period != tick_stamp)) {
> +                perf_adjust_period(event, period, delta, false);
> +            }
> +        }
>
>           event->pmu->start(event, delta > 0 ? PERF_EF_RELOAD : 0);
>       next:
>
>>> ---
>>> kernel/events/core.c | 6 +++++-
>>> 1 file changed, 5 insertions(+), 1 deletion(-)
>>>
>>> diff --git a/kernel/events/core.c b/kernel/events/core.c
>>> index 683dc086ef10..cad50d3439f1 100644
>>> --- a/kernel/events/core.c
>>> +++ b/kernel/events/core.c
>>> @@ -4078,7 +4078,11 @@ static void perf_adjust_period(struct perf_event *event, u64 nsec, u64 count, bo
>>>       period = perf_calculate_period(event, nsec, count);
>>>       delta = (s64)(period - hwc->sample_period);
>>> -    delta = (delta + 7) / 8; /* low pass filter */
>>> +    if (delta >= 0)
>>> +        delta += 7;
>>> +    else
>>> +        delta -= 7;
>>> +    delta /= 8; /* low pass filter */
>>>       sample_period = hwc->sample_period + delta;
>>>
>>

2024-02-18 06:38:06

by Luo Gengkun

[permalink] [raw]

Subject: Re: [PATCH] perf/core: Fix small negative period being ignored

On 2024/2/5 22:21, Adrian Hunter wrote:
> On 20/01/24 09:21, Luo Gengkun wrote:
>>
>>
>> 在 2024/1/19 15:36, Adrian Hunter 写道:
>>> On 16/01/24 10:39, Luo Gengkun wrote:
>>>> In perf_adjust_period, we will first calculate period, and then use
>>>> this period to calculate delta. However, when delta is less than 0,
>>>> there will be a deviation compared to when delta is greater than or
>>>> equal to 0. For example, when delta is in the range of [-14,-1], the
>>>> range of delta = delta + 7 is between [-7,6], so the final value of
>>>> delta/8 is 0. Therefore, the impact of -1 and -2 will be ignored.
>>>> This is unacceptable when the target period is very short, because
>>>> we will lose a lot of samples.
>>>>
>>>> Here are some tests and analyzes:
>>>> before:
>>>>    # perf record -e cs -F 1000 ./a.out
>>>>    [ perf record: Woken up 1 times to write data ]
>>>>    [ perf record: Captured and wrote 0.022 MB perf.data (518 samples) ]
>>>>
>>>>    # perf script
>>>>    ...
>>>>    a.out     396   257.956048:         23 cs: ffffffff81f4eeec schedul>
>>>>    a.out     396   257.957891:         23 cs: ffffffff81f4eeec schedul>
>>>>    a.out     396   257.959730:         23 cs: ffffffff81f4eeec schedul>
>>>>    a.out     396   257.961545:         23 cs: ffffffff81f4eeec schedul>
>>>>    a.out     396   257.963355:         23 cs: ffffffff81f4eeec schedul>
>>>>    a.out     396   257.965163:         23 cs: ffffffff81f4eeec schedul>
>>>>    a.out     396   257.966973:         23 cs: ffffffff81f4eeec schedul>
>>>>    a.out     396   257.968785:         23 cs: ffffffff81f4eeec schedul>
>>>>    a.out     396   257.970593:         23 cs: ffffffff81f4eeec schedul>
>>>>    ...
>>>>
>>>> after:
>>>>    # perf record -e cs -F 1000 ./a.out
>>>>    [ perf record: Woken up 1 times to write data ]
>>>>    [ perf record: Captured and wrote 0.058 MB perf.data (1466 samples) ]
>>>>
>>>>    # perf script
>>>>    ...
>>>>    a.out     395    59.338813:         11 cs: ffffffff81f4eeec schedul>
>>>>    a.out     395    59.339707:         12 cs: ffffffff81f4eeec schedul>
>>>>    a.out     395    59.340682:         13 cs: ffffffff81f4eeec schedul>
>>>>    a.out     395    59.341751:         13 cs: ffffffff81f4eeec schedul>
>>>>    a.out     395    59.342799:         12 cs: ffffffff81f4eeec schedul>
>>>>    a.out     395    59.343765:         11 cs: ffffffff81f4eeec schedul>
>>>>    a.out     395    59.344651:         11 cs: ffffffff81f4eeec schedul>
>>>>    a.out     395    59.345539:         12 cs: ffffffff81f4eeec schedul>
>>>>    a.out     395    59.346502:         13 cs: ffffffff81f4eeec schedul>
>>>>    ...
>>>>
>>>> test.c
>>>>
>>>> int main() {
>>>>          for (int i = 0; i < 20000; i++)
>>>>                  usleep(10);
>>>>
>>>>          return 0;
>>>> }
>>>>
>>>>    # time ./a.out
>>>>    real    0m1.583s
>>>>    user    0m0.040s
>>>>    sys     0m0.298s
>>>>
>>>> The above results were tested on x86-64 qemu with KVM enabled using
>>>> test.c as test program. Ideally, we should have around 1500 samples,
>>>> but the previous algorithm had only about 500, whereas the modified
>>>> algorithm now has about 1400. Further more, the new version shows 1
>>>> sample per 0.001s, while the previous one is 1 sample per 0.002s.This
>>>> indicates that the new algorithm is more sensitive to small negative
>>>> values compared to old algorithm.
>>>>
>>>> Fixes: bd2b5b12849a ("perf_counter: More aggressive frequency adjustment")
>>>> Signed-off-by: Luo Gengkun <[email protected]>
>>>
>>> It seems better, and the maths makes sense, so:
>>>
>>> Reviewed-by: Adrian Hunter <[email protected]>
>>>
>>>
>>> But the test case still seems to give unexpected results. Usually:
>>>
>>>    # time taskset --cpu 1 ./test
>>>    real    0m 1.25s
>>>    user    0m 0.03s
>>>    sys     0m 0.00
>>>    # taskset --cpu 0 perf record -F 1000 -e cs -- taskset --cpu 1 ./test
>>>    [ perf record: Woken up 1 times to write data ]
>>>    [ perf record: Captured and wrote 0.051 MB perf.data (1290 samples) ]
>>>
>>> But occasionally:
>>>
>>>    # taskset --cpu 0 perf record -F 1000 -e cs -- taskset --cpu 1 ./test
>>>    [ perf record: Woken up 1 times to write data ]
>>>    [ perf record: Captured and wrote 0.010 MB perf.data (204 samples) ]
>>>    # perf script
>>>    ...
>>>    test   865   265.377846:         16 cs: ffffffff832e927b schedule+0x2b
>>>    test   865   265.378900:         15 cs: ffffffff832e927b schedule+0x2b
>>>    test   865   265.379845:         14 cs: ffffffff832e927b schedule+0x2b
>>>    test   865   265.380770:         14 cs: ffffffff832e927b schedule+0x2b
>>>    test   865   265.381647:         15 cs: ffffffff832e927b schedule+0x2b
>>>    test   865   265.382638:         16 cs: ffffffff832e927b schedule+0x2b
>>>    test   865   265.383647:         16 cs: ffffffff832e927b schedule+0x2b
>>>    test   865   265.384704:         15 cs: ffffffff832e927b schedule+0x2b
>>>    test   865   265.385649:         14 cs: ffffffff832e927b schedule+0x2b
>>>    test   865   265.386578:        152 cs: ffffffff832e927b schedule+0x2b
>>>    test   865   265.396383:        154 cs: ffffffff832e927b schedule+0x2b
>>>    test   865   265.406183:        154 cs: ffffffff832e927b schedule+0x2b
>>>    test   865   265.415839:        154 cs: ffffffff832e927b schedule+0x2b
>>>    test   865   265.425445:        154 cs: ffffffff832e927b schedule+0x2b
>>>    test   865   265.435052:        154 cs: ffffffff832e927b schedule+0x2b
>>>    test   865   265.444708:        154 cs: ffffffff832e927b schedule+0x2b
>>>    test   865   265.454314:        154 cs: ffffffff832e927b schedule+0x2b
>>>    test   865   265.463970:        154 cs: ffffffff832e927b schedule+0x2b
>>>    test   865   265.473577:        154 cs: ffffffff832e927b schedule+0x2b
>>>    ...
>>>
>>>
>>>
>> It seems that the unexpected results is caused by Timer Interrupts not coming every TICK_NSEC.
>>
>> I guess this is due to system idleness.
>
> It looks like the period is adjusted at the tick only
> for active tasks, so a task that is asleep a lot, like
> the test case, could go a number of ticks without being
> adjusted by perf_adjust_freq_unthr_context().
>
>> I tried the patch and it should have fixed the issue.
>
> Maybe also:
>
> @@ -9523,7 +9532,7 @@ __perf_event_account_interrupt(struct perf_event *event, int throttle)
>
> hwc->freq_time_stamp = now;
>
> - if (delta > 0 && delta < 2*TICK_NSEC)
> + if (delta > 0 && delta != now)
> perf_adjust_period(event, delta, hwc->last_period, true);
> }
>
>
It seems like it could re-adjust the period increased by the tick.
However, this would also introduce some problems. For example, if a task
is scheduled out and then scheduled back in, and we use such a delta to
calculate the new period, it will introduce more bias because the delta
includes the time when the task is not on the CPU. In addition, the
impact of the tick adjustment is still present, because the timer
interrupt is not always reaching every TICK_NSEC, which means we are
sending an incorrect nsec to perf_adjust_period, and therefore the
calculated period will become very large.
>>
>> You can give it a try as well.
>>
>> diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h index afb028c54f33..2708f1d0692c 100644
>> --- a/include/linux/perf_event.h
>> +++ b/include/linux/perf_event.h
>> @@ -265,6 +265,7 @@ struct hw_perf_event {
>>        * State for freq target events, see __perf_event_overflow() and
>>        * perf_adjust_freq_unthr_context().
>>        */
>> +    u64                freq_tick_stamp;
>>       u64                freq_time_stamp;
>>       u64                freq_count_stamp;
>> #endif
>> diff --git a/kernel/events/core.c b/kernel/events/core.c index cad50d3439f1..fe0d9b470365 100644
>> --- a/kernel/events/core.c
>> +++ b/kernel/events/core.c
>> @@ -4112,7 +4112,7 @@ perf_adjust_freq_unthr_context(struct
>> perf_event_context *ctx, bool unthrottle)
>> {
>>       struct perf_event *event;
>>       struct hw_perf_event *hwc;
>> -    u64 now, period = TICK_NSEC;
>> +    u64 now, period, tick_stamp;
>>       s64 delta;
>>
>>       /*
>> @@ -4151,6 +4151,10 @@ perf_adjust_freq_unthr_context(struct
>> perf_event_context *ctx, bool unthrottle)
>>            */
>>           event->pmu->stop(event, PERF_EF_UPDATE);
>>
>> +        tick_stamp = perf_clock();
>> +        period = tick_stamp - hwc->freq_tick_stamp;
>> +        hwc->freq_tick_stamp = tick_stamp;
>> +
>>           now = local64_read(&event->count);
>>           delta = now - hwc->freq_count_stamp;
>>           hwc->freq_count_stamp = now;
>> @@ -4162,8 +4166,14 @@ perf_adjust_freq_unthr_context(struct
>> perf_event_context *ctx, bool unthrottle)
>>            * to perf_adjust_period() to avoid stopping it
>>            * twice.
>>            */
>> -        if (delta > 0)
>> -            perf_adjust_period(event, period, delta, false);
>> +        if (delta > 0) {
>> +            /*
>> +             * we skip first tick adjust period
>> +             */
>> +            if (likely(period != tick_stamp)) {
>> +                perf_adjust_period(event, period, delta, false);
>> +            }
>> +        }
>>
>>           event->pmu->start(event, delta > 0 ? PERF_EF_RELOAD : 0);
>>       next:
>>
>>>> ---
>>>> kernel/events/core.c | 6 +++++-
>>>> 1 file changed, 5 insertions(+), 1 deletion(-)
>>>>
>>>> diff --git a/kernel/events/core.c b/kernel/events/core.c
>>>> index 683dc086ef10..cad50d3439f1 100644
>>>> --- a/kernel/events/core.c
>>>> +++ b/kernel/events/core.c
>>>> @@ -4078,7 +4078,11 @@ static void perf_adjust_period(struct perf_event *event, u64 nsec, u64 count, bo
>>>>       period = perf_calculate_period(event, nsec, count);
>>>>       delta = (s64)(period - hwc->sample_period);
>>>> -    delta = (delta + 7) / 8; /* low pass filter */
>>>> +    if (delta >= 0)
>>>> +        delta += 7;
>>>> +    else
>>>> +        delta -= 7;
>>>> +    delta /= 8; /* low pass filter */
>>>>       sample_period = hwc->sample_period + delta;
>>>>
>>>
>