2023-02-08 11:09:15

by Tariq Toukan

[permalink] [raw]
Subject: Bug report: UDP ~20% degradation

Hi all,

Our performance verification team spotted a degradation of up to ~20% in
UDP performance, for a specific combination of parameters.

Our matrix covers several parameters values, like:
IP version: 4/6
MTU: 1500/9000
Msg size: 64/1452/8952 (only when applicable while avoiding ip
fragmentation).
Num of streams: 1/8/16/24.
Num of directions: unidir/bidir.

Surprisingly, the issue exists only with this specific combination:
8 streams,
MTU 9000,
Msg size 8952,
both ipv4/6,
bidir.
(in unidir it repros only with ipv4)

The reproduction is consistent on all the different setups we tested with.

Bisect [2] was done between these two points, v5.19 (Good), and v6.0-rc1
(Bad), with ConnectX-6DX NIC.

c82a69629c53eda5233f13fc11c3c01585ef48a2 is the first bad commit [1].

We couldn't come up with a good explanation how this patch causes this
issue. We also looked for related changes in the networking/UDP stack,
but nothing looked suspicious.

Maybe someone here can help with this.
We can provide more details or do further tests/experiments to progress
with the debug.

Thanks,
Tariq

[1]
commit c82a69629c53eda5233f13fc11c3c01585ef48a2
Author: Vincent Guittot <[email protected]>
Date: Fri Jul 8 17:44:01 2022 +0200

sched/fair: fix case with reduced capacity CPU

The capacity of the CPU available for CFS tasks can be reduced
because of
other activities running on the latter. In such case, it's worth
trying to
move CFS tasks on a CPU with more available capacity.




The rework of the load balance has filtered the case when the CPU
is


classified to be fully busy but its capacity is reduced.







Check if CPU's capacity is reduced while gathering load balance
statistic


and classify it group_misfit_task instead of group_fully_busy so we
can


try to move the load on another CPU.







Reported-by: David Chen <[email protected]>



Reported-by: Zhang Qiao <[email protected]>



Signed-off-by: Vincent Guittot <[email protected]>



Signed-off-by: Peter Zijlstra (Intel) <[email protected]>



Tested-by: David Chen <[email protected]>



Tested-by: Zhang Qiao <[email protected]>



Link:
https://lkml.kernel.org/r/[email protected]




[2]

Detailed bisec steps:

+--------------+--------+-----------+-----------+
| Commit | Status | BW (Gbps) | BW (Gbps) |
| | | run1 | run2 |
+--------------+--------+-----------+-----------+
| 526942b8134c | Bad | --- | --- |
+--------------+--------+-----------+-----------+
| 2e7a95156d64 | Bad | --- | --- |
+--------------+--------+-----------+-----------+
| 26c350fe7ae0 | Good | 279.8 | 281.9 |
+--------------+--------+-----------+-----------+
| 9de1f9c8ca51 | Bad | 257.243 | --- |
+--------------+--------+-----------+-----------+
| 892f7237b3ff | Good | 285 | 300.7 |
+--------------+--------+-----------+-----------+
| 0dd1cabe8a4a | Good | 305.599 | 290.3 |
+--------------+--------+-----------+-----------+
| dfea84827f7e | Bad | 250.2 | 258.899 |
+--------------+--------+-----------+-----------+
| 22a39c3d8693 | Bad | 236.8 | 245.399 |
+--------------+--------+-----------+-----------+
| e2f3e35f1f5a | Good | 277.599 | 287 |
+--------------+--------+-----------+-----------+
| 401e4963bf45 | Bad | 250.149 | 248.899 |
+--------------+--------+-----------+-----------+
| 3e8c6c9aac42 | Good | 299.09 | 294.9 |
+--------------+--------+-----------+-----------+
| 1fcf54deb767 | Good | 292.719 | 301.299 |
+--------------+--------+-----------+-----------+
| c82a69629c53 | Bad | 254.7 | 246.1 |
+--------------+--------+-----------+-----------+
| c02d5546ea34 | Good | 276.4 | 294 |
+--------------+--------+-----------+-----------+


2023-02-08 14:13:25

by Vincent Guittot

[permalink] [raw]
Subject: Re: Bug report: UDP ~20% degradation

Hi Tariq,

On Wed, 8 Feb 2023 at 12:09, Tariq Toukan <[email protected]> wrote:
>
> Hi all,
>
> Our performance verification team spotted a degradation of up to ~20% in
> UDP performance, for a specific combination of parameters.
>
> Our matrix covers several parameters values, like:
> IP version: 4/6
> MTU: 1500/9000
> Msg size: 64/1452/8952 (only when applicable while avoiding ip
> fragmentation).
> Num of streams: 1/8/16/24.
> Num of directions: unidir/bidir.
>
> Surprisingly, the issue exists only with this specific combination:
> 8 streams,
> MTU 9000,
> Msg size 8952,
> both ipv4/6,
> bidir.
> (in unidir it repros only with ipv4)
>
> The reproduction is consistent on all the different setups we tested with.
>
> Bisect [2] was done between these two points, v5.19 (Good), and v6.0-rc1
> (Bad), with ConnectX-6DX NIC.
>
> c82a69629c53eda5233f13fc11c3c01585ef48a2 is the first bad commit [1].
>
> We couldn't come up with a good explanation how this patch causes this
> issue. We also looked for related changes in the networking/UDP stack,
> but nothing looked suspicious.
>
> Maybe someone here can help with this.
> We can provide more details or do further tests/experiments to progress
> with the debug.

Could you share more details about your system and the cpu topology ?

The commit c82a69629c53 migrates a task on an idle cpu when the task
is the only one running on local cpu but the time spent by this local
cpu under interrupt or RT context becomes significant (10%-17%)
I can imagine that 16/24 stream overload your system so load_balance
doesn't end up in this case and the cpus are busy with several
threads. On the other hand, 1 stream is small enough to keep your
system lightly loaded but 8 streams make your system significantly
loaded to trigger the reduced capacity case but still not overloaded.

Vincent

>
> Thanks,
> Tariq
>
> [1]
> commit c82a69629c53eda5233f13fc11c3c01585ef48a2
> Author: Vincent Guittot <[email protected]>
> Date: Fri Jul 8 17:44:01 2022 +0200
>
> sched/fair: fix case with reduced capacity CPU
>
> The capacity of the CPU available for CFS tasks can be reduced
> because of
> other activities running on the latter. In such case, it's worth
> trying to
> move CFS tasks on a CPU with more available capacity.
>
>
>
>
> The rework of the load balance has filtered the case when the CPU
> is
>
>
> classified to be fully busy but its capacity is reduced.
>
>
>
>
>
>
>
> Check if CPU's capacity is reduced while gathering load balance
> statistic
>
>
> and classify it group_misfit_task instead of group_fully_busy so we
> can
>
>
> try to move the load on another CPU.
>
>
>
>
>
>
>
> Reported-by: David Chen <[email protected]>
>
>
>
> Reported-by: Zhang Qiao <[email protected]>
>
>
>
> Signed-off-by: Vincent Guittot <[email protected]>
>
>
>
> Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
>
>
>
> Tested-by: David Chen <[email protected]>
>
>
>
> Tested-by: Zhang Qiao <[email protected]>
>
>
>
> Link:
> https://lkml.kernel.org/r/[email protected]
>
>
>
>
> [2]
>
> Detailed bisec steps:
>
> +--------------+--------+-----------+-----------+
> | Commit | Status | BW (Gbps) | BW (Gbps) |
> | | | run1 | run2 |
> +--------------+--------+-----------+-----------+
> | 526942b8134c | Bad | --- | --- |
> +--------------+--------+-----------+-----------+
> | 2e7a95156d64 | Bad | --- | --- |
> +--------------+--------+-----------+-----------+
> | 26c350fe7ae0 | Good | 279.8 | 281.9 |
> +--------------+--------+-----------+-----------+
> | 9de1f9c8ca51 | Bad | 257.243 | --- |
> +--------------+--------+-----------+-----------+
> | 892f7237b3ff | Good | 285 | 300.7 |
> +--------------+--------+-----------+-----------+
> | 0dd1cabe8a4a | Good | 305.599 | 290.3 |
> +--------------+--------+-----------+-----------+
> | dfea84827f7e | Bad | 250.2 | 258.899 |
> +--------------+--------+-----------+-----------+
> | 22a39c3d8693 | Bad | 236.8 | 245.399 |
> +--------------+--------+-----------+-----------+
> | e2f3e35f1f5a | Good | 277.599 | 287 |
> +--------------+--------+-----------+-----------+
> | 401e4963bf45 | Bad | 250.149 | 248.899 |
> +--------------+--------+-----------+-----------+
> | 3e8c6c9aac42 | Good | 299.09 | 294.9 |
> +--------------+--------+-----------+-----------+
> | 1fcf54deb767 | Good | 292.719 | 301.299 |
> +--------------+--------+-----------+-----------+
> | c82a69629c53 | Bad | 254.7 | 246.1 |
> +--------------+--------+-----------+-----------+
> | c02d5546ea34 | Good | 276.4 | 294 |
> +--------------+--------+-----------+-----------+

2023-02-10 18:38:10

by Thorsten Leemhuis

[permalink] [raw]
Subject: Re: Bug report: UDP ~20% degradation

[TLDR: I'm adding this report to the list of tracked Linux kernel
regressions; the text you find below is based on a few templates
paragraphs you might have encountered already in similar form.
See link in footer if these mails annoy you.]

On 08.02.23 12:08, Tariq Toukan wrote:
>
> Our performance verification team spotted a degradation of up to ~20% in
> UDP performance, for a specific combination of parameters.
>
> Our matrix covers several parameters values, like:
> IP version: 4/6
> MTU: 1500/9000
> Msg size: 64/1452/8952 (only when applicable while avoiding ip
> fragmentation).
> Num of streams: 1/8/16/24.
> Num of directions: unidir/bidir.
>
> Surprisingly, the issue exists only with this specific combination:
> 8 streams,
> MTU 9000,
> Msg size 8952,
> both ipv4/6,
> bidir.
> (in unidir it repros only with ipv4)
>
> The reproduction is consistent on all the different setups we tested with.
>
> Bisect [2] was done between these two points, v5.19 (Good), and v6.0-rc1
> (Bad), with ConnectX-6DX NIC.
>
> c82a69629c53eda5233f13fc11c3c01585ef48a2 is the first bad commit [1].
>
> We couldn't come up with a good explanation how this patch causes this
> issue. We also looked for related changes in the networking/UDP stack,
> but nothing looked suspicious.
>
> Maybe someone here can help with this.
> We can provide more details or do further tests/experiments to progress
> with the debug.

Thanks for the report. To be sure the issue doesn't fall through the
cracks unnoticed, I'm adding it to regzbot, the Linux kernel regression
tracking bot:

#regzbot ^introduced c82a69629c53eda5233f13fc11c3c01585ef48a
#regzbot title sched/fair: UDP ~20% degradation
#regzbot ignore-activity

This isn't a regression? This issue or a fix for it are already
discussed somewhere else? It was fixed already? You want to clarify when
the regression started to happen? Or point out I got the title or
something else totally wrong? Then just reply and tell me -- ideally
while also telling regzbot about it, as explained by the page listed in
the footer of this mail.

Developers: When fixing the issue, remember to add 'Link:' tags pointing
to the report (the parent of this mail). See page linked in footer for
details.

Ciao, Thorsten (wearing his 'the Linux kernel's regression tracker' hat)
--
Everything you wanna know about Linux kernel regression tracking:
https://linux-regtracking.leemhuis.info/about/#tldr
That page also explains what to do if mails like this annoy you.

> [1]
> commit c82a69629c53eda5233f13fc11c3c01585ef48a2
> Author: Vincent Guittot <[email protected]>
> Date:   Fri Jul 8 17:44:01 2022 +0200
>
>     sched/fair: fix case with reduced capacity CPU
>
>     The capacity of the CPU available for CFS tasks can be reduced
> because of
>     other activities running on the latter. In such case, it's worth
> trying to
>     move CFS tasks on a CPU with more available capacity.
>
>
>
>
>     The rework of the load balance has filtered the case when the CPU is
>
>     classified to be fully busy but its capacity is reduced.
>
>
>
>
>
>
>     Check if CPU's capacity is reduced while gathering load balance
> statistic
>
>     and classify it group_misfit_task instead of group_fully_busy so we can
>
>     try to move the load on another CPU.
>
>
>
>
>
>
>     Reported-by: David Chen <[email protected]>
>
>
>     Reported-by: Zhang Qiao <[email protected]>
>
>
>     Signed-off-by: Vincent Guittot <[email protected]>
>
>
>     Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
>
>
>     Tested-by: David Chen <[email protected]>
>
>
>     Tested-by: Zhang Qiao <[email protected]>
>
>
>     Link:
> https://lkml.kernel.org/r/[email protected]
>
>
>
> [2]
>
> Detailed bisec steps:
>
> +--------------+--------+-----------+-----------+
> | Commit       | Status | BW (Gbps) | BW (Gbps) |
> |              |        | run1      | run2      |
> +--------------+--------+-----------+-----------+
> | 526942b8134c | Bad    | ---       | ---       |
> +--------------+--------+-----------+-----------+
> | 2e7a95156d64 | Bad    | ---       | ---       |
> +--------------+--------+-----------+-----------+
> | 26c350fe7ae0 | Good   | 279.8     | 281.9     |
> +--------------+--------+-----------+-----------+
> | 9de1f9c8ca51 | Bad    | 257.243   | ---       |
> +--------------+--------+-----------+-----------+
> | 892f7237b3ff | Good   | 285       | 300.7     |
> +--------------+--------+-----------+-----------+
> | 0dd1cabe8a4a | Good   | 305.599   | 290.3     |
> +--------------+--------+-----------+-----------+
> | dfea84827f7e | Bad    | 250.2     | 258.899   |
> +--------------+--------+-----------+-----------+
> | 22a39c3d8693 | Bad    | 236.8     | 245.399   |
> +--------------+--------+-----------+-----------+
> | e2f3e35f1f5a | Good   | 277.599   | 287       |
> +--------------+--------+-----------+-----------+
> | 401e4963bf45 | Bad    | 250.149   | 248.899   |
> +--------------+--------+-----------+-----------+
> | 3e8c6c9aac42 | Good   | 299.09    | 294.9     |
> +--------------+--------+-----------+-----------+
> | 1fcf54deb767 | Good   | 292.719   | 301.299   |
> +--------------+--------+-----------+-----------+
> | c82a69629c53 | Bad    | 254.7     | 246.1     |
> +--------------+--------+-----------+-----------+
> | c02d5546ea34 | Good   | 276.4     | 294       |
> +--------------+--------+-----------+-----------+

2023-02-12 11:50:43

by Tariq Toukan

[permalink] [raw]
Subject: Re: Bug report: UDP ~20% degradation



On 08/02/2023 16:12, Vincent Guittot wrote:
> Hi Tariq,
>
> On Wed, 8 Feb 2023 at 12:09, Tariq Toukan <[email protected]> wrote:
>>
>> Hi all,
>>
>> Our performance verification team spotted a degradation of up to ~20% in
>> UDP performance, for a specific combination of parameters.
>>
>> Our matrix covers several parameters values, like:
>> IP version: 4/6
>> MTU: 1500/9000
>> Msg size: 64/1452/8952 (only when applicable while avoiding ip
>> fragmentation).
>> Num of streams: 1/8/16/24.
>> Num of directions: unidir/bidir.
>>
>> Surprisingly, the issue exists only with this specific combination:
>> 8 streams,
>> MTU 9000,
>> Msg size 8952,
>> both ipv4/6,
>> bidir.
>> (in unidir it repros only with ipv4)
>>
>> The reproduction is consistent on all the different setups we tested with.
>>
>> Bisect [2] was done between these two points, v5.19 (Good), and v6.0-rc1
>> (Bad), with ConnectX-6DX NIC.
>>
>> c82a69629c53eda5233f13fc11c3c01585ef48a2 is the first bad commit [1].
>>
>> We couldn't come up with a good explanation how this patch causes this
>> issue. We also looked for related changes in the networking/UDP stack,
>> but nothing looked suspicious.
>>
>> Maybe someone here can help with this.
>> We can provide more details or do further tests/experiments to progress
>> with the debug.
>
> Could you share more details about your system and the cpu topology ?
>

output for 'lscpu':

Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Address sizes: 40 bits physical, 57 bits virtual
Byte Order: Little Endian
CPU(s): 24
On-line CPU(s) list: 0-23
Vendor ID: GenuineIntel
BIOS Vendor ID: QEMU
Model name: Intel(R) Xeon(R) Platinum 8380 CPU @
2.30GHz
BIOS Model name: pc-q35-5.0
CPU family: 6
Model: 106
Thread(s) per core: 1
Core(s) per socket: 1
Socket(s): 24
Stepping: 6
BogoMIPS: 4589.21
Flags: fpu vme de pse tsc msr pae mce cx8 apic
sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss syscall nx
pdpe1gb rdtscp lm constant_tsc arch_perfmon rep_good nopl xtopology
cpuid tsc_known_freq pni pclmulqdq vmx ssse3 fma cx16 pdcm pcid sse4_1
sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand
hypervisor lahf_lm abm 3dnowprefetch cpuid_fault invpcid_single ssbd
ibrs ibpb stibp ibrs_enhanced tpr_shadow vnmi flexpriority ept vpid
ept_ad fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid avx512f
avx512dq rdseed adx smap avx512ifma clflushopt clwb avx512cd sha_ni
avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves wbnoinvd arat
avx512vbmi umip pku ospke avx512_vbmi2 gfni vaes vpclmulqdq avx512_vnni
avx512_bitalg avx512_vpopcntdq rdpid md_clear arch_capabilities
Virtualization: VT-x
Hypervisor vendor: KVM
Virtualization type: full
L1d cache: 768 KiB (24 instances)
L1i cache: 768 KiB (24 instances)
L2 cache: 96 MiB (24 instances)
L3 cache: 384 MiB (24 instances)
NUMA node(s): 1
NUMA node0 CPU(s): 0-23
Vulnerability Itlb multihit: Not affected
Vulnerability L1tf: Not affected
Vulnerability Mds: Not affected
Vulnerability Meltdown: Not affected
Vulnerability Mmio stale data: Vulnerable: Clear CPU buffers
attempted, no microcode; SMT Host state unknown
Vulnerability Retbleed: Not affected
Vulnerability Spec store bypass: Mitigation; Speculative Store Bypass
disabled via prctl
Vulnerability Spectre v1: Mitigation; usercopy/swapgs barriers
and __user pointer sanitization
Vulnerability Spectre v2: Vulnerable: eIBRS with unprivileged eBPF
Vulnerability Srbds: Not affected
Vulnerability Tsx async abort: Not affected

> The commit c82a69629c53 migrates a task on an idle cpu when the task
> is the only one running on local cpu but the time spent by this local
> cpu under interrupt or RT context becomes significant (10%-17%)
> I can imagine that 16/24 stream overload your system so load_balance
> doesn't end up in this case and the cpus are busy with several
> threads. On the other hand, 1 stream is small enough to keep your
> system lightly loaded but 8 streams make your system significantly
> loaded to trigger the reduced capacity case but still not overloaded.
>

I see. Makes sense.
1. How do you check this theory? Any suggested tests/experiments?
2. How do you suggest this degradation should be fixed?

Thanks,
Tariq

2023-02-22 08:49:51

by Tariq Toukan

[permalink] [raw]
Subject: Re: Bug report: UDP ~20% degradation



On 12/02/2023 13:50, Tariq Toukan wrote:
>
>
> On 08/02/2023 16:12, Vincent Guittot wrote:
>> Hi Tariq,
>>
>> On Wed, 8 Feb 2023 at 12:09, Tariq Toukan <[email protected]> wrote:
>>>
>>> Hi all,
>>>
>>> Our performance verification team spotted a degradation of up to ~20% in
>>> UDP performance, for a specific combination of parameters.
>>>
>>> Our matrix covers several parameters values, like:
>>> IP version: 4/6
>>> MTU: 1500/9000
>>> Msg size: 64/1452/8952 (only when applicable while avoiding ip
>>> fragmentation).
>>> Num of streams: 1/8/16/24.
>>> Num of directions: unidir/bidir.
>>>
>>> Surprisingly, the issue exists only with this specific combination:
>>> 8 streams,
>>> MTU 9000,
>>> Msg size 8952,
>>> both ipv4/6,
>>> bidir.
>>> (in unidir it repros only with ipv4)
>>>
>>> The reproduction is consistent on all the different setups we tested
>>> with.
>>>
>>> Bisect [2] was done between these two points, v5.19 (Good), and v6.0-rc1
>>> (Bad), with ConnectX-6DX NIC.
>>>
>>> c82a69629c53eda5233f13fc11c3c01585ef48a2 is the first bad commit [1].
>>>
>>> We couldn't come up with a good explanation how this patch causes this
>>> issue. We also looked for related changes in the networking/UDP stack,
>>> but nothing looked suspicious.
>>>
>>> Maybe someone here can help with this.
>>> We can provide more details or do further tests/experiments to progress
>>> with the debug.
>>
>> Could you share more details about your system and the cpu topology ?
>>
>
> output for 'lscpu':
>
> Architecture:                    x86_64
> CPU op-mode(s):                  32-bit, 64-bit
> Address sizes:                   40 bits physical, 57 bits virtual
> Byte Order:                      Little Endian
> CPU(s):                          24
> On-line CPU(s) list:             0-23
> Vendor ID:                       GenuineIntel
> BIOS Vendor ID:                  QEMU
> Model name:                      Intel(R) Xeon(R) Platinum 8380 CPU @
> 2.30GHz
> BIOS Model name:                 pc-q35-5.0
> CPU family:                      6
> Model:                           106
> Thread(s) per core:              1
> Core(s) per socket:              1
> Socket(s):                       24
> Stepping:                        6
> BogoMIPS:                        4589.21
> Flags:                           fpu vme de pse tsc msr pae mce cx8 apic
> sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss syscall nx
> pdpe1gb rdtscp lm constant_tsc arch_perfmon rep_good nopl xtopology
> cpuid tsc_known_freq pni pclmulqdq vmx ssse3 fma cx16 pdcm pcid sse4_1
> sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand
> hypervisor lahf_lm abm 3dnowprefetch cpuid_fault invpcid_single ssbd
> ibrs ibpb stibp ibrs_enhanced tpr_shadow vnmi flexpriority ept vpid
> ept_ad fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid avx512f
> avx512dq rdseed adx smap avx512ifma clflushopt clwb avx512cd sha_ni
> avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves wbnoinvd arat
> avx512vbmi umip pku ospke avx512_vbmi2 gfni vaes vpclmulqdq avx512_vnni
> avx512_bitalg avx512_vpopcntdq rdpid md_clear arch_capabilities
> Virtualization:                  VT-x
> Hypervisor vendor:               KVM
> Virtualization type:             full
> L1d cache:                       768 KiB (24 instances)
> L1i cache:                       768 KiB (24 instances)
> L2 cache:                        96 MiB (24 instances)
> L3 cache:                        384 MiB (24 instances)
> NUMA node(s):                    1
> NUMA node0 CPU(s):               0-23
> Vulnerability Itlb multihit:     Not affected
> Vulnerability L1tf:              Not affected
> Vulnerability Mds:               Not affected
> Vulnerability Meltdown:          Not affected
> Vulnerability Mmio stale data:   Vulnerable: Clear CPU buffers
> attempted, no microcode; SMT Host state unknown
> Vulnerability Retbleed:          Not affected
> Vulnerability Spec store bypass: Mitigation; Speculative Store Bypass
> disabled via prctl
> Vulnerability Spectre v1:        Mitigation; usercopy/swapgs barriers
> and __user pointer sanitization
> Vulnerability Spectre v2:        Vulnerable: eIBRS with unprivileged eBPF
> Vulnerability Srbds:             Not affected
> Vulnerability Tsx async abort:   Not affected
>
>> The commit  c82a69629c53 migrates a task on an idle cpu when the task
>> is the only one running on local cpu but the time spent by this local
>> cpu under interrupt or RT context becomes significant (10%-17%)
>> I can imagine that 16/24 stream overload your system so load_balance
>> doesn't end up in this case and the cpus are busy with several
>> threads. On the other hand, 1 stream is small enough to keep your
>> system lightly loaded but 8 streams make your system significantly
>> loaded to trigger the reduced capacity case but still not overloaded.
>>
>
> I see. Makes sense.
> 1. How do you check this theory? Any suggested tests/experiments?
> 2. How do you suggest this degradation should be fixed?
>

Hi,
A kind reminder.

2023-02-22 16:52:10

by Vincent Guittot

[permalink] [raw]
Subject: Re: Bug report: UDP ~20% degradation

On Wed, 22 Feb 2023 at 09:49, Tariq Toukan <[email protected]> wrote:
>
>
>
> On 12/02/2023 13:50, Tariq Toukan wrote:
> >
> >
> > On 08/02/2023 16:12, Vincent Guittot wrote:
> >> Hi Tariq,
> >>
> >> On Wed, 8 Feb 2023 at 12:09, Tariq Toukan <[email protected]> wrote:
> >>>
> >>> Hi all,
> >>>
> >>> Our performance verification team spotted a degradation of up to ~20% in
> >>> UDP performance, for a specific combination of parameters.
> >>>
> >>> Our matrix covers several parameters values, like:
> >>> IP version: 4/6
> >>> MTU: 1500/9000
> >>> Msg size: 64/1452/8952 (only when applicable while avoiding ip
> >>> fragmentation).
> >>> Num of streams: 1/8/16/24.
> >>> Num of directions: unidir/bidir.
> >>>
> >>> Surprisingly, the issue exists only with this specific combination:
> >>> 8 streams,
> >>> MTU 9000,
> >>> Msg size 8952,
> >>> both ipv4/6,
> >>> bidir.
> >>> (in unidir it repros only with ipv4)
> >>>
> >>> The reproduction is consistent on all the different setups we tested
> >>> with.
> >>>
> >>> Bisect [2] was done between these two points, v5.19 (Good), and v6.0-rc1
> >>> (Bad), with ConnectX-6DX NIC.
> >>>
> >>> c82a69629c53eda5233f13fc11c3c01585ef48a2 is the first bad commit [1].
> >>>
> >>> We couldn't come up with a good explanation how this patch causes this
> >>> issue. We also looked for related changes in the networking/UDP stack,
> >>> but nothing looked suspicious.
> >>>
> >>> Maybe someone here can help with this.
> >>> We can provide more details or do further tests/experiments to progress
> >>> with the debug.
> >>
> >> Could you share more details about your system and the cpu topology ?
> >>
> >
> > output for 'lscpu':
> >
> > Architecture: x86_64
> > CPU op-mode(s): 32-bit, 64-bit
> > Address sizes: 40 bits physical, 57 bits virtual
> > Byte Order: Little Endian
> > CPU(s): 24
> > On-line CPU(s) list: 0-23
> > Vendor ID: GenuineIntel
> > BIOS Vendor ID: QEMU
> > Model name: Intel(R) Xeon(R) Platinum 8380 CPU @
> > 2.30GHz
> > BIOS Model name: pc-q35-5.0
> > CPU family: 6
> > Model: 106
> > Thread(s) per core: 1
> > Core(s) per socket: 1
> > Socket(s): 24
> > Stepping: 6
> > BogoMIPS: 4589.21
> > Flags: fpu vme de pse tsc msr pae mce cx8 apic
> > sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss syscall nx
> > pdpe1gb rdtscp lm constant_tsc arch_perfmon rep_good nopl xtopology
> > cpuid tsc_known_freq pni pclmulqdq vmx ssse3 fma cx16 pdcm pcid sse4_1
> > sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand
> > hypervisor lahf_lm abm 3dnowprefetch cpuid_fault invpcid_single ssbd
> > ibrs ibpb stibp ibrs_enhanced tpr_shadow vnmi flexpriority ept vpid
> > ept_ad fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid avx512f
> > avx512dq rdseed adx smap avx512ifma clflushopt clwb avx512cd sha_ni
> > avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves wbnoinvd arat
> > avx512vbmi umip pku ospke avx512_vbmi2 gfni vaes vpclmulqdq avx512_vnni
> > avx512_bitalg avx512_vpopcntdq rdpid md_clear arch_capabilities
> > Virtualization: VT-x
> > Hypervisor vendor: KVM
> > Virtualization type: full
> > L1d cache: 768 KiB (24 instances)
> > L1i cache: 768 KiB (24 instances)
> > L2 cache: 96 MiB (24 instances)
> > L3 cache: 384 MiB (24 instances)
> > NUMA node(s): 1
> > NUMA node0 CPU(s): 0-23
> > Vulnerability Itlb multihit: Not affected
> > Vulnerability L1tf: Not affected
> > Vulnerability Mds: Not affected
> > Vulnerability Meltdown: Not affected
> > Vulnerability Mmio stale data: Vulnerable: Clear CPU buffers
> > attempted, no microcode; SMT Host state unknown
> > Vulnerability Retbleed: Not affected
> > Vulnerability Spec store bypass: Mitigation; Speculative Store Bypass
> > disabled via prctl
> > Vulnerability Spectre v1: Mitigation; usercopy/swapgs barriers
> > and __user pointer sanitization
> > Vulnerability Spectre v2: Vulnerable: eIBRS with unprivileged eBPF
> > Vulnerability Srbds: Not affected
> > Vulnerability Tsx async abort: Not affected
> >
> >> The commit c82a69629c53 migrates a task on an idle cpu when the task
> >> is the only one running on local cpu but the time spent by this local
> >> cpu under interrupt or RT context becomes significant (10%-17%)
> >> I can imagine that 16/24 stream overload your system so load_balance
> >> doesn't end up in this case and the cpus are busy with several
> >> threads. On the other hand, 1 stream is small enough to keep your
> >> system lightly loaded but 8 streams make your system significantly
> >> loaded to trigger the reduced capacity case but still not overloaded.
> >>
> >
> > I see. Makes sense.
> > 1. How do you check this theory? Any suggested tests/experiments?

Could you get some statistics about the threads involved in your tests
? Like the number of migrations as an example.

Or a trace but that might be a lot of data. I haven't tried to
reproduce this on a local system yet.

You can fall in the situation where the tasks of your bench are
periodically moved to the next cpu becoming idle.

Which cpufreq driver and governor are you using ? Could you also check
the average frequency of your cpu ? Another cause could be that we
spread tasks and irq on different cpus which then trigger a freq
decrease

> > 2. How do you suggest this degradation should be fixed?
> >
>
> Hi,
> A kind reminder.

2023-04-05 13:26:57

by Thorsten Leemhuis

[permalink] [raw]
Subject: Re: Bug report: UDP ~20% degradation

Hi, Thorsten here, the Linux kernel's regression tracker. Top-posting
for once, to make this easily accessible to everyone.

Tariq Toukanm: it looks like you never provided the data Vincent asked
for (see below). Did you stop caring, did this discussion continue
somewhere else (doesn't look like it one lore), did the problem vanish,
or was it fixed somehow? I for now assume it's one of the two latter
option and will stop tracking this. If that was a bad assumption and
worth continue tracking, please let me know -- otherwise consider this a
"JFYI" mail.

#regzbot inconclusive: lack of data to debug, as it looks like the
reporter stopped caring
#regzbot ignore-activity

Ciao, Thorsten (wearing his 'the Linux kernel's regression tracker' hat)
--
Everything you wanna know about Linux kernel regression tracking:
https://linux-regtracking.leemhuis.info/about/#tldr
If I did something stupid, please tell me, as explained on that page.

On 22.02.23 17:51, Vincent Guittot wrote:
> On Wed, 22 Feb 2023 at 09:49, Tariq Toukan <[email protected]> wrote:
>> On 12/02/2023 13:50, Tariq Toukan wrote:
>>> On 08/02/2023 16:12, Vincent Guittot wrote:
>>>> On Wed, 8 Feb 2023 at 12:09, Tariq Toukan <[email protected]> wrote:
>>>>>
>>>>> Our performance verification team spotted a degradation of up to ~20% in
>>>>> UDP performance, for a specific combination of parameters.
>>>>>
>>>>> Our matrix covers several parameters values, like:
>>>>> IP version: 4/6
>>>>> MTU: 1500/9000
>>>>> Msg size: 64/1452/8952 (only when applicable while avoiding ip
>>>>> fragmentation).
>>>>> Num of streams: 1/8/16/24.
>>>>> Num of directions: unidir/bidir.
>>>>>
>>>>> Surprisingly, the issue exists only with this specific combination:
>>>>> 8 streams,
>>>>> MTU 9000,
>>>>> Msg size 8952,
>>>>> both ipv4/6,
>>>>> bidir.
>>>>> (in unidir it repros only with ipv4)
>>>>>
>>>>> The reproduction is consistent on all the different setups we tested
>>>>> with.
>>>>>
>>>>> Bisect [2] was done between these two points, v5.19 (Good), and v6.0-rc1
>>>>> (Bad), with ConnectX-6DX NIC.
>>>>>
>>>>> c82a69629c53eda5233f13fc11c3c01585ef48a2 is the first bad commit [1].
>>>>>
>>>>> We couldn't come up with a good explanation how this patch causes this
>>>>> issue. We also looked for related changes in the networking/UDP stack,
>>>>> but nothing looked suspicious.
>>>>>
>>>>> Maybe someone here can help with this.
>>>>> We can provide more details or do further tests/experiments to progress
>>>>> with the debug.
>>>>
>>>> Could you share more details about your system and the cpu topology ?
>>>>
>>>
>>> output for 'lscpu':
>>>
>>> Architecture: x86_64
>>> CPU op-mode(s): 32-bit, 64-bit
>>> Address sizes: 40 bits physical, 57 bits virtual
>>> Byte Order: Little Endian
>>> CPU(s): 24
>>> On-line CPU(s) list: 0-23
>>> Vendor ID: GenuineIntel
>>> BIOS Vendor ID: QEMU
>>> Model name: Intel(R) Xeon(R) Platinum 8380 CPU @
>>> 2.30GHz
>>> BIOS Model name: pc-q35-5.0
>>> CPU family: 6
>>> Model: 106
>>> Thread(s) per core: 1
>>> Core(s) per socket: 1
>>> Socket(s): 24
>>> Stepping: 6
>>> BogoMIPS: 4589.21
>>> Flags: fpu vme de pse tsc msr pae mce cx8 apic
>>> sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss syscall nx
>>> pdpe1gb rdtscp lm constant_tsc arch_perfmon rep_good nopl xtopology
>>> cpuid tsc_known_freq pni pclmulqdq vmx ssse3 fma cx16 pdcm pcid sse4_1
>>> sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand
>>> hypervisor lahf_lm abm 3dnowprefetch cpuid_fault invpcid_single ssbd
>>> ibrs ibpb stibp ibrs_enhanced tpr_shadow vnmi flexpriority ept vpid
>>> ept_ad fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid avx512f
>>> avx512dq rdseed adx smap avx512ifma clflushopt clwb avx512cd sha_ni
>>> avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves wbnoinvd arat
>>> avx512vbmi umip pku ospke avx512_vbmi2 gfni vaes vpclmulqdq avx512_vnni
>>> avx512_bitalg avx512_vpopcntdq rdpid md_clear arch_capabilities
>>> Virtualization: VT-x
>>> Hypervisor vendor: KVM
>>> Virtualization type: full
>>> L1d cache: 768 KiB (24 instances)
>>> L1i cache: 768 KiB (24 instances)
>>> L2 cache: 96 MiB (24 instances)
>>> L3 cache: 384 MiB (24 instances)
>>> NUMA node(s): 1
>>> NUMA node0 CPU(s): 0-23
>>> Vulnerability Itlb multihit: Not affected
>>> Vulnerability L1tf: Not affected
>>> Vulnerability Mds: Not affected
>>> Vulnerability Meltdown: Not affected
>>> Vulnerability Mmio stale data: Vulnerable: Clear CPU buffers
>>> attempted, no microcode; SMT Host state unknown
>>> Vulnerability Retbleed: Not affected
>>> Vulnerability Spec store bypass: Mitigation; Speculative Store Bypass
>>> disabled via prctl
>>> Vulnerability Spectre v1: Mitigation; usercopy/swapgs barriers
>>> and __user pointer sanitization
>>> Vulnerability Spectre v2: Vulnerable: eIBRS with unprivileged eBPF
>>> Vulnerability Srbds: Not affected
>>> Vulnerability Tsx async abort: Not affected
>>>
>>>> The commit c82a69629c53 migrates a task on an idle cpu when the task
>>>> is the only one running on local cpu but the time spent by this local
>>>> cpu under interrupt or RT context becomes significant (10%-17%)
>>>> I can imagine that 16/24 stream overload your system so load_balance
>>>> doesn't end up in this case and the cpus are busy with several
>>>> threads. On the other hand, 1 stream is small enough to keep your
>>>> system lightly loaded but 8 streams make your system significantly
>>>> loaded to trigger the reduced capacity case but still not overloaded.
>>>>
>>>
>>> I see. Makes sense.
>>> 1. How do you check this theory? Any suggested tests/experiments?
>
> Could you get some statistics about the threads involved in your tests
> ? Like the number of migrations as an example.
>
> Or a trace but that might be a lot of data. I haven't tried to
> reproduce this on a local system yet.
>
> You can fall in the situation where the tasks of your bench are
> periodically moved to the next cpu becoming idle.
>
> Which cpufreq driver and governor are you using ? Could you also check
> the average frequency of your cpu ? Another cause could be that we
> spread tasks and irq on different cpus which then trigger a freq
> decrease
>
>>> 2. How do you suggest this degradation should be fixed?
>>>
>>
>> Hi,
>> A kind reminder.