2024-02-08 10:32:40

by Mikhail Gavrilov

[permalink] [raw]
Subject: Re: This is the fourth time I’ve tried to find wha t led to the regression of outgoing network speed and each t ime I find the merge commit 8c94ccc7cd691472461448f98e2372c7 5849406c

On Thu, Feb 8, 2024 at 2:23 PM Mathias Nyman
<[email protected]> wrote:
>
> My guess is that CPU0 spends more time with interrupts disabled than other CPUs.
> Either because it's handling interrupts from some other hardware, or running
> code that disables interrupts (for example kernel code inside spin_lock_irq),
> and thus not able to handle network adapter interrupts at the same rate as CPU23
>

Can this be fixed?
Can I help you here with anything else?

--
Best Regards,
Mike Gavrilov.


2024-02-08 15:42:34

by Mathias Nyman

[permalink] [raw]
Subject: Re: This is the fourth time I’ve tried to find what led to the regression of outgoing network speed and each tim e I find the merge commit 8c94ccc7cd691472461448f98e2372c75849406 c

On 8.2.2024 12.32, Mikhail Gavrilov wrote:
> On Thu, Feb 8, 2024 at 2:23 PM Mathias Nyman
> <[email protected]> wrote:
>>
>> My guess is that CPU0 spends more time with interrupts disabled than other CPUs.
>> Either because it's handling interrupts from some other hardware, or running
>> code that disables interrupts (for example kernel code inside spin_lock_irq),
>> and thus not able to handle network adapter interrupts at the same rate as CPU23
>>
>
> Can this be fixed?

Not sure, I'm not that familiar with this area.
Maybe running irqbalance could help?

Thanks
Mathias

Subject: Re: This is the fourth time Iâve tried to find wha t led to the regression of outgoing network speed and each time I find the merge commit 8c94ccc7cd691472461448f98e2372c75849406c

[CCing the regression list, as it should be in the loop for regressions:
https://docs.kernel.org/admin-guide/reporting-regressions.html]

On 08.02.24 16:43, Mathias Nyman wrote:
> On 8.2.2024 12.32, Mikhail Gavrilov wrote:
>> On Thu, Feb 8, 2024 at 2:23 PM Mathias Nyman
>> <[email protected]> wrote:
>>>
>>> My guess is that CPU0 spends more time with interrupts disabled than
>>> other CPUs.
>>> Either because it's handling interrupts from some other hardware, or
>>> running
>>> code that disables interrupts (for example kernel code inside
>>> spin_lock_irq),
>>> and thus not able to handle network adapter interrupts at the same
>>> rate as CPU23
>>
>> Can this be fixed?
>
> Not sure, I'm not that familiar with this area.
> Maybe running irqbalance could help?

Mikhail, what's the status of this? I wonder if I should track this as a
regression to ensure Linus is aware of this.

Ciao, Thorsten (wearing his 'the Linux kernel's regression tracker' hat)
--
Everything you wanna know about Linux kernel regression tracking:
https://linux-regtracking.leemhuis.info/about/#tldr
If I did something stupid, please tell me, as explained on that page.

2024-02-19 09:43:42

by Mikhail Gavrilov

[permalink] [raw]
Subject: Re: This is the fourth time I’ve tried to find wha t led to the regression of outgoing network speed and each t ime I find the merge commit 8c94ccc7cd691472461448f98e2372c7 5849406c

On Thu, Feb 8, 2024 at 8:42 PM Mathias Nyman
<[email protected]> wrote:
>
> On 8.2.2024 12.32, Mikhail Gavrilov wrote:
> > On Thu, Feb 8, 2024 at 2:23 PM Mathias Nyman
> > <[email protected]> wrote:
> >>
> >> My guess is that CPU0 spends more time with interrupts disabled than other CPUs.
> >> Either because it's handling interrupts from some other hardware, or running
> >> code that disables interrupts (for example kernel code inside spin_lock_irq),
> >> and thus not able to handle network adapter interrupts at the same rate as CPU23
> >>
> >
> > Can this be fixed?
>
> Not sure, I'm not that familiar with this area.
> Maybe running irqbalance could help?

I installed irqbalance daemon and nothing changed.
So who is responsible for irq balancing?

--
Best Regards,
Mike Gavrilov.


Attachments:
measuaments-irqbalance.zip (1.19 kB)

2024-02-20 23:20:03

by Mikhail Gavrilov

[permalink] [raw]
Subject: Re: This is the fourth time I’ve tried to find wha t led to the regression of outgoing network speed and each t ime I find the merge commit 8c94ccc7cd691472461448f98e2372c7 5849406c

On Mon, Feb 19, 2024 at 2:41 PM Mikhail Gavrilov
<[email protected]> wrote:
>
> I installed irqbalance daemon and nothing changed.
> So who is responsible for irq balancing?

Sorry for the noise. Can anyone give me an answer?
Who is responsible for distributing interrupts in Linux?
I spotted network performance regression and it turned out, this was
due to the network card getting other interrupt. It is a side effect
of commit 57e153dfd0e7a080373fe5853c5609443d97fa5a.
Installing irqbalance daemon did not help. Maybe someone experienced
such a problem?

--
Best Regards,
Mike Gavrilov.

2024-02-21 00:59:41

by Randy Dunlap

[permalink] [raw]
Subject: Re: This is the fourth time I’ve tried to find what led to the regression of outgoing network speed and each tim e I find the merge commit 8c94ccc7cd691472461448f98e2372c75849406 c

{+ tglx]

On 2/20/24 15:19, Mikhail Gavrilov wrote:
> On Mon, Feb 19, 2024 at 2:41 PM Mikhail Gavrilov
> <[email protected]> wrote:
>>
>> I installed irqbalance daemon and nothing changed.
>> So who is responsible for irq balancing?
>
> Sorry for the noise. Can anyone give me an answer?
> Who is responsible for distributing interrupts in Linux?
> I spotted network performance regression and it turned out, this was
> due to the network card getting other interrupt. It is a side effect
> of commit 57e153dfd0e7a080373fe5853c5609443d97fa5a.

That's a merge commit (AFAIK, maybe not so much). The commit in mainline is:

commit f977f4c9301c
Author: Niklas Neronin <[email protected]>
Date: Fri Dec 1 17:06:40 2023 +0200

xhci: add handler for only one interrupt line

> Installing irqbalance daemon did not help. Maybe someone experienced
> such a problem?
>

Thomas, would you look at this, please?

A network device and xhci (USB) driver are now sharing interrupts.
This causes a large performance decrease for the networking device.

The thread begins here:
https://lore.kernel.org/lkml/CABXGCsNnUfCCYVSb_-j-a-cAdONu1r6Fe8p2OtQ5op_wskOfpw@mail.gmail.com/


motherboard:
"My motherboard is MPG-B650I-EDGE-WIFI looks like it is related to the
mentioned commit.
https://www.msi.com/Motherboard/MPG-B650I-EDGE-WIFI"

network device:
Network: RTL8125 2.5GbE Controller (rev 05)


thanks.
--
#Randy

2024-02-21 06:52:03

by Randy Dunlap

[permalink] [raw]
Subject: Re: This is the fourth time I’ve tried to find what led to the regression of outgoing network speed and each tim e I find the merge commit 8c94ccc7cd691472461448f98e2372c75849406 c



On 2/20/24 15:41, Randy Dunlap wrote:
> {+ tglx]

(this time for real)

>
> On 2/20/24 15:19, Mikhail Gavrilov wrote:
>> On Mon, Feb 19, 2024 at 2:41 PM Mikhail Gavrilov
>> <[email protected]> wrote:
>>>
>>> I installed irqbalance daemon and nothing changed.
>>> So who is responsible for irq balancing?
>>
>> Sorry for the noise. Can anyone give me an answer?
>> Who is responsible for distributing interrupts in Linux?
>> I spotted network performance regression and it turned out, this was
>> due to the network card getting other interrupt. It is a side effect
>> of commit 57e153dfd0e7a080373fe5853c5609443d97fa5a.
>
> That's a merge commit (AFAIK, maybe not so much). The commit in mainline is:
>
> commit f977f4c9301c
> Author: Niklas Neronin <[email protected]>
> Date: Fri Dec 1 17:06:40 2023 +0200
>
> xhci: add handler for only one interrupt line
>
>> Installing irqbalance daemon did not help. Maybe someone experienced
>> such a problem?
>>
>
> Thomas, would you look at this, please?
>
> A network device and xhci (USB) driver are now sharing interrupts.
> This causes a large performance decrease for the networking device.
>
> The thread begins here:
> https://lore.kernel.org/lkml/CABXGCsNnUfCCYVSb_-j-a-cAdONu1r6Fe8p2OtQ5op_wskOfpw@mail.gmail.com/
>
>
> motherboard:
> "My motherboard is MPG-B650I-EDGE-WIFI looks like it is related to the
> mentioned commit.
> https://www.msi.com/Motherboard/MPG-B650I-EDGE-WIFI"
>
> network device:
> Network: RTL8125 2.5GbE Controller (rev 05)
>
>
> thanks.

--
#Randy

2024-02-21 16:10:25

by Mathias Nyman

[permalink] [raw]
Subject: Re: This is the fourth time I’ve tried to find what led to the regression of outgoing network speed and each tim e I find the merge commit 8c94ccc7cd691472461448f98e2372c75849406 c

On 21.2.2024 1.43, Randy Dunlap wrote:
>
>
> On 2/20/24 15:41, Randy Dunlap wrote:
>> {+ tglx]
>
> (this time for real)
>
>>
>> On 2/20/24 15:19, Mikhail Gavrilov wrote:
>>> On Mon, Feb 19, 2024 at 2:41 PM Mikhail Gavrilov
>>> <[email protected]> wrote:
>>>>
>>>> I installed irqbalance daemon and nothing changed.
>>>> So who is responsible for irq balancing?
>>>
>>> Sorry for the noise. Can anyone give me an answer?
>>> Who is responsible for distributing interrupts in Linux?
>>> I spotted network performance regression and it turned out, this was
>>> due to the network card getting other interrupt. It is a side effect
>>> of commit 57e153dfd0e7a080373fe5853c5609443d97fa5a.
>>
>> That's a merge commit (AFAIK, maybe not so much). The commit in mainline is:
>>
>> commit f977f4c9301c
>> Author: Niklas Neronin <[email protected]>
>> Date: Fri Dec 1 17:06:40 2023 +0200
>>
>> xhci: add handler for only one interrupt line
>>
>>> Installing irqbalance daemon did not help. Maybe someone experienced
>>> such a problem?
>>>
>>
>> Thomas, would you look at this, please?
>>
>> A network device and xhci (USB) driver are now sharing interrupts.
>> This causes a large performance decrease for the networking device.

Short recap:

xhci (USB) and network device didn't share interrupts, or even interrupt the
same CPU in either good or bad case.

A change in how many interrupts xhci driver requests changed which CPU
the network device interrupts.

In the bad case Mikhail Gavrilovs network device was interrupting CPU0
together with:
- IR-IO-APIC 2-edge timer
- IR-PCI-MSIX-0000:07:00.0 1-edge nvme1q1

In the good case network device was interrupting CPU27 together with:
- IR-PCI-MSIX-0000:04:00.0 27-edge nvme0q27
- IR-PCI-MSIX-0000:07:00.0 28-edge nvme1q28

Manually moving network device irq 87 from CPU0 to CPU23 helped.
(echo 800000 > /proc/irq/87/smp_affinity)

Thanks
-Mathias


Subject: Re: This is the fourth time I've tried to find what led to the regression of outgoing network speed and each time I find the merge commit 8c94ccc7cd691472461448f98e2372c75849406c

On 21.02.24 14:44, Mathias Nyman wrote:
> On 21.2.2024 1.43, Randy Dunlap wrote:
>> On 2/20/24 15:41, Randy Dunlap wrote:
>>> {+ tglx]
>>> On 2/20/24 15:19, Mikhail Gavrilov wrote:
>>>> On Mon, Feb 19, 2024 at 2:41 PM Mikhail Gavrilov
>>>> <[email protected]> wrote:
>>>> I spotted network performance regression and it turned out, this was
>>>> due to the network card getting other interrupt. It is a side effect
>>>> of commit 57e153dfd0e7a080373fe5853c5609443d97fa5a.
>>> That's a merge commit (AFAIK, maybe not so much). The commit in
>>> mainline is:
>>>
>>> commit f977f4c9301c
>>> Author: Niklas Neronin <[email protected]>
>>> Date:   Fri Dec 1 17:06:40 2023 +0200
>>>
>>>      xhci: add handler for only one interrupt line
>>>
>>>> Installing irqbalance daemon did not help. Maybe someone experienced
>>>> such a problem?
>>>
>>> Thomas, would you look at this, please?
>>>
>>> A network device and xhci (USB) driver are now sharing interrupts.
>>> This causes a large performance decrease for the networking device.
>
> Short recap:

Thx for that. As the 6.8 release is merely two or three weeks away while
a fix is nowhere near in sight yet (afaics!) I start to wonder if we
should consider a revert here and try reapplying the culprit in a later
cycle when this problem is fixed.

Mathias, would that be an option? Or is there still hope that we see a
fix for this regression before the release of 6.8?

Ciao, Thorsten (wearing his 'the Linux kernel's regression tracker' hat)
--
Everything you wanna know about Linux kernel regression tracking:
https://linux-regtracking.leemhuis.info/about/#tldr
If I did something stupid, please tell me, as explained on that page.

#regzbot poke

> xhci (USB) and network device didn't share interrupts, or even interrupt
> the
> same CPU in either good or bad case.
>
> A change in how many interrupts xhci driver requests changed which CPU
> the network device interrupts.
>
> In the bad case Mikhail Gavrilovs network device was interrupting CPU0
> together with:
> - IR-IO-APIC    2-edge      timer
> - IR-PCI-MSIX-0000:07:00.0    1-edge      nvme1q1
>
> In the good case network device was interrupting CPU27 together with:
> - IR-PCI-MSIX-0000:04:00.0   27-edge      nvme0q27
> - IR-PCI-MSIX-0000:07:00.0   28-edge      nvme1q28
>
> Manually moving network device irq 87 from CPU0 to CPU23 helped.
> (echo 800000 > /proc/irq/87/smp_affinity)
>
> Thanks
> -Mathias
>

2024-02-26 12:20:24

by Mathias Nyman

[permalink] [raw]
Subject: Re: This is the fourth time I've tried to find what led to the regression of outgoing network speed and each time I find the merge commit 8c94ccc7cd691472461448f98e2372c75849406c

On 26.2.2024 7.45, Linux regression tracking (Thorsten Leemhuis) wrote:
> On 21.02.24 14:44, Mathias Nyman wrote:
>> On 21.2.2024 1.43, Randy Dunlap wrote:
>>> On 2/20/24 15:41, Randy Dunlap wrote:
>>>> {+ tglx]
>>>> On 2/20/24 15:19, Mikhail Gavrilov wrote:
>>>>> On Mon, Feb 19, 2024 at 2:41 PM Mikhail Gavrilov
>>>>> <[email protected]> wrote:
>>>>> I spotted network performance regression and it turned out, this was
>>>>> due to the network card getting other interrupt. It is a side effect
>>>>> of commit 57e153dfd0e7a080373fe5853c5609443d97fa5a.
>>>> That's a merge commit (AFAIK, maybe not so much). The commit in
>>>> mainline is:
>>>>
>>>> commit f977f4c9301c
>>>> Author: Niklas Neronin <[email protected]>
>>>> Date:   Fri Dec 1 17:06:40 2023 +0200
>>>>
>>>>      xhci: add handler for only one interrupt line
>>>>
>>>>> Installing irqbalance daemon did not help. Maybe someone experienced
>>>>> such a problem?
>>>>
>>>> Thomas, would you look at this, please?
>>>>
>>>> A network device and xhci (USB) driver are now sharing interrupts.
>>>> This causes a large performance decrease for the networking device.
>>
>> Short recap:
>
> Thx for that. As the 6.8 release is merely two or three weeks away while
> a fix is nowhere near in sight yet (afaics!) I start to wonder if we
> should consider a revert here and try reapplying the culprit in a later
> cycle when this problem is fixed.

I don't think reverting this series is a solution.

This isn't really about those usb xhci patches.
This is about which interrupt gets assigned to which CPU.

Mikhail got unlucky when the network adapter interrupts on that system was
assigned to CPU0, clearly a more "clogged" CPU, thus causing a drop in max
bandwidth.

Thanks
Mathias