Hi,
I'm trying to find the first bad commit that led to a decreased
network outgoing speed.
And every time I come to a huge merge [Merge tag 'usb-6.8-rc1' of
git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/usb]
I have already triple-checked all my answers and speed measurements.
I don't understand where I'm making a mistake.
Let's try to figure it out together.
Input data:
Two computers connected 1Gbps link.
Both have the same hardware.
Network: RTL8125 2.5GbE Controller (rev 05)
When I copy files from one computer to another and kernel snapshot
builded from commit 296455ade1fd I have 97-110MB/sec which is almost
max speed of 1Gbps link.
When I move to commit 9d1694dc91ce I have only 66-70MB/sec which is
significantly slower.
I bisected the issue by measuring network speed on each step.
I save all results to file [1]
[1] file is attached as a zip archive.
# first bad commit: [8c94ccc7cd691472461448f98e2372c75849406c] Merge
tag 'usb-6.8-rc1' of
git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/usb
--
Best Regards,
Mike Gavrilov.
On Sat, 03 Feb, 2024 06:02:15 +0500 Mikhail Gavrilov <[email protected]> wrote:
> Hi,
> I'm trying to find the first bad commit that led to a decreased
> network outgoing speed.
> And every time I come to a huge merge [Merge tag 'usb-6.8-rc1' of
> git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/usb]
> I have already triple-checked all my answers and speed measurements.
> I don't understand where I'm making a mistake.
Have you tried using --first-parent when you git bisect to see if that
helps you find the culprit aside from the merge commit you keep hitting?
https://git-scm.com/docs/git-bisect#Documentation/git-bisect.txt---first-parent
>
> Let's try to figure it out together.
>
> Input data:
> Two computers connected 1Gbps link.
> Both have the same hardware.
> Network: RTL8125 2.5GbE Controller (rev 05)
>
> When I copy files from one computer to another and kernel snapshot
> builded from commit 296455ade1fd I have 97-110MB/sec which is almost
> max speed of 1Gbps link.
> When I move to commit 9d1694dc91ce I have only 66-70MB/sec which is
> significantly slower.
>
> I bisected the issue by measuring network speed on each step.
> I save all results to file [1]
>
> [1] file is attached as a zip archive.
>
> # first bad commit: [8c94ccc7cd691472461448f98e2372c75849406c] Merge
> tag 'usb-6.8-rc1' of
> git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/usb
--
Thanks,
Rahul Rameshbabu
Hi,
On 2/2/24 17:02, Mikhail Gavrilov wrote:
> Hi,
> I'm trying to find the first bad commit that led to a decreased
> network outgoing speed.
> And every time I come to a huge merge [Merge tag 'usb-6.8-rc1' of
> git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/usb]
> I have already triple-checked all my answers and speed measurements.
> I don't understand where I'm making a mistake.
>
> Let's try to figure it out together.
>
> Input data:
> Two computers connected 1Gbps link.
> Both have the same hardware.
> Network: RTL8125 2.5GbE Controller (rev 05)
>
> When I copy files from one computer to another and kernel snapshot
> builded from commit 296455ade1fd I have 97-110MB/sec which is almost
> max speed of 1Gbps link.
> When I move to commit 9d1694dc91ce I have only 66-70MB/sec which is
> significantly slower.
>
> I bisected the issue by measuring network speed on each step.
> I save all results to file [1]
>
> [1] file is attached as a zip archive.
>
> # first bad commit: [8c94ccc7cd691472461448f98e2372c75849406c] Merge
> tag 'usb-6.8-rc1' of
> git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/usb
a. Do you clean the object files between each test run?
or at least clean net/* and drivers/net/ethernet/* ?
b. I am far from a git expert, but in the bisects that I have
done, after each test run, I just say
$ git bisect good
or
$ git bisect bad
It looks like you are typing
$ git bisect [good | bad] hashID
Is that correct?
Anyway, I am interested in your outcome just to learn
how to handle this problem.
Good luck.
--
#Randy
[correct the netdev mailing list address]
On 2/2/24 17:15, Randy Dunlap wrote:
> Hi,
>
> On 2/2/24 17:02, Mikhail Gavrilov wrote:
>> Hi,
>> I'm trying to find the first bad commit that led to a decreased
>> network outgoing speed.
>> And every time I come to a huge merge [Merge tag 'usb-6.8-rc1' of
>> git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/usb]
>> I have already triple-checked all my answers and speed measurements.
>> I don't understand where I'm making a mistake.
>>
>> Let's try to figure it out together.
>>
>> Input data:
>> Two computers connected 1Gbps link.
>> Both have the same hardware.
>> Network: RTL8125 2.5GbE Controller (rev 05)
>>
>> When I copy files from one computer to another and kernel snapshot
>> builded from commit 296455ade1fd I have 97-110MB/sec which is almost
>> max speed of 1Gbps link.
>> When I move to commit 9d1694dc91ce I have only 66-70MB/sec which is
>> significantly slower.
>>
>> I bisected the issue by measuring network speed on each step.
>> I save all results to file [1]
>>
>> [1] file is attached as a zip archive.
>>
>> # first bad commit: [8c94ccc7cd691472461448f98e2372c75849406c] Merge
>> tag 'usb-6.8-rc1' of
>> git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/usb
>
> a. Do you clean the object files between each test run?
> or at least clean net/* and drivers/net/ethernet/* ?
>
> b. I am far from a git expert, but in the bisects that I have
> done, after each test run, I just say
> $ git bisect good
> or
> $ git bisect bad
>
> It looks like you are typing
> $ git bisect [good | bad] hashID
>
> Is that correct?
>
> Anyway, I am interested in your outcome just to learn
> how to handle this problem.
>
> Good luck.
>
--
#Randy
Thanks for the forward!
On Fri, 2 Feb 2024 17:16:41 -0800 Randy Dunlap wrote:
> >> When I copy files from one computer to another and kernel snapshot
> >> builded from commit 296455ade1fd I have 97-110MB/sec which is almost
> >> max speed of 1Gbps link.
> >> When I move to commit 9d1694dc91ce I have only 66-70MB/sec which is
> >> significantly slower.
There isn't that much networking in between the two.
Is any of the CPU cores at 100% when you are transferring the data on
the bad commit?
Do you have any iptables / nftables rules?
Are you using TLS in the transfer?
Did you try reverting f1172f3ee3a98754?
On Sat, Feb 03, 2024 at 06:02:15AM +0500, Mikhail Gavrilov wrote:
> Hi,
> I'm trying to find the first bad commit that led to a decreased
> network outgoing speed.
> And every time I come to a huge merge [Merge tag 'usb-6.8-rc1' of
> git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/usb]
> I have already triple-checked all my answers and speed measurements.
> I don't understand where I'm making a mistake.
>
> Let's try to figure it out together.
>
> Input data:
> Two computers connected 1Gbps link.
> Both have the same hardware.
> Network: RTL8125 2.5GbE Controller (rev 05)
>
> When I copy files from one computer to another and kernel snapshot
> builded from commit 296455ade1fd I have 97-110MB/sec which is almost
> max speed of 1Gbps link.
> When I move to commit 9d1694dc91ce I have only 66-70MB/sec which is
> significantly slower.
>
> I bisected the issue by measuring network speed on each step.
> I save all results to file [1]
>
> [1] file is attached as a zip archive.
>
> # first bad commit: [8c94ccc7cd691472461448f98e2372c75849406c] Merge
> tag 'usb-6.8-rc1' of
> git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/usb
So (simplified) the change history looks something like this:
branch point 296455ade1fd (good) 9d1694dc91ce (merge, bad)
| | |
----- * ----------- * ----------------------- * -------------- <- master
\ /
----------- * --------------------- <- development branch
|
Bug introduced here
The straight line is Linus' master branch. At some point in the
past a development branch was forked from Linus's branch (the
lower line). During that development the regression was introduced
and when the branch was merged back the master branch started to
have the regression.
To further pin point the bug in the development branch you'll have to
bisect along the lower line.
So how do you do this:
Look at the merge:
$ git cat-file -p 9d1694dc91ce | head -n3
tree d9093aecb9261cccaea1f0a58887fcd9db542172
parent e9a5a78d1ad8ceb4e3df6d6ad93360094c84ac40
parent b2e792ae883a0aa976d4176dfa7dc933263440ea
So the merge commit has two parent commits, one is on the master
branch, the other is on the development branch. To find out which
of the parents is on the development branch you can ask git to find
the common ancestor of the two parent commits and your good commit
on the master branch:
$ git merge-base 296455ade1fd e9a5a78d1ad8ceb4e3df6d6ad93360094c84ac40
296455ade1fdcf5f8f8c033201633b60946c589a
$ git merge-base 296455ade1fd b2e792ae883a0aa976d4176dfa7dc933263440ea
587371ed783b046f22ba7a5e1cc9a19ae35123b4
So the second parent is not on the master branch and the merge base
(i.e. the point where the development branch was forked from the master
branch) is 587371ed783b046f22ba7a5e1cc9a19ae35123b4.
So I'd assume that 587371ed783b046f22ba7a5e1cc9a19ae35123b4 is a good
commit and b2e792ae883a0aa976d4176dfa7dc933263440ea is a bad commit.
You can verify this and then start bisecting like this for better
results:
$ git bisect good 587371ed783b046f22ba7a5e1cc9a19ae35123b4
$ git bisect bad b2e792ae883a0aa976d4176dfa7dc933263440ea
regards Christian
Hi,
[ sorry, replying to myself ]
On Sat, Feb 03, 2024 at 07:20:47PM +0100, Christian A. Ehrhardt wrote:
> On Sat, Feb 03, 2024 at 06:02:15AM +0500, Mikhail Gavrilov wrote:
> > Hi,
> > I'm trying to find the first bad commit that led to a decreased
> > network outgoing speed.
> > And every time I come to a huge merge [Merge tag 'usb-6.8-rc1' of
> > git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/usb]
> > I have already triple-checked all my answers and speed measurements.
> > I don't understand where I'm making a mistake.
> >
> > Let's try to figure it out together.
> >
> > Input data:
> > Two computers connected 1Gbps link.
> > Both have the same hardware.
> > Network: RTL8125 2.5GbE Controller (rev 05)
> >
> > When I copy files from one computer to another and kernel snapshot
> > builded from commit 296455ade1fd I have 97-110MB/sec which is almost
> > max speed of 1Gbps link.
> > When I move to commit 9d1694dc91ce I have only 66-70MB/sec which is
> > significantly slower.
> >
> > I bisected the issue by measuring network speed on each step.
> > I save all results to file [1]
> >
> > [1] file is attached as a zip archive.
> >
> > # first bad commit: [8c94ccc7cd691472461448f98e2372c75849406c] Merge
> > tag 'usb-6.8-rc1' of
> > git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/usb
>
> So (simplified) the change history looks something like this:
> [ ... ]
Sorry, I was looking at the wrong merge commit and when using
the commit pinpointed by your bisect your log shows
that _both_ parents of the bad merge commit are marked as good
which is somewhat strange.
However, it should be possible to bisect further if you do a rebase
like this:
$ git cat-file -p
8c94ccc7cd691472461448f98e2372c75849406c | head -n 3
tree d3907cad2a1fbbbcf71847274fdbdcf5a2aeb9a2
parent bd736f38c014ba70ba7ec3bdc6af6fe5368d6612
parent 933bb7b878ddd0f8c094db45551a7daddf806e00
$ git branch m bd736f38c014ba70ba7ec3bdc6af6fe5368d6612
$ git branch d933bb7b878ddd0f8c094db45551a7daddf806e000
$ git checkout d
Updating files: 100% (11666/11666), done.
Switched to branch 'd'
$ git rebase m
Successfully rebased and updated refs/heads/d.
Now, "m" must be good as per your bisect log and "d" must be bad
because it is the same tree as the bad merge commit (8c94ccc7cd69).
Due to the rebase there's a liner history between the two, thus
starting a bisect like this might yield more information:
$ git bisect good m
$ git bisect bad d
regards Christian
On Mon, Feb 5, 2024 at 1:47 AM Christian A. Ehrhardt <[email protected]> wrote:
>
>
> Sorry, I was looking at the wrong merge commit and when using
> the commit pinpointed by your bisect your log shows
> that _both_ parents of the bad merge commit are marked as good
> which is somewhat strange.
>
> However, it should be possible to bisect further if you do a rebase
> like this:
>
> $ git cat-file -p
> 8c94ccc7cd691472461448f98e2372c75849406c | head -n 3
> tree d3907cad2a1fbbbcf71847274fdbdcf5a2aeb9a2
> parent bd736f38c014ba70ba7ec3bdc6af6fe5368d6612
> parent 933bb7b878ddd0f8c094db45551a7daddf806e00
> $ git branch m bd736f38c014ba70ba7ec3bdc6af6fe5368d6612
> $ git branch d933bb7b878ddd0f8c094db45551a7daddf806e000
> $ git checkout d
> Updating files: 100% (11666/11666), done.
> Switched to branch 'd'
> $ git rebase m
> Successfully rebased and updated refs/heads/d.
>
> Now, "m" must be good as per your bisect log and "d" must be bad
> because it is the same tree as the bad merge commit (8c94ccc7cd69).
>
> Due to the rebase there's a liner history between the two, thus
> starting a bisect like this might yield more information:
>
> $ git bisect good m
> $ git bisect bad d
>
> regards Christian
>
>
Thanks for real help.
Now I spotted a really bad commit.
57e153dfd0e7a080373fe5853c5609443d97fa5a is the first bad commit
commit 57e153dfd0e7a080373fe5853c5609443d97fa5a
Author: Niklas Neronin <[email protected]>
Date: Fri Dec 1 17:06:40 2023 +0200
xhci: add handler for only one interrupt line
Current xHCI driver only supports one "interrupter", meaning we will
only use one MSI/MSI-X interrupt line. Thus, add handler only to the
first interrupt line.
Signed-off-by: Niklas Neronin <[email protected]>
Co-developed-by: Mathias Nyman <[email protected]>
Signed-off-by: Mathias Nyman <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Greg Kroah-Hartman <[email protected]>
drivers/usb/host/xhci-pci.c | 35 ++++++++++-------------------------
1 file changed, 10 insertions(+), 25 deletions(-)
Niklas, Mathias I spotted decreased network speed on sending when
transferring files via sftp between my workstations in the local
network.
And bisection of issue leads me to this commit.
My motherboard is MPG-B650I-EDGE-WIFI looks like it is related to the
mentioned commit.
https://www.msi.com/Motherboard/MPG-B650I-EDGE-WIFI
--
Best Regards,
Mike Gavrilov.
On 5.2.2024 23.08, Mikhail Gavrilov wrote:
> On Mon, Feb 5, 2024 at 1:47 AM Christian A. Ehrhardt <[email protected]> wrote:
>
> Thanks for real help.
> Now I spotted a really bad commit.
>
> 57e153dfd0e7a080373fe5853c5609443d97fa5a is the first bad commit
> commit 57e153dfd0e7a080373fe5853c5609443d97fa5a
> Author: Niklas Neronin <[email protected]>
> Date: Fri Dec 1 17:06:40 2023 +0200
>
> xhci: add handler for only one interrupt line
>
> Current xHCI driver only supports one "interrupter", meaning we will
> only use one MSI/MSI-X interrupt line. Thus, add handler only to the
> first interrupt line.
>
> Signed-off-by: Niklas Neronin <[email protected]>
> Co-developed-by: Mathias Nyman <[email protected]>
> Signed-off-by: Mathias Nyman <[email protected]>
> Link: https://lore.kernel.org/r/[email protected]
> Signed-off-by: Greg Kroah-Hartman <[email protected]>
>
> drivers/usb/host/xhci-pci.c | 35 ++++++++++-------------------------
> 1 file changed, 10 insertions(+), 25 deletions(-)
>
> Niklas, Mathias I spotted decreased network speed on sending when
> transferring files via sftp between my workstations in the local
> network.
> And bisection of issue leads me to this commit.
> My motherboard is MPG-B650I-EDGE-WIFI looks like it is related to the
> mentioned commit.
> https://www.msi.com/Motherboard/MPG-B650I-EDGE-WIFI
>
This seems odd, not sure how this usb host change would impact your network speed.
Could you try reverting that patch from 6.8-rc1 and see if it helps?
There are some other patches on top of it that needs to be reverted first.
These should be enough:
36b24ebf9a04 xhci: minor coding style cleanup in 'xhci_try_enable_msi()
9831960df237 xhci: rework 'xhci_try_enable_msi()' MSI and MSI-X setup code
dfbf4441f2d3 xhci: change 'msix_count' to encompass MSI or MSI-X vectors
a795f708b284 xhci: refactor static MSI function
74554e9c2276 xhci: refactor static MSI-X function
f977f4c9301c xhci: add handler for only one interrupt line
That patch changes how we request MSI/MSI-X interrupt(s) for xhci.
Is there any change is /proc/interrupts between a good and bad case?
Such as xhci_hcd using MSI-X instead of MSI, or eth0 and xhci_hcd
interrupting on the same CPU?
Thanks
Mathias
On Tue, Feb 6, 2024 at 4:24 PM Mathias Nyman
<[email protected]> wrote:
>
>
> This seems odd, not sure how this usb host change would impact your network speed.
>
> Could you try reverting that patch from 6.8-rc1 and see if it helps?
>
> There are some other patches on top of it that needs to be reverted first.
> These should be enough:
>
> 36b24ebf9a04 xhci: minor coding style cleanup in 'xhci_try_enable_msi()
> 9831960df237 xhci: rework 'xhci_try_enable_msi()' MSI and MSI-X setup code
> dfbf4441f2d3 xhci: change 'msix_count' to encompass MSI or MSI-X vectors
> a795f708b284 xhci: refactor static MSI function
> 74554e9c2276 xhci: refactor static MSI-X function
> f977f4c9301c xhci: add handler for only one interrupt line
I confirm after reverting all listed commits and 57e153dfd0e7
performance of the network returned to theoretical maximum.
> That patch changes how we request MSI/MSI-X interrupt(s) for xhci.
>
> Is there any change is /proc/interrupts between a good and bad case?
> Such as xhci_hcd using MSI-X instead of MSI, or eth0 and xhci_hcd
> interrupting on the same CPU?
On the good kernel I have - 32 xhci_hcd, and bad only - 4.
In both scenarios using PCI-MSIX.
I attached both interrupt output as archives to this message.
[1] https://postimg.cc/zL2RYgYZ
--
Best Regards,
Mike Gavrilov.
On 6.2.2024 18.12, Mikhail Gavrilov wrote:
> On Tue, Feb 6, 2024 at 4:24 PM Mathias Nyman
> <[email protected]> wrote:
>
> I confirm after reverting all listed commits and 57e153dfd0e7
> performance of the network returned to theoretical maximum.
>
>> That patch changes how we request MSI/MSI-X interrupt(s) for xhci.
>>
>> Is there any change is /proc/interrupts between a good and bad case?
>> Such as xhci_hcd using MSI-X instead of MSI, or eth0 and xhci_hcd
>> interrupting on the same CPU?
>
> On the good kernel I have - 32 xhci_hcd, and bad only - 4.
> In both scenarios using PCI-MSIX.
> I attached both interrupt output as archives to this message.
>
Thanks,
Looks like your network adapter ends up interrupting CPU0 in the bad case due
to the change in how many interrupts are requested by xhci_hcd before it.
bad case:
CPU0 CPU1 ... CPU31
87: 18213809 0 ... 0 IR-PCI-MSIX-0000:0e:00.0 0-edge enp14s0
Does manually changing it to some other CPU help? picking one that doesn't already
handle a lot of interrupts. CPU0 could also in general be more busy, possibly spending
more time with interrupts disabled.
For example change to CPU23 in the bad case:
echo 800000 > /proc/irq/87/smp_affinity
Check from proc/interrupts that enp14s0 interrupts actually go to CPU23 after this.
Thanks
Mathias
On 03.02.24 02:02, Mikhail Gavrilov wrote:
> Hi,
> I'm trying to find the first bad commit that led to a decreased
> network outgoing speed.
> And every time I come to a huge merge [Merge tag 'usb-6.8-rc1' of
> git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/usb]
> I have already triple-checked all my answers and speed measurements.
> I don't understand where I'm making a mistake.
>
To be sure the issue doesn't fall through the cracks unnoticed, I'm
adding it to regzbot, the Linux kernel regression tracking bot:
#regzbot ^introduced f977f4c9301c
#regzbot title irq/net/usb: performance decrease now that network device
and xhci share IRQs
#regzbot ignore-activity
Ciao, Thorsten (wearing his 'the Linux kernel's regression tracker' hat)
--
Everything you wanna know about Linux kernel regression tracking:
https://linux-regtracking.leemhuis.info/about/#tldr
That page also explains what to do if mails like this annoy you.