2023-09-27 13:40:33

by Vishvambar Panth S

[permalink] [raw]
Subject: [PATCH net] net: microchip: lan743x : bidirectional throughuput improvement

The LAN743x/PCI11xxx DMA descriptors are always 4 dwords long, but the
device supports placing the descriptors in memory back to back or
reserving space in between them using its DMA_DESCRIPTOR_SPACE (DSPACE)
configurable hardware setting. Currently DSPACE is unnecessarily set to
match the host's L1 cache line size, resulting in space reserved in
between descriptors in most platforms and causing a suboptimal behavior
(single PCIe Mem transaction per descriptor). By changing the setting
to DSPACE=16 many descriptors can be packed in a single PCIe Mem
transaction resulting in a massive performance improvement in
bidirectional tests without any negative effects.
Tested and verified improvements on x64 PC and several ARM platforms
(typical data below)

Test setup 1: x64 PC with LAN7430 ---> x64 PC

iperf3 UDP bidirectional with DSPACE set to L1 CACHE Size:
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID][Role] Interval Transfer Bitrate
[ 5][TX-C] 0.00-10.00 sec 170 MBytes 143 Mbits/sec sender
[ 5][TX-C] 0.00-10.04 sec 169 MBytes 141 Mbits/sec receiver
[ 7][RX-C] 0.00-10.00 sec 1.02 GBytes 876 Mbits/sec sender
[ 7][RX-C] 0.00-10.04 sec 1.02 GBytes 870 Mbits/sec receiver

iperf3 UDP bidirectional with DSPACE set to 16 Bytes
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID][Role] Interval Transfer Bitrate
[ 5][TX-C] 0.00-10.00 sec 1.11 GBytes 956 Mbits/sec sender
[ 5][TX-C] 0.00-10.04 sec 1.11 GBytes 951 Mbits/sec receiver
[ 7][RX-C] 0.00-10.00 sec 1.10 GBytes 948 Mbits/sec sender
[ 7][RX-C] 0.00-10.04 sec 1.10 GBytes 942 Mbits/sec receiver

Test setup 2 : RK3399 with LAN7430 ---> x64 PC

RK3399 Spec:
The SOM-RK3399 is ARM module designed and developed by FriendlyElec.
Cores: 64-bit Dual Core Cortex-A72 + Quad Core Cortex-A53
Frequency: Cortex-A72(up to 2.0GHz), Cortex-A53(up to 1.5GHz)
PCIe: PCIe x4, compatible with PCIe 2.1, Dual operation mode

iperf3 UDP bidirectional with DSPACE set to L1 CACHE Size:
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID][Role] Interval Transfer Bitrate
[ 5][TX-C] 0.00-10.00 sec 534 MBytes 448 Mbits/sec sender
[ 5][TX-C] 0.00-10.05 sec 534 MBytes 446 Mbits/sec receiver
[ 7][RX-C] 0.00-10.00 sec 1.12 GBytes 961 Mbits/sec sender
[ 7][RX-C] 0.00-10.05 sec 1.11 GBytes 946 Mbits/sec receiver

iperf3 UDP bidirectional with DSPACE set to 16 Bytes
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID][Role] Interval Transfer Bitrate
[ 5][TX-C] 0.00-10.00 sec 966 MBytes 810 Mbits/sec sender
[ 5][TX-C] 0.00-10.04 sec 965 MBytes 806 Mbits/sec receiver
[ 7][RX-C] 0.00-10.00 sec 1.11 GBytes 956 Mbits/sec sender
[ 7][RX-C] 0.00-10.04 sec 1.07 GBytes 919 Mbits/sec receiver

Fixes: 23f0703c125b ("lan743x: Add main source files for new lan743x driver")
Signed-off-by: Vishvambar Panth S <[email protected]>
---
drivers/net/ethernet/microchip/lan743x_main.h | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/microchip/lan743x_main.h b/drivers/net/ethernet/microchip/lan743x_main.h
index 52609fc13ad9..6dac6fef7d24 100644
--- a/drivers/net/ethernet/microchip/lan743x_main.h
+++ b/drivers/net/ethernet/microchip/lan743x_main.h
@@ -1067,7 +1067,7 @@ struct lan743x_adapter {
#define DMA_DESCRIPTOR_SPACING_32 (32)
#define DMA_DESCRIPTOR_SPACING_64 (64)
#define DMA_DESCRIPTOR_SPACING_128 (128)
-#define DEFAULT_DMA_DESCRIPTOR_SPACING (L1_CACHE_BYTES)
+#define DEFAULT_DMA_DESCRIPTOR_SPACING (DMA_DESCRIPTOR_SPACING_16)

#define DMAC_CHANNEL_STATE_SET(start_bit, stop_bit) \
(((start_bit) ? 2 : 0) | ((stop_bit) ? 1 : 0))
--
2.25.1


2023-09-30 14:41:07

by Jacob Keller

[permalink] [raw]
Subject: Re: [PATCH net] net: microchip: lan743x : bidirectional throughuput improvement



On 9/27/2023 4:16 AM, Vishvambar Panth S wrote:
> The LAN743x/PCI11xxx DMA descriptors are always 4 dwords long, but the
> device supports placing the descriptors in memory back to back or
> reserving space in between them using its DMA_DESCRIPTOR_SPACE (DSPACE)
> configurable hardware setting. Currently DSPACE is unnecessarily set to
> match the host's L1 cache line size, resulting in space reserved in
> between descriptors in most platforms and causing a suboptimal behavior
> (single PCIe Mem transaction per descriptor). By changing the setting
> to DSPACE=16 many descriptors can be packed in a single PCIe Mem
> transaction resulting in a massive performance improvement in
> bidirectional tests without any negative effects.
> Tested and verified improvements on x64 PC and several ARM platforms
> (typical data below)
>

I assume here the choice of 16 is to get the 16 bytes from 4 dwords?

> Test setup 1: x64 PC with LAN7430 ---> x64 PC
>
> iperf3 UDP bidirectional with DSPACE set to L1 CACHE Size:
> - - - - - - - - - - - - - - - - - - - - - - - - -
> [ ID][Role] Interval Transfer Bitrate
> [ 5][TX-C] 0.00-10.00 sec 170 MBytes 143 Mbits/sec sender
> [ 5][TX-C] 0.00-10.04 sec 169 MBytes 141 Mbits/sec receiver
> [ 7][RX-C] 0.00-10.00 sec 1.02 GBytes 876 Mbits/sec sender
> [ 7][RX-C] 0.00-10.04 sec 1.02 GBytes 870 Mbits/sec receiver
>
> iperf3 UDP bidirectional with DSPACE set to 16 Bytes
> - - - - - - - - - - - - - - - - - - - - - - - - -
> [ ID][Role] Interval Transfer Bitrate
> [ 5][TX-C] 0.00-10.00 sec 1.11 GBytes 956 Mbits/sec sender
> [ 5][TX-C] 0.00-10.04 sec 1.11 GBytes 951 Mbits/sec receiver
> [ 7][RX-C] 0.00-10.00 sec 1.10 GBytes 948 Mbits/sec sender
> [ 7][RX-C] 0.00-10.04 sec 1.10 GBytes 942 Mbits/sec receiver
>

Going from barely transmitting to hitting the line rate *and* improving
both Tx and Rx slightly is fantastic. Huge win just avoiding all that
unnecessary wasted PCIe bandwidth.

> Test setup 2 : RK3399 with LAN7430 ---> x64 PC
>
> RK3399 Spec:
> The SOM-RK3399 is ARM module designed and developed by FriendlyElec.
> Cores: 64-bit Dual Core Cortex-A72 + Quad Core Cortex-A53
> Frequency: Cortex-A72(up to 2.0GHz), Cortex-A53(up to 1.5GHz)
> PCIe: PCIe x4, compatible with PCIe 2.1, Dual operation mode
>
> iperf3 UDP bidirectional with DSPACE set to L1 CACHE Size:
> - - - - - - - - - - - - - - - - - - - - - - - - -
> [ ID][Role] Interval Transfer Bitrate
> [ 5][TX-C] 0.00-10.00 sec 534 MBytes 448 Mbits/sec sender
> [ 5][TX-C] 0.00-10.05 sec 534 MBytes 446 Mbits/sec receiver
> [ 7][RX-C] 0.00-10.00 sec 1.12 GBytes 961 Mbits/sec sender
> [ 7][RX-C] 0.00-10.05 sec 1.11 GBytes 946 Mbits/sec receiver
>
> iperf3 UDP bidirectional with DSPACE set to 16 Bytes
> - - - - - - - - - - - - - - - - - - - - - - - - -
> [ ID][Role] Interval Transfer Bitrate
> [ 5][TX-C] 0.00-10.00 sec 966 MBytes 810 Mbits/sec sender
> [ 5][TX-C] 0.00-10.04 sec 965 MBytes 806 Mbits/sec receiver
> [ 7][RX-C] 0.00-10.00 sec 1.11 GBytes 956 Mbits/sec sender
> [ 7][RX-C] 0.00-10.04 sec 1.07 GBytes 919 Mbits/sec receiver
>
> Fixes: 23f0703c125b ("lan743x: Add main source files for new lan743x driver")
> Signed-off-by: Vishvambar Panth S <[email protected]>
> ---

Always fun to see a massive win like this! Even just the 2x improvement
on the RK3399 is huge, let alone 6x improvement on the x86 platform.

Reviewed-by: Jacob Keller <[email protected]>

> drivers/net/ethernet/microchip/lan743x_main.h | 2 +-
> 1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/drivers/net/ethernet/microchip/lan743x_main.h b/drivers/net/ethernet/microchip/lan743x_main.h
> index 52609fc13ad9..6dac6fef7d24 100644
> --- a/drivers/net/ethernet/microchip/lan743x_main.h
> +++ b/drivers/net/ethernet/microchip/lan743x_main.h
> @@ -1067,7 +1067,7 @@ struct lan743x_adapter {
> #define DMA_DESCRIPTOR_SPACING_32 (32)
> #define DMA_DESCRIPTOR_SPACING_64 (64)
> #define DMA_DESCRIPTOR_SPACING_128 (128)
> -#define DEFAULT_DMA_DESCRIPTOR_SPACING (L1_CACHE_BYTES)
> +#define DEFAULT_DMA_DESCRIPTOR_SPACING (DMA_DESCRIPTOR_SPACING_16)
>
> #define DMAC_CHANNEL_STATE_SET(start_bit, stop_bit) \
> (((start_bit) ? 2 : 0) | ((stop_bit) ? 1 : 0))

2023-10-04 19:23:56

by Jakub Kicinski

[permalink] [raw]
Subject: Re: [PATCH net] net: microchip: lan743x : bidirectional throughuput improvement

On Wed, 27 Sep 2023 16:46:23 +0530 Vishvambar Panth S wrote:
> The LAN743x/PCI11xxx DMA descriptors are always 4 dwords long, but the
> device supports placing the descriptors in memory back to back or
> reserving space in between them using its DMA_DESCRIPTOR_SPACE (DSPACE)
> configurable hardware setting. Currently DSPACE is unnecessarily set to
> match the host's L1 cache line size, resulting in space reserved in
> between descriptors in most platforms and causing a suboptimal behavior
> (single PCIe Mem transaction per descriptor). By changing the setting
> to DSPACE=16 many descriptors can be packed in a single PCIe Mem
> transaction resulting in a massive performance improvement in
> bidirectional tests without any negative effects.
> Tested and verified improvements on x64 PC and several ARM platforms
> (typical data below)

Nobody complained for 5 years, and it's not a regression.
Let's not treat this as a fix, please repost without the Fixes tag for
net-next.
--
pw-bot: cr

2023-10-04 20:02:41

by Florian Fainelli

[permalink] [raw]
Subject: Re: [PATCH net] net: microchip: lan743x : bidirectional throughuput improvement

On 10/4/23 12:20, Jakub Kicinski wrote:
> On Wed, 27 Sep 2023 16:46:23 +0530 Vishvambar Panth S wrote:
>> The LAN743x/PCI11xxx DMA descriptors are always 4 dwords long, but the
>> device supports placing the descriptors in memory back to back or
>> reserving space in between them using its DMA_DESCRIPTOR_SPACE (DSPACE)
>> configurable hardware setting. Currently DSPACE is unnecessarily set to
>> match the host's L1 cache line size, resulting in space reserved in
>> between descriptors in most platforms and causing a suboptimal behavior
>> (single PCIe Mem transaction per descriptor). By changing the setting
>> to DSPACE=16 many descriptors can be packed in a single PCIe Mem
>> transaction resulting in a massive performance improvement in
>> bidirectional tests without any negative effects.
>> Tested and verified improvements on x64 PC and several ARM platforms
>> (typical data below)
>
> Nobody complained for 5 years, and it's not a regression.
> Let's not treat this as a fix, please repost without the Fixes tag for
> net-next.

As a driver maintainer, you may want to provide some guarantees to your
end users/customers that from stable version X.Y.Z the performance
issues have been fixed. Performance improvements are definitively border
line in terms of being considered as bug fixes though.
--
Florian

2023-10-04 20:10:04

by Jakub Kicinski

[permalink] [raw]
Subject: Re: [PATCH net] net: microchip: lan743x : bidirectional throughuput improvement

On Wed, 4 Oct 2023 13:02:17 -0700 Florian Fainelli wrote:
> > Nobody complained for 5 years, and it's not a regression.
> > Let's not treat this as a fix, please repost without the Fixes tag for
> > net-next.
>
> As a driver maintainer, you may want to provide some guarantees to your
> end users/customers that from stable version X.Y.Z the performance
> issues have been fixed. Performance improvements are definitively border
> line in terms of being considered as bug fixes though.

I understand that, but too often people just "feel like a device which
advertises X Mbps / Gbps should reach line rate" while no end user
cares.

Luckily stable rules are pretty clear about this (search for
"performance"):
https://docs.kernel.org/process/stable-kernel-rules.html

As posted it doesn't fulfill the requirements ????️

2023-10-05 15:53:14

by Vishvambar Panth S

[permalink] [raw]
Subject: Re: [PATCH net] net: microchip: lan743x : bidirectional throughuput improvement

On Fri, 2023-09-29 at 10:35 -0700, Jacob Keller wrote:
> EXTERNAL EMAIL: Do not click links or open attachments unless you
> know the content is safe
>
> On 9/27/2023 4:16 AM, Vishvambar Panth S wrote:
> > The LAN743x/PCI11xxx DMA descriptors are always 4 dwords long, but
> > the
> > device supports placing the descriptors in memory back to back or
> > reserving space in between them using its DMA_DESCRIPTOR_SPACE
> > (DSPACE)
> > configurable hardware setting. Currently DSPACE is unnecessarily
> > set to
> > match the host's L1 cache line size, resulting in space reserved in
> > between descriptors in most platforms and causing a suboptimal
> > behavior
> > (single PCIe Mem transaction per descriptor). By changing the
> > setting
> > to DSPACE=16 many descriptors can be packed in a single PCIe Mem
> > transaction resulting in a massive performance improvement in
> > bidirectional tests without any negative effects.
> > Tested and verified improvements on x64 PC and several ARM
> > platforms
> > (typical data below)
> >
>
> I assume here the choice of 16 is to get the 16 bytes from 4 dwords?

Thanks for the comments. Yes, it is set to 16 bytes to match the size
of the descriptors which are 4 dwords long.
>
> > Test setup 1: x64 PC with LAN7430 ---> x64 PC
> >
> > iperf3 UDP bidirectional with DSPACE set to L1 CACHE Size:
> > - - - - - - - - - - - - - - - - - - - - - - - - -
> > [ ID][Role] Interval           Transfer     Bitrate
> > [  5][TX-C]   0.00-10.00  sec   170 MBytes   143 Mbits/sec  sender
> > [  5][TX-C]   0.00-10.04  sec   169 MBytes   141 Mbits/sec 
> > receiver
> > [  7][RX-C]   0.00-10.00  sec  1.02 GBytes   876 Mbits/sec  sender
> > [  7][RX-C]   0.00-10.04  sec  1.02 GBytes   870 Mbits/sec 
> > receiver
> >
> > iperf3 UDP bidirectional with DSPACE set to 16 Bytes
> > - - - - - - - - - - - - - - - - - - - - - - - - -
> > [ ID][Role] Interval           Transfer     Bitrate
> > [  5][TX-C]   0.00-10.00  sec  1.11 GBytes   956 Mbits/sec  sender
> > [  5][TX-C]   0.00-10.04  sec  1.11 GBytes   951 Mbits/sec 
> > receiver
> > [  7][RX-C]   0.00-10.00  sec  1.10 GBytes   948 Mbits/sec  sender
> > [  7][RX-C]   0.00-10.04  sec  1.10 GBytes   942 Mbits/sec 
> > receiver
> >
>
> Going from barely transmitting to hitting the line rate *and*
> improving
> both Tx and Rx slightly is fantastic. Huge win just avoiding all that
> unnecessary wasted PCIe bandwidth.
>
> > Test setup 2 : RK3399 with LAN7430 ---> x64 PC
> >
> > RK3399 Spec:
> > The SOM-RK3399 is ARM module designed and developed by
> > FriendlyElec.
> > Cores: 64-bit Dual Core Cortex-A72 + Quad Core Cortex-A53
> > Frequency: Cortex-A72(up to 2.0GHz), Cortex-A53(up to 1.5GHz)
> > PCIe: PCIe x4, compatible with PCIe 2.1, Dual operation mode
> >
> > iperf3 UDP bidirectional with DSPACE set to L1 CACHE Size:
> > - - - - - - - - - - - - - - - - - - - - - - - - -
> > [ ID][Role] Interval           Transfer     Bitrate
> > [  5][TX-C]   0.00-10.00  sec   534 MBytes   448 Mbits/sec  sender
> > [  5][TX-C]   0.00-10.05  sec   534 MBytes   446 Mbits/sec 
> > receiver
> > [  7][RX-C]   0.00-10.00  sec  1.12 GBytes   961 Mbits/sec  sender
> > [  7][RX-C]   0.00-10.05  sec  1.11 GBytes   946 Mbits/sec 
> > receiver
> >
> > iperf3 UDP bidirectional with DSPACE set to 16 Bytes
> > - - - - - - - - - - - - - - - - - - - - - - - - -
> > [ ID][Role] Interval           Transfer     Bitrate
> > [  5][TX-C]   0.00-10.00  sec   966 MBytes   810 Mbits/sec   sender
> > [  5][TX-C]   0.00-10.04  sec   965 MBytes   806 Mbits/sec  
> > receiver
> > [  7][RX-C]   0.00-10.00  sec  1.11 GBytes   956 Mbits/sec   sender
> > [  7][RX-C]   0.00-10.04  sec  1.07 GBytes   919 Mbits/sec  
> > receiver
> >
> > Fixes: 23f0703c125b ("lan743x: Add main source files for new
> > lan743x driver")
> > Signed-off-by: Vishvambar Panth S <[email protected]>
> > ---
>
> Always fun to see a massive win like this! Even just the 2x
> improvement
> on the RK3399 is huge, let alone 6x improvement on the x86 platform.
>
> Reviewed-by: Jacob Keller <[email protected]>
>
> >  drivers/net/ethernet/microchip/lan743x_main.h | 2 +-
> >  1 file changed, 1 insertion(+), 1 deletion(-)
> >
> > diff --git a/drivers/net/ethernet/microchip/lan743x_main.h
> > b/drivers/net/ethernet/microchip/lan743x_main.h
> > index 52609fc13ad9..6dac6fef7d24 100644
> > --- a/drivers/net/ethernet/microchip/lan743x_main.h
> > +++ b/drivers/net/ethernet/microchip/lan743x_main.h
> > @@ -1067,7 +1067,7 @@ struct lan743x_adapter {
> >  #define DMA_DESCRIPTOR_SPACING_32       (32)
> >  #define DMA_DESCRIPTOR_SPACING_64       (64)
> >  #define DMA_DESCRIPTOR_SPACING_128      (128)
> > -#define DEFAULT_DMA_DESCRIPTOR_SPACING  (L1_CACHE_BYTES)
> > +#define DEFAULT_DMA_DESCRIPTOR_SPACING 
> > (DMA_DESCRIPTOR_SPACING_16)
> >
> >  #define DMAC_CHANNEL_STATE_SET(start_bit, stop_bit) \
> >       (((start_bit) ? 2 : 0) | ((stop_bit) ? 1 : 0))

2023-11-01 07:21:23

by Vishvambar Panth S

[permalink] [raw]
Subject: Re: [PATCH net] net: microchip: lan743x : bidirectional throughuput improvement

On Wed, 2023-10-04 at 13:09 -0700, Jakub Kicinski wrote:
> EXTERNAL EMAIL: Do not click links or open attachments unless you
> know the content is safe
>
> On Wed, 4 Oct 2023 13:02:17 -0700 Florian Fainelli wrote:
> > > Nobody complained for 5 years, and it's not a regression.
> > > Let's not treat this as a fix, please repost without the Fixes
> > > tag for
> > > net-next.
> >
> > As a driver maintainer, you may want to provide some guarantees to
> > your
> > end users/customers that from stable version X.Y.Z the performance
> > issues have been fixed. Performance improvements are definitively
> > border
> > line in terms of being considered as bug fixes though.
>
> I understand that, but too often people just "feel like a device
> which
> advertises X Mbps / Gbps should reach line rate" while no end user
> cares.
>
> Luckily stable rules are pretty clear about this (search for
> "performance"):
> https://docs.kernel.org/process/stable-kernel-rules.html
>
> As posted it doesn't fulfill the requirements ????️

Thanks for your feedback. I apologize for the delayed response.

The data presented in the patch description was aimed to convince a
reviewer with the visible impact of the performance boosts in both x64
and ARM platforms. However, the main motivation behind the patch was
not merely a "good-to-have" improvement but a solution to the
throughput issues reported by multiple customers in several platforms.
We received lots of customer requests through our ticket site system
urging us to address the performance issues on multiple kernel versions
including LTS. While it's acknowledged that stable branch rules
typically do not consider performance fixes that are not documented in
public Bugzilla, this performance enhancement is essential to many of
our customers and their end users and we believe should therefore be
considered for stable branch on the basis of it’s visible user impact.

Few issues reported by our customers are mentioned below, even though
these issues have existed for a long time, the data presented below is
collected from the customer within last 3 months.

Customer-A using lan743x with Hisilicon- Kirin 990 processor in 5.10
kernel, reported a mere ~300Mbps in Rx UDP. The fix significantly
improved the performance to ~900Mbps Rx in their platform.

Customer-B using lan743x with v5.10 has an issue with Tx UDP being only
157Mbps in their platform. Including the fix in the patch boosts the
performance to ~600Mbps in Tx UDP.

Customer-C using lan743x with ADAS Ref Design in v5.10 reported UDP
Tx/Rx to be 126/723 Mbps and the fix improved the performance to
828/956 Mbps.

Customer-D using lan743x with Qcom 6490 with v5.4 wanted improvements
for their platform from UDP Rx 200Mbps. The fix along with few other
changes helped us to bring Rx perf to 800Mbps in customer’s platform

This is a kind request for considering the acceptance of this patch
into the net branch, as it has a significant positive impact on users
and does not have any adverse effects.

Thanks,
Vishvambar Panth S


2023-11-16 05:49:56

by Vishvambar Panth S

[permalink] [raw]
Subject: Re: [PATCH net] net: microchip: lan743x : bidirectional throughuput improvement

On Thu, 2023-11-09 at 15:04 -0800, Jakub Kicinski wrote:
> EXTERNAL EMAIL: Do not click links or open attachments unless you
> know the content is safe
>
> On Thu, 9 Nov 2023 10:53:26 +0000 [email protected]
> wrote:
> > Thanks for your feedback. I apologize for the delayed response.
> >
> > The data presented in the patch description was aimed to convince a
> > reviewer with the visible impact of the performance boosts in both
> > x64
> > and ARM platforms. However, the main motivation behind the patch
> > was
> > not merely a "good-to-have" improvement but a solution to the
> > throughput issues reported by multiple customers in several
> > platforms.
> > We received lots of customer requests through our ticket site
> > system
> > urging us to address the performance issues on multiple kernel
> > versions
> > including LTS. While it's acknowledged that stable branch rules
> > typically do not consider performance fixes that are not documented
> > in
> > public Bugzilla, this performance enhancement is essential to many
> > of
> > our customers and their end users and we believe should therefore
> > be
> > considered for stable branch on the basis of it’s visible user
> > impact.
> > Few issues reported by our customers are mentioned below, even
> > though
> > these issues have existed for a long time, the data presented below
> > is
> > collected from the customer within last 3 months.
> >
> > Customer-A using lan743x with Hisilicon- Kirin 990 processor in
> > 5.10
> > kernel, reported a mere ~300Mbps in Rx UDP. The fix significantly
> > improved the performance to ~900Mbps Rx in their platform.
> >
> > Customer-B using lan743x with v5.10 has an issue with Tx UDP being
> > only
> > 157Mbps in their platform. Including the fix in the patch boosts
> > the
> > performance to ~600Mbps in Tx UDP.
> >
> > Customer-C using lan743x with ADAS Ref Design in v5.10 reported UDP
> > Tx/Rx to be 126/723 Mbps and the fix improved the performance to
> > 828/956 Mbps.
> >
> > Customer-D using lan743x with Qcom 6490 with v5.4 wanted
> > improvements
> > for their platform from UDP Rx 200Mbps. The fix along with few
> > other
> > changes helped us to bring Rx perf to 800Mbps in customer’s
> > platform
> >
> > This is a kind request for considering the acceptance of this patch
> > into the net branch, as it has a significant positive impact on
> > users
> > and does not have any adverse effects.
>
> Thanks a lot for the details. Unfortunately after further
> consideration
> I can't accept this patch as a fix with clear conscience. The code
> has
> been this way for a long time, performance improvements should end up
> in new kernels and people who want to benefit from faster kernels
> should
> not be sticking to old LTS releases.
>
> So please repost for net-next next week, when it's open again.

Hi Jakub,
Thanks for your inputs. Have submitted this patch to the net-next
branch.