2022-03-08 18:11:29

by Horatiu Vultur

[permalink] [raw]
Subject: [PATCH net-next] net: lan966x: Improve the CPU TX bitrate.

Replace 'readx_poll_timeout_atomic' with a simple while loop + timeout
when checking if it is possible to write to the HW the next word of the
frame.
Doing this will improve the TX bitrate by 65%. The measurements were
done using iperf3.

Before:
[ ID] Interval Transfer Bitrate Retr
[ 5] 0.00-10.03 sec 55.2 MBytes 46.2 Mbits/sec 0 sender
[ 5] 0.00-10.04 sec 53.8 MBytes 45.0 Mbits/sec receiver

After:
[ ID] Interval Transfer Bitrate Retr
[ 5] 0.00-10.02 sec 91.4 MBytes 76.6 Mbits/sec 0 sender
[ 5] 0.00-10.03 sec 90.2 MBytes 75.5 Mbits/sec receiver

Signed-off-by: Horatiu Vultur <[email protected]>
---
.../ethernet/microchip/lan966x/lan966x_main.c | 20 ++++++++++---------
1 file changed, 11 insertions(+), 9 deletions(-)

diff --git a/drivers/net/ethernet/microchip/lan966x/lan966x_main.c b/drivers/net/ethernet/microchip/lan966x/lan966x_main.c
index 81c01665d01e..f6cef29b9d36 100644
--- a/drivers/net/ethernet/microchip/lan966x/lan966x_main.c
+++ b/drivers/net/ethernet/microchip/lan966x/lan966x_main.c
@@ -176,18 +176,20 @@ static int lan966x_port_stop(struct net_device *dev)
return 0;
}

-static int lan966x_port_inj_status(struct lan966x *lan966x)
-{
- return lan_rd(lan966x, QS_INJ_STATUS);
-}
-
static int lan966x_port_inj_ready(struct lan966x *lan966x, u8 grp)
{
- u32 val;
+ unsigned long time = jiffies + usecs_to_jiffies(READL_TIMEOUT_US);
+ int ret = 0;

- return readx_poll_timeout_atomic(lan966x_port_inj_status, lan966x, val,
- QS_INJ_STATUS_FIFO_RDY_GET(val) & BIT(grp),
- READL_SLEEP_US, READL_TIMEOUT_US);
+ while (!(lan_rd(lan966x, QS_INJ_STATUS) &
+ QS_INJ_STATUS_FIFO_RDY_SET(BIT(grp)))) {
+ if (time_after(jiffies, time)) {
+ ret = -ETIMEDOUT;
+ break;
+ }
+ }
+
+ return ret;
}

static int lan966x_port_ifh_xmit(struct sk_buff *skb,
--
2.33.0


2022-03-08 23:48:36

by Horatiu Vultur

[permalink] [raw]
Subject: Re: [PATCH net-next] net: lan966x: Improve the CPU TX bitrate.

The 03/08/2022 22:36, Andrew Lunn wrote:
>
> > static int lan966x_port_inj_ready(struct lan966x *lan966x, u8 grp)
> > {
> > - u32 val;
> > + unsigned long time = jiffies + usecs_to_jiffies(READL_TIMEOUT_US);
> > + int ret = 0;
> >
> > - return readx_poll_timeout_atomic(lan966x_port_inj_status, lan966x, val,
> > - QS_INJ_STATUS_FIFO_RDY_GET(val) & BIT(grp),
> > - READL_SLEEP_US, READL_TIMEOUT_US);
> > + while (!(lan_rd(lan966x, QS_INJ_STATUS) &
> > + QS_INJ_STATUS_FIFO_RDY_SET(BIT(grp)))) {
> > + if (time_after(jiffies, time)) {
> > + ret = -ETIMEDOUT;
> > + break;
> > + }
>
> Did you try setting READL_SLEEP_US to 0? readx_poll_timeout_atomic()
> explicitly supports that.

I have tried but it didn't improve. It was the same as before.

>
> Andrew

--
/Horatiu

2022-03-09 00:18:15

by Andrew Lunn

[permalink] [raw]
Subject: Re: [PATCH net-next] net: lan966x: Improve the CPU TX bitrate.

> static int lan966x_port_inj_ready(struct lan966x *lan966x, u8 grp)
> {
> - u32 val;
> + unsigned long time = jiffies + usecs_to_jiffies(READL_TIMEOUT_US);
> + int ret = 0;
>
> - return readx_poll_timeout_atomic(lan966x_port_inj_status, lan966x, val,
> - QS_INJ_STATUS_FIFO_RDY_GET(val) & BIT(grp),
> - READL_SLEEP_US, READL_TIMEOUT_US);
> + while (!(lan_rd(lan966x, QS_INJ_STATUS) &
> + QS_INJ_STATUS_FIFO_RDY_SET(BIT(grp)))) {
> + if (time_after(jiffies, time)) {
> + ret = -ETIMEDOUT;
> + break;
> + }

Did you try setting READL_SLEEP_US to 0? readx_poll_timeout_atomic()
explicitly supports that.

Andrew

2022-03-09 00:53:42

by David Laight

[permalink] [raw]
Subject: RE: [PATCH net-next] net: lan966x: Improve the CPU TX bitrate.

From: Horatiu Vultur
> Sent: 08 March 2022 22:30
>
> The 03/08/2022 22:36, Andrew Lunn wrote:
> >
> > > static int lan966x_port_inj_ready(struct lan966x *lan966x, u8 grp)
> > > {
> > > - u32 val;
> > > + unsigned long time = jiffies + usecs_to_jiffies(READL_TIMEOUT_US);
> > > + int ret = 0;
> > >
> > > - return readx_poll_timeout_atomic(lan966x_port_inj_status, lan966x, val,
> > > - QS_INJ_STATUS_FIFO_RDY_GET(val) & BIT(grp),
> > > - READL_SLEEP_US, READL_TIMEOUT_US);
> > > + while (!(lan_rd(lan966x, QS_INJ_STATUS) &
> > > + QS_INJ_STATUS_FIFO_RDY_SET(BIT(grp)))) {
> > > + if (time_after(jiffies, time)) {
> > > + ret = -ETIMEDOUT;
> > > + break;
> > > + }
> >
> > Did you try setting READL_SLEEP_US to 0? readx_poll_timeout_atomic()
> > explicitly supports that.
>
> I have tried but it didn't improve. It was the same as before.

How many times round the loop is it going ?

It might be that by the time readx_poll_timeout_atomic()
gets around to reading the register the fifo is actually ready.

The there is the delay between detecting 'ready' and writing
the next data.
That delay might be cumulative and affect performance.

OTOH spinning waiting for fifo space is just plain horrid.
Reminds me of 3C509 drivers :-)

David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)

2022-03-09 02:12:52

by Jakub Kicinski

[permalink] [raw]
Subject: Re: [PATCH net-next] net: lan966x: Improve the CPU TX bitrate.

On Tue, 8 Mar 2022 23:30:00 +0100 Horatiu Vultur wrote:
> > > static int lan966x_port_inj_ready(struct lan966x *lan966x, u8 grp)
> > > {
> > > - u32 val;
> > > + unsigned long time = jiffies + usecs_to_jiffies(READL_TIMEOUT_US);
> > > + int ret = 0;
> > >
> > > - return readx_poll_timeout_atomic(lan966x_port_inj_status, lan966x, val,
> > > - QS_INJ_STATUS_FIFO_RDY_GET(val) & BIT(grp),
> > > - READL_SLEEP_US, READL_TIMEOUT_US);
> > > + while (!(lan_rd(lan966x, QS_INJ_STATUS) &
> > > + QS_INJ_STATUS_FIFO_RDY_SET(BIT(grp)))) {
> > > + if (time_after(jiffies, time)) {
> > > + ret = -ETIMEDOUT;
> > > + break;
> > > + }
> >
> > Did you try setting READL_SLEEP_US to 0? readx_poll_timeout_atomic()
> > explicitly supports that.
>
> I have tried but it didn't improve. It was the same as before.

Huh, is ktime_get() super expensive on that platform?
jiffies vs ktime seems to be the main difference?

2022-03-09 09:29:07

by Horatiu Vultur

[permalink] [raw]
Subject: Re: [PATCH net-next] net: lan966x: Improve the CPU TX bitrate.

The 03/08/2022 22:46, David Laight wrote:
>
> From: Horatiu Vultur
> > Sent: 08 March 2022 22:30
> >
> > The 03/08/2022 22:36, Andrew Lunn wrote:
> > >
> > > > static int lan966x_port_inj_ready(struct lan966x *lan966x, u8 grp)
> > > > {
> > > > - u32 val;
> > > > + unsigned long time = jiffies + usecs_to_jiffies(READL_TIMEOUT_US);
> > > > + int ret = 0;
> > > >
> > > > - return readx_poll_timeout_atomic(lan966x_port_inj_status, lan966x, val,
> > > > - QS_INJ_STATUS_FIFO_RDY_GET(val) & BIT(grp),
> > > > - READL_SLEEP_US, READL_TIMEOUT_US);
> > > > + while (!(lan_rd(lan966x, QS_INJ_STATUS) &
> > > > + QS_INJ_STATUS_FIFO_RDY_SET(BIT(grp)))) {
> > > > + if (time_after(jiffies, time)) {
> > > > + ret = -ETIMEDOUT;
> > > > + break;
> > > > + }
> > >
> > > Did you try setting READL_SLEEP_US to 0? readx_poll_timeout_atomic()
> > > explicitly supports that.
> >
> > I have tried but it didn't improve. It was the same as before.
>
> How many times round the loop is it going ?

In the tests that I have done, I have never seen entering in the loop.

>
> It might be that by the time readx_poll_timeout_atomic()
> gets around to reading the register the fifo is actually ready.
>
> The there is the delay between detecting 'ready' and writing
> the next data.
> That delay might be cumulative and affect performance.

There is also a small delay between writting the word and checking if
more words can be written.

>
> OTOH spinning waiting for fifo space is just plain horrid.

Yes, I agree with you.

> Reminds me of 3C509 drivers :-)

>
> David
>
> -
> Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
> Registration No: 1397386 (Wales)

--
/Horatiu

2022-03-09 10:41:40

by Horatiu Vultur

[permalink] [raw]
Subject: Re: [PATCH net-next] net: lan966x: Improve the CPU TX bitrate.

The 03/08/2022 16:40, Jakub Kicinski wrote:
>
> On Tue, 8 Mar 2022 23:30:00 +0100 Horatiu Vultur wrote:
> > > > static int lan966x_port_inj_ready(struct lan966x *lan966x, u8 grp)
> > > > {
> > > > - u32 val;
> > > > + unsigned long time = jiffies + usecs_to_jiffies(READL_TIMEOUT_US);
> > > > + int ret = 0;
> > > >
> > > > - return readx_poll_timeout_atomic(lan966x_port_inj_status, lan966x, val,
> > > > - QS_INJ_STATUS_FIFO_RDY_GET(val) & BIT(grp),
> > > > - READL_SLEEP_US, READL_TIMEOUT_US);
> > > > + while (!(lan_rd(lan966x, QS_INJ_STATUS) &
> > > > + QS_INJ_STATUS_FIFO_RDY_SET(BIT(grp)))) {
> > > > + if (time_after(jiffies, time)) {
> > > > + ret = -ETIMEDOUT;
> > > > + break;
> > > > + }
> > >
> > > Did you try setting READL_SLEEP_US to 0? readx_poll_timeout_atomic()
> > > explicitly supports that.
> >
> > I have tried but it didn't improve. It was the same as before.
>
> Huh, is ktime_get() super expensive on that platform?

Hm.. it looks like. Just adding ktime_get() before the while loop, then
the performance drops like before.
I am using SOC_LAN966 which has an ARMv7 CPU. So I am not sure how
expensive is ktime_get().

> jiffies vs ktime seems to be the main difference?

--
/Horatiu

2022-03-09 10:58:29

by David Laight

[permalink] [raw]
Subject: RE: [PATCH net-next] net: lan966x: Improve the CPU TX bitrate.

From: 'Horatiu Vultur'
> Sent: 09 March 2022 09:11
>
> The 03/08/2022 22:46, David Laight wrote:
> >
> > From: Horatiu Vultur
> > > Sent: 08 March 2022 22:30
> > >
> > > The 03/08/2022 22:36, Andrew Lunn wrote:
> > > >
> > > > > static int lan966x_port_inj_ready(struct lan966x *lan966x, u8 grp)
> > > > > {
> > > > > - u32 val;
> > > > > + unsigned long time = jiffies + usecs_to_jiffies(READL_TIMEOUT_US);
> > > > > + int ret = 0;
> > > > >
> > > > > - return readx_poll_timeout_atomic(lan966x_port_inj_status, lan966x, val,
> > > > > - QS_INJ_STATUS_FIFO_RDY_GET(val) & BIT(grp),
> > > > > - READL_SLEEP_US, READL_TIMEOUT_US);
> > > > > + while (!(lan_rd(lan966x, QS_INJ_STATUS) &
> > > > > + QS_INJ_STATUS_FIFO_RDY_SET(BIT(grp)))) {
> > > > > + if (time_after(jiffies, time)) {
> > > > > + ret = -ETIMEDOUT;
> > > > > + break;
> > > > > + }
> > > >
> > > > Did you try setting READL_SLEEP_US to 0? readx_poll_timeout_atomic()
> > > > explicitly supports that.
> > >
> > > I have tried but it didn't improve. It was the same as before.
> >
> > How many times round the loop is it going ?
>
> In the tests that I have done, I have never seen entering in the loop.

In which case I'd do an initial status check before even
faffing with 'jiffies'.

It might even be that the status read is so slow that space
is always available by the time it is processed.
PCIe reads can be horribly slow.
Into our fgpa they end up being slower than old ISA bus cycles.
Probably several thousand cpu clocks.

David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)

2022-03-09 16:09:44

by Andrew Lunn

[permalink] [raw]
Subject: Re: [PATCH net-next] net: lan966x: Improve the CPU TX bitrate.

On Tue, Mar 08, 2022 at 11:30:00PM +0100, Horatiu Vultur wrote:
> The 03/08/2022 22:36, Andrew Lunn wrote:
> >
> > > static int lan966x_port_inj_ready(struct lan966x *lan966x, u8 grp)
> > > {
> > > - u32 val;
> > > + unsigned long time = jiffies + usecs_to_jiffies(READL_TIMEOUT_US);
> > > + int ret = 0;
> > >
> > > - return readx_poll_timeout_atomic(lan966x_port_inj_status, lan966x, val,
> > > - QS_INJ_STATUS_FIFO_RDY_GET(val) & BIT(grp),
> > > - READL_SLEEP_US, READL_TIMEOUT_US);
> > > + while (!(lan_rd(lan966x, QS_INJ_STATUS) &
> > > + QS_INJ_STATUS_FIFO_RDY_SET(BIT(grp)))) {
> > > + if (time_after(jiffies, time)) {
> > > + ret = -ETIMEDOUT;
> > > + break;
> > > + }
> >
> > Did you try setting READL_SLEEP_US to 0? readx_poll_timeout_atomic()
> > explicitly supports that.
>
> I have tried but it didn't improve. It was the same as before.

The reason i recommend ipoll.h is that your implementation has the
usual bug, which iopoll does not have. Since you are using _atomic()
it is less of an issue, but it still exists.

while (!(lan_rd(lan966x, QS_INJ_STATUS) &
QS_INJ_STATUS_FIFO_RDY_SET(BIT(grp)))) {

Say you take an interrupt here

if (time_after(jiffies, time)) {
ret = -ETIMEDOUT;
break;
}


The interrupt takes a while, so that by the time you get to
time_after(), you have reached your timeout. So -ETIMEDOUT is
returned. But in fact, the hardware has done its thing, and if you
where to read the status the bit would be set, and you should actually
return O.K, not an error.

iopoll does another check of the status before deciding to return
-ETIMEDOUT or O.K.

If you decide to simply check the status directly after the write, i
suggest you then use readx_poll_timeout_atomic() if you do need to
poll.

Andrew

2022-03-09 16:15:37

by Horatiu Vultur

[permalink] [raw]
Subject: Re: [PATCH net-next] net: lan966x: Improve the CPU TX bitrate.

The 03/09/2022 09:57, David Laight wrote:
> EXTERNAL EMAIL: Do not click links or open attachments unless you know the content is safe
>
> From: 'Horatiu Vultur'
> > Sent: 09 March 2022 09:11
> >
> > The 03/08/2022 22:46, David Laight wrote:
> > >
> > > From: Horatiu Vultur
> > > > Sent: 08 March 2022 22:30
> > > >
> > > > The 03/08/2022 22:36, Andrew Lunn wrote:
> > > > >
> > > > > > static int lan966x_port_inj_ready(struct lan966x *lan966x, u8 grp)
> > > > > > {
> > > > > > - u32 val;
> > > > > > + unsigned long time = jiffies + usecs_to_jiffies(READL_TIMEOUT_US);
> > > > > > + int ret = 0;
> > > > > >
> > > > > > - return readx_poll_timeout_atomic(lan966x_port_inj_status, lan966x, val,
> > > > > > - QS_INJ_STATUS_FIFO_RDY_GET(val) & BIT(grp),
> > > > > > - READL_SLEEP_US, READL_TIMEOUT_US);
> > > > > > + while (!(lan_rd(lan966x, QS_INJ_STATUS) &
> > > > > > + QS_INJ_STATUS_FIFO_RDY_SET(BIT(grp)))) {
> > > > > > + if (time_after(jiffies, time)) {
> > > > > > + ret = -ETIMEDOUT;
> > > > > > + break;
> > > > > > + }
> > > > >
> > > > > Did you try setting READL_SLEEP_US to 0? readx_poll_timeout_atomic()
> > > > > explicitly supports that.
> > > >
> > > > I have tried but it didn't improve. It was the same as before.
> > >
> > > How many times round the loop is it going ?
> >
> > In the tests that I have done, I have never seen entering in the loop.
>
> In which case I'd do an initial status check before even
> faffing with 'jiffies'.

That is a good idea. I will do this change in the next version.

>
> It might even be that the status read is so slow that space
> is always available by the time it is processed.
> PCIe reads can be horribly slow.
> Into our fgpa they end up being slower than old ISA bus cycles.
> Probably several thousand cpu clocks.
>
> David
>
> -
> Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
> Registration No: 1397386 (Wales)

--
/Horatiu

2022-03-10 02:52:11

by Andrew Lunn

[permalink] [raw]
Subject: Re: [PATCH net-next] net: lan966x: Improve the CPU TX bitrate.

On Wed, Mar 09, 2022 at 11:05:16PM +0100, Horatiu Vultur wrote:
> The 03/09/2022 14:11, Andrew Lunn wrote:
> >
> > On Tue, Mar 08, 2022 at 11:30:00PM +0100, Horatiu Vultur wrote:
> > > The 03/08/2022 22:36, Andrew Lunn wrote:
> > > >
> > > > > static int lan966x_port_inj_ready(struct lan966x *lan966x, u8 grp)
> > > > > {
> > > > > - u32 val;
> > > > > + unsigned long time = jiffies + usecs_to_jiffies(READL_TIMEOUT_US);
> > > > > + int ret = 0;
> > > > >
> > > > > - return readx_poll_timeout_atomic(lan966x_port_inj_status, lan966x, val,
> > > > > - QS_INJ_STATUS_FIFO_RDY_GET(val) & BIT(grp),
> > > > > - READL_SLEEP_US, READL_TIMEOUT_US);
> > > > > + while (!(lan_rd(lan966x, QS_INJ_STATUS) &
> > > > > + QS_INJ_STATUS_FIFO_RDY_SET(BIT(grp)))) {
> > > > > + if (time_after(jiffies, time)) {
> > > > > + ret = -ETIMEDOUT;
> > > > > + break;
> > > > > + }
> > > >
> > > > Did you try setting READL_SLEEP_US to 0? readx_poll_timeout_atomic()
> > > > explicitly supports that.
> > >
> > > I have tried but it didn't improve. It was the same as before.
> >
> > The reason i recommend ipoll.h is that your implementation has the
> > usual bug, which iopoll does not have. Since you are using _atomic()
> > it is less of an issue, but it still exists.
> >
> > while (!(lan_rd(lan966x, QS_INJ_STATUS) &
> > QS_INJ_STATUS_FIFO_RDY_SET(BIT(grp)))) {
> >
> > Say you take an interrupt here
> >
> > if (time_after(jiffies, time)) {
> > ret = -ETIMEDOUT;
> > break;
> > }
> >
> >
> > The interrupt takes a while, so that by the time you get to
> > time_after(), you have reached your timeout. So -ETIMEDOUT is
> > returned. But in fact, the hardware has done its thing, and if you
> > where to read the status the bit would be set, and you should actually
> > return O.K, not an error.
>
> That is a good catch and really nice explanation!
> Then if I add also the check at the end, then it should be also OK.

You are then just repeating code which is already in the core. That is
generally not liked. If you find reading the status once works 99% of
the time, then i suggest you call readx_poll_timeout_atomic() when the
status does indicate you need to poll.

Andrew

2022-03-10 14:36:06

by Horatiu Vultur

[permalink] [raw]
Subject: Re: [PATCH net-next] net: lan966x: Improve the CPU TX bitrate.

The 03/09/2022 14:11, Andrew Lunn wrote:
>
> On Tue, Mar 08, 2022 at 11:30:00PM +0100, Horatiu Vultur wrote:
> > The 03/08/2022 22:36, Andrew Lunn wrote:
> > >
> > > > static int lan966x_port_inj_ready(struct lan966x *lan966x, u8 grp)
> > > > {
> > > > - u32 val;
> > > > + unsigned long time = jiffies + usecs_to_jiffies(READL_TIMEOUT_US);
> > > > + int ret = 0;
> > > >
> > > > - return readx_poll_timeout_atomic(lan966x_port_inj_status, lan966x, val,
> > > > - QS_INJ_STATUS_FIFO_RDY_GET(val) & BIT(grp),
> > > > - READL_SLEEP_US, READL_TIMEOUT_US);
> > > > + while (!(lan_rd(lan966x, QS_INJ_STATUS) &
> > > > + QS_INJ_STATUS_FIFO_RDY_SET(BIT(grp)))) {
> > > > + if (time_after(jiffies, time)) {
> > > > + ret = -ETIMEDOUT;
> > > > + break;
> > > > + }
> > >
> > > Did you try setting READL_SLEEP_US to 0? readx_poll_timeout_atomic()
> > > explicitly supports that.
> >
> > I have tried but it didn't improve. It was the same as before.
>
> The reason i recommend ipoll.h is that your implementation has the
> usual bug, which iopoll does not have. Since you are using _atomic()
> it is less of an issue, but it still exists.
>
> while (!(lan_rd(lan966x, QS_INJ_STATUS) &
> QS_INJ_STATUS_FIFO_RDY_SET(BIT(grp)))) {
>
> Say you take an interrupt here
>
> if (time_after(jiffies, time)) {
> ret = -ETIMEDOUT;
> break;
> }
>
>
> The interrupt takes a while, so that by the time you get to
> time_after(), you have reached your timeout. So -ETIMEDOUT is
> returned. But in fact, the hardware has done its thing, and if you
> where to read the status the bit would be set, and you should actually
> return O.K, not an error.

That is a good catch and really nice explanation!
Then if I add also the check at the end, then it should be also OK.

>
> iopoll does another check of the status before deciding to return
> -ETIMEDOUT or O.K.
>
> If you decide to simply check the status directly after the write, i
> suggest you then use readx_poll_timeout_atomic() if you do need to
> poll.
>
> Andrew

--
/Horatiu