2014-12-28 14:57:34

by Beniamino Galvani

[permalink] [raw]
Subject: [PATCH net-next] net: stmmac: add BQL support

Add support for Byte Queue Limits to the STMicro MAC driver.

Tested on a Amlogic S805 Cortex-A5 board, where the use of BQL
slightly decreases the ping latency from ~10ms to ~3ms when the
100Mbps link is saturated by TCP streams. No difference is
observed at 1Gbps.

Signed-off-by: Beniamino Galvani <[email protected]>
---
drivers/net/ethernet/stmicro/stmmac/stmmac_main.c | 9 +++++++++
1 file changed, 9 insertions(+)

diff --git a/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c b/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c
index 118a427..c5af3d8 100644
--- a/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c
+++ b/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c
@@ -1097,6 +1097,7 @@ static int init_dma_desc_rings(struct net_device *dev, gfp_t flags)

priv->dirty_tx = 0;
priv->cur_tx = 0;
+ netdev_reset_queue(priv->dev);

stmmac_clear_descriptors(priv);

@@ -1300,6 +1301,7 @@ static void stmmac_dma_operation_mode(struct stmmac_priv *priv)
static void stmmac_tx_clean(struct stmmac_priv *priv)
{
unsigned int txsize = priv->dma_tx_size;
+ unsigned int bytes_compl = 0, pkts_compl = 0;

spin_lock(&priv->tx_lock);

@@ -1356,6 +1358,8 @@ static void stmmac_tx_clean(struct stmmac_priv *priv)
priv->hw->mode->clean_desc3(priv, p);

if (likely(skb != NULL)) {
+ pkts_compl++;
+ bytes_compl += skb->len;
dev_consume_skb_any(skb);
priv->tx_skbuff[entry] = NULL;
}
@@ -1364,6 +1368,9 @@ static void stmmac_tx_clean(struct stmmac_priv *priv)

priv->dirty_tx++;
}
+
+ netdev_completed_queue(priv->dev, pkts_compl, bytes_compl);
+
if (unlikely(netif_queue_stopped(priv->dev) &&
stmmac_tx_avail(priv) > STMMAC_TX_THRESH(priv))) {
netif_tx_lock(priv->dev);
@@ -1418,6 +1425,7 @@ static void stmmac_tx_err(struct stmmac_priv *priv)
(i == txsize - 1));
priv->dirty_tx = 0;
priv->cur_tx = 0;
+ netdev_reset_queue(priv->dev);
priv->hw->dma->start_tx(priv->ioaddr);

priv->dev->stats.tx_errors++;
@@ -2049,6 +2057,7 @@ static netdev_tx_t stmmac_xmit(struct sk_buff *skb, struct net_device *dev)
skb_tx_timestamp(skb);

priv->hw->dma->enable_dma_transmission(priv->ioaddr);
+ netdev_sent_queue(dev, skb->len);

spin_unlock(&priv->tx_lock);
return NETDEV_TX_OK;
--
2.1.4


2014-12-28 16:25:44

by Dave Taht

[permalink] [raw]
Subject: Re: [PATCH net-next] net: stmmac: add BQL support

On Sun, Dec 28, 2014 at 6:57 AM, Beniamino Galvani <[email protected]> wrote:
> Add support for Byte Queue Limits to the STMicro MAC driver.

Thank you!

> Tested on a Amlogic S805 Cortex-A5 board, where the use of BQL
> slightly decreases the ping latency from ~10ms to ~3ms when the
> 100Mbps link is saturated by TCP streams. No difference is
> observed at 1Gbps.

I see the plural. With TSQ in place it is hard (without something like
the rrul test driving multiple streams) to drive a driver to
saturation with small numbers of flows. This was with pfifo_fast, not
sch_fq, at 100mbit?

Can this board actually drive a full gigabit in the first place? Until
now most of the low end arm boards I have seen only came with
a 100mbit mac, and the gig ones lacking offloads seemed to peak
out at about 600mbit.

Under my christmas tree landed a quad core A5 (odroid-c1), also an
xgene and zedboard - both of the latter are a-needing BQL,
and I haven't booted the udroid yet. Hopefully it is the
same driver you just improved.

(https://plus.google.com/u/0/107942175615993706558/posts/f1D43umhm7E )

> Signed-off-by: Beniamino Galvani <[email protected]>
> ---
> drivers/net/ethernet/stmicro/stmmac/stmmac_main.c | 9 +++++++++
> 1 file changed, 9 insertions(+)
>
> diff --git a/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c b/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c
> index 118a427..c5af3d8 100644
> --- a/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c
> +++ b/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c
> @@ -1097,6 +1097,7 @@ static int init_dma_desc_rings(struct net_device *dev, gfp_t flags)
>
> priv->dirty_tx = 0;
> priv->cur_tx = 0;
> + netdev_reset_queue(priv->dev);
>
> stmmac_clear_descriptors(priv);
>
> @@ -1300,6 +1301,7 @@ static void stmmac_dma_operation_mode(struct stmmac_priv *priv)
> static void stmmac_tx_clean(struct stmmac_priv *priv)
> {
> unsigned int txsize = priv->dma_tx_size;
> + unsigned int bytes_compl = 0, pkts_compl = 0;
>
> spin_lock(&priv->tx_lock);
>
> @@ -1356,6 +1358,8 @@ static void stmmac_tx_clean(struct stmmac_priv *priv)
> priv->hw->mode->clean_desc3(priv, p);
>
> if (likely(skb != NULL)) {
> + pkts_compl++;
> + bytes_compl += skb->len;
> dev_consume_skb_any(skb);
> priv->tx_skbuff[entry] = NULL;
> }
> @@ -1364,6 +1368,9 @@ static void stmmac_tx_clean(struct stmmac_priv *priv)
>
> priv->dirty_tx++;
> }
> +
> + netdev_completed_queue(priv->dev, pkts_compl, bytes_compl);
> +
> if (unlikely(netif_queue_stopped(priv->dev) &&
> stmmac_tx_avail(priv) > STMMAC_TX_THRESH(priv))) {
> netif_tx_lock(priv->dev);
> @@ -1418,6 +1425,7 @@ static void stmmac_tx_err(struct stmmac_priv *priv)
> (i == txsize - 1));
> priv->dirty_tx = 0;
> priv->cur_tx = 0;
> + netdev_reset_queue(priv->dev);
> priv->hw->dma->start_tx(priv->ioaddr);
>
> priv->dev->stats.tx_errors++;
> @@ -2049,6 +2057,7 @@ static netdev_tx_t stmmac_xmit(struct sk_buff *skb, struct net_device *dev)
> skb_tx_timestamp(skb);
>
> priv->hw->dma->enable_dma_transmission(priv->ioaddr);
> + netdev_sent_queue(dev, skb->len);
>
> spin_unlock(&priv->tx_lock);
> return NETDEV_TX_OK;
> --
> 2.1.4
>
> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html



--
Dave Täht

thttp://www.bufferbloat.net/projects/bloat/wiki/Upcoming_Talks

2014-12-28 21:48:52

by Beniamino Galvani

[permalink] [raw]
Subject: Re: [PATCH net-next] net: stmmac: add BQL support

On Sun, Dec 28, 2014 at 08:25:40AM -0800, Dave Taht wrote:
> On Sun, Dec 28, 2014 at 6:57 AM, Beniamino Galvani <[email protected]> wrote:
> > Add support for Byte Queue Limits to the STMicro MAC driver.
>
> Thank you!
>
> > Tested on a Amlogic S805 Cortex-A5 board, where the use of BQL
> > slightly decreases the ping latency from ~10ms to ~3ms when the
> > 100Mbps link is saturated by TCP streams. No difference is
> > observed at 1Gbps.
>
> I see the plural. With TSQ in place it is hard (without something like
> the rrul test driving multiple streams) to drive a driver to
> saturation with small numbers of flows. This was with pfifo_fast, not
> sch_fq, at 100mbit?

Hi Dave,

yes, this was with pfifo_fast and I used 4 iperf TCP streams. The total
throughput didn't seem to increase adding more streams.

>
> Can this board actually drive a full gigabit in the first place? Until
> now most of the low end arm boards I have seen only came with
> a 100mbit mac, and the gig ones lacking offloads seemed to peak
> out at about 600mbit.

I measured a throughput of 650mbit in rx and 600mbit in tx.

>
> Under my christmas tree landed a quad core A5 (odroid-c1), also an
> xgene and zedboard - both of the latter are a-needing BQL,
> and I haven't booted the udroid yet. Hopefully it is the
> same driver you just improved.

I'm using the odroid-c1 too, with this tree based on the recent
Amlogic mainline work:

https://github.com/bengal/linux/tree/meson8b

Unfortunately at the moment the support for the board is very basic
(for example, SMP is not working yet) but it's enough to do some NIC
tests.

Beniamino

2014-12-29 07:48:58

by Florian Fainelli

[permalink] [raw]
Subject: Re: [PATCH net-next] net: stmmac: add BQL support

2014-12-28 6:57 GMT-08:00 Beniamino Galvani <[email protected]>:
> Add support for Byte Queue Limits to the STMicro MAC driver.
>
> Tested on a Amlogic S805 Cortex-A5 board, where the use of BQL
> slightly decreases the ping latency from ~10ms to ~3ms when the
> 100Mbps link is saturated by TCP streams. No difference is
> observed at 1Gbps.
>
> Signed-off-by: Beniamino Galvani <[email protected]>
> ---

[snip]

> priv->dev->stats.tx_errors++;
> @@ -2049,6 +2057,7 @@ static netdev_tx_t stmmac_xmit(struct sk_buff *skb, struct net_device *dev)
> skb_tx_timestamp(skb);
>
> priv->hw->dma->enable_dma_transmission(priv->ioaddr);
> + netdev_sent_queue(dev, skb->len);

You are introducing a potential use after free here in case tx_lock is
eliminated one day and your TX reclaim logic kicks in and frees the
freshly transmitted SKB, it would be safer to just cache skb->len in a
local variable, and use it here.

>
> spin_unlock(&priv->tx_lock);
> return NETDEV_TX_OK;
--
Florian

2014-12-29 17:42:04

by Dave Taht

[permalink] [raw]
Subject: Re: [PATCH net-next] net: stmmac: add BQL support

On Sun, Dec 28, 2014 at 1:48 PM, Beniamino Galvani <[email protected]> wrote:
> On Sun, Dec 28, 2014 at 08:25:40AM -0800, Dave Taht wrote:
>> On Sun, Dec 28, 2014 at 6:57 AM, Beniamino Galvani <[email protected]> wrote:
>> > Add support for Byte Queue Limits to the STMicro MAC driver.
>>
>> Thank you!
>>
>> > Tested on a Amlogic S805 Cortex-A5 board, where the use of BQL
>> > slightly decreases the ping latency from ~10ms to ~3ms when the
>> > 100Mbps link is saturated by TCP streams. No difference is
>> > observed at 1Gbps.
>>
>> I see the plural. With TSQ in place it is hard (without something like
>> the rrul test driving multiple streams) to drive a driver to
>> saturation with small numbers of flows. This was with pfifo_fast, not
>> sch_fq, at 100mbit?
>
> Hi Dave,
>
> yes, this was with pfifo_fast and I used 4 iperf TCP streams. The total
> throughput didn't seem to increase adding more streams.

>>
>> Can this board actually drive a full gigabit in the first place? Until
>> now most of the low end arm boards I have seen only came with
>> a 100mbit mac, and the gig ones lacking offloads seemed to peak
>> out at about 600mbit.
>
> I measured a throughput of 650mbit in rx and 600mbit in tx.

You might want to try the rrul test which tests both directions and
latency at the same time.

In my case I have been trying to find a low-cost chip that could do soft
rate limiting (htb) + fq_codel at up to 300mbit/sec, as that is about
the peak speed
we will be getting from cable modems, and these are horribly overbuffered,
at these speeds too, with 1.2sec of bidirectional latency observed at
120mbit/12mbit.

I'm open to crazy ideas like trying to find a use for the gpu, etc, to
get there.

>
>>
>> Under my christmas tree landed a quad core A5 (odroid-c1), also an
>> xgene and zedboard - both of the latter are a-needing BQL,
>> and I haven't booted the udroid yet. Hopefully it is the
>> same driver you just improved.
>
> I'm using the odroid-c1 too, with this tree based on the recent
> Amlogic mainline work:
>
> https://github.com/bengal/linux/tree/meson8b

Oh, cool, thx!

> Unfortunately at the moment the support for the board is very basic
> (for example, SMP is not working yet) but it's enough to do some NIC
> tests.

Good to know. Have you looked at xmit_more yet?

http://lwn.net/Articles/615238/


> Beniamino



--
Dave Täht

http://www.bufferbloat.net/projects/bloat/wiki/Upcoming_Talks

2014-12-30 21:33:02

by Beniamino Galvani

[permalink] [raw]
Subject: Re: [PATCH net-next] net: stmmac: add BQL support

On Sun, Dec 28, 2014 at 11:48:14PM -0800, Florian Fainelli wrote:
> 2014-12-28 6:57 GMT-08:00 Beniamino Galvani <[email protected]>:
> > Add support for Byte Queue Limits to the STMicro MAC driver.
> >
> > Tested on a Amlogic S805 Cortex-A5 board, where the use of BQL
> > slightly decreases the ping latency from ~10ms to ~3ms when the
> > 100Mbps link is saturated by TCP streams. No difference is
> > observed at 1Gbps.
> >
> > Signed-off-by: Beniamino Galvani <[email protected]>
> > ---
>
> [snip]
>
> > priv->dev->stats.tx_errors++;
> > @@ -2049,6 +2057,7 @@ static netdev_tx_t stmmac_xmit(struct sk_buff *skb, struct net_device *dev)
> > skb_tx_timestamp(skb);
> >
> > priv->hw->dma->enable_dma_transmission(priv->ioaddr);
> > + netdev_sent_queue(dev, skb->len);
>
> You are introducing a potential use after free here in case tx_lock is
> eliminated one day and your TX reclaim logic kicks in and frees the
> freshly transmitted SKB, it would be safer to just cache skb->len in a
> local variable, and use it here.

Ok, I will change this part; probably a simpler solution is to call
netdev_sent_queue() before enabling the DMA like in other drivers.

BTW, I'm not sure this lock is really needed, since it should be
possible to safely access the ring without locks from both the
transmit and the reclaim functions if the pointers are updated
carefully. So maybe it will be really removed one day.

Beniamino

>
> >
> > spin_unlock(&priv->tx_lock);
> > return NETDEV_TX_OK;

2014-12-30 23:13:49

by Beniamino Galvani

[permalink] [raw]
Subject: Re: [PATCH net-next] net: stmmac: add BQL support

On Mon, Dec 29, 2014 at 09:42:01AM -0800, Dave Taht wrote:
> On Sun, Dec 28, 2014 at 1:48 PM, Beniamino Galvani <[email protected]> wrote:
> > On Sun, Dec 28, 2014 at 08:25:40AM -0800, Dave Taht wrote:
> >> On Sun, Dec 28, 2014 at 6:57 AM, Beniamino Galvani <[email protected]> wrote:
> >> > Add support for Byte Queue Limits to the STMicro MAC driver.
> >>
> >> Thank you!
> >>
> >> > Tested on a Amlogic S805 Cortex-A5 board, where the use of BQL
> >> > slightly decreases the ping latency from ~10ms to ~3ms when the
> >> > 100Mbps link is saturated by TCP streams. No difference is
> >> > observed at 1Gbps.
> >>
> >> I see the plural. With TSQ in place it is hard (without something like
> >> the rrul test driving multiple streams) to drive a driver to
> >> saturation with small numbers of flows. This was with pfifo_fast, not
> >> sch_fq, at 100mbit?
> >
> > Hi Dave,
> >
> > yes, this was with pfifo_fast and I used 4 iperf TCP streams. The total
> > throughput didn't seem to increase adding more streams.
>
> >>
> >> Can this board actually drive a full gigabit in the first place? Until
> >> now most of the low end arm boards I have seen only came with
> >> a 100mbit mac, and the gig ones lacking offloads seemed to peak
> >> out at about 600mbit.
> >
> > I measured a throughput of 650mbit in rx and 600mbit in tx.
>
> You might want to try the rrul test which tests both directions and
> latency at the same time.

I will try it, thanks.

>
> In my case I have been trying to find a low-cost chip that could do soft
> rate limiting (htb) + fq_codel at up to 300mbit/sec, as that is about
> the peak speed
> we will be getting from cable modems, and these are horribly overbuffered,
> at these speeds too, with 1.2sec of bidirectional latency observed at
> 120mbit/12mbit.
>
> I'm open to crazy ideas like trying to find a use for the gpu, etc, to
> get there.
>
> >
> >>
> >> Under my christmas tree landed a quad core A5 (odroid-c1), also an
> >> xgene and zedboard - both of the latter are a-needing BQL,
> >> and I haven't booted the udroid yet. Hopefully it is the
> >> same driver you just improved.
> >
> > I'm using the odroid-c1 too, with this tree based on the recent
> > Amlogic mainline work:
> >
> > https://github.com/bengal/linux/tree/meson8b
>
> Oh, cool, thx!
>
> > Unfortunately at the moment the support for the board is very basic
> > (for example, SMP is not working yet) but it's enough to do some NIC
> > tests.
>
> Good to know. Have you looked at xmit_more yet?
>
> http://lwn.net/Articles/615238/

I don't know if I have implemented it correctly, but I found that the
improvement with xmit_more is so small to be barely observable, maybe
because the cost for starting the hardware transmission is very low
(it's a single mmio write).

Beniamino