From: Sven Van Asbroeck <[email protected]>
Even if the rx ring is completely full, and there is more rx data
waiting on the chip, the rx napi poll fn will never run more than
once - it will always immediately bail out and re-enable interrupts.
Which results in ping-pong between napi and interrupt.
This defeats the purpose of napi, and is bad for performance.
Fix by addressing two separate issues:
1. Ensure the rx napi poll fn always updates the rx ring tail
when returning, even when not re-enabling interrupts.
2. Up to half of elements in a full rx ring are extension
frames, which do not generate any skbs. Limit the default
napi weight to the smallest no. of skbs that can be generated
by a full rx ring.
Tested-by: Sven Van Asbroeck <[email protected]> # lan7430
Signed-off-by: Sven Van Asbroeck <[email protected]>
---
Tree: git://git.kernel.org/pub/scm/linux/kernel/git/davem/net.git # 905b2032fa42
To: Bryan Whitehead <[email protected]>
To: Microchip Linux Driver Support <[email protected]>
To: "David S. Miller" <[email protected]>
To: Jakub Kicinski <[email protected]>
Cc: Andrew Lunn <[email protected]>
Cc: [email protected]
Cc: [email protected]
drivers/net/ethernet/microchip/lan743x_main.c | 11 +++++++++--
1 file changed, 9 insertions(+), 2 deletions(-)
diff --git a/drivers/net/ethernet/microchip/lan743x_main.c b/drivers/net/ethernet/microchip/lan743x_main.c
index 87b6c59a1e03..ebb5e0bc516b 100644
--- a/drivers/net/ethernet/microchip/lan743x_main.c
+++ b/drivers/net/ethernet/microchip/lan743x_main.c
@@ -2260,10 +2260,11 @@ static int lan743x_rx_napi_poll(struct napi_struct *napi, int weight)
INT_BIT_DMA_RX_(rx->channel_number));
}
+done:
/* update RX_TAIL */
lan743x_csr_write(adapter, RX_TAIL(rx->channel_number),
rx_tail_flags | rx->last_tail);
-done:
+
return count;
}
@@ -2405,9 +2406,15 @@ static int lan743x_rx_open(struct lan743x_rx *rx)
if (ret)
goto return_error;
+ /* up to half of elements in a full rx ring are
+ * extension frames. these do not generate skbs.
+ * to prevent napi/interrupt ping-pong, limit default
+ * weight to the smallest no. of skbs that can be
+ * generated by a full rx ring.
+ */
netif_napi_add(adapter->netdev,
&rx->napi, lan743x_rx_napi_poll,
- rx->ring_size - 1);
+ (rx->ring_size - 1) / 2);
lan743x_csr_write(adapter, DMAC_CMD,
DMAC_CMD_RX_SWR_(rx->channel_number));
--
2.17.1
From: Sven Van Asbroeck <[email protected]>
To support jumbo frames, each rx ring dma buffer is 9K in size.
But the chip only stores a single frame per dma buffer.
When the chip is working with the default 1500 byte MTU, a 9K
dma buffer goes from chip -> cpu per 1500 byte frame. This means
that to get 1G/s ethernet bandwidth, we need 6G/s PCIe bandwidth !
Fix by limiting the rx ring dma buffer size to the current MTU
size.
Tested with iperf3 on a freescale imx6 + lan7430, both sides
set to mtu 1500 bytes.
Before:
[ ID] Interval Transfer Bandwidth Retr
[ 4] 0.00-20.00 sec 483 MBytes 203 Mbits/sec 0
After:
[ ID] Interval Transfer Bandwidth Retr
[ 4] 0.00-20.00 sec 1.15 GBytes 496 Mbits/sec 0
And with both sides set to MTU 9000 bytes:
Before:
[ ID] Interval Transfer Bandwidth Retr
[ 4] 0.00-20.00 sec 1.87 GBytes 803 Mbits/sec 27
After:
[ ID] Interval Transfer Bandwidth Retr
[ 4] 0.00-20.00 sec 1.98 GBytes 849 Mbits/sec 0
Tested-by: Sven Van Asbroeck <[email protected]> # lan7430
Signed-off-by: Sven Van Asbroeck <[email protected]>
---
Tree: git://git.kernel.org/pub/scm/linux/kernel/git/davem/net.git # 905b2032fa42
To: Bryan Whitehead <[email protected]>
To: Microchip Linux Driver Support <[email protected]>
To: "David S. Miller" <[email protected]>
To: Jakub Kicinski <[email protected]>
Cc: Andrew Lunn <[email protected]>
Cc: [email protected]
Cc: [email protected]
drivers/net/ethernet/microchip/lan743x_main.c | 21 ++++++++++++-------
1 file changed, 13 insertions(+), 8 deletions(-)
diff --git a/drivers/net/ethernet/microchip/lan743x_main.c b/drivers/net/ethernet/microchip/lan743x_main.c
index ebb5e0bc516b..2bded1c46784 100644
--- a/drivers/net/ethernet/microchip/lan743x_main.c
+++ b/drivers/net/ethernet/microchip/lan743x_main.c
@@ -1957,11 +1957,11 @@ static int lan743x_rx_next_index(struct lan743x_rx *rx, int index)
static struct sk_buff *lan743x_rx_allocate_skb(struct lan743x_rx *rx)
{
- int length = 0;
+ struct net_device *netdev = rx->adapter->netdev;
- length = (LAN743X_MAX_FRAME_SIZE + ETH_HLEN + 4 + RX_HEAD_PADDING);
- return __netdev_alloc_skb(rx->adapter->netdev,
- length, GFP_ATOMIC | GFP_DMA);
+ return __netdev_alloc_skb(netdev,
+ netdev->mtu + ETH_HLEN + 4 + RX_HEAD_PADDING,
+ GFP_ATOMIC | GFP_DMA);
}
static int lan743x_rx_init_ring_element(struct lan743x_rx *rx, int index,
@@ -1969,9 +1969,10 @@ static int lan743x_rx_init_ring_element(struct lan743x_rx *rx, int index,
{
struct lan743x_rx_buffer_info *buffer_info;
struct lan743x_rx_descriptor *descriptor;
- int length = 0;
+ struct net_device *netdev = rx->adapter->netdev;
+ int length;
- length = (LAN743X_MAX_FRAME_SIZE + ETH_HLEN + 4 + RX_HEAD_PADDING);
+ length = netdev->mtu + ETH_HLEN + 4 + RX_HEAD_PADDING;
descriptor = &rx->ring_cpu_ptr[index];
buffer_info = &rx->buffer_info[index];
buffer_info->skb = skb;
@@ -2157,8 +2158,8 @@ static int lan743x_rx_process_packet(struct lan743x_rx *rx)
int index = first_index;
/* multi buffer packet not supported */
- /* this should not happen since
- * buffers are allocated to be at least jumbo size
+ /* this should not happen since buffers are allocated
+ * to be at least the mtu size configured in the mac.
*/
/* clean up buffers */
@@ -2632,9 +2633,13 @@ static int lan743x_netdev_change_mtu(struct net_device *netdev, int new_mtu)
struct lan743x_adapter *adapter = netdev_priv(netdev);
int ret = 0;
+ if (netif_running(netdev))
+ return -EBUSY;
+
ret = lan743x_mac_set_mtu(adapter, new_mtu);
if (!ret)
netdev->mtu = new_mtu;
+
return ret;
}
--
2.17.1
On Sat, 5 Dec 2020 22:44:07 -0500 Sven Van Asbroeck wrote:
> From: Sven Van Asbroeck <[email protected]>
>
> Even if the rx ring is completely full, and there is more rx data
> waiting on the chip, the rx napi poll fn will never run more than
> once - it will always immediately bail out and re-enable interrupts.
> Which results in ping-pong between napi and interrupt.
>
> This defeats the purpose of napi, and is bad for performance.
>
> Fix by addressing two separate issues:
>
> 1. Ensure the rx napi poll fn always updates the rx ring tail
> when returning, even when not re-enabling interrupts.
>
> 2. Up to half of elements in a full rx ring are extension
> frames, which do not generate any skbs. Limit the default
> napi weight to the smallest no. of skbs that can be generated
> by a full rx ring.
>
> Tested-by: Sven Van Asbroeck <[email protected]> # lan7430
> Signed-off-by: Sven Van Asbroeck <[email protected]>
> ---
>
> Tree: git://git.kernel.org/pub/scm/linux/kernel/git/davem/net.git # 905b2032fa42
>
> To: Bryan Whitehead <[email protected]>
> To: Microchip Linux Driver Support <[email protected]>
> To: "David S. Miller" <[email protected]>
> To: Jakub Kicinski <[email protected]>
> Cc: Andrew Lunn <[email protected]>
> Cc: [email protected]
> Cc: [email protected]
>
> drivers/net/ethernet/microchip/lan743x_main.c | 11 +++++++++--
> 1 file changed, 9 insertions(+), 2 deletions(-)
>
> diff --git a/drivers/net/ethernet/microchip/lan743x_main.c b/drivers/net/ethernet/microchip/lan743x_main.c
> index 87b6c59a1e03..ebb5e0bc516b 100644
> --- a/drivers/net/ethernet/microchip/lan743x_main.c
> +++ b/drivers/net/ethernet/microchip/lan743x_main.c
> @@ -2260,10 +2260,11 @@ static int lan743x_rx_napi_poll(struct napi_struct *napi, int weight)
> INT_BIT_DMA_RX_(rx->channel_number));
> }
>
> +done:
> /* update RX_TAIL */
> lan743x_csr_write(adapter, RX_TAIL(rx->channel_number),
> rx_tail_flags | rx->last_tail);
> -done:
> +
I assume this rings the doorbell to let the device know that more
buffers are available? If so it's a little unusual to do this at the
end of NAPI poll. The more usual place would be to do this every n
times a new buffer is allocated (in lan743x_rx_init_ring_element()?)
That's to say for example ring the doorbell every time a buffer is put
at an index divisible by 16.
> return count;
> }
>
> @@ -2405,9 +2406,15 @@ static int lan743x_rx_open(struct lan743x_rx *rx)
> if (ret)
> goto return_error;
>
> + /* up to half of elements in a full rx ring are
> + * extension frames. these do not generate skbs.
> + * to prevent napi/interrupt ping-pong, limit default
> + * weight to the smallest no. of skbs that can be
> + * generated by a full rx ring.
> + */
> netif_napi_add(adapter->netdev,
> &rx->napi, lan743x_rx_napi_poll,
> - rx->ring_size - 1);
> + (rx->ring_size - 1) / 2);
This is rather unusual, drivers should generally pass NAPI_POLL_WEIGHT
here.
Hi Jakub, thank you so much for reviewing this patchset !
On Tue, Dec 8, 2020 at 2:43 PM Jakub Kicinski <[email protected]> wrote:
>
> > When the chip is working with the default 1500 byte MTU, a 9K
> > dma buffer goes from chip -> cpu per 1500 byte frame. This means
> > that to get 1G/s ethernet bandwidth, we need 6G/s PCIe bandwidth !
> >
> > Fix by limiting the rx ring dma buffer size to the current MTU
> > size.
>
> I'd guess this is a memory allocate issue, not a bandwidth thing.
> for 9K frames the driver needs to do order-2 allocations of 16K.
> For 1500 2K allocations are sufficient (which is < 1 page, hence
> a lot cheaper).
That's a good question. I used perf to create a flame graph of what
the cpu was doing when receiving data at high speed. It showed that
__dma_page_dev_to_cpu took up most of the cpu time. Which is triggered
by dma_unmap_single(9K, DMA_FROM_DEVICE).
So I assumed that it's a PCIe dma bandwidth issue, but I could be wrong -
I didn't do any PCIe bandwidth measurements.
>
> > Tested with iperf3 on a freescale imx6 + lan7430, both sides
> > set to mtu 1500 bytes.
> >
> > Before:
> > [ ID] Interval Transfer Bandwidth Retr
> > [ 4] 0.00-20.00 sec 483 MBytes 203 Mbits/sec 0
> > After:
> > [ ID] Interval Transfer Bandwidth Retr
> > [ 4] 0.00-20.00 sec 1.15 GBytes 496 Mbits/sec 0
> >
> > And with both sides set to MTU 9000 bytes:
> > Before:
> > [ ID] Interval Transfer Bandwidth Retr
> > [ 4] 0.00-20.00 sec 1.87 GBytes 803 Mbits/sec 27
> > After:
> > [ ID] Interval Transfer Bandwidth Retr
> > [ 4] 0.00-20.00 sec 1.98 GBytes 849 Mbits/sec 0
> >
> > Tested-by: Sven Van Asbroeck <[email protected]> # lan7430
> > Signed-off-by: Sven Van Asbroeck <[email protected]>
>
> This is a performance improvement, not a fix, it really needs to target
> net-next.
I thought it'd be cool if 'historic' kernels could benefit from this performance
improvement too, but yeah if it's against policy it should go into net-next.
What about the other patch in the patchset (ping-pong). Should it go into
net-next as well?
>
> > diff --git a/drivers/net/ethernet/microchip/lan743x_main.c b/drivers/net/ethernet/microchip/lan743x_main.c
> > index ebb5e0bc516b..2bded1c46784 100644
> > --- a/drivers/net/ethernet/microchip/lan743x_main.c
> > +++ b/drivers/net/ethernet/microchip/lan743x_main.c
> > @@ -1957,11 +1957,11 @@ static int lan743x_rx_next_index(struct lan743x_rx *rx, int index)
> >
> > static struct sk_buff *lan743x_rx_allocate_skb(struct lan743x_rx *rx)
> > {
> > - int length = 0;
> > + struct net_device *netdev = rx->adapter->netdev;
> >
> > - length = (LAN743X_MAX_FRAME_SIZE + ETH_HLEN + 4 + RX_HEAD_PADDING);
> > - return __netdev_alloc_skb(rx->adapter->netdev,
> > - length, GFP_ATOMIC | GFP_DMA);
> > + return __netdev_alloc_skb(netdev,
> > + netdev->mtu + ETH_HLEN + 4 + RX_HEAD_PADDING,
> > + GFP_ATOMIC | GFP_DMA);
> > }
> >
> > static int lan743x_rx_init_ring_element(struct lan743x_rx *rx, int index,
> > @@ -1969,9 +1969,10 @@ static int lan743x_rx_init_ring_element(struct lan743x_rx *rx, int index,
> > {
> > struct lan743x_rx_buffer_info *buffer_info;
> > struct lan743x_rx_descriptor *descriptor;
> > - int length = 0;
> > + struct net_device *netdev = rx->adapter->netdev;
> > + int length;
> >
> > - length = (LAN743X_MAX_FRAME_SIZE + ETH_HLEN + 4 + RX_HEAD_PADDING);
> > + length = netdev->mtu + ETH_HLEN + 4 + RX_HEAD_PADDING;
> > descriptor = &rx->ring_cpu_ptr[index];
> > buffer_info = &rx->buffer_info[index];
> > buffer_info->skb = skb;
> > @@ -2157,8 +2158,8 @@ static int lan743x_rx_process_packet(struct lan743x_rx *rx)
> > int index = first_index;
> >
> > /* multi buffer packet not supported */
> > - /* this should not happen since
> > - * buffers are allocated to be at least jumbo size
> > + /* this should not happen since buffers are allocated
> > + * to be at least the mtu size configured in the mac.
> > */
> >
> > /* clean up buffers */
> > @@ -2632,9 +2633,13 @@ static int lan743x_netdev_change_mtu(struct net_device *netdev, int new_mtu)
> > struct lan743x_adapter *adapter = netdev_priv(netdev);
> > int ret = 0;
> >
> > + if (netif_running(netdev))
> > + return -EBUSY;
>
> That may cause a regression to users of the driver who expect to be
> able to set the MTU when the device is running. You need to disable
> the NAPI, pause the device, swap the buffers for smaller / bigger ones
> and restart the device.
That's what I tried first, but I quickly ran into a spot of trouble:
restarting the device may fail (unlikely but possible). So when the user tries
to change the mtu and that errors out, they might end up with a stopped device.
Is that acceptable behaviour? If so, I'll add it to the patch.
>
> > ret = lan743x_mac_set_mtu(adapter, new_mtu);
> > if (!ret)
> > netdev->mtu = new_mtu;
> > +
> > return ret;
> > }
> >
>
On Tue, Dec 8, 2020 at 2:50 PM Jakub Kicinski <[email protected]> wrote:
>
> >
> > +done:
> > /* update RX_TAIL */
> > lan743x_csr_write(adapter, RX_TAIL(rx->channel_number),
> > rx_tail_flags | rx->last_tail);
> > -done:
> > +
>
> I assume this rings the doorbell to let the device know that more
> buffers are available? If so it's a little unusual to do this at the
> end of NAPI poll. The more usual place would be to do this every n
> times a new buffer is allocated (in lan743x_rx_init_ring_element()?)
> That's to say for example ring the doorbell every time a buffer is put
> at an index divisible by 16.
Yes, I believe it tells the device that new buffers have become available.
I wonder why it's so unusual to do this at the end of a napi poll? Omitting
this could result in sub-optimal use of buffers, right?
Example:
- tail is at position 0
- core calls napi_poll(weight=64)
- napi poll consumes 15 buffers and pushes 15 skbs, then ring empty
- index not divisible by 16, so tail is not updated
- weight not reached, so napi poll re-enables interrupts and bails out
Result: now there are 15 buffers which the device could potentially use, but
because the tail wasn't updated, it doesn't know about them.
It does make sense to update the tail more frequently than only at the end
of the napi poll, though?
I'm new to napi polling, so I'm quite interested to learn about this.
> >
> > + /* up to half of elements in a full rx ring are
> > + * extension frames. these do not generate skbs.
> > + * to prevent napi/interrupt ping-pong, limit default
> > + * weight to the smallest no. of skbs that can be
> > + * generated by a full rx ring.
> > + */
> > netif_napi_add(adapter->netdev,
> > &rx->napi, lan743x_rx_napi_poll,
> > - rx->ring_size - 1);
> > + (rx->ring_size - 1) / 2);
>
> This is rather unusual, drivers should generally pass NAPI_POLL_WEIGHT
> here.
I agree. The problem is that a full ring buffer of 64 buffers will only
contain 32 buffers with network data - the others are timestamps.
So napi_poll(weight=64) can never reach its full weight. Even with a full
buffer, it always assumes that it has to stop polling, and re-enable
interrupts, which results in a ping-pong.
Would it be better to fix the weight counting? Increase the count
for every buffer consumed, instead of for every skb pushed?
Hi Andrew,
On Tue, Dec 8, 2020 at 5:51 PM Andrew Lunn <[email protected]> wrote:
>
> >
> > So I assumed that it's a PCIe dma bandwidth issue, but I could be wrong -
> > I didn't do any PCIe bandwidth measurements.
>
> Sometimes it is actually cache operations which take all the
> time. This needs to invalidate the cache, so that when the memory is
> then accessed, it get fetched from RAM. On SMP machines, cache
> invalidation can be expensive, due to all the cross CPU operations.
> I've actually got better performance by building a UP kernel on some
> low core count ARM CPUs.
>
> There are some tricks which can be played. Do you actually need all
> 9K? Does the descriptor tell you actually how much is used? You can
> get a nice speed up if you just unmap 64 bytes for a TCP ACK, rather
> than the full 9K.
>
Thank you for the suggestion! The original driver developer chose 9K because
presumably that's the largest frame size supported by the chip.
Yes, I believe the chip will tell us via the descriptor how much it has
written, I would have to double-check. I was already looking for a
"trick" to transfer only the required number of bytes, but I was led to
believe that dma_map_single() and dma_unmap_single() always needed to match.
So:
dma_map_single(9K) followed by dma_unmap_single(9K) is correct, and
dma_map_single(9K) followed by dma_unmap_single(1500 bytes) means trouble.
How can we get around that?
On Tue, 8 Dec 2020 17:23:08 -0500 Sven Van Asbroeck wrote:
> On Tue, Dec 8, 2020 at 2:50 PM Jakub Kicinski <[email protected]> wrote:
> >
> > >
> > > +done:
> > > /* update RX_TAIL */
> > > lan743x_csr_write(adapter, RX_TAIL(rx->channel_number),
> > > rx_tail_flags | rx->last_tail);
> > > -done:
> > > +
> >
> > I assume this rings the doorbell to let the device know that more
> > buffers are available? If so it's a little unusual to do this at the
> > end of NAPI poll. The more usual place would be to do this every n
> > times a new buffer is allocated (in lan743x_rx_init_ring_element()?)
> > That's to say for example ring the doorbell every time a buffer is put
> > at an index divisible by 16.
>
> Yes, I believe it tells the device that new buffers have become available.
>
> I wonder why it's so unusual to do this at the end of a napi poll? Omitting
> this could result in sub-optimal use of buffers, right?
>
> Example:
> - tail is at position 0
> - core calls napi_poll(weight=64)
> - napi poll consumes 15 buffers and pushes 15 skbs, then ring empty
> - index not divisible by 16, so tail is not updated
> - weight not reached, so napi poll re-enables interrupts and bails out
>
> Result: now there are 15 buffers which the device could potentially use, but
> because the tail wasn't updated, it doesn't know about them.
Perhaps 16 for a device with 64 descriptors is rather high indeed.
Let's say 8. If the device misses 8 packet buffers on the ring,
that should be negligible.
Depends on the cost of the CSR write, usually packet processing is
putting a lot of pressure on the memory subsystem of the CPU, hence
amortizing the write over multiple descriptors helps. The other thing
is that you could delay the descriptor writes to write full cache lines,
but I don't think that will help on IMX6.
> It does make sense to update the tail more frequently than only at the end
> of the napi poll, though?
>
> I'm new to napi polling, so I'm quite interested to learn about this.
There is a tracepoint which records how many packets NAPI has polled:
napi:napi_poll, you can see easily what your system is doing.
What you want to avoid is the situation where the device already used
up all the descriptors by the time driver finishes the Rx processing.
That'd result in drops. So the driver should push the buffers back to
the device reasonably early.
With a ring of 64 descriptors and NAPI budget of 64 it's not unlikely
that the ring will be completely used when processing runs.
> > > + /* up to half of elements in a full rx ring are
> > > + * extension frames. these do not generate skbs.
> > > + * to prevent napi/interrupt ping-pong, limit default
> > > + * weight to the smallest no. of skbs that can be
> > > + * generated by a full rx ring.
> > > + */
> > > netif_napi_add(adapter->netdev,
> > > &rx->napi, lan743x_rx_napi_poll,
> > > - rx->ring_size - 1);
> > > + (rx->ring_size - 1) / 2);
> >
> > This is rather unusual, drivers should generally pass NAPI_POLL_WEIGHT
> > here.
>
> I agree. The problem is that a full ring buffer of 64 buffers will only
> contain 32 buffers with network data - the others are timestamps.
>
> So napi_poll(weight=64) can never reach its full weight. Even with a full
> buffer, it always assumes that it has to stop polling, and re-enable
> interrupts, which results in a ping-pong.
Interesting I don't think we ever had this problem before. Let me CC
Eric to see if he has any thoughts on the case. AFAIU you should think
of the weight as way of arbitrating between devices (if there is more
than one).
NAPI does not do any deferral (in wall clock time terms) of processing,
so the only difference you may get for lower weight is that softirq
kthread will get a chance to kick in earlier.
> Would it be better to fix the weight counting? Increase the count
> for every buffer consumed, instead of for every skb pushed?
On 12/8/20 3:02 PM, Sven Van Asbroeck wrote:
> Hi Andrew,
>
> On Tue, Dec 8, 2020 at 5:51 PM Andrew Lunn <[email protected]> wrote:
>>
>>>
>>> So I assumed that it's a PCIe dma bandwidth issue, but I could be wrong -
>>> I didn't do any PCIe bandwidth measurements.
>>
>> Sometimes it is actually cache operations which take all the
>> time. This needs to invalidate the cache, so that when the memory is
>> then accessed, it get fetched from RAM. On SMP machines, cache
>> invalidation can be expensive, due to all the cross CPU operations.
>> I've actually got better performance by building a UP kernel on some
>> low core count ARM CPUs.
>>
>> There are some tricks which can be played. Do you actually need all
>> 9K? Does the descriptor tell you actually how much is used? You can
>> get a nice speed up if you just unmap 64 bytes for a TCP ACK, rather
>> than the full 9K.
>>
>
> Thank you for the suggestion! The original driver developer chose 9K because
> presumably that's the largest frame size supported by the chip.
>
> Yes, I believe the chip will tell us via the descriptor how much it has
> written, I would have to double-check. I was already looking for a
> "trick" to transfer only the required number of bytes, but I was led to
> believe that dma_map_single() and dma_unmap_single() always needed to match.
>
> So:
> dma_map_single(9K) followed by dma_unmap_single(9K) is correct, and
> dma_map_single(9K) followed by dma_unmap_single(1500 bytes) means trouble.
>
> How can we get around that?
dma_sync_single_for_{cpu,device} is what you would need in order to make
a partial cache line invalidation. You would still need to unmap the
same address+length pair that was used for the initial mapping otherwise
the DMA-API debugging will rightfully complain.
--
Florian
On Wed, Dec 9, 2020 at 12:29 AM Jakub Kicinski <[email protected]> wrote:
>
> On Tue, 8 Dec 2020 17:23:08 -0500 Sven Van Asbroeck wrote:
> > On Tue, Dec 8, 2020 at 2:50 PM Jakub Kicinski <[email protected]> wrote:
> > >
> > > >
> > > > +done:
> > > > /* update RX_TAIL */
> > > > lan743x_csr_write(adapter, RX_TAIL(rx->channel_number),
> > > > rx_tail_flags | rx->last_tail);
> > > > -done:
> > > > +
> > >
> > > I assume this rings the doorbell to let the device know that more
> > > buffers are available? If so it's a little unusual to do this at the
> > > end of NAPI poll. The more usual place would be to do this every n
> > > times a new buffer is allocated (in lan743x_rx_init_ring_element()?)
> > > That's to say for example ring the doorbell every time a buffer is put
> > > at an index divisible by 16.
> >
> > Yes, I believe it tells the device that new buffers have become available.
> >
> > I wonder why it's so unusual to do this at the end of a napi poll? Omitting
> > this could result in sub-optimal use of buffers, right?
> >
> > Example:
> > - tail is at position 0
> > - core calls napi_poll(weight=64)
> > - napi poll consumes 15 buffers and pushes 15 skbs, then ring empty
> > - index not divisible by 16, so tail is not updated
> > - weight not reached, so napi poll re-enables interrupts and bails out
> >
> > Result: now there are 15 buffers which the device could potentially use, but
> > because the tail wasn't updated, it doesn't know about them.
>
> Perhaps 16 for a device with 64 descriptors is rather high indeed.
> Let's say 8. If the device misses 8 packet buffers on the ring,
> that should be negligible.
>
mlx4 uses 8 as the threshold ( mlx4_en_refill_rx_buffers())
> Depends on the cost of the CSR write, usually packet processing is
> putting a lot of pressure on the memory subsystem of the CPU, hence
> amortizing the write over multiple descriptors helps. The other thing
> is that you could delay the descriptor writes to write full cache lines,
> but I don't think that will help on IMX6.
>
> > It does make sense to update the tail more frequently than only at the end
> > of the napi poll, though?
> >
> > I'm new to napi polling, so I'm quite interested to learn about this.
>
> There is a tracepoint which records how many packets NAPI has polled:
> napi:napi_poll, you can see easily what your system is doing.
>
> What you want to avoid is the situation where the device already used
> up all the descriptors by the time driver finishes the Rx processing.
> That'd result in drops. So the driver should push the buffers back to
> the device reasonably early.
>
> With a ring of 64 descriptors and NAPI budget of 64 it's not unlikely
> that the ring will be completely used when processing runs.
>
> > > > + /* up to half of elements in a full rx ring are
> > > > + * extension frames. these do not generate skbs.
> > > > + * to prevent napi/interrupt ping-pong, limit default
> > > > + * weight to the smallest no. of skbs that can be
> > > > + * generated by a full rx ring.
> > > > + */
> > > > netif_napi_add(adapter->netdev,
> > > > &rx->napi, lan743x_rx_napi_poll,
> > > > - rx->ring_size - 1);
> > > > + (rx->ring_size - 1) / 2);
> > >
> > > This is rather unusual, drivers should generally pass NAPI_POLL_WEIGHT
> > > here.
> >
> > I agree. The problem is that a full ring buffer of 64 buffers will only
> > contain 32 buffers with network data - the others are timestamps.
> >
> > So napi_poll(weight=64) can never reach its full weight. Even with a full
> > buffer, it always assumes that it has to stop polling, and re-enable
> > interrupts, which results in a ping-pong.
>
> Interesting I don't think we ever had this problem before. Let me CC
> Eric to see if he has any thoughts on the case. AFAIU you should think
> of the weight as way of arbitrating between devices (if there is more
> than one).
Driver could be called with an arbitrary budget (of 64),
and if its ring buffer has been depleted, return @budget instead of skb counts,
and not ream the interrupt
if (count < budget && !rx_ring_fully_processed) {
if (napi_complete_done(napi, count))
ream_irqs();
return count;
}
return budget;
>
> NAPI does not do any deferral (in wall clock time terms) of processing,
> so the only difference you may get for lower weight is that softirq
> kthread will get a chance to kick in earlier.
>
> > Would it be better to fix the weight counting? Increase the count
> > for every buffer consumed, instead of for every skb pushed?
>
On Tue, Dec 8, 2020 at 6:50 PM Eric Dumazet <[email protected]> wrote:
>
> Driver could be called with an arbitrary budget (of 64),
> and if its ring buffer has been depleted, return @budget instead of skb counts,
> and not ream the interrupt
>
Aha, so the decision to re-arm the interrupts is made by looking
at whether the device has run out of ring buffers to fill... instead
of checking whether the weight was reached !
That makes complete sense.
Thank you Eric and Jakub for your expert suggestions.
On Sat, 5 Dec 2020 22:44:08 -0500 Sven Van Asbroeck wrote:
> From: Sven Van Asbroeck <[email protected]>
>
> To support jumbo frames, each rx ring dma buffer is 9K in size.
> But the chip only stores a single frame per dma buffer.
>
> When the chip is working with the default 1500 byte MTU, a 9K
> dma buffer goes from chip -> cpu per 1500 byte frame. This means
> that to get 1G/s ethernet bandwidth, we need 6G/s PCIe bandwidth !
>
> Fix by limiting the rx ring dma buffer size to the current MTU
> size.
I'd guess this is a memory allocate issue, not a bandwidth thing.
for 9K frames the driver needs to do order-2 allocations of 16K.
For 1500 2K allocations are sufficient (which is < 1 page, hence
a lot cheaper).
> Tested with iperf3 on a freescale imx6 + lan7430, both sides
> set to mtu 1500 bytes.
>
> Before:
> [ ID] Interval Transfer Bandwidth Retr
> [ 4] 0.00-20.00 sec 483 MBytes 203 Mbits/sec 0
> After:
> [ ID] Interval Transfer Bandwidth Retr
> [ 4] 0.00-20.00 sec 1.15 GBytes 496 Mbits/sec 0
>
> And with both sides set to MTU 9000 bytes:
> Before:
> [ ID] Interval Transfer Bandwidth Retr
> [ 4] 0.00-20.00 sec 1.87 GBytes 803 Mbits/sec 27
> After:
> [ ID] Interval Transfer Bandwidth Retr
> [ 4] 0.00-20.00 sec 1.98 GBytes 849 Mbits/sec 0
>
> Tested-by: Sven Van Asbroeck <[email protected]> # lan7430
> Signed-off-by: Sven Van Asbroeck <[email protected]>
This is a performance improvement, not a fix, it really needs to target
net-next.
> diff --git a/drivers/net/ethernet/microchip/lan743x_main.c b/drivers/net/ethernet/microchip/lan743x_main.c
> index ebb5e0bc516b..2bded1c46784 100644
> --- a/drivers/net/ethernet/microchip/lan743x_main.c
> +++ b/drivers/net/ethernet/microchip/lan743x_main.c
> @@ -1957,11 +1957,11 @@ static int lan743x_rx_next_index(struct lan743x_rx *rx, int index)
>
> static struct sk_buff *lan743x_rx_allocate_skb(struct lan743x_rx *rx)
> {
> - int length = 0;
> + struct net_device *netdev = rx->adapter->netdev;
>
> - length = (LAN743X_MAX_FRAME_SIZE + ETH_HLEN + 4 + RX_HEAD_PADDING);
> - return __netdev_alloc_skb(rx->adapter->netdev,
> - length, GFP_ATOMIC | GFP_DMA);
> + return __netdev_alloc_skb(netdev,
> + netdev->mtu + ETH_HLEN + 4 + RX_HEAD_PADDING,
> + GFP_ATOMIC | GFP_DMA);
> }
>
> static int lan743x_rx_init_ring_element(struct lan743x_rx *rx, int index,
> @@ -1969,9 +1969,10 @@ static int lan743x_rx_init_ring_element(struct lan743x_rx *rx, int index,
> {
> struct lan743x_rx_buffer_info *buffer_info;
> struct lan743x_rx_descriptor *descriptor;
> - int length = 0;
> + struct net_device *netdev = rx->adapter->netdev;
> + int length;
>
> - length = (LAN743X_MAX_FRAME_SIZE + ETH_HLEN + 4 + RX_HEAD_PADDING);
> + length = netdev->mtu + ETH_HLEN + 4 + RX_HEAD_PADDING;
> descriptor = &rx->ring_cpu_ptr[index];
> buffer_info = &rx->buffer_info[index];
> buffer_info->skb = skb;
> @@ -2157,8 +2158,8 @@ static int lan743x_rx_process_packet(struct lan743x_rx *rx)
> int index = first_index;
>
> /* multi buffer packet not supported */
> - /* this should not happen since
> - * buffers are allocated to be at least jumbo size
> + /* this should not happen since buffers are allocated
> + * to be at least the mtu size configured in the mac.
> */
>
> /* clean up buffers */
> @@ -2632,9 +2633,13 @@ static int lan743x_netdev_change_mtu(struct net_device *netdev, int new_mtu)
> struct lan743x_adapter *adapter = netdev_priv(netdev);
> int ret = 0;
>
> + if (netif_running(netdev))
> + return -EBUSY;
That may cause a regression to users of the driver who expect to be
able to set the MTU when the device is running. You need to disable
the NAPI, pause the device, swap the buffers for smaller / bigger ones
and restart the device.
> ret = lan743x_mac_set_mtu(adapter, new_mtu);
> if (!ret)
> netdev->mtu = new_mtu;
> +
> return ret;
> }
>
> That's a good question. I used perf to create a flame graph of what
> the cpu was doing when receiving data at high speed. It showed that
> __dma_page_dev_to_cpu took up most of the cpu time. Which is triggered
> by dma_unmap_single(9K, DMA_FROM_DEVICE).
>
> So I assumed that it's a PCIe dma bandwidth issue, but I could be wrong -
> I didn't do any PCIe bandwidth measurements.
Sometimes it is actually cache operations which take all the
time. This needs to invalidate the cache, so that when the memory is
then accessed, it get fetched from RAM. On SMP machines, cache
invalidation can be expensive, due to all the cross CPU operations.
I've actually got better performance by building a UP kernel on some
low core count ARM CPUs.
There are some tricks which can be played. Do you actually need all
9K? Does the descriptor tell you actually how much is used? You can
get a nice speed up if you just unmap 64 bytes for a TCP ACK, rather
than the full 9K.
Andrew
On Tue, 8 Dec 2020 16:54:33 -0500 Sven Van Asbroeck wrote:
> > > Tested with iperf3 on a freescale imx6 + lan7430, both sides
> > > set to mtu 1500 bytes.
> > >
> > > Before:
> > > [ ID] Interval Transfer Bandwidth Retr
> > > [ 4] 0.00-20.00 sec 483 MBytes 203 Mbits/sec 0
> > > After:
> > > [ ID] Interval Transfer Bandwidth Retr
> > > [ 4] 0.00-20.00 sec 1.15 GBytes 496 Mbits/sec 0
> > >
> > > And with both sides set to MTU 9000 bytes:
> > > Before:
> > > [ ID] Interval Transfer Bandwidth Retr
> > > [ 4] 0.00-20.00 sec 1.87 GBytes 803 Mbits/sec 27
> > > After:
> > > [ ID] Interval Transfer Bandwidth Retr
> > > [ 4] 0.00-20.00 sec 1.98 GBytes 849 Mbits/sec 0
> > >
> > > Tested-by: Sven Van Asbroeck <[email protected]> # lan7430
> > > Signed-off-by: Sven Van Asbroeck <[email protected]>
> >
> > This is a performance improvement, not a fix, it really needs to target
> > net-next.
>
> I thought it'd be cool if 'historic' kernels could benefit from this performance
> improvement too, but yeah if it's against policy it should go into net-next.
>
> What about the other patch in the patchset (ping-pong). Should it go into
> net-next as well?
The jury is out on that one. Using ring size for netif_napi_add()
and updating RX_TAIL at the end of NAPI is pretty broken. So that
one can qualify as a fix IMHO.
> > > @@ -2632,9 +2633,13 @@ static int lan743x_netdev_change_mtu(struct net_device *netdev, int new_mtu)
> > > struct lan743x_adapter *adapter = netdev_priv(netdev);
> > > int ret = 0;
> > >
> > > + if (netif_running(netdev))
> > > + return -EBUSY;
> >
> > That may cause a regression to users of the driver who expect to be
> > able to set the MTU when the device is running. You need to disable
> > the NAPI, pause the device, swap the buffers for smaller / bigger ones
> > and restart the device.
>
> That's what I tried first, but I quickly ran into a spot of trouble:
> restarting the device may fail (unlikely but possible). So when the user tries
> to change the mtu and that errors out, they might end up with a stopped device.
> Is that acceptable behaviour? If so, I'll add it to the patch.
Fail because of memory allocation failures?
The best way to work around that is to allocate all the memory for new
configuration before you free the old memory. This also makes the
change a lot less disturbing to the traffic because you can do all the
allocations before the device is disabled, do the swap, start the
device, and then free the old set.
On Tue, 8 Dec 2020 18:02:30 -0500 Sven Van Asbroeck wrote:
> On Tue, Dec 8, 2020 at 5:51 PM Andrew Lunn <[email protected]> wrote:
> > > So I assumed that it's a PCIe dma bandwidth issue, but I could be wrong -
> > > I didn't do any PCIe bandwidth measurements.
> >
> > Sometimes it is actually cache operations which take all the
> > time. This needs to invalidate the cache, so that when the memory is
> > then accessed, it get fetched from RAM. On SMP machines, cache
> > invalidation can be expensive, due to all the cross CPU operations.
> > I've actually got better performance by building a UP kernel on some
> > low core count ARM CPUs.
> >
> > There are some tricks which can be played. Do you actually need all
> > 9K? Does the descriptor tell you actually how much is used? You can
> > get a nice speed up if you just unmap 64 bytes for a TCP ACK, rather
> > than the full 9K.
Good point!
> Thank you for the suggestion! The original driver developer chose 9K because
> presumably that's the largest frame size supported by the chip.
>
> Yes, I believe the chip will tell us via the descriptor how much it has
> written, I would have to double-check. I was already looking for a
> "trick" to transfer only the required number of bytes, but I was led to
> believe that dma_map_single() and dma_unmap_single() always needed to match.
>
> So:
> dma_map_single(9K) followed by dma_unmap_single(9K) is correct, and
> dma_map_single(9K) followed by dma_unmap_single(1500 bytes) means trouble.
>
> How can we get around that?
You can set DMA_ATTR_SKIP_CPU_SYNC and then sync only the part of the
buffer that got written.
> dma_sync_single_for_{cpu,device} is what you would need in order to make
> a partial cache line invalidation. You would still need to unmap the
> same address+length pair that was used for the initial mapping otherwise
> the DMA-API debugging will rightfully complain.
But often you don't unmap it, you call dma_sync_single_for_device and
put it back into the ring.
Andrew
On Tue, Dec 8, 2020 at 6:36 PM Florian Fainelli <[email protected]> wrote:
>
> dma_sync_single_for_{cpu,device} is what you would need in order to make
> a partial cache line invalidation. You would still need to unmap the
> same address+length pair that was used for the initial mapping otherwise
> the DMA-API debugging will rightfully complain.
I tried replacing
dma_unmap_single(9K, DMA_FROM_DEVICE);
with
dma_sync_single_for_cpu(received_size=1500 bytes, DMA_FROM_DEVICE);
dma_unmap_single_attrs(9K, DMA_FROM_DEVICE, DMA_ATTR_SKIP_CPU_SYNC);
and that works! But the bandwidth is still pretty bad, because the cpu
now spends most of its time doing
dma_map_single(9K, DMA_FROM_DEVICE);
which spends a lot of time doing __dma_page_cpu_to_dev.
When I try and replace that with
dma_map_single_attrs(9K, DMA_FROM_DEVICE, DMA_ATTR_SKIP_CPU_SYNC);
Then I get lots of dropped packets, which seems to indicate data corruption.
Interestingly, when I do
dma_map_single_attrs(9K, DMA_FROM_DEVICE, DMA_ATTR_SKIP_CPU_SYNC);
dma_sync_single_for_{cpu|device}(9K, DMA_FROM_DEVICE);
then the dropped packets disappear, but things are still very slow.
What am I missing?
On Tue, Dec 08, 2020 at 10:49:16PM -0500, Sven Van Asbroeck wrote:
> On Tue, Dec 8, 2020 at 6:36 PM Florian Fainelli <[email protected]> wrote:
> >
> > dma_sync_single_for_{cpu,device} is what you would need in order to make
> > a partial cache line invalidation. You would still need to unmap the
> > same address+length pair that was used for the initial mapping otherwise
> > the DMA-API debugging will rightfully complain.
>
> I tried replacing
> dma_unmap_single(9K, DMA_FROM_DEVICE);
> with
> dma_sync_single_for_cpu(received_size=1500 bytes, DMA_FROM_DEVICE);
> dma_unmap_single_attrs(9K, DMA_FROM_DEVICE, DMA_ATTR_SKIP_CPU_SYNC);
>
> and that works! But the bandwidth is still pretty bad, because the cpu
> now spends most of its time doing
> dma_map_single(9K, DMA_FROM_DEVICE);
> which spends a lot of time doing __dma_page_cpu_to_dev.
9K is not a nice number, since for each allocation it probably has to
find 4 contiguous pages. See what the performance difference is with
2K, 4K and 8K. If there is a big difference, you might want to special
case when the MTU is set for jumbo packets, or check if the hardware
can do scatter/gather.
You also need to be careful with caches and speculation. As you have
seen, bad things can happen. And it can be a lot more subtle. If some
code is accessing the page before the buffer and gets towards the end
of the page, the CPU might speculatively bring in the next page, i.e
the start of the buffer. If that happens before the DMA operation, and
you don't invalidate the cache correctly, you get hard to find
corruption.
Andrew
Hi Andrew,
On Wed, Dec 9, 2020 at 9:10 AM Andrew Lunn <[email protected]> wrote:
>
> 9K is not a nice number, since for each allocation it probably has to
> find 4 contiguous pages. See what the performance difference is with
> 2K, 4K and 8K. If there is a big difference, you might want to special
> case when the MTU is set for jumbo packets, or check if the hardware
> can do scatter/gather.
>
> You also need to be careful with caches and speculation. As you have
> seen, bad things can happen. And it can be a lot more subtle. If some
> code is accessing the page before the buffer and gets towards the end
> of the page, the CPU might speculatively bring in the next page, i.e
> the start of the buffer. If that happens before the DMA operation, and
> you don't invalidate the cache correctly, you get hard to find
> corruption.
Thank you for the guidance. When I keep the 9K buffers, and sync
only the buffer space that is being used (mtu when mapping, received
packet size when unmapping), then there is no more corruption, and
performance improves. But setting the buffer size to the mtu size
still provides much better performance. I do not understand why
(yet).
It seems that caching and dma behaviour/performance on arm32
(armv7) is very different compared to x86.
On 12/16/20 4:57 PM, Sven Van Asbroeck wrote:
> Hi Andrew,
>
> On Wed, Dec 9, 2020 at 9:10 AM Andrew Lunn <[email protected]> wrote:
>>
>> 9K is not a nice number, since for each allocation it probably has to
>> find 4 contiguous pages. See what the performance difference is with
>> 2K, 4K and 8K. If there is a big difference, you might want to special
>> case when the MTU is set for jumbo packets, or check if the hardware
>> can do scatter/gather.
>>
>> You also need to be careful with caches and speculation. As you have
>> seen, bad things can happen. And it can be a lot more subtle. If some
>> code is accessing the page before the buffer and gets towards the end
>> of the page, the CPU might speculatively bring in the next page, i.e
>> the start of the buffer. If that happens before the DMA operation, and
>> you don't invalidate the cache correctly, you get hard to find
>> corruption.
>
> Thank you for the guidance. When I keep the 9K buffers, and sync
> only the buffer space that is being used (mtu when mapping, received
> packet size when unmapping), then there is no more corruption, and
> performance improves. But setting the buffer size to the mtu size
> still provides much better performance. I do not understand why
> (yet).
>
> It seems that caching and dma behaviour/performance on arm32
> (armv7) is very different compared to x86.
x86 is a fully cache and device coherent memory architecture and there
are smarts like DDIO to bring freshly DMA'd data into the L3 cache
directly. For ARMv7, it depends on the hardware you have, most ARMv7
SoCs do not have hardware maintained coherency at all, this means that
doing the cache maintenance operations is costly. This is even true on
platforms that use an external cache controller (PL310).
--
Florian
On Wed, Dec 16, 2020 at 8:01 PM Florian Fainelli <[email protected]> wrote:
>
> x86 is a fully cache and device coherent memory architecture and there
> are smarts like DDIO to bring freshly DMA'd data into the L3 cache
> directly. For ARMv7, it depends on the hardware you have, most ARMv7
> SoCs do not have hardware maintained coherency at all, this means that
> doing the cache maintenance operations is costly. This is even true on
> platforms that use an external cache controller (PL310).
Thank you, that's quite fascinating. The functions my armv7 spends most
time in during ethernet receive (eg. v7_dma_inv_range) do appear to be
just nops on x86.