On 29/05/2024 21:52, Sergey Shtylyov wrote:
> On 5/28/24 6:03 PM, Paul Barker wrote:
>
>> This patch makes multiple changes that can't be separated:
>>
>> 1) Allocate plain RX buffers via a page pool instead of allocating
>> SKBs, then use build_skb() when a packet is received.
>> 2) For GbEth IP, reduce the RX buffer size to 2kB.
>> 3) For GbEth IP, merge packets which span more than one RX descriptor
>> as SKB fragments instead of copying data.
>>
>> Implementing (1) without (2) would require the use of an order-1 page
>> pool (instead of an order-0 page pool split into page fragments) for
>> GbEth.
>>
>> Implementing (2) without (3) would leave us no space to re-assemble
>> packets which span more than one RX descriptor.
>>
>> Implementing (3) without (1) would not be possible as the network stack
>> expects to use put_page() or page_pool_put_page() to free SKB fragments
>> after an SKB is consumed.
>>
>> RX checksum offload support is adjusted to handle both linear and
>> nonlinear (fragmented) packets.
>>
>> This patch gives the following improvements during testing with iperf3.
>>
>> * RZ/G2L:
>> * TCP RX: same bandwidth at -43% CPU load (70% -> 40%)
>> * UDP RX: same bandwidth at -17% CPU load (88% -> 74%)
>>
>> * RZ/G2UL:
>> * TCP RX: +30% bandwidth (726Mbps -> 941Mbps)
>> * UDP RX: +417% bandwidth (108Mbps -> 558Mbps)
>>
>> * RZ/G3S:
>> * TCP RX: +64% bandwidth (562Mbps -> 920Mbps)
>> * UDP RX: +420% bandwidth (90Mbps -> 468Mbps)
>>
>> * RZ/Five:
>> * TCP RX: +217% bandwidth (145Mbps -> 459Mbps)
>> * UDP RX: +470% bandwidth (20Mbps -> 114Mbps)
>>
>> There is no significant impact on bandwidth or CPU load in testing on
>> RZ/G2H or R-Car M3N.
>>
>> Signed-off-by: Paul Barker <[email protected]>
>> ---
>> Changes v3->v4:
>> * Used a separate page pool for each RX queue.
>> * Passed struct ravb_rx_desc to ravb_alloc_rx_buffer() so that we can
>> simplify the calling function.
>> * Explained the calculation of rx_desc->ds_cc.
>> * Added handling of nonlinear SKBs in ravb_rx_csum_gbeth().
>>
>> drivers/net/ethernet/renesas/ravb.h | 10 +-
>> drivers/net/ethernet/renesas/ravb_main.c | 230 ++++++++++++++---------
>> 2 files changed, 146 insertions(+), 94 deletions(-)
>>
>> diff --git a/drivers/net/ethernet/renesas/ravb.h b/drivers/net/ethernet/renesas/ravb.h
>> index 6a7aa7dd17e6..f2091a17fcf7 100644
>> --- a/drivers/net/ethernet/renesas/ravb.h
>> +++ b/drivers/net/ethernet/renesas/ravb.h
> [...]> @@ -1094,7 +1099,8 @@ struct ravb_private {
>> struct ravb_tx_desc *tx_ring[NUM_TX_QUEUE];
>> void *tx_align[NUM_TX_QUEUE];
>> struct sk_buff *rx_1st_skb;
>> - struct sk_buff **rx_skb[NUM_RX_QUEUE];
>> + struct page_pool *rx_pool[NUM_RX_QUEUE];
>
> Don't we need #include <net/page_pool/types.h>

Yes. I got away with it as ravb_main.c includes
<net/page_pool/helpers.h> before including "ravb.h", but the header
shouldn't assume that.

>
> [...]
>> diff --git a/drivers/net/ethernet/renesas/ravb_main.c b/drivers/net/ethernet/renesas/ravb_main.c
>> index dd92f074881a..bb7f7d44be6e 100644
>> --- a/drivers/net/ethernet/renesas/ravb_main.c
>> +++ b/drivers/net/ethernet/renesas/ravb_main.c
> [...]
>> @@ -317,35 +289,56 @@ static void ravb_ring_free(struct net_device *ndev, int q)
>> priv->tx_skb[q] = NULL;
>> }
>>
>> +static int
>> +ravb_alloc_rx_buffer(struct net_device *ndev, int q, u32 entry, gfp_t gfp_mask,
>> + struct ravb_rx_desc *rx_desc)
>> +{
>> + struct ravb_private *priv = netdev_priv(ndev);
>> + const struct ravb_hw_info *info = priv->info;
>> + struct ravb_rx_buffer *rx_buff = &priv->rx_buffers[q][entry];
>> + dma_addr_t dma_addr;
>> + unsigned int size;
>> +
>> + size = info->rx_buffer_size;
>> + rx_buff->page = page_pool_alloc(priv->rx_pool[q], &rx_buff->offset, &size,
>> + gfp_mask);
>> + if (unlikely(!rx_buff->page)) {
>> + /* We just set the data size to 0 for a failed mapping
>> + * which should prevent DMA from happening...
>> + */
>> + rx_desc->ds_cc = cpu_to_le16(0);
>> + return -ENOMEM;
>> + }
>> +
>> + dma_addr = page_pool_get_dma_addr(rx_buff->page) + rx_buff->offset;
>> + dma_sync_single_for_device(ndev->dev.parent, dma_addr,
>> + info->rx_buffer_size, DMA_FROM_DEVICE);
>
> Do we really need this call?

Looking at .config I see CONFIG_DMA_NEED_SYNC=y so yes I think this is
needed.

>
>> + rx_desc->dptr = cpu_to_le32(dma_addr);
>> +
>> + /* The end of the RX buffer is used to store skb shared data, so we need
>> + * to ensure that the hardware leaves enough space for this.
>> + */
>> + rx_desc->ds_cc = cpu_to_le16(info->rx_buffer_size
>> + - SKB_DATA_ALIGN(sizeof(struct skb_shared_info))
>
> Please leave the - operator on the previous line...

Ack.

>
>> + - ETH_FCS_LEN + sizeof(__sum16));
>
> Here as well...

Ack.

>
>> + return 0;
>> +}
>> +
>> static u32
>> ravb_rx_ring_refill(struct net_device *ndev, int q, u32 count, gfp_t gfp_mask)
>> {
>> struct ravb_private *priv = netdev_priv(ndev);
>> - const struct ravb_hw_info *info = priv->info;
>> struct ravb_rx_desc *rx_desc;
>> - dma_addr_t dma_addr;
>> u32 i, entry;
>>
>> for (i = 0; i < count; i++) {
>> entry = (priv->dirty_rx[q] + i) % priv->num_rx_ring[q];
>> rx_desc = ravb_rx_get_desc(priv, q, entry);
>> - rx_desc->ds_cc = cpu_to_le16(info->rx_max_desc_use);
>>
>> - if (!priv->rx_skb[q][entry]) {
>> - priv->rx_skb[q][entry] = ravb_alloc_skb(ndev, info, gfp_mask);
>> - if (!priv->rx_skb[q][entry])
>> + if (!priv->rx_buffers[q][entry].page) {
>> + if (unlikely(ravb_alloc_rx_buffer(ndev, q, entry,
>
> Well, IIRC Greg KH is against using unlikely() unless you have actually
> instrumented the code and this gives an improvement... have you? :-)

My understanding was that we should use unlikely() for error checking in
hot code paths where we want the "good" path to be optimised. I can drop
this if I'm wrong though.

>
> [...]
>> @@ -727,12 +739,22 @@ static void ravb_rx_csum_gbeth(struct sk_buff *skb)
>> if (unlikely(skb->len < sizeof(__sum16) * 2))
>> return;
>>
>> - hw_csum = skb_tail_pointer(skb) - sizeof(__sum16);
>> + if (skb_is_nonlinear(skb)) {
>> + last_frag = &shinfo->frags[shinfo->nr_frags - 1];
>> + hw_csum = skb_frag_address(last_frag) + skb_frag_size(last_frag) - sizeof(__sum16);
>> + } else {
>> + hw_csum = skb_tail_pointer(skb) - sizeof(__sum16);
>> + }
>
> We can do the subtraction only once here...

Ack. I'll pull that out of the if.

>
> [...]
>> @@ -816,14 +824,26 @@ static int ravb_rx_gbeth(struct net_device *ndev, int budget, int q)
>> if (desc_status & MSC_CEEF)
>> stats->rx_missed_errors++;
>> } else {
>> + struct ravb_rx_buffer *rx_buff = &priv->rx_buffers[q][entry];
>> + void *rx_addr = page_address(rx_buff->page) + rx_buff->offset;
>
> Need an empty line here...

Ack.

>
>> die_dt = desc->die_dt & 0xF0;
>> - skb = ravb_get_skb_gbeth(ndev, entry, desc);
>> + dma_sync_single_for_cpu(ndev->dev.parent, le32_to_cpu(desc->dptr),
>> + desc_len, DMA_FROM_DEVICE);
>> +
>> switch (die_dt) {
>> case DT_FSINGLE:
>> case DT_FSTART:
>> /* Start of packet:
>> - * Set initial data length.
>> + * Prepare an SKB and add initial data.
>
> I'd prefer calling it skb in the comments...

Ack.

>
> [...]
>> @@ -865,7 +894,16 @@ static int ravb_rx_gbeth(struct net_device *ndev, int budget, int q)
>> stats->rx_bytes += skb->len;
>> napi_gro_receive(&priv->napi[q], skb);
>> rx_packets++;
>> +
>> + /* Clear rx_1st_skb so that it will only be
>> + * non-NULL when valid.
>> + */
>> + if (die_dt == DT_FEND)
>> + priv->rx_1st_skb = NULL;
>
> Hm, can't we do this under *case* DT_FEND above?

It makes more logical sense to me to do this as the last step, but I
guess it's a little more optimal to do it earlier. I'll move it.

Thanks,

--
Paul Barker

Attachments:

OpenPGP_0x27F4B3459F002257.asc (3.49 kB)
OpenPGP public key OpenPGP_signature.asc (243.00 B)
OpenPGP digital signature Download all attachments

2024-05-30 10:29:27

On 01/06/2024 11:13, Simon Horman wrote:
> On Tue, May 28, 2024 at 04:03:39PM +0100, Paul Barker wrote:
>> This patch makes multiple changes that can't be separated:
>>
>> 1) Allocate plain RX buffers via a page pool instead of allocating
>> SKBs, then use build_skb() when a packet is received.
>> 2) For GbEth IP, reduce the RX buffer size to 2kB.
>> 3) For GbEth IP, merge packets which span more than one RX descriptor
>> as SKB fragments instead of copying data.
>>
>> Implementing (1) without (2) would require the use of an order-1 page
>> pool (instead of an order-0 page pool split into page fragments) for
>> GbEth.
>>
>> Implementing (2) without (3) would leave us no space to re-assemble
>> packets which span more than one RX descriptor.
>>
>> Implementing (3) without (1) would not be possible as the network stack
>> expects to use put_page() or page_pool_put_page() to free SKB fragments
>> after an SKB is consumed.
>>
>> RX checksum offload support is adjusted to handle both linear and
>> nonlinear (fragmented) packets.
>>
>> This patch gives the following improvements during testing with iperf3.
>>
>> * RZ/G2L:
>> * TCP RX: same bandwidth at -43% CPU load (70% -> 40%)
>> * UDP RX: same bandwidth at -17% CPU load (88% -> 74%)
>>
>> * RZ/G2UL:
>> * TCP RX: +30% bandwidth (726Mbps -> 941Mbps)
>> * UDP RX: +417% bandwidth (108Mbps -> 558Mbps)
>>
>> * RZ/G3S:
>> * TCP RX: +64% bandwidth (562Mbps -> 920Mbps)
>> * UDP RX: +420% bandwidth (90Mbps -> 468Mbps)
>>
>> * RZ/Five:
>> * TCP RX: +217% bandwidth (145Mbps -> 459Mbps)
>> * UDP RX: +470% bandwidth (20Mbps -> 114Mbps)
>>
>> There is no significant impact on bandwidth or CPU load in testing on
>> RZ/G2H or R-Car M3N.
>>
>> Signed-off-by: Paul Barker <[email protected]>
>
> Hi Paul,
>
> Some minor feedback from my side.
>
> ...
>
>> diff --git a/drivers/net/ethernet/renesas/ravb_main.c b/drivers/net/ethernet/renesas/ravb_main.c
>
> ...
>
>> @@ -298,13 +269,14 @@ static void ravb_ring_free(struct net_device *ndev, int q)
>> priv->tx_ring[q] = NULL;
>> }
>>
>> - /* Free RX skb ringbuffer */
>> - if (priv->rx_skb[q]) {
>> - for (i = 0; i < priv->num_rx_ring[q]; i++)
>> - dev_kfree_skb(priv->rx_skb[q][i]);
>> + /* Free RX buffers */
>> + for (i = 0; i < priv->num_rx_ring[q]; i++) {
>> + if (priv->rx_buffers[q][i].page)
>> + page_pool_put_page(priv->rx_pool[q], priv->rx_buffers[q][i].page, 0, true);
>
> nit: Networking still prefers code to be 80 columns wide or less.
> It looks like that can be trivially achieved here.
>
> Flagged by checkpatch.pl --max-line-length=80

Sergey has asked me to wrap to 100 cols [1]. I can only find a reference
to 80 in the docs though [2], so I guess you may be right.

[1]: https://lore.kernel.org/all/[email protected]/
[2]: https://www.kernel.org/doc/html/latest/process/coding-style.html

>
>> }
>> - kfree(priv->rx_skb[q]);
>> - priv->rx_skb[q] = NULL;
>> + kfree(priv->rx_buffers[q]);
>> + priv->rx_buffers[q] = NULL;
>> + page_pool_destroy(priv->rx_pool[q]);
>>
>> /* Free aligned TX buffers */
>> kfree(priv->tx_align[q]);
>> @@ -317,35 +289,56 @@ static void ravb_ring_free(struct net_device *ndev, int q)
>> priv->tx_skb[q] = NULL;
>> }
>>
>> +static int
>> +ravb_alloc_rx_buffer(struct net_device *ndev, int q, u32 entry, gfp_t gfp_mask,
>> + struct ravb_rx_desc *rx_desc)
>> +{
>> + struct ravb_private *priv = netdev_priv(ndev);
>> + const struct ravb_hw_info *info = priv->info;
>> + struct ravb_rx_buffer *rx_buff = &priv->rx_buffers[q][entry];
>> + dma_addr_t dma_addr;
>> + unsigned int size;
>
> nit: I would appreciate it if some consideration could be given to
> moving this driver towards rather than away from reverse xmas
> tree - longest line to shortest - for local variable declarations.
>
> I'm not suggesting a clean-up patch. Rather, that in cases
> like this where new code is added, and also in cases where
> code is modified, reverse xmas tree is preferred.
>
> Here I would suggest separating the assinment of rx_buf from
> it's declaration (completely untested!):
>
> struct ravb_private *priv = netdev_priv(ndev);
> const struct ravb_hw_info *info = priv->info;
> struct ravb_rx_buffer *rx_buff;
> dma_addr_t dma_addr;
> unsigned int size;
>
> rx_buff = &priv->rx_buffers[q][entry];
>
> Edward Cree's xmastree tool can be helpful here:
> https://github.com/ecree-solarflare/xmastree

Ack.

>
>> +
>> + size = info->rx_buffer_size;
>> + rx_buff->page = page_pool_alloc(priv->rx_pool[q], &rx_buff->offset, &size,
>> + gfp_mask);
>> + if (unlikely(!rx_buff->page)) {
>> + /* We just set the data size to 0 for a failed mapping
>> + * which should prevent DMA from happening...
>> + */
>> + rx_desc->ds_cc = cpu_to_le16(0);
>> + return -ENOMEM;
>> + }
>> +
>> + dma_addr = page_pool_get_dma_addr(rx_buff->page) + rx_buff->offset;
>> + dma_sync_single_for_device(ndev->dev.parent, dma_addr,
>> + info->rx_buffer_size, DMA_FROM_DEVICE);
>> + rx_desc->dptr = cpu_to_le32(dma_addr);
>> +
>> + /* The end of the RX buffer is used to store skb shared data, so we need
>> + * to ensure that the hardware leaves enough space for this.
>> + */
>> + rx_desc->ds_cc = cpu_to_le16(info->rx_buffer_size
>> + - SKB_DATA_ALIGN(sizeof(struct skb_shared_info))
>> + - ETH_FCS_LEN + sizeof(__sum16));
>> + return 0;
>> +}
>
> ...
>
>> @@ -816,14 +824,26 @@ static int ravb_rx_gbeth(struct net_device *ndev, int budget, int q)
>> if (desc_status & MSC_CEEF)
>> stats->rx_missed_errors++;
>> } else {
>> + struct ravb_rx_buffer *rx_buff = &priv->rx_buffers[q][entry];
>> + void *rx_addr = page_address(rx_buff->page) + rx_buff->offset;
>> die_dt = desc->die_dt & 0xF0;
>> - skb = ravb_get_skb_gbeth(ndev, entry, desc);
>> + dma_sync_single_for_cpu(ndev->dev.parent, le32_to_cpu(desc->dptr),
>> + desc_len, DMA_FROM_DEVICE);
>> +
>> switch (die_dt) {
>> case DT_FSINGLE:
>> case DT_FSTART:
>> /* Start of packet:
>> - * Set initial data length.
>> + * Prepare an SKB and add initial data.
>> */
>> + skb = napi_build_skb(rx_addr, info->rx_buffer_size);
>> + if (unlikely(!skb)) {
>> + stats->rx_errors++;
>> + page_pool_put_page(priv->rx_pool[q],
>> + rx_buff->page, 0, true);
>
> Here skb is NULL.
>
>> + break;
>> + }
>> + skb_mark_for_recycle(skb);
>> skb_put(skb, desc_len);
>>
>> /* Save this SKB if the packet spans multiple
>> @@ -836,14 +856,23 @@ static int ravb_rx_gbeth(struct net_device *ndev, int budget, int q)
>> case DT_FMID:
>> case DT_FEND:
>> /* Continuing a packet:
>> - * Move data into the saved SKB.
>> + * Add this buffer as an RX frag.
>> */
>> - skb_copy_to_linear_data_offset(priv->rx_1st_skb,
>> - priv->rx_1st_skb->len,
>> - skb->data,
>> - desc_len);
>> - skb_put(priv->rx_1st_skb, desc_len);
>> - dev_kfree_skb(skb);
>> +
>> + /* rx_1st_skb will be NULL if napi_build_skb()
>> + * failed for the first descriptor of a
>> + * multi-descriptor packet.
>> + */
>> + if (unlikely(!priv->rx_1st_skb)) {
>> + stats->rx_errors++;
>> + page_pool_put_page(priv->rx_pool[q],
>> + rx_buff->page, 0, true);
>
> And here skb seems to be uninitialised.
>
>> + break;
>> + }
>> + skb_add_rx_frag(priv->rx_1st_skb,
>> + skb_shinfo(priv->rx_1st_skb)->nr_frags,
>> + rx_buff->page, rx_buff->offset,
>> + desc_len, info->rx_buffer_size);
>>
>> /* Set skb to point at the whole packet so that
>> * we only need one code path for finishing a
>
> The code between the hunk above and the hunk below is:
>
> /* Set skb to point at the whole packet so that
> * we only need one code path for finishing a
> * packet.
> */
> skb = priv->rx_1st_skb;
> }
> switch (die_dt) {
> case DT_FSINGLE:
> case DT_FEND:
> /* Finishing a packet:
> * Determine protocol & checksum, hand off to
> * NAPI and update our stats.
> */
> skb->protocol = eth_type_trans(skb, ndev);
> if (ndev->features & NETIF_F_RXCSUM)
> ravb_rx_csum_gbeth(skb);
> stats->rx_bytes += skb->len;
> napi_gro_receive(&priv->napi[q], skb);
> rx_packets++;
>
> It seems that the inter-hunk code above may dereference skb when it is NULL
> or uninitialised.
>
> Flagged by Smatch.

I see what has happened here. I wrote this using if statements first,
then changed to switch statements in response to Sergey's review. So the
break statements were intended to break out of the outer for loop, not
the switch statement. I'll need to replace them with goto statements.

Thanks for your review!

--
Paul Barker

Attachments:

OpenPGP_0x27F4B3459F002257.asc (3.49 kB)
OpenPGP public key OpenPGP_signature.asc (243.00 B)
OpenPGP digital signature Download all attachments

2024-06-03 12:08:50

by Simon Horman

[permalink] [raw]

Subject: Re: [net-next PATCH v4 7/7] net: ravb: Allocate RX buffers via page pool

On Mon, Jun 03, 2024 at 09:02:51AM +0100, Paul Barker wrote:
> On 01/06/2024 11:13, Simon Horman wrote:
> > On Tue, May 28, 2024 at 04:03:39PM +0100, Paul Barker wrote:

...

> >> @@ -298,13 +269,14 @@ static void ravb_ring_free(struct net_device *ndev, int q)
> >> priv->tx_ring[q] = NULL;
> >> }
> >>
> >> - /* Free RX skb ringbuffer */
> >> - if (priv->rx_skb[q]) {
> >> - for (i = 0; i < priv->num_rx_ring[q]; i++)
> >> - dev_kfree_skb(priv->rx_skb[q][i]);
> >> + /* Free RX buffers */
> >> + for (i = 0; i < priv->num_rx_ring[q]; i++) {
> >> + if (priv->rx_buffers[q][i].page)
> >> + page_pool_put_page(priv->rx_pool[q], priv->rx_buffers[q][i].page, 0, true);
> >
> > nit: Networking still prefers code to be 80 columns wide or less.
> > It looks like that can be trivially achieved here.
> >
> > Flagged by checkpatch.pl --max-line-length=80
>
> Sergey has asked me to wrap to 100 cols [1]. I can only find a reference
> to 80 in the docs though [2], so I guess you may be right.
>
> [1]: https://lore.kernel.org/all/[email protected]/
> [2]: https://www.kernel.org/doc/html/latest/process/coding-style.html

Hi Paul,

If Sergey prefers 100 then I won't argue :)

FWIIW, think what has happened here relates to the Kernel, at some point,
going from 80 to 100 columns as the preferred maximum width, while Networking
stuck with 80.

...

2024-06-03 12:16:06

On 6/3/24 3:15 PM, Paul Barker wrote:
[...]

>>>>> @@ -298,13 +269,14 @@ static void ravb_ring_free(struct net_device *ndev, int q)
>>>>> priv->tx_ring[q] = NULL;
>>>>> }
>>>>>
>>>>> - /* Free RX skb ringbuffer */
>>>>> - if (priv->rx_skb[q]) {
>>>>> - for (i = 0; i < priv->num_rx_ring[q]; i++)
>>>>> - dev_kfree_skb(priv->rx_skb[q][i]);
>>>>> + /* Free RX buffers */
>>>>> + for (i = 0; i < priv->num_rx_ring[q]; i++) {
>>>>> + if (priv->rx_buffers[q][i].page)
>>>>> + page_pool_put_page(priv->rx_pool[q], priv->rx_buffers[q][i].page, 0, true);
>>>>
>>>> nit: Networking still prefers code to be 80 columns wide or less.
>>>> It looks like that can be trivially achieved here.
>>>>
>>>> Flagged by checkpatch.pl --max-line-length=80
>>>
>>> Sergey has asked me to wrap to 100 cols [1]. I can only find a reference
>>> to 80 in the docs though [2], so I guess you may be right.
>>>
>>> [1]: https://lore.kernel.org/all/[email protected]/
>>> [2]: https://www.kernel.org/doc/html/latest/process/coding-style.html
>>
>> Hi Paul,
>>
>> If Sergey prefers 100 then I won't argue :)
>>
>> FWIIW, think what has happened here relates to the Kernel, at some point,
>> going from 80 to 100 columns as the preferred maximum width, while Networking
>> stuck with 80.
>
> I saw that netdevbpf patchwork is configured for 80 cols and it has
> warnings for v4 of this patch [1], so I've already re-wrapped the
> changes in this series to 80 cols (excluding a couple of lines where
> using slightly more than 80 cols significantly improves readability).
> I'm planning to send that in the next hour or so, assuming my tests
> pass.
>
> [1]: https://patchwork.kernel.org/project/netdevbpf/patch/[email protected]/

Sorry for misinforming you about 100 coulmns -- I had no idea netdev stuck
to 80! :-)

MBR, Sergey