On 2021-03-25 10:45, Rakesh Pillai wrote:
> Hi Felix / Ben,
>
> In case of ath10k (snoc based targets), we have a lot of processing in the NAPI context.
> Even moving this to threaded NAPI is not helping much due to the load.
>
> Breaking the tasks into multiple context (with the patch series I posted) is helping in improving the throughput.
> With the current rx_thread based approach, the rx processing is broken into two parallel contexts
> 1) reaping the packets from the HW
> 2) processing these packets list and handing it over to mac80211 (and later to the network stack)
>
> This is the primary reason for choosing the rx thread approach.
Have you considered the possibility that maybe the problem is that the
driver doing too much work?
One example is that you could take advantage of the new 802.3 decap
offload to simplify rx processing. Worked for me on mt76 where a
dual-core 1.3 GHz A64 can easily handle >1.8 Gbps local TCP rx on a
single card, without the rx NAPI thread being the biggest consumer of
CPU cycles.
And if you can't do that and still consider all of the metric tons of
processing work necessary, you could still do this:
On interrupts, spawn a processing thread that traverses the ring and
does the preparation work (instead of NAPI).
From that thread you schedule the threaded NAPI handler that processes
these packets further and hands them to mac80211.
To keep the load somewhat balanced, you can limit the number of
pre-processed packets in the ring.
- Felix