Message-ID: <b163bfe3-0b91-e2cf-f702-8ab08a30db0d@nbd.name>
Date:   Fri, 17 Feb 2023 16:26:37 +0100
MIME-Version: 1.0
User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:102.0)
 Gecko/20100101 Thunderbird/102.7.2
Content-Language: en-US
To:     Eric Dumazet <edumazet@google.com>
Cc:     netdev@vger.kernel.org, "David S. Miller" <davem@davemloft.net>,
        Jakub Kicinski <kuba@kernel.org>,
        Paolo Abeni <pabeni@redhat.com>, linux-kernel@vger.kernel.org
References: <20230217100606.1234-1-nbd@nbd.name>
 <CANn89iJXjEWJcFbSMwKOXuupCVr4b-y4Gh+LwOQg+TQwJPQ=eg@mail.gmail.com>
 <acaf1607-412d-3142-1465-8d8439520228@nbd.name>
 <CANn89iLQa-FruxSUQycawQAHY=wCFP_Q3LHEQfusL1pUbNVxyg@mail.gmail.com>
 <e317d5bc-cc26-8b1b-ca4b-66b5328683c4@nbd.name>
 <CANn89i+j6==4S4Oyn+NVgm+qRQO_P5dMjxGSEjntPVp=v-HA1A@mail.gmail.com>
From:   Felix Fietkau <nbd@nbd.name>
Subject: Re: [RFC v2] net/core: add optional threading for rps backlog
 processing
In-Reply-To: <CANn89i+j6==4S4Oyn+NVgm+qRQO_P5dMjxGSEjntPVp=v-HA1A@mail.gmail.com>
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 7bit
Precedence: bulk

On 17.02.23 15:38, Eric Dumazet wrote:
> On Fri, Feb 17, 2023 at 2:40 PM Felix Fietkau <nbd@nbd.name> wrote:
>>
>> On 17.02.23 13:57, Eric Dumazet wrote:
>> > On Fri, Feb 17, 2023 at 1:35 PM Felix Fietkau <nbd@nbd.name> wrote:
>> >>
>> >> On 17.02.23 13:23, Eric Dumazet wrote:
>> >> > On Fri, Feb 17, 2023 at 11:06 AM Felix Fietkau <nbd@nbd.name> wrote:
>> >> >>
>> >> >> When dealing with few flows or an imbalance on CPU utilization, static RPS
>> >> >> CPU assignment can be too inflexible. Add support for enabling threaded NAPI
>> >> >> for RPS backlog processing in order to allow the scheduler to better balance
>> >> >> processing. This helps better spread the load across idle CPUs.
>> >> >>
>> >> >> Signed-off-by: Felix Fietkau <nbd@nbd.name>
>> >> >> ---
>> >> >>
>> >> >> RFC v2:
>> >> >>  - fix rebase error in rps locking
>> >> >
>> >> > Why only deal with RPS ?
>> >> >
>> >> > It seems you propose the sofnet_data backlog be processed by a thread,
>> >> > instead than from softirq ?
>> >> Right. I originally wanted to mainly improve RPS, but my patch does
>> >> cover backlog in general. I will update the description in the next
>> >> version. Does the approach in general make sense to you?
>> >>
>> >
>> > I do not know, this seems to lack some (perf) numbers, and
>> > descriptions of added max latencies and stuff like that :)
>> I just ran some test where I used a MT7621 device (dual-core 800 MHz
>> MIPS, 4 threads) as a router doing NAT without flow offloading.
>>
>> Using the flent RRUL test between 2 PCs connected through the router,
>> I get these results:
>>
>> rps_threaded=0: (combined CPU idle time around 27%)
>>                               avg       median       99th %          # data pts
>>   Ping (ms) ICMP   :        26.08        28.70        54.74 ms              199
>>   Ping (ms) UDP BE :         1.96        24.12        37.28 ms              200
>>   Ping (ms) UDP BK :         1.88        15.86        27.30 ms              200
>>   Ping (ms) UDP EF :         1.98        31.77        54.10 ms              200
>>   Ping (ms) avg    :         1.94          N/A          N/A ms              200
>>   TCP download BE  :        69.25        70.20       139.55 Mbits/s         200
>>   TCP download BK  :        95.15        92.51       163.93 Mbits/s         200
>>   TCP download CS5 :       133.64       129.10       292.46 Mbits/s         200
>>   TCP download EF  :       129.86       127.70       254.47 Mbits/s         200
>>   TCP download avg :       106.97          N/A          N/A Mbits/s         200
>>   TCP download sum :       427.90          N/A          N/A Mbits/s         200
>>   TCP totals       :       864.43          N/A          N/A Mbits/s         200
>>   TCP upload BE    :        97.54        96.67       163.99 Mbits/s         200
>>   TCP upload BK    :       139.76       143.88       190.37 Mbits/s         200
>>   TCP upload CS5   :        97.52        94.70       206.60 Mbits/s         200
>>   TCP upload EF    :       101.71       106.72       147.88 Mbits/s         200
>>   TCP upload avg   :       109.13          N/A          N/A Mbits/s         200
>>   TCP upload sum   :       436.53          N/A          N/A Mbits/s         200
>>
>> rps_threaded=1: (combined CPU idle time around 16%)
>>                               avg       median       99th %          # data pts
>>   Ping (ms) ICMP   :        13.70        16.10        27.60 ms              199
>>   Ping (ms) UDP BE :         2.03        18.35        24.16 ms              200
>>   Ping (ms) UDP BK :         2.03        18.36        29.13 ms              200
>>   Ping (ms) UDP EF :         2.36        25.20        41.50 ms              200
>>   Ping (ms) avg    :         2.14          N/A          N/A ms              200
>>   TCP download BE  :       118.69       120.94       160.12 Mbits/s         200
>>   TCP download BK  :       134.67       137.81       177.14 Mbits/s         200
>>   TCP download CS5 :       126.15       127.81       174.84 Mbits/s         200
>>   TCP download EF  :        78.36        79.41       143.31 Mbits/s         200
>>   TCP download avg :       114.47          N/A          N/A Mbits/s         200
>>   TCP download sum :       457.87          N/A          N/A Mbits/s         200
>>   TCP totals       :       918.19          N/A          N/A Mbits/s         200
>>   TCP upload BE    :       112.20       111.55       164.38 Mbits/s         200
>>   TCP upload BK    :       144.99       139.24       205.12 Mbits/s         200
>>   TCP upload CS5   :        93.09        95.50       132.39 Mbits/s         200
>>   TCP upload EF    :       110.04       108.21       207.00 Mbits/s         200
>>   TCP upload avg   :       115.08          N/A          N/A Mbits/s         200
>>   TCP upload sum   :       460.32          N/A          N/A Mbits/s         200
>>
>> As you can see, both throughput and latency improve because load can be
>> better distributed across CPU cores.
>>
> 
> What happens if user threads are competing with your kthreads ?
> 
> It seems you are adding another variant of ksoftirqd.
> 
> More threads might look better in some cases, if we accept to be at
> the mercy of process scheduling decisions.
> 
> NUMA affinities matter, I do not see how you are dealing with this.
> 
> Your patch assumes all cpus can participate in network processing,
> but rps is fine grained:
> Boxes with 128 or 256 cpus usually have different rps_mask per receive
> queue, with 2 or 4 bits set in the per-rx-queue rps_mask.
I'm assuming that in cases where this matters, user space can set the 
affinities and priority of the rps threads. The number in the rps-%d 
process name matches the bit number for rps_mask, so it can control 
things in a more fine grained way.

Would you prefer having support for selectively enabling threading on 
individual backlogs with a mask?

> Then, process_backlog() has been designed to run only from the cpu
> tied to the per-cpu data (softnet_data)
> There are multiple comments about this assumption, and various things
> that would need to be changed
> (eg sd_has_rps_ipi_waiting() would be wrong in its current implementation)
That's why I added the NAPI_STATE_THREADED check in napi_schedule_rps, 
so that sd_has_rps_ipi_waiting would always return false.
Or are you worried about a race when enabling threading?

- Felix