Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20;
Message-ID: <d71d4924-d2f0-3252-5268-5dafd1ebbc39@grimberg.me>
Date:   Mon, 17 Apr 2023 16:33:40 +0300
MIME-Version: 1.0
User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:102.0) Gecko/20100101
 Thunderbird/102.9.0
Subject: Re: [PATCH v2] nvme/tcp: Add support to set the tcp worker cpu
 affinity
Content-Language: en-US
To:     Ming Lei <ming.lei@redhat.com>, Li Feng <lifeng1519@gmail.com>
Cc:     Keith Busch <kbusch@kernel.org>, Jens Axboe <axboe@fb.com>,
        Christoph Hellwig <hch@lst.de>,
        "open list:NVM EXPRESS DRIVER" <linux-nvme@lists.infradead.org>,
        linux-kernel <linux-kernel@vger.kernel.org>
References: <20230413062339.2454616-1-fengli@smartx.com>
 <20230413132941.2489795-1-fengli@smartx.com>
 <ZDz3TlFUxMxaO1W4@ovpn-8-16.pek2.redhat.com>
 <D8FFAFCB-5486-4211-9AC8-2779AE368183@gmail.com>
 <ZDz9vv6tbmKyhK7h@ovpn-8-16.pek2.redhat.com>
From:   Sagi Grimberg <sagi@grimberg.me>
In-Reply-To: <ZDz9vv6tbmKyhK7h@ovpn-8-16.pek2.redhat.com>
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 8bit
Precedence: bulk


>>> On Thu, Apr 13, 2023 at 09:29:41PM +0800, Li Feng wrote:
>>>> The default worker affinity policy is using all online cpus, e.g. from 0
>>>> to N-1. However, some cpus are busy for other jobs, then the nvme-tcp will
>>>> have a bad performance.
>>>
>>> Can you explain in detail how nvme-tcp performs worse in this situation?
>>>
>>> If some of CPUs are knows as busy, you can submit the nvme-tcp io jobs
>>> on other non-busy CPUs via taskset, or scheduler is supposed to choose
>>> proper CPUs for you. And usually nvme-tcp device should be saturated
>>> with limited io depth or jobs/cpus.
>>>
>>>
>>> Thanks,
>>> Ming
>>>
>>
>> Taskset can’t work on nvme-tcp io-queues, because the worker cpu has decided at the nvme-tcp ‘connect’ stage,
>> not the sending io stage. Assume there is only one io-queue, the binding cpu is CPU0, no matter io jobs
>> run other cpus.
> 
> OK, looks the problem is on queue->io_cpu, see nvme_tcp_queue_request().
> 
> But I am wondering why nvme-tcp doesn't queue the io work on the current
> cpu? And why is queue->io_cpu introduced? Given blk-mq defines cpu
> affinities for each hw queue, driver is supposed to submit IO request
> to hardware on the local CPU.
> 
> Sagi and Guys, any ideas about introducing queue->io_cpu?

Hey Ming,

I have some vague memories wrt to this, but from what I recall:

- The number of queues is dependent on both the controller and
the user (Not a reason/motivation on its own, just clarifying).

- It simply matches what pci does (to some extent, outside of rx side
entropy that may exist), it just happens to take more cpu cycles due to
the network stack overhead.

- I didn't want io threads to change CPUs because of RFS/aRFS
optimizations that people use, which allows the NIC to steer interrupts
(and napi context) to where the io thread is running, and thus minimize
latency due to improved locality. that on its own was shown to be worth
over 30% reduction.

- At some point nvme-tcp rx context used to run in softirq, and having
to synchronize different cores (on different numa nodes maybe, depends
on what RSS decided) when processing the socket resulted in high
latency as well. This is not the case today (due to some nics back then
that surfaced various issues with this) but it may be come back in
the future again (if shown to provide value).

- Also today when there is a sync network send from .queue_rq path,
it is only executed when the running cpu == queue->io_cpu, to avoid high
contention. My concern is that if the io context is not bound to
a specific cpu, it may create heavier contention on queue serialization.
Today there are at most 2 contexts that compete, io context (triggered 
from network rx or scheduled in the submission path) and .queue_rq sync
network send path. I'd prefer to not have to introduce more contention 
with increasing number of threads accessing an nvme controller.

Having said that, I don't think there is a fundamental issue with
using queue_work, or queue_work_node(cur_cpu) or
queue_work_node(netdev_home_cpu), if that does not introduce
additional latency in the common cases. Although having io threads
bounce around is going to regress users that use RFS/aRFS...