Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20;
Content-Type: text/plain;
        charset=utf-8
Mime-Version: 1.0 (Mac OS X Mail 16.0 \(3731.300.101.1.3\))
Subject: Re: [PATCH v2] nvme/tcp: Add support to set the tcp worker cpu
 affinity
From:   Li Feng <lifeng1519@gmail.com>
In-Reply-To: <d71d4924-d2f0-3252-5268-5dafd1ebbc39@grimberg.me>
Date:   Tue, 18 Apr 2023 11:29:54 +0800
Cc:     Ming Lei <ming.lei@redhat.com>, Keith Busch <kbusch@kernel.org>,
        Jens Axboe <axboe@fb.com>, Christoph Hellwig <hch@lst.de>,
        "open list:NVM EXPRESS DRIVER" <linux-nvme@lists.infradead.org>,
        linux-kernel <linux-kernel@vger.kernel.org>
Content-Transfer-Encoding: quoted-printable
Message-Id: <5976C351-B2EB-4C25-A2AA-2A64EDD8CCDE@gmail.com>
References: <20230413062339.2454616-1-fengli@smartx.com>
 <20230413132941.2489795-1-fengli@smartx.com>
 <ZDz3TlFUxMxaO1W4@ovpn-8-16.pek2.redhat.com>
 <D8FFAFCB-5486-4211-9AC8-2779AE368183@gmail.com>
 <ZDz9vv6tbmKyhK7h@ovpn-8-16.pek2.redhat.com>
 <d71d4924-d2f0-3252-5268-5dafd1ebbc39@grimberg.me>
To:     Sagi Grimberg <sagi@grimberg.me>
Precedence: bulk


> 2023=E5=B9=B44=E6=9C=8817=E6=97=A5 =E4=B8=8B=E5=8D=889:33=EF=BC=8CSagi =
Grimberg <sagi@grimberg.me> =E5=86=99=E9=81=93=EF=BC=9A
>=20
>=20
>>>> On Thu, Apr 13, 2023 at 09:29:41PM +0800, Li Feng wrote:
>>>>> The default worker affinity policy is using all online cpus, e.g. =
from 0
>>>>> to N-1. However, some cpus are busy for other jobs, then the =
nvme-tcp will
>>>>> have a bad performance.
>>>>=20
>>>> Can you explain in detail how nvme-tcp performs worse in this =
situation?
>>>>=20
>>>> If some of CPUs are knows as busy, you can submit the nvme-tcp io =
jobs
>>>> on other non-busy CPUs via taskset, or scheduler is supposed to =
choose
>>>> proper CPUs for you. And usually nvme-tcp device should be =
saturated
>>>> with limited io depth or jobs/cpus.
>>>>=20
>>>>=20
>>>> Thanks,
>>>> Ming
>>>>=20
>>>=20
>>> Taskset can=E2=80=99t work on nvme-tcp io-queues, because the worker =
cpu has decided at the nvme-tcp =E2=80=98connect=E2=80=99 stage,
>>> not the sending io stage. Assume there is only one io-queue, the =
binding cpu is CPU0, no matter io jobs
>>> run other cpus.
>> OK, looks the problem is on queue->io_cpu, see =
nvme_tcp_queue_request().
>> But I am wondering why nvme-tcp doesn't queue the io work on the =
current
>> cpu? And why is queue->io_cpu introduced? Given blk-mq defines cpu
>> affinities for each hw queue, driver is supposed to submit IO request
>> to hardware on the local CPU.
>> Sagi and Guys, any ideas about introducing queue->io_cpu?
>=20
> Hey Ming,
>=20
> I have some vague memories wrt to this, but from what I recall:
>=20
> - The number of queues is dependent on both the controller and
> the user (Not a reason/motivation on its own, just clarifying).
>=20
> - It simply matches what pci does (to some extent, outside of rx side
> entropy that may exist), it just happens to take more cpu cycles due =
to
> the network stack overhead.
>=20
> - I didn't want io threads to change CPUs because of RFS/aRFS
> optimizations that people use, which allows the NIC to steer =
interrupts
> (and napi context) to where the io thread is running, and thus =
minimize
> latency due to improved locality. that on its own was shown to be =
worth
> over 30% reduction.
>=20
RFS works not good here. On my aarch64, the NIC irq is handled on NUMA =
node 2 CPU.
And nvme-tcp io-queue is busy on CPU0.

> - At some point nvme-tcp rx context used to run in softirq, and having
> to synchronize different cores (on different numa nodes maybe, depends
> on what RSS decided) when processing the socket resulted in high
> latency as well. This is not the case today (due to some nics back =
then
> that surfaced various issues with this) but it may be come back in
> the future again (if shown to provide value).
>=20
> - Also today when there is a sync network send from .queue_rq path,
> it is only executed when the running cpu =3D=3D queue->io_cpu, to =
avoid high
> contention. My concern is that if the io context is not bound to
> a specific cpu, it may create heavier contention on queue =
serialization.
> Today there are at most 2 contexts that compete, io context (triggered =
from network rx or scheduled in the submission path) and .queue_rq sync
> network send path. I'd prefer to not have to introduce more contention =
with increasing number of threads accessing an nvme controller.
>=20
> Having said that, I don't think there is a fundamental issue with
> using queue_work, or queue_work_node(cur_cpu) or
> queue_work_node(netdev_home_cpu), if that does not introduce
> additional latency in the common cases. Although having io threads
> bounce around is going to regress users that use RFS/aRFS...