Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20;
MIME-Version: 1.0
References: <20230413062339.2454616-1-fengli@smartx.com> <20230413132941.2489795-1-fengli@smartx.com>
 <94d6a76c-8ad1-bda1-6336-e9f5fa3a6168@suse.de> <CAHckoCxcmNC++AXELmnCVZNjpHcaOQWOGcjia=NBCnOA7S7EeQ@mail.gmail.com>
 <3e45f600db2049c4986fd8bb6aea69f4@AcuMS.aculab.com> <c9b07e76-9953-b34f-dac0-4a9762b4ac34@suse.de>
In-Reply-To: <c9b07e76-9953-b34f-dac0-4a9762b4ac34@suse.de>
From:   Li Feng <fengli@smartx.com>
Date:   Mon, 17 Apr 2023 16:32:56 +0800
Message-ID: <CAHckoCz2ZTiSB6WQ8t5m6A9z6zC3BHid12_QFoPe5VGeq5Wwgw@mail.gmail.com>
Subject: Re: [PATCH v2] nvme/tcp: Add support to set the tcp worker cpu affinity
To:     Hannes Reinecke <hare@suse.de>
Cc:     David Laight <David.Laight@aculab.com>,
        Keith Busch <kbusch@kernel.org>, Jens Axboe <axboe@fb.com>,
        Christoph Hellwig <hch@lst.de>,
        Sagi Grimberg <sagi@grimberg.me>,
        "open list:NVM EXPRESS DRIVER" <linux-nvme@lists.infradead.org>,
        open list <linux-kernel@vger.kernel.org>,
        "lifeng1519@gmail.com" <lifeng1519@gmail.com>
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
Precedence: bulk

On Mon, Apr 17, 2023 at 2:27=E2=80=AFPM Hannes Reinecke <hare@suse.de> wrot=
e:
>
> On 4/15/23 23:06, David Laight wrote:
> > From: Li Feng
> >> Sent: 14 April 2023 10:35
> >>>
> >>> On 4/13/23 15:29, Li Feng wrote:
> >>>> The default worker affinity policy is using all online cpus, e.g. fr=
om 0
> >>>> to N-1. However, some cpus are busy for other jobs, then the nvme-tc=
p will
> >>>> have a bad performance.
> >>>>
> >>>> This patch adds a module parameter to set the cpu affinity for the n=
vme-tcp
> >>>> socket worker threads.  The parameter is a comma separated list of C=
PU
> >>>> numbers.  The list is parsed and the resulting cpumask is used to se=
t the
> >>>> affinity of the socket worker threads.  If the list is empty or the
> >>>> parsing fails, the default affinity is used.
> >>>>
> > ...
> >>> I am not in favour of this.
> >>> NVMe-over-Fabrics has _virtual_ queues, which really have no
> >>> relationship to the underlying hardware.
> >>> So trying to be clever here by tacking queues to CPUs sort of works i=
f
> >>> you have one subsystem to talk to, but if you have several where each
> >>> exposes a _different_ number of queues you end up with a quite
> >>> suboptimal setting (ie you rely on the resulting cpu sets to overlap,
> >>> but there is no guarantee that they do).
> >>
> >> Thanks for your comment.
> >> The current io-queues/cpu map method is not optimal.
> >> It is stupid, and just starts from 0 to the last CPU, which is not con=
figurable.
> >
> > Module parameters suck, and passing the buck to the user
> > when you can't decide how to do something isn't a good idea either.
> >
> > If the system is busy pinning threads to cpus is very hard to
> > get right.
> >
> > It can be better to set the threads to run at the lowest RT
> > priority - so they have priority over all 'normal' threads
> > and also have a very sticky (but not fixed) cpu affinity so
> > that all such threads tends to get spread out by the scheduler.
> > This all works best if the number of RT threads isn't greater
> > than the number of physical cpu.
> >
> And the problem is that you cannot give an 'optimal' performance metric
> here. With NVMe-over-Fabrics the number of queues is negotiated during
> the initial 'connect' call, and the resulting number of queues strongly
> depends on target preferences (eg a NetApp array will expose only 4
> queues, with Dell/EMC you end up with up max 128 queues).
> And these queues need to be mapped on the underlying hardware, which has
> its own issues wrt to NUMA affinity.
>
> To give you an example:
> Given a setup with a 4 node NUMA machine, one NIC connected to
> one NUMA core, each socket having 24 threads, the NIC exposing up to 32
> interrupts, and connections to a NetApp _and_ a EMC, how exactly should
> the 'best' layout look like?
> And, what _is_ the 'best' layout?
> You cannot satisfy the queue requirements from NetApp _and_ EMC, as you
> only have one NIC, and you cannot change the interrupt affinity for each
> I/O.
>
Not all users have so many NIC cards that they can have one NIC per NUMA no=
de.
This scenario is quite common that only has one NIC.

There doesn=E2=80=99t exist a =E2=80=98best' layout for all cases,
So add this parameter to let users select what they want.

> Cheers,
>
> Hannes
> --
> Dr. Hannes Reinecke                Kernel Storage Architect
> hare@suse.de                              +49 911 74053 688
> SUSE Software Solutions GmbH, Maxfeldstr. 5, 90409 N=C3=BCrnberg
> HRB 36809 (AG N=C3=BCrnberg), Gesch=C3=A4ftsf=C3=BChrer: Ivo Totev, Andre=
w
> Myers, Andrew McDonald, Martje Boudien Moerman
>