Received: by 2002:a05:6358:53a8:b0:117:f937:c515 with SMTP id z40csp3677566rwe; Mon, 17 Apr 2023 01:38:18 -0700 (PDT) X-Google-Smtp-Source: AKy350a3Y07mm1d5Arw/vUwfjBk/qPaIiYw0WSwCh+7nquyc31kHDCiarBWYDnEf4f9OVI1Gkw8u X-Received: by 2002:a17:902:c94d:b0:1a6:dc4a:b8fa with SMTP id i13-20020a170902c94d00b001a6dc4ab8famr3419684pla.54.1681720698362; Mon, 17 Apr 2023 01:38:18 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1681720698; cv=none; d=google.com; s=arc-20160816; b=vHGDsfx/IJL/IXiAhEtnvZOI7iPZ5cFIJgDj592V47G2JrwJm8QKXY4JWbue5RJrjP 0+T06y9/rPN7rzPXEWXsllS0VBjuimrHPuuN3uLDNLbWB2q41ISchAWliCyaq5fv9xKU X+vs+SBtUyRJOdB37esrYDW+9DQvuxrszAmwu1ENNt6/dLb/MhkBU8RHYsdHpC/V9fNp eBShgbyl2oJsk5uog/jSKfrycc/G/pLVHe7d7iWL3zHE/5k3x82amZao+qGPM4p43wzB ZsEbVDt5kVZgWi01JVc3nWp2gFdUFZqQEuuDaNr1yHAjq6qOYtXuWQei8IvgofgGyWir izcA== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:cc:to:subject :message-id:date:from:in-reply-to:references:mime-version :dkim-signature; bh=FkodB1zytRbF9QO4BRTAJCPNWh+4ypd4er+H1ch/Q3M=; b=Jy60f5uvb8SZLU7hUun9LvOGbVqEN2QxZL0j2x1NWfJYy615Akwcx48AXSuaidi8vP 11FgymcakQ8Mqn/7ZAsN2/gSzptZoc2OoLSjeJO4L9KAgagm8LeBvoa2vwHXT4c8LNHd yTwg8VCvrZ4z+YJ9kzj65xoO0COxuRAMDeJ5Y1Ec14hcJt9AKqsrNhFLrwKY1CydYYoS Qf3hZ/9JWeCkfjL1MDWGO+nPriK8C3yk5aK/2IwnVlPx+mX9t21kEJupcMY2U1Y2wWFP veYwtox+ki/xvVvRX71uT681yvbdo4MvX6G2tOLVcXBs4sM0lUItrRi5leHwDDoBFJF5 HHeA== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@smartx-com.20221208.gappssmtp.com header.s=20221208 header.b=WyCh4fYt; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Return-Path: Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20]) by mx.google.com with ESMTP id d15-20020a170902b70f00b001a4ebe5d058si10990816pls.38.2023.04.17.01.38.07; Mon, 17 Apr 2023 01:38:18 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Authentication-Results: mx.google.com; dkim=pass header.i=@smartx-com.20221208.gappssmtp.com header.s=20221208 header.b=WyCh4fYt; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S230498AbjDQIbE (ORCPT + 99 others); Mon, 17 Apr 2023 04:31:04 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:45544 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S229741AbjDQIbC (ORCPT ); Mon, 17 Apr 2023 04:31:02 -0400 Received: from mail-ej1-x62b.google.com (mail-ej1-x62b.google.com [IPv6:2a00:1450:4864:20::62b]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 720F549F8 for ; Mon, 17 Apr 2023 01:30:58 -0700 (PDT) Received: by mail-ej1-x62b.google.com with SMTP id sz19so4908696ejc.2 for ; Mon, 17 Apr 2023 01:30:58 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=smartx-com.20221208.gappssmtp.com; s=20221208; t=1681720257; x=1684312257; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=FkodB1zytRbF9QO4BRTAJCPNWh+4ypd4er+H1ch/Q3M=; b=WyCh4fYtMhPaR/jEQQ6tftR0L3u7oCUoCLqfg9k2zw7MPjwkIPOSNIirzxwgz8UtEP lC7QrG9rd1R00+KNtCPrJTN/7rIK63Pp3kS5MjWD263DXc8pLaUkpVyDXmuGXSvOqrbi zm1B/QtdUoMh11C0u0YxKKmhESDCEVVQQQ+E46m1VSCTrxVXtSxThIakAj6iA4krkpLK Okb8b/VZeEeNqj43S5Sx8Pqxrop/ESf4gKUd0857yLKl9D7cAbxNB3V3q9E2+odj6Wow nKJ6A1s9TbqOuRsn6c8koqP/BuWNwHXeTKDw8Aj8ykX9EaFCtWHpbpUCMLiX8Lotk7gO X/qw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20221208; t=1681720257; x=1684312257; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=FkodB1zytRbF9QO4BRTAJCPNWh+4ypd4er+H1ch/Q3M=; b=jfLR3NG+nmdn7mP0hZSUttr7UYpG+39oYJ8GFA4Wghn0jJMipp2sb2rM9+tmHgplCZ DPKHXK9RmNAvsNW3EsQfRJBmUcDZTMIdq5QYqTwggjgT8u5JyGK8xlKpxwQkv9BFA2Hp GqIYhyfzguBNtxPfUrg7UfjWBFzyUbHKgNBoKR8db7+/eTG2RxBlariM9IfJZyKpRoBW PzTT9bxpUKhSPadNHB26fYK/pUTooSO4vfMSn/w4rEgHm1mZvvbpWanYTer7fKFnK9FG rB5FRQxar/Yx6t6uJcT+n5dcoblUFq0AVLBCPmvwARdAI1O3KvQg5qeIELQyMKTHtbfd aiig== X-Gm-Message-State: AAQBX9fxO+Z2aeBB0jHdltCERc8c4EYi3nRydpgkq3kiUuxm8FieyMM/ nB1GH/hwD+/PvL8oOcwcfCDUWQj4dKYaByXKUNaoJg== X-Received: by 2002:a17:906:470d:b0:94e:ec32:ba28 with SMTP id y13-20020a170906470d00b0094eec32ba28mr6834528ejq.29.1681720256664; Mon, 17 Apr 2023 01:30:56 -0700 (PDT) MIME-Version: 1.0 References: <20230413062339.2454616-1-fengli@smartx.com> <20230413132941.2489795-1-fengli@smartx.com> <94d6a76c-8ad1-bda1-6336-e9f5fa3a6168@suse.de> <3e45f600db2049c4986fd8bb6aea69f4@AcuMS.aculab.com> In-Reply-To: From: Li Feng Date: Mon, 17 Apr 2023 16:32:56 +0800 Message-ID: Subject: Re: [PATCH v2] nvme/tcp: Add support to set the tcp worker cpu affinity To: Hannes Reinecke Cc: David Laight , Keith Busch , Jens Axboe , Christoph Hellwig , Sagi Grimberg , "open list:NVM EXPRESS DRIVER" , open list , "lifeng1519@gmail.com" Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Spam-Status: No, score=-1.9 required=5.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,RCVD_IN_DNSWL_NONE,SPF_HELO_NONE,SPF_NONE, T_SCC_BODY_TEXT_LINE autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Mon, Apr 17, 2023 at 2:27=E2=80=AFPM Hannes Reinecke wrot= e: > > On 4/15/23 23:06, David Laight wrote: > > From: Li Feng > >> Sent: 14 April 2023 10:35 > >>> > >>> On 4/13/23 15:29, Li Feng wrote: > >>>> The default worker affinity policy is using all online cpus, e.g. fr= om 0 > >>>> to N-1. However, some cpus are busy for other jobs, then the nvme-tc= p will > >>>> have a bad performance. > >>>> > >>>> This patch adds a module parameter to set the cpu affinity for the n= vme-tcp > >>>> socket worker threads. The parameter is a comma separated list of C= PU > >>>> numbers. The list is parsed and the resulting cpumask is used to se= t the > >>>> affinity of the socket worker threads. If the list is empty or the > >>>> parsing fails, the default affinity is used. > >>>> > > ... > >>> I am not in favour of this. > >>> NVMe-over-Fabrics has _virtual_ queues, which really have no > >>> relationship to the underlying hardware. > >>> So trying to be clever here by tacking queues to CPUs sort of works i= f > >>> you have one subsystem to talk to, but if you have several where each > >>> exposes a _different_ number of queues you end up with a quite > >>> suboptimal setting (ie you rely on the resulting cpu sets to overlap, > >>> but there is no guarantee that they do). > >> > >> Thanks for your comment. > >> The current io-queues/cpu map method is not optimal. > >> It is stupid, and just starts from 0 to the last CPU, which is not con= figurable. > > > > Module parameters suck, and passing the buck to the user > > when you can't decide how to do something isn't a good idea either. > > > > If the system is busy pinning threads to cpus is very hard to > > get right. > > > > It can be better to set the threads to run at the lowest RT > > priority - so they have priority over all 'normal' threads > > and also have a very sticky (but not fixed) cpu affinity so > > that all such threads tends to get spread out by the scheduler. > > This all works best if the number of RT threads isn't greater > > than the number of physical cpu. > > > And the problem is that you cannot give an 'optimal' performance metric > here. With NVMe-over-Fabrics the number of queues is negotiated during > the initial 'connect' call, and the resulting number of queues strongly > depends on target preferences (eg a NetApp array will expose only 4 > queues, with Dell/EMC you end up with up max 128 queues). > And these queues need to be mapped on the underlying hardware, which has > its own issues wrt to NUMA affinity. > > To give you an example: > Given a setup with a 4 node NUMA machine, one NIC connected to > one NUMA core, each socket having 24 threads, the NIC exposing up to 32 > interrupts, and connections to a NetApp _and_ a EMC, how exactly should > the 'best' layout look like? > And, what _is_ the 'best' layout? > You cannot satisfy the queue requirements from NetApp _and_ EMC, as you > only have one NIC, and you cannot change the interrupt affinity for each > I/O. > Not all users have so many NIC cards that they can have one NIC per NUMA no= de. This scenario is quite common that only has one NIC. There doesn=E2=80=99t exist a =E2=80=98best' layout for all cases, So add this parameter to let users select what they want. > Cheers, > > Hannes > -- > Dr. Hannes Reinecke Kernel Storage Architect > hare@suse.de +49 911 74053 688 > SUSE Software Solutions GmbH, Maxfeldstr. 5, 90409 N=C3=BCrnberg > HRB 36809 (AG N=C3=BCrnberg), Gesch=C3=A4ftsf=C3=BChrer: Ivo Totev, Andre= w > Myers, Andrew McDonald, Martje Boudien Moerman >