Received: by 2002:a05:6a10:22f:0:0:0:0 with SMTP id 15csp1317508pxk; Fri, 25 Sep 2020 11:28:50 -0700 (PDT) X-Google-Smtp-Source: ABdhPJyL+pRumlrnKdHoVxEHyb8YHhX2uwfkWrgqAnqIfD/eQETlTCeGuTOvxUP+ggUijAOUsnHX X-Received: by 2002:a17:906:a207:: with SMTP id r7mr4126465ejy.32.1601058529821; Fri, 25 Sep 2020 11:28:49 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1601058529; cv=none; d=google.com; s=arc-20160816; b=C6CtztfJ6tJ/VVMpPxZWeNVqj4Z+jyCNA3sCwOrdgphlfG5xxkaEEyAYHhFbOXevpy E0S4ax7+jHaQt5AXr9FSBc8gII8R6lwVK7FRoGZ1Wsi0PaTO7Npw2rG/EsE53DLlgd4V RgxaaK7YG6RAe9ayVEONsK1JLQ7w2CZ0WRPbO9cQUM5WI7rjIunMn7flfElLO3CS4Dst scHHkkjTbOKrSUOlp3Xbz5+51nEvQieDk8624aElsT21X/3ViefQlADnaUApBBHkg+6a F0mV7r+aAFyYZyW2ko/Q3clnXTR556FvBZSiICHyJIU4IfnQ11EVBf5lBRnLN4lUO+Jh W+sQ== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:mime-version :message-id:date:subject:to:from:dkim-signature; bh=H6+BMGpX9nRwK6ylAAUa5PZB+Njz4oOv2pFi9n8OI38=; b=qdNfY04HingXSMKxKoe1gXEeFo39hbvJWDoY9vESIS6Ncr5j/6tLW1G5yYPwgcPd1x yD5pYcYLVoWog4vVC3tSqufIP5HYqVHUdw5Y5WiMxBPWa9mHiNW2CiXPfGWZnbu5dBQD ZFfOXCyP+vgwW5BHm+mrnwL9D1j1GyYovZ9PFLXntRw/vigwXih8nLVxTruVc1nJ/Nkr tl71q3ZH7WR0vQPasBVC9LeUEBVfH6RvlF2QA1/aUvFscTKrJjNrnZB8NepHFl4l9ymO oNAPuo9QT74Ld8gYya2KK0W8NOY01Ab1XKSMoXIDrSf67EVZk9mYBOhh0WdKfmE66iNQ +r6A== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@redhat.com header.s=mimecast20190719 header.b=G7gUdssJ; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=redhat.com Return-Path: Received: from vger.kernel.org (vger.kernel.org. [23.128.96.18]) by mx.google.com with ESMTP id v22si2430245edl.408.2020.09.25.11.28.26; Fri, 25 Sep 2020 11:28:49 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) client-ip=23.128.96.18; Authentication-Results: mx.google.com; dkim=pass header.i=@redhat.com header.s=mimecast20190719 header.b=G7gUdssJ; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=redhat.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1729759AbgIYS1P (ORCPT + 99 others); Fri, 25 Sep 2020 14:27:15 -0400 Received: from us-smtp-delivery-124.mimecast.com ([63.128.21.124]:26362 "EHLO us-smtp-delivery-124.mimecast.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1727733AbgIYS1M (ORCPT ); Fri, 25 Sep 2020 14:27:12 -0400 Dkim-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1601058430; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding; bh=H6+BMGpX9nRwK6ylAAUa5PZB+Njz4oOv2pFi9n8OI38=; b=G7gUdssJ/9rnDEhepe6lE6he/Gxgw8XlacRRUPH/88PgQWHWJcQ9gUzj7WEeNcCvOm7YcW K9TXmr7IVcGLhpETONEX4pDFCefnBZAFNIePmGzojtyspT+w0qDyiFiJuRF7jrg/ZYK+d4 C0xZ2nCMuzaL1gyglaaIyKglJJPeW8U= Received: from mimecast-mx01.redhat.com (mimecast-mx01.redhat.com [209.132.183.4]) (Using TLS) by relay.mimecast.com with ESMTP id us-mta-444-tApSkebgOlmeS07q1mJcMQ-1; Fri, 25 Sep 2020 14:27:06 -0400 X-MC-Unique: tApSkebgOlmeS07q1mJcMQ-1 Received: from smtp.corp.redhat.com (int-mx01.intmail.prod.int.phx2.redhat.com [10.5.11.11]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by mimecast-mx01.redhat.com (Postfix) with ESMTPS id A13962FD09; Fri, 25 Sep 2020 18:27:03 +0000 (UTC) Received: from virtlab719.virt.lab.eng.bos.redhat.com (virtlab719.virt.lab.eng.bos.redhat.com [10.19.153.15]) by smtp.corp.redhat.com (Postfix) with ESMTP id AD88478810; Fri, 25 Sep 2020 18:26:55 +0000 (UTC) From: Nitesh Narayan Lal To: linux-kernel@vger.kernel.org, netdev@vger.kernel.org, linux-pci@vger.kernel.org, intel-wired-lan@lists.osuosl.org, frederic@kernel.org, mtosatti@redhat.com, sassmann@redhat.com, jesse.brandeburg@intel.com, lihong.yang@intel.com, helgaas@kernel.org, nitesh@redhat.com, jeffrey.t.kirsher@intel.com, jacob.e.keller@intel.com, jlelli@redhat.com, hch@infradead.org, bhelgaas@google.com, mike.marciniszyn@intel.com, dennis.dalessandro@intel.com, thomas.lendacky@amd.com, jiri@nvidia.com, mingo@redhat.com, peterz@infradead.org, juri.lelli@redhat.com, vincent.guittot@linaro.org, lgoncalv@redhat.com Subject: [PATCH v3 0/4] isolation: limit msix vectors to housekeeping CPUs Date: Fri, 25 Sep 2020 14:26:50 -0400 Message-Id: <20200925182654.224004-1-nitesh@redhat.com> MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit X-Scanned-By: MIMEDefang 2.79 on 10.5.11.11 Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org This is a follow-up posting for "[PATCH v2 0/4] isolation: limit msix vectors based on housekeeping CPUs". Issue ===== With the current implementation device drivers while creating their MSIX         vectors only take num_online_cpus() into consideration which works quite well   for a non-RT environment, but in an RT environment that has a large number of   isolated CPUs and very few housekeeping CPUs this could lead to a problem.     The problem will be triggered when something like tuned will try to move all     the IRQs from isolated CPUs to the limited number of housekeeping CPUs to       prevent interruptions for a latency-sensitive workload that will be running on the isolated CPUs. This failure is caused because of the per CPU vector         limitation.                                                                     Proposed Fix ============ In this patch-set, the following changes are proposed: - A generic API housekeeping_num_online_cpus() which is meant to return the online housekeeping CPUs based on the hk_flag passed by the caller. - i40e: Specifically for the i40e driver the num_online_cpus() used in   i40e_init_msix() to calculate numbers msix vectors is replaced with the above defined API that returns the online housekeeping CPUs that are meant to handle managed IRQ jobs. - pci_alloc_irq_vector(): With the help of housekeeping_num_online_cpus() the max_vecs passed in pci_alloc_irq_vector() is restricted only to the online   housekeeping CPUs (designated for managed IRQ jobs) strictly in an RT environment. However, if the min_vecs exceeds the online housekeeping CPUs, max_vecs is limited based on the min_vecs instead. Future Work =========== - In the previous upstream discussion [1], it was decided that it would be better if we can have a generic framework that can be consumed by all the drivers to fix this kind of issue. However, it will be a long term work, and since there are RT workloads that are getting impacted by the reported issue. We agreed upon the proposed per-device approach for now. Testing ======= Functionality: - To test that the issue is resolved with i40e change I added a tracepoint   in i40e_init_msix() to find the number of CPUs derived for vector creation   with and without tuned's realtime-virtual-host profile. As per expectation   with the profile applied I was only getting the number of housekeeping CPUs   and all available CPUs without it. Another way to verify is by checking the number of IRQs that get created corresponding to a impacted device. Similarly did a few more tests with different modes eg with only nohz_full, isolcpus etc. Performance: - To analyze the performance impact I have targetted the change introduced in   pci_alloc_irq_vectors() and compared the results against a vanilla kernel   (5.9.0-rc3) results.   Setup Information:   + I had a couple of 24-core machines connected back to back via a couple of     mlx5 NICs and I analyzed the average bitrate for server-client TCP and UDP transmission via iperf.   + To minimize the Bitrate variation of iperf TCP and UDP stream test I have     applied the tuned's network-throughput profile and disabled HT.  Test Information:   + For the environment that had no isolated CPUs:     I have tested with single stream and 24 streams (same as that of online     CPUs).   + For the environment that had 20 isolated CPUs:     I have tested with single stream, 4 streams (same as that the number of     housekeeping) and 24 streams (same as that of online CPUs).  Results:   # UDP Stream Test:   + There was no degradation observed in UDP stream tests in both environments. (With isolated CPUs and without isolated CPUs after the introduction of the patches).   # TCP Stream Test - No isolated CPUs:   + No noticeable degradation was observed.   # TCP Stream Test - With isolated CPUs:   + Multiple Stream (4)  - Average degradation of around 5-6%   + Multiple Stream (24) - Average degradation of around 2-3%   + Single Stream        - Even on a vanilla kernel the Bitrate observed for a TCP single stream test seem to vary significantly across different runs (eg. the % variation between the best and the worst case on a vanilla kernel was around 8-10%). A similar variation was observed with the kernel that included my patches. No additional degradation was observed. If there are any suggestions for more performance evaluation, I would be happy to discuss/perform them. Changes from v2[2]: ================== - Renamed hk_num_online_cpus() with housekeeping_num_online_cpus() to keep the naming convention consistent (based on a suggestion from Peter Zijlstra and Frederic Weisbecker). - Added an argument "enum hk_flags" to the housekeeping_num_online_cpus() API to make it more usable in different use-cases (based on a suggestion from Frederic Weisbecker). - Replaced cpumask_weight(cpu_online_mask) with num_online_cpus() (suggestion from Bjorn Helgaas). - Modified patch commit messages and comment based on Bjorn Helgaas's suggestion. Changes from v1[3]: ================== Patch1:                                                                       - Replaced num_houskeeeping_cpus() with hk_num_online_cpus() and started using the cpumask corresponding to HK_FLAG_MANAGED_IRQ to derive the number of online housekeeping CPUs. This is based on Frederic Weisbecker's suggestion.           - Since the hk_num_online_cpus() is self-explanatory, got rid of               the comment that was added previously.                                     Patch2:                                                                       - Added a new patch that is meant to enable managed IRQ isolation for nohz_full CPUs. This is based on Frederic Weisbecker's suggestion.              Patch4 (PCI):                                                                 - For cases where the min_vecs exceeds the online housekeeping CPUs, instead of skipping modification to max_vecs, started restricting it based on the min_vecs. This is based on a suggestion from Marcelo Tosatti.                                                                     [1] https://lore.kernel.org/lkml/20200922095440.GA5217@lenoir/ [2] https://lore.kernel.org/lkml/20200923181126.223766-1-nitesh@redhat.com/ [3] https://lore.kernel.org/lkml/20200909150818.313699-1-nitesh@redhat.com/ Nitesh Narayan Lal (4): sched/isolation: API to get number of housekeeping CPUs sched/isolation: Extend nohz_full to isolate managed IRQs i40e: Limit msix vectors to housekeeping CPUs PCI: Limit pci_alloc_irq_vectors() to housekeeping CPUs drivers/net/ethernet/intel/i40e/i40e_main.c | 3 ++- include/linux/pci.h | 17 +++++++++++++++++ include/linux/sched/isolation.h | 9 +++++++++ kernel/sched/isolation.c | 2 +- 4 files changed, 29 insertions(+), 2 deletions(-) --