Received: by 10.192.165.148 with SMTP id m20csp772161imm; Fri, 20 Apr 2018 15:43:28 -0700 (PDT) X-Google-Smtp-Source: AIpwx49wviU/LMB9QR5w+0RE3zrmfFsp5VyLniL5n0usJhmiDKzAGUAGdB6WbdAoqQr2/47HsaEL X-Received: by 2002:a17:902:6b87:: with SMTP id p7-v6mr11741422plk.101.1524264208200; Fri, 20 Apr 2018 15:43:28 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1524264208; cv=none; d=google.com; s=arc-20160816; b=kJ6F0Owf3laL1MhWR26FdpstkdAbiEzqxM/GtBWOuUZPkRyAcDjOWiMoDy+AWHpIlc FNDAoXtDqrnjDIimkvCDqC4GuVgzat7gxaDCfhfgpeCUD92exmdmxHA881rnI51lYlfN Z23T2uZ6JCFWyXDul90rNowgleCET2kKQlfgHpmLlnXfugLWDcaQy2izeTfRfj3xZmvV NCDFBaAmTTAxBiDHnsUDVgjIUDs8hzCi8sGzfcGOjZpb/CE87Q69wFGGShAAvZRZOj5t nCiPUWipjjVSBMI1subfDBGaXigo3T825XeznQjCe7AC402bikJ0DZG6ifj7Kf+HLhVC mBmg== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:references:in-reply-to:message-id:date :subject:cc:to:from:arc-authentication-results; bh=h7DWZdR3tOVJ5oJX31C6/rU4I+7pjYJAwHEQJTfaEFQ=; b=u5s2F9lNOeGhM8MCyhdzkYg386l7NQeIPaZraejbHY9jGvv30OeRCPXqgB3i44Smti WQCzefOrFBUrEZ/ucscySQg6CdBFke8PxaaPhJEe2hcskHbxBnBXI9CYhbB+lE/nnUvV 2UlFpUAD5b0HyJZOwZnEBWbad2MK6HHaYz1W+lXMXNbtBwwsMkXqECZEVGIu7ozDPBni /ku2rw4Z2CKHddA81koPpf/Jji1794G8l4MAwTpELI5OYejYQBNyBGlU4UmW759tmqWF qMv2i3LV0qTf1jLK24/Q918rFeUH6FlmnleX+Cmb0t7HDSDZlqN1RgRI2cU010JfqI2k t3cA== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id e4-v6si6702032pln.331.2018.04.20.15.42.50; Fri, 20 Apr 2018 15:43:28 -0700 (PDT) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1753059AbeDTWjL (ORCPT + 99 others); Fri, 20 Apr 2018 18:39:11 -0400 Received: from mga01.intel.com ([192.55.52.88]:60382 "EHLO mga01.intel.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1752712AbeDTWjJ (ORCPT ); Fri, 20 Apr 2018 18:39:09 -0400 X-Amp-Result: SKIPPED(no attachment in message) X-Amp-File-Uploaded: False Received: from orsmga007.jf.intel.com ([10.7.209.58]) by fmsmga101.fm.intel.com with ESMTP/TLS/DHE-RSA-AES256-GCM-SHA384; 20 Apr 2018 15:39:08 -0700 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="5.49,303,1520924400"; d="scan'208";a="34982630" Received: from vshiva-udesk.sc.intel.com ([10.3.52.52]) by orsmga007.jf.intel.com with ESMTP; 20 Apr 2018 15:39:08 -0700 From: Vikas Shivappa To: vikas.shivappa@intel.com, tony.luck@intel.com, ravi.v.shankar@intel.com, fenghua.yu@intel.com, x86@kernel.org, tglx@linutronix.de, hpa@zytor.com Cc: linux-kernel@vger.kernel.org, ak@linux.intel.com, vikas.shivappa@linux.intel.com Subject: [PATCH 1/6] x86/intel_rdt/mba_sc: Documentation for MBA software controller(mba_sc) Date: Fri, 20 Apr 2018 15:36:16 -0700 Message-Id: <1524263781-14267-2-git-send-email-vikas.shivappa@linux.intel.com> X-Mailer: git-send-email 1.9.1 In-Reply-To: <1524263781-14267-1-git-send-email-vikas.shivappa@linux.intel.com> References: <1524263781-14267-1-git-send-email-vikas.shivappa@linux.intel.com> Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Add documentation about the feedback loop mechanism (MBA software controller) which lets the user specify the memory bandwidth allocation in MBps. This includes some changes to "schemata" formati with examples. Signed-off-by: Vikas Shivappa --- Documentation/x86/intel_rdt_ui.txt | 75 ++++++++++++++++++++++++++++++++++---- 1 file changed, 67 insertions(+), 8 deletions(-) diff --git a/Documentation/x86/intel_rdt_ui.txt b/Documentation/x86/intel_rdt_ui.txt index 71c3098..a16aa21 100644 --- a/Documentation/x86/intel_rdt_ui.txt +++ b/Documentation/x86/intel_rdt_ui.txt @@ -17,12 +17,14 @@ MBA (Memory Bandwidth Allocation) - "mba" To use the feature mount the file system: - # mount -t resctrl resctrl [-o cdp[,cdpl2]] /sys/fs/resctrl + # mount -t resctrl resctrl [-o cdp[,cdpl2][,mba_MBps]] /sys/fs/resctrl mount options are: "cdp": Enable code/data prioritization in L3 cache allocations. "cdpl2": Enable code/data prioritization in L2 cache allocations. +"mba_MBps": Enable the MBA Software Controller(mba_sc) to specify MBA + bandwidth in MBps L2 and L3 CDP are controlled seperately. @@ -270,10 +272,11 @@ and 0xA are not. On a system with a 20-bit mask each bit represents 5% of the capacity of the cache. You could partition the cache into four equal parts with masks: 0x1f, 0x3e0, 0x7c00, 0xf8000. -Memory bandwidth(b/w) percentage --------------------------------- -For Memory b/w resource, user controls the resource by indicating the -percentage of total memory b/w. +Memory bandwidth Allocation and monitoring +------------------------------------------ + +For Memory bandwidth resource, by default the user controls the resource +by indicating the percentage of total memory bandwidth. The minimum bandwidth percentage value for each cpu model is predefined and can be looked up through "info/MB/min_bandwidth". The bandwidth @@ -285,7 +288,47 @@ to the next control step available on the hardware. The bandwidth throttling is a core specific mechanism on some of Intel SKUs. Using a high bandwidth and a low bandwidth setting on two threads sharing a core will result in both threads being throttled to use the -low bandwidth. +low bandwidth. The fact that Memory bandwidth allocation(MBA) is a core +specific mechanism where as memory bandwidth monitoring(MBM) is done at +the package level may lead to confusion when users try to apply control +via the MBA and then monitor the bandwidth to see if the controls are +effective. Below are such scenarios: + +1. User may *not* see increase in actual bandwidth when percentage + values are increased: + +This can occur when aggregate L2 external bandwidth is more than L3 +external bandwidth. Consider an SKL SKU with 24 cores on a package and +where L2 external is 10GBps (hence aggregate L2 external bandwidth is +240GBps) and L3 external bandwidth is 100GBps. Now a workload with '20 +threads, having 50% bandwidth, each consuming 5GBps' consumes the max L3 +bandwidth of 100GBps although the percentage value specified is only 50% +<< 100%. Hence increasing the bandwidth percentage will not yeild any +more bandwidth. This is because although the L2 external bandwidth still +has capacity, the L3 external bandwidth is fully used. Also note that +this would be dependent on number of cores the benchmark is run on. + +2. Same bandwidth percentage may mean different actual bandwidth + depending on # of threads: + +For the same SKU in #1, a 'single thread, with 10% bandwidth' and '4 +thread, with 10% bandwidth' can consume upto 10GBps and 40GBps although +they have same percentage bandwidth of 10%. This is simply because as +threads start using more cores in an rdtgroup, the actual bandwidth may +increase or vary although user specified bandwidth percentage is same. + +In order to mitigate this and make the interface more user friendly, +resctrl added support for specifying the bandwidth in MBps as well. The +kernel underneath would use a software feedback mechanism or a "Software +Controller(mba_sc)" which reads the actual bandwidth using MBM counters +and adjust the memowy bandwidth percentages to ensure + + "actual bandwidth < user specified bandwidth". + +By default, the schemata would take the bandwidth percentage values +where as user can switch to the "MBA software controller" mode using +a mount option 'mba_MBps'. The schemata format is specified in the below +sections. L3 schemata file details (code and data prioritization disabled) ---------------------------------------------------------------- @@ -308,13 +351,20 @@ schemata format is always: L2:=;=;... -Memory b/w Allocation details ------------------------------ +Memory bandwidth Allocation (default mode) +------------------------------------------ Memory b/w domain is L3 cache. MB:=bandwidth0;=bandwidth1;... +Memory bandwidth Allocation specified in MBps +--------------------------------------------- + +Memory bandwidth domain is L3 cache. + + MB:=bw_MBps0;=bw_MBps1;... + Reading/writing the schemata file --------------------------------- Reading the schemata file will show the state of all resources @@ -358,6 +408,15 @@ allocations can overlap or not. The allocations specifies the maximum b/w that the group may be able to use and the system admin can configure the b/w accordingly. +If the MBA is specified in MB(megabytes) then user can enter the max b/w in MB +rather than the percentage values. + +# echo "L3:0=3;1=c\nMB:0=1024;1=500" > /sys/fs/resctrl/p0/schemata +# echo "L3:0=3;1=3\nMB:0=1024;1=500" > /sys/fs/resctrl/p1/schemata + +In the above example the tasks in "p1" and "p0" on socket 0 would use a max b/w +of 1024MB where as on socket 1 they would use 500MB. + Example 2 --------- Again two sockets, but this time with a more realistic 20-bit mask. -- 1.9.1