The memory bandwidth software controller uses 2^20 units rather than
10^6. See mbm_bw_count() which computes bandwidth using the "SZ_1M"
Linux define for 0x00100000.
Update the documentation to use MiB when describing this feature.
It's too late to fix the mount option "mba_MBps" as that is now an
established user interface.
Signed-off-by: Tony Luck <[email protected]>
---
Documentation/arch/x86/resctrl.rst | 8 ++++----
1 file changed, 4 insertions(+), 4 deletions(-)
diff --git a/Documentation/arch/x86/resctrl.rst b/Documentation/arch/x86/resctrl.rst
index a6279df64a9d..3712d81cb50c 100644
--- a/Documentation/arch/x86/resctrl.rst
+++ b/Documentation/arch/x86/resctrl.rst
@@ -45,7 +45,7 @@ mount options are:
Enable code/data prioritization in L2 cache allocations.
"mba_MBps":
Enable the MBA Software Controller(mba_sc) to specify MBA
- bandwidth in MBps
+ bandwidth in MiBps
"debug":
Make debug files accessible. Available debug files are annotated with
"Available only with debug option".
@@ -526,7 +526,7 @@ threads start using more cores in an rdtgroup, the actual bandwidth may
increase or vary although user specified bandwidth percentage is same.
In order to mitigate this and make the interface more user friendly,
-resctrl added support for specifying the bandwidth in MBps as well. The
+resctrl added support for specifying the bandwidth in MiBps as well. The
kernel underneath would use a software feedback mechanism or a "Software
Controller(mba_sc)" which reads the actual bandwidth using MBM counters
and adjust the memory bandwidth percentages to ensure::
@@ -573,13 +573,13 @@ Memory b/w domain is L3 cache.
MB:<cache_id0>=bandwidth0;<cache_id1>=bandwidth1;...
-Memory bandwidth Allocation specified in MBps
+Memory bandwidth Allocation specified in MiBps
---------------------------------------------
Memory bandwidth domain is L3 cache.
::
- MB:<cache_id0>=bw_MBps0;<cache_id1>=bw_MBps1;...
+ MB:<cache_id0>=bw_MiBps0;<cache_id1>=bw_MiBps1;...
Slow Memory Bandwidth Allocation (SMBA)
---------------------------------------
--
2.44.0
The following commit has been merged into the x86/urgent branch of tip:
Commit-ID: a8ed59a3a8de2648e69dd5936f5771ac4c92d085
Gitweb: https://git.kernel.org/tip/a8ed59a3a8de2648e69dd5936f5771ac4c92d085
Author: Tony Luck <[email protected]>
AuthorDate: Fri, 22 Mar 2024 11:20:15 -07:00
Committer: Ingo Molnar <[email protected]>
CommitterDate: Sun, 24 Mar 2024 03:58:43 +01:00
Documentation/x86: Document that resctrl bandwidth control units are MiB
The memory bandwidth software controller uses 2^20 units rather than
10^6. See mbm_bw_count() which computes bandwidth using the "SZ_1M"
Linux define for 0x00100000.
Update the documentation to use MiB when describing this feature.
It's too late to fix the mount option "mba_MBps" as that is now an
established user interface.
Signed-off-by: Tony Luck <[email protected]>
Signed-off-by: Ingo Molnar <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
---
Documentation/arch/x86/resctrl.rst | 8 ++++----
1 file changed, 4 insertions(+), 4 deletions(-)
diff --git a/Documentation/arch/x86/resctrl.rst b/Documentation/arch/x86/resctrl.rst
index a6279df..3712d81 100644
--- a/Documentation/arch/x86/resctrl.rst
+++ b/Documentation/arch/x86/resctrl.rst
@@ -45,7 +45,7 @@ mount options are:
Enable code/data prioritization in L2 cache allocations.
"mba_MBps":
Enable the MBA Software Controller(mba_sc) to specify MBA
- bandwidth in MBps
+ bandwidth in MiBps
"debug":
Make debug files accessible. Available debug files are annotated with
"Available only with debug option".
@@ -526,7 +526,7 @@ threads start using more cores in an rdtgroup, the actual bandwidth may
increase or vary although user specified bandwidth percentage is same.
In order to mitigate this and make the interface more user friendly,
-resctrl added support for specifying the bandwidth in MBps as well. The
+resctrl added support for specifying the bandwidth in MiBps as well. The
kernel underneath would use a software feedback mechanism or a "Software
Controller(mba_sc)" which reads the actual bandwidth using MBM counters
and adjust the memory bandwidth percentages to ensure::
@@ -573,13 +573,13 @@ Memory b/w domain is L3 cache.
MB:<cache_id0>=bandwidth0;<cache_id1>=bandwidth1;...
-Memory bandwidth Allocation specified in MBps
+Memory bandwidth Allocation specified in MiBps
---------------------------------------------
Memory bandwidth domain is L3 cache.
::
- MB:<cache_id0>=bw_MBps0;<cache_id1>=bw_MBps1;...
+ MB:<cache_id0>=bw_MiBps0;<cache_id1>=bw_MiBps1;...
Slow Memory Bandwidth Allocation (SMBA)
---------------------------------------
Hi Tony,
On 3/22/2024 11:20 AM, Tony Luck wrote:
> The memory bandwidth software controller uses 2^20 units rather than
> 10^6. See mbm_bw_count() which computes bandwidth using the "SZ_1M"
> Linux define for 0x00100000.
>
> Update the documentation to use MiB when describing this feature.
> It's too late to fix the mount option "mba_MBps" as that is now an
> established user interface.
I see that this is merged already but I do not think this is correct.
Shouldn't the implementation be fixed instead? Looking at the implementation
the intent appears to be clear that the goal is to have bandwidth be
MBps .... that is when looking from documentation to the define
(MBA_MAX_MBPS) to the comments of the function you reference above
mbm_bw_count(). For example, "...and delta bandwidth in MBps ..."
and "...maintain values in MBps..."
To me this change creates significant confusion since it now contradicts
with the source code and comments I reference above. Not to mention the
discrepancy with user documentation.
If you believe that this should be MiB then should the
source and comments not also be changed to reflect that? Or alternatively,
why not just fix mbm_bw_count() to support the documentation and what
it appears to be intended to do. If users have been using the interface
expecting MBps then this seems more like a needed bugfix than
a needed documentation change.
Finally, if you make documentation changes, please do build the
documentation afterwards. This change introduces a warning:
Memory bandwidth Allocation specified in MiBps
---------------------------------------------
../linux/Documentation/arch/x86/resctrl.rst:583: WARNING: Title underline too short.
Reinette
On Thu, Mar 28, 2024 at 06:01:33PM -0700, Reinette Chatre wrote:
> Hi Tony,
>
> On 3/22/2024 11:20 AM, Tony Luck wrote:
> > The memory bandwidth software controller uses 2^20 units rather than
> > 10^6. See mbm_bw_count() which computes bandwidth using the "SZ_1M"
> > Linux define for 0x00100000.
> >
> > Update the documentation to use MiB when describing this feature.
> > It's too late to fix the mount option "mba_MBps" as that is now an
> > established user interface.
>
> I see that this is merged already but I do not think this is correct.
I was surprised that Ingo merged it without giving folks a chance to
comment.
> Shouldn't the implementation be fixed instead? Looking at the implementation
> the intent appears to be clear that the goal is to have bandwidth be
> MBps .... that is when looking from documentation to the define
> (MBA_MAX_MBPS) to the comments of the function you reference above
> mbm_bw_count(). For example, "...and delta bandwidth in MBps ..."
> and "...maintain values in MBps..."
Difficult to be sure of intent. But in general when people talk about
"megabytes" in the context of memory they mean 2^20. Storage capacity
on computers was originally in 2^20 units until the marketing teams
at disk drive manufacturers realized they could print numbers 4.8% bigger
on their products by using SI unit 10^6 Mega prefix (rising to 7.3% with
Giga and 10% with Tera).
It is clear that the code uses 2^20 as it converts from bytes using
a right shift by 20.
Fixing the code would change the legacy API. Folks with a schemata
file that sets a limit of 5000 MB/s would find their applications
throttled by an addtional 4.8% on upgrading to a kernel with this
"fix".
> To me this change creates significant confusion since it now contradicts
> with the source code and comments I reference above. Not to mention the
> discrepancy with user documentation.
>
> If you believe that this should be MiB then should the
> source and comments not also be changed to reflect that? Or alternatively,
> why not just fix mbm_bw_count() to support the documentation and what
> it appears to be intended to do. If users have been using the interface
> expecting MBps then this seems more like a needed bugfix than
> a needed documentation change.
I agree that the comments need to be fixed. I will spin up a patch.
> Finally, if you make documentation changes, please do build the
> documentation afterwards. This change introduces a warning:
>
> Memory bandwidth Allocation specified in MiBps
> ---------------------------------------------
> .../linux/Documentation/arch/x86/resctrl.rst:583: WARNING: Title underline too short.
My bad. Ingo has already applied a fix to TIP x86/urgent. I assume that
will be merged to Linus soon.
https://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git/commit/?h=x86/urgent&id=91491e5fb09624116950f9f2e1767a42e1da786
-Tony
Hi Tony,
On 3/29/2024 8:31 AM, Tony Luck wrote:
> On Thu, Mar 28, 2024 at 06:01:33PM -0700, Reinette Chatre wrote:
>> Hi Tony,
>>
>> On 3/22/2024 11:20 AM, Tony Luck wrote:
>>> The memory bandwidth software controller uses 2^20 units rather than
>>> 10^6. See mbm_bw_count() which computes bandwidth using the "SZ_1M"
>>> Linux define for 0x00100000.
>>>
>>> Update the documentation to use MiB when describing this feature.
>>> It's too late to fix the mount option "mba_MBps" as that is now an
>>> established user interface.
>>
>> I see that this is merged already but I do not think this is correct.
>
> I was surprised that Ingo merged it without giving folks a chance to
> comment.
>
>> Shouldn't the implementation be fixed instead? Looking at the implementation
>> the intent appears to be clear that the goal is to have bandwidth be
>> MBps .... that is when looking from documentation to the define
>> (MBA_MAX_MBPS) to the comments of the function you reference above
>> mbm_bw_count(). For example, "...and delta bandwidth in MBps ..."
>> and "...maintain values in MBps..."
>
> Difficult to be sure of intent. But in general when people talk about
> "megabytes" in the context of memory they mean 2^20. Storage capacity
> on computers was originally in 2^20 units until the marketing teams
> at disk drive manufacturers realized they could print numbers 4.8% bigger
> on their products by using SI unit 10^6 Mega prefix (rising to 7.3% with
> Giga and 10% with Tera).
This is not so obvious to me. I hear what you are saying about storage
capacity but the topic here is memory bandwidth and here I find the custom
to be that MB/s means 10^6 bytes per second. That is looking from how DDR
bandwidth is documented to how benchmarks like
https://github.com/intel/memory-bandwidth-benchmarks report the data, to
what wikipedia says in https://en.wikipedia.org/wiki/Memory_bandwidth.
I also took a sample of what the perf side of things may look like
and, for example, when looking at;
tools/perf/pmu-events/arch/x86/sapphirerapids/spr-metrics.json
I understand that the custom for bandwidth is MB/s. For example:
{
"BriefDescription": "DDR memory read bandwidth (MB/sec)",
"MetricExpr": "UNC_M_CAS_COUNT.RD * 64 / 1e6 / duration_time",
"MetricName": "memory_bandwidth_read",
"ScaleUnit": "1MB/s"
},
>
> It is clear that the code uses 2^20 as it converts from bytes using
> a right shift by 20.
Right. This appears to be a bug.
>
> Fixing the code would change the legacy API. Folks with a schemata
> file that sets a limit of 5000 MB/s would find their applications
> throttled by an addtional 4.8% on upgrading to a kernel with this
> "fix".
Wouldn't this be the right thing to do though? If the user sets a limit
of 5000MB/s then I believe the expectation is that it should not exceed
that bandwidth.
>> To me this change creates significant confusion since it now contradicts
>> with the source code and comments I reference above. Not to mention the
>> discrepancy with user documentation.
>>
>> If you believe that this should be MiB then should the
>> source and comments not also be changed to reflect that? Or alternatively,
>> why not just fix mbm_bw_count() to support the documentation and what
>> it appears to be intended to do. If users have been using the interface
>> expecting MBps then this seems more like a needed bugfix than
>> a needed documentation change.
>
> I agree that the comments need to be fixed. I will spin up a patch.
It is not just the comments but the constants and variables also. All point
to MBps.
Also note that there remain several examples in the resctrl doc (see
section "Memory bandwidth Allocation and monitoring") that continue to
use GB/s. Are you really proposing that all should be changed to GiB/s?
In addition to all above this change would fragment resctrl even more
with AMD's memory bandwidth control clearly being in GB/s.
I continue to find this change to be the source of confusion and diverges
resctrl from the custom.
Reinette
Hi Tony,
On 3/29/2024 9:37 AM, Reinette Chatre wrote:
> On 3/29/2024 8:31 AM, Tony Luck wrote:
>> On Thu, Mar 28, 2024 at 06:01:33PM -0700, Reinette Chatre wrote:
>>> On 3/22/2024 11:20 AM, Tony Luck wrote:
>>>> The memory bandwidth software controller uses 2^20 units rather than
>>>> 10^6. See mbm_bw_count() which computes bandwidth using the "SZ_1M"
>>>> Linux define for 0x00100000.
>>>>
>>>> Update the documentation to use MiB when describing this feature.
>>>> It's too late to fix the mount option "mba_MBps" as that is now an
>>>> established user interface.
>>>
>>> I see that this is merged already but I do not think this is correct.
>>
>> I was surprised that Ingo merged it without giving folks a chance to
>> comment.
>>
>>> Shouldn't the implementation be fixed instead? Looking at the implementation
>>> the intent appears to be clear that the goal is to have bandwidth be
>>> MBps .... that is when looking from documentation to the define
>>> (MBA_MAX_MBPS) to the comments of the function you reference above
>>> mbm_bw_count(). For example, "...and delta bandwidth in MBps ..."
>>> and "...maintain values in MBps..."
>>
>> Difficult to be sure of intent. But in general when people talk about
>> "megabytes" in the context of memory they mean 2^20. Storage capacity
>> on computers was originally in 2^20 units until the marketing teams
>> at disk drive manufacturers realized they could print numbers 4.8% bigger
>> on their products by using SI unit 10^6 Mega prefix (rising to 7.3% with
>> Giga and 10% with Tera).
>
> This is not so obvious to me. I hear what you are saying about storage
> capacity but the topic here is memory bandwidth and here I find the custom
> to be that MB/s means 10^6 bytes per second. That is looking from how DDR
> bandwidth is documented to how benchmarks like
> https://github.com/intel/memory-bandwidth-benchmarks report the data, to
> what wikipedia says in https://en.wikipedia.org/wiki/Memory_bandwidth.
>
> I also took a sample of what the perf side of things may look like
> and, for example, when looking at;
> tools/perf/pmu-events/arch/x86/sapphirerapids/spr-metrics.json
> I understand that the custom for bandwidth is MB/s. For example:
>
> {
> "BriefDescription": "DDR memory read bandwidth (MB/sec)",
> "MetricExpr": "UNC_M_CAS_COUNT.RD * 64 / 1e6 / duration_time",
> "MetricName": "memory_bandwidth_read",
> "ScaleUnit": "1MB/s"
> },
>
(Thanks to Kan Liang for explaining this to me.)
As an update on this, the perf side does not seem to be as consistent as
I first interpreted it to be. There appears to be a "kernel side" and
"user side" related to memory bandwidth data.
On the kernel side, users can refer directly to:
/sys/bus/event_source/devices/uncore_imc_*/events to read the
UNC_M_CAS_COUNT.RD and UNC_M_CAS_COUNT.WR data and this appears to
be intended to be consumed as MiB/s as per:
$ /sys/bus/event_source/devices/uncore_imc_0/events/cas_count_read.unit
MiB
On the user side, using perf from userspace the metrics are obtained
from the relevant json file as quoted above, and thus when using perf
from the command line the data is in MB/sec, for example:
$ perf list
[SNIP]
llc_miss_local_memory_bandwidth_read
[Bandwidth (MB/sec) of read requests that miss the last level cache (LLC) and go to local memory]
llc_miss_local_memory_bandwidth_write
[Bandwidth (MB/sec) of write requests that miss the last level cache (LLC) and go to local memory]
llc_miss_remote_memory_bandwidth_read
[Bandwidth (MB/sec) of read requests that miss the last level cache (LLC) and go to remote memory]
llc_miss_remote_memory_bandwidth_write
[Bandwidth (MB/sec) of write requests that miss the last level cache (LLC) and go to remote memory]
loads_per_instr
[The ratio of number of completed memory load instructions to the total number completed instructions]
memory_bandwidth_read
[DDR memory read bandwidth (MB/sec)]
memory_bandwidth_total
[DDR memory bandwidth (MB/sec)]
memory_bandwidth_write
[DDR memory write bandwidth (MB/sec)]
[SNIP]
It appears that there is no custom here and it may just be somebody's preference?
Reinette
> It appears that there is no custom here and it may just be somebody's preference?
Reinette,
Thanks for digging around. I had thought there was general consensus that
memory was measured in 2^20, storage in 10^6 and networking in either
10^6 or 10^9 (but bits rather than bytes.
But, as you've found, there doesn't seem to be to be even that much of
a custom.
Maybe a case for https://xkcd.com/927/ (since it is April 1st, I propose
everyone standardize on Teranibbles per fortnight[1] :-) )
But back to the patch. As there is no standard, changing the documentation
to accurately represent the code looks like a good option.
-Tony
[1] 1 MiB/s == 2.5367 Tnibbles/fortnight
Hi Tony,
On 4/1/2024 4:03 PM, Luck, Tony wrote:
>> It appears that there is no custom here and it may just be somebody's preference?
>
> Reinette,
>
> Thanks for digging around. I had thought there was general consensus that
> memory was measured in 2^20, storage in 10^6 and networking in either
> 10^6 or 10^9 (but bits rather than bytes.
>
> But, as you've found, there doesn't seem to be to be even that much of
> a custom.
>
> Maybe a case for https://xkcd.com/927/ (since it is April 1st, I propose
> everyone standardize on Teranibbles per fortnight[1] :-) )
>
> But back to the patch. As there is no standard, changing the documentation
> to accurately represent the code looks like a good option.
It is not obvious to me what the right thing is to do. The documentation has
stated since the inception of the software controller that that it accepts
bandwidth in MBps. You demonstrated this is the case but it is not obvious to
me that this is a documentation problem or an issue that needs to
be fixed in the code. I am not familiar with a precedent in this regard.
Changing the documentation does seem like the least controversial approach.
The consequence is that resctrl documentation itself now switches back and
forth between the units ... it uses MiBps for the software controller and
GBps when referring to AMD and talking about memory bandwidth in general
(see section Memory bandwidth Allocation and monitoring). I hope that it
is clear enough that MiBps is just related to the software
controller.
Reinette
> Changing the documentation does seem like the least controversial approach.
> The consequence is that resctrl documentation itself now switches back and
> forth between the units ... it uses MiBps for the software controller and
> GBps when referring to AMD and talking about memory bandwidth in general
> (see section Memory bandwidth Allocation and monitoring). I hope that it
> is clear enough that MiBps is just related to the software
> controller.
Babu,
In this section that Reinette refers to above:
Reading/writing the schemata file (on AMD systems)
--------------------------------------------------
Reading the schemata file will show the current bandwidth limit on all
domains. The allocated resources are in multiples of one eighth GB/s.
When writing to the file, you need to specify what cache id you wish to
configure the bandwidth limit. Reading the schemata file will show the current bandwidth limit on all
domains. The allocated resources are in multiples of one eighth GB/s.
When writing to the file, you need to specify what cache id you wish to
configure the bandwidth limit.
Does "one eighth GB/s" mean 134217728 bytes/s or 125000000 bytes/s?
I.e. is the documentation accurate, or sloppily using GB/s when it should
have said GiB/s?
Thanks
-Tony