2018-03-29 22:31:21

by Shivappa Vikas

[permalink] [raw]
Subject: [PATCH RFC 0/6] Memory b/w allocation software controller

Intel RDT memory bandwidth allocation (MBA) currently uses the resctrl
interface and uses the schemata file in each rdtgroup to specify the max
b/w percentage that is allowed to be used by the "threads" and "cpus" in
the rdtgroup. These values are specified "per package" in each rdtgroup
in the schemata file as below:

$ cat /sys/fs/resctrl/p1/schemata
L3:0=7ff;1=7ff
MB:0=100;1=50

In the above example the MB is the memory bandwidth percentage and "0"
and "1" specify the package/socket ids. The threads in rdtgroup "p1"
would get 100% memory b/w on socket0 and 50% b/w on socket1.

However, Memory bandwidth allocation (MBA) is a core specific mechanism
which means that when the Memory b/w percentage is specified in the
schemata per package it actually is applied on a per core basis via
IA32_MBA_THRTL_MSR interface. This may lead to confusion in scenarios
below:

1. User may not see increase in actual b/w when percentage values are
increased:

This can occur when aggregate L2 external b/w is more than L3 external
b/w. Consider an SKL SKU with 24 cores on a package and where L2
external b/w is 10GBps (hence aggregate L2 external b/w is 240GBps) and
L3 external b/w is 100GBps. Now a workload with '20 threads, having 50%
b/w, each consuming 5GBps' consumes the max L3 b/w of 100GBps although
the percentage value specified is only 50% << 100%. Hence increasing
the b/w percentage will not yeild any more b/w. This is because
although the L2 external b/w still has capacity, the L3 external b/w
is fully used. Also note that this would be dependent on number of
cores the benchmark is run on.

2. Same b/w percentage may mean different actual b/w depending on # of
threads:

For the same SKU in #1, a 'single thread, with 10% b/w' and '4 thread,
with 10% b/w' can consume upto 10GBps and 40GBps although they have same
percentage b/w of 10%. This is simply because as threads start using
more cores in an rdtgroup, the actual b/w may increase or vary although
user specified b/w percentage is same.

In order to mitigate this and make the interface more user friendly, we
can let the user specify the max bandwidth per rdtgroup in bytes(or mega
bytes). The kernel underneath would use a software feedback mechanism or
a "Software Controller" which reads the actual b/w using MBM counters
and adjust the memowy bandwidth percentages to ensure the "actual b/w
< user b/w".

The legacy behaviour is default and user can switch to the "MBA software
controller" mode using a mount option 'mba_MB'.

To use the feature mount the file system using mba_MB option:

$ mount -t resctrl resctrl [-o cdp[,cdpl2][mba_MB]] /sys/fs/resctrl

We could also use a config option as suggested by Fenghua. This may be
useful in situations where other resources need such options and we dont
have to keep growing the if else in the mount. However it needs enough
isolation when implemented with respect to resetting the values.

If the MBA is specified in MB(megabytes) then user can enter the max b/w
in MB rather than the percentage values. The default when mounted is
max_u32.

$ echo "L3:0=3;1=c\nMB:0=1024;1=500" > /sys/fs/resctrl/p0/schemata
$ echo "L3:0=3;1=3\nMB:0=1024;1=500" > /sys/fs/resctrl/p1/schemata

In the above example the tasks in "p1" and "p0" rdtgroup
would use a max b/w of 1024MBps on socket0 and 500MBps on socket1.

Vikas Shivappa (6):
x86/intel_rdt/mba_sc: Add documentation for MBA software controller
x86/intel_rdt/mba_sc: Add support to enable/disable via mount option
x86/intel_rdt/mba_sc: Add initialization support
x86/intel_rdt/mba_sc: Add schemata support
x86/intel_rdt/mba_sc: Add counting for MBA software controller
x86/intel_rdt/mba_sc: Add support to dynamically update the memory b/w

Documentation/x86/intel_rdt_ui.txt | 63 +++++++++++++++++
arch/x86/kernel/cpu/intel_rdt.c | 50 +++++++++----
arch/x86/kernel/cpu/intel_rdt.h | 34 ++++++++-
arch/x86/kernel/cpu/intel_rdt_ctrlmondata.c | 10 ++-
arch/x86/kernel/cpu/intel_rdt_monitor.c | 105 +++++++++++++++++++++++++---
arch/x86/kernel/cpu/intel_rdt_rdtgroup.c | 34 ++++++++-
6 files changed, 268 insertions(+), 28 deletions(-)

--
1.9.1



2018-03-29 22:31:01

by Shivappa Vikas

[permalink] [raw]
Subject: [PATCH 1/6] x86/intel_rdt/mba_sc: Add documentation for MBA software controller

Add documentation about usage which includes the "schemata" format and
use case for MBA software controller.

Signed-off-by: Vikas Shivappa <[email protected]>
---
Documentation/x86/intel_rdt_ui.txt | 63 ++++++++++++++++++++++++++++++++++++++
1 file changed, 63 insertions(+)

diff --git a/Documentation/x86/intel_rdt_ui.txt b/Documentation/x86/intel_rdt_ui.txt
index 71c3098..3b9634e 100644
--- a/Documentation/x86/intel_rdt_ui.txt
+++ b/Documentation/x86/intel_rdt_ui.txt
@@ -315,6 +315,60 @@ Memory b/w domain is L3 cache.

MB:<cache_id0>=bandwidth0;<cache_id1>=bandwidth1;...

+Memory bandwidth(b/w) in MegaBytes
+----------------------------------
+
+Memory bandwidth is a core specific mechanism which means that when the
+Memory b/w percentage is specified in the schemata per package it
+actually is applied on a per core basis via IA32_MBA_THRTL_MSR
+interface. This may lead to confusion in scenarios below:
+
+1. User may not see increase in actual b/w when percentage values are
+ increased:
+
+This can occur when aggregate L2 external b/w is more than L3 external
+b/w. Consider an SKL SKU with 24 cores on a package and where L2
+external b/w is 10GBps (hence aggregate L2 external b/w is 240GBps) and
+L3 external b/w is 100GBps. Now a workload with '20 threads, having 50%
+b/w, each consuming 5GBps' consumes the max L3 b/w of 100GBps although
+the percentage value specified is only 50% << 100%. Hence increasing
+the b/w percentage will not yeild any more b/w. This is because
+although the L2 external b/w still has capacity, the L3 external b/w
+is fully used. Also note that this would be dependent on number of
+cores the benchmark is run on.
+
+2. Same b/w percentage may mean different actual b/w depending on # of
+ threads:
+
+For the same SKU in #1, a 'single thread, with 10% b/w' and '4 thread,
+with 10% b/w' can consume upto 10GBps and 40GBps although they have same
+percentage b/w of 10%. This is simply because as threads start using
+more cores in an rdtgroup, the actual b/w may increase or vary although
+user specified b/w percentage is same.
+
+In order to mitigate this and make the interface more user friendly, we
+can let the user specify the max bandwidth per rdtgroup in bytes(or mega
+bytes). The kernel underneath would use a software feedback mechanism or
+a "Software Controller" which reads the actual b/w using MBM counters
+and adjust the memowy bandwidth percentages to ensure the "actual b/w
+< user b/w".
+
+The legacy behaviour is default and user can switch to the "MBA software
+controller" mode using a mount option 'mba_MB'.
+
+To use the feature mount the file system using mba_MB option:
+
+# mount -t resctrl resctrl [-o cdp[,cdpl2][mba_MB]] /sys/fs/resctrl
+
+The schemata format is below:
+
+Memory b/w Allocation in Megabytes
+----------------------------------
+
+Memory b/w domain is L3 cache.
+
+ MB:<cache_id0>=bw_MB0;<cache_id1>=bw_MB1;...
+
Reading/writing the schemata file
---------------------------------
Reading the schemata file will show the state of all resources
@@ -358,6 +412,15 @@ allocations can overlap or not. The allocations specifies the maximum
b/w that the group may be able to use and the system admin can configure
the b/w accordingly.

+If the MBA is specified in MB(megabytes) then user can enter the max b/w in MB
+rather than the percentage values.
+
+# echo "L3:0=3;1=c\nMB:0=1024;1=500" > /sys/fs/resctrl/p0/schemata
+# echo "L3:0=3;1=3\nMB:0=1024;1=500" > /sys/fs/resctrl/p1/schemata
+
+In the above example the tasks in "p1" and "p0" on socket 0 would use a max b/w
+of 1024MB where as on socket 1 they would use 500MB.
+
Example 2
---------
Again two sockets, but this time with a more realistic 20-bit mask.
--
1.9.1


2018-03-29 22:31:27

by Shivappa Vikas

[permalink] [raw]
Subject: [PATCH 6/6] x86/intel_rdt/mba_sc: Add support to dynamically update the memory b/w

The software controller uses the existing MBM overflow timer calls (once
per second currently) to read the bandwidth to ensure that always

"current b/w < user specified max b/w"

Similarly when we see that the current b/w is too low, we also try to
increase the b/w. We use a threshold b/w or a delta b/w which is
calculated dynamically to determine what is too low. OS then uses the
"IA32_MBA_THRTL_MSR" to change the b/w. The change itself is currently
linear and is in the increment of decrement of the "b/w granularity"
specified by the SKU. The values written to the MSR are also cached so
that we donot do a rdmsr for every 1s.

Signed-off-by: Vikas Shivappa <[email protected]>
---
arch/x86/kernel/cpu/intel_rdt.c | 2 +-
arch/x86/kernel/cpu/intel_rdt.h | 1 +
arch/x86/kernel/cpu/intel_rdt_monitor.c | 71 +++++++++++++++++++++++++++++++--
3 files changed, 69 insertions(+), 5 deletions(-)

diff --git a/arch/x86/kernel/cpu/intel_rdt.c b/arch/x86/kernel/cpu/intel_rdt.c
index 78beb64..700e957 100644
--- a/arch/x86/kernel/cpu/intel_rdt.c
+++ b/arch/x86/kernel/cpu/intel_rdt.c
@@ -341,7 +341,7 @@ static int get_cache_id(int cpu, int level)
* that can be written to QOS_MSRs.
* There are currently no SKUs which support non linear delay values.
*/
-static u32 delay_bw_map(unsigned long bw, struct rdt_resource *r)
+u32 delay_bw_map(unsigned long bw, struct rdt_resource *r)
{
if (r->membw.delay_linear)
return MAX_MBA_BW - bw;
diff --git a/arch/x86/kernel/cpu/intel_rdt.h b/arch/x86/kernel/cpu/intel_rdt.h
index b74619d..aafbc4b 100644
--- a/arch/x86/kernel/cpu/intel_rdt.h
+++ b/arch/x86/kernel/cpu/intel_rdt.h
@@ -474,6 +474,7 @@ void mbm_setup_overflow_handler(struct rdt_domain *dom,
unsigned long delay_ms);
void mbm_handle_overflow(struct work_struct *work);
void setup_ctrlval(struct rdt_resource *r, u32 *dc, u32 *dm);
+u32 delay_bw_map(unsigned long bw, struct rdt_resource *r);
void cqm_setup_limbo_handler(struct rdt_domain *dom, unsigned long delay_ms);
void cqm_handle_limbo(struct work_struct *work);
bool has_busy_rmid(struct rdt_resource *r, struct rdt_domain *d);
diff --git a/arch/x86/kernel/cpu/intel_rdt_monitor.c b/arch/x86/kernel/cpu/intel_rdt_monitor.c
index 509f338..13b6eff 100644
--- a/arch/x86/kernel/cpu/intel_rdt_monitor.c
+++ b/arch/x86/kernel/cpu/intel_rdt_monitor.c
@@ -272,6 +272,10 @@ static int __mon_event_count(u32 rmid, struct rmid_read *rr, struct mbm_state **

rr->val += m->chunks;

+ /*
+ * We only do the bw calculations for the mbm overflow
+ * periodic timer calls and for local events only.
+ */
if(!md)
goto out;

@@ -320,10 +324,61 @@ void mon_event_count(void *info)
}
}

-static void mbm_update(struct rdt_domain *d, int rmid)
+/*
+ * Check the current b/w using the MBM counters to always ensure that
+ * current b/w < user specified b/w. If the current b/w is way less than
+ * the user defined b/w (determined by the delta b/w)
+ * then try to increase the b/w
+ */
+static void update_mba_bw(struct rdtgroup *rgrp, struct mbm_state *m)
{
+ u32 closid, rmid, cur_msr, cur_msr_val, new_msr_val;
+ struct rdt_resource *r_mba;
+ u64 cur_bw, user_bw, thrshl_bw;
+ struct rdt_domain *dom_mba;
+
+ r_mba = &rdt_resources_all[RDT_RESOURCE_MBA];
+ closid = rgrp->closid;
+ rmid = rgrp->mon.rmid;
+
+ dom_mba = get_domain_from_cpu(smp_processor_id(), r_mba);
+ if (!dom_mba) {
+ pr_warn_once("Failure to get domain for MBA update\n");
+ return;
+ }
+
+ cur_bw = m->prev_bw;
+ user_bw = dom_mba->ctrl_val[closid];
+ thrshl_bw = m->delta_bw;
+ cur_msr_val = dom_mba->msr_val[closid];
+ /*
+ * Scale up/down the b/w linearly.
+ */
+ if (user_bw < cur_bw && cur_msr_val > r_mba->membw.min_bw) {
+ new_msr_val = cur_msr_val - r_mba->membw.bw_gran;
+ } else if ((cur_bw && user_bw > (cur_bw + thrshl_bw)) &&
+ cur_msr_val < MAX_MBA_BW) {
+ new_msr_val = cur_msr_val + r_mba->membw.bw_gran;
+ } else {
+ return;
+ }

+ cur_msr = r_mba->msr_base + closid;
+ wrmsrl(cur_msr, delay_bw_map(new_msr_val, r_mba));
+ dom_mba->msr_val[closid] = new_msr_val;
+
+ /*
+ * When the counter is read next time, recaliberate the
+ * threshold since we changed the MSR value.
+ */
+ m->thrshl_calib = true;
+}
+
+static void mbm_update(struct rdt_domain *d, struct rdtgroup *rgrp, bool prgrp)
+{
+ int rmid = rgrp->mon.rmid;
struct rmid_read rr;
+ struct mbm_state *m;

rr.first = false;
rr.d = d;
@@ -338,7 +393,15 @@ static void mbm_update(struct rdt_domain *d, int rmid)
}
if (is_mbm_local_enabled()) {
rr.evtid = QOS_L3_MBM_LOCAL_EVENT_ID;
- __mon_event_count(rmid, &rr, NULL);
+ __mon_event_count(rmid, &rr, &m);
+
+ /*
+ * Call the MBA software controller core function
+ * only for the control groups and when user has enabled
+ * the software controller explicitly.
+ */
+ if (prgrp && is_mba_MBctrl())
+ update_mba_bw(rgrp, m);
}
}

@@ -404,11 +467,11 @@ void mbm_handle_overflow(struct work_struct *work)
goto out_unlock;

list_for_each_entry(prgrp, &rdt_all_groups, rdtgroup_list) {
- mbm_update(d, prgrp->mon.rmid);
+ mbm_update(d, prgrp, true);

head = &prgrp->mon.crdtgrp_list;
list_for_each_entry(crgrp, head, mon.crdtgrp_list)
- mbm_update(d, crgrp->mon.rmid);
+ mbm_update(d, crgrp, false);
}

schedule_delayed_work_on(cpu, &d->mbm_over, delay);
--
1.9.1


2018-03-29 22:31:49

by Shivappa Vikas

[permalink] [raw]
Subject: [PATCH 2/6] x86/intel_rdt/mba_sc: Add support to enable/disable via mount option

Specify a new mount option "mba_MB" to enable the user to specify MBA
bandwidth in Megabytes(Software Controller/SC) instead of b/w
percentage:

$mount -t resctrl resctrl [-o cdp[,cdpl2][mba_MB]] /sys/fs/resctrl

Signed-off-by: Vikas Shivappa <[email protected]>
---
arch/x86/kernel/cpu/intel_rdt.h | 5 +++++
arch/x86/kernel/cpu/intel_rdt_rdtgroup.c | 31 +++++++++++++++++++++++++++++--
2 files changed, 34 insertions(+), 2 deletions(-)

diff --git a/arch/x86/kernel/cpu/intel_rdt.h b/arch/x86/kernel/cpu/intel_rdt.h
index 3fd7a70..3e9bc3f 100644
--- a/arch/x86/kernel/cpu/intel_rdt.h
+++ b/arch/x86/kernel/cpu/intel_rdt.h
@@ -259,6 +259,7 @@ struct rdt_cache {
* @min_bw: Minimum memory bandwidth percentage user can request
* @bw_gran: Granularity at which the memory bandwidth is allocated
* @delay_linear: True if memory B/W delay is in linear scale
+ * @bw_byte: True if memory B/W is specified in bytes
* @mb_map: Mapping of memory B/W percentage to memory B/W delay
*/
struct rdt_membw {
@@ -266,6 +267,7 @@ struct rdt_membw {
u32 min_bw;
u32 bw_gran;
u32 delay_linear;
+ bool bw_byte;
u32 *mb_map;
};

@@ -295,6 +297,9 @@ static inline bool is_mbm_event(int e)
e <= QOS_L3_MBM_LOCAL_EVENT_ID);
}

+#define is_mba_linear() rdt_resources_all[RDT_RESOURCE_MBA].membw.delay_linear
+#define is_mba_MBctrl() rdt_resources_all[RDT_RESOURCE_MBA].membw.bw_byte
+
/**
* struct rdt_resource - attributes of an RDT resource
* @rid: The index of the resource
diff --git a/arch/x86/kernel/cpu/intel_rdt_rdtgroup.c b/arch/x86/kernel/cpu/intel_rdt_rdtgroup.c
index fca759d..0707191 100644
--- a/arch/x86/kernel/cpu/intel_rdt_rdtgroup.c
+++ b/arch/x86/kernel/cpu/intel_rdt_rdtgroup.c
@@ -1041,6 +1041,24 @@ static int set_cache_qos_cfg(int level, bool enable)
return 0;
}

+static void __set_mba_byte_ctrl(bool byte_ctrl)
+{
+ struct rdt_resource *r = &rdt_resources_all[RDT_RESOURCE_MBA];
+
+ r->membw.bw_byte = byte_ctrl;
+}
+
+/*
+ * MBA allocation in bytes is only supported if
+ * MBM is supported and MBA is in linear scale.
+*/
+static void set_mba_byte_ctrl(bool byte_ctrl)
+{
+ if ((is_mbm_enabled() && is_mba_linear()) &&
+ byte_ctrl != is_mba_MBctrl())
+ __set_mba_byte_ctrl(byte_ctrl);
+}
+
static int cdp_enable(int level, int data_type, int code_type)
{
struct rdt_resource *r_ldata = &rdt_resources_all[data_type];
@@ -1104,7 +1122,7 @@ static void cdp_disable_all(void)
cdpl2_disable();
}

-static int parse_rdtgroupfs_options(char *data)
+static int parse_rdtgroupfs_options(char *data, bool *mba_MBctrl)
{
char *token, *o = data;
int ret = 0;
@@ -1123,6 +1141,8 @@ static int parse_rdtgroupfs_options(char *data)
ret = cdpl2_enable();
if (ret)
goto out;
+ } else if (!strcmp(token, "mba_MB")) {
+ *mba_MBctrl = true;
} else {
ret = -EINVAL;
goto out;
@@ -1209,6 +1229,7 @@ static struct dentry *rdt_mount(struct file_system_type *fs_type,
int flags, const char *unused_dev_name,
void *data)
{
+ bool mba_MBctrl = false;
struct rdt_domain *dom;
struct rdt_resource *r;
struct dentry *dentry;
@@ -1224,7 +1245,7 @@ static struct dentry *rdt_mount(struct file_system_type *fs_type,
goto out;
}

- ret = parse_rdtgroupfs_options(data);
+ ret = parse_rdtgroupfs_options(data, &mba_MBctrl);
if (ret) {
dentry = ERR_PTR(ret);
goto out_cdp;
@@ -1277,6 +1298,9 @@ static struct dentry *rdt_mount(struct file_system_type *fs_type,
mbm_setup_overflow_handler(dom, MBM_OVERFLOW_INTERVAL);
}

+ if (mba_MBctrl)
+ set_mba_byte_ctrl(true);
+
goto out;

out_mondata:
@@ -1445,6 +1469,9 @@ static void rdt_kill_sb(struct super_block *sb)
cpus_read_lock();
mutex_lock(&rdtgroup_mutex);

+ /*Set the control values before the rest of reset*/
+ set_mba_byte_ctrl(false);
+
/*Put everything back to default values. */
for_each_alloc_enabled_rdt_resource(r)
reset_all_ctrls(r);
--
1.9.1


2018-03-29 22:32:07

by Shivappa Vikas

[permalink] [raw]
Subject: [PATCH 4/6] x86/intel_rdt/mba_sc: Add schemata support

Currently user specifies the maximum memory bandwidth percentage per
domain in the "schemata" file. When user updates the schemata, kernel
writes the corresponding b/w percentage values to the
IA32_MBA_THRTL_MSR.

When MBA is expressed in MegaBytes(MB), the schemata format is changed
to have the per package memory b/w in MB instead of being specified in
b/w percentage. The b/w percentage values are only <= 100 and the b/w in
MB could be upto U32_MAX. We do not write the MSRs when the schemata is
updated as that is handled separately after the m/w in MB is converted
to b/w in percentage values.

Signed-off-by: Vikas Shivappa <[email protected]>
---
arch/x86/kernel/cpu/intel_rdt.c | 10 +++++++++-
arch/x86/kernel/cpu/intel_rdt_ctrlmondata.c | 10 ++++++++--
2 files changed, 17 insertions(+), 3 deletions(-)

diff --git a/arch/x86/kernel/cpu/intel_rdt.c b/arch/x86/kernel/cpu/intel_rdt.c
index 8a32561..8a12d26 100644
--- a/arch/x86/kernel/cpu/intel_rdt.c
+++ b/arch/x86/kernel/cpu/intel_rdt.c
@@ -179,7 +179,7 @@ struct rdt_resource rdt_resources_all[] = {
.msr_update = mba_wrmsr,
.cache_level = 3,
.parse_ctrlval = parse_bw,
- .format_str = "%d=%*d",
+ .format_str = "%d=%*u",
.fflags = RFTYPE_RES_MB,
},
};
@@ -356,6 +356,14 @@ static u32 delay_bw_map(unsigned long bw, struct rdt_resource *r)
{
unsigned int i;

+ /*
+ * The ctrl_val should not be written when
+ * MBA is expressed in Megabytes because the Megabyte value
+ * need to be first converted to delay values that can be
+ * programmed to the MSR.
+ */
+ WARN_ON(is_mba_MBctrl());
+
/* Write the delay values for mba. */
for (i = m->low; i < m->high; i++)
wrmsrl(r->msr_base + i, delay_bw_map(d->ctrl_val[i], r));
diff --git a/arch/x86/kernel/cpu/intel_rdt_ctrlmondata.c b/arch/x86/kernel/cpu/intel_rdt_ctrlmondata.c
index 23e1d5c..6372e4f 100644
--- a/arch/x86/kernel/cpu/intel_rdt_ctrlmondata.c
+++ b/arch/x86/kernel/cpu/intel_rdt_ctrlmondata.c
@@ -53,7 +53,8 @@ static bool bw_validate(char *buf, unsigned long *data, struct rdt_resource *r)
return false;
}

- if (bw < r->membw.min_bw || bw > r->default_ctrl) {
+ if ((bw < r->membw.min_bw || bw > r->default_ctrl) &&
+ !is_mba_MBctrl()) {
rdt_last_cmd_printf("MB value %ld out of range [%d,%d]\n", bw,
r->membw.min_bw, r->default_ctrl);
return false;
@@ -194,7 +195,12 @@ static int update_domains(struct rdt_resource *r, int closid)
d->ctrl_val[closid] = d->new_ctrl;
}
}
- if (cpumask_empty(cpu_mask))
+
+ /*
+ * Avoid writing the control msr with control values when
+ * MBA is expressed in Megabytes.
+ */
+ if (cpumask_empty(cpu_mask) || is_mba_MBctrl())
goto done;
cpu = get_cpu();
/* Update CBM on this cpu if it's in cpu_mask. */
--
1.9.1


2018-03-29 22:32:35

by Shivappa Vikas

[permalink] [raw]
Subject: [PATCH 3/6] x86/intel_rdt/mba_sc: Add initialization support

When MBA software controller is enabled, we need a per domain storage
for user specified bandwidth in MB and the raw b/w percentage values
which are programmed into the MSR. Add support for these data structures
and initialization.

Signed-off-by: Vikas Shivappa <[email protected]>
---
arch/x86/kernel/cpu/intel_rdt.c | 37 +++++++++++++++++++++++---------
arch/x86/kernel/cpu/intel_rdt.h | 4 ++++
arch/x86/kernel/cpu/intel_rdt_rdtgroup.c | 3 +++
3 files changed, 34 insertions(+), 10 deletions(-)

diff --git a/arch/x86/kernel/cpu/intel_rdt.c b/arch/x86/kernel/cpu/intel_rdt.c
index 2b65601..8a32561 100644
--- a/arch/x86/kernel/cpu/intel_rdt.c
+++ b/arch/x86/kernel/cpu/intel_rdt.c
@@ -35,6 +35,7 @@

#define MAX_MBA_BW 100u
#define MBA_IS_LINEAR 0x4
+#define MBA_BW_MAX_MB U32_MAX

/* Mutex to protect rdtgroup access. */
DEFINE_MUTEX(rdtgroup_mutex);
@@ -431,25 +432,40 @@ struct rdt_domain *rdt_find_domain(struct rdt_resource *r, int id,
return NULL;
}

+void setup_ctrlval(struct rdt_resource *r, u32 *dc, u32 *dm)
+{
+ int i;
+
+ /*
+ * Initialize the Control MSRs to having no control.
+ * For Cache Allocation: Set all bits in cbm
+ * For Memory Allocation: Set b/w requested to 100%
+ * and the b/w in MB to U32_MAX
+ */
+ for (i = 0; i < r->num_closid; i++, dc++, dm++) {
+ *dc = r->membw.bw_byte ? MBA_BW_MAX_MB : r->default_ctrl;
+ *dm = r->default_ctrl;
+ }
+}
+
static int domain_setup_ctrlval(struct rdt_resource *r, struct rdt_domain *d)
{
struct msr_param m;
- u32 *dc;
- int i;
+ u32 *dc, *dm;

dc = kmalloc_array(r->num_closid, sizeof(*d->ctrl_val), GFP_KERNEL);
if (!dc)
return -ENOMEM;

- d->ctrl_val = dc;
+ dm = kmalloc_array(r->num_closid, sizeof(*d->msr_val), GFP_KERNEL);
+ if (!dm) {
+ kfree (dc);
+ return -ENOMEM;
+ }

- /*
- * Initialize the Control MSRs to having no control.
- * For Cache Allocation: Set all bits in cbm
- * For Memory Allocation: Set b/w requested to 100
- */
- for (i = 0; i < r->num_closid; i++, dc++)
- *dc = r->default_ctrl;
+ d->ctrl_val = dc;
+ d->msr_val = dm;
+ setup_ctrlval(r, dc, dm);

m.low = 0;
m.high = r->num_closid;
@@ -588,6 +604,7 @@ static void domain_remove_cpu(int cpu, struct rdt_resource *r)
}

kfree(d->ctrl_val);
+ kfree(d->msr_val);
kfree(d->rmid_busy_llc);
kfree(d->mbm_total);
kfree(d->mbm_local);
diff --git a/arch/x86/kernel/cpu/intel_rdt.h b/arch/x86/kernel/cpu/intel_rdt.h
index 3e9bc3f..68c7da0 100644
--- a/arch/x86/kernel/cpu/intel_rdt.h
+++ b/arch/x86/kernel/cpu/intel_rdt.h
@@ -202,6 +202,8 @@ struct mbm_state {
* @cqm_work_cpu:
* worker cpu for CQM h/w counters
* @ctrl_val: array of cache or mem ctrl values (indexed by CLOSID)
+ * When MBA is expressed in MB, this holds number of MegaBytes
+ * @msr_val: When MBA is expressed in MB, this holds the control MSR value
* @new_ctrl: new ctrl value to be loaded
* @have_new_ctrl: did user provide new_ctrl for this domain
*/
@@ -217,6 +219,7 @@ struct rdt_domain {
int mbm_work_cpu;
int cqm_work_cpu;
u32 *ctrl_val;
+ u32 *msr_val;
u32 new_ctrl;
bool have_new_ctrl;
};
@@ -450,6 +453,7 @@ void mon_event_read(struct rmid_read *rr, struct rdt_domain *d,
void mbm_setup_overflow_handler(struct rdt_domain *dom,
unsigned long delay_ms);
void mbm_handle_overflow(struct work_struct *work);
+void setup_ctrlval(struct rdt_resource *r, u32 *dc, u32 *dm);
void cqm_setup_limbo_handler(struct rdt_domain *dom, unsigned long delay_ms);
void cqm_handle_limbo(struct work_struct *work);
bool has_busy_rmid(struct rdt_resource *r, struct rdt_domain *d);
diff --git a/arch/x86/kernel/cpu/intel_rdt_rdtgroup.c b/arch/x86/kernel/cpu/intel_rdt_rdtgroup.c
index 0707191..d4e8412 100644
--- a/arch/x86/kernel/cpu/intel_rdt_rdtgroup.c
+++ b/arch/x86/kernel/cpu/intel_rdt_rdtgroup.c
@@ -1044,8 +1044,11 @@ static int set_cache_qos_cfg(int level, bool enable)
static void __set_mba_byte_ctrl(bool byte_ctrl)
{
struct rdt_resource *r = &rdt_resources_all[RDT_RESOURCE_MBA];
+ struct rdt_domain *d;

r->membw.bw_byte = byte_ctrl;
+ list_for_each_entry(d, &r->domains, list)
+ setup_ctrlval(r, d->ctrl_val, d->msr_val);
}

/*
--
1.9.1


2018-03-29 22:32:52

by Shivappa Vikas

[permalink] [raw]
Subject: [PATCH 5/6] x86/intel_rdt/mba_sc: Add counting for MBA software controller

Currently we store the per package "total bytes" for each rdtgroup for
Memory bandwidth management which exposed via
"/sys/fs/resctrl/<rdtgrpx>/mon_data/mon_L3_00/mbm_local_bytes".

The above user interface remains while we also add support to measure
the per package b/w in Megabytes and the delta b/w when the b/w MSR
values change. We do this by taking the time stamp every time a the
counter is read and then keeping a history of b/w. This will be used to
support internal queries for the bandwidth in Megabytes.

Signed-off-by: Vikas Shivappa <[email protected]>
---
arch/x86/kernel/cpu/intel_rdt.c | 1 -
arch/x86/kernel/cpu/intel_rdt.h | 24 ++++++++++++++++++++--
arch/x86/kernel/cpu/intel_rdt_monitor.c | 36 +++++++++++++++++++++++++++------
3 files changed, 52 insertions(+), 9 deletions(-)

diff --git a/arch/x86/kernel/cpu/intel_rdt.c b/arch/x86/kernel/cpu/intel_rdt.c
index 8a12d26..78beb64 100644
--- a/arch/x86/kernel/cpu/intel_rdt.c
+++ b/arch/x86/kernel/cpu/intel_rdt.c
@@ -33,7 +33,6 @@
#include <asm/intel_rdt_sched.h>
#include "intel_rdt.h"

-#define MAX_MBA_BW 100u
#define MBA_IS_LINEAR 0x4
#define MBA_BW_MAX_MB U32_MAX

diff --git a/arch/x86/kernel/cpu/intel_rdt.h b/arch/x86/kernel/cpu/intel_rdt.h
index 68c7da0..b74619d 100644
--- a/arch/x86/kernel/cpu/intel_rdt.h
+++ b/arch/x86/kernel/cpu/intel_rdt.h
@@ -28,6 +28,18 @@

#define MBM_CNTR_WIDTH 24
#define MBM_OVERFLOW_INTERVAL 1000
+#define MAX_MBA_BW 100u
+
+/*
+ * This measures a tolerable delta value in MegaBytes between
+ * the expected bandwidth and the actual bandwidth.
+ * This is done so that we dont keep flipping the control
+ * bandwidth to more than and less than the expected bandwidth.
+ *
+ * However note that this is only initial threshold value and
+ * it is adjusted dynamically package wise for each rdtgrp
+ */
+#define MBA_BW_MB_THRSHL 1024

#define RMID_VAL_ERROR BIT_ULL(63)
#define RMID_VAL_UNAVAIL BIT_ULL(62)
@@ -180,10 +192,18 @@ struct rftype {
* struct mbm_state - status for each MBM counter in each domain
* @chunks: Total data moved (multiply by rdt_group.mon_scale to get bytes)
* @prev_msr Value of IA32_QM_CTR for this RMID last time we read it
+ * @prev_read_time:The last time counter was read
+ * @prev_bw: The most recent bandwidth in Megabytes
+ * @delta_bw: Difference between the current b/w and previous b/w
+ * @threshl_calib: Indicates when to calculate the delta_bw
*/
struct mbm_state {
- u64 chunks;
- u64 prev_msr;
+ u64 chunks;
+ u64 prev_msr;
+ unsigned long prev_read_time;
+ u64 prev_bw;
+ u64 delta_bw;
+ bool thrshl_calib;
};

/**
diff --git a/arch/x86/kernel/cpu/intel_rdt_monitor.c b/arch/x86/kernel/cpu/intel_rdt_monitor.c
index 681450e..509f338 100644
--- a/arch/x86/kernel/cpu/intel_rdt_monitor.c
+++ b/arch/x86/kernel/cpu/intel_rdt_monitor.c
@@ -225,9 +225,11 @@ void free_rmid(u32 rmid)
list_add_tail(&entry->list, &rmid_free_lru);
}

-static int __mon_event_count(u32 rmid, struct rmid_read *rr)
+static int __mon_event_count(u32 rmid, struct rmid_read *rr, struct mbm_state **md)
{
- u64 chunks, shift, tval;
+ struct rdt_resource *r = &rdt_resources_all[RDT_RESOURCE_L3];
+ u64 chunks, shift, tval, cur_bw = 0;
+ unsigned long delta_time, now;
struct mbm_state *m;

tval = __rmid_read(rmid, rr->evtid);
@@ -256,6 +258,9 @@ static int __mon_event_count(u32 rmid, struct rmid_read *rr)
if (rr->first) {
m->prev_msr = tval;
m->chunks = 0;
+ m->prev_read_time = jiffies;
+ m->prev_bw = 0;
+ m->delta_bw = MBA_BW_MB_THRSHL;
return 0;
}

@@ -266,6 +271,24 @@ static int __mon_event_count(u32 rmid, struct rmid_read *rr)
m->prev_msr = tval;

rr->val += m->chunks;
+
+ if(!md)
+ goto out;
+
+ now = jiffies;
+ delta_time = jiffies_to_usecs(now - m->prev_read_time);
+ if (delta_time)
+ cur_bw = (chunks * r->mon_scale) / delta_time;
+
+ if (m->thrshl_calib)
+ m->delta_bw = abs(cur_bw - m->prev_bw);
+ m->thrshl_calib = false;
+ m->prev_bw = cur_bw;
+ m->prev_read_time = now;
+
+ *md = m;
+out:
+
return 0;
}

@@ -281,7 +304,7 @@ void mon_event_count(void *info)

rdtgrp = rr->rgrp;

- if (__mon_event_count(rdtgrp->mon.rmid, rr))
+ if (__mon_event_count(rdtgrp->mon.rmid, rr, NULL))
return;

/*
@@ -291,7 +314,7 @@ void mon_event_count(void *info)

if (rdtgrp->type == RDTCTRL_GROUP) {
list_for_each_entry(entry, head, mon.crdtgrp_list) {
- if (__mon_event_count(entry->mon.rmid, rr))
+ if (__mon_event_count(entry->mon.rmid, rr, NULL))
return;
}
}
@@ -299,6 +322,7 @@ void mon_event_count(void *info)

static void mbm_update(struct rdt_domain *d, int rmid)
{
+
struct rmid_read rr;

rr.first = false;
@@ -310,11 +334,11 @@ static void mbm_update(struct rdt_domain *d, int rmid)
*/
if (is_mbm_total_enabled()) {
rr.evtid = QOS_L3_MBM_TOTAL_EVENT_ID;
- __mon_event_count(rmid, &rr);
+ __mon_event_count(rmid, &rr, NULL);
}
if (is_mbm_local_enabled()) {
rr.evtid = QOS_L3_MBM_LOCAL_EVENT_ID;
- __mon_event_count(rmid, &rr);
+ __mon_event_count(rmid, &rr, NULL);
}
}

--
1.9.1


2018-03-30 09:34:17

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [PATCH 2/6] x86/intel_rdt/mba_sc: Add support to enable/disable via mount option

On Thu, 29 Mar 2018, Vikas Shivappa wrote:

Subject: x86/intel_rdt/mba_sc: Add support to enable/disable via mount option

Huch? From Documentation:

The ``summary phrase`` in the email's Subject should concisely
describe the patch which that email contains.

You're introducing somthing new: mba_sc

It's completely unclear what that is and what it means.

x86/intel_rdt: Add mount option for bandwidth allocation in MB/s

or something like that.

> Specify a new mount option "mba_MB" to enable the user to specify MBA
> bandwidth in Megabytes(Software Controller/SC) instead of b/w

You cannot specify bandwidth in Megabytes. Bandwidth is a bit-rate and the
units are multiples of bits per second and not Megabytes.

> --- a/arch/x86/kernel/cpu/intel_rdt.h
> +++ b/arch/x86/kernel/cpu/intel_rdt.h
> @@ -259,6 +259,7 @@ struct rdt_cache {
> * @min_bw: Minimum memory bandwidth percentage user can request
> * @bw_gran: Granularity at which the memory bandwidth is allocated
> * @delay_linear: True if memory B/W delay is in linear scale
> + * @bw_byte: True if memory B/W is specified in bytes

So the mount parameter says Megabytes, but here you say bytes? What?

And bw_byte is a misnomer. bw_bytes if you really mean bytes. bw_mb if it's megabytes.

> +#define is_mba_linear() rdt_resources_all[RDT_RESOURCE_MBA].membw.delay_linear
> +#define is_mba_MBctrl() rdt_resources_all[RDT_RESOURCE_MBA].membw.bw_byte

Please use inlines and no camel case. That's horrible.

> +
> /**
> * struct rdt_resource - attributes of an RDT resource
> * @rid: The index of the resource
> diff --git a/arch/x86/kernel/cpu/intel_rdt_rdtgroup.c b/arch/x86/kernel/cpu/intel_rdt_rdtgroup.c
> index fca759d..0707191 100644
> --- a/arch/x86/kernel/cpu/intel_rdt_rdtgroup.c
> +++ b/arch/x86/kernel/cpu/intel_rdt_rdtgroup.c
> @@ -1041,6 +1041,24 @@ static int set_cache_qos_cfg(int level, bool enable)
> return 0;
> }
>
> +static void __set_mba_byte_ctrl(bool byte_ctrl)
> +{
> + struct rdt_resource *r = &rdt_resources_all[RDT_RESOURCE_MBA];
> +
> + r->membw.bw_byte = byte_ctrl;

I don't see the point of this extra function. It has exactly one user.

> +}
> +
> +/*
> + * MBA allocation in bytes is only supported if
> + * MBM is supported and MBA is in linear scale.
> +*/

Hint: checkpatch.pl is not optional

> +static void set_mba_byte_ctrl(bool byte_ctrl)
> +{
> + if ((is_mbm_enabled() && is_mba_linear()) &&
> + byte_ctrl != is_mba_MBctrl())
> + __set_mba_byte_ctrl(byte_ctrl);

And that user is a small enough function. To avoid indentation you can
simply return when the condition is false.

Also if the user wants to mount with the MB option and it's not supported,
why are you not returning an error code and refuse the mount? That's just
wrong.

> +
> static int cdp_enable(int level, int data_type, int code_type)
> {
> struct rdt_resource *r_ldata = &rdt_resources_all[data_type];
> @@ -1104,7 +1122,7 @@ static void cdp_disable_all(void)
> cdpl2_disable();
> }
>
> -static int parse_rdtgroupfs_options(char *data)
> +static int parse_rdtgroupfs_options(char *data, bool *mba_MBctrl)

What?

> {
> char *token, *o = data;
> int ret = 0;
> @@ -1123,6 +1141,8 @@ static int parse_rdtgroupfs_options(char *data)
> ret = cdpl2_enable();
> if (ret)
> goto out;
> + } else if (!strcmp(token, "mba_MB")) {
> + *mba_MBctrl = true;

That's mindless hackery. Really. What's wrong with setting the flag in the
resource and then add the actual register fiddling right in the

if (is_mbm_enabled()) {

section in rdt_mount()? That would be too obvious and fit into the existing
code, right?

> + /*Set the control values before the rest of reset*/

Space after '/*' and before '*/

Aside of that the comment is pretty useless. 'the control values' ??? Which
control values?

> + set_mba_byte_ctrl(false);

Thanks,

tglx

2018-03-30 17:24:47

by Shivappa Vikas

[permalink] [raw]
Subject: Re: [PATCH 2/6] x86/intel_rdt/mba_sc: Add support to enable/disable via mount option


Hello Thomas,

On Fri, 30 Mar 2018, Thomas Gleixner wrote:

> On Thu, 29 Mar 2018, Vikas Shivappa wrote:
>
> Subject: x86/intel_rdt/mba_sc: Add support to enable/disable via mount option
>
> Huch? From Documentation:
>
> The ``summary phrase`` in the email's Subject should concisely
> describe the patch which that email contains.
>
> You're introducing somthing new: mba_sc
>
> It's completely unclear what that is and what it means.
>
> x86/intel_rdt: Add mount option for bandwidth allocation in MB/s
>
> or something like that.

would 'Mount option to enable MBA softwarecontroller' be better? Given that I
have a documentation patch which says what is mba software controller.

>
>> Specify a new mount option "mba_MB" to enable the user to specify MBA
>> bandwidth in Megabytes(Software Controller/SC) instead of b/w
>
> You cannot specify bandwidth in Megabytes. Bandwidth is a bit-rate and the
> units are multiples of bits per second and not Megabytes.
>
>> --- a/arch/x86/kernel/cpu/intel_rdt.h
>> +++ b/arch/x86/kernel/cpu/intel_rdt.h
>> @@ -259,6 +259,7 @@ struct rdt_cache {
>> * @min_bw: Minimum memory bandwidth percentage user can request
>> * @bw_gran: Granularity at which the memory bandwidth is allocated
>> * @delay_linear: True if memory B/W delay is in linear scale
>> + * @bw_byte: True if memory B/W is specified in bytes
>
> So the mount parameter says Megabytes, but here you say bytes? What?
>
> And bw_byte is a misnomer. bw_bytes if you really mean bytes. bw_mb if it's megabytes.

Will fix the namings. Thanks for pointing it should be MBps everywhere.

>
>> +#define is_mba_linear() rdt_resources_all[RDT_RESOURCE_MBA].membw.delay_linear
>> +#define is_mba_MBctrl() rdt_resources_all[RDT_RESOURCE_MBA].membw.bw_byte
>
> Please use inlines and no camel case. That's horrible.

Will fix..

>
>> +
>> /**
>> * struct rdt_resource - attributes of an RDT resource
>> * @rid: The index of the resource
>> diff --git a/arch/x86/kernel/cpu/intel_rdt_rdtgroup.c b/arch/x86/kernel/cpu/intel_rdt_rdtgroup.c
>> index fca759d..0707191 100644
>> --- a/arch/x86/kernel/cpu/intel_rdt_rdtgroup.c
>> +++ b/arch/x86/kernel/cpu/intel_rdt_rdtgroup.c
>> @@ -1041,6 +1041,24 @@ static int set_cache_qos_cfg(int level, bool enable)
>> return 0;
>> }
>>
>> +static void __set_mba_byte_ctrl(bool byte_ctrl)
>> +{
>> + struct rdt_resource *r = &rdt_resources_all[RDT_RESOURCE_MBA];
>> +
>> + r->membw.bw_byte = byte_ctrl;
>
> I don't see the point of this extra function. It has exactly one user.
>
>> +}
>> +
>> +/*
>> + * MBA allocation in bytes is only supported if
>> + * MBM is supported and MBA is in linear scale.
>> +*/
>
> Hint: checkpatch.pl is not optional
>
>> +static void set_mba_byte_ctrl(bool byte_ctrl)
>> +{
>> + if ((is_mbm_enabled() && is_mba_linear()) &&
>> + byte_ctrl != is_mba_MBctrl())
>> + __set_mba_byte_ctrl(byte_ctrl);
>
> And that user is a small enough function. To avoid indentation you can
> simply return when the condition is false.
>
> Also if the user wants to mount with the MB option and it's not supported,
> why are you not returning an error code and refuse the mount? That's just
> wrong.

Will fix. can merge into one function and return error when not available.

>
>> +
>> static int cdp_enable(int level, int data_type, int code_type)
>> {
>> struct rdt_resource *r_ldata = &rdt_resources_all[data_type];
>> @@ -1104,7 +1122,7 @@ static void cdp_disable_all(void)
>> cdpl2_disable();
>> }
>>
>> -static int parse_rdtgroupfs_options(char *data)
>> +static int parse_rdtgroupfs_options(char *data, bool *mba_MBctrl)
>
> What?
>
>> {
>> char *token, *o = data;
>> int ret = 0;
>> @@ -1123,6 +1141,8 @@ static int parse_rdtgroupfs_options(char *data)
>> ret = cdpl2_enable();
>> if (ret)
>> goto out;
>> + } else if (!strcmp(token, "mba_MB")) {
>> + *mba_MBctrl = true;
>
> That's mindless hackery. Really. What's wrong with setting the flag in the
> resource and then add the actual register fiddling right in the
>
> if (is_mbm_enabled()) {
>
> section in rdt_mount()? That would be too obvious and fit into the existing
> code, right?

Will fix.

>
>> + /*Set the control values before the rest of reset*/
>
> Space after '/*' and before '*/
>
> Aside of that the comment is pretty useless. 'the control values' ??? Which
> control values?
>

Will fix the comment or remove. Wanted to point here that we reset the control
values (the delay values that go into the IA32_MBA_THRTL_MSRs) but thats done
any ways in the reset_all_ctrls call after this, so comment can be removed.

Will fix the checkpatch issues as pointed.

In general wanted to know if this is
a sane idea to have a software feedback and let the user specify b/w in MBps
rather than the confusing percentage values. The typical confusing scenarios are
documented in documentation patch with examples. The use
can can occur in any rdtgroups
which are trying to group jobs where different number of threads are active. Say
if you want to create an rdtgroup with low priority jobs and give them 10% of
b/w the actual raw b/w in MBps used can vary and increase if more threads are
spawned (because the new threads spawned belong to the same rdtgroup and each
thread can use up 10% of the 'per core' memory b/w).

Thanks,
Vikas

>> + set_mba_byte_ctrl(false);
>
> Thanks,
>
> tglx
>

2018-03-30 21:23:06

by kernel test robot

[permalink] [raw]
Subject: Re: [PATCH 6/6] x86/intel_rdt/mba_sc: Add support to dynamically update the memory b/w

Hi Vikas,

Thank you for the patch! Yet something to improve:

[auto build test ERROR on v4.16-rc7]
[also build test ERROR on next-20180329]
[cannot apply to tip/x86/core]
[if your patch is applied to the wrong git tree, please drop us a note to help improve the system]

url: https://github.com/0day-ci/linux/commits/Vikas-Shivappa/Memory-b-w-allocation-software-controller/20180331-040536
config: i386-randconfig-a0-201812 (attached as .config)
compiler: gcc-4.9 (Debian 4.9.4-2) 4.9.4
reproduce:
# save the attached .config to linux build tree
make ARCH=i386

All errors (new ones prefixed by >>):

arch/x86/kernel/cpu/intel_rdt_monitor.o: In function `__mon_event_count':
>> arch/x86/kernel/cpu/intel_rdt_monitor.c:285: undefined reference to `__udivdi3'

vim +285 arch/x86/kernel/cpu/intel_rdt_monitor.c

edf6fa1c Vikas Shivappa 2017-07-25 227
2bbfc129 Vikas Shivappa 2018-03-29 228 static int __mon_event_count(u32 rmid, struct rmid_read *rr, struct mbm_state **md)
d89b7379 Vikas Shivappa 2017-07-25 229 {
2bbfc129 Vikas Shivappa 2018-03-29 230 struct rdt_resource *r = &rdt_resources_all[RDT_RESOURCE_L3];
2bbfc129 Vikas Shivappa 2018-03-29 231 u64 chunks, shift, tval, cur_bw = 0;
2bbfc129 Vikas Shivappa 2018-03-29 232 unsigned long delta_time, now;
9f52425b Tony Luck 2017-07-25 233 struct mbm_state *m;
d89b7379 Vikas Shivappa 2017-07-25 234
d89b7379 Vikas Shivappa 2017-07-25 235 tval = __rmid_read(rmid, rr->evtid);
d89b7379 Vikas Shivappa 2017-07-25 236 if (tval & (RMID_VAL_ERROR | RMID_VAL_UNAVAIL)) {
d89b7379 Vikas Shivappa 2017-07-25 237 rr->val = tval;
d89b7379 Vikas Shivappa 2017-07-25 238 return -EINVAL;
d89b7379 Vikas Shivappa 2017-07-25 239 }
d89b7379 Vikas Shivappa 2017-07-25 240 switch (rr->evtid) {
d89b7379 Vikas Shivappa 2017-07-25 241 case QOS_L3_OCCUP_EVENT_ID:
d89b7379 Vikas Shivappa 2017-07-25 242 rr->val += tval;
d89b7379 Vikas Shivappa 2017-07-25 243 return 0;
9f52425b Tony Luck 2017-07-25 244 case QOS_L3_MBM_TOTAL_EVENT_ID:
9f52425b Tony Luck 2017-07-25 245 m = &rr->d->mbm_total[rmid];
9f52425b Tony Luck 2017-07-25 246 break;
9f52425b Tony Luck 2017-07-25 247 case QOS_L3_MBM_LOCAL_EVENT_ID:
9f52425b Tony Luck 2017-07-25 248 m = &rr->d->mbm_local[rmid];
9f52425b Tony Luck 2017-07-25 249 break;
d89b7379 Vikas Shivappa 2017-07-25 250 default:
d89b7379 Vikas Shivappa 2017-07-25 251 /*
d89b7379 Vikas Shivappa 2017-07-25 252 * Code would never reach here because
d89b7379 Vikas Shivappa 2017-07-25 253 * an invalid event id would fail the __rmid_read.
d89b7379 Vikas Shivappa 2017-07-25 254 */
d89b7379 Vikas Shivappa 2017-07-25 255 return -EINVAL;
d89b7379 Vikas Shivappa 2017-07-25 256 }
a4de1dfd Vikas Shivappa 2017-07-25 257
a4de1dfd Vikas Shivappa 2017-07-25 258 if (rr->first) {
a4de1dfd Vikas Shivappa 2017-07-25 259 m->prev_msr = tval;
a4de1dfd Vikas Shivappa 2017-07-25 260 m->chunks = 0;
2bbfc129 Vikas Shivappa 2018-03-29 261 m->prev_read_time = jiffies;
2bbfc129 Vikas Shivappa 2018-03-29 262 m->prev_bw = 0;
2bbfc129 Vikas Shivappa 2018-03-29 263 m->delta_bw = MBA_BW_MB_THRSHL;
a4de1dfd Vikas Shivappa 2017-07-25 264 return 0;
a4de1dfd Vikas Shivappa 2017-07-25 265 }
a4de1dfd Vikas Shivappa 2017-07-25 266
9f52425b Tony Luck 2017-07-25 267 shift = 64 - MBM_CNTR_WIDTH;
9f52425b Tony Luck 2017-07-25 268 chunks = (tval << shift) - (m->prev_msr << shift);
9f52425b Tony Luck 2017-07-25 269 chunks >>= shift;
9f52425b Tony Luck 2017-07-25 270 m->chunks += chunks;
9f52425b Tony Luck 2017-07-25 271 m->prev_msr = tval;
9f52425b Tony Luck 2017-07-25 272
9f52425b Tony Luck 2017-07-25 273 rr->val += m->chunks;
2bbfc129 Vikas Shivappa 2018-03-29 274
6138a999 Vikas Shivappa 2018-03-29 275 /*
6138a999 Vikas Shivappa 2018-03-29 276 * We only do the bw calculations for the mbm overflow
6138a999 Vikas Shivappa 2018-03-29 277 * periodic timer calls and for local events only.
6138a999 Vikas Shivappa 2018-03-29 278 */
2bbfc129 Vikas Shivappa 2018-03-29 279 if(!md)
2bbfc129 Vikas Shivappa 2018-03-29 280 goto out;
2bbfc129 Vikas Shivappa 2018-03-29 281
2bbfc129 Vikas Shivappa 2018-03-29 282 now = jiffies;
2bbfc129 Vikas Shivappa 2018-03-29 283 delta_time = jiffies_to_usecs(now - m->prev_read_time);
2bbfc129 Vikas Shivappa 2018-03-29 284 if (delta_time)
2bbfc129 Vikas Shivappa 2018-03-29 @285 cur_bw = (chunks * r->mon_scale) / delta_time;
2bbfc129 Vikas Shivappa 2018-03-29 286
2bbfc129 Vikas Shivappa 2018-03-29 287 if (m->thrshl_calib)
2bbfc129 Vikas Shivappa 2018-03-29 288 m->delta_bw = abs(cur_bw - m->prev_bw);
2bbfc129 Vikas Shivappa 2018-03-29 289 m->thrshl_calib = false;
2bbfc129 Vikas Shivappa 2018-03-29 290 m->prev_bw = cur_bw;
2bbfc129 Vikas Shivappa 2018-03-29 291 m->prev_read_time = now;
2bbfc129 Vikas Shivappa 2018-03-29 292
2bbfc129 Vikas Shivappa 2018-03-29 293 *md = m;
2bbfc129 Vikas Shivappa 2018-03-29 294 out:
2bbfc129 Vikas Shivappa 2018-03-29 295
9f52425b Tony Luck 2017-07-25 296 return 0;
d89b7379 Vikas Shivappa 2017-07-25 297 }
d89b7379 Vikas Shivappa 2017-07-25 298

:::::: The code at line 285 was first introduced by commit
:::::: 2bbfc12978bb70164a0fa01307798973a4e2c80d x86/intel_rdt/mba_sc: Add counting for MBA software controller

:::::: TO: Vikas Shivappa <[email protected]>
:::::: CC: 0day robot <[email protected]>

---
0-DAY kernel test infrastructure Open Source Technology Center
https://lists.01.org/pipermail/kbuild-all Intel Corporation


Attachments:
(No filename) (5.70 kB)
.config.gz (29.94 kB)
Download all attachments

2018-03-31 01:39:04

by kernel test robot

[permalink] [raw]
Subject: Re: [PATCH 6/6] x86/intel_rdt/mba_sc: Add support to dynamically update the memory b/w

Hi Vikas,

Thank you for the patch! Yet something to improve:

[auto build test ERROR on v4.16-rc7]
[also build test ERROR on next-20180329]
[cannot apply to tip/x86/core]
[if your patch is applied to the wrong git tree, please drop us a note to help improve the system]

url: https://github.com/0day-ci/linux/commits/Vikas-Shivappa/Memory-b-w-allocation-software-controller/20180331-040536
config: i386-allmodconfig (attached as .config)
compiler: gcc-7 (Debian 7.3.0-1) 7.3.0
reproduce:
# save the attached .config to linux build tree
make ARCH=i386

All errors (new ones prefixed by >>):

arch/x86/kernel/cpu/intel_rdt_monitor.o: In function `__mon_event_count':
>> intel_rdt_monitor.c:(.text+0x1b8): undefined reference to `__udivdi3'

---
0-DAY kernel test infrastructure Open Source Technology Center
https://lists.01.org/pipermail/kbuild-all Intel Corporation


Attachments:
(No filename) (947.00 B)
.config.gz (61.59 kB)
Download all attachments

2018-04-03 09:48:10

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [PATCH 1/6] x86/intel_rdt/mba_sc: Add documentation for MBA software controller

On Thu, 29 Mar 2018, Vikas Shivappa wrote:
> +Memory bandwidth(b/w) in MegaBytes
> +----------------------------------
> +
> +Memory bandwidth is a core specific mechanism which means that when the
> +Memory b/w percentage is specified in the schemata per package it
> +actually is applied on a per core basis via IA32_MBA_THRTL_MSR
> +interface. This may lead to confusion in scenarios below:
> +
> +1. User may not see increase in actual b/w when percentage values are
> + increased:
> +
> +This can occur when aggregate L2 external b/w is more than L3 external
> +b/w. Consider an SKL SKU with 24 cores on a package and where L2
> +external b/w is 10GBps (hence aggregate L2 external b/w is 240GBps) and
> +L3 external b/w is 100GBps. Now a workload with '20 threads, having 50%
> +b/w, each consuming 5GBps' consumes the max L3 b/w of 100GBps although
> +the percentage value specified is only 50% << 100%. Hence increasing
> +the b/w percentage will not yeild any more b/w. This is because
> +although the L2 external b/w still has capacity, the L3 external b/w
> +is fully used. Also note that this would be dependent on number of
> +cores the benchmark is run on.
> +
> +2. Same b/w percentage may mean different actual b/w depending on # of
> + threads:
> +
> +For the same SKU in #1, a 'single thread, with 10% b/w' and '4 thread,
> +with 10% b/w' can consume upto 10GBps and 40GBps although they have same
> +percentage b/w of 10%. This is simply because as threads start using
> +more cores in an rdtgroup, the actual b/w may increase or vary although
> +user specified b/w percentage is same.
> +
> +In order to mitigate this and make the interface more user friendly, we
> +can let the user specify the max bandwidth per rdtgroup in bytes(or mega
> +bytes). The kernel underneath would use a software feedback mechanism or
> +a "Software Controller" which reads the actual b/w using MBM counters
> +and adjust the memowy bandwidth percentages to ensure the "actual b/w
> +< user b/w".
> +
> +The legacy behaviour is default and user can switch to the "MBA software
> +controller" mode using a mount option 'mba_MB'.

You said above:

> This may lead to confusion in scenarios below:

Reading the blurb after that creates even more confusion than being
helpful.

First of all this information should not be under the section 'Memory
bandwidth in MB/s'.

Also please write bandwidth. The weird acronym b/w (band per width???) is
really not increasing legibility.

What you really want is a general section about memory bandwidth allocation
where you explain the technical background in purely technical terms w/o
fairy tale mode. Technical descriptions have to be factual and not
'could/may/would'.

If I decode the above correctly then the current percentage based
implementation was buggy from the very beginning in several ways.

Now the obvious question which is in no way answered by the cover letter is
why the current percentage based implementation cannot be fixed and we need
some feedback driven magic to achieve that. I assume you spent some brain
cycles on that question, so it would be really helpful if you shared that.

If I understand it correctly then the problem is that the throttling
mechanism is per core and affects the L2 external bandwidth.

Is this really per core? What about hyper threads. Both threads have that
MSR. How is that working?

The L2 external bandwidth is higher than the L3 external bandwidth.

Is there any information available from CPUID or whatever source which
allows us to retrieve the bandwidth ratio or the absolute maximum
bandwidth per level?

What's also missing from your explanation is how that feedback loop behaves
under different workloads.

Is this assuming that the involved threads/cpus actually try to utilize
the bandwidth completely?

What happens if the threads/cpus are only using a small set because they
are idle or their computations are mostly cache local and do not need
external bandwidth? Looking at the implementation I don't see how that is
taken into account.

Thanks,

tglx







2018-04-03 09:54:38

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [PATCH 3/6] x86/intel_rdt/mba_sc: Add initialization support

On Thu, 29 Mar 2018, Vikas Shivappa wrote:
> +void setup_ctrlval(struct rdt_resource *r, u32 *dc, u32 *dm)
> +{
> + int i;
> +
> + /*
> + * Initialize the Control MSRs to having no control.
> + * For Cache Allocation: Set all bits in cbm
> + * For Memory Allocation: Set b/w requested to 100%
> + * and the b/w in MB to U32_MAX
> + */
> + for (i = 0; i < r->num_closid; i++, dc++, dm++) {
> + *dc = r->membw.bw_byte ? MBA_BW_MAX_MB : r->default_ctrl;
> + *dm = r->default_ctrl;

No! Please stop duct taping your stuff into the existing code. So far the
ctrl value was the same as the value which was actually written into the
MSR. With your new mode you have to split that up into the user supplied
value and the value which gets written into the MSR.

So the right thing to do is to separate the user value and the MSR value
first and independent of the mode. Then the new mode falls into place
naturally because r->default_ctrl and r->default_msrval are set up at mount
time with the values which correspond to the mount mode.

Thanks,

tglx

2018-04-03 14:31:40

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [PATCH 1/6] x86/intel_rdt/mba_sc: Add documentation for MBA software controller

On Tue, 3 Apr 2018, Thomas Gleixner wrote:
> On Thu, 29 Mar 2018, Vikas Shivappa wrote:
> You said above:
>
> > This may lead to confusion in scenarios below:
>
> Reading the blurb after that creates even more confusion than being
> helpful.
>
> First of all this information should not be under the section 'Memory
> bandwidth in MB/s'.
>
> Also please write bandwidth. The weird acronym b/w (band per width???) is
> really not increasing legibility.
>
> What you really want is a general section about memory bandwidth allocation
> where you explain the technical background in purely technical terms w/o
> fairy tale mode. Technical descriptions have to be factual and not
> 'could/may/would'.
>
> If I decode the above correctly then the current percentage based
> implementation was buggy from the very beginning in several ways.
>
> Now the obvious question which is in no way answered by the cover letter is
> why the current percentage based implementation cannot be fixed and we need
> some feedback driven magic to achieve that. I assume you spent some brain
> cycles on that question, so it would be really helpful if you shared that.
>
> If I understand it correctly then the problem is that the throttling
> mechanism is per core and affects the L2 external bandwidth.
>
> Is this really per core? What about hyper threads. Both threads have that
> MSR. How is that working?
>
> The L2 external bandwidth is higher than the L3 external bandwidth.
>
> Is there any information available from CPUID or whatever source which
> allows us to retrieve the bandwidth ratio or the absolute maximum
> bandwidth per level?
>
> What's also missing from your explanation is how that feedback loop behaves
> under different workloads.
>
> Is this assuming that the involved threads/cpus actually try to utilize
> the bandwidth completely?
>
> What happens if the threads/cpus are only using a small set because they
> are idle or their computations are mostly cache local and do not need
> external bandwidth? Looking at the implementation I don't see how that is
> taken into account.

Forgot to mention the following:

The proposed new interface has no upper limit. The existing percentage
based implementation has at least some notion of limit and scale; not
really helpful either because of the hardware implementation. but I

How is the poor admin supposed to configure that new thing without
knowing what the actual hardware limits are in the first place?

Thanks,

tglx



2018-04-03 18:50:10

by Shivappa Vikas

[permalink] [raw]
Subject: Re: [PATCH 1/6] x86/intel_rdt/mba_sc: Add documentation for MBA software controller



On Tue, 3 Apr 2018, Thomas Gleixner wrote:

> On Thu, 29 Mar 2018, Vikas Shivappa wrote:
>> +Memory bandwidth(b/w) in MegaBytes
>> +----------------------------------
>> +
>> +Memory bandwidth is a core specific mechanism which means that when the
>> +Memory b/w percentage is specified in the schemata per package it
>> +actually is applied on a per core basis via IA32_MBA_THRTL_MSR
>> +interface. This may lead to confusion in scenarios below:
>> +
>> +1. User may not see increase in actual b/w when percentage values are
>> + increased:
>> +
>> +This can occur when aggregate L2 external b/w is more than L3 external
>> +b/w. Consider an SKL SKU with 24 cores on a package and where L2
>> +external b/w is 10GBps (hence aggregate L2 external b/w is 240GBps) and
>> +L3 external b/w is 100GBps. Now a workload with '20 threads, having 50%
>> +b/w, each consuming 5GBps' consumes the max L3 b/w of 100GBps although
>> +the percentage value specified is only 50% << 100%. Hence increasing
>> +the b/w percentage will not yeild any more b/w. This is because
>> +although the L2 external b/w still has capacity, the L3 external b/w
>> +is fully used. Also note that this would be dependent on number of
>> +cores the benchmark is run on.
>> +
>> +2. Same b/w percentage may mean different actual b/w depending on # of
>> + threads:
>> +
>> +For the same SKU in #1, a 'single thread, with 10% b/w' and '4 thread,
>> +with 10% b/w' can consume upto 10GBps and 40GBps although they have same
>> +percentage b/w of 10%. This is simply because as threads start using
>> +more cores in an rdtgroup, the actual b/w may increase or vary although
>> +user specified b/w percentage is same.
>> +
>> +In order to mitigate this and make the interface more user friendly, we
>> +can let the user specify the max bandwidth per rdtgroup in bytes(or mega
>> +bytes). The kernel underneath would use a software feedback mechanism or
>> +a "Software Controller" which reads the actual b/w using MBM counters
>> +and adjust the memowy bandwidth percentages to ensure the "actual b/w
>> +< user b/w".
>> +
>> +The legacy behaviour is default and user can switch to the "MBA software
>> +controller" mode using a mount option 'mba_MB'.
>
> You said above:
>
>> This may lead to confusion in scenarios below:
>
> Reading the blurb after that creates even more confusion than being
> helpful.
>
> First of all this information should not be under the section 'Memory
> bandwidth in MB/s'.
>
> Also please write bandwidth. The weird acronym b/w (band per width???) is
> really not increasing legibility.

Ok will fix and add a seperate section.

>
> What you really want is a general section about memory bandwidth allocation
> where you explain the technical background in purely technical terms w/o
> fairy tale mode. Technical descriptions have to be factual and not
> 'could/may/would'.
>
> If I decode the above correctly then the current percentage based
> implementation was buggy from the very beginning in several ways.
>
> Now the obvious question which is in no way answered by the cover letter is
> why the current percentage based implementation cannot be fixed and we need
> some feedback driven magic to achieve that. I assume you spent some brain
> cycles on that question, so it would be really helpful if you shared that.
>
> If I understand it correctly then the problem is that the throttling
> mechanism is per core and affects the L2 external bandwidth.
>
> Is this really per core? What about hyper threads. Both threads have that
> MSR. How is that working?

It is per core mechanism. On hyperthreads, it just takes the lowest bandwidth
among the thread siblings. We have the below to explain the same - i can add
more description if needed

"The bandwidth throttling is a core specific mechanism on some of Intel
SKUs. Using a high bandwidth and a low bandwidth setting on two threads
sharing a core will result in both threads being throttled to use the
low bandwidth."

>
> The L2 external bandwidth is higher than the L3 external bandwidth.
>
> Is there any information available from CPUID or whatever source which
> allows us to retrieve the bandwidth ratio or the absolute maximum
> bandwidth per level?

There is no information in cpuid on the bandwidth available. Also we have seen
from our experiments that the increase is not perfectly linear (delta bandwidth
increase from 30% to 40% may not be same as 70% to 80%). So we currently
dynamically caliberate this delta for the software controller.

>
> What's also missing from your explanation is how that feedback loop behaves
> under different workloads.
>
> Is this assuming that the involved threads/cpus actually try to utilize
> the bandwidth completely?

No, the feedback loop only guarentees that the usage will not exceed what the
user specifies as max bandwidth. If it is using below the max value it does not
matter how much less it is using.

>
> What happens if the threads/cpus are only using a small set because they
> are idle or their computations are mostly cache local and do not need
> external bandwidth? Looking at the implementation I don't see how that is
> taken into account.

The feedback only kicks into action if a rdtgroup uses more bandwidth than the
max specified by the user. I specified that it is always "ensure the "actual b/w
354 < user b/w" " and can add more explanation on these scenarios.

Also note that we are using the MBM counters for this feedback loop. Now that
the interface is much more useful because we have the same rdtgroup that is
being monitored and controlled. (vs. if we had the perf mbm the group of threads
in resctrl mba and in mbm could be different and would be hard to measure what
the threads/cpus in the resctrl are using). So the resctrl being used for both
of these is a requirement for this and we always check that.

Thanks,
Vikas

>
> Thanks,
>
> tglx
>
>
>
>
>
>
>

2018-04-03 18:54:01

by Shivappa Vikas

[permalink] [raw]
Subject: Re: [PATCH 1/6] x86/intel_rdt/mba_sc: Add documentation for MBA software controller



On Tue, 3 Apr 2018, Thomas Gleixner wrote:

> On Tue, 3 Apr 2018, Thomas Gleixner wrote:
>> On Thu, 29 Mar 2018, Vikas Shivappa wrote:
>> You said above:
>>
>>> This may lead to confusion in scenarios below:
>>
>> Reading the blurb after that creates even more confusion than being
>> helpful.
>>
>> First of all this information should not be under the section 'Memory
>> bandwidth in MB/s'.
>>
>> Also please write bandwidth. The weird acronym b/w (band per width???) is
>> really not increasing legibility.
>>
>> What you really want is a general section about memory bandwidth allocation
>> where you explain the technical background in purely technical terms w/o
>> fairy tale mode. Technical descriptions have to be factual and not
>> 'could/may/would'.
>>
>> If I decode the above correctly then the current percentage based
>> implementation was buggy from the very beginning in several ways.
>>
>> Now the obvious question which is in no way answered by the cover letter is
>> why the current percentage based implementation cannot be fixed and we need
>> some feedback driven magic to achieve that. I assume you spent some brain
>> cycles on that question, so it would be really helpful if you shared that.
>>
>> If I understand it correctly then the problem is that the throttling
>> mechanism is per core and affects the L2 external bandwidth.
>>
>> Is this really per core? What about hyper threads. Both threads have that
>> MSR. How is that working?
>>
>> The L2 external bandwidth is higher than the L3 external bandwidth.
>>
>> Is there any information available from CPUID or whatever source which
>> allows us to retrieve the bandwidth ratio or the absolute maximum
>> bandwidth per level?
>>
>> What's also missing from your explanation is how that feedback loop behaves
>> under different workloads.
>>
>> Is this assuming that the involved threads/cpus actually try to utilize
>> the bandwidth completely?
>>
>> What happens if the threads/cpus are only using a small set because they
>> are idle or their computations are mostly cache local and do not need
>> external bandwidth? Looking at the implementation I don't see how that is
>> taken into account.
>
> Forgot to mention the following:
>
> The proposed new interface has no upper limit. The existing percentage
> based implementation has at least some notion of limit and scale; not
> really helpful either because of the hardware implementation. but I
>
> How is the poor admin supposed to configure that new thing without
> knowing what the actual hardware limits are in the first place?

That is true. The default values only put it to a very high bandwidth which
means user gets to use everything. There seems no other way other than
caliberating to know the actual max bandwidth in bytes. That could be a better
value to have as default so admin knows the limit. I will explore if there is
a way to calculate the same without caliberating.

Thanks,
Vikas

>
> Thanks,
>
> tglx
>
>
>

2018-04-03 18:55:38

by Shivappa Vikas

[permalink] [raw]
Subject: Re: [PATCH 3/6] x86/intel_rdt/mba_sc: Add initialization support



On Tue, 3 Apr 2018, Thomas Gleixner wrote:

> On Thu, 29 Mar 2018, Vikas Shivappa wrote:
>> +void setup_ctrlval(struct rdt_resource *r, u32 *dc, u32 *dm)
>> +{
>> + int i;
>> +
>> + /*
>> + * Initialize the Control MSRs to having no control.
>> + * For Cache Allocation: Set all bits in cbm
>> + * For Memory Allocation: Set b/w requested to 100%
>> + * and the b/w in MB to U32_MAX
>> + */
>> + for (i = 0; i < r->num_closid; i++, dc++, dm++) {
>> + *dc = r->membw.bw_byte ? MBA_BW_MAX_MB : r->default_ctrl;
>> + *dm = r->default_ctrl;
>
> No! Please stop duct taping your stuff into the existing code. So far the
> ctrl value was the same as the value which was actually written into the
> MSR. With your new mode you have to split that up into the user supplied
> value and the value which gets written into the MSR.
>
> So the right thing to do is to separate the user value and the MSR value
> first and independent of the mode. Then the new mode falls into place
> naturally because r->default_ctrl and r->default_msrval are set up at mount
> time with the values which correspond to the mount mode.

will fix. I tried both and this implementation assumes what user modifies is
the control values (because then schemata read and write is easy as user does it
directly) but agree we can change that.

Thanks,
Vikas

>
> Thanks,
>
> tglx
>

2018-04-04 09:12:49

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [PATCH 1/6] x86/intel_rdt/mba_sc: Add documentation for MBA software controller

On Tue, 3 Apr 2018, Shivappa Vikas wrote:
> On Tue, 3 Apr 2018, Thomas Gleixner wrote:
> > On Thu, 29 Mar 2018, Vikas Shivappa wrote:
> > The L2 external bandwidth is higher than the L3 external bandwidth.
> >
> > Is there any information available from CPUID or whatever source which
> > allows us to retrieve the bandwidth ratio or the absolute maximum
> > bandwidth per level?
>
> There is no information in cpuid on the bandwidth available. Also we have seen
> from our experiments that the increase is not perfectly linear (delta
> bandwidth increase from 30% to 40% may not be same as 70% to 80%). So we
> currently dynamically caliberate this delta for the software controller.

I assume you mean: calibrate

Though I don't see anything which looks remotely like calibration.
Calibration means that you determine the exact parameters by observation and
then can use the calibrated values afterwards. But that's not what you are
doing. So please don't claim its calibration.

You observe behaviour which depends on the workload and other
factors. That's not calibration. If you change the MSR by a granularity
value then you calculate the bandwidth delta vs. the previous MSR
value. That only makes sense and works when the application is having the
same memory access patterns accross both observation periods.

And of course, this won't be necessarily linear because if you throttle the
application then it gets less work done per CPU time slice and the
resulting stalls will also have side effects on the requested amount of
memory and therefore distort the measurement. Ditto the other way
around.

There are too many factors influencing this, so claiming that it's
calibration is window dressing at best. Even worse it suggests that it's
something accurate, which subverts your goal of reducing confusion.

Adaptive control might be an acceptable description, though given the
amount of factors which play into that it's still an euphemism for
'heuristic'.

> > What's also missing from your explanation is how that feedback loop behaves
> > under different workloads.
> >
> > Is this assuming that the involved threads/cpus actually try to utilize
> > the bandwidth completely?
>
> No, the feedback loop only guarentees that the usage will not exceed what the
> user specifies as max bandwidth. If it is using below the max value it does
> not matter how much less it is using.
> >
> > What happens if the threads/cpus are only using a small set because they
> > are idle or their computations are mostly cache local and do not need
> > external bandwidth? Looking at the implementation I don't see how that is
> > taken into account.
>
> The feedback only kicks into action if a rdtgroup uses more bandwidth than the
> max specified by the user. I specified that it is always "ensure the "actual
> b/w
> 354 < user b/w" " and can add more explanation on these scenarios.

Please finally stop to use this horrible 'b/w' thingy. It makes my eyes bleed
everytime.

> Also note that we are using the MBM counters for this feedback loop. Now that
> the interface is much more useful because we have the same rdtgroup that is
> being monitored and controlled. (vs. if we had the perf mbm the group of
> threads in resctrl mba and in mbm could be different and would be hard to
> measure what the threads/cpus in the resctrl are using).

Why does that make me smile?

Thanks,

tglx

2018-04-04 09:32:14

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [PATCH 1/6] x86/intel_rdt/mba_sc: Add documentation for MBA software controller

On Tue, 3 Apr 2018, Shivappa Vikas wrote:
> On Tue, 3 Apr 2018, Thomas Gleixner wrote:
> > The proposed new interface has no upper limit. The existing percentage
> > based implementation has at least some notion of limit and scale; not
> > really helpful either because of the hardware implementation. but I
> >
> > How is the poor admin supposed to configure that new thing without
> > knowing what the actual hardware limits are in the first place?
>
> That is true. The default values only put it to a very high bandwidth which
> means user gets to use everything. There seems no other way other than
> caliberating to know the actual max bandwidth in bytes. That could be a better
> value to have as default so admin knows the limit. I will explore if there is
> a way to calculate the same without caliberating.

Right, ideally we get that information from the hardware.

Thanks,

tglx





2018-04-04 19:00:52

by Shivappa Vikas

[permalink] [raw]
Subject: Re: [PATCH 1/6] x86/intel_rdt/mba_sc: Add documentation for MBA software controller



On Wed, 4 Apr 2018, Thomas Gleixner wrote:

> On Tue, 3 Apr 2018, Shivappa Vikas wrote:
> > On Tue, 3 Apr 2018, Thomas Gleixner wrote:
> > > On Thu, 29 Mar 2018, Vikas Shivappa wrote:
> > > The L2 external bandwidth is higher than the L3 external bandwidth.
> > >
> > > Is there any information available from CPUID or whatever source which
> > > allows us to retrieve the bandwidth ratio or the absolute maximum
> > > bandwidth per level?
> >
> > There is no information in cpuid on the bandwidth available. Also we have seen
> > from our experiments that the increase is not perfectly linear (delta
> > bandwidth increase from 30% to 40% may not be same as 70% to 80%). So we
> > currently dynamically caliberate this delta for the software controller.
>
> I assume you mean: calibrate
>
> Though I don't see anything which looks remotely like calibration.
> Calibration means that you determine the exact parameters by observation and
> then can use the calibrated values afterwards. But that's not what you are
> doing. So please don't claim its calibration.
>
> You observe behaviour which depends on the workload and other
> factors. That's not calibration. If you change the MSR by a granularity
> value then you calculate the bandwidth delta vs. the previous MSR
> value. That only makes sense and works when the application is having the
> same memory access patterns accross both observation periods.
>
> And of course, this won't be necessarily linear because if you throttle the
> application then it gets less work done per CPU time slice and the
> resulting stalls will also have side effects on the requested amount of
> memory and therefore distort the measurement. Ditto the other way
> around.
>
> There are too many factors influencing this, so claiming that it's
> calibration is window dressing at best. Even worse it suggests that it's
> something accurate, which subverts your goal of reducing confusion.
>
> Adaptive control might be an acceptable description, though given the
> amount of factors which play into that it's still an euphemism for
> 'heuristic'.

Agree we donot really caliberate and the only thing we guarentee is that
the actual bandwidth in bytes < user specified bandwidth bytes. This is
what the hardware guarenteed when we specified the values in percentage
as well but just that it was confusing.

>
> > > What's also missing from your explanation is how that feedback loop behaves
> > > under different workloads.
> > >
> > > Is this assuming that the involved threads/cpus actually try to utilize
> > > the bandwidth completely?
> >
> > No, the feedback loop only guarentees that the usage will not exceed what the
> > user specifies as max bandwidth. If it is using below the max value it does
> > not matter how much less it is using.
> > >
> > > What happens if the threads/cpus are only using a small set because they
> > > are idle or their computations are mostly cache local and do not need
> > > external bandwidth? Looking at the implementation I don't see how that is
> > > taken into account.
> >
> > The feedback only kicks into action if a rdtgroup uses more bandwidth than the
> > max specified by the user. I specified that it is always "ensure the "actual
> > b/w
> > 354 < user b/w" " and can add more explanation on these scenarios.
>
> Please finally stop to use this horrible 'b/w' thingy. It makes my eyes bleed
> everytime.

Will fix - this was a text from already existing documentation.

>
> > Also note that we are using the MBM counters for this feedback loop. Now that
> > the interface is much more useful because we have the same rdtgroup that is
> > being monitored and controlled. (vs. if we had the perf mbm the group of
> > threads in resctrl mba and in mbm could be different and would be hard to
> > measure what the threads/cpus in the resctrl are using).
>
> Why does that make me smile?

I know why :) Full credits to you as you had suggested to rewrite the
cqm/mbm in resctrl which is definitely very good in long term !

Thanks,
Vikas

>
> Thanks,
>
> tglx
>