LinuxLists.cc - [PATCH 1/1] x86/cqm: Cqm requirements

2017-03-07 20:42:22

Subject: [PATCH 1/1] x86/cqm: Cqm requirements

Sending the cqm requirements as per Thomas comments in the previous
verson of cqm patch series -
https://marc.info/?l=linux-kernel&m=148374167726502

This is modified version of requirements sent by Tony in the same
thread with inputs from David and Stephan.

Reviewed-by: Tony Luck<[email protected]>
Reviewed-by: David Carrillo-Cisneros <[email protected]>
Reviewed-by: Yu Fenghua <[email protected]>
Reviewed-by: Stephane Eranian <[email protected]>
Signed-off-by: Vikas Shivappa <[email protected]>

General

1) Able to measure all RDT events (currently L3 occupancy,
Total B/W, Local B/W).
2) Report separate results per domain (currently L3 caches only).
3) Able to keep track of occupancy without requiring the creation of a
perf_event (avoid overhead of perf_events on ctxsw)

Thread Monitoring

4) Measure per thread, including kernel threads.
5) Put multiple threads into a single measurement group. No need for
cgroup-like hierarchy in measurement group.
6) Newly forked threads inherit parent's measurement group.

CAT/Allocation based monitoring

7) Able to monitor an existing resctrl CAT group (threads and cpus)
8) Can get measurements for subsets of tasks in a CAT group (to find the
threads hogging the resources).

Interface

9) Interface should be extensible to support changes of CLOSID in
measurement groups without affecting correctness of pre-existing measurement
groups. (See use case 5).
10) Interface should be extensible to work with perf's kernel API and be
compatible "as far as practical" with perf tool.
11) Interface should be extensible to support perf-like CPU filtering (i.e.
measure for all threads that run in a CPU, regardless of allocation group).

Use cases:
---------
1) use RDT to size jobs cache footprint to drive CAT partition sizing

2) Once CAT partition established, monitor actual resource usage of
jobs inside that partition, find resource hog or resize CAT partition

Change job's CAT partition without affecting monitoring. This is
useful when allocation requirements for a task group changes dynamically
and the number of distinct task groups is larger than the number of CLOSIDs.

3) Monitoring real time tasks. These may include tasks belonging to default
CAT group but run on a set of CPUs.
4) Monitor all allocations on a CPU.

5) When the number of desired allocation groups is less than number of available
CLOSIDs, jobs with different dynamic allocation needs may be combined into the
same allocation group. If dynamic allocation needs of one of the jobs change,
the job will change to a different allocation group. Monitoring of these jobs
should remain correct during allocation group moving.

2017-03-07 20:07:48

by Luck, Tony

[permalink] [raw]

Subject: RE: [PATCH 1/1] x86/cqm: Cqm requirements

> That's all nice and good, but I still have no coherent explanation why
> measuring across allocation domains makes sense.

Is this in reaction to this one?

>> 5) Put multiple threads into a single measurement group

If we fix it to say "threads from the same CAT group" does it fix things?

We'd like to have measurement groups use a single RMID ... if we
allowed tasks from different CAT groups in the same measurement
group we wouldn't be able to split the numbers back to report the
right overall total for each of the CAT groups.

-Tony

2017-03-07 20:12:13

by Thomas Gleixner

[permalink] [raw]

Subject: Re: [PATCH 1/1] x86/cqm: Cqm requirements

On Tue, 7 Mar 2017, Vikas Shivappa wrote:

> Sending the cqm requirements as per Thomas comments in the previous
> verson of cqm patch series -
> https://marc.info/?l=linux-kernel&m=148374167726502
>
> This is modified version of requirements sent by Tony in the same
> thread with inputs from David and Stephan.
>
> Reviewed-by: Tony Luck<[email protected]>
> Reviewed-by: David Carrillo-Cisneros <[email protected]>
> Reviewed-by: Yu Fenghua <[email protected]>
> Reviewed-by: Stephane Eranian <[email protected]>
> Signed-off-by: Vikas Shivappa <[email protected]>

That's all nice and good, but I still have no coherent explanation why
measuring across allocation domains makes sense.

Repeating this requirement w/o explanation does not get us anywhere.

For the record:

I can understand the use case for bandwidth, but that's a pretty trivial
act of summing it up in software, so there is no actual requirement to do
that with a seperate RMID.

For cache oocupancy it still does not make any sense unless there is some
magic voodoo I'm not aware of to decipher the meaning of those numbers.

Thanks,

tglx

2017-03-07 20:41:46

by Thomas Gleixner

[permalink] [raw]

Subject: RE: [PATCH 1/1] x86/cqm: Cqm requirements

On Tue, 7 Mar 2017, Luck, Tony wrote:

> > That's all nice and good, but I still have no coherent explanation why
> > measuring across allocation domains makes sense.
>
> Is this in reaction to this one?
>
> >> 5) Put multiple threads into a single measurement group
>
> If we fix it to say "threads from the same CAT group" does it fix things?
>
> We'd like to have measurement groups use a single RMID ... if we
> allowed tasks from different CAT groups in the same measurement
> group we wouldn't be able to split the numbers back to report the
> right overall total for each of the CAT groups.

Right. And the same applies to CPU measurements. If we have threads from 3
CAT groups running on a CPU then the aggregate value (except for bandwidth
which can be computed by software) is useless.

Thanks,

tglx

2017-03-07 23:29:31

by Stephane Eranian

[permalink] [raw]

Subject: Re: [PATCH 1/1] x86/cqm: Cqm requirements

Hi,

On Tue, Mar 7, 2017 at 12:04 PM, Luck, Tony <[email protected]> wrote:
>> That's all nice and good, but I still have no coherent explanation why
>> measuring across allocation domains makes sense.
>
> Is this in reaction to this one?
>
>>> 5) Put multiple threads into a single measurement group
>
> If we fix it to say "threads from the same CAT group" does it fix things?
>
Inside a CAT partition, there may be multiple tasks split into
different cgroups.
We need the ability to monitor groups of tasks individually within that CAT
partition. I think this is what this bullet is about.

> We'd like to have measurement groups use a single RMID ... if we
> allowed tasks from different CAT groups in the same measurement
> group we wouldn't be able to split the numbers back to report the
> right overall total for each of the CAT groups.
>
> -Tony

2017-03-08 00:10:58

by Shivappa Vikas

[permalink] [raw]

Subject: Re: [PATCH 1/1] x86/cqm: Cqm requirements

On Tue, 7 Mar 2017, Stephane Eranian wrote:

> Hi,
>
> On Tue, Mar 7, 2017 at 12:04 PM, Luck, Tony <[email protected]> wrote:
>>> That's all nice and good, but I still have no coherent explanation why
>>> measuring across allocation domains makes sense.
>>
>> Is this in reaction to this one?
>>
>>>> 5) Put multiple threads into a single measurement group
>>
>> If we fix it to say "threads from the same CAT group" does it fix things?
>>
> Inside a CAT partition, there may be multiple tasks split into
> different cgroups.
> We need the ability to monitor groups of tasks individually within that CAT
> partition. I think this is what this bullet is about.
>

The #8 covers that I think (or what we intended for 5..) ?

8) Can get measurements for subsets of tasks in a CAT group (to find the
threads hogging the resources).

Thanks,
Vikas

>
>> We'd like to have measurement groups use a single RMID ... if we
>> allowed tasks from different CAT groups in the same measurement
>> group we wouldn't be able to split the numbers back to report the
>> right overall total for each of the CAT groups.
>>
>> -Tony
>

2017-03-08 09:42:36

by Thomas Gleixner

[permalink] [raw]

Subject: Re: [PATCH 1/1] x86/cqm: Cqm requirements

Stephane,

On Tue, 7 Mar 2017, Stephane Eranian wrote:
> On Tue, Mar 7, 2017 at 12:04 PM, Luck, Tony <[email protected]> wrote:
> >> That's all nice and good, but I still have no coherent explanation why
> >> measuring across allocation domains makes sense.
> >
> > Is this in reaction to this one?
> >
> >>> 5) Put multiple threads into a single measurement group
> >
> > If we fix it to say "threads from the same CAT group" does it fix things?
> >
> Inside a CAT partition, there may be multiple tasks split into different
> cgroups. We need the ability to monitor groups of tasks individually
> within that CAT partition. I think this is what this bullet is about.

I completely understand that. That's fine and I never debated that one, but
the requirements list is too vague about what you want to measure.

> >>> 5) Put multiple threads into a single measurement group

That can be:

A) threads within a CAT group

B) threads which belong to different CAT groups

A) is fine. B) does not make any sense to me

Same applies for per CPU measurements.

Thanks,

tglx

2017-03-08 17:58:33

by David Carrillo-Cisneros

[permalink] [raw]

Subject: Re: [PATCH 1/1] x86/cqm: Cqm requirements

On Wed, Mar 8, 2017 at 12:30 AM, Thomas Gleixner <[email protected]> wrote:
> Stephane,
>
> On Tue, 7 Mar 2017, Stephane Eranian wrote:
>> On Tue, Mar 7, 2017 at 12:04 PM, Luck, Tony <[email protected]> wrote:
>> >> That's all nice and good, but I still have no coherent explanation why
>> >> measuring across allocation domains makes sense.
>> >
>> > Is this in reaction to this one?
>> >
>> >>> 5) Put multiple threads into a single measurement group
>> >
>> > If we fix it to say "threads from the same CAT group" does it fix things?
>> >
>> Inside a CAT partition, there may be multiple tasks split into different
>> cgroups. We need the ability to monitor groups of tasks individually
>> within that CAT partition. I think this is what this bullet is about.
>
> I completely understand that. That's fine and I never debated that one, but
> the requirements list is too vague about what you want to measure.
>
>> >>> 5) Put multiple threads into a single measurement group
>
> That can be:
>
> A) threads within a CAT group
>
> B) threads which belong to different CAT groups
>
> A) is fine. B) does not make any sense to me

It's A). As Tony suggested in a previous email, we can rephrase it to:

5) Put a subset of threads from the same CAT group into a single
measurement group.

>
> Same applies for per CPU measurements.

For CPU measurements. We need perf-like CPU filtering to support tools
that perform low overhead monitoring by polling CPU events. These
tools approximate per-cgroup/task events by reconciling CPU events
with logs of what job run when in what CPU.

>
> Thanks,
>
> tglx

2017-03-09 11:02:40

by Thomas Gleixner

[permalink] [raw]

Subject: Re: [PATCH 1/1] x86/cqm: Cqm requirements

On Wed, 8 Mar 2017, David Carrillo-Cisneros wrote:
> On Wed, Mar 8, 2017 at 12:30 AM, Thomas Gleixner <[email protected]> wrote:
> > Same applies for per CPU measurements.
>
> For CPU measurements. We need perf-like CPU filtering to support tools
> that perform low overhead monitoring by polling CPU events. These
> tools approximate per-cgroup/task events by reconciling CPU events
> with logs of what job run when in what CPU.

Sorry, but for CQM that's just voodoo analysis. Lets look at an example:

CPU default is CAT group 0 (20% of cache)
T1 belongs to CAT group 1 (40% of cache)
T2 belongs to CAT group 2 (40% of cache)

Now you do low overhead samples of the CPU (all groups accounted) with 1
second period.

Lets assume that T1 runs 50% and T2 runs 20% the rest of the time is
utilized by random other things and the kernel itself (using CAT group 0).

What is the accumulated value telling you?

How do you approximate that back to T1/T2 and the rest?

How do you do that when the tasks are switching between the samples several
times?

I really have idea how that should work and what the value of this would
be.

Thanks,

tglx

2017-03-09 18:05:40

by David Carrillo-Cisneros

[permalink] [raw]

Subject: Re: [PATCH 1/1] x86/cqm: Cqm requirements

On Thu, Mar 9, 2017 at 3:01 AM, Thomas Gleixner <[email protected]> wrote:
> On Wed, 8 Mar 2017, David Carrillo-Cisneros wrote:
>> On Wed, Mar 8, 2017 at 12:30 AM, Thomas Gleixner <[email protected]> wrote:
>> > Same applies for per CPU measurements.
>>
>> For CPU measurements. We need perf-like CPU filtering to support tools
>> that perform low overhead monitoring by polling CPU events. These
>> tools approximate per-cgroup/task events by reconciling CPU events
>> with logs of what job run when in what CPU.
>
> Sorry, but for CQM that's just voodoo analysis.

I'll argue that. Yet, perf-like CPU is also needed for MBM, a less
contentious scenario, I believe.

>
> CPU default is CAT group 0 (20% of cache)
> T1 belongs to CAT group 1 (40% of cache)
> T2 belongs to CAT group 2 (40% of cache)
>
> Now you do low overhead samples of the CPU (all groups accounted) with 1
> second period.
>
> Lets assume that T1 runs 50% and T2 runs 20% the rest of the time is
> utilized by random other things and the kernel itself (using CAT group 0).
>
> What is the accumulated value telling you?

In this single example not much, only the sum of occupancies. But
assume I have T1...T10000 different jobs, and I randomly select a pair
of those jobs to run together in a machine, (they become the T1 and T2
in your example). Then I repeat that hundreds of thousands of times.

I can collect all data with (tasks run, time run, occupancy) and build
a simple regression to estimate the expected occupancy (and some
confidence interval). That inaccurate but approximate value is very
useful to feed into a job scheduler. Furthermore, it can be correlated
with values of other events that are currently sampled this way.

>
> How do you approximate that back to T1/T2 and the rest?

Described above for large numbers and random samples. More
sophisticated (voodo?) statistic techniques are employed in practice
to account for almost all issues I could think of (selection bias,
missing values, interaction between tasks, etc). They seem to work
fine.

>
> How do you do that when the tasks are switching between the samples several
> times?

It does not work well for a single run (your example). But for the
example I gave, one can just rely on Random Sampling, Law of Large
Numbers, and Central Limit Theorem.

Thanks,
David

2017-03-10 14:53:44

by Thomas Gleixner

[permalink] [raw]

Subject: Re: [PATCH 1/1] x86/cqm: Cqm requirements

On Thu, 9 Mar 2017, David Carrillo-Cisneros wrote:
> On Thu, Mar 9, 2017 at 3:01 AM, Thomas Gleixner <[email protected]> wrote:
> > On Wed, 8 Mar 2017, David Carrillo-Cisneros wrote:
> >> On Wed, Mar 8, 2017 at 12:30 AM, Thomas Gleixner <[email protected]> wrote:
> >> > Same applies for per CPU measurements.
> >>
> >> For CPU measurements. We need perf-like CPU filtering to support tools
> >> that perform low overhead monitoring by polling CPU events. These
> >> tools approximate per-cgroup/task events by reconciling CPU events
> >> with logs of what job run when in what CPU.
> >
> > Sorry, but for CQM that's just voodoo analysis.
>
> I'll argue that. Yet, perf-like CPU is also needed for MBM, a less
> contentious scenario, I believe.

MBM is a different playground (albeit related due to the RMID stuff).

> It does not work well for a single run (your example). But for the
> example I gave, one can just rely on Random Sampling, Law of Large
> Numbers, and Central Limit Theorem.

Fine. So we need this for ONE particular use case. And if that is not well
documented including the underlying mechanics to analyze the data then this
will be a nice source of confusion for Joe User.

I still think that this can be done differently while keeping the overhead
small.

You look at this from the existing perf mechanics which require high
overhead context switching machinery. But that's just wrong because that's
not how the cache and bandwidth monitoring works.

Contrary to the other perf counters, CQM and MBM are based on a context
selectable set of counters which do not require readout and reconfiguration
when the switch happens.

Especially with CAT in play, the context switch overhead is there already
when CAT partitions need to be switched. So switching the RMID at the same
time is basically free, if we are smart enough to do an equivalent to the
CLOSID context switch mechanics and ideally combine both into a single MSR
write.

With that the low overhead periodic sampling can read N counters which are
related to the monitored set and provide N separate results. For bandwidth
the aggregation is a simple ADD and for cache residency it's pointless.

Just because perf was designed with the regular performance counters in
mind (way before that CQM/MBM stuff came around) does not mean that we
cannot change/extend that if it makes sense.

And looking at the way Cache/Bandwidth allocation and monitoring works, it
makes a lot of sense. Definitely more than shoving it into the current mode
of operandi with duct tape just because we can.

Thanks,

tglx

2017-03-11 01:53:35

by David Carrillo-Cisneros

[permalink] [raw]

Subject: Re: [PATCH 1/1] x86/cqm: Cqm requirements

>
> Fine. So we need this for ONE particular use case. And if that is not well
> documented including the underlying mechanics to analyze the data then this
> will be a nice source of confusion for Joe User.
>
> I still think that this can be done differently while keeping the overhead
> small.
>
> You look at this from the existing perf mechanics which require high
> overhead context switching machinery. But that's just wrong because that's
> not how the cache and bandwidth monitoring works.
>
> Contrary to the other perf counters, CQM and MBM are based on a context
> selectable set of counters which do not require readout and reconfiguration
> when the switch happens.
>
> Especially with CAT in play, the context switch overhead is there already
> when CAT partitions need to be switched. So switching the RMID at the same
> time is basically free, if we are smart enough to do an equivalent to the
> CLOSID context switch mechanics and ideally combine both into a single MSR
> write.
>
> With that the low overhead periodic sampling can read N counters which are
> related to the monitored set and provide N separate results. For bandwidth
> the aggregation is a simple ADD and for cache residency it's pointless.
>
> Just because perf was designed with the regular performance counters in
> mind (way before that CQM/MBM stuff came around) does not mean that we
> cannot change/extend that if it makes sense.
>
> And looking at the way Cache/Bandwidth allocation and monitoring works, it
> makes a lot of sense. Definitely more than shoving it into the current mode
> of operandi with duct tape just because we can.
>

You made a point. The use case I described can be better served with
the low overhead monitoring groups that Fenghua is working on. Then
that info can be merged with the per-CPU profile collected for non-RDT
events.

I am ok removing the perf-like CPU filtering from the requirements.

Thanks,
David

2017-03-13 19:10:50

by Thomas Gleixner

[permalink] [raw]

Subject: Re: [PATCH 1/1] x86/cqm: Cqm requirements

On Fri, 10 Mar 2017, David Carrillo-Cisneros wrote:
> > Fine. So we need this for ONE particular use case. And if that is not well
> > documented including the underlying mechanics to analyze the data then this
> > will be a nice source of confusion for Joe User.
> >
> > I still think that this can be done differently while keeping the overhead
> > small.
> >
> > You look at this from the existing perf mechanics which require high
> > overhead context switching machinery. But that's just wrong because that's
> > not how the cache and bandwidth monitoring works.
> >
> > Contrary to the other perf counters, CQM and MBM are based on a context
> > selectable set of counters which do not require readout and reconfiguration
> > when the switch happens.
> >
> > Especially with CAT in play, the context switch overhead is there already
> > when CAT partitions need to be switched. So switching the RMID at the same
> > time is basically free, if we are smart enough to do an equivalent to the
> > CLOSID context switch mechanics and ideally combine both into a single MSR
> > write.
> >
> > With that the low overhead periodic sampling can read N counters which are
> > related to the monitored set and provide N separate results. For bandwidth
> > the aggregation is a simple ADD and for cache residency it's pointless.
> >
> > Just because perf was designed with the regular performance counters in
> > mind (way before that CQM/MBM stuff came around) does not mean that we
> > cannot change/extend that if it makes sense.
> >
> > And looking at the way Cache/Bandwidth allocation and monitoring works, it
> > makes a lot of sense. Definitely more than shoving it into the current mode
> > of operandi with duct tape just because we can.
> >
>
> You made a point. The use case I described can be better served with
> the low overhead monitoring groups that Fenghua is working on. Then
> that info can be merged with the per-CPU profile collected for non-RDT
> events.
>
> I am ok removing the perf-like CPU filtering from the requirements.

So if I'm not missing something then ALL remaining requirements can be
solved with the RDT integrated monitoring mechanics, right?

Thanks,

tglx

2017-03-13 19:58:33

by David Carrillo-Cisneros

[permalink] [raw]

Subject: Re: [PATCH 1/1] x86/cqm: Cqm requirements

>> I am ok removing the perf-like CPU filtering from the requirements.
>
> So if I'm not missing something then ALL remaining requirements can be
> solved with the RDT integrated monitoring mechanics, right?
>

Right.

2017-03-13 20:21:37

by Thomas Gleixner

[permalink] [raw]

Subject: Re: [PATCH 1/1] x86/cqm: Cqm requirements

On Mon, 13 Mar 2017, David Carrillo-Cisneros wrote:

> >> I am ok removing the perf-like CPU filtering from the requirements.
> >
> > So if I'm not missing something then ALL remaining requirements can be
> > solved with the RDT integrated monitoring mechanics, right?
> >
>
> Right.

Excellent.