2018-01-14 22:54:35

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [RFC PATCH 00/20] Intel(R) Resource Director Technology Cache Pseudo-Locking enabling

On Fri, 17 Nov 2017, Reinette Chatre wrote:

Sorry for the delay. You know why :)

> On 11/17/2017 4:48 PM, Thomas Gleixner wrote:
> > On Mon, 13 Nov 2017, Reinette Chatre wrote:
> > Did you compare that against the good old cache coloring mechanism,
> > e.g. palloc ?
>

> I understand where your question originates. I have not compared against
> PALLOC for two reasons:
>
> 1) PALLOC is not upstream and while inquiring about the status of this
> work (please see https://github.com/heechul/palloc/issues/4 for details)
> we learned that one reason for this is that recent Intel processors are
> not well supported.

So if I understand Heechul correctly then recent CPUs cannot be supported
easily due to changes in the memory controllers and the cache. I assume the
latter is related to CAT.

> 2) The most recent kernel supported by PALLOC is v4.4 and also mentioned
> in the above link there is currently no plan to upstream this work for a
> less divergent comparison of PALLOC and the more recent RDT/CAT enabling
> on which Cache Pseudo-Locking is built.

Well, that's not a really good excuse for not trying. You at Intel should
be able to get to the parameters easy enough :)

> >> The cache pseudo-locking approach relies on generation-specific behavior
> >> of processors. It may provide benefits on certain processor generations,
> >> but is not guaranteed to be supported in the future.
> >
> > Hmm, are you saying that the CAT mechanism might change radically in the
> > future so that access to cached data in an allocated area which does not
> > belong to the current executing context wont work anymore?
>
> Most devices that publicly support CAT in the Linux mainline can take
> advantage of Cache Pseudo-Locking. However, Cache Pseudo-Locking is a
> model-specific feature so there may be some variation in if, or to what
> extent, current and future devices can support Cache Pseudo-Locking. CAT
> remains architectural.

Sure, but that does NOT answer my question at all.

> >> It is not a guarantee that data will remain in the cache. It is not a
> >> guarantee that data will remain in certain levels or certain regions of
> >> the cache. Rather, cache pseudo-locking increases the probability that
> >> data will remain in a certain level of the cache via carefully
> >> configuring the CAT feature and carefully controlling application
> >> behavior.
> >
> > Which kind of applications are you targeting with that?
> >
> > Are there real world use cases which actually can benefit from this and
>
> To ensure I answer your question I will consider two views. First, the
>"carefully controlling application behavior" referred to above refers to
> applications/OS/VMs running after the pseudo-locked regions have been set
> up. These applications should take care to not do anything, for example
> call wbinvd, that would affect the Cache Pseudo-Locked regions. Second,
> what you are also asking about is the applications using these Cache
> Pseudo-Locked regions. We do see a clear performance benefit to
> applications using these pseudo-locked regions. Latency sensitive
> applications could relocate their code as well as data to pseudo-locked
> regions for improved performance.

This is again a marketing pitch and not answering my question about real
world use cases.

> > what are those applications supposed to do once the feature breaks with
> > future generations of processors?
>
> This feature is model specific with a few platforms supporting it at this
> time. Only platforms known to support Cache Pseudo-Locking will expose
> its resctrl interface.

And you deliberately avoided to answer my question again.

Thanks,

tglx


2018-01-15 16:23:31

by Hindman, Gavin

[permalink] [raw]
Subject: RE: [RFC PATCH 00/20] Intel(R) Resource Director Technology Cache Pseudo-Locking enabling

Thanks for the feedback, Thomas.

> -----Original Message-----
> From: [email protected] [mailto:linux-kernel-
> [email protected]] On Behalf Of Thomas Gleixner
> Sent: Sunday, January 14, 2018 2:54 PM
> To: Chatre, Reinette <[email protected]>
> Cc: Yu, Fenghua <[email protected]>; Luck, Tony
> <[email protected]>; [email protected]; Hansen, Dave
> <[email protected]>; [email protected]; [email protected];
> [email protected]; [email protected]
> Subject: Re: [RFC PATCH 00/20] Intel(R) Resource Director Technology Cache
> Pseudo-Locking enabling
>
> On Fri, 17 Nov 2017, Reinette Chatre wrote:
>
> Sorry for the delay. You know why :)
>
> > On 11/17/2017 4:48 PM, Thomas Gleixner wrote:
> > > On Mon, 13 Nov 2017, Reinette Chatre wrote:
> > > Did you compare that against the good old cache coloring mechanism,
> > > e.g. palloc ?
> >
>
> > I understand where your question originates. I have not compared
> > against PALLOC for two reasons:
> >
> > 1) PALLOC is not upstream and while inquiring about the status of this
> > work (please see https://github.com/heechul/palloc/issues/4 for
> > details) we learned that one reason for this is that recent Intel
> > processors are not well supported.
>
> So if I understand Heechul correctly then recent CPUs cannot be supported
> easily due to changes in the memory controllers and the cache. I assume the
> latter is related to CAT.
>
> > 2) The most recent kernel supported by PALLOC is v4.4 and also
> > mentioned in the above link there is currently no plan to upstream
> > this work for a less divergent comparison of PALLOC and the more
> > recent RDT/CAT enabling on which Cache Pseudo-Locking is built.
>
> Well, that's not a really good excuse for not trying. You at Intel should be able
> to get to the parameters easy enough :)
>
We can run the comparison, but I'm not sure that I understand the intent - my understanding of Palloc is that it's intended to allow allocation of memory to specific physical memory banks. While that might result in reduced cache-misses since processes are more separated, it's not explicitly intended to reduce cache-misses, and Palloc's benefits would only hold as long as you have few enough processes to be able to dedicate/isolate memory accordingly. Am I misunderstanding the intent/usage of palloc?

> > >> The cache pseudo-locking approach relies on generation-specific
> > >> behavior of processors. It may provide benefits on certain
> > >> processor generations, but is not guaranteed to be supported in the
> future.
> > >
> > > Hmm, are you saying that the CAT mechanism might change radically in
> > > the future so that access to cached data in an allocated area which
> > > does not belong to the current executing context wont work anymore?
> >

No, I don't see any scenario in which devices that currently support pseudo-locking would stop working, but until support is architectural support in a current generation of a product line doesn't imply support in a future generation. Certainly we'll make every effort to carry support forward, and would adjust to any changes in CAT support, but we can't account for unforeseen future architectural changes that might block pseudo-locking use-cases on top of CAT.

> > Most devices that publicly support CAT in the Linux mainline can take
> > advantage of Cache Pseudo-Locking. However, Cache Pseudo-Locking is a
> > model-specific feature so there may be some variation in if, or to
> > what extent, current and future devices can support Cache
> > Pseudo-Locking. CAT remains architectural.
>
> Sure, but that does NOT answer my question at all.
>
> > >> It is not a guarantee that data will remain in the cache. It is not
> > >> a guarantee that data will remain in certain levels or certain
> > >> regions of the cache. Rather, cache pseudo-locking increases the
> > >> probability that data will remain in a certain level of the cache
> > >> via carefully configuring the CAT feature and carefully controlling
> > >> application behavior.
> > >
> > > Which kind of applications are you targeting with that?
> > >
> > > Are there real world use cases which actually can benefit from this
> > > and
> >
> > To ensure I answer your question I will consider two views. First, the
> >"carefully controlling application behavior" referred to above refers
> >to applications/OS/VMs running after the pseudo-locked regions have
> >been set up. These applications should take care to not do anything,
> >for example call wbinvd, that would affect the Cache Pseudo-Locked
> >regions. Second, what you are also asking about is the applications
> >using these Cache Pseudo-Locked regions. We do see a clear performance
> >benefit to applications using these pseudo-locked regions. Latency
> >sensitive applications could relocate their code as well as data to
> >pseudo-locked regions for improved performance.
>
> This is again a marketing pitch and not answering my question about real
> world use cases.
>
There are a number of real-world use-cases that are already making use of hacked-up ad-hoc versions of pseudo-locking - this corner case has been available in hardware for some time - and this patch-set is intended to bring it more into the mainstream and more supportable. Primary usages right now are industrial PLCs/automation and high-frequency trading/financial enterprise systems, but anything with relatively small repeating data structures should see benefit.

> > > what are those applications supposed to do once the feature breaks
> > > with future generations of processors?
> >
> > This feature is model specific with a few platforms supporting it at
> > this time. Only platforms known to support Cache Pseudo-Locking will
> > expose its resctrl interface.
>
> And you deliberately avoided to answer my question again.
>
Reinette's not trying to avoid the questions, we just don't necessarily have definitive answers at this time. Currently pseudo-locking requires manual setup on the part of the integrator, so there will not be any invisible breakage when trying to port software expecting pseudo-locking to new devices, and we'll certainly do everything we can to minimize user-space/configuration impact on migration if things change going forward, but these are unknowns. We are in a bit of chicken/egg where people aren't broadly using it because it's not architectural, and it's not architectural because people aren't broadly using it. We could publicly carry the patches out of mainline, but our intent for pushing the patches to mainline are to a) increase exposure/usage b) reduce divergence across people already using hacked versions, and c) ease the overhead in keep patches in sync with the larger CAT infrastructure as it evolves - we are clear on the potential support burden being incurred by submitting a non-architectural feature, and there's certainly no intent to dump a science-experiment into mainline.

> Thanks,
>
> tglx

Thanks,
Gavin

2018-01-16 11:38:31

by Thomas Gleixner

[permalink] [raw]
Subject: RE: [RFC PATCH 00/20] Intel(R) Resource Director Technology Cache Pseudo-Locking enabling

On Mon, 15 Jan 2018, Hindman, Gavin wrote:
> > From: [email protected] [mailto:linux-kernel-
> > [email protected]] On Behalf Of Thomas Gleixner
> > On Fri, 17 Nov 2017, Reinette Chatre wrote:
> > >
> > > 1) PALLOC is not upstream and while inquiring about the status of this
> > > work (please see https://github.com/heechul/palloc/issues/4 for
> > > details) we learned that one reason for this is that recent Intel
> > > processors are not well supported.
> >
> > So if I understand Heechul correctly then recent CPUs cannot be supported
> > easily due to changes in the memory controllers and the cache. I assume the
> > latter is related to CAT.

Is that assumption correct?

> > > 2) The most recent kernel supported by PALLOC is v4.4 and also
> > > mentioned in the above link there is currently no plan to upstream
> > > this work for a less divergent comparison of PALLOC and the more
> > > recent RDT/CAT enabling on which Cache Pseudo-Locking is built.
> >
> > Well, that's not a really good excuse for not trying. You at Intel should be able
> > to get to the parameters easy enough :)
> >
> We can run the comparison, but I'm not sure that I understand the intent
> - my understanding of Palloc is that it's intended to allow allocation of
> memory to specific physical memory banks. While that might result in
> reduced cache-misses since processes are more separated, it's not
> explicitly intended to reduce cache-misses, and Palloc's benefits would
> only hold as long as you have few enough processes to be able to
> dedicate/isolate memory accordingly. Am I misunderstanding the
> intent/usage of palloc?

Right. It comes with its own set of restrictions as does the pseudo-locking.

> > > >> The cache pseudo-locking approach relies on generation-specific
> > > >> behavior of processors. It may provide benefits on certain
> > > >> processor generations, but is not guaranteed to be supported in the
> > future.
> > > >
> > > > Hmm, are you saying that the CAT mechanism might change radically in
> > > > the future so that access to cached data in an allocated area which
> > > > does not belong to the current executing context wont work anymore?
> > >
>
> No, I don't see any scenario in which devices that currently support
> pseudo-locking would stop working, but until support is architectural
> support in a current generation of a product line doesn't imply support
> in a future generation. Certainly we'll make every effort to carry
> support forward, and would adjust to any changes in CAT support, but we
> can't account for unforeseen future architectural changes that might
> block pseudo-locking use-cases on top of CAT.

So and that's the real problem. We add something which gives us some form
of isolation, but we don't know whether next generation CPUs will
work. From a maintainability and usefulnes POV that's not a really great
prospect.

> > This is again a marketing pitch and not answering my question about real
> > world use cases.
> >
> There are a number of real-world use-cases that are already making use of
> hacked-up ad-hoc versions of pseudo-locking - this corner case has been
> available in hardware for some time - and this patch-set is intended to
> bring it more into the mainstream and more supportable. Primary usages
> right now are industrial PLCs/automation and high-frequency
> trading/financial enterprise systems, but anything with relatively small
> repeating data structures should see benefit.

Ok,

> > > > what are those applications supposed to do once the feature breaks
> > > > with future generations of processors?
> > >
> > > This feature is model specific with a few platforms supporting it at
> > > this time. Only platforms known to support Cache Pseudo-Locking will
> > > expose its resctrl interface.
> >
> > And you deliberately avoided to answer my question again.
> >
> Reinette's not trying to avoid the questions, we just don't necessarily
> have definitive answers at this time. Currently pseudo-locking requires
> manual setup on the part of the integrator, so there will not be any
> invisible breakage when trying to port software expecting pseudo-locking
> to new devices, and we'll certainly do everything we can to minimize
> user-space/configuration impact on migration if things change going
> forward, but these are unknowns. We are in a bit of chicken/egg where
> people aren't broadly using it because it's not architectural, and it's
> not architectural because people aren't broadly using it. We could
> publicly carry the patches out of mainline, but our intent for pushing
> the patches to mainline are to a) increase exposure/usage b) reduce
> divergence across people already using hacked versions, and c) ease the
> overhead in keep patches in sync with the larger CAT infrastructure as it
> evolves - we are clear on the potential support burden being incurred by
> submitting a non-architectural feature, and there's certainly no intent
> to dump a science-experiment into mainline.

Ok. So what you are saying is that 'official' support should broaden the
user base which in turn might push it into the architectural realm.

I'll go through the patch set with this in mind.

Thanks,

tglx

2018-01-17 00:53:08

by Reinette Chatre

[permalink] [raw]
Subject: Re: [RFC PATCH 00/20] Intel(R) Resource Director Technology Cache Pseudo-Locking enabling

Hi Thomas,

On 1/16/2018 3:38 AM, Thomas Gleixner wrote:
> On Mon, 15 Jan 2018, Hindman, Gavin wrote:
>>> From: [email protected] [mailto:linux-kernel-
>>> [email protected]] On Behalf Of Thomas Gleixner
>>> On Fri, 17 Nov 2017, Reinette Chatre wrote:
>>>>
>>>> 1) PALLOC is not upstream and while inquiring about the status of this
>>>> work (please see https://github.com/heechul/palloc/issues/4 for
>>>> details) we learned that one reason for this is that recent Intel
>>>> processors are not well supported.
>>>
>>> So if I understand Heechul correctly then recent CPUs cannot be supported
>>> easily due to changes in the memory controllers and the cache. I assume the
>>> latter is related to CAT.
>
> Is that assumption correct?

>From what I understand to be able to allocate memory from a specific
DRAM bank or cache set PALLOC requires knowing exactly which DRAM bank
or cache set a physical address maps to. The PALLOC implementation
relies on user space code that times a variety of memory accesses to
guess which bits determine DRAM bank or cache set placement. These bits
are then provided to the kernel implementation as the page coloring input.

The comments at https://github.com/heechul/palloc/issues/4 point out
that it is this user space guessing of physical address to specific DRAM
bank and cache set mapping that is harder in recent Intel processors.
This is not related to CAT. CAT could be used to limit the number of
ways to which the contents of a physical address can be allocated, CAT
does not modify the set to which the physical address maps.

Without possibility of using PALLOC I do not currently know how to
answer your request for a comparison with a cache coloring mechanism. I
will surely ask around and do more research.

Reinette

2018-02-12 19:08:46

by Reinette Chatre

[permalink] [raw]
Subject: Re: [RFC PATCH 00/20] Intel(R) Resource Director Technology Cache Pseudo-Locking enabling

Hi Thomas,

On 1/16/2018 3:38 AM, Thomas Gleixner wrote:
> On Mon, 15 Jan 2018, Hindman, Gavin wrote:
>>> From: [email protected] [mailto:linux-kernel-
>>> [email protected]] On Behalf Of Thomas Gleixner
>>> On Fri, 17 Nov 2017, Reinette Chatre wrote:

>>>> 2) The most recent kernel supported by PALLOC is v4.4 and also
>>>> mentioned in the above link there is currently no plan to upstream
>>>> this work for a less divergent comparison of PALLOC and the more
>>>> recent RDT/CAT enabling on which Cache Pseudo-Locking is built.
>>>
>>> Well, that's not a really good excuse for not trying. You at Intel should be able
>>> to get to the parameters easy enough :)
>>>
>> We can run the comparison, but I'm not sure that I understand the intent
>> - my understanding of Palloc is that it's intended to allow allocation of
>> memory to specific physical memory banks. While that might result in
>> reduced cache-misses since processes are more separated, it's not
>> explicitly intended to reduce cache-misses, and Palloc's benefits would
>> only hold as long as you have few enough processes to be able to
>> dedicate/isolate memory accordingly. Am I misunderstanding the
>> intent/usage of palloc?
>
> Right. It comes with its own set of restrictions as does the pseudo-locking.

Reporting results of comparison between PALLOC, CAT, and Cache
Pseudo-Locking. CAT is a hardware supported and Linux enabled cache
partitioning mechanism while PALLOC is an out of tree software cache
partitioning mechanism. Neither CAT nor PALLOC protects against eviction
from a cache partition. Cache Pseudo-Locking builds on CAT by adding
protection against eviction from cache.

Latest PALLOC available is a patch against kernel v4.4. PALLOC data was
collected with latest PALLOC v4.4 patch(*) applied against v4.4.113. CAT
and Cache Pseudo-Locking data was collected with a rebase of this patch
series against x86/cache of tip.git (based on v4.15-rc8) when the HEAD was:

commit 31516de306c0c9235156cdc7acb976ea21f1f646
Author: Fenghua Yu <[email protected]>
Date: Wed Dec 20 14:57:24 2017 -0800

x86/intel_rdt: Add command line parameter to control L2_CDP

All tests involve a user space application that allocates (malloc() with
mlockall()) or in the case of Cache Pseudo-Locking maps using mmap()) a
256KB region of memory. The application then randomly accesses this
region, 32 bytes at a time, measuring the latency in cycles of each
access using the rdtsc instruction. Each time a test is run it is
repeated ten times.

As with the previous tests from this thread, testing was done on an
Intel(R) NUC NUC6CAYS (it has an Intel(R) Celeron(R) Processor J3455).
The system has two 1MB L2 cache (1024 sets and 16 ways).

A few extra tests were done with PALLOC to establish a baseline that I
got it working right before comparing it against CAT and Cache
Pseudo-Locking. Each test was run on an idle system as well as a system
where significant interference was introduced on a core sharing the L2 cache
with the core on which the test was running (referred to as "noisy
neighbor").


TEST1) PALLOC: Enable PALLOC but do not do any cache partitioning.

TEST2) PALLOC: Designate four bits to be used for page coloring, thus
creating four bins. Bits were chosen as the only four bits that overlap
between page and cache set addressing. Run application in a cgroup that
has access to one bin with rest of system accessing the three remaining
bins.

TEST3) PALLOC: With the same four bits used for page coloring as in
TEST2. Let application run in cgroup with dedicated access to two bins,
rest of system the remaining two bins.

TEST4) CAT: Same CAT test as in original cover letter where application
runs with dedicated CLOS with CBM of 0xf. Default CLOS CBM changed to
non-overlapping 0xf0.

TEST5) Cache Pseudo-Locking: Application reads from 256KB Cache Pseudo
Locked region.

Data visualizations plot the cumulative (of ten tests) counts of the
number of instances (y axis) a particular number of cycles (x axis) were
measured. Each plot is accompanied by a boxplot used to visualize the
descriptive statistics (whiskers represent 0 to 99th percentile, inter
quartile range q1 to q3 with black rectangle, median is orange line,
green is average).

Visualization
https://github.com/rchatre/data/blob/master/cache_pseudo_locking/palloc/palloc_baseline.png
presents the PALLOC only results for TEST1 through TEST3. The most
prominent improvement when using PALLOC is when the application obtains
dedicated access to two bins (half of the cache, double the size of
memory being accessed) - in this environment its first quartile is
significantly lower than all the other partitionings. The application
thus experiences more instances where memory access latency is low. We
can see though that the average latency experienced by the application
is not affected significantly by this.

Visualization
https://github.com/rchatre/data/blob/master/cache_pseudo_locking/palloc/palloc_cat_pseudo.png
presents the same PALLOC two bins (TEST3) seen in previous visualization
with the CAT and Cache Pseudo-Locking results. The visualization shows
with all descriptive statistics a significant improved latency when
using CAT compared to PALLOC. The additional comparison with Cache
Pseudo-Locking shows the improved average access latency when compared
to both CAT and PALLOC.

In both the PALLOC and CAT tests there was improvement (CAT most
significant) in latency accessing a 256KB memory region but in both
(PALLOC and CAT) 512KB of cache was set aside for application to obtain
these results. Using Cache Pseudo-Locking to access the 256KB memory
region only 256KB of cache was set aside while also reducing the access
latency when compared to both PALLOC and CAT.

I do hope these results establishes the value of Cache Pseudo-Locking to
you. The rebased patch series used in this testing will be sent out
this week.

Regards,

Reinette

(*) A one line change was made as documented in
https://github.com/heechul/palloc/issues/8


2018-02-13 10:28:56

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [RFC PATCH 00/20] Intel(R) Resource Director Technology Cache Pseudo-Locking enabling

On Mon, 12 Feb 2018, Reinette Chatre wrote:
> On 1/16/2018 3:38 AM, Thomas Gleixner wrote:
> All tests involve a user space application that allocates (malloc() with
> mlockall()) or in the case of Cache Pseudo-Locking maps using mmap()) a
> 256KB region of memory. The application then randomly accesses this
> region, 32 bytes at a time, measuring the latency in cycles of each
> access using the rdtsc instruction. Each time a test is run it is
> repeated ten times.
> In both the PALLOC and CAT tests there was improvement (CAT most
> significant) in latency accessing a 256KB memory region but in both
> (PALLOC and CAT) 512KB of cache was set aside for application to obtain
> these results. Using Cache Pseudo-Locking to access the 256KB memory
> region only 256KB of cache was set aside while also reducing the access
> latency when compared to both PALLOC and CAT.
>
> I do hope these results establishes the value of Cache Pseudo-Locking to
> you.

Very nice. Thank you so much for doing this. That kind of data is really
valuable.

My take away from this: All of the mechanisms are only delivering best
effort and the real benefit is the reduction of average latency. The worst
case outliers are in the same ballpark at seems.

> The rebased patch series used in this testing will be sent out
> this week.

I'll make sure to have cycles available for review.

Thanks,

tglx