LinuxLists.cc - Re: [PATCH] memcg: introduce per-memcg reclaim interface

2020-09-29 21:57:59

Subject: Re: [PATCH] memcg: introduce per-memcg reclaim interface

On Tue, Sep 29, 2020 at 05:04:44PM +0200, Michal Hocko wrote:
> On Mon 28-09-20 17:02:16, Johannes Weiner wrote:
> [...]
> > My take is that a proactive reclaim feature, whose goal is never to
> > thrash or punish but to keep the LRUs warm and the workingset trimmed,
> > would ideally have:
> >
> > - a pressure or size target specified by userspace but with
> > enforcement driven inside the kernel from the allocation path
> >
> > - the enforcement work NOT be done synchronously by the workload
> > (something I'd argue we want for *all* memory limits)
> >
> > - the enforcement work ACCOUNTED to the cgroup, though, since it's the
> > cgroup's memory allocations causing the work (again something I'd
> > argue we want in general)
> >
> > - a delegatable knob that is independent of setting the maximum size
> > of a container, as that expresses a different type of policy
> >
> > - if size target, self-limiting (ha) enforcement on a pressure
> > threshold or stop enforcement when the userspace component dies
> >
> > Thoughts?
>
> Agreed with above points. What do you think about
> http://lkml.kernel.org/r/[email protected].

I definitely agree with what you wrote in this email for background
reclaim. Indeed, your description sounds like what I proposed in
https://lore.kernel.org/linux-mm/[email protected]/
- what's missing from that patch is proper work attribution.

> I assume that you do not want to override memory.high to implement
> this because that tends to be tricky from the configuration POV as
> you mentioned above. But a new limit (memory.middle for a lack of a
> better name) to define the background reclaim sounds like a good fit
> with above points.

I can see that with a new memory.middle you could kind of sort of do
both - background reclaim and proactive reclaim.

That said, I do see advantages in keeping them separate:

1. Background reclaim is essentially an allocation optimization that
we may want to provide per default, just like kswapd.

Kswapd is tweakable of course, but I think actually few users do,
and it works pretty well out of the box. It would be nice to
provide the same thing on a per-cgroup basis per default and not
ask users to make decisions that we are generally better at making.

2. Proactive reclaim may actually be better configured through a
pressure threshold rather than a size target.

As per above, the goal is not to be punitive or containing. The
goal is to keep the LRUs warm and move the colder pages to disk.

But how aggressively do you run reclaim for this purpose? What
target value should a user write to such a memory.middle file?

For one, it depends on the job. A batch job, or a less important
background job, may tolerate higher paging overhead than an
interactive job. That means more of its pages could be trimmed from
RAM and reloaded on-demand from disk.

But also, it depends on the storage device. If you move a workload
from a machine with a slow disk to a machine with a fast disk, you
can page more data in the same amount of time. That means while
your workload tolerances stays the same, the faster the disk, the
more aggressively you can do reclaim and offload memory.

So again, what should a user write to such a control file?

Of course, you can approximate an optimal target size for the
workload. You can run a manual workingset analysis with page_idle,
damon, or similar, determine a hot/cold cutoff based on what you
know about the storage characteristics, then echo a number of pages
or a size target into a cgroup file and let kernel do the reclaim
accordingly. The drawbacks are that the kernel LRU may do a
different hot/cold classification than you did and evict the wrong
pages, the storage device latencies may vary based on overall IO
pattern, and two equally warm pages may have very different paging
overhead depending on whether readahead can avert a major fault or
not. So it's easy to overshoot the tolerance target and disrupt the
workload, or undershoot and have stale LRU data, waste memory etc.

You can also do a feedback loop, where you guess an optimal size,
then adjust based on how much paging overhead the workload is
experiencing, i.e. memory pressure. The drawbacks are that you have
to monitor pressure closely and react quickly when the workload is
expanding, as it can be potentially sensitive to latencies in the
usec range. This can be tricky to do from userspace.

So instead of asking users for a target size whose suitability
heavily depends on the kernel's LRU implementation, the readahead
code, the IO device's capability and general load, why not directly
ask the user for a pressure level that the workload is comfortable
with and which captures all of the above factors implicitly? Then
let the kernel do this feedback loop from a per-cgroup worker.

2020-09-30 15:47:05

by Shakeel Butt

[permalink] [raw]

Subject: Re: [PATCH] memcg: introduce per-memcg reclaim interface

On Tue, Sep 29, 2020 at 2:55 PM Johannes Weiner <[email protected]> wrote:
>
> On Tue, Sep 29, 2020 at 05:04:44PM +0200, Michal Hocko wrote:
> > On Mon 28-09-20 17:02:16, Johannes Weiner wrote:
> > [...]
> > > My take is that a proactive reclaim feature, whose goal is never to
> > > thrash or punish but to keep the LRUs warm and the workingset trimmed,
> > > would ideally have:
> > >
> > > - a pressure or size target specified by userspace but with
> > > enforcement driven inside the kernel from the allocation path
> > >
> > > - the enforcement work NOT be done synchronously by the workload
> > > (something I'd argue we want for *all* memory limits)
> > >
> > > - the enforcement work ACCOUNTED to the cgroup, though, since it's the
> > > cgroup's memory allocations causing the work (again something I'd
> > > argue we want in general)
> > >
> > > - a delegatable knob that is independent of setting the maximum size
> > > of a container, as that expresses a different type of policy
> > >
> > > - if size target, self-limiting (ha) enforcement on a pressure
> > > threshold or stop enforcement when the userspace component dies
> > >
> > > Thoughts?
> >
> > Agreed with above points. What do you think about
> > http://lkml.kernel.org/r/[email protected].
>
> I definitely agree with what you wrote in this email for background
> reclaim. Indeed, your description sounds like what I proposed in
> https://lore.kernel.org/linux-mm/[email protected]/
> - what's missing from that patch is proper work attribution.
>
> > I assume that you do not want to override memory.high to implement
> > this because that tends to be tricky from the configuration POV as
> > you mentioned above. But a new limit (memory.middle for a lack of a
> > better name) to define the background reclaim sounds like a good fit
> > with above points.
>
> I can see that with a new memory.middle you could kind of sort of do
> both - background reclaim and proactive reclaim.
>
> That said, I do see advantages in keeping them separate:
>
> 1. Background reclaim is essentially an allocation optimization that
> we may want to provide per default, just like kswapd.
>
> Kswapd is tweakable of course, but I think actually few users do,
> and it works pretty well out of the box. It would be nice to
> provide the same thing on a per-cgroup basis per default and not
> ask users to make decisions that we are generally better at making.
>
> 2. Proactive reclaim may actually be better configured through a
> pressure threshold rather than a size target.
>
> As per above, the goal is not to be punitive or containing. The
> goal is to keep the LRUs warm and move the colder pages to disk.
>
> But how aggressively do you run reclaim for this purpose? What
> target value should a user write to such a memory.middle file?
>
> For one, it depends on the job. A batch job, or a less important
> background job, may tolerate higher paging overhead than an
> interactive job. That means more of its pages could be trimmed from
> RAM and reloaded on-demand from disk.
>
> But also, it depends on the storage device. If you move a workload
> from a machine with a slow disk to a machine with a fast disk, you
> can page more data in the same amount of time. That means while
> your workload tolerances stays the same, the faster the disk, the
> more aggressively you can do reclaim and offload memory.
>
> So again, what should a user write to such a control file?
>
> Of course, you can approximate an optimal target size for the
> workload. You can run a manual workingset analysis with page_idle,
> damon, or similar, determine a hot/cold cutoff based on what you
> know about the storage characteristics, then echo a number of pages
> or a size target into a cgroup file and let kernel do the reclaim
> accordingly. The drawbacks are that the kernel LRU may do a
> different hot/cold classification than you did and evict the wrong
> pages, the storage device latencies may vary based on overall IO
> pattern, and two equally warm pages may have very different paging
> overhead depending on whether readahead can avert a major fault or
> not. So it's easy to overshoot the tolerance target and disrupt the
> workload, or undershoot and have stale LRU data, waste memory etc.
>
> You can also do a feedback loop, where you guess an optimal size,
> then adjust based on how much paging overhead the workload is
> experiencing, i.e. memory pressure. The drawbacks are that you have
> to monitor pressure closely and react quickly when the workload is
> expanding, as it can be potentially sensitive to latencies in the
> usec range. This can be tricky to do from userspace.
>

This is actually what we do in our production i.e. feedback loop to
adjust the next iteration of proactive reclaim.

We eliminated the IO or slow disk issues you mentioned by only
focusing on anon memory and doing zswap.

> So instead of asking users for a target size whose suitability
> heavily depends on the kernel's LRU implementation, the readahead
> code, the IO device's capability and general load, why not directly
> ask the user for a pressure level that the workload is comfortable
> with and which captures all of the above factors implicitly? Then
> let the kernel do this feedback loop from a per-cgroup worker.

I am assuming here by pressure level you are referring to the PSI like
interface e.g. allowing the users to tell about their jobs that X
amount of stalls in a fixed time window is tolerable.

Seems promising though I would like flexibility for giving the
resource to the per-cgroup worker.

Are you planning to work on this or should I give it a try?

2020-10-01 14:35:35

by Johannes Weiner

[permalink] [raw]

Subject: Re: [PATCH] memcg: introduce per-memcg reclaim interface

On Wed, Sep 30, 2020 at 08:45:17AM -0700, Shakeel Butt wrote:
> On Tue, Sep 29, 2020 at 2:55 PM Johannes Weiner <[email protected]> wrote:
> >
> > On Tue, Sep 29, 2020 at 05:04:44PM +0200, Michal Hocko wrote:
> > > On Mon 28-09-20 17:02:16, Johannes Weiner wrote:
> > > [...]
> > > > My take is that a proactive reclaim feature, whose goal is never to
> > > > thrash or punish but to keep the LRUs warm and the workingset trimmed,
> > > > would ideally have:
> > > >
> > > > - a pressure or size target specified by userspace but with
> > > > enforcement driven inside the kernel from the allocation path
> > > >
> > > > - the enforcement work NOT be done synchronously by the workload
> > > > (something I'd argue we want for *all* memory limits)
> > > >
> > > > - the enforcement work ACCOUNTED to the cgroup, though, since it's the
> > > > cgroup's memory allocations causing the work (again something I'd
> > > > argue we want in general)
> > > >
> > > > - a delegatable knob that is independent of setting the maximum size
> > > > of a container, as that expresses a different type of policy
> > > >
> > > > - if size target, self-limiting (ha) enforcement on a pressure
> > > > threshold or stop enforcement when the userspace component dies
> > > >
> > > > Thoughts?
> > >
> > > Agreed with above points. What do you think about
> > > http://lkml.kernel.org/r/[email protected].
> >
> > I definitely agree with what you wrote in this email for background
> > reclaim. Indeed, your description sounds like what I proposed in
> > https://lore.kernel.org/linux-mm/[email protected]/
> > - what's missing from that patch is proper work attribution.
> >
> > > I assume that you do not want to override memory.high to implement
> > > this because that tends to be tricky from the configuration POV as
> > > you mentioned above. But a new limit (memory.middle for a lack of a
> > > better name) to define the background reclaim sounds like a good fit
> > > with above points.
> >
> > I can see that with a new memory.middle you could kind of sort of do
> > both - background reclaim and proactive reclaim.
> >
> > That said, I do see advantages in keeping them separate:
> >
> > 1. Background reclaim is essentially an allocation optimization that
> > we may want to provide per default, just like kswapd.
> >
> > Kswapd is tweakable of course, but I think actually few users do,
> > and it works pretty well out of the box. It would be nice to
> > provide the same thing on a per-cgroup basis per default and not
> > ask users to make decisions that we are generally better at making.
> >
> > 2. Proactive reclaim may actually be better configured through a
> > pressure threshold rather than a size target.
> >
> > As per above, the goal is not to be punitive or containing. The
> > goal is to keep the LRUs warm and move the colder pages to disk.
> >
> > But how aggressively do you run reclaim for this purpose? What
> > target value should a user write to such a memory.middle file?
> >
> > For one, it depends on the job. A batch job, or a less important
> > background job, may tolerate higher paging overhead than an
> > interactive job. That means more of its pages could be trimmed from
> > RAM and reloaded on-demand from disk.
> >
> > But also, it depends on the storage device. If you move a workload
> > from a machine with a slow disk to a machine with a fast disk, you
> > can page more data in the same amount of time. That means while
> > your workload tolerances stays the same, the faster the disk, the
> > more aggressively you can do reclaim and offload memory.
> >
> > So again, what should a user write to such a control file?
> >
> > Of course, you can approximate an optimal target size for the
> > workload. You can run a manual workingset analysis with page_idle,
> > damon, or similar, determine a hot/cold cutoff based on what you
> > know about the storage characteristics, then echo a number of pages
> > or a size target into a cgroup file and let kernel do the reclaim
> > accordingly. The drawbacks are that the kernel LRU may do a
> > different hot/cold classification than you did and evict the wrong
> > pages, the storage device latencies may vary based on overall IO
> > pattern, and two equally warm pages may have very different paging
> > overhead depending on whether readahead can avert a major fault or
> > not. So it's easy to overshoot the tolerance target and disrupt the
> > workload, or undershoot and have stale LRU data, waste memory etc.
> >
> > You can also do a feedback loop, where you guess an optimal size,
> > then adjust based on how much paging overhead the workload is
> > experiencing, i.e. memory pressure. The drawbacks are that you have
> > to monitor pressure closely and react quickly when the workload is
> > expanding, as it can be potentially sensitive to latencies in the
> > usec range. This can be tricky to do from userspace.
> >
>
> This is actually what we do in our production i.e. feedback loop to
> adjust the next iteration of proactive reclaim.

That's what we do also right now. It works reasonably well, the only
two pain points are/have been the reaction time under quick workload
expansion and inadvertently forcing the workload into direct reclaim.

> We eliminated the IO or slow disk issues you mentioned by only
> focusing on anon memory and doing zswap.

Interesting, may I ask how the file cache is managed in this setup?

> > So instead of asking users for a target size whose suitability
> > heavily depends on the kernel's LRU implementation, the readahead
> > code, the IO device's capability and general load, why not directly
> > ask the user for a pressure level that the workload is comfortable
> > with and which captures all of the above factors implicitly? Then
> > let the kernel do this feedback loop from a per-cgroup worker.
>
> I am assuming here by pressure level you are referring to the PSI like
> interface e.g. allowing the users to tell about their jobs that X
> amount of stalls in a fixed time window is tolerable.

Right, essentially the same parameters that psi poll() would take.

2020-10-06 16:57:48

by Shakeel Butt

[permalink] [raw]

Subject: Re: [PATCH] memcg: introduce per-memcg reclaim interface

On Thu, Oct 1, 2020 at 7:33 AM Johannes Weiner <[email protected]> wrote:
>
[snip]
> > > So instead of asking users for a target size whose suitability
> > > heavily depends on the kernel's LRU implementation, the readahead
> > > code, the IO device's capability and general load, why not directly
> > > ask the user for a pressure level that the workload is comfortable
> > > with and which captures all of the above factors implicitly? Then
> > > let the kernel do this feedback loop from a per-cgroup worker.
> >
> > I am assuming here by pressure level you are referring to the PSI like
> > interface e.g. allowing the users to tell about their jobs that X
> > amount of stalls in a fixed time window is tolerable.
>
> Right, essentially the same parameters that psi poll() would take.

I thought a bit more on the semantics of the psi usage for the
proactive reclaim.

Suppose I have a top level cgroup A on which I want to enable
proactive reclaim. Which memory psi events should the proactive
reclaim should consider?

The simplest would be the memory.psi at 'A'. However memory.psi is
hierarchical and I would not really want the pressure due limits in
children of 'A' to impact the proactive reclaim. PSI due to refaults
and slow IO should be included or maybe only those which are caused by
the proactive reclaim itself. I am undecided on the PSI due to
compaction. PSI due to global reclaim for 'A' is even more
complicated. This is a stall due to reclaiming from the system
including self. It might not really cause more refaults and IOs for
'A'. Should proactive reclaim ignore the pressure due to global
pressure when tuning its aggressiveness.

Am I overthinking here?

2020-10-08 16:46:09

by Johannes Weiner

[permalink] [raw]

Subject: Re: [PATCH] memcg: introduce per-memcg reclaim interface

On Tue, Oct 06, 2020 at 09:55:43AM -0700, Shakeel Butt wrote:
> On Thu, Oct 1, 2020 at 7:33 AM Johannes Weiner <[email protected]> wrote:
> >
> [snip]
> > > > So instead of asking users for a target size whose suitability
> > > > heavily depends on the kernel's LRU implementation, the readahead
> > > > code, the IO device's capability and general load, why not directly
> > > > ask the user for a pressure level that the workload is comfortable
> > > > with and which captures all of the above factors implicitly? Then
> > > > let the kernel do this feedback loop from a per-cgroup worker.
> > >
> > > I am assuming here by pressure level you are referring to the PSI like
> > > interface e.g. allowing the users to tell about their jobs that X
> > > amount of stalls in a fixed time window is tolerable.
> >
> > Right, essentially the same parameters that psi poll() would take.
>
> I thought a bit more on the semantics of the psi usage for the
> proactive reclaim.
>
> Suppose I have a top level cgroup A on which I want to enable
> proactive reclaim. Which memory psi events should the proactive
> reclaim should consider?
>
> The simplest would be the memory.psi at 'A'. However memory.psi is
> hierarchical and I would not really want the pressure due limits in
> children of 'A' to impact the proactive reclaim.

I don't think pressure from limits down the tree can be separated out,
generally. All events are accounted recursively as well. Of course, we
remember the reclaim level for evicted entries - but if there is
reclaim triggered at A and A/B concurrently, the distribution of who
ends up reclaiming the physical pages in A/B is pretty arbitrary/racy.

If A/B decides to do its own proactive reclaim with the sublimit, and
ends up consuming the pressure budget assigned to proactive reclaim in
A, there isn't much that can be done.

It's also possible that proactive reclaim in A keeps A/B from hitting
its limit in the first place.

I have to say, the configuration doesn't really strike me as sensible,
though. Limits make sense for doing fixed partitioning: A gets 4G, A/B
gets 2G out of that. But if you do proactive reclaim on A you're
essentially saying A as a whole is auto-sizing dynamically based on
its memory access pattern. I'm not sure what it means to then start
doing fixed partitions in the sublevel.

> PSI due to refaults and slow IO should be included or maybe only
> those which are caused by the proactive reclaim itself. I am
> undecided on the PSI due to compaction. PSI due to global reclaim
> for 'A' is even more complicated. This is a stall due to reclaiming
> from the system including self. It might not really cause more
> refaults and IOs for 'A'. Should proactive reclaim ignore the
> pressure due to global pressure when tuning its aggressiveness.

Yeah, I think they should all be included, because ultimately what
matters is what the workload can tolerate without sacrificing
performance.

Proactive reclaim can destroy THPs, so the cost of recreating them
should be reflected. Otherwise you can easily overpressurize.

For global reclaim, if you say you want a workload pressurized to X
percent in order to drive the LRUs and chop off all cold pages the
workload can live without, it doesn't matter who does the work. If
there is an abundance of physical memory, it's going to be proactive
reclaim. If physical memory is already tight enough that global
reclaim does it for you, there is nothing to be done in addition, and
proactive reclaim should hang back. Otherwise you can again easily
overpressurize the workload.