2023-11-22 09:38:53

by Chengming Zhou

[permalink] [raw]
Subject: Question: memcg dirty throttle caused by low per-memcg dirty thresh

Hello all,

Sorry to bother you, we encountered a problem related to the memcg dirty
throttle after migrating from cgroup v1 to v2, so want to ask for some
comments or suggestions.

1. Problem

We have the "containerd" service running under system.slice, with
its memory.max set to 5GB. It will be constantly throttled in the
balance_dirty_pages() since the memcg has dirty memory more than
the memcg dirty thresh.

We haven't this problem on cgroup v1, because cgroup v1 doesn't have
the per-memcg writeback and per-memcg dirty thresh. Only the global
dirty thresh will be checked in balance_dirty_pages().

2. Thinking

So we wonder if we can support the per-memcg dirty thresh interface?
Now the memcg dirty thresh is just calculated from memcg max * ratio,
which can be set from /proc/sys/vm/dirty_ratio.

We have to set it to 60 instead of the default 20 to workaround now,
but worry about the potential side effects.

If we can support the per-memcg dirty thresh interface, we can set
some containers to a much higher dirty_ratio, especially for hungry
dirtier workloads like "containerd".

3. Solution?

But we could't think of a good solution to support this. The current
memcg dirty thresh is calculated from a complex rule:

memcg dirty thresh = memcg avail * dirty_ratio

memcg avail is from combination of: memcg max/high, memcg files
and capped by system-wide clean memory excluding the amount being used
in the memcg.

Although we may find a way to calculate the per-memcg dirty thresh,
we can't use it directly, since we still need to calculate/distribute
dirty thresh to the per-wb dirty thresh share.

R - A - B
\-- C

For example, if we know the dirty thresh of A, but wb is in C, we
have no way to distribute the dirty thresh shares to the wb in C.

But we have to get the dirty thresh of the wb in C, since we need it
to control throttling process of the wb in balance_dirty_pages().

I may have missed something above, but the problem seems clear IMHO.
Looking forward to any comment or suggestion.

Thanks!


2023-11-22 10:03:12

by Michal Hocko

[permalink] [raw]
Subject: Re: Question: memcg dirty throttle caused by low per-memcg dirty thresh

On Wed 22-11-23 17:38:25, Chengming Zhou wrote:
> Hello all,
>
> Sorry to bother you, we encountered a problem related to the memcg dirty
> throttle after migrating from cgroup v1 to v2, so want to ask for some
> comments or suggestions.
>
> 1. Problem
>
> We have the "containerd" service running under system.slice, with
> its memory.max set to 5GB. It will be constantly throttled in the
> balance_dirty_pages() since the memcg has dirty memory more than
> the memcg dirty thresh.
>
> We haven't this problem on cgroup v1, because cgroup v1 doesn't have
> the per-memcg writeback and per-memcg dirty thresh. Only the global
> dirty thresh will be checked in balance_dirty_pages().

Yes, v1 didn't have any sensible IO throttling and so we had to rely on
ugly hack to wait for writeback to finish from the memcg memory reclaim
path. This is really suboptimal because it makes memcg reclaim stalls
hard to predict. So it is essentially only a poor's man OOM prevention.

V2 on the other hand has memcg aware dirty memory throttling which is a
much better solution as it throttles at the moment when the memory is
being dirtied.

Why do you consider that to be a problem? Constant throttling as you
suggest might be a result of the limit being too small?

>
> 2. Thinking
>
> So we wonder if we can support the per-memcg dirty thresh interface?
> Now the memcg dirty thresh is just calculated from memcg max * ratio,
> which can be set from /proc/sys/vm/dirty_ratio.

In general I would recommend using dirty_bytes instead as the ratio
doesn't scall all that great on larger systems.

> We have to set it to 60 instead of the default 20 to workaround now,
> but worry about the potential side effects.
>
> If we can support the per-memcg dirty thresh interface, we can set
> some containers to a much higher dirty_ratio, especially for hungry
> dirtier workloads like "containerd".

But why would you want that? If you allow heavy writers to dirty a lot
of memory then flushing that to the backing store will take more time.
That could starve small writers as well because they could end up queued
behind huge amount of data to be flushed.

I am no expert on the writeback so others could give you a better
arguments but from my POV the dirty data flushing and throttling is
mostly a global mechanism to optmize the IO pattern and is a function of
storage much more than workload specific. If you heavy writer hits
throttling too much then either the limit is too low or you should stard
background flushing earlier.

--
Michal Hocko
SUSE Labs

2023-11-22 14:49:53

by Jan Kara

[permalink] [raw]
Subject: Re: Question: memcg dirty throttle caused by low per-memcg dirty thresh

Hello!

On Wed 22-11-23 17:38:25, Chengming Zhou wrote:
> Sorry to bother you, we encountered a problem related to the memcg dirty
> throttle after migrating from cgroup v1 to v2, so want to ask for some
> comments or suggestions.
>
> 1. Problem
>
> We have the "containerd" service running under system.slice, with
> its memory.max set to 5GB. It will be constantly throttled in the
> balance_dirty_pages() since the memcg has dirty memory more than
> the memcg dirty thresh.
>
> We haven't this problem on cgroup v1, because cgroup v1 doesn't have
> the per-memcg writeback and per-memcg dirty thresh. Only the global
> dirty thresh will be checked in balance_dirty_pages().

As Michal writes, if you allow too many memcg pages to become dirty, you
might be facing issues with page reclaim so there are actually good reasons
why you want amount of dirty pages in each memcg reasonably limited. Also
generally increasing number of available dirty pages beyond say 1GB is not
going to bring any benefit in the overall writeback performance. It may
still be useful in case you generate a lot of (or large) temporary files
which get quickly deleted and thus with high enough dirty limit they don't
have to be written to the disk at all. Similarly if the generation of dirty
data is very bursty (i.e. you generate a lot of dirty data in a short while
and then don't dirty anything for a long time), having higher dirty limit
may be useful. What is your usecase that you think you'll benefit from
higher dirty limit?

> 2. Thinking
>
> So we wonder if we can support the per-memcg dirty thresh interface?
> Now the memcg dirty thresh is just calculated from memcg max * ratio,
> which can be set from /proc/sys/vm/dirty_ratio.
>
> We have to set it to 60 instead of the default 20 to workaround now,
> but worry about the potential side effects.
>
> If we can support the per-memcg dirty thresh interface, we can set
> some containers to a much higher dirty_ratio, especially for hungry
> dirtier workloads like "containerd".

As Michal wrote, if this ought to be configurable per memcg, then
configuring dirty amount directly in bytes may be more sensible.

> 3. Solution?
>
> But we could't think of a good solution to support this. The current
> memcg dirty thresh is calculated from a complex rule:
>
> memcg dirty thresh = memcg avail * dirty_ratio
>
> memcg avail is from combination of: memcg max/high, memcg files
> and capped by system-wide clean memory excluding the amount being used
> in the memcg.
>
> Although we may find a way to calculate the per-memcg dirty thresh,
> we can't use it directly, since we still need to calculate/distribute
> dirty thresh to the per-wb dirty thresh share.
>
> R - A - B
> \-- C
>
> For example, if we know the dirty thresh of A, but wb is in C, we
> have no way to distribute the dirty thresh shares to the wb in C.
>
> But we have to get the dirty thresh of the wb in C, since we need it
> to control throttling process of the wb in balance_dirty_pages().
>
> I may have missed something above, but the problem seems clear IMHO.
> Looking forward to any comment or suggestion.

I'm not sure I follow what is the problem here. In balance_dirty_pages() we
have global dirty threshold (tracked in gdtc) and memcg dirty threshold
(tracked in mdtc). This can get further scaled down based on the device
throughput (that is the difference between 'thresh' and 'wb_thresh') but
that is independent of the way mdtc->thresh is calculated. So if we provide
a different way of calculating mdtc->thresh, technically everything should
keep working as is.

Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR

2023-11-22 14:59:52

by Chengming Zhou

[permalink] [raw]
Subject: Re: Question: memcg dirty throttle caused by low per-memcg dirty thresh

On 2023/11/22 18:02, Michal Hocko wrote:
> On Wed 22-11-23 17:38:25, Chengming Zhou wrote:
>> Hello all,
>>
>> Sorry to bother you, we encountered a problem related to the memcg dirty
>> throttle after migrating from cgroup v1 to v2, so want to ask for some
>> comments or suggestions.
>>
>> 1. Problem
>>
>> We have the "containerd" service running under system.slice, with
>> its memory.max set to 5GB. It will be constantly throttled in the
>> balance_dirty_pages() since the memcg has dirty memory more than
>> the memcg dirty thresh.
>>
>> We haven't this problem on cgroup v1, because cgroup v1 doesn't have
>> the per-memcg writeback and per-memcg dirty thresh. Only the global
>> dirty thresh will be checked in balance_dirty_pages().
>
> Yes, v1 didn't have any sensible IO throttling and so we had to rely on
> ugly hack to wait for writeback to finish from the memcg memory reclaim
> path. This is really suboptimal because it makes memcg reclaim stalls
> hard to predict. So it is essentially only a poor's man OOM prevention.
>
> V2 on the other hand has memcg aware dirty memory throttling which is a
> much better solution as it throttles at the moment when the memory is
> being dirtied.
>
> Why do you consider that to be a problem? Constant throttling as you
> suggest might be a result of the limit being too small?

Right, v2 is better at limiting the dirty memory in one memcg, which is
better for the memcg reclaim path.

The problem we encountered is the global dirty_ratio is too small (20%)
for some cgroup workloads, like "containerd" preparing a big image file,
we want its memory.max (5GB in our case) can be most dirtied, to speed up
the process.

And yes, this may backup more dirty pages in that memcg, then the longer
writeback IO may interfere other memcgs' writeback IO. But we also have
the per-blkcg IO throttling, so it's not that bad ?

Now we have to adjust up the global dirty_ratio to achieve this result,
but it's not good for every memcg workload, and could be bad for some
memcg reclaim path as you noted.

>
>>
>> 2. Thinking
>>
>> So we wonder if we can support the per-memcg dirty thresh interface?
>> Now the memcg dirty thresh is just calculated from memcg max * ratio,
>> which can be set from /proc/sys/vm/dirty_ratio.
>
> In general I would recommend using dirty_bytes instead as the ratio
> doesn't scall all that great on larger systems.
>
>> We have to set it to 60 instead of the default 20 to workaround now,
>> but worry about the potential side effects.
>>
>> If we can support the per-memcg dirty thresh interface, we can set
>> some containers to a much higher dirty_ratio, especially for hungry
>> dirtier workloads like "containerd".
>
> But why would you want that? If you allow heavy writers to dirty a lot
> of memory then flushing that to the backing store will take more time.
> That could starve small writers as well because they could end up queued
> behind huge amount of data to be flushed.
>

Yes, we also need per-blkcg IO throttling to distribute writeback IO bandwidth.

> I am no expert on the writeback so others could give you a better
> arguments but from my POV the dirty data flushing and throttling is
> mostly a global mechanism to optmize the IO pattern and is a function of
> storage much more than workload specific. If you heavy writer hits

Maybe the per-bdi ratio is worth trying instead of the global dirty_ratio,
which could affect all devices.

> throttling too much then either the limit is too low or you should stard
> background flushing earlier.
>

The global dirty_ratio is too low for "containerd" in this case, so we
want more control over the memcg dirty_ratio.

Thanks!

2023-11-22 15:35:01

by Chengming Zhou

[permalink] [raw]
Subject: Re: Question: memcg dirty throttle caused by low per-memcg dirty thresh

On 2023/11/22 22:49, Jan Kara wrote:
> Hello!
>
> On Wed 22-11-23 17:38:25, Chengming Zhou wrote:
>> Sorry to bother you, we encountered a problem related to the memcg dirty
>> throttle after migrating from cgroup v1 to v2, so want to ask for some
>> comments or suggestions.
>>
>> 1. Problem
>>
>> We have the "containerd" service running under system.slice, with
>> its memory.max set to 5GB. It will be constantly throttled in the
>> balance_dirty_pages() since the memcg has dirty memory more than
>> the memcg dirty thresh.
>>
>> We haven't this problem on cgroup v1, because cgroup v1 doesn't have
>> the per-memcg writeback and per-memcg dirty thresh. Only the global
>> dirty thresh will be checked in balance_dirty_pages().
>
> As Michal writes, if you allow too many memcg pages to become dirty, you
> might be facing issues with page reclaim so there are actually good reasons
> why you want amount of dirty pages in each memcg reasonably limited. Also

Yes, the memcg dirty limit (20%) is good for the memcg reclaim path.
But for some workloads (like burst dirtier) which may only create many dirty
pages in a short time, we want its memory.max 60% can be dirtied without
being throttled. And this is not much harmful for its memcg reclaim path.

> generally increasing number of available dirty pages beyond say 1GB is not
> going to bring any benefit in the overall writeback performance. It may
> still be useful in case you generate a lot of (or large) temporary files
> which get quickly deleted and thus with high enough dirty limit they don't
> have to be written to the disk at all. Similarly if the generation of dirty
> data is very bursty (i.e. you generate a lot of dirty data in a short while
> and then don't dirty anything for a long time), having higher dirty limit
> may be useful. What is your usecase that you think you'll benefit from
> higher dirty limit?

I think it's the burst dirtier in our case, and we have good performance
improvement if we change the global dirty_ratio to 60 just for testing.

>
>> 2. Thinking
>>
>> So we wonder if we can support the per-memcg dirty thresh interface?
>> Now the memcg dirty thresh is just calculated from memcg max * ratio,
>> which can be set from /proc/sys/vm/dirty_ratio.
>>
>> We have to set it to 60 instead of the default 20 to workaround now,
>> but worry about the potential side effects.
>>
>> If we can support the per-memcg dirty thresh interface, we can set
>> some containers to a much higher dirty_ratio, especially for hungry
>> dirtier workloads like "containerd".
>
> As Michal wrote, if this ought to be configurable per memcg, then
> configuring dirty amount directly in bytes may be more sensible.
>

Yes, "memory.dirty_limit" should be more sensible than "memory.dirty_ratio".

>> 3. Solution?
>>
>> But we could't think of a good solution to support this. The current
>> memcg dirty thresh is calculated from a complex rule:
>>
>> memcg dirty thresh = memcg avail * dirty_ratio
>>
>> memcg avail is from combination of: memcg max/high, memcg files
>> and capped by system-wide clean memory excluding the amount being used
>> in the memcg.
>>
>> Although we may find a way to calculate the per-memcg dirty thresh,
>> we can't use it directly, since we still need to calculate/distribute
>> dirty thresh to the per-wb dirty thresh share.
>>
>> R - A - B
>> \-- C
>>
>> For example, if we know the dirty thresh of A, but wb is in C, we
>> have no way to distribute the dirty thresh shares to the wb in C.
>>
>> But we have to get the dirty thresh of the wb in C, since we need it
>> to control throttling process of the wb in balance_dirty_pages().
>>
>> I may have missed something above, but the problem seems clear IMHO.
>> Looking forward to any comment or suggestion.
>
> I'm not sure I follow what is the problem here. In balance_dirty_pages() we
> have global dirty threshold (tracked in gdtc) and memcg dirty threshold
> (tracked in mdtc). This can get further scaled down based on the device
> throughput (that is the difference between 'thresh' and 'wb_thresh') but
> that is independent of the way mdtc->thresh is calculated. So if we provide
> a different way of calculating mdtc->thresh, technically everything should
> keep working as is.
>

Sorry for the confusion. The problem is exactly how to calculate mdtc->thresh.

R - A - B
\-- C

Case 1:

Suppose the C has "memory.dirty_limit" set, should we just use it as mdtc->thresh?
I see the current code also consider the system clean memory in mdtc_calc_avail(),
should we also need to consider it when "memory.dirty_limit" set?

Case 2:

Suppose the C hasn't "memory.dirty_limit" set, but A has "memory.dirty_limit" set,
how do we calculate the C mdtc->thresh ?

Obviously we can't directly use the A "memory.dirty_limit", since it should be
distributed to B and C ?

So the problem is that I don't know how to reasonably calculate the mdtc->thresh,
even given a memcg tree with some memcgs have "memory.dirty_limit" set. :\

Thanks!

2023-11-24 04:21:03

by Jan Kara

[permalink] [raw]
Subject: Re: Question: memcg dirty throttle caused by low per-memcg dirty thresh

On Wed 22-11-23 23:32:50, Chengming Zhou wrote:
> On 2023/11/22 22:49, Jan Kara wrote:
> > Hello!
> >
> > On Wed 22-11-23 17:38:25, Chengming Zhou wrote:
> >> Sorry to bother you, we encountered a problem related to the memcg dirty
> >> throttle after migrating from cgroup v1 to v2, so want to ask for some
> >> comments or suggestions.
> >>
> >> 1. Problem
> >>
> >> We have the "containerd" service running under system.slice, with
> >> its memory.max set to 5GB. It will be constantly throttled in the
> >> balance_dirty_pages() since the memcg has dirty memory more than
> >> the memcg dirty thresh.
> >>
> >> We haven't this problem on cgroup v1, because cgroup v1 doesn't have
> >> the per-memcg writeback and per-memcg dirty thresh. Only the global
> >> dirty thresh will be checked in balance_dirty_pages().
> >
> > As Michal writes, if you allow too many memcg pages to become dirty, you
> > might be facing issues with page reclaim so there are actually good reasons
> > why you want amount of dirty pages in each memcg reasonably limited. Also
>
> Yes, the memcg dirty limit (20%) is good for the memcg reclaim path.
> But for some workloads (like burst dirtier) which may only create many dirty
> pages in a short time, we want its memory.max 60% can be dirtied without
> being throttled. And this is not much harmful for its memcg reclaim path.

Well, I'd rather say that your memcg likely doesn't hit reclaim path too
much (the memory is reasonably sized for the task) and thus high fraction
of dirty pagecache pages does not really matter much.

> > generally increasing number of available dirty pages beyond say 1GB is not
> > going to bring any benefit in the overall writeback performance. It may
> > still be useful in case you generate a lot of (or large) temporary files
> > which get quickly deleted and thus with high enough dirty limit they don't
> > have to be written to the disk at all. Similarly if the generation of dirty
> > data is very bursty (i.e. you generate a lot of dirty data in a short while
> > and then don't dirty anything for a long time), having higher dirty limit
> > may be useful. What is your usecase that you think you'll benefit from
> > higher dirty limit?
>
> I think it's the burst dirtier in our case, and we have good performance
> improvement if we change the global dirty_ratio to 60 just for testing.

OK.

> >> 3. Solution?
> >>
> >> But we could't think of a good solution to support this. The current
> >> memcg dirty thresh is calculated from a complex rule:
> >>
> >> memcg dirty thresh = memcg avail * dirty_ratio
> >>
> >> memcg avail is from combination of: memcg max/high, memcg files
> >> and capped by system-wide clean memory excluding the amount being used
> >> in the memcg.
> >>
> >> Although we may find a way to calculate the per-memcg dirty thresh,
> >> we can't use it directly, since we still need to calculate/distribute
> >> dirty thresh to the per-wb dirty thresh share.
> >>
> >> R - A - B
> >> \-- C
> >>
> >> For example, if we know the dirty thresh of A, but wb is in C, we
> >> have no way to distribute the dirty thresh shares to the wb in C.
> >>
> >> But we have to get the dirty thresh of the wb in C, since we need it
> >> to control throttling process of the wb in balance_dirty_pages().
> >>
> >> I may have missed something above, but the problem seems clear IMHO.
> >> Looking forward to any comment or suggestion.
> >
> > I'm not sure I follow what is the problem here. In balance_dirty_pages() we
> > have global dirty threshold (tracked in gdtc) and memcg dirty threshold
> > (tracked in mdtc). This can get further scaled down based on the device
> > throughput (that is the difference between 'thresh' and 'wb_thresh') but
> > that is independent of the way mdtc->thresh is calculated. So if we provide
> > a different way of calculating mdtc->thresh, technically everything should
> > keep working as is.
> >
>
> Sorry for the confusion. The problem is exactly how to calculate mdtc->thresh.
>
> R - A - B
> \-- C
>
> Case 1:
>
> Suppose the C has "memory.dirty_limit" set, should we just use it as mdtc->thresh?
> I see the current code also consider the system clean memory in mdtc_calc_avail(),
> should we also need to consider it when "memory.dirty_limit" set?
>
> Case 2:
>
> Suppose the C hasn't "memory.dirty_limit" set, but A has "memory.dirty_limit" set,
> how do we calculate the C mdtc->thresh ?
>
> Obviously we can't directly use the A "memory.dirty_limit", since it should be
> distributed to B and C ?
>
> So the problem is that I don't know how to reasonably calculate the mdtc->thresh,
> even given a memcg tree with some memcgs have "memory.dirty_limit" set. :\

I see, thanks for explanation. I guess we would need to redistribute
dirtiable memory in hierarchical manner like we do it for other resources.
The most natural would probably be to somehow follow the behavior of other
memcg memory limits - but I know close to nothing about how that works so
Michal would have to elaborate.

Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR