Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::3:7 as permitted sender) client-ip=2620:137:e000::3:7;
Message-ID: <aca14cf3-28d3-4c5a-82ab-a4607173dbee@linux.dev>
Date:   Wed, 22 Nov 2023 22:59:02 +0800
MIME-Version: 1.0
Subject: Re: Question: memcg dirty throttle caused by low per-memcg dirty
 thresh
Content-Language: en-US
To:     Michal Hocko <mhocko@suse.com>
Cc:     LKML <linux-kernel@vger.kernel.org>, linux-mm <linux-mm@kvack.org>,
        jack@suse.cz, Tejun Heo <tj@kernel.org>,
        Johannes Weiner <hannes@cmpxchg.org>,
        Christoph Hellwig <hch@lst.de>, shr@devkernel.io, neilb@suse.de
References: <109029e0-1772-4102-a2a8-ab9076462454@linux.dev>
 <ZV3Ru1BmHaU_uW7b@tiehlicka>
From:   Chengming Zhou <chengming.zhou@linux.dev>
In-Reply-To: <ZV3Ru1BmHaU_uW7b@tiehlicka>
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 7bit
Precedence: bulk

On 2023/11/22 18:02, Michal Hocko wrote:
> On Wed 22-11-23 17:38:25, Chengming Zhou wrote:
>> Hello all,
>>
>> Sorry to bother you, we encountered a problem related to the memcg dirty
>> throttle after migrating from cgroup v1 to v2, so want to ask for some
>> comments or suggestions.
>>
>> 1. Problem
>>
>> We have the "containerd" service running under system.slice, with
>> its memory.max set to 5GB. It will be constantly throttled in the
>> balance_dirty_pages() since the memcg has dirty memory more than
>> the memcg dirty thresh.
>>
>> We haven't this problem on cgroup v1, because cgroup v1 doesn't have
>> the per-memcg writeback and per-memcg dirty thresh. Only the global
>> dirty thresh will be checked in balance_dirty_pages().
> 
> Yes, v1 didn't have any sensible IO throttling and so we had to rely on
> ugly hack to wait for writeback to finish from the memcg memory reclaim
> path.  This is really suboptimal because it makes memcg reclaim stalls
> hard to predict. So it is essentially only a poor's man OOM prevention.
> 
> V2 on the other hand has memcg aware dirty memory throttling which is a
> much better solution as it throttles at the moment when the memory is
> being dirtied.
> 
> Why do you consider that to be a problem? Constant throttling as you
> suggest might be a result of the limit being too small?

Right, v2 is better at limiting the dirty memory in one memcg, which is
better for the memcg reclaim path.

The problem we encountered is the global dirty_ratio is too small (20%)
for some cgroup workloads, like "containerd" preparing a big image file,
we want its memory.max (5GB in our case) can be most dirtied, to speed up
the process.

And yes, this may backup more dirty pages in that memcg, then the longer
writeback IO may interfere other memcgs' writeback IO. But we also have
the per-blkcg IO throttling, so it's not that bad ?

Now we have to adjust up the global dirty_ratio to achieve this result,
but it's not good for every memcg workload, and could be bad for some
memcg reclaim path as you noted.

> 
>>
>> 2. Thinking
>>
>> So we wonder if we can support the per-memcg dirty thresh interface?
>> Now the memcg dirty thresh is just calculated from memcg max * ratio,
>> which can be set from /proc/sys/vm/dirty_ratio.
> 
> In general I would recommend using dirty_bytes instead as the ratio
> doesn't scall all that great on larger systems.
>  
>> We have to set it to 60 instead of the default 20 to workaround now,
>> but worry about the potential side effects.
>>
>> If we can support the per-memcg dirty thresh interface, we can set
>> some containers to a much higher dirty_ratio, especially for hungry
>> dirtier workloads like "containerd".
> 
> But why would you want that? If you allow heavy writers to dirty a lot
> of memory then flushing that to the backing store will take more time.
> That could starve small writers as well because they could end up queued
> behind huge amount of data to be flushed.
> 

Yes, we also need per-blkcg IO throttling to distribute writeback IO bandwidth.

> I am no expert on the writeback so others could give you a better
> arguments but from my POV the dirty data flushing and throttling is
> mostly a global mechanism to optmize the IO pattern and is a function of
> storage much more than workload specific. If you heavy writer hits

Maybe the per-bdi ratio is worth trying instead of the global dirty_ratio,
which could affect all devices.

> throttling too much then either the limit is too low or you should stard
> background flushing earlier.
> 

The global dirty_ratio is too low for "containerd" in this case, so we
want more control over the memcg dirty_ratio.

Thanks!