LinuxLists.cc - Re: [PATCH 0/2] hugetlb memcg accounting

2023-09-26 23:14:35

Subject: Re: [PATCH 0/2] hugetlb memcg accounting

Hi Frank,

On Tue, Sep 26, 2023 at 01:50:10PM -0700, Frank van der Linden wrote:
> On Tue, Sep 26, 2023 at 12:49 PM Nhat Pham <[email protected]> wrote:
> >
> > Currently, hugetlb memory usage is not acounted for in the memory
> > controller, which could lead to memory overprotection for cgroups with
> > hugetlb-backed memory. This has been observed in our production system.
> >
> > This patch series rectifies this issue by charging the memcg when the
> > hugetlb folio is allocated, and uncharging when the folio is freed. In
> > addition, a new selftest is added to demonstrate and verify this new
> > behavior.
> >
> > Nhat Pham (2):
> > hugetlb: memcg: account hugetlb-backed memory in memory controller
> > selftests: add a selftest to verify hugetlb usage in memcg
> >
> > MAINTAINERS | 2 +
> > fs/hugetlbfs/inode.c | 2 +-
> > include/linux/hugetlb.h | 6 +-
> > include/linux/memcontrol.h | 8 +
> > mm/hugetlb.c | 23 +-
> > mm/memcontrol.c | 40 ++++
> > tools/testing/selftests/cgroup/.gitignore | 1 +
> > tools/testing/selftests/cgroup/Makefile | 2 +
> > .../selftests/cgroup/test_hugetlb_memcg.c | 222 ++++++++++++++++++
> > 9 files changed, 297 insertions(+), 9 deletions(-)
> > create mode 100644 tools/testing/selftests/cgroup/test_hugetlb_memcg.c
> >
> > --
> > 2.34.1
> >
>
> We've had this behavior at Google for a long time, and we're actually
> getting rid of it. hugetlb pages are a precious resource that should
> be accounted for separately. They are not just any memory, they are
> physically contiguous memory, charging them the same as any other
> region of the same size ended up not making sense, especially not for
> larger hugetlb page sizes.

I agree that on one hand they're a limited resource, and some form of
access control makes sense. There is the hugetlb cgroup controller
that allows for tracking and apportioning them per-cgroups.

But on the other hand they're also still just host memory that a
cgroup can consume, which is the domain of memcg.

Those two aren't mutually exclusive. It makes sense to set a limit on
a cgroup's access to hugetlb. It also makes sense that the huge pages
a cgroup IS using count toward its memory limit, where they displace
file cache and anonymous pages under pressure. Or that they're
considered when determining degree of protection from global pressure.

This isn't unlike e.g. kernel memory being special in that it consumes
lowmem and isn't reclaimable. This shows up in total memory, while it
was also tracked and limited separately. (Separate control disappeared
for lack of a good enforcement mechanism - but hugetlb has that.)

The fact that memory consumed by hugetlb is currently not considered
inside memcg (host memory accounting and control) is inconsistent. It
has been quite confusing to our service owners and complicating things
for our containers team.

For example, jobs need to describe their overall memory size in order
to match them to machines and co-locate them. Based on that parameter
the container limits as well as protection (memory.low) from global
pressure is set. Currently, there are ugly hacks in place to subtract
any hugetlb quota from the container config - otherwise the limits and
protection settings would be way too big if a large part of the host
memory consumption isn't a part of it. This has been quite cumbersome
and error prone.

> Additionally, if this behavior is changed just like that, there will
> be quite a few workloads that will break badly because they'll hit
> their limits immediately - imagine a container that uses 1G hugetlb
> pages to back something large (a database, a VM), and 'plain' memory
> for control processes.

I agree with you there. This could break existing setups. We've added
new consumers to memcg in the past without thinking too hard about it,
but hugetlb often makes up a huge portion of a group's overall memory
footprint. And we *do* have those subtraction hacks in place that
would then fail in the other direction.

A cgroup mountflag makes sense for this to ease the transition.

2023-09-28 03:47:07

by Michal Hocko

[permalink] [raw]

Subject: Re: [PATCH 0/2] hugetlb memcg accounting

On Tue 26-09-23 18:14:14, Johannes Weiner wrote:
[...]
> The fact that memory consumed by hugetlb is currently not considered
> inside memcg (host memory accounting and control) is inconsistent. It
> has been quite confusing to our service owners and complicating things
> for our containers team.

I do understand how that is confusing and inconsistent as well. Hugetlb
is bringing throughout its existence I am afraid.

As noted in other reply though I am not sure hugeltb pool can be
reasonably incorporated with a sane semantic. Neither of the regular
allocation nor the hugetlb reservation/actual use can fallback to the
pool of the other. This makes them 2 different things each hitting their
own failure cases that require a dedicated handling.

Just from top of my head these are cases I do not see easy way out from:
- hugetlb charge failure has two failure modes - pool empty
or memcg limit reached. The former is not recoverable and
should fail without any further intervention the latter might
benefit from reclaiming.
- !hugetlb memory charge failure cannot consider any hugetlb
pages - they are implicit memory.min protection so it is
impossible to manage reclaim protection without having a
knowledge of the hugetlb use.
- there is no way to control the hugetlb pool distribution by
memcg limits. How do we distinguish reservations from actual
use?
- pre-allocated pool is consuming memory without any actual
owner until it is actually used and even that has two stages
(reserved and really used). This makes it really hard to
manage memory as whole when there is a considerable amount of
hugetlb memore preallocated.
I am pretty sure there are many more interesting cases.
--
Michal Hocko
SUSE Labs