2023-09-27 01:24:57

by Nhat Pham

[permalink] [raw]
Subject: [PATCH 0/2] hugetlb memcg accounting

Currently, hugetlb memory usage is not acounted for in the memory
controller, which could lead to memory overprotection for cgroups with
hugetlb-backed memory. This has been observed in our production system.

This patch series rectifies this issue by charging the memcg when the
hugetlb folio is allocated, and uncharging when the folio is freed. In
addition, a new selftest is added to demonstrate and verify this new
behavior.

Nhat Pham (2):
hugetlb: memcg: account hugetlb-backed memory in memory controller
selftests: add a selftest to verify hugetlb usage in memcg

MAINTAINERS | 2 +
fs/hugetlbfs/inode.c | 2 +-
include/linux/hugetlb.h | 6 +-
include/linux/memcontrol.h | 8 +
mm/hugetlb.c | 23 +-
mm/memcontrol.c | 40 ++++
tools/testing/selftests/cgroup/.gitignore | 1 +
tools/testing/selftests/cgroup/Makefile | 2 +
.../selftests/cgroup/test_hugetlb_memcg.c | 222 ++++++++++++++++++
9 files changed, 297 insertions(+), 9 deletions(-)
create mode 100644 tools/testing/selftests/cgroup/test_hugetlb_memcg.c

--
2.34.1


2023-09-27 09:12:23

by Nhat Pham

[permalink] [raw]
Subject: Re: [PATCH 0/2] hugetlb memcg accounting

On Tue, Sep 26, 2023 at 1:50 PM Frank van der Linden <[email protected]> wrote:
>
> On Tue, Sep 26, 2023 at 12:49 PM Nhat Pham <[email protected]> wrote:
> >
> > Currently, hugetlb memory usage is not acounted for in the memory
> > controller, which could lead to memory overprotection for cgroups with
> > hugetlb-backed memory. This has been observed in our production system.
> >
> > This patch series rectifies this issue by charging the memcg when the
> > hugetlb folio is allocated, and uncharging when the folio is freed. In
> > addition, a new selftest is added to demonstrate and verify this new
> > behavior.
> >
> > Nhat Pham (2):
> > hugetlb: memcg: account hugetlb-backed memory in memory controller
> > selftests: add a selftest to verify hugetlb usage in memcg
> >
> > MAINTAINERS | 2 +
> > fs/hugetlbfs/inode.c | 2 +-
> > include/linux/hugetlb.h | 6 +-
> > include/linux/memcontrol.h | 8 +
> > mm/hugetlb.c | 23 +-
> > mm/memcontrol.c | 40 ++++
> > tools/testing/selftests/cgroup/.gitignore | 1 +
> > tools/testing/selftests/cgroup/Makefile | 2 +
> > .../selftests/cgroup/test_hugetlb_memcg.c | 222 ++++++++++++++++++
> > 9 files changed, 297 insertions(+), 9 deletions(-)
> > create mode 100644 tools/testing/selftests/cgroup/test_hugetlb_memcg.c
> >
> > --
> > 2.34.1
> >
>
> We've had this behavior at Google for a long time, and we're actually
> getting rid of it. hugetlb pages are a precious resource that should
> be accounted for separately. They are not just any memory, they are
> physically contiguous memory, charging them the same as any other
> region of the same size ended up not making sense, especially not for
> larger hugetlb page sizes.

I agree hugetlb is a special kind of resource. But as Johannes
pointed out, it is still a form of memory. Semantically, its usage
should be modulated by the memory controller.

We do have the HugeTLB controller for hugetlb-specific
restriction, and where appropriate we definitely should take
advantage of it. But it does not fix the hole we have in memory
usage reporting, as well as (over)protection and reclaim dynamics.
Hence the need for the userspace hack (as Johannes described):
manually adding/subtracting HugeTLB usage where applicable.
This is not only inelegant, but also cumbersome and buggy.

>
> Additionally, if this behavior is changed just like that, there will
> be quite a few workloads that will break badly because they'll hit
> their limits immediately - imagine a container that uses 1G hugetlb
> pages to back something large (a database, a VM), and 'plain' memory
> for control processes.
>
> What do your workloads do? Is it not possible for you to account for
> hugetlb pages separately? Sure, it can be annoying to have to deal
> with 2 separate totals that you need to take into account, but again,
> hugetlb pages are a resource that is best dealt with separately.
>

Johannes beat me to it - he described our use case, and what we
have hacked together to temporarily get around the issue.

A knob/flag to turn on/off this behavior sounds good to me.

> - Frank
Thanks for the comments, Frank!

2023-09-27 16:16:32

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH 0/2] hugetlb memcg accounting

On Tue 26-09-23 12:49:47, Nhat Pham wrote:
> Currently, hugetlb memory usage is not acounted for in the memory
> controller, which could lead to memory overprotection for cgroups with
> hugetlb-backed memory. This has been observed in our production system.
>
> This patch series rectifies this issue by charging the memcg when the
> hugetlb folio is allocated, and uncharging when the folio is freed. In
> addition, a new selftest is added to demonstrate and verify this new
> behavior.

The primary reason why hugetlb is living outside of memcg (and the core
MM as well) is that it doesn't really fit the whole scheme. In several
aspects. First and the foremost it is an independently managed resource
with its own pool management, use and lifetime.

There is no notion of memory reclaim and this makes a huge difference
for the pool that might consume considerable amount of memory. While
this is the case for many kernel allocations as well they usually do not
consume considerable portions of the accounted memory. This makes it
really tricky to handle limit enforcement gracefully.

Another important aspect comes from the lifetime semantics when a proper
reservations accounting and managing needs to handle mmap time rather
than than usual allocation path. While pages are allocated they do not
belong to anybody and only later at the #PF time (or read for the fs
backed mapping) the ownership is established. That makes it really hard
to manage memory as whole under the memcg anyway as a large part of
that pool sits without an ownership yet it cannot be used for any other
purpose.

These and more reasons where behind the earlier decision o have a
dedicated hugetlb controller.

Also I will also Nack involving hugetlb pages being accounted by
default. This would break any setups which mix normal and hugetlb memory
with memcg limits applied.
--
Michal Hocko
SUSE Labs

2023-09-27 23:34:27

by Nhat Pham

[permalink] [raw]
Subject: Re: [PATCH 0/2] hugetlb memcg accounting

On Wed, Sep 27, 2023 at 4:21 AM Michal Hocko <[email protected]> wrote:
>
> On Tue 26-09-23 12:49:47, Nhat Pham wrote:
> > Currently, hugetlb memory usage is not acounted for in the memory
> > controller, which could lead to memory overprotection for cgroups with
> > hugetlb-backed memory. This has been observed in our production system.
> >
> > This patch series rectifies this issue by charging the memcg when the
> > hugetlb folio is allocated, and uncharging when the folio is freed. In
> > addition, a new selftest is added to demonstrate and verify this new
> > behavior.
>
> The primary reason why hugetlb is living outside of memcg (and the core
> MM as well) is that it doesn't really fit the whole scheme. In several
> aspects. First and the foremost it is an independently managed resource
> with its own pool management, use and lifetime.
>
> There is no notion of memory reclaim and this makes a huge difference
> for the pool that might consume considerable amount of memory. While
> this is the case for many kernel allocations as well they usually do not
> consume considerable portions of the accounted memory. This makes it
> really tricky to handle limit enforcement gracefully.
>
> Another important aspect comes from the lifetime semantics when a proper
> reservations accounting and managing needs to handle mmap time rather
> than than usual allocation path. While pages are allocated they do not
> belong to anybody and only later at the #PF time (or read for the fs
> backed mapping) the ownership is established. That makes it really hard
> to manage memory as whole under the memcg anyway as a large part of
> that pool sits without an ownership yet it cannot be used for any other
> purpose.
>
> These and more reasons where behind the earlier decision o have a
> dedicated hugetlb controller.

While I believe all of these are true, I think they are not reasons not to
have memcg accounting. As everyone has pointed out, memcg
accounting by itself cannot handle all situations - it is not a fix-all.
Other mechanisms, such as the HugeTLB controller, could be the better
solution in these cases, and hugetlb memcg accounting is definitely not
an attempt to infringe upon these control domains.

However, memcg accounting is still necessary for certain memory limits
enforcement to work cleanly and properly - such as the use cases we have
(as Johannes has beautifully described). It will certainly help
administrators simplify their control workflow a lot (assuming we do not
surprise them with this change - a new mount option to opt-in should
help with the transition).

>
> Also I will also Nack involving hugetlb pages being accounted by
> default. This would break any setups which mix normal and hugetlb memory
> with memcg limits applied.

Got it! I'll introduce some opt-in mechanisms in the next version. This is
my oversight.


> --
> Michal Hocko
> SUSE Labs

2023-09-28 00:21:10

by Johannes Weiner

[permalink] [raw]
Subject: Re: [PATCH 0/2] hugetlb memcg accounting

On Wed, Sep 27, 2023 at 01:21:20PM +0200, Michal Hocko wrote:
> On Tue 26-09-23 12:49:47, Nhat Pham wrote:
> > Currently, hugetlb memory usage is not acounted for in the memory
> > controller, which could lead to memory overprotection for cgroups with
> > hugetlb-backed memory. This has been observed in our production system.
> >
> > This patch series rectifies this issue by charging the memcg when the
> > hugetlb folio is allocated, and uncharging when the folio is freed. In
> > addition, a new selftest is added to demonstrate and verify this new
> > behavior.
>
> The primary reason why hugetlb is living outside of memcg (and the core
> MM as well) is that it doesn't really fit the whole scheme. In several
> aspects. First and the foremost it is an independently managed resource
> with its own pool management, use and lifetime.

Honestly, the simpler explanation is that few people have used hugetlb
in regular, containerized non-HPC workloads.

Hugetlb has historically been much more special, and it retains a
specialness that warrants e.g. the hugetlb cgroup container. But it
has also made strides with hugetlb_cma, migratability, madvise support
etc. that allows much more on-demand use. It's no longer the case that
you just put a static pool of memory aside during boot and only a few
blessed applications are using it.

For example, we're using hugetlb_cma very broadly with generic
containers. The CMA region is fully usable by movable non-huge stuff
until huge pages are allocated in it. With the hugetlb controller you
can define a maximum number of hugetlb pages that can be used per
container. But what if that container isn't using any? Why shouldn't
it be allowed to use its overall memory allowance for anon and cache
instead?

With hugetlb being more dynamic, it becomes the same problem that we
had with dedicated tcp and kmem pools. It didn't make sense to fail a
random slab allocation when you still have memory headroom or can
reclaim some cache. Nowadays, the same problem applies to hugetlb.

> There is no notion of memory reclaim and this makes a huge difference
> for the pool that might consume considerable amount of memory. While
> this is the case for many kernel allocations as well they usually do not
> consume considerable portions of the accounted memory. This makes it
> really tricky to handle limit enforcement gracefully.

I don't think that's true. For some workloads, network buffers can
absolutely dominate. And they work just fine with cgroup limits. It
isn't a problem that they aren't reclaimable themselves, it's just
important that they put pressure on stuff that is.

So that if you use 80% hugetlb, the other memory is forced to stay in
the remaining 20%, or it OOMs; and that if you don't use hugetlb, the
group is still allowed to use the full 100% of its host memory
allowance, without requiring some outside agent continuously
monitoring and adjusting the container limits.

> Another important aspect comes from the lifetime semantics when a proper
> reservations accounting and managing needs to handle mmap time rather
> than than usual allocation path. While pages are allocated they do not
> belong to anybody and only later at the #PF time (or read for the fs
> backed mapping) the ownership is established. That makes it really hard
> to manage memory as whole under the memcg anyway as a large part of
> that pool sits without an ownership yet it cannot be used for any other
> purpose.
>
> These and more reasons where behind the earlier decision o have a
> dedicated hugetlb controller.

Yeah, there is still a need for an actual hugetlb controller for the
static use cases (and even for dynamic access to hugetlb_cma).

But you need memcg coverage as well for the more dynamic cases to work
as expected. And having that doesn't really interfere with the static
usecases.

> Also I will also Nack involving hugetlb pages being accounted by
> default. This would break any setups which mix normal and hugetlb memory
> with memcg limits applied.

Yes, no disagreement there. I think we're all on the same page this
needs to be opt-in, say with a cgroup mount option.

2023-09-28 00:32:56

by Johannes Weiner

[permalink] [raw]
Subject: Re: [PATCH 0/2] hugetlb memcg accounting

On Wed, Sep 27, 2023 at 02:50:10PM +0200, Michal Hocko wrote:
> On Tue 26-09-23 18:14:14, Johannes Weiner wrote:
> [...]
> > The fact that memory consumed by hugetlb is currently not considered
> > inside memcg (host memory accounting and control) is inconsistent. It
> > has been quite confusing to our service owners and complicating things
> > for our containers team.
>
> I do understand how that is confusing and inconsistent as well. Hugetlb
> is bringing throughout its existence I am afraid.
>
> As noted in other reply though I am not sure hugeltb pool can be
> reasonably incorporated with a sane semantic. Neither of the regular
> allocation nor the hugetlb reservation/actual use can fallback to the
> pool of the other. This makes them 2 different things each hitting their
> own failure cases that require a dedicated handling.
>
> Just from top of my head these are cases I do not see easy way out from:
> - hugetlb charge failure has two failure modes - pool empty
> or memcg limit reached. The former is not recoverable and
> should fail without any further intervention the latter might
> benefit from reclaiming.
> - !hugetlb memory charge failure cannot consider any hugetlb
> pages - they are implicit memory.min protection so it is
> impossible to manage reclaim protection without having a
> knowledge of the hugetlb use.
> - there is no way to control the hugetlb pool distribution by
> memcg limits. How do we distinguish reservations from actual
> use?
> - pre-allocated pool is consuming memory without any actual
> owner until it is actually used and even that has two stages
> (reserved and really used). This makes it really hard to
> manage memory as whole when there is a considerable amount of
> hugetlb memore preallocated.

It's important to distinguish hugetlb access policy from memory use
policy. This patch isn't about hugetlb access, it's about general
memory use.

Hugetlb access policy is a separate domain with separate
answers. Preallocating is a privileged operation, for access control
there is the hugetlb cgroup controller etc.

What's missing is that once you get past the access restrictions and
legitimately get your hands on huge pages, that memory use gets
reflected in memory.current and exerts pressure on *other* memory
inside the group, such as anon or optimistic cache allocations.

Note that hugetlb *can* be allocated on demand. It's unexpected that
when an application optimistically allocates a couple of 2M hugetlb
pages those aren't reflected in its memory.current. The same is true
for hugetlb_cma. If the gigantic pages aren't currently allocated to a
cgroup, that CMA memory can be used for movable memory elsewhere.

The points you and Frank raise are reasons and scenarios where
additional hugetlb access control is necessary - preallocation,
limited availability of 1G pages etc. But they're not reasons against
charging faulted in hugetlb to the memcg *as well*.

My point is we need both. One to manage competition over hugetlb,
because it has unique limitations. The other to manage competition
over host memory which hugetlb is a part of.

Here is a usecase from our fleet.

Imagine a configuration with two 32G containers. The machine is booted
with hugetlb_cma=6G, and each container may or may not use up to 3
gigantic page, depending on the workload within it. The rest is anon,
cache, slab, etc. You set the hugetlb cgroup limit of each cgroup to
3G to enforce hugetlb fairness. But how do you configure memory.max to
keep *overall* consumption, including anon, cache, slab etc. fair?

If used hugetlb is charged, you can just set memory.max=32G regardless
of the workload inside.

Without it, you'd have to constantly poll hugetlb usage and readjust
memory.max!

2023-09-28 02:50:58

by Frank van der Linden

[permalink] [raw]
Subject: Re: [PATCH 0/2] hugetlb memcg accounting

On Tue, Sep 26, 2023 at 12:49 PM Nhat Pham <[email protected]> wrote:
>
> Currently, hugetlb memory usage is not acounted for in the memory
> controller, which could lead to memory overprotection for cgroups with
> hugetlb-backed memory. This has been observed in our production system.
>
> This patch series rectifies this issue by charging the memcg when the
> hugetlb folio is allocated, and uncharging when the folio is freed. In
> addition, a new selftest is added to demonstrate and verify this new
> behavior.
>
> Nhat Pham (2):
> hugetlb: memcg: account hugetlb-backed memory in memory controller
> selftests: add a selftest to verify hugetlb usage in memcg
>
> MAINTAINERS | 2 +
> fs/hugetlbfs/inode.c | 2 +-
> include/linux/hugetlb.h | 6 +-
> include/linux/memcontrol.h | 8 +
> mm/hugetlb.c | 23 +-
> mm/memcontrol.c | 40 ++++
> tools/testing/selftests/cgroup/.gitignore | 1 +
> tools/testing/selftests/cgroup/Makefile | 2 +
> .../selftests/cgroup/test_hugetlb_memcg.c | 222 ++++++++++++++++++
> 9 files changed, 297 insertions(+), 9 deletions(-)
> create mode 100644 tools/testing/selftests/cgroup/test_hugetlb_memcg.c
>
> --
> 2.34.1
>

We've had this behavior at Google for a long time, and we're actually
getting rid of it. hugetlb pages are a precious resource that should
be accounted for separately. They are not just any memory, they are
physically contiguous memory, charging them the same as any other
region of the same size ended up not making sense, especially not for
larger hugetlb page sizes.

Additionally, if this behavior is changed just like that, there will
be quite a few workloads that will break badly because they'll hit
their limits immediately - imagine a container that uses 1G hugetlb
pages to back something large (a database, a VM), and 'plain' memory
for control processes.

What do your workloads do? Is it not possible for you to account for
hugetlb pages separately? Sure, it can be annoying to have to deal
with 2 separate totals that you need to take into account, but again,
hugetlb pages are a resource that is best dealt with separately.

- Frank

2023-09-28 04:06:54

by Nhat Pham

[permalink] [raw]
Subject: Re: [PATCH 0/2] hugetlb memcg accounting

On Tue, Sep 26, 2023 at 12:49 PM Nhat Pham <[email protected]> wrote:
>
> Currently, hugetlb memory usage is not acounted for in the memory
> controller, which could lead to memory overprotection for cgroups with
> hugetlb-backed memory. This has been observed in our production system.
>
> This patch series rectifies this issue by charging the memcg when the
> hugetlb folio is allocated, and uncharging when the folio is freed. In
> addition, a new selftest is added to demonstrate and verify this new
> behavior.
>
> Nhat Pham (2):
> hugetlb: memcg: account hugetlb-backed memory in memory controller
> selftests: add a selftest to verify hugetlb usage in memcg
>
> MAINTAINERS | 2 +
> fs/hugetlbfs/inode.c | 2 +-
> include/linux/hugetlb.h | 6 +-
> include/linux/memcontrol.h | 8 +
> mm/hugetlb.c | 23 +-
> mm/memcontrol.c | 40 ++++
> tools/testing/selftests/cgroup/.gitignore | 1 +
> tools/testing/selftests/cgroup/Makefile | 2 +
> .../selftests/cgroup/test_hugetlb_memcg.c | 222 ++++++++++++++++++
> 9 files changed, 297 insertions(+), 9 deletions(-)
> create mode 100644 tools/testing/selftests/cgroup/test_hugetlb_memcg.c
>
> --
> 2.34.1

Thanks for all the comments and suggestions everyone!
FYI, I have sent out a second version of the patch series with the new
mount flag:

https://lore.kernel.org/lkml/[email protected]/T/#t

Feel free to check it out and discuss it over there too!

2023-09-28 06:43:05

by Nhat Pham

[permalink] [raw]
Subject: Re: [PATCH 0/2] hugetlb memcg accounting

On Wed, Sep 27, 2023 at 9:44 AM Johannes Weiner <[email protected]> wrote:
>
> On Wed, Sep 27, 2023 at 02:50:10PM +0200, Michal Hocko wrote:
> > On Tue 26-09-23 18:14:14, Johannes Weiner wrote:
> > [...]
> > > The fact that memory consumed by hugetlb is currently not considered
> > > inside memcg (host memory accounting and control) is inconsistent. It
> > > has been quite confusing to our service owners and complicating things
> > > for our containers team.
> >
> > I do understand how that is confusing and inconsistent as well. Hugetlb
> > is bringing throughout its existence I am afraid.
> >
> > As noted in other reply though I am not sure hugeltb pool can be
> > reasonably incorporated with a sane semantic. Neither of the regular
> > allocation nor the hugetlb reservation/actual use can fallback to the
> > pool of the other. This makes them 2 different things each hitting their
> > own failure cases that require a dedicated handling.
> >
> > Just from top of my head these are cases I do not see easy way out from:
> > - hugetlb charge failure has two failure modes - pool empty
> > or memcg limit reached. The former is not recoverable and
> > should fail without any further intervention the latter might
> > benefit from reclaiming.
> > - !hugetlb memory charge failure cannot consider any hugetlb
> > pages - they are implicit memory.min protection so it is
> > impossible to manage reclaim protection without having a
> > knowledge of the hugetlb use.
> > - there is no way to control the hugetlb pool distribution by
> > memcg limits. How do we distinguish reservations from actual
> > use?
> > - pre-allocated pool is consuming memory without any actual
> > owner until it is actually used and even that has two stages
> > (reserved and really used). This makes it really hard to
> > manage memory as whole when there is a considerable amount of
> > hugetlb memore preallocated.
>
> It's important to distinguish hugetlb access policy from memory use
> policy. This patch isn't about hugetlb access, it's about general
> memory use.
>
> Hugetlb access policy is a separate domain with separate
> answers. Preallocating is a privileged operation, for access control
> there is the hugetlb cgroup controller etc.
>
> What's missing is that once you get past the access restrictions and
> legitimately get your hands on huge pages, that memory use gets
> reflected in memory.current and exerts pressure on *other* memory
> inside the group, such as anon or optimistic cache allocations.
>
> Note that hugetlb *can* be allocated on demand. It's unexpected that
> when an application optimistically allocates a couple of 2M hugetlb
> pages those aren't reflected in its memory.current. The same is true
> for hugetlb_cma. If the gigantic pages aren't currently allocated to a
> cgroup, that CMA memory can be used for movable memory elsewhere.
>
> The points you and Frank raise are reasons and scenarios where
> additional hugetlb access control is necessary - preallocation,
> limited availability of 1G pages etc. But they're not reasons against
> charging faulted in hugetlb to the memcg *as well*.
>
> My point is we need both. One to manage competition over hugetlb,
> because it has unique limitations. The other to manage competition
> over host memory which hugetlb is a part of.
>
> Here is a usecase from our fleet.
>
> Imagine a configuration with two 32G containers. The machine is booted
> with hugetlb_cma=6G, and each container may or may not use up to 3
> gigantic page, depending on the workload within it. The rest is anon,
> cache, slab, etc. You set the hugetlb cgroup limit of each cgroup to
> 3G to enforce hugetlb fairness. But how do you configure memory.max to
> keep *overall* consumption, including anon, cache, slab etc. fair?
>
> If used hugetlb is charged, you can just set memory.max=32G regardless
> of the workload inside.
>
> Without it, you'd have to constantly poll hugetlb usage and readjust
> memory.max!

Yep, and I'd like to add that this could and have caused issues in
our production system, when there is a delay in memory limits
(low or max) correction. The userspace agent in charge of correcting
these only runs periodically, and within consecutive runs the system
could be in an over/underprotected state. An instantaneous charge
towards the memory controller would close this gap.

I think we need both a HugeTLB controller and memory controller
accounting.

2023-09-28 07:19:19

by Roman Gushchin

[permalink] [raw]
Subject: Re: [PATCH 0/2] hugetlb memcg accounting

On Wed, Sep 27, 2023 at 02:47:38PM -0400, Johannes Weiner wrote:
> On Wed, Sep 27, 2023 at 01:21:20PM +0200, Michal Hocko wrote:
> > On Tue 26-09-23 12:49:47, Nhat Pham wrote:
> > > Currently, hugetlb memory usage is not acounted for in the memory
> > > controller, which could lead to memory overprotection for cgroups with
> > > hugetlb-backed memory. This has been observed in our production system.
> > >
> > > This patch series rectifies this issue by charging the memcg when the
> > > hugetlb folio is allocated, and uncharging when the folio is freed. In
> > > addition, a new selftest is added to demonstrate and verify this new
> > > behavior.
> >
> > The primary reason why hugetlb is living outside of memcg (and the core
> > MM as well) is that it doesn't really fit the whole scheme. In several
> > aspects. First and the foremost it is an independently managed resource
> > with its own pool management, use and lifetime.
>
> Honestly, the simpler explanation is that few people have used hugetlb
> in regular, containerized non-HPC workloads.
>
> Hugetlb has historically been much more special, and it retains a
> specialness that warrants e.g. the hugetlb cgroup container. But it
> has also made strides with hugetlb_cma, migratability, madvise support
> etc. that allows much more on-demand use. It's no longer the case that
> you just put a static pool of memory aside during boot and only a few
> blessed applications are using it.
>
> For example, we're using hugetlb_cma very broadly with generic
> containers. The CMA region is fully usable by movable non-huge stuff
> until huge pages are allocated in it. With the hugetlb controller you
> can define a maximum number of hugetlb pages that can be used per
> container. But what if that container isn't using any? Why shouldn't
> it be allowed to use its overall memory allowance for anon and cache
> instead?

Cool, I remember proposing hugetlb memcg stats several years ago and if
I remember correctly at that time you was opposing it based on the idea
that huge pages are not a part of the overall memcg flow: they are not
a subject for memory pressure, can't be evicted, etc. And thp's were seen
as a long-term replacement. Even though all above it's true, hugetlb has
it's niche and I don't think thp's will realistically replace it any time
soon.

So I'm glad to see this effort (and very supportive) on making hugetlb
more convenient and transparent for an end user.

Thanks!

2023-09-28 22:27:36

by Johannes Weiner

[permalink] [raw]
Subject: Re: [PATCH 0/2] hugetlb memcg accounting

On Wed, Sep 27, 2023 at 02:37:47PM -0700, Roman Gushchin wrote:
> On Wed, Sep 27, 2023 at 02:47:38PM -0400, Johannes Weiner wrote:
> > On Wed, Sep 27, 2023 at 01:21:20PM +0200, Michal Hocko wrote:
> > > On Tue 26-09-23 12:49:47, Nhat Pham wrote:
> > > > Currently, hugetlb memory usage is not acounted for in the memory
> > > > controller, which could lead to memory overprotection for cgroups with
> > > > hugetlb-backed memory. This has been observed in our production system.
> > > >
> > > > This patch series rectifies this issue by charging the memcg when the
> > > > hugetlb folio is allocated, and uncharging when the folio is freed. In
> > > > addition, a new selftest is added to demonstrate and verify this new
> > > > behavior.
> > >
> > > The primary reason why hugetlb is living outside of memcg (and the core
> > > MM as well) is that it doesn't really fit the whole scheme. In several
> > > aspects. First and the foremost it is an independently managed resource
> > > with its own pool management, use and lifetime.
> >
> > Honestly, the simpler explanation is that few people have used hugetlb
> > in regular, containerized non-HPC workloads.
> >
> > Hugetlb has historically been much more special, and it retains a
> > specialness that warrants e.g. the hugetlb cgroup container. But it
> > has also made strides with hugetlb_cma, migratability, madvise support
> > etc. that allows much more on-demand use. It's no longer the case that
> > you just put a static pool of memory aside during boot and only a few
> > blessed applications are using it.
> >
> > For example, we're using hugetlb_cma very broadly with generic
> > containers. The CMA region is fully usable by movable non-huge stuff
> > until huge pages are allocated in it. With the hugetlb controller you
> > can define a maximum number of hugetlb pages that can be used per
> > container. But what if that container isn't using any? Why shouldn't
> > it be allowed to use its overall memory allowance for anon and cache
> > instead?
>
> Cool, I remember proposing hugetlb memcg stats several years ago and if
> I remember correctly at that time you was opposing it based on the idea
> that huge pages are not a part of the overall memcg flow: they are not
> a subject for memory pressure, can't be evicted, etc. And thp's were seen
> as a long-term replacement. Even though all above it's true, hugetlb has
> it's niche and I don't think thp's will realistically replace it any time
> soon.

Yeah, Michal's arguments very much reminded me of my stance then. I
stand corrected.

I'm still hopeful that we can make 2M work transparently. I'd expect
1G to remain in the hugetlb domain for some time to come, but even
those are mostly dynamic now with your hugetlb_cma feature!

> So I'm glad to see this effort (and very supportive) on making hugetlb
> more convenient and transparent for an end user.

Thanks!

2023-10-02 07:01:12

by Mike Kravetz

[permalink] [raw]
Subject: Re: [PATCH 0/2] hugetlb memcg accounting

On 09/27/23 14:47, Johannes Weiner wrote:
> On Wed, Sep 27, 2023 at 01:21:20PM +0200, Michal Hocko wrote:
> > On Tue 26-09-23 12:49:47, Nhat Pham wrote:
>
> So that if you use 80% hugetlb, the other memory is forced to stay in
> the remaining 20%, or it OOMs; and that if you don't use hugetlb, the
> group is still allowed to use the full 100% of its host memory
> allowance, without requiring some outside agent continuously
> monitoring and adjusting the container limits.

Jumping in late here as I was traveling last week. In addition, I want
to state my limited cgroup knowledge up front.

I was thinking of your scenario above a little differently. Suppose a
group is up and running at almost 100% memory usage. However, the majority
of that memory is reclaimable. Now, someone wants to allocate a 2M hugetlb
page. There is not 2MB free, but we could easily reclaim 2MB to make room
for the hugetlb page. I may be missing something, but I do not see how that
is going to happen. It seems like we would really want that behavior.
--
Mike Kravetz

2023-10-02 22:26:01

by Johannes Weiner

[permalink] [raw]
Subject: Re: [PATCH 0/2] hugetlb memcg accounting

On Sun, Oct 01, 2023 at 04:27:30PM -0700, Mike Kravetz wrote:
> On 09/27/23 14:47, Johannes Weiner wrote:
> > On Wed, Sep 27, 2023 at 01:21:20PM +0200, Michal Hocko wrote:
> > > On Tue 26-09-23 12:49:47, Nhat Pham wrote:
> >
> > So that if you use 80% hugetlb, the other memory is forced to stay in
> > the remaining 20%, or it OOMs; and that if you don't use hugetlb, the
> > group is still allowed to use the full 100% of its host memory
> > allowance, without requiring some outside agent continuously
> > monitoring and adjusting the container limits.
>
> Jumping in late here as I was traveling last week. In addition, I want
> to state my limited cgroup knowledge up front.
>
> I was thinking of your scenario above a little differently. Suppose a
> group is up and running at almost 100% memory usage. However, the majority
> of that memory is reclaimable. Now, someone wants to allocate a 2M hugetlb
> page. There is not 2MB free, but we could easily reclaim 2MB to make room
> for the hugetlb page. I may be missing something, but I do not see how that
> is going to happen. It seems like we would really want that behavior.

But that is actually what it does, no?

alloc_hugetlb_folio
mem_cgroup_hugetlb_charge_folio
charge_memcg
try_charge
!page_counter_try_charge ?
!try_to_free_mem_cgroup_pages ?
mem_cgroup_oom

So it does reclaim when the hugetlb hits the cgroup limit. And if that
fails to make room, it OOMs the cgroup.

Or maybe I'm missing something?

2023-10-02 22:50:35

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH 0/2] hugetlb memcg accounting

On Mon 02-10-23 10:42:50, Johannes Weiner wrote:
> On Sun, Oct 01, 2023 at 04:27:30PM -0700, Mike Kravetz wrote:
> > On 09/27/23 14:47, Johannes Weiner wrote:
> > > On Wed, Sep 27, 2023 at 01:21:20PM +0200, Michal Hocko wrote:
> > > > On Tue 26-09-23 12:49:47, Nhat Pham wrote:
> > >
> > > So that if you use 80% hugetlb, the other memory is forced to stay in
> > > the remaining 20%, or it OOMs; and that if you don't use hugetlb, the
> > > group is still allowed to use the full 100% of its host memory
> > > allowance, without requiring some outside agent continuously
> > > monitoring and adjusting the container limits.
> >
> > Jumping in late here as I was traveling last week. In addition, I want
> > to state my limited cgroup knowledge up front.
> >
> > I was thinking of your scenario above a little differently. Suppose a
> > group is up and running at almost 100% memory usage. However, the majority
> > of that memory is reclaimable. Now, someone wants to allocate a 2M hugetlb
> > page. There is not 2MB free, but we could easily reclaim 2MB to make room
> > for the hugetlb page. I may be missing something, but I do not see how that
> > is going to happen. It seems like we would really want that behavior.
>
> But that is actually what it does, no?
>
> alloc_hugetlb_folio
> mem_cgroup_hugetlb_charge_folio
> charge_memcg
> try_charge
> !page_counter_try_charge ?
> !try_to_free_mem_cgroup_pages ?
> mem_cgroup_oom
>
> So it does reclaim when the hugetlb hits the cgroup limit. And if that
> fails to make room, it OOMs the cgroup.
>
> Or maybe I'm missing something?

I beleve that Mike alludes to what I have pointed in other email:
http://lkml.kernel.org/r/ZRrI90KcRBwVZn/[email protected] and a situation
when the hugetlb requests results in an acutal hugetlb allocation rather
than consumption from the pre-allocated pool. In that case memcg is not
involved because the charge happens only after the allocation happens.
That btw. means that this request could disrupt a different memcg even
if the current one is at the limit or it could be reclaimed instead.

Also there is not OOM as hugetlb pages are costly requests and we do not
invoke the oom killer.

--
Michal Hocko
SUSE Labs

2023-10-02 22:55:51

by Johannes Weiner

[permalink] [raw]
Subject: Re: [PATCH 0/2] hugetlb memcg accounting

On Mon, Oct 02, 2023 at 04:58:21PM +0200, Michal Hocko wrote:
> Also there is not OOM as hugetlb pages are costly requests and we do not
> invoke the oom killer.

Ah good point.

That seems like a policy choice we could make. However, since hugetlb
users are already set up for and come to expect SIGBUS for physical
failure as well as hugetlb_cgroup limits, we should have memcg follow
established precedent and leave the OOM killer out.

Agree that a sentence in the changelog about this makes sense though.