The kvmalloc* allocation functions can fallback to vmalloc allocations
and more often on long running machines. In addition the kernel does
have __GFP_ACCOUNT kvmalloc* calls. So, often on long running machines,
the memory.stat does not tell the complete picture which type of memory
is charged to the memcg. So add a per-memcg vmalloc stat.
Signed-off-by: Shakeel Butt <[email protected]>
---
Changelog since v1:
- page_memcg() within rcu lock as suggested by Muchun.
Documentation/admin-guide/cgroup-v2.rst | 3 +++
include/linux/memcontrol.h | 21 +++++++++++++++++++++
mm/memcontrol.c | 1 +
mm/vmalloc.c | 5 +++++
4 files changed, 30 insertions(+)
diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/admin-guide/cgroup-v2.rst
index 82c8dc91b2be..5aa368d165da 100644
--- a/Documentation/admin-guide/cgroup-v2.rst
+++ b/Documentation/admin-guide/cgroup-v2.rst
@@ -1314,6 +1314,9 @@ PAGE_SIZE multiple when read back.
sock (npn)
Amount of memory used in network transmission buffers
+ vmalloc (npn)
+ Amount of memory used for vmap backed memory.
+
shmem
Amount of cached filesystem data that is swap-backed,
such as tmpfs, shm segments, shared anonymous mmap()s
diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index d76dad703580..b72d75141e12 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -33,6 +33,7 @@ enum memcg_stat_item {
MEMCG_SWAP = NR_VM_NODE_STAT_ITEMS,
MEMCG_SOCK,
MEMCG_PERCPU_B,
+ MEMCG_VMALLOC,
MEMCG_NR_STAT,
};
@@ -944,6 +945,21 @@ static inline void mod_memcg_state(struct mem_cgroup *memcg,
local_irq_restore(flags);
}
+static inline void mod_memcg_page_state(struct page *page,
+ int idx, int val)
+{
+ struct mem_cgroup *memcg;
+
+ if (mem_cgroup_disabled())
+ return;
+
+ rcu_read_lock();
+ memcg = page_memcg(page);
+ if (memcg)
+ mod_memcg_state(memcg, idx, val);
+ rcu_read_unlock();
+}
+
static inline unsigned long memcg_page_state(struct mem_cgroup *memcg, int idx)
{
return READ_ONCE(memcg->vmstats.state[idx]);
@@ -1399,6 +1415,11 @@ static inline void mod_memcg_state(struct mem_cgroup *memcg,
{
}
+static inline void mod_memcg_page_state(struct page *page,
+ int idx, int val)
+{
+}
+
static inline unsigned long memcg_page_state(struct mem_cgroup *memcg, int idx)
{
return 0;
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 7ae77608847e..7027a3cc416f 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -1375,6 +1375,7 @@ static const struct memory_stat memory_stats[] = {
{ "pagetables", NR_PAGETABLE },
{ "percpu", MEMCG_PERCPU_B },
{ "sock", MEMCG_SOCK },
+ { "vmalloc", MEMCG_VMALLOC },
{ "shmem", NR_SHMEM },
{ "file_mapped", NR_FILE_MAPPED },
{ "file_dirty", NR_FILE_DIRTY },
diff --git a/mm/vmalloc.c b/mm/vmalloc.c
index eb6e527a6b77..af67ce4fd402 100644
--- a/mm/vmalloc.c
+++ b/mm/vmalloc.c
@@ -39,6 +39,7 @@
#include <linux/uaccess.h>
#include <linux/hugetlb.h>
#include <linux/sched/mm.h>
+#include <linux/memcontrol.h>
#include <asm/tlbflush.h>
#include <asm/shmparam.h>
@@ -2626,6 +2627,9 @@ static void __vunmap(const void *addr, int deallocate_pages)
unsigned int page_order = vm_area_page_order(area);
int i;
+ mod_memcg_page_state(area->pages[0], MEMCG_VMALLOC,
+ -(int)area->nr_pages);
+
for (i = 0; i < area->nr_pages; i += 1U << page_order) {
struct page *page = area->pages[i];
@@ -2964,6 +2968,7 @@ static void *__vmalloc_area_node(struct vm_struct *area, gfp_t gfp_mask,
page_order, nr_small_pages, area->pages);
atomic_long_add(area->nr_pages, &nr_vmalloc_pages);
+ mod_memcg_page_state(area->pages[0], MEMCG_VMALLOC, area->nr_pages);
/*
* If not enough pages were obtained to accomplish an
--
2.34.1.307.g9b7440fafd-goog
On Wed, Dec 22, 2021 at 1:25 PM Shakeel Butt <[email protected]> wrote:
>
[...]
> @@ -2626,6 +2627,9 @@ static void __vunmap(const void *addr, int deallocate_pages)
> unsigned int page_order = vm_area_page_order(area);
> int i;
>
> + mod_memcg_page_state(area->pages[0], MEMCG_VMALLOC,
> + -(int)area->nr_pages);
The cast can go away since the type of 3rd parameter of
mod_memcg_page_state() is int but not long. I suggest
to remove it. Anyway
Reviewed-by: Muchun Song <[email protected]>
Thanks.
On Tue, Dec 21, 2021 at 9:50 PM Muchun Song <[email protected]> wrote:
>
> On Wed, Dec 22, 2021 at 1:25 PM Shakeel Butt <[email protected]> wrote:
> >
> [...]
> > @@ -2626,6 +2627,9 @@ static void __vunmap(const void *addr, int deallocate_pages)
> > unsigned int page_order = vm_area_page_order(area);
> > int i;
> >
> > + mod_memcg_page_state(area->pages[0], MEMCG_VMALLOC,
> > + -(int)area->nr_pages);
>
> The cast can go away since the type of 3rd parameter of
> mod_memcg_page_state() is int but not long. I suggest
> to remove it. Anyway
Andrew, please remove this cast of the third param when you add this
patch in the mm tree.
>
> Reviewed-by: Muchun Song <[email protected]>
Thanks.
On Tue, Dec 21, 2021 at 09:24:57PM -0800, Shakeel Butt wrote:
> The kvmalloc* allocation functions can fallback to vmalloc allocations
> and more often on long running machines. In addition the kernel does
> have __GFP_ACCOUNT kvmalloc* calls. So, often on long running machines,
> the memory.stat does not tell the complete picture which type of memory
> is charged to the memcg. So add a per-memcg vmalloc stat.
>
> Signed-off-by: Shakeel Butt <[email protected]>
>
> ---
> Changelog since v1:
> - page_memcg() within rcu lock as suggested by Muchun.
>
> Documentation/admin-guide/cgroup-v2.rst | 3 +++
> include/linux/memcontrol.h | 21 +++++++++++++++++++++
> mm/memcontrol.c | 1 +
> mm/vmalloc.c | 5 +++++
> 4 files changed, 30 insertions(+)
>
> diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/admin-guide/cgroup-v2.rst
> index 82c8dc91b2be..5aa368d165da 100644
> --- a/Documentation/admin-guide/cgroup-v2.rst
> +++ b/Documentation/admin-guide/cgroup-v2.rst
> @@ -1314,6 +1314,9 @@ PAGE_SIZE multiple when read back.
> sock (npn)
> Amount of memory used in network transmission buffers
>
> + vmalloc (npn)
> + Amount of memory used for vmap backed memory.
> +
It's a bid sad that this counter will partially intersect with others
(e.g. percpu and stack), but I don't see how it can be easily fixed.
But I agree that the counter is useful and worth adding.
Acked-by: Roman Gushchin <[email protected]>
Thanks!
On Wed, Dec 22, 2021 at 06:02:55PM -0800, Roman Gushchin wrote:
> It's a bid sad that this counter will partially intersect with others
> (e.g. percpu and stack), but I don't see how it can be easily fixed.
If it's worth fixing, we could add a VM_NO_ACCT flag?
On Wed, Dec 22, 2021 at 6:03 PM Roman Gushchin <[email protected]> wrote:
>
> On Tue, Dec 21, 2021 at 09:24:57PM -0800, Shakeel Butt wrote:
> > The kvmalloc* allocation functions can fallback to vmalloc allocations
> > and more often on long running machines. In addition the kernel does
> > have __GFP_ACCOUNT kvmalloc* calls. So, often on long running machines,
> > the memory.stat does not tell the complete picture which type of memory
> > is charged to the memcg. So add a per-memcg vmalloc stat.
> >
> > Signed-off-by: Shakeel Butt <[email protected]>
> >
> > ---
> > Changelog since v1:
> > - page_memcg() within rcu lock as suggested by Muchun.
> >
> > Documentation/admin-guide/cgroup-v2.rst | 3 +++
> > include/linux/memcontrol.h | 21 +++++++++++++++++++++
> > mm/memcontrol.c | 1 +
> > mm/vmalloc.c | 5 +++++
> > 4 files changed, 30 insertions(+)
> >
> > diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/admin-guide/cgroup-v2.rst
> > index 82c8dc91b2be..5aa368d165da 100644
> > --- a/Documentation/admin-guide/cgroup-v2.rst
> > +++ b/Documentation/admin-guide/cgroup-v2.rst
> > @@ -1314,6 +1314,9 @@ PAGE_SIZE multiple when read back.
> > sock (npn)
> > Amount of memory used in network transmission buffers
> >
> > + vmalloc (npn)
> > + Amount of memory used for vmap backed memory.
> > +
>
> It's a bid sad that this counter will partially intersect with others
> (e.g. percpu and stack), but I don't see how it can be easily fixed.
I checked those again. For vmap based stack we do vmalloc() without
__GFP_ACCOUNT and charge the stack page afterwards with
memcg_kmem_charge_page() interface.
I think we do the same for percpu as well i.e. not use GFP_ACCOUNT for
underlying memory but later use objcg infrastructure to charge at
finer grain.
So, I think at least percpu and stack should not intersect with
vmalloc per-memcg stat.
>
> But I agree that the counter is useful and worth adding.
>
> Acked-by: Roman Gushchin <[email protected]>
Thanks.
On Thu, Dec 23, 2021 at 09:44:26AM -0800, Shakeel Butt wrote:
> On Wed, Dec 22, 2021 at 6:03 PM Roman Gushchin <[email protected]> wrote:
> >
> > On Tue, Dec 21, 2021 at 09:24:57PM -0800, Shakeel Butt wrote:
> > > The kvmalloc* allocation functions can fallback to vmalloc allocations
> > > and more often on long running machines. In addition the kernel does
> > > have __GFP_ACCOUNT kvmalloc* calls. So, often on long running machines,
> > > the memory.stat does not tell the complete picture which type of memory
> > > is charged to the memcg. So add a per-memcg vmalloc stat.
> > >
> > > Signed-off-by: Shakeel Butt <[email protected]>
> > >
> > > ---
> > > Changelog since v1:
> > > - page_memcg() within rcu lock as suggested by Muchun.
> > >
> > > Documentation/admin-guide/cgroup-v2.rst | 3 +++
> > > include/linux/memcontrol.h | 21 +++++++++++++++++++++
> > > mm/memcontrol.c | 1 +
> > > mm/vmalloc.c | 5 +++++
> > > 4 files changed, 30 insertions(+)
> > >
> > > diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/admin-guide/cgroup-v2.rst
> > > index 82c8dc91b2be..5aa368d165da 100644
> > > --- a/Documentation/admin-guide/cgroup-v2.rst
> > > +++ b/Documentation/admin-guide/cgroup-v2.rst
> > > @@ -1314,6 +1314,9 @@ PAGE_SIZE multiple when read back.
> > > sock (npn)
> > > Amount of memory used in network transmission buffers
> > >
> > > + vmalloc (npn)
> > > + Amount of memory used for vmap backed memory.
> > > +
> >
> > It's a bid sad that this counter will partially intersect with others
> > (e.g. percpu and stack), but I don't see how it can be easily fixed.
>
> I checked those again. For vmap based stack we do vmalloc() without
> __GFP_ACCOUNT and charge the stack page afterwards with
> memcg_kmem_charge_page() interface.
>
> I think we do the same for percpu as well i.e. not use GFP_ACCOUNT for
> underlying memory but later use objcg infrastructure to charge at
> finer grain.
>
> So, I think at least percpu and stack should not intersect with
> vmalloc per-memcg stat.
Ah, ok then, thanks for checking!
In general, it seems that if an allocation is backed by multiple layers
of mm code (e.g. percpu on top of vmalloc), we choose one layer to do
both memcg statistics and accounting. This makes total sense.
Thanks!
On Tue 21-12-21 21:24:57, Shakeel Butt wrote:
> The kvmalloc* allocation functions can fallback to vmalloc allocations
> and more often on long running machines. In addition the kernel does
> have __GFP_ACCOUNT kvmalloc* calls. So, often on long running machines,
> the memory.stat does not tell the complete picture which type of memory
> is charged to the memcg. So add a per-memcg vmalloc stat.
>
> Signed-off-by: Shakeel Butt <[email protected]>
The counter is useful IMHO. I just have one implementation specific
question.
[...]
> @@ -2626,6 +2627,9 @@ static void __vunmap(const void *addr, int deallocate_pages)
> unsigned int page_order = vm_area_page_order(area);
> int i;
>
> + mod_memcg_page_state(area->pages[0], MEMCG_VMALLOC,
> + -(int)area->nr_pages);
> +
> for (i = 0; i < area->nr_pages; i += 1U << page_order) {
> struct page *page = area->pages[i];
>
> @@ -2964,6 +2968,7 @@ static void *__vmalloc_area_node(struct vm_struct *area, gfp_t gfp_mask,
> page_order, nr_small_pages, area->pages);
>
> atomic_long_add(area->nr_pages, &nr_vmalloc_pages);
> + mod_memcg_page_state(area->pages[0], MEMCG_VMALLOC, area->nr_pages);
>
> /*
> * If not enough pages were obtained to accomplish an
Is it safe to assume that the whole area is always charged to the same
memcg? I am not really deeply familiar with vmalloc internals but is it
possible that an area could get resized/partially reused with a
different charging context?
A clarification comment would be really handy.
--
Michal Hocko
SUSE Labs
Hi Michal,
On Mon, Dec 27, 2021 at 2:48 AM Michal Hocko <[email protected]> wrote:
>
[...]
> > atomic_long_add(area->nr_pages, &nr_vmalloc_pages);
> > + mod_memcg_page_state(area->pages[0], MEMCG_VMALLOC, area->nr_pages);
> >
> > /*
> > * If not enough pages were obtained to accomplish an
>
> Is it safe to assume that the whole area is always charged to the same
> memcg? I am not really deeply familiar with vmalloc internals but is it
> possible that an area could get resized/partially reused with a
> different charging context?
From what I understand, vmalloc areas are not resized or partially
reused at the moment. There is some ongoing discussion on caching it
but caching would also require updating the accounting part as well.
Regarding the whole area charged to the same memcg, the only way it
may get charged to different memcgs is if the process in which the
allocations are happening is migrated to a different memcg. We can
resolve this by traversing the pages in area->pages array (and use
lruvec based stats instead).
I did contemplate on making this a lruvec stat but decided to start
simple and if we ever need per-node stat then we can easily move to
lruvec based stats. Let me know what you think.
thanks,
Shakeel
On Thu 30-12-21 11:06:14, Shakeel Butt wrote:
> Hi Michal,
>
> On Mon, Dec 27, 2021 at 2:48 AM Michal Hocko <[email protected]> wrote:
> >
> [...]
> > > atomic_long_add(area->nr_pages, &nr_vmalloc_pages);
> > > + mod_memcg_page_state(area->pages[0], MEMCG_VMALLOC, area->nr_pages);
> > >
> > > /*
> > > * If not enough pages were obtained to accomplish an
> >
> > Is it safe to assume that the whole area is always charged to the same
> > memcg? I am not really deeply familiar with vmalloc internals but is it
> > possible that an area could get resized/partially reused with a
> > different charging context?
>
> From what I understand, vmalloc areas are not resized or partially
> reused at the moment. There is some ongoing discussion on caching it
> but caching would also require updating the accounting part as well.
OK.
> Regarding the whole area charged to the same memcg, the only way it
> may get charged to different memcgs is if the process in which the
> allocations are happening is migrated to a different memcg. We can
> resolve this by traversing the pages in area->pages array (and use
> lruvec based stats instead).
I haven't even thought of a task migration. I expect that this is not a
very likely scenario but it would lead to weird numbers and I guess we
would like to prevent from that. A loop over all pages in the area and
accounting them each separately should be good enough to cover that as
well as the existing problem that has already been observed by syzbot.
> I did contemplate on making this a lruvec stat but decided to start
> simple and if we ever need per-node stat then we can easily move to
> lruvec based stats. Let me know what you think.
I am not really sure here. For now I would go with page by page stats
gathering.
Thanks!
--
Michal Hocko
SUSE Labs