LinuxLists.cc - [QUESTION] memcg page_counter seems broken in MADV

2022-11-26 13:39:20

Subject: [QUESTION] memcg page_counter seems broken in MADV_DONTNEED with THP enabled

Hi,

We use mm_counter to how much a process physical memory used. Meanwhile,
page_counter of a memcg is used to count how much a cgroup physical
memory used.
If a cgroup only contains a process, they looks almost the same. But with
THP enabled, sometimes memory.usage_in_bytes in memcg may be twice or
more than rss
in proc/[pid]/smaps_rollup as follow:

[root@localhost sda]# cat /sys/fs/cgroup/memory/test/memory.usage_in_bytes
1080930304
[root@localhost sda]# cat /sys/fs/cgroup/memory/test/cgroup.procs
1290
[root@localhost sda]# cat /proc/1290/smaps_rollup
55ba80600000-ffffffffff601000 ---p 00000000 00:00 0
[rollup]
Rss:              500648 kB
Pss:              498337 kB
Shared_Clean:       2732 kB
Shared_Dirty:          0 kB
Private_Clean:       364 kB
Private_Dirty:    497552 kB
Referenced:       500648 kB
Anonymous:        492016 kB
LazyFree:              0 kB
AnonHugePages:    129024 kB
ShmemPmdMapped:        0 kB
Shared_Hugetlb:        0 kB
Private_Hugetlb:       0 kB
Swap:                  0 kB
SwapPss:               0 kB
Locked:                0 kB
THPeligible:    0

I have found the differences was because that __split_huge_pmd decrease
the mm_counter but page_counter in memcg was not decreased with refcount
of a head page is not zero. Here are the follows:

do_madvise
madvise_dontneed_free
    zap_page_range
    unmap_single_vma
        zap_pud_range
        zap_pmd_range
            __split_huge_pmd
            __split_huge_pmd_locked
                __mod_lruvec_page_state
            zap_pte_range
               add_mm_rss_vec
                  add_mm_counter                    -> decrease the
mm_counter
    tlb_finish_mmu
        arch_tlb_finish_mmu
          tlb_flush_mmu_free
            free_pages_and_swap_cache
              release_pages
                folio_put_testzero(page)            -> not zero, skip
                  continue;
                __folio_put_large
                  free_transhuge_page
                    free_compound_page
                      mem_cgroup_uncharge
                        page_counter_uncharge        -> decrease the
page_counter

node_page_stat which shows in meminfo was also decreased. the
__split_huge_pmd
seems free no physical memory unless the total THP was free.I am
confused which
one is the true physical memory used of a process.

Kind regards,

Yongqiang Liu

2022-11-28 20:37:06

by Yang Shi

[permalink] [raw]

Subject: Re: [QUESTION] memcg page_counter seems broken in MADV_DONTNEED with THP enabled

On Sat, Nov 26, 2022 at 5:10 AM Yongqiang Liu <[email protected]> wrote:
>
> Hi,
>
> We use mm_counter to how much a process physical memory used. Meanwhile,
> page_counter of a memcg is used to count how much a cgroup physical
> memory used.
> If a cgroup only contains a process, they looks almost the same. But with
> THP enabled, sometimes memory.usage_in_bytes in memcg may be twice or
> more than rss
> in proc/[pid]/smaps_rollup as follow:
>
> [root@localhost sda]# cat /sys/fs/cgroup/memory/test/memory.usage_in_bytes
> 1080930304
> [root@localhost sda]# cat /sys/fs/cgroup/memory/test/cgroup.procs
> 1290
> [root@localhost sda]# cat /proc/1290/smaps_rollup
> 55ba80600000-ffffffffff601000 ---p 00000000 00:00 0
> [rollup]
> Rss: 500648 kB
> Pss: 498337 kB
> Shared_Clean: 2732 kB
> Shared_Dirty: 0 kB
> Private_Clean: 364 kB
> Private_Dirty: 497552 kB
> Referenced: 500648 kB
> Anonymous: 492016 kB
> LazyFree: 0 kB
> AnonHugePages: 129024 kB
> ShmemPmdMapped: 0 kB
> Shared_Hugetlb: 0 kB
> Private_Hugetlb: 0 kB
> Swap: 0 kB
> SwapPss: 0 kB
> Locked: 0 kB
> THPeligible: 0
>
> I have found the differences was because that __split_huge_pmd decrease
> the mm_counter but page_counter in memcg was not decreased with refcount
> of a head page is not zero. Here are the follows:
>
> do_madvise
> madvise_dontneed_free
> zap_page_range
> unmap_single_vma
> zap_pud_range
> zap_pmd_range
> __split_huge_pmd
> __split_huge_pmd_locked
> __mod_lruvec_page_state
> zap_pte_range
> add_mm_rss_vec
> add_mm_counter -> decrease the
> mm_counter
> tlb_finish_mmu
> arch_tlb_finish_mmu
> tlb_flush_mmu_free
> free_pages_and_swap_cache
> release_pages
> folio_put_testzero(page) -> not zero, skip
> continue;
> __folio_put_large
> free_transhuge_page
> free_compound_page
> mem_cgroup_uncharge
> page_counter_uncharge -> decrease the
> page_counter
>
> node_page_stat which shows in meminfo was also decreased. the
> __split_huge_pmd
> seems free no physical memory unless the total THP was free.I am
> confused which
> one is the true physical memory used of a process.

This should be caused by the deferred split of THP. When MADV_DONTNEED
is called on the partial of the map, the huge PMD is split, but the
THP itself will not be split until the memory pressure is hit (global
or memcg limit). So the unmapped sub pages are actually not freed
until that point. So the mm counter is decreased due to the zapping
but the physical pages are not actually freed then uncharged from
memcg.

>
>
> Kind regards,
>
> Yongqiang Liu
>
>

2022-11-29 09:29:21

by Michal Hocko

[permalink] [raw]

Subject: Re: [QUESTION] memcg page_counter seems broken in MADV_DONTNEED with THP enabled

On Mon 28-11-22 12:01:37, Yang Shi wrote:
> On Sat, Nov 26, 2022 at 5:10 AM Yongqiang Liu <[email protected]> wrote:
> >
> > Hi,
> >
> > We use mm_counter to how much a process physical memory used. Meanwhile,
> > page_counter of a memcg is used to count how much a cgroup physical
> > memory used.
> > If a cgroup only contains a process, they looks almost the same. But with
> > THP enabled, sometimes memory.usage_in_bytes in memcg may be twice or
> > more than rss
> > in proc/[pid]/smaps_rollup as follow:
[...]
> > node_page_stat which shows in meminfo was also decreased. the
> > __split_huge_pmd
> > seems free no physical memory unless the total THP was free.I am
> > confused which
> > one is the true physical memory used of a process.
>
> This should be caused by the deferred split of THP. When MADV_DONTNEED
> is called on the partial of the map, the huge PMD is split, but the
> THP itself will not be split until the memory pressure is hit (global
> or memcg limit). So the unmapped sub pages are actually not freed
> until that point. So the mm counter is decreased due to the zapping
> but the physical pages are not actually freed then uncharged from
> memcg.

Yes, and this is not really bound to THP. Consider a page cache. It can
be accessed via syscalls when it doesn't correspondent to rss at all
while it is still charged to a memcg. Or it can be mapped and then later
unmapped so it disappear from rss while it is still charged until it
gets reclaimed by the memory pressure. Or it can be an in-memory object
that is not bound to any process life time (e.g. tmpfs). Or it can be a
kernel memory charged to a memcg which is not covered by rss because it
is either not mapped or it is unknown to rss counters.
--
Michal Hocko
SUSE Labs

2022-11-29 18:03:46

by Yang Shi

[permalink] [raw]

Subject: Re: [QUESTION] memcg page_counter seems broken in MADV_DONTNEED with THP enabled

On Tue, Nov 29, 2022 at 12:10 AM Michal Hocko <[email protected]> wrote:
>
> On Mon 28-11-22 12:01:37, Yang Shi wrote:
> > On Sat, Nov 26, 2022 at 5:10 AM Yongqiang Liu <[email protected]> wrote:
> > >
> > > Hi,
> > >
> > > We use mm_counter to how much a process physical memory used. Meanwhile,
> > > page_counter of a memcg is used to count how much a cgroup physical
> > > memory used.
> > > If a cgroup only contains a process, they looks almost the same. But with
> > > THP enabled, sometimes memory.usage_in_bytes in memcg may be twice or
> > > more than rss
> > > in proc/[pid]/smaps_rollup as follow:
> [...]
> > > node_page_stat which shows in meminfo was also decreased. the
> > > __split_huge_pmd
> > > seems free no physical memory unless the total THP was free.I am
> > > confused which
> > > one is the true physical memory used of a process.
> >
> > This should be caused by the deferred split of THP. When MADV_DONTNEED
> > is called on the partial of the map, the huge PMD is split, but the
> > THP itself will not be split until the memory pressure is hit (global
> > or memcg limit). So the unmapped sub pages are actually not freed
> > until that point. So the mm counter is decreased due to the zapping
> > but the physical pages are not actually freed then uncharged from
> > memcg.
>
> Yes, and this is not really bound to THP. Consider a page cache. It can
> be accessed via syscalls when it doesn't correspondent to rss at all
> while it is still charged to a memcg. Or it can be mapped and then later
> unmapped so it disappear from rss while it is still charged until it
> gets reclaimed by the memory pressure. Or it can be an in-memory object
> that is not bound to any process life time (e.g. tmpfs). Or it can be a
> kernel memory charged to a memcg which is not covered by rss because it
> is either not mapped or it is unknown to rss counters.

Yes, good points. Thanks, Michal. And one more thing worth mentioning
is that the RSS shown by ps or smaps is different from the RSS shown
by memcg.

> --
> Michal Hocko
> SUSE Labs

2022-11-29 18:04:49

by Yang Shi

[permalink] [raw]

Subject: Re: [QUESTION] memcg page_counter seems broken in MADV_DONTNEED with THP enabled

On Tue, Nov 29, 2022 at 5:14 AM Yongqiang Liu <[email protected]> wrote:
>
>
> 在 2022/11/29 4:01, Yang Shi 写道:
> > On Sat, Nov 26, 2022 at 5:10 AM Yongqiang Liu <[email protected]> wrote:
> >> Hi,
> >>
> >> We use mm_counter to how much a process physical memory used. Meanwhile,
> >> page_counter of a memcg is used to count how much a cgroup physical
> >> memory used.
> >> If a cgroup only contains a process, they looks almost the same. But with
> >> THP enabled, sometimes memory.usage_in_bytes in memcg may be twice or
> >> more than rss
> >> in proc/[pid]/smaps_rollup as follow:
> >>
> >> [root@localhost sda]# cat /sys/fs/cgroup/memory/test/memory.usage_in_bytes
> >> 1080930304
> >> [root@localhost sda]# cat /sys/fs/cgroup/memory/test/cgroup.procs
> >> 1290
> >> [root@localhost sda]# cat /proc/1290/smaps_rollup
> >> 55ba80600000-ffffffffff601000 ---p 00000000 00:00 0
> >> [rollup]
> >> Rss: 500648 kB
> >> Pss: 498337 kB
> >> Shared_Clean: 2732 kB
> >> Shared_Dirty: 0 kB
> >> Private_Clean: 364 kB
> >> Private_Dirty: 497552 kB
> >> Referenced: 500648 kB
> >> Anonymous: 492016 kB
> >> LazyFree: 0 kB
> >> AnonHugePages: 129024 kB
> >> ShmemPmdMapped: 0 kB
> >> Shared_Hugetlb: 0 kB
> >> Private_Hugetlb: 0 kB
> >> Swap: 0 kB
> >> SwapPss: 0 kB
> >> Locked: 0 kB
> >> THPeligible: 0
> >>
> >> I have found the differences was because that __split_huge_pmd decrease
> >> the mm_counter but page_counter in memcg was not decreased with refcount
> >> of a head page is not zero. Here are the follows:
> >>
> >> do_madvise
> >> madvise_dontneed_free
> >> zap_page_range
> >> unmap_single_vma
> >> zap_pud_range
> >> zap_pmd_range
> >> __split_huge_pmd
> >> __split_huge_pmd_locked
> >> __mod_lruvec_page_state
> >> zap_pte_range
> >> add_mm_rss_vec
> >> add_mm_counter -> decrease the
> >> mm_counter
> >> tlb_finish_mmu
> >> arch_tlb_finish_mmu
> >> tlb_flush_mmu_free
> >> free_pages_and_swap_cache
> >> release_pages
> >> folio_put_testzero(page) -> not zero, skip
> >> continue;
> >> __folio_put_large
> >> free_transhuge_page
> >> free_compound_page
> >> mem_cgroup_uncharge
> >> page_counter_uncharge -> decrease the
> >> page_counter
> >>
> >> node_page_stat which shows in meminfo was also decreased. the
> >> __split_huge_pmd
> >> seems free no physical memory unless the total THP was free.I am
> >> confused which
> >> one is the true physical memory used of a process.
> > This should be caused by the deferred split of THP. When MADV_DONTNEED
> > is called on the partial of the map, the huge PMD is split, but the
> > THP itself will not be split until the memory pressure is hit (global
> > or memcg limit). So the unmapped sub pages are actually not freed
> > until that point. So the mm counter is decreased due to the zapping
> > but the physical pages are not actually freed then uncharged from
> > memcg.
>
> Thanks!
>
> I don't know how much memory a real workload will cost.So I just
>
> test the max_usage_in_bytes of memcg with THP disabled and add a little bit
>
> more for the limit_in_byte of memcg with THP enabled which trigger a oom...
>
> (actually it costed 100M more with THP enabled). Another testcase which I
>
> known the amout of memory will cost don't trigger a oom with suitable
>
> memcg limit and I see the THP split when the memory hit the limit.
>
>
> I have another concern that k8s usually use (rss - files) to estimate

Do you mean "workingset" used by some 3rd party k8s monitoring tools?
I recall that depends on what monitoring tools you use, for example,
some monitoring use active_anon + active_file.

>
> the memory workload but the anon_thp in the defered list charged
>
> in memcg will make it look higher than actucal. And it seems the

Yes, but the deferred split shrinker should handle this quite gracefully.

>
> container will be killed without oom...

If you have some userspace daemons which monitor the memory usage by
rss, and try to behave smarter to kill the container by looking at rss
solely, you may kill the container prematurely.

>
> Is it suitable to add meminfo of a deferred split list of THP?

We could, but I don't think of how it will be used to improve the
usecase. Any more thoughts?

>
> >>
> >> Kind regards,
> >>
> >> Yongqiang Liu
> >>
> >>
> > .