v2:
1. rename mem_cgroup_split_huge_fixup
2. use split_page_memcg when split page
*** BLURB HERE ***
Zhou Guanghui (2):
mm/memcg: rename mem_cgroup_split_huge_fixup to split_page_memcg
mm/memcg: set memcg when split pages
include/linux/memcontrol.h | 6 ++----
mm/huge_memory.c | 2 +-
mm/memcontrol.c | 15 ++++++---------
mm/page_alloc.c | 1 +
4 files changed, 10 insertions(+), 14 deletions(-)
--
2.25.0
As described in the split_page function comment, for the non-compound
high order page, the sub-pages must be freed individually. If the
memcg of the fisrt page is valid, the tail pages cannot be uncharged
when be freed.
For example, when alloc_pages_exact is used to allocate 1MB continuous
physical memory, 2MB is charged(kmemcg is enabled and __GFP_ACCOUNT is
set). When make_alloc_exact free the unused 1MB and free_pages_exact
free the applied 1MB, actually, only 4KB(one page) is uncharged.
Therefore, the memcg of the tail page needs to be set when split page.
Signed-off-by: Zhou Guanghui <[email protected]>
---
mm/page_alloc.c | 1 +
1 file changed, 1 insertion(+)
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 3e4b29ee2b1e..3ed783e25c3c 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -3310,6 +3310,7 @@ void split_page(struct page *page, unsigned int order)
for (i = 1; i < (1 << order); i++)
set_page_refcounted(page + i);
split_page_owner(page, 1 << order);
+ split_page_memcg(page, 1 << order);
}
EXPORT_SYMBOL_GPL(split_page);
--
2.25.0
On Thu, Mar 04, 2021 at 07:40:53AM +0000, Zhou Guanghui wrote:
> As described in the split_page function comment, for the non-compound
> high order page, the sub-pages must be freed individually. If the
> memcg of the fisrt page is valid, the tail pages cannot be uncharged
> when be freed.
>
> For example, when alloc_pages_exact is used to allocate 1MB continuous
> physical memory, 2MB is charged(kmemcg is enabled and __GFP_ACCOUNT is
> set). When make_alloc_exact free the unused 1MB and free_pages_exact
> free the applied 1MB, actually, only 4KB(one page) is uncharged.
>
> Therefore, the memcg of the tail page needs to be set when split page.
>
> Signed-off-by: Zhou Guanghui <[email protected]>
Acked-by: Johannes Weiner <[email protected]>
On 4 Mar 2021, at 2:40, Zhou Guanghui wrote:
> As described in the split_page function comment, for the non-compound
> high order page, the sub-pages must be freed individually. If the
> memcg of the fisrt page is valid, the tail pages cannot be uncharged
s/fisrt/first/
> when be freed.
>
> For example, when alloc_pages_exact is used to allocate 1MB continuous
> physical memory, 2MB is charged(kmemcg is enabled and __GFP_ACCOUNT is
> set). When make_alloc_exact free the unused 1MB and free_pages_exact
> free the applied 1MB, actually, only 4KB(one page) is uncharged.
>
> Therefore, the memcg of the tail page needs to be set when split page.
>
> Signed-off-by: Zhou Guanghui <[email protected]>
> ---
> mm/page_alloc.c | 1 +
> 1 file changed, 1 insertion(+)
>
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 3e4b29ee2b1e..3ed783e25c3c 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -3310,6 +3310,7 @@ void split_page(struct page *page, unsigned int order)
> for (i = 1; i < (1 << order); i++)
> set_page_refcounted(page + i);
> split_page_owner(page, 1 << order);
> + split_page_memcg(page, 1 << order);
> }
> EXPORT_SYMBOL_GPL(split_page);
>
> --
> 2.25.0
LGTM. Thanks.
Reviewed-by: Zi Yan <[email protected]>
—
Best Regards,
Yan Zi
On Wed, Mar 3, 2021 at 11:57 PM Zhou Guanghui <[email protected]> wrote:
>
> As described in the split_page function comment, for the non-compound
> high order page, the sub-pages must be freed individually. If the
> memcg of the fisrt page is valid, the tail pages cannot be uncharged
> when be freed.
>
> For example, when alloc_pages_exact is used to allocate 1MB continuous
> physical memory, 2MB is charged(kmemcg is enabled and __GFP_ACCOUNT is
> set). When make_alloc_exact free the unused 1MB and free_pages_exact
> free the applied 1MB, actually, only 4KB(one page) is uncharged.
>
> Therefore, the memcg of the tail page needs to be set when split page.
>
> Signed-off-by: Zhou Guanghui <[email protected]>
Reviewed-by: Shakeel Butt <[email protected]>
Rename mem_cgroup_split_huge_fixup to split_page_memcg and explicitly
pass in page number argument.
In this way, the interface name is more common and can be used by
potential users. In addition, the complete info(memcg and flag) of
the memcg needs to be set to the tail pages.
Signed-off-by: Zhou Guanghui <[email protected]>
---
include/linux/memcontrol.h | 6 ++----
mm/huge_memory.c | 2 +-
mm/memcontrol.c | 15 ++++++---------
3 files changed, 9 insertions(+), 14 deletions(-)
diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index e6dc793d587d..0c04d39a7967 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -1061,9 +1061,7 @@ static inline void memcg_memory_event_mm(struct mm_struct *mm,
rcu_read_unlock();
}
-#ifdef CONFIG_TRANSPARENT_HUGEPAGE
-void mem_cgroup_split_huge_fixup(struct page *head);
-#endif
+void split_page_memcg(struct page *head, unsigned int nr);
#else /* CONFIG_MEMCG */
@@ -1400,7 +1398,7 @@ unsigned long mem_cgroup_soft_limit_reclaim(pg_data_t *pgdat, int order,
return 0;
}
-static inline void mem_cgroup_split_huge_fixup(struct page *head)
+static inline void split_page_memcg(struct page *head, unsigned int nr)
{
}
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 395c75111d33..e7f29308ebc8 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -2471,7 +2471,7 @@ static void __split_huge_page(struct page *page, struct list_head *list,
int i;
/* complete memcg works before add pages to LRU */
- mem_cgroup_split_huge_fixup(head);
+ split_page_memcg(head, nr);
if (PageAnon(head) && PageSwapCache(head)) {
swp_entry_t entry = { .val = page_private(head) };
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 845eec01ef9d..e064ac0d850a 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -3287,24 +3287,21 @@ void obj_cgroup_uncharge(struct obj_cgroup *objcg, size_t size)
#endif /* CONFIG_MEMCG_KMEM */
-#ifdef CONFIG_TRANSPARENT_HUGEPAGE
/*
- * Because page_memcg(head) is not set on compound tails, set it now.
+ * Because page_memcg(head) is not set on tails, set it now.
*/
-void mem_cgroup_split_huge_fixup(struct page *head)
+void split_page_memcg(struct page *head, unsigned int nr)
{
struct mem_cgroup *memcg = page_memcg(head);
int i;
- if (mem_cgroup_disabled())
+ if (mem_cgroup_disabled() || !memcg)
return;
- for (i = 1; i < HPAGE_PMD_NR; i++) {
- css_get(&memcg->css);
- head[i].memcg_data = (unsigned long)memcg;
- }
+ for (i = 1; i < nr; i++)
+ head[i].memcg_data = head->memcg_data;
+ css_get_many(&memcg->css, nr - 1);
}
-#endif /* CONFIG_TRANSPARENT_HUGEPAGE */
#ifdef CONFIG_MEMCG_SWAP
/**
--
2.25.0
On Thu 04-03-21 07:40:52, Zhou Guanghui wrote:
> Rename mem_cgroup_split_huge_fixup to split_page_memcg and explicitly
> pass in page number argument.
>
> In this way, the interface name is more common and can be used by
> potential users. In addition, the complete info(memcg and flag) of
> the memcg needs to be set to the tail pages.
I think it is good to call out the css_get -> css_get_many as a
microptimization explicitly. Even though I do not expect this to be
easily visible it makes sense to save rcu section for each of the tail
page in general.
> Signed-off-by: Zhou Guanghui <[email protected]>
Acked-by: Michal Hocko <[email protected]>
> ---
> include/linux/memcontrol.h | 6 ++----
> mm/huge_memory.c | 2 +-
> mm/memcontrol.c | 15 ++++++---------
> 3 files changed, 9 insertions(+), 14 deletions(-)
>
> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> index e6dc793d587d..0c04d39a7967 100644
> --- a/include/linux/memcontrol.h
> +++ b/include/linux/memcontrol.h
> @@ -1061,9 +1061,7 @@ static inline void memcg_memory_event_mm(struct mm_struct *mm,
> rcu_read_unlock();
> }
>
> -#ifdef CONFIG_TRANSPARENT_HUGEPAGE
> -void mem_cgroup_split_huge_fixup(struct page *head);
> -#endif
> +void split_page_memcg(struct page *head, unsigned int nr);
>
> #else /* CONFIG_MEMCG */
>
> @@ -1400,7 +1398,7 @@ unsigned long mem_cgroup_soft_limit_reclaim(pg_data_t *pgdat, int order,
> return 0;
> }
>
> -static inline void mem_cgroup_split_huge_fixup(struct page *head)
> +static inline void split_page_memcg(struct page *head, unsigned int nr)
> {
> }
>
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index 395c75111d33..e7f29308ebc8 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -2471,7 +2471,7 @@ static void __split_huge_page(struct page *page, struct list_head *list,
> int i;
>
> /* complete memcg works before add pages to LRU */
> - mem_cgroup_split_huge_fixup(head);
> + split_page_memcg(head, nr);
>
> if (PageAnon(head) && PageSwapCache(head)) {
> swp_entry_t entry = { .val = page_private(head) };
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 845eec01ef9d..e064ac0d850a 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -3287,24 +3287,21 @@ void obj_cgroup_uncharge(struct obj_cgroup *objcg, size_t size)
>
> #endif /* CONFIG_MEMCG_KMEM */
>
> -#ifdef CONFIG_TRANSPARENT_HUGEPAGE
> /*
> - * Because page_memcg(head) is not set on compound tails, set it now.
> + * Because page_memcg(head) is not set on tails, set it now.
> */
> -void mem_cgroup_split_huge_fixup(struct page *head)
> +void split_page_memcg(struct page *head, unsigned int nr)
> {
> struct mem_cgroup *memcg = page_memcg(head);
> int i;
>
> - if (mem_cgroup_disabled())
> + if (mem_cgroup_disabled() || !memcg)
> return;
>
> - for (i = 1; i < HPAGE_PMD_NR; i++) {
> - css_get(&memcg->css);
> - head[i].memcg_data = (unsigned long)memcg;
> - }
> + for (i = 1; i < nr; i++)
> + head[i].memcg_data = head->memcg_data;
> + css_get_many(&memcg->css, nr - 1);
> }
> -#endif /* CONFIG_TRANSPARENT_HUGEPAGE */
>
> #ifdef CONFIG_MEMCG_SWAP
> /**
> --
> 2.25.0
>
--
Michal Hocko
SUSE Labs
On Thu 04-03-21 07:40:53, Zhou Guanghui wrote:
> As described in the split_page function comment, for the non-compound
> high order page, the sub-pages must be freed individually. If the
> memcg of the fisrt page is valid, the tail pages cannot be uncharged
> when be freed.
>
> For example, when alloc_pages_exact is used to allocate 1MB continuous
> physical memory, 2MB is charged(kmemcg is enabled and __GFP_ACCOUNT is
> set). When make_alloc_exact free the unused 1MB and free_pages_exact
> free the applied 1MB, actually, only 4KB(one page) is uncharged.
>
> Therefore, the memcg of the tail page needs to be set when split page.
>
As already mentioned there are at least two explicit users of
__GFP_ACCOUNT with alloc_exact_pages added recently. It would be good to
mention that explicitly and maybe even mention 7efe8ef274024 resp.
c419621873713 so that it is clear this is not just a theoretical issue.
> Signed-off-by: Zhou Guanghui <[email protected]>
Acked-by: Michal Hocko <[email protected]>
> ---
> mm/page_alloc.c | 1 +
> 1 file changed, 1 insertion(+)
>
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 3e4b29ee2b1e..3ed783e25c3c 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -3310,6 +3310,7 @@ void split_page(struct page *page, unsigned int order)
> for (i = 1; i < (1 << order); i++)
> set_page_refcounted(page + i);
> split_page_owner(page, 1 << order);
> + split_page_memcg(page, 1 << order);
> }
> EXPORT_SYMBOL_GPL(split_page);
>
> --
> 2.25.0
>
--
Michal Hocko
SUSE Labs
On Fri, 5 Mar 2021 12:52:52 +0100 Michal Hocko <[email protected]> wrote:
> On Thu 04-03-21 07:40:53, Zhou Guanghui wrote:
> > As described in the split_page function comment, for the non-compound
> > high order page, the sub-pages must be freed individually. If the
> > memcg of the fisrt page is valid, the tail pages cannot be uncharged
> > when be freed.
> >
> > For example, when alloc_pages_exact is used to allocate 1MB continuous
> > physical memory, 2MB is charged(kmemcg is enabled and __GFP_ACCOUNT is
> > set). When make_alloc_exact free the unused 1MB and free_pages_exact
> > free the applied 1MB, actually, only 4KB(one page) is uncharged.
> >
> > Therefore, the memcg of the tail page needs to be set when split page.
> >
>
> As already mentioned there are at least two explicit users of
> __GFP_ACCOUNT with alloc_exact_pages added recently. It would be good to
> mention that explicitly and maybe even mention 7efe8ef274024 resp.
> c419621873713 so that it is clear this is not just a theoretical issue.
I added
: Michel:
:
: There are at least two explicit users of __GFP_ACCOUNT with
: alloc_exact_pages added recently. See 7efe8ef274024 ("KVM: arm64:
: Allocate stage-2 pgd pages with GFP_KERNEL_ACCOUNT") and c419621873713
: ("KVM: s390: Add memcg accounting to KVM allocations"), so this is not
: just a theoretical issue.
And should we cc:stable on this one?
On Fri 05-03-21 15:58:40, Andrew Morton wrote:
> On Fri, 5 Mar 2021 12:52:52 +0100 Michal Hocko <[email protected]> wrote:
>
> > On Thu 04-03-21 07:40:53, Zhou Guanghui wrote:
> > > As described in the split_page function comment, for the non-compound
> > > high order page, the sub-pages must be freed individually. If the
> > > memcg of the fisrt page is valid, the tail pages cannot be uncharged
> > > when be freed.
> > >
> > > For example, when alloc_pages_exact is used to allocate 1MB continuous
> > > physical memory, 2MB is charged(kmemcg is enabled and __GFP_ACCOUNT is
> > > set). When make_alloc_exact free the unused 1MB and free_pages_exact
> > > free the applied 1MB, actually, only 4KB(one page) is uncharged.
> > >
> > > Therefore, the memcg of the tail page needs to be set when split page.
> > >
> >
> > As already mentioned there are at least two explicit users of
> > __GFP_ACCOUNT with alloc_exact_pages added recently. It would be good to
> > mention that explicitly and maybe even mention 7efe8ef274024 resp.
> > c419621873713 so that it is clear this is not just a theoretical issue.
>
> I added
>
> : Michel:
> :
> : There are at least two explicit users of __GFP_ACCOUNT with
> : alloc_exact_pages added recently. See 7efe8ef274024 ("KVM: arm64:
> : Allocate stage-2 pgd pages with GFP_KERNEL_ACCOUNT") and c419621873713
> : ("KVM: s390: Add memcg accounting to KVM allocations"), so this is not
> : just a theoretical issue.
>
> And should we cc:stable on this one?
Somebody more familiar with iommu dma allocation layer should have a
look as well (__iommu_dma_alloc_pages) so that we know whether there are
kernels outside of the above two ones mentioned above that need a fix.
But in general this sounds like a good fit for the stable tree.
--
Michal Hocko
SUSE Labs
On Mon, 8 Mar 2021 09:41:38 +0100 Michal Hocko <[email protected]> wrote:
> On Fri 05-03-21 15:58:40, Andrew Morton wrote:
> > On Fri, 5 Mar 2021 12:52:52 +0100 Michal Hocko <[email protected]> wrote:
> >
> > > On Thu 04-03-21 07:40:53, Zhou Guanghui wrote:
> > > > As described in the split_page function comment, for the non-compound
> > > > high order page, the sub-pages must be freed individually. If the
> > > > memcg of the fisrt page is valid, the tail pages cannot be uncharged
> > > > when be freed.
> > > >
> > > > For example, when alloc_pages_exact is used to allocate 1MB continuous
> > > > physical memory, 2MB is charged(kmemcg is enabled and __GFP_ACCOUNT is
> > > > set). When make_alloc_exact free the unused 1MB and free_pages_exact
> > > > free the applied 1MB, actually, only 4KB(one page) is uncharged.
> > > >
> > > > Therefore, the memcg of the tail page needs to be set when split page.
> > > >
> > >
> > > As already mentioned there are at least two explicit users of
> > > __GFP_ACCOUNT with alloc_exact_pages added recently. It would be good to
> > > mention that explicitly and maybe even mention 7efe8ef274024 resp.
> > > c419621873713 so that it is clear this is not just a theoretical issue.
> >
> > I added
> >
> > : Michel:
> > :
> > : There are at least two explicit users of __GFP_ACCOUNT with
> > : alloc_exact_pages added recently. See 7efe8ef274024 ("KVM: arm64:
> > : Allocate stage-2 pgd pages with GFP_KERNEL_ACCOUNT") and c419621873713
> > : ("KVM: s390: Add memcg accounting to KVM allocations"), so this is not
> > : just a theoretical issue.
> >
> > And should we cc:stable on this one?
>
> Somebody more familiar with iommu dma allocation layer should have a
> look as well (__iommu_dma_alloc_pages) so that we know whether there are
> kernels outside of the above two ones mentioned above that need a fix.
> But in general this sounds like a good fit for the stable tree.
OK. I reversed the order of these two patches so we don't need to
burden -stable with a cosmetic rename.
On Mon, Mar 08, 2021 at 12:42:27PM -0800, Andrew Morton wrote:
> On Mon, 8 Mar 2021 09:41:38 +0100 Michal Hocko <[email protected]> wrote:
>
> > On Fri 05-03-21 15:58:40, Andrew Morton wrote:
> > > On Fri, 5 Mar 2021 12:52:52 +0100 Michal Hocko <[email protected]> wrote:
> > >
> > > > On Thu 04-03-21 07:40:53, Zhou Guanghui wrote:
> > > > > As described in the split_page function comment, for the non-compound
> > > > > high order page, the sub-pages must be freed individually. If the
> > > > > memcg of the fisrt page is valid, the tail pages cannot be uncharged
> > > > > when be freed.
> > > > >
> > > > > For example, when alloc_pages_exact is used to allocate 1MB continuous
> > > > > physical memory, 2MB is charged(kmemcg is enabled and __GFP_ACCOUNT is
> > > > > set). When make_alloc_exact free the unused 1MB and free_pages_exact
> > > > > free the applied 1MB, actually, only 4KB(one page) is uncharged.
> > > > >
> > > > > Therefore, the memcg of the tail page needs to be set when split page.
> > > > >
> > > >
> > > > As already mentioned there are at least two explicit users of
> > > > __GFP_ACCOUNT with alloc_exact_pages added recently. It would be good to
> > > > mention that explicitly and maybe even mention 7efe8ef274024 resp.
> > > > c419621873713 so that it is clear this is not just a theoretical issue.
> > >
> > > I added
> > >
> > > : Michel:
> > > :
> > > : There are at least two explicit users of __GFP_ACCOUNT with
> > > : alloc_exact_pages added recently. See 7efe8ef274024 ("KVM: arm64:
> > > : Allocate stage-2 pgd pages with GFP_KERNEL_ACCOUNT") and c419621873713
> > > : ("KVM: s390: Add memcg accounting to KVM allocations"), so this is not
> > > : just a theoretical issue.
> > >
> > > And should we cc:stable on this one?
> >
> > Somebody more familiar with iommu dma allocation layer should have a
> > look as well (__iommu_dma_alloc_pages) so that we know whether there are
> > kernels outside of the above two ones mentioned above that need a fix.
> > But in general this sounds like a good fit for the stable tree.
>
> OK. I reversed the order of these two patches so we don't need to
> burden -stable with a cosmetic rename.
Eek, no.
The alloc_pages_exact() is done to pages that _aren't_ compound.
So you have to pass the number of pages to the memcg split function,
because a non-compound page doesn't know the size of its allocation.
On Thu, Mar 04, 2021 at 07:40:53AM +0000, Zhou Guanghui wrote:
> As described in the split_page function comment, for the non-compound
> high order page, the sub-pages must be freed individually. If the
> memcg of the fisrt page is valid, the tail pages cannot be uncharged
> when be freed.
>
> For example, when alloc_pages_exact is used to allocate 1MB continuous
> physical memory, 2MB is charged(kmemcg is enabled and __GFP_ACCOUNT is
> set). When make_alloc_exact free the unused 1MB and free_pages_exact
> free the applied 1MB, actually, only 4KB(one page) is uncharged.
>
> Therefore, the memcg of the tail page needs to be set when split page.
There's another place we need to do this to ...
+++ b/mm/page_alloc.c
@@ -5081,9 +5081,15 @@ void __free_pages(struct page *page, unsigned int order)
{
if (put_page_testzero(page))
free_the_page(page, order);
- else if (!PageHead(page))
- while (order-- > 0)
- free_the_page(page + (1 << order), order);
+ else if (!PageHead(page)) {
+ while (order-- > 0) {
+ struct page *tail = page + (1 << order);
+#ifdef CONFIG_MEMCG
+ tail->memcg_data = page->memcg_data;
+#endif
+ free_the_page(tail, order);
+ }
+ }
}
EXPORT_SYMBOL(__free_pages);
I wonder if we shouldn't initialise memcg_data on all subsequent pages
of non-compound allocations instead? Because I'm not sure this is the
only place that needs to be fixed.
On 4/3/21 6:40 pm, Zhou Guanghui wrote:
> Rename mem_cgroup_split_huge_fixup to split_page_memcg and explicitly
> pass in page number argument.
>
> In this way, the interface name is more common and can be used by
> potential users. In addition, the complete info(memcg and flag) of
> the memcg needs to be set to the tail pages.
>
> Signed-off-by: Zhou Guanghui <[email protected]>
> ---
> include/linux/memcontrol.h | 6 ++----
> mm/huge_memory.c | 2 +-
> mm/memcontrol.c | 15 ++++++---------
> 3 files changed, 9 insertions(+), 14 deletions(-)
>
> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> index e6dc793d587d..0c04d39a7967 100644
> --- a/include/linux/memcontrol.h
> +++ b/include/linux/memcontrol.h
> @@ -1061,9 +1061,7 @@ static inline void memcg_memory_event_mm(struct mm_struct *mm,
> rcu_read_unlock();
> }
>
> -#ifdef CONFIG_TRANSPARENT_HUGEPAGE
> -void mem_cgroup_split_huge_fixup(struct page *head);
> -#endif
> +void split_page_memcg(struct page *head, unsigned int nr);
>
> #else /* CONFIG_MEMCG */
>
> @@ -1400,7 +1398,7 @@ unsigned long mem_cgroup_soft_limit_reclaim(pg_data_t *pgdat, int order,
> return 0;
> }
>
> -static inline void mem_cgroup_split_huge_fixup(struct page *head)
> +static inline void split_page_memcg(struct page *head, unsigned int nr)
> {
> }
>
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index 395c75111d33..e7f29308ebc8 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -2471,7 +2471,7 @@ static void __split_huge_page(struct page *page, struct list_head *list,
> int i;
>
> /* complete memcg works before add pages to LRU */
> - mem_cgroup_split_huge_fixup(head);
> + split_page_memcg(head, nr);
>
> if (PageAnon(head) && PageSwapCache(head)) {
> swp_entry_t entry = { .val = page_private(head) };
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 845eec01ef9d..e064ac0d850a 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -3287,24 +3287,21 @@ void obj_cgroup_uncharge(struct obj_cgroup *objcg, size_t size)
>
> #endif /* CONFIG_MEMCG_KMEM */
>
> -#ifdef CONFIG_TRANSPARENT_HUGEPAGE
> /*
> - * Because page_memcg(head) is not set on compound tails, set it now.
> + * Because page_memcg(head) is not set on tails, set it now.
> */
> -void mem_cgroup_split_huge_fixup(struct page *head)
> +void split_page_memcg(struct page *head, unsigned int nr)
> {
Do we need input validation on nr? Can nr be aribtrary or can we enforce
VM_BUG_ON(!is_power_of_2(nr));
Balbir Singh
On Mon, 8 Mar 2021 20:47:31 +0000 Matthew Wilcox <[email protected]> wrote:
> On Mon, Mar 08, 2021 at 12:42:27PM -0800, Andrew Morton wrote:
> > On Mon, 8 Mar 2021 09:41:38 +0100 Michal Hocko <[email protected]> wrote:
> >
> > > On Fri 05-03-21 15:58:40, Andrew Morton wrote:
> > > > On Fri, 5 Mar 2021 12:52:52 +0100 Michal Hocko <[email protected]> wrote:
> > > >
> > > > > On Thu 04-03-21 07:40:53, Zhou Guanghui wrote:
> > > > > > As described in the split_page function comment, for the non-compound
> > > > > > high order page, the sub-pages must be freed individually. If the
> > > > > > memcg of the fisrt page is valid, the tail pages cannot be uncharged
> > > > > > when be freed.
> > > > > >
> > > > > > For example, when alloc_pages_exact is used to allocate 1MB continuous
> > > > > > physical memory, 2MB is charged(kmemcg is enabled and __GFP_ACCOUNT is
> > > > > > set). When make_alloc_exact free the unused 1MB and free_pages_exact
> > > > > > free the applied 1MB, actually, only 4KB(one page) is uncharged.
> > > > > >
> > > > > > Therefore, the memcg of the tail page needs to be set when split page.
> > > > > >
> > > > >
> > > > > As already mentioned there are at least two explicit users of
> > > > > __GFP_ACCOUNT with alloc_exact_pages added recently. It would be good to
> > > > > mention that explicitly and maybe even mention 7efe8ef274024 resp.
> > > > > c419621873713 so that it is clear this is not just a theoretical issue.
> > > >
> > > > I added
> > > >
> > > > : Michel:
> > > > :
> > > > : There are at least two explicit users of __GFP_ACCOUNT with
> > > > : alloc_exact_pages added recently. See 7efe8ef274024 ("KVM: arm64:
> > > > : Allocate stage-2 pgd pages with GFP_KERNEL_ACCOUNT") and c419621873713
> > > > : ("KVM: s390: Add memcg accounting to KVM allocations"), so this is not
> > > > : just a theoretical issue.
> > > >
> > > > And should we cc:stable on this one?
> > >
> > > Somebody more familiar with iommu dma allocation layer should have a
> > > look as well (__iommu_dma_alloc_pages) so that we know whether there are
> > > kernels outside of the above two ones mentioned above that need a fix.
> > > But in general this sounds like a good fit for the stable tree.
> >
> > OK. I reversed the order of these two patches so we don't need to
> > burden -stable with a cosmetic rename.
>
> Eek, no.
>
> The alloc_pages_exact() is done to pages that _aren't_ compound.
> So you have to pass the number of pages to the memcg split function,
> because a non-compound page doesn't know the size of its allocation.
Ah, OK, the patch title fooled me.
It should have been a three-patch series, really
1: add nr_pages arg to mem_cgroup_split_huge_fixup()
2: call mem_cgroup_split_huge_fixup() when splitting
3: rename mem_cgroup_split_huge_fixup() to split_page_memcg()
That way, the third cosmetic patch could be deferred so we don't feed
the cosmetic renaming into -stable.
But whatever, the rename isn't a big deal so I'll go with the 2-patch
series as sent for -stable.
On Tue 09-03-21 09:37:29, Balbir Singh wrote:
> On 4/3/21 6:40 pm, Zhou Guanghui wrote:
[...]
> > -#ifdef CONFIG_TRANSPARENT_HUGEPAGE
> > /*
> > - * Because page_memcg(head) is not set on compound tails, set it now.
> > + * Because page_memcg(head) is not set on tails, set it now.
> > */
> > -void mem_cgroup_split_huge_fixup(struct page *head)
> > +void split_page_memcg(struct page *head, unsigned int nr)
> > {
>
> Do we need input validation on nr? Can nr be aribtrary or can we enforce
>
> VM_BUG_ON(!is_power_of_2(nr));
In practice this will be power of 2 but why should we bother to sanitze
that?
--
Michal Hocko
SUSE Labs
On Mon 08-03-21 21:02:25, Matthew Wilcox wrote:
> On Thu, Mar 04, 2021 at 07:40:53AM +0000, Zhou Guanghui wrote:
> > As described in the split_page function comment, for the non-compound
> > high order page, the sub-pages must be freed individually. If the
> > memcg of the fisrt page is valid, the tail pages cannot be uncharged
> > when be freed.
> >
> > For example, when alloc_pages_exact is used to allocate 1MB continuous
> > physical memory, 2MB is charged(kmemcg is enabled and __GFP_ACCOUNT is
> > set). When make_alloc_exact free the unused 1MB and free_pages_exact
> > free the applied 1MB, actually, only 4KB(one page) is uncharged.
> >
> > Therefore, the memcg of the tail page needs to be set when split page.
>
> There's another place we need to do this to ...
>
> +++ b/mm/page_alloc.c
> @@ -5081,9 +5081,15 @@ void __free_pages(struct page *page, unsigned int order)
> {
> if (put_page_testzero(page))
> free_the_page(page, order);
> - else if (!PageHead(page))
> - while (order-- > 0)
> - free_the_page(page + (1 << order), order);
> + else if (!PageHead(page)) {
> + while (order-- > 0) {
> + struct page *tail = page + (1 << order);
> +#ifdef CONFIG_MEMCG
> + tail->memcg_data = page->memcg_data;
> +#endif
> + free_the_page(tail, order);
> + }
> + }
> }
> EXPORT_SYMBOL(__free_pages);
Hmm, I was not aware of this code. This is really a tricky code.
> I wonder if we shouldn't initialise memcg_data on all subsequent pages
> of non-compound allocations instead? Because I'm not sure this is the
> only place that needs to be fixed.
That would be safer for sure. Do you mean this as a replacement to the
original patch?
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 913c2b9e5c72..d44dea2b8d22 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -3135,8 +3135,21 @@ int __memcg_kmem_charge_page(struct page *page, gfp_t gfp, int order)
if (memcg && !mem_cgroup_is_root(memcg)) {
ret = __memcg_kmem_charge(memcg, gfp, 1 << order);
if (!ret) {
+ int nr_pages = 1 << order;
page->memcg_data = (unsigned long)memcg |
MEMCG_DATA_KMEM;
+
+ /*
+ * Compound pages are normally split or freed
+ * via their head pages so memcg_data in in the
+ * head page should be sufficient but there
+ * are exceptions to the rule (see __free_pages).
+ * Non compound pages would need to copy memcg anyway.
+ */
+ for (i = 1; i < nr_pages; i++) {
+ struct page * p = page + i;
+ p->memcg_data = page->memcg_data
+ }
return 0;
}
css_put(&memcg->css);
--
Michal Hocko
SUSE Labs
On Tue, Mar 09, 2021 at 10:02:00AM +0100, Michal Hocko wrote:
> On Mon 08-03-21 21:02:25, Matthew Wilcox wrote:
> > On Thu, Mar 04, 2021 at 07:40:53AM +0000, Zhou Guanghui wrote:
> > > For example, when alloc_pages_exact is used to allocate 1MB continuous
> > > physical memory, 2MB is charged(kmemcg is enabled and __GFP_ACCOUNT is
> > > set). When make_alloc_exact free the unused 1MB and free_pages_exact
> > > free the applied 1MB, actually, only 4KB(one page) is uncharged.
> > @@ -5081,9 +5081,15 @@ void __free_pages(struct page *page, unsigned int order)
> > {
> > if (put_page_testzero(page))
> > free_the_page(page, order);
> > - else if (!PageHead(page))
> > - while (order-- > 0)
> > - free_the_page(page + (1 << order), order);
> > + else if (!PageHead(page)) {
> > + while (order-- > 0) {
> > + struct page *tail = page + (1 << order);
> > +#ifdef CONFIG_MEMCG
> > + tail->memcg_data = page->memcg_data;
> > +#endif
> > + free_the_page(tail, order);
> > + }
> > + }
> > }
> > EXPORT_SYMBOL(__free_pages);
>
> Hmm, I was not aware of this code. This is really a tricky code.
Yes. I only added it recently. I don't see a better way to solve this
problem. We could turn the non-compound page into a compound page at
this point, but I'm not sure that's really less tricky.
> > I wonder if we shouldn't initialise memcg_data on all subsequent pages
> > of non-compound allocations instead? Because I'm not sure this is the
> > only place that needs to be fixed.
>
> That would be safer for sure. Do you mean this as a replacement to the
> original patch?
>
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 913c2b9e5c72..d44dea2b8d22 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -3135,8 +3135,21 @@ int __memcg_kmem_charge_page(struct page *page, gfp_t gfp, int order)
> if (memcg && !mem_cgroup_is_root(memcg)) {
> ret = __memcg_kmem_charge(memcg, gfp, 1 << order);
> if (!ret) {
> + int nr_pages = 1 << order;
> page->memcg_data = (unsigned long)memcg |
> MEMCG_DATA_KMEM;
> +
> + /*
> + * Compound pages are normally split or freed
> + * via their head pages so memcg_data in in the
> + * head page should be sufficient but there
> + * are exceptions to the rule (see __free_pages).
> + * Non compound pages would need to copy memcg anyway.
> + */
> + for (i = 1; i < nr_pages; i++) {
> + struct page * p = page + i;
> + p->memcg_data = page->memcg_data
> + }
> return 0;
I would condition this loop on if (!(gfp & __GFP_COMP)), but yes, something
along these lines. I might phrase the comment a little differently ...
/*
* Compound pages are treated as a single unit,
* but non-compound pages can be freed individually
* so each page needs to have its memcg set to get
* the accounting right.
*/
On Tue 09-03-21 12:32:55, Matthew Wilcox wrote:
> On Tue, Mar 09, 2021 at 10:02:00AM +0100, Michal Hocko wrote:
[...]
> > diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> > index 913c2b9e5c72..d44dea2b8d22 100644
> > --- a/mm/memcontrol.c
> > +++ b/mm/memcontrol.c
> > @@ -3135,8 +3135,21 @@ int __memcg_kmem_charge_page(struct page *page, gfp_t gfp, int order)
> > if (memcg && !mem_cgroup_is_root(memcg)) {
> > ret = __memcg_kmem_charge(memcg, gfp, 1 << order);
> > if (!ret) {
> > + int nr_pages = 1 << order;
> > page->memcg_data = (unsigned long)memcg |
> > MEMCG_DATA_KMEM;
> > +
> > + /*
> > + * Compound pages are normally split or freed
> > + * via their head pages so memcg_data in in the
> > + * head page should be sufficient but there
> > + * are exceptions to the rule (see __free_pages).
> > + * Non compound pages would need to copy memcg anyway.
> > + */
> > + for (i = 1; i < nr_pages; i++) {
> > + struct page * p = page + i;
> > + p->memcg_data = page->memcg_data
> > + }
> > return 0;
>
> I would condition this loop on if (!(gfp & __GFP_COMP)), but yes, something
> along these lines. I might phrase the comment a little differently ...
>
> /*
> * Compound pages are treated as a single unit,
> * but non-compound pages can be freed individually
> * so each page needs to have its memcg set to get
> * the accounting right.
> */
OK, I must have misunderstood your __free_pages fix then. I thought this
was about compound pages. Btw. again I forgot about css ref counting so
here is an updated version.
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 913c2b9e5c72..ec2c705f38fa 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -3133,10 +3133,22 @@ int __memcg_kmem_charge_page(struct page *page, gfp_t gfp, int order)
memcg = get_mem_cgroup_from_current();
if (memcg && !mem_cgroup_is_root(memcg)) {
- ret = __memcg_kmem_charge(memcg, gfp, 1 << order);
+ int nr_pages = 1 << order;
+ ret = __memcg_kmem_charge(memcg, gfp, nr_pages);
if (!ret) {
page->memcg_data = (unsigned long)memcg |
MEMCG_DATA_KMEM;
+ if (nr_pages > 1) {
+ /*
+ * comment goes here
+ */
+ for (i = 1; i < nr_pages; i++) {
+ struct page * p = page + i;
+ p->memcg_data = page->memcg_data
+ }
+ /* Head page reference from get_mem_cgroup_from_current */
+ css_get_many(&memcg->css, nr_pages - 1);
+ }
return 0;
}
css_put(&memcg->css);
--
Michal Hocko
SUSE Labs
On 9/3/21 7:28 pm, Michal Hocko wrote:
> On Tue 09-03-21 09:37:29, Balbir Singh wrote:
>> On 4/3/21 6:40 pm, Zhou Guanghui wrote:
> [...]
>>> -#ifdef CONFIG_TRANSPARENT_HUGEPAGE
>>> /*
>>> - * Because page_memcg(head) is not set on compound tails, set it now.
>>> + * Because page_memcg(head) is not set on tails, set it now.
>>> */
>>> -void mem_cgroup_split_huge_fixup(struct page *head)
>>> +void split_page_memcg(struct page *head, unsigned int nr)
>>> {
>>
>> Do we need input validation on nr? Can nr be aribtrary or can we enforce
>>
>> VM_BUG_ON(!is_power_of_2(nr));
>
> In practice this will be power of 2 but why should we bother to sanitze
> that?
>
Just when DEBUG_VM is enabled to ensure the contract is valid, given that
nr is now variable, we could end up with subtle bugs unless we can audit
all callers. Even the power of 2 check does not catch the fact that nr
is indeed what we expect, but it still checks a large range of invalid
inputs.
Balbir Singh.
On Thu, 11 Mar 2021, Singh, Balbir wrote:
> On 9/3/21 7:28 pm, Michal Hocko wrote:
> > On Tue 09-03-21 09:37:29, Balbir Singh wrote:
> >> On 4/3/21 6:40 pm, Zhou Guanghui wrote:
> > [...]
> >>> -#ifdef CONFIG_TRANSPARENT_HUGEPAGE
> >>> /*
> >>> - * Because page_memcg(head) is not set on compound tails, set it now.
> >>> + * Because page_memcg(head) is not set on tails, set it now.
> >>> */
> >>> -void mem_cgroup_split_huge_fixup(struct page *head)
> >>> +void split_page_memcg(struct page *head, unsigned int nr)
> >>> {
> >>
> >> Do we need input validation on nr? Can nr be aribtrary or can we enforce
> >>
> >> VM_BUG_ON(!is_power_of_2(nr));
> >
> > In practice this will be power of 2 but why should we bother to sanitze
> > that?
> >
>
> Just when DEBUG_VM is enabled to ensure the contract is valid, given that
> nr is now variable, we could end up with subtle bugs unless we can audit
> all callers. Even the power of 2 check does not catch the fact that nr
> is indeed what we expect, but it still checks a large range of invalid
> inputs.
I think you imagine this is something it's not.
"all callers" are __split_huge_page() and split_page() (maybe Matthew
will have a third caller, maybe not). It is not something drivers will
be calling directly themselves, and it won't ever get EXPORTed to them.
Hugh
On 11/3/21 9:00 am, Hugh Dickins wrote:
> On Thu, 11 Mar 2021, Singh, Balbir wrote:
>> On 9/3/21 7:28 pm, Michal Hocko wrote:
>>> On Tue 09-03-21 09:37:29, Balbir Singh wrote:
>>>> On 4/3/21 6:40 pm, Zhou Guanghui wrote:
>>> [...]
>>>>> -#ifdef CONFIG_TRANSPARENT_HUGEPAGE
>>>>> /*
>>>>> - * Because page_memcg(head) is not set on compound tails, set it now.
>>>>> + * Because page_memcg(head) is not set on tails, set it now.
>>>>> */
>>>>> -void mem_cgroup_split_huge_fixup(struct page *head)
>>>>> +void split_page_memcg(struct page *head, unsigned int nr)
>>>>> {
>>>>
>>>> Do we need input validation on nr? Can nr be aribtrary or can we enforce
>>>>
>>>> VM_BUG_ON(!is_power_of_2(nr));
>>>
>>> In practice this will be power of 2 but why should we bother to sanitze
>>> that?
>>>
>>
>> Just when DEBUG_VM is enabled to ensure the contract is valid, given that
>> nr is now variable, we could end up with subtle bugs unless we can audit
>> all callers. Even the power of 2 check does not catch the fact that nr
>> is indeed what we expect, but it still checks a large range of invalid
>> inputs.
>
> I think you imagine this is something it's not.
>
> "all callers" are __split_huge_page() and split_page() (maybe Matthew
> will have a third caller, maybe not). It is not something drivers will
> be calling directly themselves, and it won't ever get EXPORTed to them.
>
Don't feel strongly about it if that is the case.
Thanks,
Balbir Singh
Johannes, Hugh,
what do you think about this approach? If we want to stick with
split_page approach then we need to update the missing place Matthew has
pointed out.
On Tue 09-03-21 14:03:36, Michal Hocko wrote:
> On Tue 09-03-21 12:32:55, Matthew Wilcox wrote:
> > On Tue, Mar 09, 2021 at 10:02:00AM +0100, Michal Hocko wrote:
> [...]
> > > diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> > > index 913c2b9e5c72..d44dea2b8d22 100644
> > > --- a/mm/memcontrol.c
> > > +++ b/mm/memcontrol.c
> > > @@ -3135,8 +3135,21 @@ int __memcg_kmem_charge_page(struct page *page, gfp_t gfp, int order)
> > > if (memcg && !mem_cgroup_is_root(memcg)) {
> > > ret = __memcg_kmem_charge(memcg, gfp, 1 << order);
> > > if (!ret) {
> > > + int nr_pages = 1 << order;
> > > page->memcg_data = (unsigned long)memcg |
> > > MEMCG_DATA_KMEM;
> > > +
> > > + /*
> > > + * Compound pages are normally split or freed
> > > + * via their head pages so memcg_data in in the
> > > + * head page should be sufficient but there
> > > + * are exceptions to the rule (see __free_pages).
> > > + * Non compound pages would need to copy memcg anyway.
> > > + */
> > > + for (i = 1; i < nr_pages; i++) {
> > > + struct page * p = page + i;
> > > + p->memcg_data = page->memcg_data
> > > + }
> > > return 0;
> >
> > I would condition this loop on if (!(gfp & __GFP_COMP)), but yes, something
> > along these lines. I might phrase the comment a little differently ...
> >
> > /*
> > * Compound pages are treated as a single unit,
> > * but non-compound pages can be freed individually
> > * so each page needs to have its memcg set to get
> > * the accounting right.
> > */
>
> OK, I must have misunderstood your __free_pages fix then. I thought this
> was about compound pages. Btw. again I forgot about css ref counting so
> here is an updated version.
>
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 913c2b9e5c72..ec2c705f38fa 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -3133,10 +3133,22 @@ int __memcg_kmem_charge_page(struct page *page, gfp_t gfp, int order)
>
> memcg = get_mem_cgroup_from_current();
> if (memcg && !mem_cgroup_is_root(memcg)) {
> - ret = __memcg_kmem_charge(memcg, gfp, 1 << order);
> + int nr_pages = 1 << order;
> + ret = __memcg_kmem_charge(memcg, gfp, nr_pages);
> if (!ret) {
> page->memcg_data = (unsigned long)memcg |
> MEMCG_DATA_KMEM;
> + if (nr_pages > 1) {
> + /*
> + * comment goes here
> + */
> + for (i = 1; i < nr_pages; i++) {
> + struct page * p = page + i;
> + p->memcg_data = page->memcg_data
> + }
> + /* Head page reference from get_mem_cgroup_from_current */
> + css_get_many(&memcg->css, nr_pages - 1);
> + }
> return 0;
> }
> css_put(&memcg->css);
--
Michal Hocko
SUSE Labs
On Thu, Mar 11, 2021 at 09:37:02AM +0100, Michal Hocko wrote:
> Johannes, Hugh,
>
> what do you think about this approach? If we want to stick with
> split_page approach then we need to update the missing place Matthew has
> pointed out.
I find the __free_pages() code quite tricky as well. But for that
reason I would actually prefer to initiate the splitting in there,
since that's the place where we actually split the page, rather than
spread the handling of this situation further out.
The race condition shouldn't be hot, so I don't think we need to be as
efficient about setting page->memcg_data only on the higher-order
buddies as in Willy's scratch patch. We can call split_page_memcg(),
which IMO should actually help document what's happening to the page.
I think that function could also benefit a bit more from step-by-step
documentation about what's going on. The kerneldoc is helpful, but I
don't think it does justice to how tricky this race condition is.
Something like this?
void __free_pages(struct page *page, unsigned int order)
{
/*
* Drop the base reference from __alloc_pages and free. In
* case there is an outstanding speculative reference, from
* e.g. the page cache, it will put and free the page later.
*/
if (likely(put_page_testzero(page))) {
free_the_page(page, order);
return;
}
/*
* The speculative reference will put and free the page.
*
* However, if the speculation was into a higher-order page
* that isn't marked compound, the other side will know
* nothing about our buddy pages and only free the order-0
* page at the start of our chunk! We must split off and free
* the buddy pages here.
*
* The buddy pages aren't individually refcounted, so they
* can't have any pending speculative references themselves.
*/
if (!PageHead(page) && order > 0) {
split_page_memcg(page, 1 << order);
while (order-- > 0)
free_the_page(page + (1 << order), order);
}
}
On Thu, Mar 11, 2021 at 10:21:39AM -0500, Johannes Weiner wrote:
> On Thu, Mar 11, 2021 at 09:37:02AM +0100, Michal Hocko wrote:
> > Johannes, Hugh,
> >
> > what do you think about this approach? If we want to stick with
> > split_page approach then we need to update the missing place Matthew has
> > pointed out.
>
> I find the __free_pages() code quite tricky as well. But for that
> reason I would actually prefer to initiate the splitting in there,
> since that's the place where we actually split the page, rather than
> spread the handling of this situation further out.
Mmm. The thing is, we don't actually split the page because it was
never compound. I don't know whether anybody actually does this,
but it's legitimate to write:
struct page *p = alloc_pages(GFP_KERNEL, 2);
free_unref_page(p + 1);
free_unref_page(p + 3);
free_unref_page(p + 2);
__free_page(p);
The good news is that I recently made free_unref_page() local to
mm/internal.h, so we don't need to worry about device drivers doing this.
As far as I can tell, we don't have any exposure to this kind of thing
today through functions exported from mm, but I might have missed
something.
I'd really like to get rid of non-compound high-order pages. Slab,
filesystems and anonymous memory all use compound pages. I think
it's just crusty old device drivers that don't. And alloc_pages_exact(),
of course, but that's kind of internal.
> The race condition shouldn't be hot, so I don't think we need to be as
> efficient about setting page->memcg_data only on the higher-order
> buddies as in Willy's scratch patch. We can call split_page_memcg(),
> which IMO should actually help document what's happening to the page.
I'm cool with that. I agree, this is not a performance case!
> I think that function could also benefit a bit more from step-by-step
> documentation about what's going on. The kerneldoc is helpful, but I
> don't think it does justice to how tricky this race condition is.
Always good to have other people read over your explanation ...
the kernel-doc could probably be simplified as a result.
On Thu 11-03-21 10:21:39, Johannes Weiner wrote:
> On Thu, Mar 11, 2021 at 09:37:02AM +0100, Michal Hocko wrote:
> > Johannes, Hugh,
> >
> > what do you think about this approach? If we want to stick with
> > split_page approach then we need to update the missing place Matthew has
> > pointed out.
>
> I find the __free_pages() code quite tricky as well. But for that
> reason I would actually prefer to initiate the splitting in there,
> since that's the place where we actually split the page, rather than
> spread the handling of this situation further out.
>
> The race condition shouldn't be hot, so I don't think we need to be as
> efficient about setting page->memcg_data only on the higher-order
> buddies as in Willy's scratch patch. We can call split_page_memcg(),
> which IMO should actually help document what's happening to the page.
>
> I think that function could also benefit a bit more from step-by-step
> documentation about what's going on. The kerneldoc is helpful, but I
> don't think it does justice to how tricky this race condition is.
>
> Something like this?
>
> void __free_pages(struct page *page, unsigned int order)
> {
> /*
> * Drop the base reference from __alloc_pages and free. In
> * case there is an outstanding speculative reference, from
> * e.g. the page cache, it will put and free the page later.
> */
> if (likely(put_page_testzero(page))) {
> free_the_page(page, order);
> return;
> }
>
> /*
> * The speculative reference will put and free the page.
> *
> * However, if the speculation was into a higher-order page
> * that isn't marked compound, the other side will know
> * nothing about our buddy pages and only free the order-0
> * page at the start of our chunk! We must split off and free
> * the buddy pages here.
> *
> * The buddy pages aren't individually refcounted, so they
> * can't have any pending speculative references themselves.
> */
> if (!PageHead(page) && order > 0) {
> split_page_memcg(page, 1 << order);
> while (order-- > 0)
> free_the_page(page + (1 << order), order);
> }
> }
Fine with me. Mathew was concerned about more places that do something
similar but I would say that if we find out more places we might
reconsider and currently stay with a reasonably clear model that it is
only head patch that carries the memcg information and split_page_memcg
is necessary to break such page into smaller pieces.
--
Michal Hocko
SUSE Labs
On Thu, 11 Mar 2021, Michal Hocko wrote:
> On Thu 11-03-21 10:21:39, Johannes Weiner wrote:
> > On Thu, Mar 11, 2021 at 09:37:02AM +0100, Michal Hocko wrote:
> > > Johannes, Hugh,
> > >
> > > what do you think about this approach? If we want to stick with
> > > split_page approach then we need to update the missing place Matthew has
> > > pointed out.
> >
> > I find the __free_pages() code quite tricky as well. But for that
> > reason I would actually prefer to initiate the splitting in there,
> > since that's the place where we actually split the page, rather than
> > spread the handling of this situation further out.
> >
> > The race condition shouldn't be hot, so I don't think we need to be as
> > efficient about setting page->memcg_data only on the higher-order
> > buddies as in Willy's scratch patch. We can call split_page_memcg(),
> > which IMO should actually help document what's happening to the page.
> >
> > I think that function could also benefit a bit more from step-by-step
> > documentation about what's going on. The kerneldoc is helpful, but I
> > don't think it does justice to how tricky this race condition is.
> >
> > Something like this?
> >
> > void __free_pages(struct page *page, unsigned int order)
> > {
> > /*
> > * Drop the base reference from __alloc_pages and free. In
> > * case there is an outstanding speculative reference, from
> > * e.g. the page cache, it will put and free the page later.
> > */
> > if (likely(put_page_testzero(page))) {
> > free_the_page(page, order);
> > return;
> > }
> >
> > /*
> > * The speculative reference will put and free the page.
> > *
> > * However, if the speculation was into a higher-order page
> > * that isn't marked compound, the other side will know
> > * nothing about our buddy pages and only free the order-0
> > * page at the start of our chunk! We must split off and free
> > * the buddy pages here.
> > *
> > * The buddy pages aren't individually refcounted, so they
> > * can't have any pending speculative references themselves.
> > */
> > if (!PageHead(page) && order > 0) {
> > split_page_memcg(page, 1 << order);
> > while (order-- > 0)
> > free_the_page(page + (1 << order), order);
> > }
> > }
>
> Fine with me. Mathew was concerned about more places that do something
> similar but I would say that if we find out more places we might
> reconsider and currently stay with a reasonably clear model that it is
> only head patch that carries the memcg information and split_page_memcg
> is necessary to break such page into smaller pieces.
I agree: I do like Johannes' suggestion best, now that we already
have split_page_memcg(). Not too worried about contrived use of
free_unref_page() here; and whether non-compound high-order pages
should be perpetuated is a different discussion.
Hugh
On Thu 11-03-21 12:37:20, Hugh Dickins wrote:
> On Thu, 11 Mar 2021, Michal Hocko wrote:
> > On Thu 11-03-21 10:21:39, Johannes Weiner wrote:
> > > On Thu, Mar 11, 2021 at 09:37:02AM +0100, Michal Hocko wrote:
> > > > Johannes, Hugh,
> > > >
> > > > what do you think about this approach? If we want to stick with
> > > > split_page approach then we need to update the missing place Matthew has
> > > > pointed out.
> > >
> > > I find the __free_pages() code quite tricky as well. But for that
> > > reason I would actually prefer to initiate the splitting in there,
> > > since that's the place where we actually split the page, rather than
> > > spread the handling of this situation further out.
> > >
> > > The race condition shouldn't be hot, so I don't think we need to be as
> > > efficient about setting page->memcg_data only on the higher-order
> > > buddies as in Willy's scratch patch. We can call split_page_memcg(),
> > > which IMO should actually help document what's happening to the page.
> > >
> > > I think that function could also benefit a bit more from step-by-step
> > > documentation about what's going on. The kerneldoc is helpful, but I
> > > don't think it does justice to how tricky this race condition is.
> > >
> > > Something like this?
> > >
> > > void __free_pages(struct page *page, unsigned int order)
> > > {
> > > /*
> > > * Drop the base reference from __alloc_pages and free. In
> > > * case there is an outstanding speculative reference, from
> > > * e.g. the page cache, it will put and free the page later.
> > > */
> > > if (likely(put_page_testzero(page))) {
> > > free_the_page(page, order);
> > > return;
> > > }
> > >
> > > /*
> > > * The speculative reference will put and free the page.
> > > *
> > > * However, if the speculation was into a higher-order page
> > > * that isn't marked compound, the other side will know
> > > * nothing about our buddy pages and only free the order-0
> > > * page at the start of our chunk! We must split off and free
> > > * the buddy pages here.
> > > *
> > > * The buddy pages aren't individually refcounted, so they
> > > * can't have any pending speculative references themselves.
> > > */
> > > if (!PageHead(page) && order > 0) {
> > > split_page_memcg(page, 1 << order);
> > > while (order-- > 0)
> > > free_the_page(page + (1 << order), order);
> > > }
> > > }
> >
> > Fine with me. Mathew was concerned about more places that do something
> > similar but I would say that if we find out more places we might
> > reconsider and currently stay with a reasonably clear model that it is
> > only head patch that carries the memcg information and split_page_memcg
> > is necessary to break such page into smaller pieces.
>
> I agree: I do like Johannes' suggestion best, now that we already
> have split_page_memcg(). Not too worried about contrived use of
> free_unref_page() here; and whether non-compound high-order pages
> should be perpetuated is a different discussion.
Matthew, are you planning to post a patch with suggested changes or
should I do it?
--
Michal Hocko
SUSE Labs
On Thu, Mar 18, 2021 at 03:05:00PM +0100, Michal Hocko wrote:
> On Thu 11-03-21 12:37:20, Hugh Dickins wrote:
> > On Thu, 11 Mar 2021, Michal Hocko wrote:
> > > On Thu 11-03-21 10:21:39, Johannes Weiner wrote:
> > > > On Thu, Mar 11, 2021 at 09:37:02AM +0100, Michal Hocko wrote:
> > > > > Johannes, Hugh,
> > > > >
> > > > > what do you think about this approach? If we want to stick with
> > > > > split_page approach then we need to update the missing place Matthew has
> > > > > pointed out.
> > > >
> > > > I find the __free_pages() code quite tricky as well. But for that
> > > > reason I would actually prefer to initiate the splitting in there,
> > > > since that's the place where we actually split the page, rather than
> > > > spread the handling of this situation further out.
> > > >
> > > > The race condition shouldn't be hot, so I don't think we need to be as
> > > > efficient about setting page->memcg_data only on the higher-order
> > > > buddies as in Willy's scratch patch. We can call split_page_memcg(),
> > > > which IMO should actually help document what's happening to the page.
> > > >
> > > > I think that function could also benefit a bit more from step-by-step
> > > > documentation about what's going on. The kerneldoc is helpful, but I
> > > > don't think it does justice to how tricky this race condition is.
> > > >
> > > > Something like this?
> > > >
> > > > void __free_pages(struct page *page, unsigned int order)
> > > > {
> > > > /*
> > > > * Drop the base reference from __alloc_pages and free. In
> > > > * case there is an outstanding speculative reference, from
> > > > * e.g. the page cache, it will put and free the page later.
> > > > */
> > > > if (likely(put_page_testzero(page))) {
> > > > free_the_page(page, order);
> > > > return;
> > > > }
> > > >
> > > > /*
> > > > * The speculative reference will put and free the page.
> > > > *
> > > > * However, if the speculation was into a higher-order page
> > > > * that isn't marked compound, the other side will know
> > > > * nothing about our buddy pages and only free the order-0
> > > > * page at the start of our chunk! We must split off and free
> > > > * the buddy pages here.
> > > > *
> > > > * The buddy pages aren't individually refcounted, so they
> > > > * can't have any pending speculative references themselves.
> > > > */
> > > > if (!PageHead(page) && order > 0) {
> > > > split_page_memcg(page, 1 << order);
> > > > while (order-- > 0)
> > > > free_the_page(page + (1 << order), order);
> > > > }
> > > > }
> > >
> > > Fine with me. Mathew was concerned about more places that do something
> > > similar but I would say that if we find out more places we might
> > > reconsider and currently stay with a reasonably clear model that it is
> > > only head patch that carries the memcg information and split_page_memcg
> > > is necessary to break such page into smaller pieces.
> >
> > I agree: I do like Johannes' suggestion best, now that we already
> > have split_page_memcg(). Not too worried about contrived use of
> > free_unref_page() here; and whether non-compound high-order pages
> > should be perpetuated is a different discussion.
>
> Matthew, are you planning to post a patch with suggested changes or
> should I do it?
I'm busy with the folio work; could you do it please?
On Thu, Mar 18, 2021 at 03:05:00PM +0100, Michal Hocko wrote:
> On Thu 11-03-21 12:37:20, Hugh Dickins wrote:
> > On Thu, 11 Mar 2021, Michal Hocko wrote:
> > > On Thu 11-03-21 10:21:39, Johannes Weiner wrote:
> > > > On Thu, Mar 11, 2021 at 09:37:02AM +0100, Michal Hocko wrote:
> > > > > Johannes, Hugh,
> > > > >
> > > > > what do you think about this approach? If we want to stick with
> > > > > split_page approach then we need to update the missing place Matthew has
> > > > > pointed out.
> > > >
> > > > I find the __free_pages() code quite tricky as well. But for that
> > > > reason I would actually prefer to initiate the splitting in there,
> > > > since that's the place where we actually split the page, rather than
> > > > spread the handling of this situation further out.
> > > >
> > > > The race condition shouldn't be hot, so I don't think we need to be as
> > > > efficient about setting page->memcg_data only on the higher-order
> > > > buddies as in Willy's scratch patch. We can call split_page_memcg(),
> > > > which IMO should actually help document what's happening to the page.
> > > >
> > > > I think that function could also benefit a bit more from step-by-step
> > > > documentation about what's going on. The kerneldoc is helpful, but I
> > > > don't think it does justice to how tricky this race condition is.
> > > >
> > > > Something like this?
> > > >
> > > > void __free_pages(struct page *page, unsigned int order)
> > > > {
> > > > /*
> > > > * Drop the base reference from __alloc_pages and free. In
> > > > * case there is an outstanding speculative reference, from
> > > > * e.g. the page cache, it will put and free the page later.
> > > > */
> > > > if (likely(put_page_testzero(page))) {
> > > > free_the_page(page, order);
> > > > return;
> > > > }
> > > >
> > > > /*
> > > > * The speculative reference will put and free the page.
> > > > *
> > > > * However, if the speculation was into a higher-order page
> > > > * that isn't marked compound, the other side will know
> > > > * nothing about our buddy pages and only free the order-0
> > > > * page at the start of our chunk! We must split off and free
> > > > * the buddy pages here.
> > > > *
> > > > * The buddy pages aren't individually refcounted, so they
> > > > * can't have any pending speculative references themselves.
> > > > */
> > > > if (!PageHead(page) && order > 0) {
> > > > split_page_memcg(page, 1 << order);
> > > > while (order-- > 0)
> > > > free_the_page(page + (1 << order), order);
> > > > }
> > > > }
> > >
> > > Fine with me. Mathew was concerned about more places that do something
> > > similar but I would say that if we find out more places we might
> > > reconsider and currently stay with a reasonably clear model that it is
> > > only head patch that carries the memcg information and split_page_memcg
> > > is necessary to break such page into smaller pieces.
> >
> > I agree: I do like Johannes' suggestion best, now that we already
> > have split_page_memcg(). Not too worried about contrived use of
> > free_unref_page() here; and whether non-compound high-order pages
> > should be perpetuated is a different discussion.
>
> Matthew, are you planning to post a patch with suggested changes or
> should I do it?
I'll post a proper patch.