2015-05-21 13:27:36

by Michal Hocko

[permalink] [raw]
Subject: [PATCH] hugetlb: Do not account hugetlb pages as NR_FILE_PAGES

hugetlb pages uses add_to_page_cache to track shared mappings. This
is OK from the data structure point of view but it is less so from the
NR_FILE_PAGES accounting:
- huge pages are accounted as 4k which is clearly wrong
- this counter is used as the amount of the reclaimable page
cache which is incorrect as well because hugetlb pages are
special and not reclaimable
- the counter is then exported to userspace via /proc/meminfo
(in Cached:), /proc/vmstat and /proc/zoneinfo as
nr_file_pages which is confusing at least:
Cached: 8883504 kB
HugePages_Free: 8348
...
Cached: 8916048 kB
HugePages_Free: 156
...
thats 8192 huge pages allocated which is ~16G accounted as 32M

There are usually not that many huge pages in the system for this to
make any visible difference e.g. by fooling __vm_enough_memory or
zone_pagecache_reclaimable.

Fix this by special casing huge pages in both __delete_from_page_cache
and __add_to_page_cache_locked. replace_page_cache_page is currently
only used by fuse and that shouldn't touch hugetlb pages AFAICS but it
is more robust to check for special casing there as well.

Hugetlb pages shouldn't get to any other paths where we do accounting:
- migration - we have a special handling via
hugetlbfs_migrate_page
- shmem - doesn't handle hugetlb pages directly even for
SHM_HUGETLB resp. MAP_HUGETLB
- swapcache - hugetlb is not swapable

This has a user visible effect but I believe it is reasonable because
the previously exported number is simply bogus.

An alternative would be to account hugetlb pages with their real size
and treat them similar to shmem. But this has some drawbacks.

First we would have to special case in kernel users of NR_FILE_PAGES and
considering how hugetlb is special we would have to do it everywhere. We
do not want Cached exported by /proc/meminfo to include it because the
value would be even more misleading.
__vm_enough_memory and zone_pagecache_reclaimable would have to do
the same thing because those pages are simply not reclaimable. The
correction is even not trivial because we would have to consider all
active hugetlb page sizes properly. Users of the counter outside of the
kernel would have to do the same.
So the question is why to account something that needs to be basically
excluded for each reasonable usage. This doesn't make much sense to me.

It seems that this has been broken since hugetlb was introduced but I
haven't checked the whole history.

Signed-off-by: Michal Hocko <[email protected]>
---
mm/filemap.c | 17 ++++++++++++++---
1 file changed, 14 insertions(+), 3 deletions(-)

diff --git a/mm/filemap.c b/mm/filemap.c
index 413eb20abfa7..249bde91c273 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -197,7 +197,9 @@ void __delete_from_page_cache(struct page *page, void *shadow)
page->mapping = NULL;
/* Leave page->index set: truncation lookup relies upon it */

- __dec_zone_page_state(page, NR_FILE_PAGES);
+ /* hugetlb pages do not participate into page cache accounting. */
+ if (!PageHuge(page))
+ __dec_zone_page_state(page, NR_FILE_PAGES);
if (PageSwapBacked(page))
__dec_zone_page_state(page, NR_SHMEM);
BUG_ON(page_mapped(page));
@@ -484,7 +486,13 @@ int replace_page_cache_page(struct page *old, struct page *new, gfp_t gfp_mask)
error = radix_tree_insert(&mapping->page_tree, offset, new);
BUG_ON(error);
mapping->nrpages++;
- __inc_zone_page_state(new, NR_FILE_PAGES);
+
+ /*
+ * hugetlb pages do not participate into page cache
+ * accounting.
+ */
+ if (!PageHuge(new))
+ __inc_zone_page_state(new, NR_FILE_PAGES);
if (PageSwapBacked(new))
__inc_zone_page_state(new, NR_SHMEM);
spin_unlock_irq(&mapping->tree_lock);
@@ -576,7 +584,10 @@ static int __add_to_page_cache_locked(struct page *page,
radix_tree_preload_end();
if (unlikely(error))
goto err_insert;
- __inc_zone_page_state(page, NR_FILE_PAGES);
+
+ /* hugetlb pages do not participate into page cache accounting. */
+ if (!huge)
+ __inc_zone_page_state(page, NR_FILE_PAGES);
spin_unlock_irq(&mapping->tree_lock);
if (!huge)
mem_cgroup_commit_charge(page, memcg, false);
--
2.1.4


2015-05-21 13:52:47

by Mel Gorman

[permalink] [raw]
Subject: Re: [PATCH] hugetlb: Do not account hugetlb pages as NR_FILE_PAGES

On Thu, May 21, 2015 at 03:27:22PM +0200, Michal Hocko wrote:
> hugetlb pages uses add_to_page_cache to track shared mappings. This
> is OK from the data structure point of view but it is less so from the
> NR_FILE_PAGES accounting:
> - huge pages are accounted as 4k which is clearly wrong
> - this counter is used as the amount of the reclaimable page
> cache which is incorrect as well because hugetlb pages are
> special and not reclaimable
> - the counter is then exported to userspace via /proc/meminfo
> (in Cached:), /proc/vmstat and /proc/zoneinfo as
> nr_file_pages which is confusing at least:
> Cached: 8883504 kB
> HugePages_Free: 8348
> ...
> Cached: 8916048 kB
> HugePages_Free: 156
> ...
> thats 8192 huge pages allocated which is ~16G accounted as 32M
>
> There are usually not that many huge pages in the system for this to
> make any visible difference e.g. by fooling __vm_enough_memory or
> zone_pagecache_reclaimable.
>
> Fix this by special casing huge pages in both __delete_from_page_cache
> and __add_to_page_cache_locked. replace_page_cache_page is currently
> only used by fuse and that shouldn't touch hugetlb pages AFAICS but it
> is more robust to check for special casing there as well.
>
> Hugetlb pages shouldn't get to any other paths where we do accounting:
> - migration - we have a special handling via
> hugetlbfs_migrate_page
> - shmem - doesn't handle hugetlb pages directly even for
> SHM_HUGETLB resp. MAP_HUGETLB
> - swapcache - hugetlb is not swapable
>
> This has a user visible effect but I believe it is reasonable because
> the previously exported number is simply bogus.
>
> An alternative would be to account hugetlb pages with their real size
> and treat them similar to shmem. But this has some drawbacks.
>
> First we would have to special case in kernel users of NR_FILE_PAGES and
> considering how hugetlb is special we would have to do it everywhere. We
> do not want Cached exported by /proc/meminfo to include it because the
> value would be even more misleading.
> __vm_enough_memory and zone_pagecache_reclaimable would have to do
> the same thing because those pages are simply not reclaimable. The
> correction is even not trivial because we would have to consider all
> active hugetlb page sizes properly. Users of the counter outside of the
> kernel would have to do the same.
> So the question is why to account something that needs to be basically
> excluded for each reasonable usage. This doesn't make much sense to me.
>
> It seems that this has been broken since hugetlb was introduced but I
> haven't checked the whole history.
>
> Signed-off-by: Michal Hocko <[email protected]>

Acked-by: Mel Gorman <[email protected]>

--
Mel Gorman
SUSE Labs

2015-05-21 16:19:12

by Mike Kravetz

[permalink] [raw]
Subject: Re: [PATCH] hugetlb: Do not account hugetlb pages as NR_FILE_PAGES

On 05/21/2015 06:27 AM, Michal Hocko wrote:
> hugetlb pages uses add_to_page_cache to track shared mappings. This
> is OK from the data structure point of view but it is less so from the
> NR_FILE_PAGES accounting:
> - huge pages are accounted as 4k which is clearly wrong
> - this counter is used as the amount of the reclaimable page
> cache which is incorrect as well because hugetlb pages are
> special and not reclaimable
> - the counter is then exported to userspace via /proc/meminfo
> (in Cached:), /proc/vmstat and /proc/zoneinfo as
> nr_file_pages which is confusing at least:
> Cached: 8883504 kB
> HugePages_Free: 8348
> ...
> Cached: 8916048 kB
> HugePages_Free: 156
> ...
> thats 8192 huge pages allocated which is ~16G accounted as 32M
>
> There are usually not that many huge pages in the system for this to
> make any visible difference e.g. by fooling __vm_enough_memory or
> zone_pagecache_reclaimable.
>
> Fix this by special casing huge pages in both __delete_from_page_cache
> and __add_to_page_cache_locked. replace_page_cache_page is currently
> only used by fuse and that shouldn't touch hugetlb pages AFAICS but it
> is more robust to check for special casing there as well.
>
> Hugetlb pages shouldn't get to any other paths where we do accounting:
> - migration - we have a special handling via
> hugetlbfs_migrate_page
> - shmem - doesn't handle hugetlb pages directly even for
> SHM_HUGETLB resp. MAP_HUGETLB
> - swapcache - hugetlb is not swapable
>
> This has a user visible effect but I believe it is reasonable because
> the previously exported number is simply bogus.
>
> An alternative would be to account hugetlb pages with their real size
> and treat them similar to shmem. But this has some drawbacks.
>
> First we would have to special case in kernel users of NR_FILE_PAGES and
> considering how hugetlb is special we would have to do it everywhere. We
> do not want Cached exported by /proc/meminfo to include it because the
> value would be even more misleading.
> __vm_enough_memory and zone_pagecache_reclaimable would have to do
> the same thing because those pages are simply not reclaimable. The
> correction is even not trivial because we would have to consider all
> active hugetlb page sizes properly. Users of the counter outside of the
> kernel would have to do the same.
> So the question is why to account something that needs to be basically
> excluded for each reasonable usage. This doesn't make much sense to me.
>
> It seems that this has been broken since hugetlb was introduced but I
> haven't checked the whole history.
>
> Signed-off-by: Michal Hocko <[email protected]>

Just for grins, I added this to my hugetlbfs fallocate stress testing
which really exercises hugetlb add and delete from page cache.
Everything is as expected.

Tested-by: Mike Kravetz <[email protected]>

--
Mike Kravetz

2015-05-21 17:09:28

by Johannes Weiner

[permalink] [raw]
Subject: Re: [PATCH] hugetlb: Do not account hugetlb pages as NR_FILE_PAGES

On Thu, May 21, 2015 at 03:27:22PM +0200, Michal Hocko wrote:
> hugetlb pages uses add_to_page_cache to track shared mappings. This
> is OK from the data structure point of view but it is less so from the
> NR_FILE_PAGES accounting:
> - huge pages are accounted as 4k which is clearly wrong
> - this counter is used as the amount of the reclaimable page
> cache which is incorrect as well because hugetlb pages are
> special and not reclaimable
> - the counter is then exported to userspace via /proc/meminfo
> (in Cached:), /proc/vmstat and /proc/zoneinfo as
> nr_file_pages which is confusing at least:
> Cached: 8883504 kB
> HugePages_Free: 8348
> ...
> Cached: 8916048 kB
> HugePages_Free: 156
> ...
> thats 8192 huge pages allocated which is ~16G accounted as 32M
>
> There are usually not that many huge pages in the system for this to
> make any visible difference e.g. by fooling __vm_enough_memory or
> zone_pagecache_reclaimable.
>
> Fix this by special casing huge pages in both __delete_from_page_cache
> and __add_to_page_cache_locked. replace_page_cache_page is currently
> only used by fuse and that shouldn't touch hugetlb pages AFAICS but it
> is more robust to check for special casing there as well.
>
> Hugetlb pages shouldn't get to any other paths where we do accounting:
> - migration - we have a special handling via
> hugetlbfs_migrate_page
> - shmem - doesn't handle hugetlb pages directly even for
> SHM_HUGETLB resp. MAP_HUGETLB
> - swapcache - hugetlb is not swapable
>
> This has a user visible effect but I believe it is reasonable because
> the previously exported number is simply bogus.
>
> An alternative would be to account hugetlb pages with their real size
> and treat them similar to shmem. But this has some drawbacks.
>
> First we would have to special case in kernel users of NR_FILE_PAGES and
> considering how hugetlb is special we would have to do it everywhere. We
> do not want Cached exported by /proc/meminfo to include it because the
> value would be even more misleading.
> __vm_enough_memory and zone_pagecache_reclaimable would have to do
> the same thing because those pages are simply not reclaimable. The
> correction is even not trivial because we would have to consider all
> active hugetlb page sizes properly. Users of the counter outside of the
> kernel would have to do the same.
> So the question is why to account something that needs to be basically
> excluded for each reasonable usage. This doesn't make much sense to me.
>
> It seems that this has been broken since hugetlb was introduced but I
> haven't checked the whole history.
>
> Signed-off-by: Michal Hocko <[email protected]>

Acked-by: Johannes Weiner <[email protected]>

This makes a lot of sense to me. The only thing I worry about is the
proliferation of PageHuge(), a function call, in relatively hot paths.

Naoya-san, would there be a strong reason to make this function a
static inline in the header?

2015-05-22 05:10:23

by Naoya Horiguchi

[permalink] [raw]
Subject: Re: [PATCH] hugetlb: Do not account hugetlb pages as NR_FILE_PAGES

On Thu, May 21, 2015 at 03:27:22PM +0200, Michal Hocko wrote:
> hugetlb pages uses add_to_page_cache to track shared mappings. This
> is OK from the data structure point of view but it is less so from the
> NR_FILE_PAGES accounting:
> - huge pages are accounted as 4k which is clearly wrong
> - this counter is used as the amount of the reclaimable page
> cache which is incorrect as well because hugetlb pages are
> special and not reclaimable
> - the counter is then exported to userspace via /proc/meminfo
> (in Cached:), /proc/vmstat and /proc/zoneinfo as
> nr_file_pages which is confusing at least:
> Cached: 8883504 kB
> HugePages_Free: 8348
> ...
> Cached: 8916048 kB
> HugePages_Free: 156
> ...
> thats 8192 huge pages allocated which is ~16G accounted as 32M
>
> There are usually not that many huge pages in the system for this to
> make any visible difference e.g. by fooling __vm_enough_memory or
> zone_pagecache_reclaimable.
>
> Fix this by special casing huge pages in both __delete_from_page_cache
> and __add_to_page_cache_locked. replace_page_cache_page is currently
> only used by fuse and that shouldn't touch hugetlb pages AFAICS but it
> is more robust to check for special casing there as well.
>
> Hugetlb pages shouldn't get to any other paths where we do accounting:
> - migration - we have a special handling via
> hugetlbfs_migrate_page
> - shmem - doesn't handle hugetlb pages directly even for
> SHM_HUGETLB resp. MAP_HUGETLB
> - swapcache - hugetlb is not swapable
>
> This has a user visible effect but I believe it is reasonable because
> the previously exported number is simply bogus.
>
> An alternative would be to account hugetlb pages with their real size
> and treat them similar to shmem. But this has some drawbacks.
>
> First we would have to special case in kernel users of NR_FILE_PAGES and
> considering how hugetlb is special we would have to do it everywhere. We
> do not want Cached exported by /proc/meminfo to include it because the
> value would be even more misleading.
> __vm_enough_memory and zone_pagecache_reclaimable would have to do
> the same thing because those pages are simply not reclaimable. The
> correction is even not trivial because we would have to consider all
> active hugetlb page sizes properly. Users of the counter outside of the
> kernel would have to do the same.
> So the question is why to account something that needs to be basically
> excluded for each reasonable usage. This doesn't make much sense to me.
>
> It seems that this has been broken since hugetlb was introduced but I
> haven't checked the whole history.
>
> Signed-off-by: Michal Hocko <[email protected]>

looks good to me,

Reviewed-by: Naoya Horiguchi <[email protected]>-

2015-05-22 06:01:42

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH] hugetlb: Do not account hugetlb pages as NR_FILE_PAGES

On Thu 21-05-15 09:18:59, Mike Kravetz wrote:
> On 05/21/2015 06:27 AM, Michal Hocko wrote:
> >hugetlb pages uses add_to_page_cache to track shared mappings. This
> >is OK from the data structure point of view but it is less so from the
> >NR_FILE_PAGES accounting:
> > - huge pages are accounted as 4k which is clearly wrong
> > - this counter is used as the amount of the reclaimable page
> > cache which is incorrect as well because hugetlb pages are
> > special and not reclaimable
> > - the counter is then exported to userspace via /proc/meminfo
> > (in Cached:), /proc/vmstat and /proc/zoneinfo as
> > nr_file_pages which is confusing at least:
> > Cached: 8883504 kB
> > HugePages_Free: 8348
> > ...
> > Cached: 8916048 kB
> > HugePages_Free: 156
> > ...
> > thats 8192 huge pages allocated which is ~16G accounted as 32M
> >
> >There are usually not that many huge pages in the system for this to
> >make any visible difference e.g. by fooling __vm_enough_memory or
> >zone_pagecache_reclaimable.
> >
> >Fix this by special casing huge pages in both __delete_from_page_cache
> >and __add_to_page_cache_locked. replace_page_cache_page is currently
> >only used by fuse and that shouldn't touch hugetlb pages AFAICS but it
> >is more robust to check for special casing there as well.
> >
> >Hugetlb pages shouldn't get to any other paths where we do accounting:
> > - migration - we have a special handling via
> > hugetlbfs_migrate_page
> > - shmem - doesn't handle hugetlb pages directly even for
> > SHM_HUGETLB resp. MAP_HUGETLB
> > - swapcache - hugetlb is not swapable
> >
> >This has a user visible effect but I believe it is reasonable because
> >the previously exported number is simply bogus.
> >
> >An alternative would be to account hugetlb pages with their real size
> >and treat them similar to shmem. But this has some drawbacks.
> >
> >First we would have to special case in kernel users of NR_FILE_PAGES and
> >considering how hugetlb is special we would have to do it everywhere. We
> >do not want Cached exported by /proc/meminfo to include it because the
> >value would be even more misleading.
> >__vm_enough_memory and zone_pagecache_reclaimable would have to do
> >the same thing because those pages are simply not reclaimable. The
> >correction is even not trivial because we would have to consider all
> >active hugetlb page sizes properly. Users of the counter outside of the
> >kernel would have to do the same.
> >So the question is why to account something that needs to be basically
> >excluded for each reasonable usage. This doesn't make much sense to me.
> >
> >It seems that this has been broken since hugetlb was introduced but I
> >haven't checked the whole history.
> >
> >Signed-off-by: Michal Hocko <[email protected]>
>
> Just for grins, I added this to my hugetlbfs fallocate stress testing
> which really exercises hugetlb add and delete from page cache.
> Everything is as expected.
>
> Tested-by: Mike Kravetz <[email protected]>

Thanks for your testing!

--
Michal Hocko
SUSE Labs

2015-05-22 14:21:48

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH] hugetlb: Do not account hugetlb pages as NR_FILE_PAGES

On Thu 21-05-15 13:09:09, Johannes Weiner wrote:
> On Thu, May 21, 2015 at 03:27:22PM +0200, Michal Hocko wrote:
> > hugetlb pages uses add_to_page_cache to track shared mappings. This
> > is OK from the data structure point of view but it is less so from the
> > NR_FILE_PAGES accounting:
> > - huge pages are accounted as 4k which is clearly wrong
> > - this counter is used as the amount of the reclaimable page
> > cache which is incorrect as well because hugetlb pages are
> > special and not reclaimable
> > - the counter is then exported to userspace via /proc/meminfo
> > (in Cached:), /proc/vmstat and /proc/zoneinfo as
> > nr_file_pages which is confusing at least:
> > Cached: 8883504 kB
> > HugePages_Free: 8348
> > ...
> > Cached: 8916048 kB
> > HugePages_Free: 156
> > ...
> > thats 8192 huge pages allocated which is ~16G accounted as 32M
> >
> > There are usually not that many huge pages in the system for this to
> > make any visible difference e.g. by fooling __vm_enough_memory or
> > zone_pagecache_reclaimable.
> >
> > Fix this by special casing huge pages in both __delete_from_page_cache
> > and __add_to_page_cache_locked. replace_page_cache_page is currently
> > only used by fuse and that shouldn't touch hugetlb pages AFAICS but it
> > is more robust to check for special casing there as well.
> >
> > Hugetlb pages shouldn't get to any other paths where we do accounting:
> > - migration - we have a special handling via
> > hugetlbfs_migrate_page
> > - shmem - doesn't handle hugetlb pages directly even for
> > SHM_HUGETLB resp. MAP_HUGETLB
> > - swapcache - hugetlb is not swapable
> >
> > This has a user visible effect but I believe it is reasonable because
> > the previously exported number is simply bogus.
> >
> > An alternative would be to account hugetlb pages with their real size
> > and treat them similar to shmem. But this has some drawbacks.
> >
> > First we would have to special case in kernel users of NR_FILE_PAGES and
> > considering how hugetlb is special we would have to do it everywhere. We
> > do not want Cached exported by /proc/meminfo to include it because the
> > value would be even more misleading.
> > __vm_enough_memory and zone_pagecache_reclaimable would have to do
> > the same thing because those pages are simply not reclaimable. The
> > correction is even not trivial because we would have to consider all
> > active hugetlb page sizes properly. Users of the counter outside of the
> > kernel would have to do the same.
> > So the question is why to account something that needs to be basically
> > excluded for each reasonable usage. This doesn't make much sense to me.
> >
> > It seems that this has been broken since hugetlb was introduced but I
> > haven't checked the whole history.
> >
> > Signed-off-by: Michal Hocko <[email protected]>
>
> Acked-by: Johannes Weiner <[email protected]>

Thanks!

> This makes a lot of sense to me. The only thing I worry about is the
> proliferation of PageHuge(), a function call, in relatively hot paths.

I've tried that (see the patch below) but it enlarged the code by almost
1k
text data bss dec hex filename
510323 74273 44440 629036 9992c mm/built-in.o.before
511248 74273 44440 629961 99cc9 mm/built-in.o.after

I am not sure the code size increase is worth it. Maybe we can reduce
the check to only PageCompound(page) as huge pages are no in the page
cache (yet).

---
diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
index 205026175c42..2e36251ad31b 100644
--- a/include/linux/hugetlb.h
+++ b/include/linux/hugetlb.h
@@ -84,7 +84,6 @@ void hugetlb_unreserve_pages(struct inode *inode, long offset, long freed);
int dequeue_hwpoisoned_huge_page(struct page *page);
bool isolate_huge_page(struct page *page, struct list_head *list);
void putback_active_hugepage(struct page *page);
-void free_huge_page(struct page *page);

#ifdef CONFIG_ARCH_WANT_HUGE_PMD_SHARE
pte_t *huge_pmd_share(struct mm_struct *mm, unsigned long addr, pud_t *pud);
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 2923a51979e9..5b6c49e55f80 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -531,23 +531,6 @@ void put_pages_list(struct list_head *pages);
void split_page(struct page *page, unsigned int order);
int split_free_page(struct page *page);

-/*
- * Compound pages have a destructor function. Provide a
- * prototype for that function and accessor functions.
- * These are _only_ valid on the head of a PG_compound page.
- */
-
-static inline void set_compound_page_dtor(struct page *page,
- compound_page_dtor *dtor)
-{
- page[1].compound_dtor = dtor;
-}
-
-static inline compound_page_dtor *get_compound_page_dtor(struct page *page)
-{
- return page[1].compound_dtor;
-}
-
static inline int compound_order(struct page *page)
{
if (!PageHead(page))
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 8d37e26a1007..a9ecaebb5392 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -226,6 +226,23 @@ struct page_frag {
#endif
};

+/*
+ * Compound pages have a destructor function. Provide a
+ * prototype for that function and accessor functions.
+ * These are _only_ valid on the head of a PG_compound page.
+ */
+
+static inline void set_compound_page_dtor(struct page *page,
+ compound_page_dtor *dtor)
+{
+ page[1].compound_dtor = dtor;
+}
+
+static inline compound_page_dtor *get_compound_page_dtor(struct page *page)
+{
+ return page[1].compound_dtor;
+}
+
typedef unsigned long __nocast vm_flags_t;

/*
diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
index 91b7f9b2b774..41329fbb5890 100644
--- a/include/linux/page-flags.h
+++ b/include/linux/page-flags.h
@@ -547,7 +547,21 @@ static inline void ClearPageCompound(struct page *page)
#endif /* !PAGEFLAGS_EXTENDED */

#ifdef CONFIG_HUGETLB_PAGE
-int PageHuge(struct page *page);
+void free_huge_page(struct page *page);
+
+/*
+ * PageHuge() only returns true for hugetlbfs pages, but not for normal or
+ * transparent huge pages. See the PageTransHuge() documentation for more
+ * details.
+ */
+static inline int PageHuge(struct page *page)
+{
+ if (!PageCompound(page))
+ return 0;
+
+ page = compound_head(page);
+ return get_compound_page_dtor(page) == free_huge_page;
+}
int PageHeadHuge(struct page *page);
bool page_huge_active(struct page *page);
#else
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 54f129dc37f6..406913f3b234 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -1041,21 +1041,6 @@ static void prep_compound_gigantic_page(struct page *page, unsigned long order)
}

/*
- * PageHuge() only returns true for hugetlbfs pages, but not for normal or
- * transparent huge pages. See the PageTransHuge() documentation for more
- * details.
- */
-int PageHuge(struct page *page)
-{
- if (!PageCompound(page))
- return 0;
-
- page = compound_head(page);
- return get_compound_page_dtor(page) == free_huge_page;
-}
-EXPORT_SYMBOL_GPL(PageHuge);
-
-/*
* PageHeadHuge() only returns true for hugetlbfs head page, but not for
* normal or transparent huge pages.
*/

--
Michal Hocko
SUSE Labs

2015-05-22 14:28:28

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH] hugetlb: Do not account hugetlb pages as NR_FILE_PAGES

On Fri 22-05-15 16:21:43, Michal Hocko wrote:
> On Thu 21-05-15 13:09:09, Johannes Weiner wrote:
[...]
> > This makes a lot of sense to me. The only thing I worry about is the
> > proliferation of PageHuge(), a function call, in relatively hot paths.
>
> I've tried that (see the patch below) but it enlarged the code by almost
> 1k
> text data bss dec hex filename
> 510323 74273 44440 629036 9992c mm/built-in.o.before
> 511248 74273 44440 629961 99cc9 mm/built-in.o.after
>
> I am not sure the code size increase is worth it. Maybe we can reduce
> the check to only PageCompound(page) as huge pages are no in the page
> cache (yet).

Just to prevent from confusion. I means to reduce the check only for
this particular case. But that is probably not worth the troubles
either...
--
Michal Hocko
SUSE Labs

2015-05-22 14:36:08

by Mel Gorman

[permalink] [raw]
Subject: Re: [PATCH] hugetlb: Do not account hugetlb pages as NR_FILE_PAGES

On Fri, May 22, 2015 at 04:21:43PM +0200, Michal Hocko wrote:
> On Thu 21-05-15 13:09:09, Johannes Weiner wrote:
> > On Thu, May 21, 2015 at 03:27:22PM +0200, Michal Hocko wrote:
> > > hugetlb pages uses add_to_page_cache to track shared mappings. This
> > > is OK from the data structure point of view but it is less so from the
> > > NR_FILE_PAGES accounting:
> > > - huge pages are accounted as 4k which is clearly wrong
> > > - this counter is used as the amount of the reclaimable page
> > > cache which is incorrect as well because hugetlb pages are
> > > special and not reclaimable
> > > - the counter is then exported to userspace via /proc/meminfo
> > > (in Cached:), /proc/vmstat and /proc/zoneinfo as
> > > nr_file_pages which is confusing at least:
> > > Cached: 8883504 kB
> > > HugePages_Free: 8348
> > > ...
> > > Cached: 8916048 kB
> > > HugePages_Free: 156
> > > ...
> > > thats 8192 huge pages allocated which is ~16G accounted as 32M
> > >
> > > There are usually not that many huge pages in the system for this to
> > > make any visible difference e.g. by fooling __vm_enough_memory or
> > > zone_pagecache_reclaimable.
> > >
> > > Fix this by special casing huge pages in both __delete_from_page_cache
> > > and __add_to_page_cache_locked. replace_page_cache_page is currently
> > > only used by fuse and that shouldn't touch hugetlb pages AFAICS but it
> > > is more robust to check for special casing there as well.
> > >
> > > Hugetlb pages shouldn't get to any other paths where we do accounting:
> > > - migration - we have a special handling via
> > > hugetlbfs_migrate_page
> > > - shmem - doesn't handle hugetlb pages directly even for
> > > SHM_HUGETLB resp. MAP_HUGETLB
> > > - swapcache - hugetlb is not swapable
> > >
> > > This has a user visible effect but I believe it is reasonable because
> > > the previously exported number is simply bogus.
> > >
> > > An alternative would be to account hugetlb pages with their real size
> > > and treat them similar to shmem. But this has some drawbacks.
> > >
> > > First we would have to special case in kernel users of NR_FILE_PAGES and
> > > considering how hugetlb is special we would have to do it everywhere. We
> > > do not want Cached exported by /proc/meminfo to include it because the
> > > value would be even more misleading.
> > > __vm_enough_memory and zone_pagecache_reclaimable would have to do
> > > the same thing because those pages are simply not reclaimable. The
> > > correction is even not trivial because we would have to consider all
> > > active hugetlb page sizes properly. Users of the counter outside of the
> > > kernel would have to do the same.
> > > So the question is why to account something that needs to be basically
> > > excluded for each reasonable usage. This doesn't make much sense to me.
> > >
> > > It seems that this has been broken since hugetlb was introduced but I
> > > haven't checked the whole history.
> > >
> > > Signed-off-by: Michal Hocko <[email protected]>
> >
> > Acked-by: Johannes Weiner <[email protected]>
>
> Thanks!
>
> > This makes a lot of sense to me. The only thing I worry about is the
> > proliferation of PageHuge(), a function call, in relatively hot paths.
>
> I've tried that (see the patch below) but it enlarged the code by almost
> 1k
> text data bss dec hex filename
> 510323 74273 44440 629036 9992c mm/built-in.o.before
> 511248 74273 44440 629961 99cc9 mm/built-in.o.after
>
> I am not sure the code size increase is worth it. Maybe we can reduce
> the check to only PageCompound(page) as huge pages are no in the page
> cache (yet).
>

That would be a more sensible route because it also avoids exposing the
hugetlbfs destructor unnecessarily.

--
Mel Gorman
SUSE Labs

2015-05-25 15:24:34

by Vlastimil Babka

[permalink] [raw]
Subject: Re: [PATCH] hugetlb: Do not account hugetlb pages as NR_FILE_PAGES

On 05/22/2015 04:35 PM, Mel Gorman wrote:
>>
>> Thanks!
>>
>> > This makes a lot of sense to me. The only thing I worry about is the
>> > proliferation of PageHuge(), a function call, in relatively hot paths.
>>
>> I've tried that (see the patch below) but it enlarged the code by almost
>> 1k
>> text data bss dec hex filename
>> 510323 74273 44440 629036 9992c mm/built-in.o.before
>> 511248 74273 44440 629961 99cc9 mm/built-in.o.after
>>
>> I am not sure the code size increase is worth it. Maybe we can reduce
>> the check to only PageCompound(page) as huge pages are no in the page
>> cache (yet).
>>
>
> That would be a more sensible route because it also avoids exposing the
> hugetlbfs destructor unnecessarily.

You could maybe do test such as (PageCompound(page) && PageHuge(page)) to
short-circuit the call while remaining future-proof.

2015-06-02 09:25:45

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH] hugetlb: Do not account hugetlb pages as NR_FILE_PAGES

On Mon 25-05-15 17:24:28, Vlastimil Babka wrote:
> On 05/22/2015 04:35 PM, Mel Gorman wrote:
> >>
> >> Thanks!
> >>
> >> > This makes a lot of sense to me. The only thing I worry about is the
> >> > proliferation of PageHuge(), a function call, in relatively hot paths.
> >>
> >> I've tried that (see the patch below) but it enlarged the code by almost
> >> 1k
> >> text data bss dec hex filename
> >> 510323 74273 44440 629036 9992c mm/built-in.o.before
> >> 511248 74273 44440 629961 99cc9 mm/built-in.o.after
> >>
> >> I am not sure the code size increase is worth it. Maybe we can reduce
> >> the check to only PageCompound(page) as huge pages are no in the page
> >> cache (yet).
> >>
> >
> > That would be a more sensible route because it also avoids exposing the
> > hugetlbfs destructor unnecessarily.
>
> You could maybe do test such as (PageCompound(page) && PageHuge(page)) to
> short-circuit the call while remaining future-proof.

How about this?
---
diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
index 91b7f9b2b774..bb8a70e8fc77 100644
--- a/include/linux/page-flags.h
+++ b/include/linux/page-flags.h
@@ -547,7 +547,13 @@ static inline void ClearPageCompound(struct page *page)
#endif /* !PAGEFLAGS_EXTENDED */

#ifdef CONFIG_HUGETLB_PAGE
-int PageHuge(struct page *page);
+int __PageHuge(struct page *page);
+static inline int PageHuge(struct page *page)
+{
+ if (!PageCompound(page))
+ return 0;
+ return __PageHuge(page);
+}
int PageHeadHuge(struct page *page);
bool page_huge_active(struct page *page);
#else
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 33defbe1897f..648c0c32857c 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -1107,19 +1107,17 @@ static void prep_compound_gigantic_page(struct page *page, unsigned long order)
}

/*
- * PageHuge() only returns true for hugetlbfs pages, but not for normal or
+ * __PageHuge() only returns true for hugetlbfs pages, but not for normal or
* transparent huge pages. See the PageTransHuge() documentation for more
* details.
*/
-int PageHuge(struct page *page)
+int __PageHuge(struct page *page)
{
- if (!PageCompound(page))
- return 0;
-
+ VM_BUG_ON(!PageCompound(page));
page = compound_head(page);
return get_compound_page_dtor(page) == free_huge_page;
}
-EXPORT_SYMBOL_GPL(PageHuge);
+EXPORT_SYMBOL_GPL(__PageHuge);

/*
* PageHeadHuge() only returns true for hugetlbfs head page, but not for

--
Michal Hocko
SUSE Labs

2015-06-02 09:33:13

by Vlastimil Babka

[permalink] [raw]
Subject: Re: [PATCH] hugetlb: Do not account hugetlb pages as NR_FILE_PAGES

On 06/02/2015 11:25 AM, Michal Hocko wrote:
> On Mon 25-05-15 17:24:28, Vlastimil Babka wrote:
>> On 05/22/2015 04:35 PM, Mel Gorman wrote:
>>>>
>>>> Thanks!
>>>>
>>>>> This makes a lot of sense to me. The only thing I worry about is the
>>>>> proliferation of PageHuge(), a function call, in relatively hot paths.
>>>>
>>>> I've tried that (see the patch below) but it enlarged the code by almost
>>>> 1k
>>>> text data bss dec hex filename
>>>> 510323 74273 44440 629036 9992c mm/built-in.o.before
>>>> 511248 74273 44440 629961 99cc9 mm/built-in.o.after
>>>>
>>>> I am not sure the code size increase is worth it. Maybe we can reduce
>>>> the check to only PageCompound(page) as huge pages are no in the page
>>>> cache (yet).
>>>>
>>>
>>> That would be a more sensible route because it also avoids exposing the
>>> hugetlbfs destructor unnecessarily.
>>
>> You could maybe do test such as (PageCompound(page) && PageHuge(page)) to
>> short-circuit the call while remaining future-proof.
>
> How about this?

Yeah (see below)

> ---
> diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
> index 91b7f9b2b774..bb8a70e8fc77 100644
> --- a/include/linux/page-flags.h
> +++ b/include/linux/page-flags.h
> @@ -547,7 +547,13 @@ static inline void ClearPageCompound(struct page *page)
> #endif /* !PAGEFLAGS_EXTENDED */
>
> #ifdef CONFIG_HUGETLB_PAGE
> -int PageHuge(struct page *page);
> +int __PageHuge(struct page *page);
> +static inline int PageHuge(struct page *page)
> +{
> + if (!PageCompound(page))

Perhaps the above as likely()?

[...]

> -EXPORT_SYMBOL_GPL(PageHuge);
> +EXPORT_SYMBOL_GPL(__PageHuge);
>
> /*
> * PageHeadHuge() only returns true for hugetlbfs head page, but not for
>

Do the same thing here by inlining the PageHead() test?
I guess the page_to_pgoff and __compound_tail_refcounted callers are
rather hot?

2015-06-02 09:38:33

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH] hugetlb: Do not account hugetlb pages as NR_FILE_PAGES

On Tue 02-06-15 11:33:05, Vlastimil Babka wrote:
> On 06/02/2015 11:25 AM, Michal Hocko wrote:
> >On Mon 25-05-15 17:24:28, Vlastimil Babka wrote:
> >>On 05/22/2015 04:35 PM, Mel Gorman wrote:
> >>>>
> >>>>Thanks!
> >>>>
> >>>>>This makes a lot of sense to me. The only thing I worry about is the
> >>>>>proliferation of PageHuge(), a function call, in relatively hot paths.
> >>>>
> >>>>I've tried that (see the patch below) but it enlarged the code by almost
> >>>>1k
> >>>> text data bss dec hex filename
> >>>> 510323 74273 44440 629036 9992c mm/built-in.o.before
> >>>> 511248 74273 44440 629961 99cc9 mm/built-in.o.after
> >>>>
> >>>>I am not sure the code size increase is worth it. Maybe we can reduce
> >>>>the check to only PageCompound(page) as huge pages are no in the page
> >>>>cache (yet).
> >>>>
> >>>
> >>>That would be a more sensible route because it also avoids exposing the
> >>>hugetlbfs destructor unnecessarily.
> >>
> >>You could maybe do test such as (PageCompound(page) && PageHuge(page)) to
> >>short-circuit the call while remaining future-proof.
> >
> >How about this?
>
> Yeah (see below)
>
> >---
> >diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
> >index 91b7f9b2b774..bb8a70e8fc77 100644
> >--- a/include/linux/page-flags.h
> >+++ b/include/linux/page-flags.h
> >@@ -547,7 +547,13 @@ static inline void ClearPageCompound(struct page *page)
> > #endif /* !PAGEFLAGS_EXTENDED */
> >
> > #ifdef CONFIG_HUGETLB_PAGE
> >-int PageHuge(struct page *page);
> >+int __PageHuge(struct page *page);
> >+static inline int PageHuge(struct page *page)
> >+{
> >+ if (!PageCompound(page))
>
> Perhaps the above as likely()?

I have added it already when writing the changelog.

> [...]
>
> >-EXPORT_SYMBOL_GPL(PageHuge);
> >+EXPORT_SYMBOL_GPL(__PageHuge);
> >
> > /*
> > * PageHeadHuge() only returns true for hugetlbfs head page, but not for
> >
>
> Do the same thing here by inlining the PageHead() test?
> I guess the page_to_pgoff and __compound_tail_refcounted callers are rather
> hot?

Yes, that sounds like a good idea.

--
Michal Hocko
SUSE Labs

2015-06-02 10:00:46

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH] hugetlb: Do not account hugetlb pages as NR_FILE_PAGES

On Tue 02-06-15 11:38:05, Michal Hocko wrote:
> On Tue 02-06-15 11:33:05, Vlastimil Babka wrote:
> > On 06/02/2015 11:25 AM, Michal Hocko wrote:
[...]
> > >diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
> > >index 91b7f9b2b774..bb8a70e8fc77 100644
> > >--- a/include/linux/page-flags.h
> > >+++ b/include/linux/page-flags.h
> > >@@ -547,7 +547,13 @@ static inline void ClearPageCompound(struct page *page)
> > > #endif /* !PAGEFLAGS_EXTENDED */
> > >
> > > #ifdef CONFIG_HUGETLB_PAGE
> > >-int PageHuge(struct page *page);
> > >+int __PageHuge(struct page *page);
> > >+static inline int PageHuge(struct page *page)
> > >+{
> > >+ if (!PageCompound(page))
> >
> > Perhaps the above as likely()?
>
> I have added it already when writing the changelog.
>
> > [...]
> >
> > >-EXPORT_SYMBOL_GPL(PageHuge);
> > >+EXPORT_SYMBOL_GPL(__PageHuge);
> > >
> > > /*
> > > * PageHeadHuge() only returns true for hugetlbfs head page, but not for
> > >
> >
> > Do the same thing here by inlining the PageHead() test?
> > I guess the page_to_pgoff and __compound_tail_refcounted callers are rather
> > hot?
>
> Yes, that sounds like a good idea.

So the overal codesize (with defconfig) has still grown with the patch:
text data bss dec hex filename
443075 59217 25604 527896 80e18 mm/built-in.o.before
443477 59217 25604 528298 80faa mm/built-in.o.PageHuge
443653 59217 25604 528474 8105a mm/built-in.o.both

It is still not ~1K with the full inline but quite large on its own.
So I am not sure it makes sense to fiddle with this without actually
seeing some penalty in profiles.

Here is what I have if somebody wants to play with it:
---
>From a1d33be3ac209da1721864c2a5a1d43635c17444 Mon Sep 17 00:00:00 2001
From: Michal Hocko <[email protected]>
Date: Tue, 2 Jun 2015 11:25:48 +0200
Subject: [PATCH] hugetlb: split PageHuge into fast and slow path

Johannes Weiner has noted that PageHuge proliferates into relatively hot
paths like __add_to_page_cache_locked or __delete_from_page_cache and
that means that we unconditionally do a function call even though that
hugetlb pages in those path are rare wrt. to normal pages. Let's split
out PageHuge into two parts. Fast path will check for PageCompound()
which will get inlined and it would return false for the normal case
and __PageHuge which does the function call and peform the remaining
hugetlb specific checks.

PageHeadHuge is a similar case as pointed out by Vlastimil Babka. It is
called from page_to_pgoff which is then used in rmap walk. Do the
similar thing and check for PageHead inlined and do the rest via a
function call (__PageHeadHuge).

We could get rid of the function call completely because both __PageHuge
and __PageHeadHuge are trivial but we would have to export other hugetlb
internals and that would increase the generated code size by quite much.
This patch has a smaller codesize footprint:
443075 59217 25604 527896 80e18 mm/built-in.o.before
443653 59217 25604 528474 8105a mm/built-in.o.after

Signed-off-by: Michal Hocko <[email protected]>
---
include/linux/page-flags.h | 17 +++++++++++++++--
mm/hugetlb.c | 18 +++++++-----------
2 files changed, 22 insertions(+), 13 deletions(-)

diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
index 91b7f9b2b774..151fb6b4e95b 100644
--- a/include/linux/page-flags.h
+++ b/include/linux/page-flags.h
@@ -547,8 +547,21 @@ static inline void ClearPageCompound(struct page *page)
#endif /* !PAGEFLAGS_EXTENDED */

#ifdef CONFIG_HUGETLB_PAGE
-int PageHuge(struct page *page);
-int PageHeadHuge(struct page *page);
+int __PageHuge(struct page *page);
+static inline int PageHuge(struct page *page)
+{
+ if (likely(!PageCompound(page)))
+ return 0;
+ return __PageHuge(page);
+}
+
+int __PageHeadHuge(struct page *page);
+static inline int PageHeadHuge(struct page *page_head)
+{
+ if (likely(!PageHead(page_head)))
+ return 0;
+ return __PageHeadHuge(page_head);
+}
bool page_huge_active(struct page *page);
#else
TESTPAGEFLAG_FALSE(Huge)
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 33defbe1897f..d28024c7ca73 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -1107,29 +1107,25 @@ static void prep_compound_gigantic_page(struct page *page, unsigned long order)
}

/*
- * PageHuge() only returns true for hugetlbfs pages, but not for normal or
+ * __PageHuge() only returns true for hugetlbfs pages, but not for normal or
* transparent huge pages. See the PageTransHuge() documentation for more
* details.
*/
-int PageHuge(struct page *page)
+int __PageHuge(struct page *page)
{
- if (!PageCompound(page))
- return 0;
-
+ VM_BUG_ON(!PageCompound(page));
page = compound_head(page);
return get_compound_page_dtor(page) == free_huge_page;
}
-EXPORT_SYMBOL_GPL(PageHuge);
+EXPORT_SYMBOL_GPL(__PageHuge);

/*
- * PageHeadHuge() only returns true for hugetlbfs head page, but not for
+ * __PageHeadHuge() only returns true for hugetlbfs head page, but not for
* normal or transparent huge pages.
*/
-int PageHeadHuge(struct page *page_head)
+int __PageHeadHuge(struct page *page_head)
{
- if (!PageHead(page_head))
- return 0;
-
+ VM_BUG_ON(!PageHead(page_head));
return get_compound_page_dtor(page_head) == free_huge_page;
}

--
2.1.4

--
Michal Hocko
SUSE Labs