HugeTLB pages may soon support being mapped with PTEs. To allow for this
case, merge HugeTLB's mapcount scheme with THP's.
The first patch of this series comes from the HugeTLB high-granularity
mapping series[1], though with some updates, as the original version
was buggy[2] and incomplete.
I am sending this change as part of this smaller series in hopes that it
can be more thoroughly scrutinized.
I haven't run any THP performance tests with this series applied.
HugeTLB pages don't currently support being mapped with
`compound=false`, but this mapcount scheme will make collapsing
compound=false mappings in HugeTLB pages quite slow. This can be
optimized with future patches (likely by taking advantage of HugeTLB's
alignment guarantees).
Matthew Wilcox is working on a mapcounting scheme[3] that will avoid
the use of each subpage's mapcount. If this series is applied, Matthew's
new scheme will automatically apply to HugeTLB pages.
[1]: https://lore.kernel.org/linux-mm/[email protected]/
[2]: https://lore.kernel.org/linux-mm/CACw3F538H+bYcvSY-qG4-gmrgGPRBgTScDzrX9suLyp_q+v_bQ@mail.gmail.com/
[3]: https://lore.kernel.org/linux-mm/Y9Afwds%[email protected]/
James Houghton (2):
mm: rmap: make hugetlb pages participate in _nr_pages_mapped
mm: rmap: increase COMPOUND_MAPPED to support 512G HugeTLB pages
include/linux/mm.h | 7 +------
mm/hugetlb.c | 4 ++--
mm/internal.h | 9 ++++-----
mm/migrate.c | 2 +-
mm/rmap.c | 35 ++++++++++++++++++++---------------
5 files changed, 28 insertions(+), 29 deletions(-)
base-commit: 9caa15b8a49949342bdf495bd47660267a3bd371
--
2.40.0.rc0.216.gc4246ad0f0-goog
For compound mappings (compound=true), _nr_pages_mapped will now be
incremented by COMPOUND_MAPPED when the first compound mapping is
created.
For small mappings, _nr_pages_mapped is incremented by 1 when the
particular small page is mapped for the first time. This is incompatible
with HPageVmemmapOptimize()ed folios, as most of the tail page structs
will be mapped read-only.
Currently HugeTLB always passes compound=true, but in the future,
HugeTLB pages may be mapped with small mappings.
To implement this change:
1. Replace most of HugeTLB's calls to page_dup_file_rmap() with
page_add_file_rmap(). The call in copy_hugetlb_page_range() is kept.
2. Update page_add_file_rmap() and page_remove_rmap() to support
HugeTLB folios.
3. Update hugepage_add_anon_rmap() and hugepage_add_new_anon_rmap() to
also increment _nr_pages_mapped properly.
With these changes, folio_large_is_mapped() no longer needs to check
_entire_mapcount.
HugeTLB doesn't use LRU or mlock, so page_add_file_rmap() and
page_remove_rmap() excludes those pieces. It is also important that
the folio_test_pmd_mappable() check is removed (or changed), as it's
possible to have a HugeTLB page whose order is not >= HPAGE_PMD_ORDER,
like arm64's CONT_PTE_SIZE HugeTLB pages.
This patch limits HugeTLB pages to 16G in size. That limit can be
increased if COMPOUND_MAPPED is raised.
Signed-off-by: James Houghton <[email protected]>
diff --git a/include/linux/mm.h b/include/linux/mm.h
index ce1590933995..25c898c51129 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1143,12 +1143,7 @@ static inline int total_mapcount(struct page *page)
static inline bool folio_large_is_mapped(struct folio *folio)
{
- /*
- * Reading _entire_mapcount below could be omitted if hugetlb
- * participated in incrementing nr_pages_mapped when compound mapped.
- */
- return atomic_read(&folio->_nr_pages_mapped) > 0 ||
- atomic_read(&folio->_entire_mapcount) >= 0;
+ return atomic_read(&folio->_nr_pages_mapped) > 0;
}
/**
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 07abcb6eb203..bf3d327cd1b9 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -5911,7 +5911,7 @@ static vm_fault_t hugetlb_no_page(struct mm_struct *mm,
if (anon_rmap)
hugepage_add_new_anon_rmap(folio, vma, haddr);
else
- page_dup_file_rmap(&folio->page, true);
+ page_add_file_rmap(&folio->page, vma, true);
new_pte = make_huge_pte(vma, &folio->page, ((vma->vm_flags & VM_WRITE)
&& (vma->vm_flags & VM_SHARED)));
/*
@@ -6293,7 +6293,7 @@ int hugetlb_mcopy_atomic_pte(struct mm_struct *dst_mm,
goto out_release_unlock;
if (folio_in_pagecache)
- page_dup_file_rmap(&folio->page, true);
+ page_add_file_rmap(&folio->page, dst_vma, true);
else
hugepage_add_new_anon_rmap(folio, dst_vma, dst_addr);
diff --git a/mm/internal.h b/mm/internal.h
index fce94775819c..400451a4dd49 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -55,8 +55,7 @@ void page_writeback_init(void);
/*
* If a 16GB hugetlb folio were mapped by PTEs of all of its 4kB pages,
* its nr_pages_mapped would be 0x400000: choose the COMPOUND_MAPPED bit
- * above that range, instead of 2*(PMD_SIZE/PAGE_SIZE). Hugetlb currently
- * leaves nr_pages_mapped at 0, but avoid surprise if it participates later.
+ * above that range, instead of 2*(PMD_SIZE/PAGE_SIZE).
*/
#define COMPOUND_MAPPED 0x800000
#define FOLIO_PAGES_MAPPED (COMPOUND_MAPPED - 1)
diff --git a/mm/migrate.c b/mm/migrate.c
index 476cd24e8f32..c95c1cbc7a47 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -254,7 +254,7 @@ static bool remove_migration_pte(struct folio *folio,
hugepage_add_anon_rmap(new, vma, pvmw.address,
rmap_flags);
else
- page_dup_file_rmap(new, true);
+ page_add_file_rmap(new, vma, true);
set_huge_pte_at(vma->vm_mm, pvmw.address, pvmw.pte, pte);
} else
#endif
diff --git a/mm/rmap.c b/mm/rmap.c
index ba901c416785..4a975429b91a 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -1316,19 +1316,21 @@ void page_add_file_rmap(struct page *page, struct vm_area_struct *vma,
int nr = 0, nr_pmdmapped = 0;
bool first;
- VM_BUG_ON_PAGE(compound && !PageTransHuge(page), page);
+ VM_BUG_ON_PAGE(compound && !PageTransHuge(page)
+ && !folio_test_hugetlb(folio), page);
/* Is page being mapped by PTE? Is this its first map to be added? */
if (likely(!compound)) {
+ if (unlikely(folio_test_hugetlb(folio)))
+ VM_BUG_ON_PAGE(HPageVmemmapOptimized(&folio->page),
+ page);
first = atomic_inc_and_test(&page->_mapcount);
nr = first;
if (first && folio_test_large(folio)) {
nr = atomic_inc_return_relaxed(mapped);
nr = (nr < COMPOUND_MAPPED);
}
- } else if (folio_test_pmd_mappable(folio)) {
- /* That test is redundant: it's for safety or to optimize out */
-
+ } else {
first = atomic_inc_and_test(&folio->_entire_mapcount);
if (first) {
nr = atomic_add_return_relaxed(COMPOUND_MAPPED, mapped);
@@ -1345,6 +1347,9 @@ void page_add_file_rmap(struct page *page, struct vm_area_struct *vma,
}
}
+ if (folio_test_hugetlb(folio))
+ return;
+
if (nr_pmdmapped)
__lruvec_stat_mod_folio(folio, folio_test_swapbacked(folio) ?
NR_SHMEM_PMDMAPPED : NR_FILE_PMDMAPPED, nr_pmdmapped);
@@ -1373,24 +1378,18 @@ void page_remove_rmap(struct page *page, struct vm_area_struct *vma,
VM_BUG_ON_PAGE(compound && !PageHead(page), page);
- /* Hugetlb pages are not counted in NR_*MAPPED */
- if (unlikely(folio_test_hugetlb(folio))) {
- /* hugetlb pages are always mapped with pmds */
- atomic_dec(&folio->_entire_mapcount);
- return;
- }
-
/* Is page being unmapped by PTE? Is this its last map to be removed? */
if (likely(!compound)) {
+ if (unlikely(folio_test_hugetlb(folio)))
+ VM_BUG_ON_PAGE(HPageVmemmapOptimized(&folio->page),
+ page);
last = atomic_add_negative(-1, &page->_mapcount);
nr = last;
if (last && folio_test_large(folio)) {
nr = atomic_dec_return_relaxed(mapped);
nr = (nr < COMPOUND_MAPPED);
}
- } else if (folio_test_pmd_mappable(folio)) {
- /* That test is redundant: it's for safety or to optimize out */
-
+ } else {
last = atomic_add_negative(-1, &folio->_entire_mapcount);
if (last) {
nr = atomic_sub_return_relaxed(COMPOUND_MAPPED, mapped);
@@ -1407,6 +1406,9 @@ void page_remove_rmap(struct page *page, struct vm_area_struct *vma,
}
}
+ if (folio_test_hugetlb(folio))
+ return;
+
if (nr_pmdmapped) {
if (folio_test_anon(folio))
idx = NR_ANON_THPS;
@@ -2541,9 +2543,11 @@ void hugepage_add_anon_rmap(struct page *page, struct vm_area_struct *vma,
first = atomic_inc_and_test(&folio->_entire_mapcount);
VM_BUG_ON_PAGE(!first && (flags & RMAP_EXCLUSIVE), page);
VM_BUG_ON_PAGE(!first && PageAnonExclusive(page), page);
- if (first)
+ if (first) {
+ atomic_add(COMPOUND_MAPPED, &folio->_nr_pages_mapped);
__page_set_anon_rmap(folio, page, vma, address,
!!(flags & RMAP_EXCLUSIVE));
+ }
}
void hugepage_add_new_anon_rmap(struct folio *folio,
@@ -2552,6 +2556,7 @@ void hugepage_add_new_anon_rmap(struct folio *folio,
BUG_ON(address < vma->vm_start || address >= vma->vm_end);
/* increment count (starts at -1) */
atomic_set(&folio->_entire_mapcount, 0);
+ atomic_set(&folio->_nr_pages_mapped, COMPOUND_MAPPED);
folio_clear_hugetlb_restore_reserve(folio);
__page_set_anon_rmap(folio, &folio->page, vma, address, 1);
}
--
2.40.0.rc0.216.gc4246ad0f0-goog
Some architectures may want to support 512G HugeTLB pages in the future.
Adjust COMPOUND_MAPPED to support this page size.
Signed-off-by: James Houghton <[email protected]>
diff --git a/mm/internal.h b/mm/internal.h
index 400451a4dd49..994381fd5155 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -53,11 +53,11 @@ struct folio_batch;
void page_writeback_init(void);
/*
- * If a 16GB hugetlb folio were mapped by PTEs of all of its 4kB pages,
- * its nr_pages_mapped would be 0x400000: choose the COMPOUND_MAPPED bit
- * above that range, instead of 2*(PMD_SIZE/PAGE_SIZE).
+ * If a 512G hugetlb folio were mapped by PTEs of all of its 4kB pages,
+ * its nr_pages_mapped would be 2^27 (0x8000000): choose the COMPOUND_MAPPED bit
+ * above that range.
*/
-#define COMPOUND_MAPPED 0x800000
+#define COMPOUND_MAPPED (1 << 28)
#define FOLIO_PAGES_MAPPED (COMPOUND_MAPPED - 1)
/*
--
2.40.0.rc0.216.gc4246ad0f0-goog
On 03/06/23 23:00, James Houghton wrote:
> For compound mappings (compound=true), _nr_pages_mapped will now be
> incremented by COMPOUND_MAPPED when the first compound mapping is
> created.
This sentence makes it sound like incrementing by COMPOUND_MAPPED for
compound pages is introduced by this patch. Rather, it is just for
hugetlb (now always) compound mappings. Perhaps change that to read:
For hugetlb mappings ...
> For small mappings, _nr_pages_mapped is incremented by 1 when the
> particular small page is mapped for the first time. This is incompatible
> with HPageVmemmapOptimize()ed folios, as most of the tail page structs
> will be mapped read-only.
>
> Currently HugeTLB always passes compound=true, but in the future,
> HugeTLB pages may be mapped with small mappings.
>
> To implement this change:
> 1. Replace most of HugeTLB's calls to page_dup_file_rmap() with
> page_add_file_rmap(). The call in copy_hugetlb_page_range() is kept.
> 2. Update page_add_file_rmap() and page_remove_rmap() to support
> HugeTLB folios.
> 3. Update hugepage_add_anon_rmap() and hugepage_add_new_anon_rmap() to
> also increment _nr_pages_mapped properly.
>
> With these changes, folio_large_is_mapped() no longer needs to check
> _entire_mapcount.
>
> HugeTLB doesn't use LRU or mlock, so page_add_file_rmap() and
> page_remove_rmap() excludes those pieces. It is also important that
> the folio_test_pmd_mappable() check is removed (or changed), as it's
> possible to have a HugeTLB page whose order is not >= HPAGE_PMD_ORDER,
> like arm64's CONT_PTE_SIZE HugeTLB pages.
>
> This patch limits HugeTLB pages to 16G in size. That limit can be
> increased if COMPOUND_MAPPED is raised.
>
> Signed-off-by: James Houghton <[email protected]>
>
Thanks!
This is a step in the direction of having hugetlb use the same mapcount
scheme as elsewhere. As you mention, with this in place future mapcount
changes should mostly 'just work' for hugetlb.
Because of this,
Acked-by: Mike Kravetz <[email protected]>
I have a few nits below, and I'm sure others will chime in later.
> diff --git a/mm/rmap.c b/mm/rmap.c
> index ba901c416785..4a975429b91a 100644
> --- a/mm/rmap.c
> +++ b/mm/rmap.c
> @@ -1316,19 +1316,21 @@ void page_add_file_rmap(struct page *page, struct vm_area_struct *vma,
> int nr = 0, nr_pmdmapped = 0;
> bool first;
>
> - VM_BUG_ON_PAGE(compound && !PageTransHuge(page), page);
> + VM_BUG_ON_PAGE(compound && !PageTransHuge(page)
> + && !folio_test_hugetlb(folio), page);
>
> /* Is page being mapped by PTE? Is this its first map to be added? */
> if (likely(!compound)) {
> + if (unlikely(folio_test_hugetlb(folio)))
> + VM_BUG_ON_PAGE(HPageVmemmapOptimized(&folio->page),
> + page);
> first = atomic_inc_and_test(&page->_mapcount);
> nr = first;
> if (first && folio_test_large(folio)) {
> nr = atomic_inc_return_relaxed(mapped);
> nr = (nr < COMPOUND_MAPPED);
> }
> - } else if (folio_test_pmd_mappable(folio)) {
> - /* That test is redundant: it's for safety or to optimize out */
I 'think' removing this check is OK. It would seem that the caller
knows if the folio is mappable. If we want a similar test, we might be
able to use something like:
arch_hugetlb_valid_size(folio_size(folio))
> -
> + } else {
> first = atomic_inc_and_test(&folio->_entire_mapcount);
> if (first) {
> nr = atomic_add_return_relaxed(COMPOUND_MAPPED, mapped);
> @@ -1345,6 +1347,9 @@ void page_add_file_rmap(struct page *page, struct vm_area_struct *vma,
> }
> }
>
> + if (folio_test_hugetlb(folio))
> + return;
IMO, a comment saying hugetlb is special and does not participate in lru
would be appropriate here.
> +
> if (nr_pmdmapped)
> __lruvec_stat_mod_folio(folio, folio_test_swapbacked(folio) ?
> NR_SHMEM_PMDMAPPED : NR_FILE_PMDMAPPED, nr_pmdmapped);
> @@ -1373,24 +1378,18 @@ void page_remove_rmap(struct page *page, struct vm_area_struct *vma,
>
> VM_BUG_ON_PAGE(compound && !PageHead(page), page);
>
> - /* Hugetlb pages are not counted in NR_*MAPPED */
> - if (unlikely(folio_test_hugetlb(folio))) {
> - /* hugetlb pages are always mapped with pmds */
> - atomic_dec(&folio->_entire_mapcount);
> - return;
> - }
> -
> /* Is page being unmapped by PTE? Is this its last map to be removed? */
> if (likely(!compound)) {
> + if (unlikely(folio_test_hugetlb(folio)))
> + VM_BUG_ON_PAGE(HPageVmemmapOptimized(&folio->page),
> + page);
> last = atomic_add_negative(-1, &page->_mapcount);
> nr = last;
> if (last && folio_test_large(folio)) {
> nr = atomic_dec_return_relaxed(mapped);
> nr = (nr < COMPOUND_MAPPED);
> }
> - } else if (folio_test_pmd_mappable(folio)) {
> - /* That test is redundant: it's for safety or to optimize out */
> -
> + } else {
> last = atomic_add_negative(-1, &folio->_entire_mapcount);
> if (last) {
> nr = atomic_sub_return_relaxed(COMPOUND_MAPPED, mapped);
> @@ -1407,6 +1406,9 @@ void page_remove_rmap(struct page *page, struct vm_area_struct *vma,
> }
> }
>
> + if (folio_test_hugetlb(folio))
> + return;
Same as above in page_add_file_rmap.
> +
> if (nr_pmdmapped) {
> if (folio_test_anon(folio))
> idx = NR_ANON_THPS;
> @@ -2541,9 +2543,11 @@ void hugepage_add_anon_rmap(struct page *page, struct vm_area_struct *vma,
> first = atomic_inc_and_test(&folio->_entire_mapcount);
> VM_BUG_ON_PAGE(!first && (flags & RMAP_EXCLUSIVE), page);
> VM_BUG_ON_PAGE(!first && PageAnonExclusive(page), page);
> - if (first)
> + if (first) {
> + atomic_add(COMPOUND_MAPPED, &folio->_nr_pages_mapped);
> __page_set_anon_rmap(folio, page, vma, address,
> !!(flags & RMAP_EXCLUSIVE));
> + }
> }
>
> void hugepage_add_new_anon_rmap(struct folio *folio,
> @@ -2552,6 +2556,7 @@ void hugepage_add_new_anon_rmap(struct folio *folio,
> BUG_ON(address < vma->vm_start || address >= vma->vm_end);
> /* increment count (starts at -1) */
> atomic_set(&folio->_entire_mapcount, 0);
> + atomic_set(&folio->_nr_pages_mapped, COMPOUND_MAPPED);
> folio_clear_hugetlb_restore_reserve(folio);
> __page_set_anon_rmap(folio, &folio->page, vma, address, 1);
> }
Should we look at perhaps modifying page_add_anon_rmap and
folio_add_new_anon_rmap as well?
--
Mike Kravetz
On Tue, Mar 7, 2023 at 1:54 PM Mike Kravetz <[email protected]> wrote:
>
> On 03/06/23 23:00, James Houghton wrote:
> > For compound mappings (compound=true), _nr_pages_mapped will now be
> > incremented by COMPOUND_MAPPED when the first compound mapping is
> > created.
>
> This sentence makes it sound like incrementing by COMPOUND_MAPPED for
> compound pages is introduced by this patch. Rather, it is just for
> hugetlb (now always) compound mappings. Perhaps change that to read:
> For hugetlb mappings ...
Yes this is kind of confusing. I'll fix it like you suggest.
>
> > For small mappings, _nr_pages_mapped is incremented by 1 when the
> > particular small page is mapped for the first time. This is incompatible
> > with HPageVmemmapOptimize()ed folios, as most of the tail page structs
> > will be mapped read-only.
> >
> > Currently HugeTLB always passes compound=true, but in the future,
> > HugeTLB pages may be mapped with small mappings.
> >
> > To implement this change:
> > 1. Replace most of HugeTLB's calls to page_dup_file_rmap() with
> > page_add_file_rmap(). The call in copy_hugetlb_page_range() is kept.
> > 2. Update page_add_file_rmap() and page_remove_rmap() to support
> > HugeTLB folios.
> > 3. Update hugepage_add_anon_rmap() and hugepage_add_new_anon_rmap() to
> > also increment _nr_pages_mapped properly.
> >
> > With these changes, folio_large_is_mapped() no longer needs to check
> > _entire_mapcount.
> >
> > HugeTLB doesn't use LRU or mlock, so page_add_file_rmap() and
> > page_remove_rmap() excludes those pieces. It is also important that
> > the folio_test_pmd_mappable() check is removed (or changed), as it's
> > possible to have a HugeTLB page whose order is not >= HPAGE_PMD_ORDER,
> > like arm64's CONT_PTE_SIZE HugeTLB pages.
> >
> > This patch limits HugeTLB pages to 16G in size. That limit can be
> > increased if COMPOUND_MAPPED is raised.
> >
> > Signed-off-by: James Houghton <[email protected]>
> >
>
> Thanks!
>
> This is a step in the direction of having hugetlb use the same mapcount
> scheme as elsewhere. As you mention, with this in place future mapcount
> changes should mostly 'just work' for hugetlb.
>
> Because of this,
> Acked-by: Mike Kravetz <[email protected]>
Thanks!
>
> I have a few nits below, and I'm sure others will chime in later.
>
> > diff --git a/mm/rmap.c b/mm/rmap.c
> > index ba901c416785..4a975429b91a 100644
> > --- a/mm/rmap.c
> > +++ b/mm/rmap.c
> > @@ -1316,19 +1316,21 @@ void page_add_file_rmap(struct page *page, struct vm_area_struct *vma,
> > int nr = 0, nr_pmdmapped = 0;
> > bool first;
> >
> > - VM_BUG_ON_PAGE(compound && !PageTransHuge(page), page);
> > + VM_BUG_ON_PAGE(compound && !PageTransHuge(page)
> > + && !folio_test_hugetlb(folio), page);
> >
> > /* Is page being mapped by PTE? Is this its first map to be added? */
> > if (likely(!compound)) {
> > + if (unlikely(folio_test_hugetlb(folio)))
> > + VM_BUG_ON_PAGE(HPageVmemmapOptimized(&folio->page),
> > + page);
> > first = atomic_inc_and_test(&page->_mapcount);
> > nr = first;
> > if (first && folio_test_large(folio)) {
> > nr = atomic_inc_return_relaxed(mapped);
> > nr = (nr < COMPOUND_MAPPED);
> > }
> > - } else if (folio_test_pmd_mappable(folio)) {
> > - /* That test is redundant: it's for safety or to optimize out */
>
> I 'think' removing this check is OK. It would seem that the caller
> knows if the folio is mappable. If we want a similar test, we might be
> able to use something like:
>
> arch_hugetlb_valid_size(folio_size(folio))
>
Ack. I think leaving the check(s) removed is fine.
> > -
> > + } else {
> > first = atomic_inc_and_test(&folio->_entire_mapcount);
> > if (first) {
> > nr = atomic_add_return_relaxed(COMPOUND_MAPPED, mapped);
> > @@ -1345,6 +1347,9 @@ void page_add_file_rmap(struct page *page, struct vm_area_struct *vma,
> > }
> > }
> >
> > + if (folio_test_hugetlb(folio))
> > + return;
>
> IMO, a comment saying hugetlb is special and does not participate in lru
> would be appropriate here.
Will do.
>
> > +
> > if (nr_pmdmapped)
> > __lruvec_stat_mod_folio(folio, folio_test_swapbacked(folio) ?
> > NR_SHMEM_PMDMAPPED : NR_FILE_PMDMAPPED, nr_pmdmapped);
> > @@ -1373,24 +1378,18 @@ void page_remove_rmap(struct page *page, struct vm_area_struct *vma,
> >
> > VM_BUG_ON_PAGE(compound && !PageHead(page), page);
> >
> > - /* Hugetlb pages are not counted in NR_*MAPPED */
> > - if (unlikely(folio_test_hugetlb(folio))) {
> > - /* hugetlb pages are always mapped with pmds */
> > - atomic_dec(&folio->_entire_mapcount);
> > - return;
> > - }
> > -
> > /* Is page being unmapped by PTE? Is this its last map to be removed? */
> > if (likely(!compound)) {
> > + if (unlikely(folio_test_hugetlb(folio)))
> > + VM_BUG_ON_PAGE(HPageVmemmapOptimized(&folio->page),
> > + page);
> > last = atomic_add_negative(-1, &page->_mapcount);
> > nr = last;
> > if (last && folio_test_large(folio)) {
> > nr = atomic_dec_return_relaxed(mapped);
> > nr = (nr < COMPOUND_MAPPED);
> > }
> > - } else if (folio_test_pmd_mappable(folio)) {
> > - /* That test is redundant: it's for safety or to optimize out */
> > -
> > + } else {
> > last = atomic_add_negative(-1, &folio->_entire_mapcount);
> > if (last) {
> > nr = atomic_sub_return_relaxed(COMPOUND_MAPPED, mapped);
> > @@ -1407,6 +1406,9 @@ void page_remove_rmap(struct page *page, struct vm_area_struct *vma,
> > }
> > }
> >
> > + if (folio_test_hugetlb(folio))
> > + return;
>
> Same as above in page_add_file_rmap.
>
> > +
> > if (nr_pmdmapped) {
> > if (folio_test_anon(folio))
> > idx = NR_ANON_THPS;
> > @@ -2541,9 +2543,11 @@ void hugepage_add_anon_rmap(struct page *page, struct vm_area_struct *vma,
> > first = atomic_inc_and_test(&folio->_entire_mapcount);
> > VM_BUG_ON_PAGE(!first && (flags & RMAP_EXCLUSIVE), page);
> > VM_BUG_ON_PAGE(!first && PageAnonExclusive(page), page);
> > - if (first)
> > + if (first) {
> > + atomic_add(COMPOUND_MAPPED, &folio->_nr_pages_mapped);
> > __page_set_anon_rmap(folio, page, vma, address,
> > !!(flags & RMAP_EXCLUSIVE));
> > + }
> > }
> >
> > void hugepage_add_new_anon_rmap(struct folio *folio,
> > @@ -2552,6 +2556,7 @@ void hugepage_add_new_anon_rmap(struct folio *folio,
> > BUG_ON(address < vma->vm_start || address >= vma->vm_end);
> > /* increment count (starts at -1) */
> > atomic_set(&folio->_entire_mapcount, 0);
> > + atomic_set(&folio->_nr_pages_mapped, COMPOUND_MAPPED);
> > folio_clear_hugetlb_restore_reserve(folio);
> > __page_set_anon_rmap(folio, &folio->page, vma, address, 1);
> > }
>
> Should we look at perhaps modifying page_add_anon_rmap and
> folio_add_new_anon_rmap as well?
I think I can merge hugepage_add_anon_rmap with page_add_anon_rmap and
hugepage_add_new_anon_rmap with folio_add_new_anon_rmap. With them
merged, it's pretty easy to see what HugeTLB does differently from
generic mm, which is nice. :)
On Tue, Mar 07, 2023 at 04:36:51PM -0800, James Houghton wrote:
> > > if (likely(!compound)) {
> > > + if (unlikely(folio_test_hugetlb(folio)))
> > > + VM_BUG_ON_PAGE(HPageVmemmapOptimized(&folio->page),
> > > + page);
How about moving folio_test_hugetlb() into the BUG_ON()?
VM_BUG_ON_PAGE(folio_test_hugetlb(folio) &&
HPageVmemmapOptimized(&folio->page),
page);
Note that BUG_ON() already contains an "unlikely".
> > > first = atomic_inc_and_test(&page->_mapcount);
> > > nr = first;
> > > if (first && folio_test_large(folio)) {
> > > nr = atomic_inc_return_relaxed(mapped);
> > > nr = (nr < COMPOUND_MAPPED);
> > > }
> > > - } else if (folio_test_pmd_mappable(folio)) {
> > > - /* That test is redundant: it's for safety or to optimize out */
> >
> > I 'think' removing this check is OK. It would seem that the caller
> > knows if the folio is mappable. If we want a similar test, we might be
> > able to use something like:
> >
> > arch_hugetlb_valid_size(folio_size(folio))
> >
>
> Ack. I think leaving the check(s) removed is fine.
Would it still be good to keep that as another BUG_ON()?
--
Peter Xu
On Mon, Mar 06, 2023 at 11:00:02PM +0000, James Houghton wrote:
> HugeTLB pages may soon support being mapped with PTEs. To allow for this
> case, merge HugeTLB's mapcount scheme with THP's.
>
> The first patch of this series comes from the HugeTLB high-granularity
> mapping series[1], though with some updates, as the original version
> was buggy[2] and incomplete.
>
> I am sending this change as part of this smaller series in hopes that it
> can be more thoroughly scrutinized.
>
> I haven't run any THP performance tests with this series applied.
> HugeTLB pages don't currently support being mapped with
> `compound=false`, but this mapcount scheme will make collapsing
> compound=false mappings in HugeTLB pages quite slow. This can be
> optimized with future patches (likely by taking advantage of HugeTLB's
> alignment guarantees).
>
> Matthew Wilcox is working on a mapcounting scheme[3] that will avoid
> the use of each subpage's mapcount. If this series is applied, Matthew's
> new scheme will automatically apply to HugeTLB pages.
Is this the plan?
I may have not followed closely on the latest development of Matthew's
idea. The thing is if the design requires ptes being installed / removed
at the same time for the whole folio, then it may not work directly for HGM
if HGM wants to support at least postcopy, iiuc, because if we install the
whole folio ptes at the same time it seems to beat the whole purpose of
having HGM..
The patch (especially patch 1) looks good. So it's a pure question just to
make sure we're on the same page. IIUC your other mapcount proposal may
work, but it still needs to be able to take care of ptes in less-than-folio
sizes whatever it'll look like at last.
A trivial comment on patch 2 since we're at it: does "a future plan on some
arch to support 512GB huge page" justify itself? It would be better
justified, IMHO, when that support is added (and decided to use HGM)?
What I feel like is missing (rather than patch 2 itself) is some guard to
make sure thp mapcountings will not be abused with new hugetlb sizes
coming.
How about another BUG_ON() squashed into patch 1 (probably somewhere in
page_add_file|anon_rmap()) to make sure folio_size() is always smaller than
COMPOUND_MAPPED / 2)?
--
Peter Xu
On Wed, Mar 8, 2023 at 2:10 PM Peter Xu <[email protected]> wrote:
>
> On Mon, Mar 06, 2023 at 11:00:02PM +0000, James Houghton wrote:
> > HugeTLB pages may soon support being mapped with PTEs. To allow for this
> > case, merge HugeTLB's mapcount scheme with THP's.
> >
> > The first patch of this series comes from the HugeTLB high-granularity
> > mapping series[1], though with some updates, as the original version
> > was buggy[2] and incomplete.
> >
> > I am sending this change as part of this smaller series in hopes that it
> > can be more thoroughly scrutinized.
> >
> > I haven't run any THP performance tests with this series applied.
> > HugeTLB pages don't currently support being mapped with
> > `compound=false`, but this mapcount scheme will make collapsing
> > compound=false mappings in HugeTLB pages quite slow. This can be
> > optimized with future patches (likely by taking advantage of HugeTLB's
> > alignment guarantees).
> >
> > Matthew Wilcox is working on a mapcounting scheme[3] that will avoid
> > the use of each subpage's mapcount. If this series is applied, Matthew's
> > new scheme will automatically apply to HugeTLB pages.
>
> Is this the plan?
>
> I may have not followed closely on the latest development of Matthew's
> idea. The thing is if the design requires ptes being installed / removed
> at the same time for the whole folio, then it may not work directly for HGM
> if HGM wants to support at least postcopy, iiuc, because if we install the
> whole folio ptes at the same time it seems to beat the whole purpose of
> having HGM..
My understanding is that it doesn't *require* all the PTEs in a folio
to be mapped at the same time. I don't see how it possibly could,
given that UFFDIO_CONTINUE exists (which can already create PTE-mapped
THPs today). It would be faster to populate all the PTEs at the same
time (you would only need to traverse the page table once for the
entire group to see if you should be incrementing mapcount).
Though, with respect to unmapping, if PTEs aren't all unmapped at the
same time, then you could end up with a case where mapcount is still
incremented but nothing is really mapped. I'm not really sure what
should be done there, but this problem applies to PTE-mapped THPs the
same way that it applies to HGMed HugeTLB pages.
> The patch (especially patch 1) looks good. So it's a pure question just to
> make sure we're on the same page. IIUC your other mapcount proposal may
> work, but it still needs to be able to take care of ptes in less-than-folio
> sizes whatever it'll look like at last.
By my "other mapcount proposal", I assume you mean the "using the
PAGE_SPECIAL bit to track if mapcount has been incremented or not". It
really only serves as an optimization for Matthew's scheme (see below
[2] for some more thoughts), and it doesn't have to only apply to
HugeTLB.
I originally thought[1] that Matthew's scheme would be really painful
for postcopy for HGM without this optimization, but it's actually not
so bad. Let's assume the worst case, that we're UFFDIO_CONTINUEing
from the end to the beginning, like in [1]:
First CONTINUE: pvmw finds an empty PUD, so quickly returns false.
Second CONTINUE: pvmw finds 511 empty PMDs, then finds 511 empty PTEs,
then finds a present PTE (from the first CONTINUE).
Third CONTINUE: pvmw finds 511 empty PMDs, then finds 510 empty PTEs.
...
514th CONTINUE: pvmw finds 510 empty PMDs, then finds 511 empty PTEs.
So it'll be slow, but it won't have to check 262k empty PTEs per
CONTINUE (though you could make this possible with MADV_DONTNEED).
Even with an HGM implementation that only allows PTE-mapping of
HugeTLB pages, it should still behave just like this, too.
> A trivial comment on patch 2 since we're at it: does "a future plan on some
> arch to support 512GB huge page" justify itself? It would be better
> justified, IMHO, when that support is added (and decided to use HGM)?
That's fine with me. I'm happy to drop that patch.
> What I feel like is missing (rather than patch 2 itself) is some guard to
> make sure thp mapcountings will not be abused with new hugetlb sizes
> coming.
>
> How about another BUG_ON() squashed into patch 1 (probably somewhere in
> page_add_file|anon_rmap()) to make sure folio_size() is always smaller than
> COMPOUND_MAPPED / 2)?
Sure, I can add that.
Thanks, Peter!
- James
[1]: https://lore.kernel.org/linux-mm/CADrL8HUrEgt+1qAtEsOHuQeA+WWnggGfLj8_nqHF0k-pqPi52w@mail.gmail.com/
[2]: Some details on what the optimization might look like:
So an excerpt of Matthew's scheme would look something like this:
/* if we're mapping < folio_nr_pages(folio) worth of PTEs. */
if (!folio_has_ptes(folio, vma))
atomic_inc(folio->_mapcount);
where folio_has_ptes() is defined like:
if (!page_vma_mapped_walk(...))
return false;
page_vma_mapped_walk_done(...);
return true;
You might be able to optimize folio_has_ptes() with a block like this
at the beginning:
if (folio_is_naturally_aligned(folio, vma)) {
/* optimization for naturally-aligned folios. */
if (folio_test_hugetlb(folio)) {
/* check hstate-level PTE, and do a similar check as below. */
}
/* for naturally-aligned THPs: */
pmdp = mm_find_pmd(...); /* or just pass it in. */
pmd = READ_ONCE(*pmdp);
BUG_ON(!pmd_present(pmd) || pmd_leaf(pmd));
if (pmd_special(pmd))
return true;
/* we already hold the PTL for the PTE. */
ptl = pmd_lock(mm, pmdp);
/* test and set pmd_special */
pmd_unlock(ptl)
return if_we_set_pmd_special;
}
(pmd_special() doesn't currently exist.) If HugeTLB walking code can
be merged with generic mm, then HugeTLB wouldn't have a special case
at all here.
On Thu, Mar 09, 2023 at 10:05:12AM -0800, James Houghton wrote:
> On Wed, Mar 8, 2023 at 2:10 PM Peter Xu <[email protected]> wrote:
> >
> > On Mon, Mar 06, 2023 at 11:00:02PM +0000, James Houghton wrote:
> > > HugeTLB pages may soon support being mapped with PTEs. To allow for this
> > > case, merge HugeTLB's mapcount scheme with THP's.
> > >
> > > The first patch of this series comes from the HugeTLB high-granularity
> > > mapping series[1], though with some updates, as the original version
> > > was buggy[2] and incomplete.
> > >
> > > I am sending this change as part of this smaller series in hopes that it
> > > can be more thoroughly scrutinized.
> > >
> > > I haven't run any THP performance tests with this series applied.
> > > HugeTLB pages don't currently support being mapped with
> > > `compound=false`, but this mapcount scheme will make collapsing
> > > compound=false mappings in HugeTLB pages quite slow. This can be
> > > optimized with future patches (likely by taking advantage of HugeTLB's
> > > alignment guarantees).
> > >
> > > Matthew Wilcox is working on a mapcounting scheme[3] that will avoid
> > > the use of each subpage's mapcount. If this series is applied, Matthew's
> > > new scheme will automatically apply to HugeTLB pages.
> >
> > Is this the plan?
> >
> > I may have not followed closely on the latest development of Matthew's
> > idea. The thing is if the design requires ptes being installed / removed
> > at the same time for the whole folio, then it may not work directly for HGM
> > if HGM wants to support at least postcopy, iiuc, because if we install the
> > whole folio ptes at the same time it seems to beat the whole purpose of
> > having HGM..
>
> My understanding is that it doesn't *require* all the PTEs in a folio
> to be mapped at the same time. I don't see how it possibly could,
> given that UFFDIO_CONTINUE exists (which can already create PTE-mapped
> THPs today). It would be faster to populate all the PTEs at the same
> time (you would only need to traverse the page table once for the
> entire group to see if you should be incrementing mapcount).
>
> Though, with respect to unmapping, if PTEs aren't all unmapped at the
> same time, then you could end up with a case where mapcount is still
> incremented but nothing is really mapped. I'm not really sure what
> should be done there, but this problem applies to PTE-mapped THPs the
> same way that it applies to HGMed HugeTLB pages.
>
> > The patch (especially patch 1) looks good. So it's a pure question just to
> > make sure we're on the same page. IIUC your other mapcount proposal may
> > work, but it still needs to be able to take care of ptes in less-than-folio
> > sizes whatever it'll look like at last.
>
> By my "other mapcount proposal", I assume you mean the "using the
> PAGE_SPECIAL bit to track if mapcount has been incremented or not". It
> really only serves as an optimization for Matthew's scheme (see below
> [2] for some more thoughts), and it doesn't have to only apply to
> HugeTLB.
>
> I originally thought[1] that Matthew's scheme would be really painful
> for postcopy for HGM without this optimization, but it's actually not
> so bad. Let's assume the worst case, that we're UFFDIO_CONTINUEing
> from the end to the beginning, like in [1]:
>
> First CONTINUE: pvmw finds an empty PUD, so quickly returns false.
> Second CONTINUE: pvmw finds 511 empty PMDs, then finds 511 empty PTEs,
> then finds a present PTE (from the first CONTINUE).
> Third CONTINUE: pvmw finds 511 empty PMDs, then finds 510 empty PTEs.
> ...
> 514th CONTINUE: pvmw finds 510 empty PMDs, then finds 511 empty PTEs.
>
> So it'll be slow, but it won't have to check 262k empty PTEs per
> CONTINUE (though you could make this possible with MADV_DONTNEED).
> Even with an HGM implementation that only allows PTE-mapping of
> HugeTLB pages, it should still behave just like this, too.
>
> > A trivial comment on patch 2 since we're at it: does "a future plan on some
> > arch to support 512GB huge page" justify itself? It would be better
> > justified, IMHO, when that support is added (and decided to use HGM)?
>
> That's fine with me. I'm happy to drop that patch.
>
> > What I feel like is missing (rather than patch 2 itself) is some guard to
> > make sure thp mapcountings will not be abused with new hugetlb sizes
> > coming.
> >
> > How about another BUG_ON() squashed into patch 1 (probably somewhere in
> > page_add_file|anon_rmap()) to make sure folio_size() is always smaller than
> > COMPOUND_MAPPED / 2)?
>
> Sure, I can add that.
>
> Thanks, Peter!
>
> - James
>
> [1]: https://lore.kernel.org/linux-mm/CADrL8HUrEgt+1qAtEsOHuQeA+WWnggGfLj8_nqHF0k-pqPi52w@mail.gmail.com/
>
> [2]: Some details on what the optimization might look like:
>
> So an excerpt of Matthew's scheme would look something like this:
>
> /* if we're mapping < folio_nr_pages(folio) worth of PTEs. */
> if (!folio_has_ptes(folio, vma))
> atomic_inc(folio->_mapcount);
>
> where folio_has_ptes() is defined like:
>
> if (!page_vma_mapped_walk(...))
> return false;
> page_vma_mapped_walk_done(...);
> return true;
>
> You might be able to optimize folio_has_ptes() with a block like this
> at the beginning:
>
> if (folio_is_naturally_aligned(folio, vma)) {
> /* optimization for naturally-aligned folios. */
> if (folio_test_hugetlb(folio)) {
> /* check hstate-level PTE, and do a similar check as below. */
> }
> /* for naturally-aligned THPs: */
> pmdp = mm_find_pmd(...); /* or just pass it in. */
> pmd = READ_ONCE(*pmdp);
> BUG_ON(!pmd_present(pmd) || pmd_leaf(pmd));
> if (pmd_special(pmd))
> return true;
> /* we already hold the PTL for the PTE. */
> ptl = pmd_lock(mm, pmdp);
> /* test and set pmd_special */
> pmd_unlock(ptl)
> return if_we_set_pmd_special;
> }
>
> (pmd_special() doesn't currently exist.) If HugeTLB walking code can
> be merged with generic mm, then HugeTLB wouldn't have a special case
> at all here.
I see what you mean now, thanks. That looks fine. I just suspect the
pte_special trick will still be needed if this will start to apply to HGM,
as it seems to not suite perfectly with a large folio size, still. The
MADV_DONTNEED worst case of having it loop over ~folio_size() times of none
pte is still possible.
--
Peter Xu
On Wed, Mar 8, 2023 at 1:56 PM Peter Xu <[email protected]> wrote:
>
> On Tue, Mar 07, 2023 at 04:36:51PM -0800, James Houghton wrote:
> > > > if (likely(!compound)) {
> > > > + if (unlikely(folio_test_hugetlb(folio)))
> > > > + VM_BUG_ON_PAGE(HPageVmemmapOptimized(&folio->page),
> > > > + page);
>
> How about moving folio_test_hugetlb() into the BUG_ON()?
>
> VM_BUG_ON_PAGE(folio_test_hugetlb(folio) &&
> HPageVmemmapOptimized(&folio->page),
> page);
>
> Note that BUG_ON() already contains an "unlikely".
Ok I can do that. It's a little cleaner.
> > > > first = atomic_inc_and_test(&page->_mapcount);
> > > > nr = first;
> > > > if (first && folio_test_large(folio)) {
> > > > nr = atomic_inc_return_relaxed(mapped);
> > > > nr = (nr < COMPOUND_MAPPED);
> > > > }
> > > > - } else if (folio_test_pmd_mappable(folio)) {
> > > > - /* That test is redundant: it's for safety or to optimize out */
> > >
> > > I 'think' removing this check is OK. It would seem that the caller
> > > knows if the folio is mappable. If we want a similar test, we might be
> > > able to use something like:
> > >
> > > arch_hugetlb_valid_size(folio_size(folio))
> > >
> >
> > Ack. I think leaving the check(s) removed is fine.
>
> Would it still be good to keep that as another BUG_ON()?
Sure, that sounds reasonable to me. I'll add it unless someone disagrees.
As you suggested in your other email, I'll also add a BUG_ON() if we
attempt to do a non-compound mapping of a folio that is larger than
COMPOUND_MAPPED / 2. (Maybe a BUG_ON() in alloc_hugetlb_folio() to
check that the size of the folio we're allocating is less than
COMPOUND_MAPPED / 2 makes sense instead. Just an idea.)
- James