This RFC patch series attempts to simplify the page cache code by removing
special casing code for hugetlb pages. Normal pages in the page cache are
indexed by PAGE_SIZE while hugetlb pages are indexed by their huge page
size. This was previously tried but the xarray was not performant enough
for the changes.
This series fails many of the hugetlb LTP test cases due to bugs in
accounting and I was hoping to get help/suggestions about why the page
accounting breaks from my changes. The basic mmap tests pass but the
advanced ones which involve overcommiting pages fail.
rebased on mm-unstable 4/13/2023
Sidhartha Kumar (4):
mm/filemap: remove hugetlb special casing in filemap.c
mm/hugetlb: remove hugetlb_basepage_index()
mm/hugetlbfs: remove huge_page_shift in hugetlbfs_file_mmap
mm/hugetlb: add hpage_shift to alloc_hugetlb_folio
fs/hugetlbfs/inode.c | 4 +--
include/linux/pagemap.h | 13 --------
mm/filemap.c | 36 +++++++---------------
mm/hugetlb.c | 68 ++++++++++++-----------------------------
4 files changed, 33 insertions(+), 88 deletions(-)
--
2.39.2
Remove shifting the vma->vm_pgoff and len arguments in
the call to hugetlb_reserve_pages() within hugetlbfs_file_mmap().
Adjust the chg variable within hugetlb_reserve_pages() to match previous
values which are expected by cgroup accounting code.
Signed-off-by: Sidhartha Kumar <[email protected]>
---
fs/hugetlbfs/inode.c | 4 ++--
mm/hugetlb.c | 7 ++++---
2 files changed, 6 insertions(+), 5 deletions(-)
diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c
index 702d79639c0df..9f2e71f2e9f52 100644
--- a/fs/hugetlbfs/inode.c
+++ b/fs/hugetlbfs/inode.c
@@ -165,8 +165,8 @@ static int hugetlbfs_file_mmap(struct file *file, struct vm_area_struct *vma)
ret = -ENOMEM;
if (!hugetlb_reserve_pages(inode,
- vma->vm_pgoff >> huge_page_order(h),
- len >> huge_page_shift(h), vma,
+ vma->vm_pgoff,
+ len, vma,
vma->vm_flags))
goto out;
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 011020a30f4ac..a28fbdff886ff 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -6752,6 +6752,7 @@ bool hugetlb_reserve_pages(struct inode *inode,
{
long chg = -1, add = -1;
struct hstate *h = hstate_inode(inode);
+ unsigned long hpage_shift = huge_page_shift(h);
struct hugepage_subpool *spool = subpool_inode(inode);
struct resv_map *resv_map;
struct hugetlb_cgroup *h_cg = NULL;
@@ -6791,14 +6792,14 @@ bool hugetlb_reserve_pages(struct inode *inode,
*/
resv_map = inode_resv_map(inode);
- chg = region_chg(resv_map, from, to, ®ions_needed);
+ chg = region_chg(resv_map, from, to, ®ions_needed) >> hpage_shift;
} else {
/* Private mapping. */
resv_map = resv_map_alloc();
if (!resv_map)
goto out_err;
- chg = to - from;
+ chg = (to - from) >> hpage_shift;
set_vma_resv_map(vma, resv_map);
set_vma_resv_flags(vma, HPAGE_RESV_OWNER);
@@ -6823,7 +6824,7 @@ bool hugetlb_reserve_pages(struct inode *inode,
* the subpool has a minimum size, there may be some global
* reservations already in place (gbl_reserve).
*/
- gbl_reserve = hugepage_subpool_get_pages(spool, chg);
+ gbl_reserve = hugepage_subpool_get_pages(spool, chg) >> hpage_shift;
if (gbl_reserve < 0)
goto out_uncharge_cgroup;
--
2.39.2
This patch aims to remove special cased hugetlb handling code within the
page cache by changing the granularity of each index to the base page size
rather than the huge page size.
Signed-off-by: Sidhartha Kumar <[email protected]>
---
include/linux/pagemap.h | 6 ------
mm/filemap.c | 36 +++++++++++-------------------------
2 files changed, 11 insertions(+), 31 deletions(-)
diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
index fdcd595d22944..330b1db913f5a 100644
--- a/include/linux/pagemap.h
+++ b/include/linux/pagemap.h
@@ -717,9 +717,6 @@ static inline struct page *folio_file_page(struct folio *folio, pgoff_t index)
*/
static inline bool folio_contains(struct folio *folio, pgoff_t index)
{
- /* HugeTLBfs indexes the page cache in units of hpage_size */
- if (folio_test_hugetlb(folio))
- return folio->index == index;
return index - folio_index(folio) < folio_nr_pages(folio);
}
@@ -844,12 +841,9 @@ static inline loff_t folio_file_pos(struct folio *folio)
/*
* Get the offset in PAGE_SIZE (even for hugetlb folios).
- * (TODO: hugetlb folios should have ->index in PAGE_SIZE)
*/
static inline pgoff_t folio_pgoff(struct folio *folio)
{
- if (unlikely(folio_test_hugetlb(folio)))
- return hugetlb_basepage_index(&folio->page);
return folio->index;
}
diff --git a/mm/filemap.c b/mm/filemap.c
index a34abfe8c6543..fadc8ca9b9695 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -131,11 +131,8 @@ static void page_cache_delete(struct address_space *mapping,
mapping_set_update(&xas, mapping);
- /* hugetlb pages are represented by a single entry in the xarray */
- if (!folio_test_hugetlb(folio)) {
- xas_set_order(&xas, folio->index, folio_order(folio));
- nr = folio_nr_pages(folio);
- }
+ xas_set_order(&xas, folio->index, folio_order(folio));
+ nr = folio_nr_pages(folio);
VM_BUG_ON_FOLIO(!folio_test_locked(folio), folio);
@@ -234,7 +231,7 @@ void filemap_free_folio(struct address_space *mapping, struct folio *folio)
if (free_folio)
free_folio(folio);
- if (folio_test_large(folio) && !folio_test_hugetlb(folio))
+ if (folio_test_large(folio))
refs = folio_nr_pages(folio);
folio_put_refs(folio, refs);
}
@@ -855,14 +852,15 @@ noinline int __filemap_add_folio(struct address_space *mapping,
if (!huge) {
int error = mem_cgroup_charge(folio, NULL, gfp);
- VM_BUG_ON_FOLIO(index & (folio_nr_pages(folio) - 1), folio);
if (error)
return error;
charged = true;
- xas_set_order(&xas, index, folio_order(folio));
- nr = folio_nr_pages(folio);
}
+ VM_BUG_ON_FOLIO(index & (folio_nr_pages(folio) - 1), folio);
+ xas_set_order(&xas, index, folio_order(folio));
+ nr = folio_nr_pages(folio);
+
gfp &= GFP_RECLAIM_MASK;
folio_ref_add(folio, nr);
folio->mapping = mapping;
@@ -2069,7 +2067,7 @@ unsigned find_get_entries(struct address_space *mapping, pgoff_t *start,
int idx = folio_batch_count(fbatch) - 1;
folio = fbatch->folios[idx];
- if (!xa_is_value(folio) && !folio_test_hugetlb(folio))
+ if (!xa_is_value(folio))
nr = folio_nr_pages(folio);
*start = indices[idx] + nr;
}
@@ -2133,7 +2131,7 @@ unsigned find_lock_entries(struct address_space *mapping, pgoff_t *start,
int idx = folio_batch_count(fbatch) - 1;
folio = fbatch->folios[idx];
- if (!xa_is_value(folio) && !folio_test_hugetlb(folio))
+ if (!xa_is_value(folio))
nr = folio_nr_pages(folio);
*start = indices[idx] + nr;
}
@@ -2174,9 +2172,6 @@ unsigned filemap_get_folios(struct address_space *mapping, pgoff_t *start,
continue;
if (!folio_batch_add(fbatch, folio)) {
unsigned long nr = folio_nr_pages(folio);
-
- if (folio_test_hugetlb(folio))
- nr = 1;
*start = folio->index + nr;
goto out;
}
@@ -2202,7 +2197,7 @@ EXPORT_SYMBOL(filemap_get_folios);
static inline
bool folio_more_pages(struct folio *folio, pgoff_t index, pgoff_t max)
{
- if (!folio_test_large(folio) || folio_test_hugetlb(folio))
+ if (!folio_test_large(folio))
return false;
if (index >= max)
return false;
@@ -2252,9 +2247,6 @@ unsigned filemap_get_folios_contig(struct address_space *mapping,
if (!folio_batch_add(fbatch, folio)) {
nr = folio_nr_pages(folio);
-
- if (folio_test_hugetlb(folio))
- nr = 1;
*start = folio->index + nr;
goto out;
}
@@ -2271,10 +2263,7 @@ unsigned filemap_get_folios_contig(struct address_space *mapping,
if (nr) {
folio = fbatch->folios[nr - 1];
- if (folio_test_hugetlb(folio))
- *start = folio->index + 1;
- else
- *start = folio->index + folio_nr_pages(folio);
+ *start = folio->index + folio_nr_pages(folio);
}
out:
rcu_read_unlock();
@@ -2312,9 +2301,6 @@ unsigned filemap_get_folios_tag(struct address_space *mapping, pgoff_t *start,
continue;
if (!folio_batch_add(fbatch, folio)) {
unsigned long nr = folio_nr_pages(folio);
-
- if (folio_test_hugetlb(folio))
- nr = 1;
*start = folio->index + nr;
goto out;
}
--
2.39.2
adjust the return value of hugepage_subpool_{put,get}_pages() by
hpage_shift to be consistent with previous values.
Signed-off-by: Sidhartha Kumar <[email protected]>
---
mm/hugetlb.c | 11 +++++++----
1 file changed, 7 insertions(+), 4 deletions(-)
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index a28fbdff886ff..258c211020c3a 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -2992,6 +2992,7 @@ struct folio *alloc_hugetlb_folio(struct vm_area_struct *vma,
{
struct hugepage_subpool *spool = subpool_vma(vma);
struct hstate *h = hstate_vma(vma);
+ unsigned long hpage_shift = huge_page_shift(h);
struct folio *folio;
long map_chg, map_commit;
long gbl_chg;
@@ -3017,7 +3018,7 @@ struct folio *alloc_hugetlb_folio(struct vm_area_struct *vma,
* checked against any subpool limit.
*/
if (map_chg || avoid_reserve) {
- gbl_chg = hugepage_subpool_get_pages(spool, 1);
+ gbl_chg = hugepage_subpool_get_pages(spool, 1) >> hpage_shift;
if (gbl_chg < 0) {
vma_end_reservation(h, vma, addr);
return ERR_PTR(-ENOSPC);
@@ -4778,6 +4779,7 @@ static void hugetlb_vm_op_close(struct vm_area_struct *vma)
{
struct hstate *h = hstate_vma(vma);
unsigned int order = huge_page_order(h);
+ unsigned long hpage_shift = huge_page_shift(h);
struct resv_map *resv;
struct hugepage_subpool *spool = subpool_vma(vma);
unsigned long reserve, start, end;
@@ -4799,7 +4801,7 @@ static void hugetlb_vm_op_close(struct vm_area_struct *vma)
* Decrement reserve counts. The global reserve count may be
* adjusted if the subpool has a minimum size.
*/
- gbl_reserve = hugepage_subpool_put_pages(spool, reserve);
+ gbl_reserve = hugepage_subpool_put_pages(spool, reserve) >> hpage_shift;
hugetlb_acct_memory(h, -gbl_reserve);
}
@@ -6871,7 +6873,7 @@ bool hugetlb_reserve_pages(struct inode *inode,
(chg - add) * pages_per_huge_page(h), h_cg);
rsv_adjust = hugepage_subpool_put_pages(spool,
- chg - add);
+ chg - add) >> hpage_shift;
hugetlb_acct_memory(h, -rsv_adjust);
} else if (h_cg) {
/*
@@ -6908,6 +6910,7 @@ long hugetlb_unreserve_pages(struct inode *inode, long start, long end,
long freed)
{
struct hstate *h = hstate_inode(inode);
+ unsigned long hpage_shift = huge_page_shift(h);
struct resv_map *resv_map = inode_resv_map(inode);
long chg = 0;
struct hugepage_subpool *spool = subpool_inode(inode);
@@ -6939,7 +6942,7 @@ long hugetlb_unreserve_pages(struct inode *inode, long start, long end,
* Note that !resv_map implies freed == 0. So (chg - freed)
* won't go negative.
*/
- gbl_reserve = hugepage_subpool_put_pages(spool, (chg - freed));
+ gbl_reserve = hugepage_subpool_put_pages(spool, (chg - freed)) >> hpage_shift;
hugetlb_acct_memory(h, -gbl_reserve);
return 0;
--
2.39.2
hugetlb_basepage_index() can now be removed as hugetlb pages have ->index
in PAGE_SIZE. This also allows removals of vma_hugecache_offset() and
linear_hugepage_index() which are replaced by calls to
linear_page_index().
Signed-off-by: Sidhartha Kumar <[email protected]>
---
include/linux/pagemap.h | 7 ------
mm/hugetlb.c | 50 ++++++++---------------------------------
2 files changed, 9 insertions(+), 48 deletions(-)
diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
index 330b1db913f5a..bb60282317875 100644
--- a/include/linux/pagemap.h
+++ b/include/linux/pagemap.h
@@ -792,16 +792,11 @@ static inline pgoff_t page_to_index(struct page *page)
return head->index + page - head;
}
-extern pgoff_t hugetlb_basepage_index(struct page *page);
-
/*
* Get the offset in PAGE_SIZE (even for hugetlb pages).
- * (TODO: hugetlb pages should have ->index in PAGE_SIZE)
*/
static inline pgoff_t page_to_pgoff(struct page *page)
{
- if (unlikely(PageHuge(page)))
- return hugetlb_basepage_index(page);
return page_to_index(page);
}
@@ -854,8 +849,6 @@ static inline pgoff_t linear_page_index(struct vm_area_struct *vma,
unsigned long address)
{
pgoff_t pgoff;
- if (unlikely(is_vm_hugetlb_page(vma)))
- return linear_hugepage_index(vma, address);
pgoff = (address - vma->vm_start) >> PAGE_SHIFT;
pgoff += vma->vm_pgoff;
return pgoff;
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index f16b25b1a6b93..011020a30f4ac 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -949,24 +949,6 @@ static long region_count(struct resv_map *resv, long f, long t)
return chg;
}
-/*
- * Convert the address within this vma to the page offset within
- * the mapping, in pagecache page units; huge pages here.
- */
-static pgoff_t vma_hugecache_offset(struct hstate *h,
- struct vm_area_struct *vma, unsigned long address)
-{
- return ((address - vma->vm_start) >> huge_page_shift(h)) +
- (vma->vm_pgoff >> huge_page_order(h));
-}
-
-pgoff_t linear_hugepage_index(struct vm_area_struct *vma,
- unsigned long address)
-{
- return vma_hugecache_offset(hstate_vma(vma), vma, address);
-}
-EXPORT_SYMBOL_GPL(linear_hugepage_index);
-
/*
* Return the size of the pages allocated when backing a VMA. In the majority
* cases this will be same size as used by the page table entries.
@@ -2087,21 +2069,6 @@ struct address_space *hugetlb_page_mapping_lock_write(struct page *hpage)
return NULL;
}
-
-pgoff_t hugetlb_basepage_index(struct page *page)
-{
- struct page *page_head = compound_head(page);
- pgoff_t index = page_index(page_head);
- unsigned long compound_idx;
-
- if (compound_order(page_head) > MAX_ORDER)
- compound_idx = page_to_pfn(page) - page_to_pfn(page_head);
- else
- compound_idx = page - page_head;
-
- return (index << compound_order(page_head)) + compound_idx;
-}
-
static struct folio *alloc_buddy_hugetlb_folio(struct hstate *h,
gfp_t gfp_mask, int nid, nodemask_t *nmask,
nodemask_t *node_alloc_noretry)
@@ -2703,7 +2670,7 @@ static long __vma_reservation_common(struct hstate *h,
if (!resv)
return 1;
- idx = vma_hugecache_offset(h, vma, addr);
+ idx = linear_page_index(vma, addr);
switch (mode) {
case VMA_NEEDS_RESV:
ret = region_chg(resv, idx, idx + 1, &dummy_out_regions_needed);
@@ -4810,6 +4777,7 @@ static void hugetlb_vm_op_open(struct vm_area_struct *vma)
static void hugetlb_vm_op_close(struct vm_area_struct *vma)
{
struct hstate *h = hstate_vma(vma);
+ unsigned int order = huge_page_order(h);
struct resv_map *resv;
struct hugepage_subpool *spool = subpool_vma(vma);
unsigned long reserve, start, end;
@@ -4821,11 +4789,11 @@ static void hugetlb_vm_op_close(struct vm_area_struct *vma)
if (!resv || !is_vma_resv_set(vma, HPAGE_RESV_OWNER))
return;
- start = vma_hugecache_offset(h, vma, vma->vm_start);
- end = vma_hugecache_offset(h, vma, vma->vm_end);
+ start = linear_page_index(vma, vma->vm_start);
+ end = linear_page_index(vma, vma->vm_end);
reserve = (end - start) - region_count(resv, start, end);
- hugetlb_cgroup_uncharge_counter(resv, start, end);
+ hugetlb_cgroup_uncharge_counter(resv, start >> order, end >> order);
if (reserve) {
/*
* Decrement reserve counts. The global reserve count may be
@@ -5582,7 +5550,7 @@ static vm_fault_t hugetlb_wp(struct mm_struct *mm, struct vm_area_struct *vma,
*
* Reacquire both after unmap operation.
*/
- idx = vma_hugecache_offset(h, vma, haddr);
+ idx = linear_page_index(vma, address);
hash = hugetlb_fault_mutex_hash(mapping, idx);
hugetlb_vma_unlock_read(vma);
mutex_unlock(&hugetlb_fault_mutex_table[hash]);
@@ -5669,7 +5637,7 @@ static bool hugetlbfs_pagecache_present(struct hstate *h,
struct vm_area_struct *vma, unsigned long address)
{
struct address_space *mapping = vma->vm_file->f_mapping;
- pgoff_t idx = vma_hugecache_offset(h, vma, address);
+ pgoff_t idx = linear_page_index(vma, address);
bool present;
rcu_read_lock();
@@ -6014,7 +5982,7 @@ vm_fault_t hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma,
* the same page in the page cache.
*/
mapping = vma->vm_file->f_mapping;
- idx = vma_hugecache_offset(h, vma, haddr);
+ idx = linear_page_index(vma, address);
hash = hugetlb_fault_mutex_hash(mapping, idx);
mutex_lock(&hugetlb_fault_mutex_table[hash]);
@@ -6185,7 +6153,7 @@ int hugetlb_mfill_atomic_pte(pte_t *dst_pte,
bool wp_enabled = (flags & MFILL_ATOMIC_WP);
struct hstate *h = hstate_vma(dst_vma);
struct address_space *mapping = dst_vma->vm_file->f_mapping;
- pgoff_t idx = vma_hugecache_offset(h, dst_vma, dst_addr);
+ pgoff_t idx = linear_page_index(dst_vma, dst_addr);
unsigned long size;
int vm_shared = dst_vma->vm_flags & VM_SHARED;
pte_t _dst_pte;
--
2.39.2
On 04/13/23 16:14, Sidhartha Kumar wrote:
> This RFC patch series attempts to simplify the page cache code by removing
> special casing code for hugetlb pages. Normal pages in the page cache are
> indexed by PAGE_SIZE while hugetlb pages are indexed by their huge page
> size. This was previously tried but the xarray was not performant enough
> for the changes.
>
> This series fails many of the hugetlb LTP test cases due to bugs in
> accounting and I was hoping to get help/suggestions about why the page
> accounting breaks from my changes. The basic mmap tests pass but the
> advanced ones which involve overcommiting pages fail.
Sorry for the late reply.
I can appreciate the desire for removing hugetlb special cases from page
cache code. As you note above, hugetlb tracks page indicies based on
huge pages size. Page cache page indicies are based on base page size.
Within the hugetlb code, the huge page size indicies are used in at
least two places:
- huge page reservations. There is a rather ugly set of code managing
hugetlb mapping reservation maps. Since a reservation is for a single
huge page, it makes sense to use huge page sized indicies in this code.
- hugetlb mutex table. The table is hashed by the values 'mapping' and
index. This guarantees that all code performing operations on a huge
page will use the same mutex. So, using huge page index is a must.
I think this means there is a need to maintain/use both huge page and
base page indicies. huge page indicies within hugetlb code and base
page indicies within the page cache.
One approach might be to add the conversion from huge page inded to base
page index for all calls into the page cache. This could be done with
hugetlb specific wrappers. There already are hugetlb_add_to_page_cache,
hugetlbfs_pagecache_present, and hugetlb_delete_from_page_cache.
New wrappers would be needed for at least filemap_get_folios and
filemap_lock_folio.
--
Mike Kravetz