Anonymous pages have already been supported for multi-size (mTHP) allocation
through commit 19eaf44954df, that can allow THP to be configured through the
sysfs interface located at '/sys/kernel/mm/transparent_hugepage/hugepage-XXkb/enabled'.
However, the anonymous shmem will ignore the anonymous mTHP rule configured
through the sysfs interface, and can only use the PMD-mapped THP, that is not
reasonable. Many implement anonymous page sharing through mmap(MAP_SHARED |
MAP_ANONYMOUS), especially in database usage scenarios, therefore, users expect
to apply an unified mTHP strategy for anonymous pages, also including the
anonymous shared pages, in order to enjoy the benefits of mTHP. For example,
lower latency than PMD-mapped THP, smaller memory bloat than PMD-mapped THP,
contiguous PTEs on ARM architecture to reduce TLB miss etc.
As discussed in the bi-weekly MM meeting[1], the mTHP controls should control
all of shmem, not only anonymous shmem, but support will be added iteratively.
Therefore, this patch set starts with support for anonymous shmem.
The primary strategy is similar to supporting anonymous mTHP. Introduce
a new interface '/mm/transparent_hugepage/hugepage-XXkb/shmem_enabled',
which can have almost the same values as the top-level
'/sys/kernel/mm/transparent_hugepage/shmem_enabled', with adding a new
additional "inherit" option and dropping the testing options 'force' and
'deny'. By default all sizes will be set to "never" except PMD size, which
is set to "inherit". This ensures backward compatibility with the anonymous
shmem enabled of the top level, meanwhile also allows independent control of
anonymous shmem enabled for each mTHP.
Use the page fault latency tool to measure the performance of 1G anonymous shmem
with 32 threads on my machine environment with: ARM64 Architecture, 32 cores,
125G memory:
base: mm-unstable
user-time sys_time faults_per_sec_per_cpu faults_per_sec
0.04s 3.10s 83516.416 2669684.890
mm-unstable + patchset, anon shmem mTHP disabled
user-time sys_time faults_per_sec_per_cpu faults_per_sec
0.02s 3.14s 82936.359 2630746.027
mm-unstable + patchset, anon shmem 64K mTHP enabled
user-time sys_time faults_per_sec_per_cpu faults_per_sec
0.08s 0.31s 678630.231 17082522.495
From the data above, it is observed that the patchset has a minimal impact when
mTHP is not enabled (some fluctuations observed during testing). When enabling 64K
mTHP, there is a significant improvement of the page fault latency.
[1] https://lore.kernel.org/all/[email protected]/
Changes from v4:
- Fix the unused variable warning reported by kernel test robot.
- Drop the 'anon' prefix for variables and functions, per Daniel.
Changes from v3:
- Drop 'force' and 'deny' testing options for each mTHP.
- Use new helper update_mmu_tlb_range(), per Lance.
- Update documentation to drop "anonymous thp" terminology, per David.
- Initialize the 'suitable_orders' in shmem_alloc_and_add_folio(),
reported by kernel test robot.
- Fix the highest mTHP order in shmem_get_unmapped_area().
- Update some commit message.
Changes from v2:
- Rebased to mm/mm-unstable.
- Remove 'huge' parameter for shmem_alloc_and_add_folio(), per Lance.
Changes from v1:
- Drop the patch that re-arranges the position of highest_order() and
next_order(), per Ryan.
- Modify the finish_fault() to fix VA alignment issue, per Ryan and
David.
- Fix some building issues, reported by Lance and kernel test robot.
- Update some commit message.
Changes from RFC:
- Rebase the patch set against the new mm-unstable branch, per Lance.
- Add a new patch to export highest_order() and next_order().
- Add a new patch to align mTHP size in shmem_get_unmapped_area().
- Handle the uffd case and the VMA limits case when building mapping for
large folio in the finish_fault() function, per Ryan.
- Remove unnecessary 'order' variable in patch 3, per Kefeng.
- Keep the anon shmem counters' name consistency.
- Modify the strategy to support mTHP for anonymous shmem, discussed with
Ryan and David.
- Add reviewed tag from Barry.
- Update the commit message.
Baolin Wang (6):
mm: memory: extend finish_fault() to support large folio
mm: shmem: add THP validation for PMD-mapped THP related statistics
mm: shmem: add multi-size THP sysfs interface for anonymous shmem
mm: shmem: add mTHP support for anonymous shmem
mm: shmem: add mTHP size alignment in shmem_get_unmapped_area
mm: shmem: add mTHP counters for anonymous shmem
Documentation/admin-guide/mm/transhuge.rst | 23 ++
include/linux/huge_mm.h | 23 ++
mm/huge_memory.c | 17 +-
mm/memory.c | 57 +++-
mm/shmem.c | 344 ++++++++++++++++++---
5 files changed, 403 insertions(+), 61 deletions(-)
--
2.39.3
Commit 19eaf44954df adds multi-size THP (mTHP) for anonymous pages, that
can allow THP to be configured through the sysfs interface located at
'/sys/kernel/mm/transparent_hugepage/hugepage-XXkb/enabled'.
However, the anonymous shmem will ignore the anonymous mTHP rule
configured through the sysfs interface, and can only use the PMD-mapped
THP, that is not reasonable. Users expect to apply the mTHP rule for
all anonymous pages, including the anonymous shmem, in order to enjoy
the benefits of mTHP. For example, lower latency than PMD-mapped THP,
smaller memory bloat than PMD-mapped THP, contiguous PTEs on ARM architecture
to reduce TLB miss etc. In addition, the mTHP interfaces can be extended
to support all shmem/tmpfs scenarios in the future, especially for the
shmem mmap() case.
The primary strategy is similar to supporting anonymous mTHP. Introduce
a new interface '/mm/transparent_hugepage/hugepage-XXkb/shmem_enabled',
which can have almost the same values as the top-level
'/sys/kernel/mm/transparent_hugepage/shmem_enabled', with adding a new
additional "inherit" option and dropping the testing options 'force' and
'deny'. By default all sizes will be set to "never" except PMD size,
which is set to "inherit". This ensures backward compatibility with the
anonymous shmem enabled of the top level, meanwhile also allows independent
control of anonymous shmem enabled for each mTHP.
Signed-off-by: Baolin Wang <[email protected]>
---
include/linux/huge_mm.h | 10 +++
mm/shmem.c | 187 +++++++++++++++++++++++++++++++++-------
2 files changed, 167 insertions(+), 30 deletions(-)
diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index fac21548c5de..909cfc67521d 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -575,6 +575,16 @@ static inline bool thp_migration_supported(void)
{
return false;
}
+
+static inline int highest_order(unsigned long orders)
+{
+ return 0;
+}
+
+static inline int next_order(unsigned long *orders, int prev)
+{
+ return 0;
+}
#endif /* CONFIG_TRANSPARENT_HUGEPAGE */
static inline int split_folio_to_list_to_order(struct folio *folio,
diff --git a/mm/shmem.c b/mm/shmem.c
index bb19ea2770e3..e849c88452b2 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -1611,6 +1611,107 @@ static gfp_t limit_gfp_mask(gfp_t huge_gfp, gfp_t limit_gfp)
return result;
}
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+static unsigned long shmem_allowable_huge_orders(struct inode *inode,
+ struct vm_area_struct *vma, pgoff_t index,
+ bool global_huge)
+{
+ unsigned long mask = READ_ONCE(huge_shmem_orders_always);
+ unsigned long within_size_orders = READ_ONCE(huge_shmem_orders_within_size);
+ unsigned long vm_flags = vma->vm_flags;
+ /*
+ * Check all the (large) orders below HPAGE_PMD_ORDER + 1 that
+ * are enabled for this vma.
+ */
+ unsigned long orders = BIT(PMD_ORDER + 1) - 1;
+ loff_t i_size;
+ int order;
+
+ if ((vm_flags & VM_NOHUGEPAGE) ||
+ test_bit(MMF_DISABLE_THP, &vma->vm_mm->flags))
+ return 0;
+
+ /* If the hardware/firmware marked hugepage support disabled. */
+ if (transparent_hugepage_flags & (1 << TRANSPARENT_HUGEPAGE_UNSUPPORTED))
+ return 0;
+
+ /*
+ * Following the 'deny' semantics of the top level, force the huge
+ * option off from all mounts.
+ */
+ if (shmem_huge == SHMEM_HUGE_DENY)
+ return 0;
+
+ /*
+ * Only allow inherit orders if the top-level value is 'force', which
+ * means non-PMD sized THP can not override 'huge' mount option now.
+ */
+ if (shmem_huge == SHMEM_HUGE_FORCE)
+ return READ_ONCE(huge_shmem_orders_inherit);
+
+ /* Allow mTHP that will be fully within i_size. */
+ order = highest_order(within_size_orders);
+ while (within_size_orders) {
+ index = round_up(index + 1, order);
+ i_size = round_up(i_size_read(inode), PAGE_SIZE);
+ if (i_size >> PAGE_SHIFT >= index) {
+ mask |= within_size_orders;
+ break;
+ }
+
+ order = next_order(&within_size_orders, order);
+ }
+
+ if (vm_flags & VM_HUGEPAGE)
+ mask |= READ_ONCE(huge_shmem_orders_madvise);
+
+ if (global_huge)
+ mask |= READ_ONCE(huge_shmem_orders_inherit);
+
+ return orders & mask;
+}
+
+static unsigned long shmem_suitable_orders(struct inode *inode, struct vm_fault *vmf,
+ struct address_space *mapping, pgoff_t index,
+ unsigned long orders)
+{
+ struct vm_area_struct *vma = vmf->vma;
+ unsigned long pages;
+ int order;
+
+ orders = thp_vma_suitable_orders(vma, vmf->address, orders);
+ if (!orders)
+ return 0;
+
+ /* Find the highest order that can add into the page cache */
+ order = highest_order(orders);
+ while (orders) {
+ pages = 1UL << order;
+ index = round_down(index, pages);
+ if (!xa_find(&mapping->i_pages, &index,
+ index + pages - 1, XA_PRESENT))
+ break;
+ order = next_order(&orders, order);
+ }
+
+ return orders;
+}
+#else
+static unsigned long shmem_allowable_huge_orders(struct inode *inode,
+ struct vm_area_struct *vma, pgoff_t index,
+ bool global_huge)
+{
+ return 0;
+}
+
+static unsigned long shmem_suitable_orders(struct inode *inode, struct vm_fault *vmf,
+ struct address_space *mapping, pgoff_t index,
+ unsigned long orders)
+{
+ return 0;
+}
+#endif /* CONFIG_TRANSPARENT_HUGEPAGE */
+
static struct folio *shmem_alloc_folio(gfp_t gfp, int order,
struct shmem_inode_info *info, pgoff_t index)
{
@@ -1625,38 +1726,55 @@ static struct folio *shmem_alloc_folio(gfp_t gfp, int order,
return folio;
}
-static struct folio *shmem_alloc_and_add_folio(gfp_t gfp,
- struct inode *inode, pgoff_t index,
- struct mm_struct *fault_mm, bool huge)
+static struct folio *shmem_alloc_and_add_folio(struct vm_fault *vmf,
+ gfp_t gfp, struct inode *inode, pgoff_t index,
+ struct mm_struct *fault_mm, unsigned long orders)
{
struct address_space *mapping = inode->i_mapping;
struct shmem_inode_info *info = SHMEM_I(inode);
- struct folio *folio;
+ struct vm_area_struct *vma = vmf ? vmf->vma : NULL;
+ unsigned long suitable_orders = 0;
+ struct folio *folio = NULL;
long pages;
- int error;
+ int error, order;
if (!IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE))
- huge = false;
+ orders = 0;
- if (huge) {
- pages = HPAGE_PMD_NR;
- index = round_down(index, HPAGE_PMD_NR);
+ if (orders > 0) {
+ if (vma && vma_is_anon_shmem(vma)) {
+ suitable_orders = shmem_suitable_orders(inode, vmf,
+ mapping, index, orders);
+ } else if (orders & BIT(HPAGE_PMD_ORDER)) {
+ pages = HPAGE_PMD_NR;
+ suitable_orders = BIT(HPAGE_PMD_ORDER);
+ index = round_down(index, HPAGE_PMD_NR);
- /*
- * Check for conflict before waiting on a huge allocation.
- * Conflict might be that a huge page has just been allocated
- * and added to page cache by a racing thread, or that there
- * is already at least one small page in the huge extent.
- * Be careful to retry when appropriate, but not forever!
- * Elsewhere -EEXIST would be the right code, but not here.
- */
- if (xa_find(&mapping->i_pages, &index,
- index + HPAGE_PMD_NR - 1, XA_PRESENT))
- return ERR_PTR(-E2BIG);
+ /*
+ * Check for conflict before waiting on a huge allocation.
+ * Conflict might be that a huge page has just been allocated
+ * and added to page cache by a racing thread, or that there
+ * is already at least one small page in the huge extent.
+ * Be careful to retry when appropriate, but not forever!
+ * Elsewhere -EEXIST would be the right code, but not here.
+ */
+ if (xa_find(&mapping->i_pages, &index,
+ index + HPAGE_PMD_NR - 1, XA_PRESENT))
+ return ERR_PTR(-E2BIG);
+ }
- folio = shmem_alloc_folio(gfp, HPAGE_PMD_ORDER, info, index);
- if (!folio && pages == HPAGE_PMD_NR)
- count_vm_event(THP_FILE_FALLBACK);
+ order = highest_order(suitable_orders);
+ while (suitable_orders) {
+ pages = 1UL << order;
+ index = round_down(index, pages);
+ folio = shmem_alloc_folio(gfp, order, info, index);
+ if (folio)
+ goto allocated;
+
+ if (pages == HPAGE_PMD_NR)
+ count_vm_event(THP_FILE_FALLBACK);
+ order = next_order(&suitable_orders, order);
+ }
} else {
pages = 1;
folio = shmem_alloc_folio(gfp, 0, info, index);
@@ -1664,6 +1782,7 @@ static struct folio *shmem_alloc_and_add_folio(gfp_t gfp,
if (!folio)
return ERR_PTR(-ENOMEM);
+allocated:
__folio_set_locked(folio);
__folio_set_swapbacked(folio);
@@ -1958,7 +2077,8 @@ static int shmem_get_folio_gfp(struct inode *inode, pgoff_t index,
struct mm_struct *fault_mm;
struct folio *folio;
int error;
- bool alloced;
+ bool alloced, huge;
+ unsigned long orders = 0;
if (WARN_ON_ONCE(!shmem_mapping(inode->i_mapping)))
return -EINVAL;
@@ -2030,14 +2150,21 @@ static int shmem_get_folio_gfp(struct inode *inode, pgoff_t index,
return 0;
}
- if (shmem_is_huge(inode, index, false, fault_mm,
- vma ? vma->vm_flags : 0)) {
+ huge = shmem_is_huge(inode, index, false, fault_mm,
+ vma ? vma->vm_flags : 0);
+ /* Find hugepage orders that are allowed for anonymous shmem. */
+ if (vma && vma_is_anon_shmem(vma))
+ orders = shmem_allowable_huge_orders(inode, vma, index, huge);
+ else if (huge)
+ orders = BIT(HPAGE_PMD_ORDER);
+
+ if (orders > 0) {
gfp_t huge_gfp;
huge_gfp = vma_thp_gfp_mask(vma);
huge_gfp = limit_gfp_mask(huge_gfp, gfp);
- folio = shmem_alloc_and_add_folio(huge_gfp,
- inode, index, fault_mm, true);
+ folio = shmem_alloc_and_add_folio(vmf, huge_gfp,
+ inode, index, fault_mm, orders);
if (!IS_ERR(folio)) {
if (folio_test_pmd_mappable(folio))
count_vm_event(THP_FILE_ALLOC);
@@ -2047,7 +2174,7 @@ static int shmem_get_folio_gfp(struct inode *inode, pgoff_t index,
goto repeat;
}
- folio = shmem_alloc_and_add_folio(gfp, inode, index, fault_mm, false);
+ folio = shmem_alloc_and_add_folio(vmf, gfp, inode, index, fault_mm, 0);
if (IS_ERR(folio)) {
error = PTR_ERR(folio);
if (error == -EEXIST)
@@ -2058,7 +2185,7 @@ static int shmem_get_folio_gfp(struct inode *inode, pgoff_t index,
alloced:
alloced = true;
- if (folio_test_pmd_mappable(folio) &&
+ if (folio_test_large(folio) &&
DIV_ROUND_UP(i_size_read(inode), PAGE_SIZE) <
folio_next_index(folio) - 1) {
struct shmem_sb_info *sbinfo = SHMEM_SB(inode->i_sb);
--
2.39.3