Anonymous pages have already been supported for multi-size (mTHP) allocation
through commit 19eaf44954df, that can allow THP to be configured through the
sysfs interface located at '/sys/kernel/mm/transparent_hugepage/hugepage-XXkb/enabled'.
However, the anonymous shared pages will ignore the anonymous mTHP rule
configured through the sysfs interface, and can only use the PMD-mapped
THP, that is not reasonable. Many implement anonymous page sharing through
mmap(MAP_SHARED | MAP_ANONYMOUS), especially in database usage scenarios,
therefore, users expect to apply an unified mTHP strategy for anonymous pages,
also including the anonymous shared pages, in order to enjoy the benefits of
mTHP. For example, lower latency than PMD-mapped THP, smaller memory bloat
than PMD-mapped THP, contiguous PTEs on ARM architecture to reduce TLB miss etc.
The primary strategy is similar to supporting anonymous mTHP. Introduce
a new interface '/mm/transparent_hugepage/hugepage-XXkb/shmem_enabled',
which can have all the same values as the top-level
'/sys/kernel/mm/transparent_hugepage/shmem_enabled', with adding a new
additional "inherit" option. By default all sizes will be set to "never"
except PMD size, which is set to "inherit". This ensures backward compatibility
with the shmem enabled of the top level, meanwhile also allows independent
control of shmem enabled for each mTHP.
Use the page fault latency tool to measure the performance of 1G anonymous shmem
with 32 threads on my machine environment with: ARM64 Architecture, 32 cores,
125G memory:
base: mm-unstable
user-time sys_time faults_per_sec_per_cpu faults_per_sec
0.04s 3.10s 83516.416 2669684.890
mm-unstable + patchset, anon shmem mTHP disabled
user-time sys_time faults_per_sec_per_cpu faults_per_sec
0.02s 3.14s 82936.359 2630746.027
mm-unstable + patchset, anon shmem 64K mTHP enabled
user-time sys_time faults_per_sec_per_cpu faults_per_sec
0.08s 0.31s 678630.231 17082522.495
From the data above, it is observed that the patchset has a minimal impact when
mTHP is not enabled (some fluctuations observed during testing). When enabling 64K
mTHP, there is a significant improvement of the page fault latency.
TODO:
- Support mTHP for tmpfs.
- Do not split the large folio when share memory swap out.
- Can swap in a large folio for share memory.
Changes from RFC:
- Rebase the patch set against the new mm-unstable branch, per Lance.
- Add a new patch to export highest_order() and next_order().
- Add a new patch to align mTHP size in shmem_get_unmapped_area().
- Handle the uffd case and the VMA limits case when building mapping for
large folio in the finish_fault() function, per Ryan.
- Remove unnecessary 'order' variable in patch 3, per Kefeng.
- Keep the anon shmem counters' name consistency.
- Modify the strategy to support mTHP for anonymous shmem, discussed with
Ryan and David.
- Add reviewed tag from Barry.
- Update the commit message.
Baolin Wang (8):
mm: move highest_order() and next_order() out of the THP config
mm: memory: extend finish_fault() to support large folio
mm: shmem: add an 'order' parameter for shmem_alloc_hugefolio()
mm: shmem: add THP validation for PMD-mapped THP related statistics
mm: shmem: add multi-size THP sysfs interface for anonymous shmem
mm: shmem: add mTHP support for anonymous shmem
mm: shmem: add mTHP size alignment in shmem_get_unmapped_area
mm: shmem: add mTHP counters for anonymous shmem
Documentation/admin-guide/mm/transhuge.rst | 29 ++
include/linux/huge_mm.h | 35 ++-
mm/huge_memory.c | 17 +-
mm/memory.c | 43 ++-
mm/shmem.c | 335 ++++++++++++++++++---
5 files changed, 387 insertions(+), 72 deletions(-)
--
2.39.3
Add large folio mapping establishment support for finish_fault() as a preparation,
to support multi-size THP allocation of anonymous shmem pages in the following
patches.
Signed-off-by: Baolin Wang <[email protected]>
---
mm/memory.c | 43 +++++++++++++++++++++++++++++++++----------
1 file changed, 33 insertions(+), 10 deletions(-)
diff --git a/mm/memory.c b/mm/memory.c
index eea6e4984eae..936377220b77 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -4747,9 +4747,12 @@ vm_fault_t finish_fault(struct vm_fault *vmf)
{
struct vm_area_struct *vma = vmf->vma;
struct page *page;
+ struct folio *folio;
vm_fault_t ret;
bool is_cow = (vmf->flags & FAULT_FLAG_WRITE) &&
!(vma->vm_flags & VM_SHARED);
+ int type, nr_pages, i;
+ unsigned long addr = vmf->address;
/* Did we COW the page? */
if (is_cow)
@@ -4780,24 +4783,44 @@ vm_fault_t finish_fault(struct vm_fault *vmf)
return VM_FAULT_OOM;
}
+ folio = page_folio(page);
+ nr_pages = folio_nr_pages(folio);
+
+ if (unlikely(userfaultfd_armed(vma))) {
+ nr_pages = 1;
+ } else if (nr_pages > 1) {
+ unsigned long start = ALIGN_DOWN(vmf->address, nr_pages * PAGE_SIZE);
+ unsigned long end = start + nr_pages * PAGE_SIZE;
+
+ /* In case the folio size in page cache beyond the VMA limits. */
+ addr = max(start, vma->vm_start);
+ nr_pages = (min(end, vma->vm_end) - addr) >> PAGE_SHIFT;
+
+ page = folio_page(folio, (addr - start) >> PAGE_SHIFT);
+ }
vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd,
- vmf->address, &vmf->ptl);
+ addr, &vmf->ptl);
if (!vmf->pte)
return VM_FAULT_NOPAGE;
/* Re-check under ptl */
- if (likely(!vmf_pte_changed(vmf))) {
- struct folio *folio = page_folio(page);
- int type = is_cow ? MM_ANONPAGES : mm_counter_file(folio);
-
- set_pte_range(vmf, folio, page, 1, vmf->address);
- add_mm_counter(vma->vm_mm, type, 1);
- ret = 0;
- } else {
- update_mmu_tlb(vma, vmf->address, vmf->pte);
+ if (nr_pages == 1 && unlikely(vmf_pte_changed(vmf))) {
+ update_mmu_tlb(vma, addr, vmf->pte);
+ ret = VM_FAULT_NOPAGE;
+ goto unlock;
+ } else if (nr_pages > 1 && !pte_range_none(vmf->pte, nr_pages)) {
+ for (i = 0; i < nr_pages; i++)
+ update_mmu_tlb(vma, addr + PAGE_SIZE * i, vmf->pte + i);
ret = VM_FAULT_NOPAGE;
+ goto unlock;
}
+ set_pte_range(vmf, folio, page, nr_pages, addr);
+ type = is_cow ? MM_ANONPAGES : mm_counter_file(folio);
+ add_mm_counter(vma->vm_mm, type, nr_pages);
+ ret = 0;
+
+unlock:
pte_unmap_unlock(vmf->pte, vmf->ptl);
return ret;
}
--
2.39.3
In order to extend support for mTHP, add THP validation for PMD-mapped THP
related statistics to avoid statistical confusion.
Signed-off-by: Baolin Wang <[email protected]>
Reviewed-by: Barry Song <[email protected]>
---
mm/shmem.c | 7 ++++---
1 file changed, 4 insertions(+), 3 deletions(-)
diff --git a/mm/shmem.c b/mm/shmem.c
index e4483c4596a8..a383ea9a89a5 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -1661,7 +1661,7 @@ static struct folio *shmem_alloc_and_add_folio(gfp_t gfp,
return ERR_PTR(-E2BIG);
folio = shmem_alloc_hugefolio(gfp, info, index, HPAGE_PMD_ORDER);
- if (!folio)
+ if (!folio && pages == HPAGE_PMD_NR)
count_vm_event(THP_FILE_FALLBACK);
} else {
pages = 1;
@@ -1679,7 +1679,7 @@ static struct folio *shmem_alloc_and_add_folio(gfp_t gfp,
if (xa_find(&mapping->i_pages, &index,
index + pages - 1, XA_PRESENT)) {
error = -EEXIST;
- } else if (huge) {
+ } else if (pages == HPAGE_PMD_NR) {
count_vm_event(THP_FILE_FALLBACK);
count_vm_event(THP_FILE_FALLBACK_CHARGE);
}
@@ -2045,7 +2045,8 @@ static int shmem_get_folio_gfp(struct inode *inode, pgoff_t index,
folio = shmem_alloc_and_add_folio(huge_gfp,
inode, index, fault_mm, true);
if (!IS_ERR(folio)) {
- count_vm_event(THP_FILE_ALLOC);
+ if (folio_test_pmd_mappable(folio))
+ count_vm_event(THP_FILE_ALLOC);
goto alloced;
}
if (PTR_ERR(folio) == -EEXIST)
--
2.39.3
Add a new parameter to specify the huge page order for shmem_alloc_hugefolio(),
as a preparation to supoort mTHP.
Signed-off-by: Baolin Wang <[email protected]>
---
mm/shmem.c | 8 ++++----
1 file changed, 4 insertions(+), 4 deletions(-)
diff --git a/mm/shmem.c b/mm/shmem.c
index fa2a0ed97507..e4483c4596a8 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -1604,14 +1604,14 @@ static gfp_t limit_gfp_mask(gfp_t huge_gfp, gfp_t limit_gfp)
}
static struct folio *shmem_alloc_hugefolio(gfp_t gfp,
- struct shmem_inode_info *info, pgoff_t index)
+ struct shmem_inode_info *info, pgoff_t index, int order)
{
struct mempolicy *mpol;
pgoff_t ilx;
struct page *page;
- mpol = shmem_get_pgoff_policy(info, index, HPAGE_PMD_ORDER, &ilx);
- page = alloc_pages_mpol(gfp, HPAGE_PMD_ORDER, mpol, ilx, numa_node_id());
+ mpol = shmem_get_pgoff_policy(info, index, order, &ilx);
+ page = alloc_pages_mpol(gfp, order, mpol, ilx, numa_node_id());
mpol_cond_put(mpol);
return page_rmappable_folio(page);
@@ -1660,7 +1660,7 @@ static struct folio *shmem_alloc_and_add_folio(gfp_t gfp,
index + HPAGE_PMD_NR - 1, XA_PRESENT))
return ERR_PTR(-E2BIG);
- folio = shmem_alloc_hugefolio(gfp, info, index);
+ folio = shmem_alloc_hugefolio(gfp, info, index, HPAGE_PMD_ORDER);
if (!folio)
count_vm_event(THP_FILE_FALLBACK);
} else {
--
2.39.3
Commit 19eaf44954df adds multi-size THP (mTHP) for anonymous pages, that
can allow THP to be configured through the sysfs interface located at
'/sys/kernel/mm/transparent_hugepage/hugepage-XXkb/enabled'.
However, the anonymous share pages will ignore the anonymous mTHP rule
configured through the sysfs interface, and can only use the PMD-mapped
THP, that is not reasonable. Users expect to apply the mTHP rule for
all anonymous pages, including the anonymous share pages, in order to
enjoy the benefits of mTHP. For example, lower latency than PMD-mapped THP,
smaller memory bloat than PMD-mapped THP, contiguous PTEs on ARM architecture
to reduce TLB miss etc.
The primary strategy is similar to supporting anonymous mTHP. Introduce
a new interface '/mm/transparent_hugepage/hugepage-XXkb/shmem_enabled',
which can have all the same values as the top-level
'/sys/kernel/mm/transparent_hugepage/shmem_enabled', with adding a new
additional "inherit" option. By default all sizes will be set to "never"
except PMD size, which is set to "inherit". This ensures backward compatibility
with the shmem enabled of the top level, meanwhile also allows independent
control of shmem enabled for each mTHP.
Signed-off-by: Baolin Wang <[email protected]>
---
mm/shmem.c | 177 +++++++++++++++++++++++++++++++++++++++++++++--------
1 file changed, 150 insertions(+), 27 deletions(-)
diff --git a/mm/shmem.c b/mm/shmem.c
index 59cc26d44344..08ccea5170a1 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -1611,6 +1611,106 @@ static gfp_t limit_gfp_mask(gfp_t huge_gfp, gfp_t limit_gfp)
return result;
}
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+static unsigned long anon_shmem_allowable_huge_orders(struct inode *inode,
+ struct vm_area_struct *vma, pgoff_t index,
+ bool global_huge)
+{
+ unsigned long mask = READ_ONCE(huge_anon_shmem_orders_always);
+ unsigned long within_size_orders = READ_ONCE(huge_anon_shmem_orders_within_size);
+ unsigned long vm_flags = vma->vm_flags;
+ /*
+ * Check all the (large) orders below HPAGE_PMD_ORDER + 1 that
+ * are enabled for this vma.
+ */
+ unsigned long orders = BIT(PMD_ORDER + 1) - 1;
+ loff_t i_size;
+ int order;
+
+ if ((vm_flags & VM_NOHUGEPAGE) ||
+ test_bit(MMF_DISABLE_THP, &vma->vm_mm->flags))
+ return 0;
+
+ /* If the hardware/firmware marked hugepage support disabled. */
+ if (transparent_hugepage_flags & (1 << TRANSPARENT_HUGEPAGE_UNSUPPORTED))
+ return 0;
+
+ /*
+ * Following the 'deny' semantics of the top level, force the huge
+ * option off from all mounts.
+ */
+ if (shmem_huge == SHMEM_HUGE_DENY)
+ return 0;
+ /*
+ * Only allow inherit orders if the top-level value is 'force', which
+ * means non-PMD sized THP can not override 'huge' mount option now.
+ */
+ if (shmem_huge == SHMEM_HUGE_FORCE)
+ return READ_ONCE(huge_anon_shmem_orders_inherit);
+
+ /* Allow mTHP that will be fully within i_size. */
+ order = highest_order(within_size_orders);
+ while (within_size_orders) {
+ index = round_up(index + 1, order);
+ i_size = round_up(i_size_read(inode), PAGE_SIZE);
+ if (i_size >> PAGE_SHIFT >= index) {
+ mask |= within_size_orders;
+ break;
+ }
+
+ order = next_order(&within_size_orders, order);
+ }
+
+ if (vm_flags & VM_HUGEPAGE)
+ mask |= READ_ONCE(huge_anon_shmem_orders_madvise);
+
+ if (global_huge)
+ mask |= READ_ONCE(huge_anon_shmem_orders_inherit);
+
+ return orders & mask;
+}
+
+static unsigned long anon_shmem_suitable_orders(struct inode *inode, struct vm_fault *vmf,
+ struct address_space *mapping, pgoff_t index,
+ unsigned long orders)
+{
+ struct vm_area_struct *vma = vmf->vma;
+ unsigned long pages;
+ int order;
+
+ orders = thp_vma_suitable_orders(vma, vmf->address, orders);
+ if (!orders)
+ return 0;
+
+ /* Find the highest order that can add into the page cache */
+ order = highest_order(orders);
+ while (orders) {
+ pages = 1UL << order;
+ index = round_down(index, pages);
+ if (!xa_find(&mapping->i_pages, &index,
+ index + pages - 1, XA_PRESENT))
+ break;
+ order = next_order(&orders, order);
+ }
+
+ return orders;
+}
+#else
+static unsigned long anon_shmem_allowable_huge_orders(struct inode *inode,
+ struct vm_area_struct *vma, pgoff_t index,
+ bool global_huge)
+{
+ return 0;
+}
+
+static unsigned long anon_shmem_suitable_orders(struct inode *inode, struct vm_fault *vmf,
+ struct address_space *mapping, pgoff_t index,
+ unsigned long orders)
+{
+ return 0;
+}
+#endif /* CONFIG_TRANSPARENT_HUGEPAGE */
+
static struct folio *shmem_alloc_hugefolio(gfp_t gfp,
struct shmem_inode_info *info, pgoff_t index, int order)
{
@@ -1639,38 +1739,55 @@ static struct folio *shmem_alloc_folio(gfp_t gfp,
return (struct folio *)page;
}
-static struct folio *shmem_alloc_and_add_folio(gfp_t gfp,
- struct inode *inode, pgoff_t index,
- struct mm_struct *fault_mm, bool huge)
+static struct folio *shmem_alloc_and_add_folio(struct vm_fault *vmf,
+ gfp_t gfp, struct inode *inode, pgoff_t index,
+ struct mm_struct *fault_mm, bool huge, unsigned long orders)
{
struct address_space *mapping = inode->i_mapping;
struct shmem_inode_info *info = SHMEM_I(inode);
+ struct vm_area_struct *vma = vmf ? vmf->vma : NULL;
+ unsigned long suitable_orders;
struct folio *folio;
long pages;
- int error;
+ int error, order;
if (!IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE))
huge = false;
- if (huge) {
- pages = HPAGE_PMD_NR;
- index = round_down(index, HPAGE_PMD_NR);
+ if (huge || orders > 0) {
+ if (vma && vma_is_anon_shmem(vma) && orders) {
+ suitable_orders = anon_shmem_suitable_orders(inode, vmf,
+ mapping, index, orders);
+ } else {
+ pages = HPAGE_PMD_NR;
+ suitable_orders = BIT(HPAGE_PMD_ORDER);
+ index = round_down(index, HPAGE_PMD_NR);
- /*
- * Check for conflict before waiting on a huge allocation.
- * Conflict might be that a huge page has just been allocated
- * and added to page cache by a racing thread, or that there
- * is already at least one small page in the huge extent.
- * Be careful to retry when appropriate, but not forever!
- * Elsewhere -EEXIST would be the right code, but not here.
- */
- if (xa_find(&mapping->i_pages, &index,
+ /*
+ * Check for conflict before waiting on a huge allocation.
+ * Conflict might be that a huge page has just been allocated
+ * and added to page cache by a racing thread, or that there
+ * is already at least one small page in the huge extent.
+ * Be careful to retry when appropriate, but not forever!
+ * Elsewhere -EEXIST would be the right code, but not here.
+ */
+ if (xa_find(&mapping->i_pages, &index,
index + HPAGE_PMD_NR - 1, XA_PRESENT))
- return ERR_PTR(-E2BIG);
+ return ERR_PTR(-E2BIG);
+ }
- folio = shmem_alloc_hugefolio(gfp, info, index, HPAGE_PMD_ORDER);
- if (!folio && pages == HPAGE_PMD_NR)
- count_vm_event(THP_FILE_FALLBACK);
+ order = highest_order(suitable_orders);
+ while (suitable_orders) {
+ pages = 1 << order;
+ index = round_down(index, pages);
+ folio = shmem_alloc_hugefolio(gfp, info, index, order);
+ if (folio)
+ goto allocated;
+
+ if (pages == HPAGE_PMD_NR)
+ count_vm_event(THP_FILE_FALLBACK);
+ order = next_order(&suitable_orders, order);
+ }
} else {
pages = 1;
folio = shmem_alloc_folio(gfp, info, index);
@@ -1678,6 +1795,7 @@ static struct folio *shmem_alloc_and_add_folio(gfp_t gfp,
if (!folio)
return ERR_PTR(-ENOMEM);
+allocated:
__folio_set_locked(folio);
__folio_set_swapbacked(folio);
@@ -1972,7 +2090,8 @@ static int shmem_get_folio_gfp(struct inode *inode, pgoff_t index,
struct mm_struct *fault_mm;
struct folio *folio;
int error;
- bool alloced;
+ bool alloced, huge;
+ unsigned long orders = 0;
if (WARN_ON_ONCE(!shmem_mapping(inode->i_mapping)))
return -EINVAL;
@@ -2044,14 +2163,18 @@ static int shmem_get_folio_gfp(struct inode *inode, pgoff_t index,
return 0;
}
- if (shmem_is_huge(inode, index, false, fault_mm,
- vma ? vma->vm_flags : 0)) {
+ huge = shmem_is_huge(inode, index, false, fault_mm,
+ vma ? vma->vm_flags : 0);
+ /* Find hugepage orders that are allowed for anonymous shmem. */
+ if (vma && vma_is_anon_shmem(vma))
+ orders = anon_shmem_allowable_huge_orders(inode, vma, index, huge);
+ if (huge || orders > 0) {
gfp_t huge_gfp;
huge_gfp = vma_thp_gfp_mask(vma);
huge_gfp = limit_gfp_mask(huge_gfp, gfp);
- folio = shmem_alloc_and_add_folio(huge_gfp,
- inode, index, fault_mm, true);
+ folio = shmem_alloc_and_add_folio(vmf, huge_gfp,
+ inode, index, fault_mm, true, orders);
if (!IS_ERR(folio)) {
if (folio_test_pmd_mappable(folio))
count_vm_event(THP_FILE_ALLOC);
@@ -2061,7 +2184,7 @@ static int shmem_get_folio_gfp(struct inode *inode, pgoff_t index,
goto repeat;
}
- folio = shmem_alloc_and_add_folio(gfp, inode, index, fault_mm, false);
+ folio = shmem_alloc_and_add_folio(vmf, gfp, inode, index, fault_mm, false, 0);
if (IS_ERR(folio)) {
error = PTR_ERR(folio);
if (error == -EEXIST)
@@ -2072,7 +2195,7 @@ static int shmem_get_folio_gfp(struct inode *inode, pgoff_t index,
alloced:
alloced = true;
- if (folio_test_pmd_mappable(folio) &&
+ if (folio_test_large(folio) &&
DIV_ROUND_UP(i_size_read(inode), PAGE_SIZE) <
folio_next_index(folio) - 1) {
struct shmem_sb_info *sbinfo = SHMEM_SB(inode->i_sb);
--
2.39.3
Although the top-level hugepage allocation can be turned off, anonymous shmem
can still use mTHP by configuring the sysfs interface located at
'/sys/kernel/mm/transparent_hugepage/hugepage-XXkb/shmem_enabled'. Therefore,
add alignment for mTHP size to provide a suitable alignment address in
shmem_get_unmapped_area().
Signed-off-by: Baolin Wang <[email protected]>
---
mm/shmem.c | 35 ++++++++++++++++++++++++++---------
1 file changed, 26 insertions(+), 9 deletions(-)
diff --git a/mm/shmem.c b/mm/shmem.c
index 08ccea5170a1..27107afdff9e 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -2404,6 +2404,7 @@ unsigned long shmem_get_unmapped_area(struct file *file,
unsigned long inflated_len;
unsigned long inflated_addr;
unsigned long inflated_offset;
+ unsigned long hpage_size = HPAGE_PMD_SIZE;
if (len > TASK_SIZE)
return -ENOMEM;
@@ -2422,8 +2423,6 @@ unsigned long shmem_get_unmapped_area(struct file *file,
if (shmem_huge == SHMEM_HUGE_DENY)
return addr;
- if (len < HPAGE_PMD_SIZE)
- return addr;
if (flags & MAP_FIXED)
return addr;
/*
@@ -2437,6 +2436,8 @@ unsigned long shmem_get_unmapped_area(struct file *file,
if (shmem_huge != SHMEM_HUGE_FORCE) {
struct super_block *sb;
+ unsigned long __maybe_unused hpage_orders;
+ int order = 0;
if (file) {
VM_BUG_ON(file->f_op != &shmem_file_operations);
@@ -2449,18 +2450,34 @@ unsigned long shmem_get_unmapped_area(struct file *file,
if (IS_ERR(shm_mnt))
return addr;
sb = shm_mnt->mnt_sb;
+
+ /*
+ * Find the highest mTHP order used for anonymous shmem to
+ * provide a suitable alignment address.
+ */
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+ hpage_orders = READ_ONCE(huge_anon_shmem_orders_always);
+ hpage_orders |= READ_ONCE(huge_anon_shmem_orders_within_size);
+ hpage_orders |= READ_ONCE(huge_anon_shmem_orders_madvise);
+ hpage_orders |= READ_ONCE(huge_anon_shmem_orders_inherit);
+ order = highest_order(hpage_orders);
+ hpage_size = PAGE_SIZE << order;
+#endif
}
- if (SHMEM_SB(sb)->huge == SHMEM_HUGE_NEVER)
+ if (SHMEM_SB(sb)->huge == SHMEM_HUGE_NEVER && !order)
return addr;
}
- offset = (pgoff << PAGE_SHIFT) & (HPAGE_PMD_SIZE-1);
- if (offset && offset + len < 2 * HPAGE_PMD_SIZE)
+ if (len < hpage_size)
+ return addr;
+
+ offset = (pgoff << PAGE_SHIFT) & (hpage_size - 1);
+ if (offset && offset + len < 2 * hpage_size)
return addr;
- if ((addr & (HPAGE_PMD_SIZE-1)) == offset)
+ if ((addr & (hpage_size - 1)) == offset)
return addr;
- inflated_len = len + HPAGE_PMD_SIZE - PAGE_SIZE;
+ inflated_len = len + hpage_size - PAGE_SIZE;
if (inflated_len > TASK_SIZE)
return addr;
if (inflated_len < len)
@@ -2473,10 +2490,10 @@ unsigned long shmem_get_unmapped_area(struct file *file,
if (inflated_addr & ~PAGE_MASK)
return addr;
- inflated_offset = inflated_addr & (HPAGE_PMD_SIZE-1);
+ inflated_offset = inflated_addr & (hpage_size - 1);
inflated_addr += offset - inflated_offset;
if (inflated_offset > offset)
- inflated_addr += HPAGE_PMD_SIZE;
+ inflated_addr += hpage_size;
if (inflated_addr > TASK_SIZE - len)
return addr;
--
2.39.3
Add mTHP counters for anonymous shmem.
Signed-off-by: Baolin Wang <[email protected]>
---
include/linux/huge_mm.h | 3 +++
mm/huge_memory.c | 6 ++++++
mm/shmem.c | 18 +++++++++++++++---
3 files changed, 24 insertions(+), 3 deletions(-)
diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index dbd6b3f56210..c15bebb2cf53 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -281,6 +281,9 @@ enum mthp_stat_item {
MTHP_STAT_ANON_FAULT_FALLBACK_CHARGE,
MTHP_STAT_ANON_SWPOUT,
MTHP_STAT_ANON_SWPOUT_FALLBACK,
+ MTHP_STAT_FILE_ALLOC,
+ MTHP_STAT_FILE_FALLBACK,
+ MTHP_STAT_FILE_FALLBACK_CHARGE,
__MTHP_STAT_COUNT
};
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index d3080a8843f2..fcda6ae604f6 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -555,6 +555,9 @@ DEFINE_MTHP_STAT_ATTR(anon_fault_fallback, MTHP_STAT_ANON_FAULT_FALLBACK);
DEFINE_MTHP_STAT_ATTR(anon_fault_fallback_charge, MTHP_STAT_ANON_FAULT_FALLBACK_CHARGE);
DEFINE_MTHP_STAT_ATTR(anon_swpout, MTHP_STAT_ANON_SWPOUT);
DEFINE_MTHP_STAT_ATTR(anon_swpout_fallback, MTHP_STAT_ANON_SWPOUT_FALLBACK);
+DEFINE_MTHP_STAT_ATTR(file_alloc, MTHP_STAT_FILE_ALLOC);
+DEFINE_MTHP_STAT_ATTR(file_fallback, MTHP_STAT_FILE_FALLBACK);
+DEFINE_MTHP_STAT_ATTR(file_fallback_charge, MTHP_STAT_FILE_FALLBACK_CHARGE);
static struct attribute *stats_attrs[] = {
&anon_fault_alloc_attr.attr,
@@ -562,6 +565,9 @@ static struct attribute *stats_attrs[] = {
&anon_fault_fallback_charge_attr.attr,
&anon_swpout_attr.attr,
&anon_swpout_fallback_attr.attr,
+ &file_alloc_attr.attr,
+ &file_fallback_attr.attr,
+ &file_fallback_charge_attr.attr,
NULL,
};
diff --git a/mm/shmem.c b/mm/shmem.c
index 27107afdff9e..1af2f0aa384d 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -1786,6 +1786,9 @@ static struct folio *shmem_alloc_and_add_folio(struct vm_fault *vmf,
if (pages == HPAGE_PMD_NR)
count_vm_event(THP_FILE_FALLBACK);
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+ count_mthp_stat(order, MTHP_STAT_FILE_FALLBACK);
+#endif
order = next_order(&suitable_orders, order);
}
} else {
@@ -1805,9 +1808,15 @@ static struct folio *shmem_alloc_and_add_folio(struct vm_fault *vmf,
if (xa_find(&mapping->i_pages, &index,
index + pages - 1, XA_PRESENT)) {
error = -EEXIST;
- } else if (pages == HPAGE_PMD_NR) {
- count_vm_event(THP_FILE_FALLBACK);
- count_vm_event(THP_FILE_FALLBACK_CHARGE);
+ } else if (pages > 1) {
+ if (pages == HPAGE_PMD_NR) {
+ count_vm_event(THP_FILE_FALLBACK);
+ count_vm_event(THP_FILE_FALLBACK_CHARGE);
+ }
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+ count_mthp_stat(folio_order(folio), MTHP_STAT_FILE_FALLBACK);
+ count_mthp_stat(folio_order(folio), MTHP_STAT_FILE_FALLBACK_CHARGE);
+#endif
}
goto unlock;
}
@@ -2178,6 +2187,9 @@ static int shmem_get_folio_gfp(struct inode *inode, pgoff_t index,
if (!IS_ERR(folio)) {
if (folio_test_pmd_mappable(folio))
count_vm_event(THP_FILE_ALLOC);
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+ count_mthp_stat(folio_order(folio), MTHP_STAT_FILE_ALLOC);
+#endif
goto alloced;
}
if (PTR_ERR(folio) == -EEXIST)
--
2.39.3
To support the use of mTHP with anonymous shmem, add a new sysfs interface
'shmem_enabled' in the '/sys/kernel/mm/transparent_hugepage/hugepages-kB/'
directory for each mTHP to control whether shmem is enabled for that mTHP,
with a value similar to the top level 'shmem_enabled', which can be set to:
"always", "inherit (to inherit the top level setting)", "within_size", "advise",
"never", "deny", "force". These values follow the same semantics as the top
level, except the 'deny' is equivalent to 'never', and 'force' is equivalent
to 'always' to keep compatibility.
By default, PMD-sized hugepages have enabled="inherit" and all other hugepage
sizes have enabled="never" for '/sys/kernel/mm/transparent_hugepage/hugepages-xxkB/shmem_enabled'.
In addition, if top level value is 'force', then only PMD-sized hugepages
have enabled="inherit", otherwise configuration will be failed and vice versa.
That means now we will avoid using non-PMD sized THP to override the global
huge allocation.
Signed-off-by: Baolin Wang <[email protected]>
---
Documentation/admin-guide/mm/transhuge.rst | 29 +++++++
include/linux/huge_mm.h | 10 +++
mm/huge_memory.c | 11 +--
mm/shmem.c | 96 ++++++++++++++++++++++
4 files changed, 138 insertions(+), 8 deletions(-)
diff --git a/Documentation/admin-guide/mm/transhuge.rst b/Documentation/admin-guide/mm/transhuge.rst
index 076443cc10a6..a28496e15bdb 100644
--- a/Documentation/admin-guide/mm/transhuge.rst
+++ b/Documentation/admin-guide/mm/transhuge.rst
@@ -332,6 +332,35 @@ deny
force
Force the huge option on for all - very useful for testing;
+Anonymous shmem can also use "multi-size THP" (mTHP) by adding a new sysfs knob
+to control mTHP allocation: /sys/kernel/mm/transparent_hugepage/hugepages-<size>kB/shmem_enabled.
+Its value for each mTHP is essentially consistent with the global setting, except
+for the addition of 'inherit' to ensure compatibility with the global settings.
+always
+ Attempt to allocate <size> huge pages every time we need a new page;
+
+inherit
+ Inherit the top-level "shmem_enabled" value. By default, PMD-sized hugepages
+ have enabled="inherit" and all other hugepage sizes have enabled="never";
+
+never
+ Do not allocate <size> huge pages;
+
+within_size
+ Only allocate <size> huge page if it will be fully within i_size.
+ Also respect fadvise()/madvise() hints;
+
+advise
+ Only allocate <size> huge pages if requested with fadvise()/madvise();
+
+deny
+ Has the same semantics as 'never', now mTHP allocation policy is only
+ used for anonymous shmem and no not override tmpfs.
+
+force
+ Has the same semantics as 'always', now mTHP allocation policy is only
+ used for anonymous shmem and no not override tmpfs.
+
Need of application restart
===========================
diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index e49b56c40a11..dbd6b3f56210 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -6,6 +6,7 @@
#include <linux/mm_types.h>
#include <linux/fs.h> /* only for vma_is_dax() */
+#include <linux/kobject.h>
vm_fault_t do_huge_pmd_anonymous_page(struct vm_fault *vmf);
int copy_huge_pmd(struct mm_struct *dst_mm, struct mm_struct *src_mm,
@@ -63,6 +64,7 @@ ssize_t single_hugepage_flag_show(struct kobject *kobj,
struct kobj_attribute *attr, char *buf,
enum transparent_hugepage_flag flag);
extern struct kobj_attribute shmem_enabled_attr;
+extern struct kobj_attribute thpsize_shmem_enabled_attr;
/*
* Mask of all large folio orders supported for anonymous THP; all orders up to
@@ -265,6 +267,14 @@ unsigned long thp_vma_allowable_orders(struct vm_area_struct *vma,
return __thp_vma_allowable_orders(vma, vm_flags, tva_flags, orders);
}
+struct thpsize {
+ struct kobject kobj;
+ struct list_head node;
+ int order;
+};
+
+#define to_thpsize(kobj) container_of(kobj, struct thpsize, kobj)
+
enum mthp_stat_item {
MTHP_STAT_ANON_FAULT_ALLOC,
MTHP_STAT_ANON_FAULT_FALLBACK,
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 9efb6fefc391..d3080a8843f2 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -449,14 +449,6 @@ static void thpsize_release(struct kobject *kobj);
static DEFINE_SPINLOCK(huge_anon_orders_lock);
static LIST_HEAD(thpsize_list);
-struct thpsize {
- struct kobject kobj;
- struct list_head node;
- int order;
-};
-
-#define to_thpsize(kobj) container_of(kobj, struct thpsize, kobj)
-
static ssize_t thpsize_enabled_show(struct kobject *kobj,
struct kobj_attribute *attr, char *buf)
{
@@ -517,6 +509,9 @@ static struct kobj_attribute thpsize_enabled_attr =
static struct attribute *thpsize_attrs[] = {
&thpsize_enabled_attr.attr,
+#ifdef CONFIG_SHMEM
+ &thpsize_shmem_enabled_attr.attr,
+#endif
NULL,
};
diff --git a/mm/shmem.c b/mm/shmem.c
index a383ea9a89a5..59cc26d44344 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -131,6 +131,14 @@ struct shmem_options {
#define SHMEM_SEEN_QUOTA 32
};
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+static unsigned long huge_anon_shmem_orders_always __read_mostly;
+static unsigned long huge_anon_shmem_orders_madvise __read_mostly;
+static unsigned long huge_anon_shmem_orders_inherit __read_mostly;
+static unsigned long huge_anon_shmem_orders_within_size __read_mostly;
+static DEFINE_SPINLOCK(huge_anon_shmem_orders_lock);
+#endif
+
#ifdef CONFIG_TMPFS
static unsigned long shmem_default_max_blocks(void)
{
@@ -4687,6 +4695,12 @@ void __init shmem_init(void)
SHMEM_SB(shm_mnt->mnt_sb)->huge = shmem_huge;
else
shmem_huge = SHMEM_HUGE_NEVER; /* just in case it was patched */
+
+ /*
+ * Default to setting PMD-sized THP to inherit the global setting and
+ * disable all other multi-size THPs, when anonymous shmem uses mTHP.
+ */
+ huge_anon_shmem_orders_inherit = BIT(HPAGE_PMD_ORDER);
#endif
return;
@@ -4746,6 +4760,11 @@ static ssize_t shmem_enabled_store(struct kobject *kobj,
huge != SHMEM_HUGE_NEVER && huge != SHMEM_HUGE_DENY)
return -EINVAL;
+ /* Do not override huge allocation policy with non-PMD sized mTHP */
+ if (huge == SHMEM_HUGE_FORCE &&
+ huge_anon_shmem_orders_inherit != BIT(HPAGE_PMD_ORDER))
+ return -EINVAL;
+
shmem_huge = huge;
if (shmem_huge > SHMEM_HUGE_DENY)
SHMEM_SB(shm_mnt->mnt_sb)->huge = shmem_huge;
@@ -4753,6 +4772,83 @@ static ssize_t shmem_enabled_store(struct kobject *kobj,
}
struct kobj_attribute shmem_enabled_attr = __ATTR_RW(shmem_enabled);
+
+static ssize_t thpsize_shmem_enabled_show(struct kobject *kobj,
+ struct kobj_attribute *attr, char *buf)
+{
+ int order = to_thpsize(kobj)->order;
+ const char *output;
+
+ if (test_bit(order, &huge_anon_shmem_orders_always))
+ output = "[always] inherit within_size advise never deny [force]";
+ else if (test_bit(order, &huge_anon_shmem_orders_inherit))
+ output = "always [inherit] within_size advise never deny force";
+ else if (test_bit(order, &huge_anon_shmem_orders_within_size))
+ output = "always inherit [within_size] advise never deny force";
+ else if (test_bit(order, &huge_anon_shmem_orders_madvise))
+ output = "always inherit within_size [advise] never deny force";
+ else
+ output = "always inherit within_size advise [never] [deny] force";
+
+ return sysfs_emit(buf, "%s\n", output);
+}
+
+static ssize_t thpsize_shmem_enabled_store(struct kobject *kobj,
+ struct kobj_attribute *attr,
+ const char *buf, size_t count)
+{
+ int order = to_thpsize(kobj)->order;
+ ssize_t ret = count;
+
+ if (sysfs_streq(buf, "always") || sysfs_streq(buf, "force")) {
+ spin_lock(&huge_anon_shmem_orders_lock);
+ clear_bit(order, &huge_anon_shmem_orders_inherit);
+ clear_bit(order, &huge_anon_shmem_orders_madvise);
+ clear_bit(order, &huge_anon_shmem_orders_within_size);
+ set_bit(order, &huge_anon_shmem_orders_always);
+ spin_unlock(&huge_anon_shmem_orders_lock);
+ } else if (sysfs_streq(buf, "inherit")) {
+ /* Do not override huge allocation policy with non-PMD sized mTHP */
+ if (shmem_huge == SHMEM_HUGE_FORCE &&
+ order != HPAGE_PMD_ORDER)
+ return -EINVAL;
+
+ spin_lock(&huge_anon_shmem_orders_lock);
+ clear_bit(order, &huge_anon_shmem_orders_always);
+ clear_bit(order, &huge_anon_shmem_orders_madvise);
+ clear_bit(order, &huge_anon_shmem_orders_within_size);
+ set_bit(order, &huge_anon_shmem_orders_inherit);
+ spin_unlock(&huge_anon_shmem_orders_lock);
+ } else if (sysfs_streq(buf, "within_size")) {
+ spin_lock(&huge_anon_shmem_orders_lock);
+ clear_bit(order, &huge_anon_shmem_orders_always);
+ clear_bit(order, &huge_anon_shmem_orders_inherit);
+ clear_bit(order, &huge_anon_shmem_orders_madvise);
+ set_bit(order, &huge_anon_shmem_orders_within_size);
+ spin_unlock(&huge_anon_shmem_orders_lock);
+ } else if (sysfs_streq(buf, "madvise")) {
+ spin_lock(&huge_anon_shmem_orders_lock);
+ clear_bit(order, &huge_anon_shmem_orders_always);
+ clear_bit(order, &huge_anon_shmem_orders_inherit);
+ clear_bit(order, &huge_anon_shmem_orders_within_size);
+ set_bit(order, &huge_anon_shmem_orders_madvise);
+ spin_unlock(&huge_anon_shmem_orders_lock);
+ } else if (sysfs_streq(buf, "never") || sysfs_streq(buf, "deny")) {
+ spin_lock(&huge_anon_shmem_orders_lock);
+ clear_bit(order, &huge_anon_shmem_orders_always);
+ clear_bit(order, &huge_anon_shmem_orders_inherit);
+ clear_bit(order, &huge_anon_shmem_orders_within_size);
+ clear_bit(order, &huge_anon_shmem_orders_madvise);
+ spin_unlock(&huge_anon_shmem_orders_lock);
+ } else {
+ ret = -EINVAL;
+ }
+
+ return ret;
+}
+
+struct kobj_attribute thpsize_shmem_enabled_attr =
+ __ATTR(shmem_enabled, 0644, thpsize_shmem_enabled_show, thpsize_shmem_enabled_store);
#endif /* CONFIG_TRANSPARENT_HUGEPAGE && CONFIG_SYSFS */
#else /* !CONFIG_SHMEM */
--
2.39.3
Hey Baolin,
I found a compilation issue that failed one[1] of my configurations
after applying this series. The error message is as follows:
mm/shmem.c: In function ‘shmem_get_unmapped_area’:
/./include/linux/compiler_types.h:460:45: error: call to ‘__compiletime_assert_481’ declared with attribute error: BUILD_BUG failed
_compiletime_assert(condition, msg, __compiletime_assert_, __COUNTER__)
^
/./include/linux/compiler_types.h:441:25: note: in definition of macro ‘__compiletime_assert’
prefix ## suffix(); \
^~~~~~
/./include/linux/compiler_types.h:460:9: note: in expansion of macro ‘_compiletime_assert’
_compiletime_assert(condition, msg, __compiletime_assert_, __COUNTER__)
^~~~~~~~~~~~~~~~~~~
/include/linux/build_bug.h:39:37: note: in expansion of macro ‘compiletime_assert’
#define BUILD_BUG_ON_MSG(cond, msg) compiletime_assert(!(cond), msg)
^~~~~~~~~~~~~~~~~~
/include/linux/build_bug.h:59:21: note: in expansion of macro ‘BUILD_BUG_ON_MSG’
#define BUILD_BUG() BUILD_BUG_ON_MSG(1, "BUILD_BUG failed")
^~~~~~~~~~~~~~~~
/include/linux/huge_mm.h:97:28: note: in expansion of macro ‘BUILD_BUG’
#define HPAGE_PMD_SHIFT ({ BUILD_BUG(); 0; })
^~~~~~~~~
/include/linux/huge_mm.h:104:35: note: in expansion of macro ‘HPAGE_PMD_SHIFT’
#define HPAGE_PMD_SIZE ((1UL) << HPAGE_PMD_SHIFT)
^~~~~~~~~~~~~~~
mm/shmem.c:2419:36: note: in expansion of macro ‘HPAGE_PMD_SIZE’
unsigned long hpage_size = HPAGE_PMD_SIZE;
^~~~~~~~~~~~~~~
It seems like we need to handle the case where CONFIG_PGTABLE_HAS_HUGE_LEAVES
is undefined.
[1] export ARCH=arm64 && make allnoconfig && make olddefconfig && make -j$(nproc)
Thanks,
Lance
Hi Lance,
On 2024/5/6 18:54, Lance Yang wrote:
> Hey Baolin,
>
> I found a compilation issue that failed one[1] of my configurations
> after applying this series. The error message is as follows:
>
> mm/shmem.c: In function ‘shmem_get_unmapped_area’:
> ././include/linux/compiler_types.h:460:45: error: call to ‘__compiletime_assert_481’ declared with attribute error: BUILD_BUG failed
> _compiletime_assert(condition, msg, __compiletime_assert_, __COUNTER__)
> ^
> ././include/linux/compiler_types.h:441:25: note: in definition of macro ‘__compiletime_assert’
> prefix ## suffix(); \
> ^~~~~~
> ././include/linux/compiler_types.h:460:9: note: in expansion of macro ‘_compiletime_assert’
> _compiletime_assert(condition, msg, __compiletime_assert_, __COUNTER__)
> ^~~~~~~~~~~~~~~~~~~
> ./include/linux/build_bug.h:39:37: note: in expansion of macro ‘compiletime_assert’
> #define BUILD_BUG_ON_MSG(cond, msg) compiletime_assert(!(cond), msg)
> ^~~~~~~~~~~~~~~~~~
> ./include/linux/build_bug.h:59:21: note: in expansion of macro ‘BUILD_BUG_ON_MSG’
> #define BUILD_BUG() BUILD_BUG_ON_MSG(1, "BUILD_BUG failed")
> ^~~~~~~~~~~~~~~~
> ./include/linux/huge_mm.h:97:28: note: in expansion of macro ‘BUILD_BUG’
> #define HPAGE_PMD_SHIFT ({ BUILD_BUG(); 0; })
> ^~~~~~~~~
> ./include/linux/huge_mm.h:104:35: note: in expansion of macro ‘HPAGE_PMD_SHIFT’
> #define HPAGE_PMD_SIZE ((1UL) << HPAGE_PMD_SHIFT)
> ^~~~~~~~~~~~~~~
> mm/shmem.c:2419:36: note: in expansion of macro ‘HPAGE_PMD_SIZE’
> unsigned long hpage_size = HPAGE_PMD_SIZE;
> ^~~~~~~~~~~~~~~
>
> It seems like we need to handle the case where CONFIG_PGTABLE_HAS_HUGE_LEAVES
> is undefined.
>
> [1] export ARCH=arm64 && make allnoconfig && make olddefconfig && make -j$(nproc)
Thanks for reporting. I can move the use of HPAGE_PMD_SIZE to after the
check for CONFIG_TRANSPARENT_HUGEPAGE, which can avoid the building error:
diff --git a/mm/shmem.c b/mm/shmem.c
index 1af2f0aa384d..d603e36e0f4f 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -2416,7 +2416,7 @@ unsigned long shmem_get_unmapped_area(struct file
*file,
unsigned long inflated_len;
unsigned long inflated_addr;
unsigned long inflated_offset;
- unsigned long hpage_size = HPAGE_PMD_SIZE;
+ unsigned long hpage_size;
if (len > TASK_SIZE)
return -ENOMEM;
@@ -2446,6 +2446,7 @@ unsigned long shmem_get_unmapped_area(struct file
*file,
if (uaddr == addr)
return addr;
+ hpage_size = HPAGE_PMD_SIZE;
if (shmem_huge != SHMEM_HUGE_FORCE) {
struct super_block *sb;
unsigned long __maybe_unused hpage_orders;
On Tue, May 7, 2024 at 9:47 AM Baolin Wang
<[email protected]> wrote:
>
> Hi Lance,
>
> On 2024/5/6 18:54, Lance Yang wrote:
> > Hey Baolin,
> >
> > I found a compilation issue that failed one[1] of my configurations
> > after applying this series. The error message is as follows:
> >
> > mm/shmem.c: In function ‘shmem_get_unmapped_area’:
> > ././include/linux/compiler_types.h:460:45: error: call to ‘__compiletime_assert_481’ declared with attribute error: BUILD_BUG failed
> > _compiletime_assert(condition, msg, __compiletime_assert_, __COUNTER__)
> > ^
> > ././include/linux/compiler_types.h:441:25: note: in definition of macro ‘__compiletime_assert’
> > prefix ## suffix(); \
> > ^~~~~~
> > ././include/linux/compiler_types.h:460:9: note: in expansion of macro ‘_compiletime_assert’
> > _compiletime_assert(condition, msg, __compiletime_assert_, __COUNTER__)
> > ^~~~~~~~~~~~~~~~~~~
> > ./include/linux/build_bug.h:39:37: note: in expansion of macro ‘compiletime_assert’
> > #define BUILD_BUG_ON_MSG(cond, msg) compiletime_assert(!(cond), msg)
> > ^~~~~~~~~~~~~~~~~~
> > ./include/linux/build_bug.h:59:21: note: in expansion of macro ‘BUILD_BUG_ON_MSG’
> > #define BUILD_BUG() BUILD_BUG_ON_MSG(1, "BUILD_BUG failed")
> > ^~~~~~~~~~~~~~~~
> > ./include/linux/huge_mm.h:97:28: note: in expansion of macro ‘BUILD_BUG’
> > #define HPAGE_PMD_SHIFT ({ BUILD_BUG(); 0; })
> > ^~~~~~~~~
> > ./include/linux/huge_mm.h:104:35: note: in expansion of macro ‘HPAGE_PMD_SHIFT’
> > #define HPAGE_PMD_SIZE ((1UL) << HPAGE_PMD_SHIFT)
> > ^~~~~~~~~~~~~~~
> > mm/shmem.c:2419:36: note: in expansion of macro ‘HPAGE_PMD_SIZE’
> > unsigned long hpage_size = HPAGE_PMD_SIZE;
> > ^~~~~~~~~~~~~~~
> >
> > It seems like we need to handle the case where CONFIG_PGTABLE_HAS_HUGE_LEAVES
> > is undefined.
> >
> > [1] export ARCH=arm64 && make allnoconfig && make olddefconfig && make -j$(nproc)
>
> Thanks for reporting. I can move the use of HPAGE_PMD_SIZE to after the
> check for CONFIG_TRANSPARENT_HUGEPAGE, which can avoid the building error:
I confirmed that the issue I reported before has disappeared after applying
this change. For the fix,
Tested-by: Lance Yang <[email protected]>
Thanks,
Lance
>
> diff --git a/mm/shmem.c b/mm/shmem.c
> index 1af2f0aa384d..d603e36e0f4f 100644
> --- a/mm/shmem.c
> +++ b/mm/shmem.c
> @@ -2416,7 +2416,7 @@ unsigned long shmem_get_unmapped_area(struct file
> *file,
> unsigned long inflated_len;
> unsigned long inflated_addr;
> unsigned long inflated_offset;
> - unsigned long hpage_size = HPAGE_PMD_SIZE;
> + unsigned long hpage_size;
>
> if (len > TASK_SIZE)
> return -ENOMEM;
> @@ -2446,6 +2446,7 @@ unsigned long shmem_get_unmapped_area(struct file
> *file,
> if (uaddr == addr)
> return addr;
>
> + hpage_size = HPAGE_PMD_SIZE;
> if (shmem_huge != SHMEM_HUGE_FORCE) {
> struct super_block *sb;
> unsigned long __maybe_unused hpage_orders;
On 06/05/2024 09:46, Baolin Wang wrote:
> Anonymous pages have already been supported for multi-size (mTHP) allocation
> through commit 19eaf44954df, that can allow THP to be configured through the
> sysfs interface located at '/sys/kernel/mm/transparent_hugepage/hugepage-XXkb/enabled'.
>
> However, the anonymous shared pages will ignore the anonymous mTHP rule
> configured through the sysfs interface, and can only use the PMD-mapped
> THP, that is not reasonable. Many implement anonymous page sharing through
> mmap(MAP_SHARED | MAP_ANONYMOUS), especially in database usage scenarios,
> therefore, users expect to apply an unified mTHP strategy for anonymous pages,
> also including the anonymous shared pages, in order to enjoy the benefits of
> mTHP. For example, lower latency than PMD-mapped THP, smaller memory bloat
> than PMD-mapped THP, contiguous PTEs on ARM architecture to reduce TLB miss etc.
>
> The primary strategy is similar to supporting anonymous mTHP. Introduce
> a new interface '/mm/transparent_hugepage/hugepage-XXkb/shmem_enabled',
> which can have all the same values as the top-level
Didn't we agree that "force" would not be supported for now, and would return an
error when attempting to set for a non-PMD-size hugepage-XXkb/shmem_enabled (or
indirectly through inheritance)?
> '/sys/kernel/mm/transparent_hugepage/shmem_enabled', with adding a new
> additional "inherit" option. By default all sizes will be set to "never"
> except PMD size, which is set to "inherit". This ensures backward compatibility
> with the shmem enabled of the top level, meanwhile also allows independent
> control of shmem enabled for each mTHP.
>
> Use the page fault latency tool to measure the performance of 1G anonymous shmem
> with 32 threads on my machine environment with: ARM64 Architecture, 32 cores,
> 125G memory:
> base: mm-unstable
> user-time sys_time faults_per_sec_per_cpu faults_per_sec
> 0.04s 3.10s 83516.416 2669684.890
>
> mm-unstable + patchset, anon shmem mTHP disabled
> user-time sys_time faults_per_sec_per_cpu faults_per_sec
> 0.02s 3.14s 82936.359 2630746.027
>
> mm-unstable + patchset, anon shmem 64K mTHP enabled
> user-time sys_time faults_per_sec_per_cpu faults_per_sec
> 0.08s 0.31s 678630.231 17082522.495
>
> From the data above, it is observed that the patchset has a minimal impact when
> mTHP is not enabled (some fluctuations observed during testing). When enabling 64K
> mTHP, there is a significant improvement of the page fault latency.
>
> TODO:
> - Support mTHP for tmpfs.
> - Do not split the large folio when share memory swap out.
> - Can swap in a large folio for share memory.
>
> Changes from RFC:
> - Rebase the patch set against the new mm-unstable branch, per Lance.
> - Add a new patch to export highest_order() and next_order().
> - Add a new patch to align mTHP size in shmem_get_unmapped_area().
> - Handle the uffd case and the VMA limits case when building mapping for
> large folio in the finish_fault() function, per Ryan.
> - Remove unnecessary 'order' variable in patch 3, per Kefeng.
> - Keep the anon shmem counters' name consistency.
> - Modify the strategy to support mTHP for anonymous shmem, discussed with
> Ryan and David.
> - Add reviewed tag from Barry.
> - Update the commit message.
>
> Baolin Wang (8):
> mm: move highest_order() and next_order() out of the THP config
> mm: memory: extend finish_fault() to support large folio
> mm: shmem: add an 'order' parameter for shmem_alloc_hugefolio()
> mm: shmem: add THP validation for PMD-mapped THP related statistics
> mm: shmem: add multi-size THP sysfs interface for anonymous shmem
> mm: shmem: add mTHP support for anonymous shmem
> mm: shmem: add mTHP size alignment in shmem_get_unmapped_area
> mm: shmem: add mTHP counters for anonymous shmem
>
> Documentation/admin-guide/mm/transhuge.rst | 29 ++
> include/linux/huge_mm.h | 35 ++-
> mm/huge_memory.c | 17 +-
> mm/memory.c | 43 ++-
> mm/shmem.c | 335 ++++++++++++++++++---
> 5 files changed, 387 insertions(+), 72 deletions(-)
>
On 06/05/2024 09:46, Baolin Wang wrote:
> Add large folio mapping establishment support for finish_fault() as a preparation,
> to support multi-size THP allocation of anonymous shmem pages in the following
> patches.
>
> Signed-off-by: Baolin Wang <[email protected]>
> ---
> mm/memory.c | 43 +++++++++++++++++++++++++++++++++----------
> 1 file changed, 33 insertions(+), 10 deletions(-)
>
> diff --git a/mm/memory.c b/mm/memory.c
> index eea6e4984eae..936377220b77 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -4747,9 +4747,12 @@ vm_fault_t finish_fault(struct vm_fault *vmf)
> {
> struct vm_area_struct *vma = vmf->vma;
> struct page *page;
> + struct folio *folio;
> vm_fault_t ret;
> bool is_cow = (vmf->flags & FAULT_FLAG_WRITE) &&
> !(vma->vm_flags & VM_SHARED);
> + int type, nr_pages, i;
> + unsigned long addr = vmf->address;
>
> /* Did we COW the page? */
> if (is_cow)
> @@ -4780,24 +4783,44 @@ vm_fault_t finish_fault(struct vm_fault *vmf)
> return VM_FAULT_OOM;
> }
>
> + folio = page_folio(page);
> + nr_pages = folio_nr_pages(folio);
> +
> + if (unlikely(userfaultfd_armed(vma))) {
> + nr_pages = 1;
> + } else if (nr_pages > 1) {
> + unsigned long start = ALIGN_DOWN(vmf->address, nr_pages * PAGE_SIZE);
> + unsigned long end = start + nr_pages * PAGE_SIZE;
> +
> + /* In case the folio size in page cache beyond the VMA limits. */
> + addr = max(start, vma->vm_start);
> + nr_pages = (min(end, vma->vm_end) - addr) >> PAGE_SHIFT;
> +
> + page = folio_page(folio, (addr - start) >> PAGE_SHIFT);
I still don't really follow the logic in this else if block. Isn't it possible
that finish_fault() gets called with a page from a folio that isn't aligned with
vmf->address?
For example, let's say we have a file who's size is 64K and which is cached in a
single large folio in the page cache. But the file is mapped into a process at
VA 16K to 80K. Let's say we fault on the first page (VA=16K). You will calculate
start=0 and end=64K I think?
Additionally, I think this path will end up mapping the entire folio (as long as
it fits in the VMA). But this bypasses the fault-around configuration. As I
think I mentioned against the RFC, this will inflate the RSS of the process and
can cause behavioural changes as a result. I believe the current advice is to
disable fault-around to prevent this kind of bloat when needed.
It might be that you need a special variant of finish_fault() for shmem?
> + }
> vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd,
> - vmf->address, &vmf->ptl);
> + addr, &vmf->ptl);
> if (!vmf->pte)
> return VM_FAULT_NOPAGE;
>
> /* Re-check under ptl */
> - if (likely(!vmf_pte_changed(vmf))) {
> - struct folio *folio = page_folio(page);
> - int type = is_cow ? MM_ANONPAGES : mm_counter_file(folio);
> -
> - set_pte_range(vmf, folio, page, 1, vmf->address);
> - add_mm_counter(vma->vm_mm, type, 1);
> - ret = 0;
> - } else {
> - update_mmu_tlb(vma, vmf->address, vmf->pte);
> + if (nr_pages == 1 && unlikely(vmf_pte_changed(vmf))) {
> + update_mmu_tlb(vma, addr, vmf->pte);
> + ret = VM_FAULT_NOPAGE;
> + goto unlock;
> + } else if (nr_pages > 1 && !pte_range_none(vmf->pte, nr_pages)) {
> + for (i = 0; i < nr_pages; i++)
> + update_mmu_tlb(vma, addr + PAGE_SIZE * i, vmf->pte + i);
> ret = VM_FAULT_NOPAGE;
> + goto unlock;
> }
>
> + set_pte_range(vmf, folio, page, nr_pages, addr);
> + type = is_cow ? MM_ANONPAGES : mm_counter_file(folio);
> + add_mm_counter(vma->vm_mm, type, nr_pages);
> + ret = 0;
> +
> +unlock:
> pte_unmap_unlock(vmf->pte, vmf->ptl);
> return ret;
> }
Hi Baolin,
kernel test robot noticed the following build warnings:
[auto build test WARNING on akpm-mm/mm-everything]
[also build test WARNING on next-20240506]
[cannot apply to linus/master v6.9-rc7]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]
url: https://github.com/intel-lab-lkp/linux/commits/Baolin-Wang/mm-move-highest_order-and-next_order-out-of-the-THP-config/20240506-164838
base: https://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm.git mm-everything
patch link: https://lore.kernel.org/r/adc64bf0f150bdc614c6c06fc313adeef7dbbbff.1714978902.git.baolin.wang%40linux.alibaba.com
patch subject: [PATCH 6/8] mm: shmem: add mTHP support for anonymous shmem
config: s390-allnoconfig (https://download.01.org/0day-ci/archive/20240507/[email protected]/config)
compiler: clang version 19.0.0git (https://github.com/llvm/llvm-project 0ab4458df0688955620b72cc2c72a32dffad3615)
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20240507/[email protected]/reproduce)
If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <[email protected]>
| Closes: https://lore.kernel.org/oe-kbuild-all/[email protected]/
All warnings (new ones prefixed by >>):
In file included from mm/shmem.c:28:
In file included from include/linux/ramfs.h:5:
In file included from include/linux/fs_parser.h:11:
In file included from include/linux/fs_context.h:14:
In file included from include/linux/security.h:33:
In file included from include/linux/mm.h:2253:
include/linux/vmstat.h:514:36: warning: arithmetic between different enumeration types ('enum node_stat_item' and 'enum lru_list') [-Wenum-enum-conversion]
514 | return node_stat_name(NR_LRU_BASE + lru) + 3; // skip "nr_"
| ~~~~~~~~~~~ ^ ~~~
>> mm/shmem.c:1780:10: warning: variable 'folio' is used uninitialized whenever 'while' loop exits because its condition is false [-Wsometimes-uninitialized]
1780 | while (suitable_orders) {
| ^~~~~~~~~~~~~~~
mm/shmem.c:1795:7: note: uninitialized use occurs here
1795 | if (!folio)
| ^~~~~
mm/shmem.c:1780:10: note: remove the condition if it is always true
1780 | while (suitable_orders) {
| ^~~~~~~~~~~~~~~
| 1
mm/shmem.c:1750:21: note: initialize the variable 'folio' to silence this warning
1750 | struct folio *folio;
| ^
| = NULL
mm/shmem.c:1564:20: warning: unused function 'shmem_show_mpol' [-Wunused-function]
1564 | static inline void shmem_show_mpol(struct seq_file *seq, struct mempolicy *mpol)
| ^~~~~~~~~~~~~~~
3 warnings generated.
vim +1780 mm/shmem.c
1741
1742 static struct folio *shmem_alloc_and_add_folio(struct vm_fault *vmf,
1743 gfp_t gfp, struct inode *inode, pgoff_t index,
1744 struct mm_struct *fault_mm, bool huge, unsigned long orders)
1745 {
1746 struct address_space *mapping = inode->i_mapping;
1747 struct shmem_inode_info *info = SHMEM_I(inode);
1748 struct vm_area_struct *vma = vmf ? vmf->vma : NULL;
1749 unsigned long suitable_orders;
1750 struct folio *folio;
1751 long pages;
1752 int error, order;
1753
1754 if (!IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE))
1755 huge = false;
1756
1757 if (huge || orders > 0) {
1758 if (vma && vma_is_anon_shmem(vma) && orders) {
1759 suitable_orders = anon_shmem_suitable_orders(inode, vmf,
1760 mapping, index, orders);
1761 } else {
1762 pages = HPAGE_PMD_NR;
1763 suitable_orders = BIT(HPAGE_PMD_ORDER);
1764 index = round_down(index, HPAGE_PMD_NR);
1765
1766 /*
1767 * Check for conflict before waiting on a huge allocation.
1768 * Conflict might be that a huge page has just been allocated
1769 * and added to page cache by a racing thread, or that there
1770 * is already at least one small page in the huge extent.
1771 * Be careful to retry when appropriate, but not forever!
1772 * Elsewhere -EEXIST would be the right code, but not here.
1773 */
1774 if (xa_find(&mapping->i_pages, &index,
1775 index + HPAGE_PMD_NR - 1, XA_PRESENT))
1776 return ERR_PTR(-E2BIG);
1777 }
1778
1779 order = highest_order(suitable_orders);
> 1780 while (suitable_orders) {
1781 pages = 1 << order;
1782 index = round_down(index, pages);
1783 folio = shmem_alloc_hugefolio(gfp, info, index, order);
1784 if (folio)
1785 goto allocated;
1786
1787 if (pages == HPAGE_PMD_NR)
1788 count_vm_event(THP_FILE_FALLBACK);
1789 order = next_order(&suitable_orders, order);
1790 }
1791 } else {
1792 pages = 1;
1793 folio = shmem_alloc_folio(gfp, info, index);
1794 }
1795 if (!folio)
1796 return ERR_PTR(-ENOMEM);
1797
1798 allocated:
1799 __folio_set_locked(folio);
1800 __folio_set_swapbacked(folio);
1801
1802 gfp &= GFP_RECLAIM_MASK;
1803 error = mem_cgroup_charge(folio, fault_mm, gfp);
1804 if (error) {
1805 if (xa_find(&mapping->i_pages, &index,
1806 index + pages - 1, XA_PRESENT)) {
1807 error = -EEXIST;
1808 } else if (pages == HPAGE_PMD_NR) {
1809 count_vm_event(THP_FILE_FALLBACK);
1810 count_vm_event(THP_FILE_FALLBACK_CHARGE);
1811 }
1812 goto unlock;
1813 }
1814
1815 error = shmem_add_to_page_cache(folio, mapping, index, NULL, gfp);
1816 if (error)
1817 goto unlock;
1818
1819 error = shmem_inode_acct_blocks(inode, pages);
1820 if (error) {
1821 struct shmem_sb_info *sbinfo = SHMEM_SB(inode->i_sb);
1822 long freed;
1823 /*
1824 * Try to reclaim some space by splitting a few
1825 * large folios beyond i_size on the filesystem.
1826 */
1827 shmem_unused_huge_shrink(sbinfo, NULL, 2);
1828 /*
1829 * And do a shmem_recalc_inode() to account for freed pages:
1830 * except our folio is there in cache, so not quite balanced.
1831 */
1832 spin_lock(&info->lock);
1833 freed = pages + info->alloced - info->swapped -
1834 READ_ONCE(mapping->nrpages);
1835 if (freed > 0)
1836 info->alloced -= freed;
1837 spin_unlock(&info->lock);
1838 if (freed > 0)
1839 shmem_inode_unacct_blocks(inode, freed);
1840 error = shmem_inode_acct_blocks(inode, pages);
1841 if (error) {
1842 filemap_remove_folio(folio);
1843 goto unlock;
1844 }
1845 }
1846
1847 shmem_recalc_inode(inode, pages, 0);
1848 folio_add_lru(folio);
1849 return folio;
1850
1851 unlock:
1852 folio_unlock(folio);
1853 folio_put(folio);
1854 return ERR_PTR(error);
1855 }
1856
--
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki
On 06/05/2024 09:46, Baolin Wang wrote:
> To support the use of mTHP with anonymous shmem, add a new sysfs interface
> 'shmem_enabled' in the '/sys/kernel/mm/transparent_hugepage/hugepages-kB/'
> directory for each mTHP to control whether shmem is enabled for that mTHP,
> with a value similar to the top level 'shmem_enabled', which can be set to:
> "always", "inherit (to inherit the top level setting)", "within_size", "advise",
> "never", "deny", "force". These values follow the same semantics as the top
> level, except the 'deny' is equivalent to 'never', and 'force' is equivalent
> to 'always' to keep compatibility.
We decided at [1] to not allow 'force' for non-PMD-sizes.
[1]
https://lore.kernel.org/linux-mm/[email protected]/
However, thinking about this a bit more, I wonder if the decision we made to
allow all hugepages-xxkB/enabled controls to take "inherit" was the wrong one.
Perhaps we should have only allowed the PMD-sized enable=inherit (this is just
for legacy back compat after all, I don't think there is any use case where
changing multiple mTHP size controls atomically is actually useful). Applying
that pattern here, it means the top level can always take "force" without any
weird error checking. And we would allow "force" on the PMD-sized control but
not on the others - again this is easy to error check.
Does this pattern make more sense? If so, is it too late to change
hugepages-xxkB/enabled interface?
>
> By default, PMD-sized hugepages have enabled="inherit" and all other hugepage
> sizes have enabled="never" for '/sys/kernel/mm/transparent_hugepage/hugepages-xxkB/shmem_enabled'.
>
> In addition, if top level value is 'force', then only PMD-sized hugepages
> have enabled="inherit", otherwise configuration will be failed and vice versa.
> That means now we will avoid using non-PMD sized THP to override the global
> huge allocation.
>
> Signed-off-by: Baolin Wang <[email protected]>
> ---
> Documentation/admin-guide/mm/transhuge.rst | 29 +++++++
> include/linux/huge_mm.h | 10 +++
> mm/huge_memory.c | 11 +--
> mm/shmem.c | 96 ++++++++++++++++++++++
> 4 files changed, 138 insertions(+), 8 deletions(-)
>
> diff --git a/Documentation/admin-guide/mm/transhuge.rst b/Documentation/admin-guide/mm/transhuge.rst
> index 076443cc10a6..a28496e15bdb 100644
> --- a/Documentation/admin-guide/mm/transhuge.rst
> +++ b/Documentation/admin-guide/mm/transhuge.rst
> @@ -332,6 +332,35 @@ deny
> force
> Force the huge option on for all - very useful for testing;
>
> +Anonymous shmem can also use "multi-size THP" (mTHP) by adding a new sysfs knob
> +to control mTHP allocation: /sys/kernel/mm/transparent_hugepage/hugepages-<size>kB/shmem_enabled.
> +Its value for each mTHP is essentially consistent with the global setting, except
> +for the addition of 'inherit' to ensure compatibility with the global settings.
> +always
> + Attempt to allocate <size> huge pages every time we need a new page;
> +
> +inherit
> + Inherit the top-level "shmem_enabled" value. By default, PMD-sized hugepages
> + have enabled="inherit" and all other hugepage sizes have enabled="never";
> +
> +never
> + Do not allocate <size> huge pages;
> +
> +within_size
> + Only allocate <size> huge page if it will be fully within i_size.
> + Also respect fadvise()/madvise() hints;
> +
> +advise
> + Only allocate <size> huge pages if requested with fadvise()/madvise();
> +
> +deny
> + Has the same semantics as 'never', now mTHP allocation policy is only
> + used for anonymous shmem and no not override tmpfs.
> +
> +force
> + Has the same semantics as 'always', now mTHP allocation policy is only
> + used for anonymous shmem and no not override tmpfs.
> +
> Need of application restart
> ===========================
>
> diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
> index e49b56c40a11..dbd6b3f56210 100644
> --- a/include/linux/huge_mm.h
> +++ b/include/linux/huge_mm.h
> @@ -6,6 +6,7 @@
> #include <linux/mm_types.h>
>
> #include <linux/fs.h> /* only for vma_is_dax() */
> +#include <linux/kobject.h>
>
> vm_fault_t do_huge_pmd_anonymous_page(struct vm_fault *vmf);
> int copy_huge_pmd(struct mm_struct *dst_mm, struct mm_struct *src_mm,
> @@ -63,6 +64,7 @@ ssize_t single_hugepage_flag_show(struct kobject *kobj,
> struct kobj_attribute *attr, char *buf,
> enum transparent_hugepage_flag flag);
> extern struct kobj_attribute shmem_enabled_attr;
> +extern struct kobj_attribute thpsize_shmem_enabled_attr;
>
> /*
> * Mask of all large folio orders supported for anonymous THP; all orders up to
> @@ -265,6 +267,14 @@ unsigned long thp_vma_allowable_orders(struct vm_area_struct *vma,
> return __thp_vma_allowable_orders(vma, vm_flags, tva_flags, orders);
> }
>
> +struct thpsize {
> + struct kobject kobj;
> + struct list_head node;
> + int order;
> +};
> +
> +#define to_thpsize(kobj) container_of(kobj, struct thpsize, kobj)
> +
> enum mthp_stat_item {
> MTHP_STAT_ANON_FAULT_ALLOC,
> MTHP_STAT_ANON_FAULT_FALLBACK,
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index 9efb6fefc391..d3080a8843f2 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -449,14 +449,6 @@ static void thpsize_release(struct kobject *kobj);
> static DEFINE_SPINLOCK(huge_anon_orders_lock);
> static LIST_HEAD(thpsize_list);
>
> -struct thpsize {
> - struct kobject kobj;
> - struct list_head node;
> - int order;
> -};
> -
> -#define to_thpsize(kobj) container_of(kobj, struct thpsize, kobj)
> -
> static ssize_t thpsize_enabled_show(struct kobject *kobj,
> struct kobj_attribute *attr, char *buf)
> {
> @@ -517,6 +509,9 @@ static struct kobj_attribute thpsize_enabled_attr =
>
> static struct attribute *thpsize_attrs[] = {
> &thpsize_enabled_attr.attr,
> +#ifdef CONFIG_SHMEM
> + &thpsize_shmem_enabled_attr.attr,
> +#endif
> NULL,
> };
>
> diff --git a/mm/shmem.c b/mm/shmem.c
> index a383ea9a89a5..59cc26d44344 100644
> --- a/mm/shmem.c
> +++ b/mm/shmem.c
> @@ -131,6 +131,14 @@ struct shmem_options {
> #define SHMEM_SEEN_QUOTA 32
> };
>
> +#ifdef CONFIG_TRANSPARENT_HUGEPAGE
> +static unsigned long huge_anon_shmem_orders_always __read_mostly;
> +static unsigned long huge_anon_shmem_orders_madvise __read_mostly;
> +static unsigned long huge_anon_shmem_orders_inherit __read_mostly;
> +static unsigned long huge_anon_shmem_orders_within_size __read_mostly;
> +static DEFINE_SPINLOCK(huge_anon_shmem_orders_lock);
> +#endif
> +
> #ifdef CONFIG_TMPFS
> static unsigned long shmem_default_max_blocks(void)
> {
> @@ -4687,6 +4695,12 @@ void __init shmem_init(void)
> SHMEM_SB(shm_mnt->mnt_sb)->huge = shmem_huge;
> else
> shmem_huge = SHMEM_HUGE_NEVER; /* just in case it was patched */
> +
> + /*
> + * Default to setting PMD-sized THP to inherit the global setting and
> + * disable all other multi-size THPs, when anonymous shmem uses mTHP.
> + */
> + huge_anon_shmem_orders_inherit = BIT(HPAGE_PMD_ORDER);
> #endif
> return;
>
> @@ -4746,6 +4760,11 @@ static ssize_t shmem_enabled_store(struct kobject *kobj,
> huge != SHMEM_HUGE_NEVER && huge != SHMEM_HUGE_DENY)
> return -EINVAL;
>
> + /* Do not override huge allocation policy with non-PMD sized mTHP */
> + if (huge == SHMEM_HUGE_FORCE &&
> + huge_anon_shmem_orders_inherit != BIT(HPAGE_PMD_ORDER))
> + return -EINVAL;
> +
> shmem_huge = huge;
> if (shmem_huge > SHMEM_HUGE_DENY)
> SHMEM_SB(shm_mnt->mnt_sb)->huge = shmem_huge;
> @@ -4753,6 +4772,83 @@ static ssize_t shmem_enabled_store(struct kobject *kobj,
> }
>
> struct kobj_attribute shmem_enabled_attr = __ATTR_RW(shmem_enabled);
> +
> +static ssize_t thpsize_shmem_enabled_show(struct kobject *kobj,
> + struct kobj_attribute *attr, char *buf)
> +{
> + int order = to_thpsize(kobj)->order;
> + const char *output;
> +
> + if (test_bit(order, &huge_anon_shmem_orders_always))
> + output = "[always] inherit within_size advise never deny [force]";
> + else if (test_bit(order, &huge_anon_shmem_orders_inherit))
> + output = "always [inherit] within_size advise never deny force";
> + else if (test_bit(order, &huge_anon_shmem_orders_within_size))
> + output = "always inherit [within_size] advise never deny force";
> + else if (test_bit(order, &huge_anon_shmem_orders_madvise))
> + output = "always inherit within_size [advise] never deny force";
> + else
> + output = "always inherit within_size advise [never] [deny] force";
> +
> + return sysfs_emit(buf, "%s\n", output);
> +}
> +
> +static ssize_t thpsize_shmem_enabled_store(struct kobject *kobj,
> + struct kobj_attribute *attr,
> + const char *buf, size_t count)
> +{
> + int order = to_thpsize(kobj)->order;
> + ssize_t ret = count;
> +
> + if (sysfs_streq(buf, "always") || sysfs_streq(buf, "force")) {
> + spin_lock(&huge_anon_shmem_orders_lock);
> + clear_bit(order, &huge_anon_shmem_orders_inherit);
> + clear_bit(order, &huge_anon_shmem_orders_madvise);
> + clear_bit(order, &huge_anon_shmem_orders_within_size);
> + set_bit(order, &huge_anon_shmem_orders_always);
> + spin_unlock(&huge_anon_shmem_orders_lock);
> + } else if (sysfs_streq(buf, "inherit")) {
> + /* Do not override huge allocation policy with non-PMD sized mTHP */
> + if (shmem_huge == SHMEM_HUGE_FORCE &&
> + order != HPAGE_PMD_ORDER)
> + return -EINVAL;
> +
> + spin_lock(&huge_anon_shmem_orders_lock);
> + clear_bit(order, &huge_anon_shmem_orders_always);
> + clear_bit(order, &huge_anon_shmem_orders_madvise);
> + clear_bit(order, &huge_anon_shmem_orders_within_size);
> + set_bit(order, &huge_anon_shmem_orders_inherit);
> + spin_unlock(&huge_anon_shmem_orders_lock);
> + } else if (sysfs_streq(buf, "within_size")) {
> + spin_lock(&huge_anon_shmem_orders_lock);
> + clear_bit(order, &huge_anon_shmem_orders_always);
> + clear_bit(order, &huge_anon_shmem_orders_inherit);
> + clear_bit(order, &huge_anon_shmem_orders_madvise);
> + set_bit(order, &huge_anon_shmem_orders_within_size);
> + spin_unlock(&huge_anon_shmem_orders_lock);
> + } else if (sysfs_streq(buf, "madvise")) {
> + spin_lock(&huge_anon_shmem_orders_lock);
> + clear_bit(order, &huge_anon_shmem_orders_always);
> + clear_bit(order, &huge_anon_shmem_orders_inherit);
> + clear_bit(order, &huge_anon_shmem_orders_within_size);
> + set_bit(order, &huge_anon_shmem_orders_madvise);
> + spin_unlock(&huge_anon_shmem_orders_lock);
> + } else if (sysfs_streq(buf, "never") || sysfs_streq(buf, "deny")) {
> + spin_lock(&huge_anon_shmem_orders_lock);
> + clear_bit(order, &huge_anon_shmem_orders_always);
> + clear_bit(order, &huge_anon_shmem_orders_inherit);
> + clear_bit(order, &huge_anon_shmem_orders_within_size);
> + clear_bit(order, &huge_anon_shmem_orders_madvise);
> + spin_unlock(&huge_anon_shmem_orders_lock);
> + } else {
> + ret = -EINVAL;
> + }
> +
> + return ret;
> +}
> +
> +struct kobj_attribute thpsize_shmem_enabled_attr =
> + __ATTR(shmem_enabled, 0644, thpsize_shmem_enabled_show, thpsize_shmem_enabled_store);
> #endif /* CONFIG_TRANSPARENT_HUGEPAGE && CONFIG_SYSFS */
>
> #else /* !CONFIG_SHMEM */
On 2024/5/7 18:37, Ryan Roberts wrote:
> On 06/05/2024 09:46, Baolin Wang wrote:
>> Add large folio mapping establishment support for finish_fault() as a preparation,
>> to support multi-size THP allocation of anonymous shmem pages in the following
>> patches.
>>
>> Signed-off-by: Baolin Wang <[email protected]>
>> ---
>> mm/memory.c | 43 +++++++++++++++++++++++++++++++++----------
>> 1 file changed, 33 insertions(+), 10 deletions(-)
>>
>> diff --git a/mm/memory.c b/mm/memory.c
>> index eea6e4984eae..936377220b77 100644
>> --- a/mm/memory.c
>> +++ b/mm/memory.c
>> @@ -4747,9 +4747,12 @@ vm_fault_t finish_fault(struct vm_fault *vmf)
>> {
>> struct vm_area_struct *vma = vmf->vma;
>> struct page *page;
>> + struct folio *folio;
>> vm_fault_t ret;
>> bool is_cow = (vmf->flags & FAULT_FLAG_WRITE) &&
>> !(vma->vm_flags & VM_SHARED);
>> + int type, nr_pages, i;
>> + unsigned long addr = vmf->address;
>>
>> /* Did we COW the page? */
>> if (is_cow)
>> @@ -4780,24 +4783,44 @@ vm_fault_t finish_fault(struct vm_fault *vmf)
>> return VM_FAULT_OOM;
>> }
>>
>> + folio = page_folio(page);
>> + nr_pages = folio_nr_pages(folio);
>> +
>> + if (unlikely(userfaultfd_armed(vma))) {
>> + nr_pages = 1;
>> + } else if (nr_pages > 1) {
>> + unsigned long start = ALIGN_DOWN(vmf->address, nr_pages * PAGE_SIZE);
>> + unsigned long end = start + nr_pages * PAGE_SIZE;
>> +
>> + /* In case the folio size in page cache beyond the VMA limits. */
>> + addr = max(start, vma->vm_start);
>> + nr_pages = (min(end, vma->vm_end) - addr) >> PAGE_SHIFT;
>> +
>> + page = folio_page(folio, (addr - start) >> PAGE_SHIFT);
>
> I still don't really follow the logic in this else if block. Isn't it possible
> that finish_fault() gets called with a page from a folio that isn't aligned with
> vmf->address?
>
> For example, let's say we have a file who's size is 64K and which is cached in a
> single large folio in the page cache. But the file is mapped into a process at
> VA 16K to 80K. Let's say we fault on the first page (VA=16K). You will calculate
For shmem, this doesn't happen because the VA is aligned with the
hugepage size in the shmem_get_unmapped_area() function. See patch 7.
> start=0 and end=64K I think?
Yes. Unfortunately, some file systems that support large mappings do not
perform alignment for multi-size THP (non-PMD sized, for example: 64K).
I think this requires modification to
__get_unmapped_area--->thp_get_unmapped_area_vmflags() or
file->f_op->get_unmapped_area() to align VA for multi-size THP in future.
So before adding that VA alignment changes, only allow building the
large folio mapping for anonymous shmem:
diff --git a/mm/memory.c b/mm/memory.c
index 936377220b77..9e4d51826d23 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -4786,7 +4786,7 @@ vm_fault_t finish_fault(struct vm_fault *vmf)
folio = page_folio(page);
nr_pages = folio_nr_pages(folio);
- if (unlikely(userfaultfd_armed(vma))) {
+ if (unlikely(userfaultfd_armed(vma)) || !vma_is_anon_shmem(vma)) {
nr_pages = 1;
} else if (nr_pages > 1) {
unsigned long start = ALIGN_DOWN(vmf->address, nr_pages
* PAGE_SIZE);
> Additionally, I think this path will end up mapping the entire folio (as long as
> it fits in the VMA). But this bypasses the fault-around configuration. As I
> think I mentioned against the RFC, this will inflate the RSS of the process and
> can cause behavioural changes as a result. I believe the current advice is to
> disable fault-around to prevent this kind of bloat when needed.
With above change, I do not think this is a problem? since users already
want to use mTHP for anonymous shmem.
> It might be that you need a special variant of finish_fault() for shmem?
>
>
>> + }
>> vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd,
>> - vmf->address, &vmf->ptl);
>> + addr, &vmf->ptl);
>> if (!vmf->pte)
>> return VM_FAULT_NOPAGE;
>>
>> /* Re-check under ptl */
>> - if (likely(!vmf_pte_changed(vmf))) {
>> - struct folio *folio = page_folio(page);
>> - int type = is_cow ? MM_ANONPAGES : mm_counter_file(folio);
>> -
>> - set_pte_range(vmf, folio, page, 1, vmf->address);
>> - add_mm_counter(vma->vm_mm, type, 1);
>> - ret = 0;
>> - } else {
>> - update_mmu_tlb(vma, vmf->address, vmf->pte);
>> + if (nr_pages == 1 && unlikely(vmf_pte_changed(vmf))) {
>> + update_mmu_tlb(vma, addr, vmf->pte);
>> + ret = VM_FAULT_NOPAGE;
>> + goto unlock;
>> + } else if (nr_pages > 1 && !pte_range_none(vmf->pte, nr_pages)) {
>> + for (i = 0; i < nr_pages; i++)
>> + update_mmu_tlb(vma, addr + PAGE_SIZE * i, vmf->pte + i);
>> ret = VM_FAULT_NOPAGE;
>> + goto unlock;
>> }
>>
>> + set_pte_range(vmf, folio, page, nr_pages, addr);
>> + type = is_cow ? MM_ANONPAGES : mm_counter_file(folio);
>> + add_mm_counter(vma->vm_mm, type, nr_pages);
>> + ret = 0;
>> +
>> +unlock:
>> pte_unmap_unlock(vmf->pte, vmf->ptl);
>> return ret;
>> }
On 2024/5/7 18:52, Ryan Roberts wrote:
> On 06/05/2024 09:46, Baolin Wang wrote:
>> To support the use of mTHP with anonymous shmem, add a new sysfs interface
>> 'shmem_enabled' in the '/sys/kernel/mm/transparent_hugepage/hugepages-kB/'
>> directory for each mTHP to control whether shmem is enabled for that mTHP,
>> with a value similar to the top level 'shmem_enabled', which can be set to:
>> "always", "inherit (to inherit the top level setting)", "within_size", "advise",
>> "never", "deny", "force". These values follow the same semantics as the top
>> level, except the 'deny' is equivalent to 'never', and 'force' is equivalent
>> to 'always' to keep compatibility.
>
> We decided at [1] to not allow 'force' for non-PMD-sizes.
>
> [1]
> https://lore.kernel.org/linux-mm/[email protected]/
>
> However, thinking about this a bit more, I wonder if the decision we made to
> allow all hugepages-xxkB/enabled controls to take "inherit" was the wrong one.
> Perhaps we should have only allowed the PMD-sized enable=inherit (this is just
> for legacy back compat after all, I don't think there is any use case where
> changing multiple mTHP size controls atomically is actually useful). Applying
Agree. This is also our usage of 'inherit'.
> that pattern here, it means the top level can always take "force" without any
> weird error checking. And we would allow "force" on the PMD-sized control but
> not on the others - again this is easy to error check.
>
> Does this pattern make more sense? If so, is it too late to change
> hugepages-xxkB/enabled interface?
IMO, this sounds reasonable to me. Let's see what others think, David?
On 2024/5/7 18:20, Ryan Roberts wrote:
> On 06/05/2024 09:46, Baolin Wang wrote:
>> Anonymous pages have already been supported for multi-size (mTHP) allocation
>> through commit 19eaf44954df, that can allow THP to be configured through the
>> sysfs interface located at '/sys/kernel/mm/transparent_hugepage/hugepage-XXkb/enabled'.
>>
>> However, the anonymous shared pages will ignore the anonymous mTHP rule
>> configured through the sysfs interface, and can only use the PMD-mapped
>> THP, that is not reasonable. Many implement anonymous page sharing through
>> mmap(MAP_SHARED | MAP_ANONYMOUS), especially in database usage scenarios,
>> therefore, users expect to apply an unified mTHP strategy for anonymous pages,
>> also including the anonymous shared pages, in order to enjoy the benefits of
>> mTHP. For example, lower latency than PMD-mapped THP, smaller memory bloat
>> than PMD-mapped THP, contiguous PTEs on ARM architecture to reduce TLB miss etc.
>>
>> The primary strategy is similar to supporting anonymous mTHP. Introduce
>> a new interface '/mm/transparent_hugepage/hugepage-XXkb/shmem_enabled',
>> which can have all the same values as the top-level
>
> Didn't we agree that "force" would not be supported for now, and would return an
> error when attempting to set for a non-PMD-size hugepage-XXkb/shmem_enabled (or
> indirectly through inheritance)?
Yes. Sorry, I did not explain it in detail in the cover letter. Please
see patch 5 you already commented.
Hi,
On 2024/5/7 18:46, kernel test robot wrote:
>>> mm/shmem.c:1780:10: warning: variable 'folio' is used uninitialized whenever 'while' loop exits because its condition is false [-Wsometimes-uninitialized]
> 1780 | while (suitable_orders) {
> | ^~~~~~~~~~~~~~~
> mm/shmem.c:1795:7: note: uninitialized use occurs here
> 1795 | if (!folio)
> | ^~~~~
> mm/shmem.c:1780:10: note: remove the condition if it is always true
> 1780 | while (suitable_orders) {
> | ^~~~~~~~~~~~~~~
> | 1
> mm/shmem.c:1750:21: note: initialize the variable 'folio' to silence this warning
> 1750 | struct folio *folio;
> | ^
> | = NULL
> mm/shmem.c:1564:20: warning: unused function 'shmem_show_mpol' [-Wunused-function]
> 1564 | static inline void shmem_show_mpol(struct seq_file *seq, struct mempolicy *mpol)
Thanks for reporting. Will add below change to avoid the warning:
diff --git a/mm/shmem.c b/mm/shmem.c
index d603e36e0f4f..fd2cb2e73a21 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -1747,7 +1747,7 @@ static struct folio
*shmem_alloc_and_add_folio(struct vm_fault *vmf,
struct shmem_inode_info *info = SHMEM_I(inode);
struct vm_area_struct *vma = vmf ? vmf->vma : NULL;
unsigned long suitable_orders;
- struct folio *folio;
+ struct folio *folio = NULL;
long pages;
int error, order;
On 08.05.24 06:45, Baolin Wang wrote:
>
>
> On 2024/5/7 18:52, Ryan Roberts wrote:
>> On 06/05/2024 09:46, Baolin Wang wrote:
>>> To support the use of mTHP with anonymous shmem, add a new sysfs interface
>>> 'shmem_enabled' in the '/sys/kernel/mm/transparent_hugepage/hugepages-kB/'
>>> directory for each mTHP to control whether shmem is enabled for that mTHP,
>>> with a value similar to the top level 'shmem_enabled', which can be set to:
>>> "always", "inherit (to inherit the top level setting)", "within_size", "advise",
>>> "never", "deny", "force". These values follow the same semantics as the top
>>> level, except the 'deny' is equivalent to 'never', and 'force' is equivalent
>>> to 'always' to keep compatibility.
>>
>> We decided at [1] to not allow 'force' for non-PMD-sizes.
>>
>> [1]
>> https://lore.kernel.org/linux-mm/[email protected]/
>>
>> However, thinking about this a bit more, I wonder if the decision we made to
>> allow all hugepages-xxkB/enabled controls to take "inherit" was the wrong one.
>> Perhaps we should have only allowed the PMD-sized enable=inherit (this is just
>> for legacy back compat after all, I don't think there is any use case where
>> changing multiple mTHP size controls atomically is actually useful). Applying
>
> Agree. This is also our usage of 'inherit'.
>
>> that pattern here, it means the top level can always take "force" without any
>> weird error checking. And we would allow "force" on the PMD-sized control but
>> not on the others - again this is easy to error check.
>>
>> Does this pattern make more sense? If so, is it too late to change
>> hugepages-xxkB/enabled interface?
>
> IMO, this sounds reasonable to me. Let's see what others think, David?
Likely too late and we should try not to diverge too much from "enabled"
for "shmem_enabled".
Having that said, to me it's much cleaner to just have "inherit" for all
sizes and disallow invalid configurations as discussed.
Error checking cannot be too hard unless I am missing something important :)
--
Cheers,
David / dhildenb
On 08.05.24 09:08, David Hildenbrand wrote:
> On 08.05.24 06:45, Baolin Wang wrote:
>>
>>
>> On 2024/5/7 18:52, Ryan Roberts wrote:
>>> On 06/05/2024 09:46, Baolin Wang wrote:
>>>> To support the use of mTHP with anonymous shmem, add a new sysfs interface
>>>> 'shmem_enabled' in the '/sys/kernel/mm/transparent_hugepage/hugepages-kB/'
>>>> directory for each mTHP to control whether shmem is enabled for that mTHP,
>>>> with a value similar to the top level 'shmem_enabled', which can be set to:
>>>> "always", "inherit (to inherit the top level setting)", "within_size", "advise",
>>>> "never", "deny", "force". These values follow the same semantics as the top
>>>> level, except the 'deny' is equivalent to 'never', and 'force' is equivalent
>>>> to 'always' to keep compatibility.
>>>
>>> We decided at [1] to not allow 'force' for non-PMD-sizes.
>>>
>>> [1]
>>> https://lore.kernel.org/linux-mm/[email protected]/
>>>
>>> However, thinking about this a bit more, I wonder if the decision we made to
>>> allow all hugepages-xxkB/enabled controls to take "inherit" was the wrong one.
>>> Perhaps we should have only allowed the PMD-sized enable=inherit (this is just
>>> for legacy back compat after all, I don't think there is any use case where
>>> changing multiple mTHP size controls atomically is actually useful). Applying
>>
>> Agree. This is also our usage of 'inherit'.
Missed that one: there might be use cases in the future once we would
start defaulting to "inherit" for all knobs (a distro might default to
that) and default-enable THP in the global knob. Then, it would be easy
to disable any THP by disabling the global knob. (I think that's the
future we're heading to, where we'd have an "auto" mode that can be set
on the global toggle).
But I am just making up use cases ;) I think it will be valuable and
just doing it consistently now might be cleaner.
--
Cheers,
David / dhildenb
On 08.05.24 05:44, Baolin Wang wrote:
>
>
> On 2024/5/7 18:37, Ryan Roberts wrote:
>> On 06/05/2024 09:46, Baolin Wang wrote:
>>> Add large folio mapping establishment support for finish_fault() as a preparation,
>>> to support multi-size THP allocation of anonymous shmem pages in the following
>>> patches.
>>>
>>> Signed-off-by: Baolin Wang <[email protected]>
>>> ---
>>> mm/memory.c | 43 +++++++++++++++++++++++++++++++++----------
>>> 1 file changed, 33 insertions(+), 10 deletions(-)
>>>
>>> diff --git a/mm/memory.c b/mm/memory.c
>>> index eea6e4984eae..936377220b77 100644
>>> --- a/mm/memory.c
>>> +++ b/mm/memory.c
>>> @@ -4747,9 +4747,12 @@ vm_fault_t finish_fault(struct vm_fault *vmf)
>>> {
>>> struct vm_area_struct *vma = vmf->vma;
>>> struct page *page;
>>> + struct folio *folio;
>>> vm_fault_t ret;
>>> bool is_cow = (vmf->flags & FAULT_FLAG_WRITE) &&
>>> !(vma->vm_flags & VM_SHARED);
>>> + int type, nr_pages, i;
>>> + unsigned long addr = vmf->address;
>>>
>>> /* Did we COW the page? */
>>> if (is_cow)
>>> @@ -4780,24 +4783,44 @@ vm_fault_t finish_fault(struct vm_fault *vmf)
>>> return VM_FAULT_OOM;
>>> }
>>>
>>> + folio = page_folio(page);
>>> + nr_pages = folio_nr_pages(folio);
>>> +
>>> + if (unlikely(userfaultfd_armed(vma))) {
>>> + nr_pages = 1;
>>> + } else if (nr_pages > 1) {
>>> + unsigned long start = ALIGN_DOWN(vmf->address, nr_pages * PAGE_SIZE);
>>> + unsigned long end = start + nr_pages * PAGE_SIZE;
>>> +
>>> + /* In case the folio size in page cache beyond the VMA limits. */
>>> + addr = max(start, vma->vm_start);
>>> + nr_pages = (min(end, vma->vm_end) - addr) >> PAGE_SHIFT;
>>> +
>>> + page = folio_page(folio, (addr - start) >> PAGE_SHIFT);
>>
>> I still don't really follow the logic in this else if block. Isn't it possible
>> that finish_fault() gets called with a page from a folio that isn't aligned with
>> vmf->address?
>>
>> For example, let's say we have a file who's size is 64K and which is cached in a
>> single large folio in the page cache. But the file is mapped into a process at
>> VA 16K to 80K. Let's say we fault on the first page (VA=16K). You will calculate
>
> For shmem, this doesn't happen because the VA is aligned with the
> hugepage size in the shmem_get_unmapped_area() function. See patch 7.
Does that cover mremap() and MAP_FIXED as well.
We should try doing this as cleanly as possible, to prepare for the
future / corner cases.
--
Cheers,
David / dhildenb
On 08/05/2024 04:44, Baolin Wang wrote:
>
>
> On 2024/5/7 18:37, Ryan Roberts wrote:
>> On 06/05/2024 09:46, Baolin Wang wrote:
>>> Add large folio mapping establishment support for finish_fault() as a
>>> preparation,
>>> to support multi-size THP allocation of anonymous shmem pages in the following
>>> patches.
>>>
>>> Signed-off-by: Baolin Wang <[email protected]>
>>> ---
>>> mm/memory.c | 43 +++++++++++++++++++++++++++++++++----------
>>> 1 file changed, 33 insertions(+), 10 deletions(-)
>>>
>>> diff --git a/mm/memory.c b/mm/memory.c
>>> index eea6e4984eae..936377220b77 100644
>>> --- a/mm/memory.c
>>> +++ b/mm/memory.c
>>> @@ -4747,9 +4747,12 @@ vm_fault_t finish_fault(struct vm_fault *vmf)
>>> {
>>> struct vm_area_struct *vma = vmf->vma;
>>> struct page *page;
>>> + struct folio *folio;
>>> vm_fault_t ret;
>>> bool is_cow = (vmf->flags & FAULT_FLAG_WRITE) &&
>>> !(vma->vm_flags & VM_SHARED);
>>> + int type, nr_pages, i;
>>> + unsigned long addr = vmf->address;
>>> /* Did we COW the page? */
>>> if (is_cow)
>>> @@ -4780,24 +4783,44 @@ vm_fault_t finish_fault(struct vm_fault *vmf)
>>> return VM_FAULT_OOM;
>>> }
>>> + folio = page_folio(page);
>>> + nr_pages = folio_nr_pages(folio);
>>> +
>>> + if (unlikely(userfaultfd_armed(vma))) {
>>> + nr_pages = 1;
>>> + } else if (nr_pages > 1) {
>>> + unsigned long start = ALIGN_DOWN(vmf->address, nr_pages * PAGE_SIZE);
>>> + unsigned long end = start + nr_pages * PAGE_SIZE;
>>> +
>>> + /* In case the folio size in page cache beyond the VMA limits. */
>>> + addr = max(start, vma->vm_start);
>>> + nr_pages = (min(end, vma->vm_end) - addr) >> PAGE_SHIFT;
>>> +
>>> + page = folio_page(folio, (addr - start) >> PAGE_SHIFT);
>>
>> I still don't really follow the logic in this else if block. Isn't it possible
>> that finish_fault() gets called with a page from a folio that isn't aligned with
>> vmf->address?
>>
>> For example, let's say we have a file who's size is 64K and which is cached in a
>> single large folio in the page cache. But the file is mapped into a process at
>> VA 16K to 80K. Let's say we fault on the first page (VA=16K). You will calculate
>
> For shmem, this doesn't happen because the VA is aligned with the hugepage size
> in the shmem_get_unmapped_area() function. See patch 7.
Certainly agree that shmem can always make sure that it packs a vma in a way
such that its folios are naturally aligned in VA when faulting in memory. If you
mremap it, that alignment will be lost; I don't think that would be a problem
for a single process; mremap will take care of moving the ptes correctly and
this path is not involved.
But what about the case when a process mmaps a shmem region, then forks, then
the child mremaps the shmem region. Then the parent faults in a THP into the
region (nicely aligned). Then the child faults in the same offset in the region
and gets the THP that the parent allocated; that THP will be aligned in the
parent's VM space but not in the child's.
>
>> start=0 and end=64K I think?
>
> Yes. Unfortunately, some file systems that support large mappings do not perform
> alignment for multi-size THP (non-PMD sized, for example: 64K). I think this
> requires modification to __get_unmapped_area--->thp_get_unmapped_area_vmflags()
> or file->f_op->get_unmapped_area() to align VA for multi-size THP in future.
By nature of the fact that a file mapping is shared between multiple processes
and each process can map it where ever it wants down to 1 page granularity, its
impossible for any THP containing a part of that file to be VA-aligned in every
process it is mapped in.
>
> So before adding that VA alignment changes, only allow building the large folio
> mapping for anonymous shmem:
>
> diff --git a/mm/memory.c b/mm/memory.c
> index 936377220b77..9e4d51826d23 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -4786,7 +4786,7 @@ vm_fault_t finish_fault(struct vm_fault *vmf)
> folio = page_folio(page);
> nr_pages = folio_nr_pages(folio);
>
> - if (unlikely(userfaultfd_armed(vma))) {
> + if (unlikely(userfaultfd_armed(vma)) || !vma_is_anon_shmem(vma)) {
If the above theoretical flow for fork & mremap is valid, then I don't think
this is sufficient.
> nr_pages = 1;
> } else if (nr_pages > 1) {
> unsigned long start = ALIGN_DOWN(vmf->address, nr_pages *
> PAGE_SIZE);
>
>> Additionally, I think this path will end up mapping the entire folio (as long as
>> it fits in the VMA). But this bypasses the fault-around configuration. As I
>> think I mentioned against the RFC, this will inflate the RSS of the process and
>> can cause behavioural changes as a result. I believe the current advice is to
>> disable fault-around to prevent this kind of bloat when needed.
>
> With above change, I do not think this is a problem? since users already want to
> use mTHP for anonymous shmem.
>
>> It might be that you need a special variant of finish_fault() for shmem?
>>
>>
>>> + }
>>> vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd,
>>> - vmf->address, &vmf->ptl);
>>> + addr, &vmf->ptl);
>>> if (!vmf->pte)
>>> return VM_FAULT_NOPAGE;
>>> /* Re-check under ptl */
>>> - if (likely(!vmf_pte_changed(vmf))) {
>>> - struct folio *folio = page_folio(page);
>>> - int type = is_cow ? MM_ANONPAGES : mm_counter_file(folio);
>>> -
>>> - set_pte_range(vmf, folio, page, 1, vmf->address);
>>> - add_mm_counter(vma->vm_mm, type, 1);
>>> - ret = 0;
>>> - } else {
>>> - update_mmu_tlb(vma, vmf->address, vmf->pte);
>>> + if (nr_pages == 1 && unlikely(vmf_pte_changed(vmf))) {
>>> + update_mmu_tlb(vma, addr, vmf->pte);
>>> + ret = VM_FAULT_NOPAGE;
>>> + goto unlock;
>>> + } else if (nr_pages > 1 && !pte_range_none(vmf->pte, nr_pages)) {
>>> + for (i = 0; i < nr_pages; i++)
>>> + update_mmu_tlb(vma, addr + PAGE_SIZE * i, vmf->pte + i);
>>> ret = VM_FAULT_NOPAGE;
>>> + goto unlock;
>>> }
>>> + set_pte_range(vmf, folio, page, nr_pages, addr);
>>> + type = is_cow ? MM_ANONPAGES : mm_counter_file(folio);
>>> + add_mm_counter(vma->vm_mm, type, nr_pages);
>>> + ret = 0;
>>> +
>>> +unlock:
>>> pte_unmap_unlock(vmf->pte, vmf->ptl);
>>> return ret;
>>> }
On 08/05/2024 08:12, David Hildenbrand wrote:
> On 08.05.24 09:08, David Hildenbrand wrote:
>> On 08.05.24 06:45, Baolin Wang wrote:
>>>
>>>
>>> On 2024/5/7 18:52, Ryan Roberts wrote:
>>>> On 06/05/2024 09:46, Baolin Wang wrote:
>>>>> To support the use of mTHP with anonymous shmem, add a new sysfs interface
>>>>> 'shmem_enabled' in the '/sys/kernel/mm/transparent_hugepage/hugepages-kB/'
>>>>> directory for each mTHP to control whether shmem is enabled for that mTHP,
>>>>> with a value similar to the top level 'shmem_enabled', which can be set to:
>>>>> "always", "inherit (to inherit the top level setting)", "within_size",
>>>>> "advise",
>>>>> "never", "deny", "force". These values follow the same semantics as the top
>>>>> level, except the 'deny' is equivalent to 'never', and 'force' is equivalent
>>>>> to 'always' to keep compatibility.
>>>>
>>>> We decided at [1] to not allow 'force' for non-PMD-sizes.
>>>>
>>>> [1]
>>>> https://lore.kernel.org/linux-mm/[email protected]/
>>>>
>>>> However, thinking about this a bit more, I wonder if the decision we made to
>>>> allow all hugepages-xxkB/enabled controls to take "inherit" was the wrong one.
>>>> Perhaps we should have only allowed the PMD-sized enable=inherit (this is just
>>>> for legacy back compat after all, I don't think there is any use case where
>>>> changing multiple mTHP size controls atomically is actually useful). Applying
>>>
>>> Agree. This is also our usage of 'inherit'.
>
> Missed that one: there might be use cases in the future once we would start
> defaulting to "inherit" for all knobs (a distro might default to that) and
> default-enable THP in the global knob. Then, it would be easy to disable any THP
> by disabling the global knob. (I think that's the future we're heading to, where
> we'd have an "auto" mode that can be set on the global toggle).
>
> But I am just making up use cases ;) I think it will be valuable and just doing
> it consistently now might be cleaner.
I agree that consistency between enabled and shmem_enabled is top priority. And
yes, I had forgotten about the glorious "auto" future. So probably continuing
all sizes to select "inherit" is best.
But for shmem_enabled, that means we need the following error checking:
- It is an error to set "force" for any size except PMD-size
- It is an error to set "force" for the global control if any size except PMD-
size is set to "inherit"
- It is an error to set "inherit" for any size except PMD-size if the global
control is set to "force".
Certainly not too difficult to code and prove to be correct, but not the nicest
UX from the user's point of view when they start seeing errors.
I think we previously said this would likely be temporary, and if/when tmpfs
gets mTHP support, we could simplify and allow all sizes to be set to "force".
But I wonder if tmpfs would ever need explicit mTHP control? Maybe it would be
more suited to the approach the page cache takes to transparently ramp up the
folio size as it faults more in. (Just saying there is a chance that this error
checking becomes permanent).
On 2024/5/8 15:15, David Hildenbrand wrote:
> On 08.05.24 05:44, Baolin Wang wrote:
>>
>>
>> On 2024/5/7 18:37, Ryan Roberts wrote:
>>> On 06/05/2024 09:46, Baolin Wang wrote:
>>>> Add large folio mapping establishment support for finish_fault() as
>>>> a preparation,
>>>> to support multi-size THP allocation of anonymous shmem pages in the
>>>> following
>>>> patches.
>>>>
>>>> Signed-off-by: Baolin Wang <[email protected]>
>>>> ---
>>>> mm/memory.c | 43 +++++++++++++++++++++++++++++++++----------
>>>> 1 file changed, 33 insertions(+), 10 deletions(-)
>>>>
>>>> diff --git a/mm/memory.c b/mm/memory.c
>>>> index eea6e4984eae..936377220b77 100644
>>>> --- a/mm/memory.c
>>>> +++ b/mm/memory.c
>>>> @@ -4747,9 +4747,12 @@ vm_fault_t finish_fault(struct vm_fault *vmf)
>>>> {
>>>> struct vm_area_struct *vma = vmf->vma;
>>>> struct page *page;
>>>> + struct folio *folio;
>>>> vm_fault_t ret;
>>>> bool is_cow = (vmf->flags & FAULT_FLAG_WRITE) &&
>>>> !(vma->vm_flags & VM_SHARED);
>>>> + int type, nr_pages, i;
>>>> + unsigned long addr = vmf->address;
>>>> /* Did we COW the page? */
>>>> if (is_cow)
>>>> @@ -4780,24 +4783,44 @@ vm_fault_t finish_fault(struct vm_fault *vmf)
>>>> return VM_FAULT_OOM;
>>>> }
>>>> + folio = page_folio(page);
>>>> + nr_pages = folio_nr_pages(folio);
>>>> +
>>>> + if (unlikely(userfaultfd_armed(vma))) {
>>>> + nr_pages = 1;
>>>> + } else if (nr_pages > 1) {
>>>> + unsigned long start = ALIGN_DOWN(vmf->address, nr_pages *
>>>> PAGE_SIZE);
>>>> + unsigned long end = start + nr_pages * PAGE_SIZE;
>>>> +
>>>> + /* In case the folio size in page cache beyond the VMA
>>>> limits. */
>>>> + addr = max(start, vma->vm_start);
>>>> + nr_pages = (min(end, vma->vm_end) - addr) >> PAGE_SHIFT;
>>>> +
>>>> + page = folio_page(folio, (addr - start) >> PAGE_SHIFT);
>>>
>>> I still don't really follow the logic in this else if block. Isn't it
>>> possible
>>> that finish_fault() gets called with a page from a folio that isn't
>>> aligned with
>>> vmf->address?
>>>
>>> For example, let's say we have a file who's size is 64K and which is
>>> cached in a
>>> single large folio in the page cache. But the file is mapped into a
>>> process at
>>> VA 16K to 80K. Let's say we fault on the first page (VA=16K). You
>>> will calculate
>>
>> For shmem, this doesn't happen because the VA is aligned with the
>> hugepage size in the shmem_get_unmapped_area() function. See patch 7.
>
> Does that cover mremap() and MAP_FIXED as well.
Good point. Thanks for pointing this out.
> We should try doing this as cleanly as possible, to prepare for the
> future / corner cases.
Sure. Let me re-think about the algorithm.
On 2024/5/8 16:53, Ryan Roberts wrote:
> On 08/05/2024 04:44, Baolin Wang wrote:
>>
>>
>> On 2024/5/7 18:37, Ryan Roberts wrote:
>>> On 06/05/2024 09:46, Baolin Wang wrote:
>>>> Add large folio mapping establishment support for finish_fault() as a
>>>> preparation,
>>>> to support multi-size THP allocation of anonymous shmem pages in the following
>>>> patches.
>>>>
>>>> Signed-off-by: Baolin Wang <[email protected]>
>>>> ---
>>>> mm/memory.c | 43 +++++++++++++++++++++++++++++++++----------
>>>> 1 file changed, 33 insertions(+), 10 deletions(-)
>>>>
>>>> diff --git a/mm/memory.c b/mm/memory.c
>>>> index eea6e4984eae..936377220b77 100644
>>>> --- a/mm/memory.c
>>>> +++ b/mm/memory.c
>>>> @@ -4747,9 +4747,12 @@ vm_fault_t finish_fault(struct vm_fault *vmf)
>>>> {
>>>> struct vm_area_struct *vma = vmf->vma;
>>>> struct page *page;
>>>> + struct folio *folio;
>>>> vm_fault_t ret;
>>>> bool is_cow = (vmf->flags & FAULT_FLAG_WRITE) &&
>>>> !(vma->vm_flags & VM_SHARED);
>>>> + int type, nr_pages, i;
>>>> + unsigned long addr = vmf->address;
>>>> /* Did we COW the page? */
>>>> if (is_cow)
>>>> @@ -4780,24 +4783,44 @@ vm_fault_t finish_fault(struct vm_fault *vmf)
>>>> return VM_FAULT_OOM;
>>>> }
>>>> + folio = page_folio(page);
>>>> + nr_pages = folio_nr_pages(folio);
>>>> +
>>>> + if (unlikely(userfaultfd_armed(vma))) {
>>>> + nr_pages = 1;
>>>> + } else if (nr_pages > 1) {
>>>> + unsigned long start = ALIGN_DOWN(vmf->address, nr_pages * PAGE_SIZE);
>>>> + unsigned long end = start + nr_pages * PAGE_SIZE;
>>>> +
>>>> + /* In case the folio size in page cache beyond the VMA limits. */
>>>> + addr = max(start, vma->vm_start);
>>>> + nr_pages = (min(end, vma->vm_end) - addr) >> PAGE_SHIFT;
>>>> +
>>>> + page = folio_page(folio, (addr - start) >> PAGE_SHIFT);
>>>
>>> I still don't really follow the logic in this else if block. Isn't it possible
>>> that finish_fault() gets called with a page from a folio that isn't aligned with
>>> vmf->address?
>>>
>>> For example, let's say we have a file who's size is 64K and which is cached in a
>>> single large folio in the page cache. But the file is mapped into a process at
>>> VA 16K to 80K. Let's say we fault on the first page (VA=16K). You will calculate
>>
>> For shmem, this doesn't happen because the VA is aligned with the hugepage size
>> in the shmem_get_unmapped_area() function. See patch 7.
>
> Certainly agree that shmem can always make sure that it packs a vma in a way
> such that its folios are naturally aligned in VA when faulting in memory. If you
> mremap it, that alignment will be lost; I don't think that would be a problem
When mremap it, it will also call shmem_get_unmapped_area() to align the
VA, but for mremap() with MAP_FIXED flag as David pointed out, yes, this
patch may be not work perfectly.
> for a single process; mremap will take care of moving the ptes correctly and
> this path is not involved.
>
> But what about the case when a process mmaps a shmem region, then forks, then
> the child mremaps the shmem region. Then the parent faults in a THP into the
> region (nicely aligned). Then the child faults in the same offset in the region
> and gets the THP that the parent allocated; that THP will be aligned in the
> parent's VM space but not in the child's.
Sorry, I did not get your point here. IIUC, the child's VA will also be
aligned if the child mremap is not set MAP_FIXED, since the child's
mremap will still call shmem_get_unmapped_area() to find an aligned new
VA. Please correct me if I missed your point.
>>> start=0 and end=64K I think?
>>
>> Yes. Unfortunately, some file systems that support large mappings do not perform
>> alignment for multi-size THP (non-PMD sized, for example: 64K). I think this
>> requires modification to __get_unmapped_area--->thp_get_unmapped_area_vmflags()
>> or file->f_op->get_unmapped_area() to align VA for multi-size THP in future.
>
> By nature of the fact that a file mapping is shared between multiple processes
> and each process can map it where ever it wants down to 1 page granularity, its
> impossible for any THP containing a part of that file to be VA-aligned in every
> process it is mapped in.
Yes, so let me re-polish this patch. Thanks.
>> So before adding that VA alignment changes, only allow building the large folio
>> mapping for anonymous shmem:
>>
>> diff --git a/mm/memory.c b/mm/memory.c
>> index 936377220b77..9e4d51826d23 100644
>> --- a/mm/memory.c
>> +++ b/mm/memory.c
>> @@ -4786,7 +4786,7 @@ vm_fault_t finish_fault(struct vm_fault *vmf)
>> folio = page_folio(page);
>> nr_pages = folio_nr_pages(folio);
>>
>> - if (unlikely(userfaultfd_armed(vma))) {
>> + if (unlikely(userfaultfd_armed(vma)) || !vma_is_anon_shmem(vma)) {
>
> If the above theoretical flow for fork & mremap is valid, then I don't think
> this is sufficient.
>
>> nr_pages = 1;
>> } else if (nr_pages > 1) {
>> unsigned long start = ALIGN_DOWN(vmf->address, nr_pages *
>> PAGE_SIZE);
>>
>>> Additionally, I think this path will end up mapping the entire folio (as long as
>>> it fits in the VMA). But this bypasses the fault-around configuration. As I
>>> think I mentioned against the RFC, this will inflate the RSS of the process and
>>> can cause behavioural changes as a result. I believe the current advice is to
>>> disable fault-around to prevent this kind of bloat when needed.
>>
>> With above change, I do not think this is a problem? since users already want to
>> use mTHP for anonymous shmem.
>>
>>> It might be that you need a special variant of finish_fault() for shmem?
>>>
>>>
>>>> + }
>>>> vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd,
>>>> - vmf->address, &vmf->ptl);
>>>> + addr, &vmf->ptl);
>>>> if (!vmf->pte)
>>>> return VM_FAULT_NOPAGE;
>>>> /* Re-check under ptl */
>>>> - if (likely(!vmf_pte_changed(vmf))) {
>>>> - struct folio *folio = page_folio(page);
>>>> - int type = is_cow ? MM_ANONPAGES : mm_counter_file(folio);
>>>> -
>>>> - set_pte_range(vmf, folio, page, 1, vmf->address);
>>>> - add_mm_counter(vma->vm_mm, type, 1);
>>>> - ret = 0;
>>>> - } else {
>>>> - update_mmu_tlb(vma, vmf->address, vmf->pte);
>>>> + if (nr_pages == 1 && unlikely(vmf_pte_changed(vmf))) {
>>>> + update_mmu_tlb(vma, addr, vmf->pte);
>>>> + ret = VM_FAULT_NOPAGE;
>>>> + goto unlock;
>>>> + } else if (nr_pages > 1 && !pte_range_none(vmf->pte, nr_pages)) {
>>>> + for (i = 0; i < nr_pages; i++)
>>>> + update_mmu_tlb(vma, addr + PAGE_SIZE * i, vmf->pte + i);
>>>> ret = VM_FAULT_NOPAGE;
>>>> + goto unlock;
>>>> }
>>>> + set_pte_range(vmf, folio, page, nr_pages, addr);
>>>> + type = is_cow ? MM_ANONPAGES : mm_counter_file(folio);
>>>> + add_mm_counter(vma->vm_mm, type, nr_pages);
>>>> + ret = 0;
>>>> +
>>>> +unlock:
>>>> pte_unmap_unlock(vmf->pte, vmf->ptl);
>>>> return ret;
>>>> }
On 2024/5/8 17:02, Ryan Roberts wrote:
> On 08/05/2024 08:12, David Hildenbrand wrote:
>> On 08.05.24 09:08, David Hildenbrand wrote:
>>> On 08.05.24 06:45, Baolin Wang wrote:
>>>>
>>>>
>>>> On 2024/5/7 18:52, Ryan Roberts wrote:
>>>>> On 06/05/2024 09:46, Baolin Wang wrote:
>>>>>> To support the use of mTHP with anonymous shmem, add a new sysfs interface
>>>>>> 'shmem_enabled' in the '/sys/kernel/mm/transparent_hugepage/hugepages-kB/'
>>>>>> directory for each mTHP to control whether shmem is enabled for that mTHP,
>>>>>> with a value similar to the top level 'shmem_enabled', which can be set to:
>>>>>> "always", "inherit (to inherit the top level setting)", "within_size",
>>>>>> "advise",
>>>>>> "never", "deny", "force". These values follow the same semantics as the top
>>>>>> level, except the 'deny' is equivalent to 'never', and 'force' is equivalent
>>>>>> to 'always' to keep compatibility.
>>>>>
>>>>> We decided at [1] to not allow 'force' for non-PMD-sizes.
>>>>>
>>>>> [1]
>>>>> https://lore.kernel.org/linux-mm/[email protected]/
>>>>>
>>>>> However, thinking about this a bit more, I wonder if the decision we made to
>>>>> allow all hugepages-xxkB/enabled controls to take "inherit" was the wrong one.
>>>>> Perhaps we should have only allowed the PMD-sized enable=inherit (this is just
>>>>> for legacy back compat after all, I don't think there is any use case where
>>>>> changing multiple mTHP size controls atomically is actually useful). Applying
>>>>
>>>> Agree. This is also our usage of 'inherit'.
>>
>> Missed that one: there might be use cases in the future once we would start
>> defaulting to "inherit" for all knobs (a distro might default to that) and
>> default-enable THP in the global knob. Then, it would be easy to disable any THP
>> by disabling the global knob. (I think that's the future we're heading to, where
>> we'd have an "auto" mode that can be set on the global toggle).
>>
>> But I am just making up use cases ;) I think it will be valuable and just doing
>> it consistently now might be cleaner.
>
> I agree that consistency between enabled and shmem_enabled is top priority. And
> yes, I had forgotten about the glorious "auto" future. So probably continuing
> all sizes to select "inherit" is best.
>
> But for shmem_enabled, that means we need the following error checking:
>
> - It is an error to set "force" for any size except PMD-size
>
> - It is an error to set "force" for the global control if any size except PMD-
> size is set to "inherit"
>
> - It is an error to set "inherit" for any size except PMD-size if the global
> control is set to "force".
>
> Certainly not too difficult to code and prove to be correct, but not the nicest
> UX from the user's point of view when they start seeing errors.
>
> I think we previously said this would likely be temporary, and if/when tmpfs
> gets mTHP support, we could simplify and allow all sizes to be set to "force".
> But I wonder if tmpfs would ever need explicit mTHP control? Maybe it would be
> more suited to the approach the page cache takes to transparently ramp up the
> folio size as it faults more in. (Just saying there is a chance that this error
> checking becomes permanent).
The strategy for tmpfs supporting mTHP will require more discussions and
evaluations in the future. However, regardless of the strategy (explicit
mTHP control or page cache control), I think it would be possible to use
'force' to override previous strategies for some testing purposes. This
appears to be permissible according to the explanation in the current
documentation: "force the huge option on for all - very useful for
testing". So it seems not permanent?
On 08/05/2024 10:31, Baolin Wang wrote:
>
>
> On 2024/5/8 16:53, Ryan Roberts wrote:
>> On 08/05/2024 04:44, Baolin Wang wrote:
>>>
>>>
>>> On 2024/5/7 18:37, Ryan Roberts wrote:
>>>> On 06/05/2024 09:46, Baolin Wang wrote:
>>>>> Add large folio mapping establishment support for finish_fault() as a
>>>>> preparation,
>>>>> to support multi-size THP allocation of anonymous shmem pages in the following
>>>>> patches.
>>>>>
>>>>> Signed-off-by: Baolin Wang <[email protected]>
>>>>> ---
>>>>> mm/memory.c | 43 +++++++++++++++++++++++++++++++++----------
>>>>> 1 file changed, 33 insertions(+), 10 deletions(-)
>>>>>
>>>>> diff --git a/mm/memory.c b/mm/memory.c
>>>>> index eea6e4984eae..936377220b77 100644
>>>>> --- a/mm/memory.c
>>>>> +++ b/mm/memory.c
>>>>> @@ -4747,9 +4747,12 @@ vm_fault_t finish_fault(struct vm_fault *vmf)
>>>>> {
>>>>> struct vm_area_struct *vma = vmf->vma;
>>>>> struct page *page;
>>>>> + struct folio *folio;
>>>>> vm_fault_t ret;
>>>>> bool is_cow = (vmf->flags & FAULT_FLAG_WRITE) &&
>>>>> !(vma->vm_flags & VM_SHARED);
>>>>> + int type, nr_pages, i;
>>>>> + unsigned long addr = vmf->address;
>>>>> /* Did we COW the page? */
>>>>> if (is_cow)
>>>>> @@ -4780,24 +4783,44 @@ vm_fault_t finish_fault(struct vm_fault *vmf)
>>>>> return VM_FAULT_OOM;
>>>>> }
>>>>> + folio = page_folio(page);
>>>>> + nr_pages = folio_nr_pages(folio);
>>>>> +
>>>>> + if (unlikely(userfaultfd_armed(vma))) {
>>>>> + nr_pages = 1;
>>>>> + } else if (nr_pages > 1) {
>>>>> + unsigned long start = ALIGN_DOWN(vmf->address, nr_pages * PAGE_SIZE);
>>>>> + unsigned long end = start + nr_pages * PAGE_SIZE;
>>>>> +
>>>>> + /* In case the folio size in page cache beyond the VMA limits. */
>>>>> + addr = max(start, vma->vm_start);
>>>>> + nr_pages = (min(end, vma->vm_end) - addr) >> PAGE_SHIFT;
>>>>> +
>>>>> + page = folio_page(folio, (addr - start) >> PAGE_SHIFT);
>>>>
>>>> I still don't really follow the logic in this else if block. Isn't it possible
>>>> that finish_fault() gets called with a page from a folio that isn't aligned
>>>> with
>>>> vmf->address?
>>>>
>>>> For example, let's say we have a file who's size is 64K and which is cached
>>>> in a
>>>> single large folio in the page cache. But the file is mapped into a process at
>>>> VA 16K to 80K. Let's say we fault on the first page (VA=16K). You will
>>>> calculate
>>>
>>> For shmem, this doesn't happen because the VA is aligned with the hugepage size
>>> in the shmem_get_unmapped_area() function. See patch 7.
>>
>> Certainly agree that shmem can always make sure that it packs a vma in a way
>> such that its folios are naturally aligned in VA when faulting in memory. If you
>> mremap it, that alignment will be lost; I don't think that would be a problem
>
> When mremap it, it will also call shmem_get_unmapped_area() to align the VA, but
> for mremap() with MAP_FIXED flag as David pointed out, yes, this patch may be
> not work perfectly.
Assuming this works similarly to anon mTHP, remapping to an arbitrary address
shouldn't be a problem within a single process; the previously allocated folios
will now be unaligned, but they will be correctly mapped so it doesn't break
anything. And new faults will allocate folios so that they are as large as
allowed by the sysfs interface AND which do not overlap with any non-none pte
AND which are naturally aligned. It's when you start sharing with other
processes that the fun and games start...
>
>> for a single process; mremap will take care of moving the ptes correctly and
>> this path is not involved.
>>
>> But what about the case when a process mmaps a shmem region, then forks, then
>> the child mremaps the shmem region. Then the parent faults in a THP into the
>> region (nicely aligned). Then the child faults in the same offset in the region
>> and gets the THP that the parent allocated; that THP will be aligned in the
>> parent's VM space but not in the child's.
>
> Sorry, I did not get your point here. IIUC, the child's VA will also be aligned
> if the child mremap is not set MAP_FIXED, since the child's mremap will still
> call shmem_get_unmapped_area() to find an aligned new VA.
In general, you shouldn't be relying on the vma bounds being aligned to a THP
boundary.
> Please correct me if I missed your point.
(I'm not 100% sure this is definitely how it works, but seems the only sane way
to me):
Let's imagine we have a process that maps 4 pages of shared anon memory at VA=64K:
mmap(64K, 16K, PROT_X, MAP_SHARED | MAP_ANONYMOUS | MAP_FIXED, ...)
Then it forks a child, and the child moves the mapping to VA=68K:
mremap(64K, 16K, 16K, MREMAP_FIXED | MREMAP_MAYMOVE, 68K)
Then the parent writes to address 64K (offset 0 in the shared region); this will
fault and cause a 16K mTHP to be allocated and mapped, covering the whole region
at 64K-80K in the parent.
Then the child reads address 68K (offset 0 in the shared region); this will
fault and cause the previously allocated 16K folio to be looked up and it must
be mapped in the child between 68K-84K. This is not naturally aligned in the child.
For the child, your code will incorrectly calculate start/end as 64K-80K.
>
>>>> start=0 and end=64K I think?
>>>
>>> Yes. Unfortunately, some file systems that support large mappings do not perform
>>> alignment for multi-size THP (non-PMD sized, for example: 64K). I think this
>>> requires modification to __get_unmapped_area--->thp_get_unmapped_area_vmflags()
>>> or file->f_op->get_unmapped_area() to align VA for multi-size THP in future.
>>
>> By nature of the fact that a file mapping is shared between multiple processes
>> and each process can map it where ever it wants down to 1 page granularity, its
>> impossible for any THP containing a part of that file to be VA-aligned in every
>> process it is mapped in.
>
> Yes, so let me re-polish this patch. Thanks.
>
>>> So before adding that VA alignment changes, only allow building the large folio
>>> mapping for anonymous shmem:
>>>
>>> diff --git a/mm/memory.c b/mm/memory.c
>>> index 936377220b77..9e4d51826d23 100644
>>> --- a/mm/memory.c
>>> +++ b/mm/memory.c
>>> @@ -4786,7 +4786,7 @@ vm_fault_t finish_fault(struct vm_fault *vmf)
>>> folio = page_folio(page);
>>> nr_pages = folio_nr_pages(folio);
>>>
>>> - if (unlikely(userfaultfd_armed(vma))) {
>>> + if (unlikely(userfaultfd_armed(vma)) || !vma_is_anon_shmem(vma)) {
>>
>> If the above theoretical flow for fork & mremap is valid, then I don't think
>> this is sufficient.
>>
>>> nr_pages = 1;
>>> } else if (nr_pages > 1) {
>>> unsigned long start = ALIGN_DOWN(vmf->address, nr_pages *
>>> PAGE_SIZE);
>>>
>>>> Additionally, I think this path will end up mapping the entire folio (as
>>>> long as
>>>> it fits in the VMA). But this bypasses the fault-around configuration. As I
>>>> think I mentioned against the RFC, this will inflate the RSS of the process and
>>>> can cause behavioural changes as a result. I believe the current advice is to
>>>> disable fault-around to prevent this kind of bloat when needed.
>>>
>>> With above change, I do not think this is a problem? since users already want to
>>> use mTHP for anonymous shmem.
>>>
>>>> It might be that you need a special variant of finish_fault() for shmem?
>>>>
>>>>
>>>>> + }
>>>>> vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd,
>>>>> - vmf->address, &vmf->ptl);
>>>>> + addr, &vmf->ptl);
>>>>> if (!vmf->pte)
>>>>> return VM_FAULT_NOPAGE;
>>>>> /* Re-check under ptl */
>>>>> - if (likely(!vmf_pte_changed(vmf))) {
>>>>> - struct folio *folio = page_folio(page);
>>>>> - int type = is_cow ? MM_ANONPAGES : mm_counter_file(folio);
>>>>> -
>>>>> - set_pte_range(vmf, folio, page, 1, vmf->address);
>>>>> - add_mm_counter(vma->vm_mm, type, 1);
>>>>> - ret = 0;
>>>>> - } else {
>>>>> - update_mmu_tlb(vma, vmf->address, vmf->pte);
>>>>> + if (nr_pages == 1 && unlikely(vmf_pte_changed(vmf))) {
>>>>> + update_mmu_tlb(vma, addr, vmf->pte);
>>>>> + ret = VM_FAULT_NOPAGE;
>>>>> + goto unlock;
>>>>> + } else if (nr_pages > 1 && !pte_range_none(vmf->pte, nr_pages)) {
>>>>> + for (i = 0; i < nr_pages; i++)
>>>>> + update_mmu_tlb(vma, addr + PAGE_SIZE * i, vmf->pte + i);
>>>>> ret = VM_FAULT_NOPAGE;
>>>>> + goto unlock;
>>>>> }
>>>>> + set_pte_range(vmf, folio, page, nr_pages, addr);
>>>>> + type = is_cow ? MM_ANONPAGES : mm_counter_file(folio);
>>>>> + add_mm_counter(vma->vm_mm, type, nr_pages);
>>>>> + ret = 0;
>>>>> +
>>>>> +unlock:
>>>>> pte_unmap_unlock(vmf->pte, vmf->ptl);
>>>>> return ret;
>>>>> }
On 08/05/2024 10:56, Baolin Wang wrote:
>
>
> On 2024/5/8 17:02, Ryan Roberts wrote:
>> On 08/05/2024 08:12, David Hildenbrand wrote:
>>> On 08.05.24 09:08, David Hildenbrand wrote:
>>>> On 08.05.24 06:45, Baolin Wang wrote:
>>>>>
>>>>>
>>>>> On 2024/5/7 18:52, Ryan Roberts wrote:
>>>>>> On 06/05/2024 09:46, Baolin Wang wrote:
>>>>>>> To support the use of mTHP with anonymous shmem, add a new sysfs interface
>>>>>>> 'shmem_enabled' in the '/sys/kernel/mm/transparent_hugepage/hugepages-kB/'
>>>>>>> directory for each mTHP to control whether shmem is enabled for that mTHP,
>>>>>>> with a value similar to the top level 'shmem_enabled', which can be set to:
>>>>>>> "always", "inherit (to inherit the top level setting)", "within_size",
>>>>>>> "advise",
>>>>>>> "never", "deny", "force". These values follow the same semantics as the top
>>>>>>> level, except the 'deny' is equivalent to 'never', and 'force' is equivalent
>>>>>>> to 'always' to keep compatibility.
>>>>>>
>>>>>> We decided at [1] to not allow 'force' for non-PMD-sizes.
>>>>>>
>>>>>> [1]
>>>>>> https://lore.kernel.org/linux-mm/[email protected]/
>>>>>>
>>>>>> However, thinking about this a bit more, I wonder if the decision we made to
>>>>>> allow all hugepages-xxkB/enabled controls to take "inherit" was the wrong
>>>>>> one.
>>>>>> Perhaps we should have only allowed the PMD-sized enable=inherit (this is
>>>>>> just
>>>>>> for legacy back compat after all, I don't think there is any use case where
>>>>>> changing multiple mTHP size controls atomically is actually useful). Applying
>>>>>
>>>>> Agree. This is also our usage of 'inherit'.
>>>
>>> Missed that one: there might be use cases in the future once we would start
>>> defaulting to "inherit" for all knobs (a distro might default to that) and
>>> default-enable THP in the global knob. Then, it would be easy to disable any THP
>>> by disabling the global knob. (I think that's the future we're heading to, where
>>> we'd have an "auto" mode that can be set on the global toggle).
>>>
>>> But I am just making up use cases ;) I think it will be valuable and just doing
>>> it consistently now might be cleaner.
>>
>> I agree that consistency between enabled and shmem_enabled is top priority. And
>> yes, I had forgotten about the glorious "auto" future. So probably continuing
>> all sizes to select "inherit" is best.
>>
>> But for shmem_enabled, that means we need the following error checking:
>>
>> - It is an error to set "force" for any size except PMD-size
>>
>> - It is an error to set "force" for the global control if any size except PMD-
>> size is set to "inherit"
>>
>> - It is an error to set "inherit" for any size except PMD-size if the global
>> control is set to "force".
>>
>> Certainly not too difficult to code and prove to be correct, but not the nicest
>> UX from the user's point of view when they start seeing errors.
>>
>> I think we previously said this would likely be temporary, and if/when tmpfs
>> gets mTHP support, we could simplify and allow all sizes to be set to "force".
>> But I wonder if tmpfs would ever need explicit mTHP control? Maybe it would be
>> more suited to the approach the page cache takes to transparently ramp up the
>> folio size as it faults more in. (Just saying there is a chance that this error
>> checking becomes permanent).
>
> The strategy for tmpfs supporting mTHP will require more discussions and
> evaluations in the future. However, regardless of the strategy (explicit mTHP
> control or page cache control), I think it would be possible to use 'force' to
> override previous strategies for some testing purposes. This appears to be
> permissible according to the explanation in the current documentation: "force
> the huge option on for all - very useful for testing". So it seems not permanent?
Yeah ok, makes sense to me.
On Mon, May 06, 2024 at 04:46:24PM +0800, Baolin Wang wrote:
> Anonymous pages have already been supported for multi-size (mTHP) allocation
> through commit 19eaf44954df, that can allow THP to be configured through the
> sysfs interface located at '/sys/kernel/mm/transparent_hugepage/hugepage-XXkb/enabled'.
>
> However, the anonymous shared pages will ignore the anonymous mTHP rule
> configured through the sysfs interface, and can only use the PMD-mapped
> THP, that is not reasonable. Many implement anonymous page sharing through
> mmap(MAP_SHARED | MAP_ANONYMOUS), especially in database usage scenarios,
> therefore, users expect to apply an unified mTHP strategy for anonymous pages,
> also including the anonymous shared pages, in order to enjoy the benefits of
> mTHP. For example, lower latency than PMD-mapped THP, smaller memory bloat
> than PMD-mapped THP, contiguous PTEs on ARM architecture to reduce TLB miss etc.
>
> The primary strategy is similar to supporting anonymous mTHP. Introduce
> a new interface '/mm/transparent_hugepage/hugepage-XXkb/shmem_enabled',
> which can have all the same values as the top-level
> '/sys/kernel/mm/transparent_hugepage/shmem_enabled', with adding a new
> additional "inherit" option. By default all sizes will be set to "never"
> except PMD size, which is set to "inherit". This ensures backward compatibility
> with the shmem enabled of the top level, meanwhile also allows independent
> control of shmem enabled for each mTHP.
I'm trying to understand the adoption of mTHP and how it fits into the adoption
of (large) folios that the kernel is moving towards. Can you, or anyone involved
here, explain this? How much do they overlap, and can we benefit from having
both? Is there any argument against the adoption of large folios here that I
might have missed?
>
> Use the page fault latency tool to measure the performance of 1G anonymous shmem
> with 32 threads on my machine environment with: ARM64 Architecture, 32 cores,
> 125G memory:
> base: mm-unstable
> user-time sys_time faults_per_sec_per_cpu faults_per_sec
> 0.04s 3.10s 83516.416 2669684.890
>
> mm-unstable + patchset, anon shmem mTHP disabled
> user-time sys_time faults_per_sec_per_cpu faults_per_sec
> 0.02s 3.14s 82936.359 2630746.027
>
> mm-unstable + patchset, anon shmem 64K mTHP enabled
> user-time sys_time faults_per_sec_per_cpu faults_per_sec
> 0.08s 0.31s 678630.231 17082522.495
>
> From the data above, it is observed that the patchset has a minimal impact when
> mTHP is not enabled (some fluctuations observed during testing). When enabling 64K
> mTHP, there is a significant improvement of the page fault latency.
>
> TODO:
> - Support mTHP for tmpfs.
> - Do not split the large folio when share memory swap out.
> - Can swap in a large folio for share memory.
>
> Changes from RFC:
> - Rebase the patch set against the new mm-unstable branch, per Lance.
> - Add a new patch to export highest_order() and next_order().
> - Add a new patch to align mTHP size in shmem_get_unmapped_area().
> - Handle the uffd case and the VMA limits case when building mapping for
> large folio in the finish_fault() function, per Ryan.
> - Remove unnecessary 'order' variable in patch 3, per Kefeng.
> - Keep the anon shmem counters' name consistency.
> - Modify the strategy to support mTHP for anonymous shmem, discussed with
> Ryan and David.
> - Add reviewed tag from Barry.
> - Update the commit message.
>
> Baolin Wang (8):
> mm: move highest_order() and next_order() out of the THP config
> mm: memory: extend finish_fault() to support large folio
> mm: shmem: add an 'order' parameter for shmem_alloc_hugefolio()
> mm: shmem: add THP validation for PMD-mapped THP related statistics
> mm: shmem: add multi-size THP sysfs interface for anonymous shmem
> mm: shmem: add mTHP support for anonymous shmem
> mm: shmem: add mTHP size alignment in shmem_get_unmapped_area
> mm: shmem: add mTHP counters for anonymous shmem
>
> Documentation/admin-guide/mm/transhuge.rst | 29 ++
> include/linux/huge_mm.h | 35 ++-
> mm/huge_memory.c | 17 +-
> mm/memory.c | 43 ++-
> mm/shmem.c | 335 ++++++++++++++++++---
> 5 files changed, 387 insertions(+), 72 deletions(-)
>
> --
> 2.39.3
>
On 08.05.24 13:39, Daniel Gomez wrote:
> On Mon, May 06, 2024 at 04:46:24PM +0800, Baolin Wang wrote:
>> Anonymous pages have already been supported for multi-size (mTHP) allocation
>> through commit 19eaf44954df, that can allow THP to be configured through the
>> sysfs interface located at '/sys/kernel/mm/transparent_hugepage/hugepage-XXkb/enabled'.
>>
>> However, the anonymous shared pages will ignore the anonymous mTHP rule
>> configured through the sysfs interface, and can only use the PMD-mapped
>> THP, that is not reasonable. Many implement anonymous page sharing through
>> mmap(MAP_SHARED | MAP_ANONYMOUS), especially in database usage scenarios,
>> therefore, users expect to apply an unified mTHP strategy for anonymous pages,
>> also including the anonymous shared pages, in order to enjoy the benefits of
>> mTHP. For example, lower latency than PMD-mapped THP, smaller memory bloat
>> than PMD-mapped THP, contiguous PTEs on ARM architecture to reduce TLB miss etc.
>>
>> The primary strategy is similar to supporting anonymous mTHP. Introduce
>> a new interface '/mm/transparent_hugepage/hugepage-XXkb/shmem_enabled',
>> which can have all the same values as the top-level
>> '/sys/kernel/mm/transparent_hugepage/shmem_enabled', with adding a new
>> additional "inherit" option. By default all sizes will be set to "never"
>> except PMD size, which is set to "inherit". This ensures backward compatibility
>> with the shmem enabled of the top level, meanwhile also allows independent
>> control of shmem enabled for each mTHP.
>
> I'm trying to understand the adoption of mTHP and how it fits into the adoption
> of (large) folios that the kernel is moving towards. Can you, or anyone involved
> here, explain this? How much do they overlap, and can we benefit from having
> both? Is there any argument against the adoption of large folios here that I
> might have missed?
mTHP are implemented using large folios, just like traditional PMD-sized
THP are. (you really should explore the history of mTHP and how it all
works internally)
The biggest challenge with memory that cannot be evicted on memory
pressure to be reclaimed (in contrast to your ordinary files in the
pagecache) is memory waste, well, and placement of large chunks of
memory in general, during page faults.
In the worst case (no swap), you allocate a large chunk of memory once
and it will stick around until freed: no reclaim of that memory.
That's the reason why THP for anonymous memory and SHMEM have toggles to
manually enable and configure them, in contrast to the pagecache. The
same was done for mTHP for anonymous memory, and now (anon) shmem follows.
There are plans to have, at some point, have it all working
automatically, but a lot for that for anonymous memory (and shmem
similarly) is still missing and unclear.
--
Cheers,
David / dhildenb
On 08.05.24 14:02, David Hildenbrand wrote:
> On 08.05.24 11:02, Ryan Roberts wrote:
>> On 08/05/2024 08:12, David Hildenbrand wrote:
>>> On 08.05.24 09:08, David Hildenbrand wrote:
>>>> On 08.05.24 06:45, Baolin Wang wrote:
>>>>>
>>>>>
>>>>> On 2024/5/7 18:52, Ryan Roberts wrote:
>>>>>> On 06/05/2024 09:46, Baolin Wang wrote:
>>>>>>> To support the use of mTHP with anonymous shmem, add a new sysfs interface
>>>>>>> 'shmem_enabled' in the '/sys/kernel/mm/transparent_hugepage/hugepages-kB/'
>>>>>>> directory for each mTHP to control whether shmem is enabled for that mTHP,
>>>>>>> with a value similar to the top level 'shmem_enabled', which can be set to:
>>>>>>> "always", "inherit (to inherit the top level setting)", "within_size",
>>>>>>> "advise",
>>>>>>> "never", "deny", "force". These values follow the same semantics as the top
>>>>>>> level, except the 'deny' is equivalent to 'never', and 'force' is equivalent
>>>>>>> to 'always' to keep compatibility.
>>>>>>
>>>>>> We decided at [1] to not allow 'force' for non-PMD-sizes.
>>>>>>
>>>>>> [1]
>>>>>> https://lore.kernel.org/linux-mm/[email protected]/
>>>>>>
>>>>>> However, thinking about this a bit more, I wonder if the decision we made to
>>>>>> allow all hugepages-xxkB/enabled controls to take "inherit" was the wrong one.
>>>>>> Perhaps we should have only allowed the PMD-sized enable=inherit (this is just
>>>>>> for legacy back compat after all, I don't think there is any use case where
>>>>>> changing multiple mTHP size controls atomically is actually useful). Applying
>>>>>
>>>>> Agree. This is also our usage of 'inherit'.
>>>
>>> Missed that one: there might be use cases in the future once we would start
>>> defaulting to "inherit" for all knobs (a distro might default to that) and
>>> default-enable THP in the global knob. Then, it would be easy to disable any THP
>>> by disabling the global knob. (I think that's the future we're heading to, where
>>> we'd have an "auto" mode that can be set on the global toggle).
>>>
>>> But I am just making up use cases ;) I think it will be valuable and just doing
>>> it consistently now might be cleaner.
>>
>> I agree that consistency between enabled and shmem_enabled is top priority. And
>> yes, I had forgotten about the glorious "auto" future. So probably continuing
>> all sizes to select "inherit" is best.
>>
>> But for shmem_enabled, that means we need the following error checking:
>>
>> - It is an error to set "force" for any size except PMD-size
>>
>> - It is an error to set "force" for the global control if any size except PMD-
>> size is set to "inherit"
>>
>> - It is an error to set "inherit" for any size except PMD-size if the global
>> control is set to "force".
>>
>> Certainly not too difficult to code and prove to be correct, but not the nicest
>> UX from the user's point of view when they start seeing errors.
>>
>> I think we previously said this would likely be temporary, and if/when tmpfs
>> gets mTHP support, we could simplify and allow all sizes to be set to "force".
>> But I wonder if tmpfs would ever need explicit mTHP control? Maybe it would be
>> more suited to the approach the page cache takes to transparently ramp up the
>> folio size as it faults more in. (Just saying there is a chance that this error
>> checking becomes permanent).
>
> Note that with shmem you're inherently facing the same memory waste
> issues etc as you would with anonymous memory. (sometimes even worse, if
> you're running shmem that's configured to be unswappable!).
Also noting that memory waste is not really a problem when a write to a
shmem file allocates a large folio that stays within boundaries of that
write; issues only pop up if you end up over-allocating, especially,
during page faults where you have not that much clue about what to do
(single address, no real range provided).
There is the other issue that wasting large chunks of contiguous memory
on stuff that barely benefits from it. With memory that maybe never gets
evicted, there is no automatic "handing back" of that memory to the
system to be used by something else. With ordinary files, that's a bit
different. But I did not look closer into that issue yet, it's one of
the reasons MADV_HUGEPAGE was added IIRC.
--
Cheers,
David / dhildenb
On 08.05.24 11:02, Ryan Roberts wrote:
> On 08/05/2024 08:12, David Hildenbrand wrote:
>> On 08.05.24 09:08, David Hildenbrand wrote:
>>> On 08.05.24 06:45, Baolin Wang wrote:
>>>>
>>>>
>>>> On 2024/5/7 18:52, Ryan Roberts wrote:
>>>>> On 06/05/2024 09:46, Baolin Wang wrote:
>>>>>> To support the use of mTHP with anonymous shmem, add a new sysfs interface
>>>>>> 'shmem_enabled' in the '/sys/kernel/mm/transparent_hugepage/hugepages-kB/'
>>>>>> directory for each mTHP to control whether shmem is enabled for that mTHP,
>>>>>> with a value similar to the top level 'shmem_enabled', which can be set to:
>>>>>> "always", "inherit (to inherit the top level setting)", "within_size",
>>>>>> "advise",
>>>>>> "never", "deny", "force". These values follow the same semantics as the top
>>>>>> level, except the 'deny' is equivalent to 'never', and 'force' is equivalent
>>>>>> to 'always' to keep compatibility.
>>>>>
>>>>> We decided at [1] to not allow 'force' for non-PMD-sizes.
>>>>>
>>>>> [1]
>>>>> https://lore.kernel.org/linux-mm/[email protected]/
>>>>>
>>>>> However, thinking about this a bit more, I wonder if the decision we made to
>>>>> allow all hugepages-xxkB/enabled controls to take "inherit" was the wrong one.
>>>>> Perhaps we should have only allowed the PMD-sized enable=inherit (this is just
>>>>> for legacy back compat after all, I don't think there is any use case where
>>>>> changing multiple mTHP size controls atomically is actually useful). Applying
>>>>
>>>> Agree. This is also our usage of 'inherit'.
>>
>> Missed that one: there might be use cases in the future once we would start
>> defaulting to "inherit" for all knobs (a distro might default to that) and
>> default-enable THP in the global knob. Then, it would be easy to disable any THP
>> by disabling the global knob. (I think that's the future we're heading to, where
>> we'd have an "auto" mode that can be set on the global toggle).
>>
>> But I am just making up use cases ;) I think it will be valuable and just doing
>> it consistently now might be cleaner.
>
> I agree that consistency between enabled and shmem_enabled is top priority. And
> yes, I had forgotten about the glorious "auto" future. So probably continuing
> all sizes to select "inherit" is best.
>
> But for shmem_enabled, that means we need the following error checking:
>
> - It is an error to set "force" for any size except PMD-size
>
> - It is an error to set "force" for the global control if any size except PMD-
> size is set to "inherit"
>
> - It is an error to set "inherit" for any size except PMD-size if the global
> control is set to "force".
>
> Certainly not too difficult to code and prove to be correct, but not the nicest
> UX from the user's point of view when they start seeing errors.
>
> I think we previously said this would likely be temporary, and if/when tmpfs
> gets mTHP support, we could simplify and allow all sizes to be set to "force".
> But I wonder if tmpfs would ever need explicit mTHP control? Maybe it would be
> more suited to the approach the page cache takes to transparently ramp up the
> folio size as it faults more in. (Just saying there is a chance that this error
> checking becomes permanent).
Note that with shmem you're inherently facing the same memory waste
issues etc as you would with anonymous memory. (sometimes even worse, if
you're running shmem that's configured to be unswappable!).
--
Cheers,
David / dhildenb
On 08/05/2024 13:10, David Hildenbrand wrote:
> On 08.05.24 14:02, David Hildenbrand wrote:
>> On 08.05.24 11:02, Ryan Roberts wrote:
>>> On 08/05/2024 08:12, David Hildenbrand wrote:
>>>> On 08.05.24 09:08, David Hildenbrand wrote:
>>>>> On 08.05.24 06:45, Baolin Wang wrote:
>>>>>>
>>>>>>
>>>>>> On 2024/5/7 18:52, Ryan Roberts wrote:
>>>>>>> On 06/05/2024 09:46, Baolin Wang wrote:
>>>>>>>> To support the use of mTHP with anonymous shmem, add a new sysfs interface
>>>>>>>> 'shmem_enabled' in the '/sys/kernel/mm/transparent_hugepage/hugepages-kB/'
>>>>>>>> directory for each mTHP to control whether shmem is enabled for that mTHP,
>>>>>>>> with a value similar to the top level 'shmem_enabled', which can be set to:
>>>>>>>> "always", "inherit (to inherit the top level setting)", "within_size",
>>>>>>>> "advise",
>>>>>>>> "never", "deny", "force". These values follow the same semantics as the top
>>>>>>>> level, except the 'deny' is equivalent to 'never', and 'force' is
>>>>>>>> equivalent
>>>>>>>> to 'always' to keep compatibility.
>>>>>>>
>>>>>>> We decided at [1] to not allow 'force' for non-PMD-sizes.
>>>>>>>
>>>>>>> [1]
>>>>>>> https://lore.kernel.org/linux-mm/[email protected]/
>>>>>>>
>>>>>>> However, thinking about this a bit more, I wonder if the decision we made to
>>>>>>> allow all hugepages-xxkB/enabled controls to take "inherit" was the wrong
>>>>>>> one.
>>>>>>> Perhaps we should have only allowed the PMD-sized enable=inherit (this is
>>>>>>> just
>>>>>>> for legacy back compat after all, I don't think there is any use case where
>>>>>>> changing multiple mTHP size controls atomically is actually useful).
>>>>>>> Applying
>>>>>>
>>>>>> Agree. This is also our usage of 'inherit'.
>>>>
>>>> Missed that one: there might be use cases in the future once we would start
>>>> defaulting to "inherit" for all knobs (a distro might default to that) and
>>>> default-enable THP in the global knob. Then, it would be easy to disable any
>>>> THP
>>>> by disabling the global knob. (I think that's the future we're heading to,
>>>> where
>>>> we'd have an "auto" mode that can be set on the global toggle).
>>>>
>>>> But I am just making up use cases ;) I think it will be valuable and just doing
>>>> it consistently now might be cleaner.
>>>
>>> I agree that consistency between enabled and shmem_enabled is top priority. And
>>> yes, I had forgotten about the glorious "auto" future. So probably continuing
>>> all sizes to select "inherit" is best.
>>>
>>> But for shmem_enabled, that means we need the following error checking:
>>>
>>> - It is an error to set "force" for any size except PMD-size
>>>
>>> - It is an error to set "force" for the global control if any size except
>>> PMD-
>>> size is set to "inherit"
>>>
>>> - It is an error to set "inherit" for any size except PMD-size if the global
>>> control is set to "force".
>>>
>>> Certainly not too difficult to code and prove to be correct, but not the nicest
>>> UX from the user's point of view when they start seeing errors.
>>>
>>> I think we previously said this would likely be temporary, and if/when tmpfs
>>> gets mTHP support, we could simplify and allow all sizes to be set to "force".
>>> But I wonder if tmpfs would ever need explicit mTHP control? Maybe it would be
>>> more suited to the approach the page cache takes to transparently ramp up the
>>> folio size as it faults more in. (Just saying there is a chance that this error
>>> checking becomes permanent).
>>
>> Note that with shmem you're inherently facing the same memory waste
>> issues etc as you would with anonymous memory. (sometimes even worse, if
>> you're running shmem that's configured to be unswappable!).
>
> Also noting that memory waste is not really a problem when a write to a shmem
> file allocates a large folio that stays within boundaries of that write; issues
> only pop up if you end up over-allocating, especially, during page faults where
> you have not that much clue about what to do (single address, no real range
> provided).
>
> There is the other issue that wasting large chunks of contiguous memory on stuff
> that barely benefits from it. With memory that maybe never gets evicted, there
> is no automatic "handing back" of that memory to the system to be used by
> something else. With ordinary files, that's a bit different. But I did not look
> closer into that issue yet, it's one of the reasons MADV_HUGEPAGE was added IIRC.
OK understood. Although, with tmpfs you're not going to mmap it then randomly
extend the file through page faults - mmap doesn't permit that, I don't think?
So presumably the user must explicitly set the size of the file first? Are you
suggesting there are a lot of use cases where a large tmpfs file is created,
mmaped then only accessed sparsely?
On 08.05.24 14:43, Ryan Roberts wrote:
> On 08/05/2024 13:10, David Hildenbrand wrote:
>> On 08.05.24 14:02, David Hildenbrand wrote:
>>> On 08.05.24 11:02, Ryan Roberts wrote:
>>>> On 08/05/2024 08:12, David Hildenbrand wrote:
>>>>> On 08.05.24 09:08, David Hildenbrand wrote:
>>>>>> On 08.05.24 06:45, Baolin Wang wrote:
>>>>>>>
>>>>>>>
>>>>>>> On 2024/5/7 18:52, Ryan Roberts wrote:
>>>>>>>> On 06/05/2024 09:46, Baolin Wang wrote:
>>>>>>>>> To support the use of mTHP with anonymous shmem, add a new sysfs interface
>>>>>>>>> 'shmem_enabled' in the '/sys/kernel/mm/transparent_hugepage/hugepages-kB/'
>>>>>>>>> directory for each mTHP to control whether shmem is enabled for that mTHP,
>>>>>>>>> with a value similar to the top level 'shmem_enabled', which can be set to:
>>>>>>>>> "always", "inherit (to inherit the top level setting)", "within_size",
>>>>>>>>> "advise",
>>>>>>>>> "never", "deny", "force". These values follow the same semantics as the top
>>>>>>>>> level, except the 'deny' is equivalent to 'never', and 'force' is
>>>>>>>>> equivalent
>>>>>>>>> to 'always' to keep compatibility.
>>>>>>>>
>>>>>>>> We decided at [1] to not allow 'force' for non-PMD-sizes.
>>>>>>>>
>>>>>>>> [1]
>>>>>>>> https://lore.kernel.org/linux-mm/[email protected]/
>>>>>>>>
>>>>>>>> However, thinking about this a bit more, I wonder if the decision we made to
>>>>>>>> allow all hugepages-xxkB/enabled controls to take "inherit" was the wrong
>>>>>>>> one.
>>>>>>>> Perhaps we should have only allowed the PMD-sized enable=inherit (this is
>>>>>>>> just
>>>>>>>> for legacy back compat after all, I don't think there is any use case where
>>>>>>>> changing multiple mTHP size controls atomically is actually useful).
>>>>>>>> Applying
>>>>>>>
>>>>>>> Agree. This is also our usage of 'inherit'.
>>>>>
>>>>> Missed that one: there might be use cases in the future once we would start
>>>>> defaulting to "inherit" for all knobs (a distro might default to that) and
>>>>> default-enable THP in the global knob. Then, it would be easy to disable any
>>>>> THP
>>>>> by disabling the global knob. (I think that's the future we're heading to,
>>>>> where
>>>>> we'd have an "auto" mode that can be set on the global toggle).
>>>>>
>>>>> But I am just making up use cases ;) I think it will be valuable and just doing
>>>>> it consistently now might be cleaner.
>>>>
>>>> I agree that consistency between enabled and shmem_enabled is top priority. And
>>>> yes, I had forgotten about the glorious "auto" future. So probably continuing
>>>> all sizes to select "inherit" is best.
>>>>
>>>> But for shmem_enabled, that means we need the following error checking:
>>>>
>>>> - It is an error to set "force" for any size except PMD-size
>>>>
>>>> - It is an error to set "force" for the global control if any size except
>>>> PMD-
>>>> size is set to "inherit"
>>>>
>>>> - It is an error to set "inherit" for any size except PMD-size if the global
>>>> control is set to "force".
>>>>
>>>> Certainly not too difficult to code and prove to be correct, but not the nicest
>>>> UX from the user's point of view when they start seeing errors.
>>>>
>>>> I think we previously said this would likely be temporary, and if/when tmpfs
>>>> gets mTHP support, we could simplify and allow all sizes to be set to "force".
>>>> But I wonder if tmpfs would ever need explicit mTHP control? Maybe it would be
>>>> more suited to the approach the page cache takes to transparently ramp up the
>>>> folio size as it faults more in. (Just saying there is a chance that this error
>>>> checking becomes permanent).
>>>
>>> Note that with shmem you're inherently facing the same memory waste
>>> issues etc as you would with anonymous memory. (sometimes even worse, if
>>> you're running shmem that's configured to be unswappable!).
>>
>> Also noting that memory waste is not really a problem when a write to a shmem
>> file allocates a large folio that stays within boundaries of that write; issues
>> only pop up if you end up over-allocating, especially, during page faults where
>> you have not that much clue about what to do (single address, no real range
>> provided).
>>
>> There is the other issue that wasting large chunks of contiguous memory on stuff
>> that barely benefits from it. With memory that maybe never gets evicted, there
>> is no automatic "handing back" of that memory to the system to be used by
>> something else. With ordinary files, that's a bit different. But I did not look
>> closer into that issue yet, it's one of the reasons MADV_HUGEPAGE was added IIRC.
>
> OK understood. Although, with tmpfs you're not going to mmap it then randomly
> extend the file through page faults - mmap doesn't permit that, I don't think?
> So presumably the user must explicitly set the size of the file first? Are you
> suggesting there are a lot of use cases where a large tmpfs file is created,
> mmaped then only accessed sparsely?
I don't know about "a lot of use cases", but for VMs that's certainly
how it's used.
--
Cheers,
David / dhildenb
On 08/05/2024 13:43, Ryan Roberts wrote:
> On 08/05/2024 13:10, David Hildenbrand wrote:
>> On 08.05.24 14:02, David Hildenbrand wrote:
>>> On 08.05.24 11:02, Ryan Roberts wrote:
>>>> On 08/05/2024 08:12, David Hildenbrand wrote:
>>>>> On 08.05.24 09:08, David Hildenbrand wrote:
>>>>>> On 08.05.24 06:45, Baolin Wang wrote:
>>>>>>>
>>>>>>>
>>>>>>> On 2024/5/7 18:52, Ryan Roberts wrote:
>>>>>>>> On 06/05/2024 09:46, Baolin Wang wrote:
>>>>>>>>> To support the use of mTHP with anonymous shmem, add a new sysfs interface
>>>>>>>>> 'shmem_enabled' in the '/sys/kernel/mm/transparent_hugepage/hugepages-kB/'
>>>>>>>>> directory for each mTHP to control whether shmem is enabled for that mTHP,
>>>>>>>>> with a value similar to the top level 'shmem_enabled', which can be set to:
>>>>>>>>> "always", "inherit (to inherit the top level setting)", "within_size",
>>>>>>>>> "advise",
>>>>>>>>> "never", "deny", "force". These values follow the same semantics as the top
>>>>>>>>> level, except the 'deny' is equivalent to 'never', and 'force' is
>>>>>>>>> equivalent
>>>>>>>>> to 'always' to keep compatibility.
>>>>>>>>
>>>>>>>> We decided at [1] to not allow 'force' for non-PMD-sizes.
>>>>>>>>
>>>>>>>> [1]
>>>>>>>> https://lore.kernel.org/linux-mm/[email protected]/
>>>>>>>>
>>>>>>>> However, thinking about this a bit more, I wonder if the decision we made to
>>>>>>>> allow all hugepages-xxkB/enabled controls to take "inherit" was the wrong
>>>>>>>> one.
>>>>>>>> Perhaps we should have only allowed the PMD-sized enable=inherit (this is
>>>>>>>> just
>>>>>>>> for legacy back compat after all, I don't think there is any use case where
>>>>>>>> changing multiple mTHP size controls atomically is actually useful).
>>>>>>>> Applying
>>>>>>>
>>>>>>> Agree. This is also our usage of 'inherit'.
>>>>>
>>>>> Missed that one: there might be use cases in the future once we would start
>>>>> defaulting to "inherit" for all knobs (a distro might default to that) and
>>>>> default-enable THP in the global knob. Then, it would be easy to disable any
>>>>> THP
>>>>> by disabling the global knob. (I think that's the future we're heading to,
>>>>> where
>>>>> we'd have an "auto" mode that can be set on the global toggle).
>>>>>
>>>>> But I am just making up use cases ;) I think it will be valuable and just doing
>>>>> it consistently now might be cleaner.
>>>>
>>>> I agree that consistency between enabled and shmem_enabled is top priority. And
>>>> yes, I had forgotten about the glorious "auto" future. So probably continuing
>>>> all sizes to select "inherit" is best.
>>>>
>>>> But for shmem_enabled, that means we need the following error checking:
>>>>
>>>> - It is an error to set "force" for any size except PMD-size
>>>>
>>>> - It is an error to set "force" for the global control if any size except
>>>> PMD-
>>>> size is set to "inherit"
>>>>
>>>> - It is an error to set "inherit" for any size except PMD-size if the global
>>>> control is set to "force".
>>>>
>>>> Certainly not too difficult to code and prove to be correct, but not the nicest
>>>> UX from the user's point of view when they start seeing errors.
>>>>
>>>> I think we previously said this would likely be temporary, and if/when tmpfs
>>>> gets mTHP support, we could simplify and allow all sizes to be set to "force".
>>>> But I wonder if tmpfs would ever need explicit mTHP control? Maybe it would be
>>>> more suited to the approach the page cache takes to transparently ramp up the
>>>> folio size as it faults more in. (Just saying there is a chance that this error
>>>> checking becomes permanent).
>>>
>>> Note that with shmem you're inherently facing the same memory waste
>>> issues etc as you would with anonymous memory. (sometimes even worse, if
>>> you're running shmem that's configured to be unswappable!).
>>
>> Also noting that memory waste is not really a problem when a write to a shmem
>> file allocates a large folio that stays within boundaries of that write; issues
>> only pop up if you end up over-allocating, especially, during page faults where
>> you have not that much clue about what to do (single address, no real range
>> provided).
>>
>> There is the other issue that wasting large chunks of contiguous memory on stuff
>> that barely benefits from it. With memory that maybe never gets evicted, there
>> is no automatic "handing back" of that memory to the system to be used by
>> something else. With ordinary files, that's a bit different. But I did not look
>> closer into that issue yet, it's one of the reasons MADV_HUGEPAGE was added IIRC.
>
> OK understood. Although, with tmpfs you're not going to mmap it then randomly
> extend the file through page faults - mmap doesn't permit that, I don't think?
> So presumably the user must explicitly set the size of the file first? Are you
> suggesting there are a lot of use cases where a large tmpfs file is created,
> mmaped then only accessed sparsely?
I know that's often the case for anon memory, but not sure if you would expect
the same pattern with an explicit file?
On 08/05/2024 13:45, David Hildenbrand wrote:
> On 08.05.24 14:43, Ryan Roberts wrote:
>> On 08/05/2024 13:10, David Hildenbrand wrote:
>>> On 08.05.24 14:02, David Hildenbrand wrote:
>>>> On 08.05.24 11:02, Ryan Roberts wrote:
>>>>> On 08/05/2024 08:12, David Hildenbrand wrote:
>>>>>> On 08.05.24 09:08, David Hildenbrand wrote:
>>>>>>> On 08.05.24 06:45, Baolin Wang wrote:
>>>>>>>>
>>>>>>>>
>>>>>>>> On 2024/5/7 18:52, Ryan Roberts wrote:
>>>>>>>>> On 06/05/2024 09:46, Baolin Wang wrote:
>>>>>>>>>> To support the use of mTHP with anonymous shmem, add a new sysfs
>>>>>>>>>> interface
>>>>>>>>>> 'shmem_enabled' in the
>>>>>>>>>> '/sys/kernel/mm/transparent_hugepage/hugepages-kB/'
>>>>>>>>>> directory for each mTHP to control whether shmem is enabled for that
>>>>>>>>>> mTHP,
>>>>>>>>>> with a value similar to the top level 'shmem_enabled', which can be
>>>>>>>>>> set to:
>>>>>>>>>> "always", "inherit (to inherit the top level setting)", "within_size",
>>>>>>>>>> "advise",
>>>>>>>>>> "never", "deny", "force". These values follow the same semantics as
>>>>>>>>>> the top
>>>>>>>>>> level, except the 'deny' is equivalent to 'never', and 'force' is
>>>>>>>>>> equivalent
>>>>>>>>>> to 'always' to keep compatibility.
>>>>>>>>>
>>>>>>>>> We decided at [1] to not allow 'force' for non-PMD-sizes.
>>>>>>>>>
>>>>>>>>> [1]
>>>>>>>>> https://lore.kernel.org/linux-mm/[email protected]/
>>>>>>>>>
>>>>>>>>> However, thinking about this a bit more, I wonder if the decision we
>>>>>>>>> made to
>>>>>>>>> allow all hugepages-xxkB/enabled controls to take "inherit" was the wrong
>>>>>>>>> one.
>>>>>>>>> Perhaps we should have only allowed the PMD-sized enable=inherit (this is
>>>>>>>>> just
>>>>>>>>> for legacy back compat after all, I don't think there is any use case
>>>>>>>>> where
>>>>>>>>> changing multiple mTHP size controls atomically is actually useful).
>>>>>>>>> Applying
>>>>>>>>
>>>>>>>> Agree. This is also our usage of 'inherit'.
>>>>>>
>>>>>> Missed that one: there might be use cases in the future once we would start
>>>>>> defaulting to "inherit" for all knobs (a distro might default to that) and
>>>>>> default-enable THP in the global knob. Then, it would be easy to disable any
>>>>>> THP
>>>>>> by disabling the global knob. (I think that's the future we're heading to,
>>>>>> where
>>>>>> we'd have an "auto" mode that can be set on the global toggle).
>>>>>>
>>>>>> But I am just making up use cases ;) I think it will be valuable and just
>>>>>> doing
>>>>>> it consistently now might be cleaner.
>>>>>
>>>>> I agree that consistency between enabled and shmem_enabled is top priority.
>>>>> And
>>>>> yes, I had forgotten about the glorious "auto" future. So probably continuing
>>>>> all sizes to select "inherit" is best.
>>>>>
>>>>> But for shmem_enabled, that means we need the following error checking:
>>>>>
>>>>> - It is an error to set "force" for any size except PMD-size
>>>>>
>>>>> - It is an error to set "force" for the global control if any size except
>>>>> PMD-
>>>>> size is set to "inherit"
>>>>>
>>>>> - It is an error to set "inherit" for any size except PMD-size if the
>>>>> global
>>>>> control is set to "force".
>>>>>
>>>>> Certainly not too difficult to code and prove to be correct, but not the
>>>>> nicest
>>>>> UX from the user's point of view when they start seeing errors.
>>>>>
>>>>> I think we previously said this would likely be temporary, and if/when tmpfs
>>>>> gets mTHP support, we could simplify and allow all sizes to be set to "force".
>>>>> But I wonder if tmpfs would ever need explicit mTHP control? Maybe it would be
>>>>> more suited to the approach the page cache takes to transparently ramp up the
>>>>> folio size as it faults more in. (Just saying there is a chance that this
>>>>> error
>>>>> checking becomes permanent).
>>>>
>>>> Note that with shmem you're inherently facing the same memory waste
>>>> issues etc as you would with anonymous memory. (sometimes even worse, if
>>>> you're running shmem that's configured to be unswappable!).
>>>
>>> Also noting that memory waste is not really a problem when a write to a shmem
>>> file allocates a large folio that stays within boundaries of that write; issues
>>> only pop up if you end up over-allocating, especially, during page faults where
>>> you have not that much clue about what to do (single address, no real range
>>> provided).
>>>
>>> There is the other issue that wasting large chunks of contiguous memory on stuff
>>> that barely benefits from it. With memory that maybe never gets evicted, there
>>> is no automatic "handing back" of that memory to the system to be used by
>>> something else. With ordinary files, that's a bit different. But I did not look
>>> closer into that issue yet, it's one of the reasons MADV_HUGEPAGE was added
>>> IIRC.
>>
>> OK understood. Although, with tmpfs you're not going to mmap it then randomly
>> extend the file through page faults - mmap doesn't permit that, I don't think?
>> So presumably the user must explicitly set the size of the file first? Are you
>> suggesting there are a lot of use cases where a large tmpfs file is created,
>> mmaped then only accessed sparsely?
>
> I don't know about "a lot of use cases", but for VMs that's certainly how it's
> used.
Gottya, thanks. And out of curiosity, what's the benefit of using tmpfs rather
than private (or shared) anonymous memory for VMs?
On 08.05.24 14:54, Ryan Roberts wrote:
> On 08/05/2024 13:45, David Hildenbrand wrote:
>> On 08.05.24 14:43, Ryan Roberts wrote:
>>> On 08/05/2024 13:10, David Hildenbrand wrote:
>>>> On 08.05.24 14:02, David Hildenbrand wrote:
>>>>> On 08.05.24 11:02, Ryan Roberts wrote:
>>>>>> On 08/05/2024 08:12, David Hildenbrand wrote:
>>>>>>> On 08.05.24 09:08, David Hildenbrand wrote:
>>>>>>>> On 08.05.24 06:45, Baolin Wang wrote:
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On 2024/5/7 18:52, Ryan Roberts wrote:
>>>>>>>>>> On 06/05/2024 09:46, Baolin Wang wrote:
>>>>>>>>>>> To support the use of mTHP with anonymous shmem, add a new sysfs
>>>>>>>>>>> interface
>>>>>>>>>>> 'shmem_enabled' in the
>>>>>>>>>>> '/sys/kernel/mm/transparent_hugepage/hugepages-kB/'
>>>>>>>>>>> directory for each mTHP to control whether shmem is enabled for that
>>>>>>>>>>> mTHP,
>>>>>>>>>>> with a value similar to the top level 'shmem_enabled', which can be
>>>>>>>>>>> set to:
>>>>>>>>>>> "always", "inherit (to inherit the top level setting)", "within_size",
>>>>>>>>>>> "advise",
>>>>>>>>>>> "never", "deny", "force". These values follow the same semantics as
>>>>>>>>>>> the top
>>>>>>>>>>> level, except the 'deny' is equivalent to 'never', and 'force' is
>>>>>>>>>>> equivalent
>>>>>>>>>>> to 'always' to keep compatibility.
>>>>>>>>>>
>>>>>>>>>> We decided at [1] to not allow 'force' for non-PMD-sizes.
>>>>>>>>>>
>>>>>>>>>> [1]
>>>>>>>>>> https://lore.kernel.org/linux-mm/[email protected]/
>>>>>>>>>>
>>>>>>>>>> However, thinking about this a bit more, I wonder if the decision we
>>>>>>>>>> made to
>>>>>>>>>> allow all hugepages-xxkB/enabled controls to take "inherit" was the wrong
>>>>>>>>>> one.
>>>>>>>>>> Perhaps we should have only allowed the PMD-sized enable=inherit (this is
>>>>>>>>>> just
>>>>>>>>>> for legacy back compat after all, I don't think there is any use case
>>>>>>>>>> where
>>>>>>>>>> changing multiple mTHP size controls atomically is actually useful).
>>>>>>>>>> Applying
>>>>>>>>>
>>>>>>>>> Agree. This is also our usage of 'inherit'.
>>>>>>>
>>>>>>> Missed that one: there might be use cases in the future once we would start
>>>>>>> defaulting to "inherit" for all knobs (a distro might default to that) and
>>>>>>> default-enable THP in the global knob. Then, it would be easy to disable any
>>>>>>> THP
>>>>>>> by disabling the global knob. (I think that's the future we're heading to,
>>>>>>> where
>>>>>>> we'd have an "auto" mode that can be set on the global toggle).
>>>>>>>
>>>>>>> But I am just making up use cases ;) I think it will be valuable and just
>>>>>>> doing
>>>>>>> it consistently now might be cleaner.
>>>>>>
>>>>>> I agree that consistency between enabled and shmem_enabled is top priority.
>>>>>> And
>>>>>> yes, I had forgotten about the glorious "auto" future. So probably continuing
>>>>>> all sizes to select "inherit" is best.
>>>>>>
>>>>>> But for shmem_enabled, that means we need the following error checking:
>>>>>>
>>>>>> - It is an error to set "force" for any size except PMD-size
>>>>>>
>>>>>> - It is an error to set "force" for the global control if any size except
>>>>>> PMD-
>>>>>> size is set to "inherit"
>>>>>>
>>>>>> - It is an error to set "inherit" for any size except PMD-size if the
>>>>>> global
>>>>>> control is set to "force".
>>>>>>
>>>>>> Certainly not too difficult to code and prove to be correct, but not the
>>>>>> nicest
>>>>>> UX from the user's point of view when they start seeing errors.
>>>>>>
>>>>>> I think we previously said this would likely be temporary, and if/when tmpfs
>>>>>> gets mTHP support, we could simplify and allow all sizes to be set to "force".
>>>>>> But I wonder if tmpfs would ever need explicit mTHP control? Maybe it would be
>>>>>> more suited to the approach the page cache takes to transparently ramp up the
>>>>>> folio size as it faults more in. (Just saying there is a chance that this
>>>>>> error
>>>>>> checking becomes permanent).
>>>>>
>>>>> Note that with shmem you're inherently facing the same memory waste
>>>>> issues etc as you would with anonymous memory. (sometimes even worse, if
>>>>> you're running shmem that's configured to be unswappable!).
>>>>
>>>> Also noting that memory waste is not really a problem when a write to a shmem
>>>> file allocates a large folio that stays within boundaries of that write; issues
>>>> only pop up if you end up over-allocating, especially, during page faults where
>>>> you have not that much clue about what to do (single address, no real range
>>>> provided).
>>>>
>>>> There is the other issue that wasting large chunks of contiguous memory on stuff
>>>> that barely benefits from it. With memory that maybe never gets evicted, there
>>>> is no automatic "handing back" of that memory to the system to be used by
>>>> something else. With ordinary files, that's a bit different. But I did not look
>>>> closer into that issue yet, it's one of the reasons MADV_HUGEPAGE was added
>>>> IIRC.
>>>
>>> OK understood. Although, with tmpfs you're not going to mmap it then randomly
>>> extend the file through page faults - mmap doesn't permit that, I don't think?
>>> So presumably the user must explicitly set the size of the file first? Are you
>>> suggesting there are a lot of use cases where a large tmpfs file is created,
>>> mmaped then only accessed sparsely?
>>
>> I don't know about "a lot of use cases", but for VMs that's certainly how it's
>> used.
>
There are more details around that and the sparsity (memory ballooning,
virtio-mem, free page reporting), but it might distract here :) I'll
note that shmem+THP is known to be problematic with memory ballooning.
> Gottya, thanks. And out of curiosity, what's the benefit of using tmpfs rather
> than private (or shared) anonymous memory for VMs?
The primary use case I know of is sharing VM memory with other processes
(usually not child processes): DPDK/SPDK and other vhost-user variants
(such as virtiofs) mmap() all guest memory to access it directly (some
sort of multi-process hypervisors). They either use real-file-based
shmem or memfd (essentially the same without a named file) for that.
Then, there is live-hypervisor upgrade, whereby you start a second
hypervisor process that will take over. People use shmem for that, so
you can minimize downtime by migrating guest memory simply by mmap'ing
the shmem file into the new hypervisor.
Shared anonymous memory is basically never used (I only know of one
corner case in QEMU).
I would assume that there are also DBs making use of rather sparse
shmem? But no expert on that.
--
Cheers,
David / dhildenb
On 08/05/2024 14:07, David Hildenbrand wrote:
> On 08.05.24 14:54, Ryan Roberts wrote:
>> On 08/05/2024 13:45, David Hildenbrand wrote:
>>> On 08.05.24 14:43, Ryan Roberts wrote:
>>>> On 08/05/2024 13:10, David Hildenbrand wrote:
>>>>> On 08.05.24 14:02, David Hildenbrand wrote:
>>>>>> On 08.05.24 11:02, Ryan Roberts wrote:
>>>>>>> On 08/05/2024 08:12, David Hildenbrand wrote:
>>>>>>>> On 08.05.24 09:08, David Hildenbrand wrote:
>>>>>>>>> On 08.05.24 06:45, Baolin Wang wrote:
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On 2024/5/7 18:52, Ryan Roberts wrote:
>>>>>>>>>>> On 06/05/2024 09:46, Baolin Wang wrote:
>>>>>>>>>>>> To support the use of mTHP with anonymous shmem, add a new sysfs
>>>>>>>>>>>> interface
>>>>>>>>>>>> 'shmem_enabled' in the
>>>>>>>>>>>> '/sys/kernel/mm/transparent_hugepage/hugepages-kB/'
>>>>>>>>>>>> directory for each mTHP to control whether shmem is enabled for that
>>>>>>>>>>>> mTHP,
>>>>>>>>>>>> with a value similar to the top level 'shmem_enabled', which can be
>>>>>>>>>>>> set to:
>>>>>>>>>>>> "always", "inherit (to inherit the top level setting)", "within_size",
>>>>>>>>>>>> "advise",
>>>>>>>>>>>> "never", "deny", "force". These values follow the same semantics as
>>>>>>>>>>>> the top
>>>>>>>>>>>> level, except the 'deny' is equivalent to 'never', and 'force' is
>>>>>>>>>>>> equivalent
>>>>>>>>>>>> to 'always' to keep compatibility.
>>>>>>>>>>>
>>>>>>>>>>> We decided at [1] to not allow 'force' for non-PMD-sizes.
>>>>>>>>>>>
>>>>>>>>>>> [1]
>>>>>>>>>>> https://lore.kernel.org/linux-mm/[email protected]/
>>>>>>>>>>>
>>>>>>>>>>> However, thinking about this a bit more, I wonder if the decision we
>>>>>>>>>>> made to
>>>>>>>>>>> allow all hugepages-xxkB/enabled controls to take "inherit" was the
>>>>>>>>>>> wrong
>>>>>>>>>>> one.
>>>>>>>>>>> Perhaps we should have only allowed the PMD-sized enable=inherit
>>>>>>>>>>> (this is
>>>>>>>>>>> just
>>>>>>>>>>> for legacy back compat after all, I don't think there is any use case
>>>>>>>>>>> where
>>>>>>>>>>> changing multiple mTHP size controls atomically is actually useful).
>>>>>>>>>>> Applying
>>>>>>>>>>
>>>>>>>>>> Agree. This is also our usage of 'inherit'.
>>>>>>>>
>>>>>>>> Missed that one: there might be use cases in the future once we would start
>>>>>>>> defaulting to "inherit" for all knobs (a distro might default to that) and
>>>>>>>> default-enable THP in the global knob. Then, it would be easy to disable
>>>>>>>> any
>>>>>>>> THP
>>>>>>>> by disabling the global knob. (I think that's the future we're heading to,
>>>>>>>> where
>>>>>>>> we'd have an "auto" mode that can be set on the global toggle).
>>>>>>>>
>>>>>>>> But I am just making up use cases ;) I think it will be valuable and just
>>>>>>>> doing
>>>>>>>> it consistently now might be cleaner.
>>>>>>>
>>>>>>> I agree that consistency between enabled and shmem_enabled is top priority.
>>>>>>> And
>>>>>>> yes, I had forgotten about the glorious "auto" future. So probably
>>>>>>> continuing
>>>>>>> all sizes to select "inherit" is best.
>>>>>>>
>>>>>>> But for shmem_enabled, that means we need the following error checking:
>>>>>>>
>>>>>>> - It is an error to set "force" for any size except PMD-size
>>>>>>>
>>>>>>> - It is an error to set "force" for the global control if any size
>>>>>>> except
>>>>>>> PMD-
>>>>>>> size is set to "inherit"
>>>>>>>
>>>>>>> - It is an error to set "inherit" for any size except PMD-size if the
>>>>>>> global
>>>>>>> control is set to "force".
>>>>>>>
>>>>>>> Certainly not too difficult to code and prove to be correct, but not the
>>>>>>> nicest
>>>>>>> UX from the user's point of view when they start seeing errors.
>>>>>>>
>>>>>>> I think we previously said this would likely be temporary, and if/when tmpfs
>>>>>>> gets mTHP support, we could simplify and allow all sizes to be set to
>>>>>>> "force".
>>>>>>> But I wonder if tmpfs would ever need explicit mTHP control? Maybe it
>>>>>>> would be
>>>>>>> more suited to the approach the page cache takes to transparently ramp up
>>>>>>> the
>>>>>>> folio size as it faults more in. (Just saying there is a chance that this
>>>>>>> error
>>>>>>> checking becomes permanent).
>>>>>>
>>>>>> Note that with shmem you're inherently facing the same memory waste
>>>>>> issues etc as you would with anonymous memory. (sometimes even worse, if
>>>>>> you're running shmem that's configured to be unswappable!).
>>>>>
>>>>> Also noting that memory waste is not really a problem when a write to a shmem
>>>>> file allocates a large folio that stays within boundaries of that write;
>>>>> issues
>>>>> only pop up if you end up over-allocating, especially, during page faults
>>>>> where
>>>>> you have not that much clue about what to do (single address, no real range
>>>>> provided).
>>>>>
>>>>> There is the other issue that wasting large chunks of contiguous memory on
>>>>> stuff
>>>>> that barely benefits from it. With memory that maybe never gets evicted, there
>>>>> is no automatic "handing back" of that memory to the system to be used by
>>>>> something else. With ordinary files, that's a bit different. But I did not
>>>>> look
>>>>> closer into that issue yet, it's one of the reasons MADV_HUGEPAGE was added
>>>>> IIRC.
>>>>
>>>> OK understood. Although, with tmpfs you're not going to mmap it then randomly
>>>> extend the file through page faults - mmap doesn't permit that, I don't think?
>>>> So presumably the user must explicitly set the size of the file first? Are you
>>>> suggesting there are a lot of use cases where a large tmpfs file is created,
>>>> mmaped then only accessed sparsely?
>>>
>>> I don't know about "a lot of use cases", but for VMs that's certainly how it's
>>> used.
>>
>
> There are more details around that and the sparsity (memory ballooning,
> virtio-mem, free page reporting), but it might distract here :) I'll note that
> shmem+THP is known to be problematic with memory ballooning.
>
>> Gottya, thanks. And out of curiosity, what's the benefit of using tmpfs rather
>> than private (or shared) anonymous memory for VMs?
>
> The primary use case I know of is sharing VM memory with other processes
> (usually not child processes): DPDK/SPDK and other vhost-user variants (such as
> virtiofs) mmap() all guest memory to access it directly (some sort of
> multi-process hypervisors). They either use real-file-based shmem or memfd
> (essentially the same without a named file) for that.
>
> Then, there is live-hypervisor upgrade, whereby you start a second hypervisor
> process that will take over. People use shmem for that, so you can minimize
> downtime by migrating guest memory simply by mmap'ing the shmem file into the
> new hypervisor.
>
> Shared anonymous memory is basically never used (I only know of one corner case
> in QEMU).
>
> I would assume that there are also DBs making use of rather sparse shmem? But no
> expert on that.
>
That all makes sense - thanks for the lesson!
On Wed, May 08, 2024 at 01:58:19PM +0200, David Hildenbrand wrote:
> On 08.05.24 13:39, Daniel Gomez wrote:
> > On Mon, May 06, 2024 at 04:46:24PM +0800, Baolin Wang wrote:
> > > Anonymous pages have already been supported for multi-size (mTHP) allocation
> > > through commit 19eaf44954df, that can allow THP to be configured through the
> > > sysfs interface located at '/sys/kernel/mm/transparent_hugepage/hugepage-XXkb/enabled'.
> > >
> > > However, the anonymous shared pages will ignore the anonymous mTHP rule
> > > configured through the sysfs interface, and can only use the PMD-mapped
> > > THP, that is not reasonable. Many implement anonymous page sharing through
> > > mmap(MAP_SHARED | MAP_ANONYMOUS), especially in database usage scenarios,
> > > therefore, users expect to apply an unified mTHP strategy for anonymous pages,
> > > also including the anonymous shared pages, in order to enjoy the benefits of
> > > mTHP. For example, lower latency than PMD-mapped THP, smaller memory bloat
> > > than PMD-mapped THP, contiguous PTEs on ARM architecture to reduce TLB miss etc.
> > >
> > > The primary strategy is similar to supporting anonymous mTHP. Introduce
> > > a new interface '/mm/transparent_hugepage/hugepage-XXkb/shmem_enabled',
> > > which can have all the same values as the top-level
> > > '/sys/kernel/mm/transparent_hugepage/shmem_enabled', with adding a new
> > > additional "inherit" option. By default all sizes will be set to "never"
> > > except PMD size, which is set to "inherit". This ensures backward compatibility
> > > with the shmem enabled of the top level, meanwhile also allows independent
> > > control of shmem enabled for each mTHP.
> >
> > I'm trying to understand the adoption of mTHP and how it fits into the adoption
> > of (large) folios that the kernel is moving towards. Can you, or anyone involved
> > here, explain this? How much do they overlap, and can we benefit from having
> > both? Is there any argument against the adoption of large folios here that I
> > might have missed?
>
> mTHP are implemented using large folios, just like traditional PMD-sized THP
> are. (you really should explore the history of mTHP and how it all works
> internally)
I'll check more in deep the code. By any chance are any of you going to be at
LSFMM this year? I have this session [1] scheduled for Wednesday and it would
be nice to get your feedback on it and if you see this working together with
mTHP/THP.
[1] https://lore.kernel.org/all/4ktpayu66noklllpdpspa3vm5gbmb5boxskcj2q6qn7md3pwwt@kvlu64pqwjzl/
>
> The biggest challenge with memory that cannot be evicted on memory pressure
> to be reclaimed (in contrast to your ordinary files in the pagecache) is
> memory waste, well, and placement of large chunks of memory in general,
> during page faults.
>
> In the worst case (no swap), you allocate a large chunk of memory once and
> it will stick around until freed: no reclaim of that memory.
I can see that path being triggered by some fstests but only for THP (where we
can actually reclaim memory).
>
> That's the reason why THP for anonymous memory and SHMEM have toggles to
> manually enable and configure them, in contrast to the pagecache. The same
> was done for mTHP for anonymous memory, and now (anon) shmem follows.
>
> There are plans to have, at some point, have it all working automatically,
> but a lot for that for anonymous memory (and shmem similarly) is still
> missing and unclear.
Thanks.
>
> --
> Cheers,
>
> David / dhildenb
>
On 08.05.24 16:28, Daniel Gomez wrote:
> On Wed, May 08, 2024 at 01:58:19PM +0200, David Hildenbrand wrote:
>> On 08.05.24 13:39, Daniel Gomez wrote:
>>> On Mon, May 06, 2024 at 04:46:24PM +0800, Baolin Wang wrote:
>>>> Anonymous pages have already been supported for multi-size (mTHP) allocation
>>>> through commit 19eaf44954df, that can allow THP to be configured through the
>>>> sysfs interface located at '/sys/kernel/mm/transparent_hugepage/hugepage-XXkb/enabled'.
>>>>
>>>> However, the anonymous shared pages will ignore the anonymous mTHP rule
>>>> configured through the sysfs interface, and can only use the PMD-mapped
>>>> THP, that is not reasonable. Many implement anonymous page sharing through
>>>> mmap(MAP_SHARED | MAP_ANONYMOUS), especially in database usage scenarios,
>>>> therefore, users expect to apply an unified mTHP strategy for anonymous pages,
>>>> also including the anonymous shared pages, in order to enjoy the benefits of
>>>> mTHP. For example, lower latency than PMD-mapped THP, smaller memory bloat
>>>> than PMD-mapped THP, contiguous PTEs on ARM architecture to reduce TLB miss etc.
>>>>
>>>> The primary strategy is similar to supporting anonymous mTHP. Introduce
>>>> a new interface '/mm/transparent_hugepage/hugepage-XXkb/shmem_enabled',
>>>> which can have all the same values as the top-level
>>>> '/sys/kernel/mm/transparent_hugepage/shmem_enabled', with adding a new
>>>> additional "inherit" option. By default all sizes will be set to "never"
>>>> except PMD size, which is set to "inherit". This ensures backward compatibility
>>>> with the shmem enabled of the top level, meanwhile also allows independent
>>>> control of shmem enabled for each mTHP.
>>>
>>> I'm trying to understand the adoption of mTHP and how it fits into the adoption
>>> of (large) folios that the kernel is moving towards. Can you, or anyone involved
>>> here, explain this? How much do they overlap, and can we benefit from having
>>> both? Is there any argument against the adoption of large folios here that I
>>> might have missed?
>>
>> mTHP are implemented using large folios, just like traditional PMD-sized THP
>> are. (you really should explore the history of mTHP and how it all works
>> internally)
>
> I'll check more in deep the code. By any chance are any of you going to be at
> LSFMM this year? I have this session [1] scheduled for Wednesday and it would
> be nice to get your feedback on it and if you see this working together with
> mTHP/THP.
>
I'll be around and will attend that session! But note that I am still
scratching my head what to do with "ordinary" shmem, especially because
of the weird way shmem behaves in contrast to real files (below). Some
input from Hugh might be very helpful.
Example: you write() to a shmem file and populate a 2M THP. Then, nobody
touches that file for a long time. There are certainly other mmap()
users that could better benefit from that THP ... and without swap that
THP will be trapped there possibly a long time (unless I am missing an
important piece of shmem THP design :) )? Sure, if we only have THP's
it's nice, that's just not the reality unfortunately. IIRC, that's one
of the reasons why THP for shmem can be enabled/disabled. But again,
still scratching my head ...
Note that this patch set only tackles anonymous shmem
(MAP_SHARED|MAP_ANON), which is in 99.999% of all cases only accessed
via page tables (memory allocated during page faults). I think there are
ways to grab the fd (/proc/self/fd), but IIRC only corner cases
read/write that.
So in that sense, anonymous shmem (this patch set) behaves mostly like
ordinary anonymous memory, and likely there is not much overlap with
other "allocate large folios during read/write/fallocate" as in [1].
swap might have an overlap.
The real confusion begins when we have ordinary shmem: some users never
mmap it and only read/write, some users never read/write it and only
mmap it and some (less common?) users do both.
And shmem really is special: it looks like "just another file", but
memory-consumption and reclaim wise it behaves just like anonymous
memory. It might be swappable ("usually very limited backing disk space
available") or it might not.
In a subthread here we are discussing what to do with that special
"shmem_enabled = force" mode ... and it's all complicated I think.
> [1] https://lore.kernel.org/all/4ktpayu66noklllpdpspa3vm5gbmb5boxskcj2q6qn7md3pwwt@kvlu64pqwjzl/
>
>>
>> The biggest challenge with memory that cannot be evicted on memory pressure
>> to be reclaimed (in contrast to your ordinary files in the pagecache) is
>> memory waste, well, and placement of large chunks of memory in general,
>> during page faults.
>>
>> In the worst case (no swap), you allocate a large chunk of memory once and
>> it will stick around until freed: no reclaim of that memory.
>
> I can see that path being triggered by some fstests but only for THP (where we
> can actually reclaim memory).
Is that when we punch-hole a partial THP and split it? I'd be interested
in what that test does.
--
Cheers,
David / dhildenb
On Wed, May 08, 2024 at 01:58:19PM +0200, David Hildenbrand wrote:
> On 08.05.24 13:39, Daniel Gomez wrote:
> > On Mon, May 06, 2024 at 04:46:24PM +0800, Baolin Wang wrote:
> > > The primary strategy is similar to supporting anonymous mTHP. Introduce
> > > a new interface '/mm/transparent_hugepage/hugepage-XXkb/shmem_enabled',
> > > which can have all the same values as the top-level
> > > '/sys/kernel/mm/transparent_hugepage/shmem_enabled', with adding a new
> > > additional "inherit" option. By default all sizes will be set to "never"
> > > except PMD size, which is set to "inherit". This ensures backward compatibility
> > > with the shmem enabled of the top level, meanwhile also allows independent
> > > control of shmem enabled for each mTHP.
> >
> > I'm trying to understand the adoption of mTHP and how it fits into the adoption
> > of (large) folios that the kernel is moving towards. Can you, or anyone involved
> > here, explain this? How much do they overlap, and can we benefit from having
> > both? Is there any argument against the adoption of large folios here that I
> > might have missed?
>
> mTHP are implemented using large folios, just like traditional PMD-sized THP
> are.
>
> The biggest challenge with memory that cannot be evicted on memory pressure
> to be reclaimed (in contrast to your ordinary files in the pagecache) is
> memory waste, well, and placement of large chunks of memory in general,
> during page faults.
>
> In the worst case (no swap), you allocate a large chunk of memory once and
> it will stick around until freed: no reclaim of that memory.
>
> That's the reason why THP for anonymous memory and SHMEM have toggles to
> manually enable and configure them, in contrast to the pagecache. The same
> was done for mTHP for anonymous memory, and now (anon) shmem follows.
>
> There are plans to have, at some point, have it all working automatically,
> but a lot for that for anonymous memory (and shmem similarly) is still
> missing and unclear.
Whereas the use for large folios for filesystems is already automatic,
so long as the filesystem supports it. We do this in readahead and write
path already for iomap, we opportunistically use large folios if we can,
otherwise we use smaller folios.
So a recommended approach by Matthew was to use the readahead and write
path, just as in iomap to determine the size of the folio to use [0].
The use of large folios would also be automatic and not require any
knobs at all.
The mTHP approach would be growing the "THP" use in filesystems by the
only single filesystem to use THP. Meanwhile use of large folios is already
automatic with the approach taken by iomap.
We're at a crux where it does beg the question if we should continue to
chug on with tmpfs being special and doing things differently extending
the old THP interface with mTHP, or if it should just use large folios
using the same approach as iomap did.
From my perspective the more shared code the better, and the more shared
paths the better. There is a chance to help test swap with large folios
instead of splitting the folios for swap, and that would could be done
first with tmpfs. I have not evaluated the difference in testing or how
we could get the most of shared code if we take a mTHP approach or the
iomap approach for tmpfs, that should be considered.
Are there other things to consider? Does this require some dialog at
LSFMM?
[0] https://lore.kernel.org/all/[email protected]/
Luis
On 2024/5/8 18:47, Ryan Roberts wrote:
> On 08/05/2024 10:31, Baolin Wang wrote:
>>
>>
>> On 2024/5/8 16:53, Ryan Roberts wrote:
>>> On 08/05/2024 04:44, Baolin Wang wrote:
>>>>
>>>>
>>>> On 2024/5/7 18:37, Ryan Roberts wrote:
>>>>> On 06/05/2024 09:46, Baolin Wang wrote:
>>>>>> Add large folio mapping establishment support for finish_fault() as a
>>>>>> preparation,
>>>>>> to support multi-size THP allocation of anonymous shmem pages in the following
>>>>>> patches.
>>>>>>
>>>>>> Signed-off-by: Baolin Wang <[email protected]>
>>>>>> ---
>>>>>> mm/memory.c | 43 +++++++++++++++++++++++++++++++++----------
>>>>>> 1 file changed, 33 insertions(+), 10 deletions(-)
>>>>>>
>>>>>> diff --git a/mm/memory.c b/mm/memory.c
>>>>>> index eea6e4984eae..936377220b77 100644
>>>>>> --- a/mm/memory.c
>>>>>> +++ b/mm/memory.c
>>>>>> @@ -4747,9 +4747,12 @@ vm_fault_t finish_fault(struct vm_fault *vmf)
>>>>>> {
>>>>>> struct vm_area_struct *vma = vmf->vma;
>>>>>> struct page *page;
>>>>>> + struct folio *folio;
>>>>>> vm_fault_t ret;
>>>>>> bool is_cow = (vmf->flags & FAULT_FLAG_WRITE) &&
>>>>>> !(vma->vm_flags & VM_SHARED);
>>>>>> + int type, nr_pages, i;
>>>>>> + unsigned long addr = vmf->address;
>>>>>> /* Did we COW the page? */
>>>>>> if (is_cow)
>>>>>> @@ -4780,24 +4783,44 @@ vm_fault_t finish_fault(struct vm_fault *vmf)
>>>>>> return VM_FAULT_OOM;
>>>>>> }
>>>>>> + folio = page_folio(page);
>>>>>> + nr_pages = folio_nr_pages(folio);
>>>>>> +
>>>>>> + if (unlikely(userfaultfd_armed(vma))) {
>>>>>> + nr_pages = 1;
>>>>>> + } else if (nr_pages > 1) {
>>>>>> + unsigned long start = ALIGN_DOWN(vmf->address, nr_pages * PAGE_SIZE);
>>>>>> + unsigned long end = start + nr_pages * PAGE_SIZE;
>>>>>> +
>>>>>> + /* In case the folio size in page cache beyond the VMA limits. */
>>>>>> + addr = max(start, vma->vm_start);
>>>>>> + nr_pages = (min(end, vma->vm_end) - addr) >> PAGE_SHIFT;
>>>>>> +
>>>>>> + page = folio_page(folio, (addr - start) >> PAGE_SHIFT);
>>>>>
>>>>> I still don't really follow the logic in this else if block. Isn't it possible
>>>>> that finish_fault() gets called with a page from a folio that isn't aligned
>>>>> with
>>>>> vmf->address?
>>>>>
>>>>> For example, let's say we have a file who's size is 64K and which is cached
>>>>> in a
>>>>> single large folio in the page cache. But the file is mapped into a process at
>>>>> VA 16K to 80K. Let's say we fault on the first page (VA=16K). You will
>>>>> calculate
>>>>
>>>> For shmem, this doesn't happen because the VA is aligned with the hugepage size
>>>> in the shmem_get_unmapped_area() function. See patch 7.
>>>
>>> Certainly agree that shmem can always make sure that it packs a vma in a way
>>> such that its folios are naturally aligned in VA when faulting in memory. If you
>>> mremap it, that alignment will be lost; I don't think that would be a problem
>>
>> When mremap it, it will also call shmem_get_unmapped_area() to align the VA, but
>> for mremap() with MAP_FIXED flag as David pointed out, yes, this patch may be
>> not work perfectly.
>
> Assuming this works similarly to anon mTHP, remapping to an arbitrary address
> shouldn't be a problem within a single process; the previously allocated folios
> will now be unaligned, but they will be correctly mapped so it doesn't break
> anything. And new faults will allocate folios so that they are as large as
> allowed by the sysfs interface AND which do not overlap with any non-none pte
> AND which are naturally aligned. It's when you start sharing with other
> processes that the fun and games start...
>
>>
>>> for a single process; mremap will take care of moving the ptes correctly and
>>> this path is not involved.
>>>
>>> But what about the case when a process mmaps a shmem region, then forks, then
>>> the child mremaps the shmem region. Then the parent faults in a THP into the
>>> region (nicely aligned). Then the child faults in the same offset in the region
>>> and gets the THP that the parent allocated; that THP will be aligned in the
>>> parent's VM space but not in the child's.
>>
>> Sorry, I did not get your point here. IIUC, the child's VA will also be aligned
>> if the child mremap is not set MAP_FIXED, since the child's mremap will still
>> call shmem_get_unmapped_area() to find an aligned new VA.
>
> In general, you shouldn't be relying on the vma bounds being aligned to a THP
> boundary.
>
>> Please correct me if I missed your point.
>
> (I'm not 100% sure this is definitely how it works, but seems the only sane way
> to me):
>
> Let's imagine we have a process that maps 4 pages of shared anon memory at VA=64K:
>
> mmap(64K, 16K, PROT_X, MAP_SHARED | MAP_ANONYMOUS | MAP_FIXED, ...)
>
> Then it forks a child, and the child moves the mapping to VA=68K:
>
> mremap(64K, 16K, 16K, MREMAP_FIXED | MREMAP_MAYMOVE, 68K)
>
> Then the parent writes to address 64K (offset 0 in the shared region); this will
> fault and cause a 16K mTHP to be allocated and mapped, covering the whole region
> at 64K-80K in the parent.
>
> Then the child reads address 68K (offset 0 in the shared region); this will
> fault and cause the previously allocated 16K folio to be looked up and it must
> be mapped in the child between 68K-84K. This is not naturally aligned in the child.
>
> For the child, your code will incorrectly calculate start/end as 64K-80K.
OK, so you set MREMAP_FIXED flag, just as David pointed out. Yes, it
will not aligned in the child for this case. Thanks for the explanation.
On 2024/5/8 22:28, Daniel Gomez wrote:
> On Wed, May 08, 2024 at 01:58:19PM +0200, David Hildenbrand wrote:
>> On 08.05.24 13:39, Daniel Gomez wrote:
>>> On Mon, May 06, 2024 at 04:46:24PM +0800, Baolin Wang wrote:
>>>> Anonymous pages have already been supported for multi-size (mTHP) allocation
>>>> through commit 19eaf44954df, that can allow THP to be configured through the
>>>> sysfs interface located at '/sys/kernel/mm/transparent_hugepage/hugepage-XXkb/enabled'.
>>>>
>>>> However, the anonymous shared pages will ignore the anonymous mTHP rule
>>>> configured through the sysfs interface, and can only use the PMD-mapped
>>>> THP, that is not reasonable. Many implement anonymous page sharing through
>>>> mmap(MAP_SHARED | MAP_ANONYMOUS), especially in database usage scenarios,
>>>> therefore, users expect to apply an unified mTHP strategy for anonymous pages,
>>>> also including the anonymous shared pages, in order to enjoy the benefits of
>>>> mTHP. For example, lower latency than PMD-mapped THP, smaller memory bloat
>>>> than PMD-mapped THP, contiguous PTEs on ARM architecture to reduce TLB miss etc.
>>>>
>>>> The primary strategy is similar to supporting anonymous mTHP. Introduce
>>>> a new interface '/mm/transparent_hugepage/hugepage-XXkb/shmem_enabled',
>>>> which can have all the same values as the top-level
>>>> '/sys/kernel/mm/transparent_hugepage/shmem_enabled', with adding a new
>>>> additional "inherit" option. By default all sizes will be set to "never"
>>>> except PMD size, which is set to "inherit". This ensures backward compatibility
>>>> with the shmem enabled of the top level, meanwhile also allows independent
>>>> control of shmem enabled for each mTHP.
>>>
>>> I'm trying to understand the adoption of mTHP and how it fits into the adoption
>>> of (large) folios that the kernel is moving towards. Can you, or anyone involved
>>> here, explain this? How much do they overlap, and can we benefit from having
>>> both? Is there any argument against the adoption of large folios here that I
>>> might have missed?
>>
>> mTHP are implemented using large folios, just like traditional PMD-sized THP
>> are. (you really should explore the history of mTHP and how it all works
>> internally)
>
> I'll check more in deep the code. By any chance are any of you going to be at
> LSFMM this year? I have this session [1] scheduled for Wednesday and it would
> be nice to get your feedback on it and if you see this working together with
> mTHP/THP.
>
> [1] https://lore.kernel.org/all/4ktpayu66noklllpdpspa3vm5gbmb5boxskcj2q6qn7md3pwwt@kvlu64pqwjzl/
Great. I'm also interested in tmpfs support for large folios (or mTHP),
so please CC me if you plan to send a new version.
As David mentioned, this patchset is mainly about adding mTHP support
for anonymous shmem, and I think that some of the swap support for large
folios could work together.
On 08.05.24 21:23, Luis Chamberlain wrote:
> On Wed, May 08, 2024 at 01:58:19PM +0200, David Hildenbrand wrote:
>> On 08.05.24 13:39, Daniel Gomez wrote:
>>> On Mon, May 06, 2024 at 04:46:24PM +0800, Baolin Wang wrote:
>>>> The primary strategy is similar to supporting anonymous mTHP. Introduce
>>>> a new interface '/mm/transparent_hugepage/hugepage-XXkb/shmem_enabled',
>>>> which can have all the same values as the top-level
>>>> '/sys/kernel/mm/transparent_hugepage/shmem_enabled', with adding a new
>>>> additional "inherit" option. By default all sizes will be set to "never"
>>>> except PMD size, which is set to "inherit". This ensures backward compatibility
>>>> with the shmem enabled of the top level, meanwhile also allows independent
>>>> control of shmem enabled for each mTHP.
>>>
>>> I'm trying to understand the adoption of mTHP and how it fits into the adoption
>>> of (large) folios that the kernel is moving towards. Can you, or anyone involved
>>> here, explain this? How much do they overlap, and can we benefit from having
>>> both? Is there any argument against the adoption of large folios here that I
>>> might have missed?
>>
>> mTHP are implemented using large folios, just like traditional PMD-sized THP
>> are.
>>
>> The biggest challenge with memory that cannot be evicted on memory pressure
>> to be reclaimed (in contrast to your ordinary files in the pagecache) is
>> memory waste, well, and placement of large chunks of memory in general,
>> during page faults.
>>
>> In the worst case (no swap), you allocate a large chunk of memory once and
>> it will stick around until freed: no reclaim of that memory.
>>
>> That's the reason why THP for anonymous memory and SHMEM have toggles to
>> manually enable and configure them, in contrast to the pagecache. The same
>> was done for mTHP for anonymous memory, and now (anon) shmem follows.
>>
>> There are plans to have, at some point, have it all working automatically,
>> but a lot for that for anonymous memory (and shmem similarly) is still
>> missing and unclear.
>
> Whereas the use for large folios for filesystems is already automatic,
> so long as the filesystem supports it. We do this in readahead and write
> path already for iomap, we opportunistically use large folios if we can,
> otherwise we use smaller folios.
>
> So a recommended approach by Matthew was to use the readahead and write
> path, just as in iomap to determine the size of the folio to use [0].
> The use of large folios would also be automatic and not require any
> knobs at all.
Yes, I remember discussing that with Willy at some point, including why
shmem is unfortunately a bit more "special", because you might not even
have a disk backend ("swap") at all where you could easily reclaim memory.
In the extreme form, you can consider SHMEM as memory that might be
always mlocked, even without the user requiring special mlock limits ...
>
> The mTHP approach would be growing the "THP" use in filesystems by the
> only single filesystem to use THP. Meanwhile use of large folios is already
> automatic with the approach taken by iomap.
Yes, it's the extension of existing shmem_enabled (that -- I'm afraid --
was added for good reasons).
>
> We're at a crux where it does beg the question if we should continue to
> chug on with tmpfs being special and doing things differently extending
> the old THP interface with mTHP, or if it should just use large folios
> using the same approach as iomap did.
I'm afraid shmem will remain to some degree special. Fortunately it's
not alone, hugetlbfs is even more special ;)
>
> From my perspective the more shared code the better, and the more shared
> paths the better. There is a chance to help test swap with large folios
> instead of splitting the folios for swap, and that would could be done
> first with tmpfs. I have not evaluated the difference in testing or how
> we could get the most of shared code if we take a mTHP approach or the
> iomap approach for tmpfs, that should be considered.
I don't have a clear picture yet of what might be best for ordinary
shmem (IOW, not MAP_SHARED|MAP_PRIVATE), and I'm afraid there is no easy
answer.
As long as we don't end up wasting memory, it's not obviously bad. But
some things might be tricky (see my example about large folios stranding
in shmem and never being able to be really reclaimed+reused for better
purposes)
I'll note that mTHP really is just (supposed to be) a user interface to
enable the various folio sizes (well, and to expose better per-size
stats), not more.
From that point of view, it's just a filter. Enable all, and you get
the same behavior as you likely would in the pagecache mode.
From a shared-code and testing point of view, there really wouldn't be
a lot of differences. Again, essentially just a filter.
>
> Are there other things to consider? Does this require some dialog at
> LSFMM?
As raised in my reply to Daniel, I'll be at LSF/MM and happy to discuss.
I'm also not a SHMEM expert, so I'm hoping at some point we'd get
feedback from Hugh.
--
Cheers,
David / dhildenb
On Wed, May 08, 2024 at 07:03:57PM +0200, David Hildenbrand wrote:
> On 08.05.24 16:28, Daniel Gomez wrote:
> > On Wed, May 08, 2024 at 01:58:19PM +0200, David Hildenbrand wrote:
> > > On 08.05.24 13:39, Daniel Gomez wrote:
> > > > On Mon, May 06, 2024 at 04:46:24PM +0800, Baolin Wang wrote:
> > > > > Anonymous pages have already been supported for multi-size (mTHP) allocation
> > > > > through commit 19eaf44954df, that can allow THP to be configured through the
> > > > > sysfs interface located at '/sys/kernel/mm/transparent_hugepage/hugepage-XXkb/enabled'.
> > > > >
> > > > > However, the anonymous shared pages will ignore the anonymous mTHP rule
> > > > > configured through the sysfs interface, and can only use the PMD-mapped
> > > > > THP, that is not reasonable. Many implement anonymous page sharing through
> > > > > mmap(MAP_SHARED | MAP_ANONYMOUS), especially in database usage scenarios,
> > > > > therefore, users expect to apply an unified mTHP strategy for anonymous pages,
> > > > > also including the anonymous shared pages, in order to enjoy the benefits of
> > > > > mTHP. For example, lower latency than PMD-mapped THP, smaller memory bloat
> > > > > than PMD-mapped THP, contiguous PTEs on ARM architecture to reduce TLB miss etc.
> > > > >
> > > > > The primary strategy is similar to supporting anonymous mTHP. Introduce
> > > > > a new interface '/mm/transparent_hugepage/hugepage-XXkb/shmem_enabled',
> > > > > which can have all the same values as the top-level
> > > > > '/sys/kernel/mm/transparent_hugepage/shmem_enabled', with adding a new
> > > > > additional "inherit" option. By default all sizes will be set to "never"
> > > > > except PMD size, which is set to "inherit". This ensures backward compatibility
> > > > > with the shmem enabled of the top level, meanwhile also allows independent
> > > > > control of shmem enabled for each mTHP.
> > > >
> > > > I'm trying to understand the adoption of mTHP and how it fits into the adoption
> > > > of (large) folios that the kernel is moving towards. Can you, or anyone involved
> > > > here, explain this? How much do they overlap, and can we benefit from having
> > > > both? Is there any argument against the adoption of large folios here that I
> > > > might have missed?
> > >
> > > mTHP are implemented using large folios, just like traditional PMD-sized THP
> > > are. (you really should explore the history of mTHP and how it all works
> > > internally)
> >
> > I'll check more in deep the code. By any chance are any of you going to be at
> > LSFMM this year? I have this session [1] scheduled for Wednesday and it would
> > be nice to get your feedback on it and if you see this working together with
> > mTHP/THP.
> >
>
> I'll be around and will attend that session! But note that I am still
> scratching my head what to do with "ordinary" shmem, especially because of
> the weird way shmem behaves in contrast to real files (below). Some input
> from Hugh might be very helpful.
I'm looking forward to meet you there and have your feedback!
>
> Example: you write() to a shmem file and populate a 2M THP. Then, nobody
> touches that file for a long time. There are certainly other mmap() users
> that could better benefit from that THP ... and without swap that THP will
> be trapped there possibly a long time (unless I am missing an important
> piece of shmem THP design :) )? Sure, if we only have THP's it's nice,
> that's just not the reality unfortunately. IIRC, that's one of the reasons
> why THP for shmem can be enabled/disabled. But again, still scratching my
> head ...
>
>
> Note that this patch set only tackles anonymous shmem (MAP_SHARED|MAP_ANON),
> which is in 99.999% of all cases only accessed via page tables (memory
> allocated during page faults). I think there are ways to grab the fd
> (/proc/self/fd), but IIRC only corner cases read/write that.
>
> So in that sense, anonymous shmem (this patch set) behaves mostly like
> ordinary anonymous memory, and likely there is not much overlap with other
> "allocate large folios during read/write/fallocate" as in [1]. swap might
> have an overlap.
>
>
> The real confusion begins when we have ordinary shmem: some users never mmap
> it and only read/write, some users never read/write it and only mmap it and
> some (less common?) users do both.
>
> And shmem really is special: it looks like "just another file", but
> memory-consumption and reclaim wise it behaves just like anonymous memory.
> It might be swappable ("usually very limited backing disk space available")
> or it might not.
>
> In a subthread here we are discussing what to do with that special
> "shmem_enabled = force" mode ... and it's all complicated I think.
>
> > [1] https://lore.kernel.org/all/4ktpayu66noklllpdpspa3vm5gbmb5boxskcj2q6qn7md3pwwt@kvlu64pqwjzl/
> >
> > >
> > > The biggest challenge with memory that cannot be evicted on memory pressure
> > > to be reclaimed (in contrast to your ordinary files in the pagecache) is
> > > memory waste, well, and placement of large chunks of memory in general,
> > > during page faults.
> > >
> > > In the worst case (no swap), you allocate a large chunk of memory once and
> > > it will stick around until freed: no reclaim of that memory.
> >
> > I can see that path being triggered by some fstests but only for THP (where we
> > can actually reclaim memory).
>
> Is that when we punch-hole a partial THP and split it? I'd be interested in
> what that test does.
The reclaim path I'm referring to is triggered when we reach max capacity
(-ENOSPC) in shmem_alloc_and_add_folio(). We reclaim space by splitting large
folios (regardless of their dirty or uptodate condition).
One of the tests that hits this path is generic/100 (with huge option enabled).
- First, it creates a directory structure in $TEMP_DIR (/tmp). Dir size is
around 26M.
- Then, it tars it up into $TEMP_DIR/temp.tar.
- Finally, untars the compressed file into $TEST_DIR (/media/test, which is the
huge tmpfs mountdir). What happens in generic/100 under the huge=always case
is that you fill up the dedicated space very quickly (this is 1G in xfstests
for tmpfs) and then you start reclaiming.
>
>
>
> --
> Cheers,
>
> David / dhildenb
>
On Thu, May 09, 2024 at 07:48:46PM +0200, David Hildenbrand wrote:
> On 08.05.24 21:23, Luis Chamberlain wrote:
> > From my perspective the more shared code the better, and the more shared
> > paths the better. There is a chance to help test swap with large folios
> > instead of splitting the folios for swap, and that would could be done
> > first with tmpfs. I have not evaluated the difference in testing or how
> > we could get the most of shared code if we take a mTHP approach or the
> > iomap approach for tmpfs, that should be considered.
>
> I don't have a clear picture yet of what might be best for ordinary shmem
> (IOW, not MAP_SHARED|MAP_PRIVATE), and I'm afraid there is no easy answer.
OK so it sounds like the different options needs to be thought out and
reviewed.
> As long as we don't end up wasting memory, it's not obviously bad.
Sure.
> But some
> things might be tricky (see my example about large folios stranding in shmem
> and never being able to be really reclaimed+reused for better purposes)
Where is that stated BTW? Could that be resolved?
> I'll note that mTHP really is just (supposed to be) a user interface to
> enable the various folio sizes (well, and to expose better per-size stats),
> not more.
Sure but given filesystems using large folios don't have silly APIs for
using which large folios to enable, it just seems odd for tmpfs to take
a different approach.
> From that point of view, it's just a filter. Enable all, and you get the
> same behavior as you likely would in the pagecache mode.
Which begs the quesiton, *why* have an API to just constrain to certain
large folios, which diverges from what filesystems are doing with large
folios?
> > Are there other things to consider? Does this require some dialog at
> > LSFMM?
>
> As raised in my reply to Daniel, I'll be at LSF/MM and happy to discuss. I'm
> also not a SHMEM expert, so I'm hoping at some point we'd get feedback from
> Hugh.
Hugh, will you be at LSFMM?
Luis