2024-05-30 02:04:48

by Baolin Wang

[permalink] [raw]
Subject: [PATCH v3 0/6] add mTHP support for anonymous shmem

Anonymous pages have already been supported for multi-size (mTHP) allocation
through commit 19eaf44954df, that can allow THP to be configured through the
sysfs interface located at '/sys/kernel/mm/transparent_hugepage/hugepage-XXkb/enabled'.

However, the anonymous shmem will ignore the anonymous mTHP rule configured
through the sysfs interface, and can only use the PMD-mapped THP, that is not
reasonable. Many implement anonymous page sharing through mmap(MAP_SHARED |
MAP_ANONYMOUS), especially in database usage scenarios, therefore, users expect
to apply an unified mTHP strategy for anonymous pages, also including the
anonymous shared pages, in order to enjoy the benefits of mTHP. For example,
lower latency than PMD-mapped THP, smaller memory bloat than PMD-mapped THP,
contiguous PTEs on ARM architecture to reduce TLB miss etc.

The primary strategy is similar to supporting anonymous mTHP. Introduce
a new interface '/mm/transparent_hugepage/hugepage-XXkb/shmem_enabled',
which can have all the same values as the top-level
'/sys/kernel/mm/transparent_hugepage/shmem_enabled', with adding a new
additional "inherit" option. By default all sizes will be set to "never"
except PMD size, which is set to "inherit". This ensures backward compatibility
with the anonymous shmem enabled of the top level, meanwhile also allows
independent control of anonymous shmem enabled for each mTHP.

Use the page fault latency tool to measure the performance of 1G anonymous shmem
with 32 threads on my machine environment with: ARM64 Architecture, 32 cores,
125G memory:
base: mm-unstable
user-time sys_time faults_per_sec_per_cpu faults_per_sec
0.04s 3.10s 83516.416 2669684.890

mm-unstable + patchset, anon shmem mTHP disabled
user-time sys_time faults_per_sec_per_cpu faults_per_sec
0.02s 3.14s 82936.359 2630746.027

mm-unstable + patchset, anon shmem 64K mTHP enabled
user-time sys_time faults_per_sec_per_cpu faults_per_sec
0.08s 0.31s 678630.231 17082522.495

From the data above, it is observed that the patchset has a minimal impact when
mTHP is not enabled (some fluctuations observed during testing). When enabling 64K
mTHP, there is a significant improvement of the page fault latency.

Changes from v2:
- Rebased to mm/mm-unstable.
- Remove 'huge' parameter for shmem_alloc_and_add_folio(), per Lance.

Changes from v1:
- Drop the patch that re-arranges the position of highest_order() and
next_order(), per Ryan.
- Modify the finish_fault() to fix VA alignment issue, per Ryan and
David.
- Fix some building issues, reported by Lance and kernel test robot.
- Update some commit message.

Changes from RFC:
- Rebase the patch set against the new mm-unstable branch, per Lance.
- Add a new patch to export highest_order() and next_order().
- Add a new patch to align mTHP size in shmem_get_unmapped_area().
- Handle the uffd case and the VMA limits case when building mapping for
large folio in the finish_fault() function, per Ryan.
- Remove unnecessary 'order' variable in patch 3, per Kefeng.
- Keep the anon shmem counters' name consistency.
- Modify the strategy to support mTHP for anonymous shmem, discussed with
Ryan and David.
- Add reviewed tag from Barry.
- Update the commit message.

Baolin Wang (6):
mm: memory: extend finish_fault() to support large folio
mm: shmem: add THP validation for PMD-mapped THP related statistics
mm: shmem: add multi-size THP sysfs interface for anonymous shmem
mm: shmem: add mTHP support for anonymous shmem
mm: shmem: add mTHP size alignment in shmem_get_unmapped_area
mm: shmem: add mTHP counters for anonymous shmem

Documentation/admin-guide/mm/transhuge.rst | 29 ++
include/linux/huge_mm.h | 23 ++
mm/huge_memory.c | 17 +-
mm/memory.c | 58 +++-
mm/shmem.c | 340 ++++++++++++++++++---
5 files changed, 406 insertions(+), 61 deletions(-)

--
2.39.3



2024-05-30 02:05:06

by Baolin Wang

[permalink] [raw]
Subject: [PATCH v3 1/6] mm: memory: extend finish_fault() to support large folio

Add large folio mapping establishment support for finish_fault() as a preparation,
to support multi-size THP allocation of anonymous shmem pages in the following
patches.

Signed-off-by: Baolin Wang <[email protected]>
---
mm/memory.c | 58 ++++++++++++++++++++++++++++++++++++++++++++---------
1 file changed, 48 insertions(+), 10 deletions(-)

diff --git a/mm/memory.c b/mm/memory.c
index eef4e482c0c2..435187ff7ea4 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -4831,9 +4831,12 @@ vm_fault_t finish_fault(struct vm_fault *vmf)
{
struct vm_area_struct *vma = vmf->vma;
struct page *page;
+ struct folio *folio;
vm_fault_t ret;
bool is_cow = (vmf->flags & FAULT_FLAG_WRITE) &&
!(vma->vm_flags & VM_SHARED);
+ int type, nr_pages, i;
+ unsigned long addr = vmf->address;

/* Did we COW the page? */
if (is_cow)
@@ -4864,24 +4867,59 @@ vm_fault_t finish_fault(struct vm_fault *vmf)
return VM_FAULT_OOM;
}

+ folio = page_folio(page);
+ nr_pages = folio_nr_pages(folio);
+
+ /*
+ * Using per-page fault to maintain the uffd semantics, and same
+ * approach also applies to non-anonymous-shmem faults to avoid
+ * inflating the RSS of the process.
+ */
+ if (!vma_is_anon_shmem(vma) || unlikely(userfaultfd_armed(vma))) {
+ nr_pages = 1;
+ } else if (nr_pages > 1) {
+ pgoff_t idx = folio_page_idx(folio, page);
+ /* The page offset of vmf->address within the VMA. */
+ pgoff_t vma_off = vmf->pgoff - vmf->vma->vm_pgoff;
+
+ /*
+ * Fallback to per-page fault in case the folio size in page
+ * cache beyond the VMA limits.
+ */
+ if (unlikely(vma_off < idx ||
+ vma_off + (nr_pages - idx) > vma_pages(vma))) {
+ nr_pages = 1;
+ } else {
+ /* Now we can set mappings for the whole large folio. */
+ addr = vmf->address - idx * PAGE_SIZE;
+ page = &folio->page;
+ }
+ }
+
vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd,
- vmf->address, &vmf->ptl);
+ addr, &vmf->ptl);
if (!vmf->pte)
return VM_FAULT_NOPAGE;

/* Re-check under ptl */
- if (likely(!vmf_pte_changed(vmf))) {
- struct folio *folio = page_folio(page);
- int type = is_cow ? MM_ANONPAGES : mm_counter_file(folio);
-
- set_pte_range(vmf, folio, page, 1, vmf->address);
- add_mm_counter(vma->vm_mm, type, 1);
- ret = 0;
- } else {
- update_mmu_tlb(vma, vmf->address, vmf->pte);
+ if (nr_pages == 1 && unlikely(vmf_pte_changed(vmf))) {
+ update_mmu_tlb(vma, addr, vmf->pte);
ret = VM_FAULT_NOPAGE;
+ goto unlock;
+ } else if (nr_pages > 1 && !pte_range_none(vmf->pte, nr_pages)) {
+ for (i = 0; i < nr_pages; i++)
+ update_mmu_tlb(vma, addr + PAGE_SIZE * i, vmf->pte + i);
+ ret = VM_FAULT_NOPAGE;
+ goto unlock;
}

+ folio_ref_add(folio, nr_pages - 1);
+ set_pte_range(vmf, folio, page, nr_pages, addr);
+ type = is_cow ? MM_ANONPAGES : mm_counter_file(folio);
+ add_mm_counter(vma->vm_mm, type, nr_pages);
+ ret = 0;
+
+unlock:
pte_unmap_unlock(vmf->pte, vmf->ptl);
return ret;
}
--
2.39.3


2024-05-30 02:05:10

by Baolin Wang

[permalink] [raw]
Subject: [PATCH v3 2/6] mm: shmem: add THP validation for PMD-mapped THP related statistics

In order to extend support for mTHP, add THP validation for PMD-mapped THP
related statistics to avoid statistical confusion.

Signed-off-by: Baolin Wang <[email protected]>
Reviewed-by: Barry Song <[email protected]>
---
mm/shmem.c | 7 ++++---
1 file changed, 4 insertions(+), 3 deletions(-)

diff --git a/mm/shmem.c b/mm/shmem.c
index 6868c0af3a69..ae358efc397a 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -1647,7 +1647,7 @@ static struct folio *shmem_alloc_and_add_folio(gfp_t gfp,
return ERR_PTR(-E2BIG);

folio = shmem_alloc_folio(gfp, HPAGE_PMD_ORDER, info, index);
- if (!folio)
+ if (!folio && pages == HPAGE_PMD_NR)
count_vm_event(THP_FILE_FALLBACK);
} else {
pages = 1;
@@ -1665,7 +1665,7 @@ static struct folio *shmem_alloc_and_add_folio(gfp_t gfp,
if (xa_find(&mapping->i_pages, &index,
index + pages - 1, XA_PRESENT)) {
error = -EEXIST;
- } else if (huge) {
+ } else if (pages == HPAGE_PMD_NR) {
count_vm_event(THP_FILE_FALLBACK);
count_vm_event(THP_FILE_FALLBACK_CHARGE);
}
@@ -2031,7 +2031,8 @@ static int shmem_get_folio_gfp(struct inode *inode, pgoff_t index,
folio = shmem_alloc_and_add_folio(huge_gfp,
inode, index, fault_mm, true);
if (!IS_ERR(folio)) {
- count_vm_event(THP_FILE_ALLOC);
+ if (folio_test_pmd_mappable(folio))
+ count_vm_event(THP_FILE_ALLOC);
goto alloced;
}
if (PTR_ERR(folio) == -EEXIST)
--
2.39.3


2024-05-30 02:05:20

by Baolin Wang

[permalink] [raw]
Subject: [PATCH v3 4/6] mm: shmem: add mTHP support for anonymous shmem

Commit 19eaf44954df adds multi-size THP (mTHP) for anonymous pages, that
can allow THP to be configured through the sysfs interface located at
'/sys/kernel/mm/transparent_hugepage/hugepage-XXkb/enabled'.

However, the anonymous share pages will ignore the anonymous mTHP rule
configured through the sysfs interface, and can only use the PMD-mapped
THP, that is not reasonable. Users expect to apply the mTHP rule for
all anonymous pages, including the anonymous share pages, in order to
enjoy the benefits of mTHP. For example, lower latency than PMD-mapped THP,
smaller memory bloat than PMD-mapped THP, contiguous PTEs on ARM architecture
to reduce TLB miss etc.

The primary strategy is similar to supporting anonymous mTHP. Introduce
a new interface '/mm/transparent_hugepage/hugepage-XXkb/shmem_enabled',
which can have all the same values as the top-level
'/sys/kernel/mm/transparent_hugepage/shmem_enabled', with adding a new
additional "inherit" option. By default all sizes will be set to "never"
except PMD size, which is set to "inherit". This ensures backward compatibility
with the anonymous shmem enabled of the top level, meanwhile also allows
independent control of anonymous shmem enabled for each mTHP.

Signed-off-by: Baolin Wang <[email protected]>
---
include/linux/huge_mm.h | 10 +++
mm/shmem.c | 187 +++++++++++++++++++++++++++++++++-------
2 files changed, 167 insertions(+), 30 deletions(-)

diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index fac21548c5de..909cfc67521d 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -575,6 +575,16 @@ static inline bool thp_migration_supported(void)
{
return false;
}
+
+static inline int highest_order(unsigned long orders)
+{
+ return 0;
+}
+
+static inline int next_order(unsigned long *orders, int prev)
+{
+ return 0;
+}
#endif /* CONFIG_TRANSPARENT_HUGEPAGE */

static inline int split_folio_to_list_to_order(struct folio *folio,
diff --git a/mm/shmem.c b/mm/shmem.c
index d5ab5e211100..493873d7246c 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -1611,6 +1611,107 @@ static gfp_t limit_gfp_mask(gfp_t huge_gfp, gfp_t limit_gfp)
return result;
}

+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+static unsigned long anon_shmem_allowable_huge_orders(struct inode *inode,
+ struct vm_area_struct *vma, pgoff_t index,
+ bool global_huge)
+{
+ unsigned long mask = READ_ONCE(huge_anon_shmem_orders_always);
+ unsigned long within_size_orders = READ_ONCE(huge_anon_shmem_orders_within_size);
+ unsigned long vm_flags = vma->vm_flags;
+ /*
+ * Check all the (large) orders below HPAGE_PMD_ORDER + 1 that
+ * are enabled for this vma.
+ */
+ unsigned long orders = BIT(PMD_ORDER + 1) - 1;
+ loff_t i_size;
+ int order;
+
+ if ((vm_flags & VM_NOHUGEPAGE) ||
+ test_bit(MMF_DISABLE_THP, &vma->vm_mm->flags))
+ return 0;
+
+ /* If the hardware/firmware marked hugepage support disabled. */
+ if (transparent_hugepage_flags & (1 << TRANSPARENT_HUGEPAGE_UNSUPPORTED))
+ return 0;
+
+ /*
+ * Following the 'deny' semantics of the top level, force the huge
+ * option off from all mounts.
+ */
+ if (shmem_huge == SHMEM_HUGE_DENY)
+ return 0;
+
+ /*
+ * Only allow inherit orders if the top-level value is 'force', which
+ * means non-PMD sized THP can not override 'huge' mount option now.
+ */
+ if (shmem_huge == SHMEM_HUGE_FORCE)
+ return READ_ONCE(huge_anon_shmem_orders_inherit);
+
+ /* Allow mTHP that will be fully within i_size. */
+ order = highest_order(within_size_orders);
+ while (within_size_orders) {
+ index = round_up(index + 1, order);
+ i_size = round_up(i_size_read(inode), PAGE_SIZE);
+ if (i_size >> PAGE_SHIFT >= index) {
+ mask |= within_size_orders;
+ break;
+ }
+
+ order = next_order(&within_size_orders, order);
+ }
+
+ if (vm_flags & VM_HUGEPAGE)
+ mask |= READ_ONCE(huge_anon_shmem_orders_madvise);
+
+ if (global_huge)
+ mask |= READ_ONCE(huge_anon_shmem_orders_inherit);
+
+ return orders & mask;
+}
+
+static unsigned long anon_shmem_suitable_orders(struct inode *inode, struct vm_fault *vmf,
+ struct address_space *mapping, pgoff_t index,
+ unsigned long orders)
+{
+ struct vm_area_struct *vma = vmf->vma;
+ unsigned long pages;
+ int order;
+
+ orders = thp_vma_suitable_orders(vma, vmf->address, orders);
+ if (!orders)
+ return 0;
+
+ /* Find the highest order that can add into the page cache */
+ order = highest_order(orders);
+ while (orders) {
+ pages = 1UL << order;
+ index = round_down(index, pages);
+ if (!xa_find(&mapping->i_pages, &index,
+ index + pages - 1, XA_PRESENT))
+ break;
+ order = next_order(&orders, order);
+ }
+
+ return orders;
+}
+#else
+static unsigned long anon_shmem_allowable_huge_orders(struct inode *inode,
+ struct vm_area_struct *vma, pgoff_t index,
+ bool global_huge)
+{
+ return 0;
+}
+
+static unsigned long anon_shmem_suitable_orders(struct inode *inode, struct vm_fault *vmf,
+ struct address_space *mapping, pgoff_t index,
+ unsigned long orders)
+{
+ return 0;
+}
+#endif /* CONFIG_TRANSPARENT_HUGEPAGE */
+
static struct folio *shmem_alloc_folio(gfp_t gfp, int order,
struct shmem_inode_info *info, pgoff_t index)
{
@@ -1625,38 +1726,55 @@ static struct folio *shmem_alloc_folio(gfp_t gfp, int order,
return folio;
}

-static struct folio *shmem_alloc_and_add_folio(gfp_t gfp,
- struct inode *inode, pgoff_t index,
- struct mm_struct *fault_mm, bool huge)
+static struct folio *shmem_alloc_and_add_folio(struct vm_fault *vmf,
+ gfp_t gfp, struct inode *inode, pgoff_t index,
+ struct mm_struct *fault_mm, unsigned long orders)
{
struct address_space *mapping = inode->i_mapping;
struct shmem_inode_info *info = SHMEM_I(inode);
- struct folio *folio;
+ struct vm_area_struct *vma = vmf ? vmf->vma : NULL;
+ unsigned long suitable_orders;
+ struct folio *folio = NULL;
long pages;
- int error;
+ int error, order;

if (!IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE))
- huge = false;
+ orders = 0;

- if (huge) {
- pages = HPAGE_PMD_NR;
- index = round_down(index, HPAGE_PMD_NR);
+ if (orders > 0) {
+ if (vma && vma_is_anon_shmem(vma)) {
+ suitable_orders = anon_shmem_suitable_orders(inode, vmf,
+ mapping, index, orders);
+ } else if (orders & BIT(HPAGE_PMD_ORDER)) {
+ pages = HPAGE_PMD_NR;
+ suitable_orders = BIT(HPAGE_PMD_ORDER);
+ index = round_down(index, HPAGE_PMD_NR);

- /*
- * Check for conflict before waiting on a huge allocation.
- * Conflict might be that a huge page has just been allocated
- * and added to page cache by a racing thread, or that there
- * is already at least one small page in the huge extent.
- * Be careful to retry when appropriate, but not forever!
- * Elsewhere -EEXIST would be the right code, but not here.
- */
- if (xa_find(&mapping->i_pages, &index,
- index + HPAGE_PMD_NR - 1, XA_PRESENT))
- return ERR_PTR(-E2BIG);
+ /*
+ * Check for conflict before waiting on a huge allocation.
+ * Conflict might be that a huge page has just been allocated
+ * and added to page cache by a racing thread, or that there
+ * is already at least one small page in the huge extent.
+ * Be careful to retry when appropriate, but not forever!
+ * Elsewhere -EEXIST would be the right code, but not here.
+ */
+ if (xa_find(&mapping->i_pages, &index,
+ index + HPAGE_PMD_NR - 1, XA_PRESENT))
+ return ERR_PTR(-E2BIG);
+ }

- folio = shmem_alloc_folio(gfp, HPAGE_PMD_ORDER, info, index);
- if (!folio && pages == HPAGE_PMD_NR)
- count_vm_event(THP_FILE_FALLBACK);
+ order = highest_order(suitable_orders);
+ while (suitable_orders) {
+ pages = 1UL << order;
+ index = round_down(index, pages);
+ folio = shmem_alloc_folio(gfp, order, info, index);
+ if (folio)
+ goto allocated;
+
+ if (pages == HPAGE_PMD_NR)
+ count_vm_event(THP_FILE_FALLBACK);
+ order = next_order(&suitable_orders, order);
+ }
} else {
pages = 1;
folio = shmem_alloc_folio(gfp, 0, info, index);
@@ -1664,6 +1782,7 @@ static struct folio *shmem_alloc_and_add_folio(gfp_t gfp,
if (!folio)
return ERR_PTR(-ENOMEM);

+allocated:
__folio_set_locked(folio);
__folio_set_swapbacked(folio);

@@ -1958,7 +2077,8 @@ static int shmem_get_folio_gfp(struct inode *inode, pgoff_t index,
struct mm_struct *fault_mm;
struct folio *folio;
int error;
- bool alloced;
+ bool alloced, huge;
+ unsigned long orders = 0;

if (WARN_ON_ONCE(!shmem_mapping(inode->i_mapping)))
return -EINVAL;
@@ -2030,14 +2150,21 @@ static int shmem_get_folio_gfp(struct inode *inode, pgoff_t index,
return 0;
}

- if (shmem_is_huge(inode, index, false, fault_mm,
- vma ? vma->vm_flags : 0)) {
+ huge = shmem_is_huge(inode, index, false, fault_mm,
+ vma ? vma->vm_flags : 0);
+ /* Find hugepage orders that are allowed for anonymous shmem. */
+ if (vma && vma_is_anon_shmem(vma))
+ orders = anon_shmem_allowable_huge_orders(inode, vma, index, huge);
+ else if (huge)
+ orders = BIT(HPAGE_PMD_ORDER);
+
+ if (orders > 0) {
gfp_t huge_gfp;

huge_gfp = vma_thp_gfp_mask(vma);
huge_gfp = limit_gfp_mask(huge_gfp, gfp);
- folio = shmem_alloc_and_add_folio(huge_gfp,
- inode, index, fault_mm, true);
+ folio = shmem_alloc_and_add_folio(vmf, huge_gfp,
+ inode, index, fault_mm, orders);
if (!IS_ERR(folio)) {
if (folio_test_pmd_mappable(folio))
count_vm_event(THP_FILE_ALLOC);
@@ -2047,7 +2174,7 @@ static int shmem_get_folio_gfp(struct inode *inode, pgoff_t index,
goto repeat;
}

- folio = shmem_alloc_and_add_folio(gfp, inode, index, fault_mm, false);
+ folio = shmem_alloc_and_add_folio(vmf, gfp, inode, index, fault_mm, 0);
if (IS_ERR(folio)) {
error = PTR_ERR(folio);
if (error == -EEXIST)
@@ -2058,7 +2185,7 @@ static int shmem_get_folio_gfp(struct inode *inode, pgoff_t index,

alloced:
alloced = true;
- if (folio_test_pmd_mappable(folio) &&
+ if (folio_test_large(folio) &&
DIV_ROUND_UP(i_size_read(inode), PAGE_SIZE) <
folio_next_index(folio) - 1) {
struct shmem_sb_info *sbinfo = SHMEM_SB(inode->i_sb);
--
2.39.3


2024-05-30 02:05:32

by Baolin Wang

[permalink] [raw]
Subject: [PATCH v3 3/6] mm: shmem: add multi-size THP sysfs interface for anonymous shmem

To support the use of mTHP with anonymous shmem, add a new sysfs interface
'shmem_enabled' in the '/sys/kernel/mm/transparent_hugepage/hugepages-kB/'
directory for each mTHP to control whether shmem is enabled for that mTHP,
with a value similar to the top level 'shmem_enabled', which can be set to:
"always", "inherit (to inherit the top level setting)", "within_size", "advise",
"never", "deny", "force". These values follow the same semantics as the top
level, except the 'deny' is equivalent to 'never', and 'force' is equivalent
to 'always' to keep compatibility.

By default, PMD-sized hugepages have enabled="inherit" and all other hugepage
sizes have enabled="never" for '/sys/kernel/mm/transparent_hugepage/hugepages-xxkB/shmem_enabled'.

In addition, if top level value is 'force', then only PMD-sized hugepages
have enabled="inherit", otherwise configuration will be failed and vice versa.
That means now we will avoid using non-PMD sized THP to override the global
huge allocation.

Signed-off-by: Baolin Wang <[email protected]>
---
Documentation/admin-guide/mm/transhuge.rst | 29 +++++++
include/linux/huge_mm.h | 10 +++
mm/huge_memory.c | 11 +--
mm/shmem.c | 96 ++++++++++++++++++++++
4 files changed, 138 insertions(+), 8 deletions(-)

diff --git a/Documentation/admin-guide/mm/transhuge.rst b/Documentation/admin-guide/mm/transhuge.rst
index d414d3f5592a..659459374032 100644
--- a/Documentation/admin-guide/mm/transhuge.rst
+++ b/Documentation/admin-guide/mm/transhuge.rst
@@ -332,6 +332,35 @@ deny
force
Force the huge option on for all - very useful for testing;

+Anonymous shmem can also use "multi-size THP" (mTHP) by adding a new sysfs knob
+to control mTHP allocation: /sys/kernel/mm/transparent_hugepage/hugepages-<size>kB/shmem_enabled.
+Its value for each mTHP is essentially consistent with the global setting, except
+for the addition of 'inherit' to ensure compatibility with the global settings.
+always
+ Attempt to allocate <size> huge pages every time we need a new page;
+
+inherit
+ Inherit the top-level "shmem_enabled" value. By default, PMD-sized hugepages
+ have enabled="inherit" and all other hugepage sizes have enabled="never";
+
+never
+ Do not allocate <size> huge pages;
+
+within_size
+ Only allocate <size> huge page if it will be fully within i_size.
+ Also respect fadvise()/madvise() hints;
+
+advise
+ Only allocate <size> huge pages if requested with fadvise()/madvise();
+
+deny
+ Has the same semantics as 'never', now mTHP allocation policy is only
+ used for anonymous shmem and no not override tmpfs.
+
+force
+ Has the same semantics as 'always', now mTHP allocation policy is only
+ used for anonymous shmem and no not override tmpfs.
+
Need of application restart
===========================

diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index 020e2344eb86..fac21548c5de 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -6,6 +6,7 @@
#include <linux/mm_types.h>

#include <linux/fs.h> /* only for vma_is_dax() */
+#include <linux/kobject.h>

vm_fault_t do_huge_pmd_anonymous_page(struct vm_fault *vmf);
int copy_huge_pmd(struct mm_struct *dst_mm, struct mm_struct *src_mm,
@@ -63,6 +64,7 @@ ssize_t single_hugepage_flag_show(struct kobject *kobj,
struct kobj_attribute *attr, char *buf,
enum transparent_hugepage_flag flag);
extern struct kobj_attribute shmem_enabled_attr;
+extern struct kobj_attribute thpsize_shmem_enabled_attr;

/*
* Mask of all large folio orders supported for anonymous THP; all orders up to
@@ -265,6 +267,14 @@ unsigned long thp_vma_allowable_orders(struct vm_area_struct *vma,
return __thp_vma_allowable_orders(vma, vm_flags, tva_flags, orders);
}

+struct thpsize {
+ struct kobject kobj;
+ struct list_head node;
+ int order;
+};
+
+#define to_thpsize(kobj) container_of(kobj, struct thpsize, kobj)
+
enum mthp_stat_item {
MTHP_STAT_ANON_FAULT_ALLOC,
MTHP_STAT_ANON_FAULT_FALLBACK,
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 8e49f402d7c7..1360a1903b66 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -449,14 +449,6 @@ static void thpsize_release(struct kobject *kobj);
static DEFINE_SPINLOCK(huge_anon_orders_lock);
static LIST_HEAD(thpsize_list);

-struct thpsize {
- struct kobject kobj;
- struct list_head node;
- int order;
-};
-
-#define to_thpsize(kobj) container_of(kobj, struct thpsize, kobj)
-
static ssize_t thpsize_enabled_show(struct kobject *kobj,
struct kobj_attribute *attr, char *buf)
{
@@ -517,6 +509,9 @@ static struct kobj_attribute thpsize_enabled_attr =

static struct attribute *thpsize_attrs[] = {
&thpsize_enabled_attr.attr,
+#ifdef CONFIG_SHMEM
+ &thpsize_shmem_enabled_attr.attr,
+#endif
NULL,
};

diff --git a/mm/shmem.c b/mm/shmem.c
index ae358efc397a..d5ab5e211100 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -131,6 +131,14 @@ struct shmem_options {
#define SHMEM_SEEN_QUOTA 32
};

+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+static unsigned long huge_anon_shmem_orders_always __read_mostly;
+static unsigned long huge_anon_shmem_orders_madvise __read_mostly;
+static unsigned long huge_anon_shmem_orders_inherit __read_mostly;
+static unsigned long huge_anon_shmem_orders_within_size __read_mostly;
+static DEFINE_SPINLOCK(huge_anon_shmem_orders_lock);
+#endif
+
#ifdef CONFIG_TMPFS
static unsigned long shmem_default_max_blocks(void)
{
@@ -4672,6 +4680,12 @@ void __init shmem_init(void)
SHMEM_SB(shm_mnt->mnt_sb)->huge = shmem_huge;
else
shmem_huge = SHMEM_HUGE_NEVER; /* just in case it was patched */
+
+ /*
+ * Default to setting PMD-sized THP to inherit the global setting and
+ * disable all other multi-size THPs, when anonymous shmem uses mTHP.
+ */
+ huge_anon_shmem_orders_inherit = BIT(HPAGE_PMD_ORDER);
#endif
return;

@@ -4731,6 +4745,11 @@ static ssize_t shmem_enabled_store(struct kobject *kobj,
huge != SHMEM_HUGE_NEVER && huge != SHMEM_HUGE_DENY)
return -EINVAL;

+ /* Do not override huge allocation policy with non-PMD sized mTHP */
+ if (huge == SHMEM_HUGE_FORCE &&
+ huge_anon_shmem_orders_inherit != BIT(HPAGE_PMD_ORDER))
+ return -EINVAL;
+
shmem_huge = huge;
if (shmem_huge > SHMEM_HUGE_DENY)
SHMEM_SB(shm_mnt->mnt_sb)->huge = shmem_huge;
@@ -4738,6 +4757,83 @@ static ssize_t shmem_enabled_store(struct kobject *kobj,
}

struct kobj_attribute shmem_enabled_attr = __ATTR_RW(shmem_enabled);
+
+static ssize_t thpsize_shmem_enabled_show(struct kobject *kobj,
+ struct kobj_attribute *attr, char *buf)
+{
+ int order = to_thpsize(kobj)->order;
+ const char *output;
+
+ if (test_bit(order, &huge_anon_shmem_orders_always))
+ output = "[always] inherit within_size advise never deny [force]";
+ else if (test_bit(order, &huge_anon_shmem_orders_inherit))
+ output = "always [inherit] within_size advise never deny force";
+ else if (test_bit(order, &huge_anon_shmem_orders_within_size))
+ output = "always inherit [within_size] advise never deny force";
+ else if (test_bit(order, &huge_anon_shmem_orders_madvise))
+ output = "always inherit within_size [advise] never deny force";
+ else
+ output = "always inherit within_size advise [never] [deny] force";
+
+ return sysfs_emit(buf, "%s\n", output);
+}
+
+static ssize_t thpsize_shmem_enabled_store(struct kobject *kobj,
+ struct kobj_attribute *attr,
+ const char *buf, size_t count)
+{
+ int order = to_thpsize(kobj)->order;
+ ssize_t ret = count;
+
+ if (sysfs_streq(buf, "always") || sysfs_streq(buf, "force")) {
+ spin_lock(&huge_anon_shmem_orders_lock);
+ clear_bit(order, &huge_anon_shmem_orders_inherit);
+ clear_bit(order, &huge_anon_shmem_orders_madvise);
+ clear_bit(order, &huge_anon_shmem_orders_within_size);
+ set_bit(order, &huge_anon_shmem_orders_always);
+ spin_unlock(&huge_anon_shmem_orders_lock);
+ } else if (sysfs_streq(buf, "inherit")) {
+ /* Do not override huge allocation policy with non-PMD sized mTHP */
+ if (shmem_huge == SHMEM_HUGE_FORCE &&
+ order != HPAGE_PMD_ORDER)
+ return -EINVAL;
+
+ spin_lock(&huge_anon_shmem_orders_lock);
+ clear_bit(order, &huge_anon_shmem_orders_always);
+ clear_bit(order, &huge_anon_shmem_orders_madvise);
+ clear_bit(order, &huge_anon_shmem_orders_within_size);
+ set_bit(order, &huge_anon_shmem_orders_inherit);
+ spin_unlock(&huge_anon_shmem_orders_lock);
+ } else if (sysfs_streq(buf, "within_size")) {
+ spin_lock(&huge_anon_shmem_orders_lock);
+ clear_bit(order, &huge_anon_shmem_orders_always);
+ clear_bit(order, &huge_anon_shmem_orders_inherit);
+ clear_bit(order, &huge_anon_shmem_orders_madvise);
+ set_bit(order, &huge_anon_shmem_orders_within_size);
+ spin_unlock(&huge_anon_shmem_orders_lock);
+ } else if (sysfs_streq(buf, "madvise")) {
+ spin_lock(&huge_anon_shmem_orders_lock);
+ clear_bit(order, &huge_anon_shmem_orders_always);
+ clear_bit(order, &huge_anon_shmem_orders_inherit);
+ clear_bit(order, &huge_anon_shmem_orders_within_size);
+ set_bit(order, &huge_anon_shmem_orders_madvise);
+ spin_unlock(&huge_anon_shmem_orders_lock);
+ } else if (sysfs_streq(buf, "never") || sysfs_streq(buf, "deny")) {
+ spin_lock(&huge_anon_shmem_orders_lock);
+ clear_bit(order, &huge_anon_shmem_orders_always);
+ clear_bit(order, &huge_anon_shmem_orders_inherit);
+ clear_bit(order, &huge_anon_shmem_orders_within_size);
+ clear_bit(order, &huge_anon_shmem_orders_madvise);
+ spin_unlock(&huge_anon_shmem_orders_lock);
+ } else {
+ ret = -EINVAL;
+ }
+
+ return ret;
+}
+
+struct kobj_attribute thpsize_shmem_enabled_attr =
+ __ATTR(shmem_enabled, 0644, thpsize_shmem_enabled_show, thpsize_shmem_enabled_store);
#endif /* CONFIG_TRANSPARENT_HUGEPAGE && CONFIG_SYSFS */

#else /* !CONFIG_SHMEM */
--
2.39.3


2024-05-30 02:05:38

by Baolin Wang

[permalink] [raw]
Subject: [PATCH v3 5/6] mm: shmem: add mTHP size alignment in shmem_get_unmapped_area

Although the top-level hugepage allocation can be turned off, anonymous shmem
can still use mTHP by configuring the sysfs interface located at
'/sys/kernel/mm/transparent_hugepage/hugepage-XXkb/shmem_enabled'. Therefore,
add alignment for mTHP size to provide a suitable alignment address in
shmem_get_unmapped_area().

Signed-off-by: Baolin Wang <[email protected]>
Tested-by: Lance Yang <[email protected]>
---
mm/shmem.c | 36 +++++++++++++++++++++++++++---------
1 file changed, 27 insertions(+), 9 deletions(-)

diff --git a/mm/shmem.c b/mm/shmem.c
index 493873d7246c..efc093ee7d4d 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -2394,6 +2394,7 @@ unsigned long shmem_get_unmapped_area(struct file *file,
unsigned long inflated_len;
unsigned long inflated_addr;
unsigned long inflated_offset;
+ unsigned long hpage_size;

if (len > TASK_SIZE)
return -ENOMEM;
@@ -2412,8 +2413,6 @@ unsigned long shmem_get_unmapped_area(struct file *file,

if (shmem_huge == SHMEM_HUGE_DENY)
return addr;
- if (len < HPAGE_PMD_SIZE)
- return addr;
if (flags & MAP_FIXED)
return addr;
/*
@@ -2425,8 +2424,11 @@ unsigned long shmem_get_unmapped_area(struct file *file,
if (uaddr == addr)
return addr;

+ hpage_size = HPAGE_PMD_SIZE;
if (shmem_huge != SHMEM_HUGE_FORCE) {
struct super_block *sb;
+ unsigned long __maybe_unused hpage_orders;
+ int order = 0;

if (file) {
VM_BUG_ON(file->f_op != &shmem_file_operations);
@@ -2439,18 +2441,34 @@ unsigned long shmem_get_unmapped_area(struct file *file,
if (IS_ERR(shm_mnt))
return addr;
sb = shm_mnt->mnt_sb;
+
+ /*
+ * Find the highest mTHP order used for anonymous shmem to
+ * provide a suitable alignment address.
+ */
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+ hpage_orders = READ_ONCE(huge_anon_shmem_orders_always);
+ hpage_orders |= READ_ONCE(huge_anon_shmem_orders_within_size);
+ hpage_orders |= READ_ONCE(huge_anon_shmem_orders_madvise);
+ hpage_orders |= READ_ONCE(huge_anon_shmem_orders_inherit);
+ order = highest_order(hpage_orders);
+ hpage_size = PAGE_SIZE << order;
+#endif
}
- if (SHMEM_SB(sb)->huge == SHMEM_HUGE_NEVER)
+ if (SHMEM_SB(sb)->huge == SHMEM_HUGE_NEVER && !order)
return addr;
}

- offset = (pgoff << PAGE_SHIFT) & (HPAGE_PMD_SIZE-1);
- if (offset && offset + len < 2 * HPAGE_PMD_SIZE)
+ if (len < hpage_size)
+ return addr;
+
+ offset = (pgoff << PAGE_SHIFT) & (hpage_size - 1);
+ if (offset && offset + len < 2 * hpage_size)
return addr;
- if ((addr & (HPAGE_PMD_SIZE-1)) == offset)
+ if ((addr & (hpage_size - 1)) == offset)
return addr;

- inflated_len = len + HPAGE_PMD_SIZE - PAGE_SIZE;
+ inflated_len = len + hpage_size - PAGE_SIZE;
if (inflated_len > TASK_SIZE)
return addr;
if (inflated_len < len)
@@ -2463,10 +2481,10 @@ unsigned long shmem_get_unmapped_area(struct file *file,
if (inflated_addr & ~PAGE_MASK)
return addr;

- inflated_offset = inflated_addr & (HPAGE_PMD_SIZE-1);
+ inflated_offset = inflated_addr & (hpage_size - 1);
inflated_addr += offset - inflated_offset;
if (inflated_offset > offset)
- inflated_addr += HPAGE_PMD_SIZE;
+ inflated_addr += hpage_size;

if (inflated_addr > TASK_SIZE - len)
return addr;
--
2.39.3


2024-05-30 02:05:50

by Baolin Wang

[permalink] [raw]
Subject: [PATCH v3 6/6] mm: shmem: add mTHP counters for anonymous shmem

Add mTHP counters for anonymous shmem.

Signed-off-by: Baolin Wang <[email protected]>
---
include/linux/huge_mm.h | 3 +++
mm/huge_memory.c | 6 ++++++
mm/shmem.c | 18 +++++++++++++++---
3 files changed, 24 insertions(+), 3 deletions(-)

diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index 909cfc67521d..212cca384d7e 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -281,6 +281,9 @@ enum mthp_stat_item {
MTHP_STAT_ANON_FAULT_FALLBACK_CHARGE,
MTHP_STAT_SWPOUT,
MTHP_STAT_SWPOUT_FALLBACK,
+ MTHP_STAT_FILE_ALLOC,
+ MTHP_STAT_FILE_FALLBACK,
+ MTHP_STAT_FILE_FALLBACK_CHARGE,
__MTHP_STAT_COUNT
};

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 1360a1903b66..3fbcd77f5957 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -555,6 +555,9 @@ DEFINE_MTHP_STAT_ATTR(anon_fault_fallback, MTHP_STAT_ANON_FAULT_FALLBACK);
DEFINE_MTHP_STAT_ATTR(anon_fault_fallback_charge, MTHP_STAT_ANON_FAULT_FALLBACK_CHARGE);
DEFINE_MTHP_STAT_ATTR(swpout, MTHP_STAT_SWPOUT);
DEFINE_MTHP_STAT_ATTR(swpout_fallback, MTHP_STAT_SWPOUT_FALLBACK);
+DEFINE_MTHP_STAT_ATTR(file_alloc, MTHP_STAT_FILE_ALLOC);
+DEFINE_MTHP_STAT_ATTR(file_fallback, MTHP_STAT_FILE_FALLBACK);
+DEFINE_MTHP_STAT_ATTR(file_fallback_charge, MTHP_STAT_FILE_FALLBACK_CHARGE);

static struct attribute *stats_attrs[] = {
&anon_fault_alloc_attr.attr,
@@ -562,6 +565,9 @@ static struct attribute *stats_attrs[] = {
&anon_fault_fallback_charge_attr.attr,
&swpout_attr.attr,
&swpout_fallback_attr.attr,
+ &file_alloc_attr.attr,
+ &file_fallback_attr.attr,
+ &file_fallback_charge_attr.attr,
NULL,
};

diff --git a/mm/shmem.c b/mm/shmem.c
index efc093ee7d4d..cdcc3b919bca 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -1773,6 +1773,9 @@ static struct folio *shmem_alloc_and_add_folio(struct vm_fault *vmf,

if (pages == HPAGE_PMD_NR)
count_vm_event(THP_FILE_FALLBACK);
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+ count_mthp_stat(order, MTHP_STAT_FILE_FALLBACK);
+#endif
order = next_order(&suitable_orders, order);
}
} else {
@@ -1792,9 +1795,15 @@ static struct folio *shmem_alloc_and_add_folio(struct vm_fault *vmf,
if (xa_find(&mapping->i_pages, &index,
index + pages - 1, XA_PRESENT)) {
error = -EEXIST;
- } else if (pages == HPAGE_PMD_NR) {
- count_vm_event(THP_FILE_FALLBACK);
- count_vm_event(THP_FILE_FALLBACK_CHARGE);
+ } else if (pages > 1) {
+ if (pages == HPAGE_PMD_NR) {
+ count_vm_event(THP_FILE_FALLBACK);
+ count_vm_event(THP_FILE_FALLBACK_CHARGE);
+ }
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+ count_mthp_stat(folio_order(folio), MTHP_STAT_FILE_FALLBACK);
+ count_mthp_stat(folio_order(folio), MTHP_STAT_FILE_FALLBACK_CHARGE);
+#endif
}
goto unlock;
}
@@ -2168,6 +2177,9 @@ static int shmem_get_folio_gfp(struct inode *inode, pgoff_t index,
if (!IS_ERR(folio)) {
if (folio_test_pmd_mappable(folio))
count_vm_event(THP_FILE_ALLOC);
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+ count_mthp_stat(folio_order(folio), MTHP_STAT_FILE_ALLOC);
+#endif
goto alloced;
}
if (PTR_ERR(folio) == -EEXIST)
--
2.39.3


2024-05-30 06:37:26

by kernel test robot

[permalink] [raw]
Subject: Re: [PATCH v3 4/6] mm: shmem: add mTHP support for anonymous shmem

Hi Baolin,

kernel test robot noticed the following build warnings:

[auto build test WARNING on akpm-mm/mm-everything]
[also build test WARNING on next-20240529]
[cannot apply to linus/master v6.10-rc1]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]

url: https://github.com/intel-lab-lkp/linux/commits/Baolin-Wang/mm-memory-extend-finish_fault-to-support-large-folio/20240530-100805
base: https://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm.git mm-everything
patch link: https://lore.kernel.org/r/ec35a23026dd016705d211e85163cabe07681516.1717033868.git.baolin.wang%40linux.alibaba.com
patch subject: [PATCH v3 4/6] mm: shmem: add mTHP support for anonymous shmem
config: riscv-defconfig (https://download.01.org/0day-ci/archive/20240530/[email protected]/config)
compiler: clang version 19.0.0git (https://github.com/llvm/llvm-project bafda89a0944d947fc4b3b5663185e07a397ac30)
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20240530/[email protected]/reproduce)

If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <[email protected]>
| Closes: https://lore.kernel.org/oe-kbuild-all/[email protected]/

All warnings (new ones prefixed by >>):

In file included from mm/shmem.c:28:
In file included from include/linux/ramfs.h:5:
In file included from include/linux/fs_parser.h:11:
In file included from include/linux/fs_context.h:14:
In file included from include/linux/security.h:33:
In file included from include/linux/mm.h:2245:
include/linux/vmstat.h:514:36: warning: arithmetic between different enumeration types ('enum node_stat_item' and 'enum lru_list') [-Wenum-enum-conversion]
514 | return node_stat_name(NR_LRU_BASE + lru) + 3; // skip "nr_"
| ~~~~~~~~~~~ ^ ~~~
>> mm/shmem.c:1748:14: warning: variable 'suitable_orders' is used uninitialized whenever 'if' condition is false [-Wsometimes-uninitialized]
1748 | } else if (orders & BIT(HPAGE_PMD_ORDER)) {
| ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~
mm/shmem.c:1766:25: note: uninitialized use occurs here
1766 | order = highest_order(suitable_orders);
| ^~~~~~~~~~~~~~~
mm/shmem.c:1748:10: note: remove the 'if' if its condition is always true
1748 | } else if (orders & BIT(HPAGE_PMD_ORDER)) {
| ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
mm/shmem.c:1736:31: note: initialize the variable 'suitable_orders' to silence this warning
1736 | unsigned long suitable_orders;
| ^
| = 0
2 warnings generated.


vim +1748 mm/shmem.c

1728
1729 static struct folio *shmem_alloc_and_add_folio(struct vm_fault *vmf,
1730 gfp_t gfp, struct inode *inode, pgoff_t index,
1731 struct mm_struct *fault_mm, unsigned long orders)
1732 {
1733 struct address_space *mapping = inode->i_mapping;
1734 struct shmem_inode_info *info = SHMEM_I(inode);
1735 struct vm_area_struct *vma = vmf ? vmf->vma : NULL;
1736 unsigned long suitable_orders;
1737 struct folio *folio = NULL;
1738 long pages;
1739 int error, order;
1740
1741 if (!IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE))
1742 orders = 0;
1743
1744 if (orders > 0) {
1745 if (vma && vma_is_anon_shmem(vma)) {
1746 suitable_orders = anon_shmem_suitable_orders(inode, vmf,
1747 mapping, index, orders);
> 1748 } else if (orders & BIT(HPAGE_PMD_ORDER)) {
1749 pages = HPAGE_PMD_NR;
1750 suitable_orders = BIT(HPAGE_PMD_ORDER);
1751 index = round_down(index, HPAGE_PMD_NR);
1752
1753 /*
1754 * Check for conflict before waiting on a huge allocation.
1755 * Conflict might be that a huge page has just been allocated
1756 * and added to page cache by a racing thread, or that there
1757 * is already at least one small page in the huge extent.
1758 * Be careful to retry when appropriate, but not forever!
1759 * Elsewhere -EEXIST would be the right code, but not here.
1760 */
1761 if (xa_find(&mapping->i_pages, &index,
1762 index + HPAGE_PMD_NR - 1, XA_PRESENT))
1763 return ERR_PTR(-E2BIG);
1764 }
1765
1766 order = highest_order(suitable_orders);
1767 while (suitable_orders) {
1768 pages = 1UL << order;
1769 index = round_down(index, pages);
1770 folio = shmem_alloc_folio(gfp, order, info, index);
1771 if (folio)
1772 goto allocated;
1773
1774 if (pages == HPAGE_PMD_NR)
1775 count_vm_event(THP_FILE_FALLBACK);
1776 order = next_order(&suitable_orders, order);
1777 }
1778 } else {
1779 pages = 1;
1780 folio = shmem_alloc_folio(gfp, 0, info, index);
1781 }
1782 if (!folio)
1783 return ERR_PTR(-ENOMEM);
1784
1785 allocated:
1786 __folio_set_locked(folio);
1787 __folio_set_swapbacked(folio);
1788
1789 gfp &= GFP_RECLAIM_MASK;
1790 error = mem_cgroup_charge(folio, fault_mm, gfp);
1791 if (error) {
1792 if (xa_find(&mapping->i_pages, &index,
1793 index + pages - 1, XA_PRESENT)) {
1794 error = -EEXIST;
1795 } else if (pages == HPAGE_PMD_NR) {
1796 count_vm_event(THP_FILE_FALLBACK);
1797 count_vm_event(THP_FILE_FALLBACK_CHARGE);
1798 }
1799 goto unlock;
1800 }
1801
1802 error = shmem_add_to_page_cache(folio, mapping, index, NULL, gfp);
1803 if (error)
1804 goto unlock;
1805
1806 error = shmem_inode_acct_blocks(inode, pages);
1807 if (error) {
1808 struct shmem_sb_info *sbinfo = SHMEM_SB(inode->i_sb);
1809 long freed;
1810 /*
1811 * Try to reclaim some space by splitting a few
1812 * large folios beyond i_size on the filesystem.
1813 */
1814 shmem_unused_huge_shrink(sbinfo, NULL, 2);
1815 /*
1816 * And do a shmem_recalc_inode() to account for freed pages:
1817 * except our folio is there in cache, so not quite balanced.
1818 */
1819 spin_lock(&info->lock);
1820 freed = pages + info->alloced - info->swapped -
1821 READ_ONCE(mapping->nrpages);
1822 if (freed > 0)
1823 info->alloced -= freed;
1824 spin_unlock(&info->lock);
1825 if (freed > 0)
1826 shmem_inode_unacct_blocks(inode, freed);
1827 error = shmem_inode_acct_blocks(inode, pages);
1828 if (error) {
1829 filemap_remove_folio(folio);
1830 goto unlock;
1831 }
1832 }
1833
1834 shmem_recalc_inode(inode, pages, 0);
1835 folio_add_lru(folio);
1836 return folio;
1837
1838 unlock:
1839 folio_unlock(folio);
1840 folio_put(folio);
1841 return ERR_PTR(error);
1842 }
1843

--
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki

2024-05-31 09:36:21

by David Hildenbrand

[permalink] [raw]
Subject: Re: [PATCH v3 0/6] add mTHP support for anonymous shmem

On 30.05.24 04:04, Baolin Wang wrote:
> Anonymous pages have already been supported for multi-size (mTHP) allocation
> through commit 19eaf44954df, that can allow THP to be configured through the
> sysfs interface located at '/sys/kernel/mm/transparent_hugepage/hugepage-XXkb/enabled'.
>
> However, the anonymous shmem will ignore the anonymous mTHP rule configured
> through the sysfs interface, and can only use the PMD-mapped THP, that is not
> reasonable. Many implement anonymous page sharing through mmap(MAP_SHARED |
> MAP_ANONYMOUS), especially in database usage scenarios, therefore, users expect
> to apply an unified mTHP strategy for anonymous pages, also including the
> anonymous shared pages, in order to enjoy the benefits of mTHP. For example,
> lower latency than PMD-mapped THP, smaller memory bloat than PMD-mapped THP,
> contiguous PTEs on ARM architecture to reduce TLB miss etc.
>
> The primary strategy is similar to supporting anonymous mTHP. Introduce
> a new interface '/mm/transparent_hugepage/hugepage-XXkb/shmem_enabled',
> which can have all the same values as the top-level
> '/sys/kernel/mm/transparent_hugepage/shmem_enabled', with adding a new
> additional "inherit" option. By default all sizes will be set to "never"
> except PMD size, which is set to "inherit". This ensures backward compatibility
> with the anonymous shmem enabled of the top level, meanwhile also allows
> independent control of anonymous shmem enabled for each mTHP.
>
> Use the page fault latency tool to measure the performance of 1G anonymous shmem
> with 32 threads on my machine environment with: ARM64 Architecture, 32 cores,
> 125G memory:
> base: mm-unstable
> user-time sys_time faults_per_sec_per_cpu faults_per_sec
> 0.04s 3.10s 83516.416 2669684.890
>
> mm-unstable + patchset, anon shmem mTHP disabled
> user-time sys_time faults_per_sec_per_cpu faults_per_sec
> 0.02s 3.14s 82936.359 2630746.027
>
> mm-unstable + patchset, anon shmem 64K mTHP enabled
> user-time sys_time faults_per_sec_per_cpu faults_per_sec
> 0.08s 0.31s 678630.231 17082522.495
>
> From the data above, it is observed that the patchset has a minimal impact when
> mTHP is not enabled (some fluctuations observed during testing). When enabling 64K
> mTHP, there is a significant improvement of the page fault latency.

Let me summarize the takeaway from the bi-weekly MM meeting as I
understood it, that includes Hugh's feedback on per-block tracking vs. mTHP:

(1) Per-block tracking

Per-block tracking is currently considered unwarranted complexity in
shmem.c. We should try to get it done without that. For any test cases
that fail, we should consider if they are actually valid for shmem.

To optimize FALLOC_FL_PUNCH_HOLE for the cases where splitting+freeing
is not possible at fallcoate() time, detecting zeropages later and
retrying to split+free might be an option, without per-block tracking.

(2) mTHP controls

As a default, we should not be using large folios / mTHP for any shmem,
just like we did with THP via shmem_enabled. This is what this series
currently does, and is aprt of the whole mTHP user-space interface design.

Further, the mTHP controls should control all of shmem, not only
"anonymous shmem".

Also, we should properly fallback within the configured sizes, and not
jump "over" configured sizes. Unless there is a good reason.

(3) khugepaged

khugepaged needs to handle larger folios properly as well. Until fixed,
using smaller THP sizes as fallback might prohibit collapsing a
PMD-sized THP later. But really, khugepaged needs to be fixed to handle
that.

(4) force/disable

These settings are rather testing artifacts from the old ages. We should
not add them to the per-size toggles. We might "inherit" it from the
global one, though.

"within_size" might have value, and especially for consistency, we
should have them per size.



So, this series only tackles anonymous shmem, which is a good starting
point. Ideally, we'd get support for other shmem (especially during
fault time) soon afterwards, because we won't be adding separate toggles
for that from the interface POV, and having inconsistent behavior
between kernel versions would be a bit unfortunate.


@Baolin, this series likely does not consider (4) yet. And I suggest we
have to take a lot of the "anonymous thp" terminology out of this
series, especially when it comes to documentation.

@Daniel, Pankaj, what are your plans regarding that? It would be great
if we could get an understanding on the next steps on !anon shmem.

--
Cheers,

David / dhildenb


2024-05-31 10:13:28

by Baolin Wang

[permalink] [raw]
Subject: Re: [PATCH v3 0/6] add mTHP support for anonymous shmem



On 2024/5/31 17:35, David Hildenbrand wrote:
> On 30.05.24 04:04, Baolin Wang wrote:
>> Anonymous pages have already been supported for multi-size (mTHP)
>> allocation
>> through commit 19eaf44954df, that can allow THP to be configured
>> through the
>> sysfs interface located at
>> '/sys/kernel/mm/transparent_hugepage/hugepage-XXkb/enabled'.
>>
>> However, the anonymous shmem will ignore the anonymous mTHP rule
>> configured
>> through the sysfs interface, and can only use the PMD-mapped THP, that
>> is not
>> reasonable. Many implement anonymous page sharing through
>> mmap(MAP_SHARED |
>> MAP_ANONYMOUS), especially in database usage scenarios, therefore,
>> users expect
>> to apply an unified mTHP strategy for anonymous pages, also including the
>> anonymous shared pages, in order to enjoy the benefits of mTHP. For
>> example,
>> lower latency than PMD-mapped THP, smaller memory bloat than
>> PMD-mapped THP,
>> contiguous PTEs on ARM architecture to reduce TLB miss etc.
>>
>> The primary strategy is similar to supporting anonymous mTHP. Introduce
>> a new interface '/mm/transparent_hugepage/hugepage-XXkb/shmem_enabled',
>> which can have all the same values as the top-level
>> '/sys/kernel/mm/transparent_hugepage/shmem_enabled', with adding a new
>> additional "inherit" option. By default all sizes will be set to "never"
>> except PMD size, which is set to "inherit". This ensures backward
>> compatibility
>> with the anonymous shmem enabled of the top level, meanwhile also allows
>> independent control of anonymous shmem enabled for each mTHP.
>>
>> Use the page fault latency tool to measure the performance of 1G
>> anonymous shmem
>> with 32 threads on my machine environment with: ARM64 Architecture, 32
>> cores,
>> 125G memory:
>> base: mm-unstable
>> user-time    sys_time    faults_per_sec_per_cpu     faults_per_sec
>> 0.04s        3.10s         83516.416                  2669684.890
>>
>> mm-unstable + patchset, anon shmem mTHP disabled
>> user-time    sys_time    faults_per_sec_per_cpu     faults_per_sec
>> 0.02s        3.14s         82936.359                  2630746.027
>>
>> mm-unstable + patchset, anon shmem 64K mTHP enabled
>> user-time    sys_time    faults_per_sec_per_cpu     faults_per_sec
>> 0.08s        0.31s         678630.231                 17082522.495
>>
>>  From the data above, it is observed that the patchset has a minimal
>> impact when
>> mTHP is not enabled (some fluctuations observed during testing). When
>> enabling 64K
>> mTHP, there is a significant improvement of the page fault latency.
>
> Let me summarize the takeaway from the bi-weekly MM meeting as I
> understood it, that includes Hugh's feedback on per-block tracking vs.

Thanks David for the summarization.

> mTHP:
>
> (1) Per-block tracking
>
> Per-block tracking is currently considered unwarranted complexity in
> shmem.c. We should try to get it done without that. For any test cases
> that fail, we should consider if they are actually valid for shmem.
>
> To optimize FALLOC_FL_PUNCH_HOLE for the cases where splitting+freeing
> is not possible at fallcoate() time, detecting zeropages later and
> retrying to split+free might be an option, without per-block tracking.
>
> (2) mTHP controls
>
> As a default, we should not be using large folios / mTHP for any shmem,
> just like we did with THP via shmem_enabled. This is what this series
> currently does, and is aprt of the whole mTHP user-space interface design.
>
> Further, the mTHP controls should control all of shmem, not only
> "anonymous shmem".

Yes, that's what I thought and in my TODO list.

>
> Also, we should properly fallback within the configured sizes, and not
> jump "over" configured sizes. Unless there is a good reason.
>
> (3) khugepaged
>
> khugepaged needs to handle larger folios properly as well. Until fixed,
> using smaller THP sizes as fallback might prohibit collapsing a
> PMD-sized THP later. But really, khugepaged needs to be fixed to handle
> that. >
> (4) force/disable
>
> These settings are rather testing artifacts from the old ages. We should
> not add them to the per-size toggles. We might "inherit" it from the
> global one, though.

Sorry, I missed this. So I thould remove the 'force' and 'deny' option
for each mTHP, right?

>
> "within_size" might have value, and especially for consistency, we
> should have them per size.
>
>
>
> So, this series only tackles anonymous shmem, which is a good starting
> point. Ideally, we'd get support for other shmem (especially during
> fault time) soon afterwards, because we won't be adding separate toggles
> for that from the interface POV, and having inconsistent behavior
> between kernel versions would be a bit unfortunate.
>
>
> @Baolin, this series likely does not consider (4) yet. And I suggest we
> have to take a lot of the "anonymous thp" terminology out of this
> series, especially when it comes to documentation.

Sure. I will remove the "anonymous thp" terminology from the
documentation, but want to still keep it in the commit message, cause I
want to start from the anonymous shmem.

>
> @Daniel, Pankaj, what are your plans regarding that? It would be great
> if we could get an understanding on the next steps on !anon shmem.
>

2024-05-31 11:14:04

by David Hildenbrand

[permalink] [raw]
Subject: Re: [PATCH v3 0/6] add mTHP support for anonymous shmem

>>
>> As a default, we should not be using large folios / mTHP for any shmem,
>> just like we did with THP via shmem_enabled. This is what this series
>> currently does, and is aprt of the whole mTHP user-space interface design.
>>
>> Further, the mTHP controls should control all of shmem, not only
>> "anonymous shmem".
>
> Yes, that's what I thought and in my TODO list.

Good, it would be helpful to coordinate with Daniel and Pankaj.

>
>>
>> Also, we should properly fallback within the configured sizes, and not
>> jump "over" configured sizes. Unless there is a good reason.
>>
>> (3) khugepaged
>>
>> khugepaged needs to handle larger folios properly as well. Until fixed,
>> using smaller THP sizes as fallback might prohibit collapsing a
>> PMD-sized THP later. But really, khugepaged needs to be fixed to handle
>> that. >
>> (4) force/disable
>>
>> These settings are rather testing artifacts from the old ages. We should
>> not add them to the per-size toggles. We might "inherit" it from the
>> global one, though.
>
> Sorry, I missed this. So I thould remove the 'force' and 'deny' option
> for each mTHP, right?

Yes, that's my understanding. But we have to keep them on the top level
for any possible user out there.

>
>>
>> "within_size" might have value, and especially for consistency, we
>> should have them per size.
>>
>>
>>
>> So, this series only tackles anonymous shmem, which is a good starting
>> point. Ideally, we'd get support for other shmem (especially during
>> fault time) soon afterwards, because we won't be adding separate toggles
>> for that from the interface POV, and having inconsistent behavior
>> between kernel versions would be a bit unfortunate.
>>
>>
>> @Baolin, this series likely does not consider (4) yet. And I suggest we
>> have to take a lot of the "anonymous thp" terminology out of this
>> series, especially when it comes to documentation.
>
> Sure. I will remove the "anonymous thp" terminology from the
> documentation, but want to still keep it in the commit message, cause I
> want to start from the anonymous shmem.

For commit message and friends makes sense. The story should be
"controls all of shmem/tmpfs, but support will be added iteratively. The
first step is anonymous shmem."

--
Cheers,

David / dhildenb


2024-05-31 13:20:19

by Daniel Gomez

[permalink] [raw]
Subject: Re: [PATCH v3 0/6] add mTHP support for anonymous shmem

On Fri, May 31, 2024 at 11:35:30AM +0200, David Hildenbrand wrote:
> On 30.05.24 04:04, Baolin Wang wrote:
> > Anonymous pages have already been supported for multi-size (mTHP) allocation
> > through commit 19eaf44954df, that can allow THP to be configured through the
> > sysfs interface located at '/sys/kernel/mm/transparent_hugepage/hugepage-XXkb/enabled'.
> >
> > However, the anonymous shmem will ignore the anonymous mTHP rule configured
> > through the sysfs interface, and can only use the PMD-mapped THP, that is not
> > reasonable. Many implement anonymous page sharing through mmap(MAP_SHARED |
> > MAP_ANONYMOUS), especially in database usage scenarios, therefore, users expect
> > to apply an unified mTHP strategy for anonymous pages, also including the
> > anonymous shared pages, in order to enjoy the benefits of mTHP. For example,
> > lower latency than PMD-mapped THP, smaller memory bloat than PMD-mapped THP,
> > contiguous PTEs on ARM architecture to reduce TLB miss etc.
> >
> > The primary strategy is similar to supporting anonymous mTHP. Introduce
> > a new interface '/mm/transparent_hugepage/hugepage-XXkb/shmem_enabled',
> > which can have all the same values as the top-level
> > '/sys/kernel/mm/transparent_hugepage/shmem_enabled', with adding a new
> > additional "inherit" option. By default all sizes will be set to "never"
> > except PMD size, which is set to "inherit". This ensures backward compatibility
> > with the anonymous shmem enabled of the top level, meanwhile also allows
> > independent control of anonymous shmem enabled for each mTHP.
> >
> > Use the page fault latency tool to measure the performance of 1G anonymous shmem
> > with 32 threads on my machine environment with: ARM64 Architecture, 32 cores,
> > 125G memory:
> > base: mm-unstable
> > user-time sys_time faults_per_sec_per_cpu faults_per_sec
> > 0.04s 3.10s 83516.416 2669684.890
> >
> > mm-unstable + patchset, anon shmem mTHP disabled
> > user-time sys_time faults_per_sec_per_cpu faults_per_sec
> > 0.02s 3.14s 82936.359 2630746.027
> >
> > mm-unstable + patchset, anon shmem 64K mTHP enabled
> > user-time sys_time faults_per_sec_per_cpu faults_per_sec
> > 0.08s 0.31s 678630.231 17082522.495
> >
> > From the data above, it is observed that the patchset has a minimal impact when
> > mTHP is not enabled (some fluctuations observed during testing). When enabling 64K
> > mTHP, there is a significant improvement of the page fault latency.
>
> Let me summarize the takeaway from the bi-weekly MM meeting as I understood
> it, that includes Hugh's feedback on per-block tracking vs. mTHP:

Thanks David for the summary. Please, find below some follow up questions.

I want understand if zeropage scanning overhead is preferred over per-block
tracking complexity or if we still need to verify this.

>
> (1) Per-block tracking
>
> Per-block tracking is currently considered unwarranted complexity in
> shmem.c. We should try to get it done without that. For any test cases that
> fail, we should consider if they are actually valid for shmem.

I agree it was unwarranted complexity but only if this is just to fix lseek() as
we can simply make the test pass by checking if holes are reported as data. That
would be the minimum required for lseek() to be compliant with the syscall.

How can we use per-block tracking for reclaiming memory and what changes would
be needed? Or is per-block really a non-viable option?

Clearly, if per-block is viable option, shmem_fault() bug would required to be
fixed first. Any ideas on how to make it reproducible?

The alternatives discussed where sub-page refcounting and zeropage scanning.
The first one is not possible (IIUC) because of a refactor years back that
simplified the code and also requires extra complexity. The second option would
require additional overhead as we would involve scanning.

>
> To optimize FALLOC_FL_PUNCH_HOLE for the cases where splitting+freeing
> is not possible at fallcoate() time, detecting zeropages later and
> retrying to split+free might be an option, without per-block tracking.

>
> (2) mTHP controls
>
> As a default, we should not be using large folios / mTHP for any shmem, just
> like we did with THP via shmem_enabled. This is what this series currently
> does, and is aprt of the whole mTHP user-space interface design.

That was clear for me too. But what is the reason we want to boot in 'safe
mode'? What are the implications of not respecing that?

>
> Further, the mTHP controls should control all of shmem, not only "anonymous
> shmem".

As I understood from the call, mTHP with sysctl knobs is preferred over
optimistic falloc/write allocation? But is still unclear to me why the former
is preferred.

Is large folios a non-viable option?

>
> Also, we should properly fallback within the configured sizes, and not jump
> "over" configured sizes. Unless there is a good reason.
>
> (3) khugepaged
>
> khugepaged needs to handle larger folios properly as well. Until fixed,
> using smaller THP sizes as fallback might prohibit collapsing a PMD-sized
> THP later. But really, khugepaged needs to be fixed to handle that.
>
> (4) force/disable
>
> These settings are rather testing artifacts from the old ages. We should not
> add them to the per-size toggles. We might "inherit" it from the global one,
> though.
>
> "within_size" might have value, and especially for consistency, we should
> have them per size.
>
>
>
> So, this series only tackles anonymous shmem, which is a good starting
> point. Ideally, we'd get support for other shmem (especially during fault
> time) soon afterwards, because we won't be adding separate toggles for that
> from the interface POV, and having inconsistent behavior between kernel
> versions would be a bit unfortunate.
>
>
> @Baolin, this series likely does not consider (4) yet. And I suggest we have
> to take a lot of the "anonymous thp" terminology out of this series,
> especially when it comes to documentation.
>
> @Daniel, Pankaj, what are your plans regarding that? It would be great if we
> could get an understanding on the next steps on !anon shmem.

I realize I've raised so many questions, but it's essential for us to grasp the
mm concerns and usage scenarios. This understanding will provide clarity on the
direction regarding folios for !anon shmem.

>
> --
> Cheers,
>
> David / dhildenb
>

Daniel

2024-05-31 14:43:55

by David Hildenbrand

[permalink] [raw]
Subject: Re: [PATCH v3 0/6] add mTHP support for anonymous shmem

Hi Daniel,

>> Let me summarize the takeaway from the bi-weekly MM meeting as I understood
>> it, that includes Hugh's feedback on per-block tracking vs. mTHP:
>
> Thanks David for the summary. Please, find below some follow up questions.
>
> I want understand if zeropage scanning overhead is preferred over per-block
> tracking complexity or if we still need to verify this.
>
>>
>> (1) Per-block tracking
>>
>> Per-block tracking is currently considered unwarranted complexity in
>> shmem.c. We should try to get it done without that. For any test cases that
>> fail, we should consider if they are actually valid for shmem.
>
> I agree it was unwarranted complexity but only if this is just to fix lseek() as
> we can simply make the test pass by checking if holes are reported as data. That
> would be the minimum required for lseek() to be compliant with the syscall.
>
> How can we use per-block tracking for reclaiming memory and what changes would
> be needed? Or is per-block really a non-viable option?

The interesting thing is: with mTHP toggles it is opt-in -- like for
PMD-sized THP with shmem_enabled -- and we don't have to be that
concerned about this problem right now.

>
> Clearly, if per-block is viable option, shmem_fault() bug would required to be
> fixed first. Any ideas on how to make it reproducible?
>
> The alternatives discussed where sub-page refcounting and zeropage scanning.

Yeah, I don't think sub-page refcounting is a feasible (and certainly
not desired ;) ) option in the folio world.

> The first one is not possible (IIUC) because of a refactor years back that
> simplified the code and also requires extra complexity. The second option would
> require additional overhead as we would involve scanning.

We'll likely need something similar (scanning, tracking?) for anonymous
memory as well. There was a proposal for a THP shrinker some time ago,
that would solve part of the problem.

For example, for shmem you could simply flag folios that failed
splitting during fallocate() as reclaim candidates and try to reclaim
memory later. So you don't have to scan arbitrary folios (which might
also be desired, though).

>
>>
>> To optimize FALLOC_FL_PUNCH_HOLE for the cases where splitting+freeing
>> is not possible at fallcoate() time, detecting zeropages later and
>> retrying to split+free might be an option, without per-block tracking.
>
>>
>> (2) mTHP controls
>>
>> As a default, we should not be using large folios / mTHP for any shmem, just
>> like we did with THP via shmem_enabled. This is what this series currently
>> does, and is aprt of the whole mTHP user-space interface design.
>
> That was clear for me too. But what is the reason we want to boot in 'safe
> mode'? What are the implications of not respecing that?

[...]

>
> As I understood from the call, mTHP with sysctl knobs is preferred over
> optimistic falloc/write allocation? But is still unclear to me why the former
> is preferred.

I think Hugh's point was that this should be an opt-in feature, just
like PMD-sized THP started out, and still is, an opt-in feature.

Problematic interaction with khugepaged (that will be fixed) is one
thing, interaction with memory reclaim (without any kind of memory
reclaim mechanisms in place) might be another one. Controlling and
tuning for specific folio sizes might be another one Hugh raised.
[summarizing what I recall from the discussion, there might be more].

>
> Is large folios a non-viable option?

I think you mean "allocating large folios without user space control".

Because mTHP gives user space full control, to the point where you can
enable all sizes and obtain the same result.

>
>>
>> Also, we should properly fallback within the configured sizes, and not jump
>> "over" configured sizes. Unless there is a good reason.
>>
>> (3) khugepaged
>>
>> khugepaged needs to handle larger folios properly as well. Until fixed,
>> using smaller THP sizes as fallback might prohibit collapsing a PMD-sized
>> THP later. But really, khugepaged needs to be fixed to handle that.
>>
>> (4) force/disable
>>
>> These settings are rather testing artifacts from the old ages. We should not
>> add them to the per-size toggles. We might "inherit" it from the global one,
>> though.
>>
>> "within_size" might have value, and especially for consistency, we should
>> have them per size.
>>
>>
>>
>> So, this series only tackles anonymous shmem, which is a good starting
>> point. Ideally, we'd get support for other shmem (especially during fault
>> time) soon afterwards, because we won't be adding separate toggles for that
>> from the interface POV, and having inconsistent behavior between kernel
>> versions would be a bit unfortunate.
>>
>>
>> @Baolin, this series likely does not consider (4) yet. And I suggest we have
>> to take a lot of the "anonymous thp" terminology out of this series,
>> especially when it comes to documentation.
>>
>> @Daniel, Pankaj, what are your plans regarding that? It would be great if we
>> could get an understanding on the next steps on !anon shmem.
>
> I realize I've raised so many questions, but it's essential for us to grasp the
> mm concerns and usage scenarios. This understanding will provide clarity on the
> direction regarding folios for !anon shmem.

If I understood correctly, Hugh had strong feelings against not
respecting mTHP toggles for shmem. Without per-block tracking, I agree
with that.

--
Cheers,

David / dhildenb


2024-06-02 04:16:01

by Baolin Wang

[permalink] [raw]
Subject: Re: [PATCH v3 0/6] add mTHP support for anonymous shmem



On 2024/5/31 19:13, David Hildenbrand wrote:
>>>
>>> As a default, we should not be using large folios / mTHP for any shmem,
>>> just like we did with THP via shmem_enabled. This is what this series
>>> currently does, and is aprt of the whole mTHP user-space interface
>>> design.
>>>
>>> Further, the mTHP controls should control all of shmem, not only
>>> "anonymous shmem".
>>
>> Yes, that's what I thought and in my TODO list.
>
> Good, it would be helpful to coordinate with Daniel and Pankaj.
>
>>
>>>
>>> Also, we should properly fallback within the configured sizes, and not
>>> jump "over" configured sizes. Unless there is a good reason.
>>>
>>> (3) khugepaged
>>>
>>> khugepaged needs to handle larger folios properly as well. Until fixed,
>>> using smaller THP sizes as fallback might prohibit collapsing a
>>> PMD-sized THP later. But really, khugepaged needs to be fixed to handle
>>> that. >
>>> (4) force/disable
>>>
>>> These settings are rather testing artifacts from the old ages. We should
>>> not add them to the per-size toggles. We might "inherit" it from the
>>> global one, though.
>>
>> Sorry, I missed this. So I thould remove the 'force' and 'deny' option
>> for each mTHP, right?
>
> Yes, that's my understanding. But we have to keep them on the top level
> for any possible user out there.

OK.

>>>
>>> "within_size" might have value, and especially for consistency, we
>>> should have them per size.
>>>
>>>
>>>
>>> So, this series only tackles anonymous shmem, which is a good starting
>>> point. Ideally, we'd get support for other shmem (especially during
>>> fault time) soon afterwards, because we won't be adding separate toggles
>>> for that from the interface POV, and having inconsistent behavior
>>> between kernel versions would be a bit unfortunate.
>>>
>>>
>>> @Baolin, this series likely does not consider (4) yet. And I suggest we
>>> have to take a lot of the "anonymous thp" terminology out of this
>>> series, especially when it comes to documentation.
>>
>> Sure. I will remove the "anonymous thp" terminology from the
>> documentation, but want to still keep it in the commit message, cause I
>> want to start from the anonymous shmem.
>
> For commit message and friends makes sense. The story should be
> "controls all of shmem/tmpfs, but support will be added iteratively. The
> first step is anonymous shmem."

Sure. Thanks.

2024-06-02 04:17:07

by Baolin Wang

[permalink] [raw]
Subject: Re: [PATCH v3 4/6] mm: shmem: add mTHP support for anonymous shmem



On 2024/5/30 14:36, kernel test robot wrote:
> Hi Baolin,
>
> kernel test robot noticed the following build warnings:
>
> [auto build test WARNING on akpm-mm/mm-everything]
> [also build test WARNING on next-20240529]
> [cannot apply to linus/master v6.10-rc1]
> [If your patch is applied to the wrong git tree, kindly drop us a note.
> And when submitting patch, we suggest to use '--base' as documented in
> https://git-scm.com/docs/git-format-patch#_base_tree_information]
>
> url: https://github.com/intel-lab-lkp/linux/commits/Baolin-Wang/mm-memory-extend-finish_fault-to-support-large-folio/20240530-100805
> base: https://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm.git mm-everything
> patch link: https://lore.kernel.org/r/ec35a23026dd016705d211e85163cabe07681516.1717033868.git.baolin.wang%40linux.alibaba.com
> patch subject: [PATCH v3 4/6] mm: shmem: add mTHP support for anonymous shmem
> config: riscv-defconfig (https://download.01.org/0day-ci/archive/20240530/[email protected]/config)
> compiler: clang version 19.0.0git (https://github.com/llvm/llvm-project bafda89a0944d947fc4b3b5663185e07a397ac30)
> reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20240530/[email protected]/reproduce)
>
> If you fix the issue in a separate patch/commit (i.e. not just a new version of
> the same patch/commit), kindly add following tags
> | Reported-by: kernel test robot <[email protected]>
> | Closes: https://lore.kernel.org/oe-kbuild-all/[email protected]/
>
> All warnings (new ones prefixed by >>):
>
> In file included from mm/shmem.c:28:
> In file included from include/linux/ramfs.h:5:
> In file included from include/linux/fs_parser.h:11:
> In file included from include/linux/fs_context.h:14:
> In file included from include/linux/security.h:33:
> In file included from include/linux/mm.h:2245:
> include/linux/vmstat.h:514:36: warning: arithmetic between different enumeration types ('enum node_stat_item' and 'enum lru_list') [-Wenum-enum-conversion]
> 514 | return node_stat_name(NR_LRU_BASE + lru) + 3; // skip "nr_"
> | ~~~~~~~~~~~ ^ ~~~
>>> mm/shmem.c:1748:14: warning: variable 'suitable_orders' is used uninitialized whenever 'if' condition is false [-Wsometimes-uninitialized]
> 1748 | } else if (orders & BIT(HPAGE_PMD_ORDER)) {
> | ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> mm/shmem.c:1766:25: note: uninitialized use occurs here
> 1766 | order = highest_order(suitable_orders);
> | ^~~~~~~~~~~~~~~
> mm/shmem.c:1748:10: note: remove the 'if' if its condition is always true
> 1748 | } else if (orders & BIT(HPAGE_PMD_ORDER)) {
> | ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> mm/shmem.c:1736:31: note: initialize the variable 'suitable_orders' to silence this warning
> 1736 | unsigned long suitable_orders;
> | ^
> | = 0
> 2 warnings generated.

Thanks for reporting. Will fix the warning in next version.

2024-06-02 04:36:46

by Baolin Wang

[permalink] [raw]
Subject: Re: [PATCH v3 3/6] mm: shmem: add multi-size THP sysfs interface for anonymous shmem



On 2024/6/1 11:29, wang wei wrote:
> At 2024-05-30 10:04:14, "Baolin Wang" <[email protected]> wrote:
>
>>To support the use of mTHP with anonymous shmem, add a new sysfs interface
>>'shmem_enabled' in the '/sys/kernel/mm/transparent_hugepage/hugepages-kB/'
>>directory for each mTHP to control whether shmem is enabled for that mTHP,
>>with a value similar to the top level 'shmem_enabled', which can be set to:
>>"always", "inherit (to inherit the top level setting)", "within_size", "advise",
>>"never", "deny", "force". These values follow the same semantics as the top
>>level, except the 'deny' is equivalent to 'never', and 'force' is equivalent
>>to 'always' to keep compatibility.
>>
>>By default, PMD-sized hugepages have enabled="inherit" and all other hugepage
>>sizes have enabled="never" for '/sys/kernel/mm/transparent_hugepage/hugepages-xxkB/shmem_enabled'.
>>
>>In addition, if top level value is 'force', then only PMD-sized hugepages
>>have enabled="inherit", otherwise configuration will be failed and vice versa.
>>That means now we will avoid using non-PMD sized THP to override the global
>>huge allocation.
>>
>>Signed-off-by: Baolin Wang <[email protected]>
>>---
>> Documentation/admin-guide/mm/transhuge.rst | 29 +++++++
>> include/linux/huge_mm.h | 10 +++
>> mm/huge_memory.c | 11 +--
>> mm/shmem.c | 96 ++++++++++++++++++++++
>> 4 files changed, 138 insertions(+), 8 deletions(-)
>>
>>diff --git a/Documentation/admin-guide/mm/transhuge.rst b/Documentation/admin-guide/mm/transhuge.rst
>>index d414d3f5592a..659459374032 100644
>>--- a/Documentation/admin-guide/mm/transhuge.rst
>>+++ b/Documentation/admin-guide/mm/transhuge.rst
>>@@ -332,6 +332,35 @@ deny
>> force
>> Force the huge option on for all - very useful for testing;
>>
>>+Anonymous shmem can also use "multi-size THP" (mTHP) by adding a new sysfs knob
>>+to control mTHP allocation: /sys/kernel/mm/transparent_hugepage/hugepages-<size>kB/shmem_enabled.
>>+Its value for each mTHP is essentially consistent with the global setting, except
>>+for the addition of 'inherit' to ensure compatibility with the global settings.
>>+always
>>+ Attempt to allocate <size> huge pages every time we need a new page;
>>+
>>+inherit
>>+ Inherit the top-level "shmem_enabled" value. By default, PMD-sized hugepages
>>+ have enabled="inherit" and all other hugepage sizes have enabled="never";
>>+
>>+never
>>+ Do not allocate <size> huge pages;
>>+
>>+within_size
>>+ Only allocate <size> huge page if it will be fully within i_size.
>>+ Also respect fadvise()/madvise() hints;
>>+
>>+advise
>>+ Only allocate <size> huge pages if requested with fadvise()/madvise();
>>+
>>+deny
>>+ Has the same semantics as 'never', now mTHP allocation policy is only
>>+ used for anonymous shmem and no not override tmpfs.
>>+
>>+force
>>+ Has the same semantics as 'always', now mTHP allocation policy is only
>>+ used for anonymous shmem and no not override tmpfs.
> >+
>
> I just briefly reviewed the discussion about the value of
> hugepages-<size>kB/shmem_enabled
> in V1 [PATCH 5/8]. Is there a conclusion now? Maybe I left out some
> important information.

You can refer to the this patch's commit message and documentation,
which are based on the conclusions of previous discussions.

In addition, you can also read more discussions from the last bi-weekly
MM meeting[1], summarized by David.

[1]
https://lore.kernel.org/all/[email protected]/#t

2024-06-03 04:45:01

by Lance Yang

[permalink] [raw]
Subject: Re: [PATCH v3 1/6] mm: memory: extend finish_fault() to support large folio

On Thu, May 30, 2024 at 10:04 AM Baolin Wang
<[email protected]> wrote:
>
> Add large folio mapping establishment support for finish_fault() as a preparation,
> to support multi-size THP allocation of anonymous shmem pages in the following
> patches.
>
> Signed-off-by: Baolin Wang <[email protected]>
> ---
> mm/memory.c | 58 ++++++++++++++++++++++++++++++++++++++++++++---------
> 1 file changed, 48 insertions(+), 10 deletions(-)
>
> diff --git a/mm/memory.c b/mm/memory.c
> index eef4e482c0c2..435187ff7ea4 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -4831,9 +4831,12 @@ vm_fault_t finish_fault(struct vm_fault *vmf)
> {
> struct vm_area_struct *vma = vmf->vma;
> struct page *page;
> + struct folio *folio;
> vm_fault_t ret;
> bool is_cow = (vmf->flags & FAULT_FLAG_WRITE) &&
> !(vma->vm_flags & VM_SHARED);
> + int type, nr_pages, i;
> + unsigned long addr = vmf->address;
>
> /* Did we COW the page? */
> if (is_cow)
> @@ -4864,24 +4867,59 @@ vm_fault_t finish_fault(struct vm_fault *vmf)
> return VM_FAULT_OOM;
> }
>
> + folio = page_folio(page);
> + nr_pages = folio_nr_pages(folio);
> +
> + /*
> + * Using per-page fault to maintain the uffd semantics, and same
> + * approach also applies to non-anonymous-shmem faults to avoid
> + * inflating the RSS of the process.
> + */
> + if (!vma_is_anon_shmem(vma) || unlikely(userfaultfd_armed(vma))) {
> + nr_pages = 1;
> + } else if (nr_pages > 1) {
> + pgoff_t idx = folio_page_idx(folio, page);
> + /* The page offset of vmf->address within the VMA. */
> + pgoff_t vma_off = vmf->pgoff - vmf->vma->vm_pgoff;
> +
> + /*
> + * Fallback to per-page fault in case the folio size in page
> + * cache beyond the VMA limits.
> + */
> + if (unlikely(vma_off < idx ||
> + vma_off + (nr_pages - idx) > vma_pages(vma))) {
> + nr_pages = 1;
> + } else {
> + /* Now we can set mappings for the whole large folio. */
> + addr = vmf->address - idx * PAGE_SIZE;
> + page = &folio->page;
> + }
> + }
> +
> vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd,
> - vmf->address, &vmf->ptl);
> + addr, &vmf->ptl);
> if (!vmf->pte)
> return VM_FAULT_NOPAGE;
>
> /* Re-check under ptl */
> - if (likely(!vmf_pte_changed(vmf))) {
> - struct folio *folio = page_folio(page);
> - int type = is_cow ? MM_ANONPAGES : mm_counter_file(folio);
> -
> - set_pte_range(vmf, folio, page, 1, vmf->address);
> - add_mm_counter(vma->vm_mm, type, 1);
> - ret = 0;
> - } else {
> - update_mmu_tlb(vma, vmf->address, vmf->pte);
> + if (nr_pages == 1 && unlikely(vmf_pte_changed(vmf))) {
> + update_mmu_tlb(vma, addr, vmf->pte);
> ret = VM_FAULT_NOPAGE;
> + goto unlock;
> + } else if (nr_pages > 1 && !pte_range_none(vmf->pte, nr_pages)) {
> + for (i = 0; i < nr_pages; i++)
> + update_mmu_tlb(vma, addr + PAGE_SIZE * i, vmf->pte + i);

Just a friendly reminder: Bang has added the update_mmu_tlb_range()[1] batch
function to update TLB in batches, so we can use it instead of the
update_mmu_tlb()
loop.

[1] https://lore.kernel.org/linux-mm/[email protected]/

Thanks,
Lance

> + ret = VM_FAULT_NOPAGE;
> + goto unlock;
> }
>
> + folio_ref_add(folio, nr_pages - 1);
> + set_pte_range(vmf, folio, page, nr_pages, addr);
> + type = is_cow ? MM_ANONPAGES : mm_counter_file(folio);
> + add_mm_counter(vma->vm_mm, type, nr_pages);
> + ret = 0;
> +
> +unlock:
> pte_unmap_unlock(vmf->pte, vmf->ptl);
> return ret;
> }
> --
> 2.39.3
>

2024-06-03 05:29:18

by Barry Song

[permalink] [raw]
Subject: Re: [PATCH v3 1/6] mm: memory: extend finish_fault() to support large folio

On Thu, May 30, 2024 at 2:04 PM Baolin Wang
<[email protected]> wrote:
>
> Add large folio mapping establishment support for finish_fault() as a preparation,
> to support multi-size THP allocation of anonymous shmem pages in the following
> patches.
>
> Signed-off-by: Baolin Wang <[email protected]>
> ---
> mm/memory.c | 58 ++++++++++++++++++++++++++++++++++++++++++++---------
> 1 file changed, 48 insertions(+), 10 deletions(-)
>
> diff --git a/mm/memory.c b/mm/memory.c
> index eef4e482c0c2..435187ff7ea4 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -4831,9 +4831,12 @@ vm_fault_t finish_fault(struct vm_fault *vmf)
> {
> struct vm_area_struct *vma = vmf->vma;
> struct page *page;
> + struct folio *folio;
> vm_fault_t ret;
> bool is_cow = (vmf->flags & FAULT_FLAG_WRITE) &&
> !(vma->vm_flags & VM_SHARED);
> + int type, nr_pages, i;
> + unsigned long addr = vmf->address;
>
> /* Did we COW the page? */
> if (is_cow)
> @@ -4864,24 +4867,59 @@ vm_fault_t finish_fault(struct vm_fault *vmf)
> return VM_FAULT_OOM;
> }
>
> + folio = page_folio(page);
> + nr_pages = folio_nr_pages(folio);
> +
> + /*
> + * Using per-page fault to maintain the uffd semantics, and same
> + * approach also applies to non-anonymous-shmem faults to avoid
> + * inflating the RSS of the process.

I don't feel the comment explains the root cause.
For non-shmem, anyway we have allocated the memory? Avoiding inflating
RSS seems not so useful as we have occupied the memory. the memory footprint
is what we really care about. so we want to rely on read-ahead hints of subpage
to determine read-ahead size? that is why we don't map nr_pages for non-shmem
files though we can potentially reduce nr_pages - 1 page faults?

> + */
> + if (!vma_is_anon_shmem(vma) || unlikely(userfaultfd_armed(vma))) {
> + nr_pages = 1;
> + } else if (nr_pages > 1) {
> + pgoff_t idx = folio_page_idx(folio, page);
> + /* The page offset of vmf->address within the VMA. */
> + pgoff_t vma_off = vmf->pgoff - vmf->vma->vm_pgoff;
> +
> + /*
> + * Fallback to per-page fault in case the folio size in page
> + * cache beyond the VMA limits.
> + */
> + if (unlikely(vma_off < idx ||
> + vma_off + (nr_pages - idx) > vma_pages(vma))) {
> + nr_pages = 1;
> + } else {
> + /* Now we can set mappings for the whole large folio. */
> + addr = vmf->address - idx * PAGE_SIZE;
> + page = &folio->page;
> + }
> + }
> +
> vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd,
> - vmf->address, &vmf->ptl);
> + addr, &vmf->ptl);
> if (!vmf->pte)
> return VM_FAULT_NOPAGE;
>
> /* Re-check under ptl */
> - if (likely(!vmf_pte_changed(vmf))) {
> - struct folio *folio = page_folio(page);
> - int type = is_cow ? MM_ANONPAGES : mm_counter_file(folio);
> -
> - set_pte_range(vmf, folio, page, 1, vmf->address);
> - add_mm_counter(vma->vm_mm, type, 1);
> - ret = 0;
> - } else {
> - update_mmu_tlb(vma, vmf->address, vmf->pte);
> + if (nr_pages == 1 && unlikely(vmf_pte_changed(vmf))) {
> + update_mmu_tlb(vma, addr, vmf->pte);
> ret = VM_FAULT_NOPAGE;
> + goto unlock;
> + } else if (nr_pages > 1 && !pte_range_none(vmf->pte, nr_pages)) {

In what case we can't use !pte_range_none(vmf->pte, 1) for nr_pages == 1
then unify the code for nr_pages==1 and nr_pages > 1?

It seems this has been discussed before, but I forget the reason.

> + for (i = 0; i < nr_pages; i++)
> + update_mmu_tlb(vma, addr + PAGE_SIZE * i, vmf->pte + i);
> + ret = VM_FAULT_NOPAGE;
> + goto unlock;
> }
>
> + folio_ref_add(folio, nr_pages - 1);
> + set_pte_range(vmf, folio, page, nr_pages, addr);
> + type = is_cow ? MM_ANONPAGES : mm_counter_file(folio);
> + add_mm_counter(vma->vm_mm, type, nr_pages);
> + ret = 0;
> +
> +unlock:
> pte_unmap_unlock(vmf->pte, vmf->ptl);
> return ret;
> }
> --
> 2.39.3
>

Thanks
Barry

2024-06-03 08:04:54

by Baolin Wang

[permalink] [raw]
Subject: Re: [PATCH v3 1/6] mm: memory: extend finish_fault() to support large folio



On 2024/6/3 12:44, Lance Yang wrote:
> On Thu, May 30, 2024 at 10:04 AM Baolin Wang
> <[email protected]> wrote:
>>
>> Add large folio mapping establishment support for finish_fault() as a preparation,
>> to support multi-size THP allocation of anonymous shmem pages in the following
>> patches.
>>
>> Signed-off-by: Baolin Wang <[email protected]>
>> ---
>> mm/memory.c | 58 ++++++++++++++++++++++++++++++++++++++++++++---------
>> 1 file changed, 48 insertions(+), 10 deletions(-)
>>
>> diff --git a/mm/memory.c b/mm/memory.c
>> index eef4e482c0c2..435187ff7ea4 100644
>> --- a/mm/memory.c
>> +++ b/mm/memory.c
>> @@ -4831,9 +4831,12 @@ vm_fault_t finish_fault(struct vm_fault *vmf)
>> {
>> struct vm_area_struct *vma = vmf->vma;
>> struct page *page;
>> + struct folio *folio;
>> vm_fault_t ret;
>> bool is_cow = (vmf->flags & FAULT_FLAG_WRITE) &&
>> !(vma->vm_flags & VM_SHARED);
>> + int type, nr_pages, i;
>> + unsigned long addr = vmf->address;
>>
>> /* Did we COW the page? */
>> if (is_cow)
>> @@ -4864,24 +4867,59 @@ vm_fault_t finish_fault(struct vm_fault *vmf)
>> return VM_FAULT_OOM;
>> }
>>
>> + folio = page_folio(page);
>> + nr_pages = folio_nr_pages(folio);
>> +
>> + /*
>> + * Using per-page fault to maintain the uffd semantics, and same
>> + * approach also applies to non-anonymous-shmem faults to avoid
>> + * inflating the RSS of the process.
>> + */
>> + if (!vma_is_anon_shmem(vma) || unlikely(userfaultfd_armed(vma))) {
>> + nr_pages = 1;
>> + } else if (nr_pages > 1) {
>> + pgoff_t idx = folio_page_idx(folio, page);
>> + /* The page offset of vmf->address within the VMA. */
>> + pgoff_t vma_off = vmf->pgoff - vmf->vma->vm_pgoff;
>> +
>> + /*
>> + * Fallback to per-page fault in case the folio size in page
>> + * cache beyond the VMA limits.
>> + */
>> + if (unlikely(vma_off < idx ||
>> + vma_off + (nr_pages - idx) > vma_pages(vma))) {
>> + nr_pages = 1;
>> + } else {
>> + /* Now we can set mappings for the whole large folio. */
>> + addr = vmf->address - idx * PAGE_SIZE;
>> + page = &folio->page;
>> + }
>> + }
>> +
>> vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd,
>> - vmf->address, &vmf->ptl);
>> + addr, &vmf->ptl);
>> if (!vmf->pte)
>> return VM_FAULT_NOPAGE;
>>
>> /* Re-check under ptl */
>> - if (likely(!vmf_pte_changed(vmf))) {
>> - struct folio *folio = page_folio(page);
>> - int type = is_cow ? MM_ANONPAGES : mm_counter_file(folio);
>> -
>> - set_pte_range(vmf, folio, page, 1, vmf->address);
>> - add_mm_counter(vma->vm_mm, type, 1);
>> - ret = 0;
>> - } else {
>> - update_mmu_tlb(vma, vmf->address, vmf->pte);
>> + if (nr_pages == 1 && unlikely(vmf_pte_changed(vmf))) {
>> + update_mmu_tlb(vma, addr, vmf->pte);
>> ret = VM_FAULT_NOPAGE;
>> + goto unlock;
>> + } else if (nr_pages > 1 && !pte_range_none(vmf->pte, nr_pages)) {
>> + for (i = 0; i < nr_pages; i++)
>> + update_mmu_tlb(vma, addr + PAGE_SIZE * i, vmf->pte + i);
>
> Just a friendly reminder: Bang has added the update_mmu_tlb_range()[1] batch
> function to update TLB in batches, so we can use it instead of the
> update_mmu_tlb()
> loop.
>
> [1] https://lore.kernel.org/linux-mm/[email protected]/

Good point, I will use the new helper instead.

2024-06-03 08:30:36

by Baolin Wang

[permalink] [raw]
Subject: Re: [PATCH v3 1/6] mm: memory: extend finish_fault() to support large folio



On 2024/6/3 13:28, Barry Song wrote:
> On Thu, May 30, 2024 at 2:04 PM Baolin Wang
> <[email protected]> wrote:
>>
>> Add large folio mapping establishment support for finish_fault() as a preparation,
>> to support multi-size THP allocation of anonymous shmem pages in the following
>> patches.
>>
>> Signed-off-by: Baolin Wang <[email protected]>
>> ---
>> mm/memory.c | 58 ++++++++++++++++++++++++++++++++++++++++++++---------
>> 1 file changed, 48 insertions(+), 10 deletions(-)
>>
>> diff --git a/mm/memory.c b/mm/memory.c
>> index eef4e482c0c2..435187ff7ea4 100644
>> --- a/mm/memory.c
>> +++ b/mm/memory.c
>> @@ -4831,9 +4831,12 @@ vm_fault_t finish_fault(struct vm_fault *vmf)
>> {
>> struct vm_area_struct *vma = vmf->vma;
>> struct page *page;
>> + struct folio *folio;
>> vm_fault_t ret;
>> bool is_cow = (vmf->flags & FAULT_FLAG_WRITE) &&
>> !(vma->vm_flags & VM_SHARED);
>> + int type, nr_pages, i;
>> + unsigned long addr = vmf->address;
>>
>> /* Did we COW the page? */
>> if (is_cow)
>> @@ -4864,24 +4867,59 @@ vm_fault_t finish_fault(struct vm_fault *vmf)
>> return VM_FAULT_OOM;
>> }
>>
>> + folio = page_folio(page);
>> + nr_pages = folio_nr_pages(folio);
>> +
>> + /*
>> + * Using per-page fault to maintain the uffd semantics, and same
>> + * approach also applies to non-anonymous-shmem faults to avoid
>> + * inflating the RSS of the process.
>
> I don't feel the comment explains the root cause.
> For non-shmem, anyway we have allocated the memory? Avoiding inflating
> RSS seems not so useful as we have occupied the memory. the memory footprint

This is also to keep the same behavior as before for non-anon-shmem, and
will be discussed in the future.

> is what we really care about. so we want to rely on read-ahead hints of subpage
> to determine read-ahead size? that is why we don't map nr_pages for non-shmem
> files though we can potentially reduce nr_pages - 1 page faults?

IMHO, there is 2 cases for non-anon-shmem:
(1) read mmap() faults: we can rely on the 'fault_around_bytes'
interface to determin what size of mapping to build.
(2) writable mmap() faults: I want to keep the same behavior as before
(per-page fault), but we can talk about this when I send new patches to
use mTHP to control large folio allocation for writable mmap().

>> + */
>> + if (!vma_is_anon_shmem(vma) || unlikely(userfaultfd_armed(vma))) {
>> + nr_pages = 1;
>> + } else if (nr_pages > 1) {
>> + pgoff_t idx = folio_page_idx(folio, page);
>> + /* The page offset of vmf->address within the VMA. */
>> + pgoff_t vma_off = vmf->pgoff - vmf->vma->vm_pgoff;
>> +
>> + /*
>> + * Fallback to per-page fault in case the folio size in page
>> + * cache beyond the VMA limits.
>> + */
>> + if (unlikely(vma_off < idx ||
>> + vma_off + (nr_pages - idx) > vma_pages(vma))) {
>> + nr_pages = 1;
>> + } else {
>> + /* Now we can set mappings for the whole large folio. */
>> + addr = vmf->address - idx * PAGE_SIZE;
>> + page = &folio->page;
>> + }
>> + }
>> +
>> vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd,
>> - vmf->address, &vmf->ptl);
>> + addr, &vmf->ptl);
>> if (!vmf->pte)
>> return VM_FAULT_NOPAGE;
>>
>> /* Re-check under ptl */
>> - if (likely(!vmf_pte_changed(vmf))) {
>> - struct folio *folio = page_folio(page);
>> - int type = is_cow ? MM_ANONPAGES : mm_counter_file(folio);
>> -
>> - set_pte_range(vmf, folio, page, 1, vmf->address);
>> - add_mm_counter(vma->vm_mm, type, 1);
>> - ret = 0;
>> - } else {
>> - update_mmu_tlb(vma, vmf->address, vmf->pte);
>> + if (nr_pages == 1 && unlikely(vmf_pte_changed(vmf))) {
>> + update_mmu_tlb(vma, addr, vmf->pte);
>> ret = VM_FAULT_NOPAGE;
>> + goto unlock;
>> + } else if (nr_pages > 1 && !pte_range_none(vmf->pte, nr_pages)) {
>
> In what case we can't use !pte_range_none(vmf->pte, 1) for nr_pages == 1
> then unify the code for nr_pages==1 and nr_pages > 1?
>
> It seems this has been discussed before, but I forget the reason.

IIUC, this is for uffd case, which is not a none pte entry.

2024-06-03 08:58:51

by Barry Song

[permalink] [raw]
Subject: Re: [PATCH v3 1/6] mm: memory: extend finish_fault() to support large folio

On Mon, Jun 3, 2024 at 8:29 PM Baolin Wang
<[email protected]> wrote:
>
>
>
> On 2024/6/3 13:28, Barry Song wrote:
> > On Thu, May 30, 2024 at 2:04 PM Baolin Wang
> > <[email protected]> wrote:
> >>
> >> Add large folio mapping establishment support for finish_fault() as a preparation,
> >> to support multi-size THP allocation of anonymous shmem pages in the following
> >> patches.
> >>
> >> Signed-off-by: Baolin Wang <[email protected]>
> >> ---
> >> mm/memory.c | 58 ++++++++++++++++++++++++++++++++++++++++++++---------
> >> 1 file changed, 48 insertions(+), 10 deletions(-)
> >>
> >> diff --git a/mm/memory.c b/mm/memory.c
> >> index eef4e482c0c2..435187ff7ea4 100644
> >> --- a/mm/memory.c
> >> +++ b/mm/memory.c
> >> @@ -4831,9 +4831,12 @@ vm_fault_t finish_fault(struct vm_fault *vmf)
> >> {
> >> struct vm_area_struct *vma = vmf->vma;
> >> struct page *page;
> >> + struct folio *folio;
> >> vm_fault_t ret;
> >> bool is_cow = (vmf->flags & FAULT_FLAG_WRITE) &&
> >> !(vma->vm_flags & VM_SHARED);
> >> + int type, nr_pages, i;
> >> + unsigned long addr = vmf->address;
> >>
> >> /* Did we COW the page? */
> >> if (is_cow)
> >> @@ -4864,24 +4867,59 @@ vm_fault_t finish_fault(struct vm_fault *vmf)
> >> return VM_FAULT_OOM;
> >> }
> >>
> >> + folio = page_folio(page);
> >> + nr_pages = folio_nr_pages(folio);
> >> +
> >> + /*
> >> + * Using per-page fault to maintain the uffd semantics, and same
> >> + * approach also applies to non-anonymous-shmem faults to avoid
> >> + * inflating the RSS of the process.
> >
> > I don't feel the comment explains the root cause.
> > For non-shmem, anyway we have allocated the memory? Avoiding inflating
> > RSS seems not so useful as we have occupied the memory. the memory footprint
>
> This is also to keep the same behavior as before for non-anon-shmem, and
> will be discussed in the future.

OK.

>
> > is what we really care about. so we want to rely on read-ahead hints of subpage
> > to determine read-ahead size? that is why we don't map nr_pages for non-shmem
> > files though we can potentially reduce nr_pages - 1 page faults?
>
> IMHO, there is 2 cases for non-anon-shmem:
> (1) read mmap() faults: we can rely on the 'fault_around_bytes'
> interface to determin what size of mapping to build.
> (2) writable mmap() faults: I want to keep the same behavior as before
> (per-page fault), but we can talk about this when I send new patches to
> use mTHP to control large folio allocation for writable mmap().

OK.

>
> >> + */
> >> + if (!vma_is_anon_shmem(vma) || unlikely(userfaultfd_armed(vma))) {
> >> + nr_pages = 1;
> >> + } else if (nr_pages > 1) {
> >> + pgoff_t idx = folio_page_idx(folio, page);
> >> + /* The page offset of vmf->address within the VMA. */
> >> + pgoff_t vma_off = vmf->pgoff - vmf->vma->vm_pgoff;
> >> +
> >> + /*
> >> + * Fallback to per-page fault in case the folio size in page
> >> + * cache beyond the VMA limits.
> >> + */
> >> + if (unlikely(vma_off < idx ||
> >> + vma_off + (nr_pages - idx) > vma_pages(vma))) {
> >> + nr_pages = 1;
> >> + } else {
> >> + /* Now we can set mappings for the whole large folio. */
> >> + addr = vmf->address - idx * PAGE_SIZE;
> >> + page = &folio->page;
> >> + }
> >> + }
> >> +
> >> vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd,
> >> - vmf->address, &vmf->ptl);
> >> + addr, &vmf->ptl);
> >> if (!vmf->pte)
> >> return VM_FAULT_NOPAGE;
> >>
> >> /* Re-check under ptl */
> >> - if (likely(!vmf_pte_changed(vmf))) {
> >> - struct folio *folio = page_folio(page);
> >> - int type = is_cow ? MM_ANONPAGES : mm_counter_file(folio);
> >> -
> >> - set_pte_range(vmf, folio, page, 1, vmf->address);
> >> - add_mm_counter(vma->vm_mm, type, 1);
> >> - ret = 0;
> >> - } else {
> >> - update_mmu_tlb(vma, vmf->address, vmf->pte);
> >> + if (nr_pages == 1 && unlikely(vmf_pte_changed(vmf))) {
> >> + update_mmu_tlb(vma, addr, vmf->pte);
> >> ret = VM_FAULT_NOPAGE;
> >> + goto unlock;
> >> + } else if (nr_pages > 1 && !pte_range_none(vmf->pte, nr_pages)) {
> >
> > In what case we can't use !pte_range_none(vmf->pte, 1) for nr_pages == 1
> > then unify the code for nr_pages==1 and nr_pages > 1?
> >
> > It seems this has been discussed before, but I forget the reason.
>
> IIUC, this is for uffd case, which is not a none pte entry.

Is it possible to have a COW case for shmem? For example, if someone
maps a shmem
file as read-only and then writes to it, would that prevent the use of
pte_range_none?

Furthermore, if we encounter a large folio in shmem while reading,
does it necessarily
mean we can map the entire folio? Is it possible for some processes to
only map part
of large folios? For instance, if process A allocates large folios and
process B maps
only part of this shmem file or partially unmaps a large folio, how
would that be handled?

Apologies for not debugging this thoroughly, but these two corner
cases seem worth
considering. If these scenarios have already been addressed, please disregard my
comments.

Thanks
Barry

2024-06-03 09:01:47

by Barry Song

[permalink] [raw]
Subject: Re: [PATCH v3 1/6] mm: memory: extend finish_fault() to support large folio

On Mon, Jun 3, 2024 at 8:58 PM Barry Song <[email protected]> wrote:
>
> On Mon, Jun 3, 2024 at 8:29 PM Baolin Wang
> <[email protected]> wrote:
> >
> >
> >
> > On 2024/6/3 13:28, Barry Song wrote:
> > > On Thu, May 30, 2024 at 2:04 PM Baolin Wang
> > > <[email protected]> wrote:
> > >>
> > >> Add large folio mapping establishment support for finish_fault() as a preparation,
> > >> to support multi-size THP allocation of anonymous shmem pages in the following
> > >> patches.
> > >>
> > >> Signed-off-by: Baolin Wang <[email protected]>
> > >> ---
> > >> mm/memory.c | 58 ++++++++++++++++++++++++++++++++++++++++++++---------
> > >> 1 file changed, 48 insertions(+), 10 deletions(-)
> > >>
> > >> diff --git a/mm/memory.c b/mm/memory.c
> > >> index eef4e482c0c2..435187ff7ea4 100644
> > >> --- a/mm/memory.c
> > >> +++ b/mm/memory.c
> > >> @@ -4831,9 +4831,12 @@ vm_fault_t finish_fault(struct vm_fault *vmf)
> > >> {
> > >> struct vm_area_struct *vma = vmf->vma;
> > >> struct page *page;
> > >> + struct folio *folio;
> > >> vm_fault_t ret;
> > >> bool is_cow = (vmf->flags & FAULT_FLAG_WRITE) &&
> > >> !(vma->vm_flags & VM_SHARED);
> > >> + int type, nr_pages, i;
> > >> + unsigned long addr = vmf->address;
> > >>
> > >> /* Did we COW the page? */
> > >> if (is_cow)
> > >> @@ -4864,24 +4867,59 @@ vm_fault_t finish_fault(struct vm_fault *vmf)
> > >> return VM_FAULT_OOM;
> > >> }
> > >>
> > >> + folio = page_folio(page);
> > >> + nr_pages = folio_nr_pages(folio);
> > >> +
> > >> + /*
> > >> + * Using per-page fault to maintain the uffd semantics, and same
> > >> + * approach also applies to non-anonymous-shmem faults to avoid
> > >> + * inflating the RSS of the process.
> > >
> > > I don't feel the comment explains the root cause.
> > > For non-shmem, anyway we have allocated the memory? Avoiding inflating
> > > RSS seems not so useful as we have occupied the memory. the memory footprint
> >
> > This is also to keep the same behavior as before for non-anon-shmem, and
> > will be discussed in the future.
>
> OK.
>
> >
> > > is what we really care about. so we want to rely on read-ahead hints of subpage
> > > to determine read-ahead size? that is why we don't map nr_pages for non-shmem
> > > files though we can potentially reduce nr_pages - 1 page faults?
> >
> > IMHO, there is 2 cases for non-anon-shmem:
> > (1) read mmap() faults: we can rely on the 'fault_around_bytes'
> > interface to determin what size of mapping to build.
> > (2) writable mmap() faults: I want to keep the same behavior as before
> > (per-page fault), but we can talk about this when I send new patches to
> > use mTHP to control large folio allocation for writable mmap().
>
> OK.
>
> >
> > >> + */
> > >> + if (!vma_is_anon_shmem(vma) || unlikely(userfaultfd_armed(vma))) {
> > >> + nr_pages = 1;
> > >> + } else if (nr_pages > 1) {
> > >> + pgoff_t idx = folio_page_idx(folio, page);
> > >> + /* The page offset of vmf->address within the VMA. */
> > >> + pgoff_t vma_off = vmf->pgoff - vmf->vma->vm_pgoff;
> > >> +
> > >> + /*
> > >> + * Fallback to per-page fault in case the folio size in page
> > >> + * cache beyond the VMA limits.
> > >> + */
> > >> + if (unlikely(vma_off < idx ||
> > >> + vma_off + (nr_pages - idx) > vma_pages(vma))) {
> > >> + nr_pages = 1;
> > >> + } else {
> > >> + /* Now we can set mappings for the whole large folio. */
> > >> + addr = vmf->address - idx * PAGE_SIZE;
> > >> + page = &folio->page;
> > >> + }
> > >> + }
> > >> +
> > >> vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd,
> > >> - vmf->address, &vmf->ptl);
> > >> + addr, &vmf->ptl);
> > >> if (!vmf->pte)
> > >> return VM_FAULT_NOPAGE;
> > >>
> > >> /* Re-check under ptl */
> > >> - if (likely(!vmf_pte_changed(vmf))) {
> > >> - struct folio *folio = page_folio(page);
> > >> - int type = is_cow ? MM_ANONPAGES : mm_counter_file(folio);
> > >> -
> > >> - set_pte_range(vmf, folio, page, 1, vmf->address);
> > >> - add_mm_counter(vma->vm_mm, type, 1);
> > >> - ret = 0;
> > >> - } else {
> > >> - update_mmu_tlb(vma, vmf->address, vmf->pte);
> > >> + if (nr_pages == 1 && unlikely(vmf_pte_changed(vmf))) {
> > >> + update_mmu_tlb(vma, addr, vmf->pte);
> > >> ret = VM_FAULT_NOPAGE;
> > >> + goto unlock;
> > >> + } else if (nr_pages > 1 && !pte_range_none(vmf->pte, nr_pages)) {
> > >
> > > In what case we can't use !pte_range_none(vmf->pte, 1) for nr_pages == 1
> > > then unify the code for nr_pages==1 and nr_pages > 1?
> > >
> > > It seems this has been discussed before, but I forget the reason.
> >
> > IIUC, this is for uffd case, which is not a none pte entry.
>
> Is it possible to have a COW case for shmem? For example, if someone
> maps a shmem
> file as read-only and then writes to it, would that prevent the use of
> pte_range_none?

sorry, i mean PRIVATE but not READ-ONLY.

>
> Furthermore, if we encounter a large folio in shmem while reading,
> does it necessarily
> mean we can map the entire folio? Is it possible for some processes to
> only map part
> of large folios? For instance, if process A allocates large folios and
> process B maps
> only part of this shmem file or partially unmaps a large folio, how
> would that be handled?
>
> Apologies for not debugging this thoroughly, but these two corner
> cases seem worth
> considering. If these scenarios have already been addressed, please disregard my
> comments.
>
> Thanks
> Barry

2024-06-03 09:43:05

by Baolin Wang

[permalink] [raw]
Subject: Re: [PATCH v3 1/6] mm: memory: extend finish_fault() to support large folio



On 2024/6/3 17:01, Barry Song wrote:
> On Mon, Jun 3, 2024 at 8:58 PM Barry Song <[email protected]> wrote:
>>
>> On Mon, Jun 3, 2024 at 8:29 PM Baolin Wang
>> <[email protected]> wrote:
>>>
>>>
>>>
>>> On 2024/6/3 13:28, Barry Song wrote:
>>>> On Thu, May 30, 2024 at 2:04 PM Baolin Wang
>>>> <[email protected]> wrote:
>>>>>
>>>>> Add large folio mapping establishment support for finish_fault() as a preparation,
>>>>> to support multi-size THP allocation of anonymous shmem pages in the following
>>>>> patches.
>>>>>
>>>>> Signed-off-by: Baolin Wang <[email protected]>
>>>>> ---
>>>>> mm/memory.c | 58 ++++++++++++++++++++++++++++++++++++++++++++---------
>>>>> 1 file changed, 48 insertions(+), 10 deletions(-)
>>>>>
>>>>> diff --git a/mm/memory.c b/mm/memory.c
>>>>> index eef4e482c0c2..435187ff7ea4 100644
>>>>> --- a/mm/memory.c
>>>>> +++ b/mm/memory.c
>>>>> @@ -4831,9 +4831,12 @@ vm_fault_t finish_fault(struct vm_fault *vmf)
>>>>> {
>>>>> struct vm_area_struct *vma = vmf->vma;
>>>>> struct page *page;
>>>>> + struct folio *folio;
>>>>> vm_fault_t ret;
>>>>> bool is_cow = (vmf->flags & FAULT_FLAG_WRITE) &&
>>>>> !(vma->vm_flags & VM_SHARED);
>>>>> + int type, nr_pages, i;
>>>>> + unsigned long addr = vmf->address;
>>>>>
>>>>> /* Did we COW the page? */
>>>>> if (is_cow)
>>>>> @@ -4864,24 +4867,59 @@ vm_fault_t finish_fault(struct vm_fault *vmf)
>>>>> return VM_FAULT_OOM;
>>>>> }
>>>>>
>>>>> + folio = page_folio(page);
>>>>> + nr_pages = folio_nr_pages(folio);
>>>>> +
>>>>> + /*
>>>>> + * Using per-page fault to maintain the uffd semantics, and same
>>>>> + * approach also applies to non-anonymous-shmem faults to avoid
>>>>> + * inflating the RSS of the process.
>>>>
>>>> I don't feel the comment explains the root cause.
>>>> For non-shmem, anyway we have allocated the memory? Avoiding inflating
>>>> RSS seems not so useful as we have occupied the memory. the memory footprint
>>>
>>> This is also to keep the same behavior as before for non-anon-shmem, and
>>> will be discussed in the future.
>>
>> OK.
>>
>>>
>>>> is what we really care about. so we want to rely on read-ahead hints of subpage
>>>> to determine read-ahead size? that is why we don't map nr_pages for non-shmem
>>>> files though we can potentially reduce nr_pages - 1 page faults?
>>>
>>> IMHO, there is 2 cases for non-anon-shmem:
>>> (1) read mmap() faults: we can rely on the 'fault_around_bytes'
>>> interface to determin what size of mapping to build.
>>> (2) writable mmap() faults: I want to keep the same behavior as before
>>> (per-page fault), but we can talk about this when I send new patches to
>>> use mTHP to control large folio allocation for writable mmap().
>>
>> OK.
>>
>>>
>>>>> + */
>>>>> + if (!vma_is_anon_shmem(vma) || unlikely(userfaultfd_armed(vma))) {
>>>>> + nr_pages = 1;
>>>>> + } else if (nr_pages > 1) {
>>>>> + pgoff_t idx = folio_page_idx(folio, page);
>>>>> + /* The page offset of vmf->address within the VMA. */
>>>>> + pgoff_t vma_off = vmf->pgoff - vmf->vma->vm_pgoff;
>>>>> +
>>>>> + /*
>>>>> + * Fallback to per-page fault in case the folio size in page
>>>>> + * cache beyond the VMA limits.
>>>>> + */
>>>>> + if (unlikely(vma_off < idx ||
>>>>> + vma_off + (nr_pages - idx) > vma_pages(vma))) {
>>>>> + nr_pages = 1;
>>>>> + } else {
>>>>> + /* Now we can set mappings for the whole large folio. */
>>>>> + addr = vmf->address - idx * PAGE_SIZE;
>>>>> + page = &folio->page;
>>>>> + }
>>>>> + }
>>>>> +
>>>>> vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd,
>>>>> - vmf->address, &vmf->ptl);
>>>>> + addr, &vmf->ptl);
>>>>> if (!vmf->pte)
>>>>> return VM_FAULT_NOPAGE;
>>>>>
>>>>> /* Re-check under ptl */
>>>>> - if (likely(!vmf_pte_changed(vmf))) {
>>>>> - struct folio *folio = page_folio(page);
>>>>> - int type = is_cow ? MM_ANONPAGES : mm_counter_file(folio);
>>>>> -
>>>>> - set_pte_range(vmf, folio, page, 1, vmf->address);
>>>>> - add_mm_counter(vma->vm_mm, type, 1);
>>>>> - ret = 0;
>>>>> - } else {
>>>>> - update_mmu_tlb(vma, vmf->address, vmf->pte);
>>>>> + if (nr_pages == 1 && unlikely(vmf_pte_changed(vmf))) {
>>>>> + update_mmu_tlb(vma, addr, vmf->pte);
>>>>> ret = VM_FAULT_NOPAGE;
>>>>> + goto unlock;
>>>>> + } else if (nr_pages > 1 && !pte_range_none(vmf->pte, nr_pages)) {
>>>>
>>>> In what case we can't use !pte_range_none(vmf->pte, 1) for nr_pages == 1
>>>> then unify the code for nr_pages==1 and nr_pages > 1?
>>>>
>>>> It seems this has been discussed before, but I forget the reason.
>>>
>>> IIUC, this is for uffd case, which is not a none pte entry.
>>
>> Is it possible to have a COW case for shmem? For example, if someone
>> maps a shmem
>> file as read-only and then writes to it, would that prevent the use of
>> pte_range_none?
>
> sorry, i mean PRIVATE but not READ-ONLY.

Yes, I think so. Now CoW case still use per-page fault in do_cow_fault().

>> Furthermore, if we encounter a large folio in shmem while reading,
>> does it necessarily
>> mean we can map the entire folio? Is it possible for some processes to

Now this will depend on the 'fault_around_bytes' interface.

>> only map part
>> of large folios? For instance, if process A allocates large folios and
>> process B maps
>> only part of this shmem file or partially unmaps a large folio, how
>> would that be handled?

This is certainly possible.

For tmpfs:
(1) If 'fault_around_bytes' is enabled, filemap_map_pages() will handle
partially mapping of the large folio for process B.

(2) If 'fault_around_bytes' is set to 0, finish_fault() will fallback to
per-page fault.

For Anonomous shmem, process B should be the child of process A in your
case, then:
(1) If 'fault_around_bytes' is enabled, behavior is same with tmpfs.

(2) If 'fault_around_bytes' is set to 0, finish_fault() will build the
whole large folio mapping for process B. Since process B will copy the
same shared VMA from parent process A, which means a mTHP mapping to share.

>> Apologies for not debugging this thoroughly, but these two corner
>> cases seem worth
>> considering. If these scenarios have already been addressed, please disregard my
>> comments.

No worries:) Thanks for your valuable input.

2024-06-04 08:19:05

by Daniel Gomez

[permalink] [raw]
Subject: Re: [PATCH v3 0/6] add mTHP support for anonymous shmem

On Fri, May 31, 2024 at 01:13:48PM +0200, David Hildenbrand wrote:
> > >
> > > As a default, we should not be using large folios / mTHP for any shmem,
> > > just like we did with THP via shmem_enabled. This is what this series
> > > currently does, and is aprt of the whole mTHP user-space interface design.
> > >
> > > Further, the mTHP controls should control all of shmem, not only
> > > "anonymous shmem".
> >
> > Yes, that's what I thought and in my TODO list.
>
> Good, it would be helpful to coordinate with Daniel and Pankaj.

I've integrated patches 11 and 12 from the lsf RFC thread [1] on top of Baolin's
v3 patches. You may find a version in my integration branch here [2]. I can
attach them here if it's preferred.

[1] https://lore.kernel.org/all/[email protected]/
[2] https://gitlab.com/dkruces/linux-next/-/commits/next-20240604-shmem-mthp

The point here is to combine the large folios strategy I proposed with mTHP
user controls. Would it make sense to limit the orders to the mapping order
calculated based on the size and index?

@@ -1765,6 +1798,10 @@ static struct folio *shmem_alloc_and_add_folio(struct vm_fault *vmf,

order = highest_order(suitable_orders);
while (suitable_orders) {
+ if (order > mapping_order) {
+ order = next_order(&suitable_orders, order);
+ continue;
+ }
pages = 1UL << order;
index = round_down(index, pages);
folio = shmem_alloc_folio(gfp, order, info, index);

Note: The branch still need to be adapted to include !anon mm.

>
> >
> > >
> > > Also, we should properly fallback within the configured sizes, and not
> > > jump "over" configured sizes. Unless there is a good reason.
> > >
> > > (3) khugepaged
> > >
> > > khugepaged needs to handle larger folios properly as well. Until fixed,
> > > using smaller THP sizes as fallback might prohibit collapsing a
> > > PMD-sized THP later. But really, khugepaged needs to be fixed to handle
> > > that. >
> > > (4) force/disable
> > >
> > > These settings are rather testing artifacts from the old ages. We should
> > > not add them to the per-size toggles. We might "inherit" it from the
> > > global one, though.
> >
> > Sorry, I missed this. So I thould remove the 'force' and 'deny' option
> > for each mTHP, right?
>
> Yes, that's my understanding. But we have to keep them on the top level for
> any possible user out there.
>
> >
> > >
> > > "within_size" might have value, and especially for consistency, we
> > > should have them per size.
> > >
> > >
> > >
> > > So, this series only tackles anonymous shmem, which is a good starting
> > > point. Ideally, we'd get support for other shmem (especially during
> > > fault time) soon afterwards, because we won't be adding separate toggles
> > > for that from the interface POV, and having inconsistent behavior
> > > between kernel versions would be a bit unfortunate.
> > >
> > >
> > > @Baolin, this series likely does not consider (4) yet. And I suggest we
> > > have to take a lot of the "anonymous thp" terminology out of this
> > > series, especially when it comes to documentation.
> >
> > Sure. I will remove the "anonymous thp" terminology from the
> > documentation, but want to still keep it in the commit message, cause I
> > want to start from the anonymous shmem.
>
> For commit message and friends makes sense. The story should be "controls
> all of shmem/tmpfs, but support will be added iteratively. The first step is
> anonymous shmem."
>
> --
> Cheers,
>
> David / dhildenb
>

2024-06-04 09:24:07

by Dan Carpenter

[permalink] [raw]
Subject: Re: [PATCH v3 4/6] mm: shmem: add mTHP support for anonymous shmem

Hi Baolin,

kernel test robot noticed the following build warnings:

https://git-scm.com/docs/git-format-patch#_base_tree_information]

url: https://github.com/intel-lab-lkp/linux/commits/Baolin-Wang/mm-memory-extend-finish_fault-to-support-large-folio/20240530-100805
base: https://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm.git mm-everything
patch link: https://lore.kernel.org/r/ec35a23026dd016705d211e85163cabe07681516.1717033868.git.baolin.wang%40linux.alibaba.com
patch subject: [PATCH v3 4/6] mm: shmem: add mTHP support for anonymous shmem
config: powerpc64-randconfig-r071-20240531 (https://download.01.org/0day-ci/archive/20240602/[email protected]/config)
compiler: clang version 19.0.0git (https://github.com/llvm/llvm-project bafda89a0944d947fc4b3b5663185e07a397ac30)

If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <[email protected]>
| Reported-by: Dan Carpenter <[email protected]>
| Closes: https://lore.kernel.org/r/[email protected]/

smatch warnings:
mm/shmem.c:1766 shmem_alloc_and_add_folio() error: uninitialized symbol 'suitable_orders'.

vim +/suitable_orders +1766 mm/shmem.c

ededbc2c2f28a1 Baolin Wang 2024-05-30 1729 static struct folio *shmem_alloc_and_add_folio(struct vm_fault *vmf,
ededbc2c2f28a1 Baolin Wang 2024-05-30 1730 gfp_t gfp, struct inode *inode, pgoff_t index,
ededbc2c2f28a1 Baolin Wang 2024-05-30 1731 struct mm_struct *fault_mm, unsigned long orders)
800d8c63b2e989 Kirill A. Shutemov 2016-07-26 1732 {
3022fd7af9604d Hugh Dickins 2023-09-29 1733 struct address_space *mapping = inode->i_mapping;
0f0796945614b7 Mike Rapoport 2017-09-06 1734 struct shmem_inode_info *info = SHMEM_I(inode);
ededbc2c2f28a1 Baolin Wang 2024-05-30 1735 struct vm_area_struct *vma = vmf ? vmf->vma : NULL;
ededbc2c2f28a1 Baolin Wang 2024-05-30 1736 unsigned long suitable_orders;
ededbc2c2f28a1 Baolin Wang 2024-05-30 1737 struct folio *folio = NULL;
3022fd7af9604d Hugh Dickins 2023-09-29 1738 long pages;
ededbc2c2f28a1 Baolin Wang 2024-05-30 1739 int error, order;
800d8c63b2e989 Kirill A. Shutemov 2016-07-26 1740
396bcc5299c281 Matthew Wilcox (Oracle 2020-04-06 1741) if (!IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE))
ededbc2c2f28a1 Baolin Wang 2024-05-30 1742 orders = 0;
800d8c63b2e989 Kirill A. Shutemov 2016-07-26 1743
ededbc2c2f28a1 Baolin Wang 2024-05-30 1744 if (orders > 0) {
ededbc2c2f28a1 Baolin Wang 2024-05-30 1745 if (vma && vma_is_anon_shmem(vma)) {
ededbc2c2f28a1 Baolin Wang 2024-05-30 1746 suitable_orders = anon_shmem_suitable_orders(inode, vmf,
ededbc2c2f28a1 Baolin Wang 2024-05-30 1747 mapping, index, orders);
ededbc2c2f28a1 Baolin Wang 2024-05-30 1748 } else if (orders & BIT(HPAGE_PMD_ORDER)) {
3022fd7af9604d Hugh Dickins 2023-09-29 1749 pages = HPAGE_PMD_NR;
ededbc2c2f28a1 Baolin Wang 2024-05-30 1750 suitable_orders = BIT(HPAGE_PMD_ORDER);
3022fd7af9604d Hugh Dickins 2023-09-29 1751 index = round_down(index, HPAGE_PMD_NR);
3022fd7af9604d Hugh Dickins 2023-09-29 1752
3022fd7af9604d Hugh Dickins 2023-09-29 1753 /*
3022fd7af9604d Hugh Dickins 2023-09-29 1754 * Check for conflict before waiting on a huge allocation.
3022fd7af9604d Hugh Dickins 2023-09-29 1755 * Conflict might be that a huge page has just been allocated
3022fd7af9604d Hugh Dickins 2023-09-29 1756 * and added to page cache by a racing thread, or that there
3022fd7af9604d Hugh Dickins 2023-09-29 1757 * is already at least one small page in the huge extent.
3022fd7af9604d Hugh Dickins 2023-09-29 1758 * Be careful to retry when appropriate, but not forever!
3022fd7af9604d Hugh Dickins 2023-09-29 1759 * Elsewhere -EEXIST would be the right code, but not here.
3022fd7af9604d Hugh Dickins 2023-09-29 1760 */
3022fd7af9604d Hugh Dickins 2023-09-29 1761 if (xa_find(&mapping->i_pages, &index,
3022fd7af9604d Hugh Dickins 2023-09-29 1762 index + HPAGE_PMD_NR - 1, XA_PRESENT))
3022fd7af9604d Hugh Dickins 2023-09-29 1763 return ERR_PTR(-E2BIG);
ededbc2c2f28a1 Baolin Wang 2024-05-30 1764 }

suitable_orders uninitialized on else path.

52cd3b074050dd Lee Schermerhorn 2008-04-28 1765
ededbc2c2f28a1 Baolin Wang 2024-05-30 @1766 order = highest_order(suitable_orders);
ededbc2c2f28a1 Baolin Wang 2024-05-30 1767 while (suitable_orders) {
ededbc2c2f28a1 Baolin Wang 2024-05-30 1768 pages = 1UL << order;
ededbc2c2f28a1 Baolin Wang 2024-05-30 1769 index = round_down(index, pages);
ededbc2c2f28a1 Baolin Wang 2024-05-30 1770 folio = shmem_alloc_folio(gfp, order, info, index);
ededbc2c2f28a1 Baolin Wang 2024-05-30 1771 if (folio)
ededbc2c2f28a1 Baolin Wang 2024-05-30 1772 goto allocated;
ededbc2c2f28a1 Baolin Wang 2024-05-30 1773
ededbc2c2f28a1 Baolin Wang 2024-05-30 1774 if (pages == HPAGE_PMD_NR)
3022fd7af9604d Hugh Dickins 2023-09-29 1775 count_vm_event(THP_FILE_FALLBACK);
ededbc2c2f28a1 Baolin Wang 2024-05-30 1776 order = next_order(&suitable_orders, order);
ededbc2c2f28a1 Baolin Wang 2024-05-30 1777 }
3022fd7af9604d Hugh Dickins 2023-09-29 1778 } else {
3022fd7af9604d Hugh Dickins 2023-09-29 1779 pages = 1;

--
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki


2024-06-04 09:29:45

by Daniel Gomez

[permalink] [raw]
Subject: Re: [PATCH v3 0/6] add mTHP support for anonymous shmem

On Fri, May 31, 2024 at 04:43:32PM +0200, David Hildenbrand wrote:
> Hi Daniel,
>
> > > Let me summarize the takeaway from the bi-weekly MM meeting as I understood
> > > it, that includes Hugh's feedback on per-block tracking vs. mTHP:
> >
> > Thanks David for the summary. Please, find below some follow up questions.
> >
> > I want understand if zeropage scanning overhead is preferred over per-block
> > tracking complexity or if we still need to verify this.
> >
> > >
> > > (1) Per-block tracking
> > >
> > > Per-block tracking is currently considered unwarranted complexity in
> > > shmem.c. We should try to get it done without that. For any test cases that
> > > fail, we should consider if they are actually valid for shmem.
> >
> > I agree it was unwarranted complexity but only if this is just to fix lseek() as
> > we can simply make the test pass by checking if holes are reported as data. That
> > would be the minimum required for lseek() to be compliant with the syscall.
> >
> > How can we use per-block tracking for reclaiming memory and what changes would
> > be needed? Or is per-block really a non-viable option?
>
> The interesting thing is: with mTHP toggles it is opt-in -- like for
> PMD-sized THP with shmem_enabled -- and we don't have to be that concerned
> about this problem right now.

Without respecting the size when allocating large folios, mTHP toggles would
over allocate. My proposal added earlier to this thread is to combine the 2
to avoid that case. Otherwise, shouldn't we try to find a solution for the
reclaiming path?

>
> >
> > Clearly, if per-block is viable option, shmem_fault() bug would required to be
> > fixed first. Any ideas on how to make it reproducible?
> >
> > The alternatives discussed where sub-page refcounting and zeropage scanning.
>
> Yeah, I don't think sub-page refcounting is a feasible (and certainly not
> desired ;) ) option in the folio world.
>
> > The first one is not possible (IIUC) because of a refactor years back that
> > simplified the code and also requires extra complexity. The second option would
> > require additional overhead as we would involve scanning.
>
> We'll likely need something similar (scanning, tracking?) for anonymous
> memory as well. There was a proposal for a THP shrinker some time ago, that
> would solve part of the problem.

It's good to know we have the same problem in different places. I'm more
inclined to keep the information rather than adding an extra overhead. Unless
the complexity is really overwhelming. Considering the concerns here, not sure
how much should we try merging with iomap as per Ritesh's suggestion.

The THP shrinker, could you please confirm if it is this following thread?

https://lore.kernel.org/all/[email protected]/

>
> For example, for shmem you could simply flag folios that failed splitting
> during fallocate() as reclaim candidates and try to reclaim memory later. So
> you don't have to scan arbitrary folios (which might also be desired,
> though).

Thanks for the suggestion. I'll look into this.

>
> >
> > >
> > > To optimize FALLOC_FL_PUNCH_HOLE for the cases where splitting+freeing
> > > is not possible at fallcoate() time, detecting zeropages later and
> > > retrying to split+free might be an option, without per-block tracking.
> >
> > >
> > > (2) mTHP controls
> > >
> > > As a default, we should not be using large folios / mTHP for any shmem, just
> > > like we did with THP via shmem_enabled. This is what this series currently
> > > does, and is aprt of the whole mTHP user-space interface design.
> >
> > That was clear for me too. But what is the reason we want to boot in 'safe
> > mode'? What are the implications of not respecing that?
>
> [...]
>
> >
> > As I understood from the call, mTHP with sysctl knobs is preferred over
> > optimistic falloc/write allocation? But is still unclear to me why the former
> > is preferred.
>
> I think Hugh's point was that this should be an opt-in feature, just like
> PMD-sized THP started out, and still is, an opt-in feature.

I'd be keen to understand the use case for this. Even the current THP controls
we have in tmpfs. I guess these are just scenarios with no swap involved?
Are these use cases the same for both tmpfs and shmem anon mm?

>
> Problematic interaction with khugepaged (that will be fixed) is one thing,
> interaction with memory reclaim (without any kind of memory reclaim
> mechanisms in place) might be another one. Controlling and tuning for
> specific folio sizes might be another one Hugh raised. [summarizing what I
> recall from the discussion, there might be more].
>
> >
> > Is large folios a non-viable option?
>
> I think you mean "allocating large folios without user space control".

Yes.

>
> Because mTHP gives user space full control, to the point where you can
> enable all sizes and obtain the same result.

Agree.

>
> >
> > >
> > > Also, we should properly fallback within the configured sizes, and not jump
> > > "over" configured sizes. Unless there is a good reason.
> > >
> > > (3) khugepaged
> > >
> > > khugepaged needs to handle larger folios properly as well. Until fixed,
> > > using smaller THP sizes as fallback might prohibit collapsing a PMD-sized
> > > THP later. But really, khugepaged needs to be fixed to handle that.
> > >
> > > (4) force/disable
> > >
> > > These settings are rather testing artifacts from the old ages. We should not
> > > add them to the per-size toggles. We might "inherit" it from the global one,
> > > though.
> > >
> > > "within_size" might have value, and especially for consistency, we should
> > > have them per size.
> > >
> > >
> > >
> > > So, this series only tackles anonymous shmem, which is a good starting
> > > point. Ideally, we'd get support for other shmem (especially during fault
> > > time) soon afterwards, because we won't be adding separate toggles for that
> > > from the interface POV, and having inconsistent behavior between kernel
> > > versions would be a bit unfortunate.
> > >
> > >
> > > @Baolin, this series likely does not consider (4) yet. And I suggest we have
> > > to take a lot of the "anonymous thp" terminology out of this series,
> > > especially when it comes to documentation.
> > >
> > > @Daniel, Pankaj, what are your plans regarding that? It would be great if we
> > > could get an understanding on the next steps on !anon shmem.
> >
> > I realize I've raised so many questions, but it's essential for us to grasp the
> > mm concerns and usage scenarios. This understanding will provide clarity on the
> > direction regarding folios for !anon shmem.
>
> If I understood correctly, Hugh had strong feelings against not respecting
> mTHP toggles for shmem. Without per-block tracking, I agree with that.

My understanding was the same. But I have this follow-up question: should
we respect mTHP toggles without considering mapping constraints (size and
index)? Or perhaps we should use within_size where we can fit this intermediate
approach, as long as mTHP granularity is respected?

Daniel

>
> --
> Cheers,
>
> David / dhildenb
>

2024-06-04 09:45:35

by Baolin Wang

[permalink] [raw]
Subject: Re: [PATCH v3 0/6] add mTHP support for anonymous shmem



On 2024/6/4 16:18, Daniel Gomez wrote:
> On Fri, May 31, 2024 at 01:13:48PM +0200, David Hildenbrand wrote:
>>>>
>>>> As a default, we should not be using large folios / mTHP for any shmem,
>>>> just like we did with THP via shmem_enabled. This is what this series
>>>> currently does, and is aprt of the whole mTHP user-space interface design.
>>>>
>>>> Further, the mTHP controls should control all of shmem, not only
>>>> "anonymous shmem".
>>>
>>> Yes, that's what I thought and in my TODO list.
>>
>> Good, it would be helpful to coordinate with Daniel and Pankaj.
>
> I've integrated patches 11 and 12 from the lsf RFC thread [1] on top of Baolin's
> v3 patches. You may find a version in my integration branch here [2]. I can
> attach them here if it's preferred.
>
> [1] https://lore.kernel.org/all/[email protected]/
> [2] https://gitlab.com/dkruces/linux-next/-/commits/next-20240604-shmem-mthp
>
> The point here is to combine the large folios strategy I proposed with mTHP
> user controls. Would it make sense to limit the orders to the mapping order
> calculated based on the size and index?

IMO, for !anon shmem, this change makes sense to me. We should respect
the size and mTHP should act as a order filter.

For anon shmem, we should ignore the length, which you always set it to
PAGE_SIZE in patch [1].

[1]
https://gitlab.com/dkruces/linux-next/-/commit/edf02311fd6d86b355d3aeb74e67c8da6de3c569

> @@ -1765,6 +1798,10 @@ static struct folio *shmem_alloc_and_add_folio(struct vm_fault *vmf,
>
> order = highest_order(suitable_orders);
> while (suitable_orders) {
> + if (order > mapping_order) {
> + order = next_order(&suitable_orders, order);
> + continue;
> + }
> pages = 1UL << order;
> index = round_down(index, pages);
> folio = shmem_alloc_folio(gfp, order, info, index);
>
> Note: The branch still need to be adapted to include !anon mm.

2024-06-04 09:48:35

by Baolin Wang

[permalink] [raw]
Subject: Re: [PATCH v3 4/6] mm: shmem: add mTHP support for anonymous shmem



On 2024/6/4 17:23, Dan Carpenter wrote:
> Hi Baolin,
>
> kernel test robot noticed the following build warnings:
>
> https://git-scm.com/docs/git-format-patch#_base_tree_information]
>
> url: https://github.com/intel-lab-lkp/linux/commits/Baolin-Wang/mm-memory-extend-finish_fault-to-support-large-folio/20240530-100805
> base: https://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm.git mm-everything
> patch link: https://lore.kernel.org/r/ec35a23026dd016705d211e85163cabe07681516.1717033868.git.baolin.wang%40linux.alibaba.com
> patch subject: [PATCH v3 4/6] mm: shmem: add mTHP support for anonymous shmem
> config: powerpc64-randconfig-r071-20240531 (https://download.01.org/0day-ci/archive/20240602/[email protected]/config)
> compiler: clang version 19.0.0git (https://github.com/llvm/llvm-project bafda89a0944d947fc4b3b5663185e07a397ac30)
>
> If you fix the issue in a separate patch/commit (i.e. not just a new version of
> the same patch/commit), kindly add following tags
> | Reported-by: kernel test robot <[email protected]>
> | Reported-by: Dan Carpenter <[email protected]>
> | Closes: https://lore.kernel.org/r/[email protected]/
>
> smatch warnings:
> mm/shmem.c:1766 shmem_alloc_and_add_folio() error: uninitialized symbol 'suitable_orders'.

Thanks Dan. LKP also reported this warning [1]. Will fix it.

[1] https://lore.kernel.org/all/[email protected]/

2024-06-04 10:00:14

by David Hildenbrand

[permalink] [raw]
Subject: Re: [PATCH v3 0/6] add mTHP support for anonymous shmem

[...]

>>> How can we use per-block tracking for reclaiming memory and what changes would
>>> be needed? Or is per-block really a non-viable option?
>>
>> The interesting thing is: with mTHP toggles it is opt-in -- like for
>> PMD-sized THP with shmem_enabled -- and we don't have to be that concerned
>> about this problem right now.
>
> Without respecting the size when allocating large folios, mTHP toggles would
> over allocate. My proposal added earlier to this thread is to combine the 2
> to avoid that case. Otherwise, shouldn't we try to find a solution for the
> reclaiming path?

I think at some point we'll really have to do a better job at reclaiming
(either memory-overallocation, PUNCHHOLE that couldn't split, but maybe
also pages that are now all-zero again and could be reclaimed again).

>
>>
>>>
>>> Clearly, if per-block is viable option, shmem_fault() bug would required to be
>>> fixed first. Any ideas on how to make it reproducible?
>>>
>>> The alternatives discussed where sub-page refcounting and zeropage scanning.
>>
>> Yeah, I don't think sub-page refcounting is a feasible (and certainly not
>> desired ;) ) option in the folio world.
>>
>>> The first one is not possible (IIUC) because of a refactor years back that
>>> simplified the code and also requires extra complexity. The second option would
>>> require additional overhead as we would involve scanning.
>>
>> We'll likely need something similar (scanning, tracking?) for anonymous
>> memory as well. There was a proposal for a THP shrinker some time ago, that
>> would solve part of the problem.
>
> It's good to know we have the same problem in different places. I'm more
> inclined to keep the information rather than adding an extra overhead. Unless
> the complexity is really overwhelming. Considering the concerns here, not sure
> how much should we try merging with iomap as per Ritesh's suggestion.

As raised in the meeting, I do see value in maintaining the information;
but I also see why Hugh and Kirill think this might be unwarranted
complexity in shmem.c. Meaning: we might be able to achieve something
similar without it, and we don't have to solve the problem right now to
benefit from large folios.

>
> The THP shrinker, could you please confirm if it is this following thread?
>
> https://lore.kernel.org/all/[email protected]/

Yes, although there was no follow up so far. Possibly, because in the
current khugepaged approach, there will be a constant back-and-forth
between khugepaged collapsing memory (and wasting memory in the default
setting), and the THP shrinker reclaiming memory; doesn't sound quite
effective :) That needs more thought IMHO.

[...]

>>>> To optimize FALLOC_FL_PUNCH_HOLE for the cases where splitting+freeing
>>>> is not possible at fallcoate() time, detecting zeropages later and
>>>> retrying to split+free might be an option, without per-block tracking.
>>>
>>>>
>>>> (2) mTHP controls
>>>>
>>>> As a default, we should not be using large folios / mTHP for any shmem, just
>>>> like we did with THP via shmem_enabled. This is what this series currently
>>>> does, and is aprt of the whole mTHP user-space interface design.
>>>
>>> That was clear for me too. But what is the reason we want to boot in 'safe
>>> mode'? What are the implications of not respecing that?
>>
>> [...]
>>
>>>
>>> As I understood from the call, mTHP with sysctl knobs is preferred over
>>> optimistic falloc/write allocation? But is still unclear to me why the former
>>> is preferred.
>>
>> I think Hugh's point was that this should be an opt-in feature, just like
>> PMD-sized THP started out, and still is, an opt-in feature.
>
> I'd be keen to understand the use case for this. Even the current THP controls
> we have in tmpfs. I guess these are just scenarios with no swap involved?
> Are these use cases the same for both tmpfs and shmem anon mm?

Systems without swap are one case I think. The other reason for a toggle
in the past was to be able to disable it to troubleshoot issues (Hugh
mentioned in the meeting about unlocking new code paths in shmem.c only
with a toggle).

Staring at my Fedora system:

$ cat /sys/kernel/mm/transparent_hugepage/shmem_enabled
always within_size advise [never] deny force

Maybe because it uses tmpfs to mount /tmp (interesting article on
lwn.net about that) and people are concerned about the side-effects
(that can currently waste memory, or result in more reclaim work being
required when exceeding file sizes).

For VMs, I know that shmem+THP with memory ballooning is problematic and
not really recommended.

[...]

>>>
>>>>
>>>> Also, we should properly fallback within the configured sizes, and not jump
>>>> "over" configured sizes. Unless there is a good reason.
>>>>
>>>> (3) khugepaged
>>>>
>>>> khugepaged needs to handle larger folios properly as well. Until fixed,
>>>> using smaller THP sizes as fallback might prohibit collapsing a PMD-sized
>>>> THP later. But really, khugepaged needs to be fixed to handle that.
>>>>
>>>> (4) force/disable
>>>>
>>>> These settings are rather testing artifacts from the old ages. We should not
>>>> add them to the per-size toggles. We might "inherit" it from the global one,
>>>> though.
>>>>
>>>> "within_size" might have value, and especially for consistency, we should
>>>> have them per size.
>>>>
>>>>
>>>>
>>>> So, this series only tackles anonymous shmem, which is a good starting
>>>> point. Ideally, we'd get support for other shmem (especially during fault
>>>> time) soon afterwards, because we won't be adding separate toggles for that
>>>> from the interface POV, and having inconsistent behavior between kernel
>>>> versions would be a bit unfortunate.
>>>>
>>>>
>>>> @Baolin, this series likely does not consider (4) yet. And I suggest we have
>>>> to take a lot of the "anonymous thp" terminology out of this series,
>>>> especially when it comes to documentation.
>>>>
>>>> @Daniel, Pankaj, what are your plans regarding that? It would be great if we
>>>> could get an understanding on the next steps on !anon shmem.
>>>
>>> I realize I've raised so many questions, but it's essential for us to grasp the
>>> mm concerns and usage scenarios. This understanding will provide clarity on the
>>> direction regarding folios for !anon shmem.
>>
>> If I understood correctly, Hugh had strong feelings against not respecting
>> mTHP toggles for shmem. Without per-block tracking, I agree with that.
>
> My understanding was the same. But I have this follow-up question: should
> we respect mTHP toggles without considering mapping constraints (size and
> index)? Or perhaps we should use within_size where we can fit this intermediate
> approach, as long as mTHP granularity is respected?

Likely we should do exactly the same as we would do with the existing
PMD-sized THPs.

I recall in the meeting that we discussed that always vs. within_size
seems to have some value, and that we should respect that setting like
we did for PMD-sized THP.

Which other mapping constraints could we have?

--
Cheers,

David / dhildenb


2024-06-04 12:05:21

by Daniel Gomez

[permalink] [raw]
Subject: Re: [PATCH v3 0/6] add mTHP support for anonymous shmem

On Tue, Jun 04, 2024 at 05:45:20PM +0800, Baolin Wang wrote:
>
>
> On 2024/6/4 16:18, Daniel Gomez wrote:
> > On Fri, May 31, 2024 at 01:13:48PM +0200, David Hildenbrand wrote:
> > > > >
> > > > > As a default, we should not be using large folios / mTHP for any shmem,
> > > > > just like we did with THP via shmem_enabled. This is what this series
> > > > > currently does, and is aprt of the whole mTHP user-space interface design.
> > > > >
> > > > > Further, the mTHP controls should control all of shmem, not only
> > > > > "anonymous shmem".
> > > >
> > > > Yes, that's what I thought and in my TODO list.
> > >
> > > Good, it would be helpful to coordinate with Daniel and Pankaj.
> >
> > I've integrated patches 11 and 12 from the lsf RFC thread [1] on top of Baolin's
> > v3 patches. You may find a version in my integration branch here [2]. I can
> > attach them here if it's preferred.
> >
> > [1] https://lore.kernel.org/all/[email protected]/
> > [2] https://protect2.fireeye.com/v1/url?k=a23e7c06-c3b56926-a23ff749-74fe485fb347-371ca2bfd5d9869f&q=1&e=6974304e-a786-4255-93a7-57498540241c&u=https%3A%2F%2Fgitlab.com%2Fdkruces%2Flinux-next%2F-%2Fcommits%2Fnext-20240604-shmem-mthp
> >
> > The point here is to combine the large folios strategy I proposed with mTHP
> > user controls. Would it make sense to limit the orders to the mapping order
> > calculated based on the size and index?
>
> IMO, for !anon shmem, this change makes sense to me. We should respect the
> size and mTHP should act as a order filter.

What about respecing the size when within_size flag is enabled? Then, 'always'
would allocate mTHP enabled folios, regardless of the size. And 'never'
would ignore mTHP and size. So, 'never' can be used for this 'safe' boot case
mentioned in the discussion.

>
> For anon shmem, we should ignore the length, which you always set it to
> PAGE_SIZE in patch [1].
>
> [1] https://protect2.fireeye.com/v1/url?k=0d75a0c6-6cfeb5e6-0d742b89-74fe485fb347-904fa75c8efebdc2&q=1&e=6974304e-a786-4255-93a7-57498540241c&u=https%3A%2F%2Fgitlab.com%2Fdkruces%2Flinux-next%2F-%2Fcommit%2Fedf02311fd6d86b355d3aeb74e67c8da6de3c569

Since we are ignoring the length, we should ignore any value being passed.

>
> > @@ -1765,6 +1798,10 @@ static struct folio *shmem_alloc_and_add_folio(struct vm_fault *vmf,
> >
> > order = highest_order(suitable_orders);
> > while (suitable_orders) {
> > + if (order > mapping_order) {
> > + order = next_order(&suitable_orders, order);
> > + continue;
> > + }
> > pages = 1UL << order;
> > index = round_down(index, pages);
> > folio = shmem_alloc_folio(gfp, order, info, index);
> >
> > Note: The branch still need to be adapted to include !anon mm.

2024-06-04 12:31:30

by Daniel Gomez

[permalink] [raw]
Subject: Re: [PATCH v3 0/6] add mTHP support for anonymous shmem

On Tue, Jun 04, 2024 at 11:59:51AM +0200, David Hildenbrand wrote:
> [...]
>
> > > > How can we use per-block tracking for reclaiming memory and what changes would
> > > > be needed? Or is per-block really a non-viable option?
> > >
> > > The interesting thing is: with mTHP toggles it is opt-in -- like for
> > > PMD-sized THP with shmem_enabled -- and we don't have to be that concerned
> > > about this problem right now.
> >
> > Without respecting the size when allocating large folios, mTHP toggles would
> > over allocate. My proposal added earlier to this thread is to combine the 2
> > to avoid that case. Otherwise, shouldn't we try to find a solution for the
> > reclaiming path?
>
> I think at some point we'll really have to do a better job at reclaiming
> (either memory-overallocation, PUNCHHOLE that couldn't split, but maybe also
> pages that are now all-zero again and could be reclaimed again).
>
> >
> > >
> > > >
> > > > Clearly, if per-block is viable option, shmem_fault() bug would required to be
> > > > fixed first. Any ideas on how to make it reproducible?
> > > >
> > > > The alternatives discussed where sub-page refcounting and zeropage scanning.
> > >
> > > Yeah, I don't think sub-page refcounting is a feasible (and certainly not
> > > desired ;) ) option in the folio world.
> > >
> > > > The first one is not possible (IIUC) because of a refactor years back that
> > > > simplified the code and also requires extra complexity. The second option would
> > > > require additional overhead as we would involve scanning.
> > >
> > > We'll likely need something similar (scanning, tracking?) for anonymous
> > > memory as well. There was a proposal for a THP shrinker some time ago, that
> > > would solve part of the problem.
> >
> > It's good to know we have the same problem in different places. I'm more
> > inclined to keep the information rather than adding an extra overhead. Unless
> > the complexity is really overwhelming. Considering the concerns here, not sure
> > how much should we try merging with iomap as per Ritesh's suggestion.
>
> As raised in the meeting, I do see value in maintaining the information; but
> I also see why Hugh and Kirill think this might be unwarranted complexity in
> shmem.c. Meaning: we might be able to achieve something similar without it,
> and we don't have to solve the problem right now to benefit from large
> folios.
>
> >
> > The THP shrinker, could you please confirm if it is this following thread?
> >
> > https://lore.kernel.org/all/[email protected]/
>
> Yes, although there was no follow up so far. Possibly, because in the
> current khugepaged approach, there will be a constant back-and-forth between
> khugepaged collapsing memory (and wasting memory in the default setting),
> and the THP shrinker reclaiming memory; doesn't sound quite effective :)
> That needs more thought IMHO.
>
> [...]
>
> > > > > To optimize FALLOC_FL_PUNCH_HOLE for the cases where splitting+freeing
> > > > > is not possible at fallcoate() time, detecting zeropages later and
> > > > > retrying to split+free might be an option, without per-block tracking.
> > > >
> > > > >
> > > > > (2) mTHP controls
> > > > >
> > > > > As a default, we should not be using large folios / mTHP for any shmem, just
> > > > > like we did with THP via shmem_enabled. This is what this series currently
> > > > > does, and is aprt of the whole mTHP user-space interface design.
> > > >
> > > > That was clear for me too. But what is the reason we want to boot in 'safe
> > > > mode'? What are the implications of not respecing that?
> > >
> > > [...]
> > >
> > > >
> > > > As I understood from the call, mTHP with sysctl knobs is preferred over
> > > > optimistic falloc/write allocation? But is still unclear to me why the former
> > > > is preferred.
> > >
> > > I think Hugh's point was that this should be an opt-in feature, just like
> > > PMD-sized THP started out, and still is, an opt-in feature.
> >
> > I'd be keen to understand the use case for this. Even the current THP controls
> > we have in tmpfs. I guess these are just scenarios with no swap involved?
> > Are these use cases the same for both tmpfs and shmem anon mm?
>
> Systems without swap are one case I think. The other reason for a toggle in
> the past was to be able to disable it to troubleshoot issues (Hugh mentioned
> in the meeting about unlocking new code paths in shmem.c only with a
> toggle).
>
> Staring at my Fedora system:
>
> $ cat /sys/kernel/mm/transparent_hugepage/shmem_enabled
> always within_size advise [never] deny force
>
> Maybe because it uses tmpfs to mount /tmp (interesting article on lwn.net
> about that) and people are concerned about the side-effects (that can
> currently waste memory, or result in more reclaim work being required when
> exceeding file sizes).
>
> For VMs, I know that shmem+THP with memory ballooning is problematic and not
> really recommended.

Agree. We cannot PUNCH_HOLES when using THP unless the punch is PMD-size.

>
> [...]
>
> > > >
> > > > >
> > > > > Also, we should properly fallback within the configured sizes, and not jump
> > > > > "over" configured sizes. Unless there is a good reason.
> > > > >
> > > > > (3) khugepaged
> > > > >
> > > > > khugepaged needs to handle larger folios properly as well. Until fixed,
> > > > > using smaller THP sizes as fallback might prohibit collapsing a PMD-sized
> > > > > THP later. But really, khugepaged needs to be fixed to handle that.
> > > > >
> > > > > (4) force/disable
> > > > >
> > > > > These settings are rather testing artifacts from the old ages. We should not
> > > > > add them to the per-size toggles. We might "inherit" it from the global one,
> > > > > though.
> > > > >
> > > > > "within_size" might have value, and especially for consistency, we should
> > > > > have them per size.
> > > > >
> > > > >
> > > > >
> > > > > So, this series only tackles anonymous shmem, which is a good starting
> > > > > point. Ideally, we'd get support for other shmem (especially during fault
> > > > > time) soon afterwards, because we won't be adding separate toggles for that
> > > > > from the interface POV, and having inconsistent behavior between kernel
> > > > > versions would be a bit unfortunate.
> > > > >
> > > > >
> > > > > @Baolin, this series likely does not consider (4) yet. And I suggest we have
> > > > > to take a lot of the "anonymous thp" terminology out of this series,
> > > > > especially when it comes to documentation.
> > > > >
> > > > > @Daniel, Pankaj, what are your plans regarding that? It would be great if we
> > > > > could get an understanding on the next steps on !anon shmem.
> > > >
> > > > I realize I've raised so many questions, but it's essential for us to grasp the
> > > > mm concerns and usage scenarios. This understanding will provide clarity on the
> > > > direction regarding folios for !anon shmem.
> > >
> > > If I understood correctly, Hugh had strong feelings against not respecting
> > > mTHP toggles for shmem. Without per-block tracking, I agree with that.
> >
> > My understanding was the same. But I have this follow-up question: should
> > we respect mTHP toggles without considering mapping constraints (size and
> > index)? Or perhaps we should use within_size where we can fit this intermediate
> > approach, as long as mTHP granularity is respected?
>
> Likely we should do exactly the same as we would do with the existing
> PMD-sized THPs.
>
> I recall in the meeting that we discussed that always vs. within_size seems
> to have some value, and that we should respect that setting like we did for
> PMD-sized THP.
>
> Which other mapping constraints could we have?

Patch 12 in my RFC for LSF [1] adds this shmem_mapping_size_order() (updated
to support get_order()) to get the max order 'based on the file size which the
mapping currently allows at the given index'. Same check is done here [2] in
filemap.c.

[1] https://lore.kernel.org/all/[email protected]/
[2] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/mm/filemap.c?h=v6.10-rc2#n1940

@@ -1726,6 +1726,37 @@ static struct folio *shmem_alloc_folio(gfp_t gfp, int order,
return folio;
}

+/**
+ * shmem_mapping_size_order - Get maximum folio order for the given file size.
+ * @mapping: Target address_space.
+ * @index: The page index.
+ * @size: The suggested size of the folio to create.
+ *
+ * This returns a high order for folios (when supported) based on the file size
+ * which the mapping currently allows at the given index. The index is relevant
+ * due to alignment considerations the mapping might have. The returned order
+ * may be less than the size passed.
+ *
+ * Like __filemap_get_folio order calculation.
+ *
+ * Return: The order.
+ */
+static inline unsigned int
+shmem_mapping_size_order(struct address_space *mapping, pgoff_t index,
+ size_t size)
+{
+ unsigned int order = get_order(max_t(size_t, size, PAGE_SIZE));
+
+ if (!mapping_large_folio_support(mapping))
+ return 0;
+
+ /* If we're not aligned, allocate a smaller folio */
+ if (index & ((1UL << order) - 1))
+ order = __ffs(index);
+
+ return min_t(size_t, order, MAX_PAGECACHE_ORDER);
+}
+

>
> --
> Cheers,
>
> David / dhildenb
>

2024-06-06 03:31:16

by Baolin Wang

[permalink] [raw]
Subject: Re: [PATCH v3 0/6] add mTHP support for anonymous shmem



On 2024/6/4 20:05, Daniel Gomez wrote:
> On Tue, Jun 04, 2024 at 05:45:20PM +0800, Baolin Wang wrote:
>>
>>
>> On 2024/6/4 16:18, Daniel Gomez wrote:
>>> On Fri, May 31, 2024 at 01:13:48PM +0200, David Hildenbrand wrote:
>>>>>>
>>>>>> As a default, we should not be using large folios / mTHP for any shmem,
>>>>>> just like we did with THP via shmem_enabled. This is what this series
>>>>>> currently does, and is aprt of the whole mTHP user-space interface design.
>>>>>>
>>>>>> Further, the mTHP controls should control all of shmem, not only
>>>>>> "anonymous shmem".
>>>>>
>>>>> Yes, that's what I thought and in my TODO list.
>>>>
>>>> Good, it would be helpful to coordinate with Daniel and Pankaj.
>>>
>>> I've integrated patches 11 and 12 from the lsf RFC thread [1] on top of Baolin's
>>> v3 patches. You may find a version in my integration branch here [2]. I can
>>> attach them here if it's preferred.
>>>
>>> [1] https://lore.kernel.org/all/[email protected]/
>>> [2] https://protect2.fireeye.com/v1/url?k=a23e7c06-c3b56926-a23ff749-74fe485fb347-371ca2bfd5d9869f&q=1&e=6974304e-a786-4255-93a7-57498540241c&u=https%3A%2F%2Fgitlab.com%2Fdkruces%2Flinux-next%2F-%2Fcommits%2Fnext-20240604-shmem-mthp
>>>
>>> The point here is to combine the large folios strategy I proposed with mTHP
>>> user controls. Would it make sense to limit the orders to the mapping order
>>> calculated based on the size and index?
>>
>> IMO, for !anon shmem, this change makes sense to me. We should respect the
>> size and mTHP should act as a order filter.
>
> What about respecing the size when within_size flag is enabled? Then, 'always'
> would allocate mTHP enabled folios, regardless of the size. And 'never'
> would ignore mTHP and size. So, 'never' can be used for this 'safe' boot case
> mentioned in the discussion.

Looks reasonable to me. What do you think, David?

And what about 'advise' option? Silimar to 'within_size'?

>> For anon shmem, we should ignore the length, which you always set it to
>> PAGE_SIZE in patch [1].
>>
>> [1] https://protect2.fireeye.com/v1/url?k=0d75a0c6-6cfeb5e6-0d742b89-74fe485fb347-904fa75c8efebdc2&q=1&e=6974304e-a786-4255-93a7-57498540241c&u=https%3A%2F%2Fgitlab.com%2Fdkruces%2Flinux-next%2F-%2Fcommit%2Fedf02311fd6d86b355d3aeb74e67c8da6de3c569
>
> Since we are ignoring the length, we should ignore any value being passed.
>
>>
>>> @@ -1765,6 +1798,10 @@ static struct folio *shmem_alloc_and_add_folio(struct vm_fault *vmf,
>>>
>>> order = highest_order(suitable_orders);
>>> while (suitable_orders) {
>>> + if (order > mapping_order) {
>>> + order = next_order(&suitable_orders, order);
>>> + continue;
>>> + }
>>> pages = 1UL << order;
>>> index = round_down(index, pages);
>>> folio = shmem_alloc_folio(gfp, order, info, index);
>>>
>>> Note: The branch still need to be adapted to include !anon mm.

2024-06-06 08:49:43

by David Hildenbrand

[permalink] [raw]
Subject: Re: [PATCH v3 0/6] add mTHP support for anonymous shmem

On 06.06.24 05:31, Baolin Wang wrote:
>
>
> On 2024/6/4 20:05, Daniel Gomez wrote:
>> On Tue, Jun 04, 2024 at 05:45:20PM +0800, Baolin Wang wrote:
>>>
>>>
>>> On 2024/6/4 16:18, Daniel Gomez wrote:
>>>> On Fri, May 31, 2024 at 01:13:48PM +0200, David Hildenbrand wrote:
>>>>>>>
>>>>>>> As a default, we should not be using large folios / mTHP for any shmem,
>>>>>>> just like we did with THP via shmem_enabled. This is what this series
>>>>>>> currently does, and is aprt of the whole mTHP user-space interface design.
>>>>>>>
>>>>>>> Further, the mTHP controls should control all of shmem, not only
>>>>>>> "anonymous shmem".
>>>>>>
>>>>>> Yes, that's what I thought and in my TODO list.
>>>>>
>>>>> Good, it would be helpful to coordinate with Daniel and Pankaj.
>>>>
>>>> I've integrated patches 11 and 12 from the lsf RFC thread [1] on top of Baolin's
>>>> v3 patches. You may find a version in my integration branch here [2]. I can
>>>> attach them here if it's preferred.
>>>>
>>>> [1] https://lore.kernel.org/all/[email protected]/
>>>> [2] https://protect2.fireeye.com/v1/url?k=a23e7c06-c3b56926-a23ff749-74fe485fb347-371ca2bfd5d9869f&q=1&e=6974304e-a786-4255-93a7-57498540241c&u=https%3A%2F%2Fgitlab.com%2Fdkruces%2Flinux-next%2F-%2Fcommits%2Fnext-20240604-shmem-mthp
>>>>
>>>> The point here is to combine the large folios strategy I proposed with mTHP
>>>> user controls. Would it make sense to limit the orders to the mapping order
>>>> calculated based on the size and index?
>>>
>>> IMO, for !anon shmem, this change makes sense to me. We should respect the
>>> size and mTHP should act as a order filter.
>>
>> What about respecing the size when within_size flag is enabled? Then, 'always'
>> would allocate mTHP enabled folios, regardless of the size. And 'never'
>> would ignore mTHP and size. So, 'never' can be used for this 'safe' boot case
>> mentioned in the discussion.
>
> Looks reasonable to me. What do you think, David?
>

That mimics existing PMD-THP behavior, right?

With "within_size", we must not exceed the size, with "always", we may
exceed the size.

> And what about 'advise' option? Silimar to 'within_size'?

Good question. What's the behavior of PMD-THP? I would assume it behaves
like "within_size", because in the common case we mmap (+advise) only
within the size of the file, not exceeding it.

(the always option, as I learned during the meeting, is primarily
helpful when writing to tmpfs files in an append-fashion. With
mmap()+madvise() that doesn't quite happen)

--
Cheers,

David / dhildenb


2024-06-06 09:33:38

by Baolin Wang

[permalink] [raw]
Subject: Re: [PATCH v3 0/6] add mTHP support for anonymous shmem



On 2024/6/6 16:38, David Hildenbrand wrote:
> On 06.06.24 05:31, Baolin Wang wrote:
>>
>>
>> On 2024/6/4 20:05, Daniel Gomez wrote:
>>> On Tue, Jun 04, 2024 at 05:45:20PM +0800, Baolin Wang wrote:
>>>>
>>>>
>>>> On 2024/6/4 16:18, Daniel Gomez wrote:
>>>>> On Fri, May 31, 2024 at 01:13:48PM +0200, David Hildenbrand wrote:
>>>>>>>>
>>>>>>>> As a default, we should not be using large folios / mTHP for any
>>>>>>>> shmem,
>>>>>>>> just like we did with THP via shmem_enabled. This is what this
>>>>>>>> series
>>>>>>>> currently does, and is aprt of the whole mTHP user-space
>>>>>>>> interface design.
>>>>>>>>
>>>>>>>> Further, the mTHP controls should control all of shmem, not only
>>>>>>>> "anonymous shmem".
>>>>>>>
>>>>>>> Yes, that's what I thought and in my TODO list.
>>>>>>
>>>>>> Good, it would be helpful to coordinate with Daniel and Pankaj.
>>>>>
>>>>> I've integrated patches 11 and 12 from the lsf RFC thread [1] on
>>>>> top of Baolin's
>>>>> v3 patches. You may find a version in my integration branch here
>>>>> [2]. I can
>>>>> attach them here if it's preferred.
>>>>>
>>>>> [1]
>>>>> https://lore.kernel.org/all/[email protected]/
>>>>> [2]
>>>>> https://protect2.fireeye.com/v1/url?k=a23e7c06-c3b56926-a23ff749-74fe485fb347-371ca2bfd5d9869f&q=1&e=6974304e-a786-4255-93a7-57498540241c&u=https%3A%2F%2Fgitlab.com%2Fdkruces%2Flinux-next%2F-%2Fcommits%2Fnext-20240604-shmem-mthp
>>>>>
>>>>> The point here is to combine the large folios strategy I proposed
>>>>> with mTHP
>>>>> user controls. Would it make sense to limit the orders to the
>>>>> mapping order
>>>>> calculated based on the size and index?
>>>>
>>>> IMO, for !anon shmem, this change makes sense to me. We should
>>>> respect the
>>>> size and mTHP should act as a order filter.
>>>
>>> What about respecing the size when within_size flag is enabled? Then,
>>> 'always'
>>> would allocate mTHP enabled folios, regardless of the size. And 'never'
>>> would ignore mTHP and size. So, 'never' can be used for this 'safe'
>>> boot case
>>> mentioned in the discussion.
>>
>> Looks reasonable to me. What do you think, David?
>>
>
> That mimics existing PMD-THP behavior, right?
>
> With "within_size", we must not exceed the size, with "always", we may
> exceed the size.

Yes, I think so.

>> And what about 'advise' option? Silimar to 'within_size'?
>
> Good question. What's the behavior of PMD-THP? I would assume it behaves
> like "within_size", because in the common case we mmap (+advise) only
> within the size of the file, not exceeding it.

Yes, that is also my understanding.

> (the always option, as I learned during the meeting, is primarily
> helpful when writing to tmpfs files in an append-fashion. With
> mmap()+madvise() that doesn't quite happen)
>

2024-06-07 09:06:23

by Daniel Gomez

[permalink] [raw]
Subject: Re: [PATCH v3 0/6] add mTHP support for anonymous shmem

On Thu, Jun 06, 2024 at 10:38:22AM +0200, David Hildenbrand wrote:
> On 06.06.24 05:31, Baolin Wang wrote:
> >
> >
> > On 2024/6/4 20:05, Daniel Gomez wrote:
> > > On Tue, Jun 04, 2024 at 05:45:20PM +0800, Baolin Wang wrote:
> > > >
> > > >
> > > > On 2024/6/4 16:18, Daniel Gomez wrote:
> > > > > On Fri, May 31, 2024 at 01:13:48PM +0200, David Hildenbrand wrote:
> > > > > > > >
> > > > > > > > As a default, we should not be using large folios / mTHP for any shmem,
> > > > > > > > just like we did with THP via shmem_enabled. This is what this series
> > > > > > > > currently does, and is aprt of the whole mTHP user-space interface design.
> > > > > > > >
> > > > > > > > Further, the mTHP controls should control all of shmem, not only
> > > > > > > > "anonymous shmem".
> > > > > > >
> > > > > > > Yes, that's what I thought and in my TODO list.
> > > > > >
> > > > > > Good, it would be helpful to coordinate with Daniel and Pankaj.
> > > > >
> > > > > I've integrated patches 11 and 12 from the lsf RFC thread [1] on top of Baolin's
> > > > > v3 patches. You may find a version in my integration branch here [2]. I can
> > > > > attach them here if it's preferred.
> > > > >
> > > > > [1] https://lore.kernel.org/all/[email protected]/
> > > > > [2] https://protect2.fireeye.com/v1/url?k=a23e7c06-c3b56926-a23ff749-74fe485fb347-371ca2bfd5d9869f&q=1&e=6974304e-a786-4255-93a7-57498540241c&u=https%3A%2F%2Fgitlab.com%2Fdkruces%2Flinux-next%2F-%2Fcommits%2Fnext-20240604-shmem-mthp
> > > > >
> > > > > The point here is to combine the large folios strategy I proposed with mTHP
> > > > > user controls. Would it make sense to limit the orders to the mapping order
> > > > > calculated based on the size and index?
> > > >
> > > > IMO, for !anon shmem, this change makes sense to me. We should respect the
> > > > size and mTHP should act as a order filter.
> > >
> > > What about respecing the size when within_size flag is enabled? Then, 'always'
> > > would allocate mTHP enabled folios, regardless of the size. And 'never'
> > > would ignore mTHP and size. So, 'never' can be used for this 'safe' boot case
> > > mentioned in the discussion.
> >
> > Looks reasonable to me. What do you think, David?
> >
>
> That mimics existing PMD-THP behavior, right?
>
> With "within_size", we must not exceed the size, with "always", we may
> exceed the size.

But right now we only check the inode size. With large folio support in
write_iter() we can have access to the length as well. I think this would solve
(paratially?) the cases where we don't have access to the file size and if we
perform writes in bigger chunks.

E.g. xfs_io -t -f -c "pwrite -b 2M -S 0x58 0 2M" /mnt/test/file

For 'within_size', the example above would allocate 512 pages instead of one
huge page. After patches [1] [2] we can get the size of the write to allocate
whatever mTHP/THP makes more sense for the length being passed.

[1] https://lore.kernel.org/all/[email protected]/
[2] https://lore.kernel.org/all/[email protected]/

Here a quick hack for THP:

@@ -561,7 +561,8 @@ bool shmem_is_huge(struct inode *inode, pgoff_t index, bool shmem_huge_force,
case SHMEM_HUGE_WITHIN_SIZE:
index = round_up(index + 1, HPAGE_PMD_NR);
i_size = round_up(i_size_read(inode), PAGE_SIZE);
- if (i_size >> PAGE_SHIFT >= index)
+ if ((i_size >> PAGE_SHIFT >= index) ||
+ (len >> PAGE_SHIFT >= index))
return true;
fallthrough;


>
> > And what about 'advise' option? Silimar to 'within_size'?
>
> Good question. What's the behavior of PMD-THP? I would assume it behaves
> like "within_size", because in the common case we mmap (+advise) only within
> the size of the file, not exceeding it.

It allocates a huge page on request when MADV_HUGEPAGE (regardless of the size).

>
> (the always option, as I learned during the meeting, is primarily helpful
> when writing to tmpfs files in an append-fashion. With mmap()+madvise() that
> doesn't quite happen)

Let's benefit of larger writes as well if the chunk matches any of the mTHP/
THP sizes.

>
> --
> Cheers,
>
> David / dhildenb
>

2024-06-07 10:40:14

by David Hildenbrand

[permalink] [raw]
Subject: Re: [PATCH v3 0/6] add mTHP support for anonymous shmem

On 07.06.24 11:05, Daniel Gomez wrote:
> On Thu, Jun 06, 2024 at 10:38:22AM +0200, David Hildenbrand wrote:
>> On 06.06.24 05:31, Baolin Wang wrote:
>>>
>>>
>>> On 2024/6/4 20:05, Daniel Gomez wrote:
>>>> On Tue, Jun 04, 2024 at 05:45:20PM +0800, Baolin Wang wrote:
>>>>>
>>>>>
>>>>> On 2024/6/4 16:18, Daniel Gomez wrote:
>>>>>> On Fri, May 31, 2024 at 01:13:48PM +0200, David Hildenbrand wrote:
>>>>>>>>>
>>>>>>>>> As a default, we should not be using large folios / mTHP for any shmem,
>>>>>>>>> just like we did with THP via shmem_enabled. This is what this series
>>>>>>>>> currently does, and is aprt of the whole mTHP user-space interface design.
>>>>>>>>>
>>>>>>>>> Further, the mTHP controls should control all of shmem, not only
>>>>>>>>> "anonymous shmem".
>>>>>>>>
>>>>>>>> Yes, that's what I thought and in my TODO list.
>>>>>>>
>>>>>>> Good, it would be helpful to coordinate with Daniel and Pankaj.
>>>>>>
>>>>>> I've integrated patches 11 and 12 from the lsf RFC thread [1] on top of Baolin's
>>>>>> v3 patches. You may find a version in my integration branch here [2]. I can
>>>>>> attach them here if it's preferred.
>>>>>>
>>>>>> [1] https://lore.kernel.org/all/[email protected]/
>>>>>> [2] https://protect2.fireeye.com/v1/url?k=a23e7c06-c3b56926-a23ff749-74fe485fb347-371ca2bfd5d9869f&q=1&e=6974304e-a786-4255-93a7-57498540241c&u=https%3A%2F%2Fgitlab.com%2Fdkruces%2Flinux-next%2F-%2Fcommits%2Fnext-20240604-shmem-mthp
>>>>>>
>>>>>> The point here is to combine the large folios strategy I proposed with mTHP
>>>>>> user controls. Would it make sense to limit the orders to the mapping order
>>>>>> calculated based on the size and index?
>>>>>
>>>>> IMO, for !anon shmem, this change makes sense to me. We should respect the
>>>>> size and mTHP should act as a order filter.
>>>>
>>>> What about respecing the size when within_size flag is enabled? Then, 'always'
>>>> would allocate mTHP enabled folios, regardless of the size. And 'never'
>>>> would ignore mTHP and size. So, 'never' can be used for this 'safe' boot case
>>>> mentioned in the discussion.
>>>
>>> Looks reasonable to me. What do you think, David?
>>>
>>
>> That mimics existing PMD-THP behavior, right?
>>
>> With "within_size", we must not exceed the size, with "always", we may
>> exceed the size.
>
> But right now we only check the inode size. With large folio support in
> write_iter() we can have access to the length as well. I think this would solve
> (paratially?) the cases where we don't have access to the file size and if we
> perform writes in bigger chunks.
>
> E.g. xfs_io -t -f -c "pwrite -b 2M -S 0x58 0 2M" /mnt/test/file
>
> For 'within_size', the example above would allocate 512 pages instead of one
> huge page. After patches [1] [2] we can get the size of the write to allocate
> whatever mTHP/THP makes more sense for the length being passed.
>

Yes, although this sounds like an optimization/change we should be doing
separately I guess.

> [1] https://lore.kernel.org/all/[email protected]/
> [2] https://lore.kernel.org/all/[email protected]/
>
> Here a quick hack for THP:
>
> @@ -561,7 +561,8 @@ bool shmem_is_huge(struct inode *inode, pgoff_t index, bool shmem_huge_force,
> case SHMEM_HUGE_WITHIN_SIZE:
> index = round_up(index + 1, HPAGE_PMD_NR);
> i_size = round_up(i_size_read(inode), PAGE_SIZE);
> - if (i_size >> PAGE_SHIFT >= index)
> + if ((i_size >> PAGE_SHIFT >= index) ||
> + (len >> PAGE_SHIFT >= index))
> return true;
> fallthrough;
>
>
>>
>>> And what about 'advise' option? Silimar to 'within_size'?
>>
>> Good question. What's the behavior of PMD-THP? I would assume it behaves
>> like "within_size", because in the common case we mmap (+advise) only within
>> the size of the file, not exceeding it.
>
> It allocates a huge page on request when MADV_HUGEPAGE (regardless of the size).
>

Interesting, so we should do the same. Thanks!

--
Cheers,

David / dhildenb