2024-04-22 07:03:20

by Baolin Wang

[permalink] [raw]
Subject: [RFC PATCH 0/5] add mTHP support for anonymous share pages

Anonymous pages have already been supported for multi-size (mTHP) allocation
through commit 19eaf44954df, that can allow THP to be configured through the
sysfs interface located at '/sys/kernel/mm/transparent_hugepage/hugepage-XXkb/enabled'.

However, the anonymous shared pages will ignore the anonymous mTHP rule
configured through the sysfs interface, and can only use the PMD-mapped
THP, that is not reasonable. Many implement anonymous page sharing through
mmap(MAP_SHARED | MAP_ANONYMOUS), especially in database usage scenarios,
therefore, users expect to apply an unified mTHP strategy for anonymous pages,
also including the anonymous shared pages, in order to enjoy the benefits of
mTHP. For example, lower latency than PMD-mapped THP, smaller memory bloat
than PMD-mapped THP, contiguous PTEs on ARM architecture to reduce TLB miss etc.

The primary strategy is that, the use of huge pages for anonymous shared pages
still follows the global control determined by the mount option "huge=" parameter
or the sysfs interface at '/sys/kernel/mm/transparent_hugepage/shmem_enabled'.
The utilization of mTHP is allowed only when the global 'huge' switch is enabled.
Subsequently, the mTHP sysfs interface (/sys/kernel/mm/transparent_hugepage/hugepage-XXkb/enabled)
is checked to determine the mTHP size that can be used for large folio allocation
for these anonymous shared pages.

TODO:
- More testing and provide some performance data.
- Need more discussion about the large folio allocation strategy for a 'regular
file' operation created by memfd_create(), for example using ftruncate(fd) to specify
the 'file' size, which need to follow the anonymous mTHP rule too?
- Do not split the large folio when share memory swap out.
- Can swap in a large folio for share memory.

Baolin Wang (5):
mm: memory: extend finish_fault() to support large folio
mm: shmem: add an 'order' parameter for shmem_alloc_hugefolio()
mm: shmem: add THP validation for PMD-mapped THP related statistics
mm: shmem: add mTHP support for anonymous share pages
mm: shmem: add anonymous share mTHP counters

include/linux/huge_mm.h | 4 +-
mm/huge_memory.c | 8 ++-
mm/memory.c | 25 +++++++---
mm/shmem.c | 107 ++++++++++++++++++++++++++++++----------
4 files changed, 108 insertions(+), 36 deletions(-)

--
2.39.3



2024-04-22 07:03:59

by Baolin Wang

[permalink] [raw]
Subject: [RFC PATCH 4/5] mm: shmem: add mTHP support for anonymous share pages

Commit 19eaf44954df adds multi-size THP (mTHP) for anonymous pages, that
can allow THP to be configured through the sysfs interface located at
'/sys/kernel/mm/transparent_hugepage/hugepage-XXkb/enabled'.

However, the anonymous share pages will ignore the anonymous mTHP rule
configured through the sysfs interface, and can only use the PMD-mapped
THP, that is not reasonable. Users expect to apply the mTHP rule for
all anonymous pages, including the anonymous share pages, in order to
enjoy the benefits of mTHP. For example, lower latency than PMD-mapped THP,
smaller memory bloat than PMD-mapped THP, contiguous PTEs on ARM architecture
to reduce TLB miss etc.

The primary strategy is that, the use of huge pages for anonymous shared pages
still follows the global control determined by the mount option "huge=" parameter
or the sysfs interface at '/sys/kernel/mm/transparent_hugepage/shmem_enabled'.
The utilization of mTHP is allowed only when the global 'huge' switch is enabled.
Subsequently, the mTHP sysfs interface (/sys/kernel/mm/transparent_hugepage/hugepage-XXkb/enabled)
is checked to determine the mTHP size that can be used for large folio allocation
for these anonymous shared pages.

Signed-off-by: Baolin Wang <[email protected]>
---
include/linux/huge_mm.h | 2 +-
mm/huge_memory.c | 4 +-
mm/shmem.c | 92 +++++++++++++++++++++++++++++++----------
3 files changed, 74 insertions(+), 24 deletions(-)

diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index b67294d5814f..26b6fa98d8ac 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -246,7 +246,7 @@ unsigned long thp_vma_allowable_orders(struct vm_area_struct *vma,
unsigned long orders)
{
/* Optimization to check if required orders are enabled early. */
- if (enforce_sysfs && vma_is_anonymous(vma)) {
+ if (enforce_sysfs && (vma_is_anonymous(vma) || vma_is_anon_shmem(vma))) {
unsigned long mask = READ_ONCE(huge_anon_orders_always);

if (vm_flags & VM_HUGEPAGE)
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 9a1b57ef9c60..9e52c0db7580 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -86,7 +86,7 @@ unsigned long __thp_vma_allowable_orders(struct vm_area_struct *vma,
unsigned long orders)
{
/* Check the intersection of requested and supported orders. */
- orders &= vma_is_anonymous(vma) ?
+ orders &= (vma_is_anonymous(vma) || vma_is_anon_shmem(vma)) ?
THP_ORDERS_ALL_ANON : THP_ORDERS_ALL_FILE;
if (!orders)
return 0;
@@ -152,7 +152,7 @@ unsigned long __thp_vma_allowable_orders(struct vm_area_struct *vma,
!enforce_sysfs, vma->vm_mm, vm_flags)
? orders : 0;

- if (!vma_is_anonymous(vma)) {
+ if (!vma_is_anonymous(vma) && !vma_is_anon_shmem(vma)) {
/*
* Enforce sysfs THP requirements as necessary. Anonymous vmas
* were already handled in thp_vma_allowable_orders().
diff --git a/mm/shmem.c b/mm/shmem.c
index b4afda71a3f0..8b009e7040b2 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -1603,6 +1603,39 @@ static gfp_t limit_gfp_mask(gfp_t huge_gfp, gfp_t limit_gfp)
return result;
}

+static unsigned long anon_shmem_suitable_orders(struct vm_fault *vmf,
+ struct address_space *mapping, pgoff_t index)
+{
+ struct vm_area_struct *vma = vmf ? vmf->vma : NULL;
+ unsigned long orders, pages;
+ int order;
+
+ /*
+ * Get a list of all the (large) orders below PMD_ORDER + 1 that are enabled
+ * for this vma. Then filter out the orders that can't be allocated over
+ * the faulting address and still be fully contained in the vma.
+ */
+ orders = thp_vma_allowable_orders(vma, vma->vm_flags, false, true, true,
+ BIT(PMD_ORDER + 1) - 1);
+ orders = thp_vma_suitable_orders(vma, vmf->address, orders);
+
+ if (!orders)
+ return orders;
+
+ /* Find the highest order that can add into the page cache */
+ order = highest_order(orders);
+ while (orders) {
+ pages = 1UL << order;
+ index = round_down(index, pages);
+ if (!xa_find(&mapping->i_pages, &index,
+ index + pages - 1, XA_PRESENT))
+ break;
+ order = next_order(&orders, order);
+ }
+
+ return orders;
+}
+
static struct folio *shmem_alloc_hugefolio(gfp_t gfp,
struct shmem_inode_info *info, pgoff_t index, int order)
{
@@ -1631,39 +1664,55 @@ static struct folio *shmem_alloc_folio(gfp_t gfp,
return (struct folio *)page;
}

-static struct folio *shmem_alloc_and_add_folio(gfp_t gfp,
- struct inode *inode, pgoff_t index,
+static struct folio *shmem_alloc_and_add_folio(struct vm_fault *vmf,
+ gfp_t gfp, struct inode *inode, pgoff_t index,
struct mm_struct *fault_mm, bool huge)
{
struct address_space *mapping = inode->i_mapping;
struct shmem_inode_info *info = SHMEM_I(inode);
+ struct vm_area_struct *vma = vmf ? vmf->vma : NULL;
struct folio *folio;
long pages;
int error, order;
+ unsigned long orders;

if (!IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE))
huge = false;

if (huge) {
- pages = HPAGE_PMD_NR;
- order = HPAGE_PMD_ORDER;
- index = round_down(index, HPAGE_PMD_NR);
+ if (vma && vma_is_anon_shmem(vma)) {
+ orders = anon_shmem_suitable_orders(vmf, mapping, index);
+ WARN_ON_ONCE(!orders);
+ } else {
+ pages = HPAGE_PMD_NR;
+ orders = BIT(HPAGE_PMD_ORDER);
+ index = round_down(index, HPAGE_PMD_NR);

- /*
- * Check for conflict before waiting on a huge allocation.
- * Conflict might be that a huge page has just been allocated
- * and added to page cache by a racing thread, or that there
- * is already at least one small page in the huge extent.
- * Be careful to retry when appropriate, but not forever!
- * Elsewhere -EEXIST would be the right code, but not here.
- */
- if (xa_find(&mapping->i_pages, &index,
+ /*
+ * Check for conflict before waiting on a huge allocation.
+ * Conflict might be that a huge page has just been allocated
+ * and added to page cache by a racing thread, or that there
+ * is already at least one small page in the huge extent.
+ * Be careful to retry when appropriate, but not forever!
+ * Elsewhere -EEXIST would be the right code, but not here.
+ */
+ if (xa_find(&mapping->i_pages, &index,
index + HPAGE_PMD_NR - 1, XA_PRESENT))
- return ERR_PTR(-E2BIG);
+ return ERR_PTR(-E2BIG);
+ }

- folio = shmem_alloc_hugefolio(gfp, info, index, order);
- if (!folio && pages == HPAGE_PMD_NR)
- count_vm_event(THP_FILE_FALLBACK);
+ order = highest_order(orders);
+ while (orders) {
+ pages = 1 << order;
+ index = round_down(index, pages);
+ folio = shmem_alloc_hugefolio(gfp, info, index, order);
+ if (folio)
+ goto allocated;
+
+ if (pages == HPAGE_PMD_NR)
+ count_vm_event(THP_FILE_FALLBACK);
+ order = next_order(&orders, order);
+ }
} else {
pages = 1;
folio = shmem_alloc_folio(gfp, info, index);
@@ -1671,6 +1720,7 @@ static struct folio *shmem_alloc_and_add_folio(gfp_t gfp,
if (!folio)
return ERR_PTR(-ENOMEM);

+allocated:
__folio_set_locked(folio);
__folio_set_swapbacked(folio);

@@ -2043,7 +2093,7 @@ static int shmem_get_folio_gfp(struct inode *inode, pgoff_t index,

huge_gfp = vma_thp_gfp_mask(vma);
huge_gfp = limit_gfp_mask(huge_gfp, gfp);
- folio = shmem_alloc_and_add_folio(huge_gfp,
+ folio = shmem_alloc_and_add_folio(vmf, huge_gfp,
inode, index, fault_mm, true);
if (!IS_ERR(folio)) {
if (folio_test_pmd_mappable(folio))
@@ -2054,7 +2104,7 @@ static int shmem_get_folio_gfp(struct inode *inode, pgoff_t index,
goto repeat;
}

- folio = shmem_alloc_and_add_folio(gfp, inode, index, fault_mm, false);
+ folio = shmem_alloc_and_add_folio(vmf, gfp, inode, index, fault_mm, false);
if (IS_ERR(folio)) {
error = PTR_ERR(folio);
if (error == -EEXIST)
@@ -2065,7 +2115,7 @@ static int shmem_get_folio_gfp(struct inode *inode, pgoff_t index,

alloced:
alloced = true;
- if (folio_test_pmd_mappable(folio) &&
+ if (folio_test_large(folio) &&
DIV_ROUND_UP(i_size_read(inode), PAGE_SIZE) <
folio_next_index(folio) - 1) {
struct shmem_sb_info *sbinfo = SHMEM_SB(inode->i_sb);
--
2.39.3


2024-04-22 07:04:11

by Baolin Wang

[permalink] [raw]
Subject: [RFC PATCH 5/5] mm: shmem: add anonymous share mTHP counters

Signed-off-by: Baolin Wang <[email protected]>
---
include/linux/huge_mm.h | 2 ++
mm/huge_memory.c | 4 ++++
mm/shmem.c | 5 ++++-
3 files changed, 10 insertions(+), 1 deletion(-)

diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index 26b6fa98d8ac..67b9c1acad31 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -270,6 +270,8 @@ enum mthp_stat_item {
MTHP_STAT_ANON_SWPOUT,
MTHP_STAT_ANON_SWPOUT_FALLBACK,
MTHP_STAT_ANON_SWPIN_REFAULT,
+ MTHP_STAT_SHMEM_ANON_ALLOC,
+ MTHP_STAT_SHMEM_ANON_ALLOC_FALLBACK,
__MTHP_STAT_COUNT
};

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 9e52c0db7580..dc15240c1ab3 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -557,6 +557,8 @@ DEFINE_MTHP_STAT_ATTR(anon_alloc_fallback, MTHP_STAT_ANON_ALLOC_FALLBACK);
DEFINE_MTHP_STAT_ATTR(anon_swpout, MTHP_STAT_ANON_SWPOUT);
DEFINE_MTHP_STAT_ATTR(anon_swpout_fallback, MTHP_STAT_ANON_SWPOUT_FALLBACK);
DEFINE_MTHP_STAT_ATTR(anon_swpin_refault, MTHP_STAT_ANON_SWPIN_REFAULT);
+DEFINE_MTHP_STAT_ATTR(shmem_anon_alloc, MTHP_STAT_SHMEM_ANON_ALLOC);
+DEFINE_MTHP_STAT_ATTR(shmem_anon_alloc_fallback, MTHP_STAT_SHMEM_ANON_ALLOC_FALLBACK);

static struct attribute *stats_attrs[] = {
&anon_alloc_attr.attr,
@@ -564,6 +566,8 @@ static struct attribute *stats_attrs[] = {
&anon_swpout_attr.attr,
&anon_swpout_fallback_attr.attr,
&anon_swpin_refault_attr.attr,
+ &shmem_anon_alloc_attr.attr,
+ &shmem_anon_alloc_fallback_attr.attr,
NULL,
};

diff --git a/mm/shmem.c b/mm/shmem.c
index 8b009e7040b2..4a0aa75ab29c 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -1706,11 +1706,14 @@ static struct folio *shmem_alloc_and_add_folio(struct vm_fault *vmf,
pages = 1 << order;
index = round_down(index, pages);
folio = shmem_alloc_hugefolio(gfp, info, index, order);
- if (folio)
+ if (folio) {
+ count_mthp_stat(order, MTHP_STAT_SHMEM_ANON_ALLOC);
goto allocated;
+ }

if (pages == HPAGE_PMD_NR)
count_vm_event(THP_FILE_FALLBACK);
+ count_mthp_stat(order, MTHP_STAT_SHMEM_ANON_ALLOC_FALLBACK);
order = next_order(&orders, order);
}
} else {
--
2.39.3


2024-04-22 07:04:34

by Baolin Wang

[permalink] [raw]
Subject: [RFC PATCH 3/5] mm: shmem: add THP validation for PMD-mapped THP related statistics

In order to extend support for mTHP, add THP validation for PMD-mapped THP
related statistics to avoid statistical confusion.

Signed-off-by: Baolin Wang <[email protected]>
---
mm/shmem.c | 7 ++++---
1 file changed, 4 insertions(+), 3 deletions(-)

diff --git a/mm/shmem.c b/mm/shmem.c
index 893c88efc45f..b4afda71a3f0 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -1662,7 +1662,7 @@ static struct folio *shmem_alloc_and_add_folio(gfp_t gfp,
return ERR_PTR(-E2BIG);

folio = shmem_alloc_hugefolio(gfp, info, index, order);
- if (!folio)
+ if (!folio && pages == HPAGE_PMD_NR)
count_vm_event(THP_FILE_FALLBACK);
} else {
pages = 1;
@@ -1680,7 +1680,7 @@ static struct folio *shmem_alloc_and_add_folio(gfp_t gfp,
if (xa_find(&mapping->i_pages, &index,
index + pages - 1, XA_PRESENT)) {
error = -EEXIST;
- } else if (huge) {
+ } else if (pages == HPAGE_PMD_NR) {
count_vm_event(THP_FILE_FALLBACK);
count_vm_event(THP_FILE_FALLBACK_CHARGE);
}
@@ -2046,7 +2046,8 @@ static int shmem_get_folio_gfp(struct inode *inode, pgoff_t index,
folio = shmem_alloc_and_add_folio(huge_gfp,
inode, index, fault_mm, true);
if (!IS_ERR(folio)) {
- count_vm_event(THP_FILE_ALLOC);
+ if (folio_test_pmd_mappable(folio))
+ count_vm_event(THP_FILE_ALLOC);
goto alloced;
}
if (PTR_ERR(folio) == -EEXIST)
--
2.39.3


2024-04-22 07:04:55

by Baolin Wang

[permalink] [raw]
Subject: [RFC PATCH 2/5] mm: shmem: add an 'order' parameter for shmem_alloc_hugefolio()

Add a new parameter to specify the huge page order for shmem_alloc_hugefolio(),
as a preparation to supoort mTHP.

Signed-off-by: Baolin Wang <[email protected]>
---
mm/shmem.c | 11 ++++++-----
1 file changed, 6 insertions(+), 5 deletions(-)

diff --git a/mm/shmem.c b/mm/shmem.c
index fa2a0ed97507..893c88efc45f 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -1604,14 +1604,14 @@ static gfp_t limit_gfp_mask(gfp_t huge_gfp, gfp_t limit_gfp)
}

static struct folio *shmem_alloc_hugefolio(gfp_t gfp,
- struct shmem_inode_info *info, pgoff_t index)
+ struct shmem_inode_info *info, pgoff_t index, int order)
{
struct mempolicy *mpol;
pgoff_t ilx;
struct page *page;

- mpol = shmem_get_pgoff_policy(info, index, HPAGE_PMD_ORDER, &ilx);
- page = alloc_pages_mpol(gfp, HPAGE_PMD_ORDER, mpol, ilx, numa_node_id());
+ mpol = shmem_get_pgoff_policy(info, index, order, &ilx);
+ page = alloc_pages_mpol(gfp, order, mpol, ilx, numa_node_id());
mpol_cond_put(mpol);

return page_rmappable_folio(page);
@@ -1639,13 +1639,14 @@ static struct folio *shmem_alloc_and_add_folio(gfp_t gfp,
struct shmem_inode_info *info = SHMEM_I(inode);
struct folio *folio;
long pages;
- int error;
+ int error, order;

if (!IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE))
huge = false;

if (huge) {
pages = HPAGE_PMD_NR;
+ order = HPAGE_PMD_ORDER;
index = round_down(index, HPAGE_PMD_NR);

/*
@@ -1660,7 +1661,7 @@ static struct folio *shmem_alloc_and_add_folio(gfp_t gfp,
index + HPAGE_PMD_NR - 1, XA_PRESENT))
return ERR_PTR(-E2BIG);

- folio = shmem_alloc_hugefolio(gfp, info, index);
+ folio = shmem_alloc_hugefolio(gfp, info, index, order);
if (!folio)
count_vm_event(THP_FILE_FALLBACK);
} else {
--
2.39.3


2024-04-23 01:13:46

by Barry Song

[permalink] [raw]
Subject: Re: [RFC PATCH 3/5] mm: shmem: add THP validation for PMD-mapped THP related statistics

On Mon, Apr 22, 2024 at 3:03 PM Baolin Wang
<[email protected]> wrote:
>
> In order to extend support for mTHP, add THP validation for PMD-mapped THP
> related statistics to avoid statistical confusion.
>
> Signed-off-by: Baolin Wang <[email protected]>
> ---

Reviewed-by: Barry Song <[email protected]>

> mm/shmem.c | 7 ++++---
> 1 file changed, 4 insertions(+), 3 deletions(-)
>
> diff --git a/mm/shmem.c b/mm/shmem.c
> index 893c88efc45f..b4afda71a3f0 100644
> --- a/mm/shmem.c
> +++ b/mm/shmem.c
> @@ -1662,7 +1662,7 @@ static struct folio *shmem_alloc_and_add_folio(gfp_t gfp,
> return ERR_PTR(-E2BIG);
>
> folio = shmem_alloc_hugefolio(gfp, info, index, order);
> - if (!folio)
> + if (!folio && pages == HPAGE_PMD_NR)
> count_vm_event(THP_FILE_FALLBACK);
> } else {
> pages = 1;
> @@ -1680,7 +1680,7 @@ static struct folio *shmem_alloc_and_add_folio(gfp_t gfp,
> if (xa_find(&mapping->i_pages, &index,
> index + pages - 1, XA_PRESENT)) {
> error = -EEXIST;
> - } else if (huge) {
> + } else if (pages == HPAGE_PMD_NR) {
> count_vm_event(THP_FILE_FALLBACK);
> count_vm_event(THP_FILE_FALLBACK_CHARGE);
> }
> @@ -2046,7 +2046,8 @@ static int shmem_get_folio_gfp(struct inode *inode, pgoff_t index,
> folio = shmem_alloc_and_add_folio(huge_gfp,
> inode, index, fault_mm, true);
> if (!IS_ERR(folio)) {
> - count_vm_event(THP_FILE_ALLOC);
> + if (folio_test_pmd_mappable(folio))
> + count_vm_event(THP_FILE_ALLOC);
> goto alloced;
> }
> if (PTR_ERR(folio) == -EEXIST)
> --
> 2.39.3
>

2024-04-23 01:17:49

by Barry Song

[permalink] [raw]
Subject: Re: [RFC PATCH 5/5] mm: shmem: add anonymous share mTHP counters

On Mon, Apr 22, 2024 at 3:03 PM Baolin Wang
<[email protected]> wrote:
>
> Signed-off-by: Baolin Wang <[email protected]>
> ---
> include/linux/huge_mm.h | 2 ++
> mm/huge_memory.c | 4 ++++
> mm/shmem.c | 5 ++++-
> 3 files changed, 10 insertions(+), 1 deletion(-)
>
> diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
> index 26b6fa98d8ac..67b9c1acad31 100644
> --- a/include/linux/huge_mm.h
> +++ b/include/linux/huge_mm.h
> @@ -270,6 +270,8 @@ enum mthp_stat_item {
> MTHP_STAT_ANON_SWPOUT,
> MTHP_STAT_ANON_SWPOUT_FALLBACK,
> MTHP_STAT_ANON_SWPIN_REFAULT,
> + MTHP_STAT_SHMEM_ANON_ALLOC,
> + MTHP_STAT_SHMEM_ANON_ALLOC_FALLBACK,

not quite sure about this. for 2MB pmd-mapped THP shmem, we count them
as FILE_THP.
here we are counting as SHMEM_ANON. To me, SHMEM_ANON is more correct but
it doesn't align with pmd-mapped THP. David, Ryan, what do you think?


> __MTHP_STAT_COUNT
> };
>
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index 9e52c0db7580..dc15240c1ab3 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -557,6 +557,8 @@ DEFINE_MTHP_STAT_ATTR(anon_alloc_fallback, MTHP_STAT_ANON_ALLOC_FALLBACK);
> DEFINE_MTHP_STAT_ATTR(anon_swpout, MTHP_STAT_ANON_SWPOUT);
> DEFINE_MTHP_STAT_ATTR(anon_swpout_fallback, MTHP_STAT_ANON_SWPOUT_FALLBACK);
> DEFINE_MTHP_STAT_ATTR(anon_swpin_refault, MTHP_STAT_ANON_SWPIN_REFAULT);
> +DEFINE_MTHP_STAT_ATTR(shmem_anon_alloc, MTHP_STAT_SHMEM_ANON_ALLOC);
> +DEFINE_MTHP_STAT_ATTR(shmem_anon_alloc_fallback, MTHP_STAT_SHMEM_ANON_ALLOC_FALLBACK);
>
> static struct attribute *stats_attrs[] = {
> &anon_alloc_attr.attr,
> @@ -564,6 +566,8 @@ static struct attribute *stats_attrs[] = {
> &anon_swpout_attr.attr,
> &anon_swpout_fallback_attr.attr,
> &anon_swpin_refault_attr.attr,
> + &shmem_anon_alloc_attr.attr,
> + &shmem_anon_alloc_fallback_attr.attr,
> NULL,
> };
>
> diff --git a/mm/shmem.c b/mm/shmem.c
> index 8b009e7040b2..4a0aa75ab29c 100644
> --- a/mm/shmem.c
> +++ b/mm/shmem.c
> @@ -1706,11 +1706,14 @@ static struct folio *shmem_alloc_and_add_folio(struct vm_fault *vmf,
> pages = 1 << order;
> index = round_down(index, pages);
> folio = shmem_alloc_hugefolio(gfp, info, index, order);
> - if (folio)
> + if (folio) {
> + count_mthp_stat(order, MTHP_STAT_SHMEM_ANON_ALLOC);
> goto allocated;
> + }
>
> if (pages == HPAGE_PMD_NR)
> count_vm_event(THP_FILE_FALLBACK);
> + count_mthp_stat(order, MTHP_STAT_SHMEM_ANON_ALLOC_FALLBACK);
> order = next_order(&orders, order);
> }
> } else {
> --
> 2.39.3
>

Thanks
Barry

2024-04-23 01:46:30

by Baolin Wang

[permalink] [raw]
Subject: Re: [RFC PATCH 5/5] mm: shmem: add anonymous share mTHP counters



On 2024/4/23 09:17, Barry Song wrote:
> On Mon, Apr 22, 2024 at 3:03 PM Baolin Wang
> <[email protected]> wrote:
>>
>> Signed-off-by: Baolin Wang <[email protected]>
>> ---
>> include/linux/huge_mm.h | 2 ++
>> mm/huge_memory.c | 4 ++++
>> mm/shmem.c | 5 ++++-
>> 3 files changed, 10 insertions(+), 1 deletion(-)
>>
>> diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
>> index 26b6fa98d8ac..67b9c1acad31 100644
>> --- a/include/linux/huge_mm.h
>> +++ b/include/linux/huge_mm.h
>> @@ -270,6 +270,8 @@ enum mthp_stat_item {
>> MTHP_STAT_ANON_SWPOUT,
>> MTHP_STAT_ANON_SWPOUT_FALLBACK,
>> MTHP_STAT_ANON_SWPIN_REFAULT,
>> + MTHP_STAT_SHMEM_ANON_ALLOC,
>> + MTHP_STAT_SHMEM_ANON_ALLOC_FALLBACK,
>
> not quite sure about this. for 2MB pmd-mapped THP shmem, we count them
> as FILE_THP.
> here we are counting as SHMEM_ANON. To me, SHMEM_ANON is more correct but
> it doesn't align with pmd-mapped THP. David, Ryan, what do you think?

Thanks for reviewing.

IMO, I think both approaches are acceptable, which also reflects the
dual nature of anonymous shared pages: on the one hand they are
anonymous pages, and on the other hand, they are backed by a pseudo
file. From the user's perspective, I prefer to use the term "anonymous
shmem", which can be distinguished from the real file-backed THP.

Anyway, let's see what others think.

>> __MTHP_STAT_COUNT
>> };
>>
>> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
>> index 9e52c0db7580..dc15240c1ab3 100644
>> --- a/mm/huge_memory.c
>> +++ b/mm/huge_memory.c
>> @@ -557,6 +557,8 @@ DEFINE_MTHP_STAT_ATTR(anon_alloc_fallback, MTHP_STAT_ANON_ALLOC_FALLBACK);
>> DEFINE_MTHP_STAT_ATTR(anon_swpout, MTHP_STAT_ANON_SWPOUT);
>> DEFINE_MTHP_STAT_ATTR(anon_swpout_fallback, MTHP_STAT_ANON_SWPOUT_FALLBACK);
>> DEFINE_MTHP_STAT_ATTR(anon_swpin_refault, MTHP_STAT_ANON_SWPIN_REFAULT);
>> +DEFINE_MTHP_STAT_ATTR(shmem_anon_alloc, MTHP_STAT_SHMEM_ANON_ALLOC);
>> +DEFINE_MTHP_STAT_ATTR(shmem_anon_alloc_fallback, MTHP_STAT_SHMEM_ANON_ALLOC_FALLBACK);
>>
>> static struct attribute *stats_attrs[] = {
>> &anon_alloc_attr.attr,
>> @@ -564,6 +566,8 @@ static struct attribute *stats_attrs[] = {
>> &anon_swpout_attr.attr,
>> &anon_swpout_fallback_attr.attr,
>> &anon_swpin_refault_attr.attr,
>> + &shmem_anon_alloc_attr.attr,
>> + &shmem_anon_alloc_fallback_attr.attr,
>> NULL,
>> };
>>
>> diff --git a/mm/shmem.c b/mm/shmem.c
>> index 8b009e7040b2..4a0aa75ab29c 100644
>> --- a/mm/shmem.c
>> +++ b/mm/shmem.c
>> @@ -1706,11 +1706,14 @@ static struct folio *shmem_alloc_and_add_folio(struct vm_fault *vmf,
>> pages = 1 << order;
>> index = round_down(index, pages);
>> folio = shmem_alloc_hugefolio(gfp, info, index, order);
>> - if (folio)
>> + if (folio) {
>> + count_mthp_stat(order, MTHP_STAT_SHMEM_ANON_ALLOC);
>> goto allocated;
>> + }
>>
>> if (pages == HPAGE_PMD_NR)
>> count_vm_event(THP_FILE_FALLBACK);
>> + count_mthp_stat(order, MTHP_STAT_SHMEM_ANON_ALLOC_FALLBACK);
>> order = next_order(&orders, order);
>> }
>> } else {
>> --
>> 2.39.3
>>
>
> Thanks
> Barry

2024-04-23 09:46:26

by Lance Yang

[permalink] [raw]
Subject: Re: [RFC PATCH 5/5] mm: shmem: add anonymous share mTHP counters

On 2024/4/23 09:17, Barry Song wrote:
[...]
>> diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
>> index 26b6fa98d8ac..67b9c1acad31 100644
>> --- a/include/linux/huge_mm.h
>> +++ b/include/linux/huge_mm.h
>> @@ -270,6 +270,8 @@ enum mthp_stat_item {
>> MTHP_STAT_ANON_SWPOUT,
>> MTHP_STAT_ANON_SWPOUT_FALLBACK,
>> MTHP_STAT_ANON_SWPIN_REFAULT,
>> + MTHP_STAT_SHMEM_ANON_ALLOC,
>> + MTHP_STAT_SHMEM_ANON_ALLOC_FALLBACK,
>
> not quite sure about this. for 2MB pmd-mapped THP shmem, we count them
> as FILE_THP.
> here we are counting as SHMEM_ANON. To me, SHMEM_ANON is more correct but
> it doesn't align with pmd-mapped THP. David, Ryan, what do you think?

+1

IMO, shmem isn't actually file-backed, but it has file-backed-like
characteristics :)

FWIW, perhaps MTHP_STAT_FILE_ALLOC and MTHP_STAT_FILE_ALLOC_FALLBACK
would better align with PMD/PTE-mapped THP.

Thanks,
Lance

2024-04-23 10:42:43

by Ryan Roberts

[permalink] [raw]
Subject: Re: [RFC PATCH 0/5] add mTHP support for anonymous share pages

On 22/04/2024 08:02, Baolin Wang wrote:
> Anonymous pages have already been supported for multi-size (mTHP) allocation
> through commit 19eaf44954df, that can allow THP to be configured through the
> sysfs interface located at '/sys/kernel/mm/transparent_hugepage/hugepage-XXkb/enabled'.
>
> However, the anonymous shared pages will ignore the anonymous mTHP rule
> configured through the sysfs interface, and can only use the PMD-mapped
> THP, that is not reasonable. Many implement anonymous page sharing through
> mmap(MAP_SHARED | MAP_ANONYMOUS), especially in database usage scenarios,
> therefore, users expect to apply an unified mTHP strategy for anonymous pages,
> also including the anonymous shared pages, in order to enjoy the benefits of
> mTHP. For example, lower latency than PMD-mapped THP, smaller memory bloat
> than PMD-mapped THP, contiguous PTEs on ARM architecture to reduce TLB miss etc.

This sounds like a very useful addition!

Out of interest, can you point me at any workloads (and off-the-shelf benchmarks
for those workloads) that predominantly use shared anon memory?

>
> The primary strategy is that, the use of huge pages for anonymous shared pages
> still follows the global control determined by the mount option "huge=" parameter
> or the sysfs interface at '/sys/kernel/mm/transparent_hugepage/shmem_enabled'.
> The utilization of mTHP is allowed only when the global 'huge' switch is enabled.
> Subsequently, the mTHP sysfs interface (/sys/kernel/mm/transparent_hugepage/hugepage-XXkb/enabled)
> is checked to determine the mTHP size that can be used for large folio allocation
> for these anonymous shared pages.

I'm not sure about this proposed control mechanism; won't it break
compatibility? I could be wrong, but I don't think shmem's use of THP used to
depend upon the value of /sys/kernel/mm/transparent_hugepage/enabled? So it
doesn't make sense to me that we now depend upon the
/sys/kernel/mm/transparent_hugepage/hugepage-XXkb/enabled values (which by
default disables all sizes except 2M, which is set to "inherit" from
/sys/kernel/mm/transparent_hugepage/enabled).

The other problem is that shmem_enabled has a different set of options
(always/never/within_size/advise/deny/force) to enabled (always/madvise/never)

Perhaps it would be cleaner to do the same trick we did for enabled; Introduce
/mm/transparent_hugepage/hugepage-XXkb/shmem_enabled, which can have all the
same values as the top-level /sys/kernel/mm/transparent_hugepage/shmem_enabled,
plus the additional "inherit" option. By default all sizes will be set to
"never" except 2M, which is set to "inherit".

Of course the huge= mount option would also need to take a per-size option in
this case. e.g. huge=2048kB:advise,64kB:always

>
> TODO:
> - More testing and provide some performance data.
> - Need more discussion about the large folio allocation strategy for a 'regular
> file' operation created by memfd_create(), for example using ftruncate(fd) to specify
> the 'file' size, which need to follow the anonymous mTHP rule too?
> - Do not split the large folio when share memory swap out.
> - Can swap in a large folio for share memory.
>
> Baolin Wang (5):
> mm: memory: extend finish_fault() to support large folio
> mm: shmem: add an 'order' parameter for shmem_alloc_hugefolio()
> mm: shmem: add THP validation for PMD-mapped THP related statistics
> mm: shmem: add mTHP support for anonymous share pages
> mm: shmem: add anonymous share mTHP counters
>
> include/linux/huge_mm.h | 4 +-
> mm/huge_memory.c | 8 ++-
> mm/memory.c | 25 +++++++---
> mm/shmem.c | 107 ++++++++++++++++++++++++++++++----------
> 4 files changed, 108 insertions(+), 36 deletions(-)
>


2024-04-23 11:11:48

by David Hildenbrand

[permalink] [raw]
Subject: Re: [RFC PATCH 0/5] add mTHP support for anonymous share pages

On 23.04.24 12:41, Ryan Roberts wrote:
> On 22/04/2024 08:02, Baolin Wang wrote:
>> Anonymous pages have already been supported for multi-size (mTHP) allocation
>> through commit 19eaf44954df, that can allow THP to be configured through the
>> sysfs interface located at '/sys/kernel/mm/transparent_hugepage/hugepage-XXkb/enabled'.
>>
>> However, the anonymous shared pages will ignore the anonymous mTHP rule
>> configured through the sysfs interface, and can only use the PMD-mapped
>> THP, that is not reasonable. Many implement anonymous page sharing through
>> mmap(MAP_SHARED | MAP_ANONYMOUS), especially in database usage scenarios,
>> therefore, users expect to apply an unified mTHP strategy for anonymous pages,
>> also including the anonymous shared pages, in order to enjoy the benefits of
>> mTHP. For example, lower latency than PMD-mapped THP, smaller memory bloat
>> than PMD-mapped THP, contiguous PTEs on ARM architecture to reduce TLB miss etc.
>
> This sounds like a very useful addition!
>
> Out of interest, can you point me at any workloads (and off-the-shelf benchmarks
> for those workloads) that predominantly use shared anon memory?
>
>>
>> The primary strategy is that, the use of huge pages for anonymous shared pages
>> still follows the global control determined by the mount option "huge=" parameter
>> or the sysfs interface at '/sys/kernel/mm/transparent_hugepage/shmem_enabled'.
>> The utilization of mTHP is allowed only when the global 'huge' switch is enabled.
>> Subsequently, the mTHP sysfs interface (/sys/kernel/mm/transparent_hugepage/hugepage-XXkb/enabled)
>> is checked to determine the mTHP size that can be used for large folio allocation
>> for these anonymous shared pages.
>
> I'm not sure about this proposed control mechanism; won't it break
> compatibility? I could be wrong, but I don't think shmem's use of THP used to
> depend upon the value of /sys/kernel/mm/transparent_hugepage/enabled? So it
> doesn't make sense to me that we now depend upon the
> /sys/kernel/mm/transparent_hugepage/hugepage-XXkb/enabled values (which by
> default disables all sizes except 2M, which is set to "inherit" from
> /sys/kernel/mm/transparent_hugepage/enabled).
>
> The other problem is that shmem_enabled has a different set of options
> (always/never/within_size/advise/deny/force) to enabled (always/madvise/never)
>
> Perhaps it would be cleaner to do the same trick we did for enabled; Introduce
> /mm/transparent_hugepage/hugepage-XXkb/shmem_enabled, which can have all the
> same values as the top-level /sys/kernel/mm/transparent_hugepage/shmem_enabled,
> plus the additional "inherit" option. By default all sizes will be set to
> "never" except 2M, which is set to "inherit".

Matches what I had in mind.

--
Cheers,

David / dhildenb


2024-04-23 11:24:46

by Lance Yang

[permalink] [raw]
Subject: Re: [RFC PATCH 5/5] mm: shmem: add anonymous share mTHP counters

On Tue, Apr 23, 2024 at 5:46 PM Lance Yang <[email protected]> wrote:
>
> On 2024/4/23 09:17, Barry Song wrote:
> [...]
> >> diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
> >> index 26b6fa98d8ac..67b9c1acad31 100644
> >> --- a/include/linux/huge_mm.h
> >> +++ b/include/linux/huge_mm.h
> >> @@ -270,6 +270,8 @@ enum mthp_stat_item {
> >> MTHP_STAT_ANON_SWPOUT,
> >> MTHP_STAT_ANON_SWPOUT_FALLBACK,
> >> MTHP_STAT_ANON_SWPIN_REFAULT,
> >> + MTHP_STAT_SHMEM_ANON_ALLOC,
> >> + MTHP_STAT_SHMEM_ANON_ALLOC_FALLBACK,

Seems like you forgot to add the FILE_FALLBACK_CHARGE counter
in this patch :)

IIUC, you've excluded the THP_FILE_FALLBACK_CHARGE counter
for PTE-mapped mTHP that size < PMD in patch3.

Thanks,
Lance

2024-04-23 11:40:15

by Ryan Roberts

[permalink] [raw]
Subject: Re: [RFC PATCH 5/5] mm: shmem: add anonymous share mTHP counters

On 23/04/2024 02:46, Baolin Wang wrote:
>
>
> On 2024/4/23 09:17, Barry Song wrote:
>> On Mon, Apr 22, 2024 at 3:03 PM Baolin Wang
>> <[email protected]> wrote:
>>>
>>> Signed-off-by: Baolin Wang <[email protected]>
>>> ---
>>>   include/linux/huge_mm.h | 2 ++
>>>   mm/huge_memory.c        | 4 ++++
>>>   mm/shmem.c              | 5 ++++-
>>>   3 files changed, 10 insertions(+), 1 deletion(-)
>>>
>>> diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
>>> index 26b6fa98d8ac..67b9c1acad31 100644
>>> --- a/include/linux/huge_mm.h
>>> +++ b/include/linux/huge_mm.h
>>> @@ -270,6 +270,8 @@ enum mthp_stat_item {
>>>          MTHP_STAT_ANON_SWPOUT,
>>>          MTHP_STAT_ANON_SWPOUT_FALLBACK,
>>>          MTHP_STAT_ANON_SWPIN_REFAULT,
>>> +       MTHP_STAT_SHMEM_ANON_ALLOC,
>>> +       MTHP_STAT_SHMEM_ANON_ALLOC_FALLBACK,
>>
>> not quite sure about this. for 2MB pmd-mapped THP shmem, we count them
>> as FILE_THP.
>> here we are counting as SHMEM_ANON. To me, SHMEM_ANON is more correct but
>> it doesn't align with pmd-mapped THP. David, Ryan, what do you think?
>
> Thanks for reviewing.
>
> IMO, I think both approaches are acceptable, which also reflects the dual nature
> of anonymous shared pages: on the one hand they are anonymous pages, and on the
> other hand, they are backed by a pseudo file. From the user's perspective, I
> prefer to use the term "anonymous shmem", which can be distinguished from the
> real file-backed THP.
>
> Anyway, let's see what others think.

From a quick look at the code, it looks like the shmem alloc/fallback/charge
events are all lumped in with FILE_THP. But the instantaneous "how many are
allocated" and "how many are mapped" have their own NR_SHMEM_THPS and
NR_SHMEM_PMDMAPPED counters? So its a bit inconsistent today.

My preference would be to add these to be consistent with the anon stats:

MTHP_STAT_SHMEM_FAULT_ALLOC,
MTHP_STAT_SHMEM_FAULT_FALLBACK,
MTHP_STAT_SHMEM_FAULT_FALLBACK_CHARGE,

But it looks like these aren't always allocated due to faults? So perhaps:

MTHP_STAT_SHMEM_ALLOC,
MTHP_STAT_SHMEM_FALLBACK,
MTHP_STAT_SHMEM_FALLBACK_CHARGE,

If I've understood the code correctly (I know nothing about shmem), the
allocation can be for both mmap(SHARED|ANON) and for tmpfs? So "SHMEM_ANON"
probably isn't quite right?


>
>>>          __MTHP_STAT_COUNT
>>>   };
>>>
>>> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
>>> index 9e52c0db7580..dc15240c1ab3 100644
>>> --- a/mm/huge_memory.c
>>> +++ b/mm/huge_memory.c
>>> @@ -557,6 +557,8 @@ DEFINE_MTHP_STAT_ATTR(anon_alloc_fallback,
>>> MTHP_STAT_ANON_ALLOC_FALLBACK);
>>>   DEFINE_MTHP_STAT_ATTR(anon_swpout, MTHP_STAT_ANON_SWPOUT);
>>>   DEFINE_MTHP_STAT_ATTR(anon_swpout_fallback, MTHP_STAT_ANON_SWPOUT_FALLBACK);
>>>   DEFINE_MTHP_STAT_ATTR(anon_swpin_refault, MTHP_STAT_ANON_SWPIN_REFAULT);
>>> +DEFINE_MTHP_STAT_ATTR(shmem_anon_alloc, MTHP_STAT_SHMEM_ANON_ALLOC);
>>> +DEFINE_MTHP_STAT_ATTR(shmem_anon_alloc_fallback,
>>> MTHP_STAT_SHMEM_ANON_ALLOC_FALLBACK);
>>>
>>>   static struct attribute *stats_attrs[] = {
>>>          &anon_alloc_attr.attr,
>>> @@ -564,6 +566,8 @@ static struct attribute *stats_attrs[] = {
>>>          &anon_swpout_attr.attr,
>>>          &anon_swpout_fallback_attr.attr,
>>>          &anon_swpin_refault_attr.attr,
>>> +       &shmem_anon_alloc_attr.attr,
>>> +       &shmem_anon_alloc_fallback_attr.attr,
>>>          NULL,
>>>   };
>>>
>>> diff --git a/mm/shmem.c b/mm/shmem.c
>>> index 8b009e7040b2..4a0aa75ab29c 100644
>>> --- a/mm/shmem.c
>>> +++ b/mm/shmem.c
>>> @@ -1706,11 +1706,14 @@ static struct folio *shmem_alloc_and_add_folio(struct
>>> vm_fault *vmf,
>>>                          pages = 1 << order;
>>>                          index = round_down(index, pages);
>>>                          folio = shmem_alloc_hugefolio(gfp, info, index, order);
>>> -                       if (folio)
>>> +                       if (folio) {
>>> +                               count_mthp_stat(order,
>>> MTHP_STAT_SHMEM_ANON_ALLOC);

is there any reason why this can't go next to the existing PMD-size stat?

>>>                                  goto allocated;
>>> +                       }
>>>
>>>                          if (pages == HPAGE_PMD_NR)
>>>                                  count_vm_event(THP_FILE_FALLBACK);
>>> +                       count_mthp_stat(order,
>>> MTHP_STAT_SHMEM_ANON_ALLOC_FALLBACK);
>>>                          order = next_order(&orders, order);
>>>                  }
>>>          } else {
>>> --
>>> 2.39.3
>>>
>>
>> Thanks
>> Barry


2024-04-23 11:43:29

by David Hildenbrand

[permalink] [raw]
Subject: Re: [RFC PATCH 5/5] mm: shmem: add anonymous share mTHP counters

On 23.04.24 03:17, Barry Song wrote:
> On Mon, Apr 22, 2024 at 3:03 PM Baolin Wang
> <[email protected]> wrote:
>>
>> Signed-off-by: Baolin Wang <[email protected]>
>> ---
>> include/linux/huge_mm.h | 2 ++
>> mm/huge_memory.c | 4 ++++
>> mm/shmem.c | 5 ++++-
>> 3 files changed, 10 insertions(+), 1 deletion(-)
>>
>> diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
>> index 26b6fa98d8ac..67b9c1acad31 100644
>> --- a/include/linux/huge_mm.h
>> +++ b/include/linux/huge_mm.h
>> @@ -270,6 +270,8 @@ enum mthp_stat_item {
>> MTHP_STAT_ANON_SWPOUT,
>> MTHP_STAT_ANON_SWPOUT_FALLBACK,
>> MTHP_STAT_ANON_SWPIN_REFAULT,
>> + MTHP_STAT_SHMEM_ANON_ALLOC,
>> + MTHP_STAT_SHMEM_ANON_ALLOC_FALLBACK,
>
> not quite sure about this. for 2MB pmd-mapped THP shmem, we count them
> as FILE_THP.
> here we are counting as SHMEM_ANON. To me, SHMEM_ANON is more correct but
> it doesn't align with pmd-mapped THP. David, Ryan, what do you think?

The term "anonymous share" in the patch subject is weird to begin with
;) Easy to confuse with anonymous cow-shared memory. Let's just call it
"anonymous shmem", which it is under the hood.

.. regarding the question: if we add FILE_ALLOC and friends, at least
initially, we wouldn't account other large pagecache folios.

.. likely we should add that then as well so the counter matches the
actual name?

If we later realize that we need separate FILE vs. SHMEM vs. WHATEVER
counters, we can always add more fine-grained counters later. Doing it
consistently w.r.t. traditional THPs first sounds reasonable.

--
Cheers,

David / dhildenb


2024-04-24 03:49:27

by Baolin Wang

[permalink] [raw]
Subject: Re: [RFC PATCH 5/5] mm: shmem: add anonymous share mTHP counters



On 2024/4/23 19:39, Ryan Roberts wrote:
> On 23/04/2024 02:46, Baolin Wang wrote:
>>
>>
>> On 2024/4/23 09:17, Barry Song wrote:
>>> On Mon, Apr 22, 2024 at 3:03 PM Baolin Wang
>>> <[email protected]> wrote:
>>>>
>>>> Signed-off-by: Baolin Wang <[email protected]>
>>>> ---
>>>>   include/linux/huge_mm.h | 2 ++
>>>>   mm/huge_memory.c        | 4 ++++
>>>>   mm/shmem.c              | 5 ++++-
>>>>   3 files changed, 10 insertions(+), 1 deletion(-)
>>>>
>>>> diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
>>>> index 26b6fa98d8ac..67b9c1acad31 100644
>>>> --- a/include/linux/huge_mm.h
>>>> +++ b/include/linux/huge_mm.h
>>>> @@ -270,6 +270,8 @@ enum mthp_stat_item {
>>>>          MTHP_STAT_ANON_SWPOUT,
>>>>          MTHP_STAT_ANON_SWPOUT_FALLBACK,
>>>>          MTHP_STAT_ANON_SWPIN_REFAULT,
>>>> +       MTHP_STAT_SHMEM_ANON_ALLOC,
>>>> +       MTHP_STAT_SHMEM_ANON_ALLOC_FALLBACK,
>>>
>>> not quite sure about this. for 2MB pmd-mapped THP shmem, we count them
>>> as FILE_THP.
>>> here we are counting as SHMEM_ANON. To me, SHMEM_ANON is more correct but
>>> it doesn't align with pmd-mapped THP. David, Ryan, what do you think?
>>
>> Thanks for reviewing.
>>
>> IMO, I think both approaches are acceptable, which also reflects the dual nature
>> of anonymous shared pages: on the one hand they are anonymous pages, and on the
>> other hand, they are backed by a pseudo file. From the user's perspective, I
>> prefer to use the term "anonymous shmem", which can be distinguished from the
>> real file-backed THP.
>>
>> Anyway, let's see what others think.
>
> From a quick look at the code, it looks like the shmem alloc/fallback/charge
> events are all lumped in with FILE_THP. But the instantaneous "how many are
> allocated" and "how many are mapped" have their own NR_SHMEM_THPS and
> NR_SHMEM_PMDMAPPED counters? So its a bit inconsistent today.
>
> My preference would be to add these to be consistent with the anon stats:
>
> MTHP_STAT_SHMEM_FAULT_ALLOC,
> MTHP_STAT_SHMEM_FAULT_FALLBACK,
> MTHP_STAT_SHMEM_FAULT_FALLBACK_CHARGE,
>
> But it looks like these aren't always allocated due to faults? So perhaps:
>
> MTHP_STAT_SHMEM_ALLOC,
> MTHP_STAT_SHMEM_FALLBACK,
> MTHP_STAT_SHMEM_FALLBACK_CHARGE,

This looks good to me.

> If I've understood the code correctly (I know nothing about shmem), the
> allocation can be for both mmap(SHARED|ANON) and for tmpfs? So "SHMEM_ANON"

This is allowed, but the 'fd' for tmpfs will be ignored (see
ksys_mmap_pgoff()), which is same with anonymous shmem.

> probably isn't quite right?
>
>
>>
>>>>          __MTHP_STAT_COUNT
>>>>   };
>>>>
>>>> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
>>>> index 9e52c0db7580..dc15240c1ab3 100644
>>>> --- a/mm/huge_memory.c
>>>> +++ b/mm/huge_memory.c
>>>> @@ -557,6 +557,8 @@ DEFINE_MTHP_STAT_ATTR(anon_alloc_fallback,
>>>> MTHP_STAT_ANON_ALLOC_FALLBACK);
>>>>   DEFINE_MTHP_STAT_ATTR(anon_swpout, MTHP_STAT_ANON_SWPOUT);
>>>>   DEFINE_MTHP_STAT_ATTR(anon_swpout_fallback, MTHP_STAT_ANON_SWPOUT_FALLBACK);
>>>>   DEFINE_MTHP_STAT_ATTR(anon_swpin_refault, MTHP_STAT_ANON_SWPIN_REFAULT);
>>>> +DEFINE_MTHP_STAT_ATTR(shmem_anon_alloc, MTHP_STAT_SHMEM_ANON_ALLOC);
>>>> +DEFINE_MTHP_STAT_ATTR(shmem_anon_alloc_fallback,
>>>> MTHP_STAT_SHMEM_ANON_ALLOC_FALLBACK);
>>>>
>>>>   static struct attribute *stats_attrs[] = {
>>>>          &anon_alloc_attr.attr,
>>>> @@ -564,6 +566,8 @@ static struct attribute *stats_attrs[] = {
>>>>          &anon_swpout_attr.attr,
>>>>          &anon_swpout_fallback_attr.attr,
>>>>          &anon_swpin_refault_attr.attr,
>>>> +       &shmem_anon_alloc_attr.attr,
>>>> +       &shmem_anon_alloc_fallback_attr.attr,
>>>>          NULL,
>>>>   };
>>>>
>>>> diff --git a/mm/shmem.c b/mm/shmem.c
>>>> index 8b009e7040b2..4a0aa75ab29c 100644
>>>> --- a/mm/shmem.c
>>>> +++ b/mm/shmem.c
>>>> @@ -1706,11 +1706,14 @@ static struct folio *shmem_alloc_and_add_folio(struct
>>>> vm_fault *vmf,
>>>>                          pages = 1 << order;
>>>>                          index = round_down(index, pages);
>>>>                          folio = shmem_alloc_hugefolio(gfp, info, index, order);
>>>> -                       if (folio)
>>>> +                       if (folio) {
>>>> +                               count_mthp_stat(order,
>>>> MTHP_STAT_SHMEM_ANON_ALLOC);
>
> is there any reason why this can't go next to the existing PMD-size stat?

No, will move to the existing PMD-size stat.

2024-04-24 03:52:58

by Baolin Wang

[permalink] [raw]
Subject: Re: [RFC PATCH 5/5] mm: shmem: add anonymous share mTHP counters



On 2024/4/23 19:22, Lance Yang wrote:
> On Tue, Apr 23, 2024 at 5:46 PM Lance Yang <[email protected]> wrote:
>>
>> On 2024/4/23 09:17, Barry Song wrote:
>> [...]
>>>> diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
>>>> index 26b6fa98d8ac..67b9c1acad31 100644
>>>> --- a/include/linux/huge_mm.h
>>>> +++ b/include/linux/huge_mm.h
>>>> @@ -270,6 +270,8 @@ enum mthp_stat_item {
>>>> MTHP_STAT_ANON_SWPOUT,
>>>> MTHP_STAT_ANON_SWPOUT_FALLBACK,
>>>> MTHP_STAT_ANON_SWPIN_REFAULT,
>>>> + MTHP_STAT_SHMEM_ANON_ALLOC,
>>>> + MTHP_STAT_SHMEM_ANON_ALLOC_FALLBACK,
>
> Seems like you forgot to add the FILE_FALLBACK_CHARGE counter
> in this patch :)
>
> IIUC, you've excluded the THP_FILE_FALLBACK_CHARGE counter
> for PTE-mapped mTHP that size < PMD in patch3.

Yes, will add in next version.

2024-04-24 06:10:39

by Baolin Wang

[permalink] [raw]
Subject: Re: [RFC PATCH 5/5] mm: shmem: add anonymous share mTHP counters



On 2024/4/23 19:37, David Hildenbrand wrote:
> On 23.04.24 03:17, Barry Song wrote:
>> On Mon, Apr 22, 2024 at 3:03 PM Baolin Wang
>> <[email protected]> wrote:
>>>
>>> Signed-off-by: Baolin Wang <[email protected]>
>>> ---
>>>   include/linux/huge_mm.h | 2 ++
>>>   mm/huge_memory.c        | 4 ++++
>>>   mm/shmem.c              | 5 ++++-
>>>   3 files changed, 10 insertions(+), 1 deletion(-)
>>>
>>> diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
>>> index 26b6fa98d8ac..67b9c1acad31 100644
>>> --- a/include/linux/huge_mm.h
>>> +++ b/include/linux/huge_mm.h
>>> @@ -270,6 +270,8 @@ enum mthp_stat_item {
>>>          MTHP_STAT_ANON_SWPOUT,
>>>          MTHP_STAT_ANON_SWPOUT_FALLBACK,
>>>          MTHP_STAT_ANON_SWPIN_REFAULT,
>>> +       MTHP_STAT_SHMEM_ANON_ALLOC,
>>> +       MTHP_STAT_SHMEM_ANON_ALLOC_FALLBACK,
>>
>> not quite sure about this. for 2MB pmd-mapped THP shmem, we count them
>> as FILE_THP.
>> here we are counting as SHMEM_ANON. To me, SHMEM_ANON is more correct but
>> it doesn't align with pmd-mapped THP. David, Ryan, what do you think?
>
> The term "anonymous share" in the patch subject is weird to begin with
> ;) Easy to confuse with anonymous cow-shared memory. Let's just call it
> "anonymous shmem", which it is under the hood.

Sure.

> ... regarding the question: if we add FILE_ALLOC and friends, at least
> initially, we wouldn't account other large pagecache folios.
>
> ... likely we should add that then as well so the counter matches the
> actual name?
>
> If we later realize that we need separate FILE vs. SHMEM vs. WHATEVER
> counters, we can always add more fine-grained counters later. Doing it
> consistently w.r.t. traditional THPs first sounds reasonable.

Um, once we expose it to userspace through the sysfs interface, the
sysfs interface should be explicit as much as possible and avoid
confusing users, otherwise it will be difficult to change this kind of
interface in the future. Personally, I prefer to Ryan's suggestion.

2024-04-24 06:58:58

by Baolin Wang

[permalink] [raw]
Subject: Re: [RFC PATCH 0/5] add mTHP support for anonymous share pages



On 2024/4/23 18:41, Ryan Roberts wrote:
> On 22/04/2024 08:02, Baolin Wang wrote:
>> Anonymous pages have already been supported for multi-size (mTHP) allocation
>> through commit 19eaf44954df, that can allow THP to be configured through the
>> sysfs interface located at '/sys/kernel/mm/transparent_hugepage/hugepage-XXkb/enabled'.
>>
>> However, the anonymous shared pages will ignore the anonymous mTHP rule
>> configured through the sysfs interface, and can only use the PMD-mapped
>> THP, that is not reasonable. Many implement anonymous page sharing through
>> mmap(MAP_SHARED | MAP_ANONYMOUS), especially in database usage scenarios,
>> therefore, users expect to apply an unified mTHP strategy for anonymous pages,
>> also including the anonymous shared pages, in order to enjoy the benefits of
>> mTHP. For example, lower latency than PMD-mapped THP, smaller memory bloat
>> than PMD-mapped THP, contiguous PTEs on ARM architecture to reduce TLB miss etc.
>
> This sounds like a very useful addition!
>
> Out of interest, can you point me at any workloads (and off-the-shelf benchmarks
> for those workloads) that predominantly use shared anon memory?

As far as I know, some database related workloads make extensive use of
shared anonymous page, such as PolarDB[1] in our Alibaba fleet, or MySQL
likely also uses shared anonymous memory. And I still need to do some
investigation to measure the performance.

[1] https://github.com/ApsaraDB/PolarDB-for-PostgreSQL

>> The primary strategy is that, the use of huge pages for anonymous shared pages
>> still follows the global control determined by the mount option "huge=" parameter
>> or the sysfs interface at '/sys/kernel/mm/transparent_hugepage/shmem_enabled'.
>> The utilization of mTHP is allowed only when the global 'huge' switch is enabled.
>> Subsequently, the mTHP sysfs interface (/sys/kernel/mm/transparent_hugepage/hugepage-XXkb/enabled)
>> is checked to determine the mTHP size that can be used for large folio allocation
>> for these anonymous shared pages.
>
> I'm not sure about this proposed control mechanism; won't it break
> compatibility? I could be wrong, but I don't think shmem's use of THP used to
> depend upon the value of /sys/kernel/mm/transparent_hugepage/enabled? So it

Yes, I realized this after more testing.

> doesn't make sense to me that we now depend upon the
> /sys/kernel/mm/transparent_hugepage/hugepage-XXkb/enabled values (which by
> default disables all sizes except 2M, which is set to "inherit" from
> /sys/kernel/mm/transparent_hugepage/enabled).
>
> The other problem is that shmem_enabled has a different set of options
> (always/never/within_size/advise/deny/force) to enabled (always/madvise/never)
>
> Perhaps it would be cleaner to do the same trick we did for enabled; Introduce
> /mm/transparent_hugepage/hugepage-XXkb/shmem_enabled, which can have all the
> same values as the top-level /sys/kernel/mm/transparent_hugepage/shmem_enabled,
> plus the additional "inherit" option. By default all sizes will be set to
> "never" except 2M, which is set to "inherit".

Sounds good to me. But I do not want to copy all same values from
top-level '/sys/kernel/mm/transparent_hugepage/shmem_enabled':
always within_size advise never deny force

For mTHP's shmem_enabled interface, we can just keep below values:
always within_size advise never

Cause when checking if mTHP can be used for anon shmem, 'deny' is equal
to 'never', and 'force' is equal to 'always'.

> Of course the huge= mount option would also need to take a per-size option in
> this case. e.g. huge=2048kB:advise,64kB:always

IMO, I do not want to change the global 'huge=' mount option, which can
control both anon shmem and tmpfs, but mTHP now is only applied for anon
shmem. So let's keep it be same with the global sysfs interface:
/sys/kernel/mm/transparent_hugepage/shmem_enabled.

For tmpfs large folio strategy, I plan to address it later, and we may
need more discussion to determine if it should follow the file large
folio strategy or not (no investigation now).

Thanks for reviewing.

>> TODO:
>> - More testing and provide some performance data.
>> - Need more discussion about the large folio allocation strategy for a 'regular
>> file' operation created by memfd_create(), for example using ftruncate(fd) to specify
>> the 'file' size, which need to follow the anonymous mTHP rule too?
>> - Do not split the large folio when share memory swap out.
>> - Can swap in a large folio for share memory.
>>
>> Baolin Wang (5):
>> mm: memory: extend finish_fault() to support large folio
>> mm: shmem: add an 'order' parameter for shmem_alloc_hugefolio()
>> mm: shmem: add THP validation for PMD-mapped THP related statistics
>> mm: shmem: add mTHP support for anonymous share pages
>> mm: shmem: add anonymous share mTHP counters
>>
>> include/linux/huge_mm.h | 4 +-
>> mm/huge_memory.c | 8 ++-
>> mm/memory.c | 25 +++++++---
>> mm/shmem.c | 107 ++++++++++++++++++++++++++++++----------
>> 4 files changed, 108 insertions(+), 36 deletions(-)
>>

2024-04-24 06:59:39

by Baolin Wang

[permalink] [raw]
Subject: Re: [RFC PATCH 2/5] mm: shmem: add an 'order' parameter for shmem_alloc_hugefolio()



On 2024/4/24 14:28, Kefeng Wang wrote:
>
>
> On 2024/4/22 15:02, Baolin Wang wrote:
>> Add a new parameter to specify the huge page order for
>> shmem_alloc_hugefolio(),
>> as a preparation to supoort mTHP.
>>
>> Signed-off-by: Baolin Wang <[email protected]>
>> ---
>>   mm/shmem.c | 11 ++++++-----
>>   1 file changed, 6 insertions(+), 5 deletions(-)
>>
>> diff --git a/mm/shmem.c b/mm/shmem.c
>> index fa2a0ed97507..893c88efc45f 100644
>> --- a/mm/shmem.c
>> +++ b/mm/shmem.c
>> @@ -1604,14 +1604,14 @@ static gfp_t limit_gfp_mask(gfp_t huge_gfp,
>> gfp_t limit_gfp)
>>   }
>>   static struct folio *shmem_alloc_hugefolio(gfp_t gfp,
>> -        struct shmem_inode_info *info, pgoff_t index)
>> +        struct shmem_inode_info *info, pgoff_t index, int order)
>>   {
>>       struct mempolicy *mpol;
>>       pgoff_t ilx;
>>       struct page *page;
>> -    mpol = shmem_get_pgoff_policy(info, index, HPAGE_PMD_ORDER, &ilx);
>> -    page = alloc_pages_mpol(gfp, HPAGE_PMD_ORDER, mpol, ilx,
>> numa_node_id());
>> +    mpol = shmem_get_pgoff_policy(info, index, order, &ilx);
>> +    page = alloc_pages_mpol(gfp, order, mpol, ilx, numa_node_id());
>>       mpol_cond_put(mpol);
>>       return page_rmappable_folio(page);
>> @@ -1639,13 +1639,14 @@ static struct folio
>> *shmem_alloc_and_add_folio(gfp_t gfp,
>>       struct shmem_inode_info *info = SHMEM_I(inode);
>>       struct folio *folio;
>>       long pages;
>> -    int error;
>> +    int error, order;
>>       if (!IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE))
>>           huge = false;
>>       if (huge) {
>>           pages = HPAGE_PMD_NR;
>> +        order = HPAGE_PMD_ORDER;
>>           index = round_down(index, HPAGE_PMD_NR);
>>           /*
>> @@ -1660,7 +1661,7 @@ static struct folio
>> *shmem_alloc_and_add_folio(gfp_t gfp,
>>                   index + HPAGE_PMD_NR - 1, XA_PRESENT))
>>               return ERR_PTR(-E2BIG);
>> -        folio = shmem_alloc_hugefolio(gfp, info, index);
>> +        folio = shmem_alloc_hugefolio(gfp, info, index, order);
>
> Avoid order variable, we can directly use HPAGE_PMD_NR here.

Yes, sure. Thanks.

2024-04-24 07:12:05

by David Hildenbrand

[permalink] [raw]
Subject: Re: [RFC PATCH 5/5] mm: shmem: add anonymous share mTHP counters

On 24.04.24 08:10, Baolin Wang wrote:
>
>
> On 2024/4/23 19:37, David Hildenbrand wrote:
>> On 23.04.24 03:17, Barry Song wrote:
>>> On Mon, Apr 22, 2024 at 3:03 PM Baolin Wang
>>> <[email protected]> wrote:
>>>>
>>>> Signed-off-by: Baolin Wang <[email protected]>
>>>> ---
>>>>   include/linux/huge_mm.h | 2 ++
>>>>   mm/huge_memory.c        | 4 ++++
>>>>   mm/shmem.c              | 5 ++++-
>>>>   3 files changed, 10 insertions(+), 1 deletion(-)
>>>>
>>>> diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
>>>> index 26b6fa98d8ac..67b9c1acad31 100644
>>>> --- a/include/linux/huge_mm.h
>>>> +++ b/include/linux/huge_mm.h
>>>> @@ -270,6 +270,8 @@ enum mthp_stat_item {
>>>>          MTHP_STAT_ANON_SWPOUT,
>>>>          MTHP_STAT_ANON_SWPOUT_FALLBACK,
>>>>          MTHP_STAT_ANON_SWPIN_REFAULT,
>>>> +       MTHP_STAT_SHMEM_ANON_ALLOC,
>>>> +       MTHP_STAT_SHMEM_ANON_ALLOC_FALLBACK,
>>>
>>> not quite sure about this. for 2MB pmd-mapped THP shmem, we count them
>>> as FILE_THP.
>>> here we are counting as SHMEM_ANON. To me, SHMEM_ANON is more correct but
>>> it doesn't align with pmd-mapped THP. David, Ryan, what do you think?
>>
>> The term "anonymous share" in the patch subject is weird to begin with
>> ;) Easy to confuse with anonymous cow-shared memory. Let's just call it
>> "anonymous shmem", which it is under the hood.
>
> Sure.
>
>> ... regarding the question: if we add FILE_ALLOC and friends, at least
>> initially, we wouldn't account other large pagecache folios.
>>
>> ... likely we should add that then as well so the counter matches the
>> actual name?
>>
>> If we later realize that we need separate FILE vs. SHMEM vs. WHATEVER
>> counters, we can always add more fine-grained counters later. Doing it
>> consistently w.r.t. traditional THPs first sounds reasonable.
>
> Um, once we expose it to userspace through the sysfs interface, the
> sysfs interface should be explicit as much as possible and avoid
> confusing users, otherwise it will be difficult to change this kind of
> interface in the future. Personally, I prefer to Ryan's suggestion.

Inconsistency is confusing. As long as you avoid that, I don't
particularly care.

--
Cheers,

David / dhildenb


2024-04-24 08:15:45

by Ryan Roberts

[permalink] [raw]
Subject: Re: [RFC PATCH 5/5] mm: shmem: add anonymous share mTHP counters

On 24/04/2024 08:11, David Hildenbrand wrote:
> On 24.04.24 08:10, Baolin Wang wrote:
>>
>>
>> On 2024/4/23 19:37, David Hildenbrand wrote:
>>> On 23.04.24 03:17, Barry Song wrote:
>>>> On Mon, Apr 22, 2024 at 3:03 PM Baolin Wang
>>>> <[email protected]> wrote:
>>>>>
>>>>> Signed-off-by: Baolin Wang <[email protected]>
>>>>> ---
>>>>>    include/linux/huge_mm.h | 2 ++
>>>>>    mm/huge_memory.c        | 4 ++++
>>>>>    mm/shmem.c              | 5 ++++-
>>>>>    3 files changed, 10 insertions(+), 1 deletion(-)
>>>>>
>>>>> diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
>>>>> index 26b6fa98d8ac..67b9c1acad31 100644
>>>>> --- a/include/linux/huge_mm.h
>>>>> +++ b/include/linux/huge_mm.h
>>>>> @@ -270,6 +270,8 @@ enum mthp_stat_item {
>>>>>           MTHP_STAT_ANON_SWPOUT,
>>>>>           MTHP_STAT_ANON_SWPOUT_FALLBACK,
>>>>>           MTHP_STAT_ANON_SWPIN_REFAULT,
>>>>> +       MTHP_STAT_SHMEM_ANON_ALLOC,
>>>>> +       MTHP_STAT_SHMEM_ANON_ALLOC_FALLBACK,
>>>>
>>>> not quite sure about this. for 2MB pmd-mapped THP shmem, we count them
>>>> as FILE_THP.
>>>> here we are counting as SHMEM_ANON. To me, SHMEM_ANON is more correct but
>>>> it doesn't align with pmd-mapped THP. David, Ryan, what do you think?
>>>
>>> The term "anonymous share" in the patch subject is weird to begin with
>>> ;) Easy to confuse with anonymous cow-shared memory. Let's just call it
>>> "anonymous shmem", which it is under the hood.
>>
>> Sure.
>>
>>> ... regarding the question: if we add FILE_ALLOC and friends, at least
>>> initially, we wouldn't account other large pagecache folios.
>>>
>>> ... likely we should add that then as well so the counter matches the
>>> actual name?
>>>
>>> If we later realize that we need separate FILE vs. SHMEM vs. WHATEVER
>>> counters, we can always add more fine-grained counters later. Doing it
>>> consistently w.r.t. traditional THPs first sounds reasonable.
>>
>> Um, once we expose it to userspace through the sysfs interface, the
>> sysfs interface should be explicit as much as possible and avoid
>> confusing users, otherwise it will be difficult to change this kind of
>> interface in the future. Personally, I prefer to Ryan's suggestion.
>
> Inconsistency is confusing. As long as you avoid that, I don't particularly care.

This is a good point. We have been careful to make sure the 2M ANON mTHP stats
match the existing PMD-size stats. So we should definitely make sure that any
future 2M FILE mTHP stats match too, which I guess means counting both SHMEM and
FILE events.

So perhaps it makes more sense to add FILE counters to start with. If we need
the SHMEM-specific counters, we could add them later?

I'm happy to go with the crowd on this...

2024-04-24 08:27:24

by Ryan Roberts

[permalink] [raw]
Subject: Re: [RFC PATCH 0/5] add mTHP support for anonymous share pages

On 24/04/2024 07:55, Baolin Wang wrote:
>
>
> On 2024/4/23 18:41, Ryan Roberts wrote:
>> On 22/04/2024 08:02, Baolin Wang wrote:
>>> Anonymous pages have already been supported for multi-size (mTHP) allocation
>>> through commit 19eaf44954df, that can allow THP to be configured through the
>>> sysfs interface located at
>>> '/sys/kernel/mm/transparent_hugepage/hugepage-XXkb/enabled'.
>>>
>>> However, the anonymous shared pages will ignore the anonymous mTHP rule
>>> configured through the sysfs interface, and can only use the PMD-mapped
>>> THP, that is not reasonable. Many implement anonymous page sharing through
>>> mmap(MAP_SHARED | MAP_ANONYMOUS), especially in database usage scenarios,
>>> therefore, users expect to apply an unified mTHP strategy for anonymous pages,
>>> also including the anonymous shared pages, in order to enjoy the benefits of
>>> mTHP. For example, lower latency than PMD-mapped THP, smaller memory bloat
>>> than PMD-mapped THP, contiguous PTEs on ARM architecture to reduce TLB miss etc.
>>
>> This sounds like a very useful addition!
>>
>> Out of interest, can you point me at any workloads (and off-the-shelf benchmarks
>> for those workloads) that predominantly use shared anon memory?
>
> As far as I know, some database related workloads make extensive use of shared
> anonymous page, such as PolarDB[1] in our Alibaba fleet, or MySQL likely also
> uses shared anonymous memory. And I still need to do some investigation to
> measure the performance.
>
> [1] https://github.com/ApsaraDB/PolarDB-for-PostgreSQL

Thanks for the pointer!

>
>>> The primary strategy is that, the use of huge pages for anonymous shared pages
>>> still follows the global control determined by the mount option "huge="
>>> parameter
>>> or the sysfs interface at '/sys/kernel/mm/transparent_hugepage/shmem_enabled'.
>>> The utilization of mTHP is allowed only when the global 'huge' switch is
>>> enabled.
>>> Subsequently, the mTHP sysfs interface
>>> (/sys/kernel/mm/transparent_hugepage/hugepage-XXkb/enabled)
>>> is checked to determine the mTHP size that can be used for large folio
>>> allocation
>>> for these anonymous shared pages.
>>
>> I'm not sure about this proposed control mechanism; won't it break
>> compatibility? I could be wrong, but I don't think shmem's use of THP used to
>> depend upon the value of /sys/kernel/mm/transparent_hugepage/enabled? So it
>
> Yes, I realized this after more testing.
>
>> doesn't make sense to me that we now depend upon the
>> /sys/kernel/mm/transparent_hugepage/hugepage-XXkb/enabled values (which by
>> default disables all sizes except 2M, which is set to "inherit" from
>> /sys/kernel/mm/transparent_hugepage/enabled).
>>
>> The other problem is that shmem_enabled has a different set of options
>> (always/never/within_size/advise/deny/force) to enabled (always/madvise/never)
>>
>> Perhaps it would be cleaner to do the same trick we did for enabled; Introduce
>> /mm/transparent_hugepage/hugepage-XXkb/shmem_enabled, which can have all the
>> same values as the top-level /sys/kernel/mm/transparent_hugepage/shmem_enabled,
>> plus the additional "inherit" option. By default all sizes will be set to
>> "never" except 2M, which is set to "inherit".
>
> Sounds good to me. But I do not want to copy all same values from top-level
> '/sys/kernel/mm/transparent_hugepage/shmem_enabled':
> always within_size advise never deny force
>
> For mTHP's shmem_enabled interface, we can just keep below values:
> always within_size advise never
>
> Cause when checking if mTHP can be used for anon shmem, 'deny' is equal to
> 'never', and 'force' is equal to 'always'.

I'll admit it wasn't completely clear to me after reading the docs, but my rough
understanding is:

- /sys/kernel/mm/transparent_hugepage/shmem_enabled controls
mmap(SHARED|ANON) allocations (mostly; see rule 3)
- huge=... controls tmpfs allocations
- deny and force in shmem_enabled are equivalent to never and always for
mmap(SHARED|ANON) but additionally override all tmpfs mounts so they act as
if they were mounted with huge=never or huge=always

Is that correct? If so, then I think it still makes sense to support per-size
deny/force. Certainly if a per-size control is set to "inherit" and the
top-level control is set to deny or force, you would need that to mean something.

>
>> Of course the huge= mount option would also need to take a per-size option in
>> this case. e.g. huge=2048kB:advise,64kB:always
>
> IMO, I do not want to change the global 'huge=' mount option, which can control
> both anon shmem and tmpfs, but mTHP now is only applied for anon shmem. So let's

How does huge= control anon shmem? I thought it was only for mounted
filesystems; so tmpfs? Perhaps my mental model for how this works is broken...

> keep it be same with the global sysfs interface:
> /sys/kernel/mm/transparent_hugepage/shmem_enabled.
>
> For tmpfs large folio strategy, I plan to address it later, and we may need more
> discussion to determine if it should follow the file large folio strategy or not
> (no investigation now).

OK. But until you get to tmpfs, you'll need an interim definition for what it
means if a per-size control is set to "inherit" and the top-level control is set
to deny/force.

>
> Thanks for reviewing.

No problem! Thanks for doing the work!

>
>>> TODO:
>>>   - More testing and provide some performance data.
>>>   - Need more discussion about the large folio allocation strategy for a
>>> 'regular
>>> file' operation created by memfd_create(), for example using ftruncate(fd) to
>>> specify
>>> the 'file' size, which need to follow the anonymous mTHP rule too?
>>>   - Do not split the large folio when share memory swap out.
>>>   - Can swap in a large folio for share memory.
>>>
>>> Baolin Wang (5):
>>>    mm: memory: extend finish_fault() to support large folio
>>>    mm: shmem: add an 'order' parameter for shmem_alloc_hugefolio()
>>>    mm: shmem: add THP validation for PMD-mapped THP related statistics
>>>    mm: shmem: add mTHP support for anonymous share pages
>>>    mm: shmem: add anonymous share mTHP counters
>>>
>>>   include/linux/huge_mm.h |   4 +-
>>>   mm/huge_memory.c        |   8 ++-
>>>   mm/memory.c             |  25 +++++++---
>>>   mm/shmem.c              | 107 ++++++++++++++++++++++++++++++----------
>>>   4 files changed, 108 insertions(+), 36 deletions(-)
>>>


2024-04-24 09:31:37

by Baolin Wang

[permalink] [raw]
Subject: Re: [RFC PATCH 5/5] mm: shmem: add anonymous share mTHP counters



On 2024/4/24 16:15, Ryan Roberts wrote:
> On 24/04/2024 08:11, David Hildenbrand wrote:
>> On 24.04.24 08:10, Baolin Wang wrote:
>>>
>>>
>>> On 2024/4/23 19:37, David Hildenbrand wrote:
>>>> On 23.04.24 03:17, Barry Song wrote:
>>>>> On Mon, Apr 22, 2024 at 3:03 PM Baolin Wang
>>>>> <[email protected]> wrote:
>>>>>>
>>>>>> Signed-off-by: Baolin Wang <[email protected]>
>>>>>> ---
>>>>>>    include/linux/huge_mm.h | 2 ++
>>>>>>    mm/huge_memory.c        | 4 ++++
>>>>>>    mm/shmem.c              | 5 ++++-
>>>>>>    3 files changed, 10 insertions(+), 1 deletion(-)
>>>>>>
>>>>>> diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
>>>>>> index 26b6fa98d8ac..67b9c1acad31 100644
>>>>>> --- a/include/linux/huge_mm.h
>>>>>> +++ b/include/linux/huge_mm.h
>>>>>> @@ -270,6 +270,8 @@ enum mthp_stat_item {
>>>>>>           MTHP_STAT_ANON_SWPOUT,
>>>>>>           MTHP_STAT_ANON_SWPOUT_FALLBACK,
>>>>>>           MTHP_STAT_ANON_SWPIN_REFAULT,
>>>>>> +       MTHP_STAT_SHMEM_ANON_ALLOC,
>>>>>> +       MTHP_STAT_SHMEM_ANON_ALLOC_FALLBACK,
>>>>>
>>>>> not quite sure about this. for 2MB pmd-mapped THP shmem, we count them
>>>>> as FILE_THP.
>>>>> here we are counting as SHMEM_ANON. To me, SHMEM_ANON is more correct but
>>>>> it doesn't align with pmd-mapped THP. David, Ryan, what do you think?
>>>>
>>>> The term "anonymous share" in the patch subject is weird to begin with
>>>> ;) Easy to confuse with anonymous cow-shared memory. Let's just call it
>>>> "anonymous shmem", which it is under the hood.
>>>
>>> Sure.
>>>
>>>> ... regarding the question: if we add FILE_ALLOC and friends, at least
>>>> initially, we wouldn't account other large pagecache folios.
>>>>
>>>> ... likely we should add that then as well so the counter matches the
>>>> actual name?
>>>>
>>>> If we later realize that we need separate FILE vs. SHMEM vs. WHATEVER
>>>> counters, we can always add more fine-grained counters later. Doing it
>>>> consistently w.r.t. traditional THPs first sounds reasonable.
>>>
>>> Um, once we expose it to userspace through the sysfs interface, the
>>> sysfs interface should be explicit as much as possible and avoid
>>> confusing users, otherwise it will be difficult to change this kind of
>>> interface in the future. Personally, I prefer to Ryan's suggestion.
>>
>> Inconsistency is confusing. As long as you avoid that, I don't particularly care.
>
> This is a good point. We have been careful to make sure the 2M ANON mTHP stats
> match the existing PMD-size stats. So we should definitely make sure that any
> future 2M FILE mTHP stats match too, which I guess means counting both SHMEM and
> FILE events.
>
> So perhaps it makes more sense to add FILE counters to start with. If we need
> the SHMEM-specific counters, we could add them later?
>
> I'm happy to go with the crowd on this...

(Seems I'm the only one who prefers the term 'SHMEM_' now.) Fine, I have
no strong preference, and let's keep consistency first. Thanks guys.

2024-04-24 09:56:34

by Baolin Wang

[permalink] [raw]
Subject: Re: [RFC PATCH 0/5] add mTHP support for anonymous share pages



On 2024/4/24 16:26, Ryan Roberts wrote:
> On 24/04/2024 07:55, Baolin Wang wrote:
>>
>>
>> On 2024/4/23 18:41, Ryan Roberts wrote:
>>> On 22/04/2024 08:02, Baolin Wang wrote:
>>>> Anonymous pages have already been supported for multi-size (mTHP) allocation
>>>> through commit 19eaf44954df, that can allow THP to be configured through the
>>>> sysfs interface located at
>>>> '/sys/kernel/mm/transparent_hugepage/hugepage-XXkb/enabled'.
>>>>
>>>> However, the anonymous shared pages will ignore the anonymous mTHP rule
>>>> configured through the sysfs interface, and can only use the PMD-mapped
>>>> THP, that is not reasonable. Many implement anonymous page sharing through
>>>> mmap(MAP_SHARED | MAP_ANONYMOUS), especially in database usage scenarios,
>>>> therefore, users expect to apply an unified mTHP strategy for anonymous pages,
>>>> also including the anonymous shared pages, in order to enjoy the benefits of
>>>> mTHP. For example, lower latency than PMD-mapped THP, smaller memory bloat
>>>> than PMD-mapped THP, contiguous PTEs on ARM architecture to reduce TLB miss etc.
>>>
>>> This sounds like a very useful addition!
>>>
>>> Out of interest, can you point me at any workloads (and off-the-shelf benchmarks
>>> for those workloads) that predominantly use shared anon memory?
>>
>> As far as I know, some database related workloads make extensive use of shared
>> anonymous page, such as PolarDB[1] in our Alibaba fleet, or MySQL likely also
>> uses shared anonymous memory. And I still need to do some investigation to
>> measure the performance.
>>
>> [1] https://github.com/ApsaraDB/PolarDB-for-PostgreSQL
>
> Thanks for the pointer!
>
>>
>>>> The primary strategy is that, the use of huge pages for anonymous shared pages
>>>> still follows the global control determined by the mount option "huge="
>>>> parameter
>>>> or the sysfs interface at '/sys/kernel/mm/transparent_hugepage/shmem_enabled'.
>>>> The utilization of mTHP is allowed only when the global 'huge' switch is
>>>> enabled.
>>>> Subsequently, the mTHP sysfs interface
>>>> (/sys/kernel/mm/transparent_hugepage/hugepage-XXkb/enabled)
>>>> is checked to determine the mTHP size that can be used for large folio
>>>> allocation
>>>> for these anonymous shared pages.
>>>
>>> I'm not sure about this proposed control mechanism; won't it break
>>> compatibility? I could be wrong, but I don't think shmem's use of THP used to
>>> depend upon the value of /sys/kernel/mm/transparent_hugepage/enabled? So it
>>
>> Yes, I realized this after more testing.
>>
>>> doesn't make sense to me that we now depend upon the
>>> /sys/kernel/mm/transparent_hugepage/hugepage-XXkb/enabled values (which by
>>> default disables all sizes except 2M, which is set to "inherit" from
>>> /sys/kernel/mm/transparent_hugepage/enabled).
>>>
>>> The other problem is that shmem_enabled has a different set of options
>>> (always/never/within_size/advise/deny/force) to enabled (always/madvise/never)
>>>
>>> Perhaps it would be cleaner to do the same trick we did for enabled; Introduce
>>> /mm/transparent_hugepage/hugepage-XXkb/shmem_enabled, which can have all the
>>> same values as the top-level /sys/kernel/mm/transparent_hugepage/shmem_enabled,
>>> plus the additional "inherit" option. By default all sizes will be set to
>>> "never" except 2M, which is set to "inherit".
>>
>> Sounds good to me. But I do not want to copy all same values from top-level
>> '/sys/kernel/mm/transparent_hugepage/shmem_enabled':
>> always within_size advise never deny force
>>
>> For mTHP's shmem_enabled interface, we can just keep below values:
>> always within_size advise never
>>
>> Cause when checking if mTHP can be used for anon shmem, 'deny' is equal to
>> 'never', and 'force' is equal to 'always'.
>
> I'll admit it wasn't completely clear to me after reading the docs, but my rough
> understanding is:
>
> - /sys/kernel/mm/transparent_hugepage/shmem_enabled controls
> mmap(SHARED|ANON) allocations (mostly; see rule 3)
> - huge=... controls tmpfs allocations
> - deny and force in shmem_enabled are equivalent to never and always for
> mmap(SHARED|ANON) but additionally override all tmpfs mounts so they act as
> if they were mounted with huge=never or huge=always
>
> Is that correct? If so, then I think it still makes sense to support per-size

Correct.

> deny/force. Certainly if a per-size control is set to "inherit" and the
> top-level control is set to deny or force, you would need that to mean something.

IMHO, the '/mm/transparent_hugepage/hugepage-XXkb/shmem_enabled'
interface should only control the anonymous shmem. And 'huge=' controls
tmpfs allocation, so we should not use anonymous control to override
tmpfs control, which seems a little mess?

>>> Of course the huge= mount option would also need to take a per-size option in
>>> this case. e.g. huge=2048kB:advise,64kB:always
>>
>> IMO, I do not want to change the global 'huge=' mount option, which can control
>> both anon shmem and tmpfs, but mTHP now is only applied for anon shmem. So let's
>
> How does huge= control anon shmem? I thought it was only for mounted
> filesystems; so tmpfs? Perhaps my mental model for how this works is broken...

Sorry for noise, you are right. So this is still the reason I don't want
to change the semantics of 'huge=', which is used to control tmpfs.

>> keep it be same with the global sysfs interface:
>> /sys/kernel/mm/transparent_hugepage/shmem_enabled.
>>
>> For tmpfs large folio strategy, I plan to address it later, and we may need more
>> discussion to determine if it should follow the file large folio strategy or not
>> (no investigation now).
>
> OK. But until you get to tmpfs, you'll need an interim definition for what it
> means if a per-size control is set to "inherit" and the top-level control is set
> to deny/force.
>
>>
>> Thanks for reviewing.
>
> No problem! Thanks for doing the work!
>
>>
>>>> TODO:
>>>>   - More testing and provide some performance data.
>>>>   - Need more discussion about the large folio allocation strategy for a
>>>> 'regular
>>>> file' operation created by memfd_create(), for example using ftruncate(fd) to
>>>> specify
>>>> the 'file' size, which need to follow the anonymous mTHP rule too?
>>>>   - Do not split the large folio when share memory swap out.
>>>>   - Can swap in a large folio for share memory.
>>>>
>>>> Baolin Wang (5):
>>>>    mm: memory: extend finish_fault() to support large folio
>>>>    mm: shmem: add an 'order' parameter for shmem_alloc_hugefolio()
>>>>    mm: shmem: add THP validation for PMD-mapped THP related statistics
>>>>    mm: shmem: add mTHP support for anonymous share pages
>>>>    mm: shmem: add anonymous share mTHP counters
>>>>
>>>>   include/linux/huge_mm.h |   4 +-
>>>>   mm/huge_memory.c        |   8 ++-
>>>>   mm/memory.c             |  25 +++++++---
>>>>   mm/shmem.c              | 107 ++++++++++++++++++++++++++++++----------
>>>>   4 files changed, 108 insertions(+), 36 deletions(-)
>>>>

2024-04-24 10:01:24

by Ryan Roberts

[permalink] [raw]
Subject: Re: [RFC PATCH 0/5] add mTHP support for anonymous share pages

On 24/04/2024 10:55, Baolin Wang wrote:
>
>
> On 2024/4/24 16:26, Ryan Roberts wrote:
>> On 24/04/2024 07:55, Baolin Wang wrote:
>>>
>>>
>>> On 2024/4/23 18:41, Ryan Roberts wrote:
>>>> On 22/04/2024 08:02, Baolin Wang wrote:
>>>>> Anonymous pages have already been supported for multi-size (mTHP) allocation
>>>>> through commit 19eaf44954df, that can allow THP to be configured through the
>>>>> sysfs interface located at
>>>>> '/sys/kernel/mm/transparent_hugepage/hugepage-XXkb/enabled'.
>>>>>
>>>>> However, the anonymous shared pages will ignore the anonymous mTHP rule
>>>>> configured through the sysfs interface, and can only use the PMD-mapped
>>>>> THP, that is not reasonable. Many implement anonymous page sharing through
>>>>> mmap(MAP_SHARED | MAP_ANONYMOUS), especially in database usage scenarios,
>>>>> therefore, users expect to apply an unified mTHP strategy for anonymous pages,
>>>>> also including the anonymous shared pages, in order to enjoy the benefits of
>>>>> mTHP. For example, lower latency than PMD-mapped THP, smaller memory bloat
>>>>> than PMD-mapped THP, contiguous PTEs on ARM architecture to reduce TLB miss
>>>>> etc.
>>>>
>>>> This sounds like a very useful addition!
>>>>
>>>> Out of interest, can you point me at any workloads (and off-the-shelf
>>>> benchmarks
>>>> for those workloads) that predominantly use shared anon memory?
>>>
>>> As far as I know, some database related workloads make extensive use of shared
>>> anonymous page, such as PolarDB[1] in our Alibaba fleet, or MySQL likely also
>>> uses shared anonymous memory. And I still need to do some investigation to
>>> measure the performance.
>>>
>>> [1] https://github.com/ApsaraDB/PolarDB-for-PostgreSQL
>>
>> Thanks for the pointer!
>>
>>>
>>>>> The primary strategy is that, the use of huge pages for anonymous shared pages
>>>>> still follows the global control determined by the mount option "huge="
>>>>> parameter
>>>>> or the sysfs interface at '/sys/kernel/mm/transparent_hugepage/shmem_enabled'.
>>>>> The utilization of mTHP is allowed only when the global 'huge' switch is
>>>>> enabled.
>>>>> Subsequently, the mTHP sysfs interface
>>>>> (/sys/kernel/mm/transparent_hugepage/hugepage-XXkb/enabled)
>>>>> is checked to determine the mTHP size that can be used for large folio
>>>>> allocation
>>>>> for these anonymous shared pages.
>>>>
>>>> I'm not sure about this proposed control mechanism; won't it break
>>>> compatibility? I could be wrong, but I don't think shmem's use of THP used to
>>>> depend upon the value of /sys/kernel/mm/transparent_hugepage/enabled? So it
>>>
>>> Yes, I realized this after more testing.
>>>
>>>> doesn't make sense to me that we now depend upon the
>>>> /sys/kernel/mm/transparent_hugepage/hugepage-XXkb/enabled values (which by
>>>> default disables all sizes except 2M, which is set to "inherit" from
>>>> /sys/kernel/mm/transparent_hugepage/enabled).
>>>>
>>>> The other problem is that shmem_enabled has a different set of options
>>>> (always/never/within_size/advise/deny/force) to enabled (always/madvise/never)
>>>>
>>>> Perhaps it would be cleaner to do the same trick we did for enabled; Introduce
>>>> /mm/transparent_hugepage/hugepage-XXkb/shmem_enabled, which can have all the
>>>> same values as the top-level /sys/kernel/mm/transparent_hugepage/shmem_enabled,
>>>> plus the additional "inherit" option. By default all sizes will be set to
>>>> "never" except 2M, which is set to "inherit".
>>>
>>> Sounds good to me. But I do not want to copy all same values from top-level
>>> '/sys/kernel/mm/transparent_hugepage/shmem_enabled':
>>> always within_size advise never deny force
>>>
>>> For mTHP's shmem_enabled interface, we can just keep below values:
>>> always within_size advise never
>>>
>>> Cause when checking if mTHP can be used for anon shmem, 'deny' is equal to
>>> 'never', and 'force' is equal to 'always'.
>>
>> I'll admit it wasn't completely clear to me after reading the docs, but my rough
>> understanding is:
>>
>>   - /sys/kernel/mm/transparent_hugepage/shmem_enabled controls
>>     mmap(SHARED|ANON) allocations (mostly; see rule 3)
>>   - huge=... controls tmpfs allocations
>>   - deny and force in shmem_enabled are equivalent to never and always for
>>     mmap(SHARED|ANON) but additionally override all tmpfs mounts so they act as
>>     if they were mounted with huge=never or huge=always
>>
>> Is that correct? If so, then I think it still makes sense to support per-size
>
> Correct.
>
>> deny/force. Certainly if a per-size control is set to "inherit" and the
>> top-level control is set to deny or force, you would need that to mean something.
>
> IMHO, the '/mm/transparent_hugepage/hugepage-XXkb/shmem_enabled' interface
> should only control the anonymous shmem. And 'huge=' controls tmpfs allocation,
> so we should not use anonymous control to override tmpfs control, which seems a
> little mess?

I agree it would be cleaner to only handle mmap(SHARED|ANON) here, and leave the
tmpfs stuff for another time. But my point is that
/mm/transparent_hugepage/shmem_enabled already interferes with tmpfs if the
value is deny or force. So if you have:

echo deny > /mm/transparent_hugepage/shmem_enabled
echo inherit > /mm/transparent_hugepage/hugepage-64kB/shmem_enabled

What does that mean?

>
>>>> Of course the huge= mount option would also need to take a per-size option in
>>>> this case. e.g. huge=2048kB:advise,64kB:always
>>>
>>> IMO, I do not want to change the global 'huge=' mount option, which can control
>>> both anon shmem and tmpfs, but mTHP now is only applied for anon shmem. So let's
>>
>> How does huge= control anon shmem? I thought it was only for mounted
>> filesystems; so tmpfs? Perhaps my mental model for how this works is broken...
>
> Sorry for noise, you are right. So this is still the reason I don't want to
> change the semantics of 'huge=', which is used to control tmpfs.
>
>>> keep it be same with the global sysfs interface:
>>> /sys/kernel/mm/transparent_hugepage/shmem_enabled.
>>>
>>> For tmpfs large folio strategy, I plan to address it later, and we may need more
>>> discussion to determine if it should follow the file large folio strategy or not
>>> (no investigation now).
>>
>> OK. But until you get to tmpfs, you'll need an interim definition for what it
>> means if a per-size control is set to "inherit" and the top-level control is set
>> to deny/force.
>>
>>>
>>> Thanks for reviewing.
>>
>> No problem! Thanks for doing the work!
>>
>>>
>>>>> TODO:
>>>>>    - More testing and provide some performance data.
>>>>>    - Need more discussion about the large folio allocation strategy for a
>>>>> 'regular
>>>>> file' operation created by memfd_create(), for example using ftruncate(fd) to
>>>>> specify
>>>>> the 'file' size, which need to follow the anonymous mTHP rule too?
>>>>>    - Do not split the large folio when share memory swap out.
>>>>>    - Can swap in a large folio for share memory.
>>>>>
>>>>> Baolin Wang (5):
>>>>>     mm: memory: extend finish_fault() to support large folio
>>>>>     mm: shmem: add an 'order' parameter for shmem_alloc_hugefolio()
>>>>>     mm: shmem: add THP validation for PMD-mapped THP related statistics
>>>>>     mm: shmem: add mTHP support for anonymous share pages
>>>>>     mm: shmem: add anonymous share mTHP counters
>>>>>
>>>>>    include/linux/huge_mm.h |   4 +-
>>>>>    mm/huge_memory.c        |   8 ++-
>>>>>    mm/memory.c             |  25 +++++++---
>>>>>    mm/shmem.c              | 107 ++++++++++++++++++++++++++++++----------
>>>>>    4 files changed, 108 insertions(+), 36 deletions(-)
>>>>>


2024-04-24 13:49:56

by Baolin Wang

[permalink] [raw]
Subject: Re: [RFC PATCH 0/5] add mTHP support for anonymous share pages



On 2024/4/24 18:01, Ryan Roberts wrote:
> On 24/04/2024 10:55, Baolin Wang wrote:
>>
>>
>> On 2024/4/24 16:26, Ryan Roberts wrote:
>>> On 24/04/2024 07:55, Baolin Wang wrote:
>>>>
>>>>
>>>> On 2024/4/23 18:41, Ryan Roberts wrote:
>>>>> On 22/04/2024 08:02, Baolin Wang wrote:
>>>>>> Anonymous pages have already been supported for multi-size (mTHP) allocation
>>>>>> through commit 19eaf44954df, that can allow THP to be configured through the
>>>>>> sysfs interface located at
>>>>>> '/sys/kernel/mm/transparent_hugepage/hugepage-XXkb/enabled'.
>>>>>>
>>>>>> However, the anonymous shared pages will ignore the anonymous mTHP rule
>>>>>> configured through the sysfs interface, and can only use the PMD-mapped
>>>>>> THP, that is not reasonable. Many implement anonymous page sharing through
>>>>>> mmap(MAP_SHARED | MAP_ANONYMOUS), especially in database usage scenarios,
>>>>>> therefore, users expect to apply an unified mTHP strategy for anonymous pages,
>>>>>> also including the anonymous shared pages, in order to enjoy the benefits of
>>>>>> mTHP. For example, lower latency than PMD-mapped THP, smaller memory bloat
>>>>>> than PMD-mapped THP, contiguous PTEs on ARM architecture to reduce TLB miss
>>>>>> etc.
>>>>>
>>>>> This sounds like a very useful addition!
>>>>>
>>>>> Out of interest, can you point me at any workloads (and off-the-shelf
>>>>> benchmarks
>>>>> for those workloads) that predominantly use shared anon memory?
>>>>
>>>> As far as I know, some database related workloads make extensive use of shared
>>>> anonymous page, such as PolarDB[1] in our Alibaba fleet, or MySQL likely also
>>>> uses shared anonymous memory. And I still need to do some investigation to
>>>> measure the performance.
>>>>
>>>> [1] https://github.com/ApsaraDB/PolarDB-for-PostgreSQL
>>>
>>> Thanks for the pointer!
>>>
>>>>
>>>>>> The primary strategy is that, the use of huge pages for anonymous shared pages
>>>>>> still follows the global control determined by the mount option "huge="
>>>>>> parameter
>>>>>> or the sysfs interface at '/sys/kernel/mm/transparent_hugepage/shmem_enabled'.
>>>>>> The utilization of mTHP is allowed only when the global 'huge' switch is
>>>>>> enabled.
>>>>>> Subsequently, the mTHP sysfs interface
>>>>>> (/sys/kernel/mm/transparent_hugepage/hugepage-XXkb/enabled)
>>>>>> is checked to determine the mTHP size that can be used for large folio
>>>>>> allocation
>>>>>> for these anonymous shared pages.
>>>>>
>>>>> I'm not sure about this proposed control mechanism; won't it break
>>>>> compatibility? I could be wrong, but I don't think shmem's use of THP used to
>>>>> depend upon the value of /sys/kernel/mm/transparent_hugepage/enabled? So it
>>>>
>>>> Yes, I realized this after more testing.
>>>>
>>>>> doesn't make sense to me that we now depend upon the
>>>>> /sys/kernel/mm/transparent_hugepage/hugepage-XXkb/enabled values (which by
>>>>> default disables all sizes except 2M, which is set to "inherit" from
>>>>> /sys/kernel/mm/transparent_hugepage/enabled).
>>>>>
>>>>> The other problem is that shmem_enabled has a different set of options
>>>>> (always/never/within_size/advise/deny/force) to enabled (always/madvise/never)
>>>>>
>>>>> Perhaps it would be cleaner to do the same trick we did for enabled; Introduce
>>>>> /mm/transparent_hugepage/hugepage-XXkb/shmem_enabled, which can have all the
>>>>> same values as the top-level /sys/kernel/mm/transparent_hugepage/shmem_enabled,
>>>>> plus the additional "inherit" option. By default all sizes will be set to
>>>>> "never" except 2M, which is set to "inherit".
>>>>
>>>> Sounds good to me. But I do not want to copy all same values from top-level
>>>> '/sys/kernel/mm/transparent_hugepage/shmem_enabled':
>>>> always within_size advise never deny force
>>>>
>>>> For mTHP's shmem_enabled interface, we can just keep below values:
>>>> always within_size advise never
>>>>
>>>> Cause when checking if mTHP can be used for anon shmem, 'deny' is equal to
>>>> 'never', and 'force' is equal to 'always'.
>>>
>>> I'll admit it wasn't completely clear to me after reading the docs, but my rough
>>> understanding is:
>>>
>>>   - /sys/kernel/mm/transparent_hugepage/shmem_enabled controls
>>>     mmap(SHARED|ANON) allocations (mostly; see rule 3)
>>>   - huge=... controls tmpfs allocations
>>>   - deny and force in shmem_enabled are equivalent to never and always for
>>>     mmap(SHARED|ANON) but additionally override all tmpfs mounts so they act as
>>>     if they were mounted with huge=never or huge=always
>>>
>>> Is that correct? If so, then I think it still makes sense to support per-size
>>
>> Correct.
>>
>>> deny/force. Certainly if a per-size control is set to "inherit" and the
>>> top-level control is set to deny or force, you would need that to mean something.
>>
>> IMHO, the '/mm/transparent_hugepage/hugepage-XXkb/shmem_enabled' interface
>> should only control the anonymous shmem. And 'huge=' controls tmpfs allocation,
>> so we should not use anonymous control to override tmpfs control, which seems a
>> little mess?
>
> I agree it would be cleaner to only handle mmap(SHARED|ANON) here, and leave the
> tmpfs stuff for another time. But my point is that
> /mm/transparent_hugepage/shmem_enabled already interferes with tmpfs if the
> value is deny or force. So if you have:
>
> echo deny > /mm/transparent_hugepage/shmem_enabled

IIUC, this global control will cause shmem_is_huge() to always return
false, so no matter how
'/mm/transparent_hugepage/hugepage-xxxkB/shmem_enabled' is set,
anonymous shmem will not use mTHP. No?

> echo inherit > /mm/transparent_hugepage/hugepage-64kB/shmem_enabled
>
> What does that mean?
>
>>
>>>>> Of course the huge= mount option would also need to take a per-size option in
>>>>> this case. e.g. huge=2048kB:advise,64kB:always
>>>>
>>>> IMO, I do not want to change the global 'huge=' mount option, which can control
>>>> both anon shmem and tmpfs, but mTHP now is only applied for anon shmem. So let's
>>>
>>> How does huge= control anon shmem? I thought it was only for mounted
>>> filesystems; so tmpfs? Perhaps my mental model for how this works is broken...
>>
>> Sorry for noise, you are right. So this is still the reason I don't want to
>> change the semantics of 'huge=', which is used to control tmpfs.
>>
>>>> keep it be same with the global sysfs interface:
>>>> /sys/kernel/mm/transparent_hugepage/shmem_enabled.
>>>>
>>>> For tmpfs large folio strategy, I plan to address it later, and we may need more
>>>> discussion to determine if it should follow the file large folio strategy or not
>>>> (no investigation now).
>>>
>>> OK. But until you get to tmpfs, you'll need an interim definition for what it
>>> means if a per-size control is set to "inherit" and the top-level control is set
>>> to deny/force.
>>>
>>>>
>>>> Thanks for reviewing.
>>>
>>> No problem! Thanks for doing the work!
>>>
>>>>
>>>>>> TODO:
>>>>>>    - More testing and provide some performance data.
>>>>>>    - Need more discussion about the large folio allocation strategy for a
>>>>>> 'regular
>>>>>> file' operation created by memfd_create(), for example using ftruncate(fd) to
>>>>>> specify
>>>>>> the 'file' size, which need to follow the anonymous mTHP rule too?
>>>>>>    - Do not split the large folio when share memory swap out.
>>>>>>    - Can swap in a large folio for share memory.
>>>>>>
>>>>>> Baolin Wang (5):
>>>>>>     mm: memory: extend finish_fault() to support large folio
>>>>>>     mm: shmem: add an 'order' parameter for shmem_alloc_hugefolio()
>>>>>>     mm: shmem: add THP validation for PMD-mapped THP related statistics
>>>>>>     mm: shmem: add mTHP support for anonymous share pages
>>>>>>     mm: shmem: add anonymous share mTHP counters
>>>>>>
>>>>>>    include/linux/huge_mm.h |   4 +-
>>>>>>    mm/huge_memory.c        |   8 ++-
>>>>>>    mm/memory.c             |  25 +++++++---
>>>>>>    mm/shmem.c              | 107 ++++++++++++++++++++++++++++++----------
>>>>>>    4 files changed, 108 insertions(+), 36 deletions(-)
>>>>>>

2024-04-24 14:28:01

by Ryan Roberts

[permalink] [raw]
Subject: Re: [RFC PATCH 0/5] add mTHP support for anonymous share pages

On 24/04/2024 14:49, Baolin Wang wrote:
>
>
> On 2024/4/24 18:01, Ryan Roberts wrote:
>> On 24/04/2024 10:55, Baolin Wang wrote:
>>>
>>>
>>> On 2024/4/24 16:26, Ryan Roberts wrote:
>>>> On 24/04/2024 07:55, Baolin Wang wrote:
>>>>>
>>>>>
>>>>> On 2024/4/23 18:41, Ryan Roberts wrote:
>>>>>> On 22/04/2024 08:02, Baolin Wang wrote:
>>>>>>> Anonymous pages have already been supported for multi-size (mTHP) allocation
>>>>>>> through commit 19eaf44954df, that can allow THP to be configured through the
>>>>>>> sysfs interface located at
>>>>>>> '/sys/kernel/mm/transparent_hugepage/hugepage-XXkb/enabled'.
>>>>>>>
>>>>>>> However, the anonymous shared pages will ignore the anonymous mTHP rule
>>>>>>> configured through the sysfs interface, and can only use the PMD-mapped
>>>>>>> THP, that is not reasonable. Many implement anonymous page sharing through
>>>>>>> mmap(MAP_SHARED | MAP_ANONYMOUS), especially in database usage scenarios,
>>>>>>> therefore, users expect to apply an unified mTHP strategy for anonymous
>>>>>>> pages,
>>>>>>> also including the anonymous shared pages, in order to enjoy the benefits of
>>>>>>> mTHP. For example, lower latency than PMD-mapped THP, smaller memory bloat
>>>>>>> than PMD-mapped THP, contiguous PTEs on ARM architecture to reduce TLB miss
>>>>>>> etc.
>>>>>>
>>>>>> This sounds like a very useful addition!
>>>>>>
>>>>>> Out of interest, can you point me at any workloads (and off-the-shelf
>>>>>> benchmarks
>>>>>> for those workloads) that predominantly use shared anon memory?
>>>>>
>>>>> As far as I know, some database related workloads make extensive use of shared
>>>>> anonymous page, such as PolarDB[1] in our Alibaba fleet, or MySQL likely also
>>>>> uses shared anonymous memory. And I still need to do some investigation to
>>>>> measure the performance.
>>>>>
>>>>> [1] https://github.com/ApsaraDB/PolarDB-for-PostgreSQL
>>>>
>>>> Thanks for the pointer!
>>>>
>>>>>
>>>>>>> The primary strategy is that, the use of huge pages for anonymous shared
>>>>>>> pages
>>>>>>> still follows the global control determined by the mount option "huge="
>>>>>>> parameter
>>>>>>> or the sysfs interface at
>>>>>>> '/sys/kernel/mm/transparent_hugepage/shmem_enabled'.
>>>>>>> The utilization of mTHP is allowed only when the global 'huge' switch is
>>>>>>> enabled.
>>>>>>> Subsequently, the mTHP sysfs interface
>>>>>>> (/sys/kernel/mm/transparent_hugepage/hugepage-XXkb/enabled)
>>>>>>> is checked to determine the mTHP size that can be used for large folio
>>>>>>> allocation
>>>>>>> for these anonymous shared pages.
>>>>>>
>>>>>> I'm not sure about this proposed control mechanism; won't it break
>>>>>> compatibility? I could be wrong, but I don't think shmem's use of THP used to
>>>>>> depend upon the value of /sys/kernel/mm/transparent_hugepage/enabled? So it
>>>>>
>>>>> Yes, I realized this after more testing.
>>>>>
>>>>>> doesn't make sense to me that we now depend upon the
>>>>>> /sys/kernel/mm/transparent_hugepage/hugepage-XXkb/enabled values (which by
>>>>>> default disables all sizes except 2M, which is set to "inherit" from
>>>>>> /sys/kernel/mm/transparent_hugepage/enabled).
>>>>>>
>>>>>> The other problem is that shmem_enabled has a different set of options
>>>>>> (always/never/within_size/advise/deny/force) to enabled
>>>>>> (always/madvise/never)
>>>>>>
>>>>>> Perhaps it would be cleaner to do the same trick we did for enabled;
>>>>>> Introduce
>>>>>> /mm/transparent_hugepage/hugepage-XXkb/shmem_enabled, which can have all the
>>>>>> same values as the top-level
>>>>>> /sys/kernel/mm/transparent_hugepage/shmem_enabled,
>>>>>> plus the additional "inherit" option. By default all sizes will be set to
>>>>>> "never" except 2M, which is set to "inherit".
>>>>>
>>>>> Sounds good to me. But I do not want to copy all same values from top-level
>>>>> '/sys/kernel/mm/transparent_hugepage/shmem_enabled':
>>>>> always within_size advise never deny force
>>>>>
>>>>> For mTHP's shmem_enabled interface, we can just keep below values:
>>>>> always within_size advise never
>>>>>
>>>>> Cause when checking if mTHP can be used for anon shmem, 'deny' is equal to
>>>>> 'never', and 'force' is equal to 'always'.
>>>>
>>>> I'll admit it wasn't completely clear to me after reading the docs, but my
>>>> rough
>>>> understanding is:
>>>>
>>>>    - /sys/kernel/mm/transparent_hugepage/shmem_enabled controls
>>>>      mmap(SHARED|ANON) allocations (mostly; see rule 3)
>>>>    - huge=... controls tmpfs allocations
>>>>    - deny and force in shmem_enabled are equivalent to never and always for
>>>>      mmap(SHARED|ANON) but additionally override all tmpfs mounts so they
>>>> act as
>>>>      if they were mounted with huge=never or huge=always
>>>>
>>>> Is that correct? If so, then I think it still makes sense to support per-size
>>>
>>> Correct.
>>>
>>>> deny/force. Certainly if a per-size control is set to "inherit" and the
>>>> top-level control is set to deny or force, you would need that to mean
>>>> something.
>>>
>>> IMHO, the '/mm/transparent_hugepage/hugepage-XXkb/shmem_enabled' interface
>>> should only control the anonymous shmem. And 'huge=' controls tmpfs allocation,
>>> so we should not use anonymous control to override tmpfs control, which seems a
>>> little mess?
>>
>> I agree it would be cleaner to only handle mmap(SHARED|ANON) here, and leave the
>> tmpfs stuff for another time. But my point is that
>> /mm/transparent_hugepage/shmem_enabled already interferes with tmpfs if the
>> value is deny or force. So if you have:
>>
>> echo deny > /mm/transparent_hugepage/shmem_enabled
>
> IIUC, this global control will cause shmem_is_huge() to always return false, so
> no matter how '/mm/transparent_hugepage/hugepage-xxxkB/shmem_enabled' is set,
> anonymous shmem will not use mTHP. No?

No, that's not how '/mm/transparent_hugepage/hugepage-xxxkB/enabled' works, and
I think '/mm/transparent_hugepage/hugepage-xxxkB/shmem_enabled' should follow
the established pattern.

For anon-private, each size is controlled by its
/mm/transparent_hugepage/hugepage-xxxkB/enabled value. Unless that value is
"inherit", in which case the value in /mm/transparent_hugepage/enabled is used
for that size.

That approach enables us to 1) maintain back-compat and 2) control each size
independently

1) is met because the default is that all sizes are initially set to "never",
except the PMD-size (e.g. /mm/transparent_hugepage/hugepage-2048kB/enabled)
which is initially set to inherit. So any mTHP unaware SW can still modify
/mm/transparent_hugepage/enabled and it will still only apply to PMD size.

2) is met because mTHP aware SW can come along and e.g. enable the 64K size
(echo always > /mm/transparent_hugepage/hugepage-64kB/enabled) without having to
modify the value in /mm/transparent_hugepage/enabled.

>
>> echo inherit > /mm/transparent_hugepage/hugepage-64kB/shmem_enabled
>>
>> What does that mean?

So I think /mm/transparent_hugepage/hugepage-xxxkB/shmem_enabled will need to
support the deny and force values. When applied to non-PMD sizes, "deny" can
just be a noop for now, because there was no way to configure a tmpfs mount for
non-PMD size THP in the first place. But I'm not sure what to do with "force"?

>>
>>>
>>>>>> Of course the huge= mount option would also need to take a per-size option in
>>>>>> this case. e.g. huge=2048kB:advise,64kB:always
>>>>>
>>>>> IMO, I do not want to change the global 'huge=' mount option, which can
>>>>> control
>>>>> both anon shmem and tmpfs, but mTHP now is only applied for anon shmem. So
>>>>> let's
>>>>
>>>> How does huge= control anon shmem? I thought it was only for mounted
>>>> filesystems; so tmpfs? Perhaps my mental model for how this works is broken...
>>>
>>> Sorry for noise, you are right. So this is still the reason I don't want to
>>> change the semantics of 'huge=', which is used to control tmpfs.
>>>
>>>>> keep it be same with the global sysfs interface:
>>>>> /sys/kernel/mm/transparent_hugepage/shmem_enabled.
>>>>>
>>>>> For tmpfs large folio strategy, I plan to address it later, and we may need
>>>>> more
>>>>> discussion to determine if it should follow the file large folio strategy
>>>>> or not
>>>>> (no investigation now).
>>>>
>>>> OK. But until you get to tmpfs, you'll need an interim definition for what it
>>>> means if a per-size control is set to "inherit" and the top-level control is
>>>> set
>>>> to deny/force.
>>>>
>>>>>
>>>>> Thanks for reviewing.
>>>>
>>>> No problem! Thanks for doing the work!
>>>>
>>>>>
>>>>>>> TODO:
>>>>>>>     - More testing and provide some performance data.
>>>>>>>     - Need more discussion about the large folio allocation strategy for a
>>>>>>> 'regular
>>>>>>> file' operation created by memfd_create(), for example using
>>>>>>> ftruncate(fd) to
>>>>>>> specify
>>>>>>> the 'file' size, which need to follow the anonymous mTHP rule too?
>>>>>>>     - Do not split the large folio when share memory swap out.
>>>>>>>     - Can swap in a large folio for share memory.
>>>>>>>
>>>>>>> Baolin Wang (5):
>>>>>>>      mm: memory: extend finish_fault() to support large folio
>>>>>>>      mm: shmem: add an 'order' parameter for shmem_alloc_hugefolio()
>>>>>>>      mm: shmem: add THP validation for PMD-mapped THP related statistics
>>>>>>>      mm: shmem: add mTHP support for anonymous share pages
>>>>>>>      mm: shmem: add anonymous share mTHP counters
>>>>>>>
>>>>>>>     include/linux/huge_mm.h |   4 +-
>>>>>>>     mm/huge_memory.c        |   8 ++-
>>>>>>>     mm/memory.c             |  25 +++++++---
>>>>>>>     mm/shmem.c              | 107 ++++++++++++++++++++++++++++++----------
>>>>>>>     4 files changed, 108 insertions(+), 36 deletions(-)
>>>>>>>


2024-04-25 06:20:27

by Baolin Wang

[permalink] [raw]
Subject: Re: [RFC PATCH 0/5] add mTHP support for anonymous share pages



On 2024/4/24 22:20, Ryan Roberts wrote:
> On 24/04/2024 14:49, Baolin Wang wrote:
>>
>>
>> On 2024/4/24 18:01, Ryan Roberts wrote:
>>> On 24/04/2024 10:55, Baolin Wang wrote:
>>>>
>>>>
>>>> On 2024/4/24 16:26, Ryan Roberts wrote:
>>>>> On 24/04/2024 07:55, Baolin Wang wrote:
>>>>>>
>>>>>>
>>>>>> On 2024/4/23 18:41, Ryan Roberts wrote:
>>>>>>> On 22/04/2024 08:02, Baolin Wang wrote:
>>>>>>>> Anonymous pages have already been supported for multi-size (mTHP) allocation
>>>>>>>> through commit 19eaf44954df, that can allow THP to be configured through the
>>>>>>>> sysfs interface located at
>>>>>>>> '/sys/kernel/mm/transparent_hugepage/hugepage-XXkb/enabled'.
>>>>>>>>
>>>>>>>> However, the anonymous shared pages will ignore the anonymous mTHP rule
>>>>>>>> configured through the sysfs interface, and can only use the PMD-mapped
>>>>>>>> THP, that is not reasonable. Many implement anonymous page sharing through
>>>>>>>> mmap(MAP_SHARED | MAP_ANONYMOUS), especially in database usage scenarios,
>>>>>>>> therefore, users expect to apply an unified mTHP strategy for anonymous
>>>>>>>> pages,
>>>>>>>> also including the anonymous shared pages, in order to enjoy the benefits of
>>>>>>>> mTHP. For example, lower latency than PMD-mapped THP, smaller memory bloat
>>>>>>>> than PMD-mapped THP, contiguous PTEs on ARM architecture to reduce TLB miss
>>>>>>>> etc.
>>>>>>>
>>>>>>> This sounds like a very useful addition!
>>>>>>>
>>>>>>> Out of interest, can you point me at any workloads (and off-the-shelf
>>>>>>> benchmarks
>>>>>>> for those workloads) that predominantly use shared anon memory?
>>>>>>
>>>>>> As far as I know, some database related workloads make extensive use of shared
>>>>>> anonymous page, such as PolarDB[1] in our Alibaba fleet, or MySQL likely also
>>>>>> uses shared anonymous memory. And I still need to do some investigation to
>>>>>> measure the performance.
>>>>>>
>>>>>> [1] https://github.com/ApsaraDB/PolarDB-for-PostgreSQL
>>>>>
>>>>> Thanks for the pointer!
>>>>>
>>>>>>
>>>>>>>> The primary strategy is that, the use of huge pages for anonymous shared
>>>>>>>> pages
>>>>>>>> still follows the global control determined by the mount option "huge="
>>>>>>>> parameter
>>>>>>>> or the sysfs interface at
>>>>>>>> '/sys/kernel/mm/transparent_hugepage/shmem_enabled'.
>>>>>>>> The utilization of mTHP is allowed only when the global 'huge' switch is
>>>>>>>> enabled.
>>>>>>>> Subsequently, the mTHP sysfs interface
>>>>>>>> (/sys/kernel/mm/transparent_hugepage/hugepage-XXkb/enabled)
>>>>>>>> is checked to determine the mTHP size that can be used for large folio
>>>>>>>> allocation
>>>>>>>> for these anonymous shared pages.
>>>>>>>
>>>>>>> I'm not sure about this proposed control mechanism; won't it break
>>>>>>> compatibility? I could be wrong, but I don't think shmem's use of THP used to
>>>>>>> depend upon the value of /sys/kernel/mm/transparent_hugepage/enabled? So it
>>>>>>
>>>>>> Yes, I realized this after more testing.
>>>>>>
>>>>>>> doesn't make sense to me that we now depend upon the
>>>>>>> /sys/kernel/mm/transparent_hugepage/hugepage-XXkb/enabled values (which by
>>>>>>> default disables all sizes except 2M, which is set to "inherit" from
>>>>>>> /sys/kernel/mm/transparent_hugepage/enabled).
>>>>>>>
>>>>>>> The other problem is that shmem_enabled has a different set of options
>>>>>>> (always/never/within_size/advise/deny/force) to enabled
>>>>>>> (always/madvise/never)
>>>>>>>
>>>>>>> Perhaps it would be cleaner to do the same trick we did for enabled;
>>>>>>> Introduce
>>>>>>> /mm/transparent_hugepage/hugepage-XXkb/shmem_enabled, which can have all the
>>>>>>> same values as the top-level
>>>>>>> /sys/kernel/mm/transparent_hugepage/shmem_enabled,
>>>>>>> plus the additional "inherit" option. By default all sizes will be set to
>>>>>>> "never" except 2M, which is set to "inherit".
>>>>>>
>>>>>> Sounds good to me. But I do not want to copy all same values from top-level
>>>>>> '/sys/kernel/mm/transparent_hugepage/shmem_enabled':
>>>>>> always within_size advise never deny force
>>>>>>
>>>>>> For mTHP's shmem_enabled interface, we can just keep below values:
>>>>>> always within_size advise never
>>>>>>
>>>>>> Cause when checking if mTHP can be used for anon shmem, 'deny' is equal to
>>>>>> 'never', and 'force' is equal to 'always'.
>>>>>
>>>>> I'll admit it wasn't completely clear to me after reading the docs, but my
>>>>> rough
>>>>> understanding is:
>>>>>
>>>>>    - /sys/kernel/mm/transparent_hugepage/shmem_enabled controls
>>>>>      mmap(SHARED|ANON) allocations (mostly; see rule 3)
>>>>>    - huge=... controls tmpfs allocations
>>>>>    - deny and force in shmem_enabled are equivalent to never and always for
>>>>>      mmap(SHARED|ANON) but additionally override all tmpfs mounts so they
>>>>> act as
>>>>>      if they were mounted with huge=never or huge=always
>>>>>
>>>>> Is that correct? If so, then I think it still makes sense to support per-size
>>>>
>>>> Correct.
>>>>
>>>>> deny/force. Certainly if a per-size control is set to "inherit" and the
>>>>> top-level control is set to deny or force, you would need that to mean
>>>>> something.
>>>>
>>>> IMHO, the '/mm/transparent_hugepage/hugepage-XXkb/shmem_enabled' interface
>>>> should only control the anonymous shmem. And 'huge=' controls tmpfs allocation,
>>>> so we should not use anonymous control to override tmpfs control, which seems a
>>>> little mess?
>>>
>>> I agree it would be cleaner to only handle mmap(SHARED|ANON) here, and leave the
>>> tmpfs stuff for another time. But my point is that
>>> /mm/transparent_hugepage/shmem_enabled already interferes with tmpfs if the
>>> value is deny or force. So if you have:
>>>
>>> echo deny > /mm/transparent_hugepage/shmem_enabled
>>
>> IIUC, this global control will cause shmem_is_huge() to always return false, so
>> no matter how '/mm/transparent_hugepage/hugepage-xxxkB/shmem_enabled' is set,
>> anonymous shmem will not use mTHP. No?
>
> No, that's not how '/mm/transparent_hugepage/hugepage-xxxkB/enabled' works, and
> I think '/mm/transparent_hugepage/hugepage-xxxkB/shmem_enabled' should follow
> the established pattern.
>
> For anon-private, each size is controlled by its
> /mm/transparent_hugepage/hugepage-xxxkB/enabled value. Unless that value is
> "inherit", in which case the value in /mm/transparent_hugepage/enabled is used
> for that size.
>
> That approach enables us to 1) maintain back-compat and 2) control each size
> independently
>
> 1) is met because the default is that all sizes are initially set to "never",
> except the PMD-size (e.g. /mm/transparent_hugepage/hugepage-2048kB/enabled)
> which is initially set to inherit. So any mTHP unaware SW can still modify
> /mm/transparent_hugepage/enabled and it will still only apply to PMD size.
>
> 2) is met because mTHP aware SW can come along and e.g. enable the 64K size
> (echo always > /mm/transparent_hugepage/hugepage-64kB/enabled) without having to
> modify the value in /mm/transparent_hugepage/enabled.

Thanks for explanation. Initially, I want to make
‘/mm/transparent_hugepage/shmem_enabled’ be a global control for huge
page, but I think it should follow the same strategy as anon mTHP as you
said.

>>> echo inherit > /mm/transparent_hugepage/hugepage-64kB/shmem_enabled
>>>
>>> What does that mean?
>
> So I think /mm/transparent_hugepage/hugepage-xxxkB/shmem_enabled will need to
> support the deny and force values. When applied to non-PMD sizes, "deny" can
> just be a noop for now, because there was no way to configure a tmpfs mount for
> non-PMD size THP in the first place. But I'm not sure what to do with "force"?

OK. And I also prefer that "force" should be a noop too, since anon
shmem control should not configure tmpfs huge page allocation.

2024-04-25 08:18:02

by Ryan Roberts

[permalink] [raw]
Subject: Re: [RFC PATCH 0/5] add mTHP support for anonymous share pages

On 25/04/2024 07:20, Baolin Wang wrote:
>
>
> On 2024/4/24 22:20, Ryan Roberts wrote:
>> On 24/04/2024 14:49, Baolin Wang wrote:
>>>
>>>
>>> On 2024/4/24 18:01, Ryan Roberts wrote:
>>>> On 24/04/2024 10:55, Baolin Wang wrote:
>>>>>
>>>>>
>>>>> On 2024/4/24 16:26, Ryan Roberts wrote:
>>>>>> On 24/04/2024 07:55, Baolin Wang wrote:
>>>>>>>
>>>>>>>
>>>>>>> On 2024/4/23 18:41, Ryan Roberts wrote:
>>>>>>>> On 22/04/2024 08:02, Baolin Wang wrote:
>>>>>>>>> Anonymous pages have already been supported for multi-size (mTHP)
>>>>>>>>> allocation
>>>>>>>>> through commit 19eaf44954df, that can allow THP to be configured
>>>>>>>>> through the
>>>>>>>>> sysfs interface located at
>>>>>>>>> '/sys/kernel/mm/transparent_hugepage/hugepage-XXkb/enabled'.
>>>>>>>>>
>>>>>>>>> However, the anonymous shared pages will ignore the anonymous mTHP rule
>>>>>>>>> configured through the sysfs interface, and can only use the PMD-mapped
>>>>>>>>> THP, that is not reasonable. Many implement anonymous page sharing through
>>>>>>>>> mmap(MAP_SHARED | MAP_ANONYMOUS), especially in database usage scenarios,
>>>>>>>>> therefore, users expect to apply an unified mTHP strategy for anonymous
>>>>>>>>> pages,
>>>>>>>>> also including the anonymous shared pages, in order to enjoy the
>>>>>>>>> benefits of
>>>>>>>>> mTHP. For example, lower latency than PMD-mapped THP, smaller memory bloat
>>>>>>>>> than PMD-mapped THP, contiguous PTEs on ARM architecture to reduce TLB
>>>>>>>>> miss
>>>>>>>>> etc.
>>>>>>>>
>>>>>>>> This sounds like a very useful addition!
>>>>>>>>
>>>>>>>> Out of interest, can you point me at any workloads (and off-the-shelf
>>>>>>>> benchmarks
>>>>>>>> for those workloads) that predominantly use shared anon memory?
>>>>>>>
>>>>>>> As far as I know, some database related workloads make extensive use of
>>>>>>> shared
>>>>>>> anonymous page, such as PolarDB[1] in our Alibaba fleet, or MySQL likely
>>>>>>> also
>>>>>>> uses shared anonymous memory. And I still need to do some investigation to
>>>>>>> measure the performance.
>>>>>>>
>>>>>>> [1] https://github.com/ApsaraDB/PolarDB-for-PostgreSQL
>>>>>>
>>>>>> Thanks for the pointer!
>>>>>>
>>>>>>>
>>>>>>>>> The primary strategy is that, the use of huge pages for anonymous shared
>>>>>>>>> pages
>>>>>>>>> still follows the global control determined by the mount option "huge="
>>>>>>>>> parameter
>>>>>>>>> or the sysfs interface at
>>>>>>>>> '/sys/kernel/mm/transparent_hugepage/shmem_enabled'.
>>>>>>>>> The utilization of mTHP is allowed only when the global 'huge' switch is
>>>>>>>>> enabled.
>>>>>>>>> Subsequently, the mTHP sysfs interface
>>>>>>>>> (/sys/kernel/mm/transparent_hugepage/hugepage-XXkb/enabled)
>>>>>>>>> is checked to determine the mTHP size that can be used for large folio
>>>>>>>>> allocation
>>>>>>>>> for these anonymous shared pages.
>>>>>>>>
>>>>>>>> I'm not sure about this proposed control mechanism; won't it break
>>>>>>>> compatibility? I could be wrong, but I don't think shmem's use of THP
>>>>>>>> used to
>>>>>>>> depend upon the value of /sys/kernel/mm/transparent_hugepage/enabled? So it
>>>>>>>
>>>>>>> Yes, I realized this after more testing.
>>>>>>>
>>>>>>>> doesn't make sense to me that we now depend upon the
>>>>>>>> /sys/kernel/mm/transparent_hugepage/hugepage-XXkb/enabled values (which by
>>>>>>>> default disables all sizes except 2M, which is set to "inherit" from
>>>>>>>> /sys/kernel/mm/transparent_hugepage/enabled).
>>>>>>>>
>>>>>>>> The other problem is that shmem_enabled has a different set of options
>>>>>>>> (always/never/within_size/advise/deny/force) to enabled
>>>>>>>> (always/madvise/never)
>>>>>>>>
>>>>>>>> Perhaps it would be cleaner to do the same trick we did for enabled;
>>>>>>>> Introduce
>>>>>>>> /mm/transparent_hugepage/hugepage-XXkb/shmem_enabled, which can have all
>>>>>>>> the
>>>>>>>> same values as the top-level
>>>>>>>> /sys/kernel/mm/transparent_hugepage/shmem_enabled,
>>>>>>>> plus the additional "inherit" option. By default all sizes will be set to
>>>>>>>> "never" except 2M, which is set to "inherit".
>>>>>>>
>>>>>>> Sounds good to me. But I do not want to copy all same values from top-level
>>>>>>> '/sys/kernel/mm/transparent_hugepage/shmem_enabled':
>>>>>>> always within_size advise never deny force
>>>>>>>
>>>>>>> For mTHP's shmem_enabled interface, we can just keep below values:
>>>>>>> always within_size advise never
>>>>>>>
>>>>>>> Cause when checking if mTHP can be used for anon shmem, 'deny' is equal to
>>>>>>> 'never', and 'force' is equal to 'always'.
>>>>>>
>>>>>> I'll admit it wasn't completely clear to me after reading the docs, but my
>>>>>> rough
>>>>>> understanding is:
>>>>>>
>>>>>>     - /sys/kernel/mm/transparent_hugepage/shmem_enabled controls
>>>>>>       mmap(SHARED|ANON) allocations (mostly; see rule 3)
>>>>>>     - huge=... controls tmpfs allocations
>>>>>>     - deny and force in shmem_enabled are equivalent to never and always for
>>>>>>       mmap(SHARED|ANON) but additionally override all tmpfs mounts so they
>>>>>> act as
>>>>>>       if they were mounted with huge=never or huge=always
>>>>>>
>>>>>> Is that correct? If so, then I think it still makes sense to support per-size
>>>>>
>>>>> Correct.
>>>>>
>>>>>> deny/force. Certainly if a per-size control is set to "inherit" and the
>>>>>> top-level control is set to deny or force, you would need that to mean
>>>>>> something.
>>>>>
>>>>> IMHO, the '/mm/transparent_hugepage/hugepage-XXkb/shmem_enabled' interface
>>>>> should only control the anonymous shmem. And 'huge=' controls tmpfs
>>>>> allocation,
>>>>> so we should not use anonymous control to override tmpfs control, which
>>>>> seems a
>>>>> little mess?
>>>>
>>>> I agree it would be cleaner to only handle mmap(SHARED|ANON) here, and leave
>>>> the
>>>> tmpfs stuff for another time. But my point is that
>>>> /mm/transparent_hugepage/shmem_enabled already interferes with tmpfs if the
>>>> value is deny or force. So if you have:
>>>>
>>>> echo deny > /mm/transparent_hugepage/shmem_enabled
>>>
>>> IIUC, this global control will cause shmem_is_huge() to always return false, so
>>> no matter how '/mm/transparent_hugepage/hugepage-xxxkB/shmem_enabled' is set,
>>> anonymous shmem will not use mTHP. No?
>>
>> No, that's not how '/mm/transparent_hugepage/hugepage-xxxkB/enabled' works, and
>> I think '/mm/transparent_hugepage/hugepage-xxxkB/shmem_enabled' should follow
>> the established pattern.
>>
>> For anon-private, each size is controlled by its
>> /mm/transparent_hugepage/hugepage-xxxkB/enabled value. Unless that value is
>> "inherit", in which case the value in /mm/transparent_hugepage/enabled is used
>> for that size.
>>
>> That approach enables us to 1) maintain back-compat and 2) control each size
>> independently
>>
>> 1) is met because the default is that all sizes are initially set to "never",
>> except the PMD-size (e.g. /mm/transparent_hugepage/hugepage-2048kB/enabled)
>> which is initially set to inherit. So any mTHP unaware SW can still modify
>> /mm/transparent_hugepage/enabled and it will still only apply to PMD size.
>>
>> 2) is met because mTHP aware SW can come along and e.g. enable the 64K size
>> (echo always > /mm/transparent_hugepage/hugepage-64kB/enabled) without having to
>> modify the value in /mm/transparent_hugepage/enabled.
>
> Thanks for explanation. Initially, I want to make
> ‘/mm/transparent_hugepage/shmem_enabled’ be a global control for huge page, but
> I think it should follow the same strategy as anon mTHP as you said.
>
>>>> echo inherit > /mm/transparent_hugepage/hugepage-64kB/shmem_enabled
>>>>
>>>> What does that mean?
>>
>> So I think /mm/transparent_hugepage/hugepage-xxxkB/shmem_enabled will need to
>> support the deny and force values. When applied to non-PMD sizes, "deny" can
>> just be a noop for now, because there was no way to configure a tmpfs mount for
>> non-PMD size THP in the first place. But I'm not sure what to do with "force"?
>
> OK. And I also prefer that "force" should be a noop too, since anon shmem
> control should not configure tmpfs huge page allocation.

I guess technically they won't be noops, but (for the non-PMD-sizes) "force"
will be an alias for "always" and "deny" will be an alias for "never"?

I was just a bit concerned about later changing that behavior to also impact
tmpfs once tmpfs supports mTHP; could that cause breaks? But thinking about it,
I don't see that as a problem.


2024-04-25 08:26:53

by David Hildenbrand

[permalink] [raw]
Subject: Re: [RFC PATCH 0/5] add mTHP support for anonymous share pages

On 25.04.24 10:17, Ryan Roberts wrote:
> On 25/04/2024 07:20, Baolin Wang wrote:
>>
>>
>> On 2024/4/24 22:20, Ryan Roberts wrote:
>>> On 24/04/2024 14:49, Baolin Wang wrote:
>>>>
>>>>
>>>> On 2024/4/24 18:01, Ryan Roberts wrote:
>>>>> On 24/04/2024 10:55, Baolin Wang wrote:
>>>>>>
>>>>>>
>>>>>> On 2024/4/24 16:26, Ryan Roberts wrote:
>>>>>>> On 24/04/2024 07:55, Baolin Wang wrote:
>>>>>>>>
>>>>>>>>
>>>>>>>> On 2024/4/23 18:41, Ryan Roberts wrote:
>>>>>>>>> On 22/04/2024 08:02, Baolin Wang wrote:
>>>>>>>>>> Anonymous pages have already been supported for multi-size (mTHP)
>>>>>>>>>> allocation
>>>>>>>>>> through commit 19eaf44954df, that can allow THP to be configured
>>>>>>>>>> through the
>>>>>>>>>> sysfs interface located at
>>>>>>>>>> '/sys/kernel/mm/transparent_hugepage/hugepage-XXkb/enabled'.
>>>>>>>>>>
>>>>>>>>>> However, the anonymous shared pages will ignore the anonymous mTHP rule
>>>>>>>>>> configured through the sysfs interface, and can only use the PMD-mapped
>>>>>>>>>> THP, that is not reasonable. Many implement anonymous page sharing through
>>>>>>>>>> mmap(MAP_SHARED | MAP_ANONYMOUS), especially in database usage scenarios,
>>>>>>>>>> therefore, users expect to apply an unified mTHP strategy for anonymous
>>>>>>>>>> pages,
>>>>>>>>>> also including the anonymous shared pages, in order to enjoy the
>>>>>>>>>> benefits of
>>>>>>>>>> mTHP. For example, lower latency than PMD-mapped THP, smaller memory bloat
>>>>>>>>>> than PMD-mapped THP, contiguous PTEs on ARM architecture to reduce TLB
>>>>>>>>>> miss
>>>>>>>>>> etc.
>>>>>>>>>
>>>>>>>>> This sounds like a very useful addition!
>>>>>>>>>
>>>>>>>>> Out of interest, can you point me at any workloads (and off-the-shelf
>>>>>>>>> benchmarks
>>>>>>>>> for those workloads) that predominantly use shared anon memory?
>>>>>>>>
>>>>>>>> As far as I know, some database related workloads make extensive use of
>>>>>>>> shared
>>>>>>>> anonymous page, such as PolarDB[1] in our Alibaba fleet, or MySQL likely
>>>>>>>> also
>>>>>>>> uses shared anonymous memory. And I still need to do some investigation to
>>>>>>>> measure the performance.
>>>>>>>>
>>>>>>>> [1] https://github.com/ApsaraDB/PolarDB-for-PostgreSQL
>>>>>>>
>>>>>>> Thanks for the pointer!
>>>>>>>
>>>>>>>>
>>>>>>>>>> The primary strategy is that, the use of huge pages for anonymous shared
>>>>>>>>>> pages
>>>>>>>>>> still follows the global control determined by the mount option "huge="
>>>>>>>>>> parameter
>>>>>>>>>> or the sysfs interface at
>>>>>>>>>> '/sys/kernel/mm/transparent_hugepage/shmem_enabled'.
>>>>>>>>>> The utilization of mTHP is allowed only when the global 'huge' switch is
>>>>>>>>>> enabled.
>>>>>>>>>> Subsequently, the mTHP sysfs interface
>>>>>>>>>> (/sys/kernel/mm/transparent_hugepage/hugepage-XXkb/enabled)
>>>>>>>>>> is checked to determine the mTHP size that can be used for large folio
>>>>>>>>>> allocation
>>>>>>>>>> for these anonymous shared pages.
>>>>>>>>>
>>>>>>>>> I'm not sure about this proposed control mechanism; won't it break
>>>>>>>>> compatibility? I could be wrong, but I don't think shmem's use of THP
>>>>>>>>> used to
>>>>>>>>> depend upon the value of /sys/kernel/mm/transparent_hugepage/enabled? So it
>>>>>>>>
>>>>>>>> Yes, I realized this after more testing.
>>>>>>>>
>>>>>>>>> doesn't make sense to me that we now depend upon the
>>>>>>>>> /sys/kernel/mm/transparent_hugepage/hugepage-XXkb/enabled values (which by
>>>>>>>>> default disables all sizes except 2M, which is set to "inherit" from
>>>>>>>>> /sys/kernel/mm/transparent_hugepage/enabled).
>>>>>>>>>
>>>>>>>>> The other problem is that shmem_enabled has a different set of options
>>>>>>>>> (always/never/within_size/advise/deny/force) to enabled
>>>>>>>>> (always/madvise/never)
>>>>>>>>>
>>>>>>>>> Perhaps it would be cleaner to do the same trick we did for enabled;
>>>>>>>>> Introduce
>>>>>>>>> /mm/transparent_hugepage/hugepage-XXkb/shmem_enabled, which can have all
>>>>>>>>> the
>>>>>>>>> same values as the top-level
>>>>>>>>> /sys/kernel/mm/transparent_hugepage/shmem_enabled,
>>>>>>>>> plus the additional "inherit" option. By default all sizes will be set to
>>>>>>>>> "never" except 2M, which is set to "inherit".
>>>>>>>>
>>>>>>>> Sounds good to me. But I do not want to copy all same values from top-level
>>>>>>>> '/sys/kernel/mm/transparent_hugepage/shmem_enabled':
>>>>>>>> always within_size advise never deny force
>>>>>>>>
>>>>>>>> For mTHP's shmem_enabled interface, we can just keep below values:
>>>>>>>> always within_size advise never
>>>>>>>>
>>>>>>>> Cause when checking if mTHP can be used for anon shmem, 'deny' is equal to
>>>>>>>> 'never', and 'force' is equal to 'always'.
>>>>>>>
>>>>>>> I'll admit it wasn't completely clear to me after reading the docs, but my
>>>>>>> rough
>>>>>>> understanding is:
>>>>>>>
>>>>>>>     - /sys/kernel/mm/transparent_hugepage/shmem_enabled controls
>>>>>>>       mmap(SHARED|ANON) allocations (mostly; see rule 3)
>>>>>>>     - huge=... controls tmpfs allocations
>>>>>>>     - deny and force in shmem_enabled are equivalent to never and always for
>>>>>>>       mmap(SHARED|ANON) but additionally override all tmpfs mounts so they
>>>>>>> act as
>>>>>>>       if they were mounted with huge=never or huge=always
>>>>>>>
>>>>>>> Is that correct? If so, then I think it still makes sense to support per-size
>>>>>>
>>>>>> Correct.
>>>>>>
>>>>>>> deny/force. Certainly if a per-size control is set to "inherit" and the
>>>>>>> top-level control is set to deny or force, you would need that to mean
>>>>>>> something.
>>>>>>
>>>>>> IMHO, the '/mm/transparent_hugepage/hugepage-XXkb/shmem_enabled' interface
>>>>>> should only control the anonymous shmem. And 'huge=' controls tmpfs
>>>>>> allocation,
>>>>>> so we should not use anonymous control to override tmpfs control, which
>>>>>> seems a
>>>>>> little mess?
>>>>>
>>>>> I agree it would be cleaner to only handle mmap(SHARED|ANON) here, and leave
>>>>> the
>>>>> tmpfs stuff for another time. But my point is that
>>>>> /mm/transparent_hugepage/shmem_enabled already interferes with tmpfs if the
>>>>> value is deny or force. So if you have:
>>>>>
>>>>> echo deny > /mm/transparent_hugepage/shmem_enabled
>>>>
>>>> IIUC, this global control will cause shmem_is_huge() to always return false, so
>>>> no matter how '/mm/transparent_hugepage/hugepage-xxxkB/shmem_enabled' is set,
>>>> anonymous shmem will not use mTHP. No?
>>>
>>> No, that's not how '/mm/transparent_hugepage/hugepage-xxxkB/enabled' works, and
>>> I think '/mm/transparent_hugepage/hugepage-xxxkB/shmem_enabled' should follow
>>> the established pattern.
>>>
>>> For anon-private, each size is controlled by its
>>> /mm/transparent_hugepage/hugepage-xxxkB/enabled value. Unless that value is
>>> "inherit", in which case the value in /mm/transparent_hugepage/enabled is used
>>> for that size.
>>>
>>> That approach enables us to 1) maintain back-compat and 2) control each size
>>> independently
>>>
>>> 1) is met because the default is that all sizes are initially set to "never",
>>> except the PMD-size (e.g. /mm/transparent_hugepage/hugepage-2048kB/enabled)
>>> which is initially set to inherit. So any mTHP unaware SW can still modify
>>> /mm/transparent_hugepage/enabled and it will still only apply to PMD size.
>>>
>>> 2) is met because mTHP aware SW can come along and e.g. enable the 64K size
>>> (echo always > /mm/transparent_hugepage/hugepage-64kB/enabled) without having to
>>> modify the value in /mm/transparent_hugepage/enabled.
>>
>> Thanks for explanation. Initially, I want to make
>> ‘/mm/transparent_hugepage/shmem_enabled’ be a global control for huge page, but
>> I think it should follow the same strategy as anon mTHP as you said.
>>
>>>>> echo inherit > /mm/transparent_hugepage/hugepage-64kB/shmem_enabled
>>>>>
>>>>> What does that mean?
>>>
>>> So I think /mm/transparent_hugepage/hugepage-xxxkB/shmem_enabled will need to
>>> support the deny and force values. When applied to non-PMD sizes, "deny" can
>>> just be a noop for now, because there was no way to configure a tmpfs mount for
>>> non-PMD size THP in the first place. But I'm not sure what to do with "force"?
>>
>> OK. And I also prefer that "force" should be a noop too, since anon shmem
>> control should not configure tmpfs huge page allocation.
>
> I guess technically they won't be noops, but (for the non-PMD-sizes) "force"
> will be an alias for "always" and "deny" will be an alias for "never"?
>
> I was just a bit concerned about later changing that behavior to also impact
> tmpfs once tmpfs supports mTHP; could that cause breaks? But thinking about it,
> I don't see that as a problem.

Is the question what should happen if we "inherit" "force" or if someone
specifies "force" for a mTP size explicitly?

--
Cheers,

David / dhildenb


2024-04-25 08:52:57

by Ryan Roberts

[permalink] [raw]
Subject: Re: [RFC PATCH 0/5] add mTHP support for anonymous share pages

On 25/04/2024 09:26, David Hildenbrand wrote:
> On 25.04.24 10:17, Ryan Roberts wrote:
>> On 25/04/2024 07:20, Baolin Wang wrote:
>>>
>>>
>>> On 2024/4/24 22:20, Ryan Roberts wrote:
>>>> On 24/04/2024 14:49, Baolin Wang wrote:
>>>>>
>>>>>
>>>>> On 2024/4/24 18:01, Ryan Roberts wrote:
>>>>>> On 24/04/2024 10:55, Baolin Wang wrote:
>>>>>>>
>>>>>>>
>>>>>>> On 2024/4/24 16:26, Ryan Roberts wrote:
>>>>>>>> On 24/04/2024 07:55, Baolin Wang wrote:
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On 2024/4/23 18:41, Ryan Roberts wrote:
>>>>>>>>>> On 22/04/2024 08:02, Baolin Wang wrote:
>>>>>>>>>>> Anonymous pages have already been supported for multi-size (mTHP)
>>>>>>>>>>> allocation
>>>>>>>>>>> through commit 19eaf44954df, that can allow THP to be configured
>>>>>>>>>>> through the
>>>>>>>>>>> sysfs interface located at
>>>>>>>>>>> '/sys/kernel/mm/transparent_hugepage/hugepage-XXkb/enabled'.
>>>>>>>>>>>
>>>>>>>>>>> However, the anonymous shared pages will ignore the anonymous mTHP rule
>>>>>>>>>>> configured through the sysfs interface, and can only use the PMD-mapped
>>>>>>>>>>> THP, that is not reasonable. Many implement anonymous page sharing
>>>>>>>>>>> through
>>>>>>>>>>> mmap(MAP_SHARED | MAP_ANONYMOUS), especially in database usage
>>>>>>>>>>> scenarios,
>>>>>>>>>>> therefore, users expect to apply an unified mTHP strategy for anonymous
>>>>>>>>>>> pages,
>>>>>>>>>>> also including the anonymous shared pages, in order to enjoy the
>>>>>>>>>>> benefits of
>>>>>>>>>>> mTHP. For example, lower latency than PMD-mapped THP, smaller memory
>>>>>>>>>>> bloat
>>>>>>>>>>> than PMD-mapped THP, contiguous PTEs on ARM architecture to reduce TLB
>>>>>>>>>>> miss
>>>>>>>>>>> etc.
>>>>>>>>>>
>>>>>>>>>> This sounds like a very useful addition!
>>>>>>>>>>
>>>>>>>>>> Out of interest, can you point me at any workloads (and off-the-shelf
>>>>>>>>>> benchmarks
>>>>>>>>>> for those workloads) that predominantly use shared anon memory?
>>>>>>>>>
>>>>>>>>> As far as I know, some database related workloads make extensive use of
>>>>>>>>> shared
>>>>>>>>> anonymous page, such as PolarDB[1] in our Alibaba fleet, or MySQL likely
>>>>>>>>> also
>>>>>>>>> uses shared anonymous memory. And I still need to do some investigation to
>>>>>>>>> measure the performance.
>>>>>>>>>
>>>>>>>>> [1] https://github.com/ApsaraDB/PolarDB-for-PostgreSQL
>>>>>>>>
>>>>>>>> Thanks for the pointer!
>>>>>>>>
>>>>>>>>>
>>>>>>>>>>> The primary strategy is that, the use of huge pages for anonymous shared
>>>>>>>>>>> pages
>>>>>>>>>>> still follows the global control determined by the mount option "huge="
>>>>>>>>>>> parameter
>>>>>>>>>>> or the sysfs interface at
>>>>>>>>>>> '/sys/kernel/mm/transparent_hugepage/shmem_enabled'.
>>>>>>>>>>> The utilization of mTHP is allowed only when the global 'huge' switch is
>>>>>>>>>>> enabled.
>>>>>>>>>>> Subsequently, the mTHP sysfs interface
>>>>>>>>>>> (/sys/kernel/mm/transparent_hugepage/hugepage-XXkb/enabled)
>>>>>>>>>>> is checked to determine the mTHP size that can be used for large folio
>>>>>>>>>>> allocation
>>>>>>>>>>> for these anonymous shared pages.
>>>>>>>>>>
>>>>>>>>>> I'm not sure about this proposed control mechanism; won't it break
>>>>>>>>>> compatibility? I could be wrong, but I don't think shmem's use of THP
>>>>>>>>>> used to
>>>>>>>>>> depend upon the value of /sys/kernel/mm/transparent_hugepage/enabled?
>>>>>>>>>> So it
>>>>>>>>>
>>>>>>>>> Yes, I realized this after more testing.
>>>>>>>>>
>>>>>>>>>> doesn't make sense to me that we now depend upon the
>>>>>>>>>> /sys/kernel/mm/transparent_hugepage/hugepage-XXkb/enabled values
>>>>>>>>>> (which by
>>>>>>>>>> default disables all sizes except 2M, which is set to "inherit" from
>>>>>>>>>> /sys/kernel/mm/transparent_hugepage/enabled).
>>>>>>>>>>
>>>>>>>>>> The other problem is that shmem_enabled has a different set of options
>>>>>>>>>> (always/never/within_size/advise/deny/force) to enabled
>>>>>>>>>> (always/madvise/never)
>>>>>>>>>>
>>>>>>>>>> Perhaps it would be cleaner to do the same trick we did for enabled;
>>>>>>>>>> Introduce
>>>>>>>>>> /mm/transparent_hugepage/hugepage-XXkb/shmem_enabled, which can have all
>>>>>>>>>> the
>>>>>>>>>> same values as the top-level
>>>>>>>>>> /sys/kernel/mm/transparent_hugepage/shmem_enabled,
>>>>>>>>>> plus the additional "inherit" option. By default all sizes will be set to
>>>>>>>>>> "never" except 2M, which is set to "inherit".
>>>>>>>>>
>>>>>>>>> Sounds good to me. But I do not want to copy all same values from
>>>>>>>>> top-level
>>>>>>>>> '/sys/kernel/mm/transparent_hugepage/shmem_enabled':
>>>>>>>>> always within_size advise never deny force
>>>>>>>>>
>>>>>>>>> For mTHP's shmem_enabled interface, we can just keep below values:
>>>>>>>>> always within_size advise never
>>>>>>>>>
>>>>>>>>> Cause when checking if mTHP can be used for anon shmem, 'deny' is equal to
>>>>>>>>> 'never', and 'force' is equal to 'always'.
>>>>>>>>
>>>>>>>> I'll admit it wasn't completely clear to me after reading the docs, but my
>>>>>>>> rough
>>>>>>>> understanding is:
>>>>>>>>
>>>>>>>>      - /sys/kernel/mm/transparent_hugepage/shmem_enabled controls
>>>>>>>>        mmap(SHARED|ANON) allocations (mostly; see rule 3)
>>>>>>>>      - huge=... controls tmpfs allocations
>>>>>>>>      - deny and force in shmem_enabled are equivalent to never and
>>>>>>>> always for
>>>>>>>>        mmap(SHARED|ANON) but additionally override all tmpfs mounts so they
>>>>>>>> act as
>>>>>>>>        if they were mounted with huge=never or huge=always
>>>>>>>>
>>>>>>>> Is that correct? If so, then I think it still makes sense to support
>>>>>>>> per-size
>>>>>>>
>>>>>>> Correct.
>>>>>>>
>>>>>>>> deny/force. Certainly if a per-size control is set to "inherit" and the
>>>>>>>> top-level control is set to deny or force, you would need that to mean
>>>>>>>> something.
>>>>>>>
>>>>>>> IMHO, the '/mm/transparent_hugepage/hugepage-XXkb/shmem_enabled' interface
>>>>>>> should only control the anonymous shmem. And 'huge=' controls tmpfs
>>>>>>> allocation,
>>>>>>> so we should not use anonymous control to override tmpfs control, which
>>>>>>> seems a
>>>>>>> little mess?
>>>>>>
>>>>>> I agree it would be cleaner to only handle mmap(SHARED|ANON) here, and leave
>>>>>> the
>>>>>> tmpfs stuff for another time. But my point is that
>>>>>> /mm/transparent_hugepage/shmem_enabled already interferes with tmpfs if the
>>>>>> value is deny or force. So if you have:
>>>>>>
>>>>>> echo deny > /mm/transparent_hugepage/shmem_enabled
>>>>>
>>>>> IIUC, this global control will cause shmem_is_huge() to always return
>>>>> false, so
>>>>> no matter how '/mm/transparent_hugepage/hugepage-xxxkB/shmem_enabled' is set,
>>>>> anonymous shmem will not use mTHP. No?
>>>>
>>>> No, that's not how '/mm/transparent_hugepage/hugepage-xxxkB/enabled' works, and
>>>> I think '/mm/transparent_hugepage/hugepage-xxxkB/shmem_enabled' should follow
>>>> the established pattern.
>>>>
>>>> For anon-private, each size is controlled by its
>>>> /mm/transparent_hugepage/hugepage-xxxkB/enabled value. Unless that value is
>>>> "inherit", in which case the value in /mm/transparent_hugepage/enabled is used
>>>> for that size.
>>>>
>>>> That approach enables us to 1) maintain back-compat and 2) control each size
>>>> independently
>>>>
>>>> 1) is met because the default is that all sizes are initially set to "never",
>>>> except the PMD-size (e.g. /mm/transparent_hugepage/hugepage-2048kB/enabled)
>>>> which is initially set to inherit. So any mTHP unaware SW can still modify
>>>> /mm/transparent_hugepage/enabled and it will still only apply to PMD size.
>>>>
>>>> 2) is met because mTHP aware SW can come along and e.g. enable the 64K size
>>>> (echo always > /mm/transparent_hugepage/hugepage-64kB/enabled) without
>>>> having to
>>>> modify the value in /mm/transparent_hugepage/enabled.
>>>
>>> Thanks for explanation. Initially, I want to make
>>> ‘/mm/transparent_hugepage/shmem_enabled’ be a global control for huge page, but
>>> I think it should follow the same strategy as anon mTHP as you said.
>>>
>>>>>> echo inherit > /mm/transparent_hugepage/hugepage-64kB/shmem_enabled
>>>>>>
>>>>>> What does that mean?
>>>>
>>>> So I think /mm/transparent_hugepage/hugepage-xxxkB/shmem_enabled will need to
>>>> support the deny and force values. When applied to non-PMD sizes, "deny" can
>>>> just be a noop for now, because there was no way to configure a tmpfs mount for
>>>> non-PMD size THP in the first place. But I'm not sure what to do with "force"?
>>>
>>> OK. And I also prefer that "force" should be a noop too, since anon shmem
>>> control should not configure tmpfs huge page allocation.
>>
>> I guess technically they won't be noops, but (for the non-PMD-sizes) "force"
>> will be an alias for "always" and "deny" will be an alias for "never"?
>>
>> I was just a bit concerned about later changing that behavior to also impact
>> tmpfs once tmpfs supports mTHP; could that cause breaks? But thinking about it,
>> I don't see that as a problem.
>
> Is the question what should happen if we "inherit" "force" or if someone
> specifies "force" for a mTP size explicitly?

Well I think it amounts to the same thing; there isn't much point in forbidding
"force" to be set directly because it can still be set indirectly through
"inherit". We can't forbid indirectly setting it, because "inherit" could be set
first, then the top-level shmem_enabled changed to "force" after - and we
wouldn't want to fail that.

So I think the question is just 'what should happen when "force" is configured
for a non-PMD-sized mTHP'?

I think the answer is 'for now, it behaves like "always", but in future it will
also force any tmpfs mounts to consider that mTHP size even if that size wasn't
specified by mount='



2024-04-25 08:59:53

by David Hildenbrand

[permalink] [raw]
Subject: Re: [RFC PATCH 0/5] add mTHP support for anonymous share pages

On 25.04.24 10:46, Ryan Roberts wrote:
> On 25/04/2024 09:26, David Hildenbrand wrote:
>> On 25.04.24 10:17, Ryan Roberts wrote:
>>> On 25/04/2024 07:20, Baolin Wang wrote:
>>>>
>>>>
>>>> On 2024/4/24 22:20, Ryan Roberts wrote:
>>>>> On 24/04/2024 14:49, Baolin Wang wrote:
>>>>>>
>>>>>>
>>>>>> On 2024/4/24 18:01, Ryan Roberts wrote:
>>>>>>> On 24/04/2024 10:55, Baolin Wang wrote:
>>>>>>>>
>>>>>>>>
>>>>>>>> On 2024/4/24 16:26, Ryan Roberts wrote:
>>>>>>>>> On 24/04/2024 07:55, Baolin Wang wrote:
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On 2024/4/23 18:41, Ryan Roberts wrote:
>>>>>>>>>>> On 22/04/2024 08:02, Baolin Wang wrote:
>>>>>>>>>>>> Anonymous pages have already been supported for multi-size (mTHP)
>>>>>>>>>>>> allocation
>>>>>>>>>>>> through commit 19eaf44954df, that can allow THP to be configured
>>>>>>>>>>>> through the
>>>>>>>>>>>> sysfs interface located at
>>>>>>>>>>>> '/sys/kernel/mm/transparent_hugepage/hugepage-XXkb/enabled'.
>>>>>>>>>>>>
>>>>>>>>>>>> However, the anonymous shared pages will ignore the anonymous mTHP rule
>>>>>>>>>>>> configured through the sysfs interface, and can only use the PMD-mapped
>>>>>>>>>>>> THP, that is not reasonable. Many implement anonymous page sharing
>>>>>>>>>>>> through
>>>>>>>>>>>> mmap(MAP_SHARED | MAP_ANONYMOUS), especially in database usage
>>>>>>>>>>>> scenarios,
>>>>>>>>>>>> therefore, users expect to apply an unified mTHP strategy for anonymous
>>>>>>>>>>>> pages,
>>>>>>>>>>>> also including the anonymous shared pages, in order to enjoy the
>>>>>>>>>>>> benefits of
>>>>>>>>>>>> mTHP. For example, lower latency than PMD-mapped THP, smaller memory
>>>>>>>>>>>> bloat
>>>>>>>>>>>> than PMD-mapped THP, contiguous PTEs on ARM architecture to reduce TLB
>>>>>>>>>>>> miss
>>>>>>>>>>>> etc.
>>>>>>>>>>>
>>>>>>>>>>> This sounds like a very useful addition!
>>>>>>>>>>>
>>>>>>>>>>> Out of interest, can you point me at any workloads (and off-the-shelf
>>>>>>>>>>> benchmarks
>>>>>>>>>>> for those workloads) that predominantly use shared anon memory?
>>>>>>>>>>
>>>>>>>>>> As far as I know, some database related workloads make extensive use of
>>>>>>>>>> shared
>>>>>>>>>> anonymous page, such as PolarDB[1] in our Alibaba fleet, or MySQL likely
>>>>>>>>>> also
>>>>>>>>>> uses shared anonymous memory. And I still need to do some investigation to
>>>>>>>>>> measure the performance.
>>>>>>>>>>
>>>>>>>>>> [1] https://github.com/ApsaraDB/PolarDB-for-PostgreSQL
>>>>>>>>>
>>>>>>>>> Thanks for the pointer!
>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>>> The primary strategy is that, the use of huge pages for anonymous shared
>>>>>>>>>>>> pages
>>>>>>>>>>>> still follows the global control determined by the mount option "huge="
>>>>>>>>>>>> parameter
>>>>>>>>>>>> or the sysfs interface at
>>>>>>>>>>>> '/sys/kernel/mm/transparent_hugepage/shmem_enabled'.
>>>>>>>>>>>> The utilization of mTHP is allowed only when the global 'huge' switch is
>>>>>>>>>>>> enabled.
>>>>>>>>>>>> Subsequently, the mTHP sysfs interface
>>>>>>>>>>>> (/sys/kernel/mm/transparent_hugepage/hugepage-XXkb/enabled)
>>>>>>>>>>>> is checked to determine the mTHP size that can be used for large folio
>>>>>>>>>>>> allocation
>>>>>>>>>>>> for these anonymous shared pages.
>>>>>>>>>>>
>>>>>>>>>>> I'm not sure about this proposed control mechanism; won't it break
>>>>>>>>>>> compatibility? I could be wrong, but I don't think shmem's use of THP
>>>>>>>>>>> used to
>>>>>>>>>>> depend upon the value of /sys/kernel/mm/transparent_hugepage/enabled?
>>>>>>>>>>> So it
>>>>>>>>>>
>>>>>>>>>> Yes, I realized this after more testing.
>>>>>>>>>>
>>>>>>>>>>> doesn't make sense to me that we now depend upon the
>>>>>>>>>>> /sys/kernel/mm/transparent_hugepage/hugepage-XXkb/enabled values
>>>>>>>>>>> (which by
>>>>>>>>>>> default disables all sizes except 2M, which is set to "inherit" from
>>>>>>>>>>> /sys/kernel/mm/transparent_hugepage/enabled).
>>>>>>>>>>>
>>>>>>>>>>> The other problem is that shmem_enabled has a different set of options
>>>>>>>>>>> (always/never/within_size/advise/deny/force) to enabled
>>>>>>>>>>> (always/madvise/never)
>>>>>>>>>>>
>>>>>>>>>>> Perhaps it would be cleaner to do the same trick we did for enabled;
>>>>>>>>>>> Introduce
>>>>>>>>>>> /mm/transparent_hugepage/hugepage-XXkb/shmem_enabled, which can have all
>>>>>>>>>>> the
>>>>>>>>>>> same values as the top-level
>>>>>>>>>>> /sys/kernel/mm/transparent_hugepage/shmem_enabled,
>>>>>>>>>>> plus the additional "inherit" option. By default all sizes will be set to
>>>>>>>>>>> "never" except 2M, which is set to "inherit".
>>>>>>>>>>
>>>>>>>>>> Sounds good to me. But I do not want to copy all same values from
>>>>>>>>>> top-level
>>>>>>>>>> '/sys/kernel/mm/transparent_hugepage/shmem_enabled':
>>>>>>>>>> always within_size advise never deny force
>>>>>>>>>>
>>>>>>>>>> For mTHP's shmem_enabled interface, we can just keep below values:
>>>>>>>>>> always within_size advise never
>>>>>>>>>>
>>>>>>>>>> Cause when checking if mTHP can be used for anon shmem, 'deny' is equal to
>>>>>>>>>> 'never', and 'force' is equal to 'always'.
>>>>>>>>>
>>>>>>>>> I'll admit it wasn't completely clear to me after reading the docs, but my
>>>>>>>>> rough
>>>>>>>>> understanding is:
>>>>>>>>>
>>>>>>>>>      - /sys/kernel/mm/transparent_hugepage/shmem_enabled controls
>>>>>>>>>        mmap(SHARED|ANON) allocations (mostly; see rule 3)
>>>>>>>>>      - huge=... controls tmpfs allocations
>>>>>>>>>      - deny and force in shmem_enabled are equivalent to never and
>>>>>>>>> always for
>>>>>>>>>        mmap(SHARED|ANON) but additionally override all tmpfs mounts so they
>>>>>>>>> act as
>>>>>>>>>        if they were mounted with huge=never or huge=always
>>>>>>>>>
>>>>>>>>> Is that correct? If so, then I think it still makes sense to support
>>>>>>>>> per-size
>>>>>>>>
>>>>>>>> Correct.
>>>>>>>>
>>>>>>>>> deny/force. Certainly if a per-size control is set to "inherit" and the
>>>>>>>>> top-level control is set to deny or force, you would need that to mean
>>>>>>>>> something.
>>>>>>>>
>>>>>>>> IMHO, the '/mm/transparent_hugepage/hugepage-XXkb/shmem_enabled' interface
>>>>>>>> should only control the anonymous shmem. And 'huge=' controls tmpfs
>>>>>>>> allocation,
>>>>>>>> so we should not use anonymous control to override tmpfs control, which
>>>>>>>> seems a
>>>>>>>> little mess?
>>>>>>>
>>>>>>> I agree it would be cleaner to only handle mmap(SHARED|ANON) here, and leave
>>>>>>> the
>>>>>>> tmpfs stuff for another time. But my point is that
>>>>>>> /mm/transparent_hugepage/shmem_enabled already interferes with tmpfs if the
>>>>>>> value is deny or force. So if you have:
>>>>>>>
>>>>>>> echo deny > /mm/transparent_hugepage/shmem_enabled
>>>>>>
>>>>>> IIUC, this global control will cause shmem_is_huge() to always return
>>>>>> false, so
>>>>>> no matter how '/mm/transparent_hugepage/hugepage-xxxkB/shmem_enabled' is set,
>>>>>> anonymous shmem will not use mTHP. No?
>>>>>
>>>>> No, that's not how '/mm/transparent_hugepage/hugepage-xxxkB/enabled' works, and
>>>>> I think '/mm/transparent_hugepage/hugepage-xxxkB/shmem_enabled' should follow
>>>>> the established pattern.
>>>>>
>>>>> For anon-private, each size is controlled by its
>>>>> /mm/transparent_hugepage/hugepage-xxxkB/enabled value. Unless that value is
>>>>> "inherit", in which case the value in /mm/transparent_hugepage/enabled is used
>>>>> for that size.
>>>>>
>>>>> That approach enables us to 1) maintain back-compat and 2) control each size
>>>>> independently
>>>>>
>>>>> 1) is met because the default is that all sizes are initially set to "never",
>>>>> except the PMD-size (e.g. /mm/transparent_hugepage/hugepage-2048kB/enabled)
>>>>> which is initially set to inherit. So any mTHP unaware SW can still modify
>>>>> /mm/transparent_hugepage/enabled and it will still only apply to PMD size.
>>>>>
>>>>> 2) is met because mTHP aware SW can come along and e.g. enable the 64K size
>>>>> (echo always > /mm/transparent_hugepage/hugepage-64kB/enabled) without
>>>>> having to
>>>>> modify the value in /mm/transparent_hugepage/enabled.
>>>>
>>>> Thanks for explanation. Initially, I want to make
>>>> ‘/mm/transparent_hugepage/shmem_enabled’ be a global control for huge page, but
>>>> I think it should follow the same strategy as anon mTHP as you said.
>>>>
>>>>>>> echo inherit > /mm/transparent_hugepage/hugepage-64kB/shmem_enabled
>>>>>>>
>>>>>>> What does that mean?
>>>>>
>>>>> So I think /mm/transparent_hugepage/hugepage-xxxkB/shmem_enabled will need to
>>>>> support the deny and force values. When applied to non-PMD sizes, "deny" can
>>>>> just be a noop for now, because there was no way to configure a tmpfs mount for
>>>>> non-PMD size THP in the first place. But I'm not sure what to do with "force"?
>>>>
>>>> OK. And I also prefer that "force" should be a noop too, since anon shmem
>>>> control should not configure tmpfs huge page allocation.
>>>
>>> I guess technically they won't be noops, but (for the non-PMD-sizes) "force"
>>> will be an alias for "always" and "deny" will be an alias for "never"?
>>>
>>> I was just a bit concerned about later changing that behavior to also impact
>>> tmpfs once tmpfs supports mTHP; could that cause breaks? But thinking about it,
>>> I don't see that as a problem.
>>
>> Is the question what should happen if we "inherit" "force" or if someone
>> specifies "force" for a mTP size explicitly?
>
> Well I think it amounts to the same thing; there isn't much point in forbidding
> "force" to be set directly because it can still be set indirectly through
> "inherit". We can't forbid indirectly setting it, because "inherit" could be set
> first, then the top-level shmem_enabled changed to "force" after - and we
> wouldn't want to fail that.

The default for PMD should be "inherit", for the other mTHP sizes it
should be "never".

So we should fail if:
* Setting top-level to "force" when any non-PMD size is "inherit"
* Setting "inherit" of a non-PMD size when the top-level is force

Both will only happen if someone messes with the mTHP configuration
manually.

And we should only offer "force" as an option for PMD-sized mTHP as long
as the others are not supported. See below.

>
> So I think the question is just 'what should happen when "force" is configured
> for a non-PMD-sized mTHP'?

We should hide it and not offer a configuration toggle that is inactive.

If someone wants to sense support for other mTHP "force" settings in the
future, they can just parse if the "shmem_enabled" toggle offers "force"
as an option. Then they know that it can actually be enabled and will
also do what is promised.

--
Cheers,

David / dhildenb


2024-04-25 09:14:34

by Baolin Wang

[permalink] [raw]
Subject: Re: [RFC PATCH 0/5] add mTHP support for anonymous share pages



On 2024/4/25 16:57, David Hildenbrand wrote:
> On 25.04.24 10:46, Ryan Roberts wrote:
>> On 25/04/2024 09:26, David Hildenbrand wrote:
>>> On 25.04.24 10:17, Ryan Roberts wrote:
>>>> On 25/04/2024 07:20, Baolin Wang wrote:
>>>>>
>>>>>
>>>>> On 2024/4/24 22:20, Ryan Roberts wrote:
>>>>>> On 24/04/2024 14:49, Baolin Wang wrote:
>>>>>>>
>>>>>>>
>>>>>>> On 2024/4/24 18:01, Ryan Roberts wrote:
>>>>>>>> On 24/04/2024 10:55, Baolin Wang wrote:
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On 2024/4/24 16:26, Ryan Roberts wrote:
>>>>>>>>>> On 24/04/2024 07:55, Baolin Wang wrote:
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> On 2024/4/23 18:41, Ryan Roberts wrote:
>>>>>>>>>>>> On 22/04/2024 08:02, Baolin Wang wrote:
>>>>>>>>>>>>> Anonymous pages have already been supported for multi-size
>>>>>>>>>>>>> (mTHP)
>>>>>>>>>>>>> allocation
>>>>>>>>>>>>> through commit 19eaf44954df, that can allow THP to be
>>>>>>>>>>>>> configured
>>>>>>>>>>>>> through the
>>>>>>>>>>>>> sysfs interface located at
>>>>>>>>>>>>> '/sys/kernel/mm/transparent_hugepage/hugepage-XXkb/enabled'.
>>>>>>>>>>>>>
>>>>>>>>>>>>> However, the anonymous shared pages will ignore the
>>>>>>>>>>>>> anonymous mTHP rule
>>>>>>>>>>>>> configured through the sysfs interface, and can only use
>>>>>>>>>>>>> the PMD-mapped
>>>>>>>>>>>>> THP, that is not reasonable. Many implement anonymous page
>>>>>>>>>>>>> sharing
>>>>>>>>>>>>> through
>>>>>>>>>>>>> mmap(MAP_SHARED | MAP_ANONYMOUS), especially in database usage
>>>>>>>>>>>>> scenarios,
>>>>>>>>>>>>> therefore, users expect to apply an unified mTHP strategy
>>>>>>>>>>>>> for anonymous
>>>>>>>>>>>>> pages,
>>>>>>>>>>>>> also including the anonymous shared pages, in order to
>>>>>>>>>>>>> enjoy the
>>>>>>>>>>>>> benefits of
>>>>>>>>>>>>> mTHP. For example, lower latency than PMD-mapped THP,
>>>>>>>>>>>>> smaller memory
>>>>>>>>>>>>> bloat
>>>>>>>>>>>>> than PMD-mapped THP, contiguous PTEs on ARM architecture to
>>>>>>>>>>>>> reduce TLB
>>>>>>>>>>>>> miss
>>>>>>>>>>>>> etc.
>>>>>>>>>>>>
>>>>>>>>>>>> This sounds like a very useful addition!
>>>>>>>>>>>>
>>>>>>>>>>>> Out of interest, can you point me at any workloads (and
>>>>>>>>>>>> off-the-shelf
>>>>>>>>>>>> benchmarks
>>>>>>>>>>>> for those workloads) that predominantly use shared anon memory?
>>>>>>>>>>>
>>>>>>>>>>> As far as I know, some database related workloads make
>>>>>>>>>>> extensive use of
>>>>>>>>>>> shared
>>>>>>>>>>> anonymous page, such as PolarDB[1] in our Alibaba fleet, or
>>>>>>>>>>> MySQL likely
>>>>>>>>>>> also
>>>>>>>>>>> uses shared anonymous memory. And I still need to do some
>>>>>>>>>>> investigation to
>>>>>>>>>>> measure the performance.
>>>>>>>>>>>
>>>>>>>>>>> [1] https://github.com/ApsaraDB/PolarDB-for-PostgreSQL
>>>>>>>>>>
>>>>>>>>>> Thanks for the pointer!
>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>>> The primary strategy is that, the use of huge pages for
>>>>>>>>>>>>> anonymous shared
>>>>>>>>>>>>> pages
>>>>>>>>>>>>> still follows the global control determined by the mount
>>>>>>>>>>>>> option "huge="
>>>>>>>>>>>>> parameter
>>>>>>>>>>>>> or the sysfs interface at
>>>>>>>>>>>>> '/sys/kernel/mm/transparent_hugepage/shmem_enabled'.
>>>>>>>>>>>>> The utilization of mTHP is allowed only when the global
>>>>>>>>>>>>> 'huge' switch is
>>>>>>>>>>>>> enabled.
>>>>>>>>>>>>> Subsequently, the mTHP sysfs interface
>>>>>>>>>>>>> (/sys/kernel/mm/transparent_hugepage/hugepage-XXkb/enabled)
>>>>>>>>>>>>> is checked to determine the mTHP size that can be used for
>>>>>>>>>>>>> large folio
>>>>>>>>>>>>> allocation
>>>>>>>>>>>>> for these anonymous shared pages.
>>>>>>>>>>>>
>>>>>>>>>>>> I'm not sure about this proposed control mechanism; won't it
>>>>>>>>>>>> break
>>>>>>>>>>>> compatibility? I could be wrong, but I don't think shmem's
>>>>>>>>>>>> use of THP
>>>>>>>>>>>> used to
>>>>>>>>>>>> depend upon the value of
>>>>>>>>>>>> /sys/kernel/mm/transparent_hugepage/enabled?
>>>>>>>>>>>> So it
>>>>>>>>>>>
>>>>>>>>>>> Yes, I realized this after more testing.
>>>>>>>>>>>
>>>>>>>>>>>> doesn't make sense to me that we now depend upon the
>>>>>>>>>>>> /sys/kernel/mm/transparent_hugepage/hugepage-XXkb/enabled
>>>>>>>>>>>> values
>>>>>>>>>>>> (which by
>>>>>>>>>>>> default disables all sizes except 2M, which is set to
>>>>>>>>>>>> "inherit" from
>>>>>>>>>>>> /sys/kernel/mm/transparent_hugepage/enabled).
>>>>>>>>>>>>
>>>>>>>>>>>> The other problem is that shmem_enabled has a different set
>>>>>>>>>>>> of options
>>>>>>>>>>>> (always/never/within_size/advise/deny/force) to enabled
>>>>>>>>>>>> (always/madvise/never)
>>>>>>>>>>>>
>>>>>>>>>>>> Perhaps it would be cleaner to do the same trick we did for
>>>>>>>>>>>> enabled;
>>>>>>>>>>>> Introduce
>>>>>>>>>>>> /mm/transparent_hugepage/hugepage-XXkb/shmem_enabled, which
>>>>>>>>>>>> can have all
>>>>>>>>>>>> the
>>>>>>>>>>>> same values as the top-level
>>>>>>>>>>>> /sys/kernel/mm/transparent_hugepage/shmem_enabled,
>>>>>>>>>>>> plus the additional "inherit" option. By default all sizes
>>>>>>>>>>>> will be set to
>>>>>>>>>>>> "never" except 2M, which is set to "inherit".
>>>>>>>>>>>
>>>>>>>>>>> Sounds good to me. But I do not want to copy all same values
>>>>>>>>>>> from
>>>>>>>>>>> top-level
>>>>>>>>>>> '/sys/kernel/mm/transparent_hugepage/shmem_enabled':
>>>>>>>>>>> always within_size advise never deny force
>>>>>>>>>>>
>>>>>>>>>>> For mTHP's shmem_enabled interface, we can just keep below
>>>>>>>>>>> values:
>>>>>>>>>>> always within_size advise never
>>>>>>>>>>>
>>>>>>>>>>> Cause when checking if mTHP can be used for anon shmem,
>>>>>>>>>>> 'deny' is equal to
>>>>>>>>>>> 'never', and 'force' is equal to 'always'.
>>>>>>>>>>
>>>>>>>>>> I'll admit it wasn't completely clear to me after reading the
>>>>>>>>>> docs, but my
>>>>>>>>>> rough
>>>>>>>>>> understanding is:
>>>>>>>>>>
>>>>>>>>>>       - /sys/kernel/mm/transparent_hugepage/shmem_enabled
>>>>>>>>>> controls
>>>>>>>>>>         mmap(SHARED|ANON) allocations (mostly; see rule 3)
>>>>>>>>>>       - huge=... controls tmpfs allocations
>>>>>>>>>>       - deny and force in shmem_enabled are equivalent to
>>>>>>>>>> never and
>>>>>>>>>> always for
>>>>>>>>>>         mmap(SHARED|ANON) but additionally override all tmpfs
>>>>>>>>>> mounts so they
>>>>>>>>>> act as
>>>>>>>>>>         if they were mounted with huge=never or huge=always
>>>>>>>>>>
>>>>>>>>>> Is that correct? If so, then I think it still makes sense to
>>>>>>>>>> support
>>>>>>>>>> per-size
>>>>>>>>>
>>>>>>>>> Correct.
>>>>>>>>>
>>>>>>>>>> deny/force. Certainly if a per-size control is set to
>>>>>>>>>> "inherit" and the
>>>>>>>>>> top-level control is set to deny or force, you would need that
>>>>>>>>>> to mean
>>>>>>>>>> something.
>>>>>>>>>
>>>>>>>>> IMHO, the
>>>>>>>>> '/mm/transparent_hugepage/hugepage-XXkb/shmem_enabled' interface
>>>>>>>>> should only control the anonymous shmem. And 'huge=' controls
>>>>>>>>> tmpfs
>>>>>>>>> allocation,
>>>>>>>>> so we should not use anonymous control to override tmpfs
>>>>>>>>> control, which
>>>>>>>>> seems a
>>>>>>>>> little mess?
>>>>>>>>
>>>>>>>> I agree it would be cleaner to only handle mmap(SHARED|ANON)
>>>>>>>> here, and leave
>>>>>>>> the
>>>>>>>> tmpfs stuff for another time. But my point is that
>>>>>>>> /mm/transparent_hugepage/shmem_enabled already interferes with
>>>>>>>> tmpfs if the
>>>>>>>> value is deny or force. So if you have:
>>>>>>>>
>>>>>>>> echo deny > /mm/transparent_hugepage/shmem_enabled
>>>>>>>
>>>>>>> IIUC, this global control will cause shmem_is_huge() to always
>>>>>>> return
>>>>>>> false, so
>>>>>>> no matter how
>>>>>>> '/mm/transparent_hugepage/hugepage-xxxkB/shmem_enabled' is set,
>>>>>>> anonymous shmem will not use mTHP. No?
>>>>>>
>>>>>> No, that's not how
>>>>>> '/mm/transparent_hugepage/hugepage-xxxkB/enabled' works, and
>>>>>> I think '/mm/transparent_hugepage/hugepage-xxxkB/shmem_enabled'
>>>>>> should follow
>>>>>> the established pattern.
>>>>>>
>>>>>> For anon-private, each size is controlled by its
>>>>>> /mm/transparent_hugepage/hugepage-xxxkB/enabled value. Unless that
>>>>>> value is
>>>>>> "inherit", in which case the value in
>>>>>> /mm/transparent_hugepage/enabled is used
>>>>>> for that size.
>>>>>>
>>>>>> That approach enables us to 1) maintain back-compat and 2) control
>>>>>> each size
>>>>>> independently
>>>>>>
>>>>>> 1) is met because the default is that all sizes are initially set
>>>>>> to "never",
>>>>>> except the PMD-size (e.g.
>>>>>> /mm/transparent_hugepage/hugepage-2048kB/enabled)
>>>>>> which is initially set to inherit. So any mTHP unaware SW can
>>>>>> still modify
>>>>>> /mm/transparent_hugepage/enabled and it will still only apply to
>>>>>> PMD size.
>>>>>>
>>>>>> 2) is met because mTHP aware SW can come along and e.g. enable the
>>>>>> 64K size
>>>>>> (echo always > /mm/transparent_hugepage/hugepage-64kB/enabled)
>>>>>> without
>>>>>> having to
>>>>>> modify the value in /mm/transparent_hugepage/enabled.
>>>>>
>>>>> Thanks for explanation. Initially, I want to make
>>>>> ‘/mm/transparent_hugepage/shmem_enabled’ be a global control for
>>>>> huge page, but
>>>>> I think it should follow the same strategy as anon mTHP as you said.
>>>>>
>>>>>>>> echo inherit > /mm/transparent_hugepage/hugepage-64kB/shmem_enabled
>>>>>>>>
>>>>>>>> What does that mean?
>>>>>>
>>>>>> So I think /mm/transparent_hugepage/hugepage-xxxkB/shmem_enabled
>>>>>> will need to
>>>>>> support the deny and force values. When applied to non-PMD sizes,
>>>>>> "deny" can
>>>>>> just be a noop for now, because there was no way to configure a
>>>>>> tmpfs mount for
>>>>>> non-PMD size THP in the first place. But I'm not sure what to do
>>>>>> with "force"?
>>>>>
>>>>> OK. And I also prefer that "force" should be a noop too, since anon
>>>>> shmem
>>>>> control should not configure tmpfs huge page allocation.
>>>>
>>>> I guess technically they won't be noops, but (for the non-PMD-sizes)
>>>> "force"
>>>> will be an alias for "always" and "deny" will be an alias for "never"?
>>>>
>>>> I was just a bit concerned about later changing that behavior to
>>>> also impact
>>>> tmpfs once tmpfs supports mTHP; could that cause breaks? But
>>>> thinking about it,
>>>> I don't see that as a problem.
>>>
>>> Is the question what should happen if we "inherit" "force" or if someone
>>> specifies "force" for a mTP size explicitly?
>>
>> Well I think it amounts to the same thing; there isn't much point in
>> forbidding
>> "force" to be set directly because it can still be set indirectly through
>> "inherit". We can't forbid indirectly setting it, because "inherit"
>> could be set
>> first, then the top-level shmem_enabled changed to "force" after - and we
>> wouldn't want to fail that.
>
> The default for PMD should be "inherit", for the other mTHP sizes it
> should be "never".
>
> So we should fail if:
> * Setting top-level to "force" when any non-PMD size is "inherit"
> * Setting "inherit" of a non-PMD size when the top-level is force

IMO, for tmpfs this is true, but for anon shmem, this 2 cases should not
fail.

So I think we should allow this configuration, but for tmpfs huge page
allocation, we will not check the mTHP.

> Both will only happen if someone messes with the mTHP configuration
> manually.
>
> And we should only offer "force" as an option for PMD-sized mTHP as long
> as the others are not supported. See below.
>
>>
>> So I think the question is just 'what should happen when "force" is
>> configured
>> for a non-PMD-sized mTHP'?
>
> We should hide it and not offer a configuration toggle that is inactive.
>
> If someone wants to sense support for other mTHP "force" settings in the
> future, they can just parse if the "shmem_enabled" toggle offers "force"
> as an option. Then they know that it can actually be enabled and will
> also do what is promised.

Sounds good to me.

2024-04-25 09:37:12

by David Hildenbrand

[permalink] [raw]
Subject: Re: [RFC PATCH 0/5] add mTHP support for anonymous share pages

On 25.04.24 11:05, Baolin Wang wrote:
>
>
> On 2024/4/25 16:57, David Hildenbrand wrote:
>> On 25.04.24 10:46, Ryan Roberts wrote:
>>> On 25/04/2024 09:26, David Hildenbrand wrote:
>>>> On 25.04.24 10:17, Ryan Roberts wrote:
>>>>> On 25/04/2024 07:20, Baolin Wang wrote:
>>>>>>
>>>>>>
>>>>>> On 2024/4/24 22:20, Ryan Roberts wrote:
>>>>>>> On 24/04/2024 14:49, Baolin Wang wrote:
>>>>>>>>
>>>>>>>>
>>>>>>>> On 2024/4/24 18:01, Ryan Roberts wrote:
>>>>>>>>> On 24/04/2024 10:55, Baolin Wang wrote:
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On 2024/4/24 16:26, Ryan Roberts wrote:
>>>>>>>>>>> On 24/04/2024 07:55, Baolin Wang wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> On 2024/4/23 18:41, Ryan Roberts wrote:
>>>>>>>>>>>>> On 22/04/2024 08:02, Baolin Wang wrote:
>>>>>>>>>>>>>> Anonymous pages have already been supported for multi-size
>>>>>>>>>>>>>> (mTHP)
>>>>>>>>>>>>>> allocation
>>>>>>>>>>>>>> through commit 19eaf44954df, that can allow THP to be
>>>>>>>>>>>>>> configured
>>>>>>>>>>>>>> through the
>>>>>>>>>>>>>> sysfs interface located at
>>>>>>>>>>>>>> '/sys/kernel/mm/transparent_hugepage/hugepage-XXkb/enabled'.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> However, the anonymous shared pages will ignore the
>>>>>>>>>>>>>> anonymous mTHP rule
>>>>>>>>>>>>>> configured through the sysfs interface, and can only use
>>>>>>>>>>>>>> the PMD-mapped
>>>>>>>>>>>>>> THP, that is not reasonable. Many implement anonymous page
>>>>>>>>>>>>>> sharing
>>>>>>>>>>>>>> through
>>>>>>>>>>>>>> mmap(MAP_SHARED | MAP_ANONYMOUS), especially in database usage
>>>>>>>>>>>>>> scenarios,
>>>>>>>>>>>>>> therefore, users expect to apply an unified mTHP strategy
>>>>>>>>>>>>>> for anonymous
>>>>>>>>>>>>>> pages,
>>>>>>>>>>>>>> also including the anonymous shared pages, in order to
>>>>>>>>>>>>>> enjoy the
>>>>>>>>>>>>>> benefits of
>>>>>>>>>>>>>> mTHP. For example, lower latency than PMD-mapped THP,
>>>>>>>>>>>>>> smaller memory
>>>>>>>>>>>>>> bloat
>>>>>>>>>>>>>> than PMD-mapped THP, contiguous PTEs on ARM architecture to
>>>>>>>>>>>>>> reduce TLB
>>>>>>>>>>>>>> miss
>>>>>>>>>>>>>> etc.
>>>>>>>>>>>>>
>>>>>>>>>>>>> This sounds like a very useful addition!
>>>>>>>>>>>>>
>>>>>>>>>>>>> Out of interest, can you point me at any workloads (and
>>>>>>>>>>>>> off-the-shelf
>>>>>>>>>>>>> benchmarks
>>>>>>>>>>>>> for those workloads) that predominantly use shared anon memory?
>>>>>>>>>>>>
>>>>>>>>>>>> As far as I know, some database related workloads make
>>>>>>>>>>>> extensive use of
>>>>>>>>>>>> shared
>>>>>>>>>>>> anonymous page, such as PolarDB[1] in our Alibaba fleet, or
>>>>>>>>>>>> MySQL likely
>>>>>>>>>>>> also
>>>>>>>>>>>> uses shared anonymous memory. And I still need to do some
>>>>>>>>>>>> investigation to
>>>>>>>>>>>> measure the performance.
>>>>>>>>>>>>
>>>>>>>>>>>> [1] https://github.com/ApsaraDB/PolarDB-for-PostgreSQL
>>>>>>>>>>>
>>>>>>>>>>> Thanks for the pointer!
>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>>> The primary strategy is that, the use of huge pages for
>>>>>>>>>>>>>> anonymous shared
>>>>>>>>>>>>>> pages
>>>>>>>>>>>>>> still follows the global control determined by the mount
>>>>>>>>>>>>>> option "huge="
>>>>>>>>>>>>>> parameter
>>>>>>>>>>>>>> or the sysfs interface at
>>>>>>>>>>>>>> '/sys/kernel/mm/transparent_hugepage/shmem_enabled'.
>>>>>>>>>>>>>> The utilization of mTHP is allowed only when the global
>>>>>>>>>>>>>> 'huge' switch is
>>>>>>>>>>>>>> enabled.
>>>>>>>>>>>>>> Subsequently, the mTHP sysfs interface
>>>>>>>>>>>>>> (/sys/kernel/mm/transparent_hugepage/hugepage-XXkb/enabled)
>>>>>>>>>>>>>> is checked to determine the mTHP size that can be used for
>>>>>>>>>>>>>> large folio
>>>>>>>>>>>>>> allocation
>>>>>>>>>>>>>> for these anonymous shared pages.
>>>>>>>>>>>>>
>>>>>>>>>>>>> I'm not sure about this proposed control mechanism; won't it
>>>>>>>>>>>>> break
>>>>>>>>>>>>> compatibility? I could be wrong, but I don't think shmem's
>>>>>>>>>>>>> use of THP
>>>>>>>>>>>>> used to
>>>>>>>>>>>>> depend upon the value of
>>>>>>>>>>>>> /sys/kernel/mm/transparent_hugepage/enabled?
>>>>>>>>>>>>> So it
>>>>>>>>>>>>
>>>>>>>>>>>> Yes, I realized this after more testing.
>>>>>>>>>>>>
>>>>>>>>>>>>> doesn't make sense to me that we now depend upon the
>>>>>>>>>>>>> /sys/kernel/mm/transparent_hugepage/hugepage-XXkb/enabled
>>>>>>>>>>>>> values
>>>>>>>>>>>>> (which by
>>>>>>>>>>>>> default disables all sizes except 2M, which is set to
>>>>>>>>>>>>> "inherit" from
>>>>>>>>>>>>> /sys/kernel/mm/transparent_hugepage/enabled).
>>>>>>>>>>>>>
>>>>>>>>>>>>> The other problem is that shmem_enabled has a different set
>>>>>>>>>>>>> of options
>>>>>>>>>>>>> (always/never/within_size/advise/deny/force) to enabled
>>>>>>>>>>>>> (always/madvise/never)
>>>>>>>>>>>>>
>>>>>>>>>>>>> Perhaps it would be cleaner to do the same trick we did for
>>>>>>>>>>>>> enabled;
>>>>>>>>>>>>> Introduce
>>>>>>>>>>>>> /mm/transparent_hugepage/hugepage-XXkb/shmem_enabled, which
>>>>>>>>>>>>> can have all
>>>>>>>>>>>>> the
>>>>>>>>>>>>> same values as the top-level
>>>>>>>>>>>>> /sys/kernel/mm/transparent_hugepage/shmem_enabled,
>>>>>>>>>>>>> plus the additional "inherit" option. By default all sizes
>>>>>>>>>>>>> will be set to
>>>>>>>>>>>>> "never" except 2M, which is set to "inherit".
>>>>>>>>>>>>
>>>>>>>>>>>> Sounds good to me. But I do not want to copy all same values
>>>>>>>>>>>> from
>>>>>>>>>>>> top-level
>>>>>>>>>>>> '/sys/kernel/mm/transparent_hugepage/shmem_enabled':
>>>>>>>>>>>> always within_size advise never deny force
>>>>>>>>>>>>
>>>>>>>>>>>> For mTHP's shmem_enabled interface, we can just keep below
>>>>>>>>>>>> values:
>>>>>>>>>>>> always within_size advise never
>>>>>>>>>>>>
>>>>>>>>>>>> Cause when checking if mTHP can be used for anon shmem,
>>>>>>>>>>>> 'deny' is equal to
>>>>>>>>>>>> 'never', and 'force' is equal to 'always'.
>>>>>>>>>>>
>>>>>>>>>>> I'll admit it wasn't completely clear to me after reading the
>>>>>>>>>>> docs, but my
>>>>>>>>>>> rough
>>>>>>>>>>> understanding is:
>>>>>>>>>>>
>>>>>>>>>>>       - /sys/kernel/mm/transparent_hugepage/shmem_enabled
>>>>>>>>>>> controls
>>>>>>>>>>>         mmap(SHARED|ANON) allocations (mostly; see rule 3)
>>>>>>>>>>>       - huge=... controls tmpfs allocations
>>>>>>>>>>>       - deny and force in shmem_enabled are equivalent to
>>>>>>>>>>> never and
>>>>>>>>>>> always for
>>>>>>>>>>>         mmap(SHARED|ANON) but additionally override all tmpfs
>>>>>>>>>>> mounts so they
>>>>>>>>>>> act as
>>>>>>>>>>>         if they were mounted with huge=never or huge=always
>>>>>>>>>>>
>>>>>>>>>>> Is that correct? If so, then I think it still makes sense to
>>>>>>>>>>> support
>>>>>>>>>>> per-size
>>>>>>>>>>
>>>>>>>>>> Correct.
>>>>>>>>>>
>>>>>>>>>>> deny/force. Certainly if a per-size control is set to
>>>>>>>>>>> "inherit" and the
>>>>>>>>>>> top-level control is set to deny or force, you would need that
>>>>>>>>>>> to mean
>>>>>>>>>>> something.
>>>>>>>>>>
>>>>>>>>>> IMHO, the
>>>>>>>>>> '/mm/transparent_hugepage/hugepage-XXkb/shmem_enabled' interface
>>>>>>>>>> should only control the anonymous shmem. And 'huge=' controls
>>>>>>>>>> tmpfs
>>>>>>>>>> allocation,
>>>>>>>>>> so we should not use anonymous control to override tmpfs
>>>>>>>>>> control, which
>>>>>>>>>> seems a
>>>>>>>>>> little mess?
>>>>>>>>>
>>>>>>>>> I agree it would be cleaner to only handle mmap(SHARED|ANON)
>>>>>>>>> here, and leave
>>>>>>>>> the
>>>>>>>>> tmpfs stuff for another time. But my point is that
>>>>>>>>> /mm/transparent_hugepage/shmem_enabled already interferes with
>>>>>>>>> tmpfs if the
>>>>>>>>> value is deny or force. So if you have:
>>>>>>>>>
>>>>>>>>> echo deny > /mm/transparent_hugepage/shmem_enabled
>>>>>>>>
>>>>>>>> IIUC, this global control will cause shmem_is_huge() to always
>>>>>>>> return
>>>>>>>> false, so
>>>>>>>> no matter how
>>>>>>>> '/mm/transparent_hugepage/hugepage-xxxkB/shmem_enabled' is set,
>>>>>>>> anonymous shmem will not use mTHP. No?
>>>>>>>
>>>>>>> No, that's not how
>>>>>>> '/mm/transparent_hugepage/hugepage-xxxkB/enabled' works, and
>>>>>>> I think '/mm/transparent_hugepage/hugepage-xxxkB/shmem_enabled'
>>>>>>> should follow
>>>>>>> the established pattern.
>>>>>>>
>>>>>>> For anon-private, each size is controlled by its
>>>>>>> /mm/transparent_hugepage/hugepage-xxxkB/enabled value. Unless that
>>>>>>> value is
>>>>>>> "inherit", in which case the value in
>>>>>>> /mm/transparent_hugepage/enabled is used
>>>>>>> for that size.
>>>>>>>
>>>>>>> That approach enables us to 1) maintain back-compat and 2) control
>>>>>>> each size
>>>>>>> independently
>>>>>>>
>>>>>>> 1) is met because the default is that all sizes are initially set
>>>>>>> to "never",
>>>>>>> except the PMD-size (e.g.
>>>>>>> /mm/transparent_hugepage/hugepage-2048kB/enabled)
>>>>>>> which is initially set to inherit. So any mTHP unaware SW can
>>>>>>> still modify
>>>>>>> /mm/transparent_hugepage/enabled and it will still only apply to
>>>>>>> PMD size.
>>>>>>>
>>>>>>> 2) is met because mTHP aware SW can come along and e.g. enable the
>>>>>>> 64K size
>>>>>>> (echo always > /mm/transparent_hugepage/hugepage-64kB/enabled)
>>>>>>> without
>>>>>>> having to
>>>>>>> modify the value in /mm/transparent_hugepage/enabled.
>>>>>>
>>>>>> Thanks for explanation. Initially, I want to make
>>>>>> ‘/mm/transparent_hugepage/shmem_enabled’ be a global control for
>>>>>> huge page, but
>>>>>> I think it should follow the same strategy as anon mTHP as you said.
>>>>>>
>>>>>>>>> echo inherit > /mm/transparent_hugepage/hugepage-64kB/shmem_enabled
>>>>>>>>>
>>>>>>>>> What does that mean?
>>>>>>>
>>>>>>> So I think /mm/transparent_hugepage/hugepage-xxxkB/shmem_enabled
>>>>>>> will need to
>>>>>>> support the deny and force values. When applied to non-PMD sizes,
>>>>>>> "deny" can
>>>>>>> just be a noop for now, because there was no way to configure a
>>>>>>> tmpfs mount for
>>>>>>> non-PMD size THP in the first place. But I'm not sure what to do
>>>>>>> with "force"?
>>>>>>
>>>>>> OK. And I also prefer that "force" should be a noop too, since anon
>>>>>> shmem
>>>>>> control should not configure tmpfs huge page allocation.
>>>>>
>>>>> I guess technically they won't be noops, but (for the non-PMD-sizes)
>>>>> "force"
>>>>> will be an alias for "always" and "deny" will be an alias for "never"?
>>>>>
>>>>> I was just a bit concerned about later changing that behavior to
>>>>> also impact
>>>>> tmpfs once tmpfs supports mTHP; could that cause breaks? But
>>>>> thinking about it,
>>>>> I don't see that as a problem.
>>>>
>>>> Is the question what should happen if we "inherit" "force" or if someone
>>>> specifies "force" for a mTP size explicitly?
>>>
>>> Well I think it amounts to the same thing; there isn't much point in
>>> forbidding
>>> "force" to be set directly because it can still be set indirectly through
>>> "inherit". We can't forbid indirectly setting it, because "inherit"
>>> could be set
>>> first, then the top-level shmem_enabled changed to "force" after - and we
>>> wouldn't want to fail that.
>>
>> The default for PMD should be "inherit", for the other mTHP sizes it
>> should be "never".
>>
>> So we should fail if:
>> * Setting top-level to "force" when any non-PMD size is "inherit"
>> * Setting "inherit" of a non-PMD size when the top-level is force
>
> IMO, for tmpfs this is true, but for anon shmem, this 2 cases should not
> fail.

If force does not apply to an mTHP size, it should fail if it would get
inherited. Until it applies and we enable it.

I'm still confused about all the toggles here, so could be I am missing
something.

--
Cheers,

David / dhildenb


2024-04-25 09:51:02

by Baolin Wang

[permalink] [raw]
Subject: Re: [RFC PATCH 0/5] add mTHP support for anonymous share pages



On 2024/4/25 17:20, David Hildenbrand wrote:
> On 25.04.24 11:05, Baolin Wang wrote:
>>
>>
>> On 2024/4/25 16:57, David Hildenbrand wrote:
>>> On 25.04.24 10:46, Ryan Roberts wrote:
>>>> On 25/04/2024 09:26, David Hildenbrand wrote:
>>>>> On 25.04.24 10:17, Ryan Roberts wrote:
>>>>>> On 25/04/2024 07:20, Baolin Wang wrote:
>>>>>>>
>>>>>>>
>>>>>>> On 2024/4/24 22:20, Ryan Roberts wrote:
>>>>>>>> On 24/04/2024 14:49, Baolin Wang wrote:
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On 2024/4/24 18:01, Ryan Roberts wrote:
>>>>>>>>>> On 24/04/2024 10:55, Baolin Wang wrote:
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> On 2024/4/24 16:26, Ryan Roberts wrote:
>>>>>>>>>>>> On 24/04/2024 07:55, Baolin Wang wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> On 2024/4/23 18:41, Ryan Roberts wrote:
>>>>>>>>>>>>>> On 22/04/2024 08:02, Baolin Wang wrote:
>>>>>>>>>>>>>>> Anonymous pages have already been supported for multi-size
>>>>>>>>>>>>>>> (mTHP)
>>>>>>>>>>>>>>> allocation
>>>>>>>>>>>>>>> through commit 19eaf44954df, that can allow THP to be
>>>>>>>>>>>>>>> configured
>>>>>>>>>>>>>>> through the
>>>>>>>>>>>>>>> sysfs interface located at
>>>>>>>>>>>>>>> '/sys/kernel/mm/transparent_hugepage/hugepage-XXkb/enabled'.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> However, the anonymous shared pages will ignore the
>>>>>>>>>>>>>>> anonymous mTHP rule
>>>>>>>>>>>>>>> configured through the sysfs interface, and can only use
>>>>>>>>>>>>>>> the PMD-mapped
>>>>>>>>>>>>>>> THP, that is not reasonable. Many implement anonymous page
>>>>>>>>>>>>>>> sharing
>>>>>>>>>>>>>>> through
>>>>>>>>>>>>>>> mmap(MAP_SHARED | MAP_ANONYMOUS), especially in database
>>>>>>>>>>>>>>> usage
>>>>>>>>>>>>>>> scenarios,
>>>>>>>>>>>>>>> therefore, users expect to apply an unified mTHP strategy
>>>>>>>>>>>>>>> for anonymous
>>>>>>>>>>>>>>> pages,
>>>>>>>>>>>>>>> also including the anonymous shared pages, in order to
>>>>>>>>>>>>>>> enjoy the
>>>>>>>>>>>>>>> benefits of
>>>>>>>>>>>>>>> mTHP. For example, lower latency than PMD-mapped THP,
>>>>>>>>>>>>>>> smaller memory
>>>>>>>>>>>>>>> bloat
>>>>>>>>>>>>>>> than PMD-mapped THP, contiguous PTEs on ARM architecture to
>>>>>>>>>>>>>>> reduce TLB
>>>>>>>>>>>>>>> miss
>>>>>>>>>>>>>>> etc.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> This sounds like a very useful addition!
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Out of interest, can you point me at any workloads (and
>>>>>>>>>>>>>> off-the-shelf
>>>>>>>>>>>>>> benchmarks
>>>>>>>>>>>>>> for those workloads) that predominantly use shared anon
>>>>>>>>>>>>>> memory?
>>>>>>>>>>>>>
>>>>>>>>>>>>> As far as I know, some database related workloads make
>>>>>>>>>>>>> extensive use of
>>>>>>>>>>>>> shared
>>>>>>>>>>>>> anonymous page, such as PolarDB[1] in our Alibaba fleet, or
>>>>>>>>>>>>> MySQL likely
>>>>>>>>>>>>> also
>>>>>>>>>>>>> uses shared anonymous memory. And I still need to do some
>>>>>>>>>>>>> investigation to
>>>>>>>>>>>>> measure the performance.
>>>>>>>>>>>>>
>>>>>>>>>>>>> [1] https://github.com/ApsaraDB/PolarDB-for-PostgreSQL
>>>>>>>>>>>>
>>>>>>>>>>>> Thanks for the pointer!
>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>>> The primary strategy is that, the use of huge pages for
>>>>>>>>>>>>>>> anonymous shared
>>>>>>>>>>>>>>> pages
>>>>>>>>>>>>>>> still follows the global control determined by the mount
>>>>>>>>>>>>>>> option "huge="
>>>>>>>>>>>>>>> parameter
>>>>>>>>>>>>>>> or the sysfs interface at
>>>>>>>>>>>>>>> '/sys/kernel/mm/transparent_hugepage/shmem_enabled'.
>>>>>>>>>>>>>>> The utilization of mTHP is allowed only when the global
>>>>>>>>>>>>>>> 'huge' switch is
>>>>>>>>>>>>>>> enabled.
>>>>>>>>>>>>>>> Subsequently, the mTHP sysfs interface
>>>>>>>>>>>>>>> (/sys/kernel/mm/transparent_hugepage/hugepage-XXkb/enabled)
>>>>>>>>>>>>>>> is checked to determine the mTHP size that can be used for
>>>>>>>>>>>>>>> large folio
>>>>>>>>>>>>>>> allocation
>>>>>>>>>>>>>>> for these anonymous shared pages.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> I'm not sure about this proposed control mechanism; won't it
>>>>>>>>>>>>>> break
>>>>>>>>>>>>>> compatibility? I could be wrong, but I don't think shmem's
>>>>>>>>>>>>>> use of THP
>>>>>>>>>>>>>> used to
>>>>>>>>>>>>>> depend upon the value of
>>>>>>>>>>>>>> /sys/kernel/mm/transparent_hugepage/enabled?
>>>>>>>>>>>>>> So it
>>>>>>>>>>>>>
>>>>>>>>>>>>> Yes, I realized this after more testing.
>>>>>>>>>>>>>
>>>>>>>>>>>>>> doesn't make sense to me that we now depend upon the
>>>>>>>>>>>>>> /sys/kernel/mm/transparent_hugepage/hugepage-XXkb/enabled
>>>>>>>>>>>>>> values
>>>>>>>>>>>>>> (which by
>>>>>>>>>>>>>> default disables all sizes except 2M, which is set to
>>>>>>>>>>>>>> "inherit" from
>>>>>>>>>>>>>> /sys/kernel/mm/transparent_hugepage/enabled).
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> The other problem is that shmem_enabled has a different set
>>>>>>>>>>>>>> of options
>>>>>>>>>>>>>> (always/never/within_size/advise/deny/force) to enabled
>>>>>>>>>>>>>> (always/madvise/never)
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Perhaps it would be cleaner to do the same trick we did for
>>>>>>>>>>>>>> enabled;
>>>>>>>>>>>>>> Introduce
>>>>>>>>>>>>>> /mm/transparent_hugepage/hugepage-XXkb/shmem_enabled, which
>>>>>>>>>>>>>> can have all
>>>>>>>>>>>>>> the
>>>>>>>>>>>>>> same values as the top-level
>>>>>>>>>>>>>> /sys/kernel/mm/transparent_hugepage/shmem_enabled,
>>>>>>>>>>>>>> plus the additional "inherit" option. By default all sizes
>>>>>>>>>>>>>> will be set to
>>>>>>>>>>>>>> "never" except 2M, which is set to "inherit".
>>>>>>>>>>>>>
>>>>>>>>>>>>> Sounds good to me. But I do not want to copy all same values
>>>>>>>>>>>>> from
>>>>>>>>>>>>> top-level
>>>>>>>>>>>>> '/sys/kernel/mm/transparent_hugepage/shmem_enabled':
>>>>>>>>>>>>> always within_size advise never deny force
>>>>>>>>>>>>>
>>>>>>>>>>>>> For mTHP's shmem_enabled interface, we can just keep below
>>>>>>>>>>>>> values:
>>>>>>>>>>>>> always within_size advise never
>>>>>>>>>>>>>
>>>>>>>>>>>>> Cause when checking if mTHP can be used for anon shmem,
>>>>>>>>>>>>> 'deny' is equal to
>>>>>>>>>>>>> 'never', and 'force' is equal to 'always'.
>>>>>>>>>>>>
>>>>>>>>>>>> I'll admit it wasn't completely clear to me after reading the
>>>>>>>>>>>> docs, but my
>>>>>>>>>>>> rough
>>>>>>>>>>>> understanding is:
>>>>>>>>>>>>
>>>>>>>>>>>>        - /sys/kernel/mm/transparent_hugepage/shmem_enabled
>>>>>>>>>>>> controls
>>>>>>>>>>>>          mmap(SHARED|ANON) allocations (mostly; see rule 3)
>>>>>>>>>>>>        - huge=... controls tmpfs allocations
>>>>>>>>>>>>        - deny and force in shmem_enabled are equivalent to
>>>>>>>>>>>> never and
>>>>>>>>>>>> always for
>>>>>>>>>>>>          mmap(SHARED|ANON) but additionally override all tmpfs
>>>>>>>>>>>> mounts so they
>>>>>>>>>>>> act as
>>>>>>>>>>>>          if they were mounted with huge=never or huge=always
>>>>>>>>>>>>
>>>>>>>>>>>> Is that correct? If so, then I think it still makes sense to
>>>>>>>>>>>> support
>>>>>>>>>>>> per-size
>>>>>>>>>>>
>>>>>>>>>>> Correct.
>>>>>>>>>>>
>>>>>>>>>>>> deny/force. Certainly if a per-size control is set to
>>>>>>>>>>>> "inherit" and the
>>>>>>>>>>>> top-level control is set to deny or force, you would need that
>>>>>>>>>>>> to mean
>>>>>>>>>>>> something.
>>>>>>>>>>>
>>>>>>>>>>> IMHO, the
>>>>>>>>>>> '/mm/transparent_hugepage/hugepage-XXkb/shmem_enabled' interface
>>>>>>>>>>> should only control the anonymous shmem. And 'huge=' controls
>>>>>>>>>>> tmpfs
>>>>>>>>>>> allocation,
>>>>>>>>>>> so we should not use anonymous control to override tmpfs
>>>>>>>>>>> control, which
>>>>>>>>>>> seems a
>>>>>>>>>>> little mess?
>>>>>>>>>>
>>>>>>>>>> I agree it would be cleaner to only handle mmap(SHARED|ANON)
>>>>>>>>>> here, and leave
>>>>>>>>>> the
>>>>>>>>>> tmpfs stuff for another time. But my point is that
>>>>>>>>>> /mm/transparent_hugepage/shmem_enabled already interferes with
>>>>>>>>>> tmpfs if the
>>>>>>>>>> value is deny or force. So if you have:
>>>>>>>>>>
>>>>>>>>>> echo deny > /mm/transparent_hugepage/shmem_enabled
>>>>>>>>>
>>>>>>>>> IIUC, this global control will cause shmem_is_huge() to always
>>>>>>>>> return
>>>>>>>>> false, so
>>>>>>>>> no matter how
>>>>>>>>> '/mm/transparent_hugepage/hugepage-xxxkB/shmem_enabled' is set,
>>>>>>>>> anonymous shmem will not use mTHP. No?
>>>>>>>>
>>>>>>>> No, that's not how
>>>>>>>> '/mm/transparent_hugepage/hugepage-xxxkB/enabled' works, and
>>>>>>>> I think '/mm/transparent_hugepage/hugepage-xxxkB/shmem_enabled'
>>>>>>>> should follow
>>>>>>>> the established pattern.
>>>>>>>>
>>>>>>>> For anon-private, each size is controlled by its
>>>>>>>> /mm/transparent_hugepage/hugepage-xxxkB/enabled value. Unless that
>>>>>>>> value is
>>>>>>>> "inherit", in which case the value in
>>>>>>>> /mm/transparent_hugepage/enabled is used
>>>>>>>> for that size.
>>>>>>>>
>>>>>>>> That approach enables us to 1) maintain back-compat and 2) control
>>>>>>>> each size
>>>>>>>> independently
>>>>>>>>
>>>>>>>> 1) is met because the default is that all sizes are initially set
>>>>>>>> to "never",
>>>>>>>> except the PMD-size (e.g.
>>>>>>>> /mm/transparent_hugepage/hugepage-2048kB/enabled)
>>>>>>>> which is initially set to inherit. So any mTHP unaware SW can
>>>>>>>> still modify
>>>>>>>> /mm/transparent_hugepage/enabled and it will still only apply to
>>>>>>>> PMD size.
>>>>>>>>
>>>>>>>> 2) is met because mTHP aware SW can come along and e.g. enable the
>>>>>>>> 64K size
>>>>>>>> (echo always > /mm/transparent_hugepage/hugepage-64kB/enabled)
>>>>>>>> without
>>>>>>>> having to
>>>>>>>> modify the value in /mm/transparent_hugepage/enabled.
>>>>>>>
>>>>>>> Thanks for explanation. Initially, I want to make
>>>>>>> ‘/mm/transparent_hugepage/shmem_enabled’ be a global control for
>>>>>>> huge page, but
>>>>>>> I think it should follow the same strategy as anon mTHP as you said.
>>>>>>>
>>>>>>>>>> echo inherit >
>>>>>>>>>> /mm/transparent_hugepage/hugepage-64kB/shmem_enabled
>>>>>>>>>>
>>>>>>>>>> What does that mean?
>>>>>>>>
>>>>>>>> So I think /mm/transparent_hugepage/hugepage-xxxkB/shmem_enabled
>>>>>>>> will need to
>>>>>>>> support the deny and force values. When applied to non-PMD sizes,
>>>>>>>> "deny" can
>>>>>>>> just be a noop for now, because there was no way to configure a
>>>>>>>> tmpfs mount for
>>>>>>>> non-PMD size THP in the first place. But I'm not sure what to do
>>>>>>>> with "force"?
>>>>>>>
>>>>>>> OK. And I also prefer that "force" should be a noop too, since anon
>>>>>>> shmem
>>>>>>> control should not configure tmpfs huge page allocation.
>>>>>>
>>>>>> I guess technically they won't be noops, but (for the non-PMD-sizes)
>>>>>> "force"
>>>>>> will be an alias for "always" and "deny" will be an alias for
>>>>>> "never"?
>>>>>>
>>>>>> I was just a bit concerned about later changing that behavior to
>>>>>> also impact
>>>>>> tmpfs once tmpfs supports mTHP; could that cause breaks? But
>>>>>> thinking about it,
>>>>>> I don't see that as a problem.
>>>>>
>>>>> Is the question what should happen if we "inherit" "force" or if
>>>>> someone
>>>>> specifies "force" for a mTP size explicitly?
>>>>
>>>> Well I think it amounts to the same thing; there isn't much point in
>>>> forbidding
>>>> "force" to be set directly because it can still be set indirectly
>>>> through
>>>> "inherit". We can't forbid indirectly setting it, because "inherit"
>>>> could be set
>>>> first, then the top-level shmem_enabled changed to "force" after -
>>>> and we
>>>> wouldn't want to fail that.
>>>
>>> The default for PMD should be "inherit", for the other mTHP sizes it
>>> should be "never".
>>>
>>> So we should fail if:
>>> * Setting top-level to "force" when any non-PMD size is "inherit"
>>> * Setting "inherit" of a non-PMD size when the top-level is force
>>
>> IMO, for tmpfs this is true, but for anon shmem, this 2 cases should not
>> fail.
>
> If force does not apply to an mTHP size, it should fail if it would get
> inherited. Until it applies and we enable it.
>
> I'm still confused about all the toggles here, so could be I am missing
> something.

Yes, this is a little messy:(

After thinking more, considering that 'force' is used to override the
tmpfs mount option, and 'inherit' will inherit the global setting. Your
suggestion will make the logic eary to understand (though it is valid
for anon shmem mTHP allocations, which are not part of this scenario),
Ryan, what do you think?

2024-04-25 10:19:37

by Ryan Roberts

[permalink] [raw]
Subject: Re: [RFC PATCH 0/5] add mTHP support for anonymous share pages

On 25/04/2024 10:50, Baolin Wang wrote:
>
>
> On 2024/4/25 17:20, David Hildenbrand wrote:
>> On 25.04.24 11:05, Baolin Wang wrote:
>>>
>>>
>>> On 2024/4/25 16:57, David Hildenbrand wrote:
>>>> On 25.04.24 10:46, Ryan Roberts wrote:
>>>>> On 25/04/2024 09:26, David Hildenbrand wrote:
>>>>>> On 25.04.24 10:17, Ryan Roberts wrote:
>>>>>>> On 25/04/2024 07:20, Baolin Wang wrote:
>>>>>>>>
>>>>>>>>
>>>>>>>> On 2024/4/24 22:20, Ryan Roberts wrote:
>>>>>>>>> On 24/04/2024 14:49, Baolin Wang wrote:
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On 2024/4/24 18:01, Ryan Roberts wrote:
>>>>>>>>>>> On 24/04/2024 10:55, Baolin Wang wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> On 2024/4/24 16:26, Ryan Roberts wrote:
>>>>>>>>>>>>> On 24/04/2024 07:55, Baolin Wang wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On 2024/4/23 18:41, Ryan Roberts wrote:
>>>>>>>>>>>>>>> On 22/04/2024 08:02, Baolin Wang wrote:
>>>>>>>>>>>>>>>> Anonymous pages have already been supported for multi-size
>>>>>>>>>>>>>>>> (mTHP)
>>>>>>>>>>>>>>>> allocation
>>>>>>>>>>>>>>>> through commit 19eaf44954df, that can allow THP to be
>>>>>>>>>>>>>>>> configured
>>>>>>>>>>>>>>>> through the
>>>>>>>>>>>>>>>> sysfs interface located at
>>>>>>>>>>>>>>>> '/sys/kernel/mm/transparent_hugepage/hugepage-XXkb/enabled'.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> However, the anonymous shared pages will ignore the
>>>>>>>>>>>>>>>> anonymous mTHP rule
>>>>>>>>>>>>>>>> configured through the sysfs interface, and can only use
>>>>>>>>>>>>>>>> the PMD-mapped
>>>>>>>>>>>>>>>> THP, that is not reasonable. Many implement anonymous page
>>>>>>>>>>>>>>>> sharing
>>>>>>>>>>>>>>>> through
>>>>>>>>>>>>>>>> mmap(MAP_SHARED | MAP_ANONYMOUS), especially in database usage
>>>>>>>>>>>>>>>> scenarios,
>>>>>>>>>>>>>>>> therefore, users expect to apply an unified mTHP strategy
>>>>>>>>>>>>>>>> for anonymous
>>>>>>>>>>>>>>>> pages,
>>>>>>>>>>>>>>>> also including the anonymous shared pages, in order to
>>>>>>>>>>>>>>>> enjoy the
>>>>>>>>>>>>>>>> benefits of
>>>>>>>>>>>>>>>> mTHP. For example, lower latency than PMD-mapped THP,
>>>>>>>>>>>>>>>> smaller memory
>>>>>>>>>>>>>>>> bloat
>>>>>>>>>>>>>>>> than PMD-mapped THP, contiguous PTEs on ARM architecture to
>>>>>>>>>>>>>>>> reduce TLB
>>>>>>>>>>>>>>>> miss
>>>>>>>>>>>>>>>> etc.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> This sounds like a very useful addition!
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Out of interest, can you point me at any workloads (and
>>>>>>>>>>>>>>> off-the-shelf
>>>>>>>>>>>>>>> benchmarks
>>>>>>>>>>>>>>> for those workloads) that predominantly use shared anon memory?
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> As far as I know, some database related workloads make
>>>>>>>>>>>>>> extensive use of
>>>>>>>>>>>>>> shared
>>>>>>>>>>>>>> anonymous page, such as PolarDB[1] in our Alibaba fleet, or
>>>>>>>>>>>>>> MySQL likely
>>>>>>>>>>>>>> also
>>>>>>>>>>>>>> uses shared anonymous memory. And I still need to do some
>>>>>>>>>>>>>> investigation to
>>>>>>>>>>>>>> measure the performance.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> [1] https://github.com/ApsaraDB/PolarDB-for-PostgreSQL
>>>>>>>>>>>>>
>>>>>>>>>>>>> Thanks for the pointer!
>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> The primary strategy is that, the use of huge pages for
>>>>>>>>>>>>>>>> anonymous shared
>>>>>>>>>>>>>>>> pages
>>>>>>>>>>>>>>>> still follows the global control determined by the mount
>>>>>>>>>>>>>>>> option "huge="
>>>>>>>>>>>>>>>> parameter
>>>>>>>>>>>>>>>> or the sysfs interface at
>>>>>>>>>>>>>>>> '/sys/kernel/mm/transparent_hugepage/shmem_enabled'.
>>>>>>>>>>>>>>>> The utilization of mTHP is allowed only when the global
>>>>>>>>>>>>>>>> 'huge' switch is
>>>>>>>>>>>>>>>> enabled.
>>>>>>>>>>>>>>>> Subsequently, the mTHP sysfs interface
>>>>>>>>>>>>>>>> (/sys/kernel/mm/transparent_hugepage/hugepage-XXkb/enabled)
>>>>>>>>>>>>>>>> is checked to determine the mTHP size that can be used for
>>>>>>>>>>>>>>>> large folio
>>>>>>>>>>>>>>>> allocation
>>>>>>>>>>>>>>>> for these anonymous shared pages.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> I'm not sure about this proposed control mechanism; won't it
>>>>>>>>>>>>>>> break
>>>>>>>>>>>>>>> compatibility? I could be wrong, but I don't think shmem's
>>>>>>>>>>>>>>> use of THP
>>>>>>>>>>>>>>> used to
>>>>>>>>>>>>>>> depend upon the value of
>>>>>>>>>>>>>>> /sys/kernel/mm/transparent_hugepage/enabled?
>>>>>>>>>>>>>>> So it
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Yes, I realized this after more testing.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> doesn't make sense to me that we now depend upon the
>>>>>>>>>>>>>>> /sys/kernel/mm/transparent_hugepage/hugepage-XXkb/enabled
>>>>>>>>>>>>>>> values
>>>>>>>>>>>>>>> (which by
>>>>>>>>>>>>>>> default disables all sizes except 2M, which is set to
>>>>>>>>>>>>>>> "inherit" from
>>>>>>>>>>>>>>> /sys/kernel/mm/transparent_hugepage/enabled).
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> The other problem is that shmem_enabled has a different set
>>>>>>>>>>>>>>> of options
>>>>>>>>>>>>>>> (always/never/within_size/advise/deny/force) to enabled
>>>>>>>>>>>>>>> (always/madvise/never)
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Perhaps it would be cleaner to do the same trick we did for
>>>>>>>>>>>>>>> enabled;
>>>>>>>>>>>>>>> Introduce
>>>>>>>>>>>>>>> /mm/transparent_hugepage/hugepage-XXkb/shmem_enabled, which
>>>>>>>>>>>>>>> can have all
>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>> same values as the top-level
>>>>>>>>>>>>>>> /sys/kernel/mm/transparent_hugepage/shmem_enabled,
>>>>>>>>>>>>>>> plus the additional "inherit" option. By default all sizes
>>>>>>>>>>>>>>> will be set to
>>>>>>>>>>>>>>> "never" except 2M, which is set to "inherit".
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Sounds good to me. But I do not want to copy all same values
>>>>>>>>>>>>>> from
>>>>>>>>>>>>>> top-level
>>>>>>>>>>>>>> '/sys/kernel/mm/transparent_hugepage/shmem_enabled':
>>>>>>>>>>>>>> always within_size advise never deny force
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> For mTHP's shmem_enabled interface, we can just keep below
>>>>>>>>>>>>>> values:
>>>>>>>>>>>>>> always within_size advise never
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Cause when checking if mTHP can be used for anon shmem,
>>>>>>>>>>>>>> 'deny' is equal to
>>>>>>>>>>>>>> 'never', and 'force' is equal to 'always'.
>>>>>>>>>>>>>
>>>>>>>>>>>>> I'll admit it wasn't completely clear to me after reading the
>>>>>>>>>>>>> docs, but my
>>>>>>>>>>>>> rough
>>>>>>>>>>>>> understanding is:
>>>>>>>>>>>>>
>>>>>>>>>>>>>        - /sys/kernel/mm/transparent_hugepage/shmem_enabled
>>>>>>>>>>>>> controls
>>>>>>>>>>>>>          mmap(SHARED|ANON) allocations (mostly; see rule 3)
>>>>>>>>>>>>>        - huge=... controls tmpfs allocations
>>>>>>>>>>>>>        - deny and force in shmem_enabled are equivalent to
>>>>>>>>>>>>> never and
>>>>>>>>>>>>> always for
>>>>>>>>>>>>>          mmap(SHARED|ANON) but additionally override all tmpfs
>>>>>>>>>>>>> mounts so they
>>>>>>>>>>>>> act as
>>>>>>>>>>>>>          if they were mounted with huge=never or huge=always
>>>>>>>>>>>>>
>>>>>>>>>>>>> Is that correct? If so, then I think it still makes sense to
>>>>>>>>>>>>> support
>>>>>>>>>>>>> per-size
>>>>>>>>>>>>
>>>>>>>>>>>> Correct.
>>>>>>>>>>>>
>>>>>>>>>>>>> deny/force. Certainly if a per-size control is set to
>>>>>>>>>>>>> "inherit" and the
>>>>>>>>>>>>> top-level control is set to deny or force, you would need that
>>>>>>>>>>>>> to mean
>>>>>>>>>>>>> something.
>>>>>>>>>>>>
>>>>>>>>>>>> IMHO, the
>>>>>>>>>>>> '/mm/transparent_hugepage/hugepage-XXkb/shmem_enabled' interface
>>>>>>>>>>>> should only control the anonymous shmem. And 'huge=' controls
>>>>>>>>>>>> tmpfs
>>>>>>>>>>>> allocation,
>>>>>>>>>>>> so we should not use anonymous control to override tmpfs
>>>>>>>>>>>> control, which
>>>>>>>>>>>> seems a
>>>>>>>>>>>> little mess?
>>>>>>>>>>>
>>>>>>>>>>> I agree it would be cleaner to only handle mmap(SHARED|ANON)
>>>>>>>>>>> here, and leave
>>>>>>>>>>> the
>>>>>>>>>>> tmpfs stuff for another time. But my point is that
>>>>>>>>>>> /mm/transparent_hugepage/shmem_enabled already interferes with
>>>>>>>>>>> tmpfs if the
>>>>>>>>>>> value is deny or force. So if you have:
>>>>>>>>>>>
>>>>>>>>>>> echo deny > /mm/transparent_hugepage/shmem_enabled
>>>>>>>>>>
>>>>>>>>>> IIUC, this global control will cause shmem_is_huge() to always
>>>>>>>>>> return
>>>>>>>>>> false, so
>>>>>>>>>> no matter how
>>>>>>>>>> '/mm/transparent_hugepage/hugepage-xxxkB/shmem_enabled' is set,
>>>>>>>>>> anonymous shmem will not use mTHP. No?
>>>>>>>>>
>>>>>>>>> No, that's not how
>>>>>>>>> '/mm/transparent_hugepage/hugepage-xxxkB/enabled' works, and
>>>>>>>>> I think '/mm/transparent_hugepage/hugepage-xxxkB/shmem_enabled'
>>>>>>>>> should follow
>>>>>>>>> the established pattern.
>>>>>>>>>
>>>>>>>>> For anon-private, each size is controlled by its
>>>>>>>>> /mm/transparent_hugepage/hugepage-xxxkB/enabled value. Unless that
>>>>>>>>> value is
>>>>>>>>> "inherit", in which case the value in
>>>>>>>>> /mm/transparent_hugepage/enabled is used
>>>>>>>>> for that size.
>>>>>>>>>
>>>>>>>>> That approach enables us to 1) maintain back-compat and 2) control
>>>>>>>>> each size
>>>>>>>>> independently
>>>>>>>>>
>>>>>>>>> 1) is met because the default is that all sizes are initially set
>>>>>>>>> to "never",
>>>>>>>>> except the PMD-size (e.g.
>>>>>>>>> /mm/transparent_hugepage/hugepage-2048kB/enabled)
>>>>>>>>> which is initially set to inherit. So any mTHP unaware SW can
>>>>>>>>> still modify
>>>>>>>>> /mm/transparent_hugepage/enabled and it will still only apply to
>>>>>>>>> PMD size.
>>>>>>>>>
>>>>>>>>> 2) is met because mTHP aware SW can come along and e.g. enable the
>>>>>>>>> 64K size
>>>>>>>>> (echo always > /mm/transparent_hugepage/hugepage-64kB/enabled)
>>>>>>>>> without
>>>>>>>>> having to
>>>>>>>>> modify the value in /mm/transparent_hugepage/enabled.
>>>>>>>>
>>>>>>>> Thanks for explanation. Initially, I want to make
>>>>>>>> ‘/mm/transparent_hugepage/shmem_enabled’ be a global control for
>>>>>>>> huge page, but
>>>>>>>> I think it should follow the same strategy as anon mTHP as you said.
>>>>>>>>
>>>>>>>>>>> echo inherit > /mm/transparent_hugepage/hugepage-64kB/shmem_enabled
>>>>>>>>>>>
>>>>>>>>>>> What does that mean?
>>>>>>>>>
>>>>>>>>> So I think /mm/transparent_hugepage/hugepage-xxxkB/shmem_enabled
>>>>>>>>> will need to
>>>>>>>>> support the deny and force values. When applied to non-PMD sizes,
>>>>>>>>> "deny" can
>>>>>>>>> just be a noop for now, because there was no way to configure a
>>>>>>>>> tmpfs mount for
>>>>>>>>> non-PMD size THP in the first place. But I'm not sure what to do
>>>>>>>>> with "force"?
>>>>>>>>
>>>>>>>> OK. And I also prefer that "force" should be a noop too, since anon
>>>>>>>> shmem
>>>>>>>> control should not configure tmpfs huge page allocation.
>>>>>>>
>>>>>>> I guess technically they won't be noops, but (for the non-PMD-sizes)
>>>>>>> "force"
>>>>>>> will be an alias for "always" and "deny" will be an alias for "never"?
>>>>>>>
>>>>>>> I was just a bit concerned about later changing that behavior to
>>>>>>> also impact
>>>>>>> tmpfs once tmpfs supports mTHP; could that cause breaks? But
>>>>>>> thinking about it,
>>>>>>> I don't see that as a problem.
>>>>>>
>>>>>> Is the question what should happen if we "inherit" "force" or if someone
>>>>>> specifies "force" for a mTP size explicitly?
>>>>>
>>>>> Well I think it amounts to the same thing; there isn't much point in
>>>>> forbidding
>>>>> "force" to be set directly because it can still be set indirectly through
>>>>> "inherit". We can't forbid indirectly setting it, because "inherit"
>>>>> could be set
>>>>> first, then the top-level shmem_enabled changed to "force" after - and we
>>>>> wouldn't want to fail that.
>>>>
>>>> The default for PMD should be "inherit", for the other mTHP sizes it
>>>> should be "never".
>>>>
>>>> So we should fail if:
>>>> * Setting top-level to "force" when any non-PMD size is "inherit"
>>>> * Setting "inherit" of a non-PMD size when the top-level is force
>>>
>>> IMO, for tmpfs this is true, but for anon shmem, this 2 cases should not
>>> fail.
>>
>> If force does not apply to an mTHP size, it should fail if it would get
>> inherited. Until it applies and we enable it.
>>
>> I'm still confused about all the toggles here, so could be I am missing
>> something.
>
> Yes, this is a little messy:(
>
> After thinking more, considering that 'force' is used to override the tmpfs
> mount option, and 'inherit' will inherit the global setting. Your suggestion
> will make the logic eary to understand (though it is valid for anon shmem mTHP
> allocations, which are not part of this scenario), Ryan, what do you think?

I was thinking that ever returning an error code when trying to set "force" for
/mm/transparent_hugepage/shmem_enabled would be weird behavior and to be avoided
if possible.

But as always, David makes a clear and compelling case. I'm happy to go with his
suggestion.

I wonder if we should do the same for "deny"? The argument I previously made was
that non-PMD-size mTHP is always effectively denied while tmpfs doesn't support
mTHP, so we can allow "deny" and for now it would just be an alias for "never".
The semantics don't change when you introduce tmpfs support.

Personally I lean towards allowing "deny" from day 1. David?




2024-04-25 10:47:10

by David Hildenbrand

[permalink] [raw]
Subject: Re: [RFC PATCH 0/5] add mTHP support for anonymous share pages

On 25.04.24 12:17, Ryan Roberts wrote:
> On 25/04/2024 10:50, Baolin Wang wrote:
>>
>>
>> On 2024/4/25 17:20, David Hildenbrand wrote:
>>> On 25.04.24 11:05, Baolin Wang wrote:
>>>>
>>>>
>>>> On 2024/4/25 16:57, David Hildenbrand wrote:
>>>>> On 25.04.24 10:46, Ryan Roberts wrote:
>>>>>> On 25/04/2024 09:26, David Hildenbrand wrote:
>>>>>>> On 25.04.24 10:17, Ryan Roberts wrote:
>>>>>>>> On 25/04/2024 07:20, Baolin Wang wrote:
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On 2024/4/24 22:20, Ryan Roberts wrote:
>>>>>>>>>> On 24/04/2024 14:49, Baolin Wang wrote:
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> On 2024/4/24 18:01, Ryan Roberts wrote:
>>>>>>>>>>>> On 24/04/2024 10:55, Baolin Wang wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> On 2024/4/24 16:26, Ryan Roberts wrote:
>>>>>>>>>>>>>> On 24/04/2024 07:55, Baolin Wang wrote:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> On 2024/4/23 18:41, Ryan Roberts wrote:
>>>>>>>>>>>>>>>> On 22/04/2024 08:02, Baolin Wang wrote:
>>>>>>>>>>>>>>>>> Anonymous pages have already been supported for multi-size
>>>>>>>>>>>>>>>>> (mTHP)
>>>>>>>>>>>>>>>>> allocation
>>>>>>>>>>>>>>>>> through commit 19eaf44954df, that can allow THP to be
>>>>>>>>>>>>>>>>> configured
>>>>>>>>>>>>>>>>> through the
>>>>>>>>>>>>>>>>> sysfs interface located at
>>>>>>>>>>>>>>>>> '/sys/kernel/mm/transparent_hugepage/hugepage-XXkb/enabled'.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> However, the anonymous shared pages will ignore the
>>>>>>>>>>>>>>>>> anonymous mTHP rule
>>>>>>>>>>>>>>>>> configured through the sysfs interface, and can only use
>>>>>>>>>>>>>>>>> the PMD-mapped
>>>>>>>>>>>>>>>>> THP, that is not reasonable. Many implement anonymous page
>>>>>>>>>>>>>>>>> sharing
>>>>>>>>>>>>>>>>> through
>>>>>>>>>>>>>>>>> mmap(MAP_SHARED | MAP_ANONYMOUS), especially in database usage
>>>>>>>>>>>>>>>>> scenarios,
>>>>>>>>>>>>>>>>> therefore, users expect to apply an unified mTHP strategy
>>>>>>>>>>>>>>>>> for anonymous
>>>>>>>>>>>>>>>>> pages,
>>>>>>>>>>>>>>>>> also including the anonymous shared pages, in order to
>>>>>>>>>>>>>>>>> enjoy the
>>>>>>>>>>>>>>>>> benefits of
>>>>>>>>>>>>>>>>> mTHP. For example, lower latency than PMD-mapped THP,
>>>>>>>>>>>>>>>>> smaller memory
>>>>>>>>>>>>>>>>> bloat
>>>>>>>>>>>>>>>>> than PMD-mapped THP, contiguous PTEs on ARM architecture to
>>>>>>>>>>>>>>>>> reduce TLB
>>>>>>>>>>>>>>>>> miss
>>>>>>>>>>>>>>>>> etc.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> This sounds like a very useful addition!
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Out of interest, can you point me at any workloads (and
>>>>>>>>>>>>>>>> off-the-shelf
>>>>>>>>>>>>>>>> benchmarks
>>>>>>>>>>>>>>>> for those workloads) that predominantly use shared anon memory?
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> As far as I know, some database related workloads make
>>>>>>>>>>>>>>> extensive use of
>>>>>>>>>>>>>>> shared
>>>>>>>>>>>>>>> anonymous page, such as PolarDB[1] in our Alibaba fleet, or
>>>>>>>>>>>>>>> MySQL likely
>>>>>>>>>>>>>>> also
>>>>>>>>>>>>>>> uses shared anonymous memory. And I still need to do some
>>>>>>>>>>>>>>> investigation to
>>>>>>>>>>>>>>> measure the performance.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> [1] https://github.com/ApsaraDB/PolarDB-for-PostgreSQL
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Thanks for the pointer!
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> The primary strategy is that, the use of huge pages for
>>>>>>>>>>>>>>>>> anonymous shared
>>>>>>>>>>>>>>>>> pages
>>>>>>>>>>>>>>>>> still follows the global control determined by the mount
>>>>>>>>>>>>>>>>> option "huge="
>>>>>>>>>>>>>>>>> parameter
>>>>>>>>>>>>>>>>> or the sysfs interface at
>>>>>>>>>>>>>>>>> '/sys/kernel/mm/transparent_hugepage/shmem_enabled'.
>>>>>>>>>>>>>>>>> The utilization of mTHP is allowed only when the global
>>>>>>>>>>>>>>>>> 'huge' switch is
>>>>>>>>>>>>>>>>> enabled.
>>>>>>>>>>>>>>>>> Subsequently, the mTHP sysfs interface
>>>>>>>>>>>>>>>>> (/sys/kernel/mm/transparent_hugepage/hugepage-XXkb/enabled)
>>>>>>>>>>>>>>>>> is checked to determine the mTHP size that can be used for
>>>>>>>>>>>>>>>>> large folio
>>>>>>>>>>>>>>>>> allocation
>>>>>>>>>>>>>>>>> for these anonymous shared pages.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> I'm not sure about this proposed control mechanism; won't it
>>>>>>>>>>>>>>>> break
>>>>>>>>>>>>>>>> compatibility? I could be wrong, but I don't think shmem's
>>>>>>>>>>>>>>>> use of THP
>>>>>>>>>>>>>>>> used to
>>>>>>>>>>>>>>>> depend upon the value of
>>>>>>>>>>>>>>>> /sys/kernel/mm/transparent_hugepage/enabled?
>>>>>>>>>>>>>>>> So it
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Yes, I realized this after more testing.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> doesn't make sense to me that we now depend upon the
>>>>>>>>>>>>>>>> /sys/kernel/mm/transparent_hugepage/hugepage-XXkb/enabled
>>>>>>>>>>>>>>>> values
>>>>>>>>>>>>>>>> (which by
>>>>>>>>>>>>>>>> default disables all sizes except 2M, which is set to
>>>>>>>>>>>>>>>> "inherit" from
>>>>>>>>>>>>>>>> /sys/kernel/mm/transparent_hugepage/enabled).
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> The other problem is that shmem_enabled has a different set
>>>>>>>>>>>>>>>> of options
>>>>>>>>>>>>>>>> (always/never/within_size/advise/deny/force) to enabled
>>>>>>>>>>>>>>>> (always/madvise/never)
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Perhaps it would be cleaner to do the same trick we did for
>>>>>>>>>>>>>>>> enabled;
>>>>>>>>>>>>>>>> Introduce
>>>>>>>>>>>>>>>> /mm/transparent_hugepage/hugepage-XXkb/shmem_enabled, which
>>>>>>>>>>>>>>>> can have all
>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>> same values as the top-level
>>>>>>>>>>>>>>>> /sys/kernel/mm/transparent_hugepage/shmem_enabled,
>>>>>>>>>>>>>>>> plus the additional "inherit" option. By default all sizes
>>>>>>>>>>>>>>>> will be set to
>>>>>>>>>>>>>>>> "never" except 2M, which is set to "inherit".
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Sounds good to me. But I do not want to copy all same values
>>>>>>>>>>>>>>> from
>>>>>>>>>>>>>>> top-level
>>>>>>>>>>>>>>> '/sys/kernel/mm/transparent_hugepage/shmem_enabled':
>>>>>>>>>>>>>>> always within_size advise never deny force
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> For mTHP's shmem_enabled interface, we can just keep below
>>>>>>>>>>>>>>> values:
>>>>>>>>>>>>>>> always within_size advise never
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Cause when checking if mTHP can be used for anon shmem,
>>>>>>>>>>>>>>> 'deny' is equal to
>>>>>>>>>>>>>>> 'never', and 'force' is equal to 'always'.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> I'll admit it wasn't completely clear to me after reading the
>>>>>>>>>>>>>> docs, but my
>>>>>>>>>>>>>> rough
>>>>>>>>>>>>>> understanding is:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>        - /sys/kernel/mm/transparent_hugepage/shmem_enabled
>>>>>>>>>>>>>> controls
>>>>>>>>>>>>>>          mmap(SHARED|ANON) allocations (mostly; see rule 3)
>>>>>>>>>>>>>>        - huge=... controls tmpfs allocations
>>>>>>>>>>>>>>        - deny and force in shmem_enabled are equivalent to
>>>>>>>>>>>>>> never and
>>>>>>>>>>>>>> always for
>>>>>>>>>>>>>>          mmap(SHARED|ANON) but additionally override all tmpfs
>>>>>>>>>>>>>> mounts so they
>>>>>>>>>>>>>> act as
>>>>>>>>>>>>>>          if they were mounted with huge=never or huge=always
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Is that correct? If so, then I think it still makes sense to
>>>>>>>>>>>>>> support
>>>>>>>>>>>>>> per-size
>>>>>>>>>>>>>
>>>>>>>>>>>>> Correct.
>>>>>>>>>>>>>
>>>>>>>>>>>>>> deny/force. Certainly if a per-size control is set to
>>>>>>>>>>>>>> "inherit" and the
>>>>>>>>>>>>>> top-level control is set to deny or force, you would need that
>>>>>>>>>>>>>> to mean
>>>>>>>>>>>>>> something.
>>>>>>>>>>>>>
>>>>>>>>>>>>> IMHO, the
>>>>>>>>>>>>> '/mm/transparent_hugepage/hugepage-XXkb/shmem_enabled' interface
>>>>>>>>>>>>> should only control the anonymous shmem. And 'huge=' controls
>>>>>>>>>>>>> tmpfs
>>>>>>>>>>>>> allocation,
>>>>>>>>>>>>> so we should not use anonymous control to override tmpfs
>>>>>>>>>>>>> control, which
>>>>>>>>>>>>> seems a
>>>>>>>>>>>>> little mess?
>>>>>>>>>>>>
>>>>>>>>>>>> I agree it would be cleaner to only handle mmap(SHARED|ANON)
>>>>>>>>>>>> here, and leave
>>>>>>>>>>>> the
>>>>>>>>>>>> tmpfs stuff for another time. But my point is that
>>>>>>>>>>>> /mm/transparent_hugepage/shmem_enabled already interferes with
>>>>>>>>>>>> tmpfs if the
>>>>>>>>>>>> value is deny or force. So if you have:
>>>>>>>>>>>>
>>>>>>>>>>>> echo deny > /mm/transparent_hugepage/shmem_enabled
>>>>>>>>>>>
>>>>>>>>>>> IIUC, this global control will cause shmem_is_huge() to always
>>>>>>>>>>> return
>>>>>>>>>>> false, so
>>>>>>>>>>> no matter how
>>>>>>>>>>> '/mm/transparent_hugepage/hugepage-xxxkB/shmem_enabled' is set,
>>>>>>>>>>> anonymous shmem will not use mTHP. No?
>>>>>>>>>>
>>>>>>>>>> No, that's not how
>>>>>>>>>> '/mm/transparent_hugepage/hugepage-xxxkB/enabled' works, and
>>>>>>>>>> I think '/mm/transparent_hugepage/hugepage-xxxkB/shmem_enabled'
>>>>>>>>>> should follow
>>>>>>>>>> the established pattern.
>>>>>>>>>>
>>>>>>>>>> For anon-private, each size is controlled by its
>>>>>>>>>> /mm/transparent_hugepage/hugepage-xxxkB/enabled value. Unless that
>>>>>>>>>> value is
>>>>>>>>>> "inherit", in which case the value in
>>>>>>>>>> /mm/transparent_hugepage/enabled is used
>>>>>>>>>> for that size.
>>>>>>>>>>
>>>>>>>>>> That approach enables us to 1) maintain back-compat and 2) control
>>>>>>>>>> each size
>>>>>>>>>> independently
>>>>>>>>>>
>>>>>>>>>> 1) is met because the default is that all sizes are initially set
>>>>>>>>>> to "never",
>>>>>>>>>> except the PMD-size (e.g.
>>>>>>>>>> /mm/transparent_hugepage/hugepage-2048kB/enabled)
>>>>>>>>>> which is initially set to inherit. So any mTHP unaware SW can
>>>>>>>>>> still modify
>>>>>>>>>> /mm/transparent_hugepage/enabled and it will still only apply to
>>>>>>>>>> PMD size.
>>>>>>>>>>
>>>>>>>>>> 2) is met because mTHP aware SW can come along and e.g. enable the
>>>>>>>>>> 64K size
>>>>>>>>>> (echo always > /mm/transparent_hugepage/hugepage-64kB/enabled)
>>>>>>>>>> without
>>>>>>>>>> having to
>>>>>>>>>> modify the value in /mm/transparent_hugepage/enabled.
>>>>>>>>>
>>>>>>>>> Thanks for explanation. Initially, I want to make
>>>>>>>>> ‘/mm/transparent_hugepage/shmem_enabled’ be a global control for
>>>>>>>>> huge page, but
>>>>>>>>> I think it should follow the same strategy as anon mTHP as you said.
>>>>>>>>>
>>>>>>>>>>>> echo inherit > /mm/transparent_hugepage/hugepage-64kB/shmem_enabled
>>>>>>>>>>>>
>>>>>>>>>>>> What does that mean?
>>>>>>>>>>
>>>>>>>>>> So I think /mm/transparent_hugepage/hugepage-xxxkB/shmem_enabled
>>>>>>>>>> will need to
>>>>>>>>>> support the deny and force values. When applied to non-PMD sizes,
>>>>>>>>>> "deny" can
>>>>>>>>>> just be a noop for now, because there was no way to configure a
>>>>>>>>>> tmpfs mount for
>>>>>>>>>> non-PMD size THP in the first place. But I'm not sure what to do
>>>>>>>>>> with "force"?
>>>>>>>>>
>>>>>>>>> OK. And I also prefer that "force" should be a noop too, since anon
>>>>>>>>> shmem
>>>>>>>>> control should not configure tmpfs huge page allocation.
>>>>>>>>
>>>>>>>> I guess technically they won't be noops, but (for the non-PMD-sizes)
>>>>>>>> "force"
>>>>>>>> will be an alias for "always" and "deny" will be an alias for "never"?
>>>>>>>>
>>>>>>>> I was just a bit concerned about later changing that behavior to
>>>>>>>> also impact
>>>>>>>> tmpfs once tmpfs supports mTHP; could that cause breaks? But
>>>>>>>> thinking about it,
>>>>>>>> I don't see that as a problem.
>>>>>>>
>>>>>>> Is the question what should happen if we "inherit" "force" or if someone
>>>>>>> specifies "force" for a mTP size explicitly?
>>>>>>
>>>>>> Well I think it amounts to the same thing; there isn't much point in
>>>>>> forbidding
>>>>>> "force" to be set directly because it can still be set indirectly through
>>>>>> "inherit". We can't forbid indirectly setting it, because "inherit"
>>>>>> could be set
>>>>>> first, then the top-level shmem_enabled changed to "force" after - and we
>>>>>> wouldn't want to fail that.
>>>>>
>>>>> The default for PMD should be "inherit", for the other mTHP sizes it
>>>>> should be "never".
>>>>>
>>>>> So we should fail if:
>>>>> * Setting top-level to "force" when any non-PMD size is "inherit"
>>>>> * Setting "inherit" of a non-PMD size when the top-level is force
>>>>
>>>> IMO, for tmpfs this is true, but for anon shmem, this 2 cases should not
>>>> fail.
>>>
>>> If force does not apply to an mTHP size, it should fail if it would get
>>> inherited. Until it applies and we enable it.
>>>
>>> I'm still confused about all the toggles here, so could be I am missing
>>> something.
>>
>> Yes, this is a little messy:(
>>
>> After thinking more, considering that 'force' is used to override the tmpfs
>> mount option, and 'inherit' will inherit the global setting. Your suggestion
>> will make the logic eary to understand (though it is valid for anon shmem mTHP
>> allocations, which are not part of this scenario), Ryan, what do you think?
>
> I was thinking that ever returning an error code when trying to set "force" for
> /mm/transparent_hugepage/shmem_enabled would be weird behavior and to be avoided
> if possible.
>
> But as always, David makes a clear and compelling case. I'm happy to go with his
> suggestion.

I'm afraid "as always" is far from true :D

>
> I wonder if we should do the same for "deny"? The argument I previously made was
> that non-PMD-size mTHP is always effectively denied while tmpfs doesn't support
> mTHP, so we can allow "deny" and for now it would just be an alias for "never".
> The semantics don't change when you introduce tmpfs support.
>
> Personally I lean towards allowing "deny" from day 1. David?

If it has clear semantics that can be fulfilled as of today, sure, we
can have it.

--
Cheers,

David / dhildenb