2024-03-19 09:41:52

by Kairui Song

[permalink] [raw]
Subject: [PATCH 0/4] mm/filemap: optimize folio adding and splitting

From: Kairui Song <[email protected]>

Currently, at least 3 tree walks are needed for filemap folio adding a
previously evicted folio. One for getting the order, one for ranged conflict
check, and one for another order retrieving. If a split is needed, more walks
are needed.

This series is trying to merge these walks, and speed up filemap_add_folio.

Instead of doing multiple tree walks, do one optimism range check
with lock hold, and exit if raced with another insertion. If a shadow
exists, check it with a new xas_get_order helper before releasing the
lock to avoid redundant tree walks for getting its order.

Drop the lock and do the allocation only if a split is needed.

In the best case, it only need to walk the tree once. If it needs
to alloc and split, 3 walks are issued (One for first ranced
conflict check and order retrieving, one for the second check after
allocation, one for the insert after split).

Testing with 4k pages, in an 8G cgroup, with 20G brd as block device:

fio -name=cached --numjobs=16 --filename=/mnt/test.img \
--buffered=1 --ioengine=mmap --rw=randread --time_based \
--ramp_time=30s --runtime=5m --group_reporting

Before:
bw ( MiB/s): min= 790, max= 3665, per=100.00%, avg=2499.17, stdev=20.64, samples=8698
iops : min=202295, max=938417, avg=639785.81, stdev=5284.08, samples=8698

After (+4%):
bw ( MiB/s): min= 451, max= 3868, per=100.00%, avg=2599.83, stdev=23.39, samples=8653
iops : min=115596, max=990364, avg=665556.34, stdev=5988.20, samples=8653

Test result with THP (do a THP test then switch to 4K page in hope it
issues a lot of splitting):

fio -name=cached --numjobs=16 --filename=/mnt/test.img \
--buffered=1 --ioengine mmap -thp=1 --readonly \
--rw=randread --random_distribution=random \
--time_based --runtime=5m --group_reporting

fio -name=cached --numjobs=16 --filename=/mnt/test.img \
--buffered=1 --ioengine mmap --readonly \
--rw=randread --random_distribution=random \
--time_based --runtime=5s --group_reporting

Before:
bw ( KiB/s): min=28071, max=62359, per=100.00%, avg=53542.44, stdev=179.77, samples=9520
iops : min= 7012, max=15586, avg=13379.39, stdev=44.94, samples=9520
bw ( MiB/s): min= 2457, max= 6193, per=100.00%, avg=3923.21, stdev=82.48, samples=144
iops : min=629220, max=1585642, avg=1004340.78, stdev=21116.07, samples=144

After (+-0.0%):
bw ( KiB/s): min=30561, max=63064, per=100.00%, avg=53635.82, stdev=177.21, samples=9520
iops : min= 7636, max=15762, avg=13402.82, stdev=44.29, samples=9520
bw ( MiB/s): min= 2449, max= 6145, per=100.00%, avg=3914.68, stdev=81.15, samples=144
iops : min=627106, max=1573156, avg=1002158.11, stdev=20774.77, samples=144

The performance is better (+4%) for 4K cached read and unchanged for THP.

Kairui Song (4):
mm/filemap: return early if failed to allocate memory for split
mm/filemap: clean up hugetlb exclusion code
lib/xarray: introduce a new helper xas_get_order
mm/filemap: optimize filemap folio adding

include/linux/xarray.h | 6 ++
lib/xarray.c | 49 +++++++++-----
mm/filemap.c | 145 ++++++++++++++++++++++++-----------------
3 files changed, 121 insertions(+), 79 deletions(-)

--
2.43.0



2024-03-19 09:42:04

by Kairui Song

[permalink] [raw]
Subject: [PATCH 1/4] mm/filemap: return early if failed to allocate memory for split

From: Kairui Song <[email protected]>

xas_split_alloc could fail with NOMEM, and in such case, it should abort
early instead of keep going and fail the xas_split below.

Signed-off-by: Kairui Song <[email protected]>
---
mm/filemap.c | 5 ++++-
1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/mm/filemap.c b/mm/filemap.c
index 7437b2bd75c1..f07ea0b97698 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -880,9 +880,12 @@ noinline int __filemap_add_folio(struct address_space *mapping,
unsigned int order = xa_get_order(xas.xa, xas.xa_index);
void *entry, *old = NULL;

- if (order > folio_order(folio))
+ if (order > folio_order(folio)) {
xas_split_alloc(&xas, xa_load(xas.xa, xas.xa_index),
order, gfp);
+ if (xas_error(&xas))
+ goto error;
+ }
xas_lock_irq(&xas);
xas_for_each_conflict(&xas, entry) {
old = entry;
--
2.43.0


2024-03-19 09:42:13

by Kairui Song

[permalink] [raw]
Subject: [PATCH 2/4] mm/filemap: clean up hugetlb exclusion code

From: Kairui Song <[email protected]>

__filemap_add_folio only has two callers, one never passes hugetlb
folio and one always passes in hugetlb folio. So move the hugetlb
related cgroup charging out of it to make the code cleaner.

Signed-off-by: Kairui Song <[email protected]>
---
mm/filemap.c | 21 ++++++++-------------
1 file changed, 8 insertions(+), 13 deletions(-)

diff --git a/mm/filemap.c b/mm/filemap.c
index f07ea0b97698..6bbec8783793 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -853,20 +853,12 @@ noinline int __filemap_add_folio(struct address_space *mapping,
{
XA_STATE(xas, &mapping->i_pages, index);
bool huge = folio_test_hugetlb(folio);
- bool charged = false;
- long nr = 1;
+ long nr;

VM_BUG_ON_FOLIO(!folio_test_locked(folio), folio);
VM_BUG_ON_FOLIO(folio_test_swapbacked(folio), folio);
mapping_set_update(&xas, mapping);

- if (!huge) {
- int error = mem_cgroup_charge(folio, NULL, gfp);
- if (error)
- return error;
- charged = true;
- }
-
VM_BUG_ON_FOLIO(index & (folio_nr_pages(folio) - 1), folio);
xas_set_order(&xas, index, folio_order(folio));
nr = folio_nr_pages(folio);
@@ -931,8 +923,6 @@ noinline int __filemap_add_folio(struct address_space *mapping,
trace_mm_filemap_add_to_page_cache(folio);
return 0;
error:
- if (charged)
- mem_cgroup_uncharge(folio);
folio->mapping = NULL;
/* Leave page->index set: truncation relies upon it */
folio_put_refs(folio, nr);
@@ -946,11 +936,16 @@ int filemap_add_folio(struct address_space *mapping, struct folio *folio,
void *shadow = NULL;
int ret;

+ ret = mem_cgroup_charge(folio, NULL, gfp);
+ if (ret)
+ return ret;
+
__folio_set_locked(folio);
ret = __filemap_add_folio(mapping, folio, index, gfp, &shadow);
- if (unlikely(ret))
+ if (unlikely(ret)) {
+ mem_cgroup_uncharge(folio);
__folio_clear_locked(folio);
- else {
+ } else {
/*
* The folio might have been evicted from cache only
* recently, in which case it should be activated like
--
2.43.0


2024-03-19 09:42:25

by Kairui Song

[permalink] [raw]
Subject: [PATCH 3/4] lib/xarray: introduce a new helper xas_get_order

From: Kairui Song <[email protected]>

It can be used after xas_load to check the order of loaded entries.
Compared to xa_get_order, it saves an XA_STATE and avoid a rewalk.

Signed-off-by: Kairui Song <[email protected]>
---
include/linux/xarray.h | 6 ++++++
lib/xarray.c | 49 ++++++++++++++++++++++++++----------------
2 files changed, 37 insertions(+), 18 deletions(-)

diff --git a/include/linux/xarray.h b/include/linux/xarray.h
index cb571dfcf4b1..d9d479334c9e 100644
--- a/include/linux/xarray.h
+++ b/include/linux/xarray.h
@@ -1548,6 +1548,7 @@ void xas_create_range(struct xa_state *);

#ifdef CONFIG_XARRAY_MULTI
int xa_get_order(struct xarray *, unsigned long index);
+int xas_get_order(struct xa_state *xas);
void xas_split(struct xa_state *, void *entry, unsigned int order);
void xas_split_alloc(struct xa_state *, void *entry, unsigned int order, gfp_t);
#else
@@ -1556,6 +1557,11 @@ static inline int xa_get_order(struct xarray *xa, unsigned long index)
return 0;
}

+static inline int xas_get_order(struct xa_state *xas)
+{
+ return 0;
+}
+
static inline void xas_split(struct xa_state *xas, void *entry,
unsigned int order)
{
diff --git a/lib/xarray.c b/lib/xarray.c
index 39f07bfc4dcc..88fbea481e1b 100644
--- a/lib/xarray.c
+++ b/lib/xarray.c
@@ -1750,39 +1750,52 @@ void *xa_store_range(struct xarray *xa, unsigned long first,
EXPORT_SYMBOL(xa_store_range);

/**
- * xa_get_order() - Get the order of an entry.
- * @xa: XArray.
- * @index: Index of the entry.
+ * xas_get_order() - Get the order of an loaded entry after xas_load.
+ * @xas: XArray operation state.
+ *
+ * Called after xas_load, the xas should not be in an error state.
*
* Return: A number between 0 and 63 indicating the order of the entry.
*/
-int xa_get_order(struct xarray *xa, unsigned long index)
+int xas_get_order(struct xa_state *xas)
{
- XA_STATE(xas, xa, index);
- void *entry;
int order = 0;

- rcu_read_lock();
- entry = xas_load(&xas);
-
- if (!entry)
- goto unlock;
-
- if (!xas.xa_node)
- goto unlock;
+ if (!xas->xa_node)
+ return 0;

for (;;) {
- unsigned int slot = xas.xa_offset + (1 << order);
+ unsigned int slot = xas->xa_offset + (1 << order);

if (slot >= XA_CHUNK_SIZE)
break;
- if (!xa_is_sibling(xas.xa_node->slots[slot]))
+ if (!xa_is_sibling(xas->xa_node->slots[slot]))
break;
order++;
}

- order += xas.xa_node->shift;
-unlock:
+ order += xas->xa_node->shift;
+ return order;
+}
+EXPORT_SYMBOL(xas_get_order);
+
+/**
+ * xa_get_order() - Get the order of an entry.
+ * @xa: XArray.
+ * @index: Index of the entry.
+ *
+ * Return: A number between 0 and 63 indicating the order of the entry.
+ */
+int xa_get_order(struct xarray *xa, unsigned long index)
+{
+ XA_STATE(xas, xa, index);
+ int order = 0;
+ void *entry;
+
+ rcu_read_lock();
+ entry = xas_load(&xas);
+ if (entry)
+ order = xas_get_order(&xas);
rcu_read_unlock();

return order;
--
2.43.0


2024-03-19 09:42:39

by Kairui Song

[permalink] [raw]
Subject: [PATCH 4/4] mm/filemap: optimize filemap folio adding

From: Kairui Song <[email protected]>

Instead of doing multiple tree walks, do one optimism range check
with lock hold, and exit if raced with another insertion. If a shadow
exists, check it with a new xas_get_order helper before releasing the
lock to avoid redundant tree walks for getting its order.

Drop the lock and do the allocation only if a split is needed.

In the best case, it only need to walk the tree once. If it needs
to alloc and split, 3 walks are issued (One for first ranced
conflict check and order retrieving, one for the second check after
allocation, one for the insert after split).

Testing with 4k pages, in an 8G cgroup, with 20G brd as block device:

fio -name=cached --numjobs=16 --filename=/mnt/test.img \
--buffered=1 --ioengine=mmap --rw=randread --time_based \
--ramp_time=30s --runtime=5m --group_reporting

Before:
bw ( MiB/s): min= 790, max= 3665, per=100.00%, avg=2499.17, stdev=20.64, samples=8698
iops : min=202295, max=938417, avg=639785.81, stdev=5284.08, samples=8698

After (+4%):
bw ( MiB/s): min= 451, max= 3868, per=100.00%, avg=2599.83, stdev=23.39, samples=8653
iops : min=115596, max=990364, avg=665556.34, stdev=5988.20, samples=8653

Test result with THP (do a THP randread then switch to 4K page in hope it
issues a lot of splitting):

fio -name=cached --numjobs=16 --filename=/mnt/test.img \
--buffered=1 --ioengine mmap -thp=1 --readonly \
--rw=randread --random_distribution=random \
--time_based --runtime=5m --group_reporting

fio -name=cached --numjobs=16 --filename=/mnt/test.img \
--buffered=1 --ioengine mmap --readonly \
--rw=randread --random_distribution=random \
--time_based --runtime=5s --group_reporting

Before:
bw ( KiB/s): min=28071, max=62359, per=100.00%, avg=53542.44, stdev=179.77, samples=9520
iops : min= 7012, max=15586, avg=13379.39, stdev=44.94, samples=9520
bw ( MiB/s): min= 2457, max= 6193, per=100.00%, avg=3923.21, stdev=82.48, samples=144
iops : min=629220, max=1585642, avg=1004340.78, stdev=21116.07, samples=144

After (+-0.0%):
bw ( KiB/s): min=30561, max=63064, per=100.00%, avg=53635.82, stdev=177.21, samples=9520
iops : min= 7636, max=15762, avg=13402.82, stdev=44.29, samples=9520
bw ( MiB/s): min= 2449, max= 6145, per=100.00%, avg=3914.68, stdev=81.15, samples=144
iops : min=627106, max=1573156, avg=1002158.11, stdev=20774.77, samples=144

The performance is better (+4%) for 4K cached read and unchanged for THP.

Signed-off-by: Kairui Song <[email protected]>
---
mm/filemap.c | 127 ++++++++++++++++++++++++++++++---------------------
1 file changed, 76 insertions(+), 51 deletions(-)

diff --git a/mm/filemap.c b/mm/filemap.c
index 6bbec8783793..c1484bcdbddb 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -848,12 +848,77 @@ void replace_page_cache_folio(struct folio *old, struct folio *new)
}
EXPORT_SYMBOL_GPL(replace_page_cache_folio);

+static int __split_add_folio_locked(struct xa_state *xas, struct folio *folio,
+ pgoff_t index, gfp_t gfp, void **shadowp)
+{
+ void *entry, *shadow, *alloced_shadow = NULL;
+ int order, alloced_order = 0;
+
+ gfp &= GFP_RECLAIM_MASK;
+ for (;;) {
+ shadow = NULL;
+ order = 0;
+
+ xas_for_each_conflict(xas, entry) {
+ if (!xa_is_value(entry))
+ return -EEXIST;
+ shadow = entry;
+ }
+
+ if (shadow) {
+ if (shadow == xas_reload(xas)) {
+ order = xas_get_order(xas);
+ if (order && order > folio_order(folio)) {
+ /* entry may have been split before we acquired lock */
+ if (shadow != alloced_shadow || order != alloced_order)
+ goto unlock;
+ xas_split(xas, shadow, order);
+ xas_reset(xas);
+ }
+ order = 0;
+ }
+ if (shadowp)
+ *shadowp = shadow;
+ }
+
+ xas_store(xas, folio);
+ /* Success, return with mapping locked */
+ if (!xas_error(xas))
+ return 0;
+unlock:
+ /*
+ * Unlock path, if errored, return unlocked.
+ * If allocation needed, alloc and retry.
+ */
+ xas_unlock_irq(xas);
+ if (order) {
+ if (unlikely(alloced_order))
+ xas_destroy(xas);
+ xas_split_alloc(xas, shadow, order, gfp);
+ if (!xas_error(xas)) {
+ alloced_shadow = shadow;
+ alloced_order = order;
+ }
+ goto next;
+ }
+ /* xas_nomem result checked by xas_error below */
+ xas_nomem(xas, gfp);
+next:
+ xas_lock_irq(xas);
+ if (xas_error(xas))
+ return xas_error(xas);
+
+ xas_reset(xas);
+ }
+}
+
noinline int __filemap_add_folio(struct address_space *mapping,
struct folio *folio, pgoff_t index, gfp_t gfp, void **shadowp)
{
XA_STATE(xas, &mapping->i_pages, index);
bool huge = folio_test_hugetlb(folio);
long nr;
+ int ret;

VM_BUG_ON_FOLIO(!folio_test_locked(folio), folio);
VM_BUG_ON_FOLIO(folio_test_swapbacked(folio), folio);
@@ -863,70 +928,30 @@ noinline int __filemap_add_folio(struct address_space *mapping,
xas_set_order(&xas, index, folio_order(folio));
nr = folio_nr_pages(folio);

- gfp &= GFP_RECLAIM_MASK;
folio_ref_add(folio, nr);
folio->mapping = mapping;
folio->index = xas.xa_index;

- do {
- unsigned int order = xa_get_order(xas.xa, xas.xa_index);
- void *entry, *old = NULL;
-
- if (order > folio_order(folio)) {
- xas_split_alloc(&xas, xa_load(xas.xa, xas.xa_index),
- order, gfp);
- if (xas_error(&xas))
- goto error;
- }
- xas_lock_irq(&xas);
- xas_for_each_conflict(&xas, entry) {
- old = entry;
- if (!xa_is_value(entry)) {
- xas_set_err(&xas, -EEXIST);
- goto unlock;
- }
- }
-
- if (old) {
- if (shadowp)
- *shadowp = old;
- /* entry may have been split before we acquired lock */
- order = xa_get_order(xas.xa, xas.xa_index);
- if (order > folio_order(folio)) {
- /* How to handle large swap entries? */
- BUG_ON(shmem_mapping(mapping));
- xas_split(&xas, old, order);
- xas_reset(&xas);
- }
- }
-
- xas_store(&xas, folio);
- if (xas_error(&xas))
- goto unlock;
-
+ xas_lock_irq(&xas);
+ ret = __split_add_folio_locked(&xas, folio, index, gfp, shadowp);
+ if (likely(!ret)) {
mapping->nrpages += nr;
-
- /* hugetlb pages do not participate in page cache accounting */
if (!huge) {
__lruvec_stat_mod_folio(folio, NR_FILE_PAGES, nr);
if (folio_test_pmd_mappable(folio))
__lruvec_stat_mod_folio(folio,
NR_FILE_THPS, nr);
}
-unlock:
xas_unlock_irq(&xas);
- } while (xas_nomem(&xas, gfp));
-
- if (xas_error(&xas))
- goto error;
+ trace_mm_filemap_add_to_page_cache(folio);
+ } else {
+ xas_unlock_irq(&xas);
+ folio->mapping = NULL;
+ /* Leave page->index set: truncation relies upon it */
+ folio_put_refs(folio, nr);
+ }

- trace_mm_filemap_add_to_page_cache(folio);
- return 0;
-error:
- folio->mapping = NULL;
- /* Leave page->index set: truncation relies upon it */
- folio_put_refs(folio, nr);
- return xas_error(&xas);
+ return ret;
}
ALLOW_ERROR_INJECTION(__filemap_add_folio, ERRNO);

--
2.43.0


2024-03-19 16:46:22

by Matthew Wilcox

[permalink] [raw]
Subject: Re: [PATCH 1/4] mm/filemap: return early if failed to allocate memory for split

On Tue, Mar 19, 2024 at 05:27:30PM +0800, Kairui Song wrote:
> From: Kairui Song <[email protected]>
>
> xas_split_alloc could fail with NOMEM, and in such case, it should abort
> early instead of keep going and fail the xas_split below.
>
> Signed-off-by: Kairui Song <[email protected]>

The usual way of programming with an xa_state is to not do any error
handling; the xas_ functions check for error and become no-ops, so we
only have to check at the end.

I think this case is worth making an exception for because we avoid
acquiring/releasing the lock. Not that we should be getting ENOMEM
very often; we only allocate about 8 nodes usually and we get 7
nodes/page.

Acked-by: Matthew Wilcox (Oracle) <[email protected]>

> mm/filemap.c | 5 ++++-
> 1 file changed, 4 insertions(+), 1 deletion(-)
>
> diff --git a/mm/filemap.c b/mm/filemap.c
> index 7437b2bd75c1..f07ea0b97698 100644
> --- a/mm/filemap.c
> +++ b/mm/filemap.c
> @@ -880,9 +880,12 @@ noinline int __filemap_add_folio(struct address_space *mapping,
> unsigned int order = xa_get_order(xas.xa, xas.xa_index);
> void *entry, *old = NULL;
>
> - if (order > folio_order(folio))
> + if (order > folio_order(folio)) {
> xas_split_alloc(&xas, xa_load(xas.xa, xas.xa_index),
> order, gfp);
> + if (xas_error(&xas))
> + goto error;
> + }
> xas_lock_irq(&xas);
> xas_for_each_conflict(&xas, entry) {
> old = entry;
> --
> 2.43.0
>

2024-03-19 16:52:24

by Matthew Wilcox

[permalink] [raw]
Subject: Re: [PATCH 3/4] lib/xarray: introduce a new helper xas_get_order

On Tue, Mar 19, 2024 at 05:27:32PM +0800, Kairui Song wrote:
> From: Kairui Song <[email protected]>
>
> It can be used after xas_load to check the order of loaded entries.
> Compared to xa_get_order, it saves an XA_STATE and avoid a rewalk.
>
> Signed-off-by: Kairui Song <[email protected]>

Brilliant; yes. I was just looking at this the other day and wondering
why I hadn't done this.

Acked-by: Matthew Wilcox (Oracle) <[email protected]>

> +EXPORT_SYMBOL(xas_get_order);

We don't have a module user yet, so I'd hold off on this. It's an
unusual thing to want to do, and we may never have a modular user.
Also, xas functions are EXPORT_SYMBOL_GPL, not EXPORT_SYMBOL.

2024-03-19 17:52:24

by Matthew Wilcox

[permalink] [raw]
Subject: Re: [PATCH 2/4] mm/filemap: clean up hugetlb exclusion code

On Tue, Mar 19, 2024 at 05:27:31PM +0800, Kairui Song wrote:
> From: Kairui Song <[email protected]>
>
> __filemap_add_folio only has two callers, one never passes hugetlb
> folio and one always passes in hugetlb folio. So move the hugetlb
> related cgroup charging out of it to make the code cleaner.
>
> Signed-off-by: Kairui Song <[email protected]>

Ah; excellent. I once had a patch along these lines, but it never made
it in. Happy things got refactored to the point where it's now easy,
and mildly annoyed at myself that I hadn't spotted it yet.

Acked-by: Matthew Wilcox (Oracle) <[email protected]>

(I'm acking these on the assumption that Andrew's just going to take
them; I can collect them myself if that makes anybody else's life
easier)

2024-03-19 22:20:20

by Matthew Wilcox

[permalink] [raw]
Subject: Re: [PATCH 4/4] mm/filemap: optimize filemap folio adding

On Tue, Mar 19, 2024 at 05:27:33PM +0800, Kairui Song wrote:
> From: Kairui Song <[email protected]>
>
> Instead of doing multiple tree walks, do one optimism range check
> with lock hold, and exit if raced with another insertion. If a shadow
> exists, check it with a new xas_get_order helper before releasing the
> lock to avoid redundant tree walks for getting its order.
>
> Drop the lock and do the allocation only if a split is needed.
>
> In the best case, it only need to walk the tree once. If it needs
> to alloc and split, 3 walks are issued (One for first ranced
> conflict check and order retrieving, one for the second check after
> allocation, one for the insert after split).
>
> Testing with 4k pages, in an 8G cgroup, with 20G brd as block device:
>
> fio -name=cached --numjobs=16 --filename=/mnt/test.img \
> --buffered=1 --ioengine=mmap --rw=randread --time_based \
> --ramp_time=30s --runtime=5m --group_reporting
>
> Before:
> bw ( MiB/s): min= 790, max= 3665, per=100.00%, avg=2499.17, stdev=20.64, samples=8698
> iops : min=202295, max=938417, avg=639785.81, stdev=5284.08, samples=8698
>
> After (+4%):
> bw ( MiB/s): min= 451, max= 3868, per=100.00%, avg=2599.83, stdev=23.39, samples=8653
> iops : min=115596, max=990364, avg=665556.34, stdev=5988.20, samples=8653
>
> Test result with THP (do a THP randread then switch to 4K page in hope it
> issues a lot of splitting):
>
> fio -name=cached --numjobs=16 --filename=/mnt/test.img \
> --buffered=1 --ioengine mmap -thp=1 --readonly \
> --rw=randread --random_distribution=random \
> --time_based --runtime=5m --group_reporting
>
> fio -name=cached --numjobs=16 --filename=/mnt/test.img \
> --buffered=1 --ioengine mmap --readonly \
> --rw=randread --random_distribution=random \
> --time_based --runtime=5s --group_reporting
>
> Before:
> bw ( KiB/s): min=28071, max=62359, per=100.00%, avg=53542.44, stdev=179.77, samples=9520
> iops : min= 7012, max=15586, avg=13379.39, stdev=44.94, samples=9520
> bw ( MiB/s): min= 2457, max= 6193, per=100.00%, avg=3923.21, stdev=82.48, samples=144
> iops : min=629220, max=1585642, avg=1004340.78, stdev=21116.07, samples=144
>
> After (+-0.0%):
> bw ( KiB/s): min=30561, max=63064, per=100.00%, avg=53635.82, stdev=177.21, samples=9520
> iops : min= 7636, max=15762, avg=13402.82, stdev=44.29, samples=9520
> bw ( MiB/s): min= 2449, max= 6145, per=100.00%, avg=3914.68, stdev=81.15, samples=144
> iops : min=627106, max=1573156, avg=1002158.11, stdev=20774.77, samples=144
>
> The performance is better (+4%) for 4K cached read and unchanged for THP.
>
> Signed-off-by: Kairui Song <[email protected]>
> ---
> mm/filemap.c | 127 ++++++++++++++++++++++++++++++---------------------
> 1 file changed, 76 insertions(+), 51 deletions(-)
>
> diff --git a/mm/filemap.c b/mm/filemap.c
> index 6bbec8783793..c1484bcdbddb 100644
> --- a/mm/filemap.c
> +++ b/mm/filemap.c
> @@ -848,12 +848,77 @@ void replace_page_cache_folio(struct folio *old, struct folio *new)
> }
> EXPORT_SYMBOL_GPL(replace_page_cache_folio);
>
> +static int __split_add_folio_locked(struct xa_state *xas, struct folio *folio,
> + pgoff_t index, gfp_t gfp, void **shadowp)

I don't love the name of this function. Splitting is a rare thing that
it does. I'd suggest it's more filemap_store().

> +{
> + void *entry, *shadow, *alloced_shadow = NULL;
> + int order, alloced_order = 0;
> +
> + gfp &= GFP_RECLAIM_MASK;
> + for (;;) {
> + shadow = NULL;
> + order = 0;
> +
> + xas_for_each_conflict(xas, entry) {
> + if (!xa_is_value(entry))
> + return -EEXIST;
> + shadow = entry;
> + }
> +
> + if (shadow) {
> + if (shadow == xas_reload(xas)) {

Why do you need the xas_reload here?

> + order = xas_get_order(xas);
> + if (order && order > folio_order(folio)) {
> + /* entry may have been split before we acquired lock */
> + if (shadow != alloced_shadow || order != alloced_order)
> + goto unlock;
> + xas_split(xas, shadow, order);
> + xas_reset(xas);
> + }
> + order = 0;
> + }

I don't think this is right. I think we can end up skipping a split
and storing a folio into a slot which is of greater order than the folio
we're storing.

> + if (shadowp)
> + *shadowp = shadow;
> + }
> +
> + xas_store(xas, folio);
> + /* Success, return with mapping locked */
> + if (!xas_error(xas))
> + return 0;
> +unlock:
> + /*
> + * Unlock path, if errored, return unlocked.
> + * If allocation needed, alloc and retry.
> + */
> + xas_unlock_irq(xas);
> + if (order) {
> + if (unlikely(alloced_order))
> + xas_destroy(xas);
> + xas_split_alloc(xas, shadow, order, gfp);
> + if (!xas_error(xas)) {
> + alloced_shadow = shadow;
> + alloced_order = order;
> + }
> + goto next;
> + }
> + /* xas_nomem result checked by xas_error below */
> + xas_nomem(xas, gfp);
> +next:
> + xas_lock_irq(xas);
> + if (xas_error(xas))
> + return xas_error(xas);
> +
> + xas_reset(xas);
> + }
> +}

Splitting this out into a different function while changing the logic
really makes this hard to review ;-(

I don't object to splitting the function, but maybe two patches; one
to move the logic and a second to change it?

2024-03-20 09:07:26

by Kairui Song

[permalink] [raw]
Subject: Re: [PATCH 4/4] mm/filemap: optimize filemap folio adding

On Wed, Mar 20, 2024 at 6:20 AM Matthew Wilcox <[email protected]> wrote:
>
> On Tue, Mar 19, 2024 at 05:27:33PM +0800, Kairui Song wrote:
> > From: Kairui Song <[email protected]>
> >
> > Instead of doing multiple tree walks, do one optimism range check
> > with lock hold, and exit if raced with another insertion. If a shadow
> > exists, check it with a new xas_get_order helper before releasing the
> > lock to avoid redundant tree walks for getting its order.
> >
> > Drop the lock and do the allocation only if a split is needed.
> >
> > In the best case, it only need to walk the tree once. If it needs
> > to alloc and split, 3 walks are issued (One for first ranced
> > conflict check and order retrieving, one for the second check after
> > allocation, one for the insert after split).
> >
> > Testing with 4k pages, in an 8G cgroup, with 20G brd as block device:
> >
> > fio -name=cached --numjobs=16 --filename=/mnt/test.img \
> > --buffered=1 --ioengine=mmap --rw=randread --time_based \
> > --ramp_time=30s --runtime=5m --group_reporting
> >
> > Before:
> > bw ( MiB/s): min= 790, max= 3665, per=100.00%, avg=2499.17, stdev=20.64, samples=8698
> > iops : min=202295, max=938417, avg=639785.81, stdev=528408, samples=8698
> >
> > After (+4%):
> > bw ( MiB/s): min= 451, max= 3868, per=100.00%, avg=2599.83, stdev=23.39, samples=8653
> > iops : min=115596, max=990364, avg=665556.34, stdev=598820, samples=8653
> >
> > Test result with THP (do a THP randread then switch to 4K page in hope it
> > issues a lot of splitting):
> >
> > fio -name=cached --numjobs=16 --filename=/mnt/test.img \
> > --buffered=1 --ioengine mmap -thp=1 --readonly \
> > --rw=randread --random_distribution=random \
> > --time_based --runtime=5m --group_reporting
> >
> > fio -name=cached --numjobs=16 --filename=/mnt/test.img \
> > --buffered=1 --ioengine mmap --readonly \
> > --rw=randread --random_distribution=random \
> > --time_based --runtime=5s --group_reporting
> >
> > Before:
> > bw ( KiB/s): min=28071, max=62359, per=100.00%, avg=53542.44, stdev=179.77, samples=9520
> > iops : min= 7012, max=15586, avg=13379.39, stdev=44.94, samples=9520
> > bw ( MiB/s): min= 2457, max= 6193, per=100.00%, avg=3923.21, stdev=82.48, samples=144
> > iops : min=629220, max=1585642, avg=1004340.78, stdev=21116.07, samples=144
> >
> > After (+-0.0%):
> > bw ( KiB/s): min=30561, max=63064, per=100.00%, avg=53635.82, stdev=177.21, samples=9520
> > iops : min= 7636, max=15762, avg=13402.82, stdev=44.29, samples=9520
> > bw ( MiB/s): min= 2449, max= 6145, per=100.00%, avg=3914.68, stdev=81.15, samples=144
> > iops : min=627106, max=1573156, avg=1002158.11, stdev=20774.77, samples=144
> >
> > The performance is better (+4%) for 4K cached read and unchanged for THP.
> >
> > Signed-off-by: Kairui Song <[email protected]>
> > ---
> > mm/filemap.c | 127 ++++++++++++++++++++++++++++++---------------------
> > 1 file changed, 76 insertions(+), 51 deletions(-)
> >
> > diff --git a/mm/filemap.c b/mm/filemap.c
> > index 6bbec8783793..c1484bcdbddb 100644
> > --- a/mm/filemap.c
> > +++ b/mm/filemap.c
> > @@ -848,12 +848,77 @@ void replace_page_cache_folio(struct folio *old, struct folio *new)
> > }
> > EXPORT_SYMBOL_GPL(replace_page_cache_folio);
> >
> > +static int __split_add_folio_locked(struct xa_state *xas, struct folio *folio,
> > + pgoff_t index, gfp_t gfp, void **shadowp)
>

Thanks for the very helpful review!

> I don't love the name of this function. Splitting is a rare thing that
> it does. I'd suggest it's more filemap_store().

Yes, the function name is a bit misleading indeed, I can rename it as
you suggested, eg. __filemap_store_locked ?

>
> > +{
> > + void *entry, *shadow, *alloced_shadow = NULL;
> > + int order, alloced_order = 0;
> > +
> > + gfp &= GFP_RECLAIM_MASK;
> > + for (;;) {
> > + shadow = NULL;
> > + order = 0;
> > +
> > + xas_for_each_conflict(xas, entry) {
> > + if (!xa_is_value(entry))
> > + return -EEXIST;
> > + shadow = entry;
> > + }
> > +
> > + if (shadow) {
> > + if (shadow == xas_reload(xas)) {
>
> Why do you need the xas_reload here?

This part of code works based on the guarantee that If there is a
larger entry, it will be the first and only entry iterated by
xas_for_each_conflict/xas_find_conflict. I added an xas_reload is here
to ensure that. But on second thought, this seems not needed indeed.

Will it be better if I write this part this way?

+ shadow = NULL;
+ order = -1;
+ xas_for_each_conflict(xas, entry) {
+ if (!xa_is_value(entry))
+ return -EEXIST;
+ /*
+ * If there is a larger entry, it will be the first
+ * and only entry iterated.
+ */
+ if (order == -1)
+ order = xas_get_order(xas);
+ shadow = entry;
+ }
+
+ if (shadow) {
+ /* check if alloc & split need, or if previous alloc is
still valid */
+ if (order > 0 && order > folio_order(folio)) {
+ if (shadow != alloced_shadow || order != alloced_order)
+ goto unlock;
+ xas_split(xas, shadow, order);
+ xas_reset(xas);
+ }
+ order = -1;
+ if (shadowp)
+ *shadowp = shadow;
+ }

>
> > + order = xas_get_order(xas);
> > + if (order && order > folio_order(folio)) {
> > + /* entry may have been split before we acquired lock */
> > + if (shadow != alloced_shadow || order != alloced_order)
> > + goto unlock;
> > + xas_split(xas, shadow, order);
> > + xas_reset(xas);
> > + }
> > + order = 0;
> > + }
>
> I don't think this is right. I think we can end up skipping a split
> and storing a folio into a slot which is of greater order than the folio
> we're storing.

If there is a larger slot, xas_for_each_conflict and check above
should catch that?

>
> > + if (shadowp)
> > + *shadowp = shadow;
> > + }
> > +
> > + xas_store(xas, folio);
> > + /* Success, return with mapping locked */
> > + if (!xas_error(xas))
> > + return 0;
> > +unlock:
> > + /*
> > + * Unlock path, if errored, return unlocked.
> > + * If allocation needed, alloc and retry.
> > + */
> > + xas_unlock_irq(xas);
> > + if (order) {
> > + if (unlikely(alloced_order))
> > + xas_destroy(xas);
> > + xas_split_alloc(xas, shadow, order, gfp);
> > + if (!xas_error(xas)) {
> > + alloced_shadow = shadow;
> > + alloced_order = order;
> > + }
> > + goto next;
> > + }
> > + /* xas_nomem result checked by xas_error below */
> > + xas_nomem(xas, gfp);
> > +next:
> > + xas_lock_irq(xas);
> > + if (xas_error(xas))
> > + return xas_error(xas);
> > +
> > + xas_reset(xas);
> > + }
> > +}
>
> Splitting this out into a different function while changing the logic
> really makes this hard to review ;-(

Sorry about this :(

This patch basically rewrites the logic of __filemap_add_folio and the
function is getting long, so I thought it would be easier to
understand if we split it out.
I initially updated the code in place but that change diff seems more
messy to me.

>
> I don't object to splitting the function, but maybe two patches; one
> to move the logic and a second to change it?
>

I can keep it in place in V2.

2024-03-21 18:36:23

by Kairui Song

[permalink] [raw]
Subject: Re: [PATCH 4/4] mm/filemap: optimize filemap folio adding

On Wed, Mar 20, 2024 at 5:06 PM Kairui Song <[email protected]> wrote:
>
> On Wed, Mar 20, 2024 at 6:20 AM Matthew Wilcox <[email protected]> wrote:
> >
> > On Tue, Mar 19, 2024 at 05:27:33PM +0800, Kairui Song wrote:
> > > From: Kairui Song <[email protected]>
> > >
> > > Instead of doing multiple tree walks, do one optimism range check
> > > with lock hold, and exit if raced with another insertion. If a shadow
> > > exists, check it with a new xas_get_order helper before releasing the
> > > lock to avoid redundant tree walks for getting its order.
> > >
> > > Drop the lock and do the allocation only if a split is needed.
> > >
> > > In the best case, it only need to walk the tree once. If it needs
> > > to alloc and split, 3 walks are issued (One for first ranced
> > > conflict check and order retrieving, one for the second check after
> > > allocation, one for the insert after split).
> > >
> > > Testing with 4k pages, in an 8G cgroup, with 20G brd as block device:
> > >
> > > fio -name=cached --numjobs=16 --filename=/mnt/test.img \
> > > --buffered=1 --ioengine=mmap --rw=randread --time_based \
> > > --ramp_time=30s --runtime=5m --group_reporting
> > >
> > > Before:
> > > bw ( MiB/s): min= 790, max= 3665, per=100.00%, avg=2499.17, stdev=20.64, samples=8698
> > > iops : min=202295, max=938417, avg=639785.81, stdev=5284.08, samples=8698
> > >
> > > After (+4%):
> > > bw ( MiB/s): min= 451, max= 3868, per=100.00%, avg=2599.83, stdev=23.39, samples=8653
> > > iops : min=115596, max=990364, avg=665556.34, stdev=5988.20, samples=8653
> > >
> > > Test result with THP (do a THP randread then switch to 4K page in hope it
> > > issues a lot of splitting):
> > >
> > > fio -name=cached --numjobs=16 --filename=/mnt/test.img \
> > > --buffered=1 --ioengine mmap -thp=1 --readonly \
> > > --rw=randread --random_distribution=random \
> > > --time_based --runtime=5m --group_reporting
> > >
> > > fio -name=cached --numjobs=16 --filename=/mnt/test.img \
> > > --buffered=1 --ioengine mmap --readonly \
> > > --rw=randread --random_distribution=random \
> > > --time_based --runtime=5s --group_reporting
> > >
> > > Before:
> > > bw ( KiB/s): min=28071, max=62359, per=100.00%, avg=53542.44, stdev=179.77, samples=9520
> > > iops : min= 7012, max=15586, avg=13379.39, stdev=44.94, samples=9520
> > > bw ( MiB/s): min= 2457, max= 6193, per=100.00%, avg=3923.21, stdev=82.48, samples=144
> > > iops : min=629220, max=1585642, avg=1004340.78, stdev=21116.07, samples=144
> > >
> > > After (+-0.0%):
> > > bw ( KiB/s): min=30561, max=63064, per=100.00%, avg=53635.82, stdev=177.21, samples=9520
> > > iops : min= 7636, max=15762, avg=13402.82, stdev=44.29, samples=9520
> > > bw ( MiB/s): min= 2449, max= 6145, per=100.00%, avg=3914.68, stdev=81.15, samples=144
> > > iops : min=627106, max=1573156, avg=1002158.11, stdev=20774.77, samples=144
> > >
> > > The performance is better (+4%) for 4K cached read and unchanged for THP.
> > >
> > > Signed-off-by: Kairui Song <[email protected]>
> > > ---
> > > mm/filemap.c | 127 ++++++++++++++++++++++++++++++---------------------
> > > 1 file changed, 76 insertions(+), 51 deletions(-)
> > >
> > > diff --git a/mm/filemap.c b/mm/filemap.c
> > > index 6bbec8783793..c1484bcdbddb 100644
> > > --- a/mm/filemap.c
> > > +++ b/mm/filemap.c
> > > @@ -848,12 +848,77 @@ void replace_page_cache_folio(struct folio *old, struct folio *new)
> > > }
> > > EXPORT_SYMBOL_GPL(replace_page_cache_folio);
> > >
> > > +static int __split_add_folio_locked(struct xa_state *xas, struct folio *folio,
> > > + pgoff_t index, gfp_t gfp, void **shadowp)
> >
>
> Thanks for the very helpful review!
>
> > I don't love the name of this function. Splitting is a rare thing that
> > it does. I'd suggest it's more filemap_store().
>
> Yes, the function name is a bit misleading indeed, I can rename it as
> you suggested, eg. __filemap_store_locked ?
>
> >
> > > +{
> > > + void *entry, *shadow, *alloced_shadow = NULL;
> > > + int order, alloced_order = 0;
> > > +
> > > + gfp &= GFP_RECLAIM_MASK;
> > > + for (;;) {
> > > + shadow = NULL;
> > > + order = 0;
> > > +
> > > + xas_for_each_conflict(xas, entry) {
> > > + if (!xa_is_value(entry))
> > > + return -EEXIST;
> > > + shadow = entry;
> > > + }
> > > +
> > > + if (shadow) {
> > > + if (shadow == xas_reload(xas)) {
> >
> > Why do you need the xas_reload here?
>
> This part of code works based on the guarantee that If there is a
> larger entry, it will be the first and only entry iterated by
> xas_for_each_conflict/xas_find_conflict. I added an xas_reload is here
> to ensure that. But on second thought, this seems not needed indeed.
>
> Will it be better if I write this part this way?
>
> + shadow = NULL;
> + order = -1;
> + xas_for_each_conflict(xas, entry) {
> + if (!xa_is_value(entry))
> + return -EEXIST;

I noticed this should release potential alloced xas data, will fix this in v2.

> + /*
> + * If there is a larger entry, it will be the first
> + * and only entry iterated.
> + */
> + if (order == -1)
> + order = xas_get_order(xas);
> + shadow = entry;
> + }
> +
> + if (shadow) {
> + /* check if alloc & split need, or if previous alloc is
> still valid */
> + if (order > 0 && order > folio_order(folio)) {
> + if (shadow != alloced_shadow || order != alloced_order)
> + goto unlock;
> + xas_split(xas, shadow, order);
> + xas_reset(xas);
> + }
> + order = -1;
> + if (shadowp)
> + *shadowp = shadow;
> + }
>

Besides this, this should be OK? I think I can add more tests for
xas_for_each_conflict and xas_get_order to ensure this works, need to
export xas_get_order as GPL symbol for that.

>
> If there is a larger slot, xas_for_each_conflict and check above
> should catch that?
>
> >
> > > + if (shadowp)
> > > + *shadowp = shadow;
> > > + }
> > > +
> > > + xas_store(xas, folio);
> > > + /* Success, return with mapping locked */
> > > + if (!xas_error(xas))
> > > + return 0;
> > > +unlock:
> > > + /*
> > > + * Unlock path, if errored, return unlocked.
> > > + * If allocation needed, alloc and retry.
> > > + */
> > > + xas_unlock_irq(xas);
> > > + if (order) {
> > > + if (unlikely(alloced_order))
> > > + xas_destroy(xas);
> > > + xas_split_alloc(xas, shadow, order, gfp);
> > > + if (!xas_error(xas)) {
> > > + alloced_shadow = shadow;
> > > + alloced_order = order;
> > > + }
> > > + goto next;
> > > + }
> > > + /* xas_nomem result checked by xas_error below */
> > > + xas_nomem(xas, gfp);
> > > +next:
> > > + xas_lock_irq(xas);
> > > + if (xas_error(xas))
> > > + return xas_error(xas);
> > > +
> > > + xas_reset(xas);
> > > + }
> > > +}
> >
> > Splitting this out into a different function while changing the logic
> > really makes this hard to review ;-(
>
> Sorry about this :(
>
> This patch basically rewrites the logic of __filemap_add_folio and the
> function is getting long, so I thought it would be easier to
> understand if we split it out.
> I initially updated the code in place but that change diff seems more
> messy to me.
>
> >
> > I don't object to splitting the function, but maybe two patches; one
> > to move the logic and a second to change it?
> >
>
> I can keep it in place in V2.