2021-03-20 05:46:15

by Matthew Wilcox

[permalink] [raw]
Subject: [PATCH v5 00/27] Memory Folios

Managing memory in 4KiB pages is a serious overhead. Many benchmarks
exist which show the benefits of a larger "page size". As an example,
an earlier iteration of this idea which used compound pages got a 7%
performance boost when compiling the kernel using kernbench without any
particular tuning.

Using compound pages or THPs exposes a serious weakness in our type
system. Functions are often unprepared for compound pages to be passed
to them, and may only act on PAGE_SIZE chunks. Even functions which are
aware of compound pages may expect a head page, and do the wrong thing
if passed a tail page.

There have been efforts to label function parameters as 'head' instead
of 'page' to indicate that the function expects a head page, but this
leaves us with runtime assertions instead of using the compiler to prove
that nobody has mistakenly passed a tail page. Calling a struct page
'head' is also inaccurate as they will work perfectly well on base pages.
The term 'nottail' has not proven popular.

We also waste a lot of instructions ensuring that we're not looking at
a tail page. Almost every call to PageFoo() contains one or more hidden
calls to compound_head(). This also happens for get_page(), put_page()
and many more functions. There does not appear to be a way to tell gcc
that it can cache the result of compound_head(), nor is there a way to
tell it that compound_head() is idempotent.

This series introduces the 'struct folio' as a replacement for
head-or-base pages. This initial set reduces the kernel size by
approximately 6kB, although its real purpose is adding infrastructure
to enable further use of the folio.

The intent is to convert all filesystems and some device drivers to work
in terms of folios. This series contains a lot of explicit conversions,
but it's important to realise it's removing a lot of implicit conversions
in some relatively hot paths. There will be very few conversions from
folios when this work is completed; filesystems, the page cache, the
LRU and so on will generally only deal with folios.

I analysed the text size reduction using a config based on Oracle UEK
with all modules changed to built-in. That's obviously not a kernel
which makes sense to run, but it serves to compare the effects on (many
common) filesystems & drivers, not just the core.

add/remove: 33645/33632 grow/shrink: 1850/1924 up/down: 894474/-899674 (-5200)

Current tree at:
https://git.infradead.org/users/willy/pagecache.git/shortlog/refs/heads/folio

(contains another ~100 patches on top of this batch, not all of which are
in good shape for submission)

v5:
- Rebase on next-20210319
- Pull out three bug-fix patches to the front of the series, allowing
them to be applied earlier.
- Fix folio_page() against pages being moved between swap & page cache
- Fix FolioDoubleMap to use the right page flags
- Rename next_folio() to folio_next() (akpm)
- Renamed folio stat functions (akpm)
- Add 'mod' versions of the folio stats for users that already have 'nr'
- Renamed folio_page to folio_file_page() (akpm)
- Added kernel-doc for struct folio, folio_next(), folio_index(),
folio_file_page(), folio_contains(), folio_order(), folio_nr_pages(),
folio_shift(), folio_size(), page_folio(), get_folio(), put_folio()
- Make folio_private() work in terms of void * instead of unsigned long
- Used page_folio() in attach/detach page_private() (hch)
- Drop afs_page_mkwrite folio conversion from this series
- Add wait_on_folio_writeback_killable()
- Convert add_page_wait_queue() to add_folio_wait_queue()
- Add folio_swap_entry() helper
- Drop the additions of *FolioFsCache
- Simplify the addition of lock_folio_memcg() et al
- Drop test_clear_page_writeback() conversion from this series
- Add FolioTransHuge() definition
- Rename __folio_file_mapping() to swapcache_mapping()
- Added swapcache_index() helper
- Removed lock_folio_async()
- Made __lock_folio_async() static to filemap.c
- Converted unlock_page_private_2() to use a folio internally
v4:
- Rebase on current Linus tree (including swap fix)
- Analyse each patch in terms of its effects on kernel text size.
A few were modified to improve their effect. In particular, where
pushing calls to page_folio() into the callers resulted in unacceptable
size increases, the wrapper was placed in mm/folio-compat.c. This lets
us see all the places which are good targets for conversion to folios.
- Some of the patches were reordered, split or merged in order to make
more logical sense.
- Use nth_page() for folio_next() if we're using SPARSEMEM and not
VMEMMAP (Zi Yan)
- Increment and decrement page stats in units of pages instead of units
of folios (Zi Yan)
v3:
- Rebase on next-20210127. Two major sources of conflict, the
generic_file_buffered_read refactoring (in akpm tree) and the
fscache work (in dhowells tree).
v2:
- Pare patch series back to just infrastructure and the page waiting
parts.

Matthew Wilcox (Oracle) (27):
fs/cachefiles: Remove wait_bit_key layout dependency
mm/writeback: Add wait_on_page_writeback_killable
afs: Use wait_on_page_writeback_killable
mm: Introduce struct folio
mm: Add folio_pgdat and folio_zone
mm/vmstat: Add functions to account folio statistics
mm/debug: Add VM_BUG_ON_FOLIO and VM_WARN_ON_ONCE_FOLIO
mm: Add put_folio
mm: Add get_folio
mm: Create FolioFlags
mm: Handle per-folio private data
mm: Add folio_index, folio_file_page and folio_contains
mm/util: Add folio_mapping and folio_file_mapping
mm/memcg: Add folio wrappers for various functions
mm/filemap: Add unlock_folio
mm/filemap: Add lock_folio
mm/filemap: Add lock_folio_killable
mm/filemap: Add __lock_folio_async
mm/filemap: Add __lock_folio_or_retry
mm/filemap: Add wait_on_folio_locked
mm/filemap: Add end_folio_writeback
mm/writeback: Add wait_on_folio_writeback
mm/writeback: Add wait_for_stable_folio
mm/filemap: Convert wait_on_page_bit to wait_on_folio_bit
mm/filemap: Convert wake_up_page_bit to wake_up_folio_bit
mm/filemap: Convert page wait queues to be folios
mm/doc: Build kerneldoc for various mm files

Documentation/core-api/mm-api.rst | 7 +
fs/afs/write.c | 3 +-
fs/cachefiles/rdwr.c | 19 ++-
fs/io_uring.c | 2 +-
include/linux/memcontrol.h | 21 +++
include/linux/mm.h | 156 +++++++++++++++----
include/linux/mm_types.h | 52 +++++++
include/linux/mmdebug.h | 20 +++
include/linux/netfs.h | 2 +-
include/linux/page-flags.h | 120 +++++++++++---
include/linux/pagemap.h | 249 ++++++++++++++++++++++--------
include/linux/swap.h | 6 +
include/linux/vmstat.h | 107 +++++++++++++
mm/Makefile | 2 +-
mm/filemap.c | 237 ++++++++++++++--------------
mm/folio-compat.c | 37 +++++
mm/memory.c | 8 +-
mm/page-writeback.c | 62 ++++++--
mm/swapfile.c | 8 +-
mm/util.c | 30 ++--
20 files changed, 857 insertions(+), 291 deletions(-)
create mode 100644 mm/folio-compat.c

--
2.30.2


2021-03-20 05:46:41

by Matthew Wilcox

[permalink] [raw]
Subject: [PATCH v5 01/27] fs/cachefiles: Remove wait_bit_key layout dependency

Cachefiles was relying on wait_page_key and wait_bit_key being the
same layout, which is fragile. Now that wait_page_key is exposed in
the pagemap.h header, we can remove that fragility

Signed-off-by: Matthew Wilcox (Oracle) <[email protected]>
---
fs/cachefiles/rdwr.c | 7 +++----
include/linux/pagemap.h | 1 -
2 files changed, 3 insertions(+), 5 deletions(-)

diff --git a/fs/cachefiles/rdwr.c b/fs/cachefiles/rdwr.c
index e027c718ca01..8ffc40e84a59 100644
--- a/fs/cachefiles/rdwr.c
+++ b/fs/cachefiles/rdwr.c
@@ -24,17 +24,16 @@ static int cachefiles_read_waiter(wait_queue_entry_t *wait, unsigned mode,
container_of(wait, struct cachefiles_one_read, monitor);
struct cachefiles_object *object;
struct fscache_retrieval *op = monitor->op;
- struct wait_bit_key *key = _key;
+ struct wait_page_key *key = _key;
struct page *page = wait->private;

ASSERT(key);

_enter("{%lu},%u,%d,{%p,%u}",
monitor->netfs_page->index, mode, sync,
- key->flags, key->bit_nr);
+ key->page, key->bit_nr);

- if (key->flags != &page->flags ||
- key->bit_nr != PG_locked)
+ if (key->page != page || key->bit_nr != PG_locked)
return 0;

_debug("--- monitor %p %lx ---", page, page->flags);
diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
index f68fe61c1dec..139678f382ff 100644
--- a/include/linux/pagemap.h
+++ b/include/linux/pagemap.h
@@ -574,7 +574,6 @@ static inline pgoff_t linear_page_index(struct vm_area_struct *vma,
return pgoff;
}

-/* This has the same layout as wait_bit_key - see fs/cachefiles/rdwr.c */
struct wait_page_key {
struct page *page;
int bit_nr;
--
2.30.2

2021-03-20 10:53:07

by Matthew Wilcox

[permalink] [raw]
Subject: [PATCH v5 02/27] mm/writeback: Add wait_on_page_writeback_killable

This is the killable version of wait_on_page_writeback.

Signed-off-by: Matthew Wilcox (Oracle) <[email protected]>
---
include/linux/pagemap.h | 1 +
mm/page-writeback.c | 16 ++++++++++++++++
2 files changed, 17 insertions(+)

diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
index 139678f382ff..8c844ba67785 100644
--- a/include/linux/pagemap.h
+++ b/include/linux/pagemap.h
@@ -698,6 +698,7 @@ static inline int wait_on_page_locked_killable(struct page *page)

int put_and_wait_on_page_locked(struct page *page, int state);
void wait_on_page_writeback(struct page *page);
+int wait_on_page_writeback_killable(struct page *page);
extern void end_page_writeback(struct page *page);
void wait_for_stable_page(struct page *page);

diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index f6c2c3165d4d..5e761fb62800 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -2830,6 +2830,22 @@ void wait_on_page_writeback(struct page *page)
}
EXPORT_SYMBOL_GPL(wait_on_page_writeback);

+/*
+ * Wait for a page to complete writeback. Returns -EINTR if we get a
+ * fatal signal while waiting.
+ */
+int wait_on_page_writeback_killable(struct page *page)
+{
+ while (PageWriteback(page)) {
+ trace_wait_on_page_writeback(page, page_mapping(page));
+ if (wait_on_page_bit_killable(page, PG_writeback))
+ return -EINTR;
+ }
+
+ return 0;
+}
+EXPORT_SYMBOL_GPL(wait_on_page_writeback_killable);
+
/**
* wait_for_stable_page() - wait for writeback to finish, if necessary.
* @page: The page to wait on.
--
2.30.2

2021-03-20 10:53:07

by Matthew Wilcox

[permalink] [raw]
Subject: [PATCH v5 09/27] mm: Add get_folio

If we know we have a folio, we can call get_folio() instead of get_page()
and save the overhead of calling compound_head().

Signed-off-by: Matthew Wilcox (Oracle) <[email protected]>
Reviewed-by: Zi Yan <[email protected]>
---
include/linux/mm.h | 26 +++++++++++++++++---------
1 file changed, 17 insertions(+), 9 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 5052479febc7..8fc7b04a1438 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1198,18 +1198,26 @@ static inline bool is_pci_p2pdma_page(const struct page *page)
}

/* 127: arbitrary random number, small enough to assemble well */
-#define page_ref_zero_or_close_to_overflow(page) \
- ((unsigned int) page_ref_count(page) + 127u <= 127u)
+#define folio_ref_zero_or_close_to_overflow(folio) \
+ ((unsigned int) page_ref_count(&folio->page) + 127u <= 127u)
+
+/**
+ * get_folio - Increment the reference count on a folio.
+ * @folio: The folio.
+ *
+ * Context: May be called in any context, as long as you know that
+ * you have a refcount on the folio. If you do not already have one,
+ * try_grab_page() may be the right interface for you to use.
+ */
+static inline void get_folio(struct folio *folio)
+{
+ VM_BUG_ON_FOLIO(folio_ref_zero_or_close_to_overflow(folio), folio);
+ page_ref_inc(&folio->page);
+}

static inline void get_page(struct page *page)
{
- page = compound_head(page);
- /*
- * Getting a normal page or the head of a compound page
- * requires to already have an elevated page->_refcount.
- */
- VM_BUG_ON_PAGE(page_ref_zero_or_close_to_overflow(page), page);
- page_ref_inc(page);
+ get_folio(page_folio(page));
}

bool __must_check try_grab_page(struct page *page, unsigned int flags);
--
2.30.2

2021-03-20 10:53:07

by Matthew Wilcox

[permalink] [raw]
Subject: [PATCH v5 05/27] mm: Add folio_pgdat and folio_zone

These are just convenience wrappers for callers with folios; pgdat and
zone can be reached from tail pages as well as head pages.

Signed-off-by: Matthew Wilcox (Oracle) <[email protected]>
Reviewed-by: Zi Yan <[email protected]>
---
include/linux/mm.h | 10 ++++++++++
1 file changed, 10 insertions(+)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 9b7e3fa12fd3..e176e9c9990f 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1544,6 +1544,16 @@ static inline pg_data_t *page_pgdat(const struct page *page)
return NODE_DATA(page_to_nid(page));
}

+static inline struct zone *folio_zone(const struct folio *folio)
+{
+ return page_zone(&folio->page);
+}
+
+static inline pg_data_t *folio_pgdat(const struct folio *folio)
+{
+ return page_pgdat(&folio->page);
+}
+
#ifdef SECTION_IN_PAGE_FLAGS
static inline void set_page_section(struct page *page, unsigned long section)
{
--
2.30.2

2021-03-20 10:53:07

by Matthew Wilcox

[permalink] [raw]
Subject: [PATCH v5 03/27] afs: Use wait_on_page_writeback_killable

Open-coding this function meant it missed out on the recent bugfix
for waiters being woken by a delayed wake event from a previous
instantiation of the page.

Signed-off-by: Matthew Wilcox (Oracle) <[email protected]>
---
fs/afs/write.c | 3 +--
1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/fs/afs/write.c b/fs/afs/write.c
index b2e03de09c24..106a864b6a93 100644
--- a/fs/afs/write.c
+++ b/fs/afs/write.c
@@ -850,8 +850,7 @@ vm_fault_t afs_page_mkwrite(struct vm_fault *vmf)
return VM_FAULT_RETRY;
#endif

- if (PageWriteback(page) &&
- wait_on_page_bit_killable(page, PG_writeback) < 0)
+ if (wait_on_page_writeback_killable(page))
return VM_FAULT_RETRY;

if (lock_page_killable(page) < 0)
--
2.30.2

2021-03-20 10:53:07

by Matthew Wilcox

[permalink] [raw]
Subject: [PATCH v5 07/27] mm/debug: Add VM_BUG_ON_FOLIO and VM_WARN_ON_ONCE_FOLIO

These are the folio equivalents of VM_BUG_ON_PAGE and VM_WARN_ON_ONCE_PAGE.

Signed-off-by: Matthew Wilcox (Oracle) <[email protected]>
Reviewed-by: Zi Yan <[email protected]>
---
include/linux/mmdebug.h | 20 ++++++++++++++++++++
1 file changed, 20 insertions(+)

diff --git a/include/linux/mmdebug.h b/include/linux/mmdebug.h
index 5d0767cb424a..77d24e1dcaec 100644
--- a/include/linux/mmdebug.h
+++ b/include/linux/mmdebug.h
@@ -23,6 +23,13 @@ void dump_mm(const struct mm_struct *mm);
BUG(); \
} \
} while (0)
+#define VM_BUG_ON_FOLIO(cond, folio) \
+ do { \
+ if (unlikely(cond)) { \
+ dump_page(&folio->page, "VM_BUG_ON_FOLIO(" __stringify(cond)")");\
+ BUG(); \
+ } \
+ } while (0)
#define VM_BUG_ON_VMA(cond, vma) \
do { \
if (unlikely(cond)) { \
@@ -48,6 +55,17 @@ void dump_mm(const struct mm_struct *mm);
} \
unlikely(__ret_warn_once); \
})
+#define VM_WARN_ON_ONCE_FOLIO(cond, folio) ({ \
+ static bool __section(".data.once") __warned; \
+ int __ret_warn_once = !!(cond); \
+ \
+ if (unlikely(__ret_warn_once && !__warned)) { \
+ dump_page(&folio->page, "VM_WARN_ON_ONCE_FOLIO(" __stringify(cond)")");\
+ __warned = true; \
+ WARN_ON(1); \
+ } \
+ unlikely(__ret_warn_once); \
+})

#define VM_WARN_ON(cond) (void)WARN_ON(cond)
#define VM_WARN_ON_ONCE(cond) (void)WARN_ON_ONCE(cond)
@@ -56,11 +74,13 @@ void dump_mm(const struct mm_struct *mm);
#else
#define VM_BUG_ON(cond) BUILD_BUG_ON_INVALID(cond)
#define VM_BUG_ON_PAGE(cond, page) VM_BUG_ON(cond)
+#define VM_BUG_ON_FOLIO(cond, folio) VM_BUG_ON(cond)
#define VM_BUG_ON_VMA(cond, vma) VM_BUG_ON(cond)
#define VM_BUG_ON_MM(cond, mm) VM_BUG_ON(cond)
#define VM_WARN_ON(cond) BUILD_BUG_ON_INVALID(cond)
#define VM_WARN_ON_ONCE(cond) BUILD_BUG_ON_INVALID(cond)
#define VM_WARN_ON_ONCE_PAGE(cond, page) BUILD_BUG_ON_INVALID(cond)
+#define VM_WARN_ON_ONCE_FOLIO(cond, folio) BUILD_BUG_ON_INVALID(cond)
#define VM_WARN_ONCE(cond, format...) BUILD_BUG_ON_INVALID(cond)
#define VM_WARN(cond, format...) BUILD_BUG_ON_INVALID(cond)
#endif
--
2.30.2

2021-03-20 10:53:07

by Matthew Wilcox

[permalink] [raw]
Subject: [PATCH v5 08/27] mm: Add put_folio

If we know we have a folio, we can call put_folio() instead of put_page()
and save the overhead of calling compound_head(). Also skips the
devmap checks.

This commit looks like it should be a no-op, but actually saves 1714 bytes
of text with the distro-derived config that I'm testing. Some functions
grow a little while others shrink. I presume the compiler is making
different inlining decisions.

Signed-off-by: Matthew Wilcox (Oracle) <[email protected]>
Reviewed-by: Zi Yan <[email protected]>
---
include/linux/mm.h | 28 +++++++++++++++++++++++-----
1 file changed, 23 insertions(+), 5 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index e176e9c9990f..5052479febc7 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1226,9 +1226,28 @@ static inline __must_check bool try_get_page(struct page *page)
return true;
}

+/**
+ * put_folio - Decrement the reference count on a folio.
+ * @folio: The folio.
+ *
+ * If the folio's reference count reaches zero, the memory will be
+ * released back to the page allocator and may be used by another
+ * allocation immediately. Do not access the memory or the struct folio
+ * after calling put_folio() unless you can be sure that it wasn't the
+ * last reference.
+ *
+ * Context: May be called in process or interrupt context, but not in NMI
+ * context. May be called while holding a spinlock.
+ */
+static inline void put_folio(struct folio *folio)
+{
+ if (put_page_testzero(&folio->page))
+ __put_page(&folio->page);
+}
+
static inline void put_page(struct page *page)
{
- page = compound_head(page);
+ struct folio *folio = page_folio(page);

/*
* For devmap managed pages we need to catch refcount transition from
@@ -1236,13 +1255,12 @@ static inline void put_page(struct page *page)
* need to inform the device driver through callback. See
* include/linux/memremap.h and HMM for details.
*/
- if (page_is_devmap_managed(page)) {
- put_devmap_managed_page(page);
+ if (page_is_devmap_managed(&folio->page)) {
+ put_devmap_managed_page(&folio->page);
return;
}

- if (put_page_testzero(page))
- __put_page(page);
+ put_folio(folio);
}

/*
--
2.30.2

2021-03-20 10:53:07

by Matthew Wilcox

[permalink] [raw]
Subject: [PATCH v5 06/27] mm/vmstat: Add functions to account folio statistics

Allow page counters to be more readily modified by callers which have
a folio. Name these wrappers with 'stat' instead of 'state' as requested
by Linus here:
https://lore.kernel.org/linux-mm/CAHk-=wj847SudR-kt+46fT3+xFFgiwpgThvm7DJWGdi4cVrbnQ@mail.gmail.com/

Signed-off-by: Matthew Wilcox (Oracle) <[email protected]>
---
include/linux/vmstat.h | 107 +++++++++++++++++++++++++++++++++++++++++
1 file changed, 107 insertions(+)

diff --git a/include/linux/vmstat.h b/include/linux/vmstat.h
index 3299cd69e4ca..d287d7c31b8f 100644
--- a/include/linux/vmstat.h
+++ b/include/linux/vmstat.h
@@ -402,6 +402,78 @@ static inline void drain_zonestat(struct zone *zone,
struct per_cpu_pageset *pset) { }
#endif /* CONFIG_SMP */

+static inline void __zone_stat_mod_folio(struct folio *folio,
+ enum zone_stat_item item, long nr)
+{
+ __mod_zone_page_state(folio_zone(folio), item, nr);
+}
+
+static inline void __zone_stat_add_folio(struct folio *folio,
+ enum zone_stat_item item)
+{
+ __mod_zone_page_state(folio_zone(folio), item, folio_nr_pages(folio));
+}
+
+static inline void __zone_stat_sub_folio(struct folio *folio,
+ enum zone_stat_item item)
+{
+ __mod_zone_page_state(folio_zone(folio), item, -folio_nr_pages(folio));
+}
+
+static inline void zone_stat_mod_folio(struct folio *folio,
+ enum zone_stat_item item, long nr)
+{
+ mod_zone_page_state(folio_zone(folio), item, nr);
+}
+
+static inline void zone_stat_add_folio(struct folio *folio,
+ enum zone_stat_item item)
+{
+ mod_zone_page_state(folio_zone(folio), item, folio_nr_pages(folio));
+}
+
+static inline void zone_stat_sub_folio(struct folio *folio,
+ enum zone_stat_item item)
+{
+ mod_zone_page_state(folio_zone(folio), item, -folio_nr_pages(folio));
+}
+
+static inline void __node_stat_mod_folio(struct folio *folio,
+ enum node_stat_item item, long nr)
+{
+ __mod_node_page_state(folio_pgdat(folio), item, nr);
+}
+
+static inline void __node_stat_add_folio(struct folio *folio,
+ enum node_stat_item item)
+{
+ __mod_node_page_state(folio_pgdat(folio), item, folio_nr_pages(folio));
+}
+
+static inline void __node_stat_sub_folio(struct folio *folio,
+ enum node_stat_item item)
+{
+ __mod_node_page_state(folio_pgdat(folio), item, -folio_nr_pages(folio));
+}
+
+static inline void node_stat_mod_folio(struct folio *folio,
+ enum node_stat_item item, long nr)
+{
+ mod_node_page_state(folio_pgdat(folio), item, nr);
+}
+
+static inline void node_stat_add_folio(struct folio *folio,
+ enum node_stat_item item)
+{
+ mod_node_page_state(folio_pgdat(folio), item, folio_nr_pages(folio));
+}
+
+static inline void node_stat_sub_folio(struct folio *folio,
+ enum node_stat_item item)
+{
+ mod_node_page_state(folio_pgdat(folio), item, -folio_nr_pages(folio));
+}
+
static inline void __mod_zone_freepage_state(struct zone *zone, int nr_pages,
int migratetype)
{
@@ -530,6 +602,24 @@ static inline void __dec_lruvec_page_state(struct page *page,
__mod_lruvec_page_state(page, idx, -1);
}

+static inline void __lruvec_stat_mod_folio(struct folio *folio,
+ enum node_stat_item idx, int val)
+{
+ __mod_lruvec_page_state(&folio->page, idx, val);
+}
+
+static inline void __lruvec_stat_add_folio(struct folio *folio,
+ enum node_stat_item idx)
+{
+ __lruvec_stat_mod_folio(folio, idx, folio_nr_pages(folio));
+}
+
+static inline void __lruvec_stat_sub_folio(struct folio *folio,
+ enum node_stat_item idx)
+{
+ __lruvec_stat_mod_folio(folio, idx, -folio_nr_pages(folio));
+}
+
static inline void inc_lruvec_page_state(struct page *page,
enum node_stat_item idx)
{
@@ -542,4 +632,21 @@ static inline void dec_lruvec_page_state(struct page *page,
mod_lruvec_page_state(page, idx, -1);
}

+static inline void lruvec_stat_mod_folio(struct folio *folio,
+ enum node_stat_item idx, int val)
+{
+ mod_lruvec_page_state(&folio->page, idx, val);
+}
+
+static inline void lruvec_stat_add_folio(struct folio *folio,
+ enum node_stat_item idx)
+{
+ lruvec_stat_mod_folio(folio, idx, folio_nr_pages(folio));
+}
+
+static inline void lruvec_stat_sub_folio(struct folio *folio,
+ enum node_stat_item idx)
+{
+ lruvec_stat_mod_folio(folio, idx, -folio_nr_pages(folio));
+}
#endif /* _LINUX_VMSTAT_H */
--
2.30.2

2021-03-20 10:53:07

by Matthew Wilcox

[permalink] [raw]
Subject: [PATCH v5 10/27] mm: Create FolioFlags

These new functions are the folio analogues of the PageFlags functions.
If CONFIG_DEBUG_VM_PGFLAGS is enabled, we check the folio is not a tail
page at every invocation. Note that this will also catch the PagePoisoned
case as a poisoned page has every bit set, which would include PageTail.

This saves 1727 bytes of text with the distro-derived config that
I'm testing due to removing a double call to compound_head() in
PageSwapCache().

Signed-off-by: Matthew Wilcox (Oracle) <[email protected]>
---
include/linux/page-flags.h | 120 ++++++++++++++++++++++++++++++-------
1 file changed, 100 insertions(+), 20 deletions(-)

diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
index 04a34c08e0a6..ec0e3eb6b85a 100644
--- a/include/linux/page-flags.h
+++ b/include/linux/page-flags.h
@@ -212,6 +212,15 @@ static inline void page_init_poison(struct page *page, size_t size)
}
#endif

+static unsigned long *folio_flags(struct folio *folio, unsigned n)
+{
+ struct page *page = &folio->page;
+
+ VM_BUG_ON_PGFLAGS(PageTail(page), page);
+ VM_BUG_ON_PGFLAGS(n > 0 && !test_bit(PG_head, &page->flags), page);
+ return &page[n].flags;
+}
+
/*
* Page flags policies wrt compound pages
*
@@ -256,34 +265,56 @@ static inline void page_init_poison(struct page *page, size_t size)
VM_BUG_ON_PGFLAGS(!PageHead(page), page); \
PF_POISONED_CHECK(&page[1]); })

+/* Which page is the flag stored in */
+#define FOLIO_PF_ANY 0
+#define FOLIO_PF_HEAD 0
+#define FOLIO_PF_ONLY_HEAD 0
+#define FOLIO_PF_NO_TAIL 0
+#define FOLIO_PF_NO_COMPOUND 0
+#define FOLIO_PF_SECOND 1
+
/*
* Macros to create function definitions for page flags
*/
#define TESTPAGEFLAG(uname, lname, policy) \
+static __always_inline int Folio##uname(struct folio *folio) \
+ { return test_bit(PG_##lname, folio_flags(folio, FOLIO_##policy)); } \
static __always_inline int Page##uname(struct page *page) \
{ return test_bit(PG_##lname, &policy(page, 0)->flags); }

#define SETPAGEFLAG(uname, lname, policy) \
+static __always_inline void SetFolio##uname(struct folio *folio) \
+ { set_bit(PG_##lname, folio_flags(folio, FOLIO_##policy)); } \
static __always_inline void SetPage##uname(struct page *page) \
{ set_bit(PG_##lname, &policy(page, 1)->flags); }

#define CLEARPAGEFLAG(uname, lname, policy) \
+static __always_inline void ClearFolio##uname(struct folio *folio) \
+ { clear_bit(PG_##lname, folio_flags(folio, FOLIO_##policy)); } \
static __always_inline void ClearPage##uname(struct page *page) \
{ clear_bit(PG_##lname, &policy(page, 1)->flags); }

#define __SETPAGEFLAG(uname, lname, policy) \
+static __always_inline void __SetFolio##uname(struct folio *folio) \
+ { __set_bit(PG_##lname, folio_flags(folio, FOLIO_##policy)); } \
static __always_inline void __SetPage##uname(struct page *page) \
{ __set_bit(PG_##lname, &policy(page, 1)->flags); }

#define __CLEARPAGEFLAG(uname, lname, policy) \
+static __always_inline void __ClearFolio##uname(struct folio *folio) \
+ { __clear_bit(PG_##lname, folio_flags(folio, FOLIO_##policy)); } \
static __always_inline void __ClearPage##uname(struct page *page) \
{ __clear_bit(PG_##lname, &policy(page, 1)->flags); }

#define TESTSETFLAG(uname, lname, policy) \
+static __always_inline int TestSetFolio##uname(struct folio *folio) \
+ { return test_and_set_bit(PG_##lname, folio_flags(folio, FOLIO_##policy)); } \
static __always_inline int TestSetPage##uname(struct page *page) \
{ return test_and_set_bit(PG_##lname, &policy(page, 1)->flags); }

#define TESTCLEARFLAG(uname, lname, policy) \
+static __always_inline int TestClearFolio##uname(struct folio *folio) \
+ { return test_and_clear_bit(PG_##lname, folio_flags(folio, FOLIO_##policy)); } \
static __always_inline int TestClearPage##uname(struct page *page) \
{ return test_and_clear_bit(PG_##lname, &policy(page, 1)->flags); }

@@ -302,21 +333,27 @@ static __always_inline int TestClearPage##uname(struct page *page) \
TESTCLEARFLAG(uname, lname, policy)

#define TESTPAGEFLAG_FALSE(uname) \
+static inline int Folio##uname(const struct folio *folio) { return 0; } \
static inline int Page##uname(const struct page *page) { return 0; }

#define SETPAGEFLAG_NOOP(uname) \
+static inline void SetFolio##uname(struct folio *folio) { } \
static inline void SetPage##uname(struct page *page) { }

#define CLEARPAGEFLAG_NOOP(uname) \
+static inline void ClearFolio##uname(struct folio *folio) { } \
static inline void ClearPage##uname(struct page *page) { }

#define __CLEARPAGEFLAG_NOOP(uname) \
+static inline void __ClearFolio##uname(struct folio *folio) { } \
static inline void __ClearPage##uname(struct page *page) { }

#define TESTSETFLAG_FALSE(uname) \
+static inline int TestSetFolio##uname(struct folio *folio) { return 0; } \
static inline int TestSetPage##uname(struct page *page) { return 0; }

#define TESTCLEARFLAG_FALSE(uname) \
+static inline int TestClearFolio##uname(struct folio *folio) { return 0; } \
static inline int TestClearPage##uname(struct page *page) { return 0; }

#define PAGEFLAG_FALSE(uname) TESTPAGEFLAG_FALSE(uname) \
@@ -393,14 +430,18 @@ PAGEFLAG_FALSE(HighMem)
#endif

#ifdef CONFIG_SWAP
-static __always_inline int PageSwapCache(struct page *page)
+static __always_inline bool FolioSwapCache(struct folio *folio)
{
-#ifdef CONFIG_THP_SWAP
- page = compound_head(page);
-#endif
- return PageSwapBacked(page) && test_bit(PG_swapcache, &page->flags);
+ return FolioSwapBacked(folio) &&
+ test_bit(PG_swapcache, folio_flags(folio, 0));
+
+}

+static __always_inline bool PageSwapCache(struct page *page)
+{
+ return FolioSwapCache(page_folio(page));
}
+
SETPAGEFLAG(SwapCache, swapcache, PF_NO_TAIL)
CLEARPAGEFLAG(SwapCache, swapcache, PF_NO_TAIL)
#else
@@ -478,10 +519,14 @@ static __always_inline int PageMappingFlags(struct page *page)
return ((unsigned long)page->mapping & PAGE_MAPPING_FLAGS) != 0;
}

-static __always_inline int PageAnon(struct page *page)
+static __always_inline bool FolioAnon(struct folio *folio)
{
- page = compound_head(page);
- return ((unsigned long)page->mapping & PAGE_MAPPING_ANON) != 0;
+ return ((unsigned long)folio->page.mapping & PAGE_MAPPING_ANON) != 0;
+}
+
+static __always_inline bool PageAnon(struct page *page)
+{
+ return FolioAnon(page_folio(page));
}

static __always_inline int __PageMovable(struct page *page)
@@ -509,18 +554,16 @@ TESTPAGEFLAG_FALSE(Ksm)

u64 stable_page_flags(struct page *page);

-static inline int PageUptodate(struct page *page)
+static inline int FolioUptodate(struct folio *folio)
{
- int ret;
- page = compound_head(page);
- ret = test_bit(PG_uptodate, &(page)->flags);
+ int ret = test_bit(PG_uptodate, folio_flags(folio, 0));
/*
* Must ensure that the data we read out of the page is loaded
* _after_ we've loaded page->flags to check for PageUptodate.
* We can skip the barrier if the page is not uptodate, because
* we wouldn't be reading anything from it.
*
- * See SetPageUptodate() for the other side of the story.
+ * See SetFolioUptodate() for the other side of the story.
*/
if (ret)
smp_rmb();
@@ -528,23 +571,36 @@ static inline int PageUptodate(struct page *page)
return ret;
}

-static __always_inline void __SetPageUptodate(struct page *page)
+static inline int PageUptodate(struct page *page)
+{
+ return FolioUptodate(page_folio(page));
+}
+
+static __always_inline void __SetFolioUptodate(struct folio *folio)
{
- VM_BUG_ON_PAGE(PageTail(page), page);
smp_wmb();
- __set_bit(PG_uptodate, &page->flags);
+ __set_bit(PG_uptodate, folio_flags(folio, 0));
}

-static __always_inline void SetPageUptodate(struct page *page)
+static __always_inline void SetFolioUptodate(struct folio *folio)
{
- VM_BUG_ON_PAGE(PageTail(page), page);
/*
* Memory barrier must be issued before setting the PG_uptodate bit,
* so that all previous stores issued in order to bring the page
* uptodate are actually visible before PageUptodate becomes true.
*/
smp_wmb();
- set_bit(PG_uptodate, &page->flags);
+ set_bit(PG_uptodate, folio_flags(folio, 0));
+}
+
+static __always_inline void __SetPageUptodate(struct page *page)
+{
+ __SetFolioUptodate((struct folio *)page);
+}
+
+static __always_inline void SetPageUptodate(struct page *page)
+{
+ SetFolioUptodate((struct folio *)page);
}

CLEARPAGEFLAG(Uptodate, uptodate, PF_NO_TAIL)
@@ -569,6 +625,17 @@ static inline void set_page_writeback_keepwrite(struct page *page)

__PAGEFLAG(Head, head, PF_ANY) CLEARPAGEFLAG(Head, head, PF_ANY)

+/* Whether there are one or multiple pages in a folio */
+static inline bool FolioSingle(struct folio *folio)
+{
+ return !FolioHead(folio);
+}
+
+static inline bool FolioMulti(struct folio *folio)
+{
+ return FolioHead(folio);
+}
+
static __always_inline void set_compound_head(struct page *page, struct page *head)
{
WRITE_ONCE(page->compound_head, (unsigned long)head + 1);
@@ -592,12 +659,15 @@ static inline void ClearPageCompound(struct page *page)
#ifdef CONFIG_HUGETLB_PAGE
int PageHuge(struct page *page);
int PageHeadHuge(struct page *page);
+static inline bool FolioHuge(struct folio *folio)
+{
+ return PageHeadHuge(&folio->page);
+}
#else
TESTPAGEFLAG_FALSE(Huge)
TESTPAGEFLAG_FALSE(HeadHuge)
#endif

-
#ifdef CONFIG_TRANSPARENT_HUGEPAGE
/*
* PageHuge() only returns true for hugetlbfs pages, but not for
@@ -613,6 +683,11 @@ static inline int PageTransHuge(struct page *page)
return PageHead(page);
}

+static inline bool FolioTransHuge(struct folio *folio)
+{
+ return FolioHead(folio);
+}
+
/*
* PageTransCompound returns true for both transparent huge pages
* and hugetlbfs pages, so it should only be called when it's known
@@ -844,6 +919,11 @@ static inline int page_has_private(struct page *page)
return !!(page->flags & PAGE_FLAGS_PRIVATE);
}

+static inline bool folio_has_private(struct folio *folio)
+{
+ return page_has_private(&folio->page);
+}
+
#undef PF_ANY
#undef PF_HEAD
#undef PF_ONLY_HEAD
--
2.30.2

2021-03-20 11:13:42

by Matthew Wilcox

[permalink] [raw]
Subject: [PATCH v5 18/27] mm/filemap: Add __lock_folio_async

There aren't any actual callers of lock_page_async(), so remove it.
Convert filemap_update_page() to call __lock_folio_async().

__lock_folio_async() is 21 bytes smaller than __lock_page_async(),
but the real savings come from using a folio in filemap_update_page(),
shrinking it from 514 bytes to 403 bytes, saving 111 bytes. The text
shrinks by 132 bytes in total.

Signed-off-by: Matthew Wilcox (Oracle) <[email protected]>
---
fs/io_uring.c | 2 +-
include/linux/pagemap.h | 17 -----------------
mm/filemap.c | 31 ++++++++++++++++---------------
3 files changed, 17 insertions(+), 33 deletions(-)

diff --git a/fs/io_uring.c b/fs/io_uring.c
index b882bc4c5af7..ad0dc9afd194 100644
--- a/fs/io_uring.c
+++ b/fs/io_uring.c
@@ -3201,7 +3201,7 @@ static int io_read_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe)
}

/*
- * This is our waitqueue callback handler, registered through lock_page_async()
+ * This is our waitqueue callback handler, registered through lock_folio_async()
* when we initially tried to do the IO with the iocb armed our waitqueue.
* This gets called when the page is unlocked, and we generally expect that to
* happen when the page IO is completed and the page is now uptodate. This will
diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
index aa7f564e5ecf..3cd1b5e28593 100644
--- a/include/linux/pagemap.h
+++ b/include/linux/pagemap.h
@@ -695,7 +695,6 @@ static inline bool wake_page_match(struct wait_page_queue *wait_page,

void __lock_folio(struct folio *folio);
int __lock_folio_killable(struct folio *folio);
-extern int __lock_page_async(struct page *page, struct wait_page_queue *wait);
extern int __lock_page_or_retry(struct page *page, struct mm_struct *mm,
unsigned int flags);
void unlock_page(struct page *page);
@@ -753,22 +752,6 @@ static inline int lock_page_killable(struct page *page)
return lock_folio_killable(page_folio(page));
}

-/*
- * lock_page_async - Lock the page, unless this would block. If the page
- * is already locked, then queue a callback when the page becomes unlocked.
- * This callback can then retry the operation.
- *
- * Returns 0 if the page is locked successfully, or -EIOCBQUEUED if the page
- * was already locked and the callback defined in 'wait' was queued.
- */
-static inline int lock_page_async(struct page *page,
- struct wait_page_queue *wait)
-{
- if (!trylock_page(page))
- return __lock_page_async(page, wait);
- return 0;
-}
-
/*
* lock_page_or_retry - Lock the page, unless this would block and the
* caller indicated that it can handle a retry.
diff --git a/mm/filemap.c b/mm/filemap.c
index 7cac47db78a5..12dc672adc2e 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -1554,18 +1554,18 @@ int __lock_folio_killable(struct folio *folio)
}
EXPORT_SYMBOL_GPL(__lock_folio_killable);

-int __lock_page_async(struct page *page, struct wait_page_queue *wait)
+static int __lock_folio_async(struct folio *folio, struct wait_page_queue *wait)
{
- struct wait_queue_head *q = page_waitqueue(page);
+ struct wait_queue_head *q = page_waitqueue(&folio->page);
int ret = 0;

- wait->page = page;
+ wait->page = &folio->page;
wait->bit_nr = PG_locked;

spin_lock_irq(&q->lock);
__add_wait_queue_entry_tail(q, &wait->wait);
- SetPageWaiters(page);
- ret = !trylock_page(page);
+ SetFolioWaiters(folio);
+ ret = !trylock_folio(folio);
/*
* If we were successful now, we know we're still on the
* waitqueue as we're still under the lock. This means it's
@@ -2312,41 +2312,42 @@ static int filemap_update_page(struct kiocb *iocb,
struct address_space *mapping, struct iov_iter *iter,
struct page *page)
{
+ struct folio *folio = page_folio(page);
int error;

- if (!trylock_page(page)) {
+ if (!trylock_folio(folio)) {
if (iocb->ki_flags & (IOCB_NOWAIT | IOCB_NOIO))
return -EAGAIN;
if (!(iocb->ki_flags & IOCB_WAITQ)) {
- put_and_wait_on_page_locked(page, TASK_KILLABLE);
+ put_and_wait_on_page_locked(&folio->page, TASK_KILLABLE);
return AOP_TRUNCATED_PAGE;
}
- error = __lock_page_async(page, iocb->ki_waitq);
+ error = __lock_folio_async(folio, iocb->ki_waitq);
if (error)
return error;
}

- if (!page->mapping)
+ if (!folio->page.mapping)
goto truncated;

error = 0;
- if (filemap_range_uptodate(mapping, iocb->ki_pos, iter, page))
+ if (filemap_range_uptodate(mapping, iocb->ki_pos, iter, &folio->page))
goto unlock;

error = -EAGAIN;
if (iocb->ki_flags & (IOCB_NOIO | IOCB_NOWAIT | IOCB_WAITQ))
goto unlock;

- error = filemap_read_page(iocb->ki_filp, mapping, page);
+ error = filemap_read_page(iocb->ki_filp, mapping, &folio->page);
if (error == AOP_TRUNCATED_PAGE)
- put_page(page);
+ put_folio(folio);
return error;
truncated:
- unlock_page(page);
- put_page(page);
+ unlock_folio(folio);
+ put_folio(folio);
return AOP_TRUNCATED_PAGE;
unlock:
- unlock_page(page);
+ unlock_folio(folio);
return error;
}

--
2.30.2

2021-03-20 11:13:42

by Matthew Wilcox

[permalink] [raw]
Subject: [PATCH v5 12/27] mm: Add folio_index, folio_file_page and folio_contains

folio_index() is the equivalent of page_index() for folios.
folio_file_page() is the equivalent of find_subpage().
folio_contains() is the equivalent of thp_contains().

Signed-off-by: Matthew Wilcox (Oracle) <[email protected]>
---
include/linux/pagemap.h | 53 +++++++++++++++++++++++++++++++++++++++++
1 file changed, 53 insertions(+)

diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
index 6676210addf6..f29c96ed3721 100644
--- a/include/linux/pagemap.h
+++ b/include/linux/pagemap.h
@@ -462,6 +462,59 @@ static inline bool thp_contains(struct page *head, pgoff_t index)
return page_index(head) == (index & ~(thp_nr_pages(head) - 1UL));
}

+#define swapcache_index(folio) __page_file_index(&(folio)->page)
+
+/**
+ * folio_index - File index of a folio.
+ * @folio: The folio.
+ *
+ * For a folio which is either in the page cache or the swap cache,
+ * return its index within the address_space it belongs to. If you know
+ * the page is definitely in the page cache, you can look at the folio's
+ * index directly.
+ *
+ * Return: The index (offset in units of pages) of a folio in its file.
+ */
+static inline pgoff_t folio_index(struct folio *folio)
+{
+ if (unlikely(FolioSwapCache(folio)))
+ return swapcache_index(folio);
+ return folio->page.index;
+}
+
+/**
+ * folio_file_page - The page for a particular index.
+ * @folio: The folio which contains this index.
+ * @index: The index we want to look up.
+ *
+ * Sometimes after looking up a folio in the page cache, we need to
+ * obtain the specific page for an index (eg a page fault).
+ *
+ * Return: The page containing the file data for this index.
+ */
+static inline struct page *folio_file_page(struct folio *folio, pgoff_t index)
+{
+ return &folio->page + (index & (folio_nr_pages(folio) - 1));
+}
+
+/**
+ * folio_contains - Does this folio contain this index?
+ * @folio: The folio.
+ * @index: The page index within the file.
+ *
+ * Context: The caller should have the page locked in order to prevent
+ * (eg) shmem from moving the page between the page cache and swap cache
+ * and changing its index in the middle of the operation.
+ * Return: true or false.
+ */
+static inline bool folio_contains(struct folio *folio, pgoff_t index)
+{
+ /* HugeTLBfs indexes the page cache in units of hpage_size */
+ if (PageHuge(&folio->page))
+ return folio->page.index == index;
+ return index - folio_index(folio) < folio_nr_pages(folio);
+}
+
/*
* Given the page we found in the page cache, return the page corresponding
* to this index in the file
--
2.30.2

2021-03-20 11:13:42

by Matthew Wilcox

[permalink] [raw]
Subject: [PATCH v5 15/27] mm/filemap: Add unlock_folio

Convert unlock_page() to call unlock_folio(). By using a folio we
avoid a call to compound_head(). This shortens the function from 39
bytes to 25 and removes 4 instructions on x86-64. Because we still
have unlock_page(), it's a net increase of 24 bytes of text for the
kernel as a whole, but any path that uses unlock_folio() will execute
4 fewer instructions.

Signed-off-by: Matthew Wilcox (Oracle) <[email protected]>
---
include/linux/pagemap.h | 3 ++-
mm/filemap.c | 27 ++++++++++-----------------
mm/folio-compat.c | 6 ++++++
3 files changed, 18 insertions(+), 18 deletions(-)

diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
index 90e970f48039..c211868086e0 100644
--- a/include/linux/pagemap.h
+++ b/include/linux/pagemap.h
@@ -698,7 +698,8 @@ extern int __lock_page_killable(struct page *page);
extern int __lock_page_async(struct page *page, struct wait_page_queue *wait);
extern int __lock_page_or_retry(struct page *page, struct mm_struct *mm,
unsigned int flags);
-extern void unlock_page(struct page *page);
+void unlock_page(struct page *page);
+void unlock_folio(struct folio *folio);
void unlock_page_private_2(struct page *page);

/*
diff --git a/mm/filemap.c b/mm/filemap.c
index eeeb8e2cc36a..47ac8126a12e 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -1435,29 +1435,22 @@ static inline bool clear_bit_unlock_is_negative_byte(long nr, volatile void *mem
#endif

/**
- * unlock_page - unlock a locked page
- * @page: the page
+ * unlock_folio - Unlock a locked folio.
+ * @folio: The folio.
*
- * Unlocks the page and wakes up sleepers in wait_on_page_locked().
- * Also wakes sleepers in wait_on_page_writeback() because the wakeup
- * mechanism between PageLocked pages and PageWriteback pages is shared.
- * But that's OK - sleepers in wait_on_page_writeback() just go back to sleep.
+ * Unlocks the folio and wakes up any thread sleeping on the page lock.
*
- * Note that this depends on PG_waiters being the sign bit in the byte
- * that contains PG_locked - thus the BUILD_BUG_ON(). That allows us to
- * clear the PG_locked bit and test PG_waiters at the same time fairly
- * portably (architectures that do LL/SC can test any bit, while x86 can
- * test the sign bit).
+ * Context: May be called from interrupt or process context. May not be
+ * called from NMI context.
*/
-void unlock_page(struct page *page)
+void unlock_folio(struct folio *folio)
{
BUILD_BUG_ON(PG_waiters != 7);
- page = compound_head(page);
- VM_BUG_ON_PAGE(!PageLocked(page), page);
- if (clear_bit_unlock_is_negative_byte(PG_locked, &page->flags))
- wake_up_page_bit(page, PG_locked);
+ VM_BUG_ON_FOLIO(!FolioLocked(folio), folio);
+ if (clear_bit_unlock_is_negative_byte(PG_locked, folio_flags(folio, 0)))
+ wake_up_page_bit(&folio->page, PG_locked);
}
-EXPORT_SYMBOL(unlock_page);
+EXPORT_SYMBOL(unlock_folio);

/**
* unlock_page_private_2 - Unlock a page that's locked with PG_private_2
diff --git a/mm/folio-compat.c b/mm/folio-compat.c
index 5e107aa30a62..02798abf19a1 100644
--- a/mm/folio-compat.c
+++ b/mm/folio-compat.c
@@ -11,3 +11,9 @@ struct address_space *page_mapping(struct page *page)
return folio_mapping(page_folio(page));
}
EXPORT_SYMBOL(page_mapping);
+
+void unlock_page(struct page *page)
+{
+ return unlock_folio(page_folio(page));
+}
+EXPORT_SYMBOL(unlock_page);
--
2.30.2

2021-03-20 11:13:42

by Matthew Wilcox

[permalink] [raw]
Subject: [PATCH v5 16/27] mm/filemap: Add lock_folio

This is like lock_page() but for use by callers who know they have a folio.
Convert __lock_page() to be __lock_folio(). This saves one call to
compound_head() per contended call to lock_page().

Saves 362 bytes of text; mostly from improved register allocation and
inlining decisions. __lock_folio is 59 bytes while __lock_page was 79.

Signed-off-by: Matthew Wilcox (Oracle) <[email protected]>
---
include/linux/pagemap.h | 24 +++++++++++++++++++-----
mm/filemap.c | 29 +++++++++++++++--------------
2 files changed, 34 insertions(+), 19 deletions(-)

diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
index c211868086e0..c96ba0dfe111 100644
--- a/include/linux/pagemap.h
+++ b/include/linux/pagemap.h
@@ -693,7 +693,7 @@ static inline bool wake_page_match(struct wait_page_queue *wait_page,
return true;
}

-extern void __lock_page(struct page *page);
+void __lock_folio(struct folio *folio);
extern int __lock_page_killable(struct page *page);
extern int __lock_page_async(struct page *page, struct wait_page_queue *wait);
extern int __lock_page_or_retry(struct page *page, struct mm_struct *mm,
@@ -702,13 +702,24 @@ void unlock_page(struct page *page);
void unlock_folio(struct folio *folio);
void unlock_page_private_2(struct page *page);

+static inline bool trylock_folio(struct folio *folio)
+{
+ return likely(!test_and_set_bit_lock(PG_locked, folio_flags(folio, 0)));
+}
+
/*
* Return true if the page was successfully locked
*/
static inline int trylock_page(struct page *page)
{
- page = compound_head(page);
- return (likely(!test_and_set_bit_lock(PG_locked, &page->flags)));
+ return trylock_folio(page_folio(page));
+}
+
+static inline void lock_folio(struct folio *folio)
+{
+ might_sleep();
+ if (!trylock_folio(folio))
+ __lock_folio(folio);
}

/*
@@ -716,9 +727,12 @@ static inline int trylock_page(struct page *page)
*/
static inline void lock_page(struct page *page)
{
+ struct folio *folio;
might_sleep();
- if (!trylock_page(page))
- __lock_page(page);
+
+ folio = page_folio(page);
+ if (!trylock_folio(folio))
+ __lock_folio(folio);
}

/*
diff --git a/mm/filemap.c b/mm/filemap.c
index 47ac8126a12e..99c05e2c0eea 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -1187,7 +1187,7 @@ static void wake_up_page(struct page *page, int bit)
*/
enum behavior {
EXCLUSIVE, /* Hold ref to page and take the bit when woken, like
- * __lock_page() waiting on then setting PG_locked.
+ * __lock_folio() waiting on then setting PG_locked.
*/
SHARED, /* Hold ref to page and check the bit when woken, like
* wait_on_page_writeback() waiting on PG_writeback.
@@ -1535,17 +1535,16 @@ void page_endio(struct page *page, bool is_write, int err)
EXPORT_SYMBOL_GPL(page_endio);

/**
- * __lock_page - get a lock on the page, assuming we need to sleep to get it
- * @__page: the page to lock
+ * __lock_folio - Get a lock on the folio, assuming we need to sleep to get it.
+ * @folio: The folio to lock
*/
-void __lock_page(struct page *__page)
+void __lock_folio(struct folio *folio)
{
- struct page *page = compound_head(__page);
- wait_queue_head_t *q = page_waitqueue(page);
- wait_on_page_bit_common(q, page, PG_locked, TASK_UNINTERRUPTIBLE,
+ wait_queue_head_t *q = page_waitqueue(&folio->page);
+ wait_on_page_bit_common(q, &folio->page, PG_locked, TASK_UNINTERRUPTIBLE,
EXCLUSIVE);
}
-EXPORT_SYMBOL(__lock_page);
+EXPORT_SYMBOL(__lock_folio);

int __lock_page_killable(struct page *__page)
{
@@ -1620,10 +1619,10 @@ int __lock_page_or_retry(struct page *page, struct mm_struct *mm,
return 0;
}
} else {
- __lock_page(page);
+ __lock_folio(page_folio(page));
}
- return 1;

+ return 1;
}

/**
@@ -2767,7 +2766,9 @@ loff_t mapping_seek_hole_data(struct address_space *mapping, loff_t start,
static int lock_page_maybe_drop_mmap(struct vm_fault *vmf, struct page *page,
struct file **fpin)
{
- if (trylock_page(page))
+ struct folio *folio = page_folio(page);
+
+ if (trylock_folio(folio))
return 1;

/*
@@ -2780,7 +2781,7 @@ static int lock_page_maybe_drop_mmap(struct vm_fault *vmf, struct page *page,

*fpin = maybe_unlock_mmap_for_io(vmf, *fpin);
if (vmf->flags & FAULT_FLAG_KILLABLE) {
- if (__lock_page_killable(page)) {
+ if (__lock_page_killable(&folio->page)) {
/*
* We didn't have the right flags to drop the mmap_lock,
* but all fault_handlers only check for fatal signals
@@ -2792,11 +2793,11 @@ static int lock_page_maybe_drop_mmap(struct vm_fault *vmf, struct page *page,
return 0;
}
} else
- __lock_page(page);
+ __lock_folio(folio);
+
return 1;
}

-
/*
* Synchronous readahead happens when we don't even find a page in the page
* cache at all. We don't want to perform IO under the mmap sem, so if we have
--
2.30.2

2021-03-20 11:13:43

by Matthew Wilcox

[permalink] [raw]
Subject: [PATCH v5 17/27] mm/filemap: Add lock_folio_killable

This is like lock_page_killable() but for use by callers who
know they have a folio. Convert __lock_page_killable() to be
__lock_folio_killable(). This saves one call to compound_head() per
contended call to lock_page_killable().

__lock_folio_killable() is 20 bytes smaller than __lock_page_killable()
was. lock_page_maybe_drop_mmap() shrinks by 68 bytes and
__lock_page_or_retry() shrinks by 66 bytes. That's a total of 154 bytes
of text saved.

Signed-off-by: Matthew Wilcox (Oracle) <[email protected]>
---
include/linux/pagemap.h | 15 ++++++++++-----
mm/filemap.c | 17 +++++++++--------
2 files changed, 19 insertions(+), 13 deletions(-)

diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
index c96ba0dfe111..aa7f564e5ecf 100644
--- a/include/linux/pagemap.h
+++ b/include/linux/pagemap.h
@@ -694,7 +694,7 @@ static inline bool wake_page_match(struct wait_page_queue *wait_page,
}

void __lock_folio(struct folio *folio);
-extern int __lock_page_killable(struct page *page);
+int __lock_folio_killable(struct folio *folio);
extern int __lock_page_async(struct page *page, struct wait_page_queue *wait);
extern int __lock_page_or_retry(struct page *page, struct mm_struct *mm,
unsigned int flags);
@@ -735,6 +735,14 @@ static inline void lock_page(struct page *page)
__lock_folio(folio);
}

+static inline int lock_folio_killable(struct folio *folio)
+{
+ might_sleep();
+ if (!trylock_folio(folio))
+ return __lock_folio_killable(folio);
+ return 0;
+}
+
/*
* lock_page_killable is like lock_page but can be interrupted by fatal
* signals. It returns 0 if it locked the page and -EINTR if it was
@@ -742,10 +750,7 @@ static inline void lock_page(struct page *page)
*/
static inline int lock_page_killable(struct page *page)
{
- might_sleep();
- if (!trylock_page(page))
- return __lock_page_killable(page);
- return 0;
+ return lock_folio_killable(page_folio(page));
}

/*
diff --git a/mm/filemap.c b/mm/filemap.c
index 99c05e2c0eea..7cac47db78a5 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -1546,14 +1546,13 @@ void __lock_folio(struct folio *folio)
}
EXPORT_SYMBOL(__lock_folio);

-int __lock_page_killable(struct page *__page)
+int __lock_folio_killable(struct folio *folio)
{
- struct page *page = compound_head(__page);
- wait_queue_head_t *q = page_waitqueue(page);
- return wait_on_page_bit_common(q, page, PG_locked, TASK_KILLABLE,
+ wait_queue_head_t *q = page_waitqueue(&folio->page);
+ return wait_on_page_bit_common(q, &folio->page, PG_locked, TASK_KILLABLE,
EXCLUSIVE);
}
-EXPORT_SYMBOL_GPL(__lock_page_killable);
+EXPORT_SYMBOL_GPL(__lock_folio_killable);

int __lock_page_async(struct page *page, struct wait_page_queue *wait)
{
@@ -1595,6 +1594,8 @@ int __lock_page_async(struct page *page, struct wait_page_queue *wait)
int __lock_page_or_retry(struct page *page, struct mm_struct *mm,
unsigned int flags)
{
+ struct folio *folio = page_folio(page);
+
if (fault_flag_allow_retry_first(flags)) {
/*
* CAUTION! In this case, mmap_lock is not released
@@ -1613,13 +1614,13 @@ int __lock_page_or_retry(struct page *page, struct mm_struct *mm,
if (flags & FAULT_FLAG_KILLABLE) {
int ret;

- ret = __lock_page_killable(page);
+ ret = __lock_folio_killable(folio);
if (ret) {
mmap_read_unlock(mm);
return 0;
}
} else {
- __lock_folio(page_folio(page));
+ __lock_folio(folio);
}

return 1;
@@ -2781,7 +2782,7 @@ static int lock_page_maybe_drop_mmap(struct vm_fault *vmf, struct page *page,

*fpin = maybe_unlock_mmap_for_io(vmf, *fpin);
if (vmf->flags & FAULT_FLAG_KILLABLE) {
- if (__lock_page_killable(&folio->page)) {
+ if (__lock_folio_killable(folio)) {
/*
* We didn't have the right flags to drop the mmap_lock,
* but all fault_handlers only check for fatal signals
--
2.30.2

2021-03-20 11:13:43

by Matthew Wilcox

[permalink] [raw]
Subject: [PATCH v5 13/27] mm/util: Add folio_mapping and folio_file_mapping

These are the folio equivalent of page_mapping() and page_file_mapping().
Add an out-of-line page_mapping() wrapper around folio_mapping()
in order to prevent the page_folio() call from bloating every caller
of page_mapping(). Adjust page_file_mapping() and page_mapping_file()
to use folios internally. Rename __page_file_mapping() to
swapcache_mapping() and change it to take a folio.

This ends up saving 186 bytes of text overall. folio_mapping() is
45 bytes shorter than page_mapping() was, but the new page_mapping()
wrapper is 30 bytes. The major reduction is a few bytes less in dozens
of nfs functions (which call page_file_mapping()). Most of these appear
to be a slight change in gcc's register allocation decisions, which allow:

48 8b 56 08 mov 0x8(%rsi),%rdx
48 8d 42 ff lea -0x1(%rdx),%rax
83 e2 01 and $0x1,%edx
48 0f 44 c6 cmove %rsi,%rax

to become:

48 8b 46 08 mov 0x8(%rsi),%rax
48 8d 78 ff lea -0x1(%rax),%rdi
a8 01 test $0x1,%al
48 0f 44 fe cmove %rsi,%rdi

for a reduction of a single byte. Once the NFS client is converted to
use folios, this entire sequence will disappear.

Signed-off-by: Matthew Wilcox (Oracle) <[email protected]>
---
include/linux/mm.h | 14 --------------
include/linux/pagemap.h | 35 +++++++++++++++++++++++++++++++++--
include/linux/swap.h | 6 ++++++
mm/Makefile | 2 +-
mm/folio-compat.c | 13 +++++++++++++
mm/swapfile.c | 8 ++++----
mm/util.c | 30 ++++++++++++++++++------------
7 files changed, 75 insertions(+), 33 deletions(-)
create mode 100644 mm/folio-compat.c

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 8fc7b04a1438..bc626c19f9f8 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1732,19 +1732,6 @@ void page_address_init(void);

extern void *page_rmapping(struct page *page);
extern struct anon_vma *page_anon_vma(struct page *page);
-extern struct address_space *page_mapping(struct page *page);
-
-extern struct address_space *__page_file_mapping(struct page *);
-
-static inline
-struct address_space *page_file_mapping(struct page *page)
-{
- if (unlikely(PageSwapCache(page)))
- return __page_file_mapping(page);
-
- return page->mapping;
-}
-
extern pgoff_t __page_file_index(struct page *page);

/*
@@ -1759,7 +1746,6 @@ static inline pgoff_t page_index(struct page *page)
}

bool page_mapped(struct page *page);
-struct address_space *page_mapping(struct page *page);

/*
* Return true only if the page has been allocated with
diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
index f29c96ed3721..90e970f48039 100644
--- a/include/linux/pagemap.h
+++ b/include/linux/pagemap.h
@@ -162,14 +162,45 @@ static inline void filemap_nr_thps_dec(struct address_space *mapping)

void release_pages(struct page **pages, int nr);

+struct address_space *page_mapping(struct page *);
+struct address_space *folio_mapping(struct folio *);
+struct address_space *swapcache_mapping(struct folio *);
+
+/**
+ * folio_file_mapping - Find the mapping this folio belongs to.
+ * @folio: The folio.
+ *
+ * For folios which are in the page cache, return the mapping that this
+ * page belongs to. Folios in the swap cache return the mapping of the
+ * swap file or swap device where the data is stored. This is different
+ * from the mapping returned by folio_mapping(). The only reason to
+ * use it is if, like NFS, you return 0 from ->activate_swapfile.
+ *
+ * Do not call this for folios which aren't in the page cache or swap cache.
+ */
+static inline struct address_space *folio_file_mapping(struct folio *folio)
+{
+ if (unlikely(FolioSwapCache(folio)))
+ return swapcache_mapping(folio);
+
+ return folio->page.mapping;
+}
+
+static inline struct address_space *page_file_mapping(struct page *page)
+{
+ return folio_file_mapping(page_folio(page));
+}
+
/*
* For file cache pages, return the address_space, otherwise return NULL
*/
static inline struct address_space *page_mapping_file(struct page *page)
{
- if (unlikely(PageSwapCache(page)))
+ struct folio *folio = page_folio(page);
+
+ if (unlikely(FolioSwapCache(folio)))
return NULL;
- return page_mapping(page);
+ return folio_mapping(folio);
}

/*
diff --git a/include/linux/swap.h b/include/linux/swap.h
index 4c3a844ac9b4..09316a5c33e9 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -314,6 +314,12 @@ struct vma_swap_readahead {
#endif
};

+static inline swp_entry_t folio_swap_entry(struct folio *folio)
+{
+ swp_entry_t entry = { .val = page_private(&folio->page) };
+ return entry;
+}
+
/* linux/mm/workingset.c */
void workingset_age_nonresident(struct lruvec *lruvec, unsigned long nr_pages);
void *workingset_eviction(struct page *page, struct mem_cgroup *target_memcg);
diff --git a/mm/Makefile b/mm/Makefile
index 788c5ce5c0ef..9d6a7e8b5a3c 100644
--- a/mm/Makefile
+++ b/mm/Makefile
@@ -46,7 +46,7 @@ mmu-$(CONFIG_MMU) += process_vm_access.o
endif

obj-y := filemap.o mempool.o oom_kill.o fadvise.o \
- maccess.o page-writeback.o \
+ maccess.o page-writeback.o folio-compat.o \
readahead.o swap.o truncate.o vmscan.o shmem.o \
util.o mmzone.o vmstat.o backing-dev.o \
mm_init.o percpu.o slab_common.o \
diff --git a/mm/folio-compat.c b/mm/folio-compat.c
new file mode 100644
index 000000000000..5e107aa30a62
--- /dev/null
+++ b/mm/folio-compat.c
@@ -0,0 +1,13 @@
+/*
+ * Compatibility functions which bloat the callers too much to make inline.
+ * All of the callers of these functions should be converted to use folios
+ * eventually.
+ */
+
+#include <linux/pagemap.h>
+
+struct address_space *page_mapping(struct page *page)
+{
+ return folio_mapping(page_folio(page));
+}
+EXPORT_SYMBOL(page_mapping);
diff --git a/mm/swapfile.c b/mm/swapfile.c
index 149e77454e3c..d0ee24239a83 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -3533,13 +3533,13 @@ struct swap_info_struct *page_swap_info(struct page *page)
}

/*
- * out-of-line __page_file_ methods to avoid include hell.
+ * out-of-line methods to avoid include hell.
*/
-struct address_space *__page_file_mapping(struct page *page)
+struct address_space *swapcache_mapping(struct folio *folio)
{
- return page_swap_info(page)->swap_file->f_mapping;
+ return page_swap_info(&folio->page)->swap_file->f_mapping;
}
-EXPORT_SYMBOL_GPL(__page_file_mapping);
+EXPORT_SYMBOL_GPL(swapcache_mapping);

pgoff_t __page_file_index(struct page *page)
{
diff --git a/mm/util.c b/mm/util.c
index 0b6dd9d81da7..c4ed5b919c7d 100644
--- a/mm/util.c
+++ b/mm/util.c
@@ -686,30 +686,36 @@ struct anon_vma *page_anon_vma(struct page *page)
return __page_rmapping(page);
}

-struct address_space *page_mapping(struct page *page)
+/**
+ * folio_mapping - Find the mapping where this folio is stored.
+ * @folio: The folio.
+ *
+ * For folios which are in the page cache, return the mapping that this
+ * page belongs to. Folios in the swap cache return the swap mapping
+ * this page is stored in (which is different from the mapping for the
+ * swap file or swap device where the data is stored).
+ *
+ * You can call this for folios which aren't in the swap cache or page
+ * cache and it will return NULL.
+ */
+struct address_space *folio_mapping(struct folio *folio)
{
struct address_space *mapping;

- page = compound_head(page);
-
/* This happens if someone calls flush_dcache_page on slab page */
- if (unlikely(PageSlab(page)))
+ if (unlikely(FolioSlab(folio)))
return NULL;

- if (unlikely(PageSwapCache(page))) {
- swp_entry_t entry;
-
- entry.val = page_private(page);
- return swap_address_space(entry);
- }
+ if (unlikely(FolioSwapCache(folio)))
+ return swap_address_space(folio_swap_entry(folio));

- mapping = page->mapping;
+ mapping = folio->page.mapping;
if ((unsigned long)mapping & PAGE_MAPPING_ANON)
return NULL;

return (void *)((unsigned long)mapping & ~PAGE_MAPPING_FLAGS);
}
-EXPORT_SYMBOL(page_mapping);
+EXPORT_SYMBOL(folio_mapping);

/* Slow path of page_mapcount() for compound pages */
int __page_mapcount(struct page *page)
--
2.30.2

2021-03-20 11:13:43

by Matthew Wilcox

[permalink] [raw]
Subject: [PATCH v5 11/27] mm: Handle per-folio private data

Add folio_private() and set_folio_private() which mirror page_private()
and set_page_private() -- ie folio private data is the same as page
private data. The only difference is that these return a void *
instead of an unsigned long, which matches the majority of users.

Turn attach_page_private() into attach_folio_private() and reimplement
attach_page_private() as a wrapper. No filesystem which uses page private
data currently supports compound pages, so we're free to define the rules.
attach_page_private() may only be called on a head page; if you want
to add private data to a tail page, you can call set_page_private()
directly (and shouldn't increment the page refcount! That should be
done when adding private data to the head page / folio).

This saves 597 bytes of text with the distro-derived config that I'm
testing due to removing the calls to compound_head() in get_page()
& put_page().

Signed-off-by: Matthew Wilcox (Oracle) <[email protected]>
---
include/linux/mm_types.h | 16 ++++++++++++++
include/linux/pagemap.h | 48 ++++++++++++++++++++++++----------------
2 files changed, 45 insertions(+), 19 deletions(-)

diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 4fc0b230d3ea..90086f93e9de 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -278,6 +278,12 @@ static inline atomic_t *compound_pincount_ptr(struct page *page)
#define PAGE_FRAG_CACHE_MAX_SIZE __ALIGN_MASK(32768, ~PAGE_MASK)
#define PAGE_FRAG_CACHE_MAX_ORDER get_order(PAGE_FRAG_CACHE_MAX_SIZE)

+/*
+ * page_private can be used on tail pages. However, PagePrivate is only
+ * checked by the VM on the head page. So page_private on the tail pages
+ * should be used for data that's ancillary to the head page (eg attaching
+ * buffer heads to tail pages after attaching buffer heads to the head page)
+ */
#define page_private(page) ((page)->private)

static inline void set_page_private(struct page *page, unsigned long private)
@@ -285,6 +291,16 @@ static inline void set_page_private(struct page *page, unsigned long private)
page->private = private;
}

+static inline void *folio_private(struct folio *folio)
+{
+ return (void *)folio->page.private;
+}
+
+static inline void set_folio_private(struct folio *folio, void *v)
+{
+ folio->page.private = (unsigned long)v;
+}
+
struct page_frag_cache {
void * va;
#if (PAGE_SIZE < PAGE_FRAG_CACHE_MAX_SIZE)
diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
index 8c844ba67785..6676210addf6 100644
--- a/include/linux/pagemap.h
+++ b/include/linux/pagemap.h
@@ -260,42 +260,52 @@ static inline int page_cache_add_speculative(struct page *page, int count)
}

/**
- * attach_page_private - Attach private data to a page.
- * @page: Page to attach data to.
- * @data: Data to attach to page.
+ * attach_folio_private - Attach private data to a folio.
+ * @folio: Folio to attach data to.
+ * @data: Data to attach to folio.
*
- * Attaching private data to a page increments the page's reference count.
- * The data must be detached before the page will be freed.
+ * Attaching private data to a folio increments the page's reference count.
+ * The data must be detached before the folio will be freed.
*/
-static inline void attach_page_private(struct page *page, void *data)
+static inline void attach_folio_private(struct folio *folio, void *data)
{
- get_page(page);
- set_page_private(page, (unsigned long)data);
- SetPagePrivate(page);
+ get_folio(folio);
+ set_folio_private(folio, data);
+ SetFolioPrivate(folio);
}

/**
- * detach_page_private - Detach private data from a page.
- * @page: Page to detach data from.
+ * detach_folio_private - Detach private data from a folio.
+ * @folio: Folio to detach data from.
*
- * Removes the data that was previously attached to the page and decrements
+ * Removes the data that was previously attached to the folio and decrements
* the refcount on the page.
*
- * Return: Data that was attached to the page.
+ * Return: Data that was attached to the folio.
*/
-static inline void *detach_page_private(struct page *page)
+static inline void *detach_folio_private(struct folio *folio)
{
- void *data = (void *)page_private(page);
+ void *data = folio_private(folio);

- if (!PagePrivate(page))
+ if (!FolioPrivate(folio))
return NULL;
- ClearPagePrivate(page);
- set_page_private(page, 0);
- put_page(page);
+ ClearFolioPrivate(folio);
+ set_folio_private(folio, NULL);
+ put_folio(folio);

return data;
}

+static inline void attach_page_private(struct page *page, void *data)
+{
+ attach_folio_private(page_folio(page), data);
+}
+
+static inline void *detach_page_private(struct page *page)
+{
+ return detach_folio_private(page_folio(page));
+}
+
#ifdef CONFIG_NUMA
extern struct page *__page_cache_alloc(gfp_t gfp);
#else
--
2.30.2

2021-03-20 11:13:54

by Matthew Wilcox

[permalink] [raw]
Subject: [PATCH v5 20/27] mm/filemap: Add wait_on_folio_locked

Also add wait_on_folio_locked_killable(). Turn wait_on_page_locked()
and wait_on_page_locked_killable() into wrappers. This eliminates a
call to compound_head() from each call-site, reducing text size by 200
bytes for me.

Signed-off-by: Matthew Wilcox (Oracle) <[email protected]>
---
include/linux/pagemap.h | 26 ++++++++++++++++++--------
mm/filemap.c | 4 ++--
2 files changed, 20 insertions(+), 10 deletions(-)

diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
index 38f4ee28a3a5..a8e19e4e0b09 100644
--- a/include/linux/pagemap.h
+++ b/include/linux/pagemap.h
@@ -777,23 +777,33 @@ extern void wait_on_page_bit(struct page *page, int bit_nr);
extern int wait_on_page_bit_killable(struct page *page, int bit_nr);

/*
- * Wait for a page to be unlocked.
+ * Wait for a folio to be unlocked.
*
- * This must be called with the caller "holding" the page,
- * ie with increased "page->count" so that the page won't
+ * This must be called with the caller "holding" the folio,
+ * ie with increased "page->count" so that the folio won't
* go away during the wait..
*/
+static inline void wait_on_folio_locked(struct folio *folio)
+{
+ if (FolioLocked(folio))
+ wait_on_page_bit(&folio->page, PG_locked);
+}
+
+static inline int wait_on_folio_locked_killable(struct folio *folio)
+{
+ if (!FolioLocked(folio))
+ return 0;
+ return wait_on_page_bit_killable(&folio->page, PG_locked);
+}
+
static inline void wait_on_page_locked(struct page *page)
{
- if (PageLocked(page))
- wait_on_page_bit(compound_head(page), PG_locked);
+ wait_on_folio_locked(page_folio(page));
}

static inline int wait_on_page_locked_killable(struct page *page)
{
- if (!PageLocked(page))
- return 0;
- return wait_on_page_bit_killable(compound_head(page), PG_locked);
+ return wait_on_folio_locked_killable(page_folio(page));
}

int put_and_wait_on_page_locked(struct page *page, int state);
diff --git a/mm/filemap.c b/mm/filemap.c
index 35e16db2e2be..99758045ec2d 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -1604,9 +1604,9 @@ int __lock_folio_or_retry(struct folio *folio, struct mm_struct *mm,

mmap_read_unlock(mm);
if (flags & FAULT_FLAG_KILLABLE)
- wait_on_page_locked_killable(page);
+ wait_on_folio_locked_killable(folio);
else
- wait_on_page_locked(page);
+ wait_on_folio_locked(folio);
return 0;
}
if (flags & FAULT_FLAG_KILLABLE) {
--
2.30.2

2021-03-20 11:14:41

by Matthew Wilcox

[permalink] [raw]
Subject: [PATCH v5 19/27] mm/filemap: Add __lock_folio_or_retry

Convert __lock_page_or_retry() to __lock_folio_or_retry(). This actually
saves 4 bytes in the only caller of lock_page_or_retry() (due to better
register allocation) and saves the 20 byte cost of calling page_folio()
in __lock_folio_or_retry() for a total saving of 24 bytes.

Signed-off-by: Matthew Wilcox (Oracle) <[email protected]>
---
include/linux/pagemap.h | 9 ++++++---
mm/filemap.c | 10 ++++------
mm/memory.c | 8 ++++----
3 files changed, 14 insertions(+), 13 deletions(-)

diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
index 3cd1b5e28593..38f4ee28a3a5 100644
--- a/include/linux/pagemap.h
+++ b/include/linux/pagemap.h
@@ -695,7 +695,7 @@ static inline bool wake_page_match(struct wait_page_queue *wait_page,

void __lock_folio(struct folio *folio);
int __lock_folio_killable(struct folio *folio);
-extern int __lock_page_or_retry(struct page *page, struct mm_struct *mm,
+int __lock_folio_or_retry(struct folio *folio, struct mm_struct *mm,
unsigned int flags);
void unlock_page(struct page *page);
void unlock_folio(struct folio *folio);
@@ -757,13 +757,16 @@ static inline int lock_page_killable(struct page *page)
* caller indicated that it can handle a retry.
*
* Return value and mmap_lock implications depend on flags; see
- * __lock_page_or_retry().
+ * __lock_folio_or_retry().
*/
static inline int lock_page_or_retry(struct page *page, struct mm_struct *mm,
unsigned int flags)
{
+ struct folio *folio;
might_sleep();
- return trylock_page(page) || __lock_page_or_retry(page, mm, flags);
+
+ folio = page_folio(page);
+ return trylock_folio(folio) || __lock_folio_or_retry(folio, mm, flags);
}

/*
diff --git a/mm/filemap.c b/mm/filemap.c
index 12dc672adc2e..35e16db2e2be 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -1582,20 +1582,18 @@ static int __lock_folio_async(struct folio *folio, struct wait_page_queue *wait)

/*
* Return values:
- * 1 - page is locked; mmap_lock is still held.
- * 0 - page is not locked.
+ * 1 - folio is locked; mmap_lock is still held.
+ * 0 - folio is not locked.
* mmap_lock has been released (mmap_read_unlock(), unless flags had both
* FAULT_FLAG_ALLOW_RETRY and FAULT_FLAG_RETRY_NOWAIT set, in
* which case mmap_lock is still held.
*
* If neither ALLOW_RETRY nor KILLABLE are set, will always return 1
- * with the page locked and the mmap_lock unperturbed.
+ * with the folio locked and the mmap_lock unperturbed.
*/
-int __lock_page_or_retry(struct page *page, struct mm_struct *mm,
+int __lock_folio_or_retry(struct folio *folio, struct mm_struct *mm,
unsigned int flags)
{
- struct folio *folio = page_folio(page);
-
if (fault_flag_allow_retry_first(flags)) {
/*
* CAUTION! In this case, mmap_lock is not released
diff --git a/mm/memory.c b/mm/memory.c
index d3273bd69dbb..9c3554972e2d 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -4056,7 +4056,7 @@ static vm_fault_t do_shared_fault(struct vm_fault *vmf)
* We enter with non-exclusive mmap_lock (to exclude vma changes,
* but allow concurrent faults).
* The mmap_lock may have been released depending on flags and our
- * return value. See filemap_fault() and __lock_page_or_retry().
+ * return value. See filemap_fault() and __lock_folio_or_retry().
* If mmap_lock is released, vma may become invalid (for example
* by other thread calling munmap()).
*/
@@ -4288,7 +4288,7 @@ static vm_fault_t wp_huge_pud(struct vm_fault *vmf, pud_t orig_pud)
* concurrent faults).
*
* The mmap_lock may have been released depending on flags and our return value.
- * See filemap_fault() and __lock_page_or_retry().
+ * See filemap_fault() and __lock_folio_or_retry().
*/
static vm_fault_t handle_pte_fault(struct vm_fault *vmf)
{
@@ -4392,7 +4392,7 @@ static vm_fault_t handle_pte_fault(struct vm_fault *vmf)
* By the time we get here, we already hold the mm semaphore
*
* The mmap_lock may have been released depending on flags and our
- * return value. See filemap_fault() and __lock_page_or_retry().
+ * return value. See filemap_fault() and __lock_folio_or_retry().
*/
static vm_fault_t __handle_mm_fault(struct vm_area_struct *vma,
unsigned long address, unsigned int flags)
@@ -4548,7 +4548,7 @@ static inline void mm_account_fault(struct pt_regs *regs,
* By the time we get here, we already hold the mm semaphore
*
* The mmap_lock may have been released depending on flags and our
- * return value. See filemap_fault() and __lock_page_or_retry().
+ * return value. See filemap_fault() and __lock_folio_or_retry().
*/
vm_fault_t handle_mm_fault(struct vm_area_struct *vma, unsigned long address,
unsigned int flags, struct pt_regs *regs)
--
2.30.2

2021-03-20 11:16:06

by Matthew Wilcox

[permalink] [raw]
Subject: [PATCH v5 23/27] mm/writeback: Add wait_for_stable_folio

Move wait_for_stable_page() into the folio compatibility file.
wait_for_stable_folio() avoids a call to compound_head() and is 14 bytes
smaller than wait_for_stable_page() was. The net text size grows by 24
bytes as a result of this patch.

Signed-off-by: Matthew Wilcox (Oracle) <[email protected]>
---
include/linux/pagemap.h | 1 +
mm/folio-compat.c | 6 ++++++
mm/page-writeback.c | 17 ++++++++---------
3 files changed, 15 insertions(+), 9 deletions(-)

diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
index a6adf69ea5c5..c92782b77d98 100644
--- a/include/linux/pagemap.h
+++ b/include/linux/pagemap.h
@@ -813,6 +813,7 @@ int wait_on_folio_writeback_killable(struct folio *folio);
void end_page_writeback(struct page *page);
void end_folio_writeback(struct folio *folio);
void wait_for_stable_page(struct page *page);
+void wait_for_stable_folio(struct folio *folio);

void page_endio(struct page *page, bool is_write, int err);

diff --git a/mm/folio-compat.c b/mm/folio-compat.c
index 6aadecc39fba..335594fe414e 100644
--- a/mm/folio-compat.c
+++ b/mm/folio-compat.c
@@ -29,3 +29,9 @@ void wait_on_page_writeback(struct page *page)
return wait_on_folio_writeback(page_folio(page));
}
EXPORT_SYMBOL_GPL(wait_on_page_writeback);
+
+void wait_for_stable_page(struct page *page)
+{
+ return wait_for_stable_folio(page_folio(page));
+}
+EXPORT_SYMBOL_GPL(wait_for_stable_page);
diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index a08e77abcf12..c222f88cf06b 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -2862,17 +2862,16 @@ int wait_on_folio_writeback_killable(struct folio *folio)
EXPORT_SYMBOL_GPL(wait_on_folio_writeback_killable);

/**
- * wait_for_stable_page() - wait for writeback to finish, if necessary.
- * @page: The page to wait on.
+ * wait_for_stable_folio() - wait for writeback to finish, if necessary.
+ * @folio: The folio to wait on.
*
- * This function determines if the given page is related to a backing device
- * that requires page contents to be held stable during writeback. If so, then
+ * This function determines if the given folio is related to a backing device
+ * that requires folio contents to be held stable during writeback. If so, then
* it will wait for any pending writeback to complete.
*/
-void wait_for_stable_page(struct page *page)
+void wait_for_stable_folio(struct folio *folio)
{
- page = thp_head(page);
- if (page->mapping->host->i_sb->s_iflags & SB_I_STABLE_WRITES)
- wait_on_page_writeback(page);
+ if (folio->page.mapping->host->i_sb->s_iflags & SB_I_STABLE_WRITES)
+ wait_on_folio_writeback(folio);
}
-EXPORT_SYMBOL_GPL(wait_for_stable_page);
+EXPORT_SYMBOL_GPL(wait_for_stable_folio);
--
2.30.2

2021-03-20 11:16:24

by Matthew Wilcox

[permalink] [raw]
Subject: [PATCH v5 27/27] mm/doc: Build kerneldoc for various mm files

These files weren't included in the html docs.

Signed-off-by: Matthew Wilcox (Oracle) <[email protected]>
---
Documentation/core-api/mm-api.rst | 7 +++++++
1 file changed, 7 insertions(+)

diff --git a/Documentation/core-api/mm-api.rst b/Documentation/core-api/mm-api.rst
index 874ae1250258..3af5875a1d9e 100644
--- a/Documentation/core-api/mm-api.rst
+++ b/Documentation/core-api/mm-api.rst
@@ -93,3 +93,10 @@ More Memory Management Functions

.. kernel-doc:: mm/page_alloc.c
.. kernel-doc:: mm/mempolicy.c
+
+.. kernel-doc:: include/linux/mm_types.h
+ :internal:
+.. kernel-doc:: include/linux/mm.h
+ :internal:
+.. kernel-doc:: mm/util.c
+ :functions: folio_mapping
--
2.30.2

2021-03-20 11:16:57

by Matthew Wilcox

[permalink] [raw]
Subject: [PATCH v5 21/27] mm/filemap: Add end_folio_writeback

Add an end_page_writeback() wrapper function for users that are not yet
converted to folios.

end_folio_writeback() is less than half the size of end_page_writeback()
at just 105 bytes compared to 213 bytes, due to removing all the
compound_head() calls. The 30 byte wrapper function makes this a net
saving of 70 bytes.

Signed-off-by: Matthew Wilcox (Oracle) <[email protected]>
---
include/linux/pagemap.h | 3 ++-
mm/filemap.c | 38 +++++++++++++++++++-------------------
mm/folio-compat.c | 6 ++++++
3 files changed, 27 insertions(+), 20 deletions(-)

diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
index a8e19e4e0b09..2ee6b1b9561c 100644
--- a/include/linux/pagemap.h
+++ b/include/linux/pagemap.h
@@ -809,7 +809,8 @@ static inline int wait_on_page_locked_killable(struct page *page)
int put_and_wait_on_page_locked(struct page *page, int state);
void wait_on_page_writeback(struct page *page);
int wait_on_page_writeback_killable(struct page *page);
-extern void end_page_writeback(struct page *page);
+void end_page_writeback(struct page *page);
+void end_folio_writeback(struct folio *folio);
void wait_for_stable_page(struct page *page);

void page_endio(struct page *page, bool is_write, int err);
diff --git a/mm/filemap.c b/mm/filemap.c
index 99758045ec2d..dc7deb8c36ee 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -1175,11 +1175,11 @@ static void wake_up_page_bit(struct page *page, int bit_nr)
spin_unlock_irqrestore(&q->lock, flags);
}

-static void wake_up_page(struct page *page, int bit)
+static void wake_up_folio(struct folio *folio, int bit)
{
- if (!PageWaiters(page))
+ if (!FolioWaiters(folio))
return;
- wake_up_page_bit(page, bit);
+ wake_up_page_bit(&folio->page, bit);
}

/*
@@ -1473,38 +1473,38 @@ void unlock_page_private_2(struct page *page)
EXPORT_SYMBOL(unlock_page_private_2);

/**
- * end_page_writeback - end writeback against a page
- * @page: the page
+ * end_folio_writeback - End writeback against a folio.
+ * @folio: The folio.
*/
-void end_page_writeback(struct page *page)
+void end_folio_writeback(struct folio *folio)
{
/*
* TestClearPageReclaim could be used here but it is an atomic
* operation and overkill in this particular case. Failing to
- * shuffle a page marked for immediate reclaim is too mild to
+ * shuffle a folio marked for immediate reclaim is too mild to
* justify taking an atomic operation penalty at the end of
- * ever page writeback.
+ * every folio writeback.
*/
- if (PageReclaim(page)) {
- ClearPageReclaim(page);
- rotate_reclaimable_page(page);
+ if (FolioReclaim(folio)) {
+ ClearFolioReclaim(folio);
+ rotate_reclaimable_page(&folio->page);
}

/*
- * Writeback does not hold a page reference of its own, relying
+ * Writeback does not hold a folio reference of its own, relying
* on truncation to wait for the clearing of PG_writeback.
- * But here we must make sure that the page is not freed and
- * reused before the wake_up_page().
+ * But here we must make sure that the folio is not freed and
+ * reused before the wake_up_folio().
*/
- get_page(page);
- if (!test_clear_page_writeback(page))
+ get_folio(folio);
+ if (!test_clear_page_writeback(&folio->page))
BUG();

smp_mb__after_atomic();
- wake_up_page(page, PG_writeback);
- put_page(page);
+ wake_up_folio(folio, PG_writeback);
+ put_folio(folio);
}
-EXPORT_SYMBOL(end_page_writeback);
+EXPORT_SYMBOL(end_folio_writeback);

/*
* After completing I/O on a page, call this routine to update the page
diff --git a/mm/folio-compat.c b/mm/folio-compat.c
index 02798abf19a1..d1a1dfe52589 100644
--- a/mm/folio-compat.c
+++ b/mm/folio-compat.c
@@ -17,3 +17,9 @@ void unlock_page(struct page *page)
return unlock_folio(page_folio(page));
}
EXPORT_SYMBOL(unlock_page);
+
+void end_page_writeback(struct page *page)
+{
+ return end_folio_writeback(page_folio(page));
+}
+EXPORT_SYMBOL(end_page_writeback);
--
2.30.2

2021-03-20 11:16:57

by Matthew Wilcox

[permalink] [raw]
Subject: [PATCH v5 22/27] mm/writeback: Add wait_on_folio_writeback

wait_on_page_writeback_killable() only has one caller, so convert it to
call wait_on_folio_writeback_killable(). For the wait_on_page_writeback()
callers, add a compatibility wrapper around wait_on_folio_writeback().

Turning PageWriteback() into FolioWriteback() eliminates a call to
compound_head() which saves 8 bytes and 15 bytes in the two functions.
That is more than offset by adding the wait_on_page_writeback
compatibility wrapper for a net increase in text of 15 bytes.

Signed-off-by: Matthew Wilcox (Oracle) <[email protected]>
---
fs/afs/write.c | 2 +-
include/linux/pagemap.h | 3 ++-
mm/folio-compat.c | 6 ++++++
mm/page-writeback.c | 43 +++++++++++++++++++++++++++--------------
4 files changed, 38 insertions(+), 16 deletions(-)

diff --git a/fs/afs/write.c b/fs/afs/write.c
index 106a864b6a93..4b70b0e7fcfa 100644
--- a/fs/afs/write.c
+++ b/fs/afs/write.c
@@ -850,7 +850,7 @@ vm_fault_t afs_page_mkwrite(struct vm_fault *vmf)
return VM_FAULT_RETRY;
#endif

- if (wait_on_page_writeback_killable(page))
+ if (wait_on_folio_writeback_killable(page_folio(page)))
return VM_FAULT_RETRY;

if (lock_page_killable(page) < 0)
diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
index 2ee6b1b9561c..a6adf69ea5c5 100644
--- a/include/linux/pagemap.h
+++ b/include/linux/pagemap.h
@@ -808,7 +808,8 @@ static inline int wait_on_page_locked_killable(struct page *page)

int put_and_wait_on_page_locked(struct page *page, int state);
void wait_on_page_writeback(struct page *page);
-int wait_on_page_writeback_killable(struct page *page);
+void wait_on_folio_writeback(struct folio *folio);
+int wait_on_folio_writeback_killable(struct folio *folio);
void end_page_writeback(struct page *page);
void end_folio_writeback(struct folio *folio);
void wait_for_stable_page(struct page *page);
diff --git a/mm/folio-compat.c b/mm/folio-compat.c
index d1a1dfe52589..6aadecc39fba 100644
--- a/mm/folio-compat.c
+++ b/mm/folio-compat.c
@@ -23,3 +23,9 @@ void end_page_writeback(struct page *page)
return end_folio_writeback(page_folio(page));
}
EXPORT_SYMBOL(end_page_writeback);
+
+void wait_on_page_writeback(struct page *page)
+{
+ return wait_on_folio_writeback(page_folio(page));
+}
+EXPORT_SYMBOL_GPL(wait_on_page_writeback);
diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index 5e761fb62800..a08e77abcf12 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -2818,33 +2818,48 @@ int __test_set_page_writeback(struct page *page, bool keep_write)
}
EXPORT_SYMBOL(__test_set_page_writeback);

-/*
- * Wait for a page to complete writeback
+/**
+ * wait_on_folio_writeback - Wait for a folio to complete writeback.
+ * @folio: The folio to wait for.
+ *
+ * If the folio is currently being written back to storage, wait for the
+ * I/O to complete.
+ *
+ * Context: Sleeps; must be called in process context and with no spinlocks
+ * held.
*/
-void wait_on_page_writeback(struct page *page)
+void wait_on_folio_writeback(struct folio *folio)
{
- while (PageWriteback(page)) {
- trace_wait_on_page_writeback(page, page_mapping(page));
- wait_on_page_bit(page, PG_writeback);
+ while (FolioWriteback(folio)) {
+ trace_wait_on_page_writeback(&folio->page, folio_mapping(folio));
+ wait_on_page_bit(&folio->page, PG_writeback);
}
}
-EXPORT_SYMBOL_GPL(wait_on_page_writeback);
+EXPORT_SYMBOL_GPL(wait_on_folio_writeback);

-/*
- * Wait for a page to complete writeback. Returns -EINTR if we get a
+/**
+ * wait_on_folio_writeback_killable - Wait for a folio to complete writeback.
+ * @folio: The folio to wait for.
+ *
+ * If the folio is currently being written back to storage, wait for the
+ * I/O to complete or a fatal signal to arrive.
+ *
+ * Context: Sleeps; must be called in process context and with no spinlocks
+ * held.
+ * Return: 0 if the folio has completed writeback. -EINTR if we get a
* fatal signal while waiting.
*/
-int wait_on_page_writeback_killable(struct page *page)
+int wait_on_folio_writeback_killable(struct folio *folio)
{
- while (PageWriteback(page)) {
- trace_wait_on_page_writeback(page, page_mapping(page));
- if (wait_on_page_bit_killable(page, PG_writeback))
+ while (FolioWriteback(folio)) {
+ trace_wait_on_page_writeback(&folio->page, folio_mapping(folio));
+ if (wait_on_page_bit_killable(&folio->page, PG_writeback))
return -EINTR;
}

return 0;
}
-EXPORT_SYMBOL_GPL(wait_on_page_writeback_killable);
+EXPORT_SYMBOL_GPL(wait_on_folio_writeback_killable);

/**
* wait_for_stable_page() - wait for writeback to finish, if necessary.
--
2.30.2

2021-03-20 11:17:11

by Matthew Wilcox

[permalink] [raw]
Subject: [PATCH v5 25/27] mm/filemap: Convert wake_up_page_bit to wake_up_folio_bit

All callers have a folio, so use it directly.

Signed-off-by: Matthew Wilcox (Oracle) <[email protected]>
---
mm/filemap.c | 20 ++++++++++----------
1 file changed, 10 insertions(+), 10 deletions(-)

diff --git a/mm/filemap.c b/mm/filemap.c
index f8746c149562..f5bacbe702ff 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -1121,14 +1121,14 @@ static int wake_page_function(wait_queue_entry_t *wait, unsigned mode, int sync,
return (flags & WQ_FLAG_EXCLUSIVE) != 0;
}

-static void wake_up_page_bit(struct page *page, int bit_nr)
+static void wake_up_folio_bit(struct folio *folio, int bit_nr)
{
- wait_queue_head_t *q = page_waitqueue(page);
+ wait_queue_head_t *q = page_waitqueue(&folio->page);
struct wait_page_key key;
unsigned long flags;
wait_queue_entry_t bookmark;

- key.page = page;
+ key.page = &folio->page;
key.bit_nr = bit_nr;
key.page_match = 0;

@@ -1163,7 +1163,7 @@ static void wake_up_page_bit(struct page *page, int bit_nr)
* page waiters.
*/
if (!waitqueue_active(q) || !key.page_match) {
- ClearPageWaiters(page);
+ ClearFolioWaiters(folio);
/*
* It's possible to miss clearing Waiters here, when we woke
* our page waiters, but the hashed waitqueue has waiters for
@@ -1179,7 +1179,7 @@ static void wake_up_folio(struct folio *folio, int bit)
{
if (!FolioWaiters(folio))
return;
- wake_up_page_bit(&folio->page, bit);
+ wake_up_folio_bit(folio, bit);
}

/*
@@ -1444,7 +1444,7 @@ void unlock_folio(struct folio *folio)
BUILD_BUG_ON(PG_waiters != 7);
VM_BUG_ON_FOLIO(!FolioLocked(folio), folio);
if (clear_bit_unlock_is_negative_byte(PG_locked, folio_flags(folio, 0)))
- wake_up_page_bit(&folio->page, PG_locked);
+ wake_up_folio_bit(folio, PG_locked);
}
EXPORT_SYMBOL(unlock_folio);

@@ -1461,10 +1461,10 @@ EXPORT_SYMBOL(unlock_folio);
*/
void unlock_page_private_2(struct page *page)
{
- page = compound_head(page);
- VM_BUG_ON_PAGE(!PagePrivate2(page), page);
- clear_bit_unlock(PG_private_2, &page->flags);
- wake_up_page_bit(page, PG_private_2);
+ struct folio *folio = page_folio(page);
+ VM_BUG_ON_FOLIO(!FolioPrivate2(folio), folio);
+ clear_bit_unlock(PG_private_2, folio_flags(folio, 0));
+ wake_up_folio_bit(folio, PG_private_2);
}
EXPORT_SYMBOL(unlock_page_private_2);

--
2.30.2

2021-03-20 11:17:29

by Matthew Wilcox

[permalink] [raw]
Subject: [PATCH v5 24/27] mm/filemap: Convert wait_on_page_bit to wait_on_folio_bit

We must always wait on the folio, otherwise we won't be woken up.

This commit shrinks the kernel by 691 bytes, mostly due to moving
the page waitqueue lookup into wait_on_folio_bit_common().

Signed-off-by: Matthew Wilcox (Oracle) <[email protected]>
---
include/linux/netfs.h | 2 +-
include/linux/pagemap.h | 10 ++++----
mm/filemap.c | 56 ++++++++++++++++++-----------------------
mm/page-writeback.c | 4 +--
4 files changed, 33 insertions(+), 39 deletions(-)

diff --git a/include/linux/netfs.h b/include/linux/netfs.h
index 9d3fbed4e30a..f44142dca767 100644
--- a/include/linux/netfs.h
+++ b/include/linux/netfs.h
@@ -54,7 +54,7 @@ static inline void unlock_page_fscache(struct page *page)
static inline void wait_on_page_fscache(struct page *page)
{
if (PageFsCache(page))
- wait_on_page_bit(compound_head(page), PG_fscache);
+ wait_on_folio_bit(page_folio(page), PG_fscache);
}

enum netfs_read_source {
diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
index c92782b77d98..7ddaabbd1ddb 100644
--- a/include/linux/pagemap.h
+++ b/include/linux/pagemap.h
@@ -770,11 +770,11 @@ static inline int lock_page_or_retry(struct page *page, struct mm_struct *mm,
}

/*
- * This is exported only for wait_on_page_locked/wait_on_page_writeback, etc.,
+ * This is exported only for wait_on_folio_locked/wait_on_folio_writeback, etc.,
* and should not be used directly.
*/
-extern void wait_on_page_bit(struct page *page, int bit_nr);
-extern int wait_on_page_bit_killable(struct page *page, int bit_nr);
+extern void wait_on_folio_bit(struct folio *folio, int bit_nr);
+extern int wait_on_folio_bit_killable(struct folio *folio, int bit_nr);

/*
* Wait for a folio to be unlocked.
@@ -786,14 +786,14 @@ extern int wait_on_page_bit_killable(struct page *page, int bit_nr);
static inline void wait_on_folio_locked(struct folio *folio)
{
if (FolioLocked(folio))
- wait_on_page_bit(&folio->page, PG_locked);
+ wait_on_folio_bit(folio, PG_locked);
}

static inline int wait_on_folio_locked_killable(struct folio *folio)
{
if (!FolioLocked(folio))
return 0;
- return wait_on_page_bit_killable(&folio->page, PG_locked);
+ return wait_on_folio_bit_killable(folio, PG_locked);
}

static inline void wait_on_page_locked(struct page *page)
diff --git a/mm/filemap.c b/mm/filemap.c
index dc7deb8c36ee..f8746c149562 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -1102,7 +1102,7 @@ static int wake_page_function(wait_queue_entry_t *wait, unsigned mode, int sync,
*
* So update the flags atomically, and wake up the waiter
* afterwards to avoid any races. This store-release pairs
- * with the load-acquire in wait_on_page_bit_common().
+ * with the load-acquire in wait_on_folio_bit_common().
*/
smp_store_release(&wait->flags, flags | WQ_FLAG_WOKEN);
wake_up_state(wait->private, mode);
@@ -1183,7 +1183,7 @@ static void wake_up_folio(struct folio *folio, int bit)
}

/*
- * A choice of three behaviors for wait_on_page_bit_common():
+ * A choice of three behaviors for wait_on_folio_bit_common():
*/
enum behavior {
EXCLUSIVE, /* Hold ref to page and take the bit when woken, like
@@ -1217,9 +1217,10 @@ static inline bool trylock_page_bit_common(struct page *page, int bit_nr,
/* How many times do we accept lock stealing from under a waiter? */
int sysctl_page_lock_unfairness = 5;

-static inline int wait_on_page_bit_common(wait_queue_head_t *q,
- struct page *page, int bit_nr, int state, enum behavior behavior)
+static inline int wait_on_folio_bit_common(struct folio *folio, int bit_nr,
+ int state, enum behavior behavior)
{
+ wait_queue_head_t *q = page_waitqueue(&folio->page);
int unfairness = sysctl_page_lock_unfairness;
struct wait_page_queue wait_page;
wait_queue_entry_t *wait = &wait_page.wait;
@@ -1228,8 +1229,8 @@ static inline int wait_on_page_bit_common(wait_queue_head_t *q,
unsigned long pflags;

if (bit_nr == PG_locked &&
- !PageUptodate(page) && PageWorkingset(page)) {
- if (!PageSwapBacked(page)) {
+ !FolioUptodate(folio) && FolioWorkingset(folio)) {
+ if (!FolioSwapBacked(folio)) {
delayacct_thrashing_start();
delayacct = true;
}
@@ -1239,7 +1240,7 @@ static inline int wait_on_page_bit_common(wait_queue_head_t *q,

init_wait(wait);
wait->func = wake_page_function;
- wait_page.page = page;
+ wait_page.page = &folio->page;
wait_page.bit_nr = bit_nr;

repeat:
@@ -1254,7 +1255,7 @@ static inline int wait_on_page_bit_common(wait_queue_head_t *q,
* Do one last check whether we can get the
* page bit synchronously.
*
- * Do the SetPageWaiters() marking before that
+ * Do the SetFolioWaiters() marking before that
* to let any waker we _just_ missed know they
* need to wake us up (otherwise they'll never
* even go to the slow case that looks at the
@@ -1265,8 +1266,8 @@ static inline int wait_on_page_bit_common(wait_queue_head_t *q,
* lock to avoid races.
*/
spin_lock_irq(&q->lock);
- SetPageWaiters(page);
- if (!trylock_page_bit_common(page, bit_nr, wait))
+ SetFolioWaiters(folio);
+ if (!trylock_page_bit_common(&folio->page, bit_nr, wait))
__add_wait_queue_entry_tail(q, wait);
spin_unlock_irq(&q->lock);

@@ -1276,10 +1277,10 @@ static inline int wait_on_page_bit_common(wait_queue_head_t *q,
* see whether the page bit testing has already
* been done by the wake function.
*
- * We can drop our reference to the page.
+ * We can drop our reference to the folio.
*/
if (behavior == DROP)
- put_page(page);
+ put_folio(folio);

/*
* Note that until the "finish_wait()", or until
@@ -1316,7 +1317,7 @@ static inline int wait_on_page_bit_common(wait_queue_head_t *q,
*
* And if that fails, we'll have to retry this all.
*/
- if (unlikely(test_and_set_bit(bit_nr, &page->flags)))
+ if (unlikely(test_and_set_bit(bit_nr, folio_flags(folio, 0))))
goto repeat;

wait->flags |= WQ_FLAG_DONE;
@@ -1325,7 +1326,7 @@ static inline int wait_on_page_bit_common(wait_queue_head_t *q,

/*
* If a signal happened, this 'finish_wait()' may remove the last
- * waiter from the wait-queues, but the PageWaiters bit will remain
+ * waiter from the wait-queues, but the FolioWaiters bit will remain
* set. That's ok. The next wakeup will take care of it, and trying
* to do it here would be difficult and prone to races.
*/
@@ -1356,19 +1357,17 @@ static inline int wait_on_page_bit_common(wait_queue_head_t *q,
return wait->flags & WQ_FLAG_WOKEN ? 0 : -EINTR;
}

-void wait_on_page_bit(struct page *page, int bit_nr)
+void wait_on_folio_bit(struct folio *folio, int bit_nr)
{
- wait_queue_head_t *q = page_waitqueue(page);
- wait_on_page_bit_common(q, page, bit_nr, TASK_UNINTERRUPTIBLE, SHARED);
+ wait_on_folio_bit_common(folio, bit_nr, TASK_UNINTERRUPTIBLE, SHARED);
}
-EXPORT_SYMBOL(wait_on_page_bit);
+EXPORT_SYMBOL(wait_on_folio_bit);

-int wait_on_page_bit_killable(struct page *page, int bit_nr)
+int wait_on_folio_bit_killable(struct folio *folio, int bit_nr)
{
- wait_queue_head_t *q = page_waitqueue(page);
- return wait_on_page_bit_common(q, page, bit_nr, TASK_KILLABLE, SHARED);
+ return wait_on_folio_bit_common(folio, bit_nr, TASK_KILLABLE, SHARED);
}
-EXPORT_SYMBOL(wait_on_page_bit_killable);
+EXPORT_SYMBOL(wait_on_folio_bit_killable);

/**
* put_and_wait_on_page_locked - Drop a reference and wait for it to be unlocked
@@ -1385,11 +1384,8 @@ EXPORT_SYMBOL(wait_on_page_bit_killable);
*/
int put_and_wait_on_page_locked(struct page *page, int state)
{
- wait_queue_head_t *q;
-
- page = compound_head(page);
- q = page_waitqueue(page);
- return wait_on_page_bit_common(q, page, PG_locked, state, DROP);
+ return wait_on_folio_bit_common(page_folio(page), PG_locked, state,
+ DROP);
}

/**
@@ -1540,16 +1536,14 @@ EXPORT_SYMBOL_GPL(page_endio);
*/
void __lock_folio(struct folio *folio)
{
- wait_queue_head_t *q = page_waitqueue(&folio->page);
- wait_on_page_bit_common(q, &folio->page, PG_locked, TASK_UNINTERRUPTIBLE,
+ wait_on_folio_bit_common(folio, PG_locked, TASK_UNINTERRUPTIBLE,
EXCLUSIVE);
}
EXPORT_SYMBOL(__lock_folio);

int __lock_folio_killable(struct folio *folio)
{
- wait_queue_head_t *q = page_waitqueue(&folio->page);
- return wait_on_page_bit_common(q, &folio->page, PG_locked, TASK_KILLABLE,
+ return wait_on_folio_bit_common(folio, PG_locked, TASK_KILLABLE,
EXCLUSIVE);
}
EXPORT_SYMBOL_GPL(__lock_folio_killable);
diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index c222f88cf06b..b29737cd8049 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -2832,7 +2832,7 @@ void wait_on_folio_writeback(struct folio *folio)
{
while (FolioWriteback(folio)) {
trace_wait_on_page_writeback(&folio->page, folio_mapping(folio));
- wait_on_page_bit(&folio->page, PG_writeback);
+ wait_on_folio_bit(folio, PG_writeback);
}
}
EXPORT_SYMBOL_GPL(wait_on_folio_writeback);
@@ -2853,7 +2853,7 @@ int wait_on_folio_writeback_killable(struct folio *folio)
{
while (FolioWriteback(folio)) {
trace_wait_on_page_writeback(&folio->page, folio_mapping(folio));
- if (wait_on_page_bit_killable(&folio->page, PG_writeback))
+ if (wait_on_folio_bit_killable(folio, PG_writeback))
return -EINTR;
}

--
2.30.2

2021-03-20 11:18:17

by Matthew Wilcox

[permalink] [raw]
Subject: [PATCH v5 26/27] mm/filemap: Convert page wait queues to be folios

Reinforce that if we're waiting for a bit in a struct page, that's
actually in the head page by changing the type from page to folio.
Increases the size of cachefiles by two bytes, but the kernel core
is unchanged in size.

Signed-off-by: Matthew Wilcox (Oracle) <[email protected]>
---
fs/cachefiles/rdwr.c | 16 ++++++++--------
include/linux/pagemap.h | 8 ++++----
mm/filemap.c | 33 +++++++++++++++++----------------
3 files changed, 29 insertions(+), 28 deletions(-)

diff --git a/fs/cachefiles/rdwr.c b/fs/cachefiles/rdwr.c
index 8ffc40e84a59..ef50bd80ae74 100644
--- a/fs/cachefiles/rdwr.c
+++ b/fs/cachefiles/rdwr.c
@@ -25,20 +25,20 @@ static int cachefiles_read_waiter(wait_queue_entry_t *wait, unsigned mode,
struct cachefiles_object *object;
struct fscache_retrieval *op = monitor->op;
struct wait_page_key *key = _key;
- struct page *page = wait->private;
+ struct folio *folio = wait->private;

ASSERT(key);

_enter("{%lu},%u,%d,{%p,%u}",
monitor->netfs_page->index, mode, sync,
- key->page, key->bit_nr);
+ key->folio, key->bit_nr);

- if (key->page != page || key->bit_nr != PG_locked)
+ if (key->folio != folio || key->bit_nr != PG_locked)
return 0;

- _debug("--- monitor %p %lx ---", page, page->flags);
+ _debug("--- monitor %p %lx ---", folio, folio->page.flags);

- if (!PageUptodate(page) && !PageError(page)) {
+ if (!FolioUptodate(folio) && !FolioError(folio)) {
/* unlocked, not uptodate and not erronous? */
_debug("page probably truncated");
}
@@ -107,7 +107,7 @@ static int cachefiles_read_reissue(struct cachefiles_object *object,
put_page(backpage2);

INIT_LIST_HEAD(&monitor->op_link);
- add_page_wait_queue(backpage, &monitor->monitor);
+ add_folio_wait_queue(page_folio(backpage), &monitor->monitor);

if (trylock_page(backpage)) {
ret = -EIO;
@@ -294,7 +294,7 @@ static int cachefiles_read_backing_file_one(struct cachefiles_object *object,
get_page(backpage);
monitor->back_page = backpage;
monitor->monitor.private = backpage;
- add_page_wait_queue(backpage, &monitor->monitor);
+ add_folio_wait_queue(page_folio(backpage), &monitor->monitor);
monitor = NULL;

/* but the page may have been read before the monitor was installed, so
@@ -548,7 +548,7 @@ static int cachefiles_read_backing_file(struct cachefiles_object *object,
get_page(backpage);
monitor->back_page = backpage;
monitor->monitor.private = backpage;
- add_page_wait_queue(backpage, &monitor->monitor);
+ add_folio_wait_queue(page_folio(backpage), &monitor->monitor);
monitor = NULL;

/* but the page may have been read before the monitor was
diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
index 7ddaabbd1ddb..78d865c2f2da 100644
--- a/include/linux/pagemap.h
+++ b/include/linux/pagemap.h
@@ -669,13 +669,13 @@ static inline pgoff_t linear_page_index(struct vm_area_struct *vma,
}

struct wait_page_key {
- struct page *page;
+ struct folio *folio;
int bit_nr;
int page_match;
};

struct wait_page_queue {
- struct page *page;
+ struct folio *folio;
int bit_nr;
wait_queue_entry_t wait;
};
@@ -683,7 +683,7 @@ struct wait_page_queue {
static inline bool wake_page_match(struct wait_page_queue *wait_page,
struct wait_page_key *key)
{
- if (wait_page->page != key->page)
+ if (wait_page->folio != key->folio)
return false;
key->page_match = 1;

@@ -820,7 +820,7 @@ void page_endio(struct page *page, bool is_write, int err);
/*
* Add an arbitrary waiter to a page's wait queue
*/
-extern void add_page_wait_queue(struct page *page, wait_queue_entry_t *waiter);
+void add_folio_wait_queue(struct folio *folio, wait_queue_entry_t *waiter);

/*
* Fault everything in given userspace address range in.
diff --git a/mm/filemap.c b/mm/filemap.c
index f5bacbe702ff..d9238d921009 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -1019,11 +1019,11 @@ EXPORT_SYMBOL(__page_cache_alloc);
*/
#define PAGE_WAIT_TABLE_BITS 8
#define PAGE_WAIT_TABLE_SIZE (1 << PAGE_WAIT_TABLE_BITS)
-static wait_queue_head_t page_wait_table[PAGE_WAIT_TABLE_SIZE] __cacheline_aligned;
+static wait_queue_head_t folio_wait_table[PAGE_WAIT_TABLE_SIZE] __cacheline_aligned;

-static wait_queue_head_t *page_waitqueue(struct page *page)
+static wait_queue_head_t *folio_waitqueue(struct folio *folio)
{
- return &page_wait_table[hash_ptr(page, PAGE_WAIT_TABLE_BITS)];
+ return &folio_wait_table[hash_ptr(folio, PAGE_WAIT_TABLE_BITS)];
}

void __init pagecache_init(void)
@@ -1031,7 +1031,7 @@ void __init pagecache_init(void)
int i;

for (i = 0; i < PAGE_WAIT_TABLE_SIZE; i++)
- init_waitqueue_head(&page_wait_table[i]);
+ init_waitqueue_head(&folio_wait_table[i]);

page_writeback_init();
}
@@ -1086,10 +1086,11 @@ static int wake_page_function(wait_queue_entry_t *wait, unsigned mode, int sync,
*/
flags = wait->flags;
if (flags & WQ_FLAG_EXCLUSIVE) {
- if (test_bit(key->bit_nr, &key->page->flags))
+ if (test_bit(key->bit_nr, &key->folio->page.flags))
return -1;
if (flags & WQ_FLAG_CUSTOM) {
- if (test_and_set_bit(key->bit_nr, &key->page->flags))
+ if (test_and_set_bit(key->bit_nr,
+ &key->folio->page.flags))
return -1;
flags |= WQ_FLAG_DONE;
}
@@ -1123,12 +1124,12 @@ static int wake_page_function(wait_queue_entry_t *wait, unsigned mode, int sync,

static void wake_up_folio_bit(struct folio *folio, int bit_nr)
{
- wait_queue_head_t *q = page_waitqueue(&folio->page);
+ wait_queue_head_t *q = folio_waitqueue(folio);
struct wait_page_key key;
unsigned long flags;
wait_queue_entry_t bookmark;

- key.page = &folio->page;
+ key.folio = folio;
key.bit_nr = bit_nr;
key.page_match = 0;

@@ -1220,7 +1221,7 @@ int sysctl_page_lock_unfairness = 5;
static inline int wait_on_folio_bit_common(struct folio *folio, int bit_nr,
int state, enum behavior behavior)
{
- wait_queue_head_t *q = page_waitqueue(&folio->page);
+ wait_queue_head_t *q = folio_waitqueue(folio);
int unfairness = sysctl_page_lock_unfairness;
struct wait_page_queue wait_page;
wait_queue_entry_t *wait = &wait_page.wait;
@@ -1240,7 +1241,7 @@ static inline int wait_on_folio_bit_common(struct folio *folio, int bit_nr,

init_wait(wait);
wait->func = wake_page_function;
- wait_page.page = &folio->page;
+ wait_page.folio = folio;
wait_page.bit_nr = bit_nr;

repeat:
@@ -1395,17 +1396,17 @@ int put_and_wait_on_page_locked(struct page *page, int state)
*
* Add an arbitrary @waiter to the wait queue for the nominated @page.
*/
-void add_page_wait_queue(struct page *page, wait_queue_entry_t *waiter)
+void add_folio_wait_queue(struct folio *folio, wait_queue_entry_t *waiter)
{
- wait_queue_head_t *q = page_waitqueue(page);
+ wait_queue_head_t *q = folio_waitqueue(folio);
unsigned long flags;

spin_lock_irqsave(&q->lock, flags);
__add_wait_queue_entry_tail(q, waiter);
- SetPageWaiters(page);
+ SetFolioWaiters(folio);
spin_unlock_irqrestore(&q->lock, flags);
}
-EXPORT_SYMBOL_GPL(add_page_wait_queue);
+EXPORT_SYMBOL_GPL(add_folio_wait_queue);

#ifndef clear_bit_unlock_is_negative_byte

@@ -1550,10 +1551,10 @@ EXPORT_SYMBOL_GPL(__lock_folio_killable);

static int __lock_folio_async(struct folio *folio, struct wait_page_queue *wait)
{
- struct wait_queue_head *q = page_waitqueue(&folio->page);
+ struct wait_queue_head *q = folio_waitqueue(folio);
int ret = 0;

- wait->page = &folio->page;
+ wait->folio = folio;
wait->bit_nr = PG_locked;

spin_lock_irq(&q->lock);
--
2.30.2

2021-03-20 11:20:20

by kernel test robot

[permalink] [raw]
Subject: Re: [PATCH v5 26/27] mm/filemap: Convert page wait queues to be folios

Hi "Matthew,

Thank you for the patch! Perhaps something to improve:

[auto build test WARNING on next-20210319]
[cannot apply to linux/master linus/master hnaz-linux-mm/master v5.12-rc3 v5.12-rc2 v5.12-rc1 v5.12-rc3]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch]

url: https://github.com/0day-ci/linux/commits/Matthew-Wilcox-Oracle/Memory-Folios/20210320-134732
base: f00397ee41c79b6155b9b44abd0055b2c0621349
config: nds32-defconfig (attached as .config)
compiler: nds32le-linux-gcc (GCC) 9.3.0
reproduce (this is a W=1 build):
wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross
chmod +x ~/bin/make.cross
# https://github.com/0day-ci/linux/commit/93822cea6776a7c6c5b1341ed1c3fdbd1e5eeaab
git remote add linux-review https://github.com/0day-ci/linux
git fetch --no-tags linux-review Matthew-Wilcox-Oracle/Memory-Folios/20210320-134732
git checkout 93822cea6776a7c6c5b1341ed1c3fdbd1e5eeaab
# save the attached .config to linux build tree
COMPILER_INSTALL_PATH=$HOME/0day COMPILER=gcc-9.3.0 make.cross ARCH=nds32

If you fix the issue, kindly add following tag as appropriate
Reported-by: kernel test robot <[email protected]>

All warnings (new ones prefixed by >>):

mm/filemap.c:1400: warning: Function parameter or member 'folio' not described in 'add_folio_wait_queue'
>> mm/filemap.c:1400: warning: expecting prototype for add_page_wait_queue(). Prototype was for add_folio_wait_queue() instead


vim +1400 mm/filemap.c

9a1ea439b16b92 Hugh Dickins 2018-12-28 1391
385e1ca5f21c46 David Howells 2009-04-03 1392 /**
385e1ca5f21c46 David Howells 2009-04-03 1393 * add_page_wait_queue - Add an arbitrary waiter to a page's wait queue
697f619fc87aa9 Randy Dunlap 2009-04-13 1394 * @page: Page defining the wait queue of interest
697f619fc87aa9 Randy Dunlap 2009-04-13 1395 * @waiter: Waiter to add to the queue
385e1ca5f21c46 David Howells 2009-04-03 1396 *
385e1ca5f21c46 David Howells 2009-04-03 1397 * Add an arbitrary @waiter to the wait queue for the nominated @page.
385e1ca5f21c46 David Howells 2009-04-03 1398 */
93822cea6776a7 Matthew Wilcox (Oracle 2021-03-20 1399) void add_folio_wait_queue(struct folio *folio, wait_queue_entry_t *waiter)
385e1ca5f21c46 David Howells 2009-04-03 @1400 {
93822cea6776a7 Matthew Wilcox (Oracle 2021-03-20 1401) wait_queue_head_t *q = folio_waitqueue(folio);
385e1ca5f21c46 David Howells 2009-04-03 1402 unsigned long flags;
385e1ca5f21c46 David Howells 2009-04-03 1403
385e1ca5f21c46 David Howells 2009-04-03 1404 spin_lock_irqsave(&q->lock, flags);
9c3a815f471a84 Linus Torvalds 2017-08-28 1405 __add_wait_queue_entry_tail(q, waiter);
93822cea6776a7 Matthew Wilcox (Oracle 2021-03-20 1406) SetFolioWaiters(folio);
385e1ca5f21c46 David Howells 2009-04-03 1407 spin_unlock_irqrestore(&q->lock, flags);
385e1ca5f21c46 David Howells 2009-04-03 1408 }
93822cea6776a7 Matthew Wilcox (Oracle 2021-03-20 1409) EXPORT_SYMBOL_GPL(add_folio_wait_queue);
385e1ca5f21c46 David Howells 2009-04-03 1410

---
0-DAY CI Kernel Test Service, Intel Corporation
https://lists.01.org/hyperkitty/list/[email protected]


Attachments:
(No filename) (3.49 kB)
.config.gz (10.64 kB)
Download all attachments

2021-03-20 12:14:12

by Matthew Wilcox

[permalink] [raw]
Subject: [PATCH v5 04/27] mm: Introduce struct folio

A struct folio is a new abstraction for a head-or-single page. A function
which takes a struct folio argument declares that it will operate on the
entire (possibly compound) page, not just PAGE_SIZE bytes. In return,
the caller guarantees that the pointer it is passing does not point to
a tail page.

Signed-off-by: Matthew Wilcox (Oracle) <[email protected]>
---
include/linux/mm.h | 78 ++++++++++++++++++++++++++++++++++++++++
include/linux/mm_types.h | 36 +++++++++++++++++++
2 files changed, 114 insertions(+)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index cb1e191da319..9b7e3fa12fd3 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -934,6 +934,20 @@ static inline unsigned int compound_order(struct page *page)
return page[1].compound_order;
}

+/**
+ * folio_order - The allocation order of a folio.
+ * @folio: The folio.
+ *
+ * A folio is composed of 2^order pages. See get_order() for the definition
+ * of order.
+ *
+ * Return: The order of the folio.
+ */
+static inline unsigned int folio_order(struct folio *folio)
+{
+ return compound_order(&folio->page);
+}
+
static inline bool hpage_pincount_available(struct page *page)
{
/*
@@ -1579,6 +1593,69 @@ static inline void set_page_links(struct page *page, enum zone_type zone,
#endif
}

+/**
+ * folio_nr_pages - The number of pages in the folio.
+ * @folio: The folio.
+ *
+ * Return: A number which is a power of two.
+ */
+static inline unsigned long folio_nr_pages(struct folio *folio)
+{
+ return compound_nr(&folio->page);
+}
+
+/**
+ * folio_next - Move to the next physical folio.
+ * @folio: The folio we're currently operating on.
+ *
+ * If you have physically contiguous memory which may span more than
+ * one folio (eg a &struct bio_vec), use this function to move from one
+ * folio to the next. Do not use it if the memory is only virtually
+ * contiguous as the folios are almost certainly not adjacent to each
+ * other. This is the folio equivalent to writing ``page++``.
+ *
+ * Context: We assume that the folios are refcounted and/or locked at a
+ * higher level and do not adjust the reference counts.
+ * Return: The next struct folio.
+ */
+static inline struct folio *folio_next(struct folio *folio)
+{
+#if defined(CONFIG_SPARSEMEM) && !defined(CONFIG_SPARSEMEM_VMEMMAP)
+ return (struct folio *)nth_page(&folio->page, folio_nr_pages(folio));
+#else
+ return folio + folio_nr_pages(folio);
+#endif
+}
+
+/**
+ * folio_shift - The number of bits covered by this folio.
+ * @folio: The folio.
+ *
+ * A folio contains a number of bytes which is a power-of-two in size.
+ * This function tells you which power-of-two the folio is.
+ *
+ * Context: The caller should have a reference on the folio to prevent
+ * it from being split. It is not necessary for the folio to be locked.
+ * Return: The base-2 logarithm of the size of this folio.
+ */
+static inline unsigned int folio_shift(struct folio *folio)
+{
+ return PAGE_SHIFT + folio_order(folio);
+}
+
+/**
+ * folio_size - The number of bytes in a folio.
+ * @folio: The folio.
+ *
+ * Context: The caller should have a reference on the folio to prevent
+ * it from being split. It is not necessary for the folio to be locked.
+ * Return: The number of bytes in this folio.
+ */
+static inline size_t folio_size(struct folio *folio)
+{
+ return PAGE_SIZE << folio_order(folio);
+}
+
/*
* Some inline functions in vmstat.h depend on page_zone()
*/
@@ -1683,6 +1760,7 @@ extern void pagefault_out_of_memory(void);

#define offset_in_page(p) ((unsigned long)(p) & ~PAGE_MASK)
#define offset_in_thp(page, p) ((unsigned long)(p) & (thp_size(page) - 1))
+#define offset_in_folio(folio, p) ((unsigned long)(p) & (folio_size(folio) - 1))

/*
* Flags passed to show_mem() and show_free_areas() to suppress output in
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 6613b26a8894..4fc0b230d3ea 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -224,6 +224,42 @@ struct page {
#endif
} _struct_page_alignment;

+/**
+ * struct folio - Represents a contiguous set of bytes.
+ * @page: Either a base (order-0) page or the head page of a compound page.
+ *
+ * A folio is a physically, virtually and logically contiguous set
+ * of bytes. It is a power-of-two in size, and it is aligned to that
+ * same power-of-two. If it is found in the page cache, it is at a file
+ * offset which is a multiple of that power-of-two. It is at least as
+ * large as PAGE_SIZE.
+ */
+struct folio {
+ struct page page;
+};
+
+/**
+ * page_folio - Converts from page to folio.
+ * @page: The page.
+ *
+ * Every page is part of a folio. This function cannot be called on a
+ * NULL pointer.
+ *
+ * Context: No reference, nor lock is required on @page. If the caller
+ * does not hold a reference, this call may race with a folio split, so
+ * it should re-check the folio still contains this page after gaining
+ * a reference on the folio.
+ * Return: The folio which contains this page.
+ */
+static inline struct folio *page_folio(struct page *page)
+{
+ unsigned long head = READ_ONCE(page->compound_head);
+
+ if (unlikely(head & 1))
+ return (struct folio *)(head - 1);
+ return (struct folio *)page;
+}
+
static inline atomic_t *compound_mapcount_ptr(struct page *page)
{
return &page[1].compound_mapcount;
--
2.30.2

2021-03-20 12:21:11

by Matthew Wilcox

[permalink] [raw]
Subject: [PATCH v5 14/27] mm/memcg: Add folio wrappers for various functions

Add new wrapper functions folio_memcg(), lock_folio_memcg(),
unlock_folio_memcg() and mem_cgroup_folio_lruvec().

Signed-off-by: Matthew Wilcox (Oracle) <[email protected]>
---
include/linux/memcontrol.h | 21 +++++++++++++++++++++
1 file changed, 21 insertions(+)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 4064c9dda534..493136f495b6 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -397,6 +397,11 @@ static inline struct mem_cgroup *page_memcg(struct page *page)
return (struct mem_cgroup *)(memcg_data & ~MEMCG_DATA_FLAGS_MASK);
}

+static inline struct mem_cgroup *folio_memcg(struct folio *folio)
+{
+ return page_memcg(&folio->page);
+}
+
/*
* page_memcg_rcu - locklessly get the memory cgroup associated with a page
* @page: a pointer to the page struct
@@ -1400,6 +1405,22 @@ unsigned long mem_cgroup_soft_limit_reclaim(pg_data_t *pgdat, int order,
}
#endif /* CONFIG_MEMCG */

+static inline void lock_folio_memcg(struct folio *folio)
+{
+ lock_page_memcg(&folio->page);
+}
+
+static inline void unlock_folio_memcg(struct folio *folio)
+{
+ unlock_page_memcg(&folio->page);
+}
+
+static inline struct lruvec *mem_cgroup_folio_lruvec(struct folio *folio,
+ struct pglist_data *pgdat)
+{
+ return mem_cgroup_page_lruvec(&folio->page, pgdat);
+}
+
static inline void __inc_lruvec_kmem_state(void *p, enum node_stat_item idx)
{
__mod_lruvec_kmem_state(p, idx, 1);
--
2.30.2

2021-03-21 07:26:21

by kernel test robot

[permalink] [raw]
Subject: Re: [PATCH v5 24/27] mm/filemap: Convert wait_on_page_bit to wait_on_folio_bit

Hi "Matthew,

Thank you for the patch! Yet something to improve:

[auto build test ERROR on next-20210319]
[cannot apply to linux/master linus/master hnaz-linux-mm/master v5.12-rc3 v5.12-rc2 v5.12-rc1 v5.12-rc3]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch]

url: https://github.com/0day-ci/linux/commits/Matthew-Wilcox-Oracle/Memory-Folios/20210320-134732
base: f00397ee41c79b6155b9b44abd0055b2c0621349
config: x86_64-allyesconfig (attached as .config)
compiler: gcc-9 (Debian 9.3.0-22) 9.3.0
reproduce (this is a W=1 build):
# https://github.com/0day-ci/linux/commit/39199d654ac6a6bbaba1620337574ec74adee8fe
git remote add linux-review https://github.com/0day-ci/linux
git fetch --no-tags linux-review Matthew-Wilcox-Oracle/Memory-Folios/20210320-134732
git checkout 39199d654ac6a6bbaba1620337574ec74adee8fe
# save the attached .config to linux build tree
make W=1 ARCH=x86_64

If you fix the issue, kindly add following tag as appropriate
Reported-by: kernel test robot <[email protected]>

All errors (new ones prefixed by >>):

fs/afs/write.c: In function 'afs_page_mkwrite':
>> fs/afs/write.c:849:6: error: implicit declaration of function 'wait_on_page_bit_killable'; did you mean 'wait_on_folio_bit_killable'? [-Werror=implicit-function-declaration]
849 | wait_on_page_bit_killable(page, PG_fscache) < 0)
| ^~~~~~~~~~~~~~~~~~~~~~~~~
| wait_on_folio_bit_killable
cc1: some warnings being treated as errors


vim +849 fs/afs/write.c

9b3f26c9110dce David Howells 2009-04-03 827
9b3f26c9110dce David Howells 2009-04-03 828 /*
9b3f26c9110dce David Howells 2009-04-03 829 * notification that a previously read-only page is about to become writable
9b3f26c9110dce David Howells 2009-04-03 830 * - if it returns an error, the caller will deliver a bus error signal
9b3f26c9110dce David Howells 2009-04-03 831 */
0722f186205976 Souptick Joarder 2018-08-23 832 vm_fault_t afs_page_mkwrite(struct vm_fault *vmf)
9b3f26c9110dce David Howells 2009-04-03 833 {
721597fd1aa668 David Howells 2020-10-20 834 struct page *page = thp_head(vmf->page);
1cf7a1518aefa6 David Howells 2017-11-02 835 struct file *file = vmf->vma->vm_file;
1cf7a1518aefa6 David Howells 2017-11-02 836 struct inode *inode = file_inode(file);
1cf7a1518aefa6 David Howells 2017-11-02 837 struct afs_vnode *vnode = AFS_FS_I(inode);
1cf7a1518aefa6 David Howells 2017-11-02 838 unsigned long priv;
9b3f26c9110dce David Howells 2009-04-03 839
721597fd1aa668 David Howells 2020-10-20 840 _enter("{{%llx:%llu}},{%lx}", vnode->fid.vid, vnode->fid.vnode, page->index);
9b3f26c9110dce David Howells 2009-04-03 841
1cf7a1518aefa6 David Howells 2017-11-02 842 sb_start_pagefault(inode->i_sb);
9b3f26c9110dce David Howells 2009-04-03 843
1cf7a1518aefa6 David Howells 2017-11-02 844 /* Wait for the page to be written to the cache before we allow it to
1cf7a1518aefa6 David Howells 2017-11-02 845 * be modified. We then assume the entire page will need writing back.
1cf7a1518aefa6 David Howells 2017-11-02 846 */
77837f50249aa4 David Howells 2020-02-06 847 #ifdef CONFIG_AFS_FSCACHE
721597fd1aa668 David Howells 2020-10-20 848 if (PageFsCache(page) &&
721597fd1aa668 David Howells 2020-10-20 @849 wait_on_page_bit_killable(page, PG_fscache) < 0)
77837f50249aa4 David Howells 2020-02-06 850 return VM_FAULT_RETRY;
77837f50249aa4 David Howells 2020-02-06 851 #endif
9b3f26c9110dce David Howells 2009-04-03 852
5dc1af598f0274 Matthew Wilcox (Oracle 2021-03-20 853) if (wait_on_folio_writeback_killable(page_folio(page)))
1cf7a1518aefa6 David Howells 2017-11-02 854 return VM_FAULT_RETRY;
1cf7a1518aefa6 David Howells 2017-11-02 855
721597fd1aa668 David Howells 2020-10-20 856 if (lock_page_killable(page) < 0)
1cf7a1518aefa6 David Howells 2017-11-02 857 return VM_FAULT_RETRY;
1cf7a1518aefa6 David Howells 2017-11-02 858
1cf7a1518aefa6 David Howells 2017-11-02 859 /* We mustn't change page->private until writeback is complete as that
1cf7a1518aefa6 David Howells 2017-11-02 860 * details the portion of the page we need to write back and we might
1cf7a1518aefa6 David Howells 2017-11-02 861 * need to redirty the page if there's a problem.
1cf7a1518aefa6 David Howells 2017-11-02 862 */
721597fd1aa668 David Howells 2020-10-20 863 wait_on_page_writeback(page);
1cf7a1518aefa6 David Howells 2017-11-02 864
721597fd1aa668 David Howells 2020-10-20 865 priv = afs_page_dirty(page, 0, thp_size(page));
f86726a69dec5d David Howells 2020-10-22 866 priv = afs_page_dirty_mmapped(priv);
721597fd1aa668 David Howells 2020-10-20 867 if (PagePrivate(page)) {
721597fd1aa668 David Howells 2020-10-20 868 set_page_private(page, priv);
721597fd1aa668 David Howells 2020-10-20 869 trace_afs_page_dirty(vnode, tracepoint_string("mkwrite+"), page);
721597fd1aa668 David Howells 2020-10-20 870 } else {
721597fd1aa668 David Howells 2020-10-20 871 attach_page_private(page, (void *)priv);
721597fd1aa668 David Howells 2020-10-20 872 trace_afs_page_dirty(vnode, tracepoint_string("mkwrite"), page);
721597fd1aa668 David Howells 2020-10-20 873 }
bb413489288e4e David Howells 2020-06-12 874 file_update_time(file);
1cf7a1518aefa6 David Howells 2017-11-02 875
1cf7a1518aefa6 David Howells 2017-11-02 876 sb_end_pagefault(inode->i_sb);
1cf7a1518aefa6 David Howells 2017-11-02 877 return VM_FAULT_LOCKED;
9b3f26c9110dce David Howells 2009-04-03 878 }
4343d00872e1de David Howells 2017-11-02 879

---
0-DAY CI Kernel Test Service, Intel Corporation
https://lists.01.org/hyperkitty/list/[email protected]


Attachments:
(No filename) (6.30 kB)
.config.gz (64.20 kB)
Download all attachments

2021-03-22 03:30:05

by Matthew Wilcox

[permalink] [raw]
Subject: Re: [PATCH v5 00/27] Memory Folios

On Sat, Mar 20, 2021 at 05:40:37AM +0000, Matthew Wilcox (Oracle) wrote:
> Current tree at:
> https://git.infradead.org/users/willy/pagecache.git/shortlog/refs/heads/folio
>
> (contains another ~100 patches on top of this batch, not all of which are
> in good shape for submission)

I've fixed the two buildbot bugs. I also resplit the docs work, and
did a bunch of other things to the patches that I haven't posted yet.

I'll send the first three patches as a separate series tomorrow,
and then the next four as their own series, then I'll repost the
rest (up to and including "Convert page wait queues to be folios")
later in the week.

2021-03-22 08:09:27

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [PATCH v5 01/27] fs/cachefiles: Remove wait_bit_key layout dependency

On Sat, Mar 20, 2021 at 05:40:38AM +0000, Matthew Wilcox (Oracle) wrote:
> Cachefiles was relying on wait_page_key and wait_bit_key being the
> same layout, which is fragile. Now that wait_page_key is exposed in
> the pagemap.h header, we can remove that fragility
>
> Signed-off-by: Matthew Wilcox (Oracle) <[email protected]>

Looks good,

Reviewed-by: Christoph Hellwig <[email protected]>

2021-03-22 08:10:11

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [PATCH v5 02/27] mm/writeback: Add wait_on_page_writeback_killable

On Sat, Mar 20, 2021 at 05:40:39AM +0000, Matthew Wilcox (Oracle) wrote:
> This is the killable version of wait_on_page_writeback.
>
> Signed-off-by: Matthew Wilcox (Oracle) <[email protected]>

Looks good,

Reviewed-by: Christoph Hellwig <[email protected]>

2021-03-22 08:14:21

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [PATCH v5 03/27] afs: Use wait_on_page_writeback_killable

On Sat, Mar 20, 2021 at 05:40:40AM +0000, Matthew Wilcox (Oracle) wrote:
> Open-coding this function meant it missed out on the recent bugfix
> for waiters being woken by a delayed wake event from a previous
> instantiation of the page.
>
> Signed-off-by: Matthew Wilcox (Oracle) <[email protected]>

Looks good,

Reviewed-by: Christoph Hellwig <[email protected]>

2021-03-22 09:28:02

by David Howells

[permalink] [raw]
Subject: Re: [PATCH v5 02/27] mm/writeback: Add wait_on_page_writeback_killable

Matthew Wilcox (Oracle) <[email protected]> wrote:

> This is the killable version of wait_on_page_writeback.
>
> Signed-off-by: Matthew Wilcox (Oracle) <[email protected]>

Acked-and-tested-by: David Howells <[email protected]>

2021-03-22 09:29:27

by David Howells

[permalink] [raw]
Subject: Re: [PATCH v5 01/27] fs/cachefiles: Remove wait_bit_key layout dependency

Matthew Wilcox (Oracle) <[email protected]> wrote:

> Cachefiles was relying on wait_page_key and wait_bit_key being the
> same layout, which is fragile. Now that wait_page_key is exposed in
> the pagemap.h header, we can remove that fragility
>
> Signed-off-by: Matthew Wilcox (Oracle) <[email protected]>

Acked-and-tested-by: David Howells <[email protected]>

I wonder if this could be pushed directly to Linus now since we're relying on
two different structs being compatible.

2021-03-22 09:31:03

by David Howells

[permalink] [raw]
Subject: Re: [PATCH v5 03/27] afs: Use wait_on_page_writeback_killable

Matthew Wilcox (Oracle) <[email protected]> wrote:

> Open-coding this function meant it missed out on the recent bugfix
> for waiters being woken by a delayed wake event from a previous
> instantiation of the page.
>
> Signed-off-by: Matthew Wilcox (Oracle) <[email protected]>

Acked-and-tested-by: David Howells <[email protected]>

Should this be pushed upstream now as well if it's missing out on a bugfix?

2021-03-22 18:01:10

by Johannes Weiner

[permalink] [raw]
Subject: Re: [PATCH v5 00/27] Memory Folios

On Sat, Mar 20, 2021 at 05:40:37AM +0000, Matthew Wilcox (Oracle) wrote:
> Managing memory in 4KiB pages is a serious overhead. Many benchmarks
> exist which show the benefits of a larger "page size". As an example,
> an earlier iteration of this idea which used compound pages got a 7%
> performance boost when compiling the kernel using kernbench without any
> particular tuning.
>
> Using compound pages or THPs exposes a serious weakness in our type
> system. Functions are often unprepared for compound pages to be passed
> to them, and may only act on PAGE_SIZE chunks. Even functions which are
> aware of compound pages may expect a head page, and do the wrong thing
> if passed a tail page.
>
> There have been efforts to label function parameters as 'head' instead
> of 'page' to indicate that the function expects a head page, but this
> leaves us with runtime assertions instead of using the compiler to prove
> that nobody has mistakenly passed a tail page. Calling a struct page
> 'head' is also inaccurate as they will work perfectly well on base pages.
> The term 'nottail' has not proven popular.
>
> We also waste a lot of instructions ensuring that we're not looking at
> a tail page. Almost every call to PageFoo() contains one or more hidden
> calls to compound_head(). This also happens for get_page(), put_page()
> and many more functions. There does not appear to be a way to tell gcc
> that it can cache the result of compound_head(), nor is there a way to
> tell it that compound_head() is idempotent.
>
> This series introduces the 'struct folio' as a replacement for
> head-or-base pages. This initial set reduces the kernel size by
> approximately 6kB, although its real purpose is adding infrastructure
> to enable further use of the folio.
>
> The intent is to convert all filesystems and some device drivers to work
> in terms of folios. This series contains a lot of explicit conversions,
> but it's important to realise it's removing a lot of implicit conversions
> in some relatively hot paths. There will be very few conversions from
> folios when this work is completed; filesystems, the page cache, the
> LRU and so on will generally only deal with folios.

If that is the case, shouldn't there in the long term only be very
few, easy to review instances of things like compound_head(),
PAGE_SIZE etc. deep in the heart of MM? And everybody else should 1)
never see tail pages and 2) never assume a compile-time page size?

What are the higher-level places that in the long-term should be
dealing with tail pages at all? Are there legit ones besides the page
allocator, THP splitting internals & pte-mapped compound pages?

I do agree that the current confusion around which layer sees which
types of pages is a problem. But I also think a lot of it is the
result of us being in a transitional period where we've added THP in
more places but not all code and data structures are or were fully
native yet, and so we had things leak out or into where maybe they
shouldn't be to make things work in the short term.

But this part is already getting better, and has gotten better, with
the page cache (largely?) going native for example.

Some compound_head() that are currently in the codebase are already
unnecessary. Like the one in activate_page().

And looking at grep, I wouldn't be surprised if only the page table
walkers need the page_compound() that mark_page_accessed() does. We
would be better off if they did the translation once and explicitly in
the outer scope, where it's clear they're dealing with a pte-mapped
compound page, instead of having a series of rather low level helpers
(page flags testing, refcount operations, LRU operations, stat
accounting) all trying to be clever but really just obscuring things
and imposing unnecessary costs on the vast majority of cases.

So I fully agree with the motivation behind this patch. But I do
wonder why it's special-casing the commmon case instead of the rare
case. It comes at a huge cost. Short term, the churn of replacing
'page' with 'folio' in pretty much all instances is enormous.

And longer term, I'm not convinced folio is the abstraction we want
throughout the kernel. If nobody should be dealing with tail pages in
the first place, why are we making everybody think in 'folios'? Why
does a filesystem care that huge pages are composed of multiple base
pages internally? This feels like an implementation detail leaking out
of the MM code. The vast majority of places should be thinking 'page'
with a size of 'page_size()'. Including most parts of the MM itself.

The compile-time check is nice, but I'm not sure it would be that much
more effective at catching things than a few centrally placed warns
inside PageFoo(), get_page() etc. and other things that should not
encounter tail pages in the first place (with __helpers for the few
instances that do). And given the invasiveness of this change, they
ought to be very drastically better at it, and obviously so, IMO.

> Documentation/core-api/mm-api.rst | 7 +
> fs/afs/write.c | 3 +-
> fs/cachefiles/rdwr.c | 19 ++-
> fs/io_uring.c | 2 +-
> include/linux/memcontrol.h | 21 +++
> include/linux/mm.h | 156 +++++++++++++++----
> include/linux/mm_types.h | 52 +++++++
> include/linux/mmdebug.h | 20 +++
> include/linux/netfs.h | 2 +-
> include/linux/page-flags.h | 120 +++++++++++---
> include/linux/pagemap.h | 249 ++++++++++++++++++++++--------
> include/linux/swap.h | 6 +
> include/linux/vmstat.h | 107 +++++++++++++
> mm/Makefile | 2 +-
> mm/filemap.c | 237 ++++++++++++++--------------
> mm/folio-compat.c | 37 +++++
> mm/memory.c | 8 +-
> mm/page-writeback.c | 62 ++++++--
> mm/swapfile.c | 8 +-
> mm/util.c | 30 ++--
> 20 files changed, 857 insertions(+), 291 deletions(-)
> create mode 100644 mm/folio-compat.c

2021-03-22 18:52:22

by Matthew Wilcox

[permalink] [raw]
Subject: Re: [PATCH v5 00/27] Memory Folios

On Mon, Mar 22, 2021 at 01:59:24PM -0400, Johannes Weiner wrote:
> On Sat, Mar 20, 2021 at 05:40:37AM +0000, Matthew Wilcox (Oracle) wrote:
> > This series introduces the 'struct folio' as a replacement for
> > head-or-base pages. This initial set reduces the kernel size by
> > approximately 6kB, although its real purpose is adding infrastructure
> > to enable further use of the folio.
> >
> > The intent is to convert all filesystems and some device drivers to work
> > in terms of folios. This series contains a lot of explicit conversions,
> > but it's important to realise it's removing a lot of implicit conversions
> > in some relatively hot paths. There will be very few conversions from
> > folios when this work is completed; filesystems, the page cache, the
> > LRU and so on will generally only deal with folios.
>
> If that is the case, shouldn't there in the long term only be very
> few, easy to review instances of things like compound_head(),
> PAGE_SIZE etc. deep in the heart of MM? And everybody else should 1)
> never see tail pages and 2) never assume a compile-time page size?

I don't know exactly where we get to eventually. There are definitely
some aspects of the filesystem<->mm interface which are page-based
(eg ->fault needs to look up the exact page, regardless of its
head/tail/base nature), while ->readpage needs to talk in terms of
folios.

> What are the higher-level places that in the long-term should be
> dealing with tail pages at all? Are there legit ones besides the page
> allocator, THP splitting internals & pte-mapped compound pages?

I can't tell. I think this patch maybe illustrates some of the
problems, but maybe it's just an intermediate problem:

https://git.infradead.org/users/willy/pagecache.git/commitdiff/047e9185dc146b18f56c6df0b49fe798f1805c7b

It deals mostly in terms of folios, but when it needs to kmap() and
memcmp(), then it needs to work in terms of pages. I don't think it's
avoidable (maybe we bury the "dealing with pages" inside a kmap()
wrapper somewhere, but I'm not sure that's better).

> I do agree that the current confusion around which layer sees which
> types of pages is a problem. But I also think a lot of it is the
> result of us being in a transitional period where we've added THP in
> more places but not all code and data structures are or were fully
> native yet, and so we had things leak out or into where maybe they
> shouldn't be to make things work in the short term.
>
> But this part is already getting better, and has gotten better, with
> the page cache (largely?) going native for example.

Thanks ;-) There's still more work to do on that (ie storing one
entry to cover 512 indices instead of 512 identical entries), but it
is getting better. What can't be made better is the CPU page tables;
they really do need to point to tail pages.

One of my longer-term goals is to support largeish pages on ARM (and
other CPUs). Instead of these silly config options to have 16KiB
or 64KiB pages, support "add PTEs for these 16 consecutive, aligned pages".
And I'm not sure how we do that without folios. The notion that a
page is PAGE_SIZE is really, really ingrained. I tried the page_size()
macro to make things easier, but there's 17000 instances of PAGE_SIZE
in the tree, and they just aren't going to go away.

> Some compound_head() that are currently in the codebase are already
> unnecessary. Like the one in activate_page().

Right! And it's hard to find & remove them without very careful analysis,
or particularly deep knowledge. With folios, we can remove them without
terribly deep thought.

> And looking at grep, I wouldn't be surprised if only the page table
> walkers need the page_compound() that mark_page_accessed() does. We
> would be better off if they did the translation once and explicitly in
> the outer scope, where it's clear they're dealing with a pte-mapped
> compound page, instead of having a series of rather low level helpers
> (page flags testing, refcount operations, LRU operations, stat
> accounting) all trying to be clever but really just obscuring things
> and imposing unnecessary costs on the vast majority of cases.
>
> So I fully agree with the motivation behind this patch. But I do
> wonder why it's special-casing the commmon case instead of the rare
> case. It comes at a huge cost. Short term, the churn of replacing
> 'page' with 'folio' in pretty much all instances is enormous.

Because people (think they) know what a page is. It's PAGE_SIZE bytes
long, it occupies one PTE, etc, etc. A folio is new and instead of
changing how something familiar (a page) behaves, we're asking them
to think about something new instead that behaves a lot like a page,
but has differences.

> And longer term, I'm not convinced folio is the abstraction we want
> throughout the kernel. If nobody should be dealing with tail pages in
> the first place, why are we making everybody think in 'folios'? Why
> does a filesystem care that huge pages are composed of multiple base
> pages internally? This feels like an implementation detail leaking out
> of the MM code. The vast majority of places should be thinking 'page'
> with a size of 'page_size()'. Including most parts of the MM itself.

I think pages already leaked out of the MM and into filesystems (and
most of the filesystem writers seem pretty unknowledgable about how
pages and the page cache work, TBH). That's OK! Or it should be OK.
Filesystem authors should be experts on how their filesystem works.
Everywhere that they have to learn about the page cache is a distraction
and annoyance for them.

I mean, I already tried what you're suggesting. It's really freaking
hard. It's hard to do, it's hard to explain, it's hard to know if you
got it right. With folios, I've got the compiler working for me, telling
me that I got some of the low-level bits right (or wrong), leaving me
free to notice "Oh, wait, we got the accounting wrong because writeback
assumes that a page is only PAGE_SIZE bytes". I would _never_ have
noticed that with the THP tree. I only noticed it because transitioning
things to folios made me read the writeback code and wonder about the
'inc_wb_stat' call, see that it's measuring something in 'number of pages'
and realise that the wb_stat accounting needs to be fixed.

> The compile-time check is nice, but I'm not sure it would be that much
> more effective at catching things than a few centrally placed warns
> inside PageFoo(), get_page() etc. and other things that should not
> encounter tail pages in the first place (with __helpers for the few
> instances that do). And given the invasiveness of this change, they
> ought to be very drastically better at it, and obviously so, IMO.

We should have come up with a new type 15 years ago instead of doing THP.
But the second best time to invent a new type for "memory objects which
are at least as big as a page" is right now. Because it only gets more
painful over time.

2021-03-22 19:44:37

by Matthew Wilcox

[permalink] [raw]
Subject: Re: [PATCH v5 03/27] afs: Use wait_on_page_writeback_killable

On Mon, Mar 22, 2021 at 09:27:26AM +0000, David Howells wrote:
> Matthew Wilcox (Oracle) <[email protected]> wrote:
>
> > Open-coding this function meant it missed out on the recent bugfix
> > for waiters being woken by a delayed wake event from a previous
> > instantiation of the page.
> >
> > Signed-off-by: Matthew Wilcox (Oracle) <[email protected]>
>
> Acked-and-tested-by: David Howells <[email protected]>
>
> Should this be pushed upstream now as well if it's missing out on a bugfix?

I would be delighted for you to take these first three patches through
your tree and push them for, say, -rc5.

2021-03-23 11:32:08

by David Howells

[permalink] [raw]
Subject: Re: [PATCH v5 03/27] afs: Use wait_on_page_writeback_killable

Matthew Wilcox (Oracle) <[email protected]> wrote:

> Open-coding this function meant it missed out on the recent bugfix

Would that be:

c2407cf7d22d0c0d94cf20342b3b8f06f1d904e7
mm: make wait_on_page_writeback() wait for multiple pending writebacks

David

2021-03-23 16:04:08

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [PATCH v5 00/27] Memory Folios

On Mon, Mar 22, 2021 at 01:59:24PM -0400, Johannes Weiner wrote:
> If that is the case, shouldn't there in the long term only be very
> few, easy to review instances of things like compound_head(),
> PAGE_SIZE etc. deep in the heart of MM? And everybody else should 1)
> never see tail pages and 2) never assume a compile-time page size?

Probably.

> But this part is already getting better, and has gotten better, with
> the page cache (largely?) going native for example.

As long as there is no strong typing it is going to remain a mess.

> So I fully agree with the motivation behind this patch. But I do
> wonder why it's special-casing the commmon case instead of the rare
> case. It comes at a huge cost. Short term, the churn of replacing
> 'page' with 'folio' in pretty much all instances is enormous.

The special case is in the eye of the beholder. I suspect we'll end
up using the folio in most FS/VM interaction eventually, which makes it
the common. But I don't see how it is the special case? Yes, changing
from page to folio just about everywhere causes more change, but it also
allow to:

a) do this gradually
b) thus actually audit everything that we actually do the right thing

And I think willys whole series (the git branch, not just the few
patches sent out) very clearly shows the benefit of that.

> And longer term, I'm not convinced folio is the abstraction we want
> throughout the kernel. If nobody should be dealing with tail pages in
> the first place, why are we making everybody think in 'folios'? Why
> does a filesystem care that huge pages are composed of multiple base
> pages internally? This feels like an implementation detail leaking out
> of the MM code. The vast majority of places should be thinking 'page'
> with a size of 'page_size()'. Including most parts of the MM itself.

Why does the name matter? While there are arguments both ways, the
clean break certainly helps every to remind everyone that this is not
your grandfathers fixed sized page.

>
> The compile-time check is nice, but I'm not sure it would be that much
> more effective at catching things than a few centrally placed warns
> inside PageFoo(), get_page() etc. and other things that should not
> encounter tail pages in the first place (with __helpers for the few
> instances that do).

Eeek, no. No amount of runtime checks is going to replace compile
time type safety.

2021-03-23 17:53:49

by David Howells

[permalink] [raw]
Subject: Re: [PATCH v5 00/27] Memory Folios

Johannes Weiner <[email protected]> wrote:

> So I fully agree with the motivation behind this patch. But I do
> wonder why it's special-casing the commmon case instead of the rare
> case. It comes at a huge cost. Short term, the churn of replacing
> 'page' with 'folio' in pretty much all instances is enormous.
>
> And longer term, I'm not convinced folio is the abstraction we want
> throughout the kernel. If nobody should be dealing with tail pages in
> the first place, why are we making everybody think in 'folios'? Why
> does a filesystem care that huge pages are composed of multiple base
> pages internally? This feels like an implementation detail leaking out
> of the MM code. The vast majority of places should be thinking 'page'
> with a size of 'page_size()'. Including most parts of the MM itself.

I like the idea of logically separating individual hardware pages from
abstract bundles of pages by using a separate type for them - at least in
filesystem code. I'm trying to abstract some of the handling out of the
network filesystems and into a common library plus ITER_XARRAY to insulate
those filesystems from the VM.

David

2021-03-24 08:48:13

by Johannes Weiner

[permalink] [raw]
Subject: Re: [PATCH v5 00/27] Memory Folios

On Mon, Mar 22, 2021 at 06:47:44PM +0000, Matthew Wilcox wrote:
> On Mon, Mar 22, 2021 at 01:59:24PM -0400, Johannes Weiner wrote:
> > On Sat, Mar 20, 2021 at 05:40:37AM +0000, Matthew Wilcox (Oracle) wrote:
> > > This series introduces the 'struct folio' as a replacement for
> > > head-or-base pages. This initial set reduces the kernel size by
> > > approximately 6kB, although its real purpose is adding infrastructure
> > > to enable further use of the folio.
> > >
> > > The intent is to convert all filesystems and some device drivers to work
> > > in terms of folios. This series contains a lot of explicit conversions,
> > > but it's important to realise it's removing a lot of implicit conversions
> > > in some relatively hot paths. There will be very few conversions from
> > > folios when this work is completed; filesystems, the page cache, the
> > > LRU and so on will generally only deal with folios.
> >
> > If that is the case, shouldn't there in the long term only be very
> > few, easy to review instances of things like compound_head(),
> > PAGE_SIZE etc. deep in the heart of MM? And everybody else should 1)
> > never see tail pages and 2) never assume a compile-time page size?
>
> I don't know exactly where we get to eventually. There are definitely
> some aspects of the filesystem<->mm interface which are page-based
> (eg ->fault needs to look up the exact page, regardless of its
> head/tail/base nature), while ->readpage needs to talk in terms of
> folios.

I can imagine we'd eventually want fault handlers that can also fill
in larger chunks of data if the file is of the right size and the MM
is able to (and policy/heuristics determine to) go with a huge page.

> > What are the higher-level places that in the long-term should be
> > dealing with tail pages at all? Are there legit ones besides the page
> > allocator, THP splitting internals & pte-mapped compound pages?
>
> I can't tell. I think this patch maybe illustrates some of the
> problems, but maybe it's just an intermediate problem:
>
> https://git.infradead.org/users/willy/pagecache.git/commitdiff/047e9185dc146b18f56c6df0b49fe798f1805c7b
>
> It deals mostly in terms of folios, but when it needs to kmap() and
> memcmp(), then it needs to work in terms of pages. I don't think it's
> avoidable (maybe we bury the "dealing with pages" inside a kmap()
> wrapper somewhere, but I'm not sure that's better).

Yeah it'd be nice to get low-level, PAGE_SIZE pages out of there. We
may be able to just kmap whole folios too, which are more likely to be
small pages on highmem systems anyway.

> > Some compound_head() that are currently in the codebase are already
> > unnecessary. Like the one in activate_page().
>
> Right! And it's hard to find & remove them without very careful analysis,
> or particularly deep knowledge. With folios, we can remove them without
> terribly deep thought.

True. It definitely also helps mark the places that have been
converted from the top down and which ones haven't. Without that you
need to think harder about the context ("How would a tail page even
get here?" vs. "No page can get here, only folios" ;-))

Again, I think that's something that would automatically be better in
the long term when compound_page() and PAGE_SIZE themselves would
stand out like sore thumbs. But you raise a good point: there is such
an overwhelming amount of them right now that it's difficult to do
this without a clearer marker and help from the type system.

> > And looking at grep, I wouldn't be surprised if only the page table
> > walkers need the page_compound() that mark_page_accessed() does. We
> > would be better off if they did the translation once and explicitly in
> > the outer scope, where it's clear they're dealing with a pte-mapped
> > compound page, instead of having a series of rather low level helpers
> > (page flags testing, refcount operations, LRU operations, stat
> > accounting) all trying to be clever but really just obscuring things
> > and imposing unnecessary costs on the vast majority of cases.
> >
> > So I fully agree with the motivation behind this patch. But I do
> > wonder why it's special-casing the commmon case instead of the rare
> > case. It comes at a huge cost. Short term, the churn of replacing
> > 'page' with 'folio' in pretty much all instances is enormous.
>
> Because people (think they) know what a page is. It's PAGE_SIZE bytes
> long, it occupies one PTE, etc, etc. A folio is new and instead of
> changing how something familiar (a page) behaves, we're asking them
> to think about something new instead that behaves a lot like a page,
> but has differences.

Yeah, that makes sense.

> > And longer term, I'm not convinced folio is the abstraction we want
> > throughout the kernel. If nobody should be dealing with tail pages in
> > the first place, why are we making everybody think in 'folios'? Why
> > does a filesystem care that huge pages are composed of multiple base
> > pages internally? This feels like an implementation detail leaking out
> > of the MM code. The vast majority of places should be thinking 'page'
> > with a size of 'page_size()'. Including most parts of the MM itself.
>
> I think pages already leaked out of the MM and into filesystems (and
> most of the filesystem writers seem pretty unknowledgable about how
> pages and the page cache work, TBH). That's OK! Or it should be OK.
> Filesystem authors should be experts on how their filesystem works.
> Everywhere that they have to learn about the page cache is a distraction
> and annoyance for them.
>
> I mean, I already tried what you're suggesting. It's really freaking
> hard. It's hard to do, it's hard to explain, it's hard to know if you
> got it right. With folios, I've got the compiler working for me, telling
> me that I got some of the low-level bits right (or wrong), leaving me
> free to notice "Oh, wait, we got the accounting wrong because writeback
> assumes that a page is only PAGE_SIZE bytes". I would _never_ have
> noticed that with the THP tree. I only noticed it because transitioning
> things to folios made me read the writeback code and wonder about the
> 'inc_wb_stat' call, see that it's measuring something in 'number of pages'
> and realise that the wb_stat accounting needs to be fixed.

I agree with all of this whole-heartedly.

The reason I asked about who would deal with tail pages in the long
term is because I think optimally most places would just think of
these things as descriptors for variable lengths of memory. And only
the allocator looks behind the curtain and deals with the (current!)
reality that they're stitched together from fixed-size objects.

To me, folios seem to further highlight this implementation detail,
more so than saying a page is now page_size() - although I readily
accept that the latter didn't turn out to be a viable mid-term
strategy in practice at all, and that a clean break is necessary
sooner rather than later (instead of cleaning up the page api now and
replacing the backing pages with struct hwpage or something later).

The name of the abstraction indicates how we think we're supposed to
use it, what behavior stands out as undesirable.

For example, you brought up kmap/memcpy/usercopy, which is a pretty
common operation. Should they continue to deal with individual tail
pages, and thereby perpetuate the exposure of these low-level MM
building blocks to drivers and filesystems?

It means portfolio -> page lookups will remain common - and certainly
the concept of the folio suggests thinking of it as a couple of pages
strung together. And the more this is the case, the less it stands out
when somebody is dealing with low-level pages when really they
shouldn't be - the thing this is trying to fix. Granted it's narrowing
the channel quite a bit. But it's also so pervasively used that I do
wonder if it's possible to keep up with creative new abuses.

But I also worry about the longevity of the concept in general. This
is one of the most central and fundamental concepts in the kernel. Is
this going to make sense in the future? In 5 years even?

> > The compile-time check is nice, but I'm not sure it would be that much
> > more effective at catching things than a few centrally placed warns
> > inside PageFoo(), get_page() etc. and other things that should not
> > encounter tail pages in the first place (with __helpers for the few
> > instances that do). And given the invasiveness of this change, they
> > ought to be very drastically better at it, and obviously so, IMO.
>
> We should have come up with a new type 15 years ago instead of doing THP.
> But the second best time to invent a new type for "memory objects which
> are at least as big as a page" is right now. Because it only gets more
> painful over time.

Yes and no.

Yes because I fully agree that too much detail of the pages have
leaked into all kinds of places where they shouldn't be, and a new
abstraction for what most places interact with is a good idea IMO.

But we're also headed in a direction with the VM that give me pause
about the folios-are-multiple-pages abstraction.

How long are we going to have multiple pages behind a huge page?

Common storage drives are getting fast enough that simple buffered IO
workloads are becoming limited by CPU, just because it's too many
individual pages to push through the cache. We have pending patches to
rewrite the reclaim algorithm because rmap is falling apart with the
rate of paging we're doing. We'll need larger pages in the VM not just
for optimizing TLB access, but to cut transaction overhead for paging
in general (I know you're already onboard with this, especially on the
page cache side, just stating it for completeness).

But for that to work, we'll need the allocator to produce huge pages
at the necessary rate, too. The current implementation likely won't
scale. Compaction is expensive enough that we have to weigh when to
allocate huge pages for long-lived anon regions, let alone allocate
them for streaming IO cache entries.

But if the overwhelming number of requests going to the page allocator
are larger than 4k pages - anon regions? check. page cache? likely a
sizable share. slub? check. network? check - does it even make sense
to have that as the default block size for the page allocator anymore?
Or even allocate struct page at this granularity?

So I think transitioning away from ye olde page is a great idea. I
wonder this: have we mapped out the near future of the VM enough to
say that the folio is the right abstraction?

What does 'folio' mean when it corresponds to either a single page or
some slab-type object with no dedicated page?

If we go through with all the churn now anyway, IMO it makes at least
sense to ditch all association and conceptual proximity to the
hardware page or collections thereof. Simply say it's some length of
memory, and keep thing-to-page translations out of the public API from
the start. I mean, is there a good reason to keep this baggage?

mem_t or something.

mem = find_get_mem(mapping, offset);
p = kmap(mem, offset - mem_file_offset(mem), len);
copy_from_user(p, buf, len);
kunmap(mem);
SetMemDirty(mem);
put_mem(mem);

There are 10k instances of 'page' in mm/ outside the page allocator, a
majority of which will be the new thing. 14k in fs. I don't think I
have the strength to type shrink_folio_list(), or explain to new
people what it means, years after it has stopped making sense.

2021-03-24 10:11:45

by Matthew Wilcox

[permalink] [raw]
Subject: Re: [PATCH v5 00/27] Memory Folios

On Tue, Mar 23, 2021 at 08:29:16PM -0400, Johannes Weiner wrote:
> On Mon, Mar 22, 2021 at 06:47:44PM +0000, Matthew Wilcox wrote:
> > On Mon, Mar 22, 2021 at 01:59:24PM -0400, Johannes Weiner wrote:
> > > On Sat, Mar 20, 2021 at 05:40:37AM +0000, Matthew Wilcox (Oracle) wrote:
> > > > This series introduces the 'struct folio' as a replacement for
> > > > head-or-base pages. This initial set reduces the kernel size by
> > > > approximately 6kB, although its real purpose is adding infrastructure
> > > > to enable further use of the folio.
> > > >
> > > > The intent is to convert all filesystems and some device drivers to work
> > > > in terms of folios. This series contains a lot of explicit conversions,
> > > > but it's important to realise it's removing a lot of implicit conversions
> > > > in some relatively hot paths. There will be very few conversions from
> > > > folios when this work is completed; filesystems, the page cache, the
> > > > LRU and so on will generally only deal with folios.
> > >
> > > If that is the case, shouldn't there in the long term only be very
> > > few, easy to review instances of things like compound_head(),
> > > PAGE_SIZE etc. deep in the heart of MM? And everybody else should 1)
> > > never see tail pages and 2) never assume a compile-time page size?
> >
> > I don't know exactly where we get to eventually. There are definitely
> > some aspects of the filesystem<->mm interface which are page-based
> > (eg ->fault needs to look up the exact page, regardless of its
> > head/tail/base nature), while ->readpage needs to talk in terms of
> > folios.
>
> I can imagine we'd eventually want fault handlers that can also fill
> in larger chunks of data if the file is of the right size and the MM
> is able to (and policy/heuristics determine to) go with a huge page.

Oh yes, me too!

The way I think this works is that the VM asks for the specific
page, just as it does today and the ->fault handler returns the page.
Then the VM looks up the folio for that page, and asks the arch to map
the entire folio. How the arch does that is up to the arch -- if it's
PMD sized and aligned, it can do that; if the arch knows that it should
use 8 consecutive PTE entries to map 32KiB all at once, it can do that.

But I think we need the ->fault handler to return the specific page,
because that's how we can figure out whether this folio is mapped at the
appropriate alignment to make this work. If the fault handler returns
the folio, I don't think we can figure out if the alignment is correct.
Maybe we can for the page cache, but a device driver might have a compound
page allocated for its own purposes, and it might not be amenable to
the same rules as the page cache.

> > https://git.infradead.org/users/willy/pagecache.git/commitdiff/047e9185dc146b18f56c6df0b49fe798f1805c7b
> >
> > It deals mostly in terms of folios, but when it needs to kmap() and
> > memcmp(), then it needs to work in terms of pages. I don't think it's
> > avoidable (maybe we bury the "dealing with pages" inside a kmap()
> > wrapper somewhere, but I'm not sure that's better).
>
> Yeah it'd be nice to get low-level, PAGE_SIZE pages out of there. We
> may be able to just kmap whole folios too, which are more likely to be
> small pages on highmem systems anyway.

I got told "no" when asking for kmap_local() of a compound page.
Maybe that's changeable, but I'm assuming that kmap() space will
continue to be tight for the foreseeable future (until we can
kill highmem forever).

> > > Some compound_head() that are currently in the codebase are already
> > > unnecessary. Like the one in activate_page().
> >
> > Right! And it's hard to find & remove them without very careful analysis,
> > or particularly deep knowledge. With folios, we can remove them without
> > terribly deep thought.
>
> True. It definitely also helps mark the places that have been
> converted from the top down and which ones haven't. Without that you
> need to think harder about the context ("How would a tail page even
> get here?" vs. "No page can get here, only folios" ;-))

Exactly! Take a look at page_mkclean(). Its implementation strongly
suggests that it expects a head page, but I think it'll unmap a single
page if passed a tail page ... and it's not clear to me that isn't the
behaviour that pagecache_isize_extended() would prefer. Tricky.

> > I mean, I already tried what you're suggesting. It's really freaking
> > hard. It's hard to do, it's hard to explain, it's hard to know if you
> > got it right. With folios, I've got the compiler working for me, telling
> > me that I got some of the low-level bits right (or wrong), leaving me
> > free to notice "Oh, wait, we got the accounting wrong because writeback
> > assumes that a page is only PAGE_SIZE bytes". I would _never_ have
> > noticed that with the THP tree. I only noticed it because transitioning
> > things to folios made me read the writeback code and wonder about the
> > 'inc_wb_stat' call, see that it's measuring something in 'number of pages'
> > and realise that the wb_stat accounting needs to be fixed.
>
> I agree with all of this whole-heartedly.
>
> The reason I asked about who would deal with tail pages in the long
> term is because I think optimally most places would just think of
> these things as descriptors for variable lengths of memory. And only
> the allocator looks behind the curtain and deals with the (current!)
> reality that they're stitched together from fixed-size objects.
>
> To me, folios seem to further highlight this implementation detail,
> more so than saying a page is now page_size() - although I readily
> accept that the latter didn't turn out to be a viable mid-term
> strategy in practice at all, and that a clean break is necessary
> sooner rather than later (instead of cleaning up the page api now and
> replacing the backing pages with struct hwpage or something later).
>
> The name of the abstraction indicates how we think we're supposed to
> use it, what behavior stands out as undesirable.
>
> For example, you brought up kmap/memcpy/usercopy, which is a pretty
> common operation. Should they continue to deal with individual tail
> pages, and thereby perpetuate the exposure of these low-level MM
> building blocks to drivers and filesystems?
>
> It means portfolio -> page lookups will remain common - and certainly
> the concept of the folio suggests thinking of it as a couple of pages
> strung together. And the more this is the case, the less it stands out
> when somebody is dealing with low-level pages when really they
> shouldn't be - the thing this is trying to fix. Granted it's narrowing
> the channel quite a bit. But it's also so pervasively used that I do
> wonder if it's possible to keep up with creative new abuses.
>
> But I also worry about the longevity of the concept in general. This
> is one of the most central and fundamental concepts in the kernel. Is
> this going to make sense in the future? In 5 years even?

One of the patches I haven't posted yet starts to try to deal with kmap()/mem*()/kunmap():

mm: Add kmap_local_folio

This allows us to map a portion of a folio. Callers can only expect
to access up to the next page boundary.

Signed-off-by: Matthew Wilcox (Oracle) <[email protected]>

diff --git a/include/linux/highmem-internal.h b/include/linux/highmem-internal.h
index 7902c7d8b55f..55a29c9d562f 100644
--- a/include/linux/highmem-internal.h
+++ b/include/linux/highmem-internal.h
@@ -73,6 +73,12 @@ static inline void *kmap_local_page(struct page *page)
return __kmap_local_page_prot(page, kmap_prot);
}

+static inline void *kmap_local_folio(struct folio *folio, size_t offset)
+{
+ struct page *page = &folio->page + offset / PAGE_SIZE;
+ return __kmap_local_page_prot(page, kmap_prot) + offset % PAGE_SIZE;
+}

Partly I haven't shared that one because I'm not 100% sure that 'byte
offset relative to start of folio' is the correct interface. I'm looking
at some users and thinking that maybe 'byte offset relative to start
of file' might be better. Or perhaps that's just filesystem-centric
thinking.

> > > The compile-time check is nice, but I'm not sure it would be that much
> > > more effective at catching things than a few centrally placed warns
> > > inside PageFoo(), get_page() etc. and other things that should not
> > > encounter tail pages in the first place (with __helpers for the few
> > > instances that do). And given the invasiveness of this change, they
> > > ought to be very drastically better at it, and obviously so, IMO.
> >
> > We should have come up with a new type 15 years ago instead of doing THP.
> > But the second best time to invent a new type for "memory objects which
> > are at least as big as a page" is right now. Because it only gets more
> > painful over time.
>
> Yes and no.
>
> Yes because I fully agree that too much detail of the pages have
> leaked into all kinds of places where they shouldn't be, and a new
> abstraction for what most places interact with is a good idea IMO.
>
> But we're also headed in a direction with the VM that give me pause
> about the folios-are-multiple-pages abstraction.
>
> How long are we going to have multiple pages behind a huge page?

Yes, that's a really good question. I think Muchun Song's patches
are an interesting and practical way of freeing up memory _now_, but
long-term we'll need something different. Maybe we end up with
dynamically allocated pages (perhaps when we break a 2MB page into
1MB pages in the buddy allocator).

> Common storage drives are getting fast enough that simple buffered IO
> workloads are becoming limited by CPU, just because it's too many
> individual pages to push through the cache. We have pending patches to
> rewrite the reclaim algorithm because rmap is falling apart with the
> rate of paging we're doing. We'll need larger pages in the VM not just
> for optimizing TLB access, but to cut transaction overhead for paging
> in general (I know you're already onboard with this, especially on the
> page cache side, just stating it for completeness).

yes, yes, yes and yes. Dave Chinner produced a fantastic perf report
for me illustrating how kswapd and the page cache completely fall apart
under what must be a common streaming load. Just create a file 2x the
size of memory, then cat it to /dev/null. cat tries to allocate memory
in readahead and ends up contending on the i_pages lock with kswapd
who's trying to free pages from the LRU list one at a time.

Larger pages will help with that because more work gets done with each
lock acquisition, but I can't help but feel that the real solution is for
the page cache to notice that this is a streaming workload and have cat
eagerly recycle pages from this file. That's a biggish project; we know
how many pages there are in this mapping, but how to know when to switch
from "allocate memory from the page allocator" to "just delete a page
from early in the file and reuse it at the current position inn the file"?

> But for that to work, we'll need the allocator to produce huge pages
> at the necessary rate, too. The current implementation likely won't
> scale. Compaction is expensive enough that we have to weigh when to
> allocate huge pages for long-lived anon regions, let alone allocate
> them for streaming IO cache entries.

Heh, I have that as a work item for later this year -- give the page
allocator per-cpu lists of compound pages, not just order-0 pages.
That'll save us turning compound pages back into buddy pages, only to
turn them into compound pages again.

I also have a feeling that the page allocator either needs to become a
sub-allocator of an allocator that deals in, say, 1GB chunks of memory,
or it needs to become reluctant to break up larger orders. eg if the
dcache asks for just one more dentry, it should have to go through at
least one round of reclaim before we choose to break up a high-order
page to satisfy that request.

> But if the overwhelming number of requests going to the page allocator
> are larger than 4k pages - anon regions? check. page cache? likely a
> sizable share. slub? check. network? check - does it even make sense
> to have that as the default block size for the page allocator anymore?
> Or even allocate struct page at this granularity?

Yep, others have talked about that as well. I think I may even have said
a few times at LSFMM, "What if we just make PAGE_SIZE 2MB?". After all,
my first 386 Linux system was 4-8MB of RAM (it got upgraded). The 16GB
laptop that I now have is 2048 times more RAM, so 4x the number of pages
that system had.

But people seem attached to being able to use smaller page sizes.
There's that pesky "compatibility" argument.

> So I think transitioning away from ye olde page is a great idea. I
> wonder this: have we mapped out the near future of the VM enough to
> say that the folio is the right abstraction?
>
> What does 'folio' mean when it corresponds to either a single page or
> some slab-type object with no dedicated page?
>
> If we go through with all the churn now anyway, IMO it makes at least
> sense to ditch all association and conceptual proximity to the
> hardware page or collections thereof. Simply say it's some length of
> memory, and keep thing-to-page translations out of the public API from
> the start. I mean, is there a good reason to keep this baggage?
>
> mem_t or something.
>
> mem = find_get_mem(mapping, offset);
> p = kmap(mem, offset - mem_file_offset(mem), len);
> copy_from_user(p, buf, len);
> kunmap(mem);
> SetMemDirty(mem);
> put_mem(mem);

I think there's still value to the "new thing" being a power of two
in size. I'm not sure you were suggesting otherwise, but it's worth
putting on the table as something we explicitly agree on (or not!)

I mean what you've written there looks a _lot_ like where I get to
in the iomap code.

status = iomap_write_begin(inode, pos, bytes, 0, &folio, iomap,
srcmap);
if (unlikely(status))
break;

if (mapping_writably_mapped(inode->i_mapping))
flush_dcache_folio(folio);

/* We may be part-way through a folio */
offset = offset_in_folio(folio, pos);
copied = iov_iter_copy_from_user_atomic(folio, i, offset,
bytes);

copied = iomap_write_end(inode, pos, bytes, copied, folio,
iomap, srcmap);
(which eventually calls TestSetFolioDirty)

It doesn't copy more than PAGE_SIZE bytes per iteration because
iov_iter_copy_from_user_atomic() isn't safe to do that yet.
But in *principle*, it should be able to.

> There are 10k instances of 'page' in mm/ outside the page allocator, a
> majority of which will be the new thing. 14k in fs. I don't think I
> have the strength to type shrink_folio_list(), or explain to new
> people what it means, years after it has stopped making sense.

One of the things I don't like about the current iteration of folio
is that getting to things is folio->page.mapping. I think it does want
to be folio->mapping, and I'm playing around with this:

struct folio {
- struct page page;
+ union {
+ struct page page;
+ struct {
+ unsigned long flags;
+ struct list_head lru;
+ struct address_space *mapping;
+ pgoff_t index;
+ unsigned long private;
+ atomic_t _mapcount;
+ atomic_t _refcount;
+ };
+ };
};

+static inline void folio_build_bug(void)
+{
+#define FOLIO_MATCH(pg, fl) \
+BUILD_BUG_ON(offsetof(struct page, pg) != offsetof(struct folio, fl));
+
+ FOLIO_MATCH(flags, flags);
+ FOLIO_MATCH(lru, lru);
+ FOLIO_MATCH(mapping, mapping);
+ FOLIO_MATCH(index, index);
+ FOLIO_MATCH(private, private);
+ FOLIO_MATCH(_mapcount, _mapcount);
+ FOLIO_MATCH(_refcount, _refcount);
+#undef FOLIO_MATCH
+ BUILD_BUG_ON(sizeof(struct page) != sizeof(struct folio));
+}

with the intent of eventually renaming page->mapping to page->__mapping
so people can't look at page->mapping on a tail page. If we even have
tail pages eventually. I could see a future where we have pte_to_pfn(),
pfn_to_folio() and are completely page-free (... the vm_fault would
presumably return a pfn instead of a page at that point ...). But that's
too ambitious a project to succeed any time soon.

There's a lot of transitional stuff in these patches where I do
&folio->page. I cringe a little every time I write that.

So yes, let's ask the question of "Is this the right short term, medium
term or long term approach?" I think it is, at least in broad strokes.
Let's keep refining it.

Thanks for your contribution here; it's really useful.

2021-03-26 17:49:53

by Johannes Weiner

[permalink] [raw]
Subject: Re: [PATCH v5 00/27] Memory Folios

On Wed, Mar 24, 2021 at 06:24:21AM +0000, Matthew Wilcox wrote:
> On Tue, Mar 23, 2021 at 08:29:16PM -0400, Johannes Weiner wrote:
> > On Mon, Mar 22, 2021 at 06:47:44PM +0000, Matthew Wilcox wrote:
> > > On Mon, Mar 22, 2021 at 01:59:24PM -0400, Johannes Weiner wrote:
> One of the patches I haven't posted yet starts to try to deal with kmap()/mem*()/kunmap():
>
> mm: Add kmap_local_folio
>
> This allows us to map a portion of a folio. Callers can only expect
> to access up to the next page boundary.
>
> Signed-off-by: Matthew Wilcox (Oracle) <[email protected]>
>
> diff --git a/include/linux/highmem-internal.h b/include/linux/highmem-internal.h
> index 7902c7d8b55f..55a29c9d562f 100644
> --- a/include/linux/highmem-internal.h
> +++ b/include/linux/highmem-internal.h
> @@ -73,6 +73,12 @@ static inline void *kmap_local_page(struct page *page)
> return __kmap_local_page_prot(page, kmap_prot);
> }
>
> +static inline void *kmap_local_folio(struct folio *folio, size_t offset)
> +{
> + struct page *page = &folio->page + offset / PAGE_SIZE;
> + return __kmap_local_page_prot(page, kmap_prot) + offset % PAGE_SIZE;
> +}
>
> Partly I haven't shared that one because I'm not 100% sure that 'byte
> offset relative to start of folio' is the correct interface. I'm looking
> at some users and thinking that maybe 'byte offset relative to start
> of file' might be better. Or perhaps that's just filesystem-centric
> thinking.

Right, this doesn't seem specific to files just because they would be
the primary users of it.

> > But for that to work, we'll need the allocator to produce huge pages
> > at the necessary rate, too. The current implementation likely won't
> > scale. Compaction is expensive enough that we have to weigh when to
> > allocate huge pages for long-lived anon regions, let alone allocate
> > them for streaming IO cache entries.
>
> Heh, I have that as a work item for later this year -- give the page
> allocator per-cpu lists of compound pages, not just order-0 pages.
> That'll save us turning compound pages back into buddy pages, only to
> turn them into compound pages again.
>
> I also have a feeling that the page allocator either needs to become a
> sub-allocator of an allocator that deals in, say, 1GB chunks of memory,
> or it needs to become reluctant to break up larger orders. eg if the
> dcache asks for just one more dentry, it should have to go through at
> least one round of reclaim before we choose to break up a high-order
> page to satisfy that request.

Slub already allocates higher-order pages for dentries:

slabinfo - version: 2.1
# name <active_objs> <num_objs> <objsize> <objperslab> <pagesperslab> : tunables <limit> <batchcount> <sharedfactor> : slabdata <active_slabs> <num_slabs> <sharedavail>
dentry 133350 133350 192 42 2 : tunables 0 0 0 : slabdata 3175 3175 0

^ here

and it could avoid even more internal fragmentation with bigger
orders. It only doesn't because of the overhead of allocating them.

If the default block size in the allocator were 2M, we'd also get slab
packing at that granularity, and we wouldn't have to worry about small
objects breaking huge pages any more than we worry about slab objects
fragmenting 4k pages today.

> > But if the overwhelming number of requests going to the page allocator
> > are larger than 4k pages - anon regions? check. page cache? likely a
> > sizable share. slub? check. network? check - does it even make sense
> > to have that as the default block size for the page allocator anymore?
> > Or even allocate struct page at this granularity?
>
> Yep, others have talked about that as well. I think I may even have said
> a few times at LSFMM, "What if we just make PAGE_SIZE 2MB?". After all,
> my first 386 Linux system was 4-8MB of RAM (it got upgraded). The 16GB
> laptop that I now have is 2048 times more RAM, so 4x the number of pages
> that system had.
>
> But people seem attached to being able to use smaller page sizes.
> There's that pesky "compatibility" argument.

Right, that's why I'm NOT saying we should eliminate the support for
4k chunks in the page cache and page tables. That's still useful if
you have lots of small files.

I'm just saying it doesn't have to be the default that everything is
primarily optimized for. We can make the default allocation size of
the allocator correspond to a hugepage and have a secondary allocator
level for 4k chunks. Like slab, but fixed-size and highmem-aware.

It makes sense to make struct page 2M as well. It would save a ton of
memory on average and reduce the pressure we have on struct page's
size today.

And we really don't need struct page at 4k just to support this unit
of paging when necesary: page tables don't care, they use pfns and can
point to any 4k offset, struct page or no struct page. For the page
cache, we can move mapping, index, lru. etc from today's struct page
into an entry descriptor that could either sit in a native 2M struct
page (just like today), or be be allocated on demand and point into a
chunked struct page. Same for <2M anonymous mappings.

Hey, didn't you just move EXACTLY those fields into the folio? ;)

All this to reiterate, I really do agree with the concept of a new
type of object for paging, page cache entries, etc. But I think there
are good reasons to assume that this unit of paging needs to support
sizes smaller than the standard page size used by the kernel at large,
and so 'bundle of pages' is not a good way of defining it.

It can easily cause problems down the line again if people continue to
assume that there is at least one PAGE_SIZE struct page in a folio.

And it's not obvious to me why it really NEEDS to be 'bundle of pages'
instead of just 'chunk of memory'.

> > So I think transitioning away from ye olde page is a great idea. I
> > wonder this: have we mapped out the near future of the VM enough to
> > say that the folio is the right abstraction?
> >
> > What does 'folio' mean when it corresponds to either a single page or
> > some slab-type object with no dedicated page?
> >
> > If we go through with all the churn now anyway, IMO it makes at least
> > sense to ditch all association and conceptual proximity to the
> > hardware page or collections thereof. Simply say it's some length of
> > memory, and keep thing-to-page translations out of the public API from
> > the start. I mean, is there a good reason to keep this baggage?
> >
> > mem_t or something.
> >
> > mem = find_get_mem(mapping, offset);
> > p = kmap(mem, offset - mem_file_offset(mem), len);
> > copy_from_user(p, buf, len);
> > kunmap(mem);
> > SetMemDirty(mem);
> > put_mem(mem);
>
> I think there's still value to the "new thing" being a power of two
> in size. I'm not sure you were suggesting otherwise, but it's worth
> putting on the table as something we explicitly agree on (or not!)

Ha, I wasn't thinking about minimum alignment. I used the byte offsets
because I figured that's what's natural to the fs and saw no reason to
have it think in terms of page size in this example.

From an implementation pov, since anything in the page cache can end
up in a page table, it probably doesn't make a whole lot of sense to
allow quantities smaller than the smallest unit of paging supported by
the processor. But I wonder if that's mostly something the MM would
care about when it allocates these objects, not necessarily something
that needs to be reflected in the interface or the filesystem.

The other point I was trying to make was just the alternate name. As I
said above, I think 'bundle of pages' as a concept is a strategic
error that will probably come back to haunt us.

I also have to admit, I really hate the name. We may want to stop
people thinking of PAGE_SIZE, but this term doesn't give people any
clue WHAT to think of. Ten years down the line, when the possible
confusion between folio and page and PAGE_SIZE has been eradicated,
people still will have to google what a folio is, and then have a hard
time retaining a mental image. I *know* what it is and I still have a
hard time reading code that uses it.

That's why I drafted around with the above code, to see if it would go
down easier. I think it does. It's simple, self-explanatory, but
abstract enough as to not make assumptions around its implementation.
Filesystem look up cache memory, write data in it, mark memory dirty.

Maybe folio makes more sense to native speakers, but I have never
heard this term. Of course when you look it up, it's "something to do
with pages" :D

As a strategy to unseat the obsolete mental model around pages, IMO
redirection would be preferable to confusion.

> > There are 10k instances of 'page' in mm/ outside the page allocator, a
> > majority of which will be the new thing. 14k in fs. I don't think I
> > have the strength to type shrink_folio_list(), or explain to new
> > people what it means, years after it has stopped making sense.
>
> One of the things I don't like about the current iteration of folio
> is that getting to things is folio->page.mapping. I think it does want
> to be folio->mapping, and I'm playing around with this:
>
> struct folio {
> - struct page page;
> + union {
> + struct page page;
> + struct {
> + unsigned long flags;
> + struct list_head lru;
> + struct address_space *mapping;
> + pgoff_t index;
> + unsigned long private;
> + atomic_t _mapcount;
> + atomic_t _refcount;
> + };
> + };
> };
>
> +static inline void folio_build_bug(void)
> +{
> +#define FOLIO_MATCH(pg, fl) \
> +BUILD_BUG_ON(offsetof(struct page, pg) != offsetof(struct folio, fl));
> +
> + FOLIO_MATCH(flags, flags);
> + FOLIO_MATCH(lru, lru);
> + FOLIO_MATCH(mapping, mapping);
> + FOLIO_MATCH(index, index);
> + FOLIO_MATCH(private, private);
> + FOLIO_MATCH(_mapcount, _mapcount);
> + FOLIO_MATCH(_refcount, _refcount);
> +#undef FOLIO_MATCH
> + BUILD_BUG_ON(sizeof(struct page) != sizeof(struct folio));
> +}
>
> with the intent of eventually renaming page->mapping to page->__mapping
> so people can't look at page->mapping on a tail page. If we even have
> tail pages eventually. I could see a future where we have pte_to_pfn(),
> pfn_to_folio() and are completely page-free (... the vm_fault would
> presumably return a pfn instead of a page at that point ...). But that's
> too ambitious a project to succeed any time soon.
>
> There's a lot of transitional stuff in these patches where I do
> &folio->page. I cringe a little every time I write that.

Instead of the union in there, could you do this?

struct thing {
struct address_space *mapping;
pgoff_t index;
...
};

struct page {
union {
struct thing thing;
...
}
}

and use container_of() to get to the page in those places?

> So yes, let's ask the question of "Is this the right short term, medium
> term or long term approach?" I think it is, at least in broad strokes.
> Let's keep refining it.

Yes, yes, and yes. :)

2021-03-29 17:03:06

by Matthew Wilcox

[permalink] [raw]
Subject: Re: [PATCH v5 00/27] Memory Folios

I'm going to respond to some points in detail below, but there are a
couple of overarching themes that I want to bring out up here.

Grand Vision
~~~~~~~~~~~~

I haven't outlined my long-term plan. Partly because it is a _very_
long way off, and partly because I think what I'm doing stands on its
own. But some of the points below bear on this, so I'll do it now.

Eventually, I want to make struct page optional for allocations. It's too
small for some things (allocating page tables, for example), and overly
large for others (allocating a 2MB page, networking page_pool). I don't
want to change its size in the meantime; having a struct page refer to
PAGE_SIZE bytes is something that's quite deeply baked in.

In broad strokes, I think that having a Power Of Two Allocator
with Descriptor (POTAD) is a useful foundational allocator to have.
The specific allocator that we call the buddy allocator is very clever for
the 1990s, but touches too many cachelines to be good with today's CPUs.
The generalisation of the buddy allocator to the POTAD lets us allocate
smaller quantities (eg a 512 byte block) and allocate descriptors which
differ in size from a struct page. For an extreme example, see xfs_buf
which is 360 bytes and is the descriptor for an allocation between 512
and 65536 bytes.

There are times when we need to get from the physical address to
the descriptor, eg memory-failure.c or get_user_pages(). This is the
equivalent of phys_to_page(), and it's going to have to be a lookup tree.
I think this is a role for the Maple Tree, but it's not ready yet.
I don't know if it'll be fast enough for this case. There's also the
need (particularly for memory-failure) to determine exactly what kind
of descriptor we're dealing with, and also its size. Even its owner,
so we can notify them of memory failure.

There's still a role for the slab allocator, eg allocating objects
which aren't a power of two, or allocating things for which the user
doesn't need a descriptor of its own. We can even keep the 'alloc_page'
interface around; it's just a specialisation of the POTAD.

Anyway, there's a lot of work here, and I'm sure there are many holes
to be poked in it, but eventually I want the concept of tail pages to
go away, and for pages to become not-the-unit of memory management in
Linux any more.

Naming
~~~~~~

The fun thing about the word folio is that it actually has several
meanings. Quoting wikipedia,

: it is firstly a term for a common method of arranging sheets of paper
: into book form, folding the sheet only once, and a term for a book
: made in this way; secondly it is a general term for a sheet, leaf or
: page in (especially) manuscripts and old books; and thirdly it is an
: approximate term for the size of a book, and for a book of this size.

So while it is a collection of pages in the first sense, in the second
sense it's also its own term for a "sheet, leaf or page". I (still)
don't insist on the word folio, but I do insist that it be _a_ word.
The word "slab" was a great coin by Bonwick -- it didn't really mean
anything in the context of memory before he used it, and now we all know
exactly what it means. I just don't want us to end up with

struct uma { /* unit of memory allocation */

We could choose another (short, not-used-in-kernel) word almost at random.
How about 'kerb'?


What I haven't touched on anywhere in this, is whether a folio is the
descriptor for all POTA or whether it's specifically the page cache
descriptor. I like the idea of having separate descriptors for objects
in the page cache from anonymous or other allocations. But I'm not very
familiar with the rmap code, and that wants to do things like manipulate
the refcount on a descriptor without knowing whether it's a file or
anon page. Or neither (eg device driver memory mapped to userspace.
Or vmalloc memory mapped to userspace. Or ...)

We could get terribly carried away with this ...

struct mappable { /* any mappable object must be LRU */
struct list_head lru;
int refcount;
int mapcount;
};

struct folio { /* for page cache */
unsigned long flags;
struct mappable map;
struct address_space *mapping;
pgoff_t index;
void *private;
};

struct quarto { /* for anon pages */
unsigned long flags;
struct mappable map;
swp_entry_t swp;
struct anon_vma *vma;
};

but I'm not sure we want to go there.

On Fri, Mar 26, 2021 at 01:48:15PM -0400, Johannes Weiner wrote:
> On Wed, Mar 24, 2021 at 06:24:21AM +0000, Matthew Wilcox wrote:
> > On Tue, Mar 23, 2021 at 08:29:16PM -0400, Johannes Weiner wrote:
> > > On Mon, Mar 22, 2021 at 06:47:44PM +0000, Matthew Wilcox wrote:
> > > > On Mon, Mar 22, 2021 at 01:59:24PM -0400, Johannes Weiner wrote:
> > One of the patches I haven't posted yet starts to try to deal with kmap()/mem*()/kunmap():
> >
> > mm: Add kmap_local_folio
> >
> > This allows us to map a portion of a folio. Callers can only expect
> > to access up to the next page boundary.
> >
> > Signed-off-by: Matthew Wilcox (Oracle) <[email protected]>
> >
> > diff --git a/include/linux/highmem-internal.h b/include/linux/highmem-internal.h
> > index 7902c7d8b55f..55a29c9d562f 100644
> > --- a/include/linux/highmem-internal.h
> > +++ b/include/linux/highmem-internal.h
> > @@ -73,6 +73,12 @@ static inline void *kmap_local_page(struct page *page)
> > return __kmap_local_page_prot(page, kmap_prot);
> > }
> >
> > +static inline void *kmap_local_folio(struct folio *folio, size_t offset)
> > +{
> > + struct page *page = &folio->page + offset / PAGE_SIZE;
> > + return __kmap_local_page_prot(page, kmap_prot) + offset % PAGE_SIZE;
> > +}
> >
> > Partly I haven't shared that one because I'm not 100% sure that 'byte
> > offset relative to start of folio' is the correct interface. I'm looking
> > at some users and thinking that maybe 'byte offset relative to start
> > of file' might be better. Or perhaps that's just filesystem-centric
> > thinking.
>
> Right, this doesn't seem specific to files just because they would be
> the primary users of it.

Yeah. I think I forgot to cc you on this:

https://lore.kernel.org/linux-fsdevel/[email protected]/

and "byte offset relative to the start of the folio" works just fine:

+ offset = offset_in_folio(folio, diter->pos);
+
+map:
+ diter->entry = kmap_local_folio(folio, offset);

> > > But for that to work, we'll need the allocator to produce huge pages
> > > at the necessary rate, too. The current implementation likely won't
> > > scale. Compaction is expensive enough that we have to weigh when to
> > > allocate huge pages for long-lived anon regions, let alone allocate
> > > them for streaming IO cache entries.
> >
> > Heh, I have that as a work item for later this year -- give the page
> > allocator per-cpu lists of compound pages, not just order-0 pages.
> > That'll save us turning compound pages back into buddy pages, only to
> > turn them into compound pages again.
> >
> > I also have a feeling that the page allocator either needs to become a
> > sub-allocator of an allocator that deals in, say, 1GB chunks of memory,
> > or it needs to become reluctant to break up larger orders. eg if the
> > dcache asks for just one more dentry, it should have to go through at
> > least one round of reclaim before we choose to break up a high-order
> > page to satisfy that request.
>
> Slub already allocates higher-order pages for dentries:
>
> slabinfo - version: 2.1
> # name <active_objs> <num_objs> <objsize> <objperslab> <pagesperslab> : tunables <limit> <batchcount> <sharedfactor> : slabdata <active_slabs> <num_slabs> <sharedavail>
> dentry 133350 133350 192 42 2 : tunables 0 0 0 : slabdata 3175 3175 0
>
> ^ here
>
> and it could avoid even more internal fragmentation with bigger
> orders. It only doesn't because of the overhead of allocating them.

Oh, yes. Sorry, I didn't explain myself properly. If we have a
lightly-loaded system with terabytes of memory (perhaps all the jobs
it is running are CPU intensive and don't need much memory), the system
has a tendency to clog up with negative dentries. Hundreds of millions
of them. We rely on memory pressure to get rid of them, and when there
finally is memory pressure, it takes literally hours.

If there were a slight amount of pressure to trim the dcache at the point
when we'd otherwise break up an order-4 page to get an order-2 page,
the system would work much better. Obviously, we do want the dcache to
be able to expand to the point where it's useful, but at the point that
it's no longer useful, we need to trim it.

It'd probably be better to have the dcache realise that its old entries
aren't useful any more and age them out instead of relying on memory
pressure to remove old entries, so this is probably an unnecessary
digression.

> If the default block size in the allocator were 2M, we'd also get slab
> packing at that granularity, and we wouldn't have to worry about small
> objects breaking huge pages any more than we worry about slab objects
> fragmenting 4k pages today.

Yup. I definitely see the attraction of letting the slab allocator
allocate in larger units. On the other hand, you have to start worrying
about underutilisation of the memory at _some_ size, and I'd argue the
sweet spot is somewhere between 4kB and 2MB today. For example:

fat_inode_cache 110 110 744 22 4 : tunables 0 0 0 : slabdata 5 5 0

That's currently using 20 pages. If slab were only allocating 2MB slabs
from the page allocator, I'd have 1.9MB of ram unused in that cache.

> > But people seem attached to being able to use smaller page sizes.
> > There's that pesky "compatibility" argument.
>
> Right, that's why I'm NOT saying we should eliminate the support for
> 4k chunks in the page cache and page tables. That's still useful if
> you have lots of small files.
>
> I'm just saying it doesn't have to be the default that everything is
> primarily optimized for. We can make the default allocation size of
> the allocator correspond to a hugepage and have a secondary allocator
> level for 4k chunks. Like slab, but fixed-size and highmem-aware.
>
> It makes sense to make struct page 2M as well. It would save a ton of
> memory on average and reduce the pressure we have on struct page's
> size today.
>
> And we really don't need struct page at 4k just to support this unit
> of paging when necesary: page tables don't care, they use pfns and can
> point to any 4k offset, struct page or no struct page. For the page
> cache, we can move mapping, index, lru. etc from today's struct page
> into an entry descriptor that could either sit in a native 2M struct
> page (just like today), or be be allocated on demand and point into a
> chunked struct page. Same for <2M anonymous mappings.
>
> Hey, didn't you just move EXACTLY those fields into the folio? ;)

You say page tables don't actually need a struct page, but we do use it.

struct { /* Page table pages */
unsigned long _pt_pad_1; /* compound_head */
pgtable_t pmd_huge_pte; /* protected by page->ptl */
unsigned long _pt_pad_2; /* mapping */
union {
struct mm_struct *pt_mm; /* x86 pgds only */
atomic_t pt_frag_refcount; /* powerpc */
};
#if ALLOC_SPLIT_PTLOCKS
spinlock_t *ptl;
#else
spinlock_t ptl;
#endif
};

It's a problem because some architectures would really rather
allocate 2KiB page tables (s390) or would like to support 4KiB page
tables on a 64KiB base page size kernel (ppc).

[actually i misread your comment initially; you meant that page
tables point to PFNs and don't care what struct backs them ... i'm
leaving this in here because it illustrates a problem with change
struct-page-size-to-2MB]

2021-03-29 17:58:49

by Matthew Wilcox

[permalink] [raw]
Subject: Re: [PATCH v5 00/27] Memory Folios

On Mon, Mar 29, 2021 at 05:58:32PM +0100, Matthew Wilcox wrote:
> In broad strokes, I think that having a Power Of Two Allocator
> with Descriptor (POTAD) is a useful foundational allocator to have.
> The specific allocator that we call the buddy allocator is very clever for
> the 1990s, but touches too many cachelines to be good with today's CPUs.
> The generalisation of the buddy allocator to the POTAD lets us allocate
> smaller quantities (eg a 512 byte block) and allocate descriptors which
> differ in size from a struct page. For an extreme example, see xfs_buf
> which is 360 bytes and is the descriptor for an allocation between 512
> and 65536 bytes.
>
> There are times when we need to get from the physical address to
> the descriptor, eg memory-failure.c or get_user_pages(). This is the
> equivalent of phys_to_page(), and it's going to have to be a lookup tree.
> I think this is a role for the Maple Tree, but it's not ready yet.
> I don't know if it'll be fast enough for this case. There's also the
> need (particularly for memory-failure) to determine exactly what kind
> of descriptor we're dealing with, and also its size. Even its owner,
> so we can notify them of memory failure.

A couple of things I forgot to mention ...

I'd like the POTAD to be not necessarily tied to allocating memory.
For example, I think it could be used to allocate swap space. eg the swap
code could register the space in a swap file as allocatable through the
POTAD, and then later ask the POTAD to allocate a POT from the swap space.

The POTAD wouldn't need to be limited to MAX_ORDER. It should be
perfectly capable of allocating 1TB if your machine has 1.5TB of RAM
in it (... and things haven't got too fragmented)

I think the POTAD can be used to replace the CMA. The CMA supports
weirdo things like "Allocate 8MB of memory at a 1MB alignment", and I
think that's doable within the data structures that I'm thinking about
for the POTAD. It'd first try to allocate an 8MB chunk at 8MB alignment,
and then if that's not possible, try to allocate two adjacent 4MB chunks;
continuing down until it finds that there aren't 8x1MB chunks, at which
point it can give up.

2021-03-30 19:35:46

by Johannes Weiner

[permalink] [raw]
Subject: Re: [PATCH v5 00/27] Memory Folios

Hi Willy,

On Mon, Mar 29, 2021 at 05:58:32PM +0100, Matthew Wilcox wrote:
> I'm going to respond to some points in detail below, but there are a
> couple of overarching themes that I want to bring out up here.
>
> Grand Vision
> ~~~~~~~~~~~~
>
> I haven't outlined my long-term plan. Partly because it is a _very_
> long way off, and partly because I think what I'm doing stands on its
> own. But some of the points below bear on this, so I'll do it now.
>
> Eventually, I want to make struct page optional for allocations. It's too
> small for some things (allocating page tables, for example), and overly
> large for others (allocating a 2MB page, networking page_pool). I don't
> want to change its size in the meantime; having a struct page refer to
> PAGE_SIZE bytes is something that's quite deeply baked in.

Right, I think it's overloaded and it needs to go away from many
contexts it's used in today.

I think it describes a real physical thing, though, and won't go away
as a concept. More on that below.

> In broad strokes, I think that having a Power Of Two Allocator
> with Descriptor (POTAD) is a useful foundational allocator to have.
> The specific allocator that we call the buddy allocator is very clever for
> the 1990s, but touches too many cachelines to be good with today's CPUs.
> The generalisation of the buddy allocator to the POTAD lets us allocate
> smaller quantities (eg a 512 byte block) and allocate descriptors which
> differ in size from a struct page. For an extreme example, see xfs_buf
> which is 360 bytes and is the descriptor for an allocation between 512
> and 65536 bytes.

I actually disagree with this rather strongly. If anything, the buddy
allocator has turned out to be a pretty poor fit for the foundational
allocator.

On paper, it is elegant and versatile in serving essentially arbitrary
memory blocks. In practice, we mostly just need 4k and 2M chunks from
it. And it sucks at the 2M ones because of the fragmentation caused by
the ungrouped 4k blocks.

The great thing about the slab allocator isn't just that it manages
internal fragmentation of the larger underlying blocks. It also groups
related objects by lifetime/age and reclaimability, which dramatically
mitigates the external fragmentation of the memory space.

The buddy allocator on the other hand has no idea what you want that
4k block for, and whether it pairs up well with the 4k block it just
handed to somebody else. But the decision it makes in that moment is
crucial for its ability to serve larger blocks later on.

We do some mobility grouping based on how reclaimable or migratable
the memory is, but it's not the full answer.

A variable size allocator without object type grouping will always
have difficulties producing anything but the smallest block size after
some uptime. It's inherently flawed that way.

What HAS proven itself is having the base block size correspond to a
reasonable transaction unit for paging and page reclaim, then fill in
smaller ranges with lifetime-aware slabbing, larger ranges with
vmalloc and SG schemes, and absurdly large requests with CMA.

We might be stuck with serving order-1, order-2 etc. for a little
while longer for the few users who can't go to kvmalloc(), but IMO
it's the wrong direction to expand into.

Optimally the foundational allocator would just do one block size.

> There are times when we need to get from the physical address to
> the descriptor, eg memory-failure.c or get_user_pages(). This is the
> equivalent of phys_to_page(), and it's going to have to be a lookup tree.
> I think this is a role for the Maple Tree, but it's not ready yet.
> I don't know if it'll be fast enough for this case. There's also the
> need (particularly for memory-failure) to determine exactly what kind
> of descriptor we're dealing with, and also its size. Even its owner,
> so we can notify them of memory failure.

A tree could be more memory efficient in the long term, but for
starters a 2M page could have a

struct smallpage *smallpages[512];

member that points to any allocated/mapped 4k descriptors. The page
table level would tell you what you're looking at: a pmd is simple, a
pte would map to a 4k pfn, whose upper bits identify a struct page
then a page flag would tell you whether we have a pte-mapped 2M page
or whether the lower pfn bits identify an offset in smallpages[].

It's one pointer for every 4k of RAM, which is a bit dumb, but not as
dumb as having an entire struct page for each of those ;)

> What I haven't touched on anywhere in this, is whether a folio is the
> descriptor for all POTA or whether it's specifically the page cache
> descriptor. I like the idea of having separate descriptors for objects
> in the page cache from anonymous or other allocations. But I'm not very
> familiar with the rmap code, and that wants to do things like manipulate
> the refcount on a descriptor without knowing whether it's a file or
> anon page. Or neither (eg device driver memory mapped to userspace.
> Or vmalloc memory mapped to userspace. Or ...)

The rmap code is all about the page type specifics, but once you get
into mmap, page reclaim, page migration, we're dealing with fully
fungible blocks of memory.

I do like the idea of using actual language typing for the different
things struct page can be today (fs page), but with a common type to
manage the fungible block of memory backing it (allocation state, LRU
& aging state, mmap state etc.)

New types for the former are an easier sell. We all agree that there
are too many details of the page - including the compound page
implementation detail - inside the cache library, fs code and drivers.

It's a slightly tougher sell to say that the core VM code itself
(outside the cache library) needs a tighter abstraction for the struct
page building block and the compound page structure. At least at this
time while we're still sorting out how it all may work down the line.
Certainly, we need something to describe fungible memory blocks:
either a struct page that can be 4k and 2M compound, or a new thing
that can be backed by a 2M struct page or a 4k struct smallpage. We
don't know yet, so I would table the new abstraction type for this.

I generally don't think we want a new type that does everything that
the overloaded struct page already does PLUS the compound
abstraction. Whatever name we pick for it, it'll always be difficult
to wrap your head around such a beast.

IMO starting with an explicit page cache descriptor that resolves to
struct page inside core VM code (and maybe ->fault) for now makes the
most sense: it greatly mitigates the PAGE_SIZE and tail page issue
right away, and it's not in conflict with, but rather helps work
toward, replacing the fungible memory unit behind it.

There isn't too much overlap or generic code between cache and anon
pages such that sharing a common descriptor would be a huge win (most
overlap is at the fungible memory block level, and the physical struct
page layout of course), so I don't think we should aim for a generic
abstraction for both.

As drivers go, I think there are slightly different requirements to
filesystems, too. For filesystems, when the VM can finally do it (and
the file range permits it), I assume we want to rather transparently
increase the unit of data transfer from 4k to 2M. Most drivers that
currently hardcode alloc_page() or PAGE_SIZE OTOH probably don't want
us to bump their allocation sizes.

There ARE instances where drivers allocate pages based on buffer_size
/ PAGE_SIZE and then interact with virtual memory. Those are true VM
objects that could grow transparently if PAGE_SIZE grows, and IMO they
should share the "fungible memory block" abstraction the VM uses.

But there are also many instances where PAGE_SIZE just means 4006 is a
good size for me, and struct page is useful for refcounting. Those
just shouldn't use whatever the VM or the cache layer are using and
stop putting additional burden on an already tricky abstraction.

> On Fri, Mar 26, 2021 at 01:48:15PM -0400, Johannes Weiner wrote:
> > On Wed, Mar 24, 2021 at 06:24:21AM +0000, Matthew Wilcox wrote:
> > > On Tue, Mar 23, 2021 at 08:29:16PM -0400, Johannes Weiner wrote:
> > > > On Mon, Mar 22, 2021 at 06:47:44PM +0000, Matthew Wilcox wrote:
> > > > > On Mon, Mar 22, 2021 at 01:59:24PM -0400, Johannes Weiner wrote:
> > > One of the patches I haven't posted yet starts to try to deal with kmap()/mem*()/kunmap():
> > >
> > > mm: Add kmap_local_folio
> > >
> > > This allows us to map a portion of a folio. Callers can only expect
> > > to access up to the next page boundary.
> > >
> > > Signed-off-by: Matthew Wilcox (Oracle) <[email protected]>
> > >
> > > diff --git a/include/linux/highmem-internal.h b/include/linux/highmem-internal.h
> > > index 7902c7d8b55f..55a29c9d562f 100644
> > > --- a/include/linux/highmem-internal.h
> > > +++ b/include/linux/highmem-internal.h
> > > @@ -73,6 +73,12 @@ static inline void *kmap_local_page(struct page *page)
> > > return __kmap_local_page_prot(page, kmap_prot);
> > > }
> > >
> > > +static inline void *kmap_local_folio(struct folio *folio, size_t offset)
> > > +{
> > > + struct page *page = &folio->page + offset / PAGE_SIZE;
> > > + return __kmap_local_page_prot(page, kmap_prot) + offset % PAGE_SIZE;
> > > +}
> > >
> > > Partly I haven't shared that one because I'm not 100% sure that 'byte
> > > offset relative to start of folio' is the correct interface. I'm looking
> > > at some users and thinking that maybe 'byte offset relative to start
> > > of file' might be better. Or perhaps that's just filesystem-centric
> > > thinking.
> >
> > Right, this doesn't seem specific to files just because they would be
> > the primary users of it.
>
> Yeah. I think I forgot to cc you on this:
>
> https://lore.kernel.org/linux-fsdevel/[email protected]/
>
> and "byte offset relative to the start of the folio" works just fine:
>
> + offset = offset_in_folio(folio, diter->pos);
> +
> +map:
> + diter->entry = kmap_local_folio(folio, offset);

Yeah, that looks great to me!

> > > > But for that to work, we'll need the allocator to produce huge pages
> > > > at the necessary rate, too. The current implementation likely won't
> > > > scale. Compaction is expensive enough that we have to weigh when to
> > > > allocate huge pages for long-lived anon regions, let alone allocate
> > > > them for streaming IO cache entries.
> > >
> > > Heh, I have that as a work item for later this year -- give the page
> > > allocator per-cpu lists of compound pages, not just order-0 pages.
> > > That'll save us turning compound pages back into buddy pages, only to
> > > turn them into compound pages again.
> > >
> > > I also have a feeling that the page allocator either needs to become a
> > > sub-allocator of an allocator that deals in, say, 1GB chunks of memory,
> > > or it needs to become reluctant to break up larger orders. eg if the
> > > dcache asks for just one more dentry, it should have to go through at
> > > least one round of reclaim before we choose to break up a high-order
> > > page to satisfy that request.
> >
> > Slub already allocates higher-order pages for dentries:
> >
> > slabinfo - version: 2.1
> > # name <active_objs> <num_objs> <objsize> <objperslab> <pagesperslab> : tunables <limit> <batchcount> <sharedfactor> : slabdata <active_slabs> <num_slabs> <sharedavail>
> > dentry 133350 133350 192 42 2 : tunables 0 0 0 : slabdata 3175 3175 0
> >
> > ^ here
> >
> > and it could avoid even more internal fragmentation with bigger
> > orders. It only doesn't because of the overhead of allocating them.
>
> Oh, yes. Sorry, I didn't explain myself properly. If we have a
> lightly-loaded system with terabytes of memory (perhaps all the jobs
> it is running are CPU intensive and don't need much memory), the system
> has a tendency to clog up with negative dentries. Hundreds of millions
> of them. We rely on memory pressure to get rid of them, and when there
> finally is memory pressure, it takes literally hours.
>
> If there were a slight amount of pressure to trim the dcache at the point
> when we'd otherwise break up an order-4 page to get an order-2 page,
> the system would work much better. Obviously, we do want the dcache to
> be able to expand to the point where it's useful, but at the point that
> it's no longer useful, we need to trim it.
>
> It'd probably be better to have the dcache realise that its old entries
> aren't useful any more and age them out instead of relying on memory
> pressure to remove old entries, so this is probably an unnecessary
> digression.

It's difficult to identify a universally acceptable line for
usefulness of caches other than physical memory pressure.

The good thing about the memory pressure threshold is that you KNOW
somebody else has immediate use for the memory, and you're justified
in recycling and reallocating caches from the cold end.

Without that, you'd either have to set an arbitrary size cutoff or an
arbitrary aging cutoff (not used in the last minute e.g.). But optimal
settings for either of those depend on the workload, and aren't very
intuitive to configure.

Such a large gap between the smallest object and the overall size of
memory is just inherently difficult to manage. More below.

> > If the default block size in the allocator were 2M, we'd also get slab
> > packing at that granularity, and we wouldn't have to worry about small
> > objects breaking huge pages any more than we worry about slab objects
> > fragmenting 4k pages today.
>
> Yup. I definitely see the attraction of letting the slab allocator
> allocate in larger units. On the other hand, you have to start worrying
> about underutilisation of the memory at _some_ size, and I'd argue the
> sweet spot is somewhere between 4kB and 2MB today. For example:
>
> fat_inode_cache 110 110 744 22 4 : tunables 0 0 0 : slabdata 5 5 0
>
> That's currently using 20 pages. If slab were only allocating 2MB slabs
> from the page allocator, I'd have 1.9MB of ram unused in that cache.

Right, we'd raise internal fragmentation to a worst case of 2M (minus
minimum object size) per slab cache. As a ratio of overall memory,
this isn't unprecedented, though: my desktop machine has 32G and my
phone has 8G. Divide those by 512 for a 4k base page comparison and
you get memory sizes common in the mid to late 90s.

Our levels of internal fragmentation are historically low, which of
course is nice by itself. But that's also what's causing problems in
the form of external fragmentation, and why we struggle to produce 2M
blocks. It's multitudes easier to free one 2M slab page of
consecutively allocated inodes than it is to free 512 batches of
different objects with conflicting lifetimes, ages, or potentially
even reclaimability.

I don't think we'll have much of a choice when it comes to trading
some internal fragmentation to deal with our mounting external
fragmentation problem.

[ Because of the way fragmentation works I also don't think that 1G
would be a good foundational block size. It either wastes a crazy
amount of memory on internal fragmentation, or you allow external
fragmentation and the big blocks deteriorate with uptime anyway.

There really is such a thing as a page: a goldilocks quantity of
memory, given the overall amount of RAM in a system, that is optimal
as a paging unit and intersection point for the fragmentation axes.

This never went away. It just isn't 4k anymore on modern systems.
And we're creating a bit of a mess by adapting various places (page
allocator, slab, page cache, swap code) to today's goldilocks size
while struct page lags behind and doesn't track reality anymore.

I think there is a lot of value in disconnecting places from struct
page that don't need it, but IMO all in the context of the broader
goal of being able to catch up struct page to what the real page is.

We may be able to get rid of the 4k backward-compatible paging units
eventually when we all have 1TB of RAM. But the concept of a page in
a virtual memory system isn't really going anywhere. ]

> > > But people seem attached to being able to use smaller page sizes.
> > > There's that pesky "compatibility" argument.
> >
> > Right, that's why I'm NOT saying we should eliminate the support for
> > 4k chunks in the page cache and page tables. That's still useful if
> > you have lots of small files.
> >
> > I'm just saying it doesn't have to be the default that everything is
> > primarily optimized for. We can make the default allocation size of
> > the allocator correspond to a hugepage and have a secondary allocator
> > level for 4k chunks. Like slab, but fixed-size and highmem-aware.
> >
> > It makes sense to make struct page 2M as well. It would save a ton of
> > memory on average and reduce the pressure we have on struct page's
> > size today.
> >
> > And we really don't need struct page at 4k just to support this unit
> > of paging when necesary: page tables don't care, they use pfns and can
> > point to any 4k offset, struct page or no struct page. For the page
> > cache, we can move mapping, index, lru. etc from today's struct page
> > into an entry descriptor that could either sit in a native 2M struct
> > page (just like today), or be be allocated on demand and point into a
> > chunked struct page. Same for <2M anonymous mappings.
> >
> > Hey, didn't you just move EXACTLY those fields into the folio? ;)
>
> You say page tables don't actually need a struct page, but we do use it.
>
> struct { /* Page table pages */
> unsigned long _pt_pad_1; /* compound_head */
> pgtable_t pmd_huge_pte; /* protected by page->ptl */
> unsigned long _pt_pad_2; /* mapping */
> union {
> struct mm_struct *pt_mm; /* x86 pgds only */
> atomic_t pt_frag_refcount; /* powerpc */
> };
> #if ALLOC_SPLIT_PTLOCKS
> spinlock_t *ptl;
> #else
> spinlock_t ptl;
> #endif
> };
>
> It's a problem because some architectures would really rather
> allocate 2KiB page tables (s390) or would like to support 4KiB page
> tables on a 64KiB base page size kernel (ppc).
>
> [actually i misread your comment initially; you meant that page
> tables point to PFNs and don't care what struct backs them ... i'm
> leaving this in here because it illustrates a problem with change
> struct-page-size-to-2MB]

Yes, I meant what page table entries point to.

The page table (directories) themselves are still 4k as per the
architecture, and they'd also have to use smallpage descriptors.

I don't immediately see why they couldn't, though. It's not that many,
especially if pmd mappings are common (a 4k pmd can map 1G worth of
address space).

2021-03-30 21:11:50

by Matthew Wilcox

[permalink] [raw]
Subject: Re: [PATCH v5 00/27] Memory Folios

On Tue, Mar 30, 2021 at 03:30:54PM -0400, Johannes Weiner wrote:
> Hi Willy,
>
> On Mon, Mar 29, 2021 at 05:58:32PM +0100, Matthew Wilcox wrote:
> > I'm going to respond to some points in detail below, but there are a
> > couple of overarching themes that I want to bring out up here.
> >
> > Grand Vision
> > ~~~~~~~~~~~~
> >
> > I haven't outlined my long-term plan. Partly because it is a _very_
> > long way off, and partly because I think what I'm doing stands on its
> > own. But some of the points below bear on this, so I'll do it now.
> >
> > Eventually, I want to make struct page optional for allocations. It's too
> > small for some things (allocating page tables, for example), and overly
> > large for others (allocating a 2MB page, networking page_pool). I don't
> > want to change its size in the meantime; having a struct page refer to
> > PAGE_SIZE bytes is something that's quite deeply baked in.
>
> Right, I think it's overloaded and it needs to go away from many
> contexts it's used in today.
>
> I think it describes a real physical thing, though, and won't go away
> as a concept. More on that below.

I'm at least 90% with you on this, and we're just quibbling over details
at this point, I think.

> > In broad strokes, I think that having a Power Of Two Allocator
> > with Descriptor (POTAD) is a useful foundational allocator to have.
> > The specific allocator that we call the buddy allocator is very clever for
> > the 1990s, but touches too many cachelines to be good with today's CPUs.
> > The generalisation of the buddy allocator to the POTAD lets us allocate
> > smaller quantities (eg a 512 byte block) and allocate descriptors which
> > differ in size from a struct page. For an extreme example, see xfs_buf
> > which is 360 bytes and is the descriptor for an allocation between 512
> > and 65536 bytes.
>
> I actually disagree with this rather strongly. If anything, the buddy
> allocator has turned out to be a pretty poor fit for the foundational
> allocator.
>
> On paper, it is elegant and versatile in serving essentially arbitrary
> memory blocks. In practice, we mostly just need 4k and 2M chunks from
> it. And it sucks at the 2M ones because of the fragmentation caused by
> the ungrouped 4k blocks.

That's a very Intel-centric way of looking at it. Other architectures
support a multitude of page sizes, from the insane ia64 (4k, 8k, 16k, then
every power of four up to 4GB) to more reasonable options like (4k, 32k,
256k, 2M, 16M, 128M). But we (in software) shouldn't constrain ourselves
to thinking in terms of what the hardware currently supports. Google
have data showing that for their workloads, 32kB is the goldilocks size.
I'm sure for some workloads, it's much higher and for others it's lower.
But for almost no workload is 4kB the right choice any more, and probably
hasn't been since the late 90s.

> The great thing about the slab allocator isn't just that it manages
> internal fragmentation of the larger underlying blocks. It also groups
> related objects by lifetime/age and reclaimability, which dramatically
> mitigates the external fragmentation of the memory space.
>
> The buddy allocator on the other hand has no idea what you want that
> 4k block for, and whether it pairs up well with the 4k block it just
> handed to somebody else. But the decision it makes in that moment is
> crucial for its ability to serve larger blocks later on.
>
> We do some mobility grouping based on how reclaimable or migratable
> the memory is, but it's not the full answer.

I don't think that's entirely true. The vast majority of memory in any
machine is either anonymous or page cache. The problem is that right now,
all anonymous and page cache allocations are order-0 (... or order-9).
So the buddy allocator can't know anything useful about the pages and will
often allocate one order-0 page to the page cache, then allocate its buddy
to the slab cache in order to allocate the radix_tree_node to store the
pointer to the page in (ok, radix tree nodes come from an order-2 cache,
but it still prevents this order-9 page from being assembled).

If the movable allocations suddenly start being order-3 and order-4,
the unmovable, unreclaimable allocations are naturally going to group
down in the lower orders, and we won't have the problem that a single
dentry blocks the allocation of an entire 2MB page.

The problem, for me, with the ZONE_MOVABLE stuff is that it requires
sysadmin intervention to set up. I don't have a ZONE_MOVABLE on
my laptop. The allocator should be automatically handling movability
hints without my intervention.

> A variable size allocator without object type grouping will always
> have difficulties producing anything but the smallest block size after
> some uptime. It's inherently flawed that way.

I think our buddy allocator is flawed, to be sure, but only because
it doesn't handle movable hints more aggressively. For example, at
the point that a largeish block gets a single non-movable allocation,
all the movable allocations within that block should be migrated out.
If the offending allocation is freed quickly, it all collapses into a
large, useful chunk, or if not, then it provides a sponge to soak up
other non-movable allocations.

> > What I haven't touched on anywhere in this, is whether a folio is the
> > descriptor for all POTA or whether it's specifically the page cache
> > descriptor. I like the idea of having separate descriptors for objects
> > in the page cache from anonymous or other allocations. But I'm not very
> > familiar with the rmap code, and that wants to do things like manipulate
> > the refcount on a descriptor without knowing whether it's a file or
> > anon page. Or neither (eg device driver memory mapped to userspace.
> > Or vmalloc memory mapped to userspace. Or ...)
>
> The rmap code is all about the page type specifics, but once you get
> into mmap, page reclaim, page migration, we're dealing with fully
> fungible blocks of memory.
>
> I do like the idea of using actual language typing for the different
> things struct page can be today (fs page), but with a common type to
> manage the fungible block of memory backing it (allocation state, LRU
> & aging state, mmap state etc.)
>
> New types for the former are an easier sell. We all agree that there
> are too many details of the page - including the compound page
> implementation detail - inside the cache library, fs code and drivers.
>
> It's a slightly tougher sell to say that the core VM code itself
> (outside the cache library) needs a tighter abstraction for the struct
> page building block and the compound page structure. At least at this
> time while we're still sorting out how it all may work down the line.
> Certainly, we need something to describe fungible memory blocks:
> either a struct page that can be 4k and 2M compound, or a new thing
> that can be backed by a 2M struct page or a 4k struct smallpage. We
> don't know yet, so I would table the new abstraction type for this.
>
> I generally don't think we want a new type that does everything that
> the overloaded struct page already does PLUS the compound
> abstraction. Whatever name we pick for it, it'll always be difficult
> to wrap your head around such a beast.
>
> IMO starting with an explicit page cache descriptor that resolves to
> struct page inside core VM code (and maybe ->fault) for now makes the
> most sense: it greatly mitigates the PAGE_SIZE and tail page issue
> right away, and it's not in conflict with, but rather helps work
> toward, replacing the fungible memory unit behind it.

Right, and that's what struct folio is today. It eliminates tail pages
from consideration in a lot of paths. I think it also makes sense for
struct folio to be used for anonymous memory. But I think that's where it
stops; it isn't for Slab, it isn't for page table pages, and it's not
for ZONE_DEVICE pages.

> There isn't too much overlap or generic code between cache and anon
> pages such that sharing a common descriptor would be a huge win (most
> overlap is at the fungible memory block level, and the physical struct
> page layout of course), so I don't think we should aim for a generic
> abstraction for both.

They're both on the LRU list, they use a lot of the same PageFlags,
they both have a mapcount and refcount, and they both have memcg_data.
The only things they really use differently are mapping, index and
private. And then we have to consider shmem which uses both in a
pretty eldritch way.

> As drivers go, I think there are slightly different requirements to
> filesystems, too. For filesystems, when the VM can finally do it (and
> the file range permits it), I assume we want to rather transparently
> increase the unit of data transfer from 4k to 2M. Most drivers that
> currently hardcode alloc_page() or PAGE_SIZE OTOH probably don't want
> us to bump their allocation sizes.

If you take a look at my earlier work, you'll see me using a range of
sizes in the page cache, starting at 16kB and gradually increasing to
(theoretically) 2MB, although the algorithm tended to top out around
256kB. Doing particularly large reads could see 512kB/1MB reads, but
it was very hard to hit 2MB in practice. I wasn't too concerned at the
time, but my point is that we do want to automatically tune the size
of the allocation unit to the workload. An application which reads in
64kB chunks is giving us a pretty clear signal that they want to manage
memory in 64kB chunks.

> > It'd probably be better to have the dcache realise that its old entries
> > aren't useful any more and age them out instead of relying on memory
> > pressure to remove old entries, so this is probably an unnecessary
> > digression.
>
> It's difficult to identify a universally acceptable line for
> usefulness of caches other than physical memory pressure.
>
> The good thing about the memory pressure threshold is that you KNOW
> somebody else has immediate use for the memory, and you're justified
> in recycling and reallocating caches from the cold end.
>
> Without that, you'd either have to set an arbitrary size cutoff or an
> arbitrary aging cutoff (not used in the last minute e.g.). But optimal
> settings for either of those depend on the workload, and aren't very
> intuitive to configure.

For the dentry cache, I think there is a more useful metric, and that's
length of the hash chain. If it gets too long, we're spending more time
walking it than we're saving by having entries cached. Starting reclaim
based on "this bucket of the dcache has twenty entries in it" would
probably work quite well.

> Our levels of internal fragmentation are historically low, which of
> course is nice by itself. But that's also what's causing problems in
> the form of external fragmentation, and why we struggle to produce 2M
> blocks. It's multitudes easier to free one 2M slab page of
> consecutively allocated inodes than it is to free 512 batches of
> different objects with conflicting lifetimes, ages, or potentially
> even reclaimability.

Unf. I don't think freeing 2MB worth of _anything_ is ever going to be
easy enough to rely on. My actual root filesystem:

xfs_inode 143134 144460 1024 32 8 : tunables 0 0 0 : slabdata 4517 4517 0

So we'd have to be able to free 2048 of those 143k inodes, and they all
have to be consecutive (and aligned). I suppose we could model that and
try to work out how many we'd have to be able to free in order to get all
2048 in any page free, but I bet it's a variant of the Birthday Paradox,
and we'd find it's something crazy like half of them.

Without slab gaining the ability to ask users to relocate allocations,
I think any memory sent to slab is never coming back.


So ... even if I accept every part of your vision as the way things
are going to be, I think the folio patchset I have now is a step in the
right direction. I'm going to send a v6 now and hope it's not too late
for this merge window.

2021-03-31 14:57:34

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [PATCH v5 00/27] Memory Folios

On Tue, Mar 30, 2021 at 03:30:54PM -0400, Johannes Weiner wrote:
> > Eventually, I want to make struct page optional for allocations. It's too
> > small for some things (allocating page tables, for example), and overly
> > large for others (allocating a 2MB page, networking page_pool). I don't
> > want to change its size in the meantime; having a struct page refer to
> > PAGE_SIZE bytes is something that's quite deeply baked in.
>
> Right, I think it's overloaded and it needs to go away from many
> contexts it's used in today.

FYI, one unrelated usage is that in many contet we use a struct page and
an offset to describe locations for I/O (block layer, networking, DMA
API). With huge pages and merged I/O buffers this representation
actually becomes increasingly painful.

And a little bit back to the topic: I think the folio as in the
current patchset is incredibly useful and someting we need like
yesterday to help file systems and the block layer to cope with
huge and compound pages of all sorts. Once willy sends out a new
version with the accumulated fixes I'm ready to ACK the whole thing.

2021-03-31 18:16:11

by Johannes Weiner

[permalink] [raw]
Subject: Re: [PATCH v5 00/27] Memory Folios

On Tue, Mar 30, 2021 at 10:09:29PM +0100, Matthew Wilcox wrote:
> On Tue, Mar 30, 2021 at 03:30:54PM -0400, Johannes Weiner wrote:
> > Hi Willy,
> >
> > On Mon, Mar 29, 2021 at 05:58:32PM +0100, Matthew Wilcox wrote:
> > > I'm going to respond to some points in detail below, but there are a
> > > couple of overarching themes that I want to bring out up here.
> > >
> > > Grand Vision
> > > ~~~~~~~~~~~~
> > >
> > > I haven't outlined my long-term plan. Partly because it is a _very_
> > > long way off, and partly because I think what I'm doing stands on its
> > > own. But some of the points below bear on this, so I'll do it now.
> > >
> > > Eventually, I want to make struct page optional for allocations. It's too
> > > small for some things (allocating page tables, for example), and overly
> > > large for others (allocating a 2MB page, networking page_pool). I don't
> > > want to change its size in the meantime; having a struct page refer to
> > > PAGE_SIZE bytes is something that's quite deeply baked in.
> >
> > Right, I think it's overloaded and it needs to go away from many
> > contexts it's used in today.
> >
> > I think it describes a real physical thing, though, and won't go away
> > as a concept. More on that below.
>
> I'm at least 90% with you on this, and we're just quibbling over details
> at this point, I think.
>
> > > In broad strokes, I think that having a Power Of Two Allocator
> > > with Descriptor (POTAD) is a useful foundational allocator to have.
> > > The specific allocator that we call the buddy allocator is very clever for
> > > the 1990s, but touches too many cachelines to be good with today's CPUs.
> > > The generalisation of the buddy allocator to the POTAD lets us allocate
> > > smaller quantities (eg a 512 byte block) and allocate descriptors which
> > > differ in size from a struct page. For an extreme example, see xfs_buf
> > > which is 360 bytes and is the descriptor for an allocation between 512
> > > and 65536 bytes.
> >
> > I actually disagree with this rather strongly. If anything, the buddy
> > allocator has turned out to be a pretty poor fit for the foundational
> > allocator.
> >
> > On paper, it is elegant and versatile in serving essentially arbitrary
> > memory blocks. In practice, we mostly just need 4k and 2M chunks from
> > it. And it sucks at the 2M ones because of the fragmentation caused by
> > the ungrouped 4k blocks.
>
> That's a very Intel-centric way of looking at it. Other architectures
> support a multitude of page sizes, from the insane ia64 (4k, 8k, 16k, then
> every power of four up to 4GB) to more reasonable options like (4k, 32k,
> 256k, 2M, 16M, 128M). But we (in software) shouldn't constrain ourselves
> to thinking in terms of what the hardware currently supports. Google
> have data showing that for their workloads, 32kB is the goldilocks size.
> I'm sure for some workloads, it's much higher and for others it's lower.
> But for almost no workload is 4kB the right choice any more, and probably
> hasn't been since the late 90s.

You missed my point entirely.

It's not about the exact page sizes, it's about the fragmentation
issue when you mix variable-sized blocks without lifetime grouping.

Anyway, we digressed quite far here. My argument was simply that it's
conceivable we'll switch to a default allocation block and page size
that is larger than the smallest paging size supported by the CPU and
the kernel. (Various architectures might support multiple page sizes,
but once you pick one, that's the smallest quantity the kernel pages.)

That makes "bundle of pages" a short-sighted abstraction, and folio a
poor name for pageable units.

I might be wrong about what happens to PAGE_SIZE eventually (even
though your broader arguments around allocator behavior and
fragmentation don't seem to line up with my observations from
production systems, or the evolution of how we manage allocations of
different sizes) - but you also haven't made a good argument why the
API *should* continue to imply we're dealing with one or more pages.

Yes, it's a bit bikesheddy. But you're proposing an abstraction for
one of the most fundamental data structures in the operating system,
with tens of thousands of instances in almost all core subsystems.

"Bundle of pages (for now) with filesystem data (and maybe anon data
since it's sort of convenient in terms of data structure, for now)"
just doesn't make me go "Yeah, that's it."

I would understand cache_entry for the cache; mem for cache and file
(that discussion trailed off); pageable if we want to imply a sizing
and alignment constraints based on the underlying MMU.

I would even prefer kerb, because at least it wouldn't be misleading
if we do have non-struct page backing in the future.

> > The great thing about the slab allocator isn't just that it manages
> > internal fragmentation of the larger underlying blocks. It also groups
> > related objects by lifetime/age and reclaimability, which dramatically
> > mitigates the external fragmentation of the memory space.
> >
> > The buddy allocator on the other hand has no idea what you want that
> > 4k block for, and whether it pairs up well with the 4k block it just
> > handed to somebody else. But the decision it makes in that moment is
> > crucial for its ability to serve larger blocks later on.
> >
> > We do some mobility grouping based on how reclaimable or migratable
> > the memory is, but it's not the full answer.
>
> I don't think that's entirely true. The vast majority of memory in any
> machine is either anonymous or page cache. The problem is that right now,
> all anonymous and page cache allocations are order-0 (... or order-9).
> So the buddy allocator can't know anything useful about the pages and will
> often allocate one order-0 page to the page cache, then allocate its buddy
> to the slab cache in order to allocate the radix_tree_node to store the
> pointer to the page in (ok, radix tree nodes come from an order-2 cache,
> but it still prevents this order-9 page from being assembled).
>
> If the movable allocations suddenly start being order-3 and order-4,
> the unmovable, unreclaimable allocations are naturally going to group
> down in the lower orders, and we won't have the problem that a single
> dentry blocks the allocation of an entire 2MB page.

I don't follow what you're saying here.

> > A variable size allocator without object type grouping will always
> > have difficulties producing anything but the smallest block size after
> > some uptime. It's inherently flawed that way.
>
> I think our buddy allocator is flawed, to be sure, but only because
> it doesn't handle movable hints more aggressively. For example, at
> the point that a largeish block gets a single non-movable allocation,
> all the movable allocations within that block should be migrated out.
> If the offending allocation is freed quickly, it all collapses into a
> large, useful chunk, or if not, then it provides a sponge to soak up
> other non-movable allocations.

The object type implies aging rules and typical access patterns that
are not going to be captured purely by migratability. As such, the
migratetype alone will always perform worse than full type grouping.

E.g. a burst of inodes and dentries allocations can claim a large
number of blocks from movable to reclaimable, which will then also be
used to serve concurrent allocations of a different type that may have
much longer lifetimes. After the inodes and dentries disappear again,
you're stuck with very sparsely populated reclaimable blocks.

They can still be reclaimed, but they won't free up as easily as a
contiguous run of bulk-aged inodes and dentries.

You also cannot easily move reclaimable objects out of the block when
an unmovable allocation claims it the same way, so this is sort of a
moot proposal anyway.

The slab allocator isn't a guarantee, but I don't see why you're
arguing we should leave additional lifetime/usage hints on the table.

> > As drivers go, I think there are slightly different requirements to
> > filesystems, too. For filesystems, when the VM can finally do it (and
> > the file range permits it), I assume we want to rather transparently
> > increase the unit of data transfer from 4k to 2M. Most drivers that
> > currently hardcode alloc_page() or PAGE_SIZE OTOH probably don't want
> > us to bump their allocation sizes.
>
> If you take a look at my earlier work, you'll see me using a range of
> sizes in the page cache, starting at 16kB and gradually increasing to
> (theoretically) 2MB, although the algorithm tended to top out around
> 256kB. Doing particularly large reads could see 512kB/1MB reads, but
> it was very hard to hit 2MB in practice. I wasn't too concerned at the
> time, but my point is that we do want to automatically tune the size
> of the allocation unit to the workload. An application which reads in
> 64kB chunks is giving us a pretty clear signal that they want to manage
> memory in 64kB chunks.

You missed my point here, but it sounds like we agree that drivers who
just want a fixed buffer should not use the same type that filesystems
use for dynamic paging units.

> > > It'd probably be better to have the dcache realise that its old entries
> > > aren't useful any more and age them out instead of relying on memory
> > > pressure to remove old entries, so this is probably an unnecessary
> > > digression.
> >
> > It's difficult to identify a universally acceptable line for
> > usefulness of caches other than physical memory pressure.
> >
> > The good thing about the memory pressure threshold is that you KNOW
> > somebody else has immediate use for the memory, and you're justified
> > in recycling and reallocating caches from the cold end.
> >
> > Without that, you'd either have to set an arbitrary size cutoff or an
> > arbitrary aging cutoff (not used in the last minute e.g.). But optimal
> > settings for either of those depend on the workload, and aren't very
> > intuitive to configure.
>
> For the dentry cache, I think there is a more useful metric, and that's
> length of the hash chain. If it gets too long, we're spending more time
> walking it than we're saving by having entries cached. Starting reclaim
> based on "this bucket of the dcache has twenty entries in it" would
> probably work quite well.

That might work for this cache, but it's not a generic solution to
fragmentation caused by cache positions building in the absence of
memory pressure.

> > Our levels of internal fragmentation are historically low, which of
> > course is nice by itself. But that's also what's causing problems in
> > the form of external fragmentation, and why we struggle to produce 2M
> > blocks. It's multitudes easier to free one 2M slab page of
> > consecutively allocated inodes than it is to free 512 batches of
> > different objects with conflicting lifetimes, ages, or potentially
> > even reclaimability.
>
> Unf. I don't think freeing 2MB worth of _anything_ is ever going to be
> easy enough to rely on. My actual root filesystem:
>
> xfs_inode 143134 144460 1024 32 8 : tunables 0 0 0 : slabdata 4517 4517 0
>
> So we'd have to be able to free 2048 of those 143k inodes, and they all
> have to be consecutive (and aligned). I suppose we could model that and
> try to work out how many we'd have to be able to free in order to get all
> 2048 in any page free, but I bet it's a variant of the Birthday Paradox,
> and we'd find it's something crazy like half of them.

How is it different than freeing a 4k page in 1995?

The descriptor size itself may not have scaled at the same rate as
overall memory size. But that also means the cache position itself is
much less a concern in terms of memory consumed and fragmented.

Case in point, this is 141M. Yes, probably with a mixture of some hot
and a long tail of cold entries. It's not really an interesting
reclaim target.

When slab cache positions become a reclaim concern, it's usually when
they spike due to a change in the workload. And then you tend to get
contiguous runs of objects with a similar age.

> Without slab gaining the ability to ask users to relocate allocations,
> I think any memory sent to slab is never coming back.

Not sure what data you're basing this on.

> So ... even if I accept every part of your vision as the way things
> are going to be, I think the folio patchset I have now is a step in the
> right direction. I'm going to send a v6 now and hope it's not too late
> for this merge window.

I don't think folio as an abstraction is cooked enough to replace such
a major part of the kernel with it. so I'm against merging it now.

I would really like to see a better definition of what it actually
represents, instead of a fluid combination of implementation details
and conveniences.

2021-03-31 18:30:52

by Matthew Wilcox

[permalink] [raw]
Subject: Re: [PATCH v5 00/27] Memory Folios

On Wed, Mar 31, 2021 at 02:14:00PM -0400, Johannes Weiner wrote:
> Anyway, we digressed quite far here. My argument was simply that it's
> conceivable we'll switch to a default allocation block and page size
> that is larger than the smallest paging size supported by the CPU and
> the kernel. (Various architectures might support multiple page sizes,
> but once you pick one, that's the smallest quantity the kernel pages.)

We've had several attempts in the past to make 'struct page' refer to
a different number of bytes than the-size-of-a-single-pte, and they've
all failed in one way or another. I don't think changing PAGE_SIZE to
any other size is reasonable.

Maybe we have a larger allocation unit in the future, maybe we do
something else, but that should have its own name, not 'struct page'.
I think the shortest path to getting what you want is having a superpage
allocator that the current page allocator can allocate from. When a
superpage is allocated from the superpage allocator, we allocate an
array of struct pages for it.

> I don't think folio as an abstraction is cooked enough to replace such
> a major part of the kernel with it. so I'm against merging it now.
>
> I would really like to see a better definition of what it actually
> represents, instead of a fluid combination of implementation details
> and conveniences.

Here's the current kernel-doc for it:

/**
* struct folio - Represents a contiguous set of bytes.
* @flags: Identical to the page flags.
* @lru: Least Recently Used list; tracks how recently this folio was used.
* @mapping: The file this page belongs to, or refers to the anon_vma for
* anonymous pages.
* @index: Offset within the file, in units of pages. For anonymous pages,
* this is the index from the beginning of the mmap.
* @private: Filesystem per-folio data (see attach_folio_private()).
* Used for swp_entry_t if FolioSwapCache().
* @_mapcount: How many times this folio is mapped to userspace. Use
* folio_mapcount() to access it.
* @_refcount: Number of references to this folio. Use folio_ref_count()
* to read it.
* @memcg_data: Memory Control Group data.
*
* A folio is a physically, virtually and logically contiguous set
* of bytes. It is a power-of-two in size, and it is aligned to that
* same power-of-two. It is at least as large as %PAGE_SIZE. If it is
* in the page cache, it is at a file offset which is a multiple of that
* power-of-two.
*/
struct folio {
/* private: don't document the anon union */
union {
struct {
/* public: */
unsigned long flags;
struct list_head lru;
struct address_space *mapping;
pgoff_t index;
unsigned long private;
atomic_t _mapcount;
atomic_t _refcount;
#ifdef CONFIG_MEMCG
unsigned long memcg_data;
#endif
/* private: the union with struct page is transitional */
};
struct page page;
};
};

2021-04-01 05:09:41

by Al Viro

[permalink] [raw]
Subject: Re: [PATCH v5 00/27] Memory Folios

On Tue, Mar 30, 2021 at 10:09:29PM +0100, Matthew Wilcox wrote:

> That's a very Intel-centric way of looking at it. Other architectures
> support a multitude of page sizes, from the insane ia64 (4k, 8k, 16k, then
> every power of four up to 4GB) to more reasonable options like (4k, 32k,
> 256k, 2M, 16M, 128M). But we (in software) shouldn't constrain ourselves
> to thinking in terms of what the hardware currently supports. Google
> have data showing that for their workloads, 32kB is the goldilocks size.
> I'm sure for some workloads, it's much higher and for others it's lower.
> But for almost no workload is 4kB the right choice any more, and probably
> hasn't been since the late 90s.

Out of curiosity I looked at the distribution of file sizes in the
kernel tree:
71455 files total
0--4Kb 36702
4--8Kb 11820
8--16Kb 10066
16--32Kb 6984
32--64Kb 3804
64--128Kb 1498
128--256Kb 393
256--512Kb 108
512Kb--1Mb 35
1--2Mb 25
2--4Mb 5
4--6Mb 7
6--8Mb 4
12Mb 2
14Mb 1
16Mb 1

... incidentally, everything bigger than 1.2Mb lives^Wshambles under
drivers/gpu/drm/amd/include/asic_reg/

Page size Footprint
4Kb 1128Mb
8Kb 1324Mb
16Kb 1764Mb
32Kb 2739Mb
64Kb 4832Mb
128Kb 9191Mb
256Kb 18062Mb
512Kb 35883Mb
1Mb 71570Mb
2Mb 142958Mb

So for kernel builds (as well as grep over the tree, etc.) uniform 2Mb pages
would be... interesting.

2021-04-01 17:53:33

by Matthew Wilcox

[permalink] [raw]
Subject: Re: [PATCH v5 00/27] Memory Folios

On Thu, Apr 01, 2021 at 05:05:37AM +0000, Al Viro wrote:
> On Tue, Mar 30, 2021 at 10:09:29PM +0100, Matthew Wilcox wrote:
>
> > That's a very Intel-centric way of looking at it. Other architectures
> > support a multitude of page sizes, from the insane ia64 (4k, 8k, 16k, then
> > every power of four up to 4GB) to more reasonable options like (4k, 32k,
> > 256k, 2M, 16M, 128M). But we (in software) shouldn't constrain ourselves
> > to thinking in terms of what the hardware currently supports. Google
> > have data showing that for their workloads, 32kB is the goldilocks size.
> > I'm sure for some workloads, it's much higher and for others it's lower.
> > But for almost no workload is 4kB the right choice any more, and probably
> > hasn't been since the late 90s.
>
> Out of curiosity I looked at the distribution of file sizes in the
> kernel tree:
> 71455 files total
> 0--4Kb 36702
> 4--8Kb 11820
> 8--16Kb 10066
> 16--32Kb 6984
> 32--64Kb 3804
> 64--128Kb 1498
> 128--256Kb 393
> 256--512Kb 108
> 512Kb--1Mb 35
> 1--2Mb 25
> 2--4Mb 5
> 4--6Mb 7
> 6--8Mb 4
> 12Mb 2
> 14Mb 1
> 16Mb 1
>
> ... incidentally, everything bigger than 1.2Mb lives^Wshambles under
> drivers/gpu/drm/amd/include/asic_reg/

I'm just going to edit this table to add a column indicating ratio
to previous size:

> Page size Footprint
> 4Kb 1128Mb
> 8Kb 1324Mb 1.17
> 16Kb 1764Mb 1.33
> 32Kb 2739Mb 1.55
> 64Kb 4832Mb 1.76
> 128Kb 9191Mb 1.90
> 256Kb 18062Mb 1.96
> 512Kb 35883Mb 1.98
> 1Mb 71570Mb 1.994
> 2Mb 142958Mb 1.997
>
> So for kernel builds (as well as grep over the tree, etc.) uniform 2Mb pages
> would be... interesting.

Yep, that's why I opted for a "start out slowly and let readahead tell me
when to increase the page size" approach.

I think Johannes' real problem is that slab and page cache / anon pages
are getting intermingled. We could solve this by having slab allocate
2MB pages from the page allocator and then split them up internally
(so not all of that 2MB necessarily goes to a single slab cache, but all
of that 2MB goes to some slab cache).

2021-04-01 18:04:48

by Johannes Weiner

[permalink] [raw]
Subject: Re: [PATCH v5 00/27] Memory Folios

On Thu, Apr 01, 2021 at 05:05:37AM +0000, Al Viro wrote:
> On Tue, Mar 30, 2021 at 10:09:29PM +0100, Matthew Wilcox wrote:
>
> > That's a very Intel-centric way of looking at it. Other architectures
> > support a multitude of page sizes, from the insane ia64 (4k, 8k, 16k, then
> > every power of four up to 4GB) to more reasonable options like (4k, 32k,
> > 256k, 2M, 16M, 128M). But we (in software) shouldn't constrain ourselves
> > to thinking in terms of what the hardware currently supports. Google
> > have data showing that for their workloads, 32kB is the goldilocks size.
> > I'm sure for some workloads, it's much higher and for others it's lower.
> > But for almost no workload is 4kB the right choice any more, and probably
> > hasn't been since the late 90s.
>
> Out of curiosity I looked at the distribution of file sizes in the
> kernel tree:
> 71455 files total
> 0--4Kb 36702
> 4--8Kb 11820
> 8--16Kb 10066
> 16--32Kb 6984
> 32--64Kb 3804
> 64--128Kb 1498
> 128--256Kb 393
> 256--512Kb 108
> 512Kb--1Mb 35
> 1--2Mb 25
> 2--4Mb 5
> 4--6Mb 7
> 6--8Mb 4
> 12Mb 2
> 14Mb 1
> 16Mb 1
>
> ... incidentally, everything bigger than 1.2Mb lives^Wshambles under
> drivers/gpu/drm/amd/include/asic_reg/
>
> Page size Footprint
> 4Kb 1128Mb
> 8Kb 1324Mb
> 16Kb 1764Mb
> 32Kb 2739Mb
> 64Kb 4832Mb
> 128Kb 9191Mb
> 256Kb 18062Mb
> 512Kb 35883Mb
> 1Mb 71570Mb
> 2Mb 142958Mb
>
> So for kernel builds (as well as grep over the tree, etc.) uniform 2Mb pages
> would be... interesting.

Right, I don't see us getting rid of 4k cache entries anytime
soon. Even 32k pages would double the footprint here.

The issue is just that at the other end of the spectrum we have IO
devices that do 10GB/s, which corresponds to 2.6 million pages per
second. At such data rates we are currently CPU-limited because of the
pure transaction overhead in page reclaim. Workloads like this tend to
use much larger files, and would benefit from a larger paging unit.

Likewise, most production workloads in cloud servers have enormous
anonymous regions and large executables that greatly benefit from
fewer page table levels and bigger TLB entries.

Today, fragmentation prevents the page allocator from producing 2MB
blocks at a satisfactory rate and allocation latency. It's not
feasible to allocate 2M inside page faults for example; getting huge
page coverage for the page cache will be even more difficult.

I'm not saying we should get rid of 4k cache entries. Rather, I'm
wondering out loud whether longer-term we'd want to change the default
page size to 2M, and implement the 4k cache entries, which we clearly
continue to need, with a slab style allocator on top. The idea being
that it'll do a better job at grouping cache entries with other cache
entries of a similar lifetime than the untyped page allocator does
naturally, and so make fragmentation a whole lot more manageable.

(I'm using x86 page sizes as examples because they matter to me. But
there is an architecture independent discrepancy between the smallest
cache entries we must continue to support, and larger blocks / huge
pages that we increasingly rely on as first class pages.)