Managing memory in 4KiB pages is a serious overhead. Many benchmarks
benefit from a larger "page size". As an example, an earlier iteration
of this idea which used compound pages (and wasn't particularly tuned)
got a 7% performance boost when compiling the kernel.
Using compound pages or THPs exposes a weakness of our type system.
Functions are often unprepared for compound pages to be passed to them,
and may only act on PAGE_SIZE chunks. Even functions which are aware of
compound pages may expect a head page, and do the wrong thing if passed
a tail page.
We also waste a lot of instructions ensuring that we're not looking at
a tail page. Almost every call to PageFoo() contains one or more hidden
calls to compound_head(). This also happens for get_page(), put_page()
and many more functions.
This patch series uses a new type, the struct folio, to manage memory.
It converts enough of the page cache, iomap and XFS to use folios instead
of pages, and then adds support for multi-page folios. It passes xfstests
(running on XFS) with no regressions compared to v5.14-rc1.
Git: https://git.infradead.org/users/willy/pagecache.git/shortlog/refs/tags/folio_14
v14:
- Defined folio_memcg() for !CONFIG_MEMCG builds
- Fixed typo in folio_activate() for !SMP builds
- Fixed missed conversion of folio_rmapping to folio_raw_mapping in KSM
- Fixed hugetlb's new dependency on copy_huge_page() by introducing
folio_copy()
- Changed the LRU page API to be entirely wrappers around the folio versions
- Removed page_lru() entirely
- Renamed folio_add_to_lru_list() -> lruvec_add_folio()
- Renamed folio_add_to_lru_list_tail() -> lruvec_add_folio_tail()
- Renamed folio_del_from_lru_list() -> lruvec_del_folio()
- Changed folio flag operations to be:
- folio_test_foo()
- folio_test_set_foo()
- folio_test_clear_foo()
- folio_set_foo()
- folio_clear_foo()
- __folio_set_foo()
- __folio_clear_foo()
- Converted trace_mm_lru_activate() to take a folio
- Converted trace_wait_on_page_writeback() to trace_folio_wait_writeback()
- Converted trace_writeback_dirty_page() to trace_writeback_dirty_folio()
- Converted trace_mm_lru_insertion() to take a folio
- Renamed alloc_folio() -> folio_alloc()
- Renamed __alloc_folio() -> __folio_alloc()
- Renamed __alloc_folio_node() -> __folio_alloc_node()
Matthew Wilcox (Oracle) (138):
mm: Convert get_page_unless_zero() to return bool
mm: Introduce struct folio
mm: Add folio_pgdat(), folio_zone() and folio_zonenum()
mm/vmstat: Add functions to account folio statistics
mm/debug: Add VM_BUG_ON_FOLIO() and VM_WARN_ON_ONCE_FOLIO()
mm: Add folio reference count functions
mm: Add folio_put()
mm: Add folio_get()
mm: Add folio_try_get_rcu()
mm: Add folio flag manipulation functions
mm/lru: Add folio LRU functions
mm: Handle per-folio private data
mm/filemap: Add folio_index(), folio_file_page() and folio_contains()
mm/filemap: Add folio_next_index()
mm/filemap: Add folio_pos() and folio_file_pos()
mm/util: Add folio_mapping() and folio_file_mapping()
mm/filemap: Add folio_unlock()
mm/filemap: Add folio_lock()
mm/filemap: Add folio_lock_killable()
mm/filemap: Add __folio_lock_async()
mm/filemap: Add folio_wait_locked()
mm/filemap: Add __folio_lock_or_retry()
mm/swap: Add folio_rotate_reclaimable()
mm/filemap: Add folio_end_writeback()
mm/writeback: Add folio_wait_writeback()
mm/writeback: Add folio_wait_stable()
mm/filemap: Add folio_wait_bit()
mm/filemap: Add folio_wake_bit()
mm/filemap: Convert page wait queues to be folios
mm/filemap: Add folio private_2 functions
fs/netfs: Add folio fscache functions
mm: Add folio_mapped()
mm: Add folio_nid()
mm/memcg: Remove 'page' parameter to mem_cgroup_charge_statistics()
mm/memcg: Use the node id in mem_cgroup_update_tree()
mm/memcg: Remove soft_limit_tree_node()
mm/memcg: Convert memcg_check_events to take a node ID
mm/memcg: Add folio_memcg() and related functions
mm/memcg: Convert commit_charge() to take a folio
mm/memcg: Convert mem_cgroup_charge() to take a folio
mm/memcg: Convert uncharge_page() to uncharge_folio()
mm/memcg: Convert mem_cgroup_uncharge() to take a folio
mm/memcg: Convert mem_cgroup_migrate() to take folios
mm/memcg: Convert mem_cgroup_track_foreign_dirty_slowpath() to folio
mm/memcg: Add folio_memcg_lock() and folio_memcg_unlock()
mm/memcg: Convert mem_cgroup_move_account() to use a folio
mm/memcg: Add folio_lruvec()
mm/memcg: Add folio_lruvec_lock() and similar functions
mm/memcg: Add folio_lruvec_relock_irq() and
folio_lruvec_relock_irqsave()
mm/workingset: Convert workingset_activation to take a folio
mm: Add folio_pfn()
mm: Add folio_raw_mapping()
mm: Add flush_dcache_folio()
mm: Add kmap_local_folio()
mm: Add arch_make_folio_accessible()
mm: Add folio_young and folio_idle
mm/swap: Add folio_activate()
mm/swap: Add folio_mark_accessed()
mm/rmap: Add folio_mkclean()
mm/migrate: Add folio_migrate_mapping()
mm/migrate: Add folio_migrate_flags()
mm/migrate: Add folio_migrate_copy()
mm/writeback: Rename __add_wb_stat() to wb_stat_mod()
flex_proportions: Allow N events instead of 1
mm/writeback: Change __wb_writeout_inc() to __wb_writeout_add()
mm/writeback: Add __folio_end_writeback()
mm/writeback: Add folio_start_writeback()
mm/writeback: Add folio_mark_dirty()
mm/writeback: Add __folio_mark_dirty()
mm/writeback: Convert tracing writeback_page_template to folios
mm/writeback: Add filemap_dirty_folio()
mm/writeback: Add folio_account_cleaned()
mm/writeback: Add folio_cancel_dirty()
mm/writeback: Add folio_clear_dirty_for_io()
mm/writeback: Add folio_account_redirty()
mm/writeback: Add folio_redirty_for_writepage()
mm/filemap: Add i_blocks_per_folio()
mm/filemap: Add folio_mkwrite_check_truncate()
mm/filemap: Add readahead_folio()
mm/workingset: Convert workingset_refault() to take a folio
mm: Add folio_evictable()
mm/lru: Convert __pagevec_lru_add_fn to take a folio
mm/lru: Add folio_add_lru()
mm/page_alloc: Add folio allocation functions
mm/filemap: Add filemap_alloc_folio
mm/filemap: Add filemap_add_folio()
mm/filemap: Convert mapping_get_entry to return a folio
mm/filemap: Add filemap_get_folio
mm/filemap: Add FGP_STABLE
block: Add bio_add_folio()
block: Add bio_for_each_folio_all()
iomap: Convert to_iomap_page to take a folio
iomap: Convert iomap_page_create to take a folio
iomap: Convert iomap_page_release to take a folio
iomap: Convert iomap_releasepage to use a folio
iomap: Convert iomap_invalidatepage to use a folio
iomap: Pass the iomap_page into iomap_set_range_uptodate
iomap: Use folio offsets instead of page offsets
iomap: Convert bio completions to use folios
iomap: Convert readahead and readpage to use a folio
iomap: Convert iomap_page_mkwrite to use a folio
iomap: Convert iomap_write_begin and iomap_write_end to folios
iomap: Convert iomap_read_inline_data to take a folio
iomap: Convert iomap_write_end_inline to take a folio
iomap: Convert iomap_add_to_ioend to take a folio
iomap: Convert iomap_do_writepage to use a folio
iomap: Convert iomap_migrate_page to use folios
mm/filemap: Convert page_cache_delete to take a folio
mm/filemap: Convert unaccount_page_cache_page to
filemap_unaccount_folio
mm/filemap: Add filemap_remove_folio and __filemap_remove_folio
mm/filemap: Convert find_get_entry to return a folio
mm/filemap: Convert filemap_get_read_batch to use folios
mm/filemap: Convert find_get_pages_contig to folios
mm/filemap: Convert filemap_read_page to take a folio
mm/filemap: Convert filemap_create_page to folio
mm/filemap: Convert filemap_range_uptodate to folios
mm/filemap: Convert filemap_fault to folio
mm/filemap: Add read_cache_folio and read_mapping_folio
mm/filemap: Convert filemap_get_pages to use folios
mm/filemap: Convert page_cache_delete_batch to folios
mm/filemap: Remove PageHWPoison check from next_uptodate_page()
mm/filemap: Use folios in next_uptodate_page
mm/filemap: Use a folio in filemap_map_pages
fs: Convert vfs_dedupe_file_range_compare to folios
mm/truncate,shmem: Handle truncates that split THPs
mm/filemap: Return only head pages from find_get_entries
mm: Use multi-index entries in the page cache
iomap: Support multi-page folios in invalidatepage
xfs: Support THPs
mm/truncate: Convert invalidate_inode_pages2_range to folios
mm/truncate: Fix invalidate_complete_page2 for THPs
mm/vmscan: Free non-shmem THPs without splitting them
mm: Fix READ_ONLY_THP warning
mm: Support arbitrary THP sizes
mm/filemap: Allow multi-page folios to be added to the page cache
mm/vmscan: Optimise shrink_page_list for smaller THPs
mm/readahead: Convert page_cache_async_ra() to take a folio
mm/readahead: Add multi-page folio readahead
Documentation/core-api/cachetlb.rst | 6 +
Documentation/core-api/mm-api.rst | 4 +
Documentation/filesystems/netfs_library.rst | 2 +
arch/nds32/include/asm/cacheflush.h | 1 +
block/bio.c | 21 +
fs/afs/write.c | 9 +-
fs/cachefiles/rdwr.c | 16 +-
fs/io_uring.c | 2 +-
fs/iomap/buffered-io.c | 524 ++++----
fs/jfs/jfs_metapage.c | 1 +
fs/remap_range.c | 116 +-
fs/xfs/xfs_aops.c | 11 +-
fs/xfs/xfs_super.c | 3 +-
include/asm-generic/cacheflush.h | 6 +
include/linux/backing-dev.h | 6 +-
include/linux/bio.h | 46 +-
include/linux/flex_proportions.h | 9 +-
include/linux/gfp.h | 22 +-
include/linux/highmem-internal.h | 11 +
include/linux/highmem.h | 38 +
include/linux/huge_mm.h | 23 +-
include/linux/ksm.h | 4 +-
include/linux/memcontrol.h | 226 ++--
include/linux/migrate.h | 4 +
include/linux/mm.h | 268 +++-
include/linux/mm_inline.h | 98 +-
include/linux/mm_types.h | 77 ++
include/linux/mmdebug.h | 20 +
include/linux/netfs.h | 77 +-
include/linux/page-flags.h | 267 ++--
include/linux/page_idle.h | 99 +-
include/linux/page_owner.h | 8 +-
include/linux/page_ref.h | 158 ++-
include/linux/pagemap.h | 615 +++++----
include/linux/rmap.h | 10 +-
include/linux/swap.h | 17 +-
include/linux/vmstat.h | 107 ++
include/linux/writeback.h | 9 +-
include/trace/events/pagemap.h | 46 +-
include/trace/events/writeback.h | 28 +-
kernel/bpf/verifier.c | 2 +-
kernel/events/uprobes.c | 3 +-
lib/flex_proportions.c | 28 +-
mm/Makefile | 2 +-
mm/compaction.c | 4 +-
mm/filemap.c | 1285 +++++++++----------
mm/folio-compat.c | 147 +++
mm/huge_memory.c | 27 +-
mm/hugetlb.c | 2 +-
mm/internal.h | 40 +-
mm/khugepaged.c | 20 +-
mm/ksm.c | 34 +-
mm/memcontrol.c | 323 +++--
mm/memory-failure.c | 2 +-
mm/memory.c | 20 +-
mm/mempolicy.c | 10 +
mm/memremap.c | 2 +-
mm/migrate.c | 193 ++-
mm/mlock.c | 3 +-
mm/page-writeback.c | 447 ++++---
mm/page_alloc.c | 14 +-
mm/page_io.c | 4 +-
mm/page_owner.c | 10 +-
mm/readahead.c | 108 +-
mm/rmap.c | 14 +-
mm/shmem.c | 115 +-
mm/swap.c | 180 +--
mm/swap_state.c | 2 +-
mm/swapfile.c | 8 +-
mm/truncate.c | 193 +--
mm/userfaultfd.c | 2 +-
mm/util.c | 98 +-
mm/vmscan.c | 15 +-
mm/workingset.c | 44 +-
74 files changed, 3865 insertions(+), 2551 deletions(-)
create mode 100644 mm/folio-compat.c
--
2.30.2
atomic_add_unless() returns bool, so remove the widening casts to int
in page_ref_add_unless() and get_page_unless_zero(). This causes gcc
to produce slightly larger code in isolate_migratepages_block(), but
it's not clear that it's worse code. Net +19 bytes of text.
Signed-off-by: Matthew Wilcox (Oracle) <[email protected]>
Reviewed-by: Christoph Hellwig <[email protected]>
Acked-by: Kirill A. Shutemov <[email protected]>
---
include/linux/mm.h | 2 +-
include/linux/page_ref.h | 4 ++--
2 files changed, 3 insertions(+), 3 deletions(-)
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 7ca22e6e694a..8dd65290bac0 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -755,7 +755,7 @@ static inline int put_page_testzero(struct page *page)
* This can be called when MMU is off so it must not access
* any of the virtual mappings.
*/
-static inline int get_page_unless_zero(struct page *page)
+static inline bool get_page_unless_zero(struct page *page)
{
return page_ref_add_unless(page, 1, 0);
}
diff --git a/include/linux/page_ref.h b/include/linux/page_ref.h
index 7ad46f45df39..3a799de8ad52 100644
--- a/include/linux/page_ref.h
+++ b/include/linux/page_ref.h
@@ -161,9 +161,9 @@ static inline int page_ref_dec_return(struct page *page)
return ret;
}
-static inline int page_ref_add_unless(struct page *page, int nr, int u)
+static inline bool page_ref_add_unless(struct page *page, int nr, int u)
{
- int ret = atomic_add_unless(&page->_refcount, nr, u);
+ bool ret = atomic_add_unless(&page->_refcount, nr, u);
if (page_ref_tracepoint_active(page_ref_mod_unless))
__page_ref_mod_unless(page, nr, ret);
--
2.30.2
These are just convenience wrappers for callers with folios; pgdat and
zone can be reached from tail pages as well as head pages. No change
to generated code.
Signed-off-by: Matthew Wilcox (Oracle) <[email protected]>
Reviewed-by: Zi Yan <[email protected]>
Reviewed-by: Christoph Hellwig <[email protected]>
Acked-by: Jeff Layton <[email protected]>
Acked-by: Kirill A. Shutemov <[email protected]>
Acked-by: Vlastimil Babka <[email protected]>
Reviewed-by: William Kucharski <[email protected]>
Reviewed-by: David Howells <[email protected]>
---
include/linux/mm.h | 15 +++++++++++++++
1 file changed, 15 insertions(+)
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 5071084a71b9..0e14ac29a3e9 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1144,6 +1144,11 @@ static inline enum zone_type page_zonenum(const struct page *page)
return (page->flags >> ZONES_PGSHIFT) & ZONES_MASK;
}
+static inline enum zone_type folio_zonenum(const struct folio *folio)
+{
+ return page_zonenum(&folio->page);
+}
+
#ifdef CONFIG_ZONE_DEVICE
static inline bool is_zone_device_page(const struct page *page)
{
@@ -1559,6 +1564,16 @@ static inline pg_data_t *page_pgdat(const struct page *page)
return NODE_DATA(page_to_nid(page));
}
+static inline struct zone *folio_zone(const struct folio *folio)
+{
+ return page_zone(&folio->page);
+}
+
+static inline pg_data_t *folio_pgdat(const struct folio *folio)
+{
+ return page_pgdat(&folio->page);
+}
+
#ifdef SECTION_IN_PAGE_FLAGS
static inline void set_page_section(struct page *page, unsigned long section)
{
--
2.30.2
A struct folio is a new abstraction to replace the venerable struct page.
A function which takes a struct folio argument declares that it will
operate on the entire (possibly compound) page, not just PAGE_SIZE bytes.
In return, the caller guarantees that the pointer it is passing does
not point to a tail page. No change to generated code.
Signed-off-by: Matthew Wilcox (Oracle) <[email protected]>
Acked-by: Jeff Layton <[email protected]>
Acked-by: Kirill A. Shutemov <[email protected]>
Acked-by: Vlastimil Babka <[email protected]>
Reviewed-by: William Kucharski <[email protected]>
Reviewed-by: Christoph Hellwig <[email protected]>
Reviewed-by: David Howells <[email protected]>
---
Documentation/core-api/mm-api.rst | 1 +
include/linux/mm.h | 74 +++++++++++++++++++++++++++++++
include/linux/mm_types.h | 60 +++++++++++++++++++++++++
include/linux/page-flags.h | 28 ++++++++++++
4 files changed, 163 insertions(+)
diff --git a/Documentation/core-api/mm-api.rst b/Documentation/core-api/mm-api.rst
index a42f9baddfbf..2a94e6164f80 100644
--- a/Documentation/core-api/mm-api.rst
+++ b/Documentation/core-api/mm-api.rst
@@ -95,6 +95,7 @@ More Memory Management Functions
.. kernel-doc:: mm/mempolicy.c
.. kernel-doc:: include/linux/mm_types.h
:internal:
+.. kernel-doc:: include/linux/page-flags.h
.. kernel-doc:: include/linux/mm.h
:internal:
.. kernel-doc:: include/linux/mmzone.h
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 8dd65290bac0..5071084a71b9 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -949,6 +949,20 @@ static inline unsigned int compound_order(struct page *page)
return page[1].compound_order;
}
+/**
+ * folio_order - The allocation order of a folio.
+ * @folio: The folio.
+ *
+ * A folio is composed of 2^order pages. See get_order() for the definition
+ * of order.
+ *
+ * Return: The order of the folio.
+ */
+static inline unsigned int folio_order(struct folio *folio)
+{
+ return compound_order(&folio->page);
+}
+
static inline bool hpage_pincount_available(struct page *page)
{
/*
@@ -1594,6 +1608,65 @@ static inline void set_page_links(struct page *page, enum zone_type zone,
#endif
}
+/**
+ * folio_nr_pages - The number of pages in the folio.
+ * @folio: The folio.
+ *
+ * Return: A number which is a power of two.
+ */
+static inline unsigned long folio_nr_pages(struct folio *folio)
+{
+ return compound_nr(&folio->page);
+}
+
+/**
+ * folio_next - Move to the next physical folio.
+ * @folio: The folio we're currently operating on.
+ *
+ * If you have physically contiguous memory which may span more than
+ * one folio (eg a &struct bio_vec), use this function to move from one
+ * folio to the next. Do not use it if the memory is only virtually
+ * contiguous as the folios are almost certainly not adjacent to each
+ * other. This is the folio equivalent to writing ``page++``.
+ *
+ * Context: We assume that the folios are refcounted and/or locked at a
+ * higher level and do not adjust the reference counts.
+ * Return: The next struct folio.
+ */
+static inline struct folio *folio_next(struct folio *folio)
+{
+ return (struct folio *)folio_page(folio, folio_nr_pages(folio));
+}
+
+/**
+ * folio_shift - The number of bits covered by this folio.
+ * @folio: The folio.
+ *
+ * A folio contains a number of bytes which is a power-of-two in size.
+ * This function tells you which power-of-two the folio is.
+ *
+ * Context: The caller should have a reference on the folio to prevent
+ * it from being split. It is not necessary for the folio to be locked.
+ * Return: The base-2 logarithm of the size of this folio.
+ */
+static inline unsigned int folio_shift(struct folio *folio)
+{
+ return PAGE_SHIFT + folio_order(folio);
+}
+
+/**
+ * folio_size - The number of bytes in a folio.
+ * @folio: The folio.
+ *
+ * Context: The caller should have a reference on the folio to prevent
+ * it from being split. It is not necessary for the folio to be locked.
+ * Return: The number of bytes in this folio.
+ */
+static inline size_t folio_size(struct folio *folio)
+{
+ return PAGE_SIZE << folio_order(folio);
+}
+
/*
* Some inline functions in vmstat.h depend on page_zone()
*/
@@ -1699,6 +1772,7 @@ extern void pagefault_out_of_memory(void);
#define offset_in_page(p) ((unsigned long)(p) & ~PAGE_MASK)
#define offset_in_thp(page, p) ((unsigned long)(p) & (thp_size(page) - 1))
+#define offset_in_folio(folio, p) ((unsigned long)(p) & (folio_size(folio) - 1))
/*
* Flags passed to show_mem() and show_free_areas() to suppress output in
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 52bbd2b7cb46..f023eaa866fe 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -231,6 +231,66 @@ struct page {
#endif
} _struct_page_alignment;
+/**
+ * struct folio - Represents a contiguous set of bytes.
+ * @flags: Identical to the page flags.
+ * @lru: Least Recently Used list; tracks how recently this folio was used.
+ * @mapping: The file this page belongs to, or refers to the anon_vma for
+ * anonymous pages.
+ * @index: Offset within the file, in units of pages. For anonymous pages,
+ * this is the index from the beginning of the mmap.
+ * @private: Filesystem per-folio data (see folio_attach_private()).
+ * Used for swp_entry_t if folio_test_swapcache().
+ * @_mapcount: Do not access this member directly. Use folio_mapcount() to
+ * find out how many times this folio is mapped by userspace.
+ * @_refcount: Do not access this member directly. Use folio_ref_count()
+ * to find how many references there are to this folio.
+ * @memcg_data: Memory Control Group data.
+ *
+ * A folio is a physically, virtually and logically contiguous set
+ * of bytes. It is a power-of-two in size, and it is aligned to that
+ * same power-of-two. It is at least as large as %PAGE_SIZE. If it is
+ * in the page cache, it is at a file offset which is a multiple of that
+ * power-of-two. It may be mapped into userspace at an address which is
+ * at an arbitrary page offset, but its kernel virtual address is aligned
+ * to its size.
+ */
+struct folio {
+ /* private: don't document the anon union */
+ union {
+ struct {
+ /* public: */
+ unsigned long flags;
+ struct list_head lru;
+ struct address_space *mapping;
+ pgoff_t index;
+ void *private;
+ atomic_t _mapcount;
+ atomic_t _refcount;
+#ifdef CONFIG_MEMCG
+ unsigned long memcg_data;
+#endif
+ /* private: the union with struct page is transitional */
+ };
+ struct page page;
+ };
+};
+
+static_assert(sizeof(struct page) == sizeof(struct folio));
+#define FOLIO_MATCH(pg, fl) \
+ static_assert(offsetof(struct page, pg) == offsetof(struct folio, fl))
+FOLIO_MATCH(flags, flags);
+FOLIO_MATCH(lru, lru);
+FOLIO_MATCH(compound_head, lru);
+FOLIO_MATCH(index, index);
+FOLIO_MATCH(private, private);
+FOLIO_MATCH(_mapcount, _mapcount);
+FOLIO_MATCH(_refcount, _refcount);
+#ifdef CONFIG_MEMCG
+FOLIO_MATCH(memcg_data, memcg_data);
+#endif
+#undef FOLIO_MATCH
+
static inline atomic_t *compound_mapcount_ptr(struct page *page)
{
return &page[1].compound_mapcount;
diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
index 5922031ffab6..70ede8345538 100644
--- a/include/linux/page-flags.h
+++ b/include/linux/page-flags.h
@@ -191,6 +191,34 @@ static inline unsigned long _compound_head(const struct page *page)
#define compound_head(page) ((typeof(page))_compound_head(page))
+/**
+ * page_folio - Converts from page to folio.
+ * @p: The page.
+ *
+ * Every page is part of a folio. This function cannot be called on a
+ * NULL pointer.
+ *
+ * Context: No reference, nor lock is required on @page. If the caller
+ * does not hold a reference, this call may race with a folio split, so
+ * it should re-check the folio still contains this page after gaining
+ * a reference on the folio.
+ * Return: The folio which contains this page.
+ */
+#define page_folio(p) (_Generic((p), \
+ const struct page *: (const struct folio *)_compound_head(p), \
+ struct page *: (struct folio *)_compound_head(p)))
+
+/**
+ * folio_page - Return a page from a folio.
+ * @folio: The folio.
+ * @n: The page number to return.
+ *
+ * @n is relative to the start of the folio. This function does not
+ * check that the page number lies within @folio; the caller is presumed
+ * to have a reference to the page.
+ */
+#define folio_page(folio, n) nth_page(&(folio)->page, n)
+
static __always_inline int PageTail(struct page *page)
{
return READ_ONCE(page->compound_head) & 1;
--
2.30.2
Allow page counters to be more readily modified by callers which have
a folio. Name these wrappers with 'stat' instead of 'state' as requested
by Linus here:
https://lore.kernel.org/linux-mm/CAHk-=wj847SudR-kt+46fT3+xFFgiwpgThvm7DJWGdi4cVrbnQ@mail.gmail.com/
No change to generated code.
Signed-off-by: Matthew Wilcox (Oracle) <[email protected]>
Reviewed-by: Christoph Hellwig <[email protected]>
Acked-by: Jeff Layton <[email protected]>
Acked-by: Kirill A. Shutemov <[email protected]>
Acked-by: Vlastimil Babka <[email protected]>
Reviewed-by: William Kucharski <[email protected]>
Reviewed-by: David Howells <[email protected]>
---
include/linux/vmstat.h | 107 +++++++++++++++++++++++++++++++++++++++++
1 file changed, 107 insertions(+)
diff --git a/include/linux/vmstat.h b/include/linux/vmstat.h
index d6a6cf53b127..241bd0f53fb9 100644
--- a/include/linux/vmstat.h
+++ b/include/linux/vmstat.h
@@ -415,6 +415,78 @@ static inline void drain_zonestat(struct zone *zone,
struct per_cpu_zonestat *pzstats) { }
#endif /* CONFIG_SMP */
+static inline void __zone_stat_mod_folio(struct folio *folio,
+ enum zone_stat_item item, long nr)
+{
+ __mod_zone_page_state(folio_zone(folio), item, nr);
+}
+
+static inline void __zone_stat_add_folio(struct folio *folio,
+ enum zone_stat_item item)
+{
+ __mod_zone_page_state(folio_zone(folio), item, folio_nr_pages(folio));
+}
+
+static inline void __zone_stat_sub_folio(struct folio *folio,
+ enum zone_stat_item item)
+{
+ __mod_zone_page_state(folio_zone(folio), item, -folio_nr_pages(folio));
+}
+
+static inline void zone_stat_mod_folio(struct folio *folio,
+ enum zone_stat_item item, long nr)
+{
+ mod_zone_page_state(folio_zone(folio), item, nr);
+}
+
+static inline void zone_stat_add_folio(struct folio *folio,
+ enum zone_stat_item item)
+{
+ mod_zone_page_state(folio_zone(folio), item, folio_nr_pages(folio));
+}
+
+static inline void zone_stat_sub_folio(struct folio *folio,
+ enum zone_stat_item item)
+{
+ mod_zone_page_state(folio_zone(folio), item, -folio_nr_pages(folio));
+}
+
+static inline void __node_stat_mod_folio(struct folio *folio,
+ enum node_stat_item item, long nr)
+{
+ __mod_node_page_state(folio_pgdat(folio), item, nr);
+}
+
+static inline void __node_stat_add_folio(struct folio *folio,
+ enum node_stat_item item)
+{
+ __mod_node_page_state(folio_pgdat(folio), item, folio_nr_pages(folio));
+}
+
+static inline void __node_stat_sub_folio(struct folio *folio,
+ enum node_stat_item item)
+{
+ __mod_node_page_state(folio_pgdat(folio), item, -folio_nr_pages(folio));
+}
+
+static inline void node_stat_mod_folio(struct folio *folio,
+ enum node_stat_item item, long nr)
+{
+ mod_node_page_state(folio_pgdat(folio), item, nr);
+}
+
+static inline void node_stat_add_folio(struct folio *folio,
+ enum node_stat_item item)
+{
+ mod_node_page_state(folio_pgdat(folio), item, folio_nr_pages(folio));
+}
+
+static inline void node_stat_sub_folio(struct folio *folio,
+ enum node_stat_item item)
+{
+ mod_node_page_state(folio_pgdat(folio), item, -folio_nr_pages(folio));
+}
+
static inline void __mod_zone_freepage_state(struct zone *zone, int nr_pages,
int migratetype)
{
@@ -543,6 +615,24 @@ static inline void __dec_lruvec_page_state(struct page *page,
__mod_lruvec_page_state(page, idx, -1);
}
+static inline void __lruvec_stat_mod_folio(struct folio *folio,
+ enum node_stat_item idx, int val)
+{
+ __mod_lruvec_page_state(&folio->page, idx, val);
+}
+
+static inline void __lruvec_stat_add_folio(struct folio *folio,
+ enum node_stat_item idx)
+{
+ __lruvec_stat_mod_folio(folio, idx, folio_nr_pages(folio));
+}
+
+static inline void __lruvec_stat_sub_folio(struct folio *folio,
+ enum node_stat_item idx)
+{
+ __lruvec_stat_mod_folio(folio, idx, -folio_nr_pages(folio));
+}
+
static inline void inc_lruvec_page_state(struct page *page,
enum node_stat_item idx)
{
@@ -555,4 +645,21 @@ static inline void dec_lruvec_page_state(struct page *page,
mod_lruvec_page_state(page, idx, -1);
}
+static inline void lruvec_stat_mod_folio(struct folio *folio,
+ enum node_stat_item idx, int val)
+{
+ mod_lruvec_page_state(&folio->page, idx, val);
+}
+
+static inline void lruvec_stat_add_folio(struct folio *folio,
+ enum node_stat_item idx)
+{
+ lruvec_stat_mod_folio(folio, idx, folio_nr_pages(folio));
+}
+
+static inline void lruvec_stat_sub_folio(struct folio *folio,
+ enum node_stat_item idx)
+{
+ lruvec_stat_mod_folio(folio, idx, -folio_nr_pages(folio));
+}
#endif /* _LINUX_VMSTAT_H */
--
2.30.2
These functions mirror their page reference counterparts. Also add
the kernel-doc to the mm-api and correct the return type of
page_ref_add_unless() to bool. No change to generated code.
Signed-off-by: Matthew Wilcox (Oracle) <[email protected]>
Reviewed-by: Christoph Hellwig <[email protected]>
Acked-by: Jeff Layton <[email protected]>
Acked-by: Kirill A. Shutemov <[email protected]>
Acked-by: Vlastimil Babka <[email protected]>
Reviewed-by: William Kucharski <[email protected]>
Reviewed-by: David Howells <[email protected]>
---
Documentation/core-api/mm-api.rst | 1 +
include/linux/page_ref.h | 88 ++++++++++++++++++++++++++++++-
2 files changed, 88 insertions(+), 1 deletion(-)
diff --git a/Documentation/core-api/mm-api.rst b/Documentation/core-api/mm-api.rst
index 2a94e6164f80..5c459ee2acce 100644
--- a/Documentation/core-api/mm-api.rst
+++ b/Documentation/core-api/mm-api.rst
@@ -98,4 +98,5 @@ More Memory Management Functions
.. kernel-doc:: include/linux/page-flags.h
.. kernel-doc:: include/linux/mm.h
:internal:
+.. kernel-doc:: include/linux/page_ref.h
.. kernel-doc:: include/linux/mmzone.h
diff --git a/include/linux/page_ref.h b/include/linux/page_ref.h
index 3a799de8ad52..717d53c9ddf1 100644
--- a/include/linux/page_ref.h
+++ b/include/linux/page_ref.h
@@ -67,9 +67,31 @@ static inline int page_ref_count(const struct page *page)
return atomic_read(&page->_refcount);
}
+/**
+ * folio_ref_count - The reference count on this folio.
+ * @folio: The folio.
+ *
+ * The refcount is usually incremented by calls to folio_get() and
+ * decremented by calls to folio_put(). Some typical users of the
+ * folio refcount:
+ *
+ * - Each reference from a page table
+ * - The page cache
+ * - Filesystem private data
+ * - The LRU list
+ * - Pipes
+ * - Direct IO which references this page in the process address space
+ *
+ * Return: The number of references to this folio.
+ */
+static inline int folio_ref_count(const struct folio *folio)
+{
+ return page_ref_count(&folio->page);
+}
+
static inline int page_count(const struct page *page)
{
- return atomic_read(&compound_head(page)->_refcount);
+ return folio_ref_count(page_folio(page));
}
static inline void set_page_count(struct page *page, int v)
@@ -79,6 +101,11 @@ static inline void set_page_count(struct page *page, int v)
__page_ref_set(page, v);
}
+static inline void folio_set_count(struct folio *folio, int v)
+{
+ set_page_count(&folio->page, v);
+}
+
/*
* Setup the page count before being freed into the page allocator for
* the first time (boot or memory hotplug)
@@ -95,6 +122,11 @@ static inline void page_ref_add(struct page *page, int nr)
__page_ref_mod(page, nr);
}
+static inline void folio_ref_add(struct folio *folio, int nr)
+{
+ page_ref_add(&folio->page, nr);
+}
+
static inline void page_ref_sub(struct page *page, int nr)
{
atomic_sub(nr, &page->_refcount);
@@ -102,6 +134,11 @@ static inline void page_ref_sub(struct page *page, int nr)
__page_ref_mod(page, -nr);
}
+static inline void folio_ref_sub(struct folio *folio, int nr)
+{
+ page_ref_sub(&folio->page, nr);
+}
+
static inline int page_ref_sub_return(struct page *page, int nr)
{
int ret = atomic_sub_return(nr, &page->_refcount);
@@ -111,6 +148,11 @@ static inline int page_ref_sub_return(struct page *page, int nr)
return ret;
}
+static inline int folio_ref_sub_return(struct folio *folio, int nr)
+{
+ return page_ref_sub_return(&folio->page, nr);
+}
+
static inline void page_ref_inc(struct page *page)
{
atomic_inc(&page->_refcount);
@@ -118,6 +160,11 @@ static inline void page_ref_inc(struct page *page)
__page_ref_mod(page, 1);
}
+static inline void folio_ref_inc(struct folio *folio)
+{
+ page_ref_inc(&folio->page);
+}
+
static inline void page_ref_dec(struct page *page)
{
atomic_dec(&page->_refcount);
@@ -125,6 +172,11 @@ static inline void page_ref_dec(struct page *page)
__page_ref_mod(page, -1);
}
+static inline void folio_ref_dec(struct folio *folio)
+{
+ page_ref_dec(&folio->page);
+}
+
static inline int page_ref_sub_and_test(struct page *page, int nr)
{
int ret = atomic_sub_and_test(nr, &page->_refcount);
@@ -134,6 +186,11 @@ static inline int page_ref_sub_and_test(struct page *page, int nr)
return ret;
}
+static inline int folio_ref_sub_and_test(struct folio *folio, int nr)
+{
+ return page_ref_sub_and_test(&folio->page, nr);
+}
+
static inline int page_ref_inc_return(struct page *page)
{
int ret = atomic_inc_return(&page->_refcount);
@@ -143,6 +200,11 @@ static inline int page_ref_inc_return(struct page *page)
return ret;
}
+static inline int folio_ref_inc_return(struct folio *folio)
+{
+ return page_ref_inc_return(&folio->page);
+}
+
static inline int page_ref_dec_and_test(struct page *page)
{
int ret = atomic_dec_and_test(&page->_refcount);
@@ -152,6 +214,11 @@ static inline int page_ref_dec_and_test(struct page *page)
return ret;
}
+static inline int folio_ref_dec_and_test(struct folio *folio)
+{
+ return page_ref_dec_and_test(&folio->page);
+}
+
static inline int page_ref_dec_return(struct page *page)
{
int ret = atomic_dec_return(&page->_refcount);
@@ -161,6 +228,11 @@ static inline int page_ref_dec_return(struct page *page)
return ret;
}
+static inline int folio_ref_dec_return(struct folio *folio)
+{
+ return page_ref_dec_return(&folio->page);
+}
+
static inline bool page_ref_add_unless(struct page *page, int nr, int u)
{
bool ret = atomic_add_unless(&page->_refcount, nr, u);
@@ -170,6 +242,11 @@ static inline bool page_ref_add_unless(struct page *page, int nr, int u)
return ret;
}
+static inline bool folio_ref_add_unless(struct folio *folio, int nr, int u)
+{
+ return page_ref_add_unless(&folio->page, nr, u);
+}
+
static inline int page_ref_freeze(struct page *page, int count)
{
int ret = likely(atomic_cmpxchg(&page->_refcount, count, 0) == count);
@@ -179,6 +256,11 @@ static inline int page_ref_freeze(struct page *page, int count)
return ret;
}
+static inline int folio_ref_freeze(struct folio *folio, int count)
+{
+ return page_ref_freeze(&folio->page, count);
+}
+
static inline void page_ref_unfreeze(struct page *page, int count)
{
VM_BUG_ON_PAGE(page_count(page) != 0, page);
@@ -189,4 +271,8 @@ static inline void page_ref_unfreeze(struct page *page, int count)
__page_ref_unfreeze(page, count);
}
+static inline void folio_ref_unfreeze(struct folio *folio, int count)
+{
+ page_ref_unfreeze(&folio->page, count);
+}
#endif
--
2.30.2
If we know we have a folio, we can call folio_get() instead
of get_page() and save the overhead of calling compound_head().
No change to generated code.
Signed-off-by: Matthew Wilcox (Oracle) <[email protected]>
Reviewed-by: Zi Yan <[email protected]>
Reviewed-by: Christoph Hellwig <[email protected]>
Acked-by: Jeff Layton <[email protected]>
Acked-by: Kirill A. Shutemov <[email protected]>
Acked-by: Vlastimil Babka <[email protected]>
Reviewed-by: William Kucharski <[email protected]>
Reviewed-by: David Howells <[email protected]>
---
include/linux/mm.h | 26 +++++++++++++++++---------
1 file changed, 17 insertions(+), 9 deletions(-)
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 9eca5da04dec..788fbc4cde0c 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1223,18 +1223,26 @@ static inline bool is_pci_p2pdma_page(const struct page *page)
}
/* 127: arbitrary random number, small enough to assemble well */
-#define page_ref_zero_or_close_to_overflow(page) \
- ((unsigned int) page_ref_count(page) + 127u <= 127u)
+#define folio_ref_zero_or_close_to_overflow(folio) \
+ ((unsigned int) folio_ref_count(folio) + 127u <= 127u)
+
+/**
+ * folio_get - Increment the reference count on a folio.
+ * @folio: The folio.
+ *
+ * Context: May be called in any context, as long as you know that
+ * you have a refcount on the folio. If you do not already have one,
+ * folio_try_get() may be the right interface for you to use.
+ */
+static inline void folio_get(struct folio *folio)
+{
+ VM_BUG_ON_FOLIO(folio_ref_zero_or_close_to_overflow(folio), folio);
+ folio_ref_inc(folio);
+}
static inline void get_page(struct page *page)
{
- page = compound_head(page);
- /*
- * Getting a normal page or the head of a compound page
- * requires to already have an elevated page->_refcount.
- */
- VM_BUG_ON_PAGE(page_ref_zero_or_close_to_overflow(page), page);
- page_ref_inc(page);
+ folio_get(page_folio(page));
}
bool __must_check try_grab_page(struct page *page, unsigned int flags);
--
2.30.2
These are just wrappers around page_offset() and page_file_offset()
respectively. No change to generated code.
Signed-off-by: Matthew Wilcox (Oracle) <[email protected]>
Reviewed-by: Christoph Hellwig <[email protected]>
Acked-by: Jeff Layton <[email protected]>
Acked-by: Kirill A. Shutemov <[email protected]>
Acked-by: Vlastimil Babka <[email protected]>
Reviewed-by: William Kucharski <[email protected]>
Reviewed-by: David Howells <[email protected]>
---
include/linux/pagemap.h | 21 +++++++++++++++++++++
1 file changed, 21 insertions(+)
diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
index bd0e7e91bfd4..aa71fa82d6be 100644
--- a/include/linux/pagemap.h
+++ b/include/linux/pagemap.h
@@ -562,6 +562,27 @@ static inline loff_t page_file_offset(struct page *page)
return ((loff_t)page_index(page)) << PAGE_SHIFT;
}
+/**
+ * folio_pos - Returns the byte position of this folio in its file.
+ * @folio: The folio.
+ */
+static inline loff_t folio_pos(struct folio *folio)
+{
+ return page_offset(&folio->page);
+}
+
+/**
+ * folio_file_pos - Returns the byte position of this folio in its file.
+ * @folio: The folio.
+ *
+ * This differs from folio_pos() for folios which belong to a swap file.
+ * NFS is the only filesystem today which needs to use folio_file_pos().
+ */
+static inline loff_t folio_file_pos(struct folio *folio)
+{
+ return page_file_offset(&folio->page);
+}
+
extern pgoff_t linear_hugepage_index(struct vm_area_struct *vma,
unsigned long address);
--
2.30.2
This is like lock_page() but for use by callers who know they have a folio.
Convert __lock_page() to be __folio_lock(). This saves one call to
compound_head() per contended call to lock_page().
Saves 455 bytes of text; mostly from improved register allocation and
inlining decisions. __folio_lock is 59 bytes while __lock_page was 79.
Signed-off-by: Matthew Wilcox (Oracle) <[email protected]>
Reviewed-by: Christoph Hellwig <[email protected]>
Acked-by: Jeff Layton <[email protected]>
Acked-by: Kirill A. Shutemov <[email protected]>
Acked-by: Vlastimil Babka <[email protected]>
Reviewed-by: William Kucharski <[email protected]>
Reviewed-by: David Howells <[email protected]>
---
include/linux/pagemap.h | 24 +++++++++++++++++++-----
mm/filemap.c | 29 +++++++++++++++--------------
2 files changed, 34 insertions(+), 19 deletions(-)
diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
index a13edc7a2916..c3673c55125b 100644
--- a/include/linux/pagemap.h
+++ b/include/linux/pagemap.h
@@ -653,7 +653,7 @@ static inline bool wake_page_match(struct wait_page_queue *wait_page,
return true;
}
-extern void __lock_page(struct page *page);
+void __folio_lock(struct folio *folio);
extern int __lock_page_killable(struct page *page);
extern int __lock_page_async(struct page *page, struct wait_page_queue *wait);
extern int __lock_page_or_retry(struct page *page, struct mm_struct *mm,
@@ -661,13 +661,24 @@ extern int __lock_page_or_retry(struct page *page, struct mm_struct *mm,
void unlock_page(struct page *page);
void folio_unlock(struct folio *folio);
+static inline bool folio_trylock(struct folio *folio)
+{
+ return likely(!test_and_set_bit_lock(PG_locked, folio_flags(folio, 0)));
+}
+
/*
* Return true if the page was successfully locked
*/
static inline int trylock_page(struct page *page)
{
- page = compound_head(page);
- return (likely(!test_and_set_bit_lock(PG_locked, &page->flags)));
+ return folio_trylock(page_folio(page));
+}
+
+static inline void folio_lock(struct folio *folio)
+{
+ might_sleep();
+ if (!folio_trylock(folio))
+ __folio_lock(folio);
}
/*
@@ -675,9 +686,12 @@ static inline int trylock_page(struct page *page)
*/
static inline void lock_page(struct page *page)
{
+ struct folio *folio;
might_sleep();
- if (!trylock_page(page))
- __lock_page(page);
+
+ folio = page_folio(page);
+ if (!folio_trylock(folio))
+ __folio_lock(folio);
}
/*
diff --git a/mm/filemap.c b/mm/filemap.c
index 1af67ef94e4c..95f89656f126 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -1187,7 +1187,7 @@ static void wake_up_page(struct page *page, int bit)
*/
enum behavior {
EXCLUSIVE, /* Hold ref to page and take the bit when woken, like
- * __lock_page() waiting on then setting PG_locked.
+ * __folio_lock() waiting on then setting PG_locked.
*/
SHARED, /* Hold ref to page and check the bit when woken, like
* wait_on_page_writeback() waiting on PG_writeback.
@@ -1578,17 +1578,16 @@ void page_endio(struct page *page, bool is_write, int err)
EXPORT_SYMBOL_GPL(page_endio);
/**
- * __lock_page - get a lock on the page, assuming we need to sleep to get it
- * @__page: the page to lock
+ * __folio_lock - Get a lock on the folio, assuming we need to sleep to get it.
+ * @folio: The folio to lock
*/
-void __lock_page(struct page *__page)
+void __folio_lock(struct folio *folio)
{
- struct page *page = compound_head(__page);
- wait_queue_head_t *q = page_waitqueue(page);
- wait_on_page_bit_common(q, page, PG_locked, TASK_UNINTERRUPTIBLE,
+ wait_queue_head_t *q = page_waitqueue(&folio->page);
+ wait_on_page_bit_common(q, &folio->page, PG_locked, TASK_UNINTERRUPTIBLE,
EXCLUSIVE);
}
-EXPORT_SYMBOL(__lock_page);
+EXPORT_SYMBOL(__folio_lock);
int __lock_page_killable(struct page *__page)
{
@@ -1663,10 +1662,10 @@ int __lock_page_or_retry(struct page *page, struct mm_struct *mm,
return 0;
}
} else {
- __lock_page(page);
+ __folio_lock(page_folio(page));
}
- return 1;
+ return 1;
}
/**
@@ -2837,7 +2836,9 @@ loff_t mapping_seek_hole_data(struct address_space *mapping, loff_t start,
static int lock_page_maybe_drop_mmap(struct vm_fault *vmf, struct page *page,
struct file **fpin)
{
- if (trylock_page(page))
+ struct folio *folio = page_folio(page);
+
+ if (folio_trylock(folio))
return 1;
/*
@@ -2850,7 +2851,7 @@ static int lock_page_maybe_drop_mmap(struct vm_fault *vmf, struct page *page,
*fpin = maybe_unlock_mmap_for_io(vmf, *fpin);
if (vmf->flags & FAULT_FLAG_KILLABLE) {
- if (__lock_page_killable(page)) {
+ if (__lock_page_killable(&folio->page)) {
/*
* We didn't have the right flags to drop the mmap_lock,
* but all fault_handlers only check for fatal signals
@@ -2862,11 +2863,11 @@ static int lock_page_maybe_drop_mmap(struct vm_fault *vmf, struct page *page,
return 0;
}
} else
- __lock_page(page);
+ __folio_lock(folio);
+
return 1;
}
-
/*
* Synchronous readahead happens when we don't even find a page in the page
* cache at all. We don't want to perform IO under the mmap sem, so if we have
--
2.30.2
There aren't any actual callers of lock_page_async(), so remove it.
Convert filemap_update_page() to call __folio_lock_async().
__folio_lock_async() is 21 bytes smaller than __lock_page_async(),
but the real savings come from using a folio in filemap_update_page(),
shrinking it from 515 bytes to 404 bytes, saving 110 bytes. The text
shrinks by 132 bytes in total.
Signed-off-by: Matthew Wilcox (Oracle) <[email protected]>
Reviewed-by: Christoph Hellwig <[email protected]>
Acked-by: Jeff Layton <[email protected]>
Acked-by: Kirill A. Shutemov <[email protected]>
Acked-by: Vlastimil Babka <[email protected]>
Reviewed-by: William Kucharski <[email protected]>
Reviewed-by: David Howells <[email protected]>
---
fs/io_uring.c | 2 +-
include/linux/pagemap.h | 17 -----------------
mm/filemap.c | 31 ++++++++++++++++---------------
3 files changed, 17 insertions(+), 33 deletions(-)
diff --git a/fs/io_uring.c b/fs/io_uring.c
index d94fb5835a20..7e30c7c361e6 100644
--- a/fs/io_uring.c
+++ b/fs/io_uring.c
@@ -3149,7 +3149,7 @@ static int io_read_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe)
}
/*
- * This is our waitqueue callback handler, registered through lock_page_async()
+ * This is our waitqueue callback handler, registered through __folio_lock_async()
* when we initially tried to do the IO with the iocb armed our waitqueue.
* This gets called when the page is unlocked, and we generally expect that to
* happen when the page IO is completed and the page is now uptodate. This will
diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
index 88727c74e059..6f631a3e42dc 100644
--- a/include/linux/pagemap.h
+++ b/include/linux/pagemap.h
@@ -655,7 +655,6 @@ static inline bool wake_page_match(struct wait_page_queue *wait_page,
void __folio_lock(struct folio *folio);
int __folio_lock_killable(struct folio *folio);
-extern int __lock_page_async(struct page *page, struct wait_page_queue *wait);
extern int __lock_page_or_retry(struct page *page, struct mm_struct *mm,
unsigned int flags);
void unlock_page(struct page *page);
@@ -712,22 +711,6 @@ static inline int lock_page_killable(struct page *page)
return folio_lock_killable(page_folio(page));
}
-/*
- * lock_page_async - Lock the page, unless this would block. If the page
- * is already locked, then queue a callback when the page becomes unlocked.
- * This callback can then retry the operation.
- *
- * Returns 0 if the page is locked successfully, or -EIOCBQUEUED if the page
- * was already locked and the callback defined in 'wait' was queued.
- */
-static inline int lock_page_async(struct page *page,
- struct wait_page_queue *wait)
-{
- if (!trylock_page(page))
- return __lock_page_async(page, wait);
- return 0;
-}
-
/*
* lock_page_or_retry - Lock the page, unless this would block and the
* caller indicated that it can handle a retry.
diff --git a/mm/filemap.c b/mm/filemap.c
index 962db5c38cd7..c97b804811fc 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -1597,18 +1597,18 @@ int __folio_lock_killable(struct folio *folio)
}
EXPORT_SYMBOL_GPL(__folio_lock_killable);
-int __lock_page_async(struct page *page, struct wait_page_queue *wait)
+static int __folio_lock_async(struct folio *folio, struct wait_page_queue *wait)
{
- struct wait_queue_head *q = page_waitqueue(page);
+ struct wait_queue_head *q = page_waitqueue(&folio->page);
int ret = 0;
- wait->page = page;
+ wait->page = &folio->page;
wait->bit_nr = PG_locked;
spin_lock_irq(&q->lock);
__add_wait_queue_entry_tail(q, &wait->wait);
- SetPageWaiters(page);
- ret = !trylock_page(page);
+ folio_set_waiters(folio);
+ ret = !folio_trylock(folio);
/*
* If we were successful now, we know we're still on the
* waitqueue as we're still under the lock. This means it's
@@ -2381,41 +2381,42 @@ static int filemap_update_page(struct kiocb *iocb,
struct address_space *mapping, struct iov_iter *iter,
struct page *page)
{
+ struct folio *folio = page_folio(page);
int error;
- if (!trylock_page(page)) {
+ if (!folio_trylock(folio)) {
if (iocb->ki_flags & (IOCB_NOWAIT | IOCB_NOIO))
return -EAGAIN;
if (!(iocb->ki_flags & IOCB_WAITQ)) {
- put_and_wait_on_page_locked(page, TASK_KILLABLE);
+ put_and_wait_on_page_locked(&folio->page, TASK_KILLABLE);
return AOP_TRUNCATED_PAGE;
}
- error = __lock_page_async(page, iocb->ki_waitq);
+ error = __folio_lock_async(folio, iocb->ki_waitq);
if (error)
return error;
}
- if (!page->mapping)
+ if (!folio->mapping)
goto truncated;
error = 0;
- if (filemap_range_uptodate(mapping, iocb->ki_pos, iter, page))
+ if (filemap_range_uptodate(mapping, iocb->ki_pos, iter, &folio->page))
goto unlock;
error = -EAGAIN;
if (iocb->ki_flags & (IOCB_NOIO | IOCB_NOWAIT | IOCB_WAITQ))
goto unlock;
- error = filemap_read_page(iocb->ki_filp, mapping, page);
+ error = filemap_read_page(iocb->ki_filp, mapping, &folio->page);
if (error == AOP_TRUNCATED_PAGE)
- put_page(page);
+ folio_put(folio);
return error;
truncated:
- unlock_page(page);
- put_page(page);
+ folio_unlock(folio);
+ folio_put(folio);
return AOP_TRUNCATED_PAGE;
unlock:
- unlock_page(page);
+ folio_unlock(folio);
return error;
}
--
2.30.2
Add an end_page_writeback() wrapper function for users that are not yet
converted to folios.
folio_end_writeback() is less than half the size of end_page_writeback()
at just 105 bytes compared to 228 bytes, due to removing all the
compound_head() calls. The 30 byte wrapper function makes this a net
saving of 93 bytes.
Signed-off-by: Matthew Wilcox (Oracle) <[email protected]>
Reviewed-by: Christoph Hellwig <[email protected]>
Acked-by: Jeff Layton <[email protected]>
Acked-by: Kirill A. Shutemov <[email protected]>
Acked-by: Vlastimil Babka <[email protected]>
Reviewed-by: William Kucharski <[email protected]>
Reviewed-by: David Howells <[email protected]>
---
include/linux/pagemap.h | 3 ++-
mm/filemap.c | 43 ++++++++++++++++++++---------------------
mm/folio-compat.c | 6 ++++++
3 files changed, 29 insertions(+), 23 deletions(-)
diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
index 626dbccbfb90..66a019178550 100644
--- a/include/linux/pagemap.h
+++ b/include/linux/pagemap.h
@@ -768,7 +768,8 @@ static inline int wait_on_page_locked_killable(struct page *page)
int put_and_wait_on_page_locked(struct page *page, int state);
void wait_on_page_writeback(struct page *page);
int wait_on_page_writeback_killable(struct page *page);
-extern void end_page_writeback(struct page *page);
+void end_page_writeback(struct page *page);
+void folio_end_writeback(struct folio *folio);
void wait_for_stable_page(struct page *page);
void __set_page_dirty(struct page *, struct address_space *, int warn);
diff --git a/mm/filemap.c b/mm/filemap.c
index 4ce2b22b64f8..b5a0d546e436 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -1175,11 +1175,11 @@ static void wake_up_page_bit(struct page *page, int bit_nr)
spin_unlock_irqrestore(&q->lock, flags);
}
-static void wake_up_page(struct page *page, int bit)
+static void folio_wake(struct folio *folio, int bit)
{
- if (!PageWaiters(page))
+ if (!folio_test_waiters(folio))
return;
- wake_up_page_bit(page, bit);
+ wake_up_page_bit(&folio->page, bit);
}
/*
@@ -1516,39 +1516,38 @@ int wait_on_page_private_2_killable(struct page *page)
EXPORT_SYMBOL(wait_on_page_private_2_killable);
/**
- * end_page_writeback - end writeback against a page
- * @page: the page
+ * folio_end_writeback - End writeback against a folio.
+ * @folio: The folio.
*/
-void end_page_writeback(struct page *page)
+void folio_end_writeback(struct folio *folio)
{
/*
- * TestClearPageReclaim could be used here but it is an atomic
- * operation and overkill in this particular case. Failing to
- * shuffle a page marked for immediate reclaim is too mild to
- * justify taking an atomic operation penalty at the end of
- * ever page writeback.
+ * folio_test_clear_reclaim() could be used here but it is an
+ * atomic operation and overkill in this particular case. Failing
+ * to shuffle a folio marked for immediate reclaim is too mild
+ * a gain to justify taking an atomic operation penalty at the
+ * end of every folio writeback.
*/
- if (PageReclaim(page)) {
- struct folio *folio = page_folio(page);
- ClearPageReclaim(page);
+ if (folio_test_reclaim(folio)) {
+ folio_clear_reclaim(folio);
folio_rotate_reclaimable(folio);
}
/*
- * Writeback does not hold a page reference of its own, relying
+ * Writeback does not hold a folio reference of its own, relying
* on truncation to wait for the clearing of PG_writeback.
- * But here we must make sure that the page is not freed and
- * reused before the wake_up_page().
+ * But here we must make sure that the folio is not freed and
+ * reused before the folio_wake().
*/
- get_page(page);
- if (!test_clear_page_writeback(page))
+ folio_get(folio);
+ if (!test_clear_page_writeback(&folio->page))
BUG();
smp_mb__after_atomic();
- wake_up_page(page, PG_writeback);
- put_page(page);
+ folio_wake(folio, PG_writeback);
+ folio_put(folio);
}
-EXPORT_SYMBOL(end_page_writeback);
+EXPORT_SYMBOL(folio_end_writeback);
/*
* After completing I/O on a page, call this routine to update the page
diff --git a/mm/folio-compat.c b/mm/folio-compat.c
index 91b3d00a92f7..526843d03d58 100644
--- a/mm/folio-compat.c
+++ b/mm/folio-compat.c
@@ -17,3 +17,9 @@ void unlock_page(struct page *page)
return folio_unlock(page_folio(page));
}
EXPORT_SYMBOL(unlock_page);
+
+void end_page_writeback(struct page *page)
+{
+ return folio_end_writeback(page_folio(page));
+}
+EXPORT_SYMBOL(end_page_writeback);
--
2.30.2
By using the node id in mem_cgroup_update_tree(), we can delete
soft_limit_tree_from_page() and mem_cgroup_page_nodeinfo(). Saves 42
bytes of kernel text on my config.
Signed-off-by: Matthew Wilcox (Oracle) <[email protected]>
Acked-by: Michal Hocko <[email protected]>
Acked-by: Johannes Weiner <[email protected]>
Reviewed-by: Christoph Hellwig <[email protected]>
---
mm/memcontrol.c | 24 ++++--------------------
1 file changed, 4 insertions(+), 20 deletions(-)
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index ee892daecb8b..d57ff5c5d330 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -451,28 +451,12 @@ ino_t page_cgroup_ino(struct page *page)
return ino;
}
-static struct mem_cgroup_per_node *
-mem_cgroup_page_nodeinfo(struct mem_cgroup *memcg, struct page *page)
-{
- int nid = page_to_nid(page);
-
- return memcg->nodeinfo[nid];
-}
-
static struct mem_cgroup_tree_per_node *
soft_limit_tree_node(int nid)
{
return soft_limit_tree.rb_tree_per_node[nid];
}
-static struct mem_cgroup_tree_per_node *
-soft_limit_tree_from_page(struct page *page)
-{
- int nid = page_to_nid(page);
-
- return soft_limit_tree.rb_tree_per_node[nid];
-}
-
static void __mem_cgroup_insert_exceeded(struct mem_cgroup_per_node *mz,
struct mem_cgroup_tree_per_node *mctz,
unsigned long new_usage_in_excess)
@@ -543,13 +527,13 @@ static unsigned long soft_limit_excess(struct mem_cgroup *memcg)
return excess;
}
-static void mem_cgroup_update_tree(struct mem_cgroup *memcg, struct page *page)
+static void mem_cgroup_update_tree(struct mem_cgroup *memcg, int nid)
{
unsigned long excess;
struct mem_cgroup_per_node *mz;
struct mem_cgroup_tree_per_node *mctz;
- mctz = soft_limit_tree_from_page(page);
+ mctz = soft_limit_tree_node(nid);
if (!mctz)
return;
/*
@@ -557,7 +541,7 @@ static void mem_cgroup_update_tree(struct mem_cgroup *memcg, struct page *page)
* because their event counter is not touched.
*/
for (; memcg; memcg = parent_mem_cgroup(memcg)) {
- mz = mem_cgroup_page_nodeinfo(memcg, page);
+ mz = memcg->nodeinfo[nid];
excess = soft_limit_excess(memcg);
/*
* We have to update the tree if mz is on RB-tree or
@@ -884,7 +868,7 @@ static void memcg_check_events(struct mem_cgroup *memcg, struct page *page)
MEM_CGROUP_TARGET_SOFTLIMIT);
mem_cgroup_threshold(memcg);
if (unlikely(do_softlimit))
- mem_cgroup_update_tree(memcg, page);
+ mem_cgroup_update_tree(memcg, page_to_nid(page));
}
}
--
2.30.2
memcg information is only stored in the head page, so the memcg
subsystem needs to assure that all accesses are to the head page.
The first step is converting page_memcg() to folio_memcg().
The callers of page_memcg() and PageMemcgKmem() are not yet ready to be
converted to use folios, so retain them as wrappers around folio_memcg()
and folio_memcg_kmem(). They will be converted in a later patch set.
Signed-off-by: Matthew Wilcox (Oracle) <[email protected]>
Reviewed-by: Christoph Hellwig <[email protected]>
---
include/linux/memcontrol.h | 109 ++++++++++++++++++++++---------------
mm/memcontrol.c | 21 ++++---
2 files changed, 77 insertions(+), 53 deletions(-)
diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index bfe5c486f4ad..eabae5874161 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -372,6 +372,7 @@ enum page_memcg_data_flags {
#define MEMCG_DATA_FLAGS_MASK (__NR_MEMCG_DATA_FLAGS - 1)
static inline bool PageMemcgKmem(struct page *page);
+static inline bool folio_memcg_kmem(struct folio *folio);
/*
* After the initialization objcg->memcg is always pointing at
@@ -386,73 +387,77 @@ static inline struct mem_cgroup *obj_cgroup_memcg(struct obj_cgroup *objcg)
}
/*
- * __page_memcg - get the memory cgroup associated with a non-kmem page
- * @page: a pointer to the page struct
+ * __folio_memcg - Get the memory cgroup associated with a non-kmem folio
+ * @folio: Pointer to the folio.
*
- * Returns a pointer to the memory cgroup associated with the page,
- * or NULL. This function assumes that the page is known to have a
+ * Returns a pointer to the memory cgroup associated with the folio,
+ * or NULL. This function assumes that the folio is known to have a
* proper memory cgroup pointer. It's not safe to call this function
- * against some type of pages, e.g. slab pages or ex-slab pages or
- * kmem pages.
+ * against some type of folios, e.g. slab folios or ex-slab folios or
+ * kmem folios.
*/
-static inline struct mem_cgroup *__page_memcg(struct page *page)
+static inline struct mem_cgroup *__folio_memcg(struct folio *folio)
{
- unsigned long memcg_data = page->memcg_data;
+ unsigned long memcg_data = folio->memcg_data;
- VM_BUG_ON_PAGE(PageSlab(page), page);
- VM_BUG_ON_PAGE(memcg_data & MEMCG_DATA_OBJCGS, page);
- VM_BUG_ON_PAGE(memcg_data & MEMCG_DATA_KMEM, page);
+ VM_BUG_ON_FOLIO(folio_test_slab(folio), folio);
+ VM_BUG_ON_FOLIO(memcg_data & MEMCG_DATA_OBJCGS, folio);
+ VM_BUG_ON_FOLIO(memcg_data & MEMCG_DATA_KMEM, folio);
return (struct mem_cgroup *)(memcg_data & ~MEMCG_DATA_FLAGS_MASK);
}
/*
- * __page_objcg - get the object cgroup associated with a kmem page
- * @page: a pointer to the page struct
+ * __folio_objcg - get the object cgroup associated with a kmem folio.
+ * @folio: Pointer to the folio.
*
- * Returns a pointer to the object cgroup associated with the page,
- * or NULL. This function assumes that the page is known to have a
+ * Returns a pointer to the object cgroup associated with the folio,
+ * or NULL. This function assumes that the folio is known to have a
* proper object cgroup pointer. It's not safe to call this function
- * against some type of pages, e.g. slab pages or ex-slab pages or
- * LRU pages.
+ * against some type of folios, e.g. slab folios or ex-slab folios or
+ * LRU folios.
*/
-static inline struct obj_cgroup *__page_objcg(struct page *page)
+static inline struct obj_cgroup *__folio_objcg(struct folio *folio)
{
- unsigned long memcg_data = page->memcg_data;
+ unsigned long memcg_data = folio->memcg_data;
- VM_BUG_ON_PAGE(PageSlab(page), page);
- VM_BUG_ON_PAGE(memcg_data & MEMCG_DATA_OBJCGS, page);
- VM_BUG_ON_PAGE(!(memcg_data & MEMCG_DATA_KMEM), page);
+ VM_BUG_ON_FOLIO(folio_test_slab(folio), folio);
+ VM_BUG_ON_FOLIO(memcg_data & MEMCG_DATA_OBJCGS, folio);
+ VM_BUG_ON_FOLIO(!(memcg_data & MEMCG_DATA_KMEM), folio);
return (struct obj_cgroup *)(memcg_data & ~MEMCG_DATA_FLAGS_MASK);
}
/*
- * page_memcg - get the memory cgroup associated with a page
- * @page: a pointer to the page struct
+ * folio_memcg - Get the memory cgroup associated with a folio.
+ * @folio: Pointer to the folio.
*
- * Returns a pointer to the memory cgroup associated with the page,
- * or NULL. This function assumes that the page is known to have a
+ * Returns a pointer to the memory cgroup associated with the folio,
+ * or NULL. This function assumes that the folio is known to have a
* proper memory cgroup pointer. It's not safe to call this function
- * against some type of pages, e.g. slab pages or ex-slab pages.
+ * against some type of folios, e.g. slab folios or ex-slab folios.
*
- * For a non-kmem page any of the following ensures page and memcg binding
+ * For a non-kmem folio any of the following ensures folio and memcg binding
* stability:
*
- * - the page lock
+ * - the folio lock
* - LRU isolation
* - lock_page_memcg()
* - exclusive reference
*
- * For a kmem page a caller should hold an rcu read lock to protect memcg
- * associated with a kmem page from being released.
+ * For a kmem folio a caller should hold an rcu read lock to protect memcg
+ * associated with a kmem folio from being released.
*/
+static inline struct mem_cgroup *folio_memcg(struct folio *folio)
+{
+ if (folio_memcg_kmem(folio))
+ return obj_cgroup_memcg(__folio_objcg(folio));
+ return __folio_memcg(folio);
+}
+
static inline struct mem_cgroup *page_memcg(struct page *page)
{
- if (PageMemcgKmem(page))
- return obj_cgroup_memcg(__page_objcg(page));
- else
- return __page_memcg(page);
+ return folio_memcg(page_folio(page));
}
/*
@@ -525,17 +530,18 @@ static inline struct mem_cgroup *page_memcg_check(struct page *page)
#ifdef CONFIG_MEMCG_KMEM
/*
- * PageMemcgKmem - check if the page has MemcgKmem flag set
- * @page: a pointer to the page struct
+ * folio_memcg_kmem - Check if the folio has the memcg_kmem flag set.
+ * @folio: Pointer to the folio.
*
- * Checks if the page has MemcgKmem flag set. The caller must ensure that
- * the page has an associated memory cgroup. It's not safe to call this function
- * against some types of pages, e.g. slab pages.
+ * Checks if the folio has MemcgKmem flag set. The caller must ensure
+ * that the folio has an associated memory cgroup. It's not safe to call
+ * this function against some types of folios, e.g. slab folios.
*/
-static inline bool PageMemcgKmem(struct page *page)
+static inline bool folio_memcg_kmem(struct folio *folio)
{
- VM_BUG_ON_PAGE(page->memcg_data & MEMCG_DATA_OBJCGS, page);
- return page->memcg_data & MEMCG_DATA_KMEM;
+ VM_BUG_ON_PGFLAGS(PageTail(&folio->page), &folio->page);
+ VM_BUG_ON_FOLIO(folio->memcg_data & MEMCG_DATA_OBJCGS, folio);
+ return folio->memcg_data & MEMCG_DATA_KMEM;
}
/*
@@ -579,7 +585,7 @@ static inline struct obj_cgroup **page_objcgs_check(struct page *page)
}
#else
-static inline bool PageMemcgKmem(struct page *page)
+static inline bool folio_memcg_kmem(struct folio *folio)
{
return false;
}
@@ -595,6 +601,11 @@ static inline struct obj_cgroup **page_objcgs_check(struct page *page)
}
#endif
+static inline bool PageMemcgKmem(struct page *page)
+{
+ return folio_memcg_kmem(page_folio(page));
+}
+
static __always_inline bool memcg_stat_item_in_bytes(int idx)
{
if (idx == MEMCG_PERCPU_B)
@@ -1106,6 +1117,11 @@ unsigned long mem_cgroup_soft_limit_reclaim(pg_data_t *pgdat, int order,
#define MEM_CGROUP_ID_SHIFT 0
#define MEM_CGROUP_ID_MAX 0
+static inline struct mem_cgroup *folio_memcg(struct folio *folio)
+{
+ return NULL;
+}
+
static inline struct mem_cgroup *page_memcg(struct page *page)
{
return NULL;
@@ -1122,6 +1138,11 @@ static inline struct mem_cgroup *page_memcg_check(struct page *page)
return NULL;
}
+static inline bool folio_memcg_kmem(struct folio *folio)
+{
+ return false;
+}
+
static inline bool PageMemcgKmem(struct page *page)
{
return false;
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 1a049bfa0e0a..f0f781dde37a 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -3050,15 +3050,16 @@ int __memcg_kmem_charge_page(struct page *page, gfp_t gfp, int order)
*/
void __memcg_kmem_uncharge_page(struct page *page, int order)
{
+ struct folio *folio = page_folio(page);
struct obj_cgroup *objcg;
unsigned int nr_pages = 1 << order;
- if (!PageMemcgKmem(page))
+ if (!folio_memcg_kmem(folio))
return;
- objcg = __page_objcg(page);
+ objcg = __folio_objcg(folio);
obj_cgroup_uncharge_pages(objcg, nr_pages);
- page->memcg_data = 0;
+ folio->memcg_data = 0;
obj_cgroup_put(objcg);
}
@@ -3290,17 +3291,18 @@ void obj_cgroup_uncharge(struct obj_cgroup *objcg, size_t size)
*/
void split_page_memcg(struct page *head, unsigned int nr)
{
- struct mem_cgroup *memcg = page_memcg(head);
+ struct folio *folio = page_folio(head);
+ struct mem_cgroup *memcg = folio_memcg(folio);
int i;
if (mem_cgroup_disabled() || !memcg)
return;
for (i = 1; i < nr; i++)
- head[i].memcg_data = head->memcg_data;
+ folio_page(folio, i)->memcg_data = folio->memcg_data;
- if (PageMemcgKmem(head))
- obj_cgroup_get_many(__page_objcg(head), nr - 1);
+ if (folio_memcg_kmem(folio))
+ obj_cgroup_get_many(__folio_objcg(folio), nr - 1);
else
css_get_many(&memcg->css, nr - 1);
}
@@ -6835,6 +6837,7 @@ static void uncharge_batch(const struct uncharge_gather *ug)
static void uncharge_page(struct page *page, struct uncharge_gather *ug)
{
+ struct folio *folio = page_folio(page);
unsigned long nr_pages;
struct mem_cgroup *memcg;
struct obj_cgroup *objcg;
@@ -6848,14 +6851,14 @@ static void uncharge_page(struct page *page, struct uncharge_gather *ug)
* exclusive access to the page.
*/
if (use_objcg) {
- objcg = __page_objcg(page);
+ objcg = __folio_objcg(folio);
/*
* This get matches the put at the end of the function and
* kmem pages do not hold memcg references anymore.
*/
memcg = get_mem_cgroup_from_objcg(objcg);
} else {
- memcg = __page_memcg(page);
+ memcg = __folio_memcg(folio);
}
if (!memcg)
--
2.30.2
Convert all callers of mem_cgroup_charge() to call page_folio() on the
page they're currently passing in. Many of them will be converted to
use folios themselves soon.
Signed-off-by: Matthew Wilcox (Oracle) <[email protected]>
Reviewed-by: Christoph Hellwig <[email protected]>
---
include/linux/memcontrol.h | 6 +++---
kernel/events/uprobes.c | 3 ++-
mm/filemap.c | 2 +-
mm/huge_memory.c | 2 +-
mm/khugepaged.c | 4 ++--
mm/ksm.c | 3 ++-
mm/memcontrol.c | 26 +++++++++++++-------------
mm/memory.c | 9 +++++----
mm/migrate.c | 2 +-
mm/shmem.c | 2 +-
mm/userfaultfd.c | 2 +-
11 files changed, 32 insertions(+), 29 deletions(-)
diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index eabae5874161..fb7b87d8e794 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -704,7 +704,7 @@ static inline bool mem_cgroup_below_min(struct mem_cgroup *memcg)
page_counter_read(&memcg->memory);
}
-int mem_cgroup_charge(struct page *page, struct mm_struct *mm, gfp_t gfp_mask);
+int mem_cgroup_charge(struct folio *, struct mm_struct *, gfp_t);
int mem_cgroup_swapin_charge_page(struct page *page, struct mm_struct *mm,
gfp_t gfp, swp_entry_t entry);
void mem_cgroup_swapin_uncharge_swap(swp_entry_t entry);
@@ -1190,8 +1190,8 @@ static inline bool mem_cgroup_below_min(struct mem_cgroup *memcg)
return false;
}
-static inline int mem_cgroup_charge(struct page *page, struct mm_struct *mm,
- gfp_t gfp_mask)
+static inline int mem_cgroup_charge(struct folio *folio,
+ struct mm_struct *mm, gfp_t gfp)
{
return 0;
}
diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c
index af24dc3febbe..6357c3580d07 100644
--- a/kernel/events/uprobes.c
+++ b/kernel/events/uprobes.c
@@ -167,7 +167,8 @@ static int __replace_page(struct vm_area_struct *vma, unsigned long addr,
addr + PAGE_SIZE);
if (new_page) {
- err = mem_cgroup_charge(new_page, vma->vm_mm, GFP_KERNEL);
+ err = mem_cgroup_charge(page_folio(new_page), vma->vm_mm,
+ GFP_KERNEL);
if (err)
return err;
}
diff --git a/mm/filemap.c b/mm/filemap.c
index a5d02ec62eb6..525f69316522 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -872,7 +872,7 @@ noinline int __add_to_page_cache_locked(struct page *page,
page->index = offset;
if (!huge) {
- error = mem_cgroup_charge(page, NULL, gfp);
+ error = mem_cgroup_charge(page_folio(page), NULL, gfp);
if (error)
goto error;
charged = true;
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index afff3ac87067..ecb1fb1f5f3e 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -603,7 +603,7 @@ static vm_fault_t __do_huge_pmd_anonymous_page(struct vm_fault *vmf,
VM_BUG_ON_PAGE(!PageCompound(page), page);
- if (mem_cgroup_charge(page, vma->vm_mm, gfp)) {
+ if (mem_cgroup_charge(page_folio(page), vma->vm_mm, gfp)) {
put_page(page);
count_vm_event(THP_FAULT_FALLBACK);
count_vm_event(THP_FAULT_FALLBACK_CHARGE);
diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index b0412be08fa2..8f6d7fdea9f4 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -1087,7 +1087,7 @@ static void collapse_huge_page(struct mm_struct *mm,
goto out_nolock;
}
- if (unlikely(mem_cgroup_charge(new_page, mm, gfp))) {
+ if (unlikely(mem_cgroup_charge(page_folio(new_page), mm, gfp))) {
result = SCAN_CGROUP_CHARGE_FAIL;
goto out_nolock;
}
@@ -1658,7 +1658,7 @@ static void collapse_file(struct mm_struct *mm,
goto out;
}
- if (unlikely(mem_cgroup_charge(new_page, mm, gfp))) {
+ if (unlikely(mem_cgroup_charge(page_folio(new_page), mm, gfp))) {
result = SCAN_CGROUP_CHARGE_FAIL;
goto out;
}
diff --git a/mm/ksm.c b/mm/ksm.c
index 3fa9bc8a67cf..23d36b59f997 100644
--- a/mm/ksm.c
+++ b/mm/ksm.c
@@ -2580,7 +2580,8 @@ struct page *ksm_might_need_to_copy(struct page *page,
return page; /* let do_swap_page report the error */
new_page = alloc_page_vma(GFP_HIGHUSER_MOVABLE, vma, address);
- if (new_page && mem_cgroup_charge(new_page, vma->vm_mm, GFP_KERNEL)) {
+ if (new_page &&
+ mem_cgroup_charge(page_folio(new_page), vma->vm_mm, GFP_KERNEL)) {
put_page(new_page);
new_page = NULL;
}
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index c2ffad021e09..03283d97b62a 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -6681,10 +6681,9 @@ void mem_cgroup_calculate_protection(struct mem_cgroup *root,
atomic_long_read(&parent->memory.children_low_usage)));
}
-static int __mem_cgroup_charge(struct page *page, struct mem_cgroup *memcg,
+static int __mem_cgroup_charge(struct folio *folio, struct mem_cgroup *memcg,
gfp_t gfp)
{
- struct folio *folio = page_folio(page);
unsigned int nr_pages = folio_nr_pages(folio);
int ret;
@@ -6697,27 +6696,27 @@ static int __mem_cgroup_charge(struct page *page, struct mem_cgroup *memcg,
local_irq_disable();
mem_cgroup_charge_statistics(memcg, nr_pages);
- memcg_check_events(memcg, page_to_nid(page));
+ memcg_check_events(memcg, folio_nid(folio));
local_irq_enable();
out:
return ret;
}
/**
- * mem_cgroup_charge - charge a newly allocated page to a cgroup
- * @page: page to charge
- * @mm: mm context of the victim
- * @gfp_mask: reclaim mode
+ * mem_cgroup_charge - Charge a newly allocated folio to a cgroup.
+ * @folio: Folio to charge.
+ * @mm: mm context of the allocating task.
+ * @gfp: reclaim mode
*
- * Try to charge @page to the memcg that @mm belongs to, reclaiming
- * pages according to @gfp_mask if necessary. if @mm is NULL, try to
+ * Try to charge @folio to the memcg that @mm belongs to, reclaiming
+ * pages according to @gfp if necessary. If @mm is NULL, try to
* charge to the active memcg.
*
- * Do not use this for pages allocated for swapin.
+ * Do not use this for folios allocated for swapin.
*
* Returns 0 on success. Otherwise, an error code is returned.
*/
-int mem_cgroup_charge(struct page *page, struct mm_struct *mm, gfp_t gfp_mask)
+int mem_cgroup_charge(struct folio *folio, struct mm_struct *mm, gfp_t gfp)
{
struct mem_cgroup *memcg;
int ret;
@@ -6726,7 +6725,7 @@ int mem_cgroup_charge(struct page *page, struct mm_struct *mm, gfp_t gfp_mask)
return 0;
memcg = get_mem_cgroup_from_mm(mm);
- ret = __mem_cgroup_charge(page, memcg, gfp_mask);
+ ret = __mem_cgroup_charge(folio, memcg, gfp);
css_put(&memcg->css);
return ret;
@@ -6747,6 +6746,7 @@ int mem_cgroup_charge(struct page *page, struct mm_struct *mm, gfp_t gfp_mask)
int mem_cgroup_swapin_charge_page(struct page *page, struct mm_struct *mm,
gfp_t gfp, swp_entry_t entry)
{
+ struct folio *folio = page_folio(page);
struct mem_cgroup *memcg;
unsigned short id;
int ret;
@@ -6761,7 +6761,7 @@ int mem_cgroup_swapin_charge_page(struct page *page, struct mm_struct *mm,
memcg = get_mem_cgroup_from_mm(mm);
rcu_read_unlock();
- ret = __mem_cgroup_charge(page, memcg, gfp);
+ ret = __mem_cgroup_charge(folio, memcg, gfp);
css_put(&memcg->css);
return ret;
diff --git a/mm/memory.c b/mm/memory.c
index 2f111f9b3dbc..614418e26e2c 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -990,7 +990,7 @@ page_copy_prealloc(struct mm_struct *src_mm, struct vm_area_struct *vma,
if (!new_page)
return NULL;
- if (mem_cgroup_charge(new_page, src_mm, GFP_KERNEL)) {
+ if (mem_cgroup_charge(page_folio(new_page), src_mm, GFP_KERNEL)) {
put_page(new_page);
return NULL;
}
@@ -3019,7 +3019,7 @@ static vm_fault_t wp_page_copy(struct vm_fault *vmf)
}
}
- if (mem_cgroup_charge(new_page, mm, GFP_KERNEL))
+ if (mem_cgroup_charge(page_folio(new_page), mm, GFP_KERNEL))
goto oom_free_new;
cgroup_throttle_swaprate(new_page, GFP_KERNEL);
@@ -3768,7 +3768,7 @@ static vm_fault_t do_anonymous_page(struct vm_fault *vmf)
if (!page)
goto oom;
- if (mem_cgroup_charge(page, vma->vm_mm, GFP_KERNEL))
+ if (mem_cgroup_charge(page_folio(page), vma->vm_mm, GFP_KERNEL))
goto oom_free_page;
cgroup_throttle_swaprate(page, GFP_KERNEL);
@@ -4183,7 +4183,8 @@ static vm_fault_t do_cow_fault(struct vm_fault *vmf)
if (!vmf->cow_page)
return VM_FAULT_OOM;
- if (mem_cgroup_charge(vmf->cow_page, vma->vm_mm, GFP_KERNEL)) {
+ if (mem_cgroup_charge(page_folio(vmf->cow_page), vma->vm_mm,
+ GFP_KERNEL)) {
put_page(vmf->cow_page);
return VM_FAULT_OOM;
}
diff --git a/mm/migrate.c b/mm/migrate.c
index 34a9ad3e0a4f..b5bdae748f82 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -2763,7 +2763,7 @@ static void migrate_vma_insert_page(struct migrate_vma *migrate,
if (unlikely(anon_vma_prepare(vma)))
goto abort;
- if (mem_cgroup_charge(page, vma->vm_mm, GFP_KERNEL))
+ if (mem_cgroup_charge(page_folio(page), vma->vm_mm, GFP_KERNEL))
goto abort;
/*
diff --git a/mm/shmem.c b/mm/shmem.c
index 70d9ce294bb4..3931fed5c8d8 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -685,7 +685,7 @@ static int shmem_add_to_page_cache(struct page *page,
page->index = index;
if (!PageSwapCache(page)) {
- error = mem_cgroup_charge(page, charge_mm, gfp);
+ error = mem_cgroup_charge(page_folio(page), charge_mm, gfp);
if (error) {
if (PageTransHuge(page)) {
count_vm_event(THP_FILE_FALLBACK);
diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c
index 0e2132834bc7..5d0f55f3c0ed 100644
--- a/mm/userfaultfd.c
+++ b/mm/userfaultfd.c
@@ -164,7 +164,7 @@ static int mcopy_atomic_pte(struct mm_struct *dst_mm,
__SetPageUptodate(page);
ret = -ENOMEM;
- if (mem_cgroup_charge(page, dst_mm, GFP_KERNEL))
+ if (mem_cgroup_charge(page_folio(page), dst_mm, GFP_KERNEL))
goto out_release;
ret = mfill_atomic_install_pte(dst_mm, dst_pmd, dst_vma, dst_addr,
--
2.30.2
Convert all callers of mem_cgroup_migrate() to call page_folio() first.
They all look like they're using head pages already, but this proves it.
Signed-off-by: Matthew Wilcox (Oracle) <[email protected]>
Reviewed-by: Christoph Hellwig <[email protected]>
---
include/linux/memcontrol.h | 4 ++--
mm/filemap.c | 4 +++-
mm/memcontrol.c | 35 +++++++++++++++++------------------
mm/migrate.c | 4 +++-
mm/shmem.c | 5 ++++-
5 files changed, 29 insertions(+), 23 deletions(-)
diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 941a1a7131c9..d75a708eac13 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -712,7 +712,7 @@ void mem_cgroup_swapin_uncharge_swap(swp_entry_t entry);
void mem_cgroup_uncharge(struct folio *folio);
void mem_cgroup_uncharge_list(struct list_head *page_list);
-void mem_cgroup_migrate(struct page *oldpage, struct page *newpage);
+void mem_cgroup_migrate(struct folio *old, struct folio *new);
/**
* mem_cgroup_lruvec - get the lru list vector for a memcg & node
@@ -1214,7 +1214,7 @@ static inline void mem_cgroup_uncharge_list(struct list_head *page_list)
{
}
-static inline void mem_cgroup_migrate(struct page *old, struct page *new)
+static inline void mem_cgroup_migrate(struct folio *old, struct folio *new)
{
}
diff --git a/mm/filemap.c b/mm/filemap.c
index 31d4ecd4268e..5c4e3185ecb3 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -817,6 +817,8 @@ EXPORT_SYMBOL(file_write_and_wait_range);
*/
void replace_page_cache_page(struct page *old, struct page *new)
{
+ struct folio *fold = page_folio(old);
+ struct folio *fnew = page_folio(new);
struct address_space *mapping = old->mapping;
void (*freepage)(struct page *) = mapping->a_ops->freepage;
pgoff_t offset = old->index;
@@ -831,7 +833,7 @@ void replace_page_cache_page(struct page *old, struct page *new)
new->mapping = mapping;
new->index = offset;
- mem_cgroup_migrate(old, new);
+ mem_cgroup_migrate(fold, fnew);
xas_lock_irqsave(&xas, flags);
xas_store(&xas, new);
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index fc94048e6451..92bbced86bdb 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -6941,36 +6941,35 @@ void mem_cgroup_uncharge_list(struct list_head *page_list)
}
/**
- * mem_cgroup_migrate - charge a page's replacement
- * @oldpage: currently circulating page
- * @newpage: replacement page
+ * mem_cgroup_migrate - Charge a folio's replacement.
+ * @old: Currently circulating folio.
+ * @new: Replacement folio.
*
- * Charge @newpage as a replacement page for @oldpage. @oldpage will
+ * Charge @new as a replacement folio for @old. @old will
* be uncharged upon free.
*
- * Both pages must be locked, @newpage->mapping must be set up.
+ * Both folios must be locked, @new->mapping must be set up.
*/
-void mem_cgroup_migrate(struct page *oldpage, struct page *newpage)
+void mem_cgroup_migrate(struct folio *old, struct folio *new)
{
- struct folio *newfolio = page_folio(newpage);
struct mem_cgroup *memcg;
- unsigned int nr_pages = folio_nr_pages(newfolio);
+ unsigned int nr_pages = folio_nr_pages(new);
unsigned long flags;
- VM_BUG_ON_PAGE(!PageLocked(oldpage), oldpage);
- VM_BUG_ON_FOLIO(!folio_test_locked(newfolio), newfolio);
- VM_BUG_ON_FOLIO(PageAnon(oldpage) != folio_test_anon(newfolio), newfolio);
- VM_BUG_ON_FOLIO(compound_nr(oldpage) != nr_pages, newfolio);
+ VM_BUG_ON_FOLIO(!folio_test_locked(old), old);
+ VM_BUG_ON_FOLIO(!folio_test_locked(new), new);
+ VM_BUG_ON_FOLIO(folio_test_anon(old) != folio_test_anon(new), new);
+ VM_BUG_ON_FOLIO(folio_nr_pages(old) != nr_pages, new);
if (mem_cgroup_disabled())
return;
- /* Page cache replacement: new page already charged? */
- if (folio_memcg(newfolio))
+ /* Page cache replacement: new folio already charged? */
+ if (folio_memcg(new))
return;
- memcg = page_memcg(oldpage);
- VM_WARN_ON_ONCE_PAGE(!memcg, oldpage);
+ memcg = folio_memcg(old);
+ VM_WARN_ON_ONCE_FOLIO(!memcg, old);
if (!memcg)
return;
@@ -6982,11 +6981,11 @@ void mem_cgroup_migrate(struct page *oldpage, struct page *newpage)
}
css_get(&memcg->css);
- commit_charge(newfolio, memcg);
+ commit_charge(new, memcg);
local_irq_save(flags);
mem_cgroup_charge_statistics(memcg, nr_pages);
- memcg_check_events(memcg, page_to_nid(newpage));
+ memcg_check_events(memcg, folio_nid(new));
local_irq_restore(flags);
}
diff --git a/mm/migrate.c b/mm/migrate.c
index b5bdae748f82..910552318df3 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -541,6 +541,8 @@ int migrate_huge_page_move_mapping(struct address_space *mapping,
*/
void migrate_page_states(struct page *newpage, struct page *page)
{
+ struct folio *folio = page_folio(page);
+ struct folio *newfolio = page_folio(newpage);
int cpupid;
if (PageError(page))
@@ -608,7 +610,7 @@ void migrate_page_states(struct page *newpage, struct page *page)
copy_page_owner(page, newpage);
if (!PageHuge(page))
- mem_cgroup_migrate(page, newpage);
+ mem_cgroup_migrate(folio, newfolio);
}
EXPORT_SYMBOL(migrate_page_states);
diff --git a/mm/shmem.c b/mm/shmem.c
index 3931fed5c8d8..2fd75b4d4974 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -1619,6 +1619,7 @@ static int shmem_replace_page(struct page **pagep, gfp_t gfp,
struct shmem_inode_info *info, pgoff_t index)
{
struct page *oldpage, *newpage;
+ struct folio *old, *new;
struct address_space *swap_mapping;
swp_entry_t entry;
pgoff_t swap_index;
@@ -1655,7 +1656,9 @@ static int shmem_replace_page(struct page **pagep, gfp_t gfp,
xa_lock_irq(&swap_mapping->i_pages);
error = shmem_replace_entry(swap_mapping, swap_index, oldpage, newpage);
if (!error) {
- mem_cgroup_migrate(oldpage, newpage);
+ old = page_folio(oldpage);
+ new = page_folio(newpage);
+ mem_cgroup_migrate(old, new);
__inc_lruvec_page_state(newpage, NR_FILE_PAGES);
__dec_lruvec_page_state(oldpage, NR_FILE_PAGES);
}
--
2.30.2
These are the folio equivalents of lock_page_memcg() and
unlock_page_memcg().
lock_page_memcg() and unlock_page_memcg() have too many callers to be
easily replaced in a single patch, so reimplement them as wrappers for
now to be cleaned up later when enough callers have been converted to
use folios.
Signed-off-by: Matthew Wilcox (Oracle) <[email protected]>
Reviewed-by: Christoph Hellwig <[email protected]>
---
include/linux/memcontrol.h | 10 +++++++++
mm/memcontrol.c | 45 ++++++++++++++++++++++++--------------
2 files changed, 39 insertions(+), 16 deletions(-)
diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 20084c47d2ca..4b79dd6b3a9c 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -950,6 +950,8 @@ void mem_cgroup_print_oom_group(struct mem_cgroup *memcg);
extern bool cgroup_memory_noswap;
#endif
+void folio_memcg_lock(struct folio *folio);
+void folio_memcg_unlock(struct folio *folio);
void lock_page_memcg(struct page *page);
void unlock_page_memcg(struct page *page);
@@ -1367,6 +1369,14 @@ static inline void unlock_page_memcg(struct page *page)
{
}
+static inline void folio_memcg_lock(struct folio *folio)
+{
+}
+
+static inline void folio_memcg_unlock(struct folio *folio)
+{
+}
+
static inline void mem_cgroup_handle_over_high(void)
{
}
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 96c34357fbca..0dd40ea67a90 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -1965,18 +1965,17 @@ void mem_cgroup_print_oom_group(struct mem_cgroup *memcg)
}
/**
- * lock_page_memcg - lock a page and memcg binding
- * @page: the page
+ * folio_memcg_lock - Bind a folio to its memcg.
+ * @folio: The folio.
*
- * This function protects unlocked LRU pages from being moved to
+ * This function prevents unlocked LRU folios from being moved to
* another cgroup.
*
- * It ensures lifetime of the locked memcg. Caller is responsible
- * for the lifetime of the page.
+ * It ensures lifetime of the bound memcg. The caller is responsible
+ * for the lifetime of the folio.
*/
-void lock_page_memcg(struct page *page)
+void folio_memcg_lock(struct folio *folio)
{
- struct page *head = compound_head(page); /* rmap on tail pages */
struct mem_cgroup *memcg;
unsigned long flags;
@@ -1990,7 +1989,7 @@ void lock_page_memcg(struct page *page)
if (mem_cgroup_disabled())
return;
again:
- memcg = page_memcg(head);
+ memcg = folio_memcg(folio);
if (unlikely(!memcg))
return;
@@ -2004,7 +2003,7 @@ void lock_page_memcg(struct page *page)
return;
spin_lock_irqsave(&memcg->move_lock, flags);
- if (memcg != page_memcg(head)) {
+ if (memcg != folio_memcg(folio)) {
spin_unlock_irqrestore(&memcg->move_lock, flags);
goto again;
}
@@ -2018,9 +2017,15 @@ void lock_page_memcg(struct page *page)
memcg->move_lock_task = current;
memcg->move_lock_flags = flags;
}
+EXPORT_SYMBOL(folio_memcg_lock);
+
+void lock_page_memcg(struct page *page)
+{
+ folio_memcg_lock(page_folio(page));
+}
EXPORT_SYMBOL(lock_page_memcg);
-static void __unlock_page_memcg(struct mem_cgroup *memcg)
+static void __folio_memcg_unlock(struct mem_cgroup *memcg)
{
if (memcg && memcg->move_lock_task == current) {
unsigned long flags = memcg->move_lock_flags;
@@ -2035,14 +2040,22 @@ static void __unlock_page_memcg(struct mem_cgroup *memcg)
}
/**
- * unlock_page_memcg - unlock a page and memcg binding
- * @page: the page
+ * folio_memcg_unlock - Release the binding between a folio and its memcg.
+ * @folio: The folio.
+ *
+ * This releases the binding created by folio_memcg_lock(). This does
+ * not change the accounting of this folio to its memcg, but it does
+ * permit others to change it.
*/
-void unlock_page_memcg(struct page *page)
+void folio_memcg_unlock(struct folio *folio)
{
- struct page *head = compound_head(page);
+ __folio_memcg_unlock(folio_memcg(folio));
+}
+EXPORT_SYMBOL(folio_memcg_unlock);
- __unlock_page_memcg(page_memcg(head));
+void unlock_page_memcg(struct page *page)
+{
+ folio_memcg_unlock(page_folio(page));
}
EXPORT_SYMBOL(unlock_page_memcg);
@@ -5666,7 +5679,7 @@ static int mem_cgroup_move_account(struct page *page,
page->memcg_data = (unsigned long)to;
- __unlock_page_memcg(from);
+ __folio_memcg_unlock(from);
ret = 0;
nid = page_to_nid(page);
--
2.30.2
This replaces mem_cgroup_page_lruvec(). All callers converted.
Signed-off-by: Matthew Wilcox (Oracle) <[email protected]>
Reviewed-by: Christoph Hellwig <[email protected]>
---
include/linux/memcontrol.h | 20 +++++++++-----------
mm/compaction.c | 2 +-
mm/memcontrol.c | 9 ++++++---
mm/swap.c | 3 ++-
mm/workingset.c | 3 ++-
5 files changed, 20 insertions(+), 17 deletions(-)
diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 4b79dd6b3a9c..4eb329b5d183 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -751,18 +751,17 @@ static inline struct lruvec *mem_cgroup_lruvec(struct mem_cgroup *memcg,
}
/**
- * mem_cgroup_page_lruvec - return lruvec for isolating/putting an LRU page
- * @page: the page
+ * folio_lruvec - return lruvec for isolating/putting an LRU folio
+ * @folio: Pointer to the folio.
*
- * This function relies on page->mem_cgroup being stable.
+ * This function relies on folio->mem_cgroup being stable.
*/
-static inline struct lruvec *mem_cgroup_page_lruvec(struct page *page)
+static inline struct lruvec *folio_lruvec(struct folio *folio)
{
- pg_data_t *pgdat = page_pgdat(page);
- struct mem_cgroup *memcg = page_memcg(page);
+ struct mem_cgroup *memcg = folio_memcg(folio);
- VM_WARN_ON_ONCE_PAGE(!memcg && !mem_cgroup_disabled(), page);
- return mem_cgroup_lruvec(memcg, pgdat);
+ VM_WARN_ON_ONCE_FOLIO(!memcg && !mem_cgroup_disabled(), folio);
+ return mem_cgroup_lruvec(memcg, folio_pgdat(folio));
}
struct mem_cgroup *mem_cgroup_from_task(struct task_struct *p);
@@ -1226,10 +1225,9 @@ static inline struct lruvec *mem_cgroup_lruvec(struct mem_cgroup *memcg,
return &pgdat->__lruvec;
}
-static inline struct lruvec *mem_cgroup_page_lruvec(struct page *page)
+static inline struct lruvec *folio_lruvec(struct folio *folio)
{
- pg_data_t *pgdat = page_pgdat(page);
-
+ struct pglist_data *pgdat = folio_pgdat(folio);
return &pgdat->__lruvec;
}
diff --git a/mm/compaction.c b/mm/compaction.c
index 621508e0ecd5..a88f7b893f80 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -1028,7 +1028,7 @@ isolate_migratepages_block(struct compact_control *cc, unsigned long low_pfn,
if (!TestClearPageLRU(page))
goto isolate_fail_put;
- lruvec = mem_cgroup_page_lruvec(page);
+ lruvec = folio_lruvec(page_folio(page));
/* If we already hold the lock, we can skip some rechecking */
if (lruvec != locked) {
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 96d6e6c0a65d..fd578d70b579 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -1186,9 +1186,10 @@ void lruvec_memcg_debug(struct lruvec *lruvec, struct page *page)
*/
struct lruvec *lock_page_lruvec(struct page *page)
{
+ struct folio *folio = page_folio(page);
struct lruvec *lruvec;
- lruvec = mem_cgroup_page_lruvec(page);
+ lruvec = folio_lruvec(folio);
spin_lock(&lruvec->lru_lock);
lruvec_memcg_debug(lruvec, page);
@@ -1198,9 +1199,10 @@ struct lruvec *lock_page_lruvec(struct page *page)
struct lruvec *lock_page_lruvec_irq(struct page *page)
{
+ struct folio *folio = page_folio(page);
struct lruvec *lruvec;
- lruvec = mem_cgroup_page_lruvec(page);
+ lruvec = folio_lruvec(folio);
spin_lock_irq(&lruvec->lru_lock);
lruvec_memcg_debug(lruvec, page);
@@ -1210,9 +1212,10 @@ struct lruvec *lock_page_lruvec_irq(struct page *page)
struct lruvec *lock_page_lruvec_irqsave(struct page *page, unsigned long *flags)
{
+ struct folio *folio = page_folio(page);
struct lruvec *lruvec;
- lruvec = mem_cgroup_page_lruvec(page);
+ lruvec = folio_lruvec(folio);
spin_lock_irqsave(&lruvec->lru_lock, *flags);
lruvec_memcg_debug(lruvec, page);
diff --git a/mm/swap.c b/mm/swap.c
index 11ff40104a2c..4ba77fc8da4f 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -315,7 +315,8 @@ void lru_note_cost(struct lruvec *lruvec, bool file, unsigned int nr_pages)
void lru_note_cost_page(struct page *page)
{
- lru_note_cost(mem_cgroup_page_lruvec(page),
+ struct folio *folio = page_folio(page);
+ lru_note_cost(folio_lruvec(folio),
page_is_file_lru(page), thp_nr_pages(page));
}
diff --git a/mm/workingset.c b/mm/workingset.c
index 5ba3e42446fa..e62c0f2084a2 100644
--- a/mm/workingset.c
+++ b/mm/workingset.c
@@ -396,6 +396,7 @@ void workingset_refault(struct page *page, void *shadow)
*/
void workingset_activation(struct page *page)
{
+ struct folio *folio = page_folio(page);
struct mem_cgroup *memcg;
struct lruvec *lruvec;
@@ -410,7 +411,7 @@ void workingset_activation(struct page *page)
memcg = page_memcg_rcu(page);
if (!mem_cgroup_disabled() && !memcg)
goto out;
- lruvec = mem_cgroup_page_lruvec(page);
+ lruvec = folio_lruvec(folio);
workingset_age_nonresident(lruvec, thp_nr_pages(page));
out:
rcu_read_unlock();
--
2.30.2
This is a default implementation which calls flush_dcache_page() on
each page in the folio. If architectures can do better, they should
implement their own version of it.
Signed-off-by: Matthew Wilcox (Oracle) <[email protected]>
---
Documentation/core-api/cachetlb.rst | 6 ++++++
arch/nds32/include/asm/cacheflush.h | 1 +
include/asm-generic/cacheflush.h | 6 ++++++
mm/util.c | 13 +++++++++++++
4 files changed, 26 insertions(+)
diff --git a/Documentation/core-api/cachetlb.rst b/Documentation/core-api/cachetlb.rst
index fe4290e26729..29682f69a915 100644
--- a/Documentation/core-api/cachetlb.rst
+++ b/Documentation/core-api/cachetlb.rst
@@ -325,6 +325,12 @@ maps this page at its virtual address.
dirty. Again, see sparc64 for examples of how
to deal with this.
+ ``void flush_dcache_folio(struct folio *folio)``
+ This function is called under the same circumstances as
+ flush_dcache_page(). It allows the architecture to
+ optimise for flushing the entire folio of pages instead
+ of flushing one page at a time.
+
``void copy_to_user_page(struct vm_area_struct *vma, struct page *page,
unsigned long user_vaddr, void *dst, void *src, int len)``
``void copy_from_user_page(struct vm_area_struct *vma, struct page *page,
diff --git a/arch/nds32/include/asm/cacheflush.h b/arch/nds32/include/asm/cacheflush.h
index 7d6824f7c0e8..f10d13af4ae5 100644
--- a/arch/nds32/include/asm/cacheflush.h
+++ b/arch/nds32/include/asm/cacheflush.h
@@ -38,6 +38,7 @@ void flush_anon_page(struct vm_area_struct *vma,
#define ARCH_HAS_FLUSH_KERNEL_DCACHE_PAGE
void flush_kernel_dcache_page(struct page *page);
+void flush_dcache_folio(struct folio *folio);
void flush_kernel_vmap_range(void *addr, int size);
void invalidate_kernel_vmap_range(void *addr, int size);
#define flush_dcache_mmap_lock(mapping) xa_lock_irq(&(mapping)->i_pages)
diff --git a/include/asm-generic/cacheflush.h b/include/asm-generic/cacheflush.h
index 4a674db4e1fa..fedc0dfa4877 100644
--- a/include/asm-generic/cacheflush.h
+++ b/include/asm-generic/cacheflush.h
@@ -49,9 +49,15 @@ static inline void flush_cache_page(struct vm_area_struct *vma,
static inline void flush_dcache_page(struct page *page)
{
}
+
+static inline void flush_dcache_folio(struct folio *folio) { }
#define ARCH_IMPLEMENTS_FLUSH_DCACHE_PAGE 0
+#define ARCH_IMPLEMENTS_FLUSH_DCACHE_FOLIO
#endif
+#ifndef ARCH_IMPLEMENTS_FLUSH_DCACHE_FOLIO
+void flush_dcache_folio(struct folio *folio);
+#endif
#ifndef flush_dcache_mmap_lock
static inline void flush_dcache_mmap_lock(struct address_space *mapping)
diff --git a/mm/util.c b/mm/util.c
index d0aa1d9c811e..149537120a91 100644
--- a/mm/util.c
+++ b/mm/util.c
@@ -1057,3 +1057,16 @@ void page_offline_end(void)
up_write(&page_offline_rwsem);
}
EXPORT_SYMBOL(page_offline_end);
+
+#ifndef ARCH_IMPLEMENTS_FLUSH_DCACHE_FOLIO
+void flush_dcache_folio(struct folio *folio)
+{
+ unsigned int n = folio_nr_pages(folio);
+
+ do {
+ n--;
+ flush_dcache_page(folio_page(folio, n));
+ } while (n);
+}
+EXPORT_SYMBOL(flush_dcache_folio);
+#endif
--
2.30.2
This allows us to map a portion of a folio. Callers can only expect
to access up to the next page boundary.
Signed-off-by: Matthew Wilcox (Oracle) <[email protected]>
Reviewed-by: Christoph Hellwig <[email protected]>
---
include/linux/highmem-internal.h | 11 +++++++++
include/linux/highmem.h | 38 ++++++++++++++++++++++++++++++++
2 files changed, 49 insertions(+)
diff --git a/include/linux/highmem-internal.h b/include/linux/highmem-internal.h
index 7902c7d8b55f..d5d6f930ae1d 100644
--- a/include/linux/highmem-internal.h
+++ b/include/linux/highmem-internal.h
@@ -73,6 +73,12 @@ static inline void *kmap_local_page(struct page *page)
return __kmap_local_page_prot(page, kmap_prot);
}
+static inline void *kmap_local_folio(struct folio *folio, size_t offset)
+{
+ struct page *page = folio_page(folio, offset / PAGE_SIZE);
+ return __kmap_local_page_prot(page, kmap_prot) + offset % PAGE_SIZE;
+}
+
static inline void *kmap_local_page_prot(struct page *page, pgprot_t prot)
{
return __kmap_local_page_prot(page, prot);
@@ -160,6 +166,11 @@ static inline void *kmap_local_page(struct page *page)
return page_address(page);
}
+static inline void *kmap_local_folio(struct folio *folio, size_t offset)
+{
+ return page_address(&folio->page) + offset;
+}
+
static inline void *kmap_local_page_prot(struct page *page, pgprot_t prot)
{
return kmap_local_page(page);
diff --git a/include/linux/highmem.h b/include/linux/highmem.h
index 8c6e8e996c87..85de3bd0b47d 100644
--- a/include/linux/highmem.h
+++ b/include/linux/highmem.h
@@ -96,6 +96,44 @@ static inline void kmap_flush_unused(void);
*/
static inline void *kmap_local_page(struct page *page);
+/**
+ * kmap_local_folio - Map a page in this folio for temporary usage
+ * @folio: The folio to be mapped.
+ * @offset: The byte offset within the folio.
+ *
+ * Returns: The virtual address of the mapping
+ *
+ * Can be invoked from any context.
+ *
+ * Requires careful handling when nesting multiple mappings because the map
+ * management is stack based. The unmap has to be in the reverse order of
+ * the map operation:
+ *
+ * addr1 = kmap_local_folio(page1, offset1);
+ * addr2 = kmap_local_folio(page2, offset2);
+ * ...
+ * kunmap_local(addr2);
+ * kunmap_local(addr1);
+ *
+ * Unmapping addr1 before addr2 is invalid and causes malfunction.
+ *
+ * Contrary to kmap() mappings the mapping is only valid in the context of
+ * the caller and cannot be handed to other contexts.
+ *
+ * On CONFIG_HIGHMEM=n kernels and for low memory pages this returns the
+ * virtual address of the direct mapping. Only real highmem pages are
+ * temporarily mapped.
+ *
+ * While it is significantly faster than kmap() for the higmem case it
+ * comes with restrictions about the pointer validity. Only use when really
+ * necessary.
+ *
+ * On HIGHMEM enabled systems mapping a highmem page has the side effect of
+ * disabling migration in order to keep the virtual address stable across
+ * preemption. No caller of kmap_local_folio() can rely on this side effect.
+ */
+static inline void *kmap_local_folio(struct folio *folio, size_t offset);
+
/**
* kmap_atomic - Atomically map a page for temporary usage - Deprecated!
* @page: Pointer to the page to be mapped
--
2.30.2
As a default implementation, call arch_make_page_accessible n times.
If an architecture can do better, it can override this.
Also move the default implementation of arch_make_page_accessible()
from gfp.h to mm.h.
Signed-off-by: Matthew Wilcox (Oracle) <[email protected]>
---
include/linux/gfp.h | 6 ------
include/linux/mm.h | 21 +++++++++++++++++++++
2 files changed, 21 insertions(+), 6 deletions(-)
diff --git a/include/linux/gfp.h b/include/linux/gfp.h
index 55b2ec1f965a..dc5ff40608ce 100644
--- a/include/linux/gfp.h
+++ b/include/linux/gfp.h
@@ -520,12 +520,6 @@ static inline void arch_free_page(struct page *page, int order) { }
#ifndef HAVE_ARCH_ALLOC_PAGE
static inline void arch_alloc_page(struct page *page, int order) { }
#endif
-#ifndef HAVE_ARCH_MAKE_PAGE_ACCESSIBLE
-static inline int arch_make_page_accessible(struct page *page)
-{
- return 0;
-}
-#endif
struct page *__alloc_pages(gfp_t gfp, unsigned int order, int preferred_nid,
nodemask_t *nodemask);
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 89daae93aa9b..deb0f5efaa65 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1732,6 +1732,27 @@ static inline size_t folio_size(struct folio *folio)
return PAGE_SIZE << folio_order(folio);
}
+#ifndef HAVE_ARCH_MAKE_PAGE_ACCESSIBLE
+static inline int arch_make_page_accessible(struct page *page)
+{
+ return 0;
+}
+#endif
+
+#ifndef HAVE_ARCH_MAKE_FOLIO_ACCESSIBLE
+static inline int arch_make_folio_accessible(struct folio *folio)
+{
+ int ret, i;
+ for (i = 0; i < folio_nr_pages(folio); i++) {
+ ret = arch_make_page_accessible(folio_page(folio, i));
+ if (ret)
+ break;
+ }
+
+ return ret;
+}
+#endif
+
/*
* Some inline functions in vmstat.h depend on page_zone()
*/
--
2.30.2
Reimplement migrate_page_move_mapping() as a wrapper around
folio_migrate_mapping(). Saves 193 bytes of kernel text.
Signed-off-by: Matthew Wilcox (Oracle) <[email protected]>
Reviewed-by: Christoph Hellwig <[email protected]>
---
include/linux/migrate.h | 2 +
mm/folio-compat.c | 11 ++++++
mm/migrate.c | 85 +++++++++++++++++++++--------------------
3 files changed, 57 insertions(+), 41 deletions(-)
diff --git a/include/linux/migrate.h b/include/linux/migrate.h
index 23dadf7aeba8..eb14495a1f46 100644
--- a/include/linux/migrate.h
+++ b/include/linux/migrate.h
@@ -51,6 +51,8 @@ extern int migrate_huge_page_move_mapping(struct address_space *mapping,
struct page *newpage, struct page *page);
extern int migrate_page_move_mapping(struct address_space *mapping,
struct page *newpage, struct page *page, int extra_count);
+int folio_migrate_mapping(struct address_space *mapping,
+ struct folio *newfolio, struct folio *folio, int extra_count);
#else
static inline void putback_movable_pages(struct list_head *l) {}
diff --git a/mm/folio-compat.c b/mm/folio-compat.c
index a374747ae1c6..d883d964fd52 100644
--- a/mm/folio-compat.c
+++ b/mm/folio-compat.c
@@ -4,6 +4,7 @@
* eventually.
*/
+#include <linux/migrate.h>
#include <linux/pagemap.h>
#include <linux/swap.h>
@@ -48,3 +49,13 @@ void mark_page_accessed(struct page *page)
folio_mark_accessed(page_folio(page));
}
EXPORT_SYMBOL(mark_page_accessed);
+
+#ifdef CONFIG_MIGRATION
+int migrate_page_move_mapping(struct address_space *mapping,
+ struct page *newpage, struct page *page, int extra_count)
+{
+ return folio_migrate_mapping(mapping, page_folio(newpage),
+ page_folio(page), extra_count);
+}
+EXPORT_SYMBOL(migrate_page_move_mapping);
+#endif
diff --git a/mm/migrate.c b/mm/migrate.c
index 910552318df3..aa4f2310c5bb 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -363,7 +363,7 @@ static int expected_page_refs(struct address_space *mapping, struct page *page)
*/
expected_count += is_device_private_page(page);
if (mapping)
- expected_count += thp_nr_pages(page) + page_has_private(page);
+ expected_count += compound_nr(page) + page_has_private(page);
return expected_count;
}
@@ -376,74 +376,75 @@ static int expected_page_refs(struct address_space *mapping, struct page *page)
* 2 for pages with a mapping
* 3 for pages with a mapping and PagePrivate/PagePrivate2 set.
*/
-int migrate_page_move_mapping(struct address_space *mapping,
- struct page *newpage, struct page *page, int extra_count)
+int folio_migrate_mapping(struct address_space *mapping,
+ struct folio *newfolio, struct folio *folio, int extra_count)
{
- XA_STATE(xas, &mapping->i_pages, page_index(page));
+ XA_STATE(xas, &mapping->i_pages, folio_index(folio));
struct zone *oldzone, *newzone;
int dirty;
- int expected_count = expected_page_refs(mapping, page) + extra_count;
- int nr = thp_nr_pages(page);
+ int expected_count = expected_page_refs(mapping, &folio->page) + extra_count;
+ int nr = folio_nr_pages(folio);
if (!mapping) {
/* Anonymous page without mapping */
- if (page_count(page) != expected_count)
+ if (folio_ref_count(folio) != expected_count)
return -EAGAIN;
/* No turning back from here */
- newpage->index = page->index;
- newpage->mapping = page->mapping;
- if (PageSwapBacked(page))
- __SetPageSwapBacked(newpage);
+ newfolio->index = folio->index;
+ newfolio->mapping = folio->mapping;
+ if (folio_test_swapbacked(folio))
+ __folio_set_swapbacked(newfolio);
return MIGRATEPAGE_SUCCESS;
}
- oldzone = page_zone(page);
- newzone = page_zone(newpage);
+ oldzone = folio_zone(folio);
+ newzone = folio_zone(newfolio);
xas_lock_irq(&xas);
- if (page_count(page) != expected_count || xas_load(&xas) != page) {
+ if (folio_ref_count(folio) != expected_count ||
+ xas_load(&xas) != folio) {
xas_unlock_irq(&xas);
return -EAGAIN;
}
- if (!page_ref_freeze(page, expected_count)) {
+ if (!folio_ref_freeze(folio, expected_count)) {
xas_unlock_irq(&xas);
return -EAGAIN;
}
/*
- * Now we know that no one else is looking at the page:
+ * Now we know that no one else is looking at the folio:
* no turning back from here.
*/
- newpage->index = page->index;
- newpage->mapping = page->mapping;
- page_ref_add(newpage, nr); /* add cache reference */
- if (PageSwapBacked(page)) {
- __SetPageSwapBacked(newpage);
- if (PageSwapCache(page)) {
- SetPageSwapCache(newpage);
- set_page_private(newpage, page_private(page));
+ newfolio->index = folio->index;
+ newfolio->mapping = folio->mapping;
+ folio_ref_add(newfolio, nr); /* add cache reference */
+ if (folio_test_swapbacked(folio)) {
+ __folio_set_swapbacked(newfolio);
+ if (folio_test_swapcache(folio)) {
+ folio_set_swapcache(newfolio);
+ newfolio->private = folio_get_private(folio);
}
} else {
- VM_BUG_ON_PAGE(PageSwapCache(page), page);
+ VM_BUG_ON_FOLIO(folio_test_swapcache(folio), folio);
}
/* Move dirty while page refs frozen and newpage not yet exposed */
- dirty = PageDirty(page);
+ dirty = folio_test_dirty(folio);
if (dirty) {
- ClearPageDirty(page);
- SetPageDirty(newpage);
+ folio_clear_dirty(folio);
+ folio_set_dirty(newfolio);
}
- xas_store(&xas, newpage);
- if (PageTransHuge(page)) {
+ xas_store(&xas, newfolio);
+ if (nr > 1) {
int i;
for (i = 1; i < nr; i++) {
xas_next(&xas);
- xas_store(&xas, newpage);
+ xas_store(&xas, newfolio);
}
}
@@ -452,7 +453,7 @@ int migrate_page_move_mapping(struct address_space *mapping,
* to one less reference.
* We know this isn't the last reference.
*/
- page_ref_unfreeze(page, expected_count - nr);
+ folio_ref_unfreeze(folio, expected_count - nr);
xas_unlock(&xas);
/* Leave irq disabled to prevent preemption while updating stats */
@@ -471,18 +472,18 @@ int migrate_page_move_mapping(struct address_space *mapping,
struct lruvec *old_lruvec, *new_lruvec;
struct mem_cgroup *memcg;
- memcg = page_memcg(page);
+ memcg = folio_memcg(folio);
old_lruvec = mem_cgroup_lruvec(memcg, oldzone->zone_pgdat);
new_lruvec = mem_cgroup_lruvec(memcg, newzone->zone_pgdat);
__mod_lruvec_state(old_lruvec, NR_FILE_PAGES, -nr);
__mod_lruvec_state(new_lruvec, NR_FILE_PAGES, nr);
- if (PageSwapBacked(page) && !PageSwapCache(page)) {
+ if (folio_test_swapbacked(folio) && !folio_test_swapcache(folio)) {
__mod_lruvec_state(old_lruvec, NR_SHMEM, -nr);
__mod_lruvec_state(new_lruvec, NR_SHMEM, nr);
}
#ifdef CONFIG_SWAP
- if (PageSwapCache(page)) {
+ if (folio_test_swapcache(folio)) {
__mod_lruvec_state(old_lruvec, NR_SWAPCACHE, -nr);
__mod_lruvec_state(new_lruvec, NR_SWAPCACHE, nr);
}
@@ -498,11 +499,11 @@ int migrate_page_move_mapping(struct address_space *mapping,
return MIGRATEPAGE_SUCCESS;
}
-EXPORT_SYMBOL(migrate_page_move_mapping);
+EXPORT_SYMBOL(folio_migrate_mapping);
/*
* The expected number of remaining references is the same as that
- * of migrate_page_move_mapping().
+ * of folio_migrate_mapping().
*/
int migrate_huge_page_move_mapping(struct address_space *mapping,
struct page *newpage, struct page *page)
@@ -563,7 +564,7 @@ void migrate_page_states(struct page *newpage, struct page *page)
if (PageMappedToDisk(page))
SetPageMappedToDisk(newpage);
- /* Move dirty on pages not done by migrate_page_move_mapping() */
+ /* Move dirty on pages not done by folio_migrate_mapping() */
if (PageDirty(page))
SetPageDirty(newpage);
@@ -639,11 +640,13 @@ int migrate_page(struct address_space *mapping,
struct page *newpage, struct page *page,
enum migrate_mode mode)
{
+ struct folio *newfolio = page_folio(newpage);
+ struct folio *folio = page_folio(page);
int rc;
- BUG_ON(PageWriteback(page)); /* Writeback must be complete */
+ BUG_ON(folio_test_writeback(folio)); /* Writeback must be complete */
- rc = migrate_page_move_mapping(mapping, newpage, page, 0);
+ rc = folio_migrate_mapping(mapping, newfolio, folio, 0);
if (rc != MIGRATEPAGE_SUCCESS)
return rc;
@@ -2387,7 +2390,7 @@ static void migrate_vma_collect(struct migrate_vma *migrate)
* @page: struct page to check
*
* Pinned pages cannot be migrated. This is the same test as in
- * migrate_page_move_mapping(), except that here we allow migration of a
+ * folio_migrate_mapping(), except that here we allow migration of a
* ZONE_DEVICE page.
*/
static bool migrate_vma_check_page(struct page *page)
--
2.30.2
This replaces activate_page() and eliminates lots of calls to
compound_head(). Saves net 118 bytes of kernel text. There are still
some redundant calls to page_folio() here which will be removed when
pagevec_lru_move_fn() is converted to use folios.
Signed-off-by: Matthew Wilcox (Oracle) <[email protected]>
Reviewed-by: Christoph Hellwig <[email protected]>
---
include/trace/events/pagemap.h | 14 +++++-------
mm/swap.c | 41 ++++++++++++++++++----------------
2 files changed, 28 insertions(+), 27 deletions(-)
diff --git a/include/trace/events/pagemap.h b/include/trace/events/pagemap.h
index 92ad176210ff..1fd0185d66e8 100644
--- a/include/trace/events/pagemap.h
+++ b/include/trace/events/pagemap.h
@@ -60,23 +60,21 @@ TRACE_EVENT(mm_lru_insertion,
TRACE_EVENT(mm_lru_activate,
- TP_PROTO(struct page *page),
+ TP_PROTO(struct folio *folio),
- TP_ARGS(page),
+ TP_ARGS(folio),
TP_STRUCT__entry(
- __field(struct page *, page )
+ __field(struct folio *, folio )
__field(unsigned long, pfn )
),
TP_fast_assign(
- __entry->page = page;
- __entry->pfn = page_to_pfn(page);
+ __entry->folio = folio;
+ __entry->pfn = folio_pfn(folio);
),
- /* Flag format is based on page-types.c formatting for pagemap */
- TP_printk("page=%p pfn=0x%lx", __entry->page, __entry->pfn)
-
+ TP_printk("folio=%p pfn=0x%lx", __entry->folio, __entry->pfn)
);
#endif /* _TRACE_PAGEMAP_H */
diff --git a/mm/swap.c b/mm/swap.c
index 85969b36b636..c3137e4e1cd8 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -322,15 +322,15 @@ void lru_note_cost_page(struct page *page)
page_is_file_lru(page), thp_nr_pages(page));
}
-static void __activate_page(struct page *page, struct lruvec *lruvec)
+static void __folio_activate(struct folio *folio, struct lruvec *lruvec)
{
- if (!PageActive(page) && !PageUnevictable(page)) {
- int nr_pages = thp_nr_pages(page);
+ if (!folio_test_active(folio) && !folio_test_unevictable(folio)) {
+ int nr_pages = folio_nr_pages(folio);
- del_page_from_lru_list(page, lruvec);
- SetPageActive(page);
- add_page_to_lru_list(page, lruvec);
- trace_mm_lru_activate(page);
+ lruvec_del_folio(lruvec, folio);
+ folio_set_active(folio);
+ lruvec_add_folio(lruvec, folio);
+ trace_mm_lru_activate(folio);
__count_vm_events(PGACTIVATE, nr_pages);
__count_memcg_events(lruvec_memcg(lruvec), PGACTIVATE,
@@ -339,6 +339,11 @@ static void __activate_page(struct page *page, struct lruvec *lruvec)
}
#ifdef CONFIG_SMP
+static void __activate_page(struct page *page, struct lruvec *lruvec)
+{
+ return __folio_activate(page_folio(page), lruvec);
+}
+
static void activate_page_drain(int cpu)
{
struct pagevec *pvec = &per_cpu(lru_pvecs.activate_page, cpu);
@@ -352,16 +357,16 @@ static bool need_activate_page_drain(int cpu)
return pagevec_count(&per_cpu(lru_pvecs.activate_page, cpu)) != 0;
}
-static void activate_page(struct page *page)
+static void folio_activate(struct folio *folio)
{
- page = compound_head(page);
- if (PageLRU(page) && !PageActive(page) && !PageUnevictable(page)) {
+ if (folio_test_lru(folio) && !folio_test_active(folio) &&
+ !folio_test_unevictable(folio)) {
struct pagevec *pvec;
+ folio_get(folio);
local_lock(&lru_pvecs.lock);
pvec = this_cpu_ptr(&lru_pvecs.activate_page);
- get_page(page);
- if (pagevec_add_and_need_flush(pvec, page))
+ if (pagevec_add_and_need_flush(pvec, &folio->page))
pagevec_lru_move_fn(pvec, __activate_page);
local_unlock(&lru_pvecs.lock);
}
@@ -372,17 +377,15 @@ static inline void activate_page_drain(int cpu)
{
}
-static void activate_page(struct page *page)
+static void folio_activate(struct folio *folio)
{
- struct folio *folio = page_folio(page);
struct lruvec *lruvec;
- page = &folio->page;
- if (TestClearPageLRU(page)) {
+ if (folio_test_clear_lru(folio)) {
lruvec = folio_lruvec_lock_irq(folio);
- __activate_page(page, lruvec);
+ __folio_activate(folio, lruvec);
unlock_page_lruvec_irq(lruvec);
- SetPageLRU(page);
+ folio_set_lru(folio);
}
}
#endif
@@ -447,7 +450,7 @@ void mark_page_accessed(struct page *page)
* LRU on the next drain.
*/
if (PageLRU(page))
- activate_page(page);
+ folio_activate(page_folio(page));
else
__lru_cache_activate_page(page);
ClearPageReferenced(page);
--
2.30.2
Turn migrate_page_states() into a wrapper around folio_migrate_flags().
Also convert two functions only called from folio_migrate_flags() to
be folio-based. ksm_migrate_page() becomes folio_migrate_ksm() and
copy_page_owner() becomes folio_copy_owner(). folio_migrate_flags()
alone shrinks by two thirds -- 1967 bytes down to 642 bytes.
Signed-off-by: Matthew Wilcox (Oracle) <[email protected]>
---
include/linux/ksm.h | 4 +-
include/linux/migrate.h | 1 +
include/linux/page_owner.h | 8 ++--
mm/folio-compat.c | 6 +++
mm/ksm.c | 31 ++++++++------
mm/migrate.c | 84 +++++++++++++++++++-------------------
mm/page_owner.c | 10 ++---
7 files changed, 77 insertions(+), 67 deletions(-)
diff --git a/include/linux/ksm.h b/include/linux/ksm.h
index 161e8164abcf..a38a5bca1ba5 100644
--- a/include/linux/ksm.h
+++ b/include/linux/ksm.h
@@ -52,7 +52,7 @@ struct page *ksm_might_need_to_copy(struct page *page,
struct vm_area_struct *vma, unsigned long address);
void rmap_walk_ksm(struct page *page, struct rmap_walk_control *rwc);
-void ksm_migrate_page(struct page *newpage, struct page *oldpage);
+void folio_migrate_ksm(struct folio *newfolio, struct folio *folio);
#else /* !CONFIG_KSM */
@@ -83,7 +83,7 @@ static inline void rmap_walk_ksm(struct page *page,
{
}
-static inline void ksm_migrate_page(struct page *newpage, struct page *oldpage)
+static inline void folio_migrate_ksm(struct folio *newfolio, struct folio *old)
{
}
#endif /* CONFIG_MMU */
diff --git a/include/linux/migrate.h b/include/linux/migrate.h
index eb14495a1f46..ba0a554b3eae 100644
--- a/include/linux/migrate.h
+++ b/include/linux/migrate.h
@@ -51,6 +51,7 @@ extern int migrate_huge_page_move_mapping(struct address_space *mapping,
struct page *newpage, struct page *page);
extern int migrate_page_move_mapping(struct address_space *mapping,
struct page *newpage, struct page *page, int extra_count);
+void folio_migrate_flags(struct folio *newfolio, struct folio *folio);
int folio_migrate_mapping(struct address_space *mapping,
struct folio *newfolio, struct folio *folio, int extra_count);
#else
diff --git a/include/linux/page_owner.h b/include/linux/page_owner.h
index 719bfe5108c5..43c638c51c1f 100644
--- a/include/linux/page_owner.h
+++ b/include/linux/page_owner.h
@@ -12,7 +12,7 @@ extern void __reset_page_owner(struct page *page, unsigned int order);
extern void __set_page_owner(struct page *page,
unsigned int order, gfp_t gfp_mask);
extern void __split_page_owner(struct page *page, unsigned int nr);
-extern void __copy_page_owner(struct page *oldpage, struct page *newpage);
+extern void __folio_copy_owner(struct folio *newfolio, struct folio *old);
extern void __set_page_owner_migrate_reason(struct page *page, int reason);
extern void __dump_page_owner(const struct page *page);
extern void pagetypeinfo_showmixedcount_print(struct seq_file *m,
@@ -36,10 +36,10 @@ static inline void split_page_owner(struct page *page, unsigned int nr)
if (static_branch_unlikely(&page_owner_inited))
__split_page_owner(page, nr);
}
-static inline void copy_page_owner(struct page *oldpage, struct page *newpage)
+static inline void folio_copy_owner(struct folio *newfolio, struct folio *old)
{
if (static_branch_unlikely(&page_owner_inited))
- __copy_page_owner(oldpage, newpage);
+ __folio_copy_owner(newfolio, old);
}
static inline void set_page_owner_migrate_reason(struct page *page, int reason)
{
@@ -63,7 +63,7 @@ static inline void split_page_owner(struct page *page,
unsigned int order)
{
}
-static inline void copy_page_owner(struct page *oldpage, struct page *newpage)
+static inline void folio_copy_owner(struct folio *newfolio, struct folio *folio)
{
}
static inline void set_page_owner_migrate_reason(struct page *page, int reason)
diff --git a/mm/folio-compat.c b/mm/folio-compat.c
index d883d964fd52..3f00ad92d1ff 100644
--- a/mm/folio-compat.c
+++ b/mm/folio-compat.c
@@ -58,4 +58,10 @@ int migrate_page_move_mapping(struct address_space *mapping,
page_folio(page), extra_count);
}
EXPORT_SYMBOL(migrate_page_move_mapping);
+
+void migrate_page_states(struct page *newpage, struct page *page)
+{
+ folio_migrate_flags(page_folio(newpage), page_folio(page));
+}
+EXPORT_SYMBOL(migrate_page_states);
#endif
diff --git a/mm/ksm.c b/mm/ksm.c
index 23d36b59f997..3a70786906eb 100644
--- a/mm/ksm.c
+++ b/mm/ksm.c
@@ -753,7 +753,7 @@ static struct page *get_ksm_page(struct stable_node *stable_node,
/*
* We come here from above when page->mapping or !PageSwapCache
* suggests that the node is stale; but it might be under migration.
- * We need smp_rmb(), matching the smp_wmb() in ksm_migrate_page(),
+ * We need smp_rmb(), matching the smp_wmb() in folio_migrate_ksm(),
* before checking whether node->kpfn has been changed.
*/
smp_rmb();
@@ -854,9 +854,14 @@ static int unmerge_ksm_pages(struct vm_area_struct *vma,
return err;
}
+static inline struct stable_node *folio_stable_node(struct folio *folio)
+{
+ return folio_test_ksm(folio) ? folio_raw_mapping(folio) : NULL;
+}
+
static inline struct stable_node *page_stable_node(struct page *page)
{
- return PageKsm(page) ? page_rmapping(page) : NULL;
+ return folio_stable_node(page_folio(page));
}
static inline void set_page_stable_node(struct page *page,
@@ -2661,26 +2666,26 @@ void rmap_walk_ksm(struct page *page, struct rmap_walk_control *rwc)
}
#ifdef CONFIG_MIGRATION
-void ksm_migrate_page(struct page *newpage, struct page *oldpage)
+void folio_migrate_ksm(struct folio *newfolio, struct folio *folio)
{
struct stable_node *stable_node;
- VM_BUG_ON_PAGE(!PageLocked(oldpage), oldpage);
- VM_BUG_ON_PAGE(!PageLocked(newpage), newpage);
- VM_BUG_ON_PAGE(newpage->mapping != oldpage->mapping, newpage);
+ VM_BUG_ON_FOLIO(!folio_test_locked(folio), folio);
+ VM_BUG_ON_FOLIO(!folio_test_locked(newfolio), newfolio);
+ VM_BUG_ON_FOLIO(newfolio->mapping != folio->mapping, newfolio);
- stable_node = page_stable_node(newpage);
+ stable_node = folio_stable_node(folio);
if (stable_node) {
- VM_BUG_ON_PAGE(stable_node->kpfn != page_to_pfn(oldpage), oldpage);
- stable_node->kpfn = page_to_pfn(newpage);
+ VM_BUG_ON_FOLIO(stable_node->kpfn != folio_pfn(folio), folio);
+ stable_node->kpfn = folio_pfn(newfolio);
/*
- * newpage->mapping was set in advance; now we need smp_wmb()
+ * newfolio->mapping was set in advance; now we need smp_wmb()
* to make sure that the new stable_node->kpfn is visible
- * to get_ksm_page() before it can see that oldpage->mapping
- * has gone stale (or that PageSwapCache has been cleared).
+ * to get_ksm_page() before it can see that folio->mapping
+ * has gone stale (or that folio_test_swapcache has been cleared).
*/
smp_wmb();
- set_page_stable_node(oldpage, NULL);
+ set_page_stable_node(&folio->page, NULL);
}
}
#endif /* CONFIG_MIGRATION */
diff --git a/mm/migrate.c b/mm/migrate.c
index aa4f2310c5bb..a86be2bfc9a1 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -538,82 +538,80 @@ int migrate_huge_page_move_mapping(struct address_space *mapping,
}
/*
- * Copy the page to its new location
+ * Copy the flags and some other ancillary information
*/
-void migrate_page_states(struct page *newpage, struct page *page)
+void folio_migrate_flags(struct folio *newfolio, struct folio *folio)
{
- struct folio *folio = page_folio(page);
- struct folio *newfolio = page_folio(newpage);
int cpupid;
- if (PageError(page))
- SetPageError(newpage);
- if (PageReferenced(page))
- SetPageReferenced(newpage);
- if (PageUptodate(page))
- SetPageUptodate(newpage);
- if (TestClearPageActive(page)) {
- VM_BUG_ON_PAGE(PageUnevictable(page), page);
- SetPageActive(newpage);
- } else if (TestClearPageUnevictable(page))
- SetPageUnevictable(newpage);
- if (PageWorkingset(page))
- SetPageWorkingset(newpage);
- if (PageChecked(page))
- SetPageChecked(newpage);
- if (PageMappedToDisk(page))
- SetPageMappedToDisk(newpage);
+ if (folio_test_error(folio))
+ folio_set_error(newfolio);
+ if (folio_test_referenced(folio))
+ folio_set_referenced(newfolio);
+ if (folio_test_uptodate(folio))
+ folio_mark_uptodate(newfolio);
+ if (folio_test_clear_active(folio)) {
+ VM_BUG_ON_FOLIO(folio_test_unevictable(folio), folio);
+ folio_set_active(newfolio);
+ } else if (folio_test_clear_unevictable(folio))
+ folio_set_unevictable(newfolio);
+ if (folio_test_workingset(folio))
+ folio_set_workingset(newfolio);
+ if (folio_test_checked(folio))
+ folio_set_checked(newfolio);
+ if (folio_test_mappedtodisk(folio))
+ folio_set_mappedtodisk(newfolio);
/* Move dirty on pages not done by folio_migrate_mapping() */
- if (PageDirty(page))
- SetPageDirty(newpage);
+ if (folio_test_dirty(folio))
+ folio_set_dirty(newfolio);
- if (page_is_young(page))
- set_page_young(newpage);
- if (page_is_idle(page))
- set_page_idle(newpage);
+ if (folio_test_young(folio))
+ folio_set_young(newfolio);
+ if (folio_test_idle(folio))
+ folio_set_idle(newfolio);
/*
* Copy NUMA information to the new page, to prevent over-eager
* future migrations of this same page.
*/
- cpupid = page_cpupid_xchg_last(page, -1);
- page_cpupid_xchg_last(newpage, cpupid);
+ cpupid = page_cpupid_xchg_last(&folio->page, -1);
+ page_cpupid_xchg_last(&newfolio->page, cpupid);
- ksm_migrate_page(newpage, page);
+ folio_migrate_ksm(newfolio, folio);
/*
* Please do not reorder this without considering how mm/ksm.c's
* get_ksm_page() depends upon ksm_migrate_page() and PageSwapCache().
*/
- if (PageSwapCache(page))
- ClearPageSwapCache(page);
- ClearPagePrivate(page);
+ if (folio_test_swapcache(folio))
+ folio_clear_swapcache(folio);
+ folio_clear_private(folio);
/* page->private contains hugetlb specific flags */
- if (!PageHuge(page))
- set_page_private(page, 0);
+ if (!folio_test_hugetlb(folio))
+ folio->private = NULL;
/*
* If any waiters have accumulated on the new page then
* wake them up.
*/
- if (PageWriteback(newpage))
- end_page_writeback(newpage);
+ if (folio_test_writeback(newfolio))
+ folio_end_writeback(newfolio);
/*
* PG_readahead shares the same bit with PG_reclaim. The above
* end_page_writeback() may clear PG_readahead mistakenly, so set the
* bit after that.
*/
- if (PageReadahead(page))
- SetPageReadahead(newpage);
+ if (folio_test_readahead(folio))
+ folio_set_readahead(newfolio);
- copy_page_owner(page, newpage);
+ folio_copy_owner(folio, newfolio);
- if (!PageHuge(page))
+ if (!folio_test_hugetlb(folio))
mem_cgroup_migrate(folio, newfolio);
}
-EXPORT_SYMBOL(migrate_page_states);
+EXPORT_SYMBOL(folio_migrate_flags);
void migrate_page_copy(struct page *newpage, struct page *page)
{
@@ -654,7 +652,7 @@ int migrate_page(struct address_space *mapping,
if (mode != MIGRATE_SYNC_NO_COPY)
migrate_page_copy(newpage, page);
else
- migrate_page_states(newpage, page);
+ folio_migrate_flags(newfolio, folio);
return MIGRATEPAGE_SUCCESS;
}
EXPORT_SYMBOL(migrate_page);
diff --git a/mm/page_owner.c b/mm/page_owner.c
index f51a57e92aa3..23bfb074ca3f 100644
--- a/mm/page_owner.c
+++ b/mm/page_owner.c
@@ -210,10 +210,10 @@ void __split_page_owner(struct page *page, unsigned int nr)
}
}
-void __copy_page_owner(struct page *oldpage, struct page *newpage)
+void __folio_copy_owner(struct folio *newfolio, struct folio *old)
{
- struct page_ext *old_ext = lookup_page_ext(oldpage);
- struct page_ext *new_ext = lookup_page_ext(newpage);
+ struct page_ext *old_ext = lookup_page_ext(&old->page);
+ struct page_ext *new_ext = lookup_page_ext(&newfolio->page);
struct page_owner *old_page_owner, *new_page_owner;
if (unlikely(!old_ext || !new_ext))
@@ -231,11 +231,11 @@ void __copy_page_owner(struct page *oldpage, struct page *newpage)
new_page_owner->free_ts_nsec = old_page_owner->ts_nsec;
/*
- * We don't clear the bit on the oldpage as it's going to be freed
+ * We don't clear the bit on the old folio as it's going to be freed
* after migration. Until then, the info can be useful in case of
* a bug, and the overall stats will be off a bit only temporarily.
* Also, migrate_misplaced_transhuge_page() can still fail the
- * migration and then we want the oldpage to retain the info. But
+ * migration and then we want the old folio to retain the info. But
* in that case we also don't need to explicitly clear the info from
* the new page, which will be freed.
*/
--
2.30.2
Reimplement set_page_dirty() as a wrapper around folio_mark_dirty().
There is no change to filesystems as they were already being called
with the compound_head of the page being marked dirty. We avoid
several calls to compound_head(), both statically (through
using folio_test_dirty() instead of PageDirty() and dynamically by
calling folio_mapping() instead of page_mapping().
Also return bool instead of int to show the range of values actually
returned, and add kernel-doc.
Signed-off-by: Matthew Wilcox (Oracle) <[email protected]>
Reviewed-by: Christoph Hellwig <[email protected]>
---
include/linux/mm.h | 3 ++-
mm/folio-compat.c | 6 ++++++
mm/page-writeback.c | 35 +++++++++++++++++++----------------
3 files changed, 27 insertions(+), 17 deletions(-)
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 23276330ef4f..43c1b5731c7f 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -2005,7 +2005,8 @@ int redirty_page_for_writepage(struct writeback_control *wbc,
struct page *page);
void account_page_cleaned(struct page *page, struct address_space *mapping,
struct bdi_writeback *wb);
-int set_page_dirty(struct page *page);
+bool folio_mark_dirty(struct folio *folio);
+bool set_page_dirty(struct page *page);
int set_page_dirty_lock(struct page *page);
void __cancel_dirty_page(struct page *page);
static inline void cancel_dirty_page(struct page *page)
diff --git a/mm/folio-compat.c b/mm/folio-compat.c
index 10ce5582d869..2c2b3917b5dc 100644
--- a/mm/folio-compat.c
+++ b/mm/folio-compat.c
@@ -77,3 +77,9 @@ bool set_page_writeback(struct page *page)
return folio_start_writeback(page_folio(page));
}
EXPORT_SYMBOL(set_page_writeback);
+
+bool set_page_dirty(struct page *page)
+{
+ return folio_mark_dirty(page_folio(page));
+}
+EXPORT_SYMBOL(set_page_dirty);
diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index 0336273154fb..d7c0cad6a57f 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -2564,18 +2564,21 @@ int redirty_page_for_writepage(struct writeback_control *wbc, struct page *page)
}
EXPORT_SYMBOL(redirty_page_for_writepage);
-/*
- * Dirty a page.
+/**
+ * folio_mark_dirty - Mark a folio as being modified.
+ * @folio: The folio.
+ *
+ * For folios with a mapping this should be done under the page lock
+ * for the benefit of asynchronous memory errors who prefer a consistent
+ * dirty state. This rule can be broken in some special cases,
+ * but should be better not to.
*
- * For pages with a mapping this should be done under the page lock for the
- * benefit of asynchronous memory errors who prefer a consistent dirty state.
- * This rule can be broken in some special cases, but should be better not to.
+ * Return: True if the folio was newly dirtied, false if it was already dirty.
*/
-int set_page_dirty(struct page *page)
+bool folio_mark_dirty(struct folio *folio)
{
- struct address_space *mapping = page_mapping(page);
+ struct address_space *mapping = folio_mapping(folio);
- page = compound_head(page);
if (likely(mapping)) {
/*
* readahead/lru_deactivate_page could remain
@@ -2587,17 +2590,17 @@ int set_page_dirty(struct page *page)
* it will confuse readahead and make it restart the size rampup
* process. But it's a trivial problem.
*/
- if (PageReclaim(page))
- ClearPageReclaim(page);
- return mapping->a_ops->set_page_dirty(page);
+ if (folio_test_reclaim(folio))
+ folio_clear_reclaim(folio);
+ return mapping->a_ops->set_page_dirty(&folio->page);
}
- if (!PageDirty(page)) {
- if (!TestSetPageDirty(page))
- return 1;
+ if (!folio_test_dirty(folio)) {
+ if (!folio_test_set_dirty(folio))
+ return true;
}
- return 0;
+ return false;
}
-EXPORT_SYMBOL(set_page_dirty);
+EXPORT_SYMBOL(folio_mark_dirty);
/*
* set_page_dirty() is racy if the caller has no reference against
--
2.30.2
Turn __set_page_dirty() into a wrapper around __folio_mark_dirty().
Convert account_page_dirtied() into folio_account_dirtied() and account
the number of pages in the folio to support multi-page folios.
Signed-off-by: Matthew Wilcox (Oracle) <[email protected]>
---
include/linux/memcontrol.h | 5 ++---
include/linux/pagemap.h | 7 ++++++-
mm/page-writeback.c | 41 +++++++++++++++++++-------------------
3 files changed, 29 insertions(+), 24 deletions(-)
diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 2dd660185bb3..c20adc22ea24 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -1574,10 +1574,9 @@ void mem_cgroup_wb_stats(struct bdi_writeback *wb, unsigned long *pfilepages,
void mem_cgroup_track_foreign_dirty_slowpath(struct folio *folio,
struct bdi_writeback *wb);
-static inline void mem_cgroup_track_foreign_dirty(struct page *page,
+static inline void mem_cgroup_track_foreign_dirty(struct folio *folio,
struct bdi_writeback *wb)
{
- struct folio *folio = page_folio(page);
if (mem_cgroup_disabled())
return;
@@ -1602,7 +1601,7 @@ static inline void mem_cgroup_wb_stats(struct bdi_writeback *wb,
{
}
-static inline void mem_cgroup_track_foreign_dirty(struct page *page,
+static inline void mem_cgroup_track_foreign_dirty(struct folio *folio,
struct bdi_writeback *wb)
{
}
diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
index 08f40e004d97..3d88c17fedc9 100644
--- a/include/linux/pagemap.h
+++ b/include/linux/pagemap.h
@@ -773,8 +773,13 @@ void end_page_writeback(struct page *page);
void folio_end_writeback(struct folio *folio);
void wait_for_stable_page(struct page *page);
void folio_wait_stable(struct folio *folio);
+void __folio_mark_dirty(struct folio *folio, struct address_space *, int warn);
+static inline void __set_page_dirty(struct page *page,
+ struct address_space *mapping, int warn)
+{
+ __folio_mark_dirty(page_folio(page), mapping, warn);
+}
-void __set_page_dirty(struct page *, struct address_space *, int warn);
int __set_page_dirty_nobuffers(struct page *page);
int __set_page_dirty_no_writeback(struct page *page);
diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index d7c0cad6a57f..3e02c86eb445 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -2421,29 +2421,30 @@ EXPORT_SYMBOL(__set_page_dirty_no_writeback);
*
* NOTE: This relies on being atomic wrt interrupts.
*/
-static void account_page_dirtied(struct page *page,
+static void folio_account_dirtied(struct folio *folio,
struct address_space *mapping)
{
struct inode *inode = mapping->host;
- trace_writeback_dirty_page(page, mapping);
+ trace_writeback_dirty_page(&folio->page, mapping);
if (mapping_can_writeback(mapping)) {
struct bdi_writeback *wb;
+ long nr = folio_nr_pages(folio);
- inode_attach_wb(inode, page);
+ inode_attach_wb(inode, &folio->page);
wb = inode_to_wb(inode);
- __inc_lruvec_page_state(page, NR_FILE_DIRTY);
- __inc_zone_page_state(page, NR_ZONE_WRITE_PENDING);
- __inc_node_page_state(page, NR_DIRTIED);
- inc_wb_stat(wb, WB_RECLAIMABLE);
- inc_wb_stat(wb, WB_DIRTIED);
- task_io_account_write(PAGE_SIZE);
- current->nr_dirtied++;
- __this_cpu_inc(bdp_ratelimits);
+ __lruvec_stat_mod_folio(folio, NR_FILE_DIRTY, nr);
+ __zone_stat_mod_folio(folio, NR_ZONE_WRITE_PENDING, nr);
+ __node_stat_mod_folio(folio, NR_DIRTIED, nr);
+ wb_stat_mod(wb, WB_RECLAIMABLE, nr);
+ wb_stat_mod(wb, WB_DIRTIED, nr);
+ task_io_account_write(nr * PAGE_SIZE);
+ current->nr_dirtied += nr;
+ __this_cpu_add(bdp_ratelimits, nr);
- mem_cgroup_track_foreign_dirty(page, wb);
+ mem_cgroup_track_foreign_dirty(folio, wb);
}
}
@@ -2464,24 +2465,24 @@ void account_page_cleaned(struct page *page, struct address_space *mapping,
}
/*
- * Mark the page dirty, and set it dirty in the page cache, and mark the inode
- * dirty.
+ * Mark the folio dirty, and set it dirty in the page cache, and mark
+ * the inode dirty.
*
- * If warn is true, then emit a warning if the page is not uptodate and has
+ * If warn is true, then emit a warning if the folio is not uptodate and has
* not been truncated.
*
* The caller must hold lock_page_memcg().
*/
-void __set_page_dirty(struct page *page, struct address_space *mapping,
+void __folio_mark_dirty(struct folio *folio, struct address_space *mapping,
int warn)
{
unsigned long flags;
xa_lock_irqsave(&mapping->i_pages, flags);
- if (page->mapping) { /* Race with truncate? */
- WARN_ON_ONCE(warn && !PageUptodate(page));
- account_page_dirtied(page, mapping);
- __xa_set_mark(&mapping->i_pages, page_index(page),
+ if (folio->mapping) { /* Race with truncate? */
+ WARN_ON_ONCE(warn && !folio_test_uptodate(folio));
+ folio_account_dirtied(folio, mapping);
+ __xa_set_mark(&mapping->i_pages, folio_index(folio),
PAGECACHE_TAG_DIRTY);
}
xa_unlock_irqrestore(&mapping->i_pages, flags);
--
2.30.2
Reimplement __set_page_dirty_nobuffers() as a wrapper around
filemap_dirty_folio().
Signed-off-by: Matthew Wilcox (Oracle) <[email protected]>
Reviewed-by: Christoph Hellwig <[email protected]>
---
include/linux/writeback.h | 1 +
mm/folio-compat.c | 6 ++++
mm/page-writeback.c | 60 ++++++++++++++++++++-------------------
3 files changed, 38 insertions(+), 29 deletions(-)
diff --git a/include/linux/writeback.h b/include/linux/writeback.h
index 667e86cfbdcf..eda9cc778ef6 100644
--- a/include/linux/writeback.h
+++ b/include/linux/writeback.h
@@ -398,6 +398,7 @@ void writeback_set_ratelimit(void);
void tag_pages_for_writeback(struct address_space *mapping,
pgoff_t start, pgoff_t end);
+bool filemap_dirty_folio(struct address_space *mapping, struct folio *folio);
void account_page_redirty(struct page *page);
void sb_mark_inode_writeback(struct inode *inode);
diff --git a/mm/folio-compat.c b/mm/folio-compat.c
index 2c2b3917b5dc..dad962b920e5 100644
--- a/mm/folio-compat.c
+++ b/mm/folio-compat.c
@@ -83,3 +83,9 @@ bool set_page_dirty(struct page *page)
return folio_mark_dirty(page_folio(page));
}
EXPORT_SYMBOL(set_page_dirty);
+
+int __set_page_dirty_nobuffers(struct page *page)
+{
+ return filemap_dirty_folio(page_mapping(page), page_folio(page));
+}
+EXPORT_SYMBOL(__set_page_dirty_nobuffers);
diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index 2dc410b110ff..bd97c461d499 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -2488,41 +2488,43 @@ void __folio_mark_dirty(struct folio *folio, struct address_space *mapping,
xa_unlock_irqrestore(&mapping->i_pages, flags);
}
-/*
- * For address_spaces which do not use buffers. Just tag the page as dirty in
- * the xarray.
- *
- * This is also used when a single buffer is being dirtied: we want to set the
- * page dirty in that case, but not all the buffers. This is a "bottom-up"
- * dirtying, whereas __set_page_dirty_buffers() is a "top-down" dirtying.
- *
- * The caller must ensure this doesn't race with truncation. Most will simply
- * hold the page lock, but e.g. zap_pte_range() calls with the page mapped and
- * the pte lock held, which also locks out truncation.
+/**
+ * filemap_dirty_folio - Mark a folio dirty for filesystems which do not use buffer_heads.
+ * @mapping: Address space this folio belongs to.
+ * @folio: Folio to be marked as dirty.
+ *
+ * Filesystems which do not use buffer heads should call this function
+ * from their set_page_dirty address space operation. It ignores the
+ * contents of folio_get_private(), so if the filesystem marks individual
+ * blocks as dirty, the filesystem should handle that itself.
+ *
+ * This is also sometimes used by filesystems which use buffer_heads when
+ * a single buffer is being dirtied: we want to set the folio dirty in
+ * that case, but not all the buffers. This is a "bottom-up" dirtying,
+ * whereas __set_page_dirty_buffers() is a "top-down" dirtying.
+ *
+ * The caller must ensure this doesn't race with truncation. Most will
+ * simply hold the folio lock, but e.g. zap_pte_range() calls with the
+ * folio mapped and the pte lock held, which also locks out truncation.
*/
-int __set_page_dirty_nobuffers(struct page *page)
+bool filemap_dirty_folio(struct address_space *mapping, struct folio *folio)
{
- lock_page_memcg(page);
- if (!TestSetPageDirty(page)) {
- struct address_space *mapping = page_mapping(page);
+ folio_memcg_lock(folio);
+ if (folio_test_set_dirty(folio)) {
+ folio_memcg_unlock(folio);
+ return false;
+ }
- if (!mapping) {
- unlock_page_memcg(page);
- return 1;
- }
- __set_page_dirty(page, mapping, !PagePrivate(page));
- unlock_page_memcg(page);
+ __folio_mark_dirty(folio, mapping, !folio_test_private(folio));
+ folio_memcg_unlock(folio);
- if (mapping->host) {
- /* !PageAnon && !swapper_space */
- __mark_inode_dirty(mapping->host, I_DIRTY_PAGES);
- }
- return 1;
+ if (mapping->host) {
+ /* !PageAnon && !swapper_space */
+ __mark_inode_dirty(mapping->host, I_DIRTY_PAGES);
}
- unlock_page_memcg(page);
- return 0;
+ return true;
}
-EXPORT_SYMBOL(__set_page_dirty_nobuffers);
+EXPORT_SYMBOL(filemap_dirty_folio);
/*
* Call this whenever redirtying a page, to de-account the dirty counters
--
2.30.2
Turn __cancel_dirty_page() into __folio_cancel_dirty() and add wrappers.
Move the prototypes into pagemap.h since this is page cache functionality.
Saves 44 bytes of kernel text in total; 33 bytes from __folio_cancel_dirty
and 11 from two callers of cancel_dirty_page().
Signed-off-by: Matthew Wilcox (Oracle) <[email protected]>
Reviewed-by: Christoph Hellwig <[email protected]>
---
include/linux/mm.h | 7 -------
include/linux/pagemap.h | 11 +++++++++++
mm/page-writeback.c | 16 ++++++++--------
3 files changed, 19 insertions(+), 15 deletions(-)
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 481019481d10..07ba22351d15 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -2005,13 +2005,6 @@ int redirty_page_for_writepage(struct writeback_control *wbc,
bool folio_mark_dirty(struct folio *folio);
bool set_page_dirty(struct page *page);
int set_page_dirty_lock(struct page *page);
-void __cancel_dirty_page(struct page *page);
-static inline void cancel_dirty_page(struct page *page)
-{
- /* Avoid atomic ops, locking, etc. when not actually needed. */
- if (PageDirty(page))
- __cancel_dirty_page(page);
-}
int clear_page_dirty_for_io(struct page *page);
int get_cmdline(struct task_struct *task, char *buffer, int buflen);
diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
index 665ba6a67385..a4d0aeaf884d 100644
--- a/include/linux/pagemap.h
+++ b/include/linux/pagemap.h
@@ -786,6 +786,17 @@ static inline void account_page_cleaned(struct page *page,
{
return folio_account_cleaned(page_folio(page), mapping, wb);
}
+void __folio_cancel_dirty(struct folio *folio);
+static inline void folio_cancel_dirty(struct folio *folio)
+{
+ /* Avoid atomic ops, locking, etc. when not actually needed. */
+ if (folio_test_dirty(folio))
+ __folio_cancel_dirty(folio);
+}
+static inline void cancel_dirty_page(struct page *page)
+{
+ folio_cancel_dirty(page_folio(page));
+}
int __set_page_dirty_nobuffers(struct page *page);
int __set_page_dirty_no_writeback(struct page *page);
diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index 792a83bd3917..0854ef768d06 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -2640,28 +2640,28 @@ EXPORT_SYMBOL(set_page_dirty_lock);
* page without actually doing it through the VM. Can you say "ext3 is
* horribly ugly"? Thought you could.
*/
-void __cancel_dirty_page(struct page *page)
+void __folio_cancel_dirty(struct folio *folio)
{
- struct address_space *mapping = page_mapping(page);
+ struct address_space *mapping = folio_mapping(folio);
if (mapping_can_writeback(mapping)) {
struct inode *inode = mapping->host;
struct bdi_writeback *wb;
struct wb_lock_cookie cookie = {};
- lock_page_memcg(page);
+ folio_memcg_lock(folio);
wb = unlocked_inode_to_wb_begin(inode, &cookie);
- if (TestClearPageDirty(page))
- account_page_cleaned(page, mapping, wb);
+ if (folio_test_clear_dirty(folio))
+ folio_account_cleaned(folio, mapping, wb);
unlocked_inode_to_wb_end(inode, &cookie);
- unlock_page_memcg(page);
+ folio_memcg_unlock(folio);
} else {
- ClearPageDirty(page);
+ folio_clear_dirty(folio);
}
}
-EXPORT_SYMBOL(__cancel_dirty_page);
+EXPORT_SYMBOL(__folio_cancel_dirty);
/*
* Clear a page's dirty flag, while caring for dirty memory accounting.
--
2.30.2
Reimplement i_blocks_per_page() as a wrapper around i_blocks_per_folio().
Signed-off-by: Matthew Wilcox (Oracle) <[email protected]>
Reviewed-by: Christoph Hellwig <[email protected]>
---
include/linux/pagemap.h | 18 ++++++++++++------
1 file changed, 12 insertions(+), 6 deletions(-)
diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
index 006de2d84d06..412db88b8d0c 100644
--- a/include/linux/pagemap.h
+++ b/include/linux/pagemap.h
@@ -1150,19 +1150,25 @@ static inline int page_mkwrite_check_truncate(struct page *page,
}
/**
- * i_blocks_per_page - How many blocks fit in this page.
+ * i_blocks_per_folio - How many blocks fit in this folio.
* @inode: The inode which contains the blocks.
- * @page: The page (head page if the page is a THP).
+ * @folio: The folio.
*
- * If the block size is larger than the size of this page, return zero.
+ * If the block size is larger than the size of this folio, return zero.
*
- * Context: The caller should hold a refcount on the page to prevent it
+ * Context: The caller should hold a refcount on the folio to prevent it
* from being split.
- * Return: The number of filesystem blocks covered by this page.
+ * Return: The number of filesystem blocks covered by this folio.
*/
+static inline
+unsigned int i_blocks_per_folio(struct inode *inode, struct folio *folio)
+{
+ return folio_size(folio) >> inode->i_blkbits;
+}
+
static inline
unsigned int i_blocks_per_page(struct inode *inode, struct page *page)
{
- return thp_size(page) >> inode->i_blkbits;
+ return i_blocks_per_folio(inode, page_folio(page));
}
#endif /* _LINUX_PAGEMAP_H */
--
2.30.2
This function already assumed it was being passed a head page, so
just formalise that.
Signed-off-by: Matthew Wilcox (Oracle) <[email protected]>
---
fs/iomap/buffered-io.c | 18 ++++++++++--------
1 file changed, 10 insertions(+), 8 deletions(-)
diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
index cd5c2f24cb7e..c15a0ac52a32 100644
--- a/fs/iomap/buffered-io.c
+++ b/fs/iomap/buffered-io.c
@@ -42,11 +42,10 @@ static inline struct iomap_page *to_iomap_page(struct folio *folio)
static struct bio_set iomap_ioend_bioset;
static struct iomap_page *
-iomap_page_create(struct inode *inode, struct page *page)
+iomap_page_create(struct inode *inode, struct folio *folio)
{
- struct folio *folio = page_folio(page);
struct iomap_page *iop = to_iomap_page(folio);
- unsigned int nr_blocks = i_blocks_per_page(inode, page);
+ unsigned int nr_blocks = i_blocks_per_folio(inode, folio);
if (iop || nr_blocks <= 1)
return iop;
@@ -54,9 +53,9 @@ iomap_page_create(struct inode *inode, struct page *page)
iop = kzalloc(struct_size(iop, uptodate, BITS_TO_LONGS(nr_blocks)),
GFP_NOFS | __GFP_NOFAIL);
spin_lock_init(&iop->uptodate_lock);
- if (PageUptodate(page))
+ if (folio_test_uptodate(folio))
bitmap_fill(iop->uptodate, nr_blocks);
- attach_page_private(page, iop);
+ folio_attach_private(folio, iop);
return iop;
}
@@ -235,7 +234,8 @@ iomap_readpage_actor(struct inode *inode, loff_t pos, loff_t length, void *data,
{
struct iomap_readpage_ctx *ctx = data;
struct page *page = ctx->cur_page;
- struct iomap_page *iop = iomap_page_create(inode, page);
+ struct folio *folio = page_folio(page);
+ struct iomap_page *iop = iomap_page_create(inode, folio);
bool same_page = false, is_contig = false;
loff_t orig_pos = pos;
unsigned poff, plen;
@@ -547,7 +547,8 @@ static int
__iomap_write_begin(struct inode *inode, loff_t pos, unsigned len, int flags,
struct page *page, struct iomap *srcmap)
{
- struct iomap_page *iop = iomap_page_create(inode, page);
+ struct folio *folio = page_folio(page);
+ struct iomap_page *iop = iomap_page_create(inode, folio);
loff_t block_size = i_blocksize(inode);
loff_t block_start = round_down(pos, block_size);
loff_t block_end = round_up(pos + len, block_size);
@@ -955,6 +956,7 @@ iomap_page_mkwrite_actor(struct inode *inode, loff_t pos, loff_t length,
void *data, struct iomap *iomap, struct iomap *srcmap)
{
struct page *page = data;
+ struct folio *folio = page_folio(page);
int ret;
if (iomap->flags & IOMAP_F_BUFFER_HEAD) {
@@ -964,7 +966,7 @@ iomap_page_mkwrite_actor(struct inode *inode, loff_t pos, loff_t length,
block_commit_write(page, 0, length);
} else {
WARN_ON_ONCE(!PageUptodate(page));
- iomap_page_create(inode, page);
+ iomap_page_create(inode, folio);
set_page_dirty(page);
}
--
2.30.2
Allow callers to iterate over each folio instead of each page. The
bio need not have been constructed using folios originally.
Signed-off-by: Matthew Wilcox (Oracle) <[email protected]>
---
include/linux/bio.h | 43 ++++++++++++++++++++++++++++++++++++++++++-
1 file changed, 42 insertions(+), 1 deletion(-)
diff --git a/include/linux/bio.h b/include/linux/bio.h
index ade93e2de6a1..d462bbc95c4b 100644
--- a/include/linux/bio.h
+++ b/include/linux/bio.h
@@ -189,7 +189,7 @@ static inline void bio_advance_iter_single(const struct bio *bio,
*/
#define bio_for_each_bvec_all(bvl, bio, i) \
for (i = 0, bvl = bio_first_bvec_all(bio); \
- i < (bio)->bi_vcnt; i++, bvl++) \
+ i < (bio)->bi_vcnt; i++, bvl++)
#define bio_iter_last(bvec, iter) ((iter).bi_size == (bvec).bv_len)
@@ -314,6 +314,47 @@ static inline struct bio_vec *bio_last_bvec_all(struct bio *bio)
return &bio->bi_io_vec[bio->bi_vcnt - 1];
}
+struct folio_iter {
+ struct folio *folio;
+ size_t offset;
+ size_t length;
+ size_t _seg_count;
+ int _i;
+};
+
+static inline
+void bio_first_folio(struct folio_iter *fi, struct bio *bio, int i)
+{
+ struct bio_vec *bvec = bio_first_bvec_all(bio) + i;
+
+ fi->folio = page_folio(bvec->bv_page);
+ fi->offset = bvec->bv_offset +
+ PAGE_SIZE * (bvec->bv_page - &fi->folio->page);
+ fi->_seg_count = bvec->bv_len;
+ fi->length = min(folio_size(fi->folio) - fi->offset, fi->_seg_count);
+ fi->_i = i;
+}
+
+static inline void bio_next_folio(struct folio_iter *fi, struct bio *bio)
+{
+ fi->_seg_count -= fi->length;
+ if (fi->_seg_count) {
+ fi->folio = folio_next(fi->folio);
+ fi->offset = 0;
+ fi->length = min(folio_size(fi->folio), fi->_seg_count);
+ } else if (fi->_i + 1 < bio->bi_vcnt) {
+ bio_first_folio(fi, bio, fi->_i + 1);
+ } else {
+ fi->folio = NULL;
+ }
+}
+
+/*
+ * Iterate over each folio in a bio.
+ */
+#define bio_for_each_folio_all(fi, bio) \
+ for (bio_first_folio(&fi, bio, 0); fi.folio; bio_next_folio(&fi, bio))
+
enum bip_flags {
BIP_BLOCK_INTEGRITY = 1 << 0, /* block layer owns integrity data */
BIP_MAPPED_INTEGRITY = 1 << 1, /* ref tag has been remapped */
--
2.30.2
iomap_page_release() was also assuming that it was being passed a
head page.
Signed-off-by: Matthew Wilcox (Oracle) <[email protected]>
---
fs/iomap/buffered-io.c | 18 +++++++++++-------
1 file changed, 11 insertions(+), 7 deletions(-)
diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
index c15a0ac52a32..251ec45426aa 100644
--- a/fs/iomap/buffered-io.c
+++ b/fs/iomap/buffered-io.c
@@ -59,18 +59,18 @@ iomap_page_create(struct inode *inode, struct folio *folio)
return iop;
}
-static void
-iomap_page_release(struct page *page)
+static void iomap_page_release(struct folio *folio)
{
- struct iomap_page *iop = detach_page_private(page);
- unsigned int nr_blocks = i_blocks_per_page(page->mapping->host, page);
+ struct iomap_page *iop = folio_detach_private(folio);
+ unsigned int nr_blocks = i_blocks_per_folio(folio->mapping->host,
+ folio);
if (!iop)
return;
WARN_ON_ONCE(atomic_read(&iop->read_bytes_pending));
WARN_ON_ONCE(atomic_read(&iop->write_bytes_pending));
WARN_ON_ONCE(bitmap_full(iop->uptodate, nr_blocks) !=
- PageUptodate(page));
+ folio_test_uptodate(folio));
kfree(iop);
}
@@ -456,6 +456,8 @@ EXPORT_SYMBOL_GPL(iomap_is_partially_uptodate);
int
iomap_releasepage(struct page *page, gfp_t gfp_mask)
{
+ struct folio *folio = page_folio(page);
+
trace_iomap_releasepage(page->mapping->host, page_offset(page),
PAGE_SIZE);
@@ -466,7 +468,7 @@ iomap_releasepage(struct page *page, gfp_t gfp_mask)
*/
if (PageDirty(page) || PageWriteback(page))
return 0;
- iomap_page_release(page);
+ iomap_page_release(folio);
return 1;
}
EXPORT_SYMBOL_GPL(iomap_releasepage);
@@ -474,6 +476,8 @@ EXPORT_SYMBOL_GPL(iomap_releasepage);
void
iomap_invalidatepage(struct page *page, unsigned int offset, unsigned int len)
{
+ struct folio *folio = page_folio(page);
+
trace_iomap_invalidatepage(page->mapping->host, offset, len);
/*
@@ -483,7 +487,7 @@ iomap_invalidatepage(struct page *page, unsigned int offset, unsigned int len)
if (offset == 0 && len == PAGE_SIZE) {
WARN_ON_ONCE(PageWriteback(page));
cancel_dirty_page(page);
- iomap_page_release(page);
+ iomap_page_release(folio);
}
}
EXPORT_SYMBOL_GPL(iomap_invalidatepage);
--
2.30.2
This is an address_space operation, so its argument must remain as a
struct page, but we can use a folio internally.
Signed-off-by: Matthew Wilcox (Oracle) <[email protected]>
---
fs/iomap/buffered-io.c | 8 ++++----
1 file changed, 4 insertions(+), 4 deletions(-)
diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
index abb065b50e38..6b41019a51a3 100644
--- a/fs/iomap/buffered-io.c
+++ b/fs/iomap/buffered-io.c
@@ -478,15 +478,15 @@ iomap_invalidatepage(struct page *page, unsigned int offset, unsigned int len)
{
struct folio *folio = page_folio(page);
- trace_iomap_invalidatepage(page->mapping->host, offset, len);
+ trace_iomap_invalidatepage(folio->mapping->host, offset, len);
/*
* If we are invalidating the entire page, clear the dirty state from it
* and release it to avoid unnecessary buildup of the LRU.
*/
- if (offset == 0 && len == PAGE_SIZE) {
- WARN_ON_ONCE(PageWriteback(page));
- cancel_dirty_page(page);
+ if (offset == 0 && len == folio_size(folio)) {
+ WARN_ON_ONCE(folio_test_writeback(folio));
+ folio_cancel_dirty(folio);
iomap_page_release(folio);
}
}
--
2.30.2
If we write to any page in a folio, we have to mark the entire
folio as dirty, and potentially COW the entire folio, because it'll
all get written back as one unit.
Signed-off-by: Matthew Wilcox (Oracle) <[email protected]>
---
fs/iomap/buffered-io.c | 42 +++++++++++++++++++++---------------------
1 file changed, 21 insertions(+), 21 deletions(-)
diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
index 7c702d6c2f64..a3fe0d36c739 100644
--- a/fs/iomap/buffered-io.c
+++ b/fs/iomap/buffered-io.c
@@ -951,23 +951,23 @@ iomap_truncate_page(struct inode *inode, loff_t pos, bool *did_zero,
}
EXPORT_SYMBOL_GPL(iomap_truncate_page);
-static loff_t
-iomap_page_mkwrite_actor(struct inode *inode, loff_t pos, loff_t length,
- void *data, struct iomap *iomap, struct iomap *srcmap)
+static loff_t iomap_folio_mkwrite_actor(struct inode *inode, loff_t pos,
+ loff_t length, void *data, struct iomap *iomap,
+ struct iomap *srcmap)
{
- struct page *page = data;
- struct folio *folio = page_folio(page);
+ struct folio *folio = data;
int ret;
if (iomap->flags & IOMAP_F_BUFFER_HEAD) {
- ret = __block_write_begin_int(page, pos, length, NULL, iomap);
+ ret = __block_write_begin_int(&folio->page, pos, length, NULL,
+ iomap);
if (ret)
return ret;
- block_commit_write(page, 0, length);
+ block_commit_write(&folio->page, 0, length);
} else {
- WARN_ON_ONCE(!PageUptodate(page));
+ WARN_ON_ONCE(!folio_test_uptodate(folio));
iomap_page_create(inode, folio);
- set_page_dirty(page);
+ folio_mark_dirty(folio);
}
return length;
@@ -975,33 +975,33 @@ iomap_page_mkwrite_actor(struct inode *inode, loff_t pos, loff_t length,
vm_fault_t iomap_page_mkwrite(struct vm_fault *vmf, const struct iomap_ops *ops)
{
- struct page *page = vmf->page;
+ struct folio *folio = page_folio(vmf->page);
struct inode *inode = file_inode(vmf->vma->vm_file);
- unsigned long length;
- loff_t offset;
+ size_t length;
+ loff_t pos;
ssize_t ret;
- lock_page(page);
- ret = page_mkwrite_check_truncate(page, inode);
+ folio_lock(folio);
+ ret = folio_mkwrite_check_truncate(folio, inode);
if (ret < 0)
goto out_unlock;
length = ret;
- offset = page_offset(page);
+ pos = folio_pos(folio);
while (length > 0) {
- ret = iomap_apply(inode, offset, length,
- IOMAP_WRITE | IOMAP_FAULT, ops, page,
- iomap_page_mkwrite_actor);
+ ret = iomap_apply(inode, pos, length,
+ IOMAP_WRITE | IOMAP_FAULT, ops, folio,
+ iomap_folio_mkwrite_actor);
if (unlikely(ret <= 0))
goto out_unlock;
- offset += ret;
+ pos += ret;
length -= ret;
}
- wait_for_stable_page(page);
+ folio_wait_stable(folio);
return VM_FAULT_LOCKED;
out_unlock:
- unlock_page(page);
+ folio_unlock(folio);
return block_page_mkwrite_return(ret);
}
EXPORT_SYMBOL_GPL(iomap_page_mkwrite);
--
2.30.2
We still iterate one block at a time, but now we call compound_head()
less often. Rename file_offset to pos to fit the rest of the file.
Signed-off-by: Matthew Wilcox (Oracle) <[email protected]>
---
fs/iomap/buffered-io.c | 66 +++++++++++++++++++-----------------------
1 file changed, 30 insertions(+), 36 deletions(-)
diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
index ac33f19325ab..8e767aec8d07 100644
--- a/fs/iomap/buffered-io.c
+++ b/fs/iomap/buffered-io.c
@@ -1252,36 +1252,29 @@ iomap_can_add_to_ioend(struct iomap_writepage_ctx *wpc, loff_t offset,
* first, otherwise finish off the current ioend and start another.
*/
static void
-iomap_add_to_ioend(struct inode *inode, loff_t offset, struct page *page,
+iomap_add_to_ioend(struct inode *inode, loff_t pos, struct folio *folio,
struct iomap_page *iop, struct iomap_writepage_ctx *wpc,
struct writeback_control *wbc, struct list_head *iolist)
{
- sector_t sector = iomap_sector(&wpc->iomap, offset);
+ sector_t sector = iomap_sector(&wpc->iomap, pos);
unsigned len = i_blocksize(inode);
- unsigned poff = offset & (PAGE_SIZE - 1);
- bool merged, same_page = false;
+ size_t poff = offset_in_folio(folio, pos);
- if (!wpc->ioend || !iomap_can_add_to_ioend(wpc, offset, sector)) {
+ if (!wpc->ioend || !iomap_can_add_to_ioend(wpc, pos, sector)) {
if (wpc->ioend)
list_add(&wpc->ioend->io_list, iolist);
- wpc->ioend = iomap_alloc_ioend(inode, wpc, offset, sector, wbc);
+ wpc->ioend = iomap_alloc_ioend(inode, wpc, pos, sector, wbc);
}
- merged = __bio_try_merge_page(wpc->ioend->io_bio, page, len, poff,
- &same_page);
if (iop)
atomic_add(len, &iop->write_bytes_pending);
-
- if (!merged) {
- if (bio_full(wpc->ioend->io_bio, len)) {
- wpc->ioend->io_bio =
- iomap_chain_bio(wpc->ioend->io_bio);
- }
- bio_add_page(wpc->ioend->io_bio, page, len, poff);
+ if (!bio_add_folio(wpc->ioend->io_bio, folio, len, poff)) {
+ wpc->ioend->io_bio = iomap_chain_bio(wpc->ioend->io_bio);
+ bio_add_folio(wpc->ioend->io_bio, folio, len, poff);
}
wpc->ioend->io_size += len;
- wbc_account_cgroup_owner(wbc, page, len);
+ wbc_account_cgroup_owner(wbc, &folio->page, len);
}
/*
@@ -1309,40 +1302,41 @@ iomap_writepage_map(struct iomap_writepage_ctx *wpc,
struct iomap_page *iop = to_iomap_page(folio);
struct iomap_ioend *ioend, *next;
unsigned len = i_blocksize(inode);
- u64 file_offset; /* file offset of page */
+ unsigned nblocks = i_blocks_per_folio(inode, folio);
+ loff_t pos = folio_pos(folio);
int error = 0, count = 0, i;
LIST_HEAD(submit_list);
- WARN_ON_ONCE(i_blocks_per_page(inode, page) > 1 && !iop);
+ WARN_ON_ONCE(nblocks > 1 && !iop);
WARN_ON_ONCE(iop && atomic_read(&iop->write_bytes_pending) != 0);
/*
- * Walk through the page to find areas to write back. If we run off the
- * end of the current map or find the current map invalid, grab a new
- * one.
+ * Walk through the folio to find areas to write back. If we
+ * run off the end of the current map or find the current map
+ * invalid, grab a new one.
*/
- for (i = 0, file_offset = page_offset(page);
- i < (PAGE_SIZE >> inode->i_blkbits) && file_offset < end_offset;
- i++, file_offset += len) {
+ for (i = 0; i < nblocks; i++, pos += len) {
+ if (pos >= end_offset)
+ break;
if (iop && !test_bit(i, iop->uptodate))
continue;
- error = wpc->ops->map_blocks(wpc, inode, file_offset);
+ error = wpc->ops->map_blocks(wpc, inode, pos);
if (error)
break;
if (WARN_ON_ONCE(wpc->iomap.type == IOMAP_INLINE))
continue;
if (wpc->iomap.type == IOMAP_HOLE)
continue;
- iomap_add_to_ioend(inode, file_offset, page, iop, wpc, wbc,
+ iomap_add_to_ioend(inode, pos, folio, iop, wpc, wbc,
&submit_list);
count++;
}
WARN_ON_ONCE(!wpc->ioend && !list_empty(&submit_list));
- WARN_ON_ONCE(!PageLocked(page));
- WARN_ON_ONCE(PageWriteback(page));
- WARN_ON_ONCE(PageDirty(page));
+ WARN_ON_ONCE(!folio_test_locked(folio));
+ WARN_ON_ONCE(folio_test_writeback(folio));
+ WARN_ON_ONCE(folio_test_dirty(folio));
/*
* We cannot cancel the ioend directly here on error. We may have
@@ -1358,16 +1352,16 @@ iomap_writepage_map(struct iomap_writepage_ctx *wpc,
* now.
*/
if (wpc->ops->discard_page)
- wpc->ops->discard_page(page, file_offset);
+ wpc->ops->discard_page(&folio->page, pos);
if (!count) {
- ClearPageUptodate(page);
- unlock_page(page);
+ folio_clear_uptodate(folio);
+ folio_unlock(folio);
goto done;
}
}
- set_page_writeback(page);
- unlock_page(page);
+ folio_start_writeback(folio);
+ folio_unlock(folio);
/*
* Preserve the original error if there was one, otherwise catch
@@ -1388,9 +1382,9 @@ iomap_writepage_map(struct iomap_writepage_ctx *wpc,
* with a partial page truncate on a sub-page block sized filesystem.
*/
if (!count)
- end_page_writeback(page);
+ folio_end_writeback(folio);
done:
- mapping_set_error(page->mapping, error);
+ mapping_set_error(folio->mapping, error);
return error;
}
--
2.30.2
Writeback an entire folio at a time, and adjust some of the variables
to have more familiar names.
Signed-off-by: Matthew Wilcox (Oracle) <[email protected]>
---
fs/iomap/buffered-io.c | 49 +++++++++++++++++++-----------------------
1 file changed, 22 insertions(+), 27 deletions(-)
diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
index 8e767aec8d07..0731e2c3f44b 100644
--- a/fs/iomap/buffered-io.c
+++ b/fs/iomap/buffered-io.c
@@ -1296,9 +1296,8 @@ iomap_add_to_ioend(struct inode *inode, loff_t pos, struct folio *folio,
static int
iomap_writepage_map(struct iomap_writepage_ctx *wpc,
struct writeback_control *wbc, struct inode *inode,
- struct page *page, u64 end_offset)
+ struct folio *folio, loff_t end_pos)
{
- struct folio *folio = page_folio(page);
struct iomap_page *iop = to_iomap_page(folio);
struct iomap_ioend *ioend, *next;
unsigned len = i_blocksize(inode);
@@ -1316,7 +1315,7 @@ iomap_writepage_map(struct iomap_writepage_ctx *wpc,
* invalid, grab a new one.
*/
for (i = 0; i < nblocks; i++, pos += len) {
- if (pos >= end_offset)
+ if (pos >= end_pos)
break;
if (iop && !test_bit(i, iop->uptodate))
continue;
@@ -1398,16 +1397,15 @@ iomap_writepage_map(struct iomap_writepage_ctx *wpc,
static int
iomap_do_writepage(struct page *page, struct writeback_control *wbc, void *data)
{
+ struct folio *folio = page_folio(page);
struct iomap_writepage_ctx *wpc = data;
- struct inode *inode = page->mapping->host;
- pgoff_t end_index;
- u64 end_offset;
- loff_t offset;
+ struct inode *inode = folio->mapping->host;
+ loff_t end_pos, isize;
- trace_iomap_writepage(inode, page_offset(page), PAGE_SIZE);
+ trace_iomap_writepage(inode, folio_pos(folio), folio_size(folio));
/*
- * Refuse to write the page out if we are called from reclaim context.
+ * Refuse to write the folio out if we are called from reclaim context.
*
* This avoids stack overflows when called from deeply used stacks in
* random callers for direct reclaim or memcg reclaim. We explicitly
@@ -1421,10 +1419,10 @@ iomap_do_writepage(struct page *page, struct writeback_control *wbc, void *data)
goto redirty;
/*
- * Is this page beyond the end of the file?
+ * Is this folio beyond the end of the file?
*
- * The page index is less than the end_index, adjust the end_offset
- * to the highest offset that this page should represent.
+ * The folio index is less than the end_index, adjust the end_pos
+ * to the highest offset that this folio should represent.
* -----------------------------------------------------
* | file mapping | <EOF> |
* -----------------------------------------------------
@@ -1433,11 +1431,9 @@ iomap_do_writepage(struct page *page, struct writeback_control *wbc, void *data)
* | desired writeback range | see else |
* ---------------------------------^------------------|
*/
- offset = i_size_read(inode);
- end_index = offset >> PAGE_SHIFT;
- if (page->index < end_index)
- end_offset = (loff_t)(page->index + 1) << PAGE_SHIFT;
- else {
+ isize = i_size_read(inode);
+ end_pos = folio_pos(folio) + folio_size(folio);
+ if (end_pos - 1 >= isize) {
/*
* Check whether the page to write out is beyond or straddles
* i_size or not.
@@ -1449,7 +1445,8 @@ iomap_do_writepage(struct page *page, struct writeback_control *wbc, void *data)
* | | Straddles |
* ---------------------------------^-----------|--------|
*/
- unsigned offset_into_page = offset & (PAGE_SIZE - 1);
+ size_t poff = offset_in_folio(folio, isize);
+ pgoff_t end_index = isize >> PAGE_SHIFT;
/*
* Skip the page if it is fully outside i_size, e.g. due to a
@@ -1468,8 +1465,8 @@ iomap_do_writepage(struct page *page, struct writeback_control *wbc, void *data)
* if the page to write is totally beyond the i_size or if it's
* offset is just equal to the EOF.
*/
- if (page->index > end_index ||
- (page->index == end_index && offset_into_page == 0))
+ if (folio->index > end_index ||
+ (folio->index == end_index && poff == 0))
goto redirty;
/*
@@ -1480,17 +1477,15 @@ iomap_do_writepage(struct page *page, struct writeback_control *wbc, void *data)
* memory is zeroed when mapped, and writes to that region are
* not written out to the file."
*/
- zero_user_segment(page, offset_into_page, PAGE_SIZE);
-
- /* Adjust the end_offset to the end of file */
- end_offset = offset;
+ zero_user_segment(&folio->page, poff, folio_size(folio));
+ end_pos = isize;
}
- return iomap_writepage_map(wpc, wbc, inode, page, end_offset);
+ return iomap_writepage_map(wpc, wbc, inode, folio, end_pos);
redirty:
- redirty_page_for_writepage(wbc, page);
- unlock_page(page);
+ folio_redirty_for_writepage(wbc, folio);
+ folio_unlock(folio);
return 0;
}
--
2.30.2
It was already assuming a head page, so this is a straightforward
conversion. Convert the one caller to call page_folio(), even though
it must currently be passing in a head page.
Signed-off-by: Matthew Wilcox (Oracle) <[email protected]>
---
mm/filemap.c | 20 ++++++++++----------
1 file changed, 10 insertions(+), 10 deletions(-)
diff --git a/mm/filemap.c b/mm/filemap.c
index 0434c5a55fec..c96febad32fc 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -120,27 +120,26 @@
*/
static void page_cache_delete(struct address_space *mapping,
- struct page *page, void *shadow)
+ struct folio *folio, void *shadow)
{
- XA_STATE(xas, &mapping->i_pages, page->index);
+ XA_STATE(xas, &mapping->i_pages, folio->index);
unsigned int nr = 1;
mapping_set_update(&xas, mapping);
/* hugetlb pages are represented by a single entry in the xarray */
- if (!PageHuge(page)) {
- xas_set_order(&xas, page->index, compound_order(page));
- nr = compound_nr(page);
+ if (!folio_test_hugetlb(folio)) {
+ xas_set_order(&xas, folio->index, folio_order(folio));
+ nr = folio_nr_pages(folio);
}
- VM_BUG_ON_PAGE(!PageLocked(page), page);
- VM_BUG_ON_PAGE(PageTail(page), page);
- VM_BUG_ON_PAGE(nr != 1 && shadow, page);
+ VM_BUG_ON_FOLIO(!folio_test_locked(folio), folio);
+ VM_BUG_ON_FOLIO(nr != 1 && shadow, folio);
xas_store(&xas, shadow);
xas_init_marks(&xas);
- page->mapping = NULL;
+ folio->mapping = NULL;
/* Leave page->index set: truncation lookup relies upon it */
mapping->nrpages -= nr;
}
@@ -222,12 +221,13 @@ static void unaccount_page_cache_page(struct address_space *mapping,
*/
void __delete_from_page_cache(struct page *page, void *shadow)
{
+ struct folio *folio = page_folio(page);
struct address_space *mapping = page->mapping;
trace_mm_filemap_delete_from_page_cache(page);
unaccount_page_cache_page(mapping, page);
- page_cache_delete(mapping, page, shadow);
+ page_cache_delete(mapping, folio, shadow);
}
static void page_cache_free_page(struct address_space *mapping,
--
2.30.2
None of the callers of find_get_pages_contig() want tail pages. They all
use order-0 pages today, but if they were converted, they'd want folios.
So just remove the call to find_subpage() instead of replacing it with
folio_page().
Signed-off-by: Matthew Wilcox (Oracle) <[email protected]>
---
mm/filemap.c | 17 ++++++++---------
1 file changed, 8 insertions(+), 9 deletions(-)
diff --git a/mm/filemap.c b/mm/filemap.c
index 04501bf50448..5a273d07eae6 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -2141,36 +2141,35 @@ unsigned find_get_pages_contig(struct address_space *mapping, pgoff_t index,
unsigned int nr_pages, struct page **pages)
{
XA_STATE(xas, &mapping->i_pages, index);
- struct page *page;
+ struct folio *folio;
unsigned int ret = 0;
if (unlikely(!nr_pages))
return 0;
rcu_read_lock();
- for (page = xas_load(&xas); page; page = xas_next(&xas)) {
- if (xas_retry(&xas, page))
+ for (folio = xas_load(&xas); folio; folio = xas_next(&xas)) {
+ if (xas_retry(&xas, folio))
continue;
/*
* If the entry has been swapped out, we can stop looking.
* No current caller is looking for DAX entries.
*/
- if (xa_is_value(page))
+ if (xa_is_value(folio))
break;
- if (!page_cache_get_speculative(page))
+ if (!folio_try_get_rcu(folio))
goto retry;
- /* Has the page moved or been split? */
- if (unlikely(page != xas_reload(&xas)))
+ if (unlikely(folio != xas_reload(&xas)))
goto put_page;
- pages[ret] = find_subpage(page, xas.xa_index);
+ pages[ret] = &folio->page;
if (++ret == nr_pages)
break;
continue;
put_page:
- put_page(page);
+ folio_put(folio);
retry:
xas_reset(&xas);
}
--
2.30.2
This is all internal to filemap and saves 100 bytes of text.
Signed-off-by: Matthew Wilcox (Oracle) <[email protected]>
---
mm/filemap.c | 18 +++++++++---------
1 file changed, 9 insertions(+), 9 deletions(-)
diff --git a/mm/filemap.c b/mm/filemap.c
index 5e2a2db1c715..7eda9afb0600 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -2396,32 +2396,32 @@ static int filemap_update_page(struct kiocb *iocb,
return error;
}
-static int filemap_create_page(struct file *file,
+static int filemap_create_folio(struct file *file,
struct address_space *mapping, pgoff_t index,
struct pagevec *pvec)
{
- struct page *page;
+ struct folio *folio;
int error;
- page = page_cache_alloc(mapping);
- if (!page)
+ folio = filemap_alloc_folio(mapping_gfp_mask(mapping), 0);
+ if (!folio)
return -ENOMEM;
- error = add_to_page_cache_lru(page, mapping, index,
+ error = filemap_add_folio(mapping, folio, index,
mapping_gfp_constraint(mapping, GFP_KERNEL));
if (error == -EEXIST)
error = AOP_TRUNCATED_PAGE;
if (error)
goto error;
- error = filemap_read_folio(file, mapping, page_folio(page));
+ error = filemap_read_folio(file, mapping, folio);
if (error)
goto error;
- pagevec_add(pvec, page);
+ pagevec_add(pvec, &folio->page);
return 0;
error:
- put_page(page);
+ folio_put(folio);
return error;
}
@@ -2463,7 +2463,7 @@ static int filemap_get_pages(struct kiocb *iocb, struct iov_iter *iter,
if (!pagevec_count(pvec)) {
if (iocb->ki_flags & (IOCB_NOWAIT | IOCB_WAITQ))
return -EAGAIN;
- err = filemap_create_page(filp, mapping,
+ err = filemap_create_folio(filp, mapping,
iocb->ki_pos >> PAGE_SHIFT, pvec);
if (err == AOP_TRUNCATED_PAGE)
goto retry;
--
2.30.2
Reimplement __delete_from_page_cache() as a wrapper around
__filemap_remove_folio() and delete_from_page_cache() as a wrapper
around filemap_remove_folio(). Remove the EXPORT_SYMBOL as
delete_from_page_cache() was not used by any in-tree modules.
Convert page_cache_free_page() into filemap_free_folio().
Signed-off-by: Matthew Wilcox (Oracle) <[email protected]>
---
include/linux/pagemap.h | 9 +++++++--
mm/filemap.c | 44 ++++++++++++++++++++---------------------
mm/folio-compat.c | 5 +++++
3 files changed, 33 insertions(+), 25 deletions(-)
diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
index f6a2a2589009..245554ce6b12 100644
--- a/include/linux/pagemap.h
+++ b/include/linux/pagemap.h
@@ -877,8 +877,13 @@ int add_to_page_cache_lru(struct page *page, struct address_space *mapping,
pgoff_t index, gfp_t gfp);
int filemap_add_folio(struct address_space *mapping, struct folio *folio,
pgoff_t index, gfp_t gfp);
-extern void delete_from_page_cache(struct page *page);
-extern void __delete_from_page_cache(struct page *page, void *shadow);
+void filemap_remove_folio(struct folio *folio);
+void delete_from_page_cache(struct page *page);
+void __filemap_remove_folio(struct folio *folio, void *shadow);
+static inline void __delete_from_page_cache(struct page *page, void *shadow)
+{
+ __filemap_remove_folio(page_folio(page), shadow);
+}
void replace_page_cache_page(struct page *old, struct page *new);
void delete_from_page_cache_batch(struct address_space *mapping,
struct pagevec *pvec);
diff --git a/mm/filemap.c b/mm/filemap.c
index 6e8b195edf19..4a81eaff363e 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -219,55 +219,53 @@ static void filemap_unaccount_folio(struct address_space *mapping,
* sure the page is locked and that nobody else uses it - or that usage
* is safe. The caller must hold the i_pages lock.
*/
-void __delete_from_page_cache(struct page *page, void *shadow)
+void __filemap_remove_folio(struct folio *folio, void *shadow)
{
- struct folio *folio = page_folio(page);
- struct address_space *mapping = page->mapping;
+ struct address_space *mapping = folio->mapping;
- trace_mm_filemap_delete_from_page_cache(page);
+ trace_mm_filemap_delete_from_page_cache(&folio->page);
filemap_unaccount_folio(mapping, folio);
page_cache_delete(mapping, folio, shadow);
}
-static void page_cache_free_page(struct address_space *mapping,
- struct page *page)
+static void filemap_free_folio(struct address_space *mapping,
+ struct folio *folio)
{
void (*freepage)(struct page *);
freepage = mapping->a_ops->freepage;
if (freepage)
- freepage(page);
+ freepage(&folio->page);
- if (PageTransHuge(page) && !PageHuge(page)) {
- page_ref_sub(page, thp_nr_pages(page));
- VM_BUG_ON_PAGE(page_count(page) <= 0, page);
+ if (folio_multi(folio) && !folio_test_hugetlb(folio)) {
+ folio_ref_sub(folio, folio_nr_pages(folio));
+ VM_BUG_ON_FOLIO(folio_ref_count(folio) <= 0, folio);
} else {
- put_page(page);
+ folio_put(folio);
}
}
/**
- * delete_from_page_cache - delete page from page cache
- * @page: the page which the kernel is trying to remove from page cache
+ * filemap_remove_folio - Remove folio from page cache.
+ * @folio: The folio.
*
- * This must be called only on pages that have been verified to be in the page
- * cache and locked. It will never put the page into the free list, the caller
- * has a reference on the page.
+ * This must be called only on folios that are locked and have been
+ * verified to be in the page cache. It will never put the folio into
+ * the free list because the caller has a reference on the page.
*/
-void delete_from_page_cache(struct page *page)
+void filemap_remove_folio(struct folio *folio)
{
- struct address_space *mapping = page_mapping(page);
+ struct address_space *mapping = folio->mapping;
unsigned long flags;
- BUG_ON(!PageLocked(page));
+ BUG_ON(!folio_test_locked(folio));
xa_lock_irqsave(&mapping->i_pages, flags);
- __delete_from_page_cache(page, NULL);
+ __filemap_remove_folio(folio, NULL);
xa_unlock_irqrestore(&mapping->i_pages, flags);
- page_cache_free_page(mapping, page);
+ filemap_free_folio(mapping, folio);
}
-EXPORT_SYMBOL(delete_from_page_cache);
/*
* page_cache_delete_batch - delete several pages from page cache
@@ -350,7 +348,7 @@ void delete_from_page_cache_batch(struct address_space *mapping,
xa_unlock_irqrestore(&mapping->i_pages, flags);
for (i = 0; i < pagevec_count(pvec); i++)
- page_cache_free_page(mapping, pvec->pages[i]);
+ filemap_free_folio(mapping, page_folio(pvec->pages[i]));
}
int filemap_check_errors(struct address_space *mapping)
diff --git a/mm/folio-compat.c b/mm/folio-compat.c
index 5b6ae1da314e..749a695b4217 100644
--- a/mm/folio-compat.c
+++ b/mm/folio-compat.c
@@ -140,3 +140,8 @@ struct page *grab_cache_page_write_begin(struct address_space *mapping,
mapping_gfp_mask(mapping));
}
EXPORT_SYMBOL(grab_cache_page_write_begin);
+
+void delete_from_page_cache(struct page *page)
+{
+ return filemap_remove_folio(page_folio(page));
+}
--
2.30.2
This saves 105 bytes of text.
Signed-off-by: Matthew Wilcox (Oracle) <[email protected]>
---
mm/filemap.c | 30 +++++++++++++++---------------
1 file changed, 15 insertions(+), 15 deletions(-)
diff --git a/mm/filemap.c b/mm/filemap.c
index 717b0d262306..545323a77c1c 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -3105,43 +3105,43 @@ static bool filemap_map_pmd(struct vm_fault *vmf, struct page *page)
return false;
}
-static struct page *next_uptodate_page(struct page *page,
+static struct page *next_uptodate_page(struct folio *folio,
struct address_space *mapping,
struct xa_state *xas, pgoff_t end_pgoff)
{
unsigned long max_idx;
do {
- if (!page)
+ if (!folio)
return NULL;
- if (xas_retry(xas, page))
+ if (xas_retry(xas, folio))
continue;
- if (xa_is_value(page))
+ if (xa_is_value(folio))
continue;
- if (PageLocked(page))
+ if (folio_test_locked(folio))
continue;
- if (!page_cache_get_speculative(page))
+ if (!folio_try_get_rcu(folio))
continue;
/* Has the page moved or been split? */
- if (unlikely(page != xas_reload(xas)))
+ if (unlikely(folio != xas_reload(xas)))
goto skip;
- if (!PageUptodate(page) || PageReadahead(page))
+ if (!folio_test_uptodate(folio) || folio_test_readahead(folio))
goto skip;
- if (!trylock_page(page))
+ if (!folio_trylock(folio))
goto skip;
- if (page->mapping != mapping)
+ if (folio->mapping != mapping)
goto unlock;
- if (!PageUptodate(page))
+ if (!folio_test_uptodate(folio))
goto unlock;
max_idx = DIV_ROUND_UP(i_size_read(mapping->host), PAGE_SIZE);
if (xas->xa_index >= max_idx)
goto unlock;
- return page;
+ return &folio->page;
unlock:
- unlock_page(page);
+ folio_unlock(folio);
skip:
- put_page(page);
- } while ((page = xas_next_entry(xas, end_pgoff)) != NULL);
+ folio_put(folio);
+ } while ((folio = xas_next_entry(xas, end_pgoff)) != NULL);
return NULL;
}
--
2.30.2
All callers now expect head (and base) pages, and can handle multiple
head pages in a single batch, so make find_get_entries() behave that way.
Also take the opportunity to make it use the pagevec infrastructure
instead of open-coding how pvecs behave. This has the side-effect of
being able to append to a pagevec with existing contents, although we
don't make use of that functionality anywhere yet.
Signed-off-by: Matthew Wilcox (Oracle) <[email protected]>
Reviewed-by: Jan Kara <[email protected]>
Reviewed-by: William Kucharski <[email protected]>
---
include/linux/pagemap.h | 2 --
mm/filemap.c | 40 ++++++++++------------------------------
mm/internal.h | 2 ++
3 files changed, 12 insertions(+), 32 deletions(-)
diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
index 842d130fd6d3..bf8e978a48f2 100644
--- a/include/linux/pagemap.h
+++ b/include/linux/pagemap.h
@@ -502,8 +502,6 @@ static inline struct page *find_subpage(struct page *head, pgoff_t index)
return head + (index & (thp_nr_pages(head) - 1));
}
-unsigned find_get_entries(struct address_space *mapping, pgoff_t start,
- pgoff_t end, struct pagevec *pvec, pgoff_t *indices);
unsigned find_get_pages_range(struct address_space *mapping, pgoff_t *start,
pgoff_t end, unsigned int nr_pages,
struct page **pages);
diff --git a/mm/filemap.c b/mm/filemap.c
index 1c0c2663c57d..20434d7bdad8 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -1955,49 +1955,29 @@ static inline struct folio *find_get_entry(struct xa_state *xas, pgoff_t max,
* the mapping. The entries are placed in @pvec. find_get_entries()
* takes a reference on any actual pages it returns.
*
- * The search returns a group of mapping-contiguous page cache entries
- * with ascending indexes. There may be holes in the indices due to
- * not-present pages.
+ * The entries have ascending indexes. The indices may not be consecutive
+ * due to not-present entries or THPs.
*
* Any shadow entries of evicted pages, or swap entries from
* shmem/tmpfs, are included in the returned array.
*
- * If it finds a Transparent Huge Page, head or tail, find_get_entries()
- * stops at that page: the caller is likely to have a better way to handle
- * the compound page as a whole, and then skip its extent, than repeatedly
- * calling find_get_entries() to return all its tails.
- *
- * Return: the number of pages and shadow entries which were found.
+ * Return: The number of entries which were found.
*/
unsigned find_get_entries(struct address_space *mapping, pgoff_t start,
pgoff_t end, struct pagevec *pvec, pgoff_t *indices)
{
XA_STATE(xas, &mapping->i_pages, start);
- struct page *page;
- unsigned int ret = 0;
- unsigned nr_entries = PAGEVEC_SIZE;
+ struct folio *folio;
rcu_read_lock();
- while ((page = &find_get_entry(&xas, end, XA_PRESENT)->page)) {
- /*
- * Terminate early on finding a THP, to allow the caller to
- * handle it all at once; but continue if this is hugetlbfs.
- */
- if (!xa_is_value(page) && PageTransHuge(page) &&
- !PageHuge(page)) {
- page = find_subpage(page, xas.xa_index);
- nr_entries = ret + 1;
- }
-
- indices[ret] = xas.xa_index;
- pvec->pages[ret] = page;
- if (++ret == nr_entries)
+ while ((folio = find_get_entry(&xas, end, XA_PRESENT)) != NULL) {
+ indices[pvec->nr] = xas.xa_index;
+ if (!pagevec_add(pvec, &folio->page))
break;
}
rcu_read_unlock();
- pvec->nr = ret;
- return ret;
+ return pagevec_count(pvec);
}
/**
@@ -2016,8 +1996,8 @@ unsigned find_get_entries(struct address_space *mapping, pgoff_t start,
* not returned.
*
* The entries have ascending indexes. The indices may not be consecutive
- * due to not-present entries, THP pages, pages which could not be locked
- * or pages under writeback.
+ * due to not-present entries, THPs, pages which could not be locked or
+ * pages under writeback.
*
* Return: The number of entries which were found.
*/
diff --git a/mm/internal.h b/mm/internal.h
index 3c0c807eddc6..3e32064df18d 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -70,6 +70,8 @@ static inline void force_page_cache_readahead(struct address_space *mapping,
unsigned find_lock_entries(struct address_space *mapping, pgoff_t start,
pgoff_t end, struct pagevec *pvec, pgoff_t *indices);
+unsigned find_get_entries(struct address_space *mapping, pgoff_t start,
+ pgoff_t end, struct pagevec *pvec, pgoff_t *indices);
bool truncate_inode_partial_page(struct page *page, loff_t start, loff_t end);
/**
--
2.30.2
If we're punching a hole in a multi-page folio, we need to remove the
per-page iomap data as the folio is about to be split and each page will
need its own. This means that writepage can now come across a page with
no iop allocated, so remove the assertion that there is already one,
and just create one (with the uptodate bits set) if there isn't one.
Signed-off-by: Matthew Wilcox (Oracle) <[email protected]>
---
fs/iomap/buffered-io.c | 11 +++++++----
1 file changed, 7 insertions(+), 4 deletions(-)
diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
index 48de198c5603..7f78256fc0ba 100644
--- a/fs/iomap/buffered-io.c
+++ b/fs/iomap/buffered-io.c
@@ -474,13 +474,17 @@ iomap_invalidatepage(struct page *page, unsigned int offset, unsigned int len)
trace_iomap_invalidatepage(folio->mapping->host, offset, len);
/*
- * If we are invalidating the entire page, clear the dirty state from it
- * and release it to avoid unnecessary buildup of the LRU.
+ * If we are invalidating the entire folio, clear the dirty state
+ * from it and release it to avoid unnecessary buildup of the LRU.
*/
if (offset == 0 && len == folio_size(folio)) {
WARN_ON_ONCE(folio_test_writeback(folio));
folio_cancel_dirty(folio);
iomap_page_release(folio);
+ } else if (folio_multi(folio)) {
+ /* Must release the iop so the page can be split */
+ WARN_ON_ONCE(!folio_test_uptodate(folio) && folio_test_dirty(folio));
+ iomap_page_release(folio);
}
}
EXPORT_SYMBOL_GPL(iomap_invalidatepage);
@@ -1300,7 +1304,7 @@ iomap_writepage_map(struct iomap_writepage_ctx *wpc,
struct writeback_control *wbc, struct inode *inode,
struct folio *folio, loff_t end_pos)
{
- struct iomap_page *iop = to_iomap_page(folio);
+ struct iomap_page *iop = iomap_page_create(inode, folio);
struct iomap_ioend *ioend, *next;
unsigned len = i_blocksize(inode);
unsigned nblocks = i_blocks_per_folio(inode, folio);
@@ -1308,7 +1312,6 @@ iomap_writepage_map(struct iomap_writepage_ctx *wpc,
int error = 0, count = 0, i;
LIST_HEAD(submit_list);
- WARN_ON_ONCE(nblocks > 1 && !iop);
WARN_ON_ONCE(iop && atomic_read(&iop->write_bytes_pending) != 0);
/*
--
2.30.2
If we're going to unmap a folio, we have to be sure to unmap the entire
folio, not just the part of it which lies after the search index.
Signed-off-by: Matthew Wilcox (Oracle) <[email protected]>
---
mm/truncate.c | 62 ++++++++++++++++++++++++++-------------------------
1 file changed, 32 insertions(+), 30 deletions(-)
diff --git a/mm/truncate.c b/mm/truncate.c
index b8c9d2fbd9b5..d068f22fe422 100644
--- a/mm/truncate.c
+++ b/mm/truncate.c
@@ -599,42 +599,43 @@ void invalidate_mapping_pagevec(struct address_space *mapping,
* shrink_page_list() has a temp ref on them, or because they're transiently
* sitting in the lru_cache_add() pagevecs.
*/
-static int
-invalidate_complete_page2(struct address_space *mapping, struct page *page)
+static int invalidate_complete_folio2(struct address_space *mapping,
+ struct folio *folio)
{
unsigned long flags;
- if (page->mapping != mapping)
+ if (folio->mapping != mapping)
return 0;
- if (page_has_private(page) && !try_to_release_page(page, GFP_KERNEL))
+ if (folio_has_private(folio) &&
+ !try_to_release_page(&folio->page, GFP_KERNEL))
return 0;
xa_lock_irqsave(&mapping->i_pages, flags);
- if (PageDirty(page))
+ if (folio_test_dirty(folio))
goto failed;
- BUG_ON(page_has_private(page));
- __delete_from_page_cache(page, NULL);
+ BUG_ON(folio_has_private(folio));
+ __filemap_remove_folio(folio, NULL);
xa_unlock_irqrestore(&mapping->i_pages, flags);
if (mapping->a_ops->freepage)
- mapping->a_ops->freepage(page);
+ mapping->a_ops->freepage(&folio->page);
- put_page(page); /* pagecache ref */
+ folio_ref_sub(folio, folio_nr_pages(folio)); /* pagecache ref */
return 1;
failed:
xa_unlock_irqrestore(&mapping->i_pages, flags);
return 0;
}
-static int do_launder_page(struct address_space *mapping, struct page *page)
+static int do_launder_folio(struct address_space *mapping, struct folio *folio)
{
- if (!PageDirty(page))
+ if (!folio_test_dirty(folio))
return 0;
- if (page->mapping != mapping || mapping->a_ops->launder_page == NULL)
+ if (folio->mapping != mapping || mapping->a_ops->launder_page == NULL)
return 0;
- return mapping->a_ops->launder_page(page);
+ return mapping->a_ops->launder_page(&folio->page);
}
/**
@@ -666,21 +667,21 @@ int invalidate_inode_pages2_range(struct address_space *mapping,
index = start;
while (find_get_entries(mapping, index, end, &pvec, indices)) {
for (i = 0; i < pagevec_count(&pvec); i++) {
- struct page *page = pvec.pages[i];
+ struct folio *folio = (struct folio *)pvec.pages[i];
- /* We rely upon deletion not changing page->index */
+ /* We rely upon deletion not changing folio->index */
index = indices[i];
- if (xa_is_value(page)) {
+ if (xa_is_value(folio)) {
if (!invalidate_exceptional_entry2(mapping,
- index, page))
+ index, folio))
ret = -EBUSY;
continue;
}
- if (!did_range_unmap && page_mapped(page)) {
+ if (!did_range_unmap && folio_mapped(folio)) {
/*
- * If page is mapped, before taking its lock,
+ * If folio is mapped, before taking its lock,
* zap the rest of the file in one hit.
*/
unmap_mapping_pages(mapping, index,
@@ -688,26 +689,27 @@ int invalidate_inode_pages2_range(struct address_space *mapping,
did_range_unmap = 1;
}
- lock_page(page);
- WARN_ON(page_to_index(page) != index);
- if (page->mapping != mapping) {
- unlock_page(page);
+ folio_lock(folio);
+ VM_WARN_ON_ONCE_FOLIO(!folio_contains(folio, index),
+ folio);
+ if (folio->mapping != mapping) {
+ folio_unlock(folio);
continue;
}
- wait_on_page_writeback(page);
+ folio_wait_writeback(folio);
- if (page_mapped(page))
- unmap_mapping_page(page);
- BUG_ON(page_mapped(page));
+ if (folio_mapped(folio))
+ unmap_mapping_page(&folio->page);
+ BUG_ON(folio_mapped(folio));
- ret2 = do_launder_page(mapping, page);
+ ret2 = do_launder_folio(mapping, folio);
if (ret2 == 0) {
- if (!invalidate_complete_page2(mapping, page))
+ if (!invalidate_complete_folio2(mapping, folio))
ret2 = -EBUSY;
}
if (ret2 < 0)
ret = ret2;
- unlock_page(page);
+ folio_unlock(folio);
}
pagevec_remove_exceptionals(&pvec);
pagevec_release(&pvec);
--
2.30.2
These counters only exist if CONFIG_READ_ONLY_THP_FOR_FS is defined,
but we should not warn if the filesystem natively supports THPs.
Signed-off-by: Matthew Wilcox (Oracle) <[email protected]>
---
include/linux/pagemap.h | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)
diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
index 25b1bf3b1cdb..81cccb708df7 100644
--- a/include/linux/pagemap.h
+++ b/include/linux/pagemap.h
@@ -146,7 +146,7 @@ static inline void filemap_nr_thps_inc(struct address_space *mapping)
if (!mapping_thp_support(mapping))
atomic_inc(&mapping->nr_thps);
#else
- WARN_ON_ONCE(1);
+ WARN_ON_ONCE(!mapping_thp_support(mapping));
#endif
}
@@ -156,7 +156,7 @@ static inline void filemap_nr_thps_dec(struct address_space *mapping)
if (!mapping_thp_support(mapping))
atomic_dec(&mapping->nr_thps);
#else
- WARN_ON_ONCE(1);
+ WARN_ON_ONCE(!mapping_thp_support(mapping));
#endif
}
--
2.30.2
Use the compound size of the page instead of assuming PTE or PMD size.
Signed-off-by: Matthew Wilcox (Oracle) <[email protected]>
---
include/linux/huge_mm.h | 8 ++------
include/linux/mm.h | 42 ++++++++++++++++++++---------------------
2 files changed, 23 insertions(+), 27 deletions(-)
diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index f280f33ff223..b70318fe7863 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -257,9 +257,7 @@ static inline spinlock_t *pud_trans_huge_lock(pud_t *pud,
static inline unsigned int thp_order(struct page *page)
{
VM_BUG_ON_PGFLAGS(PageTail(page), page);
- if (PageHead(page))
- return HPAGE_PMD_ORDER;
- return 0;
+ return compound_order(page);
}
/**
@@ -269,9 +267,7 @@ static inline unsigned int thp_order(struct page *page)
static inline int thp_nr_pages(struct page *page)
{
VM_BUG_ON_PGFLAGS(PageTail(page), page);
- if (PageHead(page))
- return HPAGE_PMD_NR;
- return 1;
+ return compound_nr(page);
}
struct page *follow_devmap_pmd(struct vm_area_struct *vma, unsigned long addr,
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 99f5f736be64..df1f4c4976df 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -715,6 +715,27 @@ int vma_is_stack_for_current(struct vm_area_struct *vma);
struct mmu_gather;
struct inode;
+static inline unsigned int compound_order(struct page *page)
+{
+ if (!PageHead(page))
+ return 0;
+ return page[1].compound_order;
+}
+
+/* Returns the number of pages in this potentially compound page. */
+static inline unsigned long compound_nr(struct page *page)
+{
+ if (!PageHead(page))
+ return 1;
+ return page[1].compound_nr;
+}
+
+static inline void set_compound_order(struct page *page, unsigned int order)
+{
+ page[1].compound_order = order;
+ page[1].compound_nr = 1U << order;
+}
+
#include <linux/huge_mm.h>
/*
@@ -937,13 +958,6 @@ static inline void destroy_compound_page(struct page *page)
compound_page_dtors[page[1].compound_dtor](page);
}
-static inline unsigned int compound_order(struct page *page)
-{
- if (!PageHead(page))
- return 0;
- return page[1].compound_order;
-}
-
/**
* folio_order - The allocation order of a folio.
* @folio: The folio.
@@ -981,20 +995,6 @@ static inline int compound_pincount(struct page *page)
return head_compound_pincount(page);
}
-static inline void set_compound_order(struct page *page, unsigned int order)
-{
- page[1].compound_order = order;
- page[1].compound_nr = 1U << order;
-}
-
-/* Returns the number of pages in this potentially compound page. */
-static inline unsigned long compound_nr(struct page *page)
-{
- if (!PageHead(page))
- return 1;
- return page[1].compound_nr;
-}
-
/* Returns the number of bytes in this potentially compound page. */
static inline unsigned long page_size(struct page *page)
{
--
2.30.2
We return -EEXIST if there are any non-shadow entries in the page
cache in the range covered by the folio. If there are multiple
shadow entries in the range, we set *shadowp to one of them (currently
the one at the highest index). If that turns out to be the wrong
answer, we can implement something more complex. This is mostly
modelled after the equivalent function in the shmem code.
Signed-off-by: Matthew Wilcox (Oracle) <[email protected]>
---
mm/filemap.c | 40 +++++++++++++++++++++++-----------------
1 file changed, 23 insertions(+), 17 deletions(-)
diff --git a/mm/filemap.c b/mm/filemap.c
index d5787502c3be..1ff21b3346d3 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -848,26 +848,27 @@ noinline int __filemap_add_folio(struct address_space *mapping,
{
XA_STATE(xas, &mapping->i_pages, index);
int huge = folio_test_hugetlb(folio);
- int error;
bool charged = false;
+ unsigned int nr = 1;
VM_BUG_ON_FOLIO(!folio_test_locked(folio), folio);
VM_BUG_ON_FOLIO(folio_test_swapbacked(folio), folio);
mapping_set_update(&xas, mapping);
- folio_get(folio);
- folio->mapping = mapping;
- folio->index = index;
-
if (!huge) {
- error = mem_cgroup_charge(folio, NULL, gfp);
+ int error = mem_cgroup_charge(folio, NULL, gfp);
VM_BUG_ON_FOLIO(index & (folio_nr_pages(folio) - 1), folio);
if (error)
- goto error;
+ return error;
charged = true;
+ xas_set_order(&xas, index, folio_order(folio));
+ nr = folio_nr_pages(folio);
}
gfp &= GFP_RECLAIM_MASK;
+ folio_ref_add(folio, nr);
+ folio->mapping = mapping;
+ folio->index = xas.xa_index;
do {
unsigned int order = xa_get_order(xas.xa, xas.xa_index);
@@ -891,6 +892,8 @@ noinline int __filemap_add_folio(struct address_space *mapping,
/* entry may have been split before we acquired lock */
order = xa_get_order(xas.xa, xas.xa_index);
if (order > folio_order(folio)) {
+ /* How to handle large swap entries? */
+ BUG_ON(shmem_mapping(mapping));
xas_split(&xas, old, order);
xas_reset(&xas);
}
@@ -900,29 +903,32 @@ noinline int __filemap_add_folio(struct address_space *mapping,
if (xas_error(&xas))
goto unlock;
- mapping->nrpages++;
+ mapping->nrpages += nr;
/* hugetlb pages do not participate in page cache accounting */
- if (!huge)
- __lruvec_stat_add_folio(folio, NR_FILE_PAGES);
+ if (!huge) {
+ __lruvec_stat_mod_folio(folio, NR_FILE_PAGES, nr);
+ if (nr > 1)
+ __lruvec_stat_mod_folio(folio,
+ NR_FILE_THPS, nr);
+ }
unlock:
xas_unlock_irq(&xas);
} while (xas_nomem(&xas, gfp));
- if (xas_error(&xas)) {
- error = xas_error(&xas);
- if (charged)
- mem_cgroup_uncharge(folio);
+ if (xas_error(&xas))
goto error;
- }
trace_mm_filemap_add_to_page_cache(&folio->page);
return 0;
error:
+ if (charged)
+ mem_cgroup_uncharge(folio);
folio->mapping = NULL;
/* Leave page->index set: truncation relies upon it */
- folio_put(folio);
- return error;
+ folio_ref_sub(folio, nr);
+ VM_BUG_ON_FOLIO(folio_ref_count(folio) <= 0, folio);
+ return xas_error(&xas);
}
ALLOW_ERROR_INJECTION(__filemap_add_folio, ERRNO);
--
2.30.2
A THP which is smaller than a PMD does not need to do the extra work
in try_to_unmap() of trying to split a PMD entry.
Signed-off-by: Matthew Wilcox (Oracle) <[email protected]>
---
mm/vmscan.c | 3 ++-
1 file changed, 2 insertions(+), 1 deletion(-)
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 8b17e46dbf32..433956675107 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1496,7 +1496,8 @@ static unsigned int shrink_page_list(struct list_head *page_list,
enum ttu_flags flags = TTU_BATCH_FLUSH;
bool was_swapbacked = PageSwapBacked(page);
- if (unlikely(PageTransHuge(page)))
+ if (PageTransHuge(page) &&
+ thp_order(page) >= HPAGE_PMD_ORDER)
flags |= TTU_SPLIT_HUGE_PMD;
try_to_unmap(page, flags);
--
2.30.2
If the filesystem supports multi-page folios, allocate larger pages in
the readahead code when it seems worth doing. The heuristic for choosing
larger page sizes will surely need some tuning, but this aggressive
ramp-up has been good for testing.
Signed-off-by: Matthew Wilcox (Oracle) <[email protected]>
---
mm/readahead.c | 102 +++++++++++++++++++++++++++++++++++++++++++++----
1 file changed, 95 insertions(+), 7 deletions(-)
diff --git a/mm/readahead.c b/mm/readahead.c
index e1df44ad57ed..27e76cc2a9ba 100644
--- a/mm/readahead.c
+++ b/mm/readahead.c
@@ -149,7 +149,7 @@ static void read_pages(struct readahead_control *rac, struct list_head *pages,
blk_finish_plug(&plug);
- BUG_ON(!list_empty(pages));
+ BUG_ON(pages && !list_empty(pages));
BUG_ON(readahead_count(rac));
out:
@@ -430,11 +430,99 @@ static int try_context_readahead(struct address_space *mapping,
return 1;
}
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+static inline int ra_alloc_folio(struct readahead_control *ractl, pgoff_t index,
+ pgoff_t mark, unsigned int order, gfp_t gfp)
+{
+ int err;
+ struct folio *folio = filemap_alloc_folio(gfp, order);
+
+ if (!folio)
+ return -ENOMEM;
+ if (mark - index < (1UL << order))
+ folio_set_readahead(folio);
+ err = filemap_add_folio(ractl->mapping, folio, index, gfp);
+ if (err)
+ folio_put(folio);
+ else
+ ractl->_nr_pages += 1UL << order;
+ return err;
+}
+
+static void page_cache_ra_order(struct readahead_control *ractl,
+ struct file_ra_state *ra, unsigned int new_order)
+{
+ struct address_space *mapping = ractl->mapping;
+ pgoff_t index = readahead_index(ractl);
+ pgoff_t limit = (i_size_read(mapping->host) - 1) >> PAGE_SHIFT;
+ pgoff_t mark = index + ra->size - ra->async_size;
+ int err = 0;
+ gfp_t gfp = readahead_gfp_mask(mapping);
+
+ if (!mapping_thp_support(mapping) || ra->size < 4)
+ goto fallback;
+
+ limit = min(limit, index + ra->size - 1);
+
+ /* Grow page size up to PMD size */
+ if (new_order < HPAGE_PMD_ORDER) {
+ new_order += 2;
+ if (new_order > HPAGE_PMD_ORDER)
+ new_order = HPAGE_PMD_ORDER;
+ while ((1 << new_order) > ra->size)
+ new_order--;
+ }
+
+ while (index <= limit) {
+ unsigned int order = new_order;
+
+ /* Align with smaller pages if needed */
+ if (index & ((1UL << order) - 1)) {
+ order = __ffs(index);
+ if (order == 1)
+ order = 0;
+ }
+ /* Don't allocate pages past EOF */
+ while (index + (1UL << order) - 1 > limit) {
+ if (--order == 1)
+ order = 0;
+ }
+ err = ra_alloc_folio(ractl, index, mark, order, gfp);
+ if (err)
+ break;
+ index += 1UL << order;
+ }
+
+ if (index > limit) {
+ ra->size += index - limit - 1;
+ ra->async_size += index - limit - 1;
+ }
+
+ read_pages(ractl, NULL, false);
+
+ /*
+ * If there were already pages in the page cache, then we may have
+ * left some gaps. Let the regular readahead code take care of this
+ * situation.
+ */
+ if (!err)
+ return;
+fallback:
+ do_page_cache_ra(ractl, ra->size, ra->async_size);
+}
+#else
+static void page_cache_ra_order(struct readahead_control *ractl,
+ struct file_ra_state *ra, unsigned int order)
+{
+ do_page_cache_ra(ractl, ra->size, ra->async_size);
+}
+#endif
+
/*
* A minimal readahead algorithm for trivial sequential/random reads.
*/
static void ondemand_readahead(struct readahead_control *ractl,
- bool hit_readahead_marker, unsigned long req_size)
+ struct folio *folio, unsigned long req_size)
{
struct backing_dev_info *bdi = inode_to_bdi(ractl->mapping->host);
struct file_ra_state *ra = ractl->ra;
@@ -469,12 +557,12 @@ static void ondemand_readahead(struct readahead_control *ractl,
}
/*
- * Hit a marked page without valid readahead state.
+ * Hit a marked folio without valid readahead state.
* E.g. interleaved reads.
* Query the pagecache for async_size, which normally equals to
* readahead size. Ramp it up and use it as the new readahead size.
*/
- if (hit_readahead_marker) {
+ if (folio) {
pgoff_t start;
rcu_read_lock();
@@ -547,7 +635,7 @@ static void ondemand_readahead(struct readahead_control *ractl,
}
ractl->_index = ra->start;
- do_page_cache_ra(ractl, ra->size, ra->async_size);
+ page_cache_ra_order(ractl, ra, folio ? folio_order(folio) : 0);
}
void page_cache_sync_ra(struct readahead_control *ractl,
@@ -575,7 +663,7 @@ void page_cache_sync_ra(struct readahead_control *ractl,
}
/* do read-ahead */
- ondemand_readahead(ractl, false, req_count);
+ ondemand_readahead(ractl, NULL, req_count);
}
EXPORT_SYMBOL_GPL(page_cache_sync_ra);
@@ -604,7 +692,7 @@ void page_cache_async_ra(struct readahead_control *ractl,
return;
/* do read-ahead */
- ondemand_readahead(ractl, true, req_count);
+ ondemand_readahead(ractl, folio, req_count);
}
EXPORT_SYMBOL_GPL(page_cache_async_ra);
--
2.30.2
This is the equivalent of page_cache_get_speculative(). Also add
folio_ref_try_add_rcu (the equivalent of page_cache_add_speculative)
and folio_get_unless_zero() (the equivalent of get_page_unless_zero()).
The new kernel-doc attempts to explain from the user's point of view
when to use folio_try_get_rcu() and when to use folio_get_unless_zero(),
because there seems to be some confusion currently between the users of
page_cache_get_speculative() and get_page_unless_zero().
Reimplement page_cache_add_speculative() and page_cache_get_speculative()
as wrappers around the folio equivalents, but leave get_page_unless_zero()
alone for now. This commit reduces text size by 3 bytes due to slightly
different register allocation & instruction selections.
Signed-off-by: Matthew Wilcox (Oracle) <[email protected]>
Acked-by: Vlastimil Babka <[email protected]>
Reviewed-by: William Kucharski <[email protected]>
Reviewed-by: Christoph Hellwig <[email protected]>
Acked-by: Kirill A. Shutemov <[email protected]>
---
include/linux/page_ref.h | 66 +++++++++++++++++++++++++++++++
include/linux/pagemap.h | 84 ++--------------------------------------
mm/filemap.c | 20 ++++++++++
3 files changed, 90 insertions(+), 80 deletions(-)
diff --git a/include/linux/page_ref.h b/include/linux/page_ref.h
index 717d53c9ddf1..2e677e6ad09f 100644
--- a/include/linux/page_ref.h
+++ b/include/linux/page_ref.h
@@ -247,6 +247,72 @@ static inline bool folio_ref_add_unless(struct folio *folio, int nr, int u)
return page_ref_add_unless(&folio->page, nr, u);
}
+/**
+ * folio_try_get - Attempt to increase the refcount on a folio.
+ * @folio: The folio.
+ *
+ * If you do not already have a reference to a folio, you can attempt to
+ * get one using this function. It may fail if, for example, the folio
+ * has been freed since you found a pointer to it, or it is frozen for
+ * the purposes of splitting or migration.
+ *
+ * Return: True if the reference count was successfully incremented.
+ */
+static inline bool folio_try_get(struct folio *folio)
+{
+ return folio_ref_add_unless(folio, 1, 0);
+}
+
+static inline bool folio_ref_try_add_rcu(struct folio *folio, int count)
+{
+#ifdef CONFIG_TINY_RCU
+ /*
+ * The caller guarantees the folio will not be freed from interrupt
+ * context, so (on !SMP) we only need preemption to be disabled
+ * and TINY_RCU does that for us.
+ */
+# ifdef CONFIG_PREEMPT_COUNT
+ VM_BUG_ON(!in_atomic() && !irqs_disabled());
+# endif
+ VM_BUG_ON_FOLIO(folio_ref_count(folio) == 0, folio);
+ folio_ref_add(folio, count);
+#else
+ if (unlikely(!folio_ref_add_unless(folio, count, 0))) {
+ /* Either the folio has been freed, or will be freed. */
+ return false;
+ }
+#endif
+ return true;
+}
+
+/**
+ * folio_try_get_rcu - Attempt to increase the refcount on a folio.
+ * @folio: The folio.
+ *
+ * This is a version of folio_try_get() optimised for non-SMP kernels.
+ * If you are still holding the rcu_read_lock() after looking up the
+ * page and know that the page cannot have its refcount decreased to
+ * zero in interrupt context, you can use this instead of folio_try_get().
+ *
+ * Example users include get_user_pages_fast() (as pages are not unmapped
+ * from interrupt context) and the page cache lookups (as pages are not
+ * truncated from interrupt context). We also know that pages are not
+ * frozen in interrupt context for the purposes of splitting or migration.
+ *
+ * You can also use this function if you're holding a lock that prevents
+ * pages being frozen & removed; eg the i_pages lock for the page cache
+ * or the mmap_sem or page table lock for page tables. In this case,
+ * it will always succeed, and you could have used a plain folio_get(),
+ * but it's sometimes more convenient to have a common function called
+ * from both locked and RCU-protected contexts.
+ *
+ * Return: True if the reference count was successfully incremented.
+ */
+static inline bool folio_try_get_rcu(struct folio *folio)
+{
+ return folio_ref_try_add_rcu(folio, 1);
+}
+
static inline int page_ref_freeze(struct page *page, int count)
{
int ret = likely(atomic_cmpxchg(&page->_refcount, count, 0) == count);
diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
index ed02aa522263..db1726b1bc1c 100644
--- a/include/linux/pagemap.h
+++ b/include/linux/pagemap.h
@@ -172,91 +172,15 @@ static inline struct address_space *page_mapping_file(struct page *page)
return page_mapping(page);
}
-/*
- * speculatively take a reference to a page.
- * If the page is free (_refcount == 0), then _refcount is untouched, and 0
- * is returned. Otherwise, _refcount is incremented by 1 and 1 is returned.
- *
- * This function must be called inside the same rcu_read_lock() section as has
- * been used to lookup the page in the pagecache radix-tree (or page table):
- * this allows allocators to use a synchronize_rcu() to stabilize _refcount.
- *
- * Unless an RCU grace period has passed, the count of all pages coming out
- * of the allocator must be considered unstable. page_count may return higher
- * than expected, and put_page must be able to do the right thing when the
- * page has been finished with, no matter what it is subsequently allocated
- * for (because put_page is what is used here to drop an invalid speculative
- * reference).
- *
- * This is the interesting part of the lockless pagecache (and lockless
- * get_user_pages) locking protocol, where the lookup-side (eg. find_get_page)
- * has the following pattern:
- * 1. find page in radix tree
- * 2. conditionally increment refcount
- * 3. check the page is still in pagecache (if no, goto 1)
- *
- * Remove-side that cares about stability of _refcount (eg. reclaim) has the
- * following (with the i_pages lock held):
- * A. atomically check refcount is correct and set it to 0 (atomic_cmpxchg)
- * B. remove page from pagecache
- * C. free the page
- *
- * There are 2 critical interleavings that matter:
- * - 2 runs before A: in this case, A sees elevated refcount and bails out
- * - A runs before 2: in this case, 2 sees zero refcount and retries;
- * subsequently, B will complete and 1 will find no page, causing the
- * lookup to return NULL.
- *
- * It is possible that between 1 and 2, the page is removed then the exact same
- * page is inserted into the same position in pagecache. That's OK: the
- * old find_get_page using a lock could equally have run before or after
- * such a re-insertion, depending on order that locks are granted.
- *
- * Lookups racing against pagecache insertion isn't a big problem: either 1
- * will find the page or it will not. Likewise, the old find_get_page could run
- * either before the insertion or afterwards, depending on timing.
- */
-static inline int __page_cache_add_speculative(struct page *page, int count)
+static inline bool page_cache_add_speculative(struct page *page, int count)
{
-#ifdef CONFIG_TINY_RCU
-# ifdef CONFIG_PREEMPT_COUNT
- VM_BUG_ON(!in_atomic() && !irqs_disabled());
-# endif
- /*
- * Preempt must be disabled here - we rely on rcu_read_lock doing
- * this for us.
- *
- * Pagecache won't be truncated from interrupt context, so if we have
- * found a page in the radix tree here, we have pinned its refcount by
- * disabling preempt, and hence no need for the "speculative get" that
- * SMP requires.
- */
- VM_BUG_ON_PAGE(page_count(page) == 0, page);
- page_ref_add(page, count);
-
-#else
- if (unlikely(!page_ref_add_unless(page, count, 0))) {
- /*
- * Either the page has been freed, or will be freed.
- * In either case, retry here and the caller should
- * do the right thing (see comments above).
- */
- return 0;
- }
-#endif
VM_BUG_ON_PAGE(PageTail(page), page);
-
- return 1;
-}
-
-static inline int page_cache_get_speculative(struct page *page)
-{
- return __page_cache_add_speculative(page, 1);
+ return folio_ref_try_add_rcu((struct folio *)page, count);
}
-static inline int page_cache_add_speculative(struct page *page, int count)
+static inline bool page_cache_get_speculative(struct page *page)
{
- return __page_cache_add_speculative(page, count);
+ return page_cache_add_speculative(page, 1);
}
/**
diff --git a/mm/filemap.c b/mm/filemap.c
index d1458ecf2f51..634adeacc4c1 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -1746,6 +1746,26 @@ pgoff_t page_cache_prev_miss(struct address_space *mapping,
}
EXPORT_SYMBOL(page_cache_prev_miss);
+/*
+ * Lockless page cache protocol:
+ * On the lookup side:
+ * 1. Load the folio from i_pages
+ * 2. Increment the refcount if it's not zero
+ * 3. If the folio is not found by xas_reload(), put the refcount and retry
+ *
+ * On the removal side:
+ * A. Freeze the page (by zeroing the refcount if nobody else has a reference)
+ * B. Remove the page from i_pages
+ * C. Return the page to the page allocator
+ *
+ * This means that any page may have its reference count temporarily
+ * increased by a speculative page cache (or fast GUP) lookup as it can
+ * be allocated by another user before the RCU grace period expires.
+ * Because the refcount temporarily acquired here may end up being the
+ * last refcount on the page, any page allocation must be freeable by
+ * put_folio().
+ */
+
/*
* mapping_get_entry - Get a page cache entry.
* @mapping: the address_space to search
--
2.30.2
Handle arbitrary-order folios being added to the LRU. By definition,
all pages being added to the LRU were already head or base pages, but
call page_folio() on them anyway to get the type right and avoid the
buried calls to compound_head().
Saves 783 bytes of kernel text; no functions grow.
Signed-off-by: Matthew Wilcox (Oracle) <[email protected]>
Reviewed-by: Yu Zhao <[email protected]>
Reviewed-by: Christoph Hellwig <[email protected]>
Reviewed-by: David Howells <[email protected]>
Acked-by: Kirill A. Shutemov <[email protected]>
---
include/linux/mm_inline.h | 98 ++++++++++++++++++++++------------
include/trace/events/pagemap.h | 2 +-
2 files changed, 65 insertions(+), 35 deletions(-)
diff --git a/include/linux/mm_inline.h b/include/linux/mm_inline.h
index 355ea1ee32bd..ee155d19885e 100644
--- a/include/linux/mm_inline.h
+++ b/include/linux/mm_inline.h
@@ -6,22 +6,27 @@
#include <linux/swap.h>
/**
- * page_is_file_lru - should the page be on a file LRU or anon LRU?
- * @page: the page to test
+ * folio_is_file_lru - should the folio be on a file LRU or anon LRU?
+ * @folio: the folio to test
*
- * Returns 1 if @page is a regular filesystem backed page cache page or a lazily
- * freed anonymous page (e.g. via MADV_FREE). Returns 0 if @page is a normal
- * anonymous page, a tmpfs page or otherwise ram or swap backed page. Used by
- * functions that manipulate the LRU lists, to sort a page onto the right LRU
- * list.
+ * Returns 1 if @folio is a regular filesystem backed page cache folio
+ * or a lazily freed anonymous folio (e.g. via MADV_FREE). Returns 0 if
+ * @folio is a normal anonymous folio, a tmpfs folio or otherwise ram or
+ * swap backed folio. Used by functions that manipulate the LRU lists,
+ * to sort a folio onto the right LRU list.
*
* We would like to get this info without a page flag, but the state
- * needs to survive until the page is last deleted from the LRU, which
+ * needs to survive until the folio is last deleted from the LRU, which
* could be as far down as __page_cache_release.
*/
+static inline int folio_is_file_lru(struct folio *folio)
+{
+ return !folio_test_swapbacked(folio);
+}
+
static inline int page_is_file_lru(struct page *page)
{
- return !PageSwapBacked(page);
+ return folio_is_file_lru(page_folio(page));
}
static __always_inline void update_lru_size(struct lruvec *lruvec,
@@ -39,69 +44,94 @@ static __always_inline void update_lru_size(struct lruvec *lruvec,
}
/**
- * __clear_page_lru_flags - clear page lru flags before releasing a page
- * @page: the page that was on lru and now has a zero reference
+ * __folio_clear_lru_flags - clear page lru flags before releasing a page
+ * @folio: The folio that was on lru and now has a zero reference
*/
-static __always_inline void __clear_page_lru_flags(struct page *page)
+static __always_inline void __folio_clear_lru_flags(struct folio *folio)
{
- VM_BUG_ON_PAGE(!PageLRU(page), page);
+ VM_BUG_ON_FOLIO(!folio_test_lru(folio), folio);
- __ClearPageLRU(page);
+ __folio_clear_lru(folio);
/* this shouldn't happen, so leave the flags to bad_page() */
- if (PageActive(page) && PageUnevictable(page))
+ if (folio_test_active(folio) && folio_test_unevictable(folio))
return;
- __ClearPageActive(page);
- __ClearPageUnevictable(page);
+ __folio_clear_active(folio);
+ __folio_clear_unevictable(folio);
+}
+
+static __always_inline void __clear_page_lru_flags(struct page *page)
+{
+ __folio_clear_lru_flags(page_folio(page));
}
/**
- * page_lru - which LRU list should a page be on?
- * @page: the page to test
+ * folio_lru_list - which LRU list should a folio be on?
+ * @folio: the folio to test
*
- * Returns the LRU list a page should be on, as an index
+ * Returns the LRU list a folio should be on, as an index
* into the array of LRU lists.
*/
-static __always_inline enum lru_list page_lru(struct page *page)
+static __always_inline enum lru_list folio_lru_list(struct folio *folio)
{
enum lru_list lru;
- VM_BUG_ON_PAGE(PageActive(page) && PageUnevictable(page), page);
+ VM_BUG_ON_FOLIO(folio_test_active(folio) && folio_test_unevictable(folio), folio);
- if (PageUnevictable(page))
+ if (folio_test_unevictable(folio))
return LRU_UNEVICTABLE;
- lru = page_is_file_lru(page) ? LRU_INACTIVE_FILE : LRU_INACTIVE_ANON;
- if (PageActive(page))
+ lru = folio_is_file_lru(folio) ? LRU_INACTIVE_FILE : LRU_INACTIVE_ANON;
+ if (folio_test_active(folio))
lru += LRU_ACTIVE;
return lru;
}
+static __always_inline
+void lruvec_add_folio(struct lruvec *lruvec, struct folio *folio)
+{
+ enum lru_list lru = folio_lru_list(folio);
+
+ update_lru_size(lruvec, lru, folio_zonenum(folio),
+ folio_nr_pages(folio));
+ list_add(&folio->lru, &lruvec->lists[lru]);
+}
+
static __always_inline void add_page_to_lru_list(struct page *page,
struct lruvec *lruvec)
{
- enum lru_list lru = page_lru(page);
+ lruvec_add_folio(lruvec, page_folio(page));
+}
+
+static __always_inline
+void lruvec_add_folio_tail(struct lruvec *lruvec, struct folio *folio)
+{
+ enum lru_list lru = folio_lru_list(folio);
- update_lru_size(lruvec, lru, page_zonenum(page), thp_nr_pages(page));
- list_add(&page->lru, &lruvec->lists[lru]);
+ update_lru_size(lruvec, lru, folio_zonenum(folio),
+ folio_nr_pages(folio));
+ list_add_tail(&folio->lru, &lruvec->lists[lru]);
}
static __always_inline void add_page_to_lru_list_tail(struct page *page,
struct lruvec *lruvec)
{
- enum lru_list lru = page_lru(page);
+ lruvec_add_folio_tail(lruvec, page_folio(page));
+}
- update_lru_size(lruvec, lru, page_zonenum(page), thp_nr_pages(page));
- list_add_tail(&page->lru, &lruvec->lists[lru]);
+static __always_inline
+void lruvec_del_folio(struct lruvec *lruvec, struct folio *folio)
+{
+ list_del(&folio->lru);
+ update_lru_size(lruvec, folio_lru_list(folio), folio_zonenum(folio),
+ -folio_nr_pages(folio));
}
static __always_inline void del_page_from_lru_list(struct page *page,
struct lruvec *lruvec)
{
- list_del(&page->lru);
- update_lru_size(lruvec, page_lru(page), page_zonenum(page),
- -thp_nr_pages(page));
+ lruvec_del_folio(lruvec, page_folio(page));
}
#endif
diff --git a/include/trace/events/pagemap.h b/include/trace/events/pagemap.h
index 1d28431e85bd..92ad176210ff 100644
--- a/include/trace/events/pagemap.h
+++ b/include/trace/events/pagemap.h
@@ -41,7 +41,7 @@ TRACE_EVENT(mm_lru_insertion,
TP_fast_assign(
__entry->page = page;
__entry->pfn = page_to_pfn(page);
- __entry->lru = page_lru(page);
+ __entry->lru = folio_lru_list(page_folio(page));
__entry->flags = trace_pagemap_flags(page);
),
--
2.30.2
Add folio_get_private() which mirrors page_private() -- ie folio private
data is the same as page private data. The only difference is that these
return a void * instead of an unsigned long, which matches the majority
of users.
Turn attach_page_private() into folio_attach_private() and reimplement
attach_page_private() as a wrapper. No filesystem which uses page private
data currently supports compound pages, so we're free to define the rules.
attach_page_private() may only be called on a head page; if you want
to add private data to a tail page, you can call set_page_private()
directly (and shouldn't increment the page refcount! That should be
done when adding private data to the head page / folio).
This saves 813 bytes of text with the distro-derived config that I'm
testing due to removing the calls to compound_head() in get_page()
& put_page().
Signed-off-by: Matthew Wilcox (Oracle) <[email protected]>
Reviewed-by: Christoph Hellwig <[email protected]>
Acked-by: Jeff Layton <[email protected]>
Acked-by: Kirill A. Shutemov <[email protected]>
Acked-by: Vlastimil Babka <[email protected]>
Reviewed-by: William Kucharski <[email protected]>
Reviewed-by: David Howells <[email protected]>
---
include/linux/mm_types.h | 11 +++++++++
include/linux/pagemap.h | 48 ++++++++++++++++++++++++----------------
2 files changed, 40 insertions(+), 19 deletions(-)
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index f023eaa866fe..c4dd41bb1019 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -309,6 +309,12 @@ static inline atomic_t *compound_pincount_ptr(struct page *page)
#define PAGE_FRAG_CACHE_MAX_SIZE __ALIGN_MASK(32768, ~PAGE_MASK)
#define PAGE_FRAG_CACHE_MAX_ORDER get_order(PAGE_FRAG_CACHE_MAX_SIZE)
+/*
+ * page_private can be used on tail pages. However, PagePrivate is only
+ * checked by the VM on the head page. So page_private on the tail pages
+ * should be used for data that's ancillary to the head page (eg attaching
+ * buffer heads to tail pages after attaching buffer heads to the head page)
+ */
#define page_private(page) ((page)->private)
static inline void set_page_private(struct page *page, unsigned long private)
@@ -316,6 +322,11 @@ static inline void set_page_private(struct page *page, unsigned long private)
page->private = private;
}
+static inline void *folio_get_private(struct folio *folio)
+{
+ return folio->private;
+}
+
struct page_frag_cache {
void * va;
#if (PAGE_SIZE < PAGE_FRAG_CACHE_MAX_SIZE)
diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
index db1726b1bc1c..3279c731ee04 100644
--- a/include/linux/pagemap.h
+++ b/include/linux/pagemap.h
@@ -184,42 +184,52 @@ static inline bool page_cache_get_speculative(struct page *page)
}
/**
- * attach_page_private - Attach private data to a page.
- * @page: Page to attach data to.
- * @data: Data to attach to page.
+ * folio_attach_private - Attach private data to a folio.
+ * @folio: Folio to attach data to.
+ * @data: Data to attach to folio.
*
- * Attaching private data to a page increments the page's reference count.
- * The data must be detached before the page will be freed.
+ * Attaching private data to a folio increments the page's reference count.
+ * The data must be detached before the folio will be freed.
*/
-static inline void attach_page_private(struct page *page, void *data)
+static inline void folio_attach_private(struct folio *folio, void *data)
{
- get_page(page);
- set_page_private(page, (unsigned long)data);
- SetPagePrivate(page);
+ folio_get(folio);
+ folio->private = data;
+ folio_set_private(folio);
}
/**
- * detach_page_private - Detach private data from a page.
- * @page: Page to detach data from.
+ * folio_detach_private - Detach private data from a folio.
+ * @folio: Folio to detach data from.
*
- * Removes the data that was previously attached to the page and decrements
+ * Removes the data that was previously attached to the folio and decrements
* the refcount on the page.
*
- * Return: Data that was attached to the page.
+ * Return: Data that was attached to the folio.
*/
-static inline void *detach_page_private(struct page *page)
+static inline void *folio_detach_private(struct folio *folio)
{
- void *data = (void *)page_private(page);
+ void *data = folio_get_private(folio);
- if (!PagePrivate(page))
+ if (!folio_test_private(folio))
return NULL;
- ClearPagePrivate(page);
- set_page_private(page, 0);
- put_page(page);
+ folio_clear_private(folio);
+ folio->private = NULL;
+ folio_put(folio);
return data;
}
+static inline void attach_page_private(struct page *page, void *data)
+{
+ folio_attach_private(page_folio(page), data);
+}
+
+static inline void *detach_page_private(struct page *page)
+{
+ return folio_detach_private(page_folio(page));
+}
+
#ifdef CONFIG_NUMA
extern struct page *__page_cache_alloc(gfp_t gfp);
#else
--
2.30.2
This helper returns the page index of the next folio in the file (ie
the end of this folio, plus one).
No changes to generated code.
Signed-off-by: Matthew Wilcox (Oracle) <[email protected]>
Reviewed-by: Christoph Hellwig <[email protected]>
Acked-by: Jeff Layton <[email protected]>
Acked-by: Kirill A. Shutemov <[email protected]>
Acked-by: Vlastimil Babka <[email protected]>
Reviewed-by: William Kucharski <[email protected]>
Reviewed-by: David Howells <[email protected]>
---
include/linux/pagemap.h | 11 +++++++++++
1 file changed, 11 insertions(+)
diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
index f7c165b5991f..bd0e7e91bfd4 100644
--- a/include/linux/pagemap.h
+++ b/include/linux/pagemap.h
@@ -406,6 +406,17 @@ static inline pgoff_t folio_index(struct folio *folio)
return folio->index;
}
+/**
+ * folio_next_index - Get the index of the next folio.
+ * @folio: The current folio.
+ *
+ * Return: The index of the folio which follows this folio in the file.
+ */
+static inline pgoff_t folio_next_index(struct folio *folio)
+{
+ return folio->index + folio_nr_pages(folio);
+}
+
/**
* folio_file_page - The page for a particular index.
* @folio: The folio which contains this index.
--
2.30.2
These are the folio equivalent of page_mapping() and page_file_mapping().
Add an out-of-line page_mapping() wrapper around folio_mapping()
in order to prevent the page_folio() call from bloating every caller
of page_mapping(). Adjust page_file_mapping() and page_mapping_file()
to use folios internally. Rename __page_file_mapping() to
swapcache_mapping() and change it to take a folio.
This ends up saving 122 bytes of text overall. folio_mapping() is
45 bytes shorter than page_mapping() was, but the new page_mapping()
wrapper is 30 bytes. The major reduction is a few bytes less in dozens
of nfs functions (which call page_file_mapping()). Most of these appear
to be a slight change in gcc's register allocation decisions, which allow:
48 8b 56 08 mov 0x8(%rsi),%rdx
48 8d 42 ff lea -0x1(%rdx),%rax
83 e2 01 and $0x1,%edx
48 0f 44 c6 cmove %rsi,%rax
to become:
48 8b 46 08 mov 0x8(%rsi),%rax
48 8d 78 ff lea -0x1(%rax),%rdi
a8 01 test $0x1,%al
48 0f 44 fe cmove %rsi,%rdi
for a reduction of a single byte. Once the NFS client is converted to
use folios, this entire sequence will disappear.
Also add folio_mapping() documentation.
Signed-off-by: Matthew Wilcox (Oracle) <[email protected]>
Reviewed-by: Christoph Hellwig <[email protected]>
Acked-by: Jeff Layton <[email protected]>
Acked-by: Kirill A. Shutemov <[email protected]>
Acked-by: Vlastimil Babka <[email protected]>
Reviewed-by: William Kucharski <[email protected]>
Reviewed-by: David Howells <[email protected]>
---
Documentation/core-api/mm-api.rst | 2 ++
include/linux/mm.h | 14 -------------
include/linux/pagemap.h | 35 +++++++++++++++++++++++++++++--
include/linux/swap.h | 6 ++++++
mm/Makefile | 2 +-
mm/folio-compat.c | 13 ++++++++++++
mm/swapfile.c | 8 +++----
mm/util.c | 30 +++++++++++++++-----------
8 files changed, 77 insertions(+), 33 deletions(-)
create mode 100644 mm/folio-compat.c
diff --git a/Documentation/core-api/mm-api.rst b/Documentation/core-api/mm-api.rst
index 5c459ee2acce..dcce6605947a 100644
--- a/Documentation/core-api/mm-api.rst
+++ b/Documentation/core-api/mm-api.rst
@@ -100,3 +100,5 @@ More Memory Management Functions
:internal:
.. kernel-doc:: include/linux/page_ref.h
.. kernel-doc:: include/linux/mmzone.h
+.. kernel-doc:: mm/util.c
+ :functions: folio_mapping
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 788fbc4cde0c..9d28f5b2e983 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1753,19 +1753,6 @@ void page_address_init(void);
extern void *page_rmapping(struct page *page);
extern struct anon_vma *page_anon_vma(struct page *page);
-extern struct address_space *page_mapping(struct page *page);
-
-extern struct address_space *__page_file_mapping(struct page *);
-
-static inline
-struct address_space *page_file_mapping(struct page *page)
-{
- if (unlikely(PageSwapCache(page)))
- return __page_file_mapping(page);
-
- return page->mapping;
-}
-
extern pgoff_t __page_file_index(struct page *page);
/*
@@ -1780,7 +1767,6 @@ static inline pgoff_t page_index(struct page *page)
}
bool page_mapped(struct page *page);
-struct address_space *page_mapping(struct page *page);
/*
* Return true only if the page has been allocated with
diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
index aa71fa82d6be..a0925a89ba11 100644
--- a/include/linux/pagemap.h
+++ b/include/linux/pagemap.h
@@ -162,14 +162,45 @@ static inline void filemap_nr_thps_dec(struct address_space *mapping)
void release_pages(struct page **pages, int nr);
+struct address_space *page_mapping(struct page *);
+struct address_space *folio_mapping(struct folio *);
+struct address_space *swapcache_mapping(struct folio *);
+
+/**
+ * folio_file_mapping - Find the mapping this folio belongs to.
+ * @folio: The folio.
+ *
+ * For folios which are in the page cache, return the mapping that this
+ * page belongs to. Folios in the swap cache return the mapping of the
+ * swap file or swap device where the data is stored. This is different
+ * from the mapping returned by folio_mapping(). The only reason to
+ * use it is if, like NFS, you return 0 from ->activate_swapfile.
+ *
+ * Do not call this for folios which aren't in the page cache or swap cache.
+ */
+static inline struct address_space *folio_file_mapping(struct folio *folio)
+{
+ if (unlikely(folio_test_swapcache(folio)))
+ return swapcache_mapping(folio);
+
+ return folio->mapping;
+}
+
+static inline struct address_space *page_file_mapping(struct page *page)
+{
+ return folio_file_mapping(page_folio(page));
+}
+
/*
* For file cache pages, return the address_space, otherwise return NULL
*/
static inline struct address_space *page_mapping_file(struct page *page)
{
- if (unlikely(PageSwapCache(page)))
+ struct folio *folio = page_folio(page);
+
+ if (unlikely(folio_test_swapcache(folio)))
return NULL;
- return page_mapping(page);
+ return folio_mapping(folio);
}
static inline bool page_cache_add_speculative(struct page *page, int count)
diff --git a/include/linux/swap.h b/include/linux/swap.h
index 6f5a43251593..3d3d85354026 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -320,6 +320,12 @@ struct vma_swap_readahead {
#endif
};
+static inline swp_entry_t folio_swap_entry(struct folio *folio)
+{
+ swp_entry_t entry = { .val = page_private(&folio->page) };
+ return entry;
+}
+
/* linux/mm/workingset.c */
void workingset_age_nonresident(struct lruvec *lruvec, unsigned long nr_pages);
void *workingset_eviction(struct page *page, struct mem_cgroup *target_memcg);
diff --git a/mm/Makefile b/mm/Makefile
index e3436741d539..d7488bcbbb2b 100644
--- a/mm/Makefile
+++ b/mm/Makefile
@@ -46,7 +46,7 @@ mmu-$(CONFIG_MMU) += process_vm_access.o
endif
obj-y := filemap.o mempool.o oom_kill.o fadvise.o \
- maccess.o page-writeback.o \
+ maccess.o page-writeback.o folio-compat.o \
readahead.o swap.o truncate.o vmscan.o shmem.o \
util.o mmzone.o vmstat.o backing-dev.o \
mm_init.o percpu.o slab_common.o \
diff --git a/mm/folio-compat.c b/mm/folio-compat.c
new file mode 100644
index 000000000000..5e107aa30a62
--- /dev/null
+++ b/mm/folio-compat.c
@@ -0,0 +1,13 @@
+/*
+ * Compatibility functions which bloat the callers too much to make inline.
+ * All of the callers of these functions should be converted to use folios
+ * eventually.
+ */
+
+#include <linux/pagemap.h>
+
+struct address_space *page_mapping(struct page *page)
+{
+ return folio_mapping(page_folio(page));
+}
+EXPORT_SYMBOL(page_mapping);
diff --git a/mm/swapfile.c b/mm/swapfile.c
index 1e07d1c776f2..3a6c094310da 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -3528,13 +3528,13 @@ struct swap_info_struct *page_swap_info(struct page *page)
}
/*
- * out-of-line __page_file_ methods to avoid include hell.
+ * out-of-line methods to avoid include hell.
*/
-struct address_space *__page_file_mapping(struct page *page)
+struct address_space *swapcache_mapping(struct folio *folio)
{
- return page_swap_info(page)->swap_file->f_mapping;
+ return page_swap_info(&folio->page)->swap_file->f_mapping;
}
-EXPORT_SYMBOL_GPL(__page_file_mapping);
+EXPORT_SYMBOL_GPL(swapcache_mapping);
pgoff_t __page_file_index(struct page *page)
{
diff --git a/mm/util.c b/mm/util.c
index 9043d03750a7..1cde6218d6d1 100644
--- a/mm/util.c
+++ b/mm/util.c
@@ -686,30 +686,36 @@ struct anon_vma *page_anon_vma(struct page *page)
return __page_rmapping(page);
}
-struct address_space *page_mapping(struct page *page)
+/**
+ * folio_mapping - Find the mapping where this folio is stored.
+ * @folio: The folio.
+ *
+ * For folios which are in the page cache, return the mapping that this
+ * page belongs to. Folios in the swap cache return the swap mapping
+ * this page is stored in (which is different from the mapping for the
+ * swap file or swap device where the data is stored).
+ *
+ * You can call this for folios which aren't in the swap cache or page
+ * cache and it will return NULL.
+ */
+struct address_space *folio_mapping(struct folio *folio)
{
struct address_space *mapping;
- page = compound_head(page);
-
/* This happens if someone calls flush_dcache_page on slab page */
- if (unlikely(PageSlab(page)))
+ if (unlikely(folio_test_slab(folio)))
return NULL;
- if (unlikely(PageSwapCache(page))) {
- swp_entry_t entry;
-
- entry.val = page_private(page);
- return swap_address_space(entry);
- }
+ if (unlikely(folio_test_swapcache(folio)))
+ return swap_address_space(folio_swap_entry(folio));
- mapping = page->mapping;
+ mapping = folio->mapping;
if ((unsigned long)mapping & PAGE_MAPPING_ANON)
return NULL;
return (void *)((unsigned long)mapping & ~PAGE_MAPPING_FLAGS);
}
-EXPORT_SYMBOL(page_mapping);
+EXPORT_SYMBOL(folio_mapping);
/* Slow path of page_mapcount() for compound pages */
int __page_mapcount(struct page *page)
--
2.30.2
Convert unlock_page() to call folio_unlock(). By using a folio we
avoid a call to compound_head(). This shortens the function from 39
bytes to 25 and removes 4 instructions on x86-64. Because we still
have unlock_page(), it's a net increase of 16 bytes of text for the
kernel as a whole, but any path that uses folio_unlock() will execute
4 fewer instructions.
Signed-off-by: Matthew Wilcox (Oracle) <[email protected]>
Reviewed-by: Christoph Hellwig <[email protected]>
Acked-by: Jeff Layton <[email protected]>
Acked-by: Kirill A. Shutemov <[email protected]>
Reviewed-by: William Kucharski <[email protected]>
Reviewed-by: David Howells <[email protected]>
---
include/linux/pagemap.h | 3 ++-
mm/filemap.c | 29 ++++++++++++-----------------
mm/folio-compat.c | 6 ++++++
3 files changed, 20 insertions(+), 18 deletions(-)
diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
index a0925a89ba11..a13edc7a2916 100644
--- a/include/linux/pagemap.h
+++ b/include/linux/pagemap.h
@@ -658,7 +658,8 @@ extern int __lock_page_killable(struct page *page);
extern int __lock_page_async(struct page *page, struct wait_page_queue *wait);
extern int __lock_page_or_retry(struct page *page, struct mm_struct *mm,
unsigned int flags);
-extern void unlock_page(struct page *page);
+void unlock_page(struct page *page);
+void folio_unlock(struct folio *folio);
/*
* Return true if the page was successfully locked
diff --git a/mm/filemap.c b/mm/filemap.c
index 634adeacc4c1..1af67ef94e4c 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -1435,29 +1435,24 @@ static inline bool clear_bit_unlock_is_negative_byte(long nr, volatile void *mem
#endif
/**
- * unlock_page - unlock a locked page
- * @page: the page
+ * folio_unlock - Unlock a locked folio.
+ * @folio: The folio.
*
- * Unlocks the page and wakes up sleepers in wait_on_page_locked().
- * Also wakes sleepers in wait_on_page_writeback() because the wakeup
- * mechanism between PageLocked pages and PageWriteback pages is shared.
- * But that's OK - sleepers in wait_on_page_writeback() just go back to sleep.
+ * Unlocks the folio and wakes up any thread sleeping on the page lock.
*
- * Note that this depends on PG_waiters being the sign bit in the byte
- * that contains PG_locked - thus the BUILD_BUG_ON(). That allows us to
- * clear the PG_locked bit and test PG_waiters at the same time fairly
- * portably (architectures that do LL/SC can test any bit, while x86 can
- * test the sign bit).
+ * Context: May be called from interrupt or process context. May not be
+ * called from NMI context.
*/
-void unlock_page(struct page *page)
+void folio_unlock(struct folio *folio)
{
+ /* Bit 7 allows x86 to check the byte's sign bit */
BUILD_BUG_ON(PG_waiters != 7);
- page = compound_head(page);
- VM_BUG_ON_PAGE(!PageLocked(page), page);
- if (clear_bit_unlock_is_negative_byte(PG_locked, &page->flags))
- wake_up_page_bit(page, PG_locked);
+ BUILD_BUG_ON(PG_locked > 7);
+ VM_BUG_ON_FOLIO(!folio_test_locked(folio), folio);
+ if (clear_bit_unlock_is_negative_byte(PG_locked, folio_flags(folio, 0)))
+ wake_up_page_bit(&folio->page, PG_locked);
}
-EXPORT_SYMBOL(unlock_page);
+EXPORT_SYMBOL(folio_unlock);
/**
* end_page_private_2 - Clear PG_private_2 and release any waiters
diff --git a/mm/folio-compat.c b/mm/folio-compat.c
index 5e107aa30a62..91b3d00a92f7 100644
--- a/mm/folio-compat.c
+++ b/mm/folio-compat.c
@@ -11,3 +11,9 @@ struct address_space *page_mapping(struct page *page)
return folio_mapping(page_folio(page));
}
EXPORT_SYMBOL(page_mapping);
+
+void unlock_page(struct page *page)
+{
+ return folio_unlock(page_folio(page));
+}
+EXPORT_SYMBOL(unlock_page);
--
2.30.2
This is like lock_page_killable() but for use by callers who
know they have a folio. Convert __lock_page_killable() to be
__folio_lock_killable(). This saves one call to compound_head() per
contended call to lock_page_killable().
__folio_lock_killable() is 19 bytes smaller than __lock_page_killable()
was. filemap_fault() shrinks by 74 bytes and __lock_page_or_retry()
shrinks by 71 bytes. That's a total of 164 bytes of text saved.
Signed-off-by: Matthew Wilcox (Oracle) <[email protected]>
Reviewed-by: Christoph Hellwig <[email protected]>
Acked-by: Jeff Layton <[email protected]>
Acked-by: Kirill A. Shutemov <[email protected]>
Acked-by: Vlastimil Babka <[email protected]>
Reviewed-by: William Kucharski <[email protected]>
---
include/linux/pagemap.h | 15 ++++++++++-----
mm/filemap.c | 17 +++++++++--------
2 files changed, 19 insertions(+), 13 deletions(-)
diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
index c3673c55125b..88727c74e059 100644
--- a/include/linux/pagemap.h
+++ b/include/linux/pagemap.h
@@ -654,7 +654,7 @@ static inline bool wake_page_match(struct wait_page_queue *wait_page,
}
void __folio_lock(struct folio *folio);
-extern int __lock_page_killable(struct page *page);
+int __folio_lock_killable(struct folio *folio);
extern int __lock_page_async(struct page *page, struct wait_page_queue *wait);
extern int __lock_page_or_retry(struct page *page, struct mm_struct *mm,
unsigned int flags);
@@ -694,6 +694,14 @@ static inline void lock_page(struct page *page)
__folio_lock(folio);
}
+static inline int folio_lock_killable(struct folio *folio)
+{
+ might_sleep();
+ if (!folio_trylock(folio))
+ return __folio_lock_killable(folio);
+ return 0;
+}
+
/*
* lock_page_killable is like lock_page but can be interrupted by fatal
* signals. It returns 0 if it locked the page and -EINTR if it was
@@ -701,10 +709,7 @@ static inline void lock_page(struct page *page)
*/
static inline int lock_page_killable(struct page *page)
{
- might_sleep();
- if (!trylock_page(page))
- return __lock_page_killable(page);
- return 0;
+ return folio_lock_killable(page_folio(page));
}
/*
diff --git a/mm/filemap.c b/mm/filemap.c
index 95f89656f126..962db5c38cd7 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -1589,14 +1589,13 @@ void __folio_lock(struct folio *folio)
}
EXPORT_SYMBOL(__folio_lock);
-int __lock_page_killable(struct page *__page)
+int __folio_lock_killable(struct folio *folio)
{
- struct page *page = compound_head(__page);
- wait_queue_head_t *q = page_waitqueue(page);
- return wait_on_page_bit_common(q, page, PG_locked, TASK_KILLABLE,
+ wait_queue_head_t *q = page_waitqueue(&folio->page);
+ return wait_on_page_bit_common(q, &folio->page, PG_locked, TASK_KILLABLE,
EXCLUSIVE);
}
-EXPORT_SYMBOL_GPL(__lock_page_killable);
+EXPORT_SYMBOL_GPL(__folio_lock_killable);
int __lock_page_async(struct page *page, struct wait_page_queue *wait)
{
@@ -1638,6 +1637,8 @@ int __lock_page_async(struct page *page, struct wait_page_queue *wait)
int __lock_page_or_retry(struct page *page, struct mm_struct *mm,
unsigned int flags)
{
+ struct folio *folio = page_folio(page);
+
if (fault_flag_allow_retry_first(flags)) {
/*
* CAUTION! In this case, mmap_lock is not released
@@ -1656,13 +1657,13 @@ int __lock_page_or_retry(struct page *page, struct mm_struct *mm,
if (flags & FAULT_FLAG_KILLABLE) {
int ret;
- ret = __lock_page_killable(page);
+ ret = __folio_lock_killable(folio);
if (ret) {
mmap_read_unlock(mm);
return 0;
}
} else {
- __folio_lock(page_folio(page));
+ __folio_lock(folio);
}
return 1;
@@ -2851,7 +2852,7 @@ static int lock_page_maybe_drop_mmap(struct vm_fault *vmf, struct page *page,
*fpin = maybe_unlock_mmap_for_io(vmf, *fpin);
if (vmf->flags & FAULT_FLAG_KILLABLE) {
- if (__lock_page_killable(&folio->page)) {
+ if (__folio_lock_killable(folio)) {
/*
* We didn't have the right flags to drop the mmap_lock,
* but all fault_handlers only check for fatal signals
--
2.30.2
Also add folio_wait_locked_killable(). Turn wait_on_page_locked() and
wait_on_page_locked_killable() into wrappers. This eliminates a call
to compound_head() from each call-site, reducing text size by 193 bytes
for me.
Signed-off-by: Matthew Wilcox (Oracle) <[email protected]>
Reviewed-by: Christoph Hellwig <[email protected]>
Acked-by: Jeff Layton <[email protected]>
Acked-by: Kirill A. Shutemov <[email protected]>
Acked-by: Vlastimil Babka <[email protected]>
Reviewed-by: William Kucharski <[email protected]>
Reviewed-by: David Howells <[email protected]>
---
include/linux/pagemap.h | 26 ++++++++++++++++++--------
mm/filemap.c | 4 ++--
2 files changed, 20 insertions(+), 10 deletions(-)
diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
index 6f631a3e42dc..03fea8bbfd8e 100644
--- a/include/linux/pagemap.h
+++ b/include/linux/pagemap.h
@@ -733,23 +733,33 @@ extern void wait_on_page_bit(struct page *page, int bit_nr);
extern int wait_on_page_bit_killable(struct page *page, int bit_nr);
/*
- * Wait for a page to be unlocked.
+ * Wait for a folio to be unlocked.
*
- * This must be called with the caller "holding" the page,
- * ie with increased "page->count" so that the page won't
+ * This must be called with the caller "holding" the folio,
+ * ie with increased "page->count" so that the folio won't
* go away during the wait..
*/
+static inline void folio_wait_locked(struct folio *folio)
+{
+ if (folio_test_locked(folio))
+ wait_on_page_bit(&folio->page, PG_locked);
+}
+
+static inline int folio_wait_locked_killable(struct folio *folio)
+{
+ if (!folio_test_locked(folio))
+ return 0;
+ return wait_on_page_bit_killable(&folio->page, PG_locked);
+}
+
static inline void wait_on_page_locked(struct page *page)
{
- if (PageLocked(page))
- wait_on_page_bit(compound_head(page), PG_locked);
+ folio_wait_locked(page_folio(page));
}
static inline int wait_on_page_locked_killable(struct page *page)
{
- if (!PageLocked(page))
- return 0;
- return wait_on_page_bit_killable(compound_head(page), PG_locked);
+ return folio_wait_locked_killable(page_folio(page));
}
int put_and_wait_on_page_locked(struct page *page, int state);
diff --git a/mm/filemap.c b/mm/filemap.c
index c97b804811fc..04fb4a84cf0d 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -1649,9 +1649,9 @@ int __lock_page_or_retry(struct page *page, struct mm_struct *mm,
mmap_read_unlock(mm);
if (flags & FAULT_FLAG_KILLABLE)
- wait_on_page_locked_killable(page);
+ folio_wait_locked_killable(folio);
else
- wait_on_page_locked(page);
+ folio_wait_locked(folio);
return 0;
}
if (flags & FAULT_FLAG_KILLABLE) {
--
2.30.2
Convert __lock_page_or_retry() to __folio_lock_or_retry(). This actually
saves 4 bytes in the only caller of lock_page_or_retry() (due to better
register allocation) and saves the 14 byte cost of calling page_folio()
in __folio_lock_or_retry() for a total saving of 18 bytes. Also use
a bool for the return type.
Signed-off-by: Matthew Wilcox (Oracle) <[email protected]>
Reviewed-by: Christoph Hellwig <[email protected]>
Acked-by: Jeff Layton <[email protected]>
Acked-by: Kirill A. Shutemov <[email protected]>
Reviewed-by: William Kucharski <[email protected]>
---
include/linux/pagemap.h | 11 +++++++----
mm/filemap.c | 20 +++++++++-----------
mm/memory.c | 8 ++++----
3 files changed, 20 insertions(+), 19 deletions(-)
diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
index 03fea8bbfd8e..626dbccbfb90 100644
--- a/include/linux/pagemap.h
+++ b/include/linux/pagemap.h
@@ -655,7 +655,7 @@ static inline bool wake_page_match(struct wait_page_queue *wait_page,
void __folio_lock(struct folio *folio);
int __folio_lock_killable(struct folio *folio);
-extern int __lock_page_or_retry(struct page *page, struct mm_struct *mm,
+bool __folio_lock_or_retry(struct folio *folio, struct mm_struct *mm,
unsigned int flags);
void unlock_page(struct page *page);
void folio_unlock(struct folio *folio);
@@ -716,13 +716,16 @@ static inline int lock_page_killable(struct page *page)
* caller indicated that it can handle a retry.
*
* Return value and mmap_lock implications depend on flags; see
- * __lock_page_or_retry().
+ * __folio_lock_or_retry().
*/
-static inline int lock_page_or_retry(struct page *page, struct mm_struct *mm,
+static inline bool lock_page_or_retry(struct page *page, struct mm_struct *mm,
unsigned int flags)
{
+ struct folio *folio;
might_sleep();
- return trylock_page(page) || __lock_page_or_retry(page, mm, flags);
+
+ folio = page_folio(page);
+ return folio_trylock(folio) || __folio_lock_or_retry(folio, mm, flags);
}
/*
diff --git a/mm/filemap.c b/mm/filemap.c
index 04fb4a84cf0d..fb6398a532e5 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -1625,48 +1625,46 @@ static int __folio_lock_async(struct folio *folio, struct wait_page_queue *wait)
/*
* Return values:
- * 1 - page is locked; mmap_lock is still held.
- * 0 - page is not locked.
+ * true - folio is locked; mmap_lock is still held.
+ * false - folio is not locked.
* mmap_lock has been released (mmap_read_unlock(), unless flags had both
* FAULT_FLAG_ALLOW_RETRY and FAULT_FLAG_RETRY_NOWAIT set, in
* which case mmap_lock is still held.
*
* If neither ALLOW_RETRY nor KILLABLE are set, will always return 1
- * with the page locked and the mmap_lock unperturbed.
+ * with the folio locked and the mmap_lock unperturbed.
*/
-int __lock_page_or_retry(struct page *page, struct mm_struct *mm,
+bool __folio_lock_or_retry(struct folio *folio, struct mm_struct *mm,
unsigned int flags)
{
- struct folio *folio = page_folio(page);
-
if (fault_flag_allow_retry_first(flags)) {
/*
* CAUTION! In this case, mmap_lock is not released
* even though return 0.
*/
if (flags & FAULT_FLAG_RETRY_NOWAIT)
- return 0;
+ return false;
mmap_read_unlock(mm);
if (flags & FAULT_FLAG_KILLABLE)
folio_wait_locked_killable(folio);
else
folio_wait_locked(folio);
- return 0;
+ return false;
}
if (flags & FAULT_FLAG_KILLABLE) {
- int ret;
+ bool ret;
ret = __folio_lock_killable(folio);
if (ret) {
mmap_read_unlock(mm);
- return 0;
+ return false;
}
} else {
__folio_lock(folio);
}
- return 1;
+ return true;
}
/**
diff --git a/mm/memory.c b/mm/memory.c
index 747a01d495f2..2f111f9b3dbc 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -4248,7 +4248,7 @@ static vm_fault_t do_shared_fault(struct vm_fault *vmf)
* We enter with non-exclusive mmap_lock (to exclude vma changes,
* but allow concurrent faults).
* The mmap_lock may have been released depending on flags and our
- * return value. See filemap_fault() and __lock_page_or_retry().
+ * return value. See filemap_fault() and __folio_lock_or_retry().
* If mmap_lock is released, vma may become invalid (for example
* by other thread calling munmap()).
*/
@@ -4489,7 +4489,7 @@ static vm_fault_t wp_huge_pud(struct vm_fault *vmf, pud_t orig_pud)
* concurrent faults).
*
* The mmap_lock may have been released depending on flags and our return value.
- * See filemap_fault() and __lock_page_or_retry().
+ * See filemap_fault() and __folio_lock_or_retry().
*/
static vm_fault_t handle_pte_fault(struct vm_fault *vmf)
{
@@ -4593,7 +4593,7 @@ static vm_fault_t handle_pte_fault(struct vm_fault *vmf)
* By the time we get here, we already hold the mm semaphore
*
* The mmap_lock may have been released depending on flags and our
- * return value. See filemap_fault() and __lock_page_or_retry().
+ * return value. See filemap_fault() and __folio_lock_or_retry().
*/
static vm_fault_t __handle_mm_fault(struct vm_area_struct *vma,
unsigned long address, unsigned int flags)
@@ -4749,7 +4749,7 @@ static inline void mm_account_fault(struct pt_regs *regs,
* By the time we get here, we already hold the mm semaphore
*
* The mmap_lock may have been released depending on flags and our
- * return value. See filemap_fault() and __lock_page_or_retry().
+ * return value. See filemap_fault() and __folio_lock_or_retry().
*/
vm_fault_t handle_mm_fault(struct vm_area_struct *vma, unsigned long address,
unsigned int flags, struct pt_regs *regs)
--
2.30.2
Convert rotate_reclaimable_page() to folio_rotate_reclaimable(). This
eliminates all five of the calls to compound_head() in this function,
saving 75 bytes at the cost of adding 15 bytes to its one caller,
end_page_writeback(). We also save 36 bytes from pagevec_move_tail_fn()
due to using folios there. Net 96 bytes savings.
Also move its declaration to mm/internal.h as it's only used by filemap.c.
Signed-off-by: Matthew Wilcox (Oracle) <[email protected]>
Acked-by: Vlastimil Babka <[email protected]>
Reviewed-by: William Kucharski <[email protected]>
Reviewed-by: Christoph Hellwig <[email protected]>
Acked-by: Kirill A. Shutemov <[email protected]>
---
include/linux/swap.h | 1 -
mm/filemap.c | 3 ++-
mm/internal.h | 1 +
mm/page_io.c | 4 ++--
mm/swap.c | 30 ++++++++++++++++--------------
5 files changed, 21 insertions(+), 18 deletions(-)
diff --git a/include/linux/swap.h b/include/linux/swap.h
index 3d3d85354026..8394716a002b 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -371,7 +371,6 @@ extern void lru_add_drain(void);
extern void lru_add_drain_cpu(int cpu);
extern void lru_add_drain_cpu_zone(struct zone *zone);
extern void lru_add_drain_all(void);
-extern void rotate_reclaimable_page(struct page *page);
extern void deactivate_file_page(struct page *page);
extern void deactivate_page(struct page *page);
extern void mark_page_lazyfree(struct page *page);
diff --git a/mm/filemap.c b/mm/filemap.c
index fb6398a532e5..4ce2b22b64f8 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -1529,8 +1529,9 @@ void end_page_writeback(struct page *page)
* ever page writeback.
*/
if (PageReclaim(page)) {
+ struct folio *folio = page_folio(page);
ClearPageReclaim(page);
- rotate_reclaimable_page(page);
+ folio_rotate_reclaimable(folio);
}
/*
diff --git a/mm/internal.h b/mm/internal.h
index 31ff935b2547..1a8851b73031 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -35,6 +35,7 @@
void page_writeback_init(void);
vm_fault_t do_swap_page(struct vm_fault *vmf);
+void folio_rotate_reclaimable(struct folio *folio);
void free_pgtables(struct mmu_gather *tlb, struct vm_area_struct *start_vma,
unsigned long floor, unsigned long ceiling);
diff --git a/mm/page_io.c b/mm/page_io.c
index c493ce9ebcf5..d597bc6e6e45 100644
--- a/mm/page_io.c
+++ b/mm/page_io.c
@@ -38,7 +38,7 @@ void end_swap_bio_write(struct bio *bio)
* Also print a dire warning that things will go BAD (tm)
* very quickly.
*
- * Also clear PG_reclaim to avoid rotate_reclaimable_page()
+ * Also clear PG_reclaim to avoid folio_rotate_reclaimable()
*/
set_page_dirty(page);
pr_alert_ratelimited("Write-error on swap-device (%u:%u:%llu)\n",
@@ -317,7 +317,7 @@ int __swap_writepage(struct page *page, struct writeback_control *wbc,
* temporary failure if the system has limited
* memory for allocating transmit buffers.
* Mark the page dirty and avoid
- * rotate_reclaimable_page but rate-limit the
+ * folio_rotate_reclaimable but rate-limit the
* messages but do not flag PageError like
* the normal direct-to-bio case as it could
* be temporary.
diff --git a/mm/swap.c b/mm/swap.c
index 19600430e536..095a5ec6f986 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -228,11 +228,13 @@ static void pagevec_lru_move_fn(struct pagevec *pvec,
static void pagevec_move_tail_fn(struct page *page, struct lruvec *lruvec)
{
- if (!PageUnevictable(page)) {
- del_page_from_lru_list(page, lruvec);
- ClearPageActive(page);
- add_page_to_lru_list_tail(page, lruvec);
- __count_vm_events(PGROTATED, thp_nr_pages(page));
+ struct folio *folio = page_folio(page);
+
+ if (!folio_test_unevictable(folio)) {
+ lruvec_del_folio(lruvec, folio);
+ folio_clear_active(folio);
+ lruvec_add_folio_tail(lruvec, folio);
+ __count_vm_events(PGROTATED, folio_nr_pages(folio));
}
}
@@ -249,23 +251,23 @@ static bool pagevec_add_and_need_flush(struct pagevec *pvec, struct page *page)
}
/*
- * Writeback is about to end against a page which has been marked for immediate
- * reclaim. If it still appears to be reclaimable, move it to the tail of the
- * inactive list.
+ * Writeback is about to end against a folio which has been marked for
+ * immediate reclaim. If it still appears to be reclaimable, move it
+ * to the tail of the inactive list.
*
- * rotate_reclaimable_page() must disable IRQs, to prevent nasty races.
+ * folio_rotate_reclaimable() must disable IRQs, to prevent nasty races.
*/
-void rotate_reclaimable_page(struct page *page)
+void folio_rotate_reclaimable(struct folio *folio)
{
- if (!PageLocked(page) && !PageDirty(page) &&
- !PageUnevictable(page) && PageLRU(page)) {
+ if (!folio_test_locked(folio) && !folio_test_dirty(folio) &&
+ !folio_test_unevictable(folio) && folio_test_lru(folio)) {
struct pagevec *pvec;
unsigned long flags;
- get_page(page);
+ folio_get(folio);
local_lock_irqsave(&lru_rotate.lock, flags);
pvec = this_cpu_ptr(&lru_rotate.pvec);
- if (pagevec_add_and_need_flush(pvec, page))
+ if (pagevec_add_and_need_flush(pvec, &folio->page))
pagevec_lru_move_fn(pvec, pagevec_move_tail_fn);
local_unlock_irqrestore(&lru_rotate.lock, flags);
}
--
2.30.2
Move wait_for_stable_page() into the folio compatibility file.
folio_wait_stable() avoids a call to compound_head() and is 14 bytes
smaller than wait_for_stable_page() was. The net text size grows by 16
bytes as a result of this patch. We can also remove thp_head() as this
was the last user.
Signed-off-by: Matthew Wilcox (Oracle) <[email protected]>
Reviewed-by: Christoph Hellwig <[email protected]>
Acked-by: Jeff Layton <[email protected]>
Acked-by: Kirill A. Shutemov <[email protected]>
Acked-by: Vlastimil Babka <[email protected]>
Reviewed-by: William Kucharski <[email protected]>
Reviewed-by: David Howells <[email protected]>
---
include/linux/huge_mm.h | 15 ---------------
include/linux/pagemap.h | 1 +
mm/folio-compat.c | 6 ++++++
mm/page-writeback.c | 24 ++++++++++++++----------
4 files changed, 21 insertions(+), 25 deletions(-)
diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index f123e15d966e..f280f33ff223 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -250,15 +250,6 @@ static inline spinlock_t *pud_trans_huge_lock(pud_t *pud,
return NULL;
}
-/**
- * thp_head - Head page of a transparent huge page.
- * @page: Any page (tail, head or regular) found in the page cache.
- */
-static inline struct page *thp_head(struct page *page)
-{
- return compound_head(page);
-}
-
/**
* thp_order - Order of a transparent huge page.
* @page: Head page of a transparent huge page.
@@ -336,12 +327,6 @@ static inline struct list_head *page_deferred_list(struct page *page)
#define HPAGE_PUD_MASK ({ BUILD_BUG(); 0; })
#define HPAGE_PUD_SIZE ({ BUILD_BUG(); 0; })
-static inline struct page *thp_head(struct page *page)
-{
- VM_BUG_ON_PGFLAGS(PageTail(page), page);
- return page;
-}
-
static inline unsigned int thp_order(struct page *page)
{
VM_BUG_ON_PGFLAGS(PageTail(page), page);
diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
index 0c5f53368fe9..96b62a2331fb 100644
--- a/include/linux/pagemap.h
+++ b/include/linux/pagemap.h
@@ -772,6 +772,7 @@ int folio_wait_writeback_killable(struct folio *folio);
void end_page_writeback(struct page *page);
void folio_end_writeback(struct folio *folio);
void wait_for_stable_page(struct page *page);
+void folio_wait_stable(struct folio *folio);
void __set_page_dirty(struct page *, struct address_space *, int warn);
int __set_page_dirty_nobuffers(struct page *page);
diff --git a/mm/folio-compat.c b/mm/folio-compat.c
index 41275dac7a92..3c83f03b80d7 100644
--- a/mm/folio-compat.c
+++ b/mm/folio-compat.c
@@ -29,3 +29,9 @@ void wait_on_page_writeback(struct page *page)
return folio_wait_writeback(page_folio(page));
}
EXPORT_SYMBOL_GPL(wait_on_page_writeback);
+
+void wait_for_stable_page(struct page *page)
+{
+ return folio_wait_stable(page_folio(page));
+}
+EXPORT_SYMBOL_GPL(wait_for_stable_page);
diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index c2c00e1533ad..a078e9786cc4 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -2877,17 +2877,21 @@ int folio_wait_writeback_killable(struct folio *folio)
EXPORT_SYMBOL_GPL(folio_wait_writeback_killable);
/**
- * wait_for_stable_page() - wait for writeback to finish, if necessary.
- * @page: The page to wait on.
+ * folio_wait_stable() - wait for writeback to finish, if necessary.
+ * @folio: The folio to wait on.
*
- * This function determines if the given page is related to a backing device
- * that requires page contents to be held stable during writeback. If so, then
- * it will wait for any pending writeback to complete.
+ * This function determines if the given folio is related to a backing
+ * device that requires folio contents to be held stable during writeback.
+ * If so, then it will wait for any pending writeback to complete.
+ *
+ * Context: Sleeps. Must be called in process context and with
+ * no spinlocks held. Caller should hold a reference on the folio.
+ * If the folio is not locked, writeback may start again after writeback
+ * has finished.
*/
-void wait_for_stable_page(struct page *page)
+void folio_wait_stable(struct folio *folio)
{
- page = thp_head(page);
- if (page->mapping->host->i_sb->s_iflags & SB_I_STABLE_WRITES)
- wait_on_page_writeback(page);
+ if (folio->mapping->host->i_sb->s_iflags & SB_I_STABLE_WRITES)
+ folio_wait_writeback(folio);
}
-EXPORT_SYMBOL_GPL(wait_for_stable_page);
+EXPORT_SYMBOL_GPL(folio_wait_stable);
--
2.30.2
wait_on_page_writeback_killable() only has one caller, so convert it to
call folio_wait_writeback_killable(). For the wait_on_page_writeback()
callers, add a compatibility wrapper around folio_wait_writeback().
Turning PageWriteback() into folio_test_writeback() eliminates a call
to compound_head() which saves 8 bytes and 15 bytes in the two
functions. Unfortunately, that is more than offset by adding the
wait_on_page_writeback compatibility wrapper for a net increase in text
of 7 bytes.
Signed-off-by: Matthew Wilcox (Oracle) <[email protected]>
Reviewed-by: Christoph Hellwig <[email protected]>
Acked-by: Jeff Layton <[email protected]>
Acked-by: Kirill A. Shutemov <[email protected]>
Acked-by: Vlastimil Babka <[email protected]>
Reviewed-by: William Kucharski <[email protected]>
---
fs/afs/write.c | 9 ++++----
include/linux/pagemap.h | 3 ++-
mm/folio-compat.c | 6 ++++++
mm/page-writeback.c | 48 ++++++++++++++++++++++++++++-------------
4 files changed, 46 insertions(+), 20 deletions(-)
diff --git a/fs/afs/write.c b/fs/afs/write.c
index 3104b62c2082..fb7d5c1cabde 100644
--- a/fs/afs/write.c
+++ b/fs/afs/write.c
@@ -839,7 +839,8 @@ int afs_fsync(struct file *file, loff_t start, loff_t end, int datasync)
*/
vm_fault_t afs_page_mkwrite(struct vm_fault *vmf)
{
- struct page *page = thp_head(vmf->page);
+ struct folio *folio = page_folio(vmf->page);
+ struct page *page = &folio->page;
struct file *file = vmf->vma->vm_file;
struct inode *inode = file_inode(file);
struct afs_vnode *vnode = AFS_FS_I(inode);
@@ -859,7 +860,7 @@ vm_fault_t afs_page_mkwrite(struct vm_fault *vmf)
goto out;
#endif
- if (wait_on_page_writeback_killable(page))
+ if (folio_wait_writeback_killable(folio))
goto out;
if (lock_page_killable(page) < 0)
@@ -869,8 +870,8 @@ vm_fault_t afs_page_mkwrite(struct vm_fault *vmf)
* details the portion of the page we need to write back and we might
* need to redirty the page if there's a problem.
*/
- if (wait_on_page_writeback_killable(page) < 0) {
- unlock_page(page);
+ if (folio_wait_writeback_killable(folio) < 0) {
+ folio_unlock(folio);
goto out;
}
diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
index 66a019178550..0c5f53368fe9 100644
--- a/include/linux/pagemap.h
+++ b/include/linux/pagemap.h
@@ -767,7 +767,8 @@ static inline int wait_on_page_locked_killable(struct page *page)
int put_and_wait_on_page_locked(struct page *page, int state);
void wait_on_page_writeback(struct page *page);
-int wait_on_page_writeback_killable(struct page *page);
+void folio_wait_writeback(struct folio *folio);
+int folio_wait_writeback_killable(struct folio *folio);
void end_page_writeback(struct page *page);
void folio_end_writeback(struct folio *folio);
void wait_for_stable_page(struct page *page);
diff --git a/mm/folio-compat.c b/mm/folio-compat.c
index 526843d03d58..41275dac7a92 100644
--- a/mm/folio-compat.c
+++ b/mm/folio-compat.c
@@ -23,3 +23,9 @@ void end_page_writeback(struct page *page)
return folio_end_writeback(page_folio(page));
}
EXPORT_SYMBOL(end_page_writeback);
+
+void wait_on_page_writeback(struct page *page)
+{
+ return folio_wait_writeback(page_folio(page));
+}
+EXPORT_SYMBOL_GPL(wait_on_page_writeback);
diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index 9f63548f247c..c2c00e1533ad 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -2830,33 +2830,51 @@ int __test_set_page_writeback(struct page *page, bool keep_write)
}
EXPORT_SYMBOL(__test_set_page_writeback);
-/*
- * Wait for a page to complete writeback
+/**
+ * folio_wait_writeback - Wait for a folio to finish writeback.
+ * @folio: The folio to wait for.
+ *
+ * If the folio is currently being written back to storage, wait for the
+ * I/O to complete.
+ *
+ * Context: Sleeps. Must be called in process context and with
+ * no spinlocks held. Caller should hold a reference on the folio.
+ * If the folio is not locked, writeback may start again after writeback
+ * has finished.
*/
-void wait_on_page_writeback(struct page *page)
+void folio_wait_writeback(struct folio *folio)
{
- while (PageWriteback(page)) {
- trace_wait_on_page_writeback(page, page_mapping(page));
- wait_on_page_bit(page, PG_writeback);
+ while (folio_test_writeback(folio)) {
+ trace_wait_on_page_writeback(&folio->page, folio_mapping(folio));
+ wait_on_page_bit(&folio->page, PG_writeback);
}
}
-EXPORT_SYMBOL_GPL(wait_on_page_writeback);
+EXPORT_SYMBOL_GPL(folio_wait_writeback);
-/*
- * Wait for a page to complete writeback. Returns -EINTR if we get a
- * fatal signal while waiting.
+/**
+ * folio_wait_writeback_killable - Wait for a folio to finish writeback.
+ * @folio: The folio to wait for.
+ *
+ * If the folio is currently being written back to storage, wait for the
+ * I/O to complete or a fatal signal to arrive.
+ *
+ * Context: Sleeps. Must be called in process context and with
+ * no spinlocks held. Caller should hold a reference on the folio.
+ * If the folio is not locked, writeback may start again after writeback
+ * has finished.
+ * Return: 0 on success, -EINTR if we get a fatal signal while waiting.
*/
-int wait_on_page_writeback_killable(struct page *page)
+int folio_wait_writeback_killable(struct folio *folio)
{
- while (PageWriteback(page)) {
- trace_wait_on_page_writeback(page, page_mapping(page));
- if (wait_on_page_bit_killable(page, PG_writeback))
+ while (folio_test_writeback(folio)) {
+ trace_wait_on_page_writeback(&folio->page, folio_mapping(folio));
+ if (wait_on_page_bit_killable(&folio->page, PG_writeback))
return -EINTR;
}
return 0;
}
-EXPORT_SYMBOL_GPL(wait_on_page_writeback_killable);
+EXPORT_SYMBOL_GPL(folio_wait_writeback_killable);
/**
* wait_for_stable_page() - wait for writeback to finish, if necessary.
--
2.30.2
Rename wait_on_page_bit() to folio_wait_bit(). We must always wait on
the folio, otherwise we won't be woken up due to the tail page hashing
to a different bucket from the head page.
This commit shrinks the kernel by 770 bytes, mostly due to moving
the page waitqueue lookup into folio_wait_bit_common().
Signed-off-by: Matthew Wilcox (Oracle) <[email protected]>
Reviewed-by: Christoph Hellwig <[email protected]>
Acked-by: Jeff Layton <[email protected]>
Acked-by: Kirill A. Shutemov <[email protected]>
Acked-by: Vlastimil Babka <[email protected]>
Reviewed-by: William Kucharski <[email protected]>
Reviewed-by: David Howells <[email protected]>
---
include/linux/pagemap.h | 10 +++---
mm/filemap.c | 77 +++++++++++++++++++----------------------
mm/page-writeback.c | 4 +--
3 files changed, 43 insertions(+), 48 deletions(-)
diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
index 96b62a2331fb..7eb02baf6f9f 100644
--- a/include/linux/pagemap.h
+++ b/include/linux/pagemap.h
@@ -729,11 +729,11 @@ static inline bool lock_page_or_retry(struct page *page, struct mm_struct *mm,
}
/*
- * This is exported only for wait_on_page_locked/wait_on_page_writeback, etc.,
+ * This is exported only for folio_wait_locked/folio_wait_writeback, etc.,
* and should not be used directly.
*/
-extern void wait_on_page_bit(struct page *page, int bit_nr);
-extern int wait_on_page_bit_killable(struct page *page, int bit_nr);
+void folio_wait_bit(struct folio *folio, int bit_nr);
+int folio_wait_bit_killable(struct folio *folio, int bit_nr);
/*
* Wait for a folio to be unlocked.
@@ -745,14 +745,14 @@ extern int wait_on_page_bit_killable(struct page *page, int bit_nr);
static inline void folio_wait_locked(struct folio *folio)
{
if (folio_test_locked(folio))
- wait_on_page_bit(&folio->page, PG_locked);
+ folio_wait_bit(folio, PG_locked);
}
static inline int folio_wait_locked_killable(struct folio *folio)
{
if (!folio_test_locked(folio))
return 0;
- return wait_on_page_bit_killable(&folio->page, PG_locked);
+ return folio_wait_bit_killable(folio, PG_locked);
}
static inline void wait_on_page_locked(struct page *page)
diff --git a/mm/filemap.c b/mm/filemap.c
index b5a0d546e436..b55c89d7997f 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -1102,7 +1102,7 @@ static int wake_page_function(wait_queue_entry_t *wait, unsigned mode, int sync,
*
* So update the flags atomically, and wake up the waiter
* afterwards to avoid any races. This store-release pairs
- * with the load-acquire in wait_on_page_bit_common().
+ * with the load-acquire in folio_wait_bit_common().
*/
smp_store_release(&wait->flags, flags | WQ_FLAG_WOKEN);
wake_up_state(wait->private, mode);
@@ -1183,7 +1183,7 @@ static void folio_wake(struct folio *folio, int bit)
}
/*
- * A choice of three behaviors for wait_on_page_bit_common():
+ * A choice of three behaviors for folio_wait_bit_common():
*/
enum behavior {
EXCLUSIVE, /* Hold ref to page and take the bit when woken, like
@@ -1198,16 +1198,16 @@ enum behavior {
};
/*
- * Attempt to check (or get) the page bit, and mark us done
+ * Attempt to check (or get) the folio flag, and mark us done
* if successful.
*/
-static inline bool trylock_page_bit_common(struct page *page, int bit_nr,
+static inline bool folio_trylock_flag(struct folio *folio, int bit_nr,
struct wait_queue_entry *wait)
{
if (wait->flags & WQ_FLAG_EXCLUSIVE) {
- if (test_and_set_bit(bit_nr, &page->flags))
+ if (test_and_set_bit(bit_nr, &folio->flags))
return false;
- } else if (test_bit(bit_nr, &page->flags))
+ } else if (test_bit(bit_nr, &folio->flags))
return false;
wait->flags |= WQ_FLAG_WOKEN | WQ_FLAG_DONE;
@@ -1217,9 +1217,10 @@ static inline bool trylock_page_bit_common(struct page *page, int bit_nr,
/* How many times do we accept lock stealing from under a waiter? */
int sysctl_page_lock_unfairness = 5;
-static inline int wait_on_page_bit_common(wait_queue_head_t *q,
- struct page *page, int bit_nr, int state, enum behavior behavior)
+static inline int folio_wait_bit_common(struct folio *folio, int bit_nr,
+ int state, enum behavior behavior)
{
+ wait_queue_head_t *q = page_waitqueue(&folio->page);
int unfairness = sysctl_page_lock_unfairness;
struct wait_page_queue wait_page;
wait_queue_entry_t *wait = &wait_page.wait;
@@ -1228,8 +1229,8 @@ static inline int wait_on_page_bit_common(wait_queue_head_t *q,
unsigned long pflags;
if (bit_nr == PG_locked &&
- !PageUptodate(page) && PageWorkingset(page)) {
- if (!PageSwapBacked(page)) {
+ !folio_test_uptodate(folio) && folio_test_workingset(folio)) {
+ if (!folio_test_swapbacked(folio)) {
delayacct_thrashing_start();
delayacct = true;
}
@@ -1239,7 +1240,7 @@ static inline int wait_on_page_bit_common(wait_queue_head_t *q,
init_wait(wait);
wait->func = wake_page_function;
- wait_page.page = page;
+ wait_page.page = &folio->page;
wait_page.bit_nr = bit_nr;
repeat:
@@ -1254,7 +1255,7 @@ static inline int wait_on_page_bit_common(wait_queue_head_t *q,
* Do one last check whether we can get the
* page bit synchronously.
*
- * Do the SetPageWaiters() marking before that
+ * Do the folio_set_waiters() marking before that
* to let any waker we _just_ missed know they
* need to wake us up (otherwise they'll never
* even go to the slow case that looks at the
@@ -1265,8 +1266,8 @@ static inline int wait_on_page_bit_common(wait_queue_head_t *q,
* lock to avoid races.
*/
spin_lock_irq(&q->lock);
- SetPageWaiters(page);
- if (!trylock_page_bit_common(page, bit_nr, wait))
+ folio_set_waiters(folio);
+ if (!folio_trylock_flag(folio, bit_nr, wait))
__add_wait_queue_entry_tail(q, wait);
spin_unlock_irq(&q->lock);
@@ -1276,10 +1277,10 @@ static inline int wait_on_page_bit_common(wait_queue_head_t *q,
* see whether the page bit testing has already
* been done by the wake function.
*
- * We can drop our reference to the page.
+ * We can drop our reference to the folio.
*/
if (behavior == DROP)
- put_page(page);
+ folio_put(folio);
/*
* Note that until the "finish_wait()", or until
@@ -1316,7 +1317,7 @@ static inline int wait_on_page_bit_common(wait_queue_head_t *q,
*
* And if that fails, we'll have to retry this all.
*/
- if (unlikely(test_and_set_bit(bit_nr, &page->flags)))
+ if (unlikely(test_and_set_bit(bit_nr, folio_flags(folio, 0))))
goto repeat;
wait->flags |= WQ_FLAG_DONE;
@@ -1325,7 +1326,7 @@ static inline int wait_on_page_bit_common(wait_queue_head_t *q,
/*
* If a signal happened, this 'finish_wait()' may remove the last
- * waiter from the wait-queues, but the PageWaiters bit will remain
+ * waiter from the wait-queues, but the folio waiters bit will remain
* set. That's ok. The next wakeup will take care of it, and trying
* to do it here would be difficult and prone to races.
*/
@@ -1356,19 +1357,17 @@ static inline int wait_on_page_bit_common(wait_queue_head_t *q,
return wait->flags & WQ_FLAG_WOKEN ? 0 : -EINTR;
}
-void wait_on_page_bit(struct page *page, int bit_nr)
+void folio_wait_bit(struct folio *folio, int bit_nr)
{
- wait_queue_head_t *q = page_waitqueue(page);
- wait_on_page_bit_common(q, page, bit_nr, TASK_UNINTERRUPTIBLE, SHARED);
+ folio_wait_bit_common(folio, bit_nr, TASK_UNINTERRUPTIBLE, SHARED);
}
-EXPORT_SYMBOL(wait_on_page_bit);
+EXPORT_SYMBOL(folio_wait_bit);
-int wait_on_page_bit_killable(struct page *page, int bit_nr)
+int folio_wait_bit_killable(struct folio *folio, int bit_nr)
{
- wait_queue_head_t *q = page_waitqueue(page);
- return wait_on_page_bit_common(q, page, bit_nr, TASK_KILLABLE, SHARED);
+ return folio_wait_bit_common(folio, bit_nr, TASK_KILLABLE, SHARED);
}
-EXPORT_SYMBOL(wait_on_page_bit_killable);
+EXPORT_SYMBOL(folio_wait_bit_killable);
/**
* put_and_wait_on_page_locked - Drop a reference and wait for it to be unlocked
@@ -1385,11 +1384,8 @@ EXPORT_SYMBOL(wait_on_page_bit_killable);
*/
int put_and_wait_on_page_locked(struct page *page, int state)
{
- wait_queue_head_t *q;
-
- page = compound_head(page);
- q = page_waitqueue(page);
- return wait_on_page_bit_common(q, page, PG_locked, state, DROP);
+ return folio_wait_bit_common(page_folio(page), PG_locked, state,
+ DROP);
}
/**
@@ -1483,9 +1479,10 @@ EXPORT_SYMBOL(end_page_private_2);
*/
void wait_on_page_private_2(struct page *page)
{
- page = compound_head(page);
- while (PagePrivate2(page))
- wait_on_page_bit(page, PG_private_2);
+ struct folio *folio = page_folio(page);
+
+ while (folio_test_private_2(folio))
+ folio_wait_bit(folio, PG_private_2);
}
EXPORT_SYMBOL(wait_on_page_private_2);
@@ -1502,11 +1499,11 @@ EXPORT_SYMBOL(wait_on_page_private_2);
*/
int wait_on_page_private_2_killable(struct page *page)
{
+ struct folio *folio = page_folio(page);
int ret = 0;
- page = compound_head(page);
- while (PagePrivate2(page)) {
- ret = wait_on_page_bit_killable(page, PG_private_2);
+ while (folio_test_private_2(folio)) {
+ ret = folio_wait_bit_killable(folio, PG_private_2);
if (ret < 0)
break;
}
@@ -1583,16 +1580,14 @@ EXPORT_SYMBOL_GPL(page_endio);
*/
void __folio_lock(struct folio *folio)
{
- wait_queue_head_t *q = page_waitqueue(&folio->page);
- wait_on_page_bit_common(q, &folio->page, PG_locked, TASK_UNINTERRUPTIBLE,
+ folio_wait_bit_common(folio, PG_locked, TASK_UNINTERRUPTIBLE,
EXCLUSIVE);
}
EXPORT_SYMBOL(__folio_lock);
int __folio_lock_killable(struct folio *folio)
{
- wait_queue_head_t *q = page_waitqueue(&folio->page);
- return wait_on_page_bit_common(q, &folio->page, PG_locked, TASK_KILLABLE,
+ return folio_wait_bit_common(folio, PG_locked, TASK_KILLABLE,
EXCLUSIVE);
}
EXPORT_SYMBOL_GPL(__folio_lock_killable);
diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index a078e9786cc4..b34278d05395 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -2846,7 +2846,7 @@ void folio_wait_writeback(struct folio *folio)
{
while (folio_test_writeback(folio)) {
trace_wait_on_page_writeback(&folio->page, folio_mapping(folio));
- wait_on_page_bit(&folio->page, PG_writeback);
+ folio_wait_bit(folio, PG_writeback);
}
}
EXPORT_SYMBOL_GPL(folio_wait_writeback);
@@ -2868,7 +2868,7 @@ int folio_wait_writeback_killable(struct folio *folio)
{
while (folio_test_writeback(folio)) {
trace_wait_on_page_writeback(&folio->page, folio_mapping(folio));
- if (wait_on_page_bit_killable(&folio->page, PG_writeback))
+ if (folio_wait_bit_killable(folio, PG_writeback))
return -EINTR;
}
--
2.30.2
Convert wake_up_page_bit() to folio_wake_bit(). All callers have a folio,
so use it directly. Saves 66 bytes of text in end_page_private_2().
Signed-off-by: Matthew Wilcox (Oracle) <[email protected]>
Reviewed-by: Christoph Hellwig <[email protected]>
Acked-by: Jeff Layton <[email protected]>
Acked-by: Kirill A. Shutemov <[email protected]>
Acked-by: Vlastimil Babka <[email protected]>
Reviewed-by: William Kucharski <[email protected]>
Reviewed-by: David Howells <[email protected]>
---
mm/filemap.c | 23 ++++++++++++-----------
1 file changed, 12 insertions(+), 11 deletions(-)
diff --git a/mm/filemap.c b/mm/filemap.c
index b55c89d7997f..a3ef9abcbcde 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -1121,14 +1121,14 @@ static int wake_page_function(wait_queue_entry_t *wait, unsigned mode, int sync,
return (flags & WQ_FLAG_EXCLUSIVE) != 0;
}
-static void wake_up_page_bit(struct page *page, int bit_nr)
+static void folio_wake_bit(struct folio *folio, int bit_nr)
{
- wait_queue_head_t *q = page_waitqueue(page);
+ wait_queue_head_t *q = page_waitqueue(&folio->page);
struct wait_page_key key;
unsigned long flags;
wait_queue_entry_t bookmark;
- key.page = page;
+ key.page = &folio->page;
key.bit_nr = bit_nr;
key.page_match = 0;
@@ -1163,7 +1163,7 @@ static void wake_up_page_bit(struct page *page, int bit_nr)
* page waiters.
*/
if (!waitqueue_active(q) || !key.page_match) {
- ClearPageWaiters(page);
+ folio_clear_waiters(folio);
/*
* It's possible to miss clearing Waiters here, when we woke
* our page waiters, but the hashed waitqueue has waiters for
@@ -1179,7 +1179,7 @@ static void folio_wake(struct folio *folio, int bit)
{
if (!folio_test_waiters(folio))
return;
- wake_up_page_bit(&folio->page, bit);
+ folio_wake_bit(folio, bit);
}
/*
@@ -1446,7 +1446,7 @@ void folio_unlock(struct folio *folio)
BUILD_BUG_ON(PG_locked > 7);
VM_BUG_ON_FOLIO(!folio_test_locked(folio), folio);
if (clear_bit_unlock_is_negative_byte(PG_locked, folio_flags(folio, 0)))
- wake_up_page_bit(&folio->page, PG_locked);
+ folio_wake_bit(folio, PG_locked);
}
EXPORT_SYMBOL(folio_unlock);
@@ -1463,11 +1463,12 @@ EXPORT_SYMBOL(folio_unlock);
*/
void end_page_private_2(struct page *page)
{
- page = compound_head(page);
- VM_BUG_ON_PAGE(!PagePrivate2(page), page);
- clear_bit_unlock(PG_private_2, &page->flags);
- wake_up_page_bit(page, PG_private_2);
- put_page(page);
+ struct folio *folio = page_folio(page);
+
+ VM_BUG_ON_FOLIO(!folio_test_private_2(folio), folio);
+ clear_bit_unlock(PG_private_2, folio_flags(folio, 0));
+ folio_wake_bit(folio, PG_private_2);
+ folio_put(folio);
}
EXPORT_SYMBOL(end_page_private_2);
--
2.30.2
Reinforce that page flags are actually in the head page by changing the
type from page to folio. Increases the size of cachefiles by two bytes,
but the kernel core is unchanged in size.
Signed-off-by: Matthew Wilcox (Oracle) <[email protected]>
Reviewed-by: Christoph Hellwig <[email protected]>
Acked-by: Jeff Layton <[email protected]>
Acked-by: Kirill A. Shutemov <[email protected]>
Acked-by: Vlastimil Babka <[email protected]>
Reviewed-by: William Kucharski <[email protected]>
Reviewed-by: David Howells <[email protected]>
---
fs/cachefiles/rdwr.c | 16 ++++++++--------
include/linux/pagemap.h | 8 ++++----
mm/filemap.c | 38 +++++++++++++++++++-------------------
3 files changed, 31 insertions(+), 31 deletions(-)
diff --git a/fs/cachefiles/rdwr.c b/fs/cachefiles/rdwr.c
index 8ffc40e84a59..fcf4f3b72923 100644
--- a/fs/cachefiles/rdwr.c
+++ b/fs/cachefiles/rdwr.c
@@ -25,20 +25,20 @@ static int cachefiles_read_waiter(wait_queue_entry_t *wait, unsigned mode,
struct cachefiles_object *object;
struct fscache_retrieval *op = monitor->op;
struct wait_page_key *key = _key;
- struct page *page = wait->private;
+ struct folio *folio = wait->private;
ASSERT(key);
_enter("{%lu},%u,%d,{%p,%u}",
monitor->netfs_page->index, mode, sync,
- key->page, key->bit_nr);
+ key->folio, key->bit_nr);
- if (key->page != page || key->bit_nr != PG_locked)
+ if (key->folio != folio || key->bit_nr != PG_locked)
return 0;
- _debug("--- monitor %p %lx ---", page, page->flags);
+ _debug("--- monitor %p %lx ---", folio, folio->flags);
- if (!PageUptodate(page) && !PageError(page)) {
+ if (!folio_test_uptodate(folio) && !folio_test_error(folio)) {
/* unlocked, not uptodate and not erronous? */
_debug("page probably truncated");
}
@@ -107,7 +107,7 @@ static int cachefiles_read_reissue(struct cachefiles_object *object,
put_page(backpage2);
INIT_LIST_HEAD(&monitor->op_link);
- add_page_wait_queue(backpage, &monitor->monitor);
+ folio_add_wait_queue(page_folio(backpage), &monitor->monitor);
if (trylock_page(backpage)) {
ret = -EIO;
@@ -294,7 +294,7 @@ static int cachefiles_read_backing_file_one(struct cachefiles_object *object,
get_page(backpage);
monitor->back_page = backpage;
monitor->monitor.private = backpage;
- add_page_wait_queue(backpage, &monitor->monitor);
+ folio_add_wait_queue(page_folio(backpage), &monitor->monitor);
monitor = NULL;
/* but the page may have been read before the monitor was installed, so
@@ -548,7 +548,7 @@ static int cachefiles_read_backing_file(struct cachefiles_object *object,
get_page(backpage);
monitor->back_page = backpage;
monitor->monitor.private = backpage;
- add_page_wait_queue(backpage, &monitor->monitor);
+ folio_add_wait_queue(page_folio(backpage), &monitor->monitor);
monitor = NULL;
/* but the page may have been read before the monitor was
diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
index 7eb02baf6f9f..c8e74d67b01f 100644
--- a/include/linux/pagemap.h
+++ b/include/linux/pagemap.h
@@ -629,13 +629,13 @@ static inline pgoff_t linear_page_index(struct vm_area_struct *vma,
}
struct wait_page_key {
- struct page *page;
+ struct folio *folio;
int bit_nr;
int page_match;
};
struct wait_page_queue {
- struct page *page;
+ struct folio *folio;
int bit_nr;
wait_queue_entry_t wait;
};
@@ -643,7 +643,7 @@ struct wait_page_queue {
static inline bool wake_page_match(struct wait_page_queue *wait_page,
struct wait_page_key *key)
{
- if (wait_page->page != key->page)
+ if (wait_page->folio != key->folio)
return false;
key->page_match = 1;
@@ -803,7 +803,7 @@ int wait_on_page_private_2_killable(struct page *page);
/*
* Add an arbitrary waiter to a page's wait queue
*/
-extern void add_page_wait_queue(struct page *page, wait_queue_entry_t *waiter);
+void folio_add_wait_queue(struct folio *folio, wait_queue_entry_t *waiter);
/*
* Fault everything in given userspace address range in.
diff --git a/mm/filemap.c b/mm/filemap.c
index a3ef9abcbcde..1ecaece68019 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -1019,11 +1019,11 @@ EXPORT_SYMBOL(__page_cache_alloc);
*/
#define PAGE_WAIT_TABLE_BITS 8
#define PAGE_WAIT_TABLE_SIZE (1 << PAGE_WAIT_TABLE_BITS)
-static wait_queue_head_t page_wait_table[PAGE_WAIT_TABLE_SIZE] __cacheline_aligned;
+static wait_queue_head_t folio_wait_table[PAGE_WAIT_TABLE_SIZE] __cacheline_aligned;
-static wait_queue_head_t *page_waitqueue(struct page *page)
+static wait_queue_head_t *folio_waitqueue(struct folio *folio)
{
- return &page_wait_table[hash_ptr(page, PAGE_WAIT_TABLE_BITS)];
+ return &folio_wait_table[hash_ptr(folio, PAGE_WAIT_TABLE_BITS)];
}
void __init pagecache_init(void)
@@ -1031,7 +1031,7 @@ void __init pagecache_init(void)
int i;
for (i = 0; i < PAGE_WAIT_TABLE_SIZE; i++)
- init_waitqueue_head(&page_wait_table[i]);
+ init_waitqueue_head(&folio_wait_table[i]);
page_writeback_init();
}
@@ -1086,10 +1086,10 @@ static int wake_page_function(wait_queue_entry_t *wait, unsigned mode, int sync,
*/
flags = wait->flags;
if (flags & WQ_FLAG_EXCLUSIVE) {
- if (test_bit(key->bit_nr, &key->page->flags))
+ if (test_bit(key->bit_nr, &key->folio->flags))
return -1;
if (flags & WQ_FLAG_CUSTOM) {
- if (test_and_set_bit(key->bit_nr, &key->page->flags))
+ if (test_and_set_bit(key->bit_nr, &key->folio->flags))
return -1;
flags |= WQ_FLAG_DONE;
}
@@ -1123,12 +1123,12 @@ static int wake_page_function(wait_queue_entry_t *wait, unsigned mode, int sync,
static void folio_wake_bit(struct folio *folio, int bit_nr)
{
- wait_queue_head_t *q = page_waitqueue(&folio->page);
+ wait_queue_head_t *q = folio_waitqueue(folio);
struct wait_page_key key;
unsigned long flags;
wait_queue_entry_t bookmark;
- key.page = &folio->page;
+ key.folio = folio;
key.bit_nr = bit_nr;
key.page_match = 0;
@@ -1220,7 +1220,7 @@ int sysctl_page_lock_unfairness = 5;
static inline int folio_wait_bit_common(struct folio *folio, int bit_nr,
int state, enum behavior behavior)
{
- wait_queue_head_t *q = page_waitqueue(&folio->page);
+ wait_queue_head_t *q = folio_waitqueue(folio);
int unfairness = sysctl_page_lock_unfairness;
struct wait_page_queue wait_page;
wait_queue_entry_t *wait = &wait_page.wait;
@@ -1240,7 +1240,7 @@ static inline int folio_wait_bit_common(struct folio *folio, int bit_nr,
init_wait(wait);
wait->func = wake_page_function;
- wait_page.page = &folio->page;
+ wait_page.folio = folio;
wait_page.bit_nr = bit_nr;
repeat:
@@ -1389,23 +1389,23 @@ int put_and_wait_on_page_locked(struct page *page, int state)
}
/**
- * add_page_wait_queue - Add an arbitrary waiter to a page's wait queue
- * @page: Page defining the wait queue of interest
+ * folio_add_wait_queue - Add an arbitrary waiter to a folio's wait queue
+ * @folio: Folio defining the wait queue of interest
* @waiter: Waiter to add to the queue
*
- * Add an arbitrary @waiter to the wait queue for the nominated @page.
+ * Add an arbitrary @waiter to the wait queue for the nominated @folio.
*/
-void add_page_wait_queue(struct page *page, wait_queue_entry_t *waiter)
+void folio_add_wait_queue(struct folio *folio, wait_queue_entry_t *waiter)
{
- wait_queue_head_t *q = page_waitqueue(page);
+ wait_queue_head_t *q = folio_waitqueue(folio);
unsigned long flags;
spin_lock_irqsave(&q->lock, flags);
__add_wait_queue_entry_tail(q, waiter);
- SetPageWaiters(page);
+ folio_set_waiters(folio);
spin_unlock_irqrestore(&q->lock, flags);
}
-EXPORT_SYMBOL_GPL(add_page_wait_queue);
+EXPORT_SYMBOL_GPL(folio_add_wait_queue);
#ifndef clear_bit_unlock_is_negative_byte
@@ -1595,10 +1595,10 @@ EXPORT_SYMBOL_GPL(__folio_lock_killable);
static int __folio_lock_async(struct folio *folio, struct wait_page_queue *wait)
{
- struct wait_queue_head *q = page_waitqueue(&folio->page);
+ struct wait_queue_head *q = folio_waitqueue(folio);
int ret = 0;
- wait->page = &folio->page;
+ wait->folio = folio;
wait->bit_nr = PG_locked;
spin_lock_irq(&q->lock);
--
2.30.2
end_page_private_2() becomes folio_end_private_2(),
wait_on_page_private_2() becomes folio_wait_private_2() and
wait_on_page_private_2_killable() becomes folio_wait_private_2_killable().
Adjust the fscache equivalents to call page_folio() before calling these
functions to avoid adding wrappers. Ends up costing 1 byte of text
in ceph & netfs, but the core shrinks by three calls to page_folio().
Signed-off-by: Matthew Wilcox (Oracle) <[email protected]>
Acked-by: Vlastimil Babka <[email protected]>
Reviewed-by: William Kucharski <[email protected]>
Reviewed-by: Christoph Hellwig <[email protected]>
Reviewed-by: David Howells <[email protected]>
Acked-by: Kirill A. Shutemov <[email protected]>
---
include/linux/netfs.h | 6 +++---
include/linux/pagemap.h | 6 +++---
mm/filemap.c | 37 ++++++++++++++++---------------------
3 files changed, 22 insertions(+), 27 deletions(-)
diff --git a/include/linux/netfs.h b/include/linux/netfs.h
index 9062adfa2fb9..fad8c6209edd 100644
--- a/include/linux/netfs.h
+++ b/include/linux/netfs.h
@@ -55,7 +55,7 @@ static inline void set_page_fscache(struct page *page)
*/
static inline void end_page_fscache(struct page *page)
{
- end_page_private_2(page);
+ folio_end_private_2(page_folio(page));
}
/**
@@ -66,7 +66,7 @@ static inline void end_page_fscache(struct page *page)
*/
static inline void wait_on_page_fscache(struct page *page)
{
- wait_on_page_private_2(page);
+ folio_wait_private_2(page_folio(page));
}
/**
@@ -82,7 +82,7 @@ static inline void wait_on_page_fscache(struct page *page)
*/
static inline int wait_on_page_fscache_killable(struct page *page)
{
- return wait_on_page_private_2_killable(page);
+ return folio_wait_private_2_killable(page_folio(page));
}
enum netfs_read_source {
diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
index c8e74d67b01f..edf58a581bce 100644
--- a/include/linux/pagemap.h
+++ b/include/linux/pagemap.h
@@ -796,9 +796,9 @@ static inline void set_page_private_2(struct page *page)
SetPagePrivate2(page);
}
-void end_page_private_2(struct page *page);
-void wait_on_page_private_2(struct page *page);
-int wait_on_page_private_2_killable(struct page *page);
+void folio_end_private_2(struct folio *folio);
+void folio_wait_private_2(struct folio *folio);
+int folio_wait_private_2_killable(struct folio *folio);
/*
* Add an arbitrary waiter to a page's wait queue
diff --git a/mm/filemap.c b/mm/filemap.c
index 1ecaece68019..a5d02ec62eb6 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -1451,56 +1451,51 @@ void folio_unlock(struct folio *folio)
EXPORT_SYMBOL(folio_unlock);
/**
- * end_page_private_2 - Clear PG_private_2 and release any waiters
- * @page: The page
+ * folio_end_private_2 - Clear PG_private_2 and wake any waiters.
+ * @folio: The folio.
*
- * Clear the PG_private_2 bit on a page and wake up any sleepers waiting for
- * this. The page ref held for PG_private_2 being set is released.
+ * Clear the PG_private_2 bit on a folio and wake up any sleepers waiting for
+ * it. The page ref held for PG_private_2 being set is released.
*
* This is, for example, used when a netfs page is being written to a local
* disk cache, thereby allowing writes to the cache for the same page to be
* serialised.
*/
-void end_page_private_2(struct page *page)
+void folio_end_private_2(struct folio *folio)
{
- struct folio *folio = page_folio(page);
-
VM_BUG_ON_FOLIO(!folio_test_private_2(folio), folio);
clear_bit_unlock(PG_private_2, folio_flags(folio, 0));
folio_wake_bit(folio, PG_private_2);
folio_put(folio);
}
-EXPORT_SYMBOL(end_page_private_2);
+EXPORT_SYMBOL(folio_end_private_2);
/**
- * wait_on_page_private_2 - Wait for PG_private_2 to be cleared on a page
- * @page: The page to wait on
+ * folio_wait_private_2 - Wait for PG_private_2 to be cleared on a page.
+ * @folio: The folio to wait on.
*
- * Wait for PG_private_2 (aka PG_fscache) to be cleared on a page.
+ * Wait for PG_private_2 (aka PG_fscache) to be cleared on a folio.
*/
-void wait_on_page_private_2(struct page *page)
+void folio_wait_private_2(struct folio *folio)
{
- struct folio *folio = page_folio(page);
-
while (folio_test_private_2(folio))
folio_wait_bit(folio, PG_private_2);
}
-EXPORT_SYMBOL(wait_on_page_private_2);
+EXPORT_SYMBOL(folio_wait_private_2);
/**
- * wait_on_page_private_2_killable - Wait for PG_private_2 to be cleared on a page
- * @page: The page to wait on
+ * folio_wait_private_2_killable - Wait for PG_private_2 to be cleared on a folio.
+ * @folio: The folio to wait on.
*
- * Wait for PG_private_2 (aka PG_fscache) to be cleared on a page or until a
+ * Wait for PG_private_2 (aka PG_fscache) to be cleared on a folio or until a
* fatal signal is received by the calling task.
*
* Return:
* - 0 if successful.
* - -EINTR if a fatal signal was encountered.
*/
-int wait_on_page_private_2_killable(struct page *page)
+int folio_wait_private_2_killable(struct folio *folio)
{
- struct folio *folio = page_folio(page);
int ret = 0;
while (folio_test_private_2(folio)) {
@@ -1511,7 +1506,7 @@ int wait_on_page_private_2_killable(struct page *page)
return ret;
}
-EXPORT_SYMBOL(wait_on_page_private_2_killable);
+EXPORT_SYMBOL(folio_wait_private_2_killable);
/**
* folio_end_writeback - End writeback against a folio.
--
2.30.2
Match the page writeback functions by adding
folio_start_fscache(), folio_end_fscache(), folio_wait_fscache() and
folio_wait_fscache_killable(). Remove set_page_private_2(). Also rewrite
the kernel-doc to describe when to use the function rather than what the
function does, and include the kernel-doc in the appropriate rst file.
Saves 31 bytes of text in netfs_rreq_unlock() due to set_page_fscache()
calling page_folio() once instead of three times.
Signed-off-by: Matthew Wilcox (Oracle) <[email protected]>
Acked-by: Vlastimil Babka <[email protected]>
Reviewed-by: William Kucharski <[email protected]>
Reviewed-by: Christoph Hellwig <[email protected]>
Acked-by: Kirill A. Shutemov <[email protected]>
---
Documentation/filesystems/netfs_library.rst | 2 +
include/linux/netfs.h | 75 +++++++++++++--------
include/linux/pagemap.h | 16 -----
3 files changed, 50 insertions(+), 43 deletions(-)
diff --git a/Documentation/filesystems/netfs_library.rst b/Documentation/filesystems/netfs_library.rst
index 57a641847818..bb68d39f03b7 100644
--- a/Documentation/filesystems/netfs_library.rst
+++ b/Documentation/filesystems/netfs_library.rst
@@ -524,3 +524,5 @@ Note that these methods are passed a pointer to the cache resource structure,
not the read request structure as they could be used in other situations where
there isn't a read request structure as well, such as writing dirty data to the
cache.
+
+.. kernel-doc:: include/linux/netfs.h
diff --git a/include/linux/netfs.h b/include/linux/netfs.h
index fad8c6209edd..e03ed7fc3aef 100644
--- a/include/linux/netfs.h
+++ b/include/linux/netfs.h
@@ -22,6 +22,7 @@
* Overload PG_private_2 to give us PG_fscache - this is used to indicate that
* a page is currently backed by a local disk cache
*/
+#define folio_test_fscache(folio) folio_test_private_2(folio)
#define PageFsCache(page) PagePrivate2((page))
#define SetPageFsCache(page) SetPagePrivate2((page))
#define ClearPageFsCache(page) ClearPagePrivate2((page))
@@ -29,57 +30,77 @@
#define TestClearPageFsCache(page) TestClearPagePrivate2((page))
/**
- * set_page_fscache - Set PG_fscache on a page and take a ref
- * @page: The page.
+ * folio_start_fscache - Start an fscache write on a folio.
+ * @folio: The folio.
*
- * Set the PG_fscache (PG_private_2) flag on a page and take the reference
- * needed for the VM to handle its lifetime correctly. This sets the flag and
- * takes the reference unconditionally, so care must be taken not to set the
- * flag again if it's already set.
+ * Call this function before writing a folio to a local cache. Starting a
+ * second write before the first one finishes is not allowed.
*/
-static inline void set_page_fscache(struct page *page)
+static inline void folio_start_fscache(struct folio *folio)
{
- set_page_private_2(page);
+ VM_BUG_ON_FOLIO(folio_test_private_2(folio), folio);
+ folio_get(folio);
+ folio_set_private_2_flag(folio);
}
/**
- * end_page_fscache - Clear PG_fscache and release any waiters
- * @page: The page
- *
- * Clear the PG_fscache (PG_private_2) bit on a page and wake up any sleepers
- * waiting for this. The page ref held for PG_private_2 being set is released.
+ * folio_end_fscache - End an fscache write on a folio.
+ * @folio: The folio.
*
- * This is, for example, used when a netfs page is being written to a local
- * disk cache, thereby allowing writes to the cache for the same page to be
- * serialised.
+ * Call this function after the folio has been written to the local cache.
+ * This will wake any sleepers waiting on this folio.
*/
-static inline void end_page_fscache(struct page *page)
+static inline void folio_end_fscache(struct folio *folio)
{
- folio_end_private_2(page_folio(page));
+ folio_end_private_2(folio);
}
/**
- * wait_on_page_fscache - Wait for PG_fscache to be cleared on a page
- * @page: The page to wait on
+ * folio_wait_fscache - Wait for an fscache write on this folio to end.
+ * @folio: The folio.
*
- * Wait for PG_fscache (aka PG_private_2) to be cleared on a page.
+ * If this folio is currently being written to a local cache, wait for
+ * the write to finish. Another write may start after this one finishes,
+ * unless the caller holds the folio lock.
*/
-static inline void wait_on_page_fscache(struct page *page)
+static inline void folio_wait_fscache(struct folio *folio)
{
- folio_wait_private_2(page_folio(page));
+ folio_wait_private_2(folio);
}
/**
- * wait_on_page_fscache_killable - Wait for PG_fscache to be cleared on a page
- * @page: The page to wait on
+ * folio_wait_fscache_killable - Wait for an fscache write on this folio to end.
+ * @folio: The folio.
*
- * Wait for PG_fscache (aka PG_private_2) to be cleared on a page or until a
- * fatal signal is received by the calling task.
+ * If this folio is currently being written to a local cache, wait
+ * for the write to finish or for a fatal signal to be received.
+ * Another write may start after this one finishes, unless the caller
+ * holds the folio lock.
*
* Return:
* - 0 if successful.
* - -EINTR if a fatal signal was encountered.
*/
+static inline int folio_wait_fscache_killable(struct folio *folio)
+{
+ return folio_wait_private_2_killable(folio);
+}
+
+static inline void set_page_fscache(struct page *page)
+{
+ folio_start_fscache(page_folio(page));
+}
+
+static inline void end_page_fscache(struct page *page)
+{
+ folio_end_private_2(page_folio(page));
+}
+
+static inline void wait_on_page_fscache(struct page *page)
+{
+ folio_wait_private_2(page_folio(page));
+}
+
static inline int wait_on_page_fscache_killable(struct page *page)
{
return folio_wait_private_2_killable(page_folio(page));
diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
index edf58a581bce..08f40e004d97 100644
--- a/include/linux/pagemap.h
+++ b/include/linux/pagemap.h
@@ -780,22 +780,6 @@ int __set_page_dirty_no_writeback(struct page *page);
void page_endio(struct page *page, bool is_write, int err);
-/**
- * set_page_private_2 - Set PG_private_2 on a page and take a ref
- * @page: The page.
- *
- * Set the PG_private_2 flag on a page and take the reference needed for the VM
- * to handle its lifetime correctly. This sets the flag and takes the
- * reference unconditionally, so care must be taken not to set the flag again
- * if it's already set.
- */
-static inline void set_page_private_2(struct page *page)
-{
- page = compound_head(page);
- get_page(page);
- SetPagePrivate2(page);
-}
-
void folio_end_private_2(struct folio *folio);
void folio_wait_private_2(struct folio *folio);
int folio_wait_private_2_killable(struct folio *folio);
--
2.30.2
This function is the equivalent of page_mapped(). It is slightly
shorter as we do not need to handle the PageTail() case. Reimplement
page_mapped() as a wrapper around folio_mapped(). folio_mapped()
is 13 bytes smaller than page_mapped(), but the page_mapped() wrapper
is 30 bytes, for a net increase of 17 bytes of text.
Signed-off-by: Matthew Wilcox (Oracle) <[email protected]>
Acked-by: Vlastimil Babka <[email protected]>
Reviewed-by: William Kucharski <[email protected]>
Reviewed-by: Christoph Hellwig <[email protected]>
Reviewed-by: David Howells <[email protected]>
Acked-by: Kirill A. Shutemov <[email protected]>
---
include/linux/mm.h | 1 +
include/linux/mm_types.h | 6 ++++++
mm/folio-compat.c | 6 ++++++
mm/util.c | 29 ++++++++++++++++-------------
4 files changed, 29 insertions(+), 13 deletions(-)
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 9d28f5b2e983..8b79d9dfa6cb 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1767,6 +1767,7 @@ static inline pgoff_t page_index(struct page *page)
}
bool page_mapped(struct page *page);
+bool folio_mapped(struct folio *folio);
/*
* Return true only if the page has been allocated with
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index c4dd41bb1019..f763aa273d82 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -291,6 +291,12 @@ FOLIO_MATCH(memcg_data, memcg_data);
#endif
#undef FOLIO_MATCH
+static inline atomic_t *folio_mapcount_ptr(struct folio *folio)
+{
+ struct page *tail = &folio->page + 1;
+ return &tail->compound_mapcount;
+}
+
static inline atomic_t *compound_mapcount_ptr(struct page *page)
{
return &page[1].compound_mapcount;
diff --git a/mm/folio-compat.c b/mm/folio-compat.c
index 3c83f03b80d7..7044fcc8a8aa 100644
--- a/mm/folio-compat.c
+++ b/mm/folio-compat.c
@@ -35,3 +35,9 @@ void wait_for_stable_page(struct page *page)
return folio_wait_stable(page_folio(page));
}
EXPORT_SYMBOL_GPL(wait_for_stable_page);
+
+bool page_mapped(struct page *page)
+{
+ return folio_mapped(page_folio(page));
+}
+EXPORT_SYMBOL(page_mapped);
diff --git a/mm/util.c b/mm/util.c
index 1cde6218d6d1..e8c12350b3eb 100644
--- a/mm/util.c
+++ b/mm/util.c
@@ -652,28 +652,31 @@ void *page_rmapping(struct page *page)
return __page_rmapping(page);
}
-/*
- * Return true if this page is mapped into pagetables.
- * For compound page it returns true if any subpage of compound page is mapped.
+/**
+ * folio_mapped - Is this folio mapped into userspace?
+ * @folio: The folio.
+ *
+ * Return: True if any page in this folio is referenced by user page tables.
*/
-bool page_mapped(struct page *page)
+bool folio_mapped(struct folio *folio)
{
- int i;
+ int i, nr;
- if (likely(!PageCompound(page)))
- return atomic_read(&page->_mapcount) >= 0;
- page = compound_head(page);
- if (atomic_read(compound_mapcount_ptr(page)) >= 0)
+ if (folio_single(folio))
+ return atomic_read(&folio->_mapcount) >= 0;
+ if (atomic_read(folio_mapcount_ptr(folio)) >= 0)
return true;
- if (PageHuge(page))
+ if (folio_test_hugetlb(folio))
return false;
- for (i = 0; i < compound_nr(page); i++) {
- if (atomic_read(&page[i]._mapcount) >= 0)
+
+ nr = folio_nr_pages(folio);
+ for (i = 0; i < nr; i++) {
+ if (atomic_read(&folio_page(folio, i)->_mapcount) >= 0)
return true;
}
return false;
}
-EXPORT_SYMBOL(page_mapped);
+EXPORT_SYMBOL(folio_mapped);
struct anon_vma *page_anon_vma(struct page *page)
{
--
2.30.2
memcg_check_events only uses the page's nid, so call page_to_nid in the
callers to make the interface easier to understand.
Signed-off-by: Matthew Wilcox (Oracle) <[email protected]>
Acked-by: Michal Hocko <[email protected]>
Reviewed-by: Christoph Hellwig <[email protected]>
---
mm/memcontrol.c | 23 ++++++++++++-----------
1 file changed, 12 insertions(+), 11 deletions(-)
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index f70e33d691aa..1a049bfa0e0a 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -851,7 +851,7 @@ static bool mem_cgroup_event_ratelimit(struct mem_cgroup *memcg,
* Check events in order.
*
*/
-static void memcg_check_events(struct mem_cgroup *memcg, struct page *page)
+static void memcg_check_events(struct mem_cgroup *memcg, int nid)
{
/* threshold event is triggered in finer grain than soft limit */
if (unlikely(mem_cgroup_event_ratelimit(memcg,
@@ -862,7 +862,7 @@ static void memcg_check_events(struct mem_cgroup *memcg, struct page *page)
MEM_CGROUP_TARGET_SOFTLIMIT);
mem_cgroup_threshold(memcg);
if (unlikely(do_softlimit))
- mem_cgroup_update_tree(memcg, page_to_nid(page));
+ mem_cgroup_update_tree(memcg, nid);
}
}
@@ -5578,7 +5578,7 @@ static int mem_cgroup_move_account(struct page *page,
struct lruvec *from_vec, *to_vec;
struct pglist_data *pgdat;
unsigned int nr_pages = compound ? thp_nr_pages(page) : 1;
- int ret;
+ int nid, ret;
VM_BUG_ON(from == to);
VM_BUG_ON_PAGE(PageLRU(page), page);
@@ -5667,12 +5667,13 @@ static int mem_cgroup_move_account(struct page *page,
__unlock_page_memcg(from);
ret = 0;
+ nid = page_to_nid(page);
local_irq_disable();
mem_cgroup_charge_statistics(to, nr_pages);
- memcg_check_events(to, page);
+ memcg_check_events(to, nid);
mem_cgroup_charge_statistics(from, -nr_pages);
- memcg_check_events(from, page);
+ memcg_check_events(from, nid);
local_irq_enable();
out_unlock:
unlock_page(page);
@@ -6693,7 +6694,7 @@ static int __mem_cgroup_charge(struct page *page, struct mem_cgroup *memcg,
local_irq_disable();
mem_cgroup_charge_statistics(memcg, nr_pages);
- memcg_check_events(memcg, page);
+ memcg_check_events(memcg, page_to_nid(page));
local_irq_enable();
out:
return ret;
@@ -6801,7 +6802,7 @@ struct uncharge_gather {
unsigned long nr_memory;
unsigned long pgpgout;
unsigned long nr_kmem;
- struct page *dummy_page;
+ int nid;
};
static inline void uncharge_gather_clear(struct uncharge_gather *ug)
@@ -6825,7 +6826,7 @@ static void uncharge_batch(const struct uncharge_gather *ug)
local_irq_save(flags);
__count_memcg_events(ug->memcg, PGPGOUT, ug->pgpgout);
__this_cpu_add(ug->memcg->vmstats_percpu->nr_page_events, ug->nr_memory);
- memcg_check_events(ug->memcg, ug->dummy_page);
+ memcg_check_events(ug->memcg, ug->nid);
local_irq_restore(flags);
/* drop reference from uncharge_page */
@@ -6866,7 +6867,7 @@ static void uncharge_page(struct page *page, struct uncharge_gather *ug)
uncharge_gather_clear(ug);
}
ug->memcg = memcg;
- ug->dummy_page = page;
+ ug->nid = page_to_nid(page);
/* pairs with css_put in uncharge_batch */
css_get(&memcg->css);
@@ -6984,7 +6985,7 @@ void mem_cgroup_migrate(struct page *oldpage, struct page *newpage)
local_irq_save(flags);
mem_cgroup_charge_statistics(memcg, nr_pages);
- memcg_check_events(memcg, newpage);
+ memcg_check_events(memcg, page_to_nid(newpage));
local_irq_restore(flags);
}
@@ -7214,7 +7215,7 @@ void mem_cgroup_swapout(struct page *page, swp_entry_t entry)
*/
VM_BUG_ON(!irqs_disabled());
mem_cgroup_charge_statistics(memcg, -nr_entries);
- memcg_check_events(memcg, page);
+ memcg_check_events(memcg, page_to_nid(page));
css_put(&memcg->css);
}
--
2.30.2
The memcg_data is only set on the head page, so enforce that by
typing it as a folio.
Signed-off-by: Matthew Wilcox (Oracle) <[email protected]>
Reviewed-by: Christoph Hellwig <[email protected]>
Acked-by: Michal Hocko <[email protected]>
---
mm/memcontrol.c | 27 +++++++++++++--------------
1 file changed, 13 insertions(+), 14 deletions(-)
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index f0f781dde37a..c2ffad021e09 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -2769,9 +2769,9 @@ static void cancel_charge(struct mem_cgroup *memcg, unsigned int nr_pages)
}
#endif
-static void commit_charge(struct page *page, struct mem_cgroup *memcg)
+static void commit_charge(struct folio *folio, struct mem_cgroup *memcg)
{
- VM_BUG_ON_PAGE(page_memcg(page), page);
+ VM_BUG_ON_FOLIO(folio_memcg(folio), folio);
/*
* Any of the following ensures page's memcg stability:
*
@@ -2780,7 +2780,7 @@ static void commit_charge(struct page *page, struct mem_cgroup *memcg)
* - lock_page_memcg()
* - exclusive reference
*/
- page->memcg_data = (unsigned long)memcg;
+ folio->memcg_data = (unsigned long)memcg;
}
static struct mem_cgroup *get_mem_cgroup_from_objcg(struct obj_cgroup *objcg)
@@ -6684,7 +6684,8 @@ void mem_cgroup_calculate_protection(struct mem_cgroup *root,
static int __mem_cgroup_charge(struct page *page, struct mem_cgroup *memcg,
gfp_t gfp)
{
- unsigned int nr_pages = thp_nr_pages(page);
+ struct folio *folio = page_folio(page);
+ unsigned int nr_pages = folio_nr_pages(folio);
int ret;
ret = try_charge(memcg, gfp, nr_pages);
@@ -6692,7 +6693,7 @@ static int __mem_cgroup_charge(struct page *page, struct mem_cgroup *memcg,
goto out;
css_get(&memcg->css);
- commit_charge(page, memcg);
+ commit_charge(folio, memcg);
local_irq_disable();
mem_cgroup_charge_statistics(memcg, nr_pages);
@@ -6952,21 +6953,21 @@ void mem_cgroup_uncharge_list(struct list_head *page_list)
*/
void mem_cgroup_migrate(struct page *oldpage, struct page *newpage)
{
+ struct folio *newfolio = page_folio(newpage);
struct mem_cgroup *memcg;
- unsigned int nr_pages;
+ unsigned int nr_pages = folio_nr_pages(newfolio);
unsigned long flags;
VM_BUG_ON_PAGE(!PageLocked(oldpage), oldpage);
- VM_BUG_ON_PAGE(!PageLocked(newpage), newpage);
- VM_BUG_ON_PAGE(PageAnon(oldpage) != PageAnon(newpage), newpage);
- VM_BUG_ON_PAGE(PageTransHuge(oldpage) != PageTransHuge(newpage),
- newpage);
+ VM_BUG_ON_FOLIO(!folio_test_locked(newfolio), newfolio);
+ VM_BUG_ON_FOLIO(PageAnon(oldpage) != folio_test_anon(newfolio), newfolio);
+ VM_BUG_ON_FOLIO(compound_nr(oldpage) != nr_pages, newfolio);
if (mem_cgroup_disabled())
return;
/* Page cache replacement: new page already charged? */
- if (page_memcg(newpage))
+ if (folio_memcg(newfolio))
return;
memcg = page_memcg(oldpage);
@@ -6975,8 +6976,6 @@ void mem_cgroup_migrate(struct page *oldpage, struct page *newpage)
return;
/* Force-charge the new page. The old one will be freed soon */
- nr_pages = thp_nr_pages(newpage);
-
if (!mem_cgroup_is_root(memcg)) {
page_counter_charge(&memcg->memory, nr_pages);
if (do_memsw_account())
@@ -6984,7 +6983,7 @@ void mem_cgroup_migrate(struct page *oldpage, struct page *newpage)
}
css_get(&memcg->css);
- commit_charge(newpage, memcg);
+ commit_charge(newfolio, memcg);
local_irq_save(flags);
mem_cgroup_charge_statistics(memcg, nr_pages);
--
2.30.2
Use a folio rather than a page to ensure that we're only operating on
base or head pages, and not tail pages.
Signed-off-by: Matthew Wilcox (Oracle) <[email protected]>
Reviewed-by: Christoph Hellwig <[email protected]>
---
mm/memcontrol.c | 29 ++++++++++++++---------------
1 file changed, 14 insertions(+), 15 deletions(-)
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 03283d97b62a..c257cb71a3b0 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -6832,24 +6832,23 @@ static void uncharge_batch(const struct uncharge_gather *ug)
memcg_check_events(ug->memcg, ug->nid);
local_irq_restore(flags);
- /* drop reference from uncharge_page */
+ /* drop reference from uncharge_folio */
css_put(&ug->memcg->css);
}
-static void uncharge_page(struct page *page, struct uncharge_gather *ug)
+static void uncharge_folio(struct folio *folio, struct uncharge_gather *ug)
{
- struct folio *folio = page_folio(page);
unsigned long nr_pages;
struct mem_cgroup *memcg;
struct obj_cgroup *objcg;
- bool use_objcg = PageMemcgKmem(page);
+ bool use_objcg = folio_memcg_kmem(folio);
- VM_BUG_ON_PAGE(PageLRU(page), page);
+ VM_BUG_ON_FOLIO(folio_test_lru(folio), folio);
/*
* Nobody should be changing or seriously looking at
- * page memcg or objcg at this point, we have fully
- * exclusive access to the page.
+ * folio memcg or objcg at this point, we have fully
+ * exclusive access to the folio.
*/
if (use_objcg) {
objcg = __folio_objcg(folio);
@@ -6871,19 +6870,19 @@ static void uncharge_page(struct page *page, struct uncharge_gather *ug)
uncharge_gather_clear(ug);
}
ug->memcg = memcg;
- ug->nid = page_to_nid(page);
+ ug->nid = folio_nid(folio);
/* pairs with css_put in uncharge_batch */
css_get(&memcg->css);
}
- nr_pages = compound_nr(page);
+ nr_pages = folio_nr_pages(folio);
if (use_objcg) {
ug->nr_memory += nr_pages;
ug->nr_kmem += nr_pages;
- page->memcg_data = 0;
+ folio->memcg_data = 0;
obj_cgroup_put(objcg);
} else {
/* LRU pages aren't accounted at the root level */
@@ -6891,7 +6890,7 @@ static void uncharge_page(struct page *page, struct uncharge_gather *ug)
ug->nr_memory += nr_pages;
ug->pgpgout++;
- page->memcg_data = 0;
+ folio->memcg_data = 0;
}
css_put(&memcg->css);
@@ -6915,7 +6914,7 @@ void mem_cgroup_uncharge(struct page *page)
return;
uncharge_gather_clear(&ug);
- uncharge_page(page, &ug);
+ uncharge_folio(page_folio(page), &ug);
uncharge_batch(&ug);
}
@@ -6929,14 +6928,14 @@ void mem_cgroup_uncharge(struct page *page)
void mem_cgroup_uncharge_list(struct list_head *page_list)
{
struct uncharge_gather ug;
- struct page *page;
+ struct folio *folio;
if (mem_cgroup_disabled())
return;
uncharge_gather_clear(&ug);
- list_for_each_entry(page, page_list, lru)
- uncharge_page(page, &ug);
+ list_for_each_entry(folio, page_list, lru)
+ uncharge_folio(folio, &ug);
if (ug.memcg)
uncharge_batch(&ug);
}
--
2.30.2
Convert all the callers to call page_folio(). Most of them were already
using a head page, but a few of them I can't prove were, so this may
actually fix a bug.
Signed-off-by: Matthew Wilcox (Oracle) <[email protected]>
Reviewed-by: Christoph Hellwig <[email protected]>
---
include/linux/memcontrol.h | 4 ++--
mm/filemap.c | 2 +-
mm/khugepaged.c | 4 ++--
mm/memcontrol.c | 14 +++++++-------
mm/memory-failure.c | 2 +-
mm/memremap.c | 2 +-
mm/page_alloc.c | 2 +-
mm/swap.c | 2 +-
8 files changed, 16 insertions(+), 16 deletions(-)
diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index fb7b87d8e794..941a1a7131c9 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -709,7 +709,7 @@ int mem_cgroup_swapin_charge_page(struct page *page, struct mm_struct *mm,
gfp_t gfp, swp_entry_t entry);
void mem_cgroup_swapin_uncharge_swap(swp_entry_t entry);
-void mem_cgroup_uncharge(struct page *page);
+void mem_cgroup_uncharge(struct folio *folio);
void mem_cgroup_uncharge_list(struct list_head *page_list);
void mem_cgroup_migrate(struct page *oldpage, struct page *newpage);
@@ -1206,7 +1206,7 @@ static inline void mem_cgroup_swapin_uncharge_swap(swp_entry_t entry)
{
}
-static inline void mem_cgroup_uncharge(struct page *page)
+static inline void mem_cgroup_uncharge(struct folio *folio)
{
}
diff --git a/mm/filemap.c b/mm/filemap.c
index 525f69316522..31d4ecd4268e 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -923,7 +923,7 @@ noinline int __add_to_page_cache_locked(struct page *page,
if (xas_error(&xas)) {
error = xas_error(&xas);
if (charged)
- mem_cgroup_uncharge(page);
+ mem_cgroup_uncharge(page_folio(page));
goto error;
}
diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index 8f6d7fdea9f4..6b9c98ddcd09 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -1211,7 +1211,7 @@ static void collapse_huge_page(struct mm_struct *mm,
mmap_write_unlock(mm);
out_nolock:
if (!IS_ERR_OR_NULL(*hpage))
- mem_cgroup_uncharge(*hpage);
+ mem_cgroup_uncharge(page_folio(*hpage));
trace_mm_collapse_huge_page(mm, isolated, result);
return;
}
@@ -1975,7 +1975,7 @@ static void collapse_file(struct mm_struct *mm,
out:
VM_BUG_ON(!list_empty(&pagelist));
if (!IS_ERR_OR_NULL(*hpage))
- mem_cgroup_uncharge(*hpage);
+ mem_cgroup_uncharge(page_folio(*hpage));
/* TODO: tracepoints */
}
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index c257cb71a3b0..fc94048e6451 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -6897,24 +6897,24 @@ static void uncharge_folio(struct folio *folio, struct uncharge_gather *ug)
}
/**
- * mem_cgroup_uncharge - uncharge a page
- * @page: page to uncharge
+ * mem_cgroup_uncharge - Uncharge a folio.
+ * @folio: Folio to uncharge.
*
- * Uncharge a page previously charged with mem_cgroup_charge().
+ * Uncharge a folio previously charged with folio_charge_cgroup().
*/
-void mem_cgroup_uncharge(struct page *page)
+void mem_cgroup_uncharge(struct folio *folio)
{
struct uncharge_gather ug;
if (mem_cgroup_disabled())
return;
- /* Don't touch page->lru of any random page, pre-check: */
- if (!page_memcg(page))
+ /* Don't touch folio->lru of any random page, pre-check: */
+ if (!folio_memcg(folio))
return;
uncharge_gather_clear(&ug);
- uncharge_folio(page_folio(page), &ug);
+ uncharge_folio(folio, &ug);
uncharge_batch(&ug);
}
diff --git a/mm/memory-failure.c b/mm/memory-failure.c
index eefd823deb67..9ae7a57a4cc0 100644
--- a/mm/memory-failure.c
+++ b/mm/memory-failure.c
@@ -763,7 +763,7 @@ static int delete_from_lru_cache(struct page *p)
* Poisoned page might never drop its ref count to 0 so we have
* to uncharge it manually from its memcg.
*/
- mem_cgroup_uncharge(p);
+ mem_cgroup_uncharge(page_folio(p));
/*
* drop the page count elevated by isolate_lru_page()
diff --git a/mm/memremap.c b/mm/memremap.c
index 15a074ffb8d7..6eac40f9f62a 100644
--- a/mm/memremap.c
+++ b/mm/memremap.c
@@ -508,7 +508,7 @@ void free_devmap_managed_page(struct page *page)
__ClearPageWaiters(page);
- mem_cgroup_uncharge(page);
+ mem_cgroup_uncharge(page_folio(page));
/*
* When a device_private page is freed, the page->mapping field
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 3b97e17806be..d72a0d9d4184 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -726,7 +726,7 @@ static inline void free_the_page(struct page *page, unsigned int order)
void free_compound_page(struct page *page)
{
- mem_cgroup_uncharge(page);
+ mem_cgroup_uncharge(page_folio(page));
free_the_page(page, compound_order(page));
}
diff --git a/mm/swap.c b/mm/swap.c
index 095a5ec6f986..11ff40104a2c 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -94,7 +94,7 @@ static void __page_cache_release(struct page *page)
static void __put_single_page(struct page *page)
{
__page_cache_release(page);
- mem_cgroup_uncharge(page);
+ mem_cgroup_uncharge(page_folio(page));
free_unref_page(page, 0);
}
--
2.30.2
Opencode this one-line function in its three callers.
Signed-off-by: Matthew Wilcox (Oracle) <[email protected]>
Acked-by: Michal Hocko <[email protected]>
Acked-by: Johannes Weiner <[email protected]>
Reviewed-by: Christoph Hellwig <[email protected]>
---
mm/memcontrol.c | 12 +++---------
1 file changed, 3 insertions(+), 9 deletions(-)
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index d57ff5c5d330..f70e33d691aa 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -451,12 +451,6 @@ ino_t page_cgroup_ino(struct page *page)
return ino;
}
-static struct mem_cgroup_tree_per_node *
-soft_limit_tree_node(int nid)
-{
- return soft_limit_tree.rb_tree_per_node[nid];
-}
-
static void __mem_cgroup_insert_exceeded(struct mem_cgroup_per_node *mz,
struct mem_cgroup_tree_per_node *mctz,
unsigned long new_usage_in_excess)
@@ -533,7 +527,7 @@ static void mem_cgroup_update_tree(struct mem_cgroup *memcg, int nid)
struct mem_cgroup_per_node *mz;
struct mem_cgroup_tree_per_node *mctz;
- mctz = soft_limit_tree_node(nid);
+ mctz = soft_limit_tree.rb_tree_per_node[nid];
if (!mctz)
return;
/*
@@ -572,7 +566,7 @@ static void mem_cgroup_remove_from_trees(struct mem_cgroup *memcg)
for_each_node(nid) {
mz = memcg->nodeinfo[nid];
- mctz = soft_limit_tree_node(nid);
+ mctz = soft_limit_tree.rb_tree_per_node[nid];
if (mctz)
mem_cgroup_remove_exceeded(mz, mctz);
}
@@ -3420,7 +3414,7 @@ unsigned long mem_cgroup_soft_limit_reclaim(pg_data_t *pgdat, int order,
if (order > 0)
return 0;
- mctz = soft_limit_tree_node(pgdat->node_id);
+ mctz = soft_limit_tree.rb_tree_per_node[pgdat->node_id];
/*
* Do not even bother to check the largest node if the root
--
2.30.2
These are the folio equivalents of relock_page_lruvec_irq() and
folio_lruvec_relock_irqsave(). Also convert page_matches_lruvec()
to folio_matches_lruvec().
Signed-off-by: Matthew Wilcox (Oracle) <[email protected]>
Reviewed-by: Christoph Hellwig <[email protected]>
---
include/linux/memcontrol.h | 17 ++++++++---------
mm/mlock.c | 3 ++-
mm/swap.c | 11 +++++++----
mm/vmscan.c | 5 +++--
4 files changed, 20 insertions(+), 16 deletions(-)
diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index ffb591920241..6511f89ad454 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -1529,19 +1529,19 @@ static inline void unlock_page_lruvec_irqrestore(struct lruvec *lruvec,
}
/* Test requires a stable page->memcg binding, see page_memcg() */
-static inline bool page_matches_lruvec(struct page *page, struct lruvec *lruvec)
+static inline bool folio_matches_lruvec(struct folio *folio,
+ struct lruvec *lruvec)
{
- return lruvec_pgdat(lruvec) == page_pgdat(page) &&
- lruvec_memcg(lruvec) == page_memcg(page);
+ return lruvec_pgdat(lruvec) == folio_pgdat(folio) &&
+ lruvec_memcg(lruvec) == folio_memcg(folio);
}
/* Don't lock again iff page's lruvec locked */
-static inline struct lruvec *relock_page_lruvec_irq(struct page *page,
+static inline struct lruvec *folio_lruvec_relock_irq(struct folio *folio,
struct lruvec *locked_lruvec)
{
- struct folio *folio = page_folio(page);
if (locked_lruvec) {
- if (page_matches_lruvec(page, locked_lruvec))
+ if (folio_matches_lruvec(folio, locked_lruvec))
return locked_lruvec;
unlock_page_lruvec_irq(locked_lruvec);
@@ -1551,12 +1551,11 @@ static inline struct lruvec *relock_page_lruvec_irq(struct page *page,
}
/* Don't lock again iff page's lruvec locked */
-static inline struct lruvec *relock_page_lruvec_irqsave(struct page *page,
+static inline struct lruvec *folio_lruvec_relock_irqsave(struct folio *folio,
struct lruvec *locked_lruvec, unsigned long *flags)
{
- struct folio *folio = page_folio(page);
if (locked_lruvec) {
- if (page_matches_lruvec(page, locked_lruvec))
+ if (folio_matches_lruvec(folio, locked_lruvec))
return locked_lruvec;
unlock_page_lruvec_irqrestore(locked_lruvec, *flags);
diff --git a/mm/mlock.c b/mm/mlock.c
index 16d2ee160d43..e263d62ae2d0 100644
--- a/mm/mlock.c
+++ b/mm/mlock.c
@@ -271,6 +271,7 @@ static void __munlock_pagevec(struct pagevec *pvec, struct zone *zone)
/* Phase 1: page isolation */
for (i = 0; i < nr; i++) {
struct page *page = pvec->pages[i];
+ struct folio *folio = page_folio(page);
if (TestClearPageMlocked(page)) {
/*
@@ -278,7 +279,7 @@ static void __munlock_pagevec(struct pagevec *pvec, struct zone *zone)
* so we can spare the get_page() here.
*/
if (TestClearPageLRU(page)) {
- lruvec = relock_page_lruvec_irq(page, lruvec);
+ lruvec = folio_lruvec_relock_irq(folio, lruvec);
del_page_from_lru_list(page, lruvec);
continue;
} else
diff --git a/mm/swap.c b/mm/swap.c
index 6d0d2bfca48e..aa9c32b714c5 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -211,12 +211,13 @@ static void pagevec_lru_move_fn(struct pagevec *pvec,
for (i = 0; i < pagevec_count(pvec); i++) {
struct page *page = pvec->pages[i];
+ struct folio *folio = page_folio(page);
/* block memcg migration during page moving between lru */
if (!TestClearPageLRU(page))
continue;
- lruvec = relock_page_lruvec_irqsave(page, lruvec, &flags);
+ lruvec = folio_lruvec_relock_irqsave(folio, lruvec, &flags);
(*move_fn)(page, lruvec);
SetPageLRU(page);
@@ -907,6 +908,7 @@ void release_pages(struct page **pages, int nr)
for (i = 0; i < nr; i++) {
struct page *page = pages[i];
+ struct folio *folio = page_folio(page);
/*
* Make sure the IRQ-safe lock-holding time does not get
@@ -918,7 +920,7 @@ void release_pages(struct page **pages, int nr)
lruvec = NULL;
}
- page = compound_head(page);
+ page = &folio->page;
if (is_huge_zero_page(page))
continue;
@@ -957,7 +959,7 @@ void release_pages(struct page **pages, int nr)
if (PageLRU(page)) {
struct lruvec *prev_lruvec = lruvec;
- lruvec = relock_page_lruvec_irqsave(page, lruvec,
+ lruvec = folio_lruvec_relock_irqsave(folio, lruvec,
&flags);
if (prev_lruvec != lruvec)
lock_batch = 0;
@@ -1061,8 +1063,9 @@ void __pagevec_lru_add(struct pagevec *pvec)
for (i = 0; i < pagevec_count(pvec); i++) {
struct page *page = pvec->pages[i];
+ struct folio *folio = page_folio(page);
- lruvec = relock_page_lruvec_irqsave(page, lruvec, &flags);
+ lruvec = folio_lruvec_relock_irqsave(folio, lruvec, &flags);
__pagevec_lru_add_fn(page, lruvec);
}
if (lruvec)
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 0d48306d37dc..7a2f25b904d9 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2075,7 +2075,7 @@ static unsigned int move_pages_to_lru(struct lruvec *lruvec,
* All pages were isolated from the same lruvec (and isolation
* inhibits memcg migration).
*/
- VM_BUG_ON_PAGE(!page_matches_lruvec(page, lruvec), page);
+ VM_BUG_ON_PAGE(!folio_matches_lruvec(page_folio(page), lruvec), page);
add_page_to_lru_list(page, lruvec);
nr_pages = thp_nr_pages(page);
nr_moved += nr_pages;
@@ -4514,6 +4514,7 @@ void check_move_unevictable_pages(struct pagevec *pvec)
for (i = 0; i < pvec->nr; i++) {
struct page *page = pvec->pages[i];
+ struct folio *folio = page_folio(page);
int nr_pages;
if (PageTransTail(page))
@@ -4526,7 +4527,7 @@ void check_move_unevictable_pages(struct pagevec *pvec)
if (!TestClearPageLRU(page))
continue;
- lruvec = relock_page_lruvec_irq(page, lruvec);
+ lruvec = folio_lruvec_relock_irq(folio, lruvec);
if (page_evictable(page) && PageUnevictable(page)) {
del_page_from_lru_list(page, lruvec);
ClearPageUnevictable(page);
--
2.30.2
This function already assumed it was being passed a head page. No real
change here, except that thp_nr_pages() compiles away on kernels with
THP compiled out while folio_nr_pages() is always present. Also convert
page_memcg_rcu() to folio_memcg_rcu().
Signed-off-by: Matthew Wilcox (Oracle) <[email protected]>
Reviewed-by: Christoph Hellwig <[email protected]>
---
include/linux/memcontrol.h | 18 +++++++++---------
include/linux/swap.h | 2 +-
mm/swap.c | 2 +-
mm/workingset.c | 11 ++++-------
4 files changed, 15 insertions(+), 18 deletions(-)
diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 6511f89ad454..2dd660185bb3 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -461,19 +461,19 @@ static inline struct mem_cgroup *page_memcg(struct page *page)
}
/*
- * page_memcg_rcu - locklessly get the memory cgroup associated with a page
- * @page: a pointer to the page struct
+ * folio_memcg_rcu - Locklessly get the memory cgroup associated with a folio.
+ * @folio: Pointer to the folio.
*
- * Returns a pointer to the memory cgroup associated with the page,
- * or NULL. This function assumes that the page is known to have a
+ * Returns a pointer to the memory cgroup associated with the folio,
+ * or NULL. This function assumes that the folio is known to have a
* proper memory cgroup pointer. It's not safe to call this function
- * against some type of pages, e.g. slab pages or ex-slab pages.
+ * against some type of folios, e.g. slab folios or ex-slab folios.
*/
-static inline struct mem_cgroup *page_memcg_rcu(struct page *page)
+static inline struct mem_cgroup *folio_memcg_rcu(struct folio *folio)
{
- unsigned long memcg_data = READ_ONCE(page->memcg_data);
+ unsigned long memcg_data = READ_ONCE(folio->memcg_data);
- VM_BUG_ON_PAGE(PageSlab(page), page);
+ VM_BUG_ON_FOLIO(folio_test_slab(folio), folio);
WARN_ON_ONCE(!rcu_read_lock_held());
if (memcg_data & MEMCG_DATA_KMEM) {
@@ -1129,7 +1129,7 @@ static inline struct mem_cgroup *page_memcg(struct page *page)
return NULL;
}
-static inline struct mem_cgroup *page_memcg_rcu(struct page *page)
+static inline struct mem_cgroup *folio_memcg_rcu(struct folio *folio)
{
WARN_ON_ONCE(!rcu_read_lock_held());
return NULL;
diff --git a/include/linux/swap.h b/include/linux/swap.h
index 8394716a002b..989d8f78c256 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -330,7 +330,7 @@ static inline swp_entry_t folio_swap_entry(struct folio *folio)
void workingset_age_nonresident(struct lruvec *lruvec, unsigned long nr_pages);
void *workingset_eviction(struct page *page, struct mem_cgroup *target_memcg);
void workingset_refault(struct page *page, void *shadow);
-void workingset_activation(struct page *page);
+void workingset_activation(struct folio *folio);
/* Only track the nodes of mappings with shadow entries */
void workingset_update_node(struct xa_node *node);
diff --git a/mm/swap.c b/mm/swap.c
index aa9c32b714c5..85969b36b636 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -451,7 +451,7 @@ void mark_page_accessed(struct page *page)
else
__lru_cache_activate_page(page);
ClearPageReferenced(page);
- workingset_activation(page);
+ workingset_activation(page_folio(page));
}
if (page_is_idle(page))
clear_page_idle(page);
diff --git a/mm/workingset.c b/mm/workingset.c
index e62c0f2084a2..39bb60d50217 100644
--- a/mm/workingset.c
+++ b/mm/workingset.c
@@ -392,13 +392,11 @@ void workingset_refault(struct page *page, void *shadow)
/**
* workingset_activation - note a page activation
- * @page: page that is being activated
+ * @folio: Folio that is being activated.
*/
-void workingset_activation(struct page *page)
+void workingset_activation(struct folio *folio)
{
- struct folio *folio = page_folio(page);
struct mem_cgroup *memcg;
- struct lruvec *lruvec;
rcu_read_lock();
/*
@@ -408,11 +406,10 @@ void workingset_activation(struct page *page)
* XXX: See workingset_refault() - this should return
* root_mem_cgroup even for !CONFIG_MEMCG.
*/
- memcg = page_memcg_rcu(page);
+ memcg = folio_memcg_rcu(folio);
if (!mem_cgroup_disabled() && !memcg)
goto out;
- lruvec = folio_lruvec(folio);
- workingset_age_nonresident(lruvec, thp_nr_pages(page));
+ workingset_age_nonresident(folio_lruvec(folio), folio_nr_pages(folio));
out:
rcu_read_unlock();
}
--
2.30.2
This saves dozens of bytes of text by eliminating a lot of calls to
compound_head().
Signed-off-by: Matthew Wilcox (Oracle) <[email protected]>
Reviewed-by: Christoph Hellwig <[email protected]>
---
mm/memcontrol.c | 37 +++++++++++++++++++------------------
1 file changed, 19 insertions(+), 18 deletions(-)
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 0dd40ea67a90..96d6e6c0a65d 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -5590,38 +5590,39 @@ static int mem_cgroup_move_account(struct page *page,
struct mem_cgroup *from,
struct mem_cgroup *to)
{
+ struct folio *folio = page_folio(page);
struct lruvec *from_vec, *to_vec;
struct pglist_data *pgdat;
- unsigned int nr_pages = compound ? thp_nr_pages(page) : 1;
+ unsigned int nr_pages = compound ? folio_nr_pages(folio) : 1;
int nid, ret;
VM_BUG_ON(from == to);
- VM_BUG_ON_PAGE(PageLRU(page), page);
- VM_BUG_ON(compound && !PageTransHuge(page));
+ VM_BUG_ON_FOLIO(folio_test_lru(folio), folio);
+ VM_BUG_ON(compound && !folio_multi(folio));
/*
* Prevent mem_cgroup_migrate() from looking at
* page's memory cgroup of its source page while we change it.
*/
ret = -EBUSY;
- if (!trylock_page(page))
+ if (!folio_trylock(folio))
goto out;
ret = -EINVAL;
- if (page_memcg(page) != from)
+ if (folio_memcg(folio) != from)
goto out_unlock;
- pgdat = page_pgdat(page);
+ pgdat = folio_pgdat(folio);
from_vec = mem_cgroup_lruvec(from, pgdat);
to_vec = mem_cgroup_lruvec(to, pgdat);
- lock_page_memcg(page);
+ folio_memcg_lock(folio);
- if (PageAnon(page)) {
- if (page_mapped(page)) {
+ if (folio_test_anon(folio)) {
+ if (folio_mapped(folio)) {
__mod_lruvec_state(from_vec, NR_ANON_MAPPED, -nr_pages);
__mod_lruvec_state(to_vec, NR_ANON_MAPPED, nr_pages);
- if (PageTransHuge(page)) {
+ if (folio_test_transhuge(folio)) {
__mod_lruvec_state(from_vec, NR_ANON_THPS,
-nr_pages);
__mod_lruvec_state(to_vec, NR_ANON_THPS,
@@ -5632,18 +5633,18 @@ static int mem_cgroup_move_account(struct page *page,
__mod_lruvec_state(from_vec, NR_FILE_PAGES, -nr_pages);
__mod_lruvec_state(to_vec, NR_FILE_PAGES, nr_pages);
- if (PageSwapBacked(page)) {
+ if (folio_test_swapbacked(folio)) {
__mod_lruvec_state(from_vec, NR_SHMEM, -nr_pages);
__mod_lruvec_state(to_vec, NR_SHMEM, nr_pages);
}
- if (page_mapped(page)) {
+ if (folio_mapped(folio)) {
__mod_lruvec_state(from_vec, NR_FILE_MAPPED, -nr_pages);
__mod_lruvec_state(to_vec, NR_FILE_MAPPED, nr_pages);
}
- if (PageDirty(page)) {
- struct address_space *mapping = page_mapping(page);
+ if (folio_test_dirty(folio)) {
+ struct address_space *mapping = folio_mapping(folio);
if (mapping_can_writeback(mapping)) {
__mod_lruvec_state(from_vec, NR_FILE_DIRTY,
@@ -5654,7 +5655,7 @@ static int mem_cgroup_move_account(struct page *page,
}
}
- if (PageWriteback(page)) {
+ if (folio_test_writeback(folio)) {
__mod_lruvec_state(from_vec, NR_WRITEBACK, -nr_pages);
__mod_lruvec_state(to_vec, NR_WRITEBACK, nr_pages);
}
@@ -5677,12 +5678,12 @@ static int mem_cgroup_move_account(struct page *page,
css_get(&to->css);
css_put(&from->css);
- page->memcg_data = (unsigned long)to;
+ folio->memcg_data = (unsigned long)to;
__folio_memcg_unlock(from);
ret = 0;
- nid = page_to_nid(page);
+ nid = folio_nid(folio);
local_irq_disable();
mem_cgroup_charge_statistics(to, nr_pages);
@@ -5691,7 +5692,7 @@ static int mem_cgroup_move_account(struct page *page,
memcg_check_events(from, nid);
local_irq_enable();
out_unlock:
- unlock_page(page);
+ folio_unlock(folio);
out:
return ret;
}
--
2.30.2
This is the folio equivalent of page_to_pfn().
Signed-off-by: Matthew Wilcox (Oracle) <[email protected]>
---
include/linux/mm.h | 14 ++++++++++++++
1 file changed, 14 insertions(+)
diff --git a/include/linux/mm.h b/include/linux/mm.h
index c6e2a1682a6d..89daae93aa9b 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1623,6 +1623,20 @@ static inline unsigned long page_to_section(const struct page *page)
}
#endif
+/**
+ * folio_pfn - Return the Page Frame Number of a folio.
+ * @folio: The folio.
+ *
+ * A folio may contain multiple pages. The pages have consecutive
+ * Page Frame Numbers.
+ *
+ * Return: The Page Frame Number of the first page in the folio.
+ */
+static inline unsigned long folio_pfn(struct folio *folio)
+{
+ return page_to_pfn(&folio->page);
+}
+
/* MIGRATE_CMA and ZONE_MOVABLE do not allow pin pages */
#ifdef CONFIG_MIGRATION
static inline bool is_pinnable_page(struct page *page)
--
2.30.2
Idle page tracking is handled through page_ext on 32-bit architectures.
Add folio equivalents for 32-bit and move all the page compatibility
parts to common code.
Signed-off-by: Matthew Wilcox (Oracle) <[email protected]>
Acked-by: Vlastimil Babka <[email protected]>
Reviewed-by: William Kucharski <[email protected]>
Reviewed-by: Christoph Hellwig <[email protected]>
---
include/linux/page_idle.h | 99 +++++++++++++++++++--------------------
1 file changed, 49 insertions(+), 50 deletions(-)
diff --git a/include/linux/page_idle.h b/include/linux/page_idle.h
index 1e894d34bdce..1bcb1365b1d0 100644
--- a/include/linux/page_idle.h
+++ b/include/linux/page_idle.h
@@ -8,46 +8,16 @@
#ifdef CONFIG_IDLE_PAGE_TRACKING
-#ifdef CONFIG_64BIT
-static inline bool page_is_young(struct page *page)
-{
- return PageYoung(page);
-}
-
-static inline void set_page_young(struct page *page)
-{
- SetPageYoung(page);
-}
-
-static inline bool test_and_clear_page_young(struct page *page)
-{
- return TestClearPageYoung(page);
-}
-
-static inline bool page_is_idle(struct page *page)
-{
- return PageIdle(page);
-}
-
-static inline void set_page_idle(struct page *page)
-{
- SetPageIdle(page);
-}
-
-static inline void clear_page_idle(struct page *page)
-{
- ClearPageIdle(page);
-}
-#else /* !CONFIG_64BIT */
+#ifndef CONFIG_64BIT
/*
* If there is not enough space to store Idle and Young bits in page flags, use
* page ext flags instead.
*/
extern struct page_ext_operations page_idle_ops;
-static inline bool page_is_young(struct page *page)
+static inline bool folio_test_young(struct folio *folio)
{
- struct page_ext *page_ext = lookup_page_ext(page);
+ struct page_ext *page_ext = lookup_page_ext(&folio->page);
if (unlikely(!page_ext))
return false;
@@ -55,9 +25,9 @@ static inline bool page_is_young(struct page *page)
return test_bit(PAGE_EXT_YOUNG, &page_ext->flags);
}
-static inline void set_page_young(struct page *page)
+static inline void folio_set_young(struct folio *folio)
{
- struct page_ext *page_ext = lookup_page_ext(page);
+ struct page_ext *page_ext = lookup_page_ext(&folio->page);
if (unlikely(!page_ext))
return;
@@ -65,9 +35,9 @@ static inline void set_page_young(struct page *page)
set_bit(PAGE_EXT_YOUNG, &page_ext->flags);
}
-static inline bool test_and_clear_page_young(struct page *page)
+static inline bool folio_test_clear_young(struct folio *folio)
{
- struct page_ext *page_ext = lookup_page_ext(page);
+ struct page_ext *page_ext = lookup_page_ext(&folio->page);
if (unlikely(!page_ext))
return false;
@@ -75,9 +45,9 @@ static inline bool test_and_clear_page_young(struct page *page)
return test_and_clear_bit(PAGE_EXT_YOUNG, &page_ext->flags);
}
-static inline bool page_is_idle(struct page *page)
+static inline bool folio_test_idle(struct folio *folio)
{
- struct page_ext *page_ext = lookup_page_ext(page);
+ struct page_ext *page_ext = lookup_page_ext(&folio->page);
if (unlikely(!page_ext))
return false;
@@ -85,9 +55,9 @@ static inline bool page_is_idle(struct page *page)
return test_bit(PAGE_EXT_IDLE, &page_ext->flags);
}
-static inline void set_page_idle(struct page *page)
+static inline void folio_set_idle(struct folio *folio)
{
- struct page_ext *page_ext = lookup_page_ext(page);
+ struct page_ext *page_ext = lookup_page_ext(&folio->page);
if (unlikely(!page_ext))
return;
@@ -95,46 +65,75 @@ static inline void set_page_idle(struct page *page)
set_bit(PAGE_EXT_IDLE, &page_ext->flags);
}
-static inline void clear_page_idle(struct page *page)
+static inline void folio_clear_idle(struct folio *folio)
{
- struct page_ext *page_ext = lookup_page_ext(page);
+ struct page_ext *page_ext = lookup_page_ext(&folio->page);
if (unlikely(!page_ext))
return;
clear_bit(PAGE_EXT_IDLE, &page_ext->flags);
}
-#endif /* CONFIG_64BIT */
+#endif /* !CONFIG_64BIT */
#else /* !CONFIG_IDLE_PAGE_TRACKING */
-static inline bool page_is_young(struct page *page)
+static inline bool folio_test_young(struct folio *folio)
{
return false;
}
-static inline void set_page_young(struct page *page)
+static inline void folio_set_young(struct folio *folio)
{
}
-static inline bool test_and_clear_page_young(struct page *page)
+static inline bool folio_test_clear_young(struct folio *folio)
{
return false;
}
-static inline bool page_is_idle(struct page *page)
+static inline bool folio_test_idle(struct folio *folio)
{
return false;
}
-static inline void set_page_idle(struct page *page)
+static inline void folio_set_idle(struct folio *folio)
{
}
-static inline void clear_page_idle(struct page *page)
+static inline void folio_clear_idle(struct folio *folio)
{
}
#endif /* CONFIG_IDLE_PAGE_TRACKING */
+static inline bool page_is_young(struct page *page)
+{
+ return folio_test_young(page_folio(page));
+}
+
+static inline void set_page_young(struct page *page)
+{
+ folio_set_young(page_folio(page));
+}
+
+static inline bool test_and_clear_page_young(struct page *page)
+{
+ return folio_test_clear_young(page_folio(page));
+}
+
+static inline bool page_is_idle(struct page *page)
+{
+ return folio_test_idle(page_folio(page));
+}
+
+static inline void set_page_idle(struct page *page)
+{
+ folio_set_idle(page_folio(page));
+}
+
+static inline void clear_page_idle(struct page *page)
+{
+ folio_clear_idle(page_folio(page));
+}
#endif /* _LINUX_MM_PAGE_IDLE_H */
--
2.30.2
Convert __page_rmapping to folio_raw_mapping and move it to mm/internal.h.
It's only a couple of instructions (load and mask), so it's definitely
going to be cheaper to inline it than call it. Leave page_rmapping
out of line.
Signed-off-by: Matthew Wilcox (Oracle) <[email protected]>
---
mm/internal.h | 7 +++++++
mm/util.c | 20 ++++----------------
2 files changed, 11 insertions(+), 16 deletions(-)
diff --git a/mm/internal.h b/mm/internal.h
index 1a8851b73031..fa31a7f0ed79 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -34,6 +34,13 @@
void page_writeback_init(void);
+static inline void *folio_raw_mapping(struct folio *folio)
+{
+ unsigned long mapping = (unsigned long)folio->mapping;
+
+ return (void *)(mapping & ~PAGE_MAPPING_FLAGS);
+}
+
vm_fault_t do_swap_page(struct vm_fault *vmf);
void folio_rotate_reclaimable(struct folio *folio);
diff --git a/mm/util.c b/mm/util.c
index e8c12350b3eb..d0aa1d9c811e 100644
--- a/mm/util.c
+++ b/mm/util.c
@@ -635,21 +635,10 @@ void kvfree_sensitive(const void *addr, size_t len)
}
EXPORT_SYMBOL(kvfree_sensitive);
-static inline void *__page_rmapping(struct page *page)
-{
- unsigned long mapping;
-
- mapping = (unsigned long)page->mapping;
- mapping &= ~PAGE_MAPPING_FLAGS;
-
- return (void *)mapping;
-}
-
/* Neutral page->mapping pointer to address_space or anon_vma or other */
void *page_rmapping(struct page *page)
{
- page = compound_head(page);
- return __page_rmapping(page);
+ return folio_raw_mapping(page_folio(page));
}
/**
@@ -680,13 +669,12 @@ EXPORT_SYMBOL(folio_mapped);
struct anon_vma *page_anon_vma(struct page *page)
{
- unsigned long mapping;
+ struct folio *folio = page_folio(page);
+ unsigned long mapping = (unsigned long)folio->mapping;
- page = compound_head(page);
- mapping = (unsigned long)page->mapping;
if ((mapping & PAGE_MAPPING_FLAGS) != PAGE_MAPPING_ANON)
return NULL;
- return __page_rmapping(page);
+ return (void *)(mapping - PAGE_MAPPING_ANON);
}
/**
--
2.30.2
Make this look like the newly renamed vmstat functions.
Signed-off-by: Matthew Wilcox (Oracle) <[email protected]>
Reviewed-by: Christoph Hellwig <[email protected]>
---
include/linux/backing-dev.h | 6 +++---
1 file changed, 3 insertions(+), 3 deletions(-)
diff --git a/include/linux/backing-dev.h b/include/linux/backing-dev.h
index 44df4fcef65c..a852876bb6e2 100644
--- a/include/linux/backing-dev.h
+++ b/include/linux/backing-dev.h
@@ -64,7 +64,7 @@ static inline bool bdi_has_dirty_io(struct backing_dev_info *bdi)
return atomic_long_read(&bdi->tot_write_bandwidth);
}
-static inline void __add_wb_stat(struct bdi_writeback *wb,
+static inline void wb_stat_mod(struct bdi_writeback *wb,
enum wb_stat_item item, s64 amount)
{
percpu_counter_add_batch(&wb->stat[item], amount, WB_STAT_BATCH);
@@ -72,12 +72,12 @@ static inline void __add_wb_stat(struct bdi_writeback *wb,
static inline void inc_wb_stat(struct bdi_writeback *wb, enum wb_stat_item item)
{
- __add_wb_stat(wb, item, 1);
+ wb_stat_mod(wb, item, 1);
}
static inline void dec_wb_stat(struct bdi_writeback *wb, enum wb_stat_item item)
{
- __add_wb_stat(wb, item, -1);
+ wb_stat_mod(wb, item, -1);
}
static inline s64 wb_stat(struct bdi_writeback *wb, enum wb_stat_item item)
--
2.30.2
When batching events (such as writing back N pages in a single I/O), it
is better to do one flex_proportion operation instead of N. There is
only one caller of __fprop_inc_percpu_max(), and it's the one we're
going to change in the next patch, so rename it instead of adding a
compatibility wrapper.
Signed-off-by: Matthew Wilcox (Oracle) <[email protected]>
Reviewed-by: Christoph Hellwig <[email protected]>
Reviewed-by: Jan Kara <[email protected]>
---
include/linux/flex_proportions.h | 9 +++++----
lib/flex_proportions.c | 28 +++++++++++++++++++---------
mm/page-writeback.c | 4 ++--
3 files changed, 26 insertions(+), 15 deletions(-)
diff --git a/include/linux/flex_proportions.h b/include/linux/flex_proportions.h
index c12df59d3f5f..3e378b1fb0bc 100644
--- a/include/linux/flex_proportions.h
+++ b/include/linux/flex_proportions.h
@@ -83,9 +83,10 @@ struct fprop_local_percpu {
int fprop_local_init_percpu(struct fprop_local_percpu *pl, gfp_t gfp);
void fprop_local_destroy_percpu(struct fprop_local_percpu *pl);
-void __fprop_inc_percpu(struct fprop_global *p, struct fprop_local_percpu *pl);
-void __fprop_inc_percpu_max(struct fprop_global *p, struct fprop_local_percpu *pl,
- int max_frac);
+void __fprop_add_percpu(struct fprop_global *p, struct fprop_local_percpu *pl,
+ long nr);
+void __fprop_add_percpu_max(struct fprop_global *p,
+ struct fprop_local_percpu *pl, int max_frac, long nr);
void fprop_fraction_percpu(struct fprop_global *p,
struct fprop_local_percpu *pl, unsigned long *numerator,
unsigned long *denominator);
@@ -96,7 +97,7 @@ void fprop_inc_percpu(struct fprop_global *p, struct fprop_local_percpu *pl)
unsigned long flags;
local_irq_save(flags);
- __fprop_inc_percpu(p, pl);
+ __fprop_add_percpu(p, pl, 1);
local_irq_restore(flags);
}
diff --git a/lib/flex_proportions.c b/lib/flex_proportions.c
index 451543937524..53e7eb1dd76c 100644
--- a/lib/flex_proportions.c
+++ b/lib/flex_proportions.c
@@ -217,11 +217,12 @@ static void fprop_reflect_period_percpu(struct fprop_global *p,
}
/* Event of type pl happened */
-void __fprop_inc_percpu(struct fprop_global *p, struct fprop_local_percpu *pl)
+void __fprop_add_percpu(struct fprop_global *p, struct fprop_local_percpu *pl,
+ long nr)
{
fprop_reflect_period_percpu(p, pl);
- percpu_counter_add_batch(&pl->events, 1, PROP_BATCH);
- percpu_counter_add(&p->events, 1);
+ percpu_counter_add_batch(&pl->events, nr, PROP_BATCH);
+ percpu_counter_add(&p->events, nr);
}
void fprop_fraction_percpu(struct fprop_global *p,
@@ -253,20 +254,29 @@ void fprop_fraction_percpu(struct fprop_global *p,
}
/*
- * Like __fprop_inc_percpu() except that event is counted only if the given
+ * Like __fprop_add_percpu() except that event is counted only if the given
* type has fraction smaller than @max_frac/FPROP_FRAC_BASE
*/
-void __fprop_inc_percpu_max(struct fprop_global *p,
- struct fprop_local_percpu *pl, int max_frac)
+void __fprop_add_percpu_max(struct fprop_global *p,
+ struct fprop_local_percpu *pl, int max_frac, long nr)
{
if (unlikely(max_frac < FPROP_FRAC_BASE)) {
unsigned long numerator, denominator;
+ s64 tmp;
fprop_fraction_percpu(p, pl, &numerator, &denominator);
- if (numerator >
- (((u64)denominator) * max_frac) >> FPROP_FRAC_SHIFT)
+ /* Adding 'nr' to fraction exceeds max_frac/FPROP_FRAC_BASE? */
+ tmp = (u64)denominator * max_frac -
+ ((u64)numerator << FPROP_FRAC_SHIFT);
+ if (tmp < 0) {
+ /* Maximum fraction already exceeded? */
return;
+ } else if (tmp < nr * (FPROP_FRAC_BASE - max_frac)) {
+ /* Add just enough for the fraction to saturate */
+ nr = div_u64(tmp + FPROP_FRAC_BASE - max_frac - 1,
+ FPROP_FRAC_BASE - max_frac);
+ }
}
- __fprop_inc_percpu(p, pl);
+ __fprop_add_percpu(p, pl, nr);
}
diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index b34278d05395..f55f2ebdd9a9 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -566,8 +566,8 @@ static void wb_domain_writeout_inc(struct wb_domain *dom,
struct fprop_local_percpu *completions,
unsigned int max_prop_frac)
{
- __fprop_inc_percpu_max(&dom->completions, completions,
- max_prop_frac);
+ __fprop_add_percpu_max(&dom->completions, completions,
+ max_prop_frac, 1);
/* First event after period switching was turned off? */
if (unlikely(!dom->period_time)) {
/*
--
2.30.2
This is the folio equivalent of migrate_page_copy(), which is retained
as a wrapper for filesystems which are not yet converted to folios.
Also convert copy_huge_page() to folio_copy().
Signed-off-by: Matthew Wilcox (Oracle) <[email protected]>
---
include/linux/migrate.h | 1 +
include/linux/mm.h | 2 +-
mm/folio-compat.c | 6 ++++++
mm/hugetlb.c | 2 +-
mm/migrate.c | 14 +++++---------
mm/util.c | 6 +++---
6 files changed, 17 insertions(+), 14 deletions(-)
diff --git a/include/linux/migrate.h b/include/linux/migrate.h
index ba0a554b3eae..6a01de9faff5 100644
--- a/include/linux/migrate.h
+++ b/include/linux/migrate.h
@@ -52,6 +52,7 @@ extern int migrate_huge_page_move_mapping(struct address_space *mapping,
extern int migrate_page_move_mapping(struct address_space *mapping,
struct page *newpage, struct page *page, int extra_count);
void folio_migrate_flags(struct folio *newfolio, struct folio *folio);
+void folio_migrate_copy(struct folio *newfolio, struct folio *folio);
int folio_migrate_mapping(struct address_space *mapping,
struct folio *newfolio, struct folio *folio, int extra_count);
#else
diff --git a/include/linux/mm.h b/include/linux/mm.h
index deb0f5efaa65..23276330ef4f 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -911,7 +911,7 @@ void __put_page(struct page *page);
void put_pages_list(struct list_head *pages);
void split_page(struct page *page, unsigned int order);
-void copy_huge_page(struct page *dst, struct page *src);
+void folio_copy(struct folio *dst, struct folio *src);
/*
* Compound pages have a destructor function. Provide a
diff --git a/mm/folio-compat.c b/mm/folio-compat.c
index 3f00ad92d1ff..2ccd8f213fc4 100644
--- a/mm/folio-compat.c
+++ b/mm/folio-compat.c
@@ -64,4 +64,10 @@ void migrate_page_states(struct page *newpage, struct page *page)
folio_migrate_flags(page_folio(newpage), page_folio(page));
}
EXPORT_SYMBOL(migrate_page_states);
+
+void migrate_page_copy(struct page *newpage, struct page *page)
+{
+ folio_migrate_copy(page_folio(newpage), page_folio(page));
+}
+EXPORT_SYMBOL(migrate_page_copy);
#endif
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 924553aa8f78..b46f9d09aa94 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -5200,7 +5200,7 @@ int hugetlb_mcopy_atomic_pte(struct mm_struct *dst_mm,
*pagep = NULL;
goto out;
}
- copy_huge_page(page, *pagep);
+ folio_copy(page_folio(page), page_folio(*pagep));
put_page(*pagep);
*pagep = NULL;
}
diff --git a/mm/migrate.c b/mm/migrate.c
index a86be2bfc9a1..36cdae0a1235 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -613,16 +613,12 @@ void folio_migrate_flags(struct folio *newfolio, struct folio *folio)
}
EXPORT_SYMBOL(folio_migrate_flags);
-void migrate_page_copy(struct page *newpage, struct page *page)
+void folio_migrate_copy(struct folio *newfolio, struct folio *folio)
{
- if (PageHuge(page) || PageTransHuge(page))
- copy_huge_page(newpage, page);
- else
- copy_highpage(newpage, page);
-
- migrate_page_states(newpage, page);
+ folio_copy(newfolio, folio);
+ folio_migrate_flags(newfolio, folio);
}
-EXPORT_SYMBOL(migrate_page_copy);
+EXPORT_SYMBOL(folio_migrate_copy);
/************************************************************
* Migration functions
@@ -650,7 +646,7 @@ int migrate_page(struct address_space *mapping,
return rc;
if (mode != MIGRATE_SYNC_NO_COPY)
- migrate_page_copy(newpage, page);
+ folio_migrate_copy(newfolio, folio);
else
folio_migrate_flags(newfolio, folio);
return MIGRATEPAGE_SUCCESS;
diff --git a/mm/util.c b/mm/util.c
index 149537120a91..904a75612307 100644
--- a/mm/util.c
+++ b/mm/util.c
@@ -728,13 +728,13 @@ int __page_mapcount(struct page *page)
}
EXPORT_SYMBOL_GPL(__page_mapcount);
-void copy_huge_page(struct page *dst, struct page *src)
+void folio_copy(struct folio *dst, struct folio *src)
{
- unsigned i, nr = compound_nr(src);
+ unsigned i, nr = folio_nr_pages(src);
for (i = 0; i < nr; i++) {
cond_resched();
- copy_highpage(nth_page(dst, i), nth_page(src, i));
+ copy_highpage(folio_page(dst, i), folio_page(src, i));
}
}
--
2.30.2
Rename writeback_dirty_page() to writeback_dirty_folio() and
wait_on_page_writeback() to folio_wait_writeback().
Signed-off-by: Matthew Wilcox (Oracle) <[email protected]>
---
include/trace/events/writeback.h | 20 ++++++++++----------
mm/page-writeback.c | 6 +++---
2 files changed, 13 insertions(+), 13 deletions(-)
diff --git a/include/trace/events/writeback.h b/include/trace/events/writeback.h
index 297871ca0004..7dccb66474f7 100644
--- a/include/trace/events/writeback.h
+++ b/include/trace/events/writeback.h
@@ -52,11 +52,11 @@ WB_WORK_REASON
struct wb_writeback_work;
-DECLARE_EVENT_CLASS(writeback_page_template,
+DECLARE_EVENT_CLASS(writeback_folio_template,
- TP_PROTO(struct page *page, struct address_space *mapping),
+ TP_PROTO(struct folio *folio, struct address_space *mapping),
- TP_ARGS(page, mapping),
+ TP_ARGS(folio, mapping),
TP_STRUCT__entry (
__array(char, name, 32)
@@ -69,7 +69,7 @@ DECLARE_EVENT_CLASS(writeback_page_template,
bdi_dev_name(mapping ? inode_to_bdi(mapping->host) :
NULL), 32);
__entry->ino = mapping ? mapping->host->i_ino : 0;
- __entry->index = page->index;
+ __entry->index = folio->index;
),
TP_printk("bdi %s: ino=%lu index=%lu",
@@ -79,18 +79,18 @@ DECLARE_EVENT_CLASS(writeback_page_template,
)
);
-DEFINE_EVENT(writeback_page_template, writeback_dirty_page,
+DEFINE_EVENT(writeback_folio_template, writeback_dirty_folio,
- TP_PROTO(struct page *page, struct address_space *mapping),
+ TP_PROTO(struct folio *folio, struct address_space *mapping),
- TP_ARGS(page, mapping)
+ TP_ARGS(folio, mapping)
);
-DEFINE_EVENT(writeback_page_template, wait_on_page_writeback,
+DEFINE_EVENT(writeback_folio_template, folio_wait_writeback,
- TP_PROTO(struct page *page, struct address_space *mapping),
+ TP_PROTO(struct folio *folio, struct address_space *mapping),
- TP_ARGS(page, mapping)
+ TP_ARGS(folio, mapping)
);
DECLARE_EVENT_CLASS(writeback_dirty_inode_template,
diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index 3e02c86eb445..2dc410b110ff 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -2426,7 +2426,7 @@ static void folio_account_dirtied(struct folio *folio,
{
struct inode *inode = mapping->host;
- trace_writeback_dirty_page(&folio->page, mapping);
+ trace_writeback_dirty_folio(folio, mapping);
if (mapping_can_writeback(mapping)) {
struct bdi_writeback *wb;
@@ -2852,7 +2852,7 @@ EXPORT_SYMBOL(__folio_start_writeback);
void folio_wait_writeback(struct folio *folio)
{
while (folio_test_writeback(folio)) {
- trace_wait_on_page_writeback(&folio->page, folio_mapping(folio));
+ trace_folio_wait_writeback(folio, folio_mapping(folio));
folio_wait_bit(folio, PG_writeback);
}
}
@@ -2874,7 +2874,7 @@ EXPORT_SYMBOL_GPL(folio_wait_writeback);
int folio_wait_writeback_killable(struct folio *folio)
{
while (folio_test_writeback(folio)) {
- trace_wait_on_page_writeback(&folio->page, folio_mapping(folio));
+ trace_folio_wait_writeback(folio, folio_mapping(folio));
if (folio_wait_bit_killable(folio, PG_writeback))
return -EINTR;
}
--
2.30.2
Get the statistics right; compound pages were being accounted as a
single page. This didn't matter before now as no filesystem which
supported compound pages did writeback. Also move the declaration
to filemap.h since this is part of the page cache. Add a wrapper for
account_page_cleaned().
Signed-off-by: Matthew Wilcox (Oracle) <[email protected]>
Reviewed-by: Christoph Hellwig <[email protected]>
---
include/linux/mm.h | 3 ---
include/linux/pagemap.h | 7 +++++++
mm/page-writeback.c | 11 ++++++-----
3 files changed, 13 insertions(+), 8 deletions(-)
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 43c1b5731c7f..481019481d10 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -39,7 +39,6 @@ struct anon_vma_chain;
struct file_ra_state;
struct user_struct;
struct writeback_control;
-struct bdi_writeback;
struct pt_regs;
extern int sysctl_page_lock_unfairness;
@@ -2003,8 +2002,6 @@ extern void do_invalidatepage(struct page *page, unsigned int offset,
int redirty_page_for_writepage(struct writeback_control *wbc,
struct page *page);
-void account_page_cleaned(struct page *page, struct address_space *mapping,
- struct bdi_writeback *wb);
bool folio_mark_dirty(struct folio *folio);
bool set_page_dirty(struct page *page);
int set_page_dirty_lock(struct page *page);
diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
index 3d88c17fedc9..665ba6a67385 100644
--- a/include/linux/pagemap.h
+++ b/include/linux/pagemap.h
@@ -779,6 +779,13 @@ static inline void __set_page_dirty(struct page *page,
{
__folio_mark_dirty(page_folio(page), mapping, warn);
}
+void folio_account_cleaned(struct folio *folio, struct address_space *mapping,
+ struct bdi_writeback *wb);
+static inline void account_page_cleaned(struct page *page,
+ struct address_space *mapping, struct bdi_writeback *wb)
+{
+ return folio_account_cleaned(page_folio(page), mapping, wb);
+}
int __set_page_dirty_nobuffers(struct page *page);
int __set_page_dirty_no_writeback(struct page *page);
diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index bd97c461d499..792a83bd3917 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -2453,14 +2453,15 @@ static void folio_account_dirtied(struct folio *folio,
*
* Caller must hold lock_page_memcg().
*/
-void account_page_cleaned(struct page *page, struct address_space *mapping,
+void folio_account_cleaned(struct folio *folio, struct address_space *mapping,
struct bdi_writeback *wb)
{
if (mapping_can_writeback(mapping)) {
- dec_lruvec_page_state(page, NR_FILE_DIRTY);
- dec_zone_page_state(page, NR_ZONE_WRITE_PENDING);
- dec_wb_stat(wb, WB_RECLAIMABLE);
- task_io_account_cancelled_write(PAGE_SIZE);
+ long nr = folio_nr_pages(folio);
+ lruvec_stat_mod_folio(folio, NR_FILE_DIRTY, -nr);
+ zone_stat_mod_folio(folio, NR_ZONE_WRITE_PENDING, -nr);
+ wb_stat_mod(wb, WB_RECLAIMABLE, -nr);
+ task_io_account_cancelled_write(folio_size(folio));
}
}
--
2.30.2
Transform clear_page_dirty_for_io() into folio_clear_dirty_for_io()
and add a compatibility wrapper. Also move the declaration to pagemap.h
as this is page cache functionality that doesn't need to be used by the
rest of the kernel.
Increases the size of the kernel by 79 bytes. While we remove a few
calls to compound_head(), we add a call to folio_nr_pages() to get the
stats correct for the eventual support of multi-page folios.
Signed-off-by: Matthew Wilcox (Oracle) <[email protected]>
Reviewed-by: Christoph Hellwig <[email protected]>
---
include/linux/mm.h | 1 -
include/linux/pagemap.h | 2 ++
mm/folio-compat.c | 6 ++++
mm/page-writeback.c | 63 +++++++++++++++++++++--------------------
4 files changed, 40 insertions(+), 32 deletions(-)
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 07ba22351d15..26883ea28349 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -2005,7 +2005,6 @@ int redirty_page_for_writepage(struct writeback_control *wbc,
bool folio_mark_dirty(struct folio *folio);
bool set_page_dirty(struct page *page);
int set_page_dirty_lock(struct page *page);
-int clear_page_dirty_for_io(struct page *page);
int get_cmdline(struct task_struct *task, char *buffer, int buflen);
diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
index a4d0aeaf884d..006de2d84d06 100644
--- a/include/linux/pagemap.h
+++ b/include/linux/pagemap.h
@@ -797,6 +797,8 @@ static inline void cancel_dirty_page(struct page *page)
{
folio_cancel_dirty(page_folio(page));
}
+bool folio_clear_dirty_for_io(struct folio *folio);
+bool clear_page_dirty_for_io(struct page *page);
int __set_page_dirty_nobuffers(struct page *page);
int __set_page_dirty_no_writeback(struct page *page);
diff --git a/mm/folio-compat.c b/mm/folio-compat.c
index dad962b920e5..39f5a8d963b1 100644
--- a/mm/folio-compat.c
+++ b/mm/folio-compat.c
@@ -89,3 +89,9 @@ int __set_page_dirty_nobuffers(struct page *page)
return filemap_dirty_folio(page_mapping(page), page_folio(page));
}
EXPORT_SYMBOL(__set_page_dirty_nobuffers);
+
+bool clear_page_dirty_for_io(struct page *page)
+{
+ return folio_clear_dirty_for_io(page_folio(page));
+}
+EXPORT_SYMBOL(clear_page_dirty_for_io);
diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index 0854ef768d06..66060bbf6aad 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -2664,25 +2664,25 @@ void __folio_cancel_dirty(struct folio *folio)
EXPORT_SYMBOL(__folio_cancel_dirty);
/*
- * Clear a page's dirty flag, while caring for dirty memory accounting.
- * Returns true if the page was previously dirty.
- *
- * This is for preparing to put the page under writeout. We leave the page
- * tagged as dirty in the xarray so that a concurrent write-for-sync
- * can discover it via a PAGECACHE_TAG_DIRTY walk. The ->writepage
- * implementation will run either set_page_writeback() or set_page_dirty(),
- * at which stage we bring the page's dirty flag and xarray dirty tag
- * back into sync.
- *
- * This incoherency between the page's dirty flag and xarray tag is
- * unfortunate, but it only exists while the page is locked.
+ * Clear a folio's dirty flag, while caring for dirty memory accounting.
+ * Returns true if the folio was previously dirty.
+ *
+ * This is for preparing to put the folio under writeout. We leave
+ * the folio tagged as dirty in the xarray so that a concurrent
+ * write-for-sync can discover it via a PAGECACHE_TAG_DIRTY walk.
+ * The ->writepage implementation will run either folio_start_writeback()
+ * or folio_mark_dirty(), at which stage we bring the folio's dirty flag
+ * and xarray dirty tag back into sync.
+ *
+ * This incoherency between the folio's dirty flag and xarray tag is
+ * unfortunate, but it only exists while the folio is locked.
*/
-int clear_page_dirty_for_io(struct page *page)
+bool folio_clear_dirty_for_io(struct folio *folio)
{
- struct address_space *mapping = page_mapping(page);
- int ret = 0;
+ struct address_space *mapping = folio_mapping(folio);
+ bool ret = false;
- VM_BUG_ON_PAGE(!PageLocked(page), page);
+ VM_BUG_ON_FOLIO(!folio_test_locked(folio), folio);
if (mapping && mapping_can_writeback(mapping)) {
struct inode *inode = mapping->host;
@@ -2695,48 +2695,49 @@ int clear_page_dirty_for_io(struct page *page)
* We use this sequence to make sure that
* (a) we account for dirty stats properly
* (b) we tell the low-level filesystem to
- * mark the whole page dirty if it was
+ * mark the whole folio dirty if it was
* dirty in a pagetable. Only to then
- * (c) clean the page again and return 1 to
+ * (c) clean the folio again and return 1 to
* cause the writeback.
*
* This way we avoid all nasty races with the
* dirty bit in multiple places and clearing
* them concurrently from different threads.
*
- * Note! Normally the "set_page_dirty(page)"
+ * Note! Normally the "folio_mark_dirty(folio)"
* has no effect on the actual dirty bit - since
* that will already usually be set. But we
* need the side effects, and it can help us
* avoid races.
*
- * We basically use the page "master dirty bit"
+ * We basically use the folio "master dirty bit"
* as a serialization point for all the different
* threads doing their things.
*/
- if (page_mkclean(page))
- set_page_dirty(page);
+ if (folio_mkclean(folio))
+ folio_mark_dirty(folio);
/*
* We carefully synchronise fault handlers against
- * installing a dirty pte and marking the page dirty
+ * installing a dirty pte and marking the folio dirty
* at this point. We do this by having them hold the
- * page lock while dirtying the page, and pages are
+ * page lock while dirtying the folio, and folios are
* always locked coming in here, so we get the desired
* exclusion.
*/
wb = unlocked_inode_to_wb_begin(inode, &cookie);
- if (TestClearPageDirty(page)) {
- dec_lruvec_page_state(page, NR_FILE_DIRTY);
- dec_zone_page_state(page, NR_ZONE_WRITE_PENDING);
- dec_wb_stat(wb, WB_RECLAIMABLE);
- ret = 1;
+ if (folio_test_clear_dirty(folio)) {
+ long nr = folio_nr_pages(folio);
+ lruvec_stat_mod_folio(folio, NR_FILE_DIRTY, -nr);
+ zone_stat_mod_folio(folio, NR_ZONE_WRITE_PENDING, -nr);
+ wb_stat_mod(wb, WB_RECLAIMABLE, -nr);
+ ret = true;
}
unlocked_inode_to_wb_end(inode, &cookie);
return ret;
}
- return TestClearPageDirty(page);
+ return folio_test_clear_dirty(folio);
}
-EXPORT_SYMBOL(clear_page_dirty_for_io);
+EXPORT_SYMBOL(folio_clear_dirty_for_io);
bool __folio_end_writeback(struct folio *folio)
{
--
2.30.2
Reimplement redirty_page_for_writepage() as a wrapper around
folio_redirty_for_writepage(). Account the number of pages in the
folio, add kernel-doc and move the prototype to writeback.h.
Signed-off-by: Matthew Wilcox (Oracle) <[email protected]>
Reviewed-by: Christoph Hellwig <[email protected]>
---
fs/jfs/jfs_metapage.c | 1 +
include/linux/mm.h | 4 ----
include/linux/writeback.h | 2 ++
mm/folio-compat.c | 7 +++++++
mm/page-writeback.c | 30 ++++++++++++++++++++----------
5 files changed, 30 insertions(+), 14 deletions(-)
diff --git a/fs/jfs/jfs_metapage.c b/fs/jfs/jfs_metapage.c
index 176580f54af9..104ae698443e 100644
--- a/fs/jfs/jfs_metapage.c
+++ b/fs/jfs/jfs_metapage.c
@@ -13,6 +13,7 @@
#include <linux/buffer_head.h>
#include <linux/mempool.h>
#include <linux/seq_file.h>
+#include <linux/writeback.h>
#include "jfs_incore.h"
#include "jfs_superblock.h"
#include "jfs_filsys.h"
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 26883ea28349..4803f2c01367 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -36,9 +36,7 @@
struct mempolicy;
struct anon_vma;
struct anon_vma_chain;
-struct file_ra_state;
struct user_struct;
-struct writeback_control;
struct pt_regs;
extern int sysctl_page_lock_unfairness;
@@ -2000,8 +1998,6 @@ extern int try_to_release_page(struct page * page, gfp_t gfp_mask);
extern void do_invalidatepage(struct page *page, unsigned int offset,
unsigned int length);
-int redirty_page_for_writepage(struct writeback_control *wbc,
- struct page *page);
bool folio_mark_dirty(struct folio *folio);
bool set_page_dirty(struct page *page);
int set_page_dirty_lock(struct page *page);
diff --git a/include/linux/writeback.h b/include/linux/writeback.h
index 50cb6e25ab9e..5383f7e39816 100644
--- a/include/linux/writeback.h
+++ b/include/linux/writeback.h
@@ -404,6 +404,8 @@ static inline void account_page_redirty(struct page *page)
{
folio_account_redirty(page_folio(page));
}
+bool folio_redirty_for_writepage(struct writeback_control *, struct folio *);
+bool redirty_page_for_writepage(struct writeback_control *, struct page *);
void sb_mark_inode_writeback(struct inode *inode);
void sb_clear_inode_writeback(struct inode *inode);
diff --git a/mm/folio-compat.c b/mm/folio-compat.c
index 39f5a8d963b1..c1e01bc36d32 100644
--- a/mm/folio-compat.c
+++ b/mm/folio-compat.c
@@ -95,3 +95,10 @@ bool clear_page_dirty_for_io(struct page *page)
return folio_clear_dirty_for_io(page_folio(page));
}
EXPORT_SYMBOL(clear_page_dirty_for_io);
+
+bool redirty_page_for_writepage(struct writeback_control *wbc,
+ struct page *page)
+{
+ return folio_redirty_for_writepage(wbc, page_folio(page));
+}
+EXPORT_SYMBOL(redirty_page_for_writepage);
diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index d7bd5580c91e..c2987f05c944 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -2558,21 +2558,31 @@ void folio_account_redirty(struct folio *folio)
}
EXPORT_SYMBOL(folio_account_redirty);
-/*
- * When a writepage implementation decides that it doesn't want to write this
- * page for some reason, it should redirty the locked page via
- * redirty_page_for_writepage() and it should then unlock the page and return 0
+/**
+ * folio_redirty_for_writepage - Decline to write a dirty folio.
+ * @wbc: The writeback control.
+ * @folio: The folio.
+ *
+ * When a writepage implementation decides that it doesn't want to write
+ * @folio for some reason, it should call this function, unlock @folio and
+ * return 0.
+ *
+ * Return: True if we redirtied the folio. False if someone else dirtied
+ * it first.
*/
-int redirty_page_for_writepage(struct writeback_control *wbc, struct page *page)
+bool folio_redirty_for_writepage(struct writeback_control *wbc,
+ struct folio *folio)
{
- int ret;
+ bool ret;
+ unsigned nr = folio_nr_pages(folio);
+
+ wbc->pages_skipped += nr;
+ ret = filemap_dirty_folio(folio->mapping, folio);
+ folio_account_redirty(folio);
- wbc->pages_skipped++;
- ret = __set_page_dirty_nobuffers(page);
- account_page_redirty(page);
return ret;
}
-EXPORT_SYMBOL(redirty_page_for_writepage);
+EXPORT_SYMBOL(folio_redirty_for_writepage);
/**
* folio_mark_dirty - Mark a folio as being modified.
--
2.30.2
This is the folio equivalent of page_mkwrite_check_truncate().
Signed-off-by: Matthew Wilcox (Oracle) <[email protected]>
---
include/linux/pagemap.h | 28 ++++++++++++++++++++++++++++
1 file changed, 28 insertions(+)
diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
index 412db88b8d0c..18c06c3e42c3 100644
--- a/include/linux/pagemap.h
+++ b/include/linux/pagemap.h
@@ -1121,6 +1121,34 @@ static inline unsigned long dir_pages(struct inode *inode)
PAGE_SHIFT;
}
+/**
+ * folio_mkwrite_check_truncate - check if folio was truncated
+ * @folio: the folio to check
+ * @inode: the inode to check the folio against
+ *
+ * Return: the number of bytes in the folio up to EOF,
+ * or -EFAULT if the folio was truncated.
+ */
+static inline ssize_t folio_mkwrite_check_truncate(struct folio *folio,
+ struct inode *inode)
+{
+ loff_t size = i_size_read(inode);
+ pgoff_t index = size >> PAGE_SHIFT;
+ size_t offset = offset_in_folio(folio, size);
+
+ if (!folio->mapping)
+ return -EFAULT;
+
+ /* folio is wholly inside EOF */
+ if (folio_next_index(folio) - 1 < index)
+ return folio_size(folio);
+ /* folio is wholly past EOF */
+ if (folio->index > index || !offset)
+ return -EFAULT;
+ /* folio is partially inside EOF */
+ return offset;
+}
+
/**
* page_mkwrite_check_truncate - check if page was truncated
* @page: the page to check
--
2.30.2
The pointers stored in the page cache are folios, by definition.
This change comes with a behaviour change -- callers of readahead_folio()
are no longer required to put the page reference themselves. This matches
how readpage works, rather than matching how readpages used to work.
Signed-off-by: Matthew Wilcox (Oracle) <[email protected]>
Reviewed-by: Christoph Hellwig <[email protected]>
---
include/linux/pagemap.h | 53 +++++++++++++++++++++++++++++------------
1 file changed, 38 insertions(+), 15 deletions(-)
diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
index 18c06c3e42c3..bd4daebaf70e 100644
--- a/include/linux/pagemap.h
+++ b/include/linux/pagemap.h
@@ -988,33 +988,56 @@ void page_cache_async_readahead(struct address_space *mapping,
page_cache_async_ra(&ractl, page, req_count);
}
+static inline struct folio *__readahead_folio(struct readahead_control *ractl)
+{
+ struct folio *folio;
+
+ BUG_ON(ractl->_batch_count > ractl->_nr_pages);
+ ractl->_nr_pages -= ractl->_batch_count;
+ ractl->_index += ractl->_batch_count;
+
+ if (!ractl->_nr_pages) {
+ ractl->_batch_count = 0;
+ return NULL;
+ }
+
+ folio = xa_load(&ractl->mapping->i_pages, ractl->_index);
+ VM_BUG_ON_FOLIO(!folio_test_locked(folio), folio);
+ ractl->_batch_count = folio_nr_pages(folio);
+
+ return folio;
+}
+
/**
* readahead_page - Get the next page to read.
- * @rac: The current readahead request.
+ * @ractl: The current readahead request.
*
* Context: The page is locked and has an elevated refcount. The caller
* should decreases the refcount once the page has been submitted for I/O
* and unlock the page once all I/O to that page has completed.
* Return: A pointer to the next page, or %NULL if we are done.
*/
-static inline struct page *readahead_page(struct readahead_control *rac)
+static inline struct page *readahead_page(struct readahead_control *ractl)
{
- struct page *page;
+ struct folio *folio = __readahead_folio(ractl);
- BUG_ON(rac->_batch_count > rac->_nr_pages);
- rac->_nr_pages -= rac->_batch_count;
- rac->_index += rac->_batch_count;
-
- if (!rac->_nr_pages) {
- rac->_batch_count = 0;
- return NULL;
- }
+ return &folio->page;
+}
- page = xa_load(&rac->mapping->i_pages, rac->_index);
- VM_BUG_ON_PAGE(!PageLocked(page), page);
- rac->_batch_count = thp_nr_pages(page);
+/**
+ * readahead_folio - Get the next folio to read.
+ * @ractl: The current readahead request.
+ *
+ * Context: The folio is locked. The caller should unlock the folio once
+ * all I/O to that folio has completed.
+ * Return: A pointer to the next folio, or %NULL if we are done.
+ */
+static inline struct folio *readahead_folio(struct readahead_control *ractl)
+{
+ struct folio *folio = __readahead_folio(ractl);
- return page;
+ folio_put(folio);
+ return folio;
}
static inline unsigned int __readahead_batch(struct readahead_control *rac,
--
2.30.2
The __folio_alloc(), __folio_alloc_node() and folio_alloc() functions
are mostly for type safety, but they also ensure that the page allocator
allocates a compound page and initialises the deferred list if the page
is large enough to have one.
Signed-off-by: Matthew Wilcox (Oracle) <[email protected]>
Reviewed-by: Christoph Hellwig <[email protected]>
---
include/linux/gfp.h | 16 ++++++++++++++++
mm/mempolicy.c | 10 ++++++++++
mm/page_alloc.c | 12 ++++++++++++
3 files changed, 38 insertions(+)
diff --git a/include/linux/gfp.h b/include/linux/gfp.h
index dc5ff40608ce..3745efd21cf6 100644
--- a/include/linux/gfp.h
+++ b/include/linux/gfp.h
@@ -523,6 +523,8 @@ static inline void arch_alloc_page(struct page *page, int order) { }
struct page *__alloc_pages(gfp_t gfp, unsigned int order, int preferred_nid,
nodemask_t *nodemask);
+struct folio *__folio_alloc(gfp_t gfp, unsigned int order, int preferred_nid,
+ nodemask_t *nodemask);
unsigned long __alloc_pages_bulk(gfp_t gfp, int preferred_nid,
nodemask_t *nodemask, int nr_pages,
@@ -564,6 +566,15 @@ __alloc_pages_node(int nid, gfp_t gfp_mask, unsigned int order)
return __alloc_pages(gfp_mask, order, nid, NULL);
}
+static inline
+struct folio *__folio_alloc_node(gfp_t gfp, unsigned int order, int nid)
+{
+ VM_BUG_ON(nid < 0 || nid >= MAX_NUMNODES);
+ VM_WARN_ON((gfp & __GFP_THISNODE) && !node_online(nid));
+
+ return __folio_alloc(gfp, order, nid, NULL);
+}
+
/*
* Allocate pages, preferring the node given as nid. When nid == NUMA_NO_NODE,
* prefer the current CPU's closest node. Otherwise node must be valid and
@@ -580,6 +591,7 @@ static inline struct page *alloc_pages_node(int nid, gfp_t gfp_mask,
#ifdef CONFIG_NUMA
struct page *alloc_pages(gfp_t gfp, unsigned int order);
+struct folio *folio_alloc(gfp_t gfp, unsigned order);
extern struct page *alloc_pages_vma(gfp_t gfp_mask, int order,
struct vm_area_struct *vma, unsigned long addr,
int node, bool hugepage);
@@ -590,6 +602,10 @@ static inline struct page *alloc_pages(gfp_t gfp_mask, unsigned int order)
{
return alloc_pages_node(numa_node_id(), gfp_mask, order);
}
+static inline struct folio *folio_alloc(gfp_t gfp, unsigned int order)
+{
+ return __folio_alloc_node(gfp, order, numa_node_id());
+}
#define alloc_pages_vma(gfp_mask, order, vma, addr, node, false)\
alloc_pages(gfp_mask, order)
#define alloc_hugepage_vma(gfp_mask, vma, addr, order) \
diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index e32360e90274..95d0cf05f7ca 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -2249,6 +2249,16 @@ struct page *alloc_pages(gfp_t gfp, unsigned order)
}
EXPORT_SYMBOL(alloc_pages);
+struct folio *folio_alloc(gfp_t gfp, unsigned order)
+{
+ struct page *page = alloc_pages(gfp | __GFP_COMP, order);
+
+ if (page && order > 1)
+ prep_transhuge_page(page);
+ return (struct folio *)page;
+}
+EXPORT_SYMBOL(folio_alloc);
+
int vma_dup_policy(struct vm_area_struct *src, struct vm_area_struct *dst)
{
struct mempolicy *pol = mpol_dup(vma_policy(src));
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index d72a0d9d4184..d03145671934 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -5399,6 +5399,18 @@ struct page *__alloc_pages(gfp_t gfp, unsigned int order, int preferred_nid,
}
EXPORT_SYMBOL(__alloc_pages);
+struct folio *__folio_alloc(gfp_t gfp, unsigned int order, int preferred_nid,
+ nodemask_t *nodemask)
+{
+ struct page *page = __alloc_pages(gfp | __GFP_COMP, order,
+ preferred_nid, nodemask);
+
+ if (page && order > 1)
+ prep_transhuge_page(page);
+ return (struct folio *)page;
+}
+EXPORT_SYMBOL(__folio_alloc);
+
/*
* Common helper functions. Never use with __GFP_HIGHMEM because the returned
* address cannot represent highmem pages. Use alloc_pages and then kmap if
--
2.30.2
Reimplement __page_cache_alloc as a wrapper around filemap_alloc_folio
to allow filesystems to be converted at our leisure. Increases
kernel text size by 133 bytes, mostly in cachefiles_read_backing_file().
pagecache_get_page() shrinks by 32 bytes, though.
Signed-off-by: Matthew Wilcox (Oracle) <[email protected]>
Reviewed-by: Christoph Hellwig <[email protected]>
---
include/linux/pagemap.h | 11 ++++++++---
mm/filemap.c | 14 +++++++-------
2 files changed, 15 insertions(+), 10 deletions(-)
diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
index bd4daebaf70e..848acb44ac80 100644
--- a/include/linux/pagemap.h
+++ b/include/linux/pagemap.h
@@ -262,14 +262,19 @@ static inline void *detach_page_private(struct page *page)
}
#ifdef CONFIG_NUMA
-extern struct page *__page_cache_alloc(gfp_t gfp);
+struct folio *filemap_alloc_folio(gfp_t gfp, unsigned int order);
#else
-static inline struct page *__page_cache_alloc(gfp_t gfp)
+static inline struct folio *filemap_alloc_folio(gfp_t gfp, unsigned int order)
{
- return alloc_pages(gfp, 0);
+ return folio_alloc(gfp, order);
}
#endif
+static inline struct page *__page_cache_alloc(gfp_t gfp)
+{
+ return &filemap_alloc_folio(gfp, 0)->page;
+}
+
static inline struct page *page_cache_alloc(struct address_space *x)
{
return __page_cache_alloc(mapping_gfp_mask(x));
diff --git a/mm/filemap.c b/mm/filemap.c
index 6bec995e69bd..54989a32d6a8 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -989,24 +989,24 @@ int add_to_page_cache_lru(struct page *page, struct address_space *mapping,
EXPORT_SYMBOL_GPL(add_to_page_cache_lru);
#ifdef CONFIG_NUMA
-struct page *__page_cache_alloc(gfp_t gfp)
+struct folio *filemap_alloc_folio(gfp_t gfp, unsigned int order)
{
int n;
- struct page *page;
+ struct folio *folio;
if (cpuset_do_page_mem_spread()) {
unsigned int cpuset_mems_cookie;
do {
cpuset_mems_cookie = read_mems_allowed_begin();
n = cpuset_mem_spread_node();
- page = __alloc_pages_node(n, gfp, 0);
- } while (!page && read_mems_allowed_retry(cpuset_mems_cookie));
+ folio = __folio_alloc_node(gfp, order, n);
+ } while (!folio && read_mems_allowed_retry(cpuset_mems_cookie));
- return page;
+ return folio;
}
- return alloc_pages(gfp, 0);
+ return folio_alloc(gfp, order);
}
-EXPORT_SYMBOL(__page_cache_alloc);
+EXPORT_SYMBOL(filemap_alloc_folio);
#endif
/*
--
2.30.2
This nets us 178 bytes of savings from removing calls to compound_head.
The three callers all grow a little, but each of them will be converted
to use folios soon, so that's fine.
Signed-off-by: Matthew Wilcox (Oracle) <[email protected]>
Reviewed-by: Christoph Hellwig <[email protected]>
---
include/linux/swap.h | 4 ++--
mm/filemap.c | 2 +-
mm/memory.c | 3 ++-
mm/swap.c | 7 +++----
mm/swap_state.c | 2 +-
mm/workingset.c | 34 +++++++++++++++++-----------------
6 files changed, 26 insertions(+), 26 deletions(-)
diff --git a/include/linux/swap.h b/include/linux/swap.h
index c7a4c0a5863d..5e01675af7ab 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -329,7 +329,7 @@ static inline swp_entry_t folio_swap_entry(struct folio *folio)
/* linux/mm/workingset.c */
void workingset_age_nonresident(struct lruvec *lruvec, unsigned long nr_pages);
void *workingset_eviction(struct page *page, struct mem_cgroup *target_memcg);
-void workingset_refault(struct page *page, void *shadow);
+void workingset_refault(struct folio *folio, void *shadow);
void workingset_activation(struct folio *folio);
/* Only track the nodes of mappings with shadow entries */
@@ -350,7 +350,7 @@ extern unsigned long nr_free_buffer_pages(void);
/* linux/mm/swap.c */
extern void lru_note_cost(struct lruvec *lruvec, bool file,
unsigned int nr_pages);
-extern void lru_note_cost_page(struct page *);
+extern void lru_note_cost_folio(struct folio *);
extern void lru_cache_add(struct page *);
void mark_page_accessed(struct page *);
void folio_mark_accessed(struct folio *);
diff --git a/mm/filemap.c b/mm/filemap.c
index a74c69a938ab..6bec995e69bd 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -981,7 +981,7 @@ int add_to_page_cache_lru(struct page *page, struct address_space *mapping,
*/
WARN_ON_ONCE(PageActive(page));
if (!(gfp_mask & __GFP_WRITE) && shadow)
- workingset_refault(page, shadow);
+ workingset_refault(page_folio(page), shadow);
lru_cache_add(page);
}
return ret;
diff --git a/mm/memory.c b/mm/memory.c
index 614418e26e2c..627e7836ade6 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3538,7 +3538,8 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
shadow = get_shadow_from_swap_cache(entry);
if (shadow)
- workingset_refault(page, shadow);
+ workingset_refault(page_folio(page),
+ shadow);
lru_cache_add(page);
diff --git a/mm/swap.c b/mm/swap.c
index d32007fe23b3..6e80f30d2e5e 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -315,11 +315,10 @@ void lru_note_cost(struct lruvec *lruvec, bool file, unsigned int nr_pages)
} while ((lruvec = parent_lruvec(lruvec)));
}
-void lru_note_cost_page(struct page *page)
+void lru_note_cost_folio(struct folio *folio)
{
- struct folio *folio = page_folio(page);
- lru_note_cost(folio_lruvec(folio),
- page_is_file_lru(page), thp_nr_pages(page));
+ lru_note_cost(folio_lruvec(folio), folio_is_file_lru(folio),
+ folio_nr_pages(folio));
}
static void __folio_activate(struct folio *folio, struct lruvec *lruvec)
diff --git a/mm/swap_state.c b/mm/swap_state.c
index c56aa9ac050d..1a29b4f98208 100644
--- a/mm/swap_state.c
+++ b/mm/swap_state.c
@@ -498,7 +498,7 @@ struct page *__read_swap_cache_async(swp_entry_t entry, gfp_t gfp_mask,
mem_cgroup_swapin_uncharge_swap(entry);
if (shadow)
- workingset_refault(page, shadow);
+ workingset_refault(page_folio(page), shadow);
/* Caller will initiate read into locked page */
lru_cache_add(page);
diff --git a/mm/workingset.c b/mm/workingset.c
index 39bb60d50217..10830211a187 100644
--- a/mm/workingset.c
+++ b/mm/workingset.c
@@ -273,17 +273,17 @@ void *workingset_eviction(struct page *page, struct mem_cgroup *target_memcg)
}
/**
- * workingset_refault - evaluate the refault of a previously evicted page
- * @page: the freshly allocated replacement page
- * @shadow: shadow entry of the evicted page
+ * workingset_refault - evaluate the refault of a previously evicted folio
+ * @page: the freshly allocated replacement folio
+ * @shadow: shadow entry of the evicted folio
*
* Calculates and evaluates the refault distance of the previously
- * evicted page in the context of the node and the memcg whose memory
+ * evicted folio in the context of the node and the memcg whose memory
* pressure caused the eviction.
*/
-void workingset_refault(struct page *page, void *shadow)
+void workingset_refault(struct folio *folio, void *shadow)
{
- bool file = page_is_file_lru(page);
+ bool file = folio_is_file_lru(folio);
struct mem_cgroup *eviction_memcg;
struct lruvec *eviction_lruvec;
unsigned long refault_distance;
@@ -301,10 +301,10 @@ void workingset_refault(struct page *page, void *shadow)
rcu_read_lock();
/*
* Look up the memcg associated with the stored ID. It might
- * have been deleted since the page's eviction.
+ * have been deleted since the folio's eviction.
*
* Note that in rare events the ID could have been recycled
- * for a new cgroup that refaults a shared page. This is
+ * for a new cgroup that refaults a shared folio. This is
* impossible to tell from the available data. However, this
* should be a rare and limited disturbance, and activations
* are always speculative anyway. Ultimately, it's the aging
@@ -340,14 +340,14 @@ void workingset_refault(struct page *page, void *shadow)
refault_distance = (refault - eviction) & EVICTION_MASK;
/*
- * The activation decision for this page is made at the level
+ * The activation decision for this folio is made at the level
* where the eviction occurred, as that is where the LRU order
- * during page reclaim is being determined.
+ * during folio reclaim is being determined.
*
- * However, the cgroup that will own the page is the one that
+ * However, the cgroup that will own the folio is the one that
* is actually experiencing the refault event.
*/
- memcg = page_memcg(page);
+ memcg = folio_memcg(folio);
lruvec = mem_cgroup_lruvec(memcg, pgdat);
inc_lruvec_state(lruvec, WORKINGSET_REFAULT_BASE + file);
@@ -375,15 +375,15 @@ void workingset_refault(struct page *page, void *shadow)
if (refault_distance > workingset_size)
goto out;
- SetPageActive(page);
- workingset_age_nonresident(lruvec, thp_nr_pages(page));
+ folio_set_active(folio);
+ workingset_age_nonresident(lruvec, folio_nr_pages(folio));
inc_lruvec_state(lruvec, WORKINGSET_ACTIVATE_BASE + file);
- /* Page was active prior to eviction */
+ /* Folio was active prior to eviction */
if (workingset) {
- SetPageWorkingset(page);
+ folio_set_workingset(folio);
/* XXX: Move to lru_cache_add() when it supports new vs putback */
- lru_note_cost_page(page);
+ lru_note_cost_folio(folio);
inc_lruvec_state(lruvec, WORKINGSET_RESTORE_BASE + file);
}
out:
--
2.30.2
This is an address_space operation, so its argument must remain as a
struct page, but we can use a folio internally.
Signed-off-by: Matthew Wilcox (Oracle) <[email protected]>
---
fs/iomap/buffered-io.c | 6 +++---
1 file changed, 3 insertions(+), 3 deletions(-)
diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
index 251ec45426aa..abb065b50e38 100644
--- a/fs/iomap/buffered-io.c
+++ b/fs/iomap/buffered-io.c
@@ -458,15 +458,15 @@ iomap_releasepage(struct page *page, gfp_t gfp_mask)
{
struct folio *folio = page_folio(page);
- trace_iomap_releasepage(page->mapping->host, page_offset(page),
- PAGE_SIZE);
+ trace_iomap_releasepage(folio->mapping->host, folio_pos(folio),
+ folio_size(folio));
/*
* mm accommodates an old ext3 case where clean pages might not have had
* the dirty bit cleared. Thus, it can send actual dirty pages to
* ->releasepage() via shrink_active_list(), skip those here.
*/
- if (PageDirty(page) || PageWriteback(page))
+ if (folio_test_dirty(folio) || folio_test_writeback(folio))
return 0;
iomap_page_release(folio);
return 1;
--
2.30.2
All but one caller already has the iomap_page, and we can avoid getting
it again.
Signed-off-by: Matthew Wilcox (Oracle) <[email protected]>
---
fs/iomap/buffered-io.c | 25 +++++++++++++------------
1 file changed, 13 insertions(+), 12 deletions(-)
diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
index 6b41019a51a3..fbe4ebc074ce 100644
--- a/fs/iomap/buffered-io.c
+++ b/fs/iomap/buffered-io.c
@@ -134,11 +134,9 @@ iomap_adjust_read_range(struct inode *inode, struct iomap_page *iop,
*lenp = plen;
}
-static void
-iomap_iop_set_range_uptodate(struct page *page, unsigned off, unsigned len)
+static void iomap_iop_set_range_uptodate(struct page *page,
+ struct iomap_page *iop, unsigned off, unsigned len)
{
- struct folio *folio = page_folio(page);
- struct iomap_page *iop = to_iomap_page(folio);
struct inode *inode = page->mapping->host;
unsigned first = off >> inode->i_blkbits;
unsigned last = (off + len - 1) >> inode->i_blkbits;
@@ -151,14 +149,14 @@ iomap_iop_set_range_uptodate(struct page *page, unsigned off, unsigned len)
spin_unlock_irqrestore(&iop->uptodate_lock, flags);
}
-static void
-iomap_set_range_uptodate(struct page *page, unsigned off, unsigned len)
+static void iomap_set_range_uptodate(struct page *page,
+ struct iomap_page *iop, unsigned off, unsigned len)
{
if (PageError(page))
return;
- if (page_has_private(page))
- iomap_iop_set_range_uptodate(page, off, len);
+ if (iop)
+ iomap_iop_set_range_uptodate(page, iop, off, len);
else
SetPageUptodate(page);
}
@@ -174,7 +172,8 @@ iomap_read_page_end_io(struct bio_vec *bvec, int error)
ClearPageUptodate(page);
SetPageError(page);
} else {
- iomap_set_range_uptodate(page, bvec->bv_offset, bvec->bv_len);
+ iomap_set_range_uptodate(page, iop, bvec->bv_offset,
+ bvec->bv_len);
}
if (!iop || atomic_sub_and_test(bvec->bv_len, &iop->read_bytes_pending))
@@ -254,7 +253,7 @@ iomap_readpage_actor(struct inode *inode, loff_t pos, loff_t length, void *data,
if (iomap_block_needs_zeroing(inode, iomap, pos)) {
zero_user(page, poff, plen);
- iomap_set_range_uptodate(page, poff, plen);
+ iomap_set_range_uptodate(page, iop, poff, plen);
goto done;
}
@@ -583,7 +582,7 @@ __iomap_write_begin(struct inode *inode, loff_t pos, unsigned len, int flags,
if (status)
return status;
}
- iomap_set_range_uptodate(page, poff, plen);
+ iomap_set_range_uptodate(page, iop, poff, plen);
} while ((block_start += plen) < block_end);
return 0;
@@ -645,6 +644,8 @@ iomap_write_begin(struct inode *inode, loff_t pos, unsigned len, unsigned flags,
static size_t __iomap_write_end(struct inode *inode, loff_t pos, size_t len,
size_t copied, struct page *page)
{
+ struct folio *folio = page_folio(page);
+ struct iomap_page *iop = to_iomap_page(folio);
flush_dcache_page(page);
/*
@@ -660,7 +661,7 @@ static size_t __iomap_write_end(struct inode *inode, loff_t pos, size_t len,
*/
if (unlikely(copied < len && !PageUptodate(page)))
return 0;
- iomap_set_range_uptodate(page, offset_in_page(pos), len);
+ iomap_set_range_uptodate(page, iop, offset_in_page(pos), len);
__set_page_dirty_nobuffers(page);
return copied;
}
--
2.30.2
Use bio_for_each_folio() to iterate over each folio in the bio
instead of iterating over each page.
Signed-off-by: Matthew Wilcox (Oracle) <[email protected]>
---
fs/iomap/buffered-io.c | 46 +++++++++++++++++-------------------------
1 file changed, 18 insertions(+), 28 deletions(-)
diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
index 707a96e36651..4732298f74e1 100644
--- a/fs/iomap/buffered-io.c
+++ b/fs/iomap/buffered-io.c
@@ -161,36 +161,29 @@ static void iomap_set_range_uptodate(struct folio *folio,
folio_mark_uptodate(folio);
}
-static void
-iomap_read_page_end_io(struct bio_vec *bvec, int error)
+static void iomap_finish_folio_read(struct folio *folio, size_t offset,
+ size_t len, int error)
{
- struct page *page = bvec->bv_page;
- struct folio *folio = page_folio(page);
struct iomap_page *iop = to_iomap_page(folio);
if (unlikely(error)) {
folio_clear_uptodate(folio);
folio_set_error(folio);
} else {
- size_t off = (page - &folio->page) * PAGE_SIZE +
- bvec->bv_offset;
-
- iomap_set_range_uptodate(folio, iop, off, bvec->bv_len);
+ iomap_set_range_uptodate(folio, iop, offset, len);
}
- if (!iop || atomic_sub_and_test(bvec->bv_len, &iop->read_bytes_pending))
+ if (!iop || atomic_sub_and_test(len, &iop->read_bytes_pending))
folio_unlock(folio);
}
-static void
-iomap_read_end_io(struct bio *bio)
+static void iomap_read_end_io(struct bio *bio)
{
int error = blk_status_to_errno(bio->bi_status);
- struct bio_vec *bvec;
- struct bvec_iter_all iter_all;
+ struct folio_iter fi;
- bio_for_each_segment_all(bvec, bio, iter_all)
- iomap_read_page_end_io(bvec, error);
+ bio_for_each_folio_all(fi, bio)
+ iomap_finish_folio_read(fi.folio, fi.offset, fi.length, error);
bio_put(bio);
}
@@ -1014,23 +1007,21 @@ vm_fault_t iomap_page_mkwrite(struct vm_fault *vmf, const struct iomap_ops *ops)
}
EXPORT_SYMBOL_GPL(iomap_page_mkwrite);
-static void
-iomap_finish_page_writeback(struct inode *inode, struct page *page,
- int error, unsigned int len)
+static void iomap_finish_folio_write(struct inode *inode, struct folio *folio,
+ size_t len, int error)
{
- struct folio *folio = page_folio(page);
struct iomap_page *iop = to_iomap_page(folio);
if (error) {
- SetPageError(page);
+ folio_set_error(folio);
mapping_set_error(inode->i_mapping, -EIO);
}
- WARN_ON_ONCE(i_blocks_per_page(inode, page) > 1 && !iop);
+ WARN_ON_ONCE(i_blocks_per_folio(inode, folio) > 1 && !iop);
WARN_ON_ONCE(iop && atomic_read(&iop->write_bytes_pending) <= 0);
if (!iop || atomic_sub_and_test(len, &iop->write_bytes_pending))
- end_page_writeback(page);
+ folio_end_writeback(folio);
}
/*
@@ -1049,8 +1040,7 @@ iomap_finish_ioend(struct iomap_ioend *ioend, int error)
bool quiet = bio_flagged(bio, BIO_QUIET);
for (bio = &ioend->io_inline_bio; bio; bio = next) {
- struct bio_vec *bv;
- struct bvec_iter_all iter_all;
+ struct folio_iter fi;
/*
* For the last bio, bi_private points to the ioend, so we
@@ -1061,10 +1051,10 @@ iomap_finish_ioend(struct iomap_ioend *ioend, int error)
else
next = bio->bi_private;
- /* walk each page on bio, ending page IO on them */
- bio_for_each_segment_all(bv, bio, iter_all)
- iomap_finish_page_writeback(inode, bv->bv_page, error,
- bv->bv_len);
+ /* walk all folios in bio, ending page IO on them */
+ bio_for_each_folio_all(fi, bio)
+ iomap_finish_folio_write(inode, fi.folio, fi.length,
+ error);
bio_put(bio);
}
/* The ioend has been freed by bio_put() */
--
2.30.2
Inline data only occupies a single page, but using a folio means that
we don't need to call compound_head() in PageUptodate().
Signed-off-by: Matthew Wilcox (Oracle) <[email protected]>
---
fs/iomap/buffered-io.c | 12 ++++++------
1 file changed, 6 insertions(+), 6 deletions(-)
diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
index c616ef1feb21..ac33f19325ab 100644
--- a/fs/iomap/buffered-io.c
+++ b/fs/iomap/buffered-io.c
@@ -662,18 +662,18 @@ static size_t __iomap_write_end(struct inode *inode, loff_t pos, size_t len,
return copied;
}
-static size_t iomap_write_end_inline(struct inode *inode, struct page *page,
+static size_t iomap_write_end_inline(struct inode *inode, struct folio *folio,
struct iomap *iomap, loff_t pos, size_t copied)
{
void *addr;
- WARN_ON_ONCE(!PageUptodate(page));
+ WARN_ON_ONCE(!folio_test_uptodate(folio));
BUG_ON(pos + copied > PAGE_SIZE - offset_in_page(iomap->inline_data));
- flush_dcache_page(page);
- addr = kmap_atomic(page);
+ flush_dcache_folio(folio);
+ addr = kmap_local_folio(folio, 0);
memcpy(iomap->inline_data + pos, addr + pos, copied);
- kunmap_atomic(addr);
+ kunmap_local(addr);
mark_inode_dirty(inode);
return copied;
@@ -690,7 +690,7 @@ static size_t iomap_write_end(struct inode *inode, loff_t pos, size_t len,
size_t ret;
if (srcmap->type == IOMAP_INLINE) {
- ret = iomap_write_end_inline(inode, page, iomap, pos, copied);
+ ret = iomap_write_end_inline(inode, folio, iomap, pos, copied);
} else if (srcmap->flags & IOMAP_F_BUFFER_HEAD) {
ret = block_write_end(NULL, inode->i_mapping, pos, len, copied,
page, NULL);
--
2.30.2
Use folios throughout filemap_unaccount_folio(), except for the bug
handling path which would need to use total_mapcount(), which is currently
only defined for builds with THP enabled.
Signed-off-by: Matthew Wilcox (Oracle) <[email protected]>
---
include/linux/pagemap.h | 5 ---
mm/filemap.c | 68 ++++++++++++++++++++---------------------
2 files changed, 34 insertions(+), 39 deletions(-)
diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
index 83c1a798265f..f6a2a2589009 100644
--- a/include/linux/pagemap.h
+++ b/include/linux/pagemap.h
@@ -786,11 +786,6 @@ static inline void __set_page_dirty(struct page *page,
}
void folio_account_cleaned(struct folio *folio, struct address_space *mapping,
struct bdi_writeback *wb);
-static inline void account_page_cleaned(struct page *page,
- struct address_space *mapping, struct bdi_writeback *wb)
-{
- return folio_account_cleaned(page_folio(page), mapping, wb);
-}
void __folio_cancel_dirty(struct folio *folio);
static inline void folio_cancel_dirty(struct folio *folio)
{
diff --git a/mm/filemap.c b/mm/filemap.c
index c96febad32fc..6e8b195edf19 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -144,8 +144,8 @@ static void page_cache_delete(struct address_space *mapping,
mapping->nrpages -= nr;
}
-static void unaccount_page_cache_page(struct address_space *mapping,
- struct page *page)
+static void filemap_unaccount_folio(struct address_space *mapping,
+ struct folio *folio)
{
int nr;
@@ -154,64 +154,64 @@ static void unaccount_page_cache_page(struct address_space *mapping,
* invalidate any existing cleancache entries. We can't leave
* stale data around in the cleancache once our page is gone
*/
- if (PageUptodate(page) && PageMappedToDisk(page))
- cleancache_put_page(page);
+ if (folio_test_uptodate(folio) && folio_test_mappedtodisk(folio))
+ cleancache_put_page(&folio->page);
else
- cleancache_invalidate_page(mapping, page);
+ cleancache_invalidate_page(mapping, &folio->page);
- VM_BUG_ON_PAGE(PageTail(page), page);
- VM_BUG_ON_PAGE(page_mapped(page), page);
- if (!IS_ENABLED(CONFIG_DEBUG_VM) && unlikely(page_mapped(page))) {
+ VM_BUG_ON_FOLIO(folio_mapped(folio), folio);
+ if (!IS_ENABLED(CONFIG_DEBUG_VM) && unlikely(folio_mapped(folio))) {
int mapcount;
pr_alert("BUG: Bad page cache in process %s pfn:%05lx\n",
- current->comm, page_to_pfn(page));
- dump_page(page, "still mapped when deleted");
+ current->comm, folio_pfn(folio));
+ dump_page(&folio->page, "still mapped when deleted");
dump_stack();
add_taint(TAINT_BAD_PAGE, LOCKDEP_NOW_UNRELIABLE);
- mapcount = page_mapcount(page);
+ mapcount = page_mapcount(&folio->page);
if (mapping_exiting(mapping) &&
- page_count(page) >= mapcount + 2) {
+ folio_ref_count(folio) >= mapcount + 2) {
/*
* All vmas have already been torn down, so it's
- * a good bet that actually the page is unmapped,
+ * a good bet that actually the folio is unmapped,
* and we'd prefer not to leak it: if we're wrong,
* some other bad page check should catch it later.
*/
- page_mapcount_reset(page);
- page_ref_sub(page, mapcount);
+ page_mapcount_reset(&folio->page);
+ folio_ref_sub(folio, mapcount);
}
}
- /* hugetlb pages do not participate in page cache accounting. */
- if (PageHuge(page))
+ /* hugetlb folios do not participate in page cache accounting. */
+ if (folio_test_hugetlb(folio))
return;
- nr = thp_nr_pages(page);
+ nr = folio_nr_pages(folio);
- __mod_lruvec_page_state(page, NR_FILE_PAGES, -nr);
- if (PageSwapBacked(page)) {
- __mod_lruvec_page_state(page, NR_SHMEM, -nr);
- if (PageTransHuge(page))
- __mod_lruvec_page_state(page, NR_SHMEM_THPS, -nr);
- } else if (PageTransHuge(page)) {
- __mod_lruvec_page_state(page, NR_FILE_THPS, -nr);
+ __lruvec_stat_mod_folio(folio, NR_FILE_PAGES, -nr);
+ if (folio_test_swapbacked(folio)) {
+ __lruvec_stat_mod_folio(folio, NR_SHMEM, -nr);
+ if (folio_multi(folio))
+ __lruvec_stat_mod_folio(folio, NR_SHMEM_THPS, -nr);
+ } else if (folio_multi(folio)) {
+ __lruvec_stat_mod_folio(folio, NR_FILE_THPS, -nr);
filemap_nr_thps_dec(mapping);
}
/*
- * At this point page must be either written or cleaned by
- * truncate. Dirty page here signals a bug and loss of
+ * At this point folio must be either written or cleaned by
+ * truncate. Dirty folio here signals a bug and loss of
* unwritten data.
*
- * This fixes dirty accounting after removing the page entirely
- * but leaves PageDirty set: it has no effect for truncated
- * page and anyway will be cleared before returning page into
+ * This fixes dirty accounting after removing the folio entirely
+ * but leaves the dirty flag set: it has no effect for truncated
+ * folio and anyway will be cleared before returning folio to
* buddy allocator.
*/
- if (WARN_ON_ONCE(PageDirty(page)))
- account_page_cleaned(page, mapping, inode_to_wb(mapping->host));
+ if (WARN_ON_ONCE(folio_test_dirty(folio)))
+ folio_account_cleaned(folio, mapping,
+ inode_to_wb(mapping->host));
}
/*
@@ -226,7 +226,7 @@ void __delete_from_page_cache(struct page *page, void *shadow)
trace_mm_filemap_delete_from_page_cache(page);
- unaccount_page_cache_page(mapping, page);
+ filemap_unaccount_folio(mapping, folio);
page_cache_delete(mapping, folio, shadow);
}
@@ -344,7 +344,7 @@ void delete_from_page_cache_batch(struct address_space *mapping,
for (i = 0; i < pagevec_count(pvec); i++) {
trace_mm_filemap_delete_from_page_cache(pvec->pages[i]);
- unaccount_page_cache_page(mapping, pvec->pages[i]);
+ filemap_unaccount_folio(mapping, page_folio(pvec->pages[i]));
}
page_cache_delete_batch(mapping, pvec);
xa_unlock_irqrestore(&mapping->i_pages, flags);
--
2.30.2
The page cache only stores folios, never tail pages. Saves 29 bytes
due to removing calls to compound_head().
Signed-off-by: Matthew Wilcox (Oracle) <[email protected]>
---
mm/filemap.c | 23 +++++++++++------------
1 file changed, 11 insertions(+), 12 deletions(-)
diff --git a/mm/filemap.c b/mm/filemap.c
index c4190c0a6d86..04501bf50448 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -2272,32 +2272,31 @@ static void filemap_get_read_batch(struct address_space *mapping,
pgoff_t index, pgoff_t max, struct pagevec *pvec)
{
XA_STATE(xas, &mapping->i_pages, index);
- struct page *head;
+ struct folio *folio;
rcu_read_lock();
- for (head = xas_load(&xas); head; head = xas_next(&xas)) {
- if (xas_retry(&xas, head))
+ for (folio = xas_load(&xas); folio; folio = xas_next(&xas)) {
+ if (xas_retry(&xas, folio))
continue;
- if (xas.xa_index > max || xa_is_value(head))
+ if (xas.xa_index > max || xa_is_value(folio))
break;
- if (!page_cache_get_speculative(head))
+ if (!folio_try_get_rcu(folio))
goto retry;
- /* Has the page moved or been split? */
- if (unlikely(head != xas_reload(&xas)))
+ if (unlikely(folio != xas_reload(&xas)))
goto put_page;
- if (!pagevec_add(pvec, head))
+ if (!pagevec_add(pvec, &folio->page))
break;
- if (!PageUptodate(head))
+ if (!folio_test_uptodate(folio))
break;
- if (PageReadahead(head))
+ if (folio_test_readahead(folio))
break;
- xas.xa_index = head->index + thp_nr_pages(head) - 1;
+ xas.xa_index = folio->index + folio_nr_pages(folio) - 1;
xas.xa_offset = (xas.xa_index >> xas.xa_shift) & XA_CHUNK_MASK;
continue;
put_page:
- put_page(head);
+ folio_put(folio);
retry:
xas_reset(&xas);
}
--
2.30.2
Convert callers to cope. Saves 580 bytes of kernel text; all five
callers are reduced in size.
Signed-off-by: Matthew Wilcox (Oracle) <[email protected]>
---
mm/filemap.c | 129 +++++++++++++++++++++++++--------------------------
1 file changed, 64 insertions(+), 65 deletions(-)
diff --git a/mm/filemap.c b/mm/filemap.c
index 4a81eaff363e..c4190c0a6d86 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -1907,37 +1907,36 @@ struct folio *__filemap_get_folio(struct address_space *mapping, pgoff_t index,
}
EXPORT_SYMBOL(__filemap_get_folio);
-static inline struct page *find_get_entry(struct xa_state *xas, pgoff_t max,
+static inline struct folio *find_get_entry(struct xa_state *xas, pgoff_t max,
xa_mark_t mark)
{
- struct page *page;
+ struct folio *folio;
retry:
if (mark == XA_PRESENT)
- page = xas_find(xas, max);
+ folio = xas_find(xas, max);
else
- page = xas_find_marked(xas, max, mark);
+ folio = xas_find_marked(xas, max, mark);
- if (xas_retry(xas, page))
+ if (xas_retry(xas, folio))
goto retry;
/*
* A shadow entry of a recently evicted page, a swap
* entry from shmem/tmpfs or a DAX entry. Return it
* without attempting to raise page count.
*/
- if (!page || xa_is_value(page))
- return page;
+ if (!folio || xa_is_value(folio))
+ return folio;
- if (!page_cache_get_speculative(page))
+ if (!folio_try_get_rcu(folio))
goto reset;
- /* Has the page moved or been split? */
- if (unlikely(page != xas_reload(xas))) {
- put_page(page);
+ if (unlikely(folio != xas_reload(xas))) {
+ folio_put(folio);
goto reset;
}
- return page;
+ return folio;
reset:
xas_reset(xas);
goto retry;
@@ -1978,7 +1977,7 @@ unsigned find_get_entries(struct address_space *mapping, pgoff_t start,
unsigned nr_entries = PAGEVEC_SIZE;
rcu_read_lock();
- while ((page = find_get_entry(&xas, end, XA_PRESENT))) {
+ while ((page = &find_get_entry(&xas, end, XA_PRESENT)->page)) {
/*
* Terminate early on finding a THP, to allow the caller to
* handle it all at once; but continue if this is hugetlbfs.
@@ -2025,38 +2024,38 @@ unsigned find_lock_entries(struct address_space *mapping, pgoff_t start,
pgoff_t end, struct pagevec *pvec, pgoff_t *indices)
{
XA_STATE(xas, &mapping->i_pages, start);
- struct page *page;
+ struct folio *folio;
rcu_read_lock();
- while ((page = find_get_entry(&xas, end, XA_PRESENT))) {
- if (!xa_is_value(page)) {
- if (page->index < start)
+ while ((folio = find_get_entry(&xas, end, XA_PRESENT))) {
+ if (!xa_is_value(folio)) {
+ if (folio->index < start)
goto put;
- VM_BUG_ON_PAGE(page->index != xas.xa_index, page);
- if (page->index + thp_nr_pages(page) - 1 > end)
+ VM_BUG_ON_FOLIO(folio->index != xas.xa_index,
+ folio);
+ if (folio->index + folio_nr_pages(folio) - 1 > end)
goto put;
- if (!trylock_page(page))
+ if (!folio_trylock(folio))
goto put;
- if (page->mapping != mapping || PageWriteback(page))
+ if (folio->mapping != mapping ||
+ folio_test_writeback(folio))
goto unlock;
- VM_BUG_ON_PAGE(!thp_contains(page, xas.xa_index),
- page);
+ VM_BUG_ON_FOLIO(!folio_contains(folio, xas.xa_index),
+ folio);
}
indices[pvec->nr] = xas.xa_index;
- if (!pagevec_add(pvec, page))
+ if (!pagevec_add(pvec, &folio->page))
break;
goto next;
unlock:
- unlock_page(page);
+ folio_unlock(folio);
put:
- put_page(page);
+ folio_put(folio);
next:
- if (!xa_is_value(page) && PageTransHuge(page)) {
- unsigned int nr_pages = thp_nr_pages(page);
-
- /* Final THP may cross MAX_LFS_FILESIZE on 32-bit */
- xas_set(&xas, page->index + nr_pages);
- if (xas.xa_index < nr_pages)
+ if (!xa_is_value(folio) && folio_multi(folio)) {
+ xas_set(&xas, folio->index + folio_nr_pages(folio));
+ /* Did we wrap on 32-bit? */
+ if (!xas.xa_index)
break;
}
}
@@ -2091,19 +2090,19 @@ unsigned find_get_pages_range(struct address_space *mapping, pgoff_t *start,
struct page **pages)
{
XA_STATE(xas, &mapping->i_pages, *start);
- struct page *page;
+ struct folio *folio;
unsigned ret = 0;
if (unlikely(!nr_pages))
return 0;
rcu_read_lock();
- while ((page = find_get_entry(&xas, end, XA_PRESENT))) {
+ while ((folio = find_get_entry(&xas, end, XA_PRESENT))) {
/* Skip over shadow, swap and DAX entries */
- if (xa_is_value(page))
+ if (xa_is_value(folio))
continue;
- pages[ret] = find_subpage(page, xas.xa_index);
+ pages[ret] = folio_file_page(folio, xas.xa_index);
if (++ret == nr_pages) {
*start = xas.xa_index + 1;
goto out;
@@ -2200,25 +2199,25 @@ unsigned find_get_pages_range_tag(struct address_space *mapping, pgoff_t *index,
struct page **pages)
{
XA_STATE(xas, &mapping->i_pages, *index);
- struct page *page;
+ struct folio *folio;
unsigned ret = 0;
if (unlikely(!nr_pages))
return 0;
rcu_read_lock();
- while ((page = find_get_entry(&xas, end, tag))) {
+ while ((folio = find_get_entry(&xas, end, tag))) {
/*
* Shadow entries should never be tagged, but this iteration
* is lockless so there is a window for page reclaim to evict
* a page we saw tagged. Skip over it.
*/
- if (xa_is_value(page))
+ if (xa_is_value(folio))
continue;
- pages[ret] = page;
+ pages[ret] = &folio->page;
if (++ret == nr_pages) {
- *index = page->index + thp_nr_pages(page);
+ *index = folio->index + folio_nr_pages(folio);
goto out;
}
}
@@ -2697,44 +2696,44 @@ generic_file_read_iter(struct kiocb *iocb, struct iov_iter *iter)
}
EXPORT_SYMBOL(generic_file_read_iter);
-static inline loff_t page_seek_hole_data(struct xa_state *xas,
- struct address_space *mapping, struct page *page,
+static inline loff_t folio_seek_hole_data(struct xa_state *xas,
+ struct address_space *mapping, struct folio *folio,
loff_t start, loff_t end, bool seek_data)
{
const struct address_space_operations *ops = mapping->a_ops;
size_t offset, bsz = i_blocksize(mapping->host);
- if (xa_is_value(page) || PageUptodate(page))
+ if (xa_is_value(folio) || folio_test_uptodate(folio))
return seek_data ? start : end;
if (!ops->is_partially_uptodate)
return seek_data ? end : start;
xas_pause(xas);
rcu_read_unlock();
- lock_page(page);
- if (unlikely(page->mapping != mapping))
+ folio_lock(folio);
+ if (unlikely(folio->mapping != mapping))
goto unlock;
- offset = offset_in_thp(page, start) & ~(bsz - 1);
+ offset = offset_in_folio(folio, start) & ~(bsz - 1);
do {
- if (ops->is_partially_uptodate(page, offset, bsz) == seek_data)
+ if (ops->is_partially_uptodate(&folio->page, offset, bsz) ==
+ seek_data)
break;
start = (start + bsz) & ~(bsz - 1);
offset += bsz;
- } while (offset < thp_size(page));
+ } while (offset < folio_size(folio));
unlock:
- unlock_page(page);
+ folio_unlock(folio);
rcu_read_lock();
return start;
}
-static inline
-unsigned int seek_page_size(struct xa_state *xas, struct page *page)
+static inline size_t seek_folio_size(struct xa_state *xas, struct folio *folio)
{
- if (xa_is_value(page))
+ if (xa_is_value(folio))
return PAGE_SIZE << xa_get_order(xas->xa, xas->xa_index);
- return thp_size(page);
+ return folio_size(folio);
}
/**
@@ -2761,15 +2760,15 @@ loff_t mapping_seek_hole_data(struct address_space *mapping, loff_t start,
XA_STATE(xas, &mapping->i_pages, start >> PAGE_SHIFT);
pgoff_t max = (end - 1) >> PAGE_SHIFT;
bool seek_data = (whence == SEEK_DATA);
- struct page *page;
+ struct folio *folio;
if (end <= start)
return -ENXIO;
rcu_read_lock();
- while ((page = find_get_entry(&xas, max, XA_PRESENT))) {
- loff_t pos = (u64)xas.xa_index << PAGE_SHIFT;
- unsigned int seek_size;
+ while ((folio = find_get_entry(&xas, max, XA_PRESENT))) {
+ loff_t pos = xas.xa_index * PAGE_SIZE;
+ size_t seek_size;
if (start < pos) {
if (!seek_data)
@@ -2777,9 +2776,9 @@ loff_t mapping_seek_hole_data(struct address_space *mapping, loff_t start,
start = pos;
}
- seek_size = seek_page_size(&xas, page);
- pos = round_up(pos + 1, seek_size);
- start = page_seek_hole_data(&xas, mapping, page, start, pos,
+ seek_size = seek_folio_size(&xas, folio);
+ pos = round_up((u64)pos + 1, seek_size);
+ start = folio_seek_hole_data(&xas, mapping, folio, start, pos,
seek_data);
if (start < pos)
goto unlock;
@@ -2787,15 +2786,15 @@ loff_t mapping_seek_hole_data(struct address_space *mapping, loff_t start,
break;
if (seek_size > PAGE_SIZE)
xas_set(&xas, pos >> PAGE_SHIFT);
- if (!xa_is_value(page))
- put_page(page);
+ if (!xa_is_value(folio))
+ folio_put(folio);
}
if (seek_data)
start = -ENXIO;
unlock:
rcu_read_unlock();
- if (page && !xa_is_value(page))
- put_page(page);
+ if (folio && !xa_is_value(folio))
+ folio_put(folio);
if (start > end)
return end;
return start;
--
2.30.2
The only caller was already passing a head page, so this simply avoids
a call to compound_head().
Signed-off-by: Matthew Wilcox (Oracle) <[email protected]>
---
mm/filemap.c | 16 ++++++++--------
1 file changed, 8 insertions(+), 8 deletions(-)
diff --git a/mm/filemap.c b/mm/filemap.c
index 7eda9afb0600..078c318e2f16 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -2328,29 +2328,29 @@ static int filemap_read_folio(struct file *file, struct address_space *mapping,
}
static bool filemap_range_uptodate(struct address_space *mapping,
- loff_t pos, struct iov_iter *iter, struct page *page)
+ loff_t pos, struct iov_iter *iter, struct folio *folio)
{
int count;
- if (PageUptodate(page))
+ if (folio_test_uptodate(folio))
return true;
/* pipes can't handle partially uptodate pages */
if (iov_iter_is_pipe(iter))
return false;
if (!mapping->a_ops->is_partially_uptodate)
return false;
- if (mapping->host->i_blkbits >= (PAGE_SHIFT + thp_order(page)))
+ if (mapping->host->i_blkbits >= (folio_shift(folio)))
return false;
count = iter->count;
- if (page_offset(page) > pos) {
- count -= page_offset(page) - pos;
+ if (folio_pos(folio) > pos) {
+ count -= folio_pos(folio) - pos;
pos = 0;
} else {
- pos -= page_offset(page);
+ pos -= folio_pos(folio);
}
- return mapping->a_ops->is_partially_uptodate(page, pos, count);
+ return mapping->a_ops->is_partially_uptodate(&folio->page, pos, count);
}
static int filemap_update_page(struct kiocb *iocb,
@@ -2376,7 +2376,7 @@ static int filemap_update_page(struct kiocb *iocb,
goto truncated;
error = 0;
- if (filemap_range_uptodate(mapping, iocb->ki_pos, iter, &folio->page))
+ if (filemap_range_uptodate(mapping, iocb->ki_pos, iter, folio))
goto unlock;
error = -EAGAIN;
--
2.30.2
Instead of converting back-and-forth between the actual page and
the head page, just convert once at the end of the function where we
set the vmf->page. Saves 241 bytes of text, or 15% of the size of
filemap_fault().
Signed-off-by: Matthew Wilcox (Oracle) <[email protected]>
---
mm/filemap.c | 78 +++++++++++++++++++++++++---------------------------
1 file changed, 38 insertions(+), 40 deletions(-)
diff --git a/mm/filemap.c b/mm/filemap.c
index 078c318e2f16..b0fe3234a20b 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -2801,21 +2801,20 @@ loff_t mapping_seek_hole_data(struct address_space *mapping, loff_t start,
#ifdef CONFIG_MMU
#define MMAP_LOTSAMISS (100)
/*
- * lock_page_maybe_drop_mmap - lock the page, possibly dropping the mmap_lock
+ * lock_folio_maybe_drop_mmap - lock the page, possibly dropping the mmap_lock
* @vmf - the vm_fault for this fault.
- * @page - the page to lock.
+ * @folio - the folio to lock.
* @fpin - the pointer to the file we may pin (or is already pinned).
*
- * This works similar to lock_page_or_retry in that it can drop the mmap_lock.
- * It differs in that it actually returns the page locked if it returns 1 and 0
- * if it couldn't lock the page. If we did have to drop the mmap_lock then fpin
- * will point to the pinned file and needs to be fput()'ed at a later point.
+ * This works similar to lock_folio_or_retry in that it can drop the
+ * mmap_lock. It differs in that it actually returns the folio locked
+ * if it returns 1 and 0 if it couldn't lock the folio. If we did have
+ * to drop the mmap_lock then fpin will point to the pinned file and
+ * needs to be fput()'ed at a later point.
*/
-static int lock_page_maybe_drop_mmap(struct vm_fault *vmf, struct page *page,
+static int lock_folio_maybe_drop_mmap(struct vm_fault *vmf, struct folio *folio,
struct file **fpin)
{
- struct folio *folio = page_folio(page);
-
if (folio_trylock(folio))
return 1;
@@ -2904,7 +2903,7 @@ static struct file *do_sync_mmap_readahead(struct vm_fault *vmf)
* was pinned if we have to drop the mmap_lock in order to do IO.
*/
static struct file *do_async_mmap_readahead(struct vm_fault *vmf,
- struct page *page)
+ struct folio *folio)
{
struct file *file = vmf->vma->vm_file;
struct file_ra_state *ra = &file->f_ra;
@@ -2919,10 +2918,10 @@ static struct file *do_async_mmap_readahead(struct vm_fault *vmf,
mmap_miss = READ_ONCE(ra->mmap_miss);
if (mmap_miss)
WRITE_ONCE(ra->mmap_miss, --mmap_miss);
- if (PageReadahead(page)) {
+ if (folio_test_readahead(folio)) {
fpin = maybe_unlock_mmap_for_io(vmf, fpin);
page_cache_async_readahead(mapping, ra, file,
- page, offset, ra->ra_pages);
+ &folio->page, offset, ra->ra_pages);
}
return fpin;
}
@@ -2941,7 +2940,7 @@ static struct file *do_async_mmap_readahead(struct vm_fault *vmf,
* vma->vm_mm->mmap_lock must be held on entry.
*
* If our return value has VM_FAULT_RETRY set, it's because the mmap_lock
- * may be dropped before doing I/O or by lock_page_maybe_drop_mmap().
+ * may be dropped before doing I/O or by lock_folio_maybe_drop_mmap().
*
* If our return value does not have VM_FAULT_RETRY set, the mmap_lock
* has not been released.
@@ -2957,58 +2956,57 @@ vm_fault_t filemap_fault(struct vm_fault *vmf)
struct file *fpin = NULL;
struct address_space *mapping = file->f_mapping;
struct inode *inode = mapping->host;
- pgoff_t offset = vmf->pgoff;
- pgoff_t max_off;
- struct page *page;
+ pgoff_t max_idx, index = vmf->pgoff;
+ struct folio *folio;
vm_fault_t ret = 0;
- max_off = DIV_ROUND_UP(i_size_read(inode), PAGE_SIZE);
- if (unlikely(offset >= max_off))
+ max_idx = DIV_ROUND_UP(i_size_read(inode), PAGE_SIZE);
+ if (unlikely(index >= max_idx))
return VM_FAULT_SIGBUS;
/*
* Do we have something in the page cache already?
*/
- page = find_get_page(mapping, offset);
- if (likely(page) && !(vmf->flags & FAULT_FLAG_TRIED)) {
+ folio = filemap_get_folio(mapping, index);
+ if (likely(folio) && !(vmf->flags & FAULT_FLAG_TRIED)) {
/*
* We found the page, so try async readahead before
* waiting for the lock.
*/
- fpin = do_async_mmap_readahead(vmf, page);
- } else if (!page) {
+ fpin = do_async_mmap_readahead(vmf, folio);
+ } else if (!folio) {
/* No page in the page cache at all */
count_vm_event(PGMAJFAULT);
count_memcg_event_mm(vmf->vma->vm_mm, PGMAJFAULT);
ret = VM_FAULT_MAJOR;
fpin = do_sync_mmap_readahead(vmf);
retry_find:
- page = pagecache_get_page(mapping, offset,
+ folio = __filemap_get_folio(mapping, index,
FGP_CREAT|FGP_FOR_MMAP,
vmf->gfp_mask);
- if (!page) {
+ if (!folio) {
if (fpin)
goto out_retry;
return VM_FAULT_OOM;
}
}
- if (!lock_page_maybe_drop_mmap(vmf, page, &fpin))
+ if (!lock_folio_maybe_drop_mmap(vmf, folio, &fpin))
goto out_retry;
/* Did it get truncated? */
- if (unlikely(compound_head(page)->mapping != mapping)) {
- unlock_page(page);
- put_page(page);
+ if (unlikely(folio->mapping != mapping)) {
+ folio_unlock(folio);
+ folio_put(folio);
goto retry_find;
}
- VM_BUG_ON_PAGE(page_to_pgoff(page) != offset, page);
+ VM_BUG_ON_FOLIO(!folio_contains(folio, index), folio);
/*
* We have a locked page in the page cache, now we need to check
* that it's up-to-date. If not, it is going to be due to an error.
*/
- if (unlikely(!PageUptodate(page)))
+ if (unlikely(!folio_test_uptodate(folio)))
goto page_not_uptodate;
/*
@@ -3017,7 +3015,7 @@ vm_fault_t filemap_fault(struct vm_fault *vmf)
* redo the fault.
*/
if (fpin) {
- unlock_page(page);
+ folio_unlock(folio);
goto out_retry;
}
@@ -3025,14 +3023,14 @@ vm_fault_t filemap_fault(struct vm_fault *vmf)
* Found the page and have a reference on it.
* We must recheck i_size under page lock.
*/
- max_off = DIV_ROUND_UP(i_size_read(inode), PAGE_SIZE);
- if (unlikely(offset >= max_off)) {
- unlock_page(page);
- put_page(page);
+ max_idx = DIV_ROUND_UP(i_size_read(inode), PAGE_SIZE);
+ if (unlikely(index >= max_idx)) {
+ folio_unlock(folio);
+ folio_put(folio);
return VM_FAULT_SIGBUS;
}
- vmf->page = page;
+ vmf->page = folio_file_page(folio, index);
return ret | VM_FAULT_LOCKED;
page_not_uptodate:
@@ -3043,10 +3041,10 @@ vm_fault_t filemap_fault(struct vm_fault *vmf)
* and we need to check for errors.
*/
fpin = maybe_unlock_mmap_for_io(vmf, fpin);
- error = filemap_read_folio(file, mapping, page_folio(page));
+ error = filemap_read_folio(file, mapping, folio);
if (fpin)
goto out_retry;
- put_page(page);
+ folio_put(folio);
if (!error || error == AOP_TRUNCATED_PAGE)
goto retry_find;
@@ -3059,8 +3057,8 @@ vm_fault_t filemap_fault(struct vm_fault *vmf)
* re-find the vma and come back and find our hopefully still populated
* page.
*/
- if (page)
- put_page(page);
+ if (folio)
+ folio_put(folio);
if (fpin)
fput(fpin);
return ret | VM_FAULT_RETRY;
--
2.30.2
We currently store order-N THPs as 2^N consecutive entries. While this
consumes rather more memory than necessary, it also turns out to be buggy.
A writeback operation which starts in the middle of a dirty THP will not
notice as the dirty bit is only set on the head index. With multi-index
entries, the dirty bit will be found no matter where in the THP the
iteration starts.
This does end up simplifying the page cache slightly, although not as
much as I had hoped.
Signed-off-by: Matthew Wilcox (Oracle) <[email protected]>
---
include/linux/pagemap.h | 10 -------
mm/filemap.c | 63 +++++++++++++++++++++++++----------------
mm/huge_memory.c | 20 ++++++++++---
mm/khugepaged.c | 12 +++++++-
mm/migrate.c | 8 ------
mm/shmem.c | 11 ++-----
6 files changed, 68 insertions(+), 56 deletions(-)
diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
index bf8e978a48f2..25b1bf3b1cdb 100644
--- a/include/linux/pagemap.h
+++ b/include/linux/pagemap.h
@@ -1078,16 +1078,6 @@ static inline unsigned int __readahead_batch(struct readahead_control *rac,
VM_BUG_ON_PAGE(PageTail(page), page);
array[i++] = page;
rac->_batch_count += thp_nr_pages(page);
-
- /*
- * The page cache isn't using multi-index entries yet,
- * so the xas cursor needs to be manually moved to the
- * next index. This can be removed once the page cache
- * is converted.
- */
- if (PageHead(page))
- xas_set(&xas, rac->_index + rac->_batch_count);
-
if (i == array_sz)
break;
}
diff --git a/mm/filemap.c b/mm/filemap.c
index 20434d7bdad8..97d17e8c76aa 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -134,7 +134,6 @@ static void page_cache_delete(struct address_space *mapping,
}
VM_BUG_ON_FOLIO(!folio_test_locked(folio), folio);
- VM_BUG_ON_FOLIO(nr != 1 && shadow, folio);
xas_store(&xas, shadow);
xas_init_marks(&xas);
@@ -276,8 +275,7 @@ void filemap_remove_folio(struct folio *folio)
* from the mapping. The function expects @pvec to be sorted by page index
* and is optimised for it to be dense.
* It tolerates holes in @pvec (mapping entries at those indices are not
- * modified). The function expects only THP head pages to be present in the
- * @pvec.
+ * modified). The function expects only folios to be present in the @pvec.
*
* The function expects the i_pages lock to be held.
*/
@@ -312,20 +310,12 @@ static void page_cache_delete_batch(struct address_space *mapping,
WARN_ON_ONCE(!folio_test_locked(folio));
- if (folio->index == xas.xa_index)
- folio->mapping = NULL;
- /* Leave page->index set: truncation lookup relies on it */
+ folio->mapping = NULL;
+ /* Leave folio->index set: truncation lookup relies on it */
- /*
- * Move to the next page in the vector if this is a regular
- * page or the index is of the last sub-page of this compound
- * page.
- */
- if (folio->index + folio_nr_pages(folio) - 1 ==
- xas.xa_index)
- i++;
+ i++;
xas_store(&xas, NULL);
- total_pages++;
+ total_pages += folio_nr_pages(folio);
}
mapping->nrpages -= total_pages;
}
@@ -2027,24 +2017,27 @@ unsigned find_lock_entries(struct address_space *mapping, pgoff_t start,
indices[pvec->nr] = xas.xa_index;
if (!pagevec_add(pvec, &folio->page))
break;
- goto next;
+ continue;
unlock:
folio_unlock(folio);
put:
folio_put(folio);
-next:
- if (!xa_is_value(folio) && folio_multi(folio)) {
- xas_set(&xas, folio->index + folio_nr_pages(folio));
- /* Did we wrap on 32-bit? */
- if (!xas.xa_index)
- break;
- }
}
rcu_read_unlock();
return pagevec_count(pvec);
}
+static inline
+bool folio_more_pages(struct folio *folio, pgoff_t index, pgoff_t max)
+{
+ if (folio_single(folio) || folio_test_hugetlb(folio))
+ return false;
+ if (index >= max)
+ return false;
+ return index < folio->index + folio_nr_pages(folio) - 1;
+}
+
/**
* find_get_pages_range - gang pagecache lookup
* @mapping: The address_space to search
@@ -2083,11 +2076,17 @@ unsigned find_get_pages_range(struct address_space *mapping, pgoff_t *start,
if (xa_is_value(folio))
continue;
+again:
pages[ret] = folio_file_page(folio, xas.xa_index);
if (++ret == nr_pages) {
*start = xas.xa_index + 1;
goto out;
}
+ if (folio_more_pages(folio, xas.xa_index, end)) {
+ xas.xa_index++;
+ folio_ref_inc(folio);
+ goto again;
+ }
}
/*
@@ -2145,9 +2144,15 @@ unsigned find_get_pages_contig(struct address_space *mapping, pgoff_t index,
if (unlikely(folio != xas_reload(&xas)))
goto put_page;
- pages[ret] = &folio->page;
+again:
+ pages[ret] = folio_file_page(folio, xas.xa_index);
if (++ret == nr_pages)
break;
+ if (folio_more_pages(folio, xas.xa_index, ULONG_MAX)) {
+ xas.xa_index++;
+ folio_ref_inc(folio);
+ goto again;
+ }
continue;
put_page:
folio_put(folio);
@@ -3169,6 +3174,7 @@ vm_fault_t filemap_map_pages(struct vm_fault *vmf,
addr = vma->vm_start + ((start_pgoff - vma->vm_pgoff) << PAGE_SHIFT);
vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd, addr, &vmf->ptl);
do {
+again:
page = folio_file_page(folio, xas.xa_index);
if (PageHWPoison(page))
goto unlock;
@@ -3190,9 +3196,18 @@ vm_fault_t filemap_map_pages(struct vm_fault *vmf,
do_set_pte(vmf, page, addr);
/* no need to invalidate: a not-present page won't be cached */
update_mmu_cache(vma, addr, vmf->pte);
+ if (folio_more_pages(folio, xas.xa_index, end_pgoff)) {
+ xas.xa_index++;
+ folio_ref_inc(folio);
+ goto again;
+ }
folio_unlock(folio);
continue;
unlock:
+ if (folio_more_pages(folio, xas.xa_index, end_pgoff)) {
+ xas.xa_index++;
+ goto again;
+ }
folio_unlock(folio);
folio_put(folio);
} while ((folio = next_map_page(mapping, &xas, end_pgoff)) != NULL);
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 763bf687ca92..7ea0052172a8 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -2638,6 +2638,7 @@ int split_huge_page_to_list(struct page *page, struct list_head *list)
{
struct page *head = compound_head(page);
struct deferred_split *ds_queue = get_deferred_split_queue(head);
+ XA_STATE(xas, &head->mapping->i_pages, head->index);
struct anon_vma *anon_vma = NULL;
struct address_space *mapping = NULL;
int extra_pins, ret;
@@ -2700,18 +2701,27 @@ int split_huge_page_to_list(struct page *page, struct list_head *list)
unmap_page(head);
+ if (mapping) {
+ xas_split_alloc(&xas, head, compound_order(head),
+ mapping_gfp_mask(mapping) & GFP_RECLAIM_MASK);
+ if (xas_error(&xas)) {
+ ret = xas_error(&xas);
+ goto out_unlock;
+ }
+ }
+
/* block interrupt reentry in xa_lock and spinlock */
local_irq_disable();
if (mapping) {
- XA_STATE(xas, &mapping->i_pages, page_index(head));
-
/*
* Check if the head page is present in page cache.
* We assume all tail are present too, if head is there.
*/
- xa_lock(&mapping->i_pages);
+ xas_lock(&xas);
+ xas_reset(&xas);
if (xas_load(&xas) != head)
goto fail;
+ xas_split(&xas, head, thp_order(head));
}
/* Prevent deferred_split_scan() touching ->_refcount */
@@ -2739,7 +2749,7 @@ int split_huge_page_to_list(struct page *page, struct list_head *list)
spin_unlock(&ds_queue->split_queue_lock);
fail:
if (mapping)
- xa_unlock(&mapping->i_pages);
+ xas_unlock(&xas);
local_irq_enable();
remap_page(head, thp_nr_pages(head));
ret = -EBUSY;
@@ -2753,6 +2763,8 @@ int split_huge_page_to_list(struct page *page, struct list_head *list)
if (mapping)
i_mmap_unlock_read(mapping);
out:
+ /* Free any memory we didn't use */
+ xas_nomem(&xas, 0);
count_vm_event(!ret ? THP_SPLIT_PAGE : THP_SPLIT_PAGE_FAILED);
return ret;
}
diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index 6b9c98ddcd09..949b583f22c0 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -1664,7 +1664,10 @@ static void collapse_file(struct mm_struct *mm,
}
count_memcg_page_event(new_page, THP_COLLAPSE_ALLOC);
- /* This will be less messy when we use multi-index entries */
+ /*
+ * Ensure we have slots for all the pages in the range. This is
+ * almost certainly a no-op because most of the pages must be present
+ */
do {
xas_lock_irq(&xas);
xas_create_range(&xas);
@@ -1884,6 +1887,9 @@ static void collapse_file(struct mm_struct *mm,
__mod_lruvec_page_state(new_page, NR_SHMEM, nr_none);
}
+ /* Join all the small entries into a single multi-index entry */
+ xas_set_order(&xas, start, HPAGE_PMD_ORDER);
+ xas_store(&xas, new_page);
xa_locked:
xas_unlock_irq(&xas);
xa_unlocked:
@@ -2005,6 +2011,10 @@ static void khugepaged_scan_file(struct mm_struct *mm,
continue;
}
+ /*
+ * XXX: khugepaged should compact smaller compound pages
+ * into a PMD sized page
+ */
if (PageTransCompound(page)) {
result = SCAN_PAGE_COMPOUND;
break;
diff --git a/mm/migrate.c b/mm/migrate.c
index 36cdae0a1235..029b592a0066 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -439,14 +439,6 @@ int folio_migrate_mapping(struct address_space *mapping,
}
xas_store(&xas, newfolio);
- if (nr > 1) {
- int i;
-
- for (i = 1; i < nr; i++) {
- xas_next(&xas);
- xas_store(&xas, newfolio);
- }
- }
/*
* Drop cache reference from old page by unfreezing
diff --git a/mm/shmem.c b/mm/shmem.c
index 337680a01f2a..bdfa60416d68 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -670,7 +670,6 @@ static int shmem_add_to_page_cache(struct page *page,
struct mm_struct *charge_mm)
{
XA_STATE_ORDER(xas, &mapping->i_pages, index, compound_order(page));
- unsigned long i = 0;
unsigned long nr = compound_nr(page);
int error;
@@ -700,17 +699,11 @@ static int shmem_add_to_page_cache(struct page *page,
void *entry;
xas_lock_irq(&xas);
entry = xas_find_conflict(&xas);
- if (entry != expected)
+ if (entry != expected) {
xas_set_err(&xas, -EEXIST);
- xas_create_range(&xas);
- if (xas_error(&xas))
goto unlock;
-next:
- xas_store(&xas, page);
- if (++i < nr) {
- xas_next(&xas);
- goto next;
}
+ xas_store(&xas, page);
if (PageTransHuge(page)) {
count_vm_event(THP_FILE_ALLOC);
__mod_lruvec_page_state(page, NR_SHMEM_THPS, nr);
--
2.30.2
Transform page_mkclean() into folio_mkclean() and add a page_mkclean()
wrapper around folio_mkclean().
folio_mkclean is 15 bytes smaller than page_mkclean, but the kernel
is enlarged by 33 bytes due to inlining page_folio() into each caller.
This will go away once the callers are converted to use folio_mkclean().
Signed-off-by: Matthew Wilcox (Oracle) <[email protected]>
Reviewed-by: Christoph Hellwig <[email protected]>
---
include/linux/rmap.h | 10 ++++++----
mm/rmap.c | 12 ++++++------
2 files changed, 12 insertions(+), 10 deletions(-)
diff --git a/include/linux/rmap.h b/include/linux/rmap.h
index 83fb86133fe1..d45584310cde 100644
--- a/include/linux/rmap.h
+++ b/include/linux/rmap.h
@@ -235,7 +235,7 @@ unsigned long page_address_in_vma(struct page *, struct vm_area_struct *);
*
* returns the number of cleaned PTEs.
*/
-int page_mkclean(struct page *);
+int folio_mkclean(struct folio *);
/*
* called in munlock()/munmap() path to check for other vmas holding
@@ -293,12 +293,14 @@ static inline int page_referenced(struct page *page, int is_locked,
#define try_to_unmap(page, refs) false
-static inline int page_mkclean(struct page *page)
+static inline int folio_mkclean(struct folio *folio)
{
return 0;
}
-
-
#endif /* CONFIG_MMU */
+static inline int page_mkclean(struct page *page)
+{
+ return folio_mkclean(page_folio(page));
+}
#endif /* _LINUX_RMAP_H */
diff --git a/mm/rmap.c b/mm/rmap.c
index 1df8683c4c4c..b3aae8eeaeaf 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -980,7 +980,7 @@ static bool invalid_mkclean_vma(struct vm_area_struct *vma, void *arg)
return true;
}
-int page_mkclean(struct page *page)
+int folio_mkclean(struct folio *folio)
{
int cleaned = 0;
struct address_space *mapping;
@@ -990,20 +990,20 @@ int page_mkclean(struct page *page)
.invalid_vma = invalid_mkclean_vma,
};
- BUG_ON(!PageLocked(page));
+ BUG_ON(!folio_test_locked(folio));
- if (!page_mapped(page))
+ if (!folio_mapped(folio))
return 0;
- mapping = page_mapping(page);
+ mapping = folio_mapping(folio);
if (!mapping)
return 0;
- rmap_walk(page, &rwc);
+ rmap_walk(&folio->page, &rwc);
return cleaned;
}
-EXPORT_SYMBOL_GPL(page_mkclean);
+EXPORT_SYMBOL_GPL(folio_mkclean);
/**
* page_move_anon_rmap - move a page to our anon_vma
--
2.30.2
Convert mark_page_accessed() to folio_mark_accessed(). It already
operated on the entire compound page, but now we can avoid calling
compound_head quite so many times. Shrinks the function from 424 bytes
to 295 bytes (shrinking by 129 bytes). The compatibility wrapper is 30
bytes, plus the 8 bytes for the exported symbol means the kernel shrinks
by 91 bytes.
Signed-off-by: Matthew Wilcox (Oracle) <[email protected]>
---
include/linux/swap.h | 3 ++-
mm/folio-compat.c | 7 +++++++
mm/swap.c | 34 ++++++++++++++++------------------
3 files changed, 25 insertions(+), 19 deletions(-)
diff --git a/include/linux/swap.h b/include/linux/swap.h
index 989d8f78c256..c7a4c0a5863d 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -352,7 +352,8 @@ extern void lru_note_cost(struct lruvec *lruvec, bool file,
unsigned int nr_pages);
extern void lru_note_cost_page(struct page *);
extern void lru_cache_add(struct page *);
-extern void mark_page_accessed(struct page *);
+void mark_page_accessed(struct page *);
+void folio_mark_accessed(struct folio *);
extern atomic_t lru_disable_count;
diff --git a/mm/folio-compat.c b/mm/folio-compat.c
index 7044fcc8a8aa..a374747ae1c6 100644
--- a/mm/folio-compat.c
+++ b/mm/folio-compat.c
@@ -5,6 +5,7 @@
*/
#include <linux/pagemap.h>
+#include <linux/swap.h>
struct address_space *page_mapping(struct page *page)
{
@@ -41,3 +42,9 @@ bool page_mapped(struct page *page)
return folio_mapped(page_folio(page));
}
EXPORT_SYMBOL(page_mapped);
+
+void mark_page_accessed(struct page *page)
+{
+ folio_mark_accessed(page_folio(page));
+}
+EXPORT_SYMBOL(mark_page_accessed);
diff --git a/mm/swap.c b/mm/swap.c
index c3137e4e1cd8..d32007fe23b3 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -390,7 +390,7 @@ static void folio_activate(struct folio *folio)
}
#endif
-static void __lru_cache_activate_page(struct page *page)
+static void __lru_cache_activate_folio(struct folio *folio)
{
struct pagevec *pvec;
int i;
@@ -411,8 +411,8 @@ static void __lru_cache_activate_page(struct page *page)
for (i = pagevec_count(pvec) - 1; i >= 0; i--) {
struct page *pagevec_page = pvec->pages[i];
- if (pagevec_page == page) {
- SetPageActive(page);
+ if (pagevec_page == &folio->page) {
+ folio_set_active(folio);
break;
}
}
@@ -430,36 +430,34 @@ static void __lru_cache_activate_page(struct page *page)
* When a newly allocated page is not yet visible, so safe for non-atomic ops,
* __SetPageReferenced(page) may be substituted for mark_page_accessed(page).
*/
-void mark_page_accessed(struct page *page)
+void folio_mark_accessed(struct folio *folio)
{
- page = compound_head(page);
-
- if (!PageReferenced(page)) {
- SetPageReferenced(page);
- } else if (PageUnevictable(page)) {
+ if (!folio_test_referenced(folio)) {
+ folio_set_referenced(folio);
+ } else if (folio_test_unevictable(folio)) {
/*
* Unevictable pages are on the "LRU_UNEVICTABLE" list. But,
* this list is never rotated or maintained, so marking an
* evictable page accessed has no effect.
*/
- } else if (!PageActive(page)) {
+ } else if (!folio_test_active(folio)) {
/*
* If the page is on the LRU, queue it for activation via
* lru_pvecs.activate_page. Otherwise, assume the page is on a
* pagevec, mark it active and it'll be moved to the active
* LRU on the next drain.
*/
- if (PageLRU(page))
- folio_activate(page_folio(page));
+ if (folio_test_lru(folio))
+ folio_activate(folio);
else
- __lru_cache_activate_page(page);
- ClearPageReferenced(page);
- workingset_activation(page_folio(page));
+ __lru_cache_activate_folio(folio);
+ folio_clear_referenced(folio);
+ workingset_activation(folio);
}
- if (page_is_idle(page))
- clear_page_idle(page);
+ if (folio_test_idle(folio))
+ folio_clear_idle(folio);
}
-EXPORT_SYMBOL(mark_page_accessed);
+EXPORT_SYMBOL(folio_mark_accessed);
/**
* lru_cache_add - add a page to a page list
--
2.30.2
Allow for accounting N pages at once instead of one page at a time.
Signed-off-by: Matthew Wilcox (Oracle) <[email protected]>
Reviewed-by: Christoph Hellwig <[email protected]>
Reviewed-by: Jan Kara <[email protected]>
---
mm/page-writeback.c | 22 +++++++++++-----------
1 file changed, 11 insertions(+), 11 deletions(-)
diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index f55f2ebdd9a9..e542ea37d605 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -562,12 +562,12 @@ static unsigned long wp_next_time(unsigned long cur_time)
return cur_time;
}
-static void wb_domain_writeout_inc(struct wb_domain *dom,
+static void wb_domain_writeout_add(struct wb_domain *dom,
struct fprop_local_percpu *completions,
- unsigned int max_prop_frac)
+ unsigned int max_prop_frac, long nr)
{
__fprop_add_percpu_max(&dom->completions, completions,
- max_prop_frac, 1);
+ max_prop_frac, nr);
/* First event after period switching was turned off? */
if (unlikely(!dom->period_time)) {
/*
@@ -585,18 +585,18 @@ static void wb_domain_writeout_inc(struct wb_domain *dom,
* Increment @wb's writeout completion count and the global writeout
* completion count. Called from test_clear_page_writeback().
*/
-static inline void __wb_writeout_inc(struct bdi_writeback *wb)
+static inline void __wb_writeout_add(struct bdi_writeback *wb, long nr)
{
struct wb_domain *cgdom;
- inc_wb_stat(wb, WB_WRITTEN);
- wb_domain_writeout_inc(&global_wb_domain, &wb->completions,
- wb->bdi->max_prop_frac);
+ wb_stat_mod(wb, WB_WRITTEN, nr);
+ wb_domain_writeout_add(&global_wb_domain, &wb->completions,
+ wb->bdi->max_prop_frac, nr);
cgdom = mem_cgroup_wb_domain(wb);
if (cgdom)
- wb_domain_writeout_inc(cgdom, wb_memcg_completions(wb),
- wb->bdi->max_prop_frac);
+ wb_domain_writeout_add(cgdom, wb_memcg_completions(wb),
+ wb->bdi->max_prop_frac, nr);
}
void wb_writeout_inc(struct bdi_writeback *wb)
@@ -604,7 +604,7 @@ void wb_writeout_inc(struct bdi_writeback *wb)
unsigned long flags;
local_irq_save(flags);
- __wb_writeout_inc(wb);
+ __wb_writeout_add(wb, 1);
local_irq_restore(flags);
}
EXPORT_SYMBOL_GPL(wb_writeout_inc);
@@ -2751,7 +2751,7 @@ int test_clear_page_writeback(struct page *page)
struct bdi_writeback *wb = inode_to_wb(inode);
dec_wb_stat(wb, WB_WRITEBACK);
- __wb_writeout_inc(wb);
+ __wb_writeout_add(wb, 1);
}
}
--
2.30.2
Rename set_page_writeback() to folio_start_writeback() to match
folio_end_writeback(). Do not bother with wrappers that return void;
callers are perfectly capable of ignoring return values.
Add wrappers for set_page_writeback(), set_page_writeback_keepwrite() and
test_set_page_writeback() for compatibililty with existing filesystems.
The main advantage of this patch is getting the statistics right,
although it does eliminate a couple of calls to compound_head().
Signed-off-by: Matthew Wilcox (Oracle) <[email protected]>
Reviewed-by: Christoph Hellwig <[email protected]>
---
include/linux/page-flags.h | 19 +++++++++---------
mm/folio-compat.c | 6 ++++++
mm/page-writeback.c | 40 ++++++++++++++++++++------------------
3 files changed, 37 insertions(+), 28 deletions(-)
diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
index 6f9d1f26b1ef..54c4af35c628 100644
--- a/include/linux/page-flags.h
+++ b/include/linux/page-flags.h
@@ -655,21 +655,22 @@ static __always_inline void SetPageUptodate(struct page *page)
CLEARPAGEFLAG(Uptodate, uptodate, PF_NO_TAIL)
-int __test_set_page_writeback(struct page *page, bool keep_write);
+bool __folio_start_writeback(struct folio *folio, bool keep_write);
+bool set_page_writeback(struct page *page);
-#define test_set_page_writeback(page) \
- __test_set_page_writeback(page, false)
-#define test_set_page_writeback_keepwrite(page) \
- __test_set_page_writeback(page, true)
+#define folio_start_writeback(folio) \
+ __folio_start_writeback(folio, false)
+#define folio_start_writeback_keepwrite(folio) \
+ __folio_start_writeback(folio, true)
-static inline void set_page_writeback(struct page *page)
+static inline void set_page_writeback_keepwrite(struct page *page)
{
- test_set_page_writeback(page);
+ folio_start_writeback_keepwrite(page_folio(page));
}
-static inline void set_page_writeback_keepwrite(struct page *page)
+static inline bool test_set_page_writeback(struct page *page)
{
- test_set_page_writeback_keepwrite(page);
+ return set_page_writeback(page);
}
__PAGEFLAG(Head, head, PF_ANY) CLEARPAGEFLAG(Head, head, PF_ANY)
diff --git a/mm/folio-compat.c b/mm/folio-compat.c
index 2ccd8f213fc4..10ce5582d869 100644
--- a/mm/folio-compat.c
+++ b/mm/folio-compat.c
@@ -71,3 +71,9 @@ void migrate_page_copy(struct page *newpage, struct page *page)
}
EXPORT_SYMBOL(migrate_page_copy);
#endif
+
+bool set_page_writeback(struct page *page)
+{
+ return folio_start_writeback(page_folio(page));
+}
+EXPORT_SYMBOL(set_page_writeback);
diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index 8d5d7921b157..0336273154fb 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -2773,21 +2773,23 @@ bool __folio_end_writeback(struct folio *folio)
return ret;
}
-int __test_set_page_writeback(struct page *page, bool keep_write)
+bool __folio_start_writeback(struct folio *folio, bool keep_write)
{
- struct address_space *mapping = page_mapping(page);
- int ret, access_ret;
+ long nr = folio_nr_pages(folio);
+ struct address_space *mapping = folio_mapping(folio);
+ bool ret;
+ int access_ret;
- lock_page_memcg(page);
+ folio_memcg_lock(folio);
if (mapping && mapping_use_writeback_tags(mapping)) {
- XA_STATE(xas, &mapping->i_pages, page_index(page));
+ XA_STATE(xas, &mapping->i_pages, folio_index(folio));
struct inode *inode = mapping->host;
struct backing_dev_info *bdi = inode_to_bdi(inode);
unsigned long flags;
xas_lock_irqsave(&xas, flags);
xas_load(&xas);
- ret = TestSetPageWriteback(page);
+ ret = folio_test_set_writeback(folio);
if (!ret) {
bool on_wblist;
@@ -2796,40 +2798,40 @@ int __test_set_page_writeback(struct page *page, bool keep_write)
xas_set_mark(&xas, PAGECACHE_TAG_WRITEBACK);
if (bdi->capabilities & BDI_CAP_WRITEBACK_ACCT)
- inc_wb_stat(inode_to_wb(inode), WB_WRITEBACK);
+ wb_stat_mod(inode_to_wb(inode), WB_WRITEBACK,
+ nr);
/*
- * We can come through here when swapping anonymous
- * pages, so we don't necessarily have an inode to track
- * for sync.
+ * We can come through here when swapping
+ * anonymous folios, so we don't necessarily
+ * have an inode to track for sync.
*/
if (mapping->host && !on_wblist)
sb_mark_inode_writeback(mapping->host);
}
- if (!PageDirty(page))
+ if (!folio_test_dirty(folio))
xas_clear_mark(&xas, PAGECACHE_TAG_DIRTY);
if (!keep_write)
xas_clear_mark(&xas, PAGECACHE_TAG_TOWRITE);
xas_unlock_irqrestore(&xas, flags);
} else {
- ret = TestSetPageWriteback(page);
+ ret = folio_test_set_writeback(folio);
}
if (!ret) {
- inc_lruvec_page_state(page, NR_WRITEBACK);
- inc_zone_page_state(page, NR_ZONE_WRITE_PENDING);
+ lruvec_stat_mod_folio(folio, NR_WRITEBACK, nr);
+ zone_stat_mod_folio(folio, NR_ZONE_WRITE_PENDING, nr);
}
- unlock_page_memcg(page);
- access_ret = arch_make_page_accessible(page);
+ folio_memcg_unlock(folio);
+ access_ret = arch_make_folio_accessible(folio);
/*
* If writeback has been triggered on a page that cannot be made
* accessible, it is too late to recover here.
*/
- VM_BUG_ON_PAGE(access_ret != 0, page);
+ VM_BUG_ON_FOLIO(access_ret != 0, folio);
return ret;
-
}
-EXPORT_SYMBOL(__test_set_page_writeback);
+EXPORT_SYMBOL(__folio_start_writeback);
/**
* folio_wait_writeback - Wait for a folio to finish writeback.
--
2.30.2
test_clear_page_writeback() is actually an mm-internal function, although
it's named as if it's a pagecache function. Move it to mm/internal.h,
rename it to __folio_end_writeback() and change the return type to bool.
The conversion from page to folio is mostly about accounting the number
of pages being written back, although it does eliminate a couple of
calls to compound_head().
Signed-off-by: Matthew Wilcox (Oracle) <[email protected]>
Reviewed-by: Christoph Hellwig <[email protected]>
---
include/linux/page-flags.h | 1 -
mm/filemap.c | 2 +-
mm/internal.h | 1 +
mm/page-writeback.c | 29 +++++++++++++++--------------
4 files changed, 17 insertions(+), 16 deletions(-)
diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
index ddb660688086..6f9d1f26b1ef 100644
--- a/include/linux/page-flags.h
+++ b/include/linux/page-flags.h
@@ -655,7 +655,6 @@ static __always_inline void SetPageUptodate(struct page *page)
CLEARPAGEFLAG(Uptodate, uptodate, PF_NO_TAIL)
-int test_clear_page_writeback(struct page *page);
int __test_set_page_writeback(struct page *page, bool keep_write);
#define test_set_page_writeback(page) \
diff --git a/mm/filemap.c b/mm/filemap.c
index 5c4e3185ecb3..a74c69a938ab 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -1535,7 +1535,7 @@ void folio_end_writeback(struct folio *folio)
* reused before the folio_wake().
*/
folio_get(folio);
- if (!test_clear_page_writeback(&folio->page))
+ if (!__folio_end_writeback(folio))
BUG();
smp_mb__after_atomic();
diff --git a/mm/internal.h b/mm/internal.h
index fa31a7f0ed79..08e8a28994d1 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -43,6 +43,7 @@ static inline void *folio_raw_mapping(struct folio *folio)
vm_fault_t do_swap_page(struct vm_fault *vmf);
void folio_rotate_reclaimable(struct folio *folio);
+bool __folio_end_writeback(struct folio *folio);
void free_pgtables(struct mmu_gather *tlb, struct vm_area_struct *start_vma,
unsigned long floor, unsigned long ceiling);
diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index e542ea37d605..8d5d7921b157 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -583,7 +583,7 @@ static void wb_domain_writeout_add(struct wb_domain *dom,
/*
* Increment @wb's writeout completion count and the global writeout
- * completion count. Called from test_clear_page_writeback().
+ * completion count. Called from __folio_end_writeback().
*/
static inline void __wb_writeout_add(struct bdi_writeback *wb, long nr)
{
@@ -2731,27 +2731,28 @@ int clear_page_dirty_for_io(struct page *page)
}
EXPORT_SYMBOL(clear_page_dirty_for_io);
-int test_clear_page_writeback(struct page *page)
+bool __folio_end_writeback(struct folio *folio)
{
- struct address_space *mapping = page_mapping(page);
- int ret;
+ long nr = folio_nr_pages(folio);
+ struct address_space *mapping = folio_mapping(folio);
+ bool ret;
- lock_page_memcg(page);
+ folio_memcg_lock(folio);
if (mapping && mapping_use_writeback_tags(mapping)) {
struct inode *inode = mapping->host;
struct backing_dev_info *bdi = inode_to_bdi(inode);
unsigned long flags;
xa_lock_irqsave(&mapping->i_pages, flags);
- ret = TestClearPageWriteback(page);
+ ret = folio_test_clear_writeback(folio);
if (ret) {
- __xa_clear_mark(&mapping->i_pages, page_index(page),
+ __xa_clear_mark(&mapping->i_pages, folio_index(folio),
PAGECACHE_TAG_WRITEBACK);
if (bdi->capabilities & BDI_CAP_WRITEBACK_ACCT) {
struct bdi_writeback *wb = inode_to_wb(inode);
- dec_wb_stat(wb, WB_WRITEBACK);
- __wb_writeout_add(wb, 1);
+ wb_stat_mod(wb, WB_WRITEBACK, -nr);
+ __wb_writeout_add(wb, nr);
}
}
@@ -2761,14 +2762,14 @@ int test_clear_page_writeback(struct page *page)
xa_unlock_irqrestore(&mapping->i_pages, flags);
} else {
- ret = TestClearPageWriteback(page);
+ ret = folio_test_clear_writeback(folio);
}
if (ret) {
- dec_lruvec_page_state(page, NR_WRITEBACK);
- dec_zone_page_state(page, NR_ZONE_WRITE_PENDING);
- inc_node_page_state(page, NR_WRITTEN);
+ lruvec_stat_mod_folio(folio, NR_WRITEBACK, -nr);
+ zone_stat_mod_folio(folio, NR_ZONE_WRITE_PENDING, -nr);
+ node_stat_mod_folio(folio, NR_WRITTEN, nr);
}
- unlock_page_memcg(page);
+ folio_memcg_unlock(folio);
return ret;
}
--
2.30.2
Account the number of pages in the folio that we're redirtying.
Turn account_page_dirty() into a wrapper around it. Also turn
the comment on folio_account_redirty() into kernel-doc and
edit it slightly so it makes sense to its potential callers.
Signed-off-by: Matthew Wilcox (Oracle) <[email protected]>
Reviewed-by: Christoph Hellwig <[email protected]>
---
include/linux/writeback.h | 6 +++++-
mm/page-writeback.c | 32 +++++++++++++++++++-------------
2 files changed, 24 insertions(+), 14 deletions(-)
diff --git a/include/linux/writeback.h b/include/linux/writeback.h
index eda9cc778ef6..50cb6e25ab9e 100644
--- a/include/linux/writeback.h
+++ b/include/linux/writeback.h
@@ -399,7 +399,11 @@ void tag_pages_for_writeback(struct address_space *mapping,
pgoff_t start, pgoff_t end);
bool filemap_dirty_folio(struct address_space *mapping, struct folio *folio);
-void account_page_redirty(struct page *page);
+void folio_account_redirty(struct folio *folio);
+static inline void account_page_redirty(struct page *page)
+{
+ folio_account_redirty(page_folio(page));
+}
void sb_mark_inode_writeback(struct inode *inode);
void sb_clear_inode_writeback(struct inode *inode);
diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index 66060bbf6aad..d7bd5580c91e 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -1084,7 +1084,7 @@ static void wb_update_write_bandwidth(struct bdi_writeback *wb,
* write_bandwidth = ---------------------------------------------------
* period
*
- * @written may have decreased due to account_page_redirty().
+ * @written may have decreased due to folio_account_redirty().
* Avoid underflowing @bw calculation.
*/
bw = written - min(written, wb->written_stamp);
@@ -2527,30 +2527,36 @@ bool filemap_dirty_folio(struct address_space *mapping, struct folio *folio)
}
EXPORT_SYMBOL(filemap_dirty_folio);
-/*
- * Call this whenever redirtying a page, to de-account the dirty counters
- * (NR_DIRTIED, WB_DIRTIED, tsk->nr_dirtied), so that they match the written
- * counters (NR_WRITTEN, WB_WRITTEN) in long term. The mismatches will lead to
- * systematic errors in balanced_dirty_ratelimit and the dirty pages position
- * control.
+/**
+ * folio_account_redirty - Manually account for redirtying a page.
+ * @folio: The folio which is being redirtied.
+ *
+ * Most filesystems should call folio_redirty_for_writepage() instead
+ * of this fuction. If your filesystem is doing writeback outside the
+ * context of a writeback_control(), it can call this when redirtying
+ * a folio, to de-account the dirty counters (NR_DIRTIED, WB_DIRTIED,
+ * tsk->nr_dirtied), so that they match the written counters (NR_WRITTEN,
+ * WB_WRITTEN) in long term. The mismatches will lead to systematic errors
+ * in balanced_dirty_ratelimit and the dirty pages position control.
*/
-void account_page_redirty(struct page *page)
+void folio_account_redirty(struct folio *folio)
{
- struct address_space *mapping = page->mapping;
+ struct address_space *mapping = folio->mapping;
if (mapping && mapping_can_writeback(mapping)) {
struct inode *inode = mapping->host;
struct bdi_writeback *wb;
struct wb_lock_cookie cookie = {};
+ unsigned nr = folio_nr_pages(folio);
wb = unlocked_inode_to_wb_begin(inode, &cookie);
- current->nr_dirtied--;
- dec_node_page_state(page, NR_DIRTIED);
- dec_wb_stat(wb, WB_DIRTIED);
+ current->nr_dirtied -= nr;
+ node_stat_mod_folio(folio, NR_DIRTIED, -nr);
+ wb_stat_mod(wb, WB_DIRTIED, -nr);
unlocked_inode_to_wb_end(inode, &cookie);
}
}
-EXPORT_SYMBOL(account_page_redirty);
+EXPORT_SYMBOL(folio_account_redirty);
/*
* When a writepage implementation decides that it doesn't want to write this
--
2.30.2
This is the folio equivalent of page_evictable(). Unfortunately, it's
different from !folio_test_unevictable(), but I think it's used in places
where you have to be a VM expert and can reasonably be expected to know
the difference.
Signed-off-by: Matthew Wilcox (Oracle) <[email protected]>
Reviewed-by: Christoph Hellwig <[email protected]>
---
mm/internal.h | 27 +++++++++++++++++++--------
1 file changed, 19 insertions(+), 8 deletions(-)
diff --git a/mm/internal.h b/mm/internal.h
index 08e8a28994d1..0910efec5821 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -72,17 +72,28 @@ unsigned find_lock_entries(struct address_space *mapping, pgoff_t start,
pgoff_t end, struct pagevec *pvec, pgoff_t *indices);
/**
- * page_evictable - test whether a page is evictable
- * @page: the page to test
+ * folio_evictable - Test whether a folio is evictable.
+ * @folio: The folio to test.
*
- * Test whether page is evictable--i.e., should be placed on active/inactive
- * lists vs unevictable list.
- *
- * Reasons page might not be evictable:
- * (1) page's mapping marked unevictable
- * (2) page is part of an mlocked VMA
+ * Test whether @folio is evictable -- i.e., should be placed on
+ * active/inactive lists vs unevictable list.
*
+ * Reasons folio might not be evictable:
+ * 1. folio's mapping marked unevictable
+ * 2. One of the pages in the folio is part of an mlocked VMA
*/
+static inline bool folio_evictable(struct folio *folio)
+{
+ bool ret;
+
+ /* Prevent address_space of inode and swap cache from being freed */
+ rcu_read_lock();
+ ret = !mapping_unevictable(folio_mapping(folio)) &&
+ !folio_test_mlocked(folio);
+ rcu_read_unlock();
+ return ret;
+}
+
static inline bool page_evictable(struct page *page)
{
bool ret;
--
2.30.2
This saves five calls to compound_head(), totalling 60 bytes of text.
Signed-off-by: Matthew Wilcox (Oracle) <[email protected]>
Reviewed-by: Christoph Hellwig <[email protected]>
---
include/trace/events/pagemap.h | 32 ++++++++++++++++----------------
mm/swap.c | 34 +++++++++++++++++-----------------
2 files changed, 33 insertions(+), 33 deletions(-)
diff --git a/include/trace/events/pagemap.h b/include/trace/events/pagemap.h
index 1fd0185d66e8..171524d3526d 100644
--- a/include/trace/events/pagemap.h
+++ b/include/trace/events/pagemap.h
@@ -16,38 +16,38 @@
#define PAGEMAP_MAPPEDDISK 0x0020u
#define PAGEMAP_BUFFERS 0x0040u
-#define trace_pagemap_flags(page) ( \
- (PageAnon(page) ? PAGEMAP_ANONYMOUS : PAGEMAP_FILE) | \
- (page_mapped(page) ? PAGEMAP_MAPPED : 0) | \
- (PageSwapCache(page) ? PAGEMAP_SWAPCACHE : 0) | \
- (PageSwapBacked(page) ? PAGEMAP_SWAPBACKED : 0) | \
- (PageMappedToDisk(page) ? PAGEMAP_MAPPEDDISK : 0) | \
- (page_has_private(page) ? PAGEMAP_BUFFERS : 0) \
+#define trace_pagemap_flags(folio) ( \
+ (folio_test_anon(folio) ? PAGEMAP_ANONYMOUS : PAGEMAP_FILE) | \
+ (folio_mapped(folio) ? PAGEMAP_MAPPED : 0) | \
+ (folio_test_swapcache(folio) ? PAGEMAP_SWAPCACHE : 0) | \
+ (folio_test_swapbacked(folio) ? PAGEMAP_SWAPBACKED : 0) | \
+ (folio_test_mappedtodisk(folio) ? PAGEMAP_MAPPEDDISK : 0) | \
+ (folio_test_private(folio) ? PAGEMAP_BUFFERS : 0) \
)
TRACE_EVENT(mm_lru_insertion,
- TP_PROTO(struct page *page),
+ TP_PROTO(struct folio *folio),
- TP_ARGS(page),
+ TP_ARGS(folio),
TP_STRUCT__entry(
- __field(struct page *, page )
+ __field(struct folio *, folio )
__field(unsigned long, pfn )
__field(enum lru_list, lru )
__field(unsigned long, flags )
),
TP_fast_assign(
- __entry->page = page;
- __entry->pfn = page_to_pfn(page);
- __entry->lru = folio_lru_list(page_folio(page));
- __entry->flags = trace_pagemap_flags(page);
+ __entry->folio = folio;
+ __entry->pfn = folio_pfn(folio);
+ __entry->lru = folio_lru_list(folio);
+ __entry->flags = trace_pagemap_flags(folio);
),
/* Flag format is based on page-types.c formatting for pagemap */
- TP_printk("page=%p pfn=0x%lx lru=%d flags=%s%s%s%s%s%s",
- __entry->page,
+ TP_printk("folio=%p pfn=0x%lx lru=%d flags=%s%s%s%s%s%s",
+ __entry->folio,
__entry->pfn,
__entry->lru,
__entry->flags & PAGEMAP_MAPPED ? "M" : " ",
diff --git a/mm/swap.c b/mm/swap.c
index 6e80f30d2e5e..89d4471ceb80 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -1001,17 +1001,18 @@ void __pagevec_release(struct pagevec *pvec)
}
EXPORT_SYMBOL(__pagevec_release);
-static void __pagevec_lru_add_fn(struct page *page, struct lruvec *lruvec)
+static void __pagevec_lru_add_fn(struct folio *folio, struct lruvec *lruvec)
{
- int was_unevictable = TestClearPageUnevictable(page);
- int nr_pages = thp_nr_pages(page);
+ int was_unevictable = folio_test_clear_unevictable(folio);
+ int nr_pages = folio_nr_pages(folio);
- VM_BUG_ON_PAGE(PageLRU(page), page);
+ VM_BUG_ON_FOLIO(folio_test_lru(folio), folio);
/*
- * Page becomes evictable in two ways:
+ * Folio becomes evictable in two ways:
* 1) Within LRU lock [munlock_vma_page() and __munlock_pagevec()].
- * 2) Before acquiring LRU lock to put the page to correct LRU and then
+ * 2) Before acquiring LRU lock to put the folio on the correct LRU
+ * and then
* a) do PageLRU check with lock [check_move_unevictable_pages]
* b) do PageLRU check before lock [clear_page_mlock]
*
@@ -1020,10 +1021,10 @@ static void __pagevec_lru_add_fn(struct page *page, struct lruvec *lruvec)
*
* #0: __pagevec_lru_add_fn #1: clear_page_mlock
*
- * SetPageLRU() TestClearPageMlocked()
+ * folio_set_lru() folio_test_clear_mlocked()
* smp_mb() // explicit ordering // above provides strict
* // ordering
- * PageMlocked() PageLRU()
+ * folio_test_mlocked() folio_test_lru()
*
*
* if '#1' does not observe setting of PG_lru by '#0' and fails
@@ -1034,21 +1035,21 @@ static void __pagevec_lru_add_fn(struct page *page, struct lruvec *lruvec)
* looking at the same page) and the evictable page will be stranded
* in an unevictable LRU.
*/
- SetPageLRU(page);
+ folio_set_lru(folio);
smp_mb__after_atomic();
- if (page_evictable(page)) {
+ if (folio_evictable(folio)) {
if (was_unevictable)
__count_vm_events(UNEVICTABLE_PGRESCUED, nr_pages);
} else {
- ClearPageActive(page);
- SetPageUnevictable(page);
+ folio_clear_active(folio);
+ folio_set_unevictable(folio);
if (!was_unevictable)
__count_vm_events(UNEVICTABLE_PGCULLED, nr_pages);
}
- add_page_to_lru_list(page, lruvec);
- trace_mm_lru_insertion(page);
+ lruvec_add_folio(lruvec, folio);
+ trace_mm_lru_insertion(folio);
}
/*
@@ -1062,11 +1063,10 @@ void __pagevec_lru_add(struct pagevec *pvec)
unsigned long flags = 0;
for (i = 0; i < pagevec_count(pvec); i++) {
- struct page *page = pvec->pages[i];
- struct folio *folio = page_folio(page);
+ struct folio *folio = page_folio(pvec->pages[i]);
lruvec = folio_lruvec_relock_irqsave(folio, lruvec, &flags);
- __pagevec_lru_add_fn(page, lruvec);
+ __pagevec_lru_add_fn(folio, lruvec);
}
if (lruvec)
unlock_page_lruvec_irqrestore(lruvec, flags);
--
2.30.2
Reimplement lru_cache_add() as a wrapper around folio_add_lru().
Saves 159 bytes of kernel text due to removing calls to compound_head().
Signed-off-by: Matthew Wilcox (Oracle) <[email protected]>
Reviewed-by: Christoph Hellwig <[email protected]>
---
include/linux/swap.h | 1 +
mm/folio-compat.c | 6 ++++++
mm/swap.c | 22 +++++++++++-----------
3 files changed, 18 insertions(+), 11 deletions(-)
diff --git a/include/linux/swap.h b/include/linux/swap.h
index 5e01675af7ab..81801ba78b1e 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -351,6 +351,7 @@ extern unsigned long nr_free_buffer_pages(void);
extern void lru_note_cost(struct lruvec *lruvec, bool file,
unsigned int nr_pages);
extern void lru_note_cost_folio(struct folio *);
+extern void folio_add_lru(struct folio *);
extern void lru_cache_add(struct page *);
void mark_page_accessed(struct page *);
void folio_mark_accessed(struct folio *);
diff --git a/mm/folio-compat.c b/mm/folio-compat.c
index c1e01bc36d32..6de3cd78a4ae 100644
--- a/mm/folio-compat.c
+++ b/mm/folio-compat.c
@@ -102,3 +102,9 @@ bool redirty_page_for_writepage(struct writeback_control *wbc,
return folio_redirty_for_writepage(wbc, page_folio(page));
}
EXPORT_SYMBOL(redirty_page_for_writepage);
+
+void lru_cache_add(struct page *page)
+{
+ folio_add_lru(page_folio(page));
+}
+EXPORT_SYMBOL(lru_cache_add);
diff --git a/mm/swap.c b/mm/swap.c
index 89d4471ceb80..6f382abeccf9 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -459,29 +459,29 @@ void folio_mark_accessed(struct folio *folio)
EXPORT_SYMBOL(folio_mark_accessed);
/**
- * lru_cache_add - add a page to a page list
- * @page: the page to be added to the LRU.
+ * folio_add_lru - Add a folio to an LRU list.
+ * @folio: The folio to be added to the LRU.
*
- * Queue the page for addition to the LRU via pagevec. The decision on whether
+ * Queue the folio for addition to the LRU. The decision on whether
* to add the page to the [in]active [file|anon] list is deferred until the
- * pagevec is drained. This gives a chance for the caller of lru_cache_add()
- * have the page added to the active list using mark_page_accessed().
+ * pagevec is drained. This gives a chance for the caller of folio_add_lru()
+ * have the folio added to the active list using folio_mark_accessed().
*/
-void lru_cache_add(struct page *page)
+void folio_add_lru(struct folio *folio)
{
struct pagevec *pvec;
- VM_BUG_ON_PAGE(PageActive(page) && PageUnevictable(page), page);
- VM_BUG_ON_PAGE(PageLRU(page), page);
+ VM_BUG_ON_FOLIO(folio_test_active(folio) && folio_test_unevictable(folio), folio);
+ VM_BUG_ON_FOLIO(folio_test_lru(folio), folio);
- get_page(page);
+ folio_get(folio);
local_lock(&lru_pvecs.lock);
pvec = this_cpu_ptr(&lru_pvecs.lru_add);
- if (pagevec_add_and_need_flush(pvec, page))
+ if (pagevec_add_and_need_flush(pvec, &folio->page))
__pagevec_lru_add(pvec);
local_unlock(&lru_pvecs.lock);
}
-EXPORT_SYMBOL(lru_cache_add);
+EXPORT_SYMBOL(folio_add_lru);
/**
* lru_cache_add_inactive_or_unevictable
--
2.30.2
Convert __add_to_page_cache_locked() into __filemap_add_folio().
Add an assertion to it that (for !hugetlbfs), the folio is naturally
aligned within the file. Move the prototype from mm.h to pagemap.h.
Convert add_to_page_cache_lru() into filemap_add_folio(). Add a
compatibility wrapper for unconverted callers.
Signed-off-by: Matthew Wilcox (Oracle) <[email protected]>
Reviewed-by: Christoph Hellwig <[email protected]>
---
include/linux/mm.h | 7 -----
include/linux/pagemap.h | 10 ++++--
kernel/bpf/verifier.c | 2 +-
mm/filemap.c | 70 ++++++++++++++++++++---------------------
mm/folio-compat.c | 7 +++++
5 files changed, 50 insertions(+), 46 deletions(-)
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 4803f2c01367..99f5f736be64 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -213,13 +213,6 @@ int overcommit_kbytes_handler(struct ctl_table *, int, void *, size_t *,
loff_t *);
int overcommit_policy_handler(struct ctl_table *, int, void *, size_t *,
loff_t *);
-/*
- * Any attempt to mark this function as static leads to build failure
- * when CONFIG_DEBUG_INFO_BTF is enabled because __add_to_page_cache_locked()
- * is referred to by BPF code. This must be visible for error injection.
- */
-int __add_to_page_cache_locked(struct page *page, struct address_space *mapping,
- pgoff_t index, gfp_t gfp, void **shadowp);
#if defined(CONFIG_SPARSEMEM) && !defined(CONFIG_SPARSEMEM_VMEMMAP)
#define nth_page(page,n) pfn_to_page(page_to_pfn((page)) + (n))
diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
index 848acb44ac80..19b2e3bea14c 100644
--- a/include/linux/pagemap.h
+++ b/include/linux/pagemap.h
@@ -877,9 +877,11 @@ static inline int fault_in_pages_readable(const char __user *uaddr, int size)
}
int add_to_page_cache_locked(struct page *page, struct address_space *mapping,
- pgoff_t index, gfp_t gfp_mask);
+ pgoff_t index, gfp_t gfp);
int add_to_page_cache_lru(struct page *page, struct address_space *mapping,
- pgoff_t index, gfp_t gfp_mask);
+ pgoff_t index, gfp_t gfp);
+int filemap_add_folio(struct address_space *mapping, struct folio *folio,
+ pgoff_t index, gfp_t gfp);
extern void delete_from_page_cache(struct page *page);
extern void __delete_from_page_cache(struct page *page, void *shadow);
void replace_page_cache_page(struct page *old, struct page *new);
@@ -904,6 +906,10 @@ static inline int add_to_page_cache(struct page *page,
return error;
}
+/* Must be non-static for BPF error injection */
+int __filemap_add_folio(struct address_space *mapping, struct folio *folio,
+ pgoff_t index, gfp_t gfp, void **shadowp);
+
/**
* struct readahead_control - Describes a readahead request.
*
diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
index 42a4063de7cd..f0a4f8b818e4 100644
--- a/kernel/bpf/verifier.c
+++ b/kernel/bpf/verifier.c
@@ -13015,7 +13015,7 @@ BTF_SET_START(btf_non_sleepable_error_inject)
/* Three functions below can be called from sleepable and non-sleepable context.
* Assume non-sleepable from bpf safety point of view.
*/
-BTF_ID(func, __add_to_page_cache_locked)
+BTF_ID(func, __filemap_add_folio)
BTF_ID(func, should_fail_alloc_page)
BTF_ID(func, should_failslab)
BTF_SET_END(btf_non_sleepable_error_inject)
diff --git a/mm/filemap.c b/mm/filemap.c
index 54989a32d6a8..4e34383fd894 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -855,26 +855,25 @@ void replace_page_cache_page(struct page *old, struct page *new)
}
EXPORT_SYMBOL_GPL(replace_page_cache_page);
-noinline int __add_to_page_cache_locked(struct page *page,
- struct address_space *mapping,
- pgoff_t offset, gfp_t gfp,
- void **shadowp)
+noinline int __filemap_add_folio(struct address_space *mapping,
+ struct folio *folio, pgoff_t index, gfp_t gfp, void **shadowp)
{
- XA_STATE(xas, &mapping->i_pages, offset);
- int huge = PageHuge(page);
+ XA_STATE(xas, &mapping->i_pages, index);
+ int huge = folio_test_hugetlb(folio);
int error;
bool charged = false;
- VM_BUG_ON_PAGE(!PageLocked(page), page);
- VM_BUG_ON_PAGE(PageSwapBacked(page), page);
+ VM_BUG_ON_FOLIO(!folio_test_locked(folio), folio);
+ VM_BUG_ON_FOLIO(folio_test_swapbacked(folio), folio);
mapping_set_update(&xas, mapping);
- get_page(page);
- page->mapping = mapping;
- page->index = offset;
+ folio_get(folio);
+ folio->mapping = mapping;
+ folio->index = index;
if (!huge) {
- error = mem_cgroup_charge(page_folio(page), NULL, gfp);
+ error = mem_cgroup_charge(folio, NULL, gfp);
+ VM_BUG_ON_FOLIO(index & (folio_nr_pages(folio) - 1), folio);
if (error)
goto error;
charged = true;
@@ -886,7 +885,7 @@ noinline int __add_to_page_cache_locked(struct page *page,
unsigned int order = xa_get_order(xas.xa, xas.xa_index);
void *entry, *old = NULL;
- if (order > thp_order(page))
+ if (order > folio_order(folio))
xas_split_alloc(&xas, xa_load(xas.xa, xas.xa_index),
order, gfp);
xas_lock_irq(&xas);
@@ -903,13 +902,13 @@ noinline int __add_to_page_cache_locked(struct page *page,
*shadowp = old;
/* entry may have been split before we acquired lock */
order = xa_get_order(xas.xa, xas.xa_index);
- if (order > thp_order(page)) {
+ if (order > folio_order(folio)) {
xas_split(&xas, old, order);
xas_reset(&xas);
}
}
- xas_store(&xas, page);
+ xas_store(&xas, folio);
if (xas_error(&xas))
goto unlock;
@@ -917,7 +916,7 @@ noinline int __add_to_page_cache_locked(struct page *page,
/* hugetlb pages do not participate in page cache accounting */
if (!huge)
- __inc_lruvec_page_state(page, NR_FILE_PAGES);
+ __lruvec_stat_add_folio(folio, NR_FILE_PAGES);
unlock:
xas_unlock_irq(&xas);
} while (xas_nomem(&xas, gfp));
@@ -925,19 +924,19 @@ noinline int __add_to_page_cache_locked(struct page *page,
if (xas_error(&xas)) {
error = xas_error(&xas);
if (charged)
- mem_cgroup_uncharge(page_folio(page));
+ mem_cgroup_uncharge(folio);
goto error;
}
- trace_mm_filemap_add_to_page_cache(page);
+ trace_mm_filemap_add_to_page_cache(&folio->page);
return 0;
error:
- page->mapping = NULL;
+ folio->mapping = NULL;
/* Leave page->index set: truncation relies upon it */
- put_page(page);
+ folio_put(folio);
return error;
}
-ALLOW_ERROR_INJECTION(__add_to_page_cache_locked, ERRNO);
+ALLOW_ERROR_INJECTION(__filemap_add_folio, ERRNO);
/**
* add_to_page_cache_locked - add a locked page to the pagecache
@@ -954,39 +953,38 @@ ALLOW_ERROR_INJECTION(__add_to_page_cache_locked, ERRNO);
int add_to_page_cache_locked(struct page *page, struct address_space *mapping,
pgoff_t offset, gfp_t gfp_mask)
{
- return __add_to_page_cache_locked(page, mapping, offset,
+ return __filemap_add_folio(mapping, page_folio(page), offset,
gfp_mask, NULL);
}
EXPORT_SYMBOL(add_to_page_cache_locked);
-int add_to_page_cache_lru(struct page *page, struct address_space *mapping,
- pgoff_t offset, gfp_t gfp_mask)
+int filemap_add_folio(struct address_space *mapping, struct folio *folio,
+ pgoff_t index, gfp_t gfp)
{
void *shadow = NULL;
int ret;
- __SetPageLocked(page);
- ret = __add_to_page_cache_locked(page, mapping, offset,
- gfp_mask, &shadow);
+ __folio_set_locked(folio);
+ ret = __filemap_add_folio(mapping, folio, index, gfp, &shadow);
if (unlikely(ret))
- __ClearPageLocked(page);
+ __folio_clear_locked(folio);
else {
/*
- * The page might have been evicted from cache only
+ * The folio might have been evicted from cache only
* recently, in which case it should be activated like
- * any other repeatedly accessed page.
- * The exception is pages getting rewritten; evicting other
+ * any other repeatedly accessed folio.
+ * The exception is folios getting rewritten; evicting other
* data from the working set, only to cache data that will
* get overwritten with something else, is a waste of memory.
*/
- WARN_ON_ONCE(PageActive(page));
- if (!(gfp_mask & __GFP_WRITE) && shadow)
- workingset_refault(page_folio(page), shadow);
- lru_cache_add(page);
+ WARN_ON_ONCE(folio_test_active(folio));
+ if (!(gfp & __GFP_WRITE) && shadow)
+ workingset_refault(folio, shadow);
+ folio_add_lru(folio);
}
return ret;
}
-EXPORT_SYMBOL_GPL(add_to_page_cache_lru);
+EXPORT_SYMBOL_GPL(filemap_add_folio);
#ifdef CONFIG_NUMA
struct folio *filemap_alloc_folio(gfp_t gfp, unsigned int order)
diff --git a/mm/folio-compat.c b/mm/folio-compat.c
index 6de3cd78a4ae..6b19bc4ed6b0 100644
--- a/mm/folio-compat.c
+++ b/mm/folio-compat.c
@@ -108,3 +108,10 @@ void lru_cache_add(struct page *page)
folio_add_lru(page_folio(page));
}
EXPORT_SYMBOL(lru_cache_add);
+
+int add_to_page_cache_lru(struct page *page, struct address_space *mapping,
+ pgoff_t index, gfp_t gfp)
+{
+ return filemap_add_folio(mapping, page_folio(page), index, gfp);
+}
+EXPORT_SYMBOL(add_to_page_cache_lru);
--
2.30.2
The pagecache only contains folios, so indicate that this is definitely
not a tail page. Shrinks mapping_get_entry() by 56 bytes, but grows
pagecache_get_page() by 21 bytes as gcc makes slightly different hot/cold
code decisions. A net reduction of 35 bytes of text.
Signed-off-by: Matthew Wilcox (Oracle) <[email protected]>
Reviewed-by: Christoph Hellwig <[email protected]>
---
mm/filemap.c | 35 ++++++++++++++---------------------
1 file changed, 14 insertions(+), 21 deletions(-)
diff --git a/mm/filemap.c b/mm/filemap.c
index 4e34383fd894..85a457c7b7a7 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -1755,49 +1755,42 @@ EXPORT_SYMBOL(page_cache_prev_miss);
* @mapping: the address_space to search
* @index: The page cache index.
*
- * Looks up the page cache slot at @mapping & @index. If there is a
- * page cache page, the head page is returned with an increased refcount.
+ * Looks up the page cache entry at @mapping & @index. If it is a folio,
+ * it is returned with an increased refcount. If it is a shadow entry
+ * of a previously evicted folio, or a swap entry from shmem/tmpfs,
+ * it is returned without further action.
*
- * If the slot holds a shadow entry of a previously evicted page, or a
- * swap entry from shmem/tmpfs, it is returned.
- *
- * Return: The head page or shadow entry, %NULL if nothing is found.
+ * Return: The folio, swap or shadow entry, %NULL if nothing is found.
*/
-static struct page *mapping_get_entry(struct address_space *mapping,
- pgoff_t index)
+static void *mapping_get_entry(struct address_space *mapping, pgoff_t index)
{
XA_STATE(xas, &mapping->i_pages, index);
- struct page *page;
+ struct folio *folio;
rcu_read_lock();
repeat:
xas_reset(&xas);
- page = xas_load(&xas);
- if (xas_retry(&xas, page))
+ folio = xas_load(&xas);
+ if (xas_retry(&xas, folio))
goto repeat;
/*
* A shadow entry of a recently evicted page, or a swap entry from
* shmem/tmpfs. Return it without attempting to raise page count.
*/
- if (!page || xa_is_value(page))
+ if (!folio || xa_is_value(folio))
goto out;
- if (!page_cache_get_speculative(page))
+ if (!folio_try_get_rcu(folio))
goto repeat;
- /*
- * Has the page moved or been split?
- * This is part of the lockless pagecache protocol. See
- * include/linux/pagemap.h for details.
- */
- if (unlikely(page != xas_reload(&xas))) {
- put_page(page);
+ if (unlikely(folio != xas_reload(&xas))) {
+ folio_put(folio);
goto repeat;
}
out:
rcu_read_unlock();
- return page;
+ return folio;
}
/**
--
2.30.2
filemap_get_folio() is a replacement for find_get_page().
Turn pagecache_get_page() into a wrapper around __filemap_get_folio().
Remove find_lock_head() as this use case is now covered by
filemap_get_folio().
Reduces overall kernel size by 209 bytes. __filemap_get_folio() is
316 bytes shorter than pagecache_get_page() was, but the new
pagecache_get_page() is 99 bytes.
Signed-off-by: Matthew Wilcox (Oracle) <[email protected]>
---
include/linux/pagemap.h | 41 +++++++++---------
mm/filemap.c | 92 ++++++++++++++++++++---------------------
mm/folio-compat.c | 12 ++++++
3 files changed, 76 insertions(+), 69 deletions(-)
diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
index 19b2e3bea14c..b24933eced18 100644
--- a/include/linux/pagemap.h
+++ b/include/linux/pagemap.h
@@ -302,8 +302,26 @@ pgoff_t page_cache_prev_miss(struct address_space *mapping,
#define FGP_HEAD 0x00000080
#define FGP_ENTRY 0x00000100
-struct page *pagecache_get_page(struct address_space *mapping, pgoff_t offset,
- int fgp_flags, gfp_t cache_gfp_mask);
+struct folio *__filemap_get_folio(struct address_space *mapping, pgoff_t index,
+ int fgp_flags, gfp_t gfp);
+struct page *pagecache_get_page(struct address_space *mapping, pgoff_t index,
+ int fgp_flags, gfp_t gfp);
+
+/**
+ * filemap_get_folio - Find and get a folio.
+ * @mapping: The address_space to search.
+ * @index: The page index.
+ *
+ * Looks up the page cache entry at @mapping & @index. If a folio is
+ * present, it is returned with an increased refcount.
+ *
+ * Otherwise, %NULL is returned.
+ */
+static inline struct folio *filemap_get_folio(struct address_space *mapping,
+ pgoff_t index)
+{
+ return __filemap_get_folio(mapping, index, 0, 0);
+}
/**
* find_get_page - find and get a page reference
@@ -346,25 +364,6 @@ static inline struct page *find_lock_page(struct address_space *mapping,
return pagecache_get_page(mapping, index, FGP_LOCK, 0);
}
-/**
- * find_lock_head - Locate, pin and lock a pagecache page.
- * @mapping: The address_space to search.
- * @index: The page index.
- *
- * Looks up the page cache entry at @mapping & @index. If there is a
- * page cache page, its head page is returned locked and with an increased
- * refcount.
- *
- * Context: May sleep.
- * Return: A struct page which is !PageTail, or %NULL if there is no page
- * in the cache for this index.
- */
-static inline struct page *find_lock_head(struct address_space *mapping,
- pgoff_t index)
-{
- return pagecache_get_page(mapping, index, FGP_LOCK | FGP_HEAD, 0);
-}
-
/**
* find_or_create_page - locate or add a pagecache page
* @mapping: the page's address_space
diff --git a/mm/filemap.c b/mm/filemap.c
index 85a457c7b7a7..061e285aae21 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -1794,93 +1794,89 @@ static void *mapping_get_entry(struct address_space *mapping, pgoff_t index)
}
/**
- * pagecache_get_page - Find and get a reference to a page.
+ * __filemap_get_folio - Find and get a reference to a folio.
* @mapping: The address_space to search.
* @index: The page index.
- * @fgp_flags: %FGP flags modify how the page is returned.
- * @gfp_mask: Memory allocation flags to use if %FGP_CREAT is specified.
+ * @fgp_flags: %FGP flags modify how the folio is returned.
+ * @gfp: Memory allocation flags to use if %FGP_CREAT is specified.
*
* Looks up the page cache entry at @mapping & @index.
*
* @fgp_flags can be zero or more of these flags:
*
- * * %FGP_ACCESSED - The page will be marked accessed.
- * * %FGP_LOCK - The page is returned locked.
- * * %FGP_HEAD - If the page is present and a THP, return the head page
- * rather than the exact page specified by the index.
+ * * %FGP_ACCESSED - The folio will be marked accessed.
+ * * %FGP_LOCK - The folio is returned locked.
* * %FGP_ENTRY - If there is a shadow / swap / DAX entry, return it
- * instead of allocating a new page to replace it.
+ * instead of allocating a new folio to replace it.
* * %FGP_CREAT - If no page is present then a new page is allocated using
- * @gfp_mask and added to the page cache and the VM's LRU list.
+ * @gfp and added to the page cache and the VM's LRU list.
* The page is returned locked and with an increased refcount.
* * %FGP_FOR_MMAP - The caller wants to do its own locking dance if the
* page is already in cache. If the page was allocated, unlock it before
* returning so the caller can do the same dance.
- * * %FGP_WRITE - The page will be written
- * * %FGP_NOFS - __GFP_FS will get cleared in gfp mask
- * * %FGP_NOWAIT - Don't get blocked by page lock
+ * * %FGP_WRITE - The page will be written to by the caller.
+ * * %FGP_NOFS - __GFP_FS will get cleared in gfp.
+ * * %FGP_NOWAIT - Don't get blocked by page lock.
*
* If %FGP_LOCK or %FGP_CREAT are specified then the function may sleep even
* if the %GFP flags specified for %FGP_CREAT are atomic.
*
* If there is a page cache page, it is returned with an increased refcount.
*
- * Return: The found page or %NULL otherwise.
+ * Return: The found folio or %NULL otherwise.
*/
-struct page *pagecache_get_page(struct address_space *mapping, pgoff_t index,
- int fgp_flags, gfp_t gfp_mask)
+struct folio *__filemap_get_folio(struct address_space *mapping, pgoff_t index,
+ int fgp_flags, gfp_t gfp)
{
- struct page *page;
+ struct folio *folio;
repeat:
- page = mapping_get_entry(mapping, index);
- if (xa_is_value(page)) {
+ folio = mapping_get_entry(mapping, index);
+ if (xa_is_value(folio)) {
if (fgp_flags & FGP_ENTRY)
- return page;
- page = NULL;
+ return folio;
+ folio = NULL;
}
- if (!page)
+ if (!folio)
goto no_page;
if (fgp_flags & FGP_LOCK) {
if (fgp_flags & FGP_NOWAIT) {
- if (!trylock_page(page)) {
- put_page(page);
+ if (!folio_trylock(folio)) {
+ folio_put(folio);
return NULL;
}
} else {
- lock_page(page);
+ folio_lock(folio);
}
/* Has the page been truncated? */
- if (unlikely(page->mapping != mapping)) {
- unlock_page(page);
- put_page(page);
+ if (unlikely(folio->mapping != mapping)) {
+ folio_unlock(folio);
+ folio_put(folio);
goto repeat;
}
- VM_BUG_ON_PAGE(!thp_contains(page, index), page);
+ VM_BUG_ON_FOLIO(!folio_contains(folio, index), folio);
}
if (fgp_flags & FGP_ACCESSED)
- mark_page_accessed(page);
+ folio_mark_accessed(folio);
else if (fgp_flags & FGP_WRITE) {
/* Clear idle flag for buffer write */
- if (page_is_idle(page))
- clear_page_idle(page);
+ if (folio_test_idle(folio))
+ folio_clear_idle(folio);
}
- if (!(fgp_flags & FGP_HEAD))
- page = find_subpage(page, index);
no_page:
- if (!page && (fgp_flags & FGP_CREAT)) {
+ if (!folio && (fgp_flags & FGP_CREAT)) {
int err;
if ((fgp_flags & FGP_WRITE) && mapping_can_writeback(mapping))
- gfp_mask |= __GFP_WRITE;
+ gfp |= __GFP_WRITE;
if (fgp_flags & FGP_NOFS)
- gfp_mask &= ~__GFP_FS;
+ gfp &= ~__GFP_FS;
- page = __page_cache_alloc(gfp_mask);
- if (!page)
+ folio = filemap_alloc_folio(gfp, 0);
+ if (!folio)
return NULL;
if (WARN_ON_ONCE(!(fgp_flags & (FGP_LOCK | FGP_FOR_MMAP))))
@@ -1888,27 +1884,27 @@ struct page *pagecache_get_page(struct address_space *mapping, pgoff_t index,
/* Init accessed so avoid atomic mark_page_accessed later */
if (fgp_flags & FGP_ACCESSED)
- __SetPageReferenced(page);
+ __folio_set_referenced(folio);
- err = add_to_page_cache_lru(page, mapping, index, gfp_mask);
+ err = filemap_add_folio(mapping, folio, index, gfp);
if (unlikely(err)) {
- put_page(page);
- page = NULL;
+ folio_put(folio);
+ folio = NULL;
if (err == -EEXIST)
goto repeat;
}
/*
- * add_to_page_cache_lru locks the page, and for mmap we expect
- * an unlocked page.
+ * filemap_add_folio locks the page, and for mmap
+ * we expect an unlocked page.
*/
- if (page && (fgp_flags & FGP_FOR_MMAP))
- unlock_page(page);
+ if (folio && (fgp_flags & FGP_FOR_MMAP))
+ folio_unlock(folio);
}
- return page;
+ return folio;
}
-EXPORT_SYMBOL(pagecache_get_page);
+EXPORT_SYMBOL(__filemap_get_folio);
static inline struct page *find_get_entry(struct xa_state *xas, pgoff_t max,
xa_mark_t mark)
diff --git a/mm/folio-compat.c b/mm/folio-compat.c
index 6b19bc4ed6b0..e833e680e944 100644
--- a/mm/folio-compat.c
+++ b/mm/folio-compat.c
@@ -115,3 +115,15 @@ int add_to_page_cache_lru(struct page *page, struct address_space *mapping,
return filemap_add_folio(mapping, page_folio(page), index, gfp);
}
EXPORT_SYMBOL(add_to_page_cache_lru);
+
+struct page *pagecache_get_page(struct address_space *mapping, pgoff_t index,
+ int fgp_flags, gfp_t gfp)
+{
+ struct folio *folio;
+
+ folio = __filemap_get_folio(mapping, index, fgp_flags, gfp);
+ if ((fgp_flags & FGP_HEAD) || !folio || xa_is_value(folio))
+ return &folio->page;
+ return folio_file_page(folio, index);
+}
+EXPORT_SYMBOL(pagecache_get_page);
--
2.30.2
Allow filemap_get_folio() to wait for writeback to complete (if the
filesystem wants that behaviour). This is the folio equivalent of
grab_cache_page_write_begin(), which is moved into the folio-compat
file as a reminder to migrate all the code using it. This paves the
way for getting rid of AOP_FLAG_NOFS once grab_cache_page_write_begin()
is removed.
Kernel grows by 11 bytes. filemap_get_folio() grows by 33 bytes but
grab_cache_page_write_begin() shrinks by 22 bytes to make up for it.
Signed-off-by: Matthew Wilcox (Oracle) <[email protected]>
---
include/linux/pagemap.h | 1 +
mm/filemap.c | 25 +++----------------------
mm/folio-compat.c | 13 +++++++++++++
3 files changed, 17 insertions(+), 22 deletions(-)
diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
index b24933eced18..83c1a798265f 100644
--- a/include/linux/pagemap.h
+++ b/include/linux/pagemap.h
@@ -301,6 +301,7 @@ pgoff_t page_cache_prev_miss(struct address_space *mapping,
#define FGP_FOR_MMAP 0x00000040
#define FGP_HEAD 0x00000080
#define FGP_ENTRY 0x00000100
+#define FGP_STABLE 0x00000200
struct folio *__filemap_get_folio(struct address_space *mapping, pgoff_t index,
int fgp_flags, gfp_t gfp);
diff --git a/mm/filemap.c b/mm/filemap.c
index 061e285aae21..0434c5a55fec 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -1817,6 +1817,7 @@ static void *mapping_get_entry(struct address_space *mapping, pgoff_t index)
* * %FGP_WRITE - The page will be written to by the caller.
* * %FGP_NOFS - __GFP_FS will get cleared in gfp.
* * %FGP_NOWAIT - Don't get blocked by page lock.
+ * * %FGP_STABLE - Wait for the folio to be stable (finished writeback)
*
* If %FGP_LOCK or %FGP_CREAT are specified then the function may sleep even
* if the %GFP flags specified for %FGP_CREAT are atomic.
@@ -1867,6 +1868,8 @@ struct folio *__filemap_get_folio(struct address_space *mapping, pgoff_t index,
folio_clear_idle(folio);
}
+ if (fgp_flags & FGP_STABLE)
+ folio_wait_stable(folio);
no_page:
if (!folio && (fgp_flags & FGP_CREAT)) {
int err;
@@ -3590,28 +3593,6 @@ generic_file_direct_write(struct kiocb *iocb, struct iov_iter *from)
}
EXPORT_SYMBOL(generic_file_direct_write);
-/*
- * Find or create a page at the given pagecache position. Return the locked
- * page. This function is specifically for buffered writes.
- */
-struct page *grab_cache_page_write_begin(struct address_space *mapping,
- pgoff_t index, unsigned flags)
-{
- struct page *page;
- int fgp_flags = FGP_LOCK|FGP_WRITE|FGP_CREAT;
-
- if (flags & AOP_FLAG_NOFS)
- fgp_flags |= FGP_NOFS;
-
- page = pagecache_get_page(mapping, index, fgp_flags,
- mapping_gfp_mask(mapping));
- if (page)
- wait_for_stable_page(page);
-
- return page;
-}
-EXPORT_SYMBOL(grab_cache_page_write_begin);
-
ssize_t generic_perform_write(struct file *file,
struct iov_iter *i, loff_t pos)
{
diff --git a/mm/folio-compat.c b/mm/folio-compat.c
index e833e680e944..5b6ae1da314e 100644
--- a/mm/folio-compat.c
+++ b/mm/folio-compat.c
@@ -116,6 +116,7 @@ int add_to_page_cache_lru(struct page *page, struct address_space *mapping,
}
EXPORT_SYMBOL(add_to_page_cache_lru);
+noinline
struct page *pagecache_get_page(struct address_space *mapping, pgoff_t index,
int fgp_flags, gfp_t gfp)
{
@@ -127,3 +128,15 @@ struct page *pagecache_get_page(struct address_space *mapping, pgoff_t index,
return folio_file_page(folio, index);
}
EXPORT_SYMBOL(pagecache_get_page);
+
+struct page *grab_cache_page_write_begin(struct address_space *mapping,
+ pgoff_t index, unsigned flags)
+{
+ unsigned fgp_flags = FGP_LOCK | FGP_WRITE | FGP_CREAT | FGP_STABLE;
+
+ if (flags & AOP_FLAG_NOFS)
+ fgp_flags |= FGP_NOFS;
+ return pagecache_get_page(mapping, index, fgp_flags,
+ mapping_gfp_mask(mapping));
+}
+EXPORT_SYMBOL(grab_cache_page_write_begin);
--
2.30.2
This is a thin wrapper around bio_add_page(). The main advantage here
is the documentation that the submitter can expect to see folios in the
completion handler, and that stupidly large folios are not supported.
It's not currently possible to allocate stupidly large folios, but if
it ever becomes possible, this function will fail gracefully instead of
doing I/O to the wrong bytes.
Signed-off-by: Matthew Wilcox (Oracle) <[email protected]>
---
block/bio.c | 21 +++++++++++++++++++++
include/linux/bio.h | 3 ++-
2 files changed, 23 insertions(+), 1 deletion(-)
diff --git a/block/bio.c b/block/bio.c
index 1fab762e079b..1b500611d25c 100644
--- a/block/bio.c
+++ b/block/bio.c
@@ -933,6 +933,27 @@ int bio_add_page(struct bio *bio, struct page *page,
}
EXPORT_SYMBOL(bio_add_page);
+/**
+ * bio_add_folio - Attempt to add part of a folio to a bio.
+ * @bio: Bio to add to.
+ * @folio: Folio to add.
+ * @len: How many bytes from the folio to add.
+ * @off: First byte in this folio to add.
+ *
+ * Always uses the head page of the folio in the bio. If a submitter
+ * only uses bio_add_folio(), it can count on never seeing tail pages
+ * in the completion routine. BIOs do not support folios larger than 2GiB.
+ *
+ * Return: The number of bytes from this folio added to the bio.
+ */
+size_t bio_add_folio(struct bio *bio, struct folio *folio, size_t len,
+ size_t off)
+{
+ if (len > UINT_MAX || off > UINT_MAX)
+ return 0;
+ return bio_add_page(bio, &folio->page, len, off);
+}
+
void bio_release_pages(struct bio *bio, bool mark_dirty)
{
struct bvec_iter_all iter_all;
diff --git a/include/linux/bio.h b/include/linux/bio.h
index 2203b686e1f0..ade93e2de6a1 100644
--- a/include/linux/bio.h
+++ b/include/linux/bio.h
@@ -462,7 +462,8 @@ extern void bio_uninit(struct bio *);
extern void bio_reset(struct bio *);
void bio_chain(struct bio *, struct bio *);
-extern int bio_add_page(struct bio *, struct page *, unsigned int,unsigned int);
+int bio_add_page(struct bio *, struct page *, unsigned len, unsigned off);
+size_t bio_add_folio(struct bio *, struct folio *, size_t len, size_t off);
extern int bio_add_pc_page(struct request_queue *, struct bio *, struct page *,
unsigned int, unsigned int);
int bio_add_zone_append_page(struct bio *bio, struct page *page,
--
2.30.2
The big comment about only using a head page can go away now that
it takes a folio argument.
Signed-off-by: Matthew Wilcox (Oracle) <[email protected]>
---
fs/iomap/buffered-io.c | 35 +++++++++++++++++------------------
1 file changed, 17 insertions(+), 18 deletions(-)
diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
index 41da4f14c00b..cd5c2f24cb7e 100644
--- a/fs/iomap/buffered-io.c
+++ b/fs/iomap/buffered-io.c
@@ -22,8 +22,8 @@
#include "../internal.h"
/*
- * Structure allocated for each page or THP when block size < page size
- * to track sub-page uptodate status and I/O completions.
+ * Structure allocated for each folio when block size < folio size
+ * to track sub-folio uptodate status and I/O completions.
*/
struct iomap_page {
atomic_t read_bytes_pending;
@@ -32,17 +32,10 @@ struct iomap_page {
unsigned long uptodate[];
};
-static inline struct iomap_page *to_iomap_page(struct page *page)
+static inline struct iomap_page *to_iomap_page(struct folio *folio)
{
- /*
- * per-block data is stored in the head page. Callers should
- * not be dealing with tail pages (and if they are, they can
- * call thp_head() first.
- */
- VM_BUG_ON_PGFLAGS(PageTail(page), page);
-
- if (page_has_private(page))
- return (struct iomap_page *)page_private(page);
+ if (folio_test_private(folio))
+ return folio_get_private(folio);
return NULL;
}
@@ -51,7 +44,8 @@ static struct bio_set iomap_ioend_bioset;
static struct iomap_page *
iomap_page_create(struct inode *inode, struct page *page)
{
- struct iomap_page *iop = to_iomap_page(page);
+ struct folio *folio = page_folio(page);
+ struct iomap_page *iop = to_iomap_page(folio);
unsigned int nr_blocks = i_blocks_per_page(inode, page);
if (iop || nr_blocks <= 1)
@@ -144,7 +138,8 @@ iomap_adjust_read_range(struct inode *inode, struct iomap_page *iop,
static void
iomap_iop_set_range_uptodate(struct page *page, unsigned off, unsigned len)
{
- struct iomap_page *iop = to_iomap_page(page);
+ struct folio *folio = page_folio(page);
+ struct iomap_page *iop = to_iomap_page(folio);
struct inode *inode = page->mapping->host;
unsigned first = off >> inode->i_blkbits;
unsigned last = (off + len - 1) >> inode->i_blkbits;
@@ -173,7 +168,8 @@ static void
iomap_read_page_end_io(struct bio_vec *bvec, int error)
{
struct page *page = bvec->bv_page;
- struct iomap_page *iop = to_iomap_page(page);
+ struct folio *folio = page_folio(page);
+ struct iomap_page *iop = to_iomap_page(folio);
if (unlikely(error)) {
ClearPageUptodate(page);
@@ -433,7 +429,8 @@ int
iomap_is_partially_uptodate(struct page *page, unsigned long from,
unsigned long count)
{
- struct iomap_page *iop = to_iomap_page(page);
+ struct folio *folio = page_folio(page);
+ struct iomap_page *iop = to_iomap_page(folio);
struct inode *inode = page->mapping->host;
unsigned len, first, last;
unsigned i;
@@ -1011,7 +1008,8 @@ static void
iomap_finish_page_writeback(struct inode *inode, struct page *page,
int error, unsigned int len)
{
- struct iomap_page *iop = to_iomap_page(page);
+ struct folio *folio = page_folio(page);
+ struct iomap_page *iop = to_iomap_page(folio);
if (error) {
SetPageError(page);
@@ -1304,7 +1302,8 @@ iomap_writepage_map(struct iomap_writepage_ctx *wpc,
struct writeback_control *wbc, struct inode *inode,
struct page *page, u64 end_offset)
{
- struct iomap_page *iop = to_iomap_page(page);
+ struct folio *folio = page_folio(page);
+ struct iomap_page *iop = to_iomap_page(folio);
struct iomap_ioend *ioend, *next;
unsigned len = i_blocksize(inode);
u64 file_offset; /* file offset of page */
--
2.30.2
Pass a folio around instead of the page, and make sure the offset
is relative to the start of the folio instead of the start of a page.
Also use size_t for offset & length to make it clear that these are byte
counts, and to support >2GB folios in the future.
Signed-off-by: Matthew Wilcox (Oracle) <[email protected]>
---
fs/iomap/buffered-io.c | 85 ++++++++++++++++++++++--------------------
1 file changed, 44 insertions(+), 41 deletions(-)
diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
index fbe4ebc074ce..707a96e36651 100644
--- a/fs/iomap/buffered-io.c
+++ b/fs/iomap/buffered-io.c
@@ -75,18 +75,18 @@ static void iomap_page_release(struct folio *folio)
}
/*
- * Calculate the range inside the page that we actually need to read.
+ * Calculate the range inside the folio that we actually need to read.
*/
-static void
-iomap_adjust_read_range(struct inode *inode, struct iomap_page *iop,
- loff_t *pos, loff_t length, unsigned *offp, unsigned *lenp)
+static void iomap_adjust_read_range(struct inode *inode, struct folio *folio,
+ loff_t *pos, loff_t length, size_t *offp, size_t *lenp)
{
+ struct iomap_page *iop = to_iomap_page(folio);
loff_t orig_pos = *pos;
loff_t isize = i_size_read(inode);
unsigned block_bits = inode->i_blkbits;
unsigned block_size = (1 << block_bits);
- unsigned poff = offset_in_page(*pos);
- unsigned plen = min_t(loff_t, PAGE_SIZE - poff, length);
+ size_t poff = offset_in_folio(folio, *pos);
+ size_t plen = min_t(loff_t, folio_size(folio) - poff, length);
unsigned first = poff >> block_bits;
unsigned last = (poff + plen - 1) >> block_bits;
@@ -124,7 +124,7 @@ iomap_adjust_read_range(struct inode *inode, struct iomap_page *iop,
* page cache for blocks that are entirely outside of i_size.
*/
if (orig_pos <= isize && orig_pos + length > isize) {
- unsigned end = offset_in_page(isize - 1) >> block_bits;
+ unsigned end = offset_in_folio(folio, isize - 1) >> block_bits;
if (first <= end && last > end)
plen -= (last - end) * block_size;
@@ -134,31 +134,31 @@ iomap_adjust_read_range(struct inode *inode, struct iomap_page *iop,
*lenp = plen;
}
-static void iomap_iop_set_range_uptodate(struct page *page,
- struct iomap_page *iop, unsigned off, unsigned len)
+static void iomap_iop_set_range_uptodate(struct folio *folio,
+ struct iomap_page *iop, size_t off, size_t len)
{
- struct inode *inode = page->mapping->host;
+ struct inode *inode = folio->mapping->host;
unsigned first = off >> inode->i_blkbits;
unsigned last = (off + len - 1) >> inode->i_blkbits;
unsigned long flags;
spin_lock_irqsave(&iop->uptodate_lock, flags);
bitmap_set(iop->uptodate, first, last - first + 1);
- if (bitmap_full(iop->uptodate, i_blocks_per_page(inode, page)))
- SetPageUptodate(page);
+ if (bitmap_full(iop->uptodate, i_blocks_per_folio(inode, folio)))
+ folio_mark_uptodate(folio);
spin_unlock_irqrestore(&iop->uptodate_lock, flags);
}
-static void iomap_set_range_uptodate(struct page *page,
- struct iomap_page *iop, unsigned off, unsigned len)
+static void iomap_set_range_uptodate(struct folio *folio,
+ struct iomap_page *iop, size_t off, size_t len)
{
- if (PageError(page))
+ if (folio_test_error(folio))
return;
if (iop)
- iomap_iop_set_range_uptodate(page, iop, off, len);
+ iomap_iop_set_range_uptodate(folio, iop, off, len);
else
- SetPageUptodate(page);
+ folio_mark_uptodate(folio);
}
static void
@@ -169,15 +169,17 @@ iomap_read_page_end_io(struct bio_vec *bvec, int error)
struct iomap_page *iop = to_iomap_page(folio);
if (unlikely(error)) {
- ClearPageUptodate(page);
- SetPageError(page);
+ folio_clear_uptodate(folio);
+ folio_set_error(folio);
} else {
- iomap_set_range_uptodate(page, iop, bvec->bv_offset,
- bvec->bv_len);
+ size_t off = (page - &folio->page) * PAGE_SIZE +
+ bvec->bv_offset;
+
+ iomap_set_range_uptodate(folio, iop, off, bvec->bv_len);
}
if (!iop || atomic_sub_and_test(bvec->bv_len, &iop->read_bytes_pending))
- unlock_page(page);
+ folio_unlock(folio);
}
static void
@@ -237,7 +239,7 @@ iomap_readpage_actor(struct inode *inode, loff_t pos, loff_t length, void *data,
struct iomap_page *iop = iomap_page_create(inode, folio);
bool same_page = false, is_contig = false;
loff_t orig_pos = pos;
- unsigned poff, plen;
+ size_t poff, plen;
sector_t sector;
if (iomap->type == IOMAP_INLINE) {
@@ -246,14 +248,14 @@ iomap_readpage_actor(struct inode *inode, loff_t pos, loff_t length, void *data,
return PAGE_SIZE;
}
- /* zero post-eof blocks as the page may be mapped */
- iomap_adjust_read_range(inode, iop, &pos, length, &poff, &plen);
+ /* zero post-eof blocks as the folio may be mapped */
+ iomap_adjust_read_range(inode, folio, &pos, length, &poff, &plen);
if (plen == 0)
goto done;
if (iomap_block_needs_zeroing(inode, iomap, pos)) {
- zero_user(page, poff, plen);
- iomap_set_range_uptodate(page, iop, poff, plen);
+ zero_user(&folio->page, poff, plen);
+ iomap_set_range_uptodate(folio, iop, poff, plen);
goto done;
}
@@ -264,7 +266,7 @@ iomap_readpage_actor(struct inode *inode, loff_t pos, loff_t length, void *data,
/* Try to merge into a previous segment if we can */
sector = iomap_sector(iomap, pos);
if (ctx->bio && bio_end_sector(ctx->bio) == sector) {
- if (__bio_try_merge_page(ctx->bio, page, plen, poff,
+ if (__bio_try_merge_page(ctx->bio, &folio->page, plen, poff,
&same_page))
goto done;
is_contig = true;
@@ -296,7 +298,7 @@ iomap_readpage_actor(struct inode *inode, loff_t pos, loff_t length, void *data,
ctx->bio->bi_end_io = iomap_read_end_io;
}
- bio_add_page(ctx->bio, page, plen, poff);
+ bio_add_folio(ctx->bio, folio, plen, poff);
done:
/*
* Move the caller beyond our range so that it keeps making progress.
@@ -531,9 +533,8 @@ iomap_write_failed(struct inode *inode, loff_t pos, unsigned len)
truncate_pagecache_range(inode, max(pos, i_size), pos + len);
}
-static int
-iomap_read_page_sync(loff_t block_start, struct page *page, unsigned poff,
- unsigned plen, struct iomap *iomap)
+static int iomap_read_folio_sync(loff_t block_start, struct folio *folio,
+ size_t poff, size_t plen, struct iomap *iomap)
{
struct bio_vec bvec;
struct bio bio;
@@ -542,7 +543,7 @@ iomap_read_page_sync(loff_t block_start, struct page *page, unsigned poff,
bio.bi_opf = REQ_OP_READ;
bio.bi_iter.bi_sector = iomap_sector(iomap, block_start);
bio_set_dev(&bio, iomap->bdev);
- __bio_add_page(&bio, page, plen, poff);
+ bio_add_folio(&bio, folio, plen, poff);
return submit_bio_wait(&bio);
}
@@ -555,14 +556,15 @@ __iomap_write_begin(struct inode *inode, loff_t pos, unsigned len, int flags,
loff_t block_size = i_blocksize(inode);
loff_t block_start = round_down(pos, block_size);
loff_t block_end = round_up(pos + len, block_size);
- unsigned from = offset_in_page(pos), to = from + len, poff, plen;
+ size_t from = offset_in_folio(folio, pos), to = from + len;
+ size_t poff, plen;
- if (PageUptodate(page))
+ if (folio_test_uptodate(folio))
return 0;
- ClearPageError(page);
+ folio_clear_error(folio);
do {
- iomap_adjust_read_range(inode, iop, &block_start,
+ iomap_adjust_read_range(inode, folio, &block_start,
block_end - block_start, &poff, &plen);
if (plen == 0)
break;
@@ -575,14 +577,15 @@ __iomap_write_begin(struct inode *inode, loff_t pos, unsigned len, int flags,
if (iomap_block_needs_zeroing(inode, srcmap, block_start)) {
if (WARN_ON_ONCE(flags & IOMAP_WRITE_F_UNSHARE))
return -EIO;
- zero_user_segments(page, poff, from, to, poff + plen);
+ zero_user_segments(&folio->page, poff, from, to,
+ poff + plen);
} else {
- int status = iomap_read_page_sync(block_start, page,
+ int status = iomap_read_folio_sync(block_start, folio,
poff, plen, srcmap);
if (status)
return status;
}
- iomap_set_range_uptodate(page, iop, poff, plen);
+ iomap_set_range_uptodate(folio, iop, poff, plen);
} while ((block_start += plen) < block_end);
return 0;
@@ -661,7 +664,7 @@ static size_t __iomap_write_end(struct inode *inode, loff_t pos, size_t len,
*/
if (unlikely(copied < len && !PageUptodate(page)))
return 0;
- iomap_set_range_uptodate(page, iop, offset_in_page(pos), len);
+ iomap_set_range_uptodate(folio, iop, offset_in_folio(folio, pos), len);
__set_page_dirty_nobuffers(page);
return copied;
}
--
2.30.2
Handle folios of arbitrary size instead of working in PAGE_SIZE units.
Signed-off-by: Matthew Wilcox (Oracle) <[email protected]>
---
fs/iomap/buffered-io.c | 61 +++++++++++++++++++++---------------------
1 file changed, 30 insertions(+), 31 deletions(-)
diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
index 4732298f74e1..7c702d6c2f64 100644
--- a/fs/iomap/buffered-io.c
+++ b/fs/iomap/buffered-io.c
@@ -188,8 +188,8 @@ static void iomap_read_end_io(struct bio *bio)
}
struct iomap_readpage_ctx {
- struct page *cur_page;
- bool cur_page_in_bio;
+ struct folio *cur_folio;
+ bool cur_folio_in_bio;
struct bio *bio;
struct readahead_control *rac;
};
@@ -227,8 +227,7 @@ iomap_readpage_actor(struct inode *inode, loff_t pos, loff_t length, void *data,
struct iomap *iomap, struct iomap *srcmap)
{
struct iomap_readpage_ctx *ctx = data;
- struct page *page = ctx->cur_page;
- struct folio *folio = page_folio(page);
+ struct folio *folio = ctx->cur_folio;
struct iomap_page *iop = iomap_page_create(inode, folio);
bool same_page = false, is_contig = false;
loff_t orig_pos = pos;
@@ -237,7 +236,7 @@ iomap_readpage_actor(struct inode *inode, loff_t pos, loff_t length, void *data,
if (iomap->type == IOMAP_INLINE) {
WARN_ON_ONCE(pos);
- iomap_read_inline_data(inode, page, iomap);
+ iomap_read_inline_data(inode, &folio->page, iomap);
return PAGE_SIZE;
}
@@ -252,7 +251,7 @@ iomap_readpage_actor(struct inode *inode, loff_t pos, loff_t length, void *data,
goto done;
}
- ctx->cur_page_in_bio = true;
+ ctx->cur_folio_in_bio = true;
if (iop)
atomic_add(plen, &iop->read_bytes_pending);
@@ -266,7 +265,7 @@ iomap_readpage_actor(struct inode *inode, loff_t pos, loff_t length, void *data,
}
if (!is_contig || bio_full(ctx->bio, plen)) {
- gfp_t gfp = mapping_gfp_constraint(page->mapping, GFP_KERNEL);
+ gfp_t gfp = mapping_gfp_constraint(folio->mapping, GFP_KERNEL);
gfp_t orig_gfp = gfp;
unsigned int nr_vecs = DIV_ROUND_UP(length, PAGE_SIZE);
@@ -305,30 +304,31 @@ iomap_readpage_actor(struct inode *inode, loff_t pos, loff_t length, void *data,
int
iomap_readpage(struct page *page, const struct iomap_ops *ops)
{
- struct iomap_readpage_ctx ctx = { .cur_page = page };
- struct inode *inode = page->mapping->host;
- unsigned poff;
+ struct folio *folio = page_folio(page);
+ struct iomap_readpage_ctx ctx = { .cur_folio = folio };
+ struct inode *inode = folio->mapping->host;
+ size_t poff;
loff_t ret;
+ size_t len = folio_size(folio);
- trace_iomap_readpage(page->mapping->host, 1);
+ trace_iomap_readpage(inode, 1);
- for (poff = 0; poff < PAGE_SIZE; poff += ret) {
- ret = iomap_apply(inode, page_offset(page) + poff,
- PAGE_SIZE - poff, 0, ops, &ctx,
- iomap_readpage_actor);
+ for (poff = 0; poff < len; poff += ret) {
+ ret = iomap_apply(inode, folio_pos(folio) + poff, len - poff,
+ 0, ops, &ctx, iomap_readpage_actor);
if (ret <= 0) {
WARN_ON_ONCE(ret == 0);
- SetPageError(page);
+ folio_set_error(folio);
break;
}
}
if (ctx.bio) {
submit_bio(ctx.bio);
- WARN_ON_ONCE(!ctx.cur_page_in_bio);
+ WARN_ON_ONCE(!ctx.cur_folio_in_bio);
} else {
- WARN_ON_ONCE(ctx.cur_page_in_bio);
- unlock_page(page);
+ WARN_ON_ONCE(ctx.cur_folio_in_bio);
+ folio_unlock(folio);
}
/*
@@ -348,15 +348,15 @@ iomap_readahead_actor(struct inode *inode, loff_t pos, loff_t length,
loff_t done, ret;
for (done = 0; done < length; done += ret) {
- if (ctx->cur_page && offset_in_page(pos + done) == 0) {
- if (!ctx->cur_page_in_bio)
- unlock_page(ctx->cur_page);
- put_page(ctx->cur_page);
- ctx->cur_page = NULL;
+ if (ctx->cur_folio &&
+ offset_in_folio(ctx->cur_folio, pos + done) == 0) {
+ if (!ctx->cur_folio_in_bio)
+ folio_unlock(ctx->cur_folio);
+ ctx->cur_folio = NULL;
}
- if (!ctx->cur_page) {
- ctx->cur_page = readahead_page(ctx->rac);
- ctx->cur_page_in_bio = false;
+ if (!ctx->cur_folio) {
+ ctx->cur_folio = readahead_folio(ctx->rac);
+ ctx->cur_folio_in_bio = false;
}
ret = iomap_readpage_actor(inode, pos + done, length - done,
ctx, iomap, srcmap);
@@ -404,10 +404,9 @@ void iomap_readahead(struct readahead_control *rac, const struct iomap_ops *ops)
if (ctx.bio)
submit_bio(ctx.bio);
- if (ctx.cur_page) {
- if (!ctx.cur_page_in_bio)
- unlock_page(ctx.cur_page);
- put_page(ctx.cur_page);
+ if (ctx.cur_folio) {
+ if (!ctx.cur_folio_in_bio)
+ folio_unlock(ctx.cur_folio);
}
}
EXPORT_SYMBOL_GPL(iomap_readahead);
--
2.30.2
These functions still only work in PAGE_SIZE chunks, but there are
fewer conversions from head to tail pages as a result of this patch.
Signed-off-by: Matthew Wilcox (Oracle) <[email protected]>
---
fs/iomap/buffered-io.c | 68 ++++++++++++++++++++++--------------------
1 file changed, 36 insertions(+), 32 deletions(-)
diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
index a3fe0d36c739..5e0aa23d4693 100644
--- a/fs/iomap/buffered-io.c
+++ b/fs/iomap/buffered-io.c
@@ -541,9 +541,8 @@ static int iomap_read_folio_sync(loff_t block_start, struct folio *folio,
static int
__iomap_write_begin(struct inode *inode, loff_t pos, unsigned len, int flags,
- struct page *page, struct iomap *srcmap)
+ struct folio *folio, struct iomap *srcmap)
{
- struct folio *folio = page_folio(page);
struct iomap_page *iop = iomap_page_create(inode, folio);
loff_t block_size = i_blocksize(inode);
loff_t block_start = round_down(pos, block_size);
@@ -583,12 +582,14 @@ __iomap_write_begin(struct inode *inode, loff_t pos, unsigned len, int flags,
return 0;
}
-static int
-iomap_write_begin(struct inode *inode, loff_t pos, unsigned len, unsigned flags,
- struct page **pagep, struct iomap *iomap, struct iomap *srcmap)
+static int iomap_write_begin(struct inode *inode, loff_t pos, size_t len,
+ unsigned flags, struct folio **foliop, struct iomap *iomap,
+ struct iomap *srcmap)
{
const struct iomap_page_ops *page_ops = iomap->page_ops;
+ struct folio *folio;
struct page *page;
+ unsigned fgp = FGP_LOCK | FGP_WRITE | FGP_CREAT | FGP_STABLE | FGP_NOFS;
int status = 0;
BUG_ON(pos + len > iomap->offset + iomap->length);
@@ -604,30 +605,31 @@ iomap_write_begin(struct inode *inode, loff_t pos, unsigned len, unsigned flags,
return status;
}
- page = grab_cache_page_write_begin(inode->i_mapping, pos >> PAGE_SHIFT,
- AOP_FLAG_NOFS);
- if (!page) {
+ folio = __filemap_get_folio(inode->i_mapping, pos >> PAGE_SHIFT, fgp,
+ mapping_gfp_mask(inode->i_mapping));
+ if (!folio) {
status = -ENOMEM;
goto out_no_page;
}
+ page = folio_file_page(folio, pos >> PAGE_SHIFT);
if (srcmap->type == IOMAP_INLINE)
iomap_read_inline_data(inode, page, srcmap);
else if (iomap->flags & IOMAP_F_BUFFER_HEAD)
status = __block_write_begin_int(page, pos, len, NULL, srcmap);
else
- status = __iomap_write_begin(inode, pos, len, flags, page,
+ status = __iomap_write_begin(inode, pos, len, flags, folio,
srcmap);
if (unlikely(status))
goto out_unlock;
- *pagep = page;
+ *foliop = folio;
return 0;
out_unlock:
- unlock_page(page);
- put_page(page);
+ folio_unlock(folio);
+ folio_put(folio);
iomap_write_failed(inode, pos, len);
out_no_page:
@@ -637,11 +639,10 @@ iomap_write_begin(struct inode *inode, loff_t pos, unsigned len, unsigned flags,
}
static size_t __iomap_write_end(struct inode *inode, loff_t pos, size_t len,
- size_t copied, struct page *page)
+ size_t copied, struct folio *folio)
{
- struct folio *folio = page_folio(page);
struct iomap_page *iop = to_iomap_page(folio);
- flush_dcache_page(page);
+ flush_dcache_folio(folio);
/*
* The blocks that were entirely written will now be uptodate, so we
@@ -654,10 +655,10 @@ static size_t __iomap_write_end(struct inode *inode, loff_t pos, size_t len,
* uptodate page as a zero-length write, and force the caller to redo
* the whole thing.
*/
- if (unlikely(copied < len && !PageUptodate(page)))
+ if (unlikely(copied < len && !folio_test_uptodate(folio)))
return 0;
iomap_set_range_uptodate(folio, iop, offset_in_folio(folio, pos), len);
- __set_page_dirty_nobuffers(page);
+ filemap_dirty_folio(inode->i_mapping, folio);
return copied;
}
@@ -680,9 +681,10 @@ static size_t iomap_write_end_inline(struct inode *inode, struct page *page,
/* Returns the number of bytes copied. May be 0. Cannot be an errno. */
static size_t iomap_write_end(struct inode *inode, loff_t pos, size_t len,
- size_t copied, struct page *page, struct iomap *iomap,
+ size_t copied, struct folio *folio, struct iomap *iomap,
struct iomap *srcmap)
{
+ struct page *page = folio_file_page(folio, pos / PAGE_SIZE);
const struct iomap_page_ops *page_ops = iomap->page_ops;
loff_t old_size = inode->i_size;
size_t ret;
@@ -693,7 +695,7 @@ static size_t iomap_write_end(struct inode *inode, loff_t pos, size_t len,
ret = block_write_end(NULL, inode->i_mapping, pos, len, copied,
page, NULL);
} else {
- ret = __iomap_write_end(inode, pos, len, copied, page);
+ ret = __iomap_write_end(inode, pos, len, copied, folio);
}
/*
@@ -705,13 +707,13 @@ static size_t iomap_write_end(struct inode *inode, loff_t pos, size_t len,
i_size_write(inode, pos + ret);
iomap->flags |= IOMAP_F_SIZE_CHANGED;
}
- unlock_page(page);
+ folio_unlock(folio);
if (old_size < pos)
pagecache_isize_extended(inode, old_size, pos);
if (page_ops && page_ops->page_done)
page_ops->page_done(inode, pos, ret, page, iomap);
- put_page(page);
+ folio_put(folio);
if (ret < len)
iomap_write_failed(inode, pos, len);
@@ -727,6 +729,7 @@ iomap_write_actor(struct inode *inode, loff_t pos, loff_t length, void *data,
ssize_t written = 0;
do {
+ struct folio *folio;
struct page *page;
unsigned long offset; /* Offset into pagecache page */
unsigned long bytes; /* Bytes to write to page */
@@ -750,18 +753,19 @@ iomap_write_actor(struct inode *inode, loff_t pos, loff_t length, void *data,
break;
}
- status = iomap_write_begin(inode, pos, bytes, 0, &page, iomap,
+ status = iomap_write_begin(inode, pos, bytes, 0, &folio, iomap,
srcmap);
if (unlikely(status))
break;
+ page = folio_file_page(folio, pos / PAGE_SIZE);
if (mapping_writably_mapped(inode->i_mapping))
flush_dcache_page(page);
copied = copy_page_from_iter_atomic(page, offset, bytes, i);
- status = iomap_write_end(inode, pos, bytes, copied, page, iomap,
- srcmap);
+ status = iomap_write_end(inode, pos, bytes, copied, folio,
+ iomap, srcmap);
if (unlikely(copied != status))
iov_iter_revert(i, copied - status);
@@ -825,14 +829,14 @@ iomap_unshare_actor(struct inode *inode, loff_t pos, loff_t length, void *data,
do {
unsigned long offset = offset_in_page(pos);
unsigned long bytes = min_t(loff_t, PAGE_SIZE - offset, length);
- struct page *page;
+ struct folio *folio;
status = iomap_write_begin(inode, pos, bytes,
- IOMAP_WRITE_F_UNSHARE, &page, iomap, srcmap);
+ IOMAP_WRITE_F_UNSHARE, &folio, iomap, srcmap);
if (unlikely(status))
return status;
- status = iomap_write_end(inode, pos, bytes, bytes, page, iomap,
+ status = iomap_write_end(inode, pos, bytes, bytes, folio, iomap,
srcmap);
if (WARN_ON_ONCE(status == 0))
return -EIO;
@@ -871,19 +875,19 @@ EXPORT_SYMBOL_GPL(iomap_file_unshare);
static s64 iomap_zero(struct inode *inode, loff_t pos, u64 length,
struct iomap *iomap, struct iomap *srcmap)
{
- struct page *page;
+ struct folio *folio;
int status;
unsigned offset = offset_in_page(pos);
unsigned bytes = min_t(u64, PAGE_SIZE - offset, length);
- status = iomap_write_begin(inode, pos, bytes, 0, &page, iomap, srcmap);
+ status = iomap_write_begin(inode, pos, bytes, 0, &folio, iomap, srcmap);
if (status)
return status;
- zero_user(page, offset, bytes);
- mark_page_accessed(page);
+ zero_user(folio_file_page(folio, pos / PAGE_SIZE), offset, bytes);
+ folio_mark_accessed(folio);
- return iomap_write_end(inode, pos, bytes, bytes, page, iomap, srcmap);
+ return iomap_write_end(inode, pos, bytes, bytes, folio, iomap, srcmap);
}
static loff_t iomap_zero_range_actor(struct inode *inode, loff_t pos,
--
2.30.2
Inline data is restricted to being less than a page in size, so we
don't need to handle multi-page folios.
Signed-off-by: Matthew Wilcox (Oracle) <[email protected]>
---
fs/iomap/buffered-io.c | 18 +++++++++---------
1 file changed, 9 insertions(+), 9 deletions(-)
diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
index 5e0aa23d4693..c616ef1feb21 100644
--- a/fs/iomap/buffered-io.c
+++ b/fs/iomap/buffered-io.c
@@ -194,24 +194,24 @@ struct iomap_readpage_ctx {
struct readahead_control *rac;
};
-static void
-iomap_read_inline_data(struct inode *inode, struct page *page,
+static void iomap_read_inline_data(struct inode *inode, struct folio *folio,
struct iomap *iomap)
{
size_t size = i_size_read(inode);
void *addr;
- if (PageUptodate(page))
+ if (folio_test_uptodate(folio))
return;
- BUG_ON(page->index);
+ BUG_ON(folio->index);
+ BUG_ON(folio_multi(folio));
BUG_ON(size > PAGE_SIZE - offset_in_page(iomap->inline_data));
- addr = kmap_atomic(page);
+ addr = kmap_local_folio(folio, 0);
memcpy(addr, iomap->inline_data, size);
memset(addr + size, 0, PAGE_SIZE - size);
- kunmap_atomic(addr);
- SetPageUptodate(page);
+ kunmap_local(addr);
+ folio_mark_uptodate(folio);
}
static inline bool iomap_block_needs_zeroing(struct inode *inode,
@@ -236,7 +236,7 @@ iomap_readpage_actor(struct inode *inode, loff_t pos, loff_t length, void *data,
if (iomap->type == IOMAP_INLINE) {
WARN_ON_ONCE(pos);
- iomap_read_inline_data(inode, &folio->page, iomap);
+ iomap_read_inline_data(inode, folio, iomap);
return PAGE_SIZE;
}
@@ -614,7 +614,7 @@ static int iomap_write_begin(struct inode *inode, loff_t pos, size_t len,
page = folio_file_page(folio, pos >> PAGE_SHIFT);
if (srcmap->type == IOMAP_INLINE)
- iomap_read_inline_data(inode, page, srcmap);
+ iomap_read_inline_data(inode, folio, srcmap);
else if (iomap->flags & IOMAP_F_BUFFER_HEAD)
status = __block_write_begin_int(page, pos, len, NULL, srcmap);
else
--
2.30.2
One of the callers already had a folio; the other two grow by a few
bytes, but filemap_read_page() shrinks by 50 bytes for a net reduction
of 27 bytes.
Signed-off-by: Matthew Wilcox (Oracle) <[email protected]>
---
mm/filemap.c | 18 +++++++++---------
1 file changed, 9 insertions(+), 9 deletions(-)
diff --git a/mm/filemap.c b/mm/filemap.c
index 5a273d07eae6..5e2a2db1c715 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -2302,8 +2302,8 @@ static void filemap_get_read_batch(struct address_space *mapping,
rcu_read_unlock();
}
-static int filemap_read_page(struct file *file, struct address_space *mapping,
- struct page *page)
+static int filemap_read_folio(struct file *file, struct address_space *mapping,
+ struct folio *folio)
{
int error;
@@ -2312,16 +2312,16 @@ static int filemap_read_page(struct file *file, struct address_space *mapping,
* eg. multipath errors. PG_error will be set again if readpage
* fails.
*/
- ClearPageError(page);
+ folio_clear_error(folio);
/* Start the actual read. The read will unlock the page. */
- error = mapping->a_ops->readpage(file, page);
+ error = mapping->a_ops->readpage(file, &folio->page);
if (error)
return error;
- error = wait_on_page_locked_killable(page);
+ error = folio_wait_locked_killable(folio);
if (error)
return error;
- if (PageUptodate(page))
+ if (folio_test_uptodate(folio))
return 0;
shrink_readahead_size_eio(&file->f_ra);
return -EIO;
@@ -2383,7 +2383,7 @@ static int filemap_update_page(struct kiocb *iocb,
if (iocb->ki_flags & (IOCB_NOIO | IOCB_NOWAIT | IOCB_WAITQ))
goto unlock;
- error = filemap_read_page(iocb->ki_filp, mapping, &folio->page);
+ error = filemap_read_folio(iocb->ki_filp, mapping, folio);
if (error == AOP_TRUNCATED_PAGE)
folio_put(folio);
return error;
@@ -2414,7 +2414,7 @@ static int filemap_create_page(struct file *file,
if (error)
goto error;
- error = filemap_read_page(file, mapping, page);
+ error = filemap_read_folio(file, mapping, page_folio(page));
if (error)
goto error;
@@ -3043,7 +3043,7 @@ vm_fault_t filemap_fault(struct vm_fault *vmf)
* and we need to check for errors.
*/
fpin = maybe_unlock_mmap_for_io(vmf, fpin);
- error = filemap_read_page(file, mapping, page);
+ error = filemap_read_folio(file, mapping, page_folio(page));
if (fpin)
goto out_retry;
put_page(page);
--
2.30.2
The arguments are still pages for now, but we can use folios internally
and cut out a lot of calls to compound_head().
Signed-off-by: Matthew Wilcox (Oracle) <[email protected]>
---
fs/iomap/buffered-io.c | 12 +++++++-----
1 file changed, 7 insertions(+), 5 deletions(-)
diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
index 0731e2c3f44b..48de198c5603 100644
--- a/fs/iomap/buffered-io.c
+++ b/fs/iomap/buffered-io.c
@@ -490,19 +490,21 @@ int
iomap_migrate_page(struct address_space *mapping, struct page *newpage,
struct page *page, enum migrate_mode mode)
{
+ struct folio *folio = page_folio(page);
+ struct folio *newfolio = page_folio(newpage);
int ret;
- ret = migrate_page_move_mapping(mapping, newpage, page, 0);
+ ret = folio_migrate_mapping(mapping, newfolio, folio, 0);
if (ret != MIGRATEPAGE_SUCCESS)
return ret;
- if (page_has_private(page))
- attach_page_private(newpage, detach_page_private(page));
+ if (folio_test_private(folio))
+ folio_attach_private(newfolio, folio_detach_private(folio));
if (mode != MIGRATE_SYNC_NO_COPY)
- migrate_page_copy(newpage, page);
+ folio_migrate_copy(newfolio, folio);
else
- migrate_page_states(newpage, page);
+ folio_migrate_flags(newfolio, folio);
return MIGRATEPAGE_SUCCESS;
}
EXPORT_SYMBOL_GPL(iomap_migrate_page);
--
2.30.2
We still only operate on a single page of data at a time due to using
kmap(). A more complex implementation would work on each page in a folio,
but it's not clear that such a complex implementation would be worthwhile.
Signed-off-by: Matthew Wilcox (Oracle) <[email protected]>
---
fs/remap_range.c | 116 ++++++++++++++++++++++-------------------------
1 file changed, 55 insertions(+), 61 deletions(-)
diff --git a/fs/remap_range.c b/fs/remap_range.c
index e4a5fdd7ad7b..886e6ed2c6c2 100644
--- a/fs/remap_range.c
+++ b/fs/remap_range.c
@@ -158,41 +158,41 @@ static int generic_remap_check_len(struct inode *inode_in,
}
/* Read a page's worth of file data into the page cache. */
-static struct page *vfs_dedupe_get_page(struct inode *inode, loff_t offset)
+static struct folio *vfs_dedupe_get_folio(struct inode *inode, loff_t pos)
{
- struct page *page;
+ struct folio *folio;
- page = read_mapping_page(inode->i_mapping, offset >> PAGE_SHIFT, NULL);
- if (IS_ERR(page))
- return page;
- if (!PageUptodate(page)) {
- put_page(page);
+ folio = read_mapping_folio(inode->i_mapping, pos >> PAGE_SHIFT, NULL);
+ if (IS_ERR(folio))
+ return folio;
+ if (!folio_test_uptodate(folio)) {
+ folio_put(folio);
return ERR_PTR(-EIO);
}
- return page;
+ return folio;
}
/*
- * Lock two pages, ensuring that we lock in offset order if the pages are from
- * the same file.
+ * Lock two folios, ensuring that we lock in offset order if the folios
+ * are from the same file.
*/
-static void vfs_lock_two_pages(struct page *page1, struct page *page2)
+static void vfs_lock_two_folios(struct folio *folio1, struct folio *folio2)
{
/* Always lock in order of increasing index. */
- if (page1->index > page2->index)
- swap(page1, page2);
+ if (folio1->index > folio2->index)
+ swap(folio1, folio2);
- lock_page(page1);
- if (page1 != page2)
- lock_page(page2);
+ folio_lock(folio1);
+ if (folio1 != folio2)
+ folio_lock(folio2);
}
-/* Unlock two pages, being careful not to unlock the same page twice. */
-static void vfs_unlock_two_pages(struct page *page1, struct page *page2)
+/* Unlock two folios, being careful not to unlock the same folio twice. */
+static void vfs_unlock_two_folios(struct folio *folio1, struct folio *folio2)
{
- unlock_page(page1);
- if (page1 != page2)
- unlock_page(page2);
+ folio_unlock(folio1);
+ if (folio1 != folio2)
+ folio_unlock(folio2);
}
/*
@@ -200,77 +200,71 @@ static void vfs_unlock_two_pages(struct page *page1, struct page *page2)
* Caller must have locked both inodes to prevent write races.
*/
static int vfs_dedupe_file_range_compare(struct inode *src, loff_t srcoff,
- struct inode *dest, loff_t destoff,
+ struct inode *dest, loff_t dstoff,
loff_t len, bool *is_same)
{
- loff_t src_poff;
- loff_t dest_poff;
- void *src_addr;
- void *dest_addr;
- struct page *src_page;
- struct page *dest_page;
- loff_t cmp_len;
- bool same;
- int error;
-
- error = -EINVAL;
- same = true;
+ bool same = true;
+ int error = -EINVAL;
+
while (len) {
- src_poff = srcoff & (PAGE_SIZE - 1);
- dest_poff = destoff & (PAGE_SIZE - 1);
- cmp_len = min(PAGE_SIZE - src_poff,
- PAGE_SIZE - dest_poff);
+ struct folio *src_folio, *dst_folio;
+ void *src_addr, *dst_addr;
+ loff_t cmp_len = min(PAGE_SIZE - offset_in_page(srcoff),
+ PAGE_SIZE - offset_in_page(dstoff));
+
cmp_len = min(cmp_len, len);
if (cmp_len <= 0)
goto out_error;
- src_page = vfs_dedupe_get_page(src, srcoff);
- if (IS_ERR(src_page)) {
- error = PTR_ERR(src_page);
+ src_folio = vfs_dedupe_get_folio(src, srcoff);
+ if (IS_ERR(src_folio)) {
+ error = PTR_ERR(src_folio);
goto out_error;
}
- dest_page = vfs_dedupe_get_page(dest, destoff);
- if (IS_ERR(dest_page)) {
- error = PTR_ERR(dest_page);
- put_page(src_page);
+ dst_folio = vfs_dedupe_get_folio(dest, dstoff);
+ if (IS_ERR(dst_folio)) {
+ error = PTR_ERR(dst_folio);
+ folio_put(src_folio);
goto out_error;
}
- vfs_lock_two_pages(src_page, dest_page);
+ vfs_lock_two_folios(src_folio, dst_folio);
/*
- * Now that we've locked both pages, make sure they're still
+ * Now that we've locked both folios, make sure they're still
* mapped to the file data we're interested in. If not,
* someone is invalidating pages on us and we lose.
*/
- if (!PageUptodate(src_page) || !PageUptodate(dest_page) ||
- src_page->mapping != src->i_mapping ||
- dest_page->mapping != dest->i_mapping) {
+ if (!folio_test_uptodate(src_folio) || !folio_test_uptodate(dst_folio) ||
+ src_folio->mapping != src->i_mapping ||
+ dst_folio->mapping != dest->i_mapping) {
same = false;
goto unlock;
}
- src_addr = kmap_atomic(src_page);
- dest_addr = kmap_atomic(dest_page);
+ src_addr = kmap_local_folio(src_folio,
+ offset_in_folio(src_folio, srcoff));
+ dst_addr = kmap_local_folio(dst_folio,
+ offset_in_folio(dst_folio, dstoff));
- flush_dcache_page(src_page);
- flush_dcache_page(dest_page);
+ flush_dcache_folio(src_folio);
+ flush_dcache_folio(dst_folio);
- if (memcmp(src_addr + src_poff, dest_addr + dest_poff, cmp_len))
+ if (memcmp(src_addr, dst_addr, cmp_len))
same = false;
- kunmap_atomic(dest_addr);
- kunmap_atomic(src_addr);
+ kunmap_local(dst_addr);
+ kunmap_local(src_addr);
unlock:
- vfs_unlock_two_pages(src_page, dest_page);
- put_page(dest_page);
- put_page(src_page);
+ vfs_unlock_two_folios(src_folio, dst_folio);
+ folio_put(dst_folio);
+ folio_put(src_folio);
if (!same)
break;
srcoff += cmp_len;
- destoff += cmp_len;
+ dstoff += cmp_len;
len -= cmp_len;
}
--
2.30.2
invalidate_complete_page2() currently open-codes filemap_free_folio(),
except for the part where it handles THP. Rather than adding that,
call page_cache_free_page() from invalidate_complete_page2().
Signed-off-by: Matthew Wilcox (Oracle) <[email protected]>
---
mm/filemap.c | 3 +--
mm/internal.h | 1 +
mm/truncate.c | 5 +----
3 files changed, 3 insertions(+), 6 deletions(-)
diff --git a/mm/filemap.c b/mm/filemap.c
index 97d17e8c76aa..d5787502c3be 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -228,8 +228,7 @@ void __filemap_remove_folio(struct folio *folio, void *shadow)
page_cache_delete(mapping, folio, shadow);
}
-static void filemap_free_folio(struct address_space *mapping,
- struct folio *folio)
+void filemap_free_folio(struct address_space *mapping, struct folio *folio)
{
void (*freepage)(struct page *);
diff --git a/mm/internal.h b/mm/internal.h
index 3e32064df18d..d63ef2595eff 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -73,6 +73,7 @@ unsigned find_lock_entries(struct address_space *mapping, pgoff_t start,
unsigned find_get_entries(struct address_space *mapping, pgoff_t start,
pgoff_t end, struct pagevec *pvec, pgoff_t *indices);
bool truncate_inode_partial_page(struct page *page, loff_t start, loff_t end);
+void filemap_free_folio(struct address_space *mapping, struct folio *folio);
/**
* folio_evictable - Test whether a folio is evictable.
diff --git a/mm/truncate.c b/mm/truncate.c
index d068f22fe422..e000402e817b 100644
--- a/mm/truncate.c
+++ b/mm/truncate.c
@@ -619,10 +619,7 @@ static int invalidate_complete_folio2(struct address_space *mapping,
__filemap_remove_folio(folio, NULL);
xa_unlock_irqrestore(&mapping->i_pages, flags);
- if (mapping->a_ops->freepage)
- mapping->a_ops->freepage(&folio->page);
-
- folio_ref_sub(folio, folio_nr_pages(folio)); /* pagecache ref */
+ filemap_free_folio(mapping, folio);
return 1;
failed:
xa_unlock_irqrestore(&mapping->i_pages, flags);
--
2.30.2
We have to allocate memory in order to split a file-backed page,
so it's not a good idea to split them. It also doesn't work for XFS
because pages have an extra reference count from page_has_private() and
split_huge_page() expects that reference to have already been removed.
Unfortunately, we still have to split shmem THPs because we can't handle
swapping out an entire THP yet.
Signed-off-by: Matthew Wilcox (Oracle) <[email protected]>
---
mm/vmscan.c | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 7a2f25b904d9..8b17e46dbf32 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1470,8 +1470,8 @@ static unsigned int shrink_page_list(struct list_head *page_list,
/* Adding to swap updated mapping */
mapping = page_mapping(page);
}
- } else if (unlikely(PageTransHuge(page))) {
- /* Split file THP */
+ } else if (PageSwapBacked(page) && PageTransHuge(page)) {
+ /* Split shmem THP */
if (split_huge_page_to_list(page, page_list))
goto keep_locked;
}
--
2.30.2
Reimplement read_cache_page() as a wrapper around read_cache_folio().
Saves over 400 bytes of text from do_read_cache_folio() which more
thn makes up for the extra 100 bytes of text added to the various
wrapper functions.
Signed-off-by: Matthew Wilcox (Oracle) <[email protected]>
---
include/linux/pagemap.h | 12 +++++-
mm/filemap.c | 95 +++++++++++++++++++++--------------------
2 files changed, 59 insertions(+), 48 deletions(-)
diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
index 245554ce6b12..842d130fd6d3 100644
--- a/include/linux/pagemap.h
+++ b/include/linux/pagemap.h
@@ -539,8 +539,10 @@ static inline struct page *grab_cache_page(struct address_space *mapping,
return find_or_create_page(mapping, index, mapping_gfp_mask(mapping));
}
-extern struct page * read_cache_page(struct address_space *mapping,
- pgoff_t index, filler_t *filler, void *data);
+struct folio *read_cache_folio(struct address_space *, pgoff_t index,
+ filler_t *filler, void *data);
+struct page *read_cache_page(struct address_space *, pgoff_t index,
+ filler_t *filler, void *data);
extern struct page * read_cache_page_gfp(struct address_space *mapping,
pgoff_t index, gfp_t gfp_mask);
extern int read_cache_pages(struct address_space *mapping,
@@ -552,6 +554,12 @@ static inline struct page *read_mapping_page(struct address_space *mapping,
return read_cache_page(mapping, index, NULL, data);
}
+static inline struct folio *read_mapping_folio(struct address_space *mapping,
+ pgoff_t index, void *data)
+{
+ return read_cache_folio(mapping, index, NULL, data);
+}
+
/*
* Get index of the page within radix-tree (but not for hugetlb pages).
* (TODO: remove once hugetlb pages will have ->index in PAGE_SIZE)
diff --git a/mm/filemap.c b/mm/filemap.c
index b0fe3234a20b..dd54a52f8e84 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -3298,35 +3298,20 @@ EXPORT_SYMBOL(filemap_page_mkwrite);
EXPORT_SYMBOL(generic_file_mmap);
EXPORT_SYMBOL(generic_file_readonly_mmap);
-static struct page *wait_on_page_read(struct page *page)
+static struct folio *do_read_cache_folio(struct address_space *mapping,
+ pgoff_t index, filler_t filler, void *data, gfp_t gfp)
{
- if (!IS_ERR(page)) {
- wait_on_page_locked(page);
- if (!PageUptodate(page)) {
- put_page(page);
- page = ERR_PTR(-EIO);
- }
- }
- return page;
-}
-
-static struct page *do_read_cache_page(struct address_space *mapping,
- pgoff_t index,
- int (*filler)(void *, struct page *),
- void *data,
- gfp_t gfp)
-{
- struct page *page;
+ struct folio *folio;
int err;
repeat:
- page = find_get_page(mapping, index);
- if (!page) {
- page = __page_cache_alloc(gfp);
- if (!page)
+ folio = filemap_get_folio(mapping, index);
+ if (!folio) {
+ folio = filemap_alloc_folio(gfp, 0);
+ if (!folio)
return ERR_PTR(-ENOMEM);
- err = add_to_page_cache_lru(page, mapping, index, gfp);
+ err = filemap_add_folio(mapping, folio, index, gfp);
if (unlikely(err)) {
- put_page(page);
+ folio_put(folio);
if (err == -EEXIST)
goto repeat;
/* Presumably ENOMEM for xarray node */
@@ -3335,21 +3320,24 @@ static struct page *do_read_cache_page(struct address_space *mapping,
filler:
if (filler)
- err = filler(data, page);
+ err = filler(data, &folio->page);
else
- err = mapping->a_ops->readpage(data, page);
+ err = mapping->a_ops->readpage(data, &folio->page);
if (err < 0) {
- put_page(page);
+ folio_put(folio);
return ERR_PTR(err);
}
- page = wait_on_page_read(page);
- if (IS_ERR(page))
- return page;
+ folio_wait_locked(folio);
+ if (!folio_test_uptodate(folio)) {
+ folio_put(folio);
+ return ERR_PTR(-EIO);
+ }
+
goto out;
}
- if (PageUptodate(page))
+ if (folio_test_uptodate(folio))
goto out;
/*
@@ -3383,23 +3371,23 @@ static struct page *do_read_cache_page(struct address_space *mapping,
* avoid spurious serialisations and wakeups when multiple processes
* wait on the same page for IO to complete.
*/
- wait_on_page_locked(page);
- if (PageUptodate(page))
+ folio_wait_locked(folio);
+ if (folio_test_uptodate(folio))
goto out;
/* Distinguish between all the cases under the safety of the lock */
- lock_page(page);
+ folio_lock(folio);
/* Case c or d, restart the operation */
- if (!page->mapping) {
- unlock_page(page);
- put_page(page);
+ if (!folio->mapping) {
+ folio_unlock(folio);
+ folio_put(folio);
goto repeat;
}
/* Someone else locked and filled the page in a very small window */
- if (PageUptodate(page)) {
- unlock_page(page);
+ if (folio_test_uptodate(folio)) {
+ folio_unlock(folio);
goto out;
}
@@ -3409,16 +3397,16 @@ static struct page *do_read_cache_page(struct address_space *mapping,
* Clear page error before actual read, PG_error will be
* set again if read page fails.
*/
- ClearPageError(page);
+ folio_clear_error(folio);
goto filler;
out:
- mark_page_accessed(page);
- return page;
+ folio_mark_accessed(folio);
+ return folio;
}
/**
- * read_cache_page - read into page cache, fill it if needed
+ * read_cache_folio - read into page cache, fill it if needed
* @mapping: the page's address_space
* @index: the page index
* @filler: function to perform the read
@@ -3431,10 +3419,25 @@ static struct page *do_read_cache_page(struct address_space *mapping,
*
* Return: up to date page on success, ERR_PTR() on failure.
*/
+struct folio *read_cache_folio(struct address_space *mapping, pgoff_t index,
+ filler_t filler, void *data)
+{
+ return do_read_cache_folio(mapping, index, filler, data,
+ mapping_gfp_mask(mapping));
+}
+EXPORT_SYMBOL(read_cache_folio);
+
+static struct page *do_read_cache_page(struct address_space *mapping,
+ pgoff_t index, filler_t *filler, void *data, gfp_t gfp)
+{
+ struct folio *folio = read_cache_folio(mapping, index, filler, data);
+ if (IS_ERR(folio))
+ return &folio->page;
+ return folio_file_page(folio, index);
+}
+
struct page *read_cache_page(struct address_space *mapping,
- pgoff_t index,
- int (*filler)(void *, struct page *),
- void *data)
+ pgoff_t index, filler_t *filler, void *data)
{
return do_read_cache_page(mapping, index, filler, data,
mapping_gfp_mask(mapping));
--
2.30.2
Saves one call to compound_head() and reduces text size by 15 bytes.
Signed-off-by: Matthew Wilcox (Oracle) <[email protected]>
---
mm/filemap.c | 21 +++++++++++----------
1 file changed, 11 insertions(+), 10 deletions(-)
diff --git a/mm/filemap.c b/mm/filemap.c
index 18b21d16c9de..8510f67dc749 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -287,15 +287,15 @@ static void page_cache_delete_batch(struct address_space *mapping,
XA_STATE(xas, &mapping->i_pages, pvec->pages[0]->index);
int total_pages = 0;
int i = 0;
- struct page *page;
+ struct folio *folio;
mapping_set_update(&xas, mapping);
- xas_for_each(&xas, page, ULONG_MAX) {
+ xas_for_each(&xas, folio, ULONG_MAX) {
if (i >= pagevec_count(pvec))
break;
/* A swap/dax/shadow entry got inserted? Skip it. */
- if (xa_is_value(page))
+ if (xa_is_value(folio))
continue;
/*
* A page got inserted in our range? Skip it. We have our
@@ -304,16 +304,16 @@ static void page_cache_delete_batch(struct address_space *mapping,
* means our page has been removed, which shouldn't be
* possible because we're holding the PageLock.
*/
- if (page != pvec->pages[i]) {
- VM_BUG_ON_PAGE(page->index > pvec->pages[i]->index,
- page);
+ if (&folio->page != pvec->pages[i]) {
+ VM_BUG_ON_FOLIO(folio->index >
+ pvec->pages[i]->index, folio);
continue;
}
- WARN_ON_ONCE(!PageLocked(page));
+ WARN_ON_ONCE(!folio_test_locked(folio));
- if (page->index == xas.xa_index)
- page->mapping = NULL;
+ if (folio->index == xas.xa_index)
+ folio->mapping = NULL;
/* Leave page->index set: truncation lookup relies on it */
/*
@@ -321,7 +321,8 @@ static void page_cache_delete_batch(struct address_space *mapping,
* page or the index is of the last sub-page of this compound
* page.
*/
- if (page->index + compound_nr(page) - 1 == xas.xa_index)
+ if (folio->index + folio_nr_pages(folio) - 1 ==
+ xas.xa_index)
i++;
xas_store(&xas, NULL);
total_pages++;
--
2.30.2
Pages are individually marked as suffering from hardware poisoning.
Checking that the head page is not hardware poisoned doesn't make
sense; we might be after a subpage. We check each page individually
before we use it, so this was an optimisation gone wrong.
Signed-off-by: Matthew Wilcox (Oracle) <[email protected]>
---
mm/filemap.c | 2 --
1 file changed, 2 deletions(-)
diff --git a/mm/filemap.c b/mm/filemap.c
index 8510f67dc749..717b0d262306 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -3127,8 +3127,6 @@ static struct page *next_uptodate_page(struct page *page,
goto skip;
if (!PageUptodate(page) || PageReadahead(page))
goto skip;
- if (PageHWPoison(page))
- goto skip;
if (!trylock_page(page))
goto skip;
if (page->mapping != mapping)
--
2.30.2
Saves 61 bytes due to fewer calls to compound_head().
Signed-off-by: Matthew Wilcox (Oracle) <[email protected]>
---
mm/filemap.c | 27 ++++++++++++++-------------
1 file changed, 14 insertions(+), 13 deletions(-)
diff --git a/mm/filemap.c b/mm/filemap.c
index 545323a77c1c..1c0c2663c57d 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -3105,7 +3105,7 @@ static bool filemap_map_pmd(struct vm_fault *vmf, struct page *page)
return false;
}
-static struct page *next_uptodate_page(struct folio *folio,
+static struct folio *next_uptodate_page(struct folio *folio,
struct address_space *mapping,
struct xa_state *xas, pgoff_t end_pgoff)
{
@@ -3136,7 +3136,7 @@ static struct page *next_uptodate_page(struct folio *folio,
max_idx = DIV_ROUND_UP(i_size_read(mapping->host), PAGE_SIZE);
if (xas->xa_index >= max_idx)
goto unlock;
- return &folio->page;
+ return folio;
unlock:
folio_unlock(folio);
skip:
@@ -3146,7 +3146,7 @@ static struct page *next_uptodate_page(struct folio *folio,
return NULL;
}
-static inline struct page *first_map_page(struct address_space *mapping,
+static inline struct folio *first_map_page(struct address_space *mapping,
struct xa_state *xas,
pgoff_t end_pgoff)
{
@@ -3154,7 +3154,7 @@ static inline struct page *first_map_page(struct address_space *mapping,
mapping, xas, end_pgoff);
}
-static inline struct page *next_map_page(struct address_space *mapping,
+static inline struct folio *next_map_page(struct address_space *mapping,
struct xa_state *xas,
pgoff_t end_pgoff)
{
@@ -3171,16 +3171,17 @@ vm_fault_t filemap_map_pages(struct vm_fault *vmf,
pgoff_t last_pgoff = start_pgoff;
unsigned long addr;
XA_STATE(xas, &mapping->i_pages, start_pgoff);
- struct page *head, *page;
+ struct folio *folio;
+ struct page *page;
unsigned int mmap_miss = READ_ONCE(file->f_ra.mmap_miss);
vm_fault_t ret = 0;
rcu_read_lock();
- head = first_map_page(mapping, &xas, end_pgoff);
- if (!head)
+ folio = first_map_page(mapping, &xas, end_pgoff);
+ if (!folio)
goto out;
- if (filemap_map_pmd(vmf, head)) {
+ if (filemap_map_pmd(vmf, &folio->page)) {
ret = VM_FAULT_NOPAGE;
goto out;
}
@@ -3188,7 +3189,7 @@ vm_fault_t filemap_map_pages(struct vm_fault *vmf,
addr = vma->vm_start + ((start_pgoff - vma->vm_pgoff) << PAGE_SHIFT);
vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd, addr, &vmf->ptl);
do {
- page = find_subpage(head, xas.xa_index);
+ page = folio_file_page(folio, xas.xa_index);
if (PageHWPoison(page))
goto unlock;
@@ -3209,12 +3210,12 @@ vm_fault_t filemap_map_pages(struct vm_fault *vmf,
do_set_pte(vmf, page, addr);
/* no need to invalidate: a not-present page won't be cached */
update_mmu_cache(vma, addr, vmf->pte);
- unlock_page(head);
+ folio_unlock(folio);
continue;
unlock:
- unlock_page(head);
- put_page(head);
- } while ((head = next_map_page(mapping, &xas, end_pgoff)) != NULL);
+ folio_unlock(folio);
+ folio_put(folio);
+ } while ((folio = next_map_page(mapping, &xas, end_pgoff)) != NULL);
pte_unmap_unlock(vmf->pte, vmf->ptl);
out:
rcu_read_unlock();
--
2.30.2
There is one place which assumes the size of a page; fix it.
Signed-off-by: Matthew Wilcox (Oracle) <[email protected]>
---
fs/xfs/xfs_aops.c | 11 ++++++-----
fs/xfs/xfs_super.c | 3 ++-
2 files changed, 8 insertions(+), 6 deletions(-)
diff --git a/fs/xfs/xfs_aops.c b/fs/xfs/xfs_aops.c
index cb4e0fcf4c76..9ffbd116592a 100644
--- a/fs/xfs/xfs_aops.c
+++ b/fs/xfs/xfs_aops.c
@@ -432,10 +432,11 @@ xfs_discard_page(
struct page *page,
loff_t fileoff)
{
- struct inode *inode = page->mapping->host;
+ struct folio *folio = page_folio(page);
+ struct inode *inode = folio->mapping->host;
struct xfs_inode *ip = XFS_I(inode);
struct xfs_mount *mp = ip->i_mount;
- unsigned int pageoff = offset_in_page(fileoff);
+ size_t pageoff = offset_in_folio(folio, fileoff);
xfs_fileoff_t start_fsb = XFS_B_TO_FSBT(mp, fileoff);
xfs_fileoff_t pageoff_fsb = XFS_B_TO_FSBT(mp, pageoff);
int error;
@@ -445,14 +446,14 @@ xfs_discard_page(
xfs_alert_ratelimited(mp,
"page discard on page "PTR_FMT", inode 0x%llx, offset %llu.",
- page, ip->i_ino, fileoff);
+ folio, ip->i_ino, fileoff);
error = xfs_bmap_punch_delalloc_range(ip, start_fsb,
- i_blocks_per_page(inode, page) - pageoff_fsb);
+ i_blocks_per_folio(inode, folio) - pageoff_fsb);
if (error && !XFS_FORCED_SHUTDOWN(mp))
xfs_alert(mp, "page discard unable to remove delalloc mapping.");
out_invalidate:
- iomap_invalidatepage(page, pageoff, PAGE_SIZE - pageoff);
+ iomap_invalidatepage(&folio->page, pageoff, folio_size(folio) - pageoff);
}
static const struct iomap_writeback_ops xfs_writeback_ops = {
diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c
index 2c9e26a44546..24adea02b887 100644
--- a/fs/xfs/xfs_super.c
+++ b/fs/xfs/xfs_super.c
@@ -1891,7 +1891,8 @@ static struct file_system_type xfs_fs_type = {
.init_fs_context = xfs_init_fs_context,
.parameters = xfs_fs_parameters,
.kill_sb = kill_block_super,
- .fs_flags = FS_REQUIRES_DEV | FS_ALLOW_IDMAP,
+ .fs_flags = FS_REQUIRES_DEV | FS_ALLOW_IDMAP | \
+ FS_THP_SUPPORT,
};
MODULE_ALIAS_FS("xfs");
--
2.30.2
This saves a few calls to compound_head(), including one in
filemap_update_page(). Shrinks the kernel by 78 bytes.
Signed-off-by: Matthew Wilcox (Oracle) <[email protected]>
---
mm/filemap.c | 23 +++++++++++------------
1 file changed, 11 insertions(+), 12 deletions(-)
diff --git a/mm/filemap.c b/mm/filemap.c
index dd54a52f8e84..18b21d16c9de 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -2355,9 +2355,8 @@ static bool filemap_range_uptodate(struct address_space *mapping,
static int filemap_update_page(struct kiocb *iocb,
struct address_space *mapping, struct iov_iter *iter,
- struct page *page)
+ struct folio *folio)
{
- struct folio *folio = page_folio(page);
int error;
if (!folio_trylock(folio)) {
@@ -2426,13 +2425,13 @@ static int filemap_create_folio(struct file *file,
}
static int filemap_readahead(struct kiocb *iocb, struct file *file,
- struct address_space *mapping, struct page *page,
+ struct address_space *mapping, struct folio *folio,
pgoff_t last_index)
{
if (iocb->ki_flags & IOCB_NOIO)
return -EAGAIN;
- page_cache_async_readahead(mapping, &file->f_ra, file, page,
- page->index, last_index - page->index);
+ page_cache_async_readahead(mapping, &file->f_ra, file, &folio->page,
+ folio->index, last_index - folio->index);
return 0;
}
@@ -2444,7 +2443,7 @@ static int filemap_get_pages(struct kiocb *iocb, struct iov_iter *iter,
struct file_ra_state *ra = &filp->f_ra;
pgoff_t index = iocb->ki_pos >> PAGE_SHIFT;
pgoff_t last_index;
- struct page *page;
+ struct folio *folio;
int err = 0;
last_index = DIV_ROUND_UP(iocb->ki_pos + iter->count, PAGE_SIZE);
@@ -2470,16 +2469,16 @@ static int filemap_get_pages(struct kiocb *iocb, struct iov_iter *iter,
return err;
}
- page = pvec->pages[pagevec_count(pvec) - 1];
- if (PageReadahead(page)) {
- err = filemap_readahead(iocb, filp, mapping, page, last_index);
+ folio = page_folio(pvec->pages[pagevec_count(pvec) - 1]);
+ if (folio_test_readahead(folio)) {
+ err = filemap_readahead(iocb, filp, mapping, folio, last_index);
if (err)
goto err;
}
- if (!PageUptodate(page)) {
+ if (!folio_test_uptodate(folio)) {
if ((iocb->ki_flags & IOCB_WAITQ) && pagevec_count(pvec) > 1)
iocb->ki_flags |= IOCB_NOWAIT;
- err = filemap_update_page(iocb, mapping, iter, page);
+ err = filemap_update_page(iocb, mapping, iter, folio);
if (err)
goto err;
}
@@ -2487,7 +2486,7 @@ static int filemap_get_pages(struct kiocb *iocb, struct iov_iter *iter,
return 0;
err:
if (err < 0)
- put_page(page);
+ folio_put(folio);
if (likely(--pvec->nr))
return 0;
if (err == AOP_TRUNCATED_PAGE)
--
2.30.2
Handle THP splitting in the parts of the truncation functions which
already handle partial pages. Factor all that code out into a new
function called truncate_inode_partial_page().
We lose the easy 'bail out' path if a truncate or hole punch is entirely
within a single page. We can add some more complex logic to restore
the optimisation if it proves to be worthwhile.
Signed-off-by: Matthew Wilcox (Oracle) <[email protected]>
Reviewed-by: Jan Kara <[email protected]>
Reviewed-by: William Kucharski <[email protected]>
---
mm/internal.h | 1 +
mm/shmem.c | 97 +++++++++++++------------------------
mm/truncate.c | 130 +++++++++++++++++++++++++++++++++-----------------
3 files changed, 120 insertions(+), 108 deletions(-)
diff --git a/mm/internal.h b/mm/internal.h
index 0910efec5821..3c0c807eddc6 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -70,6 +70,7 @@ static inline void force_page_cache_readahead(struct address_space *mapping,
unsigned find_lock_entries(struct address_space *mapping, pgoff_t start,
pgoff_t end, struct pagevec *pvec, pgoff_t *indices);
+bool truncate_inode_partial_page(struct page *page, loff_t start, loff_t end);
/**
* folio_evictable - Test whether a folio is evictable.
diff --git a/mm/shmem.c b/mm/shmem.c
index 2fd75b4d4974..337680a01f2a 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -857,32 +857,6 @@ void shmem_unlock_mapping(struct address_space *mapping)
}
}
-/*
- * Check whether a hole-punch or truncation needs to split a huge page,
- * returning true if no split was required, or the split has been successful.
- *
- * Eviction (or truncation to 0 size) should never need to split a huge page;
- * but in rare cases might do so, if shmem_undo_range() failed to trylock on
- * head, and then succeeded to trylock on tail.
- *
- * A split can only succeed when there are no additional references on the
- * huge page: so the split below relies upon find_get_entries() having stopped
- * when it found a subpage of the huge page, without getting further references.
- */
-static bool shmem_punch_compound(struct page *page, pgoff_t start, pgoff_t end)
-{
- if (!PageTransCompound(page))
- return true;
-
- /* Just proceed to delete a huge page wholly within the range punched */
- if (PageHead(page) &&
- page->index >= start && page->index + HPAGE_PMD_NR <= end)
- return true;
-
- /* Try to split huge page, so we can truly punch the hole or truncate */
- return split_huge_page(page) >= 0;
-}
-
/*
* Remove range of pages and swap entries from page cache, and free them.
* If !unfalloc, truncate or punch hole; if unfalloc, undo failed fallocate.
@@ -894,13 +868,13 @@ static void shmem_undo_range(struct inode *inode, loff_t lstart, loff_t lend,
struct shmem_inode_info *info = SHMEM_I(inode);
pgoff_t start = (lstart + PAGE_SIZE - 1) >> PAGE_SHIFT;
pgoff_t end = (lend + 1) >> PAGE_SHIFT;
- unsigned int partial_start = lstart & (PAGE_SIZE - 1);
- unsigned int partial_end = (lend + 1) & (PAGE_SIZE - 1);
struct pagevec pvec;
pgoff_t indices[PAGEVEC_SIZE];
+ struct page *page;
long nr_swaps_freed = 0;
pgoff_t index;
int i;
+ bool partial_end;
if (lend == -1)
end = -1; /* unsigned, so actually very big */
@@ -910,7 +884,7 @@ static void shmem_undo_range(struct inode *inode, loff_t lstart, loff_t lend,
while (index < end && find_lock_entries(mapping, index, end - 1,
&pvec, indices)) {
for (i = 0; i < pagevec_count(&pvec); i++) {
- struct page *page = pvec.pages[i];
+ page = pvec.pages[i];
index = indices[i];
@@ -933,33 +907,37 @@ static void shmem_undo_range(struct inode *inode, loff_t lstart, loff_t lend,
index++;
}
- if (partial_start) {
- struct page *page = NULL;
- shmem_getpage(inode, start - 1, &page, SGP_READ);
- if (page) {
- unsigned int top = PAGE_SIZE;
- if (start > end) {
- top = partial_end;
- partial_end = 0;
- }
- zero_user_segment(page, partial_start, top);
- set_page_dirty(page);
- unlock_page(page);
- put_page(page);
+ partial_end = ((lend + 1) % PAGE_SIZE) > 0;
+ page = NULL;
+ shmem_getpage(inode, lstart >> PAGE_SHIFT, &page, SGP_READ);
+ if (page) {
+ bool same_page;
+
+ page = compound_head(page);
+ same_page = lend < page_offset(page) + thp_size(page);
+ if (same_page)
+ partial_end = false;
+ set_page_dirty(page);
+ if (!truncate_inode_partial_page(page, lstart, lend)) {
+ start = page->index + thp_nr_pages(page);
+ if (same_page)
+ end = page->index;
}
+ unlock_page(page);
+ put_page(page);
+ page = NULL;
}
- if (partial_end) {
- struct page *page = NULL;
+
+ if (partial_end)
shmem_getpage(inode, end, &page, SGP_READ);
- if (page) {
- zero_user_segment(page, 0, partial_end);
- set_page_dirty(page);
- unlock_page(page);
- put_page(page);
- }
+ if (page) {
+ page = compound_head(page);
+ set_page_dirty(page);
+ if (!truncate_inode_partial_page(page, lstart, lend))
+ end = page->index;
+ unlock_page(page);
+ put_page(page);
}
- if (start >= end)
- return;
index = start;
while (index < end) {
@@ -975,7 +953,7 @@ static void shmem_undo_range(struct inode *inode, loff_t lstart, loff_t lend,
continue;
}
for (i = 0; i < pagevec_count(&pvec); i++) {
- struct page *page = pvec.pages[i];
+ page = pvec.pages[i];
index = indices[i];
if (xa_is_value(page)) {
@@ -1000,18 +978,9 @@ static void shmem_undo_range(struct inode *inode, loff_t lstart, loff_t lend,
break;
}
VM_BUG_ON_PAGE(PageWriteback(page), page);
- if (shmem_punch_compound(page, start, end))
- truncate_inode_page(mapping, page);
- else if (IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE)) {
- /* Wipe the page and don't get stuck */
- clear_highpage(page);
- flush_dcache_page(page);
- set_page_dirty(page);
- if (index <
- round_up(start, HPAGE_PMD_NR))
- start = index + 1;
- }
+ truncate_inode_page(mapping, page);
}
+ index = page->index + thp_nr_pages(page) - 1;
unlock_page(page);
}
pagevec_remove_exceptionals(&pvec);
diff --git a/mm/truncate.c b/mm/truncate.c
index 234ddd879caa..b8c9d2fbd9b5 100644
--- a/mm/truncate.c
+++ b/mm/truncate.c
@@ -220,6 +220,58 @@ int truncate_inode_page(struct address_space *mapping, struct page *page)
return 0;
}
+/*
+ * Handle partial (transparent) pages. The page may be entirely within the
+ * range if a split has raced with us. If not, we zero the part of the
+ * page that's within the [start, end] range, and then split the page if
+ * it's a THP. split_page_range() will discard pages which now lie beyond
+ * i_size, and we rely on the caller to discard pages which lie within a
+ * newly created hole.
+ *
+ * Returns false if THP splitting failed so the caller can avoid
+ * discarding the entire page which is stubbornly unsplit.
+ */
+bool truncate_inode_partial_page(struct page *page, loff_t start, loff_t end)
+{
+ loff_t pos = page_offset(page);
+ unsigned int offset, length;
+
+ if (pos < start)
+ offset = start - pos;
+ else
+ offset = 0;
+ length = thp_size(page);
+ if (pos + length <= (u64)end)
+ length = length - offset;
+ else
+ length = end + 1 - pos - offset;
+
+ wait_on_page_writeback(page);
+ if (length == thp_size(page)) {
+ truncate_inode_page(page->mapping, page);
+ return true;
+ }
+
+ /*
+ * We may be zeroing pages we're about to discard, but it avoids
+ * doing a complex calculation here, and then doing the zeroing
+ * anyway if the page split fails.
+ */
+ zero_user(page, offset, length);
+
+ cleancache_invalidate_page(page->mapping, page);
+ if (page_has_private(page))
+ do_invalidatepage(page, offset, length);
+ if (!PageTransHuge(page))
+ return true;
+ if (split_huge_page(page) == 0)
+ return true;
+ if (PageDirty(page))
+ return false;
+ truncate_inode_page(page->mapping, page);
+ return true;
+}
+
/*
* Used to get rid of pages on hardware memory corruption.
*/
@@ -255,6 +307,13 @@ int invalidate_inode_page(struct page *page)
return invalidate_complete_page(mapping, page);
}
+static inline struct page *find_lock_head(struct address_space *mapping,
+ pgoff_t index)
+{
+ struct folio *folio = __filemap_get_folio(mapping, index, FGP_LOCK, 0);
+ return &folio->page;
+}
+
/**
* truncate_inode_pages_range - truncate range of pages specified by start & end byte offsets
* @mapping: mapping to truncate
@@ -284,20 +343,16 @@ void truncate_inode_pages_range(struct address_space *mapping,
{
pgoff_t start; /* inclusive */
pgoff_t end; /* exclusive */
- unsigned int partial_start; /* inclusive */
- unsigned int partial_end; /* exclusive */
struct pagevec pvec;
pgoff_t indices[PAGEVEC_SIZE];
pgoff_t index;
int i;
+ struct page * page;
+ bool partial_end;
if (mapping_empty(mapping))
goto out;
- /* Offsets within partial pages */
- partial_start = lstart & (PAGE_SIZE - 1);
- partial_end = (lend + 1) & (PAGE_SIZE - 1);
-
/*
* 'start' and 'end' always covers the range of pages to be fully
* truncated. Partial pages are covered with 'partial_start' at the
@@ -330,48 +385,35 @@ void truncate_inode_pages_range(struct address_space *mapping,
cond_resched();
}
- if (partial_start) {
- struct page *page = find_lock_page(mapping, start - 1);
- if (page) {
- unsigned int top = PAGE_SIZE;
- if (start > end) {
- /* Truncation within a single page */
- top = partial_end;
- partial_end = 0;
- }
- wait_on_page_writeback(page);
- zero_user_segment(page, partial_start, top);
- cleancache_invalidate_page(mapping, page);
- if (page_has_private(page))
- do_invalidatepage(page, partial_start,
- top - partial_start);
- unlock_page(page);
- put_page(page);
+ partial_end = ((lend + 1) % PAGE_SIZE) > 0;
+ page = find_lock_head(mapping, lstart >> PAGE_SHIFT);
+ if (page) {
+ bool same_page = lend < page_offset(page) + thp_size(page);
+ if (same_page)
+ partial_end = false;
+ if (!truncate_inode_partial_page(page, lstart, lend)) {
+ start = page->index + thp_nr_pages(page);
+ if (same_page)
+ end = page->index;
}
+ unlock_page(page);
+ put_page(page);
+ page = NULL;
}
- if (partial_end) {
- struct page *page = find_lock_page(mapping, end);
- if (page) {
- wait_on_page_writeback(page);
- zero_user_segment(page, 0, partial_end);
- cleancache_invalidate_page(mapping, page);
- if (page_has_private(page))
- do_invalidatepage(page, 0,
- partial_end);
- unlock_page(page);
- put_page(page);
- }
+
+ if (partial_end)
+ page = find_lock_head(mapping, end);
+ if (page) {
+ if (!truncate_inode_partial_page(page, lstart, lend))
+ end = page->index;
+ unlock_page(page);
+ put_page(page);
}
- /*
- * If the truncation happened within a single page no pages
- * will be released, just zeroed, so we can bail out now.
- */
- if (start >= end)
- goto out;
index = start;
- for ( ; ; ) {
+ while (index < end) {
cond_resched();
+
if (!find_get_entries(mapping, index, end - 1, &pvec,
indices)) {
/* If all gone from start onwards, we're done */
@@ -383,7 +425,7 @@ void truncate_inode_pages_range(struct address_space *mapping,
}
for (i = 0; i < pagevec_count(&pvec); i++) {
- struct page *page = pvec.pages[i];
+ page = pvec.pages[i];
/* We rely upon deletion not changing page->index */
index = indices[i];
@@ -392,7 +434,7 @@ void truncate_inode_pages_range(struct address_space *mapping,
continue;
lock_page(page);
- WARN_ON(page_to_index(page) != index);
+ index = page->index + thp_nr_pages(page) - 1;
wait_on_page_writeback(page);
truncate_inode_page(mapping, page);
unlock_page(page);
--
2.30.2
This lets us pass the folio in directly from filemap_readahead(), but its
primary reason is to enable us to pass a folio to ondemand_readahead()
in the next patch.
Signed-off-by: Matthew Wilcox (Oracle) <[email protected]>
---
include/linux/pagemap.h | 4 ++--
mm/filemap.c | 5 +++--
mm/readahead.c | 6 +++---
3 files changed, 8 insertions(+), 7 deletions(-)
diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
index 81cccb708df7..d3f0b68ea3b1 100644
--- a/include/linux/pagemap.h
+++ b/include/linux/pagemap.h
@@ -955,7 +955,7 @@ struct readahead_control {
void page_cache_ra_unbounded(struct readahead_control *,
unsigned long nr_to_read, unsigned long lookahead_count);
void page_cache_sync_ra(struct readahead_control *, unsigned long req_count);
-void page_cache_async_ra(struct readahead_control *, struct page *,
+void page_cache_async_ra(struct readahead_control *, struct folio *,
unsigned long req_count);
void readahead_expand(struct readahead_control *ractl,
loff_t new_start, size_t new_len);
@@ -1002,7 +1002,7 @@ void page_cache_async_readahead(struct address_space *mapping,
struct page *page, pgoff_t index, unsigned long req_count)
{
DEFINE_READAHEAD(ractl, file, ra, mapping, index);
- page_cache_async_ra(&ractl, page, req_count);
+ page_cache_async_ra(&ractl, page_folio(page), req_count);
}
static inline struct folio *__readahead_folio(struct readahead_control *ractl)
diff --git a/mm/filemap.c b/mm/filemap.c
index 1ff21b3346d3..ee7c72b4edee 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -2419,10 +2419,11 @@ static int filemap_readahead(struct kiocb *iocb, struct file *file,
struct address_space *mapping, struct folio *folio,
pgoff_t last_index)
{
+ DEFINE_READAHEAD(ractl, file, &file->f_ra, mapping, folio->index);
+
if (iocb->ki_flags & IOCB_NOIO)
return -EAGAIN;
- page_cache_async_readahead(mapping, &file->f_ra, file, &folio->page,
- folio->index, last_index - folio->index);
+ page_cache_async_ra(&ractl, folio, last_index - folio->index);
return 0;
}
diff --git a/mm/readahead.c b/mm/readahead.c
index d589f147f4c2..e1df44ad57ed 100644
--- a/mm/readahead.c
+++ b/mm/readahead.c
@@ -580,7 +580,7 @@ void page_cache_sync_ra(struct readahead_control *ractl,
EXPORT_SYMBOL_GPL(page_cache_sync_ra);
void page_cache_async_ra(struct readahead_control *ractl,
- struct page *page, unsigned long req_count)
+ struct folio *folio, unsigned long req_count)
{
/* no read-ahead */
if (!ractl->ra->ra_pages)
@@ -589,10 +589,10 @@ void page_cache_async_ra(struct readahead_control *ractl,
/*
* Same bit is used for PG_readahead and PG_reclaim.
*/
- if (PageWriteback(page))
+ if (folio_test_writeback(folio))
return;
- ClearPageReadahead(page);
+ folio_clear_readahead(folio);
/*
* Defer asynchronous read-ahead on IO congestion.
--
2.30.2
On Thu, Jul 15, 2021 at 05:51:26PM +0800, kernel test robot wrote:
> include/linux/netfs.h: In function 'folio_start_fscache':
> >> include/linux/netfs.h:43:2: error: implicit declaration of function 'folio_set_private_2_flag'; did you mean 'folio_set_private_2'? [-Werror=implicit-function-declaration]
> 43 | folio_set_private_2_flag(folio);
> | ^~~~~~~~~~~~~~~~~~~~~~~~
> | folio_set_private_2
> cc1: some warnings being treated as errors
I'll be folding in this patch:
+++ b/include/linux/netfs.h
@@ -40,7 +40,7 @@ static inline void folio_start_fscache(struct folio *folio)
{
VM_BUG_ON_FOLIO(folio_test_private_2(folio), folio);
folio_get(folio);
- folio_set_private_2_flag(folio);
+ folio_set_private_2(folio);
}
/**
Hi "Matthew,
Thank you for the patch! Yet something to improve:
[auto build test ERROR on linus/master]
[also build test ERROR on v5.14-rc1 next-20210715]
[cannot apply to hnaz-linux-mm/master xfs-linux/for-next tip/perf/core]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch]
url: https://github.com/0day-ci/linux/commits/Matthew-Wilcox-Oracle/Memory-folios/20210715-133101
base: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git 8096acd7442e613fad0354fc8dfdb2003cceea0b
config: nios2-defconfig (attached as .config)
compiler: nios2-linux-gcc (GCC) 10.3.0
reproduce (this is a W=1 build):
wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross
chmod +x ~/bin/make.cross
# https://github.com/0day-ci/linux/commit/8e4044529261dffc386ab56b6d90e8511c820605
git remote add linux-review https://github.com/0day-ci/linux
git fetch --no-tags linux-review Matthew-Wilcox-Oracle/Memory-folios/20210715-133101
git checkout 8e4044529261dffc386ab56b6d90e8511c820605
# save the attached .config to linux build tree
COMPILER_INSTALL_PATH=$HOME/0day COMPILER=gcc-10.3.0 make.cross ARCH=nios2
If you fix the issue, kindly add following tag as appropriate
Reported-by: kernel test robot <[email protected]>
All errors (new ones prefixed by >>):
In file included from include/linux/fscache.h:22,
from fs/nfs/fscache.h:14,
from fs/nfs/client.c:48:
include/linux/netfs.h: In function 'folio_start_fscache':
>> include/linux/netfs.h:43:2: error: implicit declaration of function 'folio_set_private_2_flag'; did you mean 'folio_set_private_2'? [-Werror=implicit-function-declaration]
43 | folio_set_private_2_flag(folio);
| ^~~~~~~~~~~~~~~~~~~~~~~~
| folio_set_private_2
cc1: some warnings being treated as errors
vim +43 include/linux/netfs.h
31
32 /**
33 * folio_start_fscache - Start an fscache write on a folio.
34 * @folio: The folio.
35 *
36 * Call this function before writing a folio to a local cache. Starting a
37 * second write before the first one finishes is not allowed.
38 */
39 static inline void folio_start_fscache(struct folio *folio)
40 {
41 VM_BUG_ON_FOLIO(folio_test_private_2(folio), folio);
42 folio_get(folio);
> 43 folio_set_private_2_flag(folio);
44 }
45
---
0-DAY CI Kernel Test Service, Intel Corporation
https://lists.01.org/hyperkitty/list/[email protected]
Hi "Matthew,
Thank you for the patch! Yet something to improve:
[auto build test ERROR on linus/master]
[also build test ERROR on v5.14-rc1 next-20210715]
[cannot apply to hnaz-linux-mm/master xfs-linux/for-next tip/perf/core]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch]
url: https://github.com/0day-ci/linux/commits/Matthew-Wilcox-Oracle/Memory-folios/20210715-133101
base: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git 8096acd7442e613fad0354fc8dfdb2003cceea0b
config: powerpc-randconfig-r033-20210715 (attached as .config)
compiler: clang version 13.0.0 (https://github.com/llvm/llvm-project 0e49c54a8cbd3e779e5526a5888c683c01cc3c50)
reproduce (this is a W=1 build):
wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross
chmod +x ~/bin/make.cross
# install powerpc cross compiling tool for clang build
# apt-get install binutils-powerpc-linux-gnu
# https://github.com/0day-ci/linux/commit/8e4044529261dffc386ab56b6d90e8511c820605
git remote add linux-review https://github.com/0day-ci/linux
git fetch --no-tags linux-review Matthew-Wilcox-Oracle/Memory-folios/20210715-133101
git checkout 8e4044529261dffc386ab56b6d90e8511c820605
# save the attached .config to linux build tree
COMPILER_INSTALL_PATH=$HOME/0day COMPILER=clang make.cross ARCH=powerpc
If you fix the issue, kindly add following tag as appropriate
Reported-by: kernel test robot <[email protected]>
All errors (new ones prefixed by >>):
In file included from include/linux/pagemap.h:11:
In file included from include/linux/highmem.h:10:
In file included from include/linux/hardirq.h:11:
In file included from arch/powerpc/include/asm/hardirq.h:6:
In file included from include/linux/irq.h:20:
In file included from include/linux/io.h:13:
In file included from arch/powerpc/include/asm/io.h:619:
arch/powerpc/include/asm/io-defs.h:45:1: warning: performing pointer arithmetic on a null pointer has undefined behavior [-Wnull-pointer-arithmetic]
DEF_PCI_AC_NORET(insw, (unsigned long p, void *b, unsigned long c),
^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
arch/powerpc/include/asm/io.h:616:3: note: expanded from macro 'DEF_PCI_AC_NORET'
__do_##name al; \
^~~~~~~~~~~~~~
<scratch space>:16:1: note: expanded from here
__do_insw
^
arch/powerpc/include/asm/io.h:557:56: note: expanded from macro '__do_insw'
#define __do_insw(p, b, n) readsw((PCI_IO_ADDR)_IO_BASE+(p), (b), (n))
~~~~~~~~~~~~~~~~~~~~~^
In file included from fs/netfs/read_helper.c:12:
In file included from include/linux/pagemap.h:11:
In file included from include/linux/highmem.h:10:
In file included from include/linux/hardirq.h:11:
In file included from arch/powerpc/include/asm/hardirq.h:6:
In file included from include/linux/irq.h:20:
In file included from include/linux/io.h:13:
In file included from arch/powerpc/include/asm/io.h:619:
arch/powerpc/include/asm/io-defs.h:47:1: warning: performing pointer arithmetic on a null pointer has undefined behavior [-Wnull-pointer-arithmetic]
DEF_PCI_AC_NORET(insl, (unsigned long p, void *b, unsigned long c),
^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
arch/powerpc/include/asm/io.h:616:3: note: expanded from macro 'DEF_PCI_AC_NORET'
__do_##name al; \
^~~~~~~~~~~~~~
<scratch space>:20:1: note: expanded from here
__do_insl
^
arch/powerpc/include/asm/io.h:558:56: note: expanded from macro '__do_insl'
#define __do_insl(p, b, n) readsl((PCI_IO_ADDR)_IO_BASE+(p), (b), (n))
~~~~~~~~~~~~~~~~~~~~~^
In file included from fs/netfs/read_helper.c:12:
In file included from include/linux/pagemap.h:11:
In file included from include/linux/highmem.h:10:
In file included from include/linux/hardirq.h:11:
In file included from arch/powerpc/include/asm/hardirq.h:6:
In file included from include/linux/irq.h:20:
In file included from include/linux/io.h:13:
In file included from arch/powerpc/include/asm/io.h:619:
arch/powerpc/include/asm/io-defs.h:49:1: warning: performing pointer arithmetic on a null pointer has undefined behavior [-Wnull-pointer-arithmetic]
DEF_PCI_AC_NORET(outsb, (unsigned long p, const void *b, unsigned long c),
^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
arch/powerpc/include/asm/io.h:616:3: note: expanded from macro 'DEF_PCI_AC_NORET'
__do_##name al; \
^~~~~~~~~~~~~~
<scratch space>:24:1: note: expanded from here
__do_outsb
^
arch/powerpc/include/asm/io.h:559:58: note: expanded from macro '__do_outsb'
#define __do_outsb(p, b, n) writesb((PCI_IO_ADDR)_IO_BASE+(p),(b),(n))
~~~~~~~~~~~~~~~~~~~~~^
In file included from fs/netfs/read_helper.c:12:
In file included from include/linux/pagemap.h:11:
In file included from include/linux/highmem.h:10:
In file included from include/linux/hardirq.h:11:
In file included from arch/powerpc/include/asm/hardirq.h:6:
In file included from include/linux/irq.h:20:
In file included from include/linux/io.h:13:
In file included from arch/powerpc/include/asm/io.h:619:
arch/powerpc/include/asm/io-defs.h:51:1: warning: performing pointer arithmetic on a null pointer has undefined behavior [-Wnull-pointer-arithmetic]
DEF_PCI_AC_NORET(outsw, (unsigned long p, const void *b, unsigned long c),
^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
arch/powerpc/include/asm/io.h:616:3: note: expanded from macro 'DEF_PCI_AC_NORET'
__do_##name al; \
^~~~~~~~~~~~~~
<scratch space>:28:1: note: expanded from here
__do_outsw
^
arch/powerpc/include/asm/io.h:560:58: note: expanded from macro '__do_outsw'
#define __do_outsw(p, b, n) writesw((PCI_IO_ADDR)_IO_BASE+(p),(b),(n))
~~~~~~~~~~~~~~~~~~~~~^
In file included from fs/netfs/read_helper.c:12:
In file included from include/linux/pagemap.h:11:
In file included from include/linux/highmem.h:10:
In file included from include/linux/hardirq.h:11:
In file included from arch/powerpc/include/asm/hardirq.h:6:
In file included from include/linux/irq.h:20:
In file included from include/linux/io.h:13:
In file included from arch/powerpc/include/asm/io.h:619:
arch/powerpc/include/asm/io-defs.h:53:1: warning: performing pointer arithmetic on a null pointer has undefined behavior [-Wnull-pointer-arithmetic]
DEF_PCI_AC_NORET(outsl, (unsigned long p, const void *b, unsigned long c),
^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
arch/powerpc/include/asm/io.h:616:3: note: expanded from macro 'DEF_PCI_AC_NORET'
__do_##name al; \
^~~~~~~~~~~~~~
<scratch space>:32:1: note: expanded from here
__do_outsl
^
arch/powerpc/include/asm/io.h:561:58: note: expanded from macro '__do_outsl'
#define __do_outsl(p, b, n) writesl((PCI_IO_ADDR)_IO_BASE+(p),(b),(n))
~~~~~~~~~~~~~~~~~~~~~^
In file included from fs/netfs/read_helper.c:17:
>> include/linux/netfs.h:43:2: error: implicit declaration of function 'folio_set_private_2_flag' [-Werror,-Wimplicit-function-declaration]
folio_set_private_2_flag(folio);
^
include/linux/netfs.h:43:2: note: did you mean 'folio_set_private_2'?
include/linux/page-flags.h:445:1: note: 'folio_set_private_2' declared here
PAGEFLAG(Private2, private_2, PF_ANY) TESTSCFLAG(Private2, private_2, PF_ANY)
^
include/linux/page-flags.h:362:2: note: expanded from macro 'PAGEFLAG'
SETPAGEFLAG(uname, lname, policy) \
^
include/linux/page-flags.h:320:6: note: expanded from macro 'SETPAGEFLAG'
void folio_set_##lname(struct folio *folio) \
^
<scratch space>:70:1: note: expanded from here
folio_set_private_2
^
12 warnings and 1 error generated.
vim +/folio_set_private_2_flag +43 include/linux/netfs.h
31
32 /**
33 * folio_start_fscache - Start an fscache write on a folio.
34 * @folio: The folio.
35 *
36 * Call this function before writing a folio to a local cache. Starting a
37 * second write before the first one finishes is not allowed.
38 */
39 static inline void folio_start_fscache(struct folio *folio)
40 {
41 VM_BUG_ON_FOLIO(folio_test_private_2(folio), folio);
42 folio_get(folio);
> 43 folio_set_private_2_flag(folio);
44 }
45
---
0-DAY CI Kernel Test Service, Intel Corporation
https://lists.01.org/hyperkitty/list/[email protected]
Hi "Matthew,
Thank you for the patch! Yet something to improve:
[auto build test ERROR on linus/master]
[also build test ERROR on next-20210715]
[cannot apply to hnaz-linux-mm/master xfs-linux/for-next tip/perf/core v5.14-rc1]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch]
url: https://github.com/0day-ci/linux/commits/Matthew-Wilcox-Oracle/Memory-folios/20210715-133101
base: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git 8096acd7442e613fad0354fc8dfdb2003cceea0b
config: arm-randconfig-r014-20210715 (attached as .config)
compiler: clang version 13.0.0 (https://github.com/llvm/llvm-project 0e49c54a8cbd3e779e5526a5888c683c01cc3c50)
reproduce (this is a W=1 build):
wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross
chmod +x ~/bin/make.cross
# install arm cross compiling tool for clang build
# apt-get install binutils-arm-linux-gnueabi
# https://github.com/0day-ci/linux/commit/fd265884da3f65758e8b5153d45537a4bbefbb70
git remote add linux-review https://github.com/0day-ci/linux
git fetch --no-tags linux-review Matthew-Wilcox-Oracle/Memory-folios/20210715-133101
git checkout fd265884da3f65758e8b5153d45537a4bbefbb70
# save the attached .config to linux build tree
COMPILER_INSTALL_PATH=$HOME/0day COMPILER=clang make.cross ARCH=arm
If you fix the issue, kindly add following tag as appropriate
Reported-by: kernel test robot <[email protected]>
All errors (new ones prefixed by >>):
>> fs/iomap/buffered-io.c:645:2: error: implicit declaration of function 'flush_dcache_folio' [-Werror,-Wimplicit-function-declaration]
flush_dcache_folio(folio);
^
fs/iomap/buffered-io.c:645:2: note: did you mean 'flush_dcache_page'?
arch/arm/include/asm/cacheflush.h:292:13: note: 'flush_dcache_page' declared here
extern void flush_dcache_page(struct page *);
^
1 error generated.
vim +/flush_dcache_folio +645 fs/iomap/buffered-io.c
640
641 static size_t __iomap_write_end(struct inode *inode, loff_t pos, size_t len,
642 size_t copied, struct folio *folio)
643 {
644 struct iomap_page *iop = to_iomap_page(folio);
> 645 flush_dcache_folio(folio);
646
647 /*
648 * The blocks that were entirely written will now be uptodate, so we
649 * don't have to worry about a readpage reading them and overwriting a
650 * partial write. However if we have encountered a short write and only
651 * partially written into a block, it will not be marked uptodate, so a
652 * readpage might come in and destroy our partial write.
653 *
654 * Do the simplest thing, and just treat any short write to a non
655 * uptodate page as a zero-length write, and force the caller to redo
656 * the whole thing.
657 */
658 if (unlikely(copied < len && !folio_test_uptodate(folio)))
659 return 0;
660 iomap_set_range_uptodate(folio, iop, offset_in_folio(folio, pos), len);
661 filemap_dirty_folio(inode->i_mapping, folio);
662 return copied;
663 }
664
---
0-DAY CI Kernel Test Service, Intel Corporation
https://lists.01.org/hyperkitty/list/[email protected]
On 14 Jul 2021, at 23:35, Matthew Wilcox (Oracle) wrote:
> Turn migrate_page_states() into a wrapper around folio_migrate_flags().
> Also convert two functions only called from folio_migrate_flags() to
> be folio-based. ksm_migrate_page() becomes folio_migrate_ksm() and
> copy_page_owner() becomes folio_copy_owner(). folio_migrate_flags()
> alone shrinks by two thirds -- 1967 bytes down to 642 bytes.
>
> Signed-off-by: Matthew Wilcox (Oracle) <[email protected]>
> ---
> include/linux/ksm.h | 4 +-
> include/linux/migrate.h | 1 +
> include/linux/page_owner.h | 8 ++--
> mm/folio-compat.c | 6 +++
> mm/ksm.c | 31 ++++++++------
> mm/migrate.c | 84 +++++++++++++++++++-------------------
> mm/page_owner.c | 10 ++---
> 7 files changed, 77 insertions(+), 67 deletions(-)
LGTM. Reviewed-by: Zi Yan <[email protected]>
—
Best Regards,
Yan, Zi
On Thu, Jul 15, 2021 at 04:34:46AM +0100, Matthew Wilcox (Oracle) wrote:
> Managing memory in 4KiB pages is a serious overhead. Many benchmarks
> benefit from a larger "page size". As an example, an earlier iteration
> of this idea which used compound pages (and wasn't particularly tuned)
> got a 7% performance boost when compiling the kernel.
>
> Using compound pages or THPs exposes a weakness of our type system.
> Functions are often unprepared for compound pages to be passed to them,
> and may only act on PAGE_SIZE chunks. Even functions which are aware of
> compound pages may expect a head page, and do the wrong thing if passed
> a tail page.
>
> We also waste a lot of instructions ensuring that we're not looking at
> a tail page. Almost every call to PageFoo() contains one or more hidden
> calls to compound_head(). This also happens for get_page(), put_page()
> and many more functions.
>
> This patch series uses a new type, the struct folio, to manage memory.
> It converts enough of the page cache, iomap and XFS to use folios instead
> of pages, and then adds support for multi-page folios. It passes xfstests
> (running on XFS) with no regressions compared to v5.14-rc1.
Hey Willy,
I must confess I've lost the thread of the plot in terms of how you
hope to get the Memory folio work merged upstream. There are some
partial patch sets that just have the mm core, and then there were
some larger patchsets include some in the past which as I recall,
would touch ext4 (but which isn't in this set).
I was wondering if you could perhaps post a roadmap for how this patch
set might be broken up, and which subsections you were hoping to
target for the upcoming merge window versus the following merge
windows.
Also I assume that for file systems that aren't converted to use
Folios, there won't be any performance regressions --- is that
correct? Or is that something we need to watch for? Put another way,
if we don't land all of the memory folio patches before the end of the
calendar year, and we cut an LTS release with some file systems
converted and some file systems not yet converted, are there any
potential problems in that eventuality?
Thanks!
- Ted
On Thu, Jul 15, 2021 at 11:56:07AM -0400, Theodore Y. Ts'o wrote:
> On Thu, Jul 15, 2021 at 04:34:46AM +0100, Matthew Wilcox (Oracle) wrote:
> > Managing memory in 4KiB pages is a serious overhead. Many benchmarks
> > benefit from a larger "page size". As an example, an earlier iteration
> > of this idea which used compound pages (and wasn't particularly tuned)
> > got a 7% performance boost when compiling the kernel.
> >
> > Using compound pages or THPs exposes a weakness of our type system.
> > Functions are often unprepared for compound pages to be passed to them,
> > and may only act on PAGE_SIZE chunks. Even functions which are aware of
> > compound pages may expect a head page, and do the wrong thing if passed
> > a tail page.
> >
> > We also waste a lot of instructions ensuring that we're not looking at
> > a tail page. Almost every call to PageFoo() contains one or more hidden
> > calls to compound_head(). This also happens for get_page(), put_page()
> > and many more functions.
> >
> > This patch series uses a new type, the struct folio, to manage memory.
> > It converts enough of the page cache, iomap and XFS to use folios instead
> > of pages, and then adds support for multi-page folios. It passes xfstests
> > (running on XFS) with no regressions compared to v5.14-rc1.
>
> Hey Willy,
>
> I must confess I've lost the thread of the plot in terms of how you
> hope to get the Memory folio work merged upstream. There are some
> partial patch sets that just have the mm core, and then there were
> some larger patchsets include some in the past which as I recall,
> would touch ext4 (but which isn't in this set).
>
> I was wondering if you could perhaps post a roadmap for how this patch
> set might be broken up, and which subsections you were hoping to
> target for the upcoming merge window versus the following merge
> windows.
Hi Ted! Great questions. This particular incarnation of the
patch set is the one Linus asked for -- show the performance win
of using compound pages in the page cache. I think of this patchset
as having six parts:
1-32: core; introduce struct folio, get/put, flags
33-50: memcg
51-89: page cache, part 1
90-107: block + iomap
108-124: page cache, part 2
125-138: actually use compound pages in the page cache
I'm hoping to get the first three parts (patches 1-89) into the
next merge window. That gets us to the point where filesystems
can start to use folios themselves (ie it does the initial Amdahl
step and then everything else can happen in parallel)
> Also I assume that for file systems that aren't converted to use
> Folios, there won't be any performance regressions --- is that
> correct? Or is that something we need to watch for? Put another way,
> if we don't land all of the memory folio patches before the end of the
> calendar year, and we cut an LTS release with some file systems
> converted and some file systems not yet converted, are there any
> potential problems in that eventuality?
I suppose I can't guarantee that there will be no performance
regressions as a result (eg 5899593f51e6 was a regression that
was seen as a result of some of the prep work for folios), but
I do not anticipate any for unconverted filesystems. There might
be a tiny performance penalty for supporting arbitrary-order pages
instead of just orders 0 and 9, but I haven't seen anything to
suggest it's noticable. I would expect to see a tiny performance
win from removing all the compound_head() calls in the VFS core.
I have a proposal in to Plumbers filesystem track where I intend to
go over all the ways I'm going to want filesystems to change to take
advantage of folios. I think that will be a good venue to discuss how
to handle buffer_head based filesystems in a multi-page folio world.
I wouldn't expect anything to have to change before the end of the year.
I only have four patches in my extended tree which touch ext4, and
they're all in the context of making treewide changes to all
filesystems:
- Converting ->set_page_dirty to ->dirty_folio,
- Converting ->invalidatepage to ->invalidate_folio,
- Converting ->readpage to ->read_folio,
- Changing readahead_page() to readahead_folio()
None of those patches are in great shape at this point, and I wouldn't ask
anyone to review them. I am anticipating that some filesystems will never
be converted to multi-page folios (although all filesystems should be
converted to single-page folios so we can remove the folio compat code).
On Thu, Jul 15, 2021 at 09:53:26PM +0800, kernel test robot wrote:
> >> fs/iomap/buffered-io.c:645:2: error: implicit declaration of function 'flush_dcache_folio' [-Werror,-Wimplicit-function-declaration]
> flush_dcache_folio(folio);
Thanks. ARM doesn't include asm-generic/cacheflush.h so it needs
flush_dcache_folio() declared. Adding this:
+++ b/arch/arm/include/asm/cacheflush.h
@@ -290,6 +290,7 @@ extern void flush_cache_page(struct vm_area_struct *vma, unsigned long user_addr
*/
#define ARCH_IMPLEMENTS_FLUSH_DCACHE_PAGE 1
extern void flush_dcache_page(struct page *);
+void flush_dcache_folio(struct folio *folio);
static inline void flush_kernel_vmap_range(void *addr, int size)
{
On 14 Jul 2021, at 23:35, Matthew Wilcox (Oracle) wrote:
> This is the folio equivalent of migrate_page_copy(), which is retained
> as a wrapper for filesystems which are not yet converted to folios.
> Also convert copy_huge_page() to folio_copy().
>
> Signed-off-by: Matthew Wilcox (Oracle) <[email protected]>
> ---
> include/linux/migrate.h | 1 +
> include/linux/mm.h | 2 +-
> mm/folio-compat.c | 6 ++++++
> mm/hugetlb.c | 2 +-
> mm/migrate.c | 14 +++++---------
> mm/util.c | 6 +++---
> 6 files changed, 17 insertions(+), 14 deletions(-)
>
LGTM. Reviewed-by: Zi Yan <[email protected]>
—
Best Regards,
Yan, Zi
On Thu, Jul 15, 2021 at 04:36:16AM +0100, Matthew Wilcox (Oracle) wrote:
> This is a thin wrapper around bio_add_page(). The main advantage here
> is the documentation that the submitter can expect to see folios in the
> completion handler, and that stupidly large folios are not supported.
> It's not currently possible to allocate stupidly large folios, but if
> it ever becomes possible, this function will fail gracefully instead of
> doing I/O to the wrong bytes.
>
> Signed-off-by: Matthew Wilcox (Oracle) <[email protected]>
> ---
> block/bio.c | 21 +++++++++++++++++++++
> include/linux/bio.h | 3 ++-
> 2 files changed, 23 insertions(+), 1 deletion(-)
>
> diff --git a/block/bio.c b/block/bio.c
> index 1fab762e079b..1b500611d25c 100644
> --- a/block/bio.c
> +++ b/block/bio.c
> @@ -933,6 +933,27 @@ int bio_add_page(struct bio *bio, struct page *page,
> }
> EXPORT_SYMBOL(bio_add_page);
>
> +/**
> + * bio_add_folio - Attempt to add part of a folio to a bio.
> + * @bio: Bio to add to.
> + * @folio: Folio to add.
> + * @len: How many bytes from the folio to add.
> + * @off: First byte in this folio to add.
> + *
> + * Always uses the head page of the folio in the bio. If a submitter
> + * only uses bio_add_folio(), it can count on never seeing tail pages
> + * in the completion routine. BIOs do not support folios larger than 2GiB.
> + *
> + * Return: The number of bytes from this folio added to the bio.
> + */
> +size_t bio_add_folio(struct bio *bio, struct folio *folio, size_t len,
> + size_t off)
> +{
> + if (len > UINT_MAX || off > UINT_MAX)
Er... if bios don't support folios larger than 2GB, then why check @off
and @len against UINT_MAX, which is ~4GB?
--D
> + return 0;
> + return bio_add_page(bio, &folio->page, len, off);
> +}
> +
> void bio_release_pages(struct bio *bio, bool mark_dirty)
> {
> struct bvec_iter_all iter_all;
> diff --git a/include/linux/bio.h b/include/linux/bio.h
> index 2203b686e1f0..ade93e2de6a1 100644
> --- a/include/linux/bio.h
> +++ b/include/linux/bio.h
> @@ -462,7 +462,8 @@ extern void bio_uninit(struct bio *);
> extern void bio_reset(struct bio *);
> void bio_chain(struct bio *, struct bio *);
>
> -extern int bio_add_page(struct bio *, struct page *, unsigned int,unsigned int);
> +int bio_add_page(struct bio *, struct page *, unsigned len, unsigned off);
> +size_t bio_add_folio(struct bio *, struct folio *, size_t len, size_t off);
> extern int bio_add_pc_page(struct request_queue *, struct bio *, struct page *,
> unsigned int, unsigned int);
> int bio_add_zone_append_page(struct bio *bio, struct page *page,
> --
> 2.30.2
>
On Thu, Jul 15, 2021 at 04:36:18AM +0100, Matthew Wilcox (Oracle) wrote:
> The big comment about only using a head page can go away now that
> it takes a folio argument.
>
> Signed-off-by: Matthew Wilcox (Oracle) <[email protected]>
Looks decent to me,
Reviewed-by: Darrick J. Wong <[email protected]>
--D
> ---
> fs/iomap/buffered-io.c | 35 +++++++++++++++++------------------
> 1 file changed, 17 insertions(+), 18 deletions(-)
>
> diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
> index 41da4f14c00b..cd5c2f24cb7e 100644
> --- a/fs/iomap/buffered-io.c
> +++ b/fs/iomap/buffered-io.c
> @@ -22,8 +22,8 @@
> #include "../internal.h"
>
> /*
> - * Structure allocated for each page or THP when block size < page size
> - * to track sub-page uptodate status and I/O completions.
> + * Structure allocated for each folio when block size < folio size
> + * to track sub-folio uptodate status and I/O completions.
> */
> struct iomap_page {
> atomic_t read_bytes_pending;
> @@ -32,17 +32,10 @@ struct iomap_page {
> unsigned long uptodate[];
> };
>
> -static inline struct iomap_page *to_iomap_page(struct page *page)
> +static inline struct iomap_page *to_iomap_page(struct folio *folio)
> {
> - /*
> - * per-block data is stored in the head page. Callers should
> - * not be dealing with tail pages (and if they are, they can
> - * call thp_head() first.
> - */
> - VM_BUG_ON_PGFLAGS(PageTail(page), page);
> -
> - if (page_has_private(page))
> - return (struct iomap_page *)page_private(page);
> + if (folio_test_private(folio))
> + return folio_get_private(folio);
> return NULL;
> }
>
> @@ -51,7 +44,8 @@ static struct bio_set iomap_ioend_bioset;
> static struct iomap_page *
> iomap_page_create(struct inode *inode, struct page *page)
> {
> - struct iomap_page *iop = to_iomap_page(page);
> + struct folio *folio = page_folio(page);
> + struct iomap_page *iop = to_iomap_page(folio);
> unsigned int nr_blocks = i_blocks_per_page(inode, page);
>
> if (iop || nr_blocks <= 1)
> @@ -144,7 +138,8 @@ iomap_adjust_read_range(struct inode *inode, struct iomap_page *iop,
> static void
> iomap_iop_set_range_uptodate(struct page *page, unsigned off, unsigned len)
> {
> - struct iomap_page *iop = to_iomap_page(page);
> + struct folio *folio = page_folio(page);
> + struct iomap_page *iop = to_iomap_page(folio);
> struct inode *inode = page->mapping->host;
> unsigned first = off >> inode->i_blkbits;
> unsigned last = (off + len - 1) >> inode->i_blkbits;
> @@ -173,7 +168,8 @@ static void
> iomap_read_page_end_io(struct bio_vec *bvec, int error)
> {
> struct page *page = bvec->bv_page;
> - struct iomap_page *iop = to_iomap_page(page);
> + struct folio *folio = page_folio(page);
> + struct iomap_page *iop = to_iomap_page(folio);
>
> if (unlikely(error)) {
> ClearPageUptodate(page);
> @@ -433,7 +429,8 @@ int
> iomap_is_partially_uptodate(struct page *page, unsigned long from,
> unsigned long count)
> {
> - struct iomap_page *iop = to_iomap_page(page);
> + struct folio *folio = page_folio(page);
> + struct iomap_page *iop = to_iomap_page(folio);
> struct inode *inode = page->mapping->host;
> unsigned len, first, last;
> unsigned i;
> @@ -1011,7 +1008,8 @@ static void
> iomap_finish_page_writeback(struct inode *inode, struct page *page,
> int error, unsigned int len)
> {
> - struct iomap_page *iop = to_iomap_page(page);
> + struct folio *folio = page_folio(page);
> + struct iomap_page *iop = to_iomap_page(folio);
>
> if (error) {
> SetPageError(page);
> @@ -1304,7 +1302,8 @@ iomap_writepage_map(struct iomap_writepage_ctx *wpc,
> struct writeback_control *wbc, struct inode *inode,
> struct page *page, u64 end_offset)
> {
> - struct iomap_page *iop = to_iomap_page(page);
> + struct folio *folio = page_folio(page);
> + struct iomap_page *iop = to_iomap_page(folio);
> struct iomap_ioend *ioend, *next;
> unsigned len = i_blocksize(inode);
> u64 file_offset; /* file offset of page */
> --
> 2.30.2
>
On Thu, Jul 15, 2021 at 04:36:17AM +0100, Matthew Wilcox (Oracle) wrote:
> Allow callers to iterate over each folio instead of each page. The
> bio need not have been constructed using folios originally.
>
> Signed-off-by: Matthew Wilcox (Oracle) <[email protected]>
> ---
> include/linux/bio.h | 43 ++++++++++++++++++++++++++++++++++++++++++-
> 1 file changed, 42 insertions(+), 1 deletion(-)
>
> diff --git a/include/linux/bio.h b/include/linux/bio.h
> index ade93e2de6a1..d462bbc95c4b 100644
> --- a/include/linux/bio.h
> +++ b/include/linux/bio.h
> @@ -189,7 +189,7 @@ static inline void bio_advance_iter_single(const struct bio *bio,
> */
> #define bio_for_each_bvec_all(bvl, bio, i) \
> for (i = 0, bvl = bio_first_bvec_all(bio); \
> - i < (bio)->bi_vcnt; i++, bvl++) \
> + i < (bio)->bi_vcnt; i++, bvl++)
>
> #define bio_iter_last(bvec, iter) ((iter).bi_size == (bvec).bv_len)
>
> @@ -314,6 +314,47 @@ static inline struct bio_vec *bio_last_bvec_all(struct bio *bio)
> return &bio->bi_io_vec[bio->bi_vcnt - 1];
> }
>
> +struct folio_iter {
> + struct folio *folio;
> + size_t offset;
> + size_t length;
Hm... so after every bio_{first,next}_folio call, we can access the
folio, the offset, and the length (both in units of bytes) within the
folio?
> + size_t _seg_count;
> + int _i;
And these are private variables that the iteration code should not
scribble over?
> +};
> +
> +static inline
> +void bio_first_folio(struct folio_iter *fi, struct bio *bio, int i)
> +{
> + struct bio_vec *bvec = bio_first_bvec_all(bio) + i;
> +
> + fi->folio = page_folio(bvec->bv_page);
> + fi->offset = bvec->bv_offset +
> + PAGE_SIZE * (bvec->bv_page - &fi->folio->page);
> + fi->_seg_count = bvec->bv_len;
> + fi->length = min(folio_size(fi->folio) - fi->offset, fi->_seg_count);
> + fi->_i = i;
> +}
> +
> +static inline void bio_next_folio(struct folio_iter *fi, struct bio *bio)
> +{
> + fi->_seg_count -= fi->length;
> + if (fi->_seg_count) {
> + fi->folio = folio_next(fi->folio);
> + fi->offset = 0;
> + fi->length = min(folio_size(fi->folio), fi->_seg_count);
> + } else if (fi->_i + 1 < bio->bi_vcnt) {
> + bio_first_folio(fi, bio, fi->_i + 1);
> + } else {
> + fi->folio = NULL;
> + }
> +}
> +
> +/*
> + * Iterate over each folio in a bio.
> + */
> +#define bio_for_each_folio_all(fi, bio) \
> + for (bio_first_folio(&fi, bio, 0); fi.folio; bio_next_folio(&fi, bio))
...so I guess a sample iteration loop would be something like:
struct bio *bio = <get one from somewhere>;
struct folio_iter fi;
bio_for_each_folio_all(fi, bio) {
if (folio_test_dirty(fi.folio))
printk("folio idx 0x%lx is dirty, i hates dirty data!",
folio_index(fi.folio));
panic();
}
I'll go look through the rest of the patches, but this so far looks
pretty straightforward to me.
--D
> +
> enum bip_flags {
> BIP_BLOCK_INTEGRITY = 1 << 0, /* block layer owns integrity data */
> BIP_MAPPED_INTEGRITY = 1 << 1, /* ref tag has been remapped */
> --
> 2.30.2
>
On Thu, Jul 15, 2021 at 04:36:19AM +0100, Matthew Wilcox (Oracle) wrote:
> This function already assumed it was being passed a head page, so
> just formalise that.
>
> Signed-off-by: Matthew Wilcox (Oracle) <[email protected]>
Reviewed-by: Darrick J. Wong <[email protected]>
--D
> ---
> fs/iomap/buffered-io.c | 18 ++++++++++--------
> 1 file changed, 10 insertions(+), 8 deletions(-)
>
> diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
> index cd5c2f24cb7e..c15a0ac52a32 100644
> --- a/fs/iomap/buffered-io.c
> +++ b/fs/iomap/buffered-io.c
> @@ -42,11 +42,10 @@ static inline struct iomap_page *to_iomap_page(struct folio *folio)
> static struct bio_set iomap_ioend_bioset;
>
> static struct iomap_page *
> -iomap_page_create(struct inode *inode, struct page *page)
> +iomap_page_create(struct inode *inode, struct folio *folio)
> {
> - struct folio *folio = page_folio(page);
> struct iomap_page *iop = to_iomap_page(folio);
> - unsigned int nr_blocks = i_blocks_per_page(inode, page);
> + unsigned int nr_blocks = i_blocks_per_folio(inode, folio);
>
> if (iop || nr_blocks <= 1)
> return iop;
> @@ -54,9 +53,9 @@ iomap_page_create(struct inode *inode, struct page *page)
> iop = kzalloc(struct_size(iop, uptodate, BITS_TO_LONGS(nr_blocks)),
> GFP_NOFS | __GFP_NOFAIL);
> spin_lock_init(&iop->uptodate_lock);
> - if (PageUptodate(page))
> + if (folio_test_uptodate(folio))
> bitmap_fill(iop->uptodate, nr_blocks);
> - attach_page_private(page, iop);
> + folio_attach_private(folio, iop);
> return iop;
> }
>
> @@ -235,7 +234,8 @@ iomap_readpage_actor(struct inode *inode, loff_t pos, loff_t length, void *data,
> {
> struct iomap_readpage_ctx *ctx = data;
> struct page *page = ctx->cur_page;
> - struct iomap_page *iop = iomap_page_create(inode, page);
> + struct folio *folio = page_folio(page);
> + struct iomap_page *iop = iomap_page_create(inode, folio);
> bool same_page = false, is_contig = false;
> loff_t orig_pos = pos;
> unsigned poff, plen;
> @@ -547,7 +547,8 @@ static int
> __iomap_write_begin(struct inode *inode, loff_t pos, unsigned len, int flags,
> struct page *page, struct iomap *srcmap)
> {
> - struct iomap_page *iop = iomap_page_create(inode, page);
> + struct folio *folio = page_folio(page);
> + struct iomap_page *iop = iomap_page_create(inode, folio);
> loff_t block_size = i_blocksize(inode);
> loff_t block_start = round_down(pos, block_size);
> loff_t block_end = round_up(pos + len, block_size);
> @@ -955,6 +956,7 @@ iomap_page_mkwrite_actor(struct inode *inode, loff_t pos, loff_t length,
> void *data, struct iomap *iomap, struct iomap *srcmap)
> {
> struct page *page = data;
> + struct folio *folio = page_folio(page);
> int ret;
>
> if (iomap->flags & IOMAP_F_BUFFER_HEAD) {
> @@ -964,7 +966,7 @@ iomap_page_mkwrite_actor(struct inode *inode, loff_t pos, loff_t length,
> block_commit_write(page, 0, length);
> } else {
> WARN_ON_ONCE(!PageUptodate(page));
> - iomap_page_create(inode, page);
> + iomap_page_create(inode, folio);
> set_page_dirty(page);
> }
>
> --
> 2.30.2
>
On Thu, Jul 15, 2021 at 04:36:20AM +0100, Matthew Wilcox (Oracle) wrote:
> iomap_page_release() was also assuming that it was being passed a
> head page.
>
> Signed-off-by: Matthew Wilcox (Oracle) <[email protected]>
Eh, looks pretty straightforward to me...
Reviewed-by: Darrick J. Wong <[email protected]>
--D
> ---
> fs/iomap/buffered-io.c | 18 +++++++++++-------
> 1 file changed, 11 insertions(+), 7 deletions(-)
>
> diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
> index c15a0ac52a32..251ec45426aa 100644
> --- a/fs/iomap/buffered-io.c
> +++ b/fs/iomap/buffered-io.c
> @@ -59,18 +59,18 @@ iomap_page_create(struct inode *inode, struct folio *folio)
> return iop;
> }
>
> -static void
> -iomap_page_release(struct page *page)
> +static void iomap_page_release(struct folio *folio)
> {
> - struct iomap_page *iop = detach_page_private(page);
> - unsigned int nr_blocks = i_blocks_per_page(page->mapping->host, page);
> + struct iomap_page *iop = folio_detach_private(folio);
> + unsigned int nr_blocks = i_blocks_per_folio(folio->mapping->host,
> + folio);
>
> if (!iop)
> return;
> WARN_ON_ONCE(atomic_read(&iop->read_bytes_pending));
> WARN_ON_ONCE(atomic_read(&iop->write_bytes_pending));
> WARN_ON_ONCE(bitmap_full(iop->uptodate, nr_blocks) !=
> - PageUptodate(page));
> + folio_test_uptodate(folio));
> kfree(iop);
> }
>
> @@ -456,6 +456,8 @@ EXPORT_SYMBOL_GPL(iomap_is_partially_uptodate);
> int
> iomap_releasepage(struct page *page, gfp_t gfp_mask)
> {
> + struct folio *folio = page_folio(page);
> +
> trace_iomap_releasepage(page->mapping->host, page_offset(page),
> PAGE_SIZE);
>
> @@ -466,7 +468,7 @@ iomap_releasepage(struct page *page, gfp_t gfp_mask)
> */
> if (PageDirty(page) || PageWriteback(page))
> return 0;
> - iomap_page_release(page);
> + iomap_page_release(folio);
> return 1;
> }
> EXPORT_SYMBOL_GPL(iomap_releasepage);
> @@ -474,6 +476,8 @@ EXPORT_SYMBOL_GPL(iomap_releasepage);
> void
> iomap_invalidatepage(struct page *page, unsigned int offset, unsigned int len)
> {
> + struct folio *folio = page_folio(page);
> +
> trace_iomap_invalidatepage(page->mapping->host, offset, len);
>
> /*
> @@ -483,7 +487,7 @@ iomap_invalidatepage(struct page *page, unsigned int offset, unsigned int len)
> if (offset == 0 && len == PAGE_SIZE) {
> WARN_ON_ONCE(PageWriteback(page));
> cancel_dirty_page(page);
> - iomap_page_release(page);
> + iomap_page_release(folio);
> }
> }
> EXPORT_SYMBOL_GPL(iomap_invalidatepage);
> --
> 2.30.2
>
On Thu, Jul 15, 2021 at 04:36:23AM +0100, Matthew Wilcox (Oracle) wrote:
> All but one caller already has the iomap_page, and we can avoid getting
> it again.
>
> Signed-off-by: Matthew Wilcox (Oracle) <[email protected]>
Took me a while to distinguish iomap_iop_set_range_uptodate and
iomap_set_range_uptodate, but yes, this looks pretty simple.
Reviewed-by: Darrick J. Wong <[email protected]>
--D
> ---
> fs/iomap/buffered-io.c | 25 +++++++++++++------------
> 1 file changed, 13 insertions(+), 12 deletions(-)
>
> diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
> index 6b41019a51a3..fbe4ebc074ce 100644
> --- a/fs/iomap/buffered-io.c
> +++ b/fs/iomap/buffered-io.c
> @@ -134,11 +134,9 @@ iomap_adjust_read_range(struct inode *inode, struct iomap_page *iop,
> *lenp = plen;
> }
>
> -static void
> -iomap_iop_set_range_uptodate(struct page *page, unsigned off, unsigned len)
> +static void iomap_iop_set_range_uptodate(struct page *page,
> + struct iomap_page *iop, unsigned off, unsigned len)
> {
> - struct folio *folio = page_folio(page);
> - struct iomap_page *iop = to_iomap_page(folio);
> struct inode *inode = page->mapping->host;
> unsigned first = off >> inode->i_blkbits;
> unsigned last = (off + len - 1) >> inode->i_blkbits;
> @@ -151,14 +149,14 @@ iomap_iop_set_range_uptodate(struct page *page, unsigned off, unsigned len)
> spin_unlock_irqrestore(&iop->uptodate_lock, flags);
> }
>
> -static void
> -iomap_set_range_uptodate(struct page *page, unsigned off, unsigned len)
> +static void iomap_set_range_uptodate(struct page *page,
> + struct iomap_page *iop, unsigned off, unsigned len)
> {
> if (PageError(page))
> return;
>
> - if (page_has_private(page))
> - iomap_iop_set_range_uptodate(page, off, len);
> + if (iop)
> + iomap_iop_set_range_uptodate(page, iop, off, len);
> else
> SetPageUptodate(page);
> }
> @@ -174,7 +172,8 @@ iomap_read_page_end_io(struct bio_vec *bvec, int error)
> ClearPageUptodate(page);
> SetPageError(page);
> } else {
> - iomap_set_range_uptodate(page, bvec->bv_offset, bvec->bv_len);
> + iomap_set_range_uptodate(page, iop, bvec->bv_offset,
> + bvec->bv_len);
> }
>
> if (!iop || atomic_sub_and_test(bvec->bv_len, &iop->read_bytes_pending))
> @@ -254,7 +253,7 @@ iomap_readpage_actor(struct inode *inode, loff_t pos, loff_t length, void *data,
>
> if (iomap_block_needs_zeroing(inode, iomap, pos)) {
> zero_user(page, poff, plen);
> - iomap_set_range_uptodate(page, poff, plen);
> + iomap_set_range_uptodate(page, iop, poff, plen);
> goto done;
> }
>
> @@ -583,7 +582,7 @@ __iomap_write_begin(struct inode *inode, loff_t pos, unsigned len, int flags,
> if (status)
> return status;
> }
> - iomap_set_range_uptodate(page, poff, plen);
> + iomap_set_range_uptodate(page, iop, poff, plen);
> } while ((block_start += plen) < block_end);
>
> return 0;
> @@ -645,6 +644,8 @@ iomap_write_begin(struct inode *inode, loff_t pos, unsigned len, unsigned flags,
> static size_t __iomap_write_end(struct inode *inode, loff_t pos, size_t len,
> size_t copied, struct page *page)
> {
> + struct folio *folio = page_folio(page);
> + struct iomap_page *iop = to_iomap_page(folio);
> flush_dcache_page(page);
>
> /*
> @@ -660,7 +661,7 @@ static size_t __iomap_write_end(struct inode *inode, loff_t pos, size_t len,
> */
> if (unlikely(copied < len && !PageUptodate(page)))
> return 0;
> - iomap_set_range_uptodate(page, offset_in_page(pos), len);
> + iomap_set_range_uptodate(page, iop, offset_in_page(pos), len);
> __set_page_dirty_nobuffers(page);
> return copied;
> }
> --
> 2.30.2
>
On Thu, Jul 15, 2021 at 04:36:24AM +0100, Matthew Wilcox (Oracle) wrote:
> Pass a folio around instead of the page, and make sure the offset
> is relative to the start of the folio instead of the start of a page.
> Also use size_t for offset & length to make it clear that these are byte
> counts, and to support >2GB folios in the future.
>
> Signed-off-by: Matthew Wilcox (Oracle) <[email protected]>
> ---
> fs/iomap/buffered-io.c | 85 ++++++++++++++++++++++--------------------
> 1 file changed, 44 insertions(+), 41 deletions(-)
>
> diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
> index fbe4ebc074ce..707a96e36651 100644
> --- a/fs/iomap/buffered-io.c
> +++ b/fs/iomap/buffered-io.c
> @@ -75,18 +75,18 @@ static void iomap_page_release(struct folio *folio)
> }
>
> /*
> - * Calculate the range inside the page that we actually need to read.
> + * Calculate the range inside the folio that we actually need to read.
> */
> -static void
> -iomap_adjust_read_range(struct inode *inode, struct iomap_page *iop,
> - loff_t *pos, loff_t length, unsigned *offp, unsigned *lenp)
> +static void iomap_adjust_read_range(struct inode *inode, struct folio *folio,
> + loff_t *pos, loff_t length, size_t *offp, size_t *lenp)
> {
> + struct iomap_page *iop = to_iomap_page(folio);
> loff_t orig_pos = *pos;
> loff_t isize = i_size_read(inode);
> unsigned block_bits = inode->i_blkbits;
> unsigned block_size = (1 << block_bits);
> - unsigned poff = offset_in_page(*pos);
> - unsigned plen = min_t(loff_t, PAGE_SIZE - poff, length);
> + size_t poff = offset_in_folio(folio, *pos);
> + size_t plen = min_t(loff_t, folio_size(folio) - poff, length);
I'm confused about 'size_t poff' here vs. 'unsigned end' later -- why do
we need a 64-bit quantity for poff? I suppose some day we might want to
have folios larger than 4GB or so, but so far we don't need that large
of a byte offset within a page/folio, right?
Or are you merely moving the codebase towards using size_t for all byte
offsets?
The rest of the conversion code looked ok though.
--D
> unsigned first = poff >> block_bits;
> unsigned last = (poff + plen - 1) >> block_bits;
>
> @@ -124,7 +124,7 @@ iomap_adjust_read_range(struct inode *inode, struct iomap_page *iop,
> * page cache for blocks that are entirely outside of i_size.
> */
> if (orig_pos <= isize && orig_pos + length > isize) {
> - unsigned end = offset_in_page(isize - 1) >> block_bits;
> + unsigned end = offset_in_folio(folio, isize - 1) >> block_bits;
>
> if (first <= end && last > end)
> plen -= (last - end) * block_size;
> @@ -134,31 +134,31 @@ iomap_adjust_read_range(struct inode *inode, struct iomap_page *iop,
> *lenp = plen;
> }
>
> -static void iomap_iop_set_range_uptodate(struct page *page,
> - struct iomap_page *iop, unsigned off, unsigned len)
> +static void iomap_iop_set_range_uptodate(struct folio *folio,
> + struct iomap_page *iop, size_t off, size_t len)
> {
> - struct inode *inode = page->mapping->host;
> + struct inode *inode = folio->mapping->host;
> unsigned first = off >> inode->i_blkbits;
> unsigned last = (off + len - 1) >> inode->i_blkbits;
> unsigned long flags;
>
> spin_lock_irqsave(&iop->uptodate_lock, flags);
> bitmap_set(iop->uptodate, first, last - first + 1);
> - if (bitmap_full(iop->uptodate, i_blocks_per_page(inode, page)))
> - SetPageUptodate(page);
> + if (bitmap_full(iop->uptodate, i_blocks_per_folio(inode, folio)))
> + folio_mark_uptodate(folio);
> spin_unlock_irqrestore(&iop->uptodate_lock, flags);
> }
>
> -static void iomap_set_range_uptodate(struct page *page,
> - struct iomap_page *iop, unsigned off, unsigned len)
> +static void iomap_set_range_uptodate(struct folio *folio,
> + struct iomap_page *iop, size_t off, size_t len)
> {
> - if (PageError(page))
> + if (folio_test_error(folio))
> return;
>
> if (iop)
> - iomap_iop_set_range_uptodate(page, iop, off, len);
> + iomap_iop_set_range_uptodate(folio, iop, off, len);
> else
> - SetPageUptodate(page);
> + folio_mark_uptodate(folio);
> }
>
> static void
> @@ -169,15 +169,17 @@ iomap_read_page_end_io(struct bio_vec *bvec, int error)
> struct iomap_page *iop = to_iomap_page(folio);
>
> if (unlikely(error)) {
> - ClearPageUptodate(page);
> - SetPageError(page);
> + folio_clear_uptodate(folio);
> + folio_set_error(folio);
> } else {
> - iomap_set_range_uptodate(page, iop, bvec->bv_offset,
> - bvec->bv_len);
> + size_t off = (page - &folio->page) * PAGE_SIZE +
> + bvec->bv_offset;
> +
> + iomap_set_range_uptodate(folio, iop, off, bvec->bv_len);
> }
>
> if (!iop || atomic_sub_and_test(bvec->bv_len, &iop->read_bytes_pending))
> - unlock_page(page);
> + folio_unlock(folio);
> }
>
> static void
> @@ -237,7 +239,7 @@ iomap_readpage_actor(struct inode *inode, loff_t pos, loff_t length, void *data,
> struct iomap_page *iop = iomap_page_create(inode, folio);
> bool same_page = false, is_contig = false;
> loff_t orig_pos = pos;
> - unsigned poff, plen;
> + size_t poff, plen;
> sector_t sector;
>
> if (iomap->type == IOMAP_INLINE) {
> @@ -246,14 +248,14 @@ iomap_readpage_actor(struct inode *inode, loff_t pos, loff_t length, void *data,
> return PAGE_SIZE;
> }
>
> - /* zero post-eof blocks as the page may be mapped */
> - iomap_adjust_read_range(inode, iop, &pos, length, &poff, &plen);
> + /* zero post-eof blocks as the folio may be mapped */
> + iomap_adjust_read_range(inode, folio, &pos, length, &poff, &plen);
> if (plen == 0)
> goto done;
>
> if (iomap_block_needs_zeroing(inode, iomap, pos)) {
> - zero_user(page, poff, plen);
> - iomap_set_range_uptodate(page, iop, poff, plen);
> + zero_user(&folio->page, poff, plen);
> + iomap_set_range_uptodate(folio, iop, poff, plen);
> goto done;
> }
>
> @@ -264,7 +266,7 @@ iomap_readpage_actor(struct inode *inode, loff_t pos, loff_t length, void *data,
> /* Try to merge into a previous segment if we can */
> sector = iomap_sector(iomap, pos);
> if (ctx->bio && bio_end_sector(ctx->bio) == sector) {
> - if (__bio_try_merge_page(ctx->bio, page, plen, poff,
> + if (__bio_try_merge_page(ctx->bio, &folio->page, plen, poff,
> &same_page))
> goto done;
> is_contig = true;
> @@ -296,7 +298,7 @@ iomap_readpage_actor(struct inode *inode, loff_t pos, loff_t length, void *data,
> ctx->bio->bi_end_io = iomap_read_end_io;
> }
>
> - bio_add_page(ctx->bio, page, plen, poff);
> + bio_add_folio(ctx->bio, folio, plen, poff);
> done:
> /*
> * Move the caller beyond our range so that it keeps making progress.
> @@ -531,9 +533,8 @@ iomap_write_failed(struct inode *inode, loff_t pos, unsigned len)
> truncate_pagecache_range(inode, max(pos, i_size), pos + len);
> }
>
> -static int
> -iomap_read_page_sync(loff_t block_start, struct page *page, unsigned poff,
> - unsigned plen, struct iomap *iomap)
> +static int iomap_read_folio_sync(loff_t block_start, struct folio *folio,
> + size_t poff, size_t plen, struct iomap *iomap)
> {
> struct bio_vec bvec;
> struct bio bio;
> @@ -542,7 +543,7 @@ iomap_read_page_sync(loff_t block_start, struct page *page, unsigned poff,
> bio.bi_opf = REQ_OP_READ;
> bio.bi_iter.bi_sector = iomap_sector(iomap, block_start);
> bio_set_dev(&bio, iomap->bdev);
> - __bio_add_page(&bio, page, plen, poff);
> + bio_add_folio(&bio, folio, plen, poff);
> return submit_bio_wait(&bio);
> }
>
> @@ -555,14 +556,15 @@ __iomap_write_begin(struct inode *inode, loff_t pos, unsigned len, int flags,
> loff_t block_size = i_blocksize(inode);
> loff_t block_start = round_down(pos, block_size);
> loff_t block_end = round_up(pos + len, block_size);
> - unsigned from = offset_in_page(pos), to = from + len, poff, plen;
> + size_t from = offset_in_folio(folio, pos), to = from + len;
> + size_t poff, plen;
>
> - if (PageUptodate(page))
> + if (folio_test_uptodate(folio))
> return 0;
> - ClearPageError(page);
> + folio_clear_error(folio);
>
> do {
> - iomap_adjust_read_range(inode, iop, &block_start,
> + iomap_adjust_read_range(inode, folio, &block_start,
> block_end - block_start, &poff, &plen);
> if (plen == 0)
> break;
> @@ -575,14 +577,15 @@ __iomap_write_begin(struct inode *inode, loff_t pos, unsigned len, int flags,
> if (iomap_block_needs_zeroing(inode, srcmap, block_start)) {
> if (WARN_ON_ONCE(flags & IOMAP_WRITE_F_UNSHARE))
> return -EIO;
> - zero_user_segments(page, poff, from, to, poff + plen);
> + zero_user_segments(&folio->page, poff, from, to,
> + poff + plen);
> } else {
> - int status = iomap_read_page_sync(block_start, page,
> + int status = iomap_read_folio_sync(block_start, folio,
> poff, plen, srcmap);
> if (status)
> return status;
> }
> - iomap_set_range_uptodate(page, iop, poff, plen);
> + iomap_set_range_uptodate(folio, iop, poff, plen);
> } while ((block_start += plen) < block_end);
>
> return 0;
> @@ -661,7 +664,7 @@ static size_t __iomap_write_end(struct inode *inode, loff_t pos, size_t len,
> */
> if (unlikely(copied < len && !PageUptodate(page)))
> return 0;
> - iomap_set_range_uptodate(page, iop, offset_in_page(pos), len);
> + iomap_set_range_uptodate(folio, iop, offset_in_folio(folio, pos), len);
> __set_page_dirty_nobuffers(page);
> return copied;
> }
> --
> 2.30.2
>
On Thu, Jul 15, 2021 at 04:36:25AM +0100, Matthew Wilcox (Oracle) wrote:
> Use bio_for_each_folio() to iterate over each folio in the bio
> instead of iterating over each page.
>
> Signed-off-by: Matthew Wilcox (Oracle) <[email protected]>
Neat conversion,
Reviewed-by: Darrick J. Wong <[email protected]>
--D
> ---
> fs/iomap/buffered-io.c | 46 +++++++++++++++++-------------------------
> 1 file changed, 18 insertions(+), 28 deletions(-)
>
> diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
> index 707a96e36651..4732298f74e1 100644
> --- a/fs/iomap/buffered-io.c
> +++ b/fs/iomap/buffered-io.c
> @@ -161,36 +161,29 @@ static void iomap_set_range_uptodate(struct folio *folio,
> folio_mark_uptodate(folio);
> }
>
> -static void
> -iomap_read_page_end_io(struct bio_vec *bvec, int error)
> +static void iomap_finish_folio_read(struct folio *folio, size_t offset,
> + size_t len, int error)
> {
> - struct page *page = bvec->bv_page;
> - struct folio *folio = page_folio(page);
> struct iomap_page *iop = to_iomap_page(folio);
>
> if (unlikely(error)) {
> folio_clear_uptodate(folio);
> folio_set_error(folio);
> } else {
> - size_t off = (page - &folio->page) * PAGE_SIZE +
> - bvec->bv_offset;
> -
> - iomap_set_range_uptodate(folio, iop, off, bvec->bv_len);
> + iomap_set_range_uptodate(folio, iop, offset, len);
> }
>
> - if (!iop || atomic_sub_and_test(bvec->bv_len, &iop->read_bytes_pending))
> + if (!iop || atomic_sub_and_test(len, &iop->read_bytes_pending))
> folio_unlock(folio);
> }
>
> -static void
> -iomap_read_end_io(struct bio *bio)
> +static void iomap_read_end_io(struct bio *bio)
> {
> int error = blk_status_to_errno(bio->bi_status);
> - struct bio_vec *bvec;
> - struct bvec_iter_all iter_all;
> + struct folio_iter fi;
>
> - bio_for_each_segment_all(bvec, bio, iter_all)
> - iomap_read_page_end_io(bvec, error);
> + bio_for_each_folio_all(fi, bio)
> + iomap_finish_folio_read(fi.folio, fi.offset, fi.length, error);
> bio_put(bio);
> }
>
> @@ -1014,23 +1007,21 @@ vm_fault_t iomap_page_mkwrite(struct vm_fault *vmf, const struct iomap_ops *ops)
> }
> EXPORT_SYMBOL_GPL(iomap_page_mkwrite);
>
> -static void
> -iomap_finish_page_writeback(struct inode *inode, struct page *page,
> - int error, unsigned int len)
> +static void iomap_finish_folio_write(struct inode *inode, struct folio *folio,
> + size_t len, int error)
> {
> - struct folio *folio = page_folio(page);
> struct iomap_page *iop = to_iomap_page(folio);
>
> if (error) {
> - SetPageError(page);
> + folio_set_error(folio);
> mapping_set_error(inode->i_mapping, -EIO);
> }
>
> - WARN_ON_ONCE(i_blocks_per_page(inode, page) > 1 && !iop);
> + WARN_ON_ONCE(i_blocks_per_folio(inode, folio) > 1 && !iop);
> WARN_ON_ONCE(iop && atomic_read(&iop->write_bytes_pending) <= 0);
>
> if (!iop || atomic_sub_and_test(len, &iop->write_bytes_pending))
> - end_page_writeback(page);
> + folio_end_writeback(folio);
> }
>
> /*
> @@ -1049,8 +1040,7 @@ iomap_finish_ioend(struct iomap_ioend *ioend, int error)
> bool quiet = bio_flagged(bio, BIO_QUIET);
>
> for (bio = &ioend->io_inline_bio; bio; bio = next) {
> - struct bio_vec *bv;
> - struct bvec_iter_all iter_all;
> + struct folio_iter fi;
>
> /*
> * For the last bio, bi_private points to the ioend, so we
> @@ -1061,10 +1051,10 @@ iomap_finish_ioend(struct iomap_ioend *ioend, int error)
> else
> next = bio->bi_private;
>
> - /* walk each page on bio, ending page IO on them */
> - bio_for_each_segment_all(bv, bio, iter_all)
> - iomap_finish_page_writeback(inode, bv->bv_page, error,
> - bv->bv_len);
> + /* walk all folios in bio, ending page IO on them */
> + bio_for_each_folio_all(fi, bio)
> + iomap_finish_folio_write(inode, fi.folio, fi.length,
> + error);
> bio_put(bio);
> }
> /* The ioend has been freed by bio_put() */
> --
> 2.30.2
>
On Thu, Jul 15, 2021 at 04:36:26AM +0100, Matthew Wilcox (Oracle) wrote:
> Handle folios of arbitrary size instead of working in PAGE_SIZE units.
>
> Signed-off-by: Matthew Wilcox (Oracle) <[email protected]>
Reviewed-by: Darrick J. Wong <[email protected]>
--D
> ---
> fs/iomap/buffered-io.c | 61 +++++++++++++++++++++---------------------
> 1 file changed, 30 insertions(+), 31 deletions(-)
>
> diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
> index 4732298f74e1..7c702d6c2f64 100644
> --- a/fs/iomap/buffered-io.c
> +++ b/fs/iomap/buffered-io.c
> @@ -188,8 +188,8 @@ static void iomap_read_end_io(struct bio *bio)
> }
>
> struct iomap_readpage_ctx {
> - struct page *cur_page;
> - bool cur_page_in_bio;
> + struct folio *cur_folio;
> + bool cur_folio_in_bio;
> struct bio *bio;
> struct readahead_control *rac;
> };
> @@ -227,8 +227,7 @@ iomap_readpage_actor(struct inode *inode, loff_t pos, loff_t length, void *data,
> struct iomap *iomap, struct iomap *srcmap)
> {
> struct iomap_readpage_ctx *ctx = data;
> - struct page *page = ctx->cur_page;
> - struct folio *folio = page_folio(page);
> + struct folio *folio = ctx->cur_folio;
> struct iomap_page *iop = iomap_page_create(inode, folio);
> bool same_page = false, is_contig = false;
> loff_t orig_pos = pos;
> @@ -237,7 +236,7 @@ iomap_readpage_actor(struct inode *inode, loff_t pos, loff_t length, void *data,
>
> if (iomap->type == IOMAP_INLINE) {
> WARN_ON_ONCE(pos);
> - iomap_read_inline_data(inode, page, iomap);
> + iomap_read_inline_data(inode, &folio->page, iomap);
> return PAGE_SIZE;
> }
>
> @@ -252,7 +251,7 @@ iomap_readpage_actor(struct inode *inode, loff_t pos, loff_t length, void *data,
> goto done;
> }
>
> - ctx->cur_page_in_bio = true;
> + ctx->cur_folio_in_bio = true;
> if (iop)
> atomic_add(plen, &iop->read_bytes_pending);
>
> @@ -266,7 +265,7 @@ iomap_readpage_actor(struct inode *inode, loff_t pos, loff_t length, void *data,
> }
>
> if (!is_contig || bio_full(ctx->bio, plen)) {
> - gfp_t gfp = mapping_gfp_constraint(page->mapping, GFP_KERNEL);
> + gfp_t gfp = mapping_gfp_constraint(folio->mapping, GFP_KERNEL);
> gfp_t orig_gfp = gfp;
> unsigned int nr_vecs = DIV_ROUND_UP(length, PAGE_SIZE);
>
> @@ -305,30 +304,31 @@ iomap_readpage_actor(struct inode *inode, loff_t pos, loff_t length, void *data,
> int
> iomap_readpage(struct page *page, const struct iomap_ops *ops)
> {
> - struct iomap_readpage_ctx ctx = { .cur_page = page };
> - struct inode *inode = page->mapping->host;
> - unsigned poff;
> + struct folio *folio = page_folio(page);
> + struct iomap_readpage_ctx ctx = { .cur_folio = folio };
> + struct inode *inode = folio->mapping->host;
> + size_t poff;
> loff_t ret;
> + size_t len = folio_size(folio);
>
> - trace_iomap_readpage(page->mapping->host, 1);
> + trace_iomap_readpage(inode, 1);
>
> - for (poff = 0; poff < PAGE_SIZE; poff += ret) {
> - ret = iomap_apply(inode, page_offset(page) + poff,
> - PAGE_SIZE - poff, 0, ops, &ctx,
> - iomap_readpage_actor);
> + for (poff = 0; poff < len; poff += ret) {
> + ret = iomap_apply(inode, folio_pos(folio) + poff, len - poff,
> + 0, ops, &ctx, iomap_readpage_actor);
> if (ret <= 0) {
> WARN_ON_ONCE(ret == 0);
> - SetPageError(page);
> + folio_set_error(folio);
> break;
> }
> }
>
> if (ctx.bio) {
> submit_bio(ctx.bio);
> - WARN_ON_ONCE(!ctx.cur_page_in_bio);
> + WARN_ON_ONCE(!ctx.cur_folio_in_bio);
> } else {
> - WARN_ON_ONCE(ctx.cur_page_in_bio);
> - unlock_page(page);
> + WARN_ON_ONCE(ctx.cur_folio_in_bio);
> + folio_unlock(folio);
> }
>
> /*
> @@ -348,15 +348,15 @@ iomap_readahead_actor(struct inode *inode, loff_t pos, loff_t length,
> loff_t done, ret;
>
> for (done = 0; done < length; done += ret) {
> - if (ctx->cur_page && offset_in_page(pos + done) == 0) {
> - if (!ctx->cur_page_in_bio)
> - unlock_page(ctx->cur_page);
> - put_page(ctx->cur_page);
> - ctx->cur_page = NULL;
> + if (ctx->cur_folio &&
> + offset_in_folio(ctx->cur_folio, pos + done) == 0) {
> + if (!ctx->cur_folio_in_bio)
> + folio_unlock(ctx->cur_folio);
> + ctx->cur_folio = NULL;
> }
> - if (!ctx->cur_page) {
> - ctx->cur_page = readahead_page(ctx->rac);
> - ctx->cur_page_in_bio = false;
> + if (!ctx->cur_folio) {
> + ctx->cur_folio = readahead_folio(ctx->rac);
> + ctx->cur_folio_in_bio = false;
> }
> ret = iomap_readpage_actor(inode, pos + done, length - done,
> ctx, iomap, srcmap);
> @@ -404,10 +404,9 @@ void iomap_readahead(struct readahead_control *rac, const struct iomap_ops *ops)
>
> if (ctx.bio)
> submit_bio(ctx.bio);
> - if (ctx.cur_page) {
> - if (!ctx.cur_page_in_bio)
> - unlock_page(ctx.cur_page);
> - put_page(ctx.cur_page);
> + if (ctx.cur_folio) {
> + if (!ctx.cur_folio_in_bio)
> + folio_unlock(ctx.cur_folio);
> }
> }
> EXPORT_SYMBOL_GPL(iomap_readahead);
> --
> 2.30.2
>
On Thu, Jul 15, 2021 at 04:36:27AM +0100, Matthew Wilcox (Oracle) wrote:
> If we write to any page in a folio, we have to mark the entire
> folio as dirty, and potentially COW the entire folio, because it'll
> all get written back as one unit.
>
> Signed-off-by: Matthew Wilcox (Oracle) <[email protected]>
> ---
> fs/iomap/buffered-io.c | 42 +++++++++++++++++++++---------------------
> 1 file changed, 21 insertions(+), 21 deletions(-)
>
> diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
> index 7c702d6c2f64..a3fe0d36c739 100644
> --- a/fs/iomap/buffered-io.c
> +++ b/fs/iomap/buffered-io.c
> @@ -951,23 +951,23 @@ iomap_truncate_page(struct inode *inode, loff_t pos, bool *did_zero,
> }
> EXPORT_SYMBOL_GPL(iomap_truncate_page);
>
> -static loff_t
> -iomap_page_mkwrite_actor(struct inode *inode, loff_t pos, loff_t length,
> - void *data, struct iomap *iomap, struct iomap *srcmap)
> +static loff_t iomap_folio_mkwrite_actor(struct inode *inode, loff_t pos,
> + loff_t length, void *data, struct iomap *iomap,
> + struct iomap *srcmap)
> {
> - struct page *page = data;
> - struct folio *folio = page_folio(page);
> + struct folio *folio = data;
> int ret;
>
> if (iomap->flags & IOMAP_F_BUFFER_HEAD) {
> - ret = __block_write_begin_int(page, pos, length, NULL, iomap);
> + ret = __block_write_begin_int(&folio->page, pos, length, NULL,
> + iomap);
> if (ret)
> return ret;
> - block_commit_write(page, 0, length);
> + block_commit_write(&folio->page, 0, length);
> } else {
> - WARN_ON_ONCE(!PageUptodate(page));
> + WARN_ON_ONCE(!folio_test_uptodate(folio));
> iomap_page_create(inode, folio);
> - set_page_dirty(page);
> + folio_mark_dirty(folio);
> }
>
> return length;
> @@ -975,33 +975,33 @@ iomap_page_mkwrite_actor(struct inode *inode, loff_t pos, loff_t length,
>
> vm_fault_t iomap_page_mkwrite(struct vm_fault *vmf, const struct iomap_ops *ops)
> {
> - struct page *page = vmf->page;
> + struct folio *folio = page_folio(vmf->page);
If before the page fault the folio was a compound 2M page, will the
memory manager will have split it into 4k pages before passing it to us?
That's a roundabout way of asking if we should expect folio_mkwrite at
some point. ;)
The conversion looks pretty straightforward though.
Reviewed-by: Darrick J. Wong <[email protected]>
--D
> struct inode *inode = file_inode(vmf->vma->vm_file);
> - unsigned long length;
> - loff_t offset;
> + size_t length;
> + loff_t pos;
> ssize_t ret;
>
> - lock_page(page);
> - ret = page_mkwrite_check_truncate(page, inode);
> + folio_lock(folio);
> + ret = folio_mkwrite_check_truncate(folio, inode);
> if (ret < 0)
> goto out_unlock;
> length = ret;
>
> - offset = page_offset(page);
> + pos = folio_pos(folio);
> while (length > 0) {
> - ret = iomap_apply(inode, offset, length,
> - IOMAP_WRITE | IOMAP_FAULT, ops, page,
> - iomap_page_mkwrite_actor);
> + ret = iomap_apply(inode, pos, length,
> + IOMAP_WRITE | IOMAP_FAULT, ops, folio,
> + iomap_folio_mkwrite_actor);
> if (unlikely(ret <= 0))
> goto out_unlock;
> - offset += ret;
> + pos += ret;
> length -= ret;
> }
>
> - wait_for_stable_page(page);
> + folio_wait_stable(folio);
> return VM_FAULT_LOCKED;
> out_unlock:
> - unlock_page(page);
> + folio_unlock(folio);
> return block_page_mkwrite_return(ret);
> }
> EXPORT_SYMBOL_GPL(iomap_page_mkwrite);
> --
> 2.30.2
>
On Thu, Jul 15, 2021 at 04:36:28AM +0100, Matthew Wilcox (Oracle) wrote:
> These functions still only work in PAGE_SIZE chunks, but there are
> fewer conversions from head to tail pages as a result of this patch.
>
> Signed-off-by: Matthew Wilcox (Oracle) <[email protected]>
> ---
> fs/iomap/buffered-io.c | 68 ++++++++++++++++++++++--------------------
> 1 file changed, 36 insertions(+), 32 deletions(-)
>
> diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
> index a3fe0d36c739..5e0aa23d4693 100644
> --- a/fs/iomap/buffered-io.c
> +++ b/fs/iomap/buffered-io.c
> @@ -541,9 +541,8 @@ static int iomap_read_folio_sync(loff_t block_start, struct folio *folio,
>
> static int
> __iomap_write_begin(struct inode *inode, loff_t pos, unsigned len, int flags,
> - struct page *page, struct iomap *srcmap)
> + struct folio *folio, struct iomap *srcmap)
> {
> - struct folio *folio = page_folio(page);
> struct iomap_page *iop = iomap_page_create(inode, folio);
> loff_t block_size = i_blocksize(inode);
> loff_t block_start = round_down(pos, block_size);
> @@ -583,12 +582,14 @@ __iomap_write_begin(struct inode *inode, loff_t pos, unsigned len, int flags,
> return 0;
> }
>
> -static int
> -iomap_write_begin(struct inode *inode, loff_t pos, unsigned len, unsigned flags,
> - struct page **pagep, struct iomap *iomap, struct iomap *srcmap)
> +static int iomap_write_begin(struct inode *inode, loff_t pos, size_t len,
> + unsigned flags, struct folio **foliop, struct iomap *iomap,
> + struct iomap *srcmap)
> {
> const struct iomap_page_ops *page_ops = iomap->page_ops;
> + struct folio *folio;
> struct page *page;
> + unsigned fgp = FGP_LOCK | FGP_WRITE | FGP_CREAT | FGP_STABLE | FGP_NOFS;
> int status = 0;
>
> BUG_ON(pos + len > iomap->offset + iomap->length);
> @@ -604,30 +605,31 @@ iomap_write_begin(struct inode *inode, loff_t pos, unsigned len, unsigned flags,
> return status;
> }
>
> - page = grab_cache_page_write_begin(inode->i_mapping, pos >> PAGE_SHIFT,
> - AOP_FLAG_NOFS);
> - if (!page) {
> + folio = __filemap_get_folio(inode->i_mapping, pos >> PAGE_SHIFT, fgp,
Ah, ok, so we're moving the file_get_pages flags up to iomap now.
> + mapping_gfp_mask(inode->i_mapping));
> + if (!folio) {
> status = -ENOMEM;
> goto out_no_page;
> }
>
> + page = folio_file_page(folio, pos >> PAGE_SHIFT);
> if (srcmap->type == IOMAP_INLINE)
> iomap_read_inline_data(inode, page, srcmap);
> else if (iomap->flags & IOMAP_F_BUFFER_HEAD)
> status = __block_write_begin_int(page, pos, len, NULL, srcmap);
> else
> - status = __iomap_write_begin(inode, pos, len, flags, page,
> + status = __iomap_write_begin(inode, pos, len, flags, folio,
> srcmap);
>
> if (unlikely(status))
> goto out_unlock;
>
> - *pagep = page;
> + *foliop = folio;
> return 0;
>
> out_unlock:
> - unlock_page(page);
> - put_page(page);
> + folio_unlock(folio);
> + folio_put(folio);
> iomap_write_failed(inode, pos, len);
>
> out_no_page:
> @@ -637,11 +639,10 @@ iomap_write_begin(struct inode *inode, loff_t pos, unsigned len, unsigned flags,
> }
>
> static size_t __iomap_write_end(struct inode *inode, loff_t pos, size_t len,
> - size_t copied, struct page *page)
> + size_t copied, struct folio *folio)
> {
> - struct folio *folio = page_folio(page);
> struct iomap_page *iop = to_iomap_page(folio);
> - flush_dcache_page(page);
> + flush_dcache_folio(folio);
>
> /*
> * The blocks that were entirely written will now be uptodate, so we
> @@ -654,10 +655,10 @@ static size_t __iomap_write_end(struct inode *inode, loff_t pos, size_t len,
> * uptodate page as a zero-length write, and force the caller to redo
> * the whole thing.
> */
> - if (unlikely(copied < len && !PageUptodate(page)))
> + if (unlikely(copied < len && !folio_test_uptodate(folio)))
> return 0;
> iomap_set_range_uptodate(folio, iop, offset_in_folio(folio, pos), len);
> - __set_page_dirty_nobuffers(page);
> + filemap_dirty_folio(inode->i_mapping, folio);
> return copied;
> }
>
> @@ -680,9 +681,10 @@ static size_t iomap_write_end_inline(struct inode *inode, struct page *page,
>
> /* Returns the number of bytes copied. May be 0. Cannot be an errno. */
> static size_t iomap_write_end(struct inode *inode, loff_t pos, size_t len,
> - size_t copied, struct page *page, struct iomap *iomap,
> + size_t copied, struct folio *folio, struct iomap *iomap,
> struct iomap *srcmap)
> {
> + struct page *page = folio_file_page(folio, pos / PAGE_SIZE);
pos >> PAGE_SHIFT ?
(There's a few more of these elsewhere...)
--D
> const struct iomap_page_ops *page_ops = iomap->page_ops;
> loff_t old_size = inode->i_size;
> size_t ret;
> @@ -693,7 +695,7 @@ static size_t iomap_write_end(struct inode *inode, loff_t pos, size_t len,
> ret = block_write_end(NULL, inode->i_mapping, pos, len, copied,
> page, NULL);
> } else {
> - ret = __iomap_write_end(inode, pos, len, copied, page);
> + ret = __iomap_write_end(inode, pos, len, copied, folio);
> }
>
> /*
> @@ -705,13 +707,13 @@ static size_t iomap_write_end(struct inode *inode, loff_t pos, size_t len,
> i_size_write(inode, pos + ret);
> iomap->flags |= IOMAP_F_SIZE_CHANGED;
> }
> - unlock_page(page);
> + folio_unlock(folio);
>
> if (old_size < pos)
> pagecache_isize_extended(inode, old_size, pos);
> if (page_ops && page_ops->page_done)
> page_ops->page_done(inode, pos, ret, page, iomap);
> - put_page(page);
> + folio_put(folio);
>
> if (ret < len)
> iomap_write_failed(inode, pos, len);
> @@ -727,6 +729,7 @@ iomap_write_actor(struct inode *inode, loff_t pos, loff_t length, void *data,
> ssize_t written = 0;
>
> do {
> + struct folio *folio;
> struct page *page;
> unsigned long offset; /* Offset into pagecache page */
> unsigned long bytes; /* Bytes to write to page */
> @@ -750,18 +753,19 @@ iomap_write_actor(struct inode *inode, loff_t pos, loff_t length, void *data,
> break;
> }
>
> - status = iomap_write_begin(inode, pos, bytes, 0, &page, iomap,
> + status = iomap_write_begin(inode, pos, bytes, 0, &folio, iomap,
> srcmap);
> if (unlikely(status))
> break;
>
> + page = folio_file_page(folio, pos / PAGE_SIZE);
> if (mapping_writably_mapped(inode->i_mapping))
> flush_dcache_page(page);
>
> copied = copy_page_from_iter_atomic(page, offset, bytes, i);
>
> - status = iomap_write_end(inode, pos, bytes, copied, page, iomap,
> - srcmap);
> + status = iomap_write_end(inode, pos, bytes, copied, folio,
> + iomap, srcmap);
>
> if (unlikely(copied != status))
> iov_iter_revert(i, copied - status);
> @@ -825,14 +829,14 @@ iomap_unshare_actor(struct inode *inode, loff_t pos, loff_t length, void *data,
> do {
> unsigned long offset = offset_in_page(pos);
> unsigned long bytes = min_t(loff_t, PAGE_SIZE - offset, length);
> - struct page *page;
> + struct folio *folio;
>
> status = iomap_write_begin(inode, pos, bytes,
> - IOMAP_WRITE_F_UNSHARE, &page, iomap, srcmap);
> + IOMAP_WRITE_F_UNSHARE, &folio, iomap, srcmap);
> if (unlikely(status))
> return status;
>
> - status = iomap_write_end(inode, pos, bytes, bytes, page, iomap,
> + status = iomap_write_end(inode, pos, bytes, bytes, folio, iomap,
> srcmap);
> if (WARN_ON_ONCE(status == 0))
> return -EIO;
> @@ -871,19 +875,19 @@ EXPORT_SYMBOL_GPL(iomap_file_unshare);
> static s64 iomap_zero(struct inode *inode, loff_t pos, u64 length,
> struct iomap *iomap, struct iomap *srcmap)
> {
> - struct page *page;
> + struct folio *folio;
> int status;
> unsigned offset = offset_in_page(pos);
> unsigned bytes = min_t(u64, PAGE_SIZE - offset, length);
>
> - status = iomap_write_begin(inode, pos, bytes, 0, &page, iomap, srcmap);
> + status = iomap_write_begin(inode, pos, bytes, 0, &folio, iomap, srcmap);
> if (status)
> return status;
>
> - zero_user(page, offset, bytes);
> - mark_page_accessed(page);
> + zero_user(folio_file_page(folio, pos / PAGE_SIZE), offset, bytes);
> + folio_mark_accessed(folio);
>
> - return iomap_write_end(inode, pos, bytes, bytes, page, iomap, srcmap);
> + return iomap_write_end(inode, pos, bytes, bytes, folio, iomap, srcmap);
> }
>
> static loff_t iomap_zero_range_actor(struct inode *inode, loff_t pos,
> --
> 2.30.2
>
On Thu, Jul 15, 2021 at 04:36:29AM +0100, Matthew Wilcox (Oracle) wrote:
> Inline data is restricted to being less than a page in size, so we
$deity I hope so.
> don't need to handle multi-page folios.
>
> Signed-off-by: Matthew Wilcox (Oracle) <[email protected]>
Reviewed-by: Darrick J. Wong <[email protected]>
--D
> ---
> fs/iomap/buffered-io.c | 18 +++++++++---------
> 1 file changed, 9 insertions(+), 9 deletions(-)
>
> diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
> index 5e0aa23d4693..c616ef1feb21 100644
> --- a/fs/iomap/buffered-io.c
> +++ b/fs/iomap/buffered-io.c
> @@ -194,24 +194,24 @@ struct iomap_readpage_ctx {
> struct readahead_control *rac;
> };
>
> -static void
> -iomap_read_inline_data(struct inode *inode, struct page *page,
> +static void iomap_read_inline_data(struct inode *inode, struct folio *folio,
> struct iomap *iomap)
> {
> size_t size = i_size_read(inode);
> void *addr;
>
> - if (PageUptodate(page))
> + if (folio_test_uptodate(folio))
> return;
>
> - BUG_ON(page->index);
> + BUG_ON(folio->index);
> + BUG_ON(folio_multi(folio));
> BUG_ON(size > PAGE_SIZE - offset_in_page(iomap->inline_data));
>
> - addr = kmap_atomic(page);
> + addr = kmap_local_folio(folio, 0);
> memcpy(addr, iomap->inline_data, size);
> memset(addr + size, 0, PAGE_SIZE - size);
> - kunmap_atomic(addr);
> - SetPageUptodate(page);
> + kunmap_local(addr);
> + folio_mark_uptodate(folio);
> }
>
> static inline bool iomap_block_needs_zeroing(struct inode *inode,
> @@ -236,7 +236,7 @@ iomap_readpage_actor(struct inode *inode, loff_t pos, loff_t length, void *data,
>
> if (iomap->type == IOMAP_INLINE) {
> WARN_ON_ONCE(pos);
> - iomap_read_inline_data(inode, &folio->page, iomap);
> + iomap_read_inline_data(inode, folio, iomap);
> return PAGE_SIZE;
> }
>
> @@ -614,7 +614,7 @@ static int iomap_write_begin(struct inode *inode, loff_t pos, size_t len,
>
> page = folio_file_page(folio, pos >> PAGE_SHIFT);
> if (srcmap->type == IOMAP_INLINE)
> - iomap_read_inline_data(inode, page, srcmap);
> + iomap_read_inline_data(inode, folio, srcmap);
> else if (iomap->flags & IOMAP_F_BUFFER_HEAD)
> status = __block_write_begin_int(page, pos, len, NULL, srcmap);
> else
> --
> 2.30.2
>
On Thu, Jul 15, 2021 at 04:36:30AM +0100, Matthew Wilcox (Oracle) wrote:
> Inline data only occupies a single page, but using a folio means that
> we don't need to call compound_head() in PageUptodate().
>
> Signed-off-by: Matthew Wilcox (Oracle) <[email protected]>
/me isn't the expert on inlinedata, but this looks reasonable to me...
Reviewed-by: Darrick J. Wong <[email protected]>
--D
> ---
> fs/iomap/buffered-io.c | 12 ++++++------
> 1 file changed, 6 insertions(+), 6 deletions(-)
>
> diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
> index c616ef1feb21..ac33f19325ab 100644
> --- a/fs/iomap/buffered-io.c
> +++ b/fs/iomap/buffered-io.c
> @@ -662,18 +662,18 @@ static size_t __iomap_write_end(struct inode *inode, loff_t pos, size_t len,
> return copied;
> }
>
> -static size_t iomap_write_end_inline(struct inode *inode, struct page *page,
> +static size_t iomap_write_end_inline(struct inode *inode, struct folio *folio,
> struct iomap *iomap, loff_t pos, size_t copied)
> {
> void *addr;
>
> - WARN_ON_ONCE(!PageUptodate(page));
> + WARN_ON_ONCE(!folio_test_uptodate(folio));
> BUG_ON(pos + copied > PAGE_SIZE - offset_in_page(iomap->inline_data));
>
> - flush_dcache_page(page);
> - addr = kmap_atomic(page);
> + flush_dcache_folio(folio);
> + addr = kmap_local_folio(folio, 0);
> memcpy(iomap->inline_data + pos, addr + pos, copied);
> - kunmap_atomic(addr);
> + kunmap_local(addr);
>
> mark_inode_dirty(inode);
> return copied;
> @@ -690,7 +690,7 @@ static size_t iomap_write_end(struct inode *inode, loff_t pos, size_t len,
> size_t ret;
>
> if (srcmap->type == IOMAP_INLINE) {
> - ret = iomap_write_end_inline(inode, page, iomap, pos, copied);
> + ret = iomap_write_end_inline(inode, folio, iomap, pos, copied);
> } else if (srcmap->flags & IOMAP_F_BUFFER_HEAD) {
> ret = block_write_end(NULL, inode->i_mapping, pos, len, copied,
> page, NULL);
> --
> 2.30.2
>
On Thu, Jul 15, 2021 at 04:36:31AM +0100, Matthew Wilcox (Oracle) wrote:
> We still iterate one block at a time, but now we call compound_head()
> less often. Rename file_offset to pos to fit the rest of the file.
>
> Signed-off-by: Matthew Wilcox (Oracle) <[email protected]>
> ---
> fs/iomap/buffered-io.c | 66 +++++++++++++++++++-----------------------
> 1 file changed, 30 insertions(+), 36 deletions(-)
>
> diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
> index ac33f19325ab..8e767aec8d07 100644
> --- a/fs/iomap/buffered-io.c
> +++ b/fs/iomap/buffered-io.c
> @@ -1252,36 +1252,29 @@ iomap_can_add_to_ioend(struct iomap_writepage_ctx *wpc, loff_t offset,
> * first, otherwise finish off the current ioend and start another.
> */
> static void
> -iomap_add_to_ioend(struct inode *inode, loff_t offset, struct page *page,
> +iomap_add_to_ioend(struct inode *inode, loff_t pos, struct folio *folio,
> struct iomap_page *iop, struct iomap_writepage_ctx *wpc,
> struct writeback_control *wbc, struct list_head *iolist)
> {
> - sector_t sector = iomap_sector(&wpc->iomap, offset);
> + sector_t sector = iomap_sector(&wpc->iomap, pos);
> unsigned len = i_blocksize(inode);
> - unsigned poff = offset & (PAGE_SIZE - 1);
> - bool merged, same_page = false;
> + size_t poff = offset_in_folio(folio, pos);
>
> - if (!wpc->ioend || !iomap_can_add_to_ioend(wpc, offset, sector)) {
> + if (!wpc->ioend || !iomap_can_add_to_ioend(wpc, pos, sector)) {
> if (wpc->ioend)
> list_add(&wpc->ioend->io_list, iolist);
> - wpc->ioend = iomap_alloc_ioend(inode, wpc, offset, sector, wbc);
> + wpc->ioend = iomap_alloc_ioend(inode, wpc, pos, sector, wbc);
> }
>
> - merged = __bio_try_merge_page(wpc->ioend->io_bio, page, len, poff,
> - &same_page);
> if (iop)
> atomic_add(len, &iop->write_bytes_pending);
> -
> - if (!merged) {
> - if (bio_full(wpc->ioend->io_bio, len)) {
> - wpc->ioend->io_bio =
> - iomap_chain_bio(wpc->ioend->io_bio);
> - }
> - bio_add_page(wpc->ioend->io_bio, page, len, poff);
> + if (!bio_add_folio(wpc->ioend->io_bio, folio, len, poff)) {
> + wpc->ioend->io_bio = iomap_chain_bio(wpc->ioend->io_bio);
> + bio_add_folio(wpc->ioend->io_bio, folio, len, poff);
The paranoiac in me wonders if we ought to have some sort of error
checking here just in case we encounter double failures?
> }
>
> wpc->ioend->io_size += len;
> - wbc_account_cgroup_owner(wbc, page, len);
> + wbc_account_cgroup_owner(wbc, &folio->page, len);
> }
>
> /*
> @@ -1309,40 +1302,41 @@ iomap_writepage_map(struct iomap_writepage_ctx *wpc,
> struct iomap_page *iop = to_iomap_page(folio);
> struct iomap_ioend *ioend, *next;
> unsigned len = i_blocksize(inode);
> - u64 file_offset; /* file offset of page */
> + unsigned nblocks = i_blocks_per_folio(inode, folio);
> + loff_t pos = folio_pos(folio);
> int error = 0, count = 0, i;
> LIST_HEAD(submit_list);
>
> - WARN_ON_ONCE(i_blocks_per_page(inode, page) > 1 && !iop);
> + WARN_ON_ONCE(nblocks > 1 && !iop);
> WARN_ON_ONCE(iop && atomic_read(&iop->write_bytes_pending) != 0);
>
> /*
> - * Walk through the page to find areas to write back. If we run off the
> - * end of the current map or find the current map invalid, grab a new
> - * one.
> + * Walk through the folio to find areas to write back. If we
> + * run off the end of the current map or find the current map
> + * invalid, grab a new one.
> */
> - for (i = 0, file_offset = page_offset(page);
> - i < (PAGE_SIZE >> inode->i_blkbits) && file_offset < end_offset;
> - i++, file_offset += len) {
> + for (i = 0; i < nblocks; i++, pos += len) {
> + if (pos >= end_offset)
> + break;
Any particular reason this isn't:
for (i = 0; i < nblocks && pos < end_offset; i++, pos += len) {
?
Everything from here on out looks decent to me.
--D
> if (iop && !test_bit(i, iop->uptodate))
> continue;
>
> - error = wpc->ops->map_blocks(wpc, inode, file_offset);
> + error = wpc->ops->map_blocks(wpc, inode, pos);
> if (error)
> break;
> if (WARN_ON_ONCE(wpc->iomap.type == IOMAP_INLINE))
> continue;
> if (wpc->iomap.type == IOMAP_HOLE)
> continue;
> - iomap_add_to_ioend(inode, file_offset, page, iop, wpc, wbc,
> + iomap_add_to_ioend(inode, pos, folio, iop, wpc, wbc,
> &submit_list);
> count++;
> }
>
> WARN_ON_ONCE(!wpc->ioend && !list_empty(&submit_list));
> - WARN_ON_ONCE(!PageLocked(page));
> - WARN_ON_ONCE(PageWriteback(page));
> - WARN_ON_ONCE(PageDirty(page));
> + WARN_ON_ONCE(!folio_test_locked(folio));
> + WARN_ON_ONCE(folio_test_writeback(folio));
> + WARN_ON_ONCE(folio_test_dirty(folio));
>
> /*
> * We cannot cancel the ioend directly here on error. We may have
> @@ -1358,16 +1352,16 @@ iomap_writepage_map(struct iomap_writepage_ctx *wpc,
> * now.
> */
> if (wpc->ops->discard_page)
> - wpc->ops->discard_page(page, file_offset);
> + wpc->ops->discard_page(&folio->page, pos);
> if (!count) {
> - ClearPageUptodate(page);
> - unlock_page(page);
> + folio_clear_uptodate(folio);
> + folio_unlock(folio);
> goto done;
> }
> }
>
> - set_page_writeback(page);
> - unlock_page(page);
> + folio_start_writeback(folio);
> + folio_unlock(folio);
>
> /*
> * Preserve the original error if there was one, otherwise catch
> @@ -1388,9 +1382,9 @@ iomap_writepage_map(struct iomap_writepage_ctx *wpc,
> * with a partial page truncate on a sub-page block sized filesystem.
> */
> if (!count)
> - end_page_writeback(page);
> + folio_end_writeback(folio);
> done:
> - mapping_set_error(page->mapping, error);
> + mapping_set_error(folio->mapping, error);
> return error;
> }
>
> --
> 2.30.2
>
On Thu, Jul 15, 2021 at 04:36:32AM +0100, Matthew Wilcox (Oracle) wrote:
> Writeback an entire folio at a time, and adjust some of the variables
> to have more familiar names.
>
> Signed-off-by: Matthew Wilcox (Oracle) <[email protected]>
> ---
> fs/iomap/buffered-io.c | 49 +++++++++++++++++++-----------------------
> 1 file changed, 22 insertions(+), 27 deletions(-)
>
> diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
> index 8e767aec8d07..0731e2c3f44b 100644
> --- a/fs/iomap/buffered-io.c
> +++ b/fs/iomap/buffered-io.c
> @@ -1296,9 +1296,8 @@ iomap_add_to_ioend(struct inode *inode, loff_t pos, struct folio *folio,
> static int
> iomap_writepage_map(struct iomap_writepage_ctx *wpc,
> struct writeback_control *wbc, struct inode *inode,
> - struct page *page, u64 end_offset)
> + struct folio *folio, loff_t end_pos)
> {
> - struct folio *folio = page_folio(page);
> struct iomap_page *iop = to_iomap_page(folio);
> struct iomap_ioend *ioend, *next;
> unsigned len = i_blocksize(inode);
> @@ -1316,7 +1315,7 @@ iomap_writepage_map(struct iomap_writepage_ctx *wpc,
> * invalid, grab a new one.
> */
> for (i = 0; i < nblocks; i++, pos += len) {
> - if (pos >= end_offset)
> + if (pos >= end_pos)
> break;
> if (iop && !test_bit(i, iop->uptodate))
> continue;
> @@ -1398,16 +1397,15 @@ iomap_writepage_map(struct iomap_writepage_ctx *wpc,
> static int
> iomap_do_writepage(struct page *page, struct writeback_control *wbc, void *data)
I imagine at some point this will become iomap_do_writefolio and ther
will be some sort of write_cache_folios() call? Or the equivalent
while(get_next_folio_to_wrote()) iomap_write_folio(); type loop?
Reviewed-by: Darrick J. Wong <[email protected]>
--D
> {
> + struct folio *folio = page_folio(page);
> struct iomap_writepage_ctx *wpc = data;
> - struct inode *inode = page->mapping->host;
> - pgoff_t end_index;
> - u64 end_offset;
> - loff_t offset;
> + struct inode *inode = folio->mapping->host;
> + loff_t end_pos, isize;
>
> - trace_iomap_writepage(inode, page_offset(page), PAGE_SIZE);
> + trace_iomap_writepage(inode, folio_pos(folio), folio_size(folio));
>
> /*
> - * Refuse to write the page out if we are called from reclaim context.
> + * Refuse to write the folio out if we are called from reclaim context.
> *
> * This avoids stack overflows when called from deeply used stacks in
> * random callers for direct reclaim or memcg reclaim. We explicitly
> @@ -1421,10 +1419,10 @@ iomap_do_writepage(struct page *page, struct writeback_control *wbc, void *data)
> goto redirty;
>
> /*
> - * Is this page beyond the end of the file?
> + * Is this folio beyond the end of the file?
> *
> - * The page index is less than the end_index, adjust the end_offset
> - * to the highest offset that this page should represent.
> + * The folio index is less than the end_index, adjust the end_pos
> + * to the highest offset that this folio should represent.
> * -----------------------------------------------------
> * | file mapping | <EOF> |
> * -----------------------------------------------------
> @@ -1433,11 +1431,9 @@ iomap_do_writepage(struct page *page, struct writeback_control *wbc, void *data)
> * | desired writeback range | see else |
> * ---------------------------------^------------------|
> */
> - offset = i_size_read(inode);
> - end_index = offset >> PAGE_SHIFT;
> - if (page->index < end_index)
> - end_offset = (loff_t)(page->index + 1) << PAGE_SHIFT;
> - else {
> + isize = i_size_read(inode);
> + end_pos = folio_pos(folio) + folio_size(folio);
> + if (end_pos - 1 >= isize) {
> /*
> * Check whether the page to write out is beyond or straddles
> * i_size or not.
> @@ -1449,7 +1445,8 @@ iomap_do_writepage(struct page *page, struct writeback_control *wbc, void *data)
> * | | Straddles |
> * ---------------------------------^-----------|--------|
> */
> - unsigned offset_into_page = offset & (PAGE_SIZE - 1);
> + size_t poff = offset_in_folio(folio, isize);
> + pgoff_t end_index = isize >> PAGE_SHIFT;
>
> /*
> * Skip the page if it is fully outside i_size, e.g. due to a
> @@ -1468,8 +1465,8 @@ iomap_do_writepage(struct page *page, struct writeback_control *wbc, void *data)
> * if the page to write is totally beyond the i_size or if it's
> * offset is just equal to the EOF.
> */
> - if (page->index > end_index ||
> - (page->index == end_index && offset_into_page == 0))
> + if (folio->index > end_index ||
> + (folio->index == end_index && poff == 0))
> goto redirty;
>
> /*
> @@ -1480,17 +1477,15 @@ iomap_do_writepage(struct page *page, struct writeback_control *wbc, void *data)
> * memory is zeroed when mapped, and writes to that region are
> * not written out to the file."
> */
> - zero_user_segment(page, offset_into_page, PAGE_SIZE);
> -
> - /* Adjust the end_offset to the end of file */
> - end_offset = offset;
> + zero_user_segment(&folio->page, poff, folio_size(folio));
> + end_pos = isize;
> }
>
> - return iomap_writepage_map(wpc, wbc, inode, page, end_offset);
> + return iomap_writepage_map(wpc, wbc, inode, folio, end_pos);
>
> redirty:
> - redirty_page_for_writepage(wbc, page);
> - unlock_page(page);
> + folio_redirty_for_writepage(wbc, folio);
> + folio_unlock(folio);
> return 0;
> }
>
> --
> 2.30.2
>
On Thu, Jul 15, 2021 at 04:36:33AM +0100, Matthew Wilcox (Oracle) wrote:
> The arguments are still pages for now, but we can use folios internally
> and cut out a lot of calls to compound_head().
>
> Signed-off-by: Matthew Wilcox (Oracle) <[email protected]>
Pretty straightforward conversion.
Reviewed-by: Darrick J. Wong <[email protected]>
--D
> ---
> fs/iomap/buffered-io.c | 12 +++++++-----
> 1 file changed, 7 insertions(+), 5 deletions(-)
>
> diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
> index 0731e2c3f44b..48de198c5603 100644
> --- a/fs/iomap/buffered-io.c
> +++ b/fs/iomap/buffered-io.c
> @@ -490,19 +490,21 @@ int
> iomap_migrate_page(struct address_space *mapping, struct page *newpage,
> struct page *page, enum migrate_mode mode)
> {
> + struct folio *folio = page_folio(page);
> + struct folio *newfolio = page_folio(newpage);
> int ret;
>
> - ret = migrate_page_move_mapping(mapping, newpage, page, 0);
> + ret = folio_migrate_mapping(mapping, newfolio, folio, 0);
> if (ret != MIGRATEPAGE_SUCCESS)
> return ret;
>
> - if (page_has_private(page))
> - attach_page_private(newpage, detach_page_private(page));
> + if (folio_test_private(folio))
> + folio_attach_private(newfolio, folio_detach_private(folio));
>
> if (mode != MIGRATE_SYNC_NO_COPY)
> - migrate_page_copy(newpage, page);
> + folio_migrate_copy(newfolio, folio);
> else
> - migrate_page_states(newpage, page);
> + folio_migrate_flags(newfolio, folio);
> return MIGRATEPAGE_SUCCESS;
> }
> EXPORT_SYMBOL_GPL(iomap_migrate_page);
> --
> 2.30.2
>
On Thu, Jul 15, 2021 at 04:36:50AM +0100, Matthew Wilcox (Oracle) wrote:
> We still only operate on a single page of data at a time due to using
> kmap(). A more complex implementation would work on each page in a folio,
> but it's not clear that such a complex implementation would be worthwhile.
Does this break up a compound folio into smaller pages?
> Signed-off-by: Matthew Wilcox (Oracle) <[email protected]>
> ---
> fs/remap_range.c | 116 ++++++++++++++++++++++-------------------------
> 1 file changed, 55 insertions(+), 61 deletions(-)
>
> diff --git a/fs/remap_range.c b/fs/remap_range.c
> index e4a5fdd7ad7b..886e6ed2c6c2 100644
> --- a/fs/remap_range.c
> +++ b/fs/remap_range.c
> @@ -158,41 +158,41 @@ static int generic_remap_check_len(struct inode *inode_in,
> }
>
> /* Read a page's worth of file data into the page cache. */
> -static struct page *vfs_dedupe_get_page(struct inode *inode, loff_t offset)
> +static struct folio *vfs_dedupe_get_folio(struct inode *inode, loff_t pos)
> {
> - struct page *page;
> + struct folio *folio;
>
> - page = read_mapping_page(inode->i_mapping, offset >> PAGE_SHIFT, NULL);
> - if (IS_ERR(page))
> - return page;
> - if (!PageUptodate(page)) {
> - put_page(page);
> + folio = read_mapping_folio(inode->i_mapping, pos >> PAGE_SHIFT, NULL);
> + if (IS_ERR(folio))
> + return folio;
> + if (!folio_test_uptodate(folio)) {
> + folio_put(folio);
> return ERR_PTR(-EIO);
> }
> - return page;
> + return folio;
> }
>
> /*
> - * Lock two pages, ensuring that we lock in offset order if the pages are from
> - * the same file.
> + * Lock two folios, ensuring that we lock in offset order if the folios
> + * are from the same file.
> */
> -static void vfs_lock_two_pages(struct page *page1, struct page *page2)
> +static void vfs_lock_two_folios(struct folio *folio1, struct folio *folio2)
> {
> /* Always lock in order of increasing index. */
> - if (page1->index > page2->index)
> - swap(page1, page2);
> + if (folio1->index > folio2->index)
> + swap(folio1, folio2);
>
> - lock_page(page1);
> - if (page1 != page2)
> - lock_page(page2);
> + folio_lock(folio1);
> + if (folio1 != folio2)
> + folio_lock(folio2);
> }
>
> -/* Unlock two pages, being careful not to unlock the same page twice. */
> -static void vfs_unlock_two_pages(struct page *page1, struct page *page2)
> +/* Unlock two folios, being careful not to unlock the same folio twice. */
> +static void vfs_unlock_two_folios(struct folio *folio1, struct folio *folio2)
> {
> - unlock_page(page1);
> - if (page1 != page2)
> - unlock_page(page2);
> + folio_unlock(folio1);
> + if (folio1 != folio2)
> + folio_unlock(folio2);
This could result in a lot of folio lock cycling. Do you think it's
worth the effort to minimize this by keeping the folio locked if the
next page is going to be from the same one?
--D
> }
>
> /*
> @@ -200,77 +200,71 @@ static void vfs_unlock_two_pages(struct page *page1, struct page *page2)
> * Caller must have locked both inodes to prevent write races.
> */
> static int vfs_dedupe_file_range_compare(struct inode *src, loff_t srcoff,
> - struct inode *dest, loff_t destoff,
> + struct inode *dest, loff_t dstoff,
> loff_t len, bool *is_same)
> {
> - loff_t src_poff;
> - loff_t dest_poff;
> - void *src_addr;
> - void *dest_addr;
> - struct page *src_page;
> - struct page *dest_page;
> - loff_t cmp_len;
> - bool same;
> - int error;
> -
> - error = -EINVAL;
> - same = true;
> + bool same = true;
> + int error = -EINVAL;
> +
> while (len) {
> - src_poff = srcoff & (PAGE_SIZE - 1);
> - dest_poff = destoff & (PAGE_SIZE - 1);
> - cmp_len = min(PAGE_SIZE - src_poff,
> - PAGE_SIZE - dest_poff);
> + struct folio *src_folio, *dst_folio;
> + void *src_addr, *dst_addr;
> + loff_t cmp_len = min(PAGE_SIZE - offset_in_page(srcoff),
> + PAGE_SIZE - offset_in_page(dstoff));
> +
> cmp_len = min(cmp_len, len);
> if (cmp_len <= 0)
> goto out_error;
>
> - src_page = vfs_dedupe_get_page(src, srcoff);
> - if (IS_ERR(src_page)) {
> - error = PTR_ERR(src_page);
> + src_folio = vfs_dedupe_get_folio(src, srcoff);
> + if (IS_ERR(src_folio)) {
> + error = PTR_ERR(src_folio);
> goto out_error;
> }
> - dest_page = vfs_dedupe_get_page(dest, destoff);
> - if (IS_ERR(dest_page)) {
> - error = PTR_ERR(dest_page);
> - put_page(src_page);
> + dst_folio = vfs_dedupe_get_folio(dest, dstoff);
> + if (IS_ERR(dst_folio)) {
> + error = PTR_ERR(dst_folio);
> + folio_put(src_folio);
> goto out_error;
> }
>
> - vfs_lock_two_pages(src_page, dest_page);
> + vfs_lock_two_folios(src_folio, dst_folio);
>
> /*
> - * Now that we've locked both pages, make sure they're still
> + * Now that we've locked both folios, make sure they're still
> * mapped to the file data we're interested in. If not,
> * someone is invalidating pages on us and we lose.
> */
> - if (!PageUptodate(src_page) || !PageUptodate(dest_page) ||
> - src_page->mapping != src->i_mapping ||
> - dest_page->mapping != dest->i_mapping) {
> + if (!folio_test_uptodate(src_folio) || !folio_test_uptodate(dst_folio) ||
> + src_folio->mapping != src->i_mapping ||
> + dst_folio->mapping != dest->i_mapping) {
> same = false;
> goto unlock;
> }
>
> - src_addr = kmap_atomic(src_page);
> - dest_addr = kmap_atomic(dest_page);
> + src_addr = kmap_local_folio(src_folio,
> + offset_in_folio(src_folio, srcoff));
> + dst_addr = kmap_local_folio(dst_folio,
> + offset_in_folio(dst_folio, dstoff));
>
> - flush_dcache_page(src_page);
> - flush_dcache_page(dest_page);
> + flush_dcache_folio(src_folio);
> + flush_dcache_folio(dst_folio);
>
> - if (memcmp(src_addr + src_poff, dest_addr + dest_poff, cmp_len))
> + if (memcmp(src_addr, dst_addr, cmp_len))
> same = false;
>
> - kunmap_atomic(dest_addr);
> - kunmap_atomic(src_addr);
> + kunmap_local(dst_addr);
> + kunmap_local(src_addr);
> unlock:
> - vfs_unlock_two_pages(src_page, dest_page);
> - put_page(dest_page);
> - put_page(src_page);
> + vfs_unlock_two_folios(src_folio, dst_folio);
> + folio_put(dst_folio);
> + folio_put(src_folio);
>
> if (!same)
> break;
>
> srcoff += cmp_len;
> - destoff += cmp_len;
> + dstoff += cmp_len;
> len -= cmp_len;
> }
>
> --
> 2.30.2
>
On Thu, Jul 15, 2021 at 04:36:55AM +0100, Matthew Wilcox (Oracle) wrote:
> There is one place which assumes the size of a page; fix it.
>
> Signed-off-by: Matthew Wilcox (Oracle) <[email protected]>
> ---
> fs/xfs/xfs_aops.c | 11 ++++++-----
> fs/xfs/xfs_super.c | 3 ++-
> 2 files changed, 8 insertions(+), 6 deletions(-)
>
> diff --git a/fs/xfs/xfs_aops.c b/fs/xfs/xfs_aops.c
> index cb4e0fcf4c76..9ffbd116592a 100644
> --- a/fs/xfs/xfs_aops.c
> +++ b/fs/xfs/xfs_aops.c
> @@ -432,10 +432,11 @@ xfs_discard_page(
> struct page *page,
> loff_t fileoff)
/me wonders if this parameter ought to become pos like most other
places, but as a straight-up conversion it looks fine to me.
Reviewed-by: Darrick J. Wong <[email protected]>
--D
> {
> - struct inode *inode = page->mapping->host;
> + struct folio *folio = page_folio(page);
> + struct inode *inode = folio->mapping->host;
> struct xfs_inode *ip = XFS_I(inode);
> struct xfs_mount *mp = ip->i_mount;
> - unsigned int pageoff = offset_in_page(fileoff);
> + size_t pageoff = offset_in_folio(folio, fileoff);
> xfs_fileoff_t start_fsb = XFS_B_TO_FSBT(mp, fileoff);
> xfs_fileoff_t pageoff_fsb = XFS_B_TO_FSBT(mp, pageoff);
> int error;
> @@ -445,14 +446,14 @@ xfs_discard_page(
>
> xfs_alert_ratelimited(mp,
> "page discard on page "PTR_FMT", inode 0x%llx, offset %llu.",
> - page, ip->i_ino, fileoff);
> + folio, ip->i_ino, fileoff);
>
> error = xfs_bmap_punch_delalloc_range(ip, start_fsb,
> - i_blocks_per_page(inode, page) - pageoff_fsb);
> + i_blocks_per_folio(inode, folio) - pageoff_fsb);
> if (error && !XFS_FORCED_SHUTDOWN(mp))
> xfs_alert(mp, "page discard unable to remove delalloc mapping.");
> out_invalidate:
> - iomap_invalidatepage(page, pageoff, PAGE_SIZE - pageoff);
> + iomap_invalidatepage(&folio->page, pageoff, folio_size(folio) - pageoff);
> }
>
> static const struct iomap_writeback_ops xfs_writeback_ops = {
> diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c
> index 2c9e26a44546..24adea02b887 100644
> --- a/fs/xfs/xfs_super.c
> +++ b/fs/xfs/xfs_super.c
> @@ -1891,7 +1891,8 @@ static struct file_system_type xfs_fs_type = {
> .init_fs_context = xfs_init_fs_context,
> .parameters = xfs_fs_parameters,
> .kill_sb = kill_block_super,
> - .fs_flags = FS_REQUIRES_DEV | FS_ALLOW_IDMAP,
> + .fs_flags = FS_REQUIRES_DEV | FS_ALLOW_IDMAP | \
> + FS_THP_SUPPORT,
> };
> MODULE_ALIAS_FS("xfs");
>
> --
> 2.30.2
>
On Thu, Jul 15, 2021 at 04:36:54AM +0100, Matthew Wilcox (Oracle) wrote:
> If we're punching a hole in a multi-page folio, we need to remove the
> per-page iomap data as the folio is about to be split and each page will
> need its own. This means that writepage can now come across a page with
> no iop allocated, so remove the assertion that there is already one,
> and just create one (with the uptodate bits set) if there isn't one.
>
> Signed-off-by: Matthew Wilcox (Oracle) <[email protected]>
Lol, Andreas already did the bottom half of the change for you.
Reviewed-by: Darrick J. Wong <[email protected]>
--D
> ---
> fs/iomap/buffered-io.c | 11 +++++++----
> 1 file changed, 7 insertions(+), 4 deletions(-)
>
> diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
> index 48de198c5603..7f78256fc0ba 100644
> --- a/fs/iomap/buffered-io.c
> +++ b/fs/iomap/buffered-io.c
> @@ -474,13 +474,17 @@ iomap_invalidatepage(struct page *page, unsigned int offset, unsigned int len)
> trace_iomap_invalidatepage(folio->mapping->host, offset, len);
>
> /*
> - * If we are invalidating the entire page, clear the dirty state from it
> - * and release it to avoid unnecessary buildup of the LRU.
> + * If we are invalidating the entire folio, clear the dirty state
> + * from it and release it to avoid unnecessary buildup of the LRU.
> */
> if (offset == 0 && len == folio_size(folio)) {
> WARN_ON_ONCE(folio_test_writeback(folio));
> folio_cancel_dirty(folio);
> iomap_page_release(folio);
> + } else if (folio_multi(folio)) {
> + /* Must release the iop so the page can be split */
> + WARN_ON_ONCE(!folio_test_uptodate(folio) && folio_test_dirty(folio));
> + iomap_page_release(folio);
> }
> }
> EXPORT_SYMBOL_GPL(iomap_invalidatepage);
> @@ -1300,7 +1304,7 @@ iomap_writepage_map(struct iomap_writepage_ctx *wpc,
> struct writeback_control *wbc, struct inode *inode,
> struct folio *folio, loff_t end_pos)
> {
> - struct iomap_page *iop = to_iomap_page(folio);
> + struct iomap_page *iop = iomap_page_create(inode, folio);
> struct iomap_ioend *ioend, *next;
> unsigned len = i_blocksize(inode);
> unsigned nblocks = i_blocks_per_folio(inode, folio);
> @@ -1308,7 +1312,6 @@ iomap_writepage_map(struct iomap_writepage_ctx *wpc,
> int error = 0, count = 0, i;
> LIST_HEAD(submit_list);
>
> - WARN_ON_ONCE(nblocks > 1 && !iop);
> WARN_ON_ONCE(iop && atomic_read(&iop->write_bytes_pending) != 0);
>
> /*
> --
> 2.30.2
>
On Thu, Jul 15, 2021 at 03:08:40PM -0700, Darrick J. Wong wrote:
> On Thu, Jul 15, 2021 at 04:36:50AM +0100, Matthew Wilcox (Oracle) wrote:
> > We still only operate on a single page of data at a time due to using
> > kmap(). A more complex implementation would work on each page in a folio,
> > but it's not clear that such a complex implementation would be worthwhile.
>
> Does this break up a compound folio into smaller pages?
No. We just operate on each page in turn. Splitting a folio is an
expensive and unrealiable thing to do, so we avoid it unless necessary.
> > +/* Unlock two folios, being careful not to unlock the same folio twice. */
> > +static void vfs_unlock_two_folios(struct folio *folio1, struct folio *folio2)
> > {
> > - unlock_page(page1);
> > - if (page1 != page2)
> > - unlock_page(page2);
> > + folio_unlock(folio1);
> > + if (folio1 != folio2)
> > + folio_unlock(folio2);
>
> This could result in a lot of folio lock cycling. Do you think it's
> worth the effort to minimize this by keeping the folio locked if the
> next page is going to be from the same one?
I think that might well be a worthwhile optimisation. I'd like to do
that as a separate patch, though (and maybe somebody other than me could
do it ;-)
On Thu, Jul 15, 2021 at 01:59:17PM -0700, Darrick J. Wong wrote:
> On Thu, Jul 15, 2021 at 04:36:16AM +0100, Matthew Wilcox (Oracle) wrote:
> > This is a thin wrapper around bio_add_page(). The main advantage here
> > is the documentation that the submitter can expect to see folios in the
> > completion handler, and that stupidly large folios are not supported.
> > It's not currently possible to allocate stupidly large folios, but if
> > it ever becomes possible, this function will fail gracefully instead of
> > doing I/O to the wrong bytes.
> >
> > Signed-off-by: Matthew Wilcox (Oracle) <[email protected]>
> > ---
> > block/bio.c | 21 +++++++++++++++++++++
> > include/linux/bio.h | 3 ++-
> > 2 files changed, 23 insertions(+), 1 deletion(-)
> >
> > diff --git a/block/bio.c b/block/bio.c
> > index 1fab762e079b..1b500611d25c 100644
> > --- a/block/bio.c
> > +++ b/block/bio.c
> > @@ -933,6 +933,27 @@ int bio_add_page(struct bio *bio, struct page *page,
> > }
> > EXPORT_SYMBOL(bio_add_page);
> >
> > +/**
> > + * bio_add_folio - Attempt to add part of a folio to a bio.
> > + * @bio: Bio to add to.
> > + * @folio: Folio to add.
> > + * @len: How many bytes from the folio to add.
> > + * @off: First byte in this folio to add.
> > + *
> > + * Always uses the head page of the folio in the bio. If a submitter
> > + * only uses bio_add_folio(), it can count on never seeing tail pages
> > + * in the completion routine. BIOs do not support folios larger than 2GiB.
> > + *
> > + * Return: The number of bytes from this folio added to the bio.
> > + */
> > +size_t bio_add_folio(struct bio *bio, struct folio *folio, size_t len,
> > + size_t off)
> > +{
> > + if (len > UINT_MAX || off > UINT_MAX)
>
> Er... if bios don't support folios larger than 2GB, then why check @off
> and @len against UINT_MAX, which is ~4GB?
I suppose that's mostly a documentation problem. The limit is:
struct bio_vec {
struct page *bv_page;
unsigned int bv_len;
unsigned int bv_offset;
};
so we can support folios which are 2GB in size (0x8000'0000) bytes, but
if (theoretically, some day) we had a 4GB folio, we wouldn't be able to
handle it in a single bio_vec. So there isn't anything between a 2GB
folio and a 4GB folio; a 4GB folio isn't allowed and a 2GB one is.
I don't think anything's wrong here, but maybe things could be worded
better?
On Thu, Jul 15, 2021 at 02:12:54PM -0700, Darrick J. Wong wrote:
> On Thu, Jul 15, 2021 at 04:36:17AM +0100, Matthew Wilcox (Oracle) wrote:
> > +struct folio_iter {
> > + struct folio *folio;
> > + size_t offset;
> > + size_t length;
>
> Hm... so after every bio_{first,next}_folio call, we can access the
> folio, the offset, and the length (both in units of bytes) within the
> folio?
Correct.
> > + size_t _seg_count;
> > + int _i;
>
> And these are private variables that the iteration code should not
> scribble over?
Indeed!
> > +/*
> > + * Iterate over each folio in a bio.
> > + */
> > +#define bio_for_each_folio_all(fi, bio) \
> > + for (bio_first_folio(&fi, bio, 0); fi.folio; bio_next_folio(&fi, bio))
>
> ...so I guess a sample iteration loop would be something like:
>
> struct bio *bio = <get one from somewhere>;
> struct folio_iter fi;
>
> bio_for_each_folio_all(fi, bio) {
> if (folio_test_dirty(fi.folio))
> printk("folio idx 0x%lx is dirty, i hates dirty data!",
> folio_index(fi.folio));
> panic();
> }
>
> I'll go look through the rest of the patches, but this so far looks
> pretty straightforward to me.
Something very much like that!
+static void iomap_read_end_io(struct bio *bio)
{
+ struct folio_iter fi;
+ bio_for_each_folio_all(fi, bio)
+ iomap_finish_folio_read(fi.folio, fi.offset, fi.length, error);
On Thu, Jul 15, 2021 at 02:26:57PM -0700, Darrick J. Wong wrote:
> > + size_t poff = offset_in_folio(folio, *pos);
> > + size_t plen = min_t(loff_t, folio_size(folio) - poff, length);
>
> I'm confused about 'size_t poff' here vs. 'unsigned end' later -- why do
> we need a 64-bit quantity for poff? I suppose some day we might want to
> have folios larger than 4GB or so, but so far we don't need that large
> of a byte offset within a page/folio, right?
>
> Or are you merely moving the codebase towards using size_t for all byte
> offsets?
Both. 'end' isn't a byte count -- it's a block count.
> > if (orig_pos <= isize && orig_pos + length > isize) {
> > - unsigned end = offset_in_page(isize - 1) >> block_bits;
> > + unsigned end = offset_in_folio(folio, isize - 1) >> block_bits;
That right shift makes it not-a-byte-count.
I don't especially want to do all the work needed to support folios >2GB,
but I do like using size_t to represent a byte count.
On Thu, Jul 15, 2021 at 11:48:00PM +0100, Matthew Wilcox wrote:
> On Thu, Jul 15, 2021 at 02:26:57PM -0700, Darrick J. Wong wrote:
> > > + size_t poff = offset_in_folio(folio, *pos);
> > > + size_t plen = min_t(loff_t, folio_size(folio) - poff, length);
> >
> > I'm confused about 'size_t poff' here vs. 'unsigned end' later -- why do
> > we need a 64-bit quantity for poff? I suppose some day we might want to
> > have folios larger than 4GB or so, but so far we don't need that large
> > of a byte offset within a page/folio, right?
> >
> > Or are you merely moving the codebase towards using size_t for all byte
> > offsets?
>
> Both. 'end' isn't a byte count -- it's a block count.
>
> > > if (orig_pos <= isize && orig_pos + length > isize) {
> > > - unsigned end = offset_in_page(isize - 1) >> block_bits;
> > > + unsigned end = offset_in_folio(folio, isize - 1) >> block_bits;
>
> That right shift makes it not-a-byte-count.
>
> I don't especially want to do all the work needed to support folios >2GB,
> but I do like using size_t to represent a byte count.
DOH. Yes, I just noticed that.
TBH I doubt anyone's really going to care about 4GB folios anyway.
Reviewed-by: Darrick J. Wong <[email protected]>
--D
On Thu, Jul 15, 2021 at 03:05:05PM -0700, Darrick J. Wong wrote:
> On Thu, Jul 15, 2021 at 04:36:32AM +0100, Matthew Wilcox (Oracle) wrote:
> > Writeback an entire folio at a time, and adjust some of the variables
> > to have more familiar names.
> > @@ -1398,16 +1397,15 @@ iomap_writepage_map(struct iomap_writepage_ctx *wpc,
> > static int
> > iomap_do_writepage(struct page *page, struct writeback_control *wbc, void *data)
>
> I imagine at some point this will become iomap_do_writefolio and ther
> will be some sort of write_cache_folios() call? Or the equivalent
> while(get_next_folio_to_wrote()) iomap_write_folio(); type loop?
I hadn't quite got as far as planning out what to do next with a
replacement for write_cache_pages(). At a minimum, that function is
going to work on folios -- it does anyway; we don't tag tail pages in
the xarray, so the tagged lookup done by write_cache_pages() only finds
folios. So everything we do with a page there is definitely looking at
a folio.
I want to get a lot more filesystems converted to use folios before I
undertake the write_cache_pages() interface overhaul (and I'll probably
think of several things to do to it at the same time -- like working on
a batch of pages all at once instead of calling one indirect function
per folio).
On Thu, Jul 15, 2021 at 03:10:18PM -0700, Darrick J. Wong wrote:
> On Thu, Jul 15, 2021 at 04:36:54AM +0100, Matthew Wilcox (Oracle) wrote:
> > If we're punching a hole in a multi-page folio, we need to remove the
> > per-page iomap data as the folio is about to be split and each page will
> > need its own. This means that writepage can now come across a page with
> > no iop allocated, so remove the assertion that there is already one,
> > and just create one (with the uptodate bits set) if there isn't one.
> >
> > Signed-off-by: Matthew Wilcox (Oracle) <[email protected]>
>
> Lol, Andreas already did the bottom half of the change for you.
Heh, yes, I copy-and-pasted it from this patch ;-) Thanks for
merging it!
On Thu, Jul 15, 2021 at 03:01:20PM -0700, Darrick J. Wong wrote:
> On Thu, Jul 15, 2021 at 04:36:31AM +0100, Matthew Wilcox (Oracle) wrote:
> >
> > - merged = __bio_try_merge_page(wpc->ioend->io_bio, page, len, poff,
> > - &same_page);
> > if (iop)
> > atomic_add(len, &iop->write_bytes_pending);
> > -
> > - if (!merged) {
> > - if (bio_full(wpc->ioend->io_bio, len)) {
> > - wpc->ioend->io_bio =
> > - iomap_chain_bio(wpc->ioend->io_bio);
> > - }
> > - bio_add_page(wpc->ioend->io_bio, page, len, poff);
> > + if (!bio_add_folio(wpc->ioend->io_bio, folio, len, poff)) {
> > + wpc->ioend->io_bio = iomap_chain_bio(wpc->ioend->io_bio);
> > + bio_add_folio(wpc->ioend->io_bio, folio, len, poff);
>
> The paranoiac in me wonders if we ought to have some sort of error
> checking here just in case we encounter double failures?
Maybe? We didn't have it before, and it's just been allocated.
I'd defer to Christoph here.
> > - for (i = 0, file_offset = page_offset(page);
> > - i < (PAGE_SIZE >> inode->i_blkbits) && file_offset < end_offset;
> > - i++, file_offset += len) {
> > + for (i = 0; i < nblocks; i++, pos += len) {
> > + if (pos >= end_offset)
> > + break;
>
> Any particular reason this isn't:
>
> for (i = 0; i < nblocks && pos < end_offset; i++, pos += len) {
>
> ?
Just mild personal preference ... I don't even like having the pos +=
len in there. But you're maintainer, I'll shuffle that in.
> Everything from here on out looks decent to me.
Thanks!
On Thu, Jul 15, 2021 at 02:51:05PM -0700, Darrick J. Wong wrote:
> On Thu, Jul 15, 2021 at 04:36:28AM +0100, Matthew Wilcox (Oracle) wrote:
> > +static int iomap_write_begin(struct inode *inode, loff_t pos, size_t len,
> > + unsigned flags, struct folio **foliop, struct iomap *iomap,
> > + struct iomap *srcmap)
> > {
> > const struct iomap_page_ops *page_ops = iomap->page_ops;
> > + struct folio *folio;
> > struct page *page;
> > + unsigned fgp = FGP_LOCK | FGP_WRITE | FGP_CREAT | FGP_STABLE | FGP_NOFS;
> > int status = 0;
> >
> > BUG_ON(pos + len > iomap->offset + iomap->length);
> > @@ -604,30 +605,31 @@ iomap_write_begin(struct inode *inode, loff_t pos, unsigned len, unsigned flags,
> > return status;
> > }
> >
> > - page = grab_cache_page_write_begin(inode->i_mapping, pos >> PAGE_SHIFT,
> > - AOP_FLAG_NOFS);
> > - if (!page) {
> > + folio = __filemap_get_folio(inode->i_mapping, pos >> PAGE_SHIFT, fgp,
>
> Ah, ok, so we're moving the file_get_pages flags up to iomap now.
Right, saves us having a folio equivalent of
grab_cache_page_write_begin(). And lets us get rid of AOP_FLAG_NOFS
eventually (although that really should be obsoleted by scoped
allocations, but one windmill at a time).
> > + struct page *page = folio_file_page(folio, pos / PAGE_SIZE);
>
> pos >> PAGE_SHIFT ?
mmm. We're inconsistent:
willy@pepe:~/kernel/folio$ git grep '/ PAGE_SIZE' mm/ fs/ |wc
92 720 6475
willy@pepe:~/kernel/folio$ git grep '>> PAGE_SHIFT' mm/ fs/ |wc
635 4582 39394
That said, there's a clear preference. It's just that we had a bug the
other day where somebody shifted by PAGE_SHIFT in the wrong direction ...
But again, this is your code, so I'll change to the shift.
On Thu, Jul 15, 2021 at 02:41:06PM -0700, Darrick J. Wong wrote:
> > @@ -975,33 +975,33 @@ iomap_page_mkwrite_actor(struct inode *inode, loff_t pos, loff_t length,
> >
> > vm_fault_t iomap_page_mkwrite(struct vm_fault *vmf, const struct iomap_ops *ops)
> > {
> > - struct page *page = vmf->page;
> > + struct folio *folio = page_folio(vmf->page);
>
> If before the page fault the folio was a compound 2M page, will the
> memory manager will have split it into 4k pages before passing it to us?
>
> That's a roundabout way of asking if we should expect folio_mkwrite at
> some point. ;)
Faults are tricky. For ->fault, we need to know the precise page which
the fault occurred on (this detail is handled for you by filemap_fault()).
For mkwrite(), the page will not be split, so it's going to be a matter
of just marking the entire compound page as dirty (in the head page)
and making sure the filesystem is able to write back the entire folio.
Yes, there's going to be some write amplification here. I believe
this will turn out to be a worthwhile tradeoff. If I'm wrong, we can
implement some kind of split-on-fault.
On Thu, Jul 15, 2021 at 02:21:05PM -0700, Darrick J. Wong wrote:
> On Thu, Jul 15, 2021 at 04:36:23AM +0100, Matthew Wilcox (Oracle) wrote:
> > All but one caller already has the iomap_page, and we can avoid getting
> > it again.
> >
> > Signed-off-by: Matthew Wilcox (Oracle) <[email protected]>
>
> Took me a while to distinguish iomap_iop_set_range_uptodate and
> iomap_set_range_uptodate, but yes, this looks pretty simple.
Not my favourite naming, but it's a preexisting condition ;-)
Honestly I'd like to rename iomap to blkmap or something.
And iomap_page is now hilariously badly named. But that's kind
of tangential to everything else here.
On Fri, Jul 16, 2021 at 04:21:25AM +0100, Matthew Wilcox wrote:
> On Thu, Jul 15, 2021 at 02:21:05PM -0700, Darrick J. Wong wrote:
> > On Thu, Jul 15, 2021 at 04:36:23AM +0100, Matthew Wilcox (Oracle) wrote:
> > > All but one caller already has the iomap_page, and we can avoid getting
> > > it again.
> > >
> > > Signed-off-by: Matthew Wilcox (Oracle) <[email protected]>
> >
> > Took me a while to distinguish iomap_iop_set_range_uptodate and
> > iomap_set_range_uptodate, but yes, this looks pretty simple.
>
> Not my favourite naming, but it's a preexisting condition ;-)
>
> Honestly I'd like to rename iomap to blkmap or something.
> And iomap_page is now hilariously badly named. But that's kind
> of tangential to everything else here.
I guess we only use 'blkmap' in a few places in the kernel, and nobody's
going to confuse us with UFS.
Hmm, what kind of new name?
struct iomap_buffer_head *ibh; /* NO */
struct iomap_folio_state *ifs;
struct iomap_state *is; /* shorter, but what is 'state'? */
struct iomap_blkmap *ibm; /* lolz */
I think iomap_blkmap sounds fine, since we're probably going to end up
exporting it (and therefore need a clear namespace) as soon as one of
the filesystems that uses page->private to stash per-page info wants to
use iomap for buffered io.
--D
On Thu, Jul 15, 2021 at 04:34:48AM +0100, Matthew Wilcox (Oracle) wrote:
> A struct folio is a new abstraction to replace the venerable struct page.
> A function which takes a struct folio argument declares that it will
> operate on the entire (possibly compound) page, not just PAGE_SIZE bytes.
> In return, the caller guarantees that the pointer it is passing does
> not point to a tail page. No change to generated code.
>
> Signed-off-by: Matthew Wilcox (Oracle) <[email protected]>
> Acked-by: Jeff Layton <[email protected]>
> Acked-by: Kirill A. Shutemov <[email protected]>
> Acked-by: Vlastimil Babka <[email protected]>
> Reviewed-by: William Kucharski <[email protected]>
> Reviewed-by: Christoph Hellwig <[email protected]>
> Reviewed-by: David Howells <[email protected]>
Except for small nits below
Acked-by: Mike Rapoport <[email protected]>
> ---
> Documentation/core-api/mm-api.rst | 1 +
> include/linux/mm.h | 74 +++++++++++++++++++++++++++++++
> include/linux/mm_types.h | 60 +++++++++++++++++++++++++
> include/linux/page-flags.h | 28 ++++++++++++
> 4 files changed, 163 insertions(+)
>
> diff --git a/Documentation/core-api/mm-api.rst b/Documentation/core-api/mm-api.rst
> index a42f9baddfbf..2a94e6164f80 100644
> --- a/Documentation/core-api/mm-api.rst
> +++ b/Documentation/core-api/mm-api.rst
> @@ -95,6 +95,7 @@ More Memory Management Functions
> .. kernel-doc:: mm/mempolicy.c
> .. kernel-doc:: include/linux/mm_types.h
> :internal:
> +.. kernel-doc:: include/linux/page-flags.h
> .. kernel-doc:: include/linux/mm.h
> :internal:
> .. kernel-doc:: include/linux/mmzone.h
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index 8dd65290bac0..5071084a71b9 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -949,6 +949,20 @@ static inline unsigned int compound_order(struct page *page)
> return page[1].compound_order;
> }
>
> +/**
> + * folio_order - The allocation order of a folio.
> + * @folio: The folio.
> + *
> + * A folio is composed of 2^order pages. See get_order() for the definition
> + * of order.
> + *
> + * Return: The order of the folio.
> + */
> +static inline unsigned int folio_order(struct folio *folio)
> +{
> + return compound_order(&folio->page);
> +}
> +
> static inline bool hpage_pincount_available(struct page *page)
> {
> /*
> @@ -1594,6 +1608,65 @@ static inline void set_page_links(struct page *page, enum zone_type zone,
> #endif
> }
>
> +/**
> + * folio_nr_pages - The number of pages in the folio.
> + * @folio: The folio.
> + *
> + * Return: A number which is a power of two.
> + */
> +static inline unsigned long folio_nr_pages(struct folio *folio)
> +{
> + return compound_nr(&folio->page);
> +}
> +
> +/**
> + * folio_next - Move to the next physical folio.
> + * @folio: The folio we're currently operating on.
> + *
> + * If you have physically contiguous memory which may span more than
> + * one folio (eg a &struct bio_vec), use this function to move from one
> + * folio to the next. Do not use it if the memory is only virtually
> + * contiguous as the folios are almost certainly not adjacent to each
> + * other. This is the folio equivalent to writing ``page++``.
> + *
> + * Context: We assume that the folios are refcounted and/or locked at a
> + * higher level and do not adjust the reference counts.
> + * Return: The next struct folio.
> + */
> +static inline struct folio *folio_next(struct folio *folio)
> +{
> + return (struct folio *)folio_page(folio, folio_nr_pages(folio));
> +}
> +
> +/**
> + * folio_shift - The number of bits covered by this folio.
For me this sounds like the size of the folio in bits.
Maybe just repeat "The base-2 logarithm of the size of this folio" here and
in return description?
> + * @folio: The folio.
> + *
> + * A folio contains a number of bytes which is a power-of-two in size.
> + * This function tells you which power-of-two the folio is.
> + *
> + * Context: The caller should have a reference on the folio to prevent
> + * it from being split. It is not necessary for the folio to be locked.
> + * Return: The base-2 logarithm of the size of this folio.
> + */
> +static inline unsigned int folio_shift(struct folio *folio)
> +{
> + return PAGE_SHIFT + folio_order(folio);
> +}
> +
> +/**
> + * folio_size - The number of bytes in a folio.
> + * @folio: The folio.
> + *
> + * Context: The caller should have a reference on the folio to prevent
> + * it from being split. It is not necessary for the folio to be locked.
> + * Return: The number of bytes in this folio.
> + */
> +static inline size_t folio_size(struct folio *folio)
> +{
> + return PAGE_SIZE << folio_order(folio);
> +}
> +
> /*
> * Some inline functions in vmstat.h depend on page_zone()
> */
> @@ -1699,6 +1772,7 @@ extern void pagefault_out_of_memory(void);
>
> #define offset_in_page(p) ((unsigned long)(p) & ~PAGE_MASK)
> #define offset_in_thp(page, p) ((unsigned long)(p) & (thp_size(page) - 1))
> +#define offset_in_folio(folio, p) ((unsigned long)(p) & (folio_size(folio) - 1))
>
> /*
> * Flags passed to show_mem() and show_free_areas() to suppress output in
> diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
> index 52bbd2b7cb46..f023eaa866fe 100644
> --- a/include/linux/mm_types.h
> +++ b/include/linux/mm_types.h
> @@ -231,6 +231,66 @@ struct page {
> #endif
> } _struct_page_alignment;
>
> +/**
> + * struct folio - Represents a contiguous set of bytes.
> + * @flags: Identical to the page flags.
> + * @lru: Least Recently Used list; tracks how recently this folio was used.
> + * @mapping: The file this page belongs to, or refers to the anon_vma for
> + * anonymous pages.
> + * @index: Offset within the file, in units of pages. For anonymous pages,
Nit: maybe memory? ^
> + * this is the index from the beginning of the mmap.
> + * @private: Filesystem per-folio data (see folio_attach_private()).
> + * Used for swp_entry_t if folio_test_swapcache().
> + * @_mapcount: Do not access this member directly. Use folio_mapcount() to
> + * find out how many times this folio is mapped by userspace.
> + * @_refcount: Do not access this member directly. Use folio_ref_count()
> + * to find how many references there are to this folio.
> + * @memcg_data: Memory Control Group data.
> + *
> + * A folio is a physically, virtually and logically contiguous set
> + * of bytes. It is a power-of-two in size, and it is aligned to that
> + * same power-of-two. It is at least as large as %PAGE_SIZE. If it is
> + * in the page cache, it is at a file offset which is a multiple of that
> + * power-of-two. It may be mapped into userspace at an address which is
> + * at an arbitrary page offset, but its kernel virtual address is aligned
> + * to its size.
> + */
> +struct folio {
> + /* private: don't document the anon union */
> + union {
> + struct {
> + /* public: */
> + unsigned long flags;
> + struct list_head lru;
> + struct address_space *mapping;
> + pgoff_t index;
> + void *private;
> + atomic_t _mapcount;
> + atomic_t _refcount;
> +#ifdef CONFIG_MEMCG
> + unsigned long memcg_data;
> +#endif
> + /* private: the union with struct page is transitional */
> + };
> + struct page page;
> + };
> +};
> +
> +static_assert(sizeof(struct page) == sizeof(struct folio));
> +#define FOLIO_MATCH(pg, fl) \
> + static_assert(offsetof(struct page, pg) == offsetof(struct folio, fl))
> +FOLIO_MATCH(flags, flags);
> +FOLIO_MATCH(lru, lru);
> +FOLIO_MATCH(compound_head, lru);
> +FOLIO_MATCH(index, index);
> +FOLIO_MATCH(private, private);
> +FOLIO_MATCH(_mapcount, _mapcount);
> +FOLIO_MATCH(_refcount, _refcount);
> +#ifdef CONFIG_MEMCG
> +FOLIO_MATCH(memcg_data, memcg_data);
> +#endif
> +#undef FOLIO_MATCH
> +
> static inline atomic_t *compound_mapcount_ptr(struct page *page)
> {
> return &page[1].compound_mapcount;
> diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
> index 5922031ffab6..70ede8345538 100644
> --- a/include/linux/page-flags.h
> +++ b/include/linux/page-flags.h
> @@ -191,6 +191,34 @@ static inline unsigned long _compound_head(const struct page *page)
>
> #define compound_head(page) ((typeof(page))_compound_head(page))
>
> +/**
> + * page_folio - Converts from page to folio.
> + * @p: The page.
> + *
> + * Every page is part of a folio. This function cannot be called on a
> + * NULL pointer.
> + *
> + * Context: No reference, nor lock is required on @page. If the caller
> + * does not hold a reference, this call may race with a folio split, so
> + * it should re-check the folio still contains this page after gaining
> + * a reference on the folio.
> + * Return: The folio which contains this page.
> + */
> +#define page_folio(p) (_Generic((p), \
> + const struct page *: (const struct folio *)_compound_head(p), \
> + struct page *: (struct folio *)_compound_head(p)))
> +
> +/**
> + * folio_page - Return a page from a folio.
> + * @folio: The folio.
> + * @n: The page number to return.
> + *
> + * @n is relative to the start of the folio. This function does not
> + * check that the page number lies within @folio; the caller is presumed
> + * to have a reference to the page.
> + */
> +#define folio_page(folio, n) nth_page(&(folio)->page, n)
> +
> static __always_inline int PageTail(struct page *page)
> {
> return READ_ONCE(page->compound_head) & 1;
> --
> 2.30.2
>
>
--
Sincerely yours,
Mike.
On Thu, Jul 15, 2021 at 04:34:49AM +0100, Matthew Wilcox (Oracle) wrote:
> These are just convenience wrappers for callers with folios; pgdat and
> zone can be reached from tail pages as well as head pages. No change
> to generated code.
>
> Signed-off-by: Matthew Wilcox (Oracle) <[email protected]>
> Reviewed-by: Zi Yan <[email protected]>
> Reviewed-by: Christoph Hellwig <[email protected]>
> Acked-by: Jeff Layton <[email protected]>
> Acked-by: Kirill A. Shutemov <[email protected]>
> Acked-by: Vlastimil Babka <[email protected]>
> Reviewed-by: William Kucharski <[email protected]>
> Reviewed-by: David Howells <[email protected]>
> ---
> include/linux/mm.h | 15 +++++++++++++++
> 1 file changed, 15 insertions(+)
Acked-by: Mike Rapoport <[email protected]>
On Thu, Jul 15, 2021 at 04:34:52AM +0100, Matthew Wilcox (Oracle) wrote:
> These functions mirror their page reference counterparts. Also add
> the kernel-doc to the mm-api and correct the return type of
> page_ref_add_unless() to bool. No change to generated code.
>
> Signed-off-by: Matthew Wilcox (Oracle) <[email protected]>
> Reviewed-by: Christoph Hellwig <[email protected]>
> Acked-by: Jeff Layton <[email protected]>
> Acked-by: Kirill A. Shutemov <[email protected]>
> Acked-by: Vlastimil Babka <[email protected]>
> Reviewed-by: William Kucharski <[email protected]>
> Reviewed-by: David Howells <[email protected]>
> ---
> Documentation/core-api/mm-api.rst | 1 +
> include/linux/page_ref.h | 88 ++++++++++++++++++++++++++++++-
> 2 files changed, 88 insertions(+), 1 deletion(-)
Acked-by: Mike Rapoport <[email protected]>
On Thu, Jul 15, 2021 at 04:34:55AM +0100, Matthew Wilcox (Oracle) wrote:
> This is the equivalent of page_cache_get_speculative(). Also add
> folio_ref_try_add_rcu (the equivalent of page_cache_add_speculative)
> and folio_get_unless_zero() (the equivalent of get_page_unless_zero()).
>
> The new kernel-doc attempts to explain from the user's point of view
> when to use folio_try_get_rcu() and when to use folio_get_unless_zero(),
> because there seems to be some confusion currently between the users of
> page_cache_get_speculative() and get_page_unless_zero().
>
> Reimplement page_cache_add_speculative() and page_cache_get_speculative()
> as wrappers around the folio equivalents, but leave get_page_unless_zero()
> alone for now. This commit reduces text size by 3 bytes due to slightly
> different register allocation & instruction selections.
>
> Signed-off-by: Matthew Wilcox (Oracle) <[email protected]>
> Acked-by: Vlastimil Babka <[email protected]>
> Reviewed-by: William Kucharski <[email protected]>
> Reviewed-by: Christoph Hellwig <[email protected]>
> Acked-by: Kirill A. Shutemov <[email protected]>
> ---
> include/linux/page_ref.h | 66 +++++++++++++++++++++++++++++++
> include/linux/pagemap.h | 84 ++--------------------------------------
> mm/filemap.c | 20 ++++++++++
> 3 files changed, 90 insertions(+), 80 deletions(-)
Acked-by: Mike Rapoport <[email protected]>
> diff --git a/include/linux/page_ref.h b/include/linux/page_ref.h
> index 717d53c9ddf1..2e677e6ad09f 100644
> --- a/include/linux/page_ref.h
> +++ b/include/linux/page_ref.h
> @@ -247,6 +247,72 @@ static inline bool folio_ref_add_unless(struct folio *folio, int nr, int u)
> return page_ref_add_unless(&folio->page, nr, u);
> }
>
> +/**
> + * folio_try_get - Attempt to increase the refcount on a folio.
> + * @folio: The folio.
> + *
> + * If you do not already have a reference to a folio, you can attempt to
> + * get one using this function. It may fail if, for example, the folio
> + * has been freed since you found a pointer to it, or it is frozen for
> + * the purposes of splitting or migration.
> + *
> + * Return: True if the reference count was successfully incremented.
> + */
> +static inline bool folio_try_get(struct folio *folio)
> +{
> + return folio_ref_add_unless(folio, 1, 0);
> +}
> +
> +static inline bool folio_ref_try_add_rcu(struct folio *folio, int count)
> +{
> +#ifdef CONFIG_TINY_RCU
> + /*
> + * The caller guarantees the folio will not be freed from interrupt
> + * context, so (on !SMP) we only need preemption to be disabled
> + * and TINY_RCU does that for us.
> + */
> +# ifdef CONFIG_PREEMPT_COUNT
> + VM_BUG_ON(!in_atomic() && !irqs_disabled());
> +# endif
> + VM_BUG_ON_FOLIO(folio_ref_count(folio) == 0, folio);
> + folio_ref_add(folio, count);
> +#else
> + if (unlikely(!folio_ref_add_unless(folio, count, 0))) {
> + /* Either the folio has been freed, or will be freed. */
> + return false;
> + }
> +#endif
> + return true;
> +}
> +
> +/**
> + * folio_try_get_rcu - Attempt to increase the refcount on a folio.
> + * @folio: The folio.
> + *
> + * This is a version of folio_try_get() optimised for non-SMP kernels.
> + * If you are still holding the rcu_read_lock() after looking up the
> + * page and know that the page cannot have its refcount decreased to
> + * zero in interrupt context, you can use this instead of folio_try_get().
> + *
> + * Example users include get_user_pages_fast() (as pages are not unmapped
> + * from interrupt context) and the page cache lookups (as pages are not
> + * truncated from interrupt context). We also know that pages are not
> + * frozen in interrupt context for the purposes of splitting or migration.
> + *
> + * You can also use this function if you're holding a lock that prevents
> + * pages being frozen & removed; eg the i_pages lock for the page cache
> + * or the mmap_sem or page table lock for page tables. In this case,
> + * it will always succeed, and you could have used a plain folio_get(),
> + * but it's sometimes more convenient to have a common function called
> + * from both locked and RCU-protected contexts.
> + *
> + * Return: True if the reference count was successfully incremented.
> + */
> +static inline bool folio_try_get_rcu(struct folio *folio)
> +{
> + return folio_ref_try_add_rcu(folio, 1);
> +}
> +
> static inline int page_ref_freeze(struct page *page, int count)
> {
> int ret = likely(atomic_cmpxchg(&page->_refcount, count, 0) == count);
> diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
> index ed02aa522263..db1726b1bc1c 100644
> --- a/include/linux/pagemap.h
> +++ b/include/linux/pagemap.h
> @@ -172,91 +172,15 @@ static inline struct address_space *page_mapping_file(struct page *page)
> return page_mapping(page);
> }
>
> -/*
> - * speculatively take a reference to a page.
> - * If the page is free (_refcount == 0), then _refcount is untouched, and 0
> - * is returned. Otherwise, _refcount is incremented by 1 and 1 is returned.
> - *
> - * This function must be called inside the same rcu_read_lock() section as has
> - * been used to lookup the page in the pagecache radix-tree (or page table):
> - * this allows allocators to use a synchronize_rcu() to stabilize _refcount.
> - *
> - * Unless an RCU grace period has passed, the count of all pages coming out
> - * of the allocator must be considered unstable. page_count may return higher
> - * than expected, and put_page must be able to do the right thing when the
> - * page has been finished with, no matter what it is subsequently allocated
> - * for (because put_page is what is used here to drop an invalid speculative
> - * reference).
> - *
> - * This is the interesting part of the lockless pagecache (and lockless
> - * get_user_pages) locking protocol, where the lookup-side (eg. find_get_page)
> - * has the following pattern:
> - * 1. find page in radix tree
> - * 2. conditionally increment refcount
> - * 3. check the page is still in pagecache (if no, goto 1)
> - *
> - * Remove-side that cares about stability of _refcount (eg. reclaim) has the
> - * following (with the i_pages lock held):
> - * A. atomically check refcount is correct and set it to 0 (atomic_cmpxchg)
> - * B. remove page from pagecache
> - * C. free the page
> - *
> - * There are 2 critical interleavings that matter:
> - * - 2 runs before A: in this case, A sees elevated refcount and bails out
> - * - A runs before 2: in this case, 2 sees zero refcount and retries;
> - * subsequently, B will complete and 1 will find no page, causing the
> - * lookup to return NULL.
> - *
> - * It is possible that between 1 and 2, the page is removed then the exact same
> - * page is inserted into the same position in pagecache. That's OK: the
> - * old find_get_page using a lock could equally have run before or after
> - * such a re-insertion, depending on order that locks are granted.
> - *
> - * Lookups racing against pagecache insertion isn't a big problem: either 1
> - * will find the page or it will not. Likewise, the old find_get_page could run
> - * either before the insertion or afterwards, depending on timing.
> - */
> -static inline int __page_cache_add_speculative(struct page *page, int count)
> +static inline bool page_cache_add_speculative(struct page *page, int count)
> {
> -#ifdef CONFIG_TINY_RCU
> -# ifdef CONFIG_PREEMPT_COUNT
> - VM_BUG_ON(!in_atomic() && !irqs_disabled());
> -# endif
> - /*
> - * Preempt must be disabled here - we rely on rcu_read_lock doing
> - * this for us.
> - *
> - * Pagecache won't be truncated from interrupt context, so if we have
> - * found a page in the radix tree here, we have pinned its refcount by
> - * disabling preempt, and hence no need for the "speculative get" that
> - * SMP requires.
> - */
> - VM_BUG_ON_PAGE(page_count(page) == 0, page);
> - page_ref_add(page, count);
> -
> -#else
> - if (unlikely(!page_ref_add_unless(page, count, 0))) {
> - /*
> - * Either the page has been freed, or will be freed.
> - * In either case, retry here and the caller should
> - * do the right thing (see comments above).
> - */
> - return 0;
> - }
> -#endif
> VM_BUG_ON_PAGE(PageTail(page), page);
> -
> - return 1;
> -}
> -
> -static inline int page_cache_get_speculative(struct page *page)
> -{
> - return __page_cache_add_speculative(page, 1);
> + return folio_ref_try_add_rcu((struct folio *)page, count);
> }
>
> -static inline int page_cache_add_speculative(struct page *page, int count)
> +static inline bool page_cache_get_speculative(struct page *page)
> {
> - return __page_cache_add_speculative(page, count);
> + return page_cache_add_speculative(page, 1);
> }
>
> /**
> diff --git a/mm/filemap.c b/mm/filemap.c
> index d1458ecf2f51..634adeacc4c1 100644
> --- a/mm/filemap.c
> +++ b/mm/filemap.c
> @@ -1746,6 +1746,26 @@ pgoff_t page_cache_prev_miss(struct address_space *mapping,
> }
> EXPORT_SYMBOL(page_cache_prev_miss);
>
> +/*
> + * Lockless page cache protocol:
> + * On the lookup side:
> + * 1. Load the folio from i_pages
> + * 2. Increment the refcount if it's not zero
> + * 3. If the folio is not found by xas_reload(), put the refcount and retry
> + *
> + * On the removal side:
> + * A. Freeze the page (by zeroing the refcount if nobody else has a reference)
> + * B. Remove the page from i_pages
> + * C. Return the page to the page allocator
> + *
> + * This means that any page may have its reference count temporarily
> + * increased by a speculative page cache (or fast GUP) lookup as it can
> + * be allocated by another user before the RCU grace period expires.
> + * Because the refcount temporarily acquired here may end up being the
> + * last refcount on the page, any page allocation must be freeable by
> + * put_folio().
^ folio_get()
> + */
> +
> /*
> * mapping_get_entry - Get a page cache entry.
> * @mapping: the address_space to search
> --
> 2.30.2
>
>
--
Sincerely yours,
Mike.
On Thu, Jul 15, 2021 at 04:35:00AM +0100, Matthew Wilcox (Oracle) wrote:
> This helper returns the page index of the next folio in the file (ie
> the end of this folio, plus one).
>
> No changes to generated code.
>
> Signed-off-by: Matthew Wilcox (Oracle) <[email protected]>
> Reviewed-by: Christoph Hellwig <[email protected]>
> Acked-by: Jeff Layton <[email protected]>
> Acked-by: Kirill A. Shutemov <[email protected]>
> Acked-by: Vlastimil Babka <[email protected]>
> Reviewed-by: William Kucharski <[email protected]>
> Reviewed-by: David Howells <[email protected]>
> ---
> include/linux/pagemap.h | 11 +++++++++++
> 1 file changed, 11 insertions(+)
Acked-by: Mike Rapoport <[email protected]>
> diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
> index f7c165b5991f..bd0e7e91bfd4 100644
> --- a/include/linux/pagemap.h
> +++ b/include/linux/pagemap.h
> @@ -406,6 +406,17 @@ static inline pgoff_t folio_index(struct folio *folio)
> return folio->index;
> }
>
> +/**
> + * folio_next_index - Get the index of the next folio.
> + * @folio: The current folio.
> + *
> + * Return: The index of the folio which follows this folio in the file.
> + */
Maybe note that index is in units of pages?
> +static inline pgoff_t folio_next_index(struct folio *folio)
> +{
> + return folio->index + folio_nr_pages(folio);
> +}
> +
> /**
> * folio_file_page - The page for a particular index.
> * @folio: The folio which contains this index.
> --
> 2.30.2
>
>
--
Sincerely yours,
Mike.
On Thu, Jul 15, 2021 at 04:34:54AM +0100, Matthew Wilcox (Oracle) wrote:
> If we know we have a folio, we can call folio_get() instead
> of get_page() and save the overhead of calling compound_head().
> No change to generated code.
>
> Signed-off-by: Matthew Wilcox (Oracle) <[email protected]>
> Reviewed-by: Zi Yan <[email protected]>
> Reviewed-by: Christoph Hellwig <[email protected]>
> Acked-by: Jeff Layton <[email protected]>
> Acked-by: Kirill A. Shutemov <[email protected]>
> Acked-by: Vlastimil Babka <[email protected]>
> Reviewed-by: William Kucharski <[email protected]>
> Reviewed-by: David Howells <[email protected]>
> ---
> include/linux/mm.h | 26 +++++++++++++++++---------
> 1 file changed, 17 insertions(+), 9 deletions(-)
Acked-by: Mike Rapoport <[email protected]>
On Thu, Jul 15, 2021 at 04:34:50AM +0100, Matthew Wilcox (Oracle) wrote:
> Allow page counters to be more readily modified by callers which have
> a folio. Name these wrappers with 'stat' instead of 'state' as requested
> by Linus here:
> https://lore.kernel.org/linux-mm/CAHk-=wj847SudR-kt+46fT3+xFFgiwpgThvm7DJWGdi4cVrbnQ@mail.gmail.com/
> No change to generated code.
>
> Signed-off-by: Matthew Wilcox (Oracle) <[email protected]>
> Reviewed-by: Christoph Hellwig <[email protected]>
> Acked-by: Jeff Layton <[email protected]>
> Acked-by: Kirill A. Shutemov <[email protected]>
> Acked-by: Vlastimil Babka <[email protected]>
> Reviewed-by: William Kucharski <[email protected]>
> Reviewed-by: David Howells <[email protected]>
> ---
> include/linux/vmstat.h | 107 +++++++++++++++++++++++++++++++++++++++++
> 1 file changed, 107 insertions(+)
Acked-by: Mike Rapoport <[email protected]>
On Thu, Jul 15, 2021 at 04:35:01AM +0100, Matthew Wilcox (Oracle) wrote:
> These are just wrappers around page_offset() and page_file_offset()
> respectively. No change to generated code.
>
> Signed-off-by: Matthew Wilcox (Oracle) <[email protected]>
> Reviewed-by: Christoph Hellwig <[email protected]>
> Acked-by: Jeff Layton <[email protected]>
> Acked-by: Kirill A. Shutemov <[email protected]>
> Acked-by: Vlastimil Babka <[email protected]>
> Reviewed-by: William Kucharski <[email protected]>
> Reviewed-by: David Howells <[email protected]>
> ---
> include/linux/pagemap.h | 21 +++++++++++++++++++++
> 1 file changed, 21 insertions(+)
>
> diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
> index bd0e7e91bfd4..aa71fa82d6be 100644
> --- a/include/linux/pagemap.h
> +++ b/include/linux/pagemap.h
> @@ -562,6 +562,27 @@ static inline loff_t page_file_offset(struct page *page)
> return ((loff_t)page_index(page)) << PAGE_SHIFT;
> }
>
> +/**
> + * folio_pos - Returns the byte position of this folio in its file.
> + * @folio: The folio.
kerneldoc will warn about missing description of return value.
> + */
> +static inline loff_t folio_pos(struct folio *folio)
> +{
> + return page_offset(&folio->page);
> +}
> +
> +/**
> + * folio_file_pos - Returns the byte position of this folio in its file.
> + * @folio: The folio.
> + *
> + * This differs from folio_pos() for folios which belong to a swap file.
> + * NFS is the only filesystem today which needs to use folio_file_pos().
ditto
> + */
> +static inline loff_t folio_file_pos(struct folio *folio)
> +{
> + return page_file_offset(&folio->page);
> +}
> +
> extern pgoff_t linear_hugepage_index(struct vm_area_struct *vma,
> unsigned long address);
>
> --
> 2.30.2
>
>
--
Sincerely yours,
Mike.
On Thu, Jul 15, 2021 at 04:35:03AM +0100, Matthew Wilcox (Oracle) wrote:
> Convert unlock_page() to call folio_unlock(). By using a folio we
> avoid a call to compound_head(). This shortens the function from 39
> bytes to 25 and removes 4 instructions on x86-64. Because we still
> have unlock_page(), it's a net increase of 16 bytes of text for the
> kernel as a whole, but any path that uses folio_unlock() will execute
> 4 fewer instructions.
>
> Signed-off-by: Matthew Wilcox (Oracle) <[email protected]>
> Reviewed-by: Christoph Hellwig <[email protected]>
> Acked-by: Jeff Layton <[email protected]>
> Acked-by: Kirill A. Shutemov <[email protected]>
> Reviewed-by: William Kucharski <[email protected]>
> Reviewed-by: David Howells <[email protected]>
> ---
> include/linux/pagemap.h | 3 ++-
> mm/filemap.c | 29 ++++++++++++-----------------
> mm/folio-compat.c | 6 ++++++
> 3 files changed, 20 insertions(+), 18 deletions(-)
Acked-by: Mike Rapoport <[email protected]>
On Thu, Jul 15, 2021 at 04:34:58AM +0100, Matthew Wilcox (Oracle) wrote:
> Add folio_get_private() which mirrors page_private() -- ie folio private
> data is the same as page private data. The only difference is that these
> return a void * instead of an unsigned long, which matches the majority
> of users.
>
> Turn attach_page_private() into folio_attach_private() and reimplement
> attach_page_private() as a wrapper. No filesystem which uses page private
> data currently supports compound pages, so we're free to define the rules.
> attach_page_private() may only be called on a head page; if you want
> to add private data to a tail page, you can call set_page_private()
> directly (and shouldn't increment the page refcount! That should be
> done when adding private data to the head page / folio).
>
> This saves 813 bytes of text with the distro-derived config that I'm
> testing due to removing the calls to compound_head() in get_page()
> & put_page().
>
> Signed-off-by: Matthew Wilcox (Oracle) <[email protected]>
> Reviewed-by: Christoph Hellwig <[email protected]>
> Acked-by: Jeff Layton <[email protected]>
> Acked-by: Kirill A. Shutemov <[email protected]>
> Acked-by: Vlastimil Babka <[email protected]>
> Reviewed-by: William Kucharski <[email protected]>
> Reviewed-by: David Howells <[email protected]>
> ---
> include/linux/mm_types.h | 11 +++++++++
> include/linux/pagemap.h | 48 ++++++++++++++++++++++++----------------
> 2 files changed, 40 insertions(+), 19 deletions(-)
Acked-by: Mike Rapoport <[email protected]>
On Thu, Jul 15, 2021 at 04:35:08AM +0100, Matthew Wilcox (Oracle) wrote:
> Convert __lock_page_or_retry() to __folio_lock_or_retry(). This actually
> saves 4 bytes in the only caller of lock_page_or_retry() (due to better
> register allocation) and saves the 14 byte cost of calling page_folio()
> in __folio_lock_or_retry() for a total saving of 18 bytes. Also use
> a bool for the return type.
>
> Signed-off-by: Matthew Wilcox (Oracle) <[email protected]>
> Reviewed-by: Christoph Hellwig <[email protected]>
> Acked-by: Jeff Layton <[email protected]>
> Acked-by: Kirill A. Shutemov <[email protected]>
> Reviewed-by: William Kucharski <[email protected]>
> ---
> include/linux/pagemap.h | 11 +++++++----
> mm/filemap.c | 20 +++++++++-----------
> mm/memory.c | 8 ++++----
> 3 files changed, 20 insertions(+), 19 deletions(-)
Acked-by: Mike Rapoport <[email protected]>
On Thu, Jul 15, 2021 at 04:35:05AM +0100, Matthew Wilcox (Oracle) wrote:
> This is like lock_page_killable() but for use by callers who
> know they have a folio. Convert __lock_page_killable() to be
> __folio_lock_killable(). This saves one call to compound_head() per
> contended call to lock_page_killable().
>
> __folio_lock_killable() is 19 bytes smaller than __lock_page_killable()
> was. filemap_fault() shrinks by 74 bytes and __lock_page_or_retry()
> shrinks by 71 bytes. That's a total of 164 bytes of text saved.
>
> Signed-off-by: Matthew Wilcox (Oracle) <[email protected]>
> Reviewed-by: Christoph Hellwig <[email protected]>
> Acked-by: Jeff Layton <[email protected]>
> Acked-by: Kirill A. Shutemov <[email protected]>
> Acked-by: Vlastimil Babka <[email protected]>
> Reviewed-by: William Kucharski <[email protected]>
> ---
> include/linux/pagemap.h | 15 ++++++++++-----
> mm/filemap.c | 17 +++++++++--------
> 2 files changed, 19 insertions(+), 13 deletions(-)
Acked-by: Mike Rapoport <[email protected]>
> diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
> index c3673c55125b..88727c74e059 100644
> --- a/include/linux/pagemap.h
> +++ b/include/linux/pagemap.h
> @@ -654,7 +654,7 @@ static inline bool wake_page_match(struct wait_page_queue *wait_page,
> }
>
> void __folio_lock(struct folio *folio);
> -extern int __lock_page_killable(struct page *page);
> +int __folio_lock_killable(struct folio *folio);
> extern int __lock_page_async(struct page *page, struct wait_page_queue *wait);
> extern int __lock_page_or_retry(struct page *page, struct mm_struct *mm,
> unsigned int flags);
> @@ -694,6 +694,14 @@ static inline void lock_page(struct page *page)
> __folio_lock(folio);
> }
>
> +static inline int folio_lock_killable(struct folio *folio)
> +{
> + might_sleep();
> + if (!folio_trylock(folio))
> + return __folio_lock_killable(folio);
> + return 0;
> +}
> +
> /*
> * lock_page_killable is like lock_page but can be interrupted by fatal
> * signals. It returns 0 if it locked the page and -EINTR if it was
> @@ -701,10 +709,7 @@ static inline void lock_page(struct page *page)
> */
> static inline int lock_page_killable(struct page *page)
> {
> - might_sleep();
> - if (!trylock_page(page))
> - return __lock_page_killable(page);
> - return 0;
> + return folio_lock_killable(page_folio(page));
> }
>
> /*
> diff --git a/mm/filemap.c b/mm/filemap.c
> index 95f89656f126..962db5c38cd7 100644
> --- a/mm/filemap.c
> +++ b/mm/filemap.c
> @@ -1589,14 +1589,13 @@ void __folio_lock(struct folio *folio)
> }
> EXPORT_SYMBOL(__folio_lock);
>
> -int __lock_page_killable(struct page *__page)
> +int __folio_lock_killable(struct folio *folio)
> {
> - struct page *page = compound_head(__page);
> - wait_queue_head_t *q = page_waitqueue(page);
> - return wait_on_page_bit_common(q, page, PG_locked, TASK_KILLABLE,
> + wait_queue_head_t *q = page_waitqueue(&folio->page);
> + return wait_on_page_bit_common(q, &folio->page, PG_locked, TASK_KILLABLE,
> EXCLUSIVE);
> }
> -EXPORT_SYMBOL_GPL(__lock_page_killable);
> +EXPORT_SYMBOL_GPL(__folio_lock_killable);
>
> int __lock_page_async(struct page *page, struct wait_page_queue *wait)
> {
> @@ -1638,6 +1637,8 @@ int __lock_page_async(struct page *page, struct wait_page_queue *wait)
> int __lock_page_or_retry(struct page *page, struct mm_struct *mm,
> unsigned int flags)
> {
> + struct folio *folio = page_folio(page);
> +
> if (fault_flag_allow_retry_first(flags)) {
> /*
> * CAUTION! In this case, mmap_lock is not released
> @@ -1656,13 +1657,13 @@ int __lock_page_or_retry(struct page *page, struct mm_struct *mm,
> if (flags & FAULT_FLAG_KILLABLE) {
> int ret;
>
> - ret = __lock_page_killable(page);
> + ret = __folio_lock_killable(folio);
> if (ret) {
> mmap_read_unlock(mm);
> return 0;
> }
> } else {
> - __folio_lock(page_folio(page));
> + __folio_lock(folio);
> }
>
> return 1;
> @@ -2851,7 +2852,7 @@ static int lock_page_maybe_drop_mmap(struct vm_fault *vmf, struct page *page,
>
> *fpin = maybe_unlock_mmap_for_io(vmf, *fpin);
> if (vmf->flags & FAULT_FLAG_KILLABLE) {
> - if (__lock_page_killable(&folio->page)) {
> + if (__folio_lock_killable(folio)) {
> /*
> * We didn't have the right flags to drop the mmap_lock,
> * but all fault_handlers only check for fatal signals
> --
> 2.30.2
>
>
--
Sincerely yours,
Mike.
On Thu, Jul 15, 2021 at 04:35:06AM +0100, Matthew Wilcox (Oracle) wrote:
> There aren't any actual callers of lock_page_async(), so remove it.
> Convert filemap_update_page() to call __folio_lock_async().
>
> __folio_lock_async() is 21 bytes smaller than __lock_page_async(),
> but the real savings come from using a folio in filemap_update_page(),
> shrinking it from 515 bytes to 404 bytes, saving 110 bytes. The text
> shrinks by 132 bytes in total.
>
> Signed-off-by: Matthew Wilcox (Oracle) <[email protected]>
> Reviewed-by: Christoph Hellwig <[email protected]>
> Acked-by: Jeff Layton <[email protected]>
> Acked-by: Kirill A. Shutemov <[email protected]>
> Acked-by: Vlastimil Babka <[email protected]>
> Reviewed-by: William Kucharski <[email protected]>
> Reviewed-by: David Howells <[email protected]>
> ---
> fs/io_uring.c | 2 +-
> include/linux/pagemap.h | 17 -----------------
> mm/filemap.c | 31 ++++++++++++++++---------------
> 3 files changed, 17 insertions(+), 33 deletions(-)
Acked-by: Mike Rapoport <[email protected]>
On Thu, Jul 15, 2021 at 04:35:07AM +0100, Matthew Wilcox (Oracle) wrote:
> Also add folio_wait_locked_killable(). Turn wait_on_page_locked() and
> wait_on_page_locked_killable() into wrappers. This eliminates a call
> to compound_head() from each call-site, reducing text size by 193 bytes
> for me.
>
> Signed-off-by: Matthew Wilcox (Oracle) <[email protected]>
> Reviewed-by: Christoph Hellwig <[email protected]>
> Acked-by: Jeff Layton <[email protected]>
> Acked-by: Kirill A. Shutemov <[email protected]>
> Acked-by: Vlastimil Babka <[email protected]>
> Reviewed-by: William Kucharski <[email protected]>
> Reviewed-by: David Howells <[email protected]>
> ---
> include/linux/pagemap.h | 26 ++++++++++++++++++--------
> mm/filemap.c | 4 ++--
> 2 files changed, 20 insertions(+), 10 deletions(-)
Acked-by: Mike Rapoport <[email protected]>
On Thu, Jul 15, 2021 at 04:35:02AM +0100, Matthew Wilcox (Oracle) wrote:
> These are the folio equivalent of page_mapping() and page_file_mapping().
> Add an out-of-line page_mapping() wrapper around folio_mapping()
> in order to prevent the page_folio() call from bloating every caller
> of page_mapping(). Adjust page_file_mapping() and page_mapping_file()
> to use folios internally. Rename __page_file_mapping() to
> swapcache_mapping() and change it to take a folio.
>
> This ends up saving 122 bytes of text overall. folio_mapping() is
> 45 bytes shorter than page_mapping() was, but the new page_mapping()
> wrapper is 30 bytes. The major reduction is a few bytes less in dozens
> of nfs functions (which call page_file_mapping()). Most of these appear
> to be a slight change in gcc's register allocation decisions, which allow:
>
> 48 8b 56 08 mov 0x8(%rsi),%rdx
> 48 8d 42 ff lea -0x1(%rdx),%rax
> 83 e2 01 and $0x1,%edx
> 48 0f 44 c6 cmove %rsi,%rax
>
> to become:
>
> 48 8b 46 08 mov 0x8(%rsi),%rax
> 48 8d 78 ff lea -0x1(%rax),%rdi
> a8 01 test $0x1,%al
> 48 0f 44 fe cmove %rsi,%rdi
>
> for a reduction of a single byte. Once the NFS client is converted to
> use folios, this entire sequence will disappear.
>
> Also add folio_mapping() documentation.
>
> Signed-off-by: Matthew Wilcox (Oracle) <[email protected]>
> Reviewed-by: Christoph Hellwig <[email protected]>
> Acked-by: Jeff Layton <[email protected]>
> Acked-by: Kirill A. Shutemov <[email protected]>
> Acked-by: Vlastimil Babka <[email protected]>
> Reviewed-by: William Kucharski <[email protected]>
> Reviewed-by: David Howells <[email protected]>
> ---
> Documentation/core-api/mm-api.rst | 2 ++
> include/linux/mm.h | 14 -------------
> include/linux/pagemap.h | 35 +++++++++++++++++++++++++++++--
> include/linux/swap.h | 6 ++++++
> mm/Makefile | 2 +-
> mm/folio-compat.c | 13 ++++++++++++
> mm/swapfile.c | 8 +++----
> mm/util.c | 30 +++++++++++++++-----------
> 8 files changed, 77 insertions(+), 33 deletions(-)
> create mode 100644 mm/folio-compat.c
>
> diff --git a/Documentation/core-api/mm-api.rst b/Documentation/core-api/mm-api.rst
> index 5c459ee2acce..dcce6605947a 100644
> --- a/Documentation/core-api/mm-api.rst
> +++ b/Documentation/core-api/mm-api.rst
> @@ -100,3 +100,5 @@ More Memory Management Functions
> :internal:
> .. kernel-doc:: include/linux/page_ref.h
> .. kernel-doc:: include/linux/mmzone.h
> +.. kernel-doc:: mm/util.c
> + :functions: folio_mapping
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index 788fbc4cde0c..9d28f5b2e983 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -1753,19 +1753,6 @@ void page_address_init(void);
>
> extern void *page_rmapping(struct page *page);
> extern struct anon_vma *page_anon_vma(struct page *page);
> -extern struct address_space *page_mapping(struct page *page);
> -
> -extern struct address_space *__page_file_mapping(struct page *);
> -
> -static inline
> -struct address_space *page_file_mapping(struct page *page)
> -{
> - if (unlikely(PageSwapCache(page)))
> - return __page_file_mapping(page);
> -
> - return page->mapping;
> -}
> -
> extern pgoff_t __page_file_index(struct page *page);
>
> /*
> @@ -1780,7 +1767,6 @@ static inline pgoff_t page_index(struct page *page)
> }
>
> bool page_mapped(struct page *page);
> -struct address_space *page_mapping(struct page *page);
>
> /*
> * Return true only if the page has been allocated with
> diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
> index aa71fa82d6be..a0925a89ba11 100644
> --- a/include/linux/pagemap.h
> +++ b/include/linux/pagemap.h
> @@ -162,14 +162,45 @@ static inline void filemap_nr_thps_dec(struct address_space *mapping)
>
> void release_pages(struct page **pages, int nr);
>
> +struct address_space *page_mapping(struct page *);
> +struct address_space *folio_mapping(struct folio *);
> +struct address_space *swapcache_mapping(struct folio *);
> +
> +/**
> + * folio_file_mapping - Find the mapping this folio belongs to.
> + * @folio: The folio.
> + *
> + * For folios which are in the page cache, return the mapping that this
> + * page belongs to. Folios in the swap cache return the mapping of the
> + * swap file or swap device where the data is stored. This is different
> + * from the mapping returned by folio_mapping(). The only reason to
> + * use it is if, like NFS, you return 0 from ->activate_swapfile.
> + *
> + * Do not call this for folios which aren't in the page cache or swap cache.
Missing return value description.
> + */
> +static inline struct address_space *folio_file_mapping(struct folio *folio)
> +{
> + if (unlikely(folio_test_swapcache(folio)))
> + return swapcache_mapping(folio);
> +
> + return folio->mapping;
> +}
> +
> +static inline struct address_space *page_file_mapping(struct page *page)
> +{
> + return folio_file_mapping(page_folio(page));
> +}
> +
[ snip ]
> diff --git a/mm/util.c b/mm/util.c
> index 9043d03750a7..1cde6218d6d1 100644
> --- a/mm/util.c
> +++ b/mm/util.c
> @@ -686,30 +686,36 @@ struct anon_vma *page_anon_vma(struct page *page)
> return __page_rmapping(page);
> }
>
> -struct address_space *page_mapping(struct page *page)
> +/**
> + * folio_mapping - Find the mapping where this folio is stored.
> + * @folio: The folio.
> + *
> + * For folios which are in the page cache, return the mapping that this
> + * page belongs to. Folios in the swap cache return the swap mapping
> + * this page is stored in (which is different from the mapping for the
> + * swap file or swap device where the data is stored).
> + *
> + * You can call this for folios which aren't in the swap cache or page
> + * cache and it will return NULL.
> + */
Missing return value description.
> +struct address_space *folio_mapping(struct folio *folio)
> {
> struct address_space *mapping;
>
> - page = compound_head(page);
> -
> /* This happens if someone calls flush_dcache_page on slab page */
> - if (unlikely(PageSlab(page)))
> + if (unlikely(folio_test_slab(folio)))
> return NULL;
>
> - if (unlikely(PageSwapCache(page))) {
> - swp_entry_t entry;
> -
> - entry.val = page_private(page);
> - return swap_address_space(entry);
> - }
> + if (unlikely(folio_test_swapcache(folio)))
> + return swap_address_space(folio_swap_entry(folio));
>
> - mapping = page->mapping;
> + mapping = folio->mapping;
> if ((unsigned long)mapping & PAGE_MAPPING_ANON)
> return NULL;
>
> return (void *)((unsigned long)mapping & ~PAGE_MAPPING_FLAGS);
> }
> -EXPORT_SYMBOL(page_mapping);
> +EXPORT_SYMBOL(folio_mapping);
>
> /* Slow path of page_mapcount() for compound pages */
> int __page_mapcount(struct page *page)
> --
> 2.30.2
>
>
--
Sincerely yours,
Mike.
On Thu, Jul 15, 2021 at 04:35:13AM +0100, Matthew Wilcox (Oracle) wrote:
> Rename wait_on_page_bit() to folio_wait_bit(). We must always wait on
> the folio, otherwise we won't be woken up due to the tail page hashing
> to a different bucket from the head page.
>
> This commit shrinks the kernel by 770 bytes, mostly due to moving
> the page waitqueue lookup into folio_wait_bit_common().
>
> Signed-off-by: Matthew Wilcox (Oracle) <[email protected]>
> Reviewed-by: Christoph Hellwig <[email protected]>
> Acked-by: Jeff Layton <[email protected]>
> Acked-by: Kirill A. Shutemov <[email protected]>
> Acked-by: Vlastimil Babka <[email protected]>
> Reviewed-by: William Kucharski <[email protected]>
> Reviewed-by: David Howells <[email protected]>
> ---
> include/linux/pagemap.h | 10 +++---
> mm/filemap.c | 77 +++++++++++++++++++----------------------
> mm/page-writeback.c | 4 +--
> 3 files changed, 43 insertions(+), 48 deletions(-)
Acked-by: Mike Rapoport <[email protected]>
On Thu, Jul 15, 2021 at 04:34:57AM +0100, Matthew Wilcox (Oracle) wrote:
> Handle arbitrary-order folios being added to the LRU. By definition,
> all pages being added to the LRU were already head or base pages, but
> call page_folio() on them anyway to get the type right and avoid the
> buried calls to compound_head().
>
> Saves 783 bytes of kernel text; no functions grow.
>
> Signed-off-by: Matthew Wilcox (Oracle) <[email protected]>
> Reviewed-by: Yu Zhao <[email protected]>
> Reviewed-by: Christoph Hellwig <[email protected]>
> Reviewed-by: David Howells <[email protected]>
> Acked-by: Kirill A. Shutemov <[email protected]>
> ---
> include/linux/mm_inline.h | 98 ++++++++++++++++++++++------------
> include/trace/events/pagemap.h | 2 +-
> 2 files changed, 65 insertions(+), 35 deletions(-)
Acked-by: Mike Rapoport <[email protected]>
> diff --git a/include/linux/mm_inline.h b/include/linux/mm_inline.h
> index 355ea1ee32bd..ee155d19885e 100644
> --- a/include/linux/mm_inline.h
> +++ b/include/linux/mm_inline.h
> @@ -6,22 +6,27 @@
> #include <linux/swap.h>
>
> /**
> - * page_is_file_lru - should the page be on a file LRU or anon LRU?
> - * @page: the page to test
> + * folio_is_file_lru - should the folio be on a file LRU or anon LRU?
> + * @folio: the folio to test
> *
> - * Returns 1 if @page is a regular filesystem backed page cache page or a lazily
> - * freed anonymous page (e.g. via MADV_FREE). Returns 0 if @page is a normal
> - * anonymous page, a tmpfs page or otherwise ram or swap backed page. Used by
> - * functions that manipulate the LRU lists, to sort a page onto the right LRU
> - * list.
> + * Returns 1 if @folio is a regular filesystem backed page cache folio
> + * or a lazily freed anonymous folio (e.g. via MADV_FREE). Returns 0 if
> + * @folio is a normal anonymous folio, a tmpfs folio or otherwise ram or
> + * swap backed folio. Used by functions that manipulate the LRU lists,
> + * to sort a folio onto the right LRU list.
> *
> * We would like to get this info without a page flag, but the state
> - * needs to survive until the page is last deleted from the LRU, which
> + * needs to survive until the folio is last deleted from the LRU, which
> * could be as far down as __page_cache_release.
It seems mm_inline.h is not a part of generated API docs, otherwise
kerneldoc would be unhappy about missing Return: description.
> */
> +static inline int folio_is_file_lru(struct folio *folio)
> +{
> + return !folio_test_swapbacked(folio);
> +}
> +
> static inline int page_is_file_lru(struct page *page)
> {
> - return !PageSwapBacked(page);
> + return folio_is_file_lru(page_folio(page));
> }
>
> static __always_inline void update_lru_size(struct lruvec *lruvec,
> @@ -39,69 +44,94 @@ static __always_inline void update_lru_size(struct lruvec *lruvec,
> }
>
> /**
> - * __clear_page_lru_flags - clear page lru flags before releasing a page
> - * @page: the page that was on lru and now has a zero reference
> + * __folio_clear_lru_flags - clear page lru flags before releasing a page
> + * @folio: The folio that was on lru and now has a zero reference
> */
> -static __always_inline void __clear_page_lru_flags(struct page *page)
> +static __always_inline void __folio_clear_lru_flags(struct folio *folio)
> {
> - VM_BUG_ON_PAGE(!PageLRU(page), page);
> + VM_BUG_ON_FOLIO(!folio_test_lru(folio), folio);
>
> - __ClearPageLRU(page);
> + __folio_clear_lru(folio);
>
> /* this shouldn't happen, so leave the flags to bad_page() */
> - if (PageActive(page) && PageUnevictable(page))
> + if (folio_test_active(folio) && folio_test_unevictable(folio))
> return;
>
> - __ClearPageActive(page);
> - __ClearPageUnevictable(page);
> + __folio_clear_active(folio);
> + __folio_clear_unevictable(folio);
> +}
> +
> +static __always_inline void __clear_page_lru_flags(struct page *page)
> +{
> + __folio_clear_lru_flags(page_folio(page));
> }
>
> /**
> - * page_lru - which LRU list should a page be on?
> - * @page: the page to test
> + * folio_lru_list - which LRU list should a folio be on?
> + * @folio: the folio to test
> *
> - * Returns the LRU list a page should be on, as an index
> + * Returns the LRU list a folio should be on, as an index
^ Return:
> * into the array of LRU lists.
> */
> -static __always_inline enum lru_list page_lru(struct page *page)
> +static __always_inline enum lru_list folio_lru_list(struct folio *folio)
> {
> enum lru_list lru;
>
> - VM_BUG_ON_PAGE(PageActive(page) && PageUnevictable(page), page);
> + VM_BUG_ON_FOLIO(folio_test_active(folio) && folio_test_unevictable(folio), folio);
>
> - if (PageUnevictable(page))
> + if (folio_test_unevictable(folio))
> return LRU_UNEVICTABLE;
>
> - lru = page_is_file_lru(page) ? LRU_INACTIVE_FILE : LRU_INACTIVE_ANON;
> - if (PageActive(page))
> + lru = folio_is_file_lru(folio) ? LRU_INACTIVE_FILE : LRU_INACTIVE_ANON;
> + if (folio_test_active(folio))
> lru += LRU_ACTIVE;
>
> return lru;
> }
. . .
On Thu, Jul 15, 2021 at 04:35:10AM +0100, Matthew Wilcox (Oracle) wrote:
> Add an end_page_writeback() wrapper function for users that are not yet
> converted to folios.
>
> folio_end_writeback() is less than half the size of end_page_writeback()
> at just 105 bytes compared to 228 bytes, due to removing all the
> compound_head() calls. The 30 byte wrapper function makes this a net
> saving of 93 bytes.
>
> Signed-off-by: Matthew Wilcox (Oracle) <[email protected]>
> Reviewed-by: Christoph Hellwig <[email protected]>
> Acked-by: Jeff Layton <[email protected]>
> Acked-by: Kirill A. Shutemov <[email protected]>
> Acked-by: Vlastimil Babka <[email protected]>
> Reviewed-by: William Kucharski <[email protected]>
> Reviewed-by: David Howells <[email protected]>
> ---
> include/linux/pagemap.h | 3 ++-
> mm/filemap.c | 43 ++++++++++++++++++++---------------------
> mm/folio-compat.c | 6 ++++++
> 3 files changed, 29 insertions(+), 23 deletions(-)
Acked-by: Mike Rapoport <[email protected]>
On Thu, Jul 15, 2021 at 04:35:09AM +0100, Matthew Wilcox (Oracle) wrote:
> Convert rotate_reclaimable_page() to folio_rotate_reclaimable(). This
> eliminates all five of the calls to compound_head() in this function,
> saving 75 bytes at the cost of adding 15 bytes to its one caller,
> end_page_writeback(). We also save 36 bytes from pagevec_move_tail_fn()
> due to using folios there. Net 96 bytes savings.
>
> Also move its declaration to mm/internal.h as it's only used by filemap.c.
>
> Signed-off-by: Matthew Wilcox (Oracle) <[email protected]>
> Acked-by: Vlastimil Babka <[email protected]>
> Reviewed-by: William Kucharski <[email protected]>
> Reviewed-by: Christoph Hellwig <[email protected]>
> Acked-by: Kirill A. Shutemov <[email protected]>
> ---
> include/linux/swap.h | 1 -
> mm/filemap.c | 3 ++-
> mm/internal.h | 1 +
> mm/page_io.c | 4 ++--
> mm/swap.c | 30 ++++++++++++++++--------------
> 5 files changed, 21 insertions(+), 18 deletions(-)
Acked-by: Mike Rapoport <[email protected]>
On Thu, Jul 15, 2021 at 04:35:04AM +0100, Matthew Wilcox (Oracle) wrote:
> This is like lock_page() but for use by callers who know they have a folio.
> Convert __lock_page() to be __folio_lock(). This saves one call to
> compound_head() per contended call to lock_page().
>
> Saves 455 bytes of text; mostly from improved register allocation and
> inlining decisions. __folio_lock is 59 bytes while __lock_page was 79.
>
> Signed-off-by: Matthew Wilcox (Oracle) <[email protected]>
> Reviewed-by: Christoph Hellwig <[email protected]>
> Acked-by: Jeff Layton <[email protected]>
> Acked-by: Kirill A. Shutemov <[email protected]>
> Acked-by: Vlastimil Babka <[email protected]>
> Reviewed-by: William Kucharski <[email protected]>
> Reviewed-by: David Howells <[email protected]>
> ---
> include/linux/pagemap.h | 24 +++++++++++++++++++-----
> mm/filemap.c | 29 +++++++++++++++--------------
> 2 files changed, 34 insertions(+), 19 deletions(-)
Acked-by: Mike Rapoport <[email protected]>
On Thu, Jul 15, 2021 at 04:35:14AM +0100, Matthew Wilcox (Oracle) wrote:
> Convert wake_up_page_bit() to folio_wake_bit(). All callers have a folio,
> so use it directly. Saves 66 bytes of text in end_page_private_2().
>
> Signed-off-by: Matthew Wilcox (Oracle) <[email protected]>
> Reviewed-by: Christoph Hellwig <[email protected]>
> Acked-by: Jeff Layton <[email protected]>
> Acked-by: Kirill A. Shutemov <[email protected]>
> Acked-by: Vlastimil Babka <[email protected]>
> Reviewed-by: William Kucharski <[email protected]>
> Reviewed-by: David Howells <[email protected]>
> ---
> mm/filemap.c | 23 ++++++++++++-----------
> 1 file changed, 12 insertions(+), 11 deletions(-)
Acked-by: Mike Rapoport <[email protected]>
On Thu, Jul 15, 2021 at 04:35:11AM +0100, Matthew Wilcox (Oracle) wrote:
> wait_on_page_writeback_killable() only has one caller, so convert it to
> call folio_wait_writeback_killable(). For the wait_on_page_writeback()
> callers, add a compatibility wrapper around folio_wait_writeback().
>
> Turning PageWriteback() into folio_test_writeback() eliminates a call
> to compound_head() which saves 8 bytes and 15 bytes in the two
> functions. Unfortunately, that is more than offset by adding the
> wait_on_page_writeback compatibility wrapper for a net increase in text
> of 7 bytes.
>
> Signed-off-by: Matthew Wilcox (Oracle) <[email protected]>
> Reviewed-by: Christoph Hellwig <[email protected]>
> Acked-by: Jeff Layton <[email protected]>
> Acked-by: Kirill A. Shutemov <[email protected]>
> Acked-by: Vlastimil Babka <[email protected]>
> Reviewed-by: William Kucharski <[email protected]>
> ---
> fs/afs/write.c | 9 ++++----
> include/linux/pagemap.h | 3 ++-
> mm/folio-compat.c | 6 ++++++
> mm/page-writeback.c | 48 ++++++++++++++++++++++++++++-------------
> 4 files changed, 46 insertions(+), 20 deletions(-)
Acked-by: Mike Rapoport <[email protected]>
On Thu, Jul 15, 2021 at 04:35:12AM +0100, Matthew Wilcox (Oracle) wrote:
> Move wait_for_stable_page() into the folio compatibility file.
> folio_wait_stable() avoids a call to compound_head() and is 14 bytes
> smaller than wait_for_stable_page() was. The net text size grows by 16
> bytes as a result of this patch. We can also remove thp_head() as this
> was the last user.
>
> Signed-off-by: Matthew Wilcox (Oracle) <[email protected]>
> Reviewed-by: Christoph Hellwig <[email protected]>
> Acked-by: Jeff Layton <[email protected]>
> Acked-by: Kirill A. Shutemov <[email protected]>
> Acked-by: Vlastimil Babka <[email protected]>
> Reviewed-by: William Kucharski <[email protected]>
> Reviewed-by: David Howells <[email protected]>
> ---
> include/linux/huge_mm.h | 15 ---------------
> include/linux/pagemap.h | 1 +
> mm/folio-compat.c | 6 ++++++
> mm/page-writeback.c | 24 ++++++++++++++----------
> 4 files changed, 21 insertions(+), 25 deletions(-)
>
> diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
> index f123e15d966e..f280f33ff223 100644
> --- a/include/linux/huge_mm.h
> +++ b/include/linux/huge_mm.h
> @@ -250,15 +250,6 @@ static inline spinlock_t *pud_trans_huge_lock(pud_t *pud,
> return NULL;
> }
>
> -/**
> - * thp_head - Head page of a transparent huge page.
> - * @page: Any page (tail, head or regular) found in the page cache.
kerneldoc will warn about missing return description
> - */
> -static inline struct page *thp_head(struct page *page)
> -{
> - return compound_head(page);
> -}
> -
> /**
> * thp_order - Order of a transparent huge page.
> * @page: Head page of a transparent huge page.
> @@ -336,12 +327,6 @@ static inline struct list_head *page_deferred_list(struct page *page)
> #define HPAGE_PUD_MASK ({ BUILD_BUG(); 0; })
> #define HPAGE_PUD_SIZE ({ BUILD_BUG(); 0; })
>
> -static inline struct page *thp_head(struct page *page)
> -{
> - VM_BUG_ON_PGFLAGS(PageTail(page), page);
> - return page;
> -}
> -
> static inline unsigned int thp_order(struct page *page)
> {
> VM_BUG_ON_PGFLAGS(PageTail(page), page);
> diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
> index 0c5f53368fe9..96b62a2331fb 100644
> --- a/include/linux/pagemap.h
> +++ b/include/linux/pagemap.h
> @@ -772,6 +772,7 @@ int folio_wait_writeback_killable(struct folio *folio);
> void end_page_writeback(struct page *page);
> void folio_end_writeback(struct folio *folio);
> void wait_for_stable_page(struct page *page);
> +void folio_wait_stable(struct folio *folio);
>
> void __set_page_dirty(struct page *, struct address_space *, int warn);
> int __set_page_dirty_nobuffers(struct page *page);
> diff --git a/mm/folio-compat.c b/mm/folio-compat.c
> index 41275dac7a92..3c83f03b80d7 100644
> --- a/mm/folio-compat.c
> +++ b/mm/folio-compat.c
> @@ -29,3 +29,9 @@ void wait_on_page_writeback(struct page *page)
> return folio_wait_writeback(page_folio(page));
> }
> EXPORT_SYMBOL_GPL(wait_on_page_writeback);
> +
> +void wait_for_stable_page(struct page *page)
> +{
> + return folio_wait_stable(page_folio(page));
> +}
> +EXPORT_SYMBOL_GPL(wait_for_stable_page);
> diff --git a/mm/page-writeback.c b/mm/page-writeback.c
> index c2c00e1533ad..a078e9786cc4 100644
> --- a/mm/page-writeback.c
> +++ b/mm/page-writeback.c
> @@ -2877,17 +2877,21 @@ int folio_wait_writeback_killable(struct folio *folio)
> EXPORT_SYMBOL_GPL(folio_wait_writeback_killable);
>
> /**
> - * wait_for_stable_page() - wait for writeback to finish, if necessary.
> - * @page: The page to wait on.
> + * folio_wait_stable() - wait for writeback to finish, if necessary.
> + * @folio: The folio to wait on.
> *
> - * This function determines if the given page is related to a backing device
> - * that requires page contents to be held stable during writeback. If so, then
> - * it will wait for any pending writeback to complete.
> + * This function determines if the given folio is related to a backing
> + * device that requires folio contents to be held stable during writeback.
> + * If so, then it will wait for any pending writeback to complete.
> + *
> + * Context: Sleeps. Must be called in process context and with
> + * no spinlocks held. Caller should hold a reference on the folio.
> + * If the folio is not locked, writeback may start again after writeback
> + * has finished.
> */
> -void wait_for_stable_page(struct page *page)
> +void folio_wait_stable(struct folio *folio)
> {
> - page = thp_head(page);
> - if (page->mapping->host->i_sb->s_iflags & SB_I_STABLE_WRITES)
> - wait_on_page_writeback(page);
> + if (folio->mapping->host->i_sb->s_iflags & SB_I_STABLE_WRITES)
> + folio_wait_writeback(folio);
> }
> -EXPORT_SYMBOL_GPL(wait_for_stable_page);
> +EXPORT_SYMBOL_GPL(folio_wait_stable);
> --
> 2.30.2
>
>
--
Sincerely yours,
Mike.
Hi Matthew,
(Sorry for the late response, I could not find time earlier)
On Thu, Jul 15, 2021 at 04:34:46AM +0100, Matthew Wilcox (Oracle) wrote:
> Managing memory in 4KiB pages is a serious overhead. Many benchmarks
> benefit from a larger "page size". As an example, an earlier iteration
> of this idea which used compound pages (and wasn't particularly tuned)
> got a 7% performance boost when compiling the kernel.
>
> Using compound pages or THPs exposes a weakness of our type system.
> Functions are often unprepared for compound pages to be passed to them,
> and may only act on PAGE_SIZE chunks. Even functions which are aware of
> compound pages may expect a head page, and do the wrong thing if passed
> a tail page.
>
> We also waste a lot of instructions ensuring that we're not looking at
> a tail page. Almost every call to PageFoo() contains one or more hidden
> calls to compound_head(). This also happens for get_page(), put_page()
> and many more functions.
>
> This patch series uses a new type, the struct folio, to manage memory.
> It converts enough of the page cache, iomap and XFS to use folios instead
> of pages, and then adds support for multi-page folios. It passes xfstests
> (running on XFS) with no regressions compared to v5.14-rc1.
I like the idea of folio and that first patches I've reviewed look good.
Most of the changelogs (at least at the first patches) mention reduction of
the kernel size for your configuration on x86. I wonder, what happens if
you build the kernel with "non-distro" configuration, e.g. defconfig or
tiny.config?
Also, what is the difference on !x86 builds?
> Git: https://git.infradead.org/users/willy/pagecache.git/shortlog/refs/tags/folio_14
--
Sincerely yours,
Mike.
On Tue, Jul 20, 2021 at 01:54:38PM +0300, Mike Rapoport wrote:
> Most of the changelogs (at least at the first patches) mention reduction of
> the kernel size for your configuration on x86. I wonder, what happens if
> you build the kernel with "non-distro" configuration, e.g. defconfig or
> tiny.config?
I did an allnoconfig build and that reduced in size by ~2KiB.
> Also, what is the difference on !x86 builds?
I don't generally do non-x86 builds ... feel free to compare for
yourself! I imagine it'll be 2-4 instructions per call to
compound_head(). ie something like:
load page into reg S
load reg S + 8 into reg T
test bottom bit of reg T
cond-move reg T - 1 to reg S
becomes
load folio into reg S
the exact spelling of those instructions will vary from architecture to
architecture; some will take more instructions than others. Possibly it
means we end up using one fewer register and so reducing the number of
registers spilled to the stack. Probably not, though.
On Tue, Jul 20, 2021 at 01:44:10PM +0300, Mike Rapoport wrote:
> > /**
> > - * page_is_file_lru - should the page be on a file LRU or anon LRU?
> > - * @page: the page to test
> > + * folio_is_file_lru - should the folio be on a file LRU or anon LRU?
> > + * @folio: the folio to test
> > *
> > - * Returns 1 if @page is a regular filesystem backed page cache page or a lazily
> > - * freed anonymous page (e.g. via MADV_FREE). Returns 0 if @page is a normal
> > - * anonymous page, a tmpfs page or otherwise ram or swap backed page. Used by
> > - * functions that manipulate the LRU lists, to sort a page onto the right LRU
> > - * list.
> > + * Returns 1 if @folio is a regular filesystem backed page cache folio
> > + * or a lazily freed anonymous folio (e.g. via MADV_FREE). Returns 0 if
> > + * @folio is a normal anonymous folio, a tmpfs folio or otherwise ram or
> > + * swap backed folio. Used by functions that manipulate the LRU lists,
> > + * to sort a folio onto the right LRU list.
> > *
> > * We would like to get this info without a page flag, but the state
> > - * needs to survive until the page is last deleted from the LRU, which
> > + * needs to survive until the folio is last deleted from the LRU, which
> > * could be as far down as __page_cache_release.
>
> It seems mm_inline.h is not a part of generated API docs, otherwise
> kerneldoc would be unhappy about missing Return: description.
kernel-doc doesn't warn about that by default.
# This check emits a lot of warnings at the moment, because many
# functions don't have a 'Return' doc section. So until the number
# of warnings goes sufficiently down, the check is only performed in
# verbose mode.
# TODO: always perform the check.
if ($verbose && !$noret) {
check_return_section($file, $declaration_name, $return_type);
}
On Tue, Jul 20, 2021 at 04:23:29PM +0100, Matthew Wilcox wrote:
> On Tue, Jul 20, 2021 at 06:17:26PM +0300, Mike Rapoport wrote:
> > On Tue, Jul 20, 2021 at 01:41:15PM +0100, Matthew Wilcox wrote:
> > > On Tue, Jul 20, 2021 at 01:54:38PM +0300, Mike Rapoport wrote:
> > > > Most of the changelogs (at least at the first patches) mention reduction of
> > > > the kernel size for your configuration on x86. I wonder, what happens if
> > > > you build the kernel with "non-distro" configuration, e.g. defconfig or
> > > > tiny.config?
> > >
> > > I did an allnoconfig build and that reduced in size by ~2KiB.
> > >
> > > > Also, what is the difference on !x86 builds?
> > >
> > > I don't generally do non-x86 builds ... feel free to compare for
> > > yourself!
> >
> > I did allnoconfig and defconfig for arm64 and powerpc.
> >
> > All execpt arm64::defconfig show decrease by ~1KiB, while arm64::defconfig
> > was actually increased by ~500 bytes.
>
> Which patch did you go up to for that? If you're going past patch 50 or
> so, then you're starting to add functionality (ie support for arbitrary
> order pages), so a certain amount of extra code size might be expected.
> I measured 6KB at patch 32 or so, then between patch 32 & 50 was pretty
> much a wash.
I've used folio_14 tag:
commit 480552d0322d855d146c0fa6fdf1e89ca8569037 (HEAD, tag: folio_14)
Author: Matthew Wilcox (Oracle) <[email protected]>
Date: Wed Feb 5 11:27:01 2020 -0500
mm/readahead: Add multi-page folio readahead
--
Sincerely yours,
Mike.
On Tue, Jul 20, 2021 at 06:17:26PM +0300, Mike Rapoport wrote:
> On Tue, Jul 20, 2021 at 01:41:15PM +0100, Matthew Wilcox wrote:
> > On Tue, Jul 20, 2021 at 01:54:38PM +0300, Mike Rapoport wrote:
> > > Most of the changelogs (at least at the first patches) mention reduction of
> > > the kernel size for your configuration on x86. I wonder, what happens if
> > > you build the kernel with "non-distro" configuration, e.g. defconfig or
> > > tiny.config?
> >
> > I did an allnoconfig build and that reduced in size by ~2KiB.
> >
> > > Also, what is the difference on !x86 builds?
> >
> > I don't generally do non-x86 builds ... feel free to compare for
> > yourself!
>
> I did allnoconfig and defconfig for arm64 and powerpc.
>
> All execpt arm64::defconfig show decrease by ~1KiB, while arm64::defconfig
> was actually increased by ~500 bytes.
Which patch did you go up to for that? If you're going past patch 50 or
so, then you're starting to add functionality (ie support for arbitrary
order pages), so a certain amount of extra code size might be expected.
I measured 6KB at patch 32 or so, then between patch 32 & 50 was pretty
much a wash.
> I didn't dig into objdumps yet.
>
> I also tried to build arm but it failed with:
>
> CC fs/remap_range.o
> fs/remap_range.c: In function 'vfs_dedupe_file_range_compare':
> fs/remap_range.c:250:3: error: implicit declaration of function 'flush_dcache_folio'; did you mean 'flush_cache_louis'? [-Werror=implicit-function-declaration]
> 250 | flush_dcache_folio(src_folio);
> | ^~~~~~~~~~~~~~~~~~
> | flush_cache_louis
> cc1: some warnings being treated as errors
Already complained about by the build bot; already fixed. You should
maybe look at the git tree if you're doing more than code review.
On Tue, Jul 20, 2021 at 01:41:15PM +0100, Matthew Wilcox wrote:
> On Tue, Jul 20, 2021 at 01:54:38PM +0300, Mike Rapoport wrote:
> > Most of the changelogs (at least at the first patches) mention reduction of
> > the kernel size for your configuration on x86. I wonder, what happens if
> > you build the kernel with "non-distro" configuration, e.g. defconfig or
> > tiny.config?
>
> I did an allnoconfig build and that reduced in size by ~2KiB.
>
> > Also, what is the difference on !x86 builds?
>
> I don't generally do non-x86 builds ... feel free to compare for
> yourself!
I did allnoconfig and defconfig for arm64 and powerpc.
All execpt arm64::defconfig show decrease by ~1KiB, while arm64::defconfig
was actually increased by ~500 bytes.
I didn't dig into objdumps yet.
I also tried to build arm but it failed with:
CC fs/remap_range.o
fs/remap_range.c: In function 'vfs_dedupe_file_range_compare':
fs/remap_range.c:250:3: error: implicit declaration of function 'flush_dcache_folio'; did you mean 'flush_cache_louis'? [-Werror=implicit-function-declaration]
250 | flush_dcache_folio(src_folio);
| ^~~~~~~~~~~~~~~~~~
| flush_cache_louis
cc1: some warnings being treated as errors
> I imagine it'll be 2-4 instructions per call to
> compound_head(). ie something like:
>
> load page into reg S
> load reg S + 8 into reg T
> test bottom bit of reg T
> cond-move reg T - 1 to reg S
> becomes
> load folio into reg S
>
> the exact spelling of those instructions will vary from architecture to
> architecture; some will take more instructions than others. Possibly it
> means we end up using one fewer register and so reducing the number of
> registers spilled to the stack. Probably not, though.
--
Sincerely yours,
Mike.
On Tue, Jul 20, 2021 at 06:35:50PM +0300, Mike Rapoport wrote:
> On Tue, Jul 20, 2021 at 04:23:29PM +0100, Matthew Wilcox wrote:
> > Which patch did you go up to for that? If you're going past patch 50 or
> > so, then you're starting to add functionality (ie support for arbitrary
> > order pages), so a certain amount of extra code size might be expected.
> > I measured 6KB at patch 32 or so, then between patch 32 & 50 was pretty
> > much a wash.
>
> I've used folio_14 tag:
>
> commit 480552d0322d855d146c0fa6fdf1e89ca8569037 (HEAD, tag: folio_14)
> Author: Matthew Wilcox (Oracle) <[email protected]>
> Date: Wed Feb 5 11:27:01 2020 -0500
>
> mm/readahead: Add multi-page folio readahead
Probably worth trying the for-next tag instead to get a meaningful
comparison of how much using folios saves over pages.
I don't want to give the impression that this is all that can be
saved by switching to folios. There are still hundreds of places that
call PageFoo(), SetPageFoo(), ClearPageFoo(), put_page(), get_page(),
lock_page() and so on. There's probably another 20KB of code that can
be removed that way.
On Tue, Jul 20, 2021 at 01:42:11PM +0300, Mike Rapoport wrote:
> > diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
> > index f7c165b5991f..bd0e7e91bfd4 100644
> > --- a/include/linux/pagemap.h
> > +++ b/include/linux/pagemap.h
> > @@ -406,6 +406,17 @@ static inline pgoff_t folio_index(struct folio *folio)
> > return folio->index;
> > }
> >
> > +/**
> > + * folio_next_index - Get the index of the next folio.
> > + * @folio: The current folio.
> > + *
> > + * Return: The index of the folio which follows this folio in the file.
> > + */
>
> Maybe note that index is in units of pages?
I don't think this is the place to explain that. Remember, we already
have:
* @index: Offset within the file, in units of pages. For anonymous pages,
* this is the index from the beginning of the mmap.
and I don't want to explain every term of art in every function
description. I think if you're reading this, you can follow the
link to the struct folio description and see what an index is.
On Tue, Jul 20, 2021 at 01:40:05PM +0300, Mike Rapoport wrote:
> > +/**
> > + * folio_shift - The number of bits covered by this folio.
>
> For me this sounds like the size of the folio in bits.
> Maybe just repeat "The base-2 logarithm of the size of this folio" here and
> in return description?
>
> > + * @folio: The folio.
> > + *
> > + * A folio contains a number of bytes which is a power-of-two in size.
> > + * This function tells you which power-of-two the folio is.
> > + *
> > + * Context: The caller should have a reference on the folio to prevent
> > + * it from being split. It is not necessary for the folio to be locked.
> > + * Return: The base-2 logarithm of the size of this folio.
> > + */
I've gone with:
/**
- * folio_shift - The number of bits covered by this folio.
+ * folio_shift - The size of the memory described by this folio.
* @folio: The folio.
*
- * A folio contains a number of bytes which is a power-of-two in size.
- * This function tells you which power-of-two the folio is.
+ * A folio represents a number of bytes which is a power-of-two in size.
+ * This function tells you which power-of-two the folio is. See also
+ * folio_size() and folio_order().
*
* Context: The caller should have a reference on the folio to prevent
* it from being split. It is not necessary for the folio to be locked.
On Tue, Jul 20, 2021 at 01:44:10PM +0300, Mike Rapoport wrote:
> It seems mm_inline.h is not a part of generated API docs, otherwise
> kerneldoc would be unhappy about missing Return: description.
It isn't, but I did add mm_inline.h to Documentation as part of this
patch (thanks!) and made this change:
/**
- * folio_is_file_lru - should the folio be on a file LRU or anon LRU?
- * @folio: the folio to test
- *
- * Returns 1 if @folio is a regular filesystem backed page cache folio
- * or a lazily freed anonymous folio (e.g. via MADV_FREE). Returns 0 if
- * @folio is a normal anonymous folio, a tmpfs folio or otherwise ram or
- * swap backed folio. Used by functions that manipulate the LRU lists,
- * to sort a folio onto the right LRU list.
+ * folio_is_file_lru - Should the folio be on a file LRU or anon LRU?
+ * @folio: The folio to test.
*
* We would like to get this info without a page flag, but the state
* needs to survive until the folio is last deleted from the LRU, which
* could be as far down as __page_cache_release.
+ *
+ * Return: An integer (not a boolean!) used to sort a folio onto the
+ * right LRU list and to account folios correctly.
+ * 1 if @folio is a regular filesystem backed page cache folio
+ * or a lazily freed anonymous folio (e.g. via MADV_FREE).
+ * 0 if @folio is a normal anonymous folio, a tmpfs folio or otherwise
+ * ram or swap backed folio.
*/
I wanted to turn those last two sentences into a list, but my
kernel-doc-fu abandoned me. Feel free to submit a follow-on patch to
fix that ;-)
On Wed, Jul 21, 2021 at 05:08:44AM +0100, Matthew Wilcox wrote:
> On Tue, Jul 20, 2021 at 01:44:10PM +0300, Mike Rapoport wrote:
> > It seems mm_inline.h is not a part of generated API docs, otherwise
> > kerneldoc would be unhappy about missing Return: description.
>
> It isn't, but I did add mm_inline.h to Documentation as part of this
> patch (thanks!) and made this change:
>
> /**
> - * folio_is_file_lru - should the folio be on a file LRU or anon LRU?
> - * @folio: the folio to test
> - *
> - * Returns 1 if @folio is a regular filesystem backed page cache folio
> - * or a lazily freed anonymous folio (e.g. via MADV_FREE). Returns 0 if
> - * @folio is a normal anonymous folio, a tmpfs folio or otherwise ram or
> - * swap backed folio. Used by functions that manipulate the LRU lists,
> - * to sort a folio onto the right LRU list.
> + * folio_is_file_lru - Should the folio be on a file LRU or anon LRU?
> + * @folio: The folio to test.
> *
> * We would like to get this info without a page flag, but the state
> * needs to survive until the folio is last deleted from the LRU, which
> * could be as far down as __page_cache_release.
> + *
> + * Return: An integer (not a boolean!) used to sort a folio onto the
> + * right LRU list and to account folios correctly.
> + * 1 if @folio is a regular filesystem backed page cache folio
> + * or a lazily freed anonymous folio (e.g. via MADV_FREE).
> + * 0 if @folio is a normal anonymous folio, a tmpfs folio or otherwise
> + * ram or swap backed folio.
> */
>
> I wanted to turn those last two sentences into a list, but my
> kernel-doc-fu abandoned me. Feel free to submit a follow-on patch to
> fix that ;-)
Here it is ;-)
Feel free to fold it into the original commit if you'd like to.
From 636d1715252f7bd1e87219797153b8baa28774af Mon Sep 17 00:00:00 2001
From: Mike Rapoport <[email protected]>
Date: Wed, 21 Jul 2021 11:35:15 +0300
Subject: [PATCH] mm/docs: folio_is_file_lru: make return description a list
Reformat return value description of folio_is_file_lru() so that will be
presented as a list in the generated output.
Signed-off-by: Mike Rapoport <[email protected]>
---
include/linux/mm_inline.h | 9 +++++----
1 file changed, 5 insertions(+), 4 deletions(-)
diff --git a/include/linux/mm_inline.h b/include/linux/mm_inline.h
index d39537c5471b..b263ac0a2c3a 100644
--- a/include/linux/mm_inline.h
+++ b/include/linux/mm_inline.h
@@ -15,10 +15,11 @@
*
* Return: An integer (not a boolean!) used to sort a folio onto the
* right LRU list and to account folios correctly.
- * 1 if @folio is a regular filesystem backed page cache folio
- * or a lazily freed anonymous folio (e.g. via MADV_FREE).
- * 0 if @folio is a normal anonymous folio, a tmpfs folio or otherwise
- * ram or swap backed folio.
+ *
+ * - 1 if @folio is a regular filesystem backed page cache folio
+ * or a lazily freed anonymous folio (e.g. via MADV_FREE).
+ * - 0 if @folio is a normal anonymous folio, a tmpfs folio or otherwise
+ * ram or swap backed folio.
*/
static inline int folio_is_file_lru(struct folio *folio)
{
--
2.31.1
--
Sincerely yours,
Mike.
On Thu, Jul 15, 2021 at 04:35:18AM +0100, Matthew Wilcox (Oracle) wrote:
> This function is the equivalent of page_mapped(). It is slightly
> shorter as we do not need to handle the PageTail() case. Reimplement
> page_mapped() as a wrapper around folio_mapped(). folio_mapped()
> is 13 bytes smaller than page_mapped(), but the page_mapped() wrapper
> is 30 bytes, for a net increase of 17 bytes of text.
>
> Signed-off-by: Matthew Wilcox (Oracle) <[email protected]>
> Acked-by: Vlastimil Babka <[email protected]>
> Reviewed-by: William Kucharski <[email protected]>
> Reviewed-by: Christoph Hellwig <[email protected]>
> Reviewed-by: David Howells <[email protected]>
> Acked-by: Kirill A. Shutemov <[email protected]>
> ---
> include/linux/mm.h | 1 +
> include/linux/mm_types.h | 6 ++++++
> mm/folio-compat.c | 6 ++++++
> mm/util.c | 29 ++++++++++++++++-------------
> 4 files changed, 29 insertions(+), 13 deletions(-)
Acked-by: Mike Rapoport <[email protected]>
On Thu, Jul 15, 2021 at 04:35:26AM +0100, Matthew Wilcox (Oracle) wrote:
> Convert all callers of mem_cgroup_charge() to call page_folio() on the
> page they're currently passing in. Many of them will be converted to
> use folios themselves soon.
>
> Signed-off-by: Matthew Wilcox (Oracle) <[email protected]>
> Reviewed-by: Christoph Hellwig <[email protected]>
> ---
> include/linux/memcontrol.h | 6 +++---
> kernel/events/uprobes.c | 3 ++-
> mm/filemap.c | 2 +-
> mm/huge_memory.c | 2 +-
> mm/khugepaged.c | 4 ++--
> mm/ksm.c | 3 ++-
> mm/memcontrol.c | 26 +++++++++++++-------------
> mm/memory.c | 9 +++++----
> mm/migrate.c | 2 +-
> mm/shmem.c | 2 +-
> mm/userfaultfd.c | 2 +-
> 11 files changed, 32 insertions(+), 29 deletions(-)
>
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index c2ffad021e09..03283d97b62a 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -6697,27 +6696,27 @@ static int __mem_cgroup_charge(struct page *page, struct mem_cgroup *memcg,
>
> local_irq_disable();
> mem_cgroup_charge_statistics(memcg, nr_pages);
> - memcg_check_events(memcg, page_to_nid(page));
> + memcg_check_events(memcg, folio_nid(folio));
> local_irq_enable();
> out:
> return ret;
> }
>
> /**
> - * mem_cgroup_charge - charge a newly allocated page to a cgroup
> - * @page: page to charge
> - * @mm: mm context of the victim
> - * @gfp_mask: reclaim mode
> + * mem_cgroup_charge - Charge a newly allocated folio to a cgroup.
> + * @folio: Folio to charge.
> + * @mm: mm context of the allocating task.
> + * @gfp: reclaim mode
> *
> - * Try to charge @page to the memcg that @mm belongs to, reclaiming
> - * pages according to @gfp_mask if necessary. if @mm is NULL, try to
> + * Try to charge @folio to the memcg that @mm belongs to, reclaiming
> + * pages according to @gfp if necessary. If @mm is NULL, try to
> * charge to the active memcg.
> *
> - * Do not use this for pages allocated for swapin.
> + * Do not use this for folios allocated for swapin.
> *
> * Returns 0 on success. Otherwise, an error code is returned.
Missing return description
> */
> -int mem_cgroup_charge(struct page *page, struct mm_struct *mm, gfp_t gfp_mask)
> +int mem_cgroup_charge(struct folio *folio, struct mm_struct *mm, gfp_t gfp)
> {
> struct mem_cgroup *memcg;
> int ret;
--
Sincerely yours,
Mike.
On Thu, Jul 15, 2021 at 04:35:16AM +0100, Matthew Wilcox (Oracle) wrote:
> end_page_private_2() becomes folio_end_private_2(),
> wait_on_page_private_2() becomes folio_wait_private_2() and
> wait_on_page_private_2_killable() becomes folio_wait_private_2_killable().
>
> Adjust the fscache equivalents to call page_folio() before calling these
> functions to avoid adding wrappers. Ends up costing 1 byte of text
> in ceph & netfs, but the core shrinks by three calls to page_folio().
>
> Signed-off-by: Matthew Wilcox (Oracle) <[email protected]>
> Acked-by: Vlastimil Babka <[email protected]>
> Reviewed-by: William Kucharski <[email protected]>
> Reviewed-by: Christoph Hellwig <[email protected]>
> Reviewed-by: David Howells <[email protected]>
> Acked-by: Kirill A. Shutemov <[email protected]>
> ---
> include/linux/netfs.h | 6 +++---
> include/linux/pagemap.h | 6 +++---
> mm/filemap.c | 37 ++++++++++++++++---------------------
> 3 files changed, 22 insertions(+), 27 deletions(-)
>
> diff --git a/include/linux/netfs.h b/include/linux/netfs.h
> index 9062adfa2fb9..fad8c6209edd 100644
> --- a/include/linux/netfs.h
> +++ b/include/linux/netfs.h
> @@ -55,7 +55,7 @@ static inline void set_page_fscache(struct page *page)
> */
> static inline void end_page_fscache(struct page *page)
> {
> - end_page_private_2(page);
> + folio_end_private_2(page_folio(page));
> }
>
> /**
> @@ -66,7 +66,7 @@ static inline void end_page_fscache(struct page *page)
> */
> static inline void wait_on_page_fscache(struct page *page)
> {
> - wait_on_page_private_2(page);
> + folio_wait_private_2(page_folio(page));
> }
>
> /**
> @@ -82,7 +82,7 @@ static inline void wait_on_page_fscache(struct page *page)
> */
> static inline int wait_on_page_fscache_killable(struct page *page)
> {
> - return wait_on_page_private_2_killable(page);
> + return folio_wait_private_2_killable(page_folio(page));
> }
>
> enum netfs_read_source {
> diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
> index c8e74d67b01f..edf58a581bce 100644
> --- a/include/linux/pagemap.h
> +++ b/include/linux/pagemap.h
> @@ -796,9 +796,9 @@ static inline void set_page_private_2(struct page *page)
> SetPagePrivate2(page);
> }
>
> -void end_page_private_2(struct page *page);
> -void wait_on_page_private_2(struct page *page);
> -int wait_on_page_private_2_killable(struct page *page);
> +void folio_end_private_2(struct folio *folio);
> +void folio_wait_private_2(struct folio *folio);
> +int folio_wait_private_2_killable(struct folio *folio);
>
> /*
> * Add an arbitrary waiter to a page's wait queue
> diff --git a/mm/filemap.c b/mm/filemap.c
> index 1ecaece68019..a5d02ec62eb6 100644
> --- a/mm/filemap.c
> +++ b/mm/filemap.c
> @@ -1451,56 +1451,51 @@ void folio_unlock(struct folio *folio)
> EXPORT_SYMBOL(folio_unlock);
>
> /**
> - * end_page_private_2 - Clear PG_private_2 and release any waiters
> - * @page: The page
> + * folio_end_private_2 - Clear PG_private_2 and wake any waiters.
> + * @folio: The folio.
> *
> - * Clear the PG_private_2 bit on a page and wake up any sleepers waiting for
> - * this. The page ref held for PG_private_2 being set is released.
> + * Clear the PG_private_2 bit on a folio and wake up any sleepers waiting for
> + * it. The page ref held for PG_private_2 being set is released.
^ folio reference
> *
> * This is, for example, used when a netfs page is being written to a local
> * disk cache, thereby allowing writes to the cache for the same page to be
> * serialised.
> */
> -void end_page_private_2(struct page *page)
> +void folio_end_private_2(struct folio *folio)
> {
> - struct folio *folio = page_folio(page);
> -
> VM_BUG_ON_FOLIO(!folio_test_private_2(folio), folio);
> clear_bit_unlock(PG_private_2, folio_flags(folio, 0));
> folio_wake_bit(folio, PG_private_2);
> folio_put(folio);
> }
> -EXPORT_SYMBOL(end_page_private_2);
> +EXPORT_SYMBOL(folio_end_private_2);
>
> /**
> - * wait_on_page_private_2 - Wait for PG_private_2 to be cleared on a page
> - * @page: The page to wait on
> + * folio_wait_private_2 - Wait for PG_private_2 to be cleared on a page.
^ folio
> + * @folio: The folio to wait on.
> *
> - * Wait for PG_private_2 (aka PG_fscache) to be cleared on a page.
> + * Wait for PG_private_2 (aka PG_fscache) to be cleared on a folio.
> */
> -void wait_on_page_private_2(struct page *page)
> +void folio_wait_private_2(struct folio *folio)
> {
> - struct folio *folio = page_folio(page);
> -
> while (folio_test_private_2(folio))
> folio_wait_bit(folio, PG_private_2);
> }
> -EXPORT_SYMBOL(wait_on_page_private_2);
> +EXPORT_SYMBOL(folio_wait_private_2);
>
> /**
> - * wait_on_page_private_2_killable - Wait for PG_private_2 to be cleared on a page
> - * @page: The page to wait on
> + * folio_wait_private_2_killable - Wait for PG_private_2 to be cleared on a folio.
> + * @folio: The folio to wait on.
> *
> - * Wait for PG_private_2 (aka PG_fscache) to be cleared on a page or until a
> + * Wait for PG_private_2 (aka PG_fscache) to be cleared on a folio or until a
> * fatal signal is received by the calling task.
> *
> * Return:
> * - 0 if successful.
> * - -EINTR if a fatal signal was encountered.
> */
> -int wait_on_page_private_2_killable(struct page *page)
> +int folio_wait_private_2_killable(struct folio *folio)
> {
> - struct folio *folio = page_folio(page);
> int ret = 0;
>
> while (folio_test_private_2(folio)) {
> @@ -1511,7 +1506,7 @@ int wait_on_page_private_2_killable(struct page *page)
>
> return ret;
> }
> -EXPORT_SYMBOL(wait_on_page_private_2_killable);
> +EXPORT_SYMBOL(folio_wait_private_2_killable);
>
> /**
> * folio_end_writeback - End writeback against a folio.
> --
> 2.30.2
>
>
--
Sincerely yours,
Mike.
On Thu, Jul 15, 2021 at 04:35:28AM +0100, Matthew Wilcox (Oracle) wrote:
> Convert all the callers to call page_folio(). Most of them were already
> using a head page, but a few of them I can't prove were, so this may
> actually fix a bug.
>
> Signed-off-by: Matthew Wilcox (Oracle) <[email protected]>
> Reviewed-by: Christoph Hellwig <[email protected]>
> ---
> include/linux/memcontrol.h | 4 ++--
> mm/filemap.c | 2 +-
> mm/khugepaged.c | 4 ++--
> mm/memcontrol.c | 14 +++++++-------
> mm/memory-failure.c | 2 +-
> mm/memremap.c | 2 +-
> mm/page_alloc.c | 2 +-
> mm/swap.c | 2 +-
> 8 files changed, 16 insertions(+), 16 deletions(-)
Acked-by: Mike Rapoport <[email protected]>
On Thu, Jul 15, 2021 at 04:35:17AM +0100, Matthew Wilcox (Oracle) wrote:
> Match the page writeback functions by adding
> folio_start_fscache(), folio_end_fscache(), folio_wait_fscache() and
> folio_wait_fscache_killable(). Remove set_page_private_2(). Also rewrite
> the kernel-doc to describe when to use the function rather than what the
> function does, and include the kernel-doc in the appropriate rst file.
> Saves 31 bytes of text in netfs_rreq_unlock() due to set_page_fscache()
> calling page_folio() once instead of three times.
>
> Signed-off-by: Matthew Wilcox (Oracle) <[email protected]>
> Acked-by: Vlastimil Babka <[email protected]>
> Reviewed-by: William Kucharski <[email protected]>
> Reviewed-by: Christoph Hellwig <[email protected]>
> Acked-by: Kirill A. Shutemov <[email protected]>
> ---
> Documentation/filesystems/netfs_library.rst | 2 +
> include/linux/netfs.h | 75 +++++++++++++--------
> include/linux/pagemap.h | 16 -----
> 3 files changed, 50 insertions(+), 43 deletions(-)
Acked-by: Mike Rapoport <[email protected]>
On Thu, Jul 15, 2021 at 04:35:31AM +0100, Matthew Wilcox (Oracle) wrote:
> These are the folio equivalents of lock_page_memcg() and
> unlock_page_memcg().
>
> lock_page_memcg() and unlock_page_memcg() have too many callers to be
> easily replaced in a single patch, so reimplement them as wrappers for
> now to be cleaned up later when enough callers have been converted to
> use folios.
>
> Signed-off-by: Matthew Wilcox (Oracle) <[email protected]>
> Reviewed-by: Christoph Hellwig <[email protected]>
> ---
> include/linux/memcontrol.h | 10 +++++++++
> mm/memcontrol.c | 45 ++++++++++++++++++++++++--------------
> 2 files changed, 39 insertions(+), 16 deletions(-)
Acked-by: Mike Rapoport <[email protected]>
On Thu, Jul 15, 2021 at 04:35:29AM +0100, Matthew Wilcox (Oracle) wrote:
> Convert all callers of mem_cgroup_migrate() to call page_folio() first.
> They all look like they're using head pages already, but this proves it.
>
> Signed-off-by: Matthew Wilcox (Oracle) <[email protected]>
> Reviewed-by: Christoph Hellwig <[email protected]>
> ---
> include/linux/memcontrol.h | 4 ++--
> mm/filemap.c | 4 +++-
> mm/memcontrol.c | 35 +++++++++++++++++------------------
> mm/migrate.c | 4 +++-
> mm/shmem.c | 5 ++++-
> 5 files changed, 29 insertions(+), 23 deletions(-)
Acked-by: Mike Rapoport <[email protected]>
On Thu, Jul 15, 2021 at 04:35:36AM +0100, Matthew Wilcox (Oracle) wrote:
> This function already assumed it was being passed a head page. No real
> change here, except that thp_nr_pages() compiles away on kernels with
> THP compiled out while folio_nr_pages() is always present. Also convert
> page_memcg_rcu() to folio_memcg_rcu().
>
> Signed-off-by: Matthew Wilcox (Oracle) <[email protected]>
> Reviewed-by: Christoph Hellwig <[email protected]>
> ---
> include/linux/memcontrol.h | 18 +++++++++---------
> include/linux/swap.h | 2 +-
> mm/swap.c | 2 +-
> mm/workingset.c | 11 ++++-------
> 4 files changed, 15 insertions(+), 18 deletions(-)
>
> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> index 6511f89ad454..2dd660185bb3 100644
> --- a/include/linux/memcontrol.h
> +++ b/include/linux/memcontrol.h
> @@ -461,19 +461,19 @@ static inline struct mem_cgroup *page_memcg(struct page *page)
> }
>
> /*
> - * page_memcg_rcu - locklessly get the memory cgroup associated with a page
> - * @page: a pointer to the page struct
> + * folio_memcg_rcu - Locklessly get the memory cgroup associated with a folio.
> + * @folio: Pointer to the folio.
> *
> - * Returns a pointer to the memory cgroup associated with the page,
> - * or NULL. This function assumes that the page is known to have a
> + * Returns a pointer to the memory cgroup associated with the folio,
> + * or NULL. This function assumes that the folio is known to have a
> * proper memory cgroup pointer. It's not safe to call this function
> - * against some type of pages, e.g. slab pages or ex-slab pages.
> + * against some type of folios, e.g. slab folios or ex-slab folios.
Maybe
- * Returns a pointer to the memory cgroup associated with the page,
- * or NULL. This function assumes that the page is known to have a
+ * This function assumes that the folio is known to have a
* proper memory cgroup pointer. It's not safe to call this function
- * against some type of pages, e.g. slab pages or ex-slab pages.
+ * against some type of folios, e.g. slab folios or ex-slab folios.
+ *
+ * Return: a pointer to the memory cgroup associated with the folio,
+ * or NULL.
> */
> -static inline struct mem_cgroup *page_memcg_rcu(struct page *page)
> +static inline struct mem_cgroup *folio_memcg_rcu(struct folio *folio)
> {
> - unsigned long memcg_data = READ_ONCE(page->memcg_data);
> + unsigned long memcg_data = READ_ONCE(folio->memcg_data);
>
> - VM_BUG_ON_PAGE(PageSlab(page), page);
> + VM_BUG_ON_FOLIO(folio_test_slab(folio), folio);
> WARN_ON_ONCE(!rcu_read_lock_held());
>
> if (memcg_data & MEMCG_DATA_KMEM) {
> @@ -1129,7 +1129,7 @@ static inline struct mem_cgroup *page_memcg(struct page *page)
> return NULL;
> }
>
> -static inline struct mem_cgroup *page_memcg_rcu(struct page *page)
> +static inline struct mem_cgroup *folio_memcg_rcu(struct folio *folio)
> {
> WARN_ON_ONCE(!rcu_read_lock_held());
> return NULL;
> diff --git a/include/linux/swap.h b/include/linux/swap.h
> index 8394716a002b..989d8f78c256 100644
> --- a/include/linux/swap.h
> +++ b/include/linux/swap.h
> @@ -330,7 +330,7 @@ static inline swp_entry_t folio_swap_entry(struct folio *folio)
> void workingset_age_nonresident(struct lruvec *lruvec, unsigned long nr_pages);
> void *workingset_eviction(struct page *page, struct mem_cgroup *target_memcg);
> void workingset_refault(struct page *page, void *shadow);
> -void workingset_activation(struct page *page);
> +void workingset_activation(struct folio *folio);
>
> /* Only track the nodes of mappings with shadow entries */
> void workingset_update_node(struct xa_node *node);
> diff --git a/mm/swap.c b/mm/swap.c
> index aa9c32b714c5..85969b36b636 100644
> --- a/mm/swap.c
> +++ b/mm/swap.c
> @@ -451,7 +451,7 @@ void mark_page_accessed(struct page *page)
> else
> __lru_cache_activate_page(page);
> ClearPageReferenced(page);
> - workingset_activation(page);
> + workingset_activation(page_folio(page));
> }
> if (page_is_idle(page))
> clear_page_idle(page);
> diff --git a/mm/workingset.c b/mm/workingset.c
> index e62c0f2084a2..39bb60d50217 100644
> --- a/mm/workingset.c
> +++ b/mm/workingset.c
> @@ -392,13 +392,11 @@ void workingset_refault(struct page *page, void *shadow)
>
> /**
> * workingset_activation - note a page activation
> - * @page: page that is being activated
> + * @folio: Folio that is being activated.
> */
> -void workingset_activation(struct page *page)
> +void workingset_activation(struct folio *folio)
> {
> - struct folio *folio = page_folio(page);
> struct mem_cgroup *memcg;
> - struct lruvec *lruvec;
>
> rcu_read_lock();
> /*
> @@ -408,11 +406,10 @@ void workingset_activation(struct page *page)
> * XXX: See workingset_refault() - this should return
> * root_mem_cgroup even for !CONFIG_MEMCG.
> */
> - memcg = page_memcg_rcu(page);
> + memcg = folio_memcg_rcu(folio);
> if (!mem_cgroup_disabled() && !memcg)
> goto out;
> - lruvec = folio_lruvec(folio);
> - workingset_age_nonresident(lruvec, thp_nr_pages(page));
> + workingset_age_nonresident(folio_lruvec(folio), folio_nr_pages(folio));
> out:
> rcu_read_unlock();
> }
> --
> 2.30.2
>
>
--
Sincerely yours,
Mike.
On Thu, Jul 15, 2021 at 04:35:33AM +0100, Matthew Wilcox (Oracle) wrote:
> This replaces mem_cgroup_page_lruvec(). All callers converted.
>
> Signed-off-by: Matthew Wilcox (Oracle) <[email protected]>
> Reviewed-by: Christoph Hellwig <[email protected]>
> ---
> include/linux/memcontrol.h | 20 +++++++++-----------
> mm/compaction.c | 2 +-
> mm/memcontrol.c | 9 ++++++---
> mm/swap.c | 3 ++-
> mm/workingset.c | 3 ++-
> 5 files changed, 20 insertions(+), 17 deletions(-)
Acked-by: Mike Rapoport <[email protected]>
On Thu, Jul 15, 2021 at 04:35:37AM +0100, Matthew Wilcox (Oracle) wrote:
> This is the folio equivalent of page_to_pfn().
>
> Signed-off-by: Matthew Wilcox (Oracle) <[email protected]>
> ---
> include/linux/mm.h | 14 ++++++++++++++
> 1 file changed, 14 insertions(+)
Acked-by: Mike Rapoport <[email protected]>
On Thu, Jul 15, 2021 at 04:35:40AM +0100, Matthew Wilcox (Oracle) wrote:
> This allows us to map a portion of a folio. Callers can only expect
> to access up to the next page boundary.
>
> Signed-off-by: Matthew Wilcox (Oracle) <[email protected]>
> Reviewed-by: Christoph Hellwig <[email protected]>
> ---
> include/linux/highmem-internal.h | 11 +++++++++
> include/linux/highmem.h | 38 ++++++++++++++++++++++++++++++++
> 2 files changed, 49 insertions(+)
>
> diff --git a/include/linux/highmem-internal.h b/include/linux/highmem-internal.h
> index 7902c7d8b55f..d5d6f930ae1d 100644
> --- a/include/linux/highmem-internal.h
> +++ b/include/linux/highmem-internal.h
> @@ -73,6 +73,12 @@ static inline void *kmap_local_page(struct page *page)
> return __kmap_local_page_prot(page, kmap_prot);
> }
>
> +static inline void *kmap_local_folio(struct folio *folio, size_t offset)
> +{
> + struct page *page = folio_page(folio, offset / PAGE_SIZE);
> + return __kmap_local_page_prot(page, kmap_prot) + offset % PAGE_SIZE;
> +}
> +
> static inline void *kmap_local_page_prot(struct page *page, pgprot_t prot)
> {
> return __kmap_local_page_prot(page, prot);
> @@ -160,6 +166,11 @@ static inline void *kmap_local_page(struct page *page)
> return page_address(page);
> }
>
> +static inline void *kmap_local_folio(struct folio *folio, size_t offset)
> +{
> + return page_address(&folio->page) + offset;
> +}
> +
> static inline void *kmap_local_page_prot(struct page *page, pgprot_t prot)
> {
> return kmap_local_page(page);
> diff --git a/include/linux/highmem.h b/include/linux/highmem.h
> index 8c6e8e996c87..85de3bd0b47d 100644
> --- a/include/linux/highmem.h
> +++ b/include/linux/highmem.h
> @@ -96,6 +96,44 @@ static inline void kmap_flush_unused(void);
> */
> static inline void *kmap_local_page(struct page *page);
>
> +/**
> + * kmap_local_folio - Map a page in this folio for temporary usage
> + * @folio: The folio to be mapped.
> + * @offset: The byte offset within the folio.
> + *
> + * Returns: The virtual address of the mapping
> + *
> + * Can be invoked from any context.
Context: Can be invoked from any context.
> + *
> + * Requires careful handling when nesting multiple mappings because the map
> + * management is stack based. The unmap has to be in the reverse order of
> + * the map operation:
> + *
> + * addr1 = kmap_local_folio(page1, offset1);
> + * addr2 = kmap_local_folio(page2, offset2);
Please s/page/folio/g here and in the description below
> + * ...
> + * kunmap_local(addr2);
> + * kunmap_local(addr1);
> + *
> + * Unmapping addr1 before addr2 is invalid and causes malfunction.
> + *
> + * Contrary to kmap() mappings the mapping is only valid in the context of
> + * the caller and cannot be handed to other contexts.
> + *
> + * On CONFIG_HIGHMEM=n kernels and for low memory pages this returns the
> + * virtual address of the direct mapping. Only real highmem pages are
> + * temporarily mapped.
> + *
> + * While it is significantly faster than kmap() for the higmem case it
> + * comes with restrictions about the pointer validity. Only use when really
> + * necessary.
> + *
> + * On HIGHMEM enabled systems mapping a highmem page has the side effect of
> + * disabling migration in order to keep the virtual address stable across
> + * preemption. No caller of kmap_local_folio() can rely on this side effect.
> + */
> +static inline void *kmap_local_folio(struct folio *folio, size_t offset);
> +
> /**
> * kmap_atomic - Atomically map a page for temporary usage - Deprecated!
> * @page: Pointer to the page to be mapped
> --
> 2.30.2
>
>
--
Sincerely yours,
Mike.
On Wed, Jul 21, 2021 at 11:39:15AM +0300, Mike Rapoport wrote:
> On Wed, Jul 21, 2021 at 05:08:44AM +0100, Matthew Wilcox wrote:
> > I wanted to turn those last two sentences into a list, but my
> > kernel-doc-fu abandoned me. Feel free to submit a follow-on patch to
> > fix that ;-)
>
> Here it is ;-)
Did you try it? Here's what that turns into with htmldoc:
Description
We would like to get this info without a page flag, but the state needs
to survive until the folio is last deleted from the LRU, which could be
as far down as __page_cache_release.
* 1 if folio is a regular filesystem backed page cache folio or a
lazily freed anonymous folio (e.g. via MADV_FREE).
* 0 if folio is a normal anonymous folio, a tmpfs folio or otherwise
ram or swap backed folio.
Return
An integer (not a boolean!) used to sort a folio onto the right LRU list
and to account folios correctly.
Yes, we get a bulleted list, but it's placed in the wrong section!
Adding linux-doc for additional insight into this problem.
For their reference, here's the input:
/**
* folio_is_file_lru - Should the folio be on a file LRU or anon LRU?
* @folio: The folio to test.
*
* We would like to get this info without a page flag, but the state
* needs to survive until the folio is last deleted from the LRU, which
* could be as far down as __page_cache_release.
*
* Return: An integer (not a boolean!) used to sort a folio onto the
* right LRU list and to account folios correctly.
*
* - 1 if @folio is a regular filesystem backed page cache folio
* or a lazily freed anonymous folio (e.g. via MADV_FREE).
* - 0 if @folio is a normal anonymous folio, a tmpfs folio or otherwise
* ram or swap backed folio.
*/
static inline int folio_is_file_lru(struct folio *folio)
On Wed, Jul 21, 2021 at 12:58:24PM +0300, Mike Rapoport wrote:
> > +/**
> > + * kmap_local_folio - Map a page in this folio for temporary usage
> > + * @folio: The folio to be mapped.
> > + * @offset: The byte offset within the folio.
> > + *
> > + * Returns: The virtual address of the mapping
> > + *
> > + * Can be invoked from any context.
>
> Context: Can be invoked from any context.
>
> > + *
> > + * Requires careful handling when nesting multiple mappings because the map
> > + * management is stack based. The unmap has to be in the reverse order of
> > + * the map operation:
> > + *
> > + * addr1 = kmap_local_folio(page1, offset1);
> > + * addr2 = kmap_local_folio(page2, offset2);
>
> Please s/page/folio/g here and in the description below
>
> > + * ...
> > + * kunmap_local(addr2);
> > + * kunmap_local(addr1);
> > + *
> > + * Unmapping addr1 before addr2 is invalid and causes malfunction.
> > + *
> > + * Contrary to kmap() mappings the mapping is only valid in the context of
> > + * the caller and cannot be handed to other contexts.
> > + *
> > + * On CONFIG_HIGHMEM=n kernels and for low memory pages this returns the
> > + * virtual address of the direct mapping. Only real highmem pages are
> > + * temporarily mapped.
> > + *
> > + * While it is significantly faster than kmap() for the higmem case it
> > + * comes with restrictions about the pointer validity. Only use when really
> > + * necessary.
> > + *
> > + * On HIGHMEM enabled systems mapping a highmem page has the side effect of
> > + * disabling migration in order to keep the virtual address stable across
> > + * preemption. No caller of kmap_local_folio() can rely on this side effect.
> > + */
kmap_local_folio() only maps one page from the folio. So it's not
appropriate to s/page/folio/g. I fiddled with the description a bit to
make this clearer:
/**
* kmap_local_folio - Map a page in this folio for temporary usage
- * @folio: The folio to be mapped.
- * @offset: The byte offset within the folio.
- *
- * Returns: The virtual address of the mapping
- *
- * Can be invoked from any context.
+ * @folio: The folio containing the page.
+ * @offset: The byte offset within the folio which identifies the page.
*
* Requires careful handling when nesting multiple mappings because the map
* management is stack based. The unmap has to be in the reverse order of
* the map operation:
*
- * addr1 = kmap_local_folio(page1, offset1);
- * addr2 = kmap_local_folio(page2, offset2);
+ * addr1 = kmap_local_folio(folio1, offset1);
+ * addr2 = kmap_local_folio(folio2, offset2);
* ...
* kunmap_local(addr2);
* kunmap_local(addr1);
@@ -131,6 +127,9 @@ static inline void *kmap_local_page(struct page *page);
* On HIGHMEM enabled systems mapping a highmem page has the side effect of
* disabling migration in order to keep the virtual address stable across
* preemption. No caller of kmap_local_folio() can rely on this side effect.
+ *
+ * Context: Can be invoked from any context.
+ * Return: The virtual address of @offset.
*/
static inline void *kmap_local_folio(struct folio *folio, size_t offset);
On Wed, Jul 21, 2021 at 03:12:03PM +0100, Matthew Wilcox wrote:
> On Wed, Jul 21, 2021 at 12:58:24PM +0300, Mike Rapoport wrote:
> > > +/**
> > > + * kmap_local_folio - Map a page in this folio for temporary usage
> > > + * @folio: The folio to be mapped.
> > > + * @offset: The byte offset within the folio.
> > > + *
> > > + * Returns: The virtual address of the mapping
> > > + *
> > > + * Can be invoked from any context.
> >
> > Context: Can be invoked from any context.
> >
> > > + *
> > > + * Requires careful handling when nesting multiple mappings because the map
> > > + * management is stack based. The unmap has to be in the reverse order of
> > > + * the map operation:
> > > + *
> > > + * addr1 = kmap_local_folio(page1, offset1);
> > > + * addr2 = kmap_local_folio(page2, offset2);
> >
> > Please s/page/folio/g here and in the description below
> >
> > > + * ...
> > > + * kunmap_local(addr2);
> > > + * kunmap_local(addr1);
> > > + *
> > > + * Unmapping addr1 before addr2 is invalid and causes malfunction.
> > > + *
> > > + * Contrary to kmap() mappings the mapping is only valid in the context of
> > > + * the caller and cannot be handed to other contexts.
> > > + *
> > > + * On CONFIG_HIGHMEM=n kernels and for low memory pages this returns the
> > > + * virtual address of the direct mapping. Only real highmem pages are
> > > + * temporarily mapped.
> > > + *
> > > + * While it is significantly faster than kmap() for the higmem case it
> > > + * comes with restrictions about the pointer validity. Only use when really
> > > + * necessary.
> > > + *
> > > + * On HIGHMEM enabled systems mapping a highmem page has the side effect of
> > > + * disabling migration in order to keep the virtual address stable across
> > > + * preemption. No caller of kmap_local_folio() can rely on this side effect.
> > > + */
>
> kmap_local_folio() only maps one page from the folio. So it's not
> appropriate to s/page/folio/g. I fiddled with the description a bit to
> make this clearer:
>
> /**
> * kmap_local_folio - Map a page in this folio for temporary usage
> - * @folio: The folio to be mapped.
> - * @offset: The byte offset within the folio.
> - *
> - * Returns: The virtual address of the mapping
> - *
> - * Can be invoked from any context.
> + * @folio: The folio containing the page.
> + * @offset: The byte offset within the folio which identifies the page.
> *
> * Requires careful handling when nesting multiple mappings because the map
> * management is stack based. The unmap has to be in the reverse order of
> * the map operation:
> *
> - * addr1 = kmap_local_folio(page1, offset1);
> - * addr2 = kmap_local_folio(page2, offset2);
> + * addr1 = kmap_local_folio(folio1, offset1);
> + * addr2 = kmap_local_folio(folio2, offset2);
> * ...
> * kunmap_local(addr2);
> * kunmap_local(addr1);
> @@ -131,6 +127,9 @@ static inline void *kmap_local_page(struct page *page);
> * On HIGHMEM enabled systems mapping a highmem page has the side effect of
> * disabling migration in order to keep the virtual address stable across
> * preemption. No caller of kmap_local_folio() can rely on this side effect.
> + *
> + * Context: Can be invoked from any context.
> + * Return: The virtual address of @offset.
> */
> static inline void *kmap_local_folio(struct folio *folio, size_t offset)
This is clearer, thanks!
Maybe just add page to Return: description:
* Return: The virtual address of page @offset.
--
Sincerely yours,
Mike.
On Wed, Jul 21, 2021 at 12:23:09PM +0100, Matthew Wilcox wrote:
> On Wed, Jul 21, 2021 at 11:39:15AM +0300, Mike Rapoport wrote:
> > On Wed, Jul 21, 2021 at 05:08:44AM +0100, Matthew Wilcox wrote:
> > > I wanted to turn those last two sentences into a list, but my
> > > kernel-doc-fu abandoned me. Feel free to submit a follow-on patch to
> > > fix that ;-)
> >
> > Here it is ;-)
>
> Did you try it? Here's what that turns into with htmldoc:
Yes, but I was so happy to see bullets that I missed the fact they are in
the wrong section :(
> Description
>
> We would like to get this info without a page flag, but the state needs
> to survive until the folio is last deleted from the LRU, which could be
> as far down as __page_cache_release.
>
> * 1 if folio is a regular filesystem backed page cache folio or a
> lazily freed anonymous folio (e.g. via MADV_FREE).
> * 0 if folio is a normal anonymous folio, a tmpfs folio or otherwise
> ram or swap backed folio.
>
> Return
>
> An integer (not a boolean!) used to sort a folio onto the right LRU list
> and to account folios correctly.
>
> Yes, we get a bulleted list, but it's placed in the wrong section!
>
> Adding linux-doc for additional insight into this problem.
> For their reference, here's the input:
>
> /**
> * folio_is_file_lru - Should the folio be on a file LRU or anon LRU?
> * @folio: The folio to test.
> *
> * We would like to get this info without a page flag, but the state
> * needs to survive until the folio is last deleted from the LRU, which
> * could be as far down as __page_cache_release.
> *
> * Return: An integer (not a boolean!) used to sort a folio onto the
> * right LRU list and to account folios correctly.
> *
> * - 1 if @folio is a regular filesystem backed page cache folio
> * or a lazily freed anonymous folio (e.g. via MADV_FREE).
> * - 0 if @folio is a normal anonymous folio, a tmpfs folio or otherwise
> * ram or swap backed folio.
> */
> static inline int folio_is_file_lru(struct folio *folio)
Hmm, there is some contradiction between kernel-doc assumption that
anything after a blank line is the default (i.e. Description) section and
the sphynx ideas where empty blank lines should be:
if ($state == STATE_BODY_WITH_BLANK_LINE && /^\s*\*\s?\S/) {
dump_section($file, $section, $contents);
$section = $section_default;
$new_start_line = $.;
$contents = "";
}
(from scripts/kernel-doc::process_body())
--
Sincerely yours,
Mike.
On Wed, Jul 21, 2021 at 04:42:36AM +0100, Matthew Wilcox wrote:
> On Tue, Jul 20, 2021 at 01:40:05PM +0300, Mike Rapoport wrote:
> > > +/**
> > > + * folio_shift - The number of bits covered by this folio.
> >
> > For me this sounds like the size of the folio in bits.
> > Maybe just repeat "The base-2 logarithm of the size of this folio" here and
> > in return description?
> >
> > > + * @folio: The folio.
> > > + *
> > > + * A folio contains a number of bytes which is a power-of-two in size.
> > > + * This function tells you which power-of-two the folio is.
> > > + *
> > > + * Context: The caller should have a reference on the folio to prevent
> > > + * it from being split. It is not necessary for the folio to be locked.
> > > + * Return: The base-2 logarithm of the size of this folio.
> > > + */
>
> I've gone with:
>
> /**
> - * folio_shift - The number of bits covered by this folio.
> + * folio_shift - The size of the memory described by this folio.
> * @folio: The folio.
> *
> - * A folio contains a number of bytes which is a power-of-two in size.
> - * This function tells you which power-of-two the folio is.
> + * A folio represents a number of bytes which is a power-of-two in size.
> + * This function tells you which power-of-two the folio is. See also
> + * folio_size() and folio_order().
> *
> * Context: The caller should have a reference on the folio to prevent
> * it from being split. It is not necessary for the folio to be locked.
>
I like it. :)
--D
On Wed, Jul 21, 2021 at 05:22:16PM +0300, Mike Rapoport wrote:
> On Wed, Jul 21, 2021 at 03:12:03PM +0100, Matthew Wilcox wrote:
> > On Wed, Jul 21, 2021 at 12:58:24PM +0300, Mike Rapoport wrote:
> > > > +/**
> > > > + * kmap_local_folio - Map a page in this folio for temporary usage
> > > > + * @folio: The folio to be mapped.
> > > > + * @offset: The byte offset within the folio.
> > > > + *
> > > > + * Returns: The virtual address of the mapping
> > > > + *
> > > > + * Can be invoked from any context.
> > >
> > > Context: Can be invoked from any context.
> > >
> > > > + *
> > > > + * Requires careful handling when nesting multiple mappings because the map
> > > > + * management is stack based. The unmap has to be in the reverse order of
> > > > + * the map operation:
> > > > + *
> > > > + * addr1 = kmap_local_folio(page1, offset1);
> > > > + * addr2 = kmap_local_folio(page2, offset2);
> > >
> > > Please s/page/folio/g here and in the description below
> > >
> > > > + * ...
> > > > + * kunmap_local(addr2);
> > > > + * kunmap_local(addr1);
> > > > + *
> > > > + * Unmapping addr1 before addr2 is invalid and causes malfunction.
> > > > + *
> > > > + * Contrary to kmap() mappings the mapping is only valid in the context of
> > > > + * the caller and cannot be handed to other contexts.
> > > > + *
> > > > + * On CONFIG_HIGHMEM=n kernels and for low memory pages this returns the
> > > > + * virtual address of the direct mapping. Only real highmem pages are
> > > > + * temporarily mapped.
> > > > + *
> > > > + * While it is significantly faster than kmap() for the higmem case it
> > > > + * comes with restrictions about the pointer validity. Only use when really
> > > > + * necessary.
> > > > + *
> > > > + * On HIGHMEM enabled systems mapping a highmem page has the side effect of
> > > > + * disabling migration in order to keep the virtual address stable across
> > > > + * preemption. No caller of kmap_local_folio() can rely on this side effect.
> > > > + */
> >
> > kmap_local_folio() only maps one page from the folio. So it's not
> > appropriate to s/page/folio/g. I fiddled with the description a bit to
> > make this clearer:
> >
> > /**
> > * kmap_local_folio - Map a page in this folio for temporary usage
> > - * @folio: The folio to be mapped.
> > - * @offset: The byte offset within the folio.
> > - *
> > - * Returns: The virtual address of the mapping
> > - *
> > - * Can be invoked from any context.
> > + * @folio: The folio containing the page.
> > + * @offset: The byte offset within the folio which identifies the page.
> > *
> > * Requires careful handling when nesting multiple mappings because the map
> > * management is stack based. The unmap has to be in the reverse order of
> > * the map operation:
> > *
> > - * addr1 = kmap_local_folio(page1, offset1);
> > - * addr2 = kmap_local_folio(page2, offset2);
> > + * addr1 = kmap_local_folio(folio1, offset1);
> > + * addr2 = kmap_local_folio(folio2, offset2);
> > * ...
> > * kunmap_local(addr2);
> > * kunmap_local(addr1);
> > @@ -131,6 +127,9 @@ static inline void *kmap_local_page(struct page *page);
> > * On HIGHMEM enabled systems mapping a highmem page has the side effect of
> > * disabling migration in order to keep the virtual address stable across
> > * preemption. No caller of kmap_local_folio() can rely on this side effect.
> > + *
> > + * Context: Can be invoked from any context.
> > + * Return: The virtual address of @offset.
> > */
> > static inline void *kmap_local_folio(struct folio *folio, size_t offset)
>
> This is clearer, thanks!
>
> Maybe just add page to Return: description:
>
> * Return: The virtual address of page @offset.
No, it really does return the virtual address of @offset. If you ask
for offset 0x1234 within a (sufficiently large) folio, it will map the
second page of that folio and return the address of the 0x234'th byte
within it.
15.07.2021 06:35, Matthew Wilcox (Oracle) пишет:
> This is the folio equivalent of migrate_page_copy(), which is retained
> as a wrapper for filesystems which are not yet converted to folios.
> Also convert copy_huge_page() to folio_copy().
>
> Signed-off-by: Matthew Wilcox (Oracle) <[email protected]>
> ---
> include/linux/migrate.h | 1 +
> include/linux/mm.h | 2 +-
> mm/folio-compat.c | 6 ++++++
> mm/hugetlb.c | 2 +-
> mm/migrate.c | 14 +++++---------
> mm/util.c | 6 +++---
> 6 files changed, 17 insertions(+), 14 deletions(-)
Hi,
I'm getting warnings that might be related to this patch.
[37020.191023] BUG: sleeping function called from invalid context at mm/util.c:761
[37020.191383] in_atomic(): 1, irqs_disabled(): 0, non_block: 0, pid: 29, name: kcompactd0
[37020.191550] CPU: 1 PID: 29 Comm: kcompactd0 Tainted: G W 5.14.0-rc2-next-20210721-00201-g393e9d2093a1 #8880
[37020.191576] Hardware name: NVIDIA Tegra SoC (Flattened Device Tree)
[37020.191599] [<c010ce15>] (unwind_backtrace) from [<c0108fd5>] (show_stack+0x11/0x14)
[37020.191667] [<c0108fd5>] (show_stack) from [<c0a74b1f>] (dump_stack_lvl+0x2b/0x34)
[37020.191724] [<c0a74b1f>] (dump_stack_lvl) from [<c0141a41>] (___might_sleep+0xed/0x11c)
[37020.191779] [<c0141a41>] (___might_sleep) from [<c0241e07>] (folio_copy+0x3f/0x84)
[37020.191817] [<c0241e07>] (folio_copy) from [<c027a7b1>] (folio_migrate_copy+0x11/0x1c)
[37020.191856] [<c027a7b1>] (folio_migrate_copy) from [<c027ab65>] (__buffer_migrate_page.part.0+0x215/0x238)
[37020.191891] [<c027ab65>] (__buffer_migrate_page.part.0) from [<c027b73d>] (buffer_migrate_page_norefs+0x19/0x28)
[37020.191927] [<c027b73d>] (buffer_migrate_page_norefs) from [<c027affd>] (move_to_new_page+0x4d/0x200)
[37020.191960] [<c027affd>] (move_to_new_page) from [<c027bc91>] (migrate_pages+0x521/0x72c)
[37020.191993] [<c027bc91>] (migrate_pages) from [<c024dbc1>] (compact_zone+0x589/0xb60)
[37020.192031] [<c024dbc1>] (compact_zone) from [<c024e1eb>] (proactive_compact_node+0x53/0x6c)
[37020.192064] [<c024e1eb>] (proactive_compact_node) from [<c024e713>] (kcompactd+0x20b/0x238)
[37020.192096] [<c024e713>] (kcompactd) from [<c013b987>] (kthread+0x123/0x140)
[37020.192134] [<c013b987>] (kthread) from [<c0100155>] (ret_from_fork+0x11/0x1c)
[37020.192164] Exception stack(0xc1751fb0 to 0xc1751ff8)
On Thu, Jul 22, 2021 at 02:52:28PM +0300, Dmitry Osipenko wrote:
> I'm getting warnings that might be related to this patch.
Thank you! This is a good report. I've trimmed away some of the
unnecessary bits from below:
> BUG: sleeping function called from invalid context at mm/util.c:761
> in_atomic(): 1, irqs_disabled(): 0, non_block: 0, pid: 29, name: kcompactd0
This is absolutely a result of this patch:
for (i = 0; i < nr; i++) {
cond_resched();
copy_highpage(folio_page(dst, i), folio_page(src, i));
}
cond_resched() can sleep, of course. This is new; previously only
copying huge pages would call cond_resched(). Now every page copy
calls cond_resched().
> (___might_sleep) from (folio_copy+0x3f/0x84)
> (folio_copy) from (folio_migrate_copy+0x11/0x1c)
> (folio_migrate_copy) from (__buffer_migrate_page.part.0+0x215/0x238)
> (__buffer_migrate_page.part.0) from (buffer_migrate_page_norefs+0x19/0x28)
__buffer_migrate_page() is where we become atomic:
if (check_refs)
spin_lock(&mapping->private_lock);
...
migrate_page_copy(newpage, page);
...
if (check_refs)
spin_unlock(&mapping->private_lock);
> (buffer_migrate_page_norefs) from (move_to_new_page+0x4d/0x200)
> (move_to_new_page) from (migrate_pages+0x521/0x72c)
> (migrate_pages) from (compact_zone+0x589/0xb60)
The obvious solution is just to change folio_copy():
{
- unsigned i, nr = folio_nr_pages(src);
+ unsigned i = 0;
+ unsigned nr = folio_nr_pages(src);
- for (i = 0; i < nr; i++) {
- cond_resched();
+ for (;;) {
copy_highpage(folio_page(dst, i), folio_page(src, i));
+ if (i++ == nr)
+ break;
+ cond_resched();
}
}
now it only calls cond_resched() for multi-page folios.
But that leaves us with a bit of an ... impediment to using multi-page
folios for buffer-head based filesystems (and block devices). I must
admit to not knowing the buffer_head locking scheme quite as well as
I would like to. Is it possible to drop this spinlock earlier?
22.07.2021 15:29, Matthew Wilcox пишет:
> On Thu, Jul 22, 2021 at 02:52:28PM +0300, Dmitry Osipenko wrote:
...
> The obvious solution is just to change folio_copy():
>
> {
> - unsigned i, nr = folio_nr_pages(src);
> + unsigned i = 0;
> + unsigned nr = folio_nr_pages(src);
>
> - for (i = 0; i < nr; i++) {
> - cond_resched();
> + for (;;) {
> copy_highpage(folio_page(dst, i), folio_page(src, i));
> + if (i++ == nr)
This works with the ++i precedence change. Thanks!
> + break;
> + cond_resched();
> }
> }
>
> now it only calls cond_resched() for multi-page folios.
...
Thank you for the explanation and for the fix!
The fs/ and mm/ are mostly outside of my scope, hope you'll figure out
the buffer-head case soon.
On Thu, Jul 22, 2021 at 04:45:59PM +0300, Dmitry Osipenko wrote:
> 22.07.2021 15:29, Matthew Wilcox пишет:
> > On Thu, Jul 22, 2021 at 02:52:28PM +0300, Dmitry Osipenko wrote:
> ...
> > The obvious solution is just to change folio_copy():
> >
> > {
> > - unsigned i, nr = folio_nr_pages(src);
> > + unsigned i = 0;
> > + unsigned nr = folio_nr_pages(src);
> >
> > - for (i = 0; i < nr; i++) {
> > - cond_resched();
> > + for (;;) {
> > copy_highpage(folio_page(dst, i), folio_page(src, i));
> > + if (i++ == nr)
>
> This works with the ++i precedence change. Thanks!
Thanks for testing! (and fixing my bug)
I just pushed out an update to for-next with this fix.
> The fs/ and mm/ are mostly outside of my scope, hope you'll figure out
> the buffer-head case soon.
Thanks. We don't need it fixed yet, but probably in the next six months.
On Thu, Jul 15, 2021 at 04:34:46AM +0100, Matthew Wilcox (Oracle) wrote:
> Managing memory in 4KiB pages is a serious overhead. Many benchmarks
> benefit from a larger "page size". As an example, an earlier iteration
> of this idea which used compound pages (and wasn't particularly tuned)
> got a 7% performance boost when compiling the kernel.
I want to thank Michael Larabel for his benchmarking effort:
https://www.phoronix.com/scan.php?page=news_item&px=Folios-v14-Testing-AMD-Linux
I'm not too surprised by the lack of performance change on the majority
of benchmarks. This patch series is only going to change things for
heavy users of the page cache (ie it'll do nothing for anon memory users),
and it's only really a benefit for programs that have good locality.
What blows me away is the 80% performance improvement for PostgreSQL.
I know they use the page cache extensively, so it's plausibly real.
I'm a bit surprised that it has such good locality, and the size of the
win far exceeds my expectations. We should probably dive into it and
figure out exactly what's going on.
Should we accelerate inclusion of this patchset? Right now, I have
89 mm patches queued up for the 5.15 merge window. My plan was to get
the 17 iomap + block patches, plus another 18 page cache patches into
5.16 and then get the 14 multi-page folio patches into 5.17. But I'm
mindful of the longterm release coming up "soon", and I'm not sure we're
best served by multiple distros trying to backport the multi-page folio
patches to either 5.15 or 5.16.
On Sat, 2021-07-24 at 18:27 +0100, Matthew Wilcox wrote:
> What blows me away is the 80% performance improvement for PostgreSQL.
> I know they use the page cache extensively, so it's plausibly real.
> I'm a bit surprised that it has such good locality, and the size of
> the win far exceeds my expectations. We should probably dive into it
> and figure out exactly what's going on.
Since none of the other tested databases showed more than a 3%
improvement, this looks like an anomalous result specific to something
in postgres ... although the next biggest db: mariadb wasn't part of
the tests so I'm not sure that's definitive. Perhaps the next step
should be to test mariadb? Since they're fairly similar in domain
(both full SQL) if mariadb shows this type of improvement, you can
safely assume it's something in the way SQL databases handle paging and
if it doesn't, it's likely fixing a postgres inefficiency.
James
On Sat, Jul 24, 2021 at 11:09:02AM -0700, James Bottomley wrote:
> On Sat, 2021-07-24 at 18:27 +0100, Matthew Wilcox wrote:
> > What blows me away is the 80% performance improvement for PostgreSQL.
> > I know they use the page cache extensively, so it's plausibly real.
> > I'm a bit surprised that it has such good locality, and the size of
> > the win far exceeds my expectations. We should probably dive into it
> > and figure out exactly what's going on.
>
> Since none of the other tested databases showed more than a 3%
> improvement, this looks like an anomalous result specific to something
> in postgres ... although the next biggest db: mariadb wasn't part of
> the tests so I'm not sure that's definitive. Perhaps the next step
> should be to test mariadb? Since they're fairly similar in domain
> (both full SQL) if mariadb shows this type of improvement, you can
> safely assume it's something in the way SQL databases handle paging and
> if it doesn't, it's likely fixing a postgres inefficiency.
I think the thing that's specific to PostgreSQL is that it's a heavy
user of the page cache. My understanding is that most databases use
direct IO and manage their own page cache, while PostgreSQL trusts
the kernel to get it right.
Regardless of whether postgres is "doing something wrong" or not,
do you not think that an 80% performance win would exert a certain
amount of pressure on distros to do the backport?
On Sat, 2021-07-24 at 19:14 +0100, Matthew Wilcox wrote:
> On Sat, Jul 24, 2021 at 11:09:02AM -0700, James Bottomley wrote:
> > On Sat, 2021-07-24 at 18:27 +0100, Matthew Wilcox wrote:
> > > What blows me away is the 80% performance improvement for
> > > PostgreSQL. I know they use the page cache extensively, so it's
> > > plausibly real. I'm a bit surprised that it has such good
> > > locality, and the size of the win far exceeds my
> > > expectations. We should probably dive into it and figure out
> > > exactly what's going on.
> >
> > Since none of the other tested databases showed more than a 3%
> > improvement, this looks like an anomalous result specific to
> > something in postgres ... although the next biggest db: mariadb
> > wasn't part of the tests so I'm not sure that's
> > definitive. Perhaps the next step should be to t
> > est mariadb? Since they're fairly similar in domain (both full
> > SQL) if mariadb shows this type of improvement, you can
> > safely assume it's something in the way SQL databases handle paging
> > and if it doesn't, it's likely fixing a postgres inefficiency.
>
> I think the thing that's specific to PostgreSQL is that it's a heavy
> user of the page cache. My understanding is that most databases use
> direct IO and manage their own page cache, while PostgreSQL trusts
> the kernel to get it right.
That's testable with mariadb, at least for the innodb engine since the
flush_method is settable.
> Regardless of whether postgres is "doing something wrong" or not,
> do you not think that an 80% performance win would exert a certain
> amount of pressure on distros to do the backport?
Well, I cut the previous question deliberately, but if you're going to
force me to answer, my experience with storage tells me that one test
being 10x different from all the others usually indicates a problem
with the benchmark test itself rather than a baseline improvement, so
I'd wait for more data.
James
Hi,
On Sat, Jul 24, 2021, at 11:23, James Bottomley wrote:
> On Sat, 2021-07-24 at 19:14 +0100, Matthew Wilcox wrote:
> > On Sat, Jul 24, 2021 at 11:09:02AM -0700, James Bottomley wrote:
> > > On Sat, 2021-07-24 at 18:27 +0100, Matthew Wilcox wrote:
> > > > What blows me away is the 80% performance improvement for
> > > > PostgreSQL. I know they use the page cache extensively, so it's
> > > > plausibly real. I'm a bit surprised that it has such good
> > > > locality, and the size of the win far exceeds my
> > > > expectations. We should probably dive into it and figure out
> > > > exactly what's going on.
> > >
> > > Since none of the other tested databases showed more than a 3%
> > > improvement, this looks like an anomalous result specific to
> > > something in postgres ... although the next biggest db: mariadb
> > > wasn't part of the tests so I'm not sure that's
> > > definitive. Perhaps the next step should be to t
> > > est mariadb? Since they're fairly similar in domain (both full
> > > SQL) if mariadb shows this type of improvement, you can
> > > safely assume it's something in the way SQL databases handle paging
> > > and if it doesn't, it's likely fixing a postgres inefficiency.
> >
> > I think the thing that's specific to PostgreSQL is that it's a heavy
> > user of the page cache. My understanding is that most databases use
> > direct IO and manage their own page cache, while PostgreSQL trusts
> > the kernel to get it right.
>
> That's testable with mariadb, at least for the innodb engine since the
> flush_method is settable.
>
> > Regardless of whether postgres is "doing something wrong" or not,
> > do you not think that an 80% performance win would exert a certain
> > amount of pressure on distros to do the backport?
>
> Well, I cut the previous question deliberately, but if you're going to
> force me to answer, my experience with storage tells me that one test
> being 10x different from all the others usually indicates a problem
> with the benchmark test itself rather than a baseline improvement, so
> I'd wait for more data.
I have a similar reaction - the large improvements are for a read/write pgbench benchmark at a scale that fits in memory. That's typically purely bound by the speed at which the WAL can be synced to disk. As far as I recall mariadb also uses buffered IO for WAL (but there was recent work in the area).
Is there a reason fdatasync() of 16MB files to have got a lot faster? Or a chance that could be broken?
Some improvement for read-only wouldn't surprise me, particularly if the os/pg weren't configured for explicit huge pages. Pgbench has a uniform distribution so its *very* tlb miss heavy with 4k pages.
Regards,
Andres
On Sat, Jul 24, 2021 at 11:23:25AM -0700, James Bottomley wrote:
> On Sat, 2021-07-24 at 19:14 +0100, Matthew Wilcox wrote:
> > On Sat, Jul 24, 2021 at 11:09:02AM -0700, James Bottomley wrote:
> > > On Sat, 2021-07-24 at 18:27 +0100, Matthew Wilcox wrote:
> > > > What blows me away is the 80% performance improvement for
> > > > PostgreSQL. I know they use the page cache extensively, so it's
> > > > plausibly real. I'm a bit surprised that it has such good
> > > > locality, and the size of the win far exceeds my
> > > > expectations. We should probably dive into it and figure out
> > > > exactly what's going on.
> > >
> > > Since none of the other tested databases showed more than a 3%
> > > improvement, this looks like an anomalous result specific to
> > > something in postgres ... although the next biggest db: mariadb
> > > wasn't part of the tests so I'm not sure that's
> > > definitive. Perhaps the next step should be to t
> > > est mariadb? Since they're fairly similar in domain (both full
> > > SQL) if mariadb shows this type of improvement, you can
> > > safely assume it's something in the way SQL databases handle paging
> > > and if it doesn't, it's likely fixing a postgres inefficiency.
> >
> > I think the thing that's specific to PostgreSQL is that it's a heavy
> > user of the page cache. My understanding is that most databases use
> > direct IO and manage their own page cache, while PostgreSQL trusts
> > the kernel to get it right.
>
> That's testable with mariadb, at least for the innodb engine since the
> flush_method is settable.
We're still not communicating well. I'm not talking about writes,
I'm talking about reads. Postgres uses the page cache for reads.
InnoDB uses O_DIRECT (afaict). See articles like this one:
https://www.percona.com/blog/2018/02/08/fsync-performance-storage-devices/
: The first and most obvious type of IO are pages reads and writes from
: the tablespaces. The pages are most often read one at a time, as 16KB
: random read operations. Writes to the tablespaces are also typically
: 16KB random operations, but they are done in batches. After every batch,
: fsync is called on the tablespace file handle.
(the current folio patch set does not create multi-page folios for
writes, only for reads)
I downloaded the mariadb source package that's in Debian, and from
what I can glean, it does indeed set O_DIRECT on data files in Linux,
through os_file_set_nocache().
> > Regardless of whether postgres is "doing something wrong" or not,
> > do you not think that an 80% performance win would exert a certain
> > amount of pressure on distros to do the backport?
>
> Well, I cut the previous question deliberately, but if you're going to
> force me to answer, my experience with storage tells me that one test
> being 10x different from all the others usually indicates a problem
> with the benchmark test itself rather than a baseline improvement, so
> I'd wait for more data.
... or the two benchmarks use Linux in completely different ways such
that one sees a huge benefit while the other sees none. Which is what
you'd expect for a patchset that improves the page cache and using a
benchmark that doesn't use the page cache.
On Sat, Jul 24, 2021 at 11:45:26AM -0700, Andres Freund wrote:
> On Sat, Jul 24, 2021, at 11:23, James Bottomley wrote:
> > Well, I cut the previous question deliberately, but if you're going to
> > force me to answer, my experience with storage tells me that one test
> > being 10x different from all the others usually indicates a problem
> > with the benchmark test itself rather than a baseline improvement, so
> > I'd wait for more data.
>
> I have a similar reaction - the large improvements are for a read/write pgbench benchmark at a scale that fits in memory. That's typically purely bound by the speed at which the WAL can be synced to disk. As far as I recall mariadb also uses buffered IO for WAL (but there was recent work in the area).
>
> Is there a reason fdatasync() of 16MB files to have got a lot faster? Or a chance that could be broken?
>
> Some improvement for read-only wouldn't surprise me, particularly if the os/pg weren't configured for explicit huge pages. Pgbench has a uniform distribution so its *very* tlb miss heavy with 4k pages.
It's going to depend substantially on the access pattern. If the 16MB
file (oof, that's tiny!) was read in in large chunks or even in small
chunks, but consecutively, the folio changes will allocate larger pages
(16k, 64k, 256k, ...). Theoretically it might get up to 2MB pages and
start using PMDs, but I've never seen that in my testing.
fdatasync() could indeed have got much faster. If we're writing back a
256kB page as a unit, we're handling 64 times less metadata than writing
back 64x4kB pages. We'll track 64x less dirty bits. We'll find only
64 dirty pages per 16MB instead of 4096 dirty pages.
It's always possible I just broke something. The xfstests aren't
exhaustive, and no regressions doesn't mean no problems.
Can you guide Michael towards parameters for pgbench that might give
an indication of performance on a more realistic workload that doesn't
entirely fit in memory?
Hi,
On Sat, Jul 24, 2021, at 12:01, Matthew Wilcox wrote:
> On Sat, Jul 24, 2021 at 11:45:26AM -0700, Andres Freund wrote:
> > On Sat, Jul 24, 2021, at 11:23, James Bottomley wrote:
> > > Well, I cut the previous question deliberately, but if you're going to
> > > force me to answer, my experience with storage tells me that one test
> > > being 10x different from all the others usually indicates a problem
> > > with the benchmark test itself rather than a baseline improvement, so
> > > I'd wait for more data.
> >
> > I have a similar reaction - the large improvements are for a read/write pgbench benchmark at a scale that fits in memory. That's typically purely bound by the speed at which the WAL can be synced to disk. As far as I recall mariadb also uses buffered IO for WAL (but there was recent work in the area).
> >
> > Is there a reason fdatasync() of 16MB files to have got a lot faster? Or a chance that could be broken?
> >
> > Some improvement for read-only wouldn't surprise me, particularly if the os/pg weren't configured for explicit huge pages. Pgbench has a uniform distribution so its *very* tlb miss heavy with 4k pages.
>
> It's going to depend substantially on the access pattern. If the 16MB
> file (oof, that's tiny!) was read in in large chunks or even in small
> chunks, but consecutively, the folio changes will allocate larger pages
> (16k, 64k, 256k, ...). Theoretically it might get up to 2MB pages and
> start using PMDs, but I've never seen that in my testing.
The 16MB files are just for the WAL/journal, and are write only in a benchmark like this. With pgbench it'll be written in small consecutive chunks (a few pages at a time, for each group commit). Each page is only written once, until after a checkpoint the entire file is "recycled" (renamed into the future of the WAL stream) and reused from start.
The data files are 1GB.
> fdatasync() could indeed have got much faster. If we're writing back a
> 256kB page as a unit, we're handling 64 times less metadata than writing
> back 64x4kB pages. We'll track 64x less dirty bits. We'll find only
> 64 dirty pages per 16MB instead of 4096 dirty pages.
The dirty writes will be 8-32k or so in this workload - the constant commits require the WAL to constantly be flushed.
> It's always possible I just broke something. The xfstests aren't
> exhaustive, and no regressions doesn't mean no problems.
>
> Can you guide Michael towards parameters for pgbench that might give
> an indication of performance on a more realistic workload that doesn't
> entirely fit in memory?
Fitting in memory isn't bad - that's a large post of real workloads. It just makes it hard to believe the performance improvement, given that we expect to be bound by disk sync speed...
Michael, where do I find more details about the codification used during the run?
Regards,
Andres
On Sat, 2021-07-24 at 19:50 +0100, Matthew Wilcox wrote:
> On Sat, Jul 24, 2021 at 11:23:25AM -0700, James Bottomley wrote:
> > On Sat, 2021-07-24 at 19:14 +0100, Matthew Wilcox wrote:
> > > On Sat, Jul 24, 2021 at 11:09:02AM -0700, James Bottomley wrote:
> > > > On Sat, 2021-07-24 at 18:27 +0100, Matthew Wilcox wrote:
> > > > > What blows me away is the 80% performance improvement for
> > > > > PostgreSQL. I know they use the page cache extensively, so
> > > > > it's
> > > > > plausibly real. I'm a bit surprised that it has such good
> > > > > locality, and the size of the win far exceeds my
> > > > > expectations. We should probably dive into it and figure out
> > > > > exactly what's going on.
> > > >
> > > > Since none of the other tested databases showed more than a 3%
> > > > improvement, this looks like an anomalous result specific to
> > > > something in postgres ... although the next biggest db: mariadb
> > > > wasn't part of the tests so I'm not sure that's
> > > > definitive. Perhaps the next step should be to t
> > > > est mariadb? Since they're fairly similar in domain (both full
> > > > SQL) if mariadb shows this type of improvement, you can
> > > > safely assume it's something in the way SQL databases handle
> > > > paging and if it doesn't, it's likely fixing a postgres
> > > > inefficiency.
> > >
> > > I think the thing that's specific to PostgreSQL is that it's a
> > > heavy user of the page cache. My understanding is that most
> > > databases use direct IO and manage their own page cache, while
> > > PostgreSQL trusts the kernel to get it right.
> >
> > That's testable with mariadb, at least for the innodb engine since
> > the flush_method is settable.
>
> We're still not communicating well. I'm not talking about writes,
> I'm talking about reads. Postgres uses the page cache for reads.
> InnoDB uses O_DIRECT (afaict). See articles like this one:
> https://www.percona.com/blog/2018/02/08/fsync-performance-storage-devices/
If it were all about reads, wouldn't the Phoronix pgbench read only
test have shown a better improvement than 7%? I think the Phoronix
data shows that whatever it is it's to do with writes ... that does
imply something in the way the log syncs data.
James
Hi,
On 2021-07-24 12:12:36 -0700, Andres Freund wrote:
> On Sat, Jul 24, 2021, at 12:01, Matthew Wilcox wrote:
> > On Sat, Jul 24, 2021 at 11:45:26AM -0700, Andres Freund wrote:
> > It's always possible I just broke something. The xfstests aren't
> > exhaustive, and no regressions doesn't mean no problems.
> >
> > Can you guide Michael towards parameters for pgbench that might give
> > an indication of performance on a more realistic workload that doesn't
> > entirely fit in memory?
>
> Fitting in memory isn't bad - that's a large post of real workloads. It just makes it hard to believe the performance improvement, given that we expect to be bound by disk sync speed...
I just tried to compare folio-14 vs its baseline, testing commit 8096acd7442e
against 480552d0322d. In a VM however (but at least with its memory being
backed by huge pages and storage being passed through). I got about 7%
improvement with just some baseline tuning of postgres applied. I think a 1-2%
of that is potentially runtime variance (I saw slightly different timings
leading around checkpointing that lead to a bit "unfair" advantage to the
folio run).
That's a *nice* win!
WRT the ~70% improvement:
> Michael, where do I find more details about the codification used during the
> run?
After some digging I found https://github.com/phoronix-test-suite/phoronix-test-suite/blob/94562dd4a808637be526b639d220c7cd937e2aa1/ob-cache/test-profiles/pts/pgbench-1.10.1/install.sh
For one the test says its done on ext4, while I used xfs. But I think the
bigger thing is the following:
The phoronix test uses postgres with only one relevant setting adjusted
(increasing the max connection count). That will end up using a buffer pool of
128MB, no huge pages, and importantly is configured to aim for not more than
1GB for postgres' journal, which will lead to constant checkpointing. The test
also only runs for 15 seconds, which likely isn't even enough to "warm up"
(the creation of the data set here will take longer than the run).
Given that the dataset phoronix is using is about ~16GB of data (excluding
WAL), and uses 256 concurrent clients running full tilt, using that limited
postgres settings doesn't end up measuring something particularly interesting
in my opinion.
Without changing the filesystem, using a configuration more similar to
phoronix', I do get a bigger win. But the run-to-run variance is so high
(largely due to the short test duration) that I don't trust those results
much.
It does look like there's a less slowdown due to checkpoints (i.e. fsyncing
all data files postgres modified since the last checkpoints) on the folio
branch, which does make some sense to me and would be a welcome improvement.
Greetings,
Andres Freund
On Sat, Jul 24, 2021 at 02:44:13PM -0700, Andres Freund wrote:
> The phoronix test uses postgres with only one relevant setting adjusted
> (increasing the max connection count). That will end up using a buffer pool of
> 128MB, no huge pages, and importantly is configured to aim for not more than
> 1GB for postgres' journal, which will lead to constant checkpointing. The test
> also only runs for 15 seconds, which likely isn't even enough to "warm up"
> (the creation of the data set here will take longer than the run).
>
> Given that the dataset phoronix is using is about ~16GB of data (excluding
> WAL), and uses 256 concurrent clients running full tilt, using that limited
> postgres settings doesn't end up measuring something particularly interesting
> in my opinion.
Hi Andreas,
I tend to use the phoronix test suite for my performance runs when
testing ext4 changes simply because it's convenient. Can you suggest
a better set configuration settings that I should perhaps use that
might give more "real world" numbers that you would find more
significant?
Thanks,
- Ted
Hi,
On 2021-07-26 10:19:11 -0400, Theodore Ts'o wrote:
> On Sat, Jul 24, 2021 at 02:44:13PM -0700, Andres Freund wrote:
> > The phoronix test uses postgres with only one relevant setting adjusted
> > (increasing the max connection count). That will end up using a buffer pool of
> > 128MB, no huge pages, and importantly is configured to aim for not more than
> > 1GB for postgres' journal, which will lead to constant checkpointing. The test
> > also only runs for 15 seconds, which likely isn't even enough to "warm up"
> > (the creation of the data set here will take longer than the run).
> >
> > Given that the dataset phoronix is using is about ~16GB of data (excluding
> > WAL), and uses 256 concurrent clients running full tilt, using that limited
> > postgres settings doesn't end up measuring something particularly interesting
> > in my opinion.
> I tend to use the phoronix test suite for my performance runs when
> testing ext4 changes simply because it's convenient. Can you suggest
> a better set configuration settings that I should perhaps use that
> might give more "real world" numbers that you would find more
> significant?
It depends a bit on what you want to test, obviously...
At the very least you should 'max_wal_size = 32GB' or such (it'll only
use that much if enough WAL is generated within checkpoint timeout,
which defaults to 5min).
And unfortunately you're not going to get meaningful performance results
for a read/write test within 10s, you need to run at least ~11min (so
two checkpoints happen).
With the default shared_buffers setting of 128MB you are going to
simulate a much-larger-than-postgres's-memory workload, albeit one where
the page cache *is* big enough on most current machines, unless you
limit the size of the page cache considerably. Doing so can be useful to
approximate a workload that would take much longer to initialize due to
the size.
I suggest *not* disabling autovacuum as currently done for performance
testing - it's not something many real-world setups can afford to do, so
benchmarking FS performance with it disabled doesn't seem like a good
idea.
FWIW, depending on what kind of thing you want to test, it'd not be hard
to come up with a test that less time to initialize. E.g. an insert-only
workload without an initial dataset or such.
As long as you *do* initialize 16GB of data, I think it'd make sense to
measure the time that takes. There's definitely been filesystem level
performance changes of that, and it's often going to be more IO intensive.
Greetings,
Andres Freund
As noted by Nick, the following pattern:
unsigned int nr = ...;
mod_zone_page_state(x, y, -nr);
is buggy and will cause the page stats to be _increased_ by about 4
billion instead of decreased by a small number. I have audited the
for-next branch and found a few bugs of this type. I have also
changed a few other places to use 'long nr' so that people who
copy-and-paste don't get it wrong. Theoretically, this means that
we now support folio sizes beyond 16TB, but that would be a ridiculous
thing to do (other limitations that exist currently are that we can
only allocate folios up to order 10, and we can only perform IO on
folios smaller than 4GB). I think using 'int nr' would be safe
everywhere, but why make the compiler narrow the return value?
In lieu of reposting the entire series, here's the diff between
the previous and the next for-next:
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 52796adf7a2f..dc80039be60e 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1667,9 +1667,9 @@ static inline void set_page_links(struct page *page, enum zone_type zone,
* folio_nr_pages - The number of pages in the folio.
* @folio: The folio.
*
- * Return: A number which is a power of two.
+ * Return: A number which is a non-negative power of two.
*/
-static inline unsigned long folio_nr_pages(struct folio *folio)
+static inline long folio_nr_pages(struct folio *folio)
{
return compound_nr(&folio->page);
}
@@ -1733,8 +1733,10 @@ static inline int arch_make_page_accessible(struct page *page)
#ifndef HAVE_ARCH_MAKE_FOLIO_ACCESSIBLE
static inline int arch_make_folio_accessible(struct folio *folio)
{
- int ret, i;
- for (i = 0; i < folio_nr_pages(folio); i++) {
+ int ret;
+ long i, nr = folio_nr_pages(folio);
+
+ for (i = 0; i < nr; i++) {
ret = arch_make_page_accessible(folio_page(folio, i));
if (ret)
break;
diff --git a/include/linux/mm_inline.h b/include/linux/mm_inline.h
index d39537c5471b..e2ec68b0515c 100644
--- a/include/linux/mm_inline.h
+++ b/include/linux/mm_inline.h
@@ -32,7 +32,7 @@ static inline int page_is_file_lru(struct page *page)
static __always_inline void update_lru_size(struct lruvec *lruvec,
enum lru_list lru, enum zone_type zid,
- int nr_pages)
+ long nr_pages)
{
struct pglist_data *pgdat = lruvec_pgdat(lruvec);
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 1aeeb4437ffd..9a018ac7defc 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -6693,7 +6693,7 @@ void mem_cgroup_calculate_protection(struct mem_cgroup *root,
static int charge_memcg(struct folio *folio, struct mem_cgroup *memcg,
gfp_t gfp)
{
- unsigned int nr_pages = folio_nr_pages(folio);
+ long nr_pages = folio_nr_pages(folio);
int ret;
ret = try_charge(memcg, gfp, nr_pages);
@@ -6847,7 +6847,7 @@ static void uncharge_batch(const struct uncharge_gather *ug)
static void uncharge_folio(struct folio *folio, struct uncharge_gather *ug)
{
- unsigned long nr_pages;
+ long nr_pages;
struct mem_cgroup *memcg;
struct obj_cgroup *objcg;
bool use_objcg = folio_memcg_kmem(folio);
@@ -6962,7 +6962,7 @@ void mem_cgroup_uncharge_list(struct list_head *page_list)
void mem_cgroup_migrate(struct folio *old, struct folio *new)
{
struct mem_cgroup *memcg;
- unsigned int nr_pages = folio_nr_pages(new);
+ long nr_pages = folio_nr_pages(new);
unsigned long flags;
VM_BUG_ON_FOLIO(!folio_test_locked(old), old);
diff --git a/mm/migrate.c b/mm/migrate.c
index 36cdae0a1235..c96d7a78a2f4 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -383,7 +383,7 @@ int folio_migrate_mapping(struct address_space *mapping,
struct zone *oldzone, *newzone;
int dirty;
int expected_count = expected_page_refs(mapping, &folio->page) + extra_count;
- int nr = folio_nr_pages(folio);
+ long nr = folio_nr_pages(folio);
if (!mapping) {
/* Anonymous page without mapping */
diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index c2987f05c944..987a2f2efe81 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -2547,7 +2547,7 @@ void folio_account_redirty(struct folio *folio)
struct inode *inode = mapping->host;
struct bdi_writeback *wb;
struct wb_lock_cookie cookie = {};
- unsigned nr = folio_nr_pages(folio);
+ long nr = folio_nr_pages(folio);
wb = unlocked_inode_to_wb_begin(inode, &cookie);
current->nr_dirtied -= nr;
@@ -2574,7 +2574,7 @@ bool folio_redirty_for_writepage(struct writeback_control *wbc,
struct folio *folio)
{
bool ret;
- unsigned nr = folio_nr_pages(folio);
+ long nr = folio_nr_pages(folio);
wbc->pages_skipped += nr;
ret = filemap_dirty_folio(folio->mapping, folio);
diff --git a/mm/swap.c b/mm/swap.c
index 6f382abeccf9..ebd1bdd80b45 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -324,7 +324,7 @@ void lru_note_cost_folio(struct folio *folio)
static void __folio_activate(struct folio *folio, struct lruvec *lruvec)
{
if (!folio_test_active(folio) && !folio_test_unevictable(folio)) {
- int nr_pages = folio_nr_pages(folio);
+ long nr_pages = folio_nr_pages(folio);
lruvec_del_folio(lruvec, folio);
folio_set_active(folio);
@@ -1004,7 +1004,7 @@ EXPORT_SYMBOL(__pagevec_release);
static void __pagevec_lru_add_fn(struct folio *folio, struct lruvec *lruvec)
{
int was_unevictable = folio_test_clear_unevictable(folio);
- int nr_pages = folio_nr_pages(folio);
+ long nr_pages = folio_nr_pages(folio);
VM_BUG_ON_FOLIO(folio_test_lru(folio), folio);
diff --git a/mm/util.c b/mm/util.c
index a0e859def6a8..b57fb165f761 100644
--- a/mm/util.c
+++ b/mm/util.c
@@ -649,7 +649,7 @@ void *page_rmapping(struct page *page)
*/
bool folio_mapped(struct folio *folio)
{
- int i, nr;
+ long i, nr;
if (folio_single(folio))
return atomic_read(&folio->_mapcount) >= 0;
@@ -730,8 +730,8 @@ EXPORT_SYMBOL_GPL(__page_mapcount);
void folio_copy(struct folio *dst, struct folio *src)
{
- unsigned i = 0;
- unsigned nr = folio_nr_pages(src);
+ long i = 0;
+ long nr = folio_nr_pages(src);
for (;;) {
copy_highpage(folio_page(dst, i), folio_page(src, i));
@@ -1064,12 +1064,10 @@ EXPORT_SYMBOL(page_offline_end);
#ifndef ARCH_IMPLEMENTS_FLUSH_DCACHE_FOLIO
void flush_dcache_folio(struct folio *folio)
{
- unsigned int n = folio_nr_pages(folio);
+ long i, nr = folio_nr_pages(folio);
- do {
- n--;
- flush_dcache_page(folio_page(folio, n));
- } while (n);
+ for (i = 0; i < nr; i++)
+ flush_dcache_page(folio_page(folio, i));
}
EXPORT_SYMBOL(flush_dcache_folio);
#endif
On Thu, Jul 15, 2021 at 04:35:35AM +0100, Matthew Wilcox (Oracle) wrote:
> These are the folio equivalents of relock_page_lruvec_irq() and
> folio_lruvec_relock_irqsave(). Also convert page_matches_lruvec()
> to folio_matches_lruvec().
>
> Signed-off-by: Matthew Wilcox (Oracle) <[email protected]>
> Reviewed-by: Christoph Hellwig <[email protected]>
When build testing what you had in your for-next branch, I got a new
warning for powerpc defconfig
In file included from ./include/linux/mmzone.h:8,
from ./include/linux/gfp.h:6,
from ./include/linux/mm.h:10,
from mm/swap.c:17:
mm/swap.c: In function 'release_pages':
./include/linux/spinlock.h:290:3: warning: 'flags' may be used uninitialized in this function [-Wmaybe-uninitialized]
290 | _raw_spin_unlock_irqrestore(lock, flags); \
| ^~~~~~~~~~~~~~~~~~~~~~~~~~~
mm/swap.c:906:16: note: 'flags' was declared here
906 | unsigned long flags;
| ^~~~~
I'm fairly sure it's a false positive and the compiler just cannot figure
out that flags are only accessed when lruvec is !NULL and once lruvec is
!NULL, flags are valid
diff --git a/mm/swap.c b/mm/swap.c
index 6f382abeccf9..96a23af8d1c7 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -903,7 +903,7 @@ void release_pages(struct page **pages, int nr)
int i;
LIST_HEAD(pages_to_free);
struct lruvec *lruvec = NULL;
- unsigned long flags;
+ unsigned long flags = 0;
unsigned int lock_batch;
for (i = 0; i < nr; i++) {
On Thu, Jul 29, 2021 at 09:36:44AM +0100, Mel Gorman wrote:
> On Thu, Jul 15, 2021 at 04:35:35AM +0100, Matthew Wilcox (Oracle) wrote:
> > These are the folio equivalents of relock_page_lruvec_irq() and
> > folio_lruvec_relock_irqsave(). Also convert page_matches_lruvec()
> > to folio_matches_lruvec().
> >
> > Signed-off-by: Matthew Wilcox (Oracle) <[email protected]>
> > Reviewed-by: Christoph Hellwig <[email protected]>
>
> When build testing what you had in your for-next branch, I got a new
> warning for powerpc defconfig
>
> In file included from ./include/linux/mmzone.h:8,
> from ./include/linux/gfp.h:6,
> from ./include/linux/mm.h:10,
> from mm/swap.c:17:
> mm/swap.c: In function 'release_pages':
> ./include/linux/spinlock.h:290:3: warning: 'flags' may be used uninitialized in this function [-Wmaybe-uninitialized]
> 290 | _raw_spin_unlock_irqrestore(lock, flags); \
> | ^~~~~~~~~~~~~~~~~~~~~~~~~~~
> mm/swap.c:906:16: note: 'flags' was declared here
> 906 | unsigned long flags;
> | ^~~~~
>
> I'm fairly sure it's a false positive and the compiler just cannot figure
> out that flags are only accessed when lruvec is !NULL and once lruvec is
> !NULL, flags are valid
Yes, I read it over carefully and I can't see a way in which this
can happen. Weird that this change made the compiler unable to figure
that out. Pushed out a new for-next with your patch included. Thanks!
Matthew Wilcox (Oracle) <[email protected]> wrote:
> atomic_add_unless() returns bool, so remove the widening casts to int
> in page_ref_add_unless() and get_page_unless_zero(). This causes gcc
> to produce slightly larger code in isolate_migratepages_block(), but
> it's not clear that it's worse code. Net +19 bytes of text.
>
> Signed-off-by: Matthew Wilcox (Oracle) <[email protected]>
> Reviewed-by: Christoph Hellwig <[email protected]>
> Acked-by: Kirill A. Shutemov <[email protected]>
Reviewed-by: David Howells <[email protected]>
Matthew Wilcox (Oracle) <[email protected]> wrote:
> This is like lock_page_killable() but for use by callers who
> know they have a folio. Convert __lock_page_killable() to be
> __folio_lock_killable(). This saves one call to compound_head() per
> contended call to lock_page_killable().
>
> __folio_lock_killable() is 19 bytes smaller than __lock_page_killable()
> was. filemap_fault() shrinks by 74 bytes and __lock_page_or_retry()
> shrinks by 71 bytes. That's a total of 164 bytes of text saved.
>
> Signed-off-by: Matthew Wilcox (Oracle) <[email protected]>
> Reviewed-by: Christoph Hellwig <[email protected]>
> Acked-by: Jeff Layton <[email protected]>
> Acked-by: Kirill A. Shutemov <[email protected]>
> Acked-by: Vlastimil Babka <[email protected]>
> Reviewed-by: William Kucharski <[email protected]>
Reviewed-by: David Howells <[email protected]>
Matthew Wilcox (Oracle) <[email protected]> wrote:
> wait_on_page_writeback_killable() only has one caller, so convert it to
> call folio_wait_writeback_killable(). For the wait_on_page_writeback()
> callers, add a compatibility wrapper around folio_wait_writeback().
>
> Turning PageWriteback() into folio_test_writeback() eliminates a call
> to compound_head() which saves 8 bytes and 15 bytes in the two
> functions. Unfortunately, that is more than offset by adding the
> wait_on_page_writeback compatibility wrapper for a net increase in text
> of 7 bytes.
>
> Signed-off-by: Matthew Wilcox (Oracle) <[email protected]>
> Reviewed-by: Christoph Hellwig <[email protected]>
> Acked-by: Jeff Layton <[email protected]>
> Acked-by: Kirill A. Shutemov <[email protected]>
> Acked-by: Vlastimil Babka <[email protected]>
> Reviewed-by: William Kucharski <[email protected]>
Reviewed-by: David Howells <[email protected]>
Matthew Wilcox (Oracle) <[email protected]> wrote:
> Match the page writeback functions by adding
> folio_start_fscache(), folio_end_fscache(), folio_wait_fscache() and
> folio_wait_fscache_killable(). Remove set_page_private_2(). Also rewrite
> the kernel-doc to describe when to use the function rather than what the
> function does, and include the kernel-doc in the appropriate rst file.
> Saves 31 bytes of text in netfs_rreq_unlock() due to set_page_fscache()
> calling page_folio() once instead of three times.
>
> Signed-off-by: Matthew Wilcox (Oracle) <[email protected]>
> Acked-by: Vlastimil Babka <[email protected]>
> Reviewed-by: William Kucharski <[email protected]>
> Reviewed-by: Christoph Hellwig <[email protected]>
> Acked-by: Kirill A. Shutemov <[email protected]>
Assuming you fixed the kernel test robot report:
Reviewed-by: David Howells <[email protected]>
Matthew Wilcox (Oracle) <[email protected]> wrote:
> Convert rotate_reclaimable_page() to folio_rotate_reclaimable(). This
> eliminates all five of the calls to compound_head() in this function,
> saving 75 bytes at the cost of adding 15 bytes to its one caller,
> end_page_writeback(). We also save 36 bytes from pagevec_move_tail_fn()
> due to using folios there. Net 96 bytes savings.
>
> Also move its declaration to mm/internal.h as it's only used by filemap.c.
>
> Signed-off-by: Matthew Wilcox (Oracle) <[email protected]>
> Acked-by: Vlastimil Babka <[email protected]>
> Reviewed-by: William Kucharski <[email protected]>
> Reviewed-by: Christoph Hellwig <[email protected]>
> Acked-by: Kirill A. Shutemov <[email protected]>
Reviewed-by: David Howells <[email protected]>
Matthew Wilcox (Oracle) <[email protected]> wrote:
> Convert __lock_page_or_retry() to __folio_lock_or_retry(). This actually
> saves 4 bytes in the only caller of lock_page_or_retry() (due to better
> register allocation) and saves the 14 byte cost of calling page_folio()
> in __folio_lock_or_retry() for a total saving of 18 bytes. Also use
> a bool for the return type.
>
> Signed-off-by: Matthew Wilcox (Oracle) <[email protected]>
> Reviewed-by: Christoph Hellwig <[email protected]>
> Acked-by: Jeff Layton <[email protected]>
> Acked-by: Kirill A. Shutemov <[email protected]>
> Reviewed-by: William Kucharski <[email protected]>
Reviewed-by: David Howells <[email protected]>
On 7/15/21 5:34 AM, Matthew Wilcox (Oracle) wrote:
> atomic_add_unless() returns bool, so remove the widening casts to int
> in page_ref_add_unless() and get_page_unless_zero(). This causes gcc
> to produce slightly larger code in isolate_migratepages_block(), but
> it's not clear that it's worse code. Net +19 bytes of text.
>
> Signed-off-by: Matthew Wilcox (Oracle) <[email protected]>
> Reviewed-by: Christoph Hellwig <[email protected]>
> Acked-by: Kirill A. Shutemov <[email protected]>
Acked-by: Vlastimil Babka <[email protected]>
> ---
> include/linux/mm.h | 2 +-
> include/linux/page_ref.h | 4 ++--
> 2 files changed, 3 insertions(+), 3 deletions(-)
>
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index 7ca22e6e694a..8dd65290bac0 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -755,7 +755,7 @@ static inline int put_page_testzero(struct page *page)
> * This can be called when MMU is off so it must not access
> * any of the virtual mappings.
> */
> -static inline int get_page_unless_zero(struct page *page)
> +static inline bool get_page_unless_zero(struct page *page)
> {
> return page_ref_add_unless(page, 1, 0);
> }
> diff --git a/include/linux/page_ref.h b/include/linux/page_ref.h
> index 7ad46f45df39..3a799de8ad52 100644
> --- a/include/linux/page_ref.h
> +++ b/include/linux/page_ref.h
> @@ -161,9 +161,9 @@ static inline int page_ref_dec_return(struct page *page)
> return ret;
> }
>
> -static inline int page_ref_add_unless(struct page *page, int nr, int u)
> +static inline bool page_ref_add_unless(struct page *page, int nr, int u)
> {
> - int ret = atomic_add_unless(&page->_refcount, nr, u);
> + bool ret = atomic_add_unless(&page->_refcount, nr, u);
>
> if (page_ref_tracepoint_active(page_ref_mod_unless))
> __page_ref_mod_unless(page, nr, ret);
>
On 7/15/21 5:34 AM, Matthew Wilcox (Oracle) wrote:
> Handle arbitrary-order folios being added to the LRU. By definition,
> all pages being added to the LRU were already head or base pages, but
> call page_folio() on them anyway to get the type right and avoid the
> buried calls to compound_head().
>
> Saves 783 bytes of kernel text; no functions grow.
>
> Signed-off-by: Matthew Wilcox (Oracle) <[email protected]>
> Reviewed-by: Yu Zhao <[email protected]>
> Reviewed-by: Christoph Hellwig <[email protected]>
> Reviewed-by: David Howells <[email protected]>
> Acked-by: Kirill A. Shutemov <[email protected]>
Acked-by: Vlastimil Babka <[email protected]>
Actually looking at the git version, which has also this:
static __always_inline void update_lru_size(struct lruvec *lruvec,
enum lru_list lru, enum zone_type zid,
- int nr_pages)
+ long nr_pages)
{
Why now and here? Some of the functions called from update_lru_size()
still take int so this looks arbitrary?
On 7/15/21 5:35 AM, Matthew Wilcox (Oracle) wrote:
> Convert unlock_page() to call folio_unlock(). By using a folio we
> avoid a call to compound_head(). This shortens the function from 39
> bytes to 25 and removes 4 instructions on x86-64. Because we still
> have unlock_page(), it's a net increase of 16 bytes of text for the
> kernel as a whole, but any path that uses folio_unlock() will execute
> 4 fewer instructions.
>
> Signed-off-by: Matthew Wilcox (Oracle) <[email protected]>
> Reviewed-by: Christoph Hellwig <[email protected]>
> Acked-by: Jeff Layton <[email protected]>
> Acked-by: Kirill A. Shutemov <[email protected]>
> Reviewed-by: William Kucharski <[email protected]>
> Reviewed-by: David Howells <[email protected]>
Acked-by: Vlastimil Babka <[email protected]>
On 7/15/21 5:35 AM, Matthew Wilcox (Oracle) wrote:
> Convert __lock_page_or_retry() to __folio_lock_or_retry(). This actually
> saves 4 bytes in the only caller of lock_page_or_retry() (due to better
> register allocation) and saves the 14 byte cost of calling page_folio()
> in __folio_lock_or_retry() for a total saving of 18 bytes. Also use
> a bool for the return type.
>
> Signed-off-by: Matthew Wilcox (Oracle) <[email protected]>
> Reviewed-by: Christoph Hellwig <[email protected]>
> Acked-by: Jeff Layton <[email protected]>
> Acked-by: Kirill A. Shutemov <[email protected]>
> Reviewed-by: William Kucharski <[email protected]>
Acked-by: Vlastimil Babka <[email protected]>
Nit:
> --- a/mm/filemap.c
> +++ b/mm/filemap.c
> @@ -1625,48 +1625,46 @@ static int __folio_lock_async(struct folio *folio, struct wait_page_queue *wait)
>
> /*
> * Return values:
> - * 1 - page is locked; mmap_lock is still held.
> - * 0 - page is not locked.
> + * true - folio is locked; mmap_lock is still held.
> + * false - folio is not locked.
> * mmap_lock has been released (mmap_read_unlock(), unless flags had both
> * FAULT_FLAG_ALLOW_RETRY and FAULT_FLAG_RETRY_NOWAIT set, in
> * which case mmap_lock is still held.
> *
> * If neither ALLOW_RETRY nor KILLABLE are set, will always return 1
s/1/true/ ? :)
> - * with the page locked and the mmap_lock unperturbed.
> + * with the folio locked and the mmap_lock unperturbed.
> */
> -int __lock_page_or_retry(struct page *page, struct mm_struct *mm,
> +bool __folio_lock_or_retry(struct folio *folio, struct mm_struct *mm,
> unsigned int flags)
> {
> - struct folio *folio = page_folio(page);
> -
> if (fault_flag_allow_retry_first(flags)) {
> /*
> * CAUTION! In this case, mmap_lock is not released
> * even though return 0.
> */
> if (flags & FAULT_FLAG_RETRY_NOWAIT)
> - return 0;
> + return false;
>
> mmap_read_unlock(mm);
> if (flags & FAULT_FLAG_KILLABLE)
> folio_wait_locked_killable(folio);
> else
> folio_wait_locked(folio);
> - return 0;
> + return false;
> }
> if (flags & FAULT_FLAG_KILLABLE) {
> - int ret;
> + bool ret;
>
> ret = __folio_lock_killable(folio);
> if (ret) {
> mmap_read_unlock(mm);
> - return 0;
> + return false;
> }
> } else {
> __folio_lock(folio);
> }
>
> - return 1;
> + return true;
> }
>
> /**
On Tue, Aug 10, 2021 at 06:01:16PM +0200, Vlastimil Babka wrote:
> Actually looking at the git version, which has also this:
>
> static __always_inline void update_lru_size(struct lruvec *lruvec,
> enum lru_list lru, enum zone_type zid,
> - int nr_pages)
> + long nr_pages)
> {
>
> Why now and here? Some of the functions called from update_lru_size()
> still take int so this looks arbitrary?
I'm still a little freaked out about the lack of warning for:
void f(long n);
void g(unsigned int n) { f(-n); }
so I've decided that the count of pages in a folio is always of type
long. The actual number is positive, and currently it's between 1 and
1024 (inclusive on both bounds), so it's always going to be
representable in an int. Narrowing it doesn't cause a bug, so we don't
need to change nr_pages anywhere, but it does no harm to make functions
take a long instead of an int (it may even cause slightly better code
generation, based on the sample of functions I've looked at).
Maybe changing update_lru_size() in this patch is wrong. I can drop it
if you like.
Matthew Wilcox (Oracle) <[email protected]> wrote:
> By using the node id in mem_cgroup_update_tree(), we can delete
> soft_limit_tree_from_page() and mem_cgroup_page_nodeinfo(). Saves 42
> bytes of kernel text on my config.
>
> Signed-off-by: Matthew Wilcox (Oracle) <[email protected]>
> Acked-by: Michal Hocko <[email protected]>
> Acked-by: Johannes Weiner <[email protected]>
> Reviewed-by: Christoph Hellwig <[email protected]>
Reviewed-by: David Howells <[email protected]>
Though I wonder if:
> - mz = mem_cgroup_page_nodeinfo(memcg, page);
> + mz = memcg->nodeinfo[nid];
should still have some sort of wrapper function.
Matthew Wilcox (Oracle) <[email protected]> wrote:
> Opencode this one-line function in its three callers.
>
> Signed-off-by: Matthew Wilcox (Oracle) <[email protected]>
> Acked-by: Michal Hocko <[email protected]>
> Acked-by: Johannes Weiner <[email protected]>
> Reviewed-by: Christoph Hellwig <[email protected]>
Reviewed-by: David Howells <[email protected]>
Matthew Wilcox (Oracle) <[email protected]> wrote:
> memcg_check_events only uses the page's nid, so call page_to_nid in the
> callers to make the interface easier to understand.
>
> Signed-off-by: Matthew Wilcox (Oracle) <[email protected]>
> Acked-by: Michal Hocko <[email protected]>
> Reviewed-by: Christoph Hellwig <[email protected]>
Reviewed-by: David Howells <[email protected]>
Matthew Wilcox (Oracle) <[email protected]> wrote:
> memcg information is only stored in the head page, so the memcg
> subsystem needs to assure that all accesses are to the head page.
> The first step is converting page_memcg() to folio_memcg().
>
> The callers of page_memcg() and PageMemcgKmem() are not yet ready to be
> converted to use folios, so retain them as wrappers around folio_memcg()
> and folio_memcg_kmem(). They will be converted in a later patch set.
>
> Signed-off-by: Matthew Wilcox (Oracle) <[email protected]>
> Reviewed-by: Christoph Hellwig <[email protected]>
Reviewed-by: David Howells <[email protected]>
Matthew Wilcox (Oracle) <[email protected]> wrote:
> The memcg_data is only set on the head page, so enforce that by
> typing it as a folio.
>
> Signed-off-by: Matthew Wilcox (Oracle) <[email protected]>
> Reviewed-by: Christoph Hellwig <[email protected]>
> Acked-by: Michal Hocko <[email protected]>
Reviewed-by: David Howells <[email protected]>
Matthew Wilcox (Oracle) <[email protected]> wrote:
> Convert all callers of mem_cgroup_charge() to call page_folio() on the
> page they're currently passing in. Many of them will be converted to
> use folios themselves soon.
>
> Signed-off-by: Matthew Wilcox (Oracle) <[email protected]>
> Reviewed-by: Christoph Hellwig <[email protected]>
Reviewed-by: David Howells <[email protected]>
Matthew Wilcox (Oracle) <[email protected]> wrote:
> Convert all the callers to call page_folio(). Most of them were already
> using a head page, but a few of them I can't prove were, so this may
> actually fix a bug.
>
> Signed-off-by: Matthew Wilcox (Oracle) <[email protected]>
> Reviewed-by: Christoph Hellwig <[email protected]>
Reviewed-by: David Howells <[email protected]>
Matthew Wilcox (Oracle) <[email protected]> wrote:
> Use a folio rather than a page to ensure that we're only operating on
> base or head pages, and not tail pages.
>
> Signed-off-by: Matthew Wilcox (Oracle) <[email protected]>
> Reviewed-by: Christoph Hellwig <[email protected]>
Reviewed-by: David Howells <[email protected]>
Matthew Wilcox (Oracle) <[email protected]> wrote:
> These are the folio equivalents of lock_page_memcg() and
> unlock_page_memcg().
>
> lock_page_memcg() and unlock_page_memcg() have too many callers to be
> easily replaced in a single patch, so reimplement them as wrappers for
> now to be cleaned up later when enough callers have been converted to
> use folios.
>
> Signed-off-by: Matthew Wilcox (Oracle) <[email protected]>
> Reviewed-by: Christoph Hellwig <[email protected]>
Reviewed-by: David Howells <[email protected]>
Matthew Wilcox (Oracle) <[email protected]> wrote:
> This replaces mem_cgroup_page_lruvec(). All callers converted.
>
> Signed-off-by: Matthew Wilcox (Oracle) <[email protected]>
> Reviewed-by: Christoph Hellwig <[email protected]>
Reviewed-by: David Howells <[email protected]>
Matthew Wilcox (Oracle) <[email protected]> wrote:
> This is the folio equivalent of page_to_pfn().
>
> Signed-off-by: Matthew Wilcox (Oracle) <[email protected]>
Reviewed-by: David Howells <[email protected]>
Matthew Wilcox (Oracle) <[email protected]> wrote:
> These are the folio equivalents of relock_page_lruvec_irq() and
> folio_lruvec_relock_irqsave(). Also convert page_matches_lruvec()
> to folio_matches_lruvec().
>
> Signed-off-by: Matthew Wilcox (Oracle) <[email protected]>
> Reviewed-by: Christoph Hellwig <[email protected]>
Reviewed-by: David Howells <[email protected]>
Matthew Wilcox (Oracle) <[email protected]> wrote:
> This function already assumed it was being passed a head page. No real
> change here, except that thp_nr_pages() compiles away on kernels with
> THP compiled out while folio_nr_pages() is always present. Also convert
> page_memcg_rcu() to folio_memcg_rcu().
>
> Signed-off-by: Matthew Wilcox (Oracle) <[email protected]>
> Reviewed-by: Christoph Hellwig <[email protected]>
Reviewed-by: David Howells <[email protected]>
Matthew Wilcox (Oracle) <[email protected]> wrote:
> Convert __page_rmapping to folio_raw_mapping and move it to mm/internal.h.
> It's only a couple of instructions (load and mask), so it's definitely
> going to be cheaper to inline it than call it. Leave page_rmapping
> out of line.
>
> Signed-off-by: Matthew Wilcox (Oracle) <[email protected]>
I assume you're going to call it from another source file at some point,
otherwise this is unnecessary.
Apart from that,
Reviewed-by: David Howells <[email protected]>
Matthew Wilcox (Oracle) <[email protected]> wrote:
> Idle page tracking is handled through page_ext on 32-bit architectures.
> Add folio equivalents for 32-bit and move all the page compatibility
> parts to common code.
>
> Signed-off-by: Matthew Wilcox (Oracle) <[email protected]>
> Acked-by: Vlastimil Babka <[email protected]>
> Reviewed-by: William Kucharski <[email protected]>
> Reviewed-by: Christoph Hellwig <[email protected]>
Reviewed-by: David Howells <[email protected]>
Matthew Wilcox (Oracle) <[email protected]> wrote:
> Transform page_mkclean() into folio_mkclean() and add a page_mkclean()
> wrapper around folio_mkclean().
>
> folio_mkclean is 15 bytes smaller than page_mkclean, but the kernel
> is enlarged by 33 bytes due to inlining page_folio() into each caller.
> This will go away once the callers are converted to use folio_mkclean().
>
> Signed-off-by: Matthew Wilcox (Oracle) <[email protected]>
> Reviewed-by: Christoph Hellwig <[email protected]>
Reviewed-by: David Howells <[email protected]>
Matthew Wilcox (Oracle) <[email protected]> wrote:
> + if (folio_test_error(folio))
> + folio_set_error(newfolio);
> + if (folio_test_referenced(folio))
> + folio_set_referenced(newfolio);
> + if (folio_test_uptodate(folio))
> + folio_mark_uptodate(newfolio);
> + if (folio_test_clear_active(folio)) {
> + VM_BUG_ON_FOLIO(folio_test_unevictable(folio), folio);
> + folio_set_active(newfolio);
> + } else if (folio_test_clear_unevictable(folio))
> + folio_set_unevictable(newfolio);
> + if (folio_test_workingset(folio))
> + folio_set_workingset(newfolio);
> + if (folio_test_checked(folio))
> + folio_set_checked(newfolio);
> + if (folio_test_mappedtodisk(folio))
> + folio_set_mappedtodisk(newfolio);
Since a bunch of these are bits in folio->flags and newfolio->flags, I wonder
if it's better to do use a cmpxchg() loop or LL/SC construct to transfer all
the relevant flags in one go.
Apart from that:
Reviewed-by: David Howells <[email protected]>
Matthew Wilcox (Oracle) <[email protected]> wrote:
> Make this look like the newly renamed vmstat functions.
>
> Signed-off-by: Matthew Wilcox (Oracle) <[email protected]>
> Reviewed-by: Christoph Hellwig <[email protected]>
Reviewed-by: David Howells <[email protected]>
Matthew Wilcox (Oracle) <[email protected]> wrote:
> Allow for accounting N pages at once instead of one page at a time.
>
> Signed-off-by: Matthew Wilcox (Oracle) <[email protected]>
> Reviewed-by: Christoph Hellwig <[email protected]>
> Reviewed-by: Jan Kara <[email protected]>
Reviewed-by: David Howells <[email protected]>
Matthew Wilcox (Oracle) <[email protected]> wrote:
> test_clear_page_writeback() is actually an mm-internal function, although
> it's named as if it's a pagecache function. Move it to mm/internal.h,
> rename it to __folio_end_writeback() and change the return type to bool.
>
> The conversion from page to folio is mostly about accounting the number
> of pages being written back, although it does eliminate a couple of
> calls to compound_head().
>
> Signed-off-by: Matthew Wilcox (Oracle) <[email protected]>
> Reviewed-by: Christoph Hellwig <[email protected]>
Reviewed-by: David Howells <[email protected]>
Matthew Wilcox (Oracle) <[email protected]> wrote:
> Rename set_page_writeback() to folio_start_writeback() to match
> folio_end_writeback(). Do not bother with wrappers that return void;
> callers are perfectly capable of ignoring return values.
>
> Add wrappers for set_page_writeback(), set_page_writeback_keepwrite() and
> test_set_page_writeback() for compatibililty with existing filesystems.
> The main advantage of this patch is getting the statistics right,
> although it does eliminate a couple of calls to compound_head().
>
> Signed-off-by: Matthew Wilcox (Oracle) <[email protected]>
> Reviewed-by: Christoph Hellwig <[email protected]>
Reviewed-by: David Howells <[email protected]>
Matthew Wilcox (Oracle) <[email protected]> wrote:
> Reimplement set_page_dirty() as a wrapper around folio_mark_dirty().
> There is no change to filesystems as they were already being called
> with the compound_head of the page being marked dirty. We avoid
> several calls to compound_head(), both statically (through
> using folio_test_dirty() instead of PageDirty() and dynamically by
> calling folio_mapping() instead of page_mapping().
>
> Also return bool instead of int to show the range of values actually
> returned, and add kernel-doc.
>
> Signed-off-by: Matthew Wilcox (Oracle) <[email protected]>
> Reviewed-by: Christoph Hellwig <[email protected]>
Reviewed-by: David Howells <[email protected]>
Matthew Wilcox (Oracle) <[email protected]> wrote:
> Turn __set_page_dirty() into a wrapper around __folio_mark_dirty().
> Convert account_page_dirtied() into folio_account_dirtied() and account
> the number of pages in the folio to support multi-page folios.
>
> Signed-off-by: Matthew Wilcox (Oracle) <[email protected]>
Reviewed-by: David Howells <[email protected]>
Matthew Wilcox (Oracle) <[email protected]> wrote:
> Rename writeback_dirty_page() to writeback_dirty_folio() and
> wait_on_page_writeback() to folio_wait_writeback().
>
> Signed-off-by: Matthew Wilcox (Oracle) <[email protected]>
Reviewed-by: David Howells <[email protected]>
Matthew Wilcox (Oracle) <[email protected]> wrote:
> Reimplement __set_page_dirty_nobuffers() as a wrapper around
> filemap_dirty_folio().
>
> Signed-off-by: Matthew Wilcox (Oracle) <[email protected]>
> Reviewed-by: Christoph Hellwig <[email protected]>
Reviewed-by: David Howells <[email protected]>
Matthew Wilcox (Oracle) <[email protected]> wrote:
> Get the statistics right; compound pages were being accounted as a
> single page. This didn't matter before now as no filesystem which
> supported compound pages did writeback. Also move the declaration
> to filemap.h since this is part of the page cache. Add a wrapper for
> account_page_cleaned().
>
> Signed-off-by: Matthew Wilcox (Oracle) <[email protected]>
> Reviewed-by: Christoph Hellwig <[email protected]>
Reviewed-by: David Howells <[email protected]>
Matthew Wilcox (Oracle) <[email protected]> wrote:
> Turn __cancel_dirty_page() into __folio_cancel_dirty() and add wrappers.
> Move the prototypes into pagemap.h since this is page cache functionality.
> Saves 44 bytes of kernel text in total; 33 bytes from __folio_cancel_dirty
> and 11 from two callers of cancel_dirty_page().
>
> Signed-off-by: Matthew Wilcox (Oracle) <[email protected]>
> Reviewed-by: Christoph Hellwig <[email protected]>
Reviewed-by: David Howells <[email protected]>
Matthew Wilcox (Oracle) <[email protected]> wrote:
> Account the number of pages in the folio that we're redirtying.
> Turn account_page_dirty() into a wrapper around it. Also turn
> the comment on folio_account_redirty() into kernel-doc and
> edit it slightly so it makes sense to its potential callers.
>
> Signed-off-by: Matthew Wilcox (Oracle) <[email protected]>
> Reviewed-by: Christoph Hellwig <[email protected]>
Reviewed-by: David Howells <[email protected]>
Matthew Wilcox (Oracle) <[email protected]> wrote:
> Reimplement redirty_page_for_writepage() as a wrapper around
> folio_redirty_for_writepage(). Account the number of pages in the
> folio, add kernel-doc and move the prototype to writeback.h.
>
> Signed-off-by: Matthew Wilcox (Oracle) <[email protected]>
> Reviewed-by: Christoph Hellwig <[email protected]>
Reviewed-by: David Howells <[email protected]>
Matthew Wilcox (Oracle) <[email protected]> wrote:
> Reimplement i_blocks_per_page() as a wrapper around i_blocks_per_folio().
>
> Signed-off-by: Matthew Wilcox (Oracle) <[email protected]>
> Reviewed-by: Christoph Hellwig <[email protected]>
Reviewed-by: David Howells <[email protected]>
Matthew Wilcox (Oracle) <[email protected]> wrote:
> This is the folio equivalent of page_mkwrite_check_truncate().
>
> Signed-off-by: Matthew Wilcox (Oracle) <[email protected]>
Reviewed-by: David Howells <[email protected]>
Matthew Wilcox (Oracle) <[email protected]> wrote:
> The pointers stored in the page cache are folios, by definition.
> This change comes with a behaviour change -- callers of readahead_folio()
> are no longer required to put the page reference themselves. This matches
> how readpage works, rather than matching how readpages used to work.
>
> Signed-off-by: Matthew Wilcox (Oracle) <[email protected]>
> Reviewed-by: Christoph Hellwig <[email protected]>
For some of the things I'm looking at, this is actually inconvenient, but I
guess I can take an extra ref if I need it.
Reviewed-by: David Howells <[email protected]>
Matthew Wilcox (Oracle) <[email protected]> wrote:
> This nets us 178 bytes of savings from removing calls to compound_head.
> The three callers all grow a little, but each of them will be converted
> to use folios soon, so that's fine.
>
> Signed-off-by: Matthew Wilcox (Oracle) <[email protected]>
> Reviewed-by: Christoph Hellwig <[email protected]>
Reviewed-by: David Howells <[email protected]>
Matthew Wilcox (Oracle) <[email protected]> wrote:
> This is the folio equivalent of page_evictable(). Unfortunately, it's
> different from !folio_test_unevictable(), but I think it's used in places
> where you have to be a VM expert and can reasonably be expected to know
> the difference.
It would be useful to say how it is different. I'm guessing it's because a
page is always entirely mlocked or not, but a folio might be partially
mlocked?
David
Matthew Wilcox (Oracle) <[email protected]> wrote:
> * looking at the same page) and the evictable page will be stranded
> * in an unevictable LRU.
Does that need converting to say 'folio'?
Other than that:
Reviewed-by: David Howells <[email protected]>
Matthew Wilcox (Oracle) <[email protected]> wrote:
> Reimplement lru_cache_add() as a wrapper around folio_add_lru().
> Saves 159 bytes of kernel text due to removing calls to compound_head().
>
> Signed-off-by: Matthew Wilcox (Oracle) <[email protected]>
> Reviewed-by: Christoph Hellwig <[email protected]>
Reviewed-by: David Howells <[email protected]>
Matthew Wilcox (Oracle) <[email protected]> wrote:
> +struct folio *folio_alloc(gfp_t gfp, unsigned order)
> +{
> + struct page *page = alloc_pages(gfp | __GFP_COMP, order);
> +
> + if (page && order > 1)
> + prep_transhuge_page(page);
Ummm... Shouldn't order==1 pages (two page folios) be prep'd also?
> + return (struct folio *)page;
> +}
Would it be better to just jump to alloc_pages() if order <= 1? E.g.:
struct folio *folio_alloc(gfp_t gfp, unsigned order)
{
struct page *page;
if (order <= 1)
return (struct folio *)alloc_pages(gfp | __GFP_COMP, order);
page = alloc_pages(gfp | __GFP_COMP, order);
if (page)
prep_transhuge_page(page);
return (struct folio *)page;
}
David
Matthew Wilcox (Oracle) <[email protected]> wrote:
> Reimplement __page_cache_alloc as a wrapper around filemap_alloc_folio
> to allow filesystems to be converted at our leisure. Increases
> kernel text size by 133 bytes, mostly in cachefiles_read_backing_file().
> pagecache_get_page() shrinks by 32 bytes, though.
>
> Signed-off-by: Matthew Wilcox (Oracle) <[email protected]>
> Reviewed-by: Christoph Hellwig <[email protected]>
> ...
> +static inline struct page *__page_cache_alloc(gfp_t gfp)
> +{
> + return &filemap_alloc_folio(gfp, 0)->page;
> +}
Might be worth a note that this *will* return NULL if the allocation fails,
though I guess it's deprecated?
Reviewed-by: David Howells <[email protected]>
Matthew Wilcox (Oracle) <[email protected]> wrote:
> Convert __add_to_page_cache_locked() into __filemap_add_folio().
> Add an assertion to it that (for !hugetlbfs), the folio is naturally
> aligned within the file. Move the prototype from mm.h to pagemap.h.
> Convert add_to_page_cache_lru() into filemap_add_folio(). Add a
> compatibility wrapper for unconverted callers.
>
> Signed-off-by: Matthew Wilcox (Oracle) <[email protected]>
> Reviewed-by: Christoph Hellwig <[email protected]>
Reviewed-by: David Howells <[email protected]>
Matthew Wilcox (Oracle) <[email protected]> wrote:
> The pagecache only contains folios, so indicate that this is definitely
> not a tail page. Shrinks mapping_get_entry() by 56 bytes, but grows
> pagecache_get_page() by 21 bytes as gcc makes slightly different hot/cold
> code decisions. A net reduction of 35 bytes of text.
>
> Signed-off-by: Matthew Wilcox (Oracle) <[email protected]>
> Reviewed-by: Christoph Hellwig <[email protected]>
Reviewed-by: David Howells <[email protected]>
Matthew Wilcox (Oracle) <[email protected]> wrote:
> filemap_get_folio() is a replacement for find_get_page().
> Turn pagecache_get_page() into a wrapper around __filemap_get_folio().
> Remove find_lock_head() as this use case is now covered by
> filemap_get_folio().
>
> Reduces overall kernel size by 209 bytes. __filemap_get_folio() is
> 316 bytes shorter than pagecache_get_page() was, but the new
> pagecache_get_page() is 99 bytes
longer, one presumes.
> .
>
> Signed-off-by: Matthew Wilcox (Oracle) <[email protected]>
Reviewed-by: David Howells <[email protected]>
Matthew Wilcox (Oracle) <[email protected]> wrote:
> Allow filemap_get_folio() to wait for writeback to complete (if the
> filesystem wants that behaviour). This is the folio equivalent of
> grab_cache_page_write_begin(), which is moved into the folio-compat
> file as a reminder to migrate all the code using it. This paves the
> way for getting rid of AOP_FLAG_NOFS once grab_cache_page_write_begin()
> is removed.
>
> Kernel grows by 11 bytes. filemap_get_folio() grows by 33 bytes but
> grab_cache_page_write_begin() shrinks by 22 bytes to make up for it.
>
> Signed-off-by: Matthew Wilcox (Oracle) <[email protected]>
Reviewed-by: David Howells <[email protected]>
On Tue, Aug 10, 2021 at 10:09:25PM +0100, David Howells wrote:
> Matthew Wilcox (Oracle) <[email protected]> wrote:
>
> > + if (folio_test_error(folio))
> > + folio_set_error(newfolio);
> > + if (folio_test_referenced(folio))
> > + folio_set_referenced(newfolio);
> > + if (folio_test_uptodate(folio))
> > + folio_mark_uptodate(newfolio);
> > + if (folio_test_clear_active(folio)) {
> > + VM_BUG_ON_FOLIO(folio_test_unevictable(folio), folio);
> > + folio_set_active(newfolio);
> > + } else if (folio_test_clear_unevictable(folio))
> > + folio_set_unevictable(newfolio);
> > + if (folio_test_workingset(folio))
> > + folio_set_workingset(newfolio);
> > + if (folio_test_checked(folio))
> > + folio_set_checked(newfolio);
> > + if (folio_test_mappedtodisk(folio))
> > + folio_set_mappedtodisk(newfolio);
>
> Since a bunch of these are bits in folio->flags and newfolio->flags, I wonder
> if it's better to do use a cmpxchg() loop or LL/SC construct to transfer all
> the relevant flags in one go.
I have plans for that, but they're on hold until the folio work is a bit
further progressed. It also helps code that does something like:
if (folio_test_dirty(folio) || folio_test_writeback(folio))
On 8/10/21 7:43 PM, Matthew Wilcox wrote:
> On Tue, Aug 10, 2021 at 06:01:16PM +0200, Vlastimil Babka wrote:
>> Actually looking at the git version, which has also this:
>>
>> static __always_inline void update_lru_size(struct lruvec *lruvec,
>> enum lru_list lru, enum zone_type zid,
>> - int nr_pages)
>> + long nr_pages)
>> {
>>
>> Why now and here? Some of the functions called from update_lru_size()
>> still take int so this looks arbitrary?
>
> I'm still a little freaked out about the lack of warning for:
>
> void f(long n);
> void g(unsigned int n) { f(-n); }
>
> so I've decided that the count of pages in a folio is always of type
> long. The actual number is positive, and currently it's between 1 and
> 1024 (inclusive on both bounds), so it's always going to be
> representable in an int. Narrowing it doesn't cause a bug, so we don't
> need to change nr_pages anywhere, but it does no harm to make functions
> take a long instead of an int (it may even cause slightly better code
> generation, based on the sample of functions I've looked at).
>
> Maybe changing update_lru_size() in this patch is wrong. I can drop it
> if you like.
It's fine, knowing it wasn't some rebasing error.
On 7/15/21 5:35 AM, Matthew Wilcox (Oracle) wrote:
> By using the node id in mem_cgroup_update_tree(), we can delete
> soft_limit_tree_from_page() and mem_cgroup_page_nodeinfo(). Saves 42
> bytes of kernel text on my config.
>
> Signed-off-by: Matthew Wilcox (Oracle) <[email protected]>
> Acked-by: Michal Hocko <[email protected]>
> Acked-by: Johannes Weiner <[email protected]>
> Reviewed-by: Christoph Hellwig <[email protected]>
Acked-by: Vlastimil Babka <[email protected]>
On 7/15/21 5:35 AM, Matthew Wilcox (Oracle) wrote:
> Opencode this one-line function in its three callers.
>
> Signed-off-by: Matthew Wilcox (Oracle) <[email protected]>
> Acked-by: Michal Hocko <[email protected]>
> Acked-by: Johannes Weiner <[email protected]>
> Reviewed-by: Christoph Hellwig <[email protected]>
Acked-by: Vlastimil Babka <[email protected]>
On 7/15/21 5:35 AM, Matthew Wilcox (Oracle) wrote:
> memcg_check_events only uses the page's nid, so call page_to_nid in the
> callers to make the interface easier to understand.
>
> Signed-off-by: Matthew Wilcox (Oracle) <[email protected]>
> Acked-by: Michal Hocko <[email protected]>
> Reviewed-by: Christoph Hellwig <[email protected]>
Acked-by: Vlastimil Babka <[email protected]>
On 7/15/21 5:35 AM, Matthew Wilcox (Oracle) wrote:
> memcg information is only stored in the head page, so the memcg
> subsystem needs to assure that all accesses are to the head page.
> The first step is converting page_memcg() to folio_memcg().
>
> The callers of page_memcg() and PageMemcgKmem() are not yet ready to be
> converted to use folios, so retain them as wrappers around folio_memcg()
> and folio_memcg_kmem(). They will be converted in a later patch set.
>
> Signed-off-by: Matthew Wilcox (Oracle) <[email protected]>
> Reviewed-by: Christoph Hellwig <[email protected]>
Acked-by: Vlastimil Babka <[email protected]>
Nit:
> ---
> include/linux/memcontrol.h | 109 ++++++++++++++++++++++---------------
> mm/memcontrol.c | 21 ++++---
> 2 files changed, 77 insertions(+), 53 deletions(-)
>
> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> index bfe5c486f4ad..eabae5874161 100644
> --- a/include/linux/memcontrol.h
> +++ b/include/linux/memcontrol.h
> @@ -372,6 +372,7 @@ enum page_memcg_data_flags {
> #define MEMCG_DATA_FLAGS_MASK (__NR_MEMCG_DATA_FLAGS - 1)
>
> static inline bool PageMemcgKmem(struct page *page);
I think this fwd declaration is no longer needed.
> +static inline bool folio_memcg_kmem(struct folio *folio);
>
On 7/15/21 5:35 AM, Matthew Wilcox (Oracle) wrote:
> The memcg_data is only set on the head page, so enforce that by
> typing it as a folio.
>
> Signed-off-by: Matthew Wilcox (Oracle) <[email protected]>
> Reviewed-by: Christoph Hellwig <[email protected]>
> Acked-by: Michal Hocko <[email protected]>
Acked-by: Vlastimil Babka <[email protected]>
On 7/15/21 5:35 AM, Matthew Wilcox (Oracle) wrote:
> Convert all callers of mem_cgroup_charge() to call page_folio() on the
> page they're currently passing in. Many of them will be converted to
> use folios themselves soon.
>
> Signed-off-by: Matthew Wilcox (Oracle) <[email protected]>
> Reviewed-by: Christoph Hellwig <[email protected]>
Acked-by: Vlastimil Babka <[email protected]>
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index c2ffad021e09..03283d97b62a 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -6681,10 +6681,9 @@ void mem_cgroup_calculate_protection(struct mem_cgroup *root,
> atomic_long_read(&parent->memory.children_low_usage)));
> }
>
> -static int __mem_cgroup_charge(struct page *page, struct mem_cgroup *memcg,
> +static int __mem_cgroup_charge(struct folio *folio, struct mem_cgroup *memcg,
> gfp_t gfp)
The git/next version also renames this function to charge_memcg(), why? The new
name doesn't look that internal as the old one. I don't have a strong opinion
but CCing memcg maintainers who might.
> {
> - struct folio *folio = page_folio(page);
> unsigned int nr_pages = folio_nr_pages(folio);
> int ret;
>
> @@ -6697,27 +6696,27 @@ static int __mem_cgroup_charge(struct page *page, struct mem_cgroup *memcg,
>
> local_irq_disable();
> mem_cgroup_charge_statistics(memcg, nr_pages);
> - memcg_check_events(memcg, page_to_nid(page));
> + memcg_check_events(memcg, folio_nid(folio));
> local_irq_enable();
> out:
> return ret;
> }
>
Matthew Wilcox (Oracle) <[email protected]> wrote:
> +/**
> + * readahead_folio - Get the next folio to read.
> + * @ractl: The current readahead request.
> + *
> + * Context: The folio is locked. The caller should unlock the folio once
> + * all I/O to that folio has completed.
> + * Return: A pointer to the next folio, or %NULL if we are done.
> + */
> +static inline struct folio *readahead_folio(struct readahead_control *ractl)
> +{
> + struct folio *folio = __readahead_folio(ractl);
>
> - return page;
> + folio_put(folio);
This will oops if __readahead_folio() returns NULL.
> + return folio;
> }
On 7/15/21 5:35 AM, Matthew Wilcox (Oracle) wrote:
> Convert all the callers to call page_folio(). Most of them were already
> using a head page, but a few of them I can't prove were, so this may
> actually fix a bug.
>
> Signed-off-by: Matthew Wilcox (Oracle) <[email protected]>
> Reviewed-by: Christoph Hellwig <[email protected]>
Acked-by: Vlastimil Babka <[email protected]>
On 7/15/21 5:35 AM, Matthew Wilcox (Oracle) wrote:
> Use a folio rather than a page to ensure that we're only operating on
> base or head pages, and not tail pages.
>
> Signed-off-by: Matthew Wilcox (Oracle) <[email protected]>
> Reviewed-by: Christoph Hellwig <[email protected]>
Acked-by: Vlastimil Babka <[email protected]>
On 7/15/21 5:35 AM, Matthew Wilcox (Oracle) wrote:
> Convert all callers of mem_cgroup_migrate() to call page_folio() first.
> They all look like they're using head pages already, but this proves it.
>
> Signed-off-by: Matthew Wilcox (Oracle) <[email protected]>
> Reviewed-by: Christoph Hellwig <[email protected]>
Acked-by: Vlastimil Babka <[email protected]>
On 7/15/21 5:35 AM, Matthew Wilcox (Oracle) wrote:
> These are the folio equivalents of lock_page_memcg() and
> unlock_page_memcg().
>
> lock_page_memcg() and unlock_page_memcg() have too many callers to be
> easily replaced in a single patch, so reimplement them as wrappers for
> now to be cleaned up later when enough callers have been converted to
> use folios.
>
> Signed-off-by: Matthew Wilcox (Oracle) <[email protected]>
> Reviewed-by: Christoph Hellwig <[email protected]>
Acked-by: Vlastimil Babka <[email protected]>
On 7/15/21 5:35 AM, Matthew Wilcox (Oracle) wrote:
> This saves dozens of bytes of text by eliminating a lot of calls to
> compound_head().
>
> Signed-off-by: Matthew Wilcox (Oracle) <[email protected]>
> Reviewed-by: Christoph Hellwig <[email protected]>
Acked-by: Vlastimil Babka <[email protected]>
On 7/15/21 5:35 AM, Matthew Wilcox (Oracle) wrote:
> This replaces mem_cgroup_page_lruvec(). All callers converted.
>
> Signed-off-by: Matthew Wilcox (Oracle) <[email protected]>
> Reviewed-by: Christoph Hellwig <[email protected]>
Acked-by: Vlastimil Babka <[email protected]>
On 7/15/21 5:35 AM, Matthew Wilcox (Oracle) wrote:
> These are the folio equivalents of relock_page_lruvec_irq() and
> folio_lruvec_relock_irqsave(). Also convert page_matches_lruvec()
> to folio_matches_lruvec().
>
> Signed-off-by: Matthew Wilcox (Oracle) <[email protected]>
> Reviewed-by: Christoph Hellwig <[email protected]>
Acked-by: Vlastimil Babka <[email protected]>
On 7/15/21 5:35 AM, Matthew Wilcox (Oracle) wrote:
> This function already assumed it was being passed a head page. No real
> change here, except that thp_nr_pages() compiles away on kernels with
> THP compiled out while folio_nr_pages() is always present. Also convert
> page_memcg_rcu() to folio_memcg_rcu().
>
> Signed-off-by: Matthew Wilcox (Oracle) <[email protected]>
> Reviewed-by: Christoph Hellwig <[email protected]>
Acked-by: Vlastimil Babka <[email protected]>
On 7/15/21 5:35 AM, Matthew Wilcox (Oracle) wrote:
> This is the folio equivalent of page_to_pfn().
>
> Signed-off-by: Matthew Wilcox (Oracle) <[email protected]>
Acked-by: Vlastimil Babka <[email protected]>
On 7/15/21 5:35 AM, Matthew Wilcox (Oracle) wrote:
> Convert __page_rmapping to folio_raw_mapping and move it to mm/internal.h.
> It's only a couple of instructions (load and mask), so it's definitely
> going to be cheaper to inline it than call it. Leave page_rmapping
> out of line.
Maybe mention the page_anon_vma() in changelog too?
> Signed-off-by: Matthew Wilcox (Oracle) <[email protected]>
Acked-by: Vlastimil Babka <[email protected]>
On 7/15/21 5:35 AM, Matthew Wilcox (Oracle) wrote:
> This is a default implementation which calls flush_dcache_page() on
> each page in the folio. If architectures can do better, they should
> implement their own version of it.
>
> Signed-off-by: Matthew Wilcox (Oracle) <[email protected]>
Acked-by: Vlastimil Babka <[email protected]>
On 7/15/21 5:35 AM, Matthew Wilcox (Oracle) wrote:
> This allows us to map a portion of a folio. Callers can only expect
> to access up to the next page boundary.
>
> Signed-off-by: Matthew Wilcox (Oracle) <[email protected]>
> Reviewed-by: Christoph Hellwig <[email protected]>
Acked-by: Vlastimil Babka <[email protected]>
On 7/15/21 5:35 AM, Matthew Wilcox (Oracle) wrote:
> As a default implementation, call arch_make_page_accessible n times.
> If an architecture can do better, it can override this.
>
> Also move the default implementation of arch_make_page_accessible()
> from gfp.h to mm.h.
>
> Signed-off-by: Matthew Wilcox (Oracle) <[email protected]>
Acked-by: Vlastimil Babka <[email protected]>
On 7/15/21 5:35 AM, Matthew Wilcox (Oracle) wrote:
> This replaces activate_page() and eliminates lots of calls to
> compound_head(). Saves net 118 bytes of kernel text. There are still
> some redundant calls to page_folio() here which will be removed when
> pagevec_lru_move_fn() is converted to use folios.
>
> Signed-off-by: Matthew Wilcox (Oracle) <[email protected]>
> Reviewed-by: Christoph Hellwig <[email protected]>
Acked-by: Vlastimil Babka <[email protected]>
On 7/15/21 5:35 AM, Matthew Wilcox (Oracle) wrote:
> Convert mark_page_accessed() to folio_mark_accessed(). It already
> operated on the entire compound page, but now we can avoid calling
> compound_head quite so many times. Shrinks the function from 424 bytes
> to 295 bytes (shrinking by 129 bytes). The compatibility wrapper is 30
> bytes, plus the 8 bytes for the exported symbol means the kernel shrinks
> by 91 bytes.
>
> Signed-off-by: Matthew Wilcox (Oracle) <[email protected]>
Acked-by: Vlastimil Babka <[email protected]>
Question below:
> @@ -430,36 +430,34 @@ static void __lru_cache_activate_page(struct page *page)
> * When a newly allocated page is not yet visible, so safe for non-atomic ops,
> * __SetPageReferenced(page) may be substituted for mark_page_accessed(page).
> */
The other patches converting whole functions rewrote also comments to be about
folios, but not this one?
> -void mark_page_accessed(struct page *page)
> +void folio_mark_accessed(struct folio *folio)
> {
> - page = compound_head(page);
> -
> - if (!PageReferenced(page)) {
> - SetPageReferenced(page);
> - } else if (PageUnevictable(page)) {
> + if (!folio_test_referenced(folio)) {
> + folio_set_referenced(folio);
> + } else if (folio_test_unevictable(folio)) {
> /*
> * Unevictable pages are on the "LRU_UNEVICTABLE" list. But,
> * this list is never rotated or maintained, so marking an
> * evictable page accessed has no effect.
> */
These comments too?
> - } else if (!PageActive(page)) {
> + } else if (!folio_test_active(folio)) {
> /*
> * If the page is on the LRU, queue it for activation via
> * lru_pvecs.activate_page. Otherwise, assume the page is on a
> * pagevec, mark it active and it'll be moved to the active
> * LRU on the next drain.
> */
> - if (PageLRU(page))
> - folio_activate(page_folio(page));
> + if (folio_test_lru(folio))
> + folio_activate(folio);
> else
> - __lru_cache_activate_page(page);
> - ClearPageReferenced(page);
> - workingset_activation(page_folio(page));
> + __lru_cache_activate_folio(folio);
> + folio_clear_referenced(folio);
> + workingset_activation(folio);
> }
> - if (page_is_idle(page))
> - clear_page_idle(page);
> + if (folio_test_idle(folio))
> + folio_clear_idle(folio);
> }
> -EXPORT_SYMBOL(mark_page_accessed);
> +EXPORT_SYMBOL(folio_mark_accessed);
>
> /**
> * lru_cache_add - add a page to a page list
>
On 7/15/21 5:35 AM, Matthew Wilcox (Oracle) wrote:
> Transform page_mkclean() into folio_mkclean() and add a page_mkclean()
> wrapper around folio_mkclean().
>
> folio_mkclean is 15 bytes smaller than page_mkclean, but the kernel
> is enlarged by 33 bytes due to inlining page_folio() into each caller.
> This will go away once the callers are converted to use folio_mkclean().
>
> Signed-off-by: Matthew Wilcox (Oracle) <[email protected]>
> Reviewed-by: Christoph Hellwig <[email protected]>
Acked-by: Vlastimil Babka <[email protected]>
On 7/15/21 5:35 AM, Matthew Wilcox (Oracle) wrote:
> Reimplement migrate_page_move_mapping() as a wrapper around
> folio_migrate_mapping(). Saves 193 bytes of kernel text.
>
> Signed-off-by: Matthew Wilcox (Oracle) <[email protected]>
> Reviewed-by: Christoph Hellwig <[email protected]>
Acked-by: Vlastimil Babka <[email protected]>
On 7/15/21 5:35 AM, Matthew Wilcox (Oracle) wrote:
> Turn migrate_page_states() into a wrapper around folio_migrate_flags().
> Also convert two functions only called from folio_migrate_flags() to
> be folio-based. ksm_migrate_page() becomes folio_migrate_ksm() and
> copy_page_owner() becomes folio_copy_owner(). folio_migrate_flags()
> alone shrinks by two thirds -- 1967 bytes down to 642 bytes.
>
> Signed-off-by: Matthew Wilcox (Oracle) <[email protected]>
After fixing the bug below,
Acked-by: Vlastimil Babka <[email protected]>
> --- a/include/linux/migrate.h
> +++ b/include/linux/migrate.h
...
> @@ -36,10 +36,10 @@ static inline void split_page_owner(struct page *page, unsigned int nr)
> if (static_branch_unlikely(&page_owner_inited))
> __split_page_owner(page, nr);
> }
> -static inline void copy_page_owner(struct page *oldpage, struct page *newpage)
> +static inline void folio_copy_owner(struct folio *newfolio, struct folio *old)
This changed order so that new is first.
...
> --- a/mm/migrate.c
> +++ b/mm/migrate.c
> @@ -538,82 +538,80 @@ int migrate_huge_page_move_mapping(struct address_space *mapping,
> }
>
> /*
> - * Copy the page to its new location
> + * Copy the flags and some other ancillary information
> */
> -void migrate_page_states(struct page *newpage, struct page *page)
> +void folio_migrate_flags(struct folio *newfolio, struct folio *folio)
> {
...
> - copy_page_owner(page, newpage);
> + folio_copy_owner(folio, newfolio);
This passes old first.
>
> - if (!PageHuge(page))
> + if (!folio_test_hugetlb(folio))
> mem_cgroup_migrate(folio, newfolio);
> }
> -EXPORT_SYMBOL(migrate_page_states);
> +EXPORT_SYMBOL(folio_migrate_flags);
On 7/15/21 5:35 AM, Matthew Wilcox (Oracle) wrote:
> This is the folio equivalent of migrate_page_copy(), which is retained
> as a wrapper for filesystems which are not yet converted to folios.
> Also convert copy_huge_page() to folio_copy().
>
> Signed-off-by: Matthew Wilcox (Oracle) <[email protected]>
Acked-by: Vlastimil Babka <[email protected]>
The way folio_copy() avoids cond_resched() for single page would IMHO deserve a
comment though, so it's not buried only in this thread.
On 7/15/21 5:35 AM, Matthew Wilcox (Oracle) wrote:
> Make this look like the newly renamed vmstat functions.
>
> Signed-off-by: Matthew Wilcox (Oracle) <[email protected]>
> Reviewed-by: Christoph Hellwig <[email protected]>
Acked-by: Vlastimil Babka <[email protected]>
> ---
> include/linux/backing-dev.h | 6 +++---
> 1 file changed, 3 insertions(+), 3 deletions(-)
>
> diff --git a/include/linux/backing-dev.h b/include/linux/backing-dev.h
> index 44df4fcef65c..a852876bb6e2 100644
> --- a/include/linux/backing-dev.h
> +++ b/include/linux/backing-dev.h
> @@ -64,7 +64,7 @@ static inline bool bdi_has_dirty_io(struct backing_dev_info *bdi)
> return atomic_long_read(&bdi->tot_write_bandwidth);
> }
>
> -static inline void __add_wb_stat(struct bdi_writeback *wb,
> +static inline void wb_stat_mod(struct bdi_writeback *wb,
> enum wb_stat_item item, s64 amount)
> {
> percpu_counter_add_batch(&wb->stat[item], amount, WB_STAT_BATCH);
> @@ -72,12 +72,12 @@ static inline void __add_wb_stat(struct bdi_writeback *wb,
>
> static inline void inc_wb_stat(struct bdi_writeback *wb, enum wb_stat_item item)
> {
> - __add_wb_stat(wb, item, 1);
> + wb_stat_mod(wb, item, 1);
> }
>
> static inline void dec_wb_stat(struct bdi_writeback *wb, enum wb_stat_item item)
> {
> - __add_wb_stat(wb, item, -1);
> + wb_stat_mod(wb, item, -1);
> }
>
> static inline s64 wb_stat(struct bdi_writeback *wb, enum wb_stat_item item)
>
On 7/15/21 5:35 AM, Matthew Wilcox (Oracle) wrote:
> test_clear_page_writeback() is actually an mm-internal function, although
> it's named as if it's a pagecache function. Move it to mm/internal.h,
> rename it to __folio_end_writeback() and change the return type to bool.
>
> The conversion from page to folio is mostly about accounting the number
> of pages being written back, although it does eliminate a couple of
> calls to compound_head().
>
> Signed-off-by: Matthew Wilcox (Oracle) <[email protected]>
> Reviewed-by: Christoph Hellwig <[email protected]>
Acked-by: Vlastimil Babka <[email protected]>
> ---
> include/linux/page-flags.h | 1 -
> mm/filemap.c | 2 +-
> mm/internal.h | 1 +
> mm/page-writeback.c | 29 +++++++++++++++--------------
> 4 files changed, 17 insertions(+), 16 deletions(-)
>
> diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
> index ddb660688086..6f9d1f26b1ef 100644
> --- a/include/linux/page-flags.h
> +++ b/include/linux/page-flags.h
> @@ -655,7 +655,6 @@ static __always_inline void SetPageUptodate(struct page *page)
>
> CLEARPAGEFLAG(Uptodate, uptodate, PF_NO_TAIL)
>
> -int test_clear_page_writeback(struct page *page);
> int __test_set_page_writeback(struct page *page, bool keep_write);
>
> #define test_set_page_writeback(page) \
> diff --git a/mm/filemap.c b/mm/filemap.c
> index 5c4e3185ecb3..a74c69a938ab 100644
> --- a/mm/filemap.c
> +++ b/mm/filemap.c
> @@ -1535,7 +1535,7 @@ void folio_end_writeback(struct folio *folio)
> * reused before the folio_wake().
> */
> folio_get(folio);
> - if (!test_clear_page_writeback(&folio->page))
> + if (!__folio_end_writeback(folio))
> BUG();
>
> smp_mb__after_atomic();
> diff --git a/mm/internal.h b/mm/internal.h
> index fa31a7f0ed79..08e8a28994d1 100644
> --- a/mm/internal.h
> +++ b/mm/internal.h
> @@ -43,6 +43,7 @@ static inline void *folio_raw_mapping(struct folio *folio)
>
> vm_fault_t do_swap_page(struct vm_fault *vmf);
> void folio_rotate_reclaimable(struct folio *folio);
> +bool __folio_end_writeback(struct folio *folio);
>
> void free_pgtables(struct mmu_gather *tlb, struct vm_area_struct *start_vma,
> unsigned long floor, unsigned long ceiling);
> diff --git a/mm/page-writeback.c b/mm/page-writeback.c
> index e542ea37d605..8d5d7921b157 100644
> --- a/mm/page-writeback.c
> +++ b/mm/page-writeback.c
> @@ -583,7 +583,7 @@ static void wb_domain_writeout_add(struct wb_domain *dom,
>
> /*
> * Increment @wb's writeout completion count and the global writeout
> - * completion count. Called from test_clear_page_writeback().
> + * completion count. Called from __folio_end_writeback().
> */
> static inline void __wb_writeout_add(struct bdi_writeback *wb, long nr)
> {
> @@ -2731,27 +2731,28 @@ int clear_page_dirty_for_io(struct page *page)
> }
> EXPORT_SYMBOL(clear_page_dirty_for_io);
>
> -int test_clear_page_writeback(struct page *page)
> +bool __folio_end_writeback(struct folio *folio)
> {
> - struct address_space *mapping = page_mapping(page);
> - int ret;
> + long nr = folio_nr_pages(folio);
> + struct address_space *mapping = folio_mapping(folio);
> + bool ret;
>
> - lock_page_memcg(page);
> + folio_memcg_lock(folio);
> if (mapping && mapping_use_writeback_tags(mapping)) {
> struct inode *inode = mapping->host;
> struct backing_dev_info *bdi = inode_to_bdi(inode);
> unsigned long flags;
>
> xa_lock_irqsave(&mapping->i_pages, flags);
> - ret = TestClearPageWriteback(page);
> + ret = folio_test_clear_writeback(folio);
> if (ret) {
> - __xa_clear_mark(&mapping->i_pages, page_index(page),
> + __xa_clear_mark(&mapping->i_pages, folio_index(folio),
> PAGECACHE_TAG_WRITEBACK);
> if (bdi->capabilities & BDI_CAP_WRITEBACK_ACCT) {
> struct bdi_writeback *wb = inode_to_wb(inode);
>
> - dec_wb_stat(wb, WB_WRITEBACK);
> - __wb_writeout_add(wb, 1);
> + wb_stat_mod(wb, WB_WRITEBACK, -nr);
> + __wb_writeout_add(wb, nr);
> }
> }
>
> @@ -2761,14 +2762,14 @@ int test_clear_page_writeback(struct page *page)
>
> xa_unlock_irqrestore(&mapping->i_pages, flags);
> } else {
> - ret = TestClearPageWriteback(page);
> + ret = folio_test_clear_writeback(folio);
> }
> if (ret) {
> - dec_lruvec_page_state(page, NR_WRITEBACK);
> - dec_zone_page_state(page, NR_ZONE_WRITE_PENDING);
> - inc_node_page_state(page, NR_WRITTEN);
> + lruvec_stat_mod_folio(folio, NR_WRITEBACK, -nr);
> + zone_stat_mod_folio(folio, NR_ZONE_WRITE_PENDING, -nr);
> + node_stat_mod_folio(folio, NR_WRITTEN, nr);
> }
> - unlock_page_memcg(page);
> + folio_memcg_unlock(folio);
> return ret;
> }
>
>
On 7/15/21 5:35 AM, Matthew Wilcox (Oracle) wrote:
> Rename set_page_writeback() to folio_start_writeback() to match
> folio_end_writeback(). Do not bother with wrappers that return void;
> callers are perfectly capable of ignoring return values.
>
> Add wrappers for set_page_writeback(), set_page_writeback_keepwrite() and
> test_set_page_writeback() for compatibililty with existing filesystems.
> The main advantage of this patch is getting the statistics right,
> although it does eliminate a couple of calls to compound_head().
>
> Signed-off-by: Matthew Wilcox (Oracle) <[email protected]>
> Reviewed-by: Christoph Hellwig <[email protected]>
Acked-by: Vlastimil Babka <[email protected]>
On 7/15/21 5:35 AM, Matthew Wilcox (Oracle) wrote:
> Turn __set_page_dirty() into a wrapper around __folio_mark_dirty().
> Convert account_page_dirtied() into folio_account_dirtied() and account
> the number of pages in the folio to support multi-page folios.
>
> Signed-off-by: Matthew Wilcox (Oracle) <[email protected]>
Acked-by: Vlastimil Babka <[email protected]>
On 7/15/21 5:35 AM, Matthew Wilcox (Oracle) wrote:
> Rename writeback_dirty_page() to writeback_dirty_folio() and
> wait_on_page_writeback() to folio_wait_writeback().
>
> Signed-off-by: Matthew Wilcox (Oracle) <[email protected]>
Acked-by: Vlastimil Babka <[email protected]>
On 7/15/21 5:35 AM, Matthew Wilcox (Oracle) wrote:
> Get the statistics right; compound pages were being accounted as a
> single page. This didn't matter before now as no filesystem which
> supported compound pages did writeback. Also move the declaration
> to filemap.h since this is part of the page cache. Add a wrapper for
Seems to be pagemap.h :)
> account_page_cleaned().
>
> Signed-off-by: Matthew Wilcox (Oracle) <[email protected]>
> Reviewed-by: Christoph Hellwig <[email protected]>
Acked-by: Vlastimil Babka <[email protected]>
Nit below:
> diff --git a/mm/page-writeback.c b/mm/page-writeback.c
> index bd97c461d499..792a83bd3917 100644
> --- a/mm/page-writeback.c
> +++ b/mm/page-writeback.c
> @@ -2453,14 +2453,15 @@ static void folio_account_dirtied(struct folio *folio,
> *
> * Caller must hold lock_page_memcg().
> */
> -void account_page_cleaned(struct page *page, struct address_space *mapping,
> +void folio_account_cleaned(struct folio *folio, struct address_space *mapping,
> struct bdi_writeback *wb)
> {
> if (mapping_can_writeback(mapping)) {
> - dec_lruvec_page_state(page, NR_FILE_DIRTY);
> - dec_zone_page_state(page, NR_ZONE_WRITE_PENDING);
> - dec_wb_stat(wb, WB_RECLAIMABLE);
> - task_io_account_cancelled_write(PAGE_SIZE);
> + long nr = folio_nr_pages(folio);
> + lruvec_stat_mod_folio(folio, NR_FILE_DIRTY, -nr);
> + zone_stat_mod_folio(folio, NR_ZONE_WRITE_PENDING, -nr);
> + wb_stat_mod(wb, WB_RECLAIMABLE, -nr);
> + task_io_account_cancelled_write(folio_size(folio));
In "mm/writeback: Add __folio_mark_dirty()" you used nr*PAGE_SIZE. Consistency?
> }
> }
>
>
On 7/15/21 5:35 AM, Matthew Wilcox (Oracle) wrote:
> Turn __cancel_dirty_page() into __folio_cancel_dirty() and add wrappers.
> Move the prototypes into pagemap.h since this is page cache functionality.
> Saves 44 bytes of kernel text in total; 33 bytes from __folio_cancel_dirty
> and 11 from two callers of cancel_dirty_page().
>
> Signed-off-by: Matthew Wilcox (Oracle) <[email protected]>
> Reviewed-by: Christoph Hellwig <[email protected]>
Acked-by: Vlastimil Babka <[email protected]>
On 7/15/21 5:36 AM, Matthew Wilcox (Oracle) wrote:
> Transform clear_page_dirty_for_io() into folio_clear_dirty_for_io()
> and add a compatibility wrapper. Also move the declaration to pagemap.h
> as this is page cache functionality that doesn't need to be used by the
> rest of the kernel.
>
> Increases the size of the kernel by 79 bytes. While we remove a few
> calls to compound_head(), we add a call to folio_nr_pages() to get the
> stats correct for the eventual support of multi-page folios.
>
> Signed-off-by: Matthew Wilcox (Oracle) <[email protected]>
> Reviewed-by: Christoph Hellwig <[email protected]>
Acked-by: Vlastimil Babka <[email protected]>
On 7/15/21 5:36 AM, Matthew Wilcox (Oracle) wrote:
> Reimplement redirty_page_for_writepage() as a wrapper around
> folio_redirty_for_writepage(). Account the number of pages in the
> folio, add kernel-doc and move the prototype to writeback.h.
>
> Signed-off-by: Matthew Wilcox (Oracle) <[email protected]>
> Reviewed-by: Christoph Hellwig <[email protected]>
Acked-by: Vlastimil Babka <[email protected]>
Nit:
> --- a/mm/page-writeback.c
> +++ b/mm/page-writeback.c
> @@ -2558,21 +2558,31 @@ void folio_account_redirty(struct folio *folio)
> }
> EXPORT_SYMBOL(folio_account_redirty);
>
> -/*
> - * When a writepage implementation decides that it doesn't want to write this
> - * page for some reason, it should redirty the locked page via
> - * redirty_page_for_writepage() and it should then unlock the page and return 0
> +/**
> + * folio_redirty_for_writepage - Decline to write a dirty folio.
> + * @wbc: The writeback control.
> + * @folio: The folio.
> + *
> + * When a writepage implementation decides that it doesn't want to write
> + * @folio for some reason, it should call this function, unlock @folio and
> + * return 0.
s/0/false
> + *
> + * Return: True if we redirtied the folio. False if someone else dirtied
> + * it first.
> */
> -int redirty_page_for_writepage(struct writeback_control *wbc, struct page *page)
> +bool folio_redirty_for_writepage(struct writeback_control *wbc,
> + struct folio *folio)
> {
> - int ret;
> + bool ret;
> + unsigned nr = folio_nr_pages(folio);
> +
> + wbc->pages_skipped += nr;
> + ret = filemap_dirty_folio(folio->mapping, folio);
> + folio_account_redirty(folio);
>
> - wbc->pages_skipped++;
> - ret = __set_page_dirty_nobuffers(page);
> - account_page_redirty(page);
> return ret;
> }
> -EXPORT_SYMBOL(redirty_page_for_writepage);
> +EXPORT_SYMBOL(folio_redirty_for_writepage);
>
> /**
> * folio_mark_dirty - Mark a folio as being modified.
>
On 7/15/21 5:35 AM, Matthew Wilcox (Oracle) wrote:
> Allow for accounting N pages at once instead of one page at a time.
>
> Signed-off-by: Matthew Wilcox (Oracle) <[email protected]>
> Reviewed-by: Christoph Hellwig <[email protected]>
> Reviewed-by: Jan Kara <[email protected]>
Acked-by: Vlastimil Babka <[email protected]>
> ---
> mm/page-writeback.c | 22 +++++++++++-----------
> 1 file changed, 11 insertions(+), 11 deletions(-)
>
> diff --git a/mm/page-writeback.c b/mm/page-writeback.c
> index f55f2ebdd9a9..e542ea37d605 100644
> --- a/mm/page-writeback.c
> +++ b/mm/page-writeback.c
> @@ -562,12 +562,12 @@ static unsigned long wp_next_time(unsigned long cur_time)
> return cur_time;
> }
>
> -static void wb_domain_writeout_inc(struct wb_domain *dom,
> +static void wb_domain_writeout_add(struct wb_domain *dom,
> struct fprop_local_percpu *completions,
> - unsigned int max_prop_frac)
> + unsigned int max_prop_frac, long nr)
> {
> __fprop_add_percpu_max(&dom->completions, completions,
> - max_prop_frac, 1);
> + max_prop_frac, nr);
> /* First event after period switching was turned off? */
> if (unlikely(!dom->period_time)) {
> /*
> @@ -585,18 +585,18 @@ static void wb_domain_writeout_inc(struct wb_domain *dom,
> * Increment @wb's writeout completion count and the global writeout
> * completion count. Called from test_clear_page_writeback().
> */
> -static inline void __wb_writeout_inc(struct bdi_writeback *wb)
> +static inline void __wb_writeout_add(struct bdi_writeback *wb, long nr)
> {
> struct wb_domain *cgdom;
>
> - inc_wb_stat(wb, WB_WRITTEN);
> - wb_domain_writeout_inc(&global_wb_domain, &wb->completions,
> - wb->bdi->max_prop_frac);
> + wb_stat_mod(wb, WB_WRITTEN, nr);
> + wb_domain_writeout_add(&global_wb_domain, &wb->completions,
> + wb->bdi->max_prop_frac, nr);
>
> cgdom = mem_cgroup_wb_domain(wb);
> if (cgdom)
> - wb_domain_writeout_inc(cgdom, wb_memcg_completions(wb),
> - wb->bdi->max_prop_frac);
> + wb_domain_writeout_add(cgdom, wb_memcg_completions(wb),
> + wb->bdi->max_prop_frac, nr);
> }
>
> void wb_writeout_inc(struct bdi_writeback *wb)
> @@ -604,7 +604,7 @@ void wb_writeout_inc(struct bdi_writeback *wb)
> unsigned long flags;
>
> local_irq_save(flags);
> - __wb_writeout_inc(wb);
> + __wb_writeout_add(wb, 1);
> local_irq_restore(flags);
> }
> EXPORT_SYMBOL_GPL(wb_writeout_inc);
> @@ -2751,7 +2751,7 @@ int test_clear_page_writeback(struct page *page)
> struct bdi_writeback *wb = inode_to_wb(inode);
>
> dec_wb_stat(wb, WB_WRITEBACK);
> - __wb_writeout_inc(wb);
> + __wb_writeout_add(wb, 1);
> }
> }
>
>
On 7/15/21 5:35 AM, Matthew Wilcox (Oracle) wrote:
> Reimplement set_page_dirty() as a wrapper around folio_mark_dirty().
> There is no change to filesystems as they were already being called
> with the compound_head of the page being marked dirty. We avoid
> several calls to compound_head(), both statically (through
> using folio_test_dirty() instead of PageDirty() and dynamically by
> calling folio_mapping() instead of page_mapping().
>
> Also return bool instead of int to show the range of values actually
> returned, and add kernel-doc.
>
> Signed-off-by: Matthew Wilcox (Oracle) <[email protected]>
> Reviewed-by: Christoph Hellwig <[email protected]>
Acked-by: Vlastimil Babka <[email protected]>
On 7/15/21 5:36 AM, Matthew Wilcox (Oracle) wrote:
> Reimplement i_blocks_per_page() as a wrapper around i_blocks_per_folio().
>
> Signed-off-by: Matthew Wilcox (Oracle) <[email protected]>
> Reviewed-by: Christoph Hellwig <[email protected]>
Acked-by: Vlastimil Babka <[email protected]>
On 7/15/21 5:36 AM, Matthew Wilcox (Oracle) wrote:
> This nets us 178 bytes of savings from removing calls to compound_head.
> The three callers all grow a little, but each of them will be converted
> to use folios soon, so that's fine.
>
> Signed-off-by: Matthew Wilcox (Oracle) <[email protected]>
> Reviewed-by: Christoph Hellwig <[email protected]>
Acked-by: Vlastimil Babka <[email protected]>
On 7/15/21 5:36 AM, Matthew Wilcox (Oracle) wrote:
> Reimplement lru_cache_add() as a wrapper around folio_add_lru().
> Saves 159 bytes of kernel text due to removing calls to compound_head().
>
> Signed-off-by: Matthew Wilcox (Oracle) <[email protected]>
> Reviewed-by: Christoph Hellwig <[email protected]>
Acked-by: Vlastimil Babka <[email protected]>
On 7/15/21 5:36 AM, Matthew Wilcox (Oracle) wrote:
> This is the folio equivalent of page_evictable(). Unfortunately, it's
> different from !folio_test_unevictable(), but I think it's used in places
> where you have to be a VM expert and can reasonably be expected to know
> the difference.
>
> Signed-off-by: Matthew Wilcox (Oracle) <[email protected]>
> Reviewed-by: Christoph Hellwig <[email protected]>
Acked-by: Vlastimil Babka <[email protected]>
On 7/15/21 5:36 AM, Matthew Wilcox (Oracle) wrote:
> This saves five calls to compound_head(), totalling 60 bytes of text.
>
> Signed-off-by: Matthew Wilcox (Oracle) <[email protected]>
> Reviewed-by: Christoph Hellwig <[email protected]>
Acked-by: Vlastimil Babka <[email protected]>
On 7/15/21 5:36 AM, Matthew Wilcox (Oracle) wrote:
> Reimplement __page_cache_alloc as a wrapper around filemap_alloc_folio
> to allow filesystems to be converted at our leisure. Increases
> kernel text size by 133 bytes, mostly in cachefiles_read_backing_file().
> pagecache_get_page() shrinks by 32 bytes, though.
>
> Signed-off-by: Matthew Wilcox (Oracle) <[email protected]>
> Reviewed-by: Christoph Hellwig <[email protected]>
Acked-by: Vlastimil Babka <[email protected]>
On 7/15/21 5:36 AM, Matthew Wilcox (Oracle) wrote:
> The __folio_alloc(), __folio_alloc_node() and folio_alloc() functions
> are mostly for type safety, but they also ensure that the page allocator
> allocates a compound page and initialises the deferred list if the page
> is large enough to have one.
>
> Signed-off-by: Matthew Wilcox (Oracle) <[email protected]>
> Reviewed-by: Christoph Hellwig <[email protected]>
Acked-by: Vlastimil Babka <[email protected]>
On 7/15/21 5:36 AM, Matthew Wilcox (Oracle) wrote:
> The pagecache only contains folios, so indicate that this is definitely
> not a tail page. Shrinks mapping_get_entry() by 56 bytes, but grows
> pagecache_get_page() by 21 bytes as gcc makes slightly different hot/cold
> code decisions. A net reduction of 35 bytes of text.
>
> Signed-off-by: Matthew Wilcox (Oracle) <[email protected]>
> Reviewed-by: Christoph Hellwig <[email protected]>
Acked-by: Vlastimil Babka <[email protected]>
On 7/15/21 5:36 AM, Matthew Wilcox (Oracle) wrote:
> Allow filemap_get_folio() to wait for writeback to complete (if the
> filesystem wants that behaviour). This is the folio equivalent of
> grab_cache_page_write_begin(), which is moved into the folio-compat
> file as a reminder to migrate all the code using it. This paves the
> way for getting rid of AOP_FLAG_NOFS once grab_cache_page_write_begin()
> is removed.
>
> Kernel grows by 11 bytes. filemap_get_folio() grows by 33 bytes but
> grab_cache_page_write_begin() shrinks by 22 bytes to make up for it.
>
> Signed-off-by: Matthew Wilcox (Oracle) <[email protected]>
Acked-by: Vlastimil Babka <[email protected]>
On 7/15/21 5:35 AM, Matthew Wilcox (Oracle) wrote:
> Reimplement __set_page_dirty_nobuffers() as a wrapper around
> filemap_dirty_folio().
I assume it becomes obvious later why the new "mapping" parameter instead of
taking it from the folio, but maybe the changelog should say it here?
> Signed-off-by: Matthew Wilcox (Oracle) <[email protected]>
> Reviewed-by: Christoph Hellwig <[email protected]>
Acked-by: Vlastimil Babka <[email protected]>
On 7/15/21 5:36 AM, Matthew Wilcox (Oracle) wrote:
> Account the number of pages in the folio that we're redirtying.
> Turn account_page_dirty() into a wrapper around it. Also turn
> the comment on folio_account_redirty() into kernel-doc and
> edit it slightly so it makes sense to its potential callers.
>
> Signed-off-by: Matthew Wilcox (Oracle) <[email protected]>
> Reviewed-by: Christoph Hellwig <[email protected]>
Acked-by: Vlastimil Babka <[email protected]>
On 7/15/21 5:36 AM, Matthew Wilcox (Oracle) wrote:
> filemap_get_folio() is a replacement for find_get_page().
> Turn pagecache_get_page() into a wrapper around __filemap_get_folio().
> Remove find_lock_head() as this use case is now covered by
> filemap_get_folio().
>
> Reduces overall kernel size by 209 bytes. __filemap_get_folio() is
> 316 bytes shorter than pagecache_get_page() was, but the new
> pagecache_get_page() is 99 bytes.
>
> Signed-off-by: Matthew Wilcox (Oracle) <[email protected]>
Acked-by: Vlastimil Babka <[email protected]>
On 7/15/21 5:36 AM, Matthew Wilcox (Oracle) wrote:
> This is the folio equivalent of page_mkwrite_check_truncate().
>
> Signed-off-by: Matthew Wilcox (Oracle) <[email protected]>
> ---
> include/linux/pagemap.h | 28 ++++++++++++++++++++++++++++
> 1 file changed, 28 insertions(+)
>
> diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
> index 412db88b8d0c..18c06c3e42c3 100644
> --- a/include/linux/pagemap.h
> +++ b/include/linux/pagemap.h
> @@ -1121,6 +1121,34 @@ static inline unsigned long dir_pages(struct inode *inode)
> PAGE_SHIFT;
> }
>
> +/**
> + * folio_mkwrite_check_truncate - check if folio was truncated
> + * @folio: the folio to check
> + * @inode: the inode to check the folio against
> + *
> + * Return: the number of bytes in the folio up to EOF,
> + * or -EFAULT if the folio was truncated.
> + */
> +static inline ssize_t folio_mkwrite_check_truncate(struct folio *folio,
> + struct inode *inode)
> +{
> + loff_t size = i_size_read(inode);
> + pgoff_t index = size >> PAGE_SHIFT;
> + size_t offset = offset_in_folio(folio, size);
> +
> + if (!folio->mapping)
The check in the page_ version is
if (page->mapping != inode->i_mapping)
Why is the one above sufficient?
> + return -EFAULT;
> +
> + /* folio is wholly inside EOF */
> + if (folio_next_index(folio) - 1 < index)
> + return folio_size(folio);
> + /* folio is wholly past EOF */
> + if (folio->index > index || !offset)
> + return -EFAULT;
> + /* folio is partially inside EOF */
> + return offset;
> +}
> +
> /**
> * page_mkwrite_check_truncate - check if page was truncated
> * @page: the page to check
>
On 7/15/21 5:36 AM, Matthew Wilcox (Oracle) wrote:
> The pointers stored in the page cache are folios, by definition.
> This change comes with a behaviour change -- callers of readahead_folio()
> are no longer required to put the page reference themselves. This matches
> how readpage works, rather than matching how readpages used to work.
>
> Signed-off-by: Matthew Wilcox (Oracle) <[email protected]>
> Reviewed-by: Christoph Hellwig <[email protected]>
Acked-by: Vlastimil Babka <[email protected]>
On 7/15/21 5:36 AM, Matthew Wilcox (Oracle) wrote:
> Convert __add_to_page_cache_locked() into __filemap_add_folio().
> Add an assertion to it that (for !hugetlbfs), the folio is naturally
> aligned within the file. Move the prototype from mm.h to pagemap.h.
> Convert add_to_page_cache_lru() into filemap_add_folio(). Add a
> compatibility wrapper for unconverted callers.
>
> Signed-off-by: Matthew Wilcox (Oracle) <[email protected]>
> Reviewed-by: Christoph Hellwig <[email protected]>
Acked-by: Vlastimil Babka <[email protected]>
On Thu, Aug 12, 2021 at 01:56:24PM +0200, Vlastimil Babka wrote:
> On 7/15/21 5:35 AM, Matthew Wilcox (Oracle) wrote:
> > This is the folio equivalent of migrate_page_copy(), which is retained
> > as a wrapper for filesystems which are not yet converted to folios.
> > Also convert copy_huge_page() to folio_copy().
> >
> > Signed-off-by: Matthew Wilcox (Oracle) <[email protected]>
>
> Acked-by: Vlastimil Babka <[email protected]>
>
> The way folio_copy() avoids cond_resched() for single page would IMHO deserve a
> comment though, so it's not buried only in this thread.
I think folio_copy() deserves kernel-doc.
/**
* folio_copy - Copy the contents of one folio to another.
* @dst: Folio to copy to.
* @src: Folio to copy from.
*
* The bytes in the folio represented by @src are copied to @dst.
* Assumes the caller has validated that @dst is at least as large as @src.
* Can be called in atomic context for order-0 folios, but if the folio is
* larger, it may sleep.
*/
On 8/13/21 6:16 AM, Matthew Wilcox wrote:
> On Thu, Aug 12, 2021 at 01:56:24PM +0200, Vlastimil Babka wrote:
>> On 7/15/21 5:35 AM, Matthew Wilcox (Oracle) wrote:
>> > This is the folio equivalent of migrate_page_copy(), which is retained
>> > as a wrapper for filesystems which are not yet converted to folios.
>> > Also convert copy_huge_page() to folio_copy().
>> >
>> > Signed-off-by: Matthew Wilcox (Oracle) <[email protected]>
>>
>> Acked-by: Vlastimil Babka <[email protected]>
>>
>> The way folio_copy() avoids cond_resched() for single page would IMHO deserve a
>> comment though, so it's not buried only in this thread.
>
> I think folio_copy() deserves kernel-doc.
>
> /**
> * folio_copy - Copy the contents of one folio to another.
> * @dst: Folio to copy to.
> * @src: Folio to copy from.
> *
> * The bytes in the folio represented by @src are copied to @dst.
> * Assumes the caller has validated that @dst is at least as large as @src.
> * Can be called in atomic context for order-0 folios, but if the folio is
> * larger, it may sleep.
> */
>
LGTM.
On Tue, Aug 10, 2021 at 06:08:03PM +0200, Vlastimil Babka wrote:
> > * If neither ALLOW_RETRY nor KILLABLE are set, will always return 1
>
> s/1/true/ ? :)
Fixed; thanks.
On Tue, Aug 10, 2021 at 09:06:52PM +0100, David Howells wrote:
> Matthew Wilcox (Oracle) <[email protected]> wrote:
>
> > By using the node id in mem_cgroup_update_tree(), we can delete
> > soft_limit_tree_from_page() and mem_cgroup_page_nodeinfo(). Saves 42
> > bytes of kernel text on my config.
> >
> > Signed-off-by: Matthew Wilcox (Oracle) <[email protected]>
> > Acked-by: Michal Hocko <[email protected]>
> > Acked-by: Johannes Weiner <[email protected]>
> > Reviewed-by: Christoph Hellwig <[email protected]>
>
> Reviewed-by: David Howells <[email protected]>
>
> Though I wonder if:
>
> > - mz = mem_cgroup_page_nodeinfo(memcg, page);
> > + mz = memcg->nodeinfo[nid];
>
> should still have some sort of wrapper function.
I was asked to remove the wrapper function as it didn't provide enough
utility to warrant the indirection.
On Wed, Aug 11, 2021 at 12:32:45PM +0200, Vlastimil Babka wrote:
> > static inline bool PageMemcgKmem(struct page *page);
>
> I think this fwd declaration is no longer needed.
I agree. Deleted.
On Wed, Aug 11, 2021 at 12:54:05PM +0200, Vlastimil Babka wrote:
> > -static int __mem_cgroup_charge(struct page *page, struct mem_cgroup *memcg,
> > +static int __mem_cgroup_charge(struct folio *folio, struct mem_cgroup *memcg,
> > gfp_t gfp)
>
> The git/next version also renames this function to charge_memcg(), why? The new
> name doesn't look that internal as the old one. I don't have a strong opinion
> but CCing memcg maintainers who might.
Ah, this is Suren's fault :-)
https://lore.kernel.org/linux-mm/[email protected]/
Renaming it here makes the merge resolution cleaner.
On Wed, Jul 21, 2021 at 12:51:57PM +0300, Mike Rapoport wrote:
> > /*
> > - * page_memcg_rcu - locklessly get the memory cgroup associated with a page
> > - * @page: a pointer to the page struct
> > + * folio_memcg_rcu - Locklessly get the memory cgroup associated with a folio.
> > + * @folio: Pointer to the folio.
> > *
> > - * Returns a pointer to the memory cgroup associated with the page,
> > - * or NULL. This function assumes that the page is known to have a
> > + * Returns a pointer to the memory cgroup associated with the folio,
> > + * or NULL. This function assumes that the folio is known to have a
> > * proper memory cgroup pointer. It's not safe to call this function
> > - * against some type of pages, e.g. slab pages or ex-slab pages.
> > + * against some type of folios, e.g. slab folios or ex-slab folios.
>
> Maybe
>
> - * Returns a pointer to the memory cgroup associated with the page,
> - * or NULL. This function assumes that the page is known to have a
> + * This function assumes that the folio is known to have a
> * proper memory cgroup pointer. It's not safe to call this function
> - * against some type of pages, e.g. slab pages or ex-slab pages.
> + * against some type of folios, e.g. slab folios or ex-slab folios.
> + *
> + * Return: a pointer to the memory cgroup associated with the folio,
> + * or NULL.
I substantially included this change a few days ago and forgot to reply
to this email; sorry. It now reads:
/**
* folio_memcg_rcu - Locklessly get the memory cgroup associated with a folio.
* @folio: Pointer to the folio.
*
* This function assumes that the folio is known to have a
* proper memory cgroup pointer. It's not safe to call this function
* against some type of folios, e.g. slab folios or ex-slab folios.
*
* Return: A pointer to the memory cgroup associated with the folio,
* or NULL.
*/
On Thu, Jul 15, 2021 at 04:36:06AM +0100, Matthew Wilcox (Oracle) wrote:
> /**
> - * workingset_refault - evaluate the refault of a previously evicted page
> - * @page: the freshly allocated replacement page
> - * @shadow: shadow entry of the evicted page
> + * workingset_refault - evaluate the refault of a previously evicted folio
> + * @page: the freshly allocated replacement folio
Randy pointed out this doc mistake. Which got me looking at this
whole patch again, and I noticed that we're counting an entire folio as
a single page.
So I'm going to apply this patch on top of the below patch.
diff --git a/include/linux/vmstat.h b/include/linux/vmstat.h
index 241bd0f53fb9..bfe38869498d 100644
--- a/include/linux/vmstat.h
+++ b/include/linux/vmstat.h
@@ -597,12 +597,6 @@ static inline void mod_lruvec_page_state(struct page *page,
#endif /* CONFIG_MEMCG */
-static inline void inc_lruvec_state(struct lruvec *lruvec,
- enum node_stat_item idx)
-{
- mod_lruvec_state(lruvec, idx, 1);
-}
-
static inline void __inc_lruvec_page_state(struct page *page,
enum node_stat_item idx)
{
diff --git a/mm/workingset.c b/mm/workingset.c
index 10830211a187..9f91c28cc0ce 100644
--- a/mm/workingset.c
+++ b/mm/workingset.c
@@ -273,9 +273,9 @@ void *workingset_eviction(struct page *page, struct mem_cgroup *target_memcg)
}
/**
- * workingset_refault - evaluate the refault of a previously evicted folio
- * @page: the freshly allocated replacement folio
- * @shadow: shadow entry of the evicted folio
+ * workingset_refault - Evaluate the refault of a previously evicted folio.
+ * @folio: The freshly allocated replacement folio.
+ * @shadow: Shadow entry of the evicted folio.
*
* Calculates and evaluates the refault distance of the previously
* evicted folio in the context of the node and the memcg whose memory
@@ -295,6 +295,7 @@ void workingset_refault(struct folio *folio, void *shadow)
unsigned long refault;
bool workingset;
int memcgid;
+ long nr;
unpack_shadow(shadow, &memcgid, &pgdat, &eviction, &workingset);
@@ -347,10 +348,11 @@ void workingset_refault(struct folio *folio, void *shadow)
* However, the cgroup that will own the folio is the one that
* is actually experiencing the refault event.
*/
+ nr = folio_nr_pages(folio);
memcg = folio_memcg(folio);
lruvec = mem_cgroup_lruvec(memcg, pgdat);
- inc_lruvec_state(lruvec, WORKINGSET_REFAULT_BASE + file);
+ mod_lruvec_state(lruvec, WORKINGSET_REFAULT_BASE + file, nr);
/*
* Compare the distance to the existing workingset size. We
@@ -376,15 +378,15 @@ void workingset_refault(struct folio *folio, void *shadow)
goto out;
folio_set_active(folio);
- workingset_age_nonresident(lruvec, folio_nr_pages(folio));
- inc_lruvec_state(lruvec, WORKINGSET_ACTIVATE_BASE + file);
+ workingset_age_nonresident(lruvec, nr);
+ mod_lruvec_state(lruvec, WORKINGSET_ACTIVATE_BASE + file, nr);
/* Folio was active prior to eviction */
if (workingset) {
folio_set_workingset(folio);
/* XXX: Move to lru_cache_add() when it supports new vs putback */
lru_note_cost_folio(folio);
- inc_lruvec_state(lruvec, WORKINGSET_RESTORE_BASE + file);
+ mod_lruvec_state(lruvec, WORKINGSET_RESTORE_BASE + file, nr);
}
out:
rcu_read_unlock();
> + * @shadow: shadow entry of the evicted folio
> *
> * Calculates and evaluates the refault distance of the previously
> - * evicted page in the context of the node and the memcg whose memory
> + * evicted folio in the context of the node and the memcg whose memory
> * pressure caused the eviction.
> */
> -void workingset_refault(struct page *page, void *shadow)
> +void workingset_refault(struct folio *folio, void *shadow)
> {
> - bool file = page_is_file_lru(page);
> + bool file = folio_is_file_lru(folio);
> struct mem_cgroup *eviction_memcg;
> struct lruvec *eviction_lruvec;
> unsigned long refault_distance;
> @@ -301,10 +301,10 @@ void workingset_refault(struct page *page, void *shadow)
> rcu_read_lock();
> /*
> * Look up the memcg associated with the stored ID. It might
> - * have been deleted since the page's eviction.
> + * have been deleted since the folio's eviction.
> *
> * Note that in rare events the ID could have been recycled
> - * for a new cgroup that refaults a shared page. This is
> + * for a new cgroup that refaults a shared folio. This is
> * impossible to tell from the available data. However, this
> * should be a rare and limited disturbance, and activations
> * are always speculative anyway. Ultimately, it's the aging
> @@ -340,14 +340,14 @@ void workingset_refault(struct page *page, void *shadow)
> refault_distance = (refault - eviction) & EVICTION_MASK;
>
> /*
> - * The activation decision for this page is made at the level
> + * The activation decision for this folio is made at the level
> * where the eviction occurred, as that is where the LRU order
> - * during page reclaim is being determined.
> + * during folio reclaim is being determined.
> *
> - * However, the cgroup that will own the page is the one that
> + * However, the cgroup that will own the folio is the one that
> * is actually experiencing the refault event.
> */
> - memcg = page_memcg(page);
> + memcg = folio_memcg(folio);
> lruvec = mem_cgroup_lruvec(memcg, pgdat);
>
> inc_lruvec_state(lruvec, WORKINGSET_REFAULT_BASE + file);
> @@ -375,15 +375,15 @@ void workingset_refault(struct page *page, void *shadow)
> if (refault_distance > workingset_size)
> goto out;
>
> - SetPageActive(page);
> - workingset_age_nonresident(lruvec, thp_nr_pages(page));
> + folio_set_active(folio);
> + workingset_age_nonresident(lruvec, folio_nr_pages(folio));
> inc_lruvec_state(lruvec, WORKINGSET_ACTIVATE_BASE + file);
>
> - /* Page was active prior to eviction */
> + /* Folio was active prior to eviction */
> if (workingset) {
> - SetPageWorkingset(page);
> + folio_set_workingset(folio);
> /* XXX: Move to lru_cache_add() when it supports new vs putback */
> - lru_note_cost_page(page);
> + lru_note_cost_folio(folio);
> inc_lruvec_state(lruvec, WORKINGSET_RESTORE_BASE + file);
> }
> out:
> --
> 2.30.2
>
On Tue, Aug 10, 2021 at 09:42:13PM +0100, David Howells wrote:
> Matthew Wilcox (Oracle) <[email protected]> wrote:
>
> > Convert __page_rmapping to folio_raw_mapping and move it to mm/internal.h.
> > It's only a couple of instructions (load and mask), so it's definitely
> > going to be cheaper to inline it than call it. Leave page_rmapping
> > out of line.
> >
> > Signed-off-by: Matthew Wilcox (Oracle) <[email protected]>
>
> I assume you're going to call it from another source file at some point,
> otherwise this is unnecessary.
Yes, it gets called from mm/ksm.c in a later patch in this series.
__page_rmapping() assumes it's being passed a head page and
folio_raw_mapping() asserts that. Eventually, page_rmapping() can
go away (and maybe it should have been moved to folio-compat.c),
but I'm not inclined to do that now.
> Apart from that,
>
> Reviewed-by: David Howells <[email protected]>
>
On Wed, Aug 11, 2021 at 03:59:06PM +0200, Vlastimil Babka wrote:
> On 7/15/21 5:35 AM, Matthew Wilcox (Oracle) wrote:
> > Convert __page_rmapping to folio_raw_mapping and move it to mm/internal.h.
> > It's only a couple of instructions (load and mask), so it's definitely
> > going to be cheaper to inline it than call it. Leave page_rmapping
> > out of line.
>
> Maybe mention the page_anon_vma() in changelog too?
Convert __page_rmapping to folio_raw_mapping and move it to mm/internal.h.
It's only a couple of instructions (load and mask), so it's definitely
going to be cheaper to inline it than call it. Leave page_rmapping
out of line. Change page_anon_vma() to not call folio_raw_mapping() --
it's more efficient to do the subtraction than the mask.
On Thu, Aug 12, 2021 at 06:07:05PM +0200, Vlastimil Babka wrote:
> On 7/15/21 5:35 AM, Matthew Wilcox (Oracle) wrote:
> > Reimplement __set_page_dirty_nobuffers() as a wrapper around
> > filemap_dirty_folio().
>
> I assume it becomes obvious later why the new "mapping" parameter instead of
> taking it from the folio, but maybe the changelog should say it here?
---
mm/writeback: Add filemap_dirty_folio()
Reimplement __set_page_dirty_nobuffers() as a wrapper around
filemap_dirty_folio(). Eventually folio_mark_dirty() will pass
the folio's mapping to the address space's ->dirty_folio()
operation, so add the parameter to filemap_dirty_folio() now.
---
Nobody seems quite sure whether it's possible to truncate (or otherwise
remove) a page from a file while it's being marked as dirty. viz:
int set_page_dirty(struct page *page)
{
struct address_space *mapping = page_mapping(page);
if (likely(mapping)) {
...
return mapping->a_ops->set_page_dirty(page);
}
so ->set_page_dirty can only be called if page has a mapping (obviously,
otherwise we wouldn't know whose ->set_page_dirty to call). But then
in __set_page_dirty_nobuffers(), we check to see if mapping has
become unset:
if (!TestSetPageDirty(page)) {
struct address_space *mapping = page_mapping(page);
if (!mapping) {
unlock_page_memcg(page);
return 1;
}
Confusingly, the comment to __set_page_dirty_nobuffers says:
* The caller must ensure this doesn't race with truncation. Most will simply
* hold the page lock, but e.g. zap_pte_range() calls with the page mapped and
* the pte lock held, which also locks out truncation.
I believe this is left-over from commit 2d6d7f982846 in 2015.
Anyway, passing mapping as a parameter is something we already do for
just about every other address_space operation, and we already called
page_mapping() to get it, so why make the callee call it again? Not to
mention people get confused about whether to call page_mapping() or just
look at page->mapping. Changing the ->set_page_dirty() operation to
->dirty_folio() is something I've postponed until the 5.17/5.18 timeframe,
but we might as well pass the parameter to filemap_dirty_folio() now.
On Sat, Aug 14, 2021 at 05:05:44AM +0100, Matthew Wilcox wrote:
> On Wed, Jul 21, 2021 at 12:51:57PM +0300, Mike Rapoport wrote:
> > > /*
> > > - * page_memcg_rcu - locklessly get the memory cgroup associated with a page
> > > - * @page: a pointer to the page struct
> > > + * folio_memcg_rcu - Locklessly get the memory cgroup associated with a folio.
> > > + * @folio: Pointer to the folio.
> > > *
> > > - * Returns a pointer to the memory cgroup associated with the page,
> > > - * or NULL. This function assumes that the page is known to have a
> > > + * Returns a pointer to the memory cgroup associated with the folio,
> > > + * or NULL. This function assumes that the folio is known to have a
> > > * proper memory cgroup pointer. It's not safe to call this function
> > > - * against some type of pages, e.g. slab pages or ex-slab pages.
> > > + * against some type of folios, e.g. slab folios or ex-slab folios.
> >
> > Maybe
> >
> > - * Returns a pointer to the memory cgroup associated with the page,
> > - * or NULL. This function assumes that the page is known to have a
> > + * This function assumes that the folio is known to have a
> > * proper memory cgroup pointer. It's not safe to call this function
> > - * against some type of pages, e.g. slab pages or ex-slab pages.
> > + * against some type of folios, e.g. slab folios or ex-slab folios.
> > + *
> > + * Return: a pointer to the memory cgroup associated with the folio,
> > + * or NULL.
>
> I substantially included this change a few days ago and forgot to reply
> to this email; sorry. It now reads:
>
> /**
> * folio_memcg_rcu - Locklessly get the memory cgroup associated with a folio.
> * @folio: Pointer to the folio.
> *
> * This function assumes that the folio is known to have a
> * proper memory cgroup pointer. It's not safe to call this function
> * against some type of folios, e.g. slab folios or ex-slab folios.
> *
> * Return: A pointer to the memory cgroup associated with the folio,
> * or NULL.
> */
I like it.
--
Sincerely yours,
Mike.
On Thu, Aug 12, 2021 at 06:14:22PM +0200, Vlastimil Babka wrote:
> On 7/15/21 5:35 AM, Matthew Wilcox (Oracle) wrote:
> > Get the statistics right; compound pages were being accounted as a
> > single page. This didn't matter before now as no filesystem which
> > supported compound pages did writeback. Also move the declaration
> > to filemap.h since this is part of the page cache. Add a wrapper for
>
> Seems to be pagemap.h :)
Ugh, right. filemap.c. pagemap.h. obviously.
> > + wb_stat_mod(wb, WB_RECLAIMABLE, -nr);
> > + task_io_account_cancelled_write(folio_size(folio));
>
> In "mm/writeback: Add __folio_mark_dirty()" you used nr*PAGE_SIZE. Consistency?
We don't have any ;-) I'll change that. Some places we use <<
PAGE_SHIFT, some places we use * PAGE_SIZE ... either are better than
calling folio_size() unnecessarily.
On Thu, Aug 12, 2021 at 06:30:51PM +0200, Vlastimil Babka wrote:
> > +/**
> > + * folio_redirty_for_writepage - Decline to write a dirty folio.
> > + * @wbc: The writeback control.
> > + * @folio: The folio.
> > + *
> > + * When a writepage implementation decides that it doesn't want to write
> > + * @folio for some reason, it should call this function, unlock @folio and
> > + * return 0.
>
> s/0/false
... no? This sentence describes what a writepage implementation should
do, and writepage returns an int, not bool.
On Thu, Aug 12, 2021 at 07:08:33PM +0200, Vlastimil Babka wrote:
> > +/**
> > + * folio_mkwrite_check_truncate - check if folio was truncated
> > + * @folio: the folio to check
> > + * @inode: the inode to check the folio against
> > + *
> > + * Return: the number of bytes in the folio up to EOF,
> > + * or -EFAULT if the folio was truncated.
> > + */
> > +static inline ssize_t folio_mkwrite_check_truncate(struct folio *folio,
> > + struct inode *inode)
> > +{
> > + loff_t size = i_size_read(inode);
> > + pgoff_t index = size >> PAGE_SHIFT;
> > + size_t offset = offset_in_folio(folio, size);
> > +
> > + if (!folio->mapping)
>
> The check in the page_ version is
> if (page->mapping != inode->i_mapping)
>
> Why is the one above sufficient?
Oh, good question!
We know that at some point this page belonged to this file. The caller
has a reference on it (and at the time they acquired a refcount on the
page, the page was part of the file). The caller also has the page
locked, but has not checked that the page is still part of the file.
That's where we come in.
The truncate path looks up the page, locks it, removes it from i_pages,
unmaps it, sets the page->mapping to NULL, unlocks it and puts the page.
Because the folio_mkwrite_check_truncate() caller holds a reference on
the page, the truncate path will not free the page. So there are only
two possibilities for the value of page->mapping; either it's the same
as inode->i_mapping, or it's NULL.
Now, maybe this is a bit subtle. For robustness, perhaps we should
check that it's definitely still part of this file instead of checking
whether it is currently part of no file. Perhaps at some point in the
future, we might get the reference to the page without checking that
it's still part of this file. Opinions?
On Tue, Aug 10, 2021 at 10:41:30PM +0100, David Howells wrote:
> Matthew Wilcox (Oracle) <[email protected]> wrote:
>
> > This is the folio equivalent of page_evictable(). Unfortunately, it's
> > different from !folio_test_unevictable(), but I think it's used in places
> > where you have to be a VM expert and can reasonably be expected to know
> > the difference.
>
> It would be useful to say how it is different. I'm guessing it's because a
> page is always entirely mlocked or not, but a folio might be partially
> mlocked?
folio_test_unevictable() checks the unevictable page flag.
folio_evictable() tests whether the folio is evictable (is any page
in the folio part of an mlocked VMA, or has the filesystem marked
the inode as being unevictable (eg is it ramfs).
It does end up looking a bit weird though.
if (page_evictable(page) && PageUnevictable(page)) {
del_page_from_lru_list(page, lruvec);
ClearPageUnevictable(page);
add_page_to_lru_list(page, lruvec);
but it's all mm-internal, not exposed to filesystems.
On Tue, Aug 10, 2021 at 10:44:27PM +0100, David Howells wrote:
> Matthew Wilcox (Oracle) <[email protected]> wrote:
>
> > * looking at the same page) and the evictable page will be stranded
> > * in an unevictable LRU.
>
> Does that need converting to say 'folio'?
Changed the parapgraph (passed it through fmt too)
* if '#1' does not observe setting of PG_lru by '#0' and
* fails isolation, the explicit barrier will make sure that
* folio_evictable check will put the folio on the correct
* LRU. Without smp_mb(), folio_set_lru() can be reordered
* after folio_test_mlocked() check and can make '#1' fail the
* isolation of the folio whose mlocked bit is cleared (#0 is
* also looking at the same folio) and the evictable folio will
* be stranded on an unevictable LRU.
> Other than that:
>
> Reviewed-by: David Howells <[email protected]>
>
On Tue, Aug 10, 2021 at 10:51:23PM +0100, David Howells wrote:
> Matthew Wilcox (Oracle) <[email protected]> wrote:
>
> > +struct folio *folio_alloc(gfp_t gfp, unsigned order)
> > +{
> > + struct page *page = alloc_pages(gfp | __GFP_COMP, order);
> > +
> > + if (page && order > 1)
> > + prep_transhuge_page(page);
>
> Ummm... Shouldn't order==1 pages (two page folios) be prep'd also?
No. The deferred list is stored in the second tail page, so there's
nowhere to store one if there are only two pages.
The free_transhuge_page() dtor only handles the deferred list, so
it's fine to skip setting the DTOR in the page too.
> Would it be better to just jump to alloc_pages() if order <= 1? E.g.:
>
> struct folio *folio_alloc(gfp_t gfp, unsigned order)
> {
> struct page *page;
>
> if (order <= 1)
> return (struct folio *)alloc_pages(gfp | __GFP_COMP, order);
>
> page = alloc_pages(gfp | __GFP_COMP, order);
> if (page)
> prep_transhuge_page(page);
> return (struct folio *)page;
> }
That doesn't look simpler to me?
On Tue, Aug 10, 2021 at 11:05:33PM +0100, David Howells wrote:
> Matthew Wilcox (Oracle) <[email protected]> wrote:
>
> > filemap_get_folio() is a replacement for find_get_page().
> > Turn pagecache_get_page() into a wrapper around __filemap_get_folio().
> > Remove find_lock_head() as this use case is now covered by
> > filemap_get_folio().
> >
> > Reduces overall kernel size by 209 bytes. __filemap_get_folio() is
> > 316 bytes shorter than pagecache_get_page() was, but the new
> > pagecache_get_page() is 99 bytes
>
> longer, one presumes.
In total -- the old pagecache_get_page() turns into
__filemap_get_folio(), but the wrapper is 99 bytes in size.
Added the word "wrapper" to make this clearer.
On Thu, Jul 15, 2021 at 04:36:38AM +0100, Matthew Wilcox (Oracle) wrote:
> rcu_read_lock();
> - for (head = xas_load(&xas); head; head = xas_next(&xas)) {
> - if (xas_retry(&xas, head))
> + for (folio = xas_load(&xas); folio; folio = xas_next(&xas)) {
> + if (xas_retry(&xas, folio))
> continue;
> - if (xas.xa_index > max || xa_is_value(head))
> + if (xas.xa_index > max || xa_is_value(folio))
> break;
> - if (!page_cache_get_speculative(head))
> + if (!folio_try_get_rcu(folio))
> goto retry;
>
> - /* Has the page moved or been split? */
> - if (unlikely(head != xas_reload(&xas)))
> + if (unlikely(folio != xas_reload(&xas)))
> goto put_page;
>
> - if (!pagevec_add(pvec, head))
> + if (!pagevec_add(pvec, &folio->page))
> break;
> - if (!PageUptodate(head))
> + if (!folio_test_uptodate(folio))
> break;
> - if (PageReadahead(head))
> + if (folio_test_readahead(folio))
> break;
> - xas.xa_index = head->index + thp_nr_pages(head) - 1;
> + xas.xa_index = folio->index + folio_nr_pages(folio) - 1;
> xas.xa_offset = (xas.xa_index >> xas.xa_shift) & XA_CHUNK_MASK;
> continue;
It's not a bug in _this_ patch, but these last two lines become a bug
once the page cache is converted to store folios as multi-index entries
(as opposed to now when it replicates an order-N entry 2^N times).
I should not have used xas.xa_shift (which is the shift of the entry
we're looking for and is always 0), but xas.xa_node->shift (which is
the shift of the entry that we found).
If you have an order-7 page, occupying (say) indices 128-255, we set
xa_index to 255, but instead of setting xa_offset to 3, we set it to 63.
That tricks __xas_next() into going up to the parent node, and then back
down, which might mean that we terminate the scan early, or that we skip
over all the other entries in the node. What I actually noticed was a
crash where we ended up loading an internal entry out of the XArray.
It's all a bit complicated really. That calls for a helper, and this is
my current candidate:
+static inline void xas_advance(struct xa_state *xas, unsigned long index)
+{
+ unsigned char shift = xas_is_node(xas) ? xas->xa_node->shift : 0;
+
+ xas->xa_index = index;
+ xas->xa_offset = (index >> shift) & XA_CHUNK_MASK;
+}
...
- xas.xa_index = folio->index + folio_nr_pages(folio) - 1;
- xas.xa_offset = (xas.xa_index >> xas.xa_shift) & XA_CHUNK_MASK;
+ xas_advance(&xas, folio->index + folio_nr_pages(folio) - 1);
This is coming up on 4 hours of continuous testing using generic/559.
Without it, it would usually crash in about 40 minutes.