2020-11-12 21:29:51

by Matthew Wilcox

[permalink] [raw]
Subject: [PATCH v4 00/16] Overhaul multi-page lookups for THP

This THP prep patchset changes several page cache iteration APIs to only
return head pages.

- It's only possible to tag head pages in the page cache, so only
return head pages, not all their subpages.
- Factor a lot of common code out of the various batch lookup routines
- Add mapping_seek_hole_data()
- Unify find_get_entries() and pagevec_lookup_entries()
- Make find_get_entries only return head pages, like find_get_entry().

These are only loosely connected, but they seem to make sense together
as a series.

v4:
- Add FGP_ENTRY, remove find_get_entry and find_lock_entry
- Rename xas_find_get_entry to find_get_entry
- Add "Optimise get_shadow_from_swap_cache"
- Move "iomap: Use mapping_seek_hole_data" to this patch series
- Rebase against next-20201112

Matthew Wilcox (Oracle) (16):
mm: Make pagecache tagged lookups return only head pages
mm/shmem: Use pagevec_lookup in shmem_unlock_mapping
mm/swap: Optimise get_shadow_from_swap_cache
mm: Add FGP_ENTRY
mm/filemap: Rename find_get_entry to mapping_get_entry
mm/filemap: Add helper for finding pages
mm/filemap: Add mapping_seek_hole_data
iomap: Use mapping_seek_hole_data
mm: Add and use find_lock_entries
mm: Add an 'end' parameter to find_get_entries
mm: Add an 'end' parameter to pagevec_lookup_entries
mm: Remove nr_entries parameter from pagevec_lookup_entries
mm: Pass pvec directly to find_get_entries
mm: Remove pagevec_lookup_entries
mm/truncate,shmem: Handle truncates that split THPs
mm/filemap: Return only head pages from find_get_entries

fs/iomap/seek.c | 125 ++------------
include/linux/pagemap.h | 6 +-
include/linux/pagevec.h | 4 -
mm/filemap.c | 351 +++++++++++++++++++++++++---------------
mm/internal.h | 7 +-
mm/shmem.c | 218 ++++++-------------------
mm/swap.c | 38 +----
mm/swap_state.c | 7 +-
mm/truncate.c | 249 +++++++++++-----------------
9 files changed, 390 insertions(+), 615 deletions(-)

--
2.28.0


2020-11-12 21:29:55

by Matthew Wilcox

[permalink] [raw]
Subject: [PATCH v4 02/16] mm/shmem: Use pagevec_lookup in shmem_unlock_mapping

The comment shows that the reason for using find_get_entries() is now
stale; find_get_pages() will not return 0 if it hits a consecutive run
of swap entries, and I don't believe it has since 2011. pagevec_lookup()
is a simpler function to use than find_get_pages(), so use it instead.

Signed-off-by: Matthew Wilcox (Oracle) <[email protected]>
Reviewed-by: Jan Kara <[email protected]>
Reviewed-by: William Kucharski <[email protected]>
---
mm/shmem.c | 11 +----------
1 file changed, 1 insertion(+), 10 deletions(-)

diff --git a/mm/shmem.c b/mm/shmem.c
index 028f4596fc16..8076c171731c 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -842,7 +842,6 @@ unsigned long shmem_swap_usage(struct vm_area_struct *vma)
void shmem_unlock_mapping(struct address_space *mapping)
{
struct pagevec pvec;
- pgoff_t indices[PAGEVEC_SIZE];
pgoff_t index = 0;

pagevec_init(&pvec);
@@ -850,16 +849,8 @@ void shmem_unlock_mapping(struct address_space *mapping)
* Minor point, but we might as well stop if someone else SHM_LOCKs it.
*/
while (!mapping_unevictable(mapping)) {
- /*
- * Avoid pagevec_lookup(): find_get_pages() returns 0 as if it
- * has finished, if it hits a row of PAGEVEC_SIZE swap entries.
- */
- pvec.nr = find_get_entries(mapping, index,
- PAGEVEC_SIZE, pvec.pages, indices);
- if (!pvec.nr)
+ if (!pagevec_lookup(&pvec, mapping, &index))
break;
- index = indices[pvec.nr - 1] + 1;
- pagevec_remove_exceptionals(&pvec);
check_move_unevictable_pages(&pvec);
pagevec_release(&pvec);
cond_resched();
--
2.28.0

2020-11-12 21:30:04

by Matthew Wilcox

[permalink] [raw]
Subject: [PATCH v4 16/16] mm/filemap: Return only head pages from find_get_entries

All callers now expect head (and base) pages, and can handle multiple
head pages in a single batch, so make find_get_entries() behave that way.
Also take the opportunity to make it use the pagevec infrastructure
instead of open-coding how pvecs behave. This has the side-effect of
being able to append to a pagevec with existing contents, although we
don't make use of that functionality anywhere yet.

Signed-off-by: Matthew Wilcox (Oracle) <[email protected]>
Reviewed-by: Jan Kara <[email protected]>
Reviewed-by: William Kucharski <[email protected]>
---
include/linux/pagemap.h | 2 --
mm/filemap.c | 36 ++++++++----------------------------
mm/internal.h | 2 ++
3 files changed, 10 insertions(+), 30 deletions(-)

diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
index 46d4b1704770..65ef8db8eaab 100644
--- a/include/linux/pagemap.h
+++ b/include/linux/pagemap.h
@@ -448,8 +448,6 @@ static inline struct page *find_subpage(struct page *head, pgoff_t index)
return head + (index & (thp_nr_pages(head) - 1));
}

-unsigned find_get_entries(struct address_space *mapping, pgoff_t start,
- pgoff_t end, struct pagevec *pvec, pgoff_t *indices);
unsigned find_get_pages_range(struct address_space *mapping, pgoff_t *start,
pgoff_t end, unsigned int nr_pages,
struct page **pages);
diff --git a/mm/filemap.c b/mm/filemap.c
index 479cbbadd93b..f8c294905e8d 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -1878,49 +1878,29 @@ static inline struct page *find_get_entry(struct xa_state *xas, pgoff_t max,
* the mapping. The entries are placed in @pvec. find_get_entries()
* takes a reference on any actual pages it returns.
*
- * The search returns a group of mapping-contiguous page cache entries
- * with ascending indexes. There may be holes in the indices due to
- * not-present pages.
+ * The entries have ascending indexes. The indices may not be consecutive
+ * due to not-present entries or THPs.
*
* Any shadow entries of evicted pages, or swap entries from
* shmem/tmpfs, are included in the returned array.
*
- * If it finds a Transparent Huge Page, head or tail, find_get_entries()
- * stops at that page: the caller is likely to have a better way to handle
- * the compound page as a whole, and then skip its extent, than repeatedly
- * calling find_get_entries() to return all its tails.
- *
- * Return: the number of pages and shadow entries which were found.
+ * Return: The number of entries which were found.
*/
unsigned find_get_entries(struct address_space *mapping, pgoff_t start,
pgoff_t end, struct pagevec *pvec, pgoff_t *indices)
{
XA_STATE(xas, &mapping->i_pages, start);
struct page *page;
- unsigned int ret = 0;
- unsigned nr_entries = PAGEVEC_SIZE;

rcu_read_lock();
while ((page = find_get_entry(&xas, end, XA_PRESENT))) {
- /*
- * Terminate early on finding a THP, to allow the caller to
- * handle it all at once; but continue if this is hugetlbfs.
- */
- if (!xa_is_value(page) && PageTransHuge(page) &&
- !PageHuge(page)) {
- page = find_subpage(page, xas.xa_index);
- nr_entries = ret + 1;
- }
-
- indices[ret] = xas.xa_index;
- pvec->pages[ret] = page;
- if (++ret == nr_entries)
+ indices[pvec->nr] = xas.xa_index;
+ if (!pagevec_add(pvec, page))
break;
}
rcu_read_unlock();

- pvec->nr = ret;
- return ret;
+ return pagevec_count(pvec);
}

/**
@@ -1939,8 +1919,8 @@ unsigned find_get_entries(struct address_space *mapping, pgoff_t start,
* not returned.
*
* The entries have ascending indexes. The indices may not be consecutive
- * due to not-present entries, THP pages, pages which could not be locked
- * or pages under writeback.
+ * due to not-present entries, THPs, pages which could not be locked or
+ * pages under writeback.
*
* Return: The number of entries which were found.
*/
diff --git a/mm/internal.h b/mm/internal.h
index cb7487efa856..1f137a5d66bb 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -60,6 +60,8 @@ static inline void force_page_cache_readahead(struct address_space *mapping,
force_page_cache_ra(&ractl, &file->f_ra, nr_to_read);
}

+unsigned find_get_entries(struct address_space *mapping, pgoff_t start,
+ pgoff_t end, struct pagevec *pvec, pgoff_t *indices);
unsigned find_lock_entries(struct address_space *mapping, pgoff_t start,
pgoff_t end, struct pagevec *pvec, pgoff_t *indices);

--
2.28.0

2020-11-12 21:30:08

by Matthew Wilcox

[permalink] [raw]
Subject: [PATCH v4 13/16] mm: Pass pvec directly to find_get_entries

All callers of find_get_entries() use a pvec, so pass it directly
instead of manipulating it in the caller.

Signed-off-by: Matthew Wilcox (Oracle) <[email protected]>
Reviewed-by: Jan Kara <[email protected]>
Reviewed-by: William Kucharski <[email protected]>
---
include/linux/pagemap.h | 3 +--
mm/filemap.c | 21 +++++++++------------
mm/shmem.c | 5 ++---
mm/swap.c | 4 +---
4 files changed, 13 insertions(+), 20 deletions(-)

diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
index c7c26a902743..46d4b1704770 100644
--- a/include/linux/pagemap.h
+++ b/include/linux/pagemap.h
@@ -449,8 +449,7 @@ static inline struct page *find_subpage(struct page *head, pgoff_t index)
}

unsigned find_get_entries(struct address_space *mapping, pgoff_t start,
- pgoff_t end, unsigned int nr_entries, struct page **entries,
- pgoff_t *indices);
+ pgoff_t end, struct pagevec *pvec, pgoff_t *indices);
unsigned find_get_pages_range(struct address_space *mapping, pgoff_t *start,
pgoff_t end, unsigned int nr_pages,
struct page **pages);
diff --git a/mm/filemap.c b/mm/filemap.c
index b3b89a62ab1a..479cbbadd93b 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -1871,14 +1871,12 @@ static inline struct page *find_get_entry(struct xa_state *xas, pgoff_t max,
* @mapping: The address_space to search
* @start: The starting page cache index
* @end: The final page index (inclusive).
- * @nr_entries: The maximum number of entries
- * @entries: Where the resulting entries are placed
+ * @pvec: Where the resulting entries are placed.
* @indices: The cache indices corresponding to the entries in @entries
*
- * find_get_entries() will search for and return a group of up to
- * @nr_entries entries in the mapping. The entries are placed at
- * @entries. find_get_entries() takes a reference against any actual
- * pages it returns.
+ * find_get_entries() will search for and return a batch of entries in
+ * the mapping. The entries are placed in @pvec. find_get_entries()
+ * takes a reference on any actual pages it returns.
*
* The search returns a group of mapping-contiguous page cache entries
* with ascending indexes. There may be holes in the indices due to
@@ -1895,15 +1893,12 @@ static inline struct page *find_get_entry(struct xa_state *xas, pgoff_t max,
* Return: the number of pages and shadow entries which were found.
*/
unsigned find_get_entries(struct address_space *mapping, pgoff_t start,
- pgoff_t end, unsigned int nr_entries, struct page **entries,
- pgoff_t *indices)
+ pgoff_t end, struct pagevec *pvec, pgoff_t *indices)
{
XA_STATE(xas, &mapping->i_pages, start);
struct page *page;
unsigned int ret = 0;
-
- if (!nr_entries)
- return 0;
+ unsigned nr_entries = PAGEVEC_SIZE;

rcu_read_lock();
while ((page = find_get_entry(&xas, end, XA_PRESENT))) {
@@ -1918,11 +1913,13 @@ unsigned find_get_entries(struct address_space *mapping, pgoff_t start,
}

indices[ret] = xas.xa_index;
- entries[ret] = page;
+ pvec->pages[ret] = page;
if (++ret == nr_entries)
break;
}
rcu_read_unlock();
+
+ pvec->nr = ret;
return ret;
}

diff --git a/mm/shmem.c b/mm/shmem.c
index 7a62dc967d7d..e01457988dd6 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -965,9 +965,8 @@ static void shmem_undo_range(struct inode *inode, loff_t lstart, loff_t lend,
while (index < end) {
cond_resched();

- pvec.nr = find_get_entries(mapping, index, end - 1,
- PAGEVEC_SIZE, pvec.pages, indices);
- if (!pvec.nr) {
+ if (!find_get_entries(mapping, index, end - 1, &pvec,
+ indices)) {
/* If all gone or hole-punch or unfalloc, we're done */
if (index == start || end != -1)
break;
diff --git a/mm/swap.c b/mm/swap.c
index 9a562f7fd200..7cf585223566 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -1102,9 +1102,7 @@ unsigned pagevec_lookup_entries(struct pagevec *pvec,
struct address_space *mapping, pgoff_t start, pgoff_t end,
pgoff_t *indices)
{
- pvec->nr = find_get_entries(mapping, start, end, PAGEVEC_SIZE,
- pvec->pages, indices);
- return pagevec_count(pvec);
+ return find_get_entries(mapping, start, end, pvec, indices);
}

/**
--
2.28.0

2020-11-12 21:30:17

by Matthew Wilcox

[permalink] [raw]
Subject: [PATCH v4 05/16] mm/filemap: Rename find_get_entry to mapping_get_entry

find_get_entry doesn't "find" anything. It returns the entry at
a particular index.

Signed-off-by: Matthew Wilcox (Oracle) <[email protected]>
---
mm/filemap.c | 7 ++++---
1 file changed, 4 insertions(+), 3 deletions(-)

diff --git a/mm/filemap.c b/mm/filemap.c
index 01603f021740..53073281f027 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -1663,7 +1663,7 @@ pgoff_t page_cache_prev_miss(struct address_space *mapping,
EXPORT_SYMBOL(page_cache_prev_miss);

/*
- * find_get_entry - find and get a page cache entry
+ * mapping_get_entry - Get a page cache entry.
* @mapping: the address_space to search
* @index: The page cache index.
*
@@ -1675,7 +1675,8 @@ EXPORT_SYMBOL(page_cache_prev_miss);
*
* Return: The head page or shadow entry, %NULL if nothing is found.
*/
-static struct page *find_get_entry(struct address_space *mapping, pgoff_t index)
+static struct page *mapping_get_entry(struct address_space *mapping,
+ pgoff_t index)
{
XA_STATE(xas, &mapping->i_pages, index);
struct page *page;
@@ -1751,7 +1752,7 @@ struct page *pagecache_get_page(struct address_space *mapping, pgoff_t index,
struct page *page;

repeat:
- page = find_get_entry(mapping, index);
+ page = mapping_get_entry(mapping, index);
if (xa_is_value(page)) {
if (fgp_flags & FGP_ENTRY)
return page;
--
2.28.0

2020-11-12 21:30:43

by Matthew Wilcox

[permalink] [raw]
Subject: [PATCH v4 08/16] iomap: Use mapping_seek_hole_data

Enhance mapping_seek_hole_data() to handle partially uptodate pages and
convert the iomap seek code to call it.

Signed-off-by: Matthew Wilcox (Oracle) <[email protected]>
---
fs/iomap/seek.c | 125 +++++-------------------------------------------
mm/filemap.c | 37 ++++++++++++--
2 files changed, 43 insertions(+), 119 deletions(-)

diff --git a/fs/iomap/seek.c b/fs/iomap/seek.c
index 107ee80c3568..dab1b02eba5b 100644
--- a/fs/iomap/seek.c
+++ b/fs/iomap/seek.c
@@ -10,122 +10,17 @@
#include <linux/pagemap.h>
#include <linux/pagevec.h>

-/*
- * Seek for SEEK_DATA / SEEK_HOLE within @page, starting at @lastoff.
- * Returns true if found and updates @lastoff to the offset in file.
- */
-static bool
-page_seek_hole_data(struct inode *inode, struct page *page, loff_t *lastoff,
- int whence)
-{
- const struct address_space_operations *ops = inode->i_mapping->a_ops;
- unsigned int bsize = i_blocksize(inode), off;
- bool seek_data = whence == SEEK_DATA;
- loff_t poff = page_offset(page);
-
- if (WARN_ON_ONCE(*lastoff >= poff + PAGE_SIZE))
- return false;
-
- if (*lastoff < poff) {
- /*
- * Last offset smaller than the start of the page means we found
- * a hole:
- */
- if (whence == SEEK_HOLE)
- return true;
- *lastoff = poff;
- }
-
- /*
- * Just check the page unless we can and should check block ranges:
- */
- if (bsize == PAGE_SIZE || !ops->is_partially_uptodate)
- return PageUptodate(page) == seek_data;
-
- lock_page(page);
- if (unlikely(page->mapping != inode->i_mapping))
- goto out_unlock_not_found;
-
- for (off = 0; off < PAGE_SIZE; off += bsize) {
- if (offset_in_page(*lastoff) >= off + bsize)
- continue;
- if (ops->is_partially_uptodate(page, off, bsize) == seek_data) {
- unlock_page(page);
- return true;
- }
- *lastoff = poff + off + bsize;
- }
-
-out_unlock_not_found:
- unlock_page(page);
- return false;
-}
-
-/*
- * Seek for SEEK_DATA / SEEK_HOLE in the page cache.
- *
- * Within unwritten extents, the page cache determines which parts are holes
- * and which are data: uptodate buffer heads count as data; everything else
- * counts as a hole.
- *
- * Returns the resulting offset on successs, and -ENOENT otherwise.
- */
static loff_t
-page_cache_seek_hole_data(struct inode *inode, loff_t offset, loff_t length,
- int whence)
-{
- pgoff_t index = offset >> PAGE_SHIFT;
- pgoff_t end = DIV_ROUND_UP(offset + length, PAGE_SIZE);
- loff_t lastoff = offset;
- struct pagevec pvec;
-
- if (length <= 0)
- return -ENOENT;
-
- pagevec_init(&pvec);
-
- do {
- unsigned nr_pages, i;
-
- nr_pages = pagevec_lookup_range(&pvec, inode->i_mapping, &index,
- end - 1);
- if (nr_pages == 0)
- break;
-
- for (i = 0; i < nr_pages; i++) {
- struct page *page = pvec.pages[i];
-
- if (page_seek_hole_data(inode, page, &lastoff, whence))
- goto check_range;
- lastoff = page_offset(page) + PAGE_SIZE;
- }
- pagevec_release(&pvec);
- } while (index < end);
-
- /* When no page at lastoff and we are not done, we found a hole. */
- if (whence != SEEK_HOLE)
- goto not_found;
-
-check_range:
- if (lastoff < offset + length)
- goto out;
-not_found:
- lastoff = -ENOENT;
-out:
- pagevec_release(&pvec);
- return lastoff;
-}
-
-
-static loff_t
-iomap_seek_hole_actor(struct inode *inode, loff_t offset, loff_t length,
+iomap_seek_hole_actor(struct inode *inode, loff_t start, loff_t length,
void *data, struct iomap *iomap, struct iomap *srcmap)
{
+ loff_t offset = start;
+
switch (iomap->type) {
case IOMAP_UNWRITTEN:
- offset = page_cache_seek_hole_data(inode, offset, length,
- SEEK_HOLE);
- if (offset < 0)
+ offset = mapping_seek_hole_data(inode->i_mapping, start,
+ start + length, SEEK_HOLE);
+ if (offset == start + length)
return length;
fallthrough;
case IOMAP_HOLE:
@@ -164,15 +59,17 @@ iomap_seek_hole(struct inode *inode, loff_t offset, const struct iomap_ops *ops)
EXPORT_SYMBOL_GPL(iomap_seek_hole);

static loff_t
-iomap_seek_data_actor(struct inode *inode, loff_t offset, loff_t length,
+iomap_seek_data_actor(struct inode *inode, loff_t start, loff_t length,
void *data, struct iomap *iomap, struct iomap *srcmap)
{
+ loff_t offset = start;
+
switch (iomap->type) {
case IOMAP_HOLE:
return length;
case IOMAP_UNWRITTEN:
- offset = page_cache_seek_hole_data(inode, offset, length,
- SEEK_DATA);
+ offset = mapping_seek_hole_data(inode->i_mapping, start,
+ start + length, SEEK_DATA);
if (offset < 0)
return length;
fallthrough;
diff --git a/mm/filemap.c b/mm/filemap.c
index ab7103eb7e11..ef7411ea3f91 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -2586,11 +2586,36 @@ generic_file_read_iter(struct kiocb *iocb, struct iov_iter *iter)
}
EXPORT_SYMBOL(generic_file_read_iter);

-static inline bool page_seek_match(struct page *page, bool seek_data)
+static inline loff_t page_seek_hole_data(struct xa_state *xas,
+ struct address_space *mapping, struct page *page,
+ loff_t start, loff_t end, bool seek_data)
{
+ const struct address_space_operations *ops = mapping->a_ops;
+ size_t offset, bsz = i_blocksize(mapping->host);
+
if (xa_is_value(page) || PageUptodate(page))
- return seek_data;
- return !seek_data;
+ return seek_data ? start : end;
+ if (!ops->is_partially_uptodate)
+ return seek_data ? end : start;
+
+ xas_pause(xas);
+ rcu_read_unlock();
+ lock_page(page);
+ if (unlikely(page->mapping != mapping))
+ goto unlock;
+
+ offset = offset_in_thp(page, start) & ~(bsz - 1);
+
+ do {
+ if (ops->is_partially_uptodate(page, offset, bsz) == seek_data)
+ break;
+ start = (start + bsz) & ~(bsz - 1);
+ offset += bsz;
+ } while (offset < thp_size(page));
+unlock:
+ unlock_page(page);
+ rcu_read_lock();
+ return start;
}

static inline
@@ -2640,9 +2665,11 @@ loff_t mapping_seek_hole_data(struct address_space *mapping, loff_t start,
start = pos;
}

- if (page_seek_match(page, seek_data))
+ pos += seek_page_size(&xas, page);
+ start = page_seek_hole_data(&xas, mapping, page, start, pos,
+ seek_data);
+ if (start < pos)
goto unlock;
- start = pos + seek_page_size(&xas, page);
put_page(page);
}
rcu_read_unlock();
--
2.28.0

2020-11-12 21:30:46

by Matthew Wilcox

[permalink] [raw]
Subject: [PATCH v4 11/16] mm: Add an 'end' parameter to pagevec_lookup_entries

Simplifies the callers and uses the existing functionality
in find_get_entries(). We can also drop the final argument of
truncate_exceptional_pvec_entries() and simplify the logic in that
function.

Signed-off-by: Matthew Wilcox (Oracle) <[email protected]>
Reviewed-by: Jan Kara <[email protected]>
Reviewed-by: William Kucharski <[email protected]>
---
include/linux/pagevec.h | 5 ++---
mm/swap.c | 8 ++++----
mm/truncate.c | 41 ++++++++++-------------------------------
3 files changed, 16 insertions(+), 38 deletions(-)

diff --git a/include/linux/pagevec.h b/include/linux/pagevec.h
index 081d934eda64..4b245592262c 100644
--- a/include/linux/pagevec.h
+++ b/include/linux/pagevec.h
@@ -26,9 +26,8 @@ struct pagevec {
void __pagevec_release(struct pagevec *pvec);
void __pagevec_lru_add(struct pagevec *pvec);
unsigned pagevec_lookup_entries(struct pagevec *pvec,
- struct address_space *mapping,
- pgoff_t start, unsigned nr_entries,
- pgoff_t *indices);
+ struct address_space *mapping, pgoff_t start, pgoff_t end,
+ unsigned nr_entries, pgoff_t *indices);
void pagevec_remove_exceptionals(struct pagevec *pvec);
unsigned pagevec_lookup_range(struct pagevec *pvec,
struct address_space *mapping,
diff --git a/mm/swap.c b/mm/swap.c
index 39be55635ebd..5c2c12e3581e 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -1078,6 +1078,7 @@ void __pagevec_lru_add(struct pagevec *pvec)
* @pvec: Where the resulting entries are placed
* @mapping: The address_space to search
* @start: The starting entry index
+ * @end: The highest index to return (inclusive).
* @nr_entries: The maximum number of pages
* @indices: The cache indices corresponding to the entries in @pvec
*
@@ -1098,11 +1099,10 @@ void __pagevec_lru_add(struct pagevec *pvec)
* found.
*/
unsigned pagevec_lookup_entries(struct pagevec *pvec,
- struct address_space *mapping,
- pgoff_t start, unsigned nr_entries,
- pgoff_t *indices)
+ struct address_space *mapping, pgoff_t start, pgoff_t end,
+ unsigned nr_entries, pgoff_t *indices)
{
- pvec->nr = find_get_entries(mapping, start, ULONG_MAX, nr_entries,
+ pvec->nr = find_get_entries(mapping, start, end, nr_entries,
pvec->pages, indices);
return pagevec_count(pvec);
}
diff --git a/mm/truncate.c b/mm/truncate.c
index eefd62898db1..dcaee659fe1a 100644
--- a/mm/truncate.c
+++ b/mm/truncate.c
@@ -57,11 +57,10 @@ static void clear_shadow_entry(struct address_space *mapping, pgoff_t index,
* exceptional entries similar to what pagevec_remove_exceptionals does.
*/
static void truncate_exceptional_pvec_entries(struct address_space *mapping,
- struct pagevec *pvec, pgoff_t *indices,
- pgoff_t end)
+ struct pagevec *pvec, pgoff_t *indices)
{
int i, j;
- bool dax, lock;
+ bool dax;

/* Handled by shmem itself */
if (shmem_mapping(mapping))
@@ -75,8 +74,7 @@ static void truncate_exceptional_pvec_entries(struct address_space *mapping,
return;

dax = dax_mapping(mapping);
- lock = !dax && indices[j] < end;
- if (lock)
+ if (!dax)
xa_lock_irq(&mapping->i_pages);

for (i = j; i < pagevec_count(pvec); i++) {
@@ -88,9 +86,6 @@ static void truncate_exceptional_pvec_entries(struct address_space *mapping,
continue;
}

- if (index >= end)
- continue;
-
if (unlikely(dax)) {
dax_delete_mapping_entry(mapping, index);
continue;
@@ -99,7 +94,7 @@ static void truncate_exceptional_pvec_entries(struct address_space *mapping,
__clear_shadow_entry(mapping, index, page);
}

- if (lock)
+ if (!dax)
xa_unlock_irq(&mapping->i_pages);
pvec->nr = j;
}
@@ -329,7 +324,7 @@ void truncate_inode_pages_range(struct address_space *mapping,
while (index < end && find_lock_entries(mapping, index, end - 1,
&pvec, indices)) {
index = indices[pagevec_count(&pvec) - 1] + 1;
- truncate_exceptional_pvec_entries(mapping, &pvec, indices, end);
+ truncate_exceptional_pvec_entries(mapping, &pvec, indices);
for (i = 0; i < pagevec_count(&pvec); i++)
truncate_cleanup_page(mapping, pvec.pages[i]);
delete_from_page_cache_batch(mapping, &pvec);
@@ -381,8 +376,8 @@ void truncate_inode_pages_range(struct address_space *mapping,
index = start;
for ( ; ; ) {
cond_resched();
- if (!pagevec_lookup_entries(&pvec, mapping, index,
- min(end - index, (pgoff_t)PAGEVEC_SIZE), indices)) {
+ if (!pagevec_lookup_entries(&pvec, mapping, index, end - 1,
+ PAGEVEC_SIZE, indices)) {
/* If all gone from start onwards, we're done */
if (index == start)
break;
@@ -390,23 +385,12 @@ void truncate_inode_pages_range(struct address_space *mapping,
index = start;
continue;
}
- if (index == start && indices[0] >= end) {
- /* All gone out of hole to be punched, we're done */
- pagevec_remove_exceptionals(&pvec);
- pagevec_release(&pvec);
- break;
- }

for (i = 0; i < pagevec_count(&pvec); i++) {
struct page *page = pvec.pages[i];

/* We rely upon deletion not changing page->index */
index = indices[i];
- if (index >= end) {
- /* Restart punch to make sure all gone */
- index = start - 1;
- break;
- }

if (xa_is_value(page))
continue;
@@ -417,7 +401,7 @@ void truncate_inode_pages_range(struct address_space *mapping,
truncate_inode_page(mapping, page);
unlock_page(page);
}
- truncate_exceptional_pvec_entries(mapping, &pvec, indices, end);
+ truncate_exceptional_pvec_entries(mapping, &pvec, indices);
pagevec_release(&pvec);
index++;
}
@@ -513,8 +497,6 @@ static unsigned long __invalidate_mapping_pages(struct address_space *mapping,

/* We rely upon deletion not changing page->index */
index = indices[i];
- if (index > end)
- break;

if (xa_is_value(page)) {
invalidate_exceptional_entry(mapping, index,
@@ -650,16 +632,13 @@ int invalidate_inode_pages2_range(struct address_space *mapping,

pagevec_init(&pvec);
index = start;
- while (index <= end && pagevec_lookup_entries(&pvec, mapping, index,
- min(end - index, (pgoff_t)PAGEVEC_SIZE - 1) + 1,
- indices)) {
+ while (pagevec_lookup_entries(&pvec, mapping, index, end,
+ PAGEVEC_SIZE, indices)) {
for (i = 0; i < pagevec_count(&pvec); i++) {
struct page *page = pvec.pages[i];

/* We rely upon deletion not changing page->index */
index = indices[i];
- if (index > end)
- break;

if (xa_is_value(page)) {
if (!invalidate_exceptional_entry2(mapping,
--
2.28.0

2020-11-12 21:30:59

by Matthew Wilcox

[permalink] [raw]
Subject: [PATCH v4 09/16] mm: Add and use find_lock_entries

We have three functions (shmem_undo_range(), truncate_inode_pages_range()
and invalidate_mapping_pages()) which want exactly this function,
so add it to filemap.c. Before this patch, shmem_undo_range() would
split any compound page which overlaps either end of the range being
punched in both the first and second loops through the address space.
After this patch, that functionality is left for the second loop, which
is arguably more appropriate since the first loop is supposed to run
through all the pages quickly, and splitting a page can sleep.

Signed-off-by: Matthew Wilcox (Oracle) <[email protected]>
Reviewed-by: Jan Kara <[email protected]>
Reviewed-by: William Kucharski <[email protected]>
---
mm/filemap.c | 57 ++++++++++++++++++++++++++++++++
mm/internal.h | 3 ++
mm/shmem.c | 22 +++----------
mm/truncate.c | 91 +++++++--------------------------------------------
4 files changed, 76 insertions(+), 97 deletions(-)

diff --git a/mm/filemap.c b/mm/filemap.c
index ef7411ea3f91..f18c5074865d 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -1925,6 +1925,63 @@ unsigned find_get_entries(struct address_space *mapping,
return ret;
}

+/**
+ * find_lock_entries - Find a batch of pagecache entries.
+ * @mapping: The address_space to search.
+ * @start: The starting page cache index.
+ * @end: The final page index (inclusive).
+ * @pvec: Where the resulting entries are placed.
+ * @indices: The cache indices of the entries in @pvec.
+ *
+ * find_lock_entries() will return a batch of entries from @mapping.
+ * Swap, shadow and DAX entries are included. Pages are returned
+ * locked and with an incremented refcount. Pages which are locked by
+ * somebody else or under writeback are skipped. Only the head page of
+ * a THP is returned. Pages which are partially outside the range are
+ * not returned.
+ *
+ * The entries have ascending indexes. The indices may not be consecutive
+ * due to not-present entries, THP pages, pages which could not be locked
+ * or pages under writeback.
+ *
+ * Return: The number of entries which were found.
+ */
+unsigned find_lock_entries(struct address_space *mapping, pgoff_t start,
+ pgoff_t end, struct pagevec *pvec, pgoff_t *indices)
+{
+ XA_STATE(xas, &mapping->i_pages, start);
+ struct page *page;
+
+ rcu_read_lock();
+ while ((page = find_get_entry(&xas, end, XA_PRESENT))) {
+ if (!xa_is_value(page)) {
+ if (page->index < start)
+ goto put;
+ VM_BUG_ON_PAGE(page->index != xas.xa_index, page);
+ if (page->index + thp_nr_pages(page) - 1 > end)
+ goto put;
+ if (!trylock_page(page))
+ goto put;
+ if (page->mapping != mapping || PageWriteback(page))
+ goto unlock;
+ }
+ indices[pvec->nr] = xas.xa_index;
+ if (!pagevec_add(pvec, page))
+ break;
+ goto next;
+unlock:
+ unlock_page(page);
+put:
+ put_page(page);
+next:
+ if (!xa_is_value(page) && PageTransHuge(page))
+ xas_set(&xas, page->index + thp_nr_pages(page));
+ }
+ rcu_read_unlock();
+
+ return pagevec_count(pvec);
+}
+
/**
* find_get_pages_range - gang pagecache lookup
* @mapping: The address_space to search
diff --git a/mm/internal.h b/mm/internal.h
index 93880d460e12..3547fed59d51 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -60,6 +60,9 @@ static inline void force_page_cache_readahead(struct address_space *mapping,
force_page_cache_ra(&ractl, &file->f_ra, nr_to_read);
}

+unsigned find_lock_entries(struct address_space *mapping, pgoff_t start,
+ pgoff_t end, struct pagevec *pvec, pgoff_t *indices);
+
/**
* page_evictable - test whether a page is evictable
* @page: the page to test
diff --git a/mm/shmem.c b/mm/shmem.c
index 6afea99a0dc0..a4aa762a55f8 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -907,12 +907,8 @@ static void shmem_undo_range(struct inode *inode, loff_t lstart, loff_t lend,

pagevec_init(&pvec);
index = start;
- while (index < end) {
- pvec.nr = find_get_entries(mapping, index,
- min(end - index, (pgoff_t)PAGEVEC_SIZE),
- pvec.pages, indices);
- if (!pvec.nr)
- break;
+ while (index < end && find_lock_entries(mapping, index, end - 1,
+ &pvec, indices)) {
for (i = 0; i < pagevec_count(&pvec); i++) {
struct page *page = pvec.pages[i];

@@ -927,18 +923,10 @@ static void shmem_undo_range(struct inode *inode, loff_t lstart, loff_t lend,
index, page);
continue;
}
+ index += thp_nr_pages(page) - 1;

- VM_BUG_ON_PAGE(page_to_pgoff(page) != index, page);
-
- if (!trylock_page(page))
- continue;
-
- if ((!unfalloc || !PageUptodate(page)) &&
- page_mapping(page) == mapping) {
- VM_BUG_ON_PAGE(PageWriteback(page), page);
- if (shmem_punch_compound(page, start, end))
- truncate_inode_page(mapping, page);
- }
+ if (!unfalloc || !PageUptodate(page))
+ truncate_inode_page(mapping, page);
unlock_page(page);
}
pagevec_remove_exceptionals(&pvec);
diff --git a/mm/truncate.c b/mm/truncate.c
index 960edf5803ca..eefd62898db1 100644
--- a/mm/truncate.c
+++ b/mm/truncate.c
@@ -326,51 +326,19 @@ void truncate_inode_pages_range(struct address_space *mapping,

pagevec_init(&pvec);
index = start;
- while (index < end && pagevec_lookup_entries(&pvec, mapping, index,
- min(end - index, (pgoff_t)PAGEVEC_SIZE),
- indices)) {
- /*
- * Pagevec array has exceptional entries and we may also fail
- * to lock some pages. So we store pages that can be deleted
- * in a new pagevec.
- */
- struct pagevec locked_pvec;
-
- pagevec_init(&locked_pvec);
- for (i = 0; i < pagevec_count(&pvec); i++) {
- struct page *page = pvec.pages[i];
-
- /* We rely upon deletion not changing page->index */
- index = indices[i];
- if (index >= end)
- break;
-
- if (xa_is_value(page))
- continue;
-
- if (!trylock_page(page))
- continue;
- WARN_ON(page_to_index(page) != index);
- if (PageWriteback(page)) {
- unlock_page(page);
- continue;
- }
- if (page->mapping != mapping) {
- unlock_page(page);
- continue;
- }
- pagevec_add(&locked_pvec, page);
- }
- for (i = 0; i < pagevec_count(&locked_pvec); i++)
- truncate_cleanup_page(mapping, locked_pvec.pages[i]);
- delete_from_page_cache_batch(mapping, &locked_pvec);
- for (i = 0; i < pagevec_count(&locked_pvec); i++)
- unlock_page(locked_pvec.pages[i]);
+ while (index < end && find_lock_entries(mapping, index, end - 1,
+ &pvec, indices)) {
+ index = indices[pagevec_count(&pvec) - 1] + 1;
truncate_exceptional_pvec_entries(mapping, &pvec, indices, end);
+ for (i = 0; i < pagevec_count(&pvec); i++)
+ truncate_cleanup_page(mapping, pvec.pages[i]);
+ delete_from_page_cache_batch(mapping, &pvec);
+ for (i = 0; i < pagevec_count(&pvec); i++)
+ unlock_page(pvec.pages[i]);
pagevec_release(&pvec);
cond_resched();
- index++;
}
+
if (partial_start) {
struct page *page = find_lock_page(mapping, start - 1);
if (page) {
@@ -539,9 +507,7 @@ static unsigned long __invalidate_mapping_pages(struct address_space *mapping,
int i;

pagevec_init(&pvec);
- while (index <= end && pagevec_lookup_entries(&pvec, mapping, index,
- min(end - index, (pgoff_t)PAGEVEC_SIZE - 1) + 1,
- indices)) {
+ while (find_lock_entries(mapping, index, end, &pvec, indices)) {
for (i = 0; i < pagevec_count(&pvec); i++) {
struct page *page = pvec.pages[i];

@@ -555,39 +521,7 @@ static unsigned long __invalidate_mapping_pages(struct address_space *mapping,
page);
continue;
}
-
- if (!trylock_page(page))
- continue;
-
- WARN_ON(page_to_index(page) != index);
-
- /* Middle of THP: skip */
- if (PageTransTail(page)) {
- unlock_page(page);
- continue;
- } else if (PageTransHuge(page)) {
- index += HPAGE_PMD_NR - 1;
- i += HPAGE_PMD_NR - 1;
- /*
- * 'end' is in the middle of THP. Don't
- * invalidate the page as the part outside of
- * 'end' could be still useful.
- */
- if (index > end) {
- unlock_page(page);
- continue;
- }
-
- /* Take a pin outside pagevec */
- get_page(page);
-
- /*
- * Drop extra pins before trying to invalidate
- * the huge page.
- */
- pagevec_remove_exceptionals(&pvec);
- pagevec_release(&pvec);
- }
+ index += thp_nr_pages(page) - 1;

ret = invalidate_inode_page(page);
unlock_page(page);
@@ -601,9 +535,6 @@ static unsigned long __invalidate_mapping_pages(struct address_space *mapping,
if (nr_pagevec)
(*nr_pagevec)++;
}
-
- if (PageTransHuge(page))
- put_page(page);
count += ret;
}
pagevec_remove_exceptionals(&pvec);
--
2.28.0

2020-11-12 21:31:08

by Matthew Wilcox

[permalink] [raw]
Subject: [PATCH v4 01/16] mm: Make pagecache tagged lookups return only head pages

Pagecache tags are used for dirty page writeback. Since dirtiness is
tracked on a per-THP basis, we only want to return the head page rather
than each subpage of a tagged page. All the filesystems which use huge
pages today are in-memory, so there are no tagged huge pages today.

Signed-off-by: Matthew Wilcox (Oracle) <[email protected]>
Reviewed-by: Jan Kara <[email protected]>
Reviewed-by: William Kucharski <[email protected]>
---
mm/filemap.c | 11 ++++++-----
1 file changed, 6 insertions(+), 5 deletions(-)

diff --git a/mm/filemap.c b/mm/filemap.c
index 249cf489f5df..bb6f2ae5a68c 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -2066,7 +2066,7 @@ unsigned find_get_pages_contig(struct address_space *mapping, pgoff_t index,
EXPORT_SYMBOL(find_get_pages_contig);

/**
- * find_get_pages_range_tag - find and return pages in given range matching @tag
+ * find_get_pages_range_tag - Find and return head pages matching @tag.
* @mapping: the address_space to search
* @index: the starting page index
* @end: The final page index (inclusive)
@@ -2074,8 +2074,9 @@ EXPORT_SYMBOL(find_get_pages_contig);
* @nr_pages: the maximum number of pages
* @pages: where the resulting pages are placed
*
- * Like find_get_pages, except we only return pages which are tagged with
- * @tag. We update @index to index the next page for the traversal.
+ * Like find_get_pages(), except we only return head pages which are tagged
+ * with @tag. @index is updated to the index immediately after the last
+ * page we return, ready for the next iteration.
*
* Return: the number of pages which were found.
*/
@@ -2109,9 +2110,9 @@ unsigned find_get_pages_range_tag(struct address_space *mapping, pgoff_t *index,
if (unlikely(page != xas_reload(&xas)))
goto put_page;

- pages[ret] = find_subpage(page, xas.xa_index);
+ pages[ret] = page;
if (++ret == nr_pages) {
- *index = xas.xa_index + 1;
+ *index = page->index + thp_nr_pages(page);
goto out;
}
continue;
--
2.28.0

2020-11-12 21:31:42

by Matthew Wilcox

[permalink] [raw]
Subject: [PATCH v4 15/16] mm/truncate,shmem: Handle truncates that split THPs

Handle THP splitting in the parts of the truncation functions which
already handle partial pages. Factor all that code out into a new
function called truncate_inode_partial_page().

We lose the easy 'bail out' path if a truncate or hole punch is entirely
within a single page. We can add some more complex logic to restore
the optimisation if it proves to be worthwhile.

Signed-off-by: Matthew Wilcox (Oracle) <[email protected]>
Reviewed-by: Jan Kara <[email protected]>
Reviewed-by: William Kucharski <[email protected]>
---
mm/internal.h | 1 +
mm/shmem.c | 97 ++++++++++++++---------------------------
mm/truncate.c | 118 +++++++++++++++++++++++++++++++-------------------
3 files changed, 108 insertions(+), 108 deletions(-)

diff --git a/mm/internal.h b/mm/internal.h
index 3547fed59d51..cb7487efa856 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -618,4 +618,5 @@ struct migration_target_control {
gfp_t gfp_mask;
};

+bool truncate_inode_partial_page(struct page *page, loff_t start, loff_t end);
#endif /* __MM_INTERNAL_H */
diff --git a/mm/shmem.c b/mm/shmem.c
index e01457988dd6..25fe257e56c0 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -857,32 +857,6 @@ void shmem_unlock_mapping(struct address_space *mapping)
}
}

-/*
- * Check whether a hole-punch or truncation needs to split a huge page,
- * returning true if no split was required, or the split has been successful.
- *
- * Eviction (or truncation to 0 size) should never need to split a huge page;
- * but in rare cases might do so, if shmem_undo_range() failed to trylock on
- * head, and then succeeded to trylock on tail.
- *
- * A split can only succeed when there are no additional references on the
- * huge page: so the split below relies upon find_get_entries() having stopped
- * when it found a subpage of the huge page, without getting further references.
- */
-static bool shmem_punch_compound(struct page *page, pgoff_t start, pgoff_t end)
-{
- if (!PageTransCompound(page))
- return true;
-
- /* Just proceed to delete a huge page wholly within the range punched */
- if (PageHead(page) &&
- page->index >= start && page->index + HPAGE_PMD_NR <= end)
- return true;
-
- /* Try to split huge page, so we can truly punch the hole or truncate */
- return split_huge_page(page) >= 0;
-}
-
/*
* Remove range of pages and swap entries from page cache, and free them.
* If !unfalloc, truncate or punch hole; if unfalloc, undo failed fallocate.
@@ -894,13 +868,13 @@ static void shmem_undo_range(struct inode *inode, loff_t lstart, loff_t lend,
struct shmem_inode_info *info = SHMEM_I(inode);
pgoff_t start = (lstart + PAGE_SIZE - 1) >> PAGE_SHIFT;
pgoff_t end = (lend + 1) >> PAGE_SHIFT;
- unsigned int partial_start = lstart & (PAGE_SIZE - 1);
- unsigned int partial_end = (lend + 1) & (PAGE_SIZE - 1);
struct pagevec pvec;
pgoff_t indices[PAGEVEC_SIZE];
+ struct page *page;
long nr_swaps_freed = 0;
pgoff_t index;
int i;
+ bool partial_end;

if (lend == -1)
end = -1; /* unsigned, so actually very big */
@@ -910,7 +884,7 @@ static void shmem_undo_range(struct inode *inode, loff_t lstart, loff_t lend,
while (index < end && find_lock_entries(mapping, index, end - 1,
&pvec, indices)) {
for (i = 0; i < pagevec_count(&pvec); i++) {
- struct page *page = pvec.pages[i];
+ page = pvec.pages[i];

index = indices[i];

@@ -933,33 +907,37 @@ static void shmem_undo_range(struct inode *inode, loff_t lstart, loff_t lend,
index++;
}

- if (partial_start) {
- struct page *page = NULL;
- shmem_getpage(inode, start - 1, &page, SGP_READ);
- if (page) {
- unsigned int top = PAGE_SIZE;
- if (start > end) {
- top = partial_end;
- partial_end = 0;
- }
- zero_user_segment(page, partial_start, top);
- set_page_dirty(page);
- unlock_page(page);
- put_page(page);
+ partial_end = ((lend + 1) % PAGE_SIZE) > 0;
+ page = NULL;
+ shmem_getpage(inode, lstart >> PAGE_SHIFT, &page, SGP_READ);
+ if (page) {
+ bool same_page;
+
+ page = thp_head(page);
+ same_page = lend < page_offset(page) + thp_size(page);
+ if (same_page)
+ partial_end = false;
+ set_page_dirty(page);
+ if (!truncate_inode_partial_page(page, lstart, lend)) {
+ start = page->index + thp_nr_pages(page);
+ if (same_page)
+ end = page->index;
}
+ unlock_page(page);
+ put_page(page);
+ page = NULL;
}
- if (partial_end) {
- struct page *page = NULL;
+
+ if (partial_end)
shmem_getpage(inode, end, &page, SGP_READ);
- if (page) {
- zero_user_segment(page, 0, partial_end);
- set_page_dirty(page);
- unlock_page(page);
- put_page(page);
- }
+ if (page) {
+ page = thp_head(page);
+ set_page_dirty(page);
+ if (!truncate_inode_partial_page(page, lstart, lend))
+ end = page->index;
+ unlock_page(page);
+ put_page(page);
}
- if (start >= end)
- return;

index = start;
while (index < end) {
@@ -975,7 +953,7 @@ static void shmem_undo_range(struct inode *inode, loff_t lstart, loff_t lend,
continue;
}
for (i = 0; i < pagevec_count(&pvec); i++) {
- struct page *page = pvec.pages[i];
+ page = pvec.pages[i];

index = indices[i];
if (xa_is_value(page)) {
@@ -1000,18 +978,9 @@ static void shmem_undo_range(struct inode *inode, loff_t lstart, loff_t lend,
break;
}
VM_BUG_ON_PAGE(PageWriteback(page), page);
- if (shmem_punch_compound(page, start, end))
- truncate_inode_page(mapping, page);
- else if (IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE)) {
- /* Wipe the page and don't get stuck */
- clear_highpage(page);
- flush_dcache_page(page);
- set_page_dirty(page);
- if (index <
- round_up(start, HPAGE_PMD_NR))
- start = index + 1;
- }
+ truncate_inode_page(mapping, page);
}
+ index = page->index + thp_nr_pages(page) - 1;
unlock_page(page);
}
pagevec_remove_exceptionals(&pvec);
diff --git a/mm/truncate.c b/mm/truncate.c
index 68b7630e1fc4..288781e41a7b 100644
--- a/mm/truncate.c
+++ b/mm/truncate.c
@@ -224,6 +224,53 @@ int truncate_inode_page(struct address_space *mapping, struct page *page)
return 0;
}

+/*
+ * Handle partial (transparent) pages. The page may be entirely within the
+ * range if a split has raced with us. If not, we zero the part of the
+ * page that's within the [start, end] range, and then split the page if
+ * it's a THP. split_page_range() will discard pages which now lie beyond
+ * i_size, and we rely on the caller to discard pages which lie within a
+ * newly created hole.
+ *
+ * Returns false if THP splitting failed so the caller can avoid
+ * discarding the entire page which is stubbornly unsplit.
+ */
+bool truncate_inode_partial_page(struct page *page, loff_t start, loff_t end)
+{
+ loff_t pos = page_offset(page);
+ unsigned int offset, length;
+
+ if (pos < start)
+ offset = start - pos;
+ else
+ offset = 0;
+ length = thp_size(page);
+ if (pos + length <= (u64)end)
+ length = length - offset;
+ else
+ length = end + 1 - pos - offset;
+
+ wait_on_page_writeback(page);
+ if (length == thp_size(page)) {
+ truncate_inode_page(page->mapping, page);
+ return true;
+ }
+
+ /*
+ * We may be zeroing pages we're about to discard, but it avoids
+ * doing a complex calculation here, and then doing the zeroing
+ * anyway if the page split fails.
+ */
+ zero_user(page, offset, length);
+
+ cleancache_invalidate_page(page->mapping, page);
+ if (page_has_private(page))
+ do_invalidatepage(page, offset, length);
+ if (!PageTransHuge(page))
+ return true;
+ return split_huge_page(page) == 0;
+}
+
/*
* Used to get rid of pages on hardware memory corruption.
*/
@@ -288,20 +335,16 @@ void truncate_inode_pages_range(struct address_space *mapping,
{
pgoff_t start; /* inclusive */
pgoff_t end; /* exclusive */
- unsigned int partial_start; /* inclusive */
- unsigned int partial_end; /* exclusive */
struct pagevec pvec;
pgoff_t indices[PAGEVEC_SIZE];
pgoff_t index;
int i;
+ struct page * page;
+ bool partial_end;

if (mapping->nrpages == 0 && mapping->nrexceptional == 0)
goto out;

- /* Offsets within partial pages */
- partial_start = lstart & (PAGE_SIZE - 1);
- partial_end = (lend + 1) & (PAGE_SIZE - 1);
-
/*
* 'start' and 'end' always covers the range of pages to be fully
* truncated. Partial pages are covered with 'partial_start' at the
@@ -334,48 +377,35 @@ void truncate_inode_pages_range(struct address_space *mapping,
cond_resched();
}

- if (partial_start) {
- struct page *page = find_lock_page(mapping, start - 1);
- if (page) {
- unsigned int top = PAGE_SIZE;
- if (start > end) {
- /* Truncation within a single page */
- top = partial_end;
- partial_end = 0;
- }
- wait_on_page_writeback(page);
- zero_user_segment(page, partial_start, top);
- cleancache_invalidate_page(mapping, page);
- if (page_has_private(page))
- do_invalidatepage(page, partial_start,
- top - partial_start);
- unlock_page(page);
- put_page(page);
+ partial_end = ((lend + 1) % PAGE_SIZE) > 0;
+ page = find_lock_head(mapping, lstart >> PAGE_SHIFT);
+ if (page) {
+ bool same_page = lend < page_offset(page) + thp_size(page);
+ if (same_page)
+ partial_end = false;
+ if (!truncate_inode_partial_page(page, lstart, lend)) {
+ start = page->index + thp_nr_pages(page);
+ if (same_page)
+ end = page->index;
}
+ unlock_page(page);
+ put_page(page);
+ page = NULL;
}
- if (partial_end) {
- struct page *page = find_lock_page(mapping, end);
- if (page) {
- wait_on_page_writeback(page);
- zero_user_segment(page, 0, partial_end);
- cleancache_invalidate_page(mapping, page);
- if (page_has_private(page))
- do_invalidatepage(page, 0,
- partial_end);
- unlock_page(page);
- put_page(page);
- }
+
+ if (partial_end)
+ page = find_lock_head(mapping, end);
+ if (page) {
+ if (!truncate_inode_partial_page(page, lstart, lend))
+ end = page->index;
+ unlock_page(page);
+ put_page(page);
}
- /*
- * If the truncation happened within a single page no pages
- * will be released, just zeroed, so we can bail out now.
- */
- if (start >= end)
- goto out;

index = start;
- for ( ; ; ) {
+ while (index < end) {
cond_resched();
+
if (!find_get_entries(mapping, index, end - 1, &pvec,
indices)) {
/* If all gone from start onwards, we're done */
@@ -387,7 +417,7 @@ void truncate_inode_pages_range(struct address_space *mapping,
}

for (i = 0; i < pagevec_count(&pvec); i++) {
- struct page *page = pvec.pages[i];
+ page = pvec.pages[i];

/* We rely upon deletion not changing page->index */
index = indices[i];
@@ -396,7 +426,7 @@ void truncate_inode_pages_range(struct address_space *mapping,
continue;

lock_page(page);
- WARN_ON(page_to_index(page) != index);
+ index = page->index + thp_nr_pages(page) - 1;
wait_on_page_writeback(page);
truncate_inode_page(mapping, page);
unlock_page(page);
--
2.28.0

2020-11-12 21:32:04

by Matthew Wilcox

[permalink] [raw]
Subject: [PATCH v4 03/16] mm/swap: Optimise get_shadow_from_swap_cache

There's no need to get a reference to the page, just load the entry and
see if it's a shadow entry.

Signed-off-by: Matthew Wilcox (Oracle) <[email protected]>
---
mm/swap_state.c | 4 +---
1 file changed, 1 insertion(+), 3 deletions(-)

diff --git a/mm/swap_state.c b/mm/swap_state.c
index cf7b322a9abc..d2161154d873 100644
--- a/mm/swap_state.c
+++ b/mm/swap_state.c
@@ -113,11 +113,9 @@ void *get_shadow_from_swap_cache(swp_entry_t entry)
pgoff_t idx = swp_offset(entry);
struct page *page;

- page = find_get_entry(address_space, idx);
+ page = xa_load(&address_space->i_pages, idx);
if (xa_is_value(page))
return page;
- if (page)
- put_page(page);
return NULL;
}

--
2.28.0

2020-11-12 21:32:49

by Matthew Wilcox

[permalink] [raw]
Subject: [PATCH v4 10/16] mm: Add an 'end' parameter to find_get_entries

This simplifies the callers and leads to a more efficient implementation
since the XArray has this functionality already.

Signed-off-by: Matthew Wilcox (Oracle) <[email protected]>
Reviewed-by: Jan Kara <[email protected]>
Reviewed-by: William Kucharski <[email protected]>
---
include/linux/pagemap.h | 4 ++--
mm/filemap.c | 9 +++++----
mm/shmem.c | 10 ++--------
mm/swap.c | 2 +-
4 files changed, 10 insertions(+), 15 deletions(-)

diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
index 9f1f4ab9612a..c7c26a902743 100644
--- a/include/linux/pagemap.h
+++ b/include/linux/pagemap.h
@@ -449,8 +449,8 @@ static inline struct page *find_subpage(struct page *head, pgoff_t index)
}

unsigned find_get_entries(struct address_space *mapping, pgoff_t start,
- unsigned int nr_entries, struct page **entries,
- pgoff_t *indices);
+ pgoff_t end, unsigned int nr_entries, struct page **entries,
+ pgoff_t *indices);
unsigned find_get_pages_range(struct address_space *mapping, pgoff_t *start,
pgoff_t end, unsigned int nr_pages,
struct page **pages);
diff --git a/mm/filemap.c b/mm/filemap.c
index f18c5074865d..b3b89a62ab1a 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -1870,6 +1870,7 @@ static inline struct page *find_get_entry(struct xa_state *xas, pgoff_t max,
* find_get_entries - gang pagecache lookup
* @mapping: The address_space to search
* @start: The starting page cache index
+ * @end: The final page index (inclusive).
* @nr_entries: The maximum number of entries
* @entries: Where the resulting entries are placed
* @indices: The cache indices corresponding to the entries in @entries
@@ -1893,9 +1894,9 @@ static inline struct page *find_get_entry(struct xa_state *xas, pgoff_t max,
*
* Return: the number of pages and shadow entries which were found.
*/
-unsigned find_get_entries(struct address_space *mapping,
- pgoff_t start, unsigned int nr_entries,
- struct page **entries, pgoff_t *indices)
+unsigned find_get_entries(struct address_space *mapping, pgoff_t start,
+ pgoff_t end, unsigned int nr_entries, struct page **entries,
+ pgoff_t *indices)
{
XA_STATE(xas, &mapping->i_pages, start);
struct page *page;
@@ -1905,7 +1906,7 @@ unsigned find_get_entries(struct address_space *mapping,
return 0;

rcu_read_lock();
- while ((page = find_get_entry(&xas, ULONG_MAX, XA_PRESENT))) {
+ while ((page = find_get_entry(&xas, end, XA_PRESENT))) {
/*
* Terminate early on finding a THP, to allow the caller to
* handle it all at once; but continue if this is hugetlbfs.
diff --git a/mm/shmem.c b/mm/shmem.c
index a4aa762a55f8..7a62dc967d7d 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -913,8 +913,6 @@ static void shmem_undo_range(struct inode *inode, loff_t lstart, loff_t lend,
struct page *page = pvec.pages[i];

index = indices[i];
- if (index >= end)
- break;

if (xa_is_value(page)) {
if (unfalloc)
@@ -967,9 +965,8 @@ static void shmem_undo_range(struct inode *inode, loff_t lstart, loff_t lend,
while (index < end) {
cond_resched();

- pvec.nr = find_get_entries(mapping, index,
- min(end - index, (pgoff_t)PAGEVEC_SIZE),
- pvec.pages, indices);
+ pvec.nr = find_get_entries(mapping, index, end - 1,
+ PAGEVEC_SIZE, pvec.pages, indices);
if (!pvec.nr) {
/* If all gone or hole-punch or unfalloc, we're done */
if (index == start || end != -1)
@@ -982,9 +979,6 @@ static void shmem_undo_range(struct inode *inode, loff_t lstart, loff_t lend,
struct page *page = pvec.pages[i];

index = indices[i];
- if (index >= end)
- break;
-
if (xa_is_value(page)) {
if (unfalloc)
continue;
diff --git a/mm/swap.c b/mm/swap.c
index 29220174433b..39be55635ebd 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -1102,7 +1102,7 @@ unsigned pagevec_lookup_entries(struct pagevec *pvec,
pgoff_t start, unsigned nr_entries,
pgoff_t *indices)
{
- pvec->nr = find_get_entries(mapping, start, nr_entries,
+ pvec->nr = find_get_entries(mapping, start, ULONG_MAX, nr_entries,
pvec->pages, indices);
return pagevec_count(pvec);
}
--
2.28.0

2020-11-14 09:56:17

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [PATCH v4 01/16] mm: Make pagecache tagged lookups return only head pages

On Thu, Nov 12, 2020 at 09:26:26PM +0000, Matthew Wilcox (Oracle) wrote:
> Pagecache tags are used for dirty page writeback. Since dirtiness is
> tracked on a per-THP basis, we only want to return the head page rather
> than each subpage of a tagged page. All the filesystems which use huge
> pages today are in-memory, so there are no tagged huge pages today.
>
> Signed-off-by: Matthew Wilcox (Oracle) <[email protected]>
> Reviewed-by: Jan Kara <[email protected]>
> Reviewed-by: William Kucharski <[email protected]>

Looks good,

Reviewed-by: Christoph Hellwig <[email protected]>

2020-11-14 09:57:34

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [PATCH v4 02/16] mm/shmem: Use pagevec_lookup in shmem_unlock_mapping

On Thu, Nov 12, 2020 at 09:26:27PM +0000, Matthew Wilcox (Oracle) wrote:
> The comment shows that the reason for using find_get_entries() is now
> stale; find_get_pages() will not return 0 if it hits a consecutive run
> of swap entries, and I don't believe it has since 2011. pagevec_lookup()
> is a simpler function to use than find_get_pages(), so use it instead.
>
> Signed-off-by: Matthew Wilcox (Oracle) <[email protected]>
> Reviewed-by: Jan Kara <[email protected]>
> Reviewed-by: William Kucharski <[email protected]>

Looks good,

Reviewed-by: Christoph Hellwig <[email protected]>

2020-11-14 09:59:04

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [PATCH v4 03/16] mm/swap: Optimise get_shadow_from_swap_cache

On Thu, Nov 12, 2020 at 09:26:28PM +0000, Matthew Wilcox (Oracle) wrote:
> There's no need to get a reference to the page, just load the entry and
> see if it's a shadow entry.
>
> Signed-off-by: Matthew Wilcox (Oracle) <[email protected]>

Looks good,

Reviewed-by: Christoph Hellwig <[email protected]>

2020-11-14 10:03:27

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [PATCH v4 05/16] mm/filemap: Rename find_get_entry to mapping_get_entry

On Thu, Nov 12, 2020 at 09:26:30PM +0000, Matthew Wilcox (Oracle) wrote:
> find_get_entry doesn't "find" anything. It returns the entry at
> a particular index.
>
> Signed-off-by: Matthew Wilcox (Oracle) <[email protected]>

Looks good,

Reviewed-by: Christoph Hellwig <[email protected]>

2020-11-14 10:08:54

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [PATCH v4 08/16] iomap: Use mapping_seek_hole_data

On Thu, Nov 12, 2020 at 09:26:33PM +0000, Matthew Wilcox (Oracle) wrote:
> Enhance mapping_seek_hole_data() to handle partially uptodate pages and
> convert the iomap seek code to call it.

Maybe split this into two patches for the mapping_seek_hole_data
enhancement (which could use a little more of a commit log anyway..)
and the iomap switch?

> -static inline bool page_seek_match(struct page *page, bool seek_data)
> +static inline loff_t page_seek_hole_data(struct xa_state *xas,
> + struct address_space *mapping, struct page *page,
> + loff_t start, loff_t end, bool seek_data)

This seems like quite a huge function to force inline..

2020-11-14 10:09:48

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [PATCH v4 09/16] mm: Add and use find_lock_entries

On Thu, Nov 12, 2020 at 09:26:34PM +0000, Matthew Wilcox (Oracle) wrote:
> We have three functions (shmem_undo_range(), truncate_inode_pages_range()
> and invalidate_mapping_pages()) which want exactly this function,
> so add it to filemap.c. Before this patch, shmem_undo_range() would
> split any compound page which overlaps either end of the range being
> punched in both the first and second loops through the address space.
> After this patch, that functionality is left for the second loop, which
> is arguably more appropriate since the first loop is supposed to run
> through all the pages quickly, and splitting a page can sleep.
>
> Signed-off-by: Matthew Wilcox (Oracle) <[email protected]>
> Reviewed-by: Jan Kara <[email protected]>
> Reviewed-by: William Kucharski <[email protected]>

Looks good,

Reviewed-by: Christoph Hellwig <[email protected]>

2020-11-14 10:12:05

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [PATCH v4 10/16] mm: Add an 'end' parameter to find_get_entries

On Thu, Nov 12, 2020 at 09:26:35PM +0000, Matthew Wilcox (Oracle) wrote:
> This simplifies the callers and leads to a more efficient implementation
> since the XArray has this functionality already.
>
> Signed-off-by: Matthew Wilcox (Oracle) <[email protected]>
> Reviewed-by: Jan Kara <[email protected]>
> Reviewed-by: William Kucharski <[email protected]>

Looks good,

Reviewed-by: Christoph Hellwig <[email protected]>

2020-11-14 10:22:23

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [PATCH v4 11/16] mm: Add an 'end' parameter to pagevec_lookup_entries

On Thu, Nov 12, 2020 at 09:26:36PM +0000, Matthew Wilcox (Oracle) wrote:
> Simplifies the callers and uses the existing functionality
> in find_get_entries(). We can also drop the final argument of
> truncate_exceptional_pvec_entries() and simplify the logic in that
> function.

I'd say the truncate_exceptional_pvec_entries simpliciation is a separate
step and should normally be a separate patch.

But otherwise this looks good:

Reviewed-by: Christoph Hellwig <[email protected]>

2020-11-14 10:25:52

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [PATCH v4 16/16] mm/filemap: Return only head pages from find_get_entries

On Thu, Nov 12, 2020 at 09:26:41PM +0000, Matthew Wilcox (Oracle) wrote:
> All callers now expect head (and base) pages, and can handle multiple
> head pages in a single batch, so make find_get_entries() behave that way.
> Also take the opportunity to make it use the pagevec infrastructure
> instead of open-coding how pvecs behave. This has the side-effect of
> being able to append to a pagevec with existing contents, although we
> don't make use of that functionality anywhere yet.

Looks good,

Reviewed-by: Christoph Hellwig <[email protected]>

2020-11-14 10:26:00

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [PATCH v4 13/16] mm: Pass pvec directly to find_get_entries

On Thu, Nov 12, 2020 at 09:26:38PM +0000, Matthew Wilcox (Oracle) wrote:
> + pvec->nr = ret;
> return ret;

Do we need to return the number of found entries in addition to storing
it in the pagevec?

2020-11-14 15:25:52

by Matthew Wilcox

[permalink] [raw]
Subject: Re: [PATCH v4 13/16] mm: Pass pvec directly to find_get_entries

On Sat, Nov 14, 2020 at 11:21:33AM +0100, Christoph Hellwig wrote:
> On Thu, Nov 12, 2020 at 09:26:38PM +0000, Matthew Wilcox (Oracle) wrote:
> > + pvec->nr = ret;
> > return ret;
>
> Do we need to return the number of found entries in addition to storing
> it in the pagevec?

We really only need a bool -- is pvec->nr zero or not. But we have the
number calculated so we may as well return it. All the current find_*()
behave this way and I haven't seen enough reason to change it yet.
Making it return void just makes the callers uglier. I have a batch of
patches to do something similar for find_get_pages / find_get_pages_range
/ pagevec_lookup / pagevec_lookup_range, but I don't need them for the
THP patchset, so I haven't sent them yet.

2020-11-16 10:37:12

by Hugh Dickins

[permalink] [raw]
Subject: Re: [PATCH v4 00/16] Overhaul multi-page lookups for THP

On Thu, 12 Nov 2020, Matthew Wilcox (Oracle) wrote:

> This THP prep patchset changes several page cache iteration APIs to only
> return head pages.
>
> - It's only possible to tag head pages in the page cache, so only
> return head pages, not all their subpages.
> - Factor a lot of common code out of the various batch lookup routines
> - Add mapping_seek_hole_data()
> - Unify find_get_entries() and pagevec_lookup_entries()
> - Make find_get_entries only return head pages, like find_get_entry().
>
> These are only loosely connected, but they seem to make sense together
> as a series.
>
> v4:
> - Add FGP_ENTRY, remove find_get_entry and find_lock_entry
> - Rename xas_find_get_entry to find_get_entry
> - Add "Optimise get_shadow_from_swap_cache"
> - Move "iomap: Use mapping_seek_hole_data" to this patch series
> - Rebase against next-20201112

I hope next-20201112 had nothing vital for this series, I applied
it to v5.10-rc3, and have been busy testing huge tmpfs on that.

Several fixes necessary. It was only a couple of hours ago that I
finally saw what was wrong in shmem_undo_range(), I'm tired now and
sending these off somewhat hastily, may have got something wrong, but
best to get them to you soonest, to rework and fold in as you see fit.
Please allow me to gather them here below, instead of replying to
individual patches.

Fix to [PATCH v4 06/16] mm/filemap: Add helper for finding pages.
I hit that VM_BUG_ON_PAGE(!thp_contains) when swapping, it is not
safe without page lock, during the interval when shmem is moving a
page between page cache and swap cache. It could be tightened by
passing in a new FGP to distinguish whether searching page or swap
cache, but I think never tight enough in the swap case - because there
is no rule for persisting page->private as there is for page->index.
The best I could do is:

--- 5103w/mm/filemap.c 2020-11-12 15:46:23.191275470 -0800
+++ 5103wh/mm/filemap.c 2020-11-16 01:09:35.427677277 -0800
@@ -1858,7 +1858,20 @@ retry:
put_page(page);
goto reset;
}
- VM_BUG_ON_PAGE(!thp_contains(page, xas->xa_index), page);
+
+#ifdef CONFIG_DEBUG_VM
+ /*
+ * thp_contains() is unreliable when shmem is swizzling between page
+ * and swap cache, when both PageSwapCache and page->mapping are set.
+ */
+ if (!thp_contains(page, xas->xa_index)) {
+ VM_BUG_ON_PAGE(!PageSwapBacked(page), page);
+ if (trylock_page(page)) {
+ VM_BUG_ON_PAGE(!thp_contains(page, xas->xa_index), page);
+ unlock_page(page);
+ }
+ }
+#endif

return page;
reset:

Fix to [PATCH v4 07/16] mm/filemap: Add mapping_seek_hole_data.
Crashed on a swap entry 0x2ff09, fairly obvious...

--- 5103w/mm/filemap.c 2020-11-12 15:46:23.191275470 -0800
+++ 5103wh/mm/filemap.c 2020-11-16 01:09:35.427677277 -0800
@@ -2632,7 +2632,8 @@ loff_t mapping_seek_hole_data(struct add
seek_data);
if (start < pos)
goto unlock;
- put_page(page);
+ if (!xa_is_value(page))
+ put_page(page);
}
rcu_read_unlock();

Fix to [PATCH v4 15/16] mm/truncate,shmem: Handle truncates that split THPs.
One machine ran fine, swapping and building in ext4 on loop0 on huge tmpfs;
one machine got occasional pages of zeros in its .os; one machine couldn't
get started because of ext4_find_dest_de errors on the newly mkfs'ed fs.
The partial_end case was decided by PAGE_SIZE, when there might be a THP
there. The below patch has run well (for not very long), but I could
easily have got it slightly wrong, off-by-one or whatever; and I have
not looked into the similar code in mm/truncate.c, maybe that will need
a similar fix or maybe not.

--- 5103w/mm/shmem.c 2020-11-12 15:46:21.075254036 -0800
+++ 5103wh/mm/shmem.c 2020-11-16 01:09:35.431677308 -0800
@@ -874,7 +874,7 @@ static void shmem_undo_range(struct inod
long nr_swaps_freed = 0;
pgoff_t index;
int i;
- bool partial_end;
+ bool same_page;

if (lend == -1)
end = -1; /* unsigned, so actually very big */
@@ -907,16 +907,12 @@ static void shmem_undo_range(struct inod
index++;
}

- partial_end = ((lend + 1) % PAGE_SIZE) > 0;
+ same_page = (lstart >> PAGE_SHIFT) == end;
page = NULL;
shmem_getpage(inode, lstart >> PAGE_SHIFT, &page, SGP_READ);
if (page) {
- bool same_page;
-
page = thp_head(page);
same_page = lend < page_offset(page) + thp_size(page);
- if (same_page)
- partial_end = false;
set_page_dirty(page);
if (!truncate_inode_partial_page(page, lstart, lend)) {
start = page->index + thp_nr_pages(page);
@@ -928,7 +924,7 @@ static void shmem_undo_range(struct inod
page = NULL;
}

- if (partial_end)
+ if (!same_page)
shmem_getpage(inode, end, &page, SGP_READ);
if (page) {
page = thp_head(page);

Fix to [PATCH v4 15/16] mm/truncate,shmem: Handle truncates that split THPs.
xfstests generic/012 on huge tmpfs hit this every time (when checking
xfs_io commands available: later decides "not run" because no "fiemap").
I grabbed this line unthinkingly from one of your later series, it fixes
the crash; but once I actually thought about it when trying to track down
weirder behaviours, realize that the kmap_atomic() and flush_dcache_page()
in zero_user_segments() are not prepared for a THP - so highmem and
flush_dcache_page architectures will be disappointed. If I searched
through your other series, I might find the complete fix; or perhaps
it's already there in linux-next, I haven't looked.

--- 5103w/include/linux/highmem.h 2020-10-11 14:15:50.000000000 -0700
+++ 5103wh/include/linux/highmem.h 2020-11-16 01:09:35.423677246 -0800
@@ -290,7 +290,7 @@ static inline void zero_user_segments(st
{
void *kaddr = kmap_atomic(page);

- BUG_ON(end1 > PAGE_SIZE || end2 > PAGE_SIZE);
+ BUG_ON(end1 > page_size(page) || end2 > page_size(page));

if (end1 > start1)
memset(kaddr + start1, 0, end1 - start1);

I also had noise from the WARN_ON(page_to_index(page) != index)
in invalidate_inode_pages2_range(): but that's my problem, since
for testing I add a dummy shmem_direct_IO() (return 0): for that
I've now added a shmem_mapping() check at the head of pages2_range().

That's all for now: I'll fire off more overnight testing.

Hugh

2020-11-16 15:17:44

by Matthew Wilcox

[permalink] [raw]
Subject: Re: [PATCH v4 00/16] Overhaul multi-page lookups for THP

On Mon, Nov 16, 2020 at 02:34:34AM -0800, Hugh Dickins wrote:
> On Thu, 12 Nov 2020, Matthew Wilcox (Oracle) wrote:
>
> > This THP prep patchset changes several page cache iteration APIs to only
> > return head pages.
> >
> > - It's only possible to tag head pages in the page cache, so only
> > return head pages, not all their subpages.
> > - Factor a lot of common code out of the various batch lookup routines
> > - Add mapping_seek_hole_data()
> > - Unify find_get_entries() and pagevec_lookup_entries()
> > - Make find_get_entries only return head pages, like find_get_entry().
> >
> > These are only loosely connected, but they seem to make sense together
> > as a series.
> >
> > v4:
> > - Add FGP_ENTRY, remove find_get_entry and find_lock_entry
> > - Rename xas_find_get_entry to find_get_entry
> > - Add "Optimise get_shadow_from_swap_cache"
> > - Move "iomap: Use mapping_seek_hole_data" to this patch series
> > - Rebase against next-20201112
>
> I hope next-20201112 had nothing vital for this series, I applied
> it to v5.10-rc3, and have been busy testing huge tmpfs on that.

Thank you. It's plain I'm not able to hit these cases ... I do run
xfstests against shmem, but that's obviously not good enough. Can
you suggest something I should be running to improve my coverage?

> Fix to [PATCH v4 06/16] mm/filemap: Add helper for finding pages.
> I hit that VM_BUG_ON_PAGE(!thp_contains) when swapping, it is not
> safe without page lock, during the interval when shmem is moving a
> page between page cache and swap cache. It could be tightened by
> passing in a new FGP to distinguish whether searching page or swap
> cache, but I think never tight enough in the swap case - because there
> is no rule for persisting page->private as there is for page->index.
> The best I could do is:

I'll just move this out to the caller who actually locks the page:

+++ b/mm/filemap.c
@@ -1839,7 +1839,6 @@ static inline struct page *find_get_entry(struct xa_state *xas, pgoff_t max,
put_page(page);
goto reset;
}
- VM_BUG_ON_PAGE(!thp_contains(page, xas->xa_index), page);

return page;
reset:
@@ -1923,6 +1922,8 @@ unsigned find_lock_entries(struct address_space *mapping, pgoff_t start,
goto put;
if (page->mapping != mapping || PageWriteback(page))
goto unlock;
+ VM_BUG_ON_PAGE(!thp_contains(page, xas->xa_index),
+ page);
}
indices[pvec->nr] = xas.xa_index;
if (!pagevec_add(pvec, page))

> Fix to [PATCH v4 07/16] mm/filemap: Add mapping_seek_hole_data.
> Crashed on a swap entry 0x2ff09, fairly obvious...

Whoops. Thanks.

> Fix to [PATCH v4 15/16] mm/truncate,shmem: Handle truncates that split THPs.
> One machine ran fine, swapping and building in ext4 on loop0 on huge tmpfs;
> one machine got occasional pages of zeros in its .os; one machine couldn't
> get started because of ext4_find_dest_de errors on the newly mkfs'ed fs.
> The partial_end case was decided by PAGE_SIZE, when there might be a THP
> there. The below patch has run well (for not very long), but I could
> easily have got it slightly wrong, off-by-one or whatever; and I have
> not looked into the similar code in mm/truncate.c, maybe that will need
> a similar fix or maybe not.
>
> --- 5103w/mm/shmem.c 2020-11-12 15:46:21.075254036 -0800
> +++ 5103wh/mm/shmem.c 2020-11-16 01:09:35.431677308 -0800
> @@ -874,7 +874,7 @@ static void shmem_undo_range(struct inod
> long nr_swaps_freed = 0;
> pgoff_t index;
> int i;
> - bool partial_end;
> + bool same_page;
>
> if (lend == -1)
> end = -1; /* unsigned, so actually very big */
> @@ -907,16 +907,12 @@ static void shmem_undo_range(struct inod
> index++;
> }
>
> - partial_end = ((lend + 1) % PAGE_SIZE) > 0;
> + same_page = (lstart >> PAGE_SHIFT) == end;
> page = NULL;
> shmem_getpage(inode, lstart >> PAGE_SHIFT, &page, SGP_READ);
> if (page) {
> - bool same_page;
> -
> page = thp_head(page);
> same_page = lend < page_offset(page) + thp_size(page);
> - if (same_page)
> - partial_end = false;

I don't object to this patch at all, at least partly because it's shorter
and simpler! I don't understand what it's solving, though. The case
where there's a THP which covers partial_end is supposed to be handled
by the three lines above.

> Fix to [PATCH v4 15/16] mm/truncate,shmem: Handle truncates that split THPs.
> xfstests generic/012 on huge tmpfs hit this every time (when checking
> xfs_io commands available: later decides "not run" because no "fiemap").
> I grabbed this line unthinkingly from one of your later series, it fixes
> the crash; but once I actually thought about it when trying to track down
> weirder behaviours, realize that the kmap_atomic() and flush_dcache_page()
> in zero_user_segments() are not prepared for a THP - so highmem and
> flush_dcache_page architectures will be disappointed. If I searched
> through your other series, I might find the complete fix; or perhaps
> it's already there in linux-next, I haven't looked.

zero_user_segments() is fixed by "mm: Support THPs in zero_user_segments".
I think most recently posted here:
https://lore.kernel.org/linux-mm/[email protected]/

My fault for not realising this patch depended on that patch. I did
test these patches stand-alone, but it didn't trigger this problem.

flush_dcache_page() needs to be called once for each sub-page. We
really need a flush_dcache_thp() so that architectures can optimise this.
Although maybe now that's going to be called flush_dcache_folio().

> I also had noise from the WARN_ON(page_to_index(page) != index)
> in invalidate_inode_pages2_range(): but that's my problem, since
> for testing I add a dummy shmem_direct_IO() (return 0): for that
> I've now added a shmem_mapping() check at the head of pages2_range().

Ah, I have a later fix for invalidate_inode_pages2_range():
https://lore.kernel.org/linux-mm/[email protected]/

I didn't post it earlier because there aren't any filesystems currently
which use THPs and directIO ;-)

> That's all for now: I'll fire off more overnight testing.

Thanks!

2020-11-16 22:51:57

by Hugh Dickins

[permalink] [raw]
Subject: Re: [PATCH v4 00/16] Overhaul multi-page lookups for THP

On Mon, 16 Nov 2020, Matthew Wilcox wrote:
> On Mon, Nov 16, 2020 at 02:34:34AM -0800, Hugh Dickins wrote:
> > On Thu, 12 Nov 2020, Matthew Wilcox (Oracle) wrote:
> >
> > > This THP prep patchset changes several page cache iteration APIs to only
> > > return head pages.
> > >
> > > - It's only possible to tag head pages in the page cache, so only
> > > return head pages, not all their subpages.
> > > - Factor a lot of common code out of the various batch lookup routines
> > > - Add mapping_seek_hole_data()
> > > - Unify find_get_entries() and pagevec_lookup_entries()
> > > - Make find_get_entries only return head pages, like find_get_entry().
> > >
> > > These are only loosely connected, but they seem to make sense together
> > > as a series.
> > >
> > > v4:
> > > - Add FGP_ENTRY, remove find_get_entry and find_lock_entry
> > > - Rename xas_find_get_entry to find_get_entry
> > > - Add "Optimise get_shadow_from_swap_cache"
> > > - Move "iomap: Use mapping_seek_hole_data" to this patch series
> > > - Rebase against next-20201112
> >
> > I hope next-20201112 had nothing vital for this series, I applied
> > it to v5.10-rc3, and have been busy testing huge tmpfs on that.
>
> Thank you. It's plain I'm not able to hit these cases ... I do run
> xfstests against shmem, but that's obviously not good enough. Can
> you suggest something I should be running to improve my coverage?

Not quickly.

I'll send you a builds.tar.xz of my tmpfs kernel builds swapping load,
like I sent Alex Shi for his lru_lock testing back in March: but no
point in resending exactly that, before I've updated its source patch
to suit less ancient compilers. Sometime in the next week. It's hard
to get working usefully in new environments: probably better as source
of ideas (like building kernel in ext4 on loop on huge tmpfs while
swapping) than as something to duplicate and run directly.

I haven't tried LTP on your series yet, will do so later today.
I install into a huge tmpfs mounted at /opt/ltp and run it there,
with /sys/kernel/mm/transparent_hugepage/shmem_enabled "force"
and khugepaged expedited. Doesn't usually find anything, but
well worth a go. I'll be doing it based on 5.10-rc3 or rc4:
avoiding mmotm or linux-next at the moment, because I've noticed
that Rik's shmem THP gfp_mask mods have severely limited the use
of huge pages in my testing there: maybe I just need to tweak defrag,
or maybe I'll want the patches adjusted - I just haven't looked yet,
making your series safe took priority. (And aside from Rik's, there
seemed to be something else affecting the use of huge pages in mmotm,
not investigated yet.)

xfstests: awkward time to ask, since I'm amidst transitioning from
my old internal hack for user xattrs to what I had hoped to get into
next release (enabling them, but restricting to the nr_inodes space),
but too late for that now. So I have a few workaround patches to the
xfstests tree, and a few tmpfs patches to the kernel tree, but in flux
at the moment. Not that user xattrs are much related to your series.
If you've got xfstests going on tmpfs, probably the easiest thing to
add is /sys/kernel/mm/transparent_hugepage/shmem_enabled "force",
so that mounts are huge=always by default.

I do have six xfstests failing with your series, that passed before;
but I have not had time to analyse those yet, and don't even want to
enumerate them yet. Some will probably be "not run"s in a standard
tree, so hard for you to try; and maybe all of them will turn out to
depend on different interpretations of small holes in huge pages -
not necessarily anything to worry about. But I wonder if they might
relate to failures to split: your truncate_inode_partial_page() work
is very nice, but for better or worse it does skip all the retries
implicit in the previous way of proceeding.

>
> > Fix to [PATCH v4 06/16] mm/filemap: Add helper for finding pages.
> > I hit that VM_BUG_ON_PAGE(!thp_contains) when swapping, it is not
> > safe without page lock, during the interval when shmem is moving a
> > page between page cache and swap cache. It could be tightened by
> > passing in a new FGP to distinguish whether searching page or swap
> > cache, but I think never tight enough in the swap case - because there
> > is no rule for persisting page->private as there is for page->index.
> > The best I could do is:
>
> I'll just move this out to the caller who actually locks the page:
>
> +++ b/mm/filemap.c
> @@ -1839,7 +1839,6 @@ static inline struct page *find_get_entry(struct xa_state *xas, pgoff_t max,
> put_page(page);
> goto reset;
> }
> - VM_BUG_ON_PAGE(!thp_contains(page, xas->xa_index), page);
>
> return page;
> reset:
> @@ -1923,6 +1922,8 @@ unsigned find_lock_entries(struct address_space *mapping, pgoff_t start,
> goto put;
> if (page->mapping != mapping || PageWriteback(page))
> goto unlock;
> + VM_BUG_ON_PAGE(!thp_contains(page, xas->xa_index),
> + page);
> }
> indices[pvec->nr] = xas.xa_index;
> if (!pagevec_add(pvec, page))
>

Okay, up to you: that's obviously much cleaner-looking than what I did;
and if it covers all the cases you're most concerned about, fine (but
your original placing did check some other usages, and I didn't want to
weaken them - though now I notice that e.g. mapping_get_entry() has no
equivalent check, so the find_get_entry() check was exceptional).

> > Fix to [PATCH v4 07/16] mm/filemap: Add mapping_seek_hole_data.
> > Crashed on a swap entry 0x2ff09, fairly obvious...
>
> Whoops. Thanks.
>
> > Fix to [PATCH v4 15/16] mm/truncate,shmem: Handle truncates that split THPs.
> > One machine ran fine, swapping and building in ext4 on loop0 on huge tmpfs;
> > one machine got occasional pages of zeros in its .os; one machine couldn't
> > get started because of ext4_find_dest_de errors on the newly mkfs'ed fs.
> > The partial_end case was decided by PAGE_SIZE, when there might be a THP
> > there. The below patch has run well (for not very long), but I could
> > easily have got it slightly wrong, off-by-one or whatever; and I have
> > not looked into the similar code in mm/truncate.c, maybe that will need
> > a similar fix or maybe not.
> >
> > --- 5103w/mm/shmem.c 2020-11-12 15:46:21.075254036 -0800
> > +++ 5103wh/mm/shmem.c 2020-11-16 01:09:35.431677308 -0800
> > @@ -874,7 +874,7 @@ static void shmem_undo_range(struct inod
> > long nr_swaps_freed = 0;
> > pgoff_t index;
> > int i;
> > - bool partial_end;
> > + bool same_page;
> >
> > if (lend == -1)
> > end = -1; /* unsigned, so actually very big */
> > @@ -907,16 +907,12 @@ static void shmem_undo_range(struct inod
> > index++;
> > }
> >
> > - partial_end = ((lend + 1) % PAGE_SIZE) > 0;
> > + same_page = (lstart >> PAGE_SHIFT) == end;
> > page = NULL;
> > shmem_getpage(inode, lstart >> PAGE_SHIFT, &page, SGP_READ);
> > if (page) {
> > - bool same_page;
> > -
> > page = thp_head(page);
> > same_page = lend < page_offset(page) + thp_size(page);
> > - if (same_page)
> > - partial_end = false;
>
> I don't object to this patch at all, at least partly because it's shorter
> and simpler! I don't understand what it's solving, though. The case
> where there's a THP which covers partial_end is supposed to be handled
> by the three lines above.

I'm glad to hear it's not obvious to you,
I was ashamed of how long it took me to see. The initial
partial_end = ((lend + 1) % PAGE_SIZE) > 0;
left partial_end false if the end is nicely PAGE_SIZE aligned.

If the lstart shmem_getpage() finds a page, THP or small,
and the lend is covered by the same page, all is well. But if
the lend is covered by a THP, PAGE_SIZE aligned but not page_size()
aligned (forgive my eliding off-by-ones wrt lend), then partial_end
remains false, so there's no shmem_getpage() of the end, no discovery
that it's a THP, and no attempt to split it as intended.

I think then "end" goes forward unadjusted, and that whole ending THP
will be truncated out: when it should have been split, and only its
first pages removed. Hence zeroes appearing where they should not.

>
> > Fix to [PATCH v4 15/16] mm/truncate,shmem: Handle truncates that split THPs.
> > xfstests generic/012 on huge tmpfs hit this every time (when checking
> > xfs_io commands available: later decides "not run" because no "fiemap").
> > I grabbed this line unthinkingly from one of your later series, it fixes
> > the crash; but once I actually thought about it when trying to track down
> > weirder behaviours, realize that the kmap_atomic() and flush_dcache_page()
> > in zero_user_segments() are not prepared for a THP - so highmem and
> > flush_dcache_page architectures will be disappointed. If I searched
> > through your other series, I might find the complete fix; or perhaps
> > it's already there in linux-next, I haven't looked.
>
> zero_user_segments() is fixed by "mm: Support THPs in zero_user_segments".
> I think most recently posted here:
> https://lore.kernel.org/linux-mm/[email protected]/
>
> My fault for not realising this patch depended on that patch. I did
> test these patches stand-alone, but it didn't trigger this problem.
>
> flush_dcache_page() needs to be called once for each sub-page. We
> really need a flush_dcache_thp() so that architectures can optimise this.

Right.

> Although maybe now that's going to be called flush_dcache_folio().

Yes, I was delighted to notice the return of "folio":
https://lore.kernel.org/linux-mm/[email protected]/

>
> > I also had noise from the WARN_ON(page_to_index(page) != index)
> > in invalidate_inode_pages2_range(): but that's my problem, since
> > for testing I add a dummy shmem_direct_IO() (return 0): for that
> > I've now added a shmem_mapping() check at the head of pages2_range().
>
> Ah, I have a later fix for invalidate_inode_pages2_range():
> https://lore.kernel.org/linux-mm/[email protected]/
>
> I didn't post it earlier because there aren't any filesystems currently
> which use THPs and directIO ;-)

Yes, I'd seen your exchange with Jan about that. And initially I ported
in your later fix; but later decided it was better to decouple my testing
mods from whatever is decided in your series. If tmpfs does (pretend to)
support directIO, it is correct that invalidate_inode_pages2_range() do
nothing on it: because the pretended IO is to and from the tmpfs backing
store, which is the page cache.

>
> > That's all for now: I'll fire off more overnight testing.
>
> Thanks!

All running fine.

Hugh

2020-11-17 15:44:42

by Matthew Wilcox

[permalink] [raw]
Subject: Re: [PATCH v4 00/16] Overhaul multi-page lookups for THP

On Mon, Nov 16, 2020 at 02:34:34AM -0800, Hugh Dickins wrote:
> Fix to [PATCH v4 15/16] mm/truncate,shmem: Handle truncates that split THPs.
> One machine ran fine, swapping and building in ext4 on loop0 on huge tmpfs;
> one machine got occasional pages of zeros in its .os; one machine couldn't
> get started because of ext4_find_dest_de errors on the newly mkfs'ed fs.
> The partial_end case was decided by PAGE_SIZE, when there might be a THP
> there. The below patch has run well (for not very long), but I could
> easily have got it slightly wrong, off-by-one or whatever; and I have
> not looked into the similar code in mm/truncate.c, maybe that will need
> a similar fix or maybe not.

Thank you for the explanation in your later email! There is indeed an
off-by-one, although in the safe direction.

> --- 5103w/mm/shmem.c 2020-11-12 15:46:21.075254036 -0800
> +++ 5103wh/mm/shmem.c 2020-11-16 01:09:35.431677308 -0800
> @@ -874,7 +874,7 @@ static void shmem_undo_range(struct inod
> long nr_swaps_freed = 0;
> pgoff_t index;
> int i;
> - bool partial_end;
> + bool same_page;
>
> if (lend == -1)
> end = -1; /* unsigned, so actually very big */
> @@ -907,16 +907,12 @@ static void shmem_undo_range(struct inod
> index++;
> }
>
> - partial_end = ((lend + 1) % PAGE_SIZE) > 0;
> + same_page = (lstart >> PAGE_SHIFT) == end;

'end' is exclusive, so this is always false. Maybe something "obvious":

same_page = (lstart >> PAGE_SHIFT) == (lend >> PAGE_SHIFT);

(lend is inclusive, so lend in 0-4095 are all on the same page)

> page = NULL;
> shmem_getpage(inode, lstart >> PAGE_SHIFT, &page, SGP_READ);
> if (page) {
> - bool same_page;
> -
> page = thp_head(page);
> same_page = lend < page_offset(page) + thp_size(page);
> - if (same_page)
> - partial_end = false;
> set_page_dirty(page);
> if (!truncate_inode_partial_page(page, lstart, lend)) {
> start = page->index + thp_nr_pages(page);
> @@ -928,7 +924,7 @@ static void shmem_undo_range(struct inod
> page = NULL;
> }
>
> - if (partial_end)
> + if (!same_page)
> shmem_getpage(inode, end, &page, SGP_READ);
> if (page) {
> page = thp_head(page);

2020-11-17 16:29:25

by Hugh Dickins

[permalink] [raw]
Subject: Re: [PATCH v4 00/16] Overhaul multi-page lookups for THP

On Tue, 17 Nov 2020, Matthew Wilcox wrote:
> On Mon, Nov 16, 2020 at 02:34:34AM -0800, Hugh Dickins wrote:
> > Fix to [PATCH v4 15/16] mm/truncate,shmem: Handle truncates that split THPs.
> > One machine ran fine, swapping and building in ext4 on loop0 on huge tmpfs;
> > one machine got occasional pages of zeros in its .os; one machine couldn't
> > get started because of ext4_find_dest_de errors on the newly mkfs'ed fs.
> > The partial_end case was decided by PAGE_SIZE, when there might be a THP
> > there. The below patch has run well (for not very long), but I could
> > easily have got it slightly wrong, off-by-one or whatever; and I have
> > not looked into the similar code in mm/truncate.c, maybe that will need
> > a similar fix or maybe not.
>
> Thank you for the explanation in your later email! There is indeed an
> off-by-one, although in the safe direction.
>
> > --- 5103w/mm/shmem.c 2020-11-12 15:46:21.075254036 -0800
> > +++ 5103wh/mm/shmem.c 2020-11-16 01:09:35.431677308 -0800
> > @@ -874,7 +874,7 @@ static void shmem_undo_range(struct inod
> > long nr_swaps_freed = 0;
> > pgoff_t index;
> > int i;
> > - bool partial_end;
> > + bool same_page;
> >
> > if (lend == -1)
> > end = -1; /* unsigned, so actually very big */
> > @@ -907,16 +907,12 @@ static void shmem_undo_range(struct inod
> > index++;
> > }
> >
> > - partial_end = ((lend + 1) % PAGE_SIZE) > 0;
> > + same_page = (lstart >> PAGE_SHIFT) == end;
>
> 'end' is exclusive, so this is always false. Maybe something "obvious":
>
> same_page = (lstart >> PAGE_SHIFT) == (lend >> PAGE_SHIFT);
>
> (lend is inclusive, so lend in 0-4095 are all on the same page)

My brain is not yet in gear this morning, so I haven't given this the
necessary thought: but I do have to question what you say there, and
throw it back to you for the further thought -

the first shmem_getpage(inode, lstart >> PAGE_SHIFT, &page, SGP_READ);
the second shmem_getpage(inode, end, &page, SGP_READ).
So same_page = (lstart >> PAGE_SHIFT) == end
had seemed right to me.

>
> > page = NULL;
> > shmem_getpage(inode, lstart >> PAGE_SHIFT, &page, SGP_READ);
> > if (page) {
> > - bool same_page;
> > -
> > page = thp_head(page);
> > same_page = lend < page_offset(page) + thp_size(page);
> > - if (same_page)
> > - partial_end = false;
> > set_page_dirty(page);
> > if (!truncate_inode_partial_page(page, lstart, lend)) {
> > start = page->index + thp_nr_pages(page);
> > @@ -928,7 +924,7 @@ static void shmem_undo_range(struct inod
> > page = NULL;
> > }
> >
> > - if (partial_end)
> > + if (!same_page)
> > shmem_getpage(inode, end, &page, SGP_READ);
> > if (page) {
> > page = thp_head(page);
>

2020-11-17 19:20:03

by Matthew Wilcox

[permalink] [raw]
Subject: Re: [PATCH v4 00/16] Overhaul multi-page lookups for THP

On Tue, Nov 17, 2020 at 08:26:03AM -0800, Hugh Dickins wrote:
> On Tue, 17 Nov 2020, Matthew Wilcox wrote:
> > On Mon, Nov 16, 2020 at 02:34:34AM -0800, Hugh Dickins wrote:
> > > Fix to [PATCH v4 15/16] mm/truncate,shmem: Handle truncates that split THPs.
> > > One machine ran fine, swapping and building in ext4 on loop0 on huge tmpfs;
> > > one machine got occasional pages of zeros in its .os; one machine couldn't
> > > get started because of ext4_find_dest_de errors on the newly mkfs'ed fs.
> > > The partial_end case was decided by PAGE_SIZE, when there might be a THP
> > > there. The below patch has run well (for not very long), but I could
> > > easily have got it slightly wrong, off-by-one or whatever; and I have
> > > not looked into the similar code in mm/truncate.c, maybe that will need
> > > a similar fix or maybe not.
> >
> > Thank you for the explanation in your later email! There is indeed an
> > off-by-one, although in the safe direction.
> >
> > > --- 5103w/mm/shmem.c 2020-11-12 15:46:21.075254036 -0800
> > > +++ 5103wh/mm/shmem.c 2020-11-16 01:09:35.431677308 -0800
> > > @@ -874,7 +874,7 @@ static void shmem_undo_range(struct inod
> > > long nr_swaps_freed = 0;
> > > pgoff_t index;
> > > int i;
> > > - bool partial_end;
> > > + bool same_page;
> > >
> > > if (lend == -1)
> > > end = -1; /* unsigned, so actually very big */
> > > @@ -907,16 +907,12 @@ static void shmem_undo_range(struct inod
> > > index++;
> > > }
> > >
> > > - partial_end = ((lend + 1) % PAGE_SIZE) > 0;
> > > + same_page = (lstart >> PAGE_SHIFT) == end;
> >
> > 'end' is exclusive, so this is always false. Maybe something "obvious":
> >
> > same_page = (lstart >> PAGE_SHIFT) == (lend >> PAGE_SHIFT);
> >
> > (lend is inclusive, so lend in 0-4095 are all on the same page)
>
> My brain is not yet in gear this morning, so I haven't given this the
> necessary thought: but I do have to question what you say there, and
> throw it back to you for the further thought -
>
> the first shmem_getpage(inode, lstart >> PAGE_SHIFT, &page, SGP_READ);
> the second shmem_getpage(inode, end, &page, SGP_READ).
> So same_page = (lstart >> PAGE_SHIFT) == end
> had seemed right to me.

I find both of these functions exceptionally confusing. Does this
make it easier to understand?

@@ -859,22 +859,47 @@ static void shmem_undo_range(struct inode *inode, loff_t l
start, loff_t lend,
{
struct address_space *mapping = inode->i_mapping;
struct shmem_inode_info *info = SHMEM_I(inode);
- pgoff_t start = (lstart + PAGE_SIZE - 1) >> PAGE_SHIFT;
- pgoff_t end = (lend + 1) >> PAGE_SHIFT;
+ pgoff_t start = lstart >> PAGE_SHIFT;
+ pgoff_t end = lend >> PAGE_SHIFT;
struct pagevec pvec;
pgoff_t indices[PAGEVEC_SIZE];
struct page *page;
long nr_swaps_freed = 0;
pgoff_t index;
int i;
- bool same_page;
+ bool same_page = (start == end);

- if (lend == -1)
- end = -1; /* unsigned, so actually very big */
+ page = NULL;
+ shmem_getpage(inode, start, &page, SGP_READ);
+ if (page) {
+ page = thp_head(page);
+ same_page = lend < page_offset(page) + thp_size(page);
+ set_page_dirty(page);
+ if (truncate_inode_partial_page(page, lstart, lend))
+ start++;
+ else
+ start = page->index + thp_nr_pages(page);
+ unlock_page(page);
+ put_page(page);
+ page = NULL;
+ }
+
+ if (!same_page)
+ shmem_getpage(inode, end, &page, SGP_READ);
+ if (page) {
+ page = thp_head(page);
+ set_page_dirty(page);
+ if (truncate_inode_partial_page(page, lstart, lend))
+ end--;
+ else
+ end = page->index - 1;
+ unlock_page(page);
+ put_page(page);
+ }

pagevec_init(&pvec);
index = start;
- while (index < end && find_lock_entries(mapping, index, end - 1,
+ while (index <= end && find_lock_entries(mapping, index, end,
&pvec, indices)) {
for (i = 0; i < pagevec_count(&pvec); i++) {
page = pvec.pages[i];
@@ -900,40 +925,11 @@ static void shmem_undo_range(struct inode *inode, loff_t lstart, loff_t lend,
index++;
}

- same_page = (lend >> PAGE_SHIFT) == (lstart >> PAGE_SHIFT);
- page = NULL;
- shmem_getpage(inode, lstart >> PAGE_SHIFT, &page, SGP_READ);
- if (page) {
- page = thp_head(page);
- same_page = lend < page_offset(page) + thp_size(page);
- set_page_dirty(page);
- if (!truncate_inode_partial_page(page, lstart, lend)) {
- start = page->index + thp_nr_pages(page);
- if (same_page)
- end = page->index;
- }
- unlock_page(page);
- put_page(page);
- page = NULL;
- }
-
- if (!same_page)
- shmem_getpage(inode, end, &page, SGP_READ);
- if (page) {
- page = thp_head(page);
- set_page_dirty(page);
- if (!truncate_inode_partial_page(page, lstart, lend))
- end = page->index;
- unlock_page(page);
- put_page(page);
- }
-
index = start;
- while (index < end) {
+ while (index <= end) {
cond_resched();

- if (!find_get_entries(mapping, index, end - 1, &pvec,
- indices)) {
+ if (!find_get_entries(mapping, index, end, &pvec, indices)) {
/* If all gone or hole-punch or unfalloc, we're done */
if (index == start || end != -1)
break;

That is, we change the definitions of start and end to be the more natural
"index of page which contains the first/last byte". Then we deal with
the start and end of the range, and adjust the start & end appropriately.

I almost managed to get rid of 'same_page' until I thought about the case
where start was a compound page, and split succeeded. In this case, we
already dealt with the tail and don't want to deal with it again.

2020-11-17 23:46:03

by Matthew Wilcox

[permalink] [raw]
Subject: Re: [PATCH v4 00/16] Overhaul multi-page lookups for THP

On Tue, Nov 17, 2020 at 07:15:13PM +0000, Matthew Wilcox wrote:
> I find both of these functions exceptionally confusing. Does this
> make it easier to understand?

Never mind, this is buggy. I'll send something better tomorrow.

2020-11-25 02:35:01

by Matthew Wilcox

[permalink] [raw]
Subject: Re: [PATCH v4 00/16] Overhaul multi-page lookups for THP

On Tue, Nov 17, 2020 at 11:43:02PM +0000, Matthew Wilcox wrote:
> On Tue, Nov 17, 2020 at 07:15:13PM +0000, Matthew Wilcox wrote:
> > I find both of these functions exceptionally confusing. Does this
> > make it easier to understand?
>
> Never mind, this is buggy. I'll send something better tomorrow.

That took a week, not a day. *sigh*. At least this is shorter.

commit 1a02863ce04fd325922d6c3db6d01e18d55f966b
Author: Matthew Wilcox (Oracle) <[email protected]>
Date: Tue Nov 17 10:45:18 2020 -0500

fix mm-truncateshmem-handle-truncates-that-split-thps.patch

diff --git a/mm/internal.h b/mm/internal.h
index 75ae680d0a2c..4ac239e8c42c 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -622,5 +622,6 @@ struct migration_target_control {
gfp_t gfp_mask;
};

-bool truncate_inode_partial_page(struct page *page, loff_t start, loff_t end);
+pgoff_t truncate_inode_partial_page(struct address_space *mapping,
+ struct page *page, loff_t start, loff_t end);
#endif /* __MM_INTERNAL_H */
diff --git a/mm/shmem.c b/mm/shmem.c
index 5da4f1a3e663..6fd00948e592 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -866,26 +866,33 @@ static void shmem_undo_range(struct inode *inode, loff_t lstart, loff_t lend,
{
struct address_space *mapping = inode->i_mapping;
struct shmem_inode_info *info = SHMEM_I(inode);
- pgoff_t start = (lstart + PAGE_SIZE - 1) >> PAGE_SHIFT;
- pgoff_t end = (lend + 1) >> PAGE_SHIFT;
+ pgoff_t start, end;
struct pagevec pvec;
pgoff_t indices[PAGEVEC_SIZE];
struct page *page;
long nr_swaps_freed = 0;
pgoff_t index;
int i;
- bool partial_end;

- if (lend == -1)
- end = -1; /* unsigned, so actually very big */
+ page = NULL;
+ start = lstart >> PAGE_SHIFT;
+ shmem_getpage(inode, start, &page, SGP_READ);
+ if (page) {
+ page = thp_head(page);
+ set_page_dirty(page);
+ start = truncate_inode_partial_page(mapping, page, lstart,
+ lend);
+ }
+
+ /* 'end' includes a partial page */
+ end = lend / PAGE_SIZE;

pagevec_init(&pvec);
index = start;
while (index < end && find_lock_entries(mapping, index, end - 1,
- &pvec, indices)) {
+ &pvec, indices)) {
for (i = 0; i < pagevec_count(&pvec); i++) {
page = pvec.pages[i];
-
index = indices[i];

if (xa_is_value(page)) {
@@ -895,8 +902,6 @@ static void shmem_undo_range(struct inode *inode, loff_t lstart, loff_t lend,
index, page);
continue;
}
- index += thp_nr_pages(page) - 1;
-
if (!unfalloc || !PageUptodate(page))
truncate_inode_page(mapping, page);
unlock_page(page);
@@ -907,85 +912,60 @@ static void shmem_undo_range(struct inode *inode, loff_t lstart, loff_t lend,
index++;
}

- partial_end = ((lend + 1) % PAGE_SIZE) > 0;
- page = NULL;
- shmem_getpage(inode, lstart >> PAGE_SHIFT, &page, SGP_READ);
- if (page) {
- bool same_page;
-
- page = thp_head(page);
- same_page = lend < page_offset(page) + thp_size(page);
- if (same_page)
- partial_end = false;
- set_page_dirty(page);
- if (!truncate_inode_partial_page(page, lstart, lend)) {
- start = page->index + thp_nr_pages(page);
- if (same_page)
- end = page->index;
- }
- unlock_page(page);
- put_page(page);
- page = NULL;
- }
-
- if (partial_end)
- shmem_getpage(inode, end, &page, SGP_READ);
- if (page) {
- page = thp_head(page);
- set_page_dirty(page);
- if (!truncate_inode_partial_page(page, lstart, lend))
- end = page->index;
- unlock_page(page);
- put_page(page);
- }
-
index = start;
- while (index < end) {
+ while (index <= end) {
cond_resched();

- if (!find_get_entries(mapping, index, end - 1, &pvec,
- indices)) {
+ if (!find_get_entries(mapping, index, end, &pvec, indices)) {
/* If all gone or hole-punch or unfalloc, we're done */
- if (index == start || end != -1)
+ if (index == start || lend != (loff_t)-1)
break;
/* But if truncating, restart to make sure all gone */
index = start;
continue;
}
+
for (i = 0; i < pagevec_count(&pvec); i++) {
page = pvec.pages[i];

- index = indices[i];
if (xa_is_value(page)) {
if (unfalloc)
continue;
- if (shmem_free_swap(mapping, index, page)) {
- /* Swap was replaced by page: retry */
- index--;
- break;
+ index = indices[i];
+ if (index == end) {
+ /* Partial page swapped out? */
+ shmem_getpage(inode, end, &page,
+ SGP_READ);
+ } else {
+ if (shmem_free_swap(mapping, index,
+ page)) {
+ /* Swap replaced: retry */
+ break;
+ }
+ nr_swaps_freed++;
+ continue;
}
- nr_swaps_freed++;
- continue;
+ } else {
+ lock_page(page);
}

- lock_page(page);
-
if (!unfalloc || !PageUptodate(page)) {
if (page_mapping(page) != mapping) {
/* Page was replaced by swap: retry */
unlock_page(page);
- index--;
+ put_page(page);
break;
}
VM_BUG_ON_PAGE(PageWriteback(page), page);
- truncate_inode_page(mapping, page);
+ index = truncate_inode_partial_page(mapping,
+ page, lstart, lend);
+ if (index > end)
+ end = indices[i] - 1;
}
- index = page->index + thp_nr_pages(page) - 1;
- unlock_page(page);
}
+ index = indices[i - 1] + 1;
pagevec_remove_exceptionals(&pvec);
- pagevec_release(&pvec);
- index++;
+ pagevec_reinit(&pvec);
}

spin_lock_irq(&info->lock);
diff --git a/mm/truncate.c b/mm/truncate.c
index ddb94fc0bc9e..e9fb4f1db837 100644
--- a/mm/truncate.c
+++ b/mm/truncate.c
@@ -225,19 +225,20 @@ int truncate_inode_page(struct address_space *mapping, struct page *page)
}

/*
- * Handle partial (transparent) pages. The page may be entirely within the
- * range if a split has raced with us. If not, we zero the part of the
- * page that's within the [start, end] range, and then split the page if
- * it's a THP. split_page_range() will discard pages which now lie beyond
- * i_size, and we rely on the caller to discard pages which lie within a
+ * Handle partial (transparent) pages. If the page is entirely within
+ * the range, we discard it. If not, we split the page if it's a THP
+ * and zero the part of the page that's within the [start, end] range.
+ * split_page_range() will discard any of the subpages which now lie
+ * beyond i_size, and the caller will discard pages which lie within a
* newly created hole.
*
- * Returns false if THP splitting failed so the caller can avoid
- * discarding the entire page which is stubbornly unsplit.
+ * Return: The index after the current page.
*/
-bool truncate_inode_partial_page(struct page *page, loff_t start, loff_t end)
+pgoff_t truncate_inode_partial_page(struct address_space *mapping,
+ struct page *page, loff_t start, loff_t end)
{
loff_t pos = page_offset(page);
+ pgoff_t next_index = page->index + thp_nr_pages(page);
unsigned int offset, length;

if (pos < start)
@@ -251,24 +252,33 @@ bool truncate_inode_partial_page(struct page *page, loff_t start, loff_t end)
length = end + 1 - pos - offset;

wait_on_page_writeback(page);
- if (length == thp_size(page)) {
- truncate_inode_page(page->mapping, page);
- return true;
- }
-
- /*
- * We may be zeroing pages we're about to discard, but it avoids
- * doing a complex calculation here, and then doing the zeroing
- * anyway if the page split fails.
- */
- zero_user(page, offset, length);
+ if (length == thp_size(page))
+ goto truncate;

cleancache_invalidate_page(page->mapping, page);
if (page_has_private(page))
do_invalidatepage(page, offset, length);
if (!PageTransHuge(page))
- return true;
- return split_huge_page(page) == 0;
+ goto zero;
+ page += offset / PAGE_SIZE;
+ if (split_huge_page(page) < 0) {
+ page -= offset / PAGE_SIZE;
+ goto zero;
+ }
+ next_index = page->index + 1;
+ offset %= PAGE_SIZE;
+ if (offset == 0 && length >= PAGE_SIZE)
+ goto truncate;
+ length = PAGE_SIZE - offset;
+zero:
+ zero_user(page, offset, length);
+ goto out;
+truncate:
+ truncate_inode_page(mapping, page);
+out:
+ unlock_page(page);
+ put_page(page);
+ return next_index;
}

/*
@@ -322,10 +332,6 @@ int invalidate_inode_page(struct page *page)
* The first pass will remove most pages, so the search cost of the second pass
* is low.
*
- * We pass down the cache-hot hint to the page freeing code. Even if the
- * mapping is large, it is probably the case that the final pages are the most
- * recently touched, and freeing happens in ascending file offset order.
- *
* Note that since ->invalidatepage() accepts range to invalidate
* truncate_inode_pages_range is able to handle cases where lend + 1 is not
* page aligned properly.
@@ -333,34 +339,24 @@ int invalidate_inode_page(struct page *page)
void truncate_inode_pages_range(struct address_space *mapping,
loff_t lstart, loff_t lend)
{
- pgoff_t start; /* inclusive */
- pgoff_t end; /* exclusive */
+ pgoff_t start, end;
struct pagevec pvec;
pgoff_t indices[PAGEVEC_SIZE];
pgoff_t index;
int i;
struct page * page;
- bool partial_end;

if (mapping->nrpages == 0 && mapping->nrexceptional == 0)
goto out;

- /*
- * 'start' and 'end' always covers the range of pages to be fully
- * truncated. Partial pages are covered with 'partial_start' at the
- * start of the range and 'partial_end' at the end of the range.
- * Note that 'end' is exclusive while 'lend' is inclusive.
- */
- start = (lstart + PAGE_SIZE - 1) >> PAGE_SHIFT;
- if (lend == -1)
- /*
- * lend == -1 indicates end-of-file so we have to set 'end'
- * to the highest possible pgoff_t and since the type is
- * unsigned we're using -1.
- */
- end = -1;
- else
- end = (lend + 1) >> PAGE_SHIFT;
+ start = lstart >> PAGE_SHIFT;
+ page = find_lock_head(mapping, start);
+ if (page)
+ start = truncate_inode_partial_page(mapping, page, lstart,
+ lend);
+
+ /* 'end' includes a partial page */
+ end = lend / PAGE_SIZE;

pagevec_init(&pvec);
index = start;
@@ -377,37 +373,11 @@ void truncate_inode_pages_range(struct address_space *mapping,
cond_resched();
}

- partial_end = ((lend + 1) % PAGE_SIZE) > 0;
- page = find_lock_head(mapping, lstart >> PAGE_SHIFT);
- if (page) {
- bool same_page = lend < page_offset(page) + thp_size(page);
- if (same_page)
- partial_end = false;
- if (!truncate_inode_partial_page(page, lstart, lend)) {
- start = page->index + thp_nr_pages(page);
- if (same_page)
- end = page->index;
- }
- unlock_page(page);
- put_page(page);
- page = NULL;
- }
-
- if (partial_end)
- page = find_lock_head(mapping, end);
- if (page) {
- if (!truncate_inode_partial_page(page, lstart, lend))
- end = page->index;
- unlock_page(page);
- put_page(page);
- }
-
index = start;
- while (index < end) {
+ while (index <= end) {
cond_resched();

- if (!find_get_entries(mapping, index, end - 1, &pvec,
- indices)) {
+ if (!find_get_entries(mapping, index, end, &pvec, indices)) {
/* If all gone from start onwards, we're done */
if (index == start)
break;
@@ -418,22 +388,19 @@ void truncate_inode_pages_range(struct address_space *mapping,

for (i = 0; i < pagevec_count(&pvec); i++) {
page = pvec.pages[i];
-
- /* We rely upon deletion not changing page->index */
- index = indices[i];
-
if (xa_is_value(page))
continue;

lock_page(page);
- index = page->index + thp_nr_pages(page) - 1;
- wait_on_page_writeback(page);
- truncate_inode_page(mapping, page);
- unlock_page(page);
+ index = truncate_inode_partial_page(mapping, page,
+ lstart, lend);
+ /* Couldn't split a THP? */
+ if (index > end)
+ end = indices[i] - 1;
}
+ index = indices[i - 1] + 1;
truncate_exceptional_pvec_entries(mapping, &pvec, indices);
- pagevec_release(&pvec);
- index++;
+ pagevec_reinit(&pvec);
}

out:

2020-11-25 02:54:54

by Hugh Dickins

[permalink] [raw]
Subject: Re: [PATCH v4 00/16] Overhaul multi-page lookups for THP

On Wed, 25 Nov 2020, Matthew Wilcox wrote:
> On Tue, Nov 17, 2020 at 11:43:02PM +0000, Matthew Wilcox wrote:
> > On Tue, Nov 17, 2020 at 07:15:13PM +0000, Matthew Wilcox wrote:
> > > I find both of these functions exceptionally confusing. Does this
> > > make it easier to understand?
> >
> > Never mind, this is buggy. I'll send something better tomorrow.
>
> That took a week, not a day. *sigh*. At least this is shorter.

Thanks, I'll give it a try (along with the other 4, on top of the 12:
maybe on -rc5, maybe on today's mmotm, I'll decide that later).

Shorter you say, that's good: I was disheartened by the way it got
more complicated, after your initial truncate_inode_partial_page()
neatness. Any hints on what was wrong with my simple fixup to that?
(But I didn't spend any more time trying to prove or disprove it.)

Hugh

2020-11-25 03:01:29

by Hugh Dickins

[permalink] [raw]
Subject: Re: [PATCH v4 00/16] Overhaul multi-page lookups for THP

On Tue, 24 Nov 2020, Hugh Dickins wrote:
> On Wed, 25 Nov 2020, Matthew Wilcox wrote:
> > On Tue, Nov 17, 2020 at 11:43:02PM +0000, Matthew Wilcox wrote:
> > > On Tue, Nov 17, 2020 at 07:15:13PM +0000, Matthew Wilcox wrote:
> > > > I find both of these functions exceptionally confusing. Does this
> > > > make it easier to understand?
> > >
> > > Never mind, this is buggy. I'll send something better tomorrow.
> >
> > That took a week, not a day. *sigh*. At least this is shorter.
>
> Thanks, I'll give it a try (along with the other 4, on top of the 12:

s/12/16/

> maybe on -rc5, maybe on today's mmotm, I'll decide that later).
>
> Shorter you say, that's good: I was disheartened by the way it got
> more complicated, after your initial truncate_inode_partial_page()
> neatness. Any hints on what was wrong with my simple fixup to that?
> (But I didn't spend any more time trying to prove or disprove it.)
>
> Hugh
>

2020-11-26 00:14:58

by Hugh Dickins

[permalink] [raw]
Subject: Re: [PATCH v4 00/16] Overhaul multi-page lookups for THP

On Wed, Nov 25, 2020 at 3:09 PM Andrew Morton <[email protected]> wrote:
>
> On Wed, 25 Nov 2020 02:32:34 +0000 Matthew Wilcox <[email protected]> wrote:
>
> > On Tue, Nov 17, 2020 at 11:43:02PM +0000, Matthew Wilcox wrote:
> > > On Tue, Nov 17, 2020 at 07:15:13PM +0000, Matthew Wilcox wrote:
> > > > I find both of these functions exceptionally confusing. Does this
> > > > make it easier to understand?
> > >
> > > Never mind, this is buggy. I'll send something better tomorrow.
> >
> > That took a week, not a day. *sigh*. At least this is shorter.
> >
> > commit 1a02863ce04fd325922d6c3db6d01e18d55f966b
> > Author: Matthew Wilcox (Oracle) <[email protected]>
> > Date: Tue Nov 17 10:45:18 2020 -0500
> >
> > fix mm-truncateshmem-handle-truncates-that-split-thps.patch
> >
>
> That's a big patch. Big enough to put
> mm-truncateshmem-handle-truncates-that-split-thps.patch back into
> unreviewed state, methinks. And big enough to have a changelog!
>
> Below is the folded-together result for reviewers, please.

I have not reviewed the changes at all, but have been testing.

Responding hastily in gmail, which will probably garble the result
(sorry): because I think you may be working towards another mmotm, and
there's one little fix definitely needed, but the machine I usually
mail patches from is in a different hang (running with this patch)
that I need to examine before rebooting - but probably not something
that the average user will ever encounter.

In general, this series behaves a lot better than it did nine days
ago: LTP results on shmem huge pages match how they were before the
series, only one of xfstests fails which did not fail before
(generic/539 - not yet analysed, may be of no importance), and until
that hang there were no problems seen in running my tmpfs swapping
loads.

Though I did see generic/476 stuck in shmem_undo_range() on one
machine, and will need to reproduce and investigate that.

The little fix definitely needed was shown by generic/083: each
fsstress waiting for page lock, happens even without forcing huge
pages. See below...

> Is the changelog still accurate and complete?
>
>
> From: "Matthew Wilcox (Oracle)" <[email protected]>
> Subject: mm/truncate,shmem: handle truncates that split THPs
>
> Handle THP splitting in the parts of the truncation functions which
> already handle partial pages. Factor all that code out into a new
> function called truncate_inode_partial_page().
>
> We lose the easy 'bail out' path if a truncate or hole punch is entirely
> within a single page. We can add some more complex logic to restore the
> optimisation if it proves to be worthwhile.
>
> [[email protected]: fix]
> Link: https://lkml.kernel.org/r/[email protected]
> Link: https://lkml.kernel.org/r/[email protected]
> Signed-off-by: Matthew Wilcox (Oracle) <[email protected]>
> Reviewed-by: Jan Kara <[email protected]>
> Reviewed-by: William Kucharski <[email protected]>
> Cc: Christoph Hellwig <[email protected]>
> Cc: Dave Chinner <[email protected]>
> Cc: Hugh Dickins <[email protected]>
> Cc: Johannes Weiner <[email protected]>
> Cc: Kirill A. Shutemov <[email protected]>
> Cc: Yang Shi <[email protected]>
> Signed-off-by: Andrew Morton <[email protected]>
> ---
>
> mm/internal.h | 2
> mm/shmem.c | 137 +++++++++++++----------------------------
> mm/truncate.c | 157 +++++++++++++++++++++++-------------------------
> 3 files changed, 122 insertions(+), 174 deletions(-)
>
> --- a/mm/internal.h~mm-truncateshmem-handle-truncates-that-split-thps
> +++ a/mm/internal.h
> @@ -623,4 +623,6 @@ struct migration_target_control {
> gfp_t gfp_mask;
> };
>
> +pgoff_t truncate_inode_partial_page(struct address_space *mapping,
> + struct page *page, loff_t start, loff_t end);
> #endif /* __MM_INTERNAL_H */
> --- a/mm/shmem.c~mm-truncateshmem-handle-truncates-that-split-thps
> +++ a/mm/shmem.c
> @@ -858,32 +858,6 @@ void shmem_unlock_mapping(struct address
> }
>
> /*
> - * Check whether a hole-punch or truncation needs to split a huge page,
> - * returning true if no split was required, or the split has been successful.
> - *
> - * Eviction (or truncation to 0 size) should never need to split a huge page;
> - * but in rare cases might do so, if shmem_undo_range() failed to trylock on
> - * head, and then succeeded to trylock on tail.
> - *
> - * A split can only succeed when there are no additional references on the
> - * huge page: so the split below relies upon find_get_entries() having stopped
> - * when it found a subpage of the huge page, without getting further references.
> - */
> -static bool shmem_punch_compound(struct page *page, pgoff_t start, pgoff_t end)
> -{
> - if (!PageTransCompound(page))
> - return true;
> -
> - /* Just proceed to delete a huge page wholly within the range punched */
> - if (PageHead(page) &&
> - page->index >= start && page->index + HPAGE_PMD_NR <= end)
> - return true;
> -
> - /* Try to split huge page, so we can truly punch the hole or truncate */
> - return split_huge_page(page) >= 0;
> -}
> -
> -/*
> * Remove range of pages and swap entries from page cache, and free them.
> * If !unfalloc, truncate or punch hole; if unfalloc, undo failed fallocate.
> */
> @@ -892,26 +866,33 @@ static void shmem_undo_range(struct inod
> {
> struct address_space *mapping = inode->i_mapping;
> struct shmem_inode_info *info = SHMEM_I(inode);
> - pgoff_t start = (lstart + PAGE_SIZE - 1) >> PAGE_SHIFT;
> - pgoff_t end = (lend + 1) >> PAGE_SHIFT;
> - unsigned int partial_start = lstart & (PAGE_SIZE - 1);
> - unsigned int partial_end = (lend + 1) & (PAGE_SIZE - 1);
> + pgoff_t start, end;
> struct pagevec pvec;
> pgoff_t indices[PAGEVEC_SIZE];
> + struct page *page;
> long nr_swaps_freed = 0;
> pgoff_t index;
> int i;
>
> - if (lend == -1)
> - end = -1; /* unsigned, so actually very big */
> + page = NULL;
> + start = lstart >> PAGE_SHIFT;
> + shmem_getpage(inode, start, &page, SGP_READ);
> + if (page) {
> + page = thp_head(page);
> + set_page_dirty(page);
> + start = truncate_inode_partial_page(mapping, page, lstart,
> + lend);
> + }
> +
> + /* 'end' includes a partial page */
> + end = lend / PAGE_SIZE;
>
> pagevec_init(&pvec);
> index = start;
> while (index < end && find_lock_entries(mapping, index, end - 1,
> - &pvec, indices)) {
> + &pvec, indices)) {
> for (i = 0; i < pagevec_count(&pvec); i++) {
> - struct page *page = pvec.pages[i];
> -
> + page = pvec.pages[i];
> index = indices[i];
>
> if (xa_is_value(page)) {
> @@ -921,8 +902,6 @@ static void shmem_undo_range(struct inod
> index, page);
> continue;
> }
> - index += thp_nr_pages(page) - 1;
> -
> if (!unfalloc || !PageUptodate(page))
> truncate_inode_page(mapping, page);
> unlock_page(page);
> @@ -933,90 +912,60 @@ static void shmem_undo_range(struct inod
> index++;
> }
>
> - if (partial_start) {
> - struct page *page = NULL;
> - shmem_getpage(inode, start - 1, &page, SGP_READ);
> - if (page) {
> - unsigned int top = PAGE_SIZE;
> - if (start > end) {
> - top = partial_end;
> - partial_end = 0;
> - }
> - zero_user_segment(page, partial_start, top);
> - set_page_dirty(page);
> - unlock_page(page);
> - put_page(page);
> - }
> - }
> - if (partial_end) {
> - struct page *page = NULL;
> - shmem_getpage(inode, end, &page, SGP_READ);
> - if (page) {
> - zero_user_segment(page, 0, partial_end);
> - set_page_dirty(page);
> - unlock_page(page);
> - put_page(page);
> - }
> - }
> - if (start >= end)
> - return;
> -
> index = start;
> - while (index < end) {
> + while (index <= end) {
> cond_resched();
>
> - if (!find_get_entries(mapping, index, end - 1, &pvec,
> - indices)) {
> + if (!find_get_entries(mapping, index, end, &pvec, indices)) {
> /* If all gone or hole-punch or unfalloc, we're done */
> - if (index == start || end != -1)
> + if (index == start || lend != (loff_t)-1)
> break;
> /* But if truncating, restart to make sure all gone */
> index = start;
> continue;
> }
> +
> for (i = 0; i < pagevec_count(&pvec); i++) {
> - struct page *page = pvec.pages[i];
> + page = pvec.pages[i];
>
> - index = indices[i];
> if (xa_is_value(page)) {
> if (unfalloc)
> continue;
> - if (shmem_free_swap(mapping, index, page)) {
> - /* Swap was replaced by page: retry */
> - index--;
> - break;
> + index = indices[i];
> + if (index == end) {
> + /* Partial page swapped out? */
> + shmem_getpage(inode, end, &page,
> + SGP_READ);
> + } else {
> + if (shmem_free_swap(mapping, index,
> + page)) {
> + /* Swap replaced: retry */
> + break;
> + }
> + nr_swaps_freed++;
> + continue;
> }
> - nr_swaps_freed++;
> - continue;
> + } else {
> + lock_page(page);
> }
>
> - lock_page(page);
> -
> if (!unfalloc || !PageUptodate(page)) {
> if (page_mapping(page) != mapping) {
> /* Page was replaced by swap: retry */
> unlock_page(page);
> - index--;
> + put_page(page);
> break;
> }
> VM_BUG_ON_PAGE(PageWriteback(page), page);
> - if (shmem_punch_compound(page, start, end))
> - truncate_inode_page(mapping, page);
> - else if (IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE)) {
> - /* Wipe the page and don't get stuck */
> - clear_highpage(page);
> - flush_dcache_page(page);
> - set_page_dirty(page);
> - if (index <
> - round_up(start, HPAGE_PMD_NR))
> - start = index + 1;
> - }
> + index = truncate_inode_partial_page(mapping,
> + page, lstart, lend);
> + if (index > end)
> + end = indices[i] - 1;
> }
> - unlock_page(page);

The fix needed is here: instead of deleting that unlock_page(page)
line, it needs to be } else { unlock_page(page); }

> }
> + index = indices[i - 1] + 1;
> pagevec_remove_exceptionals(&pvec);
> - pagevec_release(&pvec);
> - index++;
> + pagevec_reinit(&pvec);
> }
>
> spin_lock_irq(&info->lock);
> --- a/mm/truncate.c~mm-truncateshmem-handle-truncates-that-split-thps
> +++ a/mm/truncate.c
> @@ -225,6 +225,63 @@ int truncate_inode_page(struct address_s
> }
>
> /*
> + * Handle partial (transparent) pages. If the page is entirely within
> + * the range, we discard it. If not, we split the page if it's a THP
> + * and zero the part of the page that's within the [start, end] range.
> + * split_page_range() will discard any of the subpages which now lie
> + * beyond i_size, and the caller will discard pages which lie within a
> + * newly created hole.
> + *
> + * Return: The index after the current page.
> + */
> +pgoff_t truncate_inode_partial_page(struct address_space *mapping,
> + struct page *page, loff_t start, loff_t end)
> +{
> + loff_t pos = page_offset(page);
> + pgoff_t next_index = page->index + thp_nr_pages(page);
> + unsigned int offset, length;
> +
> + if (pos < start)
> + offset = start - pos;
> + else
> + offset = 0;
> + length = thp_size(page);
> + if (pos + length <= (u64)end)
> + length = length - offset;
> + else
> + length = end + 1 - pos - offset;
> +
> + wait_on_page_writeback(page);
> + if (length == thp_size(page))
> + goto truncate;
> +
> + cleancache_invalidate_page(page->mapping, page);
> + if (page_has_private(page))
> + do_invalidatepage(page, offset, length);
> + if (!PageTransHuge(page))
> + goto zero;
> + page += offset / PAGE_SIZE;
> + if (split_huge_page(page) < 0) {
> + page -= offset / PAGE_SIZE;
> + goto zero;
> + }
> + next_index = page->index + 1;
> + offset %= PAGE_SIZE;
> + if (offset == 0 && length >= PAGE_SIZE)
> + goto truncate;
> + length = PAGE_SIZE - offset;
> +zero:
> + zero_user(page, offset, length);
> + goto out;
> +truncate:
> + truncate_inode_page(mapping, page);
> +out:
> + unlock_page(page);
> + put_page(page);
> + return next_index;
> +}
> +
> +/*
> * Used to get rid of pages on hardware memory corruption.
> */
> int generic_error_remove_page(struct address_space *mapping, struct page *page)
> @@ -275,10 +332,6 @@ int invalidate_inode_page(struct page *p
> * The first pass will remove most pages, so the search cost of the second pass
> * is low.
> *
> - * We pass down the cache-hot hint to the page freeing code. Even if the
> - * mapping is large, it is probably the case that the final pages are the most
> - * recently touched, and freeing happens in ascending file offset order.
> - *
> * Note that since ->invalidatepage() accepts range to invalidate
> * truncate_inode_pages_range is able to handle cases where lend + 1 is not
> * page aligned properly.
> @@ -286,38 +339,24 @@ int invalidate_inode_page(struct page *p
> void truncate_inode_pages_range(struct address_space *mapping,
> loff_t lstart, loff_t lend)
> {
> - pgoff_t start; /* inclusive */
> - pgoff_t end; /* exclusive */
> - unsigned int partial_start; /* inclusive */
> - unsigned int partial_end; /* exclusive */
> + pgoff_t start, end;
> struct pagevec pvec;
> pgoff_t indices[PAGEVEC_SIZE];
> pgoff_t index;
> int i;
> + struct page * page;
>
> if (mapping->nrpages == 0 && mapping->nrexceptional == 0)
> goto out;
>
> - /* Offsets within partial pages */
> - partial_start = lstart & (PAGE_SIZE - 1);
> - partial_end = (lend + 1) & (PAGE_SIZE - 1);
> + start = lstart >> PAGE_SHIFT;
> + page = find_lock_head(mapping, start);
> + if (page)
> + start = truncate_inode_partial_page(mapping, page, lstart,
> + lend);
>
> - /*
> - * 'start' and 'end' always covers the range of pages to be fully
> - * truncated. Partial pages are covered with 'partial_start' at the
> - * start of the range and 'partial_end' at the end of the range.
> - * Note that 'end' is exclusive while 'lend' is inclusive.
> - */
> - start = (lstart + PAGE_SIZE - 1) >> PAGE_SHIFT;
> - if (lend == -1)
> - /*
> - * lend == -1 indicates end-of-file so we have to set 'end'
> - * to the highest possible pgoff_t and since the type is
> - * unsigned we're using -1.
> - */
> - end = -1;
> - else
> - end = (lend + 1) >> PAGE_SHIFT;
> + /* 'end' includes a partial page */
> + end = lend / PAGE_SIZE;
>
> pagevec_init(&pvec);
> index = start;
> @@ -334,50 +373,11 @@ void truncate_inode_pages_range(struct a
> cond_resched();
> }
>
> - if (partial_start) {
> - struct page *page = find_lock_page(mapping, start - 1);
> - if (page) {
> - unsigned int top = PAGE_SIZE;
> - if (start > end) {
> - /* Truncation within a single page */
> - top = partial_end;
> - partial_end = 0;
> - }
> - wait_on_page_writeback(page);
> - zero_user_segment(page, partial_start, top);
> - cleancache_invalidate_page(mapping, page);
> - if (page_has_private(page))
> - do_invalidatepage(page, partial_start,
> - top - partial_start);
> - unlock_page(page);
> - put_page(page);
> - }
> - }
> - if (partial_end) {
> - struct page *page = find_lock_page(mapping, end);
> - if (page) {
> - wait_on_page_writeback(page);
> - zero_user_segment(page, 0, partial_end);
> - cleancache_invalidate_page(mapping, page);
> - if (page_has_private(page))
> - do_invalidatepage(page, 0,
> - partial_end);
> - unlock_page(page);
> - put_page(page);
> - }
> - }
> - /*
> - * If the truncation happened within a single page no pages
> - * will be released, just zeroed, so we can bail out now.
> - */
> - if (start >= end)
> - goto out;
> -
> index = start;
> - for ( ; ; ) {
> + while (index <= end) {
> cond_resched();
> - if (!find_get_entries(mapping, index, end - 1, &pvec,
> - indices)) {
> +
> + if (!find_get_entries(mapping, index, end, &pvec, indices)) {
> /* If all gone from start onwards, we're done */
> if (index == start)
> break;
> @@ -387,23 +387,20 @@ void truncate_inode_pages_range(struct a
> }
>
> for (i = 0; i < pagevec_count(&pvec); i++) {
> - struct page *page = pvec.pages[i];
> -
> - /* We rely upon deletion not changing page->index */
> - index = indices[i];
> -
> + page = pvec.pages[i];
> if (xa_is_value(page))
> continue;
>
> lock_page(page);
> - WARN_ON(page_to_index(page) != index);
> - wait_on_page_writeback(page);
> - truncate_inode_page(mapping, page);
> - unlock_page(page);
> + index = truncate_inode_partial_page(mapping, page,
> + lstart, lend);
> + /* Couldn't split a THP? */
> + if (index > end)
> + end = indices[i] - 1;
> }
> + index = indices[i - 1] + 1;
> truncate_exceptional_pvec_entries(mapping, &pvec, indices);
> - pagevec_release(&pvec);
> - index++;
> + pagevec_reinit(&pvec);
> }
>
> out:
> _
>

2020-11-26 08:21:27

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH v4 00/16] Overhaul multi-page lookups for THP

On Wed, 25 Nov 2020 02:32:34 +0000 Matthew Wilcox <[email protected]> wrote:

> On Tue, Nov 17, 2020 at 11:43:02PM +0000, Matthew Wilcox wrote:
> > On Tue, Nov 17, 2020 at 07:15:13PM +0000, Matthew Wilcox wrote:
> > > I find both of these functions exceptionally confusing. Does this
> > > make it easier to understand?
> >
> > Never mind, this is buggy. I'll send something better tomorrow.
>
> That took a week, not a day. *sigh*. At least this is shorter.
>
> commit 1a02863ce04fd325922d6c3db6d01e18d55f966b
> Author: Matthew Wilcox (Oracle) <[email protected]>
> Date: Tue Nov 17 10:45:18 2020 -0500
>
> fix mm-truncateshmem-handle-truncates-that-split-thps.patch
>

That's a big patch. Big enough to put
mm-truncateshmem-handle-truncates-that-split-thps.patch back into
unreviewed state, methinks. And big enough to have a changelog!

Below is the folded-together result for reviewers, please.

Is the changelog still accurate and complete?


From: "Matthew Wilcox (Oracle)" <[email protected]>
Subject: mm/truncate,shmem: handle truncates that split THPs

Handle THP splitting in the parts of the truncation functions which
already handle partial pages. Factor all that code out into a new
function called truncate_inode_partial_page().

We lose the easy 'bail out' path if a truncate or hole punch is entirely
within a single page. We can add some more complex logic to restore the
optimisation if it proves to be worthwhile.

[[email protected]: fix]
Link: https://lkml.kernel.org/r/[email protected]
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Matthew Wilcox (Oracle) <[email protected]>
Reviewed-by: Jan Kara <[email protected]>
Reviewed-by: William Kucharski <[email protected]>
Cc: Christoph Hellwig <[email protected]>
Cc: Dave Chinner <[email protected]>
Cc: Hugh Dickins <[email protected]>
Cc: Johannes Weiner <[email protected]>
Cc: Kirill A. Shutemov <[email protected]>
Cc: Yang Shi <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
---

mm/internal.h | 2
mm/shmem.c | 137 +++++++++++++----------------------------
mm/truncate.c | 157 +++++++++++++++++++++++-------------------------
3 files changed, 122 insertions(+), 174 deletions(-)

--- a/mm/internal.h~mm-truncateshmem-handle-truncates-that-split-thps
+++ a/mm/internal.h
@@ -623,4 +623,6 @@ struct migration_target_control {
gfp_t gfp_mask;
};

+pgoff_t truncate_inode_partial_page(struct address_space *mapping,
+ struct page *page, loff_t start, loff_t end);
#endif /* __MM_INTERNAL_H */
--- a/mm/shmem.c~mm-truncateshmem-handle-truncates-that-split-thps
+++ a/mm/shmem.c
@@ -858,32 +858,6 @@ void shmem_unlock_mapping(struct address
}

/*
- * Check whether a hole-punch or truncation needs to split a huge page,
- * returning true if no split was required, or the split has been successful.
- *
- * Eviction (or truncation to 0 size) should never need to split a huge page;
- * but in rare cases might do so, if shmem_undo_range() failed to trylock on
- * head, and then succeeded to trylock on tail.
- *
- * A split can only succeed when there are no additional references on the
- * huge page: so the split below relies upon find_get_entries() having stopped
- * when it found a subpage of the huge page, without getting further references.
- */
-static bool shmem_punch_compound(struct page *page, pgoff_t start, pgoff_t end)
-{
- if (!PageTransCompound(page))
- return true;
-
- /* Just proceed to delete a huge page wholly within the range punched */
- if (PageHead(page) &&
- page->index >= start && page->index + HPAGE_PMD_NR <= end)
- return true;
-
- /* Try to split huge page, so we can truly punch the hole or truncate */
- return split_huge_page(page) >= 0;
-}
-
-/*
* Remove range of pages and swap entries from page cache, and free them.
* If !unfalloc, truncate or punch hole; if unfalloc, undo failed fallocate.
*/
@@ -892,26 +866,33 @@ static void shmem_undo_range(struct inod
{
struct address_space *mapping = inode->i_mapping;
struct shmem_inode_info *info = SHMEM_I(inode);
- pgoff_t start = (lstart + PAGE_SIZE - 1) >> PAGE_SHIFT;
- pgoff_t end = (lend + 1) >> PAGE_SHIFT;
- unsigned int partial_start = lstart & (PAGE_SIZE - 1);
- unsigned int partial_end = (lend + 1) & (PAGE_SIZE - 1);
+ pgoff_t start, end;
struct pagevec pvec;
pgoff_t indices[PAGEVEC_SIZE];
+ struct page *page;
long nr_swaps_freed = 0;
pgoff_t index;
int i;

- if (lend == -1)
- end = -1; /* unsigned, so actually very big */
+ page = NULL;
+ start = lstart >> PAGE_SHIFT;
+ shmem_getpage(inode, start, &page, SGP_READ);
+ if (page) {
+ page = thp_head(page);
+ set_page_dirty(page);
+ start = truncate_inode_partial_page(mapping, page, lstart,
+ lend);
+ }
+
+ /* 'end' includes a partial page */
+ end = lend / PAGE_SIZE;

pagevec_init(&pvec);
index = start;
while (index < end && find_lock_entries(mapping, index, end - 1,
- &pvec, indices)) {
+ &pvec, indices)) {
for (i = 0; i < pagevec_count(&pvec); i++) {
- struct page *page = pvec.pages[i];
-
+ page = pvec.pages[i];
index = indices[i];

if (xa_is_value(page)) {
@@ -921,8 +902,6 @@ static void shmem_undo_range(struct inod
index, page);
continue;
}
- index += thp_nr_pages(page) - 1;
-
if (!unfalloc || !PageUptodate(page))
truncate_inode_page(mapping, page);
unlock_page(page);
@@ -933,90 +912,60 @@ static void shmem_undo_range(struct inod
index++;
}

- if (partial_start) {
- struct page *page = NULL;
- shmem_getpage(inode, start - 1, &page, SGP_READ);
- if (page) {
- unsigned int top = PAGE_SIZE;
- if (start > end) {
- top = partial_end;
- partial_end = 0;
- }
- zero_user_segment(page, partial_start, top);
- set_page_dirty(page);
- unlock_page(page);
- put_page(page);
- }
- }
- if (partial_end) {
- struct page *page = NULL;
- shmem_getpage(inode, end, &page, SGP_READ);
- if (page) {
- zero_user_segment(page, 0, partial_end);
- set_page_dirty(page);
- unlock_page(page);
- put_page(page);
- }
- }
- if (start >= end)
- return;
-
index = start;
- while (index < end) {
+ while (index <= end) {
cond_resched();

- if (!find_get_entries(mapping, index, end - 1, &pvec,
- indices)) {
+ if (!find_get_entries(mapping, index, end, &pvec, indices)) {
/* If all gone or hole-punch or unfalloc, we're done */
- if (index == start || end != -1)
+ if (index == start || lend != (loff_t)-1)
break;
/* But if truncating, restart to make sure all gone */
index = start;
continue;
}
+
for (i = 0; i < pagevec_count(&pvec); i++) {
- struct page *page = pvec.pages[i];
+ page = pvec.pages[i];

- index = indices[i];
if (xa_is_value(page)) {
if (unfalloc)
continue;
- if (shmem_free_swap(mapping, index, page)) {
- /* Swap was replaced by page: retry */
- index--;
- break;
+ index = indices[i];
+ if (index == end) {
+ /* Partial page swapped out? */
+ shmem_getpage(inode, end, &page,
+ SGP_READ);
+ } else {
+ if (shmem_free_swap(mapping, index,
+ page)) {
+ /* Swap replaced: retry */
+ break;
+ }
+ nr_swaps_freed++;
+ continue;
}
- nr_swaps_freed++;
- continue;
+ } else {
+ lock_page(page);
}

- lock_page(page);
-
if (!unfalloc || !PageUptodate(page)) {
if (page_mapping(page) != mapping) {
/* Page was replaced by swap: retry */
unlock_page(page);
- index--;
+ put_page(page);
break;
}
VM_BUG_ON_PAGE(PageWriteback(page), page);
- if (shmem_punch_compound(page, start, end))
- truncate_inode_page(mapping, page);
- else if (IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE)) {
- /* Wipe the page and don't get stuck */
- clear_highpage(page);
- flush_dcache_page(page);
- set_page_dirty(page);
- if (index <
- round_up(start, HPAGE_PMD_NR))
- start = index + 1;
- }
+ index = truncate_inode_partial_page(mapping,
+ page, lstart, lend);
+ if (index > end)
+ end = indices[i] - 1;
}
- unlock_page(page);
}
+ index = indices[i - 1] + 1;
pagevec_remove_exceptionals(&pvec);
- pagevec_release(&pvec);
- index++;
+ pagevec_reinit(&pvec);
}

spin_lock_irq(&info->lock);
--- a/mm/truncate.c~mm-truncateshmem-handle-truncates-that-split-thps
+++ a/mm/truncate.c
@@ -225,6 +225,63 @@ int truncate_inode_page(struct address_s
}

/*
+ * Handle partial (transparent) pages. If the page is entirely within
+ * the range, we discard it. If not, we split the page if it's a THP
+ * and zero the part of the page that's within the [start, end] range.
+ * split_page_range() will discard any of the subpages which now lie
+ * beyond i_size, and the caller will discard pages which lie within a
+ * newly created hole.
+ *
+ * Return: The index after the current page.
+ */
+pgoff_t truncate_inode_partial_page(struct address_space *mapping,
+ struct page *page, loff_t start, loff_t end)
+{
+ loff_t pos = page_offset(page);
+ pgoff_t next_index = page->index + thp_nr_pages(page);
+ unsigned int offset, length;
+
+ if (pos < start)
+ offset = start - pos;
+ else
+ offset = 0;
+ length = thp_size(page);
+ if (pos + length <= (u64)end)
+ length = length - offset;
+ else
+ length = end + 1 - pos - offset;
+
+ wait_on_page_writeback(page);
+ if (length == thp_size(page))
+ goto truncate;
+
+ cleancache_invalidate_page(page->mapping, page);
+ if (page_has_private(page))
+ do_invalidatepage(page, offset, length);
+ if (!PageTransHuge(page))
+ goto zero;
+ page += offset / PAGE_SIZE;
+ if (split_huge_page(page) < 0) {
+ page -= offset / PAGE_SIZE;
+ goto zero;
+ }
+ next_index = page->index + 1;
+ offset %= PAGE_SIZE;
+ if (offset == 0 && length >= PAGE_SIZE)
+ goto truncate;
+ length = PAGE_SIZE - offset;
+zero:
+ zero_user(page, offset, length);
+ goto out;
+truncate:
+ truncate_inode_page(mapping, page);
+out:
+ unlock_page(page);
+ put_page(page);
+ return next_index;
+}
+
+/*
* Used to get rid of pages on hardware memory corruption.
*/
int generic_error_remove_page(struct address_space *mapping, struct page *page)
@@ -275,10 +332,6 @@ int invalidate_inode_page(struct page *p
* The first pass will remove most pages, so the search cost of the second pass
* is low.
*
- * We pass down the cache-hot hint to the page freeing code. Even if the
- * mapping is large, it is probably the case that the final pages are the most
- * recently touched, and freeing happens in ascending file offset order.
- *
* Note that since ->invalidatepage() accepts range to invalidate
* truncate_inode_pages_range is able to handle cases where lend + 1 is not
* page aligned properly.
@@ -286,38 +339,24 @@ int invalidate_inode_page(struct page *p
void truncate_inode_pages_range(struct address_space *mapping,
loff_t lstart, loff_t lend)
{
- pgoff_t start; /* inclusive */
- pgoff_t end; /* exclusive */
- unsigned int partial_start; /* inclusive */
- unsigned int partial_end; /* exclusive */
+ pgoff_t start, end;
struct pagevec pvec;
pgoff_t indices[PAGEVEC_SIZE];
pgoff_t index;
int i;
+ struct page * page;

if (mapping->nrpages == 0 && mapping->nrexceptional == 0)
goto out;

- /* Offsets within partial pages */
- partial_start = lstart & (PAGE_SIZE - 1);
- partial_end = (lend + 1) & (PAGE_SIZE - 1);
+ start = lstart >> PAGE_SHIFT;
+ page = find_lock_head(mapping, start);
+ if (page)
+ start = truncate_inode_partial_page(mapping, page, lstart,
+ lend);

- /*
- * 'start' and 'end' always covers the range of pages to be fully
- * truncated. Partial pages are covered with 'partial_start' at the
- * start of the range and 'partial_end' at the end of the range.
- * Note that 'end' is exclusive while 'lend' is inclusive.
- */
- start = (lstart + PAGE_SIZE - 1) >> PAGE_SHIFT;
- if (lend == -1)
- /*
- * lend == -1 indicates end-of-file so we have to set 'end'
- * to the highest possible pgoff_t and since the type is
- * unsigned we're using -1.
- */
- end = -1;
- else
- end = (lend + 1) >> PAGE_SHIFT;
+ /* 'end' includes a partial page */
+ end = lend / PAGE_SIZE;

pagevec_init(&pvec);
index = start;
@@ -334,50 +373,11 @@ void truncate_inode_pages_range(struct a
cond_resched();
}

- if (partial_start) {
- struct page *page = find_lock_page(mapping, start - 1);
- if (page) {
- unsigned int top = PAGE_SIZE;
- if (start > end) {
- /* Truncation within a single page */
- top = partial_end;
- partial_end = 0;
- }
- wait_on_page_writeback(page);
- zero_user_segment(page, partial_start, top);
- cleancache_invalidate_page(mapping, page);
- if (page_has_private(page))
- do_invalidatepage(page, partial_start,
- top - partial_start);
- unlock_page(page);
- put_page(page);
- }
- }
- if (partial_end) {
- struct page *page = find_lock_page(mapping, end);
- if (page) {
- wait_on_page_writeback(page);
- zero_user_segment(page, 0, partial_end);
- cleancache_invalidate_page(mapping, page);
- if (page_has_private(page))
- do_invalidatepage(page, 0,
- partial_end);
- unlock_page(page);
- put_page(page);
- }
- }
- /*
- * If the truncation happened within a single page no pages
- * will be released, just zeroed, so we can bail out now.
- */
- if (start >= end)
- goto out;
-
index = start;
- for ( ; ; ) {
+ while (index <= end) {
cond_resched();
- if (!find_get_entries(mapping, index, end - 1, &pvec,
- indices)) {
+
+ if (!find_get_entries(mapping, index, end, &pvec, indices)) {
/* If all gone from start onwards, we're done */
if (index == start)
break;
@@ -387,23 +387,20 @@ void truncate_inode_pages_range(struct a
}

for (i = 0; i < pagevec_count(&pvec); i++) {
- struct page *page = pvec.pages[i];
-
- /* We rely upon deletion not changing page->index */
- index = indices[i];
-
+ page = pvec.pages[i];
if (xa_is_value(page))
continue;

lock_page(page);
- WARN_ON(page_to_index(page) != index);
- wait_on_page_writeback(page);
- truncate_inode_page(mapping, page);
- unlock_page(page);
+ index = truncate_inode_partial_page(mapping, page,
+ lstart, lend);
+ /* Couldn't split a THP? */
+ if (index > end)
+ end = indices[i] - 1;
}
+ index = indices[i - 1] + 1;
truncate_exceptional_pvec_entries(mapping, &pvec, indices);
- pagevec_release(&pvec);
- index++;
+ pagevec_reinit(&pvec);
}

out:
_

2020-11-27 08:22:06

by Matthew Wilcox

[permalink] [raw]
Subject: Re: [PATCH v4 00/16] Overhaul multi-page lookups for THP

On Wed, Nov 25, 2020 at 04:11:57PM -0800, Hugh Dickins wrote:
> The little fix definitely needed was shown by generic/083: each
> fsstress waiting for page lock, happens even without forcing huge
> pages. See below...

Huh ... I need to look into why my xfstests run is skipping generic/083:

0006 generic/083 3s ... run fstests generic/083 at 2020-11-26 12:11:52
0006 [not run] this test requires a valid $SCRATCH_MNT and unique
0006 Ran: generic/083
0006 Not run: generic/083

> > if (!unfalloc || !PageUptodate(page)) {
> > if (page_mapping(page) != mapping) {
> > /* Page was replaced by swap: retry */
> > unlock_page(page);
> > - index--;
> > + put_page(page);
> > break;
> > }
> > VM_BUG_ON_PAGE(PageWriteback(page), page);
> > - if (shmem_punch_compound(page, start, end))
> > - truncate_inode_page(mapping, page);
> > - else if (IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE)) {
> > - /* Wipe the page and don't get stuck */
> > - clear_highpage(page);
> > - flush_dcache_page(page);
> > - set_page_dirty(page);
> > - if (index <
> > - round_up(start, HPAGE_PMD_NR))
> > - start = index + 1;
> > - }
> > + index = truncate_inode_partial_page(mapping,
> > + page, lstart, lend);
> > + if (index > end)
> > + end = indices[i] - 1;
> > }
> > - unlock_page(page);
>
> The fix needed is here: instead of deleting that unlock_page(page)
> line, it needs to be } else { unlock_page(page); }

It also needs a put_page(page);

That's now taken care of by truncate_inode_partial_page(), so if we're
not calling that, we need to put the page as well. ie this:

+++ b/mm/shmem.c
@@ -954,6 +954,9 @@ static void shmem_undo_range(struct inode *inode, loff_t lstart, loff_t lend,
page, lstart, lend);
if (index > end)
end = indices[i] - 1;
+ } else {
+ unlock_page(page);
+ put_page(page);
}
}
index = indices[i - 1] + 1;

> > }
> > + index = indices[i - 1] + 1;
> > pagevec_remove_exceptionals(&pvec);
> > - pagevec_release(&pvec);
> > - index++;
> > + pagevec_reinit(&pvec);

2020-11-27 08:41:39

by Hugh Dickins

[permalink] [raw]
Subject: Re: [PATCH v4 00/16] Overhaul multi-page lookups for THP

On Thu, 26 Nov 2020, Matthew Wilcox wrote:
> On Wed, Nov 25, 2020 at 04:11:57PM -0800, Hugh Dickins wrote:
> > > + index = truncate_inode_partial_page(mapping,
> > > + page, lstart, lend);
> > > + if (index > end)
> > > + end = indices[i] - 1;
> > > }
> > > - unlock_page(page);
> >
> > The fix needed is here: instead of deleting that unlock_page(page)
> > line, it needs to be } else { unlock_page(page); }
>
> It also needs a put_page(page);

Oh yes indeed, sorry for getting that wrong. I'd misread the
pagevec_reinit() at the end as the old pagevec_release(). Do you
really need to do pagevec_remove_exceptionals() there if you're not
using pagevec_release()?

>
> That's now taken care of by truncate_inode_partial_page(), so if we're
> not calling that, we need to put the page as well. ie this:

Right, but I do find it confusing that truncate_inode_partial_page()
does the unlock_page(),put_page() whereas truncate_inode_page() does
not: I think you would do better to leave them out of _partial_page(),
even if including them there saves a couple of lines somewhere else.

But right now it's the right fix that's important: ack to yours below.

I've not yet worked out the other issues I saw: will report when I have.
Rebooted this laptop, pretty sure it missed freeing a shmem swap entry,
not yet reproduced, will study the change there later, but the non-swap
hang in generic/476 (later seen also in generic/112) more important.

Hugh

>
> +++ b/mm/shmem.c
> @@ -954,6 +954,9 @@ static void shmem_undo_range(struct inode *inode, loff_t lstart, loff_t lend,
> page, lstart, lend);
> if (index > end)
> end = indices[i] - 1;
> + } else {
> + unlock_page(page);
> + put_page(page);
> }
> }
> index = indices[i - 1] + 1;
>
> > > }
> > > + index = indices[i - 1] + 1;
> > > pagevec_remove_exceptionals(&pvec);
> > > - pagevec_release(&pvec);
> > > - index++;
> > > + pagevec_reinit(&pvec);

2020-11-27 08:44:39

by Matthew Wilcox

[permalink] [raw]
Subject: Re: [PATCH v4 00/16] Overhaul multi-page lookups for THP

On Thu, Nov 26, 2020 at 11:24:59AM -0800, Hugh Dickins wrote:
> On Thu, 26 Nov 2020, Matthew Wilcox wrote:
> > On Wed, Nov 25, 2020 at 04:11:57PM -0800, Hugh Dickins wrote:
> > > > + index = truncate_inode_partial_page(mapping,
> > > > + page, lstart, lend);
> > > > + if (index > end)
> > > > + end = indices[i] - 1;
> > > > }
> > > > - unlock_page(page);
> > >
> > > The fix needed is here: instead of deleting that unlock_page(page)
> > > line, it needs to be } else { unlock_page(page); }
> >
> > It also needs a put_page(page);
>
> Oh yes indeed, sorry for getting that wrong. I'd misread the
> pagevec_reinit() at the end as the old pagevec_release(). Do you
> really need to do pagevec_remove_exceptionals() there if you're not
> using pagevec_release()?

Oh, good point!

> > That's now taken care of by truncate_inode_partial_page(), so if we're
> > not calling that, we need to put the page as well. ie this:
>
> Right, but I do find it confusing that truncate_inode_partial_page()
> does the unlock_page(),put_page() whereas truncate_inode_page() does
> not: I think you would do better to leave them out of _partial_page(),
> even if including them there saves a couple of lines somewhere else.

I agree; I don't love it. The only reason I moved them in there was
because after calling split_huge_page(), we have to call unlock_page();
put_page(); on the former-tail-page-that-is-now-partial, not on the
head page.

Hm, what I could do is return the struct page which is now the partial
page. Why do I never think of these things before posting? I'll see
how that works out.

> But right now it's the right fix that's important: ack to yours below.
>
> I've not yet worked out the other issues I saw: will report when I have.
> Rebooted this laptop, pretty sure it missed freeing a shmem swap entry,
> not yet reproduced, will study the change there later, but the non-swap
> hang in generic/476 (later seen also in generic/112) more important.

The good news is that I've sorted out the SCRATCH_DEV issue with
running xfstests. The bad news is that (even on an unmodified kernel),
generic/027 takes 19 hours to run. On xfs, it's 4 minutes. Any idea
why it's so slow on tmpfs?

2020-11-30 19:49:02

by Hugh Dickins

[permalink] [raw]
Subject: Re: [PATCH v4 00/16] Overhaul multi-page lookups for THP

On Thu, 26 Nov 2020, Matthew Wilcox wrote:
> On Thu, Nov 26, 2020 at 11:24:59AM -0800, Hugh Dickins wrote:
>
> > But right now it's the right fix that's important: ack to yours below.
> >
> > I've not yet worked out the other issues I saw: will report when I have.
> > Rebooted this laptop, pretty sure it missed freeing a shmem swap entry,
> > not yet reproduced, will study the change there later, but the non-swap
> > hang in generic/476 (later seen also in generic/112) more important.

It's been a struggle, but I do now have a version of shmem_undo_range()
that works reliably, no known bugs, with no changes to your last version
outside of shmem_undo_range().

But my testing so far has been with the initial optimization loop (of
trylocks in find_lock_entries()) "#if 0"ed out, to give the final loop
a harder time. Now I'll bring back that initial loop (maybe cleaning up
some start/end variables) and retest - hoping not to regress as I do so.

I'll send it late today: I expect I'll just send the body of
shmem_undo_range() itself, rather than a diff, since it's too confusing
to know what to diff against. Or, maybe you now have your own improved
version, and will want me to look at yours rather than sending mine.

Andrew, if you're planning another mmotm soon, please remove/comment
mm-truncateshmem-handle-truncates-that-split-thps.patch
and any of its "fixes" as to-be-updated: all versions to date
have been too buggy to keep, and a new version will require its own
review, as you noted. I think that means you also have to remove
mm-filemap-return-only-head-pages-from-find_get_entries.patch
which I think is blameless, but may depend on it logically.

>
> The good news is that I've sorted out the SCRATCH_DEV issue with
> running xfstests. The bad news is that (even on an unmodified kernel),
> generic/027 takes 19 hours to run. On xfs, it's 4 minutes. Any idea
> why it's so slow on tmpfs?

I sent a tarball of four xfstests patches on Thursday, they're valid:
but generic/075 and generic/112, very useful for this testing, suffer
from an fsx build bug which might or might not affect you: so I'll now
reply to that mail with latest tarball.

Hugh

2020-12-01 05:11:57

by Hugh Dickins

[permalink] [raw]
Subject: Re: [PATCH v4 00/16] Overhaul multi-page lookups for THP

On Mon, 30 Nov 2020, Hugh Dickins wrote:
> On Thu, 26 Nov 2020, Matthew Wilcox wrote:
> > On Thu, Nov 26, 2020 at 11:24:59AM -0800, Hugh Dickins wrote:
> >
> > > But right now it's the right fix that's important: ack to yours below.
> > >
> > > I've not yet worked out the other issues I saw: will report when I have.
> > > Rebooted this laptop, pretty sure it missed freeing a shmem swap entry,
> > > not yet reproduced, will study the change there later, but the non-swap
> > > hang in generic/476 (later seen also in generic/112) more important.
>
> It's been a struggle, but I do now have a version of shmem_undo_range()
> that works reliably, no known bugs, with no changes to your last version
> outside of shmem_undo_range().

Here's my version of shmem_undo_range(), working fine,
to replace the shmem_undo_range() resulting from your
"mm/truncate,shmem: handle truncates that split THPs".

Some comments on the differences:-

The initial shmem_getpage(,,SGP_READ) was problematic and
surprising: the SGP_READ means that it returns NULL in some cases,
which might not be what we want e.g. on a fallocated but not yet
initialized page, and on a page beyond i_size (with i_size being
forced to 0 earlier when evicting): so start was not incremented
when I expected it to be. And without a "partial" check, it also
risked reading in a page from swap unnecessarily (though not when
evicting, as I had feared, because of the i_size then being 0).

I think it was the unincremented start which was responsible for
shmem_evict_inode()'s WARN_ON(inode->i_blocks) (and subsequent
hang in swapoff) which I sometimes saw: swap entries were
truncated away without being properly accounted.

I found it easier to do it, like the end one, in the later main loop.
But of course there's an important difference between start and end
when splitting: I think you have relied on the tails-to-be-truncated
following on at the end; whereas (after many unsuccessful attempts to
preserve the old "But if truncating, restart to make sure all gone",
what I think of as "pincer" behaviour) I ended up just retrying even
in the hole-punch case, but not retrying indefinitely. That allows
retrying the split if there was a momentary reason why the first
attempt failed, good, but not getting stuck there forever. (It's
not obvious, but I think my old shmem_punch_compound() technique
handled failed split by incrementing up the tails one by one:
good to retry, but 510 times was too generous.)

The initial, trylock, pass then depends for correctness (when meeting
a huge page) on the behaviour you document for find_lock_entries():
"Pages which are partially outside the range are not returned".

There were rare cases in the later passes which did "break" from the
inner loop: if those entries were followed by others in the pagevec,
without a pagevec_release() at the end, they would have been leaked.
I ended up restoring the pagevec_release() (with its prior
pagevec_remove_exceptionals()), but hacking in a value entry where
truncate_inode_partial_page() had already done the unlock+put.
Yet (now we have retry pass even for holepunch) also changed those
breaks to continues, to deal with whatever follows in the pagevec.
I didn't really decide whether it's better to pagevec_release()
there or put one by one.

I never did quite absorb the "if (index > end) end = indices[i] - 1"
but it's gone away now. I misread the "index = indices[i - 1] + 1",
not seeing that i must be non-zero there: I keep a running next_index
instead (though that does beg the comment, pointing out that it's
only used at the end of the pagevec). I guess next_index is better
in the "unfalloc" case, where we want to skip pre-existing entries.

Somehow I deleted the old VM_BUG_ON_PAGE(PageWriteback(page), page):
it did not seem interesting at all, particularly with your
wait_on_page_writeback() in truncate_inode_partial_page().
And removed the old set_page_dirty() which went with the initial
shmem_getpage(), but was not replicated inside the main loop:
it makes no difference, the only way in which shmem pages can not
be dirty is if they are filled with zeroes, so writing zeroes into
them does not need dirtying (but I am surprised that other filesystems
don't want a set_page_dirty() in truncate_inode_partial_page() -
perhaps they all do it themselves).

Here it is:-
---
static void shmem_undo_range(struct inode *inode, loff_t lstart, loff_t lend,
bool unfalloc)
{
struct address_space *mapping = inode->i_mapping;
struct shmem_inode_info *info = SHMEM_I(inode);
pgoff_t start, end, last_full;
bool start_is_partial, end_is_partial;
struct pagevec pvec;
pgoff_t indices[PAGEVEC_SIZE];
struct page *page;
long nr_swaps_freed = 0;
pgoff_t index, next_index;
int i, pass = 0;

pagevec_init(&pvec);

/* start and end are indices of first and last page, perhaps partial */
start = lstart >> PAGE_SHIFT;
start_is_partial = !!(lstart & (PAGE_SIZE - 1));
end = lend >> PAGE_SHIFT;
end_is_partial = !!((lend + 1) & (PAGE_SIZE - 1));

if (start >= end) {
/* we especially want to avoid end 0 and last_full -1UL */
goto next_pass;
}

/*
* Quick and easy first pass, using trylocks, hollowing out the middle.
*/
index = start + start_is_partial;
last_full = end - end_is_partial;
while (index <= last_full &&
find_lock_entries(mapping, index, last_full, &pvec, indices)) {
for (i = 0; i < pagevec_count(&pvec); i++) {
page = pvec.pages[i];
index = indices[i];

if (xa_is_value(page)) {
if (unfalloc)
continue;
nr_swaps_freed += !shmem_free_swap(mapping,
index, page);
index++;
continue;
}
if (!unfalloc || !PageUptodate(page))
truncate_inode_page(mapping, page);
index += thp_nr_pages(page);
unlock_page(page);
}
pagevec_remove_exceptionals(&pvec);
pagevec_release(&pvec);
cond_resched();
}

next_pass:
index = start;
for ( ; ; ) {
cond_resched();

if (index > end ||
!find_get_entries(mapping, index, end, &pvec, indices)) {
/* If all gone or unfalloc, we're done */
if (index == start || unfalloc)
break;
/* Otherwise restart to make sure all gone */
if (++pass == 1) {
/* The previous pass zeroed partial pages */
if (start_is_partial) {
start++;
start_is_partial = false;
}
if (end_is_partial) {
if (end)
end--;
end_is_partial = false;
}
}
/* Repeat to remove residue, but not indefinitely */
if (pass == 3)
break;
index = start;
continue;
}

for (i = 0; i < pagevec_count(&pvec); i++) {
page = pvec.pages[i];
index = indices[i];

if (xa_is_value(page)) {
next_index = index + 1;
if (unfalloc)
continue;
if ((index == start && start_is_partial) ||
(index == end && end_is_partial)) {
/* Partial page swapped out? */
page = NULL;
shmem_getpage(inode, index, &page,
SGP_READ);
if (!page)
continue;
pvec.pages[i] = page;
} else {
if (shmem_free_swap(mapping, index,
page)) {
/* Swap replaced: retry */
next_index = index;
continue;
}
nr_swaps_freed++;
continue;
}
} else {
lock_page(page);
}

if (!unfalloc || !PageUptodate(page)) {
if (page_mapping(page) != mapping) {
/* Page was replaced by swap: retry */
unlock_page(page);
next_index = index;
continue;
}
page = thp_head(page);
next_index = truncate_inode_partial_page(
mapping, page, lstart, lend);
/* which already unlocked and put the page */
pvec.pages[i] = xa_mk_value(0);
} else {
next_index = index + thp_nr_pages(page);
unlock_page(page);
}
}

pagevec_remove_exceptionals(&pvec);
pagevec_release(&pvec);
/* next_index is effective only when refilling the pagevec */
index = next_index;
}

spin_lock_irq(&info->lock);
info->swapped -= nr_swaps_freed;
shmem_recalc_inode(inode);
spin_unlock_irq(&info->lock);
}

2020-12-03 15:50:00

by Marek Szyprowski

[permalink] [raw]
Subject: Re: [PATCH v4 00/16] Overhaul multi-page lookups for THP

Hi

On 25.11.2020 03:32, Matthew Wilcox wrote:
> On Tue, Nov 17, 2020 at 11:43:02PM +0000, Matthew Wilcox wrote:
>> On Tue, Nov 17, 2020 at 07:15:13PM +0000, Matthew Wilcox wrote:
>>> I find both of these functions exceptionally confusing. Does this
>>> make it easier to understand?
>> Never mind, this is buggy. I'll send something better tomorrow.
> That took a week, not a day. *sigh*. At least this is shorter.
>
> commit 1a02863ce04fd325922d6c3db6d01e18d55f966b
> Author: Matthew Wilcox (Oracle) <[email protected]>
> Date: Tue Nov 17 10:45:18 2020 -0500
>
> fix mm-truncateshmem-handle-truncates-that-split-thps.patch

This patch landed in todays linux-next (20201203) as commit 8678b27f4b8b
("8678b27f4b8bfc130a13eb9e9f27171bcd8c0b3b"). Sadly it breaks booting of
ANY of my ARM 32bit test systems, which use initrd. ARM64bit based
systems boot fine. Here is example of the crash:

Waiting 2 sec before mounting root device...
RAMDISK: squashfs filesystem found at block 0
RAMDISK: Loading 37861KiB [1 disk] into ram disk... /
/
/
/
done.
using deprecated initrd support, will be removed in 2021.
------------[ cut here ]------------
kernel BUG at fs/inode.c:531!
Internal error: Oops - BUG: 0 [#1] PREEMPT SMP ARM
Modules linked in:
CPU: 0 PID: 12 Comm: kworker/0:1 Not tainted 5.10.0-rc6-next-20201203 #2131
Hardware name: Samsung Exynos (Flattened Device Tree)
Workqueue: events delayed_fput
PC is at clear_inode+0x74/0x88
LR is at clear_inode+0x14/0x88
pc : [<c02fb334>]    lr : [<c02fb2d4>]    psr: 200001d3
sp : c1d2be68  ip : c1736ff4  fp : c1208f14
r10: c1208ec8  r9 : c20020c0  r8 : c209b0d8
r7 : c02f759c  r6 : c0c13940  r5 : c209b244  r4 : c209b0d8
r3 : 000024f9  r2 : 00000000  r1 : 00000000  r0 : c209b244
Flags: nzCv  IRQs off  FIQs off  Mode SVC_32  ISA ARM  Segment none
Control: 10c5387d  Table: 4000404a  DAC: 00000051
Process kworker/0:1 (pid: 12, stack limit = 0x(ptrval))
Stack: (0xc1d2be68 to 0xc1d2c000)
...
[<c02fb334>] (clear_inode) from [<c02fc8a0>] (evict+0x12c/0x13c)
[<c02fc8a0>] (evict) from [<c02f648c>] (__dentry_kill+0xb0/0x188)
[<c02f648c>] (__dentry_kill) from [<c02f7714>] (dput+0x2d8/0x67c)
[<c02f7714>] (dput) from [<c02dd300>] (__fput+0xd4/0x24c)
[<c02dd300>] (__fput) from [<c02dd4b4>] (delayed_fput+0x3c/0x48)
[<c02dd4b4>] (delayed_fput) from [<c0149660>] (process_one_work+0x234/0x7e4)
[<c0149660>] (process_one_work) from [<c0149c54>] (worker_thread+0x44/0x51c)
[<c0149c54>] (worker_thread) from [<c0150a88>] (kthread+0x158/0x1a0)
[<c0150a88>] (kthread) from [<c010011c>] (ret_from_fork+0x14/0x38)
Exception stack(0xc1d2bfb0 to 0xc1d2bff8)
...
---[ end trace b3c68905048e7f9b ]---
note: kworker/0:1[12] exited with preempt_count 1
BUG: sleeping function called from invalid context at
./include/linux/percpu-rwsem.h:49
in_atomic(): 0, irqs_disabled(): 128, non_block: 0, pid: 12, name:
kworker/0:1
INFO: lockdep is turned off.
irq event stamp: 7498
hardirqs last  enabled at (7497): [<c02b7fcc>] free_unref_page+0x80/0x88
hardirqs last disabled at (7498): [<c0b40b18>] _raw_spin_lock_irq+0x24/0x5c
softirqs last  enabled at (6234): [<c0966af4>] linkwatch_do_dev+0x20/0x80
softirqs last disabled at (6232): [<c0966a60>] rfc2863_policy+0x30/0xa4
CPU: 0 PID: 12 Comm: kworker/0:1 Tainted: G      D
5.10.0-rc6-next-20201203 #2131
Hardware name: Samsung Exynos (Flattened Device Tree)
Workqueue: events delayed_fput
[<c0111718>] (unwind_backtrace) from [<c010d050>] (show_stack+0x10/0x14)
[<c010d050>] (show_stack) from [<c0b34310>] (dump_stack+0xb4/0xd4)
[<c0b34310>] (dump_stack) from [<c015a9d4>] (___might_sleep+0x288/0x2d8)
[<c015a9d4>] (___might_sleep) from [<c013c744>] (exit_signals+0x38/0x428)
[<c013c744>] (exit_signals) from [<c012ce18>] (do_exit+0xe4/0xc88)
[<c012ce18>] (do_exit) from [<c010d28c>] (die+0x238/0x30c)
[<c010d28c>] (die) from [<c010d560>] (do_undefinstr+0xbc/0x26c)
[<c010d560>] (do_undefinstr) from [<c0100c1c>] (__und_svc_finish+0x0/0x44)
Exception stack(0xc1d2be18 to 0xc1d2be60)
VFS: Mounted root (squashfs filesystem) readonly on device 1:0.
be00: c209b244 00000000
be20: 00000000 000024f9 c209b0d8 c209b244 c0c13940 c02f759c c209b0d8
c20020c0
be40: c1208ec8 c1208f14 c1736ff4 c1d2be68 c02fb2d4 c02fb334 200001d3
ffffffff
[<c0100c1c>] (__und_svc_finish) from [<c02fb334>] (clear_inode+0x74/0x88)
[<c02fb334>] (clear_inode) from [<c02fc8a0>] (evict+0x12c/0x13c)
[<c02fc8a0>] (evict) from [<c02f648c>] (__dentry_kill+0xb0/0x188)
[<c02f648c>] (__dentry_kill) from [<c02f7714>] (dput+0x2d8/0x67c)
[<c02f7714>] (dput) from [<c02dd300>] (__fput+0xd4/0x24c)
[<c02dd300>] (__fput) from [<c02dd4b4>] (delayed_fput+0x3c/0x48)
[<c02dd4b4>] (delayed_fput) from [<c0149660>] (process_one_work+0x234/0x7e4)
[<c0149660>] (process_one_work) from [<c0149c54>] (worker_thread+0x44/0x51c)
[<c0149c54>] (worker_thread) from [<c0150a88>] (kthread+0x158/0x1a0)
[<c0150a88>] (kthread) from [<c010011c>] (ret_from_fork+0x14/0x38)
Exception stack(0xc1d2bfb0 to 0xc1d2bff8)
bfa0:                                     00000000 00000000 00000000
00000000
bfc0: 00000000 00000000 00000000 00000000 00000000 00000000 00000000
00000000
bfe0: 00000000 00000000 00000000 00000000 00000013 00000000
EXT4-fs (mmcblk0p6): INFO: recovery required on readonly filesystem
EXT4-fs (mmcblk0p6): write access will be enabled during recovery
EXT4-fs (mmcblk0p6): recovery complete
EXT4-fs (mmcblk0p6): mounted filesystem with ordered data mode. Opts: (null)
VFS: Mounted root (ext4 filesystem) readonly on device 179:6.
Trying to move old root to /initrd ...

I suppose this issue can be also reproduced with qemu.

Best regards

--
Marek Szyprowski, PhD
Samsung R&D Institute Poland

2020-12-03 17:33:18

by Marek Szyprowski

[permalink] [raw]
Subject: Re: [PATCH v4 00/16] Overhaul multi-page lookups for THP

Hi

On 03.12.2020 16:46, Marek Szyprowski wrote:
> On 25.11.2020 03:32, Matthew Wilcox wrote:
>> On Tue, Nov 17, 2020 at 11:43:02PM +0000, Matthew Wilcox wrote:
>>> On Tue, Nov 17, 2020 at 07:15:13PM +0000, Matthew Wilcox wrote:
>>>> I find both of these functions exceptionally confusing.  Does this
>>>> make it easier to understand?
>>> Never mind, this is buggy.  I'll send something better tomorrow.
>> That took a week, not a day.  *sigh*.  At least this is shorter.
>>
>> commit 1a02863ce04fd325922d6c3db6d01e18d55f966b
>> Author: Matthew Wilcox (Oracle) <[email protected]>
>> Date:   Tue Nov 17 10:45:18 2020 -0500
>>
>>      fix mm-truncateshmem-handle-truncates-that-split-thps.patch
>
> This patch landed in todays linux-next (20201203) as commit
> 8678b27f4b8b ("8678b27f4b8bfc130a13eb9e9f27171bcd8c0b3b"). Sadly it
> breaks booting of ANY of my ARM 32bit test systems, which use initrd.
> ARM64bit based systems boot fine. Here is example of the crash:

One more thing. Reverting those two:

1b1aa968b0b6 mm-truncateshmem-handle-truncates-that-split-thps-fix-fix

8678b27f4b8b mm-truncateshmem-handle-truncates-that-split-thps-fix

on top of linux next-20201203 fixes the boot issues.

Best regards
--
Marek Szyprowski, PhD
Samsung R&D Institute Poland

2020-12-03 21:31:30

by Qian Cai

[permalink] [raw]
Subject: Re: [PATCH v4 00/16] Overhaul multi-page lookups for THP

On Thu, 2020-12-03 at 18:27 +0100, Marek Szyprowski wrote:
> Hi
>
> On 03.12.2020 16:46, Marek Szyprowski wrote:
> > On 25.11.2020 03:32, Matthew Wilcox wrote:
> > > On Tue, Nov 17, 2020 at 11:43:02PM +0000, Matthew Wilcox wrote:
> > > > On Tue, Nov 17, 2020 at 07:15:13PM +0000, Matthew Wilcox wrote:
> > > > > I find both of these functions exceptionally confusing. Does this
> > > > > make it easier to understand?
> > > > Never mind, this is buggy. I'll send something better tomorrow.
> > > That took a week, not a day. *sigh*. At least this is shorter.
> > >
> > > commit 1a02863ce04fd325922d6c3db6d01e18d55f966b
> > > Author: Matthew Wilcox (Oracle) <[email protected]>
> > > Date: Tue Nov 17 10:45:18 2020 -0500
> > >
> > > fix mm-truncateshmem-handle-truncates-that-split-thps.patch
> >
> > This patch landed in todays linux-next (20201203) as commit
> > 8678b27f4b8b ("8678b27f4b8bfc130a13eb9e9f27171bcd8c0b3b"). Sadly it
> > breaks booting of ANY of my ARM 32bit test systems, which use initrd.
> > ARM64bit based systems boot fine. Here is example of the crash:
>
> One more thing. Reverting those two:
>
> 1b1aa968b0b6 mm-truncateshmem-handle-truncates-that-split-thps-fix-fix
>
> 8678b27f4b8b mm-truncateshmem-handle-truncates-that-split-thps-fix
>
> on top of linux next-20201203 fixes the boot issues.

We have to revert those two patches as well to fix this one process keeps
running 100% CPU in find_get_entries() and all other threads are blocking on the
i_mutex almost forever.

[ 380.735099] INFO: task trinity-c58:2143 can't die for more than 125 seconds.
[ 380.742923] task:trinity-c58 state:R running task stack:26056 pid: 2143 ppid: 1914 flags:0x00004006
[ 380.753640] Call Trace:
[ 380.756811] ? find_get_entries+0x339/0x790
find_get_entry at mm/filemap.c:1848
(inlined by) find_get_entries at mm/filemap.c:1904
[ 380.761723] ? __lock_page_or_retry+0x3f0/0x3f0
[ 380.767009] ? shmem_undo_range+0x3bf/0xb60
[ 380.771944] ? unmap_mapping_pages+0x96/0x230
[ 380.777036] ? find_held_lock+0x33/0x1c0
[ 380.781688] ? shmem_write_begin+0x1b0/0x1b0
[ 380.786703] ? unmap_mapping_pages+0xc2/0x230
[ 380.791796] ? down_write+0xe0/0x150
[ 380.796114] ? do_wp_page+0xc60/0xc60
[ 380.800507] ? shmem_truncate_range+0x14/0x80
[ 380.805618] ? shmem_setattr+0x827/0xc70
[ 380.810274] ? notify_change+0x6cf/0xc30
[ 380.814941] ? do_truncate+0xe2/0x180
[ 380.819335] ? do_truncate+0xe2/0x180
[ 380.823741] ? do_sys_openat2+0x5c0/0x5c0
[ 380.828484] ? do_sys_ftruncate+0x2e2/0x4e0
[ 380.833417] ? trace_hardirqs_on+0x1c/0x150
[ 380.838335] ? do_syscall_64+0x33/0x40
[ 380.842828] ? entry_SYSCALL_64_after_hwframe+0x44/0xa9
[ 380.848870]

2020-12-03 21:50:05

by Hugh Dickins

[permalink] [raw]
Subject: Re: [PATCH v4 00/16] Overhaul multi-page lookups for THP

On Thu, 3 Dec 2020, Marek Szyprowski wrote:
> On 03.12.2020 16:46, Marek Szyprowski wrote:
> > On 25.11.2020 03:32, Matthew Wilcox wrote:
> >> On Tue, Nov 17, 2020 at 11:43:02PM +0000, Matthew Wilcox wrote:
> >>> On Tue, Nov 17, 2020 at 07:15:13PM +0000, Matthew Wilcox wrote:
> >>>> I find both of these functions exceptionally confusing.  Does this
> >>>> make it easier to understand?
> >>> Never mind, this is buggy.  I'll send something better tomorrow.
> >> That took a week, not a day.  *sigh*.  At least this is shorter.
> >>
> >> commit 1a02863ce04fd325922d6c3db6d01e18d55f966b
> >> Author: Matthew Wilcox (Oracle) <[email protected]>
> >> Date:   Tue Nov 17 10:45:18 2020 -0500
> >>
> >>      fix mm-truncateshmem-handle-truncates-that-split-thps.patch
> >
> > This patch landed in todays linux-next (20201203) as commit
> > 8678b27f4b8b ("8678b27f4b8bfc130a13eb9e9f27171bcd8c0b3b"). Sadly it
> > breaks booting of ANY of my ARM 32bit test systems, which use initrd.
> > ARM64bit based systems boot fine. Here is example of the crash:

kernel BUG at fs/inode.c:531
evict() hitting clear_inode()'s BUG_ON(inode->i_data.nr_pages)
Same here on i386 on mmotm (slightly different line number on mmotm).

>
> One more thing. Reverting those two:
>
> 1b1aa968b0b6 mm-truncateshmem-handle-truncates-that-split-thps-fix-fix
>
> 8678b27f4b8b mm-truncateshmem-handle-truncates-that-split-thps-fix
>
> on top of linux next-20201203 fixes the boot issues.

Thanks a lot for the report, Marek. Yes, reverting those two
(of which "-fix" amounts to a rewrite, and "-fix-fix" is far from
complete) takes the linux-next tree back to how truncate was before it
took in yesterday's mmotm: not crashing on 32-bit, but still not good.

The 32-bit breakage may turn out to be a simple one-liner like a
missing cast, or overflow from 0 to -1, somewhere in the rewritten
truncate_inode_pages_range(); but it did not stand out to me, and
it does not immediately matter, since other fixes are needed to
that patch. I'm afraid it's proving to be work in progress.

I did ask Andrew to revert these earlier in the thread, but it looks
like that got lost in the jungle of his inbox: I'll send a better
targetted mail, but what we need is to revert these *four* patches,
until we have a better tested and stable version.

mm-truncateshmem-handle-truncates-that-split-thps.patch
mm-truncateshmem-handle-truncates-that-split-thps-fix.patch
mm-truncateshmem-handle-truncates-that-split-thps-fix-fix.patch
mm-filemap-return-only-head-pages-from-find_get_entries.patch

Hugh

2020-12-03 22:24:49

by Hugh Dickins

[permalink] [raw]
Subject: Re: [PATCH v4 00/16] Overhaul multi-page lookups for THP

On Thu, 3 Dec 2020, Qian Cai wrote:
> On Thu, 2020-12-03 at 18:27 +0100, Marek Szyprowski wrote:
> > On 03.12.2020 16:46, Marek Szyprowski wrote:
> > > On 25.11.2020 03:32, Matthew Wilcox wrote:
> > > > On Tue, Nov 17, 2020 at 11:43:02PM +0000, Matthew Wilcox wrote:
> > > > > On Tue, Nov 17, 2020 at 07:15:13PM +0000, Matthew Wilcox wrote:
> > > > > > I find both of these functions exceptionally confusing. Does this
> > > > > > make it easier to understand?
> > > > > Never mind, this is buggy. I'll send something better tomorrow.
> > > > That took a week, not a day. *sigh*. At least this is shorter.
> > > >
> > > > commit 1a02863ce04fd325922d6c3db6d01e18d55f966b
> > > > Author: Matthew Wilcox (Oracle) <[email protected]>
> > > > Date: Tue Nov 17 10:45:18 2020 -0500
> > > >
> > > > fix mm-truncateshmem-handle-truncates-that-split-thps.patch
> > >
> > > This patch landed in todays linux-next (20201203) as commit
> > > 8678b27f4b8b ("8678b27f4b8bfc130a13eb9e9f27171bcd8c0b3b"). Sadly it
> > > breaks booting of ANY of my ARM 32bit test systems, which use initrd.
> > > ARM64bit based systems boot fine. Here is example of the crash:
> >
> > One more thing. Reverting those two:
> >
> > 1b1aa968b0b6 mm-truncateshmem-handle-truncates-that-split-thps-fix-fix
> >
> > 8678b27f4b8b mm-truncateshmem-handle-truncates-that-split-thps-fix
> >
> > on top of linux next-20201203 fixes the boot issues.
>
> We have to revert those two patches as well to fix this one process keeps
> running 100% CPU in find_get_entries() and all other threads are blocking on the
> i_mutex almost forever.
>
> [ 380.735099] INFO: task trinity-c58:2143 can't die for more than 125 seconds.
> [ 380.742923] task:trinity-c58 state:R running task stack:26056 pid: 2143 ppid: 1914 flags:0x00004006
> [ 380.753640] Call Trace:
> [ 380.756811] ? find_get_entries+0x339/0x790
> find_get_entry at mm/filemap.c:1848
> (inlined by) find_get_entries at mm/filemap.c:1904
> [ 380.761723] ? __lock_page_or_retry+0x3f0/0x3f0
> [ 380.767009] ? shmem_undo_range+0x3bf/0xb60
> [ 380.771944] ? unmap_mapping_pages+0x96/0x230
> [ 380.777036] ? find_held_lock+0x33/0x1c0
> [ 380.781688] ? shmem_write_begin+0x1b0/0x1b0
> [ 380.786703] ? unmap_mapping_pages+0xc2/0x230
> [ 380.791796] ? down_write+0xe0/0x150
> [ 380.796114] ? do_wp_page+0xc60/0xc60
> [ 380.800507] ? shmem_truncate_range+0x14/0x80
> [ 380.805618] ? shmem_setattr+0x827/0xc70
> [ 380.810274] ? notify_change+0x6cf/0xc30
> [ 380.814941] ? do_truncate+0xe2/0x180
> [ 380.819335] ? do_truncate+0xe2/0x180
> [ 380.823741] ? do_sys_openat2+0x5c0/0x5c0
> [ 380.828484] ? do_sys_ftruncate+0x2e2/0x4e0
> [ 380.833417] ? trace_hardirqs_on+0x1c/0x150
> [ 380.838335] ? do_syscall_64+0x33/0x40
> [ 380.842828] ? entry_SYSCALL_64_after_hwframe+0x44/0xa9

Thanks for trinitizing. If you have time, please would you try
replacing the shmem_undo_range() in mm/shmem.c by the version I gave in

https://lore.kernel.org/linux-mm/[email protected]/T/#mc15d60a2166f80fe284a18d4758eb4c04cc3255d

That will not help at all with the 32-bit booting issue,
but it does have a good chance of placating trinity.

Thanks,
Hugh

2021-02-24 01:06:55

by Matthew Wilcox

[permalink] [raw]
Subject: Re: [PATCH v4 00/16] Overhaul multi-page lookups for THP

On Tue, Feb 23, 2021 at 02:58:36PM -0800, Andrew Morton wrote:
> Do you feel this patchset is ready to merge up?

I think so.

> https://lore.kernel.org/linux-mm/[email protected]/
> was addressed by 0060ef3b4e6dd ("mm: support THPs in
> zero_user_segments"), yes?

Correct.

> "mm/truncate,shmem: Handle truncates that split THPs" and "mm/filemap:
> Return only head pages from find_get_entries" were dropped. I guess
> you'll be having another go at those sometime?

That's right.

> https://lore.kernel.org/linux-fsdevel/[email protected]/
> wasn't responded to?

It didn't really seem worth replying to ... I'm not particularly
sympathetic to "please refactor the patch series" when there's no
compelling reason, and having a large function inlined really isn't a
problem when it only has one caller.

2021-02-24 05:25:43

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH v4 00/16] Overhaul multi-page lookups for THP

Do you feel this patchset is ready to merge up?

https://lore.kernel.org/linux-mm/[email protected]/
was addressed by 0060ef3b4e6dd ("mm: support THPs in
zero_user_segments"), yes?

"mm/truncate,shmem: Handle truncates that split THPs" and "mm/filemap:
Return only head pages from find_get_entries" were dropped. I guess
you'll be having another go at those sometime?

https://lore.kernel.org/linux-fsdevel/[email protected]/
wasn't responded to?