2024-02-09 14:29:54

by Daniel Gomez

[permalink] [raw]
Subject: [RFC PATCH 0/9] shmem: fix llseek in hugepages

Hi,

The following series fixes the generic/285 and generic/436 fstests for huge
pages (huge=always). These are tests for llseek (SEEK_HOLE and SEEK_DATA).

The implementation to fix above tests is based on iomap per-block tracking for
uptodate and dirty states but applied to shmem uptodate flag.

The motivation is to avoid any regressions in tmpfs once it gets support for
large folios.

Testing with kdevops
Testing has been performed using fstests with kdevops for the v6.8-rc2 tag.
There are currently different profiles supported [1] and for each of these,
a baseline of 20 loops has been performed with the following failures for
hugepages profiles: generic/080, generic/126, generic/193, generic/245,
generic/285, generic/436, generic/551, generic/619 and generic/732.

If anyone interested, please find all of the failures in the expunges directory:
https://github.com/linux-kdevops/kdevops/tree/master/workflows/fstests/expunges/6.8.0-rc2/tmpfs/unassigned

[1] tmpfs profiles supported in kdevops: default, tmpfs_noswap_huge_never,
tmpfs_noswap_huge_always, tmpfs_noswap_huge_within_size,
tmpfs_noswap_huge_advise, tmpfs_huge_always, tmpfs_huge_within_size and
tmpfs_huge_advise.

More information:
https://github.com/linux-kdevops/kdevops/tree/master/workflows/fstests/expunges/6.8.0-rc2/tmpfs/unassigned

All the patches has been tested on top of v6.8-rc2 and rebased onto latest next
tag available (next-20240209).

Daniel

Daniel Gomez (8):
shmem: add per-block uptodate tracking for hugepages
shmem: move folio zero operation to write_begin()
shmem: exit shmem_get_folio_gfp() if block is uptodate
shmem: clear_highpage() if block is not uptodate
shmem: set folio uptodate when reclaim
shmem: check if a block is uptodate before splice into pipe
shmem: clear uptodate blocks after PUNCH_HOLE
shmem: enable per-block uptodate

Pankaj Raghav (1):
splice: don't check for uptodate if partially uptodate is impl

fs/splice.c | 17 ++-
mm/shmem.c | 340 ++++++++++++++++++++++++++++++++++++++++++++++++----
2 files changed, 332 insertions(+), 25 deletions(-)

--
2.43.0


2024-02-09 14:30:13

by Daniel Gomez

[permalink] [raw]
Subject: [RFC PATCH 2/9] shmem: add per-block uptodate tracking for hugepages

Based on iomap per-block dirty and uptodate state track, add support
for shmem_folio_state struct to track the uptodate state per-block for
large folios (hugepages).

Signed-off-by: Daniel Gomez <[email protected]>
---
mm/shmem.c | 195 +++++++++++++++++++++++++++++++++++++++++++++++++++--
1 file changed, 189 insertions(+), 6 deletions(-)

diff --git a/mm/shmem.c b/mm/shmem.c
index d7c84ff62186..5980d7b94f65 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -131,6 +131,124 @@ struct shmem_options {
#define SHMEM_SEEN_QUOTA 32
};

+/*
+ * Structure allocated for each folio to track per-block uptodate state.
+ *
+ * Like buffered-io iomap_folio_state struct but only for uptodate.
+ */
+struct shmem_folio_state {
+ spinlock_t state_lock;
+ unsigned long state[];
+};
+
+static inline bool sfs_is_fully_uptodate(struct folio *folio)
+{
+ struct inode *inode = folio->mapping->host;
+ struct shmem_folio_state *sfs = folio->private;
+
+ return bitmap_full(sfs->state, i_blocks_per_folio(inode, folio));
+}
+
+static inline bool sfs_is_block_uptodate(struct shmem_folio_state *sfs,
+ unsigned int block)
+{
+ return test_bit(block, sfs->state);
+}
+
+/**
+ * sfs_get_last_block_uptodate - find the index of the last uptodate block
+ * within a specified range
+ * @folio: The folio
+ * @first: The starting block of the range to search
+ * @last: The ending block of the range to search
+ *
+ * Returns the index of the last uptodate block within the specified range If
+ * a non uptodate block is found at the start, it returns UINT_MAX.
+ */
+static unsigned int sfs_get_last_block_uptodate(struct folio *folio,
+ unsigned int first,
+ unsigned int last)
+{
+ struct inode *inode = folio->mapping->host;
+ struct shmem_folio_state *sfs = folio->private;
+ unsigned int nr_blocks = i_blocks_per_folio(inode, folio);
+ unsigned int aux = find_next_zero_bit(sfs->state, nr_blocks, first);
+
+ /*
+ * Exceed the range of possible last block and return UINT_MAX if a non
+ * uptodate block is found at the beginning of the scan.
+ */
+ if (aux == first)
+ return UINT_MAX;
+
+ return min_t(unsigned int, aux - 1, last);
+}
+
+static void sfs_set_range_uptodate(struct folio *folio,
+ struct shmem_folio_state *sfs, size_t off,
+ size_t len)
+{
+ struct inode *inode = folio->mapping->host;
+ unsigned int first_blk = off >> inode->i_blkbits;
+ unsigned int last_blk = (off + len - 1) >> inode->i_blkbits;
+ unsigned int nr_blks = last_blk - first_blk + 1;
+ unsigned long flags;
+
+ spin_lock_irqsave(&sfs->state_lock, flags);
+ bitmap_set(sfs->state, first_blk, nr_blks);
+ if (sfs_is_fully_uptodate(folio))
+ folio_mark_uptodate(folio);
+ spin_unlock_irqrestore(&sfs->state_lock, flags);
+}
+
+static struct shmem_folio_state *sfs_alloc(struct inode *inode,
+ struct folio *folio)
+{
+ struct shmem_folio_state *sfs = folio->private;
+ unsigned int nr_blocks = i_blocks_per_folio(inode, folio);
+ gfp_t gfp = GFP_KERNEL;
+
+ if (sfs || nr_blocks <= 1)
+ return sfs;
+
+ /*
+ * sfs->state tracks uptodate flag when the block size is smaller
+ * than the folio size.
+ */
+ sfs = kzalloc(struct_size(sfs, state, BITS_TO_LONGS(nr_blocks)), gfp);
+ if (!sfs)
+ return sfs;
+
+ spin_lock_init(&sfs->state_lock);
+ if (folio_test_uptodate(folio))
+ bitmap_set(sfs->state, 0, nr_blocks);
+ folio_attach_private(folio, sfs);
+
+ return sfs;
+}
+
+static void sfs_free(struct folio *folio, bool force)
+{
+ if (!folio_test_private(folio))
+ return;
+
+ if (!force)
+ WARN_ON_ONCE(sfs_is_fully_uptodate(folio) !=
+ folio_test_uptodate(folio));
+
+ kfree(folio_detach_private(folio));
+}
+
+static void shmem_set_range_uptodate(struct folio *folio, size_t off,
+ size_t len)
+{
+ struct shmem_folio_state *sfs = folio->private;
+
+ if (sfs)
+ sfs_set_range_uptodate(folio, sfs, off, len);
+ else
+ folio_mark_uptodate(folio);
+}
#ifdef CONFIG_TMPFS
static unsigned long shmem_default_max_blocks(void)
{
@@ -1487,7 +1605,7 @@ static int shmem_writepage(struct page *page, struct writeback_control *wbc)
}
folio_zero_range(folio, 0, folio_size(folio));
flush_dcache_folio(folio);
- folio_mark_uptodate(folio);
+ shmem_set_range_uptodate(folio, 0, folio_size(folio));
}

swap = folio_alloc_swap(folio);
@@ -1769,13 +1887,16 @@ static int shmem_replace_folio(struct folio **foliop, gfp_t gfp,
if (!new)
return -ENOMEM;

+ if (folio_get_private(old))
+ folio_attach_private(new, folio_detach_private(old));
+
folio_get(new);
folio_copy(new, old);
flush_dcache_folio(new);

__folio_set_locked(new);
__folio_set_swapbacked(new);
- folio_mark_uptodate(new);
+ shmem_set_range_uptodate(new, 0, folio_size(new));
new->swap = entry;
folio_set_swapcache(new);

@@ -2060,6 +2181,12 @@ static int shmem_get_folio_gfp(struct inode *inode, pgoff_t index,

alloced:
alloced = true;
+
+ if (!sfs_alloc(inode, folio) && folio_test_large(folio)) {
+ error = -ENOMEM;
+ goto unlock;
+ }
+
if (folio_test_pmd_mappable(folio) &&
DIV_ROUND_UP(i_size_read(inode), PAGE_SIZE) <
folio_next_index(folio) - 1) {
@@ -2101,7 +2228,7 @@ static int shmem_get_folio_gfp(struct inode *inode, pgoff_t index,
for (i = 0; i < n; i++)
clear_highpage(folio_page(folio, i));
flush_dcache_folio(folio);
- folio_mark_uptodate(folio);
+ shmem_set_range_uptodate(folio, 0, folio_size(folio));
}

/* Perhaps the file has been truncated since we checked */
@@ -2746,8 +2873,8 @@ shmem_write_end(struct file *file, struct address_space *mapping,
folio_zero_segments(folio, 0, from,
from + copied, folio_size(folio));
}
- folio_mark_uptodate(folio);
}
+ shmem_set_range_uptodate(folio, 0, folio_size(folio));
folio_mark_dirty(folio);
folio_unlock(folio);
folio_put(folio);
@@ -2755,6 +2882,59 @@ shmem_write_end(struct file *file, struct address_space *mapping,
return copied;
}

+static void shmem_invalidate_folio(struct folio *folio, size_t offset,
+ size_t len)
+{
+ /*
+ * If we're invalidating the entire folio, clear the dirty state
+ * from it and release it to avoid unnecessary buildup of the LRU.
+ */
+ if (offset == 0 && len == folio_size(folio)) {
+ WARN_ON_ONCE(folio_test_writeback(folio));
+ folio_cancel_dirty(folio);
+ sfs_free(folio, true);
+ }
+}
+
+static bool shmem_release_folio(struct folio *folio, gfp_t gfp_flags)
+{
+ if (folio_test_dirty(folio) && !sfs_is_fully_uptodate(folio))
+ return false;
+
+ sfs_free(folio, false);
+ return true;
+}
+
+/*
+ * shmem_is_partially_uptodate checks whether blocks within a folio are
+ * uptodate or not.
+ *
+ * Returns true if all blocks which correspond to the specified part
+ * of the folio are uptodate.
+ */
+static bool shmem_is_partially_uptodate(struct folio *folio, size_t from,
+ size_t count)
+{
+ struct shmem_folio_state *sfs = folio->private;
+ struct inode *inode = folio->mapping->host;
+ unsigned int first, last;
+
+ if (!sfs)
+ return false;
+
+ /* Caller's range may extend past the end of this folio */
+ count = min(folio_size(folio) - from, count);
+
+ /* First and last blocks in range within folio */
+ first = from >> inode->i_blkbits;
+ last = (from + count - 1) >> inode->i_blkbits;
+
+ if (sfs_get_last_block_uptodate(folio, first, last) != last)
+ return false;
+
+ return true;
+}
+
static ssize_t shmem_file_read_iter(struct kiocb *iocb, struct iov_iter *to)
{
struct file *file = iocb->ki_filp;
@@ -3506,7 +3686,7 @@ static int shmem_symlink(struct mnt_idmap *idmap, struct inode *dir,
inode->i_mapping->a_ops = &shmem_aops;
inode->i_op = &shmem_symlink_inode_operations;
memcpy(folio_address(folio), symname, len);
- folio_mark_uptodate(folio);
+ shmem_set_range_uptodate(folio, 0, folio_size(folio));
folio_mark_dirty(folio);
folio_unlock(folio);
folio_put(folio);
@@ -4476,7 +4656,10 @@ const struct address_space_operations shmem_aops = {
#ifdef CONFIG_MIGRATION
.migrate_folio = migrate_folio,
#endif
- .error_remove_folio = shmem_error_remove_folio,
+ .error_remove_folio = shmem_error_remove_folio,
+ .invalidate_folio = shmem_invalidate_folio,
+ .release_folio = shmem_release_folio,
+ .is_partially_uptodate = shmem_is_partially_uptodate,
};
EXPORT_SYMBOL(shmem_aops);

--
2.43.0

2024-02-09 14:31:43

by Daniel Gomez

[permalink] [raw]
Subject: [RFC PATCH 5/9] shmem: clear_highpage() if block is not uptodate

clear_highpage() is called for all the subpages (blocks) in a large
folio when the folio is not uptodate. Do clear the subpages only when
they are not uptodate.

Signed-off-by: Daniel Gomez <[email protected]>
---
mm/shmem.c | 3 ++-
1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/mm/shmem.c b/mm/shmem.c
index 614cda767298..b6f9a60b179b 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -2253,7 +2253,8 @@ static int shmem_get_folio_gfp(struct inode *inode, pgoff_t index,
long i, n = folio_nr_pages(folio);

for (i = 0; i < n; i++)
- clear_highpage(folio_page(folio, i));
+ if (!shmem_is_block_uptodate(folio, i))
+ clear_highpage(folio_page(folio, i));
flush_dcache_folio(folio);
shmem_set_range_uptodate(folio, 0, folio_size(folio));
}
--
2.43.0

2024-02-09 14:31:53

by Daniel Gomez

[permalink] [raw]
Subject: [RFC PATCH 8/9] shmem: clear uptodate blocks after PUNCH_HOLE

In the fallocate path with PUNCH_HOLE mode flag enabled, clear the
uptodate flag for those blocks covered by the punch. Skip all partial
blocks as they may still contain data.

Signed-off-by: Daniel Gomez <[email protected]>
---
mm/shmem.c | 78 +++++++++++++++++++++++++++++++++++++++++++++++++-----
1 file changed, 72 insertions(+), 6 deletions(-)

diff --git a/mm/shmem.c b/mm/shmem.c
index 2d2eeb40f19b..2157a87b2e4b 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -209,6 +209,28 @@ static void sfs_set_range_uptodate(struct folio *folio,
spin_unlock_irqrestore(&sfs->state_lock, flags);
}

+static void sfs_clear_range_uptodate(struct folio *folio,
+ struct shmem_folio_state *sfs, size_t off,
+ size_t len)
+{
+ struct inode *inode = folio->mapping->host;
+ unsigned int first_blk, last_blk;
+ unsigned long flags;
+
+ first_blk = DIV_ROUND_UP_ULL(off, 1 << inode->i_blkbits);
+ last_blk = DIV_ROUND_DOWN_ULL(off + len, 1 << inode->i_blkbits) - 1;
+ if (last_blk == UINT_MAX)
+ return;
+
+ if (first_blk > last_blk)
+ return;
+
+ spin_lock_irqsave(&sfs->state_lock, flags);
+ bitmap_clear(sfs->state, first_blk, last_blk - first_blk + 1);
+ folio_clear_uptodate(folio);
+ spin_unlock_irqrestore(&sfs->state_lock, flags);
+}
+
static struct shmem_folio_state *sfs_alloc(struct inode *inode,
struct folio *folio)
{
@@ -276,6 +298,19 @@ static void shmem_set_range_uptodate(struct folio *folio, size_t off,
else
folio_mark_uptodate(folio);
}
+
+static void shmem_clear_range_uptodate(struct folio *folio, size_t off,
+ size_t len)
+{
+ struct shmem_folio_state *sfs = folio->private;
+
+ if (sfs)
+ sfs_clear_range_uptodate(folio, sfs, off, len);
+ else
+ folio_clear_uptodate(folio);
+
+}
+
#ifdef CONFIG_TMPFS
static unsigned long shmem_default_max_blocks(void)
{
@@ -1103,12 +1138,33 @@ static struct folio *shmem_get_partial_folio(struct inode *inode, pgoff_t index)
return folio;
}

+static void shmem_clear(struct folio *folio, loff_t start, loff_t end, int mode)
+{
+ loff_t pos = folio_pos(folio);
+ unsigned int offset, length;
+
+ if (!(mode & FALLOC_FL_PUNCH_HOLE) || !(folio_test_large(folio)))
+ return;
+
+ if (pos < start)
+ offset = start - pos;
+ else
+ offset = 0;
+ length = folio_size(folio);
+ if (pos + length <= (u64)end)
+ length = length - offset;
+ else
+ length = end + 1 - pos - offset;
+
+ shmem_clear_range_uptodate(folio, offset, length);
+}
+
/*
* Remove range of pages and swap entries from page cache, and free them.
* If !unfalloc, truncate or punch hole; if unfalloc, undo failed fallocate.
*/
static void shmem_undo_range(struct inode *inode, loff_t lstart, loff_t lend,
- bool unfalloc)
+ bool unfalloc, int mode)
{
struct address_space *mapping = inode->i_mapping;
struct shmem_inode_info *info = SHMEM_I(inode);
@@ -1166,6 +1222,7 @@ static void shmem_undo_range(struct inode *inode, loff_t lstart, loff_t lend,
if (folio) {
same_folio = lend < folio_pos(folio) + folio_size(folio);
folio_mark_dirty(folio);
+ shmem_clear(folio, lstart, lend, mode);
if (!truncate_inode_partial_folio(folio, lstart, lend)) {
start = folio_next_index(folio);
if (same_folio)
@@ -1255,9 +1312,17 @@ static void shmem_undo_range(struct inode *inode, loff_t lstart, loff_t lend,
shmem_recalc_inode(inode, 0, -nr_swaps_freed);
}

+static void shmem_truncate_range_mode(struct inode *inode, loff_t lstart,
+ loff_t lend, int mode)
+{
+ shmem_undo_range(inode, lstart, lend, false, mode);
+ inode_set_mtime_to_ts(inode, inode_set_ctime_current(inode));
+ inode_inc_iversion(inode);
+}
+
void shmem_truncate_range(struct inode *inode, loff_t lstart, loff_t lend)
{
- shmem_undo_range(inode, lstart, lend, false);
+ shmem_undo_range(inode, lstart, lend, false, 0);
inode_set_mtime_to_ts(inode, inode_set_ctime_current(inode));
inode_inc_iversion(inode);
}
@@ -3315,7 +3380,7 @@ static long shmem_fallocate(struct file *file, int mode, loff_t offset,
if ((u64)unmap_end > (u64)unmap_start)
unmap_mapping_range(mapping, unmap_start,
1 + unmap_end - unmap_start, 0);
- shmem_truncate_range(inode, offset, offset + len - 1);
+ shmem_truncate_range_mode(inode, offset, offset + len - 1, mode);
/* No need to unmap again: hole-punching leaves COWed pages */

spin_lock(&inode->i_lock);
@@ -3381,9 +3446,10 @@ static long shmem_fallocate(struct file *file, int mode, loff_t offset,
info->fallocend = undo_fallocend;
/* Remove the !uptodate folios we added */
if (index > start) {
- shmem_undo_range(inode,
- (loff_t)start << PAGE_SHIFT,
- ((loff_t)index << PAGE_SHIFT) - 1, true);
+ shmem_undo_range(
+ inode, (loff_t)start << PAGE_SHIFT,
+ ((loff_t)index << PAGE_SHIFT) - 1, true,
+ 0);
}
goto undone;
}
--
2.43.0

2024-02-09 14:32:59

by Daniel Gomez

[permalink] [raw]
Subject: [RFC PATCH 3/9] shmem: move folio zero operation to write_begin()

Simplify zero out operation by moving it from write_end() to the
write_begin(). If a large folio does not have any block uptodate when we
first get it, zero it out entirely.

Signed-off-by: Daniel Gomez <[email protected]>
---
mm/shmem.c | 27 ++++++++++++++++++++-------
1 file changed, 20 insertions(+), 7 deletions(-)

diff --git a/mm/shmem.c b/mm/shmem.c
index 5980d7b94f65..3bddf7a89c18 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -149,6 +149,14 @@ static inline bool sfs_is_fully_uptodate(struct folio *folio)
return bitmap_full(sfs->state, i_blocks_per_folio(inode, folio));
}

+static inline bool sfs_is_any_uptodate(struct folio *folio)
+{
+ struct inode *inode = folio->mapping->host;
+ struct shmem_folio_state *sfs = folio->private;
+
+ return !bitmap_empty(sfs->state, i_blocks_per_folio(inode, folio));
+}
+
static inline bool sfs_is_block_uptodate(struct shmem_folio_state *sfs,
unsigned int block)
{
@@ -239,6 +247,15 @@ static void sfs_free(struct folio *folio, bool force)
kfree(folio_detach_private(folio));
}

+static inline bool shmem_is_any_uptodate(struct folio *folio)
+{
+ struct shmem_folio_state *sfs = folio->private;
+
+ if (folio_test_large(folio) && sfs)
+ return sfs_is_any_uptodate(folio);
+ return folio_test_uptodate(folio);
+}
+
static void shmem_set_range_uptodate(struct folio *folio, size_t off,
size_t len)
{
@@ -2845,6 +2862,9 @@ shmem_write_begin(struct file *file, struct address_space *mapping,
if (ret)
return ret;

+ if (!shmem_is_any_uptodate(folio))
+ folio_zero_range(folio, 0, folio_size(folio));
+
*pagep = folio_file_page(folio, index);
if (PageHWPoison(*pagep)) {
folio_unlock(folio);
@@ -2867,13 +2887,6 @@ shmem_write_end(struct file *file, struct address_space *mapping,
if (pos + copied > inode->i_size)
i_size_write(inode, pos + copied);

- if (!folio_test_uptodate(folio)) {
- if (copied < folio_size(folio)) {
- size_t from = offset_in_folio(folio, pos);
- folio_zero_segments(folio, 0, from,
- from + copied, folio_size(folio));
- }
- }
shmem_set_range_uptodate(folio, 0, folio_size(folio));
folio_mark_dirty(folio);
folio_unlock(folio);
--
2.43.0

2024-02-09 14:33:07

by Daniel Gomez

[permalink] [raw]
Subject: [RFC PATCH 6/9] shmem: set folio uptodate when reclaim

When reclaiming some space by splitting a large folio through
shmem_unused_huge_shrink(), a large folio is split regardless of its
uptodate status. Mark all the blocks as uptodate in the reclaim path so
split_folio() can release the folio private struct (shmem_folio_state).

Signed-off-by: Daniel Gomez <[email protected]>
---
mm/shmem.c | 1 +
1 file changed, 1 insertion(+)

diff --git a/mm/shmem.c b/mm/shmem.c
index b6f9a60b179b..9fa86cb82da9 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -836,6 +836,7 @@ static unsigned long shmem_unused_huge_shrink(struct shmem_sb_info *sbinfo,
goto move_back;
}

+ shmem_set_range_uptodate(folio, 0, folio_size(folio));
ret = split_folio(folio);
folio_unlock(folio);
folio_put(folio);
--
2.43.0

2024-02-09 14:41:16

by Daniel Gomez

[permalink] [raw]
Subject: [RFC PATCH 1/9] splice: don't check for uptodate if partially uptodate is impl

From: Pankaj Raghav <[email protected]>

When huge_page=always is set in tmpfs, it will zero out the whole page even
if only a small part of it is written, and it updates the uptodate flag of the
whole huge page.

Once the per-block uptodate tracking is implemented for tmpfs hugepages,
pipe_buf_confirm only needs to check the range it needs to splice to be
uptodate and not the whole folio as we don't set uptodate flag for partial
writes.

Signed-off-by: Pankaj Raghav <[email protected]>
Signed-off-by: Daniel Gomez <[email protected]>
---

Other option here is to have a separate implementation of
page_cache_pipe_buf_ops for tmpfs instead of changing the
page_cache_pipe_buf_confirm.

fs/splice.c | 17 ++++++++++++++---
1 file changed, 14 insertions(+), 3 deletions(-)

diff --git a/fs/splice.c b/fs/splice.c
index 218e24b1ac40..e6ac57795590 100644
--- a/fs/splice.c
+++ b/fs/splice.c
@@ -120,7 +120,9 @@ static int page_cache_pipe_buf_confirm(struct pipe_inode_info *pipe,
struct pipe_buffer *buf)
{
struct folio *folio = page_folio(buf->page);
+ const struct address_space_operations *ops;
int err;
+ off_t off = folio_page_idx(folio, buf->page) * PAGE_SIZE + buf->offset;

if (!folio_test_uptodate(folio)) {
folio_lock(folio);
@@ -134,12 +136,21 @@ static int page_cache_pipe_buf_confirm(struct pipe_inode_info *pipe,
goto error;
}

+ ops = folio->mapping->a_ops;
/*
* Uh oh, read-error from disk.
*/
- if (!folio_test_uptodate(folio)) {
- err = -EIO;
- goto error;
+ if (!ops->is_partially_uptodate) {
+ if (!folio_test_uptodate(folio)) {
+ err = -EIO;
+ goto error;
+ }
+ } else {
+ if (!ops->is_partially_uptodate(folio, off,
+ buf->len)) {
+ err = -EIO;
+ goto error;
+ }
}

/* Folio is ok after all, we are done */
--
2.43.0

2024-02-09 14:41:45

by Daniel Gomez

[permalink] [raw]
Subject: [RFC PATCH 4/9] shmem: exit shmem_get_folio_gfp() if block is uptodate

When we get a folio from the page cache with filemap_get_entry() and
is uptodate we exit from shmem_get_folio_gfp(). Replicate the same
behaviour if the block is uptodate in the index we are operating on.

Signed-off-by: Daniel Gomez <[email protected]>
---
mm/shmem.c | 12 +++++++++++-
1 file changed, 11 insertions(+), 1 deletion(-)

diff --git a/mm/shmem.c b/mm/shmem.c
index 3bddf7a89c18..614cda767298 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -256,6 +256,16 @@ static inline bool shmem_is_any_uptodate(struct folio *folio)
return folio_test_uptodate(folio);
}

+static inline bool shmem_is_block_uptodate(struct folio *folio,
+ unsigned int block)
+{
+ struct shmem_folio_state *sfs = folio->private;
+
+ if (folio_test_large(folio) && sfs)
+ return sfs_is_block_uptodate(sfs, block);
+ return folio_test_uptodate(folio);
+}
+
static void shmem_set_range_uptodate(struct folio *folio, size_t off,
size_t len)
{
@@ -2143,7 +2153,7 @@ static int shmem_get_folio_gfp(struct inode *inode, pgoff_t index,
}
if (sgp == SGP_WRITE)
folio_mark_accessed(folio);
- if (folio_test_uptodate(folio))
+ if (shmem_is_block_uptodate(folio, index - folio_index(folio)))
goto out;
/* fallocated folio */
if (sgp != SGP_READ)
--
2.43.0

2024-02-09 14:43:18

by Daniel Gomez

[permalink] [raw]
Subject: [RFC PATCH 9/9] shmem: enable per-block uptodate

In the write_end() function, mark only the blocks that are being written
as uptodate.

Signed-off-by: Daniel Gomez <[email protected]>
---
mm/shmem.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/mm/shmem.c b/mm/shmem.c
index 2157a87b2e4b..8ff2d190a9e4 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -2964,7 +2964,7 @@ shmem_write_end(struct file *file, struct address_space *mapping,
if (pos + copied > inode->i_size)
i_size_write(inode, pos + copied);

- shmem_set_range_uptodate(folio, 0, folio_size(folio));
+ shmem_set_range_uptodate(folio, offset_in_folio(folio, pos), len);
folio_mark_dirty(folio);
folio_unlock(folio);
folio_put(folio);
--
2.43.0

2024-02-09 14:43:41

by Daniel Gomez

[permalink] [raw]
Subject: [RFC PATCH 7/9] shmem: check if a block is uptodate before splice into pipe

The splice_read() path assumes folios are always uptodate. Make sure
all blocks in the given range are uptodate or else, splice zeropage into
the pipe. Maximize the number of blocks that can be spliced into pipe at
once by increasing the 'part' to the latest uptodate block found.

Signed-off-by: Daniel Gomez <[email protected]>
---
mm/shmem.c | 24 +++++++++++++++++++++++-
1 file changed, 23 insertions(+), 1 deletion(-)

diff --git a/mm/shmem.c b/mm/shmem.c
index 9fa86cb82da9..2d2eeb40f19b 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -3196,8 +3196,30 @@ static ssize_t shmem_file_splice_read(struct file *in, loff_t *ppos,
if (unlikely(*ppos >= isize))
break;
part = min_t(loff_t, isize - *ppos, len);
+ if (folio && folio_test_large(folio) &&
+ folio_test_private(folio)) {
+ unsigned long from = offset_in_folio(folio, *ppos);
+ unsigned int bfirst = from >> inode->i_blkbits;
+ unsigned int blast, blast_upd;
+
+ len = min(folio_size(folio) - from, len);
+ blast = (from + len - 1) >> inode->i_blkbits;
+
+ blast_upd = sfs_get_last_block_uptodate(folio, bfirst,
+ blast);
+ if (blast_upd <= blast) {
+ unsigned int bsize = 1 << inode->i_blkbits;
+ unsigned int blks = blast_upd - bfirst + 1;
+ unsigned int bbytes = blks << inode->i_blkbits;
+ unsigned int boff = (*ppos % bsize);
+
+ part = min_t(loff_t, bbytes - boff, len);
+ }
+ }

- if (folio) {
+ if (folio && shmem_is_block_uptodate(
+ folio, offset_in_folio(folio, *ppos) >>
+ inode->i_blkbits)) {
/*
* If users can be writing to this page using arbitrary
* virtual addresses, take care about potential aliasing
--
2.43.0

2024-02-14 19:59:21

by Daniel Gomez

[permalink] [raw]
Subject: Re: [RFC PATCH 0/9] shmem: fix llseek in hugepages

On Fri, Feb 09, 2024 at 02:29:01PM +0000, Daniel Gomez wrote:
> Hi,
>
> The following series fixes the generic/285 and generic/436 fstests for huge
> pages (huge=always). These are tests for llseek (SEEK_HOLE and SEEK_DATA).
>
> The implementation to fix above tests is based on iomap per-block tracking for
> uptodate and dirty states but applied to shmem uptodate flag.

Hi Hugh, Andrew,

Could you kindly provide feedback on these patches/fixes? I'd appreciate your
input on whether we're headed in the right direction, or maybe not.

Thanks,
Daniel

>
> The motivation is to avoid any regressions in tmpfs once it gets support for
> large folios.
>
> Testing with kdevops
> Testing has been performed using fstests with kdevops for the v6.8-rc2 tag.
> There are currently different profiles supported [1] and for each of these,
> a baseline of 20 loops has been performed with the following failures for
> hugepages profiles: generic/080, generic/126, generic/193, generic/245,
> generic/285, generic/436, generic/551, generic/619 and generic/732.
>
> If anyone interested, please find all of the failures in the expunges directory:
> https://github.com/linux-kdevops/kdevops/tree/master/workflows/fstests/expunges/6.8.0-rc2/tmpfs/unassigned
>
> [1] tmpfs profiles supported in kdevops: default, tmpfs_noswap_huge_never,
> tmpfs_noswap_huge_always, tmpfs_noswap_huge_within_size,
> tmpfs_noswap_huge_advise, tmpfs_huge_always, tmpfs_huge_within_size and
> tmpfs_huge_advise.
>
> More information:
> https://github.com/linux-kdevops/kdevops/tree/master/workflows/fstests/expunges/6.8.0-rc2/tmpfs/unassigned
>
> All the patches has been tested on top of v6.8-rc2 and rebased onto latest next
> tag available (next-20240209).
>
> Daniel
>
> Daniel Gomez (8):
> shmem: add per-block uptodate tracking for hugepages
> shmem: move folio zero operation to write_begin()
> shmem: exit shmem_get_folio_gfp() if block is uptodate
> shmem: clear_highpage() if block is not uptodate
> shmem: set folio uptodate when reclaim
> shmem: check if a block is uptodate before splice into pipe
> shmem: clear uptodate blocks after PUNCH_HOLE
> shmem: enable per-block uptodate
>
> Pankaj Raghav (1):
> splice: don't check for uptodate if partially uptodate is impl
>
> fs/splice.c | 17 ++-
> mm/shmem.c | 340 ++++++++++++++++++++++++++++++++++++++++++++++++----
> 2 files changed, 332 insertions(+), 25 deletions(-)
>
> --
> 2.43.0

2024-02-19 10:16:49

by Hugh Dickins

[permalink] [raw]
Subject: Re: [RFC PATCH 0/9] shmem: fix llseek in hugepages

On Wed, 14 Feb 2024, Daniel Gomez wrote:
> On Fri, Feb 09, 2024 at 02:29:01PM +0000, Daniel Gomez wrote:
> > Hi,
> >
> > The following series fixes the generic/285 and generic/436 fstests for huge
> > pages (huge=always). These are tests for llseek (SEEK_HOLE and SEEK_DATA).
> >
> > The implementation to fix above tests is based on iomap per-block tracking for
> > uptodate and dirty states but applied to shmem uptodate flag.
>
> Hi Hugh, Andrew,
>
> Could you kindly provide feedback on these patches/fixes? I'd appreciate your
> input on whether we're headed in the right direction, or maybe not.

I am sorry, Daniel, but I see this series as misdirected effort.

We do not want to add overhead to tmpfs and the kernel, just to pass two
tests which were (very reasonably) written for fixed block size, before
the huge page possibility ever came in.

If one opts for transparent huge pages in the filesystem, then of course
the dividing line between hole and data becomes more elastic than before.

It would be a serious bug if lseek ever reported an area of non-0 data as
in a hole; but I don't think that is what generic/285 or generic/436 find.

Beyond that, "man 2 lseek" is very forgiving of filesystem implementation.

I'll send you my stack of xfstests patches (which, as usual, I cannot
afford the time now to re-review and post): there are several tweaks to
seek_sanity_test in there for tmpfs huge pages, along with other fixes
for tmpfs (and some fixes to suit an old 32-bit build environment).

With those tweaks, generic/285 and generic/436 and others (but not all)
have been passing on huge tmpfs for several years. If you see something
you'd like to add your name to in that stack, or can improve upon, please
go ahead and post to the fstests list (Cc me).

Thanks,
Hugh

>
> Thanks,
> Daniel
>
> >
> > The motivation is to avoid any regressions in tmpfs once it gets support for
> > large folios.
> >
> > Testing with kdevops
> > Testing has been performed using fstests with kdevops for the v6.8-rc2 tag.
> > There are currently different profiles supported [1] and for each of these,
> > a baseline of 20 loops has been performed with the following failures for
> > hugepages profiles: generic/080, generic/126, generic/193, generic/245,
> > generic/285, generic/436, generic/551, generic/619 and generic/732.
> >
> > If anyone interested, please find all of the failures in the expunges directory:
> > https://github.com/linux-kdevops/kdevops/tree/master/workflows/fstests/expunges/6.8.0-rc2/tmpfs/unassigned
> >
> > [1] tmpfs profiles supported in kdevops: default, tmpfs_noswap_huge_never,
> > tmpfs_noswap_huge_always, tmpfs_noswap_huge_within_size,
> > tmpfs_noswap_huge_advise, tmpfs_huge_always, tmpfs_huge_within_size and
> > tmpfs_huge_advise.
> >
> > More information:
> > https://github.com/linux-kdevops/kdevops/tree/master/workflows/fstests/expunges/6.8.0-rc2/tmpfs/unassigned
> >
> > All the patches has been tested on top of v6.8-rc2 and rebased onto latest next
> > tag available (next-20240209).
> >
> > Daniel
> >
> > Daniel Gomez (8):
> > shmem: add per-block uptodate tracking for hugepages
> > shmem: move folio zero operation to write_begin()
> > shmem: exit shmem_get_folio_gfp() if block is uptodate
> > shmem: clear_highpage() if block is not uptodate
> > shmem: set folio uptodate when reclaim
> > shmem: check if a block is uptodate before splice into pipe
> > shmem: clear uptodate blocks after PUNCH_HOLE
> > shmem: enable per-block uptodate
> >
> > Pankaj Raghav (1):
> > splice: don't check for uptodate if partially uptodate is impl
> >
> > fs/splice.c | 17 ++-
> > mm/shmem.c | 340 ++++++++++++++++++++++++++++++++++++++++++++++++----
> > 2 files changed, 332 insertions(+), 25 deletions(-)
> >
> > --
> > 2.43.0

2024-02-20 10:27:50

by Daniel Gomez

[permalink] [raw]
Subject: Re: [RFC PATCH 0/9] shmem: fix llseek in hugepages

On Mon, Feb 19, 2024 at 02:15:47AM -0800, Hugh Dickins wrote:
> On Wed, 14 Feb 2024, Daniel Gomez wrote:
> > On Fri, Feb 09, 2024 at 02:29:01PM +0000, Daniel Gomez wrote:
> > > Hi,
> > >
> > > The following series fixes the generic/285 and generic/436 fstests for huge
> > > pages (huge=always). These are tests for llseek (SEEK_HOLE and SEEK_DATA).
> > >
> > > The implementation to fix above tests is based on iomap per-block tracking for
> > > uptodate and dirty states but applied to shmem uptodate flag.
> >
> > Hi Hugh, Andrew,
> >
> > Could you kindly provide feedback on these patches/fixes? I'd appreciate your
> > input on whether we're headed in the right direction, or maybe not.
>
> I am sorry, Daniel, but I see this series as misdirected effort.
>
> We do not want to add overhead to tmpfs and the kernel, just to pass two
> tests which were (very reasonably) written for fixed block size, before
> the huge page possibility ever came in.

Is this overhead a concern in performance? Can you clarify what do you mean?

I guess is a matter of which kind of granularity we want for a filesystem. Then,
we can either adapt the test to work with different block sizes or change the
filesystem to support this fixed and minimum block size.

I believe the tests should remain unchanged if we still want to operate at this
fixed block size, regardless of how the memory is managed in the filesystem side
(whether is a huge page or a large folio with arbitrary order).

>
> If one opts for transparent huge pages in the filesystem, then of course
> the dividing line between hole and data becomes more elastic than before.

I'm uncertain when we may want to be more elastic. In the case of XFS with iomap
and support for large folios, for instance, we are 'less' elastic than here So,
what exactly is the rationale behind wanting shmem to be 'more elastic'?

If we ever move shmem to large folios [1], and we use them in an oportunistic way,
then we are going to be more elastic in the default path.

[1] https://lore.kernel.org/all/[email protected]

In addition, I think that having this block granularity can benefit quota
support and the reclaim path. For example, in the generic/100 fstest, around
~26M of data are reported as 1G of used disk when using tmpfs with huge pages.

>
> It would be a serious bug if lseek ever reported an area of non-0 data as
> in a hole; but I don't think that is what generic/285 or generic/436 find.

I agree this is not the case here. We mark the entire folio (huge page) as
uptodate, hence we report that full area as data, making steps of 2M.

>
> Beyond that, "man 2 lseek" is very forgiving of filesystem implementation.

Thanks for bringing that up. This got me thinking along the same lines as
before, wanting to understand where we want to draw the line and the reasons
benhind it.

>
> I'll send you my stack of xfstests patches (which, as usual, I cannot
> afford the time now to re-review and post): there are several tweaks to
> seek_sanity_test in there for tmpfs huge pages, along with other fixes
> for tmpfs (and some fixes to suit an old 32-bit build environment).
>
> With those tweaks, generic/285 and generic/436 and others (but not all)
> have been passing on huge tmpfs for several years. If you see something
> you'd like to add your name to in that stack, or can improve upon, please
> go ahead and post to the fstests list (Cc me).

Thanks for the patches Hugh. I see how you are making the seeking tests a bit
more 'elastic'. I will post them shortly and see if we can make sure we can
minimize the number of failures [2].

In kdevops [3], we are discussing the possibility to add tmpfs to 0-day and
track for any regressions.

[2] https://github.com/linux-kdevops/kdevops/tree/master/workflows/fstests/expunges/6.8.0-rc2/tmpfs/unassigned
[3] https://github.com/linux-kdevops/kdevops

>
> Thanks,
> Hugh
>
> >
> > Thanks,
> > Daniel
> >
> > >
> > > The motivation is to avoid any regressions in tmpfs once it gets support for
> > > large folios.
> > >
> > > Testing with kdevops
> > > Testing has been performed using fstests with kdevops for the v6.8-rc2 tag.
> > > There are currently different profiles supported [1] and for each of these,
> > > a baseline of 20 loops has been performed with the following failures for
> > > hugepages profiles: generic/080, generic/126, generic/193, generic/245,
> > > generic/285, generic/436, generic/551, generic/619 and generic/732.
> > >
> > > If anyone interested, please find all of the failures in the expunges directory:
> > > https://protect2.fireeye.com/v1/url?k=9a7b8131-fbf09401-9a7a0a7e-000babffaa23-2e83e8b120fdf45e&q=1&e=e25c026a-1bb5-45f4-8acb-884e4a5e4d91&u=https%3A%2F%2Fgithub.com%2Flinux-kdevops%2Fkdevops%2Ftree%2Fmaster%2Fworkflows%2Ffstests%2Fexpunges%2F6.8.0-rc2%2Ftmpfs%2Funassigned
> > >
> > > [1] tmpfs profiles supported in kdevops: default, tmpfs_noswap_huge_never,
> > > tmpfs_noswap_huge_always, tmpfs_noswap_huge_within_size,
> > > tmpfs_noswap_huge_advise, tmpfs_huge_always, tmpfs_huge_within_size and
> > > tmpfs_huge_advise.
> > >
> > > More information:
> > > https://protect2.fireeye.com/v1/url?k=70096f39-11827a09-7008e476-000babffaa23-4c0e0d7b2ec659b6&q=1&e=e25c026a-1bb5-45f4-8acb-884e4a5e4d91&u=https%3A%2F%2Fgithub.com%2Flinux-kdevops%2Fkdevops%2Ftree%2Fmaster%2Fworkflows%2Ffstests%2Fexpunges%2F6.8.0-rc2%2Ftmpfs%2Funassigned
> > >
> > > All the patches has been tested on top of v6.8-rc2 and rebased onto latest next
> > > tag available (next-20240209).
> > >
> > > Daniel
> > >
> > > Daniel Gomez (8):
> > > shmem: add per-block uptodate tracking for hugepages
> > > shmem: move folio zero operation to write_begin()
> > > shmem: exit shmem_get_folio_gfp() if block is uptodate
> > > shmem: clear_highpage() if block is not uptodate
> > > shmem: set folio uptodate when reclaim
> > > shmem: check if a block is uptodate before splice into pipe
> > > shmem: clear uptodate blocks after PUNCH_HOLE
> > > shmem: enable per-block uptodate
> > >
> > > Pankaj Raghav (1):
> > > splice: don't check for uptodate if partially uptodate is impl
> > >
> > > fs/splice.c | 17 ++-
> > > mm/shmem.c | 340 ++++++++++++++++++++++++++++++++++++++++++++++++----
> > > 2 files changed, 332 insertions(+), 25 deletions(-)
> > >
> > > --
> > > 2.43.0

2024-02-20 12:39:24

by Jan Kara

[permalink] [raw]
Subject: Re: [RFC PATCH 0/9] shmem: fix llseek in hugepages

On Tue 20-02-24 10:26:48, Daniel Gomez wrote:
> On Mon, Feb 19, 2024 at 02:15:47AM -0800, Hugh Dickins wrote:
> I'm uncertain when we may want to be more elastic. In the case of XFS with iomap
> and support for large folios, for instance, we are 'less' elastic than here. So,
> what exactly is the rationale behind wanting shmem to be 'more elastic'?

Well, but if you allocated space in larger chunks - as is the case with
ext4 and bigalloc feature, you will be similarly 'elastic' as tmpfs with
large folio support... So simply the granularity of allocation of
underlying space is what matters here. And for tmpfs the underlying space
happens to be the page cache.

> If we ever move shmem to large folios [1], and we use them in an oportunistic way,
> then we are going to be more elastic in the default path.
>
> [1] https://lore.kernel.org/all/[email protected]
>
> In addition, I think that having this block granularity can benefit quota
> support and the reclaim path. For example, in the generic/100 fstest, around
> ~26M of data are reported as 1G of used disk when using tmpfs with huge pages.

And I'd argue this is a desirable thing. If 1G worth of pages is attached
to the inode, then quota should be accounting 1G usage even though you've
written just 26MB of data to the file. Quota is about constraining used
resources, not about "how much did I write to the file".

Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR

2024-02-27 11:57:56

by Daniel Gomez

[permalink] [raw]
Subject: Re: [RFC PATCH 0/9] shmem: fix llseek in hugepages

On Tue, Feb 20, 2024 at 01:39:05PM +0100, Jan Kara wrote:
> On Tue 20-02-24 10:26:48, Daniel Gomez wrote:
> > On Mon, Feb 19, 2024 at 02:15:47AM -0800, Hugh Dickins wrote:
> > I'm uncertain when we may want to be more elastic. In the case of XFS with iomap
> > and support for large folios, for instance, we are 'less' elastic than here. So,
> > what exactly is the rationale behind wanting shmem to be 'more elastic'?
>
> Well, but if you allocated space in larger chunks - as is the case with
> ext4 and bigalloc feature, you will be similarly 'elastic' as tmpfs with
> large folio support... So simply the granularity of allocation of
> underlying space is what matters here. And for tmpfs the underlying space
> happens to be the page cache.

But it seems like the underlying space 'behaves' differently when we talk about
large folios and huge pages. Is that correct? And this is reflected in the fstat
st_blksize. The first one is always based on the host base page size, regardless
of the order we get. The second one is always based on the host huge page size
configured (at the moment I've tested 2MiB, and 1GiB for x86-64 and 2MiB, 512
MiB and 16GiB for ARM64).

If that is the case, I'd agree this is not needed for huge pages but only when
we adopt large folios. Otherwise, we won't have a way to determine the step/
granularity for seeking data/holes as it could be anything from order-0 to
order-9. Note: order-1 support currently in LBS v1 thread here [1].

Regarding large folios adoption, we have the following implementations [2] being
sent to the mailing list. Would it make sense then, to have this block tracking
for the large folios case? Notice that my last attempt includes a partial
implementation of block tracking discussed here.

[1] https://lore.kernel.org/all/[email protected]/

[2] shmem: high order folios support in write path
v1: https://lore.kernel.org/all/[email protected]/
v2: https://lore.kernel.org/all/[email protected]/
v3 (RFC): https://lore.kernel.org/all/[email protected]/

>
> > If we ever move shmem to large folios [1], and we use them in an oportunistic way,
> > then we are going to be more elastic in the default path.
> >
> > [1] https://lore.kernel.org/all/[email protected]
> >
> > In addition, I think that having this block granularity can benefit quota
> > support and the reclaim path. For example, in the generic/100 fstest, around
> > ~26M of data are reported as 1G of used disk when using tmpfs with huge pages.
>
> And I'd argue this is a desirable thing. If 1G worth of pages is attached
> to the inode, then quota should be accounting 1G usage even though you've
> written just 26MB of data to the file. Quota is about constraining used
> resources, not about "how much did I write to the file".

But these are two separate values. I get that the system wants to track how many
pages are attached to the inode, so is there a way to report (in addition) the
actual use of these pages being consumed?

>
> Honza
> --
> Jan Kara <[email protected]>
> SUSE Labs, CR

2024-02-28 15:58:04

by Daniel Gomez

[permalink] [raw]
Subject: Re: [RFC PATCH 0/9] shmem: fix llseek in hugepages

On Tue, Feb 27, 2024 at 11:42:01AM +0000, Daniel Gomez wrote:
> On Tue, Feb 20, 2024 at 01:39:05PM +0100, Jan Kara wrote:
> > On Tue 20-02-24 10:26:48, Daniel Gomez wrote:
> > > On Mon, Feb 19, 2024 at 02:15:47AM -0800, Hugh Dickins wrote:
> > > I'm uncertain when we may want to be more elastic. In the case of XFS with iomap
> > > and support for large folios, for instance, we are 'less' elastic than here. So,
> > > what exactly is the rationale behind wanting shmem to be 'more elastic'?
> >
> > Well, but if you allocated space in larger chunks - as is the case with
> > ext4 and bigalloc feature, you will be similarly 'elastic' as tmpfs with
> > large folio support... So simply the granularity of allocation of
> > underlying space is what matters here. And for tmpfs the underlying space
> > happens to be the page cache.
>
> But it seems like the underlying space 'behaves' differently when we talk about
> large folios and huge pages. Is that correct? And this is reflected in the fstat
> st_blksize. The first one is always based on the host base page size, regardless
> of the order we get. The second one is always based on the host huge page size
> configured (at the moment I've tested 2MiB, and 1GiB for x86-64 and 2MiB, 512
> MiB and 16GiB for ARM64).

Apologies, I was mixing the values available in HugeTLB and those supported in
THP (pmd-size only). Thus, it is 2MiB for x86-64, and 2MiB, 32 MiB and 512 MiB
for ARM64 with 4k, 16k and 64k Base Page Size, respectively.

>
> If that is the case, I'd agree this is not needed for huge pages but only when
> we adopt large folios. Otherwise, we won't have a way to determine the step/
> granularity for seeking data/holes as it could be anything from order-0 to
> order-9. Note: order-1 support currently in LBS v1 thread here [1].
>
> Regarding large folios adoption, we have the following implementations [2] being
> sent to the mailing list. Would it make sense then, to have this block tracking
> for the large folios case? Notice that my last attempt includes a partial
> implementation of block tracking discussed here.
>
> [1] https://lore.kernel.org/all/[email protected]/
>
> [2] shmem: high order folios support in write path
> v1: https://lore.kernel.org/all/20230915095042.1320180-1-da.gomez@samsungcom/
> v2: https://lore.kernel.org/all/20230919135536.2165715-1-da.gomez@samsungcom/
> v3 (RFC): https://lore.kernel.org/all/[email protected]/
>
> >
> > > If we ever move shmem to large folios [1], and we use them in an oportunistic way,
> > > then we are going to be more elastic in the default path.
> > >
> > > [1] https://lore.kernel.org/all/[email protected]
> > >
> > > In addition, I think that having this block granularity can benefit quota
> > > support and the reclaim path. For example, in the generic/100 fstest, around
> > > ~26M of data are reported as 1G of used disk when using tmpfs with huge pages.
> >
> > And I'd argue this is a desirable thing. If 1G worth of pages is attached
> > to the inode, then quota should be accounting 1G usage even though you've
> > written just 26MB of data to the file. Quota is about constraining used
> > resources, not about "how much did I write to the file".
>
> But these are two separate values. I get that the system wants to track how many
> pages are attached to the inode, so is there a way to report (in addition) the
> actual use of these pages being consumed?
>
> >
> > Honza
> > --
> > Jan Kara <[email protected]>
> > SUSE Labs, CR