2021-04-13 14:46:36

by Jan Kara

[permalink] [raw]
Subject: [PATCH 0/7 RFC v3] fs: Hole punch vs page cache filling races

Hello,

here is another version of my patches to address races between hole punching
and page cache filling functions for ext4 and other filesystems. Since the last
posting I've added more documentation and comments regarding the lock ordering
of the new lock (i_mapping_sem) and how is the new lock supposed to be used.
I've also added conversions of ext2, xfs, zonefs so that there is better idea
how the conversion looks for most filesystems. Obviously, there are much more
filesystems to convert but I don't want to do that work unless we have a
concensus this is indeed the right approach. Also as a result of spelling out
locking rules more precisely, I have realized there's no need to use
i_mapping_sem in .page_mkwrite handlers so I've added a bonus patch removing
those - not sure we want to actually do that together with the rest of the
series (maybe we can do this cleanup later when the rest of the conversion has
settled down).

Also when writing the documentation I came across one question: Do we mandate
i_mapping_sem for truncate + hole punch for all filesystems or just for
filesystems that support hole punching (or other complex fallocate operations)?
I wrote the documentation so that we require every filesystem to use
i_mapping_sem. This makes locking rules simpler, we can also add asserts when
all filesystems are converted. The downside is that simple filesystems now pay
the overhead of the locking unnecessary for them. The overhead is small
(uncontended rwsem acquisition for truncate) so I don't think we care and the
simplicity is worth it but I wanted to spell this out.

What do people think about this?

Changes since v2:
* Added documentation and comments regarding lock ordering and how the lock is
supposed to be used
* Added conversions of ext2, xfs, zonefs
* Added patch removing i_mapping_sem protection from .page_mkwrite handlers

Changes since v1:
* Moved to using inode->i_mapping_sem instead of aops handler to acquire
appropriate lock

---
Motivation:

Amir has reported [1] a that ext4 has a potential issues when reads can race
with hole punching possibly exposing stale data from freed blocks or even
corrupting filesystem when stale mapping data gets used for writeout. The
problem is that during hole punching, new page cache pages can get instantiated
and block mapping from the looked up in a punched range after
truncate_inode_pages() has run but before the filesystem removes blocks from
the file. In principle any filesystem implementing hole punching thus needs to
implement a mechanism to block instantiating page cache pages during hole
punching to avoid this race. This is further complicated by the fact that there
are multiple places that can instantiate pages in page cache. We can have
regular read(2) or page fault doing this but fadvise(2) or madvise(2) can also
result in reading in page cache pages through force_page_cache_readahead().

There are couple of ways how to fix this. First way (currently implemented by
XFS) is to protect read(2) and *advise(2) calls with i_rwsem so that they are
serialized with hole punching. This is easy to do but as a result all reads
would then be serialized with writes and thus mixed read-write workloads suffer
heavily on ext4. Thus this series introduces inode->i_mapping_sem and uses it
when creating new pages in the page cache and looking up their corresponding
block mapping. We also replace EXT4_I(inode)->i_mmap_sem with this new rwsem
which provides necessary serialization with hole punching for ext4.

Honza

[1] https://lore.kernel.org/linux-fsdevel/[email protected]om/

Previous versions:
Link: https://lore.kernel.org/linux-fsdevel/[email protected]/


2021-04-13 14:46:36

by Jan Kara

[permalink] [raw]
Subject: [PATCH 2/7] mm: Protect operations adding pages to page cache with i_mapping_lock

Currently, serializing operations such as page fault, read, or readahead
against hole punching is rather difficult. The basic race scheme is
like:

fallocate(FALLOC_FL_PUNCH_HOLE) read / fault / ..
truncate_inode_pages_range()
<create pages in page
cache here>
<update fs block mapping and free blocks>

Now the problem is in this way read / page fault / readahead can
instantiate pages in page cache with potentially stale data (if blocks
get quickly reused). Avoiding this race is not simple - page locks do
not work because we want to make sure there are *no* pages in given
range. inode->i_rwsem does not work because page fault happens under
mmap_sem which ranks below inode->i_rwsem. Also using it for reads makes
the performance for mixed read-write workloads suffer.

So create a new rw_semaphore in the inode - i_mapping_sem - that
protects adding of pages to page cache for page faults / reads /
readahead.

Signed-off-by: Jan Kara <[email protected]>
---
Documentation/filesystems/locking.rst | 38 ++++++++++++++------
fs/inode.c | 3 ++
include/linux/fs.h | 2 ++
mm/filemap.c | 51 ++++++++++++++++++++++++---
mm/readahead.c | 2 ++
mm/rmap.c | 37 +++++++++----------
mm/truncate.c | 2 +-
7 files changed, 101 insertions(+), 34 deletions(-)

diff --git a/Documentation/filesystems/locking.rst b/Documentation/filesystems/locking.rst
index b7dcc86c92a4..67ba0a81301a 100644
--- a/Documentation/filesystems/locking.rst
+++ b/Documentation/filesystems/locking.rst
@@ -266,19 +266,19 @@ prototypes::
locking rules:
All except set_page_dirty and freepage may block

-====================== ======================== =========
-ops PageLocked(page) i_rwsem
-====================== ======================== =========
+====================== ======================== ========= ===============
+ops PageLocked(page) i_rwsem i_mapping_sem
+====================== ======================== ========= ===============
writepage: yes, unlocks (see below)
-readpage: yes, unlocks
+readpage: yes, unlocks shared
writepages:
set_page_dirty no
-readahead: yes, unlocks
-readpages: no
+readahead: yes, unlocks shared
+readpages: no shared
write_begin: locks the page exclusive
write_end: yes, unlocks exclusive
bmap:
-invalidatepage: yes
+invalidatepage: yes exclusive
releasepage: yes
freepage: yes
direct_IO:
@@ -373,7 +373,10 @@ keep it that way and don't breed new callers.
->invalidatepage() is called when the filesystem must attempt to drop
some or all of the buffers from the page when it is being truncated. It
returns zero on success. If ->invalidatepage is zero, the kernel uses
-block_invalidatepage() instead.
+block_invalidatepage() instead. The filesystem should exclusively acquire
+i_mapping_sem before invalidating page cache in truncate / hole punch path (and
+thus calling into ->invalidatepage) to block races between page cache
+invalidation and page cache filling functions (fault, read, ...).

->releasepage() is called when the kernel is about to try to drop the
buffers from the page in preparation for freeing it. It returns zero to
@@ -567,6 +570,19 @@ in sys_read() and friends.
the lease within the individual filesystem to record the result of the
operation

+->fallocate implementation must be really careful to maintain page cache
+consistency when punching holes or performing other operations that invalidate
+page cache contents. Usually the filesystem needs to call
+truncate_inode_pages_range() to invalidate relevant range of the page cache.
+However the filesystem usually also needs to update its internal (and on disk)
+view of file offset -> disk block mapping. Until this update is finished, the
+filesystem needs to block page faults and reads from reloading now-stale page
+cache contents from the disk. VFS provides inode->i_mapping_sem for this and
+acquires it in shared mode in paths loading pages from disk (filemap_fault(),
+filemap_read(), readahead paths). The filesystem is responsible for taking this
+lock in its fallocate implementation and generally whenever the page cache
+contents needs to be invalidated because a block is moving from under a page.
+
dquot_operations
================

@@ -628,9 +644,9 @@ access: yes
to be faulted in. The filesystem must find and return the page associated
with the passed in "pgoff" in the vm_fault structure. If it is possible that
the page may be truncated and/or invalidated, then the filesystem must lock
-the page, then ensure it is not already truncated (the page lock will block
-subsequent truncate), and then return with VM_FAULT_LOCKED, and the page
-locked. The VM will unlock the page.
+i_mapping_sem, then ensure the page is not already truncated (i_mapping_sem
+will block subsequent truncate), and then return with VM_FAULT_LOCKED, and the
+page locked. The VM will unlock the page.

->map_pages() is called when VM asks to map easy accessible pages.
Filesystem should find and map pages associated with offsets from "start_pgoff"
diff --git a/fs/inode.c b/fs/inode.c
index a047ab306f9a..e23e707a507d 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -175,6 +175,9 @@ int inode_init_always(struct super_block *sb, struct inode *inode)

init_rwsem(&inode->i_rwsem);
lockdep_set_class(&inode->i_rwsem, &sb->s_type->i_mutex_key);
+ init_rwsem(&inode->i_mapping_sem);
+ lockdep_set_class(&inode->i_mapping_sem,
+ &sb->s_type->i_mapping_sem_key);

atomic_set(&inode->i_dio_count, 0);

diff --git a/include/linux/fs.h b/include/linux/fs.h
index ec8f3ddf4a6a..c020c105d2d9 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -660,6 +660,7 @@ struct inode {
/* Misc */
unsigned long i_state;
struct rw_semaphore i_rwsem;
+ struct rw_semaphore i_mapping_sem;

unsigned long dirtied_when; /* jiffies of first dirtying */
unsigned long dirtied_time_when;
@@ -2351,6 +2352,7 @@ struct file_system_type {

struct lock_class_key i_lock_key;
struct lock_class_key i_mutex_key;
+ struct lock_class_key i_mapping_sem_key;
struct lock_class_key i_mutex_dir_key;
};

diff --git a/mm/filemap.c b/mm/filemap.c
index bd7c50e060a9..bc82a7856d3e 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -77,7 +77,8 @@
* ->i_pages lock
*
* ->i_rwsem
- * ->i_mmap_rwsem (truncate->unmap_mapping_range)
+ * ->i_mapping_sem (acquired by fs in truncate path)
+ * ->i_mmap_rwsem (truncate->unmap_mapping_range)
*
* ->mmap_lock
* ->i_mmap_rwsem
@@ -85,7 +86,8 @@
* ->i_pages lock (arch-dependent flush_dcache_mmap_lock)
*
* ->mmap_lock
- * ->lock_page (access_process_vm)
+ * ->i_mapping_sem (filemap_fault)
+ * ->lock_page (filemap_fault, access_process_vm)
*
* ->i_rwsem (generic_perform_write)
* ->mmap_lock (fault_in_pages_readable->do_page_fault)
@@ -2276,16 +2278,28 @@ static int filemap_update_page(struct kiocb *iocb,
{
int error;

+ if (iocb->ki_flags & IOCB_NOWAIT) {
+ if (!down_read_trylock(&mapping->host->i_mapping_sem))
+ return -EAGAIN;
+ } else {
+ down_read(&mapping->host->i_mapping_sem);
+ }
+
if (!trylock_page(page)) {
- if (iocb->ki_flags & (IOCB_NOWAIT | IOCB_NOIO))
+ if (iocb->ki_flags & (IOCB_NOWAIT | IOCB_NOIO)) {
+ up_read(&mapping->host->i_mapping_sem);
return -EAGAIN;
+ }
if (!(iocb->ki_flags & IOCB_WAITQ)) {
+ up_read(&mapping->host->i_mapping_sem);
put_and_wait_on_page_locked(page, TASK_KILLABLE);
return AOP_TRUNCATED_PAGE;
}
error = __lock_page_async(page, iocb->ki_waitq);
- if (error)
+ if (error) {
+ up_read(&mapping->host->i_mapping_sem);
return error;
+ }
}

if (!page->mapping)
@@ -2302,6 +2316,7 @@ static int filemap_update_page(struct kiocb *iocb,
error = filemap_read_page(iocb->ki_filp, mapping, page);
if (error == AOP_TRUNCATED_PAGE)
put_page(page);
+ up_read(&mapping->host->i_mapping_sem);
return error;
truncated:
unlock_page(page);
@@ -2309,6 +2324,7 @@ static int filemap_update_page(struct kiocb *iocb,
return AOP_TRUNCATED_PAGE;
unlock:
unlock_page(page);
+ up_read(&mapping->host->i_mapping_sem);
return error;
}

@@ -2323,6 +2339,19 @@ static int filemap_create_page(struct file *file,
if (!page)
return -ENOMEM;

+ /*
+ * Protect against truncate / hole punch. Grabbing i_mapping_sem here
+ * assures we cannot instantiate and bring uptodate new pagecache pages
+ * after evicting page cache during truncate and before actually
+ * freeing blocks. Note that we could release i_mapping_sem after
+ * inserting the page into page cache as the locked page would then be
+ * enough to synchronize with hole punching. But there are code paths
+ * such as filemap_update_page() filling in partially uptodate pages or
+ * ->readpages() that need to hold i_mapping_sem while mapping blocks
+ * for IO so let's hold the lock here as well to keep locking rules
+ * simple.
+ */
+ down_read(&mapping->host->i_mapping_sem);
error = add_to_page_cache_lru(page, mapping, index,
mapping_gfp_constraint(mapping, GFP_KERNEL));
if (error == -EEXIST)
@@ -2334,9 +2363,11 @@ static int filemap_create_page(struct file *file,
if (error)
goto error;

+ up_read(&mapping->host->i_mapping_sem);
pagevec_add(pvec, page);
return 0;
error:
+ up_read(&mapping->host->i_mapping_sem);
put_page(page);
return error;
}
@@ -2896,6 +2927,13 @@ vm_fault_t filemap_fault(struct vm_fault *vmf)
count_memcg_event_mm(vmf->vma->vm_mm, PGMAJFAULT);
ret = VM_FAULT_MAJOR;
fpin = do_sync_mmap_readahead(vmf);
+ }
+
+ /*
+ * See comment in filemap_create_page() why we need i_mapping_sem
+ */
+ down_read(&inode->i_mapping_sem);
+ if (!page) {
retry_find:
page = pagecache_get_page(mapping, offset,
FGP_CREAT|FGP_FOR_MMAP,
@@ -2903,6 +2941,7 @@ vm_fault_t filemap_fault(struct vm_fault *vmf)
if (!page) {
if (fpin)
goto out_retry;
+ up_read(&inode->i_mapping_sem);
return VM_FAULT_OOM;
}
}
@@ -2943,9 +2982,11 @@ vm_fault_t filemap_fault(struct vm_fault *vmf)
if (unlikely(offset >= max_off)) {
unlock_page(page);
put_page(page);
+ up_read(&inode->i_mapping_sem);
return VM_FAULT_SIGBUS;
}

+ up_read(&inode->i_mapping_sem);
vmf->page = page;
return ret | VM_FAULT_LOCKED;

@@ -2971,6 +3012,7 @@ vm_fault_t filemap_fault(struct vm_fault *vmf)
if (!error || error == AOP_TRUNCATED_PAGE)
goto retry_find;

+ up_read(&inode->i_mapping_sem);
shrink_readahead_size_eio(ra);
return VM_FAULT_SIGBUS;

@@ -2982,6 +3024,7 @@ vm_fault_t filemap_fault(struct vm_fault *vmf)
*/
if (page)
put_page(page);
+ up_read(&inode->i_mapping_sem);
if (fpin)
fput(fpin);
return ret | VM_FAULT_RETRY;
diff --git a/mm/readahead.c b/mm/readahead.c
index c5b0457415be..ac5bb50b3a4c 100644
--- a/mm/readahead.c
+++ b/mm/readahead.c
@@ -192,6 +192,7 @@ void page_cache_ra_unbounded(struct readahead_control *ractl,
*/
unsigned int nofs = memalloc_nofs_save();

+ down_read(&mapping->host->i_mapping_sem);
/*
* Preallocate as many pages as we will need.
*/
@@ -236,6 +237,7 @@ void page_cache_ra_unbounded(struct readahead_control *ractl,
* will then handle the error.
*/
read_pages(ractl, &page_pool, false);
+ up_read(&mapping->host->i_mapping_sem);
memalloc_nofs_restore(nofs);
}
EXPORT_SYMBOL_GPL(page_cache_ra_unbounded);
diff --git a/mm/rmap.c b/mm/rmap.c
index dba8cb8a5578..37e5dceb4351 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -22,24 +22,25 @@
*
* inode->i_rwsem (while writing or truncating, not reading or faulting)
* mm->mmap_lock
- * page->flags PG_locked (lock_page) * (see hugetlbfs below)
- * hugetlbfs_i_mmap_rwsem_key (in huge_pmd_share)
- * mapping->i_mmap_rwsem
- * hugetlb_fault_mutex (hugetlbfs specific page fault mutex)
- * anon_vma->rwsem
- * mm->page_table_lock or pte_lock
- * swap_lock (in swap_duplicate, swap_info_get)
- * mmlist_lock (in mmput, drain_mmlist and others)
- * mapping->private_lock (in __set_page_dirty_buffers)
- * lock_page_memcg move_lock (in __set_page_dirty_buffers)
- * i_pages lock (widely used)
- * lruvec->lru_lock (in lock_page_lruvec_irq)
- * inode->i_lock (in set_page_dirty's __mark_inode_dirty)
- * bdi.wb->list_lock (in set_page_dirty's __mark_inode_dirty)
- * sb_lock (within inode_lock in fs/fs-writeback.c)
- * i_pages lock (widely used, in set_page_dirty,
- * in arch-dependent flush_dcache_mmap_lock,
- * within bdi.wb->list_lock in __sync_single_inode)
+ * inode->i_mapping_sem (in filemap_fault)
+ * page->flags PG_locked (lock_page) * (see hugetlbfs below)
+ * hugetlbfs_i_mmap_rwsem_key (in huge_pmd_share)
+ * mapping->i_mmap_rwsem
+ * hugetlb_fault_mutex (hugetlbfs specific page fault mutex)
+ * anon_vma->rwsem
+ * mm->page_table_lock or pte_lock
+ * swap_lock (in swap_duplicate, swap_info_get)
+ * mmlist_lock (in mmput, drain_mmlist and others)
+ * mapping->private_lock (in __set_page_dirty_buffers)
+ * lock_page_memcg move_lock (in __set_page_dirty_buffers)
+ * i_pages lock (widely used)
+ * lruvec->lru_lock (in lock_page_lruvec_irq)
+ * inode->i_lock (in set_page_dirty's __mark_inode_dirty)
+ * bdi.wb->list_lock (in set_page_dirty's __mark_inode_dirty)
+ * sb_lock (within inode_lock in fs/fs-writeback.c)
+ * i_pages lock (widely used, in set_page_dirty,
+ * in arch-dependent flush_dcache_mmap_lock,
+ * within bdi.wb->list_lock in __sync_single_inode)
*
* anon_vma->rwsem,mapping->i_mmap_rwsem (memory_failure, collect_procs_anon)
* ->tasklist_lock
diff --git a/mm/truncate.c b/mm/truncate.c
index 2cf71d8c3c62..464ad70a081f 100644
--- a/mm/truncate.c
+++ b/mm/truncate.c
@@ -416,7 +416,7 @@ EXPORT_SYMBOL(truncate_inode_pages_range);
* @mapping: mapping to truncate
* @lstart: offset from which to truncate
*
- * Called under (and serialised by) inode->i_rwsem.
+ * Called under (and serialised by) inode->i_rwsem and inode->i_mapping_rwsem.
*
* Note: When this function returns, there can be a page in the process of
* deletion (inside __delete_from_page_cache()) in the specified range. Thus
--
2.31.0.99.g0d91da736d9f.dirty

2021-04-13 14:46:39

by Jan Kara

[permalink] [raw]
Subject: [PATCH 7/7] fs: Remove i_mapping_sem protection from .page_mkwrite handlers

Using i_mapping_sem in .page_mkwrite handlers is generally unnecessary
since these handlers do not create any new pages in the page cache. Thus
if the handler locks the page and it is still attached to appropriate
mapping, we are certain there cannot be truncate or hole punch in
progress past its page cache invalidation step (remember that a page
cannot be created in the page cache from the moment hole punch acquires
i_mapping_sem and starts invalidating page cache to the moment the
operation is finished and i_mapping_sem is released) and any page cache
invalidation will wait until we release the page lock. So just remove
i_mapping_sem usage from .page_mkwrite handlers.

Signed-off-by: Jan Kara <[email protected]>
---
fs/ext4/inode.c | 3 ---
fs/xfs/xfs_file.c | 2 --
fs/zonefs/super.c | 3 ---
3 files changed, 8 deletions(-)

diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index d76803eba884..4779b6084a30 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -6064,8 +6064,6 @@ vm_fault_t ext4_page_mkwrite(struct vm_fault *vmf)
sb_start_pagefault(inode->i_sb);
file_update_time(vma->vm_file);

- down_read(&inode->i_mapping_sem);
-
err = ext4_convert_inline_data(inode);
if (err)
goto out_ret;
@@ -6177,7 +6175,6 @@ vm_fault_t ext4_page_mkwrite(struct vm_fault *vmf)
out_ret:
ret = block_page_mkwrite_return(err);
out:
- up_read(&inode->i_mapping_sem);
sb_end_pagefault(inode->i_sb);
return ret;
out_error:
diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
index a76404708312..416a5c8ccdda 100644
--- a/fs/xfs/xfs_file.c
+++ b/fs/xfs/xfs_file.c
@@ -1316,10 +1316,8 @@ __xfs_filemap_fault(
xfs_iunlock(XFS_I(inode), XFS_MMAPLOCK_SHARED);
} else {
if (write_fault) {
- xfs_ilock(XFS_I(inode), XFS_MMAPLOCK_SHARED);
ret = iomap_page_mkwrite(vmf,
&xfs_buffered_write_iomap_ops);
- xfs_iunlock(XFS_I(inode), XFS_MMAPLOCK_SHARED);
} else
ret = filemap_fault(vmf);
}
diff --git a/fs/zonefs/super.c b/fs/zonefs/super.c
index ace6f3a223da..a90ef0f09a9f 100644
--- a/fs/zonefs/super.c
+++ b/fs/zonefs/super.c
@@ -594,10 +594,7 @@ static vm_fault_t zonefs_filemap_page_mkwrite(struct vm_fault *vmf)
sb_start_pagefault(inode->i_sb);
file_update_time(vmf->vma->vm_file);

- /* Serialize against truncates */
- down_read(&inode->i_mapping_sem);
ret = iomap_page_mkwrite(vmf, &zonefs_iomap_ops);
- up_read(&inode->i_mapping_sem);

sb_end_pagefault(inode->i_sb);
return ret;
--
2.31.0.99.g0d91da736d9f.dirty

2021-04-13 14:46:40

by Jan Kara

[permalink] [raw]
Subject: [PATCH 3/7] ext4: Convert to use inode->i_mapping_sem

Convert ext4 to use inode->i_mapping_sem instead of its private
EXT4_I(inode)->i_mmap_sem. This is mostly search-and-replace. By this
conversion we fix a long standing race between hole punching and read(2)
/ readahead(2) paths that can lead to stale page cache contents.

Signed-off-by: Jan Kara <[email protected]>
---
fs/ext4/ext4.h | 10 ----------
fs/ext4/extents.c | 18 +++++++++---------
fs/ext4/file.c | 12 ++++++------
fs/ext4/inode.c | 47 +++++++++++++++++-----------------------------
fs/ext4/ioctl.c | 4 ++--
fs/ext4/super.c | 11 ++++-------
fs/ext4/truncate.h | 4 ++--
7 files changed, 40 insertions(+), 66 deletions(-)

diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
index 826a56e3bbd2..2ae365458dca 100644
--- a/fs/ext4/ext4.h
+++ b/fs/ext4/ext4.h
@@ -1081,15 +1081,6 @@ struct ext4_inode_info {
* by other means, so we have i_data_sem.
*/
struct rw_semaphore i_data_sem;
- /*
- * i_mmap_sem is for serializing page faults with truncate / punch hole
- * operations. We have to make sure that new page cannot be faulted in
- * a section of the inode that is being punched. We cannot easily use
- * i_data_sem for this since we need protection for the whole punch
- * operation and i_data_sem ranks below transaction start so we have
- * to occasionally drop it.
- */
- struct rw_semaphore i_mmap_sem;
struct inode vfs_inode;
struct jbd2_inode *jinode;

@@ -2908,7 +2899,6 @@ extern int ext4_chunk_trans_blocks(struct inode *, int nrblocks);
extern int ext4_zero_partial_blocks(handle_t *handle, struct inode *inode,
loff_t lstart, loff_t lend);
extern vm_fault_t ext4_page_mkwrite(struct vm_fault *vmf);
-extern vm_fault_t ext4_filemap_fault(struct vm_fault *vmf);
extern qsize_t *ext4_get_reserved_space(struct inode *inode);
extern int ext4_get_projid(struct inode *inode, kprojid_t *projid);
extern void ext4_da_release_space(struct inode *inode, int to_free);
diff --git a/fs/ext4/extents.c b/fs/ext4/extents.c
index 77c84d6f1af6..a315fe6a0929 100644
--- a/fs/ext4/extents.c
+++ b/fs/ext4/extents.c
@@ -4553,17 +4553,17 @@ static long ext4_zero_range(struct file *file, loff_t offset,
* Prevent page faults from reinstantiating pages we have
* released from page cache.
*/
- down_write(&EXT4_I(inode)->i_mmap_sem);
+ down_write(&inode->i_mapping_sem);

ret = ext4_break_layouts(inode);
if (ret) {
- up_write(&EXT4_I(inode)->i_mmap_sem);
+ up_write(&inode->i_mapping_sem);
goto out_mutex;
}

ret = ext4_update_disksize_before_punch(inode, offset, len);
if (ret) {
- up_write(&EXT4_I(inode)->i_mmap_sem);
+ up_write(&inode->i_mapping_sem);
goto out_mutex;
}
/* Now release the pages and zero block aligned part of pages */
@@ -4572,7 +4572,7 @@ static long ext4_zero_range(struct file *file, loff_t offset,

ret = ext4_alloc_file_blocks(file, lblk, max_blocks, new_size,
flags);
- up_write(&EXT4_I(inode)->i_mmap_sem);
+ up_write(&inode->i_mapping_sem);
if (ret)
goto out_mutex;
}
@@ -5267,7 +5267,7 @@ static int ext4_collapse_range(struct inode *inode, loff_t offset, loff_t len)
* Prevent page faults from reinstantiating pages we have released from
* page cache.
*/
- down_write(&EXT4_I(inode)->i_mmap_sem);
+ down_write(&inode->i_mapping_sem);

ret = ext4_break_layouts(inode);
if (ret)
@@ -5288,7 +5288,7 @@ static int ext4_collapse_range(struct inode *inode, loff_t offset, loff_t len)
/*
* Write data that will be shifted to preserve them when discarding
* page cache below. We are also protected from pages becoming dirty
- * by i_mmap_sem.
+ * by i_mapping_sem.
*/
ret = filemap_write_and_wait_range(inode->i_mapping, offset + len,
LLONG_MAX);
@@ -5343,7 +5343,7 @@ static int ext4_collapse_range(struct inode *inode, loff_t offset, loff_t len)
ext4_journal_stop(handle);
ext4_fc_stop_ineligible(sb);
out_mmap:
- up_write(&EXT4_I(inode)->i_mmap_sem);
+ up_write(&inode->i_mapping_sem);
out_mutex:
inode_unlock(inode);
return ret;
@@ -5418,7 +5418,7 @@ static int ext4_insert_range(struct inode *inode, loff_t offset, loff_t len)
* Prevent page faults from reinstantiating pages we have released from
* page cache.
*/
- down_write(&EXT4_I(inode)->i_mmap_sem);
+ down_write(&inode->i_mapping_sem);

ret = ext4_break_layouts(inode);
if (ret)
@@ -5519,7 +5519,7 @@ static int ext4_insert_range(struct inode *inode, loff_t offset, loff_t len)
ext4_journal_stop(handle);
ext4_fc_stop_ineligible(sb);
out_mmap:
- up_write(&EXT4_I(inode)->i_mmap_sem);
+ up_write(&inode->i_mapping_sem);
out_mutex:
inode_unlock(inode);
return ret;
diff --git a/fs/ext4/file.c b/fs/ext4/file.c
index 194f5d00fa32..93fab87f0fff 100644
--- a/fs/ext4/file.c
+++ b/fs/ext4/file.c
@@ -692,17 +692,17 @@ static vm_fault_t ext4_dax_huge_fault(struct vm_fault *vmf,
if (write) {
sb_start_pagefault(sb);
file_update_time(vmf->vma->vm_file);
- down_read(&EXT4_I(inode)->i_mmap_sem);
+ down_read(&inode->i_mapping_sem);
retry:
handle = ext4_journal_start_sb(sb, EXT4_HT_WRITE_PAGE,
EXT4_DATA_TRANS_BLOCKS(sb));
if (IS_ERR(handle)) {
- up_read(&EXT4_I(inode)->i_mmap_sem);
+ up_read(&inode->i_mapping_sem);
sb_end_pagefault(sb);
return VM_FAULT_SIGBUS;
}
} else {
- down_read(&EXT4_I(inode)->i_mmap_sem);
+ down_read(&inode->i_mapping_sem);
}
result = dax_iomap_fault(vmf, pe_size, &pfn, &error, &ext4_iomap_ops);
if (write) {
@@ -714,10 +714,10 @@ static vm_fault_t ext4_dax_huge_fault(struct vm_fault *vmf,
/* Handling synchronous page fault? */
if (result & VM_FAULT_NEEDDSYNC)
result = dax_finish_sync_fault(vmf, pe_size, pfn);
- up_read(&EXT4_I(inode)->i_mmap_sem);
+ up_read(&inode->i_mapping_sem);
sb_end_pagefault(sb);
} else {
- up_read(&EXT4_I(inode)->i_mmap_sem);
+ up_read(&inode->i_mapping_sem);
}

return result;
@@ -739,7 +739,7 @@ static const struct vm_operations_struct ext4_dax_vm_ops = {
#endif

static const struct vm_operations_struct ext4_file_vm_ops = {
- .fault = ext4_filemap_fault,
+ .fault = filemap_fault,
.map_pages = filemap_map_pages,
.page_mkwrite = ext4_page_mkwrite,
};
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 0948a43f1b3d..d76803eba884 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -3952,20 +3952,19 @@ int ext4_update_disksize_before_punch(struct inode *inode, loff_t offset,
return ret;
}

-static void ext4_wait_dax_page(struct ext4_inode_info *ei)
+static void ext4_wait_dax_page(struct inode *inode)
{
- up_write(&ei->i_mmap_sem);
+ up_write(&inode->i_mapping_sem);
schedule();
- down_write(&ei->i_mmap_sem);
+ down_write(&inode->i_mapping_sem);
}

int ext4_break_layouts(struct inode *inode)
{
- struct ext4_inode_info *ei = EXT4_I(inode);
struct page *page;
int error;

- if (WARN_ON_ONCE(!rwsem_is_locked(&ei->i_mmap_sem)))
+ if (WARN_ON_ONCE(!rwsem_is_locked(&inode->i_mapping_sem)))
return -EINVAL;

do {
@@ -3976,7 +3975,7 @@ int ext4_break_layouts(struct inode *inode)
error = ___wait_var_event(&page->_refcount,
atomic_read(&page->_refcount) == 1,
TASK_INTERRUPTIBLE, 0, 0,
- ext4_wait_dax_page(ei));
+ ext4_wait_dax_page(inode));
} while (error == 0);

return error;
@@ -4007,9 +4006,9 @@ int ext4_punch_hole(struct inode *inode, loff_t offset, loff_t length)

ext4_clear_inode_state(inode, EXT4_STATE_MAY_INLINE_DATA);
if (ext4_has_inline_data(inode)) {
- down_write(&EXT4_I(inode)->i_mmap_sem);
+ down_write(&inode->i_mapping_sem);
ret = ext4_convert_inline_data(inode);
- up_write(&EXT4_I(inode)->i_mmap_sem);
+ up_write(&inode->i_mapping_sem);
if (ret)
return ret;
}
@@ -4060,7 +4059,7 @@ int ext4_punch_hole(struct inode *inode, loff_t offset, loff_t length)
* Prevent page faults from reinstantiating pages we have released from
* page cache.
*/
- down_write(&EXT4_I(inode)->i_mmap_sem);
+ down_write(&inode->i_mapping_sem);

ret = ext4_break_layouts(inode);
if (ret)
@@ -4133,7 +4132,7 @@ int ext4_punch_hole(struct inode *inode, loff_t offset, loff_t length)
out_stop:
ext4_journal_stop(handle);
out_dio:
- up_write(&EXT4_I(inode)->i_mmap_sem);
+ up_write(&inode->i_mapping_sem);
out_mutex:
inode_unlock(inode);
return ret;
@@ -5428,11 +5427,11 @@ int ext4_setattr(struct user_namespace *mnt_userns, struct dentry *dentry,
inode_dio_wait(inode);
}

- down_write(&EXT4_I(inode)->i_mmap_sem);
+ down_write(&inode->i_mapping_sem);

rc = ext4_break_layouts(inode);
if (rc) {
- up_write(&EXT4_I(inode)->i_mmap_sem);
+ up_write(&inode->i_mapping_sem);
goto err_out;
}

@@ -5508,7 +5507,7 @@ int ext4_setattr(struct user_namespace *mnt_userns, struct dentry *dentry,
error = rc;
}
out_mmap_sem:
- up_write(&EXT4_I(inode)->i_mmap_sem);
+ up_write(&inode->i_mapping_sem);
}

if (!error) {
@@ -5985,10 +5984,10 @@ int ext4_change_inode_journal_flag(struct inode *inode, int val)
* data (and journalled aops don't know how to handle these cases).
*/
if (val) {
- down_write(&EXT4_I(inode)->i_mmap_sem);
+ down_write(&inode->i_mapping_sem);
err = filemap_write_and_wait(inode->i_mapping);
if (err < 0) {
- up_write(&EXT4_I(inode)->i_mmap_sem);
+ up_write(&inode->i_mapping_sem);
return err;
}
}
@@ -6021,7 +6020,7 @@ int ext4_change_inode_journal_flag(struct inode *inode, int val)
percpu_up_write(&sbi->s_writepages_rwsem);

if (val)
- up_write(&EXT4_I(inode)->i_mmap_sem);
+ up_write(&inode->i_mapping_sem);

/* Finally we can mark the inode as dirty. */

@@ -6065,7 +6064,7 @@ vm_fault_t ext4_page_mkwrite(struct vm_fault *vmf)
sb_start_pagefault(inode->i_sb);
file_update_time(vma->vm_file);

- down_read(&EXT4_I(inode)->i_mmap_sem);
+ down_read(&inode->i_mapping_sem);

err = ext4_convert_inline_data(inode);
if (err)
@@ -6178,7 +6177,7 @@ vm_fault_t ext4_page_mkwrite(struct vm_fault *vmf)
out_ret:
ret = block_page_mkwrite_return(err);
out:
- up_read(&EXT4_I(inode)->i_mmap_sem);
+ up_read(&inode->i_mapping_sem);
sb_end_pagefault(inode->i_sb);
return ret;
out_error:
@@ -6186,15 +6185,3 @@ vm_fault_t ext4_page_mkwrite(struct vm_fault *vmf)
ext4_journal_stop(handle);
goto out;
}
-
-vm_fault_t ext4_filemap_fault(struct vm_fault *vmf)
-{
- struct inode *inode = file_inode(vmf->vma->vm_file);
- vm_fault_t ret;
-
- down_read(&EXT4_I(inode)->i_mmap_sem);
- ret = filemap_fault(vmf);
- up_read(&EXT4_I(inode)->i_mmap_sem);
-
- return ret;
-}
diff --git a/fs/ext4/ioctl.c b/fs/ext4/ioctl.c
index a2cf35066f46..7a9f24596401 100644
--- a/fs/ext4/ioctl.c
+++ b/fs/ext4/ioctl.c
@@ -147,7 +147,7 @@ static long swap_inode_boot_loader(struct super_block *sb,
goto journal_err_out;
}

- down_write(&EXT4_I(inode)->i_mmap_sem);
+ down_write(&inode->i_mapping_sem);
err = filemap_write_and_wait(inode->i_mapping);
if (err)
goto err_out;
@@ -255,7 +255,7 @@ static long swap_inode_boot_loader(struct super_block *sb,
ext4_double_up_write_data_sem(inode, inode_bl);

err_out:
- up_write(&EXT4_I(inode)->i_mmap_sem);
+ up_write(&inode->i_mapping_sem);
journal_err_out:
unlock_two_nondirectories(inode, inode_bl);
iput(inode_bl);
diff --git a/fs/ext4/super.c b/fs/ext4/super.c
index b9693680463a..ec38f3673ad2 100644
--- a/fs/ext4/super.c
+++ b/fs/ext4/super.c
@@ -90,11 +90,8 @@ static struct inode *ext4_get_journal_inode(struct super_block *sb,
/*
* Lock ordering
*
- * Note the difference between i_mmap_sem (EXT4_I(inode)->i_mmap_sem) and
- * i_mmap_rwsem (inode->i_mmap_rwsem)!
- *
* page fault path:
- * mmap_lock -> sb_start_pagefault -> i_mmap_sem (r) -> transaction start ->
+ * mmap_lock -> sb_start_pagefault -> i_mapping_sem (r) -> transaction start ->
* page lock -> i_data_sem (rw)
*
* buffered write path:
@@ -103,8 +100,9 @@ static struct inode *ext4_get_journal_inode(struct super_block *sb,
* i_data_sem (rw)
*
* truncate:
- * sb_start_write -> i_mutex -> i_mmap_sem (w) -> i_mmap_rwsem (w) -> page lock
- * sb_start_write -> i_mutex -> i_mmap_sem (w) -> transaction start ->
+ * sb_start_write -> i_mutex -> i_mapping_sem (w) -> i_mmap_rwsem (w) ->
+ * page lock
+ * sb_start_write -> i_mutex -> i_mapping_sem (w) -> transaction start ->
* i_data_sem (rw)
*
* direct IO:
@@ -1349,7 +1347,6 @@ static void init_once(void *foo)
INIT_LIST_HEAD(&ei->i_orphan);
init_rwsem(&ei->xattr_sem);
init_rwsem(&ei->i_data_sem);
- init_rwsem(&ei->i_mmap_sem);
inode_init_once(&ei->vfs_inode);
ext4_fc_init_inode(&ei->vfs_inode);
}
diff --git a/fs/ext4/truncate.h b/fs/ext4/truncate.h
index bcbe3668c1d4..4fe34ccc74e0 100644
--- a/fs/ext4/truncate.h
+++ b/fs/ext4/truncate.h
@@ -15,10 +15,10 @@ static inline void ext4_truncate_failed_write(struct inode *inode)
* We don't need to call ext4_break_layouts() because the blocks we
* are truncating were never visible to userspace.
*/
- down_write(&EXT4_I(inode)->i_mmap_sem);
+ down_write(&inode->i_mapping_sem);
truncate_inode_pages(inode->i_mapping, inode->i_size);
ext4_truncate(inode);
- up_write(&EXT4_I(inode)->i_mmap_sem);
+ up_write(&inode->i_mapping_sem);
}

/*
--
2.31.0.99.g0d91da736d9f.dirty

2021-04-13 14:46:42

by Jan Kara

[permalink] [raw]
Subject: [PATCH 6/7] zonefs: Convert to using i_mapping_sem

Use i_mapping_sem instead of zonefs' private i_mmap_sem. The intended
purpose is exactly the same.

Signed-off-by: Jan Kara <[email protected]>
---
fs/zonefs/super.c | 23 +++++------------------
fs/zonefs/zonefs.h | 7 +++----
2 files changed, 8 insertions(+), 22 deletions(-)

diff --git a/fs/zonefs/super.c b/fs/zonefs/super.c
index 049e36c69ed7..ace6f3a223da 100644
--- a/fs/zonefs/super.c
+++ b/fs/zonefs/super.c
@@ -462,7 +462,7 @@ static int zonefs_file_truncate(struct inode *inode, loff_t isize)
inode_dio_wait(inode);

/* Serialize against page faults */
- down_write(&zi->i_mmap_sem);
+ down_write(&inode->i_mapping_sem);

/* Serialize against zonefs_iomap_begin() */
mutex_lock(&zi->i_truncate_mutex);
@@ -500,7 +500,7 @@ static int zonefs_file_truncate(struct inode *inode, loff_t isize)

unlock:
mutex_unlock(&zi->i_truncate_mutex);
- up_write(&zi->i_mmap_sem);
+ up_write(&inode->i_mapping_sem);

return ret;
}
@@ -575,18 +575,6 @@ static int zonefs_file_fsync(struct file *file, loff_t start, loff_t end,
return ret;
}

-static vm_fault_t zonefs_filemap_fault(struct vm_fault *vmf)
-{
- struct zonefs_inode_info *zi = ZONEFS_I(file_inode(vmf->vma->vm_file));
- vm_fault_t ret;
-
- down_read(&zi->i_mmap_sem);
- ret = filemap_fault(vmf);
- up_read(&zi->i_mmap_sem);
-
- return ret;
-}
-
static vm_fault_t zonefs_filemap_page_mkwrite(struct vm_fault *vmf)
{
struct inode *inode = file_inode(vmf->vma->vm_file);
@@ -607,16 +595,16 @@ static vm_fault_t zonefs_filemap_page_mkwrite(struct vm_fault *vmf)
file_update_time(vmf->vma->vm_file);

/* Serialize against truncates */
- down_read(&zi->i_mmap_sem);
+ down_read(&inode->i_mapping_sem);
ret = iomap_page_mkwrite(vmf, &zonefs_iomap_ops);
- up_read(&zi->i_mmap_sem);
+ up_read(&inode->i_mapping_sem);

sb_end_pagefault(inode->i_sb);
return ret;
}

static const struct vm_operations_struct zonefs_file_vm_ops = {
- .fault = zonefs_filemap_fault,
+ .fault = filemap_fault,
.map_pages = filemap_map_pages,
.page_mkwrite = zonefs_filemap_page_mkwrite,
};
@@ -1158,7 +1146,6 @@ static struct inode *zonefs_alloc_inode(struct super_block *sb)

inode_init_once(&zi->i_vnode);
mutex_init(&zi->i_truncate_mutex);
- init_rwsem(&zi->i_mmap_sem);
zi->i_wr_refcnt = 0;

return &zi->i_vnode;
diff --git a/fs/zonefs/zonefs.h b/fs/zonefs/zonefs.h
index 51141907097c..d00626442ab5 100644
--- a/fs/zonefs/zonefs.h
+++ b/fs/zonefs/zonefs.h
@@ -70,12 +70,11 @@ struct zonefs_inode_info {
* and changes to the inode private data, and in particular changes to
* a sequential file size on completion of direct IO writes.
* Serialization of mmap read IOs with truncate and syscall IO
- * operations is done with i_mmap_sem in addition to i_truncate_mutex.
- * Only zonefs_seq_file_truncate() takes both lock (i_mmap_sem first,
- * i_truncate_mutex second).
+ * operations is done with i_mapping_sem in addition to
+ * i_truncate_mutex. Only zonefs_seq_file_truncate() takes both lock
+ * (i_mapping_sem first, i_truncate_mutex second).
*/
struct mutex i_truncate_mutex;
- struct rw_semaphore i_mmap_sem;

/* guarded by i_truncate_mutex */
unsigned int i_wr_refcnt;
--
2.31.0.99.g0d91da736d9f.dirty

2021-04-13 14:48:11

by Jan Kara

[permalink] [raw]
Subject: [PATCH 4/7] ext2: Convert to using i_mapping_sem

Ext2 has its private dax_sem used for synchronizing page faults and
truncation. Use inode->i_mapping_sem instead as it is meant for this
purpose.

Signed-off-by: Jan Kara <[email protected]>
---
fs/ext2/ext2.h | 11 -----------
fs/ext2/file.c | 7 +++----
fs/ext2/inode.c | 12 ++++++------
fs/ext2/super.c | 3 ---
4 files changed, 9 insertions(+), 24 deletions(-)

diff --git a/fs/ext2/ext2.h b/fs/ext2/ext2.h
index 3309fb2d327a..96d5ea5df78b 100644
--- a/fs/ext2/ext2.h
+++ b/fs/ext2/ext2.h
@@ -671,9 +671,6 @@ struct ext2_inode_info {
struct rw_semaphore xattr_sem;
#endif
rwlock_t i_meta_lock;
-#ifdef CONFIG_FS_DAX
- struct rw_semaphore dax_sem;
-#endif

/*
* truncate_mutex is for serialising ext2_truncate() against
@@ -689,14 +686,6 @@ struct ext2_inode_info {
#endif
};

-#ifdef CONFIG_FS_DAX
-#define dax_sem_down_write(ext2_inode) down_write(&(ext2_inode)->dax_sem)
-#define dax_sem_up_write(ext2_inode) up_write(&(ext2_inode)->dax_sem)
-#else
-#define dax_sem_down_write(ext2_inode)
-#define dax_sem_up_write(ext2_inode)
-#endif
-
/*
* Inode dynamic state flags
*/
diff --git a/fs/ext2/file.c b/fs/ext2/file.c
index 96044f5dbc0e..70f0242b7167 100644
--- a/fs/ext2/file.c
+++ b/fs/ext2/file.c
@@ -81,7 +81,7 @@ static ssize_t ext2_dax_write_iter(struct kiocb *iocb, struct iov_iter *from)
*
* mmap_lock (MM)
* sb_start_pagefault (vfs, freeze)
- * ext2_inode_info->dax_sem
+ * inode->i_mapping_sem
* address_space->i_mmap_rwsem or page_lock (mutually exclusive in DAX)
* ext2_inode_info->truncate_mutex
*
@@ -91,7 +91,6 @@ static ssize_t ext2_dax_write_iter(struct kiocb *iocb, struct iov_iter *from)
static vm_fault_t ext2_dax_fault(struct vm_fault *vmf)
{
struct inode *inode = file_inode(vmf->vma->vm_file);
- struct ext2_inode_info *ei = EXT2_I(inode);
vm_fault_t ret;
bool write = (vmf->flags & FAULT_FLAG_WRITE) &&
(vmf->vma->vm_flags & VM_SHARED);
@@ -100,11 +99,11 @@ static vm_fault_t ext2_dax_fault(struct vm_fault *vmf)
sb_start_pagefault(inode->i_sb);
file_update_time(vmf->vma->vm_file);
}
- down_read(&ei->dax_sem);
+ down_read(&inode->i_mapping_sem);

ret = dax_iomap_fault(vmf, PE_SIZE_PTE, NULL, NULL, &ext2_iomap_ops);

- up_read(&ei->dax_sem);
+ up_read(&inode->i_mapping_sem);
if (write)
sb_end_pagefault(inode->i_sb);
return ret;
diff --git a/fs/ext2/inode.c b/fs/ext2/inode.c
index 68178b2234bd..018c6d62b744 100644
--- a/fs/ext2/inode.c
+++ b/fs/ext2/inode.c
@@ -1175,7 +1175,7 @@ static void ext2_free_branches(struct inode *inode, __le32 *p, __le32 *q, int de
ext2_free_data(inode, p, q);
}

-/* dax_sem must be held when calling this function */
+/* inode->i_mapping_sem must be held when calling this function */
static void __ext2_truncate_blocks(struct inode *inode, loff_t offset)
{
__le32 *i_data = EXT2_I(inode)->i_data;
@@ -1192,7 +1192,7 @@ static void __ext2_truncate_blocks(struct inode *inode, loff_t offset)
iblock = (offset + blocksize-1) >> EXT2_BLOCK_SIZE_BITS(inode->i_sb);

#ifdef CONFIG_FS_DAX
- WARN_ON(!rwsem_is_locked(&ei->dax_sem));
+ WARN_ON(!rwsem_is_locked(&inode->i_mapping_sem));
#endif

n = ext2_block_to_path(inode, iblock, offsets, NULL);
@@ -1274,9 +1274,9 @@ static void ext2_truncate_blocks(struct inode *inode, loff_t offset)
if (ext2_inode_is_fast_symlink(inode))
return;

- dax_sem_down_write(EXT2_I(inode));
+ down_write(&inode->i_mapping_sem);
__ext2_truncate_blocks(inode, offset);
- dax_sem_up_write(EXT2_I(inode));
+ up_write(&inode->i_mapping_sem);
}

static int ext2_setsize(struct inode *inode, loff_t newsize)
@@ -1306,10 +1306,10 @@ static int ext2_setsize(struct inode *inode, loff_t newsize)
if (error)
return error;

- dax_sem_down_write(EXT2_I(inode));
+ down_write(&inode->i_mapping_sem);
truncate_setsize(inode, newsize);
__ext2_truncate_blocks(inode, newsize);
- dax_sem_up_write(EXT2_I(inode));
+ up_write(&inode->i_mapping_sem);

inode->i_mtime = inode->i_ctime = current_time(inode);
if (inode_needs_sync(inode)) {
diff --git a/fs/ext2/super.c b/fs/ext2/super.c
index 6c4753277916..143e72c95acf 100644
--- a/fs/ext2/super.c
+++ b/fs/ext2/super.c
@@ -206,9 +206,6 @@ static void init_once(void *foo)
init_rwsem(&ei->xattr_sem);
#endif
mutex_init(&ei->truncate_mutex);
-#ifdef CONFIG_FS_DAX
- init_rwsem(&ei->dax_sem);
-#endif
inode_init_once(&ei->vfs_inode);
}

--
2.31.0.99.g0d91da736d9f.dirty

2021-04-13 14:48:55

by Jan Kara

[permalink] [raw]
Subject: [PATCH 5/7] xfs: Convert to use i_mapping_sem

Use i_mapping_sem instead of XFS internal i_mmap_lock. The intended
purpose of i_mapping_sem is exactly the same.

Signed-off-by: Jan Kara <[email protected]>
---
fs/xfs/xfs_file.c | 12 ++++++-----
fs/xfs/xfs_inode.c | 51 +++++++++++++++++++++++++---------------------
fs/xfs/xfs_inode.h | 1 -
fs/xfs/xfs_super.c | 2 --
4 files changed, 35 insertions(+), 31 deletions(-)

diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
index a007ca0711d9..a76404708312 100644
--- a/fs/xfs/xfs_file.c
+++ b/fs/xfs/xfs_file.c
@@ -1282,7 +1282,7 @@ xfs_file_llseek(
*
* mmap_lock (MM)
* sb_start_pagefault(vfs, freeze)
- * i_mmaplock (XFS - truncate serialisation)
+ * i_mapping_sem (XFS - truncate serialisation)
* page_lock (MM)
* i_lock (XFS - extent map serialisation)
*/
@@ -1303,24 +1303,26 @@ __xfs_filemap_fault(
file_update_time(vmf->vma->vm_file);
}

- xfs_ilock(XFS_I(inode), XFS_MMAPLOCK_SHARED);
if (IS_DAX(inode)) {
pfn_t pfn;

+ xfs_ilock(XFS_I(inode), XFS_MMAPLOCK_SHARED);
ret = dax_iomap_fault(vmf, pe_size, &pfn, NULL,
(write_fault && !vmf->cow_page) ?
&xfs_direct_write_iomap_ops :
&xfs_read_iomap_ops);
if (ret & VM_FAULT_NEEDDSYNC)
ret = dax_finish_sync_fault(vmf, pe_size, pfn);
+ xfs_iunlock(XFS_I(inode), XFS_MMAPLOCK_SHARED);
} else {
- if (write_fault)
+ if (write_fault) {
+ xfs_ilock(XFS_I(inode), XFS_MMAPLOCK_SHARED);
ret = iomap_page_mkwrite(vmf,
&xfs_buffered_write_iomap_ops);
- else
+ xfs_iunlock(XFS_I(inode), XFS_MMAPLOCK_SHARED);
+ } else
ret = filemap_fault(vmf);
}
- xfs_iunlock(XFS_I(inode), XFS_MMAPLOCK_SHARED);

if (write_fault)
sb_end_pagefault(inode->i_sb);
diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c
index f93370bd7b1e..bffceaec11a9 100644
--- a/fs/xfs/xfs_inode.c
+++ b/fs/xfs/xfs_inode.c
@@ -134,7 +134,7 @@ xfs_ilock_attr_map_shared(

/*
* In addition to i_rwsem in the VFS inode, the xfs inode contains 2
- * multi-reader locks: i_mmap_lock and the i_lock. This routine allows
+ * multi-reader locks: i_mapping_sem and the i_lock. This routine allows
* various combinations of the locks to be obtained.
*
* The 3 locks should always be ordered so that the IO lock is obtained first,
@@ -142,23 +142,23 @@ xfs_ilock_attr_map_shared(
*
* Basic locking order:
*
- * i_rwsem -> i_mmap_lock -> page_lock -> i_ilock
+ * i_rwsem -> i_mapping_sem -> page_lock -> i_ilock
*
* mmap_lock locking order:
*
* i_rwsem -> page lock -> mmap_lock
- * mmap_lock -> i_mmap_lock -> page_lock
+ * mmap_lock -> i_mapping_sem -> page_lock
*
* The difference in mmap_lock locking order mean that we cannot hold the
- * i_mmap_lock over syscall based read(2)/write(2) based IO. These IO paths can
- * fault in pages during copy in/out (for buffered IO) or require the mmap_lock
- * in get_user_pages() to map the user pages into the kernel address space for
- * direct IO. Similarly the i_rwsem cannot be taken inside a page fault because
- * page faults already hold the mmap_lock.
+ * i_mapping_sem over syscall based read(2)/write(2) based IO. These IO paths
+ * can fault in pages during copy in/out (for buffered IO) or require the
+ * mmap_lock in get_user_pages() to map the user pages into the kernel address
+ * space for direct IO. Similarly the i_rwsem cannot be taken inside a page
+ * fault because page faults already hold the mmap_lock.
*
* Hence to serialise fully against both syscall and mmap based IO, we need to
- * take both the i_rwsem and the i_mmap_lock. These locks should *only* be both
- * taken in places where we need to invalidate the page cache in a race
+ * take both the i_rwsem and the i_mapping_sem. These locks should *only* be
+ * both taken in places where we need to invalidate the page cache in a race
* free manner (e.g. truncate, hole punch and other extent manipulation
* functions).
*/
@@ -190,10 +190,13 @@ xfs_ilock(
XFS_IOLOCK_DEP(lock_flags));
}

- if (lock_flags & XFS_MMAPLOCK_EXCL)
- mrupdate_nested(&ip->i_mmaplock, XFS_MMAPLOCK_DEP(lock_flags));
- else if (lock_flags & XFS_MMAPLOCK_SHARED)
- mraccess_nested(&ip->i_mmaplock, XFS_MMAPLOCK_DEP(lock_flags));
+ if (lock_flags & XFS_MMAPLOCK_EXCL) {
+ down_write_nested(&VFS_I(ip)->i_mapping_sem,
+ XFS_MMAPLOCK_DEP(lock_flags));
+ } else if (lock_flags & XFS_MMAPLOCK_SHARED) {
+ down_read_nested(&VFS_I(ip)->i_mapping_sem,
+ XFS_MMAPLOCK_DEP(lock_flags));
+ }

if (lock_flags & XFS_ILOCK_EXCL)
mrupdate_nested(&ip->i_lock, XFS_ILOCK_DEP(lock_flags));
@@ -242,10 +245,10 @@ xfs_ilock_nowait(
}

if (lock_flags & XFS_MMAPLOCK_EXCL) {
- if (!mrtryupdate(&ip->i_mmaplock))
+ if (!down_write_trylock(&VFS_I(ip)->i_mapping_sem))
goto out_undo_iolock;
} else if (lock_flags & XFS_MMAPLOCK_SHARED) {
- if (!mrtryaccess(&ip->i_mmaplock))
+ if (!down_read_trylock(&VFS_I(ip)->i_mapping_sem))
goto out_undo_iolock;
}

@@ -260,9 +263,9 @@ xfs_ilock_nowait(

out_undo_mmaplock:
if (lock_flags & XFS_MMAPLOCK_EXCL)
- mrunlock_excl(&ip->i_mmaplock);
+ up_write(&VFS_I(ip)->i_mapping_sem);
else if (lock_flags & XFS_MMAPLOCK_SHARED)
- mrunlock_shared(&ip->i_mmaplock);
+ up_read(&VFS_I(ip)->i_mapping_sem);
out_undo_iolock:
if (lock_flags & XFS_IOLOCK_EXCL)
up_write(&VFS_I(ip)->i_rwsem);
@@ -309,9 +312,9 @@ xfs_iunlock(
up_read(&VFS_I(ip)->i_rwsem);

if (lock_flags & XFS_MMAPLOCK_EXCL)
- mrunlock_excl(&ip->i_mmaplock);
+ up_write(&VFS_I(ip)->i_mapping_sem);
else if (lock_flags & XFS_MMAPLOCK_SHARED)
- mrunlock_shared(&ip->i_mmaplock);
+ up_read(&VFS_I(ip)->i_mapping_sem);

if (lock_flags & XFS_ILOCK_EXCL)
mrunlock_excl(&ip->i_lock);
@@ -337,7 +340,7 @@ xfs_ilock_demote(
if (lock_flags & XFS_ILOCK_EXCL)
mrdemote(&ip->i_lock);
if (lock_flags & XFS_MMAPLOCK_EXCL)
- mrdemote(&ip->i_mmaplock);
+ downgrade_write(&VFS_I(ip)->i_mapping_sem);
if (lock_flags & XFS_IOLOCK_EXCL)
downgrade_write(&VFS_I(ip)->i_rwsem);

@@ -358,8 +361,10 @@ xfs_isilocked(

if (lock_flags & (XFS_MMAPLOCK_EXCL|XFS_MMAPLOCK_SHARED)) {
if (!(lock_flags & XFS_MMAPLOCK_SHARED))
- return !!ip->i_mmaplock.mr_writer;
- return rwsem_is_locked(&ip->i_mmaplock.mr_lock);
+ return !debug_locks ||
+ lockdep_is_held_type(&VFS_I(ip)->i_mapping_sem,
+ 0);
+ return rwsem_is_locked(&VFS_I(ip)->i_mapping_sem);
}

if (lock_flags & (XFS_IOLOCK_EXCL|XFS_IOLOCK_SHARED)) {
diff --git a/fs/xfs/xfs_inode.h b/fs/xfs/xfs_inode.h
index 88ee4c3930ae..147537be751c 100644
--- a/fs/xfs/xfs_inode.h
+++ b/fs/xfs/xfs_inode.h
@@ -40,7 +40,6 @@ typedef struct xfs_inode {
/* Transaction and locking information. */
struct xfs_inode_log_item *i_itemp; /* logging information */
mrlock_t i_lock; /* inode lock */
- mrlock_t i_mmaplock; /* inode mmap IO lock */
atomic_t i_pincount; /* inode pin count */

/*
diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c
index e5e0713bebcd..a1536a02aaa5 100644
--- a/fs/xfs/xfs_super.c
+++ b/fs/xfs/xfs_super.c
@@ -715,8 +715,6 @@ xfs_fs_inode_init_once(
atomic_set(&ip->i_pincount, 0);
spin_lock_init(&ip->i_flags_lock);

- mrlock_init(&ip->i_mmaplock, MRLOCK_ALLOW_EQUAL_PRI|MRLOCK_BARRIER,
- "xfsino", ip->i_ino);
mrlock_init(&ip->i_lock, MRLOCK_ALLOW_EQUAL_PRI|MRLOCK_BARRIER,
"xfsino", ip->i_ino);
}
--
2.31.0.99.g0d91da736d9f.dirty

2021-04-13 15:22:38

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [PATCH 2/7] mm: Protect operations adding pages to page cache with i_mapping_lock

> if (error == AOP_TRUNCATED_PAGE)
> put_page(page);
> + up_read(&mapping->host->i_mapping_sem);
> return error;

Please add an unlock_mapping label above this up_read and consolidate
most of the other unlocks by jumping there (put_and_wait_on_page_locked
probablt can't use it).

> truncated:
> unlock_page(page);
> @@ -2309,6 +2324,7 @@ static int filemap_update_page(struct kiocb *iocb,
> return AOP_TRUNCATED_PAGE;

The trunated case actually seems to miss the unlock.

Similarly I think filemap_fault would benefit from a common
unlock path.

2021-04-13 15:23:12

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [PATCH 5/7] xfs: Convert to use i_mapping_sem

On Tue, Apr 13, 2021 at 01:28:49PM +0200, Jan Kara wrote:
> Use i_mapping_sem instead of XFS internal i_mmap_lock. The intended
> purpose of i_mapping_sem is exactly the same.

Might be worth mentioning here that the locking in __xfs_filemap_fault
changes because filemap_fault already takes i_mapping_sem?

> * mmap_lock (MM)
> * sb_start_pagefault(vfs, freeze)
> - * i_mmaplock (XFS - truncate serialisation)
> + * i_mapping_sem (XFS - truncate serialisation)

This is sort of VFS now, isn't it?

Otherwise looks good:

Reviewed-by: Christoph Hellwig <[email protected]>

2021-04-13 15:23:18

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [PATCH 0/7 RFC v3] fs: Hole punch vs page cache filling races

> Also when writing the documentation I came across one question: Do we mandate
> i_mapping_sem for truncate + hole punch for all filesystems or just for
> filesystems that support hole punching (or other complex fallocate operations)?
> I wrote the documentation so that we require every filesystem to use
> i_mapping_sem. This makes locking rules simpler, we can also add asserts when
> all filesystems are converted. The downside is that simple filesystems now pay
> the overhead of the locking unnecessary for them. The overhead is small
> (uncontended rwsem acquisition for truncate) so I don't think we care and the
> simplicity is worth it but I wanted to spell this out.

I think all makes for much better to understand and document rules,
so I'd shoot for that eventually.

Btw, what about locking for DAX faults? XFS seems to take
the mmap sem for those as well currently.

2021-04-13 15:29:36

by Jan Kara

[permalink] [raw]
Subject: Re: [PATCH 5/7] xfs: Convert to use i_mapping_sem

On Tue 13-04-21 14:05:12, Christoph Hellwig wrote:
> On Tue, Apr 13, 2021 at 01:28:49PM +0200, Jan Kara wrote:
> > Use i_mapping_sem instead of XFS internal i_mmap_lock. The intended
> > purpose of i_mapping_sem is exactly the same.
>
> Might be worth mentioning here that the locking in __xfs_filemap_fault
> changes because filemap_fault already takes i_mapping_sem?

Sure, will add.

>
> > * mmap_lock (MM)
> > * sb_start_pagefault(vfs, freeze)
> > - * i_mmaplock (XFS - truncate serialisation)
> > + * i_mapping_sem (XFS - truncate serialisation)
>
> This is sort of VFS now, isn't it?

Right, I'll update the comment.

> Otherwise looks good:
>
> Reviewed-by: Christoph Hellwig <[email protected]>

Thanks!

Honza

--
Jan Kara <[email protected]>
SUSE Labs, CR

2021-04-13 15:33:45

by Jan Kara

[permalink] [raw]
Subject: Re: [PATCH 2/7] mm: Protect operations adding pages to page cache with i_mapping_lock

On Tue 13-04-21 13:57:46, Christoph Hellwig wrote:
> > if (error == AOP_TRUNCATED_PAGE)
> > put_page(page);
> > + up_read(&mapping->host->i_mapping_sem);
> > return error;
>
> Please add an unlock_mapping label above this up_read and consolidate
> most of the other unlocks by jumping there (put_and_wait_on_page_locked
> probablt can't use it).

Yeah, I've actually simplified the labels even a bit more like:

...
error = filemap_read_page(iocb->ki_filp, mapping, page);
goto unlock_mapping;
unlock:
unlock_page(page);
unlock_mapping:
up_read(&mapping->host->i_mapping_sem);
if (error == AOP_TRUNCATED_PAGE)
put_page(page);
return error;

and everything now jumps to either unlock or unlock_mapping (except for
put_and_wait_on_page_locked() case).

> > truncated:
> > unlock_page(page);
> > @@ -2309,6 +2324,7 @@ static int filemap_update_page(struct kiocb *iocb,
> > return AOP_TRUNCATED_PAGE;
>
> The trunated case actually seems to miss the unlock.
>
> Similarly I think filemap_fault would benefit from a common
> unlock path.

Right, thanks for catching that!

Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR

2021-04-13 18:36:53

by Jan Kara

[permalink] [raw]
Subject: Re: [PATCH 0/7 RFC v3] fs: Hole punch vs page cache filling races

On Tue 13-04-21 14:09:50, Christoph Hellwig wrote:
> > Also when writing the documentation I came across one question: Do we mandate
> > i_mapping_sem for truncate + hole punch for all filesystems or just for
> > filesystems that support hole punching (or other complex fallocate operations)?
> > I wrote the documentation so that we require every filesystem to use
> > i_mapping_sem. This makes locking rules simpler, we can also add asserts when
> > all filesystems are converted. The downside is that simple filesystems now pay
> > the overhead of the locking unnecessary for them. The overhead is small
> > (uncontended rwsem acquisition for truncate) so I don't think we care and the
> > simplicity is worth it but I wanted to spell this out.
>
> I think all makes for much better to understand and document rules,
> so I'd shoot for that eventually.

OK.

> Btw, what about locking for DAX faults? XFS seems to take
> the mmap sem for those as well currently.

Yes, I've mechanically converted all those uses to i_mapping_sem for XFS,
ext4, and ext2 as well. Longer term we may be able to move some locking
into generic DAX code now that the lock is in struct inode. But I want to
leave that for later since DAX locking is different enough that it needs
some careful thinking and justification...

Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR

2021-04-14 07:50:32

by Dave Chinner

[permalink] [raw]
Subject: Re: [PATCH 2/7] mm: Protect operations adding pages to page cache with i_mapping_lock

On Tue, Apr 13, 2021 at 01:28:46PM +0200, Jan Kara wrote:
> Currently, serializing operations such as page fault, read, or readahead
> against hole punching is rather difficult. The basic race scheme is
> like:
>
> fallocate(FALLOC_FL_PUNCH_HOLE) read / fault / ..
> truncate_inode_pages_range()
> <create pages in page
> cache here>
> <update fs block mapping and free blocks>
>
> Now the problem is in this way read / page fault / readahead can
> instantiate pages in page cache with potentially stale data (if blocks
> get quickly reused). Avoiding this race is not simple - page locks do
> not work because we want to make sure there are *no* pages in given
> range. inode->i_rwsem does not work because page fault happens under
> mmap_sem which ranks below inode->i_rwsem. Also using it for reads makes
> the performance for mixed read-write workloads suffer.
>
> So create a new rw_semaphore in the inode - i_mapping_sem - that
> protects adding of pages to page cache for page faults / reads /
> readahead.
>
> Signed-off-by: Jan Kara <[email protected]>
....
> diff --git a/mm/filemap.c b/mm/filemap.c
> index bd7c50e060a9..bc82a7856d3e 100644
> --- a/mm/filemap.c
> +++ b/mm/filemap.c
> @@ -77,7 +77,8 @@
> * ->i_pages lock
> *
> * ->i_rwsem
> - * ->i_mmap_rwsem (truncate->unmap_mapping_range)
> + * ->i_mapping_sem (acquired by fs in truncate path)
> + * ->i_mmap_rwsem (truncate->unmap_mapping_range)

This is now officially a confusing mess.

We have inode->i_mapping that points to an address space, which has
an internal i_mmap variable and an i_mmap_rwsem that serialises
access to the i_mmap tree.

Now we have a inode->i_mapping_sem (which is actually a rwsem, not a
sem) that sorta serialises additions to the mapping tree
(inode->i_mapping->i_pages) against truncate, but it does not
serialise accesses to the rest of the inode->i_mapping structure
itself despite the similarlities in naming.

Then we have the inode_lock() API and the i_mmap_lock() API that
wrap around the i_rwsem and i_mmap_rwsem, but there's no API for
inode_mapping_lock()...

THen we have the mmap_lock in the page fault path as well, which is
also an rwsem despite the name, and that protects something
completely different to the i_mmap and the i_mapping.

IOWs, we have 4 layers of entwined structures and locking here
that pretty much all have the same name but protect different things
and not always the obvious thing the name suggests. This makes it
really difficult to actually read the code and understand that the
correct lock is being used in the correct place...


> *
> * ->mmap_lock
> * ->i_mmap_rwsem
> @@ -85,7 +86,8 @@
> * ->i_pages lock (arch-dependent flush_dcache_mmap_lock)
> *
> * ->mmap_lock
> - * ->lock_page (access_process_vm)
> + * ->i_mapping_sem (filemap_fault)
> + * ->lock_page (filemap_fault, access_process_vm)
> *
> * ->i_rwsem (generic_perform_write)
> * ->mmap_lock (fault_in_pages_readable->do_page_fault)
> @@ -2276,16 +2278,28 @@ static int filemap_update_page(struct kiocb *iocb,
> {
> int error;
>
> + if (iocb->ki_flags & IOCB_NOWAIT) {
> + if (!down_read_trylock(&mapping->host->i_mapping_sem))
> + return -EAGAIN;
> + } else {
> + down_read(&mapping->host->i_mapping_sem);
> + }

We really need a lock primitive for this. The number of times this
exact lock pattern is being replicated all through the IO path is
getting out of hand.

static inline bool
down_read_try_or_lock(struct rwsem *sem, bool try)
{
if (try) {
if (!down_read_trylock(sem))
return false;
} else {
down_read(&mapping->host->i_mapping_sem);
}
return true;
}

and the callers become:

if (!down_read_try_or_lock(sem, (iocb->ki_flags & IOCB_NOWAIT)))
return -EAGAIN;

We can do the same with mutex_try_or_lock(), down_try_or_lock(), etc
and we don't need to rely on cargo cult knowledge to propagate this
pattern anymore. Because I'm betting relatively few people actually
know why the code is written this way because the only place it is
documented is in an XFS commit message....

Doing this is a separate cleanup, though, and not something that
needs to be done in this patchset.

> index c5b0457415be..ac5bb50b3a4c 100644
> --- a/mm/readahead.c
> +++ b/mm/readahead.c
> @@ -192,6 +192,7 @@ void page_cache_ra_unbounded(struct readahead_control *ractl,
> */
> unsigned int nofs = memalloc_nofs_save();
>
> + down_read(&mapping->host->i_mapping_sem);
> /*
> * Preallocate as many pages as we will need.
> */

I can't say I'm a great fan of having the mapping reach back up to
the host to lock the host. THis seems the wrong way around to me
given that most of the locking in the IO path is in "host locks
mapping" and "mapping locks internal mapping structures" order...

I also come back to the naming confusion here, in that when we look
at this in long hand from the inode perspective, this chain actually
looks like:

lock(inode->i_mapping->inode->i_mapping_sem)

i.e. the mapping is reaching back up outside it's scope to lock
itself against other inode->i_mapping operations. Smells of layering
violations to me.

So, next question: should this truncate semanphore actually be part
of the address space, not the inode? This patch is actually moving
the page fault serialisation from the inode into the address space
operations when page faults and page cache operations are done, so
maybe the lock should also make that move? That would help clear up
the naming problem, because now we can name it based around what it
serialises in the address space, not the address space as a whole...

Cheers,

Dave.
--
Dave Chinner
[email protected]

2021-04-14 15:42:58

by Jan Kara

[permalink] [raw]
Subject: Re: [PATCH 2/7] mm: Protect operations adding pages to page cache with i_mapping_lock

On Wed 14-04-21 10:01:13, Dave Chinner wrote:
> On Tue, Apr 13, 2021 at 01:28:46PM +0200, Jan Kara wrote:
> > *
> > * ->mmap_lock
> > * ->i_mmap_rwsem
> > @@ -85,7 +86,8 @@
> > * ->i_pages lock (arch-dependent flush_dcache_mmap_lock)
> > *
> > * ->mmap_lock
> > - * ->lock_page (access_process_vm)
> > + * ->i_mapping_sem (filemap_fault)
> > + * ->lock_page (filemap_fault, access_process_vm)
> > *
> > * ->i_rwsem (generic_perform_write)
> > * ->mmap_lock (fault_in_pages_readable->do_page_fault)
> > @@ -2276,16 +2278,28 @@ static int filemap_update_page(struct kiocb *iocb,
> > {
> > int error;
> >
> > + if (iocb->ki_flags & IOCB_NOWAIT) {
> > + if (!down_read_trylock(&mapping->host->i_mapping_sem))
> > + return -EAGAIN;
> > + } else {
> > + down_read(&mapping->host->i_mapping_sem);
> > + }
>
> We really need a lock primitive for this. The number of times this
> exact lock pattern is being replicated all through the IO path is
> getting out of hand.
>
> static inline bool
> down_read_try_or_lock(struct rwsem *sem, bool try)
> {
> if (try) {
> if (!down_read_trylock(sem))
> return false;
> } else {
> down_read(&mapping->host->i_mapping_sem);
> }
> return true;
> }
>
> and the callers become:
>
> if (!down_read_try_or_lock(sem, (iocb->ki_flags & IOCB_NOWAIT)))
> return -EAGAIN;
>
> We can do the same with mutex_try_or_lock(), down_try_or_lock(), etc
> and we don't need to rely on cargo cult knowledge to propagate this
> pattern anymore. Because I'm betting relatively few people actually
> know why the code is written this way because the only place it is
> documented is in an XFS commit message....
>
> Doing this is a separate cleanup, though, and not something that
> needs to be done in this patchset.

Yep, good idea but let's do it in a separate patch set.

> > index c5b0457415be..ac5bb50b3a4c 100644
> > --- a/mm/readahead.c
> > +++ b/mm/readahead.c
> > @@ -192,6 +192,7 @@ void page_cache_ra_unbounded(struct readahead_control *ractl,
> > */
> > unsigned int nofs = memalloc_nofs_save();
> >
> > + down_read(&mapping->host->i_mapping_sem);
> > /*
> > * Preallocate as many pages as we will need.
> > */
>
> I can't say I'm a great fan of having the mapping reach back up to
> the host to lock the host. THis seems the wrong way around to me
> given that most of the locking in the IO path is in "host locks
> mapping" and "mapping locks internal mapping structures" order...
>
> I also come back to the naming confusion here, in that when we look
> at this in long hand from the inode perspective, this chain actually
> looks like:
>
> lock(inode->i_mapping->inode->i_mapping_sem)
>
> i.e. the mapping is reaching back up outside it's scope to lock
> itself against other inode->i_mapping operations. Smells of layering
> violations to me.
>
> So, next question: should this truncate semanphore actually be part
> of the address space, not the inode? This patch is actually moving
> the page fault serialisation from the inode into the address space
> operations when page faults and page cache operations are done, so
> maybe the lock should also make that move? That would help clear up
> the naming problem, because now we can name it based around what it
> serialises in the address space, not the address space as a whole...

I think that moving the lock to address_space makes some sence although the
lock actually protects consistency of inode->i_mapping->i_pages with
whatever the filesystem has in its file_offset->disk_block mapping
structures (which are generally associated with the inode). So it is not
only about inode->i_mapping contents but I agree that struct address_space
is probably a bit more logical place than struct inode.

Regarding the name: How about i_pages_rwsem? The lock is protecting
invalidation of mapping->i_pages and needs to be held until insertion of
pages into i_pages is safe again...

Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR

2021-04-15 00:48:24

by Dave Chinner

[permalink] [raw]
Subject: Re: [PATCH 2/7] mm: Protect operations adding pages to page cache with i_mapping_lock

On Wed, Apr 14, 2021 at 02:23:19PM +0200, Jan Kara wrote:
> On Wed 14-04-21 10:01:13, Dave Chinner wrote:
> > On Tue, Apr 13, 2021 at 01:28:46PM +0200, Jan Kara wrote:
> > > index c5b0457415be..ac5bb50b3a4c 100644
> > > --- a/mm/readahead.c
> > > +++ b/mm/readahead.c
> > > @@ -192,6 +192,7 @@ void page_cache_ra_unbounded(struct readahead_control *ractl,
> > > */
> > > unsigned int nofs = memalloc_nofs_save();
> > >
> > > + down_read(&mapping->host->i_mapping_sem);
> > > /*
> > > * Preallocate as many pages as we will need.
> > > */
> >
> > I can't say I'm a great fan of having the mapping reach back up to
> > the host to lock the host. THis seems the wrong way around to me
> > given that most of the locking in the IO path is in "host locks
> > mapping" and "mapping locks internal mapping structures" order...
> >
> > I also come back to the naming confusion here, in that when we look
> > at this in long hand from the inode perspective, this chain actually
> > looks like:
> >
> > lock(inode->i_mapping->inode->i_mapping_sem)
> >
> > i.e. the mapping is reaching back up outside it's scope to lock
> > itself against other inode->i_mapping operations. Smells of layering
> > violations to me.
> >
> > So, next question: should this truncate semanphore actually be part
> > of the address space, not the inode? This patch is actually moving
> > the page fault serialisation from the inode into the address space
> > operations when page faults and page cache operations are done, so
> > maybe the lock should also make that move? That would help clear up
> > the naming problem, because now we can name it based around what it
> > serialises in the address space, not the address space as a whole...
>
> I think that moving the lock to address_space makes some sence although the
> lock actually protects consistency of inode->i_mapping->i_pages with
> whatever the filesystem has in its file_offset->disk_block mapping
> structures (which are generally associated with the inode).

Well, I look at is as a mechanism that the filesystem uses to ensure
coherency of the page cache accesses w.r.t. physical layout changes.
The layout is a property of the inode, but changes to the physical
layout of the inode are serialised by other inode based mechanisms.
THe page cache isn't part of the inode - it's part of the address
space - but coherency with the inode is required. Hence inode
operations need to be able to ensure coherency of the address space
content and accesses w.r.t. physical layout changes of the inode,
but the address space really knows nothing about the physical layout
of the inode or how it gets changed...

Hence it's valid for the inode operations to lock the address space
to ensure coherency of the page cache when making physical layout
changes, but locking the address space, by itself, is not sufficient
to safely serialise against physical changes to the inode layout.

> So it is not
> only about inode->i_mapping contents but I agree that struct address_space
> is probably a bit more logical place than struct inode.

Yup. Remember that the XFS_MMAPLOCK arose at the inode level because
that was the only way the filesystem could acheive the necessary
serialisation of page cache accesses whilst doing physical layout
changes. So the lock became an "inode property" because of
implementation constraints, not because it was the best way to
implement the necessary coherency hooks.

> Regarding the name: How about i_pages_rwsem? The lock is protecting
> invalidation of mapping->i_pages and needs to be held until insertion of
> pages into i_pages is safe again...

I don't actually have a good name for this right now. :(

The i_pages structure has it's own internal locking, so
i_pages_rwsem implies things that aren't necessarily true, and
taking a read lock for insertion for something that is named like a
structure protection lock creates cognitive dissonance...

I keep wanting to say "lock for invalidation" and "lock to exclude
invalidation" because those are the two actions that we need for
coherency of operations. But they are way too verbose for an actual
API...

So I want to call this an "invalidation lock" of some kind (no need
to encode the type in the name!), but haven't worked out a good
shorthand for "address space invalidation coherency mechanism"...

Naming is hard. :/

Cheers,

Dave.
--
Dave Chinner
[email protected]

2021-04-15 00:49:35

by Matthew Wilcox

[permalink] [raw]
Subject: Re: [PATCH 2/7] mm: Protect operations adding pages to page cache with i_mapping_lock

On Wed, Apr 14, 2021 at 10:01:13AM +1000, Dave Chinner wrote:
> > + if (iocb->ki_flags & IOCB_NOWAIT) {
> > + if (!down_read_trylock(&mapping->host->i_mapping_sem))
> > + return -EAGAIN;
> > + } else {
> > + down_read(&mapping->host->i_mapping_sem);
> > + }
>
> We really need a lock primitive for this. The number of times this
> exact lock pattern is being replicated all through the IO path is
> getting out of hand.
>
> static inline bool
> down_read_try_or_lock(struct rwsem *sem, bool try)
> {
> if (try) {
> if (!down_read_trylock(sem))
> return false;
> } else {
> down_read(&mapping->host->i_mapping_sem);
> }
> return true;
> }
>
> and the callers become:
>
> if (!down_read_try_or_lock(sem, (iocb->ki_flags & IOCB_NOWAIT)))
> return -EAGAIN;

I think that should be written:

if (!iocb_read_lock(iocb, &rwsem))
return -EAGAIN;

and implemented as:

static inline int iocb_read_lock(struct kiocb *iocb, struct rwsem *sem)
{
if (iocb->ki_flags & IOCB_NOWAIT)
return down_read_trylock(sem) ? 0 : -EAGAIN;
return down_read_killable(sem);
}

2021-04-15 02:05:56

by Dave Chinner

[permalink] [raw]
Subject: Re: [PATCH 2/7] mm: Protect operations adding pages to page cache with i_mapping_lock

On Wed, Apr 14, 2021 at 11:25:31PM +0100, Matthew Wilcox wrote:
> On Wed, Apr 14, 2021 at 10:01:13AM +1000, Dave Chinner wrote:
> > > + if (iocb->ki_flags & IOCB_NOWAIT) {
> > > + if (!down_read_trylock(&mapping->host->i_mapping_sem))
> > > + return -EAGAIN;
> > > + } else {
> > > + down_read(&mapping->host->i_mapping_sem);
> > > + }
> >
> > We really need a lock primitive for this. The number of times this
> > exact lock pattern is being replicated all through the IO path is
> > getting out of hand.
> >
> > static inline bool
> > down_read_try_or_lock(struct rwsem *sem, bool try)
> > {
> > if (try) {
> > if (!down_read_trylock(sem))
> > return false;
> > } else {
> > down_read(&mapping->host->i_mapping_sem);
> > }
> > return true;
> > }
> >
> > and the callers become:
> >
> > if (!down_read_try_or_lock(sem, (iocb->ki_flags & IOCB_NOWAIT)))
> > return -EAGAIN;
>
> I think that should be written:
>
> if (!iocb_read_lock(iocb, &rwsem))
> return -EAGAIN;
>
> and implemented as:
>
> static inline int iocb_read_lock(struct kiocb *iocb, struct rwsem *sem)
> {
> if (iocb->ki_flags & IOCB_NOWAIT)
> return down_read_trylock(sem) ? 0 : -EAGAIN;
> return down_read_killable(sem);
> }

Yup, we already have done that with xfs_ilock_iocb(), but my point
is that this "non blocking try lock or lock" pattern is slowly being
used in more places than just IOCB_NOWAIT situations. e.g. We use
if for IOMAP_NOWAIT locking in XFS, too, and ISTR other places where
optimisitic locking is used are replicating it, too.

Hence my suggestion that is moved up into the locking primitives,
not merely have context specific wrappers added...

Cheers,

Dave.
--
Dave Chinner
[email protected]

2021-04-15 13:12:33

by Jan Kara

[permalink] [raw]
Subject: Re: [PATCH 2/7] mm: Protect operations adding pages to page cache with i_mapping_lock

On Thu 15-04-21 07:57:39, Dave Chinner wrote:
> On Wed, Apr 14, 2021 at 02:23:19PM +0200, Jan Kara wrote:
> > Regarding the name: How about i_pages_rwsem? The lock is protecting
> > invalidation of mapping->i_pages and needs to be held until insertion of
> > pages into i_pages is safe again...
>
> I don't actually have a good name for this right now. :(
>
> The i_pages structure has it's own internal locking, so
> i_pages_rwsem implies things that aren't necessarily true, and
> taking a read lock for insertion for something that is named like a
> structure protection lock creates cognitive dissonance...
>
> I keep wanting to say "lock for invalidation" and "lock to exclude
> invalidation" because those are the two actions that we need for
> coherency of operations. But they are way too verbose for an actual
> API...
>
> So I want to call this an "invalidation lock" of some kind (no need
> to encode the type in the name!), but haven't worked out a good
> shorthand for "address space invalidation coherency mechanism"...

So "invalidate_lock" was just next on my list of things to suggest so I'm
fine with that name. Or maybe block_invalidate_lock, block_remove_lock,
map_remove_lock, ... Dunno :).

Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR

2021-04-19 16:11:52

by Matthew Wilcox

[permalink] [raw]
Subject: Re: [PATCH 0/7 RFC v3] fs: Hole punch vs page cache filling races

On Tue, Apr 13, 2021 at 01:28:44PM +0200, Jan Kara wrote:
> Also when writing the documentation I came across one question: Do we mandate
> i_mapping_sem for truncate + hole punch for all filesystems or just for
> filesystems that support hole punching (or other complex fallocate operations)?
> I wrote the documentation so that we require every filesystem to use
> i_mapping_sem. This makes locking rules simpler, we can also add asserts when
> all filesystems are converted. The downside is that simple filesystems now pay
> the overhead of the locking unnecessary for them. The overhead is small
> (uncontended rwsem acquisition for truncate) so I don't think we care and the
> simplicity is worth it but I wanted to spell this out.

What do we actually get in return for supporting these complex fallocate
operations? Someone added them for a reason, but does that reason
actually benefit me? Other than running xfstests, how many times has
holepunch been called on your laptop in the last week? I don't want to
incur even one extra instruction per I/O operation to support something
that happens twice a week; that's a bad tradeoff.

Can we implement holepunch as a NOP? Or return -ENOTTY? Those both
seem like better solutions than adding an extra rwsem to every inode.
Failing that, is there a bigger hammer we can use on the holepunch side
(eg preventing all concurrent accesses while the holepunch is happening)
to reduce the overhead on the read side?

2021-04-19 16:34:13

by Jan Kara

[permalink] [raw]
Subject: Re: [PATCH 0/7 RFC v3] fs: Hole punch vs page cache filling races

On Mon 19-04-21 16:20:08, Matthew Wilcox wrote:
> On Tue, Apr 13, 2021 at 01:28:44PM +0200, Jan Kara wrote:
> > Also when writing the documentation I came across one question: Do we mandate
> > i_mapping_sem for truncate + hole punch for all filesystems or just for
> > filesystems that support hole punching (or other complex fallocate operations)?
> > I wrote the documentation so that we require every filesystem to use
> > i_mapping_sem. This makes locking rules simpler, we can also add asserts when
> > all filesystems are converted. The downside is that simple filesystems now pay
> > the overhead of the locking unnecessary for them. The overhead is small
> > (uncontended rwsem acquisition for truncate) so I don't think we care and the
> > simplicity is worth it but I wanted to spell this out.
>
> What do we actually get in return for supporting these complex fallocate
> operations? Someone added them for a reason, but does that reason
> actually benefit me? Other than running xfstests, how many times has
> holepunch been called on your laptop in the last week? I don't want to
> incur even one extra instruction per I/O operation to support something
> that happens twice a week; that's a bad tradeoff.

I agree hole punch is relatively rare compared to normal operations but
when it is used, it is used rather frequently - e.g. by VMs to manage their
filesystem images. So if we regress holepunch either by not freeing blocks
or by slowing it down significantly, I'm pretty sure some people will
complain. That being said I fully understand your reluctance to add lock to
the read path but note that it is taken only when we need to fill data from
the storage and it should be taken once per readahead request so I actually
doubt the extra acquisition will be visible in the profiles. But I can
profile it to be sure.

> Can we implement holepunch as a NOP? Or return -ENOTTY? Those both
> seem like better solutions than adding an extra rwsem to every inode.

We already have that rwsem there today for most major filesystems. This
work just lifts it from fs-private inode area into the VFS inode. So in
terms of memory usage we are not loosing that much.

> Failing that, is there a bigger hammer we can use on the holepunch side
> (eg preventing all concurrent accesses while the holepunch is happening)
> to reduce the overhead on the read side?

I'm open to other solutions but frankly this was the best I could come up
with. Holepunch already uses a pretty much big hammer approach - take all
the locks there are on the inode in exclusive mode, block DIO, unmap
everything and then do its dirty deeds... I don't think we want hole punch
to block anything on fs-wide basis (that's a DoS recipe) and besides that
I don't see how the hammer could be bigger ;).

Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR

2021-04-20 22:13:48

by Dave Chinner

[permalink] [raw]
Subject: Re: [PATCH 0/7 RFC v3] fs: Hole punch vs page cache filling races

On Mon, Apr 19, 2021 at 04:20:08PM +0100, Matthew Wilcox wrote:
> On Tue, Apr 13, 2021 at 01:28:44PM +0200, Jan Kara wrote:
> > Also when writing the documentation I came across one question: Do we mandate
> > i_mapping_sem for truncate + hole punch for all filesystems or just for
> > filesystems that support hole punching (or other complex fallocate operations)?
> > I wrote the documentation so that we require every filesystem to use
> > i_mapping_sem. This makes locking rules simpler, we can also add asserts when
> > all filesystems are converted. The downside is that simple filesystems now pay
> > the overhead of the locking unnecessary for them. The overhead is small
> > (uncontended rwsem acquisition for truncate) so I don't think we care and the
> > simplicity is worth it but I wanted to spell this out.
>
> What do we actually get in return for supporting these complex fallocate
> operations? Someone added them for a reason, but does that reason
> actually benefit me? Other than running xfstests, how many times has
> holepunch been called on your laptop in the last week?

Quite a lot, actually.

> I don't want to
> incur even one extra instruction per I/O operation to support something
> that happens twice a week; that's a bad tradeoff.

Hole punching gets into all sorts of interesting places. For
example, did you know that issuing fstrim (discards) or "write
zeroes" on a file-backed loopback device will issue hole punches to
the underlying file? nvmet does the same. Userspace iscsi server
implementations (e.g. TGT) do the same thing and have for a long
time. NFSv4 servers issue hole punching based on client side
requests, too.

Then there's Kubernetes management tools. Samba. Qemu. Libvirt.
Mysql. Network-Manager. Gluster. Chromium. RocksDB. Swift. Systemd.
The list of core system infrastructure we have that uses hole
punching is quite large...

So, really, hole punching is something that happens a lot and in
many unexpected places. You can argue that your laptop doesn't use
it, but that really doesn't matter in the bigger scheme of things.
Hole punching is something applications expect to work and not
corrupt data....

> Can we implement holepunch as a NOP? Or return -ENOTTY? Those both
> seem like better solutions than adding an extra rwsem to every inode.

We've already added this extra i_rwsem to ext4 and XFS - it's a sunk
cost for almost every production machine out there in the wild. It
needs to be made generic so we can optimise the implementation and
not have to implement a unicorn in every filesystem to work around
the fact the page cache and page faults have no internal
serialisation mechanism against filesystem operations that directly
manipulate and invalidate large ranges of the backing storage the
page cache sits over.

> Failing that, is there a bigger hammer we can use on the holepunch side
> (eg preventing all concurrent accesses while the holepunch is happening)
> to reduce the overhead on the read side?

That's what we currently do and what Jan is trying to refine....

Cheers,

Dave.
--
Dave Chinner
[email protected]