LinuxLists.cc - Re: [POC][PATCH] xfs: reduce ilock contention on buffered randrw workload

2019-04-08 09:02:48

Subject: Re: [POC][PATCH] xfs: reduce ilock contention on buffered randrw workload

On Mon, Apr 8, 2019 at 2:27 AM Dave Chinner <[email protected]> wrote:
>
> On Fri, Apr 05, 2019 at 05:02:33PM +0300, Amir Goldstein wrote:
> > On Fri, Apr 5, 2019 at 12:17 AM Dave Chinner <[email protected]> wrote:
> > >
> > > On Thu, Apr 04, 2019 at 07:57:37PM +0300, Amir Goldstein wrote:
> > > > This patch improves performance of mixed random rw workload
> > > > on xfs without relaxing the atomic buffered read/write guaranty
> > > > that xfs has always provided.
> > > >
> > > > We achieve that by calling generic_file_read_iter() twice.
> > > > Once with a discard iterator to warm up page cache before taking
> > > > the shared ilock and once again under shared ilock.
> > >
> > > This will race with thing like truncate, hole punching, etc that
> > > serialise IO and invalidate the page cache for data integrity
> > > reasons under the IOLOCK. These rely on there being no IO to the
> > > inode in progress at all to work correctly, which this patch
> > > violates. IOWs, while this is fast, it is not safe and so not a
> > > viable approach to solving the problem.
> > >
> >
> > This statement leaves me wondering, if ext4 does not takes
> > i_rwsem on generic_file_read_iter(), how does ext4 (or any other
> > fs for that matter) guaranty buffered read synchronization with
> > truncate, hole punching etc?
> > The answer in ext4 case is i_mmap_sem, which is read locked
> > in the page fault handler.
>
> Nope, the i_mmap_sem is for serialisation of /page faults/ against
> truncate, holepunching, etc. Completely irrelevant to the read()
> path.
>

I'm at lost here. Why are page faults completely irrelevant to read()
path? Aren't full pages supposed to be faulted in on read() after
truncate_pagecache_range()? And aren't partial pages supposed
to be partially zeroed and uptodate after truncate_pagecache_range()?

> > And xfs does the same type of synchronization with MMAPLOCK,
> > so while my patch may not be safe, I cannot follow why from your
> > explanation, so please explain if I am missing something.
>
> mmap_sem inversions require independent locks for IO path and page
> faults - the MMAPLOCK does not protect anything in the
> read()/write() IO path.
>
[...]
>
> All you see is this:
>
> truncate: read()
>
> IOLOCK_EXCL
> flush relevant cached data
> truncate page cache
> pre-read page cache between
> new eof and old eof
> IOLOCK_SHARED
> <blocks>
> start transaction
> ILOCK_EXCL
> update isize
> remove extents
> ....
> commit xactn
> IOLOCK unlock
> <gets lock>
> sees beyond EOF, returns 0
>
>
> So you see the read() doing the right thing (detect EOF, returning
> short read). Great.
>
> But what I see is uptodate pages containing stale data being left in
> the page cache beyond EOF. That is th eproblem here - truncate must
> not leave stale pages beyond EOF behind - it's the landmine that
> causes future things to go wrong.
>
> e.g. now the app does post-eof preallocation so the range those
> pages are cached over are allocated as unwritten - the filesystem
> will do this without even looking at the page cache because it's
> beyond EOF. Now we extend the file past those cached pages, and
> iomap_zero() sees the range as unwritten and so does not write zeros
> to the blocks between the old EOF and the new EOF. Now the app reads
> from that range (say it does a sub-page write, triggering a page
> cache RMW cycle). the read goes to instantiate the page cache page,
> finds a page already in the cache that is uptodate, and uses it
> without zeroing or reading from disk.
>
> And now we have stale data exposure and/or data corruption.
>
> If can come up with quite a few scenarios where this particular
> "populate cache after invalidation" race can cause similar problems
> for XFS. Hole punch and most of the other fallocate extent
> manipulations have the same serialisation requirements - no IO over
> the range of the operation can be *initiated* between the /start/ of
> the page cache invalidation and the end of the specific extent
> manipulation operation.
>
> So how does ext4 avoid this problem on truncate?
>
> History lesson: truncate in Linux (and hence ext4) has traditionally
> been serialised by the hacky post-page-lock checks that are strewn
> all through the page cache and mm/ subsystem. i.e. every time you
> look up and lock a page cache page, you have to check the
> page->mapping and page->index to ensure that the lookup-and-lock
> hasn't raced with truncate. This only works because truncate
> requires the inode size to be updated before invalidating the page
> cache - that's the "hacky" part of it.
>
> IOWs, the burden of detecting truncate races is strewn throughout
> the mm/ subsystem, rather than being the responisibility of the
> filesystem. This is made worse by the fact this mechanism simply
> doesn't work for hole punching because there is no file size change
> to indicate that the page lookup is racing with an in-progress
> invalidation.
>
> That means the mm/ and page cache code is unable to detect hole
> punch races, and so the serialisation of invalidation vs page cache
> instantiation has to be done in the filesystem. And no Linux native
> filesystem had the infrastructure for such serialisation because
> they never had to implement anything to ensure truncate was
> serialised against new and in-progress IO.
>
> The result of this is that, AFAICT, ext4 does not protect against
> read() vs hole punch races - it's hole punching code it does:
>
> Hole Punch: read():
>
> inode_lock()
> inode_dio_wait(inode);
> down_write(i_mmap_sem)
> truncate_pagecache_range()
> ext4_file_iter_read()
> ext4_map_blocks()
> down_read(i_data_sem)
> <gets mapping>
> <populates page cache over hole>
> <reads stale data into cache>
> .....
> down_write(i_data_sem)
> remove extents
>
> IOWs, ext4 is safe against truncate because of the
> change-inode-size-before-invalidation hacks, but the lack of
> serialise buffered reads means that hole punch and other similar
> fallocate based extent manipulations can race against reads....
>

Adding some ext4 folks to comment on the above.
Could it be that those races were already addressed by Lukas' work:
https://lore.kernel.org/patchwork/cover/371861/

Thanks,
Amir.

2019-04-08 14:11:17

by Jan Kara

[permalink] [raw]

Subject: Re: [POC][PATCH] xfs: reduce ilock contention on buffered randrw workload

On Mon 08-04-19 12:02:34, Amir Goldstein wrote:
> On Mon, Apr 8, 2019 at 2:27 AM Dave Chinner <[email protected]> wrote:
> >
> > On Fri, Apr 05, 2019 at 05:02:33PM +0300, Amir Goldstein wrote:
> > > On Fri, Apr 5, 2019 at 12:17 AM Dave Chinner <[email protected]> wrote:
> > > >
> > > > On Thu, Apr 04, 2019 at 07:57:37PM +0300, Amir Goldstein wrote:
> > > > > This patch improves performance of mixed random rw workload
> > > > > on xfs without relaxing the atomic buffered read/write guaranty
> > > > > that xfs has always provided.
> > > > >
> > > > > We achieve that by calling generic_file_read_iter() twice.
> > > > > Once with a discard iterator to warm up page cache before taking
> > > > > the shared ilock and once again under shared ilock.
> > > >
> > > > This will race with thing like truncate, hole punching, etc that
> > > > serialise IO and invalidate the page cache for data integrity
> > > > reasons under the IOLOCK. These rely on there being no IO to the
> > > > inode in progress at all to work correctly, which this patch
> > > > violates. IOWs, while this is fast, it is not safe and so not a
> > > > viable approach to solving the problem.
> > > >
> > >
> > > This statement leaves me wondering, if ext4 does not takes
> > > i_rwsem on generic_file_read_iter(), how does ext4 (or any other
> > > fs for that matter) guaranty buffered read synchronization with
> > > truncate, hole punching etc?
> > > The answer in ext4 case is i_mmap_sem, which is read locked
> > > in the page fault handler.
> >
> > Nope, the i_mmap_sem is for serialisation of /page faults/ against
> > truncate, holepunching, etc. Completely irrelevant to the read()
> > path.
> >
>
> I'm at lost here. Why are page faults completely irrelevant to read()
> path? Aren't full pages supposed to be faulted in on read() after
> truncate_pagecache_range()?

During read(2), pages are not "faulted in". Just look at
what generic_file_buffered_read() does. It uses completely separate code to
add page to page cache, trigger readahead, and possibly call ->readpage() to
fill the page with data. "fault" path (handled by filemap_fault()) applies
only to accesses from userspace to mmaps.

> And aren't partial pages supposed to be partially zeroed and uptodate
> after truncate_pagecache_range()?

truncate_pagecache_range() does take care of page zeroing, that is correct
(if those pages are in page cache in the first place). I'm not sure how
this relates to the above discussion though.

Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR

2019-04-08 17:41:23

by Amir Goldstein

[permalink] [raw]

Subject: Re: [POC][PATCH] xfs: reduce ilock contention on buffered randrw workload

On Mon, Apr 8, 2019 at 5:11 PM Jan Kara <[email protected]> wrote:
>
> On Mon 08-04-19 12:02:34, Amir Goldstein wrote:
> > On Mon, Apr 8, 2019 at 2:27 AM Dave Chinner <[email protected]> wrote:
> > >
> > > On Fri, Apr 05, 2019 at 05:02:33PM +0300, Amir Goldstein wrote:
> > > > On Fri, Apr 5, 2019 at 12:17 AM Dave Chinner <[email protected]> wrote:
> > > > >
> > > > > On Thu, Apr 04, 2019 at 07:57:37PM +0300, Amir Goldstein wrote:
> > > > > > This patch improves performance of mixed random rw workload
> > > > > > on xfs without relaxing the atomic buffered read/write guaranty
> > > > > > that xfs has always provided.
> > > > > >
> > > > > > We achieve that by calling generic_file_read_iter() twice.
> > > > > > Once with a discard iterator to warm up page cache before taking
> > > > > > the shared ilock and once again under shared ilock.
> > > > >
> > > > > This will race with thing like truncate, hole punching, etc that
> > > > > serialise IO and invalidate the page cache for data integrity
> > > > > reasons under the IOLOCK. These rely on there being no IO to the
> > > > > inode in progress at all to work correctly, which this patch
> > > > > violates. IOWs, while this is fast, it is not safe and so not a
> > > > > viable approach to solving the problem.
> > > > >
> > > >
> > > > This statement leaves me wondering, if ext4 does not takes
> > > > i_rwsem on generic_file_read_iter(), how does ext4 (or any other
> > > > fs for that matter) guaranty buffered read synchronization with
> > > > truncate, hole punching etc?
> > > > The answer in ext4 case is i_mmap_sem, which is read locked
> > > > in the page fault handler.
> > >
> > > Nope, the i_mmap_sem is for serialisation of /page faults/ against
> > > truncate, holepunching, etc. Completely irrelevant to the read()
> > > path.
> > >
> >
> > I'm at lost here. Why are page faults completely irrelevant to read()
> > path? Aren't full pages supposed to be faulted in on read() after
> > truncate_pagecache_range()?
>
> During read(2), pages are not "faulted in". Just look at
> what generic_file_buffered_read() does. It uses completely separate code to
> add page to page cache, trigger readahead, and possibly call ->readpage() to
> fill the page with data. "fault" path (handled by filemap_fault()) applies
> only to accesses from userspace to mmaps.
>

Oh! thanks for fixing my blind spot.
So if you agree with Dave that ext4, and who knows what other fs,
are vulnerable to populating page cache with stale "uptodate" data,
then it seems to me that also xfs is vulnerable via readahead(2) and
posix_fadvise().
Mind you, I recently added an fadvise f_op, so it could be used by
xfs to synchronize with IOLOCK.

Perhaps a better solution would be for truncate_pagecache_range()
to leave zeroed or Unwritten (i.e. lazy zeroed by read) pages in page
cache. When we have shared pages for files, these pages could be
deduped.

Thanks,
Amir.

2019-04-09 08:26:09

by Jan Kara

[permalink] [raw]

Subject: Re: [POC][PATCH] xfs: reduce ilock contention on buffered randrw workload

On Mon 08-04-19 20:41:09, Amir Goldstein wrote:
> On Mon, Apr 8, 2019 at 5:11 PM Jan Kara <[email protected]> wrote:
> >
> > On Mon 08-04-19 12:02:34, Amir Goldstein wrote:
> > > On Mon, Apr 8, 2019 at 2:27 AM Dave Chinner <[email protected]> wrote:
> > > >
> > > > On Fri, Apr 05, 2019 at 05:02:33PM +0300, Amir Goldstein wrote:
> > > > > On Fri, Apr 5, 2019 at 12:17 AM Dave Chinner <[email protected]> wrote:
> > > > > >
> > > > > > On Thu, Apr 04, 2019 at 07:57:37PM +0300, Amir Goldstein wrote:
> > > > > > > This patch improves performance of mixed random rw workload
> > > > > > > on xfs without relaxing the atomic buffered read/write guaranty
> > > > > > > that xfs has always provided.
> > > > > > >
> > > > > > > We achieve that by calling generic_file_read_iter() twice.
> > > > > > > Once with a discard iterator to warm up page cache before taking
> > > > > > > the shared ilock and once again under shared ilock.
> > > > > >
> > > > > > This will race with thing like truncate, hole punching, etc that
> > > > > > serialise IO and invalidate the page cache for data integrity
> > > > > > reasons under the IOLOCK. These rely on there being no IO to the
> > > > > > inode in progress at all to work correctly, which this patch
> > > > > > violates. IOWs, while this is fast, it is not safe and so not a
> > > > > > viable approach to solving the problem.
> > > > > >
> > > > >
> > > > > This statement leaves me wondering, if ext4 does not takes
> > > > > i_rwsem on generic_file_read_iter(), how does ext4 (or any other
> > > > > fs for that matter) guaranty buffered read synchronization with
> > > > > truncate, hole punching etc?
> > > > > The answer in ext4 case is i_mmap_sem, which is read locked
> > > > > in the page fault handler.
> > > >
> > > > Nope, the i_mmap_sem is for serialisation of /page faults/ against
> > > > truncate, holepunching, etc. Completely irrelevant to the read()
> > > > path.
> > > >
> > >
> > > I'm at lost here. Why are page faults completely irrelevant to read()
> > > path? Aren't full pages supposed to be faulted in on read() after
> > > truncate_pagecache_range()?
> >
> > During read(2), pages are not "faulted in". Just look at
> > what generic_file_buffered_read() does. It uses completely separate code to
> > add page to page cache, trigger readahead, and possibly call ->readpage() to
> > fill the page with data. "fault" path (handled by filemap_fault()) applies
> > only to accesses from userspace to mmaps.
> >
>
> Oh! thanks for fixing my blind spot.
> So if you agree with Dave that ext4, and who knows what other fs,
> are vulnerable to populating page cache with stale "uptodate" data,

Not that many filesystems support punching holes but you're right.

> then it seems to me that also xfs is vulnerable via readahead(2) and
> posix_fadvise().

Yes, this is correct AFAICT.

> Mind you, I recently added an fadvise f_op, so it could be used by
> xfs to synchronize with IOLOCK.

And yes, this should work.

> Perhaps a better solution would be for truncate_pagecache_range()
> to leave zeroed or Unwritten (i.e. lazy zeroed by read) pages in page
> cache. When we have shared pages for files, these pages could be
> deduped.

No, I wouldn't really mess with sharing pages due to this. It would be hard
to make that scale resonably and would be rather complex. We really need a
proper and reasonably simple synchronization mechanism between operations
removing blocks from inode and operations filling in page cache of the
inode. Page lock was supposed to provide this but doesn't quite work
because hole punching first remove pagecache pages and then go removing all
blocks.

So I agree with Dave that going for range lock is really the cleanest way
forward here without causing big regressions for mixed rw workloads. I'm
just thinking how to best do that without introducing lot of boilerplate
code into each filesystem.

Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR

2022-06-17 14:49:35

by Amir Goldstein

[permalink] [raw]

Subject: Re: [POC][PATCH] xfs: reduce ilock contention on buffered randrw workload

On Tue, Apr 9, 2019 at 11:26 AM Jan Kara <[email protected]> wrote:
>
> On Mon 08-04-19 20:41:09, Amir Goldstein wrote:
> > On Mon, Apr 8, 2019 at 5:11 PM Jan Kara <[email protected]> wrote:
> > >
> > > On Mon 08-04-19 12:02:34, Amir Goldstein wrote:
> > > > On Mon, Apr 8, 2019 at 2:27 AM Dave Chinner <[email protected]> wrote:
> > > > >
> > > > > On Fri, Apr 05, 2019 at 05:02:33PM +0300, Amir Goldstein wrote:
> > > > > > On Fri, Apr 5, 2019 at 12:17 AM Dave Chinner <[email protected]> wrote:
> > > > > > >
> > > > > > > On Thu, Apr 04, 2019 at 07:57:37PM +0300, Amir Goldstein wrote:
> > > > > > > > This patch improves performance of mixed random rw workload
> > > > > > > > on xfs without relaxing the atomic buffered read/write guaranty
> > > > > > > > that xfs has always provided.
> > > > > > > >
> > > > > > > > We achieve that by calling generic_file_read_iter() twice.
> > > > > > > > Once with a discard iterator to warm up page cache before taking
> > > > > > > > the shared ilock and once again under shared ilock.
> > > > > > >
> > > > > > > This will race with thing like truncate, hole punching, etc that
> > > > > > > serialise IO and invalidate the page cache for data integrity
> > > > > > > reasons under the IOLOCK. These rely on there being no IO to the
> > > > > > > inode in progress at all to work correctly, which this patch
> > > > > > > violates. IOWs, while this is fast, it is not safe and so not a
> > > > > > > viable approach to solving the problem.
> > > > > > >
> > > > > >
> > > > > > This statement leaves me wondering, if ext4 does not takes
> > > > > > i_rwsem on generic_file_read_iter(), how does ext4 (or any other
> > > > > > fs for that matter) guaranty buffered read synchronization with
> > > > > > truncate, hole punching etc?
> > > > > > The answer in ext4 case is i_mmap_sem, which is read locked
> > > > > > in the page fault handler.
> > > > >
> > > > > Nope, the i_mmap_sem is for serialisation of /page faults/ against
> > > > > truncate, holepunching, etc. Completely irrelevant to the read()
> > > > > path.
> > > > >
> > > >
> > > > I'm at lost here. Why are page faults completely irrelevant to read()
> > > > path? Aren't full pages supposed to be faulted in on read() after
> > > > truncate_pagecache_range()?
> > >
> > > During read(2), pages are not "faulted in". Just look at
> > > what generic_file_buffered_read() does. It uses completely separate code to
> > > add page to page cache, trigger readahead, and possibly call ->readpage() to
> > > fill the page with data. "fault" path (handled by filemap_fault()) applies
> > > only to accesses from userspace to mmaps.
> > >
> >
> > Oh! thanks for fixing my blind spot.
> > So if you agree with Dave that ext4, and who knows what other fs,
> > are vulnerable to populating page cache with stale "uptodate" data,
>
> Not that many filesystems support punching holes but you're right.
>
> > then it seems to me that also xfs is vulnerable via readahead(2) and
> > posix_fadvise().
>
> Yes, this is correct AFAICT.
>
> > Mind you, I recently added an fadvise f_op, so it could be used by
> > xfs to synchronize with IOLOCK.
>
> And yes, this should work.
>
> > Perhaps a better solution would be for truncate_pagecache_range()
> > to leave zeroed or Unwritten (i.e. lazy zeroed by read) pages in page
> > cache. When we have shared pages for files, these pages could be
> > deduped.
>
> No, I wouldn't really mess with sharing pages due to this. It would be hard
> to make that scale resonably and would be rather complex. We really need a
> proper and reasonably simple synchronization mechanism between operations
> removing blocks from inode and operations filling in page cache of the
> inode. Page lock was supposed to provide this but doesn't quite work
> because hole punching first remove pagecache pages and then go removing all
> blocks.
>
> So I agree with Dave that going for range lock is really the cleanest way
> forward here without causing big regressions for mixed rw workloads. I'm
> just thinking how to best do that without introducing lot of boilerplate
> code into each filesystem.

Hi Jan, Dave,

Trying to circle back to this after 3 years!
Seeing that there is no progress with range locks and
that the mixed rw workloads performance issue still very much exists.

Is the situation now different than 3 years ago with invalidate_lock?
Would my approach of pre-warm page cache before taking IOLOCK
be safe if page cache is pre-warmed with invalidate_lock held?

For the pNFS leases issue, as I wrote back in pre-COVID era,
I intend to opt-out of this optimization with
#ifndef CONFIG_EXPORTFS_BLOCK_OPS

Thanks,
Amir.

2022-06-17 15:12:46

by Jan Kara

[permalink] [raw]

Subject: Re: [POC][PATCH] xfs: reduce ilock contention on buffered randrw workload

On Fri 17-06-22 17:48:08, Amir Goldstein wrote:
> On Tue, Apr 9, 2019 at 11:26 AM Jan Kara <[email protected]> wrote:
> >
> > On Mon 08-04-19 20:41:09, Amir Goldstein wrote:
> > > On Mon, Apr 8, 2019 at 5:11 PM Jan Kara <[email protected]> wrote:
> > > >
> > > > On Mon 08-04-19 12:02:34, Amir Goldstein wrote:
> > > > > On Mon, Apr 8, 2019 at 2:27 AM Dave Chinner <[email protected]> wrote:
> > > > > >
> > > > > > On Fri, Apr 05, 2019 at 05:02:33PM +0300, Amir Goldstein wrote:
> > > > > > > On Fri, Apr 5, 2019 at 12:17 AM Dave Chinner <[email protected]> wrote:
> > > > > > > >
> > > > > > > > On Thu, Apr 04, 2019 at 07:57:37PM +0300, Amir Goldstein wrote:
> > > > > > > > > This patch improves performance of mixed random rw workload
> > > > > > > > > on xfs without relaxing the atomic buffered read/write guaranty
> > > > > > > > > that xfs has always provided.
> > > > > > > > >
> > > > > > > > > We achieve that by calling generic_file_read_iter() twice.
> > > > > > > > > Once with a discard iterator to warm up page cache before taking
> > > > > > > > > the shared ilock and once again under shared ilock.
> > > > > > > >
> > > > > > > > This will race with thing like truncate, hole punching, etc that
> > > > > > > > serialise IO and invalidate the page cache for data integrity
> > > > > > > > reasons under the IOLOCK. These rely on there being no IO to the
> > > > > > > > inode in progress at all to work correctly, which this patch
> > > > > > > > violates. IOWs, while this is fast, it is not safe and so not a
> > > > > > > > viable approach to solving the problem.
> > > > > > > >
> > > > > > >
> > > > > > > This statement leaves me wondering, if ext4 does not takes
> > > > > > > i_rwsem on generic_file_read_iter(), how does ext4 (or any other
> > > > > > > fs for that matter) guaranty buffered read synchronization with
> > > > > > > truncate, hole punching etc?
> > > > > > > The answer in ext4 case is i_mmap_sem, which is read locked
> > > > > > > in the page fault handler.
> > > > > >
> > > > > > Nope, the i_mmap_sem is for serialisation of /page faults/ against
> > > > > > truncate, holepunching, etc. Completely irrelevant to the read()
> > > > > > path.
> > > > > >
> > > > >
> > > > > I'm at lost here. Why are page faults completely irrelevant to read()
> > > > > path? Aren't full pages supposed to be faulted in on read() after
> > > > > truncate_pagecache_range()?
> > > >
> > > > During read(2), pages are not "faulted in". Just look at
> > > > what generic_file_buffered_read() does. It uses completely separate code to
> > > > add page to page cache, trigger readahead, and possibly call ->readpage() to
> > > > fill the page with data. "fault" path (handled by filemap_fault()) applies
> > > > only to accesses from userspace to mmaps.
> > > >
> > >
> > > Oh! thanks for fixing my blind spot.
> > > So if you agree with Dave that ext4, and who knows what other fs,
> > > are vulnerable to populating page cache with stale "uptodate" data,
> >
> > Not that many filesystems support punching holes but you're right.
> >
> > > then it seems to me that also xfs is vulnerable via readahead(2) and
> > > posix_fadvise().
> >
> > Yes, this is correct AFAICT.
> >
> > > Mind you, I recently added an fadvise f_op, so it could be used by
> > > xfs to synchronize with IOLOCK.
> >
> > And yes, this should work.
> >
> > > Perhaps a better solution would be for truncate_pagecache_range()
> > > to leave zeroed or Unwritten (i.e. lazy zeroed by read) pages in page
> > > cache. When we have shared pages for files, these pages could be
> > > deduped.
> >
> > No, I wouldn't really mess with sharing pages due to this. It would be hard
> > to make that scale resonably and would be rather complex. We really need a
> > proper and reasonably simple synchronization mechanism between operations
> > removing blocks from inode and operations filling in page cache of the
> > inode. Page lock was supposed to provide this but doesn't quite work
> > because hole punching first remove pagecache pages and then go removing all
> > blocks.
> >
> > So I agree with Dave that going for range lock is really the cleanest way
> > forward here without causing big regressions for mixed rw workloads. I'm
> > just thinking how to best do that without introducing lot of boilerplate
> > code into each filesystem.
>
> Hi Jan, Dave,
>
> Trying to circle back to this after 3 years!
> Seeing that there is no progress with range locks and
> that the mixed rw workloads performance issue still very much exists.
>
> Is the situation now different than 3 years ago with invalidate_lock?

Yes, I've implemented invalidate_lock exactly to fix the issues you've
pointed out without regressing the mixed rw workloads (because
invalidate_lock is taken in shared mode only for reads and usually not at
all for writes).

> Would my approach of pre-warm page cache before taking IOLOCK
> be safe if page cache is pre-warmed with invalidate_lock held?

Why would it be needed? But yes, with invalidate_lock you could presumably
make that idea safe...

Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR