2022-03-21 22:38:56

by Jaegeuk Kim

[permalink] [raw]
Subject: [GIT PULL] f2fs for 5.18

Hi Linus,

Could you please consider this pull request?

Thanks,

The following changes since commit dd81e1c7d5fb126e5fbc5c9e334d7b3ec29a16a0:

Merge tag 'powerpc-5.17-2' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux (2022-01-23 17:52:42 +0200)

are available in the Git repository at:

git://git.kernel.org/pub/scm/linux/kernel/git/jaegeuk/f2fs.git tags/f2fs-for-5.18

for you to fetch changes up to 5b5b4f85b01604389f7a0f11ef180a725bf0e2d4:

f2fs: fix to do sanity check on .cp_pack_total_block_count (2022-03-21 09:10:21 -0700)

----------------------------------------------------------------
f2fs-for-5.18

In this cycle, f2fs has some performance improvements for Android workloads such
as using read-unfair rwsems and adding some sysfs entries to control GCs and
discard commands in more details. In addtiion, it has some tunings to improve
the recovery speed after sudden power-cut.

Enhancement:
- add reader-unfair rwsems with F2FS_UNFAIR_RWSEM
: will replace with generic API support
- adjust to make the readahead/recovery flow more efficiently
- sysfs entries to control issue speeds of GCs and Discard commands
- enable idmapped mounts

Bug fix:
- correct wrong error handling routines
- fix missing conditions in quota
- fix a potential deadlock between writeback and block plug routines
- fix a deadlock btween freezefs and evict_inode

We've added some boundary checks to avoid kernel panics on corrupted images,
and several minor code clean-ups.

----------------------------------------------------------------
Bart Van Assche (1):
f2fs: Restore rwsem lockdep support

Chao Yu (11):
f2fs: fix to enable ATGC correctly via gc_idle sysfs interface
f2fs: fix to unlock page correctly in error path of is_alive()
f2fs: adjust readahead block number during recovery
f2fs: introduce F2FS_IPU_HONOR_OPU_WRITE ipu policy
f2fs: support idmapped mounts
f2fs: fix to avoid potential deadlock
f2fs: fix to do sanity check on curseg->alloc_type
f2fs: compress: fix to print raw data size in error path of lz4 decompression
f2fs: initialize sbi->gc_mode explicitly
f2fs: use aggressive GC policy during f2fs_disable_checkpoint()
f2fs: fix to do sanity check on .cp_pack_total_block_count

Daeho Jeong (2):
f2fs: introduce gc_urgent_mid mode
f2fs: make gc_urgent and gc_segment_mode sysfs node readable

Fengnan Chang (1):
f2fs: fix compressed file start atomic write may cause data corruption

Jaegeuk Kim (6):
f2fs: add a way to limit roll forward recovery time
f2fs: fix missing free nid in f2fs_handle_failed_inode
f2fs: avoid an infinite loop in f2fs_sync_dirty_inodes
f2fs: introduce F2FS_UNFAIR_RWSEM to support unfair rwsem
f2fs: don't get FREEZE lock in f2fs_evict_inode in frozen fs
f2fs: use spin_lock to avoid hang

Jia Yang (1):
f2fs: remove unnecessary read for F2FS_FITS_IN_INODE

Juhyung Park (1):
f2fs: quota: fix loop condition at f2fs_quota_sync()

Konstantin Vyshetsky (2):
f2fs: move discard parameters into discard_cmd_control
f2fs: expose discard related parameters in sysfs

Tim Murray (1):
f2fs: move f2fs to use reader-unfair rwsems

Wang Xiaojun (1):
f2fs: remove redundant parameter judgment

Documentation/ABI/testing/sysfs-fs-f2fs | 54 ++++++++--
fs/f2fs/Kconfig | 7 ++
fs/f2fs/acl.c | 21 ++--
fs/f2fs/checkpoint.c | 58 +++++++----
fs/f2fs/compress.c | 11 +-
fs/f2fs/data.c | 76 ++++++++------
fs/f2fs/debug.c | 25 +++--
fs/f2fs/dir.c | 12 +--
fs/f2fs/f2fs.h | 154 +++++++++++++++++++++++-----
fs/f2fs/file.c | 175 ++++++++++++++++----------------
fs/f2fs/gc.c | 53 +++++-----
fs/f2fs/inline.c | 4 +-
fs/f2fs/inode.c | 7 +-
fs/f2fs/namei.c | 78 +++++++-------
fs/f2fs/node.c | 92 +++++++++--------
fs/f2fs/node.h | 3 +
fs/f2fs/recovery.c | 35 ++++++-
fs/f2fs/segment.c | 73 +++++++------
fs/f2fs/segment.h | 5 +-
fs/f2fs/super.c | 91 ++++++++++-------
fs/f2fs/sysfs.c | 40 +++++++-
fs/f2fs/verity.c | 4 +-
fs/f2fs/xattr.c | 12 +--
23 files changed, 699 insertions(+), 391 deletions(-)


2022-03-22 21:38:44

by Linus Torvalds

[permalink] [raw]
Subject: Re: [GIT PULL] f2fs for 5.18

On Mon, Mar 21, 2022 at 1:39 PM Jaegeuk Kim <[email protected]> wrote:
>
> In this cycle, f2fs has some performance improvements for Android workloads such
> as using read-unfair rwsems [...]

I've pulled this, but that read-unfair rwsem code looks incredibly
dodgy. Doing your own locking is always a bad sign, and it ahs
traditionally come back to bite us pretty much every time. At least it
uses real lock primitives, just in a really odd way.

The whole notion of making an rwsem unfair to readers sounds really
really odd. I mean, the whole and only _point_ of an rwsem is to
allow concurrent readers, and traditionally if it's unfair it's unfair
to _writers_ because that tends to be better for throughput (but
unfairness can cause horrible latency).

So it smells like there's something bad going on in f2fs.

That said, I'm adding Waiman to the cc here in case he would have
ideas at least for a cleaner interface. Our rw_semaphores are
explicitly trying to be fair, because unfairness (the other way) was
such a big problem.

I'm wondering it the optimistic read lock stealing is bothering f2fs?

Linus

2022-03-22 22:15:49

by pr-tracker-bot

[permalink] [raw]
Subject: Re: [f2fs-dev] [GIT PULL] f2fs for 5.18

The pull request you sent on Mon, 21 Mar 2022 13:39:32 -0700:

> git://git.kernel.org/pub/scm/linux/kernel/git/jaegeuk/f2fs.git tags/f2fs-for-5.18

has been merged into torvalds/linux.git:
https://git.kernel.org/torvalds/c/ef510682af3dbe2f9cdae7126a1461c94e010967

Thank you!

--
Deet-doot-dot, I am a bot.
https://korg.docs.kernel.org/prtracker.html

2022-03-22 23:29:06

by Waiman Long

[permalink] [raw]
Subject: Re: [GIT PULL] f2fs for 5.18

On 3/22/22 13:22, Linus Torvalds wrote:
> On Mon, Mar 21, 2022 at 1:39 PM Jaegeuk Kim <[email protected]> wrote:
>> In this cycle, f2fs has some performance improvements for Android workloads such
>> as using read-unfair rwsems [...]
> I've pulled this, but that read-unfair rwsem code looks incredibly
> dodgy. Doing your own locking is always a bad sign, and it ahs
> traditionally come back to bite us pretty much every time. At least it
> uses real lock primitives, just in a really odd way.
>
> The whole notion of making an rwsem unfair to readers sounds really
> really odd. I mean, the whole and only _point_ of an rwsem is to
> allow concurrent readers, and traditionally if it's unfair it's unfair
> to _writers_ because that tends to be better for throughput (but
> unfairness can cause horrible latency).
>
> So it smells like there's something bad going on in f2fs.
>
> That said, I'm adding Waiman to the cc here in case he would have
> ideas at least for a cleaner interface. Our rw_semaphores are
> explicitly trying to be fair, because unfairness (the other way) was
> such a big problem.
>
> I'm wondering it the optimistic read lock stealing is bothering f2fs?

I don't believe it is the optimistic read lock stealing code that is
bothering f2fs.

AFAICS, the read-unfair rwsem code is created to resolve a potential
lock starvation problem that they found on linux-5.10.y stable tree. I
believe I have fixed that in the v5.11 kernel, see commit 2f06f702925
("locking/rwsem: Prevent potential lock starvation"). However that
commit is not in the stable tree. In fact, I have moved forward and
taken out reader optimistic spinning but added just optimistic lock
stealing instead. I believe the problem would have solved by including
that patch series in their build. I haven't gotten any response as to
whether they had tested this or not.

Apparently they prefer to upstream this stop-gap solution.

Cheers,
Longman

2022-03-23 03:28:29

by Jaegeuk Kim

[permalink] [raw]
Subject: Re: [GIT PULL] f2fs for 5.18

Hi Linus,

On 03/22, Linus Torvalds wrote:
> On Tue, Mar 22, 2022 at 10:37 AM Waiman Long <[email protected]> wrote:
> >
> > AFAICS, the read-unfair rwsem code is created to resolve a potential
> > lock starvation problem that they found on linux-5.10.y stable tree. I
> > believe I have fixed that in the v5.11 kernel, see commit 2f06f702925
> > ("locking/rwsem: Prevent potential lock starvation").
>
> Ahh.
>
> Adding Tim Murray to the cc, since he was the source of that odd
> reader-unfair thing.
>
> I really *really* dislike people thinking they can do locking
> primitives, because history has taught us that they are wrong.
>
> Even when people get the semantics and memory ordering right (which is
> not always the case, but at least the f2fs code uses real lock
> primitives - just oddly - and should thus be ok), it invariably tends
> to be a sign of something else being very wrong.
>
> And I can easily believe that in this case it's due to a rmsem issue
> that was already fixed long long ago as per Waiman.
>
> Can people please test with the actual modern rwsem code and with the
> odd reader-unfair locks disabled?

The pain point is 1) we don't have a specific test to reproduce the issue,
but got some foundings from field only, 2) in order to test the patches, we
need to merge the patches into Android kernel [1] through LTS, 3) but, LTS
wants to see any test results [2].

[1] https://android-review.googlesource.com/q/topic:rwsem_unfair
[2] https://lore.kernel.org/stable/[email protected]/

So, I thought applying it in f2fs could avoid kernel version issues without
any risk of updating rwsem. Meanwhile, agreed that we should use the right APIs,
I'm going to disable this f2fs change in the next device having newer kernel to
see whether or not uptodate rwsem can really fix the issue.

>
> Linus

2022-03-23 03:53:06

by Linus Torvalds

[permalink] [raw]
Subject: Re: [GIT PULL] f2fs for 5.18

On Tue, Mar 22, 2022 at 10:37 AM Waiman Long <[email protected]> wrote:
>
> AFAICS, the read-unfair rwsem code is created to resolve a potential
> lock starvation problem that they found on linux-5.10.y stable tree. I
> believe I have fixed that in the v5.11 kernel, see commit 2f06f702925
> ("locking/rwsem: Prevent potential lock starvation").

Ahh.

Adding Tim Murray to the cc, since he was the source of that odd
reader-unfair thing.

I really *really* dislike people thinking they can do locking
primitives, because history has taught us that they are wrong.

Even when people get the semantics and memory ordering right (which is
not always the case, but at least the f2fs code uses real lock
primitives - just oddly - and should thus be ok), it invariably tends
to be a sign of something else being very wrong.

And I can easily believe that in this case it's due to a rmsem issue
that was already fixed long long ago as per Waiman.

Can people please test with the actual modern rwsem code and with the
odd reader-unfair locks disabled?

Linus

2022-03-23 09:07:20

by Linus Torvalds

[permalink] [raw]
Subject: Re: [GIT PULL] f2fs for 5.18

On Tue, Mar 22, 2022 at 5:34 PM Tim Murray <[email protected]> wrote:
>
> AFAICT, what's happening is that rwsem_down_read_slowpath
> modifies sem->count to indicate that there's a pending reader while
> f2fs_ckpt holds the write lock, and when f2fs_ckpt releases the write
> lock, it wakes pending readers and hands the lock over to readers.
> This means that any subsequent attempt to grab the write lock from
> f2fs_ckpt will stall until the newly-awakened reader releases the read
> lock, which depends on the readers' arbitrarily long scheduling
> delays.

Ugh.

So I'm looking at some of this, and you have things like this:

f2fs_down_read(&F2FS_I(inode)->i_sem);
cp_reason = need_do_checkpoint(inode);
f2fs_up_read(&F2FS_I(inode)->i_sem);

which really doesn't seem to want a sleeping lock at all.

In fact, it's not clear that it has any business serializing with IO
at all. It seems to just check very basic inode state. Very strange.
It's the kind of thing that the VFS layer tends to use te i_lock
*spinlock* for.

And perhaps equally oddly, then when you do f2fs_issue_checkpoint(),
_that_ code uses fancy lockless lists.

I'm probably mis-reading it.

Linus

2022-03-23 13:43:54

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [GIT PULL] f2fs for 5.18

On Tue, Mar 22, 2022 at 10:22:50AM -0700, Linus Torvalds wrote:
> On Mon, Mar 21, 2022 at 1:39 PM Jaegeuk Kim <[email protected]> wrote:
> >
> > In this cycle, f2fs has some performance improvements for Android workloads such
> > as using read-unfair rwsems [...]
>
> I've pulled this, but that read-unfair rwsem code looks incredibly
> dodgy. Doing your own locking is always a bad sign, and it ahs
> traditionally come back to bite us pretty much every time. At least it
> uses real lock primitives, just in a really odd way.

FYI, Peter and I both pointed this out when the patches were posted
and NAKed the patch, but the feedback was ignored.

2022-03-24 12:19:30

by Jaegeuk Kim

[permalink] [raw]
Subject: Re: [GIT PULL] f2fs for 5.18

On 03/23, Linus Torvalds wrote:
> On Wed, Mar 23, 2022 at 9:26 AM Jaegeuk Kim <[email protected]> wrote:
> >
> > OTOH, I was suspecting the major contetion would be
> > f2fs_lock_op -> f2fs_down_read(&sbi->cp_rwsem);
> > , which was used for most of filesystem operations.
>
> Very possible, I was just looking at a random one in f2fs/file.c
> obviously with no actual numbers in hand.
>
> In general, I really hate seeing specialized locks, but this f2fs use
> case is in some ways worse than other ad-hoc locks I've seen - simply
> because it's been one whole-sale conversion of "down_read/write()" to
> "f2fs_down_read/write()" - regardless of _which_ lock is being locked.
>
> (Now, it's not all bad news - in other respects it's much better than
> some ad-hoc locking: at least you still will participate in lockdep,
> and you use the actual low-level locking primitives instead of making
> up your own and getting memory ordering wrong).
>
> But basically I think it would have been much nicer if you would have
> done this for just the _one_ lock that mattered, whichever lock that
> might be. Partly as documentation, and partly so that maybe some day
> you can split that lock up (or maybe notice cases where you can avoid
> it entirely).
>
> For example, if it's really just f2fs_lock_op() that needs this, the
> special "wait_event(trylock)" hack could have been entirely local to
> just *that*, rather than affecting all the other locks too.
>
> And the very first f2fs_lock_op() case I find, I see that the lock is
> pointless. Again, that's unlikely to be the *cause* of any of these
> problems, but the fact that I've now looked at two of the f2fs locks,
> and gone "the locking seems to be pointlessly badly done" does imply
> that the problem isn't "down_read()", it's the use.
>
> That other lock I reacted to was the f2fs_lock_op(sbi) at the top of
> f2fs_new_inode().
>
> Look, you have a new inode that you just allocated, that nobody else
> can yet access.
>
> And the only thing that that f2fs_lock_op(sbi) -> f2fs_unlock_op(sbi)
> sequence protects is the f2fs_alloc_nid() for that new inode.
>
> Ok, so maybe f2fs_alloc_nid() needs that lock?
>
> No it doesn't. It already has
>
> - &nm_i->nid_list_lock spinlock for its own in-memory internal NID caches
>
> *and* when that fails
>
> - &NM_I(sbi)->build_lock for protecting all of f2fs_build_free_nids()
>
> *and* inside of that lock
>
> - f2fs_down_read(&nm_i->nat_tree_lock) for protecting the NAT tree structures.
>
> So I see two major issues in the very first user of that
> f2fs_lock_op() that I look at:
>
> (a) it seems to be entirely unnecessary

Actually, when I took a look at the above path, indeed, f2fs_lock_op in
f2fs_new_inode may be unnecessary at all aligned to your points. Even, that
might hurt performance since we get f2fs_lock_op twice before dealing with
dentries like f2fs_add_link. Let me test a bit whether there's any regression
if I remove f2fs_lock_op in f2fs_new_inode.

>
> (b) it is a classic case of "multiple nested locks".
>
> Now, it's possible that I'm wrong on (a) and there's some odd reason
> that lock is needed (maybe there is a lock ordering problem for one of
> the other locks between readers and writers, and the op-lock acts as a
> mutual exclusion for that).
>
> But (b) really is a classic problem case for locking: nested locks are
> *much* more likely to cause horrible contention, because not any
> contention in any of the locks will end up affecting the others (and
> you easily get "bunching up" of different processes when they get
> synchronized with each other thanks to the inner lock).
>
> Nested locking is often required, but it's one of those things where
> you just need to be aware that they can be horribly bad for
> performance, _particularly_ if an inner lock sees contention and
> essentially "transfers" that contention to an outer lock.
>
> Maybe I've been unlucky. Maybe the two cases I happened to look at
> were just completely harmless, and very unusual. But the fact that I'm
> two-for-two and go "that locking looks like a prime candidate to be
> fixed" makes me suspect there's a lot of low-hanging fruit in there.

Thank you so much for taking the time to write this great advise. Let me dig
more whether there's anything that I can relax the lock use-cases further.
(tbh, I haven't reviewed them for a long time due to focusing on stability
issues mostly.)

>
> And that whole "wait_event(trylock)" thing is a symptom of problematic
> f2fs locking, rather than a solution to it.

Understood. If I can avoid lock contention upfront, definitely it wouldn't
need to apply rwsem change at all. Let me take some time to think about how to
move forward.

>
> Linus

2022-03-24 12:45:46

by Jaegeuk Kim

[permalink] [raw]
Subject: Re: [GIT PULL] f2fs for 5.18

On 03/23, Christoph Hellwig wrote:
> On Wed, Mar 23, 2022 at 09:48:17AM -0700, Jaegeuk Kim wrote:
> > Christoph, I proposed,
> >
> > "I've been waiting for a generic solution as suggested here. Until then, I'd like
> > to keep this in f2fs *only* in order to ship the fix in products. Once there's
> > a right fix, let me drop or revise this patch again."
> >
> > https://lore.kernel.org/linux-f2fs-devel/[email protected]/
>
> That counts as ignoring the advice to me.

My apologies that you felt like that. Thanks,

2022-03-24 17:16:18

by Jaegeuk Kim

[permalink] [raw]
Subject: Re: [GIT PULL] f2fs for 5.18

On 03/23, Waiman Long wrote:
> On 3/23/22 12:48, Jaegeuk Kim wrote:
> > On 03/23, Christoph Hellwig wrote:
> > > On Tue, Mar 22, 2022 at 10:22:50AM -0700, Linus Torvalds wrote:
> > > > On Mon, Mar 21, 2022 at 1:39 PM Jaegeuk Kim <[email protected]> wrote:
> > > > > In this cycle, f2fs has some performance improvements for Android workloads such
> > > > > as using read-unfair rwsems [...]
> > > > I've pulled this, but that read-unfair rwsem code looks incredibly
> > > > dodgy. Doing your own locking is always a bad sign, and it ahs
> > > > traditionally come back to bite us pretty much every time. At least it
> > > > uses real lock primitives, just in a really odd way.
> > > FYI, Peter and I both pointed this out when the patches were posted
> > > and NAKed the patch, but the feedback was ignored.
> > Christoph, I proposed,
> >
> > "I've been waiting for a generic solution as suggested here. Until then, I'd like
> > to keep this in f2fs *only* in order to ship the fix in products. Once there's
> > a right fix, let me drop or revise this patch again."
> >
> > https://lore.kernel.org/linux-f2fs-devel/[email protected]/
> >
> I suspect f2fs may also need the 617f3ef95177 ("locking/rwsem: Remove reader
> optimistic spinning") to give higher priority to writer. Please let me know
> the test result when you are able to test v5.15 LTS to see if these commits
> are able to address the f2fs issue.

Sure, I'll keep an eye on it, but next kernel is likely to be applied to the
products in next year. It may take some time. :(

>
> I have some ideas of making a reader-unfair rwsem, but that requires either
> the introduction of a set of new down_read() variants or keeping the unfair
> state in the rwsem itself. I would like to make sure that there is really a
> need for such a thing before working on it.
>
> Cheers,
> Longman
>

2022-03-24 23:45:15

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [GIT PULL] f2fs for 5.18

On Wed, Mar 23, 2022 at 09:48:17AM -0700, Jaegeuk Kim wrote:
> Christoph, I proposed,
>
> "I've been waiting for a generic solution as suggested here. Until then, I'd like
> to keep this in f2fs *only* in order to ship the fix in products. Once there's
> a right fix, let me drop or revise this patch again."
>
> https://lore.kernel.org/linux-f2fs-devel/[email protected]/

That counts as ignoring the advice to me.

2022-03-25 00:09:13

by Jaegeuk Kim

[permalink] [raw]
Subject: Re: [GIT PULL] f2fs for 5.18

On 03/22, Linus Torvalds wrote:
> On Tue, Mar 22, 2022 at 5:34 PM Tim Murray <[email protected]> wrote:
> >
> > AFAICT, what's happening is that rwsem_down_read_slowpath
> > modifies sem->count to indicate that there's a pending reader while
> > f2fs_ckpt holds the write lock, and when f2fs_ckpt releases the write
> > lock, it wakes pending readers and hands the lock over to readers.
> > This means that any subsequent attempt to grab the write lock from
> > f2fs_ckpt will stall until the newly-awakened reader releases the read
> > lock, which depends on the readers' arbitrarily long scheduling
> > delays.
>
> Ugh.
>
> So I'm looking at some of this, and you have things like this:
>
> f2fs_down_read(&F2FS_I(inode)->i_sem);
> cp_reason = need_do_checkpoint(inode);
> f2fs_up_read(&F2FS_I(inode)->i_sem);
>
> which really doesn't seem to want a sleeping lock at all.
>
> In fact, it's not clear that it has any business serializing with IO
> at all. It seems to just check very basic inode state. Very strange.
> It's the kind of thing that the VFS layer tends to use te i_lock
> *spinlock* for.

Um.. let me check this i_sem, introduced by
d928bfbfe77a ("f2fs: introduce fi->i_sem to protect fi's info").

OTOH, I was suspecting the major contetion would be
f2fs_lock_op -> f2fs_down_read(&sbi->cp_rwsem);
, which was used for most of filesystem operations.

And, when we need to do checkpoint, we'd like to block internal operations by
f2fs_lock_all -> f2fs_down_write(&sbi->cp_rwsem);

So, what I expected was giving the highest priority to the checkpoint thread
by grabbing down_write to block all the other readers.

>
> And perhaps equally oddly, then when you do f2fs_issue_checkpoint(),
> _that_ code uses fancy lockless lists.
>
> I'm probably mis-reading it.
>
> Linus

2022-03-25 01:34:57

by Tim Murray

[permalink] [raw]
Subject: Re: [GIT PULL] f2fs for 5.18

Hi Linus,

On Tue, Mar 22, 2022 at 10:50 AM Linus Torvalds
<[email protected]> wrote:
>
> Even when people get the semantics and memory ordering right (which is
> not always the case, but at least the f2fs code uses real lock
> primitives - just oddly - and should thus be ok), it invariably tends
> to be a sign of something else being very wrong.
>
> And I can easily believe that in this case it's due to a rmsem issue
> that was already fixed long long ago as per Waiman.
>
> Can people please test with the actual modern rwsem code and with the
> odd reader-unfair locks disabled?

I did try the patch from Waiman (backported to a local Android device
running 5.10 without the unfair-rwsem patch) when we were first
discussing this, and as far as I could tell, his patch did not make a
difference for this problem. I think I can explain why it solves a
different problem, and it is related to how f2fs uses locking
primitives and worker threads. In this case, it's important to know
that f2fs's checkpoint thread coalesces checkpoints on behalf of all
userspace threads calling fsync(), userspace threads calling many f2fs
functions take the read lock on the rwsem, and the checkpoint thread
takes the write lock. The userspace threads can be any CFS priority
and may be throttled for various reasons (constrained cpuset or cpuctl
runtime limits).

Here's one example from a trace associated with a very slow app start
from a 5.10 device before the unfair-rwsem patch:

1. pid 6711, a random userspace thread, is running on a CPU at time 0ms.
2. After 2.6ms of running, pid 6711 enters state D, deschedules, and
is blocked in f2fs_new_inode according to sched_blocked_reason (a
tracepoint we added in the Android kernel that emits the function
where a thread is blocked in uninterruptible sleep).
3. The f2fs_checkpoint thread runs 5ms after pid 6711 deschedules.
f2fs_ckpt runs for 63us, makes pid 6711 runnable, and then deschedules
in state D blocked at rwsem_down_write_slowpath.
4. pid 6711 remains runnable for 341ms--it's a low priority thread in
a throttled process and the system is busy. Meanwhile, f2fs_ckpt
remains in D for 341ms.
5. pid 6711 finally runs and makes f2fs_ckpt runnable after 11us.

If the problem were related to optimistic spinning that would be
addressed by Waiman's patch, I'd expect to see pid 6711 running
concurrently with f2fs_ckpt. However, that's not what happens; there's
a 5ms gap in between pid 6711 blocking on the read lock and f2fs_ckpt
running. AFAICT, what's happening is that rwsem_down_read_slowpath
modifies sem->count to indicate that there's a pending reader while
f2fs_ckpt holds the write lock, and when f2fs_ckpt releases the write
lock, it wakes pending readers and hands the lock over to readers.
This means that any subsequent attempt to grab the write lock from
f2fs_ckpt will stall until the newly-awakened reader releases the read
lock, which depends on the readers' arbitrarily long scheduling
delays.

AFAICT, any solution for this problem that doesn't require an overhaul
of f2fs locking requires that the newly-awakened readers can't block
f2fs_ckpt until they are scheduled and grab the read lock. Either we
could do something trylock-related like this patch implemented, or we
could move f2fs to percpu_rwsem. However, moving to percpu_rwsem means
no optimistic spinning, which also seems bad given how short the
read-locked critical sections are. We have seen similar problems with
contention on the cgroup percpu_rwsem that spin-on-owner could have
alleviated, so I'm hesitant to get rid of optimistic spinning for f2fs
by switching to percpu_rwsem.

The most obvious objection to the patch is that f2fs_ckpt will still
stall if a reader deschedules while holding the lock. That is
absolutely true! However, in practice, we were hitting the worst case
every time because any attempt to grab the read lock while f2fs_ckpt
holds the write lock exposes f2fs_ckpt to arbitrary scheduling delay
from the reader. The reader-unfair rwsem patch prevents this problem.
We have evidence this is true in practice, too. I just ran a quick
comparison on 150 slow app start traces from our internal population
comparing kernels pre- and post-unfair-rwsem patch, and the percentage
of time the thread responsible for UI was stalled on f2fs locks went
from ~50% of all uninterruptible sleep time on that thread to less
than 5%.

I think this is a special case where fairness to readers is explicitly
harmful because any thread at any priority can block on the only
writer (f2fs_ckpt) but any low priority thread could be a reader
blocking the only writer. I don't know if it's worth explicit support
in rw_semaphore for this kind of use case, but I don't think an
existing rwsem patch addresses this issue because it is so unusual.

2022-03-25 18:41:22

by Linus Torvalds

[permalink] [raw]
Subject: Re: [GIT PULL] f2fs for 5.18

On Wed, Mar 23, 2022 at 9:26 AM Jaegeuk Kim <[email protected]> wrote:
>
> OTOH, I was suspecting the major contetion would be
> f2fs_lock_op -> f2fs_down_read(&sbi->cp_rwsem);
> , which was used for most of filesystem operations.

Very possible, I was just looking at a random one in f2fs/file.c
obviously with no actual numbers in hand.

In general, I really hate seeing specialized locks, but this f2fs use
case is in some ways worse than other ad-hoc locks I've seen - simply
because it's been one whole-sale conversion of "down_read/write()" to
"f2fs_down_read/write()" - regardless of _which_ lock is being locked.

(Now, it's not all bad news - in other respects it's much better than
some ad-hoc locking: at least you still will participate in lockdep,
and you use the actual low-level locking primitives instead of making
up your own and getting memory ordering wrong).

But basically I think it would have been much nicer if you would have
done this for just the _one_ lock that mattered, whichever lock that
might be. Partly as documentation, and partly so that maybe some day
you can split that lock up (or maybe notice cases where you can avoid
it entirely).

For example, if it's really just f2fs_lock_op() that needs this, the
special "wait_event(trylock)" hack could have been entirely local to
just *that*, rather than affecting all the other locks too.

And the very first f2fs_lock_op() case I find, I see that the lock is
pointless. Again, that's unlikely to be the *cause* of any of these
problems, but the fact that I've now looked at two of the f2fs locks,
and gone "the locking seems to be pointlessly badly done" does imply
that the problem isn't "down_read()", it's the use.

That other lock I reacted to was the f2fs_lock_op(sbi) at the top of
f2fs_new_inode().

Look, you have a new inode that you just allocated, that nobody else
can yet access.

And the only thing that that f2fs_lock_op(sbi) -> f2fs_unlock_op(sbi)
sequence protects is the f2fs_alloc_nid() for that new inode.

Ok, so maybe f2fs_alloc_nid() needs that lock?

No it doesn't. It already has

- &nm_i->nid_list_lock spinlock for its own in-memory internal NID caches

*and* when that fails

- &NM_I(sbi)->build_lock for protecting all of f2fs_build_free_nids()

*and* inside of that lock

- f2fs_down_read(&nm_i->nat_tree_lock) for protecting the NAT tree structures.

So I see two major issues in the very first user of that
f2fs_lock_op() that I look at:

(a) it seems to be entirely unnecessary

(b) it is a classic case of "multiple nested locks".

Now, it's possible that I'm wrong on (a) and there's some odd reason
that lock is needed (maybe there is a lock ordering problem for one of
the other locks between readers and writers, and the op-lock acts as a
mutual exclusion for that).

But (b) really is a classic problem case for locking: nested locks are
*much* more likely to cause horrible contention, because not any
contention in any of the locks will end up affecting the others (and
you easily get "bunching up" of different processes when they get
synchronized with each other thanks to the inner lock).

Nested locking is often required, but it's one of those things where
you just need to be aware that they can be horribly bad for
performance, _particularly_ if an inner lock sees contention and
essentially "transfers" that contention to an outer lock.

Maybe I've been unlucky. Maybe the two cases I happened to look at
were just completely harmless, and very unusual. But the fact that I'm
two-for-two and go "that locking looks like a prime candidate to be
fixed" makes me suspect there's a lot of low-hanging fruit in there.

And that whole "wait_event(trylock)" thing is a symptom of problematic
f2fs locking, rather than a solution to it.

Linus

2022-03-25 19:34:07

by Waiman Long

[permalink] [raw]
Subject: Re: [GIT PULL] f2fs for 5.18

On 3/23/22 12:48, Jaegeuk Kim wrote:
> On 03/23, Christoph Hellwig wrote:
>> On Tue, Mar 22, 2022 at 10:22:50AM -0700, Linus Torvalds wrote:
>>> On Mon, Mar 21, 2022 at 1:39 PM Jaegeuk Kim <[email protected]> wrote:
>>>> In this cycle, f2fs has some performance improvements for Android workloads such
>>>> as using read-unfair rwsems [...]
>>> I've pulled this, but that read-unfair rwsem code looks incredibly
>>> dodgy. Doing your own locking is always a bad sign, and it ahs
>>> traditionally come back to bite us pretty much every time. At least it
>>> uses real lock primitives, just in a really odd way.
>> FYI, Peter and I both pointed this out when the patches were posted
>> and NAKed the patch, but the feedback was ignored.
> Christoph, I proposed,
>
> "I've been waiting for a generic solution as suggested here. Until then, I'd like
> to keep this in f2fs *only* in order to ship the fix in products. Once there's
> a right fix, let me drop or revise this patch again."
>
> https://lore.kernel.org/linux-f2fs-devel/[email protected]/
>
I suspect f2fs may also need the 617f3ef95177 ("locking/rwsem: Remove
reader optimistic spinning") to give higher priority to writer. Please
let me know the test result when you are able to test v5.15 LTS to see
if these commits are able to address the f2fs issue.

I have some ideas of making a reader-unfair rwsem, but that requires
either the introduction of a set of new down_read() variants or keeping
the unfair state in the rwsem itself. I would like to make sure that
there is really a need for such a thing before working on it.

Cheers,
Longman


2022-03-25 19:53:49

by Jaegeuk Kim

[permalink] [raw]
Subject: Re: [GIT PULL] f2fs for 5.18

On 03/23, Christoph Hellwig wrote:
> On Tue, Mar 22, 2022 at 10:22:50AM -0700, Linus Torvalds wrote:
> > On Mon, Mar 21, 2022 at 1:39 PM Jaegeuk Kim <[email protected]> wrote:
> > >
> > > In this cycle, f2fs has some performance improvements for Android workloads such
> > > as using read-unfair rwsems [...]
> >
> > I've pulled this, but that read-unfair rwsem code looks incredibly
> > dodgy. Doing your own locking is always a bad sign, and it ahs
> > traditionally come back to bite us pretty much every time. At least it
> > uses real lock primitives, just in a really odd way.
>
> FYI, Peter and I both pointed this out when the patches were posted
> and NAKed the patch, but the feedback was ignored.

Christoph, I proposed,

"I've been waiting for a generic solution as suggested here. Until then, I'd like
to keep this in f2fs *only* in order to ship the fix in products. Once there's
a right fix, let me drop or revise this patch again."

https://lore.kernel.org/linux-f2fs-devel/[email protected]/

2022-06-15 20:22:53

by Pavel Machek

[permalink] [raw]
Subject: Re: [GIT PULL] f2fs for 5.18

Hi!

> > > AFAICS, the read-unfair rwsem code is created to resolve a potential
> > > lock starvation problem that they found on linux-5.10.y stable tree. I
> > > believe I have fixed that in the v5.11 kernel, see commit 2f06f702925
> > > ("locking/rwsem: Prevent potential lock starvation").
> >
> > Ahh.
> >
> > Adding Tim Murray to the cc, since he was the source of that odd
> > reader-unfair thing.
> >
> > I really *really* dislike people thinking they can do locking
> > primitives, because history has taught us that they are wrong.
> >
> > Even when people get the semantics and memory ordering right (which is
> > not always the case, but at least the f2fs code uses real lock
> > primitives - just oddly - and should thus be ok), it invariably tends
> > to be a sign of something else being very wrong.
> >
> > And I can easily believe that in this case it's due to a rmsem issue
> > that was already fixed long long ago as per Waiman.
> >
> > Can people please test with the actual modern rwsem code and with the
> > odd reader-unfair locks disabled?
>
> The pain point is 1) we don't have a specific test to reproduce the issue,
> but got some foundings from field only, 2) in order to test the patches, we
> need to merge the patches into Android kernel [1] through LTS, 3) but, LTS
> wants to see any test results [2].
>
> [1] https://android-review.googlesource.com/q/topic:rwsem_unfair
> [2] https://lore.kernel.org/stable/[email protected]/

Wait, what? Normally, patches are tested before going to mainline, and especially
before being backported to stable.

If you can't reproduce issue on mainline kernel, there's something very wrong
with trying to fix it on mainline kernel. You should not be merging untested fixes
so that they can make it into mainline and then into stable and then into android kernel
to be tested.

If there's no other way, you should be able to backport those patches to android kernel and
test them _before_ merging them. Android phones are rather cheap. Some should even run mainline
kernels -- see for example Oneplus 4T -- if you don't need all the features.

It looks hch was right NAKing the patches.

Best regards,

Pavel
--
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

2022-06-16 17:03:57

by Jaegeuk Kim

[permalink] [raw]
Subject: Re: [GIT PULL] f2fs for 5.18

On 06/15, Pavel Machek wrote:
> Hi!
>
> > > > AFAICS, the read-unfair rwsem code is created to resolve a potential
> > > > lock starvation problem that they found on linux-5.10.y stable tree. I
> > > > believe I have fixed that in the v5.11 kernel, see commit 2f06f702925
> > > > ("locking/rwsem: Prevent potential lock starvation").
> > >
> > > Ahh.
> > >
> > > Adding Tim Murray to the cc, since he was the source of that odd
> > > reader-unfair thing.
> > >
> > > I really *really* dislike people thinking they can do locking
> > > primitives, because history has taught us that they are wrong.
> > >
> > > Even when people get the semantics and memory ordering right (which is
> > > not always the case, but at least the f2fs code uses real lock
> > > primitives - just oddly - and should thus be ok), it invariably tends
> > > to be a sign of something else being very wrong.
> > >
> > > And I can easily believe that in this case it's due to a rmsem issue
> > > that was already fixed long long ago as per Waiman.
> > >
> > > Can people please test with the actual modern rwsem code and with the
> > > odd reader-unfair locks disabled?
> >
> > The pain point is 1) we don't have a specific test to reproduce the issue,
> > but got some foundings from field only, 2) in order to test the patches, we
> > need to merge the patches into Android kernel [1] through LTS, 3) but, LTS
> > wants to see any test results [2].
> >
> > [1] https://android-review.googlesource.com/q/topic:rwsem_unfair
> > [2] https://lore.kernel.org/stable/[email protected]/
>
> Wait, what? Normally, patches are tested before going to mainline, and especially
> before being backported to stable.
>
> If you can't reproduce issue on mainline kernel, there's something very wrong
> with trying to fix it on mainline kernel. You should not be merging untested fixes
> so that they can make it into mainline and then into stable and then into android kernel
> to be tested.

What do you mean "untested fixes" here? As Tim mentioned [1], this F2FS patch
resolved the issue in our Pixel devices.

[1] https://lore.kernel.org/lkml/CAEe=Sxmcn5+YUXBQhxDpzZVJu_T6S6+EURDqrP9uUS-PHGyuSg@mail.gmail.com/

>
> If there's no other way, you should be able to backport those patches to android kernel and
> test them _before_ merging them. Android phones are rather cheap. Some should even run mainline
> kernels -- see for example Oneplus 4T -- if you don't need all the features.

IIUC, the point here was whether we need another generic rwsem API to address
the issue or not. [1] said some rwsem fixes couldn't resolve our issue,
and Waiman wanted to test another patch [2]. In order to avoid endless
tests, I decided to get some results from Pixel using v5.15 (at least) by
turning CONFIG_F2FS_UNFAIR_RWSEM off as of now. If we can see v5.15
works, I'm happy to revert the F2FS patch. Otherwise, we need it for
our production.

[2] https://lore.kernel.org/lkml/[email protected]/#t

>
> It looks hch was right NAKing the patches.
>
> Best regards,
>
> Pavel
> --
> (english) http://www.livejournal.com/~pavelmachek
> (cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html