by Brian Foster

[permalink] [raw]

Subject: Re: [PATCH 2/2] xfs: make sure link path does not go away at access

On Wed, Nov 17, 2021 at 11:22:51AM +1100, Dave Chinner wrote:
> On Tue, Nov 16, 2021 at 10:59:05AM -0500, Brian Foster wrote:
> > On Tue, Nov 16, 2021 at 02:01:20PM +1100, Dave Chinner wrote:
> > > On Tue, Nov 16, 2021 at 09:03:31AM +0800, Ian Kent wrote:
> > > > On Tue, 2021-11-16 at 09:24 +1100, Dave Chinner wrote:
> > > > > If it isn't safe for ext4 to do that, then we have a general
> > > > > pathwalk problem, not an XFS issue. But, as you say, it is safe
> > > > > to do this zeroing, so the fix to xfs_ifree() is to zero the
> > > > > link buffer instead of freeing it, just like ext4 does.
> > > > >
> > > > > As a side issue, we really don't want to move what XFS does in
> > > > > .destroy_inode to .free_inode because that then means we need to
> > > > > add synchronise_rcu() calls everywhere in XFS that might need to
> > > > > wait on inodes being inactivated and/or reclaimed. And because
> > > > > inode reclaim uses lockless rcu lookups, there's substantial
> > > > > danger of adding rcu callback related deadlocks to XFS here.
> > > > > That's just not a direction we should be moving in.
> > > >
> > > > Another reason I decided to use the ECHILD return instead is that
> > > > I thought synchronise_rcu() might add an unexpected delay.
> > >
> > > It depends where you put the synchronise_rcu() call. :)
> > >
> > > > Since synchronise_rcu() will only wait for processes that
> > > > currently have the rcu read lock do you think that could actually
> > > > be a problem in this code path?
> > >
> > > No, I don't think it will. The inode recycle case in XFS inode
> > > lookup can trigger in two cases:
> > >
> > > 1. VFS cache eviction followed by immediate lookup
> > > 2. Inode has been unlinked and evicted, then free and reallocated by
> > > the filesytsem.
> > >
> > > In case #1, that's a cold cache lookup and hence delays are
> > > acceptible (e.g. a slightly longer delay might result in having to
> > > fetch the inode from disk again). Calling synchronise_rcu() in this
> > > case is not going to be any different from having to fetch the inode
> > > from disk...
> > >
> > > In case #2, there's a *lot* of CPU work being done to modify
> > > metadata (inode btree updates, etc), and so the operations can block
> > > on journal space, metadata IO, etc. Delays are acceptible, and could
> > > be in the order of hundreds of milliseconds if the transaction
> > > subsystem is bottlenecked. waiting for an RCU grace period when we
> > > reallocate an indoe immediately after freeing it isn't a big deal.
> > >
> > > IOWs, if synchronize_rcu() turns out to be a problem, we can
> > > optimise that separately - we need to correct the inode reuse
> > > behaviour w.r.t. VFS RCU expectations, then we can optimise the
> > > result if there are perf problems stemming from correct behaviour.
> > >
> >
> > FWIW, with a fairly crude test on a high cpu count system, it's not that
> > difficult to reproduce an observable degradation in inode allocation
> > rate with a synchronous grace period in the inode reuse path, caused
> > purely by a lookup heavy workload on a completely separate filesystem.
> >
> > The following is a 5m snapshot of the iget stats from a filesystem doing
> > allocs/frees with an external/heavy lookup workload (which not included
> > in the stats), with and without a sync grace period wait in the reuse
> > path:
> >
> > baseline: ig 1337026 1331541 4 5485 0 5541 1337026
> > sync_rcu_test: ig 2955 2588 0 367 0 383 2955
>
> The alloc/free part of the workload is a single threaded
> create/unlink in a tight loop, yes?
>
> This smells like a side effect of agressive reallocation of
> just-freed XFS_IRECLAIMABLE inodes from the finobt that haven't had
> their unlink state written back to disk yet. i.e. this is a corner
> case in #2 above where a small set of inodes is being repeated
> allocated and freed by userspace and hence being agressively reused
> and never needing to wait for IO. i.e. a tempfile workload
> optimisation...
>

Yes, that was the point of the test.. to stress inode reuse against
known rcu activity.

> > I think this is kind of the nature of RCU and why I'm not sure it's a
> > great idea to rely on update side synchronization in a codepath that
> > might want to scale/perform in certain workloads.
>
> The problem here is not update side synchronisation. Root cause is
> aggressive reallocation of recently freed VFS inodes via physical
> inode allocation algorithms. Unfortunately, the RCU grace period
> requirements of the VFS inode life cycle dictate that we can't
> aggressively re-allocate and reuse freed inodes like this. i.e.
> reallocation of a just-freed inode also has to wait for an RCU grace
> period to pass before the in memory inode can be re-instantiated as
> a newly allocated inode.
>

I'm just showing that insertion of an synchronous rcu grace period wait
in the iget codepath is not without side effect, because that was the
proposal.

> (Hmmmm - I wonder if of the other filesystems might have similar
> problems with physical inode reallocation inside a RCU grace period?
> i.e. without inode instance re-use, the VFS could potentially see
> multiple in-memory instances of the same physical inode at the same
> time.)
>
> > I'm not totally sure
> > if this will be a problem for real users running real workloads or not,
> > or if this can be easily mitigated, whether it's all rcu or a cascading
> > effect, etc. This is just a quick test so that all probably requires
> > more test and analysis to discern.
>
> This looks like a similar problem to what busy extents address - we
> can't reuse a newly freed extent until the transaction containing
> the EFI/EFD hit stable storage (and the discard operation on the
> range is complete). Hence while a newly freed extent is
> marked free in the allocbt, they can't be reused until they are
> released from the busy extent tree.
>
> I can think of several ways to address this, but let me think on it
> a bit more. I suspect there's a trick we can use to avoid needing
> synchronise_rcu() completely by using the spare radix tree tag and
> rcu grace period state checks with get_state_synchronize_rcu() and
> poll_state_synchronize_rcu() to clear the radix tree tags via a
> periodic radix tree tag walk (i.e. allocation side polling for "can
> we use this inode" rather than waiting for the grace period to
> expire once an inode has been selected and allocated.)
>

Yeah, and same. It's just a matter of how to break things down. I can
sort of see where you're going with the above, though I'm not totally
convinced that rcu gp polling is an advantage over explicit use of
existing infrastructure/apis. It seems more important that we avoid
overly crude things like sync waits in the alloc path vs. optimize away
potentially multiple async grace periods in the free path. Of course,
it's worth thinking about options regardless.

That said, is deferred inactivation still a thing? If so, then we've
already decided to defer/batch inactivations from the point the vfs
calls our ->destroy_inode() based on our own hueristic (which is likely
longer than a grace period already in most cases, making this even less
of an issue). That includes deferral of the physical free and inobt
updates, which means inode reuse can't occur until the inactivation
workqueue task runs. Only a single grace period is required to cover
(from the rcuwalk perspective) the entire set of inodes queued for
inactivation. That leaves at least a few fairly straightforward options:

1. Use queue_rcu_work() to schedule the inactivation task. We'd probably
have to isolate the list to process first from the queueing context
rather than from workqueue context to ensure we don't process recently
added inodes that haven't sat for a grace period.

2. Drop a synchronize_rcu() in the workqueue task before it starts
processing items.

3. Incorporate something like the above with an rcu grace period cookie
to selectively process inodes (or batches thereof).

Options 1 and 2 both seem rather simple and unlikely to noticeably
impact behavior/performance. Option 3 seems a bit overkill to me, but is
certainly an option if the previous assertion around performance doesn't
hold true (particularly if we keep the tracking simple, such as
recording/enforcing the most recent gp of the set). Thoughts?

Brian

> > > >
> > > > Sorry, I don't understand what you mean by the root cause not
> > > > being identified?
> > >
> > > The whole approach of "we don't know how to fix the inode reuse case
> > > so disable it" implies that nobody has understood where in the reuse
> > > case the problem lies. i.e. "inode reuse" by itself is not the root
> > > cause of the problem.
> > >
> >
> > I don't think anybody suggested to disable inode reuse.
>
> Nobody did, so that's not what I was refering to. I was refering to
> the patches for and comments advocating disabling .get_link for RCU
> pathwalk because of the apparently unsolved problems stemming from
> inode reuse...
>
> > > The root cause is "allowing an inode to be reused without waiting
> > > for an RCU grace period to expire". This might seem pedantic, but
> > > "without waiting for an rcu grace period to expire" is the important
> > > part of the problem (i.e. the bug), not the "allowing an inode to be
> > > reused" bit.
> > >
> > > Once the RCU part of the problem is pointed out, the solution
> > > becomes obvious. As nobody had seen the obvious (wait for an RCU
> > > grace period when recycling an inode) it stands to reason that
> > > nobody really understood what the root cause of the inode reuse
> > > problem.
> > >
> >
> > The synchronize_rcu() approach was one of the first options discussed in
> > the bug report once a reproducer was available.
>
> What bug report would that be? :/
>
> It's not one that I've read, and I don't recall seeing a pointer to
> it anywhere in the path posting. IOWs, whatever discussion happened
> in a private distro bug report can't be assumed as "general
> knowledge" in an upstream discussion...
>
> > AIUI, this is not currently a reproducible problem even before patch 1,
> > which reduces the race window even further. Given that and the nak on
> > the current patch (the justification for which I don't really
> > understand), I'm starting to agree with Ian's earlier statement that
> > perhaps it is best to separate this one so we can (hopefully) move patch
> > 1 along on its own merit..
>
> *nod*
>
> The problem seems pretty rare, the pathwalk patch makes it
> even rarer, so I think they can be separated just fine.
>
> Cheers,
>
> Dave.
> --
> Dave Chinner
> [email protected]
>

2021-11-17 20:39:24

On Wed, Dec 15, 2021 at 11:54:11AM +0800, Ian Kent wrote:
> On Wed, 2021-11-24 at 15:56 -0500, Brian Foster wrote:
> > On Tue, Nov 23, 2021 at 10:26:57AM +1100, Dave Chinner wrote:
> > > On Mon, Nov 22, 2021 at 02:27:59PM -0500, Brian Foster wrote:
> > > > On Mon, Nov 22, 2021 at 11:08:51AM +1100, Dave Chinner wrote:
> > > > > On Fri, Nov 19, 2021 at 02:44:21PM -0500, Brian Foster wrote:
> > > > > > In
> > > > > > any event, another experiment I ran in light of the above
> > > > > > results that
> > > > > > might be similar was to put the inode queueing component of
> > > > > > destroy_inode() behind an rcu callback. This reduces the
> > > > > > single threaded
> > > > > > perf hit from the previous approach by about 50%. So not
> > > > > > entirely
> > > > > > baseline performance, but it's back closer to baseline if I
> > > > > > double the
> > > > > > throttle threshold (and actually faster at 4x). Perhaps my
> > > > > > crude
> > > > > > prototype logic could be optimized further to not rely on
> > > > > > percpu
> > > > > > threshold changes to match the baseline.
> > > > > >
> > > > > > My overall takeaway from these couple hacky experiments is
> > > > > > that the
> > > > > > unconditional synchronous rcu wait is indeed probably too
> > > > > > heavy weight,
> > > > > > as you point out. The polling or callback (or perhaps your
> > > > > > separate
> > > > > > queue) approach seems to be in the ballpark of viability,
> > > > > > however,
> > > > > > particularly when we consider the behavior of scaled or mixed
> > > > > > workloads
> > > > > > (since inactive queue processing seems to be size driven vs.
> > > > > > latency
> > > > > > driven).
> > > > > >
> > > > > > So I dunno.. if you consider the various severity and
> > > > > > complexity
> > > > > > tradeoffs, this certainly seems worth more consideration to
> > > > > > me. I can
> > > > > > think of other potentially interesting ways to experiment
> > > > > > with
> > > > > > optimizing the above or perhaps tweak queueing to better
> > > > > > facilitate
> > > > > > taking advantage of grace periods, but it's not worth going
> > > > > > too far down
> > > > > > that road if you're wedded to the "busy inodes" approach.
> > > > >
> > > > > I'm not wedded to "busy inodes" but, as your experiments are
> > > > > indicating, trying to handle rcu grace periods into the
> > > > > deferred
> > > > > inactivation path isn't completely mitigating the impact of
> > > > > having
> > > > > to wait for a grace period for recycling of inodes.
> > > > >
> > > >
> > > > What I'm seeing so far is that the impact seems to be limited to
> > > > the
> > > > single threaded workload and largely mitigated by an increase in
> > > > the
> > > > percpu throttle limit. IOW, it's not completely free right out of
> > > > the
> > > > gate, but the impact seems isolated and potentially mitigated by
> > > > adjustment of the pipeline.
> > > >
> > > > I realize the throttle is a percpu value, so that is what has me
> > > > wondering about some potential for gains in efficiency to try and
> > > > get
> > > > more of that single-threaded performance back in other ways, or
> > > > perhaps
> > > > enhancements that might be more broadly beneficial to deferred
> > > > inactivations in general (i.e. some form of adaptive throttling
> > > > thresholds to balance percpu thresholds against a global
> > > > threshold).
> > >
> > > I ran experiments on queue depth early on. Once we go over a few
> > > tens of inodes we start to lose the "hot cache" effect and
> > > performance starts to go backwards. By queue depths of hundreds,
> > > we've lost all the hot cache and nothing else gets that performance
> > > back because we can't avoid the latency of all the memory writes
> > > from cache eviction and the followup memory loads that result.
> > >
> >
> > Admittedly my testing is simple/crude as I'm just exploring the
> > potential viability of a concept, not fine tuning a workload, etc.
> > That
> > said, I'm curious to know what your tests for this look like because
> > I
> > suspect I'm running into different conditions. My tests frequently
> > hit
> > the percpu throttle threshold (256 inodes), which is beyond your
> > ideal
> > tens of inodes range (and probably more throttle limited than cpu
> > cache
> > limited).
> >
> > > Making the per-cpu queues longer or shorter based on global state
> > > won't gain us anything. ALl it will do is slow down local
> > > operations
> > > that don't otherwise need slowing down....
> > >
> >
> > This leaves out context. The increase in throttle threshold mitigates
> > the delays I've introduced via the rcu callback. That happens to
> > produce
> > observable results comparable to my baseline test, but it's more of a
> > measure of the impact of the delay than a direct proposal. If there's
> > a
> > more fine grained test worth running here (re: above), please
> > describe
> > it.
> >
> > > > > I suspect a rethink on the inode recycling mechanism is needed.
> > > > > THe
> > > > > way it is currently implemented was a brute force solution - it
> > > > > is
> > > > > simple and effective. However, we need more nuance in the
> > > > > recycling
> > > > > logic now.? That is, if we are recycling an inode that is
> > > > > clean, has
> > > > > nlink >=1 and has not been unlinked, it means the VFS evicted
> > > > > it too
> > > > > soon and we are going to re-instantiate it as the identical
> > > > > inode
> > > > > that was evicted from cache.
> > > > >
> > > >
> > > > Probably. How advantageous is inode memory reuse supposed to be
> > > > in the
> > > > first place? I've more considered it a necessary side effect of
> > > > broader
> > > > architecture (i.e. deferred reclaim, etc.) as opposed to a
> > > > primary
> > > > optimization.
> > >
> > > Yes, it's an architectural feature resulting from the filesystem
> > > inode life cycle being different to the VFS inode life cycle. This
> > > was inherited from Irix - it had separate inactivation vs reclaim
> > > states and action steps for vnodes - inactivation occurred when the
> > > vnode refcount went to zero, reclaim occurred when the vnode was to
> > > be freed.
> > >
> > > Architecturally, Linux doesn't have this two-step infrastructure;
> > > it
> > > just has evict() that runs everything when the inode needs to be
> > > reclaimed. Hence we hide the two-phase reclaim architecture of XFS
> > > behind that, and so we always had this troublesome impedence
> > > mismatch on Linux.
> > >
> >
> > Ok, that was generally how I viewed it.
> >
> > > Thinking a bit about this, maybe there is a more integrated way to
> > > handle this life cycle impedence mismatch by making the way we
> > > interact with the linux inode cache to be more ...? Irix like.
> > > Linux
> > > does actually give us a a callback when the last reference to an
> > > inode goes away: .drop_inode()
> > >
> > > i.e. Maybe we should look to triggering inactivation work from
> > > ->drop_inode instead of ->destroy_inode and hence always leaving
> > > unreferenced, reclaimable inodes in the VFS cache on the LRU. i.e.
> > > rather than hiding the two-phase XFS inode inactivation+reclaim
> > > algorithm from the VFS, move it up into the VFS.? If we prevent
> > > inodes from being reclaimed from the LRU until they have finished
> > > inactivation and are clean (easy enough just by marking the inode
> > > as
> > > dirty), that would allow us to immediately reclaim and free inodes
> > > in evict() context. Integration with __wait_on_freeing_inode()
> > > would
> > > like solve the RCU reuse/recycling issues.
> > >
> >
> > Hmm.. this is the point where we decide whether the inode remains
> > cached, which is currently basically whether the inode has a link
> > count
> > or not. That makes me curious what (can) happens with an
> > unlinked/inactivated inode on the lru. I'm not sure any other fs' do
> > anything like that currently..?
> >
> > > There's more to it than just this, but perhaps the longer term
> > > direction should be closer integration with the Linux inode cache
> > > life cycle rather than trying to solve all these problems
> > > underneath
> > > the VFS cache whilst still trying to play by it's rules...
> > >
> >
> > Yeah. Caching logic details aside, I think that makes sense.
> >
> > > > I still see reuse occur with deferred inactivation, we
> > > > just end up cycling through the same set of inodes as they fall
> > > > through
> > > > the queue rather than recycling the same one over and over. I'm
> > > > sort of
> > > > wondering what the impact would be if we didn't do this at all
> > > > (for the
> > > > new allocation case at least).
> > >
> > > We end up with a larger pool of free inodes in the finobt. This is
> > > basically what my "busy inode check" proposal is based on - inodes
> > > that we can't allocate without recycling just remain on the finobt
> > > for longer before they can be used. This would be awful if we
> > > didn't
> > > have the finobt to efficiently locate free inodes - the finobt
> > > record iteration makes it pretty low overhead to scan inodes here.
> > >
> >
> > I get the idea. That last bit is what I'm skeptical about. The finobt
> > is
> > based on the premise that free inode lookup becomes a predictable
> > tree
> > lookup instead of the old searching algorithm on the inobt, which we
> > still support and can be awful in its own right under worst case
> > conditions. I agree that this would be bad on the inobt (which raises
> > the question on how we'd provide these recycling correctness
> > guarantees
> > on !finobt fs'). What I'm more concerned about is whether this could
> > make finobt enabled fs' (transiently) just as poor as the old algo
> > under
> > certain workloads/conditions.
> >
> > I think there needs to be at least some high level description of the
> > search algorithm before we can sufficiently reason about it's
> > behavior..
> >
> > > > > So how much re-initialisation do we actually need for that
> > > > > inode?
> > > > > Almost everything in the inode is still valid; the problems
> > > > > come
> > > > > from inode_init_always() resetting the entire internal inode
> > > > > state
> > > > > and XFS then having to set them up again.? The internal state
> > > > > is
> > > > > already largely correct when we start recycling, and the
> > > > > identity of
> > > > > the recycled inode does not change when nlink >= 1. Hence
> > > > > eliding
> > > > > inode_init_always() would also go a long way to avoiding the
> > > > > need
> > > > > for a RCU grace period to pass before we can make the inode
> > > > > visible
> > > > > to the VFS again.
> > > > >
> > > > > If we can do that, then the only indoes that need a grace
> > > > > period
> > > > > before they can be recycled are unlinked inodes as they change
> > > > > identity when being recycled. That identity change absolutely
> > > > > requires a grace period to expire before the new instantiation
> > > > > can
> > > > > be made visible.? Given the arbitrary delay that this can
> > > > > introduce
> > > > > for an inode allocation operation, it seems much better suited
> > > > > to
> > > > > detecting busy inodes than waiting for a global OS state change
> > > > > to
> > > > > occur...
> > > > >
> > > >
> > > > Maybe..? The experiments I've been doing are aimed at simplicity
> > > > and
> > > > reducing the scope of the changes. Part of the reason for this is
> > > > tbh
> > > > I'm not totally convinced we really need to do anything more
> > > > complex
> > > > than preserve the inline symlink buffer one way or another (for
> > > > example,
> > > > see the rfc patch below for an alternative to the inline symlink
> > > > rcuwalk
> > > > disable patch). Maybe we should consider something like this
> > > > anyways.
> > > >
> > > > With busy inodes, we need to alter inode allocation to some
> > > > degree to
> > > > accommodate. We can have (tens of?) thousands of inodes under the
> > > > grace
> > > > period at any time based on current batching behavior, so it's
> > > > not
> > > > totally evident to me that we won't end up with some of the same
> > > > fundamental issues to deal with here, just needing to be
> > > > accommodated in
> > > > the inode allocation algorithm rather than the teardown sequence.
> > >
> > > Sure, but the purpose of the allocation selection
> > > policy is to select the best inode to allocate for the current
> > > context.? The cost of not being able to use an inode immediately
> > > needs to be factored into that allocation policy. i.e. if the
> > > selected inode has an associated delay with it before it can be
> > > reused and other free inodes don't, then we should not be selecting
> > > the inode with a delay associcated with it.
> > >
> >
> > We still have to find those "no delay" inodes. AFAICT the worst case
> > conditions on the system I've been playing with can have something
> > like
> > 20k free && busy inodes. That could cover all or most of the finobt
> > at
> > the time of an inode allocation. What happens from there depends on
> > the
> > search algorithm.
> >
> > > This is exactly the reasoning and logic we use for busy extents.?
> > > We
> > > only take the blocking penalty for resolving busy extent state if
> > > we
> > > run out of free extents to search before we've found an allocation
> > > candidate. I think it makes sense for inode allocation, too.
> > >
> >
> > Sure, the idea makes sense and it's worth looking into. But there are
> > enough contextual differences that I wouldn't just assume the same
> > logic
> > translates over to the finobt without potential for performance
> > impact.
> > For example, extent allocation has some advantages with things like
> > delalloc (physical block allocation occurs async from buffered write
> > syscall time) and the fact that metadata allocs can reuse busy
> > blocks.
> > The finobt only tracks existing chunks with free inodes, so it's
> > easily
> > possible to have conditions where the finobt is 100% (or majority)
> > populated with busy inodes (whether it be one inode or several
> > thousand).
> >
> > This raises questions like at what point does search cost become a
> > factor? At what point and with what frequency do we suffer the
> > blocking
> > penalty? Do we opt to allocate new chunks based on gp state?
> > Something
> > else? We don't need to answer these questions here (this thread is
> > long
> > enough :P). I'm just trying to say that it's one thing to consider
> > the
> > approach a viable option, but it isn't automatically preferable just
> > because we use it for extents. Further details beyond "detect busy
> > inodes" would be nice to objectively reason about.
> >
> > > > --- 8< ---
> > > >
> > > > diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c
> > > > index 64b9bf334806..058e3fc69ff7 100644
> > > > --- a/fs/xfs/xfs_inode.c
> > > > +++ b/fs/xfs/xfs_inode.c
> > > > @@ -2644,7 +2644,7 @@ xfs_ifree(
> > > > ???????? * already been freed by xfs_attr_inactive.
> > > > ???????? */
> > > > ????????if (ip->i_df.if_format == XFS_DINODE_FMT_LOCAL) {
> > > > -???????????????kmem_free(ip->i_df.if_u1.if_data);
> > > > +???????????????kfree_rcu(ip->i_df.if_u1.if_data);
> > > > ????????????????ip->i_df.if_u1.if_data = NULL;
> > >
> > > That would need to be rcu_assign_pointer(ip->i_df.if_u1.if_data,
> > > NULL) to put the correct memory barriers in place, right? Also, I
> > > think ip->i_df.if_u1.if_data needs to be set to NULL before calling
> > > kfree_rcu() so racing lookups will always see NULL before
> > > the object is freed.
> > >
> >
> > I think rcu_assign_pointer() is intended to be paired with the
> > associated rcu deref and for scenarios like making sure an object
> > isn't
> > made available until it's completely initialized (i.e. such as for
> > rcu
> > protected list traversals, etc.).
> >
> > With regard to ordering, we no longer access if_data in rcuwalk mode
> > with this change. Thus I think all we need here is the
> > WRITE_ONCE(i_link, NULL) that pairs with the READ_ONCE() in the vfs,
> > and
> > that happens earlier in xfs_inactive_symlink() before we rcu free the
> > memory here. With that, ISTM a racing lookup should either see an rcu
> > protected i_link or NULL, the latter of which calls into ->get_link()
> > and triggers refwalk mode. Hm?
> >
> > > But again, as I asked up front, why do we even need to free this
> > > memory buffer here? It will be freed in xfs_inode_free_callback()
> > > after the current RCU grace period expires, so what do we gain by
> > > freeing it separately here?
> > >
>
> The thing that's been bugging me is not knowing if the VFS has
> finished using the link string.
>
> The link itself can be removed while the link path could still
> be valid and the VFS will then walk that path. So rcu grace time
> might not be sufficient but stashing the pointer and rcu freeing
> it ensures the pointer won't go away while the walk is under way
> without any grace period guessing. Relating this to the current
> path walk via an rcu delayed free is a reliable way to do what's
> needed.
>
> The only problem here is that path string remaining while it is
> being used. If there aren't any other known problems with the
> xfs inode re-use sub system I don't see any worth in complicating
> it to cater for this special case.
>
> Brian's patch is a variation on the original patch and is all
> that's really needed. IMHO going this way (whatever we end up
> with) is the sensible thing to do.

Why not simply change xfs_readlink to memcpy ip->i_df.if_u1.if_data into
the caller's link buffer? We shouldn't be exposing internal XFS
metadata buffers to the VFS to scribble on without telling us; this gets
rid of the dual inode_operations for symlinks; and we're probably
breaking a locking rule somewhere by not taking any locks AFAICT. That
seems a lot less complex than adding rcu freeing rules to understand how
to handle local format file forks.

(I say this from the vantage point of online repair, which will try to
salvage damaged symlinks, for which we actually /do/ want to be able to
lock out readers and change the data fork after symlink creation... but
I was saving that for 2022 because I'm too overwhelmed to try to send
that again.)

--D

>
> Ian
> >
> > One prevented memory leak? ;)
> >
> > It won't be freed in xfs_inode_free_callback() because we change the
> > data fork format type (and clear i_mode) in this path. Perhaps that
> > could use an audit, but that's a separate issue.
> >
> > > > ????????????????ip->i_df.if_bytes = 0;
> > > > ????????}
> > > > diff --git a/fs/xfs/xfs_iops.c b/fs/xfs/xfs_iops.c
> > > > index a607d6aca5c4..e98d7f10ba7d 100644
> > > > --- a/fs/xfs/xfs_iops.c
> > > > +++ b/fs/xfs/xfs_iops.c
> > > > @@ -511,27 +511,6 @@ xfs_vn_get_link(
> > > > ????????return ERR_PTR(error);
> > > > ?}
> > > > ?
> > > > -STATIC const char *
> > > > -xfs_vn_get_link_inline(
> > > > -???????struct dentry???????????*dentry,
> > > > -???????struct inode????????????*inode,
> > > > -???????struct delayed_call?????*done)
> > > > -{
> > > > -???????struct xfs_inode????????*ip = XFS_I(inode);
> > > > -???????char????????????????????*link;
> > > > -
> > > > -???????ASSERT(ip->i_df.if_format == XFS_DINODE_FMT_LOCAL);
> > > > -
> > > > -???????/*
> > > > -??????? * The VFS crashes on a NULL pointer, so return -
> > > > EFSCORRUPTED if
> > > > -??????? * if_data is junk.
> > > > -??????? */
> > > > -???????link = ip->i_df.if_u1.if_data;
> > > > -???????if (XFS_IS_CORRUPT(ip->i_mount, !link))
> > > > -???????????????return ERR_PTR(-EFSCORRUPTED);
> > > > -???????return link;
> > > > -}
> > > > -
> > > > ?static uint32_t
> > > > ?xfs_stat_blksize(
> > > > ????????struct xfs_inode????????*ip)
> > > > @@ -1250,14 +1229,6 @@ static const struct inode_operations
> > > > xfs_symlink_inode_operations = {
> > > > ????????.update_time????????????= xfs_vn_update_time,
> > > > ?};
> > > > ?
> > > > -static const struct inode_operations
> > > > xfs_inline_symlink_inode_operations = {
> > > > -???????.get_link???????????????= xfs_vn_get_link_inline,
> > > > -???????.getattr????????????????= xfs_vn_getattr,
> > > > -???????.setattr????????????????= xfs_vn_setattr,
> > > > -???????.listxattr??????????????= xfs_vn_listxattr,
> > > > -???????.update_time????????????= xfs_vn_update_time,
> > > > -};
> > > > -
> > > > ?/* Figure out if this file actually supports DAX. */
> > > > ?static bool
> > > > ?xfs_inode_supports_dax(
> > > > @@ -1409,9 +1380,8 @@ xfs_setup_iops(
> > > > ????????????????break;
> > > > ????????case S_IFLNK:
> > > > ????????????????if (ip->i_df.if_format == XFS_DINODE_FMT_LOCAL)
> > > > -???????????????????????inode->i_op =
> > > > &xfs_inline_symlink_inode_operations;
> > > > -???????????????else
> > > > -???????????????????????inode->i_op =
> > > > &xfs_symlink_inode_operations;
> > > > +???????????????????????inode->i_link = ip->i_df.if_u1.if_data;
> > > > +???????????????inode->i_op = &xfs_symlink_inode_operations;
> > >
> > > This still needs corruption checks - ip->i_df.if_u1.if_data can be
> > > null if there's some kind of inode corruption detected.
> > >
> >
> > It's fine for i_link to be NULL. We'd just fall into the get_link()
> > call
> > and have to handle it there like the current callback does.
> >
> > However, this does need to restore some of the code removed from
> > xfs_vn_get_link() in commit 30ee052e12b9 ("xfs: optimize inline
> > symlinks") to handle the local format case. If if_data can be NULL
> > we'll
> > obviously need to handle it there anyways.
> >
> > If there's no fundamental objection I'll address these issues, give
> > it
> > some proper testing and send a real patch..
> >
> > Brian
> >
> > > > ????????????????break;
> > > > ????????default:
> > > > ????????????????inode->i_op = &xfs_inode_operations;
> > > > diff --git a/fs/xfs/xfs_symlink.c b/fs/xfs/xfs_symlink.c
> > > > index fc2c6a404647..20ec2f450c56 100644
> > > > --- a/fs/xfs/xfs_symlink.c
> > > > +++ b/fs/xfs/xfs_symlink.c
> > > > @@ -497,6 +497,7 @@ xfs_inactive_symlink(
> > > > ???????? * do here in that case.
> > > > ???????? */
> > > > ????????if (ip->i_df.if_format == XFS_DINODE_FMT_LOCAL) {
> > > > +???????????????WRITE_ONCE(VFS_I(ip)->i_link, NULL);
> > >
> > > Again, rcu_assign_pointer(), yes?
> > >
> > > > ????????????????xfs_iunlock(ip, XFS_ILOCK_EXCL);
> > > > ????????????????return 0;
> > > > ????????}
> > > >
> > > >
> > >
> > > --
> > > Dave Chinner
> > > [email protected]
> > >
> >
>
>

2021-12-16 02:45:30

by Ian Kent

[permalink] [raw]

Subject: Re: [PATCH 2/2] xfs: make sure link path does not go away at access

On Tue, 2021-12-14 at 21:06 -0800, Darrick J. Wong wrote:
> On Wed, Dec 15, 2021 at 11:54:11AM +0800, Ian Kent wrote:
> > On Wed, 2021-11-24 at 15:56 -0500, Brian Foster wrote:
> > > On Tue, Nov 23, 2021 at 10:26:57AM +1100, Dave Chinner wrote:
> > > > On Mon, Nov 22, 2021 at 02:27:59PM -0500, Brian Foster wrote:
> > > > > On Mon, Nov 22, 2021 at 11:08:51AM +1100, Dave Chinner wrote:
> > > > > > On Fri, Nov 19, 2021 at 02:44:21PM -0500, Brian Foster
> > > > > > wrote:
> > > > > > > In
> > > > > > > any event, another experiment I ran in light of the above
> > > > > > > results that
> > > > > > > might be similar was to put the inode queueing component
> > > > > > > of
> > > > > > > destroy_inode() behind an rcu callback. This reduces the
> > > > > > > single threaded
> > > > > > > perf hit from the previous approach by about 50%. So not
> > > > > > > entirely
> > > > > > > baseline performance, but it's back closer to baseline if
> > > > > > > I
> > > > > > > double the
> > > > > > > throttle threshold (and actually faster at 4x). Perhaps
> > > > > > > my
> > > > > > > crude
> > > > > > > prototype logic could be optimized further to not rely on
> > > > > > > percpu
> > > > > > > threshold changes to match the baseline.
> > > > > > >
> > > > > > > My overall takeaway from these couple hacky experiments
> > > > > > > is
> > > > > > > that the
> > > > > > > unconditional synchronous rcu wait is indeed probably too
> > > > > > > heavy weight,
> > > > > > > as you point out. The polling or callback (or perhaps
> > > > > > > your
> > > > > > > separate
> > > > > > > queue) approach seems to be in the ballpark of viability,
> > > > > > > however,
> > > > > > > particularly when we consider the behavior of scaled or
> > > > > > > mixed
> > > > > > > workloads
> > > > > > > (since inactive queue processing seems to be size driven
> > > > > > > vs.
> > > > > > > latency
> > > > > > > driven).
> > > > > > >
> > > > > > > So I dunno.. if you consider the various severity and
> > > > > > > complexity
> > > > > > > tradeoffs, this certainly seems worth more consideration
> > > > > > > to
> > > > > > > me. I can
> > > > > > > think of other potentially interesting ways to experiment
> > > > > > > with
> > > > > > > optimizing the above or perhaps tweak queueing to better
> > > > > > > facilitate
> > > > > > > taking advantage of grace periods, but it's not worth
> > > > > > > going
> > > > > > > too far down
> > > > > > > that road if you're wedded to the "busy inodes" approach.
> > > > > >
> > > > > > I'm not wedded to "busy inodes" but, as your experiments
> > > > > > are
> > > > > > indicating, trying to handle rcu grace periods into the
> > > > > > deferred
> > > > > > inactivation path isn't completely mitigating the impact of
> > > > > > having
> > > > > > to wait for a grace period for recycling of inodes.
> > > > > >
> > > > >
> > > > > What I'm seeing so far is that the impact seems to be limited
> > > > > to
> > > > > the
> > > > > single threaded workload and largely mitigated by an increase
> > > > > in
> > > > > the
> > > > > percpu throttle limit. IOW, it's not completely free right
> > > > > out of
> > > > > the
> > > > > gate, but the impact seems isolated and potentially mitigated
> > > > > by
> > > > > adjustment of the pipeline.
> > > > >
> > > > > I realize the throttle is a percpu value, so that is what has
> > > > > me
> > > > > wondering about some potential for gains in efficiency to try
> > > > > and
> > > > > get
> > > > > more of that single-threaded performance back in other ways,
> > > > > or
> > > > > perhaps
> > > > > enhancements that might be more broadly beneficial to
> > > > > deferred
> > > > > inactivations in general (i.e. some form of adaptive
> > > > > throttling
> > > > > thresholds to balance percpu thresholds against a global
> > > > > threshold).
> > > >
> > > > I ran experiments on queue depth early on. Once we go over a
> > > > few
> > > > tens of inodes we start to lose the "hot cache" effect and
> > > > performance starts to go backwards. By queue depths of
> > > > hundreds,
> > > > we've lost all the hot cache and nothing else gets that
> > > > performance
> > > > back because we can't avoid the latency of all the memory
> > > > writes
> > > > from cache eviction and the followup memory loads that result.
> > > >
> > >
> > > Admittedly my testing is simple/crude as I'm just exploring the
> > > potential viability of a concept, not fine tuning a workload,
> > > etc.
> > > That
> > > said, I'm curious to know what your tests for this look like
> > > because
> > > I
> > > suspect I'm running into different conditions. My tests
> > > frequently
> > > hit
> > > the percpu throttle threshold (256 inodes), which is beyond your
> > > ideal
> > > tens of inodes range (and probably more throttle limited than cpu
> > > cache
> > > limited).
> > >
> > > > Making the per-cpu queues longer or shorter based on global
> > > > state
> > > > won't gain us anything. ALl it will do is slow down local
> > > > operations
> > > > that don't otherwise need slowing down....
> > > >
> > >
> > > This leaves out context. The increase in throttle threshold
> > > mitigates
> > > the delays I've introduced via the rcu callback. That happens to
> > > produce
> > > observable results comparable to my baseline test, but it's more
> > > of a
> > > measure of the impact of the delay than a direct proposal. If
> > > there's
> > > a
> > > more fine grained test worth running here (re: above), please
> > > describe
> > > it.
> > >
> > > > > > I suspect a rethink on the inode recycling mechanism is
> > > > > > needed.
> > > > > > THe
> > > > > > way it is currently implemented was a brute force solution
> > > > > > - it
> > > > > > is
> > > > > > simple and effective. However, we need more nuance in the
> > > > > > recycling
> > > > > > logic now. That is, if we are recycling an inode that is
> > > > > > clean, has
> > > > > > nlink >=1 and has not been unlinked, it means the VFS
> > > > > > evicted
> > > > > > it too
> > > > > > soon and we are going to re-instantiate it as the identical
> > > > > > inode
> > > > > > that was evicted from cache.
> > > > > >
> > > > >
> > > > > Probably. How advantageous is inode memory reuse supposed to
> > > > > be
> > > > > in the
> > > > > first place? I've more considered it a necessary side effect
> > > > > of
> > > > > broader
> > > > > architecture (i.e. deferred reclaim, etc.) as opposed to a
> > > > > primary
> > > > > optimization.
> > > >
> > > > Yes, it's an architectural feature resulting from the
> > > > filesystem
> > > > inode life cycle being different to the VFS inode life cycle.
> > > > This
> > > > was inherited from Irix - it had separate inactivation vs
> > > > reclaim
> > > > states and action steps for vnodes - inactivation occurred when
> > > > the
> > > > vnode refcount went to zero, reclaim occurred when the vnode
> > > > was to
> > > > be freed.
> > > >
> > > > Architecturally, Linux doesn't have this two-step
> > > > infrastructure;
> > > > it
> > > > just has evict() that runs everything when the inode needs to
> > > > be
> > > > reclaimed. Hence we hide the two-phase reclaim architecture of
> > > > XFS
> > > > behind that, and so we always had this troublesome impedence
> > > > mismatch on Linux.
> > > >
> > >
> > > Ok, that was generally how I viewed it.
> > >
> > > > Thinking a bit about this, maybe there is a more integrated way
> > > > to
> > > > handle this life cycle impedence mismatch by making the way we
> > > > interact with the linux inode cache to be more ... Irix like.
> > > > Linux
> > > > does actually give us a a callback when the last reference to
> > > > an
> > > > inode goes away: .drop_inode()
> > > >
> > > > i.e. Maybe we should look to triggering inactivation work from
> > > > ->drop_inode instead of ->destroy_inode and hence always
> > > > leaving
> > > > unreferenced, reclaimable inodes in the VFS cache on the LRU.
> > > > i.e.
> > > > rather than hiding the two-phase XFS inode inactivation+reclaim
> > > > algorithm from the VFS, move it up into the VFS. If we prevent
> > > > inodes from being reclaimed from the LRU until they have
> > > > finished
> > > > inactivation and are clean (easy enough just by marking the
> > > > inode
> > > > as
> > > > dirty), that would allow us to immediately reclaim and free
> > > > inodes
> > > > in evict() context. Integration with __wait_on_freeing_inode()
> > > > would
> > > > like solve the RCU reuse/recycling issues.
> > > >
> > >
> > > Hmm.. this is the point where we decide whether the inode remains
> > > cached, which is currently basically whether the inode has a link
> > > count
> > > or not. That makes me curious what (can) happens with an
> > > unlinked/inactivated inode on the lru. I'm not sure any other fs'
> > > do
> > > anything like that currently..?
> > >
> > > > There's more to it than just this, but perhaps the longer term
> > > > direction should be closer integration with the Linux inode
> > > > cache
> > > > life cycle rather than trying to solve all these problems
> > > > underneath
> > > > the VFS cache whilst still trying to play by it's rules...
> > > >
> > >
> > > Yeah. Caching logic details aside, I think that makes sense.
> > >
> > > > > I still see reuse occur with deferred inactivation, we
> > > > > just end up cycling through the same set of inodes as they
> > > > > fall
> > > > > through
> > > > > the queue rather than recycling the same one over and over.
> > > > > I'm
> > > > > sort of
> > > > > wondering what the impact would be if we didn't do this at
> > > > > all
> > > > > (for the
> > > > > new allocation case at least).
> > > >
> > > > We end up with a larger pool of free inodes in the finobt. This
> > > > is
> > > > basically what my "busy inode check" proposal is based on -
> > > > inodes
> > > > that we can't allocate without recycling just remain on the
> > > > finobt
> > > > for longer before they can be used. This would be awful if we
> > > > didn't
> > > > have the finobt to efficiently locate free inodes - the finobt
> > > > record iteration makes it pretty low overhead to scan inodes
> > > > here.
> > > >
> > >
> > > I get the idea. That last bit is what I'm skeptical about. The
> > > finobt
> > > is
> > > based on the premise that free inode lookup becomes a predictable
> > > tree
> > > lookup instead of the old searching algorithm on the inobt, which
> > > we
> > > still support and can be awful in its own right under worst case
> > > conditions. I agree that this would be bad on the inobt (which
> > > raises
> > > the question on how we'd provide these recycling correctness
> > > guarantees
> > > on !finobt fs'). What I'm more concerned about is whether this
> > > could
> > > make finobt enabled fs' (transiently) just as poor as the old
> > > algo
> > > under
> > > certain workloads/conditions.
> > >
> > > I think there needs to be at least some high level description of
> > > the
> > > search algorithm before we can sufficiently reason about it's
> > > behavior..
> > >
> > > > > > So how much re-initialisation do we actually need for that
> > > > > > inode?
> > > > > > Almost everything in the inode is still valid; the problems
> > > > > > come
> > > > > > from inode_init_always() resetting the entire internal
> > > > > > inode
> > > > > > state
> > > > > > and XFS then having to set them up again. The internal
> > > > > > state
> > > > > > is
> > > > > > already largely correct when we start recycling, and the
> > > > > > identity of
> > > > > > the recycled inode does not change when nlink >= 1. Hence
> > > > > > eliding
> > > > > > inode_init_always() would also go a long way to avoiding
> > > > > > the
> > > > > > need
> > > > > > for a RCU grace period to pass before we can make the inode
> > > > > > visible
> > > > > > to the VFS again.
> > > > > >
> > > > > > If we can do that, then the only indoes that need a grace
> > > > > > period
> > > > > > before they can be recycled are unlinked inodes as they
> > > > > > change
> > > > > > identity when being recycled. That identity change
> > > > > > absolutely
> > > > > > requires a grace period to expire before the new
> > > > > > instantiation
> > > > > > can
> > > > > > be made visible. Given the arbitrary delay that this can
> > > > > > introduce
> > > > > > for an inode allocation operation, it seems much better
> > > > > > suited
> > > > > > to
> > > > > > detecting busy inodes than waiting for a global OS state
> > > > > > change
> > > > > > to
> > > > > > occur...
> > > > > >
> > > > >
> > > > > Maybe..? The experiments I've been doing are aimed at
> > > > > simplicity
> > > > > and
> > > > > reducing the scope of the changes. Part of the reason for
> > > > > this is
> > > > > tbh
> > > > > I'm not totally convinced we really need to do anything more
> > > > > complex
> > > > > than preserve the inline symlink buffer one way or another
> > > > > (for
> > > > > example,
> > > > > see the rfc patch below for an alternative to the inline
> > > > > symlink
> > > > > rcuwalk
> > > > > disable patch). Maybe we should consider something like this
> > > > > anyways.
> > > > >
> > > > > With busy inodes, we need to alter inode allocation to some
> > > > > degree to
> > > > > accommodate. We can have (tens of?) thousands of inodes under
> > > > > the
> > > > > grace
> > > > > period at any time based on current batching behavior, so
> > > > > it's
> > > > > not
> > > > > totally evident to me that we won't end up with some of the
> > > > > same
> > > > > fundamental issues to deal with here, just needing to be
> > > > > accommodated in
> > > > > the inode allocation algorithm rather than the teardown
> > > > > sequence.
> > > >
> > > > Sure, but the purpose of the allocation selection
> > > > policy is to select the best inode to allocate for the current
> > > > context. The cost of not being able to use an inode
> > > > immediately
> > > > needs to be factored into that allocation policy. i.e. if the
> > > > selected inode has an associated delay with it before it can be
> > > > reused and other free inodes don't, then we should not be
> > > > selecting
> > > > the inode with a delay associcated with it.
> > > >
> > >
> > > We still have to find those "no delay" inodes. AFAICT the worst
> > > case
> > > conditions on the system I've been playing with can have
> > > something
> > > like
> > > 20k free && busy inodes. That could cover all or most of the
> > > finobt
> > > at
> > > the time of an inode allocation. What happens from there depends
> > > on
> > > the
> > > search algorithm.
> > >
> > > > This is exactly the reasoning and logic we use for busy
> > > > extents.
> > > > We
> > > > only take the blocking penalty for resolving busy extent state
> > > > if
> > > > we
> > > > run out of free extents to search before we've found an
> > > > allocation
> > > > candidate. I think it makes sense for inode allocation, too.
> > > >
> > >
> > > Sure, the idea makes sense and it's worth looking into. But there
> > > are
> > > enough contextual differences that I wouldn't just assume the
> > > same
> > > logic
> > > translates over to the finobt without potential for performance
> > > impact.
> > > For example, extent allocation has some advantages with things
> > > like
> > > delalloc (physical block allocation occurs async from buffered
> > > write
> > > syscall time) and the fact that metadata allocs can reuse busy
> > > blocks.
> > > The finobt only tracks existing chunks with free inodes, so it's
> > > easily
> > > possible to have conditions where the finobt is 100% (or
> > > majority)
> > > populated with busy inodes (whether it be one inode or several
> > > thousand).
> > >
> > > This raises questions like at what point does search cost become
> > > a
> > > factor? At what point and with what frequency do we suffer the
> > > blocking
> > > penalty? Do we opt to allocate new chunks based on gp state?
> > > Something
> > > else? We don't need to answer these questions here (this thread
> > > is
> > > long
> > > enough :P). I'm just trying to say that it's one thing to
> > > consider
> > > the
> > > approach a viable option, but it isn't automatically preferable
> > > just
> > > because we use it for extents. Further details beyond "detect
> > > busy
> > > inodes" would be nice to objectively reason about.
> > >
> > > > > --- 8< ---
> > > > >
> > > > > diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c
> > > > > index 64b9bf334806..058e3fc69ff7 100644
> > > > > --- a/fs/xfs/xfs_inode.c
> > > > > +++ b/fs/xfs/xfs_inode.c
> > > > > @@ -2644,7 +2644,7 @@ xfs_ifree(
> > > > >          * already been freed by xfs_attr_inactive.
> > > > >          */
> > > > >         if (ip->i_df.if_format == XFS_DINODE_FMT_LOCAL) {
> > > > > -               kmem_free(ip->i_df.if_u1.if_data);
> > > > > +               kfree_rcu(ip->i_df.if_u1.if_data);
> > > > >                 ip->i_df.if_u1.if_data = NULL;
> > > >
> > > > That would need to be rcu_assign_pointer(ip-
> > > > >i_df.if_u1.if_data,
> > > > NULL) to put the correct memory barriers in place, right? Also,
> > > > I
> > > > think ip->i_df.if_u1.if_data needs to be set to NULL before
> > > > calling
> > > > kfree_rcu() so racing lookups will always see NULL before
> > > > the object is freed.
> > > >
> > >
> > > I think rcu_assign_pointer() is intended to be paired with the
> > > associated rcu deref and for scenarios like making sure an object
> > > isn't
> > > made available until it's completely initialized (i.e. such as
> > > for
> > > rcu
> > > protected list traversals, etc.).
> > >
> > > With regard to ordering, we no longer access if_data in rcuwalk
> > > mode
> > > with this change. Thus I think all we need here is the
> > > WRITE_ONCE(i_link, NULL) that pairs with the READ_ONCE() in the
> > > vfs,
> > > and
> > > that happens earlier in xfs_inactive_symlink() before we rcu free
> > > the
> > > memory here. With that, ISTM a racing lookup should either see an
> > > rcu
> > > protected i_link or NULL, the latter of which calls into -
> > > >get_link()
> > > and triggers refwalk mode. Hm?
> > >
> > > > But again, as I asked up front, why do we even need to free
> > > > this
> > > > memory buffer here? It will be freed in
> > > > xfs_inode_free_callback()
> > > > after the current RCU grace period expires, so what do we gain
> > > > by
> > > > freeing it separately here?
> > > >
> >
> > The thing that's been bugging me is not knowing if the VFS has
> > finished using the link string.
> >
> > The link itself can be removed while the link path could still
> > be valid and the VFS will then walk that path. So rcu grace time
> > might not be sufficient but stashing the pointer and rcu freeing
> > it ensures the pointer won't go away while the walk is under way
> > without any grace period guessing. Relating this to the current
> > path walk via an rcu delayed free is a reliable way to do what's
> > needed.
> >
> > The only problem here is that path string remaining while it is
> > being used. If there aren't any other known problems with the
> > xfs inode re-use sub system I don't see any worth in complicating
> > it to cater for this special case.
> >
> > Brian's patch is a variation on the original patch and is all
> > that's really needed. IMHO going this way (whatever we end up
> > with) is the sensible thing to do.
>
> Why not simply change xfs_readlink to memcpy ip->i_df.if_u1.if_data
> into
> the caller's link buffer? We shouldn't be exposing internal XFS
> metadata buffers to the VFS to scribble on without telling us; this
> gets
> rid of the dual inode_operations for symlinks; and we're probably
> breaking a locking rule somewhere by not taking any locks AFAICT.
> That
> seems a lot less complex than adding rcu freeing rules to understand
> how
> to handle local format file forks.

Ahhhaa ... I didn't/don't understand what these inline symlinks
are.

But it seems they aren't too different from the usual symlinks
and treating them as such makes the problem go away.

I'm pretty sure I was missing the point of some of this discussion
too because I just didn't get it.

Thanks for helping me get what's going on here Darrick.

Ian
>
> (I say this from the vantage point of online repair, which will try
> to
> salvage damaged symlinks, for which we actually /do/ want to be able
> to
> lock out readers and change the data fork after symlink creation...
> but
> I was saving that for 2022 because I'm too overwhelmed to try to send
> that again.)
>
> --D
>
> >
> > Ian
> > >
> > > One prevented memory leak? ;)
> > >
> > > It won't be freed in xfs_inode_free_callback() because we change
> > > the
> > > data fork format type (and clear i_mode) in this path. Perhaps
> > > that
> > > could use an audit, but that's a separate issue.
> > >
> > > > >                 ip->i_df.if_bytes = 0;
> > > > >         }
> > > > > diff --git a/fs/xfs/xfs_iops.c b/fs/xfs/xfs_iops.c
> > > > > index a607d6aca5c4..e98d7f10ba7d 100644
> > > > > --- a/fs/xfs/xfs_iops.c
> > > > > +++ b/fs/xfs/xfs_iops.c
> > > > > @@ -511,27 +511,6 @@ xfs_vn_get_link(
> > > > >         return ERR_PTR(error);
> > > > > }
> > > > >
> > > > > -STATIC const char *
> > > > > -xfs_vn_get_link_inline(
> > > > > -       struct dentry           *dentry,
> > > > > -       struct inode            *inode,
> > > > > -       struct delayed_call     *done)
> > > > > -{
> > > > > -       struct xfs_inode        *ip = XFS_I(inode);
> > > > > -       char                    *link;
> > > > > -
> > > > > -       ASSERT(ip->i_df.if_format == XFS_DINODE_FMT_LOCAL);
> > > > > -
> > > > > -       /*
> > > > > -        * The VFS crashes on a NULL pointer, so return -
> > > > > EFSCORRUPTED if
> > > > > -        * if_data is junk.
> > > > > -        */
> > > > > -       link = ip->i_df.if_u1.if_data;
> > > > > -       if (XFS_IS_CORRUPT(ip->i_mount, !link))
> > > > > -               return ERR_PTR(-EFSCORRUPTED);
> > > > > -       return link;
> > > > > -}
> > > > > -
> > > > > static uint32_t
> > > > > xfs_stat_blksize(
> > > > >         struct xfs_inode        *ip)
> > > > > @@ -1250,14 +1229,6 @@ static const struct inode_operations
> > > > > xfs_symlink_inode_operations = {
> > > > >         .update_time            = xfs_vn_update_time,
> > > > > };
> > > > >
> > > > > -static const struct inode_operations
> > > > > xfs_inline_symlink_inode_operations = {
> > > > > -       .get_link               = xfs_vn_get_link_inline,
> > > > > -       .getattr                = xfs_vn_getattr,
> > > > > -       .setattr                = xfs_vn_setattr,
> > > > > -       .listxattr              = xfs_vn_listxattr,
> > > > > -       .update_time            = xfs_vn_update_time,
> > > > > -};
> > > > > -
> > > > > /* Figure out if this file actually supports DAX. */
> > > > > static bool
> > > > > xfs_inode_supports_dax(
> > > > > @@ -1409,9 +1380,8 @@ xfs_setup_iops(
> > > > >                 break;
> > > > >         case S_IFLNK:
> > > > >                 if (ip->i_df.if_format ==
> > > > > XFS_DINODE_FMT_LOCAL)
> > > > > -                       inode->i_op =
> > > > > &xfs_inline_symlink_inode_operations;
> > > > > -               else
> > > > > -                       inode->i_op =
> > > > > &xfs_symlink_inode_operations;
> > > > > +                       inode->i_link = ip-
> > > > > >i_df.if_u1.if_data;
> > > > > +               inode->i_op = &xfs_symlink_inode_operations;
> > > >
> > > > This still needs corruption checks - ip->i_df.if_u1.if_data can
> > > > be
> > > > null if there's some kind of inode corruption detected.
> > > >
> > >
> > > It's fine for i_link to be NULL. We'd just fall into the
> > > get_link()
> > > call
> > > and have to handle it there like the current callback does.
> > >
> > > However, this does need to restore some of the code removed from
> > > xfs_vn_get_link() in commit 30ee052e12b9 ("xfs: optimize inline
> > > symlinks") to handle the local format case. If if_data can be
> > > NULL
> > > we'll
> > > obviously need to handle it there anyways.
> > >
> > > If there's no fundamental objection I'll address these issues,
> > > give
> > > it
> > > some proper testing and send a real patch..
> > >
> > > Brian
> > >
> > > > >                 break;
> > > > >         default:
> > > > >                 inode->i_op = &xfs_inode_operations;
> > > > > diff --git a/fs/xfs/xfs_symlink.c b/fs/xfs/xfs_symlink.c
> > > > > index fc2c6a404647..20ec2f450c56 100644
> > > > > --- a/fs/xfs/xfs_symlink.c
> > > > > +++ b/fs/xfs/xfs_symlink.c
> > > > > @@ -497,6 +497,7 @@ xfs_inactive_symlink(
> > > > >          * do here in that case.
> > > > >          */
> > > > >         if (ip->i_df.if_format == XFS_DINODE_FMT_LOCAL) {
> > > > > +               WRITE_ONCE(VFS_I(ip)->i_link, NULL);
> > > >
> > > > Again, rcu_assign_pointer(), yes?
> > > >
> > > > >                 xfs_iunlock(ip, XFS_ILOCK_EXCL);
> > > > >                 return 0;
> > > > >         }
> > > > >
> > > > >
> > > >
> > > > --
> > > > Dave Chinner
> > > > [email protected]
> > > >
> > >
> >
> >