2023-08-08 17:37:35

by Mateusz Guzik

[permalink] [raw]
Subject: new_inode_pseudo vs locked inode->i_state = 0

Hello,

new_inode_pseudo is:
struct inode *inode = alloc_inode(sb);

if (inode) {
spin_lock(&inode->i_lock);
inode->i_state = 0;
spin_unlock(&inode->i_lock);
}

I'm trying to understand:
1. why is it zeroing i_state (as opposed to have it happen in inode_init_always)
2. why is zeroing taking place with i_lock held

The inode is freshly allocated, not yet added to the hash -- I would
expect that nobody else can see it.

Moreover, another consumer of alloc_inode zeroes without bothering to
lock -- see iget5_locked:
[snip]
struct inode *new = alloc_inode(sb);

if (new) {
new->i_state = 0;
[/snip]

I tried to find justification for it in git, the pre-history-wipe repo
(git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux-2.6.git)
says it came in "Import 2.1.45pre1" in 1997. This is where my digging
stopped.

As is, I strongly suspect this is a leftover waiting for clean up.
Moving i_state = 0 back to inode_init_always would result in a few
simplifications in the area. I'm happy to make them, provided this is
indeed safe.

If the lock is required, then it should be added to iget5_locked?

UNRELATED:

While here, new_inode starts with: spin_lock_prefetch(&sb->s_inode_list_lock)

This was also *way* back in a huge commit, since the line was only
getting patched to remain compilable.

This is the only remaining spin_lock_prefetch use in the tree.

I don't know the original justification nor whether it made sense at
the time, this is definitely problematic today in the rather heavy
multicore era -- there is tons of work happening between the prefetch
and actually take the s_inode_list_lock lock, meaning if there is
contention, the cacheline is going to be marked invalid by the time
spin_lock on it is called. But then this only adds to cacheline
bouncing.

Personally I would just remove this line without even trying to benchmark.
--
Mateusz Guzik <mjguzik gmail.com>


2023-08-09 01:16:47

by Mateusz Guzik

[permalink] [raw]
Subject: Re: new_inode_pseudo vs locked inode->i_state = 0

On 8/9/23, Dave Chinner <[email protected]> wrote:
> On Tue, Aug 08, 2023 at 06:05:33PM +0200, Mateusz Guzik wrote:
>> Hello,
>>
>> new_inode_pseudo is:
>> struct inode *inode = alloc_inode(sb);
>>
>> if (inode) {
>> spin_lock(&inode->i_lock);
>> inode->i_state = 0;
>> spin_unlock(&inode->i_lock);
>> }
>>
>> I'm trying to understand:
>> 1. why is it zeroing i_state (as opposed to have it happen in
>> inode_init_always)
>> 2. why is zeroing taking place with i_lock held
>>
>> The inode is freshly allocated, not yet added to the hash -- I would
>> expect that nobody else can see it.
>
> Maybe not at this point, but as soon as the function returns with
> the new inode, it could be published in some list that can be
> accessed concurrently and then the i_state visible on other CPUs
> better be correct.
>
> I'll come back to this, because the answer lies in this code:
>
>> Moreover, another consumer of alloc_inode zeroes without bothering to
>> lock -- see iget5_locked:
>> [snip]
>> struct inode *new = alloc_inode(sb);
>>
>> if (new) {
>> new->i_state = 0;
>> [/snip]
>
> Yes, that one is fine because the inode has not been published yet.
> The actual i_state serialisation needed to publish the inode happens
> in the function called in the very next line - inode_insert5().
>
> That does:
>
> spin_lock(&inode_hash_lock);
>
> .....
> /*
> * Return the locked inode with I_NEW set, the
> * caller is responsible for filling in the contents
> */
> spin_lock(&inode->i_lock);
> inode->i_state |= I_NEW;
> hlist_add_head_rcu(&inode->i_hash, head);
> spin_unlock(&inode->i_lock);
> .....
>
> spin_unlock(&inode_hash_lock);
>
> The i_lock is held across the inode state initialisation and hash
> list insert so that if anything finds the inode in the hash
> immediately after insert, they should set an initialised value.
>
> Don't be fooled by the inode_hash_lock here. We have
> find_inode_rcu() which walks hash lists without holding the hash
> lock, hence if anything needs to do a state check on the found
> inode, they are guaranteed to see I_NEW after grabbing the i_lock....
>
> Further, inode_insert5() adds the inode to the superblock inode
> list, which means concurrent sb inode list walkers can also see this
> inode whilst the inode_hash_lock is still held by inode_insert5().
> Those inode list walkers *must* see I_NEW at this point, and they
> are guaranteed to do so by taking i_lock before checking i_state....
>
> IOWs, the initialisation of inode->i_state for normal inodes must be
> done under i_lock so that lookups that occur after hash/sb list
> insert are guaranteed to see the correct value.
>
> If we now go back to new_inode_pseudo(), we see one of the callers
> is new_inode(), and it does this:
>
> struct inode *new_inode(struct super_block *sb)
> {
> struct inode *inode;
>
> spin_lock_prefetch(&sb->s_inode_list_lock);
>
> inode = new_inode_pseudo(sb);
> if (inode)
> inode_sb_list_add(inode);
> return inode;
> }
>
> IOWs, the inode is immediately published on the superblock inode
> list, and so inode list walkers can see it immediately. As per
> inode_insert5(), this requires the inode state to be fully
> initialised and memory barriers in place such that any walker will
> see the correct value of i_state. The simplest, safest way to do
> this is to initialise i_state under the i_lock....
>

Thanks for the detailed answer, I do think you have a valid point but
I don't think it works with the given example. ;)

inode_sb_list_add is:
spin_lock(&inode->i_sb->s_inode_list_lock);
list_add(&inode->i_sb_list, &inode->i_sb->s_inodes);
spin_unlock(&inode->i_sb->s_inode_list_lock);

... thus i_state is published by the time it unlocks.

According to my grep all iterations over the list hold the
s_inode_list_lock, thus they are guaranteed to see the update, making
the release fence in new_inode_pseudo redundant for this case.

With this in mind I'm assuming the fence was there as a safety
measure, for consumers which would maybe need it.

Then the code can:
struct inode *inode = alloc_inode(sb);

if (inode) {
inode->i_state = 0;
/* make sure i_state update will be visible before we insert
* the inode anywhere */
smp_wmb();
}

Upshots:
- replaces 2 atomics with a mere release fence, which is way cheaper
to do everywhere and virtually free on x86-64
- people reading the code don't wonder who on earth are we locking against

All that said, if the (possibly redundant) fence is literally the only
reason for the lock trip, I would once more propose zeroing in
inode_init_always:
diff --git a/fs/inode.c b/fs/inode.c
index 8fefb69e1f84..ce9664c4efe9 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -232,6 +232,13 @@ int inode_init_always(struct super_block *sb,
struct inode *inode)
return -ENOMEM;
this_cpu_inc(nr_inodes);

+ inode->i_state = 0;
+ /*
+ * Make sure i_state update is visible before this inode gets inserted
+ * anywhere.
+ */
+ smp_wmb();
+
return 0;
}
EXPORT_SYMBOL(inode_init_always);

This is more in the spirit of making sure everybody has published
i_state = 0 and facilitates cleanup.
- new_inode_pseudo is now just alloc_inode
- confusing unlocked/unfenced i_state = 0 disappears from iget5_locked

And probably some more tidyups.

Now, I'm not going to flame with anyone over doing smp_wmb instead of
the lock trip (looks like a no-brainer to me, but I got flamed for
another one earlier today ;>).

I am however going to /strongly suggest/ that a comment explaining
what's going on is added there, if the current state is to remain.

As far as I'm concerned *locking* when a mere smp_wmb would sufficne
is heavily misleading and should be whacked if only for that reason.

Cheers,
--
Mateusz Guzik <mjguzik gmail.com>

2023-08-09 09:53:30

by Mateusz Guzik

[permalink] [raw]
Subject: Re: new_inode_pseudo vs locked inode->i_state = 0

On 8/9/23, Dave Chinner <[email protected]> wrote:
> On Wed, Aug 09, 2023 at 02:23:59AM +0200, Mateusz Guzik wrote:
>> On 8/9/23, Dave Chinner <[email protected]> wrote:
>> > On Tue, Aug 08, 2023 at 06:05:33PM +0200, Mateusz Guzik wrote:
>> >> Hello,
>> >>
>> >> new_inode_pseudo is:
>> >> struct inode *inode = alloc_inode(sb);
>> >>
>> >> if (inode) {
>> >> spin_lock(&inode->i_lock);
>> >> inode->i_state = 0;
>> >> spin_unlock(&inode->i_lock);
>> >> }
>> >>
>> >> I'm trying to understand:
>> >> 1. why is it zeroing i_state (as opposed to have it happen in
>> >> inode_init_always)
>> >> 2. why is zeroing taking place with i_lock held
>> >>
>> >> The inode is freshly allocated, not yet added to the hash -- I would
>> >> expect that nobody else can see it.
>> >
>> > Maybe not at this point, but as soon as the function returns with
>> > the new inode, it could be published in some list that can be
>> > accessed concurrently and then the i_state visible on other CPUs
>> > better be correct.
>> >
>> > I'll come back to this, because the answer lies in this code:
>> >
>> >> Moreover, another consumer of alloc_inode zeroes without bothering to
>> >> lock -- see iget5_locked:
>> >> [snip]
>> >> struct inode *new = alloc_inode(sb);
>> >>
>> >> if (new) {
>> >> new->i_state = 0;
>> >> [/snip]
>> >
>> > Yes, that one is fine because the inode has not been published yet.
>> > The actual i_state serialisation needed to publish the inode happens
>> > in the function called in the very next line - inode_insert5().
>> >
>> > That does:
>> >
>> > spin_lock(&inode_hash_lock);
>> >
>> > .....
>> > /*
>> > * Return the locked inode with I_NEW set, the
>> > * caller is responsible for filling in the contents
>> > */
>> > spin_lock(&inode->i_lock);
>> > inode->i_state |= I_NEW;
>> > hlist_add_head_rcu(&inode->i_hash, head);
>> > spin_unlock(&inode->i_lock);
>> > .....
>> >
>> > spin_unlock(&inode_hash_lock);
>> >
>> > The i_lock is held across the inode state initialisation and hash
>> > list insert so that if anything finds the inode in the hash
>> > immediately after insert, they should set an initialised value.
>> >
>> > Don't be fooled by the inode_hash_lock here. We have
>> > find_inode_rcu() which walks hash lists without holding the hash
>> > lock, hence if anything needs to do a state check on the found
>> > inode, they are guaranteed to see I_NEW after grabbing the i_lock....
>> >
>> > Further, inode_insert5() adds the inode to the superblock inode
>> > list, which means concurrent sb inode list walkers can also see this
>> > inode whilst the inode_hash_lock is still held by inode_insert5().
>> > Those inode list walkers *must* see I_NEW at this point, and they
>> > are guaranteed to do so by taking i_lock before checking i_state....
>> >
>> > IOWs, the initialisation of inode->i_state for normal inodes must be
>> > done under i_lock so that lookups that occur after hash/sb list
>> > insert are guaranteed to see the correct value.
>> >
>> > If we now go back to new_inode_pseudo(), we see one of the callers
>> > is new_inode(), and it does this:
>> >
>> > struct inode *new_inode(struct super_block *sb)
>> > {
>> > struct inode *inode;
>> >
>> > spin_lock_prefetch(&sb->s_inode_list_lock);
>> >
>> > inode = new_inode_pseudo(sb);
>> > if (inode)
>> > inode_sb_list_add(inode);
>> > return inode;
>> > }
>> >
>> > IOWs, the inode is immediately published on the superblock inode
>> > list, and so inode list walkers can see it immediately. As per
>> > inode_insert5(), this requires the inode state to be fully
>> > initialised and memory barriers in place such that any walker will
>> > see the correct value of i_state. The simplest, safest way to do
>> > this is to initialise i_state under the i_lock....
>> >
>>
>> Thanks for the detailed answer, I do think you have a valid point but
>> I don't think it works with the given example. ;)
>>
>> inode_sb_list_add is:
>> spin_lock(&inode->i_sb->s_inode_list_lock);
>> list_add(&inode->i_sb_list, &inode->i_sb->s_inodes);
>> spin_unlock(&inode->i_sb->s_inode_list_lock);
>>
>> ... thus i_state is published by the time it unlocks.
>>
>> According to my grep all iterations over the list hold the
>> s_inode_list_lock, thus they are guaranteed to see the update, making
>> the release fence in new_inode_pseudo redundant for this case.
>
> I don't believe that is the case - the i_state modification is not
> within the critical region the s_inode_list_lock covers, nor is the
> cacheline i_state lies on referenced within the critical section.
> Hence there is no explicit ordering dependency created by
> inode_sb_list_add on the value of i_state.
>
> Your argument seems to be that we can rely on the side effect of
> some unrelated lock to provide ordered memory access - that's just
> really poor locking design and will result in unmaintainable code
> that gets broken without realising in the future.
>

In that spirit, *normally* when you add an object to some
list/whatever, there are probably other changes you made past i_state
= 0 (i.e. another fence is ultimately needed just before insertion).

Least error prone approach as far as I'm concerned would make sure the
inserting routine issues relevant fences -- that way consumers which
sneak in some updates prior to calling it are still covered just fine
and so happens the sb list iteration *is* covered in this manner. But
I guess that's not the expected way here.

I feel compelled to note that not issuing the fence in iget5_locked
and instead relying on hash insertion to sort it out is implementing
this approach, albeit internally it is explicitly done with i_lock.

> That's why the spin lock. It's *obviously correct*, and it doesn't
> require any of the other code that checks i_state to have to care
> about any lock or memory barrier mechanism other than taking
> i_lock...
>

It is obvious not *wrong* to do in sense of not breaking code, just
deeply confusing for the reader -- I don't know about you, if I see a
lock taken and released for what should be an object invisible to
other threads, immediately the question pops up if the inode *is* in
fact visible somewhere. Making sure i_state update is published by
starting with acquiring the lock on freshly allocated inode is not
something I thought of.

>> With this in mind I'm assuming the fence was there as a safety
>> measure, for consumers which would maybe need it.
>>
>> Then the code can:
>> struct inode *inode = alloc_inode(sb);
>>
>> if (inode) {
>> inode->i_state = 0;
>> /* make sure i_state update will be visible before we
>> insert
>> * the inode anywhere */
>> smp_wmb();
>> }
>
> AFAIA, that doesn't work by itself without a matching smp_rmb()
> prior to the i_state reader - memory barriers need to be paired for
> ordering to be valid. Hence this also seems to assume that we can
> rely on some other unrelated lock pairing to actually order memory
> accesses to i_state....
>

It has to be either smp_rmb or a consume barrier.

Spelling it out instead of having it implicitly done through
spin_unlock points out that it is redundant for the new_inode vs sb
allocation case, people just don't have strong reaction to plain
lock/unlock trips, whereas they get very worried about any explicitly
mentioned barriers (most justified for the latter, but they also
should be this way for the former).

>> Upshots:
>> - replaces 2 atomics with a mere release fence, which is way cheaper
>> to do everywhere and virtually free on x86-64
>> - people reading the code don't wonder who on earth are we locking
>> against
>
> Downsides:
> - memory barriers are hard to get right,
> - nobody will be able to look at the code and say "this is obviously
> correct".
> - random unpaired memory barriers in code end up making it
> unmaintainable.
> - impossible to test for correctness
>
[snip]
>> I am however going to /strongly suggest/ that a comment explaining
>> what's going on is added there, if the current state is to remain.
>
> If you really think that improves the code, send a patch....
>
>> As far as I'm concerned *locking* when a mere smp_wmb would sufficne
>> is heavily misleading and should be whacked if only for that reason.
>
> If it ain't broke, don't fix it.
>
> The code we have now is widely used (there are over a hundred
> callers of new_inode()) and we know it works correctly. Nobody is
> complaining that it is too slow, and generally speaking the overhead
> of this lock traversal is lost in the noise of all the other
> operations needed to be performed to initialise a new inode.
>

Maybe you have been around this code too long to see an outsider
perspective or maybe I did not do a good job spelling it out.

I'm not complaining about speed, I'm complaining that a standalone
lock trip there is *confusing*, especially when another consumer in
the same file does not do it and filesystems have *custom* allocation
routines -- is the inode visible to other threads?

I also noted the reasoning you initially gave is trivially satisified
with smp_wmb, but I'm definitely not going to push for it.

However, after more poking around, what I think what you were trying
to say is that whatever inode lookups/traversal always take the inode
lock to inspect i_state, thus zeroing enclosed by it makes it easier
to reason about it (but if that's the case why is iget5_locked
standing out and you seem fine with it?). Now that I justification I
can understand at least.

All that said, how about this (modulo wording):
diff --git a/fs/inode.c b/fs/inode.c
index 8fefb69e1f84..977e72942706 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -1018,6 +1018,13 @@ struct inode *new_inode_pseudo(struct super_block *sb)
struct inode *inode = alloc_inode(sb);

if (inode) {
+ /*
+ * Make sure i_state update is visible before this inode gets
+ * inserted anywhere.
+ *
+ * Safe way for consumers to inspect the field is with i_lock
+ * held, match this for simplicity.
+ */
spin_lock(&inode->i_lock);
inode->i_state = 0;
spin_unlock(&inode->i_lock);
@@ -1285,6 +1292,11 @@ struct inode *iget5_locked(struct super_block
*sb, unsigned long hashval,
struct inode *new = alloc_inode(sb);

if (new) {
+ /*
+ * new_inode_pseudo takes a lock to zero i_state, here
+ * we safely skip it because hash insertion will take
+ * care of it
+ */
new->i_state = 0;
inode = inode_insert5(new, hashval, test, set, data);
if (unlikely(inode != new))


or patch iget5_locked to use new_inode_pseudo, but then I don't want
to be on a lookout for someone's microbench changing

--
Mateusz Guzik <mjguzik gmail.com>