Hello,
so I finally got to looking into your patches for reducing contention
on s_orphan_lock. AFAICT these two patches (the first one is just a
small cleanup) should have the same speed gain as the patches you wrote
and they are simpler. Can you give them a try please? Thanks!
Honza
Use sbi pointer consistently in ext4_orphan_del() instead of opencoding
it sometimes.
Signed-off-by: Jan Kara <[email protected]>
---
fs/ext4/namei.c | 10 ++++------
1 file changed, 4 insertions(+), 6 deletions(-)
diff --git a/fs/ext4/namei.c b/fs/ext4/namei.c
index 1cb84f78909e..411957326827 100644
--- a/fs/ext4/namei.c
+++ b/fs/ext4/namei.c
@@ -2622,22 +2622,20 @@ int ext4_orphan_del(handle_t *handle, struct inode *inode)
{
struct list_head *prev;
struct ext4_inode_info *ei = EXT4_I(inode);
- struct ext4_sb_info *sbi;
+ struct ext4_sb_info *sbi = EXT4_SB(inode->i_sb);
__u32 ino_next;
struct ext4_iloc iloc;
int err = 0;
- if ((!EXT4_SB(inode->i_sb)->s_journal) &&
- !(EXT4_SB(inode->i_sb)->s_mount_state & EXT4_ORPHAN_FS))
+ if (!sbi->s_journal && !(sbi->s_mount_state & EXT4_ORPHAN_FS))
return 0;
- mutex_lock(&EXT4_SB(inode->i_sb)->s_orphan_lock);
+ mutex_lock(&sbi->s_orphan_lock);
if (list_empty(&ei->i_orphan))
goto out;
ino_next = NEXT_ORPHAN(inode);
prev = ei->i_orphan.prev;
- sbi = EXT4_SB(inode->i_sb);
jbd_debug(4, "remove inode %lu from orphan list\n", inode->i_ino);
@@ -2683,7 +2681,7 @@ int ext4_orphan_del(handle_t *handle, struct inode *inode)
out_err:
ext4_std_error(inode->i_sb, err);
out:
- mutex_unlock(&EXT4_SB(inode->i_sb)->s_orphan_lock);
+ mutex_unlock(&sbi->s_orphan_lock);
return err;
out_brelse:
--
1.8.1.4
Shuffle code around in ext4_orphan_add() and ext4_orphan_del() so that
we avoid taking global s_orphan_lock in some cases and hold it for
shorter time in other cases.
Signed-off-by: Jan Kara <[email protected]>
---
fs/ext4/namei.c | 39 +++++++++++++++++++++++++++++----------
1 file changed, 29 insertions(+), 10 deletions(-)
diff --git a/fs/ext4/namei.c b/fs/ext4/namei.c
index 411957326827..4253df8af9ef 100644
--- a/fs/ext4/namei.c
+++ b/fs/ext4/namei.c
@@ -2539,13 +2539,17 @@ static int empty_dir(struct inode *inode)
return 1;
}
-/* ext4_orphan_add() links an unlinked or truncated inode into a list of
+/*
+ * ext4_orphan_add() links an unlinked or truncated inode into a list of
* such inodes, starting at the superblock, in case we crash before the
* file is closed/deleted, or in case the inode truncate spans multiple
* transactions and the last transaction is not recovered after a crash.
*
* At filesystem recovery time, we walk this list deleting unlinked
* inodes and truncating linked inodes in ext4_orphan_cleanup().
+ *
+ * Orphan list manipulation functions must be called under i_mutex unless
+ * we are just creating the inode or deleting it.
*/
int ext4_orphan_add(handle_t *handle, struct inode *inode)
{
@@ -2556,9 +2560,14 @@ int ext4_orphan_add(handle_t *handle, struct inode *inode)
if (!EXT4_SB(sb)->s_journal)
return 0;
- mutex_lock(&EXT4_SB(sb)->s_orphan_lock);
- if (!list_empty(&EXT4_I(inode)->i_orphan))
- goto out_unlock;
+ WARN_ON_ONCE(!(inode->i_state & (I_NEW | I_FREEING)) &&
+ !mutex_is_locked(&inode->i_mutex));
+ /*
+ * Exit early if inode already is on orphan list. This is a big speedup
+ * since we don't have to contend on the global s_orphan_lock.
+ */
+ if (!list_empty(&EXT4_I(inode)->i_orphan))
+ return 0;
/*
* Orphan handling is only valid for files with data blocks
@@ -2577,6 +2586,8 @@ int ext4_orphan_add(handle_t *handle, struct inode *inode)
err = ext4_reserve_inode_write(handle, inode, &iloc);
if (err)
goto out_unlock;
+
+ mutex_lock(&EXT4_SB(sb)->s_orphan_lock);
/*
* Due to previous errors inode may be already a part of on-disk
* orphan list. If so skip on-disk list modification.
@@ -2630,10 +2641,22 @@ int ext4_orphan_del(handle_t *handle, struct inode *inode)
if (!sbi->s_journal && !(sbi->s_mount_state & EXT4_ORPHAN_FS))
return 0;
- mutex_lock(&sbi->s_orphan_lock);
+ WARN_ON_ONCE(!(inode->i_state & (I_NEW | I_FREEING)) &&
+ !mutex_is_locked(&inode->i_mutex));
+ /*
+ * Do this quick and racy check before taking global s_orphan_lock.
+ */
if (list_empty(&ei->i_orphan))
- goto out;
+ return 0;
+ /* Grab inode buffer early before taking global s_orphan_lock */
+ err = ext4_reserve_inode_write(handle, inode, &iloc);
+ if (err) {
+ list_del_init(&ei->i_orphan);
+ return err;
+ }
+
+ mutex_lock(&sbi->s_orphan_lock);
ino_next = NEXT_ORPHAN(inode);
prev = ei->i_orphan.prev;
@@ -2648,10 +2671,6 @@ int ext4_orphan_del(handle_t *handle, struct inode *inode)
if (!handle)
goto out;
- err = ext4_reserve_inode_write(handle, inode, &iloc);
- if (err)
- goto out_err;
On 04/29/2014 05:32 PM, Jan Kara wrote:
> Shuffle code around in ext4_orphan_add() and ext4_orphan_del() so that
> we avoid taking global s_orphan_lock in some cases and hold it for
> shorter time in other cases.
>
> Signed-off-by: Jan Kara <[email protected]>
> ---
> fs/ext4/namei.c | 39 +++++++++++++++++++++++++++++----------
> 1 file changed, 29 insertions(+), 10 deletions(-)
>
> diff --git a/fs/ext4/namei.c b/fs/ext4/namei.c
> index 411957326827..4253df8af9ef 100644
> --- a/fs/ext4/namei.c
> +++ b/fs/ext4/namei.c
> @@ -2539,13 +2539,17 @@ static int empty_dir(struct inode *inode)
> return 1;
> }
>
> -/* ext4_orphan_add() links an unlinked or truncated inode into a list of
> +/*
> + * ext4_orphan_add() links an unlinked or truncated inode into a list of
> * such inodes, starting at the superblock, in case we crash before the
> * file is closed/deleted, or in case the inode truncate spans multiple
> * transactions and the last transaction is not recovered after a crash.
> *
> * At filesystem recovery time, we walk this list deleting unlinked
> * inodes and truncating linked inodes in ext4_orphan_cleanup().
> + *
> + * Orphan list manipulation functions must be called under i_mutex unless
> + * we are just creating the inode or deleting it.
> */
> int ext4_orphan_add(handle_t *handle, struct inode *inode)
> {
> @@ -2556,9 +2560,14 @@ int ext4_orphan_add(handle_t *handle, struct inode *inode)
> if (!EXT4_SB(sb)->s_journal)
> return 0;
>
> - mutex_lock(&EXT4_SB(sb)->s_orphan_lock);
> - if (!list_empty(&EXT4_I(inode)->i_orphan))
> - goto out_unlock;
> + WARN_ON_ONCE(!(inode->i_state & (I_NEW | I_FREEING)) &&
> + !mutex_is_locked(&inode->i_mutex));
> + /*
> + * Exit early if inode already is on orphan list. This is a big speedup
> + * since we don't have to contend on the global s_orphan_lock.
> + */
> + if (!list_empty(&EXT4_I(inode)->i_orphan))
> + return 0;
>
> /*
> * Orphan handling is only valid for files with data blocks
> @@ -2577,6 +2586,8 @@ int ext4_orphan_add(handle_t *handle, struct inode *inode)
> err = ext4_reserve_inode_write(handle, inode, &iloc);
> if (err)
> goto out_unlock;
> +
> + mutex_lock(&EXT4_SB(sb)->s_orphan_lock);
> /*
> * Due to previous errors inode may be already a part of on-disk
> * orphan list. If so skip on-disk list modification.
I believe we need to also call brelse() on the iloc that we reserve earlier, if we are to only update in memory orphan list only. This is also missing in the original code.
> @@ -2630,10 +2641,22 @@ int ext4_orphan_del(handle_t *handle, struct inode *inode)
> if (!sbi->s_journal && !(sbi->s_mount_state & EXT4_ORPHAN_FS))
> return 0;
>
> - mutex_lock(&sbi->s_orphan_lock);
> + WARN_ON_ONCE(!(inode->i_state & (I_NEW | I_FREEING)) &&
> + !mutex_is_locked(&inode->i_mutex));
> + /*
> + * Do this quick and racy check before taking global s_orphan_lock.
> + */
> if (list_empty(&ei->i_orphan))
> - goto out;
> + return 0;
>
> + /* Grab inode buffer early before taking global s_orphan_lock */
> + err = ext4_reserve_inode_write(handle, inode, &iloc);
> + if (err) {
You need to hold s_orphan_lock mutex before modify the orphan list.
> + list_del_init(&ei->i_orphan);
> + return err;
> + }
> +
> + mutex_lock(&sbi->s_orphan_lock);
> ino_next = NEXT_ORPHAN(inode);
> prev = ei->i_orphan.prev;
>
> @@ -2648,10 +2671,6 @@ int ext4_orphan_del(handle_t *handle, struct inode *inode)
> if (!handle)
> goto out;
In the error path, with NULL handle, you still call the ext4_resrve_inode_write(), which is not required. This may be one of the reasons your did not achieve a better performance. In the original code and my patch, this path should be quick.
BTW, I realized that my patch does not remove the in memory orphan list even if we do not have write access to the inode.
>
> - err = ext4_reserve_inode_write(handle, inode, &iloc);
> - if (err)
> - goto out_err;
> -
> if (prev == &sbi->s_orphan) {
> jbd_debug(4, "superblock will point to %u\n", ino_next);
> BUFFER_TRACE(sbi->s_sbh, "get_write_access");
>
In my patch, there are things I do post updating the orphan list without holding the mutex, which you do not do. Again, this may be the reason your patch does not achieve the same performance.
Yes, there some code clean up that should have been done when I revert back to using a single mutex. I doubt it would make much difference if we are to minimize the time we are holding the s_orphan_lock.
As mentioned earlier, I still believe the hashed mutex is a good thing to have though with existing code it may not be required. I'll resubmit a new version of my patch with some code clean up and fix for the problem with orphan delete mentioned above.
Thanks,
Mak.
On 04/29/2014 05:32 PM, Jan Kara wrote:
>
> Hello,
>
> so I finally got to looking into your patches for reducing contention
> on s_orphan_lock. AFAICT these two patches (the first one is just a
> small cleanup) should have the same speed gain as the patches you wrote
> and they are simpler. Can you give them a try please? Thanks!
>
> Honza
>
I applied your patch and ran aim7 on both the 80 and 120 core count machines. There are aim7 workloads that your patch does show some performance improvement. Unfortunately in general, it does not have the same performance level as my original patch, especially with high user count, 500 or more.
As for the hashed mutex used in my patch to serialize orphan operation within a single node, even if I agree with you that with existing code it is not required, I don't believe that you can count on that in the future. I believe that is also your concern, as you also added comment indicating the requirement of the i_mutex in your patch.
In terms of maintainability, I do not believe simply relying on warning in a comment is sufficient. On top of that with this new requirement, we are unnecessarily coupling the orphan operations to i_mutex, adding more contention to it. This would likely to cause performance regression, as my experiment in responding to your earlier comment on my patch did show some regression when using i_mutex for serialization of orphan operations within a single node (adding code to lock the if it is not already locked).
I still believe having a different mutex for orphan operation serialization is a better and safer alternative. From my experiment so far (I'm still verifying this), it may even help improving the performance by spreading out the contention on the s_orphan_mutex.
Please also see my response to patch 2/2 for comments on the actual code.
Thanks,
Mak.
On Fri 02-05-14 15:56:56, Thavatchai Makphaibulchoke wrote:
> On 04/29/2014 05:32 PM, Jan Kara wrote:
> >
> > Hello,
> >
> > so I finally got to looking into your patches for reducing contention
> > on s_orphan_lock. AFAICT these two patches (the first one is just a
> > small cleanup) should have the same speed gain as the patches you wrote
> > and they are simpler. Can you give them a try please? Thanks!
>
> I applied your patch and ran aim7 on both the 80 and 120 core count
> machines. There are aim7 workloads that your patch does show some
> performance improvement. Unfortunately in general, it does not have the
> same performance level as my original patch, especially with high user
> count, 500 or more.
Thanks for running the benchmarks!
> As for the hashed mutex used in my patch to serialize orphan operation
> within a single node, even if I agree with you that with existing code it
> is not required, I don't believe that you can count on that in the
> future. I believe that is also your concern, as you also added comment
> indicating the requirement of the i_mutex in your patch.
So I believe it is reasonable to require i_mutex to be held when orphan
operations are called. Adding / removing inode from orphan list is needed
when extending or truncating inode and these operations generally require
i_mutex in ext4. I've added the assertion so that we can catch a situation
when we either by mistake or intentionally grow a call site which won't
hold i_mutex. If such situation happens and we have good reasons why
i_mutex shouldn't be used there, then we can think about some dedicated
locks (either hashed or per inode ones).
> In terms of maintainability, I do not believe simply relying on warning
> in a comment is sufficient. On top of that with this new requirement, we
> are unnecessarily coupling the orphan operations to i_mutex, adding more
> contention to it. This would likely to cause performance regression, as
We aren't adding *any* contention. I didn't have to add a single
additional locking of i_mutex because in all the necessary cases we are
holding i_mutex over the orphan operations anyway (or the inode isn't
reachable by other processes). So regarding per-inode locking, we are doing
strictly less work with my patch than with your patch. However you are
correct in your comments to my patch 2/2 that in your patch you handle
operations under the global s_orphan_lock better and that's probably what
causes the difference. I need to improve that for sure.
> my experiment in responding to your earlier comment on my patch did show
> some regression when using i_mutex for serialization of orphan operations
> within a single node (adding code to lock the if it is not already
> locked).
>
> I still believe having a different mutex for orphan operation
> serialization is a better and safer alternative.
Frankly I don't see why they would be better. They are marginally safer
by keeping the locking inside the orphan functions but the WARN_ON I have
added pretty much mitigates this concern and as I wrote above I actually
think using i_mutex not only works but also makes logical sense.
> From my experiment so
> far (I'm still verifying this), it may even help improving the
> performance by spreading out the contention on the s_orphan_mutex.
So orphan operations on a single inode are serialized by i_mutex both
with your and with my patch. You add additional serialization by hashed
mutexes so now orphan operations for independent inodes get serialized as
well. It may in theory improve the performance by effectively making the
access to global s_orphan_lock less fair but for now I believe that other
differences in our patches is what makes a difference.
Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR
On 05/06/2014 05:49 AM, Jan Kara wrote:
> On Fri 02-05-14 15:56:56, Thavatchai Makphaibulchoke wrote:
> So I believe it is reasonable to require i_mutex to be held when orphan
> operations are called. Adding / removing inode from orphan list is needed
> when extending or truncating inode and these operations generally require
> i_mutex in ext4. I've added the assertion so that we can catch a situation
> when we either by mistake or intentionally grow a call site which won't
> hold i_mutex. If such situation happens and we have good reasons why
> i_mutex shouldn't be used there, then we can think about some dedicated
> locks (either hashed or per inode ones).
>
In earlier version of the patch, it is suggested that per inode mutex is too much of an overhead. In addition, adding a mutex to the ext4_inode struct would increase its size above the cache size.
> We aren't adding *any* contention. I didn't have to add a single
> additional locking of i_mutex because in all the necessary cases we are
> holding i_mutex over the orphan operations anyway (or the inode isn't
> reachable by other processes). So regarding per-inode locking, we are doing
> strictly less work with my patch than with your patch. However you are
> correct in your comments to my patch 2/2 that in your patch you handle
> operations under the global s_orphan_lock better and that's probably what
> causes the difference. I need to improve that for sure.
>
Sorry for not being clear. Yes, I agree, with existing code, you are not adding any contention. What I'm trying to say is that with your patch making i_mutex an implicit requirement of an orphan operation, any future ext4 operation requiring an orphan add/delete will also have to acquire i_mutex regardless of whether it needs i_mutex. Though this may not likely to happen, it could potentially add unnecessary contention to i_mutex.
Again, this may not likely to happen, but any change in i_mutex usages in the caller could have direct impact on your change.
> Frankly I don't see why they would be better. They are marginally safer
> by keeping the locking inside the orphan functions but the WARN_ON I have
> added pretty much mitigates this concern and as I wrote above I actually
> think using i_mutex not only works but also makes logical sense.
>
Yes, the WARN_ON is good safe gaurd against any i_mutex violation. Personally I would prefer a BUG_ON. But as mentioned above, any change in the caller or some future usage of orphan operations may require a revisit of the WARN_ON.
Though my patch is doing more work and is only marginal safer with existing code, it would be safer with regards to any further change. In addition, since i_mutex is not required, there may be a chance (sorry I did not have chance to verify with the source yet)for the caller to drop the i_mutex earlier for possible further improvement.
Despite what I've said, I would agree it would make sense to take advantage of the fact either the caller already held i_mutex or it is safe due to no other accesses possible to the inode for the performance reason. Unfotunately (it may be coincidental) from the aim7 results from my patch with the hashed mutex acquire/release removed, not only no performance improvement is detected, there is significant regression in several workloads on two of my higher thread count platforms tested.
With all the performance numbers I'm getting, plus the fact that at least it is marginal safer, I'm still believe the hashed mutex is the way to go.
Thanks,
Mak.
> So orphan operations on a single inode are serialized by i_mutex both
> with your and with my patch. You add additional serialization by hashed
> mutexes so now orphan operations for independent inodes get serialized as
> well. It may in theory improve the performance by effectively making the
> access to global s_orphan_lock less fair but for now I believe that other
> differences in our patches is what makes a difference.
>
> Honza
>