2014-05-15 20:17:15

by Jan Kara

[permalink] [raw]
Subject: [PATCH 0/2 v2] Improve orphan list scaling


Hello,

this is my version of patches to improve orphan list scaling by
reducing amount of work done under global s_orphan_mutex. We are
in disagreement with Thavatchai whose patches are better (see thread
http://www.spinics.net/lists/linux-ext4/msg43220.html) so I guess it's
up to Ted or other people on this list to decide.

When running code stressing orphan list operations [1] with these
patches, I see s_orphan_lock to move from number 1 in lock_stat report
to unmeasurable. So with the patches there are other much more
problematic locks (superblock buffer lock and bh_state lock,
j_list_lock, buffer locks for inode buffers when several inodes share a
block...). The average times for 10 runs for the test program to run on my
48-way box with ext4 on ramdisk are:
Vanilla Patched
Procs Avg Stddev Avg Stddev
1 2.769200 0.056194 2.890700 0.061727
2 5.756500 0.313268 4.383500 0.161629
4 11.852500 0.130221 6.542900 0.160039
10 33.590900 0.394888 27.749500 0.615517
20 71.035400 0.320914 76.368700 3.734557
40 236.671100 2.856885 228.236800 2.361391

So we can see the biggest speedup was for 2, 4, and 10 threads. For
higher thread counts the contention and cache bouncing prevented any
significant speedup (we can even see a barely-out-of-noise performance
drop for 20 threads).

Changes since v2:
* Fixed up various bugs in error handling pointed out by Thavatchai and
some others as well
* Somewhat reduced critical sections under s_orphan_lock

[1] The test program runs given number of processes, each process is
truncating a 4k file by 1 byte until it reaches 1 byte size and then the
file is extended to 4k again.

Honza


2014-05-15 20:17:16

by Jan Kara

[permalink] [raw]
Subject: [PATCH 2/2] ext4: Reduce contention on s_orphan_lock

Shuffle code around in ext4_orphan_add() and ext4_orphan_del() so that
we avoid taking global s_orphan_lock in some cases and hold it for
shorter time in other cases.

Signed-off-by: Jan Kara <[email protected]>
---
fs/ext4/namei.c | 109 +++++++++++++++++++++++++++++++++-----------------------
1 file changed, 65 insertions(+), 44 deletions(-)

diff --git a/fs/ext4/namei.c b/fs/ext4/namei.c
index 5fcaa85b6dc5..0486fbafb808 100644
--- a/fs/ext4/namei.c
+++ b/fs/ext4/namei.c
@@ -2539,13 +2539,17 @@ static int empty_dir(struct inode *inode)
return 1;
}

-/* ext4_orphan_add() links an unlinked or truncated inode into a list of
+/*
+ * ext4_orphan_add() links an unlinked or truncated inode into a list of
* such inodes, starting at the superblock, in case we crash before the
* file is closed/deleted, or in case the inode truncate spans multiple
* transactions and the last transaction is not recovered after a crash.
*
* At filesystem recovery time, we walk this list deleting unlinked
* inodes and truncating linked inodes in ext4_orphan_cleanup().
+ *
+ * Orphan list manipulation functions must be called under i_mutex unless
+ * we are just creating the inode or deleting it.
*/
int ext4_orphan_add(handle_t *handle, struct inode *inode)
{
@@ -2553,13 +2557,19 @@ int ext4_orphan_add(handle_t *handle, struct inode *inode)
struct ext4_sb_info *sbi = EXT4_SB(sb);
struct ext4_iloc iloc;
int err = 0, rc;
+ bool dirty = false;

if (!sbi->s_journal)
return 0;

- mutex_lock(&sbi->s_orphan_lock);
+ WARN_ON_ONCE(!(inode->i_state & (I_NEW | I_FREEING)) &&
+ !mutex_is_locked(&inode->i_mutex));
+ /*
+ * Exit early if inode already is on orphan list. This is a big speedup
+ * since we don't have to contend on the global s_orphan_lock.
+ */
if (!list_empty(&EXT4_I(inode)->i_orphan))
- goto out_unlock;
+ return 0;

/*
* Orphan handling is only valid for files with data blocks
@@ -2573,44 +2583,47 @@ int ext4_orphan_add(handle_t *handle, struct inode *inode)
BUFFER_TRACE(sbi->s_sbh, "get_write_access");
err = ext4_journal_get_write_access(handle, sbi->s_sbh);
if (err)
- goto out_unlock;
+ goto out;

err = ext4_reserve_inode_write(handle, inode, &iloc);
if (err)
- goto out_unlock;
+ goto out;
+
+ mutex_lock(&sbi->s_orphan_lock);
/*
* Due to previous errors inode may be already a part of on-disk
* orphan list. If so skip on-disk list modification.
*/
- if (NEXT_ORPHAN(inode) && NEXT_ORPHAN(inode) <=
- (le32_to_cpu(sbi->s_es->s_inodes_count)))
- goto mem_insert;
-
- /* Insert this inode at the head of the on-disk orphan list... */
- NEXT_ORPHAN(inode) = le32_to_cpu(sbi->s_es->s_last_orphan);
- sbi->s_es->s_last_orphan = cpu_to_le32(inode->i_ino);
- err = ext4_handle_dirty_super(handle, sb);
- rc = ext4_mark_iloc_dirty(handle, inode, &iloc);
- if (!err)
- err = rc;
-
- /* Only add to the head of the in-memory list if all the
- * previous operations succeeded. If the orphan_add is going to
- * fail (possibly taking the journal offline), we can't risk
- * leaving the inode on the orphan list: stray orphan-list
- * entries can cause panics at unmount time.
- *
- * This is safe: on error we're going to ignore the orphan list
- * anyway on the next recovery. */
-mem_insert:
- if (!err)
- list_add(&EXT4_I(inode)->i_orphan, &sbi->s_orphan);
+ if (!NEXT_ORPHAN(inode) || NEXT_ORPHAN(inode) >
+ (le32_to_cpu(sbi->s_es->s_inodes_count))) {
+ /* Insert this inode at the head of the on-disk orphan list */
+ NEXT_ORPHAN(inode) = le32_to_cpu(sbi->s_es->s_last_orphan);
+ sbi->s_es->s_last_orphan = cpu_to_le32(inode->i_ino);
+ dirty = true;
+ }
+ list_add(&EXT4_I(inode)->i_orphan, &sbi->s_orphan);
+ mutex_unlock(&sbi->s_orphan_lock);

+ if (dirty) {
+ err = ext4_handle_dirty_super(handle, sb);
+ rc = ext4_mark_iloc_dirty(handle, inode, &iloc);
+ if (!err)
+ err = rc;
+ if (err) {
+ /*
+ * We have to remove inode from in-memory list if
+ * addition to on disk orphan list failed. Stray orphan
+ * list entries can cause panics at unmount time.
+ */
+ mutex_lock(&sbi->s_orphan_lock);
+ list_del(&EXT4_I(inode)->i_orphan);
+ mutex_unlock(&sbi->s_orphan_lock);
+ }
+ }
jbd_debug(4, "superblock will point to %lu\n", inode->i_ino);
jbd_debug(4, "orphan inode %lu will point to %d\n",
inode->i_ino, NEXT_ORPHAN(inode));
-out_unlock:
- mutex_unlock(&sbi->s_orphan_lock);
+out:
ext4_std_error(sb, err);
return err;
}
@@ -2631,13 +2644,18 @@ int ext4_orphan_del(handle_t *handle, struct inode *inode)
if (!sbi->s_journal && !(sbi->s_mount_state & EXT4_ORPHAN_FS))
return 0;

- mutex_lock(&sbi->s_orphan_lock);
+ WARN_ON_ONCE(!(inode->i_state & (I_NEW | I_FREEING)) &&
+ !mutex_is_locked(&inode->i_mutex));
+ /* Do this quick check before taking global s_orphan_lock. */
if (list_empty(&ei->i_orphan))
- goto out;
+ return 0;

- ino_next = NEXT_ORPHAN(inode);
- prev = ei->i_orphan.prev;
+ if (handle) {
+ /* Grab inode buffer early before taking global s_orphan_lock */
+ err = ext4_reserve_inode_write(handle, inode, &iloc);
+ }

+ mutex_lock(&sbi->s_orphan_lock);
jbd_debug(4, "remove inode %lu from orphan list\n", inode->i_ino);

list_del_init(&ei->i_orphan);
@@ -2646,20 +2664,23 @@ int ext4_orphan_del(handle_t *handle, struct inode *inode)
* transaction handle with which to update the orphan list on
* disk, but we still need to remove the inode from the linked
* list in memory. */
- if (!handle)
- goto out;
-
- err = ext4_reserve_inode_write(handle, inode, &iloc);
- if (err)
+ if (!handle || err) {
+ mutex_unlock(&sbi->s_orphan_lock);
goto out_err;
+ }

+ ino_next = NEXT_ORPHAN(inode);
+ prev = ei->i_orphan.prev;
if (prev == &sbi->s_orphan) {
jbd_debug(4, "superblock will point to %u\n", ino_next);
BUFFER_TRACE(sbi->s_sbh, "get_write_access");
err = ext4_journal_get_write_access(handle, sbi->s_sbh);
- if (err)
+ if (err) {
+ mutex_unlock(&sbi->s_orphan_lock);
goto out_brelse;
+ }
sbi->s_es->s_last_orphan = cpu_to_le32(ino_next);
+ mutex_unlock(&sbi->s_orphan_lock);
err = ext4_handle_dirty_super(handle, inode->i_sb);
} else {
struct ext4_iloc iloc2;
@@ -2669,20 +2690,20 @@ int ext4_orphan_del(handle_t *handle, struct inode *inode)
jbd_debug(4, "orphan inode %lu will point to %u\n",
i_prev->i_ino, ino_next);
err = ext4_reserve_inode_write(handle, i_prev, &iloc2);
- if (err)
+ if (err) {
+ mutex_unlock(&sbi->s_orphan_lock);
goto out_brelse;
+ }
NEXT_ORPHAN(i_prev) = ino_next;
+ mutex_unlock(&sbi->s_orphan_lock);
err = ext4_mark_iloc_dirty(handle, i_prev, &iloc2);
}
if (err)
goto out_brelse;
NEXT_ORPHAN(inode) = 0;
err = ext4_mark_iloc_dirty(handle, inode, &iloc);

2014-05-15 20:17:16

by Jan Kara

[permalink] [raw]
Subject: [PATCH 1/2] ext4: Use sbi in ext4_orphan_{add|del}()

Use sbi pointer consistently in ext4_orphan_del() instead of opencoding
it sometimes. Also ext4_orphan_add() uses EXT4_SB(sb) often so create
sbi variable for it as well and use it.

Signed-off-by: Jan Kara <[email protected]>
---
fs/ext4/namei.c | 31 +++++++++++++++----------------
1 file changed, 15 insertions(+), 16 deletions(-)

diff --git a/fs/ext4/namei.c b/fs/ext4/namei.c
index 1cb84f78909e..5fcaa85b6dc5 100644
--- a/fs/ext4/namei.c
+++ b/fs/ext4/namei.c
@@ -2550,13 +2550,14 @@ static int empty_dir(struct inode *inode)
int ext4_orphan_add(handle_t *handle, struct inode *inode)
{
struct super_block *sb = inode->i_sb;
+ struct ext4_sb_info *sbi = EXT4_SB(sb);
struct ext4_iloc iloc;
int err = 0, rc;

- if (!EXT4_SB(sb)->s_journal)
+ if (!sbi->s_journal)
return 0;

- mutex_lock(&EXT4_SB(sb)->s_orphan_lock);
+ mutex_lock(&sbi->s_orphan_lock);
if (!list_empty(&EXT4_I(inode)->i_orphan))
goto out_unlock;

@@ -2569,8 +2570,8 @@ int ext4_orphan_add(handle_t *handle, struct inode *inode)
J_ASSERT((S_ISREG(inode->i_mode) || S_ISDIR(inode->i_mode) ||
S_ISLNK(inode->i_mode)) || inode->i_nlink == 0);

- BUFFER_TRACE(EXT4_SB(sb)->s_sbh, "get_write_access");
- err = ext4_journal_get_write_access(handle, EXT4_SB(sb)->s_sbh);
+ BUFFER_TRACE(sbi->s_sbh, "get_write_access");
+ err = ext4_journal_get_write_access(handle, sbi->s_sbh);
if (err)
goto out_unlock;

@@ -2582,12 +2583,12 @@ int ext4_orphan_add(handle_t *handle, struct inode *inode)
* orphan list. If so skip on-disk list modification.
*/
if (NEXT_ORPHAN(inode) && NEXT_ORPHAN(inode) <=
- (le32_to_cpu(EXT4_SB(sb)->s_es->s_inodes_count)))
+ (le32_to_cpu(sbi->s_es->s_inodes_count)))
goto mem_insert;

/* Insert this inode at the head of the on-disk orphan list... */
- NEXT_ORPHAN(inode) = le32_to_cpu(EXT4_SB(sb)->s_es->s_last_orphan);
- EXT4_SB(sb)->s_es->s_last_orphan = cpu_to_le32(inode->i_ino);
+ NEXT_ORPHAN(inode) = le32_to_cpu(sbi->s_es->s_last_orphan);
+ sbi->s_es->s_last_orphan = cpu_to_le32(inode->i_ino);
err = ext4_handle_dirty_super(handle, sb);
rc = ext4_mark_iloc_dirty(handle, inode, &iloc);
if (!err)
@@ -2603,14 +2604,14 @@ int ext4_orphan_add(handle_t *handle, struct inode *inode)
* anyway on the next recovery. */
mem_insert:
if (!err)
- list_add(&EXT4_I(inode)->i_orphan, &EXT4_SB(sb)->s_orphan);
+ list_add(&EXT4_I(inode)->i_orphan, &sbi->s_orphan);

jbd_debug(4, "superblock will point to %lu\n", inode->i_ino);
jbd_debug(4, "orphan inode %lu will point to %d\n",
inode->i_ino, NEXT_ORPHAN(inode));
out_unlock:
- mutex_unlock(&EXT4_SB(sb)->s_orphan_lock);
- ext4_std_error(inode->i_sb, err);
+ mutex_unlock(&sbi->s_orphan_lock);
+ ext4_std_error(sb, err);
return err;
}

@@ -2622,22 +2623,20 @@ int ext4_orphan_del(handle_t *handle, struct inode *inode)
{
struct list_head *prev;
struct ext4_inode_info *ei = EXT4_I(inode);
- struct ext4_sb_info *sbi;
+ struct ext4_sb_info *sbi = EXT4_SB(inode->i_sb);
__u32 ino_next;
struct ext4_iloc iloc;
int err = 0;

- if ((!EXT4_SB(inode->i_sb)->s_journal) &&
- !(EXT4_SB(inode->i_sb)->s_mount_state & EXT4_ORPHAN_FS))
+ if (!sbi->s_journal && !(sbi->s_mount_state & EXT4_ORPHAN_FS))
return 0;

- mutex_lock(&EXT4_SB(inode->i_sb)->s_orphan_lock);
+ mutex_lock(&sbi->s_orphan_lock);
if (list_empty(&ei->i_orphan))
goto out;

ino_next = NEXT_ORPHAN(inode);
prev = ei->i_orphan.prev;
- sbi = EXT4_SB(inode->i_sb);

jbd_debug(4, "remove inode %lu from orphan list\n", inode->i_ino);

@@ -2683,7 +2682,7 @@ int ext4_orphan_del(handle_t *handle, struct inode *inode)
out_err:
ext4_std_error(inode->i_sb, err);
out:
- mutex_unlock(&EXT4_SB(inode->i_sb)->s_orphan_lock);
+ mutex_unlock(&sbi->s_orphan_lock);
return err;

out_brelse:
--
1.8.1.4


2014-05-19 14:50:37

by Theodore Ts'o

[permalink] [raw]
Subject: Re: [PATCH 0/2 v2] Improve orphan list scaling

On Thu, May 15, 2014 at 10:17:04PM +0200, Jan Kara wrote:
> this is my version of patches to improve orphan list scaling by
> reducing amount of work done under global s_orphan_mutex. We are
> in disagreement with Thavatchai whose patches are better (see thread
> http://www.spinics.net/lists/linux-ext4/msg43220.html) so I guess it's
> up to Ted or other people on this list to decide.

I've started testing Jan's patches, mainly because of the simplicity
issue, and becase I agree with him that it's not clear that the hashed
mutexes is worth the extra complexity, even though it's a vanishingly
small number of systems that would show any benefit for them, and even
then it's not clear the changes are that much greater.

For the purposes of doing further optimizations, I'd prefer to start
with Jan's changes as a starting point and as a performance baseline
for further improvements.

If adding more mutexes or spinlocks, even 128 more mutexes per
filesystem, can be justified by a really huge improvement, it's
something that I won't completely rule out at this stage. But I'd
rather avoid that unless the comparative benchmarks and lock_stat
reports make for an extremely convincing case.

Cheers,

- Ted

2014-05-20 03:23:56

by Theodore Ts'o

[permalink] [raw]
Subject: Re: [PATCH 2/2] ext4: Reduce contention on s_orphan_lock

On Thu, May 15, 2014 at 10:17:06PM +0200, Jan Kara wrote:
> Shuffle code around in ext4_orphan_add() and ext4_orphan_del() so that
> we avoid taking global s_orphan_lock in some cases and hold it for
> shorter time in other cases.
>
> Signed-off-by: Jan Kara <[email protected]>

Hi Jan,

With this patch applies, xfstests is hanging after ext4/001 runs:

BEGIN TEST: Ext4 4k block Tue May 20 03:18:49 UTC 2014
Device: /dev/vdb
mk2fs options: -q
mount options: -o block_validity
FSTYP -- ext4
PLATFORM -- Linux/i686 candygram 3.15.0-rc2-00052-g0f430fc
MKFS_OPTIONS -- -q /dev/vdc
MOUNT_OPTIONS -- -o acl,user_xattr -o block_validity /dev/vdc /vdc

ext4/001 4s ... [03:19:05] [03:19:09] 4s

It just stops here, and never runs the next test.

If I revert this commit, the test run continus with ext4/002, and then
onto the rest of the tests.

Could you take a look?

Thanks!!

- Ted

Subject: Re: [PATCH 2/2] ext4: Reduce contention on s_orphan_lock

Please see my one comment below.

BTW, I've run aim7 on your before I notice what I commented below. There are workloads that my patch outperform yours and vice versa. I will have to redo it over again.

On 05/15/2014 02:17 PM, Jan Kara wrote:
> Shuffle code around in ext4_orphan_add() and ext4_orphan_del() so that
> we avoid taking global s_orphan_lock in some cases and hold it for
> shorter time in other cases.
>
> Signed-off-by: Jan Kara <[email protected]>
> ---
> fs/ext4/namei.c | 109 +++++++++++++++++++++++++++++++++-----------------------
> 1 file changed, 65 insertions(+), 44 deletions(-)
>
> diff --git a/fs/ext4/namei.c b/fs/ext4/namei.c
> index 5fcaa85b6dc5..0486fbafb808 100644
> --- a/fs/ext4/namei.c
> +++ b/fs/ext4/namei.c
> @@ -2539,13 +2539,17 @@ static int empty_dir(struct inode *inode)
> return 1;
> }
>
> -/* ext4_orphan_add() links an unlinked or truncated inode into a list of
> +/*
> + * ext4_orphan_add() links an unlinked or truncated inode into a list of
> * such inodes, starting at the superblock, in case we crash before the
> * file is closed/deleted, or in case the inode truncate spans multiple
> * transactions and the last transaction is not recovered after a crash.
> *
> * At filesystem recovery time, we walk this list deleting unlinked
> * inodes and truncating linked inodes in ext4_orphan_cleanup().
> + *
> + * Orphan list manipulation functions must be called under i_mutex unless
> + * we are just creating the inode or deleting it.
> */
> int ext4_orphan_add(handle_t *handle, struct inode *inode)
> {
> @@ -2553,13 +2557,19 @@ int ext4_orphan_add(handle_t *handle, struct inode *inode)
> struct ext4_sb_info *sbi = EXT4_SB(sb);
> struct ext4_iloc iloc;
> int err = 0, rc;
> + bool dirty = false;
>
> if (!sbi->s_journal)
> return 0;
>
> - mutex_lock(&sbi->s_orphan_lock);
> + WARN_ON_ONCE(!(inode->i_state & (I_NEW | I_FREEING)) &&
> + !mutex_is_locked(&inode->i_mutex));
> + /*
> + * Exit early if inode already is on orphan list. This is a big speedup
> + * since we don't have to contend on the global s_orphan_lock.
> + */
> if (!list_empty(&EXT4_I(inode)->i_orphan))
> - goto out_unlock;
> + return 0;
>
> /*
> * Orphan handling is only valid for files with data blocks
> @@ -2573,44 +2583,47 @@ int ext4_orphan_add(handle_t *handle, struct inode *inode)
> BUFFER_TRACE(sbi->s_sbh, "get_write_access");
> err = ext4_journal_get_write_access(handle, sbi->s_sbh);
> if (err)
> - goto out_unlock;
> + goto out;
>
> err = ext4_reserve_inode_write(handle, inode, &iloc);
> if (err)
> - goto out_unlock;
> + goto out;
> +
> + mutex_lock(&sbi->s_orphan_lock);
> /*
> * Due to previous errors inode may be already a part of on-disk
> * orphan list. If so skip on-disk list modification.
> */
> - if (NEXT_ORPHAN(inode) && NEXT_ORPHAN(inode) <=
> - (le32_to_cpu(sbi->s_es->s_inodes_count)))
> - goto mem_insert;
> -
> - /* Insert this inode at the head of the on-disk orphan list... */
> - NEXT_ORPHAN(inode) = le32_to_cpu(sbi->s_es->s_last_orphan);
> - sbi->s_es->s_last_orphan = cpu_to_le32(inode->i_ino);
> - err = ext4_handle_dirty_super(handle, sb);
> - rc = ext4_mark_iloc_dirty(handle, inode, &iloc);
> - if (!err)
> - err = rc;
> -
> - /* Only add to the head of the in-memory list if all the
> - * previous operations succeeded. If the orphan_add is going to
> - * fail (possibly taking the journal offline), we can't risk
> - * leaving the inode on the orphan list: stray orphan-list
> - * entries can cause panics at unmount time.
> - *
> - * This is safe: on error we're going to ignore the orphan list
> - * anyway on the next recovery. */
> -mem_insert:
> - if (!err)
> - list_add(&EXT4_I(inode)->i_orphan, &sbi->s_orphan);
> + if (!NEXT_ORPHAN(inode) || NEXT_ORPHAN(inode) >
> + (le32_to_cpu(sbi->s_es->s_inodes_count))) {
> + /* Insert this inode at the head of the on-disk orphan list */
> + NEXT_ORPHAN(inode) = le32_to_cpu(sbi->s_es->s_last_orphan);
> + sbi->s_es->s_last_orphan = cpu_to_le32(inode->i_ino);
> + dirty = true;
> + }
> + list_add(&EXT4_I(inode)->i_orphan, &sbi->s_orphan);
> + mutex_unlock(&sbi->s_orphan_lock);
>
> + if (dirty) {
> + err = ext4_handle_dirty_super(handle, sb);
> + rc = ext4_mark_iloc_dirty(handle, inode, &iloc);
> + if (!err)
> + err = rc;
> + if (err) {
> + /*
> + * We have to remove inode from in-memory list if
> + * addition to on disk orphan list failed. Stray orphan
> + * list entries can cause panics at unmount time.
> + */
> + mutex_lock(&sbi->s_orphan_lock);
> + list_del(&EXT4_I(inode)->i_orphan);
> + mutex_unlock(&sbi->s_orphan_lock);
> + }
> + }
> jbd_debug(4, "superblock will point to %lu\n", inode->i_ino);
> jbd_debug(4, "orphan inode %lu will point to %d\n",
> inode->i_ino, NEXT_ORPHAN(inode));
> -out_unlock:
> - mutex_unlock(&sbi->s_orphan_lock);
> +out:
> ext4_std_error(sb, err);
> return err;
> }
> @@ -2631,13 +2644,18 @@ int ext4_orphan_del(handle_t *handle, struct inode *inode)
> if (!sbi->s_journal && !(sbi->s_mount_state & EXT4_ORPHAN_FS))
> return 0;
>
> - mutex_lock(&sbi->s_orphan_lock);
> + WARN_ON_ONCE(!(inode->i_state & (I_NEW | I_FREEING)) &&
> + !mutex_is_locked(&inode->i_mutex));
> + /* Do this quick check before taking global s_orphan_lock. */
> if (list_empty(&ei->i_orphan))
> - goto out;
> + return 0;
>
> - ino_next = NEXT_ORPHAN(inode);
> - prev = ei->i_orphan.prev;
> + if (handle) {
> + /* Grab inode buffer early before taking global s_orphan_lock */
> + err = ext4_reserve_inode_write(handle, inode, &iloc);
> + }
>
> + mutex_lock(&sbi->s_orphan_lock);
> jbd_debug(4, "remove inode %lu from orphan list\n", inode->i_ino);
>

Should set prev = ei->i_orphan.prev; here, instead of down below where it has already been removed from the list.

Thanks,
Mak.

> list_del_init(&ei->i_orphan);
> @@ -2646,20 +2664,23 @@ int ext4_orphan_del(handle_t *handle, struct inode *inode)
> * transaction handle with which to update the orphan list on
> * disk, but we still need to remove the inode from the linked
> * list in memory. */
> - if (!handle)
> - goto out;
> -
> - err = ext4_reserve_inode_write(handle, inode, &iloc);
> - if (err)
> + if (!handle || err) {
> + mutex_unlock(&sbi->s_orphan_lock);
> goto out_err;
> + }
>
> + ino_next = NEXT_ORPHAN(inode);
> + prev = ei->i_orphan.prev;


2014-05-20 09:18:28

by Jan Kara

[permalink] [raw]
Subject: Re: [PATCH 2/2] ext4: Reduce contention on s_orphan_lock

On Tue 20-05-14 02:33:23, Thavatchai Makphaibulchoke wrote:
> Please see my one comment below.
>
> BTW, I've run aim7 on your before I notice what I commented below. There
> are workloads that my patch outperform yours and vice versa. I will have
> to redo it over again.
>
> On 05/15/2014 02:17 PM, Jan Kara wrote:
> > @@ -2631,13 +2644,18 @@ int ext4_orphan_del(handle_t *handle, struct inode *inode)
> > if (!sbi->s_journal && !(sbi->s_mount_state & EXT4_ORPHAN_FS))
> > return 0;
> >
> > - mutex_lock(&sbi->s_orphan_lock);
> > + WARN_ON_ONCE(!(inode->i_state & (I_NEW | I_FREEING)) &&
> > + !mutex_is_locked(&inode->i_mutex));
> > + /* Do this quick check before taking global s_orphan_lock. */
> > if (list_empty(&ei->i_orphan))
> > - goto out;
> > + return 0;
> >
> > - ino_next = NEXT_ORPHAN(inode);
> > - prev = ei->i_orphan.prev;
> > + if (handle) {
> > + /* Grab inode buffer early before taking global s_orphan_lock */
> > + err = ext4_reserve_inode_write(handle, inode, &iloc);
> > + }
> >
> > + mutex_lock(&sbi->s_orphan_lock);
> > jbd_debug(4, "remove inode %lu from orphan list\n", inode->i_ino);
> >
>
> Should set prev = ei->i_orphan.prev; here, instead of down below where it
> has already been removed from the list.
Bah, I'm sorry for causing extra work to you. That's a really stupid
mistake. Thanks for finding that! Also I've realized we cannot really
reliably do
ext4_mark_iloc_dirty(handle, i_prev, &iloc2);
outside of s_orphan_lock because that inode may be reclaimed the moment
after we drop s_orphan_lock. So I had to remove that optimization as well
and move the call back under s_orphan_lock. I'm running xfstests now
(updated so they include also the test Ted found failing) to verify
correctness again and then I'll get my performance numbers with fixed
patches.

Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR

2014-05-20 13:57:33

by Theodore Ts'o

[permalink] [raw]
Subject: Re: [PATCH 2/2] ext4: Reduce contention on s_orphan_lock

On Tue, May 20, 2014 at 02:33:23AM -0600, Thavatchai Makphaibulchoke wrote:
> Please see my one comment below.
>
> BTW, I've run aim7 on your before I notice what I commented below. There are workloads that my patch outperform yours and vice versa. I will have to redo it over again.

Thavatchai, it would be really great if you could do lock_stat runs
with both Jan's latest patches as well as yours. We need to
understand where the differences are coming from.

As I understand things, there are two differences between Jan and your
approaches. The first is that Jan is using the implicit locking of
i_mutex to avoid needing to keep a hashed array of mutexes to
synchronize an individual inode's being added or removed to the orphan
list.

The second is that you've split the orphan mutex into an on-disk mutex
and a in-memory spinlock.

Is it possible to split up your patch so we can measure the benefits
of each of these two changes? More interestingly, is there a way we
can use the your second change in concert with Jan's changes?

Regards,

- Ted

Subject: Re: [PATCH 2/2] ext4: Reduce contention on s_orphan_lock

On 05/20/2014 07:57 AM, Theodore Ts'o wrote:
> Thavatchai, it would be really great if you could do lock_stat runs
> with both Jan's latest patches as well as yours. We need to
> understand where the differences are coming from.
>
> As I understand things, there are two differences between Jan and your
> approaches. The first is that Jan is using the implicit locking of
> i_mutex to avoid needing to keep a hashed array of mutexes to
> synchronize an individual inode's being added or removed to the orphan
> list.
>
> The second is that you've split the orphan mutex into an on-disk mutex
> and a in-memory spinlock.
>
> Is it possible to split up your patch so we can measure the benefits
> of each of these two changes? More interestingly, is there a way we
> can use the your second change in concert with Jan's changes?
>
> Regards,
>
> - Ted
>

Ted, depending on what Jan does with my latest comment regarding the dirty optimization, her patch and my patch are functionally identical, with the exception that mine uses an explicit hashed lock to serialize add/delete within a single node. Hers uses the fact that either i_mutex is already held or the node is not accessible to other thread.


My assumption is that with the hashed mutex, the contentions for the s_orphan_mutex will spread out more, increasing its chance to hit the mutex_lock fast path. Anyway, I will do lock_stat to get more info.

Thanks,
Mak.


Subject: Re: [PATCH 2/2] ext4: Reduce contention on s_orphan_lock

On 05/20/2014 07:57 AM, Theodore Ts'o wrote:
> On Tue, May 20, 2014 at 02:33:23AM -0600, Thavatchai Makphaibulchoke wrote:
>
> Thavatchai, it would be really great if you could do lock_stat runs
> with both Jan's latest patches as well as yours. We need to
> understand where the differences are coming from.
>
> As I understand things, there are two differences between Jan and your
> approaches. The first is that Jan is using the implicit locking of
> i_mutex to avoid needing to keep a hashed array of mutexes to
> synchronize an individual inode's being added or removed to the orphan
> list.
>
> The second is that you've split the orphan mutex into an on-disk mutex
> and a in-memory spinlock.
>
> Is it possible to split up your patch so we can measure the benefits
> of each of these two changes? More interestingly, is there a way we
> can use the your second change in concert with Jan's changes?
>
> Regards,
>
> - Ted
>

Thanks to Jan, as she pointed out one optimization in orphan_addr() that I've missed.

After integrated that into my patch, I've rerun the following aim7 workloads; alltests, custom, dbase, disk, fserver, new_fserver, shared and short. Here are the results.

On an 8 core (16 thread) machine, both my revised patch (with additional optimization from Jan's oprhan_add()) and version 3 of Jan's patch give about the same results, for most of the workloads, except fserver and new_fserver, which Jan's outperforms about 9% and 16%, respectively.

Here are the lock_stat output for disk,
Jan's patch,
lock_stat version 0.4
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
class name con-bounces contentions waittime-min waittime-max waittime-total waittime-avg acq-bounces acquisitions holdtime-min holdtime-max holdtime-total holdtime-avg
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
&sbi->s_orphan_lock: 80189 80246 3.94 464489.22 77219615.47 962.29 503289 809004 0.10 476537.44 3424587.77 4.23
Mine,
&sbi->s_orphan_lock: 82215 82259 3.61 640876.19 15098561.09 183.55 541422 794254 0.10 640368.86 4425140.61 5.57
&sbi->s_orphan_op_mutex[n]: 102507 104880 4.21 1335087.21 1392773487.19 13279.69 398328 840120 0.11 1334711.17 397596137.90 473.26

For new_fserver,
Jan's patch,
&sbi->s_orphan_lock: 1063059 1063369 5.57 1073325.95 59535205188.94 55987.34 4525570 8446052 0.10 75625.72 10700844.58 1.27
Mine,
&sbi->s_orphan_lock: 1171433 1172220 3.02 349678.21 553168029.92 471.90 5517262 8446052 0.09 254108.75 16504015.29 1.95
&sbi->s_orphan_op_mutex[n]: 2176760 2202674 3.44 633129.10 55206091750.06 25063.21 3259467 8452918 0.10 349687.82 605683982.34 71.65


On an 80 core (160 thread) machine, mine outpeforms Jan's in alltests, custom, fserver, new_fserver and shared about the same margin it did over the baseline, around 20% For all these workloads, Jan's patch does not seem to show any noticeable improvement over baseline kernel. I'm getting about the same performance with the rest of the workloads.

Here are the lock_stat output for alltests,
Jan;'s,
&sbi->s_orphan_lock: 2762871 2763355 4.46 49043.39 1763499587.40 638.17 5878253 6475844 0.15 20508.98 70827300.79 10.94
Mine,
&sbi->s_orphan_lock: 1171433 1172220 3.02 349678.21 553168029.92 471.90 5517262 8446052 0.09 254108.75 16504015.29 1.95
&sbi->s_orphan_op_mutex[n]: 783176 785840 4.95 30358.58 432279688.66 550.09 2899889 6505883 0.16 30254.12 1668330140.08 256.43

For custom,
Jan's,
&sbi->s_orphan_lock: 5706466 5707069 4.54 44063.38 3312864313.18 580.48 11942088 13175060 0.15 15944.34 142660367.51 10.83
Mine,
&sbi->s_orphan_lock: 5518186 5518558 4.84 32040.05 2436898419.22 441.58 12290996 13175234 0.17 23160.65 141234888.88 10.72
&sbi->s_orphan_op_mutex[n]: 1565216 1569333 4.50 32527.02 788215876.94 502.26 5894074 13196979 0.16 71073.57 3128766227.92 237.08

For dbase,
Jan's,
&sbi->s_orphan_lock: 14453 14489 5.84 39442.57 8678179.21 598.95 119847 153686 0.17 4390.25 1406816.03 9.15
Mine,
&sbi->s_orphan_lock: 13847 13868 6.23 31314.03 7982386.22 575.60 120332 153542 0.17 9354.86 1458061.28 9.50
&sbi->s_orphan_op_mutex[n]: 1700 1717 22.00 50566.24 1225749.82 713.89 85062 189435 0.16 31374.44 14476217.56 76.42

In case the line-wrap making it hard to read, I've also attached the results as a text file.

The lock_stat seems to show that with my patch the s_orphan_lock performs better across the board. But on a smaller machine, the hashed mutex seems to offset out the performance gain in the s_oprhan_lock and increase the hashed mutex size likely to make it perform better.

Jan, if you could send me your orphan stress test, I could run lock_stat for more performance comparison.

Ted, please let me know if there is anything else you like me to experiment with. If you'd like I could also resubmit my revised patch for you to take a look.



Thanks,
Mak.





Attachments:
lock_stat.txt (4.80 kB)

2014-06-03 08:52:10

by Jan Kara

[permalink] [raw]
Subject: Re: [PATCH 2/2] ext4: Reduce contention on s_orphan_lock

On Mon 02-06-14 11:45:32, Thavatchai Makphaibulchoke wrote:
> On 05/20/2014 07:57 AM, Theodore Ts'o wrote:
> > On Tue, May 20, 2014 at 02:33:23AM -0600, Thavatchai Makphaibulchoke wrote:
> >
> > Thavatchai, it would be really great if you could do lock_stat runs
> > with both Jan's latest patches as well as yours. We need to
> > understand where the differences are coming from.
> >
> > As I understand things, there are two differences between Jan and your
> > approaches. The first is that Jan is using the implicit locking of
> > i_mutex to avoid needing to keep a hashed array of mutexes to
> > synchronize an individual inode's being added or removed to the orphan
> > list.
> >
> > The second is that you've split the orphan mutex into an on-disk mutex
> > and a in-memory spinlock.
> >
> > Is it possible to split up your patch so we can measure the benefits
> > of each of these two changes? More interestingly, is there a way we
> > can use the your second change in concert with Jan's changes?
> >
> > Regards,
> >
> > - Ted
> >
>
> Thanks to Jan, as she pointed out one optimization in orphan_addr() that
> I've missed.
>
> After integrated that into my patch, I've rerun the following aim7
> workloads; alltests, custom, dbase, disk, fserver, new_fserver, shared
> and short. Here are the results.
>
> On an 8 core (16 thread) machine, both my revised patch (with additional
> optimization from Jan's oprhan_add()) and version 3 of Jan's patch give
> about the same results, for most of the workloads, except fserver and
> new_fserver, which Jan's outperforms about 9% and 16%, respectively.
>
> Here are the lock_stat output for disk,
> Jan's patch,
> lock_stat version 0.4
> -----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
> class name con-bounces contentions waittime-min waittime-max waittime-total waittime-avg acq-bounces acquisitions holdtime-min holdtime-max holdtime-total holdtime-avg
> -----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
> &sbi->s_orphan_lock: 80189 80246 3.94 464489.22 77219615.47 962.29 503289 809004 0.10 476537.44 3424587.77 4.23
> Mine,
> &sbi->s_orphan_lock: 82215 82259 3.61 640876.19 15098561.09 183.55 541422 794254 0.10 640368.86 4425140.61 5.57
> &sbi->s_orphan_op_mutex[n]: 102507 104880 4.21 1335087.21 1392773487.19 13279.69 398328 840120 0.11 1334711.17 397596137.90 473.26
>
> For new_fserver,
> Jan's patch,
> &sbi->s_orphan_lock: 1063059 1063369 5.57 1073325.95 59535205188.94 55987.34 4525570 8446052 0.10 75625.72 10700844.58 1.27
> Mine,
> &sbi->s_orphan_lock: 1171433 1172220 3.02 349678.21 553168029.92 471.90 5517262 8446052 0.09 254108.75 16504015.29 1.95
> &sbi->s_orphan_op_mutex[n]: 2176760 2202674 3.44 633129.10 55206091750.06 25063.21 3259467 8452918 0.10 349687.82 605683982.34 71.65
>
>
> On an 80 core (160 thread) machine, mine outpeforms Jan's in alltests,
> custom, fserver, new_fserver and shared about the same margin it did over
> the baseline, around 20% For all these workloads, Jan's patch does not
> seem to show any noticeable improvement over baseline kernel. I'm
> getting about the same performance with the rest of the workloads.
>
> Here are the lock_stat output for alltests,
> Jan;'s,
> &sbi->s_orphan_lock: 2762871 2763355 4.46 49043.39 1763499587.40 638.17 5878253 6475844 0.15 20508.98 70827300.79 10.94
> Mine,
> &sbi->s_orphan_lock: 1171433 1172220 3.02 349678.21 553168029.92 471.90 5517262 8446052 0.09 254108.75 16504015.29 1.95
> &sbi->s_orphan_op_mutex[n]: 783176 785840 4.95 30358.58 432279688.66 550.09 2899889 6505883 0.16 30254.12 1668330140.08 256.43
>
> For custom,
> Jan's,
> &sbi->s_orphan_lock: 5706466 5707069 4.54 44063.38 3312864313.18 580.48 11942088 13175060 0.15 15944.34 142660367.51 10.83
> Mine,
> &sbi->s_orphan_lock: 5518186 5518558 4.84 32040.05 2436898419.22 441.58 12290996 13175234 0.17 23160.65 141234888.88 10.72
> &sbi->s_orphan_op_mutex[n]: 1565216 1569333 4.50 32527.02 788215876.94 502.26 5894074 13196979 0.16 71073.57 3128766227.92 237.08
>
> For dbase,
> Jan's,
> &sbi->s_orphan_lock: 14453 14489 5.84 39442.57 8678179.21 598.95 119847 153686 0.17 4390.25 1406816.03 9.15
> Mine,
> &sbi->s_orphan_lock: 13847 13868 6.23 31314.03 7982386.22 575.60 120332 153542 0.17 9354.86 1458061.28 9.50
> &sbi->s_orphan_op_mutex[n]: 1700 1717 22.00 50566.24 1225749.82 713.89 85062 189435 0.16 31374.44 14476217.56 76.42
>
> In case the line-wrap making it hard to read, I've also attached the
> results as a text file.
>
> The lock_stat seems to show that with my patch the s_orphan_lock performs
> better across the board. But on a smaller machine, the hashed mutex
> seems to offset out the performance gain in the s_oprhan_lock and
> increase the hashed mutex size likely to make it perform better.
I'd interpret the data a bit differently :) With your patch the
contention for resource - access to orphan list - is split between
s_orphan_lock and s_orphan_op_mutex. For the smaller machine contending
directly on s_orphan_lock is a win and we spend less time waiting in total.
For the large machine it seems beneficial to contend on the hashed mutex
first and only after that on global lock. Likely that reduces amount of
cacheline bouncing, or maybe the mutex is more often acquired during the
spinning phase which reduces the acquisition latency.

> Jan, if you could send me your orphan stress test, I could run lock_stat
> for more performance comparison.
Sure, it is attached.

Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR


Attachments:
(No filename) (7.24 kB)
stress-orphan.c (1.27 kB)
Download all attachments
Subject: Re: [PATCH 2/2] ext4: Reduce contention on s_orphan_lock

On 06/03/2014 02:52 AM, Jan Kara wrote:
> I'd interpret the data a bit differently :) With your patch the
> contention for resource - access to orphan list - is split between
> s_orphan_lock and s_orphan_op_mutex. For the smaller machine contending
> directly on s_orphan_lock is a win and we spend less time waiting in total.
> For the large machine it seems beneficial to contend on the hashed mutex
> first and only after that on global lock. Likely that reduces amount of
> cacheline bouncing, or maybe the mutex is more often acquired during the
> spinning phase which reduces the acquisition latency.
>
> Sure, it is attached.
>
> Honza
>

Thanks Jan for the test program.

Anyway I did modify the test a little so that we could also actually run multiple incarnations of the test simultaneously, that is to generate the orphan stress operations on multiple files. I have also attached the modified test, just in case.

These are the results that I got.

All values are real time in seconds, computed over ten runs with journaling disabled. "w/o" stand for without hashed mutexes and "with" for with mutexes, "Proc" number of processes, and "Files" number of files.


With only 1 file,

On an 8 core (16 thread) platform,


Proc | 1 | 20 | 40 | 80 | 160 | 400 | 800
-----------------------------------------------------------------------------------------------------
| Avg | SD | Avg | SD | Avg | SD | Avg | SD | Avg | SD | Avg | SD | Avg | SD
-----------------------------------------------------------------------------------------------------
w/o |.7921|.0467|7.1342|.0316|12.4026|.3552|19.3930|.6917|22.7519|.7017|35.9374|1.658|66.7374|.4716
-----------------------------------------------------------------------------------------------------
with |.7819|.0362|6.3302|.2119|12.0933|.6589|18.7514|.9698|24.1351|1.659|38.6480|.6809|67.4810|.2749

On a 80 core (160 thread) platform,

Proc | 40 | 80 | 100 | 400 | 800 | 1600
-----------------------------------------------------------------------------------------------------
| Avg | SD | Avg | SD | Avg | SD | Avg | SD | Avg | SD | Avg | SD
-----------------------------------------------------------------------------------------------------
w/o |44.8532|3.4991|67.8351|1.7534| 73.0616|2.4733|252.5798|1.1949|485.3289|5.7320|952.8874|2.0911
-----------------------------------------------------------------------------------------------------
with |46.1134|3.3228|99.1550|1.4894|109.0272|1.3617|259.6128|2.5247|284.4386|4.6767|266.8664|7.7726

With one file, we would expect without hashed mutexes would perform better than with. The results do show so on 80 core machine with 80 up to 400 processes. Surprisingly there is no differences across all process ranges tested on 8 core. Also on 80 core, with hashed mutexes the time seems to be steadying out at around high two hundred something with 400 or more processes and outperform without significantly with 800 or more processes.


With multiple files and only 1 process per file,

On an 8 core (16 thread) platform,


Proc | 40 | 80 | 150 | 400 | 800
-------------------------------------------------------------------------
| Avg | SD | Avg | SD | Avg | SD | Avg | SD | Avg | SD
--------------------------------------------------------------------------
w/o |3.3578|.0363|6.4533|.1204|12.1925|.2528|31.5862|.6016|63.9913|.3528
-------------------------------------------------------------------------
with |3.2583|.0384|6.3228|.1075|11.8328|.2290|30.5394|.3220|62.7672|.3802

On a 80 core (160 thread) platform,

Proc| 40 | 80 | 100 | 200 | 400 | 800 | 1200 | 1600
------------------------------------------------------------------------------------------------------------------
| Avg | SD | Avg | SD | Avg | SD | Avg | SD | Avg | SD | Avg | SD | Avg | SD |
-------------------------------------------------------------------------------------------------------------------
w/o_|43.6507|2.9979|57.0404|1.8684|068.5557|1.2902|144.745|1.7939|52.7491|1.3585|487.8996|1.3997|715.1978|1.1224|942.5605|2.9629
-------------------------------------------------------------------------------------------------------------------
with|52.8003|2.1949|69.2455|1.2902|106.5026|1.8813|130.2995|7.8020150.3648|3.4153|184.7233|6.0525|270.1533|3.2261|298.5705|3.1318

Again, there is not much difference on 8 core. On 80 core, without hashed mutexes performs better than with hashed mutexes with the number of files between 40 to less than 200. With hashed mutexes outperorm without significantly with 400 or more files.

Overall there seems to be no performance difference on 8 core. On 80 core with hashed mutexes, while performs worse than with in the lower ranges of both processes and files, seems to be scaling better with both higher processes and files.

Again, the change with hashed mutexes does include the additional optimization in orphan_add() introduced by your patch. Please let me know if you need a copy of the modified patch with hashed mutexes for verification.

Thanks,
Mak.




Attachments:
stress_orphan.c (3.09 kB)

2014-06-17 09:29:37

by Jan Kara

[permalink] [raw]
Subject: Re: [PATCH 2/2] ext4: Reduce contention on s_orphan_lock

On Mon 16-06-14 13:20:32, Thavatchai Makphaibulchoke wrote:
> On 06/03/2014 02:52 AM, Jan Kara wrote:
> > I'd interpret the data a bit differently :) With your patch the
> > contention for resource - access to orphan list - is split between
> > s_orphan_lock and s_orphan_op_mutex. For the smaller machine contending
> > directly on s_orphan_lock is a win and we spend less time waiting in total.
> > For the large machine it seems beneficial to contend on the hashed mutex
> > first and only after that on global lock. Likely that reduces amount of
> > cacheline bouncing, or maybe the mutex is more often acquired during the
> > spinning phase which reduces the acquisition latency.
> >
> > Sure, it is attached.
> >
> > Honza
> >
>
> Thanks Jan for the test program.
>
> Anyway I did modify the test a little so that we could also actually run
> multiple incarnations of the test simultaneously, that is to generate the
> orphan stress operations on multiple files. I have also attached the
> modified test, just in case.
Hum, looking at your test program source I'm not sure what do you mean.
Your program first forks 'niteration' times and each process starts using a
different directory. Then each of these processes forks 'procs' times and
each of these processes will use a different file in the directory
belonging to the parent. So what's the difference to just running
'niterations' * 'procs' processes? After some thought I guess the
difference is in how time to run on each individual file contributes to the
total average -
(\sum_{i=1}^{procs} t_i)/procs
in the first case you ran, where t_i is the time to run test for file i, and
(max^{i=1}^{procs} t_i)
in the second case. But what's the point?

> These are the results that I got.
>
> All values are real time in seconds, computed over ten runs with
> journaling disabled. "w/o" stand for without hashed mutexes and "with"
> for with mutexes, "Proc" number of processes, and "Files" number of
> files.
What do you exactly mean by 'journaling disabled'? Did you run ext4 in
nojournal mode? That wouldn't really make sense because in nojournal mode
all orphan list operations are skipped... So what did you really test?

> With only 1 file,
>
> On an 8 core (16 thread) platform,
>
>
> Proc | 1 | 20 | 40 | 80 | 160 | 400 | 800
> -----------------------------------------------------------------------------------------------------
> | Avg | SD | Avg | SD | Avg | SD | Avg | SD | Avg | SD | Avg | SD | Avg | SD
> -----------------------------------------------------------------------------------------------------
> w/o |.7921|.0467|7.1342|.0316|12.4026|.3552|19.3930|.6917|22.7519|.7017|35.9374|1.658|66.7374|.4716
> -----------------------------------------------------------------------------------------------------
> with |.7819|.0362|6.3302|.2119|12.0933|.6589|18.7514|.9698|24.1351|1.659|38.6480|.6809|67.4810|.2749
>
> On a 80 core (160 thread) platform,
>
> Proc | 40 | 80 | 100 | 400 | 800 | 1600
> -----------------------------------------------------------------------------------------------------
> | Avg | SD | Avg | SD | Avg | SD | Avg | SD | Avg | SD | Avg | SD
> -----------------------------------------------------------------------------------------------------
> w/o |44.8532|3.4991|67.8351|1.7534| 73.0616|2.4733|252.5798|1.1949|485.3289|5.7320|952.8874|2.0911
> -----------------------------------------------------------------------------------------------------
> with |46.1134|3.3228|99.1550|1.4894|109.0272|1.3617|259.6128|2.5247|284.4386|4.6767|266.8664|7.7726
>
> With one file, we would expect without hashed mutexes would perform
> better than with. The results do show so on 80 core machine with 80 up
> to 400 processes. Surprisingly there is no differences across all
> process ranges tested on 8 core. Also on 80 core, with hashed mutexes
> the time seems to be steadying out at around high two hundred something
> with 400 or more processes and outperform without significantly with 800
> or more processes.
>
> With multiple files and only 1 process per file,
>
> On an 8 core (16 thread) platform,
>
>
> Proc | 40 | 80 | 150 | 400 | 800
> -------------------------------------------------------------------------
> | Avg | SD | Avg | SD | Avg | SD | Avg | SD | Avg | SD
> --------------------------------------------------------------------------
> w/o |3.3578|.0363|6.4533|.1204|12.1925|.2528|31.5862|.6016|63.9913|.3528
> -------------------------------------------------------------------------
> with |3.2583|.0384|6.3228|.1075|11.8328|.2290|30.5394|.3220|62.7672|.3802
>
> On a 80 core (160 thread) platform,
>
> Proc| 40 | 80 | 100 | 200 | 400 | 800 | 1200 | 1600
> ------------------------------------------------------------------------------------------------------------------
> | Avg | SD | Avg | SD | Avg | SD | Avg | SD | Avg | SD | Avg | SD | Avg | SD |
> -------------------------------------------------------------------------------------------------------------------
> w/o_|43.6507|2.9979|57.0404|1.8684|068.5557|1.2902|144.745|1.7939|52.7491|1.3585|487.8996|1.3997|715.1978|1.1224|942.5605|2.9629
^^^^
this number is strange
> -------------------------------------------------------------------------------------------------------------------
> with|52.8003|2.1949|69.2455|1.2902|106.5026|1.8813|130.2995|7.8020150.3648|3.4153|184.7233|6.0525|270.1533|3.2261|298.5705|3.1318
>
> Again, there is not much difference on 8 core. On 80 core, without
> hashed mutexes performs better than with hashed mutexes with the number
> of files between 40 to less than 200. With hashed mutexes outperorm
> without significantly with 400 or more files.
>
> Overall there seems to be no performance difference on 8 core. On 80
> core with hashed mutexes, while performs worse than with in the lower
> ranges of both processes and files, seems to be scaling better with both
> higher processes and files.
Your numbers are interesting and seem to confirm that with really high
contention it is advantageous to contend on smaller locks first (your
hashed mutexes) and only after that on the global lock. But I'd like to
hear answers to my previous questions before drawing any conclusions...

> Again, the change with hashed mutexes does include the additional
> optimization in orphan_add() introduced by your patch. Please let me
> know if you need a copy of the modified patch with hashed mutexes for
> verification.


Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR

Subject: Re: [PATCH 2/2] ext4: Reduce contention on s_orphan_lock

On 06/17/2014 03:29 AM, Jan Kara wrote:
> Hum, looking at your test program source I'm not sure what do you mean.
> Your program first forks 'niteration' times and each process starts using a
> different directory. Then each of these processes forks 'procs' times and
> each of these processes will use a different file in the directory
> belonging to the parent. So what's the difference to just running
> 'niterations' * 'procs' processes? After some thought I guess the
> difference is in how time to run on each individual file contributes to the
> total average -
> (\sum_{i=1}^{procs} t_i)/procs
> in the first case you ran, where t_i is the time to run test for file i, and
> (max^{i=1}^{procs} t_i)
> in the second case. But what's the point?
>

The original test program generates orphan traffic on only a single file, which not only does not seem to be a fair comparison for the WM (with hashed mutexes) algorithm, does not seem to represent realistic load. The modified test's intention is to compare performance between the WM and WO (without hashed mutexes) under heavy orphan traffic on more than a single file.

Instead of running multiple copies of the original test in the background (for a short job we may not actually get overlapping traffic on different inodes), the modified test runs and start the excution, as close as possible to each other, of multiple copies of the original test.

The results of my test is the average, over ten runs, of the maximum time the original test takes to complete under certain number of processes and files. Of course, with one copy (niteration of 1), we get the equivalence of the original test.

> What do you exactly mean by 'journaling disabled'? Did you run ext4 in
> nojournal mode? That wouldn't really make sense because in nojournal mode
> all orphan list operations are skipped... So what did you really test?
>

Yes, sorry my fault disabling journaling, should also inhibit the orphan activities. Therefore there should not be any difference in performance between the two.

I just discovered that the two different branches I used do not have the same baseline. Let me recreate the two branches and redo my test. I'll get back with you with the results.

Sorry for the confusion.

Thanks,
Mak.


> Your numbers are interesting and seem to confirm that with really high
> contention it is advantageous to contend on smaller locks first (your
> hashed mutexes) and only after that on the global lock. But I'd like to
> hear answers to my previous questions before drawing any conclusions...
>
>
>
> Honza
>



2014-06-18 10:37:44

by Jan Kara

[permalink] [raw]
Subject: Re: [PATCH 2/2] ext4: Reduce contention on s_orphan_lock

On Tue 17-06-14 22:38:32, Thavatchai Makphaibulchoke wrote:
> On 06/17/2014 03:29 AM, Jan Kara wrote:
> > Hum, looking at your test program source I'm not sure what do you mean.
> > Your program first forks 'niteration' times and each process starts using a
> > different directory. Then each of these processes forks 'procs' times and
> > each of these processes will use a different file in the directory
> > belonging to the parent. So what's the difference to just running
> > 'niterations' * 'procs' processes? After some thought I guess the
> > difference is in how time to run on each individual file contributes to the
> > total average -
> > (\sum_{i=1}^{procs} t_i)/procs
> > in the first case you ran, where t_i is the time to run test for file i, and
> > (max^{i=1}^{procs} t_i)
> > in the second case. But what's the point?
> >
>
> The original test program generates orphan traffic on only a single file,
> which not only does not seem to be a fair comparison for the WM (with
> hashed mutexes) algorithm, does not seem to represent realistic load.
> The modified test's intention is to compare performance between the WM
> and WO (without hashed mutexes) under heavy orphan traffic on more than a
> single file.
That's not true. The original test program creates one file per process:
void run_test(char *base, int count)
{
...
sprintf(pbuf, "%s/file-%d", base, count);
fd = open(pbuf, O_CREAT | O_TRUNC | O_WRONLY, 0644);
...
and forks given number of processes. Ah, maybe I see what you are getting
at. My original test program doesn't bother with synchronizing start of the
processes - I have manually set number of repetitions (COUNT) so that each
process runs long enough on my machine that the differences in start time
are negligible. Maybe you thought these processes run sequentially (and
maybe they really did if you had fast enough setup). So I agree
synchronization of start using shared memory as you did it is a good thing
and gets more reliable numbers especially for high numbers of processes
(which I didn't really test because of my limited HW) just there's no need
to add another level of iteration...

> Instead of running multiple copies of the original test in the background
> (for a short job we may not actually get overlapping traffic on different
> inodes), the modified test runs and start the excution, as close as
> possible to each other, of multiple copies of the original test.
>
> The results of my test is the average, over ten runs, of the maximum time
> the original test takes to complete under certain number of processes and
> files. Of course, with one copy (niteration of 1), we get the
> equivalence of the original test.
>
> > What do you exactly mean by 'journaling disabled'? Did you run ext4 in
> > nojournal mode? That wouldn't really make sense because in nojournal mode
> > all orphan list operations are skipped... So what did you really test?
> >
>
> Yes, sorry my fault disabling journaling, should also inhibit the orphan
> activities. Therefore there should not be any difference in performance
> between the two.
Exactly, it's strange you saw a difference.

> I just discovered that the two different branches I used do not have the
> same baseline. Let me recreate the two branches and redo my test. I'll
> get back with you with the results.
OK, looking forward to the results.

Honza

--
Jan Kara <[email protected]>
SUSE Labs, CR

Subject: Re: [PATCH 2/2] ext4: Reduce contention on s_orphan_lock

On 06/18/2014 04:37 AM, Jan Kara wrote:
> That's not true. The original test program creates one file per process:
> void run_test(char *base, int count)
> {
> ...
> sprintf(pbuf, "%s/file-%d", base, count);
> fd = open(pbuf, O_CREAT | O_TRUNC | O_WRONLY, 0644);
> ...
> and forks given number of processes. Ah, maybe I see what you are getting
> at. My original test program doesn't bother with synchronizing start of the
> processes - I have manually set number of repetitions (COUNT) so that each
> process runs long enough on my machine that the differences in start time
> are negligible. Maybe you thought these processes run sequentially (and
> maybe they really did if you had fast enough setup). So I agree
> synchronization of start using shared memory as you did it is a good thing
> and gets more reliable numbers especially for high numbers of processes
> (which I didn't really test because of my limited HW) just there's no need
> to add another level of iteration...
>
First sorry for taking a little long to get back.

Your original main is now my new run_main() function. The old main and new run_main() function, practically are doing the same thing, fork and immediately all the child processes as indicated by argv[1], running the original run_test() function. By converting it to a run_main function, I'm trying to start each incarnation of the old main as closed to each other as possible, making sure I have some overlapping orphan activities on different files. Actually I should have done the synchronization in the run_test function instead, this way both tests should be closer to each other.

> Exactly, it's strange you saw a difference.
>
>> I just discovered that the two different branches I used do not have the
>> same baseline. Let me recreate the two branches and redo my test. I'll
>> get back with you with the results.
> OK, looking forward to the results.
>
> Honza
>

Here are the results with 3.15-rc5 baseline kernel, using your original test. Each result is an average over ten runs, except those longer runs, marked with an '*', which is an average of only 5 runs.

On a 2 socket 16 thread machine,

----------------------------------
Num files | 40 | 100 | 400 | 1000 |
----------------------------------------------
|W/O Mutexes|118.593|178.118|595.017|1480.165|
-----------------------------------------------
|W Mutexes |129.212|194.728|620.412|1430.908|
----------------------------------------------

On an 8 socket 160 thread machine,

--------------------------------------------
Num files | 40 | 100 | 400 | 1000* | 1500* |
-------------------------------------------------------
|W/O Mutexes|387.257|644.270|1692.018|4254.766|7362.064|
-------------------------------------------------------
|W Mutexes |328.490|557.508|1967.676|4721.060|7013.618|
--------------------------------------------------------

>From the above data, looks like without mutexes (WOM) performs better across the board on a smaller machine, except at 1000 files. For a larger machine, I give the edge to with mutexes (WM) for smaller number of files, until around 400 to 100 something files. Looks like with WOM starts performing better again at 1500 files.

I've noticed that your test seems to generate high volume of both add and delete orphan operations for inodes that are either already on or off the orphan chain already, favoring the WM. I've also done some data collection to verify that. It turns out about half of both the delete and add orphan operations do indeed fell into that category.

I did some further experiment by changing the decrement in the ftruncate() loop in the run_test() function, from 1 to be a variable configurable from the command line, as shown here,

for (i = 0; i < COUNT; i++) {
if (pwrite(fd, wbuf, 4096, 0) != 4096) {
perror("pwrite");
exit(1);
}

for (j = 4096 - ndec ; j >= 1; j -= ndec) {
if (ftruncate(fd, j) < 0) {
perror("ftruncate");
exit(1);
}
}
}

Here are more results with different decrement values.

On a 2 socket 16 thread machine,
with decrement of 32,

--------------------------------------------------
Num files | 40 | 100 | 400 | 1000 | 1500 | 2000 | 2500 |
--------------------------------------------------------------
|W/O Mutexes|4.077|5.928|19.005|46.056|95.767|514.050|621.227|
--------------------------------------------------------------
|W Mutexes |4.441|6.354|19.837|44.771|63.068| 85.997|284.084|
--------------------------------------------------------------

with decrement of 256,

-----------------------------------------------------
Num files | 40 | 100 | 400 | 1000| 1500 | 2000 | 2500 | 3000 |
--------------------------------------------------------------
|W/O Mutexes|0.482|0.708|2.486|6.072|11.111|54.994|69.695|81.657|
-----------------------------------------------------------------
|W Mutexes |0.540|0.823|2.449|5.650| 8.146|11.179|26.453|34.705|
-----------------------------------------------------------------

with decrement of 4095,

-------------------------------------------------------------
Num files | 40 | 100 | 400 | 1000| 1500| 2000| 2500| 3000| 4000| 5000|
-------------------------------------------------------------------------
|W/O Mutexes|0.063|0.010|0.235|0.479|0.665|1.701|2.620|3.812|4.986|7.047|
-------------------------------------------------------------------------
|W Mutexes |0.064|0.011|0.217|0.462|0.682|0.886|1.374|2.091|3.602|4.650|
-------------------------------------------------------------------------

On an 8 socket 160 thread machine,
with decrement of 32,

--------------------------------------------------------
Num files | 40 | 100 | 400 | 1000 | 1500 | 2000 | 2500* |
--------------------------------------------------------------------
|W/O Mutexes|15.461|21.501|54.009|131.591|219.795|2375.105|6401.232|
--------------------------------------------------------------------
|W Mutexes |13.376|18.557|61.573|146.369|217.394| 375.293|2626.881|
--------------------------------------------------------------------

with decrement of 256,

-------------------------------------------------
Num files | 40 | 100 | 400 | 1000 | 1500 | 2000 | 2500 |
-------------------------------------------------------------
|W/O Mutexes|1.903|2.709|7.212|16.573|26.634|260.666|745.265|
-------------------------------------------------------------
|W Mutexes |1.643|2.319|7.429|17.641|26.059| 45.333|302.096|
-------------------------------------------------------------

with decrement of 4095,

---------------------------------------------
Num files | 40 | 100 | 400 | 1000| 1500| 2000 | 2500 |
---------------------------------------------------------
|W/O Mutexes|0.160|0.231|0.746|1.640|2.307|13.586|49.986|
---------------------------------------------------------
|W Mutexes |0.134|0.192|0.579|1.306|1.931| 3.332|15.481|
---------------------------------------------------------

Looks like the results are similar to that of decrement of 1. With the exception that on smaller machine, with higher decrement values, both the WM and WOM seems to performance closer to each other for small to medium number of files. The WOM does show better scaling for higher number of files on both the small and large core count machines.

Here are also aim7 results on an 8 socket 160 threads machine, at 2000 users,

| % of Average job per minutes of WM compared to WOM
----------------------------------------------------------------
all_tests | +42.02%
----------------------------------------------------------------
custom | +56.22%
----------------------------------------------------------------
dbase | 0%
----------------------------------------------------------------
disk | 0%
----------------------------------------------------------------
fserver | +51.23%
----------------------------------------------------------------
new_dbase | 0%
----------------------------------------------------------------
new_fserver|+53.83%
----------------------------------------------------------------
shared |+44.37%
----------------------------------------------------------------

Please let me know if you have need any additional info or have any question and also if you would like a copy of the patch for testing.

Thanks,
Mak.





2014-07-23 08:15:44

by Jan Kara

[permalink] [raw]
Subject: Re: [PATCH 2/2] ext4: Reduce contention on s_orphan_lock

Hello,

On Mon 21-07-14 22:35:24, Thavatchai Makphaibulchoke wrote:
> On 06/18/2014 04:37 AM, Jan Kara wrote:
> > That's not true. The original test program creates one file per process:
> > void run_test(char *base, int count)
> > {
> > ...
> > sprintf(pbuf, "%s/file-%d", base, count);
> > fd = open(pbuf, O_CREAT | O_TRUNC | O_WRONLY, 0644);
> > ...
> > and forks given number of processes. Ah, maybe I see what you are getting
> > at. My original test program doesn't bother with synchronizing start of the
> > processes - I have manually set number of repetitions (COUNT) so that each
> > process runs long enough on my machine that the differences in start time
> > are negligible. Maybe you thought these processes run sequentially (and
> > maybe they really did if you had fast enough setup). So I agree
> > synchronization of start using shared memory as you did it is a good thing
> > and gets more reliable numbers especially for high numbers of processes
> > (which I didn't really test because of my limited HW) just there's no need
> > to add another level of iteration...
> >
> First sorry for taking a little long to get back.
>
> Your original main is now my new run_main() function. The old main and
> new run_main() function, practically are doing the same thing, fork and
> immediately all the child processes as indicated by argv[1], running the
> original run_test() function. By converting it to a run_main function,
> I'm trying to start each incarnation of the old main as closed to each
> other as possible, making sure I have some overlapping orphan activities
> on different files. Actually I should have done the synchronization in
> the run_test function instead, this way both tests should be closer to
> each other.
>
> Here are the results with 3.15-rc5 baseline kernel, using your original
> test. Each result is an average over ten runs, except those longer runs,
> marked with an '*', which is an average of only 5 runs.
>
> On a 2 socket 16 thread machine,
>
> ----------------------------------
> Num files | 40 | 100 | 400 | 1000 |
> ----------------------------------------------
> |W/O Mutexes|118.593|178.118|595.017|1480.165|
> -----------------------------------------------
> |W Mutexes |129.212|194.728|620.412|1430.908|
> ----------------------------------------------
>
> On an 8 socket 160 thread machine,
>
> --------------------------------------------
> Num files | 40 | 100 | 400 | 1000* | 1500* |
> -------------------------------------------------------
> |W/O Mutexes|387.257|644.270|1692.018|4254.766|7362.064|
> -------------------------------------------------------
> |W Mutexes |328.490|557.508|1967.676|4721.060|7013.618|
> --------------------------------------------------------
>
> From the above data, looks like without mutexes (WOM) performs better
> across the board on a smaller machine, except at 1000 files. For a
> larger machine, I give the edge to with mutexes (WM) for smaller number
> of files, until around 400 to 100 something files. Looks like with WOM
> starts performing better again at 1500 files.
>

<snip results for larger truncates as those don't seem to show anything
new>

> Here are also aim7 results on an 8 socket 160 threads machine, at 2000
> users,
>
> | % of Average job per minutes of WM compared to WOM
> ----------------------------------------------------------------
> all_tests | +42.02%
> ----------------------------------------------------------------
> custom | +56.22%
> ----------------------------------------------------------------
> dbase | 0%
> ----------------------------------------------------------------
> disk | 0%
> ----------------------------------------------------------------
> fserver | +51.23%
> ----------------------------------------------------------------
> new_dbase | 0%
> ----------------------------------------------------------------
> new_fserver|+53.83%
> ----------------------------------------------------------------
> shared |+44.37%
> ----------------------------------------------------------------
>
> Please let me know if you have need any additional info or have any
> question and also if you would like a copy of the patch for testing.
Thanks for all the measurements! I don't have further questions I think.
All your tests seem to point to the fact that when contention on orphan
list is high, your hashed mutexes improve performance (especially when we
are bouncing cache between sockets) while they have some cost for low
contention cases. I'm not sure it is a worthwhile tradeoff but it's upto
Ted as a maintainer to decide.

Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR