2017-08-20 11:54:59

by Wang Shilong

[permalink] [raw]
Subject: [PATCH 3/3 v5] ext4: reduce lock contention in __ext4_new_inode

From: Wang Shilong <[email protected]>

While running number of creating file threads concurrently,
we found heavy lock contention on group spinlock:

FUNC TOTAL_TIME(us) COUNT AVG(us)
ext4_create 1707443399 1440000 1185.72
_raw_spin_lock 1317641501 180899929 7.28
jbd2__journal_start 287821030 1453950 197.96
jbd2_journal_get_write_access 33441470 73077185 0.46
ext4_add_nondir 29435963 1440000 20.44
ext4_add_entry 26015166 1440049 18.07
ext4_dx_add_entry 25729337 1432814 17.96
ext4_mark_inode_dirty 12302433 5774407 2.13

most of cpu time blames to _raw_spin_lock, here is some testing
numbers with/without patch.

Test environment:
Server : SuperMicro Sever (2 x E5-2690 [email protected], 128GB 2133MHz
DDR4 Memory, 8GbFC)
Storage : 2 x RAID1 (DDN SFA7700X, 4 x Toshiba PX02SMU020 200GB
Read Intensive SSD)

format command:
mkfs.ext4 -J size=4096

test command:
mpirun -np 48 mdtest -n 30000 -d /ext4/mdtest.out -F -C \
-r -i 1 -v -p 10 -u #first run to load inode

mpirun -np 48 mdtest -n 30000 -d /ext4/mdtest.out -F -C \
-r -i 3 -v -p 10 -u

Kernel version: 4.13.0-rc3

Test 1,440,000 files with 48 directories by 48 processes:

Without patch:

File Creation File removal
79,033 289,569 ops/per second
81,463 285,359
79,875 288,475

With patch:
File Creation File removal
810669 301694
812805 302711
813965 297670

Creation performance is improved more than 10X with large
journal size. The main problem here is we test bitmap
and do some check and journal operations which could be
slept, then we test and set with lock hold, this could
be racy, and make 'inode' steal by other process.

However, after first try, we could confirm handle has
been started and inode bitmap journaled too, then
we could find and set bit with lock hold directly, this
will mostly gurateee success with second try.

Tested-by: Shuichi Ihara <[email protected]>
Signed-off-by: Wang Shilong <[email protected]>
---
v4->v5: better refactor codes as Jan Kara suggested.
---
fs/ext4/ialloc.c | 50 +++++++++++++++++++++++++++++++++++++++-----------
1 file changed, 39 insertions(+), 11 deletions(-)

diff --git a/fs/ext4/ialloc.c b/fs/ext4/ialloc.c
index 9e6eb1c..fb83a36 100644
--- a/fs/ext4/ialloc.c
+++ b/fs/ext4/ialloc.c
@@ -730,6 +730,27 @@ static int recently_deleted(struct super_block *sb, ext4_group_t group, int ino)
return ret;
}

+static int find_inode_bit(struct super_block *sb, ext4_group_t group,
+ struct buffer_head *bitmap, unsigned long *ino)
+{
+next:
+ *ino = ext4_find_next_zero_bit((unsigned long *)
+ bitmap->b_data,
+ EXT4_INODES_PER_GROUP(sb), *ino);
+ if (*ino >= EXT4_INODES_PER_GROUP(sb))
+ return 0;
+
+ if ((EXT4_SB(sb)->s_journal == NULL) &&
+ recently_deleted(sb, group, *ino)) {
+ *ino = *ino + 1;
+ if (*ino < EXT4_INODES_PER_GROUP(sb))
+ goto next;
+ return 0;
+ }
+
+ return 1;
+}
+
/*
* There are two policies for allocating an inode. If the new inode is
* a directory, then a forward search is made for a block group with both
@@ -910,21 +931,16 @@ struct inode *__ext4_new_inode(handle_t *handle, struct inode *dir,
}

repeat_in_this_group:
- ino = ext4_find_next_zero_bit((unsigned long *)
- inode_bitmap_bh->b_data,
- EXT4_INODES_PER_GROUP(sb), ino);
- if (ino >= EXT4_INODES_PER_GROUP(sb))
+ ret2 = find_inode_bit(sb, group, inode_bitmap_bh, &ino);
+ if (!ret2)
goto next_group;
- if (group == 0 && (ino+1) < EXT4_FIRST_INO(sb)) {
+
+ if (group == 0 && (ino + 1) < EXT4_FIRST_INO(sb)) {
ext4_error(sb, "reserved inode found cleared - "
"inode=%lu", ino + 1);
goto next_group;
}
- if ((EXT4_SB(sb)->s_journal == NULL) &&
- recently_deleted(sb, group, ino)) {
- ino++;
- goto next_inode;
- }
+
if (!handle) {
BUG_ON(nblocks <= 0);
handle = __ext4_journal_start_sb(dir->i_sb, line_no,
@@ -944,11 +960,23 @@ struct inode *__ext4_new_inode(handle_t *handle, struct inode *dir,
}
ext4_lock_group(sb, group);
ret2 = ext4_test_and_set_bit(ino, inode_bitmap_bh->b_data);
+ if (ret2) {
+ /* Someone already took the bit. Repeat the search
+ * with lock held.
+ */
+ ret2 = find_inode_bit(sb, group, inode_bitmap_bh, &ino);
+ if (ret2) {
+ ext4_set_bit(ino, inode_bitmap_bh->b_data);
+ ret2 = 0;
+ } else {
+ ret2 = 1; /* we didn't grab the inode */
+ }
+ }
ext4_unlock_group(sb, group);
ino++; /* the inode bitmap is zero-based */
if (!ret2)
goto got; /* we grabbed the inode! */
-next_inode:
+
if (ino < EXT4_INODES_PER_GROUP(sb))
goto repeat_in_this_group;
next_group:
--
2.9.3


2017-08-21 08:02:23

by Jan Kara

[permalink] [raw]
Subject: Re: [PATCH 3/3 v5] ext4: reduce lock contention in __ext4_new_inode

On Sun 20-08-17 19:54:37, Wang Shilong wrote:
> From: Wang Shilong <[email protected]>
>
> While running number of creating file threads concurrently,
> we found heavy lock contention on group spinlock:
>
> FUNC TOTAL_TIME(us) COUNT AVG(us)
> ext4_create 1707443399 1440000 1185.72
> _raw_spin_lock 1317641501 180899929 7.28
> jbd2__journal_start 287821030 1453950 197.96
> jbd2_journal_get_write_access 33441470 73077185 0.46
> ext4_add_nondir 29435963 1440000 20.44
> ext4_add_entry 26015166 1440049 18.07
> ext4_dx_add_entry 25729337 1432814 17.96
> ext4_mark_inode_dirty 12302433 5774407 2.13
>
> most of cpu time blames to _raw_spin_lock, here is some testing
> numbers with/without patch.
>
> Test environment:
> Server : SuperMicro Sever (2 x E5-2690 [email protected], 128GB 2133MHz
> DDR4 Memory, 8GbFC)
> Storage : 2 x RAID1 (DDN SFA7700X, 4 x Toshiba PX02SMU020 200GB
> Read Intensive SSD)
>
> format command:
> mkfs.ext4 -J size=4096
>
> test command:
> mpirun -np 48 mdtest -n 30000 -d /ext4/mdtest.out -F -C \
> -r -i 1 -v -p 10 -u #first run to load inode
>
> mpirun -np 48 mdtest -n 30000 -d /ext4/mdtest.out -F -C \
> -r -i 3 -v -p 10 -u
>
> Kernel version: 4.13.0-rc3
>
> Test 1,440,000 files with 48 directories by 48 processes:
>
> Without patch:
>
> File Creation File removal
> 79,033 289,569 ops/per second
> 81,463 285,359
> 79,875 288,475
>
> With patch:
> File Creation File removal
> 810669 301694
> 812805 302711
> 813965 297670
>
> Creation performance is improved more than 10X with large
> journal size. The main problem here is we test bitmap
> and do some check and journal operations which could be
> slept, then we test and set with lock hold, this could
> be racy, and make 'inode' steal by other process.
>
> However, after first try, we could confirm handle has
> been started and inode bitmap journaled too, then
> we could find and set bit with lock hold directly, this
> will mostly gurateee success with second try.
>
> Tested-by: Shuichi Ihara <[email protected]>
> Signed-off-by: Wang Shilong <[email protected]>

I like the result! You can add:

Reviewed-by: Jan Kara <[email protected]>

Honza

> ---
> v4->v5: better refactor codes as Jan Kara suggested.
> ---
> fs/ext4/ialloc.c | 50 +++++++++++++++++++++++++++++++++++++++-----------
> 1 file changed, 39 insertions(+), 11 deletions(-)
>
> diff --git a/fs/ext4/ialloc.c b/fs/ext4/ialloc.c
> index 9e6eb1c..fb83a36 100644
> --- a/fs/ext4/ialloc.c
> +++ b/fs/ext4/ialloc.c
> @@ -730,6 +730,27 @@ static int recently_deleted(struct super_block *sb, ext4_group_t group, int ino)
> return ret;
> }
>
> +static int find_inode_bit(struct super_block *sb, ext4_group_t group,
> + struct buffer_head *bitmap, unsigned long *ino)
> +{
> +next:
> + *ino = ext4_find_next_zero_bit((unsigned long *)
> + bitmap->b_data,
> + EXT4_INODES_PER_GROUP(sb), *ino);
> + if (*ino >= EXT4_INODES_PER_GROUP(sb))
> + return 0;
> +
> + if ((EXT4_SB(sb)->s_journal == NULL) &&
> + recently_deleted(sb, group, *ino)) {
> + *ino = *ino + 1;
> + if (*ino < EXT4_INODES_PER_GROUP(sb))
> + goto next;
> + return 0;
> + }
> +
> + return 1;
> +}
> +
> /*
> * There are two policies for allocating an inode. If the new inode is
> * a directory, then a forward search is made for a block group with both
> @@ -910,21 +931,16 @@ struct inode *__ext4_new_inode(handle_t *handle, struct inode *dir,
> }
>
> repeat_in_this_group:
> - ino = ext4_find_next_zero_bit((unsigned long *)
> - inode_bitmap_bh->b_data,
> - EXT4_INODES_PER_GROUP(sb), ino);
> - if (ino >= EXT4_INODES_PER_GROUP(sb))
> + ret2 = find_inode_bit(sb, group, inode_bitmap_bh, &ino);
> + if (!ret2)
> goto next_group;
> - if (group == 0 && (ino+1) < EXT4_FIRST_INO(sb)) {
> +
> + if (group == 0 && (ino + 1) < EXT4_FIRST_INO(sb)) {
> ext4_error(sb, "reserved inode found cleared - "
> "inode=%lu", ino + 1);
> goto next_group;
> }
> - if ((EXT4_SB(sb)->s_journal == NULL) &&
> - recently_deleted(sb, group, ino)) {
> - ino++;
> - goto next_inode;
> - }
> +
> if (!handle) {
> BUG_ON(nblocks <= 0);
> handle = __ext4_journal_start_sb(dir->i_sb, line_no,
> @@ -944,11 +960,23 @@ struct inode *__ext4_new_inode(handle_t *handle, struct inode *dir,
> }
> ext4_lock_group(sb, group);
> ret2 = ext4_test_and_set_bit(ino, inode_bitmap_bh->b_data);
> + if (ret2) {
> + /* Someone already took the bit. Repeat the search
> + * with lock held.
> + */
> + ret2 = find_inode_bit(sb, group, inode_bitmap_bh, &ino);
> + if (ret2) {
> + ext4_set_bit(ino, inode_bitmap_bh->b_data);
> + ret2 = 0;
> + } else {
> + ret2 = 1; /* we didn't grab the inode */
> + }
> + }
> ext4_unlock_group(sb, group);
> ino++; /* the inode bitmap is zero-based */
> if (!ret2)
> goto got; /* we grabbed the inode! */
> -next_inode:
> +
> if (ino < EXT4_INODES_PER_GROUP(sb))
> goto repeat_in_this_group;
> next_group:
> --
> 2.9.3
>
--
Jan Kara <[email protected]>
SUSE Labs, CR

2017-08-24 17:07:17

by Theodore Ts'o

[permalink] [raw]
Subject: Re: [PATCH 3/3 v5] ext4: reduce lock contention in __ext4_new_inode

On Mon, Aug 21, 2017 at 10:02:20AM +0200, Jan Kara wrote:
> On Sun 20-08-17 19:54:37, Wang Shilong wrote:
> > From: Wang Shilong <[email protected]>
> >
> > While running number of creating file threads concurrently,
> > we found heavy lock contention on group spinlock:

Thanks, applied. Many thanks for all of your work!!

- Ted