2016-08-15 12:23:47

by Dmitry Monakhov

[permalink] [raw]
Subject: [PATCH] ext4: improve ext4lazyinit scalability V2

ext4lazyinit is global thread. This thread performs itable initalization
under li_list_mtx mutex.

It basically does following:
ext4_lazyinit_thread
->mutex_lock(&eli->li_list_mtx);
->ext4_run_li_request(elr)
->ext4_init_inode_table-> Do a lot of IO if the list is large

And when new mount/umount arrive they have to block on ->li_list_mtx
because lazy_thread holds it during full walk procedure.
ext4_fill_super
->ext4_register_li_request
->mutex_lock(&ext4_li_info->li_list_mtx);
->list_add(&elr->lr_request, &ext4_li_info >li_request_list);
In my case mount takes 40minutes on server with 36 * 4Tb HDD.
Common user may face this in case of very slow dev ( /dev/mmcblkXXX)
Even more. If one of filesystems was frozen lazyinit_thread will simply
blocks on sb_start_write() so other mount/umount will be suck forever.

This patch changes logic like follows:
- grap ->s_umount read sem before processing new li_request.
After that it is safe to drop li_list_mtx because all callers of
li_remove_request are holding ->s_umount for write.
- li_thread skips frozen SB's

Locking order:
Order is asserted by umout path like follows: s_umount ->li_list_mtx so
the only way to to grab ->s_mount inside li_thread is via down_read_trylock

xfstests:ext4/023
#PSBM-49658

Changes from V1
- spell fixes according to jack@ comments
- do not use temporal list.


Signed-off-by: Dmitry Monakhov <[email protected]>
---
fs/ext4/super.c | 43 +++++++++++++++++++++++++++++++------------
1 file changed, 31 insertions(+), 12 deletions(-)

diff --git a/fs/ext4/super.c b/fs/ext4/super.c
index 3822a5a..0e45344 100644
--- a/fs/ext4/super.c
+++ b/fs/ext4/super.c
@@ -2635,7 +2635,6 @@ static int ext4_run_li_request(struct ext4_li_request *elr)
sb = elr->lr_super;
ngroups = EXT4_SB(sb)->s_groups_count;

- sb_start_write(sb);
for (group = elr->lr_next_group; group < ngroups; group++) {
gdp = ext4_get_group_desc(sb, group, NULL);
if (!gdp) {
@@ -2662,8 +2661,6 @@ static int ext4_run_li_request(struct ext4_li_request *elr)
elr->lr_next_sched = jiffies + elr->lr_timeout;
elr->lr_next_group = group + 1;
}
- sb_end_write(sb);
-
return ret;
}

@@ -2728,21 +2725,43 @@ cont_thread:
mutex_unlock(&eli->li_list_mtx);
goto exit_thread;
}
-
list_for_each_safe(pos, n, &eli->li_request_list) {
+ int err = 0;
+ int progress = 0;
elr = list_entry(pos, struct ext4_li_request,
lr_request);

- if (time_after_eq(jiffies, elr->lr_next_sched)) {
- if (ext4_run_li_request(elr) != 0) {
- /* error, remove the lazy_init job */
- ext4_remove_li_request(elr);
- continue;
+ if (time_before(jiffies, elr->lr_next_sched)) {
+ if (time_before(elr->lr_next_sched, next_wakeup))
+ next_wakeup = elr->lr_next_sched;
+ continue;
+ }
+ if (down_read_trylock(&elr->lr_super->s_umount)) {
+ if (sb_start_write_trylock(elr->lr_super)) {
+ progress = 1;
+ /*
+ * We hold sb->s_umount, sb can not
+ * be removed from the list, it is
+ * now safe to drop li_list_mtx
+ */
+ mutex_unlock(&eli->li_list_mtx);
+ err = ext4_run_li_request(elr);
+ sb_end_write(elr->lr_super);
+ mutex_lock(&eli->li_list_mtx);
+ n = pos->next;
}
+ up_read((&elr->lr_super->s_umount));
+ }
+ /* error, remove the lazy_init job */
+ if (err) {
+ ext4_remove_li_request(elr);
+ continue;
+ }
+ if (!progress) {
+ elr->lr_next_sched = jiffies +
+ (prandom_u32()
+ % (EXT4_DEF_LI_MAX_START_DELAY * HZ));
}


2016-08-15 15:05:22

by Jan Kara

[permalink] [raw]
Subject: Re: [PATCH] ext4: improve ext4lazyinit scalability V2

Hello,

Thanks for the patch. Couple of spelling fixes below and one functional
comment...

On Mon 15-08-16 16:23:35, Dmitry Monakhov wrote:
> ext4lazyinit is global thread. This thread performs itable initalization
^^^ a global thread

> under li_list_mtx mutex.
>
> It basically does following:
^ the

> ext4_lazyinit_thread
> ->mutex_lock(&eli->li_list_mtx);
> ->ext4_run_li_request(elr)
> ->ext4_init_inode_table-> Do a lot of IO if the list is large
>
> And when new mount/umount arrive they have to block on ->li_list_mtx
> because lazy_thread holds it during full walk procedure.
> ext4_fill_super
> ->ext4_register_li_request
> ->mutex_lock(&ext4_li_info->li_list_mtx);
> ->list_add(&elr->lr_request, &ext4_li_info >li_request_list);
> In my case mount takes 40minutes on server with 36 * 4Tb HDD.
> Common user may face this in case of very slow dev ( /dev/mmcblkXXX)
> Even more. If one of filesystems was frozen lazyinit_thread will simply
> blocks on sb_start_write() so other mount/umount will be suck forever.
^^^ block ^^ stuck

> This patch changes logic like follows:
> - grap ->s_umount read sem before processing new li_request.
^^^ grab

> After that it is safe to drop li_list_mtx because all callers of
> li_remove_request are holding ->s_umount for write.
> - li_thread skips frozen SB's
>
> Locking order:
> Order is asserted by umout path like follows: s_umount ->li_list_mtx so
^^^ umount

> the only way to to grab ->s_mount inside li_thread is via down_read_trylock
>
> xfstests:ext4/023
> #PSBM-49658
>
> Changes from V1
> - spell fixes according to jack@ comments
> - do not use temporal list.
>
>
> Signed-off-by: Dmitry Monakhov <[email protected]>
> ---
> fs/ext4/super.c | 43 +++++++++++++++++++++++++++++++------------
> 1 file changed, 31 insertions(+), 12 deletions(-)
...
> + if (!progress) {
> + elr->lr_next_sched = jiffies +
> + (prandom_u32()
> + % (EXT4_DEF_LI_MAX_START_DELAY * HZ));
> }

I think we need to update next_wakeup here based on updated value of
lr_next_sched and also in case ext4_run_li_request() didn't complete the
request but ended up rescheduling it. Otherwise the patch looks fine.

Honza

--
Jan Kara <[email protected]>
SUSE Labs, CR

2016-09-06 03:40:25

by Theodore Ts'o

[permalink] [raw]
Subject: Re: [PATCH] ext4: improve ext4lazyinit scalability V2

Dmitry, thanks for the patch, and Jan, thanks for the review.

This is what I've added to the ext4 tree, which should reflect all of
the comments which Jan made.

- Ted

commit 933ef679501f124bc5eb762f6068d099a0579937
Author: Dmitry Monakhov <[email protected]>
Date: Mon Sep 5 23:38:36 2016 -0400

ext4: improve ext4lazyinit scalability

ext4lazyinit is a global thread. This thread performs itable
initalization under li_list_mtx mutex.

It basically does the following:
ext4_lazyinit_thread
->mutex_lock(&eli->li_list_mtx);
->ext4_run_li_request(elr)
->ext4_init_inode_table-> Do a lot of IO if the list is large

And when new mount/umount arrive they have to block on ->li_list_mtx
because lazy_thread holds it during full walk procedure.
ext4_fill_super
->ext4_register_li_request
->mutex_lock(&ext4_li_info->li_list_mtx);
->list_add(&elr->lr_request, &ext4_li_info >li_request_list);
In my case mount takes 40minutes on server with 36 * 4Tb HDD.
Common user may face this in case of very slow dev ( /dev/mmcblkXXX)
Even more. If one of filesystems was frozen lazyinit_thread will simply
block on sb_start_write() so other mount/umount will be stuck forever.

This patch changes logic like follows:
- grab ->s_umount read sem before processing new li_request.
After that it is safe to drop li_list_mtx because all callers of
li_remove_request are holding ->s_umount for write.
- li_thread skips frozen SB's

Locking order:
Mh KOrder is asserted by umount path like follows: s_umount ->li_list_mtx so
the only way to to grab ->s_mount inside li_thread is via down_read_trylock

xfstests:ext4/023
#PSBM-49658

Signed-off-by: Dmitry Monakhov <[email protected]>
Signed-off-by: Theodore Ts'o <[email protected]>

diff --git a/fs/ext4/super.c b/fs/ext4/super.c
index 5819b0e..ebeebf5 100644
--- a/fs/ext4/super.c
+++ b/fs/ext4/super.c
@@ -2749,7 +2749,6 @@ static int ext4_run_li_request(struct ext4_li_request *elr)
sb = elr->lr_super;
ngroups = EXT4_SB(sb)->s_groups_count;

- sb_start_write(sb);
for (group = elr->lr_next_group; group < ngroups; group++) {
gdp = ext4_get_group_desc(sb, group, NULL);
if (!gdp) {
@@ -2776,8 +2775,6 @@ static int ext4_run_li_request(struct ext4_li_request *elr)
elr->lr_next_sched = jiffies + elr->lr_timeout;
elr->lr_next_group = group + 1;
}
- sb_end_write(sb);
-
return ret;
}

@@ -2842,21 +2839,46 @@ cont_thread:
mutex_unlock(&eli->li_list_mtx);
goto exit_thread;
}
-
list_for_each_safe(pos, n, &eli->li_request_list) {
+ int err = 0;
+ int progress = 0;
elr = list_entry(pos, struct ext4_li_request,
lr_request);

- if (time_after_eq(jiffies, elr->lr_next_sched)) {
- if (ext4_run_li_request(elr) != 0) {
- /* error, remove the lazy_init job */
- ext4_remove_li_request(elr);
- continue;
+ if (time_before(jiffies, elr->lr_next_sched)) {
+ if (time_before(elr->lr_next_sched, next_wakeup))
+ next_wakeup = elr->lr_next_sched;
+ continue;
+ }
+ if (down_read_trylock(&elr->lr_super->s_umount)) {
+ if (sb_start_write_trylock(elr->lr_super)) {
+ progress = 1;
+ /*
+ * We hold sb->s_umount, sb can not
+ * be removed from the list, it is
+ * now safe to drop li_list_mtx
+ */
+ mutex_unlock(&eli->li_list_mtx);
+ err = ext4_run_li_request(elr);
+ sb_end_write(elr->lr_super);
+ mutex_lock(&eli->li_list_mtx);
+ n = pos->next;
}
+ up_read((&elr->lr_super->s_umount));
+ }
+ /* error, remove the lazy_init job */
+ if (err) {
+ ext4_remove_li_request(elr);
+ continue;
+ }
+ if (!progress) {
+ elr->lr_next_sched = jiffies +
+ (prandom_u32()
+ % (EXT4_DEF_LI_MAX_START_DELAY * HZ));
+ if (time_before(elr->lr_next_sched,
+ next_wakeup))
+ next_wakeup = elr->lr_next_sched;
}

2016-09-06 08:36:11

by Jan Kara

[permalink] [raw]
Subject: Re: [PATCH] ext4: improve ext4lazyinit scalability V2

On Mon 05-09-16 23:39:54, Ted Tso wrote:
> Dmitry, thanks for the patch, and Jan, thanks for the review.
>
> This is what I've added to the ext4 tree, which should reflect all of
> the comments which Jan made.

Hum, one comment still:

> + if (down_read_trylock(&elr->lr_super->s_umount)) {
> + if (sb_start_write_trylock(elr->lr_super)) {
> + progress = 1;
> + /*
> + * We hold sb->s_umount, sb can not
> + * be removed from the list, it is
> + * now safe to drop li_list_mtx
> + */
> + mutex_unlock(&eli->li_list_mtx);
> + err = ext4_run_li_request(elr);
> + sb_end_write(elr->lr_super);
> + mutex_lock(&eli->li_list_mtx);
> + n = pos->next;
> }
> + up_read((&elr->lr_super->s_umount));
> + }
> + /* error, remove the lazy_init job */
> + if (err) {
> + ext4_remove_li_request(elr);
> + continue;
> + }
> + if (!progress) {
> + elr->lr_next_sched = jiffies +
> + (prandom_u32()
> + % (EXT4_DEF_LI_MAX_START_DELAY * HZ));
> + if (time_before(elr->lr_next_sched,
> + next_wakeup))
> + next_wakeup = elr->lr_next_sched;
> }
> -
> - if (time_before(elr->lr_next_sched, next_wakeup))
> - next_wakeup = elr->lr_next_sched;
> }

ext4_run_li_request() can also update elr->lr_next_sched. So we need to
update next_wakeup even in progress == 1 case (i.e., I'd leave the update
of next_wakeup as is in the old code...). Otherwise the patch looks good.

Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR

2016-09-06 09:49:19

by Dmitry Monakhov

[permalink] [raw]
Subject: Re: [PATCH] ext4: improve ext4lazyinit scalability V2

Jan Kara <[email protected]> writes:

> On Mon 05-09-16 23:39:54, Ted Tso wrote:
>> Dmitry, thanks for the patch, and Jan, thanks for the review.
>>
>> This is what I've added to the ext4 tree, which should reflect all of
>> the comments which Jan made.
>
> Hum, one comment still:
>
>> + if (down_read_trylock(&elr->lr_super->s_umount)) {
>> + if (sb_start_write_trylock(elr->lr_super)) {
>> + progress = 1;
>> + /*
>> + * We hold sb->s_umount, sb can not
>> + * be removed from the list, it is
>> + * now safe to drop li_list_mtx
>> + */
>> + mutex_unlock(&eli->li_list_mtx);
>> + err = ext4_run_li_request(elr);
>> + sb_end_write(elr->lr_super);
>> + mutex_lock(&eli->li_list_mtx);
>> + n = pos->next;
>> }
>> + up_read((&elr->lr_super->s_umount));
>> + }
>> + /* error, remove the lazy_init job */
>> + if (err) {
>> + ext4_remove_li_request(elr);
>> + continue;
>> + }
>> + if (!progress) {
>> + elr->lr_next_sched = jiffies +
>> + (prandom_u32()
>> + % (EXT4_DEF_LI_MAX_START_DELAY * HZ));
>> + if (time_before(elr->lr_next_sched,
>> + next_wakeup))
>> + next_wakeup = elr->lr_next_sched;
>> }
>> -
>> - if (time_before(elr->lr_next_sched, next_wakeup))
>> - next_wakeup = elr->lr_next_sched;
>> }
>
> ext4_run_li_request() can also update elr->lr_next_sched. So we need to
> update next_wakeup even in progress == 1 case (i.e., I'd leave the update
> of next_wakeup as is in the old code...). Otherwise the patch looks good.
Yes. That is correct. Simply last hank shoul not be deleted.

Theodore, please fold patch attached to original one.


Attachments:
signature.asc (472.00 B)
lazy-init-v2-update-next-sched.patch (361.00 B)
lazyinit-v2-update-next-sched.patch
Download all attachments

2016-09-15 15:22:57

by Theodore Ts'o

[permalink] [raw]
Subject: Re: [PATCH] ext4: improve ext4lazyinit scalability V2

On Tue, Sep 06, 2016 at 12:49:09PM +0300, Dmitry Monakhov wrote:
> > ext4_run_li_request() can also update elr->lr_next_sched. So we need to
> > update next_wakeup even in progress == 1 case (i.e., I'd leave the update
> > of next_wakeup as is in the old code...). Otherwise the patch looks good.
> Yes. That is correct. Simply last hank shoul not be deleted.
>
> Theodore, please fold patch attached to original one.

Done, thanks!!

- Ted