2018-06-12 08:06:10

by jing xia

[permalink] [raw]
Subject: [PATCH] dm bufio: Reduce dm_bufio_lock contention

Performance test in android reports that the phone sometimes gets
hanged and shows black screen for about several minutes.The sysdump shows:
1. kswapd and other tasks who enter the direct-reclaim path are waiting
on the dm_bufio_lock;
2. the task who gets the dm_bufio_lock is stalled for IO completions,
the relevant stack trace as :

PID: 22920 TASK: ffffffc0120f1a00 CPU: 1 COMMAND: "kworker/u8:2"
#0 [ffffffc0282af3d0] __switch_to at ffffff8008085e48
#1 [ffffffc0282af3f0] __schedule at ffffff8008850cc8
#2 [ffffffc0282af450] schedule at ffffff8008850f4c
#3 [ffffffc0282af470] schedule_timeout at ffffff8008853a0c
#4 [ffffffc0282af520] schedule_timeout_uninterruptible at ffffff8008853aa8
#5 [ffffffc0282af530] wait_iff_congested at ffffff8008181b40
#6 [ffffffc0282af5b0] shrink_inactive_list at ffffff8008177c80
#7 [ffffffc0282af680] shrink_lruvec at ffffff8008178510
#8 [ffffffc0282af790] mem_cgroup_shrink_node_zone at ffffff80081793bc
#9 [ffffffc0282af840] mem_cgroup_soft_limit_reclaim at ffffff80081b6040

This patch aims to reduce the dm_bufio_lock contention when multiple
tasks do shrink_slab() at the same time.It is acceptable that task
will be allowed to reclaim from other shrinkers or reclaim from dm-bufio
next time, rather than stalled for the dm_bufio_lock.

Signed-off-by: Jing Xia <[email protected]>
Signed-off-by: Jing Xia <[email protected]>
---
drivers/md/dm-bufio.c | 13 +++++++++++--
1 file changed, 11 insertions(+), 2 deletions(-)

diff --git a/drivers/md/dm-bufio.c b/drivers/md/dm-bufio.c
index c546b56..402a028 100644
--- a/drivers/md/dm-bufio.c
+++ b/drivers/md/dm-bufio.c
@@ -1647,10 +1647,19 @@ static unsigned long __scan(struct dm_bufio_client *c, unsigned long nr_to_scan,
static unsigned long
dm_bufio_shrink_count(struct shrinker *shrink, struct shrink_control *sc)
{
+ unsigned long count;
+ unsigned long retain_target;
+
struct dm_bufio_client *c = container_of(shrink, struct dm_bufio_client, shrinker);
- unsigned long count = READ_ONCE(c->n_buffers[LIST_CLEAN]) +
+
+ if (!dm_bufio_trylock(c))
+ return 0;
+
+ count = READ_ONCE(c->n_buffers[LIST_CLEAN]) +
READ_ONCE(c->n_buffers[LIST_DIRTY]);
- unsigned long retain_target = get_retain_buffers(c);
+ retain_target = get_retain_buffers(c);
+
+ dm_bufio_unlock(c);

return (count < retain_target) ? 0 : (count - retain_target);
}
--
1.9.1



2018-06-12 21:20:51

by Mike Snitzer

[permalink] [raw]
Subject: Re: dm bufio: Reduce dm_bufio_lock contention

On Tue, Jun 12 2018 at 4:03am -0400,
Jing Xia <[email protected]> wrote:

> Performance test in android reports that the phone sometimes gets
> hanged and shows black screen for about several minutes.The sysdump shows:
> 1. kswapd and other tasks who enter the direct-reclaim path are waiting
> on the dm_bufio_lock;

Do you have an understanding of where they are waiting? Is it in
dm_bufio_shrink_scan()?

> 2. the task who gets the dm_bufio_lock is stalled for IO completions,
> the relevant stack trace as :
>
> PID: 22920 TASK: ffffffc0120f1a00 CPU: 1 COMMAND: "kworker/u8:2"
> #0 [ffffffc0282af3d0] __switch_to at ffffff8008085e48
> #1 [ffffffc0282af3f0] __schedule at ffffff8008850cc8
> #2 [ffffffc0282af450] schedule at ffffff8008850f4c
> #3 [ffffffc0282af470] schedule_timeout at ffffff8008853a0c
> #4 [ffffffc0282af520] schedule_timeout_uninterruptible at ffffff8008853aa8
> #5 [ffffffc0282af530] wait_iff_congested at ffffff8008181b40
> #6 [ffffffc0282af5b0] shrink_inactive_list at ffffff8008177c80
> #7 [ffffffc0282af680] shrink_lruvec at ffffff8008178510
> #8 [ffffffc0282af790] mem_cgroup_shrink_node_zone at ffffff80081793bc
> #9 [ffffffc0282af840] mem_cgroup_soft_limit_reclaim at ffffff80081b6040

Understanding the root cause for why the IO isn't completing quick
enough would be nice. Is the backing storage just overwhelmed?

> This patch aims to reduce the dm_bufio_lock contention when multiple
> tasks do shrink_slab() at the same time.It is acceptable that task
> will be allowed to reclaim from other shrinkers or reclaim from dm-bufio
> next time, rather than stalled for the dm_bufio_lock.

Your patch just looks to be papering over the issue. Like you're
treating the symptom rather than the problem.

> Signed-off-by: Jing Xia <[email protected]>
> Signed-off-by: Jing Xia <[email protected]>

You only need one Signed-off-by.

> ---
> drivers/md/dm-bufio.c | 13 +++++++++++--
> 1 file changed, 11 insertions(+), 2 deletions(-)
>
> diff --git a/drivers/md/dm-bufio.c b/drivers/md/dm-bufio.c
> index c546b56..402a028 100644
> --- a/drivers/md/dm-bufio.c
> +++ b/drivers/md/dm-bufio.c
> @@ -1647,10 +1647,19 @@ static unsigned long __scan(struct dm_bufio_client *c, unsigned long nr_to_scan,
> static unsigned long
> dm_bufio_shrink_count(struct shrinker *shrink, struct shrink_control *sc)
> {
> + unsigned long count;
> + unsigned long retain_target;
> +
> struct dm_bufio_client *c = container_of(shrink, struct dm_bufio_client, shrinker);
> - unsigned long count = READ_ONCE(c->n_buffers[LIST_CLEAN]) +
> +
> + if (!dm_bufio_trylock(c))
> + return 0;
> +
> + count = READ_ONCE(c->n_buffers[LIST_CLEAN]) +
> READ_ONCE(c->n_buffers[LIST_DIRTY]);
> - unsigned long retain_target = get_retain_buffers(c);
> + retain_target = get_retain_buffers(c);
> +
> + dm_bufio_unlock(c);
>
> return (count < retain_target) ? 0 : (count - retain_target);
> }
> --
> 1.9.1
>

The reality of your patch is, on a heavily used bufio-backed volume,
you're effectively disabling the ability to reclaim bufio memory via the
shrinker.

Because chances are the bufio lock will always be contended for a
heavily used bufio client.

But after a quick look, I'm left wondering why dm_bufio_shrink_scan()'s
dm_bufio_trylock() isn't sufficient to short-circuit the shrinker for
your use-case?
Maybe __GFP_FS is set so dm_bufio_shrink_scan() only ever uses
dm_bufio_lock()?

Is a shrinker able to be reentered by the VM subsystem
(e.g. shrink_slab() calls down into same shrinker from multiple tasks
that hit direct reclaim)?
If so, a better fix could be to add a flag to the bufio client so we can
know if the same client is being re-entered via the shrinker (though
it'd likely be a bug for the shrinker to do that!).. and have
dm_bufio_shrink_scan() check that flag and return SHRINK_STOP if set.

That said, it could be that other parts of dm-bufio are monopolizing the
lock as part of issuing normal IO (to your potentially slow
backend).. in which case just taking the lock from the shrinker even
once will block like you've reported.

It does seem like additional analysis is needed to pinpoint exactly what
is occuring. Or some additional clarification needed (e.g. are the
multiple tasks waiting for the bufio lock, as you reported with "1"
above, waiting for the same exact shrinker's ability to get the same
bufio lock?)

But Mikulas, please have a look at this reported issue and let us know
your thoughts.

Thanks,
Mike

2018-06-13 14:05:23

by Mikulas Patocka

[permalink] [raw]
Subject: Re: dm bufio: Reduce dm_bufio_lock contention



On Tue, 12 Jun 2018, Mike Snitzer wrote:

> On Tue, Jun 12 2018 at 4:03am -0400,
> Jing Xia <[email protected]> wrote:
>
> > Performance test in android reports that the phone sometimes gets
> > hanged and shows black screen for about several minutes.The sysdump shows:
> > 1. kswapd and other tasks who enter the direct-reclaim path are waiting
> > on the dm_bufio_lock;
>
> Do you have an understanding of where they are waiting? Is it in
> dm_bufio_shrink_scan()?
>
> > 2. the task who gets the dm_bufio_lock is stalled for IO completions,
> > the relevant stack trace as :
> >
> > PID: 22920 TASK: ffffffc0120f1a00 CPU: 1 COMMAND: "kworker/u8:2"
> > #0 [ffffffc0282af3d0] __switch_to at ffffff8008085e48
> > #1 [ffffffc0282af3f0] __schedule at ffffff8008850cc8
> > #2 [ffffffc0282af450] schedule at ffffff8008850f4c
> > #3 [ffffffc0282af470] schedule_timeout at ffffff8008853a0c
> > #4 [ffffffc0282af520] schedule_timeout_uninterruptible at ffffff8008853aa8
> > #5 [ffffffc0282af530] wait_iff_congested at ffffff8008181b40
> > #6 [ffffffc0282af5b0] shrink_inactive_list at ffffff8008177c80
> > #7 [ffffffc0282af680] shrink_lruvec at ffffff8008178510
> > #8 [ffffffc0282af790] mem_cgroup_shrink_node_zone at ffffff80081793bc
> > #9 [ffffffc0282af840] mem_cgroup_soft_limit_reclaim at ffffff80081b6040

Please send the full stacktrace of this task.

Then, we can see, why is it waiting here.

Mikulas

2018-06-14 07:18:47

by jing xia

[permalink] [raw]
Subject: Re: dm bufio: Reduce dm_bufio_lock contention

Thank for your comment,I appreciate it.

On Wed, Jun 13, 2018 at 5:20 AM, Mike Snitzer <[email protected]> wrote:
> On Tue, Jun 12 2018 at 4:03am -0400,
> Jing Xia <[email protected]> wrote:
>
>> Performance test in android reports that the phone sometimes gets
>> hanged and shows black screen for about several minutes.The sysdump shows:
>> 1. kswapd and other tasks who enter the direct-reclaim path are waiting
>> on the dm_bufio_lock;
>
> Do you have an understanding of where they are waiting? Is it in
> dm_bufio_shrink_scan()?
>

I'm sorry I don't make the issue clear.The kernel version in our
platform is k4.4, and
the implementation of dm_bufio_shrink_count() is still need to gain
the dm_bufio_lock.
So kswapd and other tasks in direct-reclaim are blocked in
dm_bufio_shrink_count().
According the sysdump,
1. those tasks are doing the shrink in the same dm_bufio_shrinker
(the first dm_bufio_shrinker in the shrink_list), and waiting for the
same mutex lock;
2. the __GFP_FS is set to all of those tasks;

The dm_bufio_lock is not necessary in dm_bufio_shrink_count() at the
latest version,
but the issue may still happens, because the lock is also needed in
dm_bufio_shrink_scan().

the relevant stack traces please refer to:

PID: 855 TASK: ffffffc07844a700 CPU: 1 COMMAND: "kswapd0"
#0 [ffffffc078357920] __switch_to at ffffff8008085e48
#1 [ffffffc078357940] __schedule at ffffff8008850cc8
#2 [ffffffc0783579a0] schedule at ffffff8008850f4c
#3 [ffffffc0783579c0] schedule_preempt_disabled at ffffff8008851284
#4 [ffffffc0783579e0] __mutex_lock_slowpath at ffffff8008852898
#5 [ffffffc078357a50] mutex_lock at ffffff8008852954
#6 [ffffffc078357a70] dm_bufio_shrink_count at ffffff8008591d08
#7 [ffffffc078357ab0] do_shrink_slab at ffffff80081753e0
#8 [ffffffc078357b30] shrink_slab at ffffff8008175f3c
#9 [ffffffc078357bd0] shrink_zone at ffffff8008178860
#10 [ffffffc078357c70] balance_pgdat at ffffff8008179838
#11 [ffffffc078357d70] kswapd at ffffff8008179e48
#12 [ffffffc078357e20] kthread at ffffff80080bae34

PID: 15538 TASK: ffffffc006833400 CPU: 3 COMMAND: "com.qiyi.video"
#0 [ffffffc027637660] __switch_to at ffffff8008085e48
#1 [ffffffc027637680] __schedule at ffffff8008850cc8
#2 [ffffffc0276376e0] schedule at ffffff8008850f4c
#3 [ffffffc027637700] schedule_preempt_disabled at ffffff8008851284
#4 [ffffffc027637720] __mutex_lock_slowpath at ffffff8008852898
#5 [ffffffc027637790] mutex_lock at ffffff8008852954
#6 [ffffffc0276377b0] dm_bufio_shrink_count at ffffff8008591d08
#7 [ffffffc0276377f0] do_shrink_slab at ffffff80081753e0
#8 [ffffffc027637870] shrink_slab at ffffff8008175f3c
#9 [ffffffc027637910] shrink_zone at ffffff8008178860
#10 [ffffffc0276379b0] do_try_to_free_pages at ffffff8008178bc4
#11 [ffffffc027637a50] try_to_free_pages at ffffff8008178f3c
#12 [ffffffc027637af0] __perform_reclaim at ffffff8008169130
#13 [ffffffc027637b70] __alloc_pages_nodemask at ffffff800816c9b8
#14 [ffffffc027637c90] handle_mm_fault at ffffff800819025c
#15 [ffffffc027637d80] do_page_fault at ffffff80080949f8
#16 [ffffffc027637df0] do_mem_abort at ffffff80080812f8


>> 2. the task who gets the dm_bufio_lock is stalled for IO completions,
>> the relevant stack trace as :
>>
>> PID: 22920 TASK: ffffffc0120f1a00 CPU: 1 COMMAND: "kworker/u8:2"
>> #0 [ffffffc0282af3d0] __switch_to at ffffff8008085e48
>> #1 [ffffffc0282af3f0] __schedule at ffffff8008850cc8
>> #2 [ffffffc0282af450] schedule at ffffff8008850f4c
>> #3 [ffffffc0282af470] schedule_timeout at ffffff8008853a0c
>> #4 [ffffffc0282af520] schedule_timeout_uninterruptible at ffffff8008853aa8
>> #5 [ffffffc0282af530] wait_iff_congested at ffffff8008181b40
>> #6 [ffffffc0282af5b0] shrink_inactive_list at ffffff8008177c80
>> #7 [ffffffc0282af680] shrink_lruvec at ffffff8008178510
>> #8 [ffffffc0282af790] mem_cgroup_shrink_node_zone at ffffff80081793bc
>> #9 [ffffffc0282af840] mem_cgroup_soft_limit_reclaim at ffffff80081b6040
>
> Understanding the root cause for why the IO isn't completing quick
> enough would be nice. Is the backing storage just overwhelmed?
>

Some information we can get from the sysdump:
1. swap-space is used out;
2. many tasks need to allocate a lot of memory;
3. IOwait is 100%;

>> This patch aims to reduce the dm_bufio_lock contention when multiple
>> tasks do shrink_slab() at the same time.It is acceptable that task
>> will be allowed to reclaim from other shrinkers or reclaim from dm-bufio
>> next time, rather than stalled for the dm_bufio_lock.
>
> Your patch just looks to be papering over the issue. Like you're
> treating the symptom rather than the problem.
>
Yes, maybe you are right.

>> Signed-off-by: Jing Xia <[email protected]>
>> Signed-off-by: Jing Xia <[email protected]>
>
> You only need one Signed-off-by.
>
Thanks, I got it.

>> ---
>> drivers/md/dm-bufio.c | 13 +++++++++++--
>> 1 file changed, 11 insertions(+), 2 deletions(-)
>>
>> diff --git a/drivers/md/dm-bufio.c b/drivers/md/dm-bufio.c
>> index c546b56..402a028 100644
>> --- a/drivers/md/dm-bufio.c
>> +++ b/drivers/md/dm-bufio.c
>> @@ -1647,10 +1647,19 @@ static unsigned long __scan(struct dm_bufio_client *c, unsigned long nr_to_scan,
>> static unsigned long
>> dm_bufio_shrink_count(struct shrinker *shrink, struct shrink_control *sc)
>> {
>> + unsigned long count;
>> + unsigned long retain_target;
>> +
>> struct dm_bufio_client *c = container_of(shrink, struct dm_bufio_client, shrinker);
>> - unsigned long count = READ_ONCE(c->n_buffers[LIST_CLEAN]) +
>> +
>> + if (!dm_bufio_trylock(c))
>> + return 0;
>> +
>> + count = READ_ONCE(c->n_buffers[LIST_CLEAN]) +
>> READ_ONCE(c->n_buffers[LIST_DIRTY]);
>> - unsigned long retain_target = get_retain_buffers(c);
>> + retain_target = get_retain_buffers(c);
>> +
>> + dm_bufio_unlock(c);
>>
>> return (count < retain_target) ? 0 : (count - retain_target);
>> }
>> --
>> 1.9.1
>>
>
> The reality of your patch is, on a heavily used bufio-backed volume,
> you're effectively disabling the ability to reclaim bufio memory via the
> shrinker.
>
> Because chances are the bufio lock will always be contended for a
> heavily used bufio client.
>

If the bufio lock will always be contended for a heavily used bufio client, this
patch is may not a good way to solve the problem.

> But after a quick look, I'm left wondering why dm_bufio_shrink_scan()'s
> dm_bufio_trylock() isn't sufficient to short-circuit the shrinker for
> your use-case?
> Maybe __GFP_FS is set so dm_bufio_shrink_scan() only ever uses
> dm_bufio_lock()?
>
> Is a shrinker able to be reentered by the VM subsystem
> (e.g. shrink_slab() calls down into same shrinker from multiple tasks
> that hit direct reclaim)?
> If so, a better fix could be to add a flag to the bufio client so we can
> know if the same client is being re-entered via the shrinker (though
> it'd likely be a bug for the shrinker to do that!).. and have
> dm_bufio_shrink_scan() check that flag and return SHRINK_STOP if set.
>
> That said, it could be that other parts of dm-bufio are monopolizing the
> lock as part of issuing normal IO (to your potentially slow
> backend).. in which case just taking the lock from the shrinker even
> once will block like you've reported.
>
> It does seem like additional analysis is needed to pinpoint exactly what
> is occuring. Or some additional clarification needed (e.g. are the
> multiple tasks waiting for the bufio lock, as you reported with "1"
> above, waiting for the same exact shrinker's ability to get the same
> bufio lock?)
>
> But Mikulas, please have a look at this reported issue and let us know
> your thoughts.
>
> Thanks,
> Mike

2018-06-14 07:19:43

by jing xia

[permalink] [raw]
Subject: Re: dm bufio: Reduce dm_bufio_lock contention

Thanks for your comment.

On Wed, Jun 13, 2018 at 10:02 PM, Mikulas Patocka <[email protected]> wrote:
>
>
> On Tue, 12 Jun 2018, Mike Snitzer wrote:
>
>> On Tue, Jun 12 2018 at 4:03am -0400,
>> Jing Xia <[email protected]> wrote:
>>
>> > Performance test in android reports that the phone sometimes gets
>> > hanged and shows black screen for about several minutes.The sysdump shows:
>> > 1. kswapd and other tasks who enter the direct-reclaim path are waiting
>> > on the dm_bufio_lock;
>>
>> Do you have an understanding of where they are waiting? Is it in
>> dm_bufio_shrink_scan()?
>>
>> > 2. the task who gets the dm_bufio_lock is stalled for IO completions,
>> > the relevant stack trace as :
>> >
>> > PID: 22920 TASK: ffffffc0120f1a00 CPU: 1 COMMAND: "kworker/u8:2"
>> > #0 [ffffffc0282af3d0] __switch_to at ffffff8008085e48
>> > #1 [ffffffc0282af3f0] __schedule at ffffff8008850cc8
>> > #2 [ffffffc0282af450] schedule at ffffff8008850f4c
>> > #3 [ffffffc0282af470] schedule_timeout at ffffff8008853a0c
>> > #4 [ffffffc0282af520] schedule_timeout_uninterruptible at ffffff8008853aa8
>> > #5 [ffffffc0282af530] wait_iff_congested at ffffff8008181b40
>> > #6 [ffffffc0282af5b0] shrink_inactive_list at ffffff8008177c80
>> > #7 [ffffffc0282af680] shrink_lruvec at ffffff8008178510
>> > #8 [ffffffc0282af790] mem_cgroup_shrink_node_zone at ffffff80081793bc
>> > #9 [ffffffc0282af840] mem_cgroup_soft_limit_reclaim at ffffff80081b6040
>
> Please send the full stacktrace of this task.
>
> Then, we can see, why is it waiting here.
>
Please refer to:

PID: 22920 TASK: ffffffc0120f1a00 CPU: 1 COMMAND: "kworker/u8:2"
#0 [ffffffc0282af3d0] __switch_to at ffffff8008085e48
#1 [ffffffc0282af3f0] __schedule at ffffff8008850cc8
#2 [ffffffc0282af450] schedule at ffffff8008850f4c
#3 [ffffffc0282af470] schedule_timeout at ffffff8008853a0c
#4 [ffffffc0282af520] schedule_timeout_uninterruptible at ffffff8008853aa8
#5 [ffffffc0282af530] wait_iff_congested at ffffff8008181b40
#6 [ffffffc0282af5b0] shrink_inactive_list at ffffff8008177c80
#7 [ffffffc0282af680] shrink_lruvec at ffffff8008178510
#8 [ffffffc0282af790] mem_cgroup_shrink_node_zone at ffffff80081793bc
#9 [ffffffc0282af840] mem_cgroup_soft_limit_reclaim at ffffff80081b6040
#10 [ffffffc0282af8f0] do_try_to_free_pages at ffffff8008178b6c
#11 [ffffffc0282af990] try_to_free_pages at ffffff8008178f3c
#12 [ffffffc0282afa30] __perform_reclaim at ffffff8008169130
#13 [ffffffc0282afab0] __alloc_pages_nodemask at ffffff800816c9b8
#14 [ffffffc0282afbd0] __get_free_pages at ffffff800816cd6c
#15 [ffffffc0282afbe0] alloc_buffer at ffffff8008591a94
#16 [ffffffc0282afc20] __bufio_new at ffffff8008592e94
#17 [ffffffc0282afc70] dm_bufio_prefetch at ffffff8008593198
#18 [ffffffc0282afd20] verity_prefetch_io at ffffff8008598384
#19 [ffffffc0282afd70] process_one_work at ffffff80080b5b3c
#20 [ffffffc0282afdc0] worker_thread at ffffff80080b64fc
#21 [ffffffc0282afe20] kthread at ffffff80080bae34

> Mikulas

2018-06-14 07:32:37

by Michal Hocko

[permalink] [raw]
Subject: Re: dm bufio: Reduce dm_bufio_lock contention

On Thu 14-06-18 15:18:58, jing xia wrote:
[...]
> PID: 22920 TASK: ffffffc0120f1a00 CPU: 1 COMMAND: "kworker/u8:2"
> #0 [ffffffc0282af3d0] __switch_to at ffffff8008085e48
> #1 [ffffffc0282af3f0] __schedule at ffffff8008850cc8
> #2 [ffffffc0282af450] schedule at ffffff8008850f4c
> #3 [ffffffc0282af470] schedule_timeout at ffffff8008853a0c
> #4 [ffffffc0282af520] schedule_timeout_uninterruptible at ffffff8008853aa8
> #5 [ffffffc0282af530] wait_iff_congested at ffffff8008181b40

This trace doesn't provide the full picture unfortunately. Waiting in
the direct reclaim means that the underlying bdi is congested. The real
question is why it doesn't flush IO in time.

> #6 [ffffffc0282af5b0] shrink_inactive_list at ffffff8008177c80
> #7 [ffffffc0282af680] shrink_lruvec at ffffff8008178510
> #8 [ffffffc0282af790] mem_cgroup_shrink_node_zone at ffffff80081793bc
> #9 [ffffffc0282af840] mem_cgroup_soft_limit_reclaim at ffffff80081b6040
> #10 [ffffffc0282af8f0] do_try_to_free_pages at ffffff8008178b6c
> #11 [ffffffc0282af990] try_to_free_pages at ffffff8008178f3c
> #12 [ffffffc0282afa30] __perform_reclaim at ffffff8008169130
> #13 [ffffffc0282afab0] __alloc_pages_nodemask at ffffff800816c9b8
> #14 [ffffffc0282afbd0] __get_free_pages at ffffff800816cd6c
> #15 [ffffffc0282afbe0] alloc_buffer at ffffff8008591a94
> #16 [ffffffc0282afc20] __bufio_new at ffffff8008592e94
> #17 [ffffffc0282afc70] dm_bufio_prefetch at ffffff8008593198
> #18 [ffffffc0282afd20] verity_prefetch_io at ffffff8008598384
> #19 [ffffffc0282afd70] process_one_work at ffffff80080b5b3c
> #20 [ffffffc0282afdc0] worker_thread at ffffff80080b64fc
> #21 [ffffffc0282afe20] kthread at ffffff80080bae34
>
> > Mikulas

--
Michal Hocko
SUSE Labs

2018-06-14 18:34:52

by Mikulas Patocka

[permalink] [raw]
Subject: Re: dm bufio: Reduce dm_bufio_lock contention



On Thu, 14 Jun 2018, Michal Hocko wrote:

> On Thu 14-06-18 15:18:58, jing xia wrote:
> [...]
> > PID: 22920 TASK: ffffffc0120f1a00 CPU: 1 COMMAND: "kworker/u8:2"
> > #0 [ffffffc0282af3d0] __switch_to at ffffff8008085e48
> > #1 [ffffffc0282af3f0] __schedule at ffffff8008850cc8
> > #2 [ffffffc0282af450] schedule at ffffff8008850f4c
> > #3 [ffffffc0282af470] schedule_timeout at ffffff8008853a0c
> > #4 [ffffffc0282af520] schedule_timeout_uninterruptible at ffffff8008853aa8
> > #5 [ffffffc0282af530] wait_iff_congested at ffffff8008181b40
>
> This trace doesn't provide the full picture unfortunately. Waiting in
> the direct reclaim means that the underlying bdi is congested. The real
> question is why it doesn't flush IO in time.

I pointed this out two years ago and you just refused to fix it:
http://lkml.iu.edu/hypermail/linux/kernel/1608.1/04507.html

I'm sure you'll come up with another creative excuse why GFP_NORETRY
allocations need incur deliberate 100ms delays in block device drivers.

Mikulas

> > #6 [ffffffc0282af5b0] shrink_inactive_list at ffffff8008177c80
> > #7 [ffffffc0282af680] shrink_lruvec at ffffff8008178510
> > #8 [ffffffc0282af790] mem_cgroup_shrink_node_zone at ffffff80081793bc
> > #9 [ffffffc0282af840] mem_cgroup_soft_limit_reclaim at ffffff80081b6040
> > #10 [ffffffc0282af8f0] do_try_to_free_pages at ffffff8008178b6c
> > #11 [ffffffc0282af990] try_to_free_pages at ffffff8008178f3c
> > #12 [ffffffc0282afa30] __perform_reclaim at ffffff8008169130
> > #13 [ffffffc0282afab0] __alloc_pages_nodemask at ffffff800816c9b8
> > #14 [ffffffc0282afbd0] __get_free_pages at ffffff800816cd6c
> > #15 [ffffffc0282afbe0] alloc_buffer at ffffff8008591a94
> > #16 [ffffffc0282afc20] __bufio_new at ffffff8008592e94
> > #17 [ffffffc0282afc70] dm_bufio_prefetch at ffffff8008593198
> > #18 [ffffffc0282afd20] verity_prefetch_io at ffffff8008598384
> > #19 [ffffffc0282afd70] process_one_work at ffffff80080b5b3c
> > #20 [ffffffc0282afdc0] worker_thread at ffffff80080b64fc
> > #21 [ffffffc0282afe20] kthread at ffffff80080bae34
> >
> > > Mikulas
>
> --
> Michal Hocko
> SUSE Labs
>

2018-06-15 07:34:15

by Michal Hocko

[permalink] [raw]
Subject: Re: dm bufio: Reduce dm_bufio_lock contention

On Thu 14-06-18 14:34:06, Mikulas Patocka wrote:
>
>
> On Thu, 14 Jun 2018, Michal Hocko wrote:
>
> > On Thu 14-06-18 15:18:58, jing xia wrote:
> > [...]
> > > PID: 22920 TASK: ffffffc0120f1a00 CPU: 1 COMMAND: "kworker/u8:2"
> > > #0 [ffffffc0282af3d0] __switch_to at ffffff8008085e48
> > > #1 [ffffffc0282af3f0] __schedule at ffffff8008850cc8
> > > #2 [ffffffc0282af450] schedule at ffffff8008850f4c
> > > #3 [ffffffc0282af470] schedule_timeout at ffffff8008853a0c
> > > #4 [ffffffc0282af520] schedule_timeout_uninterruptible at ffffff8008853aa8
> > > #5 [ffffffc0282af530] wait_iff_congested at ffffff8008181b40
> >
> > This trace doesn't provide the full picture unfortunately. Waiting in
> > the direct reclaim means that the underlying bdi is congested. The real
> > question is why it doesn't flush IO in time.
>
> I pointed this out two years ago and you just refused to fix it:
> http://lkml.iu.edu/hypermail/linux/kernel/1608.1/04507.html

Let me be evil again and let me quote the old discussion:
: > I agree that mempool_alloc should _primarily_ sleep on their own
: > throttling mechanism. I am not questioning that. I am just saying that
: > the page allocator has its own throttling which it relies on and that
: > cannot be just ignored because that might have other undesirable side
: > effects. So if the right approach is really to never throttle certain
: > requests then we have to bail out from a congested nodes/zones as soon
: > as the congestion is detected.
: >
: > Now, I would like to see that something like that is _really_ necessary.
:
: Currently, it is not a problem - device mapper reports the device as
: congested only if the underlying physical disks are congested.
:
: But once we change it so that device mapper reports congested state on its
: own (when it has too many bios in progress), this starts being a problem.

So has this changed since then? If yes then we can think of a proper
solution but that would require to actually describe why we see the
congestion, why it does help to wait on the caller rather than the
allocator etc...

Throwing statements like ...

> I'm sure you'll come up with another creative excuse why GFP_NORETRY
> allocations need incur deliberate 100ms delays in block device drivers.

... is not really productive. I've tried to explain why I am not _sure_ what
possible side effects such a change might have and your hand waving
didn't really convince me. MD is not the only user of the page
allocator...

E.g. why has 41c73a49df31 ("dm bufio: drop the lock when doing GFP_NOIO
allocation") even added GFP_NOIO request in the first place when you
keep retrying and sleep yourself? The changelog only describes what but
doesn't explain why. Or did I misread the code and this is not the
allocation which is stalling due to congestion?
--
Michal Hocko
SUSE Labs

2018-06-15 11:36:09

by Mikulas Patocka

[permalink] [raw]
Subject: Re: dm bufio: Reduce dm_bufio_lock contention



On Fri, 15 Jun 2018, Michal Hocko wrote:

> On Thu 14-06-18 14:34:06, Mikulas Patocka wrote:
> >
> >
> > On Thu, 14 Jun 2018, Michal Hocko wrote:
> >
> > > On Thu 14-06-18 15:18:58, jing xia wrote:
> > > [...]
> > > > PID: 22920 TASK: ffffffc0120f1a00 CPU: 1 COMMAND: "kworker/u8:2"
> > > > #0 [ffffffc0282af3d0] __switch_to at ffffff8008085e48
> > > > #1 [ffffffc0282af3f0] __schedule at ffffff8008850cc8
> > > > #2 [ffffffc0282af450] schedule at ffffff8008850f4c
> > > > #3 [ffffffc0282af470] schedule_timeout at ffffff8008853a0c
> > > > #4 [ffffffc0282af520] schedule_timeout_uninterruptible at ffffff8008853aa8
> > > > #5 [ffffffc0282af530] wait_iff_congested at ffffff8008181b40
> > >
> > > This trace doesn't provide the full picture unfortunately. Waiting in
> > > the direct reclaim means that the underlying bdi is congested. The real
> > > question is why it doesn't flush IO in time.
> >
> > I pointed this out two years ago and you just refused to fix it:
> > http://lkml.iu.edu/hypermail/linux/kernel/1608.1/04507.html
>
> Let me be evil again and let me quote the old discussion:
> : > I agree that mempool_alloc should _primarily_ sleep on their own
> : > throttling mechanism. I am not questioning that. I am just saying that
> : > the page allocator has its own throttling which it relies on and that
> : > cannot be just ignored because that might have other undesirable side
> : > effects. So if the right approach is really to never throttle certain
> : > requests then we have to bail out from a congested nodes/zones as soon
> : > as the congestion is detected.
> : >
> : > Now, I would like to see that something like that is _really_ necessary.
> :
> : Currently, it is not a problem - device mapper reports the device as
> : congested only if the underlying physical disks are congested.
> :
> : But once we change it so that device mapper reports congested state on its
> : own (when it has too many bios in progress), this starts being a problem.
>
> So has this changed since then? If yes then we can think of a proper
> solution but that would require to actually describe why we see the
> congestion, why it does help to wait on the caller rather than the
> allocator etc...

Device mapper doesn't report congested state - but something else does
(perhaps the user inserted a cheap slow usb stick or sdcard?). And device
mapper is just a victim of that.

Why should device mapper sleep because some other random block device is
congested?

> Throwing statements like ...
>
> > I'm sure you'll come up with another creative excuse why GFP_NORETRY
> > allocations need incur deliberate 100ms delays in block device drivers.
>
> ... is not really productive. I've tried to explain why I am not _sure_ what
> possible side effects such a change might have and your hand waving
> didn't really convince me. MD is not the only user of the page
> allocator...
>
> E.g. why has 41c73a49df31 ("dm bufio: drop the lock when doing GFP_NOIO
> allocation") even added GFP_NOIO request in the first place when you
> keep retrying and sleep yourself?

Because mempool uses it. Mempool uses allocations with "GFP_NOIO |
__GFP_NORETRY | __GFP_NOMEMALLOC | __GFP_NOWARN". An so dm-bufio uses
these flags too. dm-bufio is just a big mempool.

If you argue that these flags are incorrect - then fix mempool_alloc.

> The changelog only describes what but
> doesn't explain why. Or did I misread the code and this is not the
> allocation which is stalling due to congestion?

Mikulas

2018-06-15 11:56:36

by Michal Hocko

[permalink] [raw]
Subject: Re: dm bufio: Reduce dm_bufio_lock contention

On Fri 15-06-18 07:35:07, Mikulas Patocka wrote:
>
>
> On Fri, 15 Jun 2018, Michal Hocko wrote:
>
> > On Thu 14-06-18 14:34:06, Mikulas Patocka wrote:
> > >
> > >
> > > On Thu, 14 Jun 2018, Michal Hocko wrote:
> > >
> > > > On Thu 14-06-18 15:18:58, jing xia wrote:
> > > > [...]
> > > > > PID: 22920 TASK: ffffffc0120f1a00 CPU: 1 COMMAND: "kworker/u8:2"
> > > > > #0 [ffffffc0282af3d0] __switch_to at ffffff8008085e48
> > > > > #1 [ffffffc0282af3f0] __schedule at ffffff8008850cc8
> > > > > #2 [ffffffc0282af450] schedule at ffffff8008850f4c
> > > > > #3 [ffffffc0282af470] schedule_timeout at ffffff8008853a0c
> > > > > #4 [ffffffc0282af520] schedule_timeout_uninterruptible at ffffff8008853aa8
> > > > > #5 [ffffffc0282af530] wait_iff_congested at ffffff8008181b40
> > > >
> > > > This trace doesn't provide the full picture unfortunately. Waiting in
> > > > the direct reclaim means that the underlying bdi is congested. The real
> > > > question is why it doesn't flush IO in time.
> > >
> > > I pointed this out two years ago and you just refused to fix it:
> > > http://lkml.iu.edu/hypermail/linux/kernel/1608.1/04507.html
> >
> > Let me be evil again and let me quote the old discussion:
> > : > I agree that mempool_alloc should _primarily_ sleep on their own
> > : > throttling mechanism. I am not questioning that. I am just saying that
> > : > the page allocator has its own throttling which it relies on and that
> > : > cannot be just ignored because that might have other undesirable side
> > : > effects. So if the right approach is really to never throttle certain
> > : > requests then we have to bail out from a congested nodes/zones as soon
> > : > as the congestion is detected.
> > : >
> > : > Now, I would like to see that something like that is _really_ necessary.
> > :
> > : Currently, it is not a problem - device mapper reports the device as
> > : congested only if the underlying physical disks are congested.
> > :
> > : But once we change it so that device mapper reports congested state on its
> > : own (when it has too many bios in progress), this starts being a problem.
> >
> > So has this changed since then? If yes then we can think of a proper
> > solution but that would require to actually describe why we see the
> > congestion, why it does help to wait on the caller rather than the
> > allocator etc...
>
> Device mapper doesn't report congested state - but something else does
> (perhaps the user inserted a cheap slow usb stick or sdcard?). And device
> mapper is just a victim of that.

Maybe yes and that would require some more debugging to find out,
analyze and think of a proper solution.

> Why should device mapper sleep because some other random block device is
> congested?

Well, the direct reclaim is a way to throttle memory allocations. There
is no real concept on who is asking for the memory. If you do not want
to get blocked then use GFP_NOWAIT.

> > Throwing statements like ...
> >
> > > I'm sure you'll come up with another creative excuse why GFP_NORETRY
> > > allocations need incur deliberate 100ms delays in block device drivers.
> >
> > ... is not really productive. I've tried to explain why I am not _sure_ what
> > possible side effects such a change might have and your hand waving
> > didn't really convince me. MD is not the only user of the page
> > allocator...
> >
> > E.g. why has 41c73a49df31 ("dm bufio: drop the lock when doing GFP_NOIO
> > allocation") even added GFP_NOIO request in the first place when you
> > keep retrying and sleep yourself?
>
> Because mempool uses it. Mempool uses allocations with "GFP_NOIO |
> __GFP_NORETRY | __GFP_NOMEMALLOC | __GFP_NOWARN". An so dm-bufio uses
> these flags too. dm-bufio is just a big mempool.

This doesn't answer my question though. Somebody else is doing it is not
an explanation. Prior to your 41c73a49df31 there was no GFP_NOIO
allocation AFAICS. So why do you really need it now? Why cannot you
simply keep retrying GFP_NOWAIT with your own throttling?

Note that I am not trying to say that 41c73a49df31, I am merely trying
to understand why this blocking allocation is done in the first place.

> If you argue that these flags are incorrect - then fix mempool_alloc.

AFAICS there is no report about mempool_alloc stalling here. Maybe this
is the same class of problem, honestly, I dunno. And I've already said
that stalling __GFP_NORETRY might be a good way around that but that
needs much more consideration and existing users examination. I am not
aware anybody has done that. Doing changes like that based on a single
user is certainly risky.
--
Michal Hocko
SUSE Labs

2018-06-15 12:50:06

by Mikulas Patocka

[permalink] [raw]
Subject: Re: dm bufio: Reduce dm_bufio_lock contention



On Fri, 15 Jun 2018, Michal Hocko wrote:

> On Fri 15-06-18 07:35:07, Mikulas Patocka wrote:
> >
> > Because mempool uses it. Mempool uses allocations with "GFP_NOIO |
> > __GFP_NORETRY | __GFP_NOMEMALLOC | __GFP_NOWARN". An so dm-bufio uses
> > these flags too. dm-bufio is just a big mempool.
>
> This doesn't answer my question though. Somebody else is doing it is not
> an explanation. Prior to your 41c73a49df31 there was no GFP_NOIO
> allocation AFAICS. So why do you really need it now? Why cannot you

dm-bufio always used "GFP_NOIO | __GFP_NORETRY | __GFP_NOMEMALLOC |
__GFP_NOWARN" since the kernel 3.2 when it was introduced.

In the kernel 4.10, dm-bufio was changed so that it does GFP_NOWAIT
allocation, then drops the lock and does GFP_NOIO with the dropped lock
(because someone was likely experiencing the same issue that is reported
in this thread) - there are two commits that change it - 9ea61cac0 and
41c73a49df31.

> simply keep retrying GFP_NOWAIT with your own throttling?
>
> Note that I am not trying to say that 41c73a49df31, I am merely trying
> to understand why this blocking allocation is done in the first place.
>
> > If you argue that these flags are incorrect - then fix mempool_alloc.
>
> AFAICS there is no report about mempool_alloc stalling here. Maybe this

If the page allocator can stall dm-bufio, it can stall mempool_alloc as
well. dm-bufio is just bigger, so it will hit this bug sooner.

> is the same class of problem, honestly, I dunno. And I've already said
> that stalling __GFP_NORETRY might be a good way around that but that
> needs much more consideration and existing users examination. I am not
> aware anybody has done that. Doing changes like that based on a single
> user is certainly risky.

Why don't you set any rules how these flags should be used?

If you use GFP_NOIO | __GFP_NORETRY in your own code and blame other
people for doing so - you are as much evil as Linus, who praised people
for reverse-engineering hardware and blamed them for reverse-engineering
bitkeeper :-)

Mikulas

2018-06-15 13:11:29

by Michal Hocko

[permalink] [raw]
Subject: Re: dm bufio: Reduce dm_bufio_lock contention

On Fri 15-06-18 08:47:52, Mikulas Patocka wrote:
>
>
> On Fri, 15 Jun 2018, Michal Hocko wrote:
>
> > On Fri 15-06-18 07:35:07, Mikulas Patocka wrote:
> > >
> > > Because mempool uses it. Mempool uses allocations with "GFP_NOIO |
> > > __GFP_NORETRY | __GFP_NOMEMALLOC | __GFP_NOWARN". An so dm-bufio uses
> > > these flags too. dm-bufio is just a big mempool.
> >
> > This doesn't answer my question though. Somebody else is doing it is not
> > an explanation. Prior to your 41c73a49df31 there was no GFP_NOIO
> > allocation AFAICS. So why do you really need it now? Why cannot you
>
> dm-bufio always used "GFP_NOIO | __GFP_NORETRY | __GFP_NOMEMALLOC |
> __GFP_NOWARN" since the kernel 3.2 when it was introduced.
>
> In the kernel 4.10, dm-bufio was changed so that it does GFP_NOWAIT
> allocation, then drops the lock and does GFP_NOIO with the dropped lock
> (because someone was likely experiencing the same issue that is reported
> in this thread) - there are two commits that change it - 9ea61cac0 and
> 41c73a49df31.

OK, I see. Is there any fundamental reason why this path has to do one
round of GFP_IO or it can keep NOWAIT, drop the lock, sleep and retry
again?

[...]
> > is the same class of problem, honestly, I dunno. And I've already said
> > that stalling __GFP_NORETRY might be a good way around that but that
> > needs much more consideration and existing users examination. I am not
> > aware anybody has done that. Doing changes like that based on a single
> > user is certainly risky.
>
> Why don't you set any rules how these flags should be used?

It is really hard to change rules during the game. You basically have to
examine all existing users and that is well beyond my time scope. I've
tried that where it was possible. E.g. __GFP_REPEAT and turned it into a
well defined semantic. __GFP_NORETRY is a bigger beast.

Anyway, I believe that it would be much safer to look at the problem
from a highlevel perspective. You seem to be focused on __GFP_NORETRY
little bit too much IMHO. We are not throttling callers which explicitly
do not want to or cannot - see current_may_throttle. Is it possible that
both your md and mempool allocators can either (ab)use PF_LESS_THROTTLE
or use other means? E.g. do you have backing_dev_info at that time?
--
Michal Hocko
SUSE Labs

2018-06-18 22:12:25

by Mikulas Patocka

[permalink] [raw]
Subject: Re: dm bufio: Reduce dm_bufio_lock contention



On Fri, 15 Jun 2018, Michal Hocko wrote:

> On Fri 15-06-18 08:47:52, Mikulas Patocka wrote:
> >
> >
> > On Fri, 15 Jun 2018, Michal Hocko wrote:
> >
> > > On Fri 15-06-18 07:35:07, Mikulas Patocka wrote:
> > > >
> > > > Because mempool uses it. Mempool uses allocations with "GFP_NOIO |
> > > > __GFP_NORETRY | __GFP_NOMEMALLOC | __GFP_NOWARN". An so dm-bufio uses
> > > > these flags too. dm-bufio is just a big mempool.
> > >
> > > This doesn't answer my question though. Somebody else is doing it is not
> > > an explanation. Prior to your 41c73a49df31 there was no GFP_NOIO
> > > allocation AFAICS. So why do you really need it now? Why cannot you
> >
> > dm-bufio always used "GFP_NOIO | __GFP_NORETRY | __GFP_NOMEMALLOC |
> > __GFP_NOWARN" since the kernel 3.2 when it was introduced.
> >
> > In the kernel 4.10, dm-bufio was changed so that it does GFP_NOWAIT
> > allocation, then drops the lock and does GFP_NOIO with the dropped lock
> > (because someone was likely experiencing the same issue that is reported
> > in this thread) - there are two commits that change it - 9ea61cac0 and
> > 41c73a49df31.
>
> OK, I see. Is there any fundamental reason why this path has to do one
> round of GFP_IO or it can keep NOWAIT, drop the lock, sleep and retry
> again?

If the process is woken up, there was some buffer added to the freelist,
or refcount of some buffer was dropped to 0. In this case, we don't want
to drop the lock and use GFP_NOIO, because the freed buffer may disappear
when we drop the lock.

> [...]
> > > is the same class of problem, honestly, I dunno. And I've already said
> > > that stalling __GFP_NORETRY might be a good way around that but that
> > > needs much more consideration and existing users examination. I am not
> > > aware anybody has done that. Doing changes like that based on a single
> > > user is certainly risky.
> >
> > Why don't you set any rules how these flags should be used?
>
> It is really hard to change rules during the game. You basically have to
> examine all existing users and that is well beyond my time scope. I've
> tried that where it was possible. E.g. __GFP_REPEAT and turned it into a
> well defined semantic. __GFP_NORETRY is a bigger beast.
>
> Anyway, I believe that it would be much safer to look at the problem
> from a highlevel perspective. You seem to be focused on __GFP_NORETRY
> little bit too much IMHO. We are not throttling callers which explicitly
> do not want to or cannot - see current_may_throttle. Is it possible that
> both your md and mempool allocators can either (ab)use PF_LESS_THROTTLE
> or use other means? E.g. do you have backing_dev_info at that time?
> --
> Michal Hocko
> SUSE Labs

I grepped the kernel for __GFP_NORETRY and triaged them. I found 16 cases
without a fallback - those are bugs that make various functions randomly
return -ENOMEM.

Most of the callers provide callback.

There is another strange flag - __GFP_RETRY_MAYFAIL - it provides two
different functions - if the allocation is larger than
PAGE_ALLOC_COSTLY_ORDER, it retries the allocation as if it were smaller.
If the allocations is smaller than PAGE_ALLOC_COSTLY_ORDER,
__GFP_RETRY_MAYFAIL will avoid the oom killer (larger order allocations
don't trigger the oom killer at all).

So, perhaps __GFP_RETRY_MAYFAIL could be used instead of __GFP_NORETRY in
the cases where the caller wants to avoid trigerring the oom killer (the
problem is that __GFP_NORETRY causes random failure even in no-oom
situations but __GFP_RETRY_MAYFAIL doesn't).


So my suggestion is - fix these obvious bugs when someone allocates memory
with __GFP_NORETRY without any fallback - and then, __GFP_NORETRY could be
just changed to return NULL instead of sleeping.




arch/arm/mm/dma-mapping.c - fallback to a smaller size without __GFP_NORETRY

arch/mips/mm/dma-default.c - says that it uses __GFP_NORETRY to avoid the
oom killer, provides no fallback - it seems to be a BUG

arch/sparc/mm/tsb.c - fallback to a smaller size without __GFP_NORETRY

arch/x86/include/asm/floppy.h - __GFP_NORETRY doesn't seem to serve any
purpose, it may cause random failures during initialization, can be
removed - BUG

arch/powerpc/mm/mmu_context_iommu.c - uses it just during moving pages,
there's no problem with failure

arch/powerpc/platforms/pseries/cmm.c - a vm balloon driver, no problem
with failure

block/bio.c - falls back to mempool

block/blk-mq.c - errorneous use of __GFP_NORETRY during initialization, it
falls back to a smaller size, but doesn't drop the __GFP_NORETRY flag
(BUG)

drivers/gpu/drm/i915/i915_gem.c - it starts with __GFP_NORETRY and on
failure, it ORs it with __GFP_RETRY_MAYFAIL (which of these conflicting
flags wins?)

drivers/gpu/drm/i915/i915_gem_gtt.c - __GFP_NORETRY is used during
initialization (BUG), it shouldn't be used

drivers/gpu/drm/i915/i915_gem_execbuffer.c - fallback to a smaller size without __GFP_NORETRY

drivers/gpu/drm/i915/i915_gem_internal.c - fallback to a smaller size without __GFP_NORETRY
size

drivers/gpu/drm/i915/i915_gem_userptr.c - seems to provide fallback

drivers/gpu/drm/i915/i915_gpu_error.c - fallback to a smaller size without __GFP_NORETRY

drivers/gpu/drm/etnaviv/etnaviv_dump.c - coredump on error path, no
problem if it fails

drivers/gpu/drm/ttm/ttm_page_alloc.c - uses __GFP_NORETRY for transparent
hugepages, no problem with failure

drivers/gpu/drm/ttm/ttm_page_alloc_dma.c - uses __GFP_NORETRY for
transparent hugepages, no problem with failure

drivers/gpu/drm/msm/msm_gem_submit.c - uses __GFP_NORETRY to process ioctl
and lacks a fallback - it is a BUG - __GFP_NORETRY should be dropped

drivers/hv/hv_balloon.c - a vm balloon driver, no problem with failure

drivers/crypto/chelsio/chtls/chtls_io.c - fallback to a smaller size without __GFP_NORETRY

drivers/xen/balloon.c - a vm balloon driver, no problem with failure

drivers/mtd/mtdcore.c - fallback to a smaller size without __GFP_NORETRY

drivers/md/dm-verity-target.c - skips prefetch on failure, no problem

drivers/md/dm-writecache.c - falls back to a smaller i/os on failure

drivers/md/dm-bufio.c - reserves some buffers on creation and falls back
to them

drivers/md/dm-integrity.c - falls back to sector-by-sector verification

drivers/md/dm-kcopyd.c - falls back to reserved pages

drivers/md/dm-crypt.c - falls back to mempool

drivers/iommu/dma-iommu.c - fallback to a smaller size without __GFP_NORETRY

drivers/mmc/core/mmc_test.c - fallback to a smaller size, but doesn't drop
__GFP_NORETRY - BUG

drivers/staging/android/ion/ion_system_heap.c - fallback to a smaller size without __GFP_NORETRY

fs/cachefiles/ - uses __GFP_NORETRY extensively - since this is just a
cache, so failure supposedly shouldn't be problem - but it's hard to
verify the whole driver that it handles failures properly

fs/xfs/xfs_buf.c - uses __GFP_NORETRY only on readahead

fs/fscache/cookie.c - no fallback, but if it fails, the caller will just
invalidate the cache entry - no problem

fs/fscache/page.c - like above - failure will just inhibit caching

fs/nfs/write.c - fails only if allowed by the arugment never_fail

include/linux/kexec.h - no fallback - it seems to be a BUG

include/linux/pagemap.h - uses __GFP_NORETRY only on readahead

kernel/bpf/syscall.c - it says it need __GFP_NORETRY to avoid oom killer -
provides no fallback, so it seems to be BUG

kernel/events/ring_buffer.c - falls back to a smaller size, but doesn't drop
__GFP_NORETRY - BUG

kernel/groups.c - falls back to vmalloc (should it use kvmalloc?)

kernel/trace/ring_buffer.c - no fallback - BUG

kernel/trace/trace.c - no fallback - BUG

kernel/power/swap.c - falls back, but there is useless WARN_ON_ONCE(1) on
the fallback path

lib/debugobjects.c - turns off debugging on alloaction failure

lib/rhashtable.c - uses __GFP_NORETRY only with GFP_ATOMIC - __GFP_NORETRY
is useless

mm/hugetlb.c - no problem if hugepage allocation fails

mm/mempool.c - falls back to mempool

mm/kmemleak.c - disables kmemleak on allocation failure

mm/page_alloc.c/__page_frag_cache_refill - fallback to a smaller size without __GFP_NORETRY

mm/zswap.c/zswap_pool_create - used during pool creation - probably a BUG
mm/zswap.c/zswap_frontswap_store - used when compressing a page - probably
ok to fail

mm/memcontrol.c - I don't know if it is a bug

mm/shmem.c - used during hugepage allocation - ok to fail

mm/slub.c - fallback to a smaller size without __GFP_NORETRY

mm/util.c - fallback to vmalloc

net/packet/af_packet.c - fallback without __GFP_NORETRY

net/xdp/xsk_queue.c - no fallback - BUG

net/core/skbuff.c - fallback to a smaller size without __GFP_NORETRY

net/core/sock.c - fallback to a smaller size without __GFP_NORETRY

net/netlink/af_netlink.c - it's ok to fail when decreasing the size of skb

net/smc/smc_core.c - falls back to a smaller size, but doesn't drop
__GFP_NORETRY - BUG

net/netfilter/x_tables.c - __GFP_NORETRY is used to avoid the oom killer
- provides no fallback - it seems to be a BUG

security/integrity/ima/ima_crypto.c - fallback to a smaller size without __GFP_NORETRY

sound/core/memalloc.c - __GFP_NORETRY is used to avoid the oom killer
- provides no fallback - it seems to be a BUG

Mikulas

2018-06-19 10:44:08

by Michal Hocko

[permalink] [raw]
Subject: Re: dm bufio: Reduce dm_bufio_lock contention

On Mon 18-06-18 18:11:26, Mikulas Patocka wrote:
[...]
> I grepped the kernel for __GFP_NORETRY and triaged them. I found 16 cases
> without a fallback - those are bugs that make various functions randomly
> return -ENOMEM.

Well, maybe those are just optimistic attempts to allocate memory and
have a fallback somewhere else. So I would be careful calling them
outright bugs. But maybe you are right.

> Most of the callers provide callback.
>
> There is another strange flag - __GFP_RETRY_MAYFAIL - it provides two
> different functions - if the allocation is larger than
> PAGE_ALLOC_COSTLY_ORDER, it retries the allocation as if it were smaller.
> If the allocations is smaller than PAGE_ALLOC_COSTLY_ORDER,
> __GFP_RETRY_MAYFAIL will avoid the oom killer (larger order allocations
> don't trigger the oom killer at all).

Well, the primary purpose of this flag is to provide a consistent
failure behavior for all requests regardless of the size.

> So, perhaps __GFP_RETRY_MAYFAIL could be used instead of __GFP_NORETRY in
> the cases where the caller wants to avoid trigerring the oom killer (the
> problem is that __GFP_NORETRY causes random failure even in no-oom
> situations but __GFP_RETRY_MAYFAIL doesn't).

myabe yes.

> So my suggestion is - fix these obvious bugs when someone allocates memory
> with __GFP_NORETRY without any fallback - and then, __GFP_NORETRY could be
> just changed to return NULL instead of sleeping.

No real objection to fixing wrong __GFP_NORETRY usage. But __GFP_NORETRY
can sleep. Nothing will really change in that regards. It does a
reclaim and that _might_ sleep.

But seriously, isn't the best way around the throttling issue to use
PF_LESS_THROTTLE?
--
Michal Hocko
SUSE Labs

2018-06-22 01:18:25

by Mikulas Patocka

[permalink] [raw]
Subject: Re: dm bufio: Reduce dm_bufio_lock contention



On Tue, 19 Jun 2018, Michal Hocko wrote:

> On Mon 18-06-18 18:11:26, Mikulas Patocka wrote:
> [...]
> > I grepped the kernel for __GFP_NORETRY and triaged them. I found 16 cases
> > without a fallback - those are bugs that make various functions randomly
> > return -ENOMEM.
>
> Well, maybe those are just optimistic attempts to allocate memory and
> have a fallback somewhere else. So I would be careful calling them
> outright bugs. But maybe you are right.

I was trying to find the fallback code when I triaged them and maked as
"BUG" those cases where I didn't find it. You can search harder and
perhaps you'll find something that I didn't.

> > Most of the callers provide callback.
> >
> > There is another strange flag - __GFP_RETRY_MAYFAIL - it provides two
> > different functions - if the allocation is larger than
> > PAGE_ALLOC_COSTLY_ORDER, it retries the allocation as if it were smaller.
> > If the allocations is smaller than PAGE_ALLOC_COSTLY_ORDER,
> > __GFP_RETRY_MAYFAIL will avoid the oom killer (larger order allocations
> > don't trigger the oom killer at all).
>
> Well, the primary purpose of this flag is to provide a consistent
> failure behavior for all requests regardless of the size.
>
> > So, perhaps __GFP_RETRY_MAYFAIL could be used instead of __GFP_NORETRY in
> > the cases where the caller wants to avoid trigerring the oom killer (the
> > problem is that __GFP_NORETRY causes random failure even in no-oom
> > situations but __GFP_RETRY_MAYFAIL doesn't).
>
> myabe yes.
>
> > So my suggestion is - fix these obvious bugs when someone allocates memory
> > with __GFP_NORETRY without any fallback - and then, __GFP_NORETRY could be
> > just changed to return NULL instead of sleeping.
>
> No real objection to fixing wrong __GFP_NORETRY usage. But __GFP_NORETRY
> can sleep. Nothing will really change in that regards. It does a
> reclaim and that _might_ sleep.
>
> But seriously, isn't the best way around the throttling issue to use
> PF_LESS_THROTTLE?

Yes - it could be done by setting PF_LESS_THROTTLE. But I think it would
be better to change it just in one place than to add PF_LESS_THROTTLE to
every block device driver (because adding it to every block driver results
in more code).

What about this patch? If __GFP_NORETRY and __GFP_FS is not set (i.e. the
request comes from a block device driver or a filesystem), we should not
sleep.

Signed-off-by: Mikulas Patocka <[email protected]>
Cc: [email protected]

Index: linux-2.6/mm/vmscan.c
===================================================================
--- linux-2.6.orig/mm/vmscan.c
+++ linux-2.6/mm/vmscan.c
@@ -2674,6 +2674,7 @@ static bool shrink_node(pg_data_t *pgdat
* the LRU too quickly.
*/
if (!sc->hibernation_mode && !current_is_kswapd() &&
+ (sc->gfp_mask & (__GFP_NORETRY | __GFP_FS)) != __GFP_NORETRY &&
current_may_throttle() && pgdat_memcg_congested(pgdat, root))
wait_iff_congested(BLK_RW_ASYNC, HZ/10);


2018-06-22 09:03:13

by Michal Hocko

[permalink] [raw]
Subject: Re: dm bufio: Reduce dm_bufio_lock contention

On Thu 21-06-18 21:17:24, Mikulas Patocka wrote:
[...]
> > But seriously, isn't the best way around the throttling issue to use
> > PF_LESS_THROTTLE?
>
> Yes - it could be done by setting PF_LESS_THROTTLE. But I think it would
> be better to change it just in one place than to add PF_LESS_THROTTLE to
> every block device driver (because adding it to every block driver results
> in more code).

Why would every block device need this? I thought we were talking about
mempool allocator and the md variant of it. They are explicitly doing
their own back off so PF_LESS_THROTTLE sounds appropriate to me.

> What about this patch? If __GFP_NORETRY and __GFP_FS is not set (i.e. the
> request comes from a block device driver or a filesystem), we should not
> sleep.

Why? How are you going to audit all the callers that the behavior makes
sense and moreover how are you going to ensure that future usage will
still make sense. The more subtle side effects gfp flags have the harder
they are to maintain.

> Signed-off-by: Mikulas Patocka <[email protected]>
> Cc: [email protected]
>
> Index: linux-2.6/mm/vmscan.c
> ===================================================================
> --- linux-2.6.orig/mm/vmscan.c
> +++ linux-2.6/mm/vmscan.c
> @@ -2674,6 +2674,7 @@ static bool shrink_node(pg_data_t *pgdat
> * the LRU too quickly.
> */
> if (!sc->hibernation_mode && !current_is_kswapd() &&
> + (sc->gfp_mask & (__GFP_NORETRY | __GFP_FS)) != __GFP_NORETRY &&
> current_may_throttle() && pgdat_memcg_congested(pgdat, root))
> wait_iff_congested(BLK_RW_ASYNC, HZ/10);
>

--
Michal Hocko
SUSE Labs

2018-06-22 09:10:40

by Michal Hocko

[permalink] [raw]
Subject: Re: dm bufio: Reduce dm_bufio_lock contention

On Fri 22-06-18 11:01:51, Michal Hocko wrote:
> On Thu 21-06-18 21:17:24, Mikulas Patocka wrote:
[...]
> > What about this patch? If __GFP_NORETRY and __GFP_FS is not set (i.e. the
> > request comes from a block device driver or a filesystem), we should not
> > sleep.
>
> Why? How are you going to audit all the callers that the behavior makes
> sense and moreover how are you going to ensure that future usage will
> still make sense. The more subtle side effects gfp flags have the harder
> they are to maintain.

So just as an excercise. Try to explain the above semantic to users. We
currently have the following.

* __GFP_NORETRY: The VM implementation will try only very lightweight
* memory direct reclaim to get some memory under memory pressure (thus
* it can sleep). It will avoid disruptive actions like OOM killer. The
* caller must handle the failure which is quite likely to happen under
* heavy memory pressure. The flag is suitable when failure can easily be
* handled at small cost, such as reduced throughput

* __GFP_FS can call down to the low-level FS. Clearing the flag avoids the
* allocator recursing into the filesystem which might already be holding
* locks.

So how are you going to explain gfp & (__GFP_NORETRY | ~__GFP_FS)? What
is the actual semantic without explaining the whole reclaim or force
users to look into the code to understand that? What about GFP_NOIO |
__GFP_NORETRY? What does it mean to that "should not sleep". Do all
shrinkers have to follow that as well?
--
Michal Hocko
SUSE Labs

2018-06-22 12:46:01

by Mikulas Patocka

[permalink] [raw]
Subject: Re: dm bufio: Reduce dm_bufio_lock contention



On Fri, 22 Jun 2018, Michal Hocko wrote:

> On Thu 21-06-18 21:17:24, Mikulas Patocka wrote:
> [...]
> > > But seriously, isn't the best way around the throttling issue to use
> > > PF_LESS_THROTTLE?
> >
> > Yes - it could be done by setting PF_LESS_THROTTLE. But I think it would
> > be better to change it just in one place than to add PF_LESS_THROTTLE to
> > every block device driver (because adding it to every block driver results
> > in more code).
>
> Why would every block device need this? I thought we were talking about
> mempool allocator and the md variant of it. They are explicitly doing
> their own back off so PF_LESS_THROTTLE sounds appropriate to me.

Because every block driver is suspicible to this problem. Two years ago,
there was a bug that when the user was swapping to dm-crypt device, memory
management would stall the allocations done by dm-crypt by 100ms - that
slowed down swapping rate and made the machine unuseable.

Then, people are complaining about the same problem in dm-bufio.

Next time, it will be something else.

(you will answer : "we will not fix bugs unless people are reporting them" :-)

> > What about this patch? If __GFP_NORETRY and __GFP_FS is not set (i.e. the
> > request comes from a block device driver or a filesystem), we should not
> > sleep.
>
> Why? How are you going to audit all the callers that the behavior makes
> sense and moreover how are you going to ensure that future usage will
> still make sense. The more subtle side effects gfp flags have the harder
> they are to maintain.

I did audit them - see the previous email in this thread:
https://www.redhat.com/archives/dm-devel/2018-June/thread.html

Mikulas

> > Signed-off-by: Mikulas Patocka <[email protected]>
> > Cc: [email protected]
> >
> > Index: linux-2.6/mm/vmscan.c
> > ===================================================================
> > --- linux-2.6.orig/mm/vmscan.c
> > +++ linux-2.6/mm/vmscan.c
> > @@ -2674,6 +2674,7 @@ static bool shrink_node(pg_data_t *pgdat
> > * the LRU too quickly.
> > */
> > if (!sc->hibernation_mode && !current_is_kswapd() &&
> > + (sc->gfp_mask & (__GFP_NORETRY | __GFP_FS)) != __GFP_NORETRY &&
> > current_may_throttle() && pgdat_memcg_congested(pgdat, root))
> > wait_iff_congested(BLK_RW_ASYNC, HZ/10);
> >
>
> --
> Michal Hocko
> SUSE Labs
>

2018-06-22 12:53:17

by Mikulas Patocka

[permalink] [raw]
Subject: Re: dm bufio: Reduce dm_bufio_lock contention



On Fri, 22 Jun 2018, Michal Hocko wrote:

> On Fri 22-06-18 11:01:51, Michal Hocko wrote:
> > On Thu 21-06-18 21:17:24, Mikulas Patocka wrote:
> [...]
> > > What about this patch? If __GFP_NORETRY and __GFP_FS is not set (i.e. the
> > > request comes from a block device driver or a filesystem), we should not
> > > sleep.
> >
> > Why? How are you going to audit all the callers that the behavior makes
> > sense and moreover how are you going to ensure that future usage will
> > still make sense. The more subtle side effects gfp flags have the harder
> > they are to maintain.
>
> So just as an excercise. Try to explain the above semantic to users. We
> currently have the following.
>
> * __GFP_NORETRY: The VM implementation will try only very lightweight
> * memory direct reclaim to get some memory under memory pressure (thus
> * it can sleep). It will avoid disruptive actions like OOM killer. The
> * caller must handle the failure which is quite likely to happen under
> * heavy memory pressure. The flag is suitable when failure can easily be
> * handled at small cost, such as reduced throughput
>
> * __GFP_FS can call down to the low-level FS. Clearing the flag avoids the
> * allocator recursing into the filesystem which might already be holding
> * locks.
>
> So how are you going to explain gfp & (__GFP_NORETRY | ~__GFP_FS)? What
> is the actual semantic without explaining the whole reclaim or force
> users to look into the code to understand that? What about GFP_NOIO |
> __GFP_NORETRY? What does it mean to that "should not sleep". Do all
> shrinkers have to follow that as well?

My reasoning was that there is broken code that uses __GFP_NORETRY and
assumes that it can't fail - so conditioning the change on !__GFP_FS would
minimize the diruption to the broken code.

Anyway - if you want to test only on __GFP_NORETRY (and fix those 16
broken cases that assume that __GFP_NORETRY can't fail), I'm OK with that.

Mikulas


Index: linux-2.6/mm/vmscan.c
===================================================================
--- linux-2.6.orig/mm/vmscan.c
+++ linux-2.6/mm/vmscan.c
@@ -2674,6 +2674,7 @@ static bool shrink_node(pg_data_t *pgdat
* the LRU too quickly.
*/
if (!sc->hibernation_mode && !current_is_kswapd() &&
+ !(sc->gfp_mask & __GFP_NORETRY) &&
current_may_throttle() && pgdat_memcg_congested(pgdat, root))
wait_iff_congested(BLK_RW_ASYNC, HZ/10);


2018-06-22 13:06:52

by Michal Hocko

[permalink] [raw]
Subject: Re: dm bufio: Reduce dm_bufio_lock contention

On Fri 22-06-18 08:52:09, Mikulas Patocka wrote:
>
>
> On Fri, 22 Jun 2018, Michal Hocko wrote:
>
> > On Fri 22-06-18 11:01:51, Michal Hocko wrote:
> > > On Thu 21-06-18 21:17:24, Mikulas Patocka wrote:
> > [...]
> > > > What about this patch? If __GFP_NORETRY and __GFP_FS is not set (i.e. the
> > > > request comes from a block device driver or a filesystem), we should not
> > > > sleep.
> > >
> > > Why? How are you going to audit all the callers that the behavior makes
> > > sense and moreover how are you going to ensure that future usage will
> > > still make sense. The more subtle side effects gfp flags have the harder
> > > they are to maintain.
> >
> > So just as an excercise. Try to explain the above semantic to users. We
> > currently have the following.
> >
> > * __GFP_NORETRY: The VM implementation will try only very lightweight
> > * memory direct reclaim to get some memory under memory pressure (thus
> > * it can sleep). It will avoid disruptive actions like OOM killer. The
> > * caller must handle the failure which is quite likely to happen under
> > * heavy memory pressure. The flag is suitable when failure can easily be
> > * handled at small cost, such as reduced throughput
> >
> > * __GFP_FS can call down to the low-level FS. Clearing the flag avoids the
> > * allocator recursing into the filesystem which might already be holding
> > * locks.
> >
> > So how are you going to explain gfp & (__GFP_NORETRY | ~__GFP_FS)? What
> > is the actual semantic without explaining the whole reclaim or force
> > users to look into the code to understand that? What about GFP_NOIO |
> > __GFP_NORETRY? What does it mean to that "should not sleep". Do all
> > shrinkers have to follow that as well?
>
> My reasoning was that there is broken code that uses __GFP_NORETRY and
> assumes that it can't fail - so conditioning the change on !__GFP_FS would
> minimize the diruption to the broken code.
>
> Anyway - if you want to test only on __GFP_NORETRY (and fix those 16
> broken cases that assume that __GFP_NORETRY can't fail), I'm OK with that.

As I've already said, this is a subtle change which is really hard to
reason about. Throttling on congestion has its meaning and reason. Look
at why we are doing that in the first place. You cannot simply say this
is ok based on your specific usecase. We do have means to achieve that.
It is explicit and thus it will be applied only where it makes sense.
You keep repeating that implicit behavior change for everybody is
better. I guess we will not agree on that part. I consider it a hack
rather than a systematic solution. I can easily imagine that we just
find out other call sites that would cause over reclaim or similar
problems because they are not throttled on too many dirty pages due to
congested storage. What are we going to then? Add another magic gfp
flag? Really, try to not add even more subtle side effects for gfp
flags. We _do_ have ways to accomplish what your particular usecase
requires.
--
Michal Hocko
SUSE Labs

2018-06-22 13:11:24

by Michal Hocko

[permalink] [raw]
Subject: Re: dm bufio: Reduce dm_bufio_lock contention

On Fri 22-06-18 08:44:52, Mikulas Patocka wrote:
> On Fri, 22 Jun 2018, Michal Hocko wrote:
[...]
> > Why? How are you going to audit all the callers that the behavior makes
> > sense and moreover how are you going to ensure that future usage will
> > still make sense. The more subtle side effects gfp flags have the harder
> > they are to maintain.
>
> I did audit them - see the previous email in this thread:
> https://www.redhat.com/archives/dm-devel/2018-June/thread.html

I do not see any mention about throttling expectations for those users.
You have focused only on the allocation failure fallback AFAIR
--
Michal Hocko
SUSE Labs

2018-06-22 18:48:00

by Mikulas Patocka

[permalink] [raw]
Subject: Re: dm bufio: Reduce dm_bufio_lock contention



On Fri, 22 Jun 2018, Michal Hocko wrote:

> On Fri 22-06-18 08:44:52, Mikulas Patocka wrote:
> > On Fri, 22 Jun 2018, Michal Hocko wrote:
> [...]
> > > Why? How are you going to audit all the callers that the behavior makes
> > > sense and moreover how are you going to ensure that future usage will
> > > still make sense. The more subtle side effects gfp flags have the harder
> > > they are to maintain.
> >
> > I did audit them - see the previous email in this thread:
> > https://www.redhat.com/archives/dm-devel/2018-June/thread.html
>
> I do not see any mention about throttling expectations for those users.
> You have focused only on the allocation failure fallback AFAIR

How should the callers be analyzed with respect to throttling?

Mikulas

2018-06-22 18:58:35

by Mikulas Patocka

[permalink] [raw]
Subject: Re: dm bufio: Reduce dm_bufio_lock contention



On Fri, 22 Jun 2018, Michal Hocko wrote:

> On Fri 22-06-18 08:52:09, Mikulas Patocka wrote:
> >
> >
> > On Fri, 22 Jun 2018, Michal Hocko wrote:
> >
> > > On Fri 22-06-18 11:01:51, Michal Hocko wrote:
> > > > On Thu 21-06-18 21:17:24, Mikulas Patocka wrote:
> > > [...]
> > > > > What about this patch? If __GFP_NORETRY and __GFP_FS is not set (i.e. the
> > > > > request comes from a block device driver or a filesystem), we should not
> > > > > sleep.
> > > >
> > > > Why? How are you going to audit all the callers that the behavior makes
> > > > sense and moreover how are you going to ensure that future usage will
> > > > still make sense. The more subtle side effects gfp flags have the harder
> > > > they are to maintain.
> > >
> > > So just as an excercise. Try to explain the above semantic to users. We
> > > currently have the following.
> > >
> > > * __GFP_NORETRY: The VM implementation will try only very lightweight
> > > * memory direct reclaim to get some memory under memory pressure (thus
> > > * it can sleep). It will avoid disruptive actions like OOM killer. The
> > > * caller must handle the failure which is quite likely to happen under
> > > * heavy memory pressure. The flag is suitable when failure can easily be
> > > * handled at small cost, such as reduced throughput
> > >
> > > * __GFP_FS can call down to the low-level FS. Clearing the flag avoids the
> > > * allocator recursing into the filesystem which might already be holding
> > > * locks.
> > >
> > > So how are you going to explain gfp & (__GFP_NORETRY | ~__GFP_FS)? What
> > > is the actual semantic without explaining the whole reclaim or force
> > > users to look into the code to understand that? What about GFP_NOIO |
> > > __GFP_NORETRY? What does it mean to that "should not sleep". Do all
> > > shrinkers have to follow that as well?
> >
> > My reasoning was that there is broken code that uses __GFP_NORETRY and
> > assumes that it can't fail - so conditioning the change on !__GFP_FS would
> > minimize the diruption to the broken code.
> >
> > Anyway - if you want to test only on __GFP_NORETRY (and fix those 16
> > broken cases that assume that __GFP_NORETRY can't fail), I'm OK with that.
>
> As I've already said, this is a subtle change which is really hard to
> reason about. Throttling on congestion has its meaning and reason. Look
> at why we are doing that in the first place. You cannot simply say this

So - explain why is throttling needed. You support throttling, I don't, so
you have to explain it :)

> is ok based on your specific usecase. We do have means to achieve that.
> It is explicit and thus it will be applied only where it makes sense.
> You keep repeating that implicit behavior change for everybody is
> better.

I don't want to change it for everybody. I want to change it for block
device drivers. I don't care what you do with non-block drivers.

> I guess we will not agree on that part. I consider it a hack
> rather than a systematic solution. I can easily imagine that we just
> find out other call sites that would cause over reclaim or similar

If a __GFP_NORETRY allocation does overreclaim - it could be fixed by
returning NULL instead of doing overreclaim. The specification says that
callers must handle failure of __GFP_NORETRY allocations.

So yes - if you think that just skipping throttling on __GFP_NORETRY could
cause excessive CPU consumption trying to reclaim unreclaimable pages or
something like that - then you can add more points where the __GFP_NORETRY
is failed with NULL to avoid the excessive CPU consumption.

> problems because they are not throttled on too many dirty pages due to
> congested storage. What are we going to then? Add another magic gfp
> flag? Really, try to not add even more subtle side effects for gfp
> flags. We _do_ have ways to accomplish what your particular usecase
> requires.
>
> --
> Michal Hocko
> SUSE Labs

Mikulas

2018-06-25 09:12:13

by Michal Hocko

[permalink] [raw]
Subject: Re: dm bufio: Reduce dm_bufio_lock contention

On Fri 22-06-18 14:57:10, Mikulas Patocka wrote:
>
>
> On Fri, 22 Jun 2018, Michal Hocko wrote:
>
> > On Fri 22-06-18 08:52:09, Mikulas Patocka wrote:
> > >
> > >
> > > On Fri, 22 Jun 2018, Michal Hocko wrote:
> > >
> > > > On Fri 22-06-18 11:01:51, Michal Hocko wrote:
> > > > > On Thu 21-06-18 21:17:24, Mikulas Patocka wrote:
> > > > [...]
> > > > > > What about this patch? If __GFP_NORETRY and __GFP_FS is not set (i.e. the
> > > > > > request comes from a block device driver or a filesystem), we should not
> > > > > > sleep.
> > > > >
> > > > > Why? How are you going to audit all the callers that the behavior makes
> > > > > sense and moreover how are you going to ensure that future usage will
> > > > > still make sense. The more subtle side effects gfp flags have the harder
> > > > > they are to maintain.
> > > >
> > > > So just as an excercise. Try to explain the above semantic to users. We
> > > > currently have the following.
> > > >
> > > > * __GFP_NORETRY: The VM implementation will try only very lightweight
> > > > * memory direct reclaim to get some memory under memory pressure (thus
> > > > * it can sleep). It will avoid disruptive actions like OOM killer. The
> > > > * caller must handle the failure which is quite likely to happen under
> > > > * heavy memory pressure. The flag is suitable when failure can easily be
> > > > * handled at small cost, such as reduced throughput
> > > >
> > > > * __GFP_FS can call down to the low-level FS. Clearing the flag avoids the
> > > > * allocator recursing into the filesystem which might already be holding
> > > > * locks.
> > > >
> > > > So how are you going to explain gfp & (__GFP_NORETRY | ~__GFP_FS)? What
> > > > is the actual semantic without explaining the whole reclaim or force
> > > > users to look into the code to understand that? What about GFP_NOIO |
> > > > __GFP_NORETRY? What does it mean to that "should not sleep". Do all
> > > > shrinkers have to follow that as well?
> > >
> > > My reasoning was that there is broken code that uses __GFP_NORETRY and
> > > assumes that it can't fail - so conditioning the change on !__GFP_FS would
> > > minimize the diruption to the broken code.
> > >
> > > Anyway - if you want to test only on __GFP_NORETRY (and fix those 16
> > > broken cases that assume that __GFP_NORETRY can't fail), I'm OK with that.
> >
> > As I've already said, this is a subtle change which is really hard to
> > reason about. Throttling on congestion has its meaning and reason. Look
> > at why we are doing that in the first place. You cannot simply say this
>
> So - explain why is throttling needed. You support throttling, I don't, so
> you have to explain it :)
>
> > is ok based on your specific usecase. We do have means to achieve that.
> > It is explicit and thus it will be applied only where it makes sense.
> > You keep repeating that implicit behavior change for everybody is
> > better.
>
> I don't want to change it for everybody. I want to change it for block
> device drivers. I don't care what you do with non-block drivers.

Well, it is usually onus of the patch submitter to justify any change.
But let me be nice on you, for once. This throttling is triggered only
if we all the pages we have encountered during the reclaim attempt are
dirty and that means that we are rushing through the LRU list quicker
than flushers are able to clean. If we didn't throttle we could hit
stronger reclaim priorities (aka scan more to reclaim memory) and
reclaim more pages as a result.

> > I guess we will not agree on that part. I consider it a hack
> > rather than a systematic solution. I can easily imagine that we just
> > find out other call sites that would cause over reclaim or similar
>
> If a __GFP_NORETRY allocation does overreclaim - it could be fixed by
> returning NULL instead of doing overreclaim. The specification says that
> callers must handle failure of __GFP_NORETRY allocations.
>
> So yes - if you think that just skipping throttling on __GFP_NORETRY could
> cause excessive CPU consumption trying to reclaim unreclaimable pages or
> something like that - then you can add more points where the __GFP_NORETRY
> is failed with NULL to avoid the excessive CPU consumption.

Which is exactly something I do not want to do. Spread __GFP_NORETRY all
over the reclaim code. Especially for something we already have means
for...
--
Michal Hocko
SUSE Labs

2018-06-25 13:54:32

by Mikulas Patocka

[permalink] [raw]
Subject: Re: dm bufio: Reduce dm_bufio_lock contention

y

On Mon, 25 Jun 2018, Michal Hocko wrote:

> On Fri 22-06-18 14:57:10, Mikulas Patocka wrote:
> >
> >
> > On Fri, 22 Jun 2018, Michal Hocko wrote:
> >
> > > On Fri 22-06-18 08:52:09, Mikulas Patocka wrote:
> > > >
> > > >
> > > > On Fri, 22 Jun 2018, Michal Hocko wrote:
> > > >
> > > > > On Fri 22-06-18 11:01:51, Michal Hocko wrote:
> > > > > > On Thu 21-06-18 21:17:24, Mikulas Patocka wrote:
> > > > > [...]
> > > > > > > What about this patch? If __GFP_NORETRY and __GFP_FS is not set (i.e. the
> > > > > > > request comes from a block device driver or a filesystem), we should not
> > > > > > > sleep.
> > > > > >
> > > > > > Why? How are you going to audit all the callers that the behavior makes
> > > > > > sense and moreover how are you going to ensure that future usage will
> > > > > > still make sense. The more subtle side effects gfp flags have the harder
> > > > > > they are to maintain.
> > > > >
> > > > > So just as an excercise. Try to explain the above semantic to users. We
> > > > > currently have the following.
> > > > >
> > > > > * __GFP_NORETRY: The VM implementation will try only very lightweight
> > > > > * memory direct reclaim to get some memory under memory pressure (thus
> > > > > * it can sleep). It will avoid disruptive actions like OOM killer. The
> > > > > * caller must handle the failure which is quite likely to happen under
> > > > > * heavy memory pressure. The flag is suitable when failure can easily be
> > > > > * handled at small cost, such as reduced throughput
> > > > >
> > > > > * __GFP_FS can call down to the low-level FS. Clearing the flag avoids the
> > > > > * allocator recursing into the filesystem which might already be holding
> > > > > * locks.
> > > > >
> > > > > So how are you going to explain gfp & (__GFP_NORETRY | ~__GFP_FS)? What
> > > > > is the actual semantic without explaining the whole reclaim or force
> > > > > users to look into the code to understand that? What about GFP_NOIO |
> > > > > __GFP_NORETRY? What does it mean to that "should not sleep". Do all
> > > > > shrinkers have to follow that as well?
> > > >
> > > > My reasoning was that there is broken code that uses __GFP_NORETRY and
> > > > assumes that it can't fail - so conditioning the change on !__GFP_FS would
> > > > minimize the diruption to the broken code.
> > > >
> > > > Anyway - if you want to test only on __GFP_NORETRY (and fix those 16
> > > > broken cases that assume that __GFP_NORETRY can't fail), I'm OK with that.
> > >
> > > As I've already said, this is a subtle change which is really hard to
> > > reason about. Throttling on congestion has its meaning and reason. Look
> > > at why we are doing that in the first place. You cannot simply say this
> >
> > So - explain why is throttling needed. You support throttling, I don't, so
> > you have to explain it :)
> >
> > > is ok based on your specific usecase. We do have means to achieve that.
> > > It is explicit and thus it will be applied only where it makes sense.
> > > You keep repeating that implicit behavior change for everybody is
> > > better.
> >
> > I don't want to change it for everybody. I want to change it for block
> > device drivers. I don't care what you do with non-block drivers.
>
> Well, it is usually onus of the patch submitter to justify any change.
> But let me be nice on you, for once. This throttling is triggered only
> if we all the pages we have encountered during the reclaim attempt are
> dirty and that means that we are rushing through the LRU list quicker
> than flushers are able to clean. If we didn't throttle we could hit
> stronger reclaim priorities (aka scan more to reclaim memory) and
> reclaim more pages as a result.

And the throttling in dm-bufio prevents kswapd from making forward
progress, causing this situation...

> > I'm sure you'll come up with another creative excuse why GFP_NORETRY
> > allocations need incur deliberate 100ms delays in block device drivers.
>
> ... is not really productive. I've tried to explain why I am not _sure_ what
> possible side effects such a change might have and your hand waving
> didn't really convince me. MD is not the only user of the page
> allocator...

But you are just doing that now - you're just coming up with another great
excuse why block device drivers need to sleep 100ms. The system stops to a
crawl when block device requests take 100ms and you - instead of fixing it
- are just arguing how is it needed.

> > > I guess we will not agree on that part. I consider it a hack
> > > rather than a systematic solution. I can easily imagine that we just
> > > find out other call sites that would cause over reclaim or similar
> >
> > If a __GFP_NORETRY allocation does overreclaim - it could be fixed by
> > returning NULL instead of doing overreclaim. The specification says that
> > callers must handle failure of __GFP_NORETRY allocations.
> >
> > So yes - if you think that just skipping throttling on __GFP_NORETRY could
> > cause excessive CPU consumption trying to reclaim unreclaimable pages or
> > something like that - then you can add more points where the __GFP_NORETRY
> > is failed with NULL to avoid the excessive CPU consumption.
>
> Which is exactly something I do not want to do. Spread __GFP_NORETRY all
> over the reclaim code. Especially for something we already have means
> for...

And so what do you want to do to prevent block drivers from sleeping?

Mikulas

2018-06-25 14:17:25

by Michal Hocko

[permalink] [raw]
Subject: Re: dm bufio: Reduce dm_bufio_lock contention

On Mon 25-06-18 09:53:34, Mikulas Patocka wrote:
> y
>
> On Mon, 25 Jun 2018, Michal Hocko wrote:
>
> > On Fri 22-06-18 14:57:10, Mikulas Patocka wrote:
> > >
> > >
> > > On Fri, 22 Jun 2018, Michal Hocko wrote:
> > >
> > > > On Fri 22-06-18 08:52:09, Mikulas Patocka wrote:
> > > > >
> > > > >
> > > > > On Fri, 22 Jun 2018, Michal Hocko wrote:
> > > > >
> > > > > > On Fri 22-06-18 11:01:51, Michal Hocko wrote:
> > > > > > > On Thu 21-06-18 21:17:24, Mikulas Patocka wrote:
> > > > > > [...]
> > > > > > > > What about this patch? If __GFP_NORETRY and __GFP_FS is not set (i.e. the
> > > > > > > > request comes from a block device driver or a filesystem), we should not
> > > > > > > > sleep.
> > > > > > >
> > > > > > > Why? How are you going to audit all the callers that the behavior makes
> > > > > > > sense and moreover how are you going to ensure that future usage will
> > > > > > > still make sense. The more subtle side effects gfp flags have the harder
> > > > > > > they are to maintain.
> > > > > >
> > > > > > So just as an excercise. Try to explain the above semantic to users. We
> > > > > > currently have the following.
> > > > > >
> > > > > > * __GFP_NORETRY: The VM implementation will try only very lightweight
> > > > > > * memory direct reclaim to get some memory under memory pressure (thus
> > > > > > * it can sleep). It will avoid disruptive actions like OOM killer. The
> > > > > > * caller must handle the failure which is quite likely to happen under
> > > > > > * heavy memory pressure. The flag is suitable when failure can easily be
> > > > > > * handled at small cost, such as reduced throughput
> > > > > >
> > > > > > * __GFP_FS can call down to the low-level FS. Clearing the flag avoids the
> > > > > > * allocator recursing into the filesystem which might already be holding
> > > > > > * locks.
> > > > > >
> > > > > > So how are you going to explain gfp & (__GFP_NORETRY | ~__GFP_FS)? What
> > > > > > is the actual semantic without explaining the whole reclaim or force
> > > > > > users to look into the code to understand that? What about GFP_NOIO |
> > > > > > __GFP_NORETRY? What does it mean to that "should not sleep". Do all
> > > > > > shrinkers have to follow that as well?
> > > > >
> > > > > My reasoning was that there is broken code that uses __GFP_NORETRY and
> > > > > assumes that it can't fail - so conditioning the change on !__GFP_FS would
> > > > > minimize the diruption to the broken code.
> > > > >
> > > > > Anyway - if you want to test only on __GFP_NORETRY (and fix those 16
> > > > > broken cases that assume that __GFP_NORETRY can't fail), I'm OK with that.
> > > >
> > > > As I've already said, this is a subtle change which is really hard to
> > > > reason about. Throttling on congestion has its meaning and reason. Look
> > > > at why we are doing that in the first place. You cannot simply say this
> > >
> > > So - explain why is throttling needed. You support throttling, I don't, so
> > > you have to explain it :)
> > >
> > > > is ok based on your specific usecase. We do have means to achieve that.
> > > > It is explicit and thus it will be applied only where it makes sense.
> > > > You keep repeating that implicit behavior change for everybody is
> > > > better.
> > >
> > > I don't want to change it for everybody. I want to change it for block
> > > device drivers. I don't care what you do with non-block drivers.
> >
> > Well, it is usually onus of the patch submitter to justify any change.
> > But let me be nice on you, for once. This throttling is triggered only
> > if we all the pages we have encountered during the reclaim attempt are
> > dirty and that means that we are rushing through the LRU list quicker
> > than flushers are able to clean. If we didn't throttle we could hit
> > stronger reclaim priorities (aka scan more to reclaim memory) and
> > reclaim more pages as a result.
>
> And the throttling in dm-bufio prevents kswapd from making forward
> progress, causing this situation...

Which is what we have PF_THROTTLE_LESS for. Geez, do we have to go in
circles like that? Are you even listening?

[...]

> And so what do you want to do to prevent block drivers from sleeping?

use the existing means we have.
--
Michal Hocko
SUSE Labs

2018-06-25 14:43:35

by Mikulas Patocka

[permalink] [raw]
Subject: Re: dm bufio: Reduce dm_bufio_lock contention



On Mon, 25 Jun 2018, Michal Hocko wrote:

> > And the throttling in dm-bufio prevents kswapd from making forward
> > progress, causing this situation...
>
> Which is what we have PF_THROTTLE_LESS for. Geez, do we have to go in
> circles like that? Are you even listening?
>
> [...]
>
> > And so what do you want to do to prevent block drivers from sleeping?
>
> use the existing means we have.
> --
> Michal Hocko
> SUSE Labs

So - do you want this patch?

There is no behavior difference between changing the allocator (so that it
implies PF_THROTTLE_LESS for block drivers) and chaning all the block
drivers to explicitly set PF_THROTTLE_LESS.

But if you insist that the allocator can't be changed, we have to repeat
the same code over and over again in the block drivers.

Mikulas



---
block/bio.c | 4 ++++
drivers/md/dm-bufio.c | 14 +++++++++++---
drivers/md/dm-crypt.c | 8 ++++++++
drivers/md/dm-integrity.c | 4 ++++
drivers/md/dm-kcopyd.c | 3 +++
drivers/md/dm-verity-target.c | 4 ++++
drivers/md/dm-writecache.c | 4 ++++
mm/mempool.c | 4 ++++
8 files changed, 42 insertions(+), 3 deletions(-)

Index: linux-2.6/mm/mempool.c
===================================================================
--- linux-2.6.orig/mm/mempool.c 2018-06-25 16:32:19.210000000 +0200
+++ linux-2.6/mm/mempool.c 2018-06-25 16:32:19.200000000 +0200
@@ -369,6 +369,7 @@ void *mempool_alloc(mempool_t *pool, gfp
unsigned long flags;
wait_queue_entry_t wait;
gfp_t gfp_temp;
+ unsigned old_flags;

VM_WARN_ON_ONCE(gfp_mask & __GFP_ZERO);
might_sleep_if(gfp_mask & __GFP_DIRECT_RECLAIM);
@@ -381,7 +382,10 @@ void *mempool_alloc(mempool_t *pool, gfp

repeat_alloc:

+ old_flags = current->flags & PF_LESS_THROTTLE;
+ current->flags |= PF_LESS_THROTTLE;
element = pool->alloc(gfp_temp, pool->pool_data);
+ current->flags = (current->flags & ~PF_LESS_THROTTLE) | old_flags;
if (likely(element != NULL))
return element;

Index: linux-2.6/block/bio.c
===================================================================
--- linux-2.6.orig/block/bio.c 2018-06-25 16:32:19.210000000 +0200
+++ linux-2.6/block/bio.c 2018-06-25 16:32:19.200000000 +0200
@@ -217,6 +217,7 @@ fallback:
} else {
struct biovec_slab *bvs = bvec_slabs + *idx;
gfp_t __gfp_mask = gfp_mask & ~(__GFP_DIRECT_RECLAIM | __GFP_IO);
+ unsigned old_flags;

/*
* Make this allocation restricted and don't dump info on
@@ -229,7 +230,10 @@ fallback:
* Try a slab allocation. If this fails and __GFP_DIRECT_RECLAIM
* is set, retry with the 1-entry mempool
*/
+ old_flags = current->flags & PF_LESS_THROTTLE;
+ current->flags |= PF_LESS_THROTTLE;
bvl = kmem_cache_alloc(bvs->slab, __gfp_mask);
+ current->flags = (current->flags & ~PF_LESS_THROTTLE) | old_flags;
if (unlikely(!bvl && (gfp_mask & __GFP_DIRECT_RECLAIM))) {
*idx = BVEC_POOL_MAX;
goto fallback;
Index: linux-2.6/drivers/md/dm-bufio.c
===================================================================
--- linux-2.6.orig/drivers/md/dm-bufio.c 2018-06-25 16:32:19.210000000 +0200
+++ linux-2.6/drivers/md/dm-bufio.c 2018-06-25 16:32:19.200000000 +0200
@@ -356,6 +356,7 @@ static void __cache_size_refresh(void)
static void *alloc_buffer_data(struct dm_bufio_client *c, gfp_t gfp_mask,
unsigned char *data_mode)
{
+ void *ptr;
if (unlikely(c->slab_cache != NULL)) {
*data_mode = DATA_MODE_SLAB;
return kmem_cache_alloc(c->slab_cache, gfp_mask);
@@ -363,9 +364,14 @@ static void *alloc_buffer_data(struct dm

if (c->block_size <= KMALLOC_MAX_SIZE &&
gfp_mask & __GFP_NORETRY) {
+ unsigned old_flags;
*data_mode = DATA_MODE_GET_FREE_PAGES;
- return (void *)__get_free_pages(gfp_mask,
+ old_flags = current->flags & PF_LESS_THROTTLE;
+ current->flags |= PF_LESS_THROTTLE;
+ ptr = (void *)__get_free_pages(gfp_mask,
c->sectors_per_block_bits - (PAGE_SHIFT - SECTOR_SHIFT));
+ current->flags = (current->flags & ~PF_LESS_THROTTLE) | old_flags;
+ return ptr;
}

*data_mode = DATA_MODE_VMALLOC;
@@ -381,8 +387,10 @@ static void *alloc_buffer_data(struct dm
*/
if (gfp_mask & __GFP_NORETRY) {
unsigned noio_flag = memalloc_noio_save();
- void *ptr = __vmalloc(c->block_size, gfp_mask, PAGE_KERNEL);
-
+ unsigned old_flags = current->flags & PF_LESS_THROTTLE;
+ current->flags |= PF_LESS_THROTTLE;
+ ptr = __vmalloc(c->block_size, gfp_mask, PAGE_KERNEL);
+ current->flags = (current->flags & ~PF_LESS_THROTTLE) | old_flags;
memalloc_noio_restore(noio_flag);
return ptr;
}
Index: linux-2.6/drivers/md/dm-integrity.c
===================================================================
--- linux-2.6.orig/drivers/md/dm-integrity.c 2018-06-25 16:32:19.210000000 +0200
+++ linux-2.6/drivers/md/dm-integrity.c 2018-06-25 16:32:52.600000000 +0200
@@ -1317,6 +1317,7 @@ static void integrity_metadata(struct wo
int r;

if (ic->internal_hash) {
+ unsigned old_flags;
struct bvec_iter iter;
struct bio_vec bv;
unsigned digest_size = crypto_shash_digestsize(ic->internal_hash);
@@ -1330,8 +1331,11 @@ static void integrity_metadata(struct wo
if (unlikely(ic->mode == 'R'))
goto skip_io;

+ old_flags = current->flags & PF_LESS_THROTTLE;
+ current->flags |= PF_LESS_THROTTLE;
checksums = kmalloc((PAGE_SIZE >> SECTOR_SHIFT >> ic->sb->log2_sectors_per_block) * ic->tag_size + extra_space,
GFP_NOIO | __GFP_NORETRY | __GFP_NOWARN);
+ current->flags = (current->flags & ~PF_LESS_THROTTLE) | old_flags;
if (!checksums)
checksums = checksums_onstack;

Index: linux-2.6/drivers/md/dm-kcopyd.c
===================================================================
--- linux-2.6.orig/drivers/md/dm-kcopyd.c 2018-06-25 16:32:19.210000000 +0200
+++ linux-2.6/drivers/md/dm-kcopyd.c 2018-06-25 16:32:19.200000000 +0200
@@ -245,7 +245,10 @@ static int kcopyd_get_pages(struct dm_kc
*pages = NULL;

do {
+ unsigned old_flags = current->flags & PF_LESS_THROTTLE;
+ current->flags |= PF_LESS_THROTTLE;
pl = alloc_pl(__GFP_NOWARN | __GFP_NORETRY | __GFP_KSWAPD_RECLAIM);
+ current->flags = (current->flags & ~PF_LESS_THROTTLE) | old_flags;
if (unlikely(!pl)) {
/* Use reserved pages */
pl = kc->pages;
Index: linux-2.6/drivers/md/dm-verity-target.c
===================================================================
--- linux-2.6.orig/drivers/md/dm-verity-target.c 2018-06-25 16:32:19.210000000 +0200
+++ linux-2.6/drivers/md/dm-verity-target.c 2018-06-25 16:32:19.200000000 +0200
@@ -596,9 +596,13 @@ no_prefetch_cluster:
static void verity_submit_prefetch(struct dm_verity *v, struct dm_verity_io *io)
{
struct dm_verity_prefetch_work *pw;
+ unsigned old_flags;

+ old_flags = current->flags & PF_LESS_THROTTLE;
+ current->flags |= PF_LESS_THROTTLE;
pw = kmalloc(sizeof(struct dm_verity_prefetch_work),
GFP_NOIO | __GFP_NORETRY | __GFP_NOMEMALLOC | __GFP_NOWARN);
+ current->flags = (current->flags & ~PF_LESS_THROTTLE) | old_flags;

if (!pw)
return;
Index: linux-2.6/drivers/md/dm-writecache.c
===================================================================
--- linux-2.6.orig/drivers/md/dm-writecache.c 2018-06-25 16:32:19.210000000 +0200
+++ linux-2.6/drivers/md/dm-writecache.c 2018-06-25 16:32:19.200000000 +0200
@@ -1467,6 +1467,7 @@ static void __writecache_writeback_pmem(
unsigned max_pages;

while (wbl->size) {
+ unsigned old_flags;
wbl->size--;
e = container_of(wbl->list.prev, struct wc_entry, lru);
list_del(&e->lru);
@@ -1480,6 +1481,8 @@ static void __writecache_writeback_pmem(
bio_set_dev(&wb->bio, wc->dev->bdev);
wb->bio.bi_iter.bi_sector = read_original_sector(wc, e);
wb->page_offset = PAGE_SIZE;
+ old_flags = current->flags & PF_LESS_THROTTLE;
+ current->flags |= PF_LESS_THROTTLE;
if (max_pages <= WB_LIST_INLINE ||
unlikely(!(wb->wc_list = kmalloc(max_pages * sizeof(struct wc_entry *),
GFP_NOIO | __GFP_NORETRY |
@@ -1487,6 +1490,7 @@ static void __writecache_writeback_pmem(
wb->wc_list = wb->wc_list_inline;
max_pages = WB_LIST_INLINE;
}
+ current->flags = (current->flags & ~PF_LESS_THROTTLE) | old_flags;

BUG_ON(!wc_add_block(wb, e, GFP_NOIO));

Index: linux-2.6/drivers/md/dm-crypt.c
===================================================================
--- linux-2.6.orig/drivers/md/dm-crypt.c 2018-06-25 16:32:19.210000000 +0200
+++ linux-2.6/drivers/md/dm-crypt.c 2018-06-25 16:33:55.810000000 +0200
@@ -2181,12 +2181,16 @@ static void *crypt_page_alloc(gfp_t gfp_
{
struct crypt_config *cc = pool_data;
struct page *page;
+ unsigned old_flags;

if (unlikely(percpu_counter_compare(&cc->n_allocated_pages, dm_crypt_pages_per_client) >= 0) &&
likely(gfp_mask & __GFP_NORETRY))
return NULL;

+ old_flags = current->flags & PF_LESS_THROTTLE;
+ current->flags |= PF_LESS_THROTTLE;
page = alloc_page(gfp_mask);
+ current->flags = (current->flags & ~PF_LESS_THROTTLE) | old_flags;
if (likely(page != NULL))
percpu_counter_add(&cc->n_allocated_pages, 1);

@@ -2893,7 +2897,10 @@ static int crypt_map(struct dm_target *t

if (cc->on_disk_tag_size) {
unsigned tag_len = cc->on_disk_tag_size * (bio_sectors(bio) >> cc->sector_shift);
+ unsigned old_flags;

+ old_flags = current->flags & PF_LESS_THROTTLE;
+ current->flags |= PF_LESS_THROTTLE;
if (unlikely(tag_len > KMALLOC_MAX_SIZE) ||
unlikely(!(io->integrity_metadata = kmalloc(tag_len,
GFP_NOIO | __GFP_NORETRY | __GFP_NOMEMALLOC | __GFP_NOWARN)))) {
@@ -2902,6 +2909,7 @@ static int crypt_map(struct dm_target *t
io->integrity_metadata = mempool_alloc(&cc->tag_pool, GFP_NOIO);
io->integrity_metadata_from_pool = true;
}
+ current->flags = (current->flags & ~PF_LESS_THROTTLE) | old_flags;
}

if (crypt_integrity_aead(cc))

2018-06-25 14:59:01

by Michal Hocko

[permalink] [raw]
Subject: Re: dm bufio: Reduce dm_bufio_lock contention

On Mon 25-06-18 10:42:30, Mikulas Patocka wrote:
>
>
> On Mon, 25 Jun 2018, Michal Hocko wrote:
>
> > > And the throttling in dm-bufio prevents kswapd from making forward
> > > progress, causing this situation...
> >
> > Which is what we have PF_THROTTLE_LESS for. Geez, do we have to go in
> > circles like that? Are you even listening?
> >
> > [...]
> >
> > > And so what do you want to do to prevent block drivers from sleeping?
> >
> > use the existing means we have.
> > --
> > Michal Hocko
> > SUSE Labs
>
> So - do you want this patch?
>
> There is no behavior difference between changing the allocator (so that it
> implies PF_THROTTLE_LESS for block drivers) and chaning all the block
> drivers to explicitly set PF_THROTTLE_LESS.

As long as you can reliably detect those users. And using gfp_mask is
about the worst way to achieve that because users tend to be creative
when it comes to using gfp mask. PF_THROTTLE_LESS in general is a
way to tell the allocator that _you_ are the one to help the reclaim by
cleaning data.

> But if you insist that the allocator can't be changed, we have to repeat
> the same code over and over again in the block drivers.

I am not familiar with the patched code but mempool change at least
makes sense (bvec_alloc seems to fallback to mempool which then makes
sense as well). If others in md/ do the same thing

I would just use current_restore_flags rather than open code it.

Thanks!
--
Michal Hocko
SUSE Labs

2018-06-29 06:44:53

by Mikulas Patocka

[permalink] [raw]
Subject: Re: dm bufio: Reduce dm_bufio_lock contention



On Mon, 25 Jun 2018, Michal Hocko wrote:

> On Mon 25-06-18 10:42:30, Mikulas Patocka wrote:
> >
> >
> > On Mon, 25 Jun 2018, Michal Hocko wrote:
> >
> > > > And the throttling in dm-bufio prevents kswapd from making forward
> > > > progress, causing this situation...
> > >
> > > Which is what we have PF_THROTTLE_LESS for. Geez, do we have to go in
> > > circles like that? Are you even listening?
> > >
> > > [...]
> > >
> > > > And so what do you want to do to prevent block drivers from sleeping?
> > >
> > > use the existing means we have.
> > > --
> > > Michal Hocko
> > > SUSE Labs
> >
> > So - do you want this patch?
> >
> > There is no behavior difference between changing the allocator (so that it
> > implies PF_THROTTLE_LESS for block drivers) and chaning all the block
> > drivers to explicitly set PF_THROTTLE_LESS.
>
> As long as you can reliably detect those users. And using gfp_mask is

You can detect them if __GFP_IO is not set and __GFP_NORETRY is set. You
can grep the kernel for __GFP_NORETRY to find all the users.

> about the worst way to achieve that because users tend to be creative
> when it comes to using gfp mask. PF_THROTTLE_LESS in general is a
> way to tell the allocator that _you_ are the one to help the reclaim by
> cleaning data.

But using PF_LESS_THROTTLE explicitly adds more lines of code than
implying PF_LESS_THROTTLE in the allocator.

> > But if you insist that the allocator can't be changed, we have to repeat
> > the same code over and over again in the block drivers.
>
> I am not familiar with the patched code but mempool change at least
> makes sense (bvec_alloc seems to fallback to mempool which then makes
> sense as well). If others in md/ do the same thing
>
> I would just use current_restore_flags rather than open code it.
>
> Thanks!
> --
> Michal Hocko
> SUSE Labs

So, do you accept this patch?

Mikulas



From: Mikulas Patocka <[email protected]>
Subject: [PATCH] mm: set PF_LESS_THROTTLE when allocating memory for i/o

When doing __GFP_NORETRY allocation, the system may sleep in
wait_iff_congested if there are too many dirty pages. Unfortunatelly this
sleeping may slow down kswapd, preventing it from doing writeback and
resolving the congestion.

This patch fixes it by setting PF_LESS_THROTTLE when allocating memory for
block device drivers.

Signed-off-by: Mikulas Patocka <[email protected]>
Cc: [email protected]

---
block/bio.c | 4 ++++
drivers/md/dm-bufio.c | 14 +++++++++++---
drivers/md/dm-crypt.c | 8 ++++++++
drivers/md/dm-integrity.c | 4 ++++
drivers/md/dm-kcopyd.c | 3 +++
drivers/md/dm-verity-target.c | 4 ++++
drivers/md/dm-writecache.c | 4 ++++
mm/mempool.c | 4 ++++
8 files changed, 42 insertions(+), 3 deletions(-)

Index: linux-2.6/mm/mempool.c
===================================================================
--- linux-2.6.orig/mm/mempool.c 2018-06-29 03:47:16.290000000 +0200
+++ linux-2.6/mm/mempool.c 2018-06-29 03:47:16.270000000 +0200
@@ -369,6 +369,7 @@ void *mempool_alloc(mempool_t *pool, gfp
unsigned long flags;
wait_queue_entry_t wait;
gfp_t gfp_temp;
+ unsigned old_flags;

VM_WARN_ON_ONCE(gfp_mask & __GFP_ZERO);
might_sleep_if(gfp_mask & __GFP_DIRECT_RECLAIM);
@@ -381,7 +382,10 @@ void *mempool_alloc(mempool_t *pool, gfp

repeat_alloc:

+ old_flags = current->flags & PF_LESS_THROTTLE;
+ current->flags |= PF_LESS_THROTTLE;
element = pool->alloc(gfp_temp, pool->pool_data);
+ current_restore_flags(old_flags, PF_LESS_THROTTLE);
if (likely(element != NULL))
return element;

Index: linux-2.6/block/bio.c
===================================================================
--- linux-2.6.orig/block/bio.c 2018-06-29 03:47:16.290000000 +0200
+++ linux-2.6/block/bio.c 2018-06-29 03:47:16.270000000 +0200
@@ -217,6 +217,7 @@ fallback:
} else {
struct biovec_slab *bvs = bvec_slabs + *idx;
gfp_t __gfp_mask = gfp_mask & ~(__GFP_DIRECT_RECLAIM | __GFP_IO);
+ unsigned old_flags;

/*
* Make this allocation restricted and don't dump info on
@@ -229,7 +230,10 @@ fallback:
* Try a slab allocation. If this fails and __GFP_DIRECT_RECLAIM
* is set, retry with the 1-entry mempool
*/
+ old_flags = current->flags & PF_LESS_THROTTLE;
+ current->flags |= PF_LESS_THROTTLE;
bvl = kmem_cache_alloc(bvs->slab, __gfp_mask);
+ current_restore_flags(old_flags, PF_LESS_THROTTLE);
if (unlikely(!bvl && (gfp_mask & __GFP_DIRECT_RECLAIM))) {
*idx = BVEC_POOL_MAX;
goto fallback;
Index: linux-2.6/drivers/md/dm-bufio.c
===================================================================
--- linux-2.6.orig/drivers/md/dm-bufio.c 2018-06-29 03:47:16.290000000 +0200
+++ linux-2.6/drivers/md/dm-bufio.c 2018-06-29 03:47:16.270000000 +0200
@@ -356,6 +356,7 @@ static void __cache_size_refresh(void)
static void *alloc_buffer_data(struct dm_bufio_client *c, gfp_t gfp_mask,
unsigned char *data_mode)
{
+ void *ptr;
if (unlikely(c->slab_cache != NULL)) {
*data_mode = DATA_MODE_SLAB;
return kmem_cache_alloc(c->slab_cache, gfp_mask);
@@ -363,9 +364,14 @@ static void *alloc_buffer_data(struct dm

if (c->block_size <= KMALLOC_MAX_SIZE &&
gfp_mask & __GFP_NORETRY) {
+ unsigned old_flags;
*data_mode = DATA_MODE_GET_FREE_PAGES;
- return (void *)__get_free_pages(gfp_mask,
+ old_flags = current->flags & PF_LESS_THROTTLE;
+ current->flags |= PF_LESS_THROTTLE;
+ ptr = (void *)__get_free_pages(gfp_mask,
c->sectors_per_block_bits - (PAGE_SHIFT - SECTOR_SHIFT));
+ current_restore_flags(old_flags, PF_LESS_THROTTLE);
+ return ptr;
}

*data_mode = DATA_MODE_VMALLOC;
@@ -381,8 +387,10 @@ static void *alloc_buffer_data(struct dm
*/
if (gfp_mask & __GFP_NORETRY) {
unsigned noio_flag = memalloc_noio_save();
- void *ptr = __vmalloc(c->block_size, gfp_mask, PAGE_KERNEL);
-
+ unsigned old_flags = current->flags & PF_LESS_THROTTLE;
+ current->flags |= PF_LESS_THROTTLE;
+ ptr = __vmalloc(c->block_size, gfp_mask, PAGE_KERNEL);
+ current_restore_flags(old_flags, PF_LESS_THROTTLE);
memalloc_noio_restore(noio_flag);
return ptr;
}
Index: linux-2.6/drivers/md/dm-integrity.c
===================================================================
--- linux-2.6.orig/drivers/md/dm-integrity.c 2018-06-29 03:47:16.290000000 +0200
+++ linux-2.6/drivers/md/dm-integrity.c 2018-06-29 03:47:16.270000000 +0200
@@ -1318,6 +1318,7 @@ static void integrity_metadata(struct wo
int r;

if (ic->internal_hash) {
+ unsigned old_flags;
struct bvec_iter iter;
struct bio_vec bv;
unsigned digest_size = crypto_shash_digestsize(ic->internal_hash);
@@ -1331,8 +1332,11 @@ static void integrity_metadata(struct wo
if (unlikely(ic->mode == 'R'))
goto skip_io;

+ old_flags = current->flags & PF_LESS_THROTTLE;
+ current->flags |= PF_LESS_THROTTLE;
checksums = kmalloc((PAGE_SIZE >> SECTOR_SHIFT >> ic->sb->log2_sectors_per_block) * ic->tag_size + extra_space,
GFP_NOIO | __GFP_NORETRY | __GFP_NOWARN);
+ current_restore_flags(old_flags, PF_LESS_THROTTLE);
if (!checksums)
checksums = checksums_onstack;

Index: linux-2.6/drivers/md/dm-kcopyd.c
===================================================================
--- linux-2.6.orig/drivers/md/dm-kcopyd.c 2018-06-29 03:47:16.290000000 +0200
+++ linux-2.6/drivers/md/dm-kcopyd.c 2018-06-29 03:47:16.270000000 +0200
@@ -245,7 +245,10 @@ static int kcopyd_get_pages(struct dm_kc
*pages = NULL;

do {
+ unsigned old_flags = current->flags & PF_LESS_THROTTLE;
+ current->flags |= PF_LESS_THROTTLE;
pl = alloc_pl(__GFP_NOWARN | __GFP_NORETRY | __GFP_KSWAPD_RECLAIM);
+ current_restore_flags(old_flags, PF_LESS_THROTTLE);
if (unlikely(!pl)) {
/* Use reserved pages */
pl = kc->pages;
Index: linux-2.6/drivers/md/dm-verity-target.c
===================================================================
--- linux-2.6.orig/drivers/md/dm-verity-target.c 2018-06-29 03:47:16.290000000 +0200
+++ linux-2.6/drivers/md/dm-verity-target.c 2018-06-29 03:47:16.280000000 +0200
@@ -596,9 +596,13 @@ no_prefetch_cluster:
static void verity_submit_prefetch(struct dm_verity *v, struct dm_verity_io *io)
{
struct dm_verity_prefetch_work *pw;
+ unsigned old_flags;

+ old_flags = current->flags & PF_LESS_THROTTLE;
+ current->flags |= PF_LESS_THROTTLE;
pw = kmalloc(sizeof(struct dm_verity_prefetch_work),
GFP_NOIO | __GFP_NORETRY | __GFP_NOMEMALLOC | __GFP_NOWARN);
+ current_restore_flags(old_flags, PF_LESS_THROTTLE);

if (!pw)
return;
Index: linux-2.6/drivers/md/dm-writecache.c
===================================================================
--- linux-2.6.orig/drivers/md/dm-writecache.c 2018-06-29 03:47:16.290000000 +0200
+++ linux-2.6/drivers/md/dm-writecache.c 2018-06-29 03:47:16.280000000 +0200
@@ -1473,6 +1473,7 @@ static void __writecache_writeback_pmem(
unsigned max_pages;

while (wbl->size) {
+ unsigned old_flags;
wbl->size--;
e = container_of(wbl->list.prev, struct wc_entry, lru);
list_del(&e->lru);
@@ -1486,6 +1487,8 @@ static void __writecache_writeback_pmem(
bio_set_dev(&wb->bio, wc->dev->bdev);
wb->bio.bi_iter.bi_sector = read_original_sector(wc, e);
wb->page_offset = PAGE_SIZE;
+ old_flags = current->flags & PF_LESS_THROTTLE;
+ current->flags |= PF_LESS_THROTTLE;
if (max_pages <= WB_LIST_INLINE ||
unlikely(!(wb->wc_list = kmalloc(max_pages * sizeof(struct wc_entry *),
GFP_NOIO | __GFP_NORETRY |
@@ -1493,6 +1496,7 @@ static void __writecache_writeback_pmem(
wb->wc_list = wb->wc_list_inline;
max_pages = WB_LIST_INLINE;
}
+ current_restore_flags(old_flags, PF_LESS_THROTTLE);

BUG_ON(!wc_add_block(wb, e, GFP_NOIO));

Index: linux-2.6/drivers/md/dm-crypt.c
===================================================================
--- linux-2.6.orig/drivers/md/dm-crypt.c 2018-06-29 03:47:16.290000000 +0200
+++ linux-2.6/drivers/md/dm-crypt.c 2018-06-29 03:47:16.280000000 +0200
@@ -2181,12 +2181,16 @@ static void *crypt_page_alloc(gfp_t gfp_
{
struct crypt_config *cc = pool_data;
struct page *page;
+ unsigned old_flags;

if (unlikely(percpu_counter_compare(&cc->n_allocated_pages, dm_crypt_pages_per_client) >= 0) &&
likely(gfp_mask & __GFP_NORETRY))
return NULL;

+ old_flags = current->flags & PF_LESS_THROTTLE;
+ current->flags |= PF_LESS_THROTTLE;
page = alloc_page(gfp_mask);
+ current_restore_flags(old_flags, PF_LESS_THROTTLE);
if (likely(page != NULL))
percpu_counter_add(&cc->n_allocated_pages, 1);

@@ -2893,7 +2897,10 @@ static int crypt_map(struct dm_target *t

if (cc->on_disk_tag_size) {
unsigned tag_len = cc->on_disk_tag_size * (bio_sectors(bio) >> cc->sector_shift);
+ unsigned old_flags;

+ old_flags = current->flags & PF_LESS_THROTTLE;
+ current->flags |= PF_LESS_THROTTLE;
if (unlikely(tag_len > KMALLOC_MAX_SIZE) ||
unlikely(!(io->integrity_metadata = kmalloc(tag_len,
GFP_NOIO | __GFP_NORETRY | __GFP_NOMEMALLOC | __GFP_NOWARN)))) {
@@ -2902,6 +2909,7 @@ static int crypt_map(struct dm_target *t
io->integrity_metadata = mempool_alloc(&cc->tag_pool, GFP_NOIO);
io->integrity_metadata_from_pool = true;
}
+ current_restore_flags(old_flags, PF_LESS_THROTTLE);
}

if (crypt_integrity_aead(cc))

2018-06-29 09:49:29

by Michal Hocko

[permalink] [raw]
Subject: Re: dm bufio: Reduce dm_bufio_lock contention

On Thu 28-06-18 22:43:29, Mikulas Patocka wrote:
>
>
> On Mon, 25 Jun 2018, Michal Hocko wrote:
>
> > On Mon 25-06-18 10:42:30, Mikulas Patocka wrote:
> > >
> > >
> > > On Mon, 25 Jun 2018, Michal Hocko wrote:
> > >
> > > > > And the throttling in dm-bufio prevents kswapd from making forward
> > > > > progress, causing this situation...
> > > >
> > > > Which is what we have PF_THROTTLE_LESS for. Geez, do we have to go in
> > > > circles like that? Are you even listening?
> > > >
> > > > [...]
> > > >
> > > > > And so what do you want to do to prevent block drivers from sleeping?
> > > >
> > > > use the existing means we have.
> > > > --
> > > > Michal Hocko
> > > > SUSE Labs
> > >
> > > So - do you want this patch?
> > >
> > > There is no behavior difference between changing the allocator (so that it
> > > implies PF_THROTTLE_LESS for block drivers) and chaning all the block
> > > drivers to explicitly set PF_THROTTLE_LESS.
> >
> > As long as you can reliably detect those users. And using gfp_mask is
>
> You can detect them if __GFP_IO is not set and __GFP_NORETRY is set. You
> can grep the kernel for __GFP_NORETRY to find all the users.

It seems that arguing doesn't make much sense here. I will not repeat
myself...

> > about the worst way to achieve that because users tend to be creative
> > when it comes to using gfp mask. PF_THROTTLE_LESS in general is a
> > way to tell the allocator that _you_ are the one to help the reclaim by
> > cleaning data.
>
> But using PF_LESS_THROTTLE explicitly adds more lines of code than
> implying PF_LESS_THROTTLE in the allocator.

Yes and it will also make the code more explicit about the intention and
so it will be easier to maintain longterm.

> From: Mikulas Patocka <[email protected]>
> Subject: [PATCH] mm: set PF_LESS_THROTTLE when allocating memory for i/o
>
> When doing __GFP_NORETRY allocation, the system may sleep in
> wait_iff_congested if there are too many dirty pages. Unfortunatelly this
> sleeping may slow down kswapd, preventing it from doing writeback and
> resolving the congestion.

This description is misleading at best if not outright wrong.
if (!sc->hibernation_mode && !current_is_kswapd() &&
current_may_throttle() && pgdat_memcg_congested(pgdat, root))
wait_iff_congested(BLK_RW_ASYNC, HZ/10);

so this is an explict throttling of the direct reclaim.

So I would use the following wording instead:
"
It has been noticed that congestion throttling can slow down allocations
path that participate in the IO and thus help the memory reclaim.
Stalling those allocation is therefore not productive. Moreover mempool
allocator and md variants of the same already implement their own
throttling which has a better way to be feedback driven. Stalling at the
page allocator is therefore even counterproductive.

PF_LESS_THROTTLE is a task flag denoting allocation context that is
participating in the memory reclaim which fits into these IO paths
model, so use the flag and make the page allocator aware they are
special and they do not really want any dirty data throttling.

<HERE GOES YOUR STORAGE CONFIGURATION AND DATA ABOUT STALLS>
"

with a more clear patch description and some data to back them up, you
can add

Acked-by: Michal Hocko <[email protected]> # mempool_alloc and bvec_alloc

I cannot really comment on other md allocators though because I am not
familiar with those.

> This patch fixes it by setting PF_LESS_THROTTLE when allocating memory for
> block device drivers.
>
> Signed-off-by: Mikulas Patocka <[email protected]>
> Cc: [email protected]
>
> ---
> block/bio.c | 4 ++++
> drivers/md/dm-bufio.c | 14 +++++++++++---
> drivers/md/dm-crypt.c | 8 ++++++++
> drivers/md/dm-integrity.c | 4 ++++
> drivers/md/dm-kcopyd.c | 3 +++
> drivers/md/dm-verity-target.c | 4 ++++
> drivers/md/dm-writecache.c | 4 ++++
> mm/mempool.c | 4 ++++
> 8 files changed, 42 insertions(+), 3 deletions(-)
>
> Index: linux-2.6/mm/mempool.c
> ===================================================================
> --- linux-2.6.orig/mm/mempool.c 2018-06-29 03:47:16.290000000 +0200
> +++ linux-2.6/mm/mempool.c 2018-06-29 03:47:16.270000000 +0200
> @@ -369,6 +369,7 @@ void *mempool_alloc(mempool_t *pool, gfp
> unsigned long flags;
> wait_queue_entry_t wait;
> gfp_t gfp_temp;
> + unsigned old_flags;
>
> VM_WARN_ON_ONCE(gfp_mask & __GFP_ZERO);
> might_sleep_if(gfp_mask & __GFP_DIRECT_RECLAIM);
> @@ -381,7 +382,10 @@ void *mempool_alloc(mempool_t *pool, gfp
>
> repeat_alloc:
>
> + old_flags = current->flags & PF_LESS_THROTTLE;
> + current->flags |= PF_LESS_THROTTLE;
> element = pool->alloc(gfp_temp, pool->pool_data);
> + current_restore_flags(old_flags, PF_LESS_THROTTLE);
> if (likely(element != NULL))
> return element;
>
> Index: linux-2.6/block/bio.c
> ===================================================================
> --- linux-2.6.orig/block/bio.c 2018-06-29 03:47:16.290000000 +0200
> +++ linux-2.6/block/bio.c 2018-06-29 03:47:16.270000000 +0200
> @@ -217,6 +217,7 @@ fallback:
> } else {
> struct biovec_slab *bvs = bvec_slabs + *idx;
> gfp_t __gfp_mask = gfp_mask & ~(__GFP_DIRECT_RECLAIM | __GFP_IO);
> + unsigned old_flags;
>
> /*
> * Make this allocation restricted and don't dump info on
> @@ -229,7 +230,10 @@ fallback:
> * Try a slab allocation. If this fails and __GFP_DIRECT_RECLAIM
> * is set, retry with the 1-entry mempool
> */
> + old_flags = current->flags & PF_LESS_THROTTLE;
> + current->flags |= PF_LESS_THROTTLE;
> bvl = kmem_cache_alloc(bvs->slab, __gfp_mask);
> + current_restore_flags(old_flags, PF_LESS_THROTTLE);
> if (unlikely(!bvl && (gfp_mask & __GFP_DIRECT_RECLAIM))) {
> *idx = BVEC_POOL_MAX;
> goto fallback;
> Index: linux-2.6/drivers/md/dm-bufio.c
> ===================================================================
> --- linux-2.6.orig/drivers/md/dm-bufio.c 2018-06-29 03:47:16.290000000 +0200
> +++ linux-2.6/drivers/md/dm-bufio.c 2018-06-29 03:47:16.270000000 +0200
> @@ -356,6 +356,7 @@ static void __cache_size_refresh(void)
> static void *alloc_buffer_data(struct dm_bufio_client *c, gfp_t gfp_mask,
> unsigned char *data_mode)
> {
> + void *ptr;
> if (unlikely(c->slab_cache != NULL)) {
> *data_mode = DATA_MODE_SLAB;
> return kmem_cache_alloc(c->slab_cache, gfp_mask);
> @@ -363,9 +364,14 @@ static void *alloc_buffer_data(struct dm
>
> if (c->block_size <= KMALLOC_MAX_SIZE &&
> gfp_mask & __GFP_NORETRY) {
> + unsigned old_flags;
> *data_mode = DATA_MODE_GET_FREE_PAGES;
> - return (void *)__get_free_pages(gfp_mask,
> + old_flags = current->flags & PF_LESS_THROTTLE;
> + current->flags |= PF_LESS_THROTTLE;
> + ptr = (void *)__get_free_pages(gfp_mask,
> c->sectors_per_block_bits - (PAGE_SHIFT - SECTOR_SHIFT));
> + current_restore_flags(old_flags, PF_LESS_THROTTLE);
> + return ptr;
> }
>
> *data_mode = DATA_MODE_VMALLOC;
> @@ -381,8 +387,10 @@ static void *alloc_buffer_data(struct dm
> */
> if (gfp_mask & __GFP_NORETRY) {
> unsigned noio_flag = memalloc_noio_save();
> - void *ptr = __vmalloc(c->block_size, gfp_mask, PAGE_KERNEL);
> -
> + unsigned old_flags = current->flags & PF_LESS_THROTTLE;
> + current->flags |= PF_LESS_THROTTLE;
> + ptr = __vmalloc(c->block_size, gfp_mask, PAGE_KERNEL);
> + current_restore_flags(old_flags, PF_LESS_THROTTLE);
> memalloc_noio_restore(noio_flag);
> return ptr;
> }
> Index: linux-2.6/drivers/md/dm-integrity.c
> ===================================================================
> --- linux-2.6.orig/drivers/md/dm-integrity.c 2018-06-29 03:47:16.290000000 +0200
> +++ linux-2.6/drivers/md/dm-integrity.c 2018-06-29 03:47:16.270000000 +0200
> @@ -1318,6 +1318,7 @@ static void integrity_metadata(struct wo
> int r;
>
> if (ic->internal_hash) {
> + unsigned old_flags;
> struct bvec_iter iter;
> struct bio_vec bv;
> unsigned digest_size = crypto_shash_digestsize(ic->internal_hash);
> @@ -1331,8 +1332,11 @@ static void integrity_metadata(struct wo
> if (unlikely(ic->mode == 'R'))
> goto skip_io;
>
> + old_flags = current->flags & PF_LESS_THROTTLE;
> + current->flags |= PF_LESS_THROTTLE;
> checksums = kmalloc((PAGE_SIZE >> SECTOR_SHIFT >> ic->sb->log2_sectors_per_block) * ic->tag_size + extra_space,
> GFP_NOIO | __GFP_NORETRY | __GFP_NOWARN);
> + current_restore_flags(old_flags, PF_LESS_THROTTLE);
> if (!checksums)
> checksums = checksums_onstack;
>
> Index: linux-2.6/drivers/md/dm-kcopyd.c
> ===================================================================
> --- linux-2.6.orig/drivers/md/dm-kcopyd.c 2018-06-29 03:47:16.290000000 +0200
> +++ linux-2.6/drivers/md/dm-kcopyd.c 2018-06-29 03:47:16.270000000 +0200
> @@ -245,7 +245,10 @@ static int kcopyd_get_pages(struct dm_kc
> *pages = NULL;
>
> do {
> + unsigned old_flags = current->flags & PF_LESS_THROTTLE;
> + current->flags |= PF_LESS_THROTTLE;
> pl = alloc_pl(__GFP_NOWARN | __GFP_NORETRY | __GFP_KSWAPD_RECLAIM);
> + current_restore_flags(old_flags, PF_LESS_THROTTLE);
> if (unlikely(!pl)) {
> /* Use reserved pages */
> pl = kc->pages;
> Index: linux-2.6/drivers/md/dm-verity-target.c
> ===================================================================
> --- linux-2.6.orig/drivers/md/dm-verity-target.c 2018-06-29 03:47:16.290000000 +0200
> +++ linux-2.6/drivers/md/dm-verity-target.c 2018-06-29 03:47:16.280000000 +0200
> @@ -596,9 +596,13 @@ no_prefetch_cluster:
> static void verity_submit_prefetch(struct dm_verity *v, struct dm_verity_io *io)
> {
> struct dm_verity_prefetch_work *pw;
> + unsigned old_flags;
>
> + old_flags = current->flags & PF_LESS_THROTTLE;
> + current->flags |= PF_LESS_THROTTLE;
> pw = kmalloc(sizeof(struct dm_verity_prefetch_work),
> GFP_NOIO | __GFP_NORETRY | __GFP_NOMEMALLOC | __GFP_NOWARN);
> + current_restore_flags(old_flags, PF_LESS_THROTTLE);
>
> if (!pw)
> return;
> Index: linux-2.6/drivers/md/dm-writecache.c
> ===================================================================
> --- linux-2.6.orig/drivers/md/dm-writecache.c 2018-06-29 03:47:16.290000000 +0200
> +++ linux-2.6/drivers/md/dm-writecache.c 2018-06-29 03:47:16.280000000 +0200
> @@ -1473,6 +1473,7 @@ static void __writecache_writeback_pmem(
> unsigned max_pages;
>
> while (wbl->size) {
> + unsigned old_flags;
> wbl->size--;
> e = container_of(wbl->list.prev, struct wc_entry, lru);
> list_del(&e->lru);
> @@ -1486,6 +1487,8 @@ static void __writecache_writeback_pmem(
> bio_set_dev(&wb->bio, wc->dev->bdev);
> wb->bio.bi_iter.bi_sector = read_original_sector(wc, e);
> wb->page_offset = PAGE_SIZE;
> + old_flags = current->flags & PF_LESS_THROTTLE;
> + current->flags |= PF_LESS_THROTTLE;
> if (max_pages <= WB_LIST_INLINE ||
> unlikely(!(wb->wc_list = kmalloc(max_pages * sizeof(struct wc_entry *),
> GFP_NOIO | __GFP_NORETRY |
> @@ -1493,6 +1496,7 @@ static void __writecache_writeback_pmem(
> wb->wc_list = wb->wc_list_inline;
> max_pages = WB_LIST_INLINE;
> }
> + current_restore_flags(old_flags, PF_LESS_THROTTLE);
>
> BUG_ON(!wc_add_block(wb, e, GFP_NOIO));
>
> Index: linux-2.6/drivers/md/dm-crypt.c
> ===================================================================
> --- linux-2.6.orig/drivers/md/dm-crypt.c 2018-06-29 03:47:16.290000000 +0200
> +++ linux-2.6/drivers/md/dm-crypt.c 2018-06-29 03:47:16.280000000 +0200
> @@ -2181,12 +2181,16 @@ static void *crypt_page_alloc(gfp_t gfp_
> {
> struct crypt_config *cc = pool_data;
> struct page *page;
> + unsigned old_flags;
>
> if (unlikely(percpu_counter_compare(&cc->n_allocated_pages, dm_crypt_pages_per_client) >= 0) &&
> likely(gfp_mask & __GFP_NORETRY))
> return NULL;
>
> + old_flags = current->flags & PF_LESS_THROTTLE;
> + current->flags |= PF_LESS_THROTTLE;
> page = alloc_page(gfp_mask);
> + current_restore_flags(old_flags, PF_LESS_THROTTLE);
> if (likely(page != NULL))
> percpu_counter_add(&cc->n_allocated_pages, 1);
>
> @@ -2893,7 +2897,10 @@ static int crypt_map(struct dm_target *t
>
> if (cc->on_disk_tag_size) {
> unsigned tag_len = cc->on_disk_tag_size * (bio_sectors(bio) >> cc->sector_shift);
> + unsigned old_flags;
>
> + old_flags = current->flags & PF_LESS_THROTTLE;
> + current->flags |= PF_LESS_THROTTLE;
> if (unlikely(tag_len > KMALLOC_MAX_SIZE) ||
> unlikely(!(io->integrity_metadata = kmalloc(tag_len,
> GFP_NOIO | __GFP_NORETRY | __GFP_NOMEMALLOC | __GFP_NOWARN)))) {
> @@ -2902,6 +2909,7 @@ static int crypt_map(struct dm_target *t
> io->integrity_metadata = mempool_alloc(&cc->tag_pool, GFP_NOIO);
> io->integrity_metadata_from_pool = true;
> }
> + current_restore_flags(old_flags, PF_LESS_THROTTLE);
> }
>
> if (crypt_integrity_aead(cc))

--
Michal Hocko
SUSE Labs

2018-08-01 02:49:03

by jing xia

[permalink] [raw]
Subject: Re: dm bufio: Reduce dm_bufio_lock contention

We reproduced this issue again and found out the root cause.
dm_bufio_prefetch() with dm_bufio_lock enters the direct reclaim and
takes a long time to do the soft_limit_reclaim, because of the huge
number of memory excess of the memcg.
Then, all the task who do shrink_slab() wait for dm_bufio_lock.

Any suggestions for this?Thanks.

On Thu, Jun 14, 2018 at 3:31 PM, Michal Hocko <[email protected]> wrote:
> On Thu 14-06-18 15:18:58, jing xia wrote:
> [...]
>> PID: 22920 TASK: ffffffc0120f1a00 CPU: 1 COMMAND: "kworker/u8:2"
>> #0 [ffffffc0282af3d0] __switch_to at ffffff8008085e48
>> #1 [ffffffc0282af3f0] __schedule at ffffff8008850cc8
>> #2 [ffffffc0282af450] schedule at ffffff8008850f4c
>> #3 [ffffffc0282af470] schedule_timeout at ffffff8008853a0c
>> #4 [ffffffc0282af520] schedule_timeout_uninterruptible at ffffff8008853aa8
>> #5 [ffffffc0282af530] wait_iff_congested at ffffff8008181b40
>
> This trace doesn't provide the full picture unfortunately. Waiting in
> the direct reclaim means that the underlying bdi is congested. The real
> question is why it doesn't flush IO in time.
>
>> #6 [ffffffc0282af5b0] shrink_inactive_list at ffffff8008177c80
>> #7 [ffffffc0282af680] shrink_lruvec at ffffff8008178510
>> #8 [ffffffc0282af790] mem_cgroup_shrink_node_zone at ffffff80081793bc
>> #9 [ffffffc0282af840] mem_cgroup_soft_limit_reclaim at ffffff80081b6040
>> #10 [ffffffc0282af8f0] do_try_to_free_pages at ffffff8008178b6c
>> #11 [ffffffc0282af990] try_to_free_pages at ffffff8008178f3c
>> #12 [ffffffc0282afa30] __perform_reclaim at ffffff8008169130
>> #13 [ffffffc0282afab0] __alloc_pages_nodemask at ffffff800816c9b8
>> #14 [ffffffc0282afbd0] __get_free_pages at ffffff800816cd6c
>> #15 [ffffffc0282afbe0] alloc_buffer at ffffff8008591a94
>> #16 [ffffffc0282afc20] __bufio_new at ffffff8008592e94
>> #17 [ffffffc0282afc70] dm_bufio_prefetch at ffffff8008593198
>> #18 [ffffffc0282afd20] verity_prefetch_io at ffffff8008598384
>> #19 [ffffffc0282afd70] process_one_work at ffffff80080b5b3c
>> #20 [ffffffc0282afdc0] worker_thread at ffffff80080b64fc
>> #21 [ffffffc0282afe20] kthread at ffffff80080bae34
>>
>> > Mikulas
>
> --
> Michal Hocko
> SUSE Labs

2018-08-01 07:04:17

by Michal Hocko

[permalink] [raw]
Subject: Re: dm bufio: Reduce dm_bufio_lock contention

On Wed 01-08-18 10:48:00, jing xia wrote:
> We reproduced this issue again and found out the root cause.
> dm_bufio_prefetch() with dm_bufio_lock enters the direct reclaim and
> takes a long time to do the soft_limit_reclaim, because of the huge
> number of memory excess of the memcg.
> Then, all the task who do shrink_slab() wait for dm_bufio_lock.
>
> Any suggestions for this?Thanks.

Do not use soft limit?
--
Michal Hocko
SUSE Labs

2018-09-03 22:25:45

by Mikulas Patocka

[permalink] [raw]
Subject: Re: dm bufio: Reduce dm_bufio_lock contention



On Wed, 1 Aug 2018, jing xia wrote:

> We reproduced this issue again and found out the root cause.
> dm_bufio_prefetch() with dm_bufio_lock enters the direct reclaim and
> takes a long time to do the soft_limit_reclaim, because of the huge
> number of memory excess of the memcg.
> Then, all the task who do shrink_slab() wait for dm_bufio_lock.
>
> Any suggestions for this?Thanks.

There's hardly any solution because Michal Hocko refuses to change
__GFP_NORETRY behavior.

The patches 41c73a49df31151f4ff868f28fe4f129f113fa2c and
d12067f428c037b4575aaeb2be00847fc214c24a could reduce the lock contention
on the dm-bufio lock - the patches don't fix the high CPU consumption
inside the memory allocation, but the kernel code should wait less on the
bufio lock.

Mikulas


> On Thu, Jun 14, 2018 at 3:31 PM, Michal Hocko <[email protected]> wrote:
> > On Thu 14-06-18 15:18:58, jing xia wrote:
> > [...]
> >> PID: 22920 TASK: ffffffc0120f1a00 CPU: 1 COMMAND: "kworker/u8:2"
> >> #0 [ffffffc0282af3d0] __switch_to at ffffff8008085e48
> >> #1 [ffffffc0282af3f0] __schedule at ffffff8008850cc8
> >> #2 [ffffffc0282af450] schedule at ffffff8008850f4c
> >> #3 [ffffffc0282af470] schedule_timeout at ffffff8008853a0c
> >> #4 [ffffffc0282af520] schedule_timeout_uninterruptible at ffffff8008853aa8
> >> #5 [ffffffc0282af530] wait_iff_congested at ffffff8008181b40
> >
> > This trace doesn't provide the full picture unfortunately. Waiting in
> > the direct reclaim means that the underlying bdi is congested. The real
> > question is why it doesn't flush IO in time.
> >
> >> #6 [ffffffc0282af5b0] shrink_inactive_list at ffffff8008177c80
> >> #7 [ffffffc0282af680] shrink_lruvec at ffffff8008178510
> >> #8 [ffffffc0282af790] mem_cgroup_shrink_node_zone at ffffff80081793bc
> >> #9 [ffffffc0282af840] mem_cgroup_soft_limit_reclaim at ffffff80081b6040
> >> #10 [ffffffc0282af8f0] do_try_to_free_pages at ffffff8008178b6c
> >> #11 [ffffffc0282af990] try_to_free_pages at ffffff8008178f3c
> >> #12 [ffffffc0282afa30] __perform_reclaim at ffffff8008169130
> >> #13 [ffffffc0282afab0] __alloc_pages_nodemask at ffffff800816c9b8
> >> #14 [ffffffc0282afbd0] __get_free_pages at ffffff800816cd6c
> >> #15 [ffffffc0282afbe0] alloc_buffer at ffffff8008591a94
> >> #16 [ffffffc0282afc20] __bufio_new at ffffff8008592e94
> >> #17 [ffffffc0282afc70] dm_bufio_prefetch at ffffff8008593198
> >> #18 [ffffffc0282afd20] verity_prefetch_io at ffffff8008598384
> >> #19 [ffffffc0282afd70] process_one_work at ffffff80080b5b3c
> >> #20 [ffffffc0282afdc0] worker_thread at ffffff80080b64fc
> >> #21 [ffffffc0282afe20] kthread at ffffff80080bae34
> >>
> >> > Mikulas
> >
> > --
> > Michal Hocko
> > SUSE Labs
>

2018-09-04 07:09:56

by Michal Hocko

[permalink] [raw]
Subject: Re: dm bufio: Reduce dm_bufio_lock contention

On Mon 03-09-18 18:23:17, Mikulas Patocka wrote:
>
>
> On Wed, 1 Aug 2018, jing xia wrote:
>
> > We reproduced this issue again and found out the root cause.
> > dm_bufio_prefetch() with dm_bufio_lock enters the direct reclaim and
> > takes a long time to do the soft_limit_reclaim, because of the huge
> > number of memory excess of the memcg.
> > Then, all the task who do shrink_slab() wait for dm_bufio_lock.
> >
> > Any suggestions for this?Thanks.
>
> There's hardly any solution because Michal Hocko refuses to change
> __GFP_NORETRY behavior.
>
> The patches 41c73a49df31151f4ff868f28fe4f129f113fa2c and
> d12067f428c037b4575aaeb2be00847fc214c24a could reduce the lock contention
> on the dm-bufio lock - the patches don't fix the high CPU consumption
> inside the memory allocation, but the kernel code should wait less on the
> bufio lock.

If you actually looked at the bottom line of the problem then you would
quickly find out that dm-bufio lock is the least of the problem with the
soft limit reclaim. This is a misfeature which has been merged and we
have to live with it. All we can do is to discourage people from using
it and use much more saner low limit instead.

So please stop this stupid blaming, try to understand the reasoning
behind my arguments.
--
Michal Evil Hocko
SUSE Labs

2018-09-04 15:20:46

by Mike Snitzer

[permalink] [raw]
Subject: Re: dm bufio: Reduce dm_bufio_lock contention

On Tue, Sep 04 2018 at 3:08am -0400,
Michal Hocko <[email protected]> wrote:

> On Mon 03-09-18 18:23:17, Mikulas Patocka wrote:
> >
> >
> > On Wed, 1 Aug 2018, jing xia wrote:
> >
> > > We reproduced this issue again and found out the root cause.
> > > dm_bufio_prefetch() with dm_bufio_lock enters the direct reclaim and
> > > takes a long time to do the soft_limit_reclaim, because of the huge
> > > number of memory excess of the memcg.
> > > Then, all the task who do shrink_slab() wait for dm_bufio_lock.
> > >
> > > Any suggestions for this?Thanks.
> >
> > There's hardly any solution because Michal Hocko refuses to change
> > __GFP_NORETRY behavior.
> >
> > The patches 41c73a49df31151f4ff868f28fe4f129f113fa2c and
> > d12067f428c037b4575aaeb2be00847fc214c24a could reduce the lock contention
> > on the dm-bufio lock - the patches don't fix the high CPU consumption
> > inside the memory allocation, but the kernel code should wait less on the
> > bufio lock.
>
> If you actually looked at the bottom line of the problem then you would
> quickly find out that dm-bufio lock is the least of the problem with the
> soft limit reclaim. This is a misfeature which has been merged and we
> have to live with it. All we can do is to discourage people from using
> it and use much more saner low limit instead.
>
> So please stop this stupid blaming, try to understand the reasoning
> behind my arguments.

Yes, this bickering isn't productive. Michal, your responses are pretty
hard to follow. I'm just trying to follow along on what it is you're
saying should be done. It isn't clear to me.

PLEASE, restate what we should be doing differently. Or what changes
need to happen outside of DM, etc.

Thanks,
Mike

2018-09-04 16:12:57

by Michal Hocko

[permalink] [raw]
Subject: Re: dm bufio: Reduce dm_bufio_lock contention

On Tue 04-09-18 11:18:44, Mike Snitzer wrote:
> On Tue, Sep 04 2018 at 3:08am -0400,
> Michal Hocko <[email protected]> wrote:
>
> > On Mon 03-09-18 18:23:17, Mikulas Patocka wrote:
> > >
> > >
> > > On Wed, 1 Aug 2018, jing xia wrote:
> > >
> > > > We reproduced this issue again and found out the root cause.
> > > > dm_bufio_prefetch() with dm_bufio_lock enters the direct reclaim and
> > > > takes a long time to do the soft_limit_reclaim, because of the huge
> > > > number of memory excess of the memcg.
> > > > Then, all the task who do shrink_slab() wait for dm_bufio_lock.
> > > >
> > > > Any suggestions for this?Thanks.
> > >
> > > There's hardly any solution because Michal Hocko refuses to change
> > > __GFP_NORETRY behavior.
> > >
> > > The patches 41c73a49df31151f4ff868f28fe4f129f113fa2c and
> > > d12067f428c037b4575aaeb2be00847fc214c24a could reduce the lock contention
> > > on the dm-bufio lock - the patches don't fix the high CPU consumption
> > > inside the memory allocation, but the kernel code should wait less on the
> > > bufio lock.
> >
> > If you actually looked at the bottom line of the problem then you would
> > quickly find out that dm-bufio lock is the least of the problem with the
> > soft limit reclaim. This is a misfeature which has been merged and we
> > have to live with it. All we can do is to discourage people from using
> > it and use much more saner low limit instead.
> >
> > So please stop this stupid blaming, try to understand the reasoning
> > behind my arguments.
>
> Yes, this bickering isn't productive. Michal, your responses are pretty
> hard to follow. I'm just trying to follow along on what it is you're
> saying should be done. It isn't clear to me.
>
> PLEASE, restate what we should be doing differently. Or what changes
> need to happen outside of DM, etc.

For this particular case I can only recommend to not use the memcg soft
limit. This is guaranteed to stall and there is no way around it because
this is the semantic of the soft limit. Sad, I know.

Regarding other other workloads. AFAIR the problem was due to the
wait_iff_congested in the direct reclaim. And I've been arguing that
special casing __GFP_NORETRY is not a propoer way to handle that case.
We have PF_LESS_THROTTLE to handle cases where the caller cannot be
really throttled because it is a part of the congestion control. I dunno
what happened in that regards since then though.
--
Michal Hocko
SUSE Labs

2018-09-04 17:33:14

by Mikulas Patocka

[permalink] [raw]
Subject: Re: dm bufio: Reduce dm_bufio_lock contention



On Tue, 4 Sep 2018, Michal Hocko wrote:

> Regarding other other workloads. AFAIR the problem was due to the
> wait_iff_congested in the direct reclaim. And I've been arguing that
> special casing __GFP_NORETRY is not a propoer way to handle that case.
> We have PF_LESS_THROTTLE to handle cases where the caller cannot be
> really throttled because it is a part of the congestion control. I dunno
> what happened in that regards since then though.
> --
> Michal Hocko
> SUSE Labs

So, could you take this patch https://patchwork.kernel.org/patch/10505523/
and commit it to the kernel?

You agreed with that patch, but you didn't commit it.

Mikulas

2018-09-04 17:47:05

by Michal Hocko

[permalink] [raw]
Subject: Re: dm bufio: Reduce dm_bufio_lock contention

On Tue 04-09-18 13:30:44, Mikulas Patocka wrote:
>
>
> On Tue, 4 Sep 2018, Michal Hocko wrote:
>
> > Regarding other other workloads. AFAIR the problem was due to the
> > wait_iff_congested in the direct reclaim. And I've been arguing that
> > special casing __GFP_NORETRY is not a propoer way to handle that case.
> > We have PF_LESS_THROTTLE to handle cases where the caller cannot be
> > really throttled because it is a part of the congestion control. I dunno
> > what happened in that regards since then though.
> > --
> > Michal Hocko
> > SUSE Labs
>
> So, could you take this patch https://patchwork.kernel.org/patch/10505523/
> and commit it to the kernel?
>
> You agreed with that patch, but you didn't commit it.

Because I do not maintain any tree that Linus pulls from. You need to
involve Andrew Morton to get this to the mm tree and then to Linus.
--
Michal Hocko
SUSE Labs