2019-10-23 10:28:57

by Michal Hocko

[permalink] [raw]
Subject: [RFC PATCH 2/2] mm, vmstat: reduce zone->lock holding time by /proc/pagetypeinfo

From: Michal Hocko <[email protected]>

pagetypeinfo_showfree_print is called by zone->lock held in irq mode.
This is not really nice because it blocks both any interrupts on that
cpu and the page allocator. On large machines this might even trigger
the hard lockup detector.

Considering the pagetypeinfo is a debugging tool we do not really need
exact numbers here. The primary reason to look at the outuput is to see
how pageblocks are spread among different migratetypes therefore putting
a bound on the number of pages on the free_list sounds like a reasonable
tradeoff.

The new output will simply tell
[...]
Node 6, zone Normal, type Movable >100000 >100000 >100000 >100000 41019 31560 23996 10054 3229 983 648

instead of
Node 6, zone Normal, type Movable 399568 294127 221558 102119 41019 31560 23996 10054 3229 983 648

The limit has been chosen arbitrary and it is a subject of a future
change should there be a need for that.

Suggested-by: Andrew Morton <[email protected]>
Signed-off-by: Michal Hocko <[email protected]>
---
mm/vmstat.c | 19 ++++++++++++++++++-
1 file changed, 18 insertions(+), 1 deletion(-)

diff --git a/mm/vmstat.c b/mm/vmstat.c
index 4e885ecd44d1..762034fc3b83 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -1386,8 +1386,25 @@ static void pagetypeinfo_showfree_print(struct seq_file *m,

area = &(zone->free_area[order]);

- list_for_each(curr, &area->free_list[mtype])
+ list_for_each(curr, &area->free_list[mtype]) {
freecount++;
+ /*
+ * Cap the free_list iteration because it might
+ * be really large and we are under a spinlock
+ * so a long time spent here could trigger a
+ * hard lockup detector. Anyway this is a
+ * debugging tool so knowing there is a handful
+ * of pages in this order should be more than
+ * sufficient
+ */
+ if (freecount > 100000) {
+ seq_printf(m, ">%6lu ", freecount);
+ spin_unlock_irq(&zone->lock);
+ cond_resched();
+ spin_lock_irq(&zone->lock);
+ continue;
+ }
+ }
seq_printf(m, "%6lu ", freecount);
}
seq_putc(m, '\n');
--
2.20.1


2019-10-23 20:10:19

by Vlastimil Babka

[permalink] [raw]
Subject: Re: [RFC PATCH 2/2] mm, vmstat: reduce zone->lock holding time by /proc/pagetypeinfo

On 10/23/19 12:27 PM, Michal Hocko wrote:
> From: Michal Hocko <[email protected]>
>
> pagetypeinfo_showfree_print is called by zone->lock held in irq mode.
> This is not really nice because it blocks both any interrupts on that
> cpu and the page allocator. On large machines this might even trigger
> the hard lockup detector.
>
> Considering the pagetypeinfo is a debugging tool we do not really need
> exact numbers here. The primary reason to look at the outuput is to see
> how pageblocks are spread among different migratetypes therefore putting
> a bound on the number of pages on the free_list sounds like a reasonable
> tradeoff.
>
> The new output will simply tell
> [...]
> Node 6, zone Normal, type Movable >100000 >100000 >100000 >100000 41019 31560 23996 10054 3229 983 648
>
> instead of
> Node 6, zone Normal, type Movable 399568 294127 221558 102119 41019 31560 23996 10054 3229 983 648
>
> The limit has been chosen arbitrary and it is a subject of a future
> change should there be a need for that.
>
> Suggested-by: Andrew Morton <[email protected]>
> Signed-off-by: Michal Hocko <[email protected]>

Hmm dunno, I would rather e.g. hide the file behind some config or boot
option than do this. Or move it to /sys/kernel/debug ?

> ---
> mm/vmstat.c | 19 ++++++++++++++++++-
> 1 file changed, 18 insertions(+), 1 deletion(-)
>
> diff --git a/mm/vmstat.c b/mm/vmstat.c
> index 4e885ecd44d1..762034fc3b83 100644
> --- a/mm/vmstat.c
> +++ b/mm/vmstat.c
> @@ -1386,8 +1386,25 @@ static void pagetypeinfo_showfree_print(struct seq_file *m,
>
> area = &(zone->free_area[order]);
>
> - list_for_each(curr, &area->free_list[mtype])
> + list_for_each(curr, &area->free_list[mtype]) {
> freecount++;
> + /*
> + * Cap the free_list iteration because it might
> + * be really large and we are under a spinlock
> + * so a long time spent here could trigger a
> + * hard lockup detector. Anyway this is a
> + * debugging tool so knowing there is a handful
> + * of pages in this order should be more than
> + * sufficient
> + */
> + if (freecount > 100000) {
> + seq_printf(m, ">%6lu ", freecount);
> + spin_unlock_irq(&zone->lock);
> + cond_resched();
> + spin_lock_irq(&zone->lock);
> + continue;
> + }
> + }
> seq_printf(m, "%6lu ", freecount);
> }
> seq_putc(m, '\n');
>

2019-10-23 20:11:56

by Michal Hocko

[permalink] [raw]
Subject: Re: [RFC PATCH 2/2] mm, vmstat: reduce zone->lock holding time by /proc/pagetypeinfo

On Wed 23-10-19 15:32:05, Vlastimil Babka wrote:
> On 10/23/19 12:27 PM, Michal Hocko wrote:
> > From: Michal Hocko <[email protected]>
> >
> > pagetypeinfo_showfree_print is called by zone->lock held in irq mode.
> > This is not really nice because it blocks both any interrupts on that
> > cpu and the page allocator. On large machines this might even trigger
> > the hard lockup detector.
> >
> > Considering the pagetypeinfo is a debugging tool we do not really need
> > exact numbers here. The primary reason to look at the outuput is to see
> > how pageblocks are spread among different migratetypes therefore putting
> > a bound on the number of pages on the free_list sounds like a reasonable
> > tradeoff.
> >
> > The new output will simply tell
> > [...]
> > Node 6, zone Normal, type Movable >100000 >100000 >100000 >100000 41019 31560 23996 10054 3229 983 648
> >
> > instead of
> > Node 6, zone Normal, type Movable 399568 294127 221558 102119 41019 31560 23996 10054 3229 983 648
> >
> > The limit has been chosen arbitrary and it is a subject of a future
> > change should there be a need for that.
> >
> > Suggested-by: Andrew Morton <[email protected]>
> > Signed-off-by: Michal Hocko <[email protected]>
>
> Hmm dunno, I would rather e.g. hide the file behind some config or boot
> option than do this. Or move it to /sys/kernel/debug ?

But those wouldn't really help to prevent from the lockup, right?
Besides that who would enable that config and how much of a difference
would root only vs. debugfs make?

Is the incomplete value a real problem?

> > ---
> > mm/vmstat.c | 19 ++++++++++++++++++-
> > 1 file changed, 18 insertions(+), 1 deletion(-)
> >
> > diff --git a/mm/vmstat.c b/mm/vmstat.c
> > index 4e885ecd44d1..762034fc3b83 100644
> > --- a/mm/vmstat.c
> > +++ b/mm/vmstat.c
> > @@ -1386,8 +1386,25 @@ static void pagetypeinfo_showfree_print(struct seq_file *m,
> >
> > area = &(zone->free_area[order]);
> >
> > - list_for_each(curr, &area->free_list[mtype])
> > + list_for_each(curr, &area->free_list[mtype]) {
> > freecount++;
> > + /*
> > + * Cap the free_list iteration because it might
> > + * be really large and we are under a spinlock
> > + * so a long time spent here could trigger a
> > + * hard lockup detector. Anyway this is a
> > + * debugging tool so knowing there is a handful
> > + * of pages in this order should be more than
> > + * sufficient
> > + */
> > + if (freecount > 100000) {
> > + seq_printf(m, ">%6lu ", freecount);
> > + spin_unlock_irq(&zone->lock);
> > + cond_resched();
> > + spin_lock_irq(&zone->lock);
> > + continue;
> > + }
> > + }
> > seq_printf(m, "%6lu ", freecount);
> > }
> > seq_putc(m, '\n');
> >

--
Michal Hocko
SUSE Labs

2019-10-23 20:18:48

by Rafael Aquini

[permalink] [raw]
Subject: Re: [RFC PATCH 2/2] mm, vmstat: reduce zone->lock holding time by /proc/pagetypeinfo

On Wed, Oct 23, 2019 at 03:32:05PM +0200, Vlastimil Babka wrote:
> On 10/23/19 12:27 PM, Michal Hocko wrote:
> > From: Michal Hocko <[email protected]>
> >
> > pagetypeinfo_showfree_print is called by zone->lock held in irq mode.
> > This is not really nice because it blocks both any interrupts on that
> > cpu and the page allocator. On large machines this might even trigger
> > the hard lockup detector.
> >
> > Considering the pagetypeinfo is a debugging tool we do not really need
> > exact numbers here. The primary reason to look at the outuput is to see
> > how pageblocks are spread among different migratetypes therefore putting
> > a bound on the number of pages on the free_list sounds like a reasonable
> > tradeoff.
> >
> > The new output will simply tell
> > [...]
> > Node 6, zone Normal, type Movable >100000 >100000 >100000 >100000 41019 31560 23996 10054 3229 983 648
> >
> > instead of
> > Node 6, zone Normal, type Movable 399568 294127 221558 102119 41019 31560 23996 10054 3229 983 648
> >
> > The limit has been chosen arbitrary and it is a subject of a future
> > change should there be a need for that.
> >
> > Suggested-by: Andrew Morton <[email protected]>
> > Signed-off-by: Michal Hocko <[email protected]>
>
> Hmm dunno, I would rather e.g. hide the file behind some config or boot
> option than do this. Or move it to /sys/kernel/debug ?
>

You beat me to it. I was going to suggest moving it to debug, as well.



> > ---
> > mm/vmstat.c | 19 ++++++++++++++++++-
> > 1 file changed, 18 insertions(+), 1 deletion(-)
> >
> > diff --git a/mm/vmstat.c b/mm/vmstat.c
> > index 4e885ecd44d1..762034fc3b83 100644
> > --- a/mm/vmstat.c
> > +++ b/mm/vmstat.c
> > @@ -1386,8 +1386,25 @@ static void pagetypeinfo_showfree_print(struct seq_file *m,
> >
> > area = &(zone->free_area[order]);
> >
> > - list_for_each(curr, &area->free_list[mtype])
> > + list_for_each(curr, &area->free_list[mtype]) {
> > freecount++;
> > + /*
> > + * Cap the free_list iteration because it might
> > + * be really large and we are under a spinlock
> > + * so a long time spent here could trigger a
> > + * hard lockup detector. Anyway this is a
> > + * debugging tool so knowing there is a handful
> > + * of pages in this order should be more than
> > + * sufficient
> > + */
> > + if (freecount > 100000) {
> > + seq_printf(m, ">%6lu ", freecount);
> > + spin_unlock_irq(&zone->lock);
> > + cond_resched();
> > + spin_lock_irq(&zone->lock);
> > + continue;
> > + }
> > + }
> > seq_printf(m, "%6lu ", freecount);
> > }
> > seq_putc(m, '\n');
> >
>

2019-10-23 20:19:04

by Vlastimil Babka

[permalink] [raw]
Subject: Re: [RFC PATCH 2/2] mm, vmstat: reduce zone->lock holding time by /proc/pagetypeinfo

On 10/23/19 3:37 PM, Michal Hocko wrote:
> On Wed 23-10-19 15:32:05, Vlastimil Babka wrote:
>> On 10/23/19 12:27 PM, Michal Hocko wrote:
>>> From: Michal Hocko <[email protected]>
>>>
>>> pagetypeinfo_showfree_print is called by zone->lock held in irq mode.
>>> This is not really nice because it blocks both any interrupts on that
>>> cpu and the page allocator. On large machines this might even trigger
>>> the hard lockup detector.
>>>
>>> Considering the pagetypeinfo is a debugging tool we do not really need
>>> exact numbers here. The primary reason to look at the outuput is to see
>>> how pageblocks are spread among different migratetypes therefore putting
>>> a bound on the number of pages on the free_list sounds like a reasonable
>>> tradeoff.
>>>
>>> The new output will simply tell
>>> [...]
>>> Node 6, zone Normal, type Movable >100000 >100000 >100000 >100000 41019 31560 23996 10054 3229 983 648
>>>
>>> instead of
>>> Node 6, zone Normal, type Movable 399568 294127 221558 102119 41019 31560 23996 10054 3229 983 648
>>>
>>> The limit has been chosen arbitrary and it is a subject of a future
>>> change should there be a need for that.
>>>
>>> Suggested-by: Andrew Morton <[email protected]>
>>> Signed-off-by: Michal Hocko <[email protected]>
>>
>> Hmm dunno, I would rather e.g. hide the file behind some config or boot
>> option than do this. Or move it to /sys/kernel/debug ?
>
> But those wouldn't really help to prevent from the lockup, right?

No, but it would perhaps help ensure that only people who know what they
are doing (or been told so by a developer e.g. on linux-mm) will try to
collect the data, and not some automatic monitoring tools taking
periodic snapshots of stuff in /proc that looks interesting.

> Besides that who would enable that config and how much of a difference
> would root only vs. debugfs make?

I would hope those tools don't scrap debugfs as much as /proc, but I
might be wrong of course :)

> Is the incomplete value a real problem?

Hmm perhaps not. If the overflow happens only for one migratetype, one
can use also /proc/buddyinfo to get to the exact count, as was proposed
in this thread for Movable migratetype.

>>> ---
>>> mm/vmstat.c | 19 ++++++++++++++++++-
>>> 1 file changed, 18 insertions(+), 1 deletion(-)
>>>
>>> diff --git a/mm/vmstat.c b/mm/vmstat.c
>>> index 4e885ecd44d1..762034fc3b83 100644
>>> --- a/mm/vmstat.c
>>> +++ b/mm/vmstat.c
>>> @@ -1386,8 +1386,25 @@ static void pagetypeinfo_showfree_print(struct seq_file *m,
>>>
>>> area = &(zone->free_area[order]);
>>>
>>> - list_for_each(curr, &area->free_list[mtype])
>>> + list_for_each(curr, &area->free_list[mtype]) {
>>> freecount++;
>>> + /*
>>> + * Cap the free_list iteration because it might
>>> + * be really large and we are under a spinlock
>>> + * so a long time spent here could trigger a
>>> + * hard lockup detector. Anyway this is a
>>> + * debugging tool so knowing there is a handful
>>> + * of pages in this order should be more than
>>> + * sufficient
>>> + */
>>> + if (freecount > 100000) {
>>> + seq_printf(m, ">%6lu ", freecount);
>>> + spin_unlock_irq(&zone->lock);
>>> + cond_resched();
>>> + spin_lock_irq(&zone->lock);
>>> + continue;
>>> + }
>>> + }
>>> seq_printf(m, "%6lu ", freecount);
>>> }
>>> seq_putc(m, '\n');
>>>
>

2019-10-23 21:22:06

by Mel Gorman

[permalink] [raw]
Subject: Re: [RFC PATCH 2/2] mm, vmstat: reduce zone->lock holding time by /proc/pagetypeinfo

On Wed, Oct 23, 2019 at 12:27:37PM +0200, Michal Hocko wrote:
> From: Michal Hocko <[email protected]>
>
> pagetypeinfo_showfree_print is called by zone->lock held in irq mode.
> This is not really nice because it blocks both any interrupts on that
> cpu and the page allocator. On large machines this might even trigger
> the hard lockup detector.
>
> Considering the pagetypeinfo is a debugging tool we do not really need
> exact numbers here. The primary reason to look at the outuput is to see
> how pageblocks are spread among different migratetypes therefore putting
> a bound on the number of pages on the free_list sounds like a reasonable
> tradeoff.
>
> The new output will simply tell
> [...]
> Node 6, zone Normal, type Movable >100000 >100000 >100000 >100000 41019 31560 23996 10054 3229 983 648
>
> instead of
> Node 6, zone Normal, type Movable 399568 294127 221558 102119 41019 31560 23996 10054 3229 983 648
>
> The limit has been chosen arbitrary and it is a subject of a future
> change should there be a need for that.
>
> Suggested-by: Andrew Morton <[email protected]>
> Signed-off-by: Michal Hocko <[email protected]>

You could have used need_resched() instead of unconditionally dropping the
lock but that's very minor for a proc file and it would allos a parallel
allocation to go ahead so

Acked-by: Mel Gorman <[email protected]>

--
Mel Gorman
SUSE Labs

2019-10-24 00:48:49

by Michal Hocko

[permalink] [raw]
Subject: Re: [RFC PATCH 2/2] mm, vmstat: reduce zone->lock holding time by /proc/pagetypeinfo

On Wed 23-10-19 15:48:36, Vlastimil Babka wrote:
> On 10/23/19 3:37 PM, Michal Hocko wrote:
> > On Wed 23-10-19 15:32:05, Vlastimil Babka wrote:
> >> On 10/23/19 12:27 PM, Michal Hocko wrote:
> >>> From: Michal Hocko <[email protected]>
> >>>
> >>> pagetypeinfo_showfree_print is called by zone->lock held in irq mode.
> >>> This is not really nice because it blocks both any interrupts on that
> >>> cpu and the page allocator. On large machines this might even trigger
> >>> the hard lockup detector.
> >>>
> >>> Considering the pagetypeinfo is a debugging tool we do not really need
> >>> exact numbers here. The primary reason to look at the outuput is to see
> >>> how pageblocks are spread among different migratetypes therefore putting
> >>> a bound on the number of pages on the free_list sounds like a reasonable
> >>> tradeoff.
> >>>
> >>> The new output will simply tell
> >>> [...]
> >>> Node 6, zone Normal, type Movable >100000 >100000 >100000 >100000 41019 31560 23996 10054 3229 983 648
> >>>
> >>> instead of
> >>> Node 6, zone Normal, type Movable 399568 294127 221558 102119 41019 31560 23996 10054 3229 983 648
> >>>
> >>> The limit has been chosen arbitrary and it is a subject of a future
> >>> change should there be a need for that.
> >>>
> >>> Suggested-by: Andrew Morton <[email protected]>
> >>> Signed-off-by: Michal Hocko <[email protected]>
> >>
> >> Hmm dunno, I would rather e.g. hide the file behind some config or boot
> >> option than do this. Or move it to /sys/kernel/debug ?
> >
> > But those wouldn't really help to prevent from the lockup, right?
>
> No, but it would perhaps help ensure that only people who know what they
> are doing (or been told so by a developer e.g. on linux-mm) will try to
> collect the data, and not some automatic monitoring tools taking
> periodic snapshots of stuff in /proc that looks interesting.

Well, we do trust root doesn't do harm, right?

> > Besides that who would enable that config and how much of a difference
> > would root only vs. debugfs make?
>
> I would hope those tools don't scrap debugfs as much as /proc, but I
> might be wrong of course :)
>
> > Is the incomplete value a real problem?
>
> Hmm perhaps not. If the overflow happens only for one migratetype, one
> can use also /proc/buddyinfo to get to the exact count, as was proposed
> in this thread for Movable migratetype.

Let's say this won't be the case. What is the worst case that the
imprecision would cause? In other words. Does it really matter whether
we have 100k pages on the free list of the specific migrate type for
order or say 200k?

--
Michal Hocko
SUSE Labs

2019-10-24 01:24:17

by Waiman Long

[permalink] [raw]
Subject: Re: [RFC PATCH 2/2] mm, vmstat: reduce zone->lock holding time by /proc/pagetypeinfo

On 10/23/19 6:27 AM, Michal Hocko wrote:
> From: Michal Hocko <[email protected]>
>
> pagetypeinfo_showfree_print is called by zone->lock held in irq mode.
> This is not really nice because it blocks both any interrupts on that
> cpu and the page allocator. On large machines this might even trigger
> the hard lockup detector.
>
> Considering the pagetypeinfo is a debugging tool we do not really need
> exact numbers here. The primary reason to look at the outuput is to see
> how pageblocks are spread among different migratetypes therefore putting
> a bound on the number of pages on the free_list sounds like a reasonable
> tradeoff.
>
> The new output will simply tell
> [...]
> Node 6, zone Normal, type Movable >100000 >100000 >100000 >100000 41019 31560 23996 10054 3229 983 648
>
> instead of
> Node 6, zone Normal, type Movable 399568 294127 221558 102119 41019 31560 23996 10054 3229 983 648
>
> The limit has been chosen arbitrary and it is a subject of a future
> change should there be a need for that.
>
> Suggested-by: Andrew Morton <[email protected]>
> Signed-off-by: Michal Hocko <[email protected]>
> ---
> mm/vmstat.c | 19 ++++++++++++++++++-
> 1 file changed, 18 insertions(+), 1 deletion(-)
>
> diff --git a/mm/vmstat.c b/mm/vmstat.c
> index 4e885ecd44d1..762034fc3b83 100644
> --- a/mm/vmstat.c
> +++ b/mm/vmstat.c
> @@ -1386,8 +1386,25 @@ static void pagetypeinfo_showfree_print(struct seq_file *m,
>
> area = &(zone->free_area[order]);
>
> - list_for_each(curr, &area->free_list[mtype])
> + list_for_each(curr, &area->free_list[mtype]) {
> freecount++;
> + /*
> + * Cap the free_list iteration because it might
> + * be really large and we are under a spinlock
> + * so a long time spent here could trigger a
> + * hard lockup detector. Anyway this is a
> + * debugging tool so knowing there is a handful
> + * of pages in this order should be more than
> + * sufficient
> + */
> + if (freecount > 100000) {
> + seq_printf(m, ">%6lu ", freecount);
> + spin_unlock_irq(&zone->lock);
> + cond_resched();
> + spin_lock_irq(&zone->lock);
> + continue;
list_for_each() is a for loop. The continue statement will just iterate
the rests with the possibility that curr will be stale. Should we use
goto to jump after the seq_print() below?
> + }
> + }
> seq_printf(m, "%6lu ", freecount);
> }
> seq_putc(m, '\n');

Cheers,
Longman

2019-10-24 03:33:24

by Waiman Long

[permalink] [raw]
Subject: Re: [RFC PATCH 2/2] mm, vmstat: reduce zone->lock holding time by /proc/pagetypeinfo

On 10/23/19 10:56 AM, Waiman Long wrote:
> On 10/23/19 6:27 AM, Michal Hocko wrote:
>> From: Michal Hocko <[email protected]>
>>
>> pagetypeinfo_showfree_print is called by zone->lock held in irq mode.
>> This is not really nice because it blocks both any interrupts on that
>> cpu and the page allocator. On large machines this might even trigger
>> the hard lockup detector.
>>
>> Considering the pagetypeinfo is a debugging tool we do not really need
>> exact numbers here. The primary reason to look at the outuput is to see
>> how pageblocks are spread among different migratetypes therefore putting
>> a bound on the number of pages on the free_list sounds like a reasonable
>> tradeoff.
>>
>> The new output will simply tell
>> [...]
>> Node 6, zone Normal, type Movable >100000 >100000 >100000 >100000 41019 31560 23996 10054 3229 983 648
>>
>> instead of
>> Node 6, zone Normal, type Movable 399568 294127 221558 102119 41019 31560 23996 10054 3229 983 648
>>
>> The limit has been chosen arbitrary and it is a subject of a future
>> change should there be a need for that.
>>
>> Suggested-by: Andrew Morton <[email protected]>
>> Signed-off-by: Michal Hocko <[email protected]>
>> ---
>> mm/vmstat.c | 19 ++++++++++++++++++-
>> 1 file changed, 18 insertions(+), 1 deletion(-)
>>
>> diff --git a/mm/vmstat.c b/mm/vmstat.c
>> index 4e885ecd44d1..762034fc3b83 100644
>> --- a/mm/vmstat.c
>> +++ b/mm/vmstat.c
>> @@ -1386,8 +1386,25 @@ static void pagetypeinfo_showfree_print(struct seq_file *m,
>>
>> area = &(zone->free_area[order]);
>>
>> - list_for_each(curr, &area->free_list[mtype])
>> + list_for_each(curr, &area->free_list[mtype]) {
>> freecount++;
>> + /*
>> + * Cap the free_list iteration because it might
>> + * be really large and we are under a spinlock
>> + * so a long time spent here could trigger a
>> + * hard lockup detector. Anyway this is a
>> + * debugging tool so knowing there is a handful
>> + * of pages in this order should be more than
>> + * sufficient
>> + */
>> + if (freecount > 100000) {
>> + seq_printf(m, ">%6lu ", freecount);

It will print ">100001" which seems a bit awk and will be incorrect if
it is exactly 100001. Could you just hardcode ">100000" into seq_printf()?

Cheers,
Longman


2019-10-24 07:08:37

by Michal Hocko

[permalink] [raw]
Subject: Re: [RFC PATCH 2/2] mm, vmstat: reduce zone->lock holding time by /proc/pagetypeinfo

On Wed 23-10-19 10:56:30, Waiman Long wrote:
> On 10/23/19 6:27 AM, Michal Hocko wrote:
> > From: Michal Hocko <[email protected]>
> >
> > pagetypeinfo_showfree_print is called by zone->lock held in irq mode.
> > This is not really nice because it blocks both any interrupts on that
> > cpu and the page allocator. On large machines this might even trigger
> > the hard lockup detector.
> >
> > Considering the pagetypeinfo is a debugging tool we do not really need
> > exact numbers here. The primary reason to look at the outuput is to see
> > how pageblocks are spread among different migratetypes therefore putting
> > a bound on the number of pages on the free_list sounds like a reasonable
> > tradeoff.
> >
> > The new output will simply tell
> > [...]
> > Node 6, zone Normal, type Movable >100000 >100000 >100000 >100000 41019 31560 23996 10054 3229 983 648
> >
> > instead of
> > Node 6, zone Normal, type Movable 399568 294127 221558 102119 41019 31560 23996 10054 3229 983 648
> >
> > The limit has been chosen arbitrary and it is a subject of a future
> > change should there be a need for that.
> >
> > Suggested-by: Andrew Morton <[email protected]>
> > Signed-off-by: Michal Hocko <[email protected]>
> > ---
> > mm/vmstat.c | 19 ++++++++++++++++++-
> > 1 file changed, 18 insertions(+), 1 deletion(-)
> >
> > diff --git a/mm/vmstat.c b/mm/vmstat.c
> > index 4e885ecd44d1..762034fc3b83 100644
> > --- a/mm/vmstat.c
> > +++ b/mm/vmstat.c
> > @@ -1386,8 +1386,25 @@ static void pagetypeinfo_showfree_print(struct seq_file *m,
> >
> > area = &(zone->free_area[order]);
> >
> > - list_for_each(curr, &area->free_list[mtype])
> > + list_for_each(curr, &area->free_list[mtype]) {
> > freecount++;
> > + /*
> > + * Cap the free_list iteration because it might
> > + * be really large and we are under a spinlock
> > + * so a long time spent here could trigger a
> > + * hard lockup detector. Anyway this is a
> > + * debugging tool so knowing there is a handful
> > + * of pages in this order should be more than
> > + * sufficient
> > + */
> > + if (freecount > 100000) {
> > + seq_printf(m, ">%6lu ", freecount);
> > + spin_unlock_irq(&zone->lock);
> > + cond_resched();
> > + spin_lock_irq(&zone->lock);
> > + continue;
> list_for_each() is a for loop. The continue statement will just iterate
> the rests with the possibility that curr will be stale. Should we use
> goto to jump after the seq_print() below?

You are right. Kinda brown paper back material. Sorry about that. What
about this on top?
---
diff --git a/mm/vmstat.c b/mm/vmstat.c
index 762034fc3b83..c156ce24a322 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -1383,11 +1383,11 @@ static void pagetypeinfo_showfree_print(struct seq_file *m,
unsigned long freecount = 0;
struct free_area *area;
struct list_head *curr;
+ bool overflow = false;

area = &(zone->free_area[order]);

list_for_each(curr, &area->free_list[mtype]) {
- freecount++;
/*
* Cap the free_list iteration because it might
* be really large and we are under a spinlock
@@ -1397,15 +1397,15 @@ static void pagetypeinfo_showfree_print(struct seq_file *m,
* of pages in this order should be more than
* sufficient
*/
- if (freecount > 100000) {
- seq_printf(m, ">%6lu ", freecount);
+ if (++freecount >= 100000) {
+ overflow = true;
spin_unlock_irq(&zone->lock);
cond_resched();
spin_lock_irq(&zone->lock);
- continue;
+ break;
}
}
- seq_printf(m, "%6lu ", freecount);
+ seq_printf(m, "%s%6lu ", overflow ? ">" : "", freecount);
}
seq_putc(m, '\n');
}

--
Michal Hocko
SUSE Labs

2019-10-24 07:16:54

by Vlastimil Babka

[permalink] [raw]
Subject: Re: [RFC PATCH 2/2] mm, vmstat: reduce zone->lock holding time by /proc/pagetypeinfo

+ linux-api

On 10/23/19 12:27 PM, Michal Hocko wrote:
> From: Michal Hocko <[email protected]>
>
> pagetypeinfo_showfree_print is called by zone->lock held in irq mode.
> This is not really nice because it blocks both any interrupts on that
> cpu and the page allocator. On large machines this might even trigger
> the hard lockup detector.
>
> Considering the pagetypeinfo is a debugging tool we do not really need
> exact numbers here. The primary reason to look at the outuput is to see
> how pageblocks are spread among different migratetypes therefore putting
> a bound on the number of pages on the free_list sounds like a reasonable
> tradeoff.
>
> The new output will simply tell
> [...]
> Node 6, zone Normal, type Movable >100000 >100000 >100000 >100000 41019 31560 23996 10054 3229 983 648
>
> instead of
> Node 6, zone Normal, type Movable 399568 294127 221558 102119 41019 31560 23996 10054 3229 983 648
>
> The limit has been chosen arbitrary and it is a subject of a future
> change should there be a need for that.
>
> Suggested-by: Andrew Morton <[email protected]>
> Signed-off-by: Michal Hocko <[email protected]>
> ---
> mm/vmstat.c | 19 ++++++++++++++++++-
> 1 file changed, 18 insertions(+), 1 deletion(-)
>
> diff --git a/mm/vmstat.c b/mm/vmstat.c
> index 4e885ecd44d1..762034fc3b83 100644
> --- a/mm/vmstat.c
> +++ b/mm/vmstat.c
> @@ -1386,8 +1386,25 @@ static void pagetypeinfo_showfree_print(struct seq_file *m,
>
> area = &(zone->free_area[order]);
>
> - list_for_each(curr, &area->free_list[mtype])
> + list_for_each(curr, &area->free_list[mtype]) {
> freecount++;
> + /*
> + * Cap the free_list iteration because it might
> + * be really large and we are under a spinlock
> + * so a long time spent here could trigger a
> + * hard lockup detector. Anyway this is a
> + * debugging tool so knowing there is a handful
> + * of pages in this order should be more than
> + * sufficient
> + */
> + if (freecount > 100000) {
> + seq_printf(m, ">%6lu ", freecount);
> + spin_unlock_irq(&zone->lock);
> + cond_resched();
> + spin_lock_irq(&zone->lock);
> + continue;
> + }
> + }
> seq_printf(m, "%6lu ", freecount);
> }
> seq_putc(m, '\n');
>

2019-10-24 07:19:13

by Waiman Long

[permalink] [raw]
Subject: Re: [RFC PATCH 2/2] mm, vmstat: reduce zone->lock holding time by /proc/pagetypeinfo

On 10/23/19 12:10 PM, Michal Hocko wrote:
> On Wed 23-10-19 10:56:30, Waiman Long wrote:
>> On 10/23/19 6:27 AM, Michal Hocko wrote:
>>> From: Michal Hocko <[email protected]>
>>>
>>> pagetypeinfo_showfree_print is called by zone->lock held in irq mode.
>>> This is not really nice because it blocks both any interrupts on that
>>> cpu and the page allocator. On large machines this might even trigger
>>> the hard lockup detector.
>>>
>>> Considering the pagetypeinfo is a debugging tool we do not really need
>>> exact numbers here. The primary reason to look at the outuput is to see
>>> how pageblocks are spread among different migratetypes therefore putting
>>> a bound on the number of pages on the free_list sounds like a reasonable
>>> tradeoff.
>>>
>>> The new output will simply tell
>>> [...]
>>> Node 6, zone Normal, type Movable >100000 >100000 >100000 >100000 41019 31560 23996 10054 3229 983 648
>>>
>>> instead of
>>> Node 6, zone Normal, type Movable 399568 294127 221558 102119 41019 31560 23996 10054 3229 983 648
>>>
>>> The limit has been chosen arbitrary and it is a subject of a future
>>> change should there be a need for that.
>>>
>>> Suggested-by: Andrew Morton <[email protected]>
>>> Signed-off-by: Michal Hocko <[email protected]>
>>> ---
>>> mm/vmstat.c | 19 ++++++++++++++++++-
>>> 1 file changed, 18 insertions(+), 1 deletion(-)
>>>
>>> diff --git a/mm/vmstat.c b/mm/vmstat.c
>>> index 4e885ecd44d1..762034fc3b83 100644
>>> --- a/mm/vmstat.c
>>> +++ b/mm/vmstat.c
>>> @@ -1386,8 +1386,25 @@ static void pagetypeinfo_showfree_print(struct seq_file *m,
>>>
>>> area = &(zone->free_area[order]);
>>>
>>> - list_for_each(curr, &area->free_list[mtype])
>>> + list_for_each(curr, &area->free_list[mtype]) {
>>> freecount++;
>>> + /*
>>> + * Cap the free_list iteration because it might
>>> + * be really large and we are under a spinlock
>>> + * so a long time spent here could trigger a
>>> + * hard lockup detector. Anyway this is a
>>> + * debugging tool so knowing there is a handful
>>> + * of pages in this order should be more than
>>> + * sufficient
>>> + */
>>> + if (freecount > 100000) {
>>> + seq_printf(m, ">%6lu ", freecount);
>>> + spin_unlock_irq(&zone->lock);
>>> + cond_resched();
>>> + spin_lock_irq(&zone->lock);
>>> + continue;
>> list_for_each() is a for loop. The continue statement will just iterate
>> the rests with the possibility that curr will be stale. Should we use
>> goto to jump after the seq_print() below?
> You are right. Kinda brown paper back material. Sorry about that. What
> about this on top?
> ---
> diff --git a/mm/vmstat.c b/mm/vmstat.c
> index 762034fc3b83..c156ce24a322 100644
> --- a/mm/vmstat.c
> +++ b/mm/vmstat.c
> @@ -1383,11 +1383,11 @@ static void pagetypeinfo_showfree_print(struct seq_file *m,
> unsigned long freecount = 0;
> struct free_area *area;
> struct list_head *curr;
> + bool overflow = false;
>
> area = &(zone->free_area[order]);
>
> list_for_each(curr, &area->free_list[mtype]) {
> - freecount++;
> /*
> * Cap the free_list iteration because it might
> * be really large and we are under a spinlock
> @@ -1397,15 +1397,15 @@ static void pagetypeinfo_showfree_print(struct seq_file *m,
> * of pages in this order should be more than
> * sufficient
> */
> - if (freecount > 100000) {
> - seq_printf(m, ">%6lu ", freecount);
> + if (++freecount >= 100000) {
> + overflow = true;
> spin_unlock_irq(&zone->lock);
> cond_resched();
> spin_lock_irq(&zone->lock);
> - continue;
> + break;
> }
> }
> - seq_printf(m, "%6lu ", freecount);
> + seq_printf(m, "%s%6lu ", overflow ? ">" : "", freecount);
> }
> seq_putc(m, '\n');
> }
>
Yes, that looks good to me. There is still a small chance that the
description will be a bit off if it is exactly 100,000. However, it is
not a big deal and I can live with that.

Thanks,
Longman

2019-10-24 07:23:11

by Vlastimil Babka

[permalink] [raw]
Subject: Re: [RFC PATCH 2/2] mm, vmstat: reduce zone->lock holding time by /proc/pagetypeinfo

On 10/23/19 4:31 PM, Michal Hocko wrote:
> On Wed 23-10-19 15:48:36, Vlastimil Babka wrote:
>> On 10/23/19 3:37 PM, Michal Hocko wrote:
>>>
>>> But those wouldn't really help to prevent from the lockup, right?
>>
>> No, but it would perhaps help ensure that only people who know what they
>> are doing (or been told so by a developer e.g. on linux-mm) will try to
>> collect the data, and not some automatic monitoring tools taking
>> periodic snapshots of stuff in /proc that looks interesting.
>
> Well, we do trust root doesn't do harm, right?

Perhaps too much :)

>>> Besides that who would enable that config and how much of a difference
>>> would root only vs. debugfs make?
>>
>> I would hope those tools don't scrap debugfs as much as /proc, but I
>> might be wrong of course :)
>>
>>> Is the incomplete value a real problem?
>>
>> Hmm perhaps not. If the overflow happens only for one migratetype, one
>> can use also /proc/buddyinfo to get to the exact count, as was proposed
>> in this thread for Movable migratetype.
>
> Let's say this won't be the case. What is the worst case that the
> imprecision would cause? In other words. Does it really matter whether
> we have 100k pages on the free list of the specific migrate type for
> order or say 200k?

Probably not, it rather matters for which order the count approaches zero.


2019-10-24 07:23:14

by Waiman Long

[permalink] [raw]
Subject: Re: [RFC PATCH 2/2] mm, vmstat: reduce zone->lock holding time by /proc/pagetypeinfo

On 10/23/19 12:17 PM, Waiman Long wrote:
> On 10/23/19 12:10 PM, Michal Hocko wrote:
>> On Wed 23-10-19 10:56:30, Waiman Long wrote:
>>> On 10/23/19 6:27 AM, Michal Hocko wrote:
>>>> From: Michal Hocko <[email protected]>
>>>>
>>>> pagetypeinfo_showfree_print is called by zone->lock held in irq mode.
>>>> This is not really nice because it blocks both any interrupts on that
>>>> cpu and the page allocator. On large machines this might even trigger
>>>> the hard lockup detector.
>>>>
>>>> Considering the pagetypeinfo is a debugging tool we do not really need
>>>> exact numbers here. The primary reason to look at the outuput is to see
>>>> how pageblocks are spread among different migratetypes therefore putting
>>>> a bound on the number of pages on the free_list sounds like a reasonable
>>>> tradeoff.
>>>>
>>>> The new output will simply tell
>>>> [...]
>>>> Node 6, zone Normal, type Movable >100000 >100000 >100000 >100000 41019 31560 23996 10054 3229 983 648
>>>>
>>>> instead of
>>>> Node 6, zone Normal, type Movable 399568 294127 221558 102119 41019 31560 23996 10054 3229 983 648
>>>>
>>>> The limit has been chosen arbitrary and it is a subject of a future
>>>> change should there be a need for that.
>>>>
>>>> Suggested-by: Andrew Morton <[email protected]>
>>>> Signed-off-by: Michal Hocko <[email protected]>
>>>> ---
>>>> mm/vmstat.c | 19 ++++++++++++++++++-
>>>> 1 file changed, 18 insertions(+), 1 deletion(-)
>>>>
>>>> diff --git a/mm/vmstat.c b/mm/vmstat.c
>>>> index 4e885ecd44d1..762034fc3b83 100644
>>>> --- a/mm/vmstat.c
>>>> +++ b/mm/vmstat.c
>>>> @@ -1386,8 +1386,25 @@ static void pagetypeinfo_showfree_print(struct seq_file *m,
>>>>
>>>> area = &(zone->free_area[order]);
>>>>
>>>> - list_for_each(curr, &area->free_list[mtype])
>>>> + list_for_each(curr, &area->free_list[mtype]) {
>>>> freecount++;
>>>> + /*
>>>> + * Cap the free_list iteration because it might
>>>> + * be really large and we are under a spinlock
>>>> + * so a long time spent here could trigger a
>>>> + * hard lockup detector. Anyway this is a
>>>> + * debugging tool so knowing there is a handful
>>>> + * of pages in this order should be more than
>>>> + * sufficient
>>>> + */
>>>> + if (freecount > 100000) {
>>>> + seq_printf(m, ">%6lu ", freecount);
>>>> + spin_unlock_irq(&zone->lock);
>>>> + cond_resched();
>>>> + spin_lock_irq(&zone->lock);
>>>> + continue;
>>> list_for_each() is a for loop. The continue statement will just iterate
>>> the rests with the possibility that curr will be stale. Should we use
>>> goto to jump after the seq_print() below?
>> You are right. Kinda brown paper back material. Sorry about that. What
>> about this on top?
>> ---
>> diff --git a/mm/vmstat.c b/mm/vmstat.c
>> index 762034fc3b83..c156ce24a322 100644
>> --- a/mm/vmstat.c
>> +++ b/mm/vmstat.c
>> @@ -1383,11 +1383,11 @@ static void pagetypeinfo_showfree_print(struct seq_file *m,
>> unsigned long freecount = 0;
>> struct free_area *area;
>> struct list_head *curr;
>> + bool overflow = false;
>>
>> area = &(zone->free_area[order]);
>>
>> list_for_each(curr, &area->free_list[mtype]) {
>> - freecount++;
>> /*
>> * Cap the free_list iteration because it might
>> * be really large and we are under a spinlock
>> @@ -1397,15 +1397,15 @@ static void pagetypeinfo_showfree_print(struct seq_file *m,
>> * of pages in this order should be more than
>> * sufficient
>> */
>> - if (freecount > 100000) {
>> - seq_printf(m, ">%6lu ", freecount);
>> + if (++freecount >= 100000) {
>> + overflow = true;
>> spin_unlock_irq(&zone->lock);
>> cond_resched();
>> spin_lock_irq(&zone->lock);
>> - continue;
>> + break;
>> }
>> }
>> - seq_printf(m, "%6lu ", freecount);
>> + seq_printf(m, "%s%6lu ", overflow ? ">" : "", freecount);
>> }
>> seq_putc(m, '\n');
>> }
>>
> Yes, that looks good to me. There is still a small chance that the
> description will be a bit off if it is exactly 100,000. However, it is
> not a big deal and I can live with that.

Alternatively, you can do

if (++freecount > 100000) {
        :
    freecount--;
    break;
}

Cheers,
Longman

2019-10-24 08:49:21

by Waiman Long

[permalink] [raw]
Subject: [PATCH 1/2] mm, vmstat: Release zone lock more frequently when reading /proc/pagetypeinfo

With a threshold of 100000, it is still possible that the zone lock
will be held for a very long time in the worst case scenario where all
the counts are just below the threshold. With up to 6 migration types
and 11 orders, it means up to 6.6 millions.

Track the total number of list iterations done since the acquisition
of the zone lock and release it whenever 100000 iterations or more have
been completed. This will cap the lock hold time to no more than 200,000
list iterations.

Signed-off-by: Waiman Long <[email protected]>
---
mm/vmstat.c | 18 ++++++++++++++----
1 file changed, 14 insertions(+), 4 deletions(-)

diff --git a/mm/vmstat.c b/mm/vmstat.c
index 57ba091e5460..c5b82fdf54af 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -1373,6 +1373,7 @@ static void pagetypeinfo_showfree_print(struct seq_file *m,
pg_data_t *pgdat, struct zone *zone)
{
int order, mtype;
+ unsigned long iteration_count = 0;

for (mtype = 0; mtype < MIGRATE_TYPES; mtype++) {
seq_printf(m, "Node %4d, zone %8s, type %12s ",
@@ -1397,15 +1398,24 @@ static void pagetypeinfo_showfree_print(struct seq_file *m,
* of pages in this order should be more than
* sufficient
*/
- if (++freecount >= 100000) {
+ if (++freecount > 100000) {
overflow = true;
- spin_unlock_irq(&zone->lock);
- cond_resched();
- spin_lock_irq(&zone->lock);
+ freecount--;
break;
}
}
seq_printf(m, "%s%6lu ", overflow ? ">" : "", freecount);
+ /*
+ * Take a break and release the zone lock when
+ * 100000 or more entries have been iterated.
+ */
+ iteration_count += freecount;
+ if (iteration_count >= 100000) {
+ iteration_count = 0;
+ spin_unlock_irq(&zone->lock);
+ cond_resched();
+ spin_lock_irq(&zone->lock);
+ }
}
seq_putc(m, '\n');
}
--
2.18.1

2019-10-24 08:58:48

by Waiman Long

[permalink] [raw]
Subject: Re: [PATCH 1/2] mm, vmstat: Release zone lock more frequently when reading /proc/pagetypeinfo

On 10/23/19 2:01 PM, Michal Hocko wrote:
> On Wed 23-10-19 13:34:22, Waiman Long wrote:
>> With a threshold of 100000, it is still possible that the zone lock
>> will be held for a very long time in the worst case scenario where all
>> the counts are just below the threshold. With up to 6 migration types
>> and 11 orders, it means up to 6.6 millions.
>>
>> Track the total number of list iterations done since the acquisition
>> of the zone lock and release it whenever 100000 iterations or more have
>> been completed. This will cap the lock hold time to no more than 200,000
>> list iterations.
>>
>> Signed-off-by: Waiman Long <[email protected]>
>> ---
>> mm/vmstat.c | 18 ++++++++++++++----
>> 1 file changed, 14 insertions(+), 4 deletions(-)
>>
>> diff --git a/mm/vmstat.c b/mm/vmstat.c
>> index 57ba091e5460..c5b82fdf54af 100644
>> --- a/mm/vmstat.c
>> +++ b/mm/vmstat.c
>> @@ -1373,6 +1373,7 @@ static void pagetypeinfo_showfree_print(struct seq_file *m,
>> pg_data_t *pgdat, struct zone *zone)
>> {
>> int order, mtype;
>> + unsigned long iteration_count = 0;
>>
>> for (mtype = 0; mtype < MIGRATE_TYPES; mtype++) {
>> seq_printf(m, "Node %4d, zone %8s, type %12s ",
>> @@ -1397,15 +1398,24 @@ static void pagetypeinfo_showfree_print(struct seq_file *m,
>> * of pages in this order should be more than
>> * sufficient
>> */
>> - if (++freecount >= 100000) {
>> + if (++freecount > 100000) {
>> overflow = true;
>> - spin_unlock_irq(&zone->lock);
>> - cond_resched();
>> - spin_lock_irq(&zone->lock);
>> + freecount--;
>> break;
>> }
>> }
>> seq_printf(m, "%s%6lu ", overflow ? ">" : "", freecount);
>> + /*
>> + * Take a break and release the zone lock when
>> + * 100000 or more entries have been iterated.
>> + */
>> + iteration_count += freecount;
>> + if (iteration_count >= 100000) {
>> + iteration_count = 0;
>> + spin_unlock_irq(&zone->lock);
>> + cond_resched();
>> + spin_lock_irq(&zone->lock);
>> + }
> Aren't you overengineering this a bit? If you are still worried then we
> can simply cond_resched for each order
> diff --git a/mm/vmstat.c b/mm/vmstat.c
> index c156ce24a322..ddb89f4e0486 100644
> --- a/mm/vmstat.c
> +++ b/mm/vmstat.c
> @@ -1399,13 +1399,13 @@ static void pagetypeinfo_showfree_print(struct seq_file *m,
> */
> if (++freecount >= 100000) {
> overflow = true;
> - spin_unlock_irq(&zone->lock);
> - cond_resched();
> - spin_lock_irq(&zone->lock);
> break;
> }
> }
> seq_printf(m, "%s%6lu ", overflow ? ">" : "", freecount);
> + spin_unlock_irq(&zone->lock);
> + cond_resched();
> + spin_lock_irq(&zone->lock);
> }
> seq_putc(m, '\n');
> }
>
> I do not have a strong opinion here but I can fold this into my patch 2.

If the free list is empty or is very short, there is probably no need to
release and reacquire the lock. How about adding a check for a lower
bound like:

if (freecount > 1000) {
    spin_unlock_irq(&zone->lock);
    cond_resched();
    spin_lock_irq(&zone->lock);
}

Cheers,
Longman

2019-10-24 11:42:26

by Michal Hocko

[permalink] [raw]
Subject: Re: [RFC PATCH 2/2] mm, vmstat: reduce zone->lock holding time by /proc/pagetypeinfo

With a brown paper bag bug fixed. I have also added a note about low
number of pages being more important as per Vlastimil's feedback

From 0282f604144a5c06fdf3cf0bb2df532411e7f8c9 Mon Sep 17 00:00:00 2001
From: Michal Hocko <[email protected]>
Date: Wed, 23 Oct 2019 12:13:02 +0200
Subject: [PATCH] mm, vmstat: reduce zone->lock holding time by
/proc/pagetypeinfo

pagetypeinfo_showfree_print is called by zone->lock held in irq mode.
This is not really nice because it blocks both any interrupts on that
cpu and the page allocator. On large machines this might even trigger
the hard lockup detector.

Considering the pagetypeinfo is a debugging tool we do not really need
exact numbers here. The primary reason to look at the outuput is to see
how pageblocks are spread among different migratetypes and low number of
pages is much more interesting therefore putting a bound on the number
of pages on the free_list sounds like a reasonable tradeoff.

The new output will simply tell
[...]
Node 6, zone Normal, type Movable >100000 >100000 >100000 >100000 41019 31560 23996 10054 3229 983 648

instead of
Node 6, zone Normal, type Movable 399568 294127 221558 102119 41019 31560 23996 10054 3229 983 648

The limit has been chosen arbitrary and it is a subject of a future
change should there be a need for that.

Suggested-by: Andrew Morton <[email protected]>
Acked-by: Mel Gorman <[email protected]>
Signed-off-by: Michal Hocko <[email protected]>
---
mm/vmstat.c | 23 ++++++++++++++++++++---
1 file changed, 20 insertions(+), 3 deletions(-)

diff --git a/mm/vmstat.c b/mm/vmstat.c
index 4e885ecd44d1..c156ce24a322 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -1383,12 +1383,29 @@ static void pagetypeinfo_showfree_print(struct seq_file *m,
unsigned long freecount = 0;
struct free_area *area;
struct list_head *curr;
+ bool overflow = false;

area = &(zone->free_area[order]);

- list_for_each(curr, &area->free_list[mtype])
- freecount++;
- seq_printf(m, "%6lu ", freecount);
+ list_for_each(curr, &area->free_list[mtype]) {
+ /*
+ * Cap the free_list iteration because it might
+ * be really large and we are under a spinlock
+ * so a long time spent here could trigger a
+ * hard lockup detector. Anyway this is a
+ * debugging tool so knowing there is a handful
+ * of pages in this order should be more than
+ * sufficient
+ */
+ if (++freecount >= 100000) {
+ overflow = true;
+ spin_unlock_irq(&zone->lock);
+ cond_resched();
+ spin_lock_irq(&zone->lock);
+ break;
+ }
+ }
+ seq_printf(m, "%s%6lu ", overflow ? ">" : "", freecount);
}
seq_putc(m, '\n');
}
--
2.20.1

--
Michal Hocko
SUSE Labs

2019-10-24 11:42:56

by Waiman Long

[permalink] [raw]
Subject: Re: [RFC PATCH 2/2] mm, vmstat: reduce zone->lock holding time by /proc/pagetypeinfo

On 10/23/19 12:41 PM, Michal Hocko wrote:
> With a brown paper bag bug fixed. I have also added a note about low
> number of pages being more important as per Vlastimil's feedback
>
> From 0282f604144a5c06fdf3cf0bb2df532411e7f8c9 Mon Sep 17 00:00:00 2001
> From: Michal Hocko <[email protected]>
> Date: Wed, 23 Oct 2019 12:13:02 +0200
> Subject: [PATCH] mm, vmstat: reduce zone->lock holding time by
> /proc/pagetypeinfo
>
> pagetypeinfo_showfree_print is called by zone->lock held in irq mode.
> This is not really nice because it blocks both any interrupts on that
> cpu and the page allocator. On large machines this might even trigger
> the hard lockup detector.
>
> Considering the pagetypeinfo is a debugging tool we do not really need
> exact numbers here. The primary reason to look at the outuput is to see
> how pageblocks are spread among different migratetypes and low number of
> pages is much more interesting therefore putting a bound on the number
> of pages on the free_list sounds like a reasonable tradeoff.
>
> The new output will simply tell
> [...]
> Node 6, zone Normal, type Movable >100000 >100000 >100000 >100000 41019 31560 23996 10054 3229 983 648
>
> instead of
> Node 6, zone Normal, type Movable 399568 294127 221558 102119 41019 31560 23996 10054 3229 983 648
>
> The limit has been chosen arbitrary and it is a subject of a future
> change should there be a need for that.
>
> Suggested-by: Andrew Morton <[email protected]>
> Acked-by: Mel Gorman <[email protected]>
> Signed-off-by: Michal Hocko <[email protected]>
> ---
> mm/vmstat.c | 23 ++++++++++++++++++++---
> 1 file changed, 20 insertions(+), 3 deletions(-)
>
> diff --git a/mm/vmstat.c b/mm/vmstat.c
> index 4e885ecd44d1..c156ce24a322 100644
> --- a/mm/vmstat.c
> +++ b/mm/vmstat.c
> @@ -1383,12 +1383,29 @@ static void pagetypeinfo_showfree_print(struct seq_file *m,
> unsigned long freecount = 0;
> struct free_area *area;
> struct list_head *curr;
> + bool overflow = false;
>
> area = &(zone->free_area[order]);
>
> - list_for_each(curr, &area->free_list[mtype])
> - freecount++;
> - seq_printf(m, "%6lu ", freecount);
> + list_for_each(curr, &area->free_list[mtype]) {
> + /*
> + * Cap the free_list iteration because it might
> + * be really large and we are under a spinlock
> + * so a long time spent here could trigger a
> + * hard lockup detector. Anyway this is a
> + * debugging tool so knowing there is a handful
> + * of pages in this order should be more than
> + * sufficient
> + */
> + if (++freecount >= 100000) {
> + overflow = true;
> + spin_unlock_irq(&zone->lock);
> + cond_resched();
> + spin_lock_irq(&zone->lock);
> + break;
> + }
> + }
> + seq_printf(m, "%s%6lu ", overflow ? ">" : "", freecount);
> }
> seq_putc(m, '\n');
> }

Reviewed-by: Waiman Long <[email protected]>

2019-10-24 11:49:56

by Waiman Long

[permalink] [raw]
Subject: [PATCH 2/2] mm, vmstat: List total free blocks for each order in /proc/pagetypeinfo

Now that the free block count for each migration types in
/proc/pagetypeinfo may not show the exact count if it excceeds
100,000. Users may not know how much more the counts will be. As the
free_area structure has already tracked the total free block count in
nr_free, we may as well print it out with no additional cost. That will
give users a rough idea of where the upper bounds will be.

If there is no overflow, the presence of the total counts will also
enable us to check if the nr_free counts match the total number of
entries in the free lists.

Signed-off-by: Waiman Long <[email protected]>
---
mm/vmstat.c | 13 ++++++++++++-
1 file changed, 12 insertions(+), 1 deletion(-)

diff --git a/mm/vmstat.c b/mm/vmstat.c
index c5b82fdf54af..172946d8f358 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -1373,6 +1373,7 @@ static void pagetypeinfo_showfree_print(struct seq_file *m,
pg_data_t *pgdat, struct zone *zone)
{
int order, mtype;
+ struct free_area *area;
unsigned long iteration_count = 0;

for (mtype = 0; mtype < MIGRATE_TYPES; mtype++) {
@@ -1382,7 +1383,6 @@ static void pagetypeinfo_showfree_print(struct seq_file *m,
migratetype_names[mtype]);
for (order = 0; order < MAX_ORDER; ++order) {
unsigned long freecount = 0;
- struct free_area *area;
struct list_head *curr;
bool overflow = false;

@@ -1419,6 +1419,17 @@ static void pagetypeinfo_showfree_print(struct seq_file *m,
}
seq_putc(m, '\n');
}
+
+ /*
+ * List total free blocks per order
+ */
+ seq_printf(m, "Node %4d, zone %8s, total ",
+ pgdat->node_id, zone->name);
+ for (order = 0; order < MAX_ORDER; ++order) {
+ area = &(zone->free_area[order]);
+ seq_printf(m, "%6lu ", area->nr_free);
+ }
+ seq_putc(m, '\n');
}

/* Print out the free pages at each order for each migatetype */
--
2.18.1

2019-10-24 11:53:35

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH 1/2] mm, vmstat: Release zone lock more frequently when reading /proc/pagetypeinfo

On Wed 23-10-19 13:34:22, Waiman Long wrote:
> With a threshold of 100000, it is still possible that the zone lock
> will be held for a very long time in the worst case scenario where all
> the counts are just below the threshold. With up to 6 migration types
> and 11 orders, it means up to 6.6 millions.
>
> Track the total number of list iterations done since the acquisition
> of the zone lock and release it whenever 100000 iterations or more have
> been completed. This will cap the lock hold time to no more than 200,000
> list iterations.
>
> Signed-off-by: Waiman Long <[email protected]>
> ---
> mm/vmstat.c | 18 ++++++++++++++----
> 1 file changed, 14 insertions(+), 4 deletions(-)
>
> diff --git a/mm/vmstat.c b/mm/vmstat.c
> index 57ba091e5460..c5b82fdf54af 100644
> --- a/mm/vmstat.c
> +++ b/mm/vmstat.c
> @@ -1373,6 +1373,7 @@ static void pagetypeinfo_showfree_print(struct seq_file *m,
> pg_data_t *pgdat, struct zone *zone)
> {
> int order, mtype;
> + unsigned long iteration_count = 0;
>
> for (mtype = 0; mtype < MIGRATE_TYPES; mtype++) {
> seq_printf(m, "Node %4d, zone %8s, type %12s ",
> @@ -1397,15 +1398,24 @@ static void pagetypeinfo_showfree_print(struct seq_file *m,
> * of pages in this order should be more than
> * sufficient
> */
> - if (++freecount >= 100000) {
> + if (++freecount > 100000) {
> overflow = true;
> - spin_unlock_irq(&zone->lock);
> - cond_resched();
> - spin_lock_irq(&zone->lock);
> + freecount--;
> break;
> }
> }
> seq_printf(m, "%s%6lu ", overflow ? ">" : "", freecount);
> + /*
> + * Take a break and release the zone lock when
> + * 100000 or more entries have been iterated.
> + */
> + iteration_count += freecount;
> + if (iteration_count >= 100000) {
> + iteration_count = 0;
> + spin_unlock_irq(&zone->lock);
> + cond_resched();
> + spin_lock_irq(&zone->lock);
> + }

Aren't you overengineering this a bit? If you are still worried then we
can simply cond_resched for each order
diff --git a/mm/vmstat.c b/mm/vmstat.c
index c156ce24a322..ddb89f4e0486 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -1399,13 +1399,13 @@ static void pagetypeinfo_showfree_print(struct seq_file *m,
*/
if (++freecount >= 100000) {
overflow = true;
- spin_unlock_irq(&zone->lock);
- cond_resched();
- spin_lock_irq(&zone->lock);
break;
}
}
seq_printf(m, "%s%6lu ", overflow ? ">" : "", freecount);
+ spin_unlock_irq(&zone->lock);
+ cond_resched();
+ spin_lock_irq(&zone->lock);
}
seq_putc(m, '\n');
}

I do not have a strong opinion here but I can fold this into my patch 2.
--
Michal Hocko
SUSE Labs

2019-10-24 14:13:08

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH 1/2] mm, vmstat: Release zone lock more frequently when reading /proc/pagetypeinfo

On Wed 23-10-19 14:14:14, Waiman Long wrote:
> On 10/23/19 2:01 PM, Michal Hocko wrote:
> > On Wed 23-10-19 13:34:22, Waiman Long wrote:
> >> With a threshold of 100000, it is still possible that the zone lock
> >> will be held for a very long time in the worst case scenario where all
> >> the counts are just below the threshold. With up to 6 migration types
> >> and 11 orders, it means up to 6.6 millions.
> >>
> >> Track the total number of list iterations done since the acquisition
> >> of the zone lock and release it whenever 100000 iterations or more have
> >> been completed. This will cap the lock hold time to no more than 200,000
> >> list iterations.
> >>
> >> Signed-off-by: Waiman Long <[email protected]>
> >> ---
> >> mm/vmstat.c | 18 ++++++++++++++----
> >> 1 file changed, 14 insertions(+), 4 deletions(-)
> >>
> >> diff --git a/mm/vmstat.c b/mm/vmstat.c
> >> index 57ba091e5460..c5b82fdf54af 100644
> >> --- a/mm/vmstat.c
> >> +++ b/mm/vmstat.c
> >> @@ -1373,6 +1373,7 @@ static void pagetypeinfo_showfree_print(struct seq_file *m,
> >> pg_data_t *pgdat, struct zone *zone)
> >> {
> >> int order, mtype;
> >> + unsigned long iteration_count = 0;
> >>
> >> for (mtype = 0; mtype < MIGRATE_TYPES; mtype++) {
> >> seq_printf(m, "Node %4d, zone %8s, type %12s ",
> >> @@ -1397,15 +1398,24 @@ static void pagetypeinfo_showfree_print(struct seq_file *m,
> >> * of pages in this order should be more than
> >> * sufficient
> >> */
> >> - if (++freecount >= 100000) {
> >> + if (++freecount > 100000) {
> >> overflow = true;
> >> - spin_unlock_irq(&zone->lock);
> >> - cond_resched();
> >> - spin_lock_irq(&zone->lock);
> >> + freecount--;
> >> break;
> >> }
> >> }
> >> seq_printf(m, "%s%6lu ", overflow ? ">" : "", freecount);
> >> + /*
> >> + * Take a break and release the zone lock when
> >> + * 100000 or more entries have been iterated.
> >> + */
> >> + iteration_count += freecount;
> >> + if (iteration_count >= 100000) {
> >> + iteration_count = 0;
> >> + spin_unlock_irq(&zone->lock);
> >> + cond_resched();
> >> + spin_lock_irq(&zone->lock);
> >> + }
> > Aren't you overengineering this a bit? If you are still worried then we
> > can simply cond_resched for each order
> > diff --git a/mm/vmstat.c b/mm/vmstat.c
> > index c156ce24a322..ddb89f4e0486 100644
> > --- a/mm/vmstat.c
> > +++ b/mm/vmstat.c
> > @@ -1399,13 +1399,13 @@ static void pagetypeinfo_showfree_print(struct seq_file *m,
> > */
> > if (++freecount >= 100000) {
> > overflow = true;
> > - spin_unlock_irq(&zone->lock);
> > - cond_resched();
> > - spin_lock_irq(&zone->lock);
> > break;
> > }
> > }
> > seq_printf(m, "%s%6lu ", overflow ? ">" : "", freecount);
> > + spin_unlock_irq(&zone->lock);
> > + cond_resched();
> > + spin_lock_irq(&zone->lock);
> > }
> > seq_putc(m, '\n');
> > }
> >
> > I do not have a strong opinion here but I can fold this into my patch 2.
>
> If the free list is empty or is very short, there is probably no need to
> release and reacquire the lock. How about adding a check for a lower
> bound like:

Again, does it really make any sense to micro optimize something like
this. It is a debugging tool. I would rather go simple.
--
Michal Hocko
SUSE Labs