2019-10-22 21:24:00

by Waiman Long

[permalink] [raw]
Subject: [PATCH] mm/vmstat: Reduce zone lock hold time when reading /proc/pagetypeinfo

The pagetypeinfo_showfree_print() function prints out the number of
free blocks for each of the page orders and migrate types. The current
code just iterates the each of the free lists to get counts. There are
bug reports about hard lockup panics when reading the /proc/pagetyeinfo
file just because it look too long to iterate all the free lists within
a zone while holing the zone lock with irq disabled.

Given the fact that /proc/pagetypeinfo is readable by all, the possiblity
of crashing a system by the simple act of reading /proc/pagetypeinfo
by any user is a security problem that needs to be addressed.

There is a free_area structure associated with each page order. There
is also a nr_free count within the free_area for all the different
migration types combined. Tracking the number of free list entries
for each migration type will probably add some overhead to the fast
paths like moving pages from one migration type to another which may
not be desirable.

we can actually skip iterating the list of one of the migration types
and used nr_free to compute the missing count. Since MIGRATE_MOVABLE
is usually the largest one on large memory systems, this is the one
to be skipped. Since the printing order is migration-type => order, we
will have to store the counts in an internal 2D array before printing
them out.

Even by skipping the MIGRATE_MOVABLE pages, we may still be holding the
zone lock for too long blocking out other zone lock waiters from being
run. This can be problematic for systems with large amount of memory.
So a check is added to temporarily release the lock and reschedule if
more than 64k of list entries have been iterated for each order. With
a MAX_ORDER of 11, the worst case will be iterating about 700k of list
entries before releasing the lock.

Signed-off-by: Waiman Long <[email protected]>
---
mm/vmstat.c | 51 +++++++++++++++++++++++++++++++++++++++++----------
1 file changed, 41 insertions(+), 10 deletions(-)

diff --git a/mm/vmstat.c b/mm/vmstat.c
index 6afc892a148a..40c9a1494709 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -1373,23 +1373,54 @@ static void pagetypeinfo_showfree_print(struct seq_file *m,
pg_data_t *pgdat, struct zone *zone)
{
int order, mtype;
+ unsigned long nfree[MAX_ORDER][MIGRATE_TYPES];

- for (mtype = 0; mtype < MIGRATE_TYPES; mtype++) {
- seq_printf(m, "Node %4d, zone %8s, type %12s ",
- pgdat->node_id,
- zone->name,
- migratetype_names[mtype]);
- for (order = 0; order < MAX_ORDER; ++order) {
+ lockdep_assert_held(&zone->lock);
+ lockdep_assert_irqs_disabled();
+
+ /*
+ * MIGRATE_MOVABLE is usually the largest one in large memory
+ * systems. We skip iterating that list. Instead, we compute it by
+ * subtracting the total of the rests from free_area->nr_free.
+ */
+ for (order = 0; order < MAX_ORDER; ++order) {
+ unsigned long nr_total = 0;
+ struct free_area *area = &(zone->free_area[order]);
+
+ for (mtype = 0; mtype < MIGRATE_TYPES; mtype++) {
unsigned long freecount = 0;
- struct free_area *area;
struct list_head *curr;

- area = &(zone->free_area[order]);
-
+ if (mtype == MIGRATE_MOVABLE)
+ continue;
list_for_each(curr, &area->free_list[mtype])
freecount++;
- seq_printf(m, "%6lu ", freecount);
+ nfree[order][mtype] = freecount;
+ nr_total += freecount;
}
+ nfree[order][MIGRATE_MOVABLE] = area->nr_free - nr_total;
+
+ /*
+ * If we have already iterated more than 64k of list
+ * entries, we might have hold the zone lock for too long.
+ * Temporarily release the lock and reschedule before
+ * continuing so that other lock waiters have a chance
+ * to run.
+ */
+ if (nr_total > (1 << 16)) {
+ spin_unlock_irq(&zone->lock);
+ cond_resched();
+ spin_lock_irq(&zone->lock);
+ }
+ }
+
+ for (mtype = 0; mtype < MIGRATE_TYPES; mtype++) {
+ seq_printf(m, "Node %4d, zone %8s, type %12s ",
+ pgdat->node_id,
+ zone->name,
+ migratetype_names[mtype]);
+ for (order = 0; order < MAX_ORDER; ++order)
+ seq_printf(m, "%6lu ", nfree[order][mtype]);
seq_putc(m, '\n');
}
}
--
2.18.1


2019-10-23 00:07:09

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH] mm/vmstat: Reduce zone lock hold time when reading /proc/pagetypeinfo

On Tue, 22 Oct 2019 12:21:56 -0400 Waiman Long <[email protected]> wrote:

> The pagetypeinfo_showfree_print() function prints out the number of
> free blocks for each of the page orders and migrate types. The current
> code just iterates the each of the free lists to get counts. There are
> bug reports about hard lockup panics when reading the /proc/pagetyeinfo
> file just because it look too long to iterate all the free lists within
> a zone while holing the zone lock with irq disabled.
>
> Given the fact that /proc/pagetypeinfo is readable by all, the possiblity
> of crashing a system by the simple act of reading /proc/pagetypeinfo
> by any user is a security problem that needs to be addressed.

Yes.

> There is a free_area structure associated with each page order. There
> is also a nr_free count within the free_area for all the different
> migration types combined. Tracking the number of free list entries
> for each migration type will probably add some overhead to the fast
> paths like moving pages from one migration type to another which may
> not be desirable.
>
> we can actually skip iterating the list of one of the migration types
> and used nr_free to compute the missing count. Since MIGRATE_MOVABLE
> is usually the largest one on large memory systems, this is the one
> to be skipped. Since the printing order is migration-type => order, we
> will have to store the counts in an internal 2D array before printing
> them out.
>
> Even by skipping the MIGRATE_MOVABLE pages, we may still be holding the
> zone lock for too long blocking out other zone lock waiters from being
> run. This can be problematic for systems with large amount of memory.
> So a check is added to temporarily release the lock and reschedule if
> more than 64k of list entries have been iterated for each order. With
> a MAX_ORDER of 11, the worst case will be iterating about 700k of list
> entries before releasing the lock.
>
> ...
>
> --- a/mm/vmstat.c
> +++ b/mm/vmstat.c
> @@ -1373,23 +1373,54 @@ static void pagetypeinfo_showfree_print(struct seq_file *m,
> pg_data_t *pgdat, struct zone *zone)
> {
> int order, mtype;
> + unsigned long nfree[MAX_ORDER][MIGRATE_TYPES];

600+ bytes is a bit much. I guess it's OK in this situation.

> - for (mtype = 0; mtype < MIGRATE_TYPES; mtype++) {
> - seq_printf(m, "Node %4d, zone %8s, type %12s ",
> - pgdat->node_id,
> - zone->name,
> - migratetype_names[mtype]);
> - for (order = 0; order < MAX_ORDER; ++order) {
> + lockdep_assert_held(&zone->lock);
> + lockdep_assert_irqs_disabled();
> +
> + /*
> + * MIGRATE_MOVABLE is usually the largest one in large memory
> + * systems. We skip iterating that list. Instead, we compute it by
> + * subtracting the total of the rests from free_area->nr_free.
> + */
> + for (order = 0; order < MAX_ORDER; ++order) {
> + unsigned long nr_total = 0;
> + struct free_area *area = &(zone->free_area[order]);
> +
> + for (mtype = 0; mtype < MIGRATE_TYPES; mtype++) {
> unsigned long freecount = 0;
> - struct free_area *area;
> struct list_head *curr;
>
> - area = &(zone->free_area[order]);
> -
> + if (mtype == MIGRATE_MOVABLE)
> + continue;
> list_for_each(curr, &area->free_list[mtype])
> freecount++;
> - seq_printf(m, "%6lu ", freecount);
> + nfree[order][mtype] = freecount;
> + nr_total += freecount;
> }
> + nfree[order][MIGRATE_MOVABLE] = area->nr_free - nr_total;
> +
> + /*
> + * If we have already iterated more than 64k of list
> + * entries, we might have hold the zone lock for too long.
> + * Temporarily release the lock and reschedule before
> + * continuing so that other lock waiters have a chance
> + * to run.
> + */
> + if (nr_total > (1 << 16)) {
> + spin_unlock_irq(&zone->lock);
> + cond_resched();
> + spin_lock_irq(&zone->lock);
> + }
> + }
> +
> + for (mtype = 0; mtype < MIGRATE_TYPES; mtype++) {
> + seq_printf(m, "Node %4d, zone %8s, type %12s ",
> + pgdat->node_id,
> + zone->name,
> + migratetype_names[mtype]);
> + for (order = 0; order < MAX_ORDER; ++order)
> + seq_printf(m, "%6lu ", nfree[order][mtype]);
> seq_putc(m, '\n');

This is not exactly a thing of beauty :( Presumably there might still
be situations where the irq-off times remain excessive.

Why are we actually holding zone->lock so much? Can we get away with
holding it across the list_for_each() loop and nothing else? If so,
this still isn't a bulletproof fix. Maybe just terminate the list
walk if freecount reaches 1024. Would anyone really care?

Sigh. I wonder if anyone really uses this thing for anything
important. Can we just remove it all?

2019-10-23 12:25:35

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH] mm/vmstat: Reduce zone lock hold time when reading /proc/pagetypeinfo

On Tue 22-10-19 14:59:02, Andrew Morton wrote:
> On Tue, 22 Oct 2019 12:21:56 -0400 Waiman Long <[email protected]> wrote:
[...]
> > - for (mtype = 0; mtype < MIGRATE_TYPES; mtype++) {
> > - seq_printf(m, "Node %4d, zone %8s, type %12s ",
> > - pgdat->node_id,
> > - zone->name,
> > - migratetype_names[mtype]);
> > - for (order = 0; order < MAX_ORDER; ++order) {
> > + lockdep_assert_held(&zone->lock);
> > + lockdep_assert_irqs_disabled();
> > +
> > + /*
> > + * MIGRATE_MOVABLE is usually the largest one in large memory
> > + * systems. We skip iterating that list. Instead, we compute it by
> > + * subtracting the total of the rests from free_area->nr_free.
> > + */
> > + for (order = 0; order < MAX_ORDER; ++order) {
> > + unsigned long nr_total = 0;
> > + struct free_area *area = &(zone->free_area[order]);
> > +
> > + for (mtype = 0; mtype < MIGRATE_TYPES; mtype++) {
> > unsigned long freecount = 0;
> > - struct free_area *area;
> > struct list_head *curr;
> >
> > - area = &(zone->free_area[order]);
> > -
> > + if (mtype == MIGRATE_MOVABLE)
> > + continue;
> > list_for_each(curr, &area->free_list[mtype])
> > freecount++;
> > - seq_printf(m, "%6lu ", freecount);
> > + nfree[order][mtype] = freecount;
> > + nr_total += freecount;
> > }
> > + nfree[order][MIGRATE_MOVABLE] = area->nr_free - nr_total;
> > +
> > + /*
> > + * If we have already iterated more than 64k of list
> > + * entries, we might have hold the zone lock for too long.
> > + * Temporarily release the lock and reschedule before
> > + * continuing so that other lock waiters have a chance
> > + * to run.
> > + */
> > + if (nr_total > (1 << 16)) {
> > + spin_unlock_irq(&zone->lock);
> > + cond_resched();
> > + spin_lock_irq(&zone->lock);
> > + }
> > + }
> > +
> > + for (mtype = 0; mtype < MIGRATE_TYPES; mtype++) {
> > + seq_printf(m, "Node %4d, zone %8s, type %12s ",
> > + pgdat->node_id,
> > + zone->name,
> > + migratetype_names[mtype]);
> > + for (order = 0; order < MAX_ORDER; ++order)
> > + seq_printf(m, "%6lu ", nfree[order][mtype]);
> > seq_putc(m, '\n');
>
> This is not exactly a thing of beauty :( Presumably there might still
> be situations where the irq-off times remain excessive.

Yes. It is the list_for_each over the free_list that needs the lock and
that is the actual problem here. This can be really large with a _lot_
of memory. And this is why I objected to the patch. Because it doesn't
really address this problem. I would like to hear from Mel and Vlastimil
how would they feel about making free_list fully migrate type aware
(including nr_free).

> Why are we actually holding zone->lock so much? Can we get away with
> holding it across the list_for_each() loop and nothing else? If so,
> this still isn't a bulletproof fix. Maybe just terminate the list
> walk if freecount reaches 1024. Would anyone really care?
>
> Sigh. I wonder if anyone really uses this thing for anything
> important. Can we just remove it all?

Vlastimil would know much better but I have seen this being used for
fragmentation related debugging. That should imply that 0400 should be
sufficient and a quick and easily backportable fix for the most pressing
immediate problem.
--
Michal Hocko
SUSE Labs

2019-10-24 00:47:04

by Waiman Long

[permalink] [raw]
Subject: Re: [PATCH] mm/vmstat: Reduce zone lock hold time when reading /proc/pagetypeinfo

On 10/22/19 5:59 PM, Andrew Morton wrote:
> On Tue, 22 Oct 2019 12:21:56 -0400 Waiman Long <[email protected]> wrote:
>
>> The pagetypeinfo_showfree_print() function prints out the number of
>> free blocks for each of the page orders and migrate types. The current
>> code just iterates the each of the free lists to get counts. There are
>> bug reports about hard lockup panics when reading the /proc/pagetyeinfo
>> file just because it look too long to iterate all the free lists within
>> a zone while holing the zone lock with irq disabled.
>>
>> Given the fact that /proc/pagetypeinfo is readable by all, the possiblity
>> of crashing a system by the simple act of reading /proc/pagetypeinfo
>> by any user is a security problem that needs to be addressed.
> Yes.
>
>> There is a free_area structure associated with each page order. There
>> is also a nr_free count within the free_area for all the different
>> migration types combined. Tracking the number of free list entries
>> for each migration type will probably add some overhead to the fast
>> paths like moving pages from one migration type to another which may
>> not be desirable.
>>
>> we can actually skip iterating the list of one of the migration types
>> and used nr_free to compute the missing count. Since MIGRATE_MOVABLE
>> is usually the largest one on large memory systems, this is the one
>> to be skipped. Since the printing order is migration-type => order, we
>> will have to store the counts in an internal 2D array before printing
>> them out.
>>
>> Even by skipping the MIGRATE_MOVABLE pages, we may still be holding the
>> zone lock for too long blocking out other zone lock waiters from being
>> run. This can be problematic for systems with large amount of memory.
>> So a check is added to temporarily release the lock and reschedule if
>> more than 64k of list entries have been iterated for each order. With
>> a MAX_ORDER of 11, the worst case will be iterating about 700k of list
>> entries before releasing the lock.
>>
>> ...
>>
>> --- a/mm/vmstat.c
>> +++ b/mm/vmstat.c
>> @@ -1373,23 +1373,54 @@ static void pagetypeinfo_showfree_print(struct seq_file *m,
>> pg_data_t *pgdat, struct zone *zone)
>> {
>> int order, mtype;
>> + unsigned long nfree[MAX_ORDER][MIGRATE_TYPES];
> 600+ bytes is a bit much. I guess it's OK in this situation.
>
This function is called by reading /proc/pagetypeinfo. The call stack is
rather shallow:

PID: 58188  TASK: ffff938a4d4f1fa0  CPU: 2   COMMAND: "sosreport"
 #0 [ffff9483bf488e48] crash_nmi_callback at ffffffffb8c551d7
 #1 [ffff9483bf488e58] nmi_handle at ffffffffb931d8cc
 #2 [ffff9483bf488eb0] do_nmi at ffffffffb931dba8
 #3 [ffff9483bf488ef0] end_repeat_nmi at ffffffffb931cd69
    [exception RIP: pagetypeinfo_showfree_print+0x73]
    RIP: ffffffffb8db7173  RSP: ffff938b9fcbfda0  RFLAGS: 00000006
    RAX: fffff0c9946d7020  RBX: ffff96073ffd5528  RCX: 0000000000000000
    RDX: 00000000001c7764  RSI: ffffffffb9676ab1  RDI: 0000000000000000
    RBP: ffff938b9fcbfdd0   R8: 000000000000000a   R9: 00000000fffffffe
    R10: 0000000000000000  R11: ffff938b9fcbfc36  R12: ffff942b97758240
    R13: ffffffffb942f730  R14: ffff96073ffd5000  R15: ffff96073ffd5180
    ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0018
--- <NMI exception stack> ---
 #4 [ffff938b9fcbfda0] pagetypeinfo_showfree_print at ffffffffb8db7173
 #5 [ffff938b9fcbfdd8] walk_zones_in_node at ffffffffb8db74df
 #6 [ffff938b9fcbfe20] pagetypeinfo_show at ffffffffb8db7a29
 #7 [ffff938b9fcbfe48] seq_read at ffffffffb8e45c3c
 #8 [ffff938b9fcbfeb8] proc_reg_read at ffffffffb8e95070
 #9 [ffff938b9fcbfed8] vfs_read at ffffffffb8e1f2af
#10 [ffff938b9fcbff08] sys_read at ffffffffb8e2017f
#11 [ffff938b9fcbff50] system_call_fastpath at ffffffffb932579b

So we should not be in any risk of overflowing the stack.

>> - for (mtype = 0; mtype < MIGRATE_TYPES; mtype++) {
>> - seq_printf(m, "Node %4d, zone %8s, type %12s ",
>> - pgdat->node_id,
>> - zone->name,
>> - migratetype_names[mtype]);
>> - for (order = 0; order < MAX_ORDER; ++order) {
>> + lockdep_assert_held(&zone->lock);
>> + lockdep_assert_irqs_disabled();
>> +
>> + /*
>> + * MIGRATE_MOVABLE is usually the largest one in large memory
>> + * systems. We skip iterating that list. Instead, we compute it by
>> + * subtracting the total of the rests from free_area->nr_free.
>> + */
>> + for (order = 0; order < MAX_ORDER; ++order) {
>> + unsigned long nr_total = 0;
>> + struct free_area *area = &(zone->free_area[order]);
>> +
>> + for (mtype = 0; mtype < MIGRATE_TYPES; mtype++) {
>> unsigned long freecount = 0;
>> - struct free_area *area;
>> struct list_head *curr;
>>
>> - area = &(zone->free_area[order]);
>> -
>> + if (mtype == MIGRATE_MOVABLE)
>> + continue;
>> list_for_each(curr, &area->free_list[mtype])
>> freecount++;
>> - seq_printf(m, "%6lu ", freecount);
>> + nfree[order][mtype] = freecount;
>> + nr_total += freecount;
>> }
>> + nfree[order][MIGRATE_MOVABLE] = area->nr_free - nr_total;
>> +
>> + /*
>> + * If we have already iterated more than 64k of list
>> + * entries, we might have hold the zone lock for too long.
>> + * Temporarily release the lock and reschedule before
>> + * continuing so that other lock waiters have a chance
>> + * to run.
>> + */
>> + if (nr_total > (1 << 16)) {
>> + spin_unlock_irq(&zone->lock);
>> + cond_resched();
>> + spin_lock_irq(&zone->lock);
>> + }
>> + }
>> +
>> + for (mtype = 0; mtype < MIGRATE_TYPES; mtype++) {
>> + seq_printf(m, "Node %4d, zone %8s, type %12s ",
>> + pgdat->node_id,
>> + zone->name,
>> + migratetype_names[mtype]);
>> + for (order = 0; order < MAX_ORDER; ++order)
>> + seq_printf(m, "%6lu ", nfree[order][mtype]);
>> seq_putc(m, '\n');
> This is not exactly a thing of beauty :( Presumably there might still
> be situations where the irq-off times remain excessive.
Yes, that is still possible.
>
> Why are we actually holding zone->lock so much? Can we get away with
> holding it across the list_for_each() loop and nothing else? If so,

We can certainly do that with the risk that the counts will be less
reliable for a given order. I can send a v2 patch if you think this is
safer.


> this still isn't a bulletproof fix. Maybe just terminate the list
> walk if freecount reaches 1024. Would anyone really care?
>
> Sigh. I wonder if anyone really uses this thing for anything
> important. Can we just remove it all?
>
Removing it will be a breakage of kernel API.

Another alternative is to mark the migration type in the page structure
so that we can do per-migration type nr_free tracking. That will be a
major change to the mm code.

I consider this patch lesser of the two evils. 

Cheers,
Longman

2019-10-24 01:12:04

by Qian Cai

[permalink] [raw]
Subject: Re: [PATCH] mm/vmstat: Reduce zone lock hold time when reading /proc/pagetypeinfo

On Wed, 2019-10-23 at 10:30 -0400, Waiman Long wrote:
> On 10/22/19 5:59 PM, Andrew Morton wrote:
> > On Tue, 22 Oct 2019 12:21:56 -0400 Waiman Long <[email protected]> wrote:
> >
> > > The pagetypeinfo_showfree_print() function prints out the number of
> > > free blocks for each of the page orders and migrate types. The current
> > > code just iterates the each of the free lists to get counts. There are
> > > bug reports about hard lockup panics when reading the /proc/pagetyeinfo
> > > file just because it look too long to iterate all the free lists within
> > > a zone while holing the zone lock with irq disabled.
> > >
> > > Given the fact that /proc/pagetypeinfo is readable by all, the possiblity
> > > of crashing a system by the simple act of reading /proc/pagetypeinfo
> > > by any user is a security problem that needs to be addressed.
> >
> > Yes.
> >
> > > There is a free_area structure associated with each page order. There
> > > is also a nr_free count within the free_area for all the different
> > > migration types combined. Tracking the number of free list entries
> > > for each migration type will probably add some overhead to the fast
> > > paths like moving pages from one migration type to another which may
> > > not be desirable.
> > >
> > > we can actually skip iterating the list of one of the migration types
> > > and used nr_free to compute the missing count. Since MIGRATE_MOVABLE
> > > is usually the largest one on large memory systems, this is the one
> > > to be skipped. Since the printing order is migration-type => order, we
> > > will have to store the counts in an internal 2D array before printing
> > > them out.
> > >
> > > Even by skipping the MIGRATE_MOVABLE pages, we may still be holding the
> > > zone lock for too long blocking out other zone lock waiters from being
> > > run. This can be problematic for systems with large amount of memory.
> > > So a check is added to temporarily release the lock and reschedule if
> > > more than 64k of list entries have been iterated for each order. With
> > > a MAX_ORDER of 11, the worst case will be iterating about 700k of list
> > > entries before releasing the lock.
> > >
> > > ...
> > >
> > > --- a/mm/vmstat.c
> > > +++ b/mm/vmstat.c
> > > @@ -1373,23 +1373,54 @@ static void pagetypeinfo_showfree_print(struct seq_file *m,
> > > pg_data_t *pgdat, struct zone *zone)
> > > {
> > > int order, mtype;
> > > + unsigned long nfree[MAX_ORDER][MIGRATE_TYPES];
> >
> > 600+ bytes is a bit much. I guess it's OK in this situation.
> >
>
> This function is called by reading /proc/pagetypeinfo. The call stack is
> rather shallow:
>
> PID: 58188  TASK: ffff938a4d4f1fa0  CPU: 2   COMMAND: "sosreport"
>  #0 [ffff9483bf488e48] crash_nmi_callback at ffffffffb8c551d7
>  #1 [ffff9483bf488e58] nmi_handle at ffffffffb931d8cc
>  #2 [ffff9483bf488eb0] do_nmi at ffffffffb931dba8
>  #3 [ffff9483bf488ef0] end_repeat_nmi at ffffffffb931cd69
>     [exception RIP: pagetypeinfo_showfree_print+0x73]
>     RIP: ffffffffb8db7173  RSP: ffff938b9fcbfda0  RFLAGS: 00000006
>     RAX: fffff0c9946d7020  RBX: ffff96073ffd5528  RCX: 0000000000000000
>     RDX: 00000000001c7764  RSI: ffffffffb9676ab1  RDI: 0000000000000000
>     RBP: ffff938b9fcbfdd0   R8: 000000000000000a   R9: 00000000fffffffe
>     R10: 0000000000000000  R11: ffff938b9fcbfc36  R12: ffff942b97758240
>     R13: ffffffffb942f730  R14: ffff96073ffd5000  R15: ffff96073ffd5180
>     ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0018
> --- <NMI exception stack> ---
>  #4 [ffff938b9fcbfda0] pagetypeinfo_showfree_print at ffffffffb8db7173
>  #5 [ffff938b9fcbfdd8] walk_zones_in_node at ffffffffb8db74df
>  #6 [ffff938b9fcbfe20] pagetypeinfo_show at ffffffffb8db7a29
>  #7 [ffff938b9fcbfe48] seq_read at ffffffffb8e45c3c
>  #8 [ffff938b9fcbfeb8] proc_reg_read at ffffffffb8e95070
>  #9 [ffff938b9fcbfed8] vfs_read at ffffffffb8e1f2af
> #10 [ffff938b9fcbff08] sys_read at ffffffffb8e2017f
> #11 [ffff938b9fcbff50] system_call_fastpath at ffffffffb932579b
>
> So we should not be in any risk of overflowing the stack.
>
> > > - for (mtype = 0; mtype < MIGRATE_TYPES; mtype++) {
> > > - seq_printf(m, "Node %4d, zone %8s, type %12s ",
> > > - pgdat->node_id,
> > > - zone->name,
> > > - migratetype_names[mtype]);
> > > - for (order = 0; order < MAX_ORDER; ++order) {
> > > + lockdep_assert_held(&zone->lock);
> > > + lockdep_assert_irqs_disabled();
> > > +
> > > + /*
> > > + * MIGRATE_MOVABLE is usually the largest one in large memory
> > > + * systems. We skip iterating that list. Instead, we compute it by
> > > + * subtracting the total of the rests from free_area->nr_free.
> > > + */
> > > + for (order = 0; order < MAX_ORDER; ++order) {
> > > + unsigned long nr_total = 0;
> > > + struct free_area *area = &(zone->free_area[order]);
> > > +
> > > + for (mtype = 0; mtype < MIGRATE_TYPES; mtype++) {
> > > unsigned long freecount = 0;
> > > - struct free_area *area;
> > > struct list_head *curr;
> > >
> > > - area = &(zone->free_area[order]);
> > > -
> > > + if (mtype == MIGRATE_MOVABLE)
> > > + continue;
> > > list_for_each(curr, &area->free_list[mtype])
> > > freecount++;
> > > - seq_printf(m, "%6lu ", freecount);
> > > + nfree[order][mtype] = freecount;
> > > + nr_total += freecount;
> > > }
> > > + nfree[order][MIGRATE_MOVABLE] = area->nr_free - nr_total;
> > > +
> > > + /*
> > > + * If we have already iterated more than 64k of list
> > > + * entries, we might have hold the zone lock for too long.
> > > + * Temporarily release the lock and reschedule before
> > > + * continuing so that other lock waiters have a chance
> > > + * to run.
> > > + */
> > > + if (nr_total > (1 << 16)) {
> > > + spin_unlock_irq(&zone->lock);
> > > + cond_resched();
> > > + spin_lock_irq(&zone->lock);
> > > + }
> > > + }
> > > +
> > > + for (mtype = 0; mtype < MIGRATE_TYPES; mtype++) {
> > > + seq_printf(m, "Node %4d, zone %8s, type %12s ",
> > > + pgdat->node_id,
> > > + zone->name,
> > > + migratetype_names[mtype]);
> > > + for (order = 0; order < MAX_ORDER; ++order)
> > > + seq_printf(m, "%6lu ", nfree[order][mtype]);
> > > seq_putc(m, '\n');
> >
> > This is not exactly a thing of beauty :( Presumably there might still
> > be situations where the irq-off times remain excessive.
>
> Yes, that is still possible.
> >
> > Why are we actually holding zone->lock so much? Can we get away with
> > holding it across the list_for_each() loop and nothing else? If so,
>
> We can certainly do that with the risk that the counts will be less
> reliable for a given order. I can send a v2 patch if you think this is
> safer.
>
>
> > this still isn't a bulletproof fix. Maybe just terminate the list
> > walk if freecount reaches 1024. Would anyone really care?
> >
> > Sigh. I wonder if anyone really uses this thing for anything
> > important. Can we just remove it all?
> >
>
> Removing it will be a breakage of kernel API.

Who cares about breaking this part of the API that essentially nobody will use
this file?

>
> Another alternative is to mark the migration type in the page structure
> so that we can do per-migration type nr_free tracking. That will be a
> major change to the mm code.
>
> I consider this patch lesser of the two evils. 
>
> Cheers,
> Longman
>
>

2019-10-24 02:31:49

by Waiman Long

[permalink] [raw]
Subject: Re: [PATCH] mm/vmstat: Reduce zone lock hold time when reading /proc/pagetypeinfo

On 10/23/19 10:48 AM, Qian Cai wrote:
>>> this still isn't a bulletproof fix. Maybe just terminate the list
>>> walk if freecount reaches 1024. Would anyone really care?
>>>
>>> Sigh. I wonder if anyone really uses this thing for anything
>>> important. Can we just remove it all?
>>>
>> Removing it will be a breakage of kernel API.
> Who cares about breaking this part of the API that essentially nobody will use
> this file?
>
There are certainly tools that use /proc/pagetypeinfo and this is how
the problem is found. I am not against removing it, but we have to be
careful and deprecate it in way that minimize user impact.

Cheers,
Longman

2019-10-24 03:00:04

by Qian Cai

[permalink] [raw]
Subject: Re: [PATCH] mm/vmstat: Reduce zone lock hold time when reading /proc/pagetypeinfo

On Wed, 2019-10-23 at 11:01 -0400, Waiman Long wrote:
> On 10/23/19 10:48 AM, Qian Cai wrote:
> > > > this still isn't a bulletproof fix. Maybe just terminate the list
> > > > walk if freecount reaches 1024. Would anyone really care?
> > > >
> > > > Sigh. I wonder if anyone really uses this thing for anything
> > > > important. Can we just remove it all?
> > > >
> > >
> > > Removing it will be a breakage of kernel API.
> >
> > Who cares about breaking this part of the API that essentially nobody will use
> > this file?
> >
>
> There are certainly tools that use /proc/pagetypeinfo and this is how
> the problem is found. I am not against removing it, but we have to be
> careful and deprecate it in way that minimize user impact.

What are those tools?

2019-10-24 03:48:22

by Rafael Aquini

[permalink] [raw]
Subject: Re: [PATCH] mm/vmstat: Reduce zone lock hold time when reading /proc/pagetypeinfo

On Wed, Oct 23, 2019 at 10:48:13AM -0400, Qian Cai wrote:
> On Wed, 2019-10-23 at 10:30 -0400, Waiman Long wrote:
> > On 10/22/19 5:59 PM, Andrew Morton wrote:
> > > On Tue, 22 Oct 2019 12:21:56 -0400 Waiman Long <[email protected]> wrote:
> > >
> > > > The pagetypeinfo_showfree_print() function prints out the number of
> > > > free blocks for each of the page orders and migrate types. The current
> > > > code just iterates the each of the free lists to get counts. There are
> > > > bug reports about hard lockup panics when reading the /proc/pagetyeinfo
> > > > file just because it look too long to iterate all the free lists within
> > > > a zone while holing the zone lock with irq disabled.
> > > >
> > > > Given the fact that /proc/pagetypeinfo is readable by all, the possiblity
> > > > of crashing a system by the simple act of reading /proc/pagetypeinfo
> > > > by any user is a security problem that needs to be addressed.
> > >
> > > Yes.
> > >
> > > > There is a free_area structure associated with each page order. There
> > > > is also a nr_free count within the free_area for all the different
> > > > migration types combined. Tracking the number of free list entries
> > > > for each migration type will probably add some overhead to the fast
> > > > paths like moving pages from one migration type to another which may
> > > > not be desirable.
> > > >
> > > > we can actually skip iterating the list of one of the migration types
> > > > and used nr_free to compute the missing count. Since MIGRATE_MOVABLE
> > > > is usually the largest one on large memory systems, this is the one
> > > > to be skipped. Since the printing order is migration-type => order, we
> > > > will have to store the counts in an internal 2D array before printing
> > > > them out.
> > > >
> > > > Even by skipping the MIGRATE_MOVABLE pages, we may still be holding the
> > > > zone lock for too long blocking out other zone lock waiters from being
> > > > run. This can be problematic for systems with large amount of memory.
> > > > So a check is added to temporarily release the lock and reschedule if
> > > > more than 64k of list entries have been iterated for each order. With
> > > > a MAX_ORDER of 11, the worst case will be iterating about 700k of list
> > > > entries before releasing the lock.
> > > >
> > > > ...
> > > >
> > > > --- a/mm/vmstat.c
> > > > +++ b/mm/vmstat.c
> > > > @@ -1373,23 +1373,54 @@ static void pagetypeinfo_showfree_print(struct seq_file *m,
> > > > pg_data_t *pgdat, struct zone *zone)
> > > > {
> > > > int order, mtype;
> > > > + unsigned long nfree[MAX_ORDER][MIGRATE_TYPES];
> > >
> > > 600+ bytes is a bit much. I guess it's OK in this situation.
> > >
> >
> > This function is called by reading /proc/pagetypeinfo. The call stack is
> > rather shallow:
> >
> > PID: 58188? TASK: ffff938a4d4f1fa0? CPU: 2?? COMMAND: "sosreport"
> > ?#0 [ffff9483bf488e48] crash_nmi_callback at ffffffffb8c551d7
> > ?#1 [ffff9483bf488e58] nmi_handle at ffffffffb931d8cc
> > ?#2 [ffff9483bf488eb0] do_nmi at ffffffffb931dba8
> > ?#3 [ffff9483bf488ef0] end_repeat_nmi at ffffffffb931cd69
> > ??? [exception RIP: pagetypeinfo_showfree_print+0x73]
> > ??? RIP: ffffffffb8db7173? RSP: ffff938b9fcbfda0? RFLAGS: 00000006
> > ??? RAX: fffff0c9946d7020? RBX: ffff96073ffd5528? RCX: 0000000000000000
> > ??? RDX: 00000000001c7764? RSI: ffffffffb9676ab1? RDI: 0000000000000000
> > ??? RBP: ffff938b9fcbfdd0?? R8: 000000000000000a?? R9: 00000000fffffffe
> > ??? R10: 0000000000000000? R11: ffff938b9fcbfc36? R12: ffff942b97758240
> > ??? R13: ffffffffb942f730? R14: ffff96073ffd5000? R15: ffff96073ffd5180
> > ??? ORIG_RAX: ffffffffffffffff? CS: 0010? SS: 0018
> > --- <NMI exception stack> ---
> > ?#4 [ffff938b9fcbfda0] pagetypeinfo_showfree_print at ffffffffb8db7173
> > ?#5 [ffff938b9fcbfdd8] walk_zones_in_node at ffffffffb8db74df
> > ?#6 [ffff938b9fcbfe20] pagetypeinfo_show at ffffffffb8db7a29
> > ?#7 [ffff938b9fcbfe48] seq_read at ffffffffb8e45c3c
> > ?#8 [ffff938b9fcbfeb8] proc_reg_read at ffffffffb8e95070
> > ?#9 [ffff938b9fcbfed8] vfs_read at ffffffffb8e1f2af
> > #10 [ffff938b9fcbff08] sys_read at ffffffffb8e2017f
> > #11 [ffff938b9fcbff50] system_call_fastpath at ffffffffb932579b
> >
> > So we should not be in any risk of overflowing the stack.
> >
> > > > - for (mtype = 0; mtype < MIGRATE_TYPES; mtype++) {
> > > > - seq_printf(m, "Node %4d, zone %8s, type %12s ",
> > > > - pgdat->node_id,
> > > > - zone->name,
> > > > - migratetype_names[mtype]);
> > > > - for (order = 0; order < MAX_ORDER; ++order) {
> > > > + lockdep_assert_held(&zone->lock);
> > > > + lockdep_assert_irqs_disabled();
> > > > +
> > > > + /*
> > > > + * MIGRATE_MOVABLE is usually the largest one in large memory
> > > > + * systems. We skip iterating that list. Instead, we compute it by
> > > > + * subtracting the total of the rests from free_area->nr_free.
> > > > + */
> > > > + for (order = 0; order < MAX_ORDER; ++order) {
> > > > + unsigned long nr_total = 0;
> > > > + struct free_area *area = &(zone->free_area[order]);
> > > > +
> > > > + for (mtype = 0; mtype < MIGRATE_TYPES; mtype++) {
> > > > unsigned long freecount = 0;
> > > > - struct free_area *area;
> > > > struct list_head *curr;
> > > >
> > > > - area = &(zone->free_area[order]);
> > > > -
> > > > + if (mtype == MIGRATE_MOVABLE)
> > > > + continue;
> > > > list_for_each(curr, &area->free_list[mtype])
> > > > freecount++;
> > > > - seq_printf(m, "%6lu ", freecount);
> > > > + nfree[order][mtype] = freecount;
> > > > + nr_total += freecount;
> > > > }
> > > > + nfree[order][MIGRATE_MOVABLE] = area->nr_free - nr_total;
> > > > +
> > > > + /*
> > > > + * If we have already iterated more than 64k of list
> > > > + * entries, we might have hold the zone lock for too long.
> > > > + * Temporarily release the lock and reschedule before
> > > > + * continuing so that other lock waiters have a chance
> > > > + * to run.
> > > > + */
> > > > + if (nr_total > (1 << 16)) {
> > > > + spin_unlock_irq(&zone->lock);
> > > > + cond_resched();
> > > > + spin_lock_irq(&zone->lock);
> > > > + }
> > > > + }
> > > > +
> > > > + for (mtype = 0; mtype < MIGRATE_TYPES; mtype++) {
> > > > + seq_printf(m, "Node %4d, zone %8s, type %12s ",
> > > > + pgdat->node_id,
> > > > + zone->name,
> > > > + migratetype_names[mtype]);
> > > > + for (order = 0; order < MAX_ORDER; ++order)
> > > > + seq_printf(m, "%6lu ", nfree[order][mtype]);
> > > > seq_putc(m, '\n');
> > >
> > > This is not exactly a thing of beauty :( Presumably there might still
> > > be situations where the irq-off times remain excessive.
> >
> > Yes, that is still possible.
> > >
> > > Why are we actually holding zone->lock so much? Can we get away with
> > > holding it across the list_for_each() loop and nothing else? If so,
> >
> > We can certainly do that with the risk that the counts will be less
> > reliable for a given order. I can send a v2 patch if you think this is
> > safer.
> >
> >
> > > this still isn't a bulletproof fix. Maybe just terminate the list
> > > walk if freecount reaches 1024. Would anyone really care?
> > >
> > > Sigh. I wonder if anyone really uses this thing for anything
> > > important. Can we just remove it all?
> > >
> >
> > Removing it will be a breakage of kernel API.
>
> Who cares about breaking this part of the API that essentially nobody will use
> this file?
>

Seems that _some_ are using it, aren't they? Otherwise we wouldn't be
seeing complaints. Moving it out of /proc to /sys/kernel/debug looks
like a reasonable compromise with those that care about the interface.

2019-10-24 06:20:27

by Qian Cai

[permalink] [raw]
Subject: Re: [PATCH] mm/vmstat: Reduce zone lock hold time when reading /proc/pagetypeinfo

On Wed, 2019-10-23 at 11:03 -0400, Rafael Aquini wrote:
> > > > this still isn't a bulletproof fix. Maybe just terminate the list
> > > > walk if freecount reaches 1024. Would anyone really care?
> > > >
> > > > Sigh. I wonder if anyone really uses this thing for anything
> > > > important. Can we just remove it all?
> > > >
> > >
> > > Removing it will be a breakage of kernel API.
> >
> > Who cares about breaking this part of the API that essentially nobody will use
> > this file?
> >
>
> Seems that _some_ are using it, aren't they? Otherwise we wouldn't be
> seeing complaints. Moving it out of /proc to /sys/kernel/debug looks
> like a reasonable compromise with those that care about the interface.

No, there are some known testing tools will blindly read this file just because
it exists which is not important to keep the file.

2019-10-24 16:12:36

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH] mm/vmstat: Reduce zone lock hold time when reading /proc/pagetypeinfo

On Wed, 23 Oct 2019 11:01:04 -0400 Waiman Long <[email protected]> wrote:

> On 10/23/19 10:48 AM, Qian Cai wrote:
> >>> this still isn't a bulletproof fix. Maybe just terminate the list
> >>> walk if freecount reaches 1024. Would anyone really care?
> >>>
> >>> Sigh. I wonder if anyone really uses this thing for anything
> >>> important. Can we just remove it all?
> >>>
> >> Removing it will be a breakage of kernel API.
> > Who cares about breaking this part of the API that essentially nobody will use
> > this file?
> >
> There are certainly tools that use /proc/pagetypeinfo and this is how
> the problem is found. I am not against removing it, but we have to be
> careful and deprecate it in way that minimize user impact.

Yes, removing things is hard. One approach is to add printk_once(this
is going away, please email us if you use it) then wait a few years.
Backport that one-liner into -stable kernels to hopefully speed up the
process.

Meanwhile, we need to fix the DoS opportunity. How about my suggestion
that we limit the count to 1024, see if anyone notices? I bet they
don't!

2019-10-24 19:52:21

by Feng Tang

[permalink] [raw]
Subject: Re: [PATCH] mm/vmstat: Reduce zone lock hold time when reading /proc/pagetypeinfo

On Wed, Oct 23, 2019 at 11:01:04PM +0800, Waiman Long wrote:
> On 10/23/19 10:48 AM, Qian Cai wrote:
> >>> this still isn't a bulletproof fix. Maybe just terminate the list
> >>> walk if freecount reaches 1024. Would anyone really care?
> >>>
> >>> Sigh. I wonder if anyone really uses this thing for anything
> >>> important. Can we just remove it all?
> >>>
> >> Removing it will be a breakage of kernel API.
> > Who cares about breaking this part of the API that essentially nobody will use
> > this file?
> >
> There are certainly tools that use /proc/pagetypeinfo and this is how
> the problem is found. I am not against removing it, but we have to be
> careful and deprecate it in way that minimize user impact.

We have been using the /proc/pagetypeinfo for debugging, mainly for
client platforms like phones/tablets. We met problems like very slow
system response or OOM things, and many of them could be related with
memory pressure or fragmentation issues, where monitoring /proc/pagetypeinfo
will be very useful for debugging.

So I think Michal's idea to change it to 0400 is a good idea.

Thanks,
Feng


> Cheers,
> Longman
>
>

2019-10-24 19:59:55

by Qian Cai

[permalink] [raw]
Subject: Re: [PATCH] mm/vmstat: Reduce zone lock hold time when reading /proc/pagetypeinfo



> On Oct 23, 2019, at 11:33 PM, Feng Tang <[email protected]> wrote:
>
> We have been using the /proc/pagetypeinfo for debugging, mainly for
> client platforms like phones/tablets. We met problems like very slow
> system response or OOM things, and many of them could be related with
> memory pressure or fragmentation issues, where monitoring /proc/pagetypeinfo
> will be very useful for debugging.

This description of use case is rather vague. Which part of the information in that file is critical to debug an OOM or slow system that is not readily available in places like /proc/zoneinfo, /proc/buddyinfo, sysrq-m, or OOM trace etc?

2019-10-24 20:30:08

by Qian Cai

[permalink] [raw]
Subject: Re: [PATCH] mm/vmstat: Reduce zone lock hold time when reading /proc/pagetypeinfo



> On Oct 23, 2019, at 6:30 PM, Andrew Morton <[email protected]> wrote:
>
> Yes, removing things is hard. One approach is to add printk_once(this
> is going away, please email us if you use it) then wait a few years.
> Backport that one-liner into -stable kernels to hopefully speed up the
> process.

Although it still look like an overkill to me given,

1) Mel given a green light to remove it.
2) Nobody justifies any sensible reason to keep it apart from it was probably only triggering by some testing tools blindly read procfs entries.

it is still better than wasting developers’ time to beating the “dead” horse.

>
> Meanwhile, we need to fix the DoS opportunity. How about my suggestion
> that we limit the count to 1024, see if anyone notices? I bet they
> don't!

The DoS is probably there since the file had been introduced almost 10 years ago, so I suspect it is not that easily exploitable.

2019-10-24 20:30:12

by Feng Tang

[permalink] [raw]
Subject: Re: [PATCH] mm/vmstat: Reduce zone lock hold time when reading /proc/pagetypeinfo

On Thu, Oct 24, 2019 at 12:34:41PM +0800, Qian Cai wrote:
>
>
> > On Oct 23, 2019, at 11:33 PM, Feng Tang <[email protected]> wrote:
> >
> > We have been using the /proc/pagetypeinfo for debugging, mainly for
> > client platforms like phones/tablets. We met problems like very slow
> > system response or OOM things, and many of them could be related with
> > memory pressure or fragmentation issues, where monitoring /proc/pagetypeinfo
> > will be very useful for debugging.
>
> This description of use case is rather vague. Which part of the information in that file is critical to debug an OOM or slow system that is not readily available in places like /proc/zoneinfo, /proc/buddyinfo, sysrq-m, or OOM trace etc?

One example is, there was a platform with limited DRAM, so it preset
some CMA memory for camera's big buffer allocation use, while it let
these memory to be shared by others when camera was not used. And
after long time running, the cma region got fragmented and camera
app started to fail due to the buffer allocation failure. And during
debugging, we kept monitoring the buddy info for different migrate
types.

Thanks,
Feng

2019-10-24 22:22:41

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH] mm/vmstat: Reduce zone lock hold time when reading /proc/pagetypeinfo

On Thu 24-10-19 01:33:01, Qian Cai wrote:
>
>
> > On Oct 23, 2019, at 6:30 PM, Andrew Morton <[email protected]> wrote:
> >
> > Yes, removing things is hard. One approach is to add printk_once(this
> > is going away, please email us if you use it) then wait a few years.
> > Backport that one-liner into -stable kernels to hopefully speed up the
> > process.
>
> Although it still look like an overkill to me given,
>
> 1) Mel given a green light to remove it.
> 2) Nobody justifies any sensible reason to keep it apart from it was
> probably only triggering by some testing tools blindly read procfs
> entries.

It's been useful for debugging memory fragmentation problems and we do
not have anything that would provide a similar information. Considering
that making it root only is trivial and reducing the lock hold times
likewise I do not really see any strong reason to dump it at this
moment.

> it is still better than wasting developers’ time to beating the “dead” horse.
>
> >
> > Meanwhile, we need to fix the DoS opportunity. How about my suggestion
> > that we limit the count to 1024, see if anyone notices? I bet they
> > don't!
>
> The DoS is probably there since the file had been introduced almost 10
> years ago, so I suspect it is not that easily exploitable.

Yes you need _tons_ of memory. Reading the file on my 3TB system takes
sys 0m3.673s

The situation might be worse if the system is terribly fragmented which
is not the case here.
--
Michal Hocko
SUSE Labs

2019-10-25 08:47:30

by Qian Cai

[permalink] [raw]
Subject: Re: [PATCH] mm/vmstat: Reduce zone lock hold time when reading /proc/pagetypeinfo



> On Oct 24, 2019, at 1:34 AM, Feng Tang <[email protected]> wrote:
>
> One example is, there was a platform with limited DRAM, so it preset
> some CMA memory for camera's big buffer allocation use, while it let
> these memory to be shared by others when camera was not used. And
> after long time running, the cma region got fragmented and camera
> app started to fail due to the buffer allocation failure. And during
> debugging, we kept monitoring the buddy info for different migrate
> types.

For CMA-specific debugging, why CMA debugfs is not enough?

2019-10-25 09:50:09

by Qian Cai

[permalink] [raw]
Subject: Re: [PATCH] mm/vmstat: Reduce zone lock hold time when reading /proc/pagetypeinfo



> On Oct 24, 2019, at 3:42 AM, Michal Hocko <[email protected]> wrote:
>
> It's been useful for debugging memory fragmentation problems and we do
> not have anything that would provide a similar information. Considering

Actually, zoneinfo and buddyinfo are somewhat similar to it. Why the extra information in pagetypeinfo is still useful in debugging today’s real-world scenarios?

> that making it root only is trivial and reducing the lock hold times
> likewise I do not really see any strong reason to dump it at this
> moment.

There is no need to hurry this, and clearly this is rather a good opportunity to discuss the consolidation of memory fragmentation debugging to ease the maintenance in long term.

2019-10-25 13:56:57

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH] mm/vmstat: Reduce zone lock hold time when reading /proc/pagetypeinfo

On Thu 24-10-19 07:11:52, Qian Cai wrote:
>
>
> > On Oct 24, 2019, at 3:42 AM, Michal Hocko <[email protected]> wrote:
> >
> > It's been useful for debugging memory fragmentation problems and we do
> > not have anything that would provide a similar information. Considering
>
> Actually, zoneinfo and buddyinfo are somewhat similar to it. Why
> the extra information in pagetypeinfo is still useful in debugging
> today’s real-world scenarios?

Because the migrate type is the information that helps to understand
fragmentation related issues. I am pretty sure Vlastimil would tell you
much more.

> > that making it root only is trivial and reducing the lock hold times
> > likewise I do not really see any strong reason to dump it at this
> > moment.
>
> There is no need to hurry this, and clearly this is rather a good
> opportunity to discuss the consolidation of memory fragmentation
> debugging to ease the maintenance in long term.

Right. I think we should address the issue at hands first and then
handle any possible consolidations.

--
Michal Hocko
SUSE Labs

2019-10-25 14:03:25

by Qian Cai

[permalink] [raw]
Subject: Re: [PATCH] mm/vmstat: Reduce zone lock hold time when reading /proc/pagetypeinfo



> On Oct 24, 2019, at 9:39 AM, Michal Hocko <[email protected]> wrote:
>
> Because the migrate type is the information that helps to understand
> fragmentation related issues. I am pretty sure Vlastimil would tell you
> much more.

Then, I don’t see any point of having two files (pagetypeinfo and buddyinfo) exporting the similar type of information which is probably more important to figure out than fixing the 10-year old DoS.

2019-10-25 19:20:49

by Feng Tang

[permalink] [raw]
Subject: Re: [PATCH] mm/vmstat: Reduce zone lock hold time when reading /proc/pagetypeinfo

On Thu, Oct 24, 2019 at 06:51:54PM +0800, Qian Cai wrote:
>
>
> > On Oct 24, 2019, at 1:34 AM, Feng Tang <[email protected]> wrote:
> >
> > One example is, there was a platform with limited DRAM, so it preset
> > some CMA memory for camera's big buffer allocation use, while it let
> > these memory to be shared by others when camera was not used. And
> > after long time running, the cma region got fragmented and camera
> > app started to fail due to the buffer allocation failure. And during
> > debugging, we kept monitoring the buddy info for different migrate
> > types.
>
> For CMA-specific debugging, why CMA debugfs is not enough?

We care about change flow of the numbers of different orders for cma
and other migrate types, sometimes I just use one simple cmd
watch -d 'cat /proc/pagetypeinfo'
which shows direct and dynamic flows.

Thanks,
Feng