2020-04-01 19:44:41

by Pasha Tatashin

[permalink] [raw]
Subject: [PATCH] mm: initialize deferred pages with interrupts enabled

Initializing struct pages is a long task and keeping interrupts disabled
for the duration of this operation introduces a number of problems.

1. jiffies are not updated for long period of time, and thus incorrect time
is reported. See proposed solution and discussion here:
lkml/[email protected]
2. It prevents farther improving deferred page initialization by allowing
inter-node multi-threading.

We are keeping interrupts disabled to solve a rather theoretical problem
that was never observed in real world (See 3a2d7fa8a3d5).

Lets keep interrupts enabled. In case we ever encounter a scenario where
an interrupt thread wants to allocate large amount of memory this early in
boot we can deal with that by growing zone (see deferred_grow_zone()) by
the needed amount before starting deferred_init_memmap() threads.

Before:
[ 1.232459] node 0 initialised, 12058412 pages in 1ms

After:
[ 1.632580] node 0 initialised, 12051227 pages in 436ms

Signed-off-by: Pavel Tatashin <[email protected]>
---
mm/page_alloc.c | 21 +++++++--------------
1 file changed, 7 insertions(+), 14 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 3c4eb750a199..4498a13b372d 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1792,6 +1792,13 @@ static int __init deferred_init_memmap(void *data)
BUG_ON(pgdat->first_deferred_pfn > pgdat_end_pfn(pgdat));
pgdat->first_deferred_pfn = ULONG_MAX;

+ /*
+ * Once we unlock here, the zone cannot be grown anymore, thus if an
+ * interrupt thread must allocate this early in boot, zone must be
+ * pre-grown prior to start of deferred page initialization.
+ */
+ pgdat_resize_unlock(pgdat, &flags);
+
/* Only the highest zone is deferred so find it */
for (zid = 0; zid < MAX_NR_ZONES; zid++) {
zone = pgdat->node_zones + zid;
@@ -1812,8 +1819,6 @@ static int __init deferred_init_memmap(void *data)
while (spfn < epfn)
nr_pages += deferred_init_maxorder(&i, zone, &spfn, &epfn);
zone_empty:
- pgdat_resize_unlock(pgdat, &flags);
-
/* Sanity check that the next zone really is unpopulated */
WARN_ON(++zid < MAX_NR_ZONES && populated_zone(++zone));

@@ -1854,18 +1859,6 @@ deferred_grow_zone(struct zone *zone, unsigned int order)
return false;

pgdat_resize_lock(pgdat, &flags);
-
- /*
- * If deferred pages have been initialized while we were waiting for
- * the lock, return true, as the zone was grown. The caller will retry
- * this zone. We won't return to this function since the caller also
- * has this static branch.
- */
- if (!static_branch_unlikely(&deferred_pages)) {
- pgdat_resize_unlock(pgdat, &flags);
- return true;
- }
-
/*
* If someone grew this zone while we were waiting for spinlock, return
* true, as there might be enough pages already.
--
2.17.1


2020-04-01 19:58:15

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH] mm: initialize deferred pages with interrupts enabled

On Wed 01-04-20 15:32:38, Pavel Tatashin wrote:
> Initializing struct pages is a long task and keeping interrupts disabled
> for the duration of this operation introduces a number of problems.
>
> 1. jiffies are not updated for long period of time, and thus incorrect time
> is reported. See proposed solution and discussion here:
> lkml/[email protected]

http://lkml.kernel.org/r/[email protected]

> 2. It prevents farther improving deferred page initialization by allowing
> inter-node multi-threading.
>
> We are keeping interrupts disabled to solve a rather theoretical problem
> that was never observed in real world (See 3a2d7fa8a3d5).
>
> Lets keep interrupts enabled. In case we ever encounter a scenario where
> an interrupt thread wants to allocate large amount of memory this early in
> boot we can deal with that by growing zone (see deferred_grow_zone()) by
> the needed amount before starting deferred_init_memmap() threads.
>
> Before:
> [ 1.232459] node 0 initialised, 12058412 pages in 1ms
>
> After:
> [ 1.632580] node 0 initialised, 12051227 pages in 436ms
>

Fixes: 3a2d7fa8a3d5 ("mm: disable interrupts while initializing deferred pages")
> Signed-off-by: Pavel Tatashin <[email protected]>

I would much rather see pgdat_resize_lock completely out of both the
allocator and deferred init path altogether but this can be done in a
separate patch. This one looks slightly safer for stable backports.

To be completely honest I would love to see the resize lock go away
completely. That might need a deeper thought but I believe it is
something that has never been done properly.

Acked-by: Michal Hocko <[email protected]>

Thanks!

> ---
> mm/page_alloc.c | 21 +++++++--------------
> 1 file changed, 7 insertions(+), 14 deletions(-)
>
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 3c4eb750a199..4498a13b372d 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -1792,6 +1792,13 @@ static int __init deferred_init_memmap(void *data)
> BUG_ON(pgdat->first_deferred_pfn > pgdat_end_pfn(pgdat));
> pgdat->first_deferred_pfn = ULONG_MAX;
>
> + /*
> + * Once we unlock here, the zone cannot be grown anymore, thus if an
> + * interrupt thread must allocate this early in boot, zone must be
> + * pre-grown prior to start of deferred page initialization.
> + */
> + pgdat_resize_unlock(pgdat, &flags);
> +
> /* Only the highest zone is deferred so find it */
> for (zid = 0; zid < MAX_NR_ZONES; zid++) {
> zone = pgdat->node_zones + zid;
> @@ -1812,8 +1819,6 @@ static int __init deferred_init_memmap(void *data)
> while (spfn < epfn)
> nr_pages += deferred_init_maxorder(&i, zone, &spfn, &epfn);
> zone_empty:
> - pgdat_resize_unlock(pgdat, &flags);
> -
> /* Sanity check that the next zone really is unpopulated */
> WARN_ON(++zid < MAX_NR_ZONES && populated_zone(++zone));
>
> @@ -1854,18 +1859,6 @@ deferred_grow_zone(struct zone *zone, unsigned int order)
> return false;
>
> pgdat_resize_lock(pgdat, &flags);
> -
> - /*
> - * If deferred pages have been initialized while we were waiting for
> - * the lock, return true, as the zone was grown. The caller will retry
> - * this zone. We won't return to this function since the caller also
> - * has this static branch.
> - */
> - if (!static_branch_unlikely(&deferred_pages)) {
> - pgdat_resize_unlock(pgdat, &flags);
> - return true;
> - }
> -
> /*
> * If someone grew this zone while we were waiting for spinlock, return
> * true, as there might be enough pages already.
> --
> 2.17.1
>

--
Michal Hocko
SUSE Labs

2020-04-01 20:01:40

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH] mm: initialize deferred pages with interrupts enabled

btw. Cc Vlastimil

On Wed 01-04-20 15:32:38, Pavel Tatashin wrote:
> Initializing struct pages is a long task and keeping interrupts disabled
> for the duration of this operation introduces a number of problems.
>
> 1. jiffies are not updated for long period of time, and thus incorrect time
> is reported. See proposed solution and discussion here:
> lkml/[email protected]
> 2. It prevents farther improving deferred page initialization by allowing
> inter-node multi-threading.
>
> We are keeping interrupts disabled to solve a rather theoretical problem
> that was never observed in real world (See 3a2d7fa8a3d5).
>
> Lets keep interrupts enabled. In case we ever encounter a scenario where
> an interrupt thread wants to allocate large amount of memory this early in
> boot we can deal with that by growing zone (see deferred_grow_zone()) by
> the needed amount before starting deferred_init_memmap() threads.
>
> Before:
> [ 1.232459] node 0 initialised, 12058412 pages in 1ms
>
> After:
> [ 1.632580] node 0 initialised, 12051227 pages in 436ms
>
> Signed-off-by: Pavel Tatashin <[email protected]>
> ---
> mm/page_alloc.c | 21 +++++++--------------
> 1 file changed, 7 insertions(+), 14 deletions(-)
>
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 3c4eb750a199..4498a13b372d 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -1792,6 +1792,13 @@ static int __init deferred_init_memmap(void *data)
> BUG_ON(pgdat->first_deferred_pfn > pgdat_end_pfn(pgdat));
> pgdat->first_deferred_pfn = ULONG_MAX;
>
> + /*
> + * Once we unlock here, the zone cannot be grown anymore, thus if an
> + * interrupt thread must allocate this early in boot, zone must be
> + * pre-grown prior to start of deferred page initialization.
> + */
> + pgdat_resize_unlock(pgdat, &flags);
> +
> /* Only the highest zone is deferred so find it */
> for (zid = 0; zid < MAX_NR_ZONES; zid++) {
> zone = pgdat->node_zones + zid;
> @@ -1812,8 +1819,6 @@ static int __init deferred_init_memmap(void *data)
> while (spfn < epfn)
> nr_pages += deferred_init_maxorder(&i, zone, &spfn, &epfn);
> zone_empty:
> - pgdat_resize_unlock(pgdat, &flags);
> -
> /* Sanity check that the next zone really is unpopulated */
> WARN_ON(++zid < MAX_NR_ZONES && populated_zone(++zone));
>
> @@ -1854,18 +1859,6 @@ deferred_grow_zone(struct zone *zone, unsigned int order)
> return false;
>
> pgdat_resize_lock(pgdat, &flags);
> -
> - /*
> - * If deferred pages have been initialized while we were waiting for
> - * the lock, return true, as the zone was grown. The caller will retry
> - * this zone. We won't return to this function since the caller also
> - * has this static branch.
> - */
> - if (!static_branch_unlikely(&deferred_pages)) {
> - pgdat_resize_unlock(pgdat, &flags);
> - return true;
> - }
> -
> /*
> * If someone grew this zone while we were waiting for spinlock, return
> * true, as there might be enough pages already.
> --
> 2.17.1
>

--
Michal Hocko
SUSE Labs

2020-04-01 20:13:01

by Daniel Jordan

[permalink] [raw]
Subject: Re: [PATCH] mm: initialize deferred pages with interrupts enabled

On Wed, Apr 01, 2020 at 04:00:27PM -0400, Daniel Jordan wrote:
> On Wed, Apr 01, 2020 at 03:32:38PM -0400, Pavel Tatashin wrote:
> > Initializing struct pages is a long task and keeping interrupts disabled
> > for the duration of this operation introduces a number of problems.
> >
> > 1. jiffies are not updated for long period of time, and thus incorrect time
> > is reported. See proposed solution and discussion here:
> > lkml/[email protected]
> > 2. It prevents farther improving deferred page initialization by allowing
>
> not allowing
> > inter-node multi-threading.
>
> intra-node
>
> ...
> > After:
> > [ 1.632580] node 0 initialised, 12051227 pages in 436ms
>
> Fixes: 3a2d7fa8a3d5 ("mm: disable interrupts while initializing deferred pages")
> Reported-by: Shile Zhang <[email protected]>
>
> > Signed-off-by: Pavel Tatashin <[email protected]>
>
> Freezing jiffies for a while during boot sounds like stable to me, so
>
> Cc: <[email protected]> [4.17.x+]
>
>
> Can you please add a comment to mmzone.h above node_size_lock, something like
>
> * Must be held any time you expect node_start_pfn,
> * node_present_pages, node_spanned_pages or nr_zones to stay constant.
> + * Also synchronizes pgdat->first_deferred_pfn during deferred page
> + * init.
> ...
> spinlock_t node_size_lock;
>
> > @@ -1854,18 +1859,6 @@ deferred_grow_zone(struct zone *zone, unsigned int order)
> > return false;
> >
> > pgdat_resize_lock(pgdat, &flags);
> > -
> > - /*
> > - * If deferred pages have been initialized while we were waiting for
> > - * the lock, return true, as the zone was grown. The caller will retry
> > - * this zone. We won't return to this function since the caller also
> > - * has this static branch.
> > - */
> > - if (!static_branch_unlikely(&deferred_pages)) {
> > - pgdat_resize_unlock(pgdat, &flags);
> > - return true;
> > - }
> > -
>
> Huh, looks like this wasn't needed even before this change.
>
>
> The rest looks fine.
>
> Reviewed-by: Daniel Jordan <[email protected]>

...except for I forgot about the touch_nmi_watchdog() calls. I think you'd
need something kind of like this before your patch.

---8<---

From: Daniel Jordan <[email protected]>
Date: Fri, 27 Mar 2020 17:29:05 -0400
Subject: [PATCH] mm: call touch_nmi_watchdog() on max order boundaries in
deferred init

deferred_init_memmap() disables interrupts the entire time, so it calls
touch_nmi_watchdog() periodically to avoid soft lockup splats. Soon it
will run with interrupts enabled, at which point cond_resched() should
be used instead.

deferred_grow_zone() makes the same watchdog calls through code shared
with deferred init but will continue to run with interrupts disabled, so
it can't call cond_resched().

Pull the watchdog calls up to these two places to allow the first to be
changed later, independently of the second. The frequency reduces from
twice per pageblock (init and free) to once per max order block.

Signed-off-by: Daniel Jordan <[email protected]>
---
mm/page_alloc.c | 7 ++++---
1 file changed, 4 insertions(+), 3 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 212734c4f8b0..4cf18c534233 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1639,7 +1639,6 @@ static void __init deferred_free_pages(unsigned long pfn,
} else if (!(pfn & nr_pgmask)) {
deferred_free_range(pfn - nr_free, nr_free);
nr_free = 1;
- touch_nmi_watchdog();
} else {
nr_free++;
}
@@ -1669,7 +1668,6 @@ static unsigned long __init deferred_init_pages(struct zone *zone,
continue;
} else if (!page || !(pfn & nr_pgmask)) {
page = pfn_to_page(pfn);
- touch_nmi_watchdog();
} else {
page++;
}
@@ -1813,8 +1811,10 @@ static int __init deferred_init_memmap(void *data)
* that we can avoid introducing any issues with the buddy
* allocator.
*/
- while (spfn < epfn)
+ while (spfn < epfn) {
nr_pages += deferred_init_maxorder(&i, zone, &spfn, &epfn);
+ touch_nmi_watchdog();
+ }
zone_empty:
pgdat_resize_unlock(pgdat, &flags);

@@ -1908,6 +1908,7 @@ deferred_grow_zone_locked(pg_data_t *pgdat, struct zone *zone,
first_deferred_pfn = spfn;

nr_pages += deferred_init_maxorder(&i, zone, &spfn, &epfn);
+ touch_nmi_watchdog();

/* We should only stop along section boundaries */
if ((first_deferred_pfn ^ spfn) < PAGES_PER_SECTION)
--
2.25.0

2020-04-01 20:28:02

by Daniel Jordan

[permalink] [raw]
Subject: Re: [PATCH] mm: initialize deferred pages with interrupts enabled

On Wed, Apr 01, 2020 at 03:32:38PM -0400, Pavel Tatashin wrote:
> Initializing struct pages is a long task and keeping interrupts disabled
> for the duration of this operation introduces a number of problems.
>
> 1. jiffies are not updated for long period of time, and thus incorrect time
> is reported. See proposed solution and discussion here:
> lkml/[email protected]
> 2. It prevents farther improving deferred page initialization by allowing

not allowing
> inter-node multi-threading.

intra-node

...
> After:
> [ 1.632580] node 0 initialised, 12051227 pages in 436ms

Fixes: 3a2d7fa8a3d5 ("mm: disable interrupts while initializing deferred pages")
Reported-by: Shile Zhang <[email protected]>

> Signed-off-by: Pavel Tatashin <[email protected]>

Freezing jiffies for a while during boot sounds like stable to me, so

Cc: <[email protected]> [4.17.x+]


Can you please add a comment to mmzone.h above node_size_lock, something like

* Must be held any time you expect node_start_pfn,
* node_present_pages, node_spanned_pages or nr_zones to stay constant.
+ * Also synchronizes pgdat->first_deferred_pfn during deferred page
+ * init.
...
spinlock_t node_size_lock;

> @@ -1854,18 +1859,6 @@ deferred_grow_zone(struct zone *zone, unsigned int order)
> return false;
>
> pgdat_resize_lock(pgdat, &flags);
> -
> - /*
> - * If deferred pages have been initialized while we were waiting for
> - * the lock, return true, as the zone was grown. The caller will retry
> - * this zone. We won't return to this function since the caller also
> - * has this static branch.
> - */
> - if (!static_branch_unlikely(&deferred_pages)) {
> - pgdat_resize_unlock(pgdat, &flags);
> - return true;
> - }
> -

Huh, looks like this wasn't needed even before this change.


The rest looks fine.

Reviewed-by: Daniel Jordan <[email protected]>

2020-04-01 20:30:19

by Pasha Tatashin

[permalink] [raw]
Subject: Re: [PATCH] mm: initialize deferred pages with interrupts enabled

On Wed, Apr 1, 2020 at 3:57 PM Michal Hocko <[email protected]> wrote:
>
> On Wed 01-04-20 15:32:38, Pavel Tatashin wrote:
> > Initializing struct pages is a long task and keeping interrupts disabled
> > for the duration of this operation introduces a number of problems.
> >
> > 1. jiffies are not updated for long period of time, and thus incorrect time
> > is reported. See proposed solution and discussion here:
> > lkml/[email protected]
>
> http://lkml.kernel.org/r/[email protected]
>
> > 2. It prevents farther improving deferred page initialization by allowing
> > inter-node multi-threading.
> >
> > We are keeping interrupts disabled to solve a rather theoretical problem
> > that was never observed in real world (See 3a2d7fa8a3d5).
> >
> > Lets keep interrupts enabled. In case we ever encounter a scenario where
> > an interrupt thread wants to allocate large amount of memory this early in
> > boot we can deal with that by growing zone (see deferred_grow_zone()) by
> > the needed amount before starting deferred_init_memmap() threads.
> >
> > Before:
> > [ 1.232459] node 0 initialised, 12058412 pages in 1ms
> >
> > After:
> > [ 1.632580] node 0 initialised, 12051227 pages in 436ms
> >
>
> Fixes: 3a2d7fa8a3d5 ("mm: disable interrupts while initializing deferred pages")
> > Signed-off-by: Pavel Tatashin <[email protected]>
>
> I would much rather see pgdat_resize_lock completely out of both the
> allocator and deferred init path altogether but this can be done in a
> separate patch. This one looks slightly safer for stable backports.

This is what I wanted to do, but after studying deferred_grow_zone(),
I do not see a simple way to solve this. It is one thing to fail an
allocation, and it is another thing to have a corruption because of
race.

> To be completely honest I would love to see the resize lock go away
> completely. That might need a deeper thought but I believe it is
> something that has never been done properly.
>
> Acked-by: Michal Hocko <[email protected]>

Thank you,
Pasha


>
> Thanks!
>
> > ---
> > mm/page_alloc.c | 21 +++++++--------------
> > 1 file changed, 7 insertions(+), 14 deletions(-)
> >
> > diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> > index 3c4eb750a199..4498a13b372d 100644
> > --- a/mm/page_alloc.c
> > +++ b/mm/page_alloc.c
> > @@ -1792,6 +1792,13 @@ static int __init deferred_init_memmap(void *data)
> > BUG_ON(pgdat->first_deferred_pfn > pgdat_end_pfn(pgdat));
> > pgdat->first_deferred_pfn = ULONG_MAX;
> >
> > + /*
> > + * Once we unlock here, the zone cannot be grown anymore, thus if an
> > + * interrupt thread must allocate this early in boot, zone must be
> > + * pre-grown prior to start of deferred page initialization.
> > + */
> > + pgdat_resize_unlock(pgdat, &flags);
> > +
> > /* Only the highest zone is deferred so find it */
> > for (zid = 0; zid < MAX_NR_ZONES; zid++) {
> > zone = pgdat->node_zones + zid;
> > @@ -1812,8 +1819,6 @@ static int __init deferred_init_memmap(void *data)
> > while (spfn < epfn)
> > nr_pages += deferred_init_maxorder(&i, zone, &spfn, &epfn);
> > zone_empty:
> > - pgdat_resize_unlock(pgdat, &flags);
> > -
> > /* Sanity check that the next zone really is unpopulated */
> > WARN_ON(++zid < MAX_NR_ZONES && populated_zone(++zone));
> >
> > @@ -1854,18 +1859,6 @@ deferred_grow_zone(struct zone *zone, unsigned int order)
> > return false;
> >
> > pgdat_resize_lock(pgdat, &flags);
> > -
> > - /*
> > - * If deferred pages have been initialized while we were waiting for
> > - * the lock, return true, as the zone was grown. The caller will retry
> > - * this zone. We won't return to this function since the caller also
> > - * has this static branch.
> > - */
> > - if (!static_branch_unlikely(&deferred_pages)) {
> > - pgdat_resize_unlock(pgdat, &flags);
> > - return true;
> > - }
> > -
> > /*
> > * If someone grew this zone while we were waiting for spinlock, return
> > * true, as there might be enough pages already.
> > --
> > 2.17.1
> >
>
> --
> Michal Hocko
> SUSE Labs

2020-04-01 20:33:26

by Pasha Tatashin

[permalink] [raw]
Subject: Re: [PATCH] mm: initialize deferred pages with interrupts enabled

On Wed, Apr 1, 2020 at 4:10 PM Daniel Jordan <[email protected]> wrote:
>
> On Wed, Apr 01, 2020 at 04:00:27PM -0400, Daniel Jordan wrote:
> > On Wed, Apr 01, 2020 at 03:32:38PM -0400, Pavel Tatashin wrote:
> > > Initializing struct pages is a long task and keeping interrupts disabled
> > > for the duration of this operation introduces a number of problems.
> > >
> > > 1. jiffies are not updated for long period of time, and thus incorrect time
> > > is reported. See proposed solution and discussion here:
> > > lkml/[email protected]
> > > 2. It prevents farther improving deferred page initialization by allowing
> >
> > not allowing
> > > inter-node multi-threading.
> >
> > intra-node
> >
> > ...
> > > After:
> > > [ 1.632580] node 0 initialised, 12051227 pages in 436ms
> >
> > Fixes: 3a2d7fa8a3d5 ("mm: disable interrupts while initializing deferred pages")
> > Reported-by: Shile Zhang <[email protected]>
> >
> > > Signed-off-by: Pavel Tatashin <[email protected]>
> >
> > Freezing jiffies for a while during boot sounds like stable to me, so
> >
> > Cc: <[email protected]> [4.17.x+]
> >
> >
> > Can you please add a comment to mmzone.h above node_size_lock, something like
> >
> > * Must be held any time you expect node_start_pfn,
> > * node_present_pages, node_spanned_pages or nr_zones to stay constant.
> > + * Also synchronizes pgdat->first_deferred_pfn during deferred page
> > + * init.
> > ...
> > spinlock_t node_size_lock;
> >
> > > @@ -1854,18 +1859,6 @@ deferred_grow_zone(struct zone *zone, unsigned int order)
> > > return false;
> > >
> > > pgdat_resize_lock(pgdat, &flags);
> > > -
> > > - /*
> > > - * If deferred pages have been initialized while we were waiting for
> > > - * the lock, return true, as the zone was grown. The caller will retry
> > > - * this zone. We won't return to this function since the caller also
> > > - * has this static branch.
> > > - */
> > > - if (!static_branch_unlikely(&deferred_pages)) {
> > > - pgdat_resize_unlock(pgdat, &flags);
> > > - return true;
> > > - }
> > > -
> >
> > Huh, looks like this wasn't needed even before this change.
> >
> >
> > The rest looks fine.
> >
> > Reviewed-by: Daniel Jordan <[email protected]>
>
> ...except for I forgot about the touch_nmi_watchdog() calls. I think you'd
> need something kind of like this before your patch.

Thank you for review. You are right, I will add your patch, and modify
my to change touch_nmi_watchdog() to cond_resched().

Pasha

2020-04-02 07:34:48

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH] mm: initialize deferred pages with interrupts enabled

On Wed 01-04-20 16:27:33, Pavel Tatashin wrote:
> On Wed, Apr 1, 2020 at 3:57 PM Michal Hocko <[email protected]> wrote:
> >
> > On Wed 01-04-20 15:32:38, Pavel Tatashin wrote:
> > > Initializing struct pages is a long task and keeping interrupts disabled
> > > for the duration of this operation introduces a number of problems.
> > >
> > > 1. jiffies are not updated for long period of time, and thus incorrect time
> > > is reported. See proposed solution and discussion here:
> > > lkml/[email protected]
> >
> > http://lkml.kernel.org/r/[email protected]
> >
> > > 2. It prevents farther improving deferred page initialization by allowing
> > > inter-node multi-threading.
> > >
> > > We are keeping interrupts disabled to solve a rather theoretical problem
> > > that was never observed in real world (See 3a2d7fa8a3d5).
> > >
> > > Lets keep interrupts enabled. In case we ever encounter a scenario where
> > > an interrupt thread wants to allocate large amount of memory this early in
> > > boot we can deal with that by growing zone (see deferred_grow_zone()) by
> > > the needed amount before starting deferred_init_memmap() threads.
> > >
> > > Before:
> > > [ 1.232459] node 0 initialised, 12058412 pages in 1ms
> > >
> > > After:
> > > [ 1.632580] node 0 initialised, 12051227 pages in 436ms
> > >
> >
> > Fixes: 3a2d7fa8a3d5 ("mm: disable interrupts while initializing deferred pages")
> > > Signed-off-by: Pavel Tatashin <[email protected]>
> >
> > I would much rather see pgdat_resize_lock completely out of both the
> > allocator and deferred init path altogether but this can be done in a
> > separate patch. This one looks slightly safer for stable backports.
>
> This is what I wanted to do, but after studying deferred_grow_zone(),
> I do not see a simple way to solve this. It is one thing to fail an
> allocation, and it is another thing to have a corruption because of
> race.

Let's discuss deferred_grow_zone after this all settles down. I still
have to study it because I wasn't aware that this is actually a page
allocator path relying on the resize lock. My recollection was that the
resize lock is only about memory hotplug. Your patches flew by and I
didn't have time to review them back then. So I have to admit I have
seen the resize lock too simple.
--
Michal Hocko
SUSE Labs

2020-04-02 07:37:51

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH] mm: initialize deferred pages with interrupts enabled

On Wed 01-04-20 16:08:55, Daniel Jordan wrote:
[...]
> From: Daniel Jordan <[email protected]>
> Date: Fri, 27 Mar 2020 17:29:05 -0400
> Subject: [PATCH] mm: call touch_nmi_watchdog() on max order boundaries in
> deferred init
>
> deferred_init_memmap() disables interrupts the entire time, so it calls
> touch_nmi_watchdog() periodically to avoid soft lockup splats. Soon it
> will run with interrupts enabled, at which point cond_resched() should
> be used instead.
>
> deferred_grow_zone() makes the same watchdog calls through code shared
> with deferred init but will continue to run with interrupts disabled, so
> it can't call cond_resched().
>
> Pull the watchdog calls up to these two places to allow the first to be
> changed later, independently of the second. The frequency reduces from
> twice per pageblock (init and free) to once per max order block.

This makes sense but I am not really sure this is necessary for the
stable backport.

> Signed-off-by: Daniel Jordan <[email protected]>

Acked-by: Michal Hocko <[email protected]>

> ---
> mm/page_alloc.c | 7 ++++---
> 1 file changed, 4 insertions(+), 3 deletions(-)
>
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 212734c4f8b0..4cf18c534233 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -1639,7 +1639,6 @@ static void __init deferred_free_pages(unsigned long pfn,
> } else if (!(pfn & nr_pgmask)) {
> deferred_free_range(pfn - nr_free, nr_free);
> nr_free = 1;
> - touch_nmi_watchdog();
> } else {
> nr_free++;
> }
> @@ -1669,7 +1668,6 @@ static unsigned long __init deferred_init_pages(struct zone *zone,
> continue;
> } else if (!page || !(pfn & nr_pgmask)) {
> page = pfn_to_page(pfn);
> - touch_nmi_watchdog();
> } else {
> page++;
> }
> @@ -1813,8 +1811,10 @@ static int __init deferred_init_memmap(void *data)
> * that we can avoid introducing any issues with the buddy
> * allocator.
> */
> - while (spfn < epfn)
> + while (spfn < epfn) {
> nr_pages += deferred_init_maxorder(&i, zone, &spfn, &epfn);
> + touch_nmi_watchdog();
> + }
> zone_empty:
> pgdat_resize_unlock(pgdat, &flags);
>
> @@ -1908,6 +1908,7 @@ deferred_grow_zone_locked(pg_data_t *pgdat, struct zone *zone,
> first_deferred_pfn = spfn;
>
> nr_pages += deferred_init_maxorder(&i, zone, &spfn, &epfn);
> + touch_nmi_watchdog();
>
> /* We should only stop along section boundaries */
> if ((first_deferred_pfn ^ spfn) < PAGES_PER_SECTION)
> --
> 2.25.0
>

--
Michal Hocko
SUSE Labs