2013-04-10 18:24:20

by Cody P Schafer

[permalink] [raw]
Subject: [PATCH v3 00/11] mm: fixup changers of per cpu pageset's ->high and ->batch

"Problems" with the current code:
1. there is a lack of synchronization in setting ->high and ->batch in
percpu_pagelist_fraction_sysctl_handler()
2. stop_machine() in zone_pcp_update() is unnecissary.
3. zone_pcp_update() does not consider the case where percpu_pagelist_fraction is non-zero

To fix:
1. add memory barriers, a safe ->batch value, an update side mutex when
updating ->high and ->batch, and use ACCESS_ONCE() for ->batch users that
expect a stable value.
2. avoid draining pages in zone_pcp_update(), rely upon the memory barriers added to fix #1
3. factor out quite a few functions, and then call the appropriate one.

Note that it results in a change to the behavior of zone_pcp_update(), which is
used by memory_hotplug. I'm rather certain that I've diserned (and preserved)
the essential behavior (changing ->high and ->batch), and only eliminated
unneeded actions (draining the per cpu pages), but this may not be the case.

Further note that the draining of pages that previously took place in
zone_pcp_update() occured after repeated draining when attempting to offline a
page, and after the offline has "succeeded". It appears that the draining was
added to zone_pcp_update() to avoid refactoring setup_pageset() into 2
funtions.

--

Since v2: https://lkml.org/lkml/2013/4/9/718

- note ACCESS_ONCE() in fix #1 above.
- consolidate ->batch & ->high update protocol into a single funtion (Gilad).
- add missing ACCESS_ONCE() on ->batch

Since v1: https://lkml.org/lkml/2013/4/5/444

- instead of using on_each_cpu(), use memory barriers (Gilad) and an update side mutex.
- add "Problem" #3 above, and fix.
- rename function to match naming style of similar function
- move unrelated comment

--

Cody P Schafer (11):
mm/page_alloc: factor out setting of pcp->high and pcp->batch.
mm/page_alloc: prevent concurrent updaters of pcp ->batch and ->high
mm/page_alloc: insert memory barriers to allow async update of pcp
batch and high
mm/page_alloc: protect pcp->batch accesses with ACCESS_ONCE
mm/page_alloc: convert zone_pcp_update() to rely on memory barriers
instead of stop_machine()
mm/page_alloc: when handling percpu_pagelist_fraction, don't unneedly
recalulate high
mm/page_alloc: factor setup_pageset() into pageset_init() and
pageset_set_batch()
mm/page_alloc: relocate comment to be directly above code it refers
to.
mm/page_alloc: factor zone_pageset_init() out of setup_zone_pageset()
mm/page_alloc: in zone_pcp_update(), uze zone_pageset_init()
mm/page_alloc: rename setup_pagelist_highmark() to match naming of
pageset_set_batch()

mm/page_alloc.c | 151 +++++++++++++++++++++++++++++++++-----------------------
1 file changed, 90 insertions(+), 61 deletions(-)

--
1.8.2.1


2013-04-10 18:24:09

by Cody P Schafer

[permalink] [raw]
Subject: [PATCH v3 02/11] mm/page_alloc: prevent concurrent updaters of pcp ->batch and ->high

Because we are going to rely upon a careful transision between old and
new ->high and ->batch values using memory barriers and will remove
stop_machine(), we need to prevent multiple updaters from interweaving
their memory writes.

Add a simple mutex to protect both update loops.

Signed-off-by: Cody P Schafer <[email protected]>
---
mm/page_alloc.c | 8 ++++++++
1 file changed, 8 insertions(+)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 5877cf0..d259599 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -64,6 +64,9 @@
#include <asm/div64.h>
#include "internal.h"

+/* prevent >1 _updater_ of zone percpu pageset ->high and ->batch fields */
+static DEFINE_MUTEX(pcp_batch_high_lock);
+
#ifdef CONFIG_USE_PERCPU_NUMA_NODE_ID
DEFINE_PER_CPU(int, numa_node);
EXPORT_PER_CPU_SYMBOL(numa_node);
@@ -5491,6 +5494,8 @@ int percpu_pagelist_fraction_sysctl_handler(ctl_table *table, int write,
ret = proc_dointvec_minmax(table, write, buffer, length, ppos);
if (!write || (ret < 0))
return ret;
+
+ mutex_lock(&pcp_batch_high_lock);
for_each_populated_zone(zone) {
for_each_possible_cpu(cpu) {
unsigned long high;
@@ -5499,6 +5504,7 @@ int percpu_pagelist_fraction_sysctl_handler(ctl_table *table, int write,
per_cpu_ptr(zone->pageset, cpu), high);
}
}
+ mutex_unlock(&pcp_batch_high_lock);
return 0;
}

@@ -6012,7 +6018,9 @@ static int __meminit __zone_pcp_update(void *data)

void __meminit zone_pcp_update(struct zone *zone)
{
+ mutex_lock(&pcp_batch_high_lock);
stop_machine(__zone_pcp_update, zone, NULL);
+ mutex_unlock(&pcp_batch_high_lock);
}
#endif

--
1.8.2.1

2013-04-10 18:24:19

by Cody P Schafer

[permalink] [raw]
Subject: [PATCH v3 03/11] mm/page_alloc: insert memory barriers to allow async update of pcp batch and high

Introduce pageset_update() to perform a safe transision from one set of
pcp->{batch,high} to a new set using memory barriers.

This ensures that batch is always set to a safe value (1) prior to
updating high, and ensure that high is fully updated before setting the
real value of batch. It avoids ->batch ever rising above ->high.

Suggested by Gilad Ben-Yossef <[email protected]> in these threads:

https://lkml.org/lkml/2013/4/9/23
https://lkml.org/lkml/2013/4/10/49

Also reproduces his proposed comment.

Reviewed-by: Gilad Ben-Yossef <[email protected]>
Signed-off-by: Cody P Schafer <[email protected]>
---
mm/page_alloc.c | 41 ++++++++++++++++++++++++++++++++---------
1 file changed, 32 insertions(+), 9 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index d259599..f2929df 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -4007,12 +4007,37 @@ static int __meminit zone_batchsize(struct zone *zone)
#endif
}

+/*
+ * pcp->high and pcp->batch values are related and dependent on one another:
+ * ->batch must never be higher then ->high.
+ * The following function updates them in a safe manner without read side
+ * locking.
+ *
+ * Any new users of pcp->batch and pcp->high should ensure they can cope with
+ * those fields changing asynchronously (acording the the above rule).
+ *
+ * mutex_is_locked(&pcp_batch_high_lock) required when calling this function
+ * outside of boot time (or some other assurance that no concurrent updaters
+ * exist).
+ */
+static void pageset_update(struct per_cpu_pages *pcp, unsigned long high,
+ unsigned long batch)
+{
+ /* start with a fail safe value for batch */
+ pcp->batch = 1;
+ smp_wmb();
+
+ /* Update high, then batch, in order */
+ pcp->high = high;
+ smp_wmb();
+
+ pcp->batch = batch;
+}
+
/* a companion to setup_pagelist_highmark() */
static void pageset_set_batch(struct per_cpu_pageset *p, unsigned long batch)
{
- struct per_cpu_pages *pcp = &p->pcp;
- pcp->high = 6 * batch;
- pcp->batch = max(1UL, 1 * batch);
+ pageset_update(&p->pcp, 6 * batch, max(1UL, 1 * batch));
}

static void setup_pageset(struct per_cpu_pageset *p, unsigned long batch)
@@ -4036,13 +4061,11 @@ static void setup_pageset(struct per_cpu_pageset *p, unsigned long batch)
static void setup_pagelist_highmark(struct per_cpu_pageset *p,
unsigned long high)
{
- struct per_cpu_pages *pcp;
+ unsigned long batch = max(1UL, high / 4);
+ if ((high / 4) > (PAGE_SHIFT * 8))
+ batch = PAGE_SHIFT * 8;

- pcp = &p->pcp;
- pcp->high = high;
- pcp->batch = max(1UL, high/4);
- if ((high/4) > (PAGE_SHIFT * 8))
- pcp->batch = PAGE_SHIFT * 8;
+ pageset_update(&p->pcp, high, batch);
}

static void __meminit setup_zone_pageset(struct zone *zone)
--
1.8.2.1

2013-04-10 18:24:37

by Cody P Schafer

[permalink] [raw]
Subject: [PATCH v3 10/11] mm/page_alloc: in zone_pcp_update(), uze zone_pageset_init()

Previously, zone_pcp_update() called pageset_set_batch() directly,
essentially assuming that percpu_pagelist_fraction == 0. Correct this by
calling zone_pageset_init(), which chooses the appropriate ->batch and
->high calculations.

Signed-off-by: Cody P Schafer <[email protected]>
---
mm/page_alloc.c | 4 +---
1 file changed, 1 insertion(+), 3 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 749b6e1..5ee5ce9 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -6032,11 +6032,9 @@ void free_contig_range(unsigned long pfn, unsigned nr_pages)
void __meminit zone_pcp_update(struct zone *zone)
{
unsigned cpu;
- unsigned long batch;
mutex_lock(&pcp_batch_high_lock);
- batch = zone_batchsize(zone);
for_each_possible_cpu(cpu)
- pageset_set_batch(per_cpu_ptr(zone->pageset, cpu), batch);
+ zone_pageset_init(zone, cpu);
mutex_unlock(&pcp_batch_high_lock);
}
#endif
--
1.8.2.1

2013-04-10 18:24:53

by Cody P Schafer

[permalink] [raw]
Subject: [PATCH v3 09/11] mm/page_alloc: factor zone_pageset_init() out of setup_zone_pageset()

Signed-off-by: Cody P Schafer <[email protected]>
---
mm/page_alloc.c | 27 +++++++++++++++------------
1 file changed, 15 insertions(+), 12 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index b0762c7..749b6e1 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -4076,22 +4076,25 @@ static void setup_pagelist_highmark(struct per_cpu_pageset *p,
pageset_update(&p->pcp, high, batch);
}

+static void __meminit zone_pageset_init(struct zone *zone, int cpu)
+{
+ struct per_cpu_pageset *pcp = per_cpu_ptr(zone->pageset, cpu);
+
+ pageset_init(pcp);
+ if (percpu_pagelist_fraction)
+ setup_pagelist_highmark(pcp,
+ (zone->managed_pages /
+ percpu_pagelist_fraction));
+ else
+ pageset_set_batch(pcp, zone_batchsize(zone));
+}
+
static void __meminit setup_zone_pageset(struct zone *zone)
{
int cpu;
-
zone->pageset = alloc_percpu(struct per_cpu_pageset);
-
- for_each_possible_cpu(cpu) {
- struct per_cpu_pageset *pcp = per_cpu_ptr(zone->pageset, cpu);
-
- setup_pageset(pcp, zone_batchsize(zone));
-
- if (percpu_pagelist_fraction)
- setup_pagelist_highmark(pcp,
- (zone->managed_pages /
- percpu_pagelist_fraction));
- }
+ for_each_possible_cpu(cpu)
+ zone_pageset_init(zone, cpu);
}

/*
--
1.8.2.1

2013-04-10 18:25:17

by Cody P Schafer

[permalink] [raw]
Subject: [PATCH v3 05/11] mm/page_alloc: convert zone_pcp_update() to rely on memory barriers instead of stop_machine()

zone_pcp_update()'s goal is to adjust the ->high and ->mark members of a
percpu pageset based on a zone's ->managed_pages. We don't need to drain
the entire percpu pageset just to modify these fields.

This lets us avoid calling setup_pageset() (and the draining required to
call it) and instead allows simply setting the fields' values (with some
attention paid to memory barriers to prevent the relationship between
->batch and ->high from being thrown off).

This does change the behavior of zone_pcp_update() as the percpu
pagesets will not be drained when zone_pcp_update() is called (they will
end up being shrunk, not completely drained, later when a 0-order page
is freed in free_hot_cold_page()).

Signed-off-by: Cody P Schafer <[email protected]>
---
mm/page_alloc.c | 33 +++++++++------------------------
1 file changed, 9 insertions(+), 24 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 9dd0dc0..5c54a08 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -6019,33 +6019,18 @@ void free_contig_range(unsigned long pfn, unsigned nr_pages)
#endif

#ifdef CONFIG_MEMORY_HOTPLUG
-static int __meminit __zone_pcp_update(void *data)
-{
- struct zone *zone = data;
- int cpu;
- unsigned long batch = zone_batchsize(zone), flags;
-
- for_each_possible_cpu(cpu) {
- struct per_cpu_pageset *pset;
- struct per_cpu_pages *pcp;
-
- pset = per_cpu_ptr(zone->pageset, cpu);
- pcp = &pset->pcp;
-
- local_irq_save(flags);
- if (pcp->count > 0)
- free_pcppages_bulk(zone, pcp->count, pcp);
- drain_zonestat(zone, pset);
- setup_pageset(pset, batch);
- local_irq_restore(flags);
- }
- return 0;
-}
-
+/*
+ * The zone indicated has a new number of managed_pages; batch sizes and percpu
+ * page high values need to be recalulated.
+ */
void __meminit zone_pcp_update(struct zone *zone)
{
+ unsigned cpu;
+ unsigned long batch;
mutex_lock(&pcp_batch_high_lock);
- stop_machine(__zone_pcp_update, zone, NULL);
+ batch = zone_batchsize(zone);
+ for_each_possible_cpu(cpu)
+ pageset_set_batch(per_cpu_ptr(zone->pageset, cpu), batch);
mutex_unlock(&pcp_batch_high_lock);
}
#endif
--
1.8.2.1

2013-04-10 18:25:22

by Cody P Schafer

[permalink] [raw]
Subject: [PATCH v3 06/11] mm/page_alloc: when handling percpu_pagelist_fraction, don't unneedly recalulate high

Simply moves calculation of the new 'high' value outside the
for_each_possible_cpu() loop, as it does not depend on the cpu.

Signed-off-by: Cody P Schafer <[email protected]>
---
mm/page_alloc.c | 10 ++++------
1 file changed, 4 insertions(+), 6 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 5c54a08..3447a4b 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -5509,7 +5509,6 @@ int lowmem_reserve_ratio_sysctl_handler(ctl_table *table, int write,
* cpu. It is the fraction of total pages in each zone that a hot per cpu pagelist
* can have before it gets flushed back to buddy allocator.
*/
-
int percpu_pagelist_fraction_sysctl_handler(ctl_table *table, int write,
void __user *buffer, size_t *length, loff_t *ppos)
{
@@ -5523,12 +5522,11 @@ int percpu_pagelist_fraction_sysctl_handler(ctl_table *table, int write,

mutex_lock(&pcp_batch_high_lock);
for_each_populated_zone(zone) {
- for_each_possible_cpu(cpu) {
- unsigned long high;
- high = zone->managed_pages / percpu_pagelist_fraction;
+ unsigned long high;
+ high = zone->managed_pages / percpu_pagelist_fraction;
+ for_each_possible_cpu(cpu)
setup_pagelist_highmark(
- per_cpu_ptr(zone->pageset, cpu), high);
- }
+ per_cpu_ptr(zone->pageset, cpu), high);
}
mutex_unlock(&pcp_batch_high_lock);
return 0;
--
1.8.2.1

2013-04-10 18:24:45

by Cody P Schafer

[permalink] [raw]
Subject: [PATCH v3 01/11] mm/page_alloc: factor out setting of pcp->high and pcp->batch.

Creates pageset_set_batch() for use in setup_pageset().
pageset_set_batch() imitates the functionality of
setup_pagelist_highmark(), but uses the boot time
(percpu_pagelist_fraction == 0) calculations for determining ->high
based on ->batch.

Signed-off-by: Cody P Schafer <[email protected]>
---
mm/page_alloc.c | 12 +++++++++---
1 file changed, 9 insertions(+), 3 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 8fcced7..5877cf0 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -4004,6 +4004,14 @@ static int __meminit zone_batchsize(struct zone *zone)
#endif
}

+/* a companion to setup_pagelist_highmark() */
+static void pageset_set_batch(struct per_cpu_pageset *p, unsigned long batch)
+{
+ struct per_cpu_pages *pcp = &p->pcp;
+ pcp->high = 6 * batch;
+ pcp->batch = max(1UL, 1 * batch);
+}
+
static void setup_pageset(struct per_cpu_pageset *p, unsigned long batch)
{
struct per_cpu_pages *pcp;
@@ -4013,8 +4021,7 @@ static void setup_pageset(struct per_cpu_pageset *p, unsigned long batch)

pcp = &p->pcp;
pcp->count = 0;
- pcp->high = 6 * batch;
- pcp->batch = max(1UL, 1 * batch);
+ pageset_set_batch(p, batch);
for (migratetype = 0; migratetype < MIGRATE_PCPTYPES; migratetype++)
INIT_LIST_HEAD(&pcp->lists[migratetype]);
}
@@ -4023,7 +4030,6 @@ static void setup_pageset(struct per_cpu_pageset *p, unsigned long batch)
* setup_pagelist_highmark() sets the high water mark for hot per_cpu_pagelist
* to the value high for the pageset p.
*/
-
static void setup_pagelist_highmark(struct per_cpu_pageset *p,
unsigned long high)
{
--
1.8.2.1

2013-04-10 18:24:43

by Cody P Schafer

[permalink] [raw]
Subject: [PATCH v3 11/11] mm/page_alloc: rename setup_pagelist_highmark() to match naming of pageset_set_batch()

Signed-off-by: Cody P Schafer <[email protected]>
---
mm/page_alloc.c | 12 ++++++------
1 file changed, 6 insertions(+), 6 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 5ee5ce9..038e9d2 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -4037,7 +4037,7 @@ static void pageset_update(struct per_cpu_pages *pcp, unsigned long high,
pcp->batch = batch;
}

-/* a companion to setup_pagelist_highmark() */
+/* a companion to pageset_set_high() */
static void pageset_set_batch(struct per_cpu_pageset *p, unsigned long batch)
{
pageset_update(&p->pcp, 6 * batch, max(1UL, 1 * batch));
@@ -4063,10 +4063,10 @@ static void setup_pageset(struct per_cpu_pageset *p, unsigned long batch)
}

/*
- * setup_pagelist_highmark() sets the high water mark for hot per_cpu_pagelist
+ * pageset_set_high() sets the high water mark for hot per_cpu_pagelist
* to the value high for the pageset p.
*/
-static void setup_pagelist_highmark(struct per_cpu_pageset *p,
+static void pageset_set_high(struct per_cpu_pageset *p,
unsigned long high)
{
unsigned long batch = max(1UL, high / 4);
@@ -4082,7 +4082,7 @@ static void __meminit zone_pageset_init(struct zone *zone, int cpu)

pageset_init(pcp);
if (percpu_pagelist_fraction)
- setup_pagelist_highmark(pcp,
+ pageset_set_high(pcp,
(zone->managed_pages /
percpu_pagelist_fraction));
else
@@ -5533,8 +5533,8 @@ int percpu_pagelist_fraction_sysctl_handler(ctl_table *table, int write,
unsigned long high;
high = zone->managed_pages / percpu_pagelist_fraction;
for_each_possible_cpu(cpu)
- setup_pagelist_highmark(
- per_cpu_ptr(zone->pageset, cpu), high);
+ pageset_set_high(per_cpu_ptr(zone->pageset, cpu),
+ high);
}
mutex_unlock(&pcp_batch_high_lock);
return 0;
--
1.8.2.1

2013-04-10 18:26:42

by Cody P Schafer

[permalink] [raw]
Subject: [PATCH v3 07/11] mm/page_alloc: factor setup_pageset() into pageset_init() and pageset_set_batch()

Signed-off-by: Cody P Schafer <[email protected]>
---
mm/page_alloc.c | 9 +++++++--
1 file changed, 7 insertions(+), 2 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 3447a4b..352c279a 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -4043,7 +4043,7 @@ static void pageset_set_batch(struct per_cpu_pageset *p, unsigned long batch)
pageset_update(&p->pcp, 6 * batch, max(1UL, 1 * batch));
}

-static void setup_pageset(struct per_cpu_pageset *p, unsigned long batch)
+static void pageset_init(struct per_cpu_pageset *p)
{
struct per_cpu_pages *pcp;
int migratetype;
@@ -4052,11 +4052,16 @@ static void setup_pageset(struct per_cpu_pageset *p, unsigned long batch)

pcp = &p->pcp;
pcp->count = 0;
- pageset_set_batch(p, batch);
for (migratetype = 0; migratetype < MIGRATE_PCPTYPES; migratetype++)
INIT_LIST_HEAD(&pcp->lists[migratetype]);
}

+static void setup_pageset(struct per_cpu_pageset *p, unsigned long batch)
+{
+ pageset_init(p);
+ pageset_set_batch(p, batch);
+}
+
/*
* setup_pagelist_highmark() sets the high water mark for hot per_cpu_pagelist
* to the value high for the pageset p.
--
1.8.2.1

2013-04-10 18:27:59

by Cody P Schafer

[permalink] [raw]
Subject: [PATCH v3 04/11] mm/page_alloc: protect pcp->batch accesses with ACCESS_ONCE

pcp->batch could change at any point, avoid relying on it being a stable value.

Signed-off-by: Cody P Schafer <[email protected]>
---
mm/page_alloc.c | 11 +++++++----
1 file changed, 7 insertions(+), 4 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index f2929df..9dd0dc0 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1181,10 +1181,12 @@ void drain_zone_pages(struct zone *zone, struct per_cpu_pages *pcp)
{
unsigned long flags;
int to_drain;
+ unsigned long batch;

local_irq_save(flags);
- if (pcp->count >= pcp->batch)
- to_drain = pcp->batch;
+ batch = ACCESS_ONCE(pcp->batch);
+ if (pcp->count >= batch)
+ to_drain = batch;
else
to_drain = pcp->count;
if (to_drain > 0) {
@@ -1352,8 +1354,9 @@ void free_hot_cold_page(struct page *page, int cold)
list_add(&page->lru, &pcp->lists[migratetype]);
pcp->count++;
if (pcp->count >= pcp->high) {
- free_pcppages_bulk(zone, pcp->batch, pcp);
- pcp->count -= pcp->batch;
+ unsigned long batch = ACCESS_ONCE(pcp->batch);
+ free_pcppages_bulk(zone, batch, pcp);
+ pcp->count -= batch;
}

out:
--
1.8.2.1

2013-04-10 18:28:52

by Cody P Schafer

[permalink] [raw]
Subject: [PATCH v3 08/11] mm/page_alloc: relocate comment to be directly above code it refers to.

Signed-off-by: Cody P Schafer <[email protected]>
---
mm/page_alloc.c | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 352c279a..b0762c7 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -3683,12 +3683,12 @@ void __ref build_all_zonelists(pg_data_t *pgdat, struct zone *zone)
mminit_verify_zonelist();
cpuset_init_current_mems_allowed();
} else {
- /* we have to stop all cpus to guarantee there is no user
- of zonelist */
#ifdef CONFIG_MEMORY_HOTPLUG
if (zone)
setup_zone_pageset(zone);
#endif
+ /* we have to stop all cpus to guarantee there is no user
+ of zonelist */
stop_machine(__build_all_zonelists, pgdat, NULL);
/* cpuset refresh routine should be here */
}
--
1.8.2.1

2013-04-10 21:23:57

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH v3 00/11] mm: fixup changers of per cpu pageset's ->high and ->batch

On Wed, 10 Apr 2013 11:23:28 -0700 Cody P Schafer <[email protected]> wrote:

> "Problems" with the current code:
> 1. there is a lack of synchronization in setting ->high and ->batch in
> percpu_pagelist_fraction_sysctl_handler()
> 2. stop_machine() in zone_pcp_update() is unnecissary.
> 3. zone_pcp_update() does not consider the case where percpu_pagelist_fraction is non-zero
>
> To fix:
> 1. add memory barriers, a safe ->batch value, an update side mutex when
> updating ->high and ->batch, and use ACCESS_ONCE() for ->batch users that
> expect a stable value.
> 2. avoid draining pages in zone_pcp_update(), rely upon the memory barriers added to fix #1
> 3. factor out quite a few functions, and then call the appropriate one.
>
> Note that it results in a change to the behavior of zone_pcp_update(), which is
> used by memory_hotplug. I'm rather certain that I've diserned (and preserved)
> the essential behavior (changing ->high and ->batch), and only eliminated
> unneeded actions (draining the per cpu pages), but this may not be the case.
>
> Further note that the draining of pages that previously took place in
> zone_pcp_update() occured after repeated draining when attempting to offline a
> page, and after the offline has "succeeded". It appears that the draining was
> added to zone_pcp_update() to avoid refactoring setup_pageset() into 2
> funtions.

There hasn't been a ton of review activity for this patchset :(

I'm inclined to duck it until after 3.9. Do the patches fix any
noticeably bad userspace behavior?

2013-04-10 21:26:10

by Cody P Schafer

[permalink] [raw]
Subject: Re: [PATCH v3 00/11] mm: fixup changers of per cpu pageset's ->high and ->batch

On 04/10/2013 02:23 PM, Andrew Morton wrote:
> On Wed, 10 Apr 2013 11:23:28 -0700 Cody P Schafer <[email protected]> wrote:
>
>> "Problems" with the current code:
>> 1. there is a lack of synchronization in setting ->high and ->batch in
>> percpu_pagelist_fraction_sysctl_handler()
>> 2. stop_machine() in zone_pcp_update() is unnecissary.
>> 3. zone_pcp_update() does not consider the case where percpu_pagelist_fraction is non-zero
>>
>> To fix:
>> 1. add memory barriers, a safe ->batch value, an update side mutex when
>> updating ->high and ->batch, and use ACCESS_ONCE() for ->batch users that
>> expect a stable value.
>> 2. avoid draining pages in zone_pcp_update(), rely upon the memory barriers added to fix #1
>> 3. factor out quite a few functions, and then call the appropriate one.
>>
>> Note that it results in a change to the behavior of zone_pcp_update(), which is
>> used by memory_hotplug. I'm rather certain that I've diserned (and preserved)
>> the essential behavior (changing ->high and ->batch), and only eliminated
>> unneeded actions (draining the per cpu pages), but this may not be the case.
>>
>> Further note that the draining of pages that previously took place in
>> zone_pcp_update() occured after repeated draining when attempting to offline a
>> page, and after the offline has "succeeded". It appears that the draining was
>> added to zone_pcp_update() to avoid refactoring setup_pageset() into 2
>> funtions.
>
> There hasn't been a ton of review activity for this patchset :(
>
> I'm inclined to duck it until after 3.9. Do the patches fix any
> noticeably bad userspace behavior?

No, all the bugs are theoretical. Waiting should be fine.

2013-05-01 23:53:54

by Cody P Schafer

[permalink] [raw]
Subject: Re: [PATCH v3 00/11] mm: fixup changers of per cpu pageset's ->high and ->batch

On 04/10/2013 02:25 PM, Cody P Schafer wrote:
> On 04/10/2013 02:23 PM, Andrew Morton wrote:
>> On Wed, 10 Apr 2013 11:23:28 -0700 Cody P Schafer
>> <[email protected]> wrote:
>>
>>> "Problems" with the current code:
>>> 1. there is a lack of synchronization in setting ->high and ->batch in
>>> percpu_pagelist_fraction_sysctl_handler()
>>> 2. stop_machine() in zone_pcp_update() is unnecissary.
>>> 3. zone_pcp_update() does not consider the case where
>>> percpu_pagelist_fraction is non-zero
>>>
>>> To fix:
>>> 1. add memory barriers, a safe ->batch value, an update side mutex
>>> when
>>> updating ->high and ->batch, and use ACCESS_ONCE() for ->batch
>>> users that
>>> expect a stable value.
>>> 2. avoid draining pages in zone_pcp_update(), rely upon the memory
>>> barriers added to fix #1
>>> 3. factor out quite a few functions, and then call the appropriate
>>> one.
>>>
>>> Note that it results in a change to the behavior of
>>> zone_pcp_update(), which is
>>> used by memory_hotplug. I'm rather certain that I've diserned (and
>>> preserved)
>>> the essential behavior (changing ->high and ->batch), and only
>>> eliminated
>>> unneeded actions (draining the per cpu pages), but this may not be
>>> the case.
>>>
>>> Further note that the draining of pages that previously took place in
>>> zone_pcp_update() occured after repeated draining when attempting to
>>> offline a
>>> page, and after the offline has "succeeded". It appears that the
>>> draining was
>>> added to zone_pcp_update() to avoid refactoring setup_pageset() into 2
>>> funtions.
>>
>> There hasn't been a ton of review activity for this patchset :(
>>
>> I'm inclined to duck it until after 3.9. Do the patches fix any
>> noticeably bad userspace behavior?
>
> No, all the bugs are theoretical. Waiting should be fine.
>

Andrew, do you want me to resend this patch set in the hope of obtaining
more review? If so, when?

2013-05-07 21:24:49

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH v3 00/11] mm: fixup changers of per cpu pageset's ->high and ->batch

On Wed, 01 May 2013 16:53:42 -0700 Cody P Schafer <[email protected]> wrote:

> >> There hasn't been a ton of review activity for this patchset :(
> >>
> >> I'm inclined to duck it until after 3.9. Do the patches fix any
> >> noticeably bad userspace behavior?
> >
> > No, all the bugs are theoretical. Waiting should be fine.
> >
>
> Andrew, do you want me to resend this patch set in the hope of obtaining
> more review?

Yes please. Resending is basically always the thing to do - if the
patches aren't resent, nobody has anything to look at or think about.

> If so, when?

After -rc1?