2013-04-09 23:28:47

by Cody P Schafer

[permalink] [raw]
Subject: [PATCH v2 00/10] mm: fixup changers of per cpu pageset's ->high and ->batch

"Problems" with the current code:
1. there is a lack of synchronization in setting ->high and ->batch in percpu_pagelist_fraction_sysctl_handler()
2. stop_machine() in zone_pcp_update() is unnecissary.
3. zone_pcp_update() does not consider the case where percpu_pagelist_fraction is non-zero

To fix:
1. add memory barriers, a safe ->batch value, and an update side mutex when updating ->high and ->batch
2. avoid draining pages in zone_pcp_update(), rely upon the memory barriers added to fix #1
3. factor out quite a few functions, and then call the appropriate one.

Note that it results in a change to the behavior of zone_pcp_update(), which is
used by memory_hotplug. I'm rather certain that I've diserned (and preserved)
the essential behavior (changing ->high and ->batch), and only eliminated
unneeded actions (draining the per cpu pages), but this may not be the case.

Further note that the draining of pages that previously took place in
zone_pcp_update() occured after repeated draining when attempting to offline a
page, and after the offline has "succeeded". It appears that the draining was
added to zone_pcp_update() to avoid refactoring setup_pageset() into 2
funtions.

--
Changes since v1:

- instead of using on_each_cpu(), use memory barriers (Gilad) and an update side mutex.
- add "Problem" #3 above, and fix.
- rename function to match naming style of similar function
- move unrelated comment

Cody P Schafer (10):
mm/page_alloc: factor out setting of pcp->high and pcp->batch.
mm/page_alloc: prevent concurrent updaters of pcp ->batch and ->high
mm/page_alloc: insert memory barriers to allow async update of pcp
batch and high
mm/page_alloc: convert zone_pcp_update() to rely on memory barriers
instead of stop_machine()
mm/page_alloc: when handling percpu_pagelist_fraction, don't unneedly
recalulate high
mm/page_alloc: factor setup_pageset() into pageset_init() and
pageset_set_batch()
mm/page_alloc: relocate comment to be directly above code it refers
to.
mm/page_alloc: factor zone_pageset_init() out of setup_zone_pageset()
mm/page_alloc: in zone_pcp_update(), uze zone_pageset_init()
mm/page_alloc: rename setup_pagelist_highmark() to match naming of
pageset_set_batch()

mm/page_alloc.c | 124 +++++++++++++++++++++++++++++++++-----------------------
1 file changed, 73 insertions(+), 51 deletions(-)

--
1.8.2


2013-04-09 23:28:48

by Cody P Schafer

[permalink] [raw]
Subject: [PATCH v2 01/10] mm/page_alloc: factor out setting of pcp->high and pcp->batch.

Creates pageset_set_batch() for use in setup_pageset().
pageset_set_batch() imitates the functionality of
setup_pagelist_highmark(), but uses the boot time
(percpu_pagelist_fraction == 0) calculations for determining ->high
based on ->batch.

Signed-off-by: Cody P Schafer <[email protected]>
---
mm/page_alloc.c | 12 +++++++++---
1 file changed, 9 insertions(+), 3 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 8fcced7..5877cf0 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -4004,6 +4004,14 @@ static int __meminit zone_batchsize(struct zone *zone)
#endif
}

+/* a companion to setup_pagelist_highmark() */
+static void pageset_set_batch(struct per_cpu_pageset *p, unsigned long batch)
+{
+ struct per_cpu_pages *pcp = &p->pcp;
+ pcp->high = 6 * batch;
+ pcp->batch = max(1UL, 1 * batch);
+}
+
static void setup_pageset(struct per_cpu_pageset *p, unsigned long batch)
{
struct per_cpu_pages *pcp;
@@ -4013,8 +4021,7 @@ static void setup_pageset(struct per_cpu_pageset *p, unsigned long batch)

pcp = &p->pcp;
pcp->count = 0;
- pcp->high = 6 * batch;
- pcp->batch = max(1UL, 1 * batch);
+ pageset_set_batch(p, batch);
for (migratetype = 0; migratetype < MIGRATE_PCPTYPES; migratetype++)
INIT_LIST_HEAD(&pcp->lists[migratetype]);
}
@@ -4023,7 +4030,6 @@ static void setup_pageset(struct per_cpu_pageset *p, unsigned long batch)
* setup_pagelist_highmark() sets the high water mark for hot per_cpu_pagelist
* to the value high for the pageset p.
*/
-
static void setup_pagelist_highmark(struct per_cpu_pageset *p,
unsigned long high)
{
--
1.8.2

2013-04-09 23:29:20

by Cody P Schafer

[permalink] [raw]
Subject: [PATCH v2 04/10] mm/page_alloc: convert zone_pcp_update() to rely on memory barriers instead of stop_machine()

zone_pcp_update()'s goal is to adjust the ->high and ->mark members of a
percpu pageset based on a zone's ->managed_pages. We don't need to drain
the entire percpu pageset just to modify these fields.

This lets us avoid calling setup_pageset() (and the draining required to
call it) and instead allows simply setting the fields' values (with some
attention paid to memory barriers to prevent the relationship between
->batch and ->high from being thrown off).

This does change the behavior of zone_pcp_update() as the percpu
pagesets will not be drained when zone_pcp_update() is called (they will
end up being shrunk, not completely drained, later when a 0-order page
is freed in free_hot_cold_page()).

Signed-off-by: Cody P Schafer <[email protected]>
---
mm/page_alloc.c | 33 +++++++++------------------------
1 file changed, 9 insertions(+), 24 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index a07bd4c..4a03c56 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -6012,33 +6012,18 @@ void free_contig_range(unsigned long pfn, unsigned nr_pages)
#endif

#ifdef CONFIG_MEMORY_HOTPLUG
-static int __meminit __zone_pcp_update(void *data)
-{
- struct zone *zone = data;
- int cpu;
- unsigned long batch = zone_batchsize(zone), flags;
-
- for_each_possible_cpu(cpu) {
- struct per_cpu_pageset *pset;
- struct per_cpu_pages *pcp;
-
- pset = per_cpu_ptr(zone->pageset, cpu);
- pcp = &pset->pcp;
-
- local_irq_save(flags);
- if (pcp->count > 0)
- free_pcppages_bulk(zone, pcp->count, pcp);
- drain_zonestat(zone, pset);
- setup_pageset(pset, batch);
- local_irq_restore(flags);
- }
- return 0;
-}
-
+/*
+ * The zone indicated has a new number of managed_pages; batch sizes and percpu
+ * page high values need to be recalulated.
+ */
void __meminit zone_pcp_update(struct zone *zone)
{
+ unsigned cpu;
+ unsigned long batch;
mutex_lock(&pcp_batch_high_lock);
- stop_machine(__zone_pcp_update, zone, NULL);
+ batch = zone_batchsize(zone);
+ for_each_possible_cpu(cpu)
+ pageset_set_batch(per_cpu_ptr(zone->pageset, cpu), batch);
mutex_unlock(&pcp_batch_high_lock);
}
#endif
--
1.8.2

2013-04-09 23:29:27

by Cody P Schafer

[permalink] [raw]
Subject: [PATCH v2 07/10] mm/page_alloc: relocate comment to be directly above code it refers to.

Signed-off-by: Cody P Schafer <[email protected]>
---
mm/page_alloc.c | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index a2f2207..6e52e67 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -3680,12 +3680,12 @@ void __ref build_all_zonelists(pg_data_t *pgdat, struct zone *zone)
mminit_verify_zonelist();
cpuset_init_current_mems_allowed();
} else {
- /* we have to stop all cpus to guarantee there is no user
- of zonelist */
#ifdef CONFIG_MEMORY_HOTPLUG
if (zone)
setup_zone_pageset(zone);
#endif
+ /* we have to stop all cpus to guarantee there is no user
+ of zonelist */
stop_machine(__build_all_zonelists, pgdat, NULL);
/* cpuset refresh routine should be here */
}
--
1.8.2

2013-04-09 23:29:34

by Cody P Schafer

[permalink] [raw]
Subject: [PATCH v2 10/10] mm/page_alloc: rename setup_pagelist_highmark() to match naming of pageset_set_batch()

Signed-off-by: Cody P Schafer <[email protected]>
---
mm/page_alloc.c | 12 ++++++------
1 file changed, 6 insertions(+), 6 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 334387e..66b8bc2 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -4018,7 +4018,7 @@ static void pageset_update_prep(struct per_cpu_pages *pcp)
smp_wmb();
}

-/* a companion to setup_pagelist_highmark() */
+/* a companion to pageset_set_high() */
static void pageset_set_batch(struct per_cpu_pageset *p, unsigned long batch)
{
struct per_cpu_pages *pcp = &p->pcp;
@@ -4050,10 +4050,10 @@ static void setup_pageset(struct per_cpu_pageset *p, unsigned long batch)
}

/*
- * setup_pagelist_highmark() sets the high water mark for hot per_cpu_pagelist
+ * pageset_set_high() sets the high water mark for hot per_cpu_pagelist
* to the value high for the pageset p.
*/
-static void setup_pagelist_highmark(struct per_cpu_pageset *p,
+static void pageset_set_high(struct per_cpu_pageset *p,
unsigned long high)
{
struct per_cpu_pages *pcp;
@@ -4075,7 +4075,7 @@ static void __meminit zone_pageset_init(struct zone *zone, int cpu)

pageset_init(pcp);
if (percpu_pagelist_fraction)
- setup_pagelist_highmark(pcp,
+ pageset_set_high(pcp,
(zone->managed_pages /
percpu_pagelist_fraction));
else
@@ -5526,8 +5526,8 @@ int percpu_pagelist_fraction_sysctl_handler(ctl_table *table, int write,
unsigned long high;
high = zone->managed_pages / percpu_pagelist_fraction;
for_each_possible_cpu(cpu)
- setup_pagelist_highmark(
- per_cpu_ptr(zone->pageset, cpu), high);
+ pageset_set_high(per_cpu_ptr(zone->pageset, cpu),
+ high);
}
mutex_unlock(&pcp_batch_high_lock);
return 0;
--
1.8.2

2013-04-09 23:29:31

by Cody P Schafer

[permalink] [raw]
Subject: [PATCH v2 08/10] mm/page_alloc: factor zone_pageset_init() out of setup_zone_pageset()

Signed-off-by: Cody P Schafer <[email protected]>
---
mm/page_alloc.c | 27 +++++++++++++++------------
1 file changed, 15 insertions(+), 12 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 6e52e67..c663e62 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -4069,22 +4069,25 @@ static void setup_pagelist_highmark(struct per_cpu_pageset *p,
pcp->batch = PAGE_SHIFT * 8;
}

+static void __meminit zone_pageset_init(struct zone *zone, int cpu)
+{
+ struct per_cpu_pageset *pcp = per_cpu_ptr(zone->pageset, cpu);
+
+ pageset_init(pcp);
+ if (percpu_pagelist_fraction)
+ setup_pagelist_highmark(pcp,
+ (zone->managed_pages /
+ percpu_pagelist_fraction));
+ else
+ pageset_set_batch(pcp, zone_batchsize(zone));
+}
+
static void __meminit setup_zone_pageset(struct zone *zone)
{
int cpu;
-
zone->pageset = alloc_percpu(struct per_cpu_pageset);
-
- for_each_possible_cpu(cpu) {
- struct per_cpu_pageset *pcp = per_cpu_ptr(zone->pageset, cpu);
-
- setup_pageset(pcp, zone_batchsize(zone));
-
- if (percpu_pagelist_fraction)
- setup_pagelist_highmark(pcp,
- (zone->managed_pages /
- percpu_pagelist_fraction));
- }
+ for_each_possible_cpu(cpu)
+ zone_pageset_init(zone, cpu);
}

/*
--
1.8.2

2013-04-09 23:29:29

by Cody P Schafer

[permalink] [raw]
Subject: [PATCH v2 09/10] mm/page_alloc: in zone_pcp_update(), uze zone_pageset_init()

Previously, zone_pcp_update() called pageset_set_batch() directly,
essentially assuming that percpu_pagelist_fraction == 0. Correct this by
calling zone_pageset_init(), which chooses the appropriate ->batch and
->high calculations.

Signed-off-by: Cody P Schafer <[email protected]>
---
mm/page_alloc.c | 4 +---
1 file changed, 1 insertion(+), 3 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index c663e62..334387e 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -6025,11 +6025,9 @@ void free_contig_range(unsigned long pfn, unsigned nr_pages)
void __meminit zone_pcp_update(struct zone *zone)
{
unsigned cpu;
- unsigned long batch;
mutex_lock(&pcp_batch_high_lock);
- batch = zone_batchsize(zone);
for_each_possible_cpu(cpu)
- pageset_set_batch(per_cpu_ptr(zone->pageset, cpu), batch);
+ zone_pageset_init(zone, cpu);
mutex_unlock(&pcp_batch_high_lock);
}
#endif
--
1.8.2

2013-04-09 23:29:25

by Cody P Schafer

[permalink] [raw]
Subject: [PATCH v2 06/10] mm/page_alloc: factor setup_pageset() into pageset_init() and pageset_set_batch()

Signed-off-by: Cody P Schafer <[email protected]>
---
mm/page_alloc.c | 9 +++++++--
1 file changed, 7 insertions(+), 2 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 50a277a..a2f2207 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -4030,7 +4030,7 @@ static void pageset_set_batch(struct per_cpu_pageset *p, unsigned long batch)
pcp->batch = max(1UL, 1 * batch);
}

-static void setup_pageset(struct per_cpu_pageset *p, unsigned long batch)
+static void pageset_init(struct per_cpu_pageset *p)
{
struct per_cpu_pages *pcp;
int migratetype;
@@ -4039,11 +4039,16 @@ static void setup_pageset(struct per_cpu_pageset *p, unsigned long batch)

pcp = &p->pcp;
pcp->count = 0;
- pageset_set_batch(p, batch);
for (migratetype = 0; migratetype < MIGRATE_PCPTYPES; migratetype++)
INIT_LIST_HEAD(&pcp->lists[migratetype]);
}

+static void setup_pageset(struct per_cpu_pageset *p, unsigned long batch)
+{
+ pageset_init(p);
+ pageset_set_batch(p, batch);
+}
+
/*
* setup_pagelist_highmark() sets the high water mark for hot per_cpu_pagelist
* to the value high for the pageset p.
--
1.8.2

2013-04-09 23:29:23

by Cody P Schafer

[permalink] [raw]
Subject: [PATCH v2 05/10] mm/page_alloc: when handling percpu_pagelist_fraction, don't unneedly recalulate high

Simply moves calculation of the new 'high' value outside the
for_each_possible_cpu() loop, as it does not depend on the cpu.

Signed-off-by: Cody P Schafer <[email protected]>
---
mm/page_alloc.c | 10 ++++------
1 file changed, 4 insertions(+), 6 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 4a03c56..50a277a 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -5502,7 +5502,6 @@ int lowmem_reserve_ratio_sysctl_handler(ctl_table *table, int write,
* cpu. It is the fraction of total pages in each zone that a hot per cpu pagelist
* can have before it gets flushed back to buddy allocator.
*/
-
int percpu_pagelist_fraction_sysctl_handler(ctl_table *table, int write,
void __user *buffer, size_t *length, loff_t *ppos)
{
@@ -5516,12 +5515,11 @@ int percpu_pagelist_fraction_sysctl_handler(ctl_table *table, int write,

mutex_lock(&pcp_batch_high_lock);
for_each_populated_zone(zone) {
- for_each_possible_cpu(cpu) {
- unsigned long high;
- high = zone->managed_pages / percpu_pagelist_fraction;
+ unsigned long high;
+ high = zone->managed_pages / percpu_pagelist_fraction;
+ for_each_possible_cpu(cpu)
setup_pagelist_highmark(
- per_cpu_ptr(zone->pageset, cpu), high);
- }
+ per_cpu_ptr(zone->pageset, cpu), high);
}
mutex_unlock(&pcp_batch_high_lock);
return 0;
--
1.8.2

2013-04-09 23:29:18

by Cody P Schafer

[permalink] [raw]
Subject: [PATCH v2 03/10] mm/page_alloc: insert memory barriers to allow async update of pcp batch and high

In pageset_set_batch() and setup_pagelist_highmark(), ensure that batch
is always set to a safe value (1) prior to updating high, and ensure
that high is fully updated before setting the real value of batch.

Suggested by Gilad Ben-Yossef <[email protected]> in this thread:

https://lkml.org/lkml/2013/4/9/23

Also reproduces his proposed comment.

Signed-off-by: Cody P Schafer <[email protected]>
---
mm/page_alloc.c | 19 +++++++++++++++++++
1 file changed, 19 insertions(+)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index d259599..a07bd4c 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -4007,11 +4007,26 @@ static int __meminit zone_batchsize(struct zone *zone)
#endif
}

+static void pageset_update_prep(struct per_cpu_pages *pcp)
+{
+ /*
+ * We're about to mess with PCP in an non atomic fashion. Put an
+ * intermediate safe value of batch and make sure it is visible before
+ * any other change
+ */
+ pcp->batch = 1;
+ smp_wmb();
+}
+
/* a companion to setup_pagelist_highmark() */
static void pageset_set_batch(struct per_cpu_pageset *p, unsigned long batch)
{
struct per_cpu_pages *pcp = &p->pcp;
+ pageset_update_prep(pcp);
+
pcp->high = 6 * batch;
+ smp_wmb();
+
pcp->batch = max(1UL, 1 * batch);
}

@@ -4039,7 +4054,11 @@ static void setup_pagelist_highmark(struct per_cpu_pageset *p,
struct per_cpu_pages *pcp;

pcp = &p->pcp;
+ pageset_update_prep(pcp);
+
pcp->high = high;
+ smp_wmb();
+
pcp->batch = max(1UL, high/4);
if ((high/4) > (PAGE_SHIFT * 8))
pcp->batch = PAGE_SHIFT * 8;
--
1.8.2

2013-04-09 23:29:15

by Cody P Schafer

[permalink] [raw]
Subject: [PATCH v2 02/10] mm/page_alloc: prevent concurrent updaters of pcp ->batch and ->high

Because we are going to rely upon a careful transision between old and
new ->high and ->batch values using memory barriers and will remove
stop_machine(), we need to prevent multiple updaters from interweaving
their memory writes.

Add a simple mutex to protect both update loops.

Signed-off-by: Cody P Schafer <[email protected]>
---
mm/page_alloc.c | 8 ++++++++
1 file changed, 8 insertions(+)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 5877cf0..d259599 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -64,6 +64,9 @@
#include <asm/div64.h>
#include "internal.h"

+/* prevent >1 _updater_ of zone percpu pageset ->high and ->batch fields */
+static DEFINE_MUTEX(pcp_batch_high_lock);
+
#ifdef CONFIG_USE_PERCPU_NUMA_NODE_ID
DEFINE_PER_CPU(int, numa_node);
EXPORT_PER_CPU_SYMBOL(numa_node);
@@ -5491,6 +5494,8 @@ int percpu_pagelist_fraction_sysctl_handler(ctl_table *table, int write,
ret = proc_dointvec_minmax(table, write, buffer, length, ppos);
if (!write || (ret < 0))
return ret;
+
+ mutex_lock(&pcp_batch_high_lock);
for_each_populated_zone(zone) {
for_each_possible_cpu(cpu) {
unsigned long high;
@@ -5499,6 +5504,7 @@ int percpu_pagelist_fraction_sysctl_handler(ctl_table *table, int write,
per_cpu_ptr(zone->pageset, cpu), high);
}
}
+ mutex_unlock(&pcp_batch_high_lock);
return 0;
}

@@ -6012,7 +6018,9 @@ static int __meminit __zone_pcp_update(void *data)

void __meminit zone_pcp_update(struct zone *zone)
{
+ mutex_lock(&pcp_batch_high_lock);
stop_machine(__zone_pcp_update, zone, NULL);
+ mutex_unlock(&pcp_batch_high_lock);
}
#endif

--
1.8.2

2013-04-10 06:19:44

by Gilad Ben-Yossef

[permalink] [raw]
Subject: Re: [PATCH v2 03/10] mm/page_alloc: insert memory barriers to allow async update of pcp batch and high

On Wed, Apr 10, 2013 at 2:28 AM, Cody P Schafer <[email protected]> wrote:
> In pageset_set_batch() and setup_pagelist_highmark(), ensure that batch
> is always set to a safe value (1) prior to updating high, and ensure
> that high is fully updated before setting the real value of batch.
>
> Suggested by Gilad Ben-Yossef <[email protected]> in this thread:
>
> https://lkml.org/lkml/2013/4/9/23
>
> Also reproduces his proposed comment.
>
> Signed-off-by: Cody P Schafer <[email protected]>
> ---
> mm/page_alloc.c | 19 +++++++++++++++++++
> 1 file changed, 19 insertions(+)
>
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index d259599..a07bd4c 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -4007,11 +4007,26 @@ static int __meminit zone_batchsize(struct zone *zone)
> #endif
> }
>
> +static void pageset_update_prep(struct per_cpu_pages *pcp)
> +{
> + /*
> + * We're about to mess with PCP in an non atomic fashion. Put an
> + * intermediate safe value of batch and make sure it is visible before
> + * any other change
> + */
> + pcp->batch = 1;
> + smp_wmb();
> +}
> +
> /* a companion to setup_pagelist_highmark() */
> static void pageset_set_batch(struct per_cpu_pageset *p, unsigned long batch)
> {
> struct per_cpu_pages *pcp = &p->pcp;
> + pageset_update_prep(pcp);
> +
> pcp->high = 6 * batch;
> + smp_wmb();
> +
> pcp->batch = max(1UL, 1 * batch);
> }
>
> @@ -4039,7 +4054,11 @@ static void setup_pagelist_highmark(struct per_cpu_pageset *p,
> struct per_cpu_pages *pcp;
>
> pcp = &p->pcp;
> + pageset_update_prep(pcp);
> +
> pcp->high = high;
> + smp_wmb();
> +
> pcp->batch = max(1UL, high/4);
> if ((high/4) > (PAGE_SHIFT * 8))
> pcp->batch = PAGE_SHIFT * 8;
> --
> 1.8.2
>

That is very good.
However, now we've created a "protocol" for updating ->high and ->batch:

1. Call pageset_update_prep(pcp)
2. Update ->high
3. smp_wmb()
4. Update ->batch

But that protocol is not documented anywhere and someone that reads
the code two
years from now will not be aware of it or why it is done like that.

How about if we create:

/*
* pcp->high and pcp->batch values are related and dependent on one another:
* ->batch must never be higher then ->high.
* The following function updates them in a safe manner without a
costly atomic transaction.
*/
static void pageset_update(struct per_cpu_pages *pcp, unsigned int
high, unsigned int batch)
{
/* start with a fail safe value for batch */
pcp->batch = 1;
smp_wmb();

/* Update high, then batch, in order */
pcp->high = high;
smp_wmb();
pcp->batch = batch;
}

And use that at the update sites? then the protocol becomes explicit.

Thanks,
Gilad.

--
Gilad Ben-Yossef
Chief Coffee Drinker
[email protected]
Israel Cell: +972-52-8260388
US Cell: +1-973-8260388
http://benyossef.com

"If you take a class in large-scale robotics, can you end up in a
situation where the homework eats your dog?"
-- Jean-Baptiste Queru

2013-04-10 06:22:36

by Gilad Ben-Yossef

[permalink] [raw]
Subject: Re: [PATCH v2 03/10] mm/page_alloc: insert memory barriers to allow async update of pcp batch and high

On Wed, Apr 10, 2013 at 9:19 AM, Gilad Ben-Yossef <[email protected]> wrote:
> On Wed, Apr 10, 2013 at 2:28 AM, Cody P Schafer <[email protected]> wrote:
>> In pageset_set_batch() and setup_pagelist_highmark(), ensure that batch
>> is always set to a safe value (1) prior to updating high, and ensure
>> that high is fully updated before setting the real value of batch.
>>
>> Suggested by Gilad Ben-Yossef <[email protected]> in this thread:
>>
>> https://lkml.org/lkml/2013/4/9/23
>>
>> Also reproduces his proposed comment.
>>
>> Signed-off-by: Cody P Schafer <[email protected]>
>> ---
>> mm/page_alloc.c | 19 +++++++++++++++++++
>> 1 file changed, 19 insertions(+)
>>
>> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
>> index d259599..a07bd4c 100644
>> --- a/mm/page_alloc.c
>> +++ b/mm/page_alloc.c
>> @@ -4007,11 +4007,26 @@ static int __meminit zone_batchsize(struct zone *zone)
>> #endif
>> }
>>
>> +static void pageset_update_prep(struct per_cpu_pages *pcp)
>> +{
>> + /*
>> + * We're about to mess with PCP in an non atomic fashion. Put an
>> + * intermediate safe value of batch and make sure it is visible before
>> + * any other change
>> + */
>> + pcp->batch = 1;
>> + smp_wmb();
>> +}
>> +
>> /* a companion to setup_pagelist_highmark() */
>> static void pageset_set_batch(struct per_cpu_pageset *p, unsigned long batch)
>> {
>> struct per_cpu_pages *pcp = &p->pcp;
>> + pageset_update_prep(pcp);
>> +
>> pcp->high = 6 * batch;
>> + smp_wmb();
>> +
>> pcp->batch = max(1UL, 1 * batch);
>> }
>>
>> @@ -4039,7 +4054,11 @@ static void setup_pagelist_highmark(struct per_cpu_pageset *p,
>> struct per_cpu_pages *pcp;
>>
>> pcp = &p->pcp;
>> + pageset_update_prep(pcp);
>> +
>> pcp->high = high;
>> + smp_wmb();
>> +
>> pcp->batch = max(1UL, high/4);
>> if ((high/4) > (PAGE_SHIFT * 8))
>> pcp->batch = PAGE_SHIFT * 8;
>> --
>> 1.8.2
>>
>
> That is very good.
> However, now we've created a "protocol" for updating ->high and ->batch:
>
> 1. Call pageset_update_prep(pcp)
> 2. Update ->high
> 3. smp_wmb()
> 4. Update ->batch
>
> But that protocol is not documented anywhere and someone that reads
> the code two
> years from now will not be aware of it or why it is done like that.
>
> How about if we create:
>
> /*
> * pcp->high and pcp->batch values are related and dependent on one another:
> * ->batch must never be higher then ->high.
> * The following function updates them in a safe manner without a
> costly atomic transaction.
> */
> static void pageset_update(struct per_cpu_pages *pcp, unsigned int
> high, unsigned int batch)
> {
> /* start with a fail safe value for batch */
> pcp->batch = 1;
> smp_wmb();
>
> /* Update high, then batch, in order */
> pcp->high = high;
> smp_wmb();
> pcp->batch = batch;
> }
>
> And use that at the update sites? then the protocol becomes explicit.

Oh, and other then that it looks good to me, so assuming you do that -

Reviewed-By: Gilad Ben-Yossef <[email protected]>

Many thanks,
Gilad


--
Gilad Ben-Yossef
Chief Coffee Drinker
[email protected]
Israel Cell: +972-52-8260388
US Cell: +1-973-8260388
http://benyossef.com

"If you take a class in large-scale robotics, can you end up in a
situation where the homework eats your dog?"
-- Jean-Baptiste Queru

2013-04-10 18:32:21

by Cody P Schafer

[permalink] [raw]
Subject: Re: [PATCH v2 03/10] mm/page_alloc: insert memory barriers to allow async update of pcp batch and high

On 04/09/2013 11:22 PM, Gilad Ben-Yossef wrote:
> On Wed, Apr 10, 2013 at 9:19 AM, Gilad Ben-Yossef <[email protected]> wrote:
>> On Wed, Apr 10, 2013 at 2:28 AM, Cody P Schafer <[email protected]> wrote:
>>> In pageset_set_batch() and setup_pagelist_highmark(), ensure that batch
>>> is always set to a safe value (1) prior to updating high, and ensure
>>> that high is fully updated before setting the real value of batch.
>>>
>>> Suggested by Gilad Ben-Yossef <[email protected]> in this thread:
>>>
>>> https://lkml.org/lkml/2013/4/9/23
>>>
>>> Also reproduces his proposed comment.
>>>
>>> Signed-off-by: Cody P Schafer <[email protected]>
>>> ---
>>> mm/page_alloc.c | 19 +++++++++++++++++++
>>> 1 file changed, 19 insertions(+)
>>>
>>> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
>>> index d259599..a07bd4c 100644
>>> --- a/mm/page_alloc.c
>>> +++ b/mm/page_alloc.c
>>> @@ -4007,11 +4007,26 @@ static int __meminit zone_batchsize(struct zone *zone)
>>> #endif
>>> }
>>>
>>> +static void pageset_update_prep(struct per_cpu_pages *pcp)
>>> +{
>>> + /*
>>> + * We're about to mess with PCP in an non atomic fashion. Put an
>>> + * intermediate safe value of batch and make sure it is visible before
>>> + * any other change
>>> + */
>>> + pcp->batch = 1;
>>> + smp_wmb();
>>> +}
>>> +
>>> /* a companion to setup_pagelist_highmark() */
>>> static void pageset_set_batch(struct per_cpu_pageset *p, unsigned long batch)
>>> {
>>> struct per_cpu_pages *pcp = &p->pcp;
>>> + pageset_update_prep(pcp);
>>> +
>>> pcp->high = 6 * batch;
>>> + smp_wmb();
>>> +
>>> pcp->batch = max(1UL, 1 * batch);
>>> }
>>>
>>> @@ -4039,7 +4054,11 @@ static void setup_pagelist_highmark(struct per_cpu_pageset *p,
>>> struct per_cpu_pages *pcp;
>>>
>>> pcp = &p->pcp;
>>> + pageset_update_prep(pcp);
>>> +
>>> pcp->high = high;
>>> + smp_wmb();
>>> +
>>> pcp->batch = max(1UL, high/4);
>>> if ((high/4) > (PAGE_SHIFT * 8))
>>> pcp->batch = PAGE_SHIFT * 8;
>>> --
>>> 1.8.2
>>>
>>
>> That is very good.
>> However, now we've created a "protocol" for updating ->high and ->batch:
>>
>> 1. Call pageset_update_prep(pcp)
>> 2. Update ->high
>> 3. smp_wmb()
>> 4. Update ->batch
>>
>> But that protocol is not documented anywhere and someone that reads
>> the code two
>> years from now will not be aware of it or why it is done like that.
>>
>> How about if we create:
>>
>> /*
>> * pcp->high and pcp->batch values are related and dependent on one another:
>> * ->batch must never be higher then ->high.
>> * The following function updates them in a safe manner without a
>> costly atomic transaction.
>> */
>> static void pageset_update(struct per_cpu_pages *pcp, unsigned int
>> high, unsigned int batch)
>> {
>> /* start with a fail safe value for batch */
>> pcp->batch = 1;
>> smp_wmb();
>>
>> /* Update high, then batch, in order */
>> pcp->high = high;
>> smp_wmb();
>> pcp->batch = batch;
>> }
>>
>> And use that at the update sites? then the protocol becomes explicit.

Yep, this looks like exactly the right thing.

>
> Oh, and other then that it looks good to me, so assuming you do that -
>
> Reviewed-By: Gilad Ben-Yossef <[email protected]>

I've added it only to the patch with pageset_update() in it, if you
meant to apply it to more patches, feel free to reply to the v3 posting.