2013-04-05 20:34:19

by Cody P Schafer

[permalink] [raw]
Subject: [PATCH 0/3] mm: fixup changers of per cpu pageset's ->high and ->batch

In one case while modifying the ->high and ->batch fields of per cpu pagesets
we're unneededly using stop_machine() (patches 1 & 2), and in another we don't have any
syncronization at all (patch 3).

This patchset fixes both of them.

Note that it results in a change to the behavior of zone_pcp_update(), which is
used by memory_hotplug. I _think_ that I've diserned (and preserved) the
essential behavior (changing ->high and ->batch), and only eliminated unneeded
actions (draining the per cpu pages), but this may not be the case.

--
mm/page_alloc.c | 63 +++++++++++++++++++++++++++------------------------------
1 file changed, 30 insertions(+), 33 deletions(-)


2013-04-05 20:34:22

by Cody P Schafer

[permalink] [raw]
Subject: [PATCH 1/3] mm/page_alloc: factor out setting of pcp->high and pcp->batch.

Creates pageset_set_batch() for use in setup_pageset().
pageset_set_batch() imitates the functionality of
setup_pagelist_highmark(), but uses the boot time
(percpu_pagelist_fraction == 0) calculations for determining ->high
based on ->batch.

Signed-off-by: Cody P Schafer <[email protected]>
---
mm/page_alloc.c | 12 +++++++++---
1 file changed, 9 insertions(+), 3 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 8fcced7..5877cf0 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -4004,6 +4004,14 @@ static int __meminit zone_batchsize(struct zone *zone)
#endif
}

+/* a companion to setup_pagelist_highmark() */
+static void pageset_set_batch(struct per_cpu_pageset *p, unsigned long batch)
+{
+ struct per_cpu_pages *pcp = &p->pcp;
+ pcp->high = 6 * batch;
+ pcp->batch = max(1UL, 1 * batch);
+}
+
static void setup_pageset(struct per_cpu_pageset *p, unsigned long batch)
{
struct per_cpu_pages *pcp;
@@ -4013,8 +4021,7 @@ static void setup_pageset(struct per_cpu_pageset *p, unsigned long batch)

pcp = &p->pcp;
pcp->count = 0;
- pcp->high = 6 * batch;
- pcp->batch = max(1UL, 1 * batch);
+ pageset_set_batch(p, batch);
for (migratetype = 0; migratetype < MIGRATE_PCPTYPES; migratetype++)
INIT_LIST_HEAD(&pcp->lists[migratetype]);
}
@@ -4023,7 +4030,6 @@ static void setup_pageset(struct per_cpu_pageset *p, unsigned long batch)
* setup_pagelist_highmark() sets the high water mark for hot per_cpu_pagelist
* to the value high for the pageset p.
*/
-
static void setup_pagelist_highmark(struct per_cpu_pageset *p,
unsigned long high)
{
--
1.8.2

2013-04-05 20:34:26

by Cody P Schafer

[permalink] [raw]
Subject: [PATCH 2/3] mm/page_alloc: convert zone_pcp_update() to use on_each_cpu() instead of stop_machine()

No off-cpu users of the percpu pagesets exist.

zone_pcp_update()'s goal is to adjust the ->high and ->mark members of a
percpu pageset based on a zone's ->managed_pages. We don't need to drain
the entire percpu pageset just to modify these fields. Avoid calling
setup_pageset() (and the draining required to call it) and instead just
set the fields' values.

This does change the behavior of zone_pcp_update() as the percpu
pagesets will not be drained when zone_pcp_update() is called (they will
end up being shrunk, not completely drained, later when a 0-order page
is freed in free_hot_cold_page()).

Signed-off-by: Cody P Schafer <[email protected]>
---
mm/page_alloc.c | 30 ++++++++++--------------------
1 file changed, 10 insertions(+), 20 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 5877cf0..48f2faa 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -5987,32 +5987,22 @@ void free_contig_range(unsigned long pfn, unsigned nr_pages)
#endif

#ifdef CONFIG_MEMORY_HOTPLUG
-static int __meminit __zone_pcp_update(void *data)
+static void __meminit __zone_pcp_update(void *data)
{
struct zone *zone = data;
- int cpu;
- unsigned long batch = zone_batchsize(zone), flags;
-
- for_each_possible_cpu(cpu) {
- struct per_cpu_pageset *pset;
- struct per_cpu_pages *pcp;
-
- pset = per_cpu_ptr(zone->pageset, cpu);
- pcp = &pset->pcp;
-
- local_irq_save(flags);
- if (pcp->count > 0)
- free_pcppages_bulk(zone, pcp->count, pcp);
- drain_zonestat(zone, pset);
- setup_pageset(pset, batch);
- local_irq_restore(flags);
- }
- return 0;
+ unsigned long batch = zone_batchsize(zone);
+ struct per_cpu_pageset *pset =
+ per_cpu_ptr(zone->pageset, smp_processor_id());
+ pageset_set_batch(pset, batch);
}

+/*
+ * The zone indicated has a new number of managed_pages; batch sizes and percpu
+ * page high values need to be recalulated.
+ */
void __meminit zone_pcp_update(struct zone *zone)
{
- stop_machine(__zone_pcp_update, zone, NULL);
+ on_each_cpu(__zone_pcp_update, zone, true);
}
#endif

--
1.8.2

2013-04-05 20:34:31

by Cody P Schafer

[permalink] [raw]
Subject: [PATCH 3/3] mm: when handling percpu_pagelist_fraction, use on_each_cpu() to set percpu pageset fields.

In free_hot_cold_page(), we rely on pcp->batch remaining stable.
Updating it without being on the cpu owning the percpu pageset
potentially destroys this stability.

Change for_each_cpu() to on_each_cpu() to fix.

Signed-off-by: Cody P Schafer <[email protected]>
---
mm/page_alloc.c | 21 +++++++++++----------
1 file changed, 11 insertions(+), 10 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 48f2faa..507db31 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -5475,30 +5475,31 @@ int lowmem_reserve_ratio_sysctl_handler(ctl_table *table, int write,
return 0;
}

+static void _zone_set_pageset_highmark(void *data)
+{
+ struct zone *zone = data;
+ unsigned long high;
+ high = zone->managed_pages / percpu_pagelist_fraction;
+ setup_pagelist_highmark(
+ per_cpu_ptr(zone->pageset, smp_processor_id()), high);
+}
+
/*
* percpu_pagelist_fraction - changes the pcp->high for each zone on each
* cpu. It is the fraction of total pages in each zone that a hot per cpu pagelist
* can have before it gets flushed back to buddy allocator.
*/
-
int percpu_pagelist_fraction_sysctl_handler(ctl_table *table, int write,
void __user *buffer, size_t *length, loff_t *ppos)
{
struct zone *zone;
- unsigned int cpu;
int ret;

ret = proc_dointvec_minmax(table, write, buffer, length, ppos);
if (!write || (ret < 0))
return ret;
- for_each_populated_zone(zone) {
- for_each_possible_cpu(cpu) {
- unsigned long high;
- high = zone->managed_pages / percpu_pagelist_fraction;
- setup_pagelist_highmark(
- per_cpu_ptr(zone->pageset, cpu), high);
- }
- }
+ for_each_populated_zone(zone)
+ on_each_cpu(_zone_set_pageset_highmark, zone, true);
return 0;
}

--
1.8.2

2013-04-07 01:32:12

by Simon Jeons

[permalink] [raw]
Subject: Re: [PATCH 0/3] mm: fixup changers of per cpu pageset's ->high and ->batch

Hi Cody,
On 04/06/2013 04:33 AM, Cody P Schafer wrote:
> In one case while modifying the ->high and ->batch fields of per cpu pagesets
> we're unneededly using stop_machine() (patches 1 & 2), and in another we don't have any
> syncronization at all (patch 3).

Do you mean stop_machine() is used for syncronization between each
online cpu?

>
> This patchset fixes both of them.
>
> Note that it results in a change to the behavior of zone_pcp_update(), which is
> used by memory_hotplug. I _think_ that I've diserned (and preserved) the
> essential behavior (changing ->high and ->batch), and only eliminated unneeded
> actions (draining the per cpu pages), but this may not be the case.
>
> --
> mm/page_alloc.c | 63 +++++++++++++++++++++++++++------------------------------
> 1 file changed, 30 insertions(+), 33 deletions(-)
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to [email protected]. For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"[email protected]"> [email protected] </a>

2013-04-07 01:37:34

by Simon Jeons

[permalink] [raw]
Subject: Re: [PATCH 1/3] mm/page_alloc: factor out setting of pcp->high and pcp->batch.

Hi Cody,
On 04/06/2013 04:33 AM, Cody P Schafer wrote:
> Creates pageset_set_batch() for use in setup_pageset().
> pageset_set_batch() imitates the functionality of
> setup_pagelist_highmark(), but uses the boot time
> (percpu_pagelist_fraction == 0) calculations for determining ->high

Why need adjust pcp->high, pcp->batch during system running? What's the
requirement?

> based on ->batch.
>
> Signed-off-by: Cody P Schafer <[email protected]>
> ---
> mm/page_alloc.c | 12 +++++++++---
> 1 file changed, 9 insertions(+), 3 deletions(-)
>
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 8fcced7..5877cf0 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -4004,6 +4004,14 @@ static int __meminit zone_batchsize(struct zone *zone)
> #endif
> }
>
> +/* a companion to setup_pagelist_highmark() */
> +static void pageset_set_batch(struct per_cpu_pageset *p, unsigned long batch)
> +{
> + struct per_cpu_pages *pcp = &p->pcp;
> + pcp->high = 6 * batch;
> + pcp->batch = max(1UL, 1 * batch);
> +}
> +
> static void setup_pageset(struct per_cpu_pageset *p, unsigned long batch)
> {
> struct per_cpu_pages *pcp;
> @@ -4013,8 +4021,7 @@ static void setup_pageset(struct per_cpu_pageset *p, unsigned long batch)
>
> pcp = &p->pcp;
> pcp->count = 0;
> - pcp->high = 6 * batch;
> - pcp->batch = max(1UL, 1 * batch);
> + pageset_set_batch(p, batch);
> for (migratetype = 0; migratetype < MIGRATE_PCPTYPES; migratetype++)
> INIT_LIST_HEAD(&pcp->lists[migratetype]);
> }
> @@ -4023,7 +4030,6 @@ static void setup_pageset(struct per_cpu_pageset *p, unsigned long batch)
> * setup_pagelist_highmark() sets the high water mark for hot per_cpu_pagelist
> * to the value high for the pageset p.
> */
> -
> static void setup_pagelist_highmark(struct per_cpu_pageset *p,
> unsigned long high)
> {

2013-04-07 01:56:26

by Simon Jeons

[permalink] [raw]
Subject: Re: [PATCH 3/3] mm: when handling percpu_pagelist_fraction, use on_each_cpu() to set percpu pageset fields.

Hi Cody,
On 04/06/2013 04:33 AM, Cody P Schafer wrote:
> In free_hot_cold_page(), we rely on pcp->batch remaining stable.
> Updating it without being on the cpu owning the percpu pageset
> potentially destroys this stability.

If cpu is off, can its pcp pageset be used in free_hot_code_page()?

>
> Change for_each_cpu() to on_each_cpu() to fix.
>
> Signed-off-by: Cody P Schafer <[email protected]>
> ---
> mm/page_alloc.c | 21 +++++++++++----------
> 1 file changed, 11 insertions(+), 10 deletions(-)
>
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 48f2faa..507db31 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -5475,30 +5475,31 @@ int lowmem_reserve_ratio_sysctl_handler(ctl_table *table, int write,
> return 0;
> }
>
> +static void _zone_set_pageset_highmark(void *data)
> +{
> + struct zone *zone = data;
> + unsigned long high;
> + high = zone->managed_pages / percpu_pagelist_fraction;
> + setup_pagelist_highmark(
> + per_cpu_ptr(zone->pageset, smp_processor_id()), high);
> +}
> +
> /*
> * percpu_pagelist_fraction - changes the pcp->high for each zone on each
> * cpu. It is the fraction of total pages in each zone that a hot per cpu pagelist
> * can have before it gets flushed back to buddy allocator.
> */
> -
> int percpu_pagelist_fraction_sysctl_handler(ctl_table *table, int write,
> void __user *buffer, size_t *length, loff_t *ppos)
> {
> struct zone *zone;
> - unsigned int cpu;
> int ret;
>
> ret = proc_dointvec_minmax(table, write, buffer, length, ppos);
> if (!write || (ret < 0))
> return ret;
> - for_each_populated_zone(zone) {
> - for_each_possible_cpu(cpu) {
> - unsigned long high;
> - high = zone->managed_pages / percpu_pagelist_fraction;
> - setup_pagelist_highmark(
> - per_cpu_ptr(zone->pageset, cpu), high);
> - }
> - }
> + for_each_populated_zone(zone)
> + on_each_cpu(_zone_set_pageset_highmark, zone, true);
> return 0;
> }
>

2013-04-07 15:23:04

by KOSAKI Motohiro

[permalink] [raw]
Subject: Re: [PATCH 0/3] mm: fixup changers of per cpu pageset's ->high and ->batch

(4/5/13 4:33 PM), Cody P Schafer wrote:
> In one case while modifying the ->high and ->batch fields of per cpu pagesets
> we're unneededly using stop_machine() (patches 1 & 2), and in another we don't have any
> syncronization at all (patch 3).
>
> This patchset fixes both of them.
>
> Note that it results in a change to the behavior of zone_pcp_update(), which is
> used by memory_hotplug. I _think_ that I've diserned (and preserved) the
> essential behavior (changing ->high and ->batch), and only eliminated unneeded
> actions (draining the per cpu pages), but this may not be the case.

at least, memory hotplug need to drain.

2013-04-07 15:39:03

by KOSAKI Motohiro

[permalink] [raw]
Subject: Re: [PATCH 2/3] mm/page_alloc: convert zone_pcp_update() to use on_each_cpu() instead of stop_machine()

(4/5/13 4:33 PM), Cody P Schafer wrote:
> No off-cpu users of the percpu pagesets exist.
>
> zone_pcp_update()'s goal is to adjust the ->high and ->mark members of a
> percpu pageset based on a zone's ->managed_pages. We don't need to drain
> the entire percpu pageset just to modify these fields. Avoid calling
> setup_pageset() (and the draining required to call it) and instead just
> set the fields' values.
>
> This does change the behavior of zone_pcp_update() as the percpu
> pagesets will not be drained when zone_pcp_update() is called (they will
> end up being shrunk, not completely drained, later when a 0-order page
> is freed in free_hot_cold_page()).
>
> Signed-off-by: Cody P Schafer <[email protected]>

NAK.

1) zone_pcp_update() is only used from memory hotplug and it require page drain.
2) stop_machin is used for avoiding race. just removing it is insane.

2013-04-07 15:41:45

by KOSAKI Motohiro

[permalink] [raw]
Subject: Re: [PATCH 3/3] mm: when handling percpu_pagelist_fraction, use on_each_cpu() to set percpu pageset fields.

(4/5/13 4:33 PM), Cody P Schafer wrote:
> In free_hot_cold_page(), we rely on pcp->batch remaining stable.
> Updating it without being on the cpu owning the percpu pageset
> potentially destroys this stability.
>
> Change for_each_cpu() to on_each_cpu() to fix.
>
> Signed-off-by: Cody P Schafer <[email protected]>

Acked-by: KOSAKI Motohiro <[email protected]>

2013-04-08 12:20:48

by Gilad Ben-Yossef

[permalink] [raw]
Subject: Re: [PATCH 3/3] mm: when handling percpu_pagelist_fraction, use on_each_cpu() to set percpu pageset fields.

On Fri, Apr 5, 2013 at 11:33 PM, Cody P Schafer <[email protected]> wrote:
> In free_hot_cold_page(), we rely on pcp->batch remaining stable.
> Updating it without being on the cpu owning the percpu pageset
> potentially destroys this stability.
>
> Change for_each_cpu() to on_each_cpu() to fix.

Are you referring to this? -

1329 if (pcp->count >= pcp->high) {
1330 free_pcppages_bulk(zone, pcp->batch, pcp);
1331 pcp->count -= pcp->batch;
1332 }

I'm probably missing the obvious but won't it be simpler to do this in
free_hot_cold_page() -

1329 if (pcp->count >= pcp->high) {
1330 unsigned int batch = ACCESS_ONCE(pcp->batch);
1331 free_pcppages_bulk(zone, batch, pcp);
1332 pcp->count -= batch;
1333 }

Now the batch value used is stable and you don't have to IPI every CPU
in the system just to change a config knob...

Thanks,
Gilad



--
Gilad Ben-Yossef
Chief Coffee Drinker
[email protected]
Israel Cell: +972-52-8260388
US Cell: +1-973-8260388
http://benyossef.com

"If you take a class in large-scale robotics, can you end up in a situation
where the homework eats your dog?"
-- Jean-Baptiste Queru

2013-04-08 17:17:07

by Cody P Schafer

[permalink] [raw]
Subject: Re: [PATCH 0/3] mm: fixup changers of per cpu pageset's ->high and ->batch

On 04/07/2013 08:23 AM, KOSAKI Motohiro wrote:
> (4/5/13 4:33 PM), Cody P Schafer wrote:
>> In one case while modifying the ->high and ->batch fields of per cpu pagesets
>> we're unneededly using stop_machine() (patches 1 & 2), and in another we don't have any
>> syncronization at all (patch 3).
>>
>> This patchset fixes both of them.
>>
>> Note that it results in a change to the behavior of zone_pcp_update(), which is
>> used by memory_hotplug. I _think_ that I've diserned (and preserved) the
>> essential behavior (changing ->high and ->batch), and only eliminated unneeded
>> actions (draining the per cpu pages), but this may not be the case.
>
> at least, memory hotplug need to drain.

Could you explain why the drain is required here? From what I can tell,
after the stop_machine() completes, the per cpu page sets could be
repopulated at any point, making the combination of draining and
modifying ->batch & ->high uneeded.

2013-04-08 17:29:18

by Cody P Schafer

[permalink] [raw]
Subject: Re: [PATCH 3/3] mm: when handling percpu_pagelist_fraction, use on_each_cpu() to set percpu pageset fields.

On 04/08/2013 05:20 AM, Gilad Ben-Yossef wrote:
> On Fri, Apr 5, 2013 at 11:33 PM, Cody P Schafer <[email protected]> wrote:
>> In free_hot_cold_page(), we rely on pcp->batch remaining stable.
>> Updating it without being on the cpu owning the percpu pageset
>> potentially destroys this stability.
>>
>> Change for_each_cpu() to on_each_cpu() to fix.
>
> Are you referring to this? -

This was the case I noticed.

>
> 1329 if (pcp->count >= pcp->high) {
> 1330 free_pcppages_bulk(zone, pcp->batch, pcp);
> 1331 pcp->count -= pcp->batch;
> 1332 }
>
> I'm probably missing the obvious but won't it be simpler to do this in
> free_hot_cold_page() -
>
> 1329 if (pcp->count >= pcp->high) {
> 1330 unsigned int batch = ACCESS_ONCE(pcp->batch);
> 1331 free_pcppages_bulk(zone, batch, pcp);
> 1332 pcp->count -= batch;
> 1333 }
>

Potentially, yes. Note that this was simply the one case I noticed,
rather than certainly the only case.

I also wonder whether there could be unexpected interactions between
->high and ->batch not changing together atomically. For example, could
adjusting this knob cause ->batch to rise enough that it is greater than
the previous ->high? If the code above then runs with the previous
->high, ->count wouldn't be correct (checking this inside
free_pcppages_bulk() might help on this one issue).

> Now the batch value used is stable and you don't have to IPI every CPU
> in the system just to change a config knob...

Is this really considered an issue? I wouldn't have expected someone to
adjust the config knob often enough (or even more than once) to cause
problems. Of course as a "It'd be nice" thing, I completely agree.

2013-04-08 17:33:10

by Cody P Schafer

[permalink] [raw]
Subject: Re: [PATCH 2/3] mm/page_alloc: convert zone_pcp_update() to use on_each_cpu() instead of stop_machine()

On 04/07/2013 08:39 AM, KOSAKI Motohiro wrote:
> (4/5/13 4:33 PM), Cody P Schafer wrote:
>> No off-cpu users of the percpu pagesets exist.
>>
>> zone_pcp_update()'s goal is to adjust the ->high and ->mark members of a
>> percpu pageset based on a zone's ->managed_pages. We don't need to drain
>> the entire percpu pageset just to modify these fields. Avoid calling
>> setup_pageset() (and the draining required to call it) and instead just
>> set the fields' values.
>>
>> This does change the behavior of zone_pcp_update() as the percpu
>> pagesets will not be drained when zone_pcp_update() is called (they will
>> end up being shrunk, not completely drained, later when a 0-order page
>> is freed in free_hot_cold_page()).
>>
>> Signed-off-by: Cody P Schafer <[email protected]>
>
> NAK.
>
> 1) zone_pcp_update() is only used from memory hotplug and it require page drain.

I'm looking at this code because I'm currently working on a patchset
which adds another interface which modifies zone sizes, so "only used
from memory hotplug" is a temporary thing (unless I discover that
zone_pcp_update() is not intended to do what I want it to do).

> 2) stop_machin is used for avoiding race. just removing it is insane.

What race? Is there a cross cpu access to ->high & ->batch that makes
using on_each_cpu() instead of stop_machine() inappropriate? It is
absolutely not just being removed.

2013-04-08 17:35:24

by Cody P Schafer

[permalink] [raw]
Subject: Re: [PATCH 3/3] mm: when handling percpu_pagelist_fraction, use on_each_cpu() to set percpu pageset fields.

On 04/06/2013 06:56 PM, Simon Jeons wrote:
> Hi Cody,
> On 04/06/2013 04:33 AM, Cody P Schafer wrote:
>> In free_hot_cold_page(), we rely on pcp->batch remaining stable.
>> Updating it without being on the cpu owning the percpu pageset
>> potentially destroys this stability.
>
> If cpu is off, can its pcp pageset be used in free_hot_code_page()?
>

I don't expect it to be as we use this_cpu_ptr() to access the pcp
pageset. Unless there is some crazy mode where we can override the cpu a
task believes it is running on...

2013-04-08 17:39:48

by Cody P Schafer

[permalink] [raw]
Subject: Re: [PATCH 1/3] mm/page_alloc: factor out setting of pcp->high and pcp->batch.

On 04/06/2013 06:37 PM, Simon Jeons wrote:
> Hi Cody,
> On 04/06/2013 04:33 AM, Cody P Schafer wrote:
>> Creates pageset_set_batch() for use in setup_pageset().
>> pageset_set_batch() imitates the functionality of
>> setup_pagelist_highmark(), but uses the boot time
>> (percpu_pagelist_fraction == 0) calculations for determining ->high
>
> Why need adjust pcp->high, pcp->batch during system running? What's the
> requirement?
>

There is currently a sysctl (which I patch later in this series) which
allows adjusting the ->high mark (and, indirectly, ->batch).
Additionally, memory hotplug changes ->high and ->batch due to the zone
size changing (essentially, zone->managed_pages and zone->present_pages
have changed) , meaning that zone_batchsize(), which is used at boot to
set ->batch and (indirectly) ->high has a different output.

Note that in addition to the 2 users of this functionality mentioned
here, I'm currently working on anther resizer of zones (runtime NUMA
reconfiguration).

2013-04-08 19:13:42

by KOSAKI Motohiro

[permalink] [raw]
Subject: Re: [PATCH 0/3] mm: fixup changers of per cpu pageset's ->high and ->batch

(4/8/13 1:16 PM), Cody P Schafer wrote:
> On 04/07/2013 08:23 AM, KOSAKI Motohiro wrote:
>> (4/5/13 4:33 PM), Cody P Schafer wrote:
>>> In one case while modifying the ->high and ->batch fields of per cpu pagesets
>>> we're unneededly using stop_machine() (patches 1 & 2), and in another we don't have any
>>> syncronization at all (patch 3).
>>>
>>> This patchset fixes both of them.
>>>
>>> Note that it results in a change to the behavior of zone_pcp_update(), which is
>>> used by memory_hotplug. I _think_ that I've diserned (and preserved) the
>>> essential behavior (changing ->high and ->batch), and only eliminated unneeded
>>> actions (draining the per cpu pages), but this may not be the case.
>>
>> at least, memory hotplug need to drain.
>
> Could you explain why the drain is required here? From what I can tell,
> after the stop_machine() completes, the per cpu page sets could be
> repopulated at any point, making the combination of draining and
> modifying ->batch & ->high uneeded.

Then, memory hotplug again and again try to drain. Moreover hotplug prevent repopulation
by using MIGRATE_ISOLATE.
pcp never be page count == 0 and it prevent memory hot remove.

2013-04-08 19:17:01

by KOSAKI Motohiro

[permalink] [raw]
Subject: Re: [PATCH 3/3] mm: when handling percpu_pagelist_fraction, use on_each_cpu() to set percpu pageset fields.

(4/8/13 8:20 AM), Gilad Ben-Yossef wrote:
> On Fri, Apr 5, 2013 at 11:33 PM, Cody P Schafer <[email protected]> wrote:
>> In free_hot_cold_page(), we rely on pcp->batch remaining stable.
>> Updating it without being on the cpu owning the percpu pageset
>> potentially destroys this stability.
>>
>> Change for_each_cpu() to on_each_cpu() to fix.
>
> Are you referring to this? -
>
> 1329 if (pcp->count >= pcp->high) {
> 1330 free_pcppages_bulk(zone, pcp->batch, pcp);
> 1331 pcp->count -= pcp->batch;
> 1332 }
>
> I'm probably missing the obvious but won't it be simpler to do this in
> free_hot_cold_page() -
>
> 1329 if (pcp->count >= pcp->high) {
> 1330 unsigned int batch = ACCESS_ONCE(pcp->batch);
> 1331 free_pcppages_bulk(zone, batch, pcp);
> 1332 pcp->count -= batch;
> 1333 }
>
> Now the batch value used is stable and you don't have to IPI every CPU
> in the system just to change a config knob...

OK, right. Your approach is much better.

2013-04-08 19:26:59

by KOSAKI Motohiro

[permalink] [raw]
Subject: Re: [PATCH 2/3] mm/page_alloc: convert zone_pcp_update() to use on_each_cpu() instead of stop_machine()

(4/8/13 1:32 PM), Cody P Schafer wrote:
> On 04/07/2013 08:39 AM, KOSAKI Motohiro wrote:
>> (4/5/13 4:33 PM), Cody P Schafer wrote:
>>> No off-cpu users of the percpu pagesets exist.
>>>
>>> zone_pcp_update()'s goal is to adjust the ->high and ->mark members of a
>>> percpu pageset based on a zone's ->managed_pages. We don't need to drain
>>> the entire percpu pageset just to modify these fields. Avoid calling
>>> setup_pageset() (and the draining required to call it) and instead just
>>> set the fields' values.
>>>
>>> This does change the behavior of zone_pcp_update() as the percpu
>>> pagesets will not be drained when zone_pcp_update() is called (they will
>>> end up being shrunk, not completely drained, later when a 0-order page
>>> is freed in free_hot_cold_page()).
>>>
>>> Signed-off-by: Cody P Schafer <[email protected]>
>>
>> NAK.
>>
>> 1) zone_pcp_update() is only used from memory hotplug and it require page drain.
>
> I'm looking at this code because I'm currently working on a patchset
> which adds another interface which modifies zone sizes, so "only used
> from memory hotplug" is a temporary thing (unless I discover that
> zone_pcp_update() is not intended to do what I want it to do).

maybe yes, maybe no. I don't know temporary or not. However the fact is,
you must not break anywhere. You need to look all caller always.


>> 2) stop_machin is used for avoiding race. just removing it is insane.
>
> What race? Is there a cross cpu access to ->high & ->batch that makes
> using on_each_cpu() instead of stop_machine() inappropriate? It is
> absolutely not just being removed.

OK, I missed that. however your code is still wrong.
However you can't call free_pcppages_bulk() from interrupt context and
then you can't use on_each_cpu() anyway.





2013-04-08 19:37:36

by Cody P Schafer

[permalink] [raw]
Subject: Re: [PATCH 0/3] mm: fixup changers of per cpu pageset's ->high and ->batch

On 04/06/2013 06:32 PM, Simon Jeons wrote:
> Hi Cody,
> On 04/06/2013 04:33 AM, Cody P Schafer wrote:
>> In one case while modifying the ->high and ->batch fields of per cpu
>> pagesets
>> we're unneededly using stop_machine() (patches 1 & 2), and in another
>> we don't have any
>> syncronization at all (patch 3).
>
> Do you mean stop_machine() is used for syncronization between each
> online cpu?
>

I mean that it looks like synchronization between cpus is unneeded
because of how per cpu pagesets are used, so stop_machine() (which does
provide syncro between all cpus) is unnecessarily "strong".

2013-04-08 19:49:47

by Cody P Schafer

[permalink] [raw]
Subject: Re: [PATCH 2/3] mm/page_alloc: convert zone_pcp_update() to use on_each_cpu() instead of stop_machine()

On 04/08/2013 12:26 PM, KOSAKI Motohiro wrote:
> (4/8/13 1:32 PM), Cody P Schafer wrote:
>> On 04/07/2013 08:39 AM, KOSAKI Motohiro wrote:
>>> (4/5/13 4:33 PM), Cody P Schafer wrote:
>>>> No off-cpu users of the percpu pagesets exist.
>>>>
>>>> zone_pcp_update()'s goal is to adjust the ->high and ->mark members of a
>>>> percpu pageset based on a zone's ->managed_pages. We don't need to drain
>>>> the entire percpu pageset just to modify these fields. Avoid calling
>>>> setup_pageset() (and the draining required to call it) and instead just
>>>> set the fields' values.
>>>>
>>>> This does change the behavior of zone_pcp_update() as the percpu
>>>> pagesets will not be drained when zone_pcp_update() is called (they will
>>>> end up being shrunk, not completely drained, later when a 0-order page
>>>> is freed in free_hot_cold_page()).
>>>>
>>>> Signed-off-by: Cody P Schafer <[email protected]>
>>>
>>> NAK.
>>>
>>> 1) zone_pcp_update() is only used from memory hotplug and it require page drain.
>>
>> I'm looking at this code because I'm currently working on a patchset
>> which adds another interface which modifies zone sizes, so "only used
>> from memory hotplug" is a temporary thing (unless I discover that
>> zone_pcp_update() is not intended to do what I want it to do).
>
> maybe yes, maybe no. I don't know temporary or not. However the fact is,
> you must not break anywhere. You need to look all caller always.

Right, which is why I want to understand memory hotplug's actual
requirements.

>>> 2) stop_machin is used for avoiding race. just removing it is insane.
>>
>> What race? Is there a cross cpu access to ->high & ->batch that makes
>> using on_each_cpu() instead of stop_machine() inappropriate? It is
>> absolutely not just being removed.
>
> OK, I missed that. however your code is still wrong.
> However you can't call free_pcppages_bulk() from interrupt context and
> then you can't use on_each_cpu() anyway.

Given drain_pages() implementation, I find that hard to believe (It uses
on_each_cpu_mask() and eventually calls free_pcppages_bulk()).

Can you provide a reference backing up your statement?

If this turns out to be an issue, schedule_on_each_cpu() could be an
alternative.

2013-04-08 19:50:46

by Cody P Schafer

[permalink] [raw]
Subject: Re: [PATCH 3/3] mm: when handling percpu_pagelist_fraction, use on_each_cpu() to set percpu pageset fields.

On 04/08/2013 10:28 AM, Cody P Schafer wrote:
> On 04/08/2013 05:20 AM, Gilad Ben-Yossef wrote:
>> On Fri, Apr 5, 2013 at 11:33 PM, Cody P Schafer
>> <[email protected]> wrote:
>>> In free_hot_cold_page(), we rely on pcp->batch remaining stable.
>>> Updating it without being on the cpu owning the percpu pageset
>>> potentially destroys this stability.
>>>
>>> Change for_each_cpu() to on_each_cpu() to fix.
>>
>> Are you referring to this? -
>
> This was the case I noticed.
>
>>
>> 1329 if (pcp->count >= pcp->high) {
>> 1330 free_pcppages_bulk(zone, pcp->batch, pcp);
>> 1331 pcp->count -= pcp->batch;
>> 1332 }
>>
>> I'm probably missing the obvious but won't it be simpler to do this in
>> free_hot_cold_page() -
>>
>> 1329 if (pcp->count >= pcp->high) {
>> 1330 unsigned int batch = ACCESS_ONCE(pcp->batch);
>> 1331 free_pcppages_bulk(zone, batch, pcp);
>> 1332 pcp->count -= batch;
>> 1333 }
>>
>
> Potentially, yes. Note that this was simply the one case I noticed,
> rather than certainly the only case.
>
> I also wonder whether there could be unexpected interactions between
> ->high and ->batch not changing together atomically. For example, could
> adjusting this knob cause ->batch to rise enough that it is greater than
> the previous ->high? If the code above then runs with the previous
> ->high, ->count wouldn't be correct (checking this inside
> free_pcppages_bulk() might help on this one issue).
>
>> Now the batch value used is stable and you don't have to IPI every CPU
>> in the system just to change a config knob...
>
> Is this really considered an issue? I wouldn't have expected someone to
> adjust the config knob often enough (or even more than once) to cause
> problems. Of course as a "It'd be nice" thing, I completely agree.

Would using schedule_on_each_cpu() instead of on_each_cpu() be an
improvement, in your opinion?

2013-04-08 21:17:53

by Cody P Schafer

[permalink] [raw]
Subject: Re: [PATCH 0/3] mm: fixup changers of per cpu pageset's ->high and ->batch

On 04/08/2013 12:08 PM, KOSAKI Motohiro wrote:
> (4/8/13 1:16 PM), Cody P Schafer wrote:
>> On 04/07/2013 08:23 AM, KOSAKI Motohiro wrote:
>>> (4/5/13 4:33 PM), Cody P Schafer wrote:
>>>> In one case while modifying the ->high and ->batch fields of per cpu pagesets
>>>> we're unneededly using stop_machine() (patches 1 & 2), and in another we don't have any
>>>> syncronization at all (patch 3).
>>>>
>>>> This patchset fixes both of them.
>>>>
>>>> Note that it results in a change to the behavior of zone_pcp_update(), which is
>>>> used by memory_hotplug. I _think_ that I've diserned (and preserved) the
>>>> essential behavior (changing ->high and ->batch), and only eliminated unneeded
>>>> actions (draining the per cpu pages), but this may not be the case.
>>>
>>> at least, memory hotplug need to drain.
>>
>> Could you explain why the drain is required here? From what I can tell,
>> after the stop_machine() completes, the per cpu page sets could be
>> repopulated at any point, making the combination of draining and
>> modifying ->batch & ->high uneeded.
>
> Then, memory hotplug again and again try to drain. Moreover hotplug prevent repopulation
> by using MIGRATE_ISOLATE.
> pcp never be page count == 0 and it prevent memory hot remove.

zone_pcp_update() is not part of the drain retry loop, in the offline
pages case, it is only called following success in "removal" (online
pages also calls zone_pcp_update(), I don't see how it could benifit in
any way from draining the per cpu pagesets).

So I still don't see where the need for draining pages combined with
modifying ->high and ->batch came from.

2013-04-08 22:18:51

by KOSAKI Motohiro

[permalink] [raw]
Subject: Re: [PATCH 2/3] mm/page_alloc: convert zone_pcp_update() to use on_each_cpu() instead of stop_machine()

(4/8/13 3:49 PM), Cody P Schafer wrote:
> On 04/08/2013 12:26 PM, KOSAKI Motohiro wrote:
>> (4/8/13 1:32 PM), Cody P Schafer wrote:
>>> On 04/07/2013 08:39 AM, KOSAKI Motohiro wrote:
>>>> (4/5/13 4:33 PM), Cody P Schafer wrote:
>>>>> No off-cpu users of the percpu pagesets exist.
>>>>>
>>>>> zone_pcp_update()'s goal is to adjust the ->high and ->mark members of a
>>>>> percpu pageset based on a zone's ->managed_pages. We don't need to drain
>>>>> the entire percpu pageset just to modify these fields. Avoid calling
>>>>> setup_pageset() (and the draining required to call it) and instead just
>>>>> set the fields' values.
>>>>>
>>>>> This does change the behavior of zone_pcp_update() as the percpu
>>>>> pagesets will not be drained when zone_pcp_update() is called (they will
>>>>> end up being shrunk, not completely drained, later when a 0-order page
>>>>> is freed in free_hot_cold_page()).
>>>>>
>>>>> Signed-off-by: Cody P Schafer <[email protected]>
>>>>
>>>> NAK.
>>>>
>>>> 1) zone_pcp_update() is only used from memory hotplug and it require page drain.
>>>
>>> I'm looking at this code because I'm currently working on a patchset
>>> which adds another interface which modifies zone sizes, so "only used
>>> from memory hotplug" is a temporary thing (unless I discover that
>>> zone_pcp_update() is not intended to do what I want it to do).
>>
>> maybe yes, maybe no. I don't know temporary or not. However the fact is,
>> you must not break anywhere. You need to look all caller always.
>
> Right, which is why I want to understand memory hotplug's actual
> requirements.
>
>>>> 2) stop_machin is used for avoiding race. just removing it is insane.
>>>
>>> What race? Is there a cross cpu access to ->high & ->batch that makes
>>> using on_each_cpu() instead of stop_machine() inappropriate? It is
>>> absolutely not just being removed.
>>
>> OK, I missed that. however your code is still wrong.
>> However you can't call free_pcppages_bulk() from interrupt context and
>> then you can't use on_each_cpu() anyway.
>
> Given drain_pages() implementation, I find that hard to believe (It uses
> on_each_cpu_mask() and eventually calls free_pcppages_bulk()).
>
> Can you provide a reference backing up your statement?

Grr. I missed again. OK you are right. go ahead.



> If this turns out to be an issue, schedule_on_each_cpu() could be an
> alternative.

no way. schedule_on_each_cpu() is more problematic and it should be removed
in the future.
schedule_on_each_cpu() can only be used when caller task don't have any lock.
otherwise it may make deadlock.






2013-04-08 22:23:50

by KOSAKI Motohiro

[permalink] [raw]
Subject: Re: [PATCH 3/3] mm: when handling percpu_pagelist_fraction, use on_each_cpu() to set percpu pageset fields.

(4/8/13 3:50 PM), Cody P Schafer wrote:
> On 04/08/2013 10:28 AM, Cody P Schafer wrote:
>> On 04/08/2013 05:20 AM, Gilad Ben-Yossef wrote:
>>> On Fri, Apr 5, 2013 at 11:33 PM, Cody P Schafer
>>> <[email protected]> wrote:
>>>> In free_hot_cold_page(), we rely on pcp->batch remaining stable.
>>>> Updating it without being on the cpu owning the percpu pageset
>>>> potentially destroys this stability.
>>>>
>>>> Change for_each_cpu() to on_each_cpu() to fix.
>>>
>>> Are you referring to this? -
>>
>> This was the case I noticed.
>>
>>>
>>> 1329 if (pcp->count >= pcp->high) {
>>> 1330 free_pcppages_bulk(zone, pcp->batch, pcp);
>>> 1331 pcp->count -= pcp->batch;
>>> 1332 }
>>>
>>> I'm probably missing the obvious but won't it be simpler to do this in
>>> free_hot_cold_page() -
>>>
>>> 1329 if (pcp->count >= pcp->high) {
>>> 1330 unsigned int batch = ACCESS_ONCE(pcp->batch);
>>> 1331 free_pcppages_bulk(zone, batch, pcp);
>>> 1332 pcp->count -= batch;
>>> 1333 }
>>>
>>
>> Potentially, yes. Note that this was simply the one case I noticed,
>> rather than certainly the only case.
>>
>> I also wonder whether there could be unexpected interactions between
>> ->high and ->batch not changing together atomically. For example, could
>> adjusting this knob cause ->batch to rise enough that it is greater than
>> the previous ->high? If the code above then runs with the previous
>> ->high, ->count wouldn't be correct (checking this inside
>> free_pcppages_bulk() might help on this one issue).
>>
>>> Now the batch value used is stable and you don't have to IPI every CPU
>>> in the system just to change a config knob...
>>
>> Is this really considered an issue? I wouldn't have expected someone to
>> adjust the config knob often enough (or even more than once) to cause
>> problems. Of course as a "It'd be nice" thing, I completely agree.
>
> Would using schedule_on_each_cpu() instead of on_each_cpu() be an
> improvement, in your opinion?

No. As far as lightweight solusion work, we shouldn't introduce heavyweight
code never. on_each_cpu() is really heavy weight especially when number of
cpus are much than a thousand.

2013-04-09 01:53:05

by Cody P Schafer

[permalink] [raw]
Subject: Re: [PATCH 2/3] mm/page_alloc: convert zone_pcp_update() to use on_each_cpu() instead of stop_machine()

On 04/08/2013 03:18 PM, KOSAKI Motohiro wrote:
> (4/8/13 3:49 PM), Cody P Schafer wrote:>
>> If this turns out to be an issue, schedule_on_each_cpu() could be an
>> alternative.
>
> no way. schedule_on_each_cpu() is more problematic and it should be removed
> in the future.
> schedule_on_each_cpu() can only be used when caller task don't have any lock.
> otherwise it may make deadlock.

I wasn't aware of that. Just to be clear, the deadlock you're referring
to isn't the same one refered to in

commit b71ab8c2025caef8db719aa41af0ed735dc543cd
Author: Tejun Heo <[email protected]>
Date: Tue Jun 29 10:07:14 2010 +0200
workqueue: increase max_active of keventd and kill current_is_keventd()

and

commit 65a64464349883891e21e74af16c05d6e1eeb4e9
Author: Andi Kleen <[email protected]>
Date: Wed Oct 14 06:22:47 2009 +0200
HWPOISON: Allow schedule_on_each_cpu() from keventd

If you're referencing some other deadlock, could you please provide a
link to the relevant discussion? (I'd really like to add a note to
schedule_on_each_cpu()'s doc comment about it so others can avoid that
pitfall).

2013-04-09 05:42:58

by Simon Jeons

[permalink] [raw]
Subject: Re: [PATCH 1/3] mm/page_alloc: factor out setting of pcp->high and pcp->batch.

Hi Cody,
On 04/09/2013 01:39 AM, Cody P Schafer wrote:
> On 04/06/2013 06:37 PM, Simon Jeons wrote:
>> Hi Cody,
>> On 04/06/2013 04:33 AM, Cody P Schafer wrote:
>>> Creates pageset_set_batch() for use in setup_pageset().
>>> pageset_set_batch() imitates the functionality of
>>> setup_pagelist_highmark(), but uses the boot time
>>> (percpu_pagelist_fraction == 0) calculations for determining ->high
>>
>> Why need adjust pcp->high, pcp->batch during system running? What's the
>> requirement?
>>
>
> There is currently a sysctl (which I patch later in this series) which
> allows adjusting the ->high mark (and, indirectly, ->batch).
> Additionally, memory hotplug changes ->high and ->batch due to the
> zone size changing (essentially, zone->managed_pages and
> zone->present_pages have changed) , meaning that zone_batchsize(),
> which is used at boot to set ->batch and (indirectly) ->high has a
> different output.

Thanks for your explain. I'm curious about this sysctl, when need adjust
the ->high, ->batch during system running except memory hotplug which
will change zone size?

>
> Note that in addition to the 2 users of this functionality mentioned
> here, I'm currently working on anther resizer of zones (runtime NUMA
> reconfiguration).
>

2013-04-09 06:03:07

by Gilad Ben-Yossef

[permalink] [raw]
Subject: Re: [PATCH 3/3] mm: when handling percpu_pagelist_fraction, use on_each_cpu() to set percpu pageset fields.

On Mon, Apr 8, 2013 at 8:28 PM, Cody P Schafer <[email protected]> wrote:
> On 04/08/2013 05:20 AM, Gilad Ben-Yossef wrote:
>>
>> On Fri, Apr 5, 2013 at 11:33 PM, Cody P Schafer <[email protected]>
>> wrote:
>>>
>>> In free_hot_cold_page(), we rely on pcp->batch remaining stable.
>>> Updating it without being on the cpu owning the percpu pageset
>>> potentially destroys this stability.
>>>
>>> Change for_each_cpu() to on_each_cpu() to fix.
>>
>>
>> Are you referring to this? -
>
>
> This was the case I noticed.
>
>
>>
>> 1329 if (pcp->count >= pcp->high) {
>> 1330 free_pcppages_bulk(zone, pcp->batch, pcp);
>> 1331 pcp->count -= pcp->batch;
>> 1332 }
>>
>> I'm probably missing the obvious but won't it be simpler to do this in
>> free_hot_cold_page() -
>>
>> 1329 if (pcp->count >= pcp->high) {
>> 1330 unsigned int batch = ACCESS_ONCE(pcp->batch);
>> 1331 free_pcppages_bulk(zone, batch, pcp);
>> 1332 pcp->count -= batch;
>> 1333 }
>>
>
> Potentially, yes. Note that this was simply the one case I noticed, rather
> than certainly the only case.

OK, so perhaps the right thing to do is to understand what are (some of) the
other cases so that we may choose the right solution.

> I also wonder whether there could be unexpected interactions between ->high
> and ->batch not changing together atomically. For example, could adjusting
> this knob cause ->batch to rise enough that it is greater than the previous
> ->high? If the code above then runs with the previous ->high, ->count
> wouldn't be correct (checking this inside free_pcppages_bulk() might help on
> this one issue).

You are right, but that can be treated in setup_pagelist_highmark() e.g.:

3993 static void setup_pagelist_highmark(struct per_cpu_pageset *p,
3994 unsigned long high)
3995 {
3996 struct per_cpu_pages *pcp;
unsigned int batch;
3997
3998 pcp = &p->pcp;
/* We're about to mess with PCP in an non atomic fashion.
Put an intermediate safe value of batch and make sure it
is visible before any other change */
pcp->batch = 1UL;
smb_mb();

3999 pcp->high = high;

4000 batch = max(1UL, high/4);
4001 if ((high/4) > (PAGE_SHIFT * 8))
4002 batch = PAGE_SHIFT * 8;

pcp->batch = batch;
4003 }

Or we could use an RCU here, but that might be an overkill.

>
>
>> Now the batch value used is stable and you don't have to IPI every CPU
>> in the system just to change a config knob...
>
>
> Is this really considered an issue? I wouldn't have expected someone to
> adjust the config knob often enough (or even more than once) to cause
> problems. Of course as a "It'd be nice" thing, I completely agree.

Well, interfering unconditionally with other CPUs either via IPIs or
scheduling work
on them is a major headache for people that run work on machines with 4k CPUs,
especially the HPC or RT or combos from the finance and networking
users.

If this was the only little knob or trigger that does this, then maybe
it wont be so bad,
but the problem is there is a list of these little knobs and items
that potentially cause
cross machine interference, and the poor sys admin has to keep them
all in his or her
head: "Now, is it ok to pull this knob now, or will it cause an IPI s**t storm?"

We can never get rid of them all, but I'd really prefer to keep them
down to a minimum
if at all possible. Here, it looks to me that it is possible and that
the price is not great -
that is, the resulting code is not too hairy or none maintainable. At
least, that is how
it looks to me.

Thanks,
Gilad

--
Gilad Ben-Yossef
Chief Coffee Drinker
[email protected]
Israel Cell: +972-52-8260388
US Cell: +1-973-8260388
http://benyossef.com

"If you take a class in large-scale robotics, can you end up in a
situation where the homework eats your dog?"
-- Jean-Baptiste Queru

2013-04-09 06:06:19

by Gilad Ben-Yossef

[permalink] [raw]
Subject: Re: [PATCH 3/3] mm: when handling percpu_pagelist_fraction, use on_each_cpu() to set percpu pageset fields.

On Tue, Apr 9, 2013 at 9:03 AM, Gilad Ben-Yossef <[email protected]> wrote:

>
>> I also wonder whether there could be unexpected interactions between ->high
>> and ->batch not changing together atomically. For example, could adjusting
>> this knob cause ->batch to rise enough that it is greater than the previous
>> ->high? If the code above then runs with the previous ->high, ->count
>> wouldn't be correct (checking this inside free_pcppages_bulk() might help on
>> this one issue).
>
> You are right, but that can be treated in setup_pagelist_highmark() e.g.:
>
> 3993 static void setup_pagelist_highmark(struct per_cpu_pageset *p,
> 3994 unsigned long high)
> 3995 {
> 3996 struct per_cpu_pages *pcp;
> unsigned int batch;
> 3997
> 3998 pcp = &p->pcp;
> /* We're about to mess with PCP in an non atomic fashion.
> Put an intermediate safe value of batch and make sure it
> is visible before any other change */
> pcp->batch = 1UL;
> smb_mb();
>
> 3999 pcp->high = high;

and i think I missed another needed barrier here:
smp_mb();

>
> 4000 batch = max(1UL, high/4);
> 4001 if ((high/4) > (PAGE_SHIFT * 8))
> 4002 batch = PAGE_SHIFT * 8;
>
> pcp->batch = batch;
> 4003 }
>

--
Gilad Ben-Yossef
Chief Coffee Drinker
[email protected]
Israel Cell: +972-52-8260388
US Cell: +1-973-8260388
http://benyossef.com

"If you take a class in large-scale robotics, can you end up in a
situation where the homework eats your dog?"
-- Jean-Baptiste Queru

2013-04-09 19:27:56

by Cody P Schafer

[permalink] [raw]
Subject: Re: [PATCH 3/3] mm: when handling percpu_pagelist_fraction, use on_each_cpu() to set percpu pageset fields.

On 04/08/2013 11:06 PM, Gilad Ben-Yossef wrote:
> On Tue, Apr 9, 2013 at 9:03 AM, Gilad Ben-Yossef <[email protected]> wrote:
>
>>
>>> I also wonder whether there could be unexpected interactions between ->high
>>> and ->batch not changing together atomically. For example, could adjusting
>>> this knob cause ->batch to rise enough that it is greater than the previous
>>> ->high? If the code above then runs with the previous ->high, ->count
>>> wouldn't be correct (checking this inside free_pcppages_bulk() might help on
>>> this one issue).
>>
>> You are right, but that can be treated in setup_pagelist_highmark() e.g.:
>>
>> 3993 static void setup_pagelist_highmark(struct per_cpu_pageset *p,
>> 3994 unsigned long high)
>> 3995 {
>> 3996 struct per_cpu_pages *pcp;
>> unsigned int batch;
>> 3997
>> 3998 pcp = &p->pcp;
>> /* We're about to mess with PCP in an non atomic fashion.
>> Put an intermediate safe value of batch and make sure it
>> is visible before any other change */
>> pcp->batch = 1UL;
>> smb_mb();
>>
>> 3999 pcp->high = high;
>
> and i think I missed another needed barrier here:
> smp_mb();
>
>>
>> 4000 batch = max(1UL, high/4);
>> 4001 if ((high/4) > (PAGE_SHIFT * 8))
>> 4002 batch = PAGE_SHIFT * 8;
>>
>> pcp->batch = batch;
>> 4003 }
>>
>

Yep, that appears to work, provided no additional users of ->batch and
->high show up. It seems we'll also need some locking to prevent
concurrent updaters, but that is relatively light weight.

I'll roll up a new patchset that uses this methodology.