LinuxLists.cc - [RFC][RT][PATCH] mm: Do not use stop_machine() for __zone_pcp_udpate() for CONFIG_PREEMPT_RT

2012-06-19 18:32:23

Subject: [RFC][RT][PATCH] mm: Do not use stop_machine() for __zone_pcp_udpate() for CONFIG_PREEMPT_RT_FULL

The code path of __zone_pcp_update() has following locks, which in
CONFIG_PREEMPT_RT_FULL=y are rt-mutex.
- pa_lock locked by cpu_lock_irqsave()
- zone->lock locked by free_pcppages_bulk()

Since __zone_pcp_update() is called from stop_machine(), so with
CONFIG_PREEMPT_RT_FULL=y
we get following backtrace when __zone_pcp_update() is called during
memory hot plugging while
doing heavy file I/O.

I think stop_machine() may not be required for calling __zone_pcp_update()
in case of CONFIG_PREEMPT_RT_FULL=y as acquiring pa_lock in __zone_pcp_update()
should be sufficient to isolate pcp pages and to setup per cpu pagesets.

Can someone please let me know if am missing anything here?

The backtrace that this patch fixes:
BUG: scheduling while atomic: migration/0/7/0x00000002
Modules linked in: v2p
Backtrace:
[<800111a0>] (dump_backtrace+0x0/0x10c) from [<802d7b7c>]
(dump_stack+0x18/0x1c)
r6:80c8fc28 r5:80c8f9a0 r4:00000000 r3:60000013
[<802d7b64>] (dump_stack+0x0/0x1c) from [<8001e81c>] (__schedule_bug+0x64/0x74)
[<8001e7b8>] (__schedule_bug+0x0/0x74) from [<802d7fa0>]
(__schedule+0x68/0x604)
r4:8051bf00 r3:00000000
[<802d7f38>] (__schedule+0x0/0x604) from [<802d8a78>] (schedule+0x98/0xbc)
[<802d89e0>] (schedule+0x0/0xbc) from [<802d9e14>]
(rt_spin_lock_slowlock+0x168/0x240)
r4:805228f4 r3:00000000
[<802d9cac>] (rt_spin_lock_slowlock+0x0/0x240) from [<802da234>]
(rt_spin_lock+0x10/0x14)
[<802da224>] (rt_spin_lock+0x0/0x14) from [<8008694c>]
(__zone_pcp_update+0x58/0xd8)
[<800868f4>] (__zone_pcp_update+0x0/0xd8) from [<800603ec>]
(stop_machine_cpu_stop+0xb0/0x104)
[<8006033c>] (stop_machine_cpu_stop+0x0/0x104) from [<80060200>]
(cpu_stopper_thread+0xd4/0x188)

Signed-off-by: Aaditya Kumar <[email protected]>

---
mm/page_alloc.c | 4 4 + 0 - 0 !
1 file changed, 4 insertions(+)

Index: b/mm/page_alloc.c
===================================================================
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -3868,7 +3868,11 @@ static int __zone_pcp_update(void *data)

void zone_pcp_update(struct zone *zone)
{
+#ifndef CONFIG_PREEMPT_RT_FULL
stop_machine(__zone_pcp_update, zone, NULL);
+#else
+ __zone_pcp_update(zone);
+#endif
}

static __meminit void zone_pcp_init(struct zone *zone)

2012-06-19 19:47:35

by Frank Rowand

[permalink] [raw]

Subject: Re: [RFC][RT][PATCH] mm: Do not use stop_machine() for __zone_pcp_udpate() for CONFIG_PREEMPT_RT_FULL

Adding to the distribution list: the author of the patch that included
the call to stop_machine() (commit 112067f0905b2de862c607ee62411cf47d2fe5c4).

On 06/19/12 11:32, Aaditya Kumar wrote:
> The code path of __zone_pcp_update() has following locks, which in
> CONFIG_PREEMPT_RT_FULL=y are rt-mutex.
> - pa_lock locked by cpu_lock_irqsave()
> - zone->lock locked by free_pcppages_bulk()
>
> Since __zone_pcp_update() is called from stop_machine(), so with
> CONFIG_PREEMPT_RT_FULL=y
> we get following backtrace when __zone_pcp_update() is called during
> memory hot plugging while
> doing heavy file I/O.
>
> I think stop_machine() may not be required for calling __zone_pcp_update()
> in case of CONFIG_PREEMPT_RT_FULL=y as acquiring pa_lock in __zone_pcp_update()
> should be sufficient to isolate pcp pages and to setup per cpu pagesets.
>
> Can someone please let me know if am missing anything here?
>
>
> The backtrace that this patch fixes:
> BUG: scheduling while atomic: migration/0/7/0x00000002
> Modules linked in: v2p
> Backtrace:
> [<800111a0>] (dump_backtrace+0x0/0x10c) from [<802d7b7c>]
> (dump_stack+0x18/0x1c)
> r6:80c8fc28 r5:80c8f9a0 r4:00000000 r3:60000013
> [<802d7b64>] (dump_stack+0x0/0x1c) from [<8001e81c>] (__schedule_bug+0x64/0x74)
> [<8001e7b8>] (__schedule_bug+0x0/0x74) from [<802d7fa0>]
> (__schedule+0x68/0x604)
> r4:8051bf00 r3:00000000
> [<802d7f38>] (__schedule+0x0/0x604) from [<802d8a78>] (schedule+0x98/0xbc)
> [<802d89e0>] (schedule+0x0/0xbc) from [<802d9e14>]
> (rt_spin_lock_slowlock+0x168/0x240)
> r4:805228f4 r3:00000000
> [<802d9cac>] (rt_spin_lock_slowlock+0x0/0x240) from [<802da234>]
> (rt_spin_lock+0x10/0x14)
> [<802da224>] (rt_spin_lock+0x0/0x14) from [<8008694c>]
> (__zone_pcp_update+0x58/0xd8)
> [<800868f4>] (__zone_pcp_update+0x0/0xd8) from [<800603ec>]
> (stop_machine_cpu_stop+0xb0/0x104)
> [<8006033c>] (stop_machine_cpu_stop+0x0/0x104) from [<80060200>]
> (cpu_stopper_thread+0xd4/0x188)
>
>
> Signed-off-by: Aaditya Kumar <[email protected]>
>
> ---
> mm/page_alloc.c | 4 4 + 0 - 0 !
> 1 file changed, 4 insertions(+)
>
> Index: b/mm/page_alloc.c
> ===================================================================
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -3868,7 +3868,11 @@ static int __zone_pcp_update(void *data)
>
> void zone_pcp_update(struct zone *zone)
> {
> +#ifndef CONFIG_PREEMPT_RT_FULL
> stop_machine(__zone_pcp_update, zone, NULL);
> +#else
> + __zone_pcp_update(zone);
> +#endif
> }
>
> static __meminit void zone_pcp_init(struct zone *zone)
>
> .
>

2012-06-19 21:54:03

by Frank Rowand

[permalink] [raw]

Subject: Re: [RFC][RT][PATCH] mm: Do not use stop_machine() for __zone_pcp_udpate() for CONFIG_PREEMPT_RT_FULL

On 06/19/12 12:46, Frank Rowand wrote:
> Adding to the distribution list: the author of the patch that included
> the call to stop_machine() (commit 112067f0905b2de862c607ee62411cf47d2fe5c4).
>
> On 06/19/12 11:32, Aaditya Kumar wrote:
>> The code path of __zone_pcp_update() has following locks, which in
>> CONFIG_PREEMPT_RT_FULL=y are rt-mutex.
>> - pa_lock locked by cpu_lock_irqsave()
>> - zone->lock locked by free_pcppages_bulk()
>>
>> Since __zone_pcp_update() is called from stop_machine(), so with

< snip >

... and just for the record, adding the author (Shaohua Li) did not
work, I got an email reject from Intel.

-Frank

2012-06-23 03:38:16

by KOSAKI Motohiro

[permalink] [raw]

Subject: Re: [RFC][RT][PATCH] mm: Do not use stop_machine() for __zone_pcp_udpate() for CONFIG_PREEMPT_RT_FULL

(6/19/12 2:32 PM), Aaditya Kumar wrote:
> The code path of __zone_pcp_update() has following locks, which in
> CONFIG_PREEMPT_RT_FULL=y are rt-mutex.
> - pa_lock locked by cpu_lock_irqsave()
> - zone->lock locked by free_pcppages_bulk()
>
> Since __zone_pcp_update() is called from stop_machine(), so with
> CONFIG_PREEMPT_RT_FULL=y
> we get following backtrace when __zone_pcp_update() is called during
> memory hot plugging while
> doing heavy file I/O.
>
> I think stop_machine() may not be required for calling __zone_pcp_update()
> in case of CONFIG_PREEMPT_RT_FULL=y as acquiring pa_lock in __zone_pcp_update()
> should be sufficient to isolate pcp pages and to setup per cpu pagesets.
>
> Can someone please let me know if am missing anything here?

First off, you should cc memory hotplug experts when discussing memory
hotplug topic.
Second, stop_machine() is required because usually zone->pageset is
per-cpu variable.
the regular access rule is, 1) owner cpu can always access their own
pcp, 2) offlined cpu's
pcp can be accessed from any cpus because is has no race chance 3)
otherwise caller must
use stop_machine for preventing owner cpu accesses pcp.

stop_machine and send ipi are common technique for per-cpu area hack.

2012-06-24 16:55:43

by Aaditya Kumar

[permalink] [raw]

Subject: Re: [RFC][RT][PATCH] mm: Do not use stop_machine() for __zone_pcp_udpate() for CONFIG_PREEMPT_RT_FULL

On Sat, Jun 23, 2012 at 9:07 AM, KOSAKI Motohiro
<[email protected]> wrote:
> (6/19/12 2:32 PM), Aaditya Kumar wrote:
>> The code path of __zone_pcp_update() has following locks, which in
>> CONFIG_PREEMPT_RT_FULL=y are rt-mutex.
>> ? - pa_lock locked by cpu_lock_irqsave()
>> ? - zone->lock locked by free_pcppages_bulk()
>>
>> Since __zone_pcp_update() is called from stop_machine(), so with
>> CONFIG_PREEMPT_RT_FULL=y
>> we get following backtrace when __zone_pcp_update() is called during
>> memory hot plugging while
>> doing heavy file I/O.
>>
>> I think stop_machine() may not be required for calling __zone_pcp_update()
>> in case of CONFIG_PREEMPT_RT_FULL=y as acquiring pa_lock in __zone_pcp_update()
>> should be sufficient to isolate pcp pages and to setup per cpu pagesets.
>>
>> Can someone please let me know if am missing anything here?
>

Hello Kosaki-san,

> First off, you should cc memory hotplug experts when discussing memory
> hotplug topic.

Sorry for that.

> Second, stop_machine() is required because usually zone->pageset is
> per-cpu variable.
> the regular access rule is, 1) owner cpu can always access their own
> pcp, 2) offlined cpu's
> pcp can be accessed from any cpus because is has no race chance 3)
> otherwise caller must
> use stop_machine for preventing owner cpu accesses pcp.

Thank you very much for your explanation, yes, my approach was not correct.

Since "mm: page_alloc: rt-friendly per-cpu pages" from RT patch set
introduces a preemptible lock
(pa_lock which becomes an rt-mutex) for accessing pcp,
(http://www.kernel.org/pub/linux/kernel/projects/rt/3.4/patch-3.4.3-rt12-rc1.patch.xz)

So while calling zone_pcp_update() (with RT-patch set applied and with
CONFIG_PREEMPT_RT_FULL=y),
we have possibly two options to fix the BUG caused by taking a
sleeping lock in stop_machine.
Option 1. Revert the patch which introduces the sleeping(pa_lock) lock.
Option 2. Fix the calling frame work.(Use another framework)

Since usually memory hot plug is not that frequent an activity in the system,
So a little more overhead occurred while taking option 2, I think is acceptable.

My approach in my below patch for zone_pcp_update() is:
1. For each online cpu, setup pageset of a cpu by scheduling work on that cpu.
2. For each offline cpu, setup pageset of a cpu from current cpu.
3. Flush the all the work spawned in step1.

I will re-send this as a formal patch if there are no objections to
this approach.

---
mm/page_alloc.c | 74 74 + 0 - 0 !
1 file changed, 74 insertions(+)

Index: b/mm/page_alloc.c
===================================================================
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -111,6 +111,16 @@ unsigned long totalreserve_pages __read_
int percpu_pagelist_fraction;
gfp_t gfp_allowed_mask __read_mostly = GFP_BOOT_MASK;

+#ifdef CONFIG_PREEMPT_RT_FULL
+struct zone_pcp_work {
+ int cpu;
+ struct zone *zone;
+ struct work_struct work;
+};
+
+static DEFINE_PER_CPU(struct zone_pcp_work, zone_pcp_update_work);
+#endif
+
#ifdef CONFIG_PM_SLEEP
/*
* The following functions are used by the suspend/hibernate code to
temporarily
@@ -3926,6 +3936,7 @@ int zone_wait_table_init(struct zone *zo
return 0;
}

+#ifndef CONFIG_PREEMPT_RT_FULL
static int __zone_pcp_update(void *data)
{
struct zone *zone = data;
@@ -3949,13 +3960,69 @@ static int __zone_pcp_update(void *data)
return 0;
}

+#else
+
+static void __zone_cpu_pcp_update(struct work_struct *work)
+{
+ struct zone_pcp_work *work_data =
+ container_of(work, struct zone_pcp_work, work);
+ struct zone *zone = work_data->zone;
+ int cpu = work_data->cpu;
+ unsigned long batch = zone_batchsize(zone), flags;
+ struct per_cpu_pageset *pset;
+ struct per_cpu_pages *pcp;
+ LIST_HEAD(dst);
+
+ pset = per_cpu_ptr(zone->pageset, cpu);
+ pcp = &pset->pcp;
+
+ cpu_lock_irqsave(cpu, flags);
+ isolate_pcp_pages(pcp->count, pcp, &dst);
+ free_pcppages_bulk(zone, pcp->count, &dst);
+ setup_pageset(pset, batch);
+ cpu_unlock_irqrestore(cpu, flags);
+
+}
+#endif
+
void zone_pcp_update(struct zone *zone)
{
+
+#ifdef CONFIG_PREEMPT_RT_FULL
+ int cpu;
+
+ get_online_cpus();
+ for_each_possible_cpu(cpu) {
+ struct zone_pcp_work *zone_pcp_work =
+ &per_cpu(zone_pcp_update_work, cpu);
+ zone_pcp_work->zone = zone;
+ zone_pcp_work->cpu = cpu;
+
+ if (cpu_online(cpu))
+ schedule_work_on(cpu, &zone_pcp_work->work);
+ else
+ __zone_cpu_pcp_update(&zone_pcp_work->work);
+ }
+
+ for_each_online_cpu(cpu) {
+ struct zone_pcp_work *zone_pcp_work =
+ &per_cpu(zone_pcp_update_work, cpu);
+
+ flush_work(&zone_pcp_work->work);
+ }
+ put_online_cpus();
+
+#else
+
stop_machine(__zone_pcp_update, zone, NULL);
+#endif
}

static __meminit void zone_pcp_init(struct zone *zone)
{
+#ifdef CONFIG_PREEMPT_RT_FULL
+ int cpu;
+#endif
/*
* per cpu subsystem is not up at this point. The following code
* relies on the ability of the linker to provide the
@@ -3967,6 +4034,13 @@ static __meminit void zone_pcp_init(stru
printk(KERN_DEBUG " %s zone: %lu pages, LIFO batch:%u\n",
zone->name, zone->present_pages,
zone_batchsize(zone));
+#ifdef CONFIG_PREEMPT_RT_FULL
+ for_each_possible_cpu(cpu) {
+ struct zone_pcp_work *zone_pcp_work =
+ &per_cpu(zone_pcp_update_work, cpu);
+ INIT_WORK(&zone_pcp_work->work, __zone_cpu_pcp_update);
+ }
+#endif
}

__meminit int init_currently_empty_zone(struct zone *zone,

>
> stop_machine and send ipi are common technique for per-cpu area hack.

2012-06-25 08:00:45

by Aaditya Kumar

[permalink] [raw]

Subject: Re: [RFC][RT][PATCH] mm: Do not use stop_machine() for __zone_pcp_udpate() for CONFIG_PREEMPT_RT_FULL

On Sun, Jun 24, 2012 at 10:25 PM, Aaditya Kumar
<[email protected]> wrote:
> On Sat, Jun 23, 2012 at 9:07 AM, KOSAKI Motohiro
> <[email protected]> wrote:
>> (6/19/12 2:32 PM), Aaditya Kumar wrote:
>>> The code path of __zone_pcp_update() has following locks, which in
>>> CONFIG_PREEMPT_RT_FULL=y are rt-mutex.
>>> ? - pa_lock locked by cpu_lock_irqsave()
>>> ? - zone->lock locked by free_pcppages_bulk()
>>>
>>> Since __zone_pcp_update() is called from stop_machine(), so with
>>> CONFIG_PREEMPT_RT_FULL=y
>>> we get following backtrace when __zone_pcp_update() is called during
>>> memory hot plugging while
>>> doing heavy file I/O.
>>>
>>> I think stop_machine() may not be required for calling __zone_pcp_update()
>>> in case of CONFIG_PREEMPT_RT_FULL=y as acquiring pa_lock in __zone_pcp_update()
>>> should be sufficient to isolate pcp pages and to setup per cpu pagesets.
>>>
>>> Can someone please let me know if am missing anything here?
>>
>
> Hello Kosaki-san,
>
>> First off, you should cc memory hotplug experts when discussing memory
>> hotplug topic.
>
> Sorry for that.
>
>> Second, stop_machine() is required because usually zone->pageset is
>> per-cpu variable.
>> the regular access rule is, 1) owner cpu can always access their own
>> pcp, 2) offlined cpu's
>> pcp can be accessed from any cpus because is has no race chance 3)
>> otherwise caller must
>> use stop_machine for preventing owner cpu accesses pcp.
>
> Thank you very much for your explanation, yes, my approach was not correct.

Hello Kosaki-san,

I revisited my first approach of simply calling __zone_pcp_update() directly
(without stop_machine() ).

The RT-patch set seems to follow following protocol for accessing per
cpu pageset pages:
(http://www.kernel.org/pub/linux/kernel/projects/rt/3.4/patch-3.4.3-rt12-rc1.patch.xz)
- Basically each cpu's pcp is protected by a per cpu spin lock.
(DEFINE_LOCAL_IRQ_LOCK(pa_lock) ).
- So in brief, to access a particular cpu's pcp we just need to take
the lock on
the spinlock corresponding to that cpu.
- cpu_lock_irqsave() in __zone_pcp_update() is locking the cpu's
pcp spinlock (whose pcp it accesses).

So I feel that my first approach should work, please let me know if I
am missing something.

--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -3868,7 +3868,11 @@ static int __zone_pcp_update(void *data)

void zone_pcp_update(struct zone *zone)
{
+#ifndef CONFIG_PREEMPT_RT_FULL
stop_machine(__zone_pcp_update, zone, NULL);
+#else
+ __zone_pcp_update(zone);
+#endif
}

> Since ?"mm: page_alloc: rt-friendly per-cpu pages" from RT patch set
> introduces a preemptible lock
> (pa_lock which becomes an rt-mutex) for accessing pcp,
> (http://www.kernel.org/pub/linux/kernel/projects/rt/3.4/patch-3.4.3-rt12-rc1.patch.xz)
>
> So while calling zone_pcp_update() (with RT-patch set applied and with
> CONFIG_PREEMPT_RT_FULL=y),
> we have possibly two options to fix the BUG caused by taking a
> sleeping lock in stop_machine.
> ?Option 1. Revert the patch which introduces the sleeping(pa_lock) lock.
> ?Option 2. Fix the calling frame work.(Use another framework)
>
> Since usually memory hot plug is not that frequent an activity in the system,
> So a little more overhead occurred while taking option 2, I think is acceptable.
>
> My approach in my below patch for zone_pcp_update() is:
> 1. For each online cpu, setup pageset of a cpu by scheduling ?work on that cpu.
> 2. For each offline cpu, setup pageset of a cpu from current cpu.
> 3. Flush the all the work spawned in step1.
>
> I will re-send this as a formal patch if there are no objections to
> this approach.
>
>
> ---
> ?mm/page_alloc.c | ? 74 ? ? ? ? 74 + ? ?0 - ? ? 0 !
> ?1 file changed, 74 insertions(+)
>
> Index: b/mm/page_alloc.c
> ===================================================================
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -111,6 +111,16 @@ unsigned long totalreserve_pages __read_
> ?int percpu_pagelist_fraction;
> ?gfp_t gfp_allowed_mask __read_mostly = GFP_BOOT_MASK;
>
> +#ifdef CONFIG_PREEMPT_RT_FULL
> +struct zone_pcp_work {
> + ? ? ? int cpu;
> + ? ? ? struct zone *zone;
> + ? ? ? struct work_struct work;
> +};
> +
> +static DEFINE_PER_CPU(struct zone_pcp_work, zone_pcp_update_work);
> +#endif
> +
> ?#ifdef CONFIG_PM_SLEEP
> ?/*
> ?* The following functions are used by the suspend/hibernate code to
> temporarily
> @@ -3926,6 +3936,7 @@ int zone_wait_table_init(struct zone *zo
> ? ? ? ?return 0;
> ?}
>
> +#ifndef CONFIG_PREEMPT_RT_FULL
> ?static int __zone_pcp_update(void *data)
> ?{
> ? ? ? ?struct zone *zone = data;
> @@ -3949,13 +3960,69 @@ static int __zone_pcp_update(void *data)
> ? ? ? ?return 0;
> ?}
>
> +#else
> +
> +static void __zone_cpu_pcp_update(struct work_struct *work)
> +{
> + ? ? ? struct zone_pcp_work *work_data =
> + ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? container_of(work, struct zone_pcp_work, work);
> + ? ? ? struct zone *zone = work_data->zone;
> + ? ? ? int cpu = work_data->cpu;
> + ? ? ? unsigned long batch = zone_batchsize(zone), flags;
> + ? ? ? struct per_cpu_pageset *pset;
> + ? ? ? struct per_cpu_pages *pcp;
> + ? ? ? LIST_HEAD(dst);
> +
> + ? ? ? pset = per_cpu_ptr(zone->pageset, cpu);
> + ? ? ? pcp = &pset->pcp;
> +
> + ? ? ? cpu_lock_irqsave(cpu, flags);
> + ? ? ? isolate_pcp_pages(pcp->count, pcp, &dst);
> + ? ? ? free_pcppages_bulk(zone, pcp->count, &dst);
> + ? ? ? setup_pageset(pset, batch);
> + ? ? ? cpu_unlock_irqrestore(cpu, flags);
> +
> +}
> +#endif
> +
> ?void zone_pcp_update(struct zone *zone)
> ?{
> +
> +#ifdef CONFIG_PREEMPT_RT_FULL
> + ? ? ? int cpu;
> +
> + ? ? ? get_online_cpus();
> + ? ? ? for_each_possible_cpu(cpu) {
> + ? ? ? ? ? ? ? struct zone_pcp_work *zone_pcp_work =
> + ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? &per_cpu(zone_pcp_update_work, cpu);
> + ? ? ? ? ? ? ? zone_pcp_work->zone = zone;
> + ? ? ? ? ? ? ? zone_pcp_work->cpu = cpu;
> +
> + ? ? ? ? ? ? ? if (cpu_online(cpu))
> + ? ? ? ? ? ? ? ? ? ? ? schedule_work_on(cpu, &zone_pcp_work->work);
> + ? ? ? ? ? ? ? else
> + ? ? ? ? ? ? ? ? ? ? ? __zone_cpu_pcp_update(&zone_pcp_work->work);
> + ? ? ? }
> +
> + ? ? ? for_each_online_cpu(cpu) {
> + ? ? ? ? ? ? ? struct zone_pcp_work *zone_pcp_work =
> + ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? &per_cpu(zone_pcp_update_work, cpu);
> +
> + ? ? ? ? ? ? ? flush_work(&zone_pcp_work->work);
> + ? ? ? }
> + ? ? ? put_online_cpus();
> +
> +#else
> +
> ? ? ? ?stop_machine(__zone_pcp_update, zone, NULL);
> +#endif
> ?}
>
> ?static __meminit void zone_pcp_init(struct zone *zone)
> ?{
> +#ifdef CONFIG_PREEMPT_RT_FULL
> + ? ? ? int cpu;
> +#endif
> ? ? ? ?/*
> ? ? ? ? * per cpu subsystem is not up at this point. The following code
> ? ? ? ? * relies on the ability of the linker to provide the
> @@ -3967,6 +4034,13 @@ static __meminit void zone_pcp_init(stru
> ? ? ? ? ? ? ? ?printk(KERN_DEBUG " ?%s zone: %lu pages, LIFO batch:%u\n",
> ? ? ? ? ? ? ? ? ? ? ? ?zone->name, zone->present_pages,
> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? zone_batchsize(zone));
> +#ifdef CONFIG_PREEMPT_RT_FULL
> + ? ? ? for_each_possible_cpu(cpu) {
> + ? ? ? ? ? ? ? struct zone_pcp_work *zone_pcp_work =
> + ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? &per_cpu(zone_pcp_update_work, cpu);
> + ? ? ? ? ? ? ? INIT_WORK(&zone_pcp_work->work, __zone_cpu_pcp_update);
> + ? ? ? }
> +#endif
> ?}
>
> ?__meminit int init_currently_empty_zone(struct zone *zone,
>
>
>
>
>
>>
>> stop_machine and send ipi are common technique for per-cpu area hack.

2013-03-05 18:40:08

by Aaditya Kumar

[permalink] [raw]

Subject: [RFC][RT][PATCH RESEND] mm: Do not use stop_machine() for __zone_pcp_udpate() for CONFIG_PREEMPT_RT_FULL

The code path of __zone_pcp_update() has following locks, which in
CONFIG_PREEMPT_RT_FULL=y are rt-mutex.
- pa_lock locked by cpu_lock_irqsave()
- zone->lock locked by free_pcppages_bulk()

Since __zone_pcp_update() is called from stop_machine(), so with
CONFIG_PREEMPT_RT_FULL=y
we get following backtrace when __zone_pcp_update() is called during
memory hot plugging while
doing heavy file I/O.

stop_machine() may not be required for calling __zone_pcp_update()
in case of CONFIG_PREEMPT_RT_FULL=y as acquiring pa_lock in __zone_pcp_update()
should be sufficient to isolate pcp pages and to setup per cpu pagesets.

The backtrace that this patch fixes:
BUG: scheduling while atomic: migration/0/7/0x00000002
Modules linked in: v2p
Backtrace:
[<800111a0>] (dump_backtrace+0x0/0x10c) from [<802d7b7c>]
(dump_stack+0x18/0x1c)
r6:80c8fc28 r5:80c8f9a0 r4:00000000 r3:60000013
[<802d7b64>] (dump_stack+0x0/0x1c) from [<8001e81c>] (__schedule_bug+0x64/0x74)
[<8001e7b8>] (__schedule_bug+0x0/0x74) from [<802d7fa0>]
(__schedule+0x68/0x604)
r4:8051bf00 r3:00000000
[<802d7f38>] (__schedule+0x0/0x604) from [<802d8a78>] (schedule+0x98/0xbc)
[<802d89e0>] (schedule+0x0/0xbc) from [<802d9e14>]
(rt_spin_lock_slowlock+0x168/0x240)
r4:805228f4 r3:00000000
[<802d9cac>] (rt_spin_lock_slowlock+0x0/0x240) from [<802da234>]
(rt_spin_lock+0x10/0x14)
[<802da224>] (rt_spin_lock+0x0/0x14) from [<8008694c>]
(__zone_pcp_update+0x58/0xd8)
[<800868f4>] (__zone_pcp_update+0x0/0xd8) from [<800603ec>]
(stop_machine_cpu_stop+0xb0/0x104)
[<8006033c>] (stop_machine_cpu_stop+0x0/0x104) from [<80060200>]
(cpu_stopper_thread+0xd4/0x188)

Signed-off-by: Aaditya Kumar <[email protected]>

---
mm/page_alloc.c | 4 4 + 0 - 0 !
1 file changed, 4 insertions(+)

Index: b/mm/page_alloc.c
===================================================================
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -3868,7 +3868,11 @@ static int __zone_pcp_update(void *data)

void zone_pcp_update(struct zone *zone)
{
+#ifndef CONFIG_PREEMPT_RT_FULL
stop_machine(__zone_pcp_update, zone, NULL);
+#else
+ __zone_pcp_update(zone);
+#endif
}

static __meminit void zone_pcp_init(struct zone *zone)

2013-03-05 19:21:56

by Steven Rostedt

[permalink] [raw]

Subject: Re: [RFC][RT][PATCH RESEND] mm: Do not use stop_machine() for __zone_pcp_udpate() for CONFIG_PREEMPT_RT_FULL

On Wed, 2013-03-06 at 00:10 +0530, Aaditya Kumar wrote:
> The code path of __zone_pcp_update() has following locks, which in
> CONFIG_PREEMPT_RT_FULL=y are rt-mutex.
> - pa_lock locked by cpu_lock_irqsave()
> - zone->lock locked by free_pcppages_bulk()
>
> Since __zone_pcp_update() is called from stop_machine(), so with
> CONFIG_PREEMPT_RT_FULL=y
> we get following backtrace when __zone_pcp_update() is called during
> memory hot plugging while
> doing heavy file I/O.
>
> stop_machine() may not be required for calling __zone_pcp_update()

"may not be required" is not a technical sufficient reason for a change.

Why is this called from stop_machine() in mainline, and what exactly
makes it "OK" to not use it in PREEMPT_RT? Just because the routine uses
mutexes doesn't mean that its safe.

Actually, spinlocks are meaningless when used in stop_machine(), thus a
question can be made, why is it taking spinlocks in a stop_machine()
routine in the first place. As stop_machine() will stop all other CPUs
from running there should not be any need for spinlocks. Is it just
because it's using routines that are used in normal operations?

Note, stop_machine() synchronizes things outside of locks. Which means
if it's needed for mainline it is most likely needed for PREEMPT_RT as
well.

The real solution is to figure out why stop_machine() is required in the
first place, and remove it completely if possible. Both from PREEMPT_RT
*and* mainline!

-- Steve

> in case of CONFIG_PREEMPT_RT_FULL=y as acquiring pa_lock in __zone_pcp_update()
> should be sufficient to isolate pcp pages and to setup per cpu pagesets.
>
>
> The backtrace that this patch fixes:
> BUG: scheduling while atomic: migration/0/7/0x00000002
> Modules linked in: v2p
> Backtrace:
> [<800111a0>] (dump_backtrace+0x0/0x10c) from [<802d7b7c>]
> (dump_stack+0x18/0x1c)
> r6:80c8fc28 r5:80c8f9a0 r4:00000000 r3:60000013
> [<802d7b64>] (dump_stack+0x0/0x1c) from [<8001e81c>] (__schedule_bug+0x64/0x74)
> [<8001e7b8>] (__schedule_bug+0x0/0x74) from [<802d7fa0>]
> (__schedule+0x68/0x604)
> r4:8051bf00 r3:00000000
> [<802d7f38>] (__schedule+0x0/0x604) from [<802d8a78>] (schedule+0x98/0xbc)
> [<802d89e0>] (schedule+0x0/0xbc) from [<802d9e14>]
> (rt_spin_lock_slowlock+0x168/0x240)
> r4:805228f4 r3:00000000
> [<802d9cac>] (rt_spin_lock_slowlock+0x0/0x240) from [<802da234>]
> (rt_spin_lock+0x10/0x14)
> [<802da224>] (rt_spin_lock+0x0/0x14) from [<8008694c>]
> (__zone_pcp_update+0x58/0xd8)
> [<800868f4>] (__zone_pcp_update+0x0/0xd8) from [<800603ec>]
> (stop_machine_cpu_stop+0xb0/0x104)
> [<8006033c>] (stop_machine_cpu_stop+0x0/0x104) from [<80060200>]
> (cpu_stopper_thread+0xd4/0x188)
>
>
> Signed-off-by: Aaditya Kumar <[email protected]>
>
> ---
> mm/page_alloc.c | 4 4 + 0 - 0 !
> 1 file changed, 4 insertions(+)
>
> Index: b/mm/page_alloc.c
> ===================================================================
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -3868,7 +3868,11 @@ static int __zone_pcp_update(void *data)
>
> void zone_pcp_update(struct zone *zone)
> {
> +#ifndef CONFIG_PREEMPT_RT_FULL
> stop_machine(__zone_pcp_update, zone, NULL);
> +#else
> + __zone_pcp_update(zone);
> +#endif
> }
>
> static __meminit void zone_pcp_init(struct zone *zone)