2010-12-29 22:55:19

by Hugh Dickins

[permalink] [raw]
Subject: PowerPC BUG: using smp_processor_id() in preemptible code

With recent 2.6.37-rc, with CONFIG_PREEMPT=y CONFIG_DEBUG_PREEMPT=y
on the PowerPC G5, I get spammed by BUG warnings each time I swapoff:

BUG: using smp_processor_id() in preemptible [00000000] code: swapoff/3974
caller is .hpte_need_flush+0x4c/0x2e8
Call Trace:
[c0000001b4a3f830] [c00000000000f3cc] .show_stack+0x6c/0x16c (unreliable)
[c0000001b4a3f8e0] [c00000000023eda0] .debug_smp_processor_id+0xe4/0x11c
[c0000001b4a3f970] [c00000000002f2f4] .hpte_need_flush+0x4c/0x2e8
[c0000001b4a3fa30] [c0000000000e7ef8] .vunmap_pud_range+0x148/0x200
[c0000001b4a3fb10] [c0000000000e8058] .vunmap_page_range+0xa8/0xd4
[c0000001b4a3fbb0] [c0000000000e80a4] .free_unmap_vmap_area+0x20/0x38
[c0000001b4a3fc40] [c0000000000e8138] .remove_vm_area+0x7c/0xb4
[c0000001b4a3fcd0] [c0000000000e8308] .__vunmap+0x50/0x104
[c0000001b4a3fd60] [c0000000000ef3fc] .SyS_swapoff+0x59c/0x6a8
[c0000001b4a3fe30] [c0000000000075a8] syscall_exit+0x0/0x40

I notice hpte_need_flush() itself acknowledges
* Must be called from within some kind of spinlock/non-preempt region...

Though I didn't actually bisect, I believe this is since Jeremy's
64141da587241301ce8638cc945f8b67853156ec "vmalloc: eagerly clear ptes
on vunmap", which moves a call to vunmap_page_range() from one place
(which happened to be inside a spinlock) to another (where it's not).

I guess my warnings would be easily silenced by moving that call to
vunmap_page_range() down just inside the spinlock below it; but I'm
dubious that that's the right fix - it looked as if there are other
paths through vmalloc.c where vunmap_page_range() has been getting
called without preemption disabled, long before Jeremy's change,
just paths that I never happen to go down in my limited testing.

For the moment I'm using the obvious patch below to keep it quiet;
but I doubt that this is the right patch either. I'm hoping that
ye who understand the importance of hpte_need_flush() will be best
able to judge what to do. Or might there be other architectures
expecting to be unpreemptible there?

Thanks,
Hugh

--- a/mm/vmalloc.c
+++ b/mm/vmalloc.c
@@ -37,11 +37,13 @@ static void vunmap_pte_range(pmd_t *pmd,
{
pte_t *pte;

+ preempt_disable(); /* Stop __vunmap() triggering smp_processor_id() in preemptible from hpte_need_flush() */
pte = pte_offset_kernel(pmd, addr);
do {
pte_t ptent = ptep_get_and_clear(&init_mm, addr, pte);
WARN_ON(!pte_none(ptent) && !pte_present(ptent));
} while (pte++, addr += PAGE_SIZE, addr != end);
+ preempt_enable();
}

static void vunmap_pmd_range(pud_t *pud, unsigned long addr, unsigned long end)


2010-12-30 01:26:31

by Jeremy Fitzhardinge

[permalink] [raw]
Subject: Re: PowerPC BUG: using smp_processor_id() in preemptible code

On 12/30/2010 09:54 AM, Hugh Dickins wrote:
> With recent 2.6.37-rc, with CONFIG_PREEMPT=y CONFIG_DEBUG_PREEMPT=y
> on the PowerPC G5, I get spammed by BUG warnings each time I swapoff:
>
> BUG: using smp_processor_id() in preemptible [00000000] code: swapoff/3974
> caller is .hpte_need_flush+0x4c/0x2e8
> Call Trace:
> [c0000001b4a3f830] [c00000000000f3cc] .show_stack+0x6c/0x16c (unreliable)
> [c0000001b4a3f8e0] [c00000000023eda0] .debug_smp_processor_id+0xe4/0x11c
> [c0000001b4a3f970] [c00000000002f2f4] .hpte_need_flush+0x4c/0x2e8
> [c0000001b4a3fa30] [c0000000000e7ef8] .vunmap_pud_range+0x148/0x200
> [c0000001b4a3fb10] [c0000000000e8058] .vunmap_page_range+0xa8/0xd4
> [c0000001b4a3fbb0] [c0000000000e80a4] .free_unmap_vmap_area+0x20/0x38
> [c0000001b4a3fc40] [c0000000000e8138] .remove_vm_area+0x7c/0xb4
> [c0000001b4a3fcd0] [c0000000000e8308] .__vunmap+0x50/0x104
> [c0000001b4a3fd60] [c0000000000ef3fc] .SyS_swapoff+0x59c/0x6a8
> [c0000001b4a3fe30] [c0000000000075a8] syscall_exit+0x0/0x40
>
> I notice hpte_need_flush() itself acknowledges
> * Must be called from within some kind of spinlock/non-preempt region...
>
> Though I didn't actually bisect, I believe this is since Jeremy's
> 64141da587241301ce8638cc945f8b67853156ec "vmalloc: eagerly clear ptes
> on vunmap", which moves a call to vunmap_page_range() from one place
> (which happened to be inside a spinlock) to another (where it's not).

Yes. I was bit worried by the interaction between vb->lock and the pte
locks, and thought it safer to keep it outside. Though I suppose kernel
mappings don't have pte locks, so that's a non-issue.

> I guess my warnings would be easily silenced by moving that call to
> vunmap_page_range() down just inside the spinlock below it; but I'm
> dubious that that's the right fix - it looked as if there are other
> paths through vmalloc.c where vunmap_page_range() has been getting
> called without preemption disabled, long before Jeremy's change,
> just paths that I never happen to go down in my limited testing.
>
> For the moment I'm using the obvious patch below to keep it quiet;
> but I doubt that this is the right patch either. I'm hoping that
> ye who understand the importance of hpte_need_flush() will be best
> able to judge what to do. Or might there be other architectures
> expecting to be unpreemptible there?

Since kernel mappings don't have a formal pte lock to rely on for
synchronization, each subsystem defines its own ad-hoc one. In this
case that's probably vb->lock anyway, so the fix is to move
vunmap_page_range() back into its protection.

> --- a/mm/vmalloc.c
> +++ b/mm/vmalloc.c
> @@ -37,11 +37,13 @@ static void vunmap_pte_range(pmd_t *pmd,
> {
> pte_t *pte;
>
> + preempt_disable(); /* Stop __vunmap() triggering smp_processor_id() in preemptible from hpte_need_flush() */
> pte = pte_offset_kernel(pmd, addr);
> do {
> pte_t ptent = ptep_get_and_clear(&init_mm, addr, pte);
> WARN_ON(!pte_none(ptent) && !pte_present(ptent));
> } while (pte++, addr += PAGE_SIZE, addr != end);
> + preempt_enable();
> }
>
> static void vunmap_pmd_range(pud_t *pud, unsigned long addr, unsigned long end)

That looks OK, but it interferes with my plans to use
apply_to_page_range(_batch) to replace all this code.

J

2010-12-30 10:45:44

by Benjamin Herrenschmidt

[permalink] [raw]
Subject: Re: PowerPC BUG: using smp_processor_id() in preemptible code

On Wed, 2010-12-29 at 14:54 -0800, Hugh Dickins wrote:
> With recent 2.6.37-rc, with CONFIG_PREEMPT=y CONFIG_DEBUG_PREEMPT=y
> on the PowerPC G5, I get spammed by BUG warnings each time I swapoff:
>
> BUG: using smp_processor_id() in preemptible [00000000] code: swapoff/3974
> caller is .hpte_need_flush+0x4c/0x2e8
> Call Trace:
> [c0000001b4a3f830] [c00000000000f3cc] .show_stack+0x6c/0x16c (unreliable)
> [c0000001b4a3f8e0] [c00000000023eda0] .debug_smp_processor_id+0xe4/0x11c
> [c0000001b4a3f970] [c00000000002f2f4] .hpte_need_flush+0x4c/0x2e8
> [c0000001b4a3fa30] [c0000000000e7ef8] .vunmap_pud_range+0x148/0x200
> [c0000001b4a3fb10] [c0000000000e8058] .vunmap_page_range+0xa8/0xd4
> [c0000001b4a3fbb0] [c0000000000e80a4] .free_unmap_vmap_area+0x20/0x38
> [c0000001b4a3fc40] [c0000000000e8138] .remove_vm_area+0x7c/0xb4
> [c0000001b4a3fcd0] [c0000000000e8308] .__vunmap+0x50/0x104
> [c0000001b4a3fd60] [c0000000000ef3fc] .SyS_swapoff+0x59c/0x6a8
> [c0000001b4a3fe30] [c0000000000075a8] syscall_exit+0x0/0x40
>
> I notice hpte_need_flush() itself acknowledges
> * Must be called from within some kind of spinlock/non-preempt region...

Yes, we assume that the PTE lock is always held when modifying page
tables...

> Though I didn't actually bisect, I believe this is since Jeremy's
> 64141da587241301ce8638cc945f8b67853156ec "vmalloc: eagerly clear ptes
> on vunmap", which moves a call to vunmap_page_range() from one place
> (which happened to be inside a spinlock) to another (where it's not).
>
> I guess my warnings would be easily silenced by moving that call to
> vunmap_page_range() down just inside the spinlock below it; but I'm
> dubious that that's the right fix - it looked as if there are other
> paths through vmalloc.c where vunmap_page_range() has been getting
> called without preemption disabled, long before Jeremy's change,
> just paths that I never happen to go down in my limited testing.
>
> For the moment I'm using the obvious patch below to keep it quiet;
> but I doubt that this is the right patch either. I'm hoping that
> ye who understand the importance of hpte_need_flush() will be best
> able to judge what to do. Or might there be other architectures
> expecting to be unpreemptible there?

Well, it looks like our kernel mappings tend to take some nasty
shortcuts with the PTE locking, which I suppose are legit but do break
some of my assumptions there. I need to have a closer look. Thanks for
the report !

Cheers,
Ben.

> Thanks,
> Hugh
>
> --- a/mm/vmalloc.c
> +++ b/mm/vmalloc.c
> @@ -37,11 +37,13 @@ static void vunmap_pte_range(pmd_t *pmd,
> {
> pte_t *pte;
>
> + preempt_disable(); /* Stop __vunmap() triggering smp_processor_id() in preemptible from hpte_need_flush() */
> pte = pte_offset_kernel(pmd, addr);
> do {
> pte_t ptent = ptep_get_and_clear(&init_mm, addr, pte);
> WARN_ON(!pte_none(ptent) && !pte_present(ptent));
> } while (pte++, addr += PAGE_SIZE, addr != end);
> + preempt_enable();
> }
>
> static void vunmap_pmd_range(pud_t *pud, unsigned long addr, unsigned long end)

2011-02-24 20:47:58

by Hugh Dickins

[permalink] [raw]
Subject: Re: PowerPC BUG: using smp_processor_id() in preemptible code

On Thu, 30 Dec 2010, Benjamin Herrenschmidt wrote:
> On Wed, 2010-12-29 at 14:54 -0800, Hugh Dickins wrote:
> > With recent 2.6.37-rc, with CONFIG_PREEMPT=y CONFIG_DEBUG_PREEMPT=y
> > on the PowerPC G5, I get spammed by BUG warnings each time I swapoff:
> >
> > BUG: using smp_processor_id() in preemptible [00000000] code: swapoff/3974
> > caller is .hpte_need_flush+0x4c/0x2e8
> > Call Trace:
> > [c0000001b4a3f830] [c00000000000f3cc] .show_stack+0x6c/0x16c (unreliable)
> > [c0000001b4a3f8e0] [c00000000023eda0] .debug_smp_processor_id+0xe4/0x11c
> > [c0000001b4a3f970] [c00000000002f2f4] .hpte_need_flush+0x4c/0x2e8
> > [c0000001b4a3fa30] [c0000000000e7ef8] .vunmap_pud_range+0x148/0x200
> > [c0000001b4a3fb10] [c0000000000e8058] .vunmap_page_range+0xa8/0xd4
> > [c0000001b4a3fbb0] [c0000000000e80a4] .free_unmap_vmap_area+0x20/0x38
> > [c0000001b4a3fc40] [c0000000000e8138] .remove_vm_area+0x7c/0xb4
> > [c0000001b4a3fcd0] [c0000000000e8308] .__vunmap+0x50/0x104
> > [c0000001b4a3fd60] [c0000000000ef3fc] .SyS_swapoff+0x59c/0x6a8
> > [c0000001b4a3fe30] [c0000000000075a8] syscall_exit+0x0/0x40
> >
> > I notice hpte_need_flush() itself acknowledges
> > * Must be called from within some kind of spinlock/non-preempt region...
>
> Yes, we assume that the PTE lock is always held when modifying page
> tables...

Right, not for the kernel page tables. I remember when doing the
pte_offset_map_lock() stuff, that it appeared that the kernel would
already be in trouble if it did not have higher serialization for
its pte updates, so no point in using a page_table_lock for them.

>
> > Though I didn't actually bisect, I believe this is since Jeremy's
> > 64141da587241301ce8638cc945f8b67853156ec "vmalloc: eagerly clear ptes
> > on vunmap", which moves a call to vunmap_page_range() from one place
> > (which happened to be inside a spinlock) to another (where it's not).
> >
> > I guess my warnings would be easily silenced by moving that call to
> > vunmap_page_range() down just inside the spinlock below it; but I'm
> > dubious that that's the right fix - it looked as if there are other
> > paths through vmalloc.c where vunmap_page_range() has been getting
> > called without preemption disabled, long before Jeremy's change,
> > just paths that I never happen to go down in my limited testing.
> >
> > For the moment I'm using the obvious patch below to keep it quiet;
> > but I doubt that this is the right patch either. I'm hoping that
> > ye who understand the importance of hpte_need_flush() will be best
> > able to judge what to do. Or might there be other architectures
> > expecting to be unpreemptible there?
>
> Well, it looks like our kernel mappings tend to take some nasty
> shortcuts with the PTE locking, which I suppose are legit but do break
> some of my assumptions there. I need to have a closer look. Thanks for
> the report !

None of us have progressed this since late in 2.6.37-rc, and now
it's late in 2.6.38-rc and we're still in the same situation.

The patch I'm currently using to suppress all the noise (I suppose
I could just turn off DEBUG_PREEMPT but that would be cheating) is an
extract from Peter's preemptible mmu_gather patches, below, but I don't
really know if it's valid to extract it in this way.

Reading back, I see Jeremy suggested moving vb_free()'s call to
vunmap_page_range() back inside vb->lock: it certainly was his moving
the call out from under that lock that brought the issue to my notice;
but it looked as if there were other paths which would give preemptible
PowerPC the same issue, just paths I happen not to go down myself. I'm
not sure, I didn't take the time to follow it up properly, expecting
further insight to arrive shortly from Ben!

And, as threatened, Jeremy has further vmalloc changes queued up in
mmotm, which certainly make the patch below inadequate, and I imagine
the vunmap_page_range() movement too. I'm currently (well, I think most
recent mmotm doesn't even boot on my ppc) having to disable preemption
in the kernel case of apply_to_pte_range().

What would be better for 2.6.38 and 2.6.37-stable? Moving that call to
vunmap_page_range back under vb->lock, or the partial-Peter-patch below?
And then what should be done for 2.6.39?

Hugh

--- 2.6.38-rc5/arch/powerpc/mm/tlb_hash64.c 2010-02-24 10:52:17.000000000 -0800
+++ linux/arch/powerpc/mm/tlb_hash64.c 2011-02-15 23:27:21.000000000 -0800
@@ -38,13 +38,11 @@ DEFINE_PER_CPU(struct ppc64_tlb_batch, p
* neesd to be flushed. This function will either perform the flush
* immediately or will batch it up if the current CPU has an active
* batch on it.
- *
- * Must be called from within some kind of spinlock/non-preempt region...
*/
void hpte_need_flush(struct mm_struct *mm, unsigned long addr,
pte_t *ptep, unsigned long pte, int huge)
{
- struct ppc64_tlb_batch *batch = &__get_cpu_var(ppc64_tlb_batch);
+ struct ppc64_tlb_batch *batch = &get_cpu_var(ppc64_tlb_batch);
unsigned long vsid, vaddr;
unsigned int psize;
int ssize;
@@ -99,6 +97,7 @@ void hpte_need_flush(struct mm_struct *m
*/
if (!batch->active) {
flush_hash_page(vaddr, rpte, psize, ssize, 0);
+ put_cpu_var(ppc64_tlb_batch);
return;
}

@@ -127,6 +126,7 @@ void hpte_need_flush(struct mm_struct *m
batch->index = ++i;
if (i >= PPC64_TLB_BATCH_NR)
__flush_tlb_pending(batch);
+ put_cpu_var(ppc64_tlb_batch);
}

/*

2011-02-24 21:06:44

by Peter Zijlstra

[permalink] [raw]
Subject: Re: PowerPC BUG: using smp_processor_id() in preemptible code

On Thu, 2011-02-24 at 12:47 -0800, Hugh Dickins wrote:

Lovely problem :-), benh mentioned it on IRC, but I never got around to
finding the email thread, thanks for the CC.

> What would be better for 2.6.38 and 2.6.37-stable? Moving that call to
> vunmap_page_range back under vb->lock, or the partial-Peter-patch below?
> And then what should be done for 2.6.39?

I think you'll also need the arch/powerpc/kernel/process.c changes that
cause context switches to flush the tlb_batch queues.

> --- 2.6.38-rc5/arch/powerpc/mm/tlb_hash64.c 2010-02-24 10:52:17.000000000 -0800
> +++ linux/arch/powerpc/mm/tlb_hash64.c 2011-02-15 23:27:21.000000000 -0800
> @@ -38,13 +38,11 @@ DEFINE_PER_CPU(struct ppc64_tlb_batch, p
> * neesd to be flushed. This function will either perform the flush
> * immediately or will batch it up if the current CPU has an active
> * batch on it.
> - *
> - * Must be called from within some kind of spinlock/non-preempt region...
> */
> void hpte_need_flush(struct mm_struct *mm, unsigned long addr,
> pte_t *ptep, unsigned long pte, int huge)
> {
> - struct ppc64_tlb_batch *batch = &__get_cpu_var(ppc64_tlb_batch);
> + struct ppc64_tlb_batch *batch = &get_cpu_var(ppc64_tlb_batch);
> unsigned long vsid, vaddr;
> unsigned int psize;
> int ssize;
> @@ -99,6 +97,7 @@ void hpte_need_flush(struct mm_struct *m
> */
> if (!batch->active) {
> flush_hash_page(vaddr, rpte, psize, ssize, 0);
> + put_cpu_var(ppc64_tlb_batch);
> return;
> }
>
> @@ -127,6 +126,7 @@ void hpte_need_flush(struct mm_struct *m
> batch->index = ++i;
> if (i >= PPC64_TLB_BATCH_NR)
> __flush_tlb_pending(batch);
> + put_cpu_var(ppc64_tlb_batch);
> }
>
> /*

2011-02-24 21:12:24

by Benjamin Herrenschmidt

[permalink] [raw]
Subject: Re: PowerPC BUG: using smp_processor_id() in preemptible code

On Thu, 2011-02-24 at 12:47 -0800, Hugh Dickins wrote:

> Reading back, I see Jeremy suggested moving vb_free()'s call to
> vunmap_page_range() back inside vb->lock: it certainly was his moving
> the call out from under that lock that brought the issue to my notice;
> but it looked as if there were other paths which would give preemptible
> PowerPC the same issue, just paths I happen not to go down myself. I'm
> not sure, I didn't take the time to follow it up properly, expecting
> further insight to arrive shortly from Ben!

Yeah, sorry, I've been too over extended lately...

> And, as threatened, Jeremy has further vmalloc changes queued up in
> mmotm, which certainly make the patch below inadequate, and I imagine
> the vunmap_page_range() movement too. I'm currently (well, I think most
> recent mmotm doesn't even boot on my ppc) having to disable preemption
> in the kernel case of apply_to_pte_range().
>
> What would be better for 2.6.38 and 2.6.37-stable? Moving that call to
> vunmap_page_range back under vb->lock, or the partial-Peter-patch below?
> And then what should be done for 2.6.39?

Patch is fine. I should send it to Linus. It's not like we have a batch
on the vmalloc space anyways since it doesnt do the arch_lazy_mmu stuff,
so it's really about protecting the per-cpu variable.

Cheers,
Ben.

> --- 2.6.38-rc5/arch/powerpc/mm/tlb_hash64.c 2010-02-24 10:52:17.000000000 -0800
> +++ linux/arch/powerpc/mm/tlb_hash64.c 2011-02-15 23:27:21.000000000 -0800
> @@ -38,13 +38,11 @@ DEFINE_PER_CPU(struct ppc64_tlb_batch, p
> * neesd to be flushed. This function will either perform the flush
> * immediately or will batch it up if the current CPU has an active
> * batch on it.
> - *
> - * Must be called from within some kind of spinlock/non-preempt region...
> */
> void hpte_need_flush(struct mm_struct *mm, unsigned long addr,
> pte_t *ptep, unsigned long pte, int huge)
> {
> - struct ppc64_tlb_batch *batch = &__get_cpu_var(ppc64_tlb_batch);
> + struct ppc64_tlb_batch *batch = &get_cpu_var(ppc64_tlb_batch);
> unsigned long vsid, vaddr;
> unsigned int psize;
> int ssize;
> @@ -99,6 +97,7 @@ void hpte_need_flush(struct mm_struct *m
> */
> if (!batch->active) {
> flush_hash_page(vaddr, rpte, psize, ssize, 0);
> + put_cpu_var(ppc64_tlb_batch);
> return;
> }
>
> @@ -127,6 +126,7 @@ void hpte_need_flush(struct mm_struct *m
> batch->index = ++i;
> if (i >= PPC64_TLB_BATCH_NR)
> __flush_tlb_pending(batch);
> + put_cpu_var(ppc64_tlb_batch);
> }
>
> /*
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/

2011-02-24 21:23:44

by Benjamin Herrenschmidt

[permalink] [raw]
Subject: Re: PowerPC BUG: using smp_processor_id() in preemptible code

On Thu, 2011-02-24 at 22:07 +0100, Peter Zijlstra wrote:
>
> Lovely problem :-), benh mentioned it on IRC, but I never got around
> to
> finding the email thread, thanks for the CC.
>
> > What would be better for 2.6.38 and 2.6.37-stable? Moving that call
> to
> > vunmap_page_range back under vb->lock, or the partial-Peter-patch
> below?
> > And then what should be done for 2.6.39?
>
> I think you'll also need the arch/powerpc/kernel/process.c changes
> that
> cause context switches to flush the tlb_batch queues.

I don't think that's needed here as there shall be no batching happening
on the vmalloc space, but it can't hurt to merge it regardless :-)

Cheers,
Ben.

2011-02-24 21:26:43

by Peter Zijlstra

[permalink] [raw]
Subject: Re: PowerPC BUG: using smp_processor_id() in preemptible code

On Fri, 2011-02-25 at 08:23 +1100, Benjamin Herrenschmidt wrote:
>
> I don't think that's needed here as there shall be no batching happening
> on the vmalloc space, but it can't hurt to merge it regardless :-)

Ah, due to the !batch->active thing? OK, then yeah Hugh's bit is
sufficient.