2011-02-03 02:49:12

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: [PATCH] x86: hold mm->page_table_lock while doing vmalloc_sync

Hello,

Larry (CC'ed) found a problem with the patch in subject. When
USE_SPLIT_PTLOCKS is not defined (NR_CPUS == 2) it will deadlock in
ptep_clear_flush_notify in rmap.c because it's sending IPIs with the
page_table_lock already held, and the other CPUs now spins on the
page_table_lock with irq disabled, so the IPI never runs. With
CONFIG_TRANSPARENT_HUGEPAGE=y this deadlocks happens even with
USE_SPLIT_PTLOCKS defined so it become visible but it needs to be
fixed regardless (for NR_CPUS == 2).

I'd like to understand why the pgd_lock needs irq disabled, it sounds
too easy that I can just remove the _irqsave, doesn't it?

A pgd_free comment says it can run from irq. page_table_lock having to
be taken there is for Xen only, but other archs also uses
spin_lock_irqsave(pgd_lock) so I guess it's either common code, or
it's superfluous and not another Xen special requirement.

If we could remove that _irqsave like below it'd solve it... But
clearly something must be taking the pgd_lock from irq. (using a
rwlock would also be possible as long as nobody takes it in write mode
during irq, but if it's pgd_free that really runs in irq, that would
need the write_lock so it wouldn't be a solution).

I'm trying this fix and the VM_BUG_ON never triggered yet.

In short: who takes the pgd_lock from an irq? (__mmdrop shouldn't but
maybe I'm overlooking some aio bit?)

======
Subject: fix pgd_lock deadlock

From: Andrea Arcangeli <[email protected]>

It's forbidden to take the page_table_lock with the irq disabled or if there's
contention the IPIs (for tlb flushes) sent with the page_table_lock held will
never run leading to a deadlock.

Apparently nobody takes the pgd_lock from irq so the _irqsave can be removed.

Signed-off-by: Andrea Arcangeli <[email protected]>
---
arch/x86/mm/fault.c | 7 ++++---
arch/x86/mm/init_64.c | 7 ++++---
arch/x86/mm/pageattr.c | 21 +++++++++++----------
arch/x86/mm/pgtable.c | 10 ++++++----
arch/x86/xen/mmu.c | 10 ++++------
5 files changed, 29 insertions(+), 26 deletions(-)

--- a/arch/x86/mm/fault.c
+++ b/arch/x86/mm/fault.c
@@ -230,14 +230,15 @@ void vmalloc_sync_all(void)
address >= TASK_SIZE && address < FIXADDR_TOP;
address += PMD_SIZE) {

- unsigned long flags;
struct page *page;

- spin_lock_irqsave(&pgd_lock, flags);
+ VM_BUG_ON(in_interrupt());
+ spin_lock(&pgd_lock);
list_for_each_entry(page, &pgd_list, lru) {
spinlock_t *pgt_lock;
pmd_t *ret;

+ /* the pgt_lock only for Xen */
pgt_lock = &pgd_page_get_mm(page)->page_table_lock;

spin_lock(pgt_lock);
@@ -247,7 +248,7 @@ void vmalloc_sync_all(void)
if (!ret)
break;
}
- spin_unlock_irqrestore(&pgd_lock, flags);
+ spin_unlock(&pgd_lock, flags);
}
}

--- a/arch/x86/mm/init_64.c
+++ b/arch/x86/mm/init_64.c
@@ -105,18 +105,19 @@ void sync_global_pgds(unsigned long star

for (address = start; address <= end; address += PGDIR_SIZE) {
const pgd_t *pgd_ref = pgd_offset_k(address);
- unsigned long flags;
struct page *page;

if (pgd_none(*pgd_ref))
continue;

- spin_lock_irqsave(&pgd_lock, flags);
+ VM_BUG_ON(in_interrupt());
+ spin_lock(&pgd_lock);
list_for_each_entry(page, &pgd_list, lru) {
pgd_t *pgd;
spinlock_t *pgt_lock;

pgd = (pgd_t *)page_address(page) + pgd_index(address);
+ /* the pgt_lock only for Xen */
pgt_lock = &pgd_page_get_mm(page)->page_table_lock;
spin_lock(pgt_lock);

@@ -128,7 +129,7 @@ void sync_global_pgds(unsigned long star

spin_unlock(pgt_lock);
}
- spin_unlock_irqrestore(&pgd_lock, flags);
+ spin_unlock(&pgd_lock);
}
}

--- a/arch/x86/mm/pageattr.c
+++ b/arch/x86/mm/pageattr.c
@@ -57,12 +57,11 @@ static unsigned long direct_pages_count[

void update_page_count(int level, unsigned long pages)
{
- unsigned long flags;
-
/* Protect against CPA */
- spin_lock_irqsave(&pgd_lock, flags);
+ VM_BUG_ON(in_interrupt());
+ spin_lock(&pgd_lock);
direct_pages_count[level] += pages;
- spin_unlock_irqrestore(&pgd_lock, flags);
+ spin_unlock(&pgd_lock);
}

static void split_page_count(int level)
@@ -402,7 +401,7 @@ static int
try_preserve_large_page(pte_t *kpte, unsigned long address,
struct cpa_data *cpa)
{
- unsigned long nextpage_addr, numpages, pmask, psize, flags, addr, pfn;
+ unsigned long nextpage_addr, numpages, pmask, psize, addr, pfn;
pte_t new_pte, old_pte, *tmp;
pgprot_t old_prot, new_prot, req_prot;
int i, do_split = 1;
@@ -411,7 +410,8 @@ try_preserve_large_page(pte_t *kpte, uns
if (cpa->force_split)
return 1;

- spin_lock_irqsave(&pgd_lock, flags);
+ VM_BUG_ON(in_interrupt());
+ spin_lock(&pgd_lock);
/*
* Check for races, another CPU might have split this page
* up already:
@@ -506,14 +506,14 @@ try_preserve_large_page(pte_t *kpte, uns
}

out_unlock:
- spin_unlock_irqrestore(&pgd_lock, flags);
+ spin_unlock(&pgd_lock);

return do_split;
}

static int split_large_page(pte_t *kpte, unsigned long address)
{
- unsigned long flags, pfn, pfninc = 1;
+ unsigned long pfn, pfninc = 1;
unsigned int i, level;
pte_t *pbase, *tmp;
pgprot_t ref_prot;
@@ -527,7 +527,8 @@ static int split_large_page(pte_t *kpte,
if (!base)
return -ENOMEM;

- spin_lock_irqsave(&pgd_lock, flags);
+ VM_BUG_ON(in_interrupt());
+ spin_lock(&pgd_lock);
/*
* Check for races, another CPU might have split this page
* up for us already:
@@ -599,7 +600,7 @@ out_unlock:
*/
if (base)
__free_page(base);
- spin_unlock_irqrestore(&pgd_lock, flags);
+ spin_unlock(&pgd_lock);

return 0;
}
--- a/arch/x86/mm/pgtable.c
+++ b/arch/x86/mm/pgtable.c
@@ -126,9 +126,10 @@ static void pgd_dtor(pgd_t *pgd)
if (SHARED_KERNEL_PMD)
return;

- spin_lock_irqsave(&pgd_lock, flags);
+ VM_BUG_ON(in_interrupt());
+ spin_lock(&pgd_lock);
pgd_list_del(pgd);
- spin_unlock_irqrestore(&pgd_lock, flags);
+ spin_unlock(&pgd_lock);
}

/*
@@ -280,12 +281,13 @@ pgd_t *pgd_alloc(struct mm_struct *mm)
* respect to anything walking the pgd_list, so that they
* never see a partially populated pgd.
*/
- spin_lock_irqsave(&pgd_lock, flags);
+ VM_BUG_ON(in_interrupt());
+ spin_lock(&pgd_lock);

pgd_ctor(mm, pgd);
pgd_prepopulate_pmd(mm, pgd, pmds);

- spin_unlock_irqrestore(&pgd_lock, flags);
+ spin_unlock(&pgd_lock);

return pgd;

--- a/arch/x86/xen/mmu.c
+++ b/arch/x86/xen/mmu.c
@@ -986,10 +986,9 @@ static void xen_pgd_pin(struct mm_struct
*/
void xen_mm_pin_all(void)
{
- unsigned long flags;
struct page *page;

- spin_lock_irqsave(&pgd_lock, flags);
+ spin_lock(&pgd_lock);

list_for_each_entry(page, &pgd_list, lru) {
if (!PagePinned(page)) {
@@ -998,7 +997,7 @@ void xen_mm_pin_all(void)
}
}

- spin_unlock_irqrestore(&pgd_lock, flags);
+ spin_unlock(&pgd_lock);
}

/*
@@ -1099,10 +1098,9 @@ static void xen_pgd_unpin(struct mm_stru
*/
void xen_mm_unpin_all(void)
{
- unsigned long flags;
struct page *page;

- spin_lock_irqsave(&pgd_lock, flags);
+ spin_lock(&pgd_lock);

list_for_each_entry(page, &pgd_list, lru) {
if (PageSavePinned(page)) {
@@ -1112,7 +1110,7 @@ void xen_mm_unpin_all(void)
}
}

- spin_unlock_irqrestore(&pgd_lock, flags);
+ spin_unlock(&pgd_lock);
}

void xen_activate_mm(struct mm_struct *prev, struct mm_struct *next)


2011-02-03 20:44:09

by Jeremy Fitzhardinge

[permalink] [raw]
Subject: Re: [PATCH] x86: hold mm->page_table_lock while doing vmalloc_sync

On 02/02/2011 06:48 PM, Andrea Arcangeli wrote:
> Hello,
>
> Larry (CC'ed) found a problem with the patch in subject. When
> USE_SPLIT_PTLOCKS is not defined (NR_CPUS == 2) it will deadlock in
> ptep_clear_flush_notify in rmap.c because it's sending IPIs with the
> page_table_lock already held, and the other CPUs now spins on the
> page_table_lock with irq disabled, so the IPI never runs. With
> CONFIG_TRANSPARENT_HUGEPAGE=y this deadlocks happens even with
> USE_SPLIT_PTLOCKS defined so it become visible but it needs to be
> fixed regardless (for NR_CPUS == 2).

What's "it" here? Do you mean vmalloc_sync_all? vmalloc_sync_one?
What's the callchain?

> I'd like to understand why the pgd_lock needs irq disabled, it sounds
> too easy that I can just remove the _irqsave, doesn't it?
>
> A pgd_free comment says it can run from irq. page_table_lock having to
> be taken there is for Xen only, but other archs also uses
> spin_lock_irqsave(pgd_lock) so I guess it's either common code, or
> it's superfluous and not another Xen special requirement.

There's no special Xen requirement here.

> If we could remove that _irqsave like below it'd solve it... But
> clearly something must be taking the pgd_lock from irq. (using a
> rwlock would also be possible as long as nobody takes it in write mode
> during irq, but if it's pgd_free that really runs in irq, that would
> need the write_lock so it wouldn't be a solution).

mmdrop() can be called from interrupt context, but I don't know if it
will ever drop the last reference from interrupt, so maybe you can get
away with it.

> I'm trying this fix and the VM_BUG_ON never triggered yet.
>
> In short: who takes the pgd_lock from an irq? (__mmdrop shouldn't but
> maybe I'm overlooking some aio bit?)
>
> ======
> Subject: fix pgd_lock deadlock
>
> From: Andrea Arcangeli <[email protected]>
>
> It's forbidden to take the page_table_lock with the irq disabled or if there's
> contention the IPIs (for tlb flushes) sent with the page_table_lock held will
> never run leading to a deadlock.
>
> Apparently nobody takes the pgd_lock from irq so the _irqsave can be removed.

I'm pretty sure this is OK from a Xen perspective.

> Signed-off-by: Andrea Arcangeli <[email protected]>
> ---
> arch/x86/mm/fault.c | 7 ++++---
> arch/x86/mm/init_64.c | 7 ++++---
> arch/x86/mm/pageattr.c | 21 +++++++++++----------
> arch/x86/mm/pgtable.c | 10 ++++++----
> arch/x86/xen/mmu.c | 10 ++++------
> 5 files changed, 29 insertions(+), 26 deletions(-)
>
> --- a/arch/x86/mm/fault.c
> +++ b/arch/x86/mm/fault.c
> @@ -230,14 +230,15 @@ void vmalloc_sync_all(void)
> address >= TASK_SIZE && address < FIXADDR_TOP;
> address += PMD_SIZE) {
>
> - unsigned long flags;
> struct page *page;
>
> - spin_lock_irqsave(&pgd_lock, flags);
> + VM_BUG_ON(in_interrupt());
> + spin_lock(&pgd_lock);
> list_for_each_entry(page, &pgd_list, lru) {
> spinlock_t *pgt_lock;
> pmd_t *ret;
>
> + /* the pgt_lock only for Xen */
> pgt_lock = &pgd_page_get_mm(page)->page_table_lock;
>
> spin_lock(pgt_lock);
> @@ -247,7 +248,7 @@ void vmalloc_sync_all(void)
> if (!ret)
> break;
> }
> - spin_unlock_irqrestore(&pgd_lock, flags);
> + spin_unlock(&pgd_lock, flags);

Urp. Did this compile?

Thanks,
J

> }
> }
>
> --- a/arch/x86/mm/init_64.c
> +++ b/arch/x86/mm/init_64.c
> @@ -105,18 +105,19 @@ void sync_global_pgds(unsigned long star
>
> for (address = start; address <= end; address += PGDIR_SIZE) {
> const pgd_t *pgd_ref = pgd_offset_k(address);
> - unsigned long flags;
> struct page *page;
>
> if (pgd_none(*pgd_ref))
> continue;
>
> - spin_lock_irqsave(&pgd_lock, flags);
> + VM_BUG_ON(in_interrupt());
> + spin_lock(&pgd_lock);
> list_for_each_entry(page, &pgd_list, lru) {
> pgd_t *pgd;
> spinlock_t *pgt_lock;
>
> pgd = (pgd_t *)page_address(page) + pgd_index(address);
> + /* the pgt_lock only for Xen */
> pgt_lock = &pgd_page_get_mm(page)->page_table_lock;
> spin_lock(pgt_lock);
>
> @@ -128,7 +129,7 @@ void sync_global_pgds(unsigned long star
>
> spin_unlock(pgt_lock);
> }
> - spin_unlock_irqrestore(&pgd_lock, flags);
> + spin_unlock(&pgd_lock);
> }
> }
>
> --- a/arch/x86/mm/pageattr.c
> +++ b/arch/x86/mm/pageattr.c
> @@ -57,12 +57,11 @@ static unsigned long direct_pages_count[
>
> void update_page_count(int level, unsigned long pages)
> {
> - unsigned long flags;
> -
> /* Protect against CPA */
> - spin_lock_irqsave(&pgd_lock, flags);
> + VM_BUG_ON(in_interrupt());
> + spin_lock(&pgd_lock);
> direct_pages_count[level] += pages;
> - spin_unlock_irqrestore(&pgd_lock, flags);
> + spin_unlock(&pgd_lock);
> }
>
> static void split_page_count(int level)
> @@ -402,7 +401,7 @@ static int
> try_preserve_large_page(pte_t *kpte, unsigned long address,
> struct cpa_data *cpa)
> {
> - unsigned long nextpage_addr, numpages, pmask, psize, flags, addr, pfn;
> + unsigned long nextpage_addr, numpages, pmask, psize, addr, pfn;
> pte_t new_pte, old_pte, *tmp;
> pgprot_t old_prot, new_prot, req_prot;
> int i, do_split = 1;
> @@ -411,7 +410,8 @@ try_preserve_large_page(pte_t *kpte, uns
> if (cpa->force_split)
> return 1;
>
> - spin_lock_irqsave(&pgd_lock, flags);
> + VM_BUG_ON(in_interrupt());
> + spin_lock(&pgd_lock);
> /*
> * Check for races, another CPU might have split this page
> * up already:
> @@ -506,14 +506,14 @@ try_preserve_large_page(pte_t *kpte, uns
> }
>
> out_unlock:
> - spin_unlock_irqrestore(&pgd_lock, flags);
> + spin_unlock(&pgd_lock);
>
> return do_split;
> }
>
> static int split_large_page(pte_t *kpte, unsigned long address)
> {
> - unsigned long flags, pfn, pfninc = 1;
> + unsigned long pfn, pfninc = 1;
> unsigned int i, level;
> pte_t *pbase, *tmp;
> pgprot_t ref_prot;
> @@ -527,7 +527,8 @@ static int split_large_page(pte_t *kpte,
> if (!base)
> return -ENOMEM;
>
> - spin_lock_irqsave(&pgd_lock, flags);
> + VM_BUG_ON(in_interrupt());
> + spin_lock(&pgd_lock);
> /*
> * Check for races, another CPU might have split this page
> * up for us already:
> @@ -599,7 +600,7 @@ out_unlock:
> */
> if (base)
> __free_page(base);
> - spin_unlock_irqrestore(&pgd_lock, flags);
> + spin_unlock(&pgd_lock);
>
> return 0;
> }
> --- a/arch/x86/mm/pgtable.c
> +++ b/arch/x86/mm/pgtable.c
> @@ -126,9 +126,10 @@ static void pgd_dtor(pgd_t *pgd)
> if (SHARED_KERNEL_PMD)
> return;
>
> - spin_lock_irqsave(&pgd_lock, flags);
> + VM_BUG_ON(in_interrupt());
> + spin_lock(&pgd_lock);
> pgd_list_del(pgd);
> - spin_unlock_irqrestore(&pgd_lock, flags);
> + spin_unlock(&pgd_lock);
> }
>
> /*
> @@ -280,12 +281,13 @@ pgd_t *pgd_alloc(struct mm_struct *mm)
> * respect to anything walking the pgd_list, so that they
> * never see a partially populated pgd.
> */
> - spin_lock_irqsave(&pgd_lock, flags);
> + VM_BUG_ON(in_interrupt());
> + spin_lock(&pgd_lock);
>
> pgd_ctor(mm, pgd);
> pgd_prepopulate_pmd(mm, pgd, pmds);
>
> - spin_unlock_irqrestore(&pgd_lock, flags);
> + spin_unlock(&pgd_lock);
>
> return pgd;
>
> --- a/arch/x86/xen/mmu.c
> +++ b/arch/x86/xen/mmu.c
> @@ -986,10 +986,9 @@ static void xen_pgd_pin(struct mm_struct
> */
> void xen_mm_pin_all(void)
> {
> - unsigned long flags;
> struct page *page;
>
> - spin_lock_irqsave(&pgd_lock, flags);
> + spin_lock(&pgd_lock);
>
> list_for_each_entry(page, &pgd_list, lru) {
> if (!PagePinned(page)) {
> @@ -998,7 +997,7 @@ void xen_mm_pin_all(void)
> }
> }
>
> - spin_unlock_irqrestore(&pgd_lock, flags);
> + spin_unlock(&pgd_lock);
> }
>
> /*
> @@ -1099,10 +1098,9 @@ static void xen_pgd_unpin(struct mm_stru
> */
> void xen_mm_unpin_all(void)
> {
> - unsigned long flags;
> struct page *page;
>
> - spin_lock_irqsave(&pgd_lock, flags);
> + spin_lock(&pgd_lock);
>
> list_for_each_entry(page, &pgd_list, lru) {
> if (PageSavePinned(page)) {
> @@ -1112,7 +1110,7 @@ void xen_mm_unpin_all(void)
> }
> }
>
> - spin_unlock_irqrestore(&pgd_lock, flags);
> + spin_unlock(&pgd_lock);
> }
>
> void xen_activate_mm(struct mm_struct *prev, struct mm_struct *next)
>

2011-02-03 20:59:42

by Larry Woodman

[permalink] [raw]
Subject: Re: [PATCH] x86: hold mm->page_table_lock while doing vmalloc_sync

On 02/02/2011 09:48 PM, Andrea Arcangeli wrote:
> Hello,
>
> Larry (CC'ed) found a problem with the patch in subject. When
> USE_SPLIT_PTLOCKS is not defined (NR_CPUS == 2) it will deadlock in
> ptep_clear_flush_notify in rmap.c because it's sending IPIs with the
> page_table_lock already held, and the other CPUs now spins on the
> page_table_lock with irq disabled, so the IPI never runs. With
> CONFIG_TRANSPARENT_HUGEPAGE=y this deadlocks happens even with
> USE_SPLIT_PTLOCKS defined so it become visible but it needs to be
> fixed regardless (for NR_CPUS == 2).
>
> I'd like to understand why the pgd_lock needs irq disabled, it sounds
> too easy that I can just remove the _irqsave, doesn't it?
>
> A pgd_free comment says it can run from irq. page_table_lock having to
> be taken there is for Xen only, but other archs also uses
> spin_lock_irqsave(pgd_lock) so I guess it's either common code, or
> it's superfluous and not another Xen special requirement.
>
> If we could remove that _irqsave like below it'd solve it... But
> clearly something must be taking the pgd_lock from irq. (using a
> rwlock would also be possible as long as nobody takes it in write mode
> during irq, but if it's pgd_free that really runs in irq, that would
> need the write_lock so it wouldn't be a solution).
>
> I'm trying this fix and the VM_BUG_ON never triggered yet.
>
> In short: who takes the pgd_lock from an irq? (__mmdrop shouldn't but
> maybe I'm overlooking some aio bit?)
>
>
This is the specifics:

The problem is with THP. The page reclaim code calls page_referenced_one()
which takes the mm->page_table_lock on one CPU before sending an IPI to other
CPU(s):

On CPU1 we take the mm->page_table_lock, send IPIs and wait for a response:
page_referenced_one(...)
if (unlikely(PageTransHuge(page))) {
pmd_t *pmd;

spin_lock(&mm->page_table_lock);
pmd = page_check_address_pmd(page, mm, address,
PAGE_CHECK_ADDRESS_PMD_FLAG);
if (pmd&& !pmd_trans_splitting(*pmd)&&
pmdp_clear_flush_young_notify(vma, address, pmd))
referenced++;
spin_unlock(&mm->page_table_lock);
} else {


CPU2 can race in vmalloc_sync_all() because it disables interrupt(preventing a
response to the IPI from CPU1) and takes the pgd_lock then spins in the
mm->page_table_lock which is already held on CPU1.

spin_lock_irqsave(&pgd_lock, flags);
list_for_each_entry(page,&pgd_list, lru) {
pgd_t *pgd;
spinlock_t *pgt_lock;

pgd = (pgd_t *)page_address(page) + pgd_index(address);

pgt_lock =&pgd_page_get_mm(page)->page_table_lock;
spin_lock(pgt_lock);


At this point the system is deadlocked. The pmdp_clear_flush_young_notify
needs to do its PDG business with the page_table_lock held then release that
lock before sending the IPIs to the other CPUs.

2011-02-04 01:21:26

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: [PATCH] x86: hold mm->page_table_lock while doing vmalloc_sync

On Thu, Feb 03, 2011 at 12:44:02PM -0800, Jeremy Fitzhardinge wrote:
> On 02/02/2011 06:48 PM, Andrea Arcangeli wrote:
> > Hello,
> >
> > Larry (CC'ed) found a problem with the patch in subject. When
> > USE_SPLIT_PTLOCKS is not defined (NR_CPUS == 2) it will deadlock in
> > ptep_clear_flush_notify in rmap.c because it's sending IPIs with the
> > page_table_lock already held, and the other CPUs now spins on the
> > page_table_lock with irq disabled, so the IPI never runs. With
> > CONFIG_TRANSPARENT_HUGEPAGE=y this deadlocks happens even with
> > USE_SPLIT_PTLOCKS defined so it become visible but it needs to be
> > fixed regardless (for NR_CPUS == 2).
>
> What's "it" here? Do you mean vmalloc_sync_all? vmalloc_sync_one?
> What's the callchain?

Larry just answered to that. If something is unclear let me know. I
never reproduced it, but it also can happen without THP enabled, you
just need to set NR_CPUS to 2 during "make menuconfig".

> > spin_lock_irqsave(pgd_lock) so I guess it's either common code, or
> > it's superfluous and not another Xen special requirement.
>
> There's no special Xen requirement here.

That was my thought too considering the other archs...

> mmdrop() can be called from interrupt context, but I don't know if it
> will ever drop the last reference from interrupt, so maybe you can get
> away with it.

Yes the issue is __mmdrop, so it'd be nice to figure if __mmdrop can
also run from irq (or only mmdrop fast path which would be safe even
without _irqsave).

Is this a Xen only thing? Or is mmdrop called from regular
linux. Considering other archs also _irqsave I assume it's common code
calling mmdrop (otherwise it means they cut-and-pasted a Xen
dependency). This comment doesn't really tell me much.

static void pgd_dtor(pgd_t *pgd)
{
unsigned long flags; /* can be called from interrupt context */

if (SHARED_KERNEL_PMD)
return;

VM_BUG_ON(in_interrupt());
spin_lock(&pgd_lock);

This comment tells the very __mmdrop can be called from irq context,
not just mmdrop. But I didn't find where yet... Can you tell me?

> > @@ -247,7 +248,7 @@ void vmalloc_sync_all(void)
> > if (!ret)
> > break;
> > }
> > - spin_unlock_irqrestore(&pgd_lock, flags);
> > + spin_unlock(&pgd_lock, flags);
>
> Urp. Did this compile?

Yes it builds and it also runs fine still (I left it running since I
posted the email and no problems yet, but this may not be reproducible
and we really need to know who calls __mmdrop from irq context to
tell). The above is under CONFIG_X86_32 and I did a 64bit build ;).

I'm not reposting a version that builds for 32bit x86 too until we
figure out the mmdrop thing...

Thanks,
Andrea

2011-02-04 21:27:41

by Jeremy Fitzhardinge

[permalink] [raw]
Subject: Re: [PATCH] x86: hold mm->page_table_lock while doing vmalloc_sync

On 02/03/2011 05:21 PM, Andrea Arcangeli wrote:
> On Thu, Feb 03, 2011 at 12:44:02PM -0800, Jeremy Fitzhardinge wrote:
>> On 02/02/2011 06:48 PM, Andrea Arcangeli wrote:
>>> Hello,
>>>
>>> Larry (CC'ed) found a problem with the patch in subject. When
>>> USE_SPLIT_PTLOCKS is not defined (NR_CPUS == 2) it will deadlock in
>>> ptep_clear_flush_notify in rmap.c because it's sending IPIs with the
>>> page_table_lock already held, and the other CPUs now spins on the
>>> page_table_lock with irq disabled, so the IPI never runs. With
>>> CONFIG_TRANSPARENT_HUGEPAGE=y this deadlocks happens even with
>>> USE_SPLIT_PTLOCKS defined so it become visible but it needs to be
>>> fixed regardless (for NR_CPUS == 2).
>> What's "it" here? Do you mean vmalloc_sync_all? vmalloc_sync_one?
>> What's the callchain?
> Larry just answered to that. If something is unclear let me know. I
> never reproduced it, but it also can happen without THP enabled, you
> just need to set NR_CPUS to 2 during "make menuconfig".
>
>>> spin_lock_irqsave(pgd_lock) so I guess it's either common code, or
>>> it's superfluous and not another Xen special requirement.
>> There's no special Xen requirement here.
> That was my thought too considering the other archs...
>
>> mmdrop() can be called from interrupt context, but I don't know if it
>> will ever drop the last reference from interrupt, so maybe you can get
>> away with it.
> Yes the issue is __mmdrop, so it'd be nice to figure if __mmdrop can
> also run from irq (or only mmdrop fast path which would be safe even
> without _irqsave).
>
> Is this a Xen only thing? Or is mmdrop called from regular
> linux. Considering other archs also _irqsave I assume it's common code
> calling mmdrop (otherwise it means they cut-and-pasted a Xen
> dependency). This comment doesn't really tell me much.

No, I don't think there's any xen-specific code which calls mmdrop (at
all, let alone in interrupt context). Erm, but I'm not sure where it
does. I had a thinko that "schedule" would be one of those places, but
calling that from interrupt context would cause much bigger problems :/
> static void pgd_dtor(pgd_t *pgd)
> {
> unsigned long flags; /* can be called from interrupt context */
>
> if (SHARED_KERNEL_PMD)
> return;
>
> VM_BUG_ON(in_interrupt());
> spin_lock(&pgd_lock);
>
> This comment tells the very __mmdrop can be called from irq context,
> not just mmdrop. But I didn't find where yet... Can you tell me?

No. I don't think I wrote that comment. It possibly just some ancient
lore that could have been correct at one point, or perhaps it never true.

>>> @@ -247,7 +248,7 @@ void vmalloc_sync_all(void)
>>> if (!ret)
>>> break;
>>> }
>>> - spin_unlock_irqrestore(&pgd_lock, flags);
>>> + spin_unlock(&pgd_lock, flags);
>> Urp. Did this compile?
> Yes it builds

(spin_unlock() shouldn't take a "flags" arg.)


> I'm not reposting a version that builds for 32bit x86 too until we
> figure out the mmdrop thing...

Stick it in next and look for explosion reports?

J

2011-02-07 23:20:54

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: [PATCH] x86: hold mm->page_table_lock while doing vmalloc_sync

On Fri, Feb 04, 2011 at 01:27:33PM -0800, Jeremy Fitzhardinge wrote:
> No, I don't think there's any xen-specific code which calls mmdrop (at
> all, let alone in interrupt context). Erm, but I'm not sure where it
> does. I had a thinko that "schedule" would be one of those places, but
> calling that from interrupt context would cause much bigger problems :/
> > static void pgd_dtor(pgd_t *pgd)

I checked again and I don't see exactly where mmdrop or __mmdrop are
called from irq context.

> No. I don't think I wrote that comment. It possibly just some ancient
> lore that could have been correct at one point, or perhaps it never true.

I agree with that. But it'd be nice of more people could look into
that so we at least can remove the misleading comment.

Where else can the pgd_lock be taken from irq context? Can we fix the
deadlock with NR_CPUS < 4 with my patch? (with the ,flags removed from below?)


>
> >>> @@ -247,7 +248,7 @@ void vmalloc_sync_all(void)
> >>> if (!ret)
> >>> break;
> >>> }
> >>> - spin_unlock_irqrestore(&pgd_lock, flags);
> >>> + spin_unlock(&pgd_lock, flags);
> >> Urp. Did this compile?
> > Yes it builds
>
> (spin_unlock() shouldn't take a "flags" arg.)
>
>
> > I'm not reposting a version that builds for 32bit x86 too until we
> > figure out the mmdrop thing...
>
> Stick it in next and look for explosion reports?

I intended to correct that of course, I just meant it is no problem
for 64bit builds and that's why I didn't notice the build failure
before posting the patch. Clearly 32bit build would have failed ;).

2011-02-15 19:07:57

by Andrea Arcangeli

[permalink] [raw]
Subject: [PATCH] fix pgd_lock deadlock

Hello,

Without this patch we can deadlock in the page_table_lock with NR_CPUS
< 4 or THP on, with this patch we hopefully won't deadlock in the
pgd_lock (if taken from irq). I can't see anything taking it from irq
(maybe aio? to check I also tried the libaio testuite with no apparent
VM_BUG_ON triggering), so unless somebody sees it, I think we should
apply it. I've been running for a while with this patch applied
without apparent problems. Other archs may follow suit if it's proven
that there's nothing taking the pgd_lock from irq.

===
Subject: fix pgd_lock deadlock

From: Andrea Arcangeli <[email protected]>

It's forbidden to take the page_table_lock with the irq disabled or if there's
contention the IPIs (for tlb flushes) sent with the page_table_lock held will
never run leading to a deadlock.

Apparently nobody takes the pgd_lock from irq so the _irqsave can be removed.

Signed-off-by: Andrea Arcangeli <[email protected]>
---
arch/x86/mm/fault.c | 8 ++++----
arch/x86/mm/init_64.c | 7 ++++---
arch/x86/mm/pageattr.c | 21 +++++++++++----------
arch/x86/mm/pgtable.c | 13 ++++++-------
arch/x86/xen/mmu.c | 10 ++++------
5 files changed, 29 insertions(+), 30 deletions(-)

--- a/arch/x86/mm/fault.c
+++ b/arch/x86/mm/fault.c
@@ -229,15 +229,15 @@ void vmalloc_sync_all(void)
for (address = VMALLOC_START & PMD_MASK;
address >= TASK_SIZE && address < FIXADDR_TOP;
address += PMD_SIZE) {
-
- unsigned long flags;
struct page *page;

- spin_lock_irqsave(&pgd_lock, flags);
+ VM_BUG_ON(in_interrupt());
+ spin_lock(&pgd_lock);
list_for_each_entry(page, &pgd_list, lru) {
spinlock_t *pgt_lock;
pmd_t *ret;

+ /* the pgt_lock only for Xen */
pgt_lock = &pgd_page_get_mm(page)->page_table_lock;

spin_lock(pgt_lock);
@@ -247,7 +247,7 @@ void vmalloc_sync_all(void)
if (!ret)
break;
}
- spin_unlock_irqrestore(&pgd_lock, flags);
+ spin_unlock(&pgd_lock);
}
}

--- a/arch/x86/mm/init_64.c
+++ b/arch/x86/mm/init_64.c
@@ -105,18 +105,19 @@ void sync_global_pgds(unsigned long star

for (address = start; address <= end; address += PGDIR_SIZE) {
const pgd_t *pgd_ref = pgd_offset_k(address);
- unsigned long flags;
struct page *page;

if (pgd_none(*pgd_ref))
continue;

- spin_lock_irqsave(&pgd_lock, flags);
+ VM_BUG_ON(in_interrupt());
+ spin_lock(&pgd_lock);
list_for_each_entry(page, &pgd_list, lru) {
pgd_t *pgd;
spinlock_t *pgt_lock;

pgd = (pgd_t *)page_address(page) + pgd_index(address);
+ /* the pgt_lock only for Xen */
pgt_lock = &pgd_page_get_mm(page)->page_table_lock;
spin_lock(pgt_lock);

@@ -128,7 +129,7 @@ void sync_global_pgds(unsigned long star

spin_unlock(pgt_lock);
}
- spin_unlock_irqrestore(&pgd_lock, flags);
+ spin_unlock(&pgd_lock);
}
}

--- a/arch/x86/mm/pageattr.c
+++ b/arch/x86/mm/pageattr.c
@@ -57,12 +57,11 @@ static unsigned long direct_pages_count[

void update_page_count(int level, unsigned long pages)
{
- unsigned long flags;
-
/* Protect against CPA */
- spin_lock_irqsave(&pgd_lock, flags);
+ VM_BUG_ON(in_interrupt());
+ spin_lock(&pgd_lock);
direct_pages_count[level] += pages;
- spin_unlock_irqrestore(&pgd_lock, flags);
+ spin_unlock(&pgd_lock);
}

static void split_page_count(int level)
@@ -394,7 +393,7 @@ static int
try_preserve_large_page(pte_t *kpte, unsigned long address,
struct cpa_data *cpa)
{
- unsigned long nextpage_addr, numpages, pmask, psize, flags, addr, pfn;
+ unsigned long nextpage_addr, numpages, pmask, psize, addr, pfn;
pte_t new_pte, old_pte, *tmp;
pgprot_t old_prot, new_prot, req_prot;
int i, do_split = 1;
@@ -403,7 +402,8 @@ try_preserve_large_page(pte_t *kpte, uns
if (cpa->force_split)
return 1;

- spin_lock_irqsave(&pgd_lock, flags);
+ VM_BUG_ON(in_interrupt());
+ spin_lock(&pgd_lock);
/*
* Check for races, another CPU might have split this page
* up already:
@@ -498,14 +498,14 @@ try_preserve_large_page(pte_t *kpte, uns
}

out_unlock:
- spin_unlock_irqrestore(&pgd_lock, flags);
+ spin_unlock(&pgd_lock);

return do_split;
}

static int split_large_page(pte_t *kpte, unsigned long address)
{
- unsigned long flags, pfn, pfninc = 1;
+ unsigned long pfn, pfninc = 1;
unsigned int i, level;
pte_t *pbase, *tmp;
pgprot_t ref_prot;
@@ -519,7 +519,8 @@ static int split_large_page(pte_t *kpte,
if (!base)
return -ENOMEM;

- spin_lock_irqsave(&pgd_lock, flags);
+ VM_BUG_ON(in_interrupt());
+ spin_lock(&pgd_lock);
/*
* Check for races, another CPU might have split this page
* up for us already:
@@ -591,7 +592,7 @@ out_unlock:
*/
if (base)
__free_page(base);
- spin_unlock_irqrestore(&pgd_lock, flags);
+ spin_unlock(&pgd_lock);

return 0;
}
--- a/arch/x86/mm/pgtable.c
+++ b/arch/x86/mm/pgtable.c
@@ -121,14 +121,13 @@ static void pgd_ctor(struct mm_struct *m

static void pgd_dtor(pgd_t *pgd)
{
- unsigned long flags; /* can be called from interrupt context */
-
if (SHARED_KERNEL_PMD)
return;

- spin_lock_irqsave(&pgd_lock, flags);
+ VM_BUG_ON(in_interrupt());
+ spin_lock(&pgd_lock);
pgd_list_del(pgd);
- spin_unlock_irqrestore(&pgd_lock, flags);
+ spin_unlock(&pgd_lock);
}

/*
@@ -260,7 +259,6 @@ pgd_t *pgd_alloc(struct mm_struct *mm)
{
pgd_t *pgd;
pmd_t *pmds[PREALLOCATED_PMDS];
- unsigned long flags;

pgd = (pgd_t *)__get_free_page(PGALLOC_GFP);

@@ -280,12 +278,13 @@ pgd_t *pgd_alloc(struct mm_struct *mm)
* respect to anything walking the pgd_list, so that they
* never see a partially populated pgd.
*/
- spin_lock_irqsave(&pgd_lock, flags);
+ VM_BUG_ON(in_interrupt());
+ spin_lock(&pgd_lock);

pgd_ctor(mm, pgd);
pgd_prepopulate_pmd(mm, pgd, pmds);

- spin_unlock_irqrestore(&pgd_lock, flags);
+ spin_unlock(&pgd_lock);

return pgd;

--- a/arch/x86/xen/mmu.c
+++ b/arch/x86/xen/mmu.c
@@ -986,10 +986,9 @@ static void xen_pgd_pin(struct mm_struct
*/
void xen_mm_pin_all(void)
{
- unsigned long flags;
struct page *page;

- spin_lock_irqsave(&pgd_lock, flags);
+ spin_lock(&pgd_lock);

list_for_each_entry(page, &pgd_list, lru) {
if (!PagePinned(page)) {
@@ -998,7 +997,7 @@ void xen_mm_pin_all(void)
}
}

- spin_unlock_irqrestore(&pgd_lock, flags);
+ spin_unlock(&pgd_lock);
}

/*
@@ -1099,10 +1098,9 @@ static void xen_pgd_unpin(struct mm_stru
*/
void xen_mm_unpin_all(void)
{
- unsigned long flags;
struct page *page;

- spin_lock_irqsave(&pgd_lock, flags);
+ spin_lock(&pgd_lock);

list_for_each_entry(page, &pgd_list, lru) {
if (PageSavePinned(page)) {
@@ -1112,7 +1110,7 @@ void xen_mm_unpin_all(void)
}
}

- spin_unlock_irqrestore(&pgd_lock, flags);
+ spin_unlock(&pgd_lock);
}

void xen_activate_mm(struct mm_struct *prev, struct mm_struct *next)

2011-02-15 19:28:11

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [PATCH] fix pgd_lock deadlock

On Tue, 15 Feb 2011, Andrea Arcangeli wrote:

> Hello,
>
> Without this patch we can deadlock in the page_table_lock with NR_CPUS
> < 4 or THP on, with this patch we hopefully won't deadlock in the
> pgd_lock (if taken from irq). I can't see anything taking it from irq
> (maybe aio? to check I also tried the libaio testuite with no apparent
> VM_BUG_ON triggering), so unless somebody sees it, I think we should
> apply it. I've been running for a while with this patch applied
> without apparent problems. Other archs may follow suit if it's proven
> that there's nothing taking the pgd_lock from irq.
>
> ===
> Subject: fix pgd_lock deadlock
>
> From: Andrea Arcangeli <[email protected]>
>
> It's forbidden to take the page_table_lock with the irq disabled or if there's
> contention the IPIs (for tlb flushes) sent with the page_table_lock held will
> never run leading to a deadlock.

I really read this thing 5 times and still cannot make any sense of it.

You talk about page_table_lock and then fiddle with pgd_lock.

-ENOSENSE

tglx

2011-02-15 19:55:46

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: [PATCH] fix pgd_lock deadlock

On Tue, Feb 15, 2011 at 08:26:51PM +0100, Thomas Gleixner wrote:
> On Tue, 15 Feb 2011, Andrea Arcangeli wrote:
>
> > Hello,
> >
> > Without this patch we can deadlock in the page_table_lock with NR_CPUS
> > < 4 or THP on, with this patch we hopefully won't deadlock in the
> > pgd_lock (if taken from irq). I can't see anything taking it from irq
> > (maybe aio? to check I also tried the libaio testuite with no apparent
> > VM_BUG_ON triggering), so unless somebody sees it, I think we should
> > apply it. I've been running for a while with this patch applied
> > without apparent problems. Other archs may follow suit if it's proven
> > that there's nothing taking the pgd_lock from irq.
> >
> > ===
> > Subject: fix pgd_lock deadlock
> >
> > From: Andrea Arcangeli <[email protected]>
> >
> > It's forbidden to take the page_table_lock with the irq disabled or if there's
> > contention the IPIs (for tlb flushes) sent with the page_table_lock held will
> > never run leading to a deadlock.
>
> I really read this thing 5 times and still cannot make any sense of it.
>
> You talk about page_table_lock and then fiddle with pgd_lock.
>
> -ENOSENSE

With NR_CPUs < 4, or with THP enabled, rmap.c will do
spin_lock(&mm->page_table_lock) (or pte_offset_map_lock where the lock
is still mm->page_table_lock and not the PT lock). Then it will send
IPIs to flush the tlb of the other CPUs.

But the other CPU is running the vmalloc_sync_all, and it is trying to
take the page_table_lock with irq disabled. It will never take the
lock because the CPU waiting the IPI delivery holds it. And it will
never run the IPI because it has irqs disabled.

Now the big question is if anything is taking the pgd_lock from
irqs. Normal testing could never reveal it as even if it happens it
has a slim chance to happen while the pgd_lock is already hold by
normal kernel context. But the VM_BUG_ON(in_interrupt()) should
hopefully have revealed it already if it ever happened, I hope.

Clearly we could try to fix it in other ways, but still if there's no
reason to do the _irqsave this sounds a good idea to apply my fix
anyway.

2011-02-15 20:06:39

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [PATCH] fix pgd_lock deadlock

On Tue, 15 Feb 2011, Andrea Arcangeli wrote:
> On Tue, Feb 15, 2011 at 08:26:51PM +0100, Thomas Gleixner wrote:
>
> With NR_CPUs < 4, or with THP enabled, rmap.c will do
> spin_lock(&mm->page_table_lock) (or pte_offset_map_lock where the lock
> is still mm->page_table_lock and not the PT lock). Then it will send
> IPIs to flush the tlb of the other CPUs.
>
> But the other CPU is running the vmalloc_sync_all, and it is trying to
> take the page_table_lock with irq disabled. It will never take the
> lock because the CPU waiting the IPI delivery holds it. And it will
> never run the IPI because it has irqs disabled.

Ok, that makes sense :)

> Now the big question is if anything is taking the pgd_lock from
> irqs. Normal testing could never reveal it as even if it happens it
> has a slim chance to happen while the pgd_lock is already hold by
> normal kernel context. But the VM_BUG_ON(in_interrupt()) should
> hopefully have revealed it already if it ever happened, I hope.
>
> Clearly we could try to fix it in other ways, but still if there's no
> reason to do the _irqsave this sounds a good idea to apply my fix
> anyway.

Did you try with DEBUG_PAGEALLOC, which is calling into cpa quite a
lot?

Thanks,

tglx

2011-02-15 20:27:45

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [PATCH] fix pgd_lock deadlock

On Tue, 15 Feb 2011, Thomas Gleixner wrote:

> On Tue, 15 Feb 2011, Andrea Arcangeli wrote:
> > On Tue, Feb 15, 2011 at 08:26:51PM +0100, Thomas Gleixner wrote:
> >
> > With NR_CPUs < 4, or with THP enabled, rmap.c will do
> > spin_lock(&mm->page_table_lock) (or pte_offset_map_lock where the lock
> > is still mm->page_table_lock and not the PT lock). Then it will send
> > IPIs to flush the tlb of the other CPUs.
> >
> > But the other CPU is running the vmalloc_sync_all, and it is trying to
> > take the page_table_lock with irq disabled. It will never take the
> > lock because the CPU waiting the IPI delivery holds it. And it will
> > never run the IPI because it has irqs disabled.
>
> Ok, that makes sense :)
>
> > Now the big question is if anything is taking the pgd_lock from
> > irqs. Normal testing could never reveal it as even if it happens it
> > has a slim chance to happen while the pgd_lock is already hold by
> > normal kernel context. But the VM_BUG_ON(in_interrupt()) should
> > hopefully have revealed it already if it ever happened, I hope.
> >
> > Clearly we could try to fix it in other ways, but still if there's no
> > reason to do the _irqsave this sounds a good idea to apply my fix
> > anyway.
>
> Did you try with DEBUG_PAGEALLOC, which is calling into cpa quite a
> lot?

Another thing. You check for in_interrupt(), but what makes sure that
the code which takes pgd_lock is never taken with interrupts disabled
except during early boot ?

Thanks,

tglx

2011-02-15 22:53:29

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: [PATCH] fix pgd_lock deadlock

On Tue, Feb 15, 2011 at 09:26:35PM +0100, Thomas Gleixner wrote:
> Another thing. You check for in_interrupt(), but what makes sure that
> the code which takes pgd_lock is never taken with interrupts disabled
> except during early boot ?

It's perfectly fine to take pgd_lock with irq disabled, as long as you
don't pretend to take the page_table_lock too after that. So that's
not a concern.

I removed _irqsave from all pgd_lock, and I doubt there's any code
protected by pgd_lock that runs with irq disabled, but if there is,
it's still ok and it especially shouldn't have used _irqsave.

The only real issue here to sort out, is if pgd_lock is ever taken
from irq or not, and to me it looks like in_interrupt() should trigger
if it is ever taken from irq, so it won't go unnoticed for long if
this isn't ok.

2011-02-15 23:04:40

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [PATCH] fix pgd_lock deadlock

On Tue, 15 Feb 2011, Andrea Arcangeli wrote:

> On Tue, Feb 15, 2011 at 09:26:35PM +0100, Thomas Gleixner wrote:
> > Another thing. You check for in_interrupt(), but what makes sure that
> > the code which takes pgd_lock is never taken with interrupts disabled
> > except during early boot ?
>
> It's perfectly fine to take pgd_lock with irq disabled, as long as you
> don't pretend to take the page_table_lock too after that. So that's
> not a concern.
>
> I removed _irqsave from all pgd_lock, and I doubt there's any code
> protected by pgd_lock that runs with irq disabled, but if there is,
> it's still ok and it especially shouldn't have used _irqsave.
>
> The only real issue here to sort out, is if pgd_lock is ever taken
> from irq or not, and to me it looks like in_interrupt() should trigger
> if it is ever taken from irq, so it won't go unnoticed for long if
> this isn't ok.

I assume you run it with a lockdep enabled kernel as well, right ?

Thanks,

tglx

2011-02-15 23:18:22

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: [PATCH] fix pgd_lock deadlock

On Wed, Feb 16, 2011 at 12:03:30AM +0100, Thomas Gleixner wrote:
> I assume you run it with a lockdep enabled kernel as well, right ?

Yes, I always run with lockdep and prove locking enabled on my test
box, not sure how it's meant to trigger more bugs in this case, the
debug check that should be relevant for this is DEBUG_VM and that is
enabled too of course. I didn't try DEBUG_PAGEALLOC yet.

2011-02-16 09:58:51

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH] fix pgd_lock deadlock

On Wed, 2011-02-16 at 00:17 +0100, Andrea Arcangeli wrote:
> On Wed, Feb 16, 2011 at 12:03:30AM +0100, Thomas Gleixner wrote:
> > I assume you run it with a lockdep enabled kernel as well, right ?
>
> Yes, I always run with lockdep and prove locking enabled on my test
> box, not sure how it's meant to trigger more bugs in this case, the
> debug check that should be relevant for this is DEBUG_VM and that is
> enabled too of course. I didn't try DEBUG_PAGEALLOC yet.

I think what Thomas tried to tell you is that your
VM_BUG_ON(in_interrupt()) is fully redundant if you have lockdep
enabled.

Lockdep will warn you if a !irqsave lock is taken from IRQ context,
since that is a clear inversion problem.

2011-02-16 10:16:19

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: [PATCH] fix pgd_lock deadlock

On Wed, Feb 16, 2011 at 10:58:14AM +0100, Peter Zijlstra wrote:
> On Wed, 2011-02-16 at 00:17 +0100, Andrea Arcangeli wrote:
> > On Wed, Feb 16, 2011 at 12:03:30AM +0100, Thomas Gleixner wrote:
> > > I assume you run it with a lockdep enabled kernel as well, right ?
> >
> > Yes, I always run with lockdep and prove locking enabled on my test
> > box, not sure how it's meant to trigger more bugs in this case, the
> > debug check that should be relevant for this is DEBUG_VM and that is
> > enabled too of course. I didn't try DEBUG_PAGEALLOC yet.
>
> I think what Thomas tried to tell you is that your
> VM_BUG_ON(in_interrupt()) is fully redundant if you have lockdep
> enabled.
>
> Lockdep will warn you if a !irqsave lock is taken from IRQ context,
> since that is a clear inversion problem.

Ah I get it now, but I prefer to have it on an all my builds, and
I don't keep lockdep on for all builds (but I keep DEBUG_VM on). It's
still only debug code that no production system will ever deal with,
so it should be good to exercise it in more than on debug .config
considering it's very low overhead (pgd_lock is never taken in fast
paths) so it's suitable for a VM_BUG_ON.

2011-02-16 10:28:56

by Ingo Molnar

[permalink] [raw]
Subject: Re: [PATCH] fix pgd_lock deadlock


* Andrea Arcangeli <[email protected]> wrote:

> On Wed, Feb 16, 2011 at 10:58:14AM +0100, Peter Zijlstra wrote:
> > On Wed, 2011-02-16 at 00:17 +0100, Andrea Arcangeli wrote:
> > > On Wed, Feb 16, 2011 at 12:03:30AM +0100, Thomas Gleixner wrote:
> > > > I assume you run it with a lockdep enabled kernel as well, right ?
> > >
> > > Yes, I always run with lockdep and prove locking enabled on my test
> > > box, not sure how it's meant to trigger more bugs in this case, the
> > > debug check that should be relevant for this is DEBUG_VM and that is
> > > enabled too of course. I didn't try DEBUG_PAGEALLOC yet.
> >
> > I think what Thomas tried to tell you is that your
> > VM_BUG_ON(in_interrupt()) is fully redundant if you have lockdep
> > enabled.
> >
> > Lockdep will warn you if a !irqsave lock is taken from IRQ context,
> > since that is a clear inversion problem.
>
> Ah I get it now, but I prefer to have it on an all my builds, and
> I don't keep lockdep on for all builds (but I keep DEBUG_VM on). It's
> still only debug code that no production system will ever deal with,
> so it should be good to exercise it in more than on debug .config
> considering it's very low overhead (pgd_lock is never taken in fast
> paths) so it's suitable for a VM_BUG_ON.

The point both Thomas and Peter tried to point out to you, that adding 7 instances
of redundant debug checks:

+ VM_BUG_ON(in_interrupt());
+ VM_BUG_ON(in_interrupt());
+ VM_BUG_ON(in_interrupt());
+ VM_BUG_ON(in_interrupt());
+ VM_BUG_ON(in_interrupt());
+ VM_BUG_ON(in_interrupt());
+ VM_BUG_ON(in_interrupt());

to arch/x86/ is not acceptable, for the reasons stated. Please remove it from the
patch.

Thanks,

Ingo

2011-02-16 14:50:27

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: [PATCH] fix pgd_lock deadlock

On Wed, Feb 16, 2011 at 11:28:01AM +0100, Ingo Molnar wrote:
> The point both Thomas and Peter tried to point out to you, that adding 7 instances
> of redundant debug checks:
>
> + VM_BUG_ON(in_interrupt());
> + VM_BUG_ON(in_interrupt());
> + VM_BUG_ON(in_interrupt());
> + VM_BUG_ON(in_interrupt());
> + VM_BUG_ON(in_interrupt());
> + VM_BUG_ON(in_interrupt());
> + VM_BUG_ON(in_interrupt());
>
> to arch/x86/ is not acceptable, for the reasons stated. Please remove it from the
> patch.

Ok sorry, but whenever I'm asked if I tested my patch with lockdep, I
think if it was tested for other bugs, not to remove debug code
overlapping with some lockdep feature.

===
Subject: fix pgd_lock deadlock

From: Andrea Arcangeli <[email protected]>

It's forbidden to take the page_table_lock with the irq disabled or if there's
contention the IPIs (for tlb flushes) sent with the page_table_lock held will
never run leading to a deadlock.

Apparently nobody takes the pgd_lock from irq so the _irqsave can be removed.

Signed-off-by: Andrea Arcangeli <[email protected]>
---
arch/x86/mm/fault.c | 7 +++----
arch/x86/mm/init_64.c | 6 +++---
arch/x86/mm/pageattr.c | 18 ++++++++----------
arch/x86/mm/pgtable.c | 11 ++++-------
arch/x86/xen/mmu.c | 10 ++++------
5 files changed, 22 insertions(+), 30 deletions(-)

--- a/arch/x86/mm/fault.c
+++ b/arch/x86/mm/fault.c
@@ -229,15 +229,14 @@ void vmalloc_sync_all(void)
for (address = VMALLOC_START & PMD_MASK;
address >= TASK_SIZE && address < FIXADDR_TOP;
address += PMD_SIZE) {
-
- unsigned long flags;
struct page *page;

- spin_lock_irqsave(&pgd_lock, flags);
+ spin_lock(&pgd_lock);
list_for_each_entry(page, &pgd_list, lru) {
spinlock_t *pgt_lock;
pmd_t *ret;

+ /* the pgt_lock only for Xen */
pgt_lock = &pgd_page_get_mm(page)->page_table_lock;

spin_lock(pgt_lock);
@@ -247,7 +246,7 @@ void vmalloc_sync_all(void)
if (!ret)
break;
}
- spin_unlock_irqrestore(&pgd_lock, flags);
+ spin_unlock(&pgd_lock);
}
}

--- a/arch/x86/mm/init_64.c
+++ b/arch/x86/mm/init_64.c
@@ -105,18 +105,18 @@ void sync_global_pgds(unsigned long star

for (address = start; address <= end; address += PGDIR_SIZE) {
const pgd_t *pgd_ref = pgd_offset_k(address);
- unsigned long flags;
struct page *page;

if (pgd_none(*pgd_ref))
continue;

- spin_lock_irqsave(&pgd_lock, flags);
+ spin_lock(&pgd_lock);
list_for_each_entry(page, &pgd_list, lru) {
pgd_t *pgd;
spinlock_t *pgt_lock;

pgd = (pgd_t *)page_address(page) + pgd_index(address);
+ /* the pgt_lock only for Xen */
pgt_lock = &pgd_page_get_mm(page)->page_table_lock;
spin_lock(pgt_lock);

@@ -128,7 +128,7 @@ void sync_global_pgds(unsigned long star

spin_unlock(pgt_lock);
}
- spin_unlock_irqrestore(&pgd_lock, flags);
+ spin_unlock(&pgd_lock);
}
}

--- a/arch/x86/mm/pageattr.c
+++ b/arch/x86/mm/pageattr.c
@@ -57,12 +57,10 @@ static unsigned long direct_pages_count[

void update_page_count(int level, unsigned long pages)
{
- unsigned long flags;
-
/* Protect against CPA */
- spin_lock_irqsave(&pgd_lock, flags);
+ spin_lock(&pgd_lock);
direct_pages_count[level] += pages;
- spin_unlock_irqrestore(&pgd_lock, flags);
+ spin_unlock(&pgd_lock);
}

static void split_page_count(int level)
@@ -394,7 +392,7 @@ static int
try_preserve_large_page(pte_t *kpte, unsigned long address,
struct cpa_data *cpa)
{
- unsigned long nextpage_addr, numpages, pmask, psize, flags, addr, pfn;
+ unsigned long nextpage_addr, numpages, pmask, psize, addr, pfn;
pte_t new_pte, old_pte, *tmp;
pgprot_t old_prot, new_prot, req_prot;
int i, do_split = 1;
@@ -403,7 +401,7 @@ try_preserve_large_page(pte_t *kpte, uns
if (cpa->force_split)
return 1;

- spin_lock_irqsave(&pgd_lock, flags);
+ spin_lock(&pgd_lock);
/*
* Check for races, another CPU might have split this page
* up already:
@@ -498,14 +496,14 @@ try_preserve_large_page(pte_t *kpte, uns
}

out_unlock:
- spin_unlock_irqrestore(&pgd_lock, flags);
+ spin_unlock(&pgd_lock);

return do_split;
}

static int split_large_page(pte_t *kpte, unsigned long address)
{
- unsigned long flags, pfn, pfninc = 1;
+ unsigned long pfn, pfninc = 1;
unsigned int i, level;
pte_t *pbase, *tmp;
pgprot_t ref_prot;
@@ -519,7 +517,7 @@ static int split_large_page(pte_t *kpte,
if (!base)
return -ENOMEM;

- spin_lock_irqsave(&pgd_lock, flags);
+ spin_lock(&pgd_lock);
/*
* Check for races, another CPU might have split this page
* up for us already:
@@ -591,7 +589,7 @@ out_unlock:
*/
if (base)
__free_page(base);
- spin_unlock_irqrestore(&pgd_lock, flags);
+ spin_unlock(&pgd_lock);

return 0;
}
--- a/arch/x86/mm/pgtable.c
+++ b/arch/x86/mm/pgtable.c
@@ -121,14 +121,12 @@ static void pgd_ctor(struct mm_struct *m

static void pgd_dtor(pgd_t *pgd)
{
- unsigned long flags; /* can be called from interrupt context */
-
if (SHARED_KERNEL_PMD)
return;

- spin_lock_irqsave(&pgd_lock, flags);
+ spin_lock(&pgd_lock);
pgd_list_del(pgd);
- spin_unlock_irqrestore(&pgd_lock, flags);
+ spin_unlock(&pgd_lock);
}

/*
@@ -260,7 +258,6 @@ pgd_t *pgd_alloc(struct mm_struct *mm)
{
pgd_t *pgd;
pmd_t *pmds[PREALLOCATED_PMDS];
- unsigned long flags;

pgd = (pgd_t *)__get_free_page(PGALLOC_GFP);

@@ -280,12 +277,12 @@ pgd_t *pgd_alloc(struct mm_struct *mm)
* respect to anything walking the pgd_list, so that they
* never see a partially populated pgd.
*/
- spin_lock_irqsave(&pgd_lock, flags);
+ spin_lock(&pgd_lock);

pgd_ctor(mm, pgd);
pgd_prepopulate_pmd(mm, pgd, pmds);

- spin_unlock_irqrestore(&pgd_lock, flags);
+ spin_unlock(&pgd_lock);

return pgd;

--- a/arch/x86/xen/mmu.c
+++ b/arch/x86/xen/mmu.c
@@ -986,10 +986,9 @@ static void xen_pgd_pin(struct mm_struct
*/
void xen_mm_pin_all(void)
{
- unsigned long flags;
struct page *page;

- spin_lock_irqsave(&pgd_lock, flags);
+ spin_lock(&pgd_lock);

list_for_each_entry(page, &pgd_list, lru) {
if (!PagePinned(page)) {
@@ -998,7 +997,7 @@ void xen_mm_pin_all(void)
}
}

- spin_unlock_irqrestore(&pgd_lock, flags);
+ spin_unlock(&pgd_lock);
}

/*
@@ -1099,10 +1098,9 @@ static void xen_pgd_unpin(struct mm_stru
*/
void xen_mm_unpin_all(void)
{
- unsigned long flags;
struct page *page;

- spin_lock_irqsave(&pgd_lock, flags);
+ spin_lock(&pgd_lock);

list_for_each_entry(page, &pgd_list, lru) {
if (PageSavePinned(page)) {
@@ -1112,7 +1110,7 @@ void xen_mm_unpin_all(void)
}
}

- spin_unlock_irqrestore(&pgd_lock, flags);
+ spin_unlock(&pgd_lock);
}

void xen_activate_mm(struct mm_struct *prev, struct mm_struct *next)

2011-02-16 16:25:08

by Rik van Riel

[permalink] [raw]
Subject: Re: [PATCH] fix pgd_lock deadlock

On 02/16/2011 09:49 AM, Andrea Arcangeli wrote:

> Subject: fix pgd_lock deadlock
>
> From: Andrea Arcangeli<[email protected]>
>
> It's forbidden to take the page_table_lock with the irq disabled or if there's
> contention the IPIs (for tlb flushes) sent with the page_table_lock held will
> never run leading to a deadlock.
>
> Apparently nobody takes the pgd_lock from irq so the _irqsave can be removed.
>
> Signed-off-by: Andrea Arcangeli<[email protected]>


Acked-by: Rik van Riel <[email protected]>

2011-02-16 18:34:02

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: [PATCH] fix pgd_lock deadlock

On Tue, Feb 15, 2011 at 09:05:20PM +0100, Thomas Gleixner wrote:
> Did you try with DEBUG_PAGEALLOC, which is calling into cpa quite a
> lot?

I tried DEBUG_PAGEALLOC and it seems to work fine (in addition to
lockdep), it doesn't spwan any debug check.

In addition to testing it (both prev patch and below one) I looked
into the code and the free_pages calling into
pageattr->split_large_page apparently happens all at boot time.

Now one doubt remains if we need change_page_attr to run from irqs
(not from DEBUG_PAGEALLOC though). Now is change_page_attr really sane
to run from irqs? I thought __flush_tlb_all was delivering IPI (in
that case it also wouldn't have been safe in the first place to run
with irq disabled) but of course the "__" version is local, so after
all maybe it's safe to run with interrupts too (I'd be amazed if
somebody is calling it from irq, if not even DEBUG_PAGEALLOC does) but
with the below patch it will remain safe from irq as far as the
pgd_lock is concerned.

I think the previous patch was safe too though, avoiding VM
manipulations from interrupts makes everything simpler. Normally only
gart drivers should call it at init time to avoid prefetching of
cachelines in the next 2m page with different (writeback) cache
attributes of the pages physically aliased in the gart and mapped with
different cache attribute, that init stuff happening from interrupt
sounds weird. Anyway I post the below patch too as an alternative to
still allow pageattr from irq.

With both patches the big dependency remains on __mmdrop not to run
from irq. The alternative approach is to remove the page_table_lock
from vmalloc_sync_all (which is only needed by Xen paravirt guest
AFIK) and solve that problem in a different way, but I don't even know
why they need it exactly, I tried not to impact that.

===
Subject: fix pgd_lock deadlock

From: Andrea Arcangeli <[email protected]>

It's forbidden to take the page_table_lock with the irq disabled or if there's
contention the IPIs (for tlb flushes) sent with the page_table_lock held will
never run leading to a deadlock.

pageattr also seems to take advantage the pgd_lock to serialize
split_large_page which isn't necessary so split it off to a separate spinlock
(as a read_lock wouldn't enough to serialize split_large_page). Then we can use
a read-write lock to allow pageattr to keep taking it from interrupts, but
without having to always clear interrupts for readers. Writers are only safe
from regular kernel context and they must clear irqs before taking it.

Signed-off-by: Andrea Arcangeli <[email protected]>
---
arch/x86/include/asm/pgtable.h | 2 +-
arch/x86/mm/fault.c | 23 ++++++++++++++++++-----
arch/x86/mm/init_64.c | 6 +++---
arch/x86/mm/pageattr.c | 17 ++++++++++-------
arch/x86/mm/pgtable.c | 10 +++++-----
arch/x86/xen/mmu.c | 10 ++++------
6 files changed, 41 insertions(+), 27 deletions(-)

--- a/arch/x86/mm/fault.c
+++ b/arch/x86/mm/fault.c
@@ -179,7 +179,21 @@ force_sig_info_fault(int si_signo, int s
force_sig_info(si_signo, &info, tsk);
}

-DEFINE_SPINLOCK(pgd_lock);
+/*
+ * The pgd_lock only protects the pgd_list. It can be taken from
+ * interrupt context but only in read mode (there's no need to clear
+ * interrupts before taking it in read mode). Write mode can only be
+ * taken from regular kernel context (not from irqs) and it must
+ * disable irqs before taking it. This way it can still be taken in
+ * read mode from interrupt context (in case
+ * pageattr->split_large_page is ever called from interrupt context),
+ * but without forcing the pgd_list readers to clear interrupts before
+ * taking it (like vmalloc_sync_all() that then wants to take the
+ * page_table_lock while holding the pgd_lock, which can only be taken
+ * while irqs are left enabled to avoid deadlocking against the IPI
+ * delivery of an SMP TLB flush run with the page_table_lock hold).
+ */
+DEFINE_RWLOCK(pgd_lock);
LIST_HEAD(pgd_list);

#ifdef CONFIG_X86_32
@@ -229,15 +243,14 @@ void vmalloc_sync_all(void)
for (address = VMALLOC_START & PMD_MASK;
address >= TASK_SIZE && address < FIXADDR_TOP;
address += PMD_SIZE) {
-
- unsigned long flags;
struct page *page;

- spin_lock_irqsave(&pgd_lock, flags);
+ read_lock(&pgd_lock);
list_for_each_entry(page, &pgd_list, lru) {
spinlock_t *pgt_lock;
pmd_t *ret;

+ /* the pgt_lock only for Xen */
pgt_lock = &pgd_page_get_mm(page)->page_table_lock;

spin_lock(pgt_lock);
@@ -247,7 +260,7 @@ void vmalloc_sync_all(void)
if (!ret)
break;
}
- spin_unlock_irqrestore(&pgd_lock, flags);
+ read_unlock(&pgd_lock);
}
}

--- a/arch/x86/mm/init_64.c
+++ b/arch/x86/mm/init_64.c
@@ -105,18 +105,18 @@ void sync_global_pgds(unsigned long star

for (address = start; address <= end; address += PGDIR_SIZE) {
const pgd_t *pgd_ref = pgd_offset_k(address);
- unsigned long flags;
struct page *page;

if (pgd_none(*pgd_ref))
continue;

- spin_lock_irqsave(&pgd_lock, flags);
+ read_lock(&pgd_lock);
list_for_each_entry(page, &pgd_list, lru) {
pgd_t *pgd;
spinlock_t *pgt_lock;

pgd = (pgd_t *)page_address(page) + pgd_index(address);
+ /* the pgt_lock only for Xen */
pgt_lock = &pgd_page_get_mm(page)->page_table_lock;
spin_lock(pgt_lock);

@@ -128,7 +128,7 @@ void sync_global_pgds(unsigned long star

spin_unlock(pgt_lock);
}
- spin_unlock_irqrestore(&pgd_lock, flags);
+ read_unlock(&pgd_lock);
}
}

--- a/arch/x86/mm/pageattr.c
+++ b/arch/x86/mm/pageattr.c
@@ -47,6 +47,7 @@ struct cpa_data {
* splitting a large page entry along with changing the attribute.
*/
static DEFINE_SPINLOCK(cpa_lock);
+static DEFINE_SPINLOCK(pgattr_lock);

#define CPA_FLUSHTLB 1
#define CPA_ARRAY 2
@@ -60,9 +61,9 @@ void update_page_count(int level, unsign
unsigned long flags;

/* Protect against CPA */
- spin_lock_irqsave(&pgd_lock, flags);
+ spin_lock_irqsave(&pgattr_lock, flags);
direct_pages_count[level] += pages;
- spin_unlock_irqrestore(&pgd_lock, flags);
+ spin_unlock_irqrestore(&pgattr_lock, flags);
}

static void split_page_count(int level)
@@ -376,6 +377,7 @@ static void __set_pmd_pte(pte_t *kpte, u
if (!SHARED_KERNEL_PMD) {
struct page *page;

+ read_lock(&pgd_lock);
list_for_each_entry(page, &pgd_list, lru) {
pgd_t *pgd;
pud_t *pud;
@@ -386,6 +388,7 @@ static void __set_pmd_pte(pte_t *kpte, u
pmd = pmd_offset(pud, address);
set_pte_atomic((pte_t *)pmd, pte);
}
+ read_unlock(&pgd_lock);
}
#endif
}
@@ -403,7 +406,7 @@ try_preserve_large_page(pte_t *kpte, uns
if (cpa->force_split)
return 1;

- spin_lock_irqsave(&pgd_lock, flags);
+ spin_lock_irqsave(&pgattr_lock, flags);
/*
* Check for races, another CPU might have split this page
* up already:
@@ -498,7 +501,7 @@ try_preserve_large_page(pte_t *kpte, uns
}

out_unlock:
- spin_unlock_irqrestore(&pgd_lock, flags);
+ spin_unlock_irqrestore(&pgattr_lock, flags);

return do_split;
}
@@ -519,7 +522,7 @@ static int split_large_page(pte_t *kpte,
if (!base)
return -ENOMEM;

- spin_lock_irqsave(&pgd_lock, flags);
+ spin_lock_irqsave(&pgattr_lock, flags);
/*
* Check for races, another CPU might have split this page
* up for us already:
@@ -587,11 +590,11 @@ static int split_large_page(pte_t *kpte,
out_unlock:
/*
* If we dropped out via the lookup_address check under
- * pgd_lock then stick the page back into the pool:
+ * pgattr_lock then stick the page back into the pool:
*/
if (base)
__free_page(base);
- spin_unlock_irqrestore(&pgd_lock, flags);
+ spin_unlock_irqrestore(&pgattr_lock, flags);

return 0;
}
--- a/arch/x86/mm/pgtable.c
+++ b/arch/x86/mm/pgtable.c
@@ -121,14 +121,14 @@ static void pgd_ctor(struct mm_struct *m

static void pgd_dtor(pgd_t *pgd)
{
- unsigned long flags; /* can be called from interrupt context */
+ unsigned long flags;

if (SHARED_KERNEL_PMD)
return;

- spin_lock_irqsave(&pgd_lock, flags);
+ write_lock_irqsave(&pgd_lock, flags);
pgd_list_del(pgd);
- spin_unlock_irqrestore(&pgd_lock, flags);
+ write_unlock_irqrestore(&pgd_lock, flags);
}

/*
@@ -280,12 +280,12 @@ pgd_t *pgd_alloc(struct mm_struct *mm)
* respect to anything walking the pgd_list, so that they
* never see a partially populated pgd.
*/
- spin_lock_irqsave(&pgd_lock, flags);
+ write_lock_irqsave(&pgd_lock, flags);

pgd_ctor(mm, pgd);
pgd_prepopulate_pmd(mm, pgd, pmds);

- spin_unlock_irqrestore(&pgd_lock, flags);
+ write_unlock_irqrestore(&pgd_lock, flags);

return pgd;

--- a/arch/x86/xen/mmu.c
+++ b/arch/x86/xen/mmu.c
@@ -986,10 +986,9 @@ static void xen_pgd_pin(struct mm_struct
*/
void xen_mm_pin_all(void)
{
- unsigned long flags;
struct page *page;

- spin_lock_irqsave(&pgd_lock, flags);
+ read_lock(&pgd_lock);

list_for_each_entry(page, &pgd_list, lru) {
if (!PagePinned(page)) {
@@ -998,7 +997,7 @@ void xen_mm_pin_all(void)
}
}

- spin_unlock_irqrestore(&pgd_lock, flags);
+ read_unlock(&pgd_lock);
}

/*
@@ -1099,10 +1098,9 @@ static void xen_pgd_unpin(struct mm_stru
*/
void xen_mm_unpin_all(void)
{
- unsigned long flags;
struct page *page;

- spin_lock_irqsave(&pgd_lock, flags);
+ read_lock(&pgd_lock);

list_for_each_entry(page, &pgd_list, lru) {
if (PageSavePinned(page)) {
@@ -1112,7 +1110,7 @@ void xen_mm_unpin_all(void)
}
}

- spin_unlock_irqrestore(&pgd_lock, flags);
+ read_unlock(&pgd_lock);
}

void xen_activate_mm(struct mm_struct *prev, struct mm_struct *next)
--- a/arch/x86/include/asm/pgtable.h
+++ b/arch/x86/include/asm/pgtable.h
@@ -25,7 +25,7 @@
extern unsigned long empty_zero_page[PAGE_SIZE / sizeof(unsigned long)];
#define ZERO_PAGE(vaddr) (virt_to_page(empty_zero_page))

-extern spinlock_t pgd_lock;
+extern rwlock_t pgd_lock;
extern struct list_head pgd_list;

extern struct mm_struct *pgd_page_get_mm(struct page *page);

2011-02-16 20:16:06

by Ingo Molnar

[permalink] [raw]
Subject: Re: [PATCH] fix pgd_lock deadlock


* Andrea Arcangeli <[email protected]> wrote:

> Ok sorry, but whenever I'm asked if I tested my patch with lockdep, I think if it
> was tested for other bugs, not to remove debug code overlapping with some lockdep
> feature.

ah, ok. The thing is, if it was only a single VM_BUG_ON() then it wouldnt be a big
deal. Seven seems excessive - especially since we have that very check via lockdep.

Thanks for removing them!

Ingo

2011-02-16 21:36:17

by Konrad Rzeszutek Wilk

[permalink] [raw]
Subject: Re: [PATCH] fix pgd_lock deadlock

On Wed, Feb 16, 2011 at 07:33:04PM +0100, Andrea Arcangeli wrote:
> On Tue, Feb 15, 2011 at 09:05:20PM +0100, Thomas Gleixner wrote:
> > Did you try with DEBUG_PAGEALLOC, which is calling into cpa quite a
> > lot?
>
> I tried DEBUG_PAGEALLOC and it seems to work fine (in addition to
> lockdep), it doesn't spwan any debug check.
>
> In addition to testing it (both prev patch and below one) I looked
I checked this under Xen running as PV and Dom0 and had no trouble.

You can stick:
Tested-by: Konrad Rzeszutek Wilk <[email protected]>

> into the code and the free_pages calling into
> pageattr->split_large_page apparently happens all at boot time.
>
> Now one doubt remains if we need change_page_attr to run from irqs
> (not from DEBUG_PAGEALLOC though). Now is change_page_attr really sane
> to run from irqs? I thought __flush_tlb_all was delivering IPI (in
> that case it also wouldn't have been safe in the first place to run
> with irq disabled) but of course the "__" version is local, so after
> all maybe it's safe to run with interrupts too (I'd be amazed if
> somebody is calling it from irq, if not even DEBUG_PAGEALLOC does) but
> with the below patch it will remain safe from irq as far as the
> pgd_lock is concerned.
>
> I think the previous patch was safe too though, avoiding VM
> manipulations from interrupts makes everything simpler. Normally only
> gart drivers should call it at init time to avoid prefetching of
> cachelines in the next 2m page with different (writeback) cache
> attributes of the pages physically aliased in the gart and mapped with
> different cache attribute, that init stuff happening from interrupt
> sounds weird. Anyway I post the below patch too as an alternative to
> still allow pageattr from irq.
>
> With both patches the big dependency remains on __mmdrop not to run
> from irq. The alternative approach is to remove the page_table_lock
> from vmalloc_sync_all (which is only needed by Xen paravirt guest
> AFIK) and solve that problem in a different way, but I don't even know
> why they need it exactly, I tried not to impact that.
>
> ===
> Subject: fix pgd_lock deadlock
>
> From: Andrea Arcangeli <[email protected]>
>
> It's forbidden to take the page_table_lock with the irq disabled or if there's
> contention the IPIs (for tlb flushes) sent with the page_table_lock held will
> never run leading to a deadlock.
>
> pageattr also seems to take advantage the pgd_lock to serialize
> split_large_page which isn't necessary so split it off to a separate spinlock
> (as a read_lock wouldn't enough to serialize split_large_page). Then we can use
> a read-write lock to allow pageattr to keep taking it from interrupts, but
> without having to always clear interrupts for readers. Writers are only safe
> from regular kernel context and they must clear irqs before taking it.
>
> Signed-off-by: Andrea Arcangeli <[email protected]>
> ---
> arch/x86/include/asm/pgtable.h | 2 +-
> arch/x86/mm/fault.c | 23 ++++++++++++++++++-----
> arch/x86/mm/init_64.c | 6 +++---
> arch/x86/mm/pageattr.c | 17 ++++++++++-------
> arch/x86/mm/pgtable.c | 10 +++++-----
> arch/x86/xen/mmu.c | 10 ++++------
> 6 files changed, 41 insertions(+), 27 deletions(-)
>
> --- a/arch/x86/mm/fault.c
> +++ b/arch/x86/mm/fault.c
> @@ -179,7 +179,21 @@ force_sig_info_fault(int si_signo, int s
> force_sig_info(si_signo, &info, tsk);
> }
>
> -DEFINE_SPINLOCK(pgd_lock);
> +/*
> + * The pgd_lock only protects the pgd_list. It can be taken from
> + * interrupt context but only in read mode (there's no need to clear
> + * interrupts before taking it in read mode). Write mode can only be
> + * taken from regular kernel context (not from irqs) and it must
> + * disable irqs before taking it. This way it can still be taken in
> + * read mode from interrupt context (in case
> + * pageattr->split_large_page is ever called from interrupt context),
> + * but without forcing the pgd_list readers to clear interrupts before
> + * taking it (like vmalloc_sync_all() that then wants to take the
> + * page_table_lock while holding the pgd_lock, which can only be taken
> + * while irqs are left enabled to avoid deadlocking against the IPI
> + * delivery of an SMP TLB flush run with the page_table_lock hold).
> + */
> +DEFINE_RWLOCK(pgd_lock);
> LIST_HEAD(pgd_list);
>
> #ifdef CONFIG_X86_32
> @@ -229,15 +243,14 @@ void vmalloc_sync_all(void)
> for (address = VMALLOC_START & PMD_MASK;
> address >= TASK_SIZE && address < FIXADDR_TOP;
> address += PMD_SIZE) {
> -
> - unsigned long flags;
> struct page *page;
>
> - spin_lock_irqsave(&pgd_lock, flags);
> + read_lock(&pgd_lock);
> list_for_each_entry(page, &pgd_list, lru) {
> spinlock_t *pgt_lock;
> pmd_t *ret;
>
> + /* the pgt_lock only for Xen */
> pgt_lock = &pgd_page_get_mm(page)->page_table_lock;
>
> spin_lock(pgt_lock);
> @@ -247,7 +260,7 @@ void vmalloc_sync_all(void)
> if (!ret)
> break;
> }
> - spin_unlock_irqrestore(&pgd_lock, flags);
> + read_unlock(&pgd_lock);
> }
> }
>
> --- a/arch/x86/mm/init_64.c
> +++ b/arch/x86/mm/init_64.c
> @@ -105,18 +105,18 @@ void sync_global_pgds(unsigned long star
>
> for (address = start; address <= end; address += PGDIR_SIZE) {
> const pgd_t *pgd_ref = pgd_offset_k(address);
> - unsigned long flags;
> struct page *page;
>
> if (pgd_none(*pgd_ref))
> continue;
>
> - spin_lock_irqsave(&pgd_lock, flags);
> + read_lock(&pgd_lock);
> list_for_each_entry(page, &pgd_list, lru) {
> pgd_t *pgd;
> spinlock_t *pgt_lock;
>
> pgd = (pgd_t *)page_address(page) + pgd_index(address);
> + /* the pgt_lock only for Xen */
> pgt_lock = &pgd_page_get_mm(page)->page_table_lock;
> spin_lock(pgt_lock);
>
> @@ -128,7 +128,7 @@ void sync_global_pgds(unsigned long star
>
> spin_unlock(pgt_lock);
> }
> - spin_unlock_irqrestore(&pgd_lock, flags);
> + read_unlock(&pgd_lock);
> }
> }
>
> --- a/arch/x86/mm/pageattr.c
> +++ b/arch/x86/mm/pageattr.c
> @@ -47,6 +47,7 @@ struct cpa_data {
> * splitting a large page entry along with changing the attribute.
> */
> static DEFINE_SPINLOCK(cpa_lock);
> +static DEFINE_SPINLOCK(pgattr_lock);
>
> #define CPA_FLUSHTLB 1
> #define CPA_ARRAY 2
> @@ -60,9 +61,9 @@ void update_page_count(int level, unsign
> unsigned long flags;
>
> /* Protect against CPA */
> - spin_lock_irqsave(&pgd_lock, flags);
> + spin_lock_irqsave(&pgattr_lock, flags);
> direct_pages_count[level] += pages;
> - spin_unlock_irqrestore(&pgd_lock, flags);
> + spin_unlock_irqrestore(&pgattr_lock, flags);
> }
>
> static void split_page_count(int level)
> @@ -376,6 +377,7 @@ static void __set_pmd_pte(pte_t *kpte, u
> if (!SHARED_KERNEL_PMD) {
> struct page *page;
>
> + read_lock(&pgd_lock);
> list_for_each_entry(page, &pgd_list, lru) {
> pgd_t *pgd;
> pud_t *pud;
> @@ -386,6 +388,7 @@ static void __set_pmd_pte(pte_t *kpte, u
> pmd = pmd_offset(pud, address);
> set_pte_atomic((pte_t *)pmd, pte);
> }
> + read_unlock(&pgd_lock);
> }
> #endif
> }
> @@ -403,7 +406,7 @@ try_preserve_large_page(pte_t *kpte, uns
> if (cpa->force_split)
> return 1;
>
> - spin_lock_irqsave(&pgd_lock, flags);
> + spin_lock_irqsave(&pgattr_lock, flags);
> /*
> * Check for races, another CPU might have split this page
> * up already:
> @@ -498,7 +501,7 @@ try_preserve_large_page(pte_t *kpte, uns
> }
>
> out_unlock:
> - spin_unlock_irqrestore(&pgd_lock, flags);
> + spin_unlock_irqrestore(&pgattr_lock, flags);
>
> return do_split;
> }
> @@ -519,7 +522,7 @@ static int split_large_page(pte_t *kpte,
> if (!base)
> return -ENOMEM;
>
> - spin_lock_irqsave(&pgd_lock, flags);
> + spin_lock_irqsave(&pgattr_lock, flags);
> /*
> * Check for races, another CPU might have split this page
> * up for us already:
> @@ -587,11 +590,11 @@ static int split_large_page(pte_t *kpte,
> out_unlock:
> /*
> * If we dropped out via the lookup_address check under
> - * pgd_lock then stick the page back into the pool:
> + * pgattr_lock then stick the page back into the pool:
> */
> if (base)
> __free_page(base);
> - spin_unlock_irqrestore(&pgd_lock, flags);
> + spin_unlock_irqrestore(&pgattr_lock, flags);
>
> return 0;
> }
> --- a/arch/x86/mm/pgtable.c
> +++ b/arch/x86/mm/pgtable.c
> @@ -121,14 +121,14 @@ static void pgd_ctor(struct mm_struct *m
>
> static void pgd_dtor(pgd_t *pgd)
> {
> - unsigned long flags; /* can be called from interrupt context */
> + unsigned long flags;
>
> if (SHARED_KERNEL_PMD)
> return;
>
> - spin_lock_irqsave(&pgd_lock, flags);
> + write_lock_irqsave(&pgd_lock, flags);
> pgd_list_del(pgd);
> - spin_unlock_irqrestore(&pgd_lock, flags);
> + write_unlock_irqrestore(&pgd_lock, flags);
> }
>
> /*
> @@ -280,12 +280,12 @@ pgd_t *pgd_alloc(struct mm_struct *mm)
> * respect to anything walking the pgd_list, so that they
> * never see a partially populated pgd.
> */
> - spin_lock_irqsave(&pgd_lock, flags);
> + write_lock_irqsave(&pgd_lock, flags);
>
> pgd_ctor(mm, pgd);
> pgd_prepopulate_pmd(mm, pgd, pmds);
>
> - spin_unlock_irqrestore(&pgd_lock, flags);
> + write_unlock_irqrestore(&pgd_lock, flags);
>
> return pgd;
>
> --- a/arch/x86/xen/mmu.c
> +++ b/arch/x86/xen/mmu.c
> @@ -986,10 +986,9 @@ static void xen_pgd_pin(struct mm_struct
> */
> void xen_mm_pin_all(void)
> {
> - unsigned long flags;
> struct page *page;
>
> - spin_lock_irqsave(&pgd_lock, flags);
> + read_lock(&pgd_lock);
>
> list_for_each_entry(page, &pgd_list, lru) {
> if (!PagePinned(page)) {
> @@ -998,7 +997,7 @@ void xen_mm_pin_all(void)
> }
> }
>
> - spin_unlock_irqrestore(&pgd_lock, flags);
> + read_unlock(&pgd_lock);
> }
>
> /*
> @@ -1099,10 +1098,9 @@ static void xen_pgd_unpin(struct mm_stru
> */
> void xen_mm_unpin_all(void)
> {
> - unsigned long flags;
> struct page *page;
>
> - spin_lock_irqsave(&pgd_lock, flags);
> + read_lock(&pgd_lock);
>
> list_for_each_entry(page, &pgd_list, lru) {
> if (PageSavePinned(page)) {
> @@ -1112,7 +1110,7 @@ void xen_mm_unpin_all(void)
> }
> }
>
> - spin_unlock_irqrestore(&pgd_lock, flags);
> + read_unlock(&pgd_lock);
> }
>
> void xen_activate_mm(struct mm_struct *prev, struct mm_struct *next)
> --- a/arch/x86/include/asm/pgtable.h
> +++ b/arch/x86/include/asm/pgtable.h
> @@ -25,7 +25,7 @@
> extern unsigned long empty_zero_page[PAGE_SIZE / sizeof(unsigned long)];
> #define ZERO_PAGE(vaddr) (virt_to_page(empty_zero_page))
>
> -extern spinlock_t pgd_lock;
> +extern rwlock_t pgd_lock;
> extern struct list_head pgd_list;
>
> extern struct mm_struct *pgd_page_get_mm(struct page *page);
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/

2011-02-17 10:21:04

by Johannes Weiner

[permalink] [raw]
Subject: Re: [PATCH] fix pgd_lock deadlock

On Wed, Feb 16, 2011 at 07:33:04PM +0100, Andrea Arcangeli wrote:
> On Tue, Feb 15, 2011 at 09:05:20PM +0100, Thomas Gleixner wrote:
> > Did you try with DEBUG_PAGEALLOC, which is calling into cpa quite a
> > lot?
>
> I tried DEBUG_PAGEALLOC and it seems to work fine (in addition to
> lockdep), it doesn't spwan any debug check.
>
> In addition to testing it (both prev patch and below one) I looked
> into the code and the free_pages calling into
> pageattr->split_large_page apparently happens all at boot time.
>
> Now one doubt remains if we need change_page_attr to run from irqs
> (not from DEBUG_PAGEALLOC though). Now is change_page_attr really sane
> to run from irqs? I thought __flush_tlb_all was delivering IPI (in
> that case it also wouldn't have been safe in the first place to run
> with irq disabled) but of course the "__" version is local, so after
> all maybe it's safe to run with interrupts too (I'd be amazed if
> somebody is calling it from irq, if not even DEBUG_PAGEALLOC does) but
> with the below patch it will remain safe from irq as far as the
> pgd_lock is concerned.
>
> I think the previous patch was safe too though, avoiding VM
> manipulations from interrupts makes everything simpler. Normally only
> gart drivers should call it at init time to avoid prefetching of
> cachelines in the next 2m page with different (writeback) cache
> attributes of the pages physically aliased in the gart and mapped with
> different cache attribute, that init stuff happening from interrupt
> sounds weird. Anyway I post the below patch too as an alternative to
> still allow pageattr from irq.
>
> With both patches the big dependency remains on __mmdrop not to run
> from irq. The alternative approach is to remove the page_table_lock
> from vmalloc_sync_all (which is only needed by Xen paravirt guest
> AFIK) and solve that problem in a different way, but I don't even know
> why they need it exactly, I tried not to impact that.

So Xen needs all page tables protected when pinning/unpinning and
extended page_table_lock to cover kernel range, which it does nowhere
else AFAICS. But the places it extended are also taking the pgd_lock,
so I wonder if Xen could just take the pgd_lock itself in these paths
and we could revert page_table_lock back to cover user va only?
Jeremy, could this work? Untested.

Hannes

---
arch/x86/include/asm/pgtable.h | 2 --
arch/x86/mm/fault.c | 14 ++------------
arch/x86/mm/init_64.c | 6 ------
arch/x86/mm/pgtable.c | 20 +++-----------------
arch/x86/xen/mmu.c | 8 ++++++++
5 files changed, 13 insertions(+), 37 deletions(-)

diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h
index 18601c8..8c0335a 100644
--- a/arch/x86/include/asm/pgtable.h
+++ b/arch/x86/include/asm/pgtable.h
@@ -28,8 +28,6 @@ extern unsigned long empty_zero_page[PAGE_SIZE / sizeof(unsigned long)];
extern spinlock_t pgd_lock;
extern struct list_head pgd_list;

-extern struct mm_struct *pgd_page_get_mm(struct page *page);
-
#ifdef CONFIG_PARAVIRT
#include <asm/paravirt.h>
#else /* !CONFIG_PARAVIRT */
diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c
index 7d90ceb..5da4155 100644
--- a/arch/x86/mm/fault.c
+++ b/arch/x86/mm/fault.c
@@ -234,19 +234,9 @@ void vmalloc_sync_all(void)
struct page *page;

spin_lock_irqsave(&pgd_lock, flags);
- list_for_each_entry(page, &pgd_list, lru) {
- spinlock_t *pgt_lock;
- pmd_t *ret;
-
- pgt_lock = &pgd_page_get_mm(page)->page_table_lock;
-
- spin_lock(pgt_lock);
- ret = vmalloc_sync_one(page_address(page), address);
- spin_unlock(pgt_lock);
-
- if (!ret)
+ list_for_each_entry(page, &pgd_list, lru)
+ if (!vmalloc_sync_one(page_address(page), address))
break;
- }
spin_unlock_irqrestore(&pgd_lock, flags);
}
}
diff --git a/arch/x86/mm/init_64.c b/arch/x86/mm/init_64.c
index 71a5929..9332f21 100644
--- a/arch/x86/mm/init_64.c
+++ b/arch/x86/mm/init_64.c
@@ -114,19 +114,13 @@ void sync_global_pgds(unsigned long start, unsigned long end)
spin_lock_irqsave(&pgd_lock, flags);
list_for_each_entry(page, &pgd_list, lru) {
pgd_t *pgd;
- spinlock_t *pgt_lock;

pgd = (pgd_t *)page_address(page) + pgd_index(address);
- pgt_lock = &pgd_page_get_mm(page)->page_table_lock;
- spin_lock(pgt_lock);
-
if (pgd_none(*pgd))
set_pgd(pgd, *pgd_ref);
else
BUG_ON(pgd_page_vaddr(*pgd)
!= pgd_page_vaddr(*pgd_ref));
-
- spin_unlock(pgt_lock);
}
spin_unlock_irqrestore(&pgd_lock, flags);
}
diff --git a/arch/x86/mm/pgtable.c b/arch/x86/mm/pgtable.c
index 500242d..72107ab 100644
--- a/arch/x86/mm/pgtable.c
+++ b/arch/x86/mm/pgtable.c
@@ -87,19 +87,7 @@ static inline void pgd_list_del(pgd_t *pgd)
#define UNSHARED_PTRS_PER_PGD \
(SHARED_KERNEL_PMD ? KERNEL_PGD_BOUNDARY : PTRS_PER_PGD)

-
-static void pgd_set_mm(pgd_t *pgd, struct mm_struct *mm)
-{
- BUILD_BUG_ON(sizeof(virt_to_page(pgd)->index) < sizeof(mm));
- virt_to_page(pgd)->index = (pgoff_t)mm;
-}
-
-struct mm_struct *pgd_page_get_mm(struct page *page)
-{
- return (struct mm_struct *)page->index;
-}
-
-static void pgd_ctor(struct mm_struct *mm, pgd_t *pgd)
+static void pgd_ctor(pgd_t *pgd)
{
/* If the pgd points to a shared pagetable level (either the
ptes in non-PAE, or shared PMD in PAE), then just copy the
@@ -113,10 +101,8 @@ static void pgd_ctor(struct mm_struct *mm, pgd_t *pgd)
}

/* list required to sync kernel mapping updates */
- if (!SHARED_KERNEL_PMD) {
- pgd_set_mm(pgd, mm);
+ if (!SHARED_KERNEL_PMD)
pgd_list_add(pgd);
- }
}

static void pgd_dtor(pgd_t *pgd)
@@ -282,7 +268,7 @@ pgd_t *pgd_alloc(struct mm_struct *mm)
*/
spin_lock_irqsave(&pgd_lock, flags);

- pgd_ctor(mm, pgd);
+ pgd_ctor(pgd);
pgd_prepopulate_pmd(mm, pgd, pmds);

spin_unlock_irqrestore(&pgd_lock, flags);
diff --git a/arch/x86/xen/mmu.c b/arch/x86/xen/mmu.c
index 5e22810..97fbfce 100644
--- a/arch/x86/xen/mmu.c
+++ b/arch/x86/xen/mmu.c
@@ -1021,7 +1021,11 @@ static void __xen_pgd_pin(struct mm_struct *mm, pgd_t *pgd)

static void xen_pgd_pin(struct mm_struct *mm)
{
+ unsigned long flags;
+
+ spin_lock_irqsave(&pgd_lock, flags);
__xen_pgd_pin(mm, mm->pgd);
+ spin_unlock_irqrestore(&pgd_lock, flags);
}

/*
@@ -1140,7 +1144,11 @@ static void __xen_pgd_unpin(struct mm_struct *mm, pgd_t *pgd)

static void xen_pgd_unpin(struct mm_struct *mm)
{
+ unsigned long flags;
+
+ spin_lock_irqsave(&pgd_lock, flags);
__xen_pgd_unpin(mm, mm->pgd);
+ spin_unlock_irqrestore(&pgd_lock, flags);
}

/*

2011-02-21 14:31:11

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: [PATCH] fix pgd_lock deadlock

On Thu, Feb 17, 2011 at 11:19:41AM +0100, Johannes Weiner wrote:
> So Xen needs all page tables protected when pinning/unpinning and
> extended page_table_lock to cover kernel range, which it does nowhere
> else AFAICS. But the places it extended are also taking the pgd_lock,
> so I wonder if Xen could just take the pgd_lock itself in these paths
> and we could revert page_table_lock back to cover user va only?
> Jeremy, could this work? Untested.

If this works for Xen, I definitely prefer this.

Still there's no point to insist on _irqsave if nothing takes the
pgd_lock from irq, so my patch probably should be applied anyway or
it's confusing and there's even a comment saying pgd_dtor is called
from irq, if it's not it should be corrected. But then it becomes a
cleanup notafix.

2011-02-21 14:54:21

by Johannes Weiner

[permalink] [raw]
Subject: Re: [PATCH] fix pgd_lock deadlock

On Mon, Feb 21, 2011 at 03:30:23PM +0100, Andrea Arcangeli wrote:
> On Thu, Feb 17, 2011 at 11:19:41AM +0100, Johannes Weiner wrote:
> > So Xen needs all page tables protected when pinning/unpinning and
> > extended page_table_lock to cover kernel range, which it does nowhere
> > else AFAICS. But the places it extended are also taking the pgd_lock,
> > so I wonder if Xen could just take the pgd_lock itself in these paths
> > and we could revert page_table_lock back to cover user va only?
> > Jeremy, could this work? Untested.
>
> If this works for Xen, I definitely prefer this.

Below is real submission, with changelog and sign-off and all (except
testing on Xen itself, sorry). I moved pgd_lock acquisition in this
version to make the lock ordering perfectly clear. Xen people, could
you have a look at this?

> Still there's no point to insist on _irqsave if nothing takes the
> pgd_lock from irq, so my patch probably should be applied anyway or
> it's confusing and there's even a comment saying pgd_dtor is called
> from irq, if it's not it should be corrected. But then it becomes a
> cleanup notafix.

Absolutely agreed, I like your clean-up but would prefer it not being
a requirement for this fix.

---
From: Johannes Weiner <[email protected]>
Subject: [patch] x86, mm: fix mm->page_table_lock deadlock

'617d34d x86, mm: Hold mm->page_table_lock while doing vmalloc_sync'
made two paths grab mm->page_table_lock while having IRQs disabled.
This is not safe as rmap waits for IPI responses with this lock held
when clearing young bits and flushing the TLB.

What 617d34d wanted was to exclude any changes to the page tables,
including demand kernel page table propagation from init_mm, so that
Xen could pin and unpin page tables atomically.

Kernel page table propagation takes the pgd_lock to protect the list
of page directories, however, so instead of adding mm->page_table_lock
to this side, this patch has Xen exclude it by taking pgd_lock itself.

The pgd_lock will then nest within mm->page_table_lock to exclude rmap
before disabling IRQs, thus fixing the deadlock.

Signed-off-by: Johannes Weiner <[email protected]>
---
arch/x86/include/asm/pgtable.h | 2 --
arch/x86/mm/fault.c | 14 ++------------
arch/x86/mm/init_64.c | 6 ------
arch/x86/mm/pgtable.c | 20 +++-----------------
arch/x86/xen/mmu.c | 12 ++++++++++++
5 files changed, 17 insertions(+), 37 deletions(-)

diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h
index 18601c8..8c0335a 100644
--- a/arch/x86/include/asm/pgtable.h
+++ b/arch/x86/include/asm/pgtable.h
@@ -28,8 +28,6 @@ extern unsigned long empty_zero_page[PAGE_SIZE / sizeof(unsigned long)];
extern spinlock_t pgd_lock;
extern struct list_head pgd_list;

-extern struct mm_struct *pgd_page_get_mm(struct page *page);
-
#ifdef CONFIG_PARAVIRT
#include <asm/paravirt.h>
#else /* !CONFIG_PARAVIRT */
diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c
index 7d90ceb..5da4155 100644
--- a/arch/x86/mm/fault.c
+++ b/arch/x86/mm/fault.c
@@ -234,19 +234,9 @@ void vmalloc_sync_all(void)
struct page *page;

spin_lock_irqsave(&pgd_lock, flags);
- list_for_each_entry(page, &pgd_list, lru) {
- spinlock_t *pgt_lock;
- pmd_t *ret;
-
- pgt_lock = &pgd_page_get_mm(page)->page_table_lock;
-
- spin_lock(pgt_lock);
- ret = vmalloc_sync_one(page_address(page), address);
- spin_unlock(pgt_lock);
-
- if (!ret)
+ list_for_each_entry(page, &pgd_list, lru)
+ if (!vmalloc_sync_one(page_address(page), address))
break;
- }
spin_unlock_irqrestore(&pgd_lock, flags);
}
}
diff --git a/arch/x86/mm/init_64.c b/arch/x86/mm/init_64.c
index 71a5929..9332f21 100644
--- a/arch/x86/mm/init_64.c
+++ b/arch/x86/mm/init_64.c
@@ -114,19 +114,13 @@ void sync_global_pgds(unsigned long start, unsigned long end)
spin_lock_irqsave(&pgd_lock, flags);
list_for_each_entry(page, &pgd_list, lru) {
pgd_t *pgd;
- spinlock_t *pgt_lock;

pgd = (pgd_t *)page_address(page) + pgd_index(address);
- pgt_lock = &pgd_page_get_mm(page)->page_table_lock;
- spin_lock(pgt_lock);
-
if (pgd_none(*pgd))
set_pgd(pgd, *pgd_ref);
else
BUG_ON(pgd_page_vaddr(*pgd)
!= pgd_page_vaddr(*pgd_ref));
-
- spin_unlock(pgt_lock);
}
spin_unlock_irqrestore(&pgd_lock, flags);
}
diff --git a/arch/x86/mm/pgtable.c b/arch/x86/mm/pgtable.c
index 500242d..72107ab 100644
--- a/arch/x86/mm/pgtable.c
+++ b/arch/x86/mm/pgtable.c
@@ -87,19 +87,7 @@ static inline void pgd_list_del(pgd_t *pgd)
#define UNSHARED_PTRS_PER_PGD \
(SHARED_KERNEL_PMD ? KERNEL_PGD_BOUNDARY : PTRS_PER_PGD)

-
-static void pgd_set_mm(pgd_t *pgd, struct mm_struct *mm)
-{
- BUILD_BUG_ON(sizeof(virt_to_page(pgd)->index) < sizeof(mm));
- virt_to_page(pgd)->index = (pgoff_t)mm;
-}
-
-struct mm_struct *pgd_page_get_mm(struct page *page)
-{
- return (struct mm_struct *)page->index;
-}
-
-static void pgd_ctor(struct mm_struct *mm, pgd_t *pgd)
+static void pgd_ctor(pgd_t *pgd)
{
/* If the pgd points to a shared pagetable level (either the
ptes in non-PAE, or shared PMD in PAE), then just copy the
@@ -113,10 +101,8 @@ static void pgd_ctor(struct mm_struct *mm, pgd_t *pgd)
}

/* list required to sync kernel mapping updates */
- if (!SHARED_KERNEL_PMD) {
- pgd_set_mm(pgd, mm);
+ if (!SHARED_KERNEL_PMD)
pgd_list_add(pgd);
- }
}

static void pgd_dtor(pgd_t *pgd)
@@ -282,7 +268,7 @@ pgd_t *pgd_alloc(struct mm_struct *mm)
*/
spin_lock_irqsave(&pgd_lock, flags);

- pgd_ctor(mm, pgd);
+ pgd_ctor(pgd);
pgd_prepopulate_pmd(mm, pgd, pmds);

spin_unlock_irqrestore(&pgd_lock, flags);
diff --git a/arch/x86/xen/mmu.c b/arch/x86/xen/mmu.c
index 5e92b61..498e6ae 100644
--- a/arch/x86/xen/mmu.c
+++ b/arch/x86/xen/mmu.c
@@ -1117,15 +1117,23 @@ void xen_mm_unpin_all(void)

void xen_activate_mm(struct mm_struct *prev, struct mm_struct *next)
{
+ unsigned long flags;
+
spin_lock(&next->page_table_lock);
+ spin_lock_irqsave(&pgd_lock, flags);
xen_pgd_pin(next);
+ spin_unlock_irqrestore(&pgd_lock, flags);
spin_unlock(&next->page_table_lock);
}

void xen_dup_mmap(struct mm_struct *oldmm, struct mm_struct *mm)
{
+ unsigned long flags;
+
spin_lock(&mm->page_table_lock);
+ spin_lock_irqsave(&pgd_lock, flags);
xen_pgd_pin(mm);
+ spin_unlock_irqrestore(&pgd_lock, flags);
spin_unlock(&mm->page_table_lock);
}

@@ -1211,16 +1219,20 @@ static void xen_drop_mm_ref(struct mm_struct *mm)
*/
void xen_exit_mmap(struct mm_struct *mm)
{
+ unsigned long flags;
+
get_cpu(); /* make sure we don't move around */
xen_drop_mm_ref(mm);
put_cpu();

spin_lock(&mm->page_table_lock);
+ spin_lock_irqsave(&pgd_lock, flags);

/* pgd may not be pinned in the error exit path of execve */
if (xen_page_pinned(mm->pgd))
xen_pgd_unpin(mm);

+ spin_unlock_irqrestore(&pgd_lock, flags);
spin_unlock(&mm->page_table_lock);
}

--
1.7.4

2011-02-21 17:40:32

by Jeremy Fitzhardinge

[permalink] [raw]
Subject: Re: [PATCH] fix pgd_lock deadlock

On 02/17/2011 02:19 AM, Johannes Weiner wrote:
> So Xen needs all page tables protected when pinning/unpinning and
> extended page_table_lock to cover kernel range, which it does nowhere
> else AFAICS. But the places it extended are also taking the pgd_lock,
> so I wonder if Xen could just take the pgd_lock itself in these paths
> and we could revert page_table_lock back to cover user va only?
> Jeremy, could this work? Untested.

Yes, this looks pretty plausible, but I need to go back and check what
the original bug was to make sure. Oh, and test it I guess.

But xen_pgd_pin/unpin only operate on the usermode parts of the address
space (since the kernel part is shared and always pinned), so there
shouldn't be any contention there.

Hm, and I don't see why pin/unpin really care about pgd_lock either.
They're called at well-defined places (fork/exec/exit) on a single pgd.
pin/unpin_all are a different matter - since they walk the pgd list -
but they were taking the lock anyway.

Will need to think about this a bit.

J

> Hannes
>
> ---
> arch/x86/include/asm/pgtable.h | 2 --
> arch/x86/mm/fault.c | 14 ++------------
> arch/x86/mm/init_64.c | 6 ------
> arch/x86/mm/pgtable.c | 20 +++-----------------
> arch/x86/xen/mmu.c | 8 ++++++++
> 5 files changed, 13 insertions(+), 37 deletions(-)
>
> diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h
> index 18601c8..8c0335a 100644
> --- a/arch/x86/include/asm/pgtable.h
> +++ b/arch/x86/include/asm/pgtable.h
> @@ -28,8 +28,6 @@ extern unsigned long empty_zero_page[PAGE_SIZE / sizeof(unsigned long)];
> extern spinlock_t pgd_lock;
> extern struct list_head pgd_list;
>
> -extern struct mm_struct *pgd_page_get_mm(struct page *page);
> -
> #ifdef CONFIG_PARAVIRT
> #include <asm/paravirt.h>
> #else /* !CONFIG_PARAVIRT */
> diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c
> index 7d90ceb..5da4155 100644
> --- a/arch/x86/mm/fault.c
> +++ b/arch/x86/mm/fault.c
> @@ -234,19 +234,9 @@ void vmalloc_sync_all(void)
> struct page *page;
>
> spin_lock_irqsave(&pgd_lock, flags);
> - list_for_each_entry(page, &pgd_list, lru) {
> - spinlock_t *pgt_lock;
> - pmd_t *ret;
> -
> - pgt_lock = &pgd_page_get_mm(page)->page_table_lock;
> -
> - spin_lock(pgt_lock);
> - ret = vmalloc_sync_one(page_address(page), address);
> - spin_unlock(pgt_lock);
> -
> - if (!ret)
> + list_for_each_entry(page, &pgd_list, lru)
> + if (!vmalloc_sync_one(page_address(page), address))
> break;
> - }
> spin_unlock_irqrestore(&pgd_lock, flags);
> }
> }
> diff --git a/arch/x86/mm/init_64.c b/arch/x86/mm/init_64.c
> index 71a5929..9332f21 100644
> --- a/arch/x86/mm/init_64.c
> +++ b/arch/x86/mm/init_64.c
> @@ -114,19 +114,13 @@ void sync_global_pgds(unsigned long start, unsigned long end)
> spin_lock_irqsave(&pgd_lock, flags);
> list_for_each_entry(page, &pgd_list, lru) {
> pgd_t *pgd;
> - spinlock_t *pgt_lock;
>
> pgd = (pgd_t *)page_address(page) + pgd_index(address);
> - pgt_lock = &pgd_page_get_mm(page)->page_table_lock;
> - spin_lock(pgt_lock);
> -
> if (pgd_none(*pgd))
> set_pgd(pgd, *pgd_ref);
> else
> BUG_ON(pgd_page_vaddr(*pgd)
> != pgd_page_vaddr(*pgd_ref));
> -
> - spin_unlock(pgt_lock);
> }
> spin_unlock_irqrestore(&pgd_lock, flags);
> }
> diff --git a/arch/x86/mm/pgtable.c b/arch/x86/mm/pgtable.c
> index 500242d..72107ab 100644
> --- a/arch/x86/mm/pgtable.c
> +++ b/arch/x86/mm/pgtable.c
> @@ -87,19 +87,7 @@ static inline void pgd_list_del(pgd_t *pgd)
> #define UNSHARED_PTRS_PER_PGD \
> (SHARED_KERNEL_PMD ? KERNEL_PGD_BOUNDARY : PTRS_PER_PGD)
>
> -
> -static void pgd_set_mm(pgd_t *pgd, struct mm_struct *mm)
> -{
> - BUILD_BUG_ON(sizeof(virt_to_page(pgd)->index) < sizeof(mm));
> - virt_to_page(pgd)->index = (pgoff_t)mm;
> -}
> -
> -struct mm_struct *pgd_page_get_mm(struct page *page)
> -{
> - return (struct mm_struct *)page->index;
> -}
> -
> -static void pgd_ctor(struct mm_struct *mm, pgd_t *pgd)
> +static void pgd_ctor(pgd_t *pgd)
> {
> /* If the pgd points to a shared pagetable level (either the
> ptes in non-PAE, or shared PMD in PAE), then just copy the
> @@ -113,10 +101,8 @@ static void pgd_ctor(struct mm_struct *mm, pgd_t *pgd)
> }
>
> /* list required to sync kernel mapping updates */
> - if (!SHARED_KERNEL_PMD) {
> - pgd_set_mm(pgd, mm);
> + if (!SHARED_KERNEL_PMD)
> pgd_list_add(pgd);
> - }
> }
>
> static void pgd_dtor(pgd_t *pgd)
> @@ -282,7 +268,7 @@ pgd_t *pgd_alloc(struct mm_struct *mm)
> */
> spin_lock_irqsave(&pgd_lock, flags);
>
> - pgd_ctor(mm, pgd);
> + pgd_ctor(pgd);
> pgd_prepopulate_pmd(mm, pgd, pmds);
>
> spin_unlock_irqrestore(&pgd_lock, flags);
> diff --git a/arch/x86/xen/mmu.c b/arch/x86/xen/mmu.c
> index 5e22810..97fbfce 100644
> --- a/arch/x86/xen/mmu.c
> +++ b/arch/x86/xen/mmu.c
> @@ -1021,7 +1021,11 @@ static void __xen_pgd_pin(struct mm_struct *mm, pgd_t *pgd)
>
> static void xen_pgd_pin(struct mm_struct *mm)
> {
> + unsigned long flags;
> +
> + spin_lock_irqsave(&pgd_lock, flags);
> __xen_pgd_pin(mm, mm->pgd);
> + spin_unlock_irqrestore(&pgd_lock, flags);
> }
>
> /*
> @@ -1140,7 +1144,11 @@ static void __xen_pgd_unpin(struct mm_struct *mm, pgd_t *pgd)
>
> static void xen_pgd_unpin(struct mm_struct *mm)
> {
> + unsigned long flags;
> +
> + spin_lock_irqsave(&pgd_lock, flags);
> __xen_pgd_unpin(mm, mm->pgd);
> + spin_unlock_irqrestore(&pgd_lock, flags);
> }
>
> /*
>

2011-02-22 07:49:02

by Jan Beulich

[permalink] [raw]
Subject: Re: [PATCH] fix pgd_lock deadlock

>>> On 21.02.11 at 15:53, Johannes Weiner <[email protected]> wrote:
> On Mon, Feb 21, 2011 at 03:30:23PM +0100, Andrea Arcangeli wrote:
>> On Thu, Feb 17, 2011 at 11:19:41AM +0100, Johannes Weiner wrote:
>> > So Xen needs all page tables protected when pinning/unpinning and
>> > extended page_table_lock to cover kernel range, which it does nowhere
>> > else AFAICS. But the places it extended are also taking the pgd_lock,
>> > so I wonder if Xen could just take the pgd_lock itself in these paths
>> > and we could revert page_table_lock back to cover user va only?
>> > Jeremy, could this work? Untested.
>>
>> If this works for Xen, I definitely prefer this.
>
> Below is real submission, with changelog and sign-off and all (except
> testing on Xen itself, sorry). I moved pgd_lock acquisition in this
> version to make the lock ordering perfectly clear. Xen people, could
> you have a look at this?

While I think that it would be correct, it doesn't look like a
reasonable fix to me: It effectively serializes process (address
space) construction and destruction.

A possible alternative would be to acquire the page table lock
in vmalloc_sync_all() only in the Xen case (perhaps by storing
NULL into page->index in pgd_set_mm() when not running on
Xen). This is utilizing the fact that there aren't (supposed to
be - for non-pvops this is definitely the case) any TLB flush IPIs
under Xen, and hence the race you're trying to fix doesn't
exist there (while non-Xen doesn't need the extra locking).

Jan

2011-02-22 13:50:44

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: [PATCH] fix pgd_lock deadlock

On Tue, Feb 22, 2011 at 07:48:54AM +0000, Jan Beulich wrote:
> A possible alternative would be to acquire the page table lock
> in vmalloc_sync_all() only in the Xen case (perhaps by storing
> NULL into page->index in pgd_set_mm() when not running on
> Xen). This is utilizing the fact that there aren't (supposed to
> be - for non-pvops this is definitely the case) any TLB flush IPIs
> under Xen, and hence the race you're trying to fix doesn't
> exist there (while non-Xen doesn't need the extra locking).

That's sure ok with me. Can we use a global runtime to check if the
guest is running under Xen paravirt, instead of passing that info
through page->something?

2011-02-22 14:23:00

by Jan Beulich

[permalink] [raw]
Subject: Re: [PATCH] fix pgd_lock deadlock

>>> On 22.02.11 at 14:49, Andrea Arcangeli <[email protected]> wrote:
> On Tue, Feb 22, 2011 at 07:48:54AM +0000, Jan Beulich wrote:
>> A possible alternative would be to acquire the page table lock
>> in vmalloc_sync_all() only in the Xen case (perhaps by storing
>> NULL into page->index in pgd_set_mm() when not running on
>> Xen). This is utilizing the fact that there aren't (supposed to
>> be - for non-pvops this is definitely the case) any TLB flush IPIs
>> under Xen, and hence the race you're trying to fix doesn't
>> exist there (while non-Xen doesn't need the extra locking).
>
> That's sure ok with me. Can we use a global runtime to check if the
> guest is running under Xen paravirt, instead of passing that info
> through page->something?

If everyone's okay with putting a couple of "if (xen_pv_domain())"
into mm/fault.c - sure. I would have thought that this wouldn't be
liked, hence the suggestion to make this depend on seeing the
backlink be non-NULL.

Jan

2011-02-22 14:34:48

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: [PATCH] fix pgd_lock deadlock

On Tue, Feb 22, 2011 at 02:22:53PM +0000, Jan Beulich wrote:
> If everyone's okay with putting a couple of "if (xen_pv_domain())"
> into mm/fault.c - sure. I would have thought that this wouldn't be
> liked, hence the suggestion to make this depend on seeing the
> backlink be non-NULL.

I prefer the if (xen_pv_domain) so it also gets optimized away
at build time when CONFIG_XEN=n. I think it's also more self
explanatory to read.

2011-02-22 17:09:06

by Jeremy Fitzhardinge

[permalink] [raw]
Subject: Re: [PATCH] fix pgd_lock deadlock

On 02/22/2011 06:34 AM, Andrea Arcangeli wrote:
> On Tue, Feb 22, 2011 at 02:22:53PM +0000, Jan Beulich wrote:
>> If everyone's okay with putting a couple of "if (xen_pv_domain())"
>> into mm/fault.c - sure. I would have thought that this wouldn't be
>> liked, hence the suggestion to make this depend on seeing the
>> backlink be non-NULL.
> I prefer the if (xen_pv_domain) so it also gets optimized away
> at build time when CONFIG_XEN=n. I think it's also more self
> explanatory to read.

Eh, very not keen about that. I'd only consider it as a last resort.

In practice CONFIG_XEN is always enabled in distros, so the conditional
would always be done at runtime.

J

2011-02-22 17:14:01

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: [PATCH] fix pgd_lock deadlock

On Tue, Feb 22, 2011 at 09:08:54AM -0800, Jeremy Fitzhardinge wrote:
> On 02/22/2011 06:34 AM, Andrea Arcangeli wrote:
> > On Tue, Feb 22, 2011 at 02:22:53PM +0000, Jan Beulich wrote:
> >> If everyone's okay with putting a couple of "if (xen_pv_domain())"
> >> into mm/fault.c - sure. I would have thought that this wouldn't be
> >> liked, hence the suggestion to make this depend on seeing the
> >> backlink be non-NULL.
> > I prefer the if (xen_pv_domain) so it also gets optimized away
> > at build time when CONFIG_XEN=n. I think it's also more self
> > explanatory to read.
>
> Eh, very not keen about that. I'd only consider it as a last resort.
>
> In practice CONFIG_XEN is always enabled in distros, so the conditional
> would always be done at runtime.

That's ok, it's still better than setting a page filed IMHO. As the
page flag would be set conditional to xen_pv_domain() anyway. I
wouldn't like to make things more complicated than they already are,
this will make it more explicit why that spinlock is needed too.
Unless we foresee other hypervisors using a generic functionality we
don't need one.

2011-02-24 04:22:43

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: [PATCH] fix pgd_lock deadlock

On Tue, Feb 22, 2011 at 02:22:53PM +0000, Jan Beulich wrote:
> >>> On 22.02.11 at 14:49, Andrea Arcangeli <[email protected]> wrote:
> > On Tue, Feb 22, 2011 at 07:48:54AM +0000, Jan Beulich wrote:
> >> A possible alternative would be to acquire the page table lock
> >> in vmalloc_sync_all() only in the Xen case (perhaps by storing
> >> NULL into page->index in pgd_set_mm() when not running on
> >> Xen). This is utilizing the fact that there aren't (supposed to
> >> be - for non-pvops this is definitely the case) any TLB flush IPIs
> >> under Xen, and hence the race you're trying to fix doesn't
> >> exist there (while non-Xen doesn't need the extra locking).
> >
> > That's sure ok with me. Can we use a global runtime to check if the
> > guest is running under Xen paravirt, instead of passing that info
> > through page->something?
>
> If everyone's okay with putting a couple of "if (xen_pv_domain())"
> into mm/fault.c - sure. I would have thought that this wouldn't be
> liked, hence the suggestion to make this depend on seeing the
> backlink be non-NULL.

What about this? The page->private logic gets optimized away at
compile time with XEN=n.

The removal of _irqsave from pgd_lock, I'll delay it as it's no bugfix
anymore.

===
Subject: xen: stop taking the page_table_lock with irq disabled

From: Andrea Arcangeli <[email protected]>

It's forbidden to take the page_table_lock with the irq disabled or if there's
contention the IPIs (for tlb flushes) sent with the page_table_lock held will
never run leading to a deadlock.

Only Xen needs the page_table_lock and Xen won't need IPI TLB flushes hence the
deadlock doesn't exist for Xen.

Signed-off-by: Andrea Arcangeli <[email protected]>
---
arch/x86/include/asm/pgtable.h | 5 +++--
arch/x86/mm/fault.c | 10 ++++++----
arch/x86/mm/init_64.c | 10 ++++++----
arch/x86/mm/pgtable.c | 9 +++------
4 files changed, 18 insertions(+), 16 deletions(-)

--- a/arch/x86/mm/fault.c
+++ b/arch/x86/mm/fault.c
@@ -235,14 +235,16 @@ void vmalloc_sync_all(void)

spin_lock_irqsave(&pgd_lock, flags);
list_for_each_entry(page, &pgd_list, lru) {
- spinlock_t *pgt_lock;
+ struct mm_struct *mm;
pmd_t *ret;

- pgt_lock = &pgd_page_get_mm(page)->page_table_lock;
+ mm = pgd_page_get_mm(page);

- spin_lock(pgt_lock);
+ if (mm)
+ spin_lock(&mm->page_table_lock);
ret = vmalloc_sync_one(page_address(page), address);
- spin_unlock(pgt_lock);
+ if (mm)
+ spin_unlock(&mm->page_table_lock);

if (!ret)
break;
--- a/arch/x86/mm/init_64.c
+++ b/arch/x86/mm/init_64.c
@@ -114,11 +114,12 @@ void sync_global_pgds(unsigned long star
spin_lock_irqsave(&pgd_lock, flags);
list_for_each_entry(page, &pgd_list, lru) {
pgd_t *pgd;
- spinlock_t *pgt_lock;
+ struct mm_struct *mm;

pgd = (pgd_t *)page_address(page) + pgd_index(address);
- pgt_lock = &pgd_page_get_mm(page)->page_table_lock;
- spin_lock(pgt_lock);
+ mm = pgd_page_get_mm(page);
+ if (mm)
+ spin_lock(&mm->page_table_lock);

if (pgd_none(*pgd))
set_pgd(pgd, *pgd_ref);
@@ -126,7 +127,8 @@ void sync_global_pgds(unsigned long star
BUG_ON(pgd_page_vaddr(*pgd)
!= pgd_page_vaddr(*pgd_ref));

- spin_unlock(pgt_lock);
+ if (mm)
+ spin_unlock(&mm->page_table_lock);
}
spin_unlock_irqrestore(&pgd_lock, flags);
}
--- a/arch/x86/mm/pgtable.c
+++ b/arch/x86/mm/pgtable.c
@@ -4,6 +4,7 @@
#include <asm/pgtable.h>
#include <asm/tlb.h>
#include <asm/fixmap.h>
+#include <xen/xen.h>

#define PGALLOC_GFP GFP_KERNEL | __GFP_NOTRACK | __GFP_REPEAT | __GFP_ZERO

@@ -91,12 +92,8 @@ static inline void pgd_list_del(pgd_t *p
static void pgd_set_mm(pgd_t *pgd, struct mm_struct *mm)
{
BUILD_BUG_ON(sizeof(virt_to_page(pgd)->index) < sizeof(mm));
- virt_to_page(pgd)->index = (pgoff_t)mm;
-}
-
-struct mm_struct *pgd_page_get_mm(struct page *page)
-{
- return (struct mm_struct *)page->index;
+ if (xen_pv_domain())
+ virt_to_page(pgd)->index = (pgoff_t)mm;
}

static void pgd_ctor(struct mm_struct *mm, pgd_t *pgd)
--- a/arch/x86/include/asm/pgtable.h
+++ b/arch/x86/include/asm/pgtable.h
@@ -28,8 +28,6 @@ extern unsigned long empty_zero_page[PAG
extern spinlock_t pgd_lock;
extern struct list_head pgd_list;

-extern struct mm_struct *pgd_page_get_mm(struct page *page);
-
#ifdef CONFIG_PARAVIRT
#include <asm/paravirt.h>
#else /* !CONFIG_PARAVIRT */
@@ -83,6 +81,9 @@ extern struct mm_struct *pgd_page_get_mm

#endif /* CONFIG_PARAVIRT */

+#define pgd_page_get_mm(__page) \
+ ((struct mm_struct *)(xen_pv_domain() ? (__page)->index : 0))
+
/*
* The following only work if pte_present() is true.
* Undefined behaviour if not..

2011-02-24 08:23:43

by Jan Beulich

[permalink] [raw]
Subject: Re: [PATCH] fix pgd_lock deadlock

>>> On 24.02.11 at 05:22, Andrea Arcangeli <[email protected]> wrote:
> On Tue, Feb 22, 2011 at 02:22:53PM +0000, Jan Beulich wrote:
>> >>> On 22.02.11 at 14:49, Andrea Arcangeli <[email protected]> wrote:
>> > On Tue, Feb 22, 2011 at 07:48:54AM +0000, Jan Beulich wrote:
>> >> A possible alternative would be to acquire the page table lock
>> >> in vmalloc_sync_all() only in the Xen case (perhaps by storing
>> >> NULL into page->index in pgd_set_mm() when not running on
>> >> Xen). This is utilizing the fact that there aren't (supposed to
>> >> be - for non-pvops this is definitely the case) any TLB flush IPIs
>> >> under Xen, and hence the race you're trying to fix doesn't
>> >> exist there (while non-Xen doesn't need the extra locking).
>> >
>> > That's sure ok with me. Can we use a global runtime to check if the
>> > guest is running under Xen paravirt, instead of passing that info
>> > through page->something?
>>
>> If everyone's okay with putting a couple of "if (xen_pv_domain())"
>> into mm/fault.c - sure. I would have thought that this wouldn't be
>> liked, hence the suggestion to make this depend on seeing the
>> backlink be non-NULL.
>
> What about this? The page->private logic gets optimized away at
> compile time with XEN=n.
>
> The removal of _irqsave from pgd_lock, I'll delay it as it's no bugfix
> anymore.
>
> ===
> Subject: xen: stop taking the page_table_lock with irq disabled
>
> From: Andrea Arcangeli <[email protected]>
>
> It's forbidden to take the page_table_lock with the irq disabled or if there's
> contention the IPIs (for tlb flushes) sent with the page_table_lock held will
> never run leading to a deadlock.
>
> Only Xen needs the page_table_lock and Xen won't need IPI TLB flushes hence
> the deadlock doesn't exist for Xen.

Looks reasonable to me, except for the implementation no longer
matching subject and description (the lock still gets taken with
IRQs disabled, just that - as far as we can tell so far - doesn't
matter for Xen).

With the conditional on the reader side I also wonder whether
the conditional on the writer side is really a good thing to have,
considering that generally distro kernels are likely to have XEN
enabled.

Jan

2011-02-24 14:12:44

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: [PATCH] fix pgd_lock deadlock

On Thu, Feb 24, 2011 at 08:23:37AM +0000, Jan Beulich wrote:
> >>> On 24.02.11 at 05:22, Andrea Arcangeli <[email protected]> wrote:
> > On Tue, Feb 22, 2011 at 02:22:53PM +0000, Jan Beulich wrote:
> >> >>> On 22.02.11 at 14:49, Andrea Arcangeli <[email protected]> wrote:
> >> > On Tue, Feb 22, 2011 at 07:48:54AM +0000, Jan Beulich wrote:
> >> >> A possible alternative would be to acquire the page table lock
> >> >> in vmalloc_sync_all() only in the Xen case (perhaps by storing
> >> >> NULL into page->index in pgd_set_mm() when not running on
> >> >> Xen). This is utilizing the fact that there aren't (supposed to
> >> >> be - for non-pvops this is definitely the case) any TLB flush IPIs
> >> >> under Xen, and hence the race you're trying to fix doesn't
> >> >> exist there (while non-Xen doesn't need the extra locking).
> >> >
> >> > That's sure ok with me. Can we use a global runtime to check if the
> >> > guest is running under Xen paravirt, instead of passing that info
> >> > through page->something?
> >>
> >> If everyone's okay with putting a couple of "if (xen_pv_domain())"
> >> into mm/fault.c - sure. I would have thought that this wouldn't be
> >> liked, hence the suggestion to make this depend on seeing the
> >> backlink be non-NULL.
> >
> > What about this? The page->private logic gets optimized away at
> > compile time with XEN=n.
> >
> > The removal of _irqsave from pgd_lock, I'll delay it as it's no bugfix
> > anymore.
> >
> > ===
> > Subject: xen: stop taking the page_table_lock with irq disabled
> >
> > From: Andrea Arcangeli <[email protected]>
> >
> > It's forbidden to take the page_table_lock with the irq disabled or if there's
> > contention the IPIs (for tlb flushes) sent with the page_table_lock held will
> > never run leading to a deadlock.
> >
> > Only Xen needs the page_table_lock and Xen won't need IPI TLB flushes hence
> > the deadlock doesn't exist for Xen.
>
> Looks reasonable to me, except for the implementation no longer
> matching subject and description (the lock still gets taken with
> IRQs disabled, just that - as far as we can tell so far - doesn't
> matter for Xen).
>
> With the conditional on the reader side I also wonder whether
> the conditional on the writer side is really a good thing to have,
> considering that generally distro kernels are likely to have XEN
> enabled.

Well there is no point to keep the writer side functional. There
aren't only distro kernels out there, I really like features to go
away completely at build time when they're not needed. Not because
it's Xen (I recently did the same thing for THP too for example,
making sure every sign of it gone away with a =n setting, with the
exception perhaps of the put_page/get_page compound logic but at least
the compound_lock goes away). It simply makes no sense to page->index
= mm if nobody could possibly read it so I prefer this. Otherwise I
wouldn't need to put in a macro for the reader side to workaround the
fact the xen.h isn't available in pgtable.h and I could leave it in
pgtable.c (and I didn't want to add it to pgtable.h). It seems to
build on i386 but it's better to re-verify i386, because on older
kernels I had to add a xen/xen.h include to x86/mm/fault.c too to
x86_64 (but upstream fault.c seems not to need it). I'll try to re-run
some build with XEN on and off x86 32/64 to be really sure all
includes are ok.