The purpose of this series is:
Waiman Long reported 'pgd_lock' contention on high CPU count systems and proposed
moving pgd_lock on a separate cacheline to eliminate false sharing and to reduce
some of the lock bouncing overhead.
I think we can do much better: this series eliminates the pgd_list and makes
pgd_alloc()/pgd_free() lockless.
This is the -v2 submission that addresses all feedback received so far, it
fixes bugs and cleans up details. There's a v1->v2 interdiff attached
further below: most of the changes relate to locking the task before looking
at tsk->mm.
The series is also available in -tip:master for easy testing:
git git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git master
It's not applied to any visible topic tree yet. (Its sha1 is bc40a788fc63
if you want to test it without pending non-x86 bits in -tip.)
Thanks,
Ingo
====================>
Ingo Molnar (12):
x86/mm/pat: Don't free PGD entries on memory unmap
x86/mm/hotplug: Remove pgd_list use from the memory hotplug code
x86/mm/hotplug: Don't remove PGD entries in remove_pagetable()
x86/mm/hotplug: Simplify sync_global_pgds()
mm: Introduce arch_pgd_init_late()
x86/mm: Enable and use the arch_pgd_init_late() method
x86/virt/guest/xen: Remove use of pgd_list from the Xen guest code
x86/mm: Remove pgd_list use from vmalloc_sync_all()
x86/mm/pat/32: Remove pgd_list use from the PAT code
x86/mm: Make pgd_alloc()/pgd_free() lockless
x86/mm: Remove pgd_list leftovers
x86/mm: Simplify pgd_alloc()
arch/Kconfig | 9 ++++
arch/x86/Kconfig | 1 +
arch/x86/include/asm/pgtable.h | 3 --
arch/x86/include/asm/pgtable_64.h | 3 +-
arch/x86/mm/fault.c | 31 +++++++++-----
arch/x86/mm/init_64.c | 80 ++++++++++++-----------------------
arch/x86/mm/pageattr.c | 41 +++++++++---------
arch/x86/mm/pgtable.c | 131 ++++++++++++++++++++++++++++++----------------------------
arch/x86/xen/mmu.c | 51 +++++++++++++++++------
fs/exec.c | 3 ++
include/linux/mm.h | 6 +++
kernel/fork.c | 16 +++++++
12 files changed, 209 insertions(+), 166 deletions(-)
-v1 => -v2 interdiff:
====
arch/x86/mm/fault.c | 17 +++++++++++------
arch/x86/mm/init_64.c | 23 ++++++++++++-----------
arch/x86/mm/pageattr.c | 27 ++++++++++++++-------------
arch/x86/mm/pgtable.c | 22 ++++++++++++----------
arch/x86/xen/mmu.c | 47 +++++++++++++++++++++++++----------------------
fs/exec.c | 2 +-
include/linux/mm.h | 4 ++--
kernel/fork.c | 4 ++--
8 files changed, 79 insertions(+), 67 deletions(-)
diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c
index bebdd97f888b..14b39c3511fd 100644
--- a/arch/x86/mm/fault.c
+++ b/arch/x86/mm/fault.c
@@ -237,28 +237,33 @@ void vmalloc_sync_all(void)
struct task_struct *g, *p;
- rcu_read_lock();
- spin_lock(&pgd_lock);
+ spin_lock(&pgd_lock); /* Implies rcu_read_lock(): */
for_each_process_thread(g, p) {
+ struct mm_struct *mm;
spinlock_t *pgt_lock;
pmd_t *pmd_ret;
- if (!p->mm)
+ task_lock(p);
+ mm = p->mm;
+ if (!mm) {
+ task_unlock(p);
continue;
+ }
/* The pgt_lock is only used on Xen: */
- pgt_lock = &p->mm->page_table_lock;
+ pgt_lock = &mm->page_table_lock;
spin_lock(pgt_lock);
- pmd_ret = vmalloc_sync_one(p->mm->pgd, address);
+ pmd_ret = vmalloc_sync_one(mm->pgd, address);
spin_unlock(pgt_lock);
+ task_unlock(p);
+
if (!pmd_ret)
break;
}
spin_unlock(&pgd_lock);
- rcu_read_unlock();
}
}
diff --git a/arch/x86/mm/init_64.c b/arch/x86/mm/init_64.c
index 730560c4873e..dcb2f45caf0e 100644
--- a/arch/x86/mm/init_64.c
+++ b/arch/x86/mm/init_64.c
@@ -171,28 +171,28 @@ void sync_global_pgds(unsigned long start, unsigned long end)
const pgd_t *pgd_ref = pgd_offset_k(address);
struct task_struct *g, *p;
- /*
- * When this function is called after memory hot remove,
- * pgd_none() already returns true, but only the reference
- * kernel PGD has been cleared, not the process PGDs.
- *
- * So clear the affected entries in every process PGD as well:
- */
+ /* Only sync (potentially) newly added PGD entries: */
if (pgd_none(*pgd_ref))
continue;
- spin_lock(&pgd_lock);
+ spin_lock(&pgd_lock); /* Implies rcu_read_lock() for the task list iteration: */
for_each_process_thread(g, p) {
+ struct mm_struct *mm;
pgd_t *pgd;
spinlock_t *pgt_lock;
- if (!p->mm)
+ task_lock(p);
+ mm = p->mm;
+ if (!mm) {
+ task_unlock(p);
continue;
- pgd = p->mm->pgd;
+ }
+
+ pgd = mm->pgd;
/* The pgt_lock is only used by Xen: */
- pgt_lock = &p->mm->page_table_lock;
+ pgt_lock = &mm->page_table_lock;
spin_lock(pgt_lock);
if (!pgd_none(*pgd_ref) && !pgd_none(*pgd))
@@ -202,6 +202,7 @@ void sync_global_pgds(unsigned long start, unsigned long end)
set_pgd(pgd, *pgd_ref);
spin_unlock(pgt_lock);
+ task_unlock(p);
}
spin_unlock(&pgd_lock);
}
diff --git a/arch/x86/mm/pageattr.c b/arch/x86/mm/pageattr.c
index 93c134fdb398..4ff6a1808f1d 100644
--- a/arch/x86/mm/pageattr.c
+++ b/arch/x86/mm/pageattr.c
@@ -440,29 +440,30 @@ static void __set_pmd_pte(pte_t *kpte, unsigned long address, pte_t pte)
if (!SHARED_KERNEL_PMD) {
struct task_struct *g, *p;
- rcu_read_lock();
+ /* We are holding pgd_lock, which implies rcu_read_lock(): */
for_each_process_thread(g, p) {
+ struct mm_struct *mm;
spinlock_t *pgt_lock;
pgd_t *pgd;
pud_t *pud;
pmd_t *pmd;
- if (!p->mm)
- continue;
+ task_lock(p);
+ mm = p->mm;
+ if (mm) {
+ pgt_lock = &mm->page_table_lock;
+ spin_lock(pgt_lock);
- pgt_lock = &p->mm->page_table_lock;
- spin_lock(pgt_lock);
+ pgd = mm->pgd + pgd_index(address);
+ pud = pud_offset(pgd, address);
+ pmd = pmd_offset(pud, address);
+ set_pte_atomic((pte_t *)pmd, pte);
- pgd = p->mm->pgd + pgd_index(address);
- pud = pud_offset(pgd, address);
- pmd = pmd_offset(pud, address);
- set_pte_atomic((pte_t *)pmd, pte);
-
- spin_unlock(pgt_lock);
+ spin_unlock(pgt_lock);
+ }
+ task_unlock(p);
}
-
- rcu_read_unlock();
}
#endif
}
diff --git a/arch/x86/mm/pgtable.c b/arch/x86/mm/pgtable.c
index b1bd35f452ef..d7d341e57e33 100644
--- a/arch/x86/mm/pgtable.c
+++ b/arch/x86/mm/pgtable.c
@@ -348,15 +348,17 @@ pgd_t *pgd_alloc(struct mm_struct *mm)
* at the highest, PGD level. Thus any other task extending it
* will first update the reference PGD, then modify the task PGDs.
*/
-void arch_pgd_init_late(struct mm_struct *mm, pgd_t *pgd)
+void arch_pgd_init_late(struct mm_struct *mm)
{
/*
- * This is called after a new MM has been made visible
- * in fork() or exec().
+ * This function is called after a new MM has been made visible
+ * in fork() or exec() via:
+ *
+ * tsk->mm = mm;
*
* This barrier makes sure the MM is visible to new RCU
- * walkers before we initialize it, so that we don't miss
- * updates:
+ * walkers before we initialize the pagetables below, so that
+ * we don't miss updates:
*/
smp_wmb();
@@ -365,12 +367,12 @@ void arch_pgd_init_late(struct mm_struct *mm, pgd_t *pgd)
* ptes in non-PAE, or shared PMD in PAE), then just copy the
* references from swapper_pg_dir:
*/
- if (CONFIG_PGTABLE_LEVELS == 2 ||
+ if ( CONFIG_PGTABLE_LEVELS == 2 ||
(CONFIG_PGTABLE_LEVELS == 3 && SHARED_KERNEL_PMD) ||
- CONFIG_PGTABLE_LEVELS == 4) {
+ CONFIG_PGTABLE_LEVELS == 4) {
pgd_t *pgd_src = swapper_pg_dir + KERNEL_PGD_BOUNDARY;
- pgd_t *pgd_dst = pgd + KERNEL_PGD_BOUNDARY;
+ pgd_t *pgd_dst = mm->pgd + KERNEL_PGD_BOUNDARY;
int i;
for (i = 0; i < KERNEL_PGD_PTRS; i++, pgd_src++, pgd_dst++) {
@@ -387,8 +389,8 @@ void arch_pgd_init_late(struct mm_struct *mm, pgd_t *pgd)
* it won't change from under us. Parallel updates (new
* allocations) will modify our (already visible) PGD:
*/
- if (pgd_val(*pgd_src))
- WRITE_ONCE(*pgd_dst, *pgd_src);
+ if (!pgd_none(*pgd_src))
+ set_pgd(pgd_dst, *pgd_src);
}
}
}
diff --git a/arch/x86/xen/mmu.c b/arch/x86/xen/mmu.c
index 87a8354435f8..70a3df5b0b54 100644
--- a/arch/x86/xen/mmu.c
+++ b/arch/x86/xen/mmu.c
@@ -855,27 +855,28 @@ void xen_mm_pin_all(void)
{
struct task_struct *g, *p;
- rcu_read_lock();
- spin_lock(&pgd_lock);
+ spin_lock(&pgd_lock); /* Implies rcu_read_lock() for the task list iteration: */
for_each_process_thread(g, p) {
+ struct mm_struct *mm;
struct page *page;
pgd_t *pgd;
- if (!p->mm)
- continue;
-
- pgd = p->mm->pgd;
- page = virt_to_page(pgd);
+ task_lock(p);
+ mm = p->mm;
+ if (mm) {
+ pgd = mm->pgd;
+ page = virt_to_page(pgd);
- if (!PagePinned(page)) {
- __xen_pgd_pin(&init_mm, pgd);
- SetPageSavePinned(page);
+ if (!PagePinned(page)) {
+ __xen_pgd_pin(&init_mm, pgd);
+ SetPageSavePinned(page);
+ }
}
+ task_unlock(p);
}
spin_unlock(&pgd_lock);
- rcu_read_unlock();
}
/*
@@ -980,24 +981,26 @@ void xen_mm_unpin_all(void)
{
struct task_struct *g, *p;
- rcu_read_lock();
- spin_lock(&pgd_lock);
+ spin_lock(&pgd_lock); /* Implies rcu_read_lock() for the task list iteration: */
for_each_process_thread(g, p) {
+ struct mm_struct *mm;
struct page *page;
pgd_t *pgd;
- if (!p->mm)
- continue;
-
- pgd = p->mm->pgd;
- page = virt_to_page(pgd);
+ task_lock(p);
+ mm = p->mm;
+ if (mm) {
+ pgd = mm->pgd;
+ page = virt_to_page(pgd);
- if (PageSavePinned(page)) {
- BUG_ON(!PagePinned(page));
- __xen_pgd_unpin(&init_mm, pgd);
- ClearPageSavePinned(page);
+ if (PageSavePinned(page)) {
+ BUG_ON(!PagePinned(page));
+ __xen_pgd_unpin(&init_mm, pgd);
+ ClearPageSavePinned(page);
+ }
}
+ task_unlock(p);
}
spin_unlock(&pgd_lock);
diff --git a/fs/exec.c b/fs/exec.c
index c1d213c64fda..4ce1383d5bba 100644
--- a/fs/exec.c
+++ b/fs/exec.c
@@ -862,7 +862,7 @@ static int exec_mmap(struct mm_struct *mm)
active_mm = tsk->active_mm;
tsk->mm = mm;
- arch_pgd_init_late(mm, mm->pgd);
+ arch_pgd_init_late(mm);
tsk->active_mm = mm;
activate_mm(active_mm, mm);
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 35d887d2b038..a3edc839e431 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1135,9 +1135,9 @@ int generic_access_phys(struct vm_area_struct *vma, unsigned long addr,
void *buf, int len, int write);
#ifdef CONFIG_ARCH_HAS_PGD_INIT_LATE
-void arch_pgd_init_late(struct mm_struct *mm, pgd_t *pgd);
+void arch_pgd_init_late(struct mm_struct *mm);
#else
-static inline void arch_pgd_init_late(struct mm_struct *mm, pgd_t *pgd) { }
+static inline void arch_pgd_init_late(struct mm_struct *mm) { }
#endif
static inline void unmap_shared_mapping_range(struct address_space *mapping,
diff --git a/kernel/fork.c b/kernel/fork.c
index 1f83ceca6c6c..cfa84971fb52 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -1605,8 +1605,8 @@ static struct task_struct *copy_process(unsigned long clone_flags,
* NOTE: any user-space parts of the PGD are already initialized
* and must not be clobbered.
*/
- if (p->mm != current->mm)
- arch_pgd_init_late(p->mm, p->mm->pgd);
+ if (!(clone_flags & CLONE_VM))
+ arch_pgd_init_late(p->mm);
proc_fork_connector(p);
cgroup_post_fork(p);
So when we unmap kernel memory range then we also check whether a PUD
is completely clear, and free it and clear the PGD entry.
This complicates PGD management, so don't do this. We can keep the
PGD mapped and the PUD table all clear - it's only a single 4K page
per 512 GB of memory mapped.
Cc: Andrew Morton <[email protected]>
Cc: Andy Lutomirski <[email protected]>
Cc: Borislav Petkov <[email protected]>
Cc: Brian Gerst <[email protected]>
Cc: Denys Vlasenko <[email protected]>
Cc: H. Peter Anvin <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Oleg Nesterov <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Rik van Riel <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Waiman Long <[email protected]>
Cc: [email protected]
Signed-off-by: Ingo Molnar <[email protected]>
---
arch/x86/mm/pageattr.c | 15 ---------------
1 file changed, 15 deletions(-)
diff --git a/arch/x86/mm/pageattr.c b/arch/x86/mm/pageattr.c
index 70d221fe2eb4..997fc97e9072 100644
--- a/arch/x86/mm/pageattr.c
+++ b/arch/x86/mm/pageattr.c
@@ -717,18 +717,6 @@ static bool try_to_free_pmd_page(pmd_t *pmd)
return true;
}
-static bool try_to_free_pud_page(pud_t *pud)
-{
- int i;
-
- for (i = 0; i < PTRS_PER_PUD; i++)
- if (!pud_none(pud[i]))
- return false;
-
- free_page((unsigned long)pud);
- return true;
-}
-
static bool unmap_pte_range(pmd_t *pmd, unsigned long start, unsigned long end)
{
pte_t *pte = pte_offset_kernel(pmd, start);
@@ -847,9 +835,6 @@ static void unmap_pgd_range(pgd_t *root, unsigned long addr, unsigned long end)
pgd_t *pgd_entry = root + pgd_index(addr);
unmap_pud_range(pgd_entry, addr, end);
-
- if (try_to_free_pud_page((pud_t *)pgd_page_vaddr(*pgd_entry)))
- pgd_clear(pgd_entry);
}
static int alloc_pte_page(pmd_t *pmd)
--
2.1.4
The memory hotplug code uses sync_global_pgds() to synchronize updates
to the global (&init_mm) kernel PGD and the task PGDs. It does this
by iterating over the pgd_list - which list closely tracks task
creation/destruction via fork/clone.
But we want to remove this list, so that it does not have to be
maintained from fork()/exit(), so convert the memory hotplug code
to use the task list to iterate over all pgds in the system.
Also improve the comments a bit, to make this function easier
to understand.
Only lightly tested, as I don't have a memory hotplug setup.
Cc: Andrew Morton <[email protected]>
Cc: Andy Lutomirski <[email protected]>
Cc: Borislav Petkov <[email protected]>
Cc: Brian Gerst <[email protected]>
Cc: Denys Vlasenko <[email protected]>
Cc: H. Peter Anvin <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Oleg Nesterov <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Waiman Long <[email protected]>
Cc: [email protected]
Signed-off-by: Ingo Molnar <[email protected]>
---
arch/x86/mm/init_64.c | 38 +++++++++++++++++++++++++-------------
1 file changed, 25 insertions(+), 13 deletions(-)
diff --git a/arch/x86/mm/init_64.c b/arch/x86/mm/init_64.c
index 3fba623e3ba5..527d5d4d020c 100644
--- a/arch/x86/mm/init_64.c
+++ b/arch/x86/mm/init_64.c
@@ -160,8 +160,8 @@ static int __init nonx32_setup(char *str)
__setup("noexec32=", nonx32_setup);
/*
- * When memory was added/removed make sure all the processes MM have
- * suitable PGD entries in the local PGD level page.
+ * When memory was added/removed make sure all the process MMs have
+ * matching PGD entries in the local PGD level page as well.
*/
void sync_global_pgds(unsigned long start, unsigned long end, int removed)
{
@@ -169,29 +169,40 @@ void sync_global_pgds(unsigned long start, unsigned long end, int removed)
for (address = start; address <= end; address += PGDIR_SIZE) {
const pgd_t *pgd_ref = pgd_offset_k(address);
- struct page *page;
+ struct task_struct *g, *p;
/*
- * When it is called after memory hot remove, pgd_none()
- * returns true. In this case (removed == 1), we must clear
- * the PGD entries in the local PGD level page.
+ * When this function is called after memory hot remove,
+ * pgd_none() already returns true, but only the reference
+ * kernel PGD has been cleared, not the process PGDs.
+ *
+ * So clear the affected entries in every process PGD as well:
*/
if (pgd_none(*pgd_ref) && !removed)
continue;
- spin_lock(&pgd_lock);
- list_for_each_entry(page, &pgd_list, lru) {
+ spin_lock(&pgd_lock); /* Implies rcu_read_lock() for the task list iteration: */
+
+ for_each_process_thread(g, p) {
+ struct mm_struct *mm;
pgd_t *pgd;
spinlock_t *pgt_lock;
- pgd = (pgd_t *)page_address(page) + pgd_index(address);
- /* the pgt_lock only for Xen */
- pgt_lock = &pgd_page_get_mm(page)->page_table_lock;
+ task_lock(p);
+ mm = p->mm;
+ if (!mm) {
+ task_unlock(p);
+ continue;
+ }
+
+ pgd = mm->pgd;
+
+ /* The pgt_lock is only used by Xen: */
+ pgt_lock = &mm->page_table_lock;
spin_lock(pgt_lock);
if (!pgd_none(*pgd_ref) && !pgd_none(*pgd))
- BUG_ON(pgd_page_vaddr(*pgd)
- != pgd_page_vaddr(*pgd_ref));
+ BUG_ON(pgd_page_vaddr(*pgd) != pgd_page_vaddr(*pgd_ref));
if (removed) {
if (pgd_none(*pgd_ref) && !pgd_none(*pgd))
@@ -202,6 +213,7 @@ void sync_global_pgds(unsigned long start, unsigned long end, int removed)
}
spin_unlock(pgt_lock);
+ task_unlock(p);
}
spin_unlock(&pgd_lock);
}
--
2.1.4
So when memory hotplug removes a piece of physical memory from pagetable
mappings, it also frees the underlying PGD entry.
This complicates PGD management, so don't do this. We can keep the
PGD mapped and the PUD table all clear - it's only a single 4K page
per 512 GB of memory hotplugged.
Cc: Andrew Morton <[email protected]>
Cc: Andy Lutomirski <[email protected]>
Cc: Borislav Petkov <[email protected]>
Cc: Brian Gerst <[email protected]>
Cc: Denys Vlasenko <[email protected]>
Cc: H. Peter Anvin <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Oleg Nesterov <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Rik van Riel <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Waiman Long <[email protected]>
Cc: [email protected]
Signed-off-by: Ingo Molnar <[email protected]>
---
arch/x86/mm/init_64.c | 27 ---------------------------
1 file changed, 27 deletions(-)
diff --git a/arch/x86/mm/init_64.c b/arch/x86/mm/init_64.c
index 527d5d4d020c..7a988dbad240 100644
--- a/arch/x86/mm/init_64.c
+++ b/arch/x86/mm/init_64.c
@@ -778,27 +778,6 @@ static void __meminit free_pmd_table(pmd_t *pmd_start, pud_t *pud)
spin_unlock(&init_mm.page_table_lock);
}
-/* Return true if pgd is changed, otherwise return false. */
-static bool __meminit free_pud_table(pud_t *pud_start, pgd_t *pgd)
-{
- pud_t *pud;
- int i;
-
- for (i = 0; i < PTRS_PER_PUD; i++) {
- pud = pud_start + i;
- if (pud_val(*pud))
- return false;
- }
-
- /* free a pud table */
- free_pagetable(pgd_page(*pgd), 0);
- spin_lock(&init_mm.page_table_lock);
- pgd_clear(pgd);
- spin_unlock(&init_mm.page_table_lock);
-
- return true;
-}
-
static void __meminit
remove_pte_table(pte_t *pte_start, unsigned long addr, unsigned long end,
bool direct)
@@ -990,7 +969,6 @@ remove_pagetable(unsigned long start, unsigned long end, bool direct)
unsigned long addr;
pgd_t *pgd;
pud_t *pud;
- bool pgd_changed = false;
for (addr = start; addr < end; addr = next) {
next = pgd_addr_end(addr, end);
@@ -1001,13 +979,8 @@ remove_pagetable(unsigned long start, unsigned long end, bool direct)
pud = (pud_t *)pgd_page_vaddr(*pgd);
remove_pud_table(pud, addr, next, direct);
- if (free_pud_table(pud, pgd))
- pgd_changed = true;
}
- if (pgd_changed)
- sync_global_pgds(start, end - 1, 1);
-
flush_tlb_all();
}
--
2.1.4
Now that the memory hotplug code does not remove PGD entries anymore,
the only users of sync_global_pgds() use it after extending the
PGD.
So remove the 'removed' parameter and simplify the call sites.
Cc: Andrew Morton <[email protected]>
Cc: Andy Lutomirski <[email protected]>
Cc: Borislav Petkov <[email protected]>
Cc: Brian Gerst <[email protected]>
Cc: Denys Vlasenko <[email protected]>
Cc: H. Peter Anvin <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Oleg Nesterov <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Rik van Riel <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Waiman Long <[email protected]>
Cc: [email protected]
Signed-off-by: Ingo Molnar <[email protected]>
---
arch/x86/include/asm/pgtable_64.h | 3 +--
arch/x86/mm/fault.c | 2 +-
arch/x86/mm/init_64.c | 27 ++++++++-------------------
3 files changed, 10 insertions(+), 22 deletions(-)
diff --git a/arch/x86/include/asm/pgtable_64.h b/arch/x86/include/asm/pgtable_64.h
index 2ee781114d34..f405fc3bb719 100644
--- a/arch/x86/include/asm/pgtable_64.h
+++ b/arch/x86/include/asm/pgtable_64.h
@@ -116,8 +116,7 @@ static inline void native_pgd_clear(pgd_t *pgd)
native_set_pgd(pgd, native_make_pgd(0));
}
-extern void sync_global_pgds(unsigned long start, unsigned long end,
- int removed);
+extern void sync_global_pgds(unsigned long start, unsigned long end);
/*
* Conversion functions: convert a page and protection to a page entry,
diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c
index 181c53bac3a7..50342825f221 100644
--- a/arch/x86/mm/fault.c
+++ b/arch/x86/mm/fault.c
@@ -349,7 +349,7 @@ static void dump_pagetable(unsigned long address)
void vmalloc_sync_all(void)
{
- sync_global_pgds(VMALLOC_START & PGDIR_MASK, VMALLOC_END, 0);
+ sync_global_pgds(VMALLOC_START & PGDIR_MASK, VMALLOC_END);
}
/*
diff --git a/arch/x86/mm/init_64.c b/arch/x86/mm/init_64.c
index 7a988dbad240..dcb2f45caf0e 100644
--- a/arch/x86/mm/init_64.c
+++ b/arch/x86/mm/init_64.c
@@ -160,10 +160,10 @@ static int __init nonx32_setup(char *str)
__setup("noexec32=", nonx32_setup);
/*
- * When memory was added/removed make sure all the process MMs have
+ * When memory was added make sure all the process MMs have
* matching PGD entries in the local PGD level page as well.
*/
-void sync_global_pgds(unsigned long start, unsigned long end, int removed)
+void sync_global_pgds(unsigned long start, unsigned long end)
{
unsigned long address;
@@ -171,14 +171,8 @@ void sync_global_pgds(unsigned long start, unsigned long end, int removed)
const pgd_t *pgd_ref = pgd_offset_k(address);
struct task_struct *g, *p;
- /*
- * When this function is called after memory hot remove,
- * pgd_none() already returns true, but only the reference
- * kernel PGD has been cleared, not the process PGDs.
- *
- * So clear the affected entries in every process PGD as well:
- */
- if (pgd_none(*pgd_ref) && !removed)
+ /* Only sync (potentially) newly added PGD entries: */
+ if (pgd_none(*pgd_ref))
continue;
spin_lock(&pgd_lock); /* Implies rcu_read_lock() for the task list iteration: */
@@ -204,13 +198,8 @@ void sync_global_pgds(unsigned long start, unsigned long end, int removed)
if (!pgd_none(*pgd_ref) && !pgd_none(*pgd))
BUG_ON(pgd_page_vaddr(*pgd) != pgd_page_vaddr(*pgd_ref));
- if (removed) {
- if (pgd_none(*pgd_ref) && !pgd_none(*pgd))
- pgd_clear(pgd);
- } else {
- if (pgd_none(*pgd))
- set_pgd(pgd, *pgd_ref);
- }
+ if (pgd_none(*pgd))
+ set_pgd(pgd, *pgd_ref);
spin_unlock(pgt_lock);
task_unlock(p);
@@ -644,7 +633,7 @@ kernel_physical_mapping_init(unsigned long start,
}
if (pgd_changed)
- sync_global_pgds(addr, end - 1, 0);
+ sync_global_pgds(addr, end - 1);
__flush_tlb_all();
@@ -1284,7 +1273,7 @@ int __meminit vmemmap_populate(unsigned long start, unsigned long end, int node)
else
err = vmemmap_populate_basepages(start, end, node);
if (!err)
- sync_global_pgds(start, end - 1, 0);
+ sync_global_pgds(start, end - 1);
return err;
}
--
2.1.4
Add a late PGD init callback to places that allocate a new MM
with a new PGD: copy_process() and exec().
The purpose of this callback is to allow architectures to implement
lockless initialization of task PGDs, to remove the scalability
limit of pgd_list/pgd_lock.
Architectures can opt in to this callback via the ARCH_HAS_PGD_INIT_LATE
Kconfig flag. There's zero overhead on architectures that are not using it.
Cc: Andrew Morton <[email protected]>
Cc: Andy Lutomirski <[email protected]>
Cc: Borislav Petkov <[email protected]>
Cc: Brian Gerst <[email protected]>
Cc: Denys Vlasenko <[email protected]>
Cc: H. Peter Anvin <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Oleg Nesterov <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Rik van Riel <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Waiman Long <[email protected]>
Cc: [email protected]
Cc: [email protected]
Signed-off-by: Ingo Molnar <[email protected]>
---
arch/Kconfig | 9 +++++++++
fs/exec.c | 3 +++
include/linux/mm.h | 6 ++++++
kernel/fork.c | 16 ++++++++++++++++
4 files changed, 34 insertions(+)
diff --git a/arch/Kconfig b/arch/Kconfig
index a65eafb24997..a8e866cd4247 100644
--- a/arch/Kconfig
+++ b/arch/Kconfig
@@ -491,6 +491,15 @@ config PGTABLE_LEVELS
int
default 2
+config ARCH_HAS_PGD_INIT_LATE
+ bool
+ help
+ Architectures that want a late PGD initialization can define
+ the arch_pgd_init_late() callback and it will be called
+ by the generic new task (fork()) code after a new task has
+ been made visible on the task list, but before it has been
+ first scheduled.
+
config ARCH_HAS_ELF_RANDOMIZE
bool
help
diff --git a/fs/exec.c b/fs/exec.c
index 1977c2a553ac..4ce1383d5bba 100644
--- a/fs/exec.c
+++ b/fs/exec.c
@@ -860,7 +860,10 @@ static int exec_mmap(struct mm_struct *mm)
}
task_lock(tsk);
active_mm = tsk->active_mm;
+
tsk->mm = mm;
+ arch_pgd_init_late(mm);
+
tsk->active_mm = mm;
activate_mm(active_mm, mm);
tsk->mm->vmacache_seqnum = 0;
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 0755b9fd03a7..a3edc839e431 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1134,6 +1134,12 @@ int follow_phys(struct vm_area_struct *vma, unsigned long address,
int generic_access_phys(struct vm_area_struct *vma, unsigned long addr,
void *buf, int len, int write);
+#ifdef CONFIG_ARCH_HAS_PGD_INIT_LATE
+void arch_pgd_init_late(struct mm_struct *mm);
+#else
+static inline void arch_pgd_init_late(struct mm_struct *mm) { }
+#endif
+
static inline void unmap_shared_mapping_range(struct address_space *mapping,
loff_t const holebegin, loff_t const holelen)
{
diff --git a/kernel/fork.c b/kernel/fork.c
index 03c1eaaa6ef5..cfa84971fb52 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -1592,6 +1592,22 @@ static struct task_struct *copy_process(unsigned long clone_flags,
syscall_tracepoint_update(p);
write_unlock_irq(&tasklist_lock);
+ /*
+ * If we have a new PGD then initialize it:
+ *
+ * This method is called after a task has been made visible
+ * on the task list already.
+ *
+ * Architectures that manage per task kernel pagetables
+ * might use this callback to initialize them after they
+ * are already visible to new updates.
+ *
+ * NOTE: any user-space parts of the PGD are already initialized
+ * and must not be clobbered.
+ */
+ if (!(clone_flags & CLONE_VM))
+ arch_pgd_init_late(p->mm);
+
proc_fork_connector(p);
cgroup_post_fork(p);
if (clone_flags & CLONE_THREAD)
--
2.1.4
Prepare for lockless PGD init: enable the arch_pgd_init_late() callback
and add a 'careful' implementation of PGD init to it: only copy over
non-zero entries.
Since PGD entries only ever get added, this method catches any updates
to swapper_pg_dir[] that might have occurred between early PGD init
and late PGD init.
Note that this only matters for code that does not use the pgd_list but
the task list to find all PGDs in the system.
Subsequent patches will convert pgd_list users to task-list iterations.
[ This adds extra overhead in that we do the PGD initialization for a
second time - a later patch will simplify this, once we don't have
old pgd_list users. ]
Cc: Andrew Morton <[email protected]>
Cc: Andy Lutomirski <[email protected]>
Cc: Borislav Petkov <[email protected]>
Cc: Brian Gerst <[email protected]>
Cc: Denys Vlasenko <[email protected]>
Cc: H. Peter Anvin <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Oleg Nesterov <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: [email protected]
Signed-off-by: Ingo Molnar <[email protected]>
---
arch/x86/Kconfig | 1 +
arch/x86/mm/pgtable.c | 59 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
2 files changed, 60 insertions(+)
diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 7e39f9b22705..15c19ce149f0 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -27,6 +27,7 @@ config X86
select ARCH_HAS_ELF_RANDOMIZE
select ARCH_HAS_FAST_MULTIPLIER
select ARCH_HAS_GCOV_PROFILE_ALL
+ select ARCH_HAS_PGD_INIT_LATE
select ARCH_HAS_SG_CHAIN
select ARCH_HAVE_NMI_SAFE_CMPXCHG
select ARCH_MIGHT_HAVE_ACPI_PDC if ACPI
diff --git a/arch/x86/mm/pgtable.c b/arch/x86/mm/pgtable.c
index fb0a9dd1d6e4..7a561b7cc01c 100644
--- a/arch/x86/mm/pgtable.c
+++ b/arch/x86/mm/pgtable.c
@@ -391,6 +391,65 @@ pgd_t *pgd_alloc(struct mm_struct *mm)
return NULL;
}
+/*
+ * Initialize the kernel portion of the PGD.
+ *
+ * This is done separately, because pgd_alloc() happens when
+ * the task is not on the task list yet - and PGD updates
+ * happen by walking the task list.
+ *
+ * No locking is needed here, as we just copy over the reference
+ * PGD. The reference PGD (pgtable_init) is only ever expanded
+ * at the highest, PGD level. Thus any other task extending it
+ * will first update the reference PGD, then modify the task PGDs.
+ */
+void arch_pgd_init_late(struct mm_struct *mm)
+{
+ /*
+ * This function is called after a new MM has been made visible
+ * in fork() or exec() via:
+ *
+ * tsk->mm = mm;
+ *
+ * This barrier makes sure the MM is visible to new RCU
+ * walkers before we initialize the pagetables below, so that
+ * we don't miss updates:
+ */
+ smp_wmb();
+
+ /*
+ * If the pgd points to a shared pagetable level (either the
+ * ptes in non-PAE, or shared PMD in PAE), then just copy the
+ * references from swapper_pg_dir:
+ */
+ if ( CONFIG_PGTABLE_LEVELS == 2 ||
+ (CONFIG_PGTABLE_LEVELS == 3 && SHARED_KERNEL_PMD) ||
+ CONFIG_PGTABLE_LEVELS == 4) {
+
+ pgd_t *pgd_src = swapper_pg_dir + KERNEL_PGD_BOUNDARY;
+ pgd_t *pgd_dst = mm->pgd + KERNEL_PGD_BOUNDARY;
+ int i;
+
+ for (i = 0; i < KERNEL_PGD_PTRS; i++, pgd_src++, pgd_dst++) {
+ /*
+ * This is lock-less, so it can race with PGD updates
+ * coming from vmalloc() or CPA methods, but it's safe,
+ * because:
+ *
+ * 1) this PGD is not in use yet, we have still not
+ * scheduled this task.
+ * 2) we only ever extend PGD entries
+ *
+ * So if we observe a non-zero PGD entry we can copy it,
+ * it won't change from under us. Parallel updates (new
+ * allocations) will modify our (already visible) PGD:
+ */
+ if (!pgd_none(*pgd_src))
+ set_pgd(pgd_dst, *pgd_src);
+ }
+ }
+}
+
void pgd_free(struct mm_struct *mm, pgd_t *pgd)
{
pgd_mop_up_pmds(mm, pgd);
--
2.1.4
xen_mm_pin_all()/unpin_all() are used to implement full guest instance
suspend/restore. It's a stop-all method that needs to iterate through
all allocated pgds in the system to fix them up for Xen's use.
This code uses pgd_list, probably because it was an easy interface.
But we want to remove the pgd_list, so convert the code over to walk
all tasks in the system. This is an equivalent method.
(As I don't use Xen this is was only build tested.)
Cc: Andrew Morton <[email protected]>
Cc: Andy Lutomirski <[email protected]>
Cc: Borislav Petkov <[email protected]>
Cc: Brian Gerst <[email protected]>
Cc: Denys Vlasenko <[email protected]>
Cc: H. Peter Anvin <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Oleg Nesterov <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Waiman Long <[email protected]>
Cc: [email protected]
Signed-off-by: Ingo Molnar <[email protected]>
---
arch/x86/xen/mmu.c | 51 ++++++++++++++++++++++++++++++++++++++-------------
1 file changed, 38 insertions(+), 13 deletions(-)
diff --git a/arch/x86/xen/mmu.c b/arch/x86/xen/mmu.c
index dd151b2045b0..70a3df5b0b54 100644
--- a/arch/x86/xen/mmu.c
+++ b/arch/x86/xen/mmu.c
@@ -853,15 +853,27 @@ static void xen_pgd_pin(struct mm_struct *mm)
*/
void xen_mm_pin_all(void)
{
- struct page *page;
+ struct task_struct *g, *p;
- spin_lock(&pgd_lock);
+ spin_lock(&pgd_lock); /* Implies rcu_read_lock() for the task list iteration: */
- list_for_each_entry(page, &pgd_list, lru) {
- if (!PagePinned(page)) {
- __xen_pgd_pin(&init_mm, (pgd_t *)page_address(page));
- SetPageSavePinned(page);
+ for_each_process_thread(g, p) {
+ struct mm_struct *mm;
+ struct page *page;
+ pgd_t *pgd;
+
+ task_lock(p);
+ mm = p->mm;
+ if (mm) {
+ pgd = mm->pgd;
+ page = virt_to_page(pgd);
+
+ if (!PagePinned(page)) {
+ __xen_pgd_pin(&init_mm, pgd);
+ SetPageSavePinned(page);
+ }
}
+ task_unlock(p);
}
spin_unlock(&pgd_lock);
@@ -967,19 +979,32 @@ static void xen_pgd_unpin(struct mm_struct *mm)
*/
void xen_mm_unpin_all(void)
{
- struct page *page;
+ struct task_struct *g, *p;
- spin_lock(&pgd_lock);
+ spin_lock(&pgd_lock); /* Implies rcu_read_lock() for the task list iteration: */
- list_for_each_entry(page, &pgd_list, lru) {
- if (PageSavePinned(page)) {
- BUG_ON(!PagePinned(page));
- __xen_pgd_unpin(&init_mm, (pgd_t *)page_address(page));
- ClearPageSavePinned(page);
+ for_each_process_thread(g, p) {
+ struct mm_struct *mm;
+ struct page *page;
+ pgd_t *pgd;
+
+ task_lock(p);
+ mm = p->mm;
+ if (mm) {
+ pgd = mm->pgd;
+ page = virt_to_page(pgd);
+
+ if (PageSavePinned(page)) {
+ BUG_ON(!PagePinned(page));
+ __xen_pgd_unpin(&init_mm, pgd);
+ ClearPageSavePinned(page);
+ }
}
+ task_unlock(p);
}
spin_unlock(&pgd_lock);
+ rcu_read_unlock();
}
static void xen_activate_mm(struct mm_struct *prev, struct mm_struct *next)
--
2.1.4
The vmalloc() code uses vmalloc_sync_all() to synchronize changes to
the global reference kernel PGD to task PGDs in certain rare cases,
like register_die_notifier().
This use seems to be somewhat questionable, as most other vmalloc
page table fixups are vmalloc_fault() driven, but nevertheless
it's there and it's using the pgd_list.
But we don't need the global list, as we can walk the task list
under RCU.
Cc: Andrew Morton <[email protected]>
Cc: Andy Lutomirski <[email protected]>
Cc: Borislav Petkov <[email protected]>
Cc: Brian Gerst <[email protected]>
Cc: Denys Vlasenko <[email protected]>
Cc: H. Peter Anvin <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Oleg Nesterov <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Rik van Riel <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Waiman Long <[email protected]>
Cc: [email protected]
Signed-off-by: Ingo Molnar <[email protected]>
---
arch/x86/mm/fault.c | 28 ++++++++++++++++++++--------
1 file changed, 20 insertions(+), 8 deletions(-)
diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c
index 50342825f221..366b8232f4b3 100644
--- a/arch/x86/mm/fault.c
+++ b/arch/x86/mm/fault.c
@@ -235,23 +235,35 @@ void vmalloc_sync_all(void)
for (address = VMALLOC_START & PMD_MASK;
address >= TASK_SIZE && address < FIXADDR_TOP;
address += PMD_SIZE) {
- struct page *page;
- spin_lock(&pgd_lock);
- list_for_each_entry(page, &pgd_list, lru) {
+ struct task_struct *g, *p;
+
+ spin_lock(&pgd_lock); /* Implies rcu_read_lock(): */
+
+ for_each_process_thread(g, p) {
+ struct mm_struct *mm;
spinlock_t *pgt_lock;
- pmd_t *ret;
+ pmd_t *pmd_ret;
- /* the pgt_lock only for Xen */
- pgt_lock = &pgd_page_get_mm(page)->page_table_lock;
+ task_lock(p);
+ mm = p->mm;
+ if (!mm) {
+ task_unlock(p);
+ continue;
+ }
+ /* The pgt_lock is only used on Xen: */
+ pgt_lock = &mm->page_table_lock;
spin_lock(pgt_lock);
- ret = vmalloc_sync_one(page_address(page), address);
+ pmd_ret = vmalloc_sync_one(mm->pgd, address);
spin_unlock(pgt_lock);
- if (!ret)
+ task_unlock(p);
+
+ if (!pmd_ret)
break;
}
+
spin_unlock(&pgd_lock);
}
}
--
2.1.4
The 32-bit x86 PAT code uses __set_pmd_pte() to update pmds.
This uses pgd_list currently, but we don't need the global
list as we can walk the task list under RCU.
(This code already holds the pgd_lock.)
Cc: Andrew Morton <[email protected]>
Cc: Andy Lutomirski <[email protected]>
Cc: Borislav Petkov <[email protected]>
Cc: Brian Gerst <[email protected]>
Cc: Denys Vlasenko <[email protected]>
Cc: H. Peter Anvin <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Oleg Nesterov <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Rik van Riel <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Waiman Long <[email protected]>
Cc: [email protected]
Signed-off-by: Ingo Molnar <[email protected]>
---
arch/x86/mm/pageattr.c | 26 ++++++++++++++++++++------
1 file changed, 20 insertions(+), 6 deletions(-)
diff --git a/arch/x86/mm/pageattr.c b/arch/x86/mm/pageattr.c
index 997fc97e9072..4ff6a1808f1d 100644
--- a/arch/x86/mm/pageattr.c
+++ b/arch/x86/mm/pageattr.c
@@ -438,17 +438,31 @@ static void __set_pmd_pte(pte_t *kpte, unsigned long address, pte_t pte)
set_pte_atomic(kpte, pte);
#ifdef CONFIG_X86_32
if (!SHARED_KERNEL_PMD) {
- struct page *page;
+ struct task_struct *g, *p;
- list_for_each_entry(page, &pgd_list, lru) {
+ /* We are holding pgd_lock, which implies rcu_read_lock(): */
+
+ for_each_process_thread(g, p) {
+ struct mm_struct *mm;
+ spinlock_t *pgt_lock;
pgd_t *pgd;
pud_t *pud;
pmd_t *pmd;
- pgd = (pgd_t *)page_address(page) + pgd_index(address);
- pud = pud_offset(pgd, address);
- pmd = pmd_offset(pud, address);
- set_pte_atomic((pte_t *)pmd, pte);
+ task_lock(p);
+ mm = p->mm;
+ if (mm) {
+ pgt_lock = &mm->page_table_lock;
+ spin_lock(pgt_lock);
+
+ pgd = mm->pgd + pgd_index(address);
+ pud = pud_offset(pgd, address);
+ pmd = pmd_offset(pud, address);
+ set_pte_atomic((pte_t *)pmd, pte);
+
+ spin_unlock(pgt_lock);
+ }
+ task_unlock(p);
}
}
#endif
--
2.1.4
The fork()/exit() code uses pgd_alloc()/pgd_free() to allocate/deallocate
the PGD, with platform specific code setting up kernel pagetables.
The x86 code uses a global pgd_list with an associated lock to update
all PGDs of all tasks in the system synchronously.
The lock is still kept to synchronize updates to all PGDs in the system,
but all users of the list have been migrated to use the task list.
So we can remove the pgd_list addition/removal from this code.
The new PGD is private while constructed, so it needs no extra
locking.
Cc: Andrew Morton <[email protected]>
Cc: Andy Lutomirski <[email protected]>
Cc: Borislav Petkov <[email protected]>
Cc: Brian Gerst <[email protected]>
Cc: Denys Vlasenko <[email protected]>
Cc: H. Peter Anvin <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Oleg Nesterov <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Rik van Riel <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Waiman Long <[email protected]>
Cc: [email protected]
Signed-off-by: Ingo Molnar <[email protected]>
---
arch/x86/mm/pgtable.c | 27 +++------------------------
1 file changed, 3 insertions(+), 24 deletions(-)
diff --git a/arch/x86/mm/pgtable.c b/arch/x86/mm/pgtable.c
index 7a561b7cc01c..0ab56d13f24d 100644
--- a/arch/x86/mm/pgtable.c
+++ b/arch/x86/mm/pgtable.c
@@ -125,22 +125,6 @@ static void pgd_ctor(struct mm_struct *mm, pgd_t *pgd)
swapper_pg_dir + KERNEL_PGD_BOUNDARY,
KERNEL_PGD_PTRS);
}
-
- /* list required to sync kernel mapping updates */
- if (!SHARED_KERNEL_PMD) {
- pgd_set_mm(pgd, mm);
- pgd_list_add(pgd);
- }
-}
-
-static void pgd_dtor(pgd_t *pgd)
-{
- if (SHARED_KERNEL_PMD)
- return;
-
- spin_lock(&pgd_lock);
- pgd_list_del(pgd);
- spin_unlock(&pgd_lock);
}
/*
@@ -370,17 +354,13 @@ pgd_t *pgd_alloc(struct mm_struct *mm)
goto out_free_pmds;
/*
- * Make sure that pre-populating the pmds is atomic with
- * respect to anything walking the pgd_list, so that they
- * never see a partially populated pgd.
+ * No locking is needed here, as the PGD is still private,
+ * so no code walking the task list and looking at mm->pgd
+ * will be able to see it before it's fully constructed:
*/
- spin_lock(&pgd_lock);
-
pgd_ctor(mm, pgd);
pgd_prepopulate_pmd(mm, pgd, pmds);
- spin_unlock(&pgd_lock);
-
return pgd;
out_free_pmds:
@@ -453,7 +433,6 @@ void arch_pgd_init_late(struct mm_struct *mm)
void pgd_free(struct mm_struct *mm, pgd_t *pgd)
{
pgd_mop_up_pmds(mm, pgd);
- pgd_dtor(pgd);
paravirt_pgd_free(mm, pgd);
_pgd_free(pgd);
}
--
2.1.4
Nothing uses the pgd_list anymore - remove the list itself and its helpers.
Cc: Andrew Morton <[email protected]>
Cc: Andy Lutomirski <[email protected]>
Cc: Borislav Petkov <[email protected]>
Cc: Brian Gerst <[email protected]>
Cc: Denys Vlasenko <[email protected]>
Cc: H. Peter Anvin <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Oleg Nesterov <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Rik van Riel <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Waiman Long <[email protected]>
Cc: [email protected]
Signed-off-by: Ingo Molnar <[email protected]>
---
arch/x86/include/asm/pgtable.h | 3 ---
arch/x86/mm/fault.c | 1 -
arch/x86/mm/pgtable.c | 26 --------------------------
3 files changed, 30 deletions(-)
diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h
index fe57e7a98839..7e93438d8b53 100644
--- a/arch/x86/include/asm/pgtable.h
+++ b/arch/x86/include/asm/pgtable.h
@@ -29,9 +29,6 @@ extern unsigned long empty_zero_page[PAGE_SIZE / sizeof(unsigned long)]
#define ZERO_PAGE(vaddr) (virt_to_page(empty_zero_page))
extern spinlock_t pgd_lock;
-extern struct list_head pgd_list;
-
-extern struct mm_struct *pgd_page_get_mm(struct page *page);
#ifdef CONFIG_PARAVIRT
#include <asm/paravirt.h>
diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c
index 366b8232f4b3..14b39c3511fd 100644
--- a/arch/x86/mm/fault.c
+++ b/arch/x86/mm/fault.c
@@ -186,7 +186,6 @@ force_sig_info_fault(int si_signo, int si_code, unsigned long address,
}
DEFINE_SPINLOCK(pgd_lock);
-LIST_HEAD(pgd_list);
#ifdef CONFIG_X86_32
static inline pmd_t *vmalloc_sync_one(pgd_t *pgd, unsigned long address)
diff --git a/arch/x86/mm/pgtable.c b/arch/x86/mm/pgtable.c
index 0ab56d13f24d..7cca42c01677 100644
--- a/arch/x86/mm/pgtable.c
+++ b/arch/x86/mm/pgtable.c
@@ -84,35 +84,9 @@ void ___pud_free_tlb(struct mmu_gather *tlb, pud_t *pud)
#endif /* CONFIG_PGTABLE_LEVELS > 3 */
#endif /* CONFIG_PGTABLE_LEVELS > 2 */
-static inline void pgd_list_add(pgd_t *pgd)
-{
- struct page *page = virt_to_page(pgd);
-
- list_add(&page->lru, &pgd_list);
-}
-
-static inline void pgd_list_del(pgd_t *pgd)
-{
- struct page *page = virt_to_page(pgd);
-
- list_del(&page->lru);
-}
-
#define UNSHARED_PTRS_PER_PGD \
(SHARED_KERNEL_PMD ? KERNEL_PGD_BOUNDARY : PTRS_PER_PGD)
-
-static void pgd_set_mm(pgd_t *pgd, struct mm_struct *mm)
-{
- BUILD_BUG_ON(sizeof(virt_to_page(pgd)->index) < sizeof(mm));
- virt_to_page(pgd)->index = (pgoff_t)mm;
-}
-
-struct mm_struct *pgd_page_get_mm(struct page *page)
-{
- return (struct mm_struct *)page->index;
-}
-
static void pgd_ctor(struct mm_struct *mm, pgd_t *pgd)
{
/* If the pgd points to a shared pagetable level (either the
--
2.1.4
Right now pgd_alloc() uses pgd_ctor(), which copies over the
current swapper_pg_dir[] to a new task's PGD.
This is not necessary, it's enough if we clear it: the PGD will
then be properly updated by arch_pgd_init_late().
Cc: Andrew Morton <[email protected]>
Cc: Andy Lutomirski <[email protected]>
Cc: Borislav Petkov <[email protected]>
Cc: Brian Gerst <[email protected]>
Cc: Denys Vlasenko <[email protected]>
Cc: H. Peter Anvin <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Oleg Nesterov <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Rik van Riel <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Waiman Long <[email protected]>
Cc: [email protected]
Signed-off-by: Ingo Molnar <[email protected]>
---
arch/x86/mm/pgtable.c | 27 +++++++++------------------
1 file changed, 9 insertions(+), 18 deletions(-)
diff --git a/arch/x86/mm/pgtable.c b/arch/x86/mm/pgtable.c
index 7cca42c01677..d7d341e57e33 100644
--- a/arch/x86/mm/pgtable.c
+++ b/arch/x86/mm/pgtable.c
@@ -87,20 +87,6 @@ void ___pud_free_tlb(struct mmu_gather *tlb, pud_t *pud)
#define UNSHARED_PTRS_PER_PGD \
(SHARED_KERNEL_PMD ? KERNEL_PGD_BOUNDARY : PTRS_PER_PGD)
-static void pgd_ctor(struct mm_struct *mm, pgd_t *pgd)
-{
- /* If the pgd points to a shared pagetable level (either the
- ptes in non-PAE, or shared PMD in PAE), then just copy the
- references from swapper_pg_dir. */
- if (CONFIG_PGTABLE_LEVELS == 2 ||
- (CONFIG_PGTABLE_LEVELS == 3 && SHARED_KERNEL_PMD) ||
- CONFIG_PGTABLE_LEVELS == 4) {
- clone_pgd_range(pgd + KERNEL_PGD_BOUNDARY,
- swapper_pg_dir + KERNEL_PGD_BOUNDARY,
- KERNEL_PGD_PTRS);
- }
-}
-
/*
* List of all pgd's needed for non-PAE so it can invalidate entries
* in both cached and uncached pgd's; not needed for PAE since the
@@ -328,11 +314,16 @@ pgd_t *pgd_alloc(struct mm_struct *mm)
goto out_free_pmds;
/*
- * No locking is needed here, as the PGD is still private,
- * so no code walking the task list and looking at mm->pgd
- * will be able to see it before it's fully constructed:
+ * Zero out the kernel portion here, we'll set them up in
+ * arch_pgd_init_late(), when the pgd is globally
+ * visible already per the task list, so that it cannot
+ * miss updates.
+ *
+ * We need to zero it here, to make sure arch_pgd_init_late()
+ * can initialize them without locking.
*/
- pgd_ctor(mm, pgd);
+ memset(pgd + KERNEL_PGD_BOUNDARY, 0, KERNEL_PGD_PTRS*sizeof(pgd_t));
+
pgd_prepopulate_pmd(mm, pgd, pmds);
return pgd;
--
2.1.4
I didn't read v2 yet, but I'd like to ask a question.
Why do we need vmalloc_sync_all()?
It has a single caller, register_die_notifier() which calls it without
any explanation. IMO, this needs a comment at least.
I am not sure I understand the changelog in 101f12af correctly, but
at first glance vmalloc_sync_all() is no longer needed at least on x86,
do_page_fault() no longer does notify_die(DIE_PAGE_FAULT). And btw
DIE_PAGE_FAULT has no users. DIE_MNI too...
Perhaps we can simply kill it on x86?
As for other architectures I am not sure. arch/tile implements
vmalloc_sync_all() and uses notify_die() in do_page_fault().
And in any case register_die_notifier()->vmalloc_sync() looks strange.
If (say) arch/tile needs this to fix the problem with modules, perhaps
it should do vmalloc_sync_all() in do_init_module() paths?
Oleg.
On 06/13, Ingo Molnar wrote:
>
> @@ -169,29 +169,40 @@ void sync_global_pgds(unsigned long start, unsigned long end, int removed)
>
> for (address = start; address <= end; address += PGDIR_SIZE) {
> const pgd_t *pgd_ref = pgd_offset_k(address);
> - struct page *page;
> + struct task_struct *g, *p;
>
> /*
> - * When it is called after memory hot remove, pgd_none()
> - * returns true. In this case (removed == 1), we must clear
> - * the PGD entries in the local PGD level page.
> + * When this function is called after memory hot remove,
> + * pgd_none() already returns true, but only the reference
> + * kernel PGD has been cleared, not the process PGDs.
> + *
> + * So clear the affected entries in every process PGD as well:
> */
> if (pgd_none(*pgd_ref) && !removed)
> continue;
>
> - spin_lock(&pgd_lock);
> - list_for_each_entry(page, &pgd_list, lru) {
> + spin_lock(&pgd_lock); /* Implies rcu_read_lock() for the task list iteration: */
^^^^^^^^^^^^^^^^^^^^^^^
Hmm, but it doesn't if PREEMPT_RCU? No, no, I do not pretend I understand
how it actually works ;) But, say, rcu_check_callbacks() can be called from
irq and since spin_lock() doesn't increment current->rcu_read_lock_nesting
this can lead to rcu_preempt_qs()?
> + for_each_process_thread(g, p) {
> + struct mm_struct *mm;
> pgd_t *pgd;
> spinlock_t *pgt_lock;
>
> - pgd = (pgd_t *)page_address(page) + pgd_index(address);
> - /* the pgt_lock only for Xen */
> - pgt_lock = &pgd_page_get_mm(page)->page_table_lock;
> + task_lock(p);
> + mm = p->mm;
> + if (!mm) {
> + task_unlock(p);
> + continue;
> + }
Again, you can simplify this code and avoid for_each_process_thread() if
you use for_each_process() + find_lock_task_mm().
Oleg.
* Oleg Nesterov <[email protected]> wrote:
> On 06/13, Ingo Molnar wrote:
> >
> > @@ -169,29 +169,40 @@ void sync_global_pgds(unsigned long start, unsigned long end, int removed)
> >
> > for (address = start; address <= end; address += PGDIR_SIZE) {
> > const pgd_t *pgd_ref = pgd_offset_k(address);
> > - struct page *page;
> > + struct task_struct *g, *p;
> >
> > /*
> > - * When it is called after memory hot remove, pgd_none()
> > - * returns true. In this case (removed == 1), we must clear
> > - * the PGD entries in the local PGD level page.
> > + * When this function is called after memory hot remove,
> > + * pgd_none() already returns true, but only the reference
> > + * kernel PGD has been cleared, not the process PGDs.
> > + *
> > + * So clear the affected entries in every process PGD as well:
> > */
> > if (pgd_none(*pgd_ref) && !removed)
> > continue;
> >
> > - spin_lock(&pgd_lock);
> > - list_for_each_entry(page, &pgd_list, lru) {
> > + spin_lock(&pgd_lock); /* Implies rcu_read_lock() for the task list iteration: */
> ^^^^^^^^^^^^^^^^^^^^^^^
>
> Hmm, but it doesn't if PREEMPT_RCU? No, no, I do not pretend I understand how it
> actually works ;) But, say, rcu_check_callbacks() can be called from irq and
> since spin_lock() doesn't increment current->rcu_read_lock_nesting this can lead
> to rcu_preempt_qs()?
No, RCU grace periods are still defined by 'heavy' context boundaries such as
context switches, entering idle or user-space mode.
PREEMPT_RCU is like traditional RCU, except that blocking is allowed within the
RCU read critical section - that is why it uses a separate nesting counter
(current->rcu_read_lock_nesting), not the preempt count.
But if a piece of kernel code is non-preemptible, such as a spinlocked region or
an irqs-off region, then those are still natural RCU read lock regions, regardless
of the RCU model, and need no additional RCU locking.
rcu_check_callbacks() can be called from irq context, but only to observe whether
the current CPU is in quiescent state. If it interrupts a spinlocked region it
won't register a quiesent state.
> > + for_each_process_thread(g, p) {
> > + struct mm_struct *mm;
> > pgd_t *pgd;
> > spinlock_t *pgt_lock;
> >
> > - pgd = (pgd_t *)page_address(page) + pgd_index(address);
> > - /* the pgt_lock only for Xen */
> > - pgt_lock = &pgd_page_get_mm(page)->page_table_lock;
> > + task_lock(p);
> > + mm = p->mm;
> > + if (!mm) {
> > + task_unlock(p);
> > + continue;
> > + }
>
> Again, you can simplify this code and avoid for_each_process_thread() if you use
> for_each_process() + find_lock_task_mm().
True!
So I looked at this when you first mentioned it but mis-read find_lock_task_mm(),
which as you insist is exactly what this iteration needs to become faster and
simpler. Thanks for the reminder - I have fixed it, will be part of -v3.
Thanks,
Ingo
* Oleg Nesterov <[email protected]> wrote:
> I didn't read v2 yet, but I'd like to ask a question.
>
> Why do we need vmalloc_sync_all()?
>
> It has a single caller, register_die_notifier() which calls it without
> any explanation. IMO, this needs a comment at least.
Yes, it's used to work around crashes in modular callbacks: if the callbacks
happens to be called from within the page fault path, before the vmalloc page
fault handler runs, then we have a catch-22 problem.
It's rare but not entirely impossible.
> I am not sure I understand the changelog in 101f12af correctly, but at first
> glance vmalloc_sync_all() is no longer needed at least on x86, do_page_fault()
> no longer does notify_die(DIE_PAGE_FAULT). And btw DIE_PAGE_FAULT has no users.
> DIE_MNI too...
>
> Perhaps we can simply kill it on x86?
So in theory we could still have it run from DIE_OOPS, and that could turn a
survivable kernel crash into a non-survivable one.
Note that all of this will go away if we also do the vmalloc fault handling
simplification that I discussed with Andy:
- this series already makes the set of kernel PGDs strictly monotonically
increasing during the lifetime of the x86 kernel
- if in a subsequent patch we can synchronize new PGDs right after the vmalloc
code creates it, before the area is used - so we can remove vmalloc_fault()
altogether [or rather, turn it into a debug warning initially].
vmalloc_fault() is a clever but somewhat fragile complication.
- after that we can simply remove vmalloc_sync_all() from x86, because all active
vmalloc areas will be fully instantiated, all the time, on x86.
Thanks,
Ingo
* Ingo Molnar <[email protected]> wrote:
> @@ -967,19 +979,32 @@ static void xen_pgd_unpin(struct mm_struct *mm)
> */
> void xen_mm_unpin_all(void)
> {
> - struct page *page;
> + struct task_struct *g, *p;
>
> - spin_lock(&pgd_lock);
> + spin_lock(&pgd_lock); /* Implies rcu_read_lock() for the task list iteration: */
>
> - list_for_each_entry(page, &pgd_list, lru) {
> - if (PageSavePinned(page)) {
> - BUG_ON(!PagePinned(page));
> - __xen_pgd_unpin(&init_mm, (pgd_t *)page_address(page));
> - ClearPageSavePinned(page);
> + for_each_process_thread(g, p) {
> + struct mm_struct *mm;
> + struct page *page;
> + pgd_t *pgd;
> +
> + task_lock(p);
> + mm = p->mm;
> + if (mm) {
> + pgd = mm->pgd;
> + page = virt_to_page(pgd);
> +
> + if (PageSavePinned(page)) {
> + BUG_ON(!PagePinned(page));
> + __xen_pgd_unpin(&init_mm, pgd);
> + ClearPageSavePinned(page);
> + }
> }
> + task_unlock(p);
> }
>
> spin_unlock(&pgd_lock);
> + rcu_read_unlock();
> }
I also removed the leftover stray rcu_read_unlock() from -v3.
Thanks,
Ingo
On 06/14, Ingo Molnar wrote:
>
> * Oleg Nesterov <[email protected]> wrote:
>
> > > + spin_lock(&pgd_lock); /* Implies rcu_read_lock() for the task list iteration: */
> > ^^^^^^^^^^^^^^^^^^^^^^^
> >
> > Hmm, but it doesn't if PREEMPT_RCU? No, no, I do not pretend I understand how it
> > actually works ;) But, say, rcu_check_callbacks() can be called from irq and
> > since spin_lock() doesn't increment current->rcu_read_lock_nesting this can lead
> > to rcu_preempt_qs()?
>
> No, RCU grace periods are still defined by 'heavy' context boundaries such as
> context switches, entering idle or user-space mode.
>
> PREEMPT_RCU is like traditional RCU, except that blocking is allowed within the
> RCU read critical section - that is why it uses a separate nesting counter
> (current->rcu_read_lock_nesting), not the preempt count.
Yes.
> But if a piece of kernel code is non-preemptible, such as a spinlocked region or
> an irqs-off region, then those are still natural RCU read lock regions, regardless
> of the RCU model, and need no additional RCU locking.
I do not think so. Yes I understand that rcu_preempt_qs() itself doesn't
finish the gp, but if there are no other rcu-read-lock holders then it
seems synchronize_rcu() on another CPU can return _before_ spin_unlock(),
this CPU no longer needs rcu_preempt_note_context_switch().
OK, I can be easily wrong, I do not really understand the implementation
of PREEMPT_RCU. Perhaps preempt_disable() can actually act as rcu_read_lock()
with the _current_ implementation. Still this doesn't look right even if
happens to work, and Documentation/RCU/checklist.txt says:
11. Note that synchronize_rcu() -only- guarantees to wait until
all currently executing rcu_read_lock()-protected RCU read-side
critical sections complete. It does -not- necessarily guarantee
that all currently running interrupts, NMIs, preempt_disable()
code, or idle loops will complete. Therefore, if your
read-side critical sections are protected by something other
than rcu_read_lock(), do -not- use synchronize_rcu().
Oleg.
On 06/14, Oleg Nesterov wrote:
>
> On 06/14, Ingo Molnar wrote:
> >
> > * Oleg Nesterov <[email protected]> wrote:
> >
> > > > + spin_lock(&pgd_lock); /* Implies rcu_read_lock() for the task list iteration: */
> > > ^^^^^^^^^^^^^^^^^^^^^^^
> > >
> > > Hmm, but it doesn't if PREEMPT_RCU? No, no, I do not pretend I understand how it
> > > actually works ;) But, say, rcu_check_callbacks() can be called from irq and
> > > since spin_lock() doesn't increment current->rcu_read_lock_nesting this can lead
> > > to rcu_preempt_qs()?
> >
> > No, RCU grace periods are still defined by 'heavy' context boundaries such as
> > context switches, entering idle or user-space mode.
> >
> > PREEMPT_RCU is like traditional RCU, except that blocking is allowed within the
> > RCU read critical section - that is why it uses a separate nesting counter
> > (current->rcu_read_lock_nesting), not the preempt count.
>
> Yes.
>
> > But if a piece of kernel code is non-preemptible, such as a spinlocked region or
> > an irqs-off region, then those are still natural RCU read lock regions, regardless
> > of the RCU model, and need no additional RCU locking.
>
> I do not think so. Yes I understand that rcu_preempt_qs() itself doesn't
> finish the gp, but if there are no other rcu-read-lock holders then it
> seems synchronize_rcu() on another CPU can return _before_ spin_unlock(),
> this CPU no longer needs rcu_preempt_note_context_switch().
>
> OK, I can be easily wrong, I do not really understand the implementation
> of PREEMPT_RCU. Perhaps preempt_disable() can actually act as rcu_read_lock()
> with the _current_ implementation. Still this doesn't look right even if
> happens to work, and Documentation/RCU/checklist.txt says:
>
> 11. Note that synchronize_rcu() -only- guarantees to wait until
> all currently executing rcu_read_lock()-protected RCU read-side
> critical sections complete. It does -not- necessarily guarantee
> that all currently running interrupts, NMIs, preempt_disable()
> code, or idle loops will complete. Therefore, if your
> read-side critical sections are protected by something other
> than rcu_read_lock(), do -not- use synchronize_rcu().
I've even checked this ;) I applied the stupid patch below and then
$ taskset 2 perl -e 'syscall 157, 666, 5000' &
[1] 565
$ taskset 1 perl -e 'syscall 157, 777'
$
[1]+ Done taskset 2 perl -e 'syscall 157, 666, 5000'
$ dmesg -c
SPIN start
SYNC start
SYNC done!
SPIN done!
Oleg.
--- a/kernel/sys.c
+++ b/kernel/sys.c
@@ -2049,6 +2049,9 @@ static int prctl_get_tid_address(struct task_struct *me, int __user **tid_addr)
}
#endif
+#include <linux/delay.h>
+
+
SYSCALL_DEFINE5(prctl, int, option, unsigned long, arg2, unsigned long, arg3,
unsigned long, arg4, unsigned long, arg5)
{
@@ -2062,6 +2065,19 @@ SYSCALL_DEFINE5(prctl, int, option, unsigned long, arg2, unsigned long, arg3,
error = 0;
switch (option) {
+ case 666:
+ preempt_disable();
+ pr_crit("SPIN start\n");
+ while (arg2--)
+ mdelay(1);
+ pr_crit("SPIN done!\n");
+ preempt_enable();
+ break;
+ case 777:
+ pr_crit("SYNC start\n");
+ synchronize_rcu();
+ pr_crit("SYNC done!\n");
+ break;
case PR_SET_PDEATHSIG:
if (!valid_signal(arg2)) {
error = -EINVAL;
On 06/14, Ingo Molnar wrote:
>
> * Oleg Nesterov <[email protected]> wrote:
>
> > I didn't read v2 yet, but I'd like to ask a question.
> >
> > Why do we need vmalloc_sync_all()?
> >
> > It has a single caller, register_die_notifier() which calls it without
> > any explanation. IMO, this needs a comment at least.
>
> Yes, it's used to work around crashes in modular callbacks: if the callbacks
> happens to be called from within the page fault path, before the vmalloc page
> fault handler runs, then we have a catch-22 problem.
>
> It's rare but not entirely impossible.
But again, the kernel no longer does this? do_page_fault() does vmalloc_fault()
without notify_die(). If it fails, I do not see how/why a modular DIE_OOPS
handler could try to resolve this problem and trigger another fault.
> > I am not sure I understand the changelog in 101f12af correctly, but at first
> > glance vmalloc_sync_all() is no longer needed at least on x86, do_page_fault()
> > no longer does notify_die(DIE_PAGE_FAULT). And btw DIE_PAGE_FAULT has no users.
> > DIE_MNI too...
> >
> > Perhaps we can simply kill it on x86?
>
> So in theory we could still have it run from DIE_OOPS, and that could turn a
> survivable kernel crash into a non-survivable one.
I don't understand... But OK, my understanding of this magic is very limited,
please forget.
Thanks,
Oleg.
On Sun, Jun 14, 2015 at 09:38:25PM +0200, Oleg Nesterov wrote:
> On 06/14, Oleg Nesterov wrote:
> >
> > On 06/14, Ingo Molnar wrote:
> > >
> > > * Oleg Nesterov <[email protected]> wrote:
> > >
> > > > > + spin_lock(&pgd_lock); /* Implies rcu_read_lock() for the task list iteration: */
> > > > ^^^^^^^^^^^^^^^^^^^^^^^
> > > >
> > > > Hmm, but it doesn't if PREEMPT_RCU? No, no, I do not pretend I understand how it
> > > > actually works ;) But, say, rcu_check_callbacks() can be called from irq and
> > > > since spin_lock() doesn't increment current->rcu_read_lock_nesting this can lead
> > > > to rcu_preempt_qs()?
> > >
> > > No, RCU grace periods are still defined by 'heavy' context boundaries such as
> > > context switches, entering idle or user-space mode.
> > >
> > > PREEMPT_RCU is like traditional RCU, except that blocking is allowed within the
> > > RCU read critical section - that is why it uses a separate nesting counter
> > > (current->rcu_read_lock_nesting), not the preempt count.
> >
> > Yes.
> >
> > > But if a piece of kernel code is non-preemptible, such as a spinlocked region or
> > > an irqs-off region, then those are still natural RCU read lock regions, regardless
> > > of the RCU model, and need no additional RCU locking.
> >
> > I do not think so. Yes I understand that rcu_preempt_qs() itself doesn't
> > finish the gp, but if there are no other rcu-read-lock holders then it
> > seems synchronize_rcu() on another CPU can return _before_ spin_unlock(),
> > this CPU no longer needs rcu_preempt_note_context_switch().
> >
> > OK, I can be easily wrong, I do not really understand the implementation
> > of PREEMPT_RCU. Perhaps preempt_disable() can actually act as rcu_read_lock()
> > with the _current_ implementation. Still this doesn't look right even if
> > happens to work, and Documentation/RCU/checklist.txt says:
> >
> > 11. Note that synchronize_rcu() -only- guarantees to wait until
> > all currently executing rcu_read_lock()-protected RCU read-side
> > critical sections complete. It does -not- necessarily guarantee
> > that all currently running interrupts, NMIs, preempt_disable()
> > code, or idle loops will complete. Therefore, if your
> > read-side critical sections are protected by something other
> > than rcu_read_lock(), do -not- use synchronize_rcu().
>
>
> I've even checked this ;) I applied the stupid patch below and then
>
> $ taskset 2 perl -e 'syscall 157, 666, 5000' &
> [1] 565
>
> $ taskset 1 perl -e 'syscall 157, 777'
>
> $
> [1]+ Done taskset 2 perl -e 'syscall 157, 666, 5000'
>
> $ dmesg -c
> SPIN start
> SYNC start
> SYNC done!
> SPIN done!
Please accept my apologies for my late entry to this thread.
Youngest kid graduated from university this weekend, so my
attention has been elsewhere.
If you were to disable interrupts instead of preemption, I would expect
that the preemptible-RCU grace period would be blocked -- though I am
not particularly comfortable with people relying on disabled interrupts
blocking a preemptible-RCU grace period.
Here is what can happen if you try to block a preemptible-RCU grace
period by disabling preemption, assuming that there are at least two
online CPUs in the system:
1. CPU 0 does spin_lock(), which disables preemption.
2. CPU 1 starts a grace period.
3. CPU 0 takes a scheduling-clock interrupt. It raises softirq,
and the RCU_SOFTIRQ handler notes that there is a new grace
period and sets state so that a subsequent quiescent state on
this CPU will be noted.
4. CPU 0 takes another scheduling-clock interrupt, which checks
current->rcu_read_lock_nesting, and notes that there is no
preemptible-RCU read-side critical section in progress. It
again raises softirq, and the RCU_SOFTIRQ handler reports
the quiescent state to core RCU.
5. Once each of the other CPUs report a quiescent state, the
grace period can end, despite CPU 0 having preemption
disabled the whole time.
So Oleg's test is correct, disabling preemption is not sufficient
to block a preemptible-RCU grace period.
The usual suggestion would be to add rcu_read_lock() just after the
lock is acquired and rcu_read_unlock() just before each release of that
same lock. Putting the entire RCU read-side critical section under
the lock prevents RCU from having to invoke rcu_read_unlock_special()
due to preemption. (It might still invoke it if the RCU read-side
critical section was overly long, but that is much cheaper than the
preemption-handling case.)
Thanx, Paul
> Oleg.
>
> --- a/kernel/sys.c
> +++ b/kernel/sys.c
> @@ -2049,6 +2049,9 @@ static int prctl_get_tid_address(struct task_struct *me, int __user **tid_addr)
> }
> #endif
>
> +#include <linux/delay.h>
> +
> +
> SYSCALL_DEFINE5(prctl, int, option, unsigned long, arg2, unsigned long, arg3,
> unsigned long, arg4, unsigned long, arg5)
> {
> @@ -2062,6 +2065,19 @@ SYSCALL_DEFINE5(prctl, int, option, unsigned long, arg2, unsigned long, arg3,
>
> error = 0;
> switch (option) {
> + case 666:
> + preempt_disable();
> + pr_crit("SPIN start\n");
> + while (arg2--)
> + mdelay(1);
> + pr_crit("SPIN done!\n");
> + preempt_enable();
> + break;
> + case 777:
> + pr_crit("SYNC start\n");
> + synchronize_rcu();
> + pr_crit("SYNC done!\n");
> + break;
> case PR_SET_PDEATHSIG:
> if (!valid_signal(arg2)) {
> error = -EINVAL;
>
Oleg Nesterov <[email protected]> writes:
>
> But again, the kernel no longer does this? do_page_fault() does vmalloc_fault()
> without notify_die(). If it fails, I do not see how/why a modular DIE_OOPS
> handler could try to resolve this problem and trigger another fault.
The same problem can happen from NMI handlers or machine check
handlers. It's not necessarily tied to page faults only.
-Andi
--
[email protected] -- Speaking for myself only
On Sun, Jun 14, 2015 at 7:47 PM, Andi Kleen <[email protected]> wrote:
> Oleg Nesterov <[email protected]> writes:
>>
>> But again, the kernel no longer does this? do_page_fault() does vmalloc_fault()
>> without notify_die(). If it fails, I do not see how/why a modular DIE_OOPS
>> handler could try to resolve this problem and trigger another fault.
>
> The same problem can happen from NMI handlers or machine check
> handlers. It's not necessarily tied to page faults only.
AIUI, the point of the one and only vmalloc_sync_all call is to
prevent infinitely recursive faults when we call a notify_die
callback. The only thing that it could realistically protect is
module text or static non-per-cpu module data, since that's the only
thing that's reliably already in the init pgd. I'm with Oleg: I don't
see how that can happen, since do_page_fault fixes up vmalloc faults
before it calls notify_die.
--Andy
On Sat, 2015-06-13 at 11:49 +0200, Ingo Molnar wrote:
> xen_mm_pin_all()/unpin_all() are used to implement full guest instance
> suspend/restore. It's a stop-all method that needs to iterate through
> all allocated pgds in the system to fix them up for Xen's use.
>
> This code uses pgd_list, probably because it was an easy interface.
>
> But we want to remove the pgd_list, so convert the code over to walk
> all tasks in the system. This is an equivalent method.
>
> (As I don't use Xen this is was only build tested.)
In which case it seems extra important to copy the appropriate
maintainers, which I've done here.
Ian.
>
> Cc: Andrew Morton <[email protected]>
> Cc: Andy Lutomirski <[email protected]>
> Cc: Borislav Petkov <[email protected]>
> Cc: Brian Gerst <[email protected]>
> Cc: Denys Vlasenko <[email protected]>
> Cc: H. Peter Anvin <[email protected]>
> Cc: Linus Torvalds <[email protected]>
> Cc: Oleg Nesterov <[email protected]>
> Cc: Peter Zijlstra <[email protected]>
> Cc: Thomas Gleixner <[email protected]>
> Cc: Waiman Long <[email protected]>
> Cc: [email protected]
> Signed-off-by: Ingo Molnar <[email protected]>
> ---
> arch/x86/xen/mmu.c | 51 ++++++++++++++++++++++++++++++++++++++-------------
> 1 file changed, 38 insertions(+), 13 deletions(-)
>
> diff --git a/arch/x86/xen/mmu.c b/arch/x86/xen/mmu.c
> index dd151b2045b0..70a3df5b0b54 100644
> --- a/arch/x86/xen/mmu.c
> +++ b/arch/x86/xen/mmu.c
> @@ -853,15 +853,27 @@ static void xen_pgd_pin(struct mm_struct *mm)
> */
> void xen_mm_pin_all(void)
> {
> - struct page *page;
> + struct task_struct *g, *p;
>
> - spin_lock(&pgd_lock);
> + spin_lock(&pgd_lock); /* Implies rcu_read_lock() for the task list iteration: */
>
> - list_for_each_entry(page, &pgd_list, lru) {
> - if (!PagePinned(page)) {
> - __xen_pgd_pin(&init_mm, (pgd_t *)page_address(page));
> - SetPageSavePinned(page);
> + for_each_process_thread(g, p) {
> + struct mm_struct *mm;
> + struct page *page;
> + pgd_t *pgd;
> +
> + task_lock(p);
> + mm = p->mm;
> + if (mm) {
> + pgd = mm->pgd;
> + page = virt_to_page(pgd);
> +
> + if (!PagePinned(page)) {
> + __xen_pgd_pin(&init_mm, pgd);
> + SetPageSavePinned(page);
> + }
> }
> + task_unlock(p);
> }
>
> spin_unlock(&pgd_lock);
> @@ -967,19 +979,32 @@ static void xen_pgd_unpin(struct mm_struct *mm)
> */
> void xen_mm_unpin_all(void)
> {
> - struct page *page;
> + struct task_struct *g, *p;
>
> - spin_lock(&pgd_lock);
> + spin_lock(&pgd_lock); /* Implies rcu_read_lock() for the task list iteration: */
>
> - list_for_each_entry(page, &pgd_list, lru) {
> - if (PageSavePinned(page)) {
> - BUG_ON(!PagePinned(page));
> - __xen_pgd_unpin(&init_mm, (pgd_t *)page_address(page));
> - ClearPageSavePinned(page);
> + for_each_process_thread(g, p) {
> + struct mm_struct *mm;
> + struct page *page;
> + pgd_t *pgd;
> +
> + task_lock(p);
> + mm = p->mm;
> + if (mm) {
> + pgd = mm->pgd;
> + page = virt_to_page(pgd);
> +
> + if (PageSavePinned(page)) {
> + BUG_ON(!PagePinned(page));
> + __xen_pgd_unpin(&init_mm, pgd);
> + ClearPageSavePinned(page);
> + }
> }
> + task_unlock(p);
> }
>
> spin_unlock(&pgd_lock);
> + rcu_read_unlock();
> }
>
> static void xen_activate_mm(struct mm_struct *prev, struct mm_struct *next)
On 15/06/15 10:05, Ian Campbell wrote:
> On Sat, 2015-06-13 at 11:49 +0200, Ingo Molnar wrote:
>> xen_mm_pin_all()/unpin_all() are used to implement full guest instance
>> suspend/restore. It's a stop-all method that needs to iterate through
>> all allocated pgds in the system to fix them up for Xen's use.
>>
>> This code uses pgd_list, probably because it was an easy interface.
>>
>> But we want to remove the pgd_list, so convert the code over to walk
>> all tasks in the system. This is an equivalent method.
It is not equivalent because pgd_alloc() now populates entries in pgds
that are not visible to xen_mm_pin_all() (note how the original code
adds the pgd to the pgd_list in pgd_ctor() before calling
pgd_prepopulate_pmd()). These newly allocated page tables won't be
correctly converted on suspend/resume and the new process will die after
resume.
David
>>
>> (As I don't use Xen this is was only build tested.)
>
> In which case it seems extra important to copy the appropriate
> maintainers, which I've done here.
>
> Ian.
>
>>
>> Cc: Andrew Morton <[email protected]>
>> Cc: Andy Lutomirski <[email protected]>
>> Cc: Borislav Petkov <[email protected]>
>> Cc: Brian Gerst <[email protected]>
>> Cc: Denys Vlasenko <[email protected]>
>> Cc: H. Peter Anvin <[email protected]>
>> Cc: Linus Torvalds <[email protected]>
>> Cc: Oleg Nesterov <[email protected]>
>> Cc: Peter Zijlstra <[email protected]>
>> Cc: Thomas Gleixner <[email protected]>
>> Cc: Waiman Long <[email protected]>
>> Cc: [email protected]
>> Signed-off-by: Ingo Molnar <[email protected]>
>> ---
>> arch/x86/xen/mmu.c | 51 ++++++++++++++++++++++++++++++++++++++-------------
>> 1 file changed, 38 insertions(+), 13 deletions(-)
>>
>> diff --git a/arch/x86/xen/mmu.c b/arch/x86/xen/mmu.c
>> index dd151b2045b0..70a3df5b0b54 100644
>> --- a/arch/x86/xen/mmu.c
>> +++ b/arch/x86/xen/mmu.c
>> @@ -853,15 +853,27 @@ static void xen_pgd_pin(struct mm_struct *mm)
>> */
>> void xen_mm_pin_all(void)
>> {
>> - struct page *page;
>> + struct task_struct *g, *p;
>>
>> - spin_lock(&pgd_lock);
>> + spin_lock(&pgd_lock); /* Implies rcu_read_lock() for the task list iteration: */
>>
>> - list_for_each_entry(page, &pgd_list, lru) {
>> - if (!PagePinned(page)) {
>> - __xen_pgd_pin(&init_mm, (pgd_t *)page_address(page));
>> - SetPageSavePinned(page);
>> + for_each_process_thread(g, p) {
>> + struct mm_struct *mm;
>> + struct page *page;
>> + pgd_t *pgd;
>> +
>> + task_lock(p);
>> + mm = p->mm;
>> + if (mm) {
>> + pgd = mm->pgd;
>> + page = virt_to_page(pgd);
>> +
>> + if (!PagePinned(page)) {
>> + __xen_pgd_pin(&init_mm, pgd);
>> + SetPageSavePinned(page);
>> + }
>> }
>> + task_unlock(p);
>> }
>>
>> spin_unlock(&pgd_lock);
>> @@ -967,19 +979,32 @@ static void xen_pgd_unpin(struct mm_struct *mm)
>> */
>> void xen_mm_unpin_all(void)
>> {
>> - struct page *page;
>> + struct task_struct *g, *p;
>>
>> - spin_lock(&pgd_lock);
>> + spin_lock(&pgd_lock); /* Implies rcu_read_lock() for the task list iteration: */
>>
>> - list_for_each_entry(page, &pgd_list, lru) {
>> - if (PageSavePinned(page)) {
>> - BUG_ON(!PagePinned(page));
>> - __xen_pgd_unpin(&init_mm, (pgd_t *)page_address(page));
>> - ClearPageSavePinned(page);
>> + for_each_process_thread(g, p) {
>> + struct mm_struct *mm;
>> + struct page *page;
>> + pgd_t *pgd;
>> +
>> + task_lock(p);
>> + mm = p->mm;
>> + if (mm) {
>> + pgd = mm->pgd;
>> + page = virt_to_page(pgd);
>> +
>> + if (PageSavePinned(page)) {
>> + BUG_ON(!PagePinned(page));
>> + __xen_pgd_unpin(&init_mm, pgd);
>> + ClearPageSavePinned(page);
>> + }
>> }
>> + task_unlock(p);
>> }
>>
>> spin_unlock(&pgd_lock);
>> + rcu_read_unlock();
>> }
>>
>> static void xen_activate_mm(struct mm_struct *prev, struct mm_struct *next)
>
>
* Andy Lutomirski <[email protected]> wrote:
> On Sun, Jun 14, 2015 at 7:47 PM, Andi Kleen <[email protected]> wrote:
> > Oleg Nesterov <[email protected]> writes:
> >>
> >> But again, the kernel no longer does this? do_page_fault() does
> >> vmalloc_fault() without notify_die(). If it fails, I do not see how/why a
> >> modular DIE_OOPS handler could try to resolve this problem and trigger
> >> another fault.
> >
> > The same problem can happen from NMI handlers or machine check handlers. It's
> > not necessarily tied to page faults only.
>
> AIUI, the point of the one and only vmalloc_sync_all call is to prevent
> infinitely recursive faults when we call a notify_die callback. The only thing
> that it could realistically protect is module text or static non-per-cpu module
> data, since that's the only thing that's reliably already in the init pgd. I'm
> with Oleg: I don't see how that can happen, since do_page_fault fixes up vmalloc
> faults before it calls notify_die.
Yes, but what I meant is that it can happen if due to an unrelated kernel bug and
unlucky timing we have installed this new handler just when that other unrelated
kernel bug triggers: say a #GPF crash in kernel code.
In any case it should all be mooted with the removal of lazy PGD instantiation.
Thanks,
Ingo
* Paul E. McKenney <[email protected]> wrote:
> On Sun, Jun 14, 2015 at 09:38:25PM +0200, Oleg Nesterov wrote:
> > On 06/14, Oleg Nesterov wrote:
> > >
> > > On 06/14, Ingo Molnar wrote:
> > > >
> > > > * Oleg Nesterov <[email protected]> wrote:
> > > >
> > > > > > + spin_lock(&pgd_lock); /* Implies rcu_read_lock() for the task list iteration: */
> > > > > ^^^^^^^^^^^^^^^^^^^^^^^
> > > > >
> > > > > Hmm, but it doesn't if PREEMPT_RCU? No, no, I do not pretend I understand how it
> > > > > actually works ;) But, say, rcu_check_callbacks() can be called from irq and
> > > > > since spin_lock() doesn't increment current->rcu_read_lock_nesting this can lead
> > > > > to rcu_preempt_qs()?
> > > >
> > > > No, RCU grace periods are still defined by 'heavy' context boundaries such as
> > > > context switches, entering idle or user-space mode.
> > > >
> > > > PREEMPT_RCU is like traditional RCU, except that blocking is allowed within the
> > > > RCU read critical section - that is why it uses a separate nesting counter
> > > > (current->rcu_read_lock_nesting), not the preempt count.
> > >
> > > Yes.
> > >
> > > > But if a piece of kernel code is non-preemptible, such as a spinlocked region or
> > > > an irqs-off region, then those are still natural RCU read lock regions, regardless
> > > > of the RCU model, and need no additional RCU locking.
> > >
> > > I do not think so. Yes I understand that rcu_preempt_qs() itself doesn't
> > > finish the gp, but if there are no other rcu-read-lock holders then it
> > > seems synchronize_rcu() on another CPU can return _before_ spin_unlock(),
> > > this CPU no longer needs rcu_preempt_note_context_switch().
> > >
> > > OK, I can be easily wrong, I do not really understand the implementation
> > > of PREEMPT_RCU. Perhaps preempt_disable() can actually act as rcu_read_lock()
> > > with the _current_ implementation. Still this doesn't look right even if
> > > happens to work, and Documentation/RCU/checklist.txt says:
> > >
> > > 11. Note that synchronize_rcu() -only- guarantees to wait until
> > > all currently executing rcu_read_lock()-protected RCU read-side
> > > critical sections complete. It does -not- necessarily guarantee
> > > that all currently running interrupts, NMIs, preempt_disable()
> > > code, or idle loops will complete. Therefore, if your
> > > read-side critical sections are protected by something other
> > > than rcu_read_lock(), do -not- use synchronize_rcu().
> >
> >
> > I've even checked this ;) I applied the stupid patch below and then
> >
> > $ taskset 2 perl -e 'syscall 157, 666, 5000' &
> > [1] 565
> >
> > $ taskset 1 perl -e 'syscall 157, 777'
> >
> > $
> > [1]+ Done taskset 2 perl -e 'syscall 157, 666, 5000'
> >
> > $ dmesg -c
> > SPIN start
> > SYNC start
> > SYNC done!
> > SPIN done!
>
> Please accept my apologies for my late entry to this thread.
> Youngest kid graduated from university this weekend, so my
> attention has been elsewhere.
Congratulations! :-)
> If you were to disable interrupts instead of preemption, I would expect
> that the preemptible-RCU grace period would be blocked -- though I am
> not particularly comfortable with people relying on disabled interrupts
> blocking a preemptible-RCU grace period.
>
> Here is what can happen if you try to block a preemptible-RCU grace
> period by disabling preemption, assuming that there are at least two
> online CPUs in the system:
>
> 1. CPU 0 does spin_lock(), which disables preemption.
>
> 2. CPU 1 starts a grace period.
>
> 3. CPU 0 takes a scheduling-clock interrupt. It raises softirq,
> and the RCU_SOFTIRQ handler notes that there is a new grace
> period and sets state so that a subsequent quiescent state on
> this CPU will be noted.
>
> 4. CPU 0 takes another scheduling-clock interrupt, which checks
> current->rcu_read_lock_nesting, and notes that there is no
> preemptible-RCU read-side critical section in progress. It
> again raises softirq, and the RCU_SOFTIRQ handler reports
> the quiescent state to core RCU.
>
> 5. Once each of the other CPUs report a quiescent state, the
> grace period can end, despite CPU 0 having preemption
> disabled the whole time.
>
> So Oleg's test is correct, disabling preemption is not sufficient
> to block a preemptible-RCU grace period.
I stand corrected!
> The usual suggestion would be to add rcu_read_lock() just after the lock is
> acquired and rcu_read_unlock() just before each release of that same lock.
Will fix it that way.
Thanks,
Ingo
* David Vrabel <[email protected]> wrote:
> On 15/06/15 10:05, Ian Campbell wrote:
> > On Sat, 2015-06-13 at 11:49 +0200, Ingo Molnar wrote:
> >> xen_mm_pin_all()/unpin_all() are used to implement full guest instance
> >> suspend/restore. It's a stop-all method that needs to iterate through all
> >> allocated pgds in the system to fix them up for Xen's use.
> >>
> >> This code uses pgd_list, probably because it was an easy interface.
> >>
> >> But we want to remove the pgd_list, so convert the code over to walk all
> >> tasks in the system. This is an equivalent method.
>
> It is not equivalent because pgd_alloc() now populates entries in pgds that are
> not visible to xen_mm_pin_all() (note how the original code adds the pgd to the
> pgd_list in pgd_ctor() before calling pgd_prepopulate_pmd()). These newly
> allocated page tables won't be correctly converted on suspend/resume and the new
> process will die after resume.
So how should the Xen logic be fixed for the new scheme? I can't say I can see
through the paravirt complexity here.
Thanks,
Ingo
On Mon, Jun 15, 2015 at 1:28 PM, Ingo Molnar <[email protected]> wrote:
>
> * Andy Lutomirski <[email protected]> wrote:
>
>> On Sun, Jun 14, 2015 at 7:47 PM, Andi Kleen <[email protected]> wrote:
>> > Oleg Nesterov <[email protected]> writes:
>> >>
>> >> But again, the kernel no longer does this? do_page_fault() does
>> >> vmalloc_fault() without notify_die(). If it fails, I do not see how/why a
>> >> modular DIE_OOPS handler could try to resolve this problem and trigger
>> >> another fault.
>> >
>> > The same problem can happen from NMI handlers or machine check handlers. It's
>> > not necessarily tied to page faults only.
>>
>> AIUI, the point of the one and only vmalloc_sync_all call is to prevent
>> infinitely recursive faults when we call a notify_die callback. The only thing
>> that it could realistically protect is module text or static non-per-cpu module
>> data, since that's the only thing that's reliably already in the init pgd. I'm
>> with Oleg: I don't see how that can happen, since do_page_fault fixes up vmalloc
>> faults before it calls notify_die.
>
> Yes, but what I meant is that it can happen if due to an unrelated kernel bug and
> unlucky timing we have installed this new handler just when that other unrelated
> kernel bug triggers: say a #GPF crash in kernel code.
I still don't see the problem.
CPU A: crash and start executing do_page_fault
CPU B: register_die_notifier
CPU A: notify_die
now we get a vmalloc fault, fix it up, and return to do_page_fault and
print the oops.
>
> In any case it should all be mooted with the removal of lazy PGD instantiation.
Agreed.
--Andy
On 15/06/15 21:35, Ingo Molnar wrote:
>
> * David Vrabel <[email protected]> wrote:
>
>> On 15/06/15 10:05, Ian Campbell wrote:
>>> On Sat, 2015-06-13 at 11:49 +0200, Ingo Molnar wrote:
>
>>>> xen_mm_pin_all()/unpin_all() are used to implement full guest instance
>>>> suspend/restore. It's a stop-all method that needs to iterate through all
>>>> allocated pgds in the system to fix them up for Xen's use.
>>>>
>>>> This code uses pgd_list, probably because it was an easy interface.
>>>>
>>>> But we want to remove the pgd_list, so convert the code over to walk all
>>>> tasks in the system. This is an equivalent method.
>>
>> It is not equivalent because pgd_alloc() now populates entries in pgds that are
>> not visible to xen_mm_pin_all() (note how the original code adds the pgd to the
>> pgd_list in pgd_ctor() before calling pgd_prepopulate_pmd()). These newly
>> allocated page tables won't be correctly converted on suspend/resume and the new
>> process will die after resume.
>
> So how should the Xen logic be fixed for the new scheme? I can't say I can see
> through the paravirt complexity here.
Actually, since we freeze_processes() before trying to pin page tables,
I think it should be ok as-is.
I'll put the patch through some tests.
David
On 06/16/2015 10:15 AM, David Vrabel wrote:
> On 15/06/15 21:35, Ingo Molnar wrote:
>> * David Vrabel <[email protected]> wrote:
>>
>>> On 15/06/15 10:05, Ian Campbell wrote:
>>>> On Sat, 2015-06-13 at 11:49 +0200, Ingo Molnar wrote:
>>>>> xen_mm_pin_all()/unpin_all() are used to implement full guest instance
>>>>> suspend/restore. It's a stop-all method that needs to iterate through all
>>>>> allocated pgds in the system to fix them up for Xen's use.
>>>>>
>>>>> This code uses pgd_list, probably because it was an easy interface.
>>>>>
>>>>> But we want to remove the pgd_list, so convert the code over to walk all
>>>>> tasks in the system. This is an equivalent method.
>>> It is not equivalent because pgd_alloc() now populates entries in pgds that are
>>> not visible to xen_mm_pin_all() (note how the original code adds the pgd to the
>>> pgd_list in pgd_ctor() before calling pgd_prepopulate_pmd()). These newly
>>> allocated page tables won't be correctly converted on suspend/resume and the new
>>> process will die after resume.
>> So how should the Xen logic be fixed for the new scheme? I can't say I can see
>> through the paravirt complexity here.
> Actually, since we freeze_processes() before trying to pin page tables,
> I think it should be ok as-is.
>
> I'll put the patch through some tests.
Actually, I just ran this through a couple of boot/suspend/resume tests
and didn't see any issues (with the one fix I mentioned to Ingo
earlier). On unstable Xen only.
-boris
On 16/06/15 15:19, Boris Ostrovsky wrote:
> On 06/16/2015 10:15 AM, David Vrabel wrote:
>> On 15/06/15 21:35, Ingo Molnar wrote:
>>> * David Vrabel <[email protected]> wrote:
>>>
>>>> On 15/06/15 10:05, Ian Campbell wrote:
>>>>> On Sat, 2015-06-13 at 11:49 +0200, Ingo Molnar wrote:
>>>>>> xen_mm_pin_all()/unpin_all() are used to implement full guest
>>>>>> instance
>>>>>> suspend/restore. It's a stop-all method that needs to iterate
>>>>>> through all
>>>>>> allocated pgds in the system to fix them up for Xen's use.
>>>>>>
>>>>>> This code uses pgd_list, probably because it was an easy interface.
>>>>>>
>>>>>> But we want to remove the pgd_list, so convert the code over to
>>>>>> walk all
>>>>>> tasks in the system. This is an equivalent method.
>>>> It is not equivalent because pgd_alloc() now populates entries in
>>>> pgds that are
>>>> not visible to xen_mm_pin_all() (note how the original code adds the
>>>> pgd to the
>>>> pgd_list in pgd_ctor() before calling pgd_prepopulate_pmd()). These
>>>> newly
>>>> allocated page tables won't be correctly converted on suspend/resume
>>>> and the new
>>>> process will die after resume.
>>> So how should the Xen logic be fixed for the new scheme? I can't say
>>> I can see
>>> through the paravirt complexity here.
>> Actually, since we freeze_processes() before trying to pin page tables,
>> I think it should be ok as-is.
>>
>> I'll put the patch through some tests.
>
> Actually, I just ran this through a couple of boot/suspend/resume tests
> and didn't see any issues (with the one fix I mentioned to Ingo
> earlier). On unstable Xen only.
In which case this can have a:
Reviewed-by: David Vrabel <[email protected]>
Thanks.
David