2015-06-11 14:07:37

by Ingo Molnar

[permalink] [raw]
Subject: [RFC PATCH 00/12] x86/mm: Implement lockless pgd_alloc()/pgd_free()

Waiman Long reported 'pgd_lock' contention on high CPU count systems and proposed
moving pgd_lock on a separate cacheline to eliminate false sharing and to reduce
some of the lock bouncing overhead.

I think we can do much better: this series eliminates the pgd_list and makes
pgd_alloc()/pgd_free() lockless.

Now the lockless initialization of the PGD has a few preconditions, which the
initial part of the series implements:

- no PGD clearing is allowed, only additions. This makes sense as a single PGD
entry covers 512 GB of RAM so the 4K overhead per 0.5TB of RAM mapped is
miniscule.

The patches after that convert existing pgd_list users to walk the task list.

PGD locking is kept intact: coherency guarantees between the CPA, vmalloc,
hotplug, etc. code are unchanged.

The final patches eliminate the pgd_list and thus make pgd_alloc()/pgd_free()
lockless.

The patches have been boot tested on 64-bit and 32-bit x86 systems.

Architectures not making use of the new facility are unaffected.

Thanks,

Ingo

=====
Ingo Molnar (12):
x86/mm/pat: Don't free PGD entries on memory unmap
x86/mm/hotplug: Remove pgd_list use from the memory hotplug code
x86/mm/hotplug: Don't remove PGD entries in remove_pagetable()
x86/mm/hotplug: Simplify sync_global_pgds()
mm: Introduce arch_pgd_init_late()
x86/mm: Enable and use the arch_pgd_init_late() method
x86/virt/guest/xen: Remove use of pgd_list from the Xen guest code
x86/mm: Remove pgd_list use from vmalloc_sync_all()
x86/mm/pat/32: Remove pgd_list use from the PAT code
x86/mm: Make pgd_alloc()/pgd_free() lockless
x86/mm: Remove pgd_list leftovers
x86/mm: Simplify pgd_alloc()

arch/Kconfig | 9 ++++
arch/x86/Kconfig | 1 +
arch/x86/include/asm/pgtable.h | 3 --
arch/x86/include/asm/pgtable_64.h | 3 +-
arch/x86/mm/fault.c | 24 +++++++----
arch/x86/mm/init_64.c | 73 +++++++++++----------------------
arch/x86/mm/pageattr.c | 34 ++++++++--------
arch/x86/mm/pgtable.c | 129 +++++++++++++++++++++++++++++-----------------------------
arch/x86/xen/mmu.c | 34 +++++++++++++---
fs/exec.c | 3 ++
include/linux/mm.h | 6 +++
kernel/fork.c | 16 ++++++++
12 files changed, 183 insertions(+), 152 deletions(-)

--
2.1.4


2015-06-11 14:11:13

by Ingo Molnar

[permalink] [raw]
Subject: [PATCH 01/12] x86/mm/pat: Don't free PGD entries on memory unmap

So when we unmap kernel memory range then we also check whether a PUD
is completely clear, and free it and clear the PGD entry.

This complicates PGD management, so don't do this. We can keep the
PGD mapped and the PUD table all clear - it's only a single 4K page
per 512 GB of memory mapped.

Cc: Andrew Morton <[email protected]>
Cc: Andy Lutomirski <[email protected]>
Cc: Borislav Petkov <[email protected]>
Cc: Brian Gerst <[email protected]>
Cc: Denys Vlasenko <[email protected]>
Cc: H. Peter Anvin <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Rik van Riel <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Waiman Long <[email protected]>
Cc: [email protected]
Signed-off-by: Ingo Molnar <[email protected]>
---
arch/x86/mm/pageattr.c | 15 ---------------
1 file changed, 15 deletions(-)

diff --git a/arch/x86/mm/pageattr.c b/arch/x86/mm/pageattr.c
index 70d221fe2eb4..997fc97e9072 100644
--- a/arch/x86/mm/pageattr.c
+++ b/arch/x86/mm/pageattr.c
@@ -717,18 +717,6 @@ static bool try_to_free_pmd_page(pmd_t *pmd)
return true;
}

-static bool try_to_free_pud_page(pud_t *pud)
-{
- int i;
-
- for (i = 0; i < PTRS_PER_PUD; i++)
- if (!pud_none(pud[i]))
- return false;
-
- free_page((unsigned long)pud);
- return true;
-}
-
static bool unmap_pte_range(pmd_t *pmd, unsigned long start, unsigned long end)
{
pte_t *pte = pte_offset_kernel(pmd, start);
@@ -847,9 +835,6 @@ static void unmap_pgd_range(pgd_t *root, unsigned long addr, unsigned long end)
pgd_t *pgd_entry = root + pgd_index(addr);

unmap_pud_range(pgd_entry, addr, end);
-
- if (try_to_free_pud_page((pud_t *)pgd_page_vaddr(*pgd_entry)))
- pgd_clear(pgd_entry);
}

static int alloc_pte_page(pmd_t *pmd)
--
2.1.4

2015-06-11 14:07:46

by Ingo Molnar

[permalink] [raw]
Subject: [PATCH 02/12] x86/mm/hotplug: Remove pgd_list use from the memory hotplug code

The memory hotplug code uses sync_global_pgds() to synchronize updates
to the global (&init_mm) kernel PGD and the task PGDs. It does this
by iterating over the pgd_list - which list closely tracks task
creation/destruction via fork/clone.

But we want to remove this list, so that it does not have to be
maintained from fork()/exit(), so convert the memory hotplug code
to use the task list to iterate over all pgds in the system.

Also improve the comments a bit, to make this function easier
to understand.

Only lightly tested, as I don't have a memory hotplug setup.

Cc: Andrew Morton <[email protected]>
Cc: Andy Lutomirski <[email protected]>
Cc: Borislav Petkov <[email protected]>
Cc: Brian Gerst <[email protected]>
Cc: Denys Vlasenko <[email protected]>
Cc: H. Peter Anvin <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Waiman Long <[email protected]>
Signed-off-by: Ingo Molnar <[email protected]>
---
arch/x86/mm/init_64.c | 30 +++++++++++++++++-------------
1 file changed, 17 insertions(+), 13 deletions(-)

diff --git a/arch/x86/mm/init_64.c b/arch/x86/mm/init_64.c
index 3fba623e3ba5..1921acbd49fd 100644
--- a/arch/x86/mm/init_64.c
+++ b/arch/x86/mm/init_64.c
@@ -160,8 +160,8 @@ static int __init nonx32_setup(char *str)
__setup("noexec32=", nonx32_setup);

/*
- * When memory was added/removed make sure all the processes MM have
- * suitable PGD entries in the local PGD level page.
+ * When memory was added/removed make sure all the process MMs have
+ * matching PGD entries in the local PGD level page as well.
*/
void sync_global_pgds(unsigned long start, unsigned long end, int removed)
{
@@ -169,29 +169,33 @@ void sync_global_pgds(unsigned long start, unsigned long end, int removed)

for (address = start; address <= end; address += PGDIR_SIZE) {
const pgd_t *pgd_ref = pgd_offset_k(address);
- struct page *page;
+ struct task_struct *g, *p;

/*
- * When it is called after memory hot remove, pgd_none()
- * returns true. In this case (removed == 1), we must clear
- * the PGD entries in the local PGD level page.
+ * When this function is called after memory hot remove,
+ * pgd_none() already returns true, but only the reference
+ * kernel PGD has been cleared, not the process PGDs.
+ *
+ * So clear the affected entries in every process PGD as well:
*/
if (pgd_none(*pgd_ref) && !removed)
continue;

spin_lock(&pgd_lock);
- list_for_each_entry(page, &pgd_list, lru) {
- pgd_t *pgd;
+
+ for_each_process_thread(g, p) {
+ pgd_t *pgd = p->mm->pgd;
spinlock_t *pgt_lock;

- pgd = (pgd_t *)page_address(page) + pgd_index(address);
- /* the pgt_lock only for Xen */
- pgt_lock = &pgd_page_get_mm(page)->page_table_lock;
+ if (!p->mm)
+ continue;
+
+ /* The pgt_lock is only used by Xen: */
+ pgt_lock = &p->mm->page_table_lock;
spin_lock(pgt_lock);

if (!pgd_none(*pgd_ref) && !pgd_none(*pgd))
- BUG_ON(pgd_page_vaddr(*pgd)
- != pgd_page_vaddr(*pgd_ref));
+ BUG_ON(pgd_page_vaddr(*pgd) != pgd_page_vaddr(*pgd_ref));

if (removed) {
if (pgd_none(*pgd_ref) && !pgd_none(*pgd))
--
2.1.4

2015-06-11 14:07:43

by Ingo Molnar

[permalink] [raw]
Subject: [PATCH 03/12] x86/mm/hotplug: Don't remove PGD entries in remove_pagetable()

So when memory hotplug removes a piece of physical memory from pagetable
mappings, it also frees the underlying PGD entry.

This complicates PGD management, so don't do this. We can keep the
PGD mapped and the PUD table all clear - it's only a single 4K page
per 512 GB of memory hotplugged.

Cc: Andrew Morton <[email protected]>
Cc: Andy Lutomirski <[email protected]>
Cc: Borislav Petkov <[email protected]>
Cc: Brian Gerst <[email protected]>
Cc: Denys Vlasenko <[email protected]>
Cc: H. Peter Anvin <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Rik van Riel <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Waiman Long <[email protected]>
Cc: [email protected]
Signed-off-by: Ingo Molnar <[email protected]>
---
arch/x86/mm/init_64.c | 27 ---------------------------
1 file changed, 27 deletions(-)

diff --git a/arch/x86/mm/init_64.c b/arch/x86/mm/init_64.c
index 1921acbd49fd..2d5931d343cd 100644
--- a/arch/x86/mm/init_64.c
+++ b/arch/x86/mm/init_64.c
@@ -770,27 +770,6 @@ static void __meminit free_pmd_table(pmd_t *pmd_start, pud_t *pud)
spin_unlock(&init_mm.page_table_lock);
}

-/* Return true if pgd is changed, otherwise return false. */
-static bool __meminit free_pud_table(pud_t *pud_start, pgd_t *pgd)
-{
- pud_t *pud;
- int i;
-
- for (i = 0; i < PTRS_PER_PUD; i++) {
- pud = pud_start + i;
- if (pud_val(*pud))
- return false;
- }
-
- /* free a pud table */
- free_pagetable(pgd_page(*pgd), 0);
- spin_lock(&init_mm.page_table_lock);
- pgd_clear(pgd);
- spin_unlock(&init_mm.page_table_lock);
-
- return true;
-}
-
static void __meminit
remove_pte_table(pte_t *pte_start, unsigned long addr, unsigned long end,
bool direct)
@@ -982,7 +961,6 @@ remove_pagetable(unsigned long start, unsigned long end, bool direct)
unsigned long addr;
pgd_t *pgd;
pud_t *pud;
- bool pgd_changed = false;

for (addr = start; addr < end; addr = next) {
next = pgd_addr_end(addr, end);
@@ -993,13 +971,8 @@ remove_pagetable(unsigned long start, unsigned long end, bool direct)

pud = (pud_t *)pgd_page_vaddr(*pgd);
remove_pud_table(pud, addr, next, direct);
- if (free_pud_table(pud, pgd))
- pgd_changed = true;
}

- if (pgd_changed)
- sync_global_pgds(start, end - 1, 1);
-
flush_tlb_all();
}

--
2.1.4

2015-06-11 14:10:51

by Ingo Molnar

[permalink] [raw]
Subject: [PATCH 04/12] x86/mm/hotplug: Simplify sync_global_pgds()

Now that the memory hotplug code does not remove PGD entries anymore,
the only users of sync_global_pgds() use it after extending the
PGD.

So remove the 'removed' parameter and simplify the call sites.

Cc: Andrew Morton <[email protected]>
Cc: Andy Lutomirski <[email protected]>
Cc: Borislav Petkov <[email protected]>
Cc: Brian Gerst <[email protected]>
Cc: Denys Vlasenko <[email protected]>
Cc: H. Peter Anvin <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Rik van Riel <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Waiman Long <[email protected]>
Cc: [email protected]
Signed-off-by: Ingo Molnar <[email protected]>
---
arch/x86/include/asm/pgtable_64.h | 3 +--
arch/x86/mm/fault.c | 2 +-
arch/x86/mm/init_64.c | 19 +++++++------------
3 files changed, 9 insertions(+), 15 deletions(-)

diff --git a/arch/x86/include/asm/pgtable_64.h b/arch/x86/include/asm/pgtable_64.h
index 2ee781114d34..f405fc3bb719 100644
--- a/arch/x86/include/asm/pgtable_64.h
+++ b/arch/x86/include/asm/pgtable_64.h
@@ -116,8 +116,7 @@ static inline void native_pgd_clear(pgd_t *pgd)
native_set_pgd(pgd, native_make_pgd(0));
}

-extern void sync_global_pgds(unsigned long start, unsigned long end,
- int removed);
+extern void sync_global_pgds(unsigned long start, unsigned long end);

/*
* Conversion functions: convert a page and protection to a page entry,
diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c
index 181c53bac3a7..50342825f221 100644
--- a/arch/x86/mm/fault.c
+++ b/arch/x86/mm/fault.c
@@ -349,7 +349,7 @@ static void dump_pagetable(unsigned long address)

void vmalloc_sync_all(void)
{
- sync_global_pgds(VMALLOC_START & PGDIR_MASK, VMALLOC_END, 0);
+ sync_global_pgds(VMALLOC_START & PGDIR_MASK, VMALLOC_END);
}

/*
diff --git a/arch/x86/mm/init_64.c b/arch/x86/mm/init_64.c
index 2d5931d343cd..beee532b76a7 100644
--- a/arch/x86/mm/init_64.c
+++ b/arch/x86/mm/init_64.c
@@ -160,10 +160,10 @@ static int __init nonx32_setup(char *str)
__setup("noexec32=", nonx32_setup);

/*
- * When memory was added/removed make sure all the process MMs have
+ * When memory was added make sure all the process MMs have
* matching PGD entries in the local PGD level page as well.
*/
-void sync_global_pgds(unsigned long start, unsigned long end, int removed)
+void sync_global_pgds(unsigned long start, unsigned long end)
{
unsigned long address;

@@ -178,7 +178,7 @@ void sync_global_pgds(unsigned long start, unsigned long end, int removed)
*
* So clear the affected entries in every process PGD as well:
*/
- if (pgd_none(*pgd_ref) && !removed)
+ if (pgd_none(*pgd_ref))
continue;

spin_lock(&pgd_lock);
@@ -197,13 +197,8 @@ void sync_global_pgds(unsigned long start, unsigned long end, int removed)
if (!pgd_none(*pgd_ref) && !pgd_none(*pgd))
BUG_ON(pgd_page_vaddr(*pgd) != pgd_page_vaddr(*pgd_ref));

- if (removed) {
- if (pgd_none(*pgd_ref) && !pgd_none(*pgd))
- pgd_clear(pgd);
- } else {
- if (pgd_none(*pgd))
- set_pgd(pgd, *pgd_ref);
- }
+ if (pgd_none(*pgd))
+ set_pgd(pgd, *pgd_ref);

spin_unlock(pgt_lock);
}
@@ -636,7 +631,7 @@ kernel_physical_mapping_init(unsigned long start,
}

if (pgd_changed)
- sync_global_pgds(addr, end - 1, 0);
+ sync_global_pgds(addr, end - 1);

__flush_tlb_all();

@@ -1276,7 +1271,7 @@ int __meminit vmemmap_populate(unsigned long start, unsigned long end, int node)
else
err = vmemmap_populate_basepages(start, end, node);
if (!err)
- sync_global_pgds(start, end - 1, 0);
+ sync_global_pgds(start, end - 1);
return err;
}

--
2.1.4

2015-06-11 14:10:12

by Ingo Molnar

[permalink] [raw]
Subject: [PATCH 05/12] mm: Introduce arch_pgd_init_late()

Add a late PGD init callback to places that allocate a new MM
with a new PGD: copy_process() and exec().

The purpose of this callback is to allow architectures to implement
lockless initialization of task PGDs, to remove the scalability
limit of pgd_list/pgd_lock.

Architectures can opt in to this callback via the ARCH_HAS_PGD_INIT_LATE
Kconfig flag. There's zero overhead on architectures that are not using it.

Cc: Andrew Morton <[email protected]>
Cc: Andy Lutomirski <[email protected]>
Cc: Borislav Petkov <[email protected]>
Cc: Brian Gerst <[email protected]>
Cc: Denys Vlasenko <[email protected]>
Cc: H. Peter Anvin <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Rik van Riel <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Waiman Long <[email protected]>
Cc: [email protected]
Cc: [email protected]
Signed-off-by: Ingo Molnar <[email protected]>
---
arch/Kconfig | 9 +++++++++
fs/exec.c | 3 +++
include/linux/mm.h | 6 ++++++
kernel/fork.c | 16 ++++++++++++++++
4 files changed, 34 insertions(+)

diff --git a/arch/Kconfig b/arch/Kconfig
index a65eafb24997..a8e866cd4247 100644
--- a/arch/Kconfig
+++ b/arch/Kconfig
@@ -491,6 +491,15 @@ config PGTABLE_LEVELS
int
default 2

+config ARCH_HAS_PGD_INIT_LATE
+ bool
+ help
+ Architectures that want a late PGD initialization can define
+ the arch_pgd_init_late() callback and it will be called
+ by the generic new task (fork()) code after a new task has
+ been made visible on the task list, but before it has been
+ first scheduled.
+
config ARCH_HAS_ELF_RANDOMIZE
bool
help
diff --git a/fs/exec.c b/fs/exec.c
index 1977c2a553ac..c1d213c64fda 100644
--- a/fs/exec.c
+++ b/fs/exec.c
@@ -860,7 +860,10 @@ static int exec_mmap(struct mm_struct *mm)
}
task_lock(tsk);
active_mm = tsk->active_mm;
+
tsk->mm = mm;
+ arch_pgd_init_late(mm, mm->pgd);
+
tsk->active_mm = mm;
activate_mm(active_mm, mm);
tsk->mm->vmacache_seqnum = 0;
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 0755b9fd03a7..35d887d2b038 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1134,6 +1134,12 @@ int follow_phys(struct vm_area_struct *vma, unsigned long address,
int generic_access_phys(struct vm_area_struct *vma, unsigned long addr,
void *buf, int len, int write);

+#ifdef CONFIG_ARCH_HAS_PGD_INIT_LATE
+void arch_pgd_init_late(struct mm_struct *mm, pgd_t *pgd);
+#else
+static inline void arch_pgd_init_late(struct mm_struct *mm, pgd_t *pgd) { }
+#endif
+
static inline void unmap_shared_mapping_range(struct address_space *mapping,
loff_t const holebegin, loff_t const holelen)
{
diff --git a/kernel/fork.c b/kernel/fork.c
index 03c1eaaa6ef5..1f83ceca6c6c 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -1592,6 +1592,22 @@ static struct task_struct *copy_process(unsigned long clone_flags,
syscall_tracepoint_update(p);
write_unlock_irq(&tasklist_lock);

+ /*
+ * If we have a new PGD then initialize it:
+ *
+ * This method is called after a task has been made visible
+ * on the task list already.
+ *
+ * Architectures that manage per task kernel pagetables
+ * might use this callback to initialize them after they
+ * are already visible to new updates.
+ *
+ * NOTE: any user-space parts of the PGD are already initialized
+ * and must not be clobbered.
+ */
+ if (p->mm != current->mm)
+ arch_pgd_init_late(p->mm, p->mm->pgd);
+
proc_fork_connector(p);
cgroup_post_fork(p);
if (clone_flags & CLONE_THREAD)
--
2.1.4

2015-06-11 14:09:47

by Ingo Molnar

[permalink] [raw]
Subject: [PATCH 06/12] x86/mm: Enable and use the arch_pgd_init_late() method

Prepare for lockless PGD init: enable the arch_pgd_init_late() callback
and add a 'careful' implementation of PGD init to it: only copy over
non-zero entries.

Since PGD entries only ever get added, this method catches any updates
to swapper_pg_dir[] that might have occured between early PGD init
and late PGD init.

Note that this only matters for code that does not use the pgd_list but
the task list to find all PGDs in the system.

Subsequent patches will convert pgd_list users to task-list iterations.

[ This adds extra overhead in that we do the PGD initialization for a
second time - a later patch will simplify this, once we don't have
old pgd_list users. ]

Cc: Andrew Morton <[email protected]>
Cc: Andy Lutomirski <[email protected]>
Cc: Borislav Petkov <[email protected]>
Cc: Brian Gerst <[email protected]>
Cc: Denys Vlasenko <[email protected]>
Cc: H. Peter Anvin <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Signed-off-by: Ingo Molnar <[email protected]>
---
arch/x86/Kconfig | 1 +
arch/x86/mm/pgtable.c | 57 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++
2 files changed, 58 insertions(+)

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 7e39f9b22705..15c19ce149f0 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -27,6 +27,7 @@ config X86
select ARCH_HAS_ELF_RANDOMIZE
select ARCH_HAS_FAST_MULTIPLIER
select ARCH_HAS_GCOV_PROFILE_ALL
+ select ARCH_HAS_PGD_INIT_LATE
select ARCH_HAS_SG_CHAIN
select ARCH_HAVE_NMI_SAFE_CMPXCHG
select ARCH_MIGHT_HAVE_ACPI_PDC if ACPI
diff --git a/arch/x86/mm/pgtable.c b/arch/x86/mm/pgtable.c
index fb0a9dd1d6e4..e0bf90470d70 100644
--- a/arch/x86/mm/pgtable.c
+++ b/arch/x86/mm/pgtable.c
@@ -391,6 +391,63 @@ pgd_t *pgd_alloc(struct mm_struct *mm)
return NULL;
}

+/*
+ * Initialize the kernel portion of the PGD.
+ *
+ * This is done separately, because pgd_alloc() happens when
+ * the task is not on the task list yet - and PGD updates
+ * happen by walking the task list.
+ *
+ * No locking is needed here, as we just copy over the reference
+ * PGD. The reference PGD (pgtable_init) is only ever expanded
+ * at the highest, PGD level. Thus any other task extending it
+ * will first update the reference PGD, then modify the task PGDs.
+ */
+void arch_pgd_init_late(struct mm_struct *mm, pgd_t *pgd)
+{
+ /*
+ * This is called after a new MM has been made visible
+ * in fork() or exec().
+ *
+ * This barrier makes sure the MM is visible to new RCU
+ * walkers before we initialize it, so that we don't miss
+ * updates:
+ */
+ smp_wmb();
+
+ /*
+ * If the pgd points to a shared pagetable level (either the
+ * ptes in non-PAE, or shared PMD in PAE), then just copy the
+ * references from swapper_pg_dir:
+ */
+ if (CONFIG_PGTABLE_LEVELS == 2 ||
+ (CONFIG_PGTABLE_LEVELS == 3 && SHARED_KERNEL_PMD) ||
+ CONFIG_PGTABLE_LEVELS == 4) {
+
+ pgd_t *pgd_src = swapper_pg_dir + KERNEL_PGD_BOUNDARY;
+ pgd_t *pgd_dst = pgd + KERNEL_PGD_BOUNDARY;
+ int i;
+
+ for (i = 0; i < KERNEL_PGD_PTRS; i++, pgd_src++, pgd_dst++) {
+ /*
+ * This is lock-less, so it can race with PGD updates
+ * coming from vmalloc() or CPA methods, but it's safe,
+ * because:
+ *
+ * 1) this PGD is not in use yet, we have still not
+ * scheduled this task.
+ * 2) we only ever extend PGD entries
+ *
+ * So if we observe a non-zero PGD entry we can copy it,
+ * it won't change from under us. Parallel updates (new
+ * allocations) will modify our (already visible) PGD:
+ */
+ if (pgd_val(*pgd_src))
+ WRITE_ONCE(*pgd_dst, *pgd_src);
+ }
+ }
+}
+
void pgd_free(struct mm_struct *mm, pgd_t *pgd)
{
pgd_mop_up_pmds(mm, pgd);
--
2.1.4

2015-06-11 14:07:55

by Ingo Molnar

[permalink] [raw]
Subject: [PATCH 07/12] x86/virt/guest/xen: Remove use of pgd_list from the Xen guest code

xen_mm_pin_all()/unpin_all() are used to implement full guest instance
suspend/restore. It's a stop-all method that needs to iterate through
all allocated pgds in the system to fix them up for Xen's use.

This code uses pgd_list, probably because it was an easy interface.

But we want to remove the pgd_list, so convert the code over to walk
all tasks in the system. This is an equivalent method.

(As I don't use Xen this is was only build tested.)

Cc: Andrew Morton <[email protected]>
Cc: Andy Lutomirski <[email protected]>
Cc: Borislav Petkov <[email protected]>
Cc: Brian Gerst <[email protected]>
Cc: Denys Vlasenko <[email protected]>
Cc: H. Peter Anvin <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Waiman Long <[email protected]>
Signed-off-by: Ingo Molnar <[email protected]>
---
arch/x86/xen/mmu.c | 34 ++++++++++++++++++++++++++++------
1 file changed, 28 insertions(+), 6 deletions(-)

diff --git a/arch/x86/xen/mmu.c b/arch/x86/xen/mmu.c
index dd151b2045b0..87a8354435f8 100644
--- a/arch/x86/xen/mmu.c
+++ b/arch/x86/xen/mmu.c
@@ -853,18 +853,29 @@ static void xen_pgd_pin(struct mm_struct *mm)
*/
void xen_mm_pin_all(void)
{
- struct page *page;
+ struct task_struct *g, *p;

+ rcu_read_lock();
spin_lock(&pgd_lock);

- list_for_each_entry(page, &pgd_list, lru) {
+ for_each_process_thread(g, p) {
+ struct page *page;
+ pgd_t *pgd;
+
+ if (!p->mm)
+ continue;
+
+ pgd = p->mm->pgd;
+ page = virt_to_page(pgd);
+
if (!PagePinned(page)) {
- __xen_pgd_pin(&init_mm, (pgd_t *)page_address(page));
+ __xen_pgd_pin(&init_mm, pgd);
SetPageSavePinned(page);
}
}

spin_unlock(&pgd_lock);
+ rcu_read_unlock();
}

/*
@@ -967,19 +978,30 @@ static void xen_pgd_unpin(struct mm_struct *mm)
*/
void xen_mm_unpin_all(void)
{
- struct page *page;
+ struct task_struct *g, *p;

+ rcu_read_lock();
spin_lock(&pgd_lock);

- list_for_each_entry(page, &pgd_list, lru) {
+ for_each_process_thread(g, p) {
+ struct page *page;
+ pgd_t *pgd;
+
+ if (!p->mm)
+ continue;
+
+ pgd = p->mm->pgd;
+ page = virt_to_page(pgd);
+
if (PageSavePinned(page)) {
BUG_ON(!PagePinned(page));
- __xen_pgd_unpin(&init_mm, (pgd_t *)page_address(page));
+ __xen_pgd_unpin(&init_mm, pgd);
ClearPageSavePinned(page);
}
}

spin_unlock(&pgd_lock);
+ rcu_read_unlock();
}

static void xen_activate_mm(struct mm_struct *prev, struct mm_struct *next)
--
2.1.4

2015-06-11 14:07:58

by Ingo Molnar

[permalink] [raw]
Subject: [PATCH 08/12] x86/mm: Remove pgd_list use from vmalloc_sync_all()

The vmalloc() code uses vmalloc_sync_all() to synchronize changes to
the global reference kernel PGD to task PGDs.

This uses pgd_list currently, but we don't need the global list,
as we can walk the task list under RCU.

Cc: Andrew Morton <[email protected]>
Cc: Andy Lutomirski <[email protected]>
Cc: Borislav Petkov <[email protected]>
Cc: Brian Gerst <[email protected]>
Cc: Denys Vlasenko <[email protected]>
Cc: H. Peter Anvin <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Rik van Riel <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Waiman Long <[email protected]>
Cc: [email protected]
Signed-off-by: Ingo Molnar <[email protected]>
---
arch/x86/mm/fault.c | 21 ++++++++++++++-------
arch/x86/mm/init_64.c | 3 ++-
2 files changed, 16 insertions(+), 8 deletions(-)

diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c
index 50342825f221..2d587e548b59 100644
--- a/arch/x86/mm/fault.c
+++ b/arch/x86/mm/fault.c
@@ -235,24 +235,31 @@ void vmalloc_sync_all(void)
for (address = VMALLOC_START & PMD_MASK;
address >= TASK_SIZE && address < FIXADDR_TOP;
address += PMD_SIZE) {
- struct page *page;

+ struct task_struct *g, *p;
+
+ rcu_read_lock();
spin_lock(&pgd_lock);
- list_for_each_entry(page, &pgd_list, lru) {
+
+ for_each_process_thread(g, p) {
spinlock_t *pgt_lock;
- pmd_t *ret;
+ pmd_t *pmd_ret;

- /* the pgt_lock only for Xen */
- pgt_lock = &pgd_page_get_mm(page)->page_table_lock;
+ if (!p->mm)
+ continue;

+ /* The pgt_lock is only used on Xen: */
+ pgt_lock = &p->mm->page_table_lock;
spin_lock(pgt_lock);
- ret = vmalloc_sync_one(page_address(page), address);
+ pmd_ret = vmalloc_sync_one(p->mm->pgd, address);
spin_unlock(pgt_lock);

- if (!ret)
+ if (!pmd_ret)
break;
}
+
spin_unlock(&pgd_lock);
+ rcu_read_unlock();
}
}

diff --git a/arch/x86/mm/init_64.c b/arch/x86/mm/init_64.c
index beee532b76a7..730560c4873e 100644
--- a/arch/x86/mm/init_64.c
+++ b/arch/x86/mm/init_64.c
@@ -184,11 +184,12 @@ void sync_global_pgds(unsigned long start, unsigned long end)
spin_lock(&pgd_lock);

for_each_process_thread(g, p) {
- pgd_t *pgd = p->mm->pgd;
+ pgd_t *pgd;
spinlock_t *pgt_lock;

if (!p->mm)
continue;
+ pgd = p->mm->pgd;

/* The pgt_lock is only used by Xen: */
pgt_lock = &p->mm->page_table_lock;
--
2.1.4

2015-06-11 14:09:22

by Ingo Molnar

[permalink] [raw]
Subject: [PATCH 09/12] x86/mm/pat/32: Remove pgd_list use from the PAT code

The 32-bit x86 PAT code uses __set_pmd_pte() to update pmds.

This uses pgd_list currently, but we don't need the global
list as we can walk the task list under RCU.

(This code already holds the pgd_lock.)

Cc: Andrew Morton <[email protected]>
Cc: Andy Lutomirski <[email protected]>
Cc: Borislav Petkov <[email protected]>
Cc: Brian Gerst <[email protected]>
Cc: Denys Vlasenko <[email protected]>
Cc: H. Peter Anvin <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Rik van Riel <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Waiman Long <[email protected]>
Cc: [email protected]
Signed-off-by: Ingo Molnar <[email protected]>
---
arch/x86/mm/pageattr.c | 19 ++++++++++++++++---
1 file changed, 16 insertions(+), 3 deletions(-)

diff --git a/arch/x86/mm/pageattr.c b/arch/x86/mm/pageattr.c
index 997fc97e9072..93c134fdb398 100644
--- a/arch/x86/mm/pageattr.c
+++ b/arch/x86/mm/pageattr.c
@@ -438,18 +438,31 @@ static void __set_pmd_pte(pte_t *kpte, unsigned long address, pte_t pte)
set_pte_atomic(kpte, pte);
#ifdef CONFIG_X86_32
if (!SHARED_KERNEL_PMD) {
- struct page *page;
+ struct task_struct *g, *p;

- list_for_each_entry(page, &pgd_list, lru) {
+ rcu_read_lock();
+
+ for_each_process_thread(g, p) {
+ spinlock_t *pgt_lock;
pgd_t *pgd;
pud_t *pud;
pmd_t *pmd;

- pgd = (pgd_t *)page_address(page) + pgd_index(address);
+ if (!p->mm)
+ continue;
+
+ pgt_lock = &p->mm->page_table_lock;
+ spin_lock(pgt_lock);
+
+ pgd = p->mm->pgd + pgd_index(address);
pud = pud_offset(pgd, address);
pmd = pmd_offset(pud, address);
set_pte_atomic((pte_t *)pmd, pte);
+
+ spin_unlock(pgt_lock);
}
+
+ rcu_read_unlock();
}
#endif
}
--
2.1.4

2015-06-11 14:09:05

by Ingo Molnar

[permalink] [raw]
Subject: [PATCH 10/12] x86/mm: Make pgd_alloc()/pgd_free() lockless

The fork()/exit() code uses pgd_alloc()/pgd_free() to allocate/deallocate
the PGD, with platform specific code setting up kernel pagetables.

The x86 code uses a global pgd_list with an associated lock to update
all PGDs of all tasks in the system synchronously.

The lock is still kept to synchronize updates to all PGDs in the system,
but all users of the list have been migrated to use the task list.

So we can remove the pgd_list addition/removal from this code.

The new PGD is private while constructed, so it needs no extra
locking.

Cc: Andrew Morton <[email protected]>
Cc: Andy Lutomirski <[email protected]>
Cc: Borislav Petkov <[email protected]>
Cc: Brian Gerst <[email protected]>
Cc: Denys Vlasenko <[email protected]>
Cc: H. Peter Anvin <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Rik van Riel <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Waiman Long <[email protected]>
Cc: [email protected]
Signed-off-by: Ingo Molnar <[email protected]>
---
arch/x86/mm/pgtable.c | 27 +++------------------------
1 file changed, 3 insertions(+), 24 deletions(-)

diff --git a/arch/x86/mm/pgtable.c b/arch/x86/mm/pgtable.c
index e0bf90470d70..d855010a111f 100644
--- a/arch/x86/mm/pgtable.c
+++ b/arch/x86/mm/pgtable.c
@@ -125,22 +125,6 @@ static void pgd_ctor(struct mm_struct *mm, pgd_t *pgd)
swapper_pg_dir + KERNEL_PGD_BOUNDARY,
KERNEL_PGD_PTRS);
}
-
- /* list required to sync kernel mapping updates */
- if (!SHARED_KERNEL_PMD) {
- pgd_set_mm(pgd, mm);
- pgd_list_add(pgd);
- }
-}
-
-static void pgd_dtor(pgd_t *pgd)
-{
- if (SHARED_KERNEL_PMD)
- return;
-
- spin_lock(&pgd_lock);
- pgd_list_del(pgd);
- spin_unlock(&pgd_lock);
}

/*
@@ -370,17 +354,13 @@ pgd_t *pgd_alloc(struct mm_struct *mm)
goto out_free_pmds;

/*
- * Make sure that pre-populating the pmds is atomic with
- * respect to anything walking the pgd_list, so that they
- * never see a partially populated pgd.
+ * No locking is needed here, as the PGD is still private,
+ * so no code walking the task list and looking at mm->pgd
+ * will be able to see it before it's fully constructed:
*/
- spin_lock(&pgd_lock);
-
pgd_ctor(mm, pgd);
pgd_prepopulate_pmd(mm, pgd, pmds);

- spin_unlock(&pgd_lock);
-
return pgd;

out_free_pmds:
@@ -451,7 +431,6 @@ void arch_pgd_init_late(struct mm_struct *mm, pgd_t *pgd)
void pgd_free(struct mm_struct *mm, pgd_t *pgd)
{
pgd_mop_up_pmds(mm, pgd);
- pgd_dtor(pgd);
paravirt_pgd_free(mm, pgd);
_pgd_free(pgd);
}
--
2.1.4

2015-06-11 14:08:49

by Ingo Molnar

[permalink] [raw]
Subject: [PATCH 11/12] x86/mm: Remove pgd_list leftovers

Nothing uses the pgd_list anymore - remove the list itself and its helpers.

Cc: Andrew Morton <[email protected]>
Cc: Andy Lutomirski <[email protected]>
Cc: Borislav Petkov <[email protected]>
Cc: Brian Gerst <[email protected]>
Cc: Denys Vlasenko <[email protected]>
Cc: H. Peter Anvin <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Rik van Riel <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Waiman Long <[email protected]>
Cc: [email protected]
Signed-off-by: Ingo Molnar <[email protected]>
---
arch/x86/include/asm/pgtable.h | 3 ---
arch/x86/mm/fault.c | 1 -
arch/x86/mm/pgtable.c | 26 --------------------------
3 files changed, 30 deletions(-)

diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h
index fe57e7a98839..7e93438d8b53 100644
--- a/arch/x86/include/asm/pgtable.h
+++ b/arch/x86/include/asm/pgtable.h
@@ -29,9 +29,6 @@ extern unsigned long empty_zero_page[PAGE_SIZE / sizeof(unsigned long)]
#define ZERO_PAGE(vaddr) (virt_to_page(empty_zero_page))

extern spinlock_t pgd_lock;
-extern struct list_head pgd_list;
-
-extern struct mm_struct *pgd_page_get_mm(struct page *page);

#ifdef CONFIG_PARAVIRT
#include <asm/paravirt.h>
diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c
index 2d587e548b59..bebdd97f888b 100644
--- a/arch/x86/mm/fault.c
+++ b/arch/x86/mm/fault.c
@@ -186,7 +186,6 @@ force_sig_info_fault(int si_signo, int si_code, unsigned long address,
}

DEFINE_SPINLOCK(pgd_lock);
-LIST_HEAD(pgd_list);

#ifdef CONFIG_X86_32
static inline pmd_t *vmalloc_sync_one(pgd_t *pgd, unsigned long address)
diff --git a/arch/x86/mm/pgtable.c b/arch/x86/mm/pgtable.c
index d855010a111f..0666f7796806 100644
--- a/arch/x86/mm/pgtable.c
+++ b/arch/x86/mm/pgtable.c
@@ -84,35 +84,9 @@ void ___pud_free_tlb(struct mmu_gather *tlb, pud_t *pud)
#endif /* CONFIG_PGTABLE_LEVELS > 3 */
#endif /* CONFIG_PGTABLE_LEVELS > 2 */

-static inline void pgd_list_add(pgd_t *pgd)
-{
- struct page *page = virt_to_page(pgd);
-
- list_add(&page->lru, &pgd_list);
-}
-
-static inline void pgd_list_del(pgd_t *pgd)
-{
- struct page *page = virt_to_page(pgd);
-
- list_del(&page->lru);
-}
-
#define UNSHARED_PTRS_PER_PGD \
(SHARED_KERNEL_PMD ? KERNEL_PGD_BOUNDARY : PTRS_PER_PGD)

-
-static void pgd_set_mm(pgd_t *pgd, struct mm_struct *mm)
-{
- BUILD_BUG_ON(sizeof(virt_to_page(pgd)->index) < sizeof(mm));
- virt_to_page(pgd)->index = (pgoff_t)mm;
-}
-
-struct mm_struct *pgd_page_get_mm(struct page *page)
-{
- return (struct mm_struct *)page->index;
-}
-
static void pgd_ctor(struct mm_struct *mm, pgd_t *pgd)
{
/* If the pgd points to a shared pagetable level (either the
--
2.1.4

2015-06-11 14:08:00

by Ingo Molnar

[permalink] [raw]
Subject: [PATCH 12/12] x86/mm: Simplify pgd_alloc()

Right now pgd_alloc() uses pgd_ctor(), which copies over the
current swapper_pg_dir[] to a new task's PGD.

This is not necessary, it's enough if we clear it: the PGD will
then be properly updated by arch_pgd_init_late().

Cc: Andrew Morton <[email protected]>
Cc: Andy Lutomirski <[email protected]>
Cc: Borislav Petkov <[email protected]>
Cc: Brian Gerst <[email protected]>
Cc: Denys Vlasenko <[email protected]>
Cc: H. Peter Anvin <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Rik van Riel <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Waiman Long <[email protected]>
Cc: [email protected]
Signed-off-by: Ingo Molnar <[email protected]>
---
arch/x86/mm/pgtable.c | 27 +++++++++------------------
1 file changed, 9 insertions(+), 18 deletions(-)

diff --git a/arch/x86/mm/pgtable.c b/arch/x86/mm/pgtable.c
index 0666f7796806..b1bd35f452ef 100644
--- a/arch/x86/mm/pgtable.c
+++ b/arch/x86/mm/pgtable.c
@@ -87,20 +87,6 @@ void ___pud_free_tlb(struct mmu_gather *tlb, pud_t *pud)
#define UNSHARED_PTRS_PER_PGD \
(SHARED_KERNEL_PMD ? KERNEL_PGD_BOUNDARY : PTRS_PER_PGD)

-static void pgd_ctor(struct mm_struct *mm, pgd_t *pgd)
-{
- /* If the pgd points to a shared pagetable level (either the
- ptes in non-PAE, or shared PMD in PAE), then just copy the
- references from swapper_pg_dir. */
- if (CONFIG_PGTABLE_LEVELS == 2 ||
- (CONFIG_PGTABLE_LEVELS == 3 && SHARED_KERNEL_PMD) ||
- CONFIG_PGTABLE_LEVELS == 4) {
- clone_pgd_range(pgd + KERNEL_PGD_BOUNDARY,
- swapper_pg_dir + KERNEL_PGD_BOUNDARY,
- KERNEL_PGD_PTRS);
- }
-}
-
/*
* List of all pgd's needed for non-PAE so it can invalidate entries
* in both cached and uncached pgd's; not needed for PAE since the
@@ -328,11 +314,16 @@ pgd_t *pgd_alloc(struct mm_struct *mm)
goto out_free_pmds;

/*
- * No locking is needed here, as the PGD is still private,
- * so no code walking the task list and looking at mm->pgd
- * will be able to see it before it's fully constructed:
+ * Zero out the kernel portion here, we'll set them up in
+ * arch_pgd_init_late(), when the pgd is globally
+ * visible already per the task list, so that it cannot
+ * miss updates.
+ *
+ * We need to zero it here, to make sure arch_pgd_init_late()
+ * can initialize them without locking.
*/
- pgd_ctor(mm, pgd);
+ memset(pgd + KERNEL_PGD_BOUNDARY, 0, KERNEL_PGD_PTRS*sizeof(pgd_t));
+
pgd_prepopulate_pmd(mm, pgd, pmds);

return pgd;
--
2.1.4

2015-06-11 14:14:00

by Ingo Molnar

[permalink] [raw]
Subject: Re: [RFC PATCH 00/12] x86/mm: Implement lockless pgd_alloc()/pgd_free()


[ I fat-fingered the linux-mm Cc:, so every reply will bounce on that,
sorry about that :-/ Fixed it in this mail's Cc: list. ]

* Ingo Molnar <[email protected]> wrote:

> Waiman Long reported 'pgd_lock' contention on high CPU count systems and
> proposed moving pgd_lock on a separate cacheline to eliminate false sharing and
> to reduce some of the lock bouncing overhead.

So 'pgd_lock' is a global lock, used for every new task creation:

arch/x86/mm/fault.c:DEFINE_SPINLOCK(pgd_lock);

which with a sufficiently high CPU count starts to hurt.

Thanks,

Ingo

2015-06-11 15:04:48

by Andy Lutomirski

[permalink] [raw]
Subject: Re: [PATCH 08/12] x86/mm: Remove pgd_list use from vmalloc_sync_all()

On Thu, Jun 11, 2015 at 7:07 AM, Ingo Molnar <[email protected]> wrote:
> The vmalloc() code uses vmalloc_sync_all() to synchronize changes to
> the global reference kernel PGD to task PGDs.

Does it? AFAICS the only caller is register_die_notifier, and it's
not really clear to me why that exists.

At some point I'd love to remove lazy kernel PGD sync from the kernel
entirely (or at least from x86) and just do it when we switch mms.
Now that you're removing all code that deletes kernel PGD entries, I
think all we'd need to do is to add a per-PGD or per-mm count of the
number of kernel entries populated and to fix it up when we switch to
an mm with fewer entries populated than init_mm.

--Andy

2015-06-11 15:50:08

by Ingo Molnar

[permalink] [raw]
Subject: Re: [PATCH 08/12] x86/mm: Remove pgd_list use from vmalloc_sync_all()


* Andy Lutomirski <[email protected]> wrote:

> On Thu, Jun 11, 2015 at 7:07 AM, Ingo Molnar <[email protected]> wrote:
> > The vmalloc() code uses vmalloc_sync_all() to synchronize changes to
> > the global reference kernel PGD to task PGDs.
>
> Does it? AFAICS the only caller is register_die_notifier, and it's
> not really clear to me why that exists.

Doh, indeed, got confused in that changelog - we are filling it in
opportunistically via vmalloc_fault().

> At some point I'd love to remove lazy kernel PGD sync from the kernel entirely
> (or at least from x86) and just do it when we switch mms. Now that you're
> removing all code that deletes kernel PGD entries, I think all we'd need to do
> is to add a per-PGD or per-mm count of the number of kernel entries populated
> and to fix it up when we switch to an mm with fewer entries populated than
> init_mm.

That would add a (cheap but nonzero) runtime check to every context switch. It's a
relative slow path, but in comparison vmalloc() is an even slower slowpath, so why
not do it there and just do synchronous updates and remove the vmalloc faults
altogether?

Also, on 64-bit it should not matter much: there the only change is the once in a
blue moon case where we allocate a new pgd for a 512 GB block of address space
that a single pgd entry covers.

I'd hate to add a check to every context switch, no matter how cheap, just for a
case that essentially never triggers...

So how about this solution instead:

- we add a generation counter to sync_global_pgds() so that it can detect when
the number of pgds populated in init_mm changes.

- we change vmalloc() to call sync_global_pgds(): this will be very cheap in the
overwhelming majority of cases.

- we eliminate vmalloc_fault(), on 64-bit at least. Yay! :-)

Thanks,

Ingo

2015-06-11 16:10:44

by Andy Lutomirski

[permalink] [raw]
Subject: Re: [PATCH 08/12] x86/mm: Remove pgd_list use from vmalloc_sync_all()

On Thu, Jun 11, 2015 at 8:49 AM, Ingo Molnar <[email protected]> wrote:
>
> * Andy Lutomirski <[email protected]> wrote:
>
>> On Thu, Jun 11, 2015 at 7:07 AM, Ingo Molnar <[email protected]> wrote:
>> > The vmalloc() code uses vmalloc_sync_all() to synchronize changes to
>> > the global reference kernel PGD to task PGDs.
>>
>> Does it? AFAICS the only caller is register_die_notifier, and it's
>> not really clear to me why that exists.
>
> Doh, indeed, got confused in that changelog - we are filling it in
> opportunistically via vmalloc_fault().
>
>> At some point I'd love to remove lazy kernel PGD sync from the kernel entirely
>> (or at least from x86) and just do it when we switch mms. Now that you're
>> removing all code that deletes kernel PGD entries, I think all we'd need to do
>> is to add a per-PGD or per-mm count of the number of kernel entries populated
>> and to fix it up when we switch to an mm with fewer entries populated than
>> init_mm.
>
> That would add a (cheap but nonzero) runtime check to every context switch. It's a
> relative slow path, but in comparison vmalloc() is an even slower slowpath, so why
> not do it there and just do synchronous updates and remove the vmalloc faults
> altogether?
>
> Also, on 64-bit it should not matter much: there the only change is the once in a
> blue moon case where we allocate a new pgd for a 512 GB block of address space
> that a single pgd entry covers.
>
> I'd hate to add a check to every context switch, no matter how cheap, just for a
> case that essentially never triggers...
>
> So how about this solution instead:
>
> - we add a generation counter to sync_global_pgds() so that it can detect when
> the number of pgds populated in init_mm changes.
>
> - we change vmalloc() to call sync_global_pgds(): this will be very cheap in the
> overwhelming majority of cases.
>
> - we eliminate vmalloc_fault(), on 64-bit at least. Yay! :-)

Should be good. The only real down side is that the update becomes a
walk over all tasks or over all mms, whereas if we checked on context
switches, then the update is just a single PGD entry copy.

That being said, the updates should be *really* rare. I think it's
completely impossible for it to happen more than 255 times per boot.
So I like your solution.

Then I can stop wondering what happens when we manage to take a
vmalloc fault early in entry_SYSCALL_64 or somewhere in the page_fault
prologue.

--Andy

2015-06-11 16:47:57

by H. Peter Anvin

[permalink] [raw]
Subject: Re: [RFC PATCH 00/12] x86/mm: Implement lockless pgd_alloc()/pgd_free()

On 06/11/2015 07:07 AM, Ingo Molnar wrote:
> Waiman Long reported 'pgd_lock' contention on high CPU count systems and proposed
> moving pgd_lock on a separate cacheline to eliminate false sharing and to reduce
> some of the lock bouncing overhead.
>
> I think we can do much better: this series eliminates the pgd_list and makes
> pgd_alloc()/pgd_free() lockless.

This ties into a slightly different issue, which is how to deal
*properly* with "special" PGDs like the 1:1 trampoline and the UEFI page
tables. These, too, should be able to incorporate, possibly more than
once, page tables further down.

-hpa

2015-06-11 18:23:53

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH 05/12] mm: Introduce arch_pgd_init_late()

On Thu, 11 Jun 2015 16:07:10 +0200 Ingo Molnar <[email protected]> wrote:

> --- a/kernel/fork.c
> +++ b/kernel/fork.c
> @@ -1592,6 +1592,22 @@ static struct task_struct *copy_process(unsigned long clone_flags,
> syscall_tracepoint_update(p);
> write_unlock_irq(&tasklist_lock);
>
> + /*
> + * If we have a new PGD then initialize it:
> + *
> + * This method is called after a task has been made visible
> + * on the task list already.
> + *
> + * Architectures that manage per task kernel pagetables
> + * might use this callback to initialize them after they
> + * are already visible to new updates.
> + *
> + * NOTE: any user-space parts of the PGD are already initialized
> + * and must not be clobbered.
> + */
> + if (p->mm != current->mm)
> + arch_pgd_init_late(p->mm, p->mm->pgd);

Couldn't this be arch_pgd_init_late(p->mm) or arch_pgd_init_late(p)?

2015-06-11 20:05:25

by Boris Ostrovsky

[permalink] [raw]
Subject: Re: [PATCH 06/12] x86/mm: Enable and use the arch_pgd_init_late() method

On 06/11/2015 10:07 AM, Ingo Molnar wrote:

> diff --git a/arch/x86/mm/pgtable.c b/arch/x86/mm/pgtable.c
> index fb0a9dd1d6e4..e0bf90470d70 100644
> --- a/arch/x86/mm/pgtable.c
> +++ b/arch/x86/mm/pgtable.c
> @@ -391,6 +391,63 @@ pgd_t *pgd_alloc(struct mm_struct *mm)
> return NULL;
> }
>
> +/*
> + * Initialize the kernel portion of the PGD.
> + *
> + * This is done separately, because pgd_alloc() happens when
> + * the task is not on the task list yet - and PGD updates
> + * happen by walking the task list.
> + *
> + * No locking is needed here, as we just copy over the reference
> + * PGD. The reference PGD (pgtable_init) is only ever expanded
> + * at the highest, PGD level. Thus any other task extending it
> + * will first update the reference PGD, then modify the task PGDs.
> + */
> +void arch_pgd_init_late(struct mm_struct *mm, pgd_t *pgd)
> +{
> + /*
> + * This is called after a new MM has been made visible
> + * in fork() or exec().
> + *
> + * This barrier makes sure the MM is visible to new RCU
> + * walkers before we initialize it, so that we don't miss
> + * updates:
> + */
> + smp_wmb();
> +
> + /*
> + * If the pgd points to a shared pagetable level (either the
> + * ptes in non-PAE, or shared PMD in PAE), then just copy the
> + * references from swapper_pg_dir:
> + */
> + if (CONFIG_PGTABLE_LEVELS == 2 ||
> + (CONFIG_PGTABLE_LEVELS == 3 && SHARED_KERNEL_PMD) ||
> + CONFIG_PGTABLE_LEVELS == 4) {
> +
> + pgd_t *pgd_src = swapper_pg_dir + KERNEL_PGD_BOUNDARY;
> + pgd_t *pgd_dst = pgd + KERNEL_PGD_BOUNDARY;
> + int i;
> +
> + for (i = 0; i < KERNEL_PGD_PTRS; i++, pgd_src++, pgd_dst++) {
> + /*
> + * This is lock-less, so it can race with PGD updates
> + * coming from vmalloc() or CPA methods, but it's safe,
> + * because:
> + *
> + * 1) this PGD is not in use yet, we have still not
> + * scheduled this task.
> + * 2) we only ever extend PGD entries
> + *
> + * So if we observe a non-zero PGD entry we can copy it,
> + * it won't change from under us. Parallel updates (new
> + * allocations) will modify our (already visible) PGD:
> + */
> + if (pgd_val(*pgd_src))
> + WRITE_ONCE(*pgd_dst, *pgd_src);


This should be set_pgd(pgd_dst, *pgd_src) in order for it to work as a
Xen PV guest.

I don't know whether anything would need to be done wrt WRITE_ONCE.
Perhaps put it into native_set_pgd()?

-boris

2015-06-11 20:37:56

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH 07/12] x86/virt/guest/xen: Remove use of pgd_list from the Xen guest code

On Thu, Jun 11, 2015 at 7:07 AM, Ingo Molnar <[email protected]> wrote:
>
> But we want to remove the pgd_list, so convert the code over to walk
> all tasks in the system. This is an equivalent method.

This is not at all equivalent, and it looks stupid.

Don't use "for_each_process_thread(g, p)". You only care about each
mm, and threads all share the same mm, so just do

for_each_process(p)

instead of iterating over all threads too.

No?

Linus

2015-06-11 20:41:30

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH 07/12] x86/virt/guest/xen: Remove use of pgd_list from the Xen guest code

On Thu, Jun 11, 2015 at 1:37 PM, Linus Torvalds
<[email protected]> wrote:
>
> Don't use "for_each_process_thread(g, p)". You only care about each
> mm, and threads all share the same mm, so just do
>
> for_each_process(p)
>
> instead of iterating over all threads too.

Hmm. I may be wrong. It strikes me that one of the group leaders might
have exited but the subthreads live on. We'd see p->mm being NULL,
even though the mm was originally in use.

Ugh. So maybe the code really does need to iterate over all threads.

Linus

2015-06-12 07:23:10

by Ingo Molnar

[permalink] [raw]
Subject: Re: [PATCH 07/12] x86/virt/guest/xen: Remove use of pgd_list from the Xen guest code


* Linus Torvalds <[email protected]> wrote:

> On Thu, Jun 11, 2015 at 1:37 PM, Linus Torvalds
> <[email protected]> wrote:
> >
> > Don't use "for_each_process_thread(g, p)". You only care about each mm, and
> > threads all share the same mm, so just do
> >
> > for_each_process(p)
> >
> > instead of iterating over all threads too.
>
> Hmm. I may be wrong. It strikes me that one of the group leaders might have
> exited but the subthreads live on. We'd see p->mm being NULL, even though the mm
> was originally in use.
>
> Ugh. So maybe the code really does need to iterate over all threads.

Yeah, for_each_process() is indeed no guarantee that we iterate over all mm's.

We might make it so: but that would mean restricting certain clone_flags variants
- not sure that's possible with our current ABI usage?

Thanks,

Ingo

2015-06-12 08:04:41

by Ingo Molnar

[permalink] [raw]
Subject: Re: [PATCH 07/12] x86/virt/guest/xen: Remove use of pgd_list from the Xen guest code


* Linus Torvalds <[email protected]> wrote:

> On Jun 12, 2015 00:23, "Ingo Molnar" <[email protected]> wrote:
> >
> > We might make it so: but that would mean restricting certain clone_flags
> > variants - not sure that's possible with our current ABI usage?
>
> We already do that. You can't share signal info unless you share the mm. And a
> shared signal state is what defines a thread group.
>
> So I think the only issue is that ->mm can become NULL when the thread group
> leader dies - a non-NULL mm should always be shared among all threads.

Indeed, we do that in exit_mm().

So we could add tsk->mm_leader or so, which does not get cleared and which the
scheduler does not look at, but I'm not sure it's entirely safe that way: we don't
have a refcount, and when the last thread exits it becomes bogus for a small
window until the zombie leader is unlinked from the task list.

To close that race we'd have __mmdrop() or so clear out tsk->mm_leader - but the
task doing the mmdrop() might be a lazy thread totally unrelated to the original
thread group so we don't know which tsk->mm_leader to clear out.

To solve that we'd have to track the leader owning an MM in mm_struct - which gets
interesting for the exec() case where the thread group gets a new leader, so we'd
have to re-link the mm's leader pointer there.

So unless I missed some simpler solution there a good number of steps where this
could go wrong, in small looking race windows - how about we just live with
iterating through all tasks instead of just all processes, once per 512 GB of
memory mapped?

Thanks,

Ingo

2015-06-12 08:16:19

by Ingo Molnar

[permalink] [raw]
Subject: Re: [PATCH 05/12] mm: Introduce arch_pgd_init_late()


* Andrew Morton <[email protected]> wrote:

> On Thu, 11 Jun 2015 16:07:10 +0200 Ingo Molnar <[email protected]> wrote:
>
> > --- a/kernel/fork.c
> > +++ b/kernel/fork.c
> > @@ -1592,6 +1592,22 @@ static struct task_struct *copy_process(unsigned long clone_flags,
> > syscall_tracepoint_update(p);
> > write_unlock_irq(&tasklist_lock);
> >
> > + /*
> > + * If we have a new PGD then initialize it:
> > + *
> > + * This method is called after a task has been made visible
> > + * on the task list already.
> > + *
> > + * Architectures that manage per task kernel pagetables
> > + * might use this callback to initialize them after they
> > + * are already visible to new updates.
> > + *
> > + * NOTE: any user-space parts of the PGD are already initialized
> > + * and must not be clobbered.
> > + */
> > + if (p->mm != current->mm)
> > + arch_pgd_init_late(p->mm, p->mm->pgd);
>
> Couldn't this be arch_pgd_init_late(p->mm) or arch_pgd_init_late(p)?

Indeed - fixed.

Thanks,

Ingo

2015-06-12 08:20:01

by Ingo Molnar

[permalink] [raw]
Subject: Re: [PATCH 06/12] x86/mm: Enable and use the arch_pgd_init_late() method


* Boris Ostrovsky <[email protected]> wrote:

> On 06/11/2015 10:07 AM, Ingo Molnar wrote:
>
> >diff --git a/arch/x86/mm/pgtable.c b/arch/x86/mm/pgtable.c
> >index fb0a9dd1d6e4..e0bf90470d70 100644
> >--- a/arch/x86/mm/pgtable.c
> >+++ b/arch/x86/mm/pgtable.c
> >@@ -391,6 +391,63 @@ pgd_t *pgd_alloc(struct mm_struct *mm)
> > return NULL;
> > }
> >
> >+/*
> >+ * Initialize the kernel portion of the PGD.
> >+ *
> >+ * This is done separately, because pgd_alloc() happens when
> >+ * the task is not on the task list yet - and PGD updates
> >+ * happen by walking the task list.
> >+ *
> >+ * No locking is needed here, as we just copy over the reference
> >+ * PGD. The reference PGD (pgtable_init) is only ever expanded
> >+ * at the highest, PGD level. Thus any other task extending it
> >+ * will first update the reference PGD, then modify the task PGDs.
> >+ */
> >+void arch_pgd_init_late(struct mm_struct *mm, pgd_t *pgd)
> >+{
> >+ /*
> >+ * This is called after a new MM has been made visible
> >+ * in fork() or exec().
> >+ *
> >+ * This barrier makes sure the MM is visible to new RCU
> >+ * walkers before we initialize it, so that we don't miss
> >+ * updates:
> >+ */
> >+ smp_wmb();
> >+
> >+ /*
> >+ * If the pgd points to a shared pagetable level (either the
> >+ * ptes in non-PAE, or shared PMD in PAE), then just copy the
> >+ * references from swapper_pg_dir:
> >+ */
> >+ if (CONFIG_PGTABLE_LEVELS == 2 ||
> >+ (CONFIG_PGTABLE_LEVELS == 3 && SHARED_KERNEL_PMD) ||
> >+ CONFIG_PGTABLE_LEVELS == 4) {
> >+
> >+ pgd_t *pgd_src = swapper_pg_dir + KERNEL_PGD_BOUNDARY;
> >+ pgd_t *pgd_dst = pgd + KERNEL_PGD_BOUNDARY;
> >+ int i;
> >+
> >+ for (i = 0; i < KERNEL_PGD_PTRS; i++, pgd_src++, pgd_dst++) {
> >+ /*
> >+ * This is lock-less, so it can race with PGD updates
> >+ * coming from vmalloc() or CPA methods, but it's safe,
> >+ * because:
> >+ *
> >+ * 1) this PGD is not in use yet, we have still not
> >+ * scheduled this task.
> >+ * 2) we only ever extend PGD entries
> >+ *
> >+ * So if we observe a non-zero PGD entry we can copy it,
> >+ * it won't change from under us. Parallel updates (new
> >+ * allocations) will modify our (already visible) PGD:
> >+ */
> >+ if (pgd_val(*pgd_src))
> >+ WRITE_ONCE(*pgd_dst, *pgd_src);
>
>
> This should be set_pgd(pgd_dst, *pgd_src) in order for it to work as a Xen
> PV guest.

Thanks, fixed.

> I don't know whether anything would need to be done wrt WRITE_ONCE. Perhaps put
> it into native_set_pgd()?

So this was just write-tearing paranoia at the raw pgd value copy I did which is
an unusual pattern - but I'll use set_pgd(), as it clearly works and has the
paravirt callback as well.

Thanks,

Ingo

2015-06-12 20:39:40

by Oleg Nesterov

[permalink] [raw]
Subject: Re: [PATCH 07/12] x86/virt/guest/xen: Remove use of pgd_list from the Xen guest code

On 06/12, Ingo Molnar wrote:
>
> * Linus Torvalds <[email protected]> wrote:
>
> > So I think the only issue is that ->mm can become NULL when the thread group
> > leader dies - a non-NULL mm should always be shared among all threads.
>
> Indeed, we do that in exit_mm().

Yes,

> So we could add tsk->mm_leader or so,

No, no, please do not. Just do something like

for_each_process(p) {

for_each_thread(p, t) {
if (t->mm) {
do_something(t->mm);
break;
}
}
}

But either way I don't understand what protects this ->mm. Perhaps this needs
find_lock_task_mm().

Oleg.

2015-06-12 20:54:40

by Oleg Nesterov

[permalink] [raw]
Subject: Re: [PATCH 07/12] x86/virt/guest/xen: Remove use of pgd_list from the Xen guest code

On 06/12, Oleg Nesterov wrote:
>
> On 06/12, Ingo Molnar wrote:
> >
> > * Linus Torvalds <[email protected]> wrote:
> >
> > > So I think the only issue is that ->mm can become NULL when the thread group
> > > leader dies - a non-NULL mm should always be shared among all threads.
> >
> > Indeed, we do that in exit_mm().
>
> Yes,
>
> > So we could add tsk->mm_leader or so,
>
> No, no, please do not. Just do something like
>
> for_each_process(p) {
>
> for_each_thread(p, t) {
> if (t->mm) {
> do_something(t->mm);
> break;
> }
> }
> }
>
> But either way I don't understand what protects this ->mm. Perhaps this needs
> find_lock_task_mm().

And, I don't understand this code, probably this doesn't matter, but.

unpin_all() is probably fine, but xen_mm_pin_all() can race with fork()
and miss the new child. Is it OK?

Oleg.

2015-06-12 21:13:27

by Oleg Nesterov

[permalink] [raw]
Subject: Re: [PATCH 05/12] mm: Introduce arch_pgd_init_late()

On 06/11, Ingo Molnar wrote:
>
> @@ -1592,6 +1592,22 @@ static struct task_struct *copy_process(unsigned long clone_flags,
> syscall_tracepoint_update(p);
> write_unlock_irq(&tasklist_lock);
>
> + /*
> + * If we have a new PGD then initialize it:
> + *
> + * This method is called after a task has been made visible
> + * on the task list already.
> + *
> + * Architectures that manage per task kernel pagetables
> + * might use this callback to initialize them after they
> + * are already visible to new updates.
> + *
> + * NOTE: any user-space parts of the PGD are already initialized
> + * and must not be clobbered.
> + */
> + if (p->mm != current->mm)
> + arch_pgd_init_late(p->mm, p->mm->pgd);
> +

Cosmetic, but imo

if (!(clone_flags & CLONE_VM))
arch_pgd_init_late(...);

will look better and more consistent.

Oleg.

2015-06-12 22:23:35

by Oleg Nesterov

[permalink] [raw]
Subject: Re: [PATCH 02/12] x86/mm/hotplug: Remove pgd_list use from the memory hotplug code

On 06/11, Ingo Molnar wrote:
>
> void sync_global_pgds(unsigned long start, unsigned long end, int removed)
> {
> @@ -169,29 +169,33 @@ void sync_global_pgds(unsigned long start, unsigned long end, int removed)
>
> for (address = start; address <= end; address += PGDIR_SIZE) {
> const pgd_t *pgd_ref = pgd_offset_k(address);
> - struct page *page;
> + struct task_struct *g, *p;
>
> /*
> - * When it is called after memory hot remove, pgd_none()
> - * returns true. In this case (removed == 1), we must clear
> - * the PGD entries in the local PGD level page.
> + * When this function is called after memory hot remove,
> + * pgd_none() already returns true, but only the reference
> + * kernel PGD has been cleared, not the process PGDs.
> + *
> + * So clear the affected entries in every process PGD as well:
> */
> if (pgd_none(*pgd_ref) && !removed)
> continue;
>
> spin_lock(&pgd_lock);
> - list_for_each_entry(page, &pgd_list, lru) {
> - pgd_t *pgd;
> +
> + for_each_process_thread(g, p) {

Well, this looks obvously unsafe without rcu_read_lock() at least.

The usage of ->mm doesn't look safe too but this is fixeable, see
my previous reply to 7/12.

And probably I am totally confused, but it seems that 06/12 should
come before this patch? Otherwise, why we can't race with fork() and
miss the new process?

Oleg.

2015-06-12 22:29:38

by Oleg Nesterov

[permalink] [raw]
Subject: Re: [PATCH 04/12] x86/mm/hotplug: Simplify sync_global_pgds()

Yes I guess I am totally confused ;)

On 06/11, Ingo Molnar wrote:
>
> @@ -178,7 +178,7 @@ void sync_global_pgds(unsigned long start, unsigned long end, int removed)
> *
> * So clear the affected entries in every process PGD as well:
> */
> - if (pgd_none(*pgd_ref) && !removed)
> + if (pgd_none(*pgd_ref))

But doesn't this change invalidate the comment above?

Oleg.

2015-06-12 22:31:35

by Boris Ostrovsky

[permalink] [raw]
Subject: Re: [PATCH 07/12] x86/virt/guest/xen: Remove use of pgd_list from the Xen guest code

On 06/12/2015 04:53 PM, Oleg Nesterov wrote:
> On 06/12, Oleg Nesterov wrote:
>>
>> On 06/12, Ingo Molnar wrote:
>>>
>>> * Linus Torvalds <[email protected]> wrote:
>>>
>>>> So I think the only issue is that ->mm can become NULL when the thread group
>>>> leader dies - a non-NULL mm should always be shared among all threads.
>>>
>>> Indeed, we do that in exit_mm().
>>
>> Yes,
>>
>>> So we could add tsk->mm_leader or so,
>>
>> No, no, please do not. Just do something like
>>
>> for_each_process(p) {
>>
>> for_each_thread(p, t) {
>> if (t->mm) {
>> do_something(t->mm);
>> break;
>> }
>> }
>> }
>>
>> But either way I don't understand what protects this ->mm. Perhaps this needs
>> find_lock_task_mm().
>
> And, I don't understand this code, probably this doesn't matter, but.
>
> unpin_all() is probably fine, but xen_mm_pin_all() can race with fork()
> and miss the new child. Is it OK?


Currently xen_mm_pin_all() is only called in the suspend path, out of
stop_machine(), so presumably at that time fork is not possible.


-boris

2015-06-12 22:48:32

by Waiman Long

[permalink] [raw]
Subject: Re: [PATCH 02/12] x86/mm/hotplug: Remove pgd_list use from the memory hotplug code

On 06/11/2015 10:07 AM, Ingo Molnar wrote:
> The memory hotplug code uses sync_global_pgds() to synchronize updates
> to the global (&init_mm) kernel PGD and the task PGDs. It does this
> by iterating over the pgd_list - which list closely tracks task
> creation/destruction via fork/clone.
>
> But we want to remove this list, so that it does not have to be
> maintained from fork()/exit(), so convert the memory hotplug code
> to use the task list to iterate over all pgds in the system.
>
> Also improve the comments a bit, to make this function easier
> to understand.
>
> Only lightly tested, as I don't have a memory hotplug setup.
>
> Cc: Andrew Morton<[email protected]>
> Cc: Andy Lutomirski<[email protected]>
> Cc: Borislav Petkov<[email protected]>
> Cc: Brian Gerst<[email protected]>
> Cc: Denys Vlasenko<[email protected]>
> Cc: H. Peter Anvin<[email protected]>
> Cc: Linus Torvalds<[email protected]>
> Cc: Peter Zijlstra<[email protected]>
> Cc: Thomas Gleixner<[email protected]>
> Cc: Waiman Long<[email protected]>
> Signed-off-by: Ingo Molnar<[email protected]>
> ---
> arch/x86/mm/init_64.c | 30 +++++++++++++++++-------------
> 1 file changed, 17 insertions(+), 13 deletions(-)
>
> diff --git a/arch/x86/mm/init_64.c b/arch/x86/mm/init_64.c
> index 3fba623e3ba5..1921acbd49fd 100644
> --- a/arch/x86/mm/init_64.c
> +++ b/arch/x86/mm/init_64.c
> @@ -160,8 +160,8 @@ static int __init nonx32_setup(char *str)
> __setup("noexec32=", nonx32_setup);
>
> /*
> - * When memory was added/removed make sure all the processes MM have
> - * suitable PGD entries in the local PGD level page.
> + * When memory was added/removed make sure all the process MMs have
> + * matching PGD entries in the local PGD level page as well.
> */
> void sync_global_pgds(unsigned long start, unsigned long end, int removed)
> {
> @@ -169,29 +169,33 @@ void sync_global_pgds(unsigned long start, unsigned long end, int removed)
>
> for (address = start; address<= end; address += PGDIR_SIZE) {
> const pgd_t *pgd_ref = pgd_offset_k(address);
> - struct page *page;
> + struct task_struct *g, *p;
>
> /*
> - * When it is called after memory hot remove, pgd_none()
> - * returns true. In this case (removed == 1), we must clear
> - * the PGD entries in the local PGD level page.
> + * When this function is called after memory hot remove,
> + * pgd_none() already returns true, but only the reference
> + * kernel PGD has been cleared, not the process PGDs.
> + *
> + * So clear the affected entries in every process PGD as well:
> */
> if (pgd_none(*pgd_ref)&& !removed)
> continue;
>
> spin_lock(&pgd_lock);
> - list_for_each_entry(page,&pgd_list, lru) {
> - pgd_t *pgd;
> +
> + for_each_process_thread(g, p) {
> + pgd_t *pgd = p->mm->pgd;
> spinlock_t *pgt_lock;
>
> - pgd = (pgd_t *)page_address(page) + pgd_index(address);
> - /* the pgt_lock only for Xen */
> - pgt_lock =&pgd_page_get_mm(page)->page_table_lock;
> + if (!p->mm)
> + continue;

pgd was initialized to p->mm->pgd before the "p->mm" check is done.
Shouldn't the initialization be moved after that.

Cheers,
Longman

2015-06-12 22:51:09

by Oleg Nesterov

[permalink] [raw]
Subject: Re: [PATCH 06/12] x86/mm: Enable and use the arch_pgd_init_late() method

On 06/11, Ingo Molnar wrote:
>
> +void arch_pgd_init_late(struct mm_struct *mm, pgd_t *pgd)
> +{
> + /*
> + * This is called after a new MM has been made visible
> + * in fork() or exec().
> + *
> + * This barrier makes sure the MM is visible to new RCU
> + * walkers before we initialize it, so that we don't miss
> + * updates:
> + */
> + smp_wmb();

I can't understand the comment and the barrier...

Afaics, we need to ensure that:

> + if (pgd_val(*pgd_src))
> + WRITE_ONCE(*pgd_dst, *pgd_src);

either we notice the recent update of this PGD, or (say) the subsequent
sync_global_pgds() can miss the child.

How the write barrier can help?

Oleg.

2015-06-12 22:58:17

by Oleg Nesterov

[permalink] [raw]
Subject: Re: [PATCH 06/12] x86/mm: Enable and use the arch_pgd_init_late() method

On 06/13, Oleg Nesterov wrote:
>
> Afaics, we need to ensure that:
>
> > + if (pgd_val(*pgd_src))
> > + WRITE_ONCE(*pgd_dst, *pgd_src);
>
> either we notice the recent update of this PGD, or (say) the subsequent
> sync_global_pgds() can miss the child.

and perhaps !pgd_none(pgd_src) will look better.

Oleg.

2015-06-12 23:16:25

by Oleg Nesterov

[permalink] [raw]
Subject: Re: [PATCH 02/12] x86/mm/hotplug: Remove pgd_list use from the memory hotplug code

On 06/11, Ingo Molnar wrote:
>
> void sync_global_pgds(unsigned long start, unsigned long end, int removed)
> {
> @@ -169,29 +169,33 @@ void sync_global_pgds(unsigned long start, unsigned long end, int removed)
>
> for (address = start; address <= end; address += PGDIR_SIZE) {
> const pgd_t *pgd_ref = pgd_offset_k(address);
> - struct page *page;
> + struct task_struct *g, *p;
>
> /*
> - * When it is called after memory hot remove, pgd_none()
> - * returns true. In this case (removed == 1), we must clear
> - * the PGD entries in the local PGD level page.
> + * When this function is called after memory hot remove,
> + * pgd_none() already returns true, but only the reference
> + * kernel PGD has been cleared, not the process PGDs.
> + *
> + * So clear the affected entries in every process PGD as well:
> */
> if (pgd_none(*pgd_ref) && !removed)
> continue;
>
> spin_lock(&pgd_lock);
> - list_for_each_entry(page, &pgd_list, lru) {
> - pgd_t *pgd;
> +
> + for_each_process_thread(g, p) {
> + pgd_t *pgd = p->mm->pgd;

and we use the same pgd later: set_pgd(pgd, *pgd_ref).
looks like this should be pgd_offset(p->mm, address) ?

And obviously you need to check ->mm != NULL first?

And perhaps it makes sense to swap "for (address)" and for_each_process()
loops...

Oleg.

2015-06-12 23:37:27

by Oleg Nesterov

[permalink] [raw]
Subject: Re: [PATCH 07/12] x86/virt/guest/xen: Remove use of pgd_list from the Xen guest code

On 06/12, Boris Ostrovsky wrote:
>
> On 06/12/2015 04:53 PM, Oleg Nesterov wrote:
>>>
>>> for_each_process(p) {
>>>
>>> for_each_thread(p, t) {
>>> if (t->mm) {
>>> do_something(t->mm);
>>> break;
>>> }
>>> }
>>> }
>>>
>>> But either way I don't understand what protects this ->mm. Perhaps this needs
>>> find_lock_task_mm().
>>
>> And, I don't understand this code, probably this doesn't matter, but.
>>
>> unpin_all() is probably fine, but xen_mm_pin_all() can race with fork()
>> and miss the new child. Is it OK?
>
>
> Currently xen_mm_pin_all() is only called in the suspend path, out of
> stop_machine(), so presumably at that time fork is not possible.

OK, thanks, this also means that this code shouldn't worry about ->mm,
it should be stable.

But for_each_process() in sync_global_pgds() should, afaics.

Oleg.

2015-06-13 06:47:18

by Ingo Molnar

[permalink] [raw]
Subject: Re: [PATCH 06/12] x86/mm: Enable and use the arch_pgd_init_late() method


* Oleg Nesterov <[email protected]> wrote:

> On 06/11, Ingo Molnar wrote:
> >
> > +void arch_pgd_init_late(struct mm_struct *mm, pgd_t *pgd)
> > +{
> > + /*
> > + * This is called after a new MM has been made visible
> > + * in fork() or exec().
> > + *
> > + * This barrier makes sure the MM is visible to new RCU
> > + * walkers before we initialize it, so that we don't miss
> > + * updates:
> > + */
> > + smp_wmb();
>
> I can't understand the comment and the barrier...
>
> Afaics, we need to ensure that:
>
> > + if (pgd_val(*pgd_src))
> > + WRITE_ONCE(*pgd_dst, *pgd_src);
>
> either we notice the recent update of this PGD, or (say) the subsequent
> sync_global_pgds() can miss the child.
>
> How the write barrier can help?

So the real thing this pairs with is the earlier:

tsk->mm = mm;

plus the linking of the new task in the task list.

_that_ write must become visible to others before we do the (conditional) copy
ourselves.

Granted, it happens quite a bit earlier, and the task linking's use of locking is
a natural barrier - but since this is lockless I didn't want to leave a silent
assumption in.

Perhaps remove the barrier and just leave a comment in that describes the
assumption on task-linking being a full barrier?

Thanks,

Ingo

2015-06-13 06:51:42

by Ingo Molnar

[permalink] [raw]
Subject: Re: [PATCH 06/12] x86/mm: Enable and use the arch_pgd_init_late() method


* Oleg Nesterov <[email protected]> wrote:

> On 06/13, Oleg Nesterov wrote:
> >
> > Afaics, we need to ensure that:
> >
> > > + if (pgd_val(*pgd_src))
> > > + WRITE_ONCE(*pgd_dst, *pgd_src);
> >
> > either we notice the recent update of this PGD, or (say) the subsequent
> > sync_global_pgds() can miss the child.
>
> and perhaps !pgd_none(pgd_src) will look better.

Indeed - fixed.

Thanks,

Ingo

2015-06-13 06:53:06

by Ingo Molnar

[permalink] [raw]
Subject: Re: [PATCH 06/12] x86/mm: Enable and use the arch_pgd_init_late() method


* Ingo Molnar <[email protected]> wrote:

> * Oleg Nesterov <[email protected]> wrote:
>
> > On 06/11, Ingo Molnar wrote:
> > >
> > > +void arch_pgd_init_late(struct mm_struct *mm, pgd_t *pgd)
> > > +{
> > > + /*
> > > + * This is called after a new MM has been made visible
> > > + * in fork() or exec().
> > > + *
> > > + * This barrier makes sure the MM is visible to new RCU
> > > + * walkers before we initialize it, so that we don't miss
> > > + * updates:
> > > + */
> > > + smp_wmb();
> >
> > I can't understand the comment and the barrier...
> >
> > Afaics, we need to ensure that:
> >
> > > + if (pgd_val(*pgd_src))
> > > + WRITE_ONCE(*pgd_dst, *pgd_src);
> >
> > either we notice the recent update of this PGD, or (say) the subsequent
> > sync_global_pgds() can miss the child.
> >
> > How the write barrier can help?
>
> So the real thing this pairs with is the earlier:
>
> tsk->mm = mm;
>
> plus the linking of the new task in the task list.
>
> _that_ write must become visible to others before we do the (conditional) copy
> ourselves.
>
> Granted, it happens quite a bit earlier, and the task linking's use of locking
> is a natural barrier - but since this is lockless I didn't want to leave a
> silent assumption in.
>
> Perhaps remove the barrier and just leave a comment in that describes the
> assumption on task-linking being a full barrier?

Ah, there's another detail I forgot. This might handle the fork case, but in
exec() we have:

tsk->mm = mm;
arch_pgd_init_late(mm);

and since the task is already linked, here we need the barrier.

So how about I improve the comment to:

/*
* This function is called after a new MM has been made visible
* in fork() or exec() via:
*
* tsk->mm = mm;
*
* This barrier makes sure the MM is visible to new RCU
* walkers before we initialize the pagetables below, so that
* we don't miss updates:
*/
smp_wmb();

and leave the barrier there?

Thanks,

Ingo

2015-06-13 06:54:37

by Ingo Molnar

[permalink] [raw]
Subject: Re: [PATCH 05/12] mm: Introduce arch_pgd_init_late()


* Oleg Nesterov <[email protected]> wrote:

> On 06/11, Ingo Molnar wrote:
> >
> > @@ -1592,6 +1592,22 @@ static struct task_struct *copy_process(unsigned long clone_flags,
> > syscall_tracepoint_update(p);
> > write_unlock_irq(&tasklist_lock);
> >
> > + /*
> > + * If we have a new PGD then initialize it:
> > + *
> > + * This method is called after a task has been made visible
> > + * on the task list already.
> > + *
> > + * Architectures that manage per task kernel pagetables
> > + * might use this callback to initialize them after they
> > + * are already visible to new updates.
> > + *
> > + * NOTE: any user-space parts of the PGD are already initialized
> > + * and must not be clobbered.
> > + */
> > + if (p->mm != current->mm)
> > + arch_pgd_init_late(p->mm, p->mm->pgd);
> > +
>
> Cosmetic, but imo
>
> if (!(clone_flags & CLONE_VM))
> arch_pgd_init_late(...);
>
> will look better and more consistent.

Indeed - fixed.

Thanks,

Ingo

2015-06-13 07:26:59

by Ingo Molnar

[permalink] [raw]
Subject: Re: [PATCH 07/12] x86/virt/guest/xen: Remove use of pgd_list from the Xen guest code


* Oleg Nesterov <[email protected]> wrote:

> > So we could add tsk->mm_leader or so,
>
> No, no, please do not. Just do something like
>
> for_each_process(p) {
>
> for_each_thread(p, t) {

So far that's what the for_each_process_thread() iterations I added do, right?

> if (t->mm) {
> do_something(t->mm);
> break;
> }
> }
> }
>
> But either way I don't understand what protects this ->mm. Perhaps this needs
> find_lock_task_mm().

That's indeed a bug: I'll add task_lock()/unlock() before looking at ->mm.

find_lock_task_mm() is not needed IMHO: we have a stable reference to 't', as a
task can only go away via RCU expiry, and all the iterations I added are (supposed
to) be RCU safe.

Thanks,

Ingo

2015-06-13 07:47:09

by Ingo Molnar

[permalink] [raw]
Subject: Re: [PATCH 02/12] x86/mm/hotplug: Remove pgd_list use from the memory hotplug code


* Waiman Long <[email protected]> wrote:

> >@@ -169,29 +169,33 @@ void sync_global_pgds(unsigned long start, unsigned long end, int removed)
> >
> > for (address = start; address<= end; address += PGDIR_SIZE) {
> > const pgd_t *pgd_ref = pgd_offset_k(address);
> >- struct page *page;
> >+ struct task_struct *g, *p;
> >
> > /*
> >- * When it is called after memory hot remove, pgd_none()
> >- * returns true. In this case (removed == 1), we must clear
> >- * the PGD entries in the local PGD level page.
> >+ * When this function is called after memory hot remove,
> >+ * pgd_none() already returns true, but only the reference
> >+ * kernel PGD has been cleared, not the process PGDs.
> >+ *
> >+ * So clear the affected entries in every process PGD as well:
> > */
> > if (pgd_none(*pgd_ref)&& !removed)
> > continue;
> >
> > spin_lock(&pgd_lock);
> >- list_for_each_entry(page,&pgd_list, lru) {
> >- pgd_t *pgd;
> >+
> >+ for_each_process_thread(g, p) {
> >+ pgd_t *pgd = p->mm->pgd;
> > spinlock_t *pgt_lock;
> >
> >- pgd = (pgd_t *)page_address(page) + pgd_index(address);
> >- /* the pgt_lock only for Xen */
> >- pgt_lock =&pgd_page_get_mm(page)->page_table_lock;
> >+ if (!p->mm)
> >+ continue;
>
> pgd was initialized to p->mm->pgd before the "p->mm" check is done.
> Shouldn't the initialization be moved after that.

Yes, already found this bug in testing and fixed it - will send out a new series
with all the feedback so far addressed.

Thanks,

Ingo

2015-06-13 07:48:58

by Ingo Molnar

[permalink] [raw]
Subject: Re: [PATCH 02/12] x86/mm/hotplug: Remove pgd_list use from the memory hotplug code


* Oleg Nesterov <[email protected]> wrote:

> On 06/11, Ingo Molnar wrote:
> >
> > void sync_global_pgds(unsigned long start, unsigned long end, int removed)
> > {
> > @@ -169,29 +169,33 @@ void sync_global_pgds(unsigned long start, unsigned long end, int removed)
> >
> > for (address = start; address <= end; address += PGDIR_SIZE) {
> > const pgd_t *pgd_ref = pgd_offset_k(address);
> > - struct page *page;
> > + struct task_struct *g, *p;
> >
> > /*
> > - * When it is called after memory hot remove, pgd_none()
> > - * returns true. In this case (removed == 1), we must clear
> > - * the PGD entries in the local PGD level page.
> > + * When this function is called after memory hot remove,
> > + * pgd_none() already returns true, but only the reference
> > + * kernel PGD has been cleared, not the process PGDs.
> > + *
> > + * So clear the affected entries in every process PGD as well:
> > */
> > if (pgd_none(*pgd_ref) && !removed)
> > continue;
> >
> > spin_lock(&pgd_lock);
> > - list_for_each_entry(page, &pgd_list, lru) {
> > - pgd_t *pgd;
> > +
> > + for_each_process_thread(g, p) {
> > + pgd_t *pgd = p->mm->pgd;
>
> and we use the same pgd later: set_pgd(pgd, *pgd_ref).
> looks like this should be pgd_offset(p->mm, address) ?
>
> And obviously you need to check ->mm != NULL first?

Yes.

> And perhaps it makes sense to swap "for (address)" and for_each_process()
> loops...

Yes, but I'd flip around the logic in a later, separate patch.

Note that PGDIR_SIZE is 512 GB here, so the outer loop will most likely execute at
most twice in practice.

Thanks,

Ingo

2015-06-13 07:51:13

by Ingo Molnar

[permalink] [raw]
Subject: Re: [PATCH 04/12] x86/mm/hotplug: Simplify sync_global_pgds()


* Oleg Nesterov <[email protected]> wrote:

> Yes I guess I am totally confused ;)
>
> On 06/11, Ingo Molnar wrote:
> >
> > @@ -178,7 +178,7 @@ void sync_global_pgds(unsigned long start, unsigned long end, int removed)
> > *
> > * So clear the affected entries in every process PGD as well:
> > */
> > - if (pgd_none(*pgd_ref) && !removed)
> > + if (pgd_none(*pgd_ref))
>
> But doesn't this change invalidate the comment above?

Indeed - I fixed the comment to now say:

/* Only sync (potentially) newly added PGD entries: */
if (pgd_none(*pgd_ref))
continue;

Thanks,

Ingo

2015-06-13 17:46:42

by Oleg Nesterov

[permalink] [raw]
Subject: Re: [PATCH 06/12] x86/mm: Enable and use the arch_pgd_init_late() method

On 06/13, Ingo Molnar wrote:
>
> * Ingo Molnar <[email protected]> wrote:
>
> > * Oleg Nesterov <[email protected]> wrote:
> >
> > >
> > > Afaics, we need to ensure that:
> > >
> > > > + if (pgd_val(*pgd_src))
> > > > + WRITE_ONCE(*pgd_dst, *pgd_src);
> > >
> > > either we notice the recent update of this PGD, or (say) the subsequent
> > > sync_global_pgds() can miss the child.
> > >
> > > How the write barrier can help?
> >
> > So the real thing this pairs with is the earlier:
> >
> > tsk->mm = mm;
> >
> > plus the linking of the new task in the task list.
> >
> > _that_ write must become visible to others before we do the (conditional) copy
> > ourselves.

Hmm. that write must be visible before we start to _read_ *pgd_src,
afaics.

> > Granted, it happens quite a bit earlier, and the task linking's use of locking
> > is a natural barrier - but since this is lockless I didn't want to leave a
> > silent assumption in.

I agree,

> Ah, there's another detail I forgot. This might handle the fork case, but in
> exec() we have:
>
> tsk->mm = mm;
> arch_pgd_init_late(mm);

Yes, this too.

But wmb() can't help. At least we need the full mb() to serialize the
STORE above (or list addition in copy_process) with the LOAD which
reads *pgd_src.

Plus we need another mb() in sync_global_pgds(), say, before the main
for_each_process() loop.


it would be nice to make this more simple/clear somehow...

Oleg.

2015-06-13 18:02:08

by Oleg Nesterov

[permalink] [raw]
Subject: Re: [PATCH 07/12] x86/virt/guest/xen: Remove use of pgd_list from the Xen guest code

On 06/13, Ingo Molnar wrote:
>
> * Oleg Nesterov <[email protected]> wrote:
>
> > > So we could add tsk->mm_leader or so,
> >
> > No, no, please do not. Just do something like
> >
> > for_each_process(p) {
> >
> > for_each_thread(p, t) {
>
> So far that's what the for_each_process_thread() iterations I added do, right?

Not really,

> > if (t->mm) {
> > do_something(t->mm);
> > break;
^^^^^

Note this "break". We stop the inner loop right after we find a thread "t"
with ->mm != NULL. In the likely case t == p (group leader) so the inner
loop stops on the 1st iteration.


> > But either way I don't understand what protects this ->mm. Perhaps this needs
> > find_lock_task_mm().
>
> That's indeed a bug: I'll add task_lock()/unlock() before looking at ->mm.

Well, in this particular case we are safe. As Boris explained this is called
from stop_machine(). But sync_global_pgds() is not safe.

> find_lock_task_mm() is not needed IMHO: we have a stable reference to 't', as
> task can only go away via RCU expiry, and all the iterations I added are (supposed
> to) be RCU safe.

Sure, you can do lock/unlock by hand. But find_lock_task_mm() can simplify
the code because it checks subthreads if group_leader->mm == NULL. You can
simply do

rcu_read_lock();
for_each_process(p) {
t = find_lock_task_mm(p);
if (!t)
continue;

do_something(t->mm);
task_unlock(t);
}
rcu_read_unlock();

Oleg.

2015-06-14 08:14:10

by Ingo Molnar

[permalink] [raw]
Subject: Re: [PATCH 06/12] x86/mm: Enable and use the arch_pgd_init_late() method


* Oleg Nesterov <[email protected]> wrote:

> On 06/13, Ingo Molnar wrote:
> >
> > * Ingo Molnar <[email protected]> wrote:
> >
> > > * Oleg Nesterov <[email protected]> wrote:
> > >
> > > >
> > > > Afaics, we need to ensure that:
> > > >
> > > > > + if (pgd_val(*pgd_src))
> > > > > + WRITE_ONCE(*pgd_dst, *pgd_src);
> > > >
> > > > either we notice the recent update of this PGD, or (say) the subsequent
> > > > sync_global_pgds() can miss the child.
> > > >
> > > > How the write barrier can help?
> > >
> > > So the real thing this pairs with is the earlier:
> > >
> > > tsk->mm = mm;
> > >
> > > plus the linking of the new task in the task list.
> > >
> > > _that_ write must become visible to others before we do the (conditional) copy
> > > ourselves.
>
> Hmm. that write must be visible before we start to _read_ *pgd_src,
> afaics.
>
> > > Granted, it happens quite a bit earlier, and the task linking's use of locking
> > > is a natural barrier - but since this is lockless I didn't want to leave a
> > > silent assumption in.
>
> I agree,
>
> > Ah, there's another detail I forgot. This might handle the fork case, but in
> > exec() we have:
> >
> > tsk->mm = mm;
> > arch_pgd_init_late(mm);
>
> Yes, this too.
>
> But wmb() can't help. At least we need the full mb() to serialize the
> STORE above (or list addition in copy_process) with the LOAD which
> reads *pgd_src.

True.

> Plus we need another mb() in sync_global_pgds(), say, before the main
> for_each_process() loop.

True.

So since we have a spin_lock() there already, I tried to find the right primitive
to turn it into a full memory barrier but gave up. Is there a simple primitive for
that?

Also, since this is x86 specific code we could rely on the fact that
spinlock-acquire is a full memory barrier?

>
> it would be nice to make this more simple/clear somehow...

I'm afraid lockless is rarely simple, but I'm open to suggestions ...

Thanks,

Ingo

2015-06-14 20:55:26

by Oleg Nesterov

[permalink] [raw]
Subject: Re: [PATCH 06/12] x86/mm: Enable and use the arch_pgd_init_late() method

On 06/14, Ingo Molnar wrote:
>
> So since we have a spin_lock() there already,

Yeeeees, I thought about task_lock() or pgd_lock too.

> Also, since this is x86 specific code we could rely on the fact that
> spinlock-acquire is a full memory barrier?

we do not really need the full barrier if we rely on spinlock_t,
we can rely on acquire+release semantics.

Lets forget about exec_mmap(). If we add, say,

// or unlock_wait() + barriers
task_lock(current->group_leader);
task_unlock(current->group_leader);

at the start of arch_pgd_init_late() we will fix the problems with
fork() even if pgd_none() below can leak into the critical section.

We rely on the fact that find_lock_task_mm() does lock/unlock too
and always starts with the group leader.

If sync_global_pgds() takes this lock first, we must see the change
in *PGD after task_unlock(). Actually right after task_lock().

Otherwise, sync_global_pgds() should see the result of list addition
if it takes this (the same) ->group_leader->lock_alloc after us.

But this is not nice, and exec_mmap() calls arch_pgd_init_late() under
task_lock().


So, unless you are going to remove pgd_lock altogether perhaps we can
rely on it the same way

mb();
spin_unlock_wait(&pgd_lock);
rmb();


Avoids the barriers (and comments) on another side, but I can't say
I really like this...


So I won't argue with 2 mb's on both sides.

Oleg.

2015-06-14 22:12:06

by Oleg Nesterov

[permalink] [raw]
Subject: Re: [PATCH 06/12] x86/mm: Enable and use the arch_pgd_init_late() method

On 06/14, Oleg Nesterov wrote:
>
> So, unless you are going to remove pgd_lock altogether perhaps we can
> rely on it the same way
>
> mb();
> spin_unlock_wait(&pgd_lock);
> rmb();
>
>
> Avoids the barriers (and comments) on another side, but I can't say
> I really like this...
>
>
> So I won't argue with 2 mb's on both sides.

Or we can add

// A new child created before can miss the PGD updates,
// but we must see that child on the process list

read_lock(tasklist_lock);
read_unlock(tasklist_lock);

// We can miss a new child forked after read_unlock(),
// but then its parent must see all PGD updates right
// after it does write_unlock(tasklist);

for_each_process(p) {

before main for_each_process() loop in sync_global_pgds().

As for exec_mmap() we can rely on task_lock(), sync_global_pgds()
takes it too. The corner case is when exec changes the leader, so
exec_mmap/sync_global_pgds can take different locks. But in this
case we can rely on de_thread() (which takes tasklist for write)
by the same reason: either sync_global_pgds() will see the new
leader, or the new leader must see the updates.

Oleg.