2012-10-18 21:21:17

by Rik van Riel

[permalink] [raw]
Subject: [PATCH 0/2] minor NUMA cleanups & documentation

Hi Ingo,

Here are some minor NUMA cleanups to start with.

I have some ideas for larger improvements and ideas to port over
from autonuma, but I got caught up in some of the code and am
not sure about those changes yet.

--
All Rights Reversed


2012-10-18 21:21:15

by Rik van Riel

[permalink] [raw]
Subject: [PATCH 1/2] add credits for NUMA placement

The NUMA placement code has been rewritten several times, but
the basic ideas took a lot of work to develop. The people who
put in the work deserve credit for it. Thanks Andrea & Peter :)

The Documentation/scheduler/numa-problem.txt file should
probably be rewritten once we figure out the final details of
what the NUMA code needs to do, and why.

Signed-off-by: Rik van Riel <[email protected]>

----
This is against tip.git numa/core

diff --git a/CREDITS b/CREDITS
index d8fe12a..b4cdc8f 100644
--- a/CREDITS
+++ b/CREDITS
@@ -125,6 +125,7 @@ D: Author of pscan that helps to fix lp/parport bugs
D: Author of lil (Linux Interrupt Latency benchmark)
D: Fixed the shm swap deallocation at swapoff time (try_to_unuse message)
D: VM hacker
+D: NUMA task placement
D: Various other kernel hacks
S: Imola 40026
S: Italy
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 1e24aa1..e93032d 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -18,6 +18,9 @@
*
* Adaptive scheduling granularity, math enhancements by Peter Zijlstra
* Copyright (C) 2007 Red Hat, Inc., Peter Zijlstra <[email protected]>
+ *
+ * NUMA placement, statistics and algorithm by Andrea Arcangeli,
+ * CFS balancing changes by Peter Zijlstra. Copyright (C) 2012 Red Hat, Inc.
*/

#include <linux/latencytop.h>
diff --git a/mm/memory.c b/mm/memory.c
index fc48fe8..9e56a44 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -36,6 +36,8 @@
* ([email protected])
*
* Aug/Sep 2004 Changed to four level page tables (Andi Kleen)
+ *
+ * 2012 - NUMA placement page faults (Andrea Arcangeli, Peter Zijlstra)
*/

#include <linux/kernel_stat.h>

2012-10-18 21:21:14

by Rik van Riel

[permalink] [raw]
Subject: [PATCH 2/2] rename NUMA fault handling functions

Having the function name indicate what the function is used
for makes the code a little easier to read. Furthermore,
the fault handling code largely consists of do_...._page
functions.

Rename the NUMA fault handling functions to indicate what
they are used for.

Signed-off-by: Rik van Riel <[email protected]>
---
Against tip.git numa/core

include/linux/huge_mm.h | 8 ++++----
mm/huge_memory.c | 4 ++--
mm/memory.c | 18 ++++++++++--------
3 files changed, 16 insertions(+), 14 deletions(-)

diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index ed60d79..9580e22 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -161,9 +161,9 @@ static inline struct page *compound_trans_head(struct page *page)
return page;
}

-extern bool pmd_prot_none(struct vm_area_struct *vma, pmd_t pmd);
+extern bool pmd_numa(struct vm_area_struct *vma, pmd_t pmd);

-extern void do_huge_pmd_prot_none(struct mm_struct *mm, struct vm_area_struct *vma,
+extern void do_huge_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
unsigned long address, pmd_t *pmd,
unsigned int flags, pmd_t orig_pmd);

@@ -204,12 +204,12 @@ static inline int pmd_trans_huge_lock(pmd_t *pmd,
return 0;
}

-static inline bool pmd_prot_none(struct vm_area_struct *vma, pmd_t pmd)
+static inline bool pmd_numa(struct vm_area_struct *vma, pmd_t pmd)
{
return false;
}

-static inline void do_huge_pmd_prot_none(struct mm_struct *mm, struct vm_area_struct *vma,
+static inline void do_huge_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
unsigned long address, pmd_t *pmd,
unsigned int flags, pmd_t orig_pmd)
{
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 5afd0d7..c25fd37 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -751,7 +751,7 @@ out:
return handle_pte_fault(mm, vma, address, pte, pmd, flags);
}

-bool pmd_prot_none(struct vm_area_struct *vma, pmd_t pmd)
+bool pmd_numa(struct vm_area_struct *vma, pmd_t pmd)
{
/*
* See pte_prot_none().
@@ -762,7 +762,7 @@ bool pmd_prot_none(struct vm_area_struct *vma, pmd_t pmd)
return pmd_same(pmd, pmd_modify(pmd, vma_prot_none(vma)));
}

-void do_huge_pmd_prot_none(struct mm_struct *mm, struct vm_area_struct *vma,
+void do_huge_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
unsigned long address, pmd_t *pmd,
unsigned int flags, pmd_t entry)
{
diff --git a/mm/memory.c b/mm/memory.c
index 9e56a44..c752379 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3425,11 +3425,13 @@ static int do_nonlinear_fault(struct mm_struct *mm, struct vm_area_struct *vma,
return __do_fault(mm, vma, address, pmd, pgoff, flags, orig_pte);
}

-static bool pte_prot_none(struct vm_area_struct *vma, pte_t pte)
+static bool pte_numa(struct vm_area_struct *vma, pte_t pte)
{
/*
- * If we have the normal vma->vm_page_prot protections we're not a
- * 'special' PROT_NONE page.
+ * For NUMA page faults, we use PROT_NONE ptes in VMAs with
+ * "normal" vma->vm_page_prot protections. Genuine PROT_NONE
+ * VMAs should never get here, because the fault handling code
+ * will notice that the VMA has no read or write permissions.
*
* This means we cannot get 'special' PROT_NONE faults from genuine
* PROT_NONE maps, nor from PROT_WRITE file maps that do dirty
@@ -3444,7 +3446,7 @@ static bool pte_prot_none(struct vm_area_struct *vma, pte_t pte)
return pte_same(pte, pte_modify(pte, vma_prot_none(vma)));
}

-static int do_prot_none(struct mm_struct *mm, struct vm_area_struct *vma,
+static int do_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
unsigned long address, pte_t *ptep, pmd_t *pmd,
unsigned int flags, pte_t entry)
{
@@ -3541,8 +3543,8 @@ int handle_pte_fault(struct mm_struct *mm,
pte, pmd, flags, entry);
}

- if (pte_prot_none(vma, entry))
- return do_prot_none(mm, vma, address, pte, pmd, flags, entry);
+ if (pte_numa(vma, entry))
+ return do_numa_page(mm, vma, address, pte, pmd, flags, entry);

ptl = pte_lockptr(mm, pmd);
spin_lock(ptl);
@@ -3612,8 +3614,8 @@ retry:

barrier();
if (pmd_trans_huge(orig_pmd) && !pmd_trans_splitting(orig_pmd)) {
- if (pmd_prot_none(vma, orig_pmd)) {
- do_huge_pmd_prot_none(mm, vma, address, pmd,
+ if (pmd_numa(vma, orig_pmd)) {
+ do_huge_pmd_numa_page(mm, vma, address, pmd,
flags, orig_pmd);
}

2012-10-19 11:41:15

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH 2/2] rename NUMA fault handling functions

On Thu, 2012-10-18 at 17:20 -0400, Rik van Riel wrote:
> Having the function name indicate what the function is used
> for makes the code a little easier to read. Furthermore,
> the fault handling code largely consists of do_...._page
> functions.

I don't much care either way, but I was thinking walken might want to
use something similar to do WSS estimation, in which case the NUMA name
is just as wrong.

2012-10-19 11:42:15

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH 1/2] add credits for NUMA placement

On Thu, 2012-10-18 at 17:19 -0400, Rik van Riel wrote:
> The NUMA placement code has been rewritten several times, but
> the basic ideas took a lot of work to develop. The people who
> put in the work deserve credit for it. Thanks Andrea & Peter :)
>
> The Documentation/scheduler/numa-problem.txt file should
> probably be rewritten once we figure out the final details of
> what the NUMA code needs to do, and why.
>
> Signed-off-by: Rik van Riel <[email protected]>

Acked-by: Peter Zijlstra <[email protected]>

Thanks Rik!

2012-10-19 12:02:33

by Rik van Riel

[permalink] [raw]
Subject: [tip:numa/core] numa: Add credits for NUMA placement

Commit-ID: c1a305006e4dd428001852923c11806d754db9f1
Gitweb: http://git.kernel.org/tip/c1a305006e4dd428001852923c11806d754db9f1
Author: Rik van Riel <[email protected]>
AuthorDate: Thu, 18 Oct 2012 17:19:28 -0400
Committer: Ingo Molnar <[email protected]>
CommitDate: Fri, 19 Oct 2012 13:45:48 +0200

numa: Add credits for NUMA placement

The NUMA placement code has been rewritten several times, but
the basic ideas took a lot of work to develop. The people who
put in the work deserve credit for it. Thanks Andrea & Peter :)

[ The Documentation/scheduler/numa-problem.txt file should
probably be rewritten once we figure out the final details of
what the NUMA code needs to do, and why. ]

Signed-off-by: Rik van Riel <[email protected]>
Acked-by: Peter Zijlstra <[email protected]>
Cc: [email protected]
Cc: Linus Torvalds <[email protected]>
Cc: Andrew Morton <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Ingo Molnar <[email protected]>
----
This is against tip.git numa/core
---
CREDITS | 1 +
kernel/sched/fair.c | 3 +++
mm/memory.c | 2 ++
3 files changed, 6 insertions(+), 0 deletions(-)

diff --git a/CREDITS b/CREDITS
index d8fe12a..b4cdc8f 100644
--- a/CREDITS
+++ b/CREDITS
@@ -125,6 +125,7 @@ D: Author of pscan that helps to fix lp/parport bugs
D: Author of lil (Linux Interrupt Latency benchmark)
D: Fixed the shm swap deallocation at swapoff time (try_to_unuse message)
D: VM hacker
+D: NUMA task placement
D: Various other kernel hacks
S: Imola 40026
S: Italy
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 1e24aa1..e93032d 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -18,6 +18,9 @@
*
* Adaptive scheduling granularity, math enhancements by Peter Zijlstra
* Copyright (C) 2007 Red Hat, Inc., Peter Zijlstra <[email protected]>
+ *
+ * NUMA placement, statistics and algorithm by Andrea Arcangeli,
+ * CFS balancing changes by Peter Zijlstra. Copyright (C) 2012 Red Hat, Inc.
*/

#include <linux/latencytop.h>
diff --git a/mm/memory.c b/mm/memory.c
index fc48fe8..9e56a44 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -36,6 +36,8 @@
* ([email protected])
*
* Aug/Sep 2004 Changed to four level page tables (Andi Kleen)
+ *
+ * 2012 - NUMA placement page faults (Andrea Arcangeli, Peter Zijlstra)
*/

#include <linux/kernel_stat.h>

2012-10-19 14:07:53

by Rik van Riel

[permalink] [raw]
Subject: Re: [PATCH 2/2] rename NUMA fault handling functions

On 10/19/2012 07:41 AM, Peter Zijlstra wrote:
> On Thu, 2012-10-18 at 17:20 -0400, Rik van Riel wrote:
>> Having the function name indicate what the function is used
>> for makes the code a little easier to read. Furthermore,
>> the fault handling code largely consists of do_...._page
>> functions.
>
> I don't much care either way, but I was thinking walken might want to
> use something similar to do WSS estimation, in which case the NUMA name
> is just as wrong.

That's a good point. I had not considered other uses of the
same code.

--
All rights reversed

2012-10-19 20:54:14

by Ingo Molnar

[permalink] [raw]
Subject: Re: [PATCH 2/2] rename NUMA fault handling functions


* Rik van Riel <[email protected]> wrote:

> On 10/19/2012 07:41 AM, Peter Zijlstra wrote:
> >On Thu, 2012-10-18 at 17:20 -0400, Rik van Riel wrote:
> >>Having the function name indicate what the function is used
> >>for makes the code a little easier to read. Furthermore,
> >>the fault handling code largely consists of do_...._page
> >>functions.
> >
> > I don't much care either way, but I was thinking walken
> > might want to use something similar to do WSS estimation, in
> > which case the NUMA name is just as wrong.
>
> That's a good point. I had not considered other uses of the
> same code.

Renaming the functions for more clarity still makes sense IMO:
we could give it a _wss or _working_set prefix/postfix?

Thanks,

Ingo

2012-10-20 10:15:31

by Michel Lespinasse

[permalink] [raw]
Subject: Re: [PATCH 2/2] rename NUMA fault handling functions

On Fri, Oct 19, 2012 at 4:41 AM, Peter Zijlstra <[email protected]> wrote:
> On Thu, 2012-10-18 at 17:20 -0400, Rik van Riel wrote:
>> Having the function name indicate what the function is used
>> for makes the code a little easier to read. Furthermore,
>> the fault handling code largely consists of do_...._page
>> functions.
>
> I don't much care either way, but I was thinking walken might want to
> use something similar to do WSS estimation, in which case the NUMA name
> is just as wrong.

Right now my working set estimation only uses A bits, so let's not
make that a concern here.

I think the _numa names are a bit better than _prot_none, but still a
bit confusing. I don't have any great suggestions but I think there
should at least be a comment above pte_numa() that explains what the
numa ptes are (the comment within the function doesn't qualify as it
only explains how the numa ptes are different from the ones in
PROT_NONE vmas...)

--
Michel "Walken" Lespinasse
A program is never fully debugged until the last user dies.

2012-10-21 12:44:00

by Ingo Molnar

[permalink] [raw]
Subject: [PATCH 3/2] sched, numa, mm: Implement constant rate working set sampling


* Rik van Riel <[email protected]> wrote:

> Hi Ingo,
>
> Here are some minor NUMA cleanups to start with.
>
> I have some ideas for larger improvements and ideas to port
> over from autonuma, but I got caught up in some of the code
> and am not sure about those changes yet.

To help out I picked up a couple of obvious ones:

cee8868763f8 x86, mm: Prevent gcc to re-read the pagetables
a860d4c7a1f4 mm: Check if PTE is already allocated during page fault
e9fe72334fb0 numa, mm: Fix NUMA hinting page faults from gup/gup_fast

I kept Andrea as the author, the patches needed only minimal
adaptation.

Plus I finally completed testing and applying Peter's
constant-rate WSS patch:

3d049f8a5398 sched, numa, mm: Implement constant, per task Working Set Sampling (WSS) rate

This is in part similar to AutoNUMA's hinting page fault rate
limiting feature (pages_to_scan et al), and in part an
improvement/extension of it. See the patch below for details.

Let me know if you have any questions!

Thanks,

Ingo

---------------->
>From 3d049f8a5398d0050ab9978b3ac67402f337390f Mon Sep 17 00:00:00 2001
From: Peter Zijlstra <[email protected]>
Date: Sun, 14 Oct 2012 16:59:13 +0200
Subject: [PATCH] sched, numa, mm: Implement constant, per task Working Set Sampling (WSS) rate

Previously, to probe the working set of a task, we'd use
a very simple and crude method: mark all of its address
space PROT_NONE.

That method has various (obvious) disadvantages:

- it samples the working set at dissimilar rates,
giving some tasks a sampling quality advantage
over others.

- creates performance problems for tasks with very
large working sets

- over-samples processes with large address spaces but
which only very rarely execute

Improve that method by keeping a rotating offset into the
address space that marks the current position of the scan,
and advance it by a constant rate (in a CPU cycles execution
proportional manner). If the offset reaches the last mapped
address of the mm then it then it starts over at the first
address.

The per-task nature of the working set sampling functionality
in this tree allows such constant rate, per task,
execution-weight proportional sampling of the working set,
with an adaptive sampling interval/frequency that goes from
once per 100 msecs up to just once per 1.6 seconds.
The current sampling volume is 256 MB per interval.

As tasks mature and converge their working set, so does the
sampling rate slow down to just a trickle, 256 MB per 1.6
seconds of CPU time executed.

This, beyond being adaptive, also rate-limits rarely
executing systems and does not over-sample on overloaded
systems.

[ In AutoNUMA speak, this patch deals with the effective sampling
rate of the 'hinting page fault'. AutoNUMA's scanning is
currently rate-limited, but it is also fundamentally
single-threaded, executing in the knuma_scand kernel thread,
so the limit in AutoNUMA is global and does not scale up with
the number of CPUs, nor does it scan tasks in an execution
proportional manner.

So the idea of rate-limiting the scanning was first implemented
in the AutoNUMA tree via a global rate limit. This patch goes
beyond that by implementing an execution rate proportional
working set sampling rate that is not implemented via a single
global scanning daemon. ]

Based-on-idea-by: Andrea Arcangeli <[email protected]>
Signed-off-by: Peter Zijlstra <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Andrew Morton <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Andrea Arcangeli <[email protected]>
Cc: Rik van Riel <[email protected]>
Link: http://lkml.kernel.org/n/[email protected]
[ Wrote changelog and fixed bug. ]
Signed-off-by: Ingo Molnar <[email protected]>
---
include/linux/mempolicy.h | 2 --
include/linux/mm.h | 6 ++++++
include/linux/mm_types.h | 1 +
include/linux/sched.h | 1 +
kernel/sched/fair.c | 44 ++++++++++++++++++++++++++++++++++++++++----
kernel/sysctl.c | 7 +++++++
mm/mempolicy.c | 24 ------------------------
7 files changed, 55 insertions(+), 30 deletions(-)

diff --git a/include/linux/mempolicy.h b/include/linux/mempolicy.h
index a5bf9d6..d6b1ea1 100644
--- a/include/linux/mempolicy.h
+++ b/include/linux/mempolicy.h
@@ -199,8 +199,6 @@ static inline int vma_migratable(struct vm_area_struct *vma)

extern int mpol_misplaced(struct page *, struct vm_area_struct *, unsigned long);

-extern void lazy_migrate_process(struct mm_struct *mm);
-
#else /* CONFIG_NUMA */

struct mempolicy {};
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 423464b..64ccf29 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1581,6 +1581,12 @@ static inline pgprot_t vma_prot_none(struct vm_area_struct *vma)
return pgprot_modify(vma->vm_page_prot, vm_get_page_prot(vmflags));
}

+static inline void
+change_prot_none(struct vm_area_struct *vma, unsigned long start, unsigned long end)
+{
+ change_protection(vma, start, end, vma_prot_none(vma), 0);
+}
+
struct vm_area_struct *find_extend_vma(struct mm_struct *, unsigned long addr);
int remap_pfn_range(struct vm_area_struct *, unsigned long addr,
unsigned long pfn, unsigned long size, pgprot_t);
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index bef4c5e..01c1d04 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -405,6 +405,7 @@ struct mm_struct {
#endif
#ifdef CONFIG_SCHED_NUMA
unsigned long numa_next_scan;
+ unsigned long numa_scan_offset;
int numa_scan_seq;
#endif
struct uprobes_state uprobes_state;
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 9e726f0..63c011e 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -2022,6 +2022,7 @@ extern enum sched_tunable_scaling sysctl_sched_tunable_scaling;

extern unsigned int sysctl_sched_numa_task_period_min;
extern unsigned int sysctl_sched_numa_task_period_max;
+extern unsigned int sysctl_sched_numa_scan_size;
extern unsigned int sysctl_sched_numa_settle_count;

#ifdef CONFIG_SCHED_DEBUG
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index a66a1b6..9f7406e 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -827,8 +827,9 @@ static void account_numa_dequeue(struct rq *rq, struct task_struct *p)
/*
* numa task sample period in ms: 5s
*/
-unsigned int sysctl_sched_numa_task_period_min = 5000;
-unsigned int sysctl_sched_numa_task_period_max = 5000*16;
+unsigned int sysctl_sched_numa_task_period_min = 100;
+unsigned int sysctl_sched_numa_task_period_max = 100*16;
+unsigned int sysctl_sched_numa_scan_size = 256; /* MB */

/*
* Wait for the 2-sample stuff to settle before migrating again
@@ -902,6 +903,9 @@ void task_numa_work(struct callback_head *work)
unsigned long migrate, next_scan, now = jiffies;
struct task_struct *p = current;
struct mm_struct *mm = p->mm;
+ struct vm_area_struct *vma;
+ unsigned long offset, end;
+ long length;

WARN_ON_ONCE(p != container_of(work, struct task_struct, numa_work));

@@ -928,8 +932,40 @@ void task_numa_work(struct callback_head *work)
if (cmpxchg(&mm->numa_next_scan, migrate, next_scan) != migrate)
return;

- ACCESS_ONCE(mm->numa_scan_seq)++;
- lazy_migrate_process(mm);
+
+ offset = mm->numa_scan_offset;
+ length = sysctl_sched_numa_scan_size;
+ length <<= 20;
+
+ down_read(&mm->mmap_sem);
+ vma = find_vma(mm, offset);
+again:
+ if (!vma) {
+ ACCESS_ONCE(mm->numa_scan_seq)++;
+ offset = 0;
+ vma = mm->mmap;
+ }
+ while (vma && !vma_migratable(vma)) {
+ vma = vma->vm_next;
+ if (!vma)
+ goto again;
+ }
+
+ offset = max(offset, vma->vm_start);
+ end = min(ALIGN(offset + length, HPAGE_SIZE), vma->vm_end);
+ length -= end - offset;
+
+ change_prot_none(vma, offset, end);
+
+ offset = end;
+
+ if (length > 0) {
+ vma = vma->vm_next;
+ goto again;
+ }
+ mm->numa_scan_offset = offset;
+ up_read(&mm->mmap_sem);
+
}

/*
diff --git a/kernel/sysctl.c b/kernel/sysctl.c
index 14a1949..0f0cb60 100644
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -365,6 +365,13 @@ static struct ctl_table kern_table[] = {
.proc_handler = proc_dointvec,
},
{
+ .procname = "sched_numa_scan_size_mb",
+ .data = &sysctl_sched_numa_scan_size,
+ .maxlen = sizeof(unsigned int),
+ .mode = 0644,
+ .proc_handler = proc_dointvec,
+ },
+ {
.procname = "sched_numa_settle_count",
.data = &sysctl_sched_numa_settle_count,
.maxlen = sizeof(unsigned int),
diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index f0e3b28..d998810 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -581,12 +581,6 @@ static inline int check_pgd_range(struct vm_area_struct *vma,
return 0;
}

-static void
-change_prot_none(struct vm_area_struct *vma, unsigned long start, unsigned long end)
-{
- change_protection(vma, start, end, vma_prot_none(vma), 0);
-}
-
/*
* Check if all pages in a range are on a set of nodes.
* If pagelist != NULL then isolate pages from the LRU and
@@ -1259,24 +1253,6 @@ static long do_mbind(unsigned long start, unsigned long len,
return err;
}

-static void lazy_migrate_vma(struct vm_area_struct *vma)
-{
- if (!vma_migratable(vma))
- return;
-
- change_prot_none(vma, vma->vm_start, vma->vm_end);
-}
-
-void lazy_migrate_process(struct mm_struct *mm)
-{
- struct vm_area_struct *vma;
-
- down_read(&mm->mmap_sem);
- for (vma = mm->mmap; vma; vma = vma->vm_next)
- lazy_migrate_vma(vma);
- up_read(&mm->mmap_sem);
-}
-
/*
* User space interface with variable sized bitmaps for nodelists.
*/

2012-10-21 12:50:14

by Ingo Molnar

[permalink] [raw]
Subject: [PATCH 4/2] numa, mm: Rename the PROT_NONE fault handling functions


* Ingo Molnar <[email protected]> wrote:

> > > I don't much care either way, but I was thinking walken
> > > might want to use something similar to do WSS estimation,
> > > in which case the NUMA name is just as wrong.
> >
> > That's a good point. I had not considered other uses of the
> > same code.
>
> Renaming the functions for more clarity still makes sense IMO:
> we could give it a _wss or _working_set prefix/postfix?

So, to not drop your patch on the floor I've modified it as per
the patch below.

The _wss() names signal that these handlers are used for a
specific purpose, they are not related to the regular PROT_NONE
handling code.

Agreed?

Thanks,

Ingo

--------------->
>From 7e426e0f6ffe228118e57a70ae402e21792a0456 Mon Sep 17 00:00:00 2001
From: Rik van Riel <[email protected]>
Date: Thu, 18 Oct 2012 17:20:21 -0400
Subject: [PATCH] numa, mm: Rename the PROT_NONE fault handling functions

Having the function name indicate what the function is used
for makes the code a little easier to read. Furthermore,
the fault handling code largely consists of do_...._page
functions.

Rename the Working-Set Sampling (WSS) fault handling functions
to indicate what they are used for.

Signed-off-by: Rik van Riel <[email protected]>
Cc: [email protected]
Cc: [email protected]
Link: http://lkml.kernel.org/r/[email protected]
[ Changed the naming pattern to 'working-set sampling (WSS)' wss_() ]
Signed-off-by: Ingo Molnar <[email protected]>
---
include/linux/huge_mm.h | 8 ++++----
mm/huge_memory.c | 4 ++--
mm/memory.c | 18 ++++++++++--------
3 files changed, 16 insertions(+), 14 deletions(-)

diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index bcbe467..93c6ab5 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -160,9 +160,9 @@ static inline struct page *compound_trans_head(struct page *page)
return page;
}

-extern bool pmd_prot_none(struct vm_area_struct *vma, pmd_t pmd);
+extern bool pmd_wss(struct vm_area_struct *vma, pmd_t pmd);

-extern void do_huge_pmd_prot_none(struct mm_struct *mm, struct vm_area_struct *vma,
+extern void do_huge_pmd_wss_page(struct mm_struct *mm, struct vm_area_struct *vma,
unsigned long address, pmd_t *pmd,
unsigned int flags, pmd_t orig_pmd);

@@ -203,12 +203,12 @@ static inline int pmd_trans_huge_lock(pmd_t *pmd,
return 0;
}

-static inline bool pmd_prot_none(struct vm_area_struct *vma, pmd_t pmd)
+static inline bool pmd_wss(struct vm_area_struct *vma, pmd_t pmd)
{
return false;
}

-static inline void do_huge_pmd_prot_none(struct mm_struct *mm, struct vm_area_struct *vma,
+static inline void do_huge_pmd_wss_page(struct mm_struct *mm, struct vm_area_struct *vma,
unsigned long address, pmd_t *pmd,
unsigned int flags, pmd_t orig_pmd)
{
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index c58a5f0..982f678 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -727,7 +727,7 @@ out:
return handle_pte_fault(mm, vma, address, pte, pmd, flags);
}

-bool pmd_prot_none(struct vm_area_struct *vma, pmd_t pmd)
+bool pmd_wss(struct vm_area_struct *vma, pmd_t pmd)
{
/*
* See pte_prot_none().
@@ -738,7 +738,7 @@ bool pmd_prot_none(struct vm_area_struct *vma, pmd_t pmd)
return pmd_same(pmd, pmd_modify(pmd, vma_prot_none(vma)));
}

-void do_huge_pmd_prot_none(struct mm_struct *mm, struct vm_area_struct *vma,
+void do_huge_pmd_wss_page(struct mm_struct *mm, struct vm_area_struct *vma,
unsigned long address, pmd_t *pmd,
unsigned int flags, pmd_t entry)
{
diff --git a/mm/memory.c b/mm/memory.c
index 2cc8a29..a3693e6 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -1471,11 +1471,13 @@ int zap_vma_ptes(struct vm_area_struct *vma, unsigned long address,
}
EXPORT_SYMBOL_GPL(zap_vma_ptes);

-static bool pte_prot_none(struct vm_area_struct *vma, pte_t pte)
+static bool pte_wss(struct vm_area_struct *vma, pte_t pte)
{
/*
- * If we have the normal vma->vm_page_prot protections we're not a
- * 'special' PROT_NONE page.
+ * For NUMA page faults, we use PROT_NONE ptes in VMAs with
+ * "normal" vma->vm_page_prot protections. Genuine PROT_NONE
+ * VMAs should never get here, because the fault handling code
+ * will notice that the VMA has no read or write permissions.
*
* This means we cannot get 'special' PROT_NONE faults from genuine
* PROT_NONE maps, nor from PROT_WRITE file maps that do dirty
@@ -3476,7 +3478,7 @@ static int do_nonlinear_fault(struct mm_struct *mm, struct vm_area_struct *vma,
return __do_fault(mm, vma, address, pmd, pgoff, flags, orig_pte);
}

-static int do_prot_none(struct mm_struct *mm, struct vm_area_struct *vma,
+static int do_wss_page(struct mm_struct *mm, struct vm_area_struct *vma,
unsigned long address, pte_t *ptep, pmd_t *pmd,
unsigned int flags, pte_t entry)
{
@@ -3601,8 +3603,8 @@ int handle_pte_fault(struct mm_struct *mm,
pte, pmd, flags, entry);
}

- if (pte_prot_none(vma, entry))
- return do_prot_none(mm, vma, address, pte, pmd, flags, entry);
+ if (pte_wss(vma, entry))
+ return do_wss_page(mm, vma, address, pte, pmd, flags, entry);

ptl = pte_lockptr(mm, pmd);
spin_lock(ptl);
@@ -3672,8 +3674,8 @@ retry:

barrier();
if (pmd_trans_huge(orig_pmd) && !pmd_trans_splitting(orig_pmd)) {
- if (pmd_prot_none(vma, orig_pmd)) {
- do_huge_pmd_prot_none(mm, vma, address, pmd,
+ if (pmd_wss(vma, orig_pmd)) {
+ do_huge_pmd_wss_page(mm, vma, address, pmd,
flags, orig_pmd);
}

2012-10-21 13:23:14

by Rik van Riel

[permalink] [raw]
Subject: Re: [PATCH 4/2] numa, mm: Rename the PROT_NONE fault handling functions

On 10/21/2012 08:50 AM, Ingo Molnar wrote:
>
> * Ingo Molnar <[email protected]> wrote:
>
>>>> I don't much care either way, but I was thinking walken
>>>> might want to use something similar to do WSS estimation,
>>>> in which case the NUMA name is just as wrong.
>>>
>>> That's a good point. I had not considered other uses of the
>>> same code.
>>
>> Renaming the functions for more clarity still makes sense IMO:
>> we could give it a _wss or _working_set prefix/postfix?
>
> So, to not drop your patch on the floor I've modified it as per
> the patch below.
>
> The _wss() names signal that these handlers are used for a
> specific purpose, they are not related to the regular PROT_NONE
> handling code.

Michel indicated that he does not use PROT_NONE for his
working set estimation code, but instead checks the
accessed bits in the page tables.

Since NUMA migration is the only user of PROT_NONE ptes
in normal vmas, maybe _numa is the right suffix after all?

--
All rights reversed

2012-10-21 13:29:27

by Ingo Molnar

[permalink] [raw]
Subject: Re: [PATCH 4/2] numa, mm: Rename the PROT_NONE fault handling functions


* Rik van Riel <[email protected]> wrote:

> On 10/21/2012 08:50 AM, Ingo Molnar wrote:
> >
> >* Ingo Molnar <[email protected]> wrote:
> >
> >>>>I don't much care either way, but I was thinking walken
> >>>>might want to use something similar to do WSS estimation,
> >>>>in which case the NUMA name is just as wrong.
> >>>
> >>>That's a good point. I had not considered other uses of the
> >>>same code.
> >>
> >>Renaming the functions for more clarity still makes sense IMO:
> >>we could give it a _wss or _working_set prefix/postfix?
> >
> >So, to not drop your patch on the floor I've modified it as per
> >the patch below.
> >
> >The _wss() names signal that these handlers are used for a
> >specific purpose, they are not related to the regular PROT_NONE
> >handling code.
>
> Michel indicated that he does not use PROT_NONE for his
> working set estimation code, but instead checks the accessed
> bits in the page tables.

The pte_young() WSS method has a couple of fundamental
limitations:

- it doesn't work with shared memory very well, the pte is per
mapping, not per page. The PROT_NONE method instruments the
physical page in essence.

- it does not tell us which task touched the pte, in a
multi-threaded program

So like Peter I'd too expect these new WSS methods to eventually
be picked up for any serious WSS work.

Thanks,

Ingo

2012-10-21 13:43:48

by Ingo Molnar

[permalink] [raw]
Subject: Re: [PATCH 4/2] numa, mm: Rename the PROT_NONE fault handling functions


* Ingo Molnar <[email protected]> wrote:

> > Michel indicated that he does not use PROT_NONE for his
> > working set estimation code, but instead checks the accessed
> > bits in the page tables.
>
> The pte_young() WSS method has a couple of fundamental
> limitations:
>
> - it doesn't work with shared memory very well, the pte is per
> mapping, not per page. The PROT_NONE method instruments the
> physical page in essence.
>
> - it does not tell us which task touched the pte, in a
> multi-threaded program
>
> So like Peter I'd too expect these new WSS methods to eventually
> be picked up for any serious WSS work.

Nevertheless lets wait and see until it actually happens - and
meanwhile the prot_none namings are confusing.

So I've applied your patch as-is, with two more (new) usage
sites converted as well. Will push it out after a bit of
testing.

Thanks,

Ingo

2012-10-21 15:20:35

by Rik van Riel

[permalink] [raw]
Subject: [tip:numa/core] numa, mm: Rename the PROT_NONE fault handling functions to *_numa()

Commit-ID: 2458840fddea542391d343dac734d149607db709
Gitweb: http://git.kernel.org/tip/2458840fddea542391d343dac734d149607db709
Author: Rik van Riel <[email protected]>
AuthorDate: Thu, 18 Oct 2012 17:20:21 -0400
Committer: Ingo Molnar <[email protected]>
CommitDate: Sun, 21 Oct 2012 15:41:26 +0200

numa, mm: Rename the PROT_NONE fault handling functions to *_numa()

Having the function name indicate what the function is used
for makes the code a little easier to read. Furthermore,
the fault handling code largely consists of do_...._page
functions.

Rename the NUMA working set sampling fault handling functions
to _numa() names, to indicate what they are used for.

This separates the naming from the regular PROT_NONE namings.

Signed-off-by: Rik van Riel <[email protected]>
Cc: [email protected]
Cc: [email protected]
Link: http://lkml.kernel.org/r/[email protected]
[ Converted two more usage sites ]
Signed-off-by: Ingo Molnar <[email protected]>
---
include/linux/huge_mm.h | 8 ++++----
mm/huge_memory.c | 4 ++--
mm/memory.c | 22 ++++++++++++----------
3 files changed, 18 insertions(+), 16 deletions(-)

diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index bcbe467..4f0f948 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -160,9 +160,9 @@ static inline struct page *compound_trans_head(struct page *page)
return page;
}

-extern bool pmd_prot_none(struct vm_area_struct *vma, pmd_t pmd);
+extern bool pmd_numa(struct vm_area_struct *vma, pmd_t pmd);

-extern void do_huge_pmd_prot_none(struct mm_struct *mm, struct vm_area_struct *vma,
+extern void do_huge_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
unsigned long address, pmd_t *pmd,
unsigned int flags, pmd_t orig_pmd);

@@ -203,12 +203,12 @@ static inline int pmd_trans_huge_lock(pmd_t *pmd,
return 0;
}

-static inline bool pmd_prot_none(struct vm_area_struct *vma, pmd_t pmd)
+static inline bool pmd_numa(struct vm_area_struct *vma, pmd_t pmd)
{
return false;
}

-static inline void do_huge_pmd_prot_none(struct mm_struct *mm, struct vm_area_struct *vma,
+static inline void do_huge_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
unsigned long address, pmd_t *pmd,
unsigned int flags, pmd_t orig_pmd)
{
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index c58a5f0..a8f6531 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -727,7 +727,7 @@ out:
return handle_pte_fault(mm, vma, address, pte, pmd, flags);
}

-bool pmd_prot_none(struct vm_area_struct *vma, pmd_t pmd)
+bool pmd_numa(struct vm_area_struct *vma, pmd_t pmd)
{
/*
* See pte_prot_none().
@@ -738,7 +738,7 @@ bool pmd_prot_none(struct vm_area_struct *vma, pmd_t pmd)
return pmd_same(pmd, pmd_modify(pmd, vma_prot_none(vma)));
}

-void do_huge_pmd_prot_none(struct mm_struct *mm, struct vm_area_struct *vma,
+void do_huge_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
unsigned long address, pmd_t *pmd,
unsigned int flags, pmd_t entry)
{
diff --git a/mm/memory.c b/mm/memory.c
index 2cc8a29..23d4bd4 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -1471,11 +1471,13 @@ int zap_vma_ptes(struct vm_area_struct *vma, unsigned long address,
}
EXPORT_SYMBOL_GPL(zap_vma_ptes);

-static bool pte_prot_none(struct vm_area_struct *vma, pte_t pte)
+static bool pte_numa(struct vm_area_struct *vma, pte_t pte)
{
/*
- * If we have the normal vma->vm_page_prot protections we're not a
- * 'special' PROT_NONE page.
+ * For NUMA page faults, we use PROT_NONE ptes in VMAs with
+ * "normal" vma->vm_page_prot protections. Genuine PROT_NONE
+ * VMAs should never get here, because the fault handling code
+ * will notice that the VMA has no read or write permissions.
*
* This means we cannot get 'special' PROT_NONE faults from genuine
* PROT_NONE maps, nor from PROT_WRITE file maps that do dirty
@@ -1543,7 +1545,7 @@ struct page *follow_page(struct vm_area_struct *vma, unsigned long address,
page = follow_huge_pmd(mm, address, pmd, flags & FOLL_WRITE);
goto out;
}
- if ((flags & FOLL_NUMA) && pmd_prot_none(vma, *pmd))
+ if ((flags & FOLL_NUMA) && pmd_numa(vma, *pmd))
goto no_page_table;
if (pmd_trans_huge(*pmd)) {
if (flags & FOLL_SPLIT) {
@@ -1574,7 +1576,7 @@ split_fallthrough:
pte = *ptep;
if (!pte_present(pte))
goto no_page;
- if ((flags & FOLL_NUMA) && pte_prot_none(vma, pte))
+ if ((flags & FOLL_NUMA) && pte_numa(vma, pte))
goto no_page;
if ((flags & FOLL_WRITE) && !pte_write(pte))
goto unlock;
@@ -3476,7 +3478,7 @@ static int do_nonlinear_fault(struct mm_struct *mm, struct vm_area_struct *vma,
return __do_fault(mm, vma, address, pmd, pgoff, flags, orig_pte);
}

-static int do_prot_none(struct mm_struct *mm, struct vm_area_struct *vma,
+static int do_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
unsigned long address, pte_t *ptep, pmd_t *pmd,
unsigned int flags, pte_t entry)
{
@@ -3601,8 +3603,8 @@ int handle_pte_fault(struct mm_struct *mm,
pte, pmd, flags, entry);
}

- if (pte_prot_none(vma, entry))
- return do_prot_none(mm, vma, address, pte, pmd, flags, entry);
+ if (pte_numa(vma, entry))
+ return do_numa_page(mm, vma, address, pte, pmd, flags, entry);

ptl = pte_lockptr(mm, pmd);
spin_lock(ptl);
@@ -3672,8 +3674,8 @@ retry:

barrier();
if (pmd_trans_huge(orig_pmd) && !pmd_trans_splitting(orig_pmd)) {
- if (pmd_prot_none(vma, orig_pmd)) {
- do_huge_pmd_prot_none(mm, vma, address, pmd,
+ if (pmd_numa(vma, orig_pmd)) {
+ do_huge_pmd_numa_page(mm, vma, address, pmd,
flags, orig_pmd);
}

2012-10-23 11:00:48

by Rik van Riel

[permalink] [raw]
Subject: [tip:numa/core] numa, mm: Rename the PROT_NONE fault handling functions to *_numa()

Commit-ID: b3c01da073d82c8aaf3aa12f6214b64d2d1d83f8
Gitweb: http://git.kernel.org/tip/b3c01da073d82c8aaf3aa12f6214b64d2d1d83f8
Author: Rik van Riel <[email protected]>
AuthorDate: Thu, 18 Oct 2012 17:20:21 -0400
Committer: Ingo Molnar <[email protected]>
CommitDate: Tue, 23 Oct 2012 11:53:51 +0200

numa, mm: Rename the PROT_NONE fault handling functions to *_numa()

Having the function name indicate what the function is used
for makes the code a little easier to read. Furthermore,
the fault handling code largely consists of do_...._page
functions.

Rename the NUMA working set sampling fault handling functions
to _numa() names, to indicate what they are used for.

This separates the naming from the regular PROT_NONE namings.

Signed-off-by: Rik van Riel <[email protected]>
Cc: [email protected]
Cc: [email protected]
Link: http://lkml.kernel.org/r/[email protected]
[ Converted two more usage sites ]
Signed-off-by: Ingo Molnar <[email protected]>
---
include/linux/huge_mm.h | 8 ++++----
mm/huge_memory.c | 4 ++--
mm/memory.c | 22 ++++++++++++----------
3 files changed, 18 insertions(+), 16 deletions(-)

diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index bcbe467..4f0f948 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -160,9 +160,9 @@ static inline struct page *compound_trans_head(struct page *page)
return page;
}

-extern bool pmd_prot_none(struct vm_area_struct *vma, pmd_t pmd);
+extern bool pmd_numa(struct vm_area_struct *vma, pmd_t pmd);

-extern void do_huge_pmd_prot_none(struct mm_struct *mm, struct vm_area_struct *vma,
+extern void do_huge_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
unsigned long address, pmd_t *pmd,
unsigned int flags, pmd_t orig_pmd);

@@ -203,12 +203,12 @@ static inline int pmd_trans_huge_lock(pmd_t *pmd,
return 0;
}

-static inline bool pmd_prot_none(struct vm_area_struct *vma, pmd_t pmd)
+static inline bool pmd_numa(struct vm_area_struct *vma, pmd_t pmd)
{
return false;
}

-static inline void do_huge_pmd_prot_none(struct mm_struct *mm, struct vm_area_struct *vma,
+static inline void do_huge_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
unsigned long address, pmd_t *pmd,
unsigned int flags, pmd_t orig_pmd)
{
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index e62d3c5..bcba184 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -726,7 +726,7 @@ out:
return handle_pte_fault(mm, vma, address, pte, pmd, flags);
}

-bool pmd_prot_none(struct vm_area_struct *vma, pmd_t pmd)
+bool pmd_numa(struct vm_area_struct *vma, pmd_t pmd)
{
/*
* See pte_prot_none().
@@ -737,7 +737,7 @@ bool pmd_prot_none(struct vm_area_struct *vma, pmd_t pmd)
return pmd_same(pmd, pmd_modify(pmd, vma_prot_none(vma)));
}

-void do_huge_pmd_prot_none(struct mm_struct *mm, struct vm_area_struct *vma,
+void do_huge_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
unsigned long address, pmd_t *pmd,
unsigned int flags, pmd_t entry)
{
diff --git a/mm/memory.c b/mm/memory.c
index b609354..7ff1905 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -1471,11 +1471,13 @@ int zap_vma_ptes(struct vm_area_struct *vma, unsigned long address,
}
EXPORT_SYMBOL_GPL(zap_vma_ptes);

-static bool pte_prot_none(struct vm_area_struct *vma, pte_t pte)
+static bool pte_numa(struct vm_area_struct *vma, pte_t pte)
{
/*
- * If we have the normal vma->vm_page_prot protections we're not a
- * 'special' PROT_NONE page.
+ * For NUMA page faults, we use PROT_NONE ptes in VMAs with
+ * "normal" vma->vm_page_prot protections. Genuine PROT_NONE
+ * VMAs should never get here, because the fault handling code
+ * will notice that the VMA has no read or write permissions.
*
* This means we cannot get 'special' PROT_NONE faults from genuine
* PROT_NONE maps, nor from PROT_WRITE file maps that do dirty
@@ -1543,7 +1545,7 @@ struct page *follow_page(struct vm_area_struct *vma, unsigned long address,
page = follow_huge_pmd(mm, address, pmd, flags & FOLL_WRITE);
goto out;
}
- if ((flags & FOLL_NUMA) && pmd_prot_none(vma, *pmd))
+ if ((flags & FOLL_NUMA) && pmd_numa(vma, *pmd))
goto no_page_table;
if (pmd_trans_huge(*pmd)) {
if (flags & FOLL_SPLIT) {
@@ -1574,7 +1576,7 @@ split_fallthrough:
pte = *ptep;
if (!pte_present(pte))
goto no_page;
- if ((flags & FOLL_NUMA) && pte_prot_none(vma, pte))
+ if ((flags & FOLL_NUMA) && pte_numa(vma, pte))
goto no_page;
if ((flags & FOLL_WRITE) && !pte_write(pte))
goto unlock;
@@ -3476,7 +3478,7 @@ static int do_nonlinear_fault(struct mm_struct *mm, struct vm_area_struct *vma,
return __do_fault(mm, vma, address, pmd, pgoff, flags, orig_pte);
}

-static int do_prot_none(struct mm_struct *mm, struct vm_area_struct *vma,
+static int do_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
unsigned long address, pte_t *ptep, pmd_t *pmd,
unsigned int flags, pte_t entry)
{
@@ -3573,8 +3575,8 @@ int handle_pte_fault(struct mm_struct *mm,
pte, pmd, flags, entry);
}

- if (pte_prot_none(vma, entry))
- return do_prot_none(mm, vma, address, pte, pmd, flags, entry);
+ if (pte_numa(vma, entry))
+ return do_numa_page(mm, vma, address, pte, pmd, flags, entry);

ptl = pte_lockptr(mm, pmd);
spin_lock(ptl);
@@ -3644,8 +3646,8 @@ retry:

barrier();
if (pmd_trans_huge(orig_pmd) && !pmd_trans_splitting(orig_pmd)) {
- if (pmd_prot_none(vma, orig_pmd)) {
- do_huge_pmd_prot_none(mm, vma, address, pmd,
+ if (pmd_numa(vma, orig_pmd)) {
+ do_huge_pmd_numa_page(mm, vma, address, pmd,
flags, orig_pmd);
}

2012-10-28 17:11:59

by Rik van Riel

[permalink] [raw]
Subject: [tip:numa/core] sched, numa, mm: Add credits for NUMA placement

Commit-ID: c2ef354e5ab9d06a6b914f1241ade0b681330ffb
Gitweb: http://git.kernel.org/tip/c2ef354e5ab9d06a6b914f1241ade0b681330ffb
Author: Rik van Riel <[email protected]>
AuthorDate: Thu, 18 Oct 2012 17:19:28 -0400
Committer: Ingo Molnar <[email protected]>
CommitDate: Sun, 28 Oct 2012 17:31:16 +0100

sched, numa, mm: Add credits for NUMA placement

The NUMA placement code has been rewritten several times, but
the basic ideas took a lot of work to develop. The people who
put in the work deserve credit for it. Thanks Andrea & Peter :)

[ The Documentation/scheduler/numa-problem.txt file should
probably be rewritten once we figure out the final details of
what the NUMA code needs to do, and why. ]

Signed-off-by: Rik van Riel <[email protected]>
Acked-by: Peter Zijlstra <[email protected]>
Cc: [email protected]
Cc: Linus Torvalds <[email protected]>
Cc: Andrew Morton <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Ingo Molnar <[email protected]>
----
This is against tip.git numa/core
---
CREDITS | 1 +
kernel/sched/fair.c | 3 +++
mm/memory.c | 2 ++
3 files changed, 6 insertions(+), 0 deletions(-)

diff --git a/CREDITS b/CREDITS
index d8fe12a..b4cdc8f 100644
--- a/CREDITS
+++ b/CREDITS
@@ -125,6 +125,7 @@ D: Author of pscan that helps to fix lp/parport bugs
D: Author of lil (Linux Interrupt Latency benchmark)
D: Fixed the shm swap deallocation at swapoff time (try_to_unuse message)
D: VM hacker
+D: NUMA task placement
D: Various other kernel hacks
S: Imola 40026
S: Italy
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index bfe07cd..3e51cfd 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -18,6 +18,9 @@
*
* Adaptive scheduling granularity, math enhancements by Peter Zijlstra
* Copyright (C) 2007 Red Hat, Inc., Peter Zijlstra <[email protected]>
+ *
+ * NUMA placement, statistics and algorithm by Andrea Arcangeli,
+ * CFS balancing changes by Peter Zijlstra. Copyright (C) 2012 Red Hat, Inc.
*/

#include <linux/latencytop.h>
diff --git a/mm/memory.c b/mm/memory.c
index 3ecfeca..2c17d82 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -36,6 +36,8 @@
* ([email protected])
*
* Aug/Sep 2004 Changed to four level page tables (Andi Kleen)
+ *
+ * 2012 - NUMA placement page faults (Andrea Arcangeli, Peter Zijlstra)
*/

#include <linux/kernel_stat.h>

2012-11-13 15:32:24

by Rik van Riel

[permalink] [raw]
Subject: [tip:numa/core] sched, numa, mm: Add credits for NUMA placement

Commit-ID: 0c4a966d7968363c833e58068da58a121095b075
Gitweb: http://git.kernel.org/tip/0c4a966d7968363c833e58068da58a121095b075
Author: Rik van Riel <[email protected]>
AuthorDate: Thu, 18 Oct 2012 17:19:28 -0400
Committer: Ingo Molnar <[email protected]>
CommitDate: Tue, 13 Nov 2012 14:11:50 +0100

sched, numa, mm: Add credits for NUMA placement

The NUMA placement code has been rewritten several times, but
the basic ideas took a lot of work to develop. The people who
put in the work deserve credit for it. Thanks Andrea & Peter :)

[ The Documentation/scheduler/numa-problem.txt file should
probably be rewritten once we figure out the final details of
what the NUMA code needs to do, and why. ]

Signed-off-by: Rik van Riel <[email protected]>
Acked-by: Peter Zijlstra <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Andrea Arcangeli <[email protected]>
Cc: Rik van Riel <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Andrew Morton <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Ingo Molnar <[email protected]>
----
This is against tip.git numa/core
---
CREDITS | 1 +
kernel/sched/fair.c | 3 +++
mm/memory.c | 2 ++
3 files changed, 6 insertions(+), 0 deletions(-)

diff --git a/CREDITS b/CREDITS
index d8fe12a..b4cdc8f 100644
--- a/CREDITS
+++ b/CREDITS
@@ -125,6 +125,7 @@ D: Author of pscan that helps to fix lp/parport bugs
D: Author of lil (Linux Interrupt Latency benchmark)
D: Fixed the shm swap deallocation at swapoff time (try_to_unuse message)
D: VM hacker
+D: NUMA task placement
D: Various other kernel hacks
S: Imola 40026
S: Italy
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 93f4de4..309a254 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -18,6 +18,9 @@
*
* Adaptive scheduling granularity, math enhancements by Peter Zijlstra
* Copyright (C) 2007 Red Hat, Inc., Peter Zijlstra <[email protected]>
+ *
+ * NUMA placement, statistics and algorithm by Andrea Arcangeli,
+ * CFS balancing changes by Peter Zijlstra. Copyright (C) 2012 Red Hat, Inc.
*/

#include <linux/latencytop.h>
diff --git a/mm/memory.c b/mm/memory.c
index 1b9108c..ebd18fd 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -36,6 +36,8 @@
* ([email protected])
*
* Aug/Sep 2004 Changed to four level page tables (Andi Kleen)
+ *
+ * 2012 - NUMA placement page faults (Andrea Arcangeli, Peter Zijlstra)
*/

#include <linux/kernel_stat.h>

2012-11-13 17:24:05

by Rik van Riel

[permalink] [raw]
Subject: [tip:numa/core] sched, numa, mm: Add credits for NUMA placement

Commit-ID: b644797f6b72b9b9b0cf35bdf7981f5602725bea
Gitweb: http://git.kernel.org/tip/b644797f6b72b9b9b0cf35bdf7981f5602725bea
Author: Rik van Riel <[email protected]>
AuthorDate: Thu, 18 Oct 2012 17:19:28 -0400
Committer: Ingo Molnar <[email protected]>
CommitDate: Tue, 13 Nov 2012 18:09:25 +0100

sched, numa, mm: Add credits for NUMA placement

The NUMA placement code has been rewritten several times, but
the basic ideas took a lot of work to develop. The people who
put in the work deserve credit for it. Thanks Andrea & Peter :)

[ The Documentation/scheduler/numa-problem.txt file should
probably be rewritten once we figure out the final details of
what the NUMA code needs to do, and why. ]

Signed-off-by: Rik van Riel <[email protected]>
Acked-by: Peter Zijlstra <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Andrea Arcangeli <[email protected]>
Cc: Rik van Riel <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Andrew Morton <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Ingo Molnar <[email protected]>
----
This is against tip.git numa/core
---
CREDITS | 1 +
kernel/sched/fair.c | 3 +++
mm/memory.c | 2 ++
3 files changed, 6 insertions(+), 0 deletions(-)

diff --git a/CREDITS b/CREDITS
index d8fe12a..b4cdc8f 100644
--- a/CREDITS
+++ b/CREDITS
@@ -125,6 +125,7 @@ D: Author of pscan that helps to fix lp/parport bugs
D: Author of lil (Linux Interrupt Latency benchmark)
D: Fixed the shm swap deallocation at swapoff time (try_to_unuse message)
D: VM hacker
+D: NUMA task placement
D: Various other kernel hacks
S: Imola 40026
S: Italy
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 93f4de4..309a254 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -18,6 +18,9 @@
*
* Adaptive scheduling granularity, math enhancements by Peter Zijlstra
* Copyright (C) 2007 Red Hat, Inc., Peter Zijlstra <[email protected]>
+ *
+ * NUMA placement, statistics and algorithm by Andrea Arcangeli,
+ * CFS balancing changes by Peter Zijlstra. Copyright (C) 2012 Red Hat, Inc.
*/

#include <linux/latencytop.h>
diff --git a/mm/memory.c b/mm/memory.c
index 1b9108c..ebd18fd 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -36,6 +36,8 @@
* ([email protected])
*
* Aug/Sep 2004 Changed to four level page tables (Andi Kleen)
+ *
+ * 2012 - NUMA placement page faults (Andrea Arcangeli, Peter Zijlstra)
*/

#include <linux/kernel_stat.h>

2012-11-19 19:47:13

by Rik van Riel

[permalink] [raw]
Subject: [tip:numa/core] sched, numa, mm: Add credits for NUMA placement

Commit-ID: cb27d6087bc812d0624ef774a9ddee81f7cc0895
Gitweb: http://git.kernel.org/tip/cb27d6087bc812d0624ef774a9ddee81f7cc0895
Author: Rik van Riel <[email protected]>
AuthorDate: Thu, 18 Oct 2012 17:19:28 -0400
Committer: Ingo Molnar <[email protected]>
CommitDate: Mon, 19 Nov 2012 03:31:54 +0100

sched, numa, mm: Add credits for NUMA placement

The NUMA placement code has been rewritten several times, but
the basic ideas took a lot of work to develop. The people who
put in the work deserve credit for it. Thanks Andrea & Peter :)

[ The Documentation/scheduler/numa-problem.txt file should
probably be rewritten once we figure out the final details of
what the NUMA code needs to do, and why. ]

Signed-off-by: Rik van Riel <[email protected]>
Acked-by: Peter Zijlstra <[email protected]>
Cc: Andrea Arcangeli <[email protected]>
Cc: Rik van Riel <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Andrew Morton <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Hugh Dickins <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Ingo Molnar <[email protected]>
----
This is against tip.git numa/core
---
CREDITS | 1 +
kernel/sched/fair.c | 3 +++
mm/memory.c | 2 ++
3 files changed, 6 insertions(+)

diff --git a/CREDITS b/CREDITS
index d8fe12a..b4cdc8f 100644
--- a/CREDITS
+++ b/CREDITS
@@ -125,6 +125,7 @@ D: Author of pscan that helps to fix lp/parport bugs
D: Author of lil (Linux Interrupt Latency benchmark)
D: Fixed the shm swap deallocation at swapoff time (try_to_unuse message)
D: VM hacker
+D: NUMA task placement
D: Various other kernel hacks
S: Imola 40026
S: Italy
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 511fbb8..8af0208 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -18,6 +18,9 @@
*
* Adaptive scheduling granularity, math enhancements by Peter Zijlstra
* Copyright (C) 2007 Red Hat, Inc., Peter Zijlstra <[email protected]>
+ *
+ * NUMA placement, statistics and algorithm by Andrea Arcangeli,
+ * CFS balancing changes by Peter Zijlstra. Copyright (C) 2012 Red Hat, Inc.
*/

#include <linux/latencytop.h>
diff --git a/mm/memory.c b/mm/memory.c
index 52ad29d..1f733dc 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -36,6 +36,8 @@
* ([email protected])
*
* Aug/Sep 2004 Changed to four level page tables (Andi Kleen)
+ *
+ * 2012 - NUMA placement page faults (Andrea Arcangeli, Peter Zijlstra)
*/

#include <linux/kernel_stat.h>