2014-02-25 18:16:54

by Davidlohr Bueso

[permalink] [raw]
Subject: [PATCH v2] mm: per-thread vma caching

From: Davidlohr Bueso <[email protected]>

This patch is a continuation of efforts trying to optimize find_vma(),
avoiding potentially expensive rbtree walks to locate a vma upon faults.
The original approach (https://lkml.org/lkml/2013/11/1/410), where the
largest vma was also cached, ended up being too specific and random, thus
further comparison with other approaches were needed. There are two things
to consider when dealing with this, the cache hit rate and the latency of
find_vma(). Improving the hit-rate does not necessarily translate in finding
the vma any faster, as the overhead of any fancy caching schemes can be too
high to consider.

We currently cache the last used vma for the whole address space, which
provides a nice optimization, reducing the total cycles in find_vma() by up
to 250%, for workloads with good locality. On the other hand, this simple
scheme is pretty much useless for workloads with poor locality. Analyzing
ebizzy runs shows that, no matter how many threads are running, the
mmap_cache hit rate is less than 2%, and in many situations below 1%.

The proposed approach is to keep the current cache and adding a small, per
thread, LRU cache. By keeping the mm->mmap_cache, programs with large heaps
or good locality can benefit by not having to deal with an additional cache
when the hit rate is good enough. Concretely, the following results are seen
on an 80 core, 8 socket x86-64 box:

1) System bootup: Most programs are single threaded, so the per-thread scheme
does improve ~50% hit rate by just adding a few more slots to the cache.

+----------------+----------+------------------+
| caching scheme | hit-rate | cycles (billion) |
+----------------+----------+------------------+
| baseline | 50.61% | 19.90 |
| patched | 73.45% | 13.58 |
+----------------+----------+------------------+

2) Kernel build: This one is already pretty good with the current approach
as we're dealing with good locality.

+----------------+----------+------------------+
| caching scheme | hit-rate | cycles (billion) |
+----------------+----------+------------------+
| baseline | 75.28% | 11.03 |
| patched | 88.09% | 9.31 |
+----------------+----------+------------------+

3) Oracle 11g Data Mining (4k pages): Similar to the kernel build workload.

+----------------+----------+------------------+
| caching scheme | hit-rate | cycles (billion) |
+----------------+----------+------------------+
| baseline | 70.66% | 17.14 |
| patched | 91.15% | 12.57 |
+----------------+----------+------------------+

4) Ebizzy: There's a fair amount of variation from run to run, but this
approach always shows nearly perfect hit rates, while baseline is just
about non-existent. The amounts of cycles can fluctuate between anywhere
from ~60 to ~116 for the baseline scheme, but this approach reduces it
considerably. For instance, with 80 threads:

+----------------+----------+------------------+
| caching scheme | hit-rate | cycles (billion) |
+----------------+----------+------------------+
| baseline | 1.06% | 91.54 |
| patched | 99.97% | 14.18 |
+----------------+----------+------------------+

Systems with !CONFIG_MMU get to keep the current logic.

Signed-off-by: Davidlohr Bueso <[email protected]>
---
Changes from v1 (https://lkml.org/lkml/2014/2/21/8):
- Removed the per-mm cache for CONFIG_MMU, only having the
per thread approach.

- Since lookups are always performed before updates, only
invalidate when searching for a vma. Simplify the updating.

- Hash based on the page# instead of the remaining two bits
of the offset, results show that both methods are pretty
much identical for hit-rates.

Please note that nommu and unicore32 arch are *untested*.
Thanks.

arch/unicore32/include/asm/mmu_context.h | 10 ++++-
fs/proc/task_mmu.c | 2 +-
include/linux/mm_types.h | 6 ++-
include/linux/sched.h | 5 +++
include/linux/vmacache.h | 24 +++++++++++
kernel/debug/debug_core.c | 6 ++-
kernel/fork.c | 7 +++-
mm/Makefile | 2 +-
mm/mmap.c | 54 +++++++++++++------------
mm/nommu.c | 12 +++---
mm/vmacache.c | 69 ++++++++++++++++++++++++++++++++
11 files changed, 157 insertions(+), 40 deletions(-)
create mode 100644 include/linux/vmacache.h
create mode 100644 mm/vmacache.c

diff --git a/arch/unicore32/include/asm/mmu_context.h b/arch/unicore32/include/asm/mmu_context.h
index fb5e4c6..c571759 100644
--- a/arch/unicore32/include/asm/mmu_context.h
+++ b/arch/unicore32/include/asm/mmu_context.h
@@ -56,6 +56,14 @@ switch_mm(struct mm_struct *prev, struct mm_struct *next,
#define deactivate_mm(tsk, mm) do { } while (0)
#define activate_mm(prev, next) switch_mm(prev, next, NULL)

+static inline void __vmacache_invalidate(struct mm_struct *mm)
+{
+#ifdef CONFIG_MMU
+ vmacache_invalidate(mm);
+#else
+ mm->vmacache = NULL;
+#endif
+}
/*
* We are inserting a "fake" vma for the user-accessible vector page so
* gdb and friends can get to it through ptrace and /proc/<pid>/mem.
@@ -73,7 +81,7 @@ do { \
else \
mm->mmap = NULL; \
rb_erase(&high_vma->vm_rb, &mm->mm_rb); \
- mm->mmap_cache = NULL; \
+ __vmacache_invalidate(mm); \
mm->map_count--; \
remove_vma(high_vma); \
} \
diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
index fb52b54..231c836 100644
--- a/fs/proc/task_mmu.c
+++ b/fs/proc/task_mmu.c
@@ -152,7 +152,7 @@ static void *m_start(struct seq_file *m, loff_t *pos)

/*
* We remember last_addr rather than next_addr to hit with
- * mmap_cache most of the time. We have zero last_addr at
+ * vmacache most of the time. We have zero last_addr at
* the beginning and also after lseek. We will have -1 last_addr
* after the end of the vmas.
*/
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 290901a..e8b90b0 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -342,13 +342,15 @@ struct mm_rss_stat {

struct kioctx_table;
struct mm_struct {
- struct vm_area_struct * mmap; /* list of VMAs */
+ struct vm_area_struct *mmap; /* list of VMAs */
struct rb_root mm_rb;
- struct vm_area_struct * mmap_cache; /* last find_vma result */
#ifdef CONFIG_MMU
+ u32 vmacache_seqnum; /* per-thread vmacache */
unsigned long (*get_unmapped_area) (struct file *filp,
unsigned long addr, unsigned long len,
unsigned long pgoff, unsigned long flags);
+#else
+ struct vm_area_struct *vmacache; /* last find_vma result */
#endif
unsigned long mmap_base; /* base of mmap area */
unsigned long mmap_legacy_base; /* base of mmap area in bottom-up allocations */
diff --git a/include/linux/sched.h b/include/linux/sched.h
index a781dec..09dd1ff 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -23,6 +23,7 @@ struct sched_param {
#include <linux/errno.h>
#include <linux/nodemask.h>
#include <linux/mm_types.h>
+#include <linux/vmacache.h>
#include <linux/preempt_mask.h>

#include <asm/page.h>
@@ -1228,6 +1229,10 @@ struct task_struct {
#ifdef CONFIG_COMPAT_BRK
unsigned brk_randomized:1;
#endif
+#ifdef CONFIG_MMU
+ u32 vmacache_seqnum;
+ struct vm_area_struct *vmacache[VMACACHE_SIZE];
+#endif
#if defined(SPLIT_RSS_COUNTING)
struct task_rss_stat rss_stat;
#endif
diff --git a/include/linux/vmacache.h b/include/linux/vmacache.h
new file mode 100644
index 0000000..4fb7841
--- /dev/null
+++ b/include/linux/vmacache.h
@@ -0,0 +1,24 @@
+#ifndef __LINUX_VMACACHE_H
+#define __LINUX_VMACACHE_H
+
+#include <linux/mm.h>
+
+#define VMACACHE_SIZE 4
+
+extern void vmacache_invalidate_all(void);
+
+static inline void vmacache_invalidate(struct mm_struct *mm)
+{
+ mm->vmacache_seqnum++;
+
+ /* deal with overflows */
+ if (unlikely(mm->vmacache_seqnum == 0))
+ vmacache_invalidate_all();
+}
+
+extern void vmacache_update(struct mm_struct *mm, unsigned long addr,
+ struct vm_area_struct *newvma);
+extern struct vm_area_struct *vmacache_find(struct mm_struct *mm,
+ unsigned long addr);
+
+#endif /* __LINUX_VMACACHE_H */
diff --git a/kernel/debug/debug_core.c b/kernel/debug/debug_core.c
index 334b398..fefc055 100644
--- a/kernel/debug/debug_core.c
+++ b/kernel/debug/debug_core.c
@@ -224,10 +224,12 @@ static void kgdb_flush_swbreak_addr(unsigned long addr)
if (!CACHE_FLUSH_IS_SAFE)
return;

- if (current->mm && current->mm->mmap_cache) {
- flush_cache_range(current->mm->mmap_cache,
+#ifndef CONFIG_MMU
+ if (current->mm && current->mm->vmacache) {
+ flush_cache_range(current->mm->vmacache,
addr, addr + BREAK_INSTR_SIZE);
}
+#endif
/* Force flush instruction cache if it was outside the mm */
flush_icache_range(addr, addr + BREAK_INSTR_SIZE);
}
diff --git a/kernel/fork.c b/kernel/fork.c
index a17621c..14396bf 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -363,7 +363,12 @@ static int dup_mmap(struct mm_struct *mm, struct mm_struct *oldmm)

mm->locked_vm = 0;
mm->mmap = NULL;
- mm->mmap_cache = NULL;
+ mm->vmacache_seqnum = oldmm->vmacache_seqnum + 1;
+
+ /* deal with overflows */
+ if (unlikely(mm->vmacache_seqnum == 0))
+ vmacache_invalidate_all();
+
mm->map_count = 0;
cpumask_clear(mm_cpumask(mm));
mm->mm_rb = RB_ROOT;
diff --git a/mm/Makefile b/mm/Makefile
index 310c90a..2bb5b6a 100644
--- a/mm/Makefile
+++ b/mm/Makefile
@@ -5,7 +5,7 @@
mmu-y := nommu.o
mmu-$(CONFIG_MMU) := fremap.o highmem.o madvise.o memory.o mincore.o \
mlock.o mmap.o mprotect.o mremap.o msync.o rmap.o \
- vmalloc.o pagewalk.o pgtable-generic.o
+ vmalloc.o pagewalk.o pgtable-generic.o vmacache.o

ifdef CONFIG_CROSS_MEMORY_ATTACH
mmu-$(CONFIG_MMU) += process_vm_access.o
diff --git a/mm/mmap.c b/mm/mmap.c
index 20ff0c3..dfd7fe7 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -681,8 +681,9 @@ __vma_unlink(struct mm_struct *mm, struct vm_area_struct *vma,
prev->vm_next = next = vma->vm_next;
if (next)
next->vm_prev = prev;
- if (mm->mmap_cache == vma)
- mm->mmap_cache = prev;
+
+ /* Kill the cache */
+ vmacache_invalidate(mm);
}

/*
@@ -1989,34 +1990,33 @@ EXPORT_SYMBOL(get_unmapped_area);
/* Look up the first VMA which satisfies addr < vm_end, NULL if none. */
struct vm_area_struct *find_vma(struct mm_struct *mm, unsigned long addr)
{
- struct vm_area_struct *vma = NULL;
+ struct rb_node *rb_node;
+ struct vm_area_struct *vma;

/* Check the cache first. */
- /* (Cache hit rate is typically around 35%.) */
- vma = ACCESS_ONCE(mm->mmap_cache);
- if (!(vma && vma->vm_end > addr && vma->vm_start <= addr)) {
- struct rb_node *rb_node;
+ vma = vmacache_find(mm, addr);
+ if (likely(vma))
+ return vma;

- rb_node = mm->mm_rb.rb_node;
- vma = NULL;
+ rb_node = mm->mm_rb.rb_node;
+ vma = NULL;

- while (rb_node) {
- struct vm_area_struct *vma_tmp;
-
- vma_tmp = rb_entry(rb_node,
- struct vm_area_struct, vm_rb);
-
- if (vma_tmp->vm_end > addr) {
- vma = vma_tmp;
- if (vma_tmp->vm_start <= addr)
- break;
- rb_node = rb_node->rb_left;
- } else
- rb_node = rb_node->rb_right;
- }
- if (vma)
- mm->mmap_cache = vma;
+ while (rb_node) {
+ struct vm_area_struct *tmp;
+
+ tmp = rb_entry(rb_node, struct vm_area_struct, vm_rb);
+
+ if (tmp->vm_end > addr) {
+ vma = tmp;
+ if (tmp->vm_start <= addr)
+ break;
+ rb_node = rb_node->rb_left;
+ } else
+ rb_node = rb_node->rb_right;
}
+
+ if (vma)
+ vmacache_update(mm, addr, vma);
return vma;
}

@@ -2388,7 +2388,9 @@ detach_vmas_to_be_unmapped(struct mm_struct *mm, struct vm_area_struct *vma,
} else
mm->highest_vm_end = prev ? prev->vm_end : 0;
tail_vma->vm_next = NULL;
- mm->mmap_cache = NULL; /* Kill the cache. */
+
+ /* Kill the cache */
+ vmacache_invalidate(mm);
}

/*
diff --git a/mm/nommu.c b/mm/nommu.c
index 8740213..c2d1b92 100644
--- a/mm/nommu.c
+++ b/mm/nommu.c
@@ -776,8 +776,8 @@ static void delete_vma_from_mm(struct vm_area_struct *vma)
protect_vma(vma, 0);

mm->map_count--;
- if (mm->mmap_cache == vma)
- mm->mmap_cache = NULL;
+ if (mm->vmacache == vma)
+ mm->vmacache = NULL;

/* remove the VMA from the mapping */
if (vma->vm_file) {
@@ -825,7 +825,7 @@ struct vm_area_struct *find_vma(struct mm_struct *mm, unsigned long addr)
struct vm_area_struct *vma;

/* check the cache first */
- vma = ACCESS_ONCE(mm->mmap_cache);
+ vma = ACCESS_ONCE(mm->vmacache);
if (vma && vma->vm_start <= addr && vma->vm_end > addr)
return vma;

@@ -835,7 +835,7 @@ struct vm_area_struct *find_vma(struct mm_struct *mm, unsigned long addr)
if (vma->vm_start > addr)
return NULL;
if (vma->vm_end > addr) {
- mm->mmap_cache = vma;
+ mm->vmacache = vma;
return vma;
}
}
@@ -874,7 +874,7 @@ static struct vm_area_struct *find_vma_exact(struct mm_struct *mm,
unsigned long end = addr + len;

/* check the cache first */
- vma = mm->mmap_cache;
+ vma = mm->vmacache;
if (vma && vma->vm_start == addr && vma->vm_end == end)
return vma;

@@ -886,7 +886,7 @@ static struct vm_area_struct *find_vma_exact(struct mm_struct *mm,
if (vma->vm_start > addr)
return NULL;
if (vma->vm_end == end) {
- mm->mmap_cache = vma;
+ mm->vmacache = vma;
return vma;
}
}
diff --git a/mm/vmacache.c b/mm/vmacache.c
new file mode 100644
index 0000000..d577ad3
--- /dev/null
+++ b/mm/vmacache.c
@@ -0,0 +1,69 @@
+#include <linux/sched.h>
+#include <linux/vmacache.h>
+
+void vmacache_invalidate_all(void)
+{
+ struct task_struct *g, *p;
+
+ rcu_read_lock();
+ for_each_process_thread(g, p) {
+ /*
+ * Only flush the vmacache pointers as the
+ * mm seqnum is already set and curr's will
+ * be set upon invalidation when the next
+ * lookup is done.
+ */
+ memset(p->vmacache, 0, sizeof(p->vmacache));
+ }
+ rcu_read_unlock();
+}
+
+static bool vmacache_valid(struct mm_struct *mm)
+{
+ struct task_struct *curr = current;
+
+ if (mm != curr->mm)
+ return false;
+
+ if (mm->vmacache_seqnum != curr->vmacache_seqnum) {
+ /*
+ * First attempt will always be invalid, initialize the
+ * new cache for this task here.
+ */
+ curr->vmacache_seqnum = mm->vmacache_seqnum;
+ memset(curr->vmacache, 0, sizeof(curr->vmacache));
+ return false;
+ }
+ return true;
+}
+
+struct vm_area_struct *vmacache_find(struct mm_struct *mm,
+ unsigned long addr)
+
+{
+ int i;
+
+ if (!vmacache_valid(mm))
+ return NULL;
+
+ for (i = 0; i < VMACACHE_SIZE; i++) {
+ struct vm_area_struct *vma = current->vmacache[i];
+
+ if (vma && vma->vm_start <= addr && vma->vm_end > addr)
+ return vma;
+ }
+
+ return NULL;
+}
+
+void vmacache_update(struct mm_struct *mm, unsigned long addr,
+ struct vm_area_struct *newvma)
+{
+ /*
+ * Hash based on the page number. Provides a good
+ * hit rate for workloads with good locality and
+ * those with random accesses as well.
+ */
+ int idx = (addr >> PAGE_SHIFT) & 3;
+ current->vmacache[idx] = newvma;
+}
--
1.8.1.4



2014-02-25 18:25:15

by Rik van Riel

[permalink] [raw]
Subject: Re: [PATCH v2] mm: per-thread vma caching

On 02/25/2014 01:16 PM, Davidlohr Bueso wrote:

> The proposed approach is to keep the current cache and adding a small, per
> thread, LRU cache. By keeping the mm->mmap_cache,

This bit of the changelog may want updating :)

> Changes from v1 (https://lkml.org/lkml/2014/2/21/8):
> - Removed the per-mm cache for CONFIG_MMU, only having the
> per thread approach.

The patch looks good.

Reviewed-by: Rik van Riel <[email protected]>

--
All rights reversed

2014-02-25 18:35:40

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH v2] mm: per-thread vma caching

On Tue, Feb 25, 2014 at 10:16:46AM -0800, Davidlohr Bueso wrote:
> +void vmacache_update(struct mm_struct *mm, unsigned long addr,
> + struct vm_area_struct *newvma)
> +{
> + /*
> + * Hash based on the page number. Provides a good
> + * hit rate for workloads with good locality and
> + * those with random accesses as well.
> + */
> + int idx = (addr >> PAGE_SHIFT) & 3;

% VMACACHE_SIZE

perhaps? GCC should turn that into a mask for all sensible values I
would think.

Barring that I think something like:

#define VMACACHE_BITS 2
#define VMACACHE_SIZE (1U << VMACACHE_BITS)
#define VMACACHE_MASK (VMACACHE_SIZE - 1)

Might do I suppose.

> + current->vmacache[idx] = newvma;
> +}
> --
> 1.8.1.4
>
>
>

2014-02-25 18:36:35

by Davidlohr Bueso

[permalink] [raw]
Subject: Re: [PATCH v2] mm: per-thread vma caching

On Tue, 2014-02-25 at 13:24 -0500, Rik van Riel wrote:
> On 02/25/2014 01:16 PM, Davidlohr Bueso wrote:
>
> > The proposed approach is to keep the current cache and adding a small, per
> > thread, LRU cache. By keeping the mm->mmap_cache,
>
> This bit of the changelog may want updating :)

bah, yes thanks.

2014-02-25 18:37:39

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH v2] mm: per-thread vma caching

On Tue, Feb 25, 2014 at 10:16 AM, Davidlohr Bueso <[email protected]> wrote:
> index a17621c..14396bf 100644
> --- a/kernel/fork.c
> +++ b/kernel/fork.c
> @@ -363,7 +363,12 @@ static int dup_mmap(struct mm_struct *mm, struct mm_struct *oldmm)
>
> mm->locked_vm = 0;
> mm->mmap = NULL;
> - mm->mmap_cache = NULL;
> + mm->vmacache_seqnum = oldmm->vmacache_seqnum + 1;
> +
> + /* deal with overflows */
> + if (unlikely(mm->vmacache_seqnum == 0))
> + vmacache_invalidate_all();

Correct me if I'm wrong, but this can not possibly be correct.

vmacache_invalidate_all() walks over all the threads of the current
process, but "mm" here is the mm of the *new* process that is getting
created, and is unrelated in all ways to the threads of the old
process.

So it walks completely the wrong list of threads.

In fact, the sequence number of the old vm and the sequence number of
the new vm cannot in any way be related.

As far as I can tell, the only sane thing to do at fork/clone() time is to:

- clear all the cache entries (of the new 'struct task_struct'! - so
not in dup_mmap, but make sure it's zeroed when allocating!)(

- set vmcache_seqnum to 0 in dup_mmap (since any sequence number is
fine when it got invalidated, and 0 is best for "avoid overflow").

but I haven't thought deeply about this, but I pretty much guarantee
that the quoted sequence above is wrong as-is.

Linus

2014-02-25 18:37:55

by Davidlohr Bueso

[permalink] [raw]
Subject: Re: [PATCH v2] mm: per-thread vma caching

On Tue, 2014-02-25 at 19:35 +0100, Peter Zijlstra wrote:
> On Tue, Feb 25, 2014 at 10:16:46AM -0800, Davidlohr Bueso wrote:
> > +void vmacache_update(struct mm_struct *mm, unsigned long addr,
> > + struct vm_area_struct *newvma)
> > +{
> > + /*
> > + * Hash based on the page number. Provides a good
> > + * hit rate for workloads with good locality and
> > + * those with random accesses as well.
> > + */
> > + int idx = (addr >> PAGE_SHIFT) & 3;
>
> % VMACACHE_SIZE
>
> perhaps? GCC should turn that into a mask for all sensible values I
> would think.
>
> Barring that I think something like:
>
> #define VMACACHE_BITS 2
> #define VMACACHE_SIZE (1U << VMACACHE_BITS)
> #define VMACACHE_MASK (VMACACHE_SIZE - 1)

Hmm all that seems like an overkill.

2014-02-25 18:45:54

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH v2] mm: per-thread vma caching

On Tue, Feb 25, 2014 at 10:37 AM, Linus Torvalds
<[email protected]> wrote:
>
> - clear all the cache entries (of the new 'struct task_struct'! - so
> not in dup_mmap, but make sure it's zeroed when allocating!)(
>
> - set vmcache_seqnum to 0 in dup_mmap (since any sequence number is
> fine when it got invalidated, and 0 is best for "avoid overflow").

Btw, as far as I can tell, that also makes the per-thread vmacache
automatically do the right thing for the non-MMU case, so that you
could just remove the difference between CONFIG_MMU and NOMMU.

Basically, dup_mmap() should no longer have anything to do with the
vmacache, since it is now per-thread, not per-mm.

So :

- allocating a new "struct mm_struct" should clear the
vmacache_seqnum for that new mm, to try to minimize unnecessary future
overflow.

- thread allocation should just zero the cache entries, and set
"tsk->vmacache_seqnum = mm->vmacache_seqnum" (after dup_mm()) to avoid
future unnecessary flushes.

and as far as I can tell, the logic would be exactly the same on NOMMU
(the dup_mm just doesn't happen, since all forks are basically sharing
mm).

And maybe you'd want to make VMACACHE_SIZE be 1 on NOMMU (and make
sure to change the "& 3" to "& (VMACACHE_SIZE-1)". Just to keep the
size down on small systems that really don't need it.

Linus

2014-02-25 19:04:05

by Davidlohr Bueso

[permalink] [raw]
Subject: Re: [PATCH v2] mm: per-thread vma caching

On Tue, 2014-02-25 at 10:37 -0800, Linus Torvalds wrote:
> On Tue, Feb 25, 2014 at 10:16 AM, Davidlohr Bueso <[email protected]> wrote:
> > index a17621c..14396bf 100644
> > --- a/kernel/fork.c
> > +++ b/kernel/fork.c
> > @@ -363,7 +363,12 @@ static int dup_mmap(struct mm_struct *mm, struct mm_struct *oldmm)
> >
> > mm->locked_vm = 0;
> > mm->mmap = NULL;
> > - mm->mmap_cache = NULL;
> > + mm->vmacache_seqnum = oldmm->vmacache_seqnum + 1;
> > +
> > + /* deal with overflows */
> > + if (unlikely(mm->vmacache_seqnum == 0))
> > + vmacache_invalidate_all();
>
> Correct me if I'm wrong, but this can not possibly be correct.
>
> vmacache_invalidate_all() walks over all the threads of the current
> process, but "mm" here is the mm of the *new* process that is getting
> created, and is unrelated in all ways to the threads of the old
> process.

vmacache_invalidate_all() is actually a misleading name since we really
aren't invalidating but just clearing the cache. I'll rename it.
Anyways...

> So it walks completely the wrong list of threads.

But we still need to deal with the rest of the tasks in the system, so
anytime there's an overflow we need to nullify all cached vmas, not just
current's. Am I missing something special about fork?

> In fact, the sequence number of the old vm and the sequence number of
> the new vm cannot in any way be related.
>
> As far as I can tell, the only sane thing to do at fork/clone() time is to:
>
> - clear all the cache entries (of the new 'struct task_struct'! - so
> not in dup_mmap, but make sure it's zeroed when allocating!)(

Right, but that's done upon the first lookup, when vmacache_valid() is
false.

> - set vmcache_seqnum to 0 in dup_mmap (since any sequence number is
> fine when it got invalidated, and 0 is best for "avoid overflow").

Assuming your referring to curr->vmacache_seqnum (since mm's is already
set).. isn't it irrelevant since we set it anyways when the first lookup
fails?

Thanks,
Davidlohr

2014-02-25 19:30:06

by Davidlohr Bueso

[permalink] [raw]
Subject: Re: [PATCH v2] mm: per-thread vma caching

On Tue, 2014-02-25 at 11:04 -0800, Davidlohr Bueso wrote:
> On Tue, 2014-02-25 at 10:37 -0800, Linus Torvalds wrote:
> > On Tue, Feb 25, 2014 at 10:16 AM, Davidlohr Bueso <[email protected]> wrote:
> > > index a17621c..14396bf 100644
> > > --- a/kernel/fork.c
> > > +++ b/kernel/fork.c
> > > @@ -363,7 +363,12 @@ static int dup_mmap(struct mm_struct *mm, struct mm_struct *oldmm)
> > >
> > > mm->locked_vm = 0;
> > > mm->mmap = NULL;
> > > - mm->mmap_cache = NULL;
> > > + mm->vmacache_seqnum = oldmm->vmacache_seqnum + 1;
> > > +
> > > + /* deal with overflows */
> > > + if (unlikely(mm->vmacache_seqnum == 0))
> > > + vmacache_invalidate_all();
> >
> > Correct me if I'm wrong, but this can not possibly be correct.
> >
> > vmacache_invalidate_all() walks over all the threads of the current
> > process, but "mm" here is the mm of the *new* process that is getting
> > created, and is unrelated in all ways to the threads of the old
> > process.
>
> vmacache_invalidate_all() is actually a misleading name since we really
> aren't invalidating but just clearing the cache. I'll rename it.
> Anyways...
>
> > So it walks completely the wrong list of threads.
>
> But we still need to deal with the rest of the tasks in the system, so
> anytime there's an overflow we need to nullify all cached vmas, not just
> current's. Am I missing something special about fork?
>
> > In fact, the sequence number of the old vm and the sequence number of
> > the new vm cannot in any way be related.
> >
> > As far as I can tell, the only sane thing to do at fork/clone() time is to:
> >
> > - clear all the cache entries (of the new 'struct task_struct'! - so
> > not in dup_mmap, but make sure it's zeroed when allocating!)(
>
> Right, but that's done upon the first lookup, when vmacache_valid() is
> false.
>
> > - set vmcache_seqnum to 0 in dup_mmap (since any sequence number is
> > fine when it got invalidated, and 0 is best for "avoid overflow").
>
> Assuming your referring to curr->vmacache_seqnum (since mm's is already
> set).. isn't it irrelevant since we set it anyways when the first lookup
> fails?

Never mind, I see your referring to the mm seqnum. Sounds like it's an
interesting alternative to the CONFIG_MMU workaround. I will look into
it.

Thanks,
Davidlohr

2014-02-25 19:31:54

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH v2] mm: per-thread vma caching

On Tue, Feb 25, 2014 at 11:04 AM, Davidlohr Bueso <[email protected]> wrote:
>
>> So it walks completely the wrong list of threads.
>
> But we still need to deal with the rest of the tasks in the system, so
> anytime there's an overflow we need to nullify all cached vmas, not just
> current's. Am I missing something special about fork?

No, you're missing the much more fundamental issue: you are *not*
incrementing the current mm sequence number at all.

This code here:

mm->vmacache_seqnum = oldmm->vmacache_seqnum + 1;

doesn't change the "oldmm" sequence number.

So invalidating the threads involved with the current sequence number
is totally bogus. It's a completely nonsensical operation. You didn't
do anything to their sequence numbers.

Your argument that "we still need to deal with the rest of the tasks"
is bogus. There is no "rest of the tasks". You're creating a new mm,
WHICH DOESN'T HAVE ANY TASKS YET!

(Well, there is the one half-formed task that is also in the process
of being created, but that you don't even have access to in
"dup_mmap()").

It's also not sensible to make the new vmacache_seqnum have anything
to do with the *old* vmacache seqnum. They are two totally unrelated
things, since they are separate mm's, and they are tied to independent
threads. They basically have nothing in common, so initializing the
new mm sequence number with a value that is related to the old
sequence number is not a sensible operation anyway.

So dup_mmap() should just clear the mm->seqnum, and not touch any
vmacache entries. There are no current vmacache entries associated
with that mm anyway.

And then you should clear the new vmacache entries in the thread
structure, and set vmacache_seqnum to 0 there too. You can do that in
"dup_mm()", which has access to the new task-struct.

And then you're done, as far as fork() is concerned.

Now, the *clone* case (CLONE_VM) is different. See "copy_mm()". For
that case, you should probably just copy the vma cache from the old
thread, and set the sequence number to be the same one. That's
correct, since you are re-using the mm. But in practice, you don't
actually need to do anything, since "dup_task_struct()" should have
done this all. So the CLONE_VM case doesn't actually require any
explicit code, because copying the cache and the sequence number is
already the right thing to do.

Btw, that all reminds me:

The one case you should make sure to double-check is the "exec" case,
which also creates a new mm. So that should clear the vmcache entries
and the seqnum. Currently we clear mm->mmap_cache implicitly in
mm_alloc() because we do a memset(mm, 0) in it.

So in exec_mmap(), as you do the activate_mm(), you need to do that
same "clear vmacache and sequence number for *this* thread (and this
thread *only*)".

Linus

2014-02-25 19:47:09

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH v2] mm: per-thread vma caching

On Tue, Feb 25, 2014 at 10:37:34AM -0800, Davidlohr Bueso wrote:
> On Tue, 2014-02-25 at 19:35 +0100, Peter Zijlstra wrote:
> > On Tue, Feb 25, 2014 at 10:16:46AM -0800, Davidlohr Bueso wrote:
> > > +void vmacache_update(struct mm_struct *mm, unsigned long addr,
> > > + struct vm_area_struct *newvma)
> > > +{
> > > + /*
> > > + * Hash based on the page number. Provides a good
> > > + * hit rate for workloads with good locality and
> > > + * those with random accesses as well.
> > > + */
> > > + int idx = (addr >> PAGE_SHIFT) & 3;
> >
> > % VMACACHE_SIZE
> >
> > perhaps? GCC should turn that into a mask for all sensible values I
> > would think.
> >
> > Barring that I think something like:
> >
> > #define VMACACHE_BITS 2
> > #define VMACACHE_SIZE (1U << VMACACHE_BITS)
> > #define VMACACHE_MASK (VMACACHE_SIZE - 1)
>
> Hmm all that seems like an overkill.

If GCC does the right thing with % VMACACHE_SIZE it gets rid of an ugly
constant. But the 3 VMACACHE_* things are 'better' in that its
impossible to set VMACACHE_SIZE to silly values.

2014-02-26 02:04:53

by Michel Lespinasse

[permalink] [raw]
Subject: Re: [PATCH v2] mm: per-thread vma caching

On Tue, Feb 25, 2014 at 10:16 AM, Davidlohr Bueso <[email protected]> wrote:
> This patch is a continuation of efforts trying to optimize find_vma(),
> avoiding potentially expensive rbtree walks to locate a vma upon faults.
> The original approach (https://lkml.org/lkml/2013/11/1/410), where the
> largest vma was also cached, ended up being too specific and random, thus
> further comparison with other approaches were needed. There are two things
> to consider when dealing with this, the cache hit rate and the latency of
> find_vma(). Improving the hit-rate does not necessarily translate in finding
> the vma any faster, as the overhead of any fancy caching schemes can be too
> high to consider.

Actually there is also the cost of keeping the cache up to date. I'm
not saying that it's an issue in your proposal - I like the proposal,
especially now that you are replacing the per-mm cache rather than
adding something on top - but it is a factor to consider.


> +static inline void __vmacache_invalidate(struct mm_struct *mm)
> +{
> +#ifdef CONFIG_MMU
> + vmacache_invalidate(mm);
> +#else
> + mm->vmacache = NULL;
> +#endif
> +}

Is there any reason why we can't use your proposal for !CONFIG_MMU as well ?
(I'm assuming that we could reduce preprocessor checks by doing so)


> +void vmacache_invalidate_all(void)
> +{
> + struct task_struct *g, *p;
> +
> + rcu_read_lock();
> + for_each_process_thread(g, p) {
> + /*
> + * Only flush the vmacache pointers as the
> + * mm seqnum is already set and curr's will
> + * be set upon invalidation when the next
> + * lookup is done.
> + */
> + memset(p->vmacache, 0, sizeof(p->vmacache));
> + }
> + rcu_read_unlock();
> +}

Two things:

- I believe we only need to invalidate vma caches for threads that
share a given mm ? we should probably pass in that mm in order to
avoid over-invalidation

- My understanding is that the operation is safe because the caller
has the mm's mmap_sem held for write, and other threads accessing the
vma cache will have mmap_sem held at least for read, so we don't need
extra locking to maintain the vma cache. Please 1- confirm this is the
intention, 2- document this, and 3- only invalidate vma caches for
threads that match the caller's mm so that mmap_sem locking can
actually apply.


> +struct vm_area_struct *vmacache_find(struct mm_struct *mm,
> + unsigned long addr)
> +
> +{
> + int i;
> +
> + if (!vmacache_valid(mm))
> + return NULL;
> +
> + for (i = 0; i < VMACACHE_SIZE; i++) {
> + struct vm_area_struct *vma = current->vmacache[i];
> +
> + if (vma && vma->vm_start <= addr && vma->vm_end > addr)
> + return vma;
> + }
> +
> + return NULL;
> +}
> +
> +void vmacache_update(struct mm_struct *mm, unsigned long addr,
> + struct vm_area_struct *newvma)
> +{
> + /*
> + * Hash based on the page number. Provides a good
> + * hit rate for workloads with good locality and
> + * those with random accesses as well.
> + */
> + int idx = (addr >> PAGE_SHIFT) & 3;
> + current->vmacache[idx] = newvma;
> +}

I did read the previous discussion about how to compute idx here. I
did not at the time realize that you are searching all 4 vmacache
entries on lookup - that is, we are only talking about eviction policy
here, not a lookup hash policy.

My understanding is that the reason both your current and your
previous idx computations work, is that a random eviction policy would
work too. Basically, what you do is pick some address bits that are
'random enough' to use as an eviction policy.

This is more of a question for Linus, but I am very surprised that I
can't find an existing LRU eviction policy scheme in Linux. What I
have in mind is to keep track of the order the cache entries have last
been used. With 4 entries, there are 4! = 24 possible orders, which
can be represented with an integer between 0 and 23. When
vmacache_find suceeds, that integer is updated using a table lookup
(table takes 24*4 = 96 bytes). In vmacache_update, the lru value
module 4 indicates which cache way to evict (i.e. it's the one that's
been least recently used).

I don't think it makes sense to introduce such an LRU facility for
this cache only, but it's so generically useful, I'm very surprised
that it's not somewhere already and instead we see people inventing
new eviction schemes over and over again...

--
Michel "Walken" Lespinasse
A program is never fully debugged until the last user dies.

2014-02-26 04:04:38

by Davidlohr Bueso

[permalink] [raw]
Subject: Re: [PATCH v2] mm: per-thread vma caching

On Tue, 2014-02-25 at 18:04 -0800, Michel Lespinasse wrote:
> On Tue, Feb 25, 2014 at 10:16 AM, Davidlohr Bueso <[email protected]> wrote:
> > This patch is a continuation of efforts trying to optimize find_vma(),
> > avoiding potentially expensive rbtree walks to locate a vma upon faults.
> > The original approach (https://lkml.org/lkml/2013/11/1/410), where the
> > largest vma was also cached, ended up being too specific and random, thus
> > further comparison with other approaches were needed. There are two things
> > to consider when dealing with this, the cache hit rate and the latency of
> > find_vma(). Improving the hit-rate does not necessarily translate in finding
> > the vma any faster, as the overhead of any fancy caching schemes can be too
> > high to consider.
>
> Actually there is also the cost of keeping the cache up to date. I'm
> not saying that it's an issue in your proposal - I like the proposal,
> especially now that you are replacing the per-mm cache rather than
> adding something on top - but it is a factor to consider.

True, although numbers show that the cost of maintaining the cache is
quite minimal. Invalidations are a free lunch (except in the rare event
of a seqnum overflow), so the updating part would consume the most
cycles, but then again, the hit rate is quite good so I'm not worried
about that either.

>
> > +static inline void __vmacache_invalidate(struct mm_struct *mm)
> > +{
> > +#ifdef CONFIG_MMU
> > + vmacache_invalidate(mm);
> > +#else
> > + mm->vmacache = NULL;
> > +#endif
> > +}
>
> Is there any reason why we can't use your proposal for !CONFIG_MMU as well ?
> (I'm assuming that we could reduce preprocessor checks by doing so)

Based on Linus' feedback today, I'm getting rid of this ugliness and
trying to have per-thread caches for both configs.

> > +void vmacache_invalidate_all(void)
> > +{
> > + struct task_struct *g, *p;
> > +
> > + rcu_read_lock();
> > + for_each_process_thread(g, p) {
> > + /*
> > + * Only flush the vmacache pointers as the
> > + * mm seqnum is already set and curr's will
> > + * be set upon invalidation when the next
> > + * lookup is done.
> > + */
> > + memset(p->vmacache, 0, sizeof(p->vmacache));
> > + }
> > + rcu_read_unlock();
> > +}
>
> Two things:
>
> - I believe we only need to invalidate vma caches for threads that
> share a given mm ? we should probably pass in that mm in order to
> avoid over-invalidation

I think you're right, since the overflows will always occur on
mm->seqnum, tasks that do not share the mm shouldn't be affected.

So the danger here is that when a lookup occurs, vmacache_valid() will
return true, having:

mm == curr->mm && mm->vmacache_seqnum == curr->vmacache_seqnum (both 0).

Then we just iterate the cache and potentially return some bugus vma.

However, since we're now going to reset the seqnum on every fork/clone
(before it was just the oldmm->seqnum + 1 thing), I doubt we'll ever
overflow.

> - My understanding is that the operation is safe because the caller
> has the mm's mmap_sem held for write, and other threads accessing the
> vma cache will have mmap_sem held at least for read, so we don't need
> extra locking to maintain the vma cache.

Yes, that's how I see things as well.

> Please 1- confirm this is the
> intention, 2- document this, and 3- only invalidate vma caches for
> threads that match the caller's mm so that mmap_sem locking can
> actually apply.

Will do.

> > +struct vm_area_struct *vmacache_find(struct mm_struct *mm,
> > + unsigned long addr)
> > +
> > +{
> > + int i;
> > +
> > + if (!vmacache_valid(mm))
> > + return NULL;
> > +
> > + for (i = 0; i < VMACACHE_SIZE; i++) {
> > + struct vm_area_struct *vma = current->vmacache[i];
> > +
> > + if (vma && vma->vm_start <= addr && vma->vm_end > addr)
> > + return vma;
> > + }
> > +
> > + return NULL;
> > +}
> > +
> > +void vmacache_update(struct mm_struct *mm, unsigned long addr,
> > + struct vm_area_struct *newvma)
> > +{
> > + /*
> > + * Hash based on the page number. Provides a good
> > + * hit rate for workloads with good locality and
> > + * those with random accesses as well.
> > + */
> > + int idx = (addr >> PAGE_SHIFT) & 3;
> > + current->vmacache[idx] = newvma;
> > +}
>
> I did read the previous discussion about how to compute idx here. I
> did not at the time realize that you are searching all 4 vmacache
> entries on lookup - that is, we are only talking about eviction policy
> here, not a lookup hash policy.

Right.

> My understanding is that the reason both your current and your
> previous idx computations work, is that a random eviction policy would
> work too. Basically, what you do is pick some address bits that are
> 'random enough' to use as an eviction policy.

What do you mean by random enough? I assume that would be something like
my original scheme were I used the last X bits of the offset within the
page.

>
> This is more of a question for Linus, but I am very surprised that I
> can't find an existing LRU eviction policy scheme in Linux. What I
> have in mind is to keep track of the order the cache entries have last
> been used. With 4 entries, there are 4! = 24 possible orders, which
> can be represented with an integer between 0 and 23. When
> vmacache_find suceeds, that integer is updated using a table lookup
> (table takes 24*4 = 96 bytes). In vmacache_update, the lru value
> module 4 indicates which cache way to evict (i.e. it's the one that's
> been least recently used).

While not completely related, I did play with a mod 4 hashing scheme
before I got to the one I'm proposing now. It was just not as effective.

Thanks,
Davidlohr

2014-02-26 07:52:36

by Michel Lespinasse

[permalink] [raw]
Subject: Re: [PATCH v2] mm: per-thread vma caching

On Tue, Feb 25, 2014 at 8:04 PM, Davidlohr Bueso <[email protected]> wrote:
> On Tue, 2014-02-25 at 18:04 -0800, Michel Lespinasse wrote:
>> On Tue, Feb 25, 2014 at 10:16 AM, Davidlohr Bueso <[email protected]> wrote:
>> > This patch is a continuation of efforts trying to optimize find_vma(),
>> > avoiding potentially expensive rbtree walks to locate a vma upon faults.
>> > The original approach (https://lkml.org/lkml/2013/11/1/410), where the
>> > largest vma was also cached, ended up being too specific and random, thus
>> > further comparison with other approaches were needed. There are two things
>> > to consider when dealing with this, the cache hit rate and the latency of
>> > find_vma(). Improving the hit-rate does not necessarily translate in finding
>> > the vma any faster, as the overhead of any fancy caching schemes can be too
>> > high to consider.
>>
>> Actually there is also the cost of keeping the cache up to date. I'm
>> not saying that it's an issue in your proposal - I like the proposal,
>> especially now that you are replacing the per-mm cache rather than
>> adding something on top - but it is a factor to consider.
>
> True, although numbers show that the cost of maintaining the cache is
> quite minimal. Invalidations are a free lunch (except in the rare event
> of a seqnum overflow), so the updating part would consume the most
> cycles, but then again, the hit rate is quite good so I'm not worried
> about that either.

Yes. I like your patch precisely because it keeps maintainance costs low.

>> > +void vmacache_invalidate_all(void)
>> > +{
>> > + struct task_struct *g, *p;
>> > +
>> > + rcu_read_lock();
>> > + for_each_process_thread(g, p) {
>> > + /*
>> > + * Only flush the vmacache pointers as the
>> > + * mm seqnum is already set and curr's will
>> > + * be set upon invalidation when the next
>> > + * lookup is done.
>> > + */
>> > + memset(p->vmacache, 0, sizeof(p->vmacache));
>> > + }
>> > + rcu_read_unlock();
>> > +}
>>
>> Two things:
>>
>> - I believe we only need to invalidate vma caches for threads that
>> share a given mm ? we should probably pass in that mm in order to
>> avoid over-invalidation
>
> I think you're right, since the overflows will always occur on
> mm->seqnum, tasks that do not share the mm shouldn't be affected.
>
> So the danger here is that when a lookup occurs, vmacache_valid() will
> return true, having:
>
> mm == curr->mm && mm->vmacache_seqnum == curr->vmacache_seqnum (both 0).
>
> Then we just iterate the cache and potentially return some bugus vma.
>
> However, since we're now going to reset the seqnum on every fork/clone
> (before it was just the oldmm->seqnum + 1 thing), I doubt we'll ever
> overflow.

I'm concerned about it *precisely* because it won't happen often, and
so it'd be hard to debug if we had a problem there. 64 bits would be
safe, but a 32-bit counter doesn't take that long to overflow, and I'm
sure it will happen once in a while in production.

Actually, I think there is a case for masking the seqnum with a
constant (all ones in production builds, but something shorter when
CONFIG_DEBUG_VM is enabled) so that this code is easier to exercise.

>> - My understanding is that the operation is safe because the caller
>> has the mm's mmap_sem held for write, and other threads accessing the
>> vma cache will have mmap_sem held at least for read, so we don't need
>> extra locking to maintain the vma cache.
>
> Yes, that's how I see things as well.
>
>> Please 1- confirm this is the
>> intention, 2- document this, and 3- only invalidate vma caches for
>> threads that match the caller's mm so that mmap_sem locking can
>> actually apply.
>
> Will do.

Thanks :)

>> > +struct vm_area_struct *vmacache_find(struct mm_struct *mm,
>> > + unsigned long addr)
>> > +
>> > +{
>> > + int i;
>> > +
>> > + if (!vmacache_valid(mm))
>> > + return NULL;
>> > +
>> > + for (i = 0; i < VMACACHE_SIZE; i++) {
>> > + struct vm_area_struct *vma = current->vmacache[i];
>> > +
>> > + if (vma && vma->vm_start <= addr && vma->vm_end > addr)
>> > + return vma;
>> > + }
>> > +
>> > + return NULL;
>> > +}
>> > +
>> > +void vmacache_update(struct mm_struct *mm, unsigned long addr,
>> > + struct vm_area_struct *newvma)
>> > +{
>> > + /*
>> > + * Hash based on the page number. Provides a good
>> > + * hit rate for workloads with good locality and
>> > + * those with random accesses as well.
>> > + */
>> > + int idx = (addr >> PAGE_SHIFT) & 3;
>> > + current->vmacache[idx] = newvma;
>> > +}
>>
>> I did read the previous discussion about how to compute idx here. I
>> did not at the time realize that you are searching all 4 vmacache
>> entries on lookup - that is, we are only talking about eviction policy
>> here, not a lookup hash policy.
>
> Right.
>
>> My understanding is that the reason both your current and your
>> previous idx computations work, is that a random eviction policy would
>> work too. Basically, what you do is pick some address bits that are
>> 'random enough' to use as an eviction policy.
>
> What do you mean by random enough? I assume that would be something like
> my original scheme were I used the last X bits of the offset within the
> page.

What I mean is that if you used a per-thread random number generator
to compute idx, such as prandom_u32_state() for example, you'd get
nice enough results already - after all, random eviction is not that
bad. So, my hunch is that the particular way you compute idx based on
the address works not because it catches on some particular property
of the access patterns, but just because the sequence of indexes it
ends up generating is as good as a random one.

>> This is more of a question for Linus, but I am very surprised that I
>> can't find an existing LRU eviction policy scheme in Linux. What I
>> have in mind is to keep track of the order the cache entries have last
>> been used. With 4 entries, there are 4! = 24 possible orders, which
>> can be represented with an integer between 0 and 23. When
>> vmacache_find suceeds, that integer is updated using a table lookup
>> (table takes 24*4 = 96 bytes). In vmacache_update, the lru value
>> module 4 indicates which cache way to evict (i.e. it's the one that's
>> been least recently used).
>
> While not completely related, I did play with a mod 4 hashing scheme
> before I got to the one I'm proposing now. It was just not as effective.

I believe you mean using idx = addr % 4 ? I can see that this wouldn't
work well, because index 0 would be chosen way too often (i.e. any
time the address is aligned, or the first access into a given page is
from reading a byte stream that crosses into it, etc).

But, what I propose above is entirely different, it is just picking
whatever cache index was least recently used for lookups. In this
context, the 'modulo 4' is only an implementation detail of how I
propose we track LRU order between the 4 cache slots.

By the way, I'm happy enough with your patch going in with your
proposed eviction scheme; LRU eviction is a refinement that is best
done as a separate patch. It just came up because I think it would be
applicable here.

--
Michel "Walken" Lespinasse
A program is never fully debugged until the last user dies.

2014-02-26 08:51:08

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH v2] mm: per-thread vma caching

On Tue, Feb 25, 2014 at 10:16:46AM -0800, Davidlohr Bueso wrote:
> +void vmacache_invalidate_all(void)
> +{
> + struct task_struct *g, *p;
> +
> + rcu_read_lock();
> + for_each_process_thread(g, p) {
> + /*
> + * Only flush the vmacache pointers as the
> + * mm seqnum is already set and curr's will
> + * be set upon invalidation when the next
> + * lookup is done.
> + */
> + memset(p->vmacache, 0, sizeof(p->vmacache));
> + }
> + rcu_read_unlock();
> +}

With all the things being said on this particular piece already; I
wanted to add that the iteration there is incomplete; we can clone()
using CLONE_VM without using CLONE_THREAD.

Its not common, but it can be done. In that case the above iteration
will miss a task that shares the same mm.

2014-02-26 08:55:28

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH v2] mm: per-thread vma caching

On Wed, Feb 26, 2014 at 09:50:48AM +0100, Peter Zijlstra wrote:
> On Tue, Feb 25, 2014 at 10:16:46AM -0800, Davidlohr Bueso wrote:
> > +void vmacache_invalidate_all(void)
> > +{
> > + struct task_struct *g, *p;
> > +
> > + rcu_read_lock();
> > + for_each_process_thread(g, p) {
> > + /*
> > + * Only flush the vmacache pointers as the
> > + * mm seqnum is already set and curr's will
> > + * be set upon invalidation when the next
> > + * lookup is done.
> > + */
> > + memset(p->vmacache, 0, sizeof(p->vmacache));
> > + }
> > + rcu_read_unlock();
> > +}
>
> With all the things being said on this particular piece already; I
> wanted to add that the iteration there is incomplete; we can clone()
> using CLONE_VM without using CLONE_THREAD.
>
> Its not common, but it can be done. In that case the above iteration
> will miss a task that shares the same mm.

Bugger; n/m that. I just spotted that it's yet another way to iterate
all tasks in the system.

Of course we need multiple macros to do this :/

Pretty horrifically expensive though.

2014-02-26 11:26:11

by Mel Gorman

[permalink] [raw]
Subject: Re: [PATCH v2] mm: per-thread vma caching

On Tue, Feb 25, 2014 at 10:16:46AM -0800, Davidlohr Bueso wrote:
> From: Davidlohr Bueso <[email protected]>
>
> This patch is a continuation of efforts trying to optimize find_vma(),
> avoiding potentially expensive rbtree walks to locate a vma upon faults.
> The original approach (https://lkml.org/lkml/2013/11/1/410), where the
> largest vma was also cached, ended up being too specific and random, thus
> further comparison with other approaches were needed. There are two things
> to consider when dealing with this, the cache hit rate and the latency of
> find_vma(). Improving the hit-rate does not necessarily translate in finding
> the vma any faster, as the overhead of any fancy caching schemes can be too
> high to consider.
>
> We currently cache the last used vma for the whole address space, which
> provides a nice optimization, reducing the total cycles in find_vma() by up
> to 250%, for workloads with good locality. On the other hand, this simple
> scheme is pretty much useless for workloads with poor locality. Analyzing
> ebizzy runs shows that, no matter how many threads are running, the
> mmap_cache hit rate is less than 2%, and in many situations below 1%.
>
> The proposed approach is to keep the current cache and adding a small, per
> thread, LRU cache. By keeping the mm->mmap_cache, programs with large heaps
> or good locality can benefit by not having to deal with an additional cache
> when the hit rate is good enough. Concretely, the following results are seen
> on an 80 core, 8 socket x86-64 box:
>
> 1) System bootup: Most programs are single threaded, so the per-thread scheme
> does improve ~50% hit rate by just adding a few more slots to the cache.
>
> +----------------+----------+------------------+
> | caching scheme | hit-rate | cycles (billion) |
> +----------------+----------+------------------+
> | baseline | 50.61% | 19.90 |
> | patched | 73.45% | 13.58 |
> +----------------+----------+------------------+
>
> 2) Kernel build: This one is already pretty good with the current approach
> as we're dealing with good locality.
>
> +----------------+----------+------------------+
> | caching scheme | hit-rate | cycles (billion) |
> +----------------+----------+------------------+
> | baseline | 75.28% | 11.03 |
> | patched | 88.09% | 9.31 |
> +----------------+----------+------------------+
>
> 3) Oracle 11g Data Mining (4k pages): Similar to the kernel build workload.
>
> +----------------+----------+------------------+
> | caching scheme | hit-rate | cycles (billion) |
> +----------------+----------+------------------+
> | baseline | 70.66% | 17.14 |
> | patched | 91.15% | 12.57 |
> +----------------+----------+------------------+
>
> 4) Ebizzy: There's a fair amount of variation from run to run, but this
> approach always shows nearly perfect hit rates, while baseline is just
> about non-existent. The amounts of cycles can fluctuate between anywhere
> from ~60 to ~116 for the baseline scheme, but this approach reduces it
> considerably. For instance, with 80 threads:
>
> +----------------+----------+------------------+
> | caching scheme | hit-rate | cycles (billion) |
> +----------------+----------+------------------+
> | baseline | 1.06% | 91.54 |
> | patched | 99.97% | 14.18 |
> +----------------+----------+------------------+
>
> Systems with !CONFIG_MMU get to keep the current logic.
>
> Signed-off-by: Davidlohr Bueso <[email protected]>
> ---
> Changes from v1 (https://lkml.org/lkml/2014/2/21/8):
> - Removed the per-mm cache for CONFIG_MMU, only having the
> per thread approach.
>
> - Since lookups are always performed before updates, only
> invalidate when searching for a vma. Simplify the updating.
>
> - Hash based on the page# instead of the remaining two bits
> of the offset, results show that both methods are pretty
> much identical for hit-rates.
>
> Please note that nommu and unicore32 arch are *untested*.
> Thanks.
>
> arch/unicore32/include/asm/mmu_context.h | 10 ++++-
> fs/proc/task_mmu.c | 2 +-
> include/linux/mm_types.h | 6 ++-
> include/linux/sched.h | 5 +++
> include/linux/vmacache.h | 24 +++++++++++
> kernel/debug/debug_core.c | 6 ++-
> kernel/fork.c | 7 +++-
> mm/Makefile | 2 +-
> mm/mmap.c | 54 +++++++++++++------------
> mm/nommu.c | 12 +++---
> mm/vmacache.c | 69 ++++++++++++++++++++++++++++++++
> 11 files changed, 157 insertions(+), 40 deletions(-)
> create mode 100644 include/linux/vmacache.h
> create mode 100644 mm/vmacache.c
>
> diff --git a/arch/unicore32/include/asm/mmu_context.h b/arch/unicore32/include/asm/mmu_context.h
> index fb5e4c6..c571759 100644
> --- a/arch/unicore32/include/asm/mmu_context.h
> +++ b/arch/unicore32/include/asm/mmu_context.h
> @@ -56,6 +56,14 @@ switch_mm(struct mm_struct *prev, struct mm_struct *next,
> #define deactivate_mm(tsk, mm) do { } while (0)
> #define activate_mm(prev, next) switch_mm(prev, next, NULL)
>
> +static inline void __vmacache_invalidate(struct mm_struct *mm)
> +{
> +#ifdef CONFIG_MMU
> + vmacache_invalidate(mm);
> +#else
> + mm->vmacache = NULL;
> +#endif
> +}

Unusual to see foo() being the internal helper and __foo() being the
inlined wrapped. Not wrong, it just does not match expectation.

> /*
> * We are inserting a "fake" vma for the user-accessible vector page so
> * gdb and friends can get to it through ptrace and /proc/<pid>/mem.
> @@ -73,7 +81,7 @@ do { \
> else \
> mm->mmap = NULL; \
> rb_erase(&high_vma->vm_rb, &mm->mm_rb); \
> - mm->mmap_cache = NULL; \
> + __vmacache_invalidate(mm); \
> mm->map_count--; \
> remove_vma(high_vma); \
> } \
> diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
> index fb52b54..231c836 100644
> --- a/fs/proc/task_mmu.c
> +++ b/fs/proc/task_mmu.c
> @@ -152,7 +152,7 @@ static void *m_start(struct seq_file *m, loff_t *pos)
>
> /*
> * We remember last_addr rather than next_addr to hit with
> - * mmap_cache most of the time. We have zero last_addr at
> + * vmacache most of the time. We have zero last_addr at
> * the beginning and also after lseek. We will have -1 last_addr
> * after the end of the vmas.
> */
> diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
> index 290901a..e8b90b0 100644
> --- a/include/linux/mm_types.h
> +++ b/include/linux/mm_types.h
> @@ -342,13 +342,15 @@ struct mm_rss_stat {
>
> struct kioctx_table;
> struct mm_struct {
> - struct vm_area_struct * mmap; /* list of VMAs */
> + struct vm_area_struct *mmap; /* list of VMAs */
> struct rb_root mm_rb;
> - struct vm_area_struct * mmap_cache; /* last find_vma result */
> #ifdef CONFIG_MMU
> + u32 vmacache_seqnum; /* per-thread vmacache */

Why is this not a seqcount with a standard check for read_seqbegin()
read_seqretry() when it gets updated? It should have similar packing
(unsigned int) unless lockdep is enabled but barries to the ordering
of reads/writes to the counter are correct. There will be a performance
impact for mmap intensive workloads invaliding the cache but they already
are stuck behind mmap_sem for write anyway.

> unsigned long (*get_unmapped_area) (struct file *filp,
> unsigned long addr, unsigned long len,
> unsigned long pgoff, unsigned long flags);
> +#else
> + struct vm_area_struct *vmacache; /* last find_vma result */
> #endif
> unsigned long mmap_base; /* base of mmap area */
> unsigned long mmap_legacy_base; /* base of mmap area in bottom-up allocations */
> diff --git a/include/linux/sched.h b/include/linux/sched.h
> index a781dec..09dd1ff 100644
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -23,6 +23,7 @@ struct sched_param {
> #include <linux/errno.h>
> #include <linux/nodemask.h>
> #include <linux/mm_types.h>
> +#include <linux/vmacache.h>
> #include <linux/preempt_mask.h>
>
> #include <asm/page.h>
> @@ -1228,6 +1229,10 @@ struct task_struct {
> #ifdef CONFIG_COMPAT_BRK
> unsigned brk_randomized:1;
> #endif
> +#ifdef CONFIG_MMU
> + u32 vmacache_seqnum;
> + struct vm_area_struct *vmacache[VMACACHE_SIZE];
> +#endif

Comment that this is a per-thread cache of a number of VMAs to cache
lookups for workloads with poor locality.

> #if defined(SPLIT_RSS_COUNTING)
> struct task_rss_stat rss_stat;
> #endif
> diff --git a/include/linux/vmacache.h b/include/linux/vmacache.h
> new file mode 100644
> index 0000000..4fb7841
> --- /dev/null
> +++ b/include/linux/vmacache.h
> @@ -0,0 +1,24 @@
> +#ifndef __LINUX_VMACACHE_H
> +#define __LINUX_VMACACHE_H
> +
> +#include <linux/mm.h>
> +
> +#define VMACACHE_SIZE 4
> +

Why 4?

> +extern void vmacache_invalidate_all(void);
> +
> +static inline void vmacache_invalidate(struct mm_struct *mm)
> +{
> + mm->vmacache_seqnum++;
> +
> + /* deal with overflows */
> + if (unlikely(mm->vmacache_seqnum == 0))
> + vmacache_invalidate_all();
> +}

Why does an overflow require that all vmacaches be invalidated globally?
The cache is invalid in the event of a simple mismatch, overflow is
unimportant and I do not see why one mm seqcount overflowing would
affect every thread in the system.

> +
> +extern void vmacache_update(struct mm_struct *mm, unsigned long addr,
> + struct vm_area_struct *newvma);
> +extern struct vm_area_struct *vmacache_find(struct mm_struct *mm,
> + unsigned long addr);
> +
> +#endif /* __LINUX_VMACACHE_H */
> diff --git a/kernel/debug/debug_core.c b/kernel/debug/debug_core.c
> index 334b398..fefc055 100644
> --- a/kernel/debug/debug_core.c
> +++ b/kernel/debug/debug_core.c
> @@ -224,10 +224,12 @@ static void kgdb_flush_swbreak_addr(unsigned long addr)
> if (!CACHE_FLUSH_IS_SAFE)
> return;
>
> - if (current->mm && current->mm->mmap_cache) {
> - flush_cache_range(current->mm->mmap_cache,
> +#ifndef CONFIG_MMU
> + if (current->mm && current->mm->vmacache) {
> + flush_cache_range(current->mm->vmacache,
> addr, addr + BREAK_INSTR_SIZE);
> }
> +#endif

What's this CONFIG_MMU check about? It looks very suspicious.

> /* Force flush instruction cache if it was outside the mm */
> flush_icache_range(addr, addr + BREAK_INSTR_SIZE);
> }
> diff --git a/kernel/fork.c b/kernel/fork.c
> index a17621c..14396bf 100644
> --- a/kernel/fork.c
> +++ b/kernel/fork.c
> @@ -363,7 +363,12 @@ static int dup_mmap(struct mm_struct *mm, struct mm_struct *oldmm)
>
> mm->locked_vm = 0;
> mm->mmap = NULL;
> - mm->mmap_cache = NULL;
> + mm->vmacache_seqnum = oldmm->vmacache_seqnum + 1;
> +

This is going to be an unrelated mm, why not just set it to 0 and ignore
the oldmm values entirely?

> + /* deal with overflows */
> + if (unlikely(mm->vmacache_seqnum == 0))
> + vmacache_invalidate_all();
> +

Same comments about the other global invalidation apply.

> mm->map_count = 0;
> cpumask_clear(mm_cpumask(mm));
> mm->mm_rb = RB_ROOT;
> diff --git a/mm/Makefile b/mm/Makefile
> index 310c90a..2bb5b6a 100644
> --- a/mm/Makefile
> +++ b/mm/Makefile
> @@ -5,7 +5,7 @@
> mmu-y := nommu.o
> mmu-$(CONFIG_MMU) := fremap.o highmem.o madvise.o memory.o mincore.o \
> mlock.o mmap.o mprotect.o mremap.o msync.o rmap.o \
> - vmalloc.o pagewalk.o pgtable-generic.o
> + vmalloc.o pagewalk.o pgtable-generic.o vmacache.o
>
> ifdef CONFIG_CROSS_MEMORY_ATTACH
> mmu-$(CONFIG_MMU) += process_vm_access.o
> diff --git a/mm/mmap.c b/mm/mmap.c
> index 20ff0c3..dfd7fe7 100644
> --- a/mm/mmap.c
> +++ b/mm/mmap.c
> @@ -681,8 +681,9 @@ __vma_unlink(struct mm_struct *mm, struct vm_area_struct *vma,
> prev->vm_next = next = vma->vm_next;
> if (next)
> next->vm_prev = prev;
> - if (mm->mmap_cache == vma)
> - mm->mmap_cache = prev;
> +
> + /* Kill the cache */
> + vmacache_invalidate(mm);

Why not

if (mm->mmap_cache == vma)
vmacache_update(mm, vma->vm_start, prev)

or seeing as there was no actual hit just

vmacache_update(mm, vma->vm_start, NULL)

> }
>
> /*
> @@ -1989,34 +1990,33 @@ EXPORT_SYMBOL(get_unmapped_area);
> /* Look up the first VMA which satisfies addr < vm_end, NULL if none. */
> struct vm_area_struct *find_vma(struct mm_struct *mm, unsigned long addr)
> {
> - struct vm_area_struct *vma = NULL;
> + struct rb_node *rb_node;
> + struct vm_area_struct *vma;
>
> /* Check the cache first. */
> - /* (Cache hit rate is typically around 35%.) */
> - vma = ACCESS_ONCE(mm->mmap_cache);
> - if (!(vma && vma->vm_end > addr && vma->vm_start <= addr)) {
> - struct rb_node *rb_node;
> + vma = vmacache_find(mm, addr);
> + if (likely(vma))
> + return vma;
>
> - rb_node = mm->mm_rb.rb_node;
> - vma = NULL;
> + rb_node = mm->mm_rb.rb_node;
> + vma = NULL;
>
> - while (rb_node) {
> - struct vm_area_struct *vma_tmp;
> -
> - vma_tmp = rb_entry(rb_node,
> - struct vm_area_struct, vm_rb);
> -
> - if (vma_tmp->vm_end > addr) {
> - vma = vma_tmp;
> - if (vma_tmp->vm_start <= addr)
> - break;
> - rb_node = rb_node->rb_left;
> - } else
> - rb_node = rb_node->rb_right;
> - }
> - if (vma)
> - mm->mmap_cache = vma;
> + while (rb_node) {
> + struct vm_area_struct *tmp;
> +
> + tmp = rb_entry(rb_node, struct vm_area_struct, vm_rb);
> +
> + if (tmp->vm_end > addr) {
> + vma = tmp;
> + if (tmp->vm_start <= addr)
> + break;
> + rb_node = rb_node->rb_left;
> + } else
> + rb_node = rb_node->rb_right;
> }
> +
> + if (vma)
> + vmacache_update(mm, addr, vma);
> return vma;
> }
>
> @@ -2388,7 +2388,9 @@ detach_vmas_to_be_unmapped(struct mm_struct *mm, struct vm_area_struct *vma,
> } else
> mm->highest_vm_end = prev ? prev->vm_end : 0;
> tail_vma->vm_next = NULL;
> - mm->mmap_cache = NULL; /* Kill the cache. */
> +
> + /* Kill the cache */
> + vmacache_invalidate(mm);
> }
>
> /*
> diff --git a/mm/nommu.c b/mm/nommu.c
> index 8740213..c2d1b92 100644
> --- a/mm/nommu.c
> +++ b/mm/nommu.c
> @@ -776,8 +776,8 @@ static void delete_vma_from_mm(struct vm_area_struct *vma)
> protect_vma(vma, 0);
>
> mm->map_count--;
> - if (mm->mmap_cache == vma)
> - mm->mmap_cache = NULL;
> + if (mm->vmacache == vma)
> + mm->vmacache = NULL;
>
> /* remove the VMA from the mapping */
> if (vma->vm_file) {
> @@ -825,7 +825,7 @@ struct vm_area_struct *find_vma(struct mm_struct *mm, unsigned long addr)
> struct vm_area_struct *vma;
>
> /* check the cache first */
> - vma = ACCESS_ONCE(mm->mmap_cache);
> + vma = ACCESS_ONCE(mm->vmacache);
> if (vma && vma->vm_start <= addr && vma->vm_end > addr)
> return vma;
>
> @@ -835,7 +835,7 @@ struct vm_area_struct *find_vma(struct mm_struct *mm, unsigned long addr)
> if (vma->vm_start > addr)
> return NULL;
> if (vma->vm_end > addr) {
> - mm->mmap_cache = vma;
> + mm->vmacache = vma;
> return vma;
> }
> }
> @@ -874,7 +874,7 @@ static struct vm_area_struct *find_vma_exact(struct mm_struct *mm,
> unsigned long end = addr + len;
>
> /* check the cache first */
> - vma = mm->mmap_cache;
> + vma = mm->vmacache;
> if (vma && vma->vm_start == addr && vma->vm_end == end)
> return vma;
>
> @@ -886,7 +886,7 @@ static struct vm_area_struct *find_vma_exact(struct mm_struct *mm,
> if (vma->vm_start > addr)
> return NULL;
> if (vma->vm_end == end) {
> - mm->mmap_cache = vma;
> + mm->vmacache = vma;
> return vma;
> }
> }
> diff --git a/mm/vmacache.c b/mm/vmacache.c
> new file mode 100644
> index 0000000..d577ad3
> --- /dev/null
> +++ b/mm/vmacache.c
> @@ -0,0 +1,69 @@
> +#include <linux/sched.h>
> +#include <linux/vmacache.h>
> +
> +void vmacache_invalidate_all(void)
> +{
> + struct task_struct *g, *p;
> +
> + rcu_read_lock();
> + for_each_process_thread(g, p) {
> + /*
> + * Only flush the vmacache pointers as the
> + * mm seqnum is already set and curr's will
> + * be set upon invalidation when the next
> + * lookup is done.
> + */
> + memset(p->vmacache, 0, sizeof(p->vmacache));
> + }
> + rcu_read_unlock();
> +}
> +

Still don't get why this is necessary :(


> +static bool vmacache_valid(struct mm_struct *mm)
> +{
> + struct task_struct *curr = current;
> +
> + if (mm != curr->mm)
> + return false;
> +
> + if (mm->vmacache_seqnum != curr->vmacache_seqnum) {
> + /*
> + * First attempt will always be invalid, initialize the
> + * new cache for this task here.
> + */
> + curr->vmacache_seqnum = mm->vmacache_seqnum;
> + memset(curr->vmacache, 0, sizeof(curr->vmacache));
> + return false;
> + }
> + return true;
> +}

So if you converted to a standard seqcount, it would simply be a case that
instread of a retry loop that you'd return false without looping.

> +
> +struct vm_area_struct *vmacache_find(struct mm_struct *mm,
> + unsigned long addr)
> +
> +{
> + int i;
> +
> + if (!vmacache_valid(mm))
> + return NULL;
> +
> + for (i = 0; i < VMACACHE_SIZE; i++) {
> + struct vm_area_struct *vma = current->vmacache[i];
> +
> + if (vma && vma->vm_start <= addr && vma->vm_end > addr)
> + return vma;
> + }
> +
> + return NULL;
> +}

vmacache_update does a hashed lookup but the find does a linear search.
I expect this is necessary because the hashing address could have been
anywhere in the VMA as could the lookup. It's obvious but no harm in
adding a comment.

> +
> +void vmacache_update(struct mm_struct *mm, unsigned long addr,
> + struct vm_area_struct *newvma)
> +{
> + /*
> + * Hash based on the page number. Provides a good
> + * hit rate for workloads with good locality and
> + * those with random accesses as well.
> + */
> + int idx = (addr >> PAGE_SHIFT) & 3;
> + current->vmacache[idx] = newvma;
> +}

The changelog claims this is LRU. Where is the LRU? It looks more a
pseudo-random replacement policy. Nothing wrong with that as such, LRU
might take longer to calculate than is worthwhile.

3 looks wrong here. Should be & (VMACACHE_SIZE-1) and then either
express VMACACHE_SIZE in terms of a shift or hope someone does not set
it to 5 and wonder what went wrong.

I see you calculated hit rates for your changelog. How about adding
tracepoints for vmacache_find() hit and misses in a follow-up patch so it
can be recalculated with ftrace without debugging patches?

--
Mel Gorman
SUSE Labs

2014-02-26 19:11:51

by Davidlohr Bueso

[permalink] [raw]
Subject: Re: [PATCH v2] mm: per-thread vma caching

On Wed, 2014-02-26 at 11:26 +0000, Mel Gorman wrote:
> On Tue, Feb 25, 2014 at 10:16:46AM -0800, Davidlohr Bueso wrote:
> >
> > struct kioctx_table;
> > struct mm_struct {
> > - struct vm_area_struct * mmap; /* list of VMAs */
> > + struct vm_area_struct *mmap; /* list of VMAs */
> > struct rb_root mm_rb;
> > - struct vm_area_struct * mmap_cache; /* last find_vma result */
> > #ifdef CONFIG_MMU
> > + u32 vmacache_seqnum; /* per-thread vmacache */
>
> Why is this not a seqcount with a standard check for read_seqbegin()
> read_seqretry() when it gets updated? It should have similar packing
> (unsigned int) unless lockdep is enabled but barries to the ordering
> of reads/writes to the counter are correct. There will be a performance
> impact for mmap intensive workloads invaliding the cache but they already
> are stuck behind mmap_sem for write anyway.

We need invalidations to be super fast. This alternative defeats the
purpose. If we bloat the caching with such things is simply not worth it
as the latency of find_vma() increases compared to mm->mmap_cache.

>
> > unsigned long (*get_unmapped_area) (struct file *filp,
> > unsigned long addr, unsigned long len,
> > unsigned long pgoff, unsigned long flags);
> > +#else
> > + struct vm_area_struct *vmacache; /* last find_vma result */
> > #endif
> > unsigned long mmap_base; /* base of mmap area */
> > unsigned long mmap_legacy_base; /* base of mmap area in bottom-up allocations */
> > diff --git a/include/linux/sched.h b/include/linux/sched.h
> > index a781dec..09dd1ff 100644
> > --- a/include/linux/sched.h
> > +++ b/include/linux/sched.h
> > @@ -23,6 +23,7 @@ struct sched_param {
> > #include <linux/errno.h>
> > #include <linux/nodemask.h>
> > #include <linux/mm_types.h>
> > +#include <linux/vmacache.h>
> > #include <linux/preempt_mask.h>
> >
> > #include <asm/page.h>
> > @@ -1228,6 +1229,10 @@ struct task_struct {
> > #ifdef CONFIG_COMPAT_BRK
> > unsigned brk_randomized:1;
> > #endif
> > +#ifdef CONFIG_MMU
> > + u32 vmacache_seqnum;
> > + struct vm_area_struct *vmacache[VMACACHE_SIZE];
> > +#endif
>
> Comment that this is a per-thread cache of a number of VMAs to cache
> lookups for workloads with poor locality.

ok

>
> > #if defined(SPLIT_RSS_COUNTING)
> > struct task_rss_stat rss_stat;
> > #endif
> > diff --git a/include/linux/vmacache.h b/include/linux/vmacache.h
> > new file mode 100644
> > index 0000000..4fb7841
> > --- /dev/null
> > +++ b/include/linux/vmacache.h
> > @@ -0,0 +1,24 @@
> > +#ifndef __LINUX_VMACACHE_H
> > +#define __LINUX_VMACACHE_H
> > +
> > +#include <linux/mm.h>
> > +
> > +#define VMACACHE_SIZE 4
> > +
>
> Why 4?

Not too small, not too big. Testing showed it provided good enough
balance.

>
> > +extern void vmacache_invalidate_all(void);
> > +
> > +static inline void vmacache_invalidate(struct mm_struct *mm)
> > +{
> > + mm->vmacache_seqnum++;
> > +
> > + /* deal with overflows */
> > + if (unlikely(mm->vmacache_seqnum == 0))
> > + vmacache_invalidate_all();
> > +}
>
> Why does an overflow require that all vmacaches be invalidated globally?
> The cache is invalid in the event of a simple mismatch, overflow is
> unimportant and I do not see why one mm seqcount overflowing would
> affect every thread in the system.

Yes, this is only affects threads sharing the same mm, updating it.

> > +
> > +extern void vmacache_update(struct mm_struct *mm, unsigned long addr,
> > + struct vm_area_struct *newvma);
> > +extern struct vm_area_struct *vmacache_find(struct mm_struct *mm,
> > + unsigned long addr);
> > +
> > +#endif /* __LINUX_VMACACHE_H */
> > diff --git a/kernel/debug/debug_core.c b/kernel/debug/debug_core.c
> > index 334b398..fefc055 100644
> > --- a/kernel/debug/debug_core.c
> > +++ b/kernel/debug/debug_core.c
> > @@ -224,10 +224,12 @@ static void kgdb_flush_swbreak_addr(unsigned long addr)
> > if (!CACHE_FLUSH_IS_SAFE)
> > return;
> >
> > - if (current->mm && current->mm->mmap_cache) {
> > - flush_cache_range(current->mm->mmap_cache,
> > +#ifndef CONFIG_MMU
> > + if (current->mm && current->mm->vmacache) {
> > + flush_cache_range(current->mm->vmacache,
> > addr, addr + BREAK_INSTR_SIZE);
> > }
> > +#endif
>
> What's this CONFIG_MMU check about? It looks very suspicious.

This was part of the ugliness of two different schemes for different
configs. This is also out the window and v3 will provide per-thread
caching for both setups.

>
> > /* Force flush instruction cache if it was outside the mm */
> > flush_icache_range(addr, addr + BREAK_INSTR_SIZE);
> > }
> > diff --git a/kernel/fork.c b/kernel/fork.c
> > index a17621c..14396bf 100644
> > --- a/kernel/fork.c
> > +++ b/kernel/fork.c
> > @@ -363,7 +363,12 @@ static int dup_mmap(struct mm_struct *mm, struct mm_struct *oldmm)
> >
> > mm->locked_vm = 0;
> > mm->mmap = NULL;
> > - mm->mmap_cache = NULL;
> > + mm->vmacache_seqnum = oldmm->vmacache_seqnum + 1;
> > +
>
> This is going to be an unrelated mm, why not just set it to 0 and ignore
> the oldmm values entirely?

Yep, that will be in v3.

>
> > + /* deal with overflows */
> > + if (unlikely(mm->vmacache_seqnum == 0))
> > + vmacache_invalidate_all();
> > +
>
> Same comments about the other global invalidation apply.
>
> > mm->map_count = 0;
> > cpumask_clear(mm_cpumask(mm));
> > mm->mm_rb = RB_ROOT;
> > diff --git a/mm/Makefile b/mm/Makefile
> > index 310c90a..2bb5b6a 100644
> > --- a/mm/Makefile
> > +++ b/mm/Makefile
> > @@ -5,7 +5,7 @@
> > mmu-y := nommu.o
> > mmu-$(CONFIG_MMU) := fremap.o highmem.o madvise.o memory.o mincore.o \
> > mlock.o mmap.o mprotect.o mremap.o msync.o rmap.o \
> > - vmalloc.o pagewalk.o pgtable-generic.o
> > + vmalloc.o pagewalk.o pgtable-generic.o vmacache.o
> >
> > ifdef CONFIG_CROSS_MEMORY_ATTACH
> > mmu-$(CONFIG_MMU) += process_vm_access.o
> > diff --git a/mm/mmap.c b/mm/mmap.c
> > index 20ff0c3..dfd7fe7 100644
> > --- a/mm/mmap.c
> > +++ b/mm/mmap.c
> > @@ -681,8 +681,9 @@ __vma_unlink(struct mm_struct *mm, struct vm_area_struct *vma,
> > prev->vm_next = next = vma->vm_next;
> > if (next)
> > next->vm_prev = prev;
> > - if (mm->mmap_cache == vma)
> > - mm->mmap_cache = prev;
> > +
> > + /* Kill the cache */
> > + vmacache_invalidate(mm);
>
> Why not
>
> if (mm->mmap_cache == vma)
> vmacache_update(mm, vma->vm_start, prev)
>
> or seeing as there was no actual hit just
>
> vmacache_update(mm, vma->vm_start, NULL)
>

That would cost more than just incrementing the seqnum. Saving cycles is
paramount here.

> > }
> >
> > /*
> > @@ -1989,34 +1990,33 @@ EXPORT_SYMBOL(get_unmapped_area);
> > /* Look up the first VMA which satisfies addr < vm_end, NULL if none. */
> > struct vm_area_struct *find_vma(struct mm_struct *mm, unsigned long addr)
> > {
> > - struct vm_area_struct *vma = NULL;
> > + struct rb_node *rb_node;
> > + struct vm_area_struct *vma;
> >
> > /* Check the cache first. */
> > - /* (Cache hit rate is typically around 35%.) */
> > - vma = ACCESS_ONCE(mm->mmap_cache);
> > - if (!(vma && vma->vm_end > addr && vma->vm_start <= addr)) {
> > - struct rb_node *rb_node;
> > + vma = vmacache_find(mm, addr);
> > + if (likely(vma))
> > + return vma;
> >
> > - rb_node = mm->mm_rb.rb_node;
> > - vma = NULL;
> > + rb_node = mm->mm_rb.rb_node;
> > + vma = NULL;
> >
> > - while (rb_node) {
> > - struct vm_area_struct *vma_tmp;
> > -
> > - vma_tmp = rb_entry(rb_node,
> > - struct vm_area_struct, vm_rb);
> > -
> > - if (vma_tmp->vm_end > addr) {
> > - vma = vma_tmp;
> > - if (vma_tmp->vm_start <= addr)
> > - break;
> > - rb_node = rb_node->rb_left;
> > - } else
> > - rb_node = rb_node->rb_right;
> > - }
> > - if (vma)
> > - mm->mmap_cache = vma;
> > + while (rb_node) {
> > + struct vm_area_struct *tmp;
> > +
> > + tmp = rb_entry(rb_node, struct vm_area_struct, vm_rb);
> > +
> > + if (tmp->vm_end > addr) {
> > + vma = tmp;
> > + if (tmp->vm_start <= addr)
> > + break;
> > + rb_node = rb_node->rb_left;
> > + } else
> > + rb_node = rb_node->rb_right;
> > }
> > +
> > + if (vma)
> > + vmacache_update(mm, addr, vma);
> > return vma;
> > }
> >
> > @@ -2388,7 +2388,9 @@ detach_vmas_to_be_unmapped(struct mm_struct *mm, struct vm_area_struct *vma,
> > } else
> > mm->highest_vm_end = prev ? prev->vm_end : 0;
> > tail_vma->vm_next = NULL;
> > - mm->mmap_cache = NULL; /* Kill the cache. */
> > +
> > + /* Kill the cache */
> > + vmacache_invalidate(mm);
> > }
> >
> > /*
> > diff --git a/mm/nommu.c b/mm/nommu.c
> > index 8740213..c2d1b92 100644
> > --- a/mm/nommu.c
> > +++ b/mm/nommu.c
> > @@ -776,8 +776,8 @@ static void delete_vma_from_mm(struct vm_area_struct *vma)
> > protect_vma(vma, 0);
> >
> > mm->map_count--;
> > - if (mm->mmap_cache == vma)
> > - mm->mmap_cache = NULL;
> > + if (mm->vmacache == vma)
> > + mm->vmacache = NULL;
> >
> > /* remove the VMA from the mapping */
> > if (vma->vm_file) {
> > @@ -825,7 +825,7 @@ struct vm_area_struct *find_vma(struct mm_struct *mm, unsigned long addr)
> > struct vm_area_struct *vma;
> >
> > /* check the cache first */
> > - vma = ACCESS_ONCE(mm->mmap_cache);
> > + vma = ACCESS_ONCE(mm->vmacache);
> > if (vma && vma->vm_start <= addr && vma->vm_end > addr)
> > return vma;
> >
> > @@ -835,7 +835,7 @@ struct vm_area_struct *find_vma(struct mm_struct *mm, unsigned long addr)
> > if (vma->vm_start > addr)
> > return NULL;
> > if (vma->vm_end > addr) {
> > - mm->mmap_cache = vma;
> > + mm->vmacache = vma;
> > return vma;
> > }
> > }
> > @@ -874,7 +874,7 @@ static struct vm_area_struct *find_vma_exact(struct mm_struct *mm,
> > unsigned long end = addr + len;
> >
> > /* check the cache first */
> > - vma = mm->mmap_cache;
> > + vma = mm->vmacache;
> > if (vma && vma->vm_start == addr && vma->vm_end == end)
> > return vma;
> >
> > @@ -886,7 +886,7 @@ static struct vm_area_struct *find_vma_exact(struct mm_struct *mm,
> > if (vma->vm_start > addr)
> > return NULL;
> > if (vma->vm_end == end) {
> > - mm->mmap_cache = vma;
> > + mm->vmacache = vma;
> > return vma;
> > }
> > }
> > diff --git a/mm/vmacache.c b/mm/vmacache.c
> > new file mode 100644
> > index 0000000..d577ad3
> > --- /dev/null
> > +++ b/mm/vmacache.c
> > @@ -0,0 +1,69 @@
> > +#include <linux/sched.h>
> > +#include <linux/vmacache.h>
> > +
> > +void vmacache_invalidate_all(void)
> > +{
> > + struct task_struct *g, *p;
> > +
> > + rcu_read_lock();
> > + for_each_process_thread(g, p) {
> > + /*
> > + * Only flush the vmacache pointers as the
> > + * mm seqnum is already set and curr's will
> > + * be set upon invalidation when the next
> > + * lookup is done.
> > + */
> > + memset(p->vmacache, 0, sizeof(p->vmacache));
> > + }
> > + rcu_read_unlock();
> > +}
> > +
>
> Still don't get why this is necessary :(
>
>
> > +static bool vmacache_valid(struct mm_struct *mm)
> > +{
> > + struct task_struct *curr = current;
> > +
> > + if (mm != curr->mm)
> > + return false;
> > +
> > + if (mm->vmacache_seqnum != curr->vmacache_seqnum) {
> > + /*
> > + * First attempt will always be invalid, initialize the
> > + * new cache for this task here.
> > + */
> > + curr->vmacache_seqnum = mm->vmacache_seqnum;
> > + memset(curr->vmacache, 0, sizeof(curr->vmacache));
> > + return false;
> > + }
> > + return true;
> > +}
>
> So if you converted to a standard seqcount, it would simply be a case that
> instread of a retry loop that you'd return false without looping.
>
> > +
> > +struct vm_area_struct *vmacache_find(struct mm_struct *mm,
> > + unsigned long addr)
> > +
> > +{
> > + int i;
> > +
> > + if (!vmacache_valid(mm))
> > + return NULL;
> > +
> > + for (i = 0; i < VMACACHE_SIZE; i++) {
> > + struct vm_area_struct *vma = current->vmacache[i];
> > +
> > + if (vma && vma->vm_start <= addr && vma->vm_end > addr)
> > + return vma;
> > + }
> > +
> > + return NULL;
> > +}
>
> vmacache_update does a hashed lookup but the find does a linear search.
> I expect this is necessary because the hashing address could have been
> anywhere in the VMA as could the lookup. It's obvious but no harm in
> adding a comment.
>
> > +
> > +void vmacache_update(struct mm_struct *mm, unsigned long addr,
> > + struct vm_area_struct *newvma)
> > +{
> > + /*
> > + * Hash based on the page number. Provides a good
> > + * hit rate for workloads with good locality and
> > + * those with random accesses as well.
> > + */
> > + int idx = (addr >> PAGE_SHIFT) & 3;
> > + current->vmacache[idx] = newvma;
> > +}
>
> The changelog claims this is LRU. Where is the LRU? It looks more a
> pseudo-random replacement policy.

Fair enough, although it could be considered a sort of LRU on the page
num.

> Nothing wrong with that as such, LRU
> might take longer to calculate than is worthwhile.

Yes, a formal LRU was considered but keeping it sorted just caused too
much overhead.

>
> 3 looks wrong here. Should be & (VMACACHE_SIZE-1) and then either
> express VMACACHE_SIZE in terms of a shift or hope someone does not set
> it to 5 and wonder what went wrong.

Yes.

> I see you calculated hit rates for your changelog. How about adding
> tracepoints for vmacache_find() hit and misses in a follow-up patch so it
> can be recalculated with ftrace without debugging patches?

I was planning on a follow up patch to add some vmstat counters, which
is what I used for this patch, for some debug config. I can look into
tracepoints instead.

Thanks,
Davidlohr