When perf profiling a wide variety of different workloads, it was found
that vmacache_find() had higher than expected cost: up to 0.08% of cpu
utilization in some cases. This was found to rival other core VM
functions such as alloc_pages_vma() with thp enabled and default
mempolicy, and the conditionals in __get_vma_policy().
VMACACHE_HASH() determines which of the four per-task_struct slots a vma
is cached for a particular address. This currently depends on the pfn,
so pfn 5212 occupies a different vmacache slot than its neighboring
pfn 5213.
vmacache_find() iterates through all four of current's vmacache slots
when looking up an address. Hashing based on pfn, an address has
~1/VMACACHE_SIZE chance of being cached in the first vmacache slot, or
about 25%, *if* the vma is cached.
This patch hashes an address by its pmd instead of pte to optimize for
workloads with good spatial locality. This results in a higher
probability of vmas being cached in the first slot that is checked:
normally ~70% on the same workloads instead of 25%.
Signed-off-by: David Rientjes <[email protected]>
---
include/linux/vmacache.h | 6 ------
mm/vmacache.c | 32 ++++++++++++++++++++++----------
2 files changed, 22 insertions(+), 16 deletions(-)
diff --git a/include/linux/vmacache.h b/include/linux/vmacache.h
--- a/include/linux/vmacache.h
+++ b/include/linux/vmacache.h
@@ -5,12 +5,6 @@
#include <linux/sched.h>
#include <linux/mm.h>
-/*
- * Hash based on the page number. Provides a good hit rate for
- * workloads with good locality and those with random accesses as well.
- */
-#define VMACACHE_HASH(addr) ((addr >> PAGE_SHIFT) & VMACACHE_MASK)
-
static inline void vmacache_flush(struct task_struct *tsk)
{
memset(tsk->vmacache.vmas, 0, sizeof(tsk->vmacache.vmas));
diff --git a/mm/vmacache.c b/mm/vmacache.c
--- a/mm/vmacache.c
+++ b/mm/vmacache.c
@@ -7,6 +7,12 @@
#include <linux/mm.h>
#include <linux/vmacache.h>
+/*
+ * Hash based on the pmd of addr. Provides a good hit rate for workloads with
+ * spatial locality.
+ */
+#define VMACACHE_HASH(addr) ((addr >> PMD_SHIFT) & VMACACHE_MASK)
+
/*
* Flush vma caches for threads that share a given mm.
*
@@ -87,6 +93,7 @@ static bool vmacache_valid(struct mm_struct *mm)
struct vm_area_struct *vmacache_find(struct mm_struct *mm, unsigned long addr)
{
+ int idx = VMACACHE_HASH(addr);
int i;
count_vm_vmacache_event(VMACACHE_FIND_CALLS);
@@ -95,16 +102,18 @@ struct vm_area_struct *vmacache_find(struct mm_struct *mm, unsigned long addr)
return NULL;
for (i = 0; i < VMACACHE_SIZE; i++) {
- struct vm_area_struct *vma = current->vmacache.vmas[i];
-
- if (!vma)
- continue;
- if (WARN_ON_ONCE(vma->vm_mm != mm))
- break;
- if (vma->vm_start <= addr && vma->vm_end > addr) {
- count_vm_vmacache_event(VMACACHE_FIND_HITS);
- return vma;
+ struct vm_area_struct *vma = current->vmacache.vmas[idx];
+
+ if (vma) {
+ if (WARN_ON_ONCE(vma->vm_mm != mm))
+ break;
+ if (vma->vm_start <= addr && vma->vm_end > addr) {
+ count_vm_vmacache_event(VMACACHE_FIND_HITS);
+ return vma;
+ }
}
+ if (++idx == VMACACHE_SIZE)
+ idx = 0;
}
return NULL;
@@ -115,6 +124,7 @@ struct vm_area_struct *vmacache_find_exact(struct mm_struct *mm,
unsigned long start,
unsigned long end)
{
+ int idx = VMACACHE_HASH(addr);
int i;
count_vm_vmacache_event(VMACACHE_FIND_CALLS);
@@ -123,12 +133,14 @@ struct vm_area_struct *vmacache_find_exact(struct mm_struct *mm,
return NULL;
for (i = 0; i < VMACACHE_SIZE; i++) {
- struct vm_area_struct *vma = current->vmacache.vmas[i];
+ struct vm_area_struct *vma = current->vmacache.vmas[idx];
if (vma && vma->vm_start == start && vma->vm_end == end) {
count_vm_vmacache_event(VMACACHE_FIND_HITS);
return vma;
}
+ if (++idx == VMACACHE_SIZE)
+ idx = 0;
}
return NULL;
On Mon, 9 Jul 2018 17:50:03 -0700 (PDT) David Rientjes <[email protected]> wrote:
> When perf profiling a wide variety of different workloads, it was found
> that vmacache_find() had higher than expected cost: up to 0.08% of cpu
> utilization in some cases. This was found to rival other core VM
> functions such as alloc_pages_vma() with thp enabled and default
> mempolicy, and the conditionals in __get_vma_policy().
>
> VMACACHE_HASH() determines which of the four per-task_struct slots a vma
> is cached for a particular address. This currently depends on the pfn,
> so pfn 5212 occupies a different vmacache slot than its neighboring
> pfn 5213.
>
> vmacache_find() iterates through all four of current's vmacache slots
> when looking up an address. Hashing based on pfn, an address has
> ~1/VMACACHE_SIZE chance of being cached in the first vmacache slot, or
> about 25%, *if* the vma is cached.
>
> This patch hashes an address by its pmd instead of pte to optimize for
> workloads with good spatial locality. This results in a higher
> probability of vmas being cached in the first slot that is checked:
> normally ~70% on the same workloads instead of 25%.
Was the improvement quantifiable?
Surprised. That little array will all be in CPU cache and that loop
should execute pretty quickly? If it's *that* sensitive then let's zap
the no-longer-needed WARN_ON. And we could hide all the event counting
behind some developer-only ifdef.
Did you consider LRU-sorting the array instead?
On Mon, 9 Jul 2018, Andrew Morton wrote:
> > When perf profiling a wide variety of different workloads, it was found
> > that vmacache_find() had higher than expected cost: up to 0.08% of cpu
> > utilization in some cases. This was found to rival other core VM
> > functions such as alloc_pages_vma() with thp enabled and default
> > mempolicy, and the conditionals in __get_vma_policy().
> >
> > VMACACHE_HASH() determines which of the four per-task_struct slots a vma
> > is cached for a particular address. This currently depends on the pfn,
> > so pfn 5212 occupies a different vmacache slot than its neighboring
> > pfn 5213.
> >
> > vmacache_find() iterates through all four of current's vmacache slots
> > when looking up an address. Hashing based on pfn, an address has
> > ~1/VMACACHE_SIZE chance of being cached in the first vmacache slot, or
> > about 25%, *if* the vma is cached.
> >
> > This patch hashes an address by its pmd instead of pte to optimize for
> > workloads with good spatial locality. This results in a higher
> > probability of vmas being cached in the first slot that is checked:
> > normally ~70% on the same workloads instead of 25%.
>
> Was the improvement quantifiable?
>
I've run page fault testing to answer this question on Haswell since the
initial profiling was done over a wide variety of user-controlled
workloads and there's no guarantee that such profiling would be a fair
comparison either way. For page faulting it's either falling below our
testing levels of 0.02%, or is right at 0.02%. Running without the patch
it's 0.05-0.06% overhead.
> Surprised. That little array will all be in CPU cache and that loop
> should execute pretty quickly? If it's *that* sensitive then let's zap
> the no-longer-needed WARN_ON. And we could hide all the event counting
> behind some developer-only ifdef.
>
Those vmevents are only defined for CONFIG_DEBUG_VM_VMACACHE, so no change
needed there. The WARN_ON() could be moved under the same config option.
I assume that if such a config option exists that at least somebody is
interested in debugging mm/vmacache.c once in a while.
> Did you consider LRU-sorting the array instead?
>
It adds 40 bytes to struct task_struct, but I'm not sure the least
recently used is the first preferred check. If I do
madvise(MADV_DONTNEED) from a malloc implementation where I don't control
what is free()'d and I'm constantly freeing back to the same hugepages,
for example, I may always get first slot cache hits with this patch as
opposed to the 25% chance that the current implementation has (and perhaps
an lru would as well).
I'm sure that I could construct a workload where LRU would be better and
could show that the added footprint were worthwhile, but I could also
construct a workload where the current implementation based on pfn would
outperform all of these. It simply turns out that on the user-controlled
workloads that I was profiling that hashing based on pmd was the win.
On Mon, 9 Jul 2018 18:37:37 -0700 (PDT) David Rientjes <[email protected]> wrote:
> > Did you consider LRU-sorting the array instead?
> >
>
> It adds 40 bytes to struct task_struct,
What does? LRU sort? It's a 4-entry array, just do it in place, like
bh_lru_install(). Confused.
> but I'm not sure the least
> recently used is the first preferred check. If I do
> madvise(MADV_DONTNEED) from a malloc implementation where I don't control
> what is free()'d and I'm constantly freeing back to the same hugepages,
> for example, I may always get first slot cache hits with this patch as
> opposed to the 25% chance that the current implementation has (and perhaps
> an lru would as well).
>
> I'm sure that I could construct a workload where LRU would be better and
> could show that the added footprint were worthwhile, but I could also
> construct a workload where the current implementation based on pfn would
> outperform all of these. It simply turns out that on the user-controlled
> workloads that I was profiling that hashing based on pmd was the win.
That leaves us nowhere to go. Zapping the WARN_ON seems a no-brainer
though?
On Wed, 11 Jul 2018, Andrew Morton wrote:
> > > Did you consider LRU-sorting the array instead?
> > >
> >
> > It adds 40 bytes to struct task_struct,
>
> What does? LRU sort? It's a 4-entry array, just do it in place, like
> bh_lru_install(). Confused.
>
I was imagining an optimized sort rather than adding an iteration to
vmacache_update() of the same form that causes vmacache_find() to show up
on my perf reports in the first place.
> > but I'm not sure the least
> > recently used is the first preferred check. If I do
> > madvise(MADV_DONTNEED) from a malloc implementation where I don't control
> > what is free()'d and I'm constantly freeing back to the same hugepages,
> > for example, I may always get first slot cache hits with this patch as
> > opposed to the 25% chance that the current implementation has (and perhaps
> > an lru would as well).
> >
> > I'm sure that I could construct a workload where LRU would be better and
> > could show that the added footprint were worthwhile, but I could also
> > construct a workload where the current implementation based on pfn would
> > outperform all of these. It simply turns out that on the user-controlled
> > workloads that I was profiling that hashing based on pmd was the win.
>
> That leaves us nowhere to go. Zapping the WARN_ON seems a no-brainer
> though?
>
I would suggest it goes under CONFIG_DEBUG_VM_VMACACHE.
My implementation for the optimized vmacache_find() is based on the
premise that spatial locality matters, and in practice on random
user-controlled workloads this yields a faster lookup than the current
implementation. Of course, any caching technique can be defeated by
workloads, artifical or otherwise, but I suggest that as a general
principle caching based on PMD_SHIFT rather than pfn has a greater
likelihood of avoiding the iteration in vmacache_find() because of spatial
locality for anything that iterates over a range of memory.