The purge_lock spinlock causes high latencies with non RT kernel. This has been
reported multiple times on lkml [1] [2] and affects applications like audio.
In this patch, I replace the spinlock with an atomic refcount so that
preemption is kept turned on during purge. This Ok to do since [3] builds the
lazy free list in advance and atomically retrieves the list so any instance of
purge will have its own list it is purging. Since the individual vmap area
frees are themselves protected by a lock, this is Ok.
The only thing left is the fact that previously it had trylock behavior, so
preserve that by using the refcount to keep track of in-progress purges and
abort like previously if there is an ongoing purge. Lastly, decrement
vmap_lazy_nr as the vmap areas are freed, and not in advance, so that the
vmap_lazy_nr is not reduced too soon as suggested in [2].
Tests:
x86_64 quad core machine on kernel 4.8, run: cyclictest -p 99 -n
Concurrently in a kernel module, run vmalloc and vfree 8K buffer in a loop.
Preemption configuration: CONFIG_PREEMPT__LL=y (low-latency desktop)
Without patch, cyclictest output:
policy: fifo: loadavg: 0.05 0.01 0.00 1/85 1272 Avg: 128 Max: 1177
policy: fifo: loadavg: 0.11 0.03 0.01 2/87 1447 Avg: 122 Max: 1897
policy: fifo: loadavg: 0.10 0.03 0.01 1/89 1656 Avg: 93 Max: 2886
With patch, cyclictest output:
policy: fifo: loadavg: 1.15 0.68 0.30 1/162 8399 Avg: 92 Max: 284
policy: fifo: loadavg: 1.21 0.71 0.32 2/164 9840 Avg: 94 Max: 296
policy: fifo: loadavg: 1.18 0.72 0.32 2/166 11253 Avg: 107 Max: 321
[1] http://lists.openwall.net/linux-kernel/2016/03/23/29
[2] https://lkml.org/lkml/2016/10/9/59
[3] https://lkml.org/lkml/2016/4/15/287
[thanks Chris for the cond_resched_lock ideas]
Cc: Chris Wilson <[email protected]>
Cc: Jisheng Zhang <[email protected]>
Cc: John Dias <[email protected]>
Cc: Andrew Morton <[email protected]>
Signed-off-by: Joel Fernandes <[email protected]>
---
v3 changes:
Fixed ordering of va pointer access and __free_vmap_area
v2 changes:
Updated test description in commit message, and typos.
mm/vmalloc.c | 25 +++++++++++++------------
1 file changed, 13 insertions(+), 12 deletions(-)
diff --git a/mm/vmalloc.c b/mm/vmalloc.c
index 613d1d9..0270723 100644
--- a/mm/vmalloc.c
+++ b/mm/vmalloc.c
@@ -626,11 +626,11 @@ void set_iounmap_nonlazy(void)
static void __purge_vmap_area_lazy(unsigned long *start, unsigned long *end,
int sync, int force_flush)
{
- static DEFINE_SPINLOCK(purge_lock);
+ static atomic_t purging;
struct llist_node *valist;
struct vmap_area *va;
struct vmap_area *n_va;
- int nr = 0;
+ int dofree = 0;
/*
* If sync is 0 but force_flush is 1, we'll go sync anyway but callers
@@ -638,10 +638,10 @@ static void __purge_vmap_area_lazy(unsigned long *start, unsigned long *end,
* the case that isn't actually used at the moment anyway.
*/
if (!sync && !force_flush) {
- if (!spin_trylock(&purge_lock))
+ if (atomic_cmpxchg(&purging, 0, 1))
return;
} else
- spin_lock(&purge_lock);
+ atomic_inc(&purging);
if (sync)
purge_fragmented_blocks_allcpus();
@@ -652,22 +652,23 @@ static void __purge_vmap_area_lazy(unsigned long *start, unsigned long *end,
*start = va->va_start;
if (va->va_end > *end)
*end = va->va_end;
- nr += (va->va_end - va->va_start) >> PAGE_SHIFT;
+ dofree = 1;
}
- if (nr)
- atomic_sub(nr, &vmap_lazy_nr);
-
- if (nr || force_flush)
+ if (dofree || force_flush)
flush_tlb_kernel_range(*start, *end);
- if (nr) {
+ if (dofree) {
spin_lock(&vmap_area_lock);
- llist_for_each_entry_safe(va, n_va, valist, purge_list)
+ llist_for_each_entry_safe(va, n_va, valist, purge_list) {
+ int nrfree = ((va->va_end - va->va_start) >> PAGE_SHIFT);
__free_vmap_area(va);
+ atomic_sub(nrfree, &vmap_lazy_nr);
+ cond_resched_lock(&vmap_area_lock);
+ }
spin_unlock(&vmap_area_lock);
}
- spin_unlock(&purge_lock);
+ atomic_dec(&purging);
}
/*
--
2.8.0.rc3.226.g39d4020
atomic_t magic for llocking is always a bit iffy.
The real question is: what is purge_lock even supposed to protect?
- purge_fragmented_blocks seems to have it's own local protection.
- all handling of of valist is implicity protected by the atomic
list deletion in llist_del_all
- the manipulation of vmap_lazy_nr already is atomic
- flush_tlb_kernel_range should not require any synchronization
- the calls to __free_vmap_area are sychronized by vmap_area_lock
- *start and *end always point to on-stack variables, never mind
that the caller never looks at the updated values anyway.
And once we use the cmpxchg in llist_del_all as the effectit protection
against multiple concurrent callers (slightly sloopy) there just be
nothing else to synchronize.
And while we're at it the code structure here is totally grotty.
So what about the patch below (only boot tested):
diff --git a/mm/vmalloc.c b/mm/vmalloc.c
index f2481cb..c2b9a98 100644
--- a/mm/vmalloc.c
+++ b/mm/vmalloc.c
@@ -613,82 +613,43 @@ void set_iounmap_nonlazy(void)
atomic_set(&vmap_lazy_nr, lazy_max_pages()+1);
}
-/*
- * Purges all lazily-freed vmap areas.
- *
- * If sync is 0 then don't purge if there is already a purge in progress.
- * If force_flush is 1, then flush kernel TLBs between *start and *end even
- * if we found no lazy vmap areas to unmap (callers can use this to optimise
- * their own TLB flushing).
- * Returns with *start = min(*start, lowest purged address)
- * *end = max(*end, highest purged address)
- */
-static void __purge_vmap_area_lazy(unsigned long *start, unsigned long *end,
- int sync, int force_flush)
+static bool __purge_vmap_area_lazy(unsigned long start, unsigned long end)
{
- static DEFINE_SPINLOCK(purge_lock);
struct llist_node *valist;
struct vmap_area *va;
struct vmap_area *n_va;
int nr = 0;
- /*
- * If sync is 0 but force_flush is 1, we'll go sync anyway but callers
- * should not expect such behaviour. This just simplifies locking for
- * the case that isn't actually used at the moment anyway.
- */
- if (!sync && !force_flush) {
- if (!spin_trylock(&purge_lock))
- return;
- } else
- spin_lock(&purge_lock);
-
- if (sync)
- purge_fragmented_blocks_allcpus();
-
valist = llist_del_all(&vmap_purge_list);
llist_for_each_entry(va, valist, purge_list) {
- if (va->va_start < *start)
- *start = va->va_start;
- if (va->va_end > *end)
- *end = va->va_end;
+ if (va->va_start < start)
+ start = va->va_start;
+ if (va->va_end > end)
+ end = va->va_end;
nr += (va->va_end - va->va_start) >> PAGE_SHIFT;
}
- if (nr)
- atomic_sub(nr, &vmap_lazy_nr);
-
- if (nr || force_flush)
- flush_tlb_kernel_range(*start, *end);
-
- if (nr) {
- spin_lock(&vmap_area_lock);
- llist_for_each_entry_safe(va, n_va, valist, purge_list)
- __free_vmap_area(va);
- spin_unlock(&vmap_area_lock);
- }
- spin_unlock(&purge_lock);
-}
+ if (!nr)
+ return false;
-/*
- * Kick off a purge of the outstanding lazy areas. Don't bother if somebody
- * is already purging.
- */
-static void try_purge_vmap_area_lazy(void)
-{
- unsigned long start = ULONG_MAX, end = 0;
+ atomic_sub(nr, &vmap_lazy_nr);
- __purge_vmap_area_lazy(&start, &end, 0, 0);
+ spin_lock(&vmap_area_lock);
+ llist_for_each_entry_safe(va, n_va, valist, purge_list)
+ __free_vmap_area(va);
+ spin_unlock(&vmap_area_lock);
+ return true;
}
/*
- * Kick off a purge of the outstanding lazy areas.
+ * Kick off a purge of the outstanding lazy areas, including the fragment
+ * blocks on the per-cpu lists.
*/
static void purge_vmap_area_lazy(void)
{
- unsigned long start = ULONG_MAX, end = 0;
+ purge_fragmented_blocks_allcpus();
+ __purge_vmap_area_lazy(ULONG_MAX, 0);
- __purge_vmap_area_lazy(&start, &end, 1, 0);
}
/*
@@ -706,8 +667,12 @@ static void free_vmap_area_noflush(struct vmap_area *va)
/* After this point, we may free va at any time */
llist_add(&va->purge_list, &vmap_purge_list);
+ /*
+ * Kick off a purge of the outstanding lazy areas. Don't bother to
+ * to purge the per-cpu lists of fragmented blocks.
+ */
if (unlikely(nr_lazy > lazy_max_pages()))
- try_purge_vmap_area_lazy();
+ __purge_vmap_area_lazy(ULONG_MAX, 0);
}
/*
@@ -1094,7 +1059,9 @@ void vm_unmap_aliases(void)
rcu_read_unlock();
}
- __purge_vmap_area_lazy(&start, &end, 1, flush);
+ purge_fragmented_blocks_allcpus();
+ if (!__purge_vmap_area_lazy(start, end) && flush)
+ flush_tlb_kernel_range(start, end);
}
EXPORT_SYMBOL_GPL(vm_unmap_aliases);
And now with a proper changelog, and the accidentall dropped call to
flush_tlb_kernel_range reinstated:
---
>From f720cc324498ab5e7931c7ccb1653bd9b8cddc63 Mon Sep 17 00:00:00 2001
From: Christoph Hellwig <[email protected]>
Date: Sat, 15 Oct 2016 18:39:44 +0200
Subject: mm: rewrite __purge_vmap_area_lazy
Remove the purge lock, there was nothing left to be protected:
- purge_fragmented_blocks seems to has it's own local protection.
- all handling of of valist is implicity protected by the atomic
list deletion in llist_del_all, which also avoids multiple callers
stomping on each other here.
- the manipulation of vmap_lazy_nr already is atomic
- flush_tlb_kernel_range does not require any synchronization
- the calls to __free_vmap_area are sychronized by vmap_area_lock
- *start and *end always point to on-stack variables, never mind
that the caller never looks at the updated values anyway.
Once that is done we can remove the sync argument by moving the calls
to purge_fragmented_blocks_allcpus into the callers that need it,
and the forced flush_tlb_kernel_range call even if no entries were
found into the one caller that cares, and we can also pass start and
end by reference.
Signed-off-by: Christoph Hellwig <[email protected]>
---
mm/vmalloc.c | 84 +++++++++++++++++++-----------------------------------------
1 file changed, 26 insertions(+), 58 deletions(-)
diff --git a/mm/vmalloc.c b/mm/vmalloc.c
index f2481cb..c3ca992 100644
--- a/mm/vmalloc.c
+++ b/mm/vmalloc.c
@@ -613,82 +613,44 @@ void set_iounmap_nonlazy(void)
atomic_set(&vmap_lazy_nr, lazy_max_pages()+1);
}
-/*
- * Purges all lazily-freed vmap areas.
- *
- * If sync is 0 then don't purge if there is already a purge in progress.
- * If force_flush is 1, then flush kernel TLBs between *start and *end even
- * if we found no lazy vmap areas to unmap (callers can use this to optimise
- * their own TLB flushing).
- * Returns with *start = min(*start, lowest purged address)
- * *end = max(*end, highest purged address)
- */
-static void __purge_vmap_area_lazy(unsigned long *start, unsigned long *end,
- int sync, int force_flush)
+static bool __purge_vmap_area_lazy(unsigned long start, unsigned long end)
{
- static DEFINE_SPINLOCK(purge_lock);
struct llist_node *valist;
struct vmap_area *va;
struct vmap_area *n_va;
int nr = 0;
- /*
- * If sync is 0 but force_flush is 1, we'll go sync anyway but callers
- * should not expect such behaviour. This just simplifies locking for
- * the case that isn't actually used at the moment anyway.
- */
- if (!sync && !force_flush) {
- if (!spin_trylock(&purge_lock))
- return;
- } else
- spin_lock(&purge_lock);
-
- if (sync)
- purge_fragmented_blocks_allcpus();
-
valist = llist_del_all(&vmap_purge_list);
llist_for_each_entry(va, valist, purge_list) {
- if (va->va_start < *start)
- *start = va->va_start;
- if (va->va_end > *end)
- *end = va->va_end;
+ if (va->va_start < start)
+ start = va->va_start;
+ if (va->va_end > end)
+ end = va->va_end;
nr += (va->va_end - va->va_start) >> PAGE_SHIFT;
}
- if (nr)
- atomic_sub(nr, &vmap_lazy_nr);
-
- if (nr || force_flush)
- flush_tlb_kernel_range(*start, *end);
-
- if (nr) {
- spin_lock(&vmap_area_lock);
- llist_for_each_entry_safe(va, n_va, valist, purge_list)
- __free_vmap_area(va);
- spin_unlock(&vmap_area_lock);
- }
- spin_unlock(&purge_lock);
-}
+ if (!nr)
+ return false;
-/*
- * Kick off a purge of the outstanding lazy areas. Don't bother if somebody
- * is already purging.
- */
-static void try_purge_vmap_area_lazy(void)
-{
- unsigned long start = ULONG_MAX, end = 0;
+ atomic_sub(nr, &vmap_lazy_nr);
+ flush_tlb_kernel_range(start, end);
- __purge_vmap_area_lazy(&start, &end, 0, 0);
+ spin_lock(&vmap_area_lock);
+ llist_for_each_entry_safe(va, n_va, valist, purge_list)
+ __free_vmap_area(va);
+ spin_unlock(&vmap_area_lock);
+ return true;
}
/*
- * Kick off a purge of the outstanding lazy areas.
+ * Kick off a purge of the outstanding lazy areas, including the fragment
+ * blocks on the per-cpu lists.
*/
static void purge_vmap_area_lazy(void)
{
- unsigned long start = ULONG_MAX, end = 0;
+ purge_fragmented_blocks_allcpus();
+ __purge_vmap_area_lazy(ULONG_MAX, 0);
- __purge_vmap_area_lazy(&start, &end, 1, 0);
}
/*
@@ -706,8 +668,12 @@ static void free_vmap_area_noflush(struct vmap_area *va)
/* After this point, we may free va at any time */
llist_add(&va->purge_list, &vmap_purge_list);
+ /*
+ * Kick off a purge of the outstanding lazy areas. Don't bother to
+ * to purge the per-cpu lists of fragmented blocks.
+ */
if (unlikely(nr_lazy > lazy_max_pages()))
- try_purge_vmap_area_lazy();
+ __purge_vmap_area_lazy(ULONG_MAX, 0);
}
/*
@@ -1094,7 +1060,9 @@ void vm_unmap_aliases(void)
rcu_read_unlock();
}
- __purge_vmap_area_lazy(&start, &end, 1, flush);
+ purge_fragmented_blocks_allcpus();
+ if (!__purge_vmap_area_lazy(start, end) && flush)
+ flush_tlb_kernel_range(start, end);
}
EXPORT_SYMBOL_GPL(vm_unmap_aliases);
--
2.1.4
Hi Christoph,
On Sat, Oct 15, 2016 at 9:54 AM, Christoph Hellwig <[email protected]> wrote:
> And now with a proper changelog, and the accidentall dropped call to
> flush_tlb_kernel_range reinstated:
>
> ---
> From f720cc324498ab5e7931c7ccb1653bd9b8cddc63 Mon Sep 17 00:00:00 2001
> From: Christoph Hellwig <[email protected]>
> Date: Sat, 15 Oct 2016 18:39:44 +0200
> Subject: mm: rewrite __purge_vmap_area_lazy
>
> Remove the purge lock, there was nothing left to be protected:
>
> - purge_fragmented_blocks seems to has it's own local protection.
> - all handling of of valist is implicity protected by the atomic
> list deletion in llist_del_all, which also avoids multiple callers
> stomping on each other here.
> - the manipulation of vmap_lazy_nr already is atomic
> - flush_tlb_kernel_range does not require any synchronization
> - the calls to __free_vmap_area are sychronized by vmap_area_lock
> - *start and *end always point to on-stack variables, never mind
> that the caller never looks at the updated values anyway.
>
> Once that is done we can remove the sync argument by moving the calls
> to purge_fragmented_blocks_allcpus into the callers that need it,
> and the forced flush_tlb_kernel_range call even if no entries were
> found into the one caller that cares, and we can also pass start and
> end by reference.
Your patch changes the behavior of the original code I think. With the
patch, for the case where you have 2 concurrent tasks executing
alloc_vmap_area function, say both hit the overflow label and enter
the __purge_vmap_area_lazy at the same time. The first task empties
the purge list and sets nr to the total number of pages of all the
vmap areas in the list. Say the first task has just emptied the list
but hasn't started freeing the vmap areas and is preempted at this
point. Now the second task runs and since the purge list is empty, the
second task doesn't have anything to do and immediately returns to
alloc_vmap_area. Once it returns, it sets purged to 1 in
alloc_vmap_area and retries. Say it hits overflow label again in the
retry path. Now because purged was set to 1, it goes to err_free.
Without your patch, it would have waited on the spin_lock (sync = 1)
instead of just erroring out, so your patch does change the behavior
of the original code by not using the purge_lock. I realize my patch
also changes the behavior, but in mine I think we can make it behave
like the original code by spinning until purging=0 (if sync = 1)
because I still have the purging variable..
Some more comments below:
> Signed-off-by: Christoph Hellwig <[email protected]>
> ---
> mm/vmalloc.c | 84 +++++++++++++++++++-----------------------------------------
> 1 file changed, 26 insertions(+), 58 deletions(-)
>
> diff --git a/mm/vmalloc.c b/mm/vmalloc.c
> index f2481cb..c3ca992 100644
> --- a/mm/vmalloc.c
> +++ b/mm/vmalloc.c
> @@ -613,82 +613,44 @@ void set_iounmap_nonlazy(void)
> atomic_set(&vmap_lazy_nr, lazy_max_pages()+1);
> }
>
> -/*
> - * Purges all lazily-freed vmap areas.
> - *
> - * If sync is 0 then don't purge if there is already a purge in progress.
> - * If force_flush is 1, then flush kernel TLBs between *start and *end even
> - * if we found no lazy vmap areas to unmap (callers can use this to optimise
> - * their own TLB flushing).
> - * Returns with *start = min(*start, lowest purged address)
> - * *end = max(*end, highest purged address)
> - */
> -static void __purge_vmap_area_lazy(unsigned long *start, unsigned long *end,
> - int sync, int force_flush)
> +static bool __purge_vmap_area_lazy(unsigned long start, unsigned long end)
> {
> - static DEFINE_SPINLOCK(purge_lock);
> struct llist_node *valist;
> struct vmap_area *va;
> struct vmap_area *n_va;
> int nr = 0;
>
> - /*
> - * If sync is 0 but force_flush is 1, we'll go sync anyway but callers
> - * should not expect such behaviour. This just simplifies locking for
> - * the case that isn't actually used at the moment anyway.
> - */
> - if (!sync && !force_flush) {
> - if (!spin_trylock(&purge_lock))
> - return;
> - } else
> - spin_lock(&purge_lock);
> -
> - if (sync)
> - purge_fragmented_blocks_allcpus();
> -
> valist = llist_del_all(&vmap_purge_list);
> llist_for_each_entry(va, valist, purge_list) {
> - if (va->va_start < *start)
> - *start = va->va_start;
> - if (va->va_end > *end)
> - *end = va->va_end;
> + if (va->va_start < start)
> + start = va->va_start;
> + if (va->va_end > end)
> + end = va->va_end;
> nr += (va->va_end - va->va_start) >> PAGE_SHIFT;
> }
>
> - if (nr)
> - atomic_sub(nr, &vmap_lazy_nr);
> -
> - if (nr || force_flush)
> - flush_tlb_kernel_range(*start, *end);
> -
> - if (nr) {
> - spin_lock(&vmap_area_lock);
> - llist_for_each_entry_safe(va, n_va, valist, purge_list)
> - __free_vmap_area(va);
> - spin_unlock(&vmap_area_lock);
> - }
> - spin_unlock(&purge_lock);
> -}
> + if (!nr)
> + return false;
>
> -/*
> - * Kick off a purge of the outstanding lazy areas. Don't bother if somebody
> - * is already purging.
> - */
> -static void try_purge_vmap_area_lazy(void)
> -{
> - unsigned long start = ULONG_MAX, end = 0;
> + atomic_sub(nr, &vmap_lazy_nr);
> + flush_tlb_kernel_range(start, end);
>
> - __purge_vmap_area_lazy(&start, &end, 0, 0);
> + spin_lock(&vmap_area_lock);
> + llist_for_each_entry_safe(va, n_va, valist, purge_list)
> + __free_vmap_area(va);
> + spin_unlock(&vmap_area_lock);
You should add a cond_resched_lock here as Chris Wilson suggested. I
tried your patch both with and without cond_resched_lock in this loop,
and without it I see the same problems my patch solves (high latencies
on cyclic test). With cond_resched_lock, your patch does solve my
problem although as I was worried above - that it changes the original
behavior.
Also, could you share your concerns about use of atomic_t in my patch?
I believe that since this is not a contented variable, the question of
lock fairness is not a concern. It is also not a lock really the way
I'm using it, it just keeps track of how many purges are in progress..
Thanks,
Joel
On Sat, Oct 15, 2016 at 03:59:34PM -0700, Joel Fernandes wrote:
> Your patch changes the behavior of the original code I think.
It does. And it does so as I don't think the existing behavior makes
sense, as mentioned in the changelog.
> With the
> patch, for the case where you have 2 concurrent tasks executing
> alloc_vmap_area function, say both hit the overflow label and enter
> the __purge_vmap_area_lazy at the same time. The first task empties
> the purge list and sets nr to the total number of pages of all the
> vmap areas in the list. Say the first task has just emptied the list
> but hasn't started freeing the vmap areas and is preempted at this
> point. Now the second task runs and since the purge list is empty, the
> second task doesn't have anything to do and immediately returns to
> alloc_vmap_area. Once it returns, it sets purged to 1 in
> alloc_vmap_area and retries. Say it hits overflow label again in the
> retry path. Now because purged was set to 1, it goes to err_free.
> Without your patch, it would have waited on the spin_lock (sync = 1)
> instead of just erroring out, so your patch does change the behavior
> of the original code by not using the purge_lock. I realize my patch
> also changes the behavior, but in mine I think we can make it behave
> like the original code by spinning until purging=0 (if sync = 1)
> because I still have the purging variable..
But for sync = 1 you don't spin on it in any way. This is the logic
in your patch:
if (!sync && !force_flush) {
if (atomic_cmpxchg(&purging, 0, 1))
return;
} else
atomic_inc(&purging);
So when called from free_vmap_area_noflush your skip the whole call
if anyone else is currently purging the list, but all other cases
we will always execute the code. So maybe my mistake with to follow
what your patch did as I just jumped onto it for seeing this atomic_t
"synchronization", but the change in behavior to your is very limited,
basically the only difference is that if free_vmap_area_noflush hits
the purge cases due to having lots of lazy pages on the purge list
it will execute despite some other purge being in progress.
> You should add a cond_resched_lock here as Chris Wilson suggested. I
> tried your patch both with and without cond_resched_lock in this loop,
> and without it I see the same problems my patch solves (high latencies
> on cyclic test). With cond_resched_lock, your patch does solve my
> problem although as I was worried above - that it changes the original
> behavior.
Yes, I actually pointed that out to "zhouxianrong", who sent a patch
just after yours. I just thought lock breaking is a different aspect
to improving this whole function. In fact I suspect even my patch
should probably be split up quite a bit more.
> Also, could you share your concerns about use of atomic_t in my patch?
> I believe that since this is not a contented variable, the question of
> lock fairness is not a concern. It is also not a lock really the way
> I'm using it, it just keeps track of how many purges are in progress..
atomic_t doesn't have any acquire/release semantics, and will require
off memory barrier dances to actually get the behavior you intended.
And from looking at the code I can't really see why we even would
want synchronization behavior - for the sort of problems where we
don't want multiple threads to run the same code at the same time
for effiency but not correctness reasons it's usually better to have
batch thresholds and/or splicing into local data structures before
operations. Both are techniques used in this code, and I'd rather
rely on them and if required improve on them then using very odd
hoc synchronization methods.
On Sat, Oct 15, 2016 at 11:10 PM, Christoph Hellwig <[email protected]> wrote:
> On Sat, Oct 15, 2016 at 03:59:34PM -0700, Joel Fernandes wrote:
>> Your patch changes the behavior of the original code I think.
>
> It does. And it does so as I don't think the existing behavior makes
> sense, as mentioned in the changelog.
>
>> With the
>> patch, for the case where you have 2 concurrent tasks executing
>> alloc_vmap_area function, say both hit the overflow label and enter
>> the __purge_vmap_area_lazy at the same time. The first task empties
>> the purge list and sets nr to the total number of pages of all the
>> vmap areas in the list. Say the first task has just emptied the list
>> but hasn't started freeing the vmap areas and is preempted at this
>> point. Now the second task runs and since the purge list is empty, the
>> second task doesn't have anything to do and immediately returns to
>> alloc_vmap_area. Once it returns, it sets purged to 1 in
>> alloc_vmap_area and retries. Say it hits overflow label again in the
>> retry path. Now because purged was set to 1, it goes to err_free.
>> Without your patch, it would have waited on the spin_lock (sync = 1)
>> instead of just erroring out, so your patch does change the behavior
>> of the original code by not using the purge_lock. I realize my patch
>> also changes the behavior, but in mine I think we can make it behave
>> like the original code by spinning until purging=0 (if sync = 1)
>> because I still have the purging variable..
>
> But for sync = 1 you don't spin on it in any way. This is the logic
> in your patch:
>
> if (!sync && !force_flush) {
> if (atomic_cmpxchg(&purging, 0, 1))
> return;
> } else
> atomic_inc(&purging);
I agree my patch doesn't spin too, I mentioned this above: "I realize my patch
also changes the behavior". However if you dont have too much of a problem with
my use of atomics, then my below new patch handles the case for sync=1 and is
identical in behavior to the original code while still fixing the preempt off
latency issues.
Your patch just got rid of sync completely, I'm not sure if that's Ok to do?
the alloc_vmap_area overflow case was assuming sync=1. The original
alloc_vmap_area code calls purge with sync=1 and waits for pending purges to
complete, instead of erroring out. I wanted to preserve this behavior.
>> Also, could you share your concerns about use of atomic_t in my patch?
>> I believe that since this is not a contented variable, the question of
>> lock fairness is not a concern. It is also not a lock really the way
>> I'm using it, it just keeps track of how many purges are in progress..
>
> atomic_t doesn't have any acquire/release semantics, and will require
> off memory barrier dances to actually get the behavior you intended.
> And from looking at the code I can't really see why we even would
> want synchronization behavior - for the sort of problems where we
> don't want multiple threads to run the same code at the same time
> for effiency but not correctness reasons it's usually better to have
> batch thresholds and/or splicing into local data structures before
> operations. Both are techniques used in this code, and I'd rather
> rely on them and if required improve on them then using very odd
> hoc synchronization methods.
Thanks for the explanation. If you know of a better way to handle the sync=1
case, let me know. In defense of atomics, even vmap_lazy_nr in the same code is
atomic_t :) I am also not using it as a lock really, but just to count how many
times something is in progress that's all - I added some more comments to my
last patch to make this clearer in the code and now I'm also handling the case
for sync=1.
-----------------8<------------
From: Joel Fernandes <[email protected]>
Date: Sat, 15 Oct 2016 01:04:14 -0700
Subject: [PATCH v4] mm: vmalloc: Replace purge_lock spinlock with atomic refcount
The purge_lock spinlock causes high latencies with non RT kernel. This has been
reported multiple times on lkml [1] [2] and affects applications like audio.
In this patch, I replace the spinlock with an atomic refcount so that
preemption is kept turned on during purge. This Ok to do since [3] builds the
lazy free list in advance and atomically retrieves the list so any instance of
purge will have its own list it is purging. Since the individual vmap area
frees are themselves protected by a lock, this is Ok.
The only thing left is the fact that previously it had trylock behavior, so
preserve that by using the refcount to keep track of in-progress purges and
abort like previously if there is an ongoing purge. Lastly, decrement
vmap_lazy_nr as the vmap areas are freed, and not in advance, so that the
vmap_lazy_nr is not reduced too soon as suggested in [2].
Tests:
x86_64 quad core machine on kernel 4.8, run: cyclictest -p 99 -n
Concurrently in a kernel module, run vmalloc and vfree 8K buffer in a loop.
Preemption configuration: CONFIG_PREEMPT__LL=y (low-latency desktop)
Without patch, cyclictest output:
policy: fifo: loadavg: 0.05 0.01 0.00 1/85 1272 Avg: 128 Max: 1177
policy: fifo: loadavg: 0.11 0.03 0.01 2/87 1447 Avg: 122 Max: 1897
policy: fifo: loadavg: 0.10 0.03 0.01 1/89 1656 Avg: 93 Max: 2886
With patch, cyclictest output:
policy: fifo: loadavg: 1.15 0.68 0.30 1/162 8399 Avg: 92 Max: 284
policy: fifo: loadavg: 1.21 0.71 0.32 2/164 9840 Avg: 94 Max: 296
policy: fifo: loadavg: 1.18 0.72 0.32 2/166 11253 Avg: 107 Max: 321
[1] http://lists.openwall.net/linux-kernel/2016/03/23/29
[2] https://lkml.org/lkml/2016/10/9/59
[3] https://lkml.org/lkml/2016/4/15/287
[thanks Chris for the cond_resched_lock ideas]
Cc: Chris Wilson <[email protected]>
Cc: Jisheng Zhang <[email protected]>
Cc: John Dias <[email protected]>
Cc: Andrew Morton <[email protected]>
Signed-off-by: Joel Fernandes <[email protected]>
---
v4 changes:
Handle case for when sync=1
Earlier changes:
Fixed ordering of va pointer access and __free_vmap_area
Updated test description in commit message, and typos.
mm/vmalloc.c | 35 +++++++++++++++++++++++------------
1 file changed, 23 insertions(+), 12 deletions(-)
diff --git a/mm/vmalloc.c b/mm/vmalloc.c
index 613d1d9..db2791a 100644
--- a/mm/vmalloc.c
+++ b/mm/vmalloc.c
@@ -626,11 +626,11 @@ void set_iounmap_nonlazy(void)
static void __purge_vmap_area_lazy(unsigned long *start, unsigned long *end,
int sync, int force_flush)
{
- static DEFINE_SPINLOCK(purge_lock);
+ static atomic_t purging;
struct llist_node *valist;
struct vmap_area *va;
struct vmap_area *n_va;
- int nr = 0;
+ int dofree = 0;
/*
* If sync is 0 but force_flush is 1, we'll go sync anyway but callers
@@ -638,10 +638,13 @@ static void __purge_vmap_area_lazy(unsigned long *start, unsigned long *end,
* the case that isn't actually used at the moment anyway.
*/
if (!sync && !force_flush) {
- if (!spin_trylock(&purge_lock))
+ /*
+ * Incase a purge is already in progress, just return.
+ */
+ if (atomic_cmpxchg(&purging, 0, 1))
return;
} else
- spin_lock(&purge_lock);
+ atomic_inc(&purging);
if (sync)
purge_fragmented_blocks_allcpus();
@@ -652,22 +655,30 @@ static void __purge_vmap_area_lazy(unsigned long *start, unsigned long *end,
*start = va->va_start;
if (va->va_end > *end)
*end = va->va_end;
- nr += (va->va_end - va->va_start) >> PAGE_SHIFT;
+ dofree = 1;
}
- if (nr)
- atomic_sub(nr, &vmap_lazy_nr);
-
- if (nr || force_flush)
+ if (dofree || force_flush)
flush_tlb_kernel_range(*start, *end);
- if (nr) {
+ if (dofree) {
spin_lock(&vmap_area_lock);
- llist_for_each_entry_safe(va, n_va, valist, purge_list)
+ llist_for_each_entry_safe(va, n_va, valist, purge_list) {
+ int nrfree = ((va->va_end - va->va_start) >> PAGE_SHIFT);
__free_vmap_area(va);
+ atomic_sub(nrfree, &vmap_lazy_nr);
+ cond_resched_lock(&vmap_area_lock);
+ }
spin_unlock(&vmap_area_lock);
}
- spin_unlock(&purge_lock);
+
+ atomic_dec(&purging);
+
+ /*
+ * If sync is 1, wait for pending purges to be completed.
+ */
+ while(sync && atomic_read(&purging))
+ cpu_relax();
}
/*
--
2.8.0.rc3.226.g39d4020
Hi Christoph,
On Sun, Oct 16, 2016 at 3:06 PM, Joel Fernandes <[email protected]> wrote:
> On Sat, Oct 15, 2016 at 11:10 PM, Christoph Hellwig <[email protected]> wrote:
>> On Sat, Oct 15, 2016 at 03:59:34PM -0700, Joel Fernandes wrote:
>>> Also, could you share your concerns about use of atomic_t in my patch?
>>> I believe that since this is not a contented variable, the question of
>>> lock fairness is not a concern. It is also not a lock really the way
>>> I'm using it, it just keeps track of how many purges are in progress..
>>
>> atomic_t doesn't have any acquire/release semantics, and will require
>> off memory barrier dances to actually get the behavior you intended.
>> And from looking at the code I can't really see why we even would
>> want synchronization behavior - for the sort of problems where we
>> don't want multiple threads to run the same code at the same time
>> for effiency but not correctness reasons it's usually better to have
>> batch thresholds and/or splicing into local data structures before
>> operations. Both are techniques used in this code, and I'd rather
>> rely on them and if required improve on them then using very odd
>> hoc synchronization methods.
>
> Thanks for the explanation. If you know of a better way to handle the sync=1
> case, let me know. In defense of atomics, even vmap_lazy_nr in the same code is
> atomic_t :) I am also not using it as a lock really, but just to count how many
> times something is in progress that's all - I added some more comments to my
> last patch to make this clearer in the code and now I'm also handling the case
> for sync=1.
Also, one more thing about the barrier dances you mentioned, this will
also be done by the spinlock which was there before my patch. So in
favor of my patch, it doesn't make things any worse than they were and
actually fixes the reported issue while preserving the original code
behavior. So I think it is a good thing to fix the issue considering
so many people are reporting it and any clean ups of the vmalloc code
itself can follow.
If you want I can looking into replacing the atomic_cmpxchg with an
atomic_inc and not do anything different for sync vs !sync except for
spinning when purges are pending. Would that make you feel a bit
better?
So instead of:
if (!sync && !force_flush) {
/*
* Incase a purge is already in progress, just return.
*/
if (atomic_cmpxchg(&purging, 0, 1))
return;
} else
atomic_inc(&purging);
,
Just do a:
atomic_inc(&purging);
This should be Ok to do since in the !sync case, we'll just return
anyway if another purge was in progress.
Thanks,
Joel
On Sun, Oct 16, 2016 at 03:48:42PM -0700, Joel Fernandes wrote:
> Also, one more thing about the barrier dances you mentioned, this will
> also be done by the spinlock which was there before my patch. So in
> favor of my patch, it doesn't make things any worse than they were and
> actually fixes the reported issue while preserving the original code
> behavior. So I think it is a good thing to fix the issue considering
> so many people are reporting it and any clean ups of the vmalloc code
> itself can follow.
I'm not worried about having barriers, we use them all over our
synchronization primitives. I'm worried about opencoding them
instead of having them in these well defined helpers.
So based on this discussion and the feedback from Nick I'll
propose a new patch (or rather a series to make it more understandable)
that adds a mutex, adds the lockbreak and gives sensible calling
conventions to __purge_vmap_area_lazy.