2017-03-15 08:59:58

by Aaron Lu

[permalink] [raw]
Subject: [PATCH v2 0/5] mm: support parallel free of memory

For regular processes, the time taken in its exit() path to free its
used memory is not a problem. But there are heavy ones that consume
several Terabytes memory and the time taken to free its memory in its
exit() path could last more than ten minutes if THP is not used.

As Dave Hansen explained why do this in kernel:
"
One of the places we saw this happen was when an app crashed and was
exit()'ing under duress without cleaning up nicely. The time that it
takes to unmap a few TB of 4k pages is pretty excessive.
"

To optimize this use case, a parallel free method is proposed here and
it is based on the current gather batch free(the following description
is taken from patch 2/5's changelog).

The current gather batch free works like this:
For each struct mmu_gather *tlb, there is a static buffer to store those
to-be-freed page pointers. The size is MMU_GATHER_BUNDLE, which is
defined to be 8. So if a tlb tear down doesn't free more than 8 pages,
that is all we need. If 8+ pages are to be freed, new pages will need
to be allocated to store those to-be-freed page pointers.

The structure used to describe the saved page pointers is called
struct mmu_gather_batch and tlb->local is of this type. tlb->local is
different than other struct mmu_gather_batch(es) in that the page
pointer array used by tlb->local points to the previouslly described
static buffer while the other struct mmu_gather_batch(es) page pointer
array points to the dynamically allocated pages.

These batches will form a singly linked list, starting from &tlb->local.

tlb->local.pages => tlb->pages(8 pointers)
\|/
next => batch1->pages => about 510 pointers
\|/
next => batch2->pages => about 510 pointers
\|/
next => batch3->pages => about 510 pointers
... ...

The proposed parallel free did this: if the process has many pages to be
freed, accumulate them in these struct mmu_gather_batch(es) one after
another till 256K pages are accumulated. Then take this singly linked
list starting from tlb->local.next off struct mmu_gather *tlb and free
them in a worker thread. The main thread can return to continue zap
other pages(after freeing pages pointed by tlb->local.pages).

A test program that did a single malloc() of 320G memory is used to see
how useful the proposed parallel free solution is, the time calculated
is for the free() call. Test machine is a Haswell EX which has
4nodes/72cores/144threads with 512G memory. All tests are done with THP
disabled.

kernel time
v4.10 10.8s ±2.8%
this patch(with default setting) 5.795s ±5.8%

Patch 3/5 introduced a dedicated workqueue for the free workers and
here are more results when setting different values for max_active of
this workqueue:

max_active: time
1 8.9s ±0.5%
2 5.65s ±5.5%
4 4.84s ±0.16%
8 4.77s ±0.97%
16 4.85s ±0.77%
32 6.21s ±0.46%

Comments are welcome and appreciated.

v2 changes: Nothing major, only minor ones.
- rebased on top of v4.11-rc2-mmotm-2017-03-14-15-41;
- use list_add_tail instead of list_add to add worker to tlb's worker
list so that when doing flush, the first queued worker gets flushed
first(based on the comsumption that the first queued worker has a
better chance of finishing its job than those later queued workers);
- use bool instead of int for variable free_batch_page in function
tlb_flush_mmu_free_batches;
- style change according to ./scripts/checkpatch;
- reword some of the changelogs to make it more readable.

v1 is here:
https://lkml.org/lkml/2017/2/24/245

Aaron Lu (5):
mm: add tlb_flush_mmu_free_batches
mm: parallel free pages
mm: use a dedicated workqueue for the free workers
mm: add force_free_pages in zap_pte_range
mm: add debugfs interface for parallel free tuning

include/asm-generic/tlb.h | 15 ++---
mm/memory.c | 141 +++++++++++++++++++++++++++++++++++++++-------
2 files changed, 128 insertions(+), 28 deletions(-)

--
2.7.4


2017-03-15 09:00:00

by Aaron Lu

[permalink] [raw]
Subject: [PATCH v2 1/5] mm: add tlb_flush_mmu_free_batches

There are two places doing page free related to struct mmu_gather_batch:
1 in tlb_flush_mmu_free, where pages gathered in mmu_gather_batch list
are freed;
2 in tlb_flush_mmu_finish, where pages for the mmu_gather_batch
structure(let's call it the batch page) are freed.

There will be yet another place in the parallel free worker thread
introduced in the following patch to free both the pages pointed to by
the mmu_gather_batch list and the batch pages themselves. To avoid code
duplication, add a new function named tlb_flush_mmu_free_batches for
this purpose.

Another reason to add this function is that after the following patch,
cond_resched will need to be added at places where more than 10K pages
can be freed, i.e. in tlb_flush_mmu_free and the worker function.
Instead of adding cond_resched at multiple places, using a single
function to reduce code duplication.

There should be no functionality change.

Signed-off-by: Aaron Lu <[email protected]>
---
mm/memory.c | 28 +++++++++++++++++-----------
1 file changed, 17 insertions(+), 11 deletions(-)

diff --git a/mm/memory.c b/mm/memory.c
index 14fc0b40f0bb..cdb2a53f251f 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -250,14 +250,25 @@ static void tlb_flush_mmu_tlbonly(struct mmu_gather *tlb)
__tlb_reset_range(tlb);
}

-static void tlb_flush_mmu_free(struct mmu_gather *tlb)
+static void tlb_flush_mmu_free_batches(struct mmu_gather_batch *batch_start,
+ bool free_batch_page)
{
- struct mmu_gather_batch *batch;
+ struct mmu_gather_batch *batch, *next;

- for (batch = &tlb->local; batch && batch->nr; batch = batch->next) {
- free_pages_and_swap_cache(batch->pages, batch->nr);
- batch->nr = 0;
+ for (batch = batch_start; batch; batch = next) {
+ next = batch->next;
+ if (batch->nr) {
+ free_pages_and_swap_cache(batch->pages, batch->nr);
+ batch->nr = 0;
+ }
+ if (free_batch_page)
+ free_pages((unsigned long)batch, 0);
}
+}
+
+static void tlb_flush_mmu_free(struct mmu_gather *tlb)
+{
+ tlb_flush_mmu_free_batches(&tlb->local, false);
tlb->active = &tlb->local;
}

@@ -273,17 +284,12 @@ void tlb_flush_mmu(struct mmu_gather *tlb)
*/
void tlb_finish_mmu(struct mmu_gather *tlb, unsigned long start, unsigned long end)
{
- struct mmu_gather_batch *batch, *next;
-
tlb_flush_mmu(tlb);

/* keep the page table cache within bounds */
check_pgt_cache();

- for (batch = tlb->local.next; batch; batch = next) {
- next = batch->next;
- free_pages((unsigned long)batch, 0);
- }
+ tlb_flush_mmu_free_batches(tlb->local.next, true);
tlb->local.next = NULL;
}

--
2.7.4

2017-03-15 09:00:07

by Aaron Lu

[permalink] [raw]
Subject: [PATCH v2 5/5] mm: add debugfs interface for parallel free tuning

Make it possible to set different values for async_free_threshold and
max_gather_batch_count through debugfs.

With this, we can do tests for different purposes:
1 Restore vanilla kernel bahaviour for performance comparison.
Set max_gather_batch_count to a value like 20 to effectively restore
the behaviour of vanilla kernel since this will make page gathered
always smaller than async_free_threshold(effectively disable parallel
free);
2 Debug purpose.
Set async_free_threshold to a very small value(like 128) to trigger
parallel free even on ordinary processes, ideal for debug purpose with
a virtual machine that doesn't have much memory assigned to it;
3 Performance tuning.
Use a different value for async_free_threshold and max_gather_batch_count
other than the default to test if parallel free performs better or worse.

Signed-off-by: Aaron Lu <[email protected]>
---
mm/memory.c | 33 +++++++++++++++++++++++++++++++--
1 file changed, 31 insertions(+), 2 deletions(-)

diff --git a/mm/memory.c b/mm/memory.c
index 83b38823aaba..3a971cc1fc3b 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -183,6 +183,35 @@ static void check_sync_rss_stat(struct task_struct *task)

#ifdef HAVE_GENERIC_MMU_GATHER

+static unsigned long async_free_threshold = ASYNC_FREE_THRESHOLD;
+static unsigned long max_gather_batch_count = MAX_GATHER_BATCH_COUNT;
+
+#ifdef CONFIG_DEBUG_FS
+static int __init tlb_mmu_parallel_free_debugfs(void)
+{
+ umode_t mode = 0644;
+ struct dentry *dir;
+
+ dir = debugfs_create_dir("parallel_free", NULL);
+ if (!dir)
+ return -ENOMEM;
+
+ if (!debugfs_create_ulong("async_free_threshold", mode, dir,
+ &async_free_threshold))
+ goto fail;
+ if (!debugfs_create_ulong("max_gather_batch_count", mode, dir,
+ &max_gather_batch_count))
+ goto fail;
+
+ return 0;
+
+fail:
+ debugfs_remove_recursive(dir);
+ return -ENOMEM;
+}
+late_initcall(tlb_mmu_parallel_free_debugfs);
+#endif
+
static bool tlb_next_batch(struct mmu_gather *tlb)
{
struct mmu_gather_batch *batch;
@@ -193,7 +222,7 @@ static bool tlb_next_batch(struct mmu_gather *tlb)
return true;
}

- if (tlb->batch_count == MAX_GATHER_BATCH_COUNT)
+ if (tlb->batch_count == max_gather_batch_count)
return false;

batch = (void *)__get_free_pages(GFP_NOWAIT | __GFP_NOWARN, 0);
@@ -307,7 +336,7 @@ static void tlb_flush_mmu_free(struct mmu_gather *tlb)
{
struct batch_free_struct *batch_free = NULL;

- if (tlb->page_nr >= ASYNC_FREE_THRESHOLD)
+ if (tlb->page_nr >= async_free_threshold)
batch_free = kmalloc(sizeof(*batch_free),
GFP_NOWAIT | __GFP_NOWARN);

--
2.7.4

2017-03-15 09:00:12

by Aaron Lu

[permalink] [raw]
Subject: [PATCH v2 4/5] mm: add force_free_pages in zap_pte_range

force_flush in zap_pte_range is set in the following 2 conditions:
1 When no more batches can be allocated (either due to no memory or
MAX_GATHER_BATCH_COUNT has reached) to store those to-be-freed page
pointers;
2 When a TLB_only flush is needed before dropping the PTE lock to avoid
a race condition as explained in commit 1cf35d47712d ("mm: split
'tlb_flush_mmu()' into tlb flushing and memory freeing parts").

Once force_flush is set, the pages accumulated thus far will all be
freed. Since there is no need to do page free for condition 2, add a new
variable named force_free_pages to decide if page free should be done
and it will only be set in condition 1.

With this change, the page accumulation will not be interrupted by
condition 2 anymore. In the meantime, rename force_flush to
force_flush_tlb for condition 2.

Signed-off-by: Aaron Lu <[email protected]>
---
mm/memory.c | 20 ++++++++------------
1 file changed, 8 insertions(+), 12 deletions(-)

diff --git a/mm/memory.c b/mm/memory.c
index 19b25bb5f45b..83b38823aaba 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -1199,7 +1199,7 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb,
struct zap_details *details)
{
struct mm_struct *mm = tlb->mm;
- int force_flush = 0;
+ int force_flush_tlb = 0, force_free_pages = 0;
int rss[NR_MM_COUNTERS];
spinlock_t *ptl;
pte_t *start_pte;
@@ -1239,7 +1239,7 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb,

if (!PageAnon(page)) {
if (pte_dirty(ptent)) {
- force_flush = 1;
+ force_flush_tlb = 1;
set_page_dirty(page);
}
if (pte_young(ptent) &&
@@ -1251,7 +1251,7 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb,
if (unlikely(page_mapcount(page) < 0))
print_bad_pte(vma, addr, ptent, page);
if (unlikely(__tlb_remove_page(tlb, page))) {
- force_flush = 1;
+ force_free_pages = 1;
addr += PAGE_SIZE;
break;
}
@@ -1279,18 +1279,14 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb,
arch_leave_lazy_mmu_mode();

/* Do the actual TLB flush before dropping ptl */
- if (force_flush)
+ if (force_flush_tlb) {
+ force_flush_tlb = 0;
tlb_flush_mmu_tlbonly(tlb);
+ }
pte_unmap_unlock(start_pte, ptl);

- /*
- * If we forced a TLB flush (either due to running out of
- * batch buffers or because we needed to flush dirty TLB
- * entries before releasing the ptl), free the batched
- * memory too. Restart if we didn't do everything.
- */
- if (force_flush) {
- force_flush = 0;
+ if (force_free_pages) {
+ force_free_pages = 0;
tlb_flush_mmu_free(tlb);
if (addr != end)
goto again;
--
2.7.4

2017-03-15 09:00:05

by Aaron Lu

[permalink] [raw]
Subject: [PATCH v2 2/5] mm: parallel free pages

For regular processes, the time taken in its exit() path to free its
used memory is not a problem. But there are heavy ones that consume
several Terabytes memory and the time taken to free its memory could
last more than ten minutes.

To optimize this use case, a parallel free method is proposed and it is
based on the current gather batch free.

The current gather batch free works like this:
For each struct mmu_gather *tlb, there is a static buffer to store those
to-be-freed page pointers. The size is MMU_GATHER_BUNDLE, which is
defined to be 8. So if a tlb tear down doesn't free more than 8 pages,
that is all we need. If 8+ pages are to be freed, new pages will need
to be allocated to store those to-be-freed page pointers.

The structure used to describe the saved page pointers is called
struct mmu_gather_batch and tlb->local is of this type. tlb->local is
different than other struct mmu_gather_batch(es) in that the page
pointer array used by tlb->local points to the previouslly described
static buffer while the other struct mmu_gather_batch(es) page pointer
array points to the dynamically allocated pages.

These batches will form a singly linked list, starting from &tlb->local.

tlb->local.pages => tlb->pages(8 pointers)
\|/
next => batch1->pages => about 510 pointers
\|/
next => batch2->pages => about 510 pointers
\|/
next => batch3->pages => about 510 pointers
... ...

The proposed parallel free did this: if the process has many pages to be
freed, accumulate them in these struct mmu_gather_batch(es) one after
another till 256K pages are accumulated. Then take this singly linked
list starting from tlb->local.next off struct mmu_gather *tlb and free
them in a worker thread. The main thread can return to continue zap
other pages(after freeing pages pointed by tlb->local.pages).

Note that since we may be accumulating as many as 256K pages now, the
soft lockup on !CONFIG_PREEMPT issue which is fixed by
commit 53a59fc67f97 ("mm: limit mmu_gather batching to fix soft lockups
on !CONFIG_PREEMPT") can reappear. For that matter, add cond_resched()
in tlb_flush_mmu_free_batches where many pages can be freed.

Signed-off-by: Aaron Lu <[email protected]>
---
include/asm-generic/tlb.h | 15 +++++++------
mm/memory.c | 57 ++++++++++++++++++++++++++++++++++++++++++++++-
2 files changed, 64 insertions(+), 8 deletions(-)

diff --git a/include/asm-generic/tlb.h b/include/asm-generic/tlb.h
index 4329bc6ef04b..7c2ac179cc47 100644
--- a/include/asm-generic/tlb.h
+++ b/include/asm-generic/tlb.h
@@ -78,13 +78,10 @@ struct mmu_gather_batch {
#define MAX_GATHER_BATCH \
((PAGE_SIZE - sizeof(struct mmu_gather_batch)) / sizeof(void *))

-/*
- * Limit the maximum number of mmu_gather batches to reduce a risk of soft
- * lockups for non-preemptible kernels on huge machines when a lot of memory
- * is zapped during unmapping.
- * 10K pages freed at once should be safe even without a preemption point.
- */
-#define MAX_GATHER_BATCH_COUNT (10000UL/MAX_GATHER_BATCH)
+#define ASYNC_FREE_THRESHOLD (256*1024UL)
+#define MAX_GATHER_BATCH_COUNT \
+ DIV_ROUND_UP(ASYNC_FREE_THRESHOLD, MAX_GATHER_BATCH)
+#define PAGE_FREE_NR_TO_YIELD (10000UL)

/* struct mmu_gather is an opaque type used by the mm code for passing around
* any data needed by arch specific code for tlb_remove_page.
@@ -108,6 +105,10 @@ struct mmu_gather {
struct page *__pages[MMU_GATHER_BUNDLE];
unsigned int batch_count;
int page_size;
+ /* how many pages we have gathered to be freed */
+ unsigned int page_nr;
+ /* list for spawned workers that do the free jobs */
+ struct list_head worker_list;
};

#define HAVE_GENERIC_MMU_GATHER
diff --git a/mm/memory.c b/mm/memory.c
index cdb2a53f251f..001c7720d773 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -228,6 +228,9 @@ void tlb_gather_mmu(struct mmu_gather *tlb, struct mm_struct *mm, unsigned long
tlb->local.max = ARRAY_SIZE(tlb->__pages);
tlb->active = &tlb->local;
tlb->batch_count = 0;
+ tlb->page_nr = 0;
+
+ INIT_LIST_HEAD(&tlb->worker_list);

#ifdef CONFIG_HAVE_RCU_TABLE_FREE
tlb->batch = NULL;
@@ -254,22 +257,65 @@ static void tlb_flush_mmu_free_batches(struct mmu_gather_batch *batch_start,
bool free_batch_page)
{
struct mmu_gather_batch *batch, *next;
+ int nr = 0;

for (batch = batch_start; batch; batch = next) {
next = batch->next;
if (batch->nr) {
free_pages_and_swap_cache(batch->pages, batch->nr);
+ nr += batch->nr;
batch->nr = 0;
}
- if (free_batch_page)
+ if (free_batch_page) {
free_pages((unsigned long)batch, 0);
+ nr++;
+ }
+ if (nr >= PAGE_FREE_NR_TO_YIELD) {
+ cond_resched();
+ nr = 0;
+ }
}
}

+struct batch_free_struct {
+ struct work_struct work;
+ struct mmu_gather_batch *batch_start;
+ struct list_head list;
+};
+
+static void batch_free_work(struct work_struct *work)
+{
+ struct batch_free_struct *batch_free = container_of(work,
+ struct batch_free_struct, work);
+ tlb_flush_mmu_free_batches(batch_free->batch_start, true);
+}
+
static void tlb_flush_mmu_free(struct mmu_gather *tlb)
{
+ struct batch_free_struct *batch_free = NULL;
+
+ if (tlb->page_nr >= ASYNC_FREE_THRESHOLD)
+ batch_free = kmalloc(sizeof(*batch_free),
+ GFP_NOWAIT | __GFP_NOWARN);
+
+ if (batch_free) {
+ /*
+ * Start a worker to free pages stored
+ * in batches following tlb->local.
+ */
+ batch_free->batch_start = tlb->local.next;
+ INIT_WORK(&batch_free->work, batch_free_work);
+ list_add_tail(&batch_free->list, &tlb->worker_list);
+ queue_work(system_unbound_wq, &batch_free->work);
+
+ tlb->batch_count = 0;
+ tlb->local.next = NULL;
+ /* fall through to free pages stored in tlb->local */
+ }
+
tlb_flush_mmu_free_batches(&tlb->local, false);
tlb->active = &tlb->local;
+ tlb->page_nr = 0;
}

void tlb_flush_mmu(struct mmu_gather *tlb)
@@ -284,11 +330,18 @@ void tlb_flush_mmu(struct mmu_gather *tlb)
*/
void tlb_finish_mmu(struct mmu_gather *tlb, unsigned long start, unsigned long end)
{
+ struct batch_free_struct *batch_free, *n;
+
tlb_flush_mmu(tlb);

/* keep the page table cache within bounds */
check_pgt_cache();

+ list_for_each_entry_safe(batch_free, n, &tlb->worker_list, list) {
+ flush_work(&batch_free->work);
+ kfree(batch_free);
+ }
+
tlb_flush_mmu_free_batches(tlb->local.next, true);
tlb->local.next = NULL;
}
@@ -307,6 +360,8 @@ bool __tlb_remove_page_size(struct mmu_gather *tlb, struct page *page, int page_
VM_BUG_ON(!tlb->end);
VM_WARN_ON(tlb->page_size != page_size);

+ tlb->page_nr++;
+
batch = tlb->active;
/*
* Add the page and check if we are full. If so
--
2.7.4

2017-03-15 09:00:04

by Aaron Lu

[permalink] [raw]
Subject: [PATCH v2 3/5] mm: use a dedicated workqueue for the free workers

Introduce a workqueue for all the free workers so that user can fine
tune how many workers can be active through sysfs interface: max_active.
More workers will normally lead to better performance, but too many can
cause severe lock contention.

Note that since the zone lock is global, the workqueue is also global
for all processes, i.e. if we set 8 to max_active, we will have at most
8 workers active for all processes that are doing munmap()/exit()/etc.

Signed-off-by: Aaron Lu <[email protected]>
---
mm/memory.c | 15 ++++++++++++++-
1 file changed, 14 insertions(+), 1 deletion(-)

diff --git a/mm/memory.c b/mm/memory.c
index 001c7720d773..19b25bb5f45b 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -253,6 +253,19 @@ static void tlb_flush_mmu_tlbonly(struct mmu_gather *tlb)
__tlb_reset_range(tlb);
}

+static struct workqueue_struct *batch_free_wq;
+static int __init batch_free_wq_init(void)
+{
+ batch_free_wq = alloc_workqueue("batch_free_wq",
+ WQ_UNBOUND | WQ_SYSFS, 0);
+ if (!batch_free_wq) {
+ pr_warn("failed to create workqueue batch_free_wq\n");
+ return -ENOMEM;
+ }
+ return 0;
+}
+subsys_initcall(batch_free_wq_init);
+
static void tlb_flush_mmu_free_batches(struct mmu_gather_batch *batch_start,
bool free_batch_page)
{
@@ -306,7 +319,7 @@ static void tlb_flush_mmu_free(struct mmu_gather *tlb)
batch_free->batch_start = tlb->local.next;
INIT_WORK(&batch_free->work, batch_free_work);
list_add_tail(&batch_free->list, &tlb->worker_list);
- queue_work(system_unbound_wq, &batch_free->work);
+ queue_work(batch_free_wq, &batch_free->work);

tlb->batch_count = 0;
tlb->local.next = NULL;
--
2.7.4

2017-03-15 09:43:18

by Hillf Danton

[permalink] [raw]
Subject: Re: [PATCH v2 2/5] mm: parallel free pages


On March 15, 2017 5:00 PM Aaron Lu wrote:
> void tlb_finish_mmu(struct mmu_gather *tlb, unsigned long start, unsigned long end)
> {
> + struct batch_free_struct *batch_free, *n;
> +
s/*n/*next/

> tlb_flush_mmu(tlb);
>
> /* keep the page table cache within bounds */
> check_pgt_cache();
>
> + list_for_each_entry_safe(batch_free, n, &tlb->worker_list, list) {
> + flush_work(&batch_free->work);

Not sure, list_del before free?

> + kfree(batch_free);
> + }
> +
> tlb_flush_mmu_free_batches(tlb->local.next, true);
> tlb->local.next = NULL;
> }

2017-03-15 11:54:37

by Aaron Lu

[permalink] [raw]
Subject: Re: [PATCH v2 2/5] mm: parallel free pages

On Wed, Mar 15, 2017 at 05:42:42PM +0800, Hillf Danton wrote:
>
> On March 15, 2017 5:00 PM Aaron Lu wrote:
> > void tlb_finish_mmu(struct mmu_gather *tlb, unsigned long start, unsigned long end)
> > {
> > + struct batch_free_struct *batch_free, *n;
> > +
> s/*n/*next/
>
> > tlb_flush_mmu(tlb);
> >
> > /* keep the page table cache within bounds */
> > check_pgt_cache();
> >
> > + list_for_each_entry_safe(batch_free, n, &tlb->worker_list, list) {
> > + flush_work(&batch_free->work);
>
> Not sure, list_del before free?

I think this is a good idea, it makes code look saner.
I just did a search of list_for_each_entry_safe and found list_del is
usually(I didn't check every one of them) used before free.

So I'll add that in the next revision, probably some days later in case
there are other comments.

Thanks for your time to review the patch.

Regards,
Aaron

> > + kfree(batch_free);
> > + }
> > +
> > tlb_flush_mmu_free_batches(tlb->local.next, true);
> > tlb->local.next = NULL;
> > }
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to [email protected]. For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"[email protected]"> [email protected] </a>

2017-03-15 14:19:14

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH v2 0/5] mm: support parallel free of memory

On Wed 15-03-17 16:59:59, Aaron Lu wrote:
[...]
> The proposed parallel free did this: if the process has many pages to be
> freed, accumulate them in these struct mmu_gather_batch(es) one after
> another till 256K pages are accumulated. Then take this singly linked
> list starting from tlb->local.next off struct mmu_gather *tlb and free
> them in a worker thread. The main thread can return to continue zap
> other pages(after freeing pages pointed by tlb->local.pages).

I didn't have a look at the implementation yet but there are two
concerns that raise up from this description. Firstly how are we going
to tune the number of workers. I assume there will be some upper bound
(one of the patch subject mentions debugfs for tuning) and secondly
if we offload the page freeing to the worker then the original context
can consume much more cpu cycles than it was configured via cpu
controller. How are we going to handle that? Or is this considered
acceptable?
--
Michal Hocko
SUSE Labs

2017-03-15 14:56:35

by Vlastimil Babka

[permalink] [raw]
Subject: Re: [PATCH v2 0/5] mm: support parallel free of memory

On 03/15/2017 09:59 AM, Aaron Lu wrote:
> For regular processes, the time taken in its exit() path to free its
> used memory is not a problem. But there are heavy ones that consume
> several Terabytes memory and the time taken to free its memory in its
> exit() path could last more than ten minutes if THP is not used.
>
> As Dave Hansen explained why do this in kernel:
> "
> One of the places we saw this happen was when an app crashed and was
> exit()'ing under duress without cleaning up nicely. The time that it
> takes to unmap a few TB of 4k pages is pretty excessive.
> "

Yeah, it would be nice to improve such cases.

> To optimize this use case, a parallel free method is proposed here and
> it is based on the current gather batch free(the following description
> is taken from patch 2/5's changelog).
>
> The current gather batch free works like this:
> For each struct mmu_gather *tlb, there is a static buffer to store those
> to-be-freed page pointers. The size is MMU_GATHER_BUNDLE, which is
> defined to be 8. So if a tlb tear down doesn't free more than 8 pages,
> that is all we need. If 8+ pages are to be freed, new pages will need
> to be allocated to store those to-be-freed page pointers.
>
> The structure used to describe the saved page pointers is called
> struct mmu_gather_batch and tlb->local is of this type. tlb->local is
> different than other struct mmu_gather_batch(es) in that the page
> pointer array used by tlb->local points to the previouslly described
> static buffer while the other struct mmu_gather_batch(es) page pointer
> array points to the dynamically allocated pages.
>
> These batches will form a singly linked list, starting from &tlb->local.
>
> tlb->local.pages => tlb->pages(8 pointers)
> \|/
> next => batch1->pages => about 510 pointers
> \|/
> next => batch2->pages => about 510 pointers
> \|/
> next => batch3->pages => about 510 pointers
> ... ...
>
> The proposed parallel free did this: if the process has many pages to be
> freed, accumulate them in these struct mmu_gather_batch(es) one after
> another till 256K pages are accumulated. Then take this singly linked
> list starting from tlb->local.next off struct mmu_gather *tlb and free
> them in a worker thread. The main thread can return to continue zap
> other pages(after freeing pages pointed by tlb->local.pages).
>
> A test program that did a single malloc() of 320G memory is used to see
> how useful the proposed parallel free solution is, the time calculated
> is for the free() call. Test machine is a Haswell EX which has
> 4nodes/72cores/144threads with 512G memory. All tests are done with THP
> disabled.
>
> kernel time
> v4.10 10.8s ±2.8%
> this patch(with default setting) 5.795s ±5.8%

I wonder if the difference would be larger if the parallelism was done
on a higher level, something around unmap_page_range(). IIUC the current
approach still leaves a lot of work to a single thread, right?
I assume it would be more complicated, but doable as we already have the
OOM reaper doing unmaps parallel to other activity? Has that been
considered?

Thanks, Vlastimil

>
> Patch 3/5 introduced a dedicated workqueue for the free workers and
> here are more results when setting different values for max_active of
> this workqueue:
>
> max_active: time
> 1 8.9s ±0.5%
> 2 5.65s ±5.5%
> 4 4.84s ±0.16%
> 8 4.77s ±0.97%
> 16 4.85s ±0.77%
> 32 6.21s ±0.46%
>
> Comments are welcome and appreciated.
>
> v2 changes: Nothing major, only minor ones.
> - rebased on top of v4.11-rc2-mmotm-2017-03-14-15-41;
> - use list_add_tail instead of list_add to add worker to tlb's worker
> list so that when doing flush, the first queued worker gets flushed
> first(based on the comsumption that the first queued worker has a
> better chance of finishing its job than those later queued workers);
> - use bool instead of int for variable free_batch_page in function
> tlb_flush_mmu_free_batches;
> - style change according to ./scripts/checkpatch;
> - reword some of the changelogs to make it more readable.
>
> v1 is here:
> https://lkml.org/lkml/2017/2/24/245
>
> Aaron Lu (5):
> mm: add tlb_flush_mmu_free_batches
> mm: parallel free pages
> mm: use a dedicated workqueue for the free workers
> mm: add force_free_pages in zap_pte_range
> mm: add debugfs interface for parallel free tuning
>
> include/asm-generic/tlb.h | 15 ++---
> mm/memory.c | 141 +++++++++++++++++++++++++++++++++++++++-------
> 2 files changed, 128 insertions(+), 28 deletions(-)
>

2017-03-15 15:44:52

by Aaron Lu

[permalink] [raw]
Subject: Re: [PATCH v2 0/5] mm: support parallel free of memory

On Wed, Mar 15, 2017 at 03:18:14PM +0100, Michal Hocko wrote:
> On Wed 15-03-17 16:59:59, Aaron Lu wrote:
> [...]
> > The proposed parallel free did this: if the process has many pages to be
> > freed, accumulate them in these struct mmu_gather_batch(es) one after
> > another till 256K pages are accumulated. Then take this singly linked
> > list starting from tlb->local.next off struct mmu_gather *tlb and free
> > them in a worker thread. The main thread can return to continue zap
> > other pages(after freeing pages pointed by tlb->local.pages).
>
> I didn't have a look at the implementation yet but there are two
> concerns that raise up from this description. Firstly how are we going
> to tune the number of workers. I assume there will be some upper bound
> (one of the patch subject mentions debugfs for tuning) and secondly

The workers are put in a dedicated workqueue which is introduced in
patch 3/5 and the number of workers can be tuned through that workqueue's
sysfs interface: max_active.

> if we offload the page freeing to the worker then the original context
> can consume much more cpu cycles than it was configured via cpu
> controller. How are we going to handle that? Or is this considered
> acceptable?

I'll need to think about and take a look at this subject(not familiar
with cpu controller).

Thanks.

2017-03-15 15:51:54

by Aaron Lu

[permalink] [raw]
Subject: Re: [PATCH v2 0/5] mm: support parallel free of memory

On Wed, Mar 15, 2017 at 03:56:02PM +0100, Vlastimil Babka wrote:
> On 03/15/2017 09:59 AM, Aaron Lu wrote:
> > For regular processes, the time taken in its exit() path to free its
> > used memory is not a problem. But there are heavy ones that consume
> > several Terabytes memory and the time taken to free its memory in its
> > exit() path could last more than ten minutes if THP is not used.
> >
> > As Dave Hansen explained why do this in kernel:
> > "
> > One of the places we saw this happen was when an app crashed and was
> > exit()'ing under duress without cleaning up nicely. The time that it
> > takes to unmap a few TB of 4k pages is pretty excessive.
> > "
>
> Yeah, it would be nice to improve such cases.

Glad to hear this.

>
> > To optimize this use case, a parallel free method is proposed here and
> > it is based on the current gather batch free(the following description
> > is taken from patch 2/5's changelog).
> >
> > The current gather batch free works like this:
> > For each struct mmu_gather *tlb, there is a static buffer to store those
> > to-be-freed page pointers. The size is MMU_GATHER_BUNDLE, which is
> > defined to be 8. So if a tlb tear down doesn't free more than 8 pages,
> > that is all we need. If 8+ pages are to be freed, new pages will need
> > to be allocated to store those to-be-freed page pointers.
> >
> > The structure used to describe the saved page pointers is called
> > struct mmu_gather_batch and tlb->local is of this type. tlb->local is
> > different than other struct mmu_gather_batch(es) in that the page
> > pointer array used by tlb->local points to the previouslly described
> > static buffer while the other struct mmu_gather_batch(es) page pointer
> > array points to the dynamically allocated pages.
> >
> > These batches will form a singly linked list, starting from &tlb->local.
> >
> > tlb->local.pages => tlb->pages(8 pointers)
> > \|/
> > next => batch1->pages => about 510 pointers
> > \|/
> > next => batch2->pages => about 510 pointers
> > \|/
> > next => batch3->pages => about 510 pointers
> > ... ...
> >
> > The proposed parallel free did this: if the process has many pages to be
> > freed, accumulate them in these struct mmu_gather_batch(es) one after
> > another till 256K pages are accumulated. Then take this singly linked
> > list starting from tlb->local.next off struct mmu_gather *tlb and free
> > them in a worker thread. The main thread can return to continue zap
> > other pages(after freeing pages pointed by tlb->local.pages).
> >
> > A test program that did a single malloc() of 320G memory is used to see
> > how useful the proposed parallel free solution is, the time calculated
> > is for the free() call. Test machine is a Haswell EX which has
> > 4nodes/72cores/144threads with 512G memory. All tests are done with THP
> > disabled.
> >
> > kernel time
> > v4.10 10.8s ?2.8%
> > this patch(with default setting) 5.795s ?5.8%
>
> I wonder if the difference would be larger if the parallelism was done
> on a higher level, something around unmap_page_range(). IIUC the current

We have tried to do it at the VMA level, but there is a problem: suppose
a program has many VMAs but only one or two of them are big/huge ones,
the parallism is not good.

I also considered PUD based parallel free, the potential issue with it
is: there could be very few physical pages actually present for that
PUD so the worker may have very few things to do.
For the test case used here though, PUD based one should work better
since all PTEs are faulted in.

> approach still leaves a lot of work to a single thread, right?

Yes, the main thread will be responsible for page table walk, PTE clear
and possibly flushing TLB in race condition. But considering the issues
of the other two mentioned approaches, I chose the current approach.

Perhaps I should also implement a PUD based parallel free and then use a
program that has 2 huge VMAs with equal size, one with all pages faulted
in RAM while the other has none and then compare the two approaches'
performance, does this make sense?

> I assume it would be more complicated, but doable as we already have the
> OOM reaper doing unmaps parallel to other activity? Has that been
> considered?

Since the tlb structure is not meant to be accessed concurrently, I
assume there will be some trouble to handle it if going the PUD based
approach. Will take a look at it tomorrow(it's late here).

Thanks.
-Aaron

>
> Thanks, Vlastimil
>
> >
> > Patch 3/5 introduced a dedicated workqueue for the free workers and
> > here are more results when setting different values for max_active of
> > this workqueue:
> >
> > max_active: time
> > 1 8.9s ?0.5%
> > 2 5.65s ?5.5%
> > 4 4.84s ?0.16%
> > 8 4.77s ?0.97%
> > 16 4.85s ?0.77%
> > 32 6.21s ?0.46%
> >
> > Comments are welcome and appreciated.
> >
> > v2 changes: Nothing major, only minor ones.
> > - rebased on top of v4.11-rc2-mmotm-2017-03-14-15-41;
> > - use list_add_tail instead of list_add to add worker to tlb's worker
> > list so that when doing flush, the first queued worker gets flushed
> > first(based on the comsumption that the first queued worker has a
> > better chance of finishing its job than those later queued workers);
> > - use bool instead of int for variable free_batch_page in function
> > tlb_flush_mmu_free_batches;
> > - style change according to ./scripts/checkpatch;
> > - reword some of the changelogs to make it more readable.
> >
> > v1 is here:
> > https://lkml.org/lkml/2017/2/24/245
> >
> > Aaron Lu (5):
> > mm: add tlb_flush_mmu_free_batches
> > mm: parallel free pages
> > mm: use a dedicated workqueue for the free workers
> > mm: add force_free_pages in zap_pte_range
> > mm: add debugfs interface for parallel free tuning
> >
> > include/asm-generic/tlb.h | 15 ++---
> > mm/memory.c | 141 +++++++++++++++++++++++++++++++++++++++-------
> > 2 files changed, 128 insertions(+), 28 deletions(-)
> >
>

2017-03-15 16:28:49

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH v2 0/5] mm: support parallel free of memory

On Wed 15-03-17 23:44:07, Aaron Lu wrote:
> On Wed, Mar 15, 2017 at 03:18:14PM +0100, Michal Hocko wrote:
> > On Wed 15-03-17 16:59:59, Aaron Lu wrote:
> > [...]
> > > The proposed parallel free did this: if the process has many pages to be
> > > freed, accumulate them in these struct mmu_gather_batch(es) one after
> > > another till 256K pages are accumulated. Then take this singly linked
> > > list starting from tlb->local.next off struct mmu_gather *tlb and free
> > > them in a worker thread. The main thread can return to continue zap
> > > other pages(after freeing pages pointed by tlb->local.pages).
> >
> > I didn't have a look at the implementation yet but there are two
> > concerns that raise up from this description. Firstly how are we going
> > to tune the number of workers. I assume there will be some upper bound
> > (one of the patch subject mentions debugfs for tuning) and secondly
>
> The workers are put in a dedicated workqueue which is introduced in
> patch 3/5 and the number of workers can be tuned through that workqueue's
> sysfs interface: max_active.

I suspect we cannot expect users to tune this. What do you consider a
reasonable default?

Moreover, and this is a more generic question, is this functionality
useful in general purpose workloads? After all the amount of the work to
be done is the same we just risk more lock contentions, unexpected CPU
usage etc. Which workloads will benefit from having exit path faster?

> > if we offload the page freeing to the worker then the original context
> > can consume much more cpu cycles than it was configured via cpu

I was not precise here. I meant to say more cpu cycles per time unit
that it was allowed.

> > controller. How are we going to handle that? Or is this considered
> > acceptable?
>
> I'll need to think about and take a look at this subject(not familiar
> with cpu controller).

the main problem is that kworkers will not belong to the same cpu group
and so they will not be throttled properly.
--
Michal Hocko
SUSE Labs

2017-03-15 21:38:38

by Tim Chen

[permalink] [raw]
Subject: Re: [PATCH v2 0/5] mm: support parallel free of memory

On Wed, 2017-03-15 at 17:28 +0100, Michal Hocko wrote:
> On Wed 15-03-17 23:44:07, Aaron Lu wrote:
> >
> > On Wed, Mar 15, 2017 at 03:18:14PM +0100, Michal Hocko wrote:
> > >
> > > On Wed 15-03-17 16:59:59, Aaron Lu wrote:
> > > [...]
> > > >
> > > > The proposed parallel free did this: if the process has many pages to be
> > > > freed, accumulate them in these struct mmu_gather_batch(es) one after
> > > > another till 256K pages are accumulated. Then take this singly linked
> > > > list starting from tlb->local.next off struct mmu_gather *tlb and free
> > > > them in a worker thread. The main thread can return to continue zap
> > > > other pages(after freeing pages pointed by tlb->local.pages).
> > > I didn't have a look at the implementation yet but there are two
> > > concerns that raise up from this description. Firstly how are we going
> > > to tune the number of workers. I assume there will be some upper bound
> > > (one of the patch subject mentions debugfs for tuning) and secondly
> > The workers are put in a dedicated workqueue which is introduced in
> > patch 3/5 and the number of workers can be tuned through that workqueue's
> > sysfs interface: max_active.
> I suspect we cannot expect users to tune this. What do you consider a
> reasonable default?

>From Aaron's data, it seems like 4 is a reasonable value for max_active:

max_active:   time
1             8.9s   ±0.5%
2             5.65s  ±5.5%
4             4.84s  ±0.16%
8             4.77s  ±0.97%
16            4.85s  ±0.77%
32            6.21s  ±0.46%


> Moreover, and this is a more generic question, is this functionality
> useful in general purpose workloads?

If we are running consecutive batch jobs, this optimization
should help start the next job sooner.

> After all the amount of the work to
> be done is the same we just risk more lock contentions, unexpected CPU
> usage etc. Which workloads will benefit from having exit path faster?
>  
> >
> > >
> > > if we offload the page freeing to the worker then the original context
> > > can consume much more cpu cycles than it was configured via cpu
> I was not precise here. I meant to say more cpu cycles per time unit
> that it was allowed.
>
> >
> > >
> > > controller. How are we going to handle that? Or is this considered
> > > acceptable?
> > I'll need to think about and take a look at this subject(not familiar
> > with cpu controller).
> the main problem is that kworkers will not belong to the same cpu group
> and so they will not be throttled properly.

You do have a point that this page freeing activities should strive to
affect other threads not in the same cgroup minimally.

On the other hand, we also don't do this throttling of kworkers 
today (e.g. pdflush) according to the cgroup it is doing work for.


Thanks.

Tim

2017-03-16 06:54:37

by Aaron Lu

[permalink] [raw]
Subject: Re: [PATCH v2 0/5] mm: support parallel free of memory

On Wed, Mar 15, 2017 at 05:28:43PM +0100, Michal Hocko wrote:
> On Wed 15-03-17 23:44:07, Aaron Lu wrote:
> > On Wed, Mar 15, 2017 at 03:18:14PM +0100, Michal Hocko wrote:
> > > On Wed 15-03-17 16:59:59, Aaron Lu wrote:
> > > [...]
> > > > The proposed parallel free did this: if the process has many pages to be
> > > > freed, accumulate them in these struct mmu_gather_batch(es) one after
> > > > another till 256K pages are accumulated. Then take this singly linked
> > > > list starting from tlb->local.next off struct mmu_gather *tlb and free
> > > > them in a worker thread. The main thread can return to continue zap
> > > > other pages(after freeing pages pointed by tlb->local.pages).
> > >
> > > I didn't have a look at the implementation yet but there are two
> > > concerns that raise up from this description. Firstly how are we going
> > > to tune the number of workers. I assume there will be some upper bound
> > > (one of the patch subject mentions debugfs for tuning) and secondly
> >
> > The workers are put in a dedicated workqueue which is introduced in
> > patch 3/5 and the number of workers can be tuned through that workqueue's
> > sysfs interface: max_active.
>
> I suspect we cannot expect users to tune this. What do you consider a
> reasonable default?

I agree with Tim that 4 is a reasonable number for now.

>
> Moreover, and this is a more generic question, is this functionality
> useful in general purpose workloads? After all the amount of the work to

I'm not sure. The main motivation is to speed up the exit of the crashed
application as explained by Dave.

> be done is the same we just risk more lock contentions, unexpected CPU
> usage etc. Which workloads will benefit from having exit path faster?
>
> > > if we offload the page freeing to the worker then the original context
> > > can consume much more cpu cycles than it was configured via cpu
>
> I was not precise here. I meant to say more cpu cycles per time unit
> that it was allowed.
>
> > > controller. How are we going to handle that? Or is this considered
> > > acceptable?
> >
> > I'll need to think about and take a look at this subject(not familiar
> > with cpu controller).
>
> the main problem is that kworkers will not belong to the same cpu group
> and so they will not be throttled properly.

Looks like a fundamental problem as long as kworker is used.
With the default max_active of the workqueue set to 4, do you think this
is a blocking issue?

Thanks.

2017-03-16 07:34:29

by Aaron Lu

[permalink] [raw]
Subject: Re: [PATCH v2 0/5] mm: support parallel free of memory

On Wed, Mar 15, 2017 at 05:28:43PM +0100, Michal Hocko wrote:
... ...
> After all the amount of the work to be done is the same we just risk
> more lock contentions, unexpected CPU usage etc.

I start to realize this is a good question.

I guess max_active=4 produced almost the best result(max_active=8 is
only slightly better) is due to the test box is a 4 node machine and
therefore, there are 4 zone->lock to contend(let's ignore those tiny
zones only available in node 0).

I'm going to test on a EP to see if max_active=2 will suffice to produce
a good enough result. If so, the proper default number should be the
number of nodes.

Thanks.

2017-03-16 09:07:39

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH v2 0/5] mm: support parallel free of memory

On Wed 15-03-17 14:38:34, Tim Chen wrote:
> On Wed, 2017-03-15 at 17:28 +0100, Michal Hocko wrote:
> > On Wed 15-03-17 23:44:07, Aaron Lu wrote:
> > >
> > > On Wed, Mar 15, 2017 at 03:18:14PM +0100, Michal Hocko wrote:
> > > >
> > > > On Wed 15-03-17 16:59:59, Aaron Lu wrote:
> > > > [...]
> > > > >
> > > > > The proposed parallel free did this: if the process has many pages to be
> > > > > freed, accumulate them in these struct mmu_gather_batch(es) one after
> > > > > another till 256K pages are accumulated. Then take this singly linked
> > > > > list starting from tlb->local.next off struct mmu_gather *tlb and free
> > > > > them in a worker thread. The main thread can return to continue zap
> > > > > other pages(after freeing pages pointed by tlb->local.pages).
> > > > I didn't have a look at the implementation yet but there are two
> > > > concerns that raise up from this description. Firstly how are we going
> > > > to tune the number of workers. I assume there will be some upper bound
> > > > (one of the patch subject mentions debugfs for tuning) and secondly
> > > The workers are put in a dedicated workqueue which is introduced in
> > > patch 3/5 and the number of workers can be tuned through that workqueue's
> > > sysfs interface: max_active.
> > I suspect we cannot expect users to tune this. What do you consider a
> > reasonable default?
>
> From Aaron's data, it seems like 4 is a reasonable value for max_active:
>
> max_active:???time
> 1?????????????8.9s????0.5%
> 2?????????????5.65s???5.5%
> 4?????????????4.84s???0.16%
> 8?????????????4.77s???0.97%
> 16????????????4.85s???0.77%
> 32????????????6.21s???0.46%

OK, but this will depend on the HW, right? Also now that I am looking at
those numbers more closely. This was about unmapping 320GB area and
using 4 times more CPUs you managed to half the run time. Is this really
worth it? Sure if those CPUs were idle then this is a clear win but if
the system is moderately busy then it doesn't look like a clear win to
me.

> > Moreover, and this is a more generic question, is this functionality
> > useful in general purpose workloads?
>
> If we are running consecutive batch jobs, this optimization
> should help start the next job sooner.

Is this sufficient justification to add a potentially hard to tune
optimization that can influence other workloads on the machine?

> > After all the amount of the work to
> > be done is the same we just risk more lock contentions, unexpected CPU
> > usage etc. Which workloads will benefit from having exit path faster?
> > ?
> > >
> > > >
> > > > if we offload the page freeing to the worker then the original context
> > > > can consume much more cpu cycles than it was configured via cpu
> > I was not precise here. I meant to say more cpu cycles per time unit
> > that it was allowed.
> >
> > >
> > > >
> > > > controller. How are we going to handle that? Or is this considered
> > > > acceptable?
> > > I'll need to think about and take a look at this subject(not familiar
> > > with cpu controller).
> > the main problem is that kworkers will not belong to the same cpu group
> > and so they will not be throttled properly.
>
> You do have a point that this page freeing activities should strive to
> affect other threads not in the same cgroup minimally.
>
> On the other hand, we also don't do this throttling of kworkers?
> today (e.g. pdflush) according to the cgroup it is doing work for.

Yes, I am not saying this a new problem. I just wanted to point out that
this is something to consider here. I believe this should be fixable.
Worker can attach to the same cgroup the initiator had for example
(assuming the cgroup core allows that which is something would have to
be checked).
--
Michal Hocko
SUSE Labs

2017-03-16 13:51:22

by Aaron Lu

[permalink] [raw]
Subject: Re: [PATCH v2 0/5] mm: support parallel free of memory

On Thu, Mar 16, 2017 at 03:34:03PM +0800, Aaron Lu wrote:
> On Wed, Mar 15, 2017 at 05:28:43PM +0100, Michal Hocko wrote:
> ... ...
> > After all the amount of the work to be done is the same we just risk
> > more lock contentions, unexpected CPU usage etc.
>
> I start to realize this is a good question.
>
> I guess max_active=4 produced almost the best result(max_active=8 is
> only slightly better) is due to the test box is a 4 node machine and
> therefore, there are 4 zone->lock to contend(let's ignore those tiny
> zones only available in node 0).
>
> I'm going to test on a EP to see if max_active=2 will suffice to produce
> a good enough result. If so, the proper default number should be the
> number of nodes.

Here are test results on 2 nodes EP with 128GiB memory, test size 100GiB.

max_active time
vanilla 2.971s ?3.8%
2 1.699s ?13.7%
4 1.616s ?3.1%
8 1.642s ?0.9%

So 4 gives best result but 2 is probably good enough.

If the size each worker deals with is changed from 1G to 2G:

max_active time
2 1.605s ?1.7%
4 1.639s ?1.2%
8 1.626s ?1.8%

Considering that we are mostly improving for memory intensive apps, the
default setting should probably be: max_active = node_number with each
worker freeing 2G memory.

2017-03-16 14:14:49

by Aaron Lu

[permalink] [raw]
Subject: Re: [PATCH v2 0/5] mm: support parallel free of memory

On Thu, Mar 16, 2017 at 09:51:22PM +0800, Aaron Lu wrote:
> Considering that we are mostly improving for memory intensive apps, the
> default setting should probably be: max_active = node_number with each
> worker freeing 2G memory.

In case people want to give this setting a try, here is what to do.

On 2-nodes EP:
# echo 2 > /sys/devices/virtual/workqueue/batch_free_wq/max_active
# echo 1030 > /sys/kernel/debug/parallel_free/max_gather_batch_count

On 4-nodes EX:
# echo 4 > /sys/devices/virtual/workqueue/batch_free_wq/max_active
# echo 1030 > /sys/kernel/debug/parallel_free/max_gather_batch_count

2017-03-16 18:36:33

by Tim Chen

[permalink] [raw]
Subject: Re: [PATCH v2 0/5] mm: support parallel free of memory

On Thu, 2017-03-16 at 10:07 +0100, Michal Hocko wrote:
> On Wed 15-03-17 14:38:34, Tim Chen wrote:
> >
> > On Wed, 2017-03-15 at 17:28 +0100, Michal Hocko wrote:
> > >
> > > On Wed 15-03-17 23:44:07, Aaron Lu wrote:
> > > >
> > > >
> > > > On Wed, Mar 15, 2017 at 03:18:14PM +0100, Michal Hocko wrote:
> > > > >
> > > > >
> > > > > On Wed 15-03-17 16:59:59, Aaron Lu wrote:
> > > > > [...]
> > > > > >
> > > > > >
> > > > > > The proposed parallel free did this: if the process has many pages to be
> > > > > > freed, accumulate them in these struct mmu_gather_batch(es) one after
> > > > > > another till 256K pages are accumulated. Then take this singly linked
> > > > > > list starting from tlb->local.next off struct mmu_gather *tlb and free
> > > > > > them in a worker thread. The main thread can return to continue zap
> > > > > > other pages(after freeing pages pointed by tlb->local.pages).
> > > > > I didn't have a look at the implementation yet but there are two
> > > > > concerns that raise up from this description. Firstly how are we going
> > > > > to tune the number of workers. I assume there will be some upper bound
> > > > > (one of the patch subject mentions debugfs for tuning) and secondly
> > > > The workers are put in a dedicated workqueue which is introduced in
> > > > patch 3/5 and the number of workers can be tuned through that workqueue's
> > > > sysfs interface: max_active.
> > > I suspect we cannot expect users to tune this. What do you consider a
> > > reasonable default?
> > From Aaron's data, it seems like 4 is a reasonable value for max_active:
> >
> > max_active:   time
> > 1             8.9s   ±0.5%
> > 2             5.65s  ±5.5%
> > 4             4.84s  ±0.16%
> > 8             4.77s  ±0.97%
> > 16            4.85s  ±0.77%
> > 32            6.21s  ±0.46%
> OK, but this will depend on the HW, right? Also now that I am looking at
> those numbers more closely. This was about unmapping 320GB area and
> using 4 times more CPUs you managed to half the run time. Is this really
> worth it? Sure if those CPUs were idle then this is a clear win but if
> the system is moderately busy then it doesn't look like a clear win to
> me.

It looks like we can reduce the exit time in half by using only 2 workers
to disturb the system minimally.
Perhaps we can only do this expedited exit only when there are idle cpus around.
We can use the root sched domain's overload indicator for such a quick check.

Tim

2017-03-16 19:39:28

by Alex Thorlton

[permalink] [raw]
Subject: Re: [PATCH v2 0/5] mm: support parallel free of memory

On Wed, Mar 15, 2017 at 04:59:59PM +0800, Aaron Lu wrote:
> v2 changes: Nothing major, only minor ones.
> - rebased on top of v4.11-rc2-mmotm-2017-03-14-15-41;
> - use list_add_tail instead of list_add to add worker to tlb's worker
> list so that when doing flush, the first queued worker gets flushed
> first(based on the comsumption that the first queued worker has a
> better chance of finishing its job than those later queued workers);
> - use bool instead of int for variable free_batch_page in function
> tlb_flush_mmu_free_batches;
> - style change according to ./scripts/checkpatch;
> - reword some of the changelogs to make it more readable.
>
> v1 is here:
> https://lkml.org/lkml/2017/2/24/245

I tested v1 on a Haswell system with 64 sockets/1024 cores/2048 threads
and 8TB of RAM, with a 1TB malloc. The average free() time for a 1TB
malloc on a vanilla kernel was 41.69s, the patched kernel averaged
21.56s for the same test.

I am testing v2 now and will report back with results in the next day or
so.

- Alex

2017-03-17 02:21:52

by Aaron Lu

[permalink] [raw]
Subject: Re: [PATCH v2 0/5] mm: support parallel free of memory

On Thu, Mar 16, 2017 at 02:38:44PM -0500, Alex Thorlton wrote:
> On Wed, Mar 15, 2017 at 04:59:59PM +0800, Aaron Lu wrote:
> > v2 changes: Nothing major, only minor ones.
> > - rebased on top of v4.11-rc2-mmotm-2017-03-14-15-41;
> > - use list_add_tail instead of list_add to add worker to tlb's worker
> > list so that when doing flush, the first queued worker gets flushed
> > first(based on the comsumption that the first queued worker has a
> > better chance of finishing its job than those later queued workers);
> > - use bool instead of int for variable free_batch_page in function
> > tlb_flush_mmu_free_batches;
> > - style change according to ./scripts/checkpatch;
> > - reword some of the changelogs to make it more readable.
> >
> > v1 is here:
> > https://lkml.org/lkml/2017/2/24/245
>
> I tested v1 on a Haswell system with 64 sockets/1024 cores/2048 threads
> and 8TB of RAM, with a 1TB malloc. The average free() time for a 1TB
> malloc on a vanilla kernel was 41.69s, the patched kernel averaged
> 21.56s for the same test.

Thanks a lot for the test result.

>
> I am testing v2 now and will report back with results in the next day or
> so.

Testing plain v2 shouldn't bring any surprise/difference, better set the
following param before the test(I'm planning to make them default in the
next version):
# echo 64 > /sys/devices/virtual/workqueue/batch_free_wq/max_active
# echo 1030 > /sys/kernel/debug/parallel_free/max_gather_batch_count

Regards,
Aaron

2017-03-17 03:11:13

by Aaron Lu

[permalink] [raw]
Subject: Re: [PATCH v2 0/5] mm: support parallel free of memory

On Wed, Mar 15, 2017 at 03:56:02PM +0100, Vlastimil Babka wrote:
> I wonder if the difference would be larger if the parallelism was done
> on a higher level, something around unmap_page_range(). IIUC the current

I guess I misunderstand you in my last email - doing it at
unmap_page_range() level is essentially doing it at a per-VMA level
since it is the main function used in unmap_single_vma(). We have tried
that and felt that it's not flexible as the proposed approach since
it wouldn't parallize well for:
1 work load that uses only 1 or very few huge VMA;
2 work load that has a lot of small VMAs.

The code is nice and easy though(developed at v4.9 time frame):

>From f6d5cfde888b9e0356719fabe8754fdfe6fe236b Mon Sep 17 00:00:00 2001
From: Aaron Lu <[email protected]>
Date: Wed, 11 Jan 2017 15:56:06 +0800
Subject: [PATCH] mm: async free vma

---
include/linux/mm_types.h | 6 ++++++
mm/memory.c | 23 ++++++++++++++++++++++-
2 files changed, 28 insertions(+), 1 deletion(-)

diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 4a8acedf4b7d..d10d2ce8f8f4 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -358,6 +358,12 @@ struct vm_area_struct {
struct mempolicy *vm_policy; /* NUMA policy for the VMA */
#endif
struct vm_userfaultfd_ctx vm_userfaultfd_ctx;
+
+ struct vma_free_ctx {
+ unsigned long start_addr;
+ unsigned long end_addr;
+ struct work_struct work;
+ } free_ctx;
};

struct core_thread {
diff --git a/mm/memory.c b/mm/memory.c
index e18c57bdc75c..0fe4e45a044b 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -1345,6 +1345,17 @@ static void unmap_single_vma(struct mmu_gather *tlb,
}
}

+static void unmap_single_vma_work(struct work_struct *work)
+{
+ struct vma_free_ctx *ctx = container_of(work, struct vma_free_ctx, work);
+ struct vm_area_struct *vma = container_of(ctx, struct vm_area_struct, free_ctx);
+ struct mmu_gather tlb;
+
+ tlb_gather_mmu(&tlb, vma->vm_mm, ctx->start_addr, ctx->end_addr);
+ unmap_single_vma(&tlb, vma, ctx->start_addr, ctx->end_addr, NULL);
+ tlb_finish_mmu(&tlb, ctx->start_addr, ctx->end_addr);
+}
+
/**
* unmap_vmas - unmap a range of memory covered by a list of vma's
* @tlb: address of the caller's struct mmu_gather
@@ -1368,10 +1379,20 @@ void unmap_vmas(struct mmu_gather *tlb,
unsigned long end_addr)
{
struct mm_struct *mm = vma->vm_mm;
+ struct vma_free_ctx *ctx;
+ struct vm_area_struct *tmp = vma;

mmu_notifier_invalidate_range_start(mm, start_addr, end_addr);
+ for ( ; vma && vma->vm_start < end_addr; vma = vma->vm_next) {
+ ctx = &vma->free_ctx;
+ ctx->start_addr = start_addr;
+ ctx->end_addr = end_addr;
+ INIT_WORK(&ctx->work, unmap_single_vma_work);
+ queue_work(system_unbound_wq, &ctx->work);
+ }
+ vma = tmp;
for ( ; vma && vma->vm_start < end_addr; vma = vma->vm_next)
- unmap_single_vma(tlb, vma, start_addr, end_addr, NULL);
+ flush_work(&vma->free_ctx.work);
mmu_notifier_invalidate_range_end(mm, start_addr, end_addr);
}

--
2.9.3

2017-03-17 08:07:34

by Minchan Kim

[permalink] [raw]
Subject: Re: [PATCH v2 0/5] mm: support parallel free of memory

On Fri, Mar 17, 2017 at 08:47:08AM +0100, Michal Hocko wrote:
> On Thu 16-03-17 11:36:21, Tim Chen wrote:
> [...]
> > Perhaps we can only do this expedited exit only when there are idle cpus around.
> > We can use the root sched domain's overload indicator for such a quick check.
>
> This is not so easy, I am afraid. Those CPUs might be idle for a good
> reason (power saving etc.). You will never know by simply checking
> one metric. This is why doing these optimistic parallelization
> optimizations is far from trivial. This is not the first time somebody
> wants to do this. People are trying to make THP migration faster
> doing the similar thing. I guess we really need a help from the
> scheduler to do this properly, though. I've been thinking about an API
> (e.g. try_to_run_in_backgroun) which would evaluate all these nasty
> details and either return with -EBUSY or kick the background thread to
> accomplish the work if the system is reasonably idle. I am not really

I agree with Michal's opinion.

In fact, I had prototyped zram parallel write(i.e., if there are many
CPU in the system, zram can compress a bio's pages in parallel via
multiple CPUs) and it seems to work well but my concern was out of
control about power, cpu load, wakeup latency and so on.

> sure whether such an API is viable though. Peter, what do you think?

If scheduler can support such API(ie, return true and queue the job
if new job is scheduled into other CPU right now because there is
idle CPU in the system), it would be really great for things which
want to use multiple CPU power in parallel to complete the job asap.
Of course, it could sacrifice power but it's trade-off, IMHO.

Thanks.

2017-03-17 08:24:31

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH v2 0/5] mm: support parallel free of memory

On Thu 16-03-17 11:36:21, Tim Chen wrote:
[...]
> Perhaps we can only do this expedited exit only when there are idle cpus around.
> We can use the root sched domain's overload indicator for such a quick check.

This is not so easy, I am afraid. Those CPUs might be idle for a good
reason (power saving etc.). You will never know by simply checking
one metric. This is why doing these optimistic parallelization
optimizations is far from trivial. This is not the first time somebody
wants to do this. People are trying to make THP migration faster
doing the similar thing. I guess we really need a help from the
scheduler to do this properly, though. I've been thinking about an API
(e.g. try_to_run_in_backgroun) which would evaluate all these nasty
details and either return with -EBUSY or kick the background thread to
accomplish the work if the system is reasonably idle. I am not really
sure whether such an API is viable though. Peter, what do you think?
--
Michal Hocko
SUSE Labs

2017-03-17 12:38:44

by Aaron Lu

[permalink] [raw]
Subject: Re: [PATCH v2 0/5] mm: support parallel free of memory

On Fri, Mar 17, 2017 at 08:47:08AM +0100, Michal Hocko wrote:
> On Thu 16-03-17 11:36:21, Tim Chen wrote:
> [...]
> > Perhaps we can only do this expedited exit only when there are idle cpus around.
> > We can use the root sched domain's overload indicator for such a quick check.
>
> This is not so easy, I am afraid. Those CPUs might be idle for a good
> reason (power saving etc.). You will never know by simply checking

Is it that those CPUs are deliberately put into idle mode to save power?
IIRC, idle injection driver could be used to do this and if so, the
injected idle task is a realtime one so the spawned kworker will not be
able to preempt(disturb) it.

> one metric. This is why doing these optimistic parallelization
> optimizations is far from trivial. This is not the first time somebody
> wants to do this. People are trying to make THP migration faster
> doing the similar thing. I guess we really need a help from the
> scheduler to do this properly, though. I've been thinking about an API
> (e.g. try_to_run_in_backgroun) which would evaluate all these nasty
> details and either return with -EBUSY or kick the background thread to
> accomplish the work if the system is reasonably idle. I am not really
> sure whether such an API is viable though. Peter, what do you think?

I would very much like to know what these nasty details are and what
'reasonably idle' actually means, I think they are useful to understand
the problem and define the API.

I totally agree that we shouldn't distrub the system by starting more
workers/threads to do spin work or to make a process utilizing more CPU
or other resources than allowed by its cgroup.

Thanks.

2017-03-17 12:53:45

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH v2 0/5] mm: support parallel free of memory

On Fri, Mar 17, 2017 at 08:47:08AM +0100, Michal Hocko wrote:
> On Thu 16-03-17 11:36:21, Tim Chen wrote:
> [...]
> > Perhaps we can only do this expedited exit only when there are idle cpus around.
> > We can use the root sched domain's overload indicator for such a quick check.
>
> This is not so easy, I am afraid. Those CPUs might be idle for a good
> reason (power saving etc.). You will never know by simply checking
> one metric. This is why doing these optimistic parallelization
> optimizations is far from trivial. This is not the first time somebody
> wants to do this. People are trying to make THP migration faster
> doing the similar thing. I guess we really need a help from the
> scheduler to do this properly, though. I've been thinking about an API
> (e.g. try_to_run_in_backgroun) which would evaluate all these nasty
> details and either return with -EBUSY or kick the background thread to
> accomplish the work if the system is reasonably idle. I am not really
> sure whether such an API is viable though.

> Peter, what do you think?

Much pain lies this way.

Also, -enocontext.

2017-03-17 12:59:44

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH v2 0/5] mm: support parallel free of memory

On Fri 17-03-17 20:33:15, Aaron Lu wrote:
> On Fri, Mar 17, 2017 at 08:47:08AM +0100, Michal Hocko wrote:
> > On Thu 16-03-17 11:36:21, Tim Chen wrote:
> > [...]
> > > Perhaps we can only do this expedited exit only when there are idle cpus around.
> > > We can use the root sched domain's overload indicator for such a quick check.
> >
> > This is not so easy, I am afraid. Those CPUs might be idle for a good
> > reason (power saving etc.). You will never know by simply checking
>
> Is it that those CPUs are deliberately put into idle mode to save power?

I am not a scheduler expert. All I know is that there is strong pressure
to make the schedule power aware and so some cpus are kept idle while
the workload is spread over other (currently active) cpus. And all I am
trying to tell is that this will be hard to guess without any assistance
from the scheduler. Especially when this should be long term
maintainable.

> IIRC, idle injection driver could be used to do this and if so, the
> injected idle task is a realtime one so the spawned kworker will not be
> able to preempt(disturb) it.
>
> > one metric. This is why doing these optimistic parallelization
> > optimizations is far from trivial. This is not the first time somebody
> > wants to do this. People are trying to make THP migration faster
> > doing the similar thing. I guess we really need a help from the
> > scheduler to do this properly, though. I've been thinking about an API
> > (e.g. try_to_run_in_backgroun) which would evaluate all these nasty
> > details and either return with -EBUSY or kick the background thread to
> > accomplish the work if the system is reasonably idle. I am not really
> > sure whether such an API is viable though. Peter, what do you think?
>
> I would very much like to know what these nasty details are and what
> 'reasonably idle' actually means, I think they are useful to understand
> the problem and define the API.

I would love to give you more specific information but I am not sure
myself. All I know is that the scheduler is the only place where we
have at least some idea about the recent load characteristics and some
policies on top. And that is why I _think_ we need to have an api and
which cooperates with the scheduler.
--
Michal Hocko
SUSE Labs

2017-03-17 13:16:57

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH v2 0/5] mm: support parallel free of memory

On Fri, Mar 17, 2017 at 08:33:15PM +0800, Aaron Lu wrote:
> On Fri, Mar 17, 2017 at 08:47:08AM +0100, Michal Hocko wrote:
> > On Thu 16-03-17 11:36:21, Tim Chen wrote:
> > [...]
> > > Perhaps we can only do this expedited exit only when there are idle cpus around.
> > > We can use the root sched domain's overload indicator for such a quick check.
> >
> > This is not so easy, I am afraid. Those CPUs might be idle for a good
> > reason (power saving etc.). You will never know by simply checking
>
> Is it that those CPUs are deliberately put into idle mode to save power?

No, forced idle injection is an abomination.

> > one metric. This is why doing these optimistic parallelization
> > optimizations is far from trivial. This is not the first time somebody
> > wants to do this. People are trying to make THP migration faster
> > doing the similar thing. I guess we really need a help from the
> > scheduler to do this properly, though. I've been thinking about an API
> > (e.g. try_to_run_in_backgroun) which would evaluate all these nasty
> > details and either return with -EBUSY or kick the background thread to
> > accomplish the work if the system is reasonably idle. I am not really
> > sure whether such an API is viable though. Peter, what do you think?
>
> I would very much like to know what these nasty details are and what
> 'reasonably idle' actually means, I think they are useful to understand
> the problem and define the API.

A CPU being idle doesn't mean it'll be idle long enough to do your
additional work.

The CPU not being idle affects scheduling latency. It also increases
power usage and thermals.

If your workload wants peak single threaded throughput, making the other
CPUs do work will lower its turbo boost range for example.

An 'obvious' solution that doesn't work is an idle scheduler; its an
instant priority inversion if you take locks there. Not to mention you
loose any fwd progress guarantees for any work you put in.

2017-03-17 13:25:28

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH v2 0/5] mm: support parallel free of memory

On Fri 17-03-17 13:53:33, Peter Zijlstra wrote:
> On Fri, Mar 17, 2017 at 08:47:08AM +0100, Michal Hocko wrote:
> > On Thu 16-03-17 11:36:21, Tim Chen wrote:
> > [...]
> > > Perhaps we can only do this expedited exit only when there are idle cpus around.
> > > We can use the root sched domain's overload indicator for such a quick check.
> >
> > This is not so easy, I am afraid. Those CPUs might be idle for a good
> > reason (power saving etc.). You will never know by simply checking
> > one metric. This is why doing these optimistic parallelization
> > optimizations is far from trivial. This is not the first time somebody
> > wants to do this. People are trying to make THP migration faster
> > doing the similar thing. I guess we really need a help from the
> > scheduler to do this properly, though. I've been thinking about an API
> > (e.g. try_to_run_in_backgroun) which would evaluate all these nasty
> > details and either return with -EBUSY or kick the background thread to
> > accomplish the work if the system is reasonably idle. I am not really
> > sure whether such an API is viable though.
>
> > Peter, what do you think?
>
> Much pain lies this way.

I somehow exptected this answer ;)

> Also, -enocontext.

Well, the context is that there are more users emerging which would like
to move some part of the heavy operation (e.g. munmap in exit or THP
migration) to the background thread because that operation can be split
and parallelized. kworker API is used for this purpose currently and I
believe that this is not the right approach because optimization for one
workload might be too disruptive on anybody else. On the other side
larger machines which would benefit from these optimizations are more
likely to have idle CPUs to (ab)use. So the idea was to provide an API
which would tell whether kicking a background worker(s) to accomplish
the task is feasible. The scheduler sounds like the best candidate to
ask this question to me. I might be wrong here of course but a
centralized API sounds like a better approach than ad-hoc solutions
developed for each particular usecase.

--
Michal Hocko
SUSE Labs

2017-03-20 19:15:54

by Alex Thorlton

[permalink] [raw]
Subject: Re: [PATCH v2 0/5] mm: support parallel free of memory

On Fri, Mar 17, 2017 at 10:21:58AM +0800, Aaron Lu wrote:
> On Thu, Mar 16, 2017 at 02:38:44PM -0500, Alex Thorlton wrote:
> > On Wed, Mar 15, 2017 at 04:59:59PM +0800, Aaron Lu wrote:
> > > v2 changes: Nothing major, only minor ones.
> > > - rebased on top of v4.11-rc2-mmotm-2017-03-14-15-41;
> > > - use list_add_tail instead of list_add to add worker to tlb's worker
> > > list so that when doing flush, the first queued worker gets flushed
> > > first(based on the comsumption that the first queued worker has a
> > > better chance of finishing its job than those later queued workers);
> > > - use bool instead of int for variable free_batch_page in function
> > > tlb_flush_mmu_free_batches;
> > > - style change according to ./scripts/checkpatch;
> > > - reword some of the changelogs to make it more readable.
> > >
> > > v1 is here:
> > > https://lkml.org/lkml/2017/2/24/245
> >
> > I tested v1 on a Haswell system with 64 sockets/1024 cores/2048 threads
> > and 8TB of RAM, with a 1TB malloc. The average free() time for a 1TB
> > malloc on a vanilla kernel was 41.69s, the patched kernel averaged
> > 21.56s for the same test.
>
> Thanks a lot for the test result.
>
> >
> > I am testing v2 now and will report back with results in the next day or
> > so.
>
> Testing plain v2 shouldn't bring any surprise/difference

You're right! Not much difference here. v2 averaged a 23.17s free
time for a 1T allocation.

> better set the
> following param before the test(I'm planning to make them default in the
> next version):
> # echo 64 > /sys/devices/virtual/workqueue/batch_free_wq/max_active
> # echo 1030 > /sys/kernel/debug/parallel_free/max_gather_batch_count

10 test runs with these params set averaged 22.22s to free 1T.

So, we're still seeing a nearly 50% decrease in free time vs. the
unpatched kernel.

- Alex

2017-03-21 15:07:04

by Dave Hansen

[permalink] [raw]
Subject: Re: [PATCH v2 0/5] mm: support parallel free of memory

On 03/16/2017 02:07 AM, Michal Hocko wrote:
> On Wed 15-03-17 14:38:34, Tim Chen wrote:
>> max_active: time
>> 1 8.9s ?0.5%
>> 2 5.65s ?5.5%
>> 4 4.84s ?0.16%
>> 8 4.77s ?0.97%
>> 16 4.85s ?0.77%
>> 32 6.21s ?0.46%
>
> OK, but this will depend on the HW, right? Also now that I am looking at
> those numbers more closely. This was about unmapping 320GB area and
> using 4 times more CPUs you managed to half the run time. Is this really
> worth it? Sure if those CPUs were idle then this is a clear win but if
> the system is moderately busy then it doesn't look like a clear win to
> me.

This still suffers from zone lock contention. It scales much better if
we are freeing memory from more than one zone. We would expect any
other generic page allocator scalability improvements to really help
here, too.

Aaron, could you make sure to make sure that the memory being freed is
coming from multiple NUMA nodes? It might also be interesting to boot
with a fake NUMA configuration with a *bunch* of nodes to see what the
best case looks like when zone lock contention isn't even in play where
one worker would be working on its own zone.

>>> Moreover, and this is a more generic question, is this functionality
>>> useful in general purpose workloads?
>>
>> If we are running consecutive batch jobs, this optimization
>> should help start the next job sooner.
>
> Is this sufficient justification to add a potentially hard to tune
> optimization that can influence other workloads on the machine?

The guys for whom a reboot is faster than a single exit() certainly
think so. :)

I have the feeling that we can find a pretty sane large process size to
be the floor where this feature gets activated. I doubt the systems
that really care about noise from other workloads are often doing
multi-gigabyte mapping teardowns.

2017-03-21 15:18:23

by Tim Chen

[permalink] [raw]
Subject: Re: [PATCH v2 0/5] mm: support parallel free of memory

On Thu, 2017-03-16 at 10:07 +0100, Michal Hocko wrote:

> > > the main problem is that kworkers will not belong to the same cpu group
> > > and so they will not be throttled properly.
> > You do have a point that this page freeing activities should strive to
> > affect other threads not in the same cgroup minimally.
> >
> > On the other hand, we also don't do this throttling of kworkers 
> > today (e.g. pdflush) according to the cgroup it is doing work for.
> Yes, I am not saying this a new problem. I just wanted to point out that
> this is something to consider here. I believe this should be fixable.
> Worker can attach to the same cgroup the initiator had for example
> (assuming the cgroup core allows that which is something would have to
> be checked).

Instead of attaching the kworders to the cgroup of the initiator, I
wonder what people think about creating a separate kworker cgroup. 
The administrator can set limit on its cpu resource bandwidth
if he/she does not want such kworkers perturbing the system.

Tim

2017-03-22 06:34:51

by Minchan Kim

[permalink] [raw]
Subject: Re: [PATCH v2 3/5] mm: use a dedicated workqueue for the free workers

Hi,

On Wed, Mar 15, 2017 at 05:00:02PM +0800, Aaron Lu wrote:
> Introduce a workqueue for all the free workers so that user can fine
> tune how many workers can be active through sysfs interface: max_active.
> More workers will normally lead to better performance, but too many can
> cause severe lock contention.

Let me ask a question.

How well can workqueue distribute the jobs in multiple CPU?
I don't ask about currency but parallelism.
I guess benefit you are seeing comes from the parallelism and
for your goal, unbound wq should spawn a thread per cpu and
doing the work in every each CPU. does it work?

>
> Note that since the zone lock is global, the workqueue is also global
> for all processes, i.e. if we set 8 to max_active, we will have at most
> 8 workers active for all processes that are doing munmap()/exit()/etc.
>
> Signed-off-by: Aaron Lu <[email protected]>
> ---
> mm/memory.c | 15 ++++++++++++++-
> 1 file changed, 14 insertions(+), 1 deletion(-)
>
> diff --git a/mm/memory.c b/mm/memory.c
> index 001c7720d773..19b25bb5f45b 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -253,6 +253,19 @@ static void tlb_flush_mmu_tlbonly(struct mmu_gather *tlb)
> __tlb_reset_range(tlb);
> }
>
> +static struct workqueue_struct *batch_free_wq;
> +static int __init batch_free_wq_init(void)
> +{
> + batch_free_wq = alloc_workqueue("batch_free_wq",
> + WQ_UNBOUND | WQ_SYSFS, 0);
> + if (!batch_free_wq) {
> + pr_warn("failed to create workqueue batch_free_wq\n");
> + return -ENOMEM;
> + }
> + return 0;
> +}
> +subsys_initcall(batch_free_wq_init);
> +
> static void tlb_flush_mmu_free_batches(struct mmu_gather_batch *batch_start,
> bool free_batch_page)
> {
> @@ -306,7 +319,7 @@ static void tlb_flush_mmu_free(struct mmu_gather *tlb)
> batch_free->batch_start = tlb->local.next;
> INIT_WORK(&batch_free->work, batch_free_work);
> list_add_tail(&batch_free->list, &tlb->worker_list);
> - queue_work(system_unbound_wq, &batch_free->work);
> + queue_work(batch_free_wq, &batch_free->work);
>
> tlb->batch_count = 0;
> tlb->local.next = NULL;
> --
> 2.7.4
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to [email protected]. For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"[email protected]"> [email protected] </a>

2017-03-22 08:09:10

by Aaron Lu

[permalink] [raw]
Subject: Re: [PATCH v2 0/5] mm: support parallel free of memory

On Tue, Mar 21, 2017 at 07:54:37AM -0700, Dave Hansen wrote:
> On 03/16/2017 02:07 AM, Michal Hocko wrote:
> > On Wed 15-03-17 14:38:34, Tim Chen wrote:
> >> max_active: time
> >> 1 8.9s ?0.5%
> >> 2 5.65s ?5.5%
> >> 4 4.84s ?0.16%
> >> 8 4.77s ?0.97%
> >> 16 4.85s ?0.77%
> >> 32 6.21s ?0.46%
> >
> > OK, but this will depend on the HW, right? Also now that I am looking at
> > those numbers more closely. This was about unmapping 320GB area and
> > using 4 times more CPUs you managed to half the run time. Is this really
> > worth it? Sure if those CPUs were idle then this is a clear win but if
> > the system is moderately busy then it doesn't look like a clear win to
> > me.
>
> This still suffers from zone lock contention. It scales much better if
> we are freeing memory from more than one zone. We would expect any
> other generic page allocator scalability improvements to really help
> here, too.
>
> Aaron, could you make sure to make sure that the memory being freed is
> coming from multiple NUMA nodes? It might also be interesting to boot

The test machine has 4 nodes and each has 128G memory.
With the test size of 320G, at least 3 nodes are involved.

But since the test is done on an idle system, I *guess* the allocated
memory is physically continuous. Then when they are freed in virtually
continuous order, it's likely that one after another physically continous
1G chunk are sent to the free kworkers. So roughly for the first
128 1G chunks, those workers will all be contending on the same zone.
(well, it shouldn't be 128 kworkers all runnable contending for the same
lock since early launched kworkers will have exited after finishing its
job before some later launched kworkers start).

> with a fake NUMA configuration with a *bunch* of nodes to see what the
> best case looks like when zone lock contention isn't even in play where
> one worker would be working on its own zone.

Good idea, will post results here once I finished the test.

Thanks.

2017-03-22 08:44:25

by Aaron Lu

[permalink] [raw]
Subject: Re: [PATCH v2 3/5] mm: use a dedicated workqueue for the free workers

On Wed, Mar 22, 2017 at 03:33:35PM +0900, Minchan Kim wrote:
> Hi,
>
> On Wed, Mar 15, 2017 at 05:00:02PM +0800, Aaron Lu wrote:
> > Introduce a workqueue for all the free workers so that user can fine
> > tune how many workers can be active through sysfs interface: max_active.
> > More workers will normally lead to better performance, but too many can
> > cause severe lock contention.
>
> Let me ask a question.
>
> How well can workqueue distribute the jobs in multiple CPU?

I would say it's good enough for my needs.
After all, it doesn't need many kworkers to achieve the 50% time
decrease: 2-4 kworkers for EP and 4-8 kworkers for EX are enough from
previous attched data.

> I don't ask about currency but parallelism.
> I guess benefit you are seeing comes from the parallelism and
> for your goal, unbound wq should spawn a thread per cpu and
> doing the work in every each CPU. does it work?

I don't think a unbound workqueue will spawn a thread per CPU, that
seems too much a cost to have a unbound workqueue.

My understanding of the unbound workqueue is that it will create a
thread pool for each node, versus each CPU as in the bound workqueue
case, and use threads from the thread pool(create threads if not enough)
to do the work.

I guess you want to ask if the unbound workqueue can spawn enough
threads to do the job? From the output of 'vmstat 1' during the free()
test, I can see some 70+ processes in runnable state when I didn't
set an upper limit for max_active of the workqueue.

Thanks,
Aaron

2017-03-22 08:55:41

by Minchan Kim

[permalink] [raw]
Subject: Re: [PATCH v2 3/5] mm: use a dedicated workqueue for the free workers

On Wed, Mar 22, 2017 at 04:41:04PM +0800, Aaron Lu wrote:
> On Wed, Mar 22, 2017 at 03:33:35PM +0900, Minchan Kim wrote:
> > Hi,
> >
> > On Wed, Mar 15, 2017 at 05:00:02PM +0800, Aaron Lu wrote:
> > > Introduce a workqueue for all the free workers so that user can fine
> > > tune how many workers can be active through sysfs interface: max_active.
> > > More workers will normally lead to better performance, but too many can
> > > cause severe lock contention.
> >
> > Let me ask a question.
> >
> > How well can workqueue distribute the jobs in multiple CPU?
>
> I would say it's good enough for my needs.
> After all, it doesn't need many kworkers to achieve the 50% time
> decrease: 2-4 kworkers for EP and 4-8 kworkers for EX are enough from
> previous attched data.
>
> > I don't ask about currency but parallelism.
> > I guess benefit you are seeing comes from the parallelism and
> > for your goal, unbound wq should spawn a thread per cpu and
> > doing the work in every each CPU. does it work?
>
> I don't think a unbound workqueue will spawn a thread per CPU, that
> seems too much a cost to have a unbound workqueue.
>
> My understanding of the unbound workqueue is that it will create a
> thread pool for each node, versus each CPU as in the bound workqueue
> case, and use threads from the thread pool(create threads if not enough)
> to do the work.

Yes, that was my understand so I read code and found that

insert_work:
..
if (__need_more_worker(pool))
wake_up_worker(pool);

so I thought if there is a running thread in that node, workqueue
will not wake any other threads so parallelism should be max 2.
AFAIK, if the work goes sleep, scheduler will spawn new worker
thread so the active worker could be a lot but I cannot see any
significant sleepable point in that work(ie, batch_free_work).

>
> I guess you want to ask if the unbound workqueue can spawn enough
> threads to do the job? From the output of 'vmstat 1' during the free()
> test, I can see some 70+ processes in runnable state when I didn't
> set an upper limit for max_active of the workqueue.

Hmm, it seems I was wrong. I am really curious how we can have
70+ active workers in that. Could you explain what I am missing?

Thanks.

>
> Thanks,
> Aaron
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to [email protected]. For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"[email protected]"> [email protected] </a>

2017-03-22 13:43:00

by Aaron Lu

[permalink] [raw]
Subject: Re: [PATCH v2 3/5] mm: use a dedicated workqueue for the free workers

On Wed, Mar 22, 2017 at 05:55:12PM +0900, Minchan Kim wrote:
> On Wed, Mar 22, 2017 at 04:41:04PM +0800, Aaron Lu wrote:
> > My understanding of the unbound workqueue is that it will create a
> > thread pool for each node, versus each CPU as in the bound workqueue
> > case, and use threads from the thread pool(create threads if not enough)
> > to do the work.
>
> Yes, that was my understand so I read code and found that
>
> insert_work:
> ..
> if (__need_more_worker(pool))
> wake_up_worker(pool);
>
> so I thought if there is a running thread in that node, workqueue
> will not wake any other threads so parallelism should be max 2.
> AFAIK, if the work goes sleep, scheduler will spawn new worker
> thread so the active worker could be a lot but I cannot see any
> significant sleepable point in that work(ie, batch_free_work).

Looks like worker_thread() will spawn new worker through manage_worker().

Note that pool->nr_running will always be zero for an unbound workqueue
and thus need_more_worker() will return true as long as there are queued
work items in the pool.

Thanks,
Aaron

2017-03-23 07:38:17

by Minchan Kim

[permalink] [raw]
Subject: Re: [PATCH v2 3/5] mm: use a dedicated workqueue for the free workers

On Wed, Mar 22, 2017 at 09:43:04PM +0800, Aaron Lu wrote:
> On Wed, Mar 22, 2017 at 05:55:12PM +0900, Minchan Kim wrote:
> > On Wed, Mar 22, 2017 at 04:41:04PM +0800, Aaron Lu wrote:
> > > My understanding of the unbound workqueue is that it will create a
> > > thread pool for each node, versus each CPU as in the bound workqueue
> > > case, and use threads from the thread pool(create threads if not enough)
> > > to do the work.
> >
> > Yes, that was my understand so I read code and found that
> >
> > insert_work:
> > ..
> > if (__need_more_worker(pool))
> > wake_up_worker(pool);
> >
> > so I thought if there is a running thread in that node, workqueue
> > will not wake any other threads so parallelism should be max 2.
> > AFAIK, if the work goes sleep, scheduler will spawn new worker
> > thread so the active worker could be a lot but I cannot see any
> > significant sleepable point in that work(ie, batch_free_work).
>
> Looks like worker_thread() will spawn new worker through manage_worker().
>
> Note that pool->nr_running will always be zero for an unbound workqueue
> and thus need_more_worker() will return true as long as there are queued
> work items in the pool.

Aha, it solves my wonder. Thanks a lot!

2017-03-23 15:38:46

by Dave Hansen

[permalink] [raw]
Subject: Re: [PATCH v2 3/5] mm: use a dedicated workqueue for the free workers

On 03/22/2017 01:41 AM, Aaron Lu wrote:
> On Wed, Mar 22, 2017 at 03:33:35PM +0900, Minchan Kim wrote:
>> On Wed, Mar 15, 2017 at 05:00:02PM +0800, Aaron Lu wrote:
>>> Introduce a workqueue for all the free workers so that user can fine
>>> tune how many workers can be active through sysfs interface: max_active.
>>> More workers will normally lead to better performance, but too many can
>>> cause severe lock contention.
>>
>> Let me ask a question.
>>
>> How well can workqueue distribute the jobs in multiple CPU?
>
> I would say it's good enough for my needs.
> After all, it doesn't need many kworkers to achieve the 50% time
> decrease: 2-4 kworkers for EP and 4-8 kworkers for EX are enough from
> previous attched data.

It's also worth noting that we'd like to *also* like to look into
increasing how scalable freeing pages to a given zone is.

2017-03-24 07:05:02

by Aaron Lu

[permalink] [raw]
Subject: Re: [PATCH v2 0/5] mm: support parallel free of memory

On Tue, Mar 21, 2017 at 07:54:37AM -0700, Dave Hansen wrote:
> On 03/16/2017 02:07 AM, Michal Hocko wrote:
> > On Wed 15-03-17 14:38:34, Tim Chen wrote:
> >> max_active: time
> >> 1 8.9s ?0.5%
> >> 2 5.65s ?5.5%
> >> 4 4.84s ?0.16%
> >> 8 4.77s ?0.97%
> >> 16 4.85s ?0.77%
> >> 32 6.21s ?0.46%
> >
> > OK, but this will depend on the HW, right? Also now that I am looking at
> > those numbers more closely. This was about unmapping 320GB area and
> > using 4 times more CPUs you managed to half the run time. Is this really
> > worth it? Sure if those CPUs were idle then this is a clear win but if
> > the system is moderately busy then it doesn't look like a clear win to
> > me.
>
> This still suffers from zone lock contention. It scales much better if
> we are freeing memory from more than one zone. We would expect any
> other generic page allocator scalability improvements to really help
> here, too.
>
> Aaron, could you make sure to make sure that the memory being freed is
> coming from multiple NUMA nodes? It might also be interesting to boot
> with a fake NUMA configuration with a *bunch* of nodes to see what the
> best case looks like when zone lock contention isn't even in play where
> one worker would be working on its own zone.

This fake NUMA configuration thing is great for this purpose, I didn't
know we have this support in kernel.

So I added numa=fake=128 and also wrote a new test program(attached)
that mmap() 321G memory and made sure they are distributed equally in
107 nodes, i.e. 3G on each node. This is achieved by using mbind before
touching the memory on each node.

Then I enlarged the max_gather_batch_count to 1543 so that during zap,
3G memory is sent to a kworker for free instead of the default 1G. In
this way, each kworker should be working on a different node.

With this change, time to free the 321G memory is reduced to:

3.23s ?13.7% (about 70% decrease)

Lock contention is 1.81%:

19.60% [kernel.kallsyms] [k] release_pages
13.30% [kernel.kallsyms] [k] unmap_page_range
13.18% [kernel.kallsyms] [k] free_pcppages_bulk
8.34% [kernel.kallsyms] [k] __mod_zone_page_state
7.75% [kernel.kallsyms] [k] page_remove_rmap
7.37% [kernel.kallsyms] [k] free_hot_cold_page
6.06% [kernel.kallsyms] [k] free_pages_and_swap_cache
3.53% [kernel.kallsyms] [k] __list_del_entry_valid
3.09% [kernel.kallsyms] [k] __list_add_valid
1.81% [kernel.kallsyms] [k] native_queued_spin_lock_slowpath
1.79% [kernel.kallsyms] [k] uncharge_list
1.69% [kernel.kallsyms] [k] mem_cgroup_update_lru_size
1.60% [kernel.kallsyms] [k] vm_normal_page
1.46% [kernel.kallsyms] [k] __dec_node_state
1.41% [kernel.kallsyms] [k] __mod_node_page_state
1.20% [kernel.kallsyms] [k] __tlb_remove_page_size
0.85% [kernel.kallsyms] [k] mem_cgroup_page_lruvec

>From 'vmstat 1', the runnable process peaked at 6 during munmap():
procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu-----
r b swpd free buff cache si so bi bo in cs us sy id wa st
0 0 0 189114560 0 761292 0 0 0 0 70 146 0 0 100 0 0
3 0 0 189099008 0 759932 0 0 0 0 2536 382 0 0 100 0 0
6 0 0 274378848 0 759972 0 0 0 0 11332 249 0 3 97 0 0
5 0 0 374426592 0 759972 0 0 0 0 13576 196 0 3 97 0 0
4 0 0 474990144 0 759972 0 0 0 0 13250 227 0 3 97 0 0
0 0 0 526039296 0 759972 0 0 0 0 6799 246 0 2 98 0 0
^C

This appears to be the best result from this approach.


Attachments:
(No filename) (3.74 kB)
node_alloc.c (1.92 kB)
Download all attachments

2017-03-24 12:37:12

by Aaron Lu

[permalink] [raw]
Subject: Re: [PATCH v2 3/5] mm: use a dedicated workqueue for the free workers

On Thu, Mar 23, 2017 at 08:38:43AM -0700, Dave Hansen wrote:
> On 03/22/2017 01:41 AM, Aaron Lu wrote:
> > On Wed, Mar 22, 2017 at 03:33:35PM +0900, Minchan Kim wrote:
> >> On Wed, Mar 15, 2017 at 05:00:02PM +0800, Aaron Lu wrote:
> >>> Introduce a workqueue for all the free workers so that user can fine
> >>> tune how many workers can be active through sysfs interface: max_active.
> >>> More workers will normally lead to better performance, but too many can
> >>> cause severe lock contention.
> >>
> >> Let me ask a question.
> >>
> >> How well can workqueue distribute the jobs in multiple CPU?
> >
> > I would say it's good enough for my needs.
> > After all, it doesn't need many kworkers to achieve the 50% time
> > decrease: 2-4 kworkers for EP and 4-8 kworkers for EX are enough from
> > previous attched data.
>
> It's also worth noting that we'd like to *also* like to look into
> increasing how scalable freeing pages to a given zone is.

Still on EX, I restricted the allocation to be only on node 1, with
120G memory allocated there:

max_active time compared to base lock from perf
base(no parallel) 3.81s ±3.3% N/A <1%
1 3.10s ±7.7% ↓18.6% 14.76%
2 2.44s ±13.6% ↓35.9% 36.95%
4 2.07s ±13.6% ↓45.6% 59.67%
8 1.98s ±0.4% ↓48.0% 62.59%
16 2.01s ±2.4% ↓47.2% 79.62%

If we can improve the scalibility of freeing a given zone, then parallel
free will be able to achieve more.

BTW, the lock is basically pgdat->lru_lock in release_pages and
zone->lock in free_pcppages_bulk:
62.59% 62.59% [kernel.kallsyms] [k] native_queued_spin_lock_slowpath
37.17% native_queued_spin_lock_slowpath;_raw_spin_lock_irqsave;free_pcppages_bulk;free_hot_cold_page;free_hot_cold_page_list;release_pages;free_pages_and_swap_cache;tlb_flush_mmu_free_batches;batch_free_work;process_one_work;worker_thread;kthread;ret_from_fork
25.27% native_queued_spin_lock_slowpath;_raw_spin_lock_irqsave;release_pages;free_pages_and_swap_cache;tlb_flush_mmu_free_batches;batch_free_work;process_one_work;worker_thread;kthread;ret_from_fork