2018-11-05 16:59:02

by Daniel Jordan

[permalink] [raw]
Subject: [RFC PATCH v4 00/13] ktask: multithread CPU-intensive kernel work

Hi,

This version addresses some of the feedback from Andrew and Michal last year
and describes the plan for tackling the rest. I'm posting now since I'll be
presenting ktask at Plumbers next week.

Andrew, you asked about parallelizing in more places[0]. This version adds
multithreading for VFIO page pinning, and there are more planned users listed
below.

Michal, you mentioned that ktask should be sensitive to CPU utilization[1].
ktask threads now run at the lowest priority on the system to avoid disturbing
busy CPUs (more details in patches 4 and 5). Does this address your concern?
The plan to address your other comments is explained below.

Alex, any thoughts about the VFIO changes in patches 6-9?

Tejun and Lai, what do you think of patch 5?

And for everyone, questions and comments welcome. Any suggestions for more
users?

Thanks,
Daniel

P.S. This series is big to address the above feedback, but I can send patches
7 and 8 separately.


TODO
----

- Implement cgroup-aware unbound workqueues in a separate series, picking up
Bandan Das's effort from two years ago[2]. This should hopefully address
Michal's comment about running ktask threads within the limits of the calling
context[1].

- Make ktask aware of power management. A starting point is to disable the
framework when energy-conscious cpufreq settings are enabled (e.g.
powersave, conservative scaling governors). This should address another
comment from Michal about keeping CPUs under power constraints idle[1].

- Add more users. On my list:
- __ib_umem_release in IB core, which Jason Gunthorpe mentioned[3]
- XFS quotacheck and online repair, as suggested by Darrick Wong
- vfs object teardown at umount time, as Andrew mentioned[0]
- page freeing in munmap/exit, as Aaron Lu posted[4]
- page freeing in shmem
The last three will benefit from scaling zone->lock and lru_lock.

- CPU hotplug support for ktask to adjust its per-CPU data and resource
limits.

- Check with IOMMU folks that iommu_map is safe for all IOMMU backend
implementations (it is for x86).


Summary
-------

A single CPU can spend an excessive amount of time in the kernel operating
on large amounts of data. Often these situations arise during initialization-
and destruction-related tasks, where the data involved scales with system size.
These long-running jobs can slow startup and shutdown of applications and the
system itself while extra CPUs sit idle.

To ensure that applications and the kernel continue to perform well as core
counts and memory sizes increase, harness these idle CPUs to complete such jobs
more quickly.

ktask is a generic framework for parallelizing CPU-intensive work in the
kernel. The API is generic enough to add concurrency to many different kinds
of tasks--for example, zeroing a range of pages or evicting a list of
inodes--and aims to save its clients the trouble of splitting up the work,
choosing the number of threads to use, maintaining an efficient concurrency
level, starting these threads, and load balancing the work between them.

The first patch has more documentation, and the second patch has the interface.

Current users:
1) VFIO page pinning before kvm guest startup (others hitting slowness too[5])
2) deferred struct page initialization at boot time
3) clearing gigantic pages
4) fallocate for HugeTLB pages

This patchset is based on the 2018-10-30 head of mmotm/master.

Changelog:

v3 -> v4:
- Added VFIO page pinning use case (Andrew's "more users" comment)
- Made ktask helpers run at the lowest priority on the system (Michal's
concern about sensitivity to CPU utilization)
- ktask learned to undo part of a task on error (required for VFIO)
- Changed mm->locked_vm to an atomic to improve performance for VFIO. This can
be split out into a smaller series (there wasn't time before posting this)
- Removed /proc/sys/debug/ktask_max_threads (it was a debug-only thing)
- Some minor improvements in the ktask code itself (shorter, cleaner, etc)
- Updated Documentation to cover these changes

v2 -> v3:
- Changed cpu to CPU in the ktask Documentation, as suggested by Randy Dunlap
- Saved more boot time now that Pavel Tatashin's deferred struct page init
patches are in mainline (https://lkml.org/lkml/2017/10/13/692). New
performance results in patch 7.
- Added resource limits, per-node and system-wide, to maintain efficient
concurrency levels (addresses a concern from my Plumbers talk)
- ktask no longer allocates memory internally during a task so it can be used
in sensitive contexts
- Added the option to run work anywhere on the system rather than always
confining it to a specific node
- Updated Documentation patch with these changes and reworked motivation
section

v1 -> v2:
- Added deferred struct page initialization use case.
- Explained the source of the performance improvement from parallelizing
clear_gigantic_page (comment from Dave Hansen).
- Fixed Documentation and build warnings from CONFIG_KTASK=n kernels.

ktask v3 RFC: lkml.kernel.org/r/[email protected]

[0] https://lkml.kernel.org/r/[email protected]
[1] https://lkml.kernel.org/r/[email protected]
[2] https://lkml.kernel.org/r/[email protected]
[3] https://lkml.kernel.org/r/[email protected]
[4] https://lkml.kernel.org/r/[email protected]
[5] https://www.redhat.com/archives/vfio-users/2018-April/msg00020.html

Daniel Jordan (13):
ktask: add documentation
ktask: multithread CPU-intensive kernel work
ktask: add undo support
ktask: run helper threads at MAX_NICE
workqueue, ktask: renice helper threads to prevent starvation
vfio: parallelize vfio_pin_map_dma
mm: change locked_vm's type from unsigned long to atomic_long_t
vfio: remove unnecessary mmap_sem writer acquisition around locked_vm
vfio: relieve mmap_sem reader cacheline bouncing by holding it longer
mm: enlarge type of offset argument in mem_map_offset and mem_map_next
mm: parallelize deferred struct page initialization within each node
mm: parallelize clear_gigantic_page
hugetlbfs: parallelize hugetlbfs_fallocate with ktask

Documentation/core-api/index.rst | 1 +
Documentation/core-api/ktask.rst | 213 +++++++++
arch/powerpc/kvm/book3s_64_vio.c | 15 +-
arch/powerpc/mm/mmu_context_iommu.c | 16 +-
drivers/fpga/dfl-afu-dma-region.c | 16 +-
drivers/vfio/vfio_iommu_spapr_tce.c | 14 +-
drivers/vfio/vfio_iommu_type1.c | 159 ++++---
fs/hugetlbfs/inode.c | 114 ++++-
fs/proc/task_mmu.c | 2 +-
include/linux/ktask.h | 267 ++++++++++++
include/linux/mm_types.h | 2 +-
include/linux/workqueue.h | 5 +
init/Kconfig | 11 +
init/main.c | 2 +
kernel/Makefile | 2 +-
kernel/fork.c | 2 +-
kernel/ktask.c | 646 ++++++++++++++++++++++++++++
kernel/workqueue.c | 106 ++++-
mm/debug.c | 3 +-
mm/internal.h | 7 +-
mm/memory.c | 32 +-
mm/mlock.c | 4 +-
mm/mmap.c | 18 +-
mm/mremap.c | 6 +-
mm/page_alloc.c | 91 +++-
25 files changed, 1599 insertions(+), 155 deletions(-)
create mode 100644 Documentation/core-api/ktask.rst
create mode 100644 include/linux/ktask.h
create mode 100644 kernel/ktask.c

--
2.19.1



2018-11-05 16:59:21

by Daniel Jordan

[permalink] [raw]
Subject: [RFC PATCH v4 10/13] mm: enlarge type of offset argument in mem_map_offset and mem_map_next

Changes the type of 'offset' from int to unsigned long in both
mem_map_offset and mem_map_next.

This facilitates ktask's use of mem_map_next with its unsigned long
types to avoid silent truncation when these unsigned longs are passed as
ints.

It also fixes the preexisting truncation of 'offset' from unsigned long
to int by the sole caller of mem_map_offset, follow_hugetlb_page.

Signed-off-by: Daniel Jordan <[email protected]>
---
mm/internal.h | 7 ++++---
1 file changed, 4 insertions(+), 3 deletions(-)

diff --git a/mm/internal.h b/mm/internal.h
index 3b1ec1412fd2..cc90de4d4e01 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -367,7 +367,8 @@ static inline void mlock_migrate_page(struct page *new, struct page *old) { }
* the maximally aligned gigantic page 'base'. Handle any discontiguity
* in the mem_map at MAX_ORDER_NR_PAGES boundaries.
*/
-static inline struct page *mem_map_offset(struct page *base, int offset)
+static inline struct page *mem_map_offset(struct page *base,
+ unsigned long offset)
{
if (unlikely(offset >= MAX_ORDER_NR_PAGES))
return nth_page(base, offset);
@@ -378,8 +379,8 @@ static inline struct page *mem_map_offset(struct page *base, int offset)
* Iterator over all subpages within the maximally aligned gigantic
* page 'base'. Handle any discontiguity in the mem_map.
*/
-static inline struct page *mem_map_next(struct page *iter,
- struct page *base, int offset)
+static inline struct page *mem_map_next(struct page *iter, struct page *base,
+ unsigned long offset)
{
if (unlikely((offset & (MAX_ORDER_NR_PAGES - 1)) == 0)) {
unsigned long pfn = page_to_pfn(base) + offset;
--
2.19.1


2018-11-05 17:00:10

by Daniel Jordan

[permalink] [raw]
Subject: [RFC PATCH v4 13/13] hugetlbfs: parallelize hugetlbfs_fallocate with ktask

hugetlbfs_fallocate preallocates huge pages to back a file in a
hugetlbfs filesystem. The time to call this function grows linearly
with size.

ktask performs well with its default thread count of 4; higher thread
counts are given for context only.

Machine: Intel(R) Xeon(R) CPU E7-8895 v3 @ 2.60GHz, 288 CPUs, 1T memory
Test: fallocate(1) a file on a hugetlbfs filesystem

nthread speedup size (GiB) min time (s) stdev
1 200 127.53 2.19
2 3.09x 200 41.30 2.11
4 5.72x 200 22.29 0.51
8 9.45x 200 13.50 2.58
16 9.74x 200 13.09 1.64

1 400 193.09 2.47
2 2.14x 400 90.31 3.39
4 3.84x 400 50.32 0.44
8 5.11x 400 37.75 1.23
16 6.12x 400 31.54 3.13

The primary bottleneck for better scaling at higher thread counts is
hugetlb_fault_mutex_table[hash]. perf showed L1-dcache-loads increase
with 8 threads and again sharply with 16 threads, and a CPU counter
profile showed that 31% of the L1d misses were on
hugetlb_fault_mutex_table[hash] in the 16-thread case.

Signed-off-by: Daniel Jordan <[email protected]>
---
fs/hugetlbfs/inode.c | 114 +++++++++++++++++++++++++++++++++++--------
1 file changed, 93 insertions(+), 21 deletions(-)

diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c
index 762028994f47..a73548a96061 100644
--- a/fs/hugetlbfs/inode.c
+++ b/fs/hugetlbfs/inode.c
@@ -37,6 +37,7 @@
#include <linux/magic.h>
#include <linux/migrate.h>
#include <linux/uio.h>
+#include <linux/ktask.h>

#include <linux/uaccess.h>

@@ -104,11 +105,16 @@ static const struct fs_parameter_description hugetlb_fs_parameters = {
};

#ifdef CONFIG_NUMA
+static inline struct shared_policy *hugetlb_get_shared_policy(
+ struct inode *inode)
+{
+ return &HUGETLBFS_I(inode)->policy;
+}
+
static inline void hugetlb_set_vma_policy(struct vm_area_struct *vma,
- struct inode *inode, pgoff_t index)
+ struct shared_policy *policy, pgoff_t index)
{
- vma->vm_policy = mpol_shared_policy_lookup(&HUGETLBFS_I(inode)->policy,
- index);
+ vma->vm_policy = mpol_shared_policy_lookup(policy, index);
}

static inline void hugetlb_drop_vma_policy(struct vm_area_struct *vma)
@@ -116,8 +122,14 @@ static inline void hugetlb_drop_vma_policy(struct vm_area_struct *vma)
mpol_cond_put(vma->vm_policy);
}
#else
+static inline struct shared_policy *hugetlb_get_shared_policy(
+ struct inode *inode)
+{
+ return NULL;
+}
+
static inline void hugetlb_set_vma_policy(struct vm_area_struct *vma,
- struct inode *inode, pgoff_t index)
+ struct shared_policy *policy, pgoff_t index)
{
}

@@ -576,20 +588,30 @@ static long hugetlbfs_punch_hole(struct inode *inode, loff_t offset, loff_t len)
return 0;
}

+struct hf_args {
+ struct file *file;
+ struct task_struct *parent_task;
+ struct mm_struct *mm;
+ struct shared_policy *shared_policy;
+ struct hstate *hstate;
+ struct address_space *mapping;
+ int error;
+};
+
+static int hugetlbfs_fallocate_chunk(pgoff_t start, pgoff_t end,
+ struct hf_args *args);
+
static long hugetlbfs_fallocate(struct file *file, int mode, loff_t offset,
loff_t len)
{
struct inode *inode = file_inode(file);
struct hugetlbfs_inode_info *info = HUGETLBFS_I(inode);
- struct address_space *mapping = inode->i_mapping;
struct hstate *h = hstate_inode(inode);
- struct vm_area_struct pseudo_vma;
- struct mm_struct *mm = current->mm;
loff_t hpage_size = huge_page_size(h);
unsigned long hpage_shift = huge_page_shift(h);
- pgoff_t start, index, end;
+ pgoff_t start, end;
+ struct hf_args hf_args;
int error;
- u32 hash;

if (mode & ~(FALLOC_FL_KEEP_SIZE | FALLOC_FL_PUNCH_HOLE))
return -EOPNOTSUPP;
@@ -617,16 +639,66 @@ static long hugetlbfs_fallocate(struct file *file, int mode, loff_t offset,
goto out;
}

+ hf_args.file = file;
+ hf_args.parent_task = current;
+ hf_args.mm = current->mm;
+ hf_args.shared_policy = hugetlb_get_shared_policy(inode);
+ hf_args.hstate = h;
+ hf_args.mapping = inode->i_mapping;
+ hf_args.error = 0;
+
+ if (unlikely(hstate_is_gigantic(h))) {
+ /*
+ * Use multiple threads in clear_gigantic_page instead of here,
+ * so just do a 1-threaded hugetlbfs_fallocate_chunk.
+ */
+ error = hugetlbfs_fallocate_chunk(start, end, &hf_args);
+ } else {
+ DEFINE_KTASK_CTL(ctl, hugetlbfs_fallocate_chunk,
+ &hf_args, KTASK_PMD_MINCHUNK);
+
+ error = ktask_run((void *)start, end - start, &ctl);
+ }
+
+ if (error != KTASK_RETURN_SUCCESS && hf_args.error != -EINTR)
+ goto out;
+
+ if (!(mode & FALLOC_FL_KEEP_SIZE) && offset + len > inode->i_size)
+ i_size_write(inode, offset + len);
+ inode->i_ctime = current_time(inode);
+out:
+ inode_unlock(inode);
+ return error;
+}
+
+static int hugetlbfs_fallocate_chunk(pgoff_t start, pgoff_t end,
+ struct hf_args *args)
+{
+ struct file *file = args->file;
+ struct task_struct *parent_task = args->parent_task;
+ struct mm_struct *mm = args->mm;
+ struct shared_policy *shared_policy = args->shared_policy;
+ struct hstate *h = args->hstate;
+ struct address_space *mapping = args->mapping;
+ int error = 0;
+ pgoff_t index;
+ struct vm_area_struct pseudo_vma;
+ loff_t hpage_size;
+ u32 hash;
+
+ hpage_size = huge_page_size(h);
+
/*
* Initialize a pseudo vma as this is required by the huge page
* allocation routines. If NUMA is configured, use page index
- * as input to create an allocation policy.
+ * as input to create an allocation policy. Each thread gets its
+ * own pseudo vma because mempolicies can differ by page.
*/
vma_init(&pseudo_vma, mm);
pseudo_vma.vm_flags = (VM_HUGETLB | VM_MAYSHARE | VM_SHARED);
pseudo_vma.vm_file = file;

- for (index = start; index < end; index++) {
+ for (index = start; index < end; ++index) {
/*
* This is supposed to be the vaddr where the page is being
* faulted in, but we have no vaddr here.
@@ -641,13 +713,13 @@ static long hugetlbfs_fallocate(struct file *file, int mode, loff_t offset,
* fallocate(2) manpage permits EINTR; we may have been
* interrupted because we are using up too much memory.
*/
- if (signal_pending(current)) {
+ if (signal_pending(parent_task) || signal_pending(current)) {
error = -EINTR;
- break;
+ goto err;
}

/* Set numa allocation policy based on index */
- hugetlb_set_vma_policy(&pseudo_vma, inode, index);
+ hugetlb_set_vma_policy(&pseudo_vma, shared_policy, index);

/* addr is the offset within the file (zero based) */
addr = index * hpage_size;
@@ -672,7 +744,7 @@ static long hugetlbfs_fallocate(struct file *file, int mode, loff_t offset,
if (IS_ERR(page)) {
mutex_unlock(&hugetlb_fault_mutex_table[hash]);
error = PTR_ERR(page);
- goto out;
+ goto err;
}
clear_huge_page(page, addr, pages_per_huge_page(h));
__SetPageUptodate(page);
@@ -680,7 +752,7 @@ static long hugetlbfs_fallocate(struct file *file, int mode, loff_t offset,
if (unlikely(error)) {
put_page(page);
mutex_unlock(&hugetlb_fault_mutex_table[hash]);
- goto out;
+ goto err;
}

mutex_unlock(&hugetlb_fault_mutex_table[hash]);
@@ -693,11 +765,11 @@ static long hugetlbfs_fallocate(struct file *file, int mode, loff_t offset,
put_page(page);
}

- if (!(mode & FALLOC_FL_KEEP_SIZE) && offset + len > inode->i_size)
- i_size_write(inode, offset + len);
- inode->i_ctime = current_time(inode);
-out:
- inode_unlock(inode);
+ return KTASK_RETURN_SUCCESS;
+
+err:
+ args->error = error;
+
return error;
}

--
2.19.1


2018-11-05 17:00:19

by Daniel Jordan

[permalink] [raw]
Subject: [RFC PATCH v4 12/13] mm: parallelize clear_gigantic_page

Parallelize clear_gigantic_page, which zeroes any page size larger than
8M (e.g. 1G on x86).

Performance results (the default number of threads is 4; higher thread
counts shown for context only):

Machine: Intel(R) Xeon(R) CPU E7-8895 v3 @ 2.60GHz, 288 CPUs, 1T memory
Test: Clear a range of gigantic pages (triggered via fallocate)

nthread speedup size (GiB) min time (s) stdev
1 100 41.13 0.03
2 2.03x 100 20.26 0.14
4 4.28x 100 9.62 0.09
8 8.39x 100 4.90 0.05
16 10.44x 100 3.94 0.03

1 200 89.68 0.35
2 2.21x 200 40.64 0.18
4 4.64x 200 19.33 0.32
8 8.99x 200 9.98 0.04
16 11.27x 200 7.96 0.04

1 400 188.20 1.57
2 2.30x 400 81.84 0.09
4 4.63x 400 40.62 0.26
8 8.92x 400 21.09 0.50
16 11.78x 400 15.97 0.25

1 800 434.91 1.81
2 2.54x 800 170.97 1.46
4 4.98x 800 87.38 1.91
8 10.15x 800 42.86 2.59
16 12.99x 800 33.48 0.83

The speedups are mostly due to the fact that more threads can use more
memory bandwidth. The loop we're stressing on the x86 chip in this test
is clear_page_erms, which tops out at a bandwidth of 2550 MiB/s with one
thread. We get the same bandwidth per thread for 2, 4, or 8 threads,
but at 16 threads the per-thread bandwidth drops to 1420 MiB/s.

However, the performance also improves over a single thread because of
the ktask threads' NUMA awareness (ktask migrates worker threads to the
node local to the work being done). This becomes a bigger factor as the
amount of pages to zero grows to include memory from multiple nodes, so
that speedups increase as the size increases.

Signed-off-by: Daniel Jordan <[email protected]>
---
mm/memory.c | 32 ++++++++++++++++++++++++--------
1 file changed, 24 insertions(+), 8 deletions(-)

diff --git a/mm/memory.c b/mm/memory.c
index 15c417e8e31d..445d06537905 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -69,6 +69,7 @@
#include <linux/userfaultfd_k.h>
#include <linux/dax.h>
#include <linux/oom.h>
+#include <linux/ktask.h>

#include <asm/io.h>
#include <asm/mmu_context.h>
@@ -4415,19 +4416,28 @@ static inline void process_huge_page(
}
}

-static void clear_gigantic_page(struct page *page,
- unsigned long addr,
- unsigned int pages_per_huge_page)
+struct cgp_args {
+ struct page *base_page;
+ unsigned long addr;
+};
+
+static int clear_gigantic_page_chunk(unsigned long start, unsigned long end,
+ struct cgp_args *args)
{
- int i;
- struct page *p = page;
+ struct page *base_page = args->base_page;
+ struct page *p = base_page;
+ unsigned long addr = args->addr;
+ unsigned long i;

might_sleep();
- for (i = 0; i < pages_per_huge_page;
- i++, p = mem_map_next(p, page, i)) {
+ for (i = start; i < end; ++i) {
cond_resched();
clear_user_highpage(p, addr + i * PAGE_SIZE);
+
+ p = mem_map_next(p, base_page, i);
}
+
+ return KTASK_RETURN_SUCCESS;
}

static void clear_subpage(unsigned long addr, int idx, void *arg)
@@ -4444,7 +4454,13 @@ void clear_huge_page(struct page *page,
~(((unsigned long)pages_per_huge_page << PAGE_SHIFT) - 1);

if (unlikely(pages_per_huge_page > MAX_ORDER_NR_PAGES)) {
- clear_gigantic_page(page, addr, pages_per_huge_page);
+ struct cgp_args args = {page, addr};
+ struct ktask_node node = {0, pages_per_huge_page,
+ page_to_nid(page)};
+ DEFINE_KTASK_CTL(ctl, clear_gigantic_page_chunk, &args,
+ KTASK_MEM_CHUNK);
+
+ ktask_run_numa(&node, 1, &ctl);
return;
}

--
2.19.1


2018-11-05 17:00:22

by Daniel Jordan

[permalink] [raw]
Subject: [RFC PATCH v4 08/13] vfio: remove unnecessary mmap_sem writer acquisition around locked_vm

Now that mmap_sem is no longer required for modifying locked_vm, remove
it in the VFIO code.

[XXX Can be sent separately, along with similar conversions in the other
places mmap_sem was taken for locked_vm. While at it, could make
similar changes to pinned_vm.]

Signed-off-by: Daniel Jordan <[email protected]>
---
drivers/vfio/vfio_iommu_type1.c | 26 +++++++++-----------------
1 file changed, 9 insertions(+), 17 deletions(-)

diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
index f307dc9d5e19..9e52a24eb45b 100644
--- a/drivers/vfio/vfio_iommu_type1.c
+++ b/drivers/vfio/vfio_iommu_type1.c
@@ -258,7 +258,8 @@ static int vfio_iova_put_vfio_pfn(struct vfio_dma *dma, struct vfio_pfn *vpfn)
static int vfio_lock_acct(struct vfio_dma *dma, long npage, bool async)
{
struct mm_struct *mm;
- int ret;
+ long locked_vm;
+ int ret = 0;

if (!npage)
return 0;
@@ -267,24 +268,15 @@ static int vfio_lock_acct(struct vfio_dma *dma, long npage, bool async)
if (!mm)
return -ESRCH; /* process exited */

- ret = down_write_killable(&mm->mmap_sem);
- if (!ret) {
- if (npage > 0) {
- if (!dma->lock_cap) {
- unsigned long limit;
-
- limit = task_rlimit(dma->task,
- RLIMIT_MEMLOCK) >> PAGE_SHIFT;
+ locked_vm = atomic_long_add_return(npage, &mm->locked_vm);

- if (atomic_long_read(&mm->locked_vm) + npage > limit)
- ret = -ENOMEM;
- }
+ if (npage > 0 && !dma->lock_cap) {
+ unsigned long limit = task_rlimit(dma->task, RLIMIT_MEMLOCK) >>
+ PAGE_SHIFT;
+ if (locked_vm > limit) {
+ atomic_long_sub(npage, &mm->locked_vm);
+ ret = -ENOMEM;
}
-
- if (!ret)
- atomic_long_add(npage, &mm->locked_vm);
-
- up_write(&mm->mmap_sem);
}

if (async)
--
2.19.1


2018-11-05 17:00:22

by Daniel Jordan

[permalink] [raw]
Subject: [RFC PATCH v4 02/13] ktask: multithread CPU-intensive kernel work

A single CPU can spend an excessive amount of time in the kernel operating on
large amounts of data. Often these situations arise during initialization- and
destruction-related tasks, where the data involved scales with system size.
These long-running jobs can slow startup and shutdown of applications and the
system itself while extra CPUs sit idle.

To ensure that applications and the kernel continue to perform well as
core counts and memory sizes increase, harness these idle CPUs to
complete such jobs more quickly.

ktask is a generic framework for parallelizing CPU-intensive work in the
kernel. The API is generic enough to add concurrency to many different
kinds of tasks--for example, zeroing a range of pages or evicting a list
of inodes--and aims to save its clients the trouble of splitting up the
work, choosing the number of threads to use, maintaining an efficient
concurrency level, starting these threads, and load balancing the work
between them.

The Documentation patch earlier in this series, from which the above was
swiped, has more background.

Inspired by work from Pavel Tatashin, Steve Sistare, and Jonathan Adams.

Signed-off-by: Daniel Jordan <[email protected]>
Suggested-by: Pavel Tatashin <[email protected]>
Suggested-by: Steve Sistare <[email protected]>
Suggested-by: Jonathan Adams <[email protected]>
---
include/linux/ktask.h | 237 +++++++++++++++++++
init/Kconfig | 11 +
init/main.c | 2 +
kernel/Makefile | 2 +-
kernel/ktask.c | 526 ++++++++++++++++++++++++++++++++++++++++++
5 files changed, 777 insertions(+), 1 deletion(-)
create mode 100644 include/linux/ktask.h
create mode 100644 kernel/ktask.c

diff --git a/include/linux/ktask.h b/include/linux/ktask.h
new file mode 100644
index 000000000000..9c75a93b51b9
--- /dev/null
+++ b/include/linux/ktask.h
@@ -0,0 +1,237 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+/*
+ * ktask.h - framework to parallelize CPU-intensive kernel work
+ *
+ * For more information, see Documentation/core-api/ktask.rst.
+ *
+ * Copyright (c) 2018 Oracle Corporation
+ * Author: Daniel Jordan <[email protected]>
+ */
+#ifndef _LINUX_KTASK_H
+#define _LINUX_KTASK_H
+
+#include <linux/mm.h>
+#include <linux/types.h>
+
+#define KTASK_RETURN_SUCCESS 0
+
+/**
+ * struct ktask_node - Holds per-NUMA-node information about a task.
+ *
+ * @kn_start: An object that describes the start of the task on this NUMA node.
+ * @kn_task_size: size of this node's work (units are task-specific)
+ * @kn_nid: NUMA node id to run threads on
+ */
+struct ktask_node {
+ void *kn_start;
+ size_t kn_task_size;
+ int kn_nid;
+};
+
+/**
+ * typedef ktask_thread_func
+ *
+ * Called on each chunk of work that a ktask thread does. A thread may call
+ * this multiple times during one task.
+ *
+ * @start: An object that describes the start of the chunk.
+ * @end: An object that describes the end of the chunk.
+ * @arg: The thread function argument (provided with struct ktask_ctl).
+ *
+ * RETURNS:
+ * KTASK_RETURN_SUCCESS or a client-specific nonzero error code.
+ */
+typedef int (*ktask_thread_func)(void *start, void *end, void *arg);
+
+/**
+ * typedef ktask_iter_func
+ *
+ * An iterator function that advances the position by size units.
+ *
+ * @position: An object that describes the current position in the task.
+ * @size: The amount to advance in the task (in task-specific units).
+ *
+ * RETURNS:
+ * An object representing the new position.
+ */
+typedef void *(*ktask_iter_func)(void *position, size_t size);
+
+/**
+ * ktask_iter_range
+ *
+ * An iterator function for a contiguous range such as an array or address
+ * range. This is the default iterator; clients may override with
+ * ktask_ctl_set_iter_func.
+ *
+ * @position: An object that describes the current position in the task.
+ * Interpreted as an unsigned long.
+ * @size: The amount to advance in the task (in task-specific units).
+ *
+ * RETURNS:
+ * (position + size)
+ */
+void *ktask_iter_range(void *position, size_t size);
+
+/**
+ * struct ktask_ctl - Client-provided per-task control information.
+ *
+ * @kc_thread_func: A thread function that completes one chunk of the task per
+ * call.
+ * @kc_func_arg: An argument to be passed to the thread and undo functions.
+ * @kc_iter_func: An iterator function to advance the iterator by some number
+ * of task-specific units.
+ * @kc_min_chunk_size: The minimum chunk size in task-specific units. This
+ * allows the client to communicate the minimum amount of
+ * work that's appropriate for one worker thread to do at
+ * once.
+ * @kc_max_threads: max threads to use for the task, actual number may be less
+ * depending on CPU count, task size, and minimum chunk size.
+ */
+struct ktask_ctl {
+ /* Required arguments set with DEFINE_KTASK_CTL. */
+ ktask_thread_func kc_thread_func;
+ void *kc_func_arg;
+ size_t kc_min_chunk_size;
+
+ /* Optional, can set with ktask_ctl_set_*. Defaults on the right. */
+ ktask_iter_func kc_iter_func; /* ktask_iter_range */
+ size_t kc_max_threads; /* 0 (uses internal limit) */
+};
+
+#define KTASK_CTL_INITIALIZER(thread_func, func_arg, min_chunk_size) \
+ { \
+ .kc_thread_func = (ktask_thread_func)(thread_func), \
+ .kc_func_arg = (func_arg), \
+ .kc_min_chunk_size = (min_chunk_size), \
+ .kc_iter_func = (ktask_iter_range), \
+ .kc_max_threads = 0, \
+ }
+
+/*
+ * KTASK_CTL_INITIALIZER casts 'thread_func' to be of type ktask_thread_func so
+ * clients can write cleaner thread functions by relieving them of the need to
+ * cast the three void * arguments. Clients can just use the actual argument
+ * types instead.
+ */
+#define DEFINE_KTASK_CTL(ctl_name, thread_func, func_arg, min_chunk_size) \
+ struct ktask_ctl ctl_name = \
+ KTASK_CTL_INITIALIZER(thread_func, func_arg, min_chunk_size) \
+
+/**
+ * ktask_ctl_set_iter_func - Set a task-specific iterator
+ *
+ * Overrides the default iterator, ktask_iter_range.
+ *
+ * Casts the type of the iterator function so its arguments can be
+ * client-specific (see the comment above DEFINE_KTASK_CTL).
+ *
+ * @ctl: A control structure containing information about the task.
+ * @iter_func: Walks a given number of units forward in the task, returning
+ * an iterator corresponding to the new position.
+ */
+#define ktask_ctl_set_iter_func(ctl, iter_func) \
+ ((ctl)->kc_iter_func = (ktask_iter_func)(iter_func))
+
+/**
+ * ktask_ctl_set_max_threads - Set a task-specific maximum number of threads
+ *
+ * This overrides the default maximum, which is KTASK_DEFAULT_MAX_THREADS.
+ *
+ * @ctl: A control structure containing information about the task.
+ * @max_threads: The maximum number of threads to be started for this task.
+ * The actual number of threads may be less than this.
+ */
+static inline void ktask_ctl_set_max_threads(struct ktask_ctl *ctl,
+ size_t max_threads)
+{
+ ctl->kc_max_threads = max_threads;
+}
+
+/*
+ * The minimum chunk sizes for tasks that operate on ranges of memory. For
+ * now, say 128M.
+ */
+#define KTASK_MEM_CHUNK (1ul << 27)
+#define KTASK_PTE_MINCHUNK (KTASK_MEM_CHUNK / PAGE_SIZE)
+#define KTASK_PMD_MINCHUNK (KTASK_MEM_CHUNK / PMD_SIZE)
+
+#ifdef CONFIG_KTASK
+
+/**
+ * ktask_run - Runs one task.
+ *
+ * Starts threads to complete one task with the given thread function. Waits
+ * for the task to finish before returning.
+ *
+ * On a NUMA system, threads run on the current node. This is designed to
+ * mirror other parts of the kernel that favor locality, such as the default
+ * memory policy of allocating pages from the same node as the calling thread.
+ * ktask_run_numa may be used to get more control over where threads run.
+ *
+ * @start: An object that describes the start of the task. The client thread
+ * function interprets the object however it sees fit (e.g. an array
+ * index, a simple pointer, or a pointer to a more complicated
+ * representation of job position).
+ * @task_size: The size of the task (units are task-specific).
+ * @ctl: A control structure containing information about the task, including
+ * the client thread function.
+ *
+ * RETURNS:
+ * KTASK_RETURN_SUCCESS or a client-specific nonzero error code.
+ */
+int ktask_run(void *start, size_t task_size, struct ktask_ctl *ctl);
+
+/**
+ * ktask_run_numa - Runs one task while accounting for NUMA locality.
+ *
+ * Starts threads on the requested nodes to complete one task with the given
+ * thread function. The client is responsible for organizing the work along
+ * NUMA boundaries in the 'nodes' array. Waits for the task to finish before
+ * returning.
+ *
+ * In the special case of NUMA_NO_NODE, threads are allowed to run on any node.
+ * This is distinct from ktask_run, which runs threads on the current node.
+ *
+ * @nodes: An array of nodes.
+ * @nr_nodes: Length of the 'nodes' array.
+ * @ctl: Control structure containing information about the task.
+ *
+ * RETURNS:
+ * KTASK_RETURN_SUCCESS or a client-specific nonzero error code.
+ */
+int ktask_run_numa(struct ktask_node *nodes, size_t nr_nodes,
+ struct ktask_ctl *ctl);
+
+void ktask_init(void);
+
+#else /* CONFIG_KTASK */
+
+static inline int ktask_run(void *start, size_t task_size,
+ struct ktask_ctl *ctl)
+{
+ return ctl->kc_thread_func(start, ctl->kc_iter_func(start, task_size),
+ ctl->kc_func_arg);
+}
+
+static inline int ktask_run_numa(struct ktask_node *nodes, size_t nr_nodes,
+ struct ktask_ctl *ctl)
+{
+ size_t i;
+ int err = KTASK_RETURN_SUCCESS;
+
+ for (i = 0; i < nr_nodes; ++i) {
+ void *start = nodes[i].kn_start;
+ void *end = ctl->kc_iter_func(start, nodes[i].kn_task_size);
+
+ err = ctl->kc_thread_func(start, end, ctl->kc_func_arg);
+ if (err != KTASK_RETURN_SUCCESS)
+ break;
+ }
+
+ return err;
+}
+
+static inline void ktask_init(void) { }
+
+#endif /* CONFIG_KTASK */
+#endif /* _LINUX_KTASK_H */
diff --git a/init/Kconfig b/init/Kconfig
index 41583f468cb4..ed82f76ed0b7 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -346,6 +346,17 @@ config AUDIT_TREE
depends on AUDITSYSCALL
select FSNOTIFY

+config KTASK
+ bool "Multithread CPU-intensive kernel work"
+ depends on SMP
+ default y
+ help
+ Parallelize CPU-intensive kernel work. This feature is designed for
+ big machines that can take advantage of their extra CPUs to speed up
+ large kernel tasks. When enabled, kworker threads may occupy more
+ CPU time during these kernel tasks, but these threads are throttled
+ when other tasks on the system need CPU time.
+
source "kernel/irq/Kconfig"
source "kernel/time/Kconfig"
source "kernel/Kconfig.preempt"
diff --git a/init/main.c b/init/main.c
index ee147103ba1b..c689f00eab95 100644
--- a/init/main.c
+++ b/init/main.c
@@ -92,6 +92,7 @@
#include <linux/rodata_test.h>
#include <linux/jump_label.h>
#include <linux/mem_encrypt.h>
+#include <linux/ktask.h>

#include <asm/io.h>
#include <asm/bugs.h>
@@ -1145,6 +1146,7 @@ static noinline void __init kernel_init_freeable(void)

smp_init();
sched_init_smp();
+ ktask_init();

page_alloc_init_late();

diff --git a/kernel/Makefile b/kernel/Makefile
index 63b643eb7e70..ce238cb7add5 100644
--- a/kernel/Makefile
+++ b/kernel/Makefile
@@ -10,7 +10,7 @@ obj-y = fork.o exec_domain.o panic.o \
extable.o params.o \
kthread.o sys_ni.o nsproxy.o \
notifier.o ksysfs.o cred.o reboot.o \
- async.o range.o smpboot.o ucount.o
+ async.o range.o smpboot.o ucount.o ktask.o

obj-$(CONFIG_MODULES) += kmod.o
obj-$(CONFIG_MULTIUSER) += groups.o
diff --git a/kernel/ktask.c b/kernel/ktask.c
new file mode 100644
index 000000000000..a7b2b5a62737
--- /dev/null
+++ b/kernel/ktask.c
@@ -0,0 +1,526 @@
+// SPDX-License-Identifier: GPL-2.0+
+/*
+ * ktask.c - framework to parallelize CPU-intensive kernel work
+ *
+ * For more information, see Documentation/core-api/ktask.rst.
+ *
+ * Copyright (c) 2018 Oracle Corporation
+ * Author: Daniel Jordan <[email protected]>
+ */
+
+#define pr_fmt(fmt) "ktask: " fmt
+
+#include <linux/ktask.h>
+
+#ifdef CONFIG_KTASK
+
+#include <linux/cpu.h>
+#include <linux/cpumask.h>
+#include <linux/completion.h>
+#include <linux/init.h>
+#include <linux/kernel.h>
+#include <linux/list.h>
+#include <linux/mutex.h>
+#include <linux/printk.h>
+#include <linux/random.h>
+#include <linux/slab.h>
+#include <linux/spinlock.h>
+#include <linux/workqueue.h>
+
+/* Resource limits on the amount of workqueue items queued through ktask. */
+static DEFINE_SPINLOCK(ktask_rlim_lock);
+/* Work items queued on all nodes (includes NUMA_NO_NODE) */
+static size_t ktask_rlim_cur;
+static size_t ktask_rlim_max;
+/* Work items queued per node */
+static size_t *ktask_rlim_node_cur;
+static size_t *ktask_rlim_node_max;
+
+/* Allow only 80% of the cpus to be running additional ktask threads. */
+#define KTASK_CPUFRAC_NUMER 4
+#define KTASK_CPUFRAC_DENOM 5
+
+/* Used to pass ktask data to the workqueue API. */
+struct ktask_work {
+ struct work_struct kw_work;
+ struct ktask_task *kw_task;
+ int kw_ktask_node_i;
+ int kw_queue_nid;
+ struct list_head kw_list; /* ktask_free_works linkage */
+};
+
+static LIST_HEAD(ktask_free_works);
+static struct ktask_work *ktask_works;
+
+/* Represents one task. This is for internal use only. */
+struct ktask_task {
+ struct ktask_ctl kt_ctl;
+ size_t kt_total_size;
+ size_t kt_chunk_size;
+ /* protects this struct and struct ktask_work's of a running task */
+ struct mutex kt_mutex;
+ struct ktask_node *kt_nodes;
+ size_t kt_nr_nodes;
+ size_t kt_nr_nodes_left;
+ size_t kt_nworks;
+ size_t kt_nworks_fini;
+ int kt_error; /* first error from thread_func */
+ struct completion kt_ktask_done;
+};
+
+/*
+ * Shrink the size of each job by this shift amount to load balance between the
+ * worker threads.
+ */
+#define KTASK_LOAD_BAL_SHIFT 2
+
+#define KTASK_DEFAULT_MAX_THREADS 4
+
+/* Maximum number of threads for a single task. */
+int ktask_max_threads = KTASK_DEFAULT_MAX_THREADS;
+
+static struct workqueue_struct *ktask_wq;
+static struct workqueue_struct *ktask_nonuma_wq;
+
+static void ktask_thread(struct work_struct *work);
+
+static void ktask_init_work(struct ktask_work *kw, struct ktask_task *kt,
+ size_t ktask_node_i, size_t queue_nid)
+{
+ INIT_WORK(&kw->kw_work, ktask_thread);
+ kw->kw_task = kt;
+ kw->kw_ktask_node_i = ktask_node_i;
+ kw->kw_queue_nid = queue_nid;
+}
+
+static void ktask_queue_work(struct ktask_work *kw)
+{
+ struct workqueue_struct *wq;
+ int cpu;
+
+ if (kw->kw_queue_nid == NUMA_NO_NODE) {
+ /*
+ * If no node is specified, use ktask_nonuma_wq to
+ * allow the thread to run on any node, but fall back
+ * to ktask_wq if we couldn't allocate ktask_nonuma_wq.
+ */
+ cpu = WORK_CPU_UNBOUND;
+ wq = (ktask_nonuma_wq) ?: ktask_wq;
+ } else {
+ /*
+ * WQ_UNBOUND workqueues, such as the one ktask uses,
+ * execute work on some CPU from the node of the CPU we
+ * pass to queue_work_on, so just pick any CPU to stand
+ * for the node on NUMA systems.
+ *
+ * On non-NUMA systems, cpumask_of_node becomes
+ * cpu_online_mask.
+ */
+ cpu = cpumask_any(cpumask_of_node(kw->kw_queue_nid));
+ wq = ktask_wq;
+ }
+
+ WARN_ON(!queue_work_on(cpu, wq, &kw->kw_work));
+}
+
+/* Returns true if we're migrating this part of the task to another node. */
+static bool ktask_node_migrate(struct ktask_node *old_kn, struct ktask_node *kn,
+ size_t ktask_node_i, struct ktask_work *kw,
+ struct ktask_task *kt)
+{
+ int new_queue_nid;
+
+ /*
+ * Don't migrate a user thread, otherwise migrate only if we're going
+ * to a different node.
+ */
+ if (!IS_ENABLED(CONFIG_NUMA) || !(current->flags & PF_KTHREAD) ||
+ kn->kn_nid == old_kn->kn_nid || num_online_nodes() == 1)
+ return false;
+
+ /* Adjust resource limits. */
+ spin_lock(&ktask_rlim_lock);
+ if (kw->kw_queue_nid != NUMA_NO_NODE)
+ --ktask_rlim_node_cur[kw->kw_queue_nid];
+
+ if (kn->kn_nid != NUMA_NO_NODE &&
+ ktask_rlim_node_cur[kw->kw_queue_nid] <
+ ktask_rlim_node_max[kw->kw_queue_nid]) {
+ new_queue_nid = kn->kn_nid;
+ ++ktask_rlim_node_cur[new_queue_nid];
+ } else {
+ new_queue_nid = NUMA_NO_NODE;
+ }
+ spin_unlock(&ktask_rlim_lock);
+
+ ktask_init_work(kw, kt, ktask_node_i, new_queue_nid);
+ ktask_queue_work(kw);
+
+ return true;
+}
+
+static void ktask_thread(struct work_struct *work)
+{
+ struct ktask_work *kw = container_of(work, struct ktask_work, kw_work);
+ struct ktask_task *kt = kw->kw_task;
+ struct ktask_ctl *kc = &kt->kt_ctl;
+ struct ktask_node *kn = &kt->kt_nodes[kw->kw_ktask_node_i];
+ bool done;
+
+ mutex_lock(&kt->kt_mutex);
+
+ while (kt->kt_total_size > 0 && kt->kt_error == KTASK_RETURN_SUCCESS) {
+ void *start, *end;
+ size_t size;
+ int ret;
+
+ if (kn->kn_task_size == 0) {
+ /* The current node is out of work; pick a new one. */
+ size_t remaining_nodes_seen = 0;
+ size_t new_idx = prandom_u32_max(kt->kt_nr_nodes_left);
+ struct ktask_node *old_kn;
+ size_t i;
+
+ WARN_ON(kt->kt_nr_nodes_left == 0);
+ WARN_ON(new_idx >= kt->kt_nr_nodes_left);
+ for (i = 0; i < kt->kt_nr_nodes; ++i) {
+ if (kt->kt_nodes[i].kn_task_size == 0)
+ continue;
+
+ if (remaining_nodes_seen >= new_idx)
+ break;
+
+ ++remaining_nodes_seen;
+ }
+ /* We should have found work on another node. */
+ WARN_ON(i >= kt->kt_nr_nodes);
+
+ old_kn = kn;
+ kn = &kt->kt_nodes[i];
+
+ /* Start another worker on the node we've chosen. */
+ if (ktask_node_migrate(old_kn, kn, i, kw, kt)) {
+ mutex_unlock(&kt->kt_mutex);
+ return;
+ }
+ }
+
+ start = kn->kn_start;
+ size = min(kt->kt_chunk_size, kn->kn_task_size);
+ end = kc->kc_iter_func(start, size);
+ kn->kn_start = end;
+ kn->kn_task_size -= size;
+ WARN_ON(kt->kt_total_size < size);
+ kt->kt_total_size -= size;
+ if (kn->kn_task_size == 0) {
+ WARN_ON(kt->kt_nr_nodes_left == 0);
+ kt->kt_nr_nodes_left--;
+ }
+
+ mutex_unlock(&kt->kt_mutex);
+
+ ret = kc->kc_thread_func(start, end, kc->kc_func_arg);
+
+ mutex_lock(&kt->kt_mutex);
+
+ /* Save first error code only. */
+ if (kt->kt_error == KTASK_RETURN_SUCCESS && ret != kt->kt_error)
+ kt->kt_error = ret;
+ }
+
+ WARN_ON(kt->kt_nr_nodes_left > 0 &&
+ kt->kt_error == KTASK_RETURN_SUCCESS);
+
+ ++kt->kt_nworks_fini;
+ WARN_ON(kt->kt_nworks_fini > kt->kt_nworks);
+ done = (kt->kt_nworks_fini == kt->kt_nworks);
+ mutex_unlock(&kt->kt_mutex);
+
+ if (done)
+ complete(&kt->kt_ktask_done);
+}
+
+/*
+ * Returns the number of chunks to break this task into.
+ *
+ * The number of chunks will be at least the number of works, but in the common
+ * case of a large task, the number of chunks will be greater to load balance
+ * between the workqueue threads in case some of them finish more quickly than
+ * others.
+ */
+static size_t ktask_chunk_size(size_t task_size, size_t min_chunk_size,
+ size_t nworks)
+{
+ size_t chunk_size;
+
+ if (nworks == 1)
+ return task_size;
+
+ chunk_size = (task_size / nworks) >> KTASK_LOAD_BAL_SHIFT;
+
+ /*
+ * chunk_size should be a multiple of min_chunk_size for tasks that
+ * need to operate in fixed-size batches.
+ */
+ if (chunk_size > min_chunk_size)
+ chunk_size = rounddown(chunk_size, min_chunk_size);
+
+ return max(chunk_size, min_chunk_size);
+}
+
+/*
+ * Returns the number of works to be used in the task. This number includes
+ * the current thread, so a return value of 1 means no extra threads are
+ * started.
+ */
+static size_t ktask_init_works(struct ktask_node *nodes, size_t nr_nodes,
+ struct ktask_task *kt,
+ struct list_head *works_list)
+{
+ size_t i, nr_works, nr_works_check;
+ size_t min_chunk_size = kt->kt_ctl.kc_min_chunk_size;
+ size_t max_threads = kt->kt_ctl.kc_max_threads;
+
+ if (!ktask_wq)
+ return 1;
+
+ if (max_threads == 0)
+ max_threads = ktask_max_threads;
+
+ /* Ensure at least one thread when task_size < min_chunk_size. */
+ nr_works_check = DIV_ROUND_UP(kt->kt_total_size, min_chunk_size);
+ nr_works_check = min_t(size_t, nr_works_check, num_online_cpus());
+ nr_works_check = min_t(size_t, nr_works_check, max_threads);
+
+ /*
+ * Use at least the current thread for this task; check whether
+ * ktask_rlim allows additional work items to be queued.
+ */
+ nr_works = 1;
+ spin_lock(&ktask_rlim_lock);
+ for (i = nr_works; i < nr_works_check; ++i) {
+ /* Allocate works evenly over the task's given nodes. */
+ size_t ktask_node_i = i % nr_nodes;
+ struct ktask_node *kn = &nodes[ktask_node_i];
+ struct ktask_work *kw;
+ int nid = kn->kn_nid;
+ int queue_nid;
+
+ WARN_ON(ktask_rlim_cur > ktask_rlim_max);
+ if (ktask_rlim_cur == ktask_rlim_max)
+ break; /* No more work items allowed to be queued. */
+
+ /* Allowed to queue on requested node? */
+ if (nid != NUMA_NO_NODE &&
+ ktask_rlim_node_cur[nid] < ktask_rlim_node_max[nid]) {
+ WARN_ON(ktask_rlim_node_cur[nid] > ktask_rlim_cur);
+ ++ktask_rlim_node_cur[nid];
+ queue_nid = nid;
+ } else {
+ queue_nid = NUMA_NO_NODE;
+ }
+
+ WARN_ON(list_empty(&ktask_free_works));
+ kw = list_first_entry(&ktask_free_works, struct ktask_work,
+ kw_list);
+ list_move_tail(&kw->kw_list, works_list);
+ ktask_init_work(kw, kt, ktask_node_i, queue_nid);
+
+ ++ktask_rlim_cur;
+ ++nr_works;
+ }
+ spin_unlock(&ktask_rlim_lock);
+
+ return nr_works;
+}
+
+static void ktask_fini_works(struct ktask_task *kt,
+ struct list_head *works_list)
+{
+ struct ktask_work *work;
+
+ spin_lock(&ktask_rlim_lock);
+
+ /* Put the works back on the free list, adjusting rlimits. */
+ list_for_each_entry(work, works_list, kw_list) {
+ if (work->kw_queue_nid != NUMA_NO_NODE) {
+ WARN_ON(ktask_rlim_node_cur[work->kw_queue_nid] == 0);
+ --ktask_rlim_node_cur[work->kw_queue_nid];
+ }
+ WARN_ON(ktask_rlim_cur == 0);
+ --ktask_rlim_cur;
+ }
+ list_splice(works_list, &ktask_free_works);
+
+ spin_unlock(&ktask_rlim_lock);
+}
+
+int ktask_run_numa(struct ktask_node *nodes, size_t nr_nodes,
+ struct ktask_ctl *ctl)
+{
+ size_t i;
+ struct ktask_work kw;
+ struct ktask_work *work;
+ LIST_HEAD(works_list);
+ struct ktask_task kt = {
+ .kt_ctl = *ctl,
+ .kt_total_size = 0,
+ .kt_nodes = nodes,
+ .kt_nr_nodes = nr_nodes,
+ .kt_nr_nodes_left = nr_nodes,
+ .kt_nworks_fini = 0,
+ .kt_error = KTASK_RETURN_SUCCESS,
+ };
+
+ for (i = 0; i < nr_nodes; ++i) {
+ kt.kt_total_size += nodes[i].kn_task_size;
+ if (nodes[i].kn_task_size == 0)
+ kt.kt_nr_nodes_left--;
+
+ WARN_ON(nodes[i].kn_nid >= MAX_NUMNODES);
+ }
+
+ if (kt.kt_total_size == 0)
+ return KTASK_RETURN_SUCCESS;
+
+ mutex_init(&kt.kt_mutex);
+ init_completion(&kt.kt_ktask_done);
+
+ kt.kt_nworks = ktask_init_works(nodes, nr_nodes, &kt, &works_list);
+ kt.kt_chunk_size = ktask_chunk_size(kt.kt_total_size,
+ ctl->kc_min_chunk_size,
+ kt.kt_nworks);
+
+ list_for_each_entry(work, &works_list, kw_list)
+ ktask_queue_work(work);
+
+ /* Use the current thread, which saves starting a workqueue worker. */
+ ktask_init_work(&kw, &kt, 0, nodes[0].kn_nid);
+ ktask_thread(&kw.kw_work);
+
+ /* Wait for all the jobs to finish. */
+ wait_for_completion(&kt.kt_ktask_done);
+
+ ktask_fini_works(&kt, &works_list);
+ mutex_destroy(&kt.kt_mutex);
+
+ return kt.kt_error;
+}
+EXPORT_SYMBOL_GPL(ktask_run_numa);
+
+int ktask_run(void *start, size_t task_size, struct ktask_ctl *ctl)
+{
+ struct ktask_node node;
+
+ node.kn_start = start;
+ node.kn_task_size = task_size;
+ node.kn_nid = numa_node_id();
+
+ return ktask_run_numa(&node, 1, ctl);
+}
+EXPORT_SYMBOL_GPL(ktask_run);
+
+/*
+ * Initialize internal limits on work items queued. Work items submitted to
+ * cmwq capped at 80% of online cpus both system-wide and per-node to maintain
+ * an efficient level of parallelization at these respective levels.
+ */
+static bool __init ktask_rlim_init(void)
+{
+ int node, nr_cpus;
+ unsigned int nr_node_cpus;
+
+ nr_cpus = num_online_cpus();
+
+ /* XXX Handle CPU hotplug. */
+ if (nr_cpus == 1)
+ return false;
+
+ ktask_rlim_node_cur = kcalloc(num_possible_nodes(), sizeof(size_t),
+ GFP_KERNEL);
+
+ ktask_rlim_node_max = kmalloc_array(num_possible_nodes(),
+ sizeof(size_t), GFP_KERNEL);
+
+ ktask_rlim_max = mult_frac(nr_cpus, KTASK_CPUFRAC_NUMER,
+ KTASK_CPUFRAC_DENOM);
+ for_each_node(node) {
+ nr_node_cpus = cpumask_weight(cpumask_of_node(node));
+ ktask_rlim_node_max[node] = mult_frac(nr_node_cpus,
+ KTASK_CPUFRAC_NUMER,
+ KTASK_CPUFRAC_DENOM);
+ }
+
+ return true;
+}
+
+void __init ktask_init(void)
+{
+ struct workqueue_attrs *attrs;
+ int i, ret;
+
+ if (!ktask_rlim_init())
+ goto out;
+
+ ktask_works = kmalloc_array(ktask_rlim_max, sizeof(struct ktask_work),
+ GFP_KERNEL);
+ for (i = 0; i < ktask_rlim_max; ++i)
+ list_add_tail(&ktask_works[i].kw_list, &ktask_free_works);
+
+ ktask_wq = alloc_workqueue("ktask_wq", WQ_UNBOUND, 0);
+ if (!ktask_wq) {
+ pr_warn("disabled (failed to alloc ktask_wq)");
+ goto out;
+ }
+
+ /*
+ * Threads executing work from this workqueue can run on any node on
+ * the system. If we get any failures below, use ktask_wq in its
+ * place. It's better than nothing.
+ */
+ ktask_nonuma_wq = alloc_workqueue("ktask_nonuma_wq", WQ_UNBOUND, 0);
+ if (!ktask_nonuma_wq) {
+ pr_warn("disabled (failed to alloc ktask_nonuma_wq)");
+ goto alloc_fail;
+ }
+
+ attrs = alloc_workqueue_attrs(GFP_KERNEL);
+ if (!attrs) {
+ pr_warn("disabled (couldn't alloc wq attrs)");
+ goto alloc_fail;
+ }
+
+ attrs->no_numa = true;
+
+ ret = apply_workqueue_attrs(ktask_nonuma_wq, attrs);
+ if (ret != 0) {
+ pr_warn("disabled (couldn't apply attrs to ktask_nonuma_wq)");
+ goto apply_fail;
+ }
+
+ free_workqueue_attrs(attrs);
+out:
+ return;
+
+apply_fail:
+ free_workqueue_attrs(attrs);
+alloc_fail:
+ if (ktask_wq)
+ destroy_workqueue(ktask_wq);
+ if (ktask_nonuma_wq)
+ destroy_workqueue(ktask_nonuma_wq);
+ ktask_wq = NULL;
+ ktask_nonuma_wq = NULL;
+}
+
+#endif /* CONFIG_KTASK */
+
+/*
+ * This function is defined outside CONFIG_KTASK so it can be called in the
+ * !CONFIG_KTASK versions of ktask_run and ktask_run_numa.
+ */
+void *ktask_iter_range(void *position, size_t size)
+{
+ return (char *)position + size;
+}
+EXPORT_SYMBOL_GPL(ktask_iter_range);
--
2.19.1


2018-11-05 17:00:28

by Daniel Jordan

[permalink] [raw]
Subject: [RFC PATCH v4 09/13] vfio: relieve mmap_sem reader cacheline bouncing by holding it longer

Profiling shows significant time being spent on atomic ops in mmap_sem
reader acquisition. mmap_sem is taken and dropped for every single base
page during pinning, so this is not surprising.

Reduce the number of times mmap_sem is taken by holding for longer,
which relieves atomic cacheline bouncing.

Results for all VFIO page pinning patches
-----------------------------------------

The test measures the time from qemu invocation to the start of guest
boot. The guest uses kvm with 320G memory backed with THP. 320G fits
in a node on the test machine used here, so there was no thrashing in
reclaim because of __GFP_THISNODE in THP allocations[1].

CPU: 2 nodes * 24 cores/node * 2 threads/core = 96 CPUs
Intel(R) Xeon(R) Platinum 8160 CPU @ 2.10GHz
memory: 754G split evenly between nodes
scaling_governor: performance

patch 6 patch 8 patch 9 (this one)
----------------------- ------------------------ ------------------------
thr speedup average sec speedup average sec speedup average sec
1 65.0 (± 0.6%) 65.2 (± 0.5%) 65.5 (± 0.4%)
2 1.5x 42.8 (± 5.8%) 1.8x 36.0 (± 0.9%) 1.9x 34.4 (± 0.3%)
3 1.9x 35.0 (±11.3%) 2.5x 26.4 (± 4.2%) 2.8x 23.7 (± 0.2%)
4 2.3x 28.5 (± 1.3%) 3.1x 21.2 (± 2.8%) 3.6x 18.3 (± 0.3%)
5 2.5x 26.2 (± 1.5%) 3.6x 17.9 (± 0.9%) 4.3x 15.1 (± 0.3%)
6 2.7x 24.5 (± 1.8%) 4.0x 16.5 (± 3.0%) 5.1x 12.9 (± 0.1%)
7 2.8x 23.5 (± 4.9%) 4.2x 15.4 (± 2.7%) 5.7x 11.5 (± 0.6%)
8 2.8x 22.8 (± 1.8%) 4.2x 15.5 (± 4.7%) 6.4x 10.3 (± 0.8%)
12 3.2x 20.2 (± 1.4%) 4.4x 14.7 (± 2.9%) 8.6x 7.6 (± 0.6%)
16 3.3x 20.0 (± 0.7%) 4.3x 15.4 (± 1.3%) 10.2x 6.4 (± 0.6%)

At patch 6, lock_stat showed long reader wait time on mmap_sem writers,
leading to patch 8.

At patch 8, profiling revealed the issue with mmap_sem described above.

Across all three patches, performance consistently improves as the
thread count increases. The one exception is the antiscaling with
nthr=16 in patch 8: those mmap_sem atomics are really bouncing around
the machine.

The performance with patch 9 looks pretty good overall. I'm working on
finding the next bottleneck, and this is where it stopped: When
nthr=16, the obvious issue profiling showed was contention on the split
PMD page table lock when pages are faulted in during the pinning (>2% of
the time).

A split PMD lock protects a PUD_SIZE-ed amount of page table mappings
(1G on x86), so if threads were operating on smaller chunks and
contending in the same PUD_SIZE range, this could be the source of
contention. However, when nthr=16, threads operate on 5G chunks (320G /
16 threads / (1<<KTASK_LOAD_BAL_SHIFT)), so this wasn't the cause, and
aligning the chunks on PUD_SIZE boundaries didn't help either.

The time is short (6.4 seconds), so the next theory was threads
finishing at different times, but probes showed the threads all returned
within less than a millisecond of each other.

Kernel probes turned up a few smaller VFIO page pin calls besides the
heavy 320G call. The chunk size given (PMD_SIZE) could affect thread
count and chunk size for these, so chunk size was increased from 2M to
1G. This caused the split PMD contention to disappear, but with little
change in the runtime. More digging required.

[1] lkml.kernel.org/r/[email protected]

Signed-off-by: Daniel Jordan <[email protected]>
---
drivers/vfio/vfio_iommu_type1.c | 28 +++++++++++++++++++---------
1 file changed, 19 insertions(+), 9 deletions(-)

diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
index 9e52a24eb45b..0d6ec0369de5 100644
--- a/drivers/vfio/vfio_iommu_type1.c
+++ b/drivers/vfio/vfio_iommu_type1.c
@@ -331,7 +331,7 @@ static int put_pfn(unsigned long pfn, int prot)
}

static int vaddr_get_pfn(struct mm_struct *mm, unsigned long vaddr,
- int prot, unsigned long *pfn)
+ int prot, unsigned long *pfn, bool handle_mmap_sem)
{
struct page *page[1];
struct vm_area_struct *vma;
@@ -342,7 +342,8 @@ static int vaddr_get_pfn(struct mm_struct *mm, unsigned long vaddr,
if (prot & IOMMU_WRITE)
flags |= FOLL_WRITE;

- down_read(&mm->mmap_sem);
+ if (handle_mmap_sem)
+ down_read(&mm->mmap_sem);
if (mm == current->mm) {
ret = get_user_pages_longterm(vaddr, 1, flags, page, vmas);
} else {
@@ -360,14 +361,16 @@ static int vaddr_get_pfn(struct mm_struct *mm, unsigned long vaddr,
put_page(page[0]);
}
}
- up_read(&mm->mmap_sem);
+ if (handle_mmap_sem)
+ up_read(&mm->mmap_sem);

if (ret == 1) {
*pfn = page_to_pfn(page[0]);
return 0;
}

- down_read(&mm->mmap_sem);
+ if (handle_mmap_sem)
+ down_read(&mm->mmap_sem);

vma = find_vma_intersection(mm, vaddr, vaddr + 1);

@@ -377,7 +380,8 @@ static int vaddr_get_pfn(struct mm_struct *mm, unsigned long vaddr,
ret = 0;
}

- up_read(&mm->mmap_sem);
+ if (handle_mmap_sem)
+ up_read(&mm->mmap_sem);
return ret;
}

@@ -399,9 +403,12 @@ static long vfio_pin_pages_remote(struct vfio_dma *dma, unsigned long vaddr,
if (!mm)
return -ENODEV;

- ret = vaddr_get_pfn(mm, vaddr, dma->prot, pfn_base);
- if (ret)
+ down_read(&mm->mmap_sem);
+ ret = vaddr_get_pfn(mm, vaddr, dma->prot, pfn_base, false);
+ if (ret) {
+ up_read(&mm->mmap_sem);
return ret;
+ }

pinned++;
rsvd = is_invalid_reserved_pfn(*pfn_base);
@@ -416,6 +423,7 @@ static long vfio_pin_pages_remote(struct vfio_dma *dma, unsigned long vaddr,
put_pfn(*pfn_base, dma->prot);
pr_warn("%s: RLIMIT_MEMLOCK (%ld) exceeded\n", __func__,
limit << PAGE_SHIFT);
+ up_read(&mm->mmap_sem);
return -ENOMEM;
}
lock_acct++;
@@ -427,7 +435,7 @@ static long vfio_pin_pages_remote(struct vfio_dma *dma, unsigned long vaddr,
/* Lock all the consecutive pages from pfn_base */
for (vaddr += PAGE_SIZE, iova += PAGE_SIZE; pinned < npage;
pinned++, vaddr += PAGE_SIZE, iova += PAGE_SIZE) {
- ret = vaddr_get_pfn(mm, vaddr, dma->prot, &pfn);
+ ret = vaddr_get_pfn(mm, vaddr, dma->prot, &pfn, false);
if (ret)
break;

@@ -444,6 +452,7 @@ static long vfio_pin_pages_remote(struct vfio_dma *dma, unsigned long vaddr,
pr_warn("%s: RLIMIT_MEMLOCK (%ld) exceeded\n",
__func__, limit << PAGE_SHIFT);
ret = -ENOMEM;
+ up_read(&mm->mmap_sem);
goto unpin_out;
}
lock_acct++;
@@ -451,6 +460,7 @@ static long vfio_pin_pages_remote(struct vfio_dma *dma, unsigned long vaddr,
}

out:
+ up_read(&mm->mmap_sem);
ret = vfio_lock_acct(dma, lock_acct, false);

unpin_out:
@@ -497,7 +507,7 @@ static int vfio_pin_page_external(struct vfio_dma *dma, unsigned long vaddr,
if (!mm)
return -ENODEV;

- ret = vaddr_get_pfn(mm, vaddr, dma->prot, pfn_base);
+ ret = vaddr_get_pfn(mm, vaddr, dma->prot, pfn_base, true);
if (!ret && do_accounting && !is_invalid_reserved_pfn(*pfn_base)) {
ret = vfio_lock_acct(dma, 1, true);
if (ret) {
--
2.19.1


2018-11-05 17:00:51

by Daniel Jordan

[permalink] [raw]
Subject: [RFC PATCH v4 11/13] mm: parallelize deferred struct page initialization within each node

Deferred struct page initialization currently runs one thread per node,
but this is a bottleneck during boot on big machines, so use ktask
within each pgdatinit thread to parallelize the struct page
initialization, allowing the system to take better advantage of its
memory bandwidth.

Because the system is not fully up yet and most CPUs are idle, use more
than the default maximum number of ktask threads. The kernel doesn't
know the memory bandwidth of a given system to get the most efficient
number of threads, so there's some guesswork involved. In testing, a
reasonable value turned out to be about a quarter of the CPUs on the
node.

__free_pages_core used to increase the zone's managed page count by the
number of pages being freed. To accommodate multiple threads, however,
account the number of freed pages with an atomic shared across the ktask
threads and bump the managed page count with it after ktask is finished.

Test: Boot the machine with deferred struct page init three times

Machine: Intel(R) Xeon(R) CPU E5-2699 v4 @ 2.20GHz, 88 CPUs, 503G memory,
2 sockets

kernel speedup max time per stdev
node (ms)

baseline (4.15-rc2) 5860 8.6
ktask 9.56x 613 12.4

---

Machine: Intel(R) Xeon(R) CPU E7-8895 v3 @ 2.60GHz, 288 CPUs, 1T memory
8 sockets

kernel speedup max time per stdev
node (ms)
baseline (4.15-rc2) 1261 1.9
ktask 3.88x 325 5.0

Signed-off-by: Daniel Jordan <[email protected]>
Suggested-by: Pavel Tatashin <[email protected]>
---
mm/page_alloc.c | 91 ++++++++++++++++++++++++++++++++++++++++++-------
1 file changed, 78 insertions(+), 13 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index ae31839874b8..fe7b681567ba 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -66,6 +66,7 @@
#include <linux/lockdep.h>
#include <linux/nmi.h>
#include <linux/psi.h>
+#include <linux/ktask.h>

#include <asm/sections.h>
#include <asm/tlbflush.h>
@@ -1275,7 +1276,6 @@ void __free_pages_core(struct page *page, unsigned int order)
set_page_count(p, 0);
}

- page_zone(page)->managed_pages += nr_pages;
set_page_refcounted(page);
__free_pages(page, order);
}
@@ -1340,6 +1340,7 @@ void __init memblock_free_pages(struct page *page, unsigned long pfn,
if (early_page_uninitialised(pfn))
return;
__free_pages_core(page, order);
+ page_zone(page)->managed_pages += 1UL << order;
}

/*
@@ -1477,23 +1478,31 @@ deferred_pfn_valid(int nid, unsigned long pfn,
return true;
}

+struct deferred_args {
+ int nid;
+ int zid;
+ atomic64_t nr_pages;
+};
+
/*
* Free pages to buddy allocator. Try to free aligned pages in
* pageblock_nr_pages sizes.
*/
-static void __init deferred_free_pages(int nid, int zid, unsigned long pfn,
- unsigned long end_pfn)
+static int __init deferred_free_pages(int nid, int zid, unsigned long pfn,
+ unsigned long end_pfn)
{
struct mminit_pfnnid_cache nid_init_state = { };
unsigned long nr_pgmask = pageblock_nr_pages - 1;
- unsigned long nr_free = 0;
+ unsigned long nr_free = 0, nr_pages = 0;

for (; pfn < end_pfn; pfn++) {
if (!deferred_pfn_valid(nid, pfn, &nid_init_state)) {
deferred_free_range(pfn - nr_free, nr_free);
+ nr_pages += nr_free;
nr_free = 0;
} else if (!(pfn & nr_pgmask)) {
deferred_free_range(pfn - nr_free, nr_free);
+ nr_pages += nr_free;
nr_free = 1;
touch_nmi_watchdog();
} else {
@@ -1502,16 +1511,27 @@ static void __init deferred_free_pages(int nid, int zid, unsigned long pfn,
}
/* Free the last block of pages to allocator */
deferred_free_range(pfn - nr_free, nr_free);
+ nr_pages += nr_free;
+
+ return nr_pages;
+}
+
+static int __init deferred_free_chunk(unsigned long pfn, unsigned long end_pfn,
+ struct deferred_args *args)
+{
+ unsigned long nr_pages = deferred_free_pages(args->nid, args->zid, pfn,
+ end_pfn);
+ atomic64_add(nr_pages, &args->nr_pages);
+ return KTASK_RETURN_SUCCESS;
}

/*
* Initialize struct pages. We minimize pfn page lookups and scheduler checks
* by performing it only once every pageblock_nr_pages.
- * Return number of pages initialized.
+ * Return number of pages initialized in deferred_args.
*/
-static unsigned long __init deferred_init_pages(int nid, int zid,
- unsigned long pfn,
- unsigned long end_pfn)
+static int __init deferred_init_pages(int nid, int zid, unsigned long pfn,
+ unsigned long end_pfn)
{
struct mminit_pfnnid_cache nid_init_state = { };
unsigned long nr_pgmask = pageblock_nr_pages - 1;
@@ -1531,7 +1551,17 @@ static unsigned long __init deferred_init_pages(int nid, int zid,
__init_single_page(page, pfn, zid, nid);
nr_pages++;
}
- return (nr_pages);
+
+ return nr_pages;
+}
+
+static int __init deferred_init_chunk(unsigned long pfn, unsigned long end_pfn,
+ struct deferred_args *args)
+{
+ unsigned long nr_pages = deferred_init_pages(args->nid, args->zid, pfn,
+ end_pfn);
+ atomic64_add(nr_pages, &args->nr_pages);
+ return KTASK_RETURN_SUCCESS;
}

/* Initialise remaining memory on a node */
@@ -1540,13 +1570,15 @@ static int __init deferred_init_memmap(void *data)
pg_data_t *pgdat = data;
int nid = pgdat->node_id;
unsigned long start = jiffies;
- unsigned long nr_pages = 0;
+ unsigned long nr_init = 0, nr_free = 0;
unsigned long spfn, epfn, first_init_pfn, flags;
phys_addr_t spa, epa;
int zid;
struct zone *zone;
const struct cpumask *cpumask = cpumask_of_node(pgdat->node_id);
u64 i;
+ unsigned long nr_node_cpus;
+ struct ktask_node kn;

/* Bind memory initialisation thread to a local node if possible */
if (!cpumask_empty(cpumask))
@@ -1560,6 +1592,14 @@ static int __init deferred_init_memmap(void *data)
return 0;
}

+ /*
+ * We'd like to know the memory bandwidth of the chip to calculate the
+ * most efficient number of threads to start, but we can't. In
+ * testing, a good value for a variety of systems was a quarter of the
+ * CPUs on the node.
+ */
+ nr_node_cpus = DIV_ROUND_UP(cpumask_weight(cpumask), 4);
+
/* Sanity check boundaries */
BUG_ON(pgdat->first_deferred_pfn < pgdat->node_start_pfn);
BUG_ON(pgdat->first_deferred_pfn > pgdat_end_pfn(pgdat));
@@ -1580,21 +1620,46 @@ static int __init deferred_init_memmap(void *data)
* page in __free_one_page()).
*/
for_each_free_mem_range(i, nid, MEMBLOCK_NONE, &spa, &epa, NULL) {
+ struct deferred_args args = { nid, zid, ATOMIC64_INIT(0) };
+ DEFINE_KTASK_CTL(ctl, deferred_init_chunk, &args,
+ KTASK_PTE_MINCHUNK);
+ ktask_ctl_set_max_threads(&ctl, nr_node_cpus);
+
spfn = max_t(unsigned long, first_init_pfn, PFN_UP(spa));
epfn = min_t(unsigned long, zone_end_pfn(zone), PFN_DOWN(epa));
- nr_pages += deferred_init_pages(nid, zid, spfn, epfn);
+
+ kn.kn_start = (void *)spfn;
+ kn.kn_task_size = (spfn < epfn) ? epfn - spfn : 0;
+ kn.kn_nid = nid;
+ (void) ktask_run_numa(&kn, 1, &ctl);
+
+ nr_init += atomic64_read(&args.nr_pages);
}
for_each_free_mem_range(i, nid, MEMBLOCK_NONE, &spa, &epa, NULL) {
+ struct deferred_args args = { nid, zid, ATOMIC64_INIT(0) };
+ DEFINE_KTASK_CTL(ctl, deferred_free_chunk, &args,
+ KTASK_PTE_MINCHUNK);
+ ktask_ctl_set_max_threads(&ctl, nr_node_cpus);
+
spfn = max_t(unsigned long, first_init_pfn, PFN_UP(spa));
epfn = min_t(unsigned long, zone_end_pfn(zone), PFN_DOWN(epa));
- deferred_free_pages(nid, zid, spfn, epfn);
+
+ kn.kn_start = (void *)spfn;
+ kn.kn_task_size = (spfn < epfn) ? epfn - spfn : 0;
+ kn.kn_nid = nid;
+ (void) ktask_run_numa(&kn, 1, &ctl);
+
+ nr_free += atomic64_read(&args.nr_pages);
}
pgdat_resize_unlock(pgdat, &flags);

/* Sanity check that the next zone really is unpopulated */
WARN_ON(++zid < MAX_NR_ZONES && populated_zone(++zone));
+ VM_BUG_ON(nr_init != nr_free);
+
+ zone->managed_pages += nr_free;

- pr_info("node %d initialised, %lu pages in %ums\n", nid, nr_pages,
+ pr_info("node %d initialised, %lu pages in %ums\n", nid, nr_free,
jiffies_to_msecs(jiffies - start));

pgdat_init_report_one_done();
--
2.19.1


2018-11-05 17:01:13

by Daniel Jordan

[permalink] [raw]
Subject: [RFC PATCH v4 03/13] ktask: add undo support

Tasks can fail midway through their work. To recover, the finished
chunks of work need to be undone in a task-specific way.

Allow ktask clients to pass an "undo" callback that is responsible for
undoing one chunk of work. To avoid multiple levels of error handling,
do not allow the callback to fail. For simplicity and because it's a
slow path, undoing is not multithreaded.

Signed-off-by: Daniel Jordan <[email protected]>
---
include/linux/ktask.h | 36 +++++++++++-
kernel/ktask.c | 125 +++++++++++++++++++++++++++++++++++-------
2 files changed, 138 insertions(+), 23 deletions(-)

diff --git a/include/linux/ktask.h b/include/linux/ktask.h
index 9c75a93b51b9..30a6a88e5dad 100644
--- a/include/linux/ktask.h
+++ b/include/linux/ktask.h
@@ -10,6 +10,7 @@
#ifndef _LINUX_KTASK_H
#define _LINUX_KTASK_H

+#include <linux/list.h>
#include <linux/mm.h>
#include <linux/types.h>

@@ -23,9 +24,14 @@
* @kn_nid: NUMA node id to run threads on
*/
struct ktask_node {
- void *kn_start;
- size_t kn_task_size;
- int kn_nid;
+ void *kn_start;
+ size_t kn_task_size;
+ int kn_nid;
+
+ /* Private fields below - do not touch these. */
+ void *kn_position;
+ size_t kn_remaining_size;
+ struct list_head kn_failed_works;
};

/**
@@ -43,6 +49,14 @@ struct ktask_node {
*/
typedef int (*ktask_thread_func)(void *start, void *end, void *arg);

+/**
+ * typedef ktask_undo_func
+ *
+ * The same as ktask_thread_func, with the exception that it must always
+ * succeed, so it doesn't return anything.
+ */
+typedef void (*ktask_undo_func)(void *start, void *end, void *arg);
+
/**
* typedef ktask_iter_func
*
@@ -77,6 +91,11 @@ void *ktask_iter_range(void *position, size_t size);
*
* @kc_thread_func: A thread function that completes one chunk of the task per
* call.
+ * @kc_undo_func: A function that undoes one chunk of the task per call.
+ * If non-NULL and error(s) occur during the task, this is
+ * called on all successfully completed chunks of work. The
+ * chunk(s) in which failure occurs should be handled in
+ * kc_thread_func.
* @kc_func_arg: An argument to be passed to the thread and undo functions.
* @kc_iter_func: An iterator function to advance the iterator by some number
* of task-specific units.
@@ -90,6 +109,7 @@ void *ktask_iter_range(void *position, size_t size);
struct ktask_ctl {
/* Required arguments set with DEFINE_KTASK_CTL. */
ktask_thread_func kc_thread_func;
+ ktask_undo_func kc_undo_func;
void *kc_func_arg;
size_t kc_min_chunk_size;

@@ -101,6 +121,7 @@ struct ktask_ctl {
#define KTASK_CTL_INITIALIZER(thread_func, func_arg, min_chunk_size) \
{ \
.kc_thread_func = (ktask_thread_func)(thread_func), \
+ .kc_undo_func = NULL, \
.kc_func_arg = (func_arg), \
.kc_min_chunk_size = (min_chunk_size), \
.kc_iter_func = (ktask_iter_range), \
@@ -132,6 +153,15 @@ struct ktask_ctl {
#define ktask_ctl_set_iter_func(ctl, iter_func) \
((ctl)->kc_iter_func = (ktask_iter_func)(iter_func))

+/**
+ * ktask_ctl_set_undo_func - Designate an undo function to unwind from error
+ *
+ * @ctl: A control structure containing information about the task.
+ * @undo_func: Undoes a piece of the task.
+ */
+#define ktask_ctl_set_undo_func(ctl, undo_func) \
+ ((ctl)->kc_undo_func = (ktask_undo_func)(undo_func))
+
/**
* ktask_ctl_set_max_threads - Set a task-specific maximum number of threads
*
diff --git a/kernel/ktask.c b/kernel/ktask.c
index a7b2b5a62737..b91c62f14dcd 100644
--- a/kernel/ktask.c
+++ b/kernel/ktask.c
@@ -20,6 +20,7 @@
#include <linux/init.h>
#include <linux/kernel.h>
#include <linux/list.h>
+#include <linux/list_sort.h>
#include <linux/mutex.h>
#include <linux/printk.h>
#include <linux/random.h>
@@ -46,7 +47,12 @@ struct ktask_work {
struct ktask_task *kw_task;
int kw_ktask_node_i;
int kw_queue_nid;
- struct list_head kw_list; /* ktask_free_works linkage */
+ /* task units from kn_start to kw_error_start */
+ size_t kw_error_offset;
+ void *kw_error_start;
+ void *kw_error_end;
+ /* ktask_free_works, kn_failed_works linkage */
+ struct list_head kw_list;
};

static LIST_HEAD(ktask_free_works);
@@ -170,11 +176,11 @@ static void ktask_thread(struct work_struct *work)
mutex_lock(&kt->kt_mutex);

while (kt->kt_total_size > 0 && kt->kt_error == KTASK_RETURN_SUCCESS) {
- void *start, *end;
- size_t size;
+ void *position, *end;
+ size_t size, position_offset;
int ret;

- if (kn->kn_task_size == 0) {
+ if (kn->kn_remaining_size == 0) {
/* The current node is out of work; pick a new one. */
size_t remaining_nodes_seen = 0;
size_t new_idx = prandom_u32_max(kt->kt_nr_nodes_left);
@@ -184,7 +190,7 @@ static void ktask_thread(struct work_struct *work)
WARN_ON(kt->kt_nr_nodes_left == 0);
WARN_ON(new_idx >= kt->kt_nr_nodes_left);
for (i = 0; i < kt->kt_nr_nodes; ++i) {
- if (kt->kt_nodes[i].kn_task_size == 0)
+ if (kt->kt_nodes[i].kn_remaining_size == 0)
continue;

if (remaining_nodes_seen >= new_idx)
@@ -205,27 +211,40 @@ static void ktask_thread(struct work_struct *work)
}
}

- start = kn->kn_start;
- size = min(kt->kt_chunk_size, kn->kn_task_size);
- end = kc->kc_iter_func(start, size);
- kn->kn_start = end;
- kn->kn_task_size -= size;
+ position = kn->kn_position;
+ position_offset = kn->kn_task_size - kn->kn_remaining_size;
+ size = min(kt->kt_chunk_size, kn->kn_remaining_size);
+ end = kc->kc_iter_func(position, size);
+ kn->kn_position = end;
+ kn->kn_remaining_size -= size;
WARN_ON(kt->kt_total_size < size);
kt->kt_total_size -= size;
- if (kn->kn_task_size == 0) {
+ if (kn->kn_remaining_size == 0) {
WARN_ON(kt->kt_nr_nodes_left == 0);
kt->kt_nr_nodes_left--;
}

mutex_unlock(&kt->kt_mutex);

- ret = kc->kc_thread_func(start, end, kc->kc_func_arg);
+ ret = kc->kc_thread_func(position, end, kc->kc_func_arg);

mutex_lock(&kt->kt_mutex);

- /* Save first error code only. */
- if (kt->kt_error == KTASK_RETURN_SUCCESS && ret != kt->kt_error)
- kt->kt_error = ret;
+ if (ret != KTASK_RETURN_SUCCESS) {
+ /* Save first error code only. */
+ if (kt->kt_error == KTASK_RETURN_SUCCESS)
+ kt->kt_error = ret;
+ /*
+ * If this task has an undo function, save information
+ * about where this thread failed for ktask_undo.
+ */
+ if (kc->kc_undo_func) {
+ list_move(&kw->kw_list, &kn->kn_failed_works);
+ kw->kw_error_start = position;
+ kw->kw_error_offset = position_offset;
+ kw->kw_error_end = end;
+ }
+ }
}

WARN_ON(kt->kt_nr_nodes_left > 0 &&
@@ -335,26 +354,85 @@ static size_t ktask_init_works(struct ktask_node *nodes, size_t nr_nodes,
}

static void ktask_fini_works(struct ktask_task *kt,
+ struct ktask_work *stack_work,
struct list_head *works_list)
{
- struct ktask_work *work;
+ struct ktask_work *work, *next;

spin_lock(&ktask_rlim_lock);

/* Put the works back on the free list, adjusting rlimits. */
- list_for_each_entry(work, works_list, kw_list) {
+ list_for_each_entry_safe(work, next, works_list, kw_list) {
+ if (work == stack_work) {
+ /* On this thread's stack, so not subject to rlimits. */
+ list_del(&work->kw_list);
+ continue;
+ }
if (work->kw_queue_nid != NUMA_NO_NODE) {
WARN_ON(ktask_rlim_node_cur[work->kw_queue_nid] == 0);
--ktask_rlim_node_cur[work->kw_queue_nid];
}
WARN_ON(ktask_rlim_cur == 0);
--ktask_rlim_cur;
+ list_move(&work->kw_list, &ktask_free_works);
}
- list_splice(works_list, &ktask_free_works);
-
spin_unlock(&ktask_rlim_lock);
}

+static int ktask_error_cmp(void *unused, struct list_head *a,
+ struct list_head *b)
+{
+ struct ktask_work *work_a = list_entry(a, struct ktask_work, kw_list);
+ struct ktask_work *work_b = list_entry(b, struct ktask_work, kw_list);
+
+ if (work_a->kw_error_offset < work_b->kw_error_offset)
+ return -1;
+ else if (work_a->kw_error_offset > work_b->kw_error_offset)
+ return 1;
+ return 0;
+}
+
+static void ktask_undo(struct ktask_node *nodes, size_t nr_nodes,
+ struct ktask_ctl *ctl, struct list_head *works_list)
+{
+ size_t i;
+
+ for (i = 0; i < nr_nodes; ++i) {
+ struct ktask_node *kn = &nodes[i];
+ struct list_head *failed_works = &kn->kn_failed_works;
+ struct ktask_work *failed_work;
+ void *undo_pos = kn->kn_start;
+ void *undo_end;
+
+ /* Sort so the failed ranges can be checked as we go. */
+ list_sort(NULL, failed_works, ktask_error_cmp);
+
+ /* Undo completed work on this node, skipping failed ranges. */
+ while (undo_pos != kn->kn_position) {
+ failed_work = list_first_entry_or_null(failed_works,
+ struct ktask_work,
+ kw_list);
+ if (failed_work)
+ undo_end = failed_work->kw_error_start;
+ else
+ undo_end = kn->kn_position;
+
+ if (undo_pos != undo_end) {
+ ctl->kc_undo_func(undo_pos, undo_end,
+ ctl->kc_func_arg);
+ }
+
+ if (failed_work) {
+ undo_pos = failed_work->kw_error_end;
+ list_move(&failed_work->kw_list, works_list);
+ } else {
+ undo_pos = undo_end;
+ }
+ }
+ WARN_ON(!list_empty(failed_works));
+ }
+}
+
int ktask_run_numa(struct ktask_node *nodes, size_t nr_nodes,
struct ktask_ctl *ctl)
{
@@ -374,6 +452,9 @@ int ktask_run_numa(struct ktask_node *nodes, size_t nr_nodes,

for (i = 0; i < nr_nodes; ++i) {
kt.kt_total_size += nodes[i].kn_task_size;
+ nodes[i].kn_position = nodes[i].kn_start;
+ nodes[i].kn_remaining_size = nodes[i].kn_task_size;
+ INIT_LIST_HEAD(&nodes[i].kn_failed_works);
if (nodes[i].kn_task_size == 0)
kt.kt_nr_nodes_left--;

@@ -396,12 +477,16 @@ int ktask_run_numa(struct ktask_node *nodes, size_t nr_nodes,

/* Use the current thread, which saves starting a workqueue worker. */
ktask_init_work(&kw, &kt, 0, nodes[0].kn_nid);
+ INIT_LIST_HEAD(&kw.kw_list);
ktask_thread(&kw.kw_work);

/* Wait for all the jobs to finish. */
wait_for_completion(&kt.kt_ktask_done);

- ktask_fini_works(&kt, &works_list);
+ if (kt.kt_error && ctl->kc_undo_func)
+ ktask_undo(nodes, nr_nodes, ctl, &works_list);
+
+ ktask_fini_works(&kt, &kw, &works_list);
mutex_destroy(&kt.kt_mutex);

return kt.kt_error;
--
2.19.1


2018-11-05 17:01:40

by Daniel Jordan

[permalink] [raw]
Subject: [RFC PATCH v4 05/13] workqueue, ktask: renice helper threads to prevent starvation

With ktask helper threads running at MAX_NICE, it's possible for one or
more of them to begin chunks of the task and then have their CPU time
constrained by higher priority threads. The main ktask thread, running
at normal priority, may finish all available chunks of the task and then
wait on the MAX_NICE helpers to finish the last in-progress chunks, for
longer than it would have if no helpers were used.

Avoid this by having the main thread assign its priority to each
unfinished helper one at a time so that on a heavily loaded system,
exactly one thread in a given ktask call is running at the main thread's
priority. At least one thread to ensure forward progress, and at most
one thread to limit excessive multithreading.

Since the workqueue interface, on which ktask is built, does not provide
access to worker threads, ktask can't adjust their priorities directly,
so add a new interface to allow a previously-queued work item to run at
a different priority than the one controlled by the corresponding
workqueue's 'nice' attribute. The worker assigned to the work item will
run the work at the given priority, temporarily overriding the worker's
priority.

The interface is flush_work_at_nice, which ensures the given work item's
assigned worker runs at the specified nice level and waits for the work
item to finish.

An alternative choice would have been to simply requeue the work item to
a pool with workers of the new priority, but this doesn't seem feasible
because a worker may have already started executing the work and there's
currently no way to interrupt it midway through. The proposed interface
solves this issue because a worker's priority can be adjusted while it's
executing the work.

TODO: flush_work_at_nice is a proof-of-concept only, and it may be
desired to have the interface set the work's nice without also waiting
for it to finish. It's implemented in the flush path for this RFC
because it was fairly simple to write ;-)

I ran tests similar to the ones in the last patch with a couple of
differences:
- The non-ktask workload uses 8 CPUs instead of 7 to compete with the
main ktask thread as well as the ktask helpers, so that when the main
thread finishes, its CPU is completely occupied by the non-ktask
workload, meaning MAX_NICE helpers can't run as often.
- The non-ktask workload starts before the ktask workload, rather
than after, to maximize the chance that it starves helpers.

Runtimes in seconds.

Case 1: Synthetic, worst-case CPU contention

ktask_test - a tight loop doing integer multiplication to max out on CPU;
used for testing only, does not appear in this series
stress-ng - cpu stressor ("-c --cpu-method ackerman --cpu-ops 1200");

8_ktask_thrs 8_ktask_thrs
w/o_renice (stdev) with_renice (stdev) 1_ktask_thr (stdev)
------------------------------------------------------------------
ktask_test 41.98 ( 0.22) 25.15 ( 2.98) 30.40 ( 0.61)
stress-ng 44.79 ( 1.11) 46.37 ( 0.69) 53.29 ( 1.91)

Without renicing, ktask_test finishes just after stress-ng does because
stress-ng needs to free up CPUs for the helpers to finish (ktask_test
shows a shorter runtime than stress-ng because ktask_test was started
later). Renicing lets ktask_test finish 40% sooner, and running the
same amount of work in ktask_test with 1 thread instead of 8 finishes in
a comparable amount of time, though longer than "with_renice" because
MAX_NICE threads still get some CPU time, and the effect over 8 threads
adds up.

stress-ng's total runtime gets a little longer going from no renicing to
renicing, as expected, because each reniced ktask thread takes more CPU
time than before when the helpers were starved.

Running with one ktask thread, stress-ng's reported walltime goes up
because that single thread interferes with fewer stress-ng threads,
but with more impact, causing a greater spread in the time it takes for
individual stress-ng threads to finish. Averages of the per-thread
stress-ng times from "with_renice" to "1_ktask_thr" come out roughly
the same, though, 43.81 and 43.89 respectively. So the total runtime of
stress-ng across all threads is unaffected, but the time stress-ng takes
to finish running its threads completely actually improves by spreading
the ktask_test work over more threads.

Case 2: Real-world CPU contention

ktask_vfio - VFIO page pin a 32G kvm guest
usemem - faults in 86G of anonymous THP per thread, PAGE_SIZE stride;
used to mimic the page clearing that dominates in ktask_vfio
so that usemem competes for the same system resources

8_ktask_thrs 8_ktask_thrs
w/o_renice (stdev) with_renice (stdev) 1_ktask_thr (stdev)
------------------------------------------------------------------
ktask_vfio 18.59 ( 0.19) 14.62 ( 2.03) 16.24 ( 0.90)
usemem 47.54 ( 0.89) 48.18 ( 0.77) 49.70 ( 1.20)

These results are similar to case 1's, though the differences between
times are not quite as pronounced because ktask_vfio ran shorter
compared to usemem.

Signed-off-by: Daniel Jordan <[email protected]>
---
include/linux/workqueue.h | 5 ++
kernel/ktask.c | 81 ++++++++++++++++++-----------
kernel/workqueue.c | 106 +++++++++++++++++++++++++++++++++++---
3 files changed, 156 insertions(+), 36 deletions(-)

diff --git a/include/linux/workqueue.h b/include/linux/workqueue.h
index 60d673e15632..d2976547c9c3 100644
--- a/include/linux/workqueue.h
+++ b/include/linux/workqueue.h
@@ -95,6 +95,10 @@ enum {
WORK_BUSY_PENDING = 1 << 0,
WORK_BUSY_RUNNING = 1 << 1,

+ /* flags for flush_work and similar functions */
+ WORK_FLUSH_FROM_CANCEL = 1 << 0,
+ WORK_FLUSH_AT_NICE = 1 << 1,
+
/* maximum string length for set_worker_desc() */
WORKER_DESC_LEN = 24,
};
@@ -477,6 +481,7 @@ extern int schedule_on_each_cpu(work_func_t func);
int execute_in_process_context(work_func_t fn, struct execute_work *);

extern bool flush_work(struct work_struct *work);
+extern bool flush_work_at_nice(struct work_struct *work, long nice);
extern bool cancel_work_sync(struct work_struct *work);

extern bool flush_delayed_work(struct delayed_work *dwork);
diff --git a/kernel/ktask.c b/kernel/ktask.c
index 72293a0f50c3..9d2727ce430c 100644
--- a/kernel/ktask.c
+++ b/kernel/ktask.c
@@ -16,7 +16,6 @@

#include <linux/cpu.h>
#include <linux/cpumask.h>
-#include <linux/completion.h>
#include <linux/init.h>
#include <linux/kernel.h>
#include <linux/list.h>
@@ -24,6 +23,7 @@
#include <linux/mutex.h>
#include <linux/printk.h>
#include <linux/random.h>
+#include <linux/sched.h>
#include <linux/slab.h>
#include <linux/spinlock.h>
#include <linux/workqueue.h>
@@ -41,6 +41,11 @@ static size_t *ktask_rlim_node_max;
#define KTASK_CPUFRAC_NUMER 4
#define KTASK_CPUFRAC_DENOM 5

+enum ktask_work_flags {
+ KTASK_WORK_FINISHED = 1,
+ KTASK_WORK_UNDO = 2,
+};
+
/* Used to pass ktask data to the workqueue API. */
struct ktask_work {
struct work_struct kw_work;
@@ -53,6 +58,7 @@ struct ktask_work {
void *kw_error_end;
/* ktask_free_works, kn_failed_works linkage */
struct list_head kw_list;
+ enum ktask_work_flags kw_flags;
};

static LIST_HEAD(ktask_free_works);
@@ -68,10 +74,7 @@ struct ktask_task {
struct ktask_node *kt_nodes;
size_t kt_nr_nodes;
size_t kt_nr_nodes_left;
- size_t kt_nworks;
- size_t kt_nworks_fini;
int kt_error; /* first error from thread_func */
- struct completion kt_ktask_done;
};

/*
@@ -97,6 +100,7 @@ static void ktask_init_work(struct ktask_work *kw, struct ktask_task *kt,
kw->kw_task = kt;
kw->kw_ktask_node_i = ktask_node_i;
kw->kw_queue_nid = queue_nid;
+ kw->kw_flags = 0;
}

static void ktask_queue_work(struct ktask_work *kw)
@@ -171,7 +175,6 @@ static void ktask_thread(struct work_struct *work)
struct ktask_task *kt = kw->kw_task;
struct ktask_ctl *kc = &kt->kt_ctl;
struct ktask_node *kn = &kt->kt_nodes[kw->kw_ktask_node_i];
- bool done;

mutex_lock(&kt->kt_mutex);

@@ -239,6 +242,7 @@ static void ktask_thread(struct work_struct *work)
* about where this thread failed for ktask_undo.
*/
if (kc->kc_undo_func) {
+ kw->kw_flags |= KTASK_WORK_UNDO;
list_move(&kw->kw_list, &kn->kn_failed_works);
kw->kw_error_start = position;
kw->kw_error_offset = position_offset;
@@ -250,13 +254,8 @@ static void ktask_thread(struct work_struct *work)
WARN_ON(kt->kt_nr_nodes_left > 0 &&
kt->kt_error == KTASK_RETURN_SUCCESS);

- ++kt->kt_nworks_fini;
- WARN_ON(kt->kt_nworks_fini > kt->kt_nworks);
- done = (kt->kt_nworks_fini == kt->kt_nworks);
+ kw->kw_flags |= KTASK_WORK_FINISHED;
mutex_unlock(&kt->kt_mutex);
-
- if (done)
- complete(&kt->kt_ktask_done);
}

/*
@@ -294,7 +293,7 @@ static size_t ktask_chunk_size(size_t task_size, size_t min_chunk_size,
*/
static size_t ktask_init_works(struct ktask_node *nodes, size_t nr_nodes,
struct ktask_task *kt,
- struct list_head *works_list)
+ struct list_head *unfinished_works)
{
size_t i, nr_works, nr_works_check;
size_t min_chunk_size = kt->kt_ctl.kc_min_chunk_size;
@@ -342,7 +341,7 @@ static size_t ktask_init_works(struct ktask_node *nodes, size_t nr_nodes,
WARN_ON(list_empty(&ktask_free_works));
kw = list_first_entry(&ktask_free_works, struct ktask_work,
kw_list);
- list_move_tail(&kw->kw_list, works_list);
+ list_move_tail(&kw->kw_list, unfinished_works);
ktask_init_work(kw, kt, ktask_node_i, queue_nid);

++ktask_rlim_cur;
@@ -355,14 +354,14 @@ static size_t ktask_init_works(struct ktask_node *nodes, size_t nr_nodes,

static void ktask_fini_works(struct ktask_task *kt,
struct ktask_work *stack_work,
- struct list_head *works_list)
+ struct list_head *finished_works)
{
struct ktask_work *work, *next;

spin_lock(&ktask_rlim_lock);

/* Put the works back on the free list, adjusting rlimits. */
- list_for_each_entry_safe(work, next, works_list, kw_list) {
+ list_for_each_entry_safe(work, next, finished_works, kw_list) {
if (work == stack_work) {
/* On this thread's stack, so not subject to rlimits. */
list_del(&work->kw_list);
@@ -393,7 +392,7 @@ static int ktask_error_cmp(void *unused, struct list_head *a,
}

static void ktask_undo(struct ktask_node *nodes, size_t nr_nodes,
- struct ktask_ctl *ctl, struct list_head *works_list)
+ struct ktask_ctl *ctl, struct list_head *finished_works)
{
size_t i;

@@ -424,7 +423,8 @@ static void ktask_undo(struct ktask_node *nodes, size_t nr_nodes,

if (failed_work) {
undo_pos = failed_work->kw_error_end;
- list_move(&failed_work->kw_list, works_list);
+ list_move(&failed_work->kw_list,
+ finished_works);
} else {
undo_pos = undo_end;
}
@@ -433,20 +433,46 @@ static void ktask_undo(struct ktask_node *nodes, size_t nr_nodes,
}
}

+static void ktask_wait_for_completion(struct ktask_task *kt,
+ struct list_head *unfinished_works,
+ struct list_head *finished_works)
+{
+ struct ktask_work *work;
+
+ mutex_lock(&kt->kt_mutex);
+ while (!list_empty(unfinished_works)) {
+ work = list_first_entry(unfinished_works, struct ktask_work,
+ kw_list);
+ if (!(work->kw_flags & KTASK_WORK_FINISHED)) {
+ mutex_unlock(&kt->kt_mutex);
+ flush_work_at_nice(&work->kw_work, task_nice(current));
+ mutex_lock(&kt->kt_mutex);
+ WARN_ON_ONCE(!(work->kw_flags & KTASK_WORK_FINISHED));
+ }
+ /*
+ * Leave works used in ktask_undo on kn->kn_failed_works.
+ * ktask_undo will move them to finished_works.
+ */
+ if (!(work->kw_flags & KTASK_WORK_UNDO))
+ list_move(&work->kw_list, finished_works);
+ }
+ mutex_unlock(&kt->kt_mutex);
+}
+
int ktask_run_numa(struct ktask_node *nodes, size_t nr_nodes,
struct ktask_ctl *ctl)
{
- size_t i;
+ size_t i, nr_works;
struct ktask_work kw;
struct ktask_work *work;
- LIST_HEAD(works_list);
+ LIST_HEAD(unfinished_works);
+ LIST_HEAD(finished_works);
struct ktask_task kt = {
.kt_ctl = *ctl,
.kt_total_size = 0,
.kt_nodes = nodes,
.kt_nr_nodes = nr_nodes,
.kt_nr_nodes_left = nr_nodes,
- .kt_nworks_fini = 0,
.kt_error = KTASK_RETURN_SUCCESS,
};

@@ -465,14 +491,12 @@ int ktask_run_numa(struct ktask_node *nodes, size_t nr_nodes,
return KTASK_RETURN_SUCCESS;

mutex_init(&kt.kt_mutex);
- init_completion(&kt.kt_ktask_done);

- kt.kt_nworks = ktask_init_works(nodes, nr_nodes, &kt, &works_list);
+ nr_works = ktask_init_works(nodes, nr_nodes, &kt, &unfinished_works);
kt.kt_chunk_size = ktask_chunk_size(kt.kt_total_size,
- ctl->kc_min_chunk_size,
- kt.kt_nworks);
+ ctl->kc_min_chunk_size, nr_works);

- list_for_each_entry(work, &works_list, kw_list)
+ list_for_each_entry(work, &unfinished_works, kw_list)
ktask_queue_work(work);

/* Use the current thread, which saves starting a workqueue worker. */
@@ -480,13 +504,12 @@ int ktask_run_numa(struct ktask_node *nodes, size_t nr_nodes,
INIT_LIST_HEAD(&kw.kw_list);
ktask_thread(&kw.kw_work);

- /* Wait for all the jobs to finish. */
- wait_for_completion(&kt.kt_ktask_done);
+ ktask_wait_for_completion(&kt, &unfinished_works, &finished_works);

if (kt.kt_error && ctl->kc_undo_func)
- ktask_undo(nodes, nr_nodes, ctl, &works_list);
+ ktask_undo(nodes, nr_nodes, ctl, &finished_works);

- ktask_fini_works(&kt, &kw, &works_list);
+ ktask_fini_works(&kt, &kw, &finished_works);
mutex_destroy(&kt.kt_mutex);

return kt.kt_error;
diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index 0280deac392e..9fbae3fc9cca 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -79,6 +79,7 @@ enum {
WORKER_CPU_INTENSIVE = 1 << 6, /* cpu intensive */
WORKER_UNBOUND = 1 << 7, /* worker is unbound */
WORKER_REBOUND = 1 << 8, /* worker was rebound */
+ WORKER_NICED = 1 << 9, /* worker's nice was adjusted */

WORKER_NOT_RUNNING = WORKER_PREP | WORKER_CPU_INTENSIVE |
WORKER_UNBOUND | WORKER_REBOUND,
@@ -2184,6 +2185,18 @@ __acquires(&pool->lock)
if (unlikely(cpu_intensive))
worker_clr_flags(worker, WORKER_CPU_INTENSIVE);

+ /*
+ * worker's nice level was adjusted (see flush_work_at_nice). Use the
+ * work's color to distinguish between the work that sets the nice
+ * level (== NO_COLOR) and the work for which the adjustment was made
+ * (!= NO_COLOR) to avoid prematurely restoring the nice level.
+ */
+ if (unlikely(worker->flags & WORKER_NICED &&
+ work_color != WORK_NO_COLOR)) {
+ set_user_nice(worker->task, worker->pool->attrs->nice);
+ worker_clr_flags(worker, WORKER_NICED);
+ }
+
/* we're done with it, release */
hash_del(&worker->hentry);
worker->current_work = NULL;
@@ -2846,8 +2859,53 @@ void drain_workqueue(struct workqueue_struct *wq)
}
EXPORT_SYMBOL_GPL(drain_workqueue);

+struct nice_work {
+ struct work_struct work;
+ long nice;
+};
+
+static void nice_work_func(struct work_struct *work)
+{
+ struct nice_work *nw = container_of(work, struct nice_work, work);
+ struct worker *worker = current_wq_worker();
+
+ if (WARN_ON_ONCE(!worker))
+ return;
+
+ set_user_nice(current, nw->nice);
+ worker->flags |= WORKER_NICED;
+}
+
+/**
+ * insert_nice_work - insert a nice_work into a pwq
+ * @pwq: pwq to insert nice_work into
+ * @nice_work: nice_work to insert
+ * @target: target work to attach @nice_work to
+ *
+ * @nice_work is linked to @target such that @target starts executing only
+ * after @nice_work finishes execution.
+ *
+ * @nice_work's only job is to ensure @target's assigned worker runs at the
+ * nice level contained in @nice_work.
+ *
+ * CONTEXT:
+ * spin_lock_irq(pool->lock).
+ */
+static void insert_nice_work(struct pool_workqueue *pwq,
+ struct nice_work *nice_work,
+ struct work_struct *target)
+{
+ /* see comment above similar code in insert_wq_barrier */
+ INIT_WORK_ONSTACK(&nice_work->work, nice_work_func);
+ __set_bit(WORK_STRUCT_PENDING_BIT, work_data_bits(&nice_work->work));
+
+ debug_work_activate(&nice_work->work);
+ insert_work(pwq, &nice_work->work, &target->entry,
+ work_color_to_flags(WORK_NO_COLOR) | WORK_STRUCT_LINKED);
+}
+
static bool start_flush_work(struct work_struct *work, struct wq_barrier *barr,
- bool from_cancel)
+ struct nice_work *nice_work, int flags)
{
struct worker *worker = NULL;
struct worker_pool *pool;
@@ -2868,11 +2926,19 @@ static bool start_flush_work(struct work_struct *work, struct wq_barrier *barr,
if (pwq) {
if (unlikely(pwq->pool != pool))
goto already_gone;
+
+ /* not yet started, insert linked work before work */
+ if (unlikely(flags & WORK_FLUSH_AT_NICE))
+ insert_nice_work(pwq, nice_work, work);
} else {
worker = find_worker_executing_work(pool, work);
if (!worker)
goto already_gone;
pwq = worker->current_pwq;
+ if (unlikely(flags & WORK_FLUSH_AT_NICE)) {
+ set_user_nice(worker->task, nice_work->nice);
+ worker->flags |= WORKER_NICED;
+ }
}

check_flush_dependency(pwq->wq, work);
@@ -2889,7 +2955,7 @@ static bool start_flush_work(struct work_struct *work, struct wq_barrier *barr,
* workqueues the deadlock happens when the rescuer stalls, blocking
* forward progress.
*/
- if (!from_cancel &&
+ if (!(flags & WORK_FLUSH_FROM_CANCEL) &&
(pwq->wq->saved_max_active == 1 || pwq->wq->rescuer)) {
lock_map_acquire(&pwq->wq->lockdep_map);
lock_map_release(&pwq->wq->lockdep_map);
@@ -2901,19 +2967,23 @@ static bool start_flush_work(struct work_struct *work, struct wq_barrier *barr,
return false;
}

-static bool __flush_work(struct work_struct *work, bool from_cancel)
+static bool __flush_work(struct work_struct *work, int flags, long nice)
{
struct wq_barrier barr;
+ struct nice_work nice_work;

if (WARN_ON(!wq_online))
return false;

- if (!from_cancel) {
+ if (!(flags & WORK_FLUSH_FROM_CANCEL)) {
lock_map_acquire(&work->lockdep_map);
lock_map_release(&work->lockdep_map);
}

- if (start_flush_work(work, &barr, from_cancel)) {
+ if (unlikely(flags & WORK_FLUSH_AT_NICE))
+ nice_work.nice = nice;
+
+ if (start_flush_work(work, &barr, &nice_work, flags)) {
wait_for_completion(&barr.done);
destroy_work_on_stack(&barr.work);
return true;
@@ -2935,10 +3005,32 @@ static bool __flush_work(struct work_struct *work, bool from_cancel)
*/
bool flush_work(struct work_struct *work)
{
- return __flush_work(work, false);
+ return __flush_work(work, 0, 0);
}
EXPORT_SYMBOL_GPL(flush_work);

+/**
+ * flush_work_at_nice - set a work's nice level and wait for it to finish
+ * @work: the target work
+ * @nice: nice level @work's assigned worker should run at
+ *
+ * Makes @work's assigned worker run at @nice for the duration of @work.
+ * Waits until @work has finished execution. @work is guaranteed to be idle
+ * on return if it hasn't been requeued since flush started.
+ *
+ * Avoids priority inversion where a high priority task queues @work on a
+ * workqueue with low priority workers and may wait indefinitely for @work's
+ * completion. That task can will its priority to @work.
+ *
+ * Return:
+ * %true if flush_work_at_nice() waited for the work to finish execution,
+ * %false if it was already idle.
+ */
+bool flush_work_at_nice(struct work_struct *work, long nice)
+{
+ return __flush_work(work, WORK_FLUSH_AT_NICE, nice);
+}
+
struct cwt_wait {
wait_queue_entry_t wait;
struct work_struct *work;
@@ -3001,7 +3093,7 @@ static bool __cancel_work_timer(struct work_struct *work, bool is_dwork)
* isn't executing.
*/
if (wq_online)
- __flush_work(work, true);
+ __flush_work(work, WORK_FLUSH_FROM_CANCEL, 0);

clear_work_data(work);

--
2.19.1


2018-11-05 17:01:45

by Daniel Jordan

[permalink] [raw]
Subject: [RFC PATCH v4 01/13] ktask: add documentation

Motivates and explains the ktask API for kernel clients.

Signed-off-by: Daniel Jordan <[email protected]>
---
Documentation/core-api/index.rst | 1 +
Documentation/core-api/ktask.rst | 213 +++++++++++++++++++++++++++++++
2 files changed, 214 insertions(+)
create mode 100644 Documentation/core-api/ktask.rst

diff --git a/Documentation/core-api/index.rst b/Documentation/core-api/index.rst
index 3adee82be311..c143a280a5b1 100644
--- a/Documentation/core-api/index.rst
+++ b/Documentation/core-api/index.rst
@@ -18,6 +18,7 @@ Core utilities
refcount-vs-atomic
cpu_hotplug
idr
+ ktask
local_ops
workqueue
genericirq
diff --git a/Documentation/core-api/ktask.rst b/Documentation/core-api/ktask.rst
new file mode 100644
index 000000000000..c3c00e1f802f
--- /dev/null
+++ b/Documentation/core-api/ktask.rst
@@ -0,0 +1,213 @@
+.. SPDX-License-Identifier: GPL-2.0+
+
+============================================
+ktask: parallelize CPU-intensive kernel work
+============================================
+
+:Date: November, 2018
+:Author: Daniel Jordan <[email protected]>
+
+
+Introduction
+============
+
+ktask is a generic framework for parallelizing CPU-intensive work in the
+kernel. The intended use is for big machines that can use their CPU power to
+speed up large tasks that can't otherwise be multithreaded in userland. The
+API is generic enough to add concurrency to many different kinds of tasks--for
+example, page clearing over an address range or freeing a list of pages--and
+aims to save its clients the trouble of splitting up the work, choosing the
+number of helper threads to use, maintaining an efficient concurrency level,
+starting these threads, and load balancing the work between them.
+
+
+Motivation
+==========
+
+A single CPU can spend an excessive amount of time in the kernel operating on
+large amounts of data. Often these situations arise during initialization- and
+destruction-related tasks, where the data involved scales with system size.
+These long-running jobs can slow startup and shutdown of applications and the
+system itself while extra CPUs sit idle.
+
+To ensure that applications and the kernel continue to perform well as core
+counts and memory sizes increase, the kernel harnesses these idle CPUs to
+complete such jobs more quickly.
+
+For example, when booting a large NUMA machine, ktask uses additional CPUs that
+would otherwise be idle until the machine is fully up to avoid a needless
+bottleneck during system boot and allow the kernel to take advantage of unused
+memory bandwidth. Similarly, when starting a large VM using VFIO, ktask takes
+advantage of the VM's idle CPUs during VFIO page pinning rather than have the
+VM's boot blocked on one thread doing all the work.
+
+ktask is not a substitute for single-threaded optimization. However, there is
+a point where a single CPU hits a wall despite performance tuning, so
+parallelize!
+
+
+Concept
+=======
+
+ktask is built on unbound workqueues to take advantage of the thread management
+facilities it provides: creation, destruction, flushing, priority setting, and
+NUMA affinity.
+
+A little terminology up front: A 'task' is the total work there is to do and a
+'chunk' is a unit of work given to a thread.
+
+To complete a task using the ktask framework, a client provides a thread
+function that is responsible for completing one chunk. The thread function is
+defined in a standard way, with start and end arguments that delimit the chunk
+as well as an argument that the client uses to pass data specific to the task.
+
+In addition, the client supplies an object representing the start of the task
+and an iterator function that knows how to advance some number of units in the
+task to yield another object representing the new task position. The framework
+uses the start object and iterator internally to divide the task into chunks.
+
+Finally, the client passes the total task size and a minimum chunk size to
+indicate the minimum amount of work that's appropriate to do in one chunk. The
+sizes are given in task-specific units (e.g. pages, inodes, bytes). The
+framework uses these sizes, along with the number of online CPUs and an
+internal maximum number of threads, to decide how many threads to start and how
+many chunks to divide the task into.
+
+For example, consider the task of clearing a gigantic page. This used to be
+done in a single thread with a for loop that calls a page clearing function for
+each constituent base page. To parallelize with ktask, the client first moves
+the for loop to the thread function, adapting it to operate on the range passed
+to the function. In this simple case, the thread function's start and end
+arguments are just addresses delimiting the portion of the gigantic page to
+clear. Then, where the for loop used to be, the client calls into ktask with
+the start address of the gigantic page, the total size of the gigantic page,
+and the thread function. Internally, ktask will divide the address range into
+an appropriate number of chunks and start an appropriate number of threads to
+complete these chunks.
+
+
+Configuration
+=============
+
+To use ktask, configure the kernel with CONFIG_KTASK=y.
+
+If CONFIG_KTASK=n, calls to the ktask API are simply #define'd to run the
+thread function that the client provides so that the task is completed without
+concurrency in the current thread.
+
+
+Interface
+=========
+
+.. kernel-doc:: include/linux/ktask.h
+
+
+Resource Limits
+===============
+
+ktask has resource limits on the number of work items it sends to workqueue.
+In ktask, a workqueue item is a thread that runs chunks of the task until the
+task is finished.
+
+These limits support the different ways ktask uses workqueues:
+ - ktask_run to run threads on the calling thread's node.
+ - ktask_run_numa to run threads on the node(s) specified.
+ - ktask_run_numa with nid=NUMA_NO_NODE to run threads on any node in the
+ system.
+
+To support these different ways of queueing work while maintaining an efficient
+concurrency level, we need both system-wide and per-node limits on the number
+of threads. Without per-node limits, a node might become oversubscribed
+despite ktask staying within the system-wide limit, and without a system-wide
+limit, we can't properly account for work that can run on any node.
+
+The system-wide limit is based on the total number of CPUs, and the per-node
+limit on the CPU count for each node. A per-node work item counts against the
+system-wide limit. Workqueue's max_active can't accommodate both types of
+limit, no matter how many workqueues are used, so ktask implements its own.
+
+If a per-node limit is reached, the work item is allowed to run anywhere on the
+machine to avoid overwhelming the node. If the global limit is also reached,
+ktask won't queue additional work items until we fall below the limit again.
+
+These limits apply only to workqueue items--that is, helper threads beyond the
+one starting the task. That way, one thread per task is always allowed to run.
+
+
+Scheduler Interaction
+=====================
+
+Even within the resource limits, ktask must take care to run a number of
+threads appropriate for the system's current CPU load. Under high CPU usage,
+starting excessive helper threads may disturb other tasks, unfairly taking CPU
+time away from them for the sake of an optimized kernel code path.
+
+ktask plays nicely in this case by setting helper threads to the lowest
+scheduling priority on the system (MAX_NICE). This way, helpers' CPU time is
+appropriately throttled on a busy system and other tasks are not disturbed.
+
+The main thread initiating the task remains at its original priority so that it
+still makes progress on a busy system.
+
+It is possible for a helper thread to start running and then be forced off-CPU
+by a higher priority thread. With the helper's CPU time curtailed by MAX_NICE,
+the main thread may wait longer for the task to finish than it would have had
+it not started any helpers, so to ensure forward progress at a single-threaded
+pace, once the main thread is finished with all outstanding work in the task,
+the main thread wills its priority to one helper thread at a time. At least
+one thread will then always be running at the priority of the calling thread.
+
+
+Cgroup Awareness
+================
+
+Given the potentially large amount of CPU time ktask threads may consume, they
+should be aware of the cgroup of the task that called into ktask and
+appropriately throttled.
+
+TODO: Implement cgroup-awareness in unbound workqueues.
+
+
+Power Management
+================
+
+Starting additional helper threads may cause the system to consume more energy,
+which is undesirable on energy-conscious devices. Therefore ktask needs to be
+aware of cpufreq policies and scaling governors.
+
+If an energy-conscious policy is in use (e.g. powersave, conservative) on any
+part of the system, that is a signal that the user has strong power management
+preferences, in which case ktask is disabled.
+
+TODO: Implement this.
+
+
+Backward Compatibility
+======================
+
+ktask is written so that existing calls to the API will be backwards compatible
+should the API gain new features in the future. This is accomplished by
+restricting API changes to members of struct ktask_ctl and having clients make
+an opaque initialization call (DEFINE_KTASK_CTL). This initialization can then
+be modified to include any new arguments so that existing call sites stay the
+same.
+
+
+Error Handling
+==============
+
+Calls to ktask fail only if the provided thread function fails. In particular,
+ktask avoids allocating memory internally during a task, so it's safe to use in
+sensitive contexts.
+
+Tasks can fail midway through their work. To recover, the finished chunks of
+work need to be undone in a task-specific way, so ktask allows clients to pass
+an "undo" callback that is responsible for undoing one chunk of work. To avoid
+multiple levels of error handling, this "undo" callback should not be allowed
+to fail. For simplicity and because it's a slow path, undoing is not
+multithreaded.
+
+Each call to ktask_run and ktask_run_numa returns a single value,
+KTASK_RETURN_SUCCESS or a client-specific value. Since threads can fail for
+different reasons, however, ktask may need the ability to return
+thread-specific error information. This can be added later if needed.
--
2.19.1


2018-11-05 17:02:25

by Daniel Jordan

[permalink] [raw]
Subject: [RFC PATCH v4 07/13] mm: change locked_vm's type from unsigned long to atomic_long_t

Currently, mmap_sem must be held as writer to modify the locked_vm field
in mm_struct.

This creates a bottleneck when multithreading VFIO page pinning because
each thread holds the mmap_sem as reader for the majority of the pinning
time but also takes mmap_sem as writer regularly, for short times, when
modifying locked_vm.

The problem gets worse when other workloads compete for CPU with ktask
threads doing page pinning because the other workloads force ktask
threads that hold mmap_sem as writer off the CPU, blocking ktask threads
trying to get mmap_sem as reader for an excessively long time (the
mmap_sem reader wait time grows linearly with the thread count).

Requiring mmap_sem for locked_vm also abuses mmap_sem by making it
protect data that could be synchronized separately.

So, decouple locked_vm from mmap_sem by making locked_vm an
atomic_long_t. locked_vm's old type was unsigned long and changing it
to a signed type makes it lose half its capacity, but that's only a
concern for 32-bit systems and LONG_MAX * PAGE_SIZE is 8T on x86 in that
case, so there's headroom.

Now that mmap_sem is not taken as writer here, ktask threads holding
mmap_sem as reader can run more often. Performance results appear later
in the series.

On powerpc, this was cross-compiled-tested only.

[XXX Can send separately.]

Signed-off-by: Daniel Jordan <[email protected]>
---
arch/powerpc/kvm/book3s_64_vio.c | 15 ++++++++-------
arch/powerpc/mm/mmu_context_iommu.c | 16 ++++++++--------
drivers/fpga/dfl-afu-dma-region.c | 16 +++++++++-------
drivers/vfio/vfio_iommu_spapr_tce.c | 14 +++++++-------
drivers/vfio/vfio_iommu_type1.c | 11 ++++++-----
fs/proc/task_mmu.c | 2 +-
include/linux/mm_types.h | 2 +-
kernel/fork.c | 2 +-
mm/debug.c | 3 ++-
mm/mlock.c | 4 ++--
mm/mmap.c | 18 +++++++++---------
mm/mremap.c | 6 +++---
12 files changed, 57 insertions(+), 52 deletions(-)

diff --git a/arch/powerpc/kvm/book3s_64_vio.c b/arch/powerpc/kvm/book3s_64_vio.c
index 62a8d03ba7e9..b5637de6dde5 100644
--- a/arch/powerpc/kvm/book3s_64_vio.c
+++ b/arch/powerpc/kvm/book3s_64_vio.c
@@ -58,33 +58,34 @@ static unsigned long kvmppc_stt_pages(unsigned long tce_pages)

static long kvmppc_account_memlimit(unsigned long stt_pages, bool inc)
{
- long ret = 0;
+ long locked_vm, ret = 0;

if (!current || !current->mm)
return ret; /* process exited */

down_write(&current->mm->mmap_sem);

+ locked_vm = atomic_long_read(&current->mm->locked_vm);
if (inc) {
unsigned long locked, lock_limit;

- locked = current->mm->locked_vm + stt_pages;
+ locked = locked_vm + stt_pages;
lock_limit = rlimit(RLIMIT_MEMLOCK) >> PAGE_SHIFT;
if (locked > lock_limit && !capable(CAP_IPC_LOCK))
ret = -ENOMEM;
else
- current->mm->locked_vm += stt_pages;
+ atomic_long_add(stt_pages, &current->mm->locked_vm);
} else {
- if (WARN_ON_ONCE(stt_pages > current->mm->locked_vm))
- stt_pages = current->mm->locked_vm;
+ if (WARN_ON_ONCE(stt_pages > locked_vm))
+ stt_pages = locked_vm;

- current->mm->locked_vm -= stt_pages;
+ atomic_long_sub(stt_pages, &current->mm->locked_vm);
}

pr_debug("[%d] RLIMIT_MEMLOCK KVM %c%ld %ld/%ld%s\n", current->pid,
inc ? '+' : '-',
stt_pages << PAGE_SHIFT,
- current->mm->locked_vm << PAGE_SHIFT,
+ atomic_long_read(&current->mm->locked_vm) << PAGE_SHIFT,
rlimit(RLIMIT_MEMLOCK),
ret ? " - exceeded" : "");

diff --git a/arch/powerpc/mm/mmu_context_iommu.c b/arch/powerpc/mm/mmu_context_iommu.c
index 56c2234cc6ae..a8f66975bf53 100644
--- a/arch/powerpc/mm/mmu_context_iommu.c
+++ b/arch/powerpc/mm/mmu_context_iommu.c
@@ -41,31 +41,31 @@ struct mm_iommu_table_group_mem_t {
static long mm_iommu_adjust_locked_vm(struct mm_struct *mm,
unsigned long npages, bool incr)
{
- long ret = 0, locked, lock_limit;
+ long ret = 0, locked, lock_limit, locked_vm;

if (!npages)
return 0;

down_write(&mm->mmap_sem);
-
+ locked_vm = atomic_long_read(&mm->locked_vm);
if (incr) {
- locked = mm->locked_vm + npages;
+ locked = locked_vm + npages;
lock_limit = rlimit(RLIMIT_MEMLOCK) >> PAGE_SHIFT;
if (locked > lock_limit && !capable(CAP_IPC_LOCK))
ret = -ENOMEM;
else
- mm->locked_vm += npages;
+ atomic_long_add(npages, &mm->locked_vm);
} else {
- if (WARN_ON_ONCE(npages > mm->locked_vm))
- npages = mm->locked_vm;
- mm->locked_vm -= npages;
+ if (WARN_ON_ONCE(npages > locked_vm))
+ npages = locked_vm;
+ atomic_long_sub(npages, &mm->locked_vm);
}

pr_debug("[%d] RLIMIT_MEMLOCK HASH64 %c%ld %ld/%ld\n",
current ? current->pid : 0,
incr ? '+' : '-',
npages << PAGE_SHIFT,
- mm->locked_vm << PAGE_SHIFT,
+ atomic_long_read(&mm->locked_vm) << PAGE_SHIFT,
rlimit(RLIMIT_MEMLOCK));
up_write(&mm->mmap_sem);

diff --git a/drivers/fpga/dfl-afu-dma-region.c b/drivers/fpga/dfl-afu-dma-region.c
index 025aba3ea76c..1a7939c511a0 100644
--- a/drivers/fpga/dfl-afu-dma-region.c
+++ b/drivers/fpga/dfl-afu-dma-region.c
@@ -45,6 +45,7 @@ void afu_dma_region_init(struct dfl_feature_platform_data *pdata)
static int afu_dma_adjust_locked_vm(struct device *dev, long npages, bool incr)
{
unsigned long locked, lock_limit;
+ long locked_vm;
int ret = 0;

/* the task is exiting. */
@@ -53,24 +54,25 @@ static int afu_dma_adjust_locked_vm(struct device *dev, long npages, bool incr)

down_write(&current->mm->mmap_sem);

+ locked_vm = atomic_long_read(&current->mm->locked_vm);
if (incr) {
- locked = current->mm->locked_vm + npages;
+ locked = locked_vm + npages;
lock_limit = rlimit(RLIMIT_MEMLOCK) >> PAGE_SHIFT;

if (locked > lock_limit && !capable(CAP_IPC_LOCK))
ret = -ENOMEM;
else
- current->mm->locked_vm += npages;
+ atomic_long_add(npages, &current->mm->locked_vm);
} else {
- if (WARN_ON_ONCE(npages > current->mm->locked_vm))
- npages = current->mm->locked_vm;
- current->mm->locked_vm -= npages;
+ if (WARN_ON_ONCE(npages > locked_vm))
+ npages = locked_vm;
+ atomic_long_sub(npages, &current->mm->locked_vm);
}

dev_dbg(dev, "[%d] RLIMIT_MEMLOCK %c%ld %ld/%ld%s\n", current->pid,
incr ? '+' : '-', npages << PAGE_SHIFT,
- current->mm->locked_vm << PAGE_SHIFT, rlimit(RLIMIT_MEMLOCK),
- ret ? "- exceeded" : "");
+ atomic_long_read(&current->mm->locked_vm) << PAGE_SHIFT,
+ rlimit(RLIMIT_MEMLOCK), ret ? "- exceeded" : "");

up_write(&current->mm->mmap_sem);

diff --git a/drivers/vfio/vfio_iommu_spapr_tce.c b/drivers/vfio/vfio_iommu_spapr_tce.c
index b30926e11d87..c834bc2d1e6b 100644
--- a/drivers/vfio/vfio_iommu_spapr_tce.c
+++ b/drivers/vfio/vfio_iommu_spapr_tce.c
@@ -45,16 +45,16 @@ static long try_increment_locked_vm(struct mm_struct *mm, long npages)
return 0;

down_write(&mm->mmap_sem);
- locked = mm->locked_vm + npages;
+ locked = atomic_long_read(&mm->locked_vm) + npages;
lock_limit = rlimit(RLIMIT_MEMLOCK) >> PAGE_SHIFT;
if (locked > lock_limit && !capable(CAP_IPC_LOCK))
ret = -ENOMEM;
else
- mm->locked_vm += npages;
+ atomic_long_add(npages, &mm->locked_vm);

pr_debug("[%d] RLIMIT_MEMLOCK +%ld %ld/%ld%s\n", current->pid,
npages << PAGE_SHIFT,
- mm->locked_vm << PAGE_SHIFT,
+ atomic_long_read(&mm->locked_vm) << PAGE_SHIFT,
rlimit(RLIMIT_MEMLOCK),
ret ? " - exceeded" : "");

@@ -69,12 +69,12 @@ static void decrement_locked_vm(struct mm_struct *mm, long npages)
return;

down_write(&mm->mmap_sem);
- if (WARN_ON_ONCE(npages > mm->locked_vm))
- npages = mm->locked_vm;
- mm->locked_vm -= npages;
+ if (WARN_ON_ONCE(npages > atomic_long_read(&mm->locked_vm)))
+ npages = atomic_long_read(&mm->locked_vm);
+ atomic_long_sub(npages, &mm->locked_vm);
pr_debug("[%d] RLIMIT_MEMLOCK -%ld %ld/%ld\n", current->pid,
npages << PAGE_SHIFT,
- mm->locked_vm << PAGE_SHIFT,
+ atomic_long_read(&mm->locked_vm) << PAGE_SHIFT,
rlimit(RLIMIT_MEMLOCK));
up_write(&mm->mmap_sem);
}
diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
index e7cfbf0c8071..f307dc9d5e19 100644
--- a/drivers/vfio/vfio_iommu_type1.c
+++ b/drivers/vfio/vfio_iommu_type1.c
@@ -276,13 +276,13 @@ static int vfio_lock_acct(struct vfio_dma *dma, long npage, bool async)
limit = task_rlimit(dma->task,
RLIMIT_MEMLOCK) >> PAGE_SHIFT;

- if (mm->locked_vm + npage > limit)
+ if (atomic_long_read(&mm->locked_vm) + npage > limit)
ret = -ENOMEM;
}
}

if (!ret)
- mm->locked_vm += npage;
+ atomic_long_add(npage, &mm->locked_vm);

up_write(&mm->mmap_sem);
}
@@ -419,7 +419,8 @@ static long vfio_pin_pages_remote(struct vfio_dma *dma, unsigned long vaddr,
* pages are already counted against the user.
*/
if (!rsvd && !vfio_find_vpfn(dma, iova)) {
- if (!dma->lock_cap && mm->locked_vm + 1 > limit) {
+ if (!dma->lock_cap &&
+ atomic_long_read(&mm->locked_vm) + 1 > limit) {
put_pfn(*pfn_base, dma->prot);
pr_warn("%s: RLIMIT_MEMLOCK (%ld) exceeded\n", __func__,
limit << PAGE_SHIFT);
@@ -445,8 +446,8 @@ static long vfio_pin_pages_remote(struct vfio_dma *dma, unsigned long vaddr,
}

if (!rsvd && !vfio_find_vpfn(dma, iova)) {
- if (!dma->lock_cap &&
- mm->locked_vm + lock_acct + 1 > limit) {
+ if (!dma->lock_cap && atomic_long_read(&mm->locked_vm) +
+ lock_acct + 1 > limit) {
put_pfn(pfn, dma->prot);
pr_warn("%s: RLIMIT_MEMLOCK (%ld) exceeded\n",
__func__, limit << PAGE_SHIFT);
diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
index 39e96a21366e..f0468bfe022e 100644
--- a/fs/proc/task_mmu.c
+++ b/fs/proc/task_mmu.c
@@ -58,7 +58,7 @@ void task_mem(struct seq_file *m, struct mm_struct *mm)
swap = get_mm_counter(mm, MM_SWAPENTS);
SEQ_PUT_DEC("VmPeak:\t", hiwater_vm);
SEQ_PUT_DEC(" kB\nVmSize:\t", total_vm);
- SEQ_PUT_DEC(" kB\nVmLck:\t", mm->locked_vm);
+ SEQ_PUT_DEC(" kB\nVmLck:\t", atomic_long_read(&mm->locked_vm));
SEQ_PUT_DEC(" kB\nVmPin:\t", mm->pinned_vm);
SEQ_PUT_DEC(" kB\nVmHWM:\t", hiwater_rss);
SEQ_PUT_DEC(" kB\nVmRSS:\t", total_rss);
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 5ed8f6292a53..ee02933d10e5 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -399,7 +399,7 @@ struct mm_struct {
unsigned long hiwater_vm; /* High-water virtual memory usage */

unsigned long total_vm; /* Total pages mapped */
- unsigned long locked_vm; /* Pages that have PG_mlocked set */
+ atomic_long_t locked_vm; /* Pages that have PG_mlocked set */
unsigned long pinned_vm; /* Refcount permanently increased */
unsigned long data_vm; /* VM_WRITE & ~VM_SHARED & ~VM_STACK */
unsigned long exec_vm; /* VM_EXEC & ~VM_WRITE & ~VM_STACK */
diff --git a/kernel/fork.c b/kernel/fork.c
index 2f78d32eaa0f..e0d4d5b8f151 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -978,7 +978,7 @@ static struct mm_struct *mm_init(struct mm_struct *mm, struct task_struct *p,
mm->core_state = NULL;
mm_pgtables_bytes_init(mm);
mm->map_count = 0;
- mm->locked_vm = 0;
+ atomic_long_set(&mm->locked_vm, 0);
mm->pinned_vm = 0;
memset(&mm->rss_stat, 0, sizeof(mm->rss_stat));
spin_lock_init(&mm->page_table_lock);
diff --git a/mm/debug.c b/mm/debug.c
index cdacba12e09a..3d8db069176f 100644
--- a/mm/debug.c
+++ b/mm/debug.c
@@ -152,7 +152,8 @@ void dump_mm(const struct mm_struct *mm)
atomic_read(&mm->mm_count),
mm_pgtables_bytes(mm),
mm->map_count,
- mm->hiwater_rss, mm->hiwater_vm, mm->total_vm, mm->locked_vm,
+ mm->hiwater_rss, mm->hiwater_vm, mm->total_vm,
+ atomic_long_read(&mm->locked_vm),
mm->pinned_vm, mm->data_vm, mm->exec_vm, mm->stack_vm,
mm->start_code, mm->end_code, mm->start_data, mm->end_data,
mm->start_brk, mm->brk, mm->start_stack,
diff --git a/mm/mlock.c b/mm/mlock.c
index 41cc47e28ad6..b213ad97585c 100644
--- a/mm/mlock.c
+++ b/mm/mlock.c
@@ -562,7 +562,7 @@ static int mlock_fixup(struct vm_area_struct *vma, struct vm_area_struct **prev,
nr_pages = -nr_pages;
else if (old_flags & VM_LOCKED)
nr_pages = 0;
- mm->locked_vm += nr_pages;
+ atomic_long_add(nr_pages, &mm->locked_vm);

/*
* vm_flags is protected by the mmap_sem held in write mode.
@@ -687,7 +687,7 @@ static __must_check int do_mlock(unsigned long start, size_t len, vm_flags_t fla
if (down_write_killable(&current->mm->mmap_sem))
return -EINTR;

- locked += current->mm->locked_vm;
+ locked += atomic_long_read(&current->mm->locked_vm);
if ((locked > lock_limit) && (!capable(CAP_IPC_LOCK))) {
/*
* It is possible that the regions requested intersect with
diff --git a/mm/mmap.c b/mm/mmap.c
index 6c04292e16a7..5c4b39b99fb4 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -1339,7 +1339,7 @@ static inline int mlock_future_check(struct mm_struct *mm,
/* mlock MCL_FUTURE? */
if (flags & VM_LOCKED) {
locked = len >> PAGE_SHIFT;
- locked += mm->locked_vm;
+ locked += atomic_long_read(&mm->locked_vm);
lock_limit = rlimit(RLIMIT_MEMLOCK);
lock_limit >>= PAGE_SHIFT;
if (locked > lock_limit && !capable(CAP_IPC_LOCK))
@@ -1825,7 +1825,7 @@ unsigned long mmap_region(struct file *file, unsigned long addr,
vma == get_gate_vma(current->mm))
vma->vm_flags &= VM_LOCKED_CLEAR_MASK;
else
- mm->locked_vm += (len >> PAGE_SHIFT);
+ atomic_long_add(len >> PAGE_SHIFT, &mm->locked_vm);
}

if (file)
@@ -2291,7 +2291,7 @@ static int acct_stack_growth(struct vm_area_struct *vma,
if (vma->vm_flags & VM_LOCKED) {
unsigned long locked;
unsigned long limit;
- locked = mm->locked_vm + grow;
+ locked = atomic_long_read(&mm->locked_vm) + grow;
limit = rlimit(RLIMIT_MEMLOCK);
limit >>= PAGE_SHIFT;
if (locked > limit && !capable(CAP_IPC_LOCK))
@@ -2385,7 +2385,7 @@ int expand_upwards(struct vm_area_struct *vma, unsigned long address)
*/
spin_lock(&mm->page_table_lock);
if (vma->vm_flags & VM_LOCKED)
- mm->locked_vm += grow;
+ atomic_long_add(grow, &mm->locked_vm);
vm_stat_account(mm, vma->vm_flags, grow);
anon_vma_interval_tree_pre_update_vma(vma);
vma->vm_end = address;
@@ -2466,7 +2466,7 @@ int expand_downwards(struct vm_area_struct *vma,
*/
spin_lock(&mm->page_table_lock);
if (vma->vm_flags & VM_LOCKED)
- mm->locked_vm += grow;
+ atomic_long_add(grow, &mm->locked_vm);
vm_stat_account(mm, vma->vm_flags, grow);
anon_vma_interval_tree_pre_update_vma(vma);
vma->vm_start = address;
@@ -2787,11 +2787,11 @@ int __do_munmap(struct mm_struct *mm, unsigned long start, size_t len,
/*
* unlock any mlock()ed ranges before detaching vmas
*/
- if (mm->locked_vm) {
+ if (atomic_long_read(&mm->locked_vm)) {
struct vm_area_struct *tmp = vma;
while (tmp && tmp->vm_start < end) {
if (tmp->vm_flags & VM_LOCKED) {
- mm->locked_vm -= vma_pages(tmp);
+ atomic_long_sub(vma_pages(tmp), &mm->locked_vm);
munlock_vma_pages_all(tmp);
}

@@ -3050,7 +3050,7 @@ static int do_brk_flags(unsigned long addr, unsigned long len, unsigned long fla
mm->total_vm += len >> PAGE_SHIFT;
mm->data_vm += len >> PAGE_SHIFT;
if (flags & VM_LOCKED)
- mm->locked_vm += (len >> PAGE_SHIFT);
+ atomic_long_add(len >> PAGE_SHIFT, &mm->locked_vm);
vma->vm_flags |= VM_SOFTDIRTY;
return 0;
}
@@ -3122,7 +3122,7 @@ void exit_mmap(struct mm_struct *mm)
up_write(&mm->mmap_sem);
}

- if (mm->locked_vm) {
+ if (atomic_long_read(&mm->locked_vm)) {
vma = mm->mmap;
while (vma) {
if (vma->vm_flags & VM_LOCKED)
diff --git a/mm/mremap.c b/mm/mremap.c
index 7f9f9180e401..abcc58690f17 100644
--- a/mm/mremap.c
+++ b/mm/mremap.c
@@ -360,7 +360,7 @@ static unsigned long move_vma(struct vm_area_struct *vma,
}

if (vm_flags & VM_LOCKED) {
- mm->locked_vm += new_len >> PAGE_SHIFT;
+ atomic_long_add(new_len >> PAGE_SHIFT, &mm->locked_vm);
*locked = true;
}

@@ -411,7 +411,7 @@ static struct vm_area_struct *vma_to_resize(unsigned long addr,

if (vma->vm_flags & VM_LOCKED) {
unsigned long locked, lock_limit;
- locked = mm->locked_vm << PAGE_SHIFT;
+ locked = atomic_long_read(&mm->locked_vm) << PAGE_SHIFT;
lock_limit = rlimit(RLIMIT_MEMLOCK);
locked += new_len - old_len;
if (locked > lock_limit && !capable(CAP_IPC_LOCK))
@@ -600,7 +600,7 @@ SYSCALL_DEFINE5(mremap, unsigned long, addr, unsigned long, old_len,

vm_stat_account(mm, vma->vm_flags, pages);
if (vma->vm_flags & VM_LOCKED) {
- mm->locked_vm += pages;
+ atomic_long_add(pages, &mm->locked_vm);
locked = true;
new_addr = addr;
}
--
2.19.1


2018-11-05 17:02:28

by Daniel Jordan

[permalink] [raw]
Subject: [RFC PATCH v4 06/13] vfio: parallelize vfio_pin_map_dma

When starting a large-memory kvm guest, it takes an excessively long
time to start the boot process because qemu must pin all guest pages to
accommodate DMA when VFIO is in use. Currently just one CPU is
responsible for the page pinning, which usually boils down to page
clearing time-wise, so the ways to optimize this are buying a faster
CPU ;-) or using more of the CPUs you already have.

Parallelize with ktask. Refactor so workqueue workers pin with the mm
of the calling thread, and to enable an undo callback for ktask to
handle errors during page pinning.

Performance results appear later in the series.

Signed-off-by: Daniel Jordan <[email protected]>
---
drivers/vfio/vfio_iommu_type1.c | 106 +++++++++++++++++++++++---------
1 file changed, 76 insertions(+), 30 deletions(-)

diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
index d9fd3188615d..e7cfbf0c8071 100644
--- a/drivers/vfio/vfio_iommu_type1.c
+++ b/drivers/vfio/vfio_iommu_type1.c
@@ -41,6 +41,7 @@
#include <linux/notifier.h>
#include <linux/dma-iommu.h>
#include <linux/irqdomain.h>
+#include <linux/ktask.h>

#define DRIVER_VERSION "0.2"
#define DRIVER_AUTHOR "Alex Williamson <[email protected]>"
@@ -395,7 +396,7 @@ static int vaddr_get_pfn(struct mm_struct *mm, unsigned long vaddr,
*/
static long vfio_pin_pages_remote(struct vfio_dma *dma, unsigned long vaddr,
long npage, unsigned long *pfn_base,
- unsigned long limit)
+ unsigned long limit, struct mm_struct *mm)
{
unsigned long pfn = 0;
long ret, pinned = 0, lock_acct = 0;
@@ -403,10 +404,10 @@ static long vfio_pin_pages_remote(struct vfio_dma *dma, unsigned long vaddr,
dma_addr_t iova = vaddr - dma->vaddr + dma->iova;

/* This code path is only user initiated */
- if (!current->mm)
+ if (!mm)
return -ENODEV;

- ret = vaddr_get_pfn(current->mm, vaddr, dma->prot, pfn_base);
+ ret = vaddr_get_pfn(mm, vaddr, dma->prot, pfn_base);
if (ret)
return ret;

@@ -418,7 +419,7 @@ static long vfio_pin_pages_remote(struct vfio_dma *dma, unsigned long vaddr,
* pages are already counted against the user.
*/
if (!rsvd && !vfio_find_vpfn(dma, iova)) {
- if (!dma->lock_cap && current->mm->locked_vm + 1 > limit) {
+ if (!dma->lock_cap && mm->locked_vm + 1 > limit) {
put_pfn(*pfn_base, dma->prot);
pr_warn("%s: RLIMIT_MEMLOCK (%ld) exceeded\n", __func__,
limit << PAGE_SHIFT);
@@ -433,7 +434,7 @@ static long vfio_pin_pages_remote(struct vfio_dma *dma, unsigned long vaddr,
/* Lock all the consecutive pages from pfn_base */
for (vaddr += PAGE_SIZE, iova += PAGE_SIZE; pinned < npage;
pinned++, vaddr += PAGE_SIZE, iova += PAGE_SIZE) {
- ret = vaddr_get_pfn(current->mm, vaddr, dma->prot, &pfn);
+ ret = vaddr_get_pfn(mm, vaddr, dma->prot, &pfn);
if (ret)
break;

@@ -445,7 +446,7 @@ static long vfio_pin_pages_remote(struct vfio_dma *dma, unsigned long vaddr,

if (!rsvd && !vfio_find_vpfn(dma, iova)) {
if (!dma->lock_cap &&
- current->mm->locked_vm + lock_acct + 1 > limit) {
+ mm->locked_vm + lock_acct + 1 > limit) {
put_pfn(pfn, dma->prot);
pr_warn("%s: RLIMIT_MEMLOCK (%ld) exceeded\n",
__func__, limit << PAGE_SHIFT);
@@ -752,15 +753,15 @@ static size_t unmap_unpin_slow(struct vfio_domain *domain,
}

static long vfio_unmap_unpin(struct vfio_iommu *iommu, struct vfio_dma *dma,
+ dma_addr_t iova, dma_addr_t end,
bool do_accounting)
{
- dma_addr_t iova = dma->iova, end = dma->iova + dma->size;
struct vfio_domain *domain, *d;
LIST_HEAD(unmapped_region_list);
int unmapped_region_cnt = 0;
long unlocked = 0;

- if (!dma->size)
+ if (iova == end)
return 0;

if (!IS_IOMMU_CAP_DOMAIN_IN_CONTAINER(iommu))
@@ -777,7 +778,7 @@ static long vfio_unmap_unpin(struct vfio_iommu *iommu, struct vfio_dma *dma,
struct vfio_domain, next);

list_for_each_entry_continue(d, &iommu->domain_list, next) {
- iommu_unmap(d->domain, dma->iova, dma->size);
+ iommu_unmap(d->domain, iova, end - iova);
cond_resched();
}

@@ -818,8 +819,6 @@ static long vfio_unmap_unpin(struct vfio_iommu *iommu, struct vfio_dma *dma,
}
}

- dma->iommu_mapped = false;
-
if (unmapped_region_cnt)
unlocked += vfio_sync_unpin(dma, domain, &unmapped_region_list);

@@ -830,14 +829,21 @@ static long vfio_unmap_unpin(struct vfio_iommu *iommu, struct vfio_dma *dma,
return unlocked;
}

-static void vfio_remove_dma(struct vfio_iommu *iommu, struct vfio_dma *dma)
+static void vfio_remove_dma_finish(struct vfio_iommu *iommu,
+ struct vfio_dma *dma)
{
- vfio_unmap_unpin(iommu, dma, true);
+ dma->iommu_mapped = false;
vfio_unlink_dma(iommu, dma);
put_task_struct(dma->task);
kfree(dma);
}

+static void vfio_remove_dma(struct vfio_iommu *iommu, struct vfio_dma *dma)
+{
+ vfio_unmap_unpin(iommu, dma, dma->iova, dma->iova + dma->size, true);
+ vfio_remove_dma_finish(iommu, dma);
+}
+
static unsigned long vfio_pgsize_bitmap(struct vfio_iommu *iommu)
{
struct vfio_domain *domain;
@@ -1031,20 +1037,29 @@ static int vfio_iommu_map(struct vfio_iommu *iommu, dma_addr_t iova,
return ret;
}

-static int vfio_pin_map_dma(struct vfio_iommu *iommu, struct vfio_dma *dma,
- size_t map_size)
+struct vfio_pin_args {
+ struct vfio_iommu *iommu;
+ struct vfio_dma *dma;
+ unsigned long limit;
+ struct mm_struct *mm;
+};
+
+static int vfio_pin_map_dma_chunk(unsigned long start_vaddr,
+ unsigned long end_vaddr,
+ struct vfio_pin_args *args)
{
- dma_addr_t iova = dma->iova;
- unsigned long vaddr = dma->vaddr;
- size_t size = map_size;
+ struct vfio_dma *dma = args->dma;
+ dma_addr_t iova = dma->iova + (start_vaddr - dma->vaddr);
+ unsigned long unmapped_size = end_vaddr - start_vaddr;
+ unsigned long pfn, mapped_size = 0;
long npage;
- unsigned long pfn, limit = rlimit(RLIMIT_MEMLOCK) >> PAGE_SHIFT;
int ret = 0;

- while (size) {
+ while (unmapped_size) {
/* Pin a contiguous chunk of memory */
- npage = vfio_pin_pages_remote(dma, vaddr + dma->size,
- size >> PAGE_SHIFT, &pfn, limit);
+ npage = vfio_pin_pages_remote(dma, start_vaddr + mapped_size,
+ unmapped_size >> PAGE_SHIFT,
+ &pfn, args->limit, args->mm);
if (npage <= 0) {
WARN_ON(!npage);
ret = (int)npage;
@@ -1052,22 +1067,50 @@ static int vfio_pin_map_dma(struct vfio_iommu *iommu, struct vfio_dma *dma,
}

/* Map it! */
- ret = vfio_iommu_map(iommu, iova + dma->size, pfn, npage,
- dma->prot);
+ ret = vfio_iommu_map(args->iommu, iova + mapped_size, pfn,
+ npage, dma->prot);
if (ret) {
- vfio_unpin_pages_remote(dma, iova + dma->size, pfn,
+ vfio_unpin_pages_remote(dma, iova + mapped_size, pfn,
npage, true);
break;
}

- size -= npage << PAGE_SHIFT;
- dma->size += npage << PAGE_SHIFT;
+ unmapped_size -= npage << PAGE_SHIFT;
+ mapped_size += npage << PAGE_SHIFT;
}

+ return (ret == 0) ? KTASK_RETURN_SUCCESS : ret;
+}
+
+static void vfio_pin_map_dma_undo(unsigned long start_vaddr,
+ unsigned long end_vaddr,
+ struct vfio_pin_args *args)
+{
+ struct vfio_dma *dma = args->dma;
+ dma_addr_t iova = dma->iova + (start_vaddr - dma->vaddr);
+ dma_addr_t end = dma->iova + (end_vaddr - dma->vaddr);
+
+ vfio_unmap_unpin(args->iommu, args->dma, iova, end, true);
+}
+
+static int vfio_pin_map_dma(struct vfio_iommu *iommu, struct vfio_dma *dma,
+ size_t map_size)
+{
+ unsigned long limit = rlimit(RLIMIT_MEMLOCK) >> PAGE_SHIFT;
+ int ret = 0;
+ struct vfio_pin_args args = { iommu, dma, limit, current->mm };
+ /* Stay on PMD boundary in case THP is being used. */
+ DEFINE_KTASK_CTL(ctl, vfio_pin_map_dma_chunk, &args, PMD_SIZE);
+
+ ktask_ctl_set_undo_func(&ctl, vfio_pin_map_dma_undo);
+ ret = ktask_run((void *)dma->vaddr, map_size, &ctl);
+
dma->iommu_mapped = true;

if (ret)
- vfio_remove_dma(iommu, dma);
+ vfio_remove_dma_finish(iommu, dma);
+ else
+ dma->size += map_size;

return ret;
}
@@ -1229,7 +1272,8 @@ static int vfio_iommu_replay(struct vfio_iommu *iommu,

npage = vfio_pin_pages_remote(dma, vaddr,
n >> PAGE_SHIFT,
- &pfn, limit);
+ &pfn, limit,
+ current->mm);
if (npage <= 0) {
WARN_ON(!npage);
ret = (int)npage;
@@ -1497,7 +1541,9 @@ static void vfio_iommu_unmap_unpin_reaccount(struct vfio_iommu *iommu)
long locked = 0, unlocked = 0;

dma = rb_entry(n, struct vfio_dma, node);
- unlocked += vfio_unmap_unpin(iommu, dma, false);
+ unlocked += vfio_unmap_unpin(iommu, dma, dma->iova,
+ dma->iova + dma->size, false);
+ dma->iommu_mapped = false;
p = rb_first(&dma->pfn_list);
for (; p; p = rb_next(p)) {
struct vfio_pfn *vpfn = rb_entry(p, struct vfio_pfn,
--
2.19.1


2018-11-05 17:03:04

by Daniel Jordan

[permalink] [raw]
Subject: [RFC PATCH v4 04/13] ktask: run helper threads at MAX_NICE

Multithreading may speed long-running kernel tasks, but overly
optimistic parallelization can go wrong if too many helper threads are
started on an already-busy system. Such helpers can degrade the
performance of other tasks, so they should be sensitive to current CPU
utilization[1].

To achieve this, run helpers at MAX_NICE so that their CPU time is
proportional to idle CPU time. The main thread that called into ktask
naturally runs at its original priority so that it can make progress on
a heavily loaded system, as it would if ktask were not in the picture.

I tested two different cases in which a non-ktask and a ktask workload
compete for the same CPUs with the goal of showing that normal priority
(i.e. nice=0) ktask helpers cause the non-ktask workload to run more
slowly, whereas MAX_NICE ktask helpers don't.

Testing notes:
- Each case was run using 8 CPUs on a large two-socket server, with a
cpumask allowing all test threads to run anywhere within the 8.
- The non-ktask workload used 7 threads and the ktask workload used 8
threads to evaluate how much ktask helpers, rather than the main ktask
thread, disturbed the non-ktask workload.
- The non-ktask workload was started after the ktask workload and run
for less time to maximize the chances that the non-ktask workload would
be disturbed.
- Runtimes in seconds.

Case 1: Synthetic, worst-case CPU contention

ktask_test - a tight loop doing integer multiplication to max out on CPU;
used for testing only, does not appear in this series
stress-ng - cpu stressor ("-c --cpu-method ackerman --cpu-ops 1200");

stress-ng
alone (stdev) max_nice (stdev) normal_prio (stdev)
------------------------------------------------------------
ktask_test 96.87 ( 1.09) 90.81 ( 0.29)
stress-ng 43.04 ( 0.00) 43.58 ( 0.01) 75.86 ( 0.39)

This case shows MAX_NICE helpers make a significant difference compared
to normal priority helpers, with stress-ng taking 76% longer to finish
when competing with normal priority ktask threads than when run by
itself, but only 1% longer when run with MAX_NICE helpers. The 1% comes
from the small amount of CPU time MAX_NICE threads are given despite
their low priority.

Case 2: Real-world CPU contention

ktask_vfio - VFIO page pin a 175G kvm guest
usemem - faults in 25G of anonymous THP per thread, PAGE_SIZE stride;
used to mimic the page clearing that dominates in ktask_vfio
so that usemem competes for the same system resources

usemem
alone (stdev) max_nice (stdev) normal_prio (stdev)
------------------------------------------------------------
ktask_vfio 14.74 ( 0.04) 9.93 ( 0.09)
usemem 10.45 ( 0.04) 10.75 ( 0.04) 14.14 ( 0.07)

In the more realistic case 2, the effect is similar although not as
pronounced. The usemem threads take 35% longer to finish with normal
priority ktask threads than when run alone, but only 3% longer when
MAX_NICE is used.

All ktask users outside of VFIO boil down to page clearing, so I imagine
the results would be similar for them.

[1] lkml.kernel.org/r/[email protected]

Signed-off-by: Daniel Jordan <[email protected]>
---
kernel/ktask.c | 12 ++++++++++++
1 file changed, 12 insertions(+)

diff --git a/kernel/ktask.c b/kernel/ktask.c
index b91c62f14dcd..72293a0f50c3 100644
--- a/kernel/ktask.c
+++ b/kernel/ktask.c
@@ -575,6 +575,18 @@ void __init ktask_init(void)
goto alloc_fail;
}

+ /*
+ * All ktask worker threads have the lowest priority on the system so
+ * they don't disturb other workloads.
+ */
+ attrs->nice = MAX_NICE;
+
+ ret = apply_workqueue_attrs(ktask_wq, attrs);
+ if (ret != 0) {
+ pr_warn("disabled (couldn't apply attrs to ktask_wq)");
+ goto apply_fail;
+ }
+
attrs->no_numa = true;

ret = apply_workqueue_attrs(ktask_nonuma_wq, attrs);
--
2.19.1


2018-11-05 17:31:41

by Michal Hocko

[permalink] [raw]
Subject: Re: [RFC PATCH v4 00/13] ktask: multithread CPU-intensive kernel work

On Mon 05-11-18 11:55:45, Daniel Jordan wrote:
> Michal, you mentioned that ktask should be sensitive to CPU utilization[1].
> ktask threads now run at the lowest priority on the system to avoid disturbing
> busy CPUs (more details in patches 4 and 5). Does this address your concern?
> The plan to address your other comments is explained below.

I have only glanced through the documentation patch and it looks like it
will be much less disruptive than the previous attempts. Now the obvious
question is how does this behave on a moderately or even busy system
when you compare that to a single threaded execution. Some numbers about
best/worst case execution would be really helpful.

I will look closer later.

--
Michal Hocko
SUSE Labs

2018-11-05 18:50:02

by Zi Yan

[permalink] [raw]
Subject: Re: [RFC PATCH v4 00/13] ktask: multithread CPU-intensive kernel work

Hi Daniel,

On 5 Nov 2018, at 11:55, Daniel Jordan wrote:

> Hi,
>
> This version addresses some of the feedback from Andrew and Michal last year
> and describes the plan for tackling the rest. I'm posting now since I'll be
> presenting ktask at Plumbers next week.
>
> Andrew, you asked about parallelizing in more places[0]. This version adds
> multithreading for VFIO page pinning, and there are more planned users listed
> below.
>
> Michal, you mentioned that ktask should be sensitive to CPU utilization[1].
> ktask threads now run at the lowest priority on the system to avoid disturbing
> busy CPUs (more details in patches 4 and 5). Does this address your concern?
> The plan to address your other comments is explained below.
>
> Alex, any thoughts about the VFIO changes in patches 6-9?
>
> Tejun and Lai, what do you think of patch 5?
>
> And for everyone, questions and comments welcome. Any suggestions for more
> users?
>
> Thanks,
> Daniel
>
> P.S. This series is big to address the above feedback, but I can send patches
> 7 and 8 separately.
>
>
> TODO
> ----
>
> - Implement cgroup-aware unbound workqueues in a separate series, picking up
> Bandan Das's effort from two years ago[2]. This should hopefully address
> Michal's comment about running ktask threads within the limits of the calling
> context[1].
>
> - Make ktask aware of power management. A starting point is to disable the
> framework when energy-conscious cpufreq settings are enabled (e.g.
> powersave, conservative scaling governors). This should address another
> comment from Michal about keeping CPUs under power constraints idle[1].
>
> - Add more users. On my list:
> - __ib_umem_release in IB core, which Jason Gunthorpe mentioned[3]
> - XFS quotacheck and online repair, as suggested by Darrick Wong
> - vfs object teardown at umount time, as Andrew mentioned[0]
> - page freeing in munmap/exit, as Aaron Lu posted[4]
> - page freeing in shmem
> The last three will benefit from scaling zone->lock and lru_lock.
>
> - CPU hotplug support for ktask to adjust its per-CPU data and resource
> limits.
>
> - Check with IOMMU folks that iommu_map is safe for all IOMMU backend
> implementations (it is for x86).
>
>
> Summary
> -------
>
> A single CPU can spend an excessive amount of time in the kernel operating
> on large amounts of data. Often these situations arise during initialization-
> and destruction-related tasks, where the data involved scales with system size.
> These long-running jobs can slow startup and shutdown of applications and the
> system itself while extra CPUs sit idle.
>
> To ensure that applications and the kernel continue to perform well as core
> counts and memory sizes increase, harness these idle CPUs to complete such jobs
> more quickly.
>
> ktask is a generic framework for parallelizing CPU-intensive work in the
> kernel. The API is generic enough to add concurrency to many different kinds
> of tasks--for example, zeroing a range of pages or evicting a list of
> inodes--and aims to save its clients the trouble of splitting up the work,
> choosing the number of threads to use, maintaining an efficient concurrency
> level, starting these threads, and load balancing the work between them.
>
> The first patch has more documentation, and the second patch has the interface.
>
> Current users:
> 1) VFIO page pinning before kvm guest startup (others hitting slowness too[5])
> 2) deferred struct page initialization at boot time
> 3) clearing gigantic pages
> 4) fallocate for HugeTLB pages

Do you think if it makes sense to use ktask for huge page migration (the data
copy part)?

I did some experiments back in 2016[1], which showed that migrating one 2MB page
with 8 threads could achieve 2.8x throughput of the existing single-threaded method.
The problem with my parallel page migration patchset at that time was that it
has no CPU-utilization awareness, which is solved by your patches now.

Thanks.

[1]https://lkml.org/lkml/2016/11/22/457

--
Best Regards
Yan Zi


Attachments:
signature.asc (569.00 B)
OpenPGP digital signature

2018-11-05 20:53:02

by Randy Dunlap

[permalink] [raw]
Subject: Re: [RFC PATCH v4 02/13] ktask: multithread CPU-intensive kernel work

On 11/5/18 8:55 AM, Daniel Jordan wrote:
> diff --git a/init/Kconfig b/init/Kconfig
> index 41583f468cb4..ed82f76ed0b7 100644
> --- a/init/Kconfig
> +++ b/init/Kconfig
> @@ -346,6 +346,17 @@ config AUDIT_TREE
> depends on AUDITSYSCALL
> select FSNOTIFY
>
> +config KTASK
> + bool "Multithread CPU-intensive kernel work"
> + depends on SMP
> + default y
> + help
> + Parallelize CPU-intensive kernel work. This feature is designed for
> + big machines that can take advantage of their extra CPUs to speed up
> + large kernel tasks. When enabled, kworker threads may occupy more
> + CPU time during these kernel tasks, but these threads are throttled
> + when other tasks on the system need CPU time.

Use tab + 2 spaces consistently for help text indentation, please.

> +
> source "kernel/irq/Kconfig"
> source "kernel/time/Kconfig"
> source "kernel/Kconfig.preempt"


--
~Randy

2018-11-05 21:22:34

by Randy Dunlap

[permalink] [raw]
Subject: Re: [RFC PATCH v4 01/13] ktask: add documentation

On 11/5/18 8:55 AM, Daniel Jordan wrote:
> Motivates and explains the ktask API for kernel clients.
>
> Signed-off-by: Daniel Jordan <[email protected]>
> ---
> Documentation/core-api/index.rst | 1 +
> Documentation/core-api/ktask.rst | 213 +++++++++++++++++++++++++++++++
> 2 files changed, 214 insertions(+)
> create mode 100644 Documentation/core-api/ktask.rst

Hi,

> diff --git a/Documentation/core-api/ktask.rst b/Documentation/core-api/ktask.rst
> new file mode 100644
> index 000000000000..c3c00e1f802f
> --- /dev/null
> +++ b/Documentation/core-api/ktask.rst
> @@ -0,0 +1,213 @@
> +.. SPDX-License-Identifier: GPL-2.0+
> +
> +============================================
> +ktask: parallelize CPU-intensive kernel work
> +============================================
> +
> +:Date: November, 2018
> +:Author: Daniel Jordan <[email protected]>
> +
> +
> +Introduction
> +============

[snip]


> +Resource Limits
> +===============
> +
> +ktask has resource limits on the number of work items it sends to workqueue.

to a workqueue.
or: to workqueues.

> +In ktask, a workqueue item is a thread that runs chunks of the task until the
> +task is finished.
> +
> +These limits support the different ways ktask uses workqueues:
> + - ktask_run to run threads on the calling thread's node.
> + - ktask_run_numa to run threads on the node(s) specified.
> + - ktask_run_numa with nid=NUMA_NO_NODE to run threads on any node in the
> + system.
> +
> +To support these different ways of queueing work while maintaining an efficient
> +concurrency level, we need both system-wide and per-node limits on the number

I would prefer to refer to ktask as ktask instead of "we", so
s/we need/ktask needs/


> +of threads. Without per-node limits, a node might become oversubscribed
> +despite ktask staying within the system-wide limit, and without a system-wide
> +limit, we can't properly account for work that can run on any node.

s/we/ktask/

> +
> +The system-wide limit is based on the total number of CPUs, and the per-node
> +limit on the CPU count for each node. A per-node work item counts against the
> +system-wide limit. Workqueue's max_active can't accommodate both types of
> +limit, no matter how many workqueues are used, so ktask implements its own.
> +
> +If a per-node limit is reached, the work item is allowed to run anywhere on the
> +machine to avoid overwhelming the node. If the global limit is also reached,
> +ktask won't queue additional work items until we fall below the limit again.

s/we fall/ktask falls/
or s/we fall/it falls/

> +
> +These limits apply only to workqueue items--that is, helper threads beyond the
> +one starting the task. That way, one thread per task is always allowed to run.


thanks.
--
~Randy

2018-11-05 21:54:01

by Alex Williamson

[permalink] [raw]
Subject: Re: [RFC PATCH v4 06/13] vfio: parallelize vfio_pin_map_dma

On Mon, 5 Nov 2018 11:55:51 -0500
Daniel Jordan <[email protected]> wrote:

> When starting a large-memory kvm guest, it takes an excessively long
> time to start the boot process because qemu must pin all guest pages to
> accommodate DMA when VFIO is in use. Currently just one CPU is
> responsible for the page pinning, which usually boils down to page
> clearing time-wise, so the ways to optimize this are buying a faster
> CPU ;-) or using more of the CPUs you already have.
>
> Parallelize with ktask. Refactor so workqueue workers pin with the mm
> of the calling thread, and to enable an undo callback for ktask to
> handle errors during page pinning.
>
> Performance results appear later in the series.
>
> Signed-off-by: Daniel Jordan <[email protected]>
> ---
> drivers/vfio/vfio_iommu_type1.c | 106 +++++++++++++++++++++++---------
> 1 file changed, 76 insertions(+), 30 deletions(-)
>
> diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
> index d9fd3188615d..e7cfbf0c8071 100644
> --- a/drivers/vfio/vfio_iommu_type1.c
> +++ b/drivers/vfio/vfio_iommu_type1.c
> @@ -41,6 +41,7 @@
> #include <linux/notifier.h>
> #include <linux/dma-iommu.h>
> #include <linux/irqdomain.h>
> +#include <linux/ktask.h>
>
> #define DRIVER_VERSION "0.2"
> #define DRIVER_AUTHOR "Alex Williamson <[email protected]>"
> @@ -395,7 +396,7 @@ static int vaddr_get_pfn(struct mm_struct *mm, unsigned long vaddr,
> */
> static long vfio_pin_pages_remote(struct vfio_dma *dma, unsigned long vaddr,
> long npage, unsigned long *pfn_base,
> - unsigned long limit)
> + unsigned long limit, struct mm_struct *mm)
> {
> unsigned long pfn = 0;
> long ret, pinned = 0, lock_acct = 0;
> @@ -403,10 +404,10 @@ static long vfio_pin_pages_remote(struct vfio_dma *dma, unsigned long vaddr,
> dma_addr_t iova = vaddr - dma->vaddr + dma->iova;
>
> /* This code path is only user initiated */
> - if (!current->mm)
> + if (!mm)
> return -ENODEV;
>
> - ret = vaddr_get_pfn(current->mm, vaddr, dma->prot, pfn_base);
> + ret = vaddr_get_pfn(mm, vaddr, dma->prot, pfn_base);
> if (ret)
> return ret;
>
> @@ -418,7 +419,7 @@ static long vfio_pin_pages_remote(struct vfio_dma *dma, unsigned long vaddr,
> * pages are already counted against the user.
> */
> if (!rsvd && !vfio_find_vpfn(dma, iova)) {
> - if (!dma->lock_cap && current->mm->locked_vm + 1 > limit) {
> + if (!dma->lock_cap && mm->locked_vm + 1 > limit) {
> put_pfn(*pfn_base, dma->prot);
> pr_warn("%s: RLIMIT_MEMLOCK (%ld) exceeded\n", __func__,
> limit << PAGE_SHIFT);
> @@ -433,7 +434,7 @@ static long vfio_pin_pages_remote(struct vfio_dma *dma, unsigned long vaddr,
> /* Lock all the consecutive pages from pfn_base */
> for (vaddr += PAGE_SIZE, iova += PAGE_SIZE; pinned < npage;
> pinned++, vaddr += PAGE_SIZE, iova += PAGE_SIZE) {
> - ret = vaddr_get_pfn(current->mm, vaddr, dma->prot, &pfn);
> + ret = vaddr_get_pfn(mm, vaddr, dma->prot, &pfn);
> if (ret)
> break;
>
> @@ -445,7 +446,7 @@ static long vfio_pin_pages_remote(struct vfio_dma *dma, unsigned long vaddr,
>
> if (!rsvd && !vfio_find_vpfn(dma, iova)) {
> if (!dma->lock_cap &&
> - current->mm->locked_vm + lock_acct + 1 > limit) {
> + mm->locked_vm + lock_acct + 1 > limit) {
> put_pfn(pfn, dma->prot);
> pr_warn("%s: RLIMIT_MEMLOCK (%ld) exceeded\n",
> __func__, limit << PAGE_SHIFT);
> @@ -752,15 +753,15 @@ static size_t unmap_unpin_slow(struct vfio_domain *domain,
> }
>
> static long vfio_unmap_unpin(struct vfio_iommu *iommu, struct vfio_dma *dma,
> + dma_addr_t iova, dma_addr_t end,
> bool do_accounting)
> {
> - dma_addr_t iova = dma->iova, end = dma->iova + dma->size;
> struct vfio_domain *domain, *d;
> LIST_HEAD(unmapped_region_list);
> int unmapped_region_cnt = 0;
> long unlocked = 0;
>
> - if (!dma->size)
> + if (iova == end)
> return 0;
>
> if (!IS_IOMMU_CAP_DOMAIN_IN_CONTAINER(iommu))
> @@ -777,7 +778,7 @@ static long vfio_unmap_unpin(struct vfio_iommu *iommu, struct vfio_dma *dma,
> struct vfio_domain, next);
>
> list_for_each_entry_continue(d, &iommu->domain_list, next) {
> - iommu_unmap(d->domain, dma->iova, dma->size);
> + iommu_unmap(d->domain, iova, end - iova);
> cond_resched();
> }
>
> @@ -818,8 +819,6 @@ static long vfio_unmap_unpin(struct vfio_iommu *iommu, struct vfio_dma *dma,
> }
> }
>
> - dma->iommu_mapped = false;
> -
> if (unmapped_region_cnt)
> unlocked += vfio_sync_unpin(dma, domain, &unmapped_region_list);
>
> @@ -830,14 +829,21 @@ static long vfio_unmap_unpin(struct vfio_iommu *iommu, struct vfio_dma *dma,
> return unlocked;
> }
>
> -static void vfio_remove_dma(struct vfio_iommu *iommu, struct vfio_dma *dma)
> +static void vfio_remove_dma_finish(struct vfio_iommu *iommu,
> + struct vfio_dma *dma)
> {
> - vfio_unmap_unpin(iommu, dma, true);
> + dma->iommu_mapped = false;
> vfio_unlink_dma(iommu, dma);
> put_task_struct(dma->task);
> kfree(dma);
> }
>
> +static void vfio_remove_dma(struct vfio_iommu *iommu, struct vfio_dma *dma)
> +{
> + vfio_unmap_unpin(iommu, dma, dma->iova, dma->iova + dma->size, true);
> + vfio_remove_dma_finish(iommu, dma);
> +}
> +
> static unsigned long vfio_pgsize_bitmap(struct vfio_iommu *iommu)
> {
> struct vfio_domain *domain;
> @@ -1031,20 +1037,29 @@ static int vfio_iommu_map(struct vfio_iommu *iommu, dma_addr_t iova,
> return ret;
> }
>
> -static int vfio_pin_map_dma(struct vfio_iommu *iommu, struct vfio_dma *dma,
> - size_t map_size)
> +struct vfio_pin_args {
> + struct vfio_iommu *iommu;
> + struct vfio_dma *dma;
> + unsigned long limit;
> + struct mm_struct *mm;
> +};
> +
> +static int vfio_pin_map_dma_chunk(unsigned long start_vaddr,
> + unsigned long end_vaddr,
> + struct vfio_pin_args *args)
> {
> - dma_addr_t iova = dma->iova;
> - unsigned long vaddr = dma->vaddr;
> - size_t size = map_size;
> + struct vfio_dma *dma = args->dma;
> + dma_addr_t iova = dma->iova + (start_vaddr - dma->vaddr);
> + unsigned long unmapped_size = end_vaddr - start_vaddr;
> + unsigned long pfn, mapped_size = 0;
> long npage;
> - unsigned long pfn, limit = rlimit(RLIMIT_MEMLOCK) >> PAGE_SHIFT;
> int ret = 0;
>
> - while (size) {
> + while (unmapped_size) {
> /* Pin a contiguous chunk of memory */
> - npage = vfio_pin_pages_remote(dma, vaddr + dma->size,
> - size >> PAGE_SHIFT, &pfn, limit);
> + npage = vfio_pin_pages_remote(dma, start_vaddr + mapped_size,
> + unmapped_size >> PAGE_SHIFT,
> + &pfn, args->limit, args->mm);
> if (npage <= 0) {
> WARN_ON(!npage);
> ret = (int)npage;
> @@ -1052,22 +1067,50 @@ static int vfio_pin_map_dma(struct vfio_iommu *iommu, struct vfio_dma *dma,
> }
>
> /* Map it! */
> - ret = vfio_iommu_map(iommu, iova + dma->size, pfn, npage,
> - dma->prot);
> + ret = vfio_iommu_map(args->iommu, iova + mapped_size, pfn,
> + npage, dma->prot);
> if (ret) {
> - vfio_unpin_pages_remote(dma, iova + dma->size, pfn,
> + vfio_unpin_pages_remote(dma, iova + mapped_size, pfn,
> npage, true);
> break;
> }
>
> - size -= npage << PAGE_SHIFT;
> - dma->size += npage << PAGE_SHIFT;
> + unmapped_size -= npage << PAGE_SHIFT;
> + mapped_size += npage << PAGE_SHIFT;
> }
>
> + return (ret == 0) ? KTASK_RETURN_SUCCESS : ret;

Overall I'm a big fan of this, but I think there's an undo problem
here. Per 03/13, kc_undo_func is only called for successfully
completed chunks and each kc_thread_func should handle cleanup of any
intermediate work before failure. That's not done here afaict. Should
we be calling the vfio_pin_map_dma_undo() manually on the completed
range before returning error?

> +}
> +
> +static void vfio_pin_map_dma_undo(unsigned long start_vaddr,
> + unsigned long end_vaddr,
> + struct vfio_pin_args *args)
> +{
> + struct vfio_dma *dma = args->dma;
> + dma_addr_t iova = dma->iova + (start_vaddr - dma->vaddr);
> + dma_addr_t end = dma->iova + (end_vaddr - dma->vaddr);
> +
> + vfio_unmap_unpin(args->iommu, args->dma, iova, end, true);
> +}
> +
> +static int vfio_pin_map_dma(struct vfio_iommu *iommu, struct vfio_dma *dma,
> + size_t map_size)
> +{
> + unsigned long limit = rlimit(RLIMIT_MEMLOCK) >> PAGE_SHIFT;
> + int ret = 0;
> + struct vfio_pin_args args = { iommu, dma, limit, current->mm };
> + /* Stay on PMD boundary in case THP is being used. */
> + DEFINE_KTASK_CTL(ctl, vfio_pin_map_dma_chunk, &args, PMD_SIZE);

PMD_SIZE chunks almost seems too convenient, I wonder a) is that really
enough work per thread, and b) is this really successfully influencing
THP? Thanks,

Alex

> +
> + ktask_ctl_set_undo_func(&ctl, vfio_pin_map_dma_undo);
> + ret = ktask_run((void *)dma->vaddr, map_size, &ctl);
> +
> dma->iommu_mapped = true;
>
> if (ret)
> - vfio_remove_dma(iommu, dma);
> + vfio_remove_dma_finish(iommu, dma);
> + else
> + dma->size += map_size;
>
> return ret;
> }
> @@ -1229,7 +1272,8 @@ static int vfio_iommu_replay(struct vfio_iommu *iommu,
>
> npage = vfio_pin_pages_remote(dma, vaddr,
> n >> PAGE_SHIFT,
> - &pfn, limit);
> + &pfn, limit,
> + current->mm);
> if (npage <= 0) {
> WARN_ON(!npage);
> ret = (int)npage;
> @@ -1497,7 +1541,9 @@ static void vfio_iommu_unmap_unpin_reaccount(struct vfio_iommu *iommu)
> long locked = 0, unlocked = 0;
>
> dma = rb_entry(n, struct vfio_dma, node);
> - unlocked += vfio_unmap_unpin(iommu, dma, false);
> + unlocked += vfio_unmap_unpin(iommu, dma, dma->iova,
> + dma->iova + dma->size, false);
> + dma->iommu_mapped = false;
> p = rb_first(&dma->pfn_list);
> for (; p; p = rb_next(p)) {
> struct vfio_pfn *vpfn = rb_entry(p, struct vfio_pfn,


2018-11-06 01:32:52

by Daniel Jordan

[permalink] [raw]
Subject: Re: [RFC PATCH v4 00/13] ktask: multithread CPU-intensive kernel work

On Mon, Nov 05, 2018 at 06:29:31PM +0100, Michal Hocko wrote:
> On Mon 05-11-18 11:55:45, Daniel Jordan wrote:
> > Michal, you mentioned that ktask should be sensitive to CPU utilization[1].
> > ktask threads now run at the lowest priority on the system to avoid disturbing
> > busy CPUs (more details in patches 4 and 5). Does this address your concern?
> > The plan to address your other comments is explained below.
>
> I have only glanced through the documentation patch and it looks like it
> will be much less disruptive than the previous attempts. Now the obvious
> question is how does this behave on a moderately or even busy system
> when you compare that to a single threaded execution. Some numbers about
> best/worst case execution would be really helpful.

Patches 4 and 5 have some numbers where a ktask and non-ktask workload compete
against each other. Those show either 8 ktask threads on 8 CPUs (worst case) or no ktask threads (best case).

By single threaded execution, I guess you mean 1 ktask thread. I'll run the
experiments that way too and post the numbers.

> I will look closer later.

Great! Thanks for your comment.

Daniel

2018-11-06 02:21:39

by Daniel Jordan

[permalink] [raw]
Subject: Re: [RFC PATCH v4 00/13] ktask: multithread CPU-intensive kernel work

Hi Zi,

On Mon, Nov 05, 2018 at 01:49:14PM -0500, Zi Yan wrote:
> On 5 Nov 2018, at 11:55, Daniel Jordan wrote:
>
> Do you think if it makes sense to use ktask for huge page migration (the data
> copy part)?

It certainly could.

> I did some experiments back in 2016[1], which showed that migrating one 2MB page
> with 8 threads could achieve 2.8x throughput of the existing single-threaded method.
> The problem with my parallel page migration patchset at that time was that it
> has no CPU-utilization awareness, which is solved by your patches now.

Did you run with fewer than 8 threads? I'd want a bigger speedup than 2.8x for
8, and a smaller thread count might improve thread utilization.

It would be nice to multithread at a higher granularity than 2M, too: a range
of THPs might also perform better than a single page.

Thanks for your comments.

> [1]https://lkml.org/lkml/2016/11/22/457

2018-11-06 02:26:58

by Daniel Jordan

[permalink] [raw]
Subject: Re: [RFC PATCH v4 02/13] ktask: multithread CPU-intensive kernel work

On Mon, Nov 05, 2018 at 12:51:33PM -0800, Randy Dunlap wrote:
> On 11/5/18 8:55 AM, Daniel Jordan wrote:
> > diff --git a/init/Kconfig b/init/Kconfig
> > index 41583f468cb4..ed82f76ed0b7 100644
> > --- a/init/Kconfig
> > +++ b/init/Kconfig
> > @@ -346,6 +346,17 @@ config AUDIT_TREE
> > depends on AUDITSYSCALL
> > select FSNOTIFY
> >
> > +config KTASK
> > + bool "Multithread CPU-intensive kernel work"
> > + depends on SMP
> > + default y
> > + help
> > + Parallelize CPU-intensive kernel work. This feature is designed for
> > + big machines that can take advantage of their extra CPUs to speed up
> > + large kernel tasks. When enabled, kworker threads may occupy more
> > + CPU time during these kernel tasks, but these threads are throttled
> > + when other tasks on the system need CPU time.
>
> Use tab + 2 spaces consistently for help text indentation, please.

Ok, will do. Thanks for pointing it out.

2018-11-06 02:28:53

by Daniel Jordan

[permalink] [raw]
Subject: Re: [RFC PATCH v4 01/13] ktask: add documentation

On Mon, Nov 05, 2018 at 01:19:50PM -0800, Randy Dunlap wrote:
> On 11/5/18 8:55 AM, Daniel Jordan wrote:
>
> Hi,
>
> > +Resource Limits
> > +===============
> > +
> > +ktask has resource limits on the number of work items it sends to workqueue.
>
> to a workqueue.
> or: to workqueues.

Ok, I'll do "to workqueues" since ktask uses two internally (NUMA-aware and
non-NUMA-aware).

>
> > +In ktask, a workqueue item is a thread that runs chunks of the task until the
> > +task is finished.
> > +
> > +These limits support the different ways ktask uses workqueues:
> > + - ktask_run to run threads on the calling thread's node.
> > + - ktask_run_numa to run threads on the node(s) specified.
> > + - ktask_run_numa with nid=NUMA_NO_NODE to run threads on any node in the
> > + system.
> > +
> > +To support these different ways of queueing work while maintaining an efficient
> > +concurrency level, we need both system-wide and per-node limits on the number
>
> I would prefer to refer to ktask as ktask instead of "we", so
> s/we need/ktask needs/

Good idea, I'll change it.

> > +of threads. Without per-node limits, a node might become oversubscribed
> > +despite ktask staying within the system-wide limit, and without a system-wide
> > +limit, we can't properly account for work that can run on any node.
>
> s/we/ktask/

Ok.

> > +
> > +The system-wide limit is based on the total number of CPUs, and the per-node
> > +limit on the CPU count for each node. A per-node work item counts against the
> > +system-wide limit. Workqueue's max_active can't accommodate both types of
> > +limit, no matter how many workqueues are used, so ktask implements its own.
> > +
> > +If a per-node limit is reached, the work item is allowed to run anywhere on the
> > +machine to avoid overwhelming the node. If the global limit is also reached,
> > +ktask won't queue additional work items until we fall below the limit again.
>
> s/we fall/ktask falls/
> or s/we fall/it falls/

'ktask.' Will change.

> > +
> > +These limits apply only to workqueue items--that is, helper threads beyond the
> > +one starting the task. That way, one thread per task is always allowed to run.
>
>
> thanks.

Appreciate the feedback!

2018-11-06 02:44:08

by Daniel Jordan

[permalink] [raw]
Subject: Re: [RFC PATCH v4 06/13] vfio: parallelize vfio_pin_map_dma

On Mon, Nov 05, 2018 at 02:51:41PM -0700, Alex Williamson wrote:
> On Mon, 5 Nov 2018 11:55:51 -0500
> Daniel Jordan <[email protected]> wrote:
> > +static int vfio_pin_map_dma_chunk(unsigned long start_vaddr,
> > + unsigned long end_vaddr,
> > + struct vfio_pin_args *args)
> > {
> > - dma_addr_t iova = dma->iova;
> > - unsigned long vaddr = dma->vaddr;
> > - size_t size = map_size;
> > + struct vfio_dma *dma = args->dma;
> > + dma_addr_t iova = dma->iova + (start_vaddr - dma->vaddr);
> > + unsigned long unmapped_size = end_vaddr - start_vaddr;
> > + unsigned long pfn, mapped_size = 0;
> > long npage;
> > - unsigned long pfn, limit = rlimit(RLIMIT_MEMLOCK) >> PAGE_SHIFT;
> > int ret = 0;
> >
> > - while (size) {
> > + while (unmapped_size) {
> > /* Pin a contiguous chunk of memory */
> > - npage = vfio_pin_pages_remote(dma, vaddr + dma->size,
> > - size >> PAGE_SHIFT, &pfn, limit);
> > + npage = vfio_pin_pages_remote(dma, start_vaddr + mapped_size,
> > + unmapped_size >> PAGE_SHIFT,
> > + &pfn, args->limit, args->mm);
> > if (npage <= 0) {
> > WARN_ON(!npage);
> > ret = (int)npage;
> > @@ -1052,22 +1067,50 @@ static int vfio_pin_map_dma(struct vfio_iommu *iommu, struct vfio_dma *dma,
> > }
> >
> > /* Map it! */
> > - ret = vfio_iommu_map(iommu, iova + dma->size, pfn, npage,
> > - dma->prot);
> > + ret = vfio_iommu_map(args->iommu, iova + mapped_size, pfn,
> > + npage, dma->prot);
> > if (ret) {
> > - vfio_unpin_pages_remote(dma, iova + dma->size, pfn,
> > + vfio_unpin_pages_remote(dma, iova + mapped_size, pfn,
> > npage, true);
> > break;
> > }
> >
> > - size -= npage << PAGE_SHIFT;
> > - dma->size += npage << PAGE_SHIFT;
> > + unmapped_size -= npage << PAGE_SHIFT;
> > + mapped_size += npage << PAGE_SHIFT;
> > }
> >
> > + return (ret == 0) ? KTASK_RETURN_SUCCESS : ret;
>
> Overall I'm a big fan of this, but I think there's an undo problem
> here. Per 03/13, kc_undo_func is only called for successfully
> completed chunks and each kc_thread_func should handle cleanup of any
> intermediate work before failure. That's not done here afaict. Should
> we be calling the vfio_pin_map_dma_undo() manually on the completed
> range before returning error?

Yes, we should be, thanks very much for catching this.

At least I documented what I didn't do? :)

>
> > +}
> > +
> > +static void vfio_pin_map_dma_undo(unsigned long start_vaddr,
> > + unsigned long end_vaddr,
> > + struct vfio_pin_args *args)
> > +{
> > + struct vfio_dma *dma = args->dma;
> > + dma_addr_t iova = dma->iova + (start_vaddr - dma->vaddr);
> > + dma_addr_t end = dma->iova + (end_vaddr - dma->vaddr);
> > +
> > + vfio_unmap_unpin(args->iommu, args->dma, iova, end, true);
> > +}
> > +
> > +static int vfio_pin_map_dma(struct vfio_iommu *iommu, struct vfio_dma *dma,
> > + size_t map_size)
> > +{
> > + unsigned long limit = rlimit(RLIMIT_MEMLOCK) >> PAGE_SHIFT;
> > + int ret = 0;
> > + struct vfio_pin_args args = { iommu, dma, limit, current->mm };
> > + /* Stay on PMD boundary in case THP is being used. */
> > + DEFINE_KTASK_CTL(ctl, vfio_pin_map_dma_chunk, &args, PMD_SIZE);
>
> PMD_SIZE chunks almost seems too convenient, I wonder a) is that really
> enough work per thread, and b) is this really successfully influencing
> THP? Thanks,

Yes, you're right on both counts. I'd been using PUD_SIZE for a while in
testing and meant to switch it back to KTASK_MEM_CHUNK (128M) but used PMD_SIZE
by mistake. PUD_SIZE chunks have made thread finishing times too spread out
in some cases, so 128M seems to be a reasonable compromise.

Thanks for the thorough and quick review.

Daniel

2018-11-06 02:49:40

by Zi Yan

[permalink] [raw]
Subject: Re: [RFC PATCH v4 00/13] ktask: multithread CPU-intensive kernel work

On 5 Nov 2018, at 21:20, Daniel Jordan wrote:

> Hi Zi,
>
> On Mon, Nov 05, 2018 at 01:49:14PM -0500, Zi Yan wrote:
>> On 5 Nov 2018, at 11:55, Daniel Jordan wrote:
>>
>> Do you think if it makes sense to use ktask for huge page migration (the data
>> copy part)?
>
> It certainly could.
>
>> I did some experiments back in 2016[1], which showed that migrating one 2MB page
>> with 8 threads could achieve 2.8x throughput of the existing single-threaded method.
>> The problem with my parallel page migration patchset at that time was that it
>> has no CPU-utilization awareness, which is solved by your patches now.
>
> Did you run with fewer than 8 threads? I'd want a bigger speedup than 2.8x for
> 8, and a smaller thread count might improve thread utilization.

Yes. When migrating one 2MB THP with migrate_pages() system call on a two-socket server
with 2 E5-2650 v3 CPUs (10 cores per socket) across two sockets, here are the page migration
throughput numbers:

throughput factor
1 thread 2.15 GB/s 1x
2 threads 3.05 GB/s 1.42x
4 threads 4.50 GB/s 2.09x
8 threads 5.98 GB/s 2.78x

>
> It would be nice to multithread at a higher granularity than 2M, too: a range
> of THPs might also perform better than a single page.

Sure. But the kernel currently does not copy multiple pages altogether even if a range
of THPs is migrated. Page copy function is interleaved with page table operations
for every single page.

I also did some study and modified the kernel to improve this, which I called
concurrent page migration in https://lwn.net/Articles/714991/. It further
improves page migration throughput.


—
Best Regards,
Yan Zi


Attachments:
signature.asc (527.00 B)
OpenPGP digital signature

2018-11-06 08:51:25

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [RFC PATCH v4 01/13] ktask: add documentation

On Mon, Nov 05, 2018 at 11:55:46AM -0500, Daniel Jordan wrote:
> +Concept
> +=======
> +
> +ktask is built on unbound workqueues to take advantage of the thread management
> +facilities it provides: creation, destruction, flushing, priority setting, and
> +NUMA affinity.
> +
> +A little terminology up front: A 'task' is the total work there is to do and a
> +'chunk' is a unit of work given to a thread.

So I hate on the task naming. We already have a task, lets not overload
that name.

> +To complete a task using the ktask framework, a client provides a thread
> +function that is responsible for completing one chunk. The thread function is
> +defined in a standard way, with start and end arguments that delimit the chunk
> +as well as an argument that the client uses to pass data specific to the task.
> +
> +In addition, the client supplies an object representing the start of the task
> +and an iterator function that knows how to advance some number of units in the
> +task to yield another object representing the new task position. The framework
> +uses the start object and iterator internally to divide the task into chunks.
> +
> +Finally, the client passes the total task size and a minimum chunk size to
> +indicate the minimum amount of work that's appropriate to do in one chunk. The
> +sizes are given in task-specific units (e.g. pages, inodes, bytes). The
> +framework uses these sizes, along with the number of online CPUs and an
> +internal maximum number of threads, to decide how many threads to start and how
> +many chunks to divide the task into.
> +
> +For example, consider the task of clearing a gigantic page. This used to be
> +done in a single thread with a for loop that calls a page clearing function for
> +each constituent base page. To parallelize with ktask, the client first moves
> +the for loop to the thread function, adapting it to operate on the range passed
> +to the function. In this simple case, the thread function's start and end
> +arguments are just addresses delimiting the portion of the gigantic page to
> +clear. Then, where the for loop used to be, the client calls into ktask with
> +the start address of the gigantic page, the total size of the gigantic page,
> +and the thread function. Internally, ktask will divide the address range into
> +an appropriate number of chunks and start an appropriate number of threads to
> +complete these chunks.

I see no mention of padata anywhere; I also don't see mention of the
async init stuff. Both appear to me to share, at least in part, the same
reason for existence.

> +Scheduler Interaction
> +=====================
> +
> +Even within the resource limits, ktask must take care to run a number of
> +threads appropriate for the system's current CPU load. Under high CPU usage,
> +starting excessive helper threads may disturb other tasks, unfairly taking CPU
> +time away from them for the sake of an optimized kernel code path.
> +
> +ktask plays nicely in this case by setting helper threads to the lowest
> +scheduling priority on the system (MAX_NICE). This way, helpers' CPU time is
> +appropriately throttled on a busy system and other tasks are not disturbed.
> +
> +The main thread initiating the task remains at its original priority so that it
> +still makes progress on a busy system.
> +
> +It is possible for a helper thread to start running and then be forced off-CPU
> +by a higher priority thread. With the helper's CPU time curtailed by MAX_NICE,
> +the main thread may wait longer for the task to finish than it would have had
> +it not started any helpers, so to ensure forward progress at a single-threaded
> +pace, once the main thread is finished with all outstanding work in the task,
> +the main thread wills its priority to one helper thread at a time. At least
> +one thread will then always be running at the priority of the calling thread.

What isn't clear is if this calling thread is waiting or not. Only do
this inheritance trick if it is actually waiting on the work. If it is
not, nobody cares.

> +Cgroup Awareness
> +================
> +
> +Given the potentially large amount of CPU time ktask threads may consume, they
> +should be aware of the cgroup of the task that called into ktask and
> +appropriately throttled.
> +
> +TODO: Implement cgroup-awareness in unbound workqueues.

Yes.. that needs done.

> +Power Management
> +================
> +
> +Starting additional helper threads may cause the system to consume more energy,
> +which is undesirable on energy-conscious devices. Therefore ktask needs to be
> +aware of cpufreq policies and scaling governors.
> +
> +If an energy-conscious policy is in use (e.g. powersave, conservative) on any
> +part of the system, that is a signal that the user has strong power management
> +preferences, in which case ktask is disabled.
> +
> +TODO: Implement this.

No, don't do that, its broken. Also, we're trying to move to a single
cpufreq governor for all.

Sure we'll retain 'performance', but powersave and conservative and all
that nonsense should go away eventually.

That's not saying you don't need a knob for this; but don't look at
cpufreq for this.


2018-11-06 09:23:41

by Michal Hocko

[permalink] [raw]
Subject: Re: [RFC PATCH v4 00/13] ktask: multithread CPU-intensive kernel work

On Mon 05-11-18 17:29:55, Daniel Jordan wrote:
> On Mon, Nov 05, 2018 at 06:29:31PM +0100, Michal Hocko wrote:
> > On Mon 05-11-18 11:55:45, Daniel Jordan wrote:
> > > Michal, you mentioned that ktask should be sensitive to CPU utilization[1].
> > > ktask threads now run at the lowest priority on the system to avoid disturbing
> > > busy CPUs (more details in patches 4 and 5). Does this address your concern?
> > > The plan to address your other comments is explained below.
> >
> > I have only glanced through the documentation patch and it looks like it
> > will be much less disruptive than the previous attempts. Now the obvious
> > question is how does this behave on a moderately or even busy system
> > when you compare that to a single threaded execution. Some numbers about
> > best/worst case execution would be really helpful.
>
> Patches 4 and 5 have some numbers where a ktask and non-ktask workload compete
> against each other. Those show either 8 ktask threads on 8 CPUs (worst case) or no ktask threads (best case).
>
> By single threaded execution, I guess you mean 1 ktask thread. I'll run the
> experiments that way too and post the numbers.

I mean a comparision of how much time it gets to accomplish the same
amount of work if it was done singlethreaded to ktask based distribution
on a idle system (best case for both) and fully contended system (the
worst case). It would be also great to get some numbers on partially
contended system to see how much the priority handover etc. acts under
different CPU contention.
--
Michal Hocko
SUSE Labs

2018-11-06 20:36:26

by Daniel Jordan

[permalink] [raw]
Subject: Re: [RFC PATCH v4 01/13] ktask: add documentation

On Tue, Nov 06, 2018 at 09:49:11AM +0100, Peter Zijlstra wrote:
> On Mon, Nov 05, 2018 at 11:55:46AM -0500, Daniel Jordan wrote:
> > +Concept
> > +=======
> > +
> > +ktask is built on unbound workqueues to take advantage of the thread management
> > +facilities it provides: creation, destruction, flushing, priority setting, and
> > +NUMA affinity.
> > +
> > +A little terminology up front: A 'task' is the total work there is to do and a
> > +'chunk' is a unit of work given to a thread.
>
> So I hate on the task naming. We already have a task, lets not overload
> that name.

Ok, agreed, it's a crowded field with 'task', 'work', 'thread'...

Maybe 'job', since nothing seems to have taken that in kernel/.

> I see no mention of padata anywhere; I also don't see mention of the
> async init stuff. Both appear to me to share, at least in part, the same
> reason for existence.

padata is news to me. From reading its doc, it comes with some special
requirements of its own, like softirqs disabled during the parallel callback,
and some ktask users need to sleep. I'll check whether it could be reworked to
handle this.

And yes, async shares the same basic infrastructure, but ktask callers need to
wait, so the two seem fundamentally at odds. I'll add this explanation in.

>
> > +Scheduler Interaction
> > +=====================
...
> > +It is possible for a helper thread to start running and then be forced off-CPU
> > +by a higher priority thread. With the helper's CPU time curtailed by MAX_NICE,
> > +the main thread may wait longer for the task to finish than it would have had
> > +it not started any helpers, so to ensure forward progress at a single-threaded
> > +pace, once the main thread is finished with all outstanding work in the task,
> > +the main thread wills its priority to one helper thread at a time. At least
> > +one thread will then always be running at the priority of the calling thread.
>
> What isn't clear is if this calling thread is waiting or not. Only do
> this inheritance trick if it is actually waiting on the work. If it is
> not, nobody cares.

The calling thread waits. Even if it didn't though, the inheritance trick
would still be desirable for timely completion of the job.

>
> > +Cgroup Awareness
> > +================
> > +
> > +Given the potentially large amount of CPU time ktask threads may consume, they
> > +should be aware of the cgroup of the task that called into ktask and
> > +appropriately throttled.
> > +
> > +TODO: Implement cgroup-awareness in unbound workqueues.
>
> Yes.. that needs done.

Great.

>
> > +Power Management
> > +================
> > +
> > +Starting additional helper threads may cause the system to consume more energy,
> > +which is undesirable on energy-conscious devices. Therefore ktask needs to be
> > +aware of cpufreq policies and scaling governors.
> > +
> > +If an energy-conscious policy is in use (e.g. powersave, conservative) on any
> > +part of the system, that is a signal that the user has strong power management
> > +preferences, in which case ktask is disabled.
> > +
> > +TODO: Implement this.
>
> No, don't do that, its broken. Also, we're trying to move to a single
> cpufreq governor for all.
>
> Sure we'll retain 'performance', but powersave and conservative and all
> that nonsense should go away eventually.

Ok, good to know.

> That's not saying you don't need a knob for this; but don't look at
> cpufreq for this.

Ok, I'll dig through power management to see what else is there. Maybe there's
some way to ask "is this machine energy conscious?"

Thanks for looking through this!

2018-11-06 21:18:08

by Daniel Jordan

[permalink] [raw]
Subject: Re: [RFC PATCH v4 00/13] ktask: multithread CPU-intensive kernel work

On Mon, Nov 05, 2018 at 09:48:56PM -0500, Zi Yan wrote:
> On 5 Nov 2018, at 21:20, Daniel Jordan wrote:
>
> > Hi Zi,
> >
> > On Mon, Nov 05, 2018 at 01:49:14PM -0500, Zi Yan wrote:
> >> On 5 Nov 2018, at 11:55, Daniel Jordan wrote:
> >>
> >> Do you think if it makes sense to use ktask for huge page migration (the data
> >> copy part)?
> >
> > It certainly could.
> >
> >> I did some experiments back in 2016[1], which showed that migrating one 2MB page
> >> with 8 threads could achieve 2.8x throughput of the existing single-threaded method.
> >> The problem with my parallel page migration patchset at that time was that it
> >> has no CPU-utilization awareness, which is solved by your patches now.
> >
> > Did you run with fewer than 8 threads? I'd want a bigger speedup than 2.8x for
> > 8, and a smaller thread count might improve thread utilization.
>
> Yes. When migrating one 2MB THP with migrate_pages() system call on a two-socket server
> with 2 E5-2650 v3 CPUs (10 cores per socket) across two sockets, here are the page migration
> throughput numbers:
>
> throughput factor
> 1 thread 2.15 GB/s 1x
> 2 threads 3.05 GB/s 1.42x
> 4 threads 4.50 GB/s 2.09x
> 8 threads 5.98 GB/s 2.78x

Thanks. Looks like in your patches you start a worker for every piece of the
huge page copy and have the main thread wait. I'm curious what the workqueue
overhead is like on your machine. On a newer Xeon it's ~50usec from queueing a
work to starting to execute it and another ~20usec to flush a work
(barrier_func), which could happen after the work is already done. A pretty
significant piece of the copy time for part of a THP.

bash 60728 [087] 155865.157116: probe:ktask_run: (ffffffffb7ee7a80)
bash 60728 [087] 155865.157119: workqueue:workqueue_queue_work: work struct=0xffff95fb73276000
bash 60728 [087] 155865.157119: workqueue:workqueue_activate_work: work struct 0xffff95fb73276000
kworker/u194:3- 86730 [095] 155865.157168: workqueue:workqueue_execute_start: work struct 0xffff95fb73276000: function ktask_thread
kworker/u194:3- 86730 [095] 155865.157170: workqueue:workqueue_execute_end: work struct 0xffff95fb73276000
kworker/u194:3- 86730 [095] 155865.157171: workqueue:workqueue_execute_start: work struct 0xffffa676995bfb90: function wq_barrier_func
kworker/u194:3- 86730 [095] 155865.157190: workqueue:workqueue_execute_end: work struct 0xffffa676995bfb90
bash 60728 [087] 155865.157207: probe:ktask_run_ret__return: (ffffffffb7ee7a80 <- ffffffffb7ee7b7b)

> >
> > It would be nice to multithread at a higher granularity than 2M, too: a range
> > of THPs might also perform better than a single page.
>
> Sure. But the kernel currently does not copy multiple pages altogether even if a range
> of THPs is migrated. Page copy function is interleaved with page table operations
> for every single page.
>
> I also did some study and modified the kernel to improve this, which I called
> concurrent page migration in https://lwn.net/Articles/714991/. It further
> improves page migration throughput.

Ok, over 4x with 8 threads for 16 THPs. Is 16 a typical number for migration,
or does it get larger? What workloads do you have in mind with this change?

2018-11-06 21:51:44

by Jason Gunthorpe

[permalink] [raw]
Subject: Re: [RFC PATCH v4 01/13] ktask: add documentation

On Tue, Nov 06, 2018 at 12:34:11PM -0800, Daniel Jordan wrote:

> > What isn't clear is if this calling thread is waiting or not. Only do
> > this inheritance trick if it is actually waiting on the work. If it is
> > not, nobody cares.
>
> The calling thread waits. Even if it didn't though, the inheritance trick
> would still be desirable for timely completion of the job.

Can you make lockdep aware that this is synchronous?

ie if I do

mutex_lock()
ktask_run()
mutex_lock()

Can lockdep know that all the workers are running under that lock?

I'm thinking particularly about rtnl_lock as a possible case, but
there could also make some sense to hold the read side of the mm_sem
or similar like the above.

Jason

2018-11-07 10:28:49

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [RFC PATCH v4 01/13] ktask: add documentation

On Tue, Nov 06, 2018 at 08:51:54PM +0000, Jason Gunthorpe wrote:
> On Tue, Nov 06, 2018 at 12:34:11PM -0800, Daniel Jordan wrote:
>
> > > What isn't clear is if this calling thread is waiting or not. Only do
> > > this inheritance trick if it is actually waiting on the work. If it is
> > > not, nobody cares.
> >
> > The calling thread waits. Even if it didn't though, the inheritance trick
> > would still be desirable for timely completion of the job.
>
> Can you make lockdep aware that this is synchronous?
>
> ie if I do
>
> mutex_lock()
> ktask_run()
> mutex_lock()
>
> Can lockdep know that all the workers are running under that lock?
>
> I'm thinking particularly about rtnl_lock as a possible case, but
> there could also make some sense to hold the read side of the mm_sem
> or similar like the above.

Yes, the normal trick is adding a fake lock to ktask_run and holding
that over the actual job. See lock_map* in flush_workqueue() vs
process_one_work().



2018-11-07 10:36:54

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [RFC PATCH v4 01/13] ktask: add documentation

On Tue, Nov 06, 2018 at 12:34:11PM -0800, Daniel Jordan wrote:
> On Tue, Nov 06, 2018 at 09:49:11AM +0100, Peter Zijlstra wrote:
> > On Mon, Nov 05, 2018 at 11:55:46AM -0500, Daniel Jordan wrote:
> > > +Concept
> > > +=======
> > > +
> > > +ktask is built on unbound workqueues to take advantage of the thread management
> > > +facilities it provides: creation, destruction, flushing, priority setting, and
> > > +NUMA affinity.
> > > +
> > > +A little terminology up front: A 'task' is the total work there is to do and a
> > > +'chunk' is a unit of work given to a thread.
> >
> > So I hate on the task naming. We already have a task, lets not overload
> > that name.
>
> Ok, agreed, it's a crowded field with 'task', 'work', 'thread'...
>
> Maybe 'job', since nothing seems to have taken that in kernel/.

Do we want to somehow convey the fundamentally parallel nature of the
thing?

> > I see no mention of padata anywhere; I also don't see mention of the
> > async init stuff. Both appear to me to share, at least in part, the same
> > reason for existence.
>
> padata is news to me. From reading its doc, it comes with some special
> requirements of its own, like softirqs disabled during the parallel callback,
> and some ktask users need to sleep. I'll check whether it could be reworked to
> handle this.

Right, padata is something that came from the network stack I think.
It's a bit of an odd thing, but it would be nice if we can fold it into
something larger.

> And yes, async shares the same basic infrastructure, but ktask callers need to
> wait, so the two seem fundamentally at odds. I'll add this explanation in.

Why does ktask have to be fundamentally async?

> > > +Scheduler Interaction
> > > +=====================
> ...
> > > +It is possible for a helper thread to start running and then be forced off-CPU
> > > +by a higher priority thread. With the helper's CPU time curtailed by MAX_NICE,
> > > +the main thread may wait longer for the task to finish than it would have had
> > > +it not started any helpers, so to ensure forward progress at a single-threaded
> > > +pace, once the main thread is finished with all outstanding work in the task,
> > > +the main thread wills its priority to one helper thread at a time. At least
> > > +one thread will then always be running at the priority of the calling thread.
> >
> > What isn't clear is if this calling thread is waiting or not. Only do
> > this inheritance trick if it is actually waiting on the work. If it is
> > not, nobody cares.
>
> The calling thread waits. Even if it didn't though, the inheritance trick
> would still be desirable for timely completion of the job.

No, if nobody is waiting on it, it really doesn't matter.

> > > +Power Management
> > > +================
> > > +
> > > +Starting additional helper threads may cause the system to consume more energy,
> > > +which is undesirable on energy-conscious devices. Therefore ktask needs to be
> > > +aware of cpufreq policies and scaling governors.
> > > +
> > > +If an energy-conscious policy is in use (e.g. powersave, conservative) on any
> > > +part of the system, that is a signal that the user has strong power management
> > > +preferences, in which case ktask is disabled.
> > > +
> > > +TODO: Implement this.
> >
> > No, don't do that, its broken. Also, we're trying to move to a single
> > cpufreq governor for all.
> >
> > Sure we'll retain 'performance', but powersave and conservative and all
> > that nonsense should go away eventually.
>
> Ok, good to know.
>
> > That's not saying you don't need a knob for this; but don't look at
> > cpufreq for this.
>
> Ok, I'll dig through power management to see what else is there. Maybe there's
> some way to ask "is this machine energy conscious?"

IIRC you're presenting at LPC, drop by the Power Management and
Energy-awareness MC.

2018-11-07 20:19:02

by Daniel Jordan

[permalink] [raw]
Subject: Re: [RFC PATCH v4 00/13] ktask: multithread CPU-intensive kernel work

On Tue, Nov 06, 2018 at 10:21:45AM +0100, Michal Hocko wrote:
> On Mon 05-11-18 17:29:55, Daniel Jordan wrote:
> > On Mon, Nov 05, 2018 at 06:29:31PM +0100, Michal Hocko wrote:
> > > On Mon 05-11-18 11:55:45, Daniel Jordan wrote:
> > > > Michal, you mentioned that ktask should be sensitive to CPU utilization[1].
> > > > ktask threads now run at the lowest priority on the system to avoid disturbing
> > > > busy CPUs (more details in patches 4 and 5). Does this address your concern?
> > > > The plan to address your other comments is explained below.
> > >
> > > I have only glanced through the documentation patch and it looks like it
> > > will be much less disruptive than the previous attempts. Now the obvious
> > > question is how does this behave on a moderately or even busy system
> > > when you compare that to a single threaded execution. Some numbers about
> > > best/worst case execution would be really helpful.
> >
> > Patches 4 and 5 have some numbers where a ktask and non-ktask workload compete
> > against each other. Those show either 8 ktask threads on 8 CPUs (worst case) or no ktask threads (best case).
> >
> > By single threaded execution, I guess you mean 1 ktask thread. I'll run the
> > experiments that way too and post the numbers.
>
> I mean a comparision of how much time it gets to accomplish the same
> amount of work if it was done singlethreaded to ktask based distribution
> on a idle system (best case for both) and fully contended system (the
> worst case). It would be also great to get some numbers on partially
> contended system to see how much the priority handover etc. acts under
> different CPU contention.

Ok, thanks for clarifying.

Testing notes
- The two workloads used were confined to run anywhere within an 8-CPU cpumask
- The vfio workload started a 64G VM using THP
- usemem was enlisted to create CPU load doing page clearing, just as the vfio
case is doing, so the two compete for the same system resources. usemem ran
four times with each of its threads allocating and freeing 30G of memory each
time. Four usemem threads simulate Michal's partially contended system
- ktask helpers always run at MAX_NICE
- renice?=yes means run with patch 5, renice?=no means without
- CPU: 2 nodes * 24 cores/node * 2 threads/core = 96 CPUs
Intel(R) Xeon(R) Platinum 8160 CPU @ 2.10GHz

vfio usemem
thr thr renice? ktask sec usemem sec
----- ------ ------- ---------------- ----------------
4 n/a 24.0 ( ? 0.1% )
8 n/a 25.3 ( ? 0.0% )

1 0 n/a 13.5 ( ? 0.0% )
1 4 n/a 14.2 ( ? 0.4% ) 24.1 ( ? 0.3% )
*** 1 8 n/a 17.3 ( ? 10.4% ) 29.7 ( ? 0.4% )

8 0 no 2.8 ( ? 1.5% )
8 4 no 4.7 ( ? 0.8% ) 24.1 ( ? 0.2% )
8 8 no 13.7 ( ? 8.8% ) 27.2 ( ? 1.2% )

8 0 yes 2.8 ( ? 1.0% )
8 4 yes 4.7 ( ? 1.4% ) 24.1 ( ? 0.0% )
*** 8 8 yes 9.2 ( ? 2.2% ) 27.0 ( ? 0.4% )

Renicing under partial contention (usemem nthr=4) doesn't affect vfio, but
renicing under heavy contention (usemem nthr=8) does: the 8-thread vfio case is
slower when the ktask master thread doesn't will its priority to each helper at
a time.

Comparing the ***'d lines, using 8 vfio threads instead of 1 causes the threads
of both workloads to finish sooner under heavy contention.

2018-11-07 20:23:24

by Daniel Jordan

[permalink] [raw]
Subject: Re: [RFC PATCH v4 01/13] ktask: add documentation

On Wed, Nov 07, 2018 at 11:27:52AM +0100, Peter Zijlstra wrote:
> On Tue, Nov 06, 2018 at 08:51:54PM +0000, Jason Gunthorpe wrote:
> > On Tue, Nov 06, 2018 at 12:34:11PM -0800, Daniel Jordan wrote:
> >
> > > > What isn't clear is if this calling thread is waiting or not. Only do
> > > > this inheritance trick if it is actually waiting on the work. If it is
> > > > not, nobody cares.
> > >
> > > The calling thread waits. Even if it didn't though, the inheritance trick
> > > would still be desirable for timely completion of the job.
> >
> > Can you make lockdep aware that this is synchronous?
> >
> > ie if I do
> >
> > mutex_lock()
> > ktask_run()
> > mutex_lock()
> >
> > Can lockdep know that all the workers are running under that lock?
> >
> > I'm thinking particularly about rtnl_lock as a possible case, but
> > there could also make some sense to hold the read side of the mm_sem
> > or similar like the above.
>
> Yes, the normal trick is adding a fake lock to ktask_run and holding
> that over the actual job. See lock_map* in flush_workqueue() vs
> process_one_work().

I'll add that for the next version.

2018-11-07 21:21:41

by Daniel Jordan

[permalink] [raw]
Subject: Re: [RFC PATCH v4 01/13] ktask: add documentation

On Wed, Nov 07, 2018 at 11:35:54AM +0100, Peter Zijlstra wrote:
> On Tue, Nov 06, 2018 at 12:34:11PM -0800, Daniel Jordan wrote:
> > On Tue, Nov 06, 2018 at 09:49:11AM +0100, Peter Zijlstra wrote:
> > > On Mon, Nov 05, 2018 at 11:55:46AM -0500, Daniel Jordan wrote:
> > > > +Concept
> > > > +=======
> > > > +
> > > > +ktask is built on unbound workqueues to take advantage of the thread management
> > > > +facilities it provides: creation, destruction, flushing, priority setting, and
> > > > +NUMA affinity.
> > > > +
> > > > +A little terminology up front: A 'task' is the total work there is to do and a
> > > > +'chunk' is a unit of work given to a thread.
> > >
> > > So I hate on the task naming. We already have a task, lets not overload
> > > that name.
> >
> > Ok, agreed, it's a crowded field with 'task', 'work', 'thread'...
> >
> > Maybe 'job', since nothing seems to have taken that in kernel/.
>
> Do we want to somehow convey the fundamentally parallel nature of the
> thing?

Ok, I've consulted my thesaurus and everything. Best I can come up with is
just 'parallel', so for example parallel_run would be the interface. Or
para_run.

Going to think about this more, and I'm open to suggestions.

>
> > > I see no mention of padata anywhere; I also don't see mention of the
> > > async init stuff. Both appear to me to share, at least in part, the same
> > > reason for existence.
> >
> > padata is news to me. From reading its doc, it comes with some special
> > requirements of its own, like softirqs disabled during the parallel callback,
> > and some ktask users need to sleep. I'll check whether it could be reworked to
> > handle this.
>
> Right, padata is something that came from the network stack I think.
> It's a bit of an odd thing, but it would be nice if we can fold it into
> something larger.

Sure, I'll see how it goes.

> > And yes, async shares the same basic infrastructure, but ktask callers need to
> > wait, so the two seem fundamentally at odds. I'll add this explanation in.
>
> Why does ktask have to be fundamentally async?

Assuming you mean sync. It doesn't have to be synchronous always, but some
users need that. A caller that clears a page shouldn't return to userland
before the job is done.

Anyway, sure, it may come out clean to encapsulate the async parts of async.c
into their own paths and then find a new name for that file. I'll see how this
goes too.

>
> > > > +Scheduler Interaction
> > > > +=====================
> > ...
> > > > +It is possible for a helper thread to start running and then be forced off-CPU
> > > > +by a higher priority thread. With the helper's CPU time curtailed by MAX_NICE,
> > > > +the main thread may wait longer for the task to finish than it would have had
> > > > +it not started any helpers, so to ensure forward progress at a single-threaded
> > > > +pace, once the main thread is finished with all outstanding work in the task,
> > > > +the main thread wills its priority to one helper thread at a time. At least
> > > > +one thread will then always be running at the priority of the calling thread.
> > >
> > > What isn't clear is if this calling thread is waiting or not. Only do
> > > this inheritance trick if it is actually waiting on the work. If it is
> > > not, nobody cares.
> >
> > The calling thread waits. Even if it didn't though, the inheritance trick
> > would still be desirable for timely completion of the job.
>
> No, if nobody is waiting on it, it really doesn't matter.

Ok, I (think) I see what you meant. If nobody is waiting on it, as in, it's
not desirable from a performance POV. Agree.

>
> > > > +Power Management
> > > > +================
> > > > +
> > > > +Starting additional helper threads may cause the system to consume more energy,
> > > > +which is undesirable on energy-conscious devices. Therefore ktask needs to be
> > > > +aware of cpufreq policies and scaling governors.
> > > > +
> > > > +If an energy-conscious policy is in use (e.g. powersave, conservative) on any
> > > > +part of the system, that is a signal that the user has strong power management
> > > > +preferences, in which case ktask is disabled.
> > > > +
> > > > +TODO: Implement this.
> > >
> > > No, don't do that, its broken. Also, we're trying to move to a single
> > > cpufreq governor for all.
> > >
> > > Sure we'll retain 'performance', but powersave and conservative and all
> > > that nonsense should go away eventually.
> >
> > Ok, good to know.
> >
> > > That's not saying you don't need a knob for this; but don't look at
> > > cpufreq for this.
> >
> > Ok, I'll dig through power management to see what else is there. Maybe there's
> > some way to ask "is this machine energy conscious?"
>
> IIRC you're presenting at LPC, drop by the Power Management and
> Energy-awareness MC.

Will do.

2018-11-08 17:27:50

by Jonathan Corbet

[permalink] [raw]
Subject: Re: [RFC PATCH v4 01/13] ktask: add documentation

On Mon, 5 Nov 2018 11:55:46 -0500
Daniel Jordan <[email protected]> wrote:

> Motivates and explains the ktask API for kernel clients.

A couple of quick thoughts:

- Agree with Peter on the use of "task"; something like "job" would be far
less likely to create confusion. Maybe you could even call it a "batch
job" to give us old-timers warm fuzzies...:)

- You have kerneldoc comments for the API functions, but you don't pull
those into the documentation itself. Adding some kernel-doc directives
could help to fill things out nicely with little effort.

Thanks,

jon

2018-11-08 19:19:52

by Daniel Jordan

[permalink] [raw]
Subject: Re: [RFC PATCH v4 01/13] ktask: add documentation

On Thu, Nov 08, 2018 at 10:26:38AM -0700, Jonathan Corbet wrote:
> On Mon, 5 Nov 2018 11:55:46 -0500
> Daniel Jordan <[email protected]> wrote:
>
> > Motivates and explains the ktask API for kernel clients.
>
> A couple of quick thoughts:
>
> - Agree with Peter on the use of "task"; something like "job" would be far
> less likely to create confusion. Maybe you could even call it a "batch
> job" to give us old-timers warm fuzzies...:)

smp_job? Or smp_batch, for that retro flavor? :)

>
> - You have kerneldoc comments for the API functions, but you don't pull
> those into the documentation itself. Adding some kernel-doc directives
> could help to fill things out nicely with little effort.

I thought this part of ktask.rst handled that, or am I not doing it right?

Interface
=========

.. kernel-doc:: include/linux/ktask.h

Thanks for the comments,
Daniel

2018-11-08 19:25:23

by Jonathan Corbet

[permalink] [raw]
Subject: Re: [RFC PATCH v4 01/13] ktask: add documentation

On Thu, 8 Nov 2018 11:15:53 -0800
Daniel Jordan <[email protected]> wrote:

> > - You have kerneldoc comments for the API functions, but you don't pull
> > those into the documentation itself. Adding some kernel-doc directives
> > could help to fill things out nicely with little effort.
>
> I thought this part of ktask.rst handled that, or am I not doing it right?
>
> Interface
> =========
>
> .. kernel-doc:: include/linux/ktask.h

Sigh, no, you're doing it just fine, and I clearly wasn't sufficiently
caffeinated. Apologies for the noise.

jon

Subject: RE: [RFC PATCH v4 11/13] mm: parallelize deferred struct page initialization within each node

> -----Original Message-----
> From: [email protected] <linux-kernel-
> [email protected]> On Behalf Of Daniel Jordan
> Sent: Monday, November 05, 2018 10:56 AM
> Subject: [RFC PATCH v4 11/13] mm: parallelize deferred struct page
> initialization within each node
>
> ... The kernel doesn't
> know the memory bandwidth of a given system to get the most efficient
> number of threads, so there's some guesswork involved.

The ACPI HMAT (Heterogeneous Memory Attribute Table) is designed to report
that kind of information, and could facilitate automatic tuning.

There was discussion last year about kernel support for it:
https://lore.kernel.org/lkml/[email protected]/


> In testing, a reasonable value turned out to be about a quarter of the
> CPUs on the node.
...
> + /*
> + * We'd like to know the memory bandwidth of the chip to
> calculate the
> + * most efficient number of threads to start, but we can't.
> + * In testing, a good value for a variety of systems was a
> quarter of the CPUs on the node.
> + */
> + nr_node_cpus = DIV_ROUND_UP(cpumask_weight(cpumask), 4);


You might want to base that calculation on and limit the threads to
physical cores, not hyperthreaded cores.

---
Robert Elliott, HPE Persistent Memory



2018-11-12 16:55:27

by Daniel Jordan

[permalink] [raw]
Subject: Re: [RFC PATCH v4 11/13] mm: parallelize deferred struct page initialization within each node

On Sat, Nov 10, 2018 at 03:48:14AM +0000, Elliott, Robert (Persistent Memory) wrote:
> > -----Original Message-----
> > From: [email protected] <linux-kernel-
> > [email protected]> On Behalf Of Daniel Jordan
> > Sent: Monday, November 05, 2018 10:56 AM
> > Subject: [RFC PATCH v4 11/13] mm: parallelize deferred struct page
> > initialization within each node
> >
> > ... The kernel doesn't
> > know the memory bandwidth of a given system to get the most efficient
> > number of threads, so there's some guesswork involved.
>
> The ACPI HMAT (Heterogeneous Memory Attribute Table) is designed to report
> that kind of information, and could facilitate automatic tuning.
>
> There was discussion last year about kernel support for it:
> https://lore.kernel.org/lkml/[email protected]/

Thanks for bringing this up. I'm traveling but will take a closer look when I
get back.

> > In testing, a reasonable value turned out to be about a quarter of the
> > CPUs on the node.
> ...
> > + /*
> > + * We'd like to know the memory bandwidth of the chip to
> > calculate the
> > + * most efficient number of threads to start, but we can't.
> > + * In testing, a good value for a variety of systems was a
> > quarter of the CPUs on the node.
> > + */
> > + nr_node_cpus = DIV_ROUND_UP(cpumask_weight(cpumask), 4);
>
>
> You might want to base that calculation on and limit the threads to
> physical cores, not hyperthreaded cores.

Why? Hyperthreads can be beneficial when waiting on memory. That said, I
don't have data that shows that in this case.

Subject: RE: [RFC PATCH v4 11/13] mm: parallelize deferred struct page initialization within each node



> -----Original Message-----
> From: Daniel Jordan <[email protected]>
> Sent: Monday, November 12, 2018 11:54 AM
> To: Elliott, Robert (Persistent Memory) <[email protected]>
> Cc: Daniel Jordan <[email protected]>; [email protected];
> [email protected]; [email protected]; [email protected];
> [email protected]; [email protected]; [email protected];
> [email protected]; [email protected]; [email protected];
> [email protected]; [email protected]; [email protected];
> [email protected]; [email protected]; [email protected];
> [email protected]; [email protected];
> [email protected]; [email protected]; [email protected];
> [email protected]
> Subject: Re: [RFC PATCH v4 11/13] mm: parallelize deferred struct page
> initialization within each node
>
> On Sat, Nov 10, 2018 at 03:48:14AM +0000, Elliott, Robert (Persistent
> Memory) wrote:
> > > -----Original Message-----
> > > From: [email protected] <linux-kernel-
> > > [email protected]> On Behalf Of Daniel Jordan
> > > Sent: Monday, November 05, 2018 10:56 AM
> > > Subject: [RFC PATCH v4 11/13] mm: parallelize deferred struct page
> > > initialization within each node
> > >
...
> > > In testing, a reasonable value turned out to be about a quarter of the
> > > CPUs on the node.
> > ...
> > > + /*
> > > + * We'd like to know the memory bandwidth of the chip to
> > > calculate the
> > > + * most efficient number of threads to start, but we can't.
> > > + * In testing, a good value for a variety of systems was a
> > > quarter of the CPUs on the node.
> > > + */
> > > + nr_node_cpus = DIV_ROUND_UP(cpumask_weight(cpumask), 4);
> >
> >
> > You might want to base that calculation on and limit the threads to
> > physical cores, not hyperthreaded cores.
>
> Why? Hyperthreads can be beneficial when waiting on memory. That said, I
> don't have data that shows that in this case.

I think that's only if there are some register-based calculations to do while
waiting. If both threads are just doing memory accesses, they'll both stall, and
there doesn't seem to be any benefit in having two contexts generate the IOs
rather than one (at least on the systems I've used). I think it takes longer
to switch contexts than to just turnaround the next IO.


---
Robert Elliott, HPE Persistent Memory




2018-11-13 16:35:44

by Tejun Heo

[permalink] [raw]
Subject: Re: [RFC PATCH v4 05/13] workqueue, ktask: renice helper threads to prevent starvation

Hello, Daniel.

On Mon, Nov 05, 2018 at 11:55:50AM -0500, Daniel Jordan wrote:
> static bool start_flush_work(struct work_struct *work, struct wq_barrier *barr,
> - bool from_cancel)
> + struct nice_work *nice_work, int flags)
> {
> struct worker *worker = NULL;
> struct worker_pool *pool;
> @@ -2868,11 +2926,19 @@ static bool start_flush_work(struct work_struct *work, struct wq_barrier *barr,
> if (pwq) {
> if (unlikely(pwq->pool != pool))
> goto already_gone;
> +
> + /* not yet started, insert linked work before work */
> + if (unlikely(flags & WORK_FLUSH_AT_NICE))
> + insert_nice_work(pwq, nice_work, work);

So, I'm not sure this works that well. e.g. what if the work item is
waiting for other work items which are at lower priority? Also, in
this case, it'd be a lot simpler to simply dequeue the work item and
execute it synchronously.

> } else {
> worker = find_worker_executing_work(pool, work);
> if (!worker)
> goto already_gone;
> pwq = worker->current_pwq;
> + if (unlikely(flags & WORK_FLUSH_AT_NICE)) {
> + set_user_nice(worker->task, nice_work->nice);
> + worker->flags |= WORKER_NICED;
> + }
> }

I'm not sure about this. Can you see whether canceling & executing
synchronously is enough to address the latency regression?

Thanks.

--
tejun

2018-11-19 16:05:31

by Daniel Jordan

[permalink] [raw]
Subject: Re: [RFC PATCH v4 11/13] mm: parallelize deferred struct page initialization within each node

On Mon, Nov 12, 2018 at 10:15:46PM +0000, Elliott, Robert (Persistent Memory) wrote:
>
>
> > -----Original Message-----
> > From: Daniel Jordan <[email protected]>
> > Sent: Monday, November 12, 2018 11:54 AM
> > To: Elliott, Robert (Persistent Memory) <[email protected]>
> > Cc: Daniel Jordan <[email protected]>; [email protected];
> > [email protected]; [email protected]; [email protected];
> > [email protected]; [email protected]; [email protected];
> > [email protected]; [email protected]; [email protected];
> > [email protected]; [email protected]; [email protected];
> > [email protected]; [email protected]; [email protected];
> > [email protected]; [email protected];
> > [email protected]; [email protected]; [email protected];
> > [email protected]
> > Subject: Re: [RFC PATCH v4 11/13] mm: parallelize deferred struct page
> > initialization within each node
> >
> > On Sat, Nov 10, 2018 at 03:48:14AM +0000, Elliott, Robert (Persistent
> > Memory) wrote:
> > > > -----Original Message-----
> > > > From: [email protected] <linux-kernel-
> > > > [email protected]> On Behalf Of Daniel Jordan
> > > > Sent: Monday, November 05, 2018 10:56 AM
> > > > Subject: [RFC PATCH v4 11/13] mm: parallelize deferred struct page
> > > > initialization within each node
> > > >
> ...
> > > > In testing, a reasonable value turned out to be about a quarter of the
> > > > CPUs on the node.
> > > ...
> > > > + /*
> > > > + * We'd like to know the memory bandwidth of the chip to
> > > > calculate the
> > > > + * most efficient number of threads to start, but we can't.
> > > > + * In testing, a good value for a variety of systems was a
> > > > quarter of the CPUs on the node.
> > > > + */
> > > > + nr_node_cpus = DIV_ROUND_UP(cpumask_weight(cpumask), 4);
> > >
> > >
> > > You might want to base that calculation on and limit the threads to
> > > physical cores, not hyperthreaded cores.
> >
> > Why? Hyperthreads can be beneficial when waiting on memory. That said, I
> > don't have data that shows that in this case.
>
> I think that's only if there are some register-based calculations to do while
> waiting. If both threads are just doing memory accesses, they'll both stall, and
> there doesn't seem to be any benefit in having two contexts generate the IOs
> rather than one (at least on the systems I've used). I think it takes longer
> to switch contexts than to just turnaround the next IO.

(Sorry for the delay, Plumbers is over now...)

I guess we're both just waving our hands without data. I've only got x86, so
using a quarter of the CPUs rules out HT on my end. Do you have a system that
you can test this on, where using a quarter of the CPUs will involve HT?

Thanks,
Daniel

2018-11-19 16:31:21

by Daniel Jordan

[permalink] [raw]
Subject: Re: [RFC PATCH v4 11/13] mm: parallelize deferred struct page initialization within each node

On Mon, Nov 12, 2018 at 08:54:12AM -0800, Daniel Jordan wrote:
> On Sat, Nov 10, 2018 at 03:48:14AM +0000, Elliott, Robert (Persistent Memory) wrote:
> > > -----Original Message-----
> > > From: [email protected] <linux-kernel-
> > > [email protected]> On Behalf Of Daniel Jordan
> > > Sent: Monday, November 05, 2018 10:56 AM
> > > Subject: [RFC PATCH v4 11/13] mm: parallelize deferred struct page
> > > initialization within each node
> > >
> > > ... The kernel doesn't
> > > know the memory bandwidth of a given system to get the most efficient
> > > number of threads, so there's some guesswork involved.
> >
> > The ACPI HMAT (Heterogeneous Memory Attribute Table) is designed to report
> > that kind of information, and could facilitate automatic tuning.
> >
> > There was discussion last year about kernel support for it:
> > https://lore.kernel.org/lkml/[email protected]/
>
> Thanks for bringing this up. I'm traveling but will take a closer look when I
> get back.

So this series would give the total bandwidth for a memory target, but there's
not a way to map that to a CPU count. In other words, it seems we couldn't
determine how many CPUs it takes to reach the max bandwidth. If I haven't
missed something, I'm going to remove that comment.

2018-11-19 16:48:18

by Daniel Jordan

[permalink] [raw]
Subject: Re: [RFC PATCH v4 05/13] workqueue, ktask: renice helper threads to prevent starvation

On Tue, Nov 13, 2018 at 08:34:00AM -0800, Tejun Heo wrote:
> Hello, Daniel.

Hi Tejun, sorry for the delay. Plumbers...

> On Mon, Nov 05, 2018 at 11:55:50AM -0500, Daniel Jordan wrote:
> > static bool start_flush_work(struct work_struct *work, struct wq_barrier *barr,
> > - bool from_cancel)
> > + struct nice_work *nice_work, int flags)
> > {
> > struct worker *worker = NULL;
> > struct worker_pool *pool;
> > @@ -2868,11 +2926,19 @@ static bool start_flush_work(struct work_struct *work, struct wq_barrier *barr,
> > if (pwq) {
> > if (unlikely(pwq->pool != pool))
> > goto already_gone;
> > +
> > + /* not yet started, insert linked work before work */
> > + if (unlikely(flags & WORK_FLUSH_AT_NICE))
> > + insert_nice_work(pwq, nice_work, work);
>
> So, I'm not sure this works that well. e.g. what if the work item is
> waiting for other work items which are at lower priority? Also, in
> this case, it'd be a lot simpler to simply dequeue the work item and
> execute it synchronously.

Good idea, that is much simpler (and shorter).

So doing it this way, the current task's nice level would be adjusted while
running the work synchronously.

>
> > } else {
> > worker = find_worker_executing_work(pool, work);
> > if (!worker)
> > goto already_gone;
> > pwq = worker->current_pwq;
> > + if (unlikely(flags & WORK_FLUSH_AT_NICE)) {
> > + set_user_nice(worker->task, nice_work->nice);
> > + worker->flags |= WORKER_NICED;
> > + }
> > }
>
> I'm not sure about this. Can you see whether canceling & executing
> synchronously is enough to address the latency regression?

In my testing, canceling was practically never successful because these are
long running jobs, so by the time the main ktask thread gets around to
flushing/nice'ing the works, worker threads have already started running them.
I had to write a no-op ktask to hit the first path where you suggest
dequeueing. So adjusting the priority of a running worker seems required to
address the latency issue.

So instead of flush_work_at_nice, how about this?:

void renice_work_sync(work_struct *work, long nice);

If a worker is running the work, renice the worker to 'nice' and wait for it to
finish (what this patch does now), and if the work isn't running, dequeue it
and run in the current thread, again at 'nice'.


Thanks for taking a look.

2018-11-20 17:05:42

by Daniel Jordan

[permalink] [raw]
Subject: Re: [RFC PATCH v4 05/13] workqueue, ktask: renice helper threads to prevent starvation

On Tue, Nov 20, 2018 at 08:33:19AM -0800, Tejun Heo wrote:
> On Mon, Nov 19, 2018 at 08:45:54AM -0800, Daniel Jordan wrote:
> > So instead of flush_work_at_nice, how about this?:
> >
> > void renice_work_sync(work_struct *work, long nice);
>
> Wouldn't renice_or_cancel make more sense?

I guess you mean, for renicing if the work is started and canceling if it
hasn't?

Then yes, it would in this case, since if a ktask work hasn't start, there's no
point in running it--there are no more chunks for it to work on.

Was attempting to generalize for other cases when the work did need to be run,
but designing for the future can be dicey, and I'm fine either way. So absent
other opinions, I'll do renice_or_cancel.

2018-11-20 17:57:03

by Tejun Heo

[permalink] [raw]
Subject: Re: [RFC PATCH v4 05/13] workqueue, ktask: renice helper threads to prevent starvation

On Mon, Nov 19, 2018 at 08:45:54AM -0800, Daniel Jordan wrote:
> So instead of flush_work_at_nice, how about this?:
>
> void renice_work_sync(work_struct *work, long nice);

Wouldn't renice_or_cancel make more sense?

Thanks.

--
tejun

Subject: RE: [RFC PATCH v4 11/13] mm: parallelize deferred struct page initialization within each node



> -----Original Message-----
> From: Daniel Jordan [mailto:[email protected]]
> Sent: Monday, November 19, 2018 10:02 AM
> On Mon, Nov 12, 2018 at 10:15:46PM +0000, Elliott, Robert (Persistent Memory) wrote:
> >
> > > -----Original Message-----
> > > From: Daniel Jordan <[email protected]>
> > > Sent: Monday, November 12, 2018 11:54 AM
> > >
> > > On Sat, Nov 10, 2018 at 03:48:14AM +0000, Elliott, Robert (Persistent
> > > Memory) wrote:
> > > > > -----Original Message-----
> > > > > From: [email protected] <linux-kernel-
> > > > > [email protected]> On Behalf Of Daniel Jordan
> > > > > Sent: Monday, November 05, 2018 10:56 AM
> > > > > Subject: [RFC PATCH v4 11/13] mm: parallelize deferred struct page
> > > > > initialization within each node
> > > > >
> > ...
> > > > > In testing, a reasonable value turned out to be about a quarter of the
> > > > > CPUs on the node.
> > > > ...
> > > > > + /*
> > > > > + * We'd like to know the memory bandwidth of the chip to
> > > > > calculate the
> > > > > + * most efficient number of threads to start, but we can't.
> > > > > + * In testing, a good value for a variety of systems was a
> > > > > quarter of the CPUs on the node.
> > > > > + */
> > > > > + nr_node_cpus = DIV_ROUND_UP(cpumask_weight(cpumask), 4);
> > > >
> > > >
> > > > You might want to base that calculation on and limit the threads to
> > > > physical cores, not hyperthreaded cores.
> > >
> > > Why? Hyperthreads can be beneficial when waiting on memory. That said, I
> > > don't have data that shows that in this case.
> >
> > I think that's only if there are some register-based calculations to do while
> > waiting. If both threads are just doing memory accesses, they'll both stall, and
> > there doesn't seem to be any benefit in having two contexts generate the IOs
> > rather than one (at least on the systems I've used). I think it takes longer
> > to switch contexts than to just turnaround the next IO.
>
> (Sorry for the delay, Plumbers is over now...)
>
> I guess we're both just waving our hands without data. I've only got x86, so
> using a quarter of the CPUs rules out HT on my end. Do you have a system that
> you can test this on, where using a quarter of the CPUs will involve HT?

I ran a short test with:
* HPE ProLiant DL360 Gen9 system
* Intel Xeon E5-2699 CPU with 18 physical cores (0-17) and
18 hyperthreaded cores (36-53)
* DDR4 NVDIMM-Ns (which run at regular DRAM DIMM speeds)
* fio workload generator
* cores on one CPU socket talking to a pmem device on the same CPU
* large (1 MiB) random writes (to minimize the threads getting CPU cache
hits from each other)

Results:
* 31.7 GB/s four threads, four physical cores (0,1,2,3)
* 22.2 GB/s four threads, two physical cores (0,1,36,37)
* 21.4 GB/s two threads, two physical cores (0,1)
* 12.1 GB/s two threads, one physical core (0,36)
* 11.2 GB/s one thread, one physical core (0)

So, I think it's important that the initialization threads run on
separate physical cores.

For the number of cores to use, one approach is:
memory bandwidth (number of interleaved channels * speed)
divided by
CPU core max sustained write bandwidth

For example, this 2133 MT/s system is roughly:
68 GB/s (4 * 17 GB/s nominal)
divided by
11.2 GB/s (one core's performance)
which is
6 cores

ACPI HMAT will report that 68 GB/s number. I'm not sure of
a good way to discover the 11.2 GB/s number.


fio job file:
[global]
direct=1
ioengine=sync
norandommap
randrepeat=0
bs=1M
runtime=20
time_based=1
group_reporting
thread
gtod_reduce=1
zero_buffers
cpus_allowed_policy=split
# pick the desired number of threads
numjobs=4
numjobs=2
numjobs=1

# CPU0: cores 0-17, hyperthreaded cores 36-53
[pmem0]
filename=/dev/pmem0
# pick the desired cpus_allowed list
cpus_allowed=0,1,2,3
cpus_allowed=0,1,36,37
cpus_allowed=0,36
cpus_allowed=0,1
cpus_allowed=0
rw=randwrite

Although most CPU time is in movnti instructions (non-temporal stores),
there is overhead in clearing the page cache and in the pmem block
driver; those won't be present in your initialization function.
perf top shows:
82.00% [kernel] [k] memcpy_flushcache
5.23% [kernel] [k] gup_pgd_range
3.41% [kernel] [k] __blkdev_direct_IO_simple
2.38% [kernel] [k] pmem_make_request
1.46% [kernel] [k] write_pmem
1.29% [kernel] [k] pmem_do_bvec


---
Robert Elliott, HPE Persistent Memory




2018-11-27 19:51:12

by Pavel Machek

[permalink] [raw]
Subject: Re: [RFC PATCH v4 01/13] ktask: add documentation

Hi!

> Motivates and explains the ktask API for kernel clients.
>
> Signed-off-by: Daniel Jordan <[email protected]>
> ---
> Documentation/core-api/index.rst | 1 +
> Documentation/core-api/ktask.rst | 213 +++++++++++++++++++++++++++++++
> 2 files changed, 214 insertions(+)
> create mode 100644 Documentation/core-api/ktask.rst
>
> diff --git a/Documentation/core-api/index.rst b/Documentation/core-api/index.rst
> index 3adee82be311..c143a280a5b1 100644
> --- a/Documentation/core-api/index.rst
> +++ b/Documentation/core-api/index.rst
> @@ -18,6 +18,7 @@ Core utilities
> refcount-vs-atomic
> cpu_hotplug
> idr
> + ktask
> local_ops
> workqueue
> genericirq
> diff --git a/Documentation/core-api/ktask.rst b/Documentation/core-api/ktask.rst
> new file mode 100644
> index 000000000000..c3c00e1f802f
> --- /dev/null
> +++ b/Documentation/core-api/ktask.rst
> @@ -0,0 +1,213 @@
> +.. SPDX-License-Identifier: GPL-2.0+
> +
> +============================================
> +ktask: parallelize CPU-intensive kernel work
> +============================================
> +
> +:Date: November, 2018
> +:Author: Daniel Jordan <[email protected]>


> +For example, consider the task of clearing a gigantic page. This used to be
> +done in a single thread with a for loop that calls a page clearing function for
> +each constituent base page. To parallelize with ktask, the client first moves
> +the for loop to the thread function, adapting it to operate on the range passed
> +to the function. In this simple case, the thread function's start and end
> +arguments are just addresses delimiting the portion of the gigantic page to
> +clear. Then, where the for loop used to be, the client calls into ktask with
> +the start address of the gigantic page, the total size of the gigantic page,
> +and the thread function. Internally, ktask will divide the address range into
> +an appropriate number of chunks and start an appropriate number of threads to
> +complete these chunks.

Great, so my little task is bound to CPUs 1-4 and uses gigantic
pages. Kernel clears them for me.

a) Do all the CPUs work for me, or just CPUs I was assigned to?

b) Will my time my_little_task show the system time including the
worker threads?

Best regards,
Pavel
--
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html


Attachments:
(No filename) (2.45 kB)
signature.asc (188.00 B)
Digital signature
Download all attachments

2018-11-27 20:25:24

by Daniel Jordan

[permalink] [raw]
Subject: Re: [RFC PATCH v4 11/13] mm: parallelize deferred struct page initialization within each node

On Tue, Nov 27, 2018 at 12:12:28AM +0000, Elliott, Robert (Persistent Memory) wrote:
> I ran a short test with:
> * HPE ProLiant DL360 Gen9 system
> * Intel Xeon E5-2699 CPU with 18 physical cores (0-17) and
> 18 hyperthreaded cores (36-53)
> * DDR4 NVDIMM-Ns (which run at regular DRAM DIMM speeds)
> * fio workload generator
> * cores on one CPU socket talking to a pmem device on the same CPU
> * large (1 MiB) random writes (to minimize the threads getting CPU cache
> hits from each other)
>
> Results:
> * 31.7 GB/s four threads, four physical cores (0,1,2,3)
> * 22.2 GB/s four threads, two physical cores (0,1,36,37)
> * 21.4 GB/s two threads, two physical cores (0,1)
> * 12.1 GB/s two threads, one physical core (0,36)
> * 11.2 GB/s one thread, one physical core (0)
>
> So, I think it's important that the initialization threads run on
> separate physical cores.

Thanks for running this. And fair enough, in this test using both siblings
gives only a 4-8% speedup over one, so it makes sense to use only cores in the
calculation.

As for how to actually do this, some arches have smp_num_siblings, but there
should be a generic interface to provide that.

It's also possible to calculate this from the existing
topology_sibling_cpumask, but the first option is better IMHO. Open to
suggestions.

> For the number of cores to use, one approach is:
> memory bandwidth (number of interleaved channels * speed)
> divided by
> CPU core max sustained write bandwidth
>
> For example, this 2133 MT/s system is roughly:
> 68 GB/s (4 * 17 GB/s nominal)
> divided by
> 11.2 GB/s (one core's performance)
> which is
> 6 cores
>
> ACPI HMAT will report that 68 GB/s number. I'm not sure of
> a good way to discover the 11.2 GB/s number.

Yes, this would be nice to do if we could know the per-core number, with the
caveat that a single number like this would be most useful for the CPU-memory
pair it was calculated for, so the kernel could at least calculate it for jobs
operating on local memory.

Some BogoMIPS-like calibration may work, but I'll wait for ACPI HMAT support in
the kernel.

2018-11-28 16:57:48

by Daniel Jordan

[permalink] [raw]
Subject: Re: [RFC PATCH v4 01/13] ktask: add documentation

On Tue, Nov 27, 2018 at 08:50:08PM +0100, Pavel Machek wrote:
> Hi!

Hi, Pavel.

> > +============================================
> > +ktask: parallelize CPU-intensive kernel work
> > +============================================
> > +
> > +:Date: November, 2018
> > +:Author: Daniel Jordan <[email protected]>
>
>
> > +For example, consider the task of clearing a gigantic page. This used to be
> > +done in a single thread with a for loop that calls a page clearing function for
> > +each constituent base page. To parallelize with ktask, the client first moves
> > +the for loop to the thread function, adapting it to operate on the range passed
> > +to the function. In this simple case, the thread function's start and end
> > +arguments are just addresses delimiting the portion of the gigantic page to
> > +clear. Then, where the for loop used to be, the client calls into ktask with
> > +the start address of the gigantic page, the total size of the gigantic page,
> > +and the thread function. Internally, ktask will divide the address range into
> > +an appropriate number of chunks and start an appropriate number of threads to
> > +complete these chunks.
>
> Great, so my little task is bound to CPUs 1-4 and uses gigantic
> pages. Kernel clears them for me.
>
> a) Do all the CPUs work for me, or just CPUs I was assigned to?

In ktask's current form, all the CPUs. This is an existing limitation of
workqueues, which ktask is built on: unbound workqueue workers don't honor the
cpumask of the queueing task (...absent a wq user applying a cpumask wq attr
beforehand, which nobody in-tree does...).

But good point, the helper threads should only run on the CPUs the task is
bound to. I'm working on cgroup-aware workqueues but hadn't considered a
task's cpumask outside of cgroup/cpuset, so I'll try adding support for this
too.

> b) Will my time my_little_task show the system time including the
> worker threads?

No, system time of kworkers isn't accounted to the user tasks they're working
on behalf of. This time is already visible to userland in kworkers, and it
would be confusing to account it to a userland task instead.

Thanks for the questions.

Daniel

2018-11-30 19:20:53

by Tejun Heo

[permalink] [raw]
Subject: Re: [RFC PATCH v4 00/13] ktask: multithread CPU-intensive kernel work

Hello,

On Mon, Nov 05, 2018 at 11:55:45AM -0500, Daniel Jordan wrote:
> Michal, you mentioned that ktask should be sensitive to CPU utilization[1].
> ktask threads now run at the lowest priority on the system to avoid disturbing
> busy CPUs (more details in patches 4 and 5). Does this address your concern?
> The plan to address your other comments is explained below.

Have you tested what kind of impact this has on bandwidth of a system
in addition to latency? The thing is while this would make a better
use of a system which has idle capacity, it does so by doing more
total work. It'd be really interesting to see how this affects
bandwidth of a system too.

Thanks.

--
tejun

2018-12-01 00:15:01

by Daniel Jordan

[permalink] [raw]
Subject: Re: [RFC PATCH v4 00/13] ktask: multithread CPU-intensive kernel work

On Fri, Nov 30, 2018 at 11:18:19AM -0800, Tejun Heo wrote:
> Hello,
>
> On Mon, Nov 05, 2018 at 11:55:45AM -0500, Daniel Jordan wrote:
> > Michal, you mentioned that ktask should be sensitive to CPU utilization[1].
> > ktask threads now run at the lowest priority on the system to avoid disturbing
> > busy CPUs (more details in patches 4 and 5). Does this address your concern?
> > The plan to address your other comments is explained below.
>
> Have you tested what kind of impact this has on bandwidth of a system
> in addition to latency? The thing is while this would make a better
> use of a system which has idle capacity, it does so by doing more
> total work. It'd be really interesting to see how this affects
> bandwidth of a system too.

I guess you mean something like comparing aggregate CPU time across threads to
the base single thread time for some job or set of jobs? Then no, I haven't
measured that, but I can for next time.

2018-12-03 16:19:24

by Tejun Heo

[permalink] [raw]
Subject: Re: [RFC PATCH v4 00/13] ktask: multithread CPU-intensive kernel work

Hello,

On Fri, Nov 30, 2018 at 04:13:07PM -0800, Daniel Jordan wrote:
> On Fri, Nov 30, 2018 at 11:18:19AM -0800, Tejun Heo wrote:
> > Hello,
> >
> > On Mon, Nov 05, 2018 at 11:55:45AM -0500, Daniel Jordan wrote:
> > > Michal, you mentioned that ktask should be sensitive to CPU utilization[1].
> > > ktask threads now run at the lowest priority on the system to avoid disturbing
> > > busy CPUs (more details in patches 4 and 5). Does this address your concern?
> > > The plan to address your other comments is explained below.
> >
> > Have you tested what kind of impact this has on bandwidth of a system
> > in addition to latency? The thing is while this would make a better
> > use of a system which has idle capacity, it does so by doing more
> > total work. It'd be really interesting to see how this affects
> > bandwidth of a system too.
>
> I guess you mean something like comparing aggregate CPU time across threads to
> the base single thread time for some job or set of jobs? Then no, I haven't
> measured that, but I can for next time.

Yeah, I'm primarily curious how expensive this is on an already loaded
system, so sth like loading up the system with a workload which can
saturate the system and comparing the bw impacts of serial and
parallel page clearings at the same frequency.

Thanks.

--
tejun