2014-10-23 18:46:16

by Alex Thorlton

[permalink] [raw]
Subject: [RESEND] [PATCH 0/4] Convert khugepaged to a task_work function

Hey everyone,

Last week, while discussing possible fixes for some unexpected/unwanted behavior
from khugepaged (see: https://lkml.org/lkml/2014/10/8/515) several people
mentioned possibly changing changing khugepaged to work as a task_work function
instead of a kernel thread. This will give us finer grained control over the
page collapse scans, eliminate some unnecessary scans since tasks that are
relatively inactive will not be scanned often, and eliminate the unwanted
behavior described in the email thread I mentioned.

This initial patch is fully functional, but there are quite a few areas that
will need to be polished up before it's ready to be considered for a merge. I
wanted to get this initial version out with some basic test results quickly, so
that people can give their opinions and let me know if there's anything they'd
like to see done differently (and there probably is :). I'll give details on
the code in the individual patches.

I gathered some pretty rudimentary test data using a 48-thread NAMD simulation
pinned to a cpuset with 8 cpus and about 60g of memory. I'm checking to see if
I'm allowed to publish the input data so that others can replicate the test. In
the meantime, if somebody knows of a publicly available benchmark that stresses
khugepaged, that would be helpful.

The only data point I gathered was the number of pages collapsed, sampled every
ten seconds, for the lifetime of the job. This one statistic gives a pretty
decent illustration of the difference in behavior between the two kernels, but I
intend to add some other counters to measure fully completed scans, failed
allocations, and possibly scans skipped due to timer constraints.

The data for the standard kernel (with a very small patch to add the stat
counter that I used to the task_struct) is available here:

http://oss.sgi.com/projects/memtests/pgcollapse/output-khpd

This was a fairly recent kernel (last Tuesday). Commit ID:
2d65a9f48fcdf7866aab6457bc707ca233e0c791. I'll send the patches I used for that
kernel as a reply to this message shortly.

The output from the modified kernel is stored here:

http://oss.sgi.com/projects/memtests/pgcollapse/output-pgcollapse

The output is stored in a pretty dumb format (*really* wide). Best viewed in a
simple text editor with word wrap off, just fyi.

Quick summary of what I found: Both kernels performed about the same when it
comes to overall runtime, my kernel was 22 seconds faster with a total runtime
of 4:13:07. Not a significant difference, but important to note that there was
no apparent performance degradation. The most interesting result is that my
kernel completed the majority of the necessary page collapses for this job in
2:04, whereas the mainline kernel took 29:05 to get to the same point.

Let me know what you think. Any suggestions are appreciated!

- Alex

Signed-off-by: Alex Thorlton <[email protected]>
Cc: Andrew Morton <[email protected]>
Cc: Bob Liu <[email protected]>
Cc: David Rientjes <[email protected]>
Cc: Eric W. Biederman <[email protected]>
Cc: Hugh Dickins <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Kees Cook <[email protected]>
Cc: Kirill A. Shutemov <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: Oleg Nesterov <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Rik van Riel <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Vladimir Davydov <[email protected]>
Cc: [email protected]

Alex Thorlton (4):
Disable khugepaged thread
Add pgcollapse controls to task_struct
Convert khugepaged scan functions to work with task_work
Add /proc files to expose per-mm pgcollapse stats

fs/proc/base.c | 25 +++++++
include/linux/khugepaged.h | 10 ++-
include/linux/sched.h | 15 ++++
kernel/fork.c | 6 ++
kernel/sched/fair.c | 19 ++++++
mm/huge_memory.c | 167 ++++++++++++++++-----------------------------
6 files changed, 129 insertions(+), 113 deletions(-)

--
1.7.12.4


2014-10-23 18:46:23

by Alex Thorlton

[permalink] [raw]
Subject: [PATCH 1/4] Disable khugepaged thread

This patch just removes any call to start khugepaged for now. If we decide to
go forward with this new approach, then this patch will also dismantle the other
bits of khugepaged that we'll no longer need.

Signed-off-by: Alex Thorlton <[email protected]>
Cc: Andrew Morton <[email protected]>
Cc: Bob Liu <[email protected]>
Cc: David Rientjes <[email protected]>
Cc: Eric W. Biederman <[email protected]>
Cc: Hugh Dickins <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Kees Cook <[email protected]>
Cc: Kirill A. Shutemov <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: Oleg Nesterov <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Rik van Riel <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Vladimir Davydov <[email protected]>
Cc: [email protected]

---
mm/huge_memory.c | 23 +++--------------------
1 file changed, 3 insertions(+), 20 deletions(-)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 74c78aa..63abf52 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -295,24 +295,9 @@ static ssize_t enabled_store(struct kobject *kobj,
struct kobj_attribute *attr,
const char *buf, size_t count)
{
- ssize_t ret;
-
- ret = double_flag_store(kobj, attr, buf, count,
- TRANSPARENT_HUGEPAGE_FLAG,
- TRANSPARENT_HUGEPAGE_REQ_MADV_FLAG);
-
- if (ret > 0) {
- int err;
-
- mutex_lock(&khugepaged_mutex);
- err = start_khugepaged();
- mutex_unlock(&khugepaged_mutex);
-
- if (err)
- ret = err;
- }
-
- return ret;
+ return double_flag_store(kobj, attr, buf, count,
+ TRANSPARENT_HUGEPAGE_FLAG,
+ TRANSPARENT_HUGEPAGE_REQ_MADV_FLAG);
}
static struct kobj_attribute enabled_attr =
__ATTR(enabled, 0644, enabled_show, enabled_store);
@@ -655,8 +640,6 @@ static int __init hugepage_init(void)
if (totalram_pages < (512 << (20 - PAGE_SHIFT)))
transparent_hugepage_flags = 0;

- start_khugepaged();
-
return 0;
out:
hugepage_exit_sysfs(hugepage_kobj);
--
1.7.12.4

2014-10-23 18:46:30

by Alex Thorlton

[permalink] [raw]
Subject: [PATCH 3/4] Convert khugepaged scan functions to work with task_work

This patch implements the actual functionality changes that I'm proposing. For
now I've just repurposed the existing khugepaged functions to do what we need.
Evenrtually they'll need to be completely separated, probably using a config
option to pick which method we want. We might also want to change the way that
we decide when to put the task_pgcollapse_work function to the task_works list.
I have some ideas to try for that, but haven't implemented any yet.

Signed-off-by: Alex Thorlton <[email protected]>
Cc: Andrew Morton <[email protected]>
Cc: Bob Liu <[email protected]>
Cc: David Rientjes <[email protected]>
Cc: Eric W. Biederman <[email protected]>
Cc: Hugh Dickins <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Kees Cook <[email protected]>
Cc: Kirill A. Shutemov <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: Oleg Nesterov <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Rik van Riel <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Vladimir Davydov <[email protected]>
Cc: [email protected]

---
include/linux/khugepaged.h | 10 +++-
include/linux/sched.h | 3 +
kernel/sched/fair.c | 19 ++++++
mm/huge_memory.c | 144 +++++++++++++++++----------------------------
4 files changed, 83 insertions(+), 93 deletions(-)

diff --git a/include/linux/khugepaged.h b/include/linux/khugepaged.h
index 6b394f0..27e0ff4 100644
--- a/include/linux/khugepaged.h
+++ b/include/linux/khugepaged.h
@@ -2,11 +2,14 @@
#define _LINUX_KHUGEPAGED_H

#include <linux/sched.h> /* MMF_VM_HUGEPAGE */
+#include <linux/task_work.h>

#ifdef CONFIG_TRANSPARENT_HUGEPAGE
-extern int __khugepaged_enter(struct mm_struct *mm);
+extern int __khugepaged_enter(void);
extern void __khugepaged_exit(struct mm_struct *mm);
extern int khugepaged_enter_vma_merge(struct vm_area_struct *vma);
+extern void task_pgcollapse_work(struct callback_head *);
+extern void khugepaged_do_scan(void);

#define khugepaged_enabled() \
(transparent_hugepage_flags & \
@@ -24,8 +27,9 @@ extern int khugepaged_enter_vma_merge(struct vm_area_struct *vma);

static inline int khugepaged_fork(struct mm_struct *mm, struct mm_struct *oldmm)
{
+ /* this will add task_pgcollapse_work to task_works */
if (test_bit(MMF_VM_HUGEPAGE, &oldmm->flags))
- return __khugepaged_enter(mm);
+ return __khugepaged_enter();
return 0;
}

@@ -42,7 +46,7 @@ static inline int khugepaged_enter(struct vm_area_struct *vma)
(khugepaged_req_madv() &&
vma->vm_flags & VM_HUGEPAGE)) &&
!(vma->vm_flags & VM_NOHUGEPAGE))
- if (__khugepaged_enter(vma->vm_mm))
+ if (__khugepaged_enter())
return -ENOMEM;
return 0;
}
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 50f67d5..7e9946b 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -3044,3 +3044,6 @@ static inline unsigned long rlimit_max(unsigned int limit)
}

#endif
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+extern void task_pgcollapse_work(struct callback_head *);
+#endif
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index b78280c..cd1cd6c 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -30,6 +30,8 @@
#include <linux/mempolicy.h>
#include <linux/migrate.h>
#include <linux/task_work.h>
+#include <linux/types.h>
+#include <linux/khugepaged.h>

#include <trace/events/sched.h>

@@ -2060,6 +2062,23 @@ static inline void account_numa_dequeue(struct rq *rq, struct task_struct *p)
}
#endif /* CONFIG_NUMA_BALANCING */

+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+void task_pgcollapse_work(struct callback_head *work)
+{
+ WARN_ON_ONCE(current != container_of(work, struct task_struct, pgcollapse_work));
+
+ work->next = work; /* protect against double add */
+
+ pr_info("!!! debug - INFO: task: %s/%d in task_pgcollapse_work\n",
+ current->comm, (int) current->pid);
+ khugepaged_do_scan();
+}
+#else
+void task_pgcollapse_work(struct callback_head *work)
+{
+}
+#endif
+
static void
account_entity_enqueue(struct cfs_rq *cfs_rq, struct sched_entity *se)
{
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 63abf52..d4ee953 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -786,6 +786,7 @@ int do_huge_pmd_anonymous_page(struct mm_struct *mm, struct vm_area_struct *vma,
return VM_FAULT_FALLBACK;
if (unlikely(anon_vma_prepare(vma)))
return VM_FAULT_OOM;
+ /* this will add task_pgcollapse_work to task_works */
if (unlikely(khugepaged_enter(vma)))
return VM_FAULT_OOM;
if (!(flags & FAULT_FLAG_WRITE) &&
@@ -2021,35 +2022,28 @@ static inline int khugepaged_test_exit(struct mm_struct *mm)
return atomic_read(&mm->mm_users) == 0;
}

-int __khugepaged_enter(struct mm_struct *mm)
+int __khugepaged_enter(void)
{
- struct mm_slot *mm_slot;
- int wakeup;
+ unsigned long period = msecs_to_jiffies(current->pgcollapse_scan_sleep_millisecs);

- mm_slot = alloc_mm_slot();
- if (!mm_slot)
- return -ENOMEM;
+ pr_info("!!! debug - INFO: task: %s/%d jiffies: %lu period: %lu last_scan: %lu\n",
+ current->comm, (int) current->pid, jiffies, period,
+ current->pgcollapse_last_scan);

- /* __khugepaged_exit() must not run from under us */
- VM_BUG_ON_MM(khugepaged_test_exit(mm), mm);
- if (unlikely(test_and_set_bit(MMF_VM_HUGEPAGE, &mm->flags))) {
- free_mm_slot(mm_slot);
- return 0;
- }
+ /* may want to move this up to where we actually do the scan... */
+ if (time_after(jiffies, current->pgcollapse_last_scan + period)) {
+ current->pgcollapse_last_scan = jiffies;

- spin_lock(&khugepaged_mm_lock);
- insert_to_mm_slots_hash(mm, mm_slot);
- /*
- * Insert just behind the scanning cursor, to let the area settle
- * down a little.
- */
- wakeup = list_empty(&khugepaged_scan.mm_head);
- list_add_tail(&mm_slot->mm_node, &khugepaged_scan.mm_head);
- spin_unlock(&khugepaged_mm_lock);
+ pr_info("!!! debug - INFO: task: %s/%d adding pgcollapse work\n",
+ current->comm, (int) current->pid);

- atomic_inc(&mm->mm_count);
- if (wakeup)
- wake_up_interruptible(&khugepaged_wait);
+ /* debug - actual new code */
+ init_task_work(&current->pgcollapse_work, task_pgcollapse_work);
+ task_work_add(current, &current->pgcollapse_work, true);
+ } else {
+ pr_info("!!! debug - INFO: task: %s/%d skipping pgcollapse_scan\n",
+ current->comm, (int) current->pid);
+ }

return 0;
}
@@ -2069,6 +2063,8 @@ int khugepaged_enter_vma_merge(struct vm_area_struct *vma)
VM_BUG_ON_VMA(vma->vm_flags & VM_NO_THP, vma);
hstart = (vma->vm_start + ~HPAGE_PMD_MASK) & HPAGE_PMD_MASK;
hend = vma->vm_end & HPAGE_PMD_MASK;
+
+ /* this will add task_pgcollapse_work to task_works */
if (hstart < hend)
return khugepaged_enter(vma);
return 0;
@@ -2417,6 +2413,9 @@ static void collapse_huge_page(struct mm_struct *mm,
if (!new_page)
return;

+ pr_info("!!! debug - INFO: task: %s/%d pgcollapse alloc: %lx on node %d\n",
+ current->comm, (int) current->pid, address, node);
+
if (unlikely(mem_cgroup_try_charge(new_page, mm,
GFP_TRANSHUGE, &memcg)))
return;
@@ -2514,7 +2513,7 @@ static void collapse_huge_page(struct mm_struct *mm,

*hpage = NULL;

- khugepaged_pages_collapsed++;
+ current->pgcollapse_pages_collapsed++;
out_up_write:
up_write(&mm->mmap_sem);
return;
@@ -2616,44 +2615,34 @@ static void collect_mm_slot(struct mm_slot *mm_slot)
}

static unsigned int khugepaged_scan_mm_slot(unsigned int pages,
- struct page **hpage)
- __releases(&khugepaged_mm_lock)
- __acquires(&khugepaged_mm_lock)
+ struct page **hpage)
{
- struct mm_slot *mm_slot;
struct mm_struct *mm;
struct vm_area_struct *vma;
int progress = 0;
+ /*
+ * grab this pointer here to avoid dereferencing
+ * the task struct multiple times later
+ */
+ unsigned long *scan_addr_p = &current->pgcollapse_scan_address;

VM_BUG_ON(!pages);
- VM_BUG_ON(NR_CPUS != 1 && !spin_is_locked(&khugepaged_mm_lock));

- if (khugepaged_scan.mm_slot)
- mm_slot = khugepaged_scan.mm_slot;
- else {
- mm_slot = list_entry(khugepaged_scan.mm_head.next,
- struct mm_slot, mm_node);
- khugepaged_scan.address = 0;
- khugepaged_scan.mm_slot = mm_slot;
- }
- spin_unlock(&khugepaged_mm_lock);
+ pr_info("!!! debug - task: %s/%d starting scan at %lx\n",
+ current->comm, (int) current->pid, *scan_addr_p);

- mm = mm_slot->mm;
+ mm = current->mm;
down_read(&mm->mmap_sem);
if (unlikely(khugepaged_test_exit(mm)))
vma = NULL;
else
- vma = find_vma(mm, khugepaged_scan.address);
+ vma = find_vma(mm, *scan_addr_p);

progress++;
for (; vma; vma = vma->vm_next) {
unsigned long hstart, hend;

cond_resched();
- if (unlikely(khugepaged_test_exit(mm))) {
- progress++;
- break;
- }
if (!hugepage_vma_check(vma)) {
skip:
progress++;
@@ -2663,26 +2652,24 @@ skip:
hend = vma->vm_end & HPAGE_PMD_MASK;
if (hstart >= hend)
goto skip;
- if (khugepaged_scan.address > hend)
+ if (*scan_addr_p > hend)
goto skip;
- if (khugepaged_scan.address < hstart)
- khugepaged_scan.address = hstart;
- VM_BUG_ON(khugepaged_scan.address & ~HPAGE_PMD_MASK);
+ if (*scan_addr_p < hstart)
+ *scan_addr_p = hstart;
+ VM_BUG_ON(*scan_addr_p & ~HPAGE_PMD_MASK);

- while (khugepaged_scan.address < hend) {
+ while (*scan_addr_p < hend) {
int ret;
cond_resched();
if (unlikely(khugepaged_test_exit(mm)))
goto breakouterloop;

- VM_BUG_ON(khugepaged_scan.address < hstart ||
- khugepaged_scan.address + HPAGE_PMD_SIZE >
- hend);
- ret = khugepaged_scan_pmd(mm, vma,
- khugepaged_scan.address,
+ VM_BUG_ON(*scan_addr_p < hstart ||
+ *scan_addr_p + HPAGE_PMD_SIZE > hend);
+ ret = khugepaged_scan_pmd(mm, vma, *scan_addr_p,
hpage);
/* move to next address */
- khugepaged_scan.address += HPAGE_PMD_SIZE;
+ *scan_addr_p += HPAGE_PMD_SIZE;
progress += HPAGE_PMD_NR;
if (ret)
/* we released mmap_sem so break loop */
@@ -2694,30 +2681,14 @@ skip:
breakouterloop:
up_read(&mm->mmap_sem); /* exit_mmap will destroy ptes after this */
breakouterloop_mmap_sem:
-
- spin_lock(&khugepaged_mm_lock);
- VM_BUG_ON(khugepaged_scan.mm_slot != mm_slot);
/*
* Release the current mm_slot if this mm is about to die, or
- * if we scanned all vmas of this mm.
+ * if we scanned all vmas of this mm. Don't think we need the
+ * khugepaged_test_exit for the task_work style scan...
*/
if (khugepaged_test_exit(mm) || !vma) {
- /*
- * Make sure that if mm_users is reaching zero while
- * khugepaged runs here, khugepaged_exit will find
- * mm_slot not pointing to the exiting mm.
- */
- if (mm_slot->mm_node.next != &khugepaged_scan.mm_head) {
- khugepaged_scan.mm_slot = list_entry(
- mm_slot->mm_node.next,
- struct mm_slot, mm_node);
- khugepaged_scan.address = 0;
- } else {
- khugepaged_scan.mm_slot = NULL;
- khugepaged_full_scans++;
- }
-
- collect_mm_slot(mm_slot);
+ *scan_addr_p = 0;
+ current->pgcollapse_full_scans++;
}

return progress;
@@ -2735,11 +2706,11 @@ static int khugepaged_wait_event(void)
kthread_should_stop();
}

-static void khugepaged_do_scan(void)
+void khugepaged_do_scan(void)
{
struct page *hpage = NULL;
- unsigned int progress = 0, pass_through_head = 0;
- unsigned int pages = khugepaged_pages_to_scan;
+ unsigned int progress = 0;
+ unsigned int pages = current->pgcollapse_pages_to_scan;
bool wait = true;

barrier(); /* write khugepaged_pages_to_scan to local stack */
@@ -2750,19 +2721,12 @@ static void khugepaged_do_scan(void)

cond_resched();

- if (unlikely(kthread_should_stop() || freezing(current)))
+ if (unlikely(freezing(current)))
break;

- spin_lock(&khugepaged_mm_lock);
- if (!khugepaged_scan.mm_slot)
- pass_through_head++;
- if (khugepaged_has_work() &&
- pass_through_head < 2)
- progress += khugepaged_scan_mm_slot(pages - progress,
- &hpage);
- else
- progress = pages;
- spin_unlock(&khugepaged_mm_lock);
+ progress += khugepaged_scan_mm_slot(pages - progress, &hpage);
+ pr_info("!!! debug - INFO: task: %s/%d scan iteration progress %u\n",
+ current->comm, (int) current->pid, progress);
}

if (!IS_ERR_OR_NULL(hpage))
--
1.7.12.4

2014-10-23 18:46:25

by Alex Thorlton

[permalink] [raw]
Subject: [PATCH 2/4] Add pgcollapse controls to task_struct

This patch just adds the necessary bits to the task_struct so that the scans can
eventually be controlled on a per-mm basis. As I mentioned previously, we might
want to add some more counters here.

Signed-off-by: Alex Thorlton <[email protected]>
Cc: Andrew Morton <[email protected]>
Cc: Bob Liu <[email protected]>
Cc: David Rientjes <[email protected]>
Cc: Eric W. Biederman <[email protected]>
Cc: Hugh Dickins <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Kees Cook <[email protected]>
Cc: Kirill A. Shutemov <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: Oleg Nesterov <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Rik van Riel <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Vladimir Davydov <[email protected]>
Cc: [email protected]

---
include/linux/sched.h | 12 ++++++++++++
kernel/fork.c | 6 ++++++
2 files changed, 18 insertions(+)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 5e344bb..50f67d5 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1661,6 +1661,18 @@ struct task_struct {
unsigned int sequential_io;
unsigned int sequential_io_avg;
#endif
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+ struct callback_head pgcollapse_work;
+ /* default scan 8*512 pte (or vmas) every 30 second */
+ unsigned int pgcollapse_pages_to_scan;
+ unsigned int pgcollapse_pages_collapsed;
+ unsigned int pgcollapse_full_scans;
+ unsigned int pgcollapse_scan_sleep_millisecs;
+ /* during fragmentation poll the hugepage allocator once every minute */
+ unsigned int pgcollapse_alloc_sleep_millisecs;
+ unsigned long pgcollapse_last_scan;
+ unsigned long pgcollapse_scan_address;
+#endif
};

/* Future-safe accessor for struct task_struct's cpus_allowed. */
diff --git a/kernel/fork.c b/kernel/fork.c
index 9b7d746..8c55309 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -1355,6 +1355,12 @@ static struct task_struct *copy_process(unsigned long clone_flags,
p->sequential_io = 0;
p->sequential_io_avg = 0;
#endif
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+ /* need to pull these values from sysctl or something */
+ p->pgcollapse_pages_to_scan = HPAGE_PMD_NR * 8;
+ p->pgcollapse_scan_sleep_millisecs = 10000;
+ p->pgcollapse_alloc_sleep_millisecs = 60000;
+#endif

/* Perform scheduler related setup. Assign this task to a CPU. */
retval = sched_fork(clone_flags, p);
--
1.7.12.4

2014-10-23 18:46:36

by Alex Thorlton

[permalink] [raw]
Subject: [PATCH 4/4] Add /proc files to expose per-mm pgcollapse stats

This patch adds a /proc file to read out the information that we've added to the
task_struct. I'll need to split the information out to separate files, probably
in a subdirectory, change a few of the files to allow us to modify their values,
and it will need appropriate locks.

Signed-off-by: Alex Thorlton <[email protected]>
Cc: Andrew Morton <[email protected]>
Cc: Bob Liu <[email protected]>
Cc: David Rientjes <[email protected]>
Cc: Eric W. Biederman <[email protected]>
Cc: Hugh Dickins <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Kees Cook <[email protected]>
Cc: Kirill A. Shutemov <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: Oleg Nesterov <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Rik van Riel <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Vladimir Davydov <[email protected]>
Cc: [email protected]

---
fs/proc/base.c | 25 +++++++++++++++++++++++++
1 file changed, 25 insertions(+)

diff --git a/fs/proc/base.c b/fs/proc/base.c
index 772efa4..a6603e5 100644
--- a/fs/proc/base.c
+++ b/fs/proc/base.c
@@ -2466,6 +2466,25 @@ static const struct file_operations proc_projid_map_operations = {
};
#endif /* CONFIG_USER_NS */

+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+int proc_pgcollapse_show(struct seq_file *m, struct pid_namespace *ns,
+ struct pid *pid, struct task_struct *tsk)
+{
+ /* need locks here */
+ seq_printf(m, "pages_to_scan: %u\n", tsk->pgcollapse_pages_to_scan);
+ seq_printf(m, "pages_collapsed: %u\n", tsk->pgcollapse_pages_collapsed);
+ seq_printf(m, "full_scans: %u\n", tsk->pgcollapse_full_scans);
+ seq_printf(m, "scan_sleep_millisecs: %u\n",
+ tsk->pgcollapse_scan_sleep_millisecs);
+ seq_printf(m, "alloc_sleep_millisecs: %u\n",
+ tsk->pgcollapse_alloc_sleep_millisecs);
+ seq_printf(m, "last_scan: %lu\n", tsk->pgcollapse_last_scan);
+ seq_printf(m, "scan_address: 0x%0lx\n", tsk->pgcollapse_scan_address);
+
+ return 0;
+}
+#endif /* CONFIG_TRANSPARENT_HUGEPAGE */
+
static int proc_pid_personality(struct seq_file *m, struct pid_namespace *ns,
struct pid *pid, struct task_struct *task)
{
@@ -2576,6 +2595,9 @@ static const struct pid_entry tgid_base_stuff[] = {
#ifdef CONFIG_CHECKPOINT_RESTORE
REG("timers", S_IRUGO, proc_timers_operations),
#endif
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+ ONE("pgcollapse", S_IRUGO, proc_pgcollapse_show),
+#endif
};

static int proc_tgid_base_readdir(struct file *file, struct dir_context *ctx)
@@ -2914,6 +2936,9 @@ static const struct pid_entry tid_base_stuff[] = {
REG("gid_map", S_IRUGO|S_IWUSR, proc_gid_map_operations),
REG("projid_map", S_IRUGO|S_IWUSR, proc_projid_map_operations),
#endif
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+ ONE("pgcollapse", S_IRUGO, proc_pgcollapse_show),
+#endif
};

static int proc_tid_base_readdir(struct file *file, struct dir_context *ctx)
--
1.7.12.4

2014-10-23 19:36:03

by Oleg Nesterov

[permalink] [raw]
Subject: Re: [PATCH 3/4] Convert khugepaged scan functions to work with task_work

I don't understand this patch, but...

On 10/23, Alex Thorlton wrote:
>
> static inline int khugepaged_fork(struct mm_struct *mm, struct mm_struct *oldmm)
> {
> + /* this will add task_pgcollapse_work to task_works */
> if (test_bit(MMF_VM_HUGEPAGE, &oldmm->flags))
> - return __khugepaged_enter(mm);
> + return __khugepaged_enter();

This looks certainly wrong or I am totally confused. __khugepaged_enter()
does task_work_add(current) but we want to kick the child, not the parent.

> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -30,6 +30,8 @@
> #include <linux/mempolicy.h>
> #include <linux/migrate.h>
> #include <linux/task_work.h>
> +#include <linux/types.h>
> +#include <linux/khugepaged.h>
>
> #include <trace/events/sched.h>
>
> @@ -2060,6 +2062,23 @@ static inline void account_numa_dequeue(struct rq *rq, struct task_struct *p)
> }
> #endif /* CONFIG_NUMA_BALANCING */
>
> +#ifdef CONFIG_TRANSPARENT_HUGEPAGE
> +void task_pgcollapse_work(struct callback_head *work)
> +{
> + WARN_ON_ONCE(current != container_of(work, struct task_struct, pgcollapse_work));
> +
> + work->next = work; /* protect against double add */
> +
> + pr_info("!!! debug - INFO: task: %s/%d in task_pgcollapse_work\n",
> + current->comm, (int) current->pid);
> + khugepaged_do_scan();
> +}
> +#else

Why do you add it into kernel/sched/ ??

> -int __khugepaged_enter(struct mm_struct *mm)
> +int __khugepaged_enter(void)
> {
> - struct mm_slot *mm_slot;
> - int wakeup;
> + unsigned long period = msecs_to_jiffies(current->pgcollapse_scan_sleep_millisecs);
>
> - mm_slot = alloc_mm_slot();
> - if (!mm_slot)
> - return -ENOMEM;
> + pr_info("!!! debug - INFO: task: %s/%d jiffies: %lu period: %lu last_scan: %lu\n",
> + current->comm, (int) current->pid, jiffies, period,
> + current->pgcollapse_last_scan);
>
> - /* __khugepaged_exit() must not run from under us */
> - VM_BUG_ON_MM(khugepaged_test_exit(mm), mm);
> - if (unlikely(test_and_set_bit(MMF_VM_HUGEPAGE, &mm->flags))) {
> - free_mm_slot(mm_slot);
> - return 0;
> - }
> + /* may want to move this up to where we actually do the scan... */
> + if (time_after(jiffies, current->pgcollapse_last_scan + period)) {
> + current->pgcollapse_last_scan = jiffies;
>
> - spin_lock(&khugepaged_mm_lock);
> - insert_to_mm_slots_hash(mm, mm_slot);
> - /*
> - * Insert just behind the scanning cursor, to let the area settle
> - * down a little.
> - */
> - wakeup = list_empty(&khugepaged_scan.mm_head);
> - list_add_tail(&mm_slot->mm_node, &khugepaged_scan.mm_head);
> - spin_unlock(&khugepaged_mm_lock);
> + pr_info("!!! debug - INFO: task: %s/%d adding pgcollapse work\n",
> + current->comm, (int) current->pid);
>
> - atomic_inc(&mm->mm_count);
> - if (wakeup)
> - wake_up_interruptible(&khugepaged_wait);
> + /* debug - actual new code */
> + init_task_work(&current->pgcollapse_work, task_pgcollapse_work);
> + task_work_add(current, &current->pgcollapse_work, true);
> + } else {
> + pr_info("!!! debug - INFO: task: %s/%d skipping pgcollapse_scan\n",
> + current->comm, (int) current->pid);
> + }
>
> return 0;
> }

so mm_slots_hash becomes unused?

Oleg.

> @@ -2069,6 +2063,8 @@ int khugepaged_enter_vma_merge(struct vm_area_struct *vma)
> VM_BUG_ON_VMA(vma->vm_flags & VM_NO_THP, vma);
> hstart = (vma->vm_start + ~HPAGE_PMD_MASK) & HPAGE_PMD_MASK;
> hend = vma->vm_end & HPAGE_PMD_MASK;
> +
> + /* this will add task_pgcollapse_work to task_works */
> if (hstart < hend)
> return khugepaged_enter(vma);
> return 0;
> @@ -2417,6 +2413,9 @@ static void collapse_huge_page(struct mm_struct *mm,
> if (!new_page)
> return;
>
> + pr_info("!!! debug - INFO: task: %s/%d pgcollapse alloc: %lx on node %d\n",
> + current->comm, (int) current->pid, address, node);
> +
> if (unlikely(mem_cgroup_try_charge(new_page, mm,
> GFP_TRANSHUGE, &memcg)))
> return;
> @@ -2514,7 +2513,7 @@ static void collapse_huge_page(struct mm_struct *mm,
>
> *hpage = NULL;
>
> - khugepaged_pages_collapsed++;
> + current->pgcollapse_pages_collapsed++;
> out_up_write:
> up_write(&mm->mmap_sem);
> return;
> @@ -2616,44 +2615,34 @@ static void collect_mm_slot(struct mm_slot *mm_slot)
> }
>
> static unsigned int khugepaged_scan_mm_slot(unsigned int pages,
> - struct page **hpage)
> - __releases(&khugepaged_mm_lock)
> - __acquires(&khugepaged_mm_lock)
> + struct page **hpage)
> {
> - struct mm_slot *mm_slot;
> struct mm_struct *mm;
> struct vm_area_struct *vma;
> int progress = 0;
> + /*
> + * grab this pointer here to avoid dereferencing
> + * the task struct multiple times later
> + */
> + unsigned long *scan_addr_p = &current->pgcollapse_scan_address;
>
> VM_BUG_ON(!pages);
> - VM_BUG_ON(NR_CPUS != 1 && !spin_is_locked(&khugepaged_mm_lock));
>
> - if (khugepaged_scan.mm_slot)
> - mm_slot = khugepaged_scan.mm_slot;
> - else {
> - mm_slot = list_entry(khugepaged_scan.mm_head.next,
> - struct mm_slot, mm_node);
> - khugepaged_scan.address = 0;
> - khugepaged_scan.mm_slot = mm_slot;
> - }
> - spin_unlock(&khugepaged_mm_lock);
> + pr_info("!!! debug - task: %s/%d starting scan at %lx\n",
> + current->comm, (int) current->pid, *scan_addr_p);
>
> - mm = mm_slot->mm;
> + mm = current->mm;
> down_read(&mm->mmap_sem);
> if (unlikely(khugepaged_test_exit(mm)))
> vma = NULL;
> else
> - vma = find_vma(mm, khugepaged_scan.address);
> + vma = find_vma(mm, *scan_addr_p);
>
> progress++;
> for (; vma; vma = vma->vm_next) {
> unsigned long hstart, hend;
>
> cond_resched();
> - if (unlikely(khugepaged_test_exit(mm))) {
> - progress++;
> - break;
> - }
> if (!hugepage_vma_check(vma)) {
> skip:
> progress++;
> @@ -2663,26 +2652,24 @@ skip:
> hend = vma->vm_end & HPAGE_PMD_MASK;
> if (hstart >= hend)
> goto skip;
> - if (khugepaged_scan.address > hend)
> + if (*scan_addr_p > hend)
> goto skip;
> - if (khugepaged_scan.address < hstart)
> - khugepaged_scan.address = hstart;
> - VM_BUG_ON(khugepaged_scan.address & ~HPAGE_PMD_MASK);
> + if (*scan_addr_p < hstart)
> + *scan_addr_p = hstart;
> + VM_BUG_ON(*scan_addr_p & ~HPAGE_PMD_MASK);
>
> - while (khugepaged_scan.address < hend) {
> + while (*scan_addr_p < hend) {
> int ret;
> cond_resched();
> if (unlikely(khugepaged_test_exit(mm)))
> goto breakouterloop;
>
> - VM_BUG_ON(khugepaged_scan.address < hstart ||
> - khugepaged_scan.address + HPAGE_PMD_SIZE >
> - hend);
> - ret = khugepaged_scan_pmd(mm, vma,
> - khugepaged_scan.address,
> + VM_BUG_ON(*scan_addr_p < hstart ||
> + *scan_addr_p + HPAGE_PMD_SIZE > hend);
> + ret = khugepaged_scan_pmd(mm, vma, *scan_addr_p,
> hpage);
> /* move to next address */
> - khugepaged_scan.address += HPAGE_PMD_SIZE;
> + *scan_addr_p += HPAGE_PMD_SIZE;
> progress += HPAGE_PMD_NR;
> if (ret)
> /* we released mmap_sem so break loop */
> @@ -2694,30 +2681,14 @@ skip:
> breakouterloop:
> up_read(&mm->mmap_sem); /* exit_mmap will destroy ptes after this */
> breakouterloop_mmap_sem:
> -
> - spin_lock(&khugepaged_mm_lock);
> - VM_BUG_ON(khugepaged_scan.mm_slot != mm_slot);
> /*
> * Release the current mm_slot if this mm is about to die, or
> - * if we scanned all vmas of this mm.
> + * if we scanned all vmas of this mm. Don't think we need the
> + * khugepaged_test_exit for the task_work style scan...
> */
> if (khugepaged_test_exit(mm) || !vma) {
> - /*
> - * Make sure that if mm_users is reaching zero while
> - * khugepaged runs here, khugepaged_exit will find
> - * mm_slot not pointing to the exiting mm.
> - */
> - if (mm_slot->mm_node.next != &khugepaged_scan.mm_head) {
> - khugepaged_scan.mm_slot = list_entry(
> - mm_slot->mm_node.next,
> - struct mm_slot, mm_node);
> - khugepaged_scan.address = 0;
> - } else {
> - khugepaged_scan.mm_slot = NULL;
> - khugepaged_full_scans++;
> - }
> -
> - collect_mm_slot(mm_slot);
> + *scan_addr_p = 0;
> + current->pgcollapse_full_scans++;
> }
>
> return progress;
> @@ -2735,11 +2706,11 @@ static int khugepaged_wait_event(void)
> kthread_should_stop();
> }
>
> -static void khugepaged_do_scan(void)
> +void khugepaged_do_scan(void)
> {
> struct page *hpage = NULL;
> - unsigned int progress = 0, pass_through_head = 0;
> - unsigned int pages = khugepaged_pages_to_scan;
> + unsigned int progress = 0;
> + unsigned int pages = current->pgcollapse_pages_to_scan;
> bool wait = true;
>
> barrier(); /* write khugepaged_pages_to_scan to local stack */
> @@ -2750,19 +2721,12 @@ static void khugepaged_do_scan(void)
>
> cond_resched();
>
> - if (unlikely(kthread_should_stop() || freezing(current)))
> + if (unlikely(freezing(current)))
> break;
>
> - spin_lock(&khugepaged_mm_lock);
> - if (!khugepaged_scan.mm_slot)
> - pass_through_head++;
> - if (khugepaged_has_work() &&
> - pass_through_head < 2)
> - progress += khugepaged_scan_mm_slot(pages - progress,
> - &hpage);
> - else
> - progress = pages;
> - spin_unlock(&khugepaged_mm_lock);
> + progress += khugepaged_scan_mm_slot(pages - progress, &hpage);
> + pr_info("!!! debug - INFO: task: %s/%d scan iteration progress %u\n",
> + current->comm, (int) current->pid, progress);
> }
>
> if (!IS_ERR_OR_NULL(hpage))
> --
> 1.7.12.4
>

2014-10-23 19:41:03

by Oleg Nesterov

[permalink] [raw]
Subject: Re: [PATCH 2/4] Add pgcollapse controls to task_struct

On 10/23, Alex Thorlton wrote:
>
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -1661,6 +1661,18 @@ struct task_struct {
> unsigned int sequential_io;
> unsigned int sequential_io_avg;
> #endif
> +#ifdef CONFIG_TRANSPARENT_HUGEPAGE
> + struct callback_head pgcollapse_work;
> + /* default scan 8*512 pte (or vmas) every 30 second */
> + unsigned int pgcollapse_pages_to_scan;
> + unsigned int pgcollapse_pages_collapsed;
> + unsigned int pgcollapse_full_scans;
> + unsigned int pgcollapse_scan_sleep_millisecs;
> + /* during fragmentation poll the hugepage allocator once every minute */
> + unsigned int pgcollapse_alloc_sleep_millisecs;
> + unsigned long pgcollapse_last_scan;
> + unsigned long pgcollapse_scan_address;
> +#endif

Shouldn't this all live in mm_struct?

Except pgcollapse_work can't, exit_mm() called before exit_mm(). Probably
it can be allocated.

Oleg.